SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Downloaden Sie, um offline zu lesen
Uber's Distributed Deep Learning Journey
Alex Sergeev, Machine Learning Platform, Uber Engineering
@alsrgv
Deep Learning
Credit: Andrew Ng, https://www.slideshare.net/ExtractConf
Deep Learning @ Uber
● Self-Driving Vehicles
● Trip Forecasting
● Fraud Detection
● … and much more!
How does Deep Learning work?
How does Deep Learning training work?
TensorFlow
● Most popular open source framework for deep learning
● Combines high performance with ability to tinker with low
level model details
● Has end-to-end support from research to production
Going Distributed
● Train very large models
● Speed up model training
Model Parallelism Data Parallelism
Going Distributed Cont.
● Modern GPUs have a lot of
RAM
● Vast majority of use cases are
data-parallel
● Facebook demonstrated
training ResNet-50 on
ImageNet in 1 hour
(arxiv.org/abs/1706.02677)
Parameter Server Technique
tf.Server()
tf.ClusterSpec()
tf.train.replicas_device_setter()
tf.train.SyncReplicasOptimizer()
Parameter Server
Worker GPU Towers
Parameter Server Technique - Example Script
Image Source: TensorFlow -- https://www.tensorflow.org/deploy/distributed
Parameter Server Technique - Performance
Considering ImageNet dataset of 1.3M images, this allows to train ResNet-101 for one
epoch in 3.5 minutes. Scaling efficiency on 128 GPUs is only 42%, however.
How Can We Improve?
● Re-think necessary complexity for data-parallel case
● Improve communication algorithm
● Use RDMA-capable networking (InfiniBand, RoCE)
Meet Horovod
● Distributed training framework for TensorFlow
● Inspired by HPC techniques and work of Baidu,
Facebook, et al.
● Uses bandwidth-optimal communication protocols
○ Makes use of RDMA (InfiniBand, RoCE) if available
● Seamlessly installs on top of TensorFlow via
pip install horovod
● Named after traditional Russian folk dance where
participants dance in a circle with linked hands
Horovod Technique
Patarasuk, P., & Yuan, X. (2009). Bandwidth optimal all-reduce algorithms for clusters of workstations.
Journal of Parallel and Distributed Computing, 69(2), 117-124. doi:10.1016/j.jpdc.2008.09.002
Horovod Stack
● Plugs into TensorFlow via custom op mechanism
● Uses MPI for worker discovery and reduction coordination
● Uses NVIDIA NCCL for actual reduction on the server and across servers
Horovod Example
import tensorflow as tf
import horovod.tensorflow as hvd
# Initialize Horovod
hvd.init()
# Pin GPU to be used
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())
# Build model...
loss = ...
opt = tf.train.AdagradOptimizer(0.01)
# Add Horovod Distributed Optimizer
opt = hvd.DistributedOptimizer(opt)
# Add hook to broadcast variables from rank 0 to all other processes during initialization.
hooks = [hvd.BroadcastGlobalVariablesHook(0)]
# Make training operation
train_op = opt.minimize(loss)
# The MonitoredTrainingSession takes care of session initialization,
# restoring from a checkpoint, saving to a checkpoint, and closing when done
# or an error occurs.
with tf.train.MonitoredTrainingSession(checkpoint_dir="/tmp/train_logs",
config=config, hooks=hooks) as mon_sess:
while not mon_sess.should_stop():
# Perform synchronous training.
mon_sess.run(train_op)
Horovod Example - Keras
import keras
from keras import backend as K
import tensorflow as tf
import horovod.keras as hvd
# Initialize Horovod.
hvd.init()
# Pin GPU to be used to process local rank (one GPU per process)
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.gpu_options.visible_device_list = str(hvd.local_rank())
K.set_session(tf.Session(config=config))
# Build model…
model = ...
opt = keras.optimizers.Adadelta(1.0)
# Add Horovod Distributed Optimizer.
opt = hvd.DistributedOptimizer(opt)
model.compile(loss=keras.losses.categorical_crossentropy, optimizer=opt, metrics=['accuracy'])
# Broadcast initial variable states from rank 0 to all other processes.
callbacks = [hvd.callbacks.BroadcastGlobalVariablesCallback(0)]
model.fit(x_train, y_train,
callbacks=callbacks,
epochs=10,
validation_data=(x_test, y_test))
Horovod Example - Estimator API
import tensorflow as tf
import horovod.tensorflow as hvd
# Initialize Horovod
hvd.init()
# Pin GPU to be used
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())
# Build model...
def model_fn(features, labels, mode):
loss = …
opt = tf.train.AdagradOptimizer(0.01)
# Add Horovod Distributed Optimizer
opt = hvd.DistributedOptimizer(opt)
train_op = optimizer.minimize(loss=loss, global_step=tf.train.get_global_step())
return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op)
# Add hook to broadcast variables from rank 0 to all other processes during initialization.
hooks = [hvd.BroadcastGlobalVariablesHook(0)]
# Create the Estimator
mnist_classifier = tf.estimator.Estimator(
model_fn=cnn_model_fn, model_dir="/tmp/mnist_convnet_model",
config=tf.estimator.RunConfig(session_config=config))
mnist_classifier.train(input_fn=train_input_fn, steps=100, hooks=hooks)
Running Horovod
● MPI takes care of launching processes on all machines
● Run on a 4 GPU machine (Open MPI 3.0.0):
○ $ mpirun -np 4 
-H localhost:4 
-bind-to none -map-by slot 
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH 
python train.py
● Run on 4 machines with 4 GPUs (Open MPI 3.0.0):
○ $ mpirun -np 16 
-H server1:4,server2:4,server3:4,server4:4 
-bind-to none -map-by slot 
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH 
python train.py
● Boilerplate mpirun arguments are easily hidden in a convenience script
Debugging - Horovod Timeline
● Discovered that ResNet-152 has a lot of tiny tensors
● Added Tensor Fusion - smart batching causes large gains
(bigger gain on less optimized networks)
Horovod Performance
With Horovod, same ResNet-101 can be trained for one epoch on ImageNet in 1.5 minutes.
Scaling efficiency is improved to 88%, making it twice as efficient as standard distributed TF.
Horovod Performance Cont.
RDMA further helps to improve efficiency - by 30% for VGG-16.
Practical Aspects - Initialization
● Use broadcast operation to make sure all workers start
with the same weights
● Otherwise, averaged gradient
will not point towards minimum
(shown in red)
Practical Aspects - Data Partitioning
● Shuffle the dataset
● Partition records among workers
● Train by sequentially reading the partition
● After epoch is done, reshuffle and partition again
NOTE: make sure that all
partitions contain the
same number of batches,
otherwise the training will
reach deadlock
Practical Aspects - Random Sampling
● Shuffle the dataset
● Train by randomly reading data from whole dataset
● After epoch is done, reshuffle
Practical Aspects - Data
● Random sampling may cause some records to be read
multiple times in a single epoch, while others not read at all
● In practice, both approaches typically yield same results
● Conclusion: use the most convenient option for your case
● Remember: validation can also be distributed, but need to
make sure to average validation results from all the workers
when using learning rate schedules that depend on validation
○ Horovod comes with MetricAverageCallback for Keras
Practical Aspects - Learning Rate Adjustment
● Facebook in paper “Accurate, Large Minibatch SGD:
Training ImageNet in 1 Hour” (arxiv.org/abs/1706.02677)
recommends linear scaling of learning rate:
○ LRN
= LR1
* N
○ Requires smooth warmup during
first K epochs, as shown below
○ Works up to batch size 8192
● Horovod comes with
LearningRateWarmupCallback for Keras
Practical Aspects - Learning Rate Adjustment Cont.
● Yang You, Igor Gitman, Boris Ginsburg in paper “Large
Batch Training of Convolutional Networks” demonstrated
scaling to batch of 32K examples (arxiv.org/abs/1708.03888)
○ Use per-layer adaptive learning rate scaling
● Google published a paper “Don't Decay the Learning Rate,
Increase the Batch Size” (arxiv.org/abs/1711.00489) arguing
that typical learning rate decay can be replaced with an
increase of the batch size
Practical Aspects - Checkpointing & Logs
● Typically, a server would have multiple GPUs
● To avoid clashes, write checkpoints, TensorBoard logs
and other artifacts on worker 0:
○ if hvd.rank() == 0:
# write checkpoint
Practical Results at Uber
● Used Facebook’s learning rate adjustment technique
● Trained convolutional networks and LSTMs in hours
instead of days or weeks with the same final accuracy
● You can do that, too!
Giving Back
Horovod is available on GitHub:
https://github.com/uber/horovod
Thank you!
Horovod on our Eng Blog: https://eng.uber.com/horovod
Michelangelo on our Eng Blog: https://eng.uber.com/michelangelo
ML at Uber on YouTube: http://t.uber.com/ml-meetup
Proprietary and confidential © 2017 Uber Technologies, Inc. All rights reserved. No part of this
document may be reproduced or utilized in any form or by any means, electronic or mechanical,
including photocopying, recording, or by any information storage or retrieval systems, without
permission in writing from Uber. This document is intended only for the use of the individual or entity
to whom it is addressed and contains information that is privileged, confidential or otherwise exempt
from disclosure under applicable law. All recipients of this document are notified that the information
contained herein includes proprietary and confidential information of Uber, and recipient may not
make use of, disseminate, or in any way disclose this document or any of the enclosed information to
any person other than employees of addressee to the extent necessary for consultations with
authorized personnel of Uber.

Weitere ähnliche Inhalte

Was ist angesagt?

Distributed Convex Optimization Thesis - Behroz Sikander
Distributed Convex Optimization Thesis - Behroz SikanderDistributed Convex Optimization Thesis - Behroz Sikander
Distributed Convex Optimization Thesis - Behroz Sikander
rogerz1234567
 

Was ist angesagt? (20)

Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud MLScaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML
 
TinyML as-a-Service
TinyML as-a-ServiceTinyML as-a-Service
TinyML as-a-Service
 
TFX: A tensor flow-based production-scale machine learning platform
TFX: A tensor flow-based production-scale machine learning platformTFX: A tensor flow-based production-scale machine learning platform
TFX: A tensor flow-based production-scale machine learning platform
 
Report
ReportReport
Report
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learning
 
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
 
Brief introduction to Distributed Deep Learning
Brief introduction to Distributed Deep LearningBrief introduction to Distributed Deep Learning
Brief introduction to Distributed Deep Learning
 
HiPEAC 2019 Tutorial - Maestro RTOS
HiPEAC 2019 Tutorial - Maestro RTOSHiPEAC 2019 Tutorial - Maestro RTOS
HiPEAC 2019 Tutorial - Maestro RTOS
 
Machine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud PlatformMachine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud Platform
 
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetFrom Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
 
More Data Science with Less Engineering: Machine Learning Infrastructure at N...
More Data Science with Less Engineering: Machine Learning Infrastructure at N...More Data Science with Less Engineering: Machine Learning Infrastructure at N...
More Data Science with Less Engineering: Machine Learning Infrastructure at N...
 
Deep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps WorkflowsDeep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps Workflows
 
Easy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersEasy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java Programmers
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D... Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 
Numba Overview
Numba OverviewNumba Overview
Numba Overview
 
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
 
Distributed Convex Optimization Thesis - Behroz Sikander
Distributed Convex Optimization Thesis - Behroz SikanderDistributed Convex Optimization Thesis - Behroz Sikander
Distributed Convex Optimization Thesis - Behroz Sikander
 
AICamp - Dr Ramine Tinati - Making Computer Vision Real
AICamp - Dr Ramine Tinati - Making Computer Vision RealAICamp - Dr Ramine Tinati - Making Computer Vision Real
AICamp - Dr Ramine Tinati - Making Computer Vision Real
 
Apache MXNet AI
Apache MXNet AIApache MXNet AI
Apache MXNet AI
 
Weekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on MobileWeekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on Mobile
 

Ähnlich wie Uber's Journey in Distributed Deep Learning

Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim DowlingDistributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Databricks
 
Measurement .Net Performance with BenchmarkDotNet
Measurement .Net Performance with BenchmarkDotNetMeasurement .Net Performance with BenchmarkDotNet
Measurement .Net Performance with BenchmarkDotNet
Vasyl Senko
 

Ähnlich wie Uber's Journey in Distributed Deep Learning (20)

Data Parallel Deep Learning
Data Parallel Deep LearningData Parallel Deep Learning
Data Parallel Deep Learning
 
Leonid Kuligin "Training ML models with Cloud"
 Leonid Kuligin   "Training ML models with Cloud" Leonid Kuligin   "Training ML models with Cloud"
Leonid Kuligin "Training ML models with Cloud"
 
2018 TensorFlow Summit Recap (GDG Shanghai)
2018 TensorFlow Summit Recap (GDG Shanghai)2018 TensorFlow Summit Recap (GDG Shanghai)
2018 TensorFlow Summit Recap (GDG Shanghai)
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
 
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim DowlingDistributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
 
Parallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterParallel Programming on the ANDC cluster
Parallel Programming on the ANDC cluster
 
C3 w3
C3 w3C3 w3
C3 w3
 
Measurement .Net Performance with BenchmarkDotNet
Measurement .Net Performance with BenchmarkDotNetMeasurement .Net Performance with BenchmarkDotNet
Measurement .Net Performance with BenchmarkDotNet
 
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
 
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureAn Introduction to TensorFlow architecture
An Introduction to TensorFlow architecture
 
Cutting edge hyperparameter tuning made simple with ray tune
Cutting edge hyperparameter tuning made simple with ray tuneCutting edge hyperparameter tuning made simple with ray tune
Cutting edge hyperparameter tuning made simple with ray tune
 
2017 arab wic marwa ayad machine learning
2017 arab wic marwa ayad machine learning2017 arab wic marwa ayad machine learning
2017 arab wic marwa ayad machine learning
 
Uber on Using Horovod for Distributed Deep Learning (AIM411) - AWS re:Invent ...
Uber on Using Horovod for Distributed Deep Learning (AIM411) - AWS re:Invent ...Uber on Using Horovod for Distributed Deep Learning (AIM411) - AWS re:Invent ...
Uber on Using Horovod for Distributed Deep Learning (AIM411) - AWS re:Invent ...
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
 
Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8
 
Anirudh Koul. 30 Golden Rules of Deep Learning Performance
Anirudh Koul. 30 Golden Rules of Deep Learning PerformanceAnirudh Koul. 30 Golden Rules of Deep Learning Performance
Anirudh Koul. 30 Golden Rules of Deep Learning Performance
 
Odsc workshop - Distributed Tensorflow on Hops
Odsc workshop - Distributed Tensorflow on HopsOdsc workshop - Distributed Tensorflow on Hops
Odsc workshop - Distributed Tensorflow on Hops
 
Overview of TensorFlow For Natural Language Processing
Overview of TensorFlow For Natural Language ProcessingOverview of TensorFlow For Natural Language Processing
Overview of TensorFlow For Natural Language Processing
 
High ROI Testing in Angular.pptx
High ROI Testing in Angular.pptxHigh ROI Testing in Angular.pptx
High ROI Testing in Angular.pptx
 
Docker + Tenserflow + GOlang - Golang singapore Meetup
Docker + Tenserflow + GOlang - Golang singapore MeetupDocker + Tenserflow + GOlang - Golang singapore Meetup
Docker + Tenserflow + GOlang - Golang singapore Meetup
 

Mehr von inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
inside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
inside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
inside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
inside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
inside-BigData.com
 

Mehr von inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Kürzlich hochgeladen

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Kürzlich hochgeladen (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Uber's Journey in Distributed Deep Learning

  • 1. Uber's Distributed Deep Learning Journey Alex Sergeev, Machine Learning Platform, Uber Engineering @alsrgv
  • 2. Deep Learning Credit: Andrew Ng, https://www.slideshare.net/ExtractConf
  • 3. Deep Learning @ Uber ● Self-Driving Vehicles ● Trip Forecasting ● Fraud Detection ● … and much more!
  • 4. How does Deep Learning work?
  • 5. How does Deep Learning training work?
  • 6. TensorFlow ● Most popular open source framework for deep learning ● Combines high performance with ability to tinker with low level model details ● Has end-to-end support from research to production
  • 7. Going Distributed ● Train very large models ● Speed up model training Model Parallelism Data Parallelism
  • 8. Going Distributed Cont. ● Modern GPUs have a lot of RAM ● Vast majority of use cases are data-parallel ● Facebook demonstrated training ResNet-50 on ImageNet in 1 hour (arxiv.org/abs/1706.02677)
  • 10. Parameter Server Technique - Example Script Image Source: TensorFlow -- https://www.tensorflow.org/deploy/distributed
  • 11. Parameter Server Technique - Performance Considering ImageNet dataset of 1.3M images, this allows to train ResNet-101 for one epoch in 3.5 minutes. Scaling efficiency on 128 GPUs is only 42%, however.
  • 12. How Can We Improve? ● Re-think necessary complexity for data-parallel case ● Improve communication algorithm ● Use RDMA-capable networking (InfiniBand, RoCE)
  • 13. Meet Horovod ● Distributed training framework for TensorFlow ● Inspired by HPC techniques and work of Baidu, Facebook, et al. ● Uses bandwidth-optimal communication protocols ○ Makes use of RDMA (InfiniBand, RoCE) if available ● Seamlessly installs on top of TensorFlow via pip install horovod ● Named after traditional Russian folk dance where participants dance in a circle with linked hands
  • 14. Horovod Technique Patarasuk, P., & Yuan, X. (2009). Bandwidth optimal all-reduce algorithms for clusters of workstations. Journal of Parallel and Distributed Computing, 69(2), 117-124. doi:10.1016/j.jpdc.2008.09.002
  • 15. Horovod Stack ● Plugs into TensorFlow via custom op mechanism ● Uses MPI for worker discovery and reduction coordination ● Uses NVIDIA NCCL for actual reduction on the server and across servers
  • 16. Horovod Example import tensorflow as tf import horovod.tensorflow as hvd # Initialize Horovod hvd.init() # Pin GPU to be used config = tf.ConfigProto() config.gpu_options.visible_device_list = str(hvd.local_rank()) # Build model... loss = ... opt = tf.train.AdagradOptimizer(0.01) # Add Horovod Distributed Optimizer opt = hvd.DistributedOptimizer(opt) # Add hook to broadcast variables from rank 0 to all other processes during initialization. hooks = [hvd.BroadcastGlobalVariablesHook(0)] # Make training operation train_op = opt.minimize(loss) # The MonitoredTrainingSession takes care of session initialization, # restoring from a checkpoint, saving to a checkpoint, and closing when done # or an error occurs. with tf.train.MonitoredTrainingSession(checkpoint_dir="/tmp/train_logs", config=config, hooks=hooks) as mon_sess: while not mon_sess.should_stop(): # Perform synchronous training. mon_sess.run(train_op)
  • 17. Horovod Example - Keras import keras from keras import backend as K import tensorflow as tf import horovod.keras as hvd # Initialize Horovod. hvd.init() # Pin GPU to be used to process local rank (one GPU per process) config = tf.ConfigProto() config.gpu_options.allow_growth = True config.gpu_options.visible_device_list = str(hvd.local_rank()) K.set_session(tf.Session(config=config)) # Build model… model = ... opt = keras.optimizers.Adadelta(1.0) # Add Horovod Distributed Optimizer. opt = hvd.DistributedOptimizer(opt) model.compile(loss=keras.losses.categorical_crossentropy, optimizer=opt, metrics=['accuracy']) # Broadcast initial variable states from rank 0 to all other processes. callbacks = [hvd.callbacks.BroadcastGlobalVariablesCallback(0)] model.fit(x_train, y_train, callbacks=callbacks, epochs=10, validation_data=(x_test, y_test))
  • 18. Horovod Example - Estimator API import tensorflow as tf import horovod.tensorflow as hvd # Initialize Horovod hvd.init() # Pin GPU to be used config = tf.ConfigProto() config.gpu_options.visible_device_list = str(hvd.local_rank()) # Build model... def model_fn(features, labels, mode): loss = … opt = tf.train.AdagradOptimizer(0.01) # Add Horovod Distributed Optimizer opt = hvd.DistributedOptimizer(opt) train_op = optimizer.minimize(loss=loss, global_step=tf.train.get_global_step()) return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op) # Add hook to broadcast variables from rank 0 to all other processes during initialization. hooks = [hvd.BroadcastGlobalVariablesHook(0)] # Create the Estimator mnist_classifier = tf.estimator.Estimator( model_fn=cnn_model_fn, model_dir="/tmp/mnist_convnet_model", config=tf.estimator.RunConfig(session_config=config)) mnist_classifier.train(input_fn=train_input_fn, steps=100, hooks=hooks)
  • 19. Running Horovod ● MPI takes care of launching processes on all machines ● Run on a 4 GPU machine (Open MPI 3.0.0): ○ $ mpirun -np 4 -H localhost:4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH python train.py ● Run on 4 machines with 4 GPUs (Open MPI 3.0.0): ○ $ mpirun -np 16 -H server1:4,server2:4,server3:4,server4:4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH python train.py ● Boilerplate mpirun arguments are easily hidden in a convenience script
  • 20. Debugging - Horovod Timeline ● Discovered that ResNet-152 has a lot of tiny tensors ● Added Tensor Fusion - smart batching causes large gains (bigger gain on less optimized networks)
  • 21. Horovod Performance With Horovod, same ResNet-101 can be trained for one epoch on ImageNet in 1.5 minutes. Scaling efficiency is improved to 88%, making it twice as efficient as standard distributed TF.
  • 22. Horovod Performance Cont. RDMA further helps to improve efficiency - by 30% for VGG-16.
  • 23. Practical Aspects - Initialization ● Use broadcast operation to make sure all workers start with the same weights ● Otherwise, averaged gradient will not point towards minimum (shown in red)
  • 24. Practical Aspects - Data Partitioning ● Shuffle the dataset ● Partition records among workers ● Train by sequentially reading the partition ● After epoch is done, reshuffle and partition again NOTE: make sure that all partitions contain the same number of batches, otherwise the training will reach deadlock
  • 25. Practical Aspects - Random Sampling ● Shuffle the dataset ● Train by randomly reading data from whole dataset ● After epoch is done, reshuffle
  • 26. Practical Aspects - Data ● Random sampling may cause some records to be read multiple times in a single epoch, while others not read at all ● In practice, both approaches typically yield same results ● Conclusion: use the most convenient option for your case ● Remember: validation can also be distributed, but need to make sure to average validation results from all the workers when using learning rate schedules that depend on validation ○ Horovod comes with MetricAverageCallback for Keras
  • 27. Practical Aspects - Learning Rate Adjustment ● Facebook in paper “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour” (arxiv.org/abs/1706.02677) recommends linear scaling of learning rate: ○ LRN = LR1 * N ○ Requires smooth warmup during first K epochs, as shown below ○ Works up to batch size 8192 ● Horovod comes with LearningRateWarmupCallback for Keras
  • 28. Practical Aspects - Learning Rate Adjustment Cont. ● Yang You, Igor Gitman, Boris Ginsburg in paper “Large Batch Training of Convolutional Networks” demonstrated scaling to batch of 32K examples (arxiv.org/abs/1708.03888) ○ Use per-layer adaptive learning rate scaling ● Google published a paper “Don't Decay the Learning Rate, Increase the Batch Size” (arxiv.org/abs/1711.00489) arguing that typical learning rate decay can be replaced with an increase of the batch size
  • 29. Practical Aspects - Checkpointing & Logs ● Typically, a server would have multiple GPUs ● To avoid clashes, write checkpoints, TensorBoard logs and other artifacts on worker 0: ○ if hvd.rank() == 0: # write checkpoint
  • 30. Practical Results at Uber ● Used Facebook’s learning rate adjustment technique ● Trained convolutional networks and LSTMs in hours instead of days or weeks with the same final accuracy ● You can do that, too!
  • 31. Giving Back Horovod is available on GitHub: https://github.com/uber/horovod
  • 32. Thank you! Horovod on our Eng Blog: https://eng.uber.com/horovod Michelangelo on our Eng Blog: https://eng.uber.com/michelangelo ML at Uber on YouTube: http://t.uber.com/ml-meetup
  • 33. Proprietary and confidential © 2017 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed and contains information that is privileged, confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified that the information contained herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber.