SlideShare ist ein Scribd-Unternehmen logo
1 von 40
Downloaden Sie, um offline zu lesen
Syed Nasar, 2018
Spark and Deep Learning frameworks
with distributed workloads
Strata Data, New York September 2018
Syed Nasar | @techmazed
Syed Nasar, 2018 2
Agenda
• Machine Learning and Deep Learning on a scalable infrastructure
• Comparative look - CPU vs CPU+GPU vs GPU
• Challenges with distributing ML pipelines and algorithms
• Frameworks that support distributed scaling
• Spark/YARN on GPUs vs CPUs
Syed Nasar, 2018 3
Machine Learning and Deep Learning on a scalable infrastructure
Syed Nasar, 2018 4
• Single machine multiple CPU cores
• Training distributed across machines - still mostly using CPUs
• For Deep Learning - Sometimes requires GPUs
• Distributed across machines
• Mixed setup (CPU+GPU)
Machine Learning on a scalable infrastructure
Single
Machine
(GPUs)
Single
Machine
(CPUs)
Multiple
Machines
(CPUs)
Multiple
Machines
(CPUs +
GPUs)
Multiple
Machines
(GPUs)
Syed Nasar, 2018 5
• Single machine - multiple GPUs
• Distributed deep learning across machines - sometimes inevitable
Deep Learning on a scalable infrastructure
Model parallelization, Data parallelization
Single
Machine
(GPUs)
Multiple
Machines
(CPUs +
GPUs)
Multiple
Machines
(GPUs)
Syed Nasar, 2018 6
• Memory in neural networks - to store input data, weight parameters and
activations as an input propagates through the network.
• GPUs' reliance on dense vectors - fill SIMD compute engines
• CPU/GPU intensive - matrix multiplications - weights x activations.
Deep Learning - why resource hungry
Model parallelization challenges
Using 32-bit floating-point -
parallelise training data
7.5 GB of
local DRAM
Mini-batch of
32
Example: 50-Layer ResNet
Syed Nasar, 2018 7
CPU vs CPU+GPU vs GPU
Syed Nasar, 2018 8
CPU vs GPU
CPU
• Few very complex cores
• Single-thread performance
optimization
• Transistor space dedicated to
complex ILP
• Few die surface for integer and fp
units
• Has other memory types but they
are provisioned only as caches, not
directly accessible to the
programmer
GPU
• Hundreds of simpler cores
• Thousands of concurrent hardware
threads
• Maximize floating-point throughput
• Most die surface for integer and fp
units
• Has several forms of memory of
varying speeds and capacities -
memory types are exposed to the
programmer
Syed Nasar, 2018 9
• Hardware configurations offer
large aggregate IO bandwidth
• Spark’s optimizer allows many
workloads - avoiding significant
disk IO
• Spark’s shuffle subsystem,
serialization & hashing key -
bottlenecks
• Spark constrained by CPU
efficiency and memory pressure
rather than IO.
Why is CPU the new bottleneck?
Image Source: https://devblogs.nvidia.com/bidmach-machine-learning-limit-gpus/
Syed Nasar, 2018 10
Challenges with distributing ML & DL pipelines, algorithms &
processes
Syed Nasar, 2018 11
Training
• Compute heavy
• Requires synchronization across network of machines
• GPUs often necessary/beneficial
Syed Nasar, 2018 12
Inference
• Embarrassingly parallel
• Varying latency requirements
• Can be mostly done on CPUs - GPUs often unnecessary
Syed Nasar, 2018 13
Parameters that define the model architecture are referred to as
hyperparameters.
HPO (hyperparameter tuning) is a process of searching for the ideal
model architecture.
• HPO is a necessary part of all model training
• It is embarrassingly parallel
• Can avoid distributed training
Scenario:
• Run 4 HP configurations, 1/gpu, in parallel vs. 4 HP
configurations, 1/4gpu, in serial
Hyper-parameter optimization
Hyper-Opt
Sckit-optimize
Spearmint
MOE
Grid-Search
Randomized Search
Tools
Optimization
Methods
Syed Nasar, 2018 14
Transfer learning
Transfer learning - machine learning where
“knowledge” learned on one task is applied to
another, related task.
• Take publicly available models and
re-purpose them for your task
• Leverage the work from hundreds of GPUs that is
already baked in
• Train a small percentage of the model for
your task
• Greatly reduced computational requirements
mean you may not need GPUs or may not
need distributed architecture
Source task /
domain
Target task /
domain
Model Model
Knowledge
Update
target
model
reuse
Initial
training
Apply
Customized
model
Syed Nasar, 2018 15
Current Landscape
• Machine Learning Frameworks
• Deep learning Frameworks
• Distributed Frameworks
• Spark based Frameworks
Syed Nasar, 2018 16
Distributed
Machine Learning
Frameworks
Syed Nasar, 2018 17
• Spark stores the model parameters in the driver
• Workers communicate with the driver to update the
parameters after each iteration.
• For large scale machine learning deployments, the
model parameters may not fit into the driver node
and they would need to be maintained as an RDD.
• Drawback -
• This introduces a lot of overhead because a new
RDD will need to be created in each iteration to hold
the updated model parameters.
• Since updating the model usually involves shuffling
data across machines, this limits the scalability of
Spark.
Machine learning on Spark
ML Frameworks
worker
nodes
master
node
Syed Nasar, 2018 18
• Both data and workloads are
distributed over worker nodes
• server nodes maintain globally shared
parameters
• represented as dense or sparse vectors
and matrices
• The framework manages
asynchronous data communication
between nodes
• Flexible consistency models
• Elastic scalability
• Continuous fault tolerance.
Using PMLS Parameter-Server framework
ML Frameworks
Image Source: https://cse.buffalo.edu/~demirbas/publications/DistMLplat.pdf
DistBelief - Google
PMLS
Parameter Server
Frameworks
PMLS(Bosen)Architecture
Syed Nasar, 2018 19
Distributed Deep
Learning
Frameworks
Syed Nasar, 2018 20
• Tensorflow distributed relies on
master, worker, parameter server
processes
• Provides fine-grained control, you can
place individual tensors and
operations
• Relies on a cluster specification to be
passed to each process
• Need to take care of booting each
process and syncing them up
somehow
• Uses Google RPC protocol
Distributed TensorFlow
DL Frameworks
Source - https://www.tensorflow.org/deploy/distributed
Tasks - workers
Client
job parameter server
TFCluster
Tasks - workers
TF Server
tensorflow:Session
RPC
Syed Nasar, 2018 21
Distributed TensorFlow
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
"worker1.example.com:2222",
"worker2.example.com:2222"
],
"ps": [
"ps0.example.com:2222",
"ps1.example.com:2222"
]})
# In task 0:
cluster = tf.train.ClusterSpec({"local": ["localhost:2222",
"localhost:2223"]})
server = tf.train.Server(cluster, job_name="local", task_index=0)
# In task 1:
cluster = tf.train.ClusterSpec({"local": ["localhost:2222",
"localhost:2223"]})
server = tf.train. Server(cluster, job_name="local", task_index=1)
tf.train.ClusterSpec
tf.train.Server
Syed Nasar, 2018 22
• Has a distributed package that
provides MPI style primitives for
distributing work
• Has interface for exchanging tensor
data across multi-machine networks
• Currently supports four backends
(tcp, gloo, mpi, nccl - CPU/GPU)
• Only recently incorporated
• Not a lot of documentation
PyTorch (torch.distributed)
DL Frameworks import torch
import torch.distributed as dist
dist.init_process_group(backend="nccl",
init_method="file:///distributed_test",
world_size=2,
rank=0)
tensor_list = []
for dev_idx in range(torch.cuda.device_count()):
tensor_list.append(torch.FloatTensor([1]).cuda(dev_idx
))
dist.all_reduce_multigpu(tensor_list)Multi-GPU
collective functions
Syed Nasar, 2018 23
import mxnet.ndarray as nd
X = nd.zeros((10000, 40000), mx.cpu(0))
#Allocate an array to store 1000 datapoints (of 40k
dimensions) that lives on the CPU
W1 = nd.zeros(shape=(40000, 1024), mx.gpu(0))
#Allocate a 40k x 1024 weight matrix on GPU for the 1st
layer of the net
W2 = nd.zeros(shape=(1024, 10), mx.gpu(0))
#Allocate a 1024 x 1024 weight matrix on GPU for the 2nd
layer of the net
• Parameter server/worker
architecture
• Must compile from source to use
• Provide built-in “launchers”, e.g.
YARN, Kubernetes, but still
cumbersome
• Data Loading(IO) - Efficient
distributed data loading and
augmentation.
• Can specify the context of the
function to be executed within -
that tells if it should be run on
CPU or GPU
Apache MXNet
DL Frameworks
Allocating parameters and data points
into GPU memory
Syed Nasar, 2018 24
Horovod is a distributed training framework for TensorFlow, Keras, and PyTorch.
• Released by Uber to make data parallel deep learning using TF easier
• Introduces a ring all-reduce pattern to eliminate need for parameter servers
• Uses MPI
• Require less code changes than the Distributed TensorFlow with parameter
servers
Horovod
DL Frameworks
To run on 4 machines
with 4 GPUs each
Syed Nasar, 2018 25
Kubeflow
DL Frameworks
Machine Learning Toolkit for Kubernetes
• Custom resources for deploying
ML tasks on Kubernetes
Currently basic support for TF
only
• Boots up your TF processes
• Can place on GPUs, automatic
restarts, etc…
• Configures Kubernetes services
for you
• Still need to know lots of
Kubernetes, plus KSonnet!
KubeFlow distributed training job
Image source: https://github.com/kubeflow/kubeflow/issues/33
Syed Nasar, 2018 26
ParallelWrapper wrapper = new
ParallelWrapper.Builder(model)
.prefetchBuffer(24)
.workers(2)
.averagingFrequency(3)
.reportScoreAfterAveraging(true)
.build();
• Deep learning in Java
• Comes with Spark support for data parallel
architecture
• Also takes advantage of Hadoop
• Works with multi-GPUs
• Also has feature engineering/preprocessing
tools, model serving tools, etc...
DL4J
DL Frameworks
ParallelWrapper to load
balance between GPUs
Syed Nasar, 2018 27
• Built for Spark, only works with Spark
• No GPUs!! CPUs are fast too, Intel says
• Good option if no GPU and tons of
commodity CPUs
• Whole team of devs from Intel working on
it
• If you have a Spark cluster and want to do
DL and are not super concerned with
performance - then consider BigDL
BigDL
DL Frameworks
$SPARK_HOME/bin/spark-submit 
--deploy-mode cluster --class
com.intel.analytics.bigdl.models.lenet.Train
--master
k8s://https://<k8s-apiserver-host>:<k8s-apis
erver-port> --kubernetes-namespace default
--conf spark.executor.instances=4
...
$BIGDL_HOME/lib/bigdl-0.4.0-SNAPSHOT-jar-wit
h-dependencies.jar -f
hdfs://master:9000/mnist 
-b 128 -e 2 --checkpoint /tmp
BigDL on
Kubernetes
Syed Nasar, 2018 28
• Lightweight wrapper that boots up
distributed Tensorflow processes in
Spark executors
• Potentially high serialization costs
• Not super active development
• Supports training (CPU/GPU) and
inference
• Easy to integrate with other Spark
stages/algos
TensorflowOnSpark
DL Frameworks
Image source: https://goo.gl/CX8N9B
ML Pipeline with multiple
programs on separate
clusters
Syed Nasar, 2018 29
• Released by Databricks for doing Transfer Learning and Inference as a Spark
pipeline stage
• Good for simple use cases
• Relies on Databricks’ Tensorframes
Spark DL Pipelines
Tensorframes
featurizer = DeepImageFeaturizer(inputCol="image", outputCol="features",
modelName="InceptionV3")
lr = LogisticRegression(maxIter=20, regParam=0.05, elasticNetParam=0.3, labelCol="label")
p = Pipeline(stages=[featurizer, lr])
For technical preview only
Syed Nasar, 2018 30
On YARN
• GPU on YARN
• Spark on YARN
• TensorFlow on YARN
Syed Nasar, 2018 31
• YARN is the resource
management layer for the
Apache Hadoop ecosystem.
• Pre Hadoop 3.1 had CPU and
memory hard-coded as the only
available types of consumable
resources.
• With Hadoop 3.1 YARN is
declaratively configurable - can
create GPU type resources for
which YARN will track
consumption.
Hadoop YARN
Support for GPUs
Master host with ResourceManager and Worker
hosts with NodeManager
Syed Nasar, 2018 32
<configuration>
<property>
<name>yarn.resource-types</name>
<value>yarn.io/gpu</value>
</property>
</configuration>
• As of now, only Nvidia GPUs are
supported by YARN
• YARN node managers have to be
pre-installed with Nvidia drivers.
• When Docker is used as
container runtime context,
nvidia-docker 1.0 needs to be
installed
• One can set additional
configurations to allow admins
leverage specialized
requirements.
GPU On YARN
Hadoop 3.1.x
Source - https://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/UsingGpus.html
<property>
<name>yarn.nodemanager.resource-plugins</name>
<value>yarn.io/gpu</value>
</property>
GPU scheduling -
yarn-site.xml
GPU isolation -
yarn-site.xml
Syed Nasar, 2018 33
• A new layer in Hadoop for
launching, distributing and
executing Deep Learning
workloads
• Leverage and support existing
Deep Learning engines
(TensorFlow, Caffe, MXNet)
• Extend and enhance YARN to
support the desired scheduling
capabilities (for FPGA, GPU)
Deep Learning on Hadoop
HDL
HDL Architecture
Source - https://github.com/Intel-bigdata/HDL
Syed Nasar, 2018 34
TensorFlow on YARN
Toolkit to enable Hadoop users an easy way to run
TensorFlow applications in distributed pattern and
accomplish tasks including model management
and serving inference.
• One YARN cluster can run multiple
TensorFlow clusters
• The tasks from the same and different
sessions can run in the same node
Source: https://github.com/Intel-bigdata/TensorFlowOnYARN
Syed Nasar, 2018 35
Spark on YARN (with GPU)
On Hadoop 3.1.0
• First class GPU support
• YARN clusters can schedule and
use GPU resources
• To get GPU isolation - and to pool
GPUs Hadoop YARN cluster
should be Docker enabled.
Source - https://issues.apache.org/jira/browse/YARN-6223
SPARK
YARN
GPUs
TF
Caffe
MXNet
Caffe
TF
SPARK
MXNet
non-GPU
machines
Syed Nasar, 2018 36
Other frameworks on YARN
Work in progress
• CaffeOnYARN
• Caffe on YARN is a project to support running Caffe on YARN, based on CaffeOnSpark
from yahoo to rebase on YARN by removing Spark dependency. It's a part of Deep
Learning on Hadoop (HDL).
• Note - Current project is a prototype with limitation and is still under development.
• MXNetOnYARN
• MXNet on YARN is a project based on dmlc-core and MXNet, aiming at running MXNet
on YARN with high efficiency and flexibility. It's an important part of Deep Learning on
Hadoop (HDL).
• Note - both the codebase and documentation are work in progress. They may not be the
final version.
Sources
● CaffeOnYARN - https://github.com/Intel-bigdata/CaffeOnYARN
● MXNetOnYARN - https://github.com/Intel-bigdata/MXNetOnYARN
Syed Nasar, 2018 37
CDH 6.x
• YARN Improvements
Syed Nasar, 2018 38
CDH 6.0
Work in progress
• Ability to add arbitrary consumable resources (via YARN configuration
files) and have the FairScheduler schedule based on those consumable
resources.
• Some examples of consumable resources that can be used are GPU and
FPGA.
• Support for MapReduce
Syed Nasar, 2018 39
CDH 6.1
Upcoming features
• Boolean (i.e. non-consumable) Resource Types that can be used for
labeling nodes and scheduling based on those.
• Some examples are nodes that have a specific license installed or nodes
that have a specific OS version.
• Support for Spark
• CM UI to be able to configure Resource Types
• Preemption capabilities with arbitrary Resource Types
Syed Nasar, 2018 40
THANK YOU
Machine Learning Presentations: cloudera.com/ml
Syed Nasar
@techmazed
https://www.linkedin.com/in/knownasar/

Weitere ähnliche Inhalte

Was ist angesagt?

A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersDataWorks Summit/Hadoop Summit
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Data Con LA
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduDataWorks Summit
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARNDataWorks Summit
 
Azure Custom Backup Solution for SAP NetWeaver
Azure Custom Backup Solution for SAP NetWeaverAzure Custom Backup Solution for SAP NetWeaver
Azure Custom Backup Solution for SAP NetWeaverGary Jackson MBCS
 
Elastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent MemoryElastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent MemoryDatabricks
 
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.J On The Beach
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopHortonworks
 
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...DataWorks Summit
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)Todd Lipcon
 
Supporting Over a Thousand Custom Hive User Defined Functions
Supporting Over a Thousand Custom Hive User Defined FunctionsSupporting Over a Thousand Custom Hive User Defined Functions
Supporting Over a Thousand Custom Hive User Defined FunctionsDatabricks
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Tuning Apache Ambari performance for Big Data at scale with 3000 agents
Tuning Apache Ambari performance for Big Data at scale with 3000 agentsTuning Apache Ambari performance for Big Data at scale with 3000 agents
Tuning Apache Ambari performance for Big Data at scale with 3000 agentsDataWorks Summit
 
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd MostakLeveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd MostakDatabricks
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDataWorks Summit
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 

Was ist angesagt? (20)

A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARN
 
Azure Custom Backup Solution for SAP NetWeaver
Azure Custom Backup Solution for SAP NetWeaverAzure Custom Backup Solution for SAP NetWeaver
Azure Custom Backup Solution for SAP NetWeaver
 
Elastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent MemoryElastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent Memory
 
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
 
Supporting Over a Thousand Custom Hive User Defined Functions
Supporting Over a Thousand Custom Hive User Defined FunctionsSupporting Over a Thousand Custom Hive User Defined Functions
Supporting Over a Thousand Custom Hive User Defined Functions
 
Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Tuning Apache Ambari performance for Big Data at scale with 3000 agents
Tuning Apache Ambari performance for Big Data at scale with 3000 agentsTuning Apache Ambari performance for Big Data at scale with 3000 agents
Tuning Apache Ambari performance for Big Data at scale with 3000 agents
 
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd MostakLeveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUs
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Yarns About Yarn
Yarns About YarnYarns About Yarn
Yarns About Yarn
 

Ähnlich wie Spark and Deep Learning frameworks with distributed workloads

Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
Choose Your Weapon: Comparing Spark on FPGAs vs GPUsChoose Your Weapon: Comparing Spark on FPGAs vs GPUs
Choose Your Weapon: Comparing Spark on FPGAs vs GPUsDatabricks
 
High Performance Deep learning with Apache Spark
High Performance Deep learning with Apache SparkHigh Performance Deep learning with Apache Spark
High Performance Deep learning with Apache SparkRui Liu
 
Distributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedDistributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedWee Hyong Tok
 
Democratizing machine learning on kubernetes
Democratizing machine learning on kubernetesDemocratizing machine learning on kubernetes
Democratizing machine learning on kubernetesDocker, Inc.
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Lablup Inc.
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networksinside-BigData.com
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAlluxio, Inc.
 
Using Deep Learning Toolkits with Kubernetes clusters
Using Deep Learning Toolkits with Kubernetes clustersUsing Deep Learning Toolkits with Kubernetes clusters
Using Deep Learning Toolkits with Kubernetes clustersJoy Qiao
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetupGanesan Narayanasamy
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917Bill Liu
 
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Distributed Tensorflow with Kubernetes - data2day - Jakob KaralusDistributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Distributed Tensorflow with Kubernetes - data2day - Jakob KaralusJakob Karalus
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceData Works MD
 
Predicting Flights with Azure Databricks
Predicting Flights with Azure DatabricksPredicting Flights with Azure Databricks
Predicting Flights with Azure DatabricksSarah Dutkiewicz
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practicesLior Sidi
 
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
 Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep... Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...Databricks
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudRose Toomey
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudDatabricks
 
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...Amazon Web Services
 

Ähnlich wie Spark and Deep Learning frameworks with distributed workloads (20)

Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
Choose Your Weapon: Comparing Spark on FPGAs vs GPUsChoose Your Weapon: Comparing Spark on FPGAs vs GPUs
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
 
High Performance Deep learning with Apache Spark
High Performance Deep learning with Apache SparkHigh Performance Deep learning with Apache Spark
High Performance Deep learning with Apache Spark
 
Distributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedDistributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learned
 
Democratizing machine learning on kubernetes
Democratizing machine learning on kubernetesDemocratizing machine learning on kubernetes
Democratizing machine learning on kubernetes
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
 
Using Deep Learning Toolkits with Kubernetes clusters
Using Deep Learning Toolkits with Kubernetes clustersUsing Deep Learning Toolkits with Kubernetes clusters
Using Deep Learning Toolkits with Kubernetes clusters
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
 
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Distributed Tensorflow with Kubernetes - data2day - Jakob KaralusDistributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data Science
 
Predicting Flights with Azure Databricks
Predicting Flights with Azure DatabricksPredicting Flights with Azure Databricks
Predicting Flights with Azure Databricks
 
Vinetalk: The missing piece for cluster managers to enable accelerator sharing
Vinetalk: The missing piece for cluster managers to enable accelerator sharingVinetalk: The missing piece for cluster managers to enable accelerator sharing
Vinetalk: The missing piece for cluster managers to enable accelerator sharing
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
 
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
 Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep... Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
 
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...
 

Kürzlich hochgeladen

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 

Kürzlich hochgeladen (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 

Spark and Deep Learning frameworks with distributed workloads

  • 1. Syed Nasar, 2018 Spark and Deep Learning frameworks with distributed workloads Strata Data, New York September 2018 Syed Nasar | @techmazed
  • 2. Syed Nasar, 2018 2 Agenda • Machine Learning and Deep Learning on a scalable infrastructure • Comparative look - CPU vs CPU+GPU vs GPU • Challenges with distributing ML pipelines and algorithms • Frameworks that support distributed scaling • Spark/YARN on GPUs vs CPUs
  • 3. Syed Nasar, 2018 3 Machine Learning and Deep Learning on a scalable infrastructure
  • 4. Syed Nasar, 2018 4 • Single machine multiple CPU cores • Training distributed across machines - still mostly using CPUs • For Deep Learning - Sometimes requires GPUs • Distributed across machines • Mixed setup (CPU+GPU) Machine Learning on a scalable infrastructure Single Machine (GPUs) Single Machine (CPUs) Multiple Machines (CPUs) Multiple Machines (CPUs + GPUs) Multiple Machines (GPUs)
  • 5. Syed Nasar, 2018 5 • Single machine - multiple GPUs • Distributed deep learning across machines - sometimes inevitable Deep Learning on a scalable infrastructure Model parallelization, Data parallelization Single Machine (GPUs) Multiple Machines (CPUs + GPUs) Multiple Machines (GPUs)
  • 6. Syed Nasar, 2018 6 • Memory in neural networks - to store input data, weight parameters and activations as an input propagates through the network. • GPUs' reliance on dense vectors - fill SIMD compute engines • CPU/GPU intensive - matrix multiplications - weights x activations. Deep Learning - why resource hungry Model parallelization challenges Using 32-bit floating-point - parallelise training data 7.5 GB of local DRAM Mini-batch of 32 Example: 50-Layer ResNet
  • 7. Syed Nasar, 2018 7 CPU vs CPU+GPU vs GPU
  • 8. Syed Nasar, 2018 8 CPU vs GPU CPU • Few very complex cores • Single-thread performance optimization • Transistor space dedicated to complex ILP • Few die surface for integer and fp units • Has other memory types but they are provisioned only as caches, not directly accessible to the programmer GPU • Hundreds of simpler cores • Thousands of concurrent hardware threads • Maximize floating-point throughput • Most die surface for integer and fp units • Has several forms of memory of varying speeds and capacities - memory types are exposed to the programmer
  • 9. Syed Nasar, 2018 9 • Hardware configurations offer large aggregate IO bandwidth • Spark’s optimizer allows many workloads - avoiding significant disk IO • Spark’s shuffle subsystem, serialization & hashing key - bottlenecks • Spark constrained by CPU efficiency and memory pressure rather than IO. Why is CPU the new bottleneck? Image Source: https://devblogs.nvidia.com/bidmach-machine-learning-limit-gpus/
  • 10. Syed Nasar, 2018 10 Challenges with distributing ML & DL pipelines, algorithms & processes
  • 11. Syed Nasar, 2018 11 Training • Compute heavy • Requires synchronization across network of machines • GPUs often necessary/beneficial
  • 12. Syed Nasar, 2018 12 Inference • Embarrassingly parallel • Varying latency requirements • Can be mostly done on CPUs - GPUs often unnecessary
  • 13. Syed Nasar, 2018 13 Parameters that define the model architecture are referred to as hyperparameters. HPO (hyperparameter tuning) is a process of searching for the ideal model architecture. • HPO is a necessary part of all model training • It is embarrassingly parallel • Can avoid distributed training Scenario: • Run 4 HP configurations, 1/gpu, in parallel vs. 4 HP configurations, 1/4gpu, in serial Hyper-parameter optimization Hyper-Opt Sckit-optimize Spearmint MOE Grid-Search Randomized Search Tools Optimization Methods
  • 14. Syed Nasar, 2018 14 Transfer learning Transfer learning - machine learning where “knowledge” learned on one task is applied to another, related task. • Take publicly available models and re-purpose them for your task • Leverage the work from hundreds of GPUs that is already baked in • Train a small percentage of the model for your task • Greatly reduced computational requirements mean you may not need GPUs or may not need distributed architecture Source task / domain Target task / domain Model Model Knowledge Update target model reuse Initial training Apply Customized model
  • 15. Syed Nasar, 2018 15 Current Landscape • Machine Learning Frameworks • Deep learning Frameworks • Distributed Frameworks • Spark based Frameworks
  • 16. Syed Nasar, 2018 16 Distributed Machine Learning Frameworks
  • 17. Syed Nasar, 2018 17 • Spark stores the model parameters in the driver • Workers communicate with the driver to update the parameters after each iteration. • For large scale machine learning deployments, the model parameters may not fit into the driver node and they would need to be maintained as an RDD. • Drawback - • This introduces a lot of overhead because a new RDD will need to be created in each iteration to hold the updated model parameters. • Since updating the model usually involves shuffling data across machines, this limits the scalability of Spark. Machine learning on Spark ML Frameworks worker nodes master node
  • 18. Syed Nasar, 2018 18 • Both data and workloads are distributed over worker nodes • server nodes maintain globally shared parameters • represented as dense or sparse vectors and matrices • The framework manages asynchronous data communication between nodes • Flexible consistency models • Elastic scalability • Continuous fault tolerance. Using PMLS Parameter-Server framework ML Frameworks Image Source: https://cse.buffalo.edu/~demirbas/publications/DistMLplat.pdf DistBelief - Google PMLS Parameter Server Frameworks PMLS(Bosen)Architecture
  • 19. Syed Nasar, 2018 19 Distributed Deep Learning Frameworks
  • 20. Syed Nasar, 2018 20 • Tensorflow distributed relies on master, worker, parameter server processes • Provides fine-grained control, you can place individual tensors and operations • Relies on a cluster specification to be passed to each process • Need to take care of booting each process and syncing them up somehow • Uses Google RPC protocol Distributed TensorFlow DL Frameworks Source - https://www.tensorflow.org/deploy/distributed Tasks - workers Client job parameter server TFCluster Tasks - workers TF Server tensorflow:Session RPC
  • 21. Syed Nasar, 2018 21 Distributed TensorFlow tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222" ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222" ]}) # In task 0: cluster = tf.train.ClusterSpec({"local": ["localhost:2222", "localhost:2223"]}) server = tf.train.Server(cluster, job_name="local", task_index=0) # In task 1: cluster = tf.train.ClusterSpec({"local": ["localhost:2222", "localhost:2223"]}) server = tf.train. Server(cluster, job_name="local", task_index=1) tf.train.ClusterSpec tf.train.Server
  • 22. Syed Nasar, 2018 22 • Has a distributed package that provides MPI style primitives for distributing work • Has interface for exchanging tensor data across multi-machine networks • Currently supports four backends (tcp, gloo, mpi, nccl - CPU/GPU) • Only recently incorporated • Not a lot of documentation PyTorch (torch.distributed) DL Frameworks import torch import torch.distributed as dist dist.init_process_group(backend="nccl", init_method="file:///distributed_test", world_size=2, rank=0) tensor_list = [] for dev_idx in range(torch.cuda.device_count()): tensor_list.append(torch.FloatTensor([1]).cuda(dev_idx )) dist.all_reduce_multigpu(tensor_list)Multi-GPU collective functions
  • 23. Syed Nasar, 2018 23 import mxnet.ndarray as nd X = nd.zeros((10000, 40000), mx.cpu(0)) #Allocate an array to store 1000 datapoints (of 40k dimensions) that lives on the CPU W1 = nd.zeros(shape=(40000, 1024), mx.gpu(0)) #Allocate a 40k x 1024 weight matrix on GPU for the 1st layer of the net W2 = nd.zeros(shape=(1024, 10), mx.gpu(0)) #Allocate a 1024 x 1024 weight matrix on GPU for the 2nd layer of the net • Parameter server/worker architecture • Must compile from source to use • Provide built-in “launchers”, e.g. YARN, Kubernetes, but still cumbersome • Data Loading(IO) - Efficient distributed data loading and augmentation. • Can specify the context of the function to be executed within - that tells if it should be run on CPU or GPU Apache MXNet DL Frameworks Allocating parameters and data points into GPU memory
  • 24. Syed Nasar, 2018 24 Horovod is a distributed training framework for TensorFlow, Keras, and PyTorch. • Released by Uber to make data parallel deep learning using TF easier • Introduces a ring all-reduce pattern to eliminate need for parameter servers • Uses MPI • Require less code changes than the Distributed TensorFlow with parameter servers Horovod DL Frameworks To run on 4 machines with 4 GPUs each
  • 25. Syed Nasar, 2018 25 Kubeflow DL Frameworks Machine Learning Toolkit for Kubernetes • Custom resources for deploying ML tasks on Kubernetes Currently basic support for TF only • Boots up your TF processes • Can place on GPUs, automatic restarts, etc… • Configures Kubernetes services for you • Still need to know lots of Kubernetes, plus KSonnet! KubeFlow distributed training job Image source: https://github.com/kubeflow/kubeflow/issues/33
  • 26. Syed Nasar, 2018 26 ParallelWrapper wrapper = new ParallelWrapper.Builder(model) .prefetchBuffer(24) .workers(2) .averagingFrequency(3) .reportScoreAfterAveraging(true) .build(); • Deep learning in Java • Comes with Spark support for data parallel architecture • Also takes advantage of Hadoop • Works with multi-GPUs • Also has feature engineering/preprocessing tools, model serving tools, etc... DL4J DL Frameworks ParallelWrapper to load balance between GPUs
  • 27. Syed Nasar, 2018 27 • Built for Spark, only works with Spark • No GPUs!! CPUs are fast too, Intel says • Good option if no GPU and tons of commodity CPUs • Whole team of devs from Intel working on it • If you have a Spark cluster and want to do DL and are not super concerned with performance - then consider BigDL BigDL DL Frameworks $SPARK_HOME/bin/spark-submit --deploy-mode cluster --class com.intel.analytics.bigdl.models.lenet.Train --master k8s://https://<k8s-apiserver-host>:<k8s-apis erver-port> --kubernetes-namespace default --conf spark.executor.instances=4 ... $BIGDL_HOME/lib/bigdl-0.4.0-SNAPSHOT-jar-wit h-dependencies.jar -f hdfs://master:9000/mnist -b 128 -e 2 --checkpoint /tmp BigDL on Kubernetes
  • 28. Syed Nasar, 2018 28 • Lightweight wrapper that boots up distributed Tensorflow processes in Spark executors • Potentially high serialization costs • Not super active development • Supports training (CPU/GPU) and inference • Easy to integrate with other Spark stages/algos TensorflowOnSpark DL Frameworks Image source: https://goo.gl/CX8N9B ML Pipeline with multiple programs on separate clusters
  • 29. Syed Nasar, 2018 29 • Released by Databricks for doing Transfer Learning and Inference as a Spark pipeline stage • Good for simple use cases • Relies on Databricks’ Tensorframes Spark DL Pipelines Tensorframes featurizer = DeepImageFeaturizer(inputCol="image", outputCol="features", modelName="InceptionV3") lr = LogisticRegression(maxIter=20, regParam=0.05, elasticNetParam=0.3, labelCol="label") p = Pipeline(stages=[featurizer, lr]) For technical preview only
  • 30. Syed Nasar, 2018 30 On YARN • GPU on YARN • Spark on YARN • TensorFlow on YARN
  • 31. Syed Nasar, 2018 31 • YARN is the resource management layer for the Apache Hadoop ecosystem. • Pre Hadoop 3.1 had CPU and memory hard-coded as the only available types of consumable resources. • With Hadoop 3.1 YARN is declaratively configurable - can create GPU type resources for which YARN will track consumption. Hadoop YARN Support for GPUs Master host with ResourceManager and Worker hosts with NodeManager
  • 32. Syed Nasar, 2018 32 <configuration> <property> <name>yarn.resource-types</name> <value>yarn.io/gpu</value> </property> </configuration> • As of now, only Nvidia GPUs are supported by YARN • YARN node managers have to be pre-installed with Nvidia drivers. • When Docker is used as container runtime context, nvidia-docker 1.0 needs to be installed • One can set additional configurations to allow admins leverage specialized requirements. GPU On YARN Hadoop 3.1.x Source - https://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/UsingGpus.html <property> <name>yarn.nodemanager.resource-plugins</name> <value>yarn.io/gpu</value> </property> GPU scheduling - yarn-site.xml GPU isolation - yarn-site.xml
  • 33. Syed Nasar, 2018 33 • A new layer in Hadoop for launching, distributing and executing Deep Learning workloads • Leverage and support existing Deep Learning engines (TensorFlow, Caffe, MXNet) • Extend and enhance YARN to support the desired scheduling capabilities (for FPGA, GPU) Deep Learning on Hadoop HDL HDL Architecture Source - https://github.com/Intel-bigdata/HDL
  • 34. Syed Nasar, 2018 34 TensorFlow on YARN Toolkit to enable Hadoop users an easy way to run TensorFlow applications in distributed pattern and accomplish tasks including model management and serving inference. • One YARN cluster can run multiple TensorFlow clusters • The tasks from the same and different sessions can run in the same node Source: https://github.com/Intel-bigdata/TensorFlowOnYARN
  • 35. Syed Nasar, 2018 35 Spark on YARN (with GPU) On Hadoop 3.1.0 • First class GPU support • YARN clusters can schedule and use GPU resources • To get GPU isolation - and to pool GPUs Hadoop YARN cluster should be Docker enabled. Source - https://issues.apache.org/jira/browse/YARN-6223 SPARK YARN GPUs TF Caffe MXNet Caffe TF SPARK MXNet non-GPU machines
  • 36. Syed Nasar, 2018 36 Other frameworks on YARN Work in progress • CaffeOnYARN • Caffe on YARN is a project to support running Caffe on YARN, based on CaffeOnSpark from yahoo to rebase on YARN by removing Spark dependency. It's a part of Deep Learning on Hadoop (HDL). • Note - Current project is a prototype with limitation and is still under development. • MXNetOnYARN • MXNet on YARN is a project based on dmlc-core and MXNet, aiming at running MXNet on YARN with high efficiency and flexibility. It's an important part of Deep Learning on Hadoop (HDL). • Note - both the codebase and documentation are work in progress. They may not be the final version. Sources ● CaffeOnYARN - https://github.com/Intel-bigdata/CaffeOnYARN ● MXNetOnYARN - https://github.com/Intel-bigdata/MXNetOnYARN
  • 37. Syed Nasar, 2018 37 CDH 6.x • YARN Improvements
  • 38. Syed Nasar, 2018 38 CDH 6.0 Work in progress • Ability to add arbitrary consumable resources (via YARN configuration files) and have the FairScheduler schedule based on those consumable resources. • Some examples of consumable resources that can be used are GPU and FPGA. • Support for MapReduce
  • 39. Syed Nasar, 2018 39 CDH 6.1 Upcoming features • Boolean (i.e. non-consumable) Resource Types that can be used for labeling nodes and scheduling based on those. • Some examples are nodes that have a specific license installed or nodes that have a specific OS version. • Support for Spark • CM UI to be able to configure Resource Types • Preemption capabilities with arbitrary Resource Types
  • 40. Syed Nasar, 2018 40 THANK YOU Machine Learning Presentations: cloudera.com/ml Syed Nasar @techmazed https://www.linkedin.com/in/knownasar/