SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Downloaden Sie, um offline zu lesen
Distributed Tensorflow with Kubernetes
Jakob Karalus, @krallistic,
1
Training Neural Networks
•First steps are quick and easy.
•Single Node Neural Networks
•We want:
•More Data!
•Deeper Models!
•Wider Model!
•Higher Accuracy!
•(Single Node) Compute cant keep up
•Longer Trainingstime -> Longer Cycles -> Lower Productivity
2
Distributed & Parallel
•We need to distribute and train in parallel to be efficient
•Data Parallelism
•Model Parallelsim
•Grid Search
•Predict
•-> Build in TF
•How can we deploy that to a cluster
•Schedule TF onto Nodes
•Service Discovery
•GPUs
•-> Kubernetes
3
Requirements & Content
•Basic Knowledge of Neural Networks
•Knowledge of Tensorflow
•Basic Docker/Kubernetes knowledge
•(Docker) Containers: Mini VM (Wrong!)
•Kubernetes: Schedulurer/Orchestration Tool for Containers
•Only focus on the task of parallel and/or distributed training
•We will not look at architectures like CNN/LTSM etc
4
Tensorflow on a single node
5
•Build your Graph
•Define which Parts on each device
•TF places data
•DMA for coordination/communication
• Define loss, accuracy etc
•Create Session for training
•Feed Data into Session
•Retrieve results
cpu:0 gpu:0
/job:worker/task:0/
Client Code
Parameter Server & Worker Replicas
•Client: Code that builds the Graph, communicates with cluster, builds the session
•Cluster: Set of nodes which have jobs (roles)
•Jobs
•Worker Replicas: compute intensive part
•Parameter Servers(ps): Holds model state and reacts to updates
•Each job can hold 0..* task
•Task
•The actual server process
•Worker Task 0 is by default the chief worker
•Responsible for checkpointing, initialising and health checking
•CPU 0 represents all CPUs on the Node
6
Client Code
Session
Graph
Graph
cpu:0 gpu:0
/job:worker/task:0/
cpu:0
/job:ps/task:0/
In Graph Replication
•Split up input into equal chunks,
•Loops over workers and assign a chunk
•collect results and optimise
•Not the recommended way
•Graph get big, lot of communication overhead
•Each device operates on all data
7
cpu:0 gpu:0
/job:worker/task:0/
cpu:0 gpu:0
/job:worker/task:0/
cpu:0
/job:ps/task:0/
Client Code
Between Replication
•Recommend way of doing replication
•Similiar to MPI
•Each device operates on a partition
•Different Client Program on each worker
•Assign itself to local resources
•Small graph independently
8
cpu:0gpu:0
/job:worker/task:0/
cpu:0 gpu:0
/job:worker/task:0/
cpu:0
/job:ps/task:0/
Client CodeClient Code
Variable Placement
•How to place the Variable onto different devices
•Manual Way
•Easy to start, full flexibility
•Gets annoying soon
•Device setter
•Automatic assign variables to ps and ops to workers
•Simple round robin by default
•Greedy Load Balancing Strategy
•Partitioned Values
•Needed for really large variables (often used in text embeddings)
•Splits variables between multiple Parameter Server
9
Training Modes
Syncronous
Replication
Every Instances reads the same
values for current parameters,
computes the gradient in parallel
and the app them together.
Asycronoues
Replication
Independent training loop in every
Instance, without coordination.
Better performance but lower
accuracy.
10
How to update the parameters between
instances?
Synchronous Training
11
Parameter Server
Add
Update
P
Model Model
Input Input
ΔP ΔP
Asyncronous Training
12
Parameter Server
Update
P
Model Model
Input Input
Update
• Each Updates Independently
• Nodes can read stale nodes from PS
• Possible: Model dont converge
ΔP ΔP
1. Define the Cluster
•Define ClusterSpec
•List Parameter Servers
•List Workers
•PS & Worker are called Jobs
•Jobs can contain one ore more Tasks
•Create Server for every Task
13
2. Assign Operation to every Task
•Same on every Node for In-Graph
•Different Devices for Between-Graph
•Can also be used to set parts to GPU and parts to CPU
14
3. Create a Training Session
•tf.train.MonitoredTrainingSession or tf.train.Supervisor for Asyncronous Training
•Takes care of initialisation
•Snapshotting
•Closing if an error occurs
•Hooks
•Summary Ops, Init Ops
•tf.train.SyncReplicaOptimizer for synchronous training:
•Also create a supervisor that takes over the role a a master between workers.
15
All Together - Server Init
16
All Together - Building Graph
17
All Together - TrainingOP
18
All Together - Session & Training
19
20
Deployment & Distribution
Packaging
•The Application (and all its) dependencies needs to be packaged into a deployable
•Wheels
•Code into deployable Artefact with defined dependencies
•Dependent on runtime
•Container
•Build Image with runtime, dependencies and code
•Additional Tooling for building and running required (Docker)
21
GPU Support
Alpha since 1.6
(experimental before)
Huge Community
One of the fastest
growing community
Auto Scaling
Build in Auto Scaling Feature
based on Ultitisation
Developer friendly API
Quick deployments
through simple and
flexible API.
Open Source
Open Sourced by Google, now
member of Cloud Computing
Foundation.
Bin Packing
Efficient resource utilisation
Kubernetes is a the leading Container Orchestration.
22
Kubernetes
Pods
Pods can be 1 or
more Container
grouped together,
smallest
scheduling Unit..
API First Deployments Service Discovery
Everything is a Object
inside the Rest API.
Declarative
Configuration with
YAML files.
Higher Level
Abstraction to say
run Pod X Times.
Services are used to
make Pods discovery
each other.
Kubernetes in 30 Seconds
23
The Basic you need to know for the Rest of the Talk
—KUBELET FLAG
—feature-gates=
"Accelerators=true"
Out of Scope for a Data Conference
24
How to enable GPU in your K8S cluster?
•Install Docker, nvidia-docker-bridge, cuda
Single Worker Instance
• Prepare our Docker Image
•Use prebuild Tensorflow image and add additional Libraries & Custom Code (gcr.io/tensorflow/tensorflow)
•special images form cpu/gpu builds, see docker hub tags
•Build & Push to a Registry
25
Write Kubernetes Pod Deployment
•Tell kubernetes to use GPU Resource
•Mount NVIDIA Libraries from Host
26
Full Pod Yaml
27
Distributed Tensorflow - Python Code
•Add clusterSpec and server information to code
•Use Flags/Envirmoent Variable to inject dynamically this information
•Write your TF Graph
•Either Manual Placement or automatic
•Dockerfile stays same/similiar
28
Distributed Tensorflow - Kubernetes Deployment
•Slightly different deployments for worker and ps nodes
•Service for each woker/ps task
•Job Name/worker index by flags
29
Distributed Kubernetes - Parameter Server
30
Automation - Tensorflow Operator
•Boilerplate code for larger cluster
•Official Documentation: Jinja Templating
•Tensorflow Operator:
•Higher level description, creates lower level objects.
•Still in the Kubernetes API (though CustomResourceDefinition)
•kubectl get tensorflow
•Comping Soon: https://github.com/krallistic/tensorflow-operator
31
Additional Stuff
•Tensorboard:
•Needs a global shared filesystem
•Instances write into subfolder
•Tensorboard Instances reads full folder
•Performance
•Scales amount of Parameter Servers
•Many CPU nodes can be more cost efficient
32
codecentric AG
Gartenstraße 69a
76135 Karlsruhe
E-Mail: jakob.karalus@codecentric.de
Twitter: @krallistic
Github: krallistic
www.codecentric.de
Address
Contact Info
Questions?
33

Weitere ähnliche Inhalte

Was ist angesagt?

DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps WayDevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Waysmalltown
 
Kafka Summit SF 2017 - Kafka Stream Processing for Everyone with KSQL
Kafka Summit SF 2017 - Kafka Stream Processing for Everyone with KSQLKafka Summit SF 2017 - Kafka Stream Processing for Everyone with KSQL
Kafka Summit SF 2017 - Kafka Stream Processing for Everyone with KSQLconfluent
 
Spark day 2017 - Spark on Kubernetes
Spark day 2017 - Spark on KubernetesSpark day 2017 - Spark on Kubernetes
Spark day 2017 - Spark on KubernetesYousun Jeong
 
Better Kafka Performance Without Changing Any Code | Simon Ritter, Azul
Better Kafka Performance Without Changing Any Code | Simon Ritter, AzulBetter Kafka Performance Without Changing Any Code | Simon Ritter, Azul
Better Kafka Performance Without Changing Any Code | Simon Ritter, AzulHostedbyConfluent
 
DockerDay2015: Getting started with Google Container Engine
DockerDay2015: Getting started with Google Container EngineDockerDay2015: Getting started with Google Container Engine
DockerDay2015: Getting started with Google Container EngineDocker-Hanoi
 
Data(?)Ops with CircleCI
Data(?)Ops with CircleCIData(?)Ops with CircleCI
Data(?)Ops with CircleCIJinwoong Kim
 
Cloud Native 오픈소스 서비스 소개 및 Serverless로 실제 게임 개발하기
Cloud Native 오픈소스 서비스 소개 및 Serverless로 실제 게임 개발하기Cloud Native 오픈소스 서비스 소개 및 Serverless로 실제 게임 개발하기
Cloud Native 오픈소스 서비스 소개 및 Serverless로 실제 게임 개발하기Jinwoong Kim
 
CDK Meetup: Rule the World through IaC
CDK Meetup: Rule the World through IaCCDK Meetup: Rule the World through IaC
CDK Meetup: Rule the World through IaCsmalltown
 
Integrating Apache Kafka and Elastic Using the Connect Framework
Integrating Apache Kafka and Elastic Using the Connect FrameworkIntegrating Apache Kafka and Elastic Using the Connect Framework
Integrating Apache Kafka and Elastic Using the Connect Frameworkconfluent
 
A Primer on Kubernetes and Google Container Engine
A Primer on Kubernetes and Google Container EngineA Primer on Kubernetes and Google Container Engine
A Primer on Kubernetes and Google Container EngineRightScale
 
Better Kafka Performance Without Changing Any Code | Simon Ritter, Azul
Better Kafka Performance Without Changing Any Code | Simon Ritter, AzulBetter Kafka Performance Without Changing Any Code | Simon Ritter, Azul
Better Kafka Performance Without Changing Any Code | Simon Ritter, AzulHostedbyConfluent
 
Scylla Summit 2022: What’s New in ScyllaDB Operator for Kubernetes
Scylla Summit 2022: What’s New in ScyllaDB Operator for KubernetesScylla Summit 2022: What’s New in ScyllaDB Operator for Kubernetes
Scylla Summit 2022: What’s New in ScyllaDB Operator for KubernetesScyllaDB
 
Distributed Kafka Architecture Taboola Scale
Distributed Kafka Architecture Taboola ScaleDistributed Kafka Architecture Taboola Scale
Distributed Kafka Architecture Taboola ScaleApache Kafka TLV
 
Monitoring, the Prometheus Way - Julius Voltz, Prometheus
Monitoring, the Prometheus Way - Julius Voltz, Prometheus Monitoring, the Prometheus Way - Julius Voltz, Prometheus
Monitoring, the Prometheus Way - Julius Voltz, Prometheus Docker, Inc.
 
Kubernetes for Serverless - Serverless Summit 2017 - Krishna Kumar
Kubernetes for Serverless  - Serverless Summit 2017 - Krishna KumarKubernetes for Serverless  - Serverless Summit 2017 - Krishna Kumar
Kubernetes for Serverless - Serverless Summit 2017 - Krishna KumarCodeOps Technologies LLP
 
2016 08-30 Kubernetes talk for Waterloo DevOps
2016 08-30 Kubernetes talk for Waterloo DevOps2016 08-30 Kubernetes talk for Waterloo DevOps
2016 08-30 Kubernetes talk for Waterloo DevOpscraigbox
 
Serverless Workflow: New approach to Kubernetes service orchestration | DevNa...
Serverless Workflow: New approach to Kubernetes service orchestration | DevNa...Serverless Workflow: New approach to Kubernetes service orchestration | DevNa...
Serverless Workflow: New approach to Kubernetes service orchestration | DevNa...Red Hat Developers
 
Cloud Native User Group: Shift-Left Testing IaC With PaC
Cloud Native User Group: Shift-Left Testing IaC With PaCCloud Native User Group: Shift-Left Testing IaC With PaC
Cloud Native User Group: Shift-Left Testing IaC With PaCsmalltown
 
Ofir Makmal - Intro To Kubernetes Operators - Google Cloud Summit 2018 Tel Aviv
Ofir Makmal - Intro To Kubernetes Operators - Google Cloud Summit 2018 Tel AvivOfir Makmal - Intro To Kubernetes Operators - Google Cloud Summit 2018 Tel Aviv
Ofir Makmal - Intro To Kubernetes Operators - Google Cloud Summit 2018 Tel AvivOfir Makmal
 

Was ist angesagt? (19)

DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps WayDevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
 
Kafka Summit SF 2017 - Kafka Stream Processing for Everyone with KSQL
Kafka Summit SF 2017 - Kafka Stream Processing for Everyone with KSQLKafka Summit SF 2017 - Kafka Stream Processing for Everyone with KSQL
Kafka Summit SF 2017 - Kafka Stream Processing for Everyone with KSQL
 
Spark day 2017 - Spark on Kubernetes
Spark day 2017 - Spark on KubernetesSpark day 2017 - Spark on Kubernetes
Spark day 2017 - Spark on Kubernetes
 
Better Kafka Performance Without Changing Any Code | Simon Ritter, Azul
Better Kafka Performance Without Changing Any Code | Simon Ritter, AzulBetter Kafka Performance Without Changing Any Code | Simon Ritter, Azul
Better Kafka Performance Without Changing Any Code | Simon Ritter, Azul
 
DockerDay2015: Getting started with Google Container Engine
DockerDay2015: Getting started with Google Container EngineDockerDay2015: Getting started with Google Container Engine
DockerDay2015: Getting started with Google Container Engine
 
Data(?)Ops with CircleCI
Data(?)Ops with CircleCIData(?)Ops with CircleCI
Data(?)Ops with CircleCI
 
Cloud Native 오픈소스 서비스 소개 및 Serverless로 실제 게임 개발하기
Cloud Native 오픈소스 서비스 소개 및 Serverless로 실제 게임 개발하기Cloud Native 오픈소스 서비스 소개 및 Serverless로 실제 게임 개발하기
Cloud Native 오픈소스 서비스 소개 및 Serverless로 실제 게임 개발하기
 
CDK Meetup: Rule the World through IaC
CDK Meetup: Rule the World through IaCCDK Meetup: Rule the World through IaC
CDK Meetup: Rule the World through IaC
 
Integrating Apache Kafka and Elastic Using the Connect Framework
Integrating Apache Kafka and Elastic Using the Connect FrameworkIntegrating Apache Kafka and Elastic Using the Connect Framework
Integrating Apache Kafka and Elastic Using the Connect Framework
 
A Primer on Kubernetes and Google Container Engine
A Primer on Kubernetes and Google Container EngineA Primer on Kubernetes and Google Container Engine
A Primer on Kubernetes and Google Container Engine
 
Better Kafka Performance Without Changing Any Code | Simon Ritter, Azul
Better Kafka Performance Without Changing Any Code | Simon Ritter, AzulBetter Kafka Performance Without Changing Any Code | Simon Ritter, Azul
Better Kafka Performance Without Changing Any Code | Simon Ritter, Azul
 
Scylla Summit 2022: What’s New in ScyllaDB Operator for Kubernetes
Scylla Summit 2022: What’s New in ScyllaDB Operator for KubernetesScylla Summit 2022: What’s New in ScyllaDB Operator for Kubernetes
Scylla Summit 2022: What’s New in ScyllaDB Operator for Kubernetes
 
Distributed Kafka Architecture Taboola Scale
Distributed Kafka Architecture Taboola ScaleDistributed Kafka Architecture Taboola Scale
Distributed Kafka Architecture Taboola Scale
 
Monitoring, the Prometheus Way - Julius Voltz, Prometheus
Monitoring, the Prometheus Way - Julius Voltz, Prometheus Monitoring, the Prometheus Way - Julius Voltz, Prometheus
Monitoring, the Prometheus Way - Julius Voltz, Prometheus
 
Kubernetes for Serverless - Serverless Summit 2017 - Krishna Kumar
Kubernetes for Serverless  - Serverless Summit 2017 - Krishna KumarKubernetes for Serverless  - Serverless Summit 2017 - Krishna Kumar
Kubernetes for Serverless - Serverless Summit 2017 - Krishna Kumar
 
2016 08-30 Kubernetes talk for Waterloo DevOps
2016 08-30 Kubernetes talk for Waterloo DevOps2016 08-30 Kubernetes talk for Waterloo DevOps
2016 08-30 Kubernetes talk for Waterloo DevOps
 
Serverless Workflow: New approach to Kubernetes service orchestration | DevNa...
Serverless Workflow: New approach to Kubernetes service orchestration | DevNa...Serverless Workflow: New approach to Kubernetes service orchestration | DevNa...
Serverless Workflow: New approach to Kubernetes service orchestration | DevNa...
 
Cloud Native User Group: Shift-Left Testing IaC With PaC
Cloud Native User Group: Shift-Left Testing IaC With PaCCloud Native User Group: Shift-Left Testing IaC With PaC
Cloud Native User Group: Shift-Left Testing IaC With PaC
 
Ofir Makmal - Intro To Kubernetes Operators - Google Cloud Summit 2018 Tel Aviv
Ofir Makmal - Intro To Kubernetes Operators - Google Cloud Summit 2018 Tel AvivOfir Makmal - Intro To Kubernetes Operators - Google Cloud Summit 2018 Tel Aviv
Ofir Makmal - Intro To Kubernetes Operators - Google Cloud Summit 2018 Tel Aviv
 

Ähnlich wie Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus

Using Deep Learning Toolkits with Kubernetes clusters
Using Deep Learning Toolkits with Kubernetes clustersUsing Deep Learning Toolkits with Kubernetes clusters
Using Deep Learning Toolkits with Kubernetes clustersJoy Qiao
 
Democratizing machine learning on kubernetes
Democratizing machine learning on kubernetesDemocratizing machine learning on kubernetes
Democratizing machine learning on kubernetesDocker, Inc.
 
Kubernetes and CoreOS @ Athens Docker meetup
Kubernetes and CoreOS @ Athens Docker meetupKubernetes and CoreOS @ Athens Docker meetup
Kubernetes and CoreOS @ Athens Docker meetupMist.io
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications OpenEBS
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetupGanesan Narayanasamy
 
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)Tibo Beijen
 
Expanding HPCC Systems Deep Neural Network Capabilities
Expanding HPCC Systems Deep Neural Network CapabilitiesExpanding HPCC Systems Deep Neural Network Capabilities
Expanding HPCC Systems Deep Neural Network CapabilitiesHPCC Systems
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLNordic APIs
 
DevOps Days Boston 2017: Real-world Kubernetes for DevOps
DevOps Days Boston 2017: Real-world Kubernetes for DevOpsDevOps Days Boston 2017: Real-world Kubernetes for DevOps
DevOps Days Boston 2017: Real-world Kubernetes for DevOpsAmbassador Labs
 
To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…Sergey Dzyuban
 
Containerized architectures for deep learning
Containerized architectures for deep learningContainerized architectures for deep learning
Containerized architectures for deep learningAntje Barth
 
Kubernetes Internals
Kubernetes InternalsKubernetes Internals
Kubernetes InternalsShimi Bandiel
 
MongoDB Ops Manager and Kubernetes - James Broadhead
MongoDB Ops Manager and Kubernetes - James BroadheadMongoDB Ops Manager and Kubernetes - James Broadhead
MongoDB Ops Manager and Kubernetes - James BroadheadMongoDB
 
Distributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedDistributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedWee Hyong Tok
 
Container orchestration and microservices world
Container orchestration and microservices worldContainer orchestration and microservices world
Container orchestration and microservices worldKarol Chrapek
 
Kubernetes presentation
Kubernetes presentationKubernetes presentation
Kubernetes presentationGauranG Bajpai
 
Database as a Service (DBaaS) on Kubernetes
Database as a Service (DBaaS) on KubernetesDatabase as a Service (DBaaS) on Kubernetes
Database as a Service (DBaaS) on KubernetesObjectRocket
 

Ähnlich wie Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus (20)

Using Deep Learning Toolkits with Kubernetes clusters
Using Deep Learning Toolkits with Kubernetes clustersUsing Deep Learning Toolkits with Kubernetes clusters
Using Deep Learning Toolkits with Kubernetes clusters
 
Democratizing machine learning on kubernetes
Democratizing machine learning on kubernetesDemocratizing machine learning on kubernetes
Democratizing machine learning on kubernetes
 
Kubernetes 101
Kubernetes 101Kubernetes 101
Kubernetes 101
 
Kubernetes and CoreOS @ Athens Docker meetup
Kubernetes and CoreOS @ Athens Docker meetupKubernetes and CoreOS @ Athens Docker meetup
Kubernetes and CoreOS @ Athens Docker meetup
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Introduction to Kubernetes
Introduction to KubernetesIntroduction to Kubernetes
Introduction to Kubernetes
 
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
 
Expanding HPCC Systems Deep Neural Network Capabilities
Expanding HPCC Systems Deep Neural Network CapabilitiesExpanding HPCC Systems Deep Neural Network Capabilities
Expanding HPCC Systems Deep Neural Network Capabilities
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of ML
 
DevOps Days Boston 2017: Real-world Kubernetes for DevOps
DevOps Days Boston 2017: Real-world Kubernetes for DevOpsDevOps Days Boston 2017: Real-world Kubernetes for DevOps
DevOps Days Boston 2017: Real-world Kubernetes for DevOps
 
To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…
 
Containerized architectures for deep learning
Containerized architectures for deep learningContainerized architectures for deep learning
Containerized architectures for deep learning
 
Kubernetes Internals
Kubernetes InternalsKubernetes Internals
Kubernetes Internals
 
MongoDB Ops Manager and Kubernetes - James Broadhead
MongoDB Ops Manager and Kubernetes - James BroadheadMongoDB Ops Manager and Kubernetes - James Broadhead
MongoDB Ops Manager and Kubernetes - James Broadhead
 
Distributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedDistributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learned
 
HPC in the Cloud
HPC in the CloudHPC in the Cloud
HPC in the Cloud
 
Container orchestration and microservices world
Container orchestration and microservices worldContainer orchestration and microservices world
Container orchestration and microservices world
 
Kubernetes presentation
Kubernetes presentationKubernetes presentation
Kubernetes presentation
 
Database as a Service (DBaaS) on Kubernetes
Database as a Service (DBaaS) on KubernetesDatabase as a Service (DBaaS) on Kubernetes
Database as a Service (DBaaS) on Kubernetes
 

Kürzlich hochgeladen

Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 

Kürzlich hochgeladen (20)

Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 

Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus

  • 1. Distributed Tensorflow with Kubernetes Jakob Karalus, @krallistic, 1
  • 2. Training Neural Networks •First steps are quick and easy. •Single Node Neural Networks •We want: •More Data! •Deeper Models! •Wider Model! •Higher Accuracy! •(Single Node) Compute cant keep up •Longer Trainingstime -> Longer Cycles -> Lower Productivity 2
  • 3. Distributed & Parallel •We need to distribute and train in parallel to be efficient •Data Parallelism •Model Parallelsim •Grid Search •Predict •-> Build in TF •How can we deploy that to a cluster •Schedule TF onto Nodes •Service Discovery •GPUs •-> Kubernetes 3
  • 4. Requirements & Content •Basic Knowledge of Neural Networks •Knowledge of Tensorflow •Basic Docker/Kubernetes knowledge •(Docker) Containers: Mini VM (Wrong!) •Kubernetes: Schedulurer/Orchestration Tool for Containers •Only focus on the task of parallel and/or distributed training •We will not look at architectures like CNN/LTSM etc 4
  • 5. Tensorflow on a single node 5 •Build your Graph •Define which Parts on each device •TF places data •DMA for coordination/communication • Define loss, accuracy etc •Create Session for training •Feed Data into Session •Retrieve results cpu:0 gpu:0 /job:worker/task:0/ Client Code
  • 6. Parameter Server & Worker Replicas •Client: Code that builds the Graph, communicates with cluster, builds the session •Cluster: Set of nodes which have jobs (roles) •Jobs •Worker Replicas: compute intensive part •Parameter Servers(ps): Holds model state and reacts to updates •Each job can hold 0..* task •Task •The actual server process •Worker Task 0 is by default the chief worker •Responsible for checkpointing, initialising and health checking •CPU 0 represents all CPUs on the Node 6 Client Code Session Graph Graph cpu:0 gpu:0 /job:worker/task:0/ cpu:0 /job:ps/task:0/
  • 7. In Graph Replication •Split up input into equal chunks, •Loops over workers and assign a chunk •collect results and optimise •Not the recommended way •Graph get big, lot of communication overhead •Each device operates on all data 7 cpu:0 gpu:0 /job:worker/task:0/ cpu:0 gpu:0 /job:worker/task:0/ cpu:0 /job:ps/task:0/ Client Code
  • 8. Between Replication •Recommend way of doing replication •Similiar to MPI •Each device operates on a partition •Different Client Program on each worker •Assign itself to local resources •Small graph independently 8 cpu:0gpu:0 /job:worker/task:0/ cpu:0 gpu:0 /job:worker/task:0/ cpu:0 /job:ps/task:0/ Client CodeClient Code
  • 9. Variable Placement •How to place the Variable onto different devices •Manual Way •Easy to start, full flexibility •Gets annoying soon •Device setter •Automatic assign variables to ps and ops to workers •Simple round robin by default •Greedy Load Balancing Strategy •Partitioned Values •Needed for really large variables (often used in text embeddings) •Splits variables between multiple Parameter Server 9
  • 10. Training Modes Syncronous Replication Every Instances reads the same values for current parameters, computes the gradient in parallel and the app them together. Asycronoues Replication Independent training loop in every Instance, without coordination. Better performance but lower accuracy. 10 How to update the parameters between instances?
  • 12. Asyncronous Training 12 Parameter Server Update P Model Model Input Input Update • Each Updates Independently • Nodes can read stale nodes from PS • Possible: Model dont converge ΔP ΔP
  • 13. 1. Define the Cluster •Define ClusterSpec •List Parameter Servers •List Workers •PS & Worker are called Jobs •Jobs can contain one ore more Tasks •Create Server for every Task 13
  • 14. 2. Assign Operation to every Task •Same on every Node for In-Graph •Different Devices for Between-Graph •Can also be used to set parts to GPU and parts to CPU 14
  • 15. 3. Create a Training Session •tf.train.MonitoredTrainingSession or tf.train.Supervisor for Asyncronous Training •Takes care of initialisation •Snapshotting •Closing if an error occurs •Hooks •Summary Ops, Init Ops •tf.train.SyncReplicaOptimizer for synchronous training: •Also create a supervisor that takes over the role a a master between workers. 15
  • 16. All Together - Server Init 16
  • 17. All Together - Building Graph 17
  • 18. All Together - TrainingOP 18
  • 19. All Together - Session & Training 19
  • 21. Packaging •The Application (and all its) dependencies needs to be packaged into a deployable •Wheels •Code into deployable Artefact with defined dependencies •Dependent on runtime •Container •Build Image with runtime, dependencies and code •Additional Tooling for building and running required (Docker) 21
  • 22. GPU Support Alpha since 1.6 (experimental before) Huge Community One of the fastest growing community Auto Scaling Build in Auto Scaling Feature based on Ultitisation Developer friendly API Quick deployments through simple and flexible API. Open Source Open Sourced by Google, now member of Cloud Computing Foundation. Bin Packing Efficient resource utilisation Kubernetes is a the leading Container Orchestration. 22 Kubernetes
  • 23. Pods Pods can be 1 or more Container grouped together, smallest scheduling Unit.. API First Deployments Service Discovery Everything is a Object inside the Rest API. Declarative Configuration with YAML files. Higher Level Abstraction to say run Pod X Times. Services are used to make Pods discovery each other. Kubernetes in 30 Seconds 23 The Basic you need to know for the Rest of the Talk
  • 24. —KUBELET FLAG —feature-gates= "Accelerators=true" Out of Scope for a Data Conference 24 How to enable GPU in your K8S cluster? •Install Docker, nvidia-docker-bridge, cuda
  • 25. Single Worker Instance • Prepare our Docker Image •Use prebuild Tensorflow image and add additional Libraries & Custom Code (gcr.io/tensorflow/tensorflow) •special images form cpu/gpu builds, see docker hub tags •Build & Push to a Registry 25
  • 26. Write Kubernetes Pod Deployment •Tell kubernetes to use GPU Resource •Mount NVIDIA Libraries from Host 26
  • 28. Distributed Tensorflow - Python Code •Add clusterSpec and server information to code •Use Flags/Envirmoent Variable to inject dynamically this information •Write your TF Graph •Either Manual Placement or automatic •Dockerfile stays same/similiar 28
  • 29. Distributed Tensorflow - Kubernetes Deployment •Slightly different deployments for worker and ps nodes •Service for each woker/ps task •Job Name/worker index by flags 29
  • 30. Distributed Kubernetes - Parameter Server 30
  • 31. Automation - Tensorflow Operator •Boilerplate code for larger cluster •Official Documentation: Jinja Templating •Tensorflow Operator: •Higher level description, creates lower level objects. •Still in the Kubernetes API (though CustomResourceDefinition) •kubectl get tensorflow •Comping Soon: https://github.com/krallistic/tensorflow-operator 31
  • 32. Additional Stuff •Tensorboard: •Needs a global shared filesystem •Instances write into subfolder •Tensorboard Instances reads full folder •Performance •Scales amount of Parameter Servers •Many CPU nodes can be more cost efficient 32
  • 33. codecentric AG Gartenstraße 69a 76135 Karlsruhe E-Mail: jakob.karalus@codecentric.de Twitter: @krallistic Github: krallistic www.codecentric.de Address Contact Info Questions? 33