Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

An Introduction to TensorFlow architecture

2.797 Aufrufe

Veröffentlicht am

Introduces you to the internals of TensorFlow and deep dives into distributed version of TensorFlow. Refer to https://github.com/manigoswami/tensorflow-examples for examples.

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

An Introduction to TensorFlow architecture

  2. 2. BEFORE WE START… • PLEASE UNDERSTAND TensorFlow DIFFERS FROM MOST DATA ENGINES OUT THERE FOR OBVIOUS REASONS. • TensorFlow differs from batch dataflow systems in two respects: • The model supports multiple concurrent executions on overlapping subgraphs of the overall graph. • Individual vertices may have mutable state that can be shared between different executions of the graph. • Some References (picked from OSDI 16 Conference): • The principal limitation of a batch dataflow system is that it requires the input data to be immutable, and all of the sub-computations to be deterministic, so that the system can re-execute sub-computations when machines in the cluster fail. • For example, the SparkNet system for training deep neural networks on Spark takes 20 seconds to broadcast weights and collect updates from five workers [55]. As a result, in these systems, each model update step must process larger batches, slowing convergence [8]. We show in Subsection 6.3 that TensorFlow can train larger models on larger clusters with step times as short as 2 seconds
  3. 3. WHAT IS TENSORFLOW? Here is the formal definition picked from https://www.tensorflow.org/: TensorFlow is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google's Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research.
  4. 4. WHAT IS A DATA FLOW GRAPH ? Consider a typical linear equation: y = W * x + b where W is Weight, x is an example and b is bias. This linear equation can be represented as a acyclic graph, as below: Biases Weight Examples MatMul Add Relu Gradients Updated Weights and Biases
  5. 5. GENERALIZING THE DATAFLOW GRAPH Biases …. Learning Rate Add -=Mul Update Biases … gradient computation Variables & Constant s Operations Updating of Variables
  6. 6. LAYERED VIEW Network LayerDevice Layer Kernel Execution Layer Distributed Master Data Flow Controller API LAYER CLIENT LAYER LIBRARIES (Training/Inference Libs)
  7. 7. TENSORFLOW’S DEVICE INTERACTION VIEW TensorFlow uses CUDA and cuDNN to control GPUs and boost CPU GPU #0 GPU #1 cuDNN CUDA TENSORFLOW
  8. 8. EXECUTION PHASES • By deferring the execution until the entire program is available, TensorFlow optimizes the execution phase by using global information about the computation • Example: • TensorFlow achieves high GPU utilization by using the graph’s dependency structure to issue a sequence of kernels to the GPU without waiting for intermediate results • TensorFlow uses deferred execution via the dataflow graph to offload larger chunks of work to accelerators. CONSTRUCTION PHASE EXECUTION PHASE CLIENT WORKERS
  9. 9. WORKER’S DEVICE INTERACTIONS • The worker service in each task: • handles requests from the master, • schedules the execution of the kernels for the operations that comprise a local subgraph • mediates direct communication between tasks. • It optimized for running large graphs with low overhead • It dispatches kernels to local devices and runs kernels in parallel when possible, for example by using multiple CPU cores or GPU streams. CLIENT MASTER WORKER GPU #1 GPU #2 CPU #0 Session
  10. 10. WORKER’S SCHEDULING & PLACEMENT ALGORITHM • Uses COST Model to determine placement • contains estimates of the sizes of the input and output tensors for each graph node • Uses estimates of the computation time required for each node • statically estimated based on heuristics associated with different operation types • also uses metrics collected for placement decisions for earlier executions of the graph • placement algorithm first runs a simulated execution of the graph • For each node, feasible devices are determined • When multiple devices are eligible for a node execution • algorithm uses a greedy heuristic; examines the effects on the completion time using COST MODEL • usually, device where the node’s operation would finish the soonest is generally selected • Applies constraints like colocation requirements
  11. 11. SINGLE MACHINE VS DISTRIBUTED SYSTEM STRUCTURE Client is one which creates computation graph during the construction phase It creates a session to master and send the constructed graph for execution Finally when client evaluates a node or nodes in graph, master starts the execution by distributing sub graphs to workers. Client Master GPU0 GPU1 GPUn session run execute sub-graph Single Process Client Process Master GPU0 session run execute sub-graph Distributed Version GPU1 GPUn CPU0 worker process 1 GPU0 GPU1 GPUn CPU0 worker process 2 GPU0 GPU1 GPUn CPU0 worker process 3 worker
  12. 12. KERNEL EXECUTION • TF manages two types of thread pools on each device to parallelize operations; inter-op & intra-op thread pools • Inter-op are normal thread pool used when two or more operations get scheduled on same device. • In few cases operations have multi-threaded kernel, they use intra-op thread pool CPU#0 CPU #1 A B F D E C Inter- op pool Intra- op pool
  13. 13. SESSION ON A SINGLE PROCESS tf.Session CPU: 0 GPU: 0 with tf.Session() as sess: sess.run(init_op) for _ in range(STEPS): sess.run(train)
  14. 14. CROSS-DEVICE COMMUNICATION s += w * x + b CPU += S w b Add MatMul X GPU #0 Worker
  16. 16. CREATING A CLUSTER tf.Session CPU: 0 GPU: 0 cluster = tf.train.ClusterSpecs ({"ps": ps_hosts, "worker": worker_hosts}) server = tf.train.Server(cluster, job_name = “worker”, task_index=0) tf.train.Server CPU: 0 GPU: 0 tf.train.Server CPU: 0 GPU: 0 tf.train.Server
  17. 17. DISTRIBUTED COMMUNICATION (DATA PARALLELISM & REPLICATION) • master decides a sub graph for a worker, in this case model parameters are given to PS worker * worker is responsible for deciding and placing nodes of the sub-graph on devices • nodes are executed in multiple GPUs/CPU Cores simultaneously subject to dependency resolution Device 1 (PS) += s w b CPU (PS)GPU #0 MatMul x Add Worker #0 Worker #1 GPU #0 MatMul x Add
  18. 18. DISTRIBUTED COMMUNICATION (DATA PARALLELISM) • Transfers between local CPU and GPU devices use the cudaMemcpyAsync() API to overlap computation and data transfer. • Transfers between two local GPUs use peer-to-peer DMA, to avoid an expensive copy via the host CPU. • Transfers between tasks uses RDMA over Converged Ethernet else uses gRPC over TCP Device 1 (PS) += s w b CPU (PS)GPU #0 MatMul x Add GPU #0 Worker #0 Worker #1 SEND RECV SEND SEND RECV RDMA is_chief=tru e MatMul x Add RECV SEND
  20. 20. DISTRIBUTED COMMUNICATION (MODEL PARALLELISM) • In model parallelism, the graph’s operations are distributed across cluster Device 1 (PS) Device 2 (worker) += s w b CPU GPU #0 MatMul x Add GPU #0 Worker #0 Worker #1
  21. 21. DISTRIBUTED COMMUNICATION (MODEL PARALLELISM) • Transfers between local CPU and GPU devices use the cudaMemcpyAsync() API to overlap computation and data transfer. • Transfers between two local GPUs use peer-to-peer DMA, to avoid an expensive copy via the host CPU. • Transfers between tasks uses RDMA over Converged Ethernet else uses gRPC over TCP Device 1 (PS) += s w b CPU GPU #0 * x + GPU #0 Worker #0 Worker #1 SEND RECV SEND RECV SEND Dest: worker#1, GPU #0 Dest: worker#0, GPU #0 Dest: worker#1, GPU #0 SEND Dest: worker#0, CPU #0 RECV RDMA is_chief = True
  22. 22. CHIEF WORKER • Chief is a task which is assigned some additional responsibilities in the cluster. • Its responsibilities: • Check pointing: • Saves graph state in a configured store like HDFS etc. • Runs a configurable frequency • Maintaining Summary • Runs all summary operations • Saving Models • Step Counters • Keeps an eye on total steps taken • Recovery • restores the graph from the most recent checkpoint and resumes training where it stopped • Initializing all the variables in graph • Can be monitored through TensorBoard.
  23. 23. PARAMETER TASKS VS WORKER TASKS • In TensorFlow workload in distributed in form of PS and workers tasks. • PS tasks holds: • Variables • Update operations • Worker tasks: holds • Pre-processing • Loss calculation • Back Propagations • Multiple workers and PS tasks can run simultaneously but TF ensures that PS is sharded, ensures that same variable has one physical copy. There are various algorithm which support PS task distribution considering load and network . • It also allows partitioning large variables (~10x GBs) into multiple PS tasks
  24. 24. TYPES OF TRAINING REPLICATION • In Graph Replication • Here single client connects to a master and requests distribution of replicated graph along with data within all available workers. • Works well for a small work load but beyond that does not scale well. • Between Graph Replication (Recommended Approach) • In this approach multiple clients take part in replication • Each machine has a client which talks to the local master and gives cluster information, graphs and data to be executed. • Master ensures that PS tasks are shared based on cluster and schedules tasks in local worker • Worker ensures all communication and synchronizations. • Between Graphs Replication can be of two types: • Synchronous • Asynchronous
  25. 25. ASYNCHRONOUS VS SYNCHRONOUS REPLICATION model input Device 1 model input Device 2 model input Device 3 Add Update P PS Server model input Device 1 model input Device 2 model input Device 3 Update P PS Server P Update Update P P P SYNCHRONOUS DATA PARALLELISM ASYNCHRONOUS DATA PARALLELISM
  26. 26. OPTIMIZATIONS • Common Subexpression Elimination • Schedules tasks in such a way that time window for which intermediate results are stored could be reduced. • Using ASAP/ALAP calculation critical path of graph is determined to estimate when to start the Receive nodes. This reduced the chances of sudden spike of I/O • Non blocking Kernels • Lossy compression of higher precision internal representations when sending data between device • XLA (Accelerated Linear Algebra) is a domain-specific compiler for linear algebra that optimizes TensorFlow computations. • Tensors also enable other optimizations for memory management and communication, such as RDMA and direct GPU-to-GPU transfer
  27. 27. FAULT TOLERANCE • Check pointing ensures that latest state is always available • If a non supervisor worker gets killed • Considering workers are state less, the cluster manager when bring it up back, it simply contacts PS task to get the updated parameter and resumes • If a PS task fails • In this case chief/supervisor is responsible for noting the failure • Supervisor/Chief interrupts training on all workers and restores all PS tasks from the last check-point. • If Chief itself fails • Interrupt training and when it comes back up it restore from a checkpoint. • Monitored Training Session allows automating the recovery • Another approach could be to use Zookeeper for chief election and pass
  28. 28. SERVING THE MODEL • TensorFlow recommended way to serve model in production is TF Serving • Advantages • Supports both online and batching mode • Supports both hosted as well as libs approach • Supports multiple model in a single process • Supports Docker & Kuburnetes
  29. 29. BENCHMARKS Instance type: NVIDIA® DGX-1™ GPU: 8x NVIDIA® Tesla® P100 OS: Ubuntu 16.04 LTS with tests run via Docker CUDA / cuDNN: 8.0 / 5.1 TensorFlow GitHub hash: b1e174e Benchmark GitHub hash: 9165a70 Build Command: bazel build -c opt --copt=-march="haswell" --config=cuda //tensorflow/tools/pip_package:build_pip_package
  30. 30. REFERENCES & FURTHER READING • Paper on Large-Scale Machine Learning on Heterogeneous Distributed Systems • TensorFlow Documentations • TensorFlow Tutorials • Hands-on Machine Learning with Sckit Learn and TensorFlow by Aurélien Géron
  31. 31. THANK YOU!