Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

TensorFlow Studying Part II for GPU

274 Aufrufe

Veröffentlicht am

This slide is to introduce BFC which plays a big role in GPU memory management ( allocation/deallocation ). It also mentions stream executor and how it works in output tensor's memory allocation in GPU.

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

TensorFlow Studying Part II for GPU

  1. 1. 工業技術研究院機密資料 禁止複製、轉載、外流 ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE TensorFlow Study (Part II) for GPU part 劉得彥 Danny Liu 資訊與通訊研究所 ICL
  2. 2. 工業技術研究院機密資料 禁止複製、轉載、外流 ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE GPU Options 2 • We can change the GPU options as follows: • message GPUOptions ▪ double per_process_gpu_memory_fraction = 1; ▪ string allocator_type = 2; a. "BFC": A "Best-fit with coalescing" algorithm ▪ int64 deferred_deletion_bytes = 3; a. Delay deletion of up to this many bytes to reduce the number of interactions with gpu driver code. ▪ bool allow_growth = 4; ▪ string visible_device_list = 5; a. For instance: » import os » os.environ[“CUDA_VISIBLE_DEVICES”] = ‘0, 1’ ▪ int32 polling_active_delay_usecs = 6; a. In the event polling loop sleep this many microseconds between PollEvents calls, when the queue is not empty. ▪ int32 polling_inactive_delay_msecs = 7; a. In the event polling loop sleep this many millisconds between PollEvents calls, when the queue is empty. ▪ bool force_gpu_compatible = 8; a. Force all tensors to be gpu_compatible. On a GPU-enabled TensorFlow, enabling this option forces all CPU tensors to be allocated with Cuda pinned memory.
  3. 3. 工業技術研究院機密資料 禁止複製、轉載、外流 ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE BFC (best-fit with coalescing) 3 • Chunks point to memory. ▪ Their prev/next pointers form a doubly-linked list of addresses sorted by base address that must be contiguous. ▪ Chunks contain information about whether they are in use or whether they are free, and contain a pointer to the bin they are in. GPU memory AllocationRegion size requested_size allocation_id prev next bin_num In_use size requested_size allocation_id prev next bin_num In_use size requested_size allocation_id prev next bin_num In_use chunkhandle chunkhandle Order by size
  4. 4. 工業技術研究院機密資料 禁止複製、轉載、外流 ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE BFC (best-fit with coalescing) 4 • The BFC Concept: ▪ Operations for bin: Search, Insert, and Delete http://blog.csdn.net/qq_33096883/article/details/77479647 Size: 256 * 2^0 = 256Bytes bin 0 Size: 256 * 2^1 bin 1 Size: 256 * 2^2 bin 2 … Size: 256 * 2^20 = 256MB bin 20 Chunk 0 Chunk 1 RegionManager ( manage all regions) chunks_ free_chunks_list_ regions_ GPU memory AllocationRegion handle_[] AllocationRegion handle[] 4 Create one large chunk for the whole memory space that will be chunked later 1 BFC tries to allocate memory and cannot find chunk in bins. Do extend() 2 If curr_region_allocation_bytes_< the allocation size, multiplying by a power of two until that is sufficient. 3 Use a factor = 0.9 to reduce the allocation if it failed. X 2 Chunk 2 Free memory management Search Insert X 0.9
  5. 5. 工業技術研究院機密資料 禁止複製、轉載、外流 ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE When output tensor is allocated 5 • A customized operation (Op) wants to allocate memory for its output tensor
  6. 6. 工業技術研究院機密資料 禁止複製、轉載、外流 ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE Tensor GPU mem allocation 6 • When a operation try to create a output tensor during its calculation, a tensor gpu memory allocation will happen. • GPUBFCAllocator is an BFC implementation class. Tensor A Allocator* Buffer* BFCAllocator::AllocateRaw() retry_helper_. AllocateRaw() GPUBFCAllocator:: AllocateInternal() BFCAllocate::Extend() DeviceMemory* stream_exec_-> AllocateArray().opaque() • Allocate() • DeviceMemory::MakeFromBy teSize() Suballocator_->Alloc() It’s where the memory allocation happens because GPUMemAllocator inherits SubAllocator. return GPUBFCAllocator StreamExecutor CUDAExecutor CUDADriver - DeviceAllocate() GPU memory StreamExecutor - Allocate()
  7. 7. 工業技術研究院機密資料 禁止複製、轉載、外流 ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE StreamExecutor Runtime Library 7 • A unified wrapper around the CUDA and OpenCL host-side programming models (runtimes). • Support cuBlas and cuDNN (tensorflow/stream_executor/blas.h and dnn.h) • It lets host code target either CUDA or OpenCL devices with identically-functioning data-parallel kernels. Stream Executor
  8. 8. 工業技術研究院機密資料 禁止複製、轉載、外流 ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE StreamExecutor Runtime Library 8 • Contrast with OpenMP ▪ OpenMP generates both the kernel code that runs on the device and the host-side code needed to launch the kernel ▪ StreamExecutor only generates the host-side code. StreamExecutor StreamExecutorImpl StreamExecutorInterface

×