SlideShare ist ein Scribd-Unternehmen logo
1 von 24
“GPU With CUDA Architecture”
Presented By-
Dhaval Kaneria (13014061010)
Guided By-
Mr. Rajesh k Navandar
Table Of Contents
• Introduction of GPU
• Performance Factors Of GPU
• GPU Pipeline
• Block Diagram Of Pipeline Process Flow
• Introduction Of CUDA
• Thread Batching
• Simple Processing Flow
• CUDA C/C++
• Applications
• The Future Scope Of CUDA Technology
• Conclusion
• References
2
Introduction of GPU
• A Graphics Processing Unit (GPU) is a microprocessor that has been designed
specifically for the processing of 3D graphics.
• The processor is built with integrated transform, lighting, triangle setup/clipping,
and rendering engines, capable of handling millions of math-intensive processes
per second.
• GPUs form the heart of modern graphics cards, relieving the CPU (central
processing units) of much of the graphics processing load. GPUs allow products
such as desktop PCs, portable computers, and game consoles to process real-time
3D graphics that only a few years ago were only available on high-end workstations.
• Used primarily for 3-D applications, a graphics processing unit is a single-chip
processor that creates lighting effects and transforms objects every time a 3D
scene is redrawn. These are mathematically-intensive tasks, which otherwise,
would put quite a strain on the CPU. Lifting this burden from the CPU frees up
cycles that can be used for other jobs.
3
Performance Factors Of GPU
• Fill Rate:
It is defined as the number of pixels or texels (textured pixels) rendered per second by the
GPU on to the memory . It shows the true power of the GPU. Modern GPUs have fill rates as
high as 3.2 billion pixels. The fill rate of a GPU can be increased by increasing the clock given
to it.
• Memory Bandwidth:
It is the data transfer speed between the graphics chip and its local frame buffer. More
bandwidth usually gives better performance with the image to be rendered is of high quality
and at very high resolution.
• Memory Management:
The performance of the GPU also depends on how efficiently the memory is managed,
because memory bandwidth may become the only bottle neck if not managed properly.
• Hidden Surface removal:
A term to describe the reducing of overdraws when rendering a scene by not rendering
surfaces that are not visible. This helps a lot in increasing performance of GPU.
4
GPU Pipeline
• The GPU receives geometry information from the CPU as an input and provides a
picture as an output
• The host interface is the communication bridge between the CPU and the GPU
• It receives commands from the CPU and also pulls geometry information from
system memory.
• It outputs a stream of vertices in object space with all their associated information
(normals, texture coordinates, per vertex color etc)
• The vertex processing stage receives vertices from the host interface in object
space and outputs them in screen space
• This may be a simple linear transformation, or a complex operation involving
morphing effects
host
interface
vertex
processing
triangle
setup
pixel
processing
memory
interface
Cont..
• A fragment is generated if and only if its center is inside the triangle
• Every fragment generated has its attributes computed to be the
perspective correct interpolation of the three vertices that make up the
triangle
• Each fragment provided by triangle setup is fed into fragment processing
as a set of attributes (position, normal, texcord etc), which are used to
compute the final color for this pixel Before the final write occurs, some
fragments are rejected by the zbuffer, stencil and alpha tests
6
Block Diagram Of Pipeline Process Flow
7
Cont..
• Allow shader to be applied to each vertex Transformation and other per
vertex ops
• Allow vertex shader to fetch texture data
• Cull/clip–per primitive operation and data preparation for rasterization
• Rasterization: primitive to pixel mapping
• Z culling: quick pixel elimination based on Depth
• Fragment : a candidate pixel Varying number of pixel pipelines
• SIMD processing hides texture fetch latency
8
Introduction Of CUDA
9
•CUDA aka Compute unified device architecture is parallel computing platform and
programing model which is implemented by graphics processing unit.
CUDA Programming Model:
A Highly Multithreaded Coprocessor
• The GPU is viewed as a compute device that:
 Is a coprocessor to the CPU or host
 Has its own DRAM (device memory)
 Runs many threads in parallel
• Data-parallel portions of an application are executed on the device as kernels which
run in parallel on many threads
• Differences between GPU and CPU threads
 GPU threads are extremely lightweight
 Very little creation overhead
 GPU needs 1000s of threads for full efficiency
 Multi-core CPU needs only a few
Thread Batching: Grids and Blocks
•A kernel is executed as a grid of thread
blocks
–All threads share data memory
space
•A thread block is a batch of threads that
can cooperate with each other by:
–Synchronizing their execution
•For hazard-free shared memory
accesses
–Efficiently sharing data through a
low latency shared memory
•Two threads from two different blocks
cannot cooperate
Host
Kernel
1
Kernel
2
Device
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Grid 2
Block (1, 1)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
Courtesy: NDVIA
Block and Thread IDs
•Threads and blocks have IDs
–So each thread can decide what data to
work on
–Block ID: 1D or 2D
–Thread ID: 1D, 2D, or 3D
•Simplifies memory
•addressing when processing
•multidimensional data
–Image processing
–Solving PDEs on volumes
Device
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Block (1, 1)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
Courtesy: NDVIA
CUDA Device Memory Space Overview
•Each thread can:
–R/W per-thread registers
–R/W per-thread local memory
–R/W per-block shared memory
–R/W per-grid global memory
–Read only per-grid constant memory
–Read only per-grid texture memory
(Device) Grid
Constant
Memory
Texture
Memory
Global
Memory
Block (0, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Host
The host can R/W global,
constant, and texture memories
Global, Constant, and Texture Memories
•Global memory
–Main means of communicating R/W
- Data between host and device
–Contents visible to all threads
•Texture and Constant Memories
–Constants initialized by host
–Contents visible to all threads
(Device) Grid
Constant
Memory
Texture
Memory
Global
Memory
Block (0, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Host
Courtesy: NDVIA
Simple Processing Flow
1. Copy input data from CPU memory to GPU memory
2. CPU instruct process to GPU
3. Load GPU program and execute, caching data on chip for performance
4. Copy results from GPU memory to CPU memory
15
CUDA C/C++
16
• CUDA Language:
C with Minimal Extensions
• Philosophy: provide minimal set of extensions necessary to expose power
• Declaration specifiers to indicate where things live
__global__ void KernelFunc(...); // kernel function, runs on device
__device__ int GlobalVar; // variable in device memory
__shared__ int SharedVar; // variable in per-block shared memory
• Extend function invocation syntax for parallel kernel launch
KernelFunc<<<500, 128>>>(...); // launch 500 blocks w/ 128 threads each
• Special variables for thread identification in kernels
dim3 threadIdx; dim3 blockIdx; dim3 blockDim; dim3 gridDim;
• Intrinsics that expose specific operations in kernel code
__syncthreads(); // barrier synchronization within kernel
Applications
17
•Military (lots)
•Mine planning
•Molecular dynamics
•MRI reconstruction
•Network processing
•Neural network
•Protein folding
•Quantum chemistry
•Ray tracing
•Radar
•Reservoir simulation
•Robotic vision/AI
•Robotic surgery
•Satellite data analysis
•Seismic imaging
•Surgery simulation
•3D image analysis
•Adaptive radiation therapy
•Astronomy
•Automobile vision
•Bio informatics
•Biological simulation
•Broadcast
•Computational Fluid Dynamics
•Computer Vision
•Cryptography
•CT reconstruction
•Data Mining
•Electromagnetic simulation
•Equity training
•Financial - lots of areas
•Mathematics research
Simulation Result
18
•If the CUDA software is installed and configured correctly, the output for deviceQuery should look similar
19
•Valid Results from bandwidth Test CUDA Sample
20
• Create an Array at the size of BLOCKS, allocate space for the array on the device, and
call,
generateArray<<<BLOCKS,1>>>( deviceArray );.
•This function will now run in BLOCKS parallel kernels, creating the entire array in one
call .
The Future Scope Of CUDA Technology
• Currently most of research is going on general purpose GPU. As GPU have a highly-
efficient and flexible parallel programmable features, a growing number of
researchers and business organizations started to use some of the non-graphical
rendering with GPU to implement the calculations, and create a new field of study:
GPGPU (General-Purpose computation on GPU) and its objective is to use GPU to
implement more extensive scientific computing. GPGPU has been successfully used in
algebra, fluid simulation, database applications, spectrum analysis, and other non-
graphical applications
• Region-based Software Virtual Memory (RSVM), a software virtual memory running
on both CPU and GPU in a distributed and cooperative way.
• Size reduction
• Cooling technique
21
Conclusion.
• CUDA is a powerful parallel programming model
Heterogeneous - mixed serial-parallel programming
Scalable - hierarchical thread execution model
Accessible - minimal but expressive changes to C
• CUDA on GPUs can achieve great results on data parallel computations with a
few simple performance optimization strategies:
• Structure your application and select execution configurations to maximize
exploitation of the GPU’s parallel capabilities.
• Minimize CPU ↔GPU data transfers.
• Coalesce global memory accesses.
• Take advantage of shared memory.
• Minimize divergent warps.
• Minimize use of low-throughput instructions.
22
References
1.Xiao Yang,Shamik K. Valia,Michael J. Schulte,Ruby B. Lee,” Exploration and Evaluation
of PLX Floating-point Instructions and Implementations for 3D Graphics ”,IEEE, Year -
2004
2.Lei Wang, Yong-zhong Huang,Xin Chen,Chun-yan Zhang,” Task Scheduling of Parallel
Processing in CPU-GPU Collaborative Environment ”,CSIT-2008
3.Feng Ji,Heshan Lin,Xiaosong Ma,’ RSVM: a Region-based Software Virtual Memory
for GPU’,IEEE-2013
4.“CUDA_Architecture_Overview” By Nathan Whitehead,Alex Fit-Florea,Nvidia
Corporation
5.“CUDA C/C++ Basics” By Cyril Zeller, NVIDIA Corporation
6.“Optimizing Parallel Reduction in CUDA” By Mark Harris ,NVIDIA Developer
Technology
23
Thank-You
24

Weitere ähnliche Inhalte

Was ist angesagt?

Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Saksham Tanwar
 
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsAMD Developer Central
 
Presentation on graphics processing unit (GPU)
Presentation on graphics processing unit (GPU)Presentation on graphics processing unit (GPU)
Presentation on graphics processing unit (GPU)MuntasirMuhit
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUsSri Ambati
 
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahGS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahAMD Developer Central
 
HPC 的に H100 は魅力的な GPU なのか?
HPC 的に H100 は魅力的な GPU なのか?HPC 的に H100 は魅力的な GPU なのか?
HPC 的に H100 は魅力的な GPU なのか?NVIDIA Japan
 
計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?Shinnosuke Furuya
 
Graphics processing unit (GPU)
Graphics processing unit (GPU)Graphics processing unit (GPU)
Graphics processing unit (GPU)Amal R
 
Performance Profiling in Rust
Performance Profiling in RustPerformance Profiling in Rust
Performance Profiling in RustInfluxData
 
Screen Space Reflections in The Surge
Screen Space Reflections in The SurgeScreen Space Reflections in The Surge
Screen Space Reflections in The SurgeMichele Giacalone
 
1072: アプリケーション開発を加速するCUDAライブラリ
1072: アプリケーション開発を加速するCUDAライブラリ1072: アプリケーション開発を加速するCUDAライブラリ
1072: アプリケーション開発を加速するCUDAライブラリNVIDIA Japan
 
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing UnitSlides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing UnitCarlo C. del Mundo
 
Siggraph 2016 - Vulkan and nvidia : the essentials
Siggraph 2016 - Vulkan and nvidia : the essentialsSiggraph 2016 - Vulkan and nvidia : the essentials
Siggraph 2016 - Vulkan and nvidia : the essentialsTristan Lorach
 
Memory consistency models
Memory consistency modelsMemory consistency models
Memory consistency modelspalani kumar
 
MLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performanceMLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performancejemin lee
 
Frontiers of Natural Language Processing
Frontiers of Natural Language ProcessingFrontiers of Natural Language Processing
Frontiers of Natural Language ProcessingSebastian Ruder
 
Deep learning: Hardware Landscape
Deep learning: Hardware LandscapeDeep learning: Hardware Landscape
Deep learning: Hardware LandscapeGrigory Sapunov
 

Was ist angesagt? (20)

Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)
 
Lec04 gpu architecture
Lec04 gpu architectureLec04 gpu architecture
Lec04 gpu architecture
 
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
 
Presentation on graphics processing unit (GPU)
Presentation on graphics processing unit (GPU)Presentation on graphics processing unit (GPU)
Presentation on graphics processing unit (GPU)
 
CUDA Architecture
CUDA ArchitectureCUDA Architecture
CUDA Architecture
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUs
 
Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)
 
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahGS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
 
HPC 的に H100 は魅力的な GPU なのか?
HPC 的に H100 は魅力的な GPU なのか?HPC 的に H100 は魅力的な GPU なのか?
HPC 的に H100 は魅力的な GPU なのか?
 
計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?
 
Graphics processing unit (GPU)
Graphics processing unit (GPU)Graphics processing unit (GPU)
Graphics processing unit (GPU)
 
Performance Profiling in Rust
Performance Profiling in RustPerformance Profiling in Rust
Performance Profiling in Rust
 
Screen Space Reflections in The Surge
Screen Space Reflections in The SurgeScreen Space Reflections in The Surge
Screen Space Reflections in The Surge
 
1072: アプリケーション開発を加速するCUDAライブラリ
1072: アプリケーション開発を加速するCUDAライブラリ1072: アプリケーション開発を加速するCUDAライブラリ
1072: アプリケーション開発を加速するCUDAライブラリ
 
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing UnitSlides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
 
Siggraph 2016 - Vulkan and nvidia : the essentials
Siggraph 2016 - Vulkan and nvidia : the essentialsSiggraph 2016 - Vulkan and nvidia : the essentials
Siggraph 2016 - Vulkan and nvidia : the essentials
 
Memory consistency models
Memory consistency modelsMemory consistency models
Memory consistency models
 
MLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performanceMLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performance
 
Frontiers of Natural Language Processing
Frontiers of Natural Language ProcessingFrontiers of Natural Language Processing
Frontiers of Natural Language Processing
 
Deep learning: Hardware Landscape
Deep learning: Hardware LandscapeDeep learning: Hardware Landscape
Deep learning: Hardware Landscape
 

Andere mochten auch

Nvidia cuda programming_guide_0.8.2
Nvidia cuda programming_guide_0.8.2Nvidia cuda programming_guide_0.8.2
Nvidia cuda programming_guide_0.8.2Piyush Mittal
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDARaymond Tay
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDAMartin Peniak
 
Boosting your HTML Apps – Overview of OpenCL and Hello World of WebCL
Boosting your HTML Apps – Overview of OpenCL and Hello World of WebCLBoosting your HTML Apps – Overview of OpenCL and Hello World of WebCL
Boosting your HTML Apps – Overview of OpenCL and Hello World of WebCLJanakiRam Raghumandala
 
08 - it3D Summit 2016 - Grid - T. Riley- NVIDIA
08 - it3D Summit 2016 - Grid - T. Riley- NVIDIA08 - it3D Summit 2016 - Grid - T. Riley- NVIDIA
08 - it3D Summit 2016 - Grid - T. Riley- NVIDIAVirginia Grubert
 
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...AMD Developer Central
 
Software Reuse and Object-Oriented Programming
Software Reuse and Object-Oriented ProgrammingSoftware Reuse and Object-Oriented Programming
Software Reuse and Object-Oriented Programmingkim.mens
 
Graphics Processing Unit by Saurabh
Graphics Processing Unit by SaurabhGraphics Processing Unit by Saurabh
Graphics Processing Unit by SaurabhSaurabh Kumar
 
19564926 graphics-processing-unit
19564926 graphics-processing-unit19564926 graphics-processing-unit
19564926 graphics-processing-unitDayakar Siddula
 
Google Now Marketing
Google Now MarketingGoogle Now Marketing
Google Now MarketingGil Reich
 
Objective-C for iOS Application Development
Objective-C for iOS Application DevelopmentObjective-C for iOS Application Development
Objective-C for iOS Application DevelopmentDhaval Kaneria
 
OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalJunli Gu
 

Andere mochten auch (20)

CUDA
CUDACUDA
CUDA
 
GPU: Understanding CUDA
GPU: Understanding CUDAGPU: Understanding CUDA
GPU: Understanding CUDA
 
Nvidia cuda programming_guide_0.8.2
Nvidia cuda programming_guide_0.8.2Nvidia cuda programming_guide_0.8.2
Nvidia cuda programming_guide_0.8.2
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDA
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
 
Boosting your HTML Apps – Overview of OpenCL and Hello World of WebCL
Boosting your HTML Apps – Overview of OpenCL and Hello World of WebCLBoosting your HTML Apps – Overview of OpenCL and Hello World of WebCL
Boosting your HTML Apps – Overview of OpenCL and Hello World of WebCL
 
08 - it3D Summit 2016 - Grid - T. Riley- NVIDIA
08 - it3D Summit 2016 - Grid - T. Riley- NVIDIA08 - it3D Summit 2016 - Grid - T. Riley- NVIDIA
08 - it3D Summit 2016 - Grid - T. Riley- NVIDIA
 
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...
 
Mobile gpu cloud computing
Mobile gpu cloud computing Mobile gpu cloud computing
Mobile gpu cloud computing
 
Software Reuse and Object-Oriented Programming
Software Reuse and Object-Oriented ProgrammingSoftware Reuse and Object-Oriented Programming
Software Reuse and Object-Oriented Programming
 
Graphics Processing Unit by Saurabh
Graphics Processing Unit by SaurabhGraphics Processing Unit by Saurabh
Graphics Processing Unit by Saurabh
 
Gpu Cuda
Gpu CudaGpu Cuda
Gpu Cuda
 
nvidia-intro
nvidia-intronvidia-intro
nvidia-intro
 
19564926 graphics-processing-unit
19564926 graphics-processing-unit19564926 graphics-processing-unit
19564926 graphics-processing-unit
 
Introduction of Xcode
Introduction of XcodeIntroduction of Xcode
Introduction of Xcode
 
Google Now Marketing
Google Now MarketingGoogle Now Marketing
Google Now Marketing
 
Objective-C for iOS Application Development
Objective-C for iOS Application DevelopmentObjective-C for iOS Application Development
Objective-C for iOS Application Development
 
OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation final
 
Siri Vs Google Now
Siri Vs Google NowSiri Vs Google Now
Siri Vs Google Now
 
GPU Programming with Java
GPU Programming with JavaGPU Programming with Java
GPU Programming with Java
 

Ähnlich wie Gpu with cuda architecture

Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUsShree Kumar
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computingbakers84
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUsfcassier
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxssuser413a98
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiectureHaris456
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010John Holden
 
GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)Fatima Qayyum
 
Deploying Pretrained Model In Edge IoT Devices.pdf
Deploying Pretrained Model In Edge IoT Devices.pdfDeploying Pretrained Model In Edge IoT Devices.pdf
Deploying Pretrained Model In Edge IoT Devices.pdfObject Automation
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networksinside-BigData.com
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersCastLabKAIST
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with GpuRohit Khatana
 
OpenGL ES and Mobile GPU
OpenGL ES and Mobile GPUOpenGL ES and Mobile GPU
OpenGL ES and Mobile GPUJiansong Chen
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 

Ähnlich wie Gpu with cuda architecture (20)

Mod 2 hardware_graphics.pdf
Mod 2 hardware_graphics.pdfMod 2 hardware_graphics.pdf
Mod 2 hardware_graphics.pdf
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUs
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computing
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
GPU Algorithms and trends 2018
GPU Algorithms and trends 2018GPU Algorithms and trends 2018
GPU Algorithms and trends 2018
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
 
IMQA Poster
IMQA PosterIMQA Poster
IMQA Poster
 
GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)
 
Deploying Pretrained Model In Edge IoT Devices.pdf
Deploying Pretrained Model In Edge IoT Devices.pdfDeploying Pretrained Model In Edge IoT Devices.pdf
Deploying Pretrained Model In Edge IoT Devices.pdf
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
GPU - Basic Working
GPU - Basic WorkingGPU - Basic Working
GPU - Basic Working
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud Datacenters
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with Gpu
 
OpenGL ES and Mobile GPU
OpenGL ES and Mobile GPUOpenGL ES and Mobile GPU
OpenGL ES and Mobile GPU
 
Gpu microprocessors
Gpu microprocessorsGpu microprocessors
Gpu microprocessors
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 

Mehr von Dhaval Kaneria

Introduction to data structures and Algorithm
Introduction to data structures and AlgorithmIntroduction to data structures and Algorithm
Introduction to data structures and AlgorithmDhaval Kaneria
 
Introduction to data structures and Algorithm
Introduction to data structures and AlgorithmIntroduction to data structures and Algorithm
Introduction to data structures and AlgorithmDhaval Kaneria
 
Serial Peripheral Interface(SPI)
Serial Peripheral Interface(SPI)Serial Peripheral Interface(SPI)
Serial Peripheral Interface(SPI)Dhaval Kaneria
 
Linux booting procedure
Linux booting procedureLinux booting procedure
Linux booting procedureDhaval Kaneria
 
Linux booting procedure
Linux booting procedureLinux booting procedure
Linux booting procedureDhaval Kaneria
 
Manage Xilinx ISE 14.5 licence for Windows 8 and 8.1
Manage Xilinx ISE 14.5 licence for Windows 8 and 8.1Manage Xilinx ISE 14.5 licence for Windows 8 and 8.1
Manage Xilinx ISE 14.5 licence for Windows 8 and 8.1Dhaval Kaneria
 
8 bit single cycle processor
8 bit single cycle processor8 bit single cycle processor
8 bit single cycle processorDhaval Kaneria
 
Paper on Optimized AES Algorithm Core Using FeedBack Architecture
Paper on Optimized AES Algorithm Core Using  FeedBack Architecture Paper on Optimized AES Algorithm Core Using  FeedBack Architecture
Paper on Optimized AES Algorithm Core Using FeedBack Architecture Dhaval Kaneria
 
PAPER ON MEMS TECHNOLOGY
PAPER ON MEMS TECHNOLOGYPAPER ON MEMS TECHNOLOGY
PAPER ON MEMS TECHNOLOGYDhaval Kaneria
 
VIdeo Compression using sum of Absolute Difference
VIdeo Compression using sum of Absolute DifferenceVIdeo Compression using sum of Absolute Difference
VIdeo Compression using sum of Absolute DifferenceDhaval Kaneria
 

Mehr von Dhaval Kaneria (18)

Swine flu
Swine flu Swine flu
Swine flu
 
Introduction to data structures and Algorithm
Introduction to data structures and AlgorithmIntroduction to data structures and Algorithm
Introduction to data structures and Algorithm
 
Introduction to data structures and Algorithm
Introduction to data structures and AlgorithmIntroduction to data structures and Algorithm
Introduction to data structures and Algorithm
 
HDMI
HDMIHDMI
HDMI
 
Hdmi
HdmiHdmi
Hdmi
 
open source hardware
open source hardwareopen source hardware
open source hardware
 
Serial Peripheral Interface(SPI)
Serial Peripheral Interface(SPI)Serial Peripheral Interface(SPI)
Serial Peripheral Interface(SPI)
 
Linux booting procedure
Linux booting procedureLinux booting procedure
Linux booting procedure
 
Linux booting procedure
Linux booting procedureLinux booting procedure
Linux booting procedure
 
Manage Xilinx ISE 14.5 licence for Windows 8 and 8.1
Manage Xilinx ISE 14.5 licence for Windows 8 and 8.1Manage Xilinx ISE 14.5 licence for Windows 8 and 8.1
Manage Xilinx ISE 14.5 licence for Windows 8 and 8.1
 
VERILOG CODE
VERILOG CODEVERILOG CODE
VERILOG CODE
 
8 bit single cycle processor
8 bit single cycle processor8 bit single cycle processor
8 bit single cycle processor
 
Paper on Optimized AES Algorithm Core Using FeedBack Architecture
Paper on Optimized AES Algorithm Core Using  FeedBack Architecture Paper on Optimized AES Algorithm Core Using  FeedBack Architecture
Paper on Optimized AES Algorithm Core Using FeedBack Architecture
 
PAPER ON MEMS TECHNOLOGY
PAPER ON MEMS TECHNOLOGYPAPER ON MEMS TECHNOLOGY
PAPER ON MEMS TECHNOLOGY
 
VIdeo Compression using sum of Absolute Difference
VIdeo Compression using sum of Absolute DifferenceVIdeo Compression using sum of Absolute Difference
VIdeo Compression using sum of Absolute Difference
 
Mems technology
Mems technologyMems technology
Mems technology
 
Network security
Network securityNetwork security
Network security
 
Token bus standard
Token bus standardToken bus standard
Token bus standard
 

Kürzlich hochgeladen

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 

Kürzlich hochgeladen (20)

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Gpu with cuda architecture

  • 1. “GPU With CUDA Architecture” Presented By- Dhaval Kaneria (13014061010) Guided By- Mr. Rajesh k Navandar
  • 2. Table Of Contents • Introduction of GPU • Performance Factors Of GPU • GPU Pipeline • Block Diagram Of Pipeline Process Flow • Introduction Of CUDA • Thread Batching • Simple Processing Flow • CUDA C/C++ • Applications • The Future Scope Of CUDA Technology • Conclusion • References 2
  • 3. Introduction of GPU • A Graphics Processing Unit (GPU) is a microprocessor that has been designed specifically for the processing of 3D graphics. • The processor is built with integrated transform, lighting, triangle setup/clipping, and rendering engines, capable of handling millions of math-intensive processes per second. • GPUs form the heart of modern graphics cards, relieving the CPU (central processing units) of much of the graphics processing load. GPUs allow products such as desktop PCs, portable computers, and game consoles to process real-time 3D graphics that only a few years ago were only available on high-end workstations. • Used primarily for 3-D applications, a graphics processing unit is a single-chip processor that creates lighting effects and transforms objects every time a 3D scene is redrawn. These are mathematically-intensive tasks, which otherwise, would put quite a strain on the CPU. Lifting this burden from the CPU frees up cycles that can be used for other jobs. 3
  • 4. Performance Factors Of GPU • Fill Rate: It is defined as the number of pixels or texels (textured pixels) rendered per second by the GPU on to the memory . It shows the true power of the GPU. Modern GPUs have fill rates as high as 3.2 billion pixels. The fill rate of a GPU can be increased by increasing the clock given to it. • Memory Bandwidth: It is the data transfer speed between the graphics chip and its local frame buffer. More bandwidth usually gives better performance with the image to be rendered is of high quality and at very high resolution. • Memory Management: The performance of the GPU also depends on how efficiently the memory is managed, because memory bandwidth may become the only bottle neck if not managed properly. • Hidden Surface removal: A term to describe the reducing of overdraws when rendering a scene by not rendering surfaces that are not visible. This helps a lot in increasing performance of GPU. 4
  • 5. GPU Pipeline • The GPU receives geometry information from the CPU as an input and provides a picture as an output • The host interface is the communication bridge between the CPU and the GPU • It receives commands from the CPU and also pulls geometry information from system memory. • It outputs a stream of vertices in object space with all their associated information (normals, texture coordinates, per vertex color etc) • The vertex processing stage receives vertices from the host interface in object space and outputs them in screen space • This may be a simple linear transformation, or a complex operation involving morphing effects host interface vertex processing triangle setup pixel processing memory interface
  • 6. Cont.. • A fragment is generated if and only if its center is inside the triangle • Every fragment generated has its attributes computed to be the perspective correct interpolation of the three vertices that make up the triangle • Each fragment provided by triangle setup is fed into fragment processing as a set of attributes (position, normal, texcord etc), which are used to compute the final color for this pixel Before the final write occurs, some fragments are rejected by the zbuffer, stencil and alpha tests 6
  • 7. Block Diagram Of Pipeline Process Flow 7
  • 8. Cont.. • Allow shader to be applied to each vertex Transformation and other per vertex ops • Allow vertex shader to fetch texture data • Cull/clip–per primitive operation and data preparation for rasterization • Rasterization: primitive to pixel mapping • Z culling: quick pixel elimination based on Depth • Fragment : a candidate pixel Varying number of pixel pipelines • SIMD processing hides texture fetch latency 8
  • 9. Introduction Of CUDA 9 •CUDA aka Compute unified device architecture is parallel computing platform and programing model which is implemented by graphics processing unit.
  • 10. CUDA Programming Model: A Highly Multithreaded Coprocessor • The GPU is viewed as a compute device that:  Is a coprocessor to the CPU or host  Has its own DRAM (device memory)  Runs many threads in parallel • Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads • Differences between GPU and CPU threads  GPU threads are extremely lightweight  Very little creation overhead  GPU needs 1000s of threads for full efficiency  Multi-core CPU needs only a few
  • 11. Thread Batching: Grids and Blocks •A kernel is executed as a grid of thread blocks –All threads share data memory space •A thread block is a batch of threads that can cooperate with each other by: –Synchronizing their execution •For hazard-free shared memory accesses –Efficiently sharing data through a low latency shared memory •Two threads from two different blocks cannot cooperate Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 2 Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) Courtesy: NDVIA
  • 12. Block and Thread IDs •Threads and blocks have IDs –So each thread can decide what data to work on –Block ID: 1D or 2D –Thread ID: 1D, 2D, or 3D •Simplifies memory •addressing when processing •multidimensional data –Image processing –Solving PDEs on volumes Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) Courtesy: NDVIA
  • 13. CUDA Device Memory Space Overview •Each thread can: –R/W per-thread registers –R/W per-thread local memory –R/W per-block shared memory –R/W per-grid global memory –Read only per-grid constant memory –Read only per-grid texture memory (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Block (1, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Host The host can R/W global, constant, and texture memories
  • 14. Global, Constant, and Texture Memories •Global memory –Main means of communicating R/W - Data between host and device –Contents visible to all threads •Texture and Constant Memories –Constants initialized by host –Contents visible to all threads (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Block (1, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Host Courtesy: NDVIA
  • 15. Simple Processing Flow 1. Copy input data from CPU memory to GPU memory 2. CPU instruct process to GPU 3. Load GPU program and execute, caching data on chip for performance 4. Copy results from GPU memory to CPU memory 15
  • 16. CUDA C/C++ 16 • CUDA Language: C with Minimal Extensions • Philosophy: provide minimal set of extensions necessary to expose power • Declaration specifiers to indicate where things live __global__ void KernelFunc(...); // kernel function, runs on device __device__ int GlobalVar; // variable in device memory __shared__ int SharedVar; // variable in per-block shared memory • Extend function invocation syntax for parallel kernel launch KernelFunc<<<500, 128>>>(...); // launch 500 blocks w/ 128 threads each • Special variables for thread identification in kernels dim3 threadIdx; dim3 blockIdx; dim3 blockDim; dim3 gridDim; • Intrinsics that expose specific operations in kernel code __syncthreads(); // barrier synchronization within kernel
  • 17. Applications 17 •Military (lots) •Mine planning •Molecular dynamics •MRI reconstruction •Network processing •Neural network •Protein folding •Quantum chemistry •Ray tracing •Radar •Reservoir simulation •Robotic vision/AI •Robotic surgery •Satellite data analysis •Seismic imaging •Surgery simulation •3D image analysis •Adaptive radiation therapy •Astronomy •Automobile vision •Bio informatics •Biological simulation •Broadcast •Computational Fluid Dynamics •Computer Vision •Cryptography •CT reconstruction •Data Mining •Electromagnetic simulation •Equity training •Financial - lots of areas •Mathematics research
  • 18. Simulation Result 18 •If the CUDA software is installed and configured correctly, the output for deviceQuery should look similar
  • 19. 19 •Valid Results from bandwidth Test CUDA Sample
  • 20. 20 • Create an Array at the size of BLOCKS, allocate space for the array on the device, and call, generateArray<<<BLOCKS,1>>>( deviceArray );. •This function will now run in BLOCKS parallel kernels, creating the entire array in one call .
  • 21. The Future Scope Of CUDA Technology • Currently most of research is going on general purpose GPU. As GPU have a highly- efficient and flexible parallel programmable features, a growing number of researchers and business organizations started to use some of the non-graphical rendering with GPU to implement the calculations, and create a new field of study: GPGPU (General-Purpose computation on GPU) and its objective is to use GPU to implement more extensive scientific computing. GPGPU has been successfully used in algebra, fluid simulation, database applications, spectrum analysis, and other non- graphical applications • Region-based Software Virtual Memory (RSVM), a software virtual memory running on both CPU and GPU in a distributed and cooperative way. • Size reduction • Cooling technique 21
  • 22. Conclusion. • CUDA is a powerful parallel programming model Heterogeneous - mixed serial-parallel programming Scalable - hierarchical thread execution model Accessible - minimal but expressive changes to C • CUDA on GPUs can achieve great results on data parallel computations with a few simple performance optimization strategies: • Structure your application and select execution configurations to maximize exploitation of the GPU’s parallel capabilities. • Minimize CPU ↔GPU data transfers. • Coalesce global memory accesses. • Take advantage of shared memory. • Minimize divergent warps. • Minimize use of low-throughput instructions. 22
  • 23. References 1.Xiao Yang,Shamik K. Valia,Michael J. Schulte,Ruby B. Lee,” Exploration and Evaluation of PLX Floating-point Instructions and Implementations for 3D Graphics ”,IEEE, Year - 2004 2.Lei Wang, Yong-zhong Huang,Xin Chen,Chun-yan Zhang,” Task Scheduling of Parallel Processing in CPU-GPU Collaborative Environment ”,CSIT-2008 3.Feng Ji,Heshan Lin,Xiaosong Ma,’ RSVM: a Region-based Software Virtual Memory for GPU’,IEEE-2013 4.“CUDA_Architecture_Overview” By Nathan Whitehead,Alex Fit-Florea,Nvidia Corporation 5.“CUDA C/C++ Basics” By Cyril Zeller, NVIDIA Corporation 6.“Optimizing Parallel Reduction in CUDA” By Mark Harris ,NVIDIA Developer Technology 23

Hinweis der Redaktion

  1. 1
  2. Global, constant, and texture memory spaces are persistent across kernels called by the same application.
  3. 15
  4. 17