SlideShare ist ein Scribd-Unternehmen logo
1 von 24
Parallel Computing on the GPU
Tilani Gunawardena
Goals
• How to program heterogeneous parallel
computing system and achieve
– High performance and energy efficiency
– Functionality and maintainability
– Scalability across future generations
• Technical subjects
– Principles and patterns of parallel algorithms
– Programming API, tools and techniques
Tentative Schedule
– Introduction
– GPU Computing and CUDA Intro
– CUDA threading model
– CUDA memory model
– CUDA performance
– Floating Point Considerations
– Application Case Study
Recommended Textbook/Notes
• D. Kirk and W. Hwu, “Programming Massively
Parallel Processors – A Hands-on Approach,”
• http://www.nvidia.com/(Communities
CUDA Zone)
• Would you rather plow a field with two strong
oxen or 1024 chickens??
How to Dig
a Hole
Faster??
1. Dig Faster
2. Buy a More Productive Shovel
3. Hire more diggers  best approach
Problems:
1. How to manage them?
2. Will they get in each other’s way
3. Will more diggers help to dig hole deeper instead of
just wider?
1. Dig Faster : Processor should run with faster clock to
spend a shorter amount of time on each step of a
computation (limit: power consumption on a chip :
increase clock speed increase power consumption)
2. Buy a More Productive Shovel: Processor do more work
on each clock cycle(How much instruction level
parallelism per clock cycle)
3. Hire more diggers  best approach
Parallelism
• Solve Large Problems by breaking them into
small pieces
• Then run smaller pieces at the same time
Modern GPU
• 1000’s of ALUs
• 100’s of processors
• Tens of thousands of concurrent threads
• Modern GPU
– Ex:GeForce GTX Titan X
CUDA cores: 3072
8000 million transistors
12GB GDDR5 Memory
Memory Bandwidth: 336(GB/s)
65000 concurrent threads
Feature size of Processors over time
As feature size decrease
Transistors
• get smaller
• run faster
• use less power
• put more of them on a chip
• As transistors improved , processor designers
would then increase clock rates of processors ,
running them faster and faster every year
• Why don’t we keep increasing clock speed?
• Have transistors stopped getting smaller+ faster?
– Problem: heat
• Even though transistors are continuing to get smaller and faster and consume
less energy per transistor … Problem is running billion transistors generate lot of
heat and we can not keep all these processors cool
• Can not make single processor faster and faster(processors that we cant keep
cool)
• Processor designers
– Smaller, more efficient processors in terms of power
– Larger number of efficient processors
(rather than faster less efficient processors)
• What kind of Processors we build?
• CPU
– Complex control hardware
– Flexibility in performance
– Expensive in terms of power
• GPU
– Simpler control hardware
– More haradware for Computation
– Potentially more power efficient
– More restrictive Programming
model
Latency vs Throughput
• Latency-Amount of time to complete a
task(time , seconds)
• Throughput-Task completed per unit
time(Jobs/Hour) Your goals are not aligned with post
office goals
Your goal: Optimize for Latency
(want to spend a little time)
Post office: Optimize for throughput
(number of customers they serve per a
day)
CPU: Optimize for latency(minimize the time
elapsed of one particular task)
GPU: Chose to Optimize for throughput
Bandwidth
• How fast to devise can send data over a single
cable
Bandwidth vs Throughput vs Latency
– Bandwidth is the maximum amount of data that
can travel through a 'channel'.
– Throughput is how much data actually does travel
through the 'channel' successfully.
– Latency is a function of how long it takes the data
to get sent all the way from the start point to the
end
Latency vs Bandwidth
• Drive from Colombo to Kandy(100km)
– Car(5 people, 60km/h)
– Bus(60 people, 20km/h)
• Calculate
– Latency?
– Throughput?
GPUs from the point of view of the
software developer ?
• Importance in programing in parallel
– 8 core ivy bridge processor(intel)
– 8-wide AVX vector operations/Core
– 2 threads/core (hyperthreading)
128 way parallelism
In this processor if you run a complete serial, C
program with no parallelism at all, you are going
to use less than 1% of the capability of this
machine.
Introduction
• Microprocessor based on CPU drove rapid performance
increases and cost reduction in computer applications for
more than 2 decades.
– The users demand even more improvements once they become
accustomed to these improvements creating a positive cycle for the
computer industry.
• This drive has slowed since 2003 due to power consumption
issues that limited the increase of the clock frequency and
the level of productive activities that can be performed in
each clock period within a single CPU.
– All microprocessor vendors have switched to multi-core and many-
core models where multiple processing unit are used in each chip to
increase the processing power.
• Vast majority of SA are written as sequential programs
– The expectation is that program run faster with each new
generation of microprocessors. This is no longer valid from
this day onward.
– No performance improvement
– Reducing the growth opportunities of computer industries.
• SA will continue to enjoy performance improvement as
parallel programs, in which multiple threads of
execution cooperate to achieve the funcionality faster.
18
• Parallel programming is by no means new
– HPC community has been developing parallel
programs for decades.
– But these programs run on large scale, expensive
computers and only a few elite application justify the
use of these costs. In practice limiting the parallel
programming to a small number of appication
developers.
• Now that all new microprocessors are parallel
computers, the number of applications that need
to be developed as parallel programs has
increased.
GPU as Parallel Computers
• Since 2003 a class of many-cores processors
called GPUs have led the race for floating
point performance.
While the
performance of
general purpose
microprocessor has
slowed, the GPU
have continued to
improve.
Many application developers are motivated to move the computationally intensive
parts of their software to GPU for execution.
Why there is this Large Gap?
• The answer lies in the differences in the
fundamental design philosophies between
the two types of processors.
Latency oriented
cores
Throughput
oriented cores
CPU: Latency Oriented Design
• CPU is optimized for sequential code
performance
• Large caches
– Convert long latency memory accesses to short
latency cache accesses
• Sophisticated control
– Branch prediction for reduced branch latency
– Data forwarding for reduced data latency
• Powerful ALU
– Reduced operation latency
• GPU is optimized for the execution of massive
number of threads.
• Small caches
– To boost memory throughput
• Simple control
– No branch prediction
– No data forwarding
• Energy efficient ALUs
– Many, long latency but heavily pipelined for high
throughput
• Require massive number of threads to tolerate
latencies
GPU: Throughput Oriented Design
Winning Applications Use Both CPU
and GPU
• CPUs for sequential parts where latency
matters
– CPUs can be 10+X faster than GPUs for sequential
code
• GPUs for parallel parts where throughput wins
– GPUs can be 10+X faster than CPUs for parallel
code
Applications

Weitere ähnliche Inhalte

Was ist angesagt?

GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practicesLior Sidi
 
Design of embedded systems
Design of embedded systemsDesign of embedded systems
Design of embedded systemsPradeep Kumar TS
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit pptSandeep Singh
 
High performance computing tutorial, with checklist and tips to optimize clus...
High performance computing tutorial, with checklist and tips to optimize clus...High performance computing tutorial, with checklist and tips to optimize clus...
High performance computing tutorial, with checklist and tips to optimize clus...Pradeep Redddy Raamana
 
introduction to embedded system presentation
introduction to embedded system presentationintroduction to embedded system presentation
introduction to embedded system presentationAmr Rashed
 
Supermicro’s Universal GPU: Modular, Standards Based and Built for the Future
Supermicro’s Universal GPU: Modular, Standards Based and Built for the FutureSupermicro’s Universal GPU: Modular, Standards Based and Built for the Future
Supermicro’s Universal GPU: Modular, Standards Based and Built for the FutureRebekah Rodriguez
 
AMD Processor
AMD ProcessorAMD Processor
AMD ProcessorAli Fahad
 
Heterogeneous computing
Heterogeneous computingHeterogeneous computing
Heterogeneous computingRashid Ansari
 
High Performance Computing
High Performance ComputingHigh Performance Computing
High Performance ComputingDell World
 
Introduction to FPGA acceleration
Introduction to FPGA accelerationIntroduction to FPGA acceleration
Introduction to FPGA accelerationMarco77328
 
FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning Dr. Swaminathan Kathirvel
 
Operating Systems 1 (9/12) - Memory Management Concepts
Operating Systems 1 (9/12) - Memory Management ConceptsOperating Systems 1 (9/12) - Memory Management Concepts
Operating Systems 1 (9/12) - Memory Management ConceptsPeter Tröger
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An IntroductionDhan V Sagar
 
Presentation on graphics processing unit (GPU)
Presentation on graphics processing unit (GPU)Presentation on graphics processing unit (GPU)
Presentation on graphics processing unit (GPU)MuntasirMuhit
 
Parallel computing
Parallel computingParallel computing
Parallel computingVinay Gupta
 

Was ist angesagt? (20)

GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
 
Design of embedded systems
Design of embedded systemsDesign of embedded systems
Design of embedded systems
 
GPU - Basic Working
GPU - Basic WorkingGPU - Basic Working
GPU - Basic Working
 
Introduction to GPU Programming
Introduction to GPU ProgrammingIntroduction to GPU Programming
Introduction to GPU Programming
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit ppt
 
High performance computing tutorial, with checklist and tips to optimize clus...
High performance computing tutorial, with checklist and tips to optimize clus...High performance computing tutorial, with checklist and tips to optimize clus...
High performance computing tutorial, with checklist and tips to optimize clus...
 
Cuda
CudaCuda
Cuda
 
introduction to embedded system presentation
introduction to embedded system presentationintroduction to embedded system presentation
introduction to embedded system presentation
 
GPU Computing
GPU ComputingGPU Computing
GPU Computing
 
Supermicro’s Universal GPU: Modular, Standards Based and Built for the Future
Supermicro’s Universal GPU: Modular, Standards Based and Built for the FutureSupermicro’s Universal GPU: Modular, Standards Based and Built for the Future
Supermicro’s Universal GPU: Modular, Standards Based and Built for the Future
 
AMD Processor
AMD ProcessorAMD Processor
AMD Processor
 
Heterogeneous computing
Heterogeneous computingHeterogeneous computing
Heterogeneous computing
 
High Performance Computing
High Performance ComputingHigh Performance Computing
High Performance Computing
 
Cuda Architecture
Cuda ArchitectureCuda Architecture
Cuda Architecture
 
Introduction to FPGA acceleration
Introduction to FPGA accelerationIntroduction to FPGA acceleration
Introduction to FPGA acceleration
 
FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning
 
Operating Systems 1 (9/12) - Memory Management Concepts
Operating Systems 1 (9/12) - Memory Management ConceptsOperating Systems 1 (9/12) - Memory Management Concepts
Operating Systems 1 (9/12) - Memory Management Concepts
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An Introduction
 
Presentation on graphics processing unit (GPU)
Presentation on graphics processing unit (GPU)Presentation on graphics processing unit (GPU)
Presentation on graphics processing unit (GPU)
 
Parallel computing
Parallel computingParallel computing
Parallel computing
 

Andere mochten auch

GPU Computing: A brief overview
GPU Computing: A brief overviewGPU Computing: A brief overview
GPU Computing: A brief overviewRajiv Kumar
 
Dominic Eskofier (Nvidia) Every Millisecond Counts: How to Render Faster for ...
Dominic Eskofier (Nvidia) Every Millisecond Counts: How to Render Faster for ...Dominic Eskofier (Nvidia) Every Millisecond Counts: How to Render Faster for ...
Dominic Eskofier (Nvidia) Every Millisecond Counts: How to Render Faster for ...AugmentedWorldExpo
 
Parallel Computing into Javascript
Parallel Computing into JavascriptParallel Computing into Javascript
Parallel Computing into JavascriptRenato Augusto Gama
 
web performance explained to network and infrastructure experts
web performance explained to network and infrastructure expertsweb performance explained to network and infrastructure experts
web performance explained to network and infrastructure expertsBernard Paques
 
Flash Roadblock: Latency! - How Storage Interconnects are Slowing Flash Storage
Flash Roadblock: Latency! - How Storage Interconnects are Slowing Flash Storage Flash Roadblock: Latency! - How Storage Interconnects are Slowing Flash Storage
Flash Roadblock: Latency! - How Storage Interconnects are Slowing Flash Storage Storage Switzerland
 
Parallel computing diu pi cluster by Prof. Dr. Syed Akhter Hossain & Md Sami
Parallel computing  diu pi cluster by Prof. Dr. Syed Akhter Hossain & Md SamiParallel computing  diu pi cluster by Prof. Dr. Syed Akhter Hossain & Md Sami
Parallel computing diu pi cluster by Prof. Dr. Syed Akhter Hossain & Md SamiMD Sami
 
Parallel Computing Example with Raspberry Pi Cluster
Parallel Computing Example with Raspberry Pi ClusterParallel Computing Example with Raspberry Pi Cluster
Parallel Computing Example with Raspberry Pi ClusterHerpiko Dwi Aguno
 
GPU, CUDA, OpenCL and OpenACC for Parallel Applications
GPU, CUDA, OpenCL and OpenACC for Parallel ApplicationsGPU, CUDA, OpenCL and OpenACC for Parallel Applications
GPU, CUDA, OpenCL and OpenACC for Parallel ApplicationsMarcos Gonzalez
 
Parallel Computing in JS
Parallel Computing in JSParallel Computing in JS
Parallel Computing in JSAhmed Gaber
 
Tecnologia para generar valor en Telecomunicaciones
Tecnologia para generar valor en TelecomunicacionesTecnologia para generar valor en Telecomunicaciones
Tecnologia para generar valor en TelecomunicacionesXavier Moreano
 
Introduction to parallel_computing
Introduction to parallel_computingIntroduction to parallel_computing
Introduction to parallel_computingMehul Patel
 
"FCoE vs. iSCSI - Making the Choice" from Interop Las Vegas 2011
"FCoE vs. iSCSI - Making the Choice" from Interop Las Vegas 2011"FCoE vs. iSCSI - Making the Choice" from Interop Las Vegas 2011
"FCoE vs. iSCSI - Making the Choice" from Interop Las Vegas 2011Stephen Foskett
 

Andere mochten auch (16)

GPU Computing: A brief overview
GPU Computing: A brief overviewGPU Computing: A brief overview
GPU Computing: A brief overview
 
Dominic Eskofier (Nvidia) Every Millisecond Counts: How to Render Faster for ...
Dominic Eskofier (Nvidia) Every Millisecond Counts: How to Render Faster for ...Dominic Eskofier (Nvidia) Every Millisecond Counts: How to Render Faster for ...
Dominic Eskofier (Nvidia) Every Millisecond Counts: How to Render Faster for ...
 
Parallel Computing into Javascript
Parallel Computing into JavascriptParallel Computing into Javascript
Parallel Computing into Javascript
 
Parallel computing
Parallel computingParallel computing
Parallel computing
 
web performance explained to network and infrastructure experts
web performance explained to network and infrastructure expertsweb performance explained to network and infrastructure experts
web performance explained to network and infrastructure experts
 
Flash Roadblock: Latency! - How Storage Interconnects are Slowing Flash Storage
Flash Roadblock: Latency! - How Storage Interconnects are Slowing Flash Storage Flash Roadblock: Latency! - How Storage Interconnects are Slowing Flash Storage
Flash Roadblock: Latency! - How Storage Interconnects are Slowing Flash Storage
 
Parallel computing diu pi cluster by Prof. Dr. Syed Akhter Hossain & Md Sami
Parallel computing  diu pi cluster by Prof. Dr. Syed Akhter Hossain & Md SamiParallel computing  diu pi cluster by Prof. Dr. Syed Akhter Hossain & Md Sami
Parallel computing diu pi cluster by Prof. Dr. Syed Akhter Hossain & Md Sami
 
Parallel Computing Example with Raspberry Pi Cluster
Parallel Computing Example with Raspberry Pi ClusterParallel Computing Example with Raspberry Pi Cluster
Parallel Computing Example with Raspberry Pi Cluster
 
GPU, CUDA, OpenCL and OpenACC for Parallel Applications
GPU, CUDA, OpenCL and OpenACC for Parallel ApplicationsGPU, CUDA, OpenCL and OpenACC for Parallel Applications
GPU, CUDA, OpenCL and OpenACC for Parallel Applications
 
Parallel Computing in JS
Parallel Computing in JSParallel Computing in JS
Parallel Computing in JS
 
Tecnologia para generar valor en Telecomunicaciones
Tecnologia para generar valor en TelecomunicacionesTecnologia para generar valor en Telecomunicaciones
Tecnologia para generar valor en Telecomunicaciones
 
Router architectures in no c
Router architectures in no cRouter architectures in no c
Router architectures in no c
 
Introduction to parallel_computing
Introduction to parallel_computingIntroduction to parallel_computing
Introduction to parallel_computing
 
Control de Flujo [Telecomunicaciones]
Control de Flujo [Telecomunicaciones]Control de Flujo [Telecomunicaciones]
Control de Flujo [Telecomunicaciones]
 
"FCoE vs. iSCSI - Making the Choice" from Interop Las Vegas 2011
"FCoE vs. iSCSI - Making the Choice" from Interop Las Vegas 2011"FCoE vs. iSCSI - Making the Choice" from Interop Las Vegas 2011
"FCoE vs. iSCSI - Making the Choice" from Interop Las Vegas 2011
 
El sindrome del pajar
El sindrome del pajarEl sindrome del pajar
El sindrome del pajar
 

Ähnlich wie Parallel Computing on the GPU

Throughput oriented aarchitectures
Throughput oriented aarchitecturesThroughput oriented aarchitectures
Throughput oriented aarchitecturesNomy059
 
Choosing the right parallel compute architecture
Choosing the right parallel compute architecture Choosing the right parallel compute architecture
Choosing the right parallel compute architecture corehard_by
 
Modern processor art
Modern processor artModern processor art
Modern processor artwaqasjadoon11
 
Modern processor art
Modern processor artModern processor art
Modern processor artwaqasjadoon11
 
Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)Ankit Gupta
 
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleGPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleSpark Summit
 
Parallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesParallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesMurtadha Alsabbagh
 
GPUs vs CPUs for Parallel Processing
GPUs vs CPUs for Parallel ProcessingGPUs vs CPUs for Parallel Processing
GPUs vs CPUs for Parallel ProcessingMohammed Billoo
 
Bitfusion Nimbix Dev Summit Heterogeneous Architectures
Bitfusion Nimbix Dev Summit Heterogeneous Architectures Bitfusion Nimbix Dev Summit Heterogeneous Architectures
Bitfusion Nimbix Dev Summit Heterogeneous Architectures Subbu Rama
 
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production ScaleGPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scalesparktc
 
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...chiportal
 
Introduction to embedded computing and arm processors
Introduction to embedded computing and arm processorsIntroduction to embedded computing and arm processors
Introduction to embedded computing and arm processorsSiva Kumar
 

Ähnlich wie Parallel Computing on the GPU (20)

Throughput oriented aarchitectures
Throughput oriented aarchitecturesThroughput oriented aarchitectures
Throughput oriented aarchitectures
 
Choosing the right parallel compute architecture
Choosing the right parallel compute architecture Choosing the right parallel compute architecture
Choosing the right parallel compute architecture
 
Ppt hsa
Ppt hsaPpt hsa
Ppt hsa
 
Modern processor art
Modern processor artModern processor art
Modern processor art
 
processor struct
processor structprocessor struct
processor struct
 
Danish presentation
Danish presentationDanish presentation
Danish presentation
 
Modern processor art
Modern processor artModern processor art
Modern processor art
 
Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)
 
HSA Features
HSA FeaturesHSA Features
HSA Features
 
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleGPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
 
High performance computing
High performance computingHigh performance computing
High performance computing
 
Parallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesParallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and Disadvantages
 
GPUs vs CPUs for Parallel Processing
GPUs vs CPUs for Parallel ProcessingGPUs vs CPUs for Parallel Processing
GPUs vs CPUs for Parallel Processing
 
Bitfusion Nimbix Dev Summit Heterogeneous Architectures
Bitfusion Nimbix Dev Summit Heterogeneous Architectures Bitfusion Nimbix Dev Summit Heterogeneous Architectures
Bitfusion Nimbix Dev Summit Heterogeneous Architectures
 
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production ScaleGPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
 
Cuda
CudaCuda
Cuda
 
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
 
Introduction to embedded computing and arm processors
Introduction to embedded computing and arm processorsIntroduction to embedded computing and arm processors
Introduction to embedded computing and arm processors
 
B9 cmis
B9 cmisB9 cmis
B9 cmis
 
CUDA
CUDACUDA
CUDA
 

Mehr von Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL

Mehr von Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL (20)

BlockChain.pptx
BlockChain.pptxBlockChain.pptx
BlockChain.pptx
 
Introduction to data mining and machine learning
Introduction to data mining and machine learningIntroduction to data mining and machine learning
Introduction to data mining and machine learning
 
Introduction to cloud computing
Introduction to cloud computingIntroduction to cloud computing
Introduction to cloud computing
 
Data analytics
Data analyticsData analytics
Data analytics
 
Hadoop Eco system
Hadoop Eco systemHadoop Eco system
Hadoop Eco system
 
evaluation and credibility-Part 2
evaluation and credibility-Part 2evaluation and credibility-Part 2
evaluation and credibility-Part 2
 
evaluation and credibility-Part 1
evaluation and credibility-Part 1evaluation and credibility-Part 1
evaluation and credibility-Part 1
 
Machine Learning and Data Mining
Machine Learning and Data MiningMachine Learning and Data Mining
Machine Learning and Data Mining
 
K Nearest Neighbors
K Nearest NeighborsK Nearest Neighbors
K Nearest Neighbors
 
Decision tree
Decision treeDecision tree
Decision tree
 
kmean clustering
kmean clusteringkmean clustering
kmean clustering
 
Covering algorithm
Covering algorithmCovering algorithm
Covering algorithm
 
Hierachical clustering
Hierachical clusteringHierachical clustering
Hierachical clustering
 
Assosiate rule mining
Assosiate rule miningAssosiate rule mining
Assosiate rule mining
 
Big data in telecom
Big data in telecomBig data in telecom
Big data in telecom
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
MapReduce
MapReduceMapReduce
MapReduce
 
Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
Interpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with SawzallInterpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with Sawzall
 

Kürzlich hochgeladen

MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdfMr Bounab Samir
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...DhatriParmar
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSMae Pangan
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataBabyAnnMotar
 
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptxMan or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptxDhatriParmar
 
Using Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea DevelopmentUsing Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea Developmentchesterberbo7
 
Mental Health Awareness - a toolkit for supporting young minds
Mental Health Awareness - a toolkit for supporting young mindsMental Health Awareness - a toolkit for supporting young minds
Mental Health Awareness - a toolkit for supporting young mindsPooky Knightsmith
 
4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptxmary850239
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operationalssuser3e220a
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
Multi Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleMulti Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleCeline George
 
week 1 cookery 8 fourth - quarter .pptx
week 1 cookery 8  fourth  -  quarter .pptxweek 1 cookery 8  fourth  -  quarter .pptx
week 1 cookery 8 fourth - quarter .pptxJonalynLegaspi2
 
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxGrade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxkarenfajardo43
 

Kürzlich hochgeladen (20)

MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdf
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHS
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped data
 
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptxMan or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
 
Using Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea DevelopmentUsing Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea Development
 
Mental Health Awareness - a toolkit for supporting young minds
Mental Health Awareness - a toolkit for supporting young mindsMental Health Awareness - a toolkit for supporting young minds
Mental Health Awareness - a toolkit for supporting young minds
 
4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx
 
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of EngineeringFaculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operational
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
Multi Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleMulti Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP Module
 
week 1 cookery 8 fourth - quarter .pptx
week 1 cookery 8  fourth  -  quarter .pptxweek 1 cookery 8  fourth  -  quarter .pptx
week 1 cookery 8 fourth - quarter .pptx
 
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxGrade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
 
Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"
 

Parallel Computing on the GPU

  • 1. Parallel Computing on the GPU Tilani Gunawardena
  • 2. Goals • How to program heterogeneous parallel computing system and achieve – High performance and energy efficiency – Functionality and maintainability – Scalability across future generations • Technical subjects – Principles and patterns of parallel algorithms – Programming API, tools and techniques
  • 3. Tentative Schedule – Introduction – GPU Computing and CUDA Intro – CUDA threading model – CUDA memory model – CUDA performance – Floating Point Considerations – Application Case Study
  • 4. Recommended Textbook/Notes • D. Kirk and W. Hwu, “Programming Massively Parallel Processors – A Hands-on Approach,” • http://www.nvidia.com/(Communities CUDA Zone)
  • 5. • Would you rather plow a field with two strong oxen or 1024 chickens??
  • 6. How to Dig a Hole Faster?? 1. Dig Faster 2. Buy a More Productive Shovel 3. Hire more diggers  best approach Problems: 1. How to manage them? 2. Will they get in each other’s way 3. Will more diggers help to dig hole deeper instead of just wider? 1. Dig Faster : Processor should run with faster clock to spend a shorter amount of time on each step of a computation (limit: power consumption on a chip : increase clock speed increase power consumption) 2. Buy a More Productive Shovel: Processor do more work on each clock cycle(How much instruction level parallelism per clock cycle) 3. Hire more diggers  best approach
  • 7. Parallelism • Solve Large Problems by breaking them into small pieces • Then run smaller pieces at the same time Modern GPU • 1000’s of ALUs • 100’s of processors • Tens of thousands of concurrent threads • Modern GPU – Ex:GeForce GTX Titan X CUDA cores: 3072 8000 million transistors 12GB GDDR5 Memory Memory Bandwidth: 336(GB/s) 65000 concurrent threads
  • 8. Feature size of Processors over time As feature size decrease Transistors • get smaller • run faster • use less power • put more of them on a chip
  • 9. • As transistors improved , processor designers would then increase clock rates of processors , running them faster and faster every year
  • 10. • Why don’t we keep increasing clock speed? • Have transistors stopped getting smaller+ faster? – Problem: heat • Even though transistors are continuing to get smaller and faster and consume less energy per transistor … Problem is running billion transistors generate lot of heat and we can not keep all these processors cool • Can not make single processor faster and faster(processors that we cant keep cool) • Processor designers – Smaller, more efficient processors in terms of power – Larger number of efficient processors (rather than faster less efficient processors) • What kind of Processors we build? • CPU – Complex control hardware – Flexibility in performance – Expensive in terms of power • GPU – Simpler control hardware – More haradware for Computation – Potentially more power efficient – More restrictive Programming model
  • 11. Latency vs Throughput • Latency-Amount of time to complete a task(time , seconds) • Throughput-Task completed per unit time(Jobs/Hour) Your goals are not aligned with post office goals Your goal: Optimize for Latency (want to spend a little time) Post office: Optimize for throughput (number of customers they serve per a day) CPU: Optimize for latency(minimize the time elapsed of one particular task) GPU: Chose to Optimize for throughput
  • 12. Bandwidth • How fast to devise can send data over a single cable
  • 13. Bandwidth vs Throughput vs Latency – Bandwidth is the maximum amount of data that can travel through a 'channel'. – Throughput is how much data actually does travel through the 'channel' successfully. – Latency is a function of how long it takes the data to get sent all the way from the start point to the end
  • 14. Latency vs Bandwidth • Drive from Colombo to Kandy(100km) – Car(5 people, 60km/h) – Bus(60 people, 20km/h) • Calculate – Latency? – Throughput?
  • 15. GPUs from the point of view of the software developer ? • Importance in programing in parallel – 8 core ivy bridge processor(intel) – 8-wide AVX vector operations/Core – 2 threads/core (hyperthreading) 128 way parallelism In this processor if you run a complete serial, C program with no parallelism at all, you are going to use less than 1% of the capability of this machine.
  • 16. Introduction • Microprocessor based on CPU drove rapid performance increases and cost reduction in computer applications for more than 2 decades. – The users demand even more improvements once they become accustomed to these improvements creating a positive cycle for the computer industry. • This drive has slowed since 2003 due to power consumption issues that limited the increase of the clock frequency and the level of productive activities that can be performed in each clock period within a single CPU. – All microprocessor vendors have switched to multi-core and many- core models where multiple processing unit are used in each chip to increase the processing power.
  • 17. • Vast majority of SA are written as sequential programs – The expectation is that program run faster with each new generation of microprocessors. This is no longer valid from this day onward. – No performance improvement – Reducing the growth opportunities of computer industries. • SA will continue to enjoy performance improvement as parallel programs, in which multiple threads of execution cooperate to achieve the funcionality faster.
  • 18. 18 • Parallel programming is by no means new – HPC community has been developing parallel programs for decades. – But these programs run on large scale, expensive computers and only a few elite application justify the use of these costs. In practice limiting the parallel programming to a small number of appication developers. • Now that all new microprocessors are parallel computers, the number of applications that need to be developed as parallel programs has increased.
  • 19. GPU as Parallel Computers • Since 2003 a class of many-cores processors called GPUs have led the race for floating point performance. While the performance of general purpose microprocessor has slowed, the GPU have continued to improve. Many application developers are motivated to move the computationally intensive parts of their software to GPU for execution.
  • 20. Why there is this Large Gap? • The answer lies in the differences in the fundamental design philosophies between the two types of processors. Latency oriented cores Throughput oriented cores
  • 21. CPU: Latency Oriented Design • CPU is optimized for sequential code performance • Large caches – Convert long latency memory accesses to short latency cache accesses • Sophisticated control – Branch prediction for reduced branch latency – Data forwarding for reduced data latency • Powerful ALU – Reduced operation latency
  • 22. • GPU is optimized for the execution of massive number of threads. • Small caches – To boost memory throughput • Simple control – No branch prediction – No data forwarding • Energy efficient ALUs – Many, long latency but heavily pipelined for high throughput • Require massive number of threads to tolerate latencies GPU: Throughput Oriented Design
  • 23. Winning Applications Use Both CPU and GPU • CPUs for sequential parts where latency matters – CPUs can be 10+X faster than GPUs for sequential code • GPUs for parallel parts where throughput wins – GPUs can be 10+X faster than CPUs for parallel code

Hinweis der Redaktion

  1. Introduction to Parallel Programming
  2. API-application program interface
  3. Today worl moving into parallel computing : from mobile device to supercomputers High performace than traditional computers
  4. Few or more powerful processors –likes the oxen Moden computing products are like the chicken. They have 100’s of processing units harnessing that power requires a different way of thinking than a programming a single processor. In this class I will teach how to program a GPU How to think about programming with parallel lense
  5. Modern computing products are like chickens: they have hundred of processors that can each run piece of your problem in parallel
  6. Feature size is : minimum size of a transistor or wire on a chip Feature size in nano meter
  7. As many years clock speed go up, However the last decade we see that clcok sppedd have essentially remained constant
  8. Answer?? No Power is the most important factor in modern processor design at all scales From mobile phones u keep in pocket to largest supercomputers
  9. Ex: municiple counsil or goovernment office This is very frustrating experience. .. You wait in lines a lot .. This is not necessarily the fault in post office though In computer graphics we care more about : Pixels per seconds than the latency of any particular pixel
  10. Cars can use all the lanes and move along at full speed
  11. Latency lags throguput
  12. Hyperthreading: is Intel's proprietary simultaneous multithreading (SMT) implementation used to improve parallelization of computations (doing multiple tasks at once) performed on x86 microprocessors. Advanced Vector Extensions
  13. CPU: latency for execution instructions GPUs are design throughput in mind : goal is to maximize throughput of instructions . Executing as many instruction possible as same time CPU: optimize latency GPU: maximize throughput
  14. Sophisticated-complex Reducing latecy of instruction execution Several imprtance design techniques, all oriented towards teducing the latency of execution various instruction Large caches : goal keep data elements as possible in the cache. So when lot of times when we need data we will find the data in cache instead of going to DRAM(dynamic random acess memory which takes long latency . We can get data from chache very quickly in short latency Control Mechanism: Branch prediction logic: reduce branch latency .branches are correspond to the decisions that we make in the source code.and whenever we make a decision sometimes , you know we may take a long time for the data to be available and so that we can make those branch decisions. So that branch prediction logic allows the hardware to immediately predicts whats going to execute after the branch instruction and reduce latency or effective latency for branch instructions Data forwarding: immediately make the output results available from one arithmetic unit or memory access unit to another execution unit Wheneevr a execution unit generates a result . May take some time to go back to the register file and then come out of the register file for use by another operation. But with data forwarfing we immediately use the result produced by one instruction. So that we can be used by another instruction in the immediate next clcok cycle Powerful ALU:Hardware produce results very qucikly eX: floating point multiplication , when we use a lot of hardware transitor to implemet a multiplication array . We will be able to produce multiplication results in very small number of clock cycles Whereas if we use less hardware , we may have to go through a long sequnce of clock cycles before we can produce the floating point multiplication results
  15. Small Caches: this caches and not mean to keep data unchecked for a long time to reduce latency rather these caches are actually design as staging consolidation , you know when stations where the memory axis is made by a large number of parallel threads could be consolidated into the fewer memory accesses.so that you can reduce the pressure or trafic to DRAM system . Simple control: so these instruction take long time to execute .
  16. Mobile phones to super computers