2. Goals
• How to program heterogeneous parallel
computing system and achieve
– High performance and energy efficiency
– Functionality and maintainability
– Scalability across future generations
• Technical subjects
– Principles and patterns of parallel algorithms
– Programming API, tools and techniques
3. Tentative Schedule
– Introduction
– GPU Computing and CUDA Intro
– CUDA threading model
– CUDA memory model
– CUDA performance
– Floating Point Considerations
– Application Case Study
4. Recommended Textbook/Notes
• D. Kirk and W. Hwu, “Programming Massively
Parallel Processors – A Hands-on Approach,”
• http://www.nvidia.com/(Communities
CUDA Zone)
5. • Would you rather plow a field with two strong
oxen or 1024 chickens??
6. How to Dig
a Hole
Faster??
1. Dig Faster
2. Buy a More Productive Shovel
3. Hire more diggers best approach
Problems:
1. How to manage them?
2. Will they get in each other’s way
3. Will more diggers help to dig hole deeper instead of
just wider?
1. Dig Faster : Processor should run with faster clock to
spend a shorter amount of time on each step of a
computation (limit: power consumption on a chip :
increase clock speed increase power consumption)
2. Buy a More Productive Shovel: Processor do more work
on each clock cycle(How much instruction level
parallelism per clock cycle)
3. Hire more diggers best approach
7. Parallelism
• Solve Large Problems by breaking them into
small pieces
• Then run smaller pieces at the same time
Modern GPU
• 1000’s of ALUs
• 100’s of processors
• Tens of thousands of concurrent threads
• Modern GPU
– Ex:GeForce GTX Titan X
CUDA cores: 3072
8000 million transistors
12GB GDDR5 Memory
Memory Bandwidth: 336(GB/s)
65000 concurrent threads
8. Feature size of Processors over time
As feature size decrease
Transistors
• get smaller
• run faster
• use less power
• put more of them on a chip
9. • As transistors improved , processor designers
would then increase clock rates of processors ,
running them faster and faster every year
10. • Why don’t we keep increasing clock speed?
• Have transistors stopped getting smaller+ faster?
– Problem: heat
• Even though transistors are continuing to get smaller and faster and consume
less energy per transistor … Problem is running billion transistors generate lot of
heat and we can not keep all these processors cool
• Can not make single processor faster and faster(processors that we cant keep
cool)
• Processor designers
– Smaller, more efficient processors in terms of power
– Larger number of efficient processors
(rather than faster less efficient processors)
• What kind of Processors we build?
• CPU
– Complex control hardware
– Flexibility in performance
– Expensive in terms of power
• GPU
– Simpler control hardware
– More haradware for Computation
– Potentially more power efficient
– More restrictive Programming
model
11. Latency vs Throughput
• Latency-Amount of time to complete a
task(time , seconds)
• Throughput-Task completed per unit
time(Jobs/Hour) Your goals are not aligned with post
office goals
Your goal: Optimize for Latency
(want to spend a little time)
Post office: Optimize for throughput
(number of customers they serve per a
day)
CPU: Optimize for latency(minimize the time
elapsed of one particular task)
GPU: Chose to Optimize for throughput
13. Bandwidth vs Throughput vs Latency
– Bandwidth is the maximum amount of data that
can travel through a 'channel'.
– Throughput is how much data actually does travel
through the 'channel' successfully.
– Latency is a function of how long it takes the data
to get sent all the way from the start point to the
end
14. Latency vs Bandwidth
• Drive from Colombo to Kandy(100km)
– Car(5 people, 60km/h)
– Bus(60 people, 20km/h)
• Calculate
– Latency?
– Throughput?
15. GPUs from the point of view of the
software developer ?
• Importance in programing in parallel
– 8 core ivy bridge processor(intel)
– 8-wide AVX vector operations/Core
– 2 threads/core (hyperthreading)
128 way parallelism
In this processor if you run a complete serial, C
program with no parallelism at all, you are going
to use less than 1% of the capability of this
machine.
16. Introduction
• Microprocessor based on CPU drove rapid performance
increases and cost reduction in computer applications for
more than 2 decades.
– The users demand even more improvements once they become
accustomed to these improvements creating a positive cycle for the
computer industry.
• This drive has slowed since 2003 due to power consumption
issues that limited the increase of the clock frequency and
the level of productive activities that can be performed in
each clock period within a single CPU.
– All microprocessor vendors have switched to multi-core and many-
core models where multiple processing unit are used in each chip to
increase the processing power.
17. • Vast majority of SA are written as sequential programs
– The expectation is that program run faster with each new
generation of microprocessors. This is no longer valid from
this day onward.
– No performance improvement
– Reducing the growth opportunities of computer industries.
• SA will continue to enjoy performance improvement as
parallel programs, in which multiple threads of
execution cooperate to achieve the funcionality faster.
18. 18
• Parallel programming is by no means new
– HPC community has been developing parallel
programs for decades.
– But these programs run on large scale, expensive
computers and only a few elite application justify the
use of these costs. In practice limiting the parallel
programming to a small number of appication
developers.
• Now that all new microprocessors are parallel
computers, the number of applications that need
to be developed as parallel programs has
increased.
19. GPU as Parallel Computers
• Since 2003 a class of many-cores processors
called GPUs have led the race for floating
point performance.
While the
performance of
general purpose
microprocessor has
slowed, the GPU
have continued to
improve.
Many application developers are motivated to move the computationally intensive
parts of their software to GPU for execution.
20. Why there is this Large Gap?
• The answer lies in the differences in the
fundamental design philosophies between
the two types of processors.
Latency oriented
cores
Throughput
oriented cores
21. CPU: Latency Oriented Design
• CPU is optimized for sequential code
performance
• Large caches
– Convert long latency memory accesses to short
latency cache accesses
• Sophisticated control
– Branch prediction for reduced branch latency
– Data forwarding for reduced data latency
• Powerful ALU
– Reduced operation latency
22. • GPU is optimized for the execution of massive
number of threads.
• Small caches
– To boost memory throughput
• Simple control
– No branch prediction
– No data forwarding
• Energy efficient ALUs
– Many, long latency but heavily pipelined for high
throughput
• Require massive number of threads to tolerate
latencies
GPU: Throughput Oriented Design
23. Winning Applications Use Both CPU
and GPU
• CPUs for sequential parts where latency
matters
– CPUs can be 10+X faster than GPUs for sequential
code
• GPUs for parallel parts where throughput wins
– GPUs can be 10+X faster than CPUs for parallel
code
Today worl moving into parallel computing : from mobile device to supercomputers
High performace than traditional computers
Few or more powerful processors –likes the oxen
Moden computing products are like the chicken. They have 100’s of processing units harnessing that power requires a different way of thinking than a programming a single processor.
In this class I will teach how to program a GPU
How to think about programming with parallel lense
Modern computing products are like chickens: they have hundred of processors that can each run piece of your problem in parallel
Feature size is : minimum size of a transistor or wire on a chip
Feature size in nano meter
As many years clock speed go up,
However the last decade we see that clcok sppedd have essentially remained constant
Answer?? No
Power is the most important factor in modern processor design at all scales
From mobile phones u keep in pocket to largest supercomputers
Ex: municiple counsil or goovernment office
This is very frustrating experience. .. You wait in lines a lot .. This is not necessarily the fault in post office though
In computer graphics we care more about :
Pixels per seconds than the latency of any particular pixel
Cars can use all the lanes and move along at full speed
Latency lags throguput
Hyperthreading:
is Intel's proprietary simultaneous multithreading (SMT) implementation used to improve parallelization of computations (doing multiple tasks at once) performed on x86 microprocessors.
Advanced Vector Extensions
CPU: latency for execution instructions
GPUs are design throughput in mind : goal is to maximize throughput of instructions . Executing as many instruction possible as same time
CPU: optimize latency
GPU: maximize throughput
Sophisticated-complex
Reducing latecy of instruction execution
Several imprtance design techniques, all oriented towards teducing the latency of execution various instruction
Large caches : goal keep data elements as possible in the cache. So when lot of times when we need data we will find the data in cache instead of going to DRAM(dynamic random acess memory which takes long latency . We can get data from chache very quickly in short latency
Control Mechanism:
Branch prediction logic: reduce branch latency .branches are correspond to the decisions that we make in the source code.and whenever we make a decision sometimes , you know we may take a long time for the data to be available and so that we can make those branch decisions. So that branch prediction logic allows the hardware to immediately predicts whats going to execute after the branch instruction and reduce latency or effective latency for branch instructions
Data forwarding: immediately make the output results available from one arithmetic unit or memory access unit to another execution unit
Wheneevr a execution unit generates a result . May take some time to go back to the register file and then come out of the register file for use by another operation.
But with data forwarfing we immediately use the result produced by one instruction. So that we can be used by another instruction in the immediate next clcok cycle
Powerful ALU:Hardware produce results very qucikly
eX: floating point multiplication , when we use a lot of hardware transitor to implemet a multiplication array . We will be able to produce multiplication results in very small number of clock cycles
Whereas if we use less hardware , we may have to go through a long sequnce of clock cycles before we can produce the floating point multiplication results
Small Caches: this caches and not mean to keep data unchecked for a long time to reduce latency rather these caches are actually design as staging consolidation , you know when stations where the memory axis is made by a large number of parallel threads could be consolidated into the fewer memory accesses.so that you can reduce the pressure or trafic to DRAM system .
Simple control: so these instruction take long time to execute .