SlideShare ist ein Scribd-Unternehmen logo
1 von 38
Downloaden Sie, um offline zu lesen
IEEE Hot Chips 2005
Multiple Cores, Multiple
Pipes, Multiple Threads –
Do we have more
Parallelism than we can
handle?
David B. Kirk
IEEE Hot Chips 2005
Outline
Types of Parallelism
Processor / System Parallelism
Application / Problem Parallelism
CPU Multi-core Roadmaps
GPU Evolution and Performance Growth
Programming Models
Mapping Application Parallelism onto Processors
The Problem of Parallelism
IEEE Hot Chips 2005
Processor / System Parallelism
Single- vs. Multi- core
Fine- vs. Coarse- grained
Single- vs. Multi- pipeline
Vector vs. Scalar math
Data- vs. Thread- vs. Instruction-level- parallel
Not mutually exclusive
Single- vs. Multi-threaded processor
Message-passing vs. Shared Memory
communication
SISD, SIMD, MIMD…
Tightly- vs. Loosely-connected cores & threads
IEEE Hot Chips 2005
Application / Problem Parallelism
If there is no parallelism in the workload,
processor/system parallelism doesn’t matter!
Large problems can (often) be more easily
parallelized
“Good” parallel behavior
Many inputs/results
Parallel structure – many similar computation paths
Little interaction between data/threads
Data parallelism easy to map to machine
“automagically”
Task parallelism requires programmer forethought
IEEE Hot Chips 2005
General Purpose Processors
Single-core…
Dual-core…
Multi-core
Limited multi-threading
No special support for
parallel programming
Coarse-grained
Thread-based
Requires programmer
awareness
Processor die photo courtesy of AMD
The Move to Intel Multi-coreThe Move to Intel Multi-core
CurrentCurrent 20052005 20062006 20072007
Single coreSingle core Dual-coreDual-core
DesktopDesktop
ClientClient
PentiumPentium®® 44
processorprocessor
Cedar MillCedar MillPentiumPentium®® 44
processorprocessor
PreslerPresler
Pentium D ProcessorPentium D Processor
PentiumPentium®® ProcessorProcessor
Extreme EditionExtreme Edition
MobileMobile
ClientClient
PentiumPentium®® M processorM processor
YonahYonah
PlatformPlatform
64-bit Intel64-bit Intel®® XeonXeon™™
processor MPprocessor MP
Intel® XeonIntel® Xeon™™
Processor MPProcessor MP
MP ServerMP Server
DP Server /DP Server /
WSWS
Paxville TulsaPaxville Tulsa
DempseyDempsey64-bit Intel® Xeon64-bit Intel® Xeon™™ Processor w/ 2MB cacheProcessor w/ 2MB cache
Itanium®Itanium®
processor MPprocessor MP Itanium® 2 ProcessorItanium® 2 Processor Montecito MontvaleMontecito Montvale Tukwila PoulsonTukwila Poulson
Itanium® 2 Processor -Itanium® 2 Processor -
3M (Fanwood)3M (Fanwood)
Itanium®Itanium®
processor DPprocessor DP
Millington DP MontvaleMillington DP Montvale DimonaDimona
ConroeConroe
WoodcrestWoodcrest
MeromMeromYonahYonah
WhitefieldWhitefield
Source: Intel Corporation
>= 2 cores>= 2 coresAll products and dates are preliminary and subject to change without notice.
Tukwila PoulsonTukwila Poulson
WhitefieldWhitefield
IEEE Hot Chips 2005
Source: Intel Corporation
IEEE Hot Chips 2005
CPU Approach to Parallelism
Multi-core
Limited multi-threading
Coarse-grained
Scalar (usually)
Explicitly Threaded
Application parallelism must exist at coarse scale
Programmer writes independent program thread for each
core
IEEE Hot Chips 2005
Multi-Core Application Improvement:
General Purpose CPU
Single-thread Applications:
0% speedup
Multi-thread Applications:
Up to X times speedup, where X is # of cores * hardware multi-
threading
Potential for cache problems / data sharing & migration
% of Applications that are Multi-threaded
Essentially 0 
But… HUGE potential 
Multi-threaded OS’s will benefit
Just add software for apps…
IEEE Hot Chips 2005
Special-purpose Parallel CPUs
Cell Processor die photo courtesy of IBM
IEEE Hot Chips 2005
Cell Processor
What is the
programming
model?
Can you
expect
programmers
to explicitly
schedule
work / data
flow?
IEEE Hot Chips 2005
Cell Processor Approach to Parallelism
Multi-core
Coarse-grained
Vector
Explicitly Threaded
Programmer writes independent program thread code for
each core
Explicit Data Sharing
Programmer must copy or DMA data between cores
IEEE Hot Chips 2005
Multi-Core Application Improvement:
Special Purpose CPU (Cell)
Single-thread Applications:
0% speedup
Multi-thread Applications:
Up to X times speedup, where X is # of cores
Up to Y times speedup, where Y is vector width
Explicit software management of cache / data sharing &
migration
% of Applications that are Multi-threaded
Essentially 100%  (all applications are written custom)
But… HUGE software development effort 
IEEE Hot Chips 2005
GeForce 7800 GTX:
Most Capable Graphics
Processor Ever Built
302M Transistors
+ XBOX GPU (60M)
+ PS2 Graphics Synthesizer (43M)
+ Game Cube Flipper (51M)
+ Game Cube Gekko (21M)
+ XBOX Pentium3 CPU (9M)
+ PS2 Emotion Engine (10.5M)
+ Athlon FX 55 (105.9M)
300.4M
IEEE Hot Chips 2005
IEEE Hot Chips 2005
The Life of a Triangle
Texture
Host / Front End / Vertex Fetch
FrameBufferController
Vertex Processing
Primitive Assembly , Setup
Rasterize & Zcull
Pixel Shader
Register Combiners
Pixel Engines (ROP)
process commands
convert to FP
transform vertices
to screen -space
generate per -
triangle equations
generate pixels , delete pixels
that cannot be seen
determine the colors , transparencies
and depth of the pixel
do final hidden surface test , blend
and write out color and new depth
IEEE Hot Chips 2005
L2 Tex
GeForce 7800
Parallelism
Cull / Clip / Setup
Shader Instruction Dispatch
Fragment Crossbar
Memory
Partition
Memory
Partition
Memory
Partition
Memory
Partition
Z-Cull
DRAM(s) DRAM(s) DRAM(s) DRAM(s)
Host / FW / VTF
vertex units
Pixel shading
engines, organized
in quads
ROP units
IEEE Hot Chips 2005
Detail of a single pixel shader pipeline
FP Texture
Processor
L1 Texture
Cache
Branch
Processor
FP32
Shader
Unit 1
FP32
Shader
Unit 2
Input Fragment
Data
Output
Shaded Fragments
Fog
ALU
Texture
Data
L2 Texture
Cache
SIMD Architecture
Dual Issue / Co-Issue
FP32 Computation
“Unlimited” Program
Length
Mini-ALU
Mini-ALU
IEEE Hot Chips 2005
Arithmetic Density
Delivered 32-bit Floating Point performance
Big GFlop #’s are nice…
…but what can you actually measure from a
basic test program?
165.064820.6331430GeForce 7800 GTX
54.05446.7568425GeForce 6800 Ultra
Gflops
vec4 MAD
GinstructionsClock
Using a test pixel shader program that simply
measures how many 4-component MAD instructions
can be executed per second.
IEEE Hot Chips 2005
GPU Approach to Parallelism
Single-core
Multi-pipeline
Multi-threaded
Fine-grained
Vector
Explicitly and Implicitly Threaded
Programmer writes sequential program thread code for
shader processors
Thread instances are spawned automatically
Data Parallel
Threads don’t communicate, except at start/finish
IEEE Hot Chips 2005
Multi-Pipeline Application
Improvement: GPU
Multi-thread Applications:
Up to X times speedup, where X is # of pipelines
Exploits x4 vector FP MADs
Very little software management of cache / data sharing &
migration
% of Applications that are Multi-threaded
Essentially 100%  (all applications are written custom)
Again… HUGE software development effort 
Limited by CPU throughput 
IEEE Hot Chips 2005
SLI – Dual GPUs in a Single System
IEEE Hot Chips 2005
CPU Limitedness – Exploitation of
GPU Parallelism limited by CPU
0
0.2
0.4
0.6
0.8
1
1.2
10x7 10x7 4x/8x 16x12 16x12 4x/8x
HL2 Prison 05rev7
HL2 Coast 05rev7
UT AS-Convoy botmatch
COD
RTCW
Far Cry Research
Halo
Doom3
Counter Strike Source
NBA Live
Vampire Masquerade
World of Warcraft
Raw bandwidth
Raw Pixel
P4 3.7 GHz
1 GB RAM
IEEE Hot Chips 2005
Game Performance Benefits of Dual-core
vs. GPU Upgrade
0% 100% 200% 300% 400% 500%
Baseline: 3.0SC & 6600
3.2GHz Single Core
GeForce 6600 GT
3.4GHz Single Core
GeForce 6800 GT
3.2GHz EE Dual Core
GeForce 7800 GTX SLI
$900 - $1100 Upgrade
$50 - $100 Upgrade
$25 - $50 Upgrade
Single Core 3.0 GHz P4 CPU
+ GeForce 6600
Performance improvement based on Doom3
1024x768x32, 4x AA, 8x Aniso filtering
Baseline
IEEE Hot Chips 2005
GPU Programming Languages
DX9 (Direct X)
assembly coding for vertex and pixel
HLSL (High-Level Shading Language)
OpenGL 1.3+
assembly coding for vertex and pixel
GLSLang
Cg
Brook for GPUs (Stanford)
HLL for GPGPU layered on DX/OGL
http://graphics.stanford.edu/projects/brookgpu/
SH for GPUs (Waterloo)
Metaprogramming Language for GPGPU
http://libsh.sourceforge.net/docs.html
Others...
IEEE Hot Chips 2005
Importance of Data Parallelism for GPUs
GPUs are designed for graphics
Highly parallel tasks
GPUs process independent vertices & fragments
Temporary registers are zeroed
No shared or static data
No read-modify-write buffers
Opportunity exists when # of independent results is large
Data-parallel processing
GPUs architecture is ALU-heavy
Multiple vertex & pixel pipelines, multiple ALUs per pipe
Large memory latency, but HUGE memory bandwidth
Hide memory latency (with more computation)
IEEE Hot Chips 2005
Language Support for Parallelism
Most widely-used programming languages are
terrible at exposing potential parallelism
for (int i=0; i<100; i++) {
// compute i’th element ...
}
LISP and other functional languages are marginally
better
(+ (func 1 4) (func 23 9) (func 6 2))
Some direct support: Fortran 90, HPF
Not in common use for development anymore
Research in true parallel languages has stagnated
Early 1990’s: lots of research in C*, DPCE, etc.
Late 1990’s on: research moved to JVM, managed code, etc
IEEE Hot Chips 2005
Parallel Programming: Not just for
GPUs
CPU’s benefit, too
SSE, SSE2, MMX, etc.
Hyperthreading
Multi-core processors announced from Intel, AMD, etc.
Playstation2 Emotion Unit
MPI, OpenMP packages for coarse grain communication
Efficient execution of:
Parallel code on a serial processor: EASY
Serial code on a parallel processor: HARD
The impact of power consumption further justifies
more research in this area – parallelism is the future
IEEE Hot Chips 2005
PC Graphics growth (225%/yr)
Sustainable Growth on Capability Curve
Season tech #trans Gflop* Mpix Mpoly Mvector
...
spring/00 0.18 25M 35 800 30M 50M
fall/00 0.15 50M 150 1.2G 75M 100M
spring/01 0.15 55M 180 2.0G 100M 200M
fall/01 0.13 100M 280 4.0G 200M 400M
...
spring/03 (NV30) 0.13 140M 500 8.0G** 300M 300M
spring/04 (NV40) 0.13 220M 1000 25.6G** 600M 600M
* Special purpose math, not all general purpose programmable math
** Samples (multiple color values within a pixel, for smooth edges)
IEEE Hot Chips 2005
GPUs Continue to Accelerate
above Moore’s Law, but that’s not all...
As pixel/vertex/triangle growth slows and plateaus...
Other performance factors increase
Number of color samples per pixel (Anti-aliasing)
Number of calculations per pixel/vertex
Flexibility of programming model
Looping, branching, multiple I/O access, multiple math ops/clock
High-level language programmability
Number of “General Purpose” programmable 32bit
Gigaflops per pixel – demand grows without bounds
GPUs become General Purpose parallel processors
What happens if you compare GPUs to microprocessors?
IEEE Hot Chips 2005
Sustained SP MAD GFLOPS
GFLOPS
IEEE Hot Chips 2005
CPU / GPU Design Strategies /
Tactics
CPU Strategy: Make the workload (one compute
thread) run as fast as possible
Tactics
Cacheing
Instruction/Data Prefetch
“hyperthreading”
Speculative Execution
 limited by “perimeter” – communication bandwidth
Multi-core will help... a little
GPU Strategy: Make the workload (as many threads
as possible) run as fast as possible
Tactics
Parallelism (1000s of threads)
Pipelining
 limited by “area” – compute capability
IEEE Hot Chips 2005
Application Matches for GPU:
Any Large-Scale, Parallel, Feed-forward,
Math- and Data-intensive Problem
Real-time Graphics (of course!)
Image Processing and Analysis
Telescope, Surveillance, Sensor Data
Volume Data
Correlation - Radio Telescope, SETI
Monte Carlo Simulation - Neutron Transport
Neural Networks
Speech Recognition
Handwriting Recognition
Ray Tracing
Physical Modeling and Simulation
Video Processing
IEEE Hot Chips 2005
Example: Black-Scholes options
pricing
Widely-used model for pricing call/put options
Implemented in ~15 lines of Cg, use combinations of input
parameters as separate simulations (fragments)
Performance:
Fast (~3GHz) P4, good C++: ~3.0 MBSOPS (1X)
Quadro FX 3000, Cg: ~2.8 MBSOPS (~.9X)
Quadro FX 4400, Cg: ~14.4 MBSOPS (4.8X)
Quadro FX 4400, Cg, 100 runs: ~176.0 MBSOPS (59X)
(remove test/data transfer bandwidth overhead)
How?
CPU: ~11GFLOPS, slow exp(), log(), sqrt(), fast mem access
GPU: ~65GFLOPS, fast exp(), log(), sqrt(), slow mem access
Black-Scholes has high ratio of math to memory access
GPU has Parallelism
IEEE Hot Chips 2005
So, What’s the Problem?
The Good News:
CPUs and GPUs are increasingly parallel
GPUs are already highly parallel
Workloads – Graphics and GP – are highly parallel
Moore’s Law and the “capability curve” continue to be our
friends
The Not-so-Good News:
Parallel programming is hard
Language and Tool support for parallelism is poor
Computer Science Education is not focused on parallel
programming
IEEE Hot Chips 2005
We Are Approaching a Crisis in
Programming Skills (lack)
Intel, AMD, IBM/Sony/Toshiba (Cell), Sun (Niagara)
have all announced Multi- or Many-core roadmaps
NVIDIA and other GPUs are already “Multi-core” 
Less of a crisis, due to GPU threaded programming model
Analysts predict > 50% of processors shipped in
next 5 years will be >1 core
Who will program these devices?
How will the value of multi-core and multi-threading
be exploited?
IEEE Hot Chips 2005
Call to Action
Research
Explore new ways of Parallel Programming
Explore new Threading models
Make parallelism easier to express/exploit
Industry (processor vendors)
Make exploitation of Multi-core easier
Explore “transparent” application speedups
Consequences of Failure to Act
Multi-core not valued by market
IEEE Hot Chips 2005
Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Intel Knights Landing Slides
Intel Knights Landing SlidesIntel Knights Landing Slides
Intel Knights Landing SlidesRonen Mendezitsky
 
Threading Game Engines: QUAKE 4 & Enemy Territory QUAKE Wars
Threading Game Engines: QUAKE 4 & Enemy Territory QUAKE WarsThreading Game Engines: QUAKE 4 & Enemy Territory QUAKE Wars
Threading Game Engines: QUAKE 4 & Enemy Territory QUAKE Warspsteinb
 
Methods and practices to analyze the performance of your application with Int...
Methods and practices to analyze the performance of your application with Int...Methods and practices to analyze the performance of your application with Int...
Methods and practices to analyze the performance of your application with Int...Intel Software Brasil
 
Create a Scalable and Destructible World in HITMAN 2*
Create a Scalable and Destructible World in HITMAN 2*Create a Scalable and Destructible World in HITMAN 2*
Create a Scalable and Destructible World in HITMAN 2*Intel® Software
 
Forts and Fights Scaling Performance on Unreal Engine*
Forts and Fights Scaling Performance on Unreal Engine*Forts and Fights Scaling Performance on Unreal Engine*
Forts and Fights Scaling Performance on Unreal Engine*Intel® Software
 
Accelerate Game Development and Enhance Game Experience with Intel® Optane™ T...
Accelerate Game Development and Enhance Game Experience with Intel® Optane™ T...Accelerate Game Development and Enhance Game Experience with Intel® Optane™ T...
Accelerate Game Development and Enhance Game Experience with Intel® Optane™ T...Intel® Software
 
The Tofu Interconnect D for the Post K Supercomputer
The Tofu Interconnect D for the Post K SupercomputerThe Tofu Interconnect D for the Post K Supercomputer
The Tofu Interconnect D for the Post K Supercomputerinside-BigData.com
 
Getting Space Pirate Trainer* to Perform on Intel® Graphics
Getting Space Pirate Trainer* to Perform on Intel® GraphicsGetting Space Pirate Trainer* to Perform on Intel® Graphics
Getting Space Pirate Trainer* to Perform on Intel® GraphicsIntel® Software
 
Early Benchmarking Results for Neuromorphic Computing
Early Benchmarking Results for Neuromorphic ComputingEarly Benchmarking Results for Neuromorphic Computing
Early Benchmarking Results for Neuromorphic ComputingDESMOND YUEN
 
Difference between i3 and i5 and i7 and core 2 duo
Difference between i3 and i5 and i7 and core 2 duoDifference between i3 and i5 and i7 and core 2 duo
Difference between i3 and i5 and i7 and core 2 duoShubham Singh
 
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...Gaurav Raina
 
F9: A Secure and Efficient Microkernel Built for Deeply Embedded Systems
F9: A Secure and Efficient Microkernel Built for Deeply Embedded SystemsF9: A Secure and Efficient Microkernel Built for Deeply Embedded Systems
F9: A Secure and Efficient Microkernel Built for Deeply Embedded SystemsNational Cheng Kung University
 
Thesis Report - Gaurav Raina MSc ES - v2
Thesis Report - Gaurav Raina MSc ES - v2Thesis Report - Gaurav Raina MSc ES - v2
Thesis Report - Gaurav Raina MSc ES - v2Gaurav Raina
 
Industrial trends in heterogeneous and esoteric compute
Industrial trends in heterogeneous and esoteric computeIndustrial trends in heterogeneous and esoteric compute
Industrial trends in heterogeneous and esoteric computePerry Lea
 

Was ist angesagt? (20)

Ibm cell
Ibm cell Ibm cell
Ibm cell
 
Intel Knights Landing Slides
Intel Knights Landing SlidesIntel Knights Landing Slides
Intel Knights Landing Slides
 
Threading Game Engines: QUAKE 4 & Enemy Territory QUAKE Wars
Threading Game Engines: QUAKE 4 & Enemy Territory QUAKE WarsThreading Game Engines: QUAKE 4 & Enemy Territory QUAKE Wars
Threading Game Engines: QUAKE 4 & Enemy Territory QUAKE Wars
 
Methods and practices to analyze the performance of your application with Int...
Methods and practices to analyze the performance of your application with Int...Methods and practices to analyze the performance of your application with Int...
Methods and practices to analyze the performance of your application with Int...
 
Intel python 2017
Intel python 2017Intel python 2017
Intel python 2017
 
Asus Tinker Board
Asus Tinker BoardAsus Tinker Board
Asus Tinker Board
 
Create a Scalable and Destructible World in HITMAN 2*
Create a Scalable and Destructible World in HITMAN 2*Create a Scalable and Destructible World in HITMAN 2*
Create a Scalable and Destructible World in HITMAN 2*
 
Forts and Fights Scaling Performance on Unreal Engine*
Forts and Fights Scaling Performance on Unreal Engine*Forts and Fights Scaling Performance on Unreal Engine*
Forts and Fights Scaling Performance on Unreal Engine*
 
Accelerate Game Development and Enhance Game Experience with Intel® Optane™ T...
Accelerate Game Development and Enhance Game Experience with Intel® Optane™ T...Accelerate Game Development and Enhance Game Experience with Intel® Optane™ T...
Accelerate Game Development and Enhance Game Experience with Intel® Optane™ T...
 
The Tofu Interconnect D for the Post K Supercomputer
The Tofu Interconnect D for the Post K SupercomputerThe Tofu Interconnect D for the Post K Supercomputer
The Tofu Interconnect D for the Post K Supercomputer
 
I3 Vs I5 Vs I7
I3 Vs I5 Vs I7I3 Vs I5 Vs I7
I3 Vs I5 Vs I7
 
Getting Space Pirate Trainer* to Perform on Intel® Graphics
Getting Space Pirate Trainer* to Perform on Intel® GraphicsGetting Space Pirate Trainer* to Perform on Intel® Graphics
Getting Space Pirate Trainer* to Perform on Intel® Graphics
 
Masked Occlusion Culling
Masked Occlusion CullingMasked Occlusion Culling
Masked Occlusion Culling
 
Early Benchmarking Results for Neuromorphic Computing
Early Benchmarking Results for Neuromorphic ComputingEarly Benchmarking Results for Neuromorphic Computing
Early Benchmarking Results for Neuromorphic Computing
 
Implement Runtime Environments for HSA using LLVM
Implement Runtime Environments for HSA using LLVMImplement Runtime Environments for HSA using LLVM
Implement Runtime Environments for HSA using LLVM
 
Difference between i3 and i5 and i7 and core 2 duo
Difference between i3 and i5 and i7 and core 2 duoDifference between i3 and i5 and i7 and core 2 duo
Difference between i3 and i5 and i7 and core 2 duo
 
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
 
F9: A Secure and Efficient Microkernel Built for Deeply Embedded Systems
F9: A Secure and Efficient Microkernel Built for Deeply Embedded SystemsF9: A Secure and Efficient Microkernel Built for Deeply Embedded Systems
F9: A Secure and Efficient Microkernel Built for Deeply Embedded Systems
 
Thesis Report - Gaurav Raina MSc ES - v2
Thesis Report - Gaurav Raina MSc ES - v2Thesis Report - Gaurav Raina MSc ES - v2
Thesis Report - Gaurav Raina MSc ES - v2
 
Industrial trends in heterogeneous and esoteric compute
Industrial trends in heterogeneous and esoteric computeIndustrial trends in heterogeneous and esoteric compute
Industrial trends in heterogeneous and esoteric compute
 

Ähnlich wie Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelism than we can handle?

Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor DesignSri Prasanna
 
Intro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPCIntro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPCSlide_N
 
Architecting Solutions for the Manycore Future
Architecting Solutions for the Manycore FutureArchitecting Solutions for the Manycore Future
Architecting Solutions for the Manycore FutureTalbott Crowell
 
Intel new processors
Intel new processorsIntel new processors
Intel new processorszaid_b
 
Power 7 Overview
Power 7 OverviewPower 7 Overview
Power 7 Overviewlambertt
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wallugur candan
 
Accelerators: the good, the bad, and the ugly
Accelerators: the good, the bad, and the uglyAccelerators: the good, the bad, and the ugly
Accelerators: the good, the bad, and the uglyIntel IT Center
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
 
Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)Johan Andersson
 
Cell Today and Tomorrow - IBM Systems and Technology Group
Cell Today and Tomorrow - IBM Systems and Technology GroupCell Today and Tomorrow - IBM Systems and Technology Group
Cell Today and Tomorrow - IBM Systems and Technology GroupSlide_N
 
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...In-Memory Computing Summit
 
Heterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsHeterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsAnand Haridass
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Intel® Software
 

Ähnlich wie Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelism than we can handle? (20)

Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor Design
 
Webinaron muticoreprocessors
Webinaron muticoreprocessorsWebinaron muticoreprocessors
Webinaron muticoreprocessors
 
Intro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPCIntro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPC
 
Architecting Solutions for the Manycore Future
Architecting Solutions for the Manycore FutureArchitecting Solutions for the Manycore Future
Architecting Solutions for the Manycore Future
 
Intel new processors
Intel new processorsIntel new processors
Intel new processors
 
No[1][1]
No[1][1]No[1][1]
No[1][1]
 
Open power ddl and lms
Open power ddl and lmsOpen power ddl and lms
Open power ddl and lms
 
Power 7 Overview
Power 7 OverviewPower 7 Overview
Power 7 Overview
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wall
 
Accelerators: the good, the bad, and the ugly
Accelerators: the good, the bad, and the uglyAccelerators: the good, the bad, and the ugly
Accelerators: the good, the bad, and the ugly
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)
 
Cell Today and Tomorrow - IBM Systems and Technology Group
Cell Today and Tomorrow - IBM Systems and Technology GroupCell Today and Tomorrow - IBM Systems and Technology Group
Cell Today and Tomorrow - IBM Systems and Technology Group
 
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
TensorFlow for HPC?
TensorFlow for HPC?TensorFlow for HPC?
TensorFlow for HPC?
 
Heterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsHeterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of Systems
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
 

Mehr von Slide_N

SpursEngine A High-performance Stream Processor Derived from Cell/B.E. for Me...
SpursEngine A High-performance Stream Processor Derived from Cell/B.E. for Me...SpursEngine A High-performance Stream Processor Derived from Cell/B.E. for Me...
SpursEngine A High-performance Stream Processor Derived from Cell/B.E. for Me...Slide_N
 
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdfParallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdfSlide_N
 
Experiences with PlayStation VR - Sony Interactive Entertainment
Experiences with PlayStation VR  - Sony Interactive EntertainmentExperiences with PlayStation VR  - Sony Interactive Entertainment
Experiences with PlayStation VR - Sony Interactive EntertainmentSlide_N
 
SPU-based Deferred Shading for Battlefield 3 on Playstation 3
SPU-based Deferred Shading for Battlefield 3 on Playstation 3SPU-based Deferred Shading for Battlefield 3 on Playstation 3
SPU-based Deferred Shading for Battlefield 3 on Playstation 3Slide_N
 
Filtering Approaches for Real-Time Anti-Aliasing
Filtering Approaches for Real-Time Anti-AliasingFiltering Approaches for Real-Time Anti-Aliasing
Filtering Approaches for Real-Time Anti-AliasingSlide_N
 
Chip Multiprocessing and the Cell Broadband Engine.pdf
Chip Multiprocessing and the Cell Broadband Engine.pdfChip Multiprocessing and the Cell Broadband Engine.pdf
Chip Multiprocessing and the Cell Broadband Engine.pdfSlide_N
 
New Millennium for Computer Entertainment - Kutaragi
New Millennium for Computer Entertainment - KutaragiNew Millennium for Computer Entertainment - Kutaragi
New Millennium for Computer Entertainment - KutaragiSlide_N
 
Sony Transformation 60 - Kutaragi
Sony Transformation 60 - KutaragiSony Transformation 60 - Kutaragi
Sony Transformation 60 - KutaragiSlide_N
 
Sony Transformation 60
Sony Transformation 60 Sony Transformation 60
Sony Transformation 60 Slide_N
 
Moving Innovative Game Technology from the Lab to the Living Room
Moving Innovative Game Technology from the Lab to the Living RoomMoving Innovative Game Technology from the Lab to the Living Room
Moving Innovative Game Technology from the Lab to the Living RoomSlide_N
 
The Technology behind PlayStation 2
The Technology behind PlayStation 2The Technology behind PlayStation 2
The Technology behind PlayStation 2Slide_N
 
Cell Technology for Graphics and Visualization
Cell Technology for Graphics and VisualizationCell Technology for Graphics and Visualization
Cell Technology for Graphics and VisualizationSlide_N
 
Industry Trends in Microprocessor Design
Industry Trends in Microprocessor DesignIndustry Trends in Microprocessor Design
Industry Trends in Microprocessor DesignSlide_N
 
Translating GPU Binaries to Tiered SIMD Architectures with Ocelot
Translating GPU Binaries to Tiered SIMD Architectures with OcelotTranslating GPU Binaries to Tiered SIMD Architectures with Ocelot
Translating GPU Binaries to Tiered SIMD Architectures with OcelotSlide_N
 
Cellular Neural Networks: Theory
Cellular Neural Networks: TheoryCellular Neural Networks: Theory
Cellular Neural Networks: TheorySlide_N
 
Network Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTMNetwork Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTMSlide_N
 
Deferred Pixel Shading on the PLAYSTATION®3
Deferred Pixel Shading on the PLAYSTATION®3Deferred Pixel Shading on the PLAYSTATION®3
Deferred Pixel Shading on the PLAYSTATION®3Slide_N
 
Developing Technology for Ratchet and Clank Future: Tools of Destruction
Developing Technology for Ratchet and Clank Future: Tools of DestructionDeveloping Technology for Ratchet and Clank Future: Tools of Destruction
Developing Technology for Ratchet and Clank Future: Tools of DestructionSlide_N
 
NVIDIA Tesla Accelerated Computing Platform for IBM Power
NVIDIA Tesla Accelerated Computing Platform for IBM PowerNVIDIA Tesla Accelerated Computing Platform for IBM Power
NVIDIA Tesla Accelerated Computing Platform for IBM PowerSlide_N
 
The Visual Computing Revolution Continues
The Visual Computing Revolution ContinuesThe Visual Computing Revolution Continues
The Visual Computing Revolution ContinuesSlide_N
 

Mehr von Slide_N (20)

SpursEngine A High-performance Stream Processor Derived from Cell/B.E. for Me...
SpursEngine A High-performance Stream Processor Derived from Cell/B.E. for Me...SpursEngine A High-performance Stream Processor Derived from Cell/B.E. for Me...
SpursEngine A High-performance Stream Processor Derived from Cell/B.E. for Me...
 
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdfParallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
 
Experiences with PlayStation VR - Sony Interactive Entertainment
Experiences with PlayStation VR  - Sony Interactive EntertainmentExperiences with PlayStation VR  - Sony Interactive Entertainment
Experiences with PlayStation VR - Sony Interactive Entertainment
 
SPU-based Deferred Shading for Battlefield 3 on Playstation 3
SPU-based Deferred Shading for Battlefield 3 on Playstation 3SPU-based Deferred Shading for Battlefield 3 on Playstation 3
SPU-based Deferred Shading for Battlefield 3 on Playstation 3
 
Filtering Approaches for Real-Time Anti-Aliasing
Filtering Approaches for Real-Time Anti-AliasingFiltering Approaches for Real-Time Anti-Aliasing
Filtering Approaches for Real-Time Anti-Aliasing
 
Chip Multiprocessing and the Cell Broadband Engine.pdf
Chip Multiprocessing and the Cell Broadband Engine.pdfChip Multiprocessing and the Cell Broadband Engine.pdf
Chip Multiprocessing and the Cell Broadband Engine.pdf
 
New Millennium for Computer Entertainment - Kutaragi
New Millennium for Computer Entertainment - KutaragiNew Millennium for Computer Entertainment - Kutaragi
New Millennium for Computer Entertainment - Kutaragi
 
Sony Transformation 60 - Kutaragi
Sony Transformation 60 - KutaragiSony Transformation 60 - Kutaragi
Sony Transformation 60 - Kutaragi
 
Sony Transformation 60
Sony Transformation 60 Sony Transformation 60
Sony Transformation 60
 
Moving Innovative Game Technology from the Lab to the Living Room
Moving Innovative Game Technology from the Lab to the Living RoomMoving Innovative Game Technology from the Lab to the Living Room
Moving Innovative Game Technology from the Lab to the Living Room
 
The Technology behind PlayStation 2
The Technology behind PlayStation 2The Technology behind PlayStation 2
The Technology behind PlayStation 2
 
Cell Technology for Graphics and Visualization
Cell Technology for Graphics and VisualizationCell Technology for Graphics and Visualization
Cell Technology for Graphics and Visualization
 
Industry Trends in Microprocessor Design
Industry Trends in Microprocessor DesignIndustry Trends in Microprocessor Design
Industry Trends in Microprocessor Design
 
Translating GPU Binaries to Tiered SIMD Architectures with Ocelot
Translating GPU Binaries to Tiered SIMD Architectures with OcelotTranslating GPU Binaries to Tiered SIMD Architectures with Ocelot
Translating GPU Binaries to Tiered SIMD Architectures with Ocelot
 
Cellular Neural Networks: Theory
Cellular Neural Networks: TheoryCellular Neural Networks: Theory
Cellular Neural Networks: Theory
 
Network Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTMNetwork Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTM
 
Deferred Pixel Shading on the PLAYSTATION®3
Deferred Pixel Shading on the PLAYSTATION®3Deferred Pixel Shading on the PLAYSTATION®3
Deferred Pixel Shading on the PLAYSTATION®3
 
Developing Technology for Ratchet and Clank Future: Tools of Destruction
Developing Technology for Ratchet and Clank Future: Tools of DestructionDeveloping Technology for Ratchet and Clank Future: Tools of Destruction
Developing Technology for Ratchet and Clank Future: Tools of Destruction
 
NVIDIA Tesla Accelerated Computing Platform for IBM Power
NVIDIA Tesla Accelerated Computing Platform for IBM PowerNVIDIA Tesla Accelerated Computing Platform for IBM Power
NVIDIA Tesla Accelerated Computing Platform for IBM Power
 
The Visual Computing Revolution Continues
The Visual Computing Revolution ContinuesThe Visual Computing Revolution Continues
The Visual Computing Revolution Continues
 

Kürzlich hochgeladen

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Kürzlich hochgeladen (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelism than we can handle?

  • 1. IEEE Hot Chips 2005 Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelism than we can handle? David B. Kirk
  • 2. IEEE Hot Chips 2005 Outline Types of Parallelism Processor / System Parallelism Application / Problem Parallelism CPU Multi-core Roadmaps GPU Evolution and Performance Growth Programming Models Mapping Application Parallelism onto Processors The Problem of Parallelism
  • 3. IEEE Hot Chips 2005 Processor / System Parallelism Single- vs. Multi- core Fine- vs. Coarse- grained Single- vs. Multi- pipeline Vector vs. Scalar math Data- vs. Thread- vs. Instruction-level- parallel Not mutually exclusive Single- vs. Multi-threaded processor Message-passing vs. Shared Memory communication SISD, SIMD, MIMD… Tightly- vs. Loosely-connected cores & threads
  • 4. IEEE Hot Chips 2005 Application / Problem Parallelism If there is no parallelism in the workload, processor/system parallelism doesn’t matter! Large problems can (often) be more easily parallelized “Good” parallel behavior Many inputs/results Parallel structure – many similar computation paths Little interaction between data/threads Data parallelism easy to map to machine “automagically” Task parallelism requires programmer forethought
  • 5. IEEE Hot Chips 2005 General Purpose Processors Single-core… Dual-core… Multi-core Limited multi-threading No special support for parallel programming Coarse-grained Thread-based Requires programmer awareness Processor die photo courtesy of AMD
  • 6. The Move to Intel Multi-coreThe Move to Intel Multi-core CurrentCurrent 20052005 20062006 20072007 Single coreSingle core Dual-coreDual-core DesktopDesktop ClientClient PentiumPentium®® 44 processorprocessor Cedar MillCedar MillPentiumPentium®® 44 processorprocessor PreslerPresler Pentium D ProcessorPentium D Processor PentiumPentium®® ProcessorProcessor Extreme EditionExtreme Edition MobileMobile ClientClient PentiumPentium®® M processorM processor YonahYonah PlatformPlatform 64-bit Intel64-bit Intel®® XeonXeon™™ processor MPprocessor MP Intel® XeonIntel® Xeon™™ Processor MPProcessor MP MP ServerMP Server DP Server /DP Server / WSWS Paxville TulsaPaxville Tulsa DempseyDempsey64-bit Intel® Xeon64-bit Intel® Xeon™™ Processor w/ 2MB cacheProcessor w/ 2MB cache Itanium®Itanium® processor MPprocessor MP Itanium® 2 ProcessorItanium® 2 Processor Montecito MontvaleMontecito Montvale Tukwila PoulsonTukwila Poulson Itanium® 2 Processor -Itanium® 2 Processor - 3M (Fanwood)3M (Fanwood) Itanium®Itanium® processor DPprocessor DP Millington DP MontvaleMillington DP Montvale DimonaDimona ConroeConroe WoodcrestWoodcrest MeromMeromYonahYonah WhitefieldWhitefield Source: Intel Corporation >= 2 cores>= 2 coresAll products and dates are preliminary and subject to change without notice. Tukwila PoulsonTukwila Poulson WhitefieldWhitefield
  • 7. IEEE Hot Chips 2005 Source: Intel Corporation
  • 8. IEEE Hot Chips 2005 CPU Approach to Parallelism Multi-core Limited multi-threading Coarse-grained Scalar (usually) Explicitly Threaded Application parallelism must exist at coarse scale Programmer writes independent program thread for each core
  • 9. IEEE Hot Chips 2005 Multi-Core Application Improvement: General Purpose CPU Single-thread Applications: 0% speedup Multi-thread Applications: Up to X times speedup, where X is # of cores * hardware multi- threading Potential for cache problems / data sharing & migration % of Applications that are Multi-threaded Essentially 0  But… HUGE potential  Multi-threaded OS’s will benefit Just add software for apps…
  • 10. IEEE Hot Chips 2005 Special-purpose Parallel CPUs Cell Processor die photo courtesy of IBM
  • 11. IEEE Hot Chips 2005 Cell Processor What is the programming model? Can you expect programmers to explicitly schedule work / data flow?
  • 12. IEEE Hot Chips 2005 Cell Processor Approach to Parallelism Multi-core Coarse-grained Vector Explicitly Threaded Programmer writes independent program thread code for each core Explicit Data Sharing Programmer must copy or DMA data between cores
  • 13. IEEE Hot Chips 2005 Multi-Core Application Improvement: Special Purpose CPU (Cell) Single-thread Applications: 0% speedup Multi-thread Applications: Up to X times speedup, where X is # of cores Up to Y times speedup, where Y is vector width Explicit software management of cache / data sharing & migration % of Applications that are Multi-threaded Essentially 100%  (all applications are written custom) But… HUGE software development effort 
  • 14. IEEE Hot Chips 2005 GeForce 7800 GTX: Most Capable Graphics Processor Ever Built 302M Transistors + XBOX GPU (60M) + PS2 Graphics Synthesizer (43M) + Game Cube Flipper (51M) + Game Cube Gekko (21M) + XBOX Pentium3 CPU (9M) + PS2 Emotion Engine (10.5M) + Athlon FX 55 (105.9M) 300.4M
  • 16. IEEE Hot Chips 2005 The Life of a Triangle Texture Host / Front End / Vertex Fetch FrameBufferController Vertex Processing Primitive Assembly , Setup Rasterize & Zcull Pixel Shader Register Combiners Pixel Engines (ROP) process commands convert to FP transform vertices to screen -space generate per - triangle equations generate pixels , delete pixels that cannot be seen determine the colors , transparencies and depth of the pixel do final hidden surface test , blend and write out color and new depth
  • 17. IEEE Hot Chips 2005 L2 Tex GeForce 7800 Parallelism Cull / Clip / Setup Shader Instruction Dispatch Fragment Crossbar Memory Partition Memory Partition Memory Partition Memory Partition Z-Cull DRAM(s) DRAM(s) DRAM(s) DRAM(s) Host / FW / VTF vertex units Pixel shading engines, organized in quads ROP units
  • 18. IEEE Hot Chips 2005 Detail of a single pixel shader pipeline FP Texture Processor L1 Texture Cache Branch Processor FP32 Shader Unit 1 FP32 Shader Unit 2 Input Fragment Data Output Shaded Fragments Fog ALU Texture Data L2 Texture Cache SIMD Architecture Dual Issue / Co-Issue FP32 Computation “Unlimited” Program Length Mini-ALU Mini-ALU
  • 19. IEEE Hot Chips 2005 Arithmetic Density Delivered 32-bit Floating Point performance Big GFlop #’s are nice… …but what can you actually measure from a basic test program? 165.064820.6331430GeForce 7800 GTX 54.05446.7568425GeForce 6800 Ultra Gflops vec4 MAD GinstructionsClock Using a test pixel shader program that simply measures how many 4-component MAD instructions can be executed per second.
  • 20. IEEE Hot Chips 2005 GPU Approach to Parallelism Single-core Multi-pipeline Multi-threaded Fine-grained Vector Explicitly and Implicitly Threaded Programmer writes sequential program thread code for shader processors Thread instances are spawned automatically Data Parallel Threads don’t communicate, except at start/finish
  • 21. IEEE Hot Chips 2005 Multi-Pipeline Application Improvement: GPU Multi-thread Applications: Up to X times speedup, where X is # of pipelines Exploits x4 vector FP MADs Very little software management of cache / data sharing & migration % of Applications that are Multi-threaded Essentially 100%  (all applications are written custom) Again… HUGE software development effort  Limited by CPU throughput 
  • 22. IEEE Hot Chips 2005 SLI – Dual GPUs in a Single System
  • 23. IEEE Hot Chips 2005 CPU Limitedness – Exploitation of GPU Parallelism limited by CPU 0 0.2 0.4 0.6 0.8 1 1.2 10x7 10x7 4x/8x 16x12 16x12 4x/8x HL2 Prison 05rev7 HL2 Coast 05rev7 UT AS-Convoy botmatch COD RTCW Far Cry Research Halo Doom3 Counter Strike Source NBA Live Vampire Masquerade World of Warcraft Raw bandwidth Raw Pixel P4 3.7 GHz 1 GB RAM
  • 24. IEEE Hot Chips 2005 Game Performance Benefits of Dual-core vs. GPU Upgrade 0% 100% 200% 300% 400% 500% Baseline: 3.0SC & 6600 3.2GHz Single Core GeForce 6600 GT 3.4GHz Single Core GeForce 6800 GT 3.2GHz EE Dual Core GeForce 7800 GTX SLI $900 - $1100 Upgrade $50 - $100 Upgrade $25 - $50 Upgrade Single Core 3.0 GHz P4 CPU + GeForce 6600 Performance improvement based on Doom3 1024x768x32, 4x AA, 8x Aniso filtering Baseline
  • 25. IEEE Hot Chips 2005 GPU Programming Languages DX9 (Direct X) assembly coding for vertex and pixel HLSL (High-Level Shading Language) OpenGL 1.3+ assembly coding for vertex and pixel GLSLang Cg Brook for GPUs (Stanford) HLL for GPGPU layered on DX/OGL http://graphics.stanford.edu/projects/brookgpu/ SH for GPUs (Waterloo) Metaprogramming Language for GPGPU http://libsh.sourceforge.net/docs.html Others...
  • 26. IEEE Hot Chips 2005 Importance of Data Parallelism for GPUs GPUs are designed for graphics Highly parallel tasks GPUs process independent vertices & fragments Temporary registers are zeroed No shared or static data No read-modify-write buffers Opportunity exists when # of independent results is large Data-parallel processing GPUs architecture is ALU-heavy Multiple vertex & pixel pipelines, multiple ALUs per pipe Large memory latency, but HUGE memory bandwidth Hide memory latency (with more computation)
  • 27. IEEE Hot Chips 2005 Language Support for Parallelism Most widely-used programming languages are terrible at exposing potential parallelism for (int i=0; i<100; i++) { // compute i’th element ... } LISP and other functional languages are marginally better (+ (func 1 4) (func 23 9) (func 6 2)) Some direct support: Fortran 90, HPF Not in common use for development anymore Research in true parallel languages has stagnated Early 1990’s: lots of research in C*, DPCE, etc. Late 1990’s on: research moved to JVM, managed code, etc
  • 28. IEEE Hot Chips 2005 Parallel Programming: Not just for GPUs CPU’s benefit, too SSE, SSE2, MMX, etc. Hyperthreading Multi-core processors announced from Intel, AMD, etc. Playstation2 Emotion Unit MPI, OpenMP packages for coarse grain communication Efficient execution of: Parallel code on a serial processor: EASY Serial code on a parallel processor: HARD The impact of power consumption further justifies more research in this area – parallelism is the future
  • 29. IEEE Hot Chips 2005 PC Graphics growth (225%/yr) Sustainable Growth on Capability Curve Season tech #trans Gflop* Mpix Mpoly Mvector ... spring/00 0.18 25M 35 800 30M 50M fall/00 0.15 50M 150 1.2G 75M 100M spring/01 0.15 55M 180 2.0G 100M 200M fall/01 0.13 100M 280 4.0G 200M 400M ... spring/03 (NV30) 0.13 140M 500 8.0G** 300M 300M spring/04 (NV40) 0.13 220M 1000 25.6G** 600M 600M * Special purpose math, not all general purpose programmable math ** Samples (multiple color values within a pixel, for smooth edges)
  • 30. IEEE Hot Chips 2005 GPUs Continue to Accelerate above Moore’s Law, but that’s not all... As pixel/vertex/triangle growth slows and plateaus... Other performance factors increase Number of color samples per pixel (Anti-aliasing) Number of calculations per pixel/vertex Flexibility of programming model Looping, branching, multiple I/O access, multiple math ops/clock High-level language programmability Number of “General Purpose” programmable 32bit Gigaflops per pixel – demand grows without bounds GPUs become General Purpose parallel processors What happens if you compare GPUs to microprocessors?
  • 31. IEEE Hot Chips 2005 Sustained SP MAD GFLOPS GFLOPS
  • 32. IEEE Hot Chips 2005 CPU / GPU Design Strategies / Tactics CPU Strategy: Make the workload (one compute thread) run as fast as possible Tactics Cacheing Instruction/Data Prefetch “hyperthreading” Speculative Execution  limited by “perimeter” – communication bandwidth Multi-core will help... a little GPU Strategy: Make the workload (as many threads as possible) run as fast as possible Tactics Parallelism (1000s of threads) Pipelining  limited by “area” – compute capability
  • 33. IEEE Hot Chips 2005 Application Matches for GPU: Any Large-Scale, Parallel, Feed-forward, Math- and Data-intensive Problem Real-time Graphics (of course!) Image Processing and Analysis Telescope, Surveillance, Sensor Data Volume Data Correlation - Radio Telescope, SETI Monte Carlo Simulation - Neutron Transport Neural Networks Speech Recognition Handwriting Recognition Ray Tracing Physical Modeling and Simulation Video Processing
  • 34. IEEE Hot Chips 2005 Example: Black-Scholes options pricing Widely-used model for pricing call/put options Implemented in ~15 lines of Cg, use combinations of input parameters as separate simulations (fragments) Performance: Fast (~3GHz) P4, good C++: ~3.0 MBSOPS (1X) Quadro FX 3000, Cg: ~2.8 MBSOPS (~.9X) Quadro FX 4400, Cg: ~14.4 MBSOPS (4.8X) Quadro FX 4400, Cg, 100 runs: ~176.0 MBSOPS (59X) (remove test/data transfer bandwidth overhead) How? CPU: ~11GFLOPS, slow exp(), log(), sqrt(), fast mem access GPU: ~65GFLOPS, fast exp(), log(), sqrt(), slow mem access Black-Scholes has high ratio of math to memory access GPU has Parallelism
  • 35. IEEE Hot Chips 2005 So, What’s the Problem? The Good News: CPUs and GPUs are increasingly parallel GPUs are already highly parallel Workloads – Graphics and GP – are highly parallel Moore’s Law and the “capability curve” continue to be our friends The Not-so-Good News: Parallel programming is hard Language and Tool support for parallelism is poor Computer Science Education is not focused on parallel programming
  • 36. IEEE Hot Chips 2005 We Are Approaching a Crisis in Programming Skills (lack) Intel, AMD, IBM/Sony/Toshiba (Cell), Sun (Niagara) have all announced Multi- or Many-core roadmaps NVIDIA and other GPUs are already “Multi-core”  Less of a crisis, due to GPU threaded programming model Analysts predict > 50% of processors shipped in next 5 years will be >1 core Who will program these devices? How will the value of multi-core and multi-threading be exploited?
  • 37. IEEE Hot Chips 2005 Call to Action Research Explore new ways of Parallel Programming Explore new Threading models Make parallelism easier to express/exploit Industry (processor vendors) Make exploitation of Multi-core easier Explore “transparent” application speedups Consequences of Failure to Act Multi-core not valued by market
  • 38. IEEE Hot Chips 2005 Questions?