SlideShare ist ein Scribd-Unternehmen logo
1 von 85
Downloaden Sie, um offline zu lesen
Parallel Programming Concepts
OpenHPI Course
Week 4 : Accelerators
Unit 4.1: Accelerate Now!
Frank Feinbube + Teaching Team
Summary: Week 3
■ Short overview of shared memory parallel programming ideas
■ Different levels of abstractions
□ Process model, thread model, task model
■ Threads for concurrency and parallelization
□ Standardized POSIX interface
□ Java / .NET concurrency functionality
■ Tasks for concurrency and parallelization
□ OpenMP for C / C++, Java, .NET, Cilk, …
■ Functional language constructs for implicit parallelism
■ PGAS languages for NUMA optimization
2
Specialized languages help the programmer to achieve speedup.
What about accordingly specialized parallel hardware?
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
What is specialized hardware?
OpenHPI | Parallel Programming Concepts | Frank Feinbube
3
OpenHPI | Parallel Programming Concepts | Frank Feinbube
4
What is specialized hardware?
■ Graphics cards: speed up rendering tasks
■ Sound cards: play sounds
■ Physics cards ?
What else could be sped up by hardware?
■ Encryption
■ Compression
■ Codec parsing
■ Protocols and formats: XML
■ Text, regular expression matching
■ …
Wide Variety of Accelerators
OpenHPI | Parallel Programming Concepts | Frank Feinbube
5
Best Super Computer:
Tianhe-2 (MilkyWay-2)
2nd
Titan
Wide Variety of Applications
OpenHPI | Parallel Programming Concepts | Frank Feinbube
6
Fluids NBody
RadixSort
Wide Variety of Application Domains
OpenHPI | Parallel Programming Concepts | Frank Feinbube
7
BioInformatics
Computational
Chemistry Computational Finance
Computational
Fluid Dynamics
Computational
Structural Mechanics Data Science
Defense
Electronic
Design Automation
Imaging &
Computer Vision
Medical Imaging Numerical Analytics Weather and Climate
http://www.nvidia.com/object/gpu-applications-domain.html
What’s in it for me?
OpenHPI | Parallel Programming Concepts | Frank Feinbube
8
Short Term View:
Cheap Performance
Performance
Energy / Price
■ Cheap to buy and to maintain
■ GFLOPS per watt: Fermi 1,5 / Kepler 5 / Maxwell 15 (2014)
OpenHPI | Parallel Programming Concepts | Frank Feinbube
9
0
200
400
600
800
1000
1200
1400
0 10000 20000 30000 40000 50000
ExecutionTimein
Milliseconds
Problem Size (Number of Sudoku Places)
Intel E8500 CPU
AMD R800 GPU
NVIDIA GT200 GPU
lower means faster
GPU: Graphics Processing Unit
(CPU of a graphics card)
A single
Maxwell GPU
will have more
performance
than the fastest
super computer
of 2001
Your
computer /
notebook
(thx to GPUs)
What is 15 GFlops?
15 000 000 000 floating point operations / s
OpenHPI | Parallel Programming Concepts | Frank Feinbube
10
Middle Term View:
Even More Performance
OpenHPI | Parallel Programming Concepts | Frank Feinbube
11
Middle Term View:
Even More Performance
OpenHPI | Parallel Programming Concepts | Frank Feinbube
12
Long Term View:
Acceleration Everywhere
Dealing with massively multi-core:
■ Accelerators (APUs) that accompany common
general purpose CPUs (Hybrid Systems)
Hybrid Systems (Accelerators + CPUs)
■ GPU Compute Devices:
High Performance Computing
(top 2 supercomputers are accelerator-based!),
Business Servers, Home/Desktop
Computers, Mobile and Embedded Systems
■ Special-Purpose Accelerators:
(de)compression, XML parsing,
(en|de)cryption, regular expression
matching
OpenHPI | Parallel Programming Concepts | Frank Feinbube
13
How do they get so fast?
OpenHPI | Parallel Programming Concepts | Frank Feinbube
14
Three Ways Of Doing Anything Faster
[Pfister]
■ Work harder
(clock speed)
 Power wall problem
 Memory wall problem
■ Work smarter
(optimization, caching)
 ILP wall problem
 Memory wall problem
■ Get help
(parallelization)
□ More cores per single CPU
□ Software needs to exploit
them in the right way
 Memory wall problem
Problem
CPU
Core
Core
Core
Core
Core
15
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Accelerators bypass the walls
■ CPUs are general purpose
□ Need to support all types of software / programming models
□ Need to support a large variety of legacy software
 That makes it hard to do something against the walls
■ Accelerators are special purpose
□ Need to support only a limited subset of software /
programming models
□ Legacy software is usually even more limited and strict
 That lessens the impact of the walls (memory & power wall)
 Stronger parallelization and better speedup possible
OpenHPI | Parallel Programming Concepts | Frank Feinbube
16
CPU and Accelerator trends
CPU
 Evolving towards throughput computing
 Motivated by energy-efficient performance
Accelerator
 Evolving towards general-purpose computing
 Motivated by higher quality graphics and
data-parallel programming
OpenHPI | Parallel Programming Concepts | Frank Feinbube
17
CPU
ACCThroughput Performance
Programmability
Multi-threading Multi-core Many-core
Fully
Programmable
Partially
Programmable
Fixed Function
NVIDIA
Keppler
Intel
MIC
Task Parallelism and Data Parallelism
OpenHPI | Parallel Programming Concepts | Frank Feinbube
18
Input Data
Parallel
Processing
Result Data
„CPU-style“ „GPU-style“
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Flynn‘s Taxonomy (1966)
■ Classify parallel hardware architectures according to their
capabilities in the instruction and data processing dimension
Single Instruction,
Single Data (SISD)
Single Instruction,
Multiple Data (SIMD)
19
Processing Step
Instruction
Data Item
Output
Processing Step
Instruction
Data Items
Output
Multiple Instruction,
Single Data (MISD)
Processing Step
Instructions
Data Item
Output
Multiple Instruction,
Multiple Data (MIMD)
Processing Step
Instructions
Data Items
Output
Parallel Programming Concepts
OpenHPI Course
Week 4 : Accelerators
Unit 4.2: Accelerator Technology
Frank Feinbube + Teaching Team
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Flynn‘s Taxonomy (1966)
■ Classify parallel hardware architectures according to their
capabilities in the instruction and data processing dimension
Single Instruction,
Single Data (SISD)
Single Instruction,
Multiple Data (SIMD)
21
Processing Step
Instruction
Data Item
Output
Processing Step
Instruction
Data Items
Output
Multiple Instruction,
Single Data (MISD)
Processing Step
Instructions
Data Item
Output
Multiple Instruction,
Multiple Data (MIMD)
Processing Step
Instructions
Data Items
Output
Why SIMD?
OpenHPI | Parallel Programming Concepts | Frank Feinbube
22
History of GPU Computing
• 1980s-1990s; configurable, not programmable;
first APIs (DirectX, OpenGL); Vertex Processing
Fixed Function
Graphic Pipelines
• Since 2001: APIs for Vertex Shading, Pixel
Shading and access to texture; DirectX9
Programmable Real-
Time Graphics
• 2006: NVIDIAs G80; unified processors arrays;
three programmable shading stages; DirectX10
Unified Graphics and
Computing Processors
OpenHPI | Parallel Programming Concepts | Frank Feinbube
23
From Fixed Function Pipeline
To Programmable Shading Stages
GPUs pre 2006:
DirectX 10 Geometry Shader Idea:
NVIDIA G80 Solution:
OpenHPI | Parallel Programming Concepts | Frank Feinbube
24
Vertex Shader Pixel Shader
Vertex Shader Geometry Shader Pixel Shader
Programmable Shader
(Vertex/Geometry/Pixel)
We would need a bigger card :/
This design allows:
Smaller card / more power
History of GPU Computing
• 1980s-1990s; configurable, not programmable;
first APIs (DirectX, OpenGL); Vertex Processing
Fixed Function
Graphic Pipelines
• Since 2001: APIs for Vertex Shading, Pixel
Shading and access to texture; DirectX9
Programmable Real-
Time Graphics
• 2006: NVIDIAs G80; unified processors arrays;
three programmable shading stages; DirectX10
Unified Graphics and
Computing Processors
• compute problem as native graphic operations;
algorithms as shaders; data in textures
General Purpose GPU
(GPGPU)
OpenHPI | Parallel Programming Concepts | Frank Feinbube
25
General Purpose GPU (GPGPU)
OpenHPI | Parallel Programming Concepts | Frank Feinbube
26
𝐵 >
1
𝑛
𝑖=1
𝑛
𝑥𝑖
2.7 1.8 2.8 1.8
2.8 4.5 9.0 4.5
Programmable Shader
(Vertex/Geometry/Pixel)
Graphics Processing Unit (GPU):
Data
Texture
Formula
Shader
Result Texture
0 0 0 0
0 1 1 1
Result
History of GPU Computing
• 1980s-1990s; configurable, not programmable;
first APIs (DirectX, OpenGL); Vertex Processing
Fixed Function
Graphic Pipelines
• Since 2001: APIs for Vertex Shading, Pixel
Shading and access to texture; DirectX9
Programmable Real-
Time Graphics
• 2006: NVIDIAs G80; unified processors arrays;
three programmable shading stages; DirectX10
Unified Graphics and
Computing Processors
• compute problem as native graphic operations;
algorithms as shaders; data in textures
General Purpose GPU
(GPGPU)
• Programming CUDA; shaders programmable;
load and store instructions; barriers; atomicsGPU Computing
OpenHPI | Parallel Programming Concepts | Frank Feinbube
27
What’s the difference to CPUs?
OpenHPI | Parallel Programming Concepts | Frank Feinbube
28
Task Parallelism and Data Parallelism
OpenHPI | Parallel Programming Concepts | Frank Feinbube
29
Input Data
Parallel
Processing
Result Data
„CPU-style“ „GPU-style“
CPU vs. GPU Architecture
■ Some huge threads
■ Branch prediction
■ 1000+ light-weight threads
■ Memory latency hiding
OpenHPI | Parallel Programming Concepts | Frank Feinbube
30
Control
PE
PE
PE
PE
Cache
DRAM DRAM
CPU GPU „many-core“„multi-core“
GPU Threads are different
GPU threads
■ Execute exactly the same instruction (line of code)
□ They share the program counter
 No branching!
Memory Latency Hiding
■ GPU holds thousands of threads
■ If some access the slow memory, others will run while they wait
■ No context switch
□ Enough registers to store all the threads all the time
OpenHPI | Parallel Programming Concepts | Frank Feinbube
31
What does it look like?
OpenHPI | Parallel Programming Concepts | Frank Feinbube
32
GF10
OpenHPI | Parallel Programming Concepts | Frank Feinbube
33
GPU
Hardware
in Detail
GF100
OpenHPI | Parallel Programming Concepts | Frank Feinbube
34
GF100
L2 Cache
GF100
OpenHPI | Parallel Programming Concepts | Frank Feinbube
35
GF100
GF100
OpenHPI | Parallel Programming Concepts | Frank Feinbube
36
GF100
GF100
OpenHPI | Parallel Programming Concepts | Frank Feinbube
37
…
GF100
GF100
OpenHPI | Parallel Programming Concepts | Frank Feinbube
38
…
GF100
It’s a jungle out there!
OpenHPI | Parallel Programming Concepts | Frank Feinbube
39
GPU Computing Platforms (Excerpt)
AMD
R700, R800, R900,
HD 7000, HD 8000, Rx 200
NVIDIA
G80, G92, GT200, GF100, GK110
Geforce, Quadro,
Tesla, ION
OpenHPI | Parallel Programming Concepts | Frank Feinbube
40
Compute Capability by version
Plus: varying amounts of cores, global memory sizes, bandwidth,
clock speeds (core, memory), bus width, memory access penalties …
OpenHPI | Parallel Programming Concepts | Frank Feinbube
41
1.0 1.1 1.2 1.3 2.x 3.0 3.5
double precision floating
point operations
No Yes
caches No Yes
max # concurrent kernels
1 8
Dynamic
Parallelism
max # threads per block 512 1024
max # Warps per MP 24 32 48 64
max # Threads per MP 768 1024 1536 2048
register count (32 bit) 8192 16384 32768 65536
max shared mem per MP 16KB 16/48KB 16/32/48KB
# shared memory banks 16 32
Intel Xeon Phi: Hardware
60 Cores based on P54C architecture (Pentium)
■ > 1.0 Ghz clock speed; 64bit based x86 instructions + SIMD
■ 1x 25 MB L2 Cache (=512KB per core) + 64 KB L1 (Cache coherency)
■ 8 (to 32) GB of DDR5
■ 4 Hardware Threads per Core (240 logical cores)
□ No Multicore / Hyper-Threading
□ Think graphics-card hardware threads
□ Only one runs = memory latency hiding
□ Switched after each instruction!!
-> use 120 or 240 threads for the 60 cores
■ 512 bit wide VPU with new ISA Kci
■ No support for MMX, SSE or AVX
■ Could handle 8 doule precision floats/16 single precision floats
■ Always structured in vectors with 16 elements
OpenHPI | Parallel Programming Concepts | Frank Feinbube
42
Intel Xeon Phi: Operating System
■ minimal, embedded Linux
■ Linux Standard Base (LSB) Core libraries.
■ Implements Busybox minimal shell environment
OpenHPI | Parallel Programming Concepts | Frank Feinbube
43
Parallel Programming Concepts
OpenHPI Course
Week 4 : Accelerators
Unit 4.3: Open Compute Language (OpenCL)
Frank Feinbube + Teaching Team
Simple Example: Vector Addition
OpenHPI | Parallel Programming Concepts | Frank Feinbube
45
𝑐 =
𝑎1
𝑎2
𝑎3
𝑎4
+
𝑏1
𝑏2
𝑏3
𝑏4
?
Open Compute Language (OpenCL)
OpenHPI | Parallel Programming Concepts | Frank Feinbube
46
AMD
ATI
NVIDIA
Intel
Apple
Merged, needed
commonality across
products
GPU vendor – wants
to steal market
share from CPU
Was tired of recoding
for many-core and
GPUs. Pushed vendors
to standardize.
CPU vendor – wants
to steal market
share from GPU
Wrote
a
draft
straw
man
API
Khronos
Compute
Group
formed
Ericsson
Nokia
IBM
Sony
Blizzard
Texas
Instruments
…
OpenCL Platform Model
■ OpenCL exposes CPUs, GPUs, and other Accelerators as “devices”
■ Each “device” contains one or more “compute units”, i.e. cores, SMs,...
■ Each “compute unit” contains one or more SIMD “processing elements”
OpenHPI | Parallel Programming Concepts | Frank Feinbube
47
Terminology
OpenHPI | Parallel Programming Concepts | Frank Feinbube
48
CPU OpenCL
Platform
Level
Memory
Level
Execution
Level
Platform
Level
Memory
Level
Execution
Level
SMP
system
Main
Memory
Process
Compute
Device
Global and
Constant
Memory
Index
Range
(NDRange)
Processor - -
Compute
Unit
Local
Memory
Work
Group
Core
Registers,
Thread
Local
Storage
Thread
Processing
Element
Registers,
Private
Memory
Work
Items
(+Kernels)
The BIG idea behind OpenCL
OpenCL execution model … execute a kernel at each point in a
problem domain.
E.g., process a 1024 x 1024 image with one kernel invocation per
pixel or 1024 x 1024 = 1,048,576 kernel executions
OpenHPI | Parallel Programming Concepts | Frank Feinbube
49
Traditional Loops Data Parallel OpenCL
Building and Executing OpenCL Code
OpenHPI | Parallel Programming Concepts | Frank Feinbube
50
Code
of one
or
more
Kernels
Compile for GPU
Compile for CPU
GPU Binary
Represen-
tation
CPU Binary
Represen-
tation
Kernel Program Device
OpenCL codes must be prepared to deal with much
greater hardware diversity (features are optional and
my not be supported on all devices) → compile code
that is tailored according to the device configuration
OpenCL Execution Model
An OpenCL kernel is executed by an array of work items.
■ All work items run the same code (SPMD)
■ Each work item has an index that it uses to compute memory
addresses and make control decisions
OpenHPI | Parallel Programming Concepts | Frank Feinbube
51
0 1 2 3 4 5 6 7Work item ids:
Threads
Work Groups: Scalable Cooperation
Divide monolithic work item array into work groups
■ Work items within a work group cooperate via shared
memory, atomic operations and barrier synchronization
■ Work items in different work groups cannot cooperate
OpenHPI | Parallel Programming Concepts | Frank Feinbube
52
0 1 2 3 4 5 6 7
Work
item ids:
Threads
8 9 10 11 12 13 14 15
Work group 0 Work group 1
OpenCL Execution Model
■ Parallel work is submitted to devices by launching kernels
■ Kernels run over global dimension index ranges (NDRange), broken up
into “work groups”, and “work items”
■ Work items executing within the same work group can synchronize with
each other with barriers or memory fences
■ Work items in different work groups can’t sync with each other, except
by launching a new kernel
OpenHPI | Parallel Programming Concepts | Frank Feinbube
53
OpenCL Execution Model
An example of an NDRange index space showing work-items, their global IDs
and their mapping onto the pair of work-group and local IDs.
OpenHPI | Parallel Programming Concepts | Frank Feinbube
54
Terminology
OpenHPI | Parallel Programming Concepts | Frank Feinbube
55
CPU OpenCL
Platform
Level
Memory
Level
Execution
Level
Platform
Level
Memory
Level
Execution
Level
SMP
system
Main
Memory
Process
Compute
Device
Global and
Constant
Memory
Index
Range
(NDRange)
Processor - -
Compute
Unit
Local
Memory
Work
Group
Core
Registers,
Thread
Local
Storage
Thread
Processing
Element
Registers,
Private
Memory
Work
Items
(+Kernels)
OpenCL Memory Architecture
Private
Per work-item
Local
Shared within
a workgroup
Global/
Constant
Visible to
all workgroups
Host Memory
On the CPU
OpenHPI | Parallel Programming Concepts | Frank Feinbube
56
Terminology
OpenHPI | Parallel Programming Concepts | Frank Feinbube
57
CPU OpenCL
Platform
Level
Memory
Level
Execution
Level
Platform
Level
Memory
Level
Execution
Level
SMP
system
Main
Memory
Process
Compute
Device
Global and
Constant
Memory
Index
Range
(NDRange)
Processor - -
Compute
Unit
Local
Memory
Work
Group
Core
Registers,
Thread
Local
Storage
Thread
Processing
Element
Registers,
Private
Memory
Work
Items
(+Kernels)
OpenCL Memory Architecture
■ Memory management is explicit:
you must move data from host → global → local… and back
OpenHPI | Parallel Programming Concepts | Frank Feinbube
58
Memory
Type
Keyword Description/Characteristics
Global
Memory
__global Shared by all work items; read/write; may be
cached (modern GPU), else slow; huge
Private
Memory
__private For local variables; per work item; may be
mapped onto global memory (Arrays on GPU)
Local
Memory
__local Shared between workitems of a work group;
may be mapped onto global memory (not
GPU), else fast; small
Constant
Memory
__constant Read-only, cached; add. special kind for GPUs:
texture memory
OpenCL Work Item Code
A subset of ISO C99 - without some C99 features
■ headers, function pointers, recursion, variable length arrays,
and bit fields
A superset of ISO C99 with additions for
■ Work-items and workgroups
■ Vector types (2,4,8,16): endian safe, aligned at vector length
■ Image types mapped to texture memory
■ Synchronization
■ Address space qualifiers
Also includes a large set of built-in functions for image manipulation,
work-item manipulation, specialized math routines, vectors, etc.
OpenHPI | Parallel Programming Concepts | Frank Feinbube
59
Vector Addition: Kernel
■ Kernel body is instantiated once for each work item; each
getting an unique index
» Code that actually executes on target devices
OpenHPI | Parallel Programming Concepts | Frank Feinbube
60
Vector Addition: Host Program
OpenHPI | Parallel Programming Concepts | Frank Feinbube
61
[5]
Vector Addition: Host Program
OpenHPI | Parallel Programming Concepts | Frank Feinbube
62
[5]
„standard“ overhead
for an OpenCL program
Development Support
Software development kits: NVIDIA and AMD; Windows and Linux
Special libraries: AMD Core Math Library, BLAS and FFT libraries by NVIDIA,
OpenNL for numerics and CULA for linear algebra; NVIDIA Performance
Primitives library: a sink for common GPU accelerated algorithms
Profiling and debugging tools:
■ NVIDIAs Parallel Nsight for Microsoft Visual Studio
■ AMDs ATI Stream Profiler
■ AMDs Stream KernelAnalyzer:
displays GPU assembler code, detects execution bottlenecks
■ gDEBugger (platform-independent)
Big knowledge bases with tutorials, examples, articles, show cases, and
developer forums
OpenHPI | Parallel Programming Concepts | Frank Feinbube
63
Nsight
OpenHPI | Parallel Programming Concepts | Frank Feinbube
64
Parallel Programming Concepts
OpenHPI Course
Week 4 : Accelerators
Unit 4.4: Optimizations
Frank Feinbube + Teaching Team
The Power of GPU Computing
0
200
400
600
800
1000
1200
1400
0 10000 20000 30000 40000 50000
ExecutionTimeinMilliseconds
Problem Size (Number of Sudoku Places)
Intel
E8500
CPU
AMD
R800
GPU
NVIDIA
GT200
GPU
OpenHPI | Parallel Programming Concepts | Frank Feinbube
66
* less is better
big performance gains for small problem sizes
The Power of GPU Computing
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
0 200000 400000 600000
ExecutionTimeinMilliseconds
Problem Size (Number of Sudoku Places)
Intel
E8500
CPU
AMD
R800
GPU
NVIDIA
GT200
GPU
OpenHPI | Parallel Programming Concepts | Frank Feinbube
67
* less is better
small/moderate performance gains for large problem sizes
→ further optimizations needed
Best Practices for Performance Tuning
• Asynchronous, Recompute, SimpleAlgorithm Design
• Chaining, Overlap Transfer & ComputeMemory Transfer
• Divergent Branching, PredicationControl Flow
• Local Memory as Cache, rare resourceMemory Types
• Coalescing, Bank ConflictsMemory Access
• Execution Size, EvaluationSizing
• Shifting, Fused Multiply, Vector TypesInstructions
• Native Math Functions, Build OptionsPrecision
OpenHPI | Parallel Programming Concepts | Frank Feinbube
68
Divergent Branching and Predication
Divergent Branching
■ Flow control instruction (if, switch, do, for, while) can result in
different execution paths
 Data parallel execution → varying execution paths will be serialized
 Threads converge back to same execution path after completion
Branch Predication
■ Instructions are associated with a per-thread condition code (predicate)
□ All instructions are scheduled for execution
□ Predicate true: executed normally
□ Predicate false: do not write results, do not evaluate addresses, do
not read operands
■ Compiler may use branch predication for if or switch statements
■ Unroll loops yourself (or use #pragma unroll for NVIDIA)
OpenHPI | Parallel Programming Concepts | Frank Feinbube
69
Use Caching: Local, Texture, Constant
Local Memory
■ Memory latency roughly 100x lower than global memory latency
■ Small, no coalescing problems, prone to memory bank conflicts
Texture Memory
■ 2-dimensionally cached, read-only
■ Can be used to avoid uncoalesced loads
form global memory
■ Used with the image data type
Constant Memory
■ Linear cache, read-only, 64 KB
■ as fast as reading from a register for the same address
■ Can be used for big lists of input arguments
OpenHPI | Parallel Programming Concepts | Frank Feinbube
70
0 1 2 3
64 65 66 67
128 129 130 131
192 193 194 195
…
…
…
…
Sizing:
What is the right execution layout?
■ Local work item count should be a multiple of native execution
size (NVIDIA 32, AMD 64, MIC 16), but not too big
■ Number of work groups should be multiple of the number of
multiprocessors (hundreds or thousands of work groups)
■ Can be configured in 1-, 2- or 3-dimensional layout: consider
access patterns and caching
■ Balance between latency hiding
and resource utilization
■ Experimenting is
required!
OpenHPI | Parallel Programming Concepts | Frank Feinbube
71
[4]
Instructions and Precision
■ Single precision floats provide best performance
■ Use shift operations to avoid expensive division and modulo calculations
■ Special compiler flags
■ AMD has native vector type implementation; NVIDIA is scalar
■ Use the native math library whenever speed trumps precision
OpenHPI | Parallel Programming Concepts | Frank Feinbube
72
Functions Throughput
single-precision floating-point add,
multiply, and multiply-add
8 operations per clock cycle
single-precision reciprocal,
reciprocal square root, and
native_logf(x)
2 operations per clock cycle
native_sin, native_cos, native_exp 1 operation per clock cycle
Coalesced Memory Accesses
Simple Access Pattern
■ Can be fetched in a single 64-byte
transaction (red rectangle)
■ Could also be permuted *
Sequential but Misaligned Access
■ Fall into single 128-byte segment:
single 128-byte transaction,
else: 64-byte transaction + 32-
byte transaction *
Strided Accesses
■ Depending on stride from 1 (here)
up to 16 transactions *
* 16 transactions with compute capability 1.1
OpenHPI | Parallel Programming Concepts | Frank Feinbube
73
[6]
Intel Xeon Phi: OpenCL
■ Whats different to GPUs?
□ Thread granularity bigger + No need for local shared memory
■ Work Groups are mapped to Threads
□ 240 OpenCL hardware threads handle workgroups
□ More than 1000 Work Groups recommended
□ Each thread executes one work group
■ Implicit Vectorization by the compiler of the inner most loop
□ 16 elements per Vector → dimension zero must be divisible by 16
otherwise scalar execution (Good work group size = 16)
OpenHPI | Parallel Programming Concepts | Frank Feinbube
74
__Kernel ABC()
for (int i = 0; i < get_local_size(2); i++)
for (int j = 0; j < get_local_size(1); j++)
for (int k = 0; k < get_local_size(0); k++)
Kernel_Body;
dimension zero of the NDRange
Intel Xeon Phi: OpenCL
■ Non uniform branching in a work group has significant overhead
■ Non vector size aligned memory access has significant overhead
■ Non linear access patterns have significant overhead
■ Manual prefetching can be advantageous
■ No hardware support for barriers
■ Further reading: Intel® SDK for OpenCL Applications XE 2013
Optimization Guide for Linux OS
OpenHPI | Parallel Programming Concepts | Frank Feinbube
75
Especially for dimension zero
Parallel Programming Concepts
OpenHPI Course
Week 4 : Accelerators
Unit 4.5: Future Trends
Frank Feinbube + Teaching Team
Towards new Platforms
WebCL [Draft] http://www.khronos.org/webcl/
■ JavaScript binding to OpenCL
■ Heterogeneous Parallel Computing (CPUs + GPU)
within Web Browsers
■ Enables compute intense programs
like physics engines, video editing…
■ Currently only available with add-ons
(Node.js, Firefox, WebKit)
Android installable client driver extension (ICD)
■ Enables OpenCL implementations to be discovered and loaded
as a shared object on Android systems.
OpenHPI | Parallel Programming Concepts | Frank Feinbube
77
Towards new Applications:
Dealing with Unstructured Grids
Too coarse Too fine
OpenHPI | Parallel Programming Concepts | Frank Feinbube
78
vs.
Towards new Applications:
Dealing with Unstructured Grids
Fixed Grid Dynamic Grid
OpenHPI | Parallel Programming Concepts | Frank Feinbube
79
Towards new Applications:
Dynamic Parallelism
OpenHPI | Parallel Programming Concepts | Frank Feinbube
80
CPU manages execution GPU manages execution
OpenHPI | Parallel Programming Concepts | Frank Feinbube
81
Towards new Programming Models:
OpenACC GPU Computing
OpenHPI | Parallel Programming Concepts | Frank Feinbube
82
Copy arrays to GPU
Parallelize loop with GPU kernels
Data automatically copied back at end of region
Towards new Programming Models:
OpenACC GPU Computing
OpenHPI | Parallel Programming Concepts | Frank Feinbube
83
Hybrid System
OpenHPI | Parallel Programming Concepts | Frank Feinbube
84
MIC
GPU
RAM
CPU
Core Core
CPU
Core Core
GPU Core
QPI
RAM
RAM
GPU
GPU
RAM
Summary: Week 4
■ Accelerators promise big speedups for data parallel applications
□ SIMD execution model (no branching)
□ Memory latency hiding with 1000s of light-weight threads
■ Enormous diversity -> OpenCL as a standard programming model for all
□ Uniform terms: Compute Device, Compute Unit, Processing Element
□ Idea: Loop parallelism with index ranges
□ Kernels are written in C; compiled at runtime; executed in parallel
□ Complex memory hierarchy, overhead to copy data from CPU
■ Getting fast is easy, getting faster is hard
□ Best practices for accelerators
□ Knowledge about hardware characteristics necessary
■ Future: faster, more features, more platforms, better programmability
85
Multi-core, Many-Core, ..
What if my computational problem still demands more power?
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Weitere ähnliche Inhalte

Was ist angesagt?

How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "SHow I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "SBrandon Liu
 
Peddle the Pedal to the Metal
Peddle the Pedal to the MetalPeddle the Pedal to the Metal
Peddle the Pedal to the MetalC4Media
 
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Databricks
 
Practical SystemTAP basics: Perl memory profiling
Practical SystemTAP basics: Perl memory profilingPractical SystemTAP basics: Perl memory profiling
Practical SystemTAP basics: Perl memory profilingLubomir Rintel
 
Everything You Need to Know About the Intel® MPI Library
Everything You Need to Know About the Intel® MPI LibraryEverything You Need to Know About the Intel® MPI Library
Everything You Need to Know About the Intel® MPI LibraryIntel® Software
 
On the necessity and inapplicability of python
On the necessity and inapplicability of pythonOn the necessity and inapplicability of python
On the necessity and inapplicability of pythonYung-Yu Chen
 
On the Necessity and Inapplicability of Python
On the Necessity and Inapplicability of PythonOn the Necessity and Inapplicability of Python
On the Necessity and Inapplicability of PythonTakeshi Akutsu
 
Your interactive computing
Your interactive computingYour interactive computing
Your interactive computingYung-Yu Chen
 
Tensorflow Lite and ARM Compute Library
Tensorflow Lite and ARM Compute LibraryTensorflow Lite and ARM Compute Library
Tensorflow Lite and ARM Compute LibraryKobe Yu
 
Introduction to TensorFlow Lite
Introduction to TensorFlow Lite Introduction to TensorFlow Lite
Introduction to TensorFlow Lite Koan-Sin Tan
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Intel® Software
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersIntel® Software
 
Neural Network File Format for Inference Framework
Neural Network File Format for Inference FrameworkNeural Network File Format for Inference Framework
Neural Network File Format for Inference FrameworkKobe Yu
 
MPI Raspberry pi 3 cluster
MPI Raspberry pi 3 clusterMPI Raspberry pi 3 cluster
MPI Raspberry pi 3 clusterArafat Hussain
 
Time space trade off
Time space trade offTime space trade off
Time space trade offanisha talwar
 
Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer George Markomanolis
 
Machine Learning on Your Hand - Introduction to Tensorflow Lite Preview
Machine Learning on Your Hand - Introduction to Tensorflow Lite PreviewMachine Learning on Your Hand - Introduction to Tensorflow Lite Preview
Machine Learning on Your Hand - Introduction to Tensorflow Lite PreviewModulabs
 
Just-In-Time GPU Compilation for Interpreted Languages with Partial Evaluation
Just-In-Time GPU Compilation for Interpreted Languages with Partial EvaluationJust-In-Time GPU Compilation for Interpreted Languages with Partial Evaluation
Just-In-Time GPU Compilation for Interpreted Languages with Partial EvaluationJuan Fumero
 

Was ist angesagt? (20)

How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "SHow I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
 
Multicore
MulticoreMulticore
Multicore
 
Peddle the Pedal to the Metal
Peddle the Pedal to the MetalPeddle the Pedal to the Metal
Peddle the Pedal to the Metal
 
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
 
Practical SystemTAP basics: Perl memory profiling
Practical SystemTAP basics: Perl memory profilingPractical SystemTAP basics: Perl memory profiling
Practical SystemTAP basics: Perl memory profiling
 
Everything You Need to Know About the Intel® MPI Library
Everything You Need to Know About the Intel® MPI LibraryEverything You Need to Know About the Intel® MPI Library
Everything You Need to Know About the Intel® MPI Library
 
MTM 2015
MTM 2015MTM 2015
MTM 2015
 
On the necessity and inapplicability of python
On the necessity and inapplicability of pythonOn the necessity and inapplicability of python
On the necessity and inapplicability of python
 
On the Necessity and Inapplicability of Python
On the Necessity and Inapplicability of PythonOn the Necessity and Inapplicability of Python
On the Necessity and Inapplicability of Python
 
Your interactive computing
Your interactive computingYour interactive computing
Your interactive computing
 
Tensorflow Lite and ARM Compute Library
Tensorflow Lite and ARM Compute LibraryTensorflow Lite and ARM Compute Library
Tensorflow Lite and ARM Compute Library
 
Introduction to TensorFlow Lite
Introduction to TensorFlow Lite Introduction to TensorFlow Lite
Introduction to TensorFlow Lite
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
 
Neural Network File Format for Inference Framework
Neural Network File Format for Inference FrameworkNeural Network File Format for Inference Framework
Neural Network File Format for Inference Framework
 
MPI Raspberry pi 3 cluster
MPI Raspberry pi 3 clusterMPI Raspberry pi 3 cluster
MPI Raspberry pi 3 cluster
 
Time space trade off
Time space trade offTime space trade off
Time space trade off
 
Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer
 
Machine Learning on Your Hand - Introduction to Tensorflow Lite Preview
Machine Learning on Your Hand - Introduction to Tensorflow Lite PreviewMachine Learning on Your Hand - Introduction to Tensorflow Lite Preview
Machine Learning on Your Hand - Introduction to Tensorflow Lite Preview
 
Just-In-Time GPU Compilation for Interpreted Languages with Partial Evaluation
Just-In-Time GPU Compilation for Interpreted Languages with Partial EvaluationJust-In-Time GPU Compilation for Interpreted Languages with Partial Evaluation
Just-In-Time GPU Compilation for Interpreted Languages with Partial Evaluation
 

Andere mochten auch

FAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA.pptFAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA.pptgrssieee
 
Computación paralela con gp us cuda
Computación paralela con gp us cudaComputación paralela con gp us cuda
Computación paralela con gp us cudaJavier Zarco
 
GPU Computing with Ruby
GPU Computing with RubyGPU Computing with Ruby
GPU Computing with RubyShin Yee Chung
 
A comparison of molecular dynamics simulations using GROMACS with GPU and CPU
A comparison of molecular dynamics simulations using GROMACS with GPU and CPUA comparison of molecular dynamics simulations using GROMACS with GPU and CPU
A comparison of molecular dynamics simulations using GROMACS with GPU and CPUAlex Camargo
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An IntroductionDhan V Sagar
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDARaymond Tay
 

Andere mochten auch (9)

FAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA.pptFAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA.ppt
 
Joel Falcou, Boost.SIMD
Joel Falcou, Boost.SIMDJoel Falcou, Boost.SIMD
Joel Falcou, Boost.SIMD
 
Equipo 2 gpus
Equipo 2 gpusEquipo 2 gpus
Equipo 2 gpus
 
Computación paralela con gp us cuda
Computación paralela con gp us cudaComputación paralela con gp us cuda
Computación paralela con gp us cuda
 
GPU Computing with Ruby
GPU Computing with RubyGPU Computing with Ruby
GPU Computing with Ruby
 
Cuda Architecture
Cuda ArchitectureCuda Architecture
Cuda Architecture
 
A comparison of molecular dynamics simulations using GROMACS with GPU and CPU
A comparison of molecular dynamics simulations using GROMACS with GPU and CPUA comparison of molecular dynamics simulations using GROMACS with GPU and CPU
A comparison of molecular dynamics simulations using GROMACS with GPU and CPU
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An Introduction
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDA
 

Ähnlich wie OpenHPI - Parallel Programming Concepts - Week 4

"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...Edge AI and Vision Alliance
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLinaro
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-isctembreternitz
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapGeorge Markomanolis
 
Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509Linaro
 
OpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CADOpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CADDesign World
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
 
01 intro-bps-2011
01 intro-bps-201101 intro-bps-2011
01 intro-bps-2011mistercteam
 
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...Anne Nicolas
 
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialSCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialGanesan Narayanasamy
 
Labview1_ Computer Applications in Control_ACRRL
Labview1_ Computer Applications in Control_ACRRLLabview1_ Computer Applications in Control_ACRRL
Labview1_ Computer Applications in Control_ACRRLMohammad Sabouri
 
Evaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerEvaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerGeorge Markomanolis
 
Amd accelerated computing -ufrj
Amd   accelerated computing -ufrjAmd   accelerated computing -ufrj
Amd accelerated computing -ufrjRoberto Brandao
 
Lecture2 cuda spring 2010
Lecture2 cuda spring 2010Lecture2 cuda spring 2010
Lecture2 cuda spring 2010haythem_2015
 
01 high bandwidth acquisitioncomputing compressionall in a box
01 high bandwidth acquisitioncomputing compressionall in a box01 high bandwidth acquisitioncomputing compressionall in a box
01 high bandwidth acquisitioncomputing compressionall in a boxYutaka Kawai
 
Compute API –Past & Future
Compute API –Past & FutureCompute API –Past & Future
Compute API –Past & FutureOfer Rosenberg
 
The Past, Present, and Future of OpenACC
The Past, Present, and Future of OpenACCThe Past, Present, and Future of OpenACC
The Past, Present, and Future of OpenACCinside-BigData.com
 
GTC15-Manoj-Roge-OpenPOWER
GTC15-Manoj-Roge-OpenPOWERGTC15-Manoj-Roge-OpenPOWER
GTC15-Manoj-Roge-OpenPOWERAchronix
 

Ähnlich wie OpenHPI - Parallel Programming Concepts - Week 4 (20)

"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscte
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
 
Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509
 
OpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CADOpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CAD
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
01 intro-bps-2011
01 intro-bps-201101 intro-bps-2011
01 intro-bps-2011
 
2014/07/17 Parallelize computer vision by GPGPU computing
2014/07/17 Parallelize computer vision by GPGPU computing2014/07/17 Parallelize computer vision by GPGPU computing
2014/07/17 Parallelize computer vision by GPGPU computing
 
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
 
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialSCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
 
Getting started with AMD GPUs
Getting started with AMD GPUsGetting started with AMD GPUs
Getting started with AMD GPUs
 
Labview1_ Computer Applications in Control_ACRRL
Labview1_ Computer Applications in Control_ACRRLLabview1_ Computer Applications in Control_ACRRL
Labview1_ Computer Applications in Control_ACRRL
 
Evaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerEvaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI Supercomputer
 
Amd accelerated computing -ufrj
Amd   accelerated computing -ufrjAmd   accelerated computing -ufrj
Amd accelerated computing -ufrj
 
Lecture2 cuda spring 2010
Lecture2 cuda spring 2010Lecture2 cuda spring 2010
Lecture2 cuda spring 2010
 
01 high bandwidth acquisitioncomputing compressionall in a box
01 high bandwidth acquisitioncomputing compressionall in a box01 high bandwidth acquisitioncomputing compressionall in a box
01 high bandwidth acquisitioncomputing compressionall in a box
 
Compute API –Past & Future
Compute API –Past & FutureCompute API –Past & Future
Compute API –Past & Future
 
The Past, Present, and Future of OpenACC
The Past, Present, and Future of OpenACCThe Past, Present, and Future of OpenACC
The Past, Present, and Future of OpenACC
 
GTC15-Manoj-Roge-OpenPOWER
GTC15-Manoj-Roge-OpenPOWERGTC15-Manoj-Roge-OpenPOWER
GTC15-Manoj-Roge-OpenPOWER
 

Mehr von Peter Tröger

WannaCry - An OS course perspective
WannaCry - An OS course perspectiveWannaCry - An OS course perspective
WannaCry - An OS course perspectivePeter Tröger
 
Cloud Standards and Virtualization
Cloud Standards and VirtualizationCloud Standards and Virtualization
Cloud Standards and VirtualizationPeter Tröger
 
Distributed Resource Management Application API (DRMAA) Version 2
Distributed Resource Management Application API (DRMAA) Version 2Distributed Resource Management Application API (DRMAA) Version 2
Distributed Resource Management Application API (DRMAA) Version 2Peter Tröger
 
OpenSubmit - How to grade 1200 code submissions
OpenSubmit - How to grade 1200 code submissionsOpenSubmit - How to grade 1200 code submissions
OpenSubmit - How to grade 1200 code submissionsPeter Tröger
 
Design of Software for Embedded Systems
Design of Software for Embedded SystemsDesign of Software for Embedded Systems
Design of Software for Embedded SystemsPeter Tröger
 
Humans should not write XML.
Humans should not write XML.Humans should not write XML.
Humans should not write XML.Peter Tröger
 
What activates a bug? A refinement of the Laprie terminology model.
What activates a bug? A refinement of the Laprie terminology model.What activates a bug? A refinement of the Laprie terminology model.
What activates a bug? A refinement of the Laprie terminology model.Peter Tröger
 
Dependable Systems - Summary (16/16)
Dependable Systems - Summary (16/16)Dependable Systems - Summary (16/16)
Dependable Systems - Summary (16/16)Peter Tröger
 
Dependable Systems - Hardware Dependability with Redundancy (14/16)
Dependable Systems - Hardware Dependability with Redundancy (14/16)Dependable Systems - Hardware Dependability with Redundancy (14/16)
Dependable Systems - Hardware Dependability with Redundancy (14/16)Peter Tröger
 
Dependable Systems - System Dependability Evaluation (8/16)
Dependable Systems - System Dependability Evaluation (8/16)Dependable Systems - System Dependability Evaluation (8/16)
Dependable Systems - System Dependability Evaluation (8/16)Peter Tröger
 
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)Peter Tröger
 
Dependable Systems -Software Dependability (15/16)
Dependable Systems -Software Dependability (15/16)Dependable Systems -Software Dependability (15/16)
Dependable Systems -Software Dependability (15/16)Peter Tröger
 
Dependable Systems -Reliability Prediction (9/16)
Dependable Systems -Reliability Prediction (9/16)Dependable Systems -Reliability Prediction (9/16)
Dependable Systems -Reliability Prediction (9/16)Peter Tröger
 
Dependable Systems -Fault Tolerance Patterns (4/16)
Dependable Systems -Fault Tolerance Patterns (4/16)Dependable Systems -Fault Tolerance Patterns (4/16)
Dependable Systems -Fault Tolerance Patterns (4/16)Peter Tröger
 
Dependable Systems - Introduction (1/16)
Dependable Systems - Introduction (1/16)Dependable Systems - Introduction (1/16)
Dependable Systems - Introduction (1/16)Peter Tröger
 
Dependable Systems -Dependability Means (3/16)
Dependable Systems -Dependability Means (3/16)Dependable Systems -Dependability Means (3/16)
Dependable Systems -Dependability Means (3/16)Peter Tröger
 
Dependable Systems - Hardware Dependability with Diagnosis (13/16)
Dependable Systems - Hardware Dependability with Diagnosis (13/16)Dependable Systems - Hardware Dependability with Diagnosis (13/16)
Dependable Systems - Hardware Dependability with Diagnosis (13/16)Peter Tröger
 
Dependable Systems -Dependability Attributes (5/16)
Dependable Systems -Dependability Attributes (5/16)Dependable Systems -Dependability Attributes (5/16)
Dependable Systems -Dependability Attributes (5/16)Peter Tröger
 
Dependable Systems -Dependability Threats (2/16)
Dependable Systems -Dependability Threats (2/16)Dependable Systems -Dependability Threats (2/16)
Dependable Systems -Dependability Threats (2/16)Peter Tröger
 
Verteilte Software-Systeme im Kontext von Industrie 4.0
Verteilte Software-Systeme im Kontext von Industrie 4.0Verteilte Software-Systeme im Kontext von Industrie 4.0
Verteilte Software-Systeme im Kontext von Industrie 4.0Peter Tröger
 

Mehr von Peter Tröger (20)

WannaCry - An OS course perspective
WannaCry - An OS course perspectiveWannaCry - An OS course perspective
WannaCry - An OS course perspective
 
Cloud Standards and Virtualization
Cloud Standards and VirtualizationCloud Standards and Virtualization
Cloud Standards and Virtualization
 
Distributed Resource Management Application API (DRMAA) Version 2
Distributed Resource Management Application API (DRMAA) Version 2Distributed Resource Management Application API (DRMAA) Version 2
Distributed Resource Management Application API (DRMAA) Version 2
 
OpenSubmit - How to grade 1200 code submissions
OpenSubmit - How to grade 1200 code submissionsOpenSubmit - How to grade 1200 code submissions
OpenSubmit - How to grade 1200 code submissions
 
Design of Software for Embedded Systems
Design of Software for Embedded SystemsDesign of Software for Embedded Systems
Design of Software for Embedded Systems
 
Humans should not write XML.
Humans should not write XML.Humans should not write XML.
Humans should not write XML.
 
What activates a bug? A refinement of the Laprie terminology model.
What activates a bug? A refinement of the Laprie terminology model.What activates a bug? A refinement of the Laprie terminology model.
What activates a bug? A refinement of the Laprie terminology model.
 
Dependable Systems - Summary (16/16)
Dependable Systems - Summary (16/16)Dependable Systems - Summary (16/16)
Dependable Systems - Summary (16/16)
 
Dependable Systems - Hardware Dependability with Redundancy (14/16)
Dependable Systems - Hardware Dependability with Redundancy (14/16)Dependable Systems - Hardware Dependability with Redundancy (14/16)
Dependable Systems - Hardware Dependability with Redundancy (14/16)
 
Dependable Systems - System Dependability Evaluation (8/16)
Dependable Systems - System Dependability Evaluation (8/16)Dependable Systems - System Dependability Evaluation (8/16)
Dependable Systems - System Dependability Evaluation (8/16)
 
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
 
Dependable Systems -Software Dependability (15/16)
Dependable Systems -Software Dependability (15/16)Dependable Systems -Software Dependability (15/16)
Dependable Systems -Software Dependability (15/16)
 
Dependable Systems -Reliability Prediction (9/16)
Dependable Systems -Reliability Prediction (9/16)Dependable Systems -Reliability Prediction (9/16)
Dependable Systems -Reliability Prediction (9/16)
 
Dependable Systems -Fault Tolerance Patterns (4/16)
Dependable Systems -Fault Tolerance Patterns (4/16)Dependable Systems -Fault Tolerance Patterns (4/16)
Dependable Systems -Fault Tolerance Patterns (4/16)
 
Dependable Systems - Introduction (1/16)
Dependable Systems - Introduction (1/16)Dependable Systems - Introduction (1/16)
Dependable Systems - Introduction (1/16)
 
Dependable Systems -Dependability Means (3/16)
Dependable Systems -Dependability Means (3/16)Dependable Systems -Dependability Means (3/16)
Dependable Systems -Dependability Means (3/16)
 
Dependable Systems - Hardware Dependability with Diagnosis (13/16)
Dependable Systems - Hardware Dependability with Diagnosis (13/16)Dependable Systems - Hardware Dependability with Diagnosis (13/16)
Dependable Systems - Hardware Dependability with Diagnosis (13/16)
 
Dependable Systems -Dependability Attributes (5/16)
Dependable Systems -Dependability Attributes (5/16)Dependable Systems -Dependability Attributes (5/16)
Dependable Systems -Dependability Attributes (5/16)
 
Dependable Systems -Dependability Threats (2/16)
Dependable Systems -Dependability Threats (2/16)Dependable Systems -Dependability Threats (2/16)
Dependable Systems -Dependability Threats (2/16)
 
Verteilte Software-Systeme im Kontext von Industrie 4.0
Verteilte Software-Systeme im Kontext von Industrie 4.0Verteilte Software-Systeme im Kontext von Industrie 4.0
Verteilte Software-Systeme im Kontext von Industrie 4.0
 

Kürzlich hochgeladen

THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...JojoEDelaCruz
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 

Kürzlich hochgeladen (20)

Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 

OpenHPI - Parallel Programming Concepts - Week 4

  • 1. Parallel Programming Concepts OpenHPI Course Week 4 : Accelerators Unit 4.1: Accelerate Now! Frank Feinbube + Teaching Team
  • 2. Summary: Week 3 ■ Short overview of shared memory parallel programming ideas ■ Different levels of abstractions □ Process model, thread model, task model ■ Threads for concurrency and parallelization □ Standardized POSIX interface □ Java / .NET concurrency functionality ■ Tasks for concurrency and parallelization □ OpenMP for C / C++, Java, .NET, Cilk, … ■ Functional language constructs for implicit parallelism ■ PGAS languages for NUMA optimization 2 Specialized languages help the programmer to achieve speedup. What about accordingly specialized parallel hardware? OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 3. What is specialized hardware? OpenHPI | Parallel Programming Concepts | Frank Feinbube 3
  • 4. OpenHPI | Parallel Programming Concepts | Frank Feinbube 4 What is specialized hardware? ■ Graphics cards: speed up rendering tasks ■ Sound cards: play sounds ■ Physics cards ? What else could be sped up by hardware? ■ Encryption ■ Compression ■ Codec parsing ■ Protocols and formats: XML ■ Text, regular expression matching ■ …
  • 5. Wide Variety of Accelerators OpenHPI | Parallel Programming Concepts | Frank Feinbube 5 Best Super Computer: Tianhe-2 (MilkyWay-2) 2nd Titan
  • 6. Wide Variety of Applications OpenHPI | Parallel Programming Concepts | Frank Feinbube 6 Fluids NBody RadixSort
  • 7. Wide Variety of Application Domains OpenHPI | Parallel Programming Concepts | Frank Feinbube 7 BioInformatics Computational Chemistry Computational Finance Computational Fluid Dynamics Computational Structural Mechanics Data Science Defense Electronic Design Automation Imaging & Computer Vision Medical Imaging Numerical Analytics Weather and Climate http://www.nvidia.com/object/gpu-applications-domain.html
  • 8. What’s in it for me? OpenHPI | Parallel Programming Concepts | Frank Feinbube 8
  • 9. Short Term View: Cheap Performance Performance Energy / Price ■ Cheap to buy and to maintain ■ GFLOPS per watt: Fermi 1,5 / Kepler 5 / Maxwell 15 (2014) OpenHPI | Parallel Programming Concepts | Frank Feinbube 9 0 200 400 600 800 1000 1200 1400 0 10000 20000 30000 40000 50000 ExecutionTimein Milliseconds Problem Size (Number of Sudoku Places) Intel E8500 CPU AMD R800 GPU NVIDIA GT200 GPU lower means faster GPU: Graphics Processing Unit (CPU of a graphics card)
  • 10. A single Maxwell GPU will have more performance than the fastest super computer of 2001 Your computer / notebook (thx to GPUs) What is 15 GFlops? 15 000 000 000 floating point operations / s OpenHPI | Parallel Programming Concepts | Frank Feinbube 10
  • 11. Middle Term View: Even More Performance OpenHPI | Parallel Programming Concepts | Frank Feinbube 11
  • 12. Middle Term View: Even More Performance OpenHPI | Parallel Programming Concepts | Frank Feinbube 12
  • 13. Long Term View: Acceleration Everywhere Dealing with massively multi-core: ■ Accelerators (APUs) that accompany common general purpose CPUs (Hybrid Systems) Hybrid Systems (Accelerators + CPUs) ■ GPU Compute Devices: High Performance Computing (top 2 supercomputers are accelerator-based!), Business Servers, Home/Desktop Computers, Mobile and Embedded Systems ■ Special-Purpose Accelerators: (de)compression, XML parsing, (en|de)cryption, regular expression matching OpenHPI | Parallel Programming Concepts | Frank Feinbube 13
  • 14. How do they get so fast? OpenHPI | Parallel Programming Concepts | Frank Feinbube 14
  • 15. Three Ways Of Doing Anything Faster [Pfister] ■ Work harder (clock speed)  Power wall problem  Memory wall problem ■ Work smarter (optimization, caching)  ILP wall problem  Memory wall problem ■ Get help (parallelization) □ More cores per single CPU □ Software needs to exploit them in the right way  Memory wall problem Problem CPU Core Core Core Core Core 15 OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
  • 16. Accelerators bypass the walls ■ CPUs are general purpose □ Need to support all types of software / programming models □ Need to support a large variety of legacy software  That makes it hard to do something against the walls ■ Accelerators are special purpose □ Need to support only a limited subset of software / programming models □ Legacy software is usually even more limited and strict  That lessens the impact of the walls (memory & power wall)  Stronger parallelization and better speedup possible OpenHPI | Parallel Programming Concepts | Frank Feinbube 16
  • 17. CPU and Accelerator trends CPU  Evolving towards throughput computing  Motivated by energy-efficient performance Accelerator  Evolving towards general-purpose computing  Motivated by higher quality graphics and data-parallel programming OpenHPI | Parallel Programming Concepts | Frank Feinbube 17 CPU ACCThroughput Performance Programmability Multi-threading Multi-core Many-core Fully Programmable Partially Programmable Fixed Function NVIDIA Keppler Intel MIC
  • 18. Task Parallelism and Data Parallelism OpenHPI | Parallel Programming Concepts | Frank Feinbube 18 Input Data Parallel Processing Result Data „CPU-style“ „GPU-style“
  • 19. OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger Flynn‘s Taxonomy (1966) ■ Classify parallel hardware architectures according to their capabilities in the instruction and data processing dimension Single Instruction, Single Data (SISD) Single Instruction, Multiple Data (SIMD) 19 Processing Step Instruction Data Item Output Processing Step Instruction Data Items Output Multiple Instruction, Single Data (MISD) Processing Step Instructions Data Item Output Multiple Instruction, Multiple Data (MIMD) Processing Step Instructions Data Items Output
  • 20. Parallel Programming Concepts OpenHPI Course Week 4 : Accelerators Unit 4.2: Accelerator Technology Frank Feinbube + Teaching Team
  • 21. OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger Flynn‘s Taxonomy (1966) ■ Classify parallel hardware architectures according to their capabilities in the instruction and data processing dimension Single Instruction, Single Data (SISD) Single Instruction, Multiple Data (SIMD) 21 Processing Step Instruction Data Item Output Processing Step Instruction Data Items Output Multiple Instruction, Single Data (MISD) Processing Step Instructions Data Item Output Multiple Instruction, Multiple Data (MIMD) Processing Step Instructions Data Items Output
  • 22. Why SIMD? OpenHPI | Parallel Programming Concepts | Frank Feinbube 22
  • 23. History of GPU Computing • 1980s-1990s; configurable, not programmable; first APIs (DirectX, OpenGL); Vertex Processing Fixed Function Graphic Pipelines • Since 2001: APIs for Vertex Shading, Pixel Shading and access to texture; DirectX9 Programmable Real- Time Graphics • 2006: NVIDIAs G80; unified processors arrays; three programmable shading stages; DirectX10 Unified Graphics and Computing Processors OpenHPI | Parallel Programming Concepts | Frank Feinbube 23
  • 24. From Fixed Function Pipeline To Programmable Shading Stages GPUs pre 2006: DirectX 10 Geometry Shader Idea: NVIDIA G80 Solution: OpenHPI | Parallel Programming Concepts | Frank Feinbube 24 Vertex Shader Pixel Shader Vertex Shader Geometry Shader Pixel Shader Programmable Shader (Vertex/Geometry/Pixel) We would need a bigger card :/ This design allows: Smaller card / more power
  • 25. History of GPU Computing • 1980s-1990s; configurable, not programmable; first APIs (DirectX, OpenGL); Vertex Processing Fixed Function Graphic Pipelines • Since 2001: APIs for Vertex Shading, Pixel Shading and access to texture; DirectX9 Programmable Real- Time Graphics • 2006: NVIDIAs G80; unified processors arrays; three programmable shading stages; DirectX10 Unified Graphics and Computing Processors • compute problem as native graphic operations; algorithms as shaders; data in textures General Purpose GPU (GPGPU) OpenHPI | Parallel Programming Concepts | Frank Feinbube 25
  • 26. General Purpose GPU (GPGPU) OpenHPI | Parallel Programming Concepts | Frank Feinbube 26 𝐵 > 1 𝑛 𝑖=1 𝑛 𝑥𝑖 2.7 1.8 2.8 1.8 2.8 4.5 9.0 4.5 Programmable Shader (Vertex/Geometry/Pixel) Graphics Processing Unit (GPU): Data Texture Formula Shader Result Texture 0 0 0 0 0 1 1 1 Result
  • 27. History of GPU Computing • 1980s-1990s; configurable, not programmable; first APIs (DirectX, OpenGL); Vertex Processing Fixed Function Graphic Pipelines • Since 2001: APIs for Vertex Shading, Pixel Shading and access to texture; DirectX9 Programmable Real- Time Graphics • 2006: NVIDIAs G80; unified processors arrays; three programmable shading stages; DirectX10 Unified Graphics and Computing Processors • compute problem as native graphic operations; algorithms as shaders; data in textures General Purpose GPU (GPGPU) • Programming CUDA; shaders programmable; load and store instructions; barriers; atomicsGPU Computing OpenHPI | Parallel Programming Concepts | Frank Feinbube 27
  • 28. What’s the difference to CPUs? OpenHPI | Parallel Programming Concepts | Frank Feinbube 28
  • 29. Task Parallelism and Data Parallelism OpenHPI | Parallel Programming Concepts | Frank Feinbube 29 Input Data Parallel Processing Result Data „CPU-style“ „GPU-style“
  • 30. CPU vs. GPU Architecture ■ Some huge threads ■ Branch prediction ■ 1000+ light-weight threads ■ Memory latency hiding OpenHPI | Parallel Programming Concepts | Frank Feinbube 30 Control PE PE PE PE Cache DRAM DRAM CPU GPU „many-core“„multi-core“
  • 31. GPU Threads are different GPU threads ■ Execute exactly the same instruction (line of code) □ They share the program counter  No branching! Memory Latency Hiding ■ GPU holds thousands of threads ■ If some access the slow memory, others will run while they wait ■ No context switch □ Enough registers to store all the threads all the time OpenHPI | Parallel Programming Concepts | Frank Feinbube 31
  • 32. What does it look like? OpenHPI | Parallel Programming Concepts | Frank Feinbube 32
  • 33. GF10 OpenHPI | Parallel Programming Concepts | Frank Feinbube 33 GPU Hardware in Detail
  • 34. GF100 OpenHPI | Parallel Programming Concepts | Frank Feinbube 34 GF100 L2 Cache
  • 35. GF100 OpenHPI | Parallel Programming Concepts | Frank Feinbube 35 GF100
  • 36. GF100 OpenHPI | Parallel Programming Concepts | Frank Feinbube 36 GF100
  • 37. GF100 OpenHPI | Parallel Programming Concepts | Frank Feinbube 37 … GF100
  • 38. GF100 OpenHPI | Parallel Programming Concepts | Frank Feinbube 38 … GF100
  • 39. It’s a jungle out there! OpenHPI | Parallel Programming Concepts | Frank Feinbube 39
  • 40. GPU Computing Platforms (Excerpt) AMD R700, R800, R900, HD 7000, HD 8000, Rx 200 NVIDIA G80, G92, GT200, GF100, GK110 Geforce, Quadro, Tesla, ION OpenHPI | Parallel Programming Concepts | Frank Feinbube 40
  • 41. Compute Capability by version Plus: varying amounts of cores, global memory sizes, bandwidth, clock speeds (core, memory), bus width, memory access penalties … OpenHPI | Parallel Programming Concepts | Frank Feinbube 41 1.0 1.1 1.2 1.3 2.x 3.0 3.5 double precision floating point operations No Yes caches No Yes max # concurrent kernels 1 8 Dynamic Parallelism max # threads per block 512 1024 max # Warps per MP 24 32 48 64 max # Threads per MP 768 1024 1536 2048 register count (32 bit) 8192 16384 32768 65536 max shared mem per MP 16KB 16/48KB 16/32/48KB # shared memory banks 16 32
  • 42. Intel Xeon Phi: Hardware 60 Cores based on P54C architecture (Pentium) ■ > 1.0 Ghz clock speed; 64bit based x86 instructions + SIMD ■ 1x 25 MB L2 Cache (=512KB per core) + 64 KB L1 (Cache coherency) ■ 8 (to 32) GB of DDR5 ■ 4 Hardware Threads per Core (240 logical cores) □ No Multicore / Hyper-Threading □ Think graphics-card hardware threads □ Only one runs = memory latency hiding □ Switched after each instruction!! -> use 120 or 240 threads for the 60 cores ■ 512 bit wide VPU with new ISA Kci ■ No support for MMX, SSE or AVX ■ Could handle 8 doule precision floats/16 single precision floats ■ Always structured in vectors with 16 elements OpenHPI | Parallel Programming Concepts | Frank Feinbube 42
  • 43. Intel Xeon Phi: Operating System ■ minimal, embedded Linux ■ Linux Standard Base (LSB) Core libraries. ■ Implements Busybox minimal shell environment OpenHPI | Parallel Programming Concepts | Frank Feinbube 43
  • 44. Parallel Programming Concepts OpenHPI Course Week 4 : Accelerators Unit 4.3: Open Compute Language (OpenCL) Frank Feinbube + Teaching Team
  • 45. Simple Example: Vector Addition OpenHPI | Parallel Programming Concepts | Frank Feinbube 45 𝑐 = 𝑎1 𝑎2 𝑎3 𝑎4 + 𝑏1 𝑏2 𝑏3 𝑏4 ?
  • 46. Open Compute Language (OpenCL) OpenHPI | Parallel Programming Concepts | Frank Feinbube 46 AMD ATI NVIDIA Intel Apple Merged, needed commonality across products GPU vendor – wants to steal market share from CPU Was tired of recoding for many-core and GPUs. Pushed vendors to standardize. CPU vendor – wants to steal market share from GPU Wrote a draft straw man API Khronos Compute Group formed Ericsson Nokia IBM Sony Blizzard Texas Instruments …
  • 47. OpenCL Platform Model ■ OpenCL exposes CPUs, GPUs, and other Accelerators as “devices” ■ Each “device” contains one or more “compute units”, i.e. cores, SMs,... ■ Each “compute unit” contains one or more SIMD “processing elements” OpenHPI | Parallel Programming Concepts | Frank Feinbube 47
  • 48. Terminology OpenHPI | Parallel Programming Concepts | Frank Feinbube 48 CPU OpenCL Platform Level Memory Level Execution Level Platform Level Memory Level Execution Level SMP system Main Memory Process Compute Device Global and Constant Memory Index Range (NDRange) Processor - - Compute Unit Local Memory Work Group Core Registers, Thread Local Storage Thread Processing Element Registers, Private Memory Work Items (+Kernels)
  • 49. The BIG idea behind OpenCL OpenCL execution model … execute a kernel at each point in a problem domain. E.g., process a 1024 x 1024 image with one kernel invocation per pixel or 1024 x 1024 = 1,048,576 kernel executions OpenHPI | Parallel Programming Concepts | Frank Feinbube 49 Traditional Loops Data Parallel OpenCL
  • 50. Building and Executing OpenCL Code OpenHPI | Parallel Programming Concepts | Frank Feinbube 50 Code of one or more Kernels Compile for GPU Compile for CPU GPU Binary Represen- tation CPU Binary Represen- tation Kernel Program Device OpenCL codes must be prepared to deal with much greater hardware diversity (features are optional and my not be supported on all devices) → compile code that is tailored according to the device configuration
  • 51. OpenCL Execution Model An OpenCL kernel is executed by an array of work items. ■ All work items run the same code (SPMD) ■ Each work item has an index that it uses to compute memory addresses and make control decisions OpenHPI | Parallel Programming Concepts | Frank Feinbube 51 0 1 2 3 4 5 6 7Work item ids: Threads
  • 52. Work Groups: Scalable Cooperation Divide monolithic work item array into work groups ■ Work items within a work group cooperate via shared memory, atomic operations and barrier synchronization ■ Work items in different work groups cannot cooperate OpenHPI | Parallel Programming Concepts | Frank Feinbube 52 0 1 2 3 4 5 6 7 Work item ids: Threads 8 9 10 11 12 13 14 15 Work group 0 Work group 1
  • 53. OpenCL Execution Model ■ Parallel work is submitted to devices by launching kernels ■ Kernels run over global dimension index ranges (NDRange), broken up into “work groups”, and “work items” ■ Work items executing within the same work group can synchronize with each other with barriers or memory fences ■ Work items in different work groups can’t sync with each other, except by launching a new kernel OpenHPI | Parallel Programming Concepts | Frank Feinbube 53
  • 54. OpenCL Execution Model An example of an NDRange index space showing work-items, their global IDs and their mapping onto the pair of work-group and local IDs. OpenHPI | Parallel Programming Concepts | Frank Feinbube 54
  • 55. Terminology OpenHPI | Parallel Programming Concepts | Frank Feinbube 55 CPU OpenCL Platform Level Memory Level Execution Level Platform Level Memory Level Execution Level SMP system Main Memory Process Compute Device Global and Constant Memory Index Range (NDRange) Processor - - Compute Unit Local Memory Work Group Core Registers, Thread Local Storage Thread Processing Element Registers, Private Memory Work Items (+Kernels)
  • 56. OpenCL Memory Architecture Private Per work-item Local Shared within a workgroup Global/ Constant Visible to all workgroups Host Memory On the CPU OpenHPI | Parallel Programming Concepts | Frank Feinbube 56
  • 57. Terminology OpenHPI | Parallel Programming Concepts | Frank Feinbube 57 CPU OpenCL Platform Level Memory Level Execution Level Platform Level Memory Level Execution Level SMP system Main Memory Process Compute Device Global and Constant Memory Index Range (NDRange) Processor - - Compute Unit Local Memory Work Group Core Registers, Thread Local Storage Thread Processing Element Registers, Private Memory Work Items (+Kernels)
  • 58. OpenCL Memory Architecture ■ Memory management is explicit: you must move data from host → global → local… and back OpenHPI | Parallel Programming Concepts | Frank Feinbube 58 Memory Type Keyword Description/Characteristics Global Memory __global Shared by all work items; read/write; may be cached (modern GPU), else slow; huge Private Memory __private For local variables; per work item; may be mapped onto global memory (Arrays on GPU) Local Memory __local Shared between workitems of a work group; may be mapped onto global memory (not GPU), else fast; small Constant Memory __constant Read-only, cached; add. special kind for GPUs: texture memory
  • 59. OpenCL Work Item Code A subset of ISO C99 - without some C99 features ■ headers, function pointers, recursion, variable length arrays, and bit fields A superset of ISO C99 with additions for ■ Work-items and workgroups ■ Vector types (2,4,8,16): endian safe, aligned at vector length ■ Image types mapped to texture memory ■ Synchronization ■ Address space qualifiers Also includes a large set of built-in functions for image manipulation, work-item manipulation, specialized math routines, vectors, etc. OpenHPI | Parallel Programming Concepts | Frank Feinbube 59
  • 60. Vector Addition: Kernel ■ Kernel body is instantiated once for each work item; each getting an unique index » Code that actually executes on target devices OpenHPI | Parallel Programming Concepts | Frank Feinbube 60
  • 61. Vector Addition: Host Program OpenHPI | Parallel Programming Concepts | Frank Feinbube 61 [5]
  • 62. Vector Addition: Host Program OpenHPI | Parallel Programming Concepts | Frank Feinbube 62 [5] „standard“ overhead for an OpenCL program
  • 63. Development Support Software development kits: NVIDIA and AMD; Windows and Linux Special libraries: AMD Core Math Library, BLAS and FFT libraries by NVIDIA, OpenNL for numerics and CULA for linear algebra; NVIDIA Performance Primitives library: a sink for common GPU accelerated algorithms Profiling and debugging tools: ■ NVIDIAs Parallel Nsight for Microsoft Visual Studio ■ AMDs ATI Stream Profiler ■ AMDs Stream KernelAnalyzer: displays GPU assembler code, detects execution bottlenecks ■ gDEBugger (platform-independent) Big knowledge bases with tutorials, examples, articles, show cases, and developer forums OpenHPI | Parallel Programming Concepts | Frank Feinbube 63
  • 64. Nsight OpenHPI | Parallel Programming Concepts | Frank Feinbube 64
  • 65. Parallel Programming Concepts OpenHPI Course Week 4 : Accelerators Unit 4.4: Optimizations Frank Feinbube + Teaching Team
  • 66. The Power of GPU Computing 0 200 400 600 800 1000 1200 1400 0 10000 20000 30000 40000 50000 ExecutionTimeinMilliseconds Problem Size (Number of Sudoku Places) Intel E8500 CPU AMD R800 GPU NVIDIA GT200 GPU OpenHPI | Parallel Programming Concepts | Frank Feinbube 66 * less is better big performance gains for small problem sizes
  • 67. The Power of GPU Computing 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 0 200000 400000 600000 ExecutionTimeinMilliseconds Problem Size (Number of Sudoku Places) Intel E8500 CPU AMD R800 GPU NVIDIA GT200 GPU OpenHPI | Parallel Programming Concepts | Frank Feinbube 67 * less is better small/moderate performance gains for large problem sizes → further optimizations needed
  • 68. Best Practices for Performance Tuning • Asynchronous, Recompute, SimpleAlgorithm Design • Chaining, Overlap Transfer & ComputeMemory Transfer • Divergent Branching, PredicationControl Flow • Local Memory as Cache, rare resourceMemory Types • Coalescing, Bank ConflictsMemory Access • Execution Size, EvaluationSizing • Shifting, Fused Multiply, Vector TypesInstructions • Native Math Functions, Build OptionsPrecision OpenHPI | Parallel Programming Concepts | Frank Feinbube 68
  • 69. Divergent Branching and Predication Divergent Branching ■ Flow control instruction (if, switch, do, for, while) can result in different execution paths  Data parallel execution → varying execution paths will be serialized  Threads converge back to same execution path after completion Branch Predication ■ Instructions are associated with a per-thread condition code (predicate) □ All instructions are scheduled for execution □ Predicate true: executed normally □ Predicate false: do not write results, do not evaluate addresses, do not read operands ■ Compiler may use branch predication for if or switch statements ■ Unroll loops yourself (or use #pragma unroll for NVIDIA) OpenHPI | Parallel Programming Concepts | Frank Feinbube 69
  • 70. Use Caching: Local, Texture, Constant Local Memory ■ Memory latency roughly 100x lower than global memory latency ■ Small, no coalescing problems, prone to memory bank conflicts Texture Memory ■ 2-dimensionally cached, read-only ■ Can be used to avoid uncoalesced loads form global memory ■ Used with the image data type Constant Memory ■ Linear cache, read-only, 64 KB ■ as fast as reading from a register for the same address ■ Can be used for big lists of input arguments OpenHPI | Parallel Programming Concepts | Frank Feinbube 70 0 1 2 3 64 65 66 67 128 129 130 131 192 193 194 195 … … … …
  • 71. Sizing: What is the right execution layout? ■ Local work item count should be a multiple of native execution size (NVIDIA 32, AMD 64, MIC 16), but not too big ■ Number of work groups should be multiple of the number of multiprocessors (hundreds or thousands of work groups) ■ Can be configured in 1-, 2- or 3-dimensional layout: consider access patterns and caching ■ Balance between latency hiding and resource utilization ■ Experimenting is required! OpenHPI | Parallel Programming Concepts | Frank Feinbube 71 [4]
  • 72. Instructions and Precision ■ Single precision floats provide best performance ■ Use shift operations to avoid expensive division and modulo calculations ■ Special compiler flags ■ AMD has native vector type implementation; NVIDIA is scalar ■ Use the native math library whenever speed trumps precision OpenHPI | Parallel Programming Concepts | Frank Feinbube 72 Functions Throughput single-precision floating-point add, multiply, and multiply-add 8 operations per clock cycle single-precision reciprocal, reciprocal square root, and native_logf(x) 2 operations per clock cycle native_sin, native_cos, native_exp 1 operation per clock cycle
  • 73. Coalesced Memory Accesses Simple Access Pattern ■ Can be fetched in a single 64-byte transaction (red rectangle) ■ Could also be permuted * Sequential but Misaligned Access ■ Fall into single 128-byte segment: single 128-byte transaction, else: 64-byte transaction + 32- byte transaction * Strided Accesses ■ Depending on stride from 1 (here) up to 16 transactions * * 16 transactions with compute capability 1.1 OpenHPI | Parallel Programming Concepts | Frank Feinbube 73 [6]
  • 74. Intel Xeon Phi: OpenCL ■ Whats different to GPUs? □ Thread granularity bigger + No need for local shared memory ■ Work Groups are mapped to Threads □ 240 OpenCL hardware threads handle workgroups □ More than 1000 Work Groups recommended □ Each thread executes one work group ■ Implicit Vectorization by the compiler of the inner most loop □ 16 elements per Vector → dimension zero must be divisible by 16 otherwise scalar execution (Good work group size = 16) OpenHPI | Parallel Programming Concepts | Frank Feinbube 74 __Kernel ABC() for (int i = 0; i < get_local_size(2); i++) for (int j = 0; j < get_local_size(1); j++) for (int k = 0; k < get_local_size(0); k++) Kernel_Body; dimension zero of the NDRange
  • 75. Intel Xeon Phi: OpenCL ■ Non uniform branching in a work group has significant overhead ■ Non vector size aligned memory access has significant overhead ■ Non linear access patterns have significant overhead ■ Manual prefetching can be advantageous ■ No hardware support for barriers ■ Further reading: Intel® SDK for OpenCL Applications XE 2013 Optimization Guide for Linux OS OpenHPI | Parallel Programming Concepts | Frank Feinbube 75 Especially for dimension zero
  • 76. Parallel Programming Concepts OpenHPI Course Week 4 : Accelerators Unit 4.5: Future Trends Frank Feinbube + Teaching Team
  • 77. Towards new Platforms WebCL [Draft] http://www.khronos.org/webcl/ ■ JavaScript binding to OpenCL ■ Heterogeneous Parallel Computing (CPUs + GPU) within Web Browsers ■ Enables compute intense programs like physics engines, video editing… ■ Currently only available with add-ons (Node.js, Firefox, WebKit) Android installable client driver extension (ICD) ■ Enables OpenCL implementations to be discovered and loaded as a shared object on Android systems. OpenHPI | Parallel Programming Concepts | Frank Feinbube 77
  • 78. Towards new Applications: Dealing with Unstructured Grids Too coarse Too fine OpenHPI | Parallel Programming Concepts | Frank Feinbube 78 vs.
  • 79. Towards new Applications: Dealing with Unstructured Grids Fixed Grid Dynamic Grid OpenHPI | Parallel Programming Concepts | Frank Feinbube 79
  • 80. Towards new Applications: Dynamic Parallelism OpenHPI | Parallel Programming Concepts | Frank Feinbube 80 CPU manages execution GPU manages execution
  • 81. OpenHPI | Parallel Programming Concepts | Frank Feinbube 81
  • 82. Towards new Programming Models: OpenACC GPU Computing OpenHPI | Parallel Programming Concepts | Frank Feinbube 82 Copy arrays to GPU Parallelize loop with GPU kernels Data automatically copied back at end of region
  • 83. Towards new Programming Models: OpenACC GPU Computing OpenHPI | Parallel Programming Concepts | Frank Feinbube 83
  • 84. Hybrid System OpenHPI | Parallel Programming Concepts | Frank Feinbube 84 MIC GPU RAM CPU Core Core CPU Core Core GPU Core QPI RAM RAM GPU GPU RAM
  • 85. Summary: Week 4 ■ Accelerators promise big speedups for data parallel applications □ SIMD execution model (no branching) □ Memory latency hiding with 1000s of light-weight threads ■ Enormous diversity -> OpenCL as a standard programming model for all □ Uniform terms: Compute Device, Compute Unit, Processing Element □ Idea: Loop parallelism with index ranges □ Kernels are written in C; compiled at runtime; executed in parallel □ Complex memory hierarchy, overhead to copy data from CPU ■ Getting fast is easy, getting faster is hard □ Best practices for accelerators □ Knowledge about hardware characteristics necessary ■ Future: faster, more features, more platforms, better programmability 85 Multi-core, Many-Core, .. What if my computational problem still demands more power? OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger