08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
1.
2. BOLT: A C++ TEMPLATE LIBRARY
FOR HSA
Ben Sander
AMD
Senior Fellow
3. MOTIVATION
§ Improve developer productivity
– Optimized library routines for common GPU operations
– Works with open standards (OpenCL™ and C++ AMP)
– Distributed as open source
§ Make GPU programming as easy as CPU programming
– Resemble familiar C++ Standard Template Library
– Customizable via C++ template parameters
– Leverage high-performance shared virtual memory
C++ Template Library For HSA
§ Optimize for HSA
– Single source base for GPU and CPU
– Platform Load Balancing
3 | BOLT | June 2012
4. AGENDA
§ Introduction and Motivation
§ Bolt Code Examples for C++ AMP and OpenCL™
§ ISV Proof Point
§ Single source code base for CPU and GPU
§ Platform Load Balancing
§ Summary
4 | BOLT | June 2012
5. SIMPLE BOLT EXAMPLE
#include <bolt/sort.h>
#include <vector>
#include <algorithm>
void main()
{
// generate random data (on host)
std::vector<int> a(1000000);
std::generate(a.begin(), a.end(), rand);
// sort, run on best device
bolt::sort(a.begin(), a.end());
}
§ Interface similar to familiar C++ Standard Template Library
§ No explicit mention of C++ AMP or OpenCL™ (or GPU!)
– More advanced use case allow programmer to supply a kernel in C++ AMP or OpenCL™
§ Direct use of host data structures (ie std::vector)
§ bolt::sort implicitly runs on the platform
– Runtime automatically selects CPU or GPU (or both)
5 | BOLT | June 2012
7. BOLT FOR C++ AMP : LEVERAGING C++11 LAMBDA
#include <bolt/transform.h>
#include <vector>
void main(void)
{
const float a=100;
std::vector<float> x(1000000); // initialization not shown
std::vector<float> y(1000000); // initialization not shown
std::vector<float> z(1000000);
// saxpy with C++ Lambda
bolt::transform(x.begin(), x.end(), y.begin(), z.begin(),
[=] (float xx, float yy) restrict(cpu, amp) {
return a * xx + yy;
});
};
§ Functor (“a * xx + yy”) now specified inline
§ Can capture variables from surrounding scope (“a”) – eliminate boilerplate class
7 | BOLT | June 2012
8. BOLT FOR OPENCL™
#include <clbolt/sort.h>
#include <vector>
#include <algorithm>
void main()
{
// generate random data (on host)
std::vector<int> a(1000000);
std::generate(a.begin(), a.end(), rand);
// sort, run on best device
clbolt::sort(a.begin(), a.end());
}
§ Interface similar to familiar C++ Standard Template Library
§ clbolt uses OpenCL™ below the API level
– Host data copied or mapped to the GPU
– First call to clbolt::sort will generate and compile a kernel
§ More advanced use case allow programmer to supply a kernel in OpenCL™
8 | BOLT | June 2012
9. BOLT FOR OPENCL™ : USER-SPECIFIED FUNCTOR
#include <clbolt/transform.h> § Challenge: OpenCL™ split-source model
#include <vector>
– Host code in C or C++
– OpenCL™ code specified in strings
BOLT_FUNCTOR(SaxpyFunctor,
struct SaxpyFunctor
{
float _a; § Solution:
SaxpyFunctor(float a) : _a(a) {};
– BOLT_FUNCTOR macro creates both host-side
float operator() (const float &xx, const float &yy) and string versions of “SaxpyFunctor” class
{ definition
return _a * xx + yy;
}; § Class name (“SaxpyFunctor”) stored in TypeName trait
}; § OpenCL™ kernel code (SaxpyFunctor class def) stored
); in ClCode trait.
void main2() { – Clbolt function implementation
SaxpyFunctor s(100);
std::vector<float> x(1000000); // initialization not shown § Can retrieve traits from class name
std::vector<float> y(1000000); // initialization not shown § Uses TypeName and ClCode to construct a customized
std::vector<float> z(1000000); transform kernel
clbolt::transform(x.begin(), x.end(), y.begin(), z.begin(), s); § First call to clbolt::transform compiles the kernel
};
– Advanced users can directly create
ClCode trait
9 | BOLT | June 2012
10. BOLT: C++ AMP VS. OPENCL™
BOLT for C++ AMP BOLT for OpenCL™
§ C++ template library for HSA § C++ template library for HSA
– Developer can customize data types and operations – Developer can customize data types and operations
– Provide library of optimized routines for AMD GPUs. – Provide library of optimized routines for AMD GPUs.
§ C++ Host Language § C++ Host Language
§ Kernels marked with “restrict(cpu, amp)” § Kernels marked with “BOLT_FUNCTOR” macro
§ Kernels written in C++ AMP kernel language § Kernels written in OpenCL™ kernel language
– Restricted set of C++ – Subset of C99, with extensions (ie vectors, builtins)
§ Kernels compiled at compile-time § Kernels compiled at runtime, on first call
– Some compile errors shown on first call
§ C++ Lambda Syntax Supported § C++11 Lambda Syntax NOT supported
§ Functors may contain array_view § Functors may not contain pointers
§ Parameters can use host data structures (ie std::vector) § Parameters can use host data structures (ie std::vector)
§ Parameters can be array or array_view types § Parameters can be cl::Buffer or cl_buffer types
§ Use “bolt” namespace § Use “clbolt” namespace
10 | BOLT | June 2012
11. BOLT : WHAT’S NEW?
§ Optimized template library routines for common GPU functions
– For OpenCL™ and C++ AMP, across multiple platforms
§ Direct interfaces to host memory structures (ie std::vectors)
– Leverage HSA unified address space and zero-copy memory
– C++ AMP array and cl::Buffer also supported if memory already on device
§ Bolt submits to the entire platform rather than a specific device
– Runtime automatically selects the device
– Provides opportunities for load-balancing
– Provides optimal CPU path if no GPU is available.
– Override to specify specific accelerator is supported
– Enables developers to fearlessly move to the GPU
§ Bolt will contain new APIs optimized for HSA Devices
– Multi-device bolt::pipeline, bolt::parallel_filter
11 | BOLT | June 2012
12. EXAMPLARY ISV PROOF-POINT
Hessian Algorithm Pseudo Code:
§ “Hessian” kernel from “MotionDSP Ikena” // x,y are coordinates of pixel to transform
– Commercially available video enhancement software // Pixel difference:
It = W(y, x) - I(y, x);
– Optimized for CPU and GPU // average left/right pixels:
Ix = 0.5f *( W(y, x+1) - W(y, x-1) );
// average top/bottom pixels:
Iy = 0.5f*( W(y+1, x) - W(y-1, x) );
§ Basic Hessian Algorithm
X = x dist of this pixel from center
– Two input images I and W Y = y dist of this pixel from center
…
– Transform, followed by reduce (“transform_reduce”) // Compute for each pixel:
H[ 0] = (Ix*X+Iy*Y) * (Ix*X+Iy*Y)
§ For each pixel in image, compute 14 float coefficients H[ 1] = (Ix*X-Iy*Y) * (Ix*X+Iy*Y)
H[ 2] = (Ix*X-Iy*Y) * (Ix*X-Iy*Y)
H[ 3] = (Ix ) * (Ix*X+Iy*Y)
§ Sum the coefficients for all the pixels– final result is 14 floats H[ 4] = (Ix ) * (Ix*X-Iy*Y)
H[ 5] = (Ix ) * (Ix )
– Complex, computationally intense, real-world algorithm H[ 6] = (Iy ) * (Ix*X+Iy*Y)
H[ 7] = (Iy ) * (Ix*X-Iy*Y)
H[ 8] = (Iy ) * (Ix )
H[ 9] = (Iy ) * (Iy )
H[10] = (It ) * (Ix*X+Iy*Y)
§ Developed multiple implementations of Hessian kernel H[11] = (It ) * (Ix*X-Iy*Y)
H[12] = (It ) * (Ix )
– CPU, GPU, Bolt H[13] = (It ) * (Iy )
12 | BOLT | June 2012
13. LINES-OF-CODE AND PERFORMANCE FOR DIFFERENT PROGRAMMING MODELS
(Exemplary ISV “Hessian” Kernel)
350
35.00
300
30.00
Init.
250
25.00
Relative Performance
Launch
200
Compile 20.00
LOC
Compile
Copy
Copy
150
15.00
Launch Launch Launch
Algorithm
100
Launch 10.00
Launch
Algorithm Algorithm Algorithm Launch
50
5.00
Algorithm Algorithm Algorithm
Copy-back Copy-back Copy-back
0
0
Serial CPU TBB Intrinsics+TBB OpenCL™-C OpenCL™ -C++ C++ AMP HSA Bolt
Copy-back Algorithm Launch Copy Compile Init Performance
13 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012
14. PERFORMANCE PORTABILITY - INTRODUCTION
§ For many algorithms, core operation same between CPU and GPU
– See sort, saxpy, hessian examples
– Same Core Operation
– Differences in how data is routed to the core operation
§ Bolt hides the device-specific routing details inside the library function implementation
– GPU implementations:
§ GPU-friendly data strides
§ Launch enough threads to hide memory latency
§ Group Memory and work-group communication
– CPU implementations:
§ CPU-friendly data strides
§ Launch enough threads to use all cores
14 | BOLT | June 2012
15. PERFORMANCE PORTABILITY – RESULTS
CPU
Performance
vs
Programming
Model
(Exemplary
ISV
"Hessian"
Kernel")
4.50
4.00
3.50
3.00
Rel
Perf
2.50
2.00
1.50
1.00
0.50
0.00
Serial
CPU
TBB
CPU
OpenCL
(CPU)
HSA
Bolt
(CPU)
15 | BOLT | June 2012
16. PERFORMANCE PORTABILITY – WHAT’S NEW ?
§ New GPU programming models are close to CPU programming models
– C++ AMP : Single-source, (restricted) C++11 kernel language, high-quality debugger/profiler, etc
§ Shared Virtual Memory in HSA
– Removes tedious copies between address spaces
– Will allow use of complex pointer-containing data structures
§ Less performance cliffs in modern GPU architectures (ie AMD GCN)
– Reduce need for GPU-specific optimizations in core operation
– Example: 14:7:1 Bandwidth Ratio for Group:Cache:Global Memory
§ Autovectorization
– Modern compilers include auto-vectorization support
– Restrictions of GPU programming models facilitate vectorization
§ Finally, Bolt functors can provide device-specific implementations if needed
16 | BOLT | June 2012
17. HSA LOAD BALANCING : KEY FEATURES AND OBSERVATIONS
§ High-performance shared virtual memory
– Developers no longer have to worry about data location (ie device vs host)
§ HSA platforms have tightly integrated CPU and GPU
– GPU better at wide vector parallelism, extracting memory bandwidth, latency hiding
– CPU better at fine-grained vector parallelism, cache-sensitive code, control-flow
§ Bolt Abstractions
– Provides insight into the characteristics of the algorithm
§ Reduce vs Transform vs parallel_filter
– Abstraction above the details of a “kernel launch”
§ Don’t need to specify device, workgroup shape, work-items, number of kernels, etc
§ Runtime may optimize these for the platform
§ Bolt has access to both optimized CPU and GPU implementations, at the same time
– Let’s use both!
17 | BOLT | June 2012
18. EXAMPLES OF HSA LOAD-BALANCING
Example
DescripBon
Exemplary
Use
Cases
Data
Size
Run
large
data
sizes
on
GPU,
small
on
CPU
Same
call-‐site
used
for
varying
data
sizes.
Run
iniWal
reducWon
phases
on
GPU,
run
ReducWon
final
stages
on
CPU
Any
reducWon
operaWon.
Border/Edge
Run
wide
center
regions
on
GPU,
run
OpWmizaWon
border
regions
on
CPU.
Image
processing.
PlaUorm
Super-‐ Distribute
workgroups
to
available
Kernel
has
similar
performance
/energy
on
Device
processing
units
on
the
enWre
plaUorm.
CPU
and
GPU.
Run
a
pipelined
series
of
user-‐defined
Heterogeneous
stages.
Stages
can
be
CPU-‐only,
GPU-‐only,
Pipeline
or
CPU
or
GPU.
Video
processing
pipeline.
GPU
scans
all
candidates
and
rejects
early
mismatches;
CPU
more
deeply
evaluates
Parallel_filter
the
survivors.
Haar
detector,
word
search,
audio
search.
18 | BOLT | June 2012
19. HETEROGENEOUS PIPELINE
§ Mimics a traditional manufacturing assembly line
– Developer supplies a series of pipeline stages
– Each stage processes it’s input token, passes an output token to the next stage
– Stages can be either CPU-only, GPU-only, or CPU/GPU
§ CPU/GPU tasks are dynamically scheduled
– Use queue depth and estimated execution time to drive scheduling decision
– Adapt to variation in target hardware or system utilization
– Data location not an issue in HSA
– Leverage single source code
§ GPU kernels scheduled asynchronously
– Completion invokes next stage of the pipeline
§ Simple Video Pipeline Example: Video
Video Video
Decode Processing Render
(CPU-only) (CPU/GPU) (GPU-only)
19 | BOLT | June 2012
20. CASCADE DEPTH ANALYSIS
Cascade Depth 25
20
15
10
5
0 20-25
15-20
10-15
5-10
0-5
20 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012
21. PARALLEL_FILTER
§ Target applications with a “Filter” pattern
– Filter out a small number of results from a large initial pool of candidates
– Initial phases best run on GPU:
§ Large data sets (too big for caches), wide vector, high-bandwidth
– Tail phases best run on CPU
§ Smaller data sets (may fit in cache), divergent control flow, fine-grained vector width
– Examples: Haar detector, word search, acoustic search
§ Developer specifies:
– Execution Grid
– Iteration state type and initial value
– Filter function
§ Accepts a point to process and the current iteration state
§ Return True to continue processing or False to exit
§ BOLT / HSA Runtime
– Automatically hands off work between CPU and GPU
– Balances work by adjusting the split point between GPU and CPU
21 | BOLT | June 2012
22. SUMMARY
§ Bolt: C++ Template Library
– Optimized GPU and HSA Library routines
– Customizable via templates
– For both OpenCL™ and C++ AMP
§ Enjoy the unique advantages of the HSA Platform
– High-performance shared virtual memory
– Tightly integrated CPU and GPU
C++ Template Library For HSA
§ Enable advanced HSA features
– A single source base for CPU and GPU
– Platform load balancing across CPU and GPU
22 | BOLT | June 2012