This document discusses designing architecture-aware libraries using Boost.Proto. It describes how the NT2 scientific computing library was redesigned using Boost.Proto to make it more extensible and able to better support new hardware architectures. The redesign segmented the evaluation of expressions into phases. Boost.Proto transforms are used in each phase to advance code generation. Hardware specifications influence function overloads through generalized tag dispatching, allowing the best function implementation to be selected for a given hardware architecture. This makes it possible to more easily add support for new optimization schemes and hardware targets to the library.
MuVM: Higher Order Mutation Analysis Virtual Machine for C
Designing Architecture-aware Library using Boost.Proto
1. Designing Architecture-aware Library using Boost.Proto
Joel Falcou, Mathias Gaunard
Pierre Esterie, Eric
Jourdanneau
LRI - University Paris Sud XI - CNRS
13/03/2012
3. Context
In Scientific Computing ...
there is Scientific
Applications are domain driven
Users = Developers
Users are reluctant to changes
2 of 34
4. Context
In Scientific Computing ...
there is Scientific
Applications are domain driven
Users = Developers
Users are reluctant to changes
there is Computing
2 of 34
5. Context
In Scientific Computing ...
there is Scientific
Applications are domain driven
Users = Developers
Users are reluctant to changes
there is Computing
Computing requires performance ...
... which implies architectures specific tuning
... which requires expertise
... which may or may not be available
2 of 34
6. Context
In Scientific Computing ...
there is Scientific
Applications are domain driven
Users = Developers
Users are reluctant to changes
there is Computing
Computing requires performance ...
... which implies architectures specific tuning
... which requires expertise
... which may or may not be available
The Problem
People using computers to do science want to do science first.
2 of 34
7. The Problem and how to solve it
The Facts
The ”Library to bind them all” doesn’t exist (or we should have it already )
New architectures performances are underused due to their inherent complexity
Few people are both experts in parallel programming and SomeOtherField
The Ends
Find a way to shield users from parallel architecture details
Find a way to shield developpers from parallel architecture details
Isolate software from hardware evolution
The Means
Generic Programming
Template Meta-Programming
Embedded Domain Specific Languages
3 of 34
9. What’s NT2 ?
A Scientific Computing Library
Provide a simple, Matlab-like interface for users
Provide high-performance computing entities and primitives
Easily extendable
A Research Platform
Simple framework to add new optimization schemes
Test bench for EDSL development methodologies
Test bench for Generic Programming in real life projects
5 of 34
10. The NT2 API
Principles
table<T,S> is a simple, multidimensional array object that exactly mimics
Matlab array behavior and functionalities
300+ functions usable directly either on table or on any scalar values as in
Matlab
6 of 34
11. The NT2 API
Principles
table<T,S> is a simple, multidimensional array object that exactly mimics
Matlab array behavior and functionalities
300+ functions usable directly either on table or on any scalar values as in
Matlab
How does it works
Take a .m file, copy to a .cpp file
6 of 34
12. The NT2 API
Principles
table<T,S> is a simple, multidimensional array object that exactly mimics
Matlab array behavior and functionalities
300+ functions usable directly either on table or on any scalar values as in
Matlab
How does it works
Take a .m file, copy to a .cpp file
Add #include <nt2/nt2.hpp> and do cosmetic changes
6 of 34
13. The NT2 API
Principles
table<T,S> is a simple, multidimensional array object that exactly mimics
Matlab array behavior and functionalities
300+ functions usable directly either on table or on any scalar values as in
Matlab
How does it works
Take a .m file, copy to a .cpp file
Add #include <nt2/nt2.hpp> and do cosmetic changes
Compile the file and link with libnt2.a
6 of 34
14. Matlab you said ?
1 R = I (: ,: ,1) ;
2 G = I (: ,: ,2) ;
3 B = I (: ,: ,3) ;
4
5 Y = min ( abs (0.299.* R +0.587.* G +0.114.* B ) ,235) ;
6 U = min ( abs ( -0.169.* R -0.331.* G +0.5.* B ) ,240) ;
7 V = min ( abs (0.5.* R -0.419.* G -0.081.* B ) ,240) ;
7 of 34
15. Now with NT2
1 table < double > R = I (_ ,_ ,1) ;
2 table < double > G = I (_ ,_ ,2) ;
3 table < double > B = I (_ ,_ ,3) ;
4 table < double > Y, U, V;
5
6 Y = min ( abs (0.299* R +0.587* G +0.114* B ) ,235) ;
7 U = min ( abs ( -0.169* R -0.331* G +0.5* B ) ,240) ;
8 V = min ( abs (0.5* R -0.419* G -0.081* B ) ,240) ;
8 of 34
16. Now with NT2
1 table < float , settings ( shallow_ , of_size_ <N ,M >) > R = I (_ ,_ ,1) ;
2 table < float , settings ( shallow_ , of_size_ <N ,M >) > G = I (_ ,_ ,2) ;
3 table < float , settings ( shallow_ , of_size_ <N ,M >) > B = I (_ ,_ ,3) ;
4 table < float , of_size_ <N ,M > > Y , U , V ;
5
6 Y = min ( abs (0.299* R +0.587* G +0.114* B ) ,235) ;
7 U = min ( abs ( -0.169* R -0.331* G +0.5* B ) ,240) ;
8 V = min ( abs (0.5* R -0.419* G -0.081* B ) ,240) ;
8 of 34
18. NT2 before Boost.Proto
EDSL Core
Code base was 11 KLOC
8KLOC was dedicated to the various Expression Template glue
3KLOC of actual smart stuff
Architectural support
Altivec and SSE2 extension support
Some vague pthread support
Efforts were stagnant: adding a simple feature required multipel KLOC of code
to change
10 of 34
20. Embedded Domain Specific Languages
What’s an EDSL ?
DSL = Domain Specific Language
Declarative language, easy-to-use, fitting the domain
EDSL = DSL within a general purpose language
EDSL in C++
Relies on operator overload abuse (Expression Templates)
Carry semantic information around code fragment
Generic implementation become self-aware of optimizations
12 of 34
21. Expression Templates
matrix x(h,w),a(h,w),b(h,w); =
x = cos(a) + (b*a);
x +
expr<assign
,expr<matrix&>
,expr<plus cos *
, expr<cos
,expr<matrix&>
>
a b a
, expr<multiplies
,expr<matrix&>
,expr<matrix&>
> #pragma omp parallel for
>(x,a,b); for(int j=0;j<h;++j)
{
for(int i=0;i<w;++i)
{
Arbitrary Transforms applied
x(j,i) = cos(a(j,i))
on the meta-AST
+ ( b(j,i)
* a(j,i)
);
}
}
13 of 34
22. Boost.Proto
What’s Boost.Proto
EDSL for defining EDSLs in C++
Generalize Expression Templates
Easy way to define and test EDSL
EDSL = some Grammar Rules + some Semantic
Boost.Proto Benefits
Fast development process
Compiler guys are at ease : Grammar + Semantic + Code Generation process
Easily extendable through Transforms
Easy to handle : MPL and Fusion compatible
14 of 34
23. Why Boost.Proto ?
Benefits
Scalability: Keep EDSL glue code size low
Extensibility: Simplify new optimizations design
Debugging: Allow for meaningful error reporting
Remaining Challenges
Provide a generic ”compilation” process
Select the best function definition w/r to hardware
Allow for non-intrusive new pass definition
15 of 34
24. State of the art EDSL
Usual approaches and results
Eigen : supports SIMD arithmetic, thread only algorithms
uBLAS : parallelism comes from potential back-end
What’s the problem ?
Hardware consideration comes after ET code
Retro-fitting parallelism is hard
Need of a hardware-aware EDSL
Back to Generative Programming
16 of 34
25. Principles of Generative Programming
Domain Specific Generative Component Concrete Application
Application Description
Translator
Parametric
Sub-components
17 of 34
27. Our Approach
Principles
Segment EDSL evaluation in phases
Each phases use Proto transforms to advance code generation
Hardware specification are active element in function overload
Use Generalized Tag Dispatching (GTD) to select proper function
implementation
Advantages
Hardware can influence part or whole evaluation process
Proto transforms syntax are easy to use
GTD increases opportunity for optimizations
19 of 34
29. Generalized Tag Dispatching
Principles
Tag dispatching use types tag to overload functions
Classical example : standard Iterators
We augment the system with:
Tag computed from the function identifier
Tag computed from the hardware configuration
20 of 34
30. Generalized Tag Dispatching
Principles
Tag dispatching use types tag to overload functions
Classical example : standard Iterators
We augment the system with:
Tag computed from the function identifier
Tag computed from the hardware configuration
1 namespace tag { struct plus_ : elementwise_ < plus_ > {}; }
2
3 functor < tag :: plus_ > callee ;
4 callee ( some_a , some_b ) ;
20 of 34
31. Generalized Tag Dispatching
Principles
Tag dispatching use types tag to overload functions
Classical example : standard Iterators
We augment the system with:
Tag computed from the function identifier
Tag computed from the hardware configuration
1 auto functor < Tag , Site >:: operator () ( A && args ...)
2 {
3 dispatch < hierarchy_of < Site >:: type
4 , hierarchy_of < Tag >:: type ( hierarchy_of <A >:: type ...)
5 > callee ;
6 return callee ( args ... ) ;
7 }
20 of 34
32. Generalized Tag Dispatching
1 template < class AO >
2 strutc dispatch < tag :: sse_
3 , tag :: plus_ ( simd_ < real_ < float_ < A0 > > >
4 , simd_ < real_ < float_ < A0 > > >
5 )
6 >
7 {
8 A0 operator ( A0 && a , A0 && b )
9 {
10 return _mm_add_ps ( a , b ) ;
11 }
12 };
21 of 34
33. Generalized Tag Dispatching
1 template < class S , class R , class L , class N , class AO >
2 strutc dispatch < openmp_ <S > , reduction_ <R ,L ,N >( ast_ < A0 >) >
3 {
4 A0 :: value_type operator ( A0 && ast )
5 {
6 auto result = functor <N ,S >( as_ < A0 :: value_type >() ) ;
7 # pragma omp parallel
8 {
9 auto local = functor <N ,S >( as_ < A0 :: value_type >() ) ;
10 # pragma omp for nowait
11 for ( int i =0; i < size ( ast ) ;++ i )
12 functor <R ,S >() ( local , ast ( i ) ) ;
13
14 # pragma omp critial
15 functor <L ,S >() ( result , local ) ;
16 }
17
18 return result ;
19 }
20 };
21 of 34
34. Multipass EDSL Code Generation
Principles
Don’t walk the AST directly
Capture where hardware introduces extension points
Use GTD to select proper phase implementation
22 of 34
35. Multipass EDSL Code Generation
Principles
Don’t walk the AST directly
Capture where hardware introduces extension points
Use GTD to select proper phase implementation
Structure
Optimize : AST to AST transforms
22 of 34
36. Multipass EDSL Code Generation
Principles
Don’t walk the AST directly
Capture where hardware introduces extension points
Use GTD to select proper phase implementation
Structure
Optimize : AST to AST transforms
Schedule : AST to Forest of Code Generator transforms
22 of 34
37. Multipass EDSL Code Generation
Principles
Don’t walk the AST directly
Capture where hardware introduces extension points
Use GTD to select proper phase implementation
Structure
Optimize : AST to AST transforms
Schedule : AST to Forest of Code Generator transforms
Run : Code Generator to Code transforms
22 of 34
38. Multipass EDSL Code Generation
Optimize
AST Pattern Matching using Proto grammar
Replace large sequence of operation by equivalent, faster kernels
E.g: a+b*c to gemm or fma, x = inv(a)*b to solve(a,x,b)
23 of 34
39. Multipass EDSL Code Generation
Optimize
AST Pattern Matching using Proto grammar
Replace large sequence of operation by equivalent, faster kernels
E.g: a+b*c to gemm or fma, x = inv(a)*b to solve(a,x,b)
Schedule
Segment AST into loop-compatible loop functors
Use AST node hierarchy (and GTD) for splitting
23 of 34
40. Multipass EDSL Code Generation
Optimize
AST Pattern Matching using Proto grammar
Replace large sequence of operation by equivalent, faster kernels
E.g: a+b*c to gemm or fma, x = inv(a)*b to solve(a,x,b)
Schedule
Segment AST into loop-compatible loop functors
Use AST node hierarchy (and GTD) for splitting
Run
Generate code for each AST int he scheduled forest
Potentially generate runtime calls for runtime optimization
Classical EDSL code generation happen here on the correct hardware
23 of 34
41. The new NT2 Structure
State of the code base
Expression Template handling 200-300LOC
GTD handling : 1KLOC
Actual features and function implementation : 10KLOC
Evolution process
Adding a new site 100-500LOC
Adding a function 10-100 LOC
Time for complete new architecture support : 1 to 3 weeks
24 of 34
43. Some results
Support for OpenMP
1 build system script
4 new overloads for the run phase
1 new allocator to prevent false sharing
Support for Next Gen SIMD
AVX : 2 weeks of work, functionnal
AVX2 : prototype ready, work in simulator
LarBn : prototype ready, work in simulator
25 of 34
45. OpenCL in action
Principle
Standard for manycore/multicore systems
Use runtime kernel and JIT Compilation
Example
1 const char * OpenCLSource [] = {
2 " __kernel void VectorAdd ( __global int * c , __global int * a , __global int * b , unsigned
int nb ) " ,
3 "{",
4 " // Index of the elements to add n " ,
5 " unsigned int n = get_global_id (0) ; " ,
6 " // Sum the n ’ th element of vectors a and b and store in c n " ,
7 " c [ n ]=0; " ,
8 " for ( int i =0; i < nb ; i ++) c [ n ] += ( a [ n ]+ b [ n ]) ; " ,
9 "}"
10 };
27 of 34
46. From Proto to OpenCL
The Schedule Pass
Split AST as usual on function hierarchy
Replace each AST by a functor generating the equivalent OpenCL
Compute a static hash of the tree to remember already compiled expressions
28 of 34
47. From Proto to OpenCL
The Schedule Pass
Split AST as usual on function hierarchy
Replace each AST by a functor generating the equivalent OpenCL
Compute a static hash of the tree to remember already compiled expressions
The Run pass
Stream data at beginning of each kernel
Handle generic calls and retrieval of compiled kernel
Fire asynchronous kernel calls over the forest
28 of 34
48. Support for openCL
NT2 Example
1 int main ()
2 {
3 int nbpoints =512000;
4 table < float , settings ( _1D ) > xx ( ofSize ( nbpoints ) ) ;
5 table < float , settings ( _1D ) > yy ( ofSize ( nbpoints ) ) ;
6
7 for ( int i =1; i <= nbpoints ; i ++)
8 {
9 xx ( i ) = - nbpoints /2+ i ;
10 yy ( i ) = 0.0;
11 }
12 yy =100* sin (8*3.14* xx /512000) ;
13
14 std :: cout < < xx (1) <<" " << yy (1) ;
15 }
29 of 34
49. Support for openCL
NT2 Example
1 int main ()
2 {
3 int nbpoints =512000;
4 table < float , settings ( _1D , nt2 :: gpu ) > xx ( ofSize ( nbpoints ) ) ;
5 table < float , settings ( _1D , nt2 :: gpu ) > yy ( ofSize ( nbpoints ) ) ;
6
7 for ( int i =1; i <= nbpoints ; i ++)
8 {
9 xx ( i ) = - nbpoints /2+ i ;
10 yy ( i ) = 0.0;
11 }
12 yy =100* sin (8*3.14* xx /512000) ;
13
14 std :: cout < < xx (1) <<" " << yy (1) ;
15 }
29 of 34
50. Preliminary results for openCL
Generating some signals
1 for ( int i =1; i <= nbpoints ; i ++)
2 {
3 xx ( i ) = - nbpoints /2+ i ;
4 yy ( i ) = 0.0;
5 }
6
7 yy =100* sin (8*3.14* xx /512000) ;
8 yy = yy +0.0001* xx ;
9
10 for ( int ii =1; ii <512000; ii +=15000)
11 {
12 addGaussian ( xx , yy ,( float ) ii ,1000) ;
13 yy =1 - yy ;
14 }
15
16 yy =1 - yy ;
17 yy = yy * yy ;
18 yy = yy * cos ( xx ) ;
30 of 34
51. Preliminary results for openCL
Generating some signals
1 template < typename A >
2 void addGaussian ( A & xx , A & yy , float centre , float largeur )
3 {
4 yy = yy + ( 100000/( val ( largeur ) * sqrt (2*3.1415) ) )
5 * exp ( -( xx - val ( centre ) ) *( xx - val ( centre ) )
6 /(2* val ( largeur ) * val ( largeur )
7 )
8 );
9 }
30 of 34
52. Preliminary results for openCL
Generating some signals
CPU 1 core 520.2 ms
CPU 4 cores 141.3 ms
CPU 4 cores+SIMD 36.1 ms
GPU First Run 878.2 ms
GPU Other Run 13.1 ms
GPU Manual OCL 12.9 ms
Compares favorably to ViennaCL and Accelereyes
30 of 34
54. Let’s round this up!
Parallel Computing for Scientist
Parallel programmign tools are required
EDSL are an efficient way to design such tools
But hardware has to stay in the loop
32 of 34
55. Let’s round this up!
Parallel Computing for Scientist
Parallel programmign tools are required
EDSL are an efficient way to design such tools
But hardware has to stay in the loop
Our claims
Software Libraries built as Generic and Generative components can solve a large
chunk of parallelism related problems while being easy to use.
Like regular language, EDSL needs informations about the hardware system
Integrating hardware descriptions as Generic components increases tools
portability and re-targetability
32 of 34
56. Current and Future Works
Recent activity
A public release of NT2 is planned ”soon”
A Matlab to NT2 compiler has been designed
Prototype for single source CUDA support in NT2
PhD started on MPI support in NT2
33 of 34
57. Current and Future Works
Recent activity
A public release of NT2 is planned ”soon”
A Matlab to NT2 compiler has been designed
Prototype for single source CUDA support in NT2
PhD started on MPI support in NT2
What we’ll be cooking in the future
std::future is the future
Experimental research on NT2 for embedded systems
Toward a global generic approach to parallelism
33 of 34