Designing Architecture-aware Library using Boost.Proto

Designing Architecture-aware Library using Boost.Proto

Joel Falcou, Mathias Gaunard
Pierre Esterie, Eric
Jourdanneau

LRI - University Paris Sud XI - CNRS

13/03/2012

Context

In Scientiﬁc Computing ...
there is Scientiﬁc

2 of 34

Context

Applications are domain driven
Users = Developers
Users are reluctant to changes

2 of 34

Context

Users = Developers
there is Computing

2 of 34

Context

Users = Developers
there is Computing
Computing requires performance ...
... which implies architectures speciﬁc tuning
... which requires expertise
... which may or may not be available

2 of 34

Context

Users = Developers
there is Computing
Computing requires performance ...
... which implies architectures speciﬁc tuning
... which requires expertise
... which may or may not be available

The Problem
People using computers to do science want to do science ﬁrst.

2 of 34

The Problem and how to solve it
The Facts
The ”Library to bind them all” doesn’t exist (or we should have it already )
New architectures performances are underused due to their inherent complexity
Few people are both experts in parallel programming and SomeOtherField

The Ends
Find a way to shield users from parallel architecture details
Find a way to shield developpers from parallel architecture details
Isolate software from hardware evolution

The Means
Generic Programming
Template Meta-Programming
Embedded Domain Speciﬁc Languages
3 of 34

Talk Layout

Introduction

NT2

Expression Templates v2.0

Supporting GPUs

Conclusion

4 of 34

What’s NT2 ?

A Scientiﬁc Computing Library
Provide a simple, Matlab-like interface for users
Provide high-performance computing entities and primitives
Easily extendable

A Research Platform
Simple framework to add new optimization schemes
Test bench for EDSL development methodologies
Test bench for Generic Programming in real life projects

5 of 34

The NT2 API

Principles
table<T,S> is a simple, multidimensional array object that exactly mimics
Matlab array behavior and functionalities
300+ functions usable directly either on table or on any scalar values as in
Matlab

6 of 34

The NT2 API

Principles
Matlab

How does it works
Take a .m ﬁle, copy to a .cpp ﬁle

6 of 34

The NT2 API

Principles
Matlab

How does it works
Add #include <nt2/nt2.hpp> and do cosmetic changes

6 of 34

The NT2 API

Principles
Matlab

How does it works
Add #include <nt2/nt2.hpp> and do cosmetic changes
Compile the ﬁle and link with libnt2.a

6 of 34

Matlab you said ?

1 R = I (: ,: ,1) ;
2 G = I (: ,: ,2) ;
3 B = I (: ,: ,3) ;
4
5 Y = min ( abs (0.299.* R +0.587.* G +0.114.* B ) ,235) ;
6 U = min ( abs ( -0.169.* R -0.331.* G +0.5.* B ) ,240) ;
7 V = min ( abs (0.5.* R -0.419.* G -0.081.* B ) ,240) ;

7 of 34

Now with NT2

1 table < double > R = I (_ ,_ ,1) ;
2 table < double > G = I (_ ,_ ,2) ;
3 table < double > B = I (_ ,_ ,3) ;
4 table < double > Y, U, V;
5
6 Y = min ( abs (0.299* R +0.587* G +0.114* B ) ,235) ;
7 U = min ( abs ( -0.169* R -0.331* G +0.5* B ) ,240) ;
8 V = min ( abs (0.5* R -0.419* G -0.081* B ) ,240) ;

8 of 34

Now with NT2

1 table < float , settings ( shallow_ , of_size_ <N ,M >) > R = I (_ ,_ ,1) ;
2 table < float , settings ( shallow_ , of_size_ <N ,M >) > G = I (_ ,_ ,2) ;
3 table < float , settings ( shallow_ , of_size_ <N ,M >) > B = I (_ ,_ ,3) ;
4 table < float , of_size_ <N ,M > > Y , U , V ;
5
6 Y = min ( abs (0.299* R +0.587* G +0.114* B ) ,235) ;
7 U = min ( abs ( -0.169* R -0.331* G +0.5* B ) ,240) ;
8 V = min ( abs (0.5* R -0.419* G -0.081* B ) ,240) ;

8 of 34

Some real life applications

9 of 34

NT2 before Boost.Proto

EDSL Core
Code base was 11 KLOC
8KLOC was dedicated to the various Expression Template glue
3KLOC of actual smart stuﬀ

Architectural support
Altivec and SSE2 extension support
Some vague pthread support
Eﬀorts were stagnant: adding a simple feature required multipel KLOC of code
to change

10 of 34

Talk Layout

Introduction

NT2


Supporting GPUs

Conclusion

11 of 34

Embedded Domain Specific Languages

What’s an EDSL ?
DSL = Domain Specific Language
Declarative language, easy-to-use, fitting the domain
EDSL = DSL within a general purpose language

EDSL in C++
Relies on operator overload abuse (Expression Templates)
Carry semantic information around code fragment
Generic implementation become self-aware of optimizations

12 of 34

Expression Templates

matrix x(h,w),a(h,w),b(h,w); =
x = cos(a) + (b*a);
x +
expr<assign
,expr<matrix&>
,expr<plus cos *
, expr<cos
,expr<matrix&>
>
a b a
, expr<multiplies
,expr<matrix&>
,expr<matrix&>
> #pragma omp parallel for
>(x,a,b); for(int j=0;j<h;++j)
{
for(int i=0;i<w;++i)
{
Arbitrary Transforms applied
x(j,i) = cos(a(j,i))
on the meta-AST
+ ( b(j,i)
* a(j,i)
);
}
}

13 of 34

Boost.Proto

What’s Boost.Proto
EDSL for defining EDSLs in C++
Generalize Expression Templates
Easy way to define and test EDSL
EDSL = some Grammar Rules + some Semantic

Boost.Proto Benefits
Fast development process
Compiler guys are at ease : Grammar + Semantic + Code Generation process
Easily extendable through Transforms
Easy to handle : MPL and Fusion compatible

14 of 34

Why Boost.Proto ?

Benefits
Scalability: Keep EDSL glue code size low
Extensibility: Simplify new optimizations design
Debugging: Allow for meaningful error reporting

Remaining Challenges
Provide a generic ”compilation” process
Select the best function definition w/r to hardware
Allow for non-intrusive new pass definition

15 of 34

State of the art EDSL

Usual approaches and results
Eigen : supports SIMD arithmetic, thread only algorithms
uBLAS : parallelism comes from potential back-end

What’s the problem ?
Hardware consideration comes after ET code
Retro-ﬁtting parallelism is hard
Need of a hardware-aware EDSL
Back to Generative Programming

16 of 34

Principles of Generative Programming

Domain Speciﬁc Generative Component Concrete Application
Application Description

Translator

Parametric
Sub-components

17 of 34

Generative Programming v2

18 of 34

Our Approach

Principles
Segment EDSL evaluation in phases
Each phases use Proto transforms to advance code generation
Hardware speciﬁcation are active element in function overload
Use Generalized Tag Dispatching (GTD) to select proper function
implementation

Advantages
Hardware can inﬂuence part or whole evaluation process
Proto transforms syntax are easy to use
GTD increases opportunity for optimizations
19 of 34

Generalized Tag Dispatching

Principles
Tag dispatching use types tag to overload functions
Classical example : standard Iterators

20 of 34


Principles
We augment the system with:
Tag computed from the function identiﬁer
Tag computed from the hardware conﬁguration

20 of 34


Principles

1 namespace tag { struct plus_ : elementwise_ < plus_ > {}; }
2
3 functor < tag :: plus_ > callee ;
4 callee ( some_a , some_b ) ;

20 of 34

Principles

1 auto functor < Tag , Site >:: operator () ( A && args ...)
2 {
3 dispatch < hierarchy_of < Site >:: type
4 , hierarchy_of < Tag >:: type ( hierarchy_of <A >:: type ...)
5 > callee ;
6 return callee ( args ... ) ;
7 }

20 of 34


1 template < class AO >
2 strutc dispatch < tag :: sse_
3 , tag :: plus_ ( simd_ < real_ < float_ < A0 > > >
4 , simd_ < real_ < float_ < A0 > > >
5 )
6 >
7 {
8 A0 operator ( A0 && a , A0 && b )
9 {
10 return _mm_add_ps ( a , b ) ;
11 }
12 };

21 of 34

1 template < class S , class R , class L , class N , class AO >
2 strutc dispatch < openmp_ <S > , reduction_ <R ,L ,N >( ast_ < A0 >) >
3 {
4 A0 :: value_type operator ( A0 && ast )
5 {
6 auto result = functor <N ,S >( as_ < A0 :: value_type >() ) ;
7 # pragma omp parallel
8 {
9 auto local = functor <N ,S >( as_ < A0 :: value_type >() ) ;
10 # pragma omp for nowait
11 for ( int i =0; i < size ( ast ) ;++ i )
12 functor <R ,S >() ( local , ast ( i ) ) ;
13
14 # pragma omp critial
15 functor <L ,S >() ( result , local ) ;
16 }
17
18 return result ;
19 }
20 };
21 of 34

Multipass EDSL Code Generation

Principles
Don’t walk the AST directly
Capture where hardware introduces extension points
Use GTD to select proper phase implementation

22 of 34


Principles

Structure
Optimize : AST to AST transforms

22 of 34


Principles

Structure
Schedule : AST to Forest of Code Generator transforms

22 of 34


Principles

Structure
Schedule : AST to Forest of Code Generator transforms
Run : Code Generator to Code transforms

22 of 34

Optimize
AST Pattern Matching using Proto grammar
Replace large sequence of operation by equivalent, faster kernels
E.g: a+b*c to gemm or fma, x = inv(a)*b to solve(a,x,b)

23 of 34

Optimize

Schedule
Segment AST into loop-compatible loop functors
Use AST node hierarchy (and GTD) for splitting

23 of 34

Optimize

Schedule
Segment AST into loop-compatible loop functors
Use AST node hierarchy (and GTD) for splitting

Run
Generate code for each AST int he scheduled forest
Potentially generate runtime calls for runtime optimization
Classical EDSL code generation happen here on the correct hardware
23 of 34

The new NT2 Structure

State of the code base
Expression Template handling 200-300LOC
GTD handling : 1KLOC
Actual features and function implementation : 10KLOC

Evolution process
Adding a new site 100-500LOC
Adding a function 10-100 LOC
Time for complete new architecture support : 1 to 3 weeks

24 of 34

The new NT2 Structure

24 of 34

Some results

Support for OpenMP
1 build system script
4 new overloads for the run phase
1 new allocator to prevent false sharing

Support for Next Gen SIMD
AVX : 2 weeks of work, functionnal
AVX2 : prototype ready, work in simulator
LarBn : prototype ready, work in simulator

25 of 34

Talk Layout

Introduction

NT2


Supporting GPUs

Conclusion

26 of 34

OpenCL in action

Principle
Standard for manycore/multicore systems
Use runtime kernel and JIT Compilation

Example
1 const char * OpenCLSource [] = {
2 " __kernel void VectorAdd ( __global int * c , __global int * a , __global int * b , unsigned
int nb ) " ,
3 "{",
4 " // Index of the elements to add n " ,
5 " unsigned int n = get_global_id (0) ; " ,
6 " // Sum the n ’ th element of vectors a and b and store in c n " ,
7 " c [ n ]=0; " ,
8 " for ( int i =0; i < nb ; i ++) c [ n ] += ( a [ n ]+ b [ n ]) ; " ,
9 "}"
10 };

27 of 34

From Proto to OpenCL

The Schedule Pass
Split AST as usual on function hierarchy
Replace each AST by a functor generating the equivalent OpenCL
Compute a static hash of the tree to remember already compiled expressions

28 of 34

From Proto to OpenCL

The Schedule Pass
Split AST as usual on function hierarchy
Replace each AST by a functor generating the equivalent OpenCL
Compute a static hash of the tree to remember already compiled expressions

The Run pass
Stream data at beginning of each kernel
Handle generic calls and retrieval of compiled kernel
Fire asynchronous kernel calls over the forest

28 of 34

Support for openCL

NT2 Example
1 int main ()
2 {
3 int nbpoints =512000;
4 table < float , settings ( _1D ) > xx ( ofSize ( nbpoints ) ) ;
5 table < float , settings ( _1D ) > yy ( ofSize ( nbpoints ) ) ;
6
7 for ( int i =1; i <= nbpoints ; i ++)
8 {
9 xx ( i ) = - nbpoints /2+ i ;
10 yy ( i ) = 0.0;
11 }
12 yy =100* sin (8*3.14* xx /512000) ;
13
14 std :: cout < < xx (1) <<" " << yy (1) ;
15 }

29 of 34

Support for openCL

NT2 Example
1 int main ()
2 {
3 int nbpoints =512000;
4 table < float , settings ( _1D , nt2 :: gpu ) > xx ( ofSize ( nbpoints ) ) ;
5 table < float , settings ( _1D , nt2 :: gpu ) > yy ( ofSize ( nbpoints ) ) ;
6
8 {
9 xx ( i ) = - nbpoints /2+ i ;
10 yy ( i ) = 0.0;
11 }
12 yy =100* sin (8*3.14* xx /512000) ;
13
14 std :: cout < < xx (1) <<" " << yy (1) ;
15 }

29 of 34

Preliminary results for openCL

Generating some signals
2 {
3 xx ( i ) = - nbpoints /2+ i ;
4 yy ( i ) = 0.0;
5 }
6
7 yy =100* sin (8*3.14* xx /512000) ;
8 yy = yy +0.0001* xx ;
9
10 for ( int ii =1; ii <512000; ii +=15000)
11 {
12 addGaussian ( xx , yy ,( float ) ii ,1000) ;
13 yy =1 - yy ;
14 }
15
16 yy =1 - yy ;
17 yy = yy * yy ;
18 yy = yy * cos ( xx ) ;

30 of 34


1 template < typename A >
2 void addGaussian ( A & xx , A & yy , float centre , float largeur )
3 {
4 yy = yy + ( 100000/( val ( largeur ) * sqrt (2*3.1415) ) )
5 * exp ( -( xx - val ( centre ) ) *( xx - val ( centre ) )
6 /(2* val ( largeur ) * val ( largeur )
7 )
8 );
9 }

30 of 34



CPU 1 core 520.2 ms
CPU 4 cores 141.3 ms
CPU 4 cores+SIMD 36.1 ms
GPU First Run 878.2 ms
GPU Other Run 13.1 ms
GPU Manual OCL 12.9 ms
Compares favorably to ViennaCL and Accelereyes

30 of 34

Talk Layout

Introduction

NT2


Supporting GPUs

Conclusion

31 of 34

Let’s round this up!

Parallel Computing for Scientist
Parallel programmign tools are required
EDSL are an eﬃcient way to design such tools
But hardware has to stay in the loop

32 of 34

Let’s round this up!

Parallel Computing for Scientist
Parallel programmign tools are required
EDSL are an eﬃcient way to design such tools
But hardware has to stay in the loop

Our claims
Software Libraries built as Generic and Generative components can solve a large
chunk of parallelism related problems while being easy to use.
Like regular language, EDSL needs informations about the hardware system
Integrating hardware descriptions as Generic components increases tools
portability and re-targetability

32 of 34

Current and Future Works

Recent activity
A public release of NT2 is planned ”soon”
A Matlab to NT2 compiler has been designed
Prototype for single source CUDA support in NT2
PhD started on MPI support in NT2

33 of 34

Current and Future Works

Recent activity
A public release of NT2 is planned ”soon”
A Matlab to NT2 compiler has been designed
Prototype for single source CUDA support in NT2
PhD started on MPI support in NT2

What we’ll be cooking in the future
std::future is the future
Experimental research on NT2 for embedded systems
Toward a global generic approach to parallelism

33 of 34

Designing Architecture-aware Library using Boost.Proto

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (11)

Ähnlich wie Designing Architecture-aware Library using Boost.Proto

Ähnlich wie Designing Architecture-aware Library using Boost.Proto (20)

Designing Architecture-aware Library using Boost.Proto