SlideShare ist ein Scribd-Unternehmen logo
1 von 58
Downloaden Sie, um offline zu lesen
Designing Architecture-aware Library using Boost.Proto




                              Joel Falcou, Mathias Gaunard
                                    Pierre Esterie, Eric
                                       Jourdanneau

                              LRI - University Paris Sud XI - CNRS


                                        13/03/2012
Context

In Scientific Computing ...
   there is Scientific




 2 of 34
Context

In Scientific Computing ...
   there is Scientific
      Applications are domain driven
      Users = Developers
      Users are reluctant to changes




 2 of 34
Context

In Scientific Computing ...
   there is Scientific
      Applications are domain driven
      Users = Developers
      Users are reluctant to changes
   there is Computing




 2 of 34
Context

In Scientific Computing ...
   there is Scientific
      Applications are domain driven
      Users = Developers
      Users are reluctant to changes
   there is Computing
      Computing requires performance ...
      ... which implies architectures specific tuning
      ... which requires expertise
      ... which may or may not be available




 2 of 34
Context

In Scientific Computing ...
   there is Scientific
      Applications are domain driven
      Users = Developers
      Users are reluctant to changes
   there is Computing
      Computing requires performance ...
      ... which implies architectures specific tuning
      ... which requires expertise
      ... which may or may not be available


The Problem
People using computers to do science want to do science first.

 2 of 34
The Problem and how to solve it
The Facts
  The ”Library to bind them all” doesn’t exist (or we should have it already )
  New architectures performances are underused due to their inherent complexity
  Few people are both experts in parallel programming and SomeOtherField


The Ends
  Find a way to shield users from parallel architecture details
  Find a way to shield developpers from parallel architecture details
  Isolate software from hardware evolution


The Means
  Generic Programming
  Template Meta-Programming
  Embedded Domain Specific Languages
3 of 34
Talk Layout


Introduction

NT2

Expression Templates v2.0

Supporting GPUs

Conclusion



 4 of 34
What’s NT2 ?


A Scientific Computing Library
  Provide a simple, Matlab-like interface for users
  Provide high-performance computing entities and primitives
  Easily extendable


A Research Platform
  Simple framework to add new optimization schemes
  Test bench for EDSL development methodologies
  Test bench for Generic Programming in real life projects




5 of 34
The NT2 API


Principles
   table<T,S> is a simple, multidimensional array object that exactly mimics
   Matlab array behavior and functionalities
   300+ functions usable directly either on table or on any scalar values as in
   Matlab




 6 of 34
The NT2 API


Principles
   table<T,S> is a simple, multidimensional array object that exactly mimics
   Matlab array behavior and functionalities
   300+ functions usable directly either on table or on any scalar values as in
   Matlab


How does it works
   Take a .m file, copy to a .cpp file




 6 of 34
The NT2 API


Principles
   table<T,S> is a simple, multidimensional array object that exactly mimics
   Matlab array behavior and functionalities
   300+ functions usable directly either on table or on any scalar values as in
   Matlab


How does it works
   Take a .m file, copy to a .cpp file
   Add #include <nt2/nt2.hpp> and do cosmetic changes




 6 of 34
The NT2 API


Principles
   table<T,S> is a simple, multidimensional array object that exactly mimics
   Matlab array behavior and functionalities
   300+ functions usable directly either on table or on any scalar values as in
   Matlab


How does it works
   Take a .m file, copy to a .cpp file
   Add #include <nt2/nt2.hpp> and do cosmetic changes
   Compile the file and link with libnt2.a



 6 of 34
Matlab you said ?



1   R = I (: ,: ,1) ;
2   G = I (: ,: ,2) ;
3   B = I (: ,: ,3) ;
4
5   Y = min ( abs (0.299.* R +0.587.* G +0.114.* B ) ,235) ;
6   U = min ( abs ( -0.169.* R -0.331.* G +0.5.* B ) ,240) ;
7   V = min ( abs (0.5.* R -0.419.* G -0.081.* B ) ,240) ;




     7 of 34
Now with NT2



1   table < double >   R = I (_ ,_ ,1) ;
2   table < double >   G = I (_ ,_ ,2) ;
3   table < double >   B = I (_ ,_ ,3) ;
4   table < double >   Y, U, V;
5
6   Y = min ( abs (0.299* R +0.587* G +0.114* B ) ,235) ;
7   U = min ( abs ( -0.169* R -0.331* G +0.5* B ) ,240) ;
8   V = min ( abs (0.5* R -0.419* G -0.081* B ) ,240) ;




     8 of 34
Now with NT2



1   table < float , settings ( shallow_ , of_size_ <N ,M >) > R = I (_ ,_ ,1) ;
2   table < float , settings ( shallow_ , of_size_ <N ,M >) > G = I (_ ,_ ,2) ;
3   table < float , settings ( shallow_ , of_size_ <N ,M >) > B = I (_ ,_ ,3) ;
4   table < float , of_size_ <N ,M > > Y , U , V ;
5
6   Y = min ( abs (0.299* R +0.587* G +0.114* B ) ,235) ;
7   U = min ( abs ( -0.169* R -0.331* G +0.5* B ) ,240) ;
8   V = min ( abs (0.5* R -0.419* G -0.081* B ) ,240) ;




     8 of 34
Some real life applications




9 of 34
NT2 before Boost.Proto

EDSL Core
   Code base was 11 KLOC
   8KLOC was dedicated to the various Expression Template glue
   3KLOC of actual smart stuff

Architectural support
   Altivec and SSE2 extension support
   Some vague pthread support
   Efforts were stagnant: adding a simple feature required multipel KLOC of code
   to change


10 of 34
Talk Layout


Introduction

NT2

Expression Templates v2.0

Supporting GPUs

Conclusion



11 of 34
Embedded Domain Specific Languages


What’s an EDSL ?
   DSL = Domain Specific Language
   Declarative language, easy-to-use, fitting the domain
   EDSL = DSL within a general purpose language

EDSL in C++
   Relies on operator overload abuse (Expression Templates)
   Carry semantic information around code fragment
   Generic implementation become self-aware of optimizations




12 of 34
Expression Templates

           matrix x(h,w),a(h,w),b(h,w);           =
           x = cos(a) + (b*a);
                                              x          +
           expr<assign
               ,expr<matrix&>
               ,expr<plus                          cos           *
                    , expr<cos
                          ,expr<matrix&>
                          >
                                                   a         b        a
                    , expr<multiplies
                          ,expr<matrix&>
                          ,expr<matrix&>
                          >                #pragma omp parallel for
                    >(x,a,b);              for(int j=0;j<h;++j)
                                           {
                                             for(int i=0;i<w;++i)
                                             {
           Arbitrary Transforms applied
                                               x(j,i) = cos(a(j,i))
                 on the meta-AST
                                                      + ( b(j,i)
                                                         * a(j,i)
                                                      );
                                             }
                                           }




13 of 34
Boost.Proto

What’s Boost.Proto
   EDSL for defining EDSLs in C++
   Generalize Expression Templates
   Easy way to define and test EDSL
   EDSL = some Grammar Rules + some Semantic

Boost.Proto Benefits
   Fast development process
   Compiler guys are at ease : Grammar + Semantic + Code Generation process
   Easily extendable through Transforms
   Easy to handle : MPL and Fusion compatible



14 of 34
Why Boost.Proto ?


Benefits
   Scalability: Keep EDSL glue code size low
   Extensibility: Simplify new optimizations design
   Debugging: Allow for meaningful error reporting


Remaining Challenges
   Provide a generic ”compilation” process
   Select the best function definition w/r to hardware
   Allow for non-intrusive new pass definition




15 of 34
State of the art EDSL

Usual approaches and results
   Eigen : supports SIMD arithmetic, thread only algorithms
   uBLAS : parallelism comes from potential back-end

What’s the problem ?
   Hardware consideration comes after ET code
   Retro-fitting parallelism is hard
   Need of a hardware-aware EDSL
   Back to Generative Programming


16 of 34
Principles of Generative Programming

              Domain Specific         Generative Component   Concrete Application
           Application Description



                                         Translator



                                         Parametric
                                       Sub-components




17 of 34
Generative Programming v2




18 of 34
Our Approach

Principles
   Segment EDSL evaluation in phases
   Each phases use Proto transforms to advance code generation
   Hardware specification are active element in function overload
   Use Generalized Tag Dispatching (GTD) to select proper function
   implementation

Advantages
   Hardware can influence part or whole evaluation process
   Proto transforms syntax are easy to use
   GTD increases opportunity for optimizations
19 of 34
Generalized Tag Dispatching


Principles
   Tag dispatching use types tag to overload functions
   Classical example : standard Iterators




20 of 34
Generalized Tag Dispatching


Principles
   Tag dispatching use types tag to overload functions
   Classical example : standard Iterators
   We augment the system with:
           Tag computed from the function identifier
           Tag computed from the hardware configuration




20 of 34
Generalized Tag Dispatching

    Principles
       Tag dispatching use types tag to overload functions
       Classical example : standard Iterators
       We augment the system with:
               Tag computed from the function identifier
               Tag computed from the hardware configuration


1   namespace tag { struct plus_ : elementwise_ < plus_ > {}; }
2
3   functor < tag :: plus_ > callee ;
4   callee ( some_a , some_b ) ;




    20 of 34
Generalized Tag Dispatching
    Principles
       Tag dispatching use types tag to overload functions
       Classical example : standard Iterators
       We augment the system with:
               Tag computed from the function identifier
               Tag computed from the hardware configuration

1   auto functor < Tag , Site >:: operator () ( A && args ...)
2   {
3     dispatch < hierarchy_of < Site >:: type
4              , hierarchy_of < Tag >:: type ( hierarchy_of <A >:: type ...)
5                > callee ;
6     return callee ( args ... ) ;
7   }


    20 of 34
Generalized Tag Dispatching


 1   template < class AO >
 2   strutc dispatch < tag :: sse_
 3                     , tag :: plus_ ( simd_ < real_ < float_ < A0 > > >
 4                                    , simd_ < real_ < float_ < A0 > > >
 5                                    )
 6                     >
 7   {
 8      A0 operator ( A0 && a , A0 && b )
 9      {
10        return _mm_add_ps ( a , b ) ;
11      }
12   };




     21 of 34
Generalized Tag Dispatching
 1   template < class S , class R , class L , class N , class AO >
 2   strutc dispatch < openmp_ <S > , reduction_ <R ,L ,N >( ast_ < A0 >) >
 3   {
 4     A0 :: value_type operator ( A0 && ast )
 5     {
 6        auto result = functor <N ,S >( as_ < A0 :: value_type >() ) ;
 7        # pragma omp parallel
 8        {
 9           auto local = functor <N ,S >( as_ < A0 :: value_type >() ) ;
10           # pragma omp for nowait
11           for ( int i =0; i < size ( ast ) ;++ i )
12              functor <R ,S >() ( local , ast ( i ) ) ;
13
14                # pragma omp critial
15                functor <L ,S >() ( result , local ) ;
16            }
17
18            return result ;
19        }
20   };
     21 of 34
Multipass EDSL Code Generation


Principles
   Don’t walk the AST directly
   Capture where hardware introduces extension points
   Use GTD to select proper phase implementation




22 of 34
Multipass EDSL Code Generation


Principles
   Don’t walk the AST directly
   Capture where hardware introduces extension points
   Use GTD to select proper phase implementation

Structure
   Optimize : AST to AST transforms




22 of 34
Multipass EDSL Code Generation


Principles
   Don’t walk the AST directly
   Capture where hardware introduces extension points
   Use GTD to select proper phase implementation

Structure
   Optimize : AST to AST transforms
   Schedule : AST to Forest of Code Generator transforms



22 of 34
Multipass EDSL Code Generation


Principles
   Don’t walk the AST directly
   Capture where hardware introduces extension points
   Use GTD to select proper phase implementation

Structure
   Optimize : AST to AST transforms
   Schedule : AST to Forest of Code Generator transforms
   Run : Code Generator to Code transforms


22 of 34
Multipass EDSL Code Generation
Optimize
   AST Pattern Matching using Proto grammar
   Replace large sequence of operation by equivalent, faster kernels
   E.g: a+b*c to gemm or fma, x = inv(a)*b to solve(a,x,b)




23 of 34
Multipass EDSL Code Generation
Optimize
   AST Pattern Matching using Proto grammar
   Replace large sequence of operation by equivalent, faster kernels
   E.g: a+b*c to gemm or fma, x = inv(a)*b to solve(a,x,b)


Schedule
   Segment AST into loop-compatible loop functors
   Use AST node hierarchy (and GTD) for splitting




23 of 34
Multipass EDSL Code Generation
Optimize
   AST Pattern Matching using Proto grammar
   Replace large sequence of operation by equivalent, faster kernels
   E.g: a+b*c to gemm or fma, x = inv(a)*b to solve(a,x,b)


Schedule
   Segment AST into loop-compatible loop functors
   Use AST node hierarchy (and GTD) for splitting


Run
   Generate code for each AST int he scheduled forest
   Potentially generate runtime calls for runtime optimization
   Classical EDSL code generation happen here on the correct hardware
23 of 34
The new NT2 Structure

State of the code base
   Expression Template handling 200-300LOC
   GTD handling : 1KLOC
   Actual features and function implementation : 10KLOC

Evolution process
   Adding a new site 100-500LOC
   Adding a function 10-100 LOC
   Time for complete new architecture support : 1 to 3 weeks


24 of 34
The new NT2 Structure




24 of 34
Some results

Support for OpenMP
   1 build system script
   4 new overloads for the run phase
   1 new allocator to prevent false sharing

Support for Next Gen SIMD
   AVX : 2 weeks of work, functionnal
   AVX2 : prototype ready, work in simulator
   LarBn : prototype ready, work in simulator


25 of 34
Talk Layout


Introduction

NT2

Expression Templates v2.0

Supporting GPUs

Conclusion



26 of 34
OpenCL in action

     Principle
        Standard for manycore/multicore systems
        Use runtime kernel and JIT Compilation

     Example
 1   const char * OpenCLSource [] = {
 2   " __kernel void VectorAdd ( __global int * c , __global int * a , __global int * b , unsigned
              int nb ) " ,
 3   "{",
 4   " // Index of the elements to add  n " ,
 5   " unsigned int n = get_global_id (0) ; " ,
 6   " // Sum the n ’ th element of vectors a and b and store in c  n " ,
 7   " c [ n ]=0; " ,
 8   " for ( int i =0; i < nb ; i ++) c [ n ] += ( a [ n ]+ b [ n ]) ; " ,
 9   "}"
10   };



     27 of 34
From Proto to OpenCL


The Schedule Pass
   Split AST as usual on function hierarchy
   Replace each AST by a functor generating the equivalent OpenCL
   Compute a static hash of the tree to remember already compiled expressions




28 of 34
From Proto to OpenCL


The Schedule Pass
   Split AST as usual on function hierarchy
   Replace each AST by a functor generating the equivalent OpenCL
   Compute a static hash of the tree to remember already compiled expressions


The Run pass
   Stream data at beginning of each kernel
   Handle generic calls and retrieval of compiled kernel
   Fire asynchronous kernel calls over the forest




28 of 34
Support for openCL


     NT2 Example
 1   int main ()
 2   {
 3     int nbpoints =512000;
 4     table < float , settings ( _1D ) > xx ( ofSize ( nbpoints ) ) ;
 5     table < float , settings ( _1D ) > yy ( ofSize ( nbpoints ) ) ;
 6
 7       for ( int i =1; i <= nbpoints ; i ++)
 8       {
 9          xx ( i ) = - nbpoints /2+ i ;
10          yy ( i ) = 0.0;
11       }
12       yy =100* sin (8*3.14* xx /512000) ;
13
14       std :: cout < < xx (1) <<" " << yy (1) ;
15   }




      29 of 34
Support for openCL


     NT2 Example
 1   int main ()
 2   {
 3     int nbpoints =512000;
 4     table < float , settings ( _1D , nt2 :: gpu ) > xx ( ofSize ( nbpoints ) ) ;
 5     table < float , settings ( _1D , nt2 :: gpu ) > yy ( ofSize ( nbpoints ) ) ;
 6
 7       for ( int i =1; i <= nbpoints ; i ++)
 8       {
 9          xx ( i ) = - nbpoints /2+ i ;
10          yy ( i ) = 0.0;
11       }
12       yy =100* sin (8*3.14* xx /512000) ;
13
14       std :: cout < < xx (1) <<" " << yy (1) ;
15   }




      29 of 34
Preliminary results for openCL


     Generating some signals
 1   for ( int i =1; i <= nbpoints ; i ++)
 2   {
 3     xx ( i ) = - nbpoints /2+ i ;
 4     yy ( i ) = 0.0;
 5   }
 6
 7   yy =100* sin (8*3.14* xx /512000) ;
 8   yy = yy +0.0001* xx ;
 9
10   for ( int ii =1; ii <512000; ii +=15000)
11   {
12     addGaussian ( xx , yy ,( float ) ii ,1000) ;
13     yy =1 - yy ;
14   }
15
16   yy =1 - yy ;
17   yy = yy * yy ;
18   yy = yy * cos ( xx ) ;




      30 of 34
Preliminary results for openCL




    Generating some signals
1   template < typename A >
2   void addGaussian ( A & xx , A & yy , float centre , float largeur )
3   {
4     yy = yy + ( 100000/( val ( largeur ) * sqrt (2*3.1415) ) )
5                       * exp ( -( xx - val ( centre ) ) *( xx - val ( centre ) )
6                                  /(2* val ( largeur ) * val ( largeur )
7                                  )
8                       );
9   }




     30 of 34
Preliminary results for openCL


Generating some signals

                 CPU 1 core             520.2 ms
                 CPU 4 cores            141.3 ms
                 CPU 4 cores+SIMD        36.1 ms
                 GPU First Run          878.2 ms
                 GPU Other Run           13.1 ms
                 GPU Manual OCL          12.9 ms
   Compares favorably to ViennaCL and   Accelereyes




30 of 34
Talk Layout


Introduction

NT2

Expression Templates v2.0

Supporting GPUs

Conclusion



31 of 34
Let’s round this up!

Parallel Computing for Scientist
   Parallel programmign tools are required
   EDSL are an efficient way to design such tools
   But hardware has to stay in the loop




32 of 34
Let’s round this up!

Parallel Computing for Scientist
   Parallel programmign tools are required
   EDSL are an efficient way to design such tools
   But hardware has to stay in the loop


Our claims
   Software Libraries built as Generic and Generative components can solve a large
   chunk of parallelism related problems while being easy to use.
   Like regular language, EDSL needs informations about the hardware system
   Integrating hardware descriptions as Generic components increases tools
   portability and re-targetability


32 of 34
Current and Future Works

Recent activity
   A public release of NT2 is planned ”soon”
   A Matlab to NT2 compiler has been designed
   Prototype for single source CUDA support in NT2
   PhD started on MPI support in NT2




33 of 34
Current and Future Works

Recent activity
   A public release of NT2 is planned ”soon”
   A Matlab to NT2 compiler has been designed
   Prototype for single source CUDA support in NT2
   PhD started on MPI support in NT2

What we’ll be cooking in the future
   std::future is the future
   Experimental research on NT2 for embedded systems
   Toward a global generic approach to parallelism

33 of 34
Thanks for your attention

Weitere ähnliche Inhalte

Was ist angesagt?

Algorithm & data structure lec2
Algorithm & data structure lec2Algorithm & data structure lec2
Algorithm & data structure lec2Abdul Khan
 
(chapter 5) A Concise and Practical Introduction to Programming Algorithms in...
(chapter 5) A Concise and Practical Introduction to Programming Algorithms in...(chapter 5) A Concise and Practical Introduction to Programming Algorithms in...
(chapter 5) A Concise and Practical Introduction to Programming Algorithms in...Frank Nielsen
 
Java cơ bản java co ban
Java cơ bản java co ban Java cơ bản java co ban
Java cơ bản java co ban ifis
 
Compiler Construction | Lecture 12 | Virtual Machines
Compiler Construction | Lecture 12 | Virtual MachinesCompiler Construction | Lecture 12 | Virtual Machines
Compiler Construction | Lecture 12 | Virtual MachinesEelco Visser
 
Archi Modelling
Archi ModellingArchi Modelling
Archi Modellingdilane007
 
19. Data Structures and Algorithm Complexity
19. Data Structures and Algorithm Complexity19. Data Structures and Algorithm Complexity
19. Data Structures and Algorithm ComplexityIntro C# Book
 
(chapter 3) A Concise and Practical Introduction to Programming Algorithms in...
(chapter 3) A Concise and Practical Introduction to Programming Algorithms in...(chapter 3) A Concise and Practical Introduction to Programming Algorithms in...
(chapter 3) A Concise and Practical Introduction to Programming Algorithms in...Frank Nielsen
 
Why we cannot ignore Functional Programming
Why we cannot ignore Functional ProgrammingWhy we cannot ignore Functional Programming
Why we cannot ignore Functional ProgrammingMario Fusco
 
High-Level Synthesis with GAUT
High-Level Synthesis with GAUTHigh-Level Synthesis with GAUT
High-Level Synthesis with GAUTAdaCore
 

Was ist angesagt? (20)

C Programming - Refresher - Part III
C Programming - Refresher - Part IIIC Programming - Refresher - Part III
C Programming - Refresher - Part III
 
Lazy java
Lazy javaLazy java
Lazy java
 
Algorithm & data structure lec2
Algorithm & data structure lec2Algorithm & data structure lec2
Algorithm & data structure lec2
 
(chapter 5) A Concise and Practical Introduction to Programming Algorithms in...
(chapter 5) A Concise and Practical Introduction to Programming Algorithms in...(chapter 5) A Concise and Practical Introduction to Programming Algorithms in...
(chapter 5) A Concise and Practical Introduction to Programming Algorithms in...
 
Oop l2
Oop l2Oop l2
Oop l2
 
Java cơ bản java co ban
Java cơ bản java co ban Java cơ bản java co ban
Java cơ bản java co ban
 
Cbse marking scheme 2006 2011
Cbse marking scheme 2006  2011Cbse marking scheme 2006  2011
Cbse marking scheme 2006 2011
 
Learn Matlab
Learn MatlabLearn Matlab
Learn Matlab
 
Compiler Construction | Lecture 12 | Virtual Machines
Compiler Construction | Lecture 12 | Virtual MachinesCompiler Construction | Lecture 12 | Virtual Machines
Compiler Construction | Lecture 12 | Virtual Machines
 
Archi Modelling
Archi ModellingArchi Modelling
Archi Modelling
 
19. Data Structures and Algorithm Complexity
19. Data Structures and Algorithm Complexity19. Data Structures and Algorithm Complexity
19. Data Structures and Algorithm Complexity
 
(chapter 3) A Concise and Practical Introduction to Programming Algorithms in...
(chapter 3) A Concise and Practical Introduction to Programming Algorithms in...(chapter 3) A Concise and Practical Introduction to Programming Algorithms in...
(chapter 3) A Concise and Practical Introduction to Programming Algorithms in...
 
Why we cannot ignore Functional Programming
Why we cannot ignore Functional ProgrammingWhy we cannot ignore Functional Programming
Why we cannot ignore Functional Programming
 
Chapter 02 functions -class xii
Chapter 02   functions -class xiiChapter 02   functions -class xii
Chapter 02 functions -class xii
 
Functional programming
Functional programmingFunctional programming
Functional programming
 
High-Level Synthesis with GAUT
High-Level Synthesis with GAUTHigh-Level Synthesis with GAUT
High-Level Synthesis with GAUT
 
OOP and FP
OOP and FPOOP and FP
OOP and FP
 
05slide
05slide05slide
05slide
 
Introduction to HDLs
Introduction to HDLsIntroduction to HDLs
Introduction to HDLs
 
Chapter03
Chapter03Chapter03
Chapter03
 

Andere mochten auch

Importance of software architecture
Importance of software architectureImportance of software architecture
Importance of software architectureHimanshu
 
Software Architecture: Introduction
Software Architecture: IntroductionSoftware Architecture: Introduction
Software Architecture: IntroductionHenry Muccini
 
Renaissance- architects, influences, works
Renaissance- architects, influences, worksRenaissance- architects, influences, works
Renaissance- architects, influences, worksJohn Marion Palameña
 
Architectural Design Concept
Architectural Design ConceptArchitectural Design Concept
Architectural Design ConceptNiteesh Patel
 
Fundamentals Of Software Architecture
Fundamentals Of Software ArchitectureFundamentals Of Software Architecture
Fundamentals Of Software ArchitectureMarkus Voelter
 
Generating Architectural Concepts & Design Ideas
Generating Architectural Concepts & Design IdeasGenerating Architectural Concepts & Design Ideas
Generating Architectural Concepts & Design IdeasWan Muhammad / Asas-Khu™
 
Architecture design in software engineering
Architecture design in software engineeringArchitecture design in software engineering
Architecture design in software engineeringPreeti Mishra
 

Andere mochten auch (11)

Importance of software architecture
Importance of software architectureImportance of software architecture
Importance of software architecture
 
Software Architecture: Introduction
Software Architecture: IntroductionSoftware Architecture: Introduction
Software Architecture: Introduction
 
Renaissance- architects, influences, works
Renaissance- architects, influences, worksRenaissance- architects, influences, works
Renaissance- architects, influences, works
 
05 architectural design
05 architectural design05 architectural design
05 architectural design
 
Architectural Design Concept
Architectural Design ConceptArchitectural Design Concept
Architectural Design Concept
 
Architectural Design
Architectural DesignArchitectural Design
Architectural Design
 
Fundamentals Of Software Architecture
Fundamentals Of Software ArchitectureFundamentals Of Software Architecture
Fundamentals Of Software Architecture
 
Software Testing
Software TestingSoftware Testing
Software Testing
 
Documenting Software Architectures
Documenting Software ArchitecturesDocumenting Software Architectures
Documenting Software Architectures
 
Generating Architectural Concepts & Design Ideas
Generating Architectural Concepts & Design IdeasGenerating Architectural Concepts & Design Ideas
Generating Architectural Concepts & Design Ideas
 
Architecture design in software engineering
Architecture design in software engineeringArchitecture design in software engineering
Architecture design in software engineering
 

Ähnlich wie Designing Architecture-aware Library using Boost.Proto

Object Oriented Technologies
Object Oriented TechnologiesObject Oriented Technologies
Object Oriented TechnologiesUmesh Nikam
 
03 Synthesis (1).ppt
03 Synthesis  (1).ppt03 Synthesis  (1).ppt
03 Synthesis (1).pptShreyasMahesh
 
1183 c-interview-questions-and-answers
1183 c-interview-questions-and-answers1183 c-interview-questions-and-answers
1183 c-interview-questions-and-answersAkash Gawali
 
Georgy Nosenko - An introduction to the use SMT solvers for software security
Georgy Nosenko - An introduction to the use SMT solvers for software securityGeorgy Nosenko - An introduction to the use SMT solvers for software security
Georgy Nosenko - An introduction to the use SMT solvers for software securityDefconRussia
 
C++ Interview Question And Answer
C++ Interview Question And AnswerC++ Interview Question And Answer
C++ Interview Question And AnswerJagan Mohan Bishoyi
 
C++ questions And Answer
C++ questions And AnswerC++ questions And Answer
C++ questions And Answerlavparmar007
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonRalf Gommers
 
Madeo - a CAD Tool for reconfigurable Hardware
Madeo - a CAD Tool for reconfigurable HardwareMadeo - a CAD Tool for reconfigurable Hardware
Madeo - a CAD Tool for reconfigurable HardwareESUG
 
Data Structures and Algorithm Analysis
Data Structures  and  Algorithm AnalysisData Structures  and  Algorithm Analysis
Data Structures and Algorithm AnalysisMary Margarat
 
Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Christian Peel
 
Learning to Spot and Refactor Inconsistent Method Names
Learning to Spot and Refactor Inconsistent Method NamesLearning to Spot and Refactor Inconsistent Method Names
Learning to Spot and Refactor Inconsistent Method NamesDongsun Kim
 
Julia - Easier, Better, Faster, Stronger
Julia - Easier, Better, Faster, StrongerJulia - Easier, Better, Faster, Stronger
Julia - Easier, Better, Faster, StrongerKenta Sato
 
Preprocessor directives
Preprocessor directivesPreprocessor directives
Preprocessor directivesVikash Dhal
 
Ida python intro
Ida python introIda python intro
Ida python intro小静 安
 
Optimization Software and Systems for Operations Research: Best Practices and...
Optimization Software and Systems for Operations Research: Best Practices and...Optimization Software and Systems for Operations Research: Best Practices and...
Optimization Software and Systems for Operations Research: Best Practices and...Bob Fourer
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA Japan
 
MuVM: Higher Order Mutation Analysis Virtual Machine for C
MuVM: Higher Order Mutation Analysis Virtual Machine for CMuVM: Higher Order Mutation Analysis Virtual Machine for C
MuVM: Higher Order Mutation Analysis Virtual Machine for CSusumu Tokumoto
 

Ähnlich wie Designing Architecture-aware Library using Boost.Proto (20)

Object Oriented Technologies
Object Oriented TechnologiesObject Oriented Technologies
Object Oriented Technologies
 
03 Synthesis (1).ppt
03 Synthesis  (1).ppt03 Synthesis  (1).ppt
03 Synthesis (1).ppt
 
1183 c-interview-questions-and-answers
1183 c-interview-questions-and-answers1183 c-interview-questions-and-answers
1183 c-interview-questions-and-answers
 
Georgy Nosenko - An introduction to the use SMT solvers for software security
Georgy Nosenko - An introduction to the use SMT solvers for software securityGeorgy Nosenko - An introduction to the use SMT solvers for software security
Georgy Nosenko - An introduction to the use SMT solvers for software security
 
C++ Interview Question And Answer
C++ Interview Question And AnswerC++ Interview Question And Answer
C++ Interview Question And Answer
 
C++ questions And Answer
C++ questions And AnswerC++ questions And Answer
C++ questions And Answer
 
Interm codegen
Interm codegenInterm codegen
Interm codegen
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for Python
 
Madeo - a CAD Tool for reconfigurable Hardware
Madeo - a CAD Tool for reconfigurable HardwareMadeo - a CAD Tool for reconfigurable Hardware
Madeo - a CAD Tool for reconfigurable Hardware
 
Data Structures and Algorithm Analysis
Data Structures  and  Algorithm AnalysisData Structures  and  Algorithm Analysis
Data Structures and Algorithm Analysis
 
Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015
 
Synthesis of Attributed Feature Models From Product Descriptions
Synthesis of Attributed Feature Models From Product DescriptionsSynthesis of Attributed Feature Models From Product Descriptions
Synthesis of Attributed Feature Models From Product Descriptions
 
Learning to Spot and Refactor Inconsistent Method Names
Learning to Spot and Refactor Inconsistent Method NamesLearning to Spot and Refactor Inconsistent Method Names
Learning to Spot and Refactor Inconsistent Method Names
 
Julia - Easier, Better, Faster, Stronger
Julia - Easier, Better, Faster, StrongerJulia - Easier, Better, Faster, Stronger
Julia - Easier, Better, Faster, Stronger
 
Preprocessor directives
Preprocessor directivesPreprocessor directives
Preprocessor directives
 
Ida python intro
Ida python introIda python intro
Ida python intro
 
Optimization Software and Systems for Operations Research: Best Practices and...
Optimization Software and Systems for Operations Research: Best Practices and...Optimization Software and Systems for Operations Research: Best Practices and...
Optimization Software and Systems for Operations Research: Best Practices and...
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読み
 
Joel Falcou, Boost.SIMD
Joel Falcou, Boost.SIMDJoel Falcou, Boost.SIMD
Joel Falcou, Boost.SIMD
 
MuVM: Higher Order Mutation Analysis Virtual Machine for C
MuVM: Higher Order Mutation Analysis Virtual Machine for CMuVM: Higher Order Mutation Analysis Virtual Machine for C
MuVM: Higher Order Mutation Analysis Virtual Machine for C
 

Designing Architecture-aware Library using Boost.Proto

  • 1. Designing Architecture-aware Library using Boost.Proto Joel Falcou, Mathias Gaunard Pierre Esterie, Eric Jourdanneau LRI - University Paris Sud XI - CNRS 13/03/2012
  • 2. Context In Scientific Computing ... there is Scientific 2 of 34
  • 3. Context In Scientific Computing ... there is Scientific Applications are domain driven Users = Developers Users are reluctant to changes 2 of 34
  • 4. Context In Scientific Computing ... there is Scientific Applications are domain driven Users = Developers Users are reluctant to changes there is Computing 2 of 34
  • 5. Context In Scientific Computing ... there is Scientific Applications are domain driven Users = Developers Users are reluctant to changes there is Computing Computing requires performance ... ... which implies architectures specific tuning ... which requires expertise ... which may or may not be available 2 of 34
  • 6. Context In Scientific Computing ... there is Scientific Applications are domain driven Users = Developers Users are reluctant to changes there is Computing Computing requires performance ... ... which implies architectures specific tuning ... which requires expertise ... which may or may not be available The Problem People using computers to do science want to do science first. 2 of 34
  • 7. The Problem and how to solve it The Facts The ”Library to bind them all” doesn’t exist (or we should have it already ) New architectures performances are underused due to their inherent complexity Few people are both experts in parallel programming and SomeOtherField The Ends Find a way to shield users from parallel architecture details Find a way to shield developpers from parallel architecture details Isolate software from hardware evolution The Means Generic Programming Template Meta-Programming Embedded Domain Specific Languages 3 of 34
  • 8. Talk Layout Introduction NT2 Expression Templates v2.0 Supporting GPUs Conclusion 4 of 34
  • 9. What’s NT2 ? A Scientific Computing Library Provide a simple, Matlab-like interface for users Provide high-performance computing entities and primitives Easily extendable A Research Platform Simple framework to add new optimization schemes Test bench for EDSL development methodologies Test bench for Generic Programming in real life projects 5 of 34
  • 10. The NT2 API Principles table<T,S> is a simple, multidimensional array object that exactly mimics Matlab array behavior and functionalities 300+ functions usable directly either on table or on any scalar values as in Matlab 6 of 34
  • 11. The NT2 API Principles table<T,S> is a simple, multidimensional array object that exactly mimics Matlab array behavior and functionalities 300+ functions usable directly either on table or on any scalar values as in Matlab How does it works Take a .m file, copy to a .cpp file 6 of 34
  • 12. The NT2 API Principles table<T,S> is a simple, multidimensional array object that exactly mimics Matlab array behavior and functionalities 300+ functions usable directly either on table or on any scalar values as in Matlab How does it works Take a .m file, copy to a .cpp file Add #include <nt2/nt2.hpp> and do cosmetic changes 6 of 34
  • 13. The NT2 API Principles table<T,S> is a simple, multidimensional array object that exactly mimics Matlab array behavior and functionalities 300+ functions usable directly either on table or on any scalar values as in Matlab How does it works Take a .m file, copy to a .cpp file Add #include <nt2/nt2.hpp> and do cosmetic changes Compile the file and link with libnt2.a 6 of 34
  • 14. Matlab you said ? 1 R = I (: ,: ,1) ; 2 G = I (: ,: ,2) ; 3 B = I (: ,: ,3) ; 4 5 Y = min ( abs (0.299.* R +0.587.* G +0.114.* B ) ,235) ; 6 U = min ( abs ( -0.169.* R -0.331.* G +0.5.* B ) ,240) ; 7 V = min ( abs (0.5.* R -0.419.* G -0.081.* B ) ,240) ; 7 of 34
  • 15. Now with NT2 1 table < double > R = I (_ ,_ ,1) ; 2 table < double > G = I (_ ,_ ,2) ; 3 table < double > B = I (_ ,_ ,3) ; 4 table < double > Y, U, V; 5 6 Y = min ( abs (0.299* R +0.587* G +0.114* B ) ,235) ; 7 U = min ( abs ( -0.169* R -0.331* G +0.5* B ) ,240) ; 8 V = min ( abs (0.5* R -0.419* G -0.081* B ) ,240) ; 8 of 34
  • 16. Now with NT2 1 table < float , settings ( shallow_ , of_size_ <N ,M >) > R = I (_ ,_ ,1) ; 2 table < float , settings ( shallow_ , of_size_ <N ,M >) > G = I (_ ,_ ,2) ; 3 table < float , settings ( shallow_ , of_size_ <N ,M >) > B = I (_ ,_ ,3) ; 4 table < float , of_size_ <N ,M > > Y , U , V ; 5 6 Y = min ( abs (0.299* R +0.587* G +0.114* B ) ,235) ; 7 U = min ( abs ( -0.169* R -0.331* G +0.5* B ) ,240) ; 8 V = min ( abs (0.5* R -0.419* G -0.081* B ) ,240) ; 8 of 34
  • 17. Some real life applications 9 of 34
  • 18. NT2 before Boost.Proto EDSL Core Code base was 11 KLOC 8KLOC was dedicated to the various Expression Template glue 3KLOC of actual smart stuff Architectural support Altivec and SSE2 extension support Some vague pthread support Efforts were stagnant: adding a simple feature required multipel KLOC of code to change 10 of 34
  • 19. Talk Layout Introduction NT2 Expression Templates v2.0 Supporting GPUs Conclusion 11 of 34
  • 20. Embedded Domain Specific Languages What’s an EDSL ? DSL = Domain Specific Language Declarative language, easy-to-use, fitting the domain EDSL = DSL within a general purpose language EDSL in C++ Relies on operator overload abuse (Expression Templates) Carry semantic information around code fragment Generic implementation become self-aware of optimizations 12 of 34
  • 21. Expression Templates matrix x(h,w),a(h,w),b(h,w); = x = cos(a) + (b*a); x + expr<assign ,expr<matrix&> ,expr<plus cos * , expr<cos ,expr<matrix&> > a b a , expr<multiplies ,expr<matrix&> ,expr<matrix&> > #pragma omp parallel for >(x,a,b); for(int j=0;j<h;++j) { for(int i=0;i<w;++i) { Arbitrary Transforms applied x(j,i) = cos(a(j,i)) on the meta-AST + ( b(j,i) * a(j,i) ); } } 13 of 34
  • 22. Boost.Proto What’s Boost.Proto EDSL for defining EDSLs in C++ Generalize Expression Templates Easy way to define and test EDSL EDSL = some Grammar Rules + some Semantic Boost.Proto Benefits Fast development process Compiler guys are at ease : Grammar + Semantic + Code Generation process Easily extendable through Transforms Easy to handle : MPL and Fusion compatible 14 of 34
  • 23. Why Boost.Proto ? Benefits Scalability: Keep EDSL glue code size low Extensibility: Simplify new optimizations design Debugging: Allow for meaningful error reporting Remaining Challenges Provide a generic ”compilation” process Select the best function definition w/r to hardware Allow for non-intrusive new pass definition 15 of 34
  • 24. State of the art EDSL Usual approaches and results Eigen : supports SIMD arithmetic, thread only algorithms uBLAS : parallelism comes from potential back-end What’s the problem ? Hardware consideration comes after ET code Retro-fitting parallelism is hard Need of a hardware-aware EDSL Back to Generative Programming 16 of 34
  • 25. Principles of Generative Programming Domain Specific Generative Component Concrete Application Application Description Translator Parametric Sub-components 17 of 34
  • 27. Our Approach Principles Segment EDSL evaluation in phases Each phases use Proto transforms to advance code generation Hardware specification are active element in function overload Use Generalized Tag Dispatching (GTD) to select proper function implementation Advantages Hardware can influence part or whole evaluation process Proto transforms syntax are easy to use GTD increases opportunity for optimizations 19 of 34
  • 28. Generalized Tag Dispatching Principles Tag dispatching use types tag to overload functions Classical example : standard Iterators 20 of 34
  • 29. Generalized Tag Dispatching Principles Tag dispatching use types tag to overload functions Classical example : standard Iterators We augment the system with: Tag computed from the function identifier Tag computed from the hardware configuration 20 of 34
  • 30. Generalized Tag Dispatching Principles Tag dispatching use types tag to overload functions Classical example : standard Iterators We augment the system with: Tag computed from the function identifier Tag computed from the hardware configuration 1 namespace tag { struct plus_ : elementwise_ < plus_ > {}; } 2 3 functor < tag :: plus_ > callee ; 4 callee ( some_a , some_b ) ; 20 of 34
  • 31. Generalized Tag Dispatching Principles Tag dispatching use types tag to overload functions Classical example : standard Iterators We augment the system with: Tag computed from the function identifier Tag computed from the hardware configuration 1 auto functor < Tag , Site >:: operator () ( A && args ...) 2 { 3 dispatch < hierarchy_of < Site >:: type 4 , hierarchy_of < Tag >:: type ( hierarchy_of <A >:: type ...) 5 > callee ; 6 return callee ( args ... ) ; 7 } 20 of 34
  • 32. Generalized Tag Dispatching 1 template < class AO > 2 strutc dispatch < tag :: sse_ 3 , tag :: plus_ ( simd_ < real_ < float_ < A0 > > > 4 , simd_ < real_ < float_ < A0 > > > 5 ) 6 > 7 { 8 A0 operator ( A0 && a , A0 && b ) 9 { 10 return _mm_add_ps ( a , b ) ; 11 } 12 }; 21 of 34
  • 33. Generalized Tag Dispatching 1 template < class S , class R , class L , class N , class AO > 2 strutc dispatch < openmp_ <S > , reduction_ <R ,L ,N >( ast_ < A0 >) > 3 { 4 A0 :: value_type operator ( A0 && ast ) 5 { 6 auto result = functor <N ,S >( as_ < A0 :: value_type >() ) ; 7 # pragma omp parallel 8 { 9 auto local = functor <N ,S >( as_ < A0 :: value_type >() ) ; 10 # pragma omp for nowait 11 for ( int i =0; i < size ( ast ) ;++ i ) 12 functor <R ,S >() ( local , ast ( i ) ) ; 13 14 # pragma omp critial 15 functor <L ,S >() ( result , local ) ; 16 } 17 18 return result ; 19 } 20 }; 21 of 34
  • 34. Multipass EDSL Code Generation Principles Don’t walk the AST directly Capture where hardware introduces extension points Use GTD to select proper phase implementation 22 of 34
  • 35. Multipass EDSL Code Generation Principles Don’t walk the AST directly Capture where hardware introduces extension points Use GTD to select proper phase implementation Structure Optimize : AST to AST transforms 22 of 34
  • 36. Multipass EDSL Code Generation Principles Don’t walk the AST directly Capture where hardware introduces extension points Use GTD to select proper phase implementation Structure Optimize : AST to AST transforms Schedule : AST to Forest of Code Generator transforms 22 of 34
  • 37. Multipass EDSL Code Generation Principles Don’t walk the AST directly Capture where hardware introduces extension points Use GTD to select proper phase implementation Structure Optimize : AST to AST transforms Schedule : AST to Forest of Code Generator transforms Run : Code Generator to Code transforms 22 of 34
  • 38. Multipass EDSL Code Generation Optimize AST Pattern Matching using Proto grammar Replace large sequence of operation by equivalent, faster kernels E.g: a+b*c to gemm or fma, x = inv(a)*b to solve(a,x,b) 23 of 34
  • 39. Multipass EDSL Code Generation Optimize AST Pattern Matching using Proto grammar Replace large sequence of operation by equivalent, faster kernels E.g: a+b*c to gemm or fma, x = inv(a)*b to solve(a,x,b) Schedule Segment AST into loop-compatible loop functors Use AST node hierarchy (and GTD) for splitting 23 of 34
  • 40. Multipass EDSL Code Generation Optimize AST Pattern Matching using Proto grammar Replace large sequence of operation by equivalent, faster kernels E.g: a+b*c to gemm or fma, x = inv(a)*b to solve(a,x,b) Schedule Segment AST into loop-compatible loop functors Use AST node hierarchy (and GTD) for splitting Run Generate code for each AST int he scheduled forest Potentially generate runtime calls for runtime optimization Classical EDSL code generation happen here on the correct hardware 23 of 34
  • 41. The new NT2 Structure State of the code base Expression Template handling 200-300LOC GTD handling : 1KLOC Actual features and function implementation : 10KLOC Evolution process Adding a new site 100-500LOC Adding a function 10-100 LOC Time for complete new architecture support : 1 to 3 weeks 24 of 34
  • 42. The new NT2 Structure 24 of 34
  • 43. Some results Support for OpenMP 1 build system script 4 new overloads for the run phase 1 new allocator to prevent false sharing Support for Next Gen SIMD AVX : 2 weeks of work, functionnal AVX2 : prototype ready, work in simulator LarBn : prototype ready, work in simulator 25 of 34
  • 44. Talk Layout Introduction NT2 Expression Templates v2.0 Supporting GPUs Conclusion 26 of 34
  • 45. OpenCL in action Principle Standard for manycore/multicore systems Use runtime kernel and JIT Compilation Example 1 const char * OpenCLSource [] = { 2 " __kernel void VectorAdd ( __global int * c , __global int * a , __global int * b , unsigned int nb ) " , 3 "{", 4 " // Index of the elements to add n " , 5 " unsigned int n = get_global_id (0) ; " , 6 " // Sum the n ’ th element of vectors a and b and store in c n " , 7 " c [ n ]=0; " , 8 " for ( int i =0; i < nb ; i ++) c [ n ] += ( a [ n ]+ b [ n ]) ; " , 9 "}" 10 }; 27 of 34
  • 46. From Proto to OpenCL The Schedule Pass Split AST as usual on function hierarchy Replace each AST by a functor generating the equivalent OpenCL Compute a static hash of the tree to remember already compiled expressions 28 of 34
  • 47. From Proto to OpenCL The Schedule Pass Split AST as usual on function hierarchy Replace each AST by a functor generating the equivalent OpenCL Compute a static hash of the tree to remember already compiled expressions The Run pass Stream data at beginning of each kernel Handle generic calls and retrieval of compiled kernel Fire asynchronous kernel calls over the forest 28 of 34
  • 48. Support for openCL NT2 Example 1 int main () 2 { 3 int nbpoints =512000; 4 table < float , settings ( _1D ) > xx ( ofSize ( nbpoints ) ) ; 5 table < float , settings ( _1D ) > yy ( ofSize ( nbpoints ) ) ; 6 7 for ( int i =1; i <= nbpoints ; i ++) 8 { 9 xx ( i ) = - nbpoints /2+ i ; 10 yy ( i ) = 0.0; 11 } 12 yy =100* sin (8*3.14* xx /512000) ; 13 14 std :: cout < < xx (1) <<" " << yy (1) ; 15 } 29 of 34
  • 49. Support for openCL NT2 Example 1 int main () 2 { 3 int nbpoints =512000; 4 table < float , settings ( _1D , nt2 :: gpu ) > xx ( ofSize ( nbpoints ) ) ; 5 table < float , settings ( _1D , nt2 :: gpu ) > yy ( ofSize ( nbpoints ) ) ; 6 7 for ( int i =1; i <= nbpoints ; i ++) 8 { 9 xx ( i ) = - nbpoints /2+ i ; 10 yy ( i ) = 0.0; 11 } 12 yy =100* sin (8*3.14* xx /512000) ; 13 14 std :: cout < < xx (1) <<" " << yy (1) ; 15 } 29 of 34
  • 50. Preliminary results for openCL Generating some signals 1 for ( int i =1; i <= nbpoints ; i ++) 2 { 3 xx ( i ) = - nbpoints /2+ i ; 4 yy ( i ) = 0.0; 5 } 6 7 yy =100* sin (8*3.14* xx /512000) ; 8 yy = yy +0.0001* xx ; 9 10 for ( int ii =1; ii <512000; ii +=15000) 11 { 12 addGaussian ( xx , yy ,( float ) ii ,1000) ; 13 yy =1 - yy ; 14 } 15 16 yy =1 - yy ; 17 yy = yy * yy ; 18 yy = yy * cos ( xx ) ; 30 of 34
  • 51. Preliminary results for openCL Generating some signals 1 template < typename A > 2 void addGaussian ( A & xx , A & yy , float centre , float largeur ) 3 { 4 yy = yy + ( 100000/( val ( largeur ) * sqrt (2*3.1415) ) ) 5 * exp ( -( xx - val ( centre ) ) *( xx - val ( centre ) ) 6 /(2* val ( largeur ) * val ( largeur ) 7 ) 8 ); 9 } 30 of 34
  • 52. Preliminary results for openCL Generating some signals CPU 1 core 520.2 ms CPU 4 cores 141.3 ms CPU 4 cores+SIMD 36.1 ms GPU First Run 878.2 ms GPU Other Run 13.1 ms GPU Manual OCL 12.9 ms Compares favorably to ViennaCL and Accelereyes 30 of 34
  • 53. Talk Layout Introduction NT2 Expression Templates v2.0 Supporting GPUs Conclusion 31 of 34
  • 54. Let’s round this up! Parallel Computing for Scientist Parallel programmign tools are required EDSL are an efficient way to design such tools But hardware has to stay in the loop 32 of 34
  • 55. Let’s round this up! Parallel Computing for Scientist Parallel programmign tools are required EDSL are an efficient way to design such tools But hardware has to stay in the loop Our claims Software Libraries built as Generic and Generative components can solve a large chunk of parallelism related problems while being easy to use. Like regular language, EDSL needs informations about the hardware system Integrating hardware descriptions as Generic components increases tools portability and re-targetability 32 of 34
  • 56. Current and Future Works Recent activity A public release of NT2 is planned ”soon” A Matlab to NT2 compiler has been designed Prototype for single source CUDA support in NT2 PhD started on MPI support in NT2 33 of 34
  • 57. Current and Future Works Recent activity A public release of NT2 is planned ”soon” A Matlab to NT2 compiler has been designed Prototype for single source CUDA support in NT2 PhD started on MPI support in NT2 What we’ll be cooking in the future std::future is the future Experimental research on NT2 for embedded systems Toward a global generic approach to parallelism 33 of 34
  • 58. Thanks for your attention