SlideShare ist ein Scribd-Unternehmen logo
1 von 50
Downloaden Sie, um offline zu lesen
Costless Software Abstractions For Parallel Hardware System

Joel Falcou
LRI - INRIA - MetaScale SAS

Maison de la Simulation 04/03/2014
Context
In Scientific Computing ...
there is Scientific
Applications are domain driven
Users = Developers
Users are reluctant to changes
there is Computing
Computing requires performance ...
... which implies architectures specific tuning
... which requires expertise
... which may or may not be available

The Problem
People using computers to do science want to do science first.
2 of 34
The Problem – and how we want to solve it
The Facts
The ”Library to bind them all” doesn’t exist (or we should have it already )
All those users want to take advantage of new architectures
Few of them want to actually handle all the dirty work

The Ends
Provide a ”familiar” interface that let users benefit from parallelism
Helps compilers to generate better parallel code
Increase sustainability by decreasing amount of code to write

The Means
Parallel Abstractions: Skeletons
Efficient Implementation: DSEL
The Holy Glue: Generative Programming
3 of 34
Expressivité

Efficient or Expressive – Choose one

Matlab
SciLab

C++
JAVA

C, FORTRAN

Efficacité

4 of 34
Talk Layout

Introduction
Abstractions
Efficiency
Tools
Conclusion

5 of 34
Talk Layout

Introduction
Abstractions
Efficiency
Tools
Conclusion

6 of 34
Spotting abstraction when you see one

Processus 3

F2
Processus 1

Processus 2

Processus 4

Processus 6

Processus 7

F1

Distrib.

F2

Collect.

F3

Processus 5

F2

7 of 34
Spotting abstraction when you see one

Pipeline

Farm

Processus 3

F2
Processus 1

Processus 2

Processus 4

Processus 6

Processus 7

F1

Distrib.

F2

Collect.

F3

Processus 5

F2

7 of 34
Parallel Skeletons in a nutshell
Basic Principles [COLE 89]
There are patterns in parallel applications
Those patterns can be generalized in Skeletons
Applications are assembled as combination of such patterns

8 of 34
Parallel Skeletons in a nutshell
Basic Principles [COLE 89]
There are patterns in parallel applications
Those patterns can be generalized in Skeletons
Applications are assembled as combination of such patterns

Functionnal point of view
Skeletons are Higher-Order Functions
Skeletons support a compositionnal semantic
Applications become composition of state-less functions

8 of 34
Classic Parallel Skeletons
Data Parallel Skeletons
map: Apply a n-ary function in SIMD mode over subset of data
fold: Perform n-ary reduction over subset of data
scan: Perform n-ary prefix reduction over subset of data

9 of 34
Classic Parallel Skeletons
Data Parallel Skeletons
map: Apply a n-ary function in SIMD mode over subset of data
fold: Perform n-ary reduction over subset of data
scan: Perform n-ary prefix reduction over subset of data

Task Parallel Skeletons
par: Independant task execution
pipe: Task dependency over time
farm: Load-balancing

9 of 34
Why using Parallel Skeletons

Software Abstraction
Write without bothering with parallel details
Code is scalable and easy to maintain
Debuggable, Provable, Certifiable

10 of 34
Why using Parallel Skeletons

Software Abstraction
Write without bothering with parallel details
Code is scalable and easy to maintain
Debuggable, Provable, Certifiable

Hardware Abstraction
Semantic is set, implementation is free
Composability ⇒ Hierarchical architecture

10 of 34
Talk Layout

Introduction
Abstractions
Efficiency
Tools
Conclusion

11 of 34
Generative Programming
Domain Specific
Application Description

Generative Component

Translator

Parametric
Sub-components

12 of 34

Concrete Application
Generative Programming as a Tool

Available techniques
Dedicated compilers
External pre-processing tools
Languages supporting meta-programming

13 of 34
Generative Programming as a Tool

Available techniques
Dedicated compilers
External pre-processing tools
Languages supporting meta-programming

13 of 34
Generative Programming as a Tool

Available techniques
Dedicated compilers
External pre-processing tools
Languages supporting meta-programming

Definition of Meta-programming
Meta-programming is the writing of computer programs that analyse,
transform and generate other programs (or themselves) as their data.

13 of 34
From Generative to Meta-programming
Meta-programmable languages
template Haskell
metaOcaml
C++

14 of 34
From Generative to Meta-programming
Meta-programmable languages
template Haskell
metaOcaml
C++

14 of 34
From Generative to Meta-programming
Meta-programmable languages
template Haskell
metaOcaml
C++

C++ meta-programming
Relies on the C++ template sub-language
Handles types and integral constants at compile-time
Proved to be Turing-complete

14 of 34
Domain Specific Embedded Languages
What’s an DSEL ?
DSL = Domain Specific Language
Declarative language, easy-to-use, fitting the domain
DSEL = DSL within a general purpose language

EDSL in C++
Relies on operator overload abuse (Expression Templates)
Carry semantic information around code fragment
Generic implementation become self-aware of optimizations

Exploiting static AST
At the expression level: code generation
At the function level: inter-procedural optimization
15 of 34
Expression Templates
=

matrix x(h,w),a(h,w),b(h,w);
x = cos(a) + (b*a);

+

x
expr<assign
,expr<matrix&>
,expr<plus
, expr<cos
,expr<matrix&>
>
, expr<multiplies
,expr<matrix&>
,expr<matrix&>
>
>(x,a,b);

Arbitrary Transforms applied
on the meta-AST

16 of 34

*

cos

a

b

#pragma omp parallel for
for(int j=0;j<h;++j)
{
for(int i=0;i<w;++i)
{
x(j,i) = cos(a(j,i))
+ ( b(j,i)
* a(j,i)
);
}
}

a
Embedded Domain Specific Languages

EDSL in C++
Relies on operator overload abuse
Carry semantic information around code fragment
Generic implementation become self-aware of optimizations

Advantages
Allow introduction of DSLs without disrupting dev. chain
Semantic defined as type informations means compile-time resolution
Access to a large selection of runtime binding

17 of 34
Embedded Domain Specific Languages

18 of 34
Talk Layout

Introduction
Abstractions
Efficiency
Tools
Conclusion

19 of 34
Different Strokes
Objectives
Apply DSEL generation techniques for different kind of hardware
Demonstrate low cost of abstractions
Demonstrate applicability of skeletons

Contributions
Extend DEMRAL into AA-DEMRAL
Boost.SIMD
NT2

20 of 34
Architecture Aware Generative Programming

21 of 34
Boost.SIMD
Generic Programming for portable SIMDization

SIMD computation C++ Library
Built with modern C++ style for modern C++ usage
Easy to extend
Easy to adapt to new CPUs

22 of 34
Boost.SIMD
Generic Programming for portable SIMDization

SIMD computation C++ Library
Built with modern C++ style for modern C++ usage
Easy to extend
Easy to adapt to new CPUs

Goals:
Integrate SIMD computations in standard C++
Make writing of SIMD code easier
Promotes Generic Programming
Provides performance out of the box

22 of 34
Boost.SIMD
Principles
pack<T,N> encapsulates the best hardware register for N elements of type T
pack<T,N> provides a classical value semantic
Operations onpack<T,N> maps to proper intrinsics
Support for SIMD standard algortihms

23 of 34
Boost.SIMD
Principles
pack<T,N> encapsulates the best hardware register for N elements of type T
pack<T,N> provides a classical value semantic
Operations onpack<T,N> maps to proper intrinsics
Support for SIMD standard algortihms

How does it works
Code is written as for scalar values
Code can be applied as polymorphic functor over data
Expression Templates optimize operations

23 of 34
Boost.SIMD
Current Support
OS : Linux, Windows, iOS, Android
Architecture: x86, ARM, PowerPC
Compilers: g++, clang, icc, MSVC
Extensions:
x86: SSE2,SSE3,SSSE3,SSE4.x,AVX,XOP,FMA3/4,AVX2
PPC: VMX
ARM: NEON

In progress:
Xeon Phi MIC
VSX, QPX
NEON2
24 of 34
Boost.SIMD - Mandelbrot
pack < int > julia ( pack < float > const & a , pack < float > const & b )
{
pack < int > i_t ;
std :: size_t i = 0;
pack < int > iter ;
float x , y ;
do
{
pack < float > x2 =
pack < float > y2 =
pack < float > xy =
x = x2 - y2 + a ;
y = xy + b ;
pack < float > m2 =
iter = selinc ( m2
i ++;

x * x;
y * y;
2 * x * y;

x2 + y2 ;
< 4 , iter ) ;

}
while ( any ( mask ) && i < 256) ;
return iter ;
}

25 of 34
Boost.SIMD - Mandelbrot
pack < int > julia ( pack < float > const & a , pack < float > const & b )
{
pack < int > i_t ;
std :: size_t i = 0;
pack < int > iter ;
float x , y ;
do
{
pack < float > x2 =
pack < float > y2 =
pack < float > xy =
x = x2 - y2 + a ;
y = xy + b ;
pack < float > m2 =
iter = selinc ( m2
i ++;

x * x;
y * y;
2 * x * y;

x2 + y2 ;
< 4 , iter ) ;

}
while ( any ( mask ) && i < 256) ;
return iter ;
}

25 of 34
Boost.SIMD - Mandelbrot
pack < int > julia ( pack < float > const & a , pack < float > const & b )
{
pack < int > i_t ;
std :: size_t i = 0;
pack < int > iter ;
float x , y ;
do
{
pack < float > x2 =
pack < float > y2 =
pack < float > xy =
x = x2 - y2 + a ;
y = xy + b ;
pack < float > m2 =
iter = selinc ( m2
i ++;

x * x;
y * y;
2 * x * y;

x2 + y2 ;
< 4 , iter ) ;

}
while ( any ( mask ) && i < 256) ;
return iter ;
}

25 of 34
NT2

A Scientific Computing Library
Provide a simple, Matlab-like interface for users
Provide high-performance computing entities and primitives
Easily extendable

Components
Use Boost.SIMD for in-core optimizations
Use recursive parallel skeletons for threading
Code is made independant of architecture and runtime

26 of 34
Expressivité

The Numerical Template Toolbox

NT2

Matlab
SciLab

C++
JAVA

C, FORTRAN

Efficacité

27 of 34
The Numerical Template Toolbox
Principles
table<T,S> is a simple, multidimensional array object that exactly mimics
Matlab array behavior and functionalities
500+ functions usable directly either on table or on any scalar values as in
Matlab

28 of 34
The Numerical Template Toolbox
Principles
table<T,S> is a simple, multidimensional array object that exactly mimics
Matlab array behavior and functionalities
500+ functions usable directly either on table or on any scalar values as in
Matlab

How does it works
Take a .m file, copy to a .cpp file

28 of 34
The Numerical Template Toolbox
Principles
table<T,S> is a simple, multidimensional array object that exactly mimics
Matlab array behavior and functionalities
500+ functions usable directly either on table or on any scalar values as in
Matlab

How does it works
Take a .m file, copy to a .cpp file
Add #include <nt2/nt2.hpp> and do cosmetic changes

28 of 34
The Numerical Template Toolbox
Principles
table<T,S> is a simple, multidimensional array object that exactly mimics
Matlab array behavior and functionalities
500+ functions usable directly either on table or on any scalar values as in
Matlab

How does it works
Take a .m file, copy to a .cpp file
Add #include <nt2/nt2.hpp> and do cosmetic changes
Compile the file and link with libnt2.a

28 of 34
Performances - Mandelbrodt

29 of 34
Performances - LU Decomposition
8000 × 8000 LU decomposition

Median GFLOPS

150

100

50
NT2
Plasma

0
0

10

20

30

Number of cores
30 of 34

40

50
Performances - linsolve

nt2::tie(x,r) = linsolve(A,b);

Scale
1024 × 1024
2048 × 2048
12000 × 1200

31 of 34

C LAPACK
85.2
350.7
735.7

C MAGMA
85.2
235.8
1299.0

NT2 LAPACK
83.1
348.2
734.1

NT2 MAGMA
85.1
236.0
1300.1
Talk Layout

Introduction
Abstractions
Efficiency
Tools
Conclusion

32 of 34
Let’s round this up!
Parallel Computing for Scientist
Software Libraries built as Generic and Generative components can solve a large
chunk of parallelism related problems while being easy to use.
Like regular language, EDSL needs informations about the hardware system
Integrating hardware descriptions as Generic components increases tools
portability and re-targetability

33 of 34
Let’s round this up!
Parallel Computing for Scientist
Software Libraries built as Generic and Generative components can solve a large
chunk of parallelism related problems while being easy to use.
Like regular language, EDSL needs informations about the hardware system
Integrating hardware descriptions as Generic components increases tools
portability and re-targetability

Recent activity
Follow us on http://www.github.com/MetaScale/nt2
Prototype for single source GPU support
Toward a global generic approach to parallelism

33 of 34
Thanks for your attention

Weitere ähnliche Inhalte

Mehr von Joel Falcou

Designing C++ portable SIMD support
Designing C++ portable SIMD supportDesigning C++ portable SIMD support
Designing C++ portable SIMD support
Joel Falcou
 

Mehr von Joel Falcou (11)

Designing C++ portable SIMD support
Designing C++ portable SIMD supportDesigning C++ portable SIMD support
Designing C++ portable SIMD support
 
The Goal and The Journey - Turning back on one year of C++14 Migration
The Goal and The Journey - Turning back on one year of C++14 MigrationThe Goal and The Journey - Turning back on one year of C++14 Migration
The Goal and The Journey - Turning back on one year of C++14 Migration
 
HDR Defence - Software Abstractions for Parallel Architectures
HDR Defence - Software Abstractions for Parallel ArchitecturesHDR Defence - Software Abstractions for Parallel Architectures
HDR Defence - Software Abstractions for Parallel Architectures
 
Lattice Boltzmann sur architecture multicoeurs vectorielle - Une approche de ...
Lattice Boltzmann sur architecture multicoeurs vectorielle - Une approche de ...Lattice Boltzmann sur architecture multicoeurs vectorielle - Une approche de ...
Lattice Boltzmann sur architecture multicoeurs vectorielle - Une approche de ...
 
Boost.SIMD
Boost.SIMDBoost.SIMD
Boost.SIMD
 
Automatic Task-based Code Generation for High Performance DSEL
Automatic Task-based Code Generation for High Performance DSELAutomatic Task-based Code Generation for High Performance DSEL
Automatic Task-based Code Generation for High Performance DSEL
 
Software Abstractions for Parallel Hardware
Software Abstractions for Parallel HardwareSoftware Abstractions for Parallel Hardware
Software Abstractions for Parallel Hardware
 
(Costless) Software Abstractions for Parallel Architectures
(Costless) Software Abstractions for Parallel Architectures(Costless) Software Abstractions for Parallel Architectures
(Costless) Software Abstractions for Parallel Architectures
 
Boost.Dispatch
Boost.DispatchBoost.Dispatch
Boost.Dispatch
 
Designing Architecture-aware Library using Boost.Proto
Designing Architecture-aware Library using Boost.ProtoDesigning Architecture-aware Library using Boost.Proto
Designing Architecture-aware Library using Boost.Proto
 
Generative and Meta-Programming - Modern C++ Design for Parallel Computing
Generative and Meta-Programming - Modern C++ Design for Parallel ComputingGenerative and Meta-Programming - Modern C++ Design for Parallel Computing
Generative and Meta-Programming - Modern C++ Design for Parallel Computing
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Kürzlich hochgeladen (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

Costless Software Abstractions For Parallel Hardware System

  • 1. Costless Software Abstractions For Parallel Hardware System Joel Falcou LRI - INRIA - MetaScale SAS Maison de la Simulation 04/03/2014
  • 2. Context In Scientific Computing ... there is Scientific Applications are domain driven Users = Developers Users are reluctant to changes there is Computing Computing requires performance ... ... which implies architectures specific tuning ... which requires expertise ... which may or may not be available The Problem People using computers to do science want to do science first. 2 of 34
  • 3. The Problem – and how we want to solve it The Facts The ”Library to bind them all” doesn’t exist (or we should have it already ) All those users want to take advantage of new architectures Few of them want to actually handle all the dirty work The Ends Provide a ”familiar” interface that let users benefit from parallelism Helps compilers to generate better parallel code Increase sustainability by decreasing amount of code to write The Means Parallel Abstractions: Skeletons Efficient Implementation: DSEL The Holy Glue: Generative Programming 3 of 34
  • 4. Expressivité Efficient or Expressive – Choose one Matlab SciLab C++ JAVA C, FORTRAN Efficacité 4 of 34
  • 7. Spotting abstraction when you see one Processus 3 F2 Processus 1 Processus 2 Processus 4 Processus 6 Processus 7 F1 Distrib. F2 Collect. F3 Processus 5 F2 7 of 34
  • 8. Spotting abstraction when you see one Pipeline Farm Processus 3 F2 Processus 1 Processus 2 Processus 4 Processus 6 Processus 7 F1 Distrib. F2 Collect. F3 Processus 5 F2 7 of 34
  • 9. Parallel Skeletons in a nutshell Basic Principles [COLE 89] There are patterns in parallel applications Those patterns can be generalized in Skeletons Applications are assembled as combination of such patterns 8 of 34
  • 10. Parallel Skeletons in a nutshell Basic Principles [COLE 89] There are patterns in parallel applications Those patterns can be generalized in Skeletons Applications are assembled as combination of such patterns Functionnal point of view Skeletons are Higher-Order Functions Skeletons support a compositionnal semantic Applications become composition of state-less functions 8 of 34
  • 11. Classic Parallel Skeletons Data Parallel Skeletons map: Apply a n-ary function in SIMD mode over subset of data fold: Perform n-ary reduction over subset of data scan: Perform n-ary prefix reduction over subset of data 9 of 34
  • 12. Classic Parallel Skeletons Data Parallel Skeletons map: Apply a n-ary function in SIMD mode over subset of data fold: Perform n-ary reduction over subset of data scan: Perform n-ary prefix reduction over subset of data Task Parallel Skeletons par: Independant task execution pipe: Task dependency over time farm: Load-balancing 9 of 34
  • 13. Why using Parallel Skeletons Software Abstraction Write without bothering with parallel details Code is scalable and easy to maintain Debuggable, Provable, Certifiable 10 of 34
  • 14. Why using Parallel Skeletons Software Abstraction Write without bothering with parallel details Code is scalable and easy to maintain Debuggable, Provable, Certifiable Hardware Abstraction Semantic is set, implementation is free Composability ⇒ Hierarchical architecture 10 of 34
  • 16. Generative Programming Domain Specific Application Description Generative Component Translator Parametric Sub-components 12 of 34 Concrete Application
  • 17. Generative Programming as a Tool Available techniques Dedicated compilers External pre-processing tools Languages supporting meta-programming 13 of 34
  • 18. Generative Programming as a Tool Available techniques Dedicated compilers External pre-processing tools Languages supporting meta-programming 13 of 34
  • 19. Generative Programming as a Tool Available techniques Dedicated compilers External pre-processing tools Languages supporting meta-programming Definition of Meta-programming Meta-programming is the writing of computer programs that analyse, transform and generate other programs (or themselves) as their data. 13 of 34
  • 20. From Generative to Meta-programming Meta-programmable languages template Haskell metaOcaml C++ 14 of 34
  • 21. From Generative to Meta-programming Meta-programmable languages template Haskell metaOcaml C++ 14 of 34
  • 22. From Generative to Meta-programming Meta-programmable languages template Haskell metaOcaml C++ C++ meta-programming Relies on the C++ template sub-language Handles types and integral constants at compile-time Proved to be Turing-complete 14 of 34
  • 23. Domain Specific Embedded Languages What’s an DSEL ? DSL = Domain Specific Language Declarative language, easy-to-use, fitting the domain DSEL = DSL within a general purpose language EDSL in C++ Relies on operator overload abuse (Expression Templates) Carry semantic information around code fragment Generic implementation become self-aware of optimizations Exploiting static AST At the expression level: code generation At the function level: inter-procedural optimization 15 of 34
  • 24. Expression Templates = matrix x(h,w),a(h,w),b(h,w); x = cos(a) + (b*a); + x expr<assign ,expr<matrix&> ,expr<plus , expr<cos ,expr<matrix&> > , expr<multiplies ,expr<matrix&> ,expr<matrix&> > >(x,a,b); Arbitrary Transforms applied on the meta-AST 16 of 34 * cos a b #pragma omp parallel for for(int j=0;j<h;++j) { for(int i=0;i<w;++i) { x(j,i) = cos(a(j,i)) + ( b(j,i) * a(j,i) ); } } a
  • 25. Embedded Domain Specific Languages EDSL in C++ Relies on operator overload abuse Carry semantic information around code fragment Generic implementation become self-aware of optimizations Advantages Allow introduction of DSLs without disrupting dev. chain Semantic defined as type informations means compile-time resolution Access to a large selection of runtime binding 17 of 34
  • 26. Embedded Domain Specific Languages 18 of 34
  • 28. Different Strokes Objectives Apply DSEL generation techniques for different kind of hardware Demonstrate low cost of abstractions Demonstrate applicability of skeletons Contributions Extend DEMRAL into AA-DEMRAL Boost.SIMD NT2 20 of 34
  • 29. Architecture Aware Generative Programming 21 of 34
  • 30. Boost.SIMD Generic Programming for portable SIMDization SIMD computation C++ Library Built with modern C++ style for modern C++ usage Easy to extend Easy to adapt to new CPUs 22 of 34
  • 31. Boost.SIMD Generic Programming for portable SIMDization SIMD computation C++ Library Built with modern C++ style for modern C++ usage Easy to extend Easy to adapt to new CPUs Goals: Integrate SIMD computations in standard C++ Make writing of SIMD code easier Promotes Generic Programming Provides performance out of the box 22 of 34
  • 32. Boost.SIMD Principles pack<T,N> encapsulates the best hardware register for N elements of type T pack<T,N> provides a classical value semantic Operations onpack<T,N> maps to proper intrinsics Support for SIMD standard algortihms 23 of 34
  • 33. Boost.SIMD Principles pack<T,N> encapsulates the best hardware register for N elements of type T pack<T,N> provides a classical value semantic Operations onpack<T,N> maps to proper intrinsics Support for SIMD standard algortihms How does it works Code is written as for scalar values Code can be applied as polymorphic functor over data Expression Templates optimize operations 23 of 34
  • 34. Boost.SIMD Current Support OS : Linux, Windows, iOS, Android Architecture: x86, ARM, PowerPC Compilers: g++, clang, icc, MSVC Extensions: x86: SSE2,SSE3,SSSE3,SSE4.x,AVX,XOP,FMA3/4,AVX2 PPC: VMX ARM: NEON In progress: Xeon Phi MIC VSX, QPX NEON2 24 of 34
  • 35. Boost.SIMD - Mandelbrot pack < int > julia ( pack < float > const & a , pack < float > const & b ) { pack < int > i_t ; std :: size_t i = 0; pack < int > iter ; float x , y ; do { pack < float > x2 = pack < float > y2 = pack < float > xy = x = x2 - y2 + a ; y = xy + b ; pack < float > m2 = iter = selinc ( m2 i ++; x * x; y * y; 2 * x * y; x2 + y2 ; < 4 , iter ) ; } while ( any ( mask ) && i < 256) ; return iter ; } 25 of 34
  • 36. Boost.SIMD - Mandelbrot pack < int > julia ( pack < float > const & a , pack < float > const & b ) { pack < int > i_t ; std :: size_t i = 0; pack < int > iter ; float x , y ; do { pack < float > x2 = pack < float > y2 = pack < float > xy = x = x2 - y2 + a ; y = xy + b ; pack < float > m2 = iter = selinc ( m2 i ++; x * x; y * y; 2 * x * y; x2 + y2 ; < 4 , iter ) ; } while ( any ( mask ) && i < 256) ; return iter ; } 25 of 34
  • 37. Boost.SIMD - Mandelbrot pack < int > julia ( pack < float > const & a , pack < float > const & b ) { pack < int > i_t ; std :: size_t i = 0; pack < int > iter ; float x , y ; do { pack < float > x2 = pack < float > y2 = pack < float > xy = x = x2 - y2 + a ; y = xy + b ; pack < float > m2 = iter = selinc ( m2 i ++; x * x; y * y; 2 * x * y; x2 + y2 ; < 4 , iter ) ; } while ( any ( mask ) && i < 256) ; return iter ; } 25 of 34
  • 38. NT2 A Scientific Computing Library Provide a simple, Matlab-like interface for users Provide high-performance computing entities and primitives Easily extendable Components Use Boost.SIMD for in-core optimizations Use recursive parallel skeletons for threading Code is made independant of architecture and runtime 26 of 34
  • 39. Expressivité The Numerical Template Toolbox NT2 Matlab SciLab C++ JAVA C, FORTRAN Efficacité 27 of 34
  • 40. The Numerical Template Toolbox Principles table<T,S> is a simple, multidimensional array object that exactly mimics Matlab array behavior and functionalities 500+ functions usable directly either on table or on any scalar values as in Matlab 28 of 34
  • 41. The Numerical Template Toolbox Principles table<T,S> is a simple, multidimensional array object that exactly mimics Matlab array behavior and functionalities 500+ functions usable directly either on table or on any scalar values as in Matlab How does it works Take a .m file, copy to a .cpp file 28 of 34
  • 42. The Numerical Template Toolbox Principles table<T,S> is a simple, multidimensional array object that exactly mimics Matlab array behavior and functionalities 500+ functions usable directly either on table or on any scalar values as in Matlab How does it works Take a .m file, copy to a .cpp file Add #include <nt2/nt2.hpp> and do cosmetic changes 28 of 34
  • 43. The Numerical Template Toolbox Principles table<T,S> is a simple, multidimensional array object that exactly mimics Matlab array behavior and functionalities 500+ functions usable directly either on table or on any scalar values as in Matlab How does it works Take a .m file, copy to a .cpp file Add #include <nt2/nt2.hpp> and do cosmetic changes Compile the file and link with libnt2.a 28 of 34
  • 45. Performances - LU Decomposition 8000 × 8000 LU decomposition Median GFLOPS 150 100 50 NT2 Plasma 0 0 10 20 30 Number of cores 30 of 34 40 50
  • 46. Performances - linsolve nt2::tie(x,r) = linsolve(A,b); Scale 1024 × 1024 2048 × 2048 12000 × 1200 31 of 34 C LAPACK 85.2 350.7 735.7 C MAGMA 85.2 235.8 1299.0 NT2 LAPACK 83.1 348.2 734.1 NT2 MAGMA 85.1 236.0 1300.1
  • 48. Let’s round this up! Parallel Computing for Scientist Software Libraries built as Generic and Generative components can solve a large chunk of parallelism related problems while being easy to use. Like regular language, EDSL needs informations about the hardware system Integrating hardware descriptions as Generic components increases tools portability and re-targetability 33 of 34
  • 49. Let’s round this up! Parallel Computing for Scientist Software Libraries built as Generic and Generative components can solve a large chunk of parallelism related problems while being easy to use. Like regular language, EDSL needs informations about the hardware system Integrating hardware descriptions as Generic components increases tools portability and re-targetability Recent activity Follow us on http://www.github.com/MetaScale/nt2 Prototype for single source GPU support Toward a global generic approach to parallelism 33 of 34
  • 50. Thanks for your attention