SlideShare ist ein Scribd-Unternehmen logo
1 von 23
State of programming models and code
transformations on heterogeneous
platforms
Boyana Norris
norris@mcs.anl.gov
- Computer Scientist, Mathematics and
Computer Science Division, Argonne
National Laboratory
- Senior Fellow, Computation Institute,
University of Chicago
Before there were computers…
Jacquard Loom, invented in 1801
 Programming was
– Parallel
– Pattern-based
– Multithreaded
(Possibly) the first heterogeneous computer(s)
Outline, goals
 Parallel programming for heterogeneous architectures
– Challenges
– Example approaches
 Help set the stage for subsequent panel discussions w.r.t.
issues related to programming heterogeneous architectures
– Need your input, please do interrupt
Heterogeneity
 Hardware heterogeneity (different devices with different
capabilities), e.g.:
– Multicore x86 CPUs with GPUs
– Multicore x86 CPUs with Intel Phi accelerators
– big.LITTLE (coupling slower, low-power ARM cores with faster, power-
hungry ARM cores)
– A cluster with different types of nodes
– x86 CPU with FPGAs (e.g., Convey)
– …
 Software heterogeneity (e.g., OS, languages)
– Not part of this talk
Similarities among heterogeneous platforms
 Typically each processor has several, and sometimes many
execution units
– NVIDIA Fermi GPUs have 16 Streaming Multiprocessors (SMPs);
– AMD GPUs have 20 or more SIMD units;
– Intel Phi has >50 x86 cores
 Each execution unit typically has SIMD or vector execution.
– NVIDIA GPUs execute threads in SIMD-like groups of 32 (what NVIDIA
calls warps);
– AMD GPUs execute in wavefronts that are 64-threads wide;
– Intel Phi has 512-bit wide SIMD instructions (16 floats or 8 doubles).
Many scales
Parallel programming models
 Bulk synchronous parallelism (BSP)
 Stream processing
 Algorithmic skeletons (e.g., master-worker)
 Workflow/dataflow
 Remote method invocation
 Distributed objects
 Components
 Functional
 …
Parallel programming models (cont.)
 Parallel process interaction
– Distributed data, exchanged through explicit messages (e.g., MPI)
– Shared/global memory (e.g., PGAS)
 Work parallelism
– SPMD
– Dataflow
– Task-based
– Streaming
– …
 Heterogeneous resources
– Host-directed execution with selected kernels offloaded to co-
processor, e.g., MPI + CUDA/OpenCL
– “Symmetric”, e.g., MPI on x86/Phi systems
Example: Host-directed MPI+X model
Image by Yili Zheng, LBL
Challenges
 Managing data
– Data distribution, movement, replication
– Load balancing
 Different processing capabilities (FPUs, clock rates, vector
units)
 Different instruction sets
Software developer’s point of view
 Important considerations, tradeoffs
– Initial investment
• learning curve, reimplementation
– Ongoing costs
• Maintainability, portability
– Performance
• Real time, within power constraints,…
– Life expectancy
• Architectures, software dependencies
– Suitability for particular goals
• Embedded system vs petaflop machine
– Agility
• Ability to exploit new architectures
– …
Programming model implementations
 Established:
– Parallelism expressed through message-passing, thread-based shared
memory, PGAS languages
– High-level languages or libraries with APIs that can map to different
models, e.g., MPI
– General-purpose languages with compiler support for exploiting
hybrid architectures
– Small language extensions or annotations embedded in GPLs with
compiler or source transformation tool support, e.g., Fortran CUDA
– Streaming, e.g., CUDA
 More recent
 Extinct, e.g., HPF
Tradeoffs
Scalability
DevelopmentProductivity
Low High
Sequential GPLs and high-
level DSLs
Low-level languages
or APIs, fully explicit
parallelism control
Libraries, frameworks
High-level parallel
languages
High
Source transformations
 Typically multiple levels of abstraction and programming
models are used simultaneously
 Goal is to express algorithms at the highest level appropriate
for the functionality being implemented
 A single language or library is unlikely to be best for any given
application on all possible hardware
 One approach:
– Define algorithms using high-level abstractions
– Provide tools to translate these into lower-level, possibly architecture
specific implementations
 Most programming on heterogeneous platforms involves
source transformation
Example: Annotation-based approaches
 Pros: low-effort, minimal changes
 Cons: limited expressivity, performance
 Examples:
– MPI + OpenACC directives in a GPL
– Some embedded DSLs (e.g., as supported by Orio)
Current limitations
 Minimally intrusive approaches typically don’t result in the
best performance possible, e.g., OpenACC annotations
without code restructuring
 A number of single-platform solutions provided by vendors
(e.g., Intel, NVIDIA), portability or performance on other
platforms not guaranteed
General-purpose programming languages
 GPLs for parallel, possibly heterogeneous architectures
– UPC, CAF, Chapel, X10
 Pros:
– Robustness (e.g., type safety, memory consistency)
– Tools (e.g., debugging, performance analysis)
 Cons:
– Manual reimplementation required in most cases
– Hard to balance user control with resource management automation
– Interoperability
Recall host-directed MPI+X model
Image by Yili Zheng, LBL
PGAS model
Image by Yili Zheng, LBL
High-level frameworks and libraries
 Domain-specific problem-solving environments and
mathematical libraries can encapsulate the specifics of
mapping to heterogeneous architectures (e.g., PETSc, Trilinos,
Cactus)
 Advantages
– Efficient implementations of common functionality
– Different levels of APIs to hide or expose different levels of the
implementation and runtime (unlike pure language approaches)
– Relatively rapid support of new hardware
 Disadvantages
– Learning curves, deep software dependencies
Ongoing efforts attempting to balance
scalability with productivity
 DOE X-Stack program pursues fundamental advances in
programming models, languages, compilers, runtime systems
and tools to support the transition of applications to exascale
platforms
– DEGAS (Dynamic, Exascale Global Address Space): a PGAS approach
– SLEEC (Semantics-rich Libraries for Effective Exascale Computation):
annotations and cost models to compile into optimized low-level
implementations
– X-Tune: model-based code generation and optimization of algorithms
written in GPLs
– D-TEC: compilers for both new general-purpose languages and
embedding DSLs into other languages
Summary
 Many traditional programming models can be used on
heterogeneous architectures, with vendor support for
compilers, libraries and runtimes
 No clear multi-platform winner programming
model/language/framework
 Many new efforts on deepening the software stack to enable
better balance of programmability, performance, portability

Weitere ähnliche Inhalte

Ähnlich wie Heterogeneous programming

2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
2015 bioinformatics python_introduction_wim_vancriekinge_vfinalProf. Wim Van Criekinge
 
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated SystemsPetapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated Systemsdairsie
 
Unit1 principle of programming language
Unit1 principle of programming languageUnit1 principle of programming language
Unit1 principle of programming languageVasavi College of Engg
 
DAE Tools 1.8.0 - Overview
DAE Tools 1.8.0 - OverviewDAE Tools 1.8.0 - Overview
DAE Tools 1.8.0 - OverviewDragan Nikolić
 
Papyrus for Real Time at the OMG TC
Papyrus for Real Time  at the OMG TCPapyrus for Real Time  at the OMG TC
Papyrus for Real Time at the OMG TCCharles Rivet
 
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems
Designing HPC, Deep Learning, and Cloud Middleware for Exascale SystemsDesigning HPC, Deep Learning, and Cloud Middleware for Exascale Systems
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systemsinside-BigData.com
 
A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...
A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...
A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...waqarnabi
 
2023-02-22_Tiberti_CyberX.pdf
2023-02-22_Tiberti_CyberX.pdf2023-02-22_Tiberti_CyberX.pdf
2023-02-22_Tiberti_CyberX.pdfcifoxo
 
A Perspective on the Future of Computer Architecture
A Perspective on the  Future of Computer ArchitectureA Perspective on the  Future of Computer Architecture
A Perspective on the Future of Computer ArchitectureARCCN
 
Presentation on Behavioral Synthesis & SystemC
Presentation on Behavioral Synthesis & SystemCPresentation on Behavioral Synthesis & SystemC
Presentation on Behavioral Synthesis & SystemCMukit Ahmed Chowdhury
 
Designing Software Libraries and Middleware for Exascale Systems: Opportuniti...
Designing Software Libraries and Middleware for Exascale Systems: Opportuniti...Designing Software Libraries and Middleware for Exascale Systems: Opportuniti...
Designing Software Libraries and Middleware for Exascale Systems: Opportuniti...inside-BigData.com
 
Ucx an open source framework for hpc network ap is and beyond
Ucx  an open source framework for hpc network ap is and beyondUcx  an open source framework for hpc network ap is and beyond
Ucx an open source framework for hpc network ap is and beyondinside-BigData.com
 
A Comparison of .NET Framework vs. Java Virtual Machine
A Comparison of .NET Framework vs. Java Virtual MachineA Comparison of .NET Framework vs. Java Virtual Machine
A Comparison of .NET Framework vs. Java Virtual MachineAbdelrahman Hosny
 
Are High Level Programming Languages for Multicore and Safety Critical Conver...
Are High Level Programming Languages for Multicore and Safety Critical Conver...Are High Level Programming Languages for Multicore and Safety Critical Conver...
Are High Level Programming Languages for Multicore and Safety Critical Conver...InfinIT - Innovationsnetværket for it
 
"Big Data" Bioinformatics
"Big Data" Bioinformatics"Big Data" Bioinformatics
"Big Data" BioinformaticsBrian Repko
 
Parallel architecture
Parallel architectureParallel architecture
Parallel architectureMr SMAK
 

Ähnlich wie Heterogeneous programming (20)

2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
 
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated SystemsPetapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
 
Unit1 principle of programming language
Unit1 principle of programming languageUnit1 principle of programming language
Unit1 principle of programming language
 
Japan's post K Computer
Japan's post K ComputerJapan's post K Computer
Japan's post K Computer
 
Linux-Internals-and-Networking
Linux-Internals-and-NetworkingLinux-Internals-and-Networking
Linux-Internals-and-Networking
 
DAE Tools 1.8.0 - Overview
DAE Tools 1.8.0 - OverviewDAE Tools 1.8.0 - Overview
DAE Tools 1.8.0 - Overview
 
Papyrus for Real Time at the OMG TC
Papyrus for Real Time  at the OMG TCPapyrus for Real Time  at the OMG TC
Papyrus for Real Time at the OMG TC
 
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems
Designing HPC, Deep Learning, and Cloud Middleware for Exascale SystemsDesigning HPC, Deep Learning, and Cloud Middleware for Exascale Systems
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems
 
intro.pptx
intro.pptxintro.pptx
intro.pptx
 
A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...
A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...
A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...
 
2023-02-22_Tiberti_CyberX.pdf
2023-02-22_Tiberti_CyberX.pdf2023-02-22_Tiberti_CyberX.pdf
2023-02-22_Tiberti_CyberX.pdf
 
A Perspective on the Future of Computer Architecture
A Perspective on the  Future of Computer ArchitectureA Perspective on the  Future of Computer Architecture
A Perspective on the Future of Computer Architecture
 
Presentation on Behavioral Synthesis & SystemC
Presentation on Behavioral Synthesis & SystemCPresentation on Behavioral Synthesis & SystemC
Presentation on Behavioral Synthesis & SystemC
 
Designing Software Libraries and Middleware for Exascale Systems: Opportuniti...
Designing Software Libraries and Middleware for Exascale Systems: Opportuniti...Designing Software Libraries and Middleware for Exascale Systems: Opportuniti...
Designing Software Libraries and Middleware for Exascale Systems: Opportuniti...
 
Ucx an open source framework for hpc network ap is and beyond
Ucx  an open source framework for hpc network ap is and beyondUcx  an open source framework for hpc network ap is and beyond
Ucx an open source framework for hpc network ap is and beyond
 
A Comparison of .NET Framework vs. Java Virtual Machine
A Comparison of .NET Framework vs. Java Virtual MachineA Comparison of .NET Framework vs. Java Virtual Machine
A Comparison of .NET Framework vs. Java Virtual Machine
 
Are High Level Programming Languages for Multicore and Safety Critical Conver...
Are High Level Programming Languages for Multicore and Safety Critical Conver...Are High Level Programming Languages for Multicore and Safety Critical Conver...
Are High Level Programming Languages for Multicore and Safety Critical Conver...
 
"Big Data" Bioinformatics
"Big Data" Bioinformatics"Big Data" Bioinformatics
"Big Data" Bioinformatics
 
Compilers
CompilersCompilers
Compilers
 
Parallel architecture
Parallel architectureParallel architecture
Parallel architecture
 

Heterogeneous programming

  • 1. State of programming models and code transformations on heterogeneous platforms Boyana Norris norris@mcs.anl.gov - Computer Scientist, Mathematics and Computer Science Division, Argonne National Laboratory - Senior Fellow, Computation Institute, University of Chicago
  • 2. Before there were computers… Jacquard Loom, invented in 1801  Programming was – Parallel – Pattern-based – Multithreaded
  • 3. (Possibly) the first heterogeneous computer(s)
  • 4. Outline, goals  Parallel programming for heterogeneous architectures – Challenges – Example approaches  Help set the stage for subsequent panel discussions w.r.t. issues related to programming heterogeneous architectures – Need your input, please do interrupt
  • 5. Heterogeneity  Hardware heterogeneity (different devices with different capabilities), e.g.: – Multicore x86 CPUs with GPUs – Multicore x86 CPUs with Intel Phi accelerators – big.LITTLE (coupling slower, low-power ARM cores with faster, power- hungry ARM cores) – A cluster with different types of nodes – x86 CPU with FPGAs (e.g., Convey) – …  Software heterogeneity (e.g., OS, languages) – Not part of this talk
  • 6. Similarities among heterogeneous platforms  Typically each processor has several, and sometimes many execution units – NVIDIA Fermi GPUs have 16 Streaming Multiprocessors (SMPs); – AMD GPUs have 20 or more SIMD units; – Intel Phi has >50 x86 cores  Each execution unit typically has SIMD or vector execution. – NVIDIA GPUs execute threads in SIMD-like groups of 32 (what NVIDIA calls warps); – AMD GPUs execute in wavefronts that are 64-threads wide; – Intel Phi has 512-bit wide SIMD instructions (16 floats or 8 doubles).
  • 8. Parallel programming models  Bulk synchronous parallelism (BSP)  Stream processing  Algorithmic skeletons (e.g., master-worker)  Workflow/dataflow  Remote method invocation  Distributed objects  Components  Functional  …
  • 9. Parallel programming models (cont.)  Parallel process interaction – Distributed data, exchanged through explicit messages (e.g., MPI) – Shared/global memory (e.g., PGAS)  Work parallelism – SPMD – Dataflow – Task-based – Streaming – …  Heterogeneous resources – Host-directed execution with selected kernels offloaded to co- processor, e.g., MPI + CUDA/OpenCL – “Symmetric”, e.g., MPI on x86/Phi systems
  • 10. Example: Host-directed MPI+X model Image by Yili Zheng, LBL
  • 11. Challenges  Managing data – Data distribution, movement, replication – Load balancing  Different processing capabilities (FPUs, clock rates, vector units)  Different instruction sets
  • 12. Software developer’s point of view  Important considerations, tradeoffs – Initial investment • learning curve, reimplementation – Ongoing costs • Maintainability, portability – Performance • Real time, within power constraints,… – Life expectancy • Architectures, software dependencies – Suitability for particular goals • Embedded system vs petaflop machine – Agility • Ability to exploit new architectures – …
  • 13. Programming model implementations  Established: – Parallelism expressed through message-passing, thread-based shared memory, PGAS languages – High-level languages or libraries with APIs that can map to different models, e.g., MPI – General-purpose languages with compiler support for exploiting hybrid architectures – Small language extensions or annotations embedded in GPLs with compiler or source transformation tool support, e.g., Fortran CUDA – Streaming, e.g., CUDA  More recent  Extinct, e.g., HPF
  • 14. Tradeoffs Scalability DevelopmentProductivity Low High Sequential GPLs and high- level DSLs Low-level languages or APIs, fully explicit parallelism control Libraries, frameworks High-level parallel languages High
  • 15. Source transformations  Typically multiple levels of abstraction and programming models are used simultaneously  Goal is to express algorithms at the highest level appropriate for the functionality being implemented  A single language or library is unlikely to be best for any given application on all possible hardware  One approach: – Define algorithms using high-level abstractions – Provide tools to translate these into lower-level, possibly architecture specific implementations  Most programming on heterogeneous platforms involves source transformation
  • 16. Example: Annotation-based approaches  Pros: low-effort, minimal changes  Cons: limited expressivity, performance  Examples: – MPI + OpenACC directives in a GPL – Some embedded DSLs (e.g., as supported by Orio)
  • 17. Current limitations  Minimally intrusive approaches typically don’t result in the best performance possible, e.g., OpenACC annotations without code restructuring  A number of single-platform solutions provided by vendors (e.g., Intel, NVIDIA), portability or performance on other platforms not guaranteed
  • 18. General-purpose programming languages  GPLs for parallel, possibly heterogeneous architectures – UPC, CAF, Chapel, X10  Pros: – Robustness (e.g., type safety, memory consistency) – Tools (e.g., debugging, performance analysis)  Cons: – Manual reimplementation required in most cases – Hard to balance user control with resource management automation – Interoperability
  • 19. Recall host-directed MPI+X model Image by Yili Zheng, LBL
  • 20. PGAS model Image by Yili Zheng, LBL
  • 21. High-level frameworks and libraries  Domain-specific problem-solving environments and mathematical libraries can encapsulate the specifics of mapping to heterogeneous architectures (e.g., PETSc, Trilinos, Cactus)  Advantages – Efficient implementations of common functionality – Different levels of APIs to hide or expose different levels of the implementation and runtime (unlike pure language approaches) – Relatively rapid support of new hardware  Disadvantages – Learning curves, deep software dependencies
  • 22. Ongoing efforts attempting to balance scalability with productivity  DOE X-Stack program pursues fundamental advances in programming models, languages, compilers, runtime systems and tools to support the transition of applications to exascale platforms – DEGAS (Dynamic, Exascale Global Address Space): a PGAS approach – SLEEC (Semantics-rich Libraries for Effective Exascale Computation): annotations and cost models to compile into optimized low-level implementations – X-Tune: model-based code generation and optimization of algorithms written in GPLs – D-TEC: compilers for both new general-purpose languages and embedding DSLs into other languages
  • 23. Summary  Many traditional programming models can be used on heterogeneous architectures, with vendor support for compilers, libraries and runtimes  No clear multi-platform winner programming model/language/framework  Many new efforts on deepening the software stack to enable better balance of programmability, performance, portability

Hinweis der Redaktion

  1. Inspired punched cards used in Charles Babbage’s analytical engine (conceived in 1834)
  2. Atanasoff Berry Computer (ABC) – 1937-42Mark 2
  3. sneetah