DASH is a realization of the PGAS (partitioned global address space) programming model in the form of a C++ template library. It provides a multidimensional array abstraction which is typically used as an underlying container for stencil- and dense matrix operations.
Efficiency of operations on a distributed multi-dimensional array highly depends on the distribution of its elements to processes and the communication strategy used to propagate values between them. Locality can only be improved by employing an optimal distribution that is specific to the implementation of the algorithm, run-time parameters such as node topology, and numerous additional aspects. Application developers do not know these implications which also might change in future releases of DASH.
In the following, we identify fundamental properties of distribution patterns that are prevalent in existing HPC applications.
We describe a classification scheme of multi-dimensional distributions based on these properties and demonstrate how distribution patterns can be optimized for locality and communication avoidance automatically and, to a great extent, at compile time.
3. Background
3Expressing and Exploiting Multi-Dimensional Locality in DASH
DASH
• Vision: “C++ standard template library for HPC”.
• Provides n-dim array abstraction for stencil- and dense matrix
operations.
• Realization of the PGAS (partitioned global address space)
programming model.
4. Background
4Expressing and Exploiting Multi-Dimensional Locality in DASH
PGAS and Locality
• Combine distributed memory into virtual global memory space.
• Strong sense of data ownership:
private, shared local, shared global
int p = 42;
5. Background
5Expressing and Exploiting Multi-Dimensional Locality in DASH
PGAS and Locality
• Combine distributed memory into virtual global memory space.
• Strong sense of data ownership:
private, shared local, shared global
int p = 42;
dash::Array<T> a;
a.local[4] = p;
6. Background
6Expressing and Exploiting Multi-Dimensional Locality in DASH
PGAS and Locality
• Combine distributed memory into virtual global memory space.
• Strong sense of data ownership:
private, shared local, shared global
int p;
dash::Array<T> a;
p = a[40];
7. Background
7Expressing and Exploiting Multi-Dimensional Locality in DASH
PGAS and Locality
• Locality (access distance to data) predominant factor for efficiency.
L = (local accesses) / (total accesses)
• Access pattern on data depends on implementation of algorithm.
• Complexity to maintain locality increases exponentially with the number
of data dimensions.
8. Objective and Approach
8Expressing and Exploiting Multi-Dimensional Locality in DASH
Objective
Portable efficiency by automatic deduction of optimal data distribution.
Approach
1. Identify distribution properties that allow well-defined specification of
any data distribution.
2. Let algorithms specify soft / hard constraints on distribution properties.
3. Derive optimal distribution for a given set of constraints.
Automatic deduction of optimal data distribution
9. Distribution Properties
9Expressing and Exploiting Multi-Dimensional Locality in DASH
Property Categories
Mappings in data distribution can be categorized by their stages:
Partitioning Decomposing the index domain to blocks
Mapping Assigning blocks to units
Layout Storage order of block elements in units’ local memory
10. Distribution Properties
10Expressing and Exploiting Multi-Dimensional Locality in DASH
Example: Morton Order Distribution
Category Properties
Partitioning balanced, regular, rectangular
Mapping balanced, minimal, neighbor
Layout blocked, linear, canonical
11. Use Cases
11Expressing and Exploiting Multi-Dimensional Locality in DASH
Automatic Deduction of Optimal Data Distribution
“Find a data distribution that fulfills a set of properties.”
// Deduces pattern type, initializes pattern instance:
auto pattern =
make_pattern< _
partitioning_properties< |-- compile time deduction
balanced, regular >, | via C++11 generic meta template
mapping_properties< | programming
neighbor > |
layout_properties< |
blocked, row_major > _|
> _
(Size<2>(10000,10000), |-- run time deduction
Team<2>(24,24)); _|
12. Use Cases
12Expressing and Exploiting Multi-Dimensional Locality in DASH
Automatic Deduction of Optimal Data Distribution
“Find a data distribution that is optimal for a given algorithm.”
// Deduce pattern from algorithm constraints:
auto pattern = dash::make_pattern< dash::summa_pattern_constraints >(
Size<2>(10000,10000),
Team<2>(24,24));
dash::Matrix<double, 2> matrix_a(pattern);
dash::Matrix<double, 2> matrix_b(pattern);
dash::Matrix<double, 2> matrix_c(pattern);
dash::summa(matrix_a, matrix_b, matrix_c);
13. Use Cases
13Expressing and Exploiting Multi-Dimensional Locality in DASH
Automatic Deduction of Optimal Algorithm
“Find algorithm variant that is optimal for a given data distribution.”
// Specify how data is distributed in global memory:
auto pattern = dash::TilePattern<2>(10000,10000, TILED(100,100));
dash::Matrix<double, 2> matrix_a(pattern);
dash::Matrix<double, 2> matrix_b(pattern);
dash::Matrix<double, 2> matrix_c(pattern);
// Selects matrix product algorithm variant that is optimal for the given
// pattern:
dash::multiply(matrix_a, matrix_b, matrix_c);
14. Use Cases
14Expressing and Exploiting Multi-Dimensional Locality in DASH
Automatic Deduction of Optimal Algorithm
“Find data distribution for the most efficient algorithm variant.”
// Use constraints of most efficient algorithm, usually SUMMA for DGEMM:
auto pattern = dash::make_pattern< dash::multiply_pattern_constraints >(
Size<2>(10000,10000),
Team<2>(24,24));
dash::Matrix<double, 2> matrix_a(pattern);
dash::Matrix<double, 2> matrix_b(pattern);
dash::Matrix<double, 2> matrix_c(pattern);
// Calls dash::summa
dash::multiply(matrix_a, matrix_b, matrix_c);
15. Evaluation: DGEMM
15Expressing and Exploiting Multi-Dimensional Locality in DASH
MKL multithreaded vs. DASH MPI (GFLOP/s)
DASH: automatic distribution of matrix elements to MPI processes,
each using serial MKL for block matrix multiplication (SUMMA).
MKL: OpenMP threads, matrix initialization in master thread.
16. Evaluation: DGEMM
16Expressing and Exploiting Multi-Dimensional Locality in DASH
MKL multithreaded vs. DASH MPI (Speedup)
DASH: High locality due to optimal data distribution,
massive communication overhead (MPI, no shared windows).
MKL: Low locality (first touch issues), no communication.
DASH beats MKL for bigger N and higher degrees of parallelism.
Speedup = DASHGFLOPS / MKLGFLOPS
17. Evaluation: SGEMM
17Expressing and Exploiting Multi-Dimensional Locality in DASH
MKL multithreaded vs. DASH MPI (GFLOP/s)
DASH: automatic distribution of matrix elements to MPI processes,
each using serial MKL for block matrix multiplication (SUMMA).
MKL: OpenMP threads, matrix initialization in master thread.
18. Evaluation: SGEMM
18Expressing and Exploiting Multi-Dimensional Locality in DASH
MKL multithreaded vs. DASH MPI (Speedup)
DASH: High locality due to optimal data distribution,
massive communication overhead (MPI, no shared windows).
MKL: Low locality (first touch issues), no communication.
DASH beats MKL for bigger N and higher degrees of parallelism.
Speedup = DASHGFLOPS / MKLGFLOPS
19. Summary
19Expressing and Exploiting Multi-Dimensional Locality in DASH
Summary
• Optimal distribution of n-dim data depends on unmanageable multitude
of factors (topology, access pattern, data flow, …).
• We defined a universal classification of distribution properties.
• Property system allows automatic deduction of optimal data distribution
and algorithm variants at compile time and run time.
Works with any C++11 compiler (tested: Intel 14.0+, gcc 4.7+, clang).
• Work in progress: optimal data distribution for data flows.