Expressing and Exploiting Multi-Dimensional Locality in DASH

Tobias Fuchs
tobias.fuchs@nm.ifi.lmu.de
LMU Munich, MNM Team
www.mnm-team.org
Expressing and Exploiting
Multi-Dimensional Locality
in DASH
SPPEXA Symposium 2016

2Expressing and Exploiting Multi-Dimensional Locality in DASH

Background
DASH
• Vision: “C++ standard template library for HPC”.
• Provides n-dim array abstraction for stencil- and dense matrix
operations.
• Realization of the PGAS (partitioned global address space)
programming model.

Background
PGAS and Locality
• Combine distributed memory into virtual global memory space.
• Strong sense of data ownership:
private, shared local, shared global
int p = 42;

Background
PGAS and Locality
int p = 42;
dash::Array<T> a;
a.local[4] = p;

Background
PGAS and Locality
int p;
dash::Array<T> a;
p = a[40];

Background
PGAS and Locality
• Locality (access distance to data) predominant factor for efficiency.
L = (local accesses) / (total accesses)
• Access pattern on data depends on implementation of algorithm.
• Complexity to maintain locality increases exponentially with the number
of data dimensions.

Objective and Approach
Objective
Portable efficiency by automatic deduction of optimal data distribution.
Approach
1. Identify distribution properties that allow well-defined specification of
any data distribution.
2. Let algorithms specify soft / hard constraints on distribution properties.
3. Derive optimal distribution for a given set of constraints.
 Automatic deduction of optimal data distribution

Distribution Properties
Property Categories
Mappings in data distribution can be categorized by their stages:
Partitioning Decomposing the index domain to blocks
Mapping Assigning blocks to units
Layout Storage order of block elements in units’ local memory

Distribution Properties
Example: Morton Order Distribution
Category Properties
Partitioning balanced, regular, rectangular
Mapping balanced, minimal, neighbor
Layout blocked, linear, canonical

Use Cases
Automatic Deduction of Optimal Data Distribution
“Find a data distribution that fulfills a set of properties.”
// Deduces pattern type, initializes pattern instance:
auto pattern =
make_pattern< _
partitioning_properties< |-- compile time deduction
balanced, regular >, | via C++11 generic meta template
mapping_properties< | programming
neighbor > |
layout_properties< |
blocked, row_major > _|
> _
(Size<2>(10000,10000), |-- run time deduction
Team<2>(24,24)); _|

Use Cases
Automatic Deduction of Optimal Data Distribution
“Find a data distribution that is optimal for a given algorithm.”
// Deduce pattern from algorithm constraints:
auto pattern = dash::make_pattern< dash::summa_pattern_constraints >(
Size<2>(10000,10000),
Team<2>(24,24));
dash::Matrix<double, 2> matrix_a(pattern);
dash::Matrix<double, 2> matrix_b(pattern);
dash::Matrix<double, 2> matrix_c(pattern);
dash::summa(matrix_a, matrix_b, matrix_c);

Use Cases
Automatic Deduction of Optimal Algorithm
“Find algorithm variant that is optimal for a given data distribution.”
// Specify how data is distributed in global memory:
auto pattern = dash::TilePattern<2>(10000,10000, TILED(100,100));
// Selects matrix product algorithm variant that is optimal for the given
// pattern:
dash::multiply(matrix_a, matrix_b, matrix_c);

Use Cases
Automatic Deduction of Optimal Algorithm
“Find data distribution for the most efficient algorithm variant.”
// Use constraints of most efficient algorithm, usually SUMMA for DGEMM:
auto pattern = dash::make_pattern< dash::multiply_pattern_constraints >(
Size<2>(10000,10000),
Team<2>(24,24));
// Calls dash::summa
dash::multiply(matrix_a, matrix_b, matrix_c);

Evaluation: DGEMM
MKL multithreaded vs. DASH MPI (GFLOP/s)
DASH: automatic distribution of matrix elements to MPI processes,
each using serial MKL for block matrix multiplication (SUMMA).
MKL: OpenMP threads, matrix initialization in master thread.

Evaluation: DGEMM
MKL multithreaded vs. DASH MPI (Speedup)
DASH: High locality due to optimal data distribution,
massive communication overhead (MPI, no shared windows).
MKL: Low locality (first touch issues), no communication.
 DASH beats MKL for bigger N and higher degrees of parallelism.
Speedup = DASHGFLOPS / MKLGFLOPS

Evaluation: SGEMM
MKL multithreaded vs. DASH MPI (GFLOP/s)
DASH: automatic distribution of matrix elements to MPI processes,
each using serial MKL for block matrix multiplication (SUMMA).
MKL: OpenMP threads, matrix initialization in master thread.

Evaluation: SGEMM
MKL multithreaded vs. DASH MPI (Speedup)
DASH: High locality due to optimal data distribution,
massive communication overhead (MPI, no shared windows).
MKL: Low locality (first touch issues), no communication.
 DASH beats MKL for bigger N and higher degrees of parallelism.
Speedup = DASHGFLOPS / MKLGFLOPS

Summary
Summary
• Optimal distribution of n-dim data depends on unmanageable multitude
of factors (topology, access pattern, data flow, …).
• We defined a universal classification of distribution properties.
• Property system allows automatic deduction of optimal data distribution
and algorithm variants at compile time and run time.
Works with any C++11 compiler (tested: Intel 14.0+, gcc 4.7+, clang).
• Work in progress: optimal data distribution for data flows.

Tobias Fuchs
tobias.fuchs@nm.ifi.lmu.de
www.mnm-team.org/~fuchst
DASH Project
www.dash-project.org
Visit for upcoming release

Expressing and Exploiting Multi-Dimensional Locality in DASH

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Expressing and Exploiting Multi-Dimensional Locality in DASH

Similar to Expressing and Exploiting Multi-Dimensional Locality in DASH (20)

Recently uploaded

Recently uploaded (20)

Expressing and Exploiting Multi-Dimensional Locality in DASH