2. OPENSHMEM
PREREQUISITES
Knowledge of C/Fortran
Familiarity with parallel computing
Linux/UNIX command-line
Useful for hands-on
◦ 64-bit Linux (native, VM or remote)
E.g. Fedora, CentOS, ...
http://www.virtualbox.org/!
◦ Download of GASNet, OpenSHMEM library,
test-suite & demo programs
http://gasnet.lbl.gov/#download
◦ Installation of GASNet,
2
3. Introducing OpenSHMEM
Supported by
◦ Oak Ridge National Laboratory
◦ University of Houston
◦ Open Source Software Solutions
◦ Department of Defense
◦ U.S. Department of Energy
◦ Extreme Scale Systems Center
SHared MEMory
SHMEM is a 1-sided communications library
◦ C and Fortran PGAS programming model
◦ Point-to-point and collective routines
◦ Synchronizations
◦ Atomic operations
Can take advantage of hardware offload
◦ Performance benefits
3
5. Introduction
Address Spaces
◦ Global vs. distributed
◦ OpenMP has global (shared) space
◦ MPI has partitioned space
Private data exchanged via messages
◦ SHMEM is “partitioned global address space”
PGAS
◦ Has private and shared data
Shared data accessible directly by other processes
Used for programs that
◦ perform computations in separate address spaces and
explicitly pass data to and from different processes in the
program.
The processes participating in shared memory
applications are referred to as processing elements
(PEs).
5
6. Introduction
All processors see symmetric variables
◦ Global Address Space
All processors have own view of symmetric variables
◦ Partitioned Global Address Space
OpenSHMEM supports Single Program Multiple Data
(SPMD) style of programming.
The OpenSHMEM Specification is an effort to create a
standardized SHMEM library API for C, C++, and
Fortran
SGI’s SHMEM API is the baseline for OpenSHMEM
Specification 1.0
The specification is open to the community for reviews
and contributions.
6
8. OpenSHMEM Features
Use of symmetric variables and point to point
“put” and “get” operations.
Remote direct memory access enables one-
sided operations, leading to performance
benefits.
A standard API for different hardware provides
network hardware independence and renders
support to different network layer technologies.
Along with data transfer operations the library
provides synchronization mechanisms (barrier,
fence, quiet, wait), collective operations
(broadcast, collection, reduction), and atomic
memory operations (swap, add, increment).
Open to the community for reviews and
contributions.
8
9. OpenSHMEM HISTORY AND
IMPLEMENTATIONS
Cray
◦ SHMEM first introduced by Cray Research Inc. in
1993 for Cray T3D
◦ Platforms: Cray T3D, T3E, PVP, XT series
SGI
◦ Owns the “rights” for SHMEM
◦ Baseline for OpenSHMEM development (Altix)
Quadrics (company out of business)
◦ Optimized API for QsNet
◦ Platform: Linux cluster with QsNet interconnect
Others
◦ HP SHMEM, IBM SHMEM
◦ GPSHMEM (cluster with ARMCI & MPI support, old)
9
10. Symmetric Variables
Arrays or variables that exist with the
same size, type, and relative address
on all PEs.
◦ The following kinds of data objects are
symmetric:
Fortran data objects in common blocks or with
the SAVE attribute.
Non-stack C and C++ variables.
Fortran arrays allocated with shpalloc
C and C++ data allocated by shmalloc
10
16. FREQUENTLY USED
OPENSHMEM API (C, C++)
OpenSHMEM Library Call Functionality
start pes() Initializes the OpenSHMEM library
my pe() Returns an integer identification of the PE
num pes() Returns the number of PEs executing the application
shmem type get(*dest, *src, size t len, int pe)
Returns with the same type value of the symmetric
src from the remote PE to the symmetric dest on the
local PE
shmem type put(*dest, *src, size t len, int pe)
Returns when the same type value of the symmetric
src on the calling PE is on its way to the symmetric
dest on the remote PE
shmem barrier all()
Returns when all PEs reach the same barrier call
and have completed all outstanding communication
16
17. Execution Model
1. Initializing
◦ Start_pes()
Setup symmetric heap
Creating PE numbers (identifiers to PEs)
Data transfer
Query routines
Finalization
17
18. Communication Operations
Data Transfers
a. One-sided puts
b. One-sided gets
Synchronization Mechanisms
a. Fence
b. Quiet
c. Barrier
Collective Communication
a. Broadcast
b. Collection
c. Reduction
Address Manipulation
a. Allocating and deallocating memory blocks in the symmetric space.
Locks
a. Implementation of mutual exclusion
Atomic Memory Operations
a. Swap, Conditional Swap, Add and Increment
Data Cache control
18
19. OpenSHMEM
ATOMIC OPERATIONS
What does “atomic” mean anyway?
◦ Indivisible operation on symmetric variable
◦ No other operation can interpose during
update
◦ But “no other operation” actually means…?
No other atomic operation
Can’t do anything about other mechanisms
interfering
E.g. thread outside of SHMEM program
Non-atomic SHMEM operation
Why this restriction?
Implementation in hardware
19
20. OpenSHMEM
ATOMIC OPERATIONS
Atomic Swap
◦ Unconditional
◦ Conditional
Arithmetic
◦ Increment
◦ Fetch-and-increment & fetch-and-add value
◦ Return previous value at target on PE
Locks
◦ Symmetric variables
◦ Acquired and released to define mutual-exclusion
execution regions
◦ Initialize lock to 0. After that managed by above API
◦ Can be used for updating distributed data structures
20
21. OpenSHMEM
LOOKING FOR OVERLAPS
How to identify overlap opportunities
◦ Put is not an indivisible operation
Send local, reuse local, on-wire, stored
Can do useful work on other data in between
◦ General principle:
Identify independent tasks/data
Initiate action as early as possible
Put/barrier/collective
Interpose independent work
Synchronize as late as possible
Divide application into distinct communication and
computation phases to minimize synchronization
points
Use of point-to-point synchronization as opposed to
collective synchronization
21
22. Benchmarks and Tools
OpenSHMEM NPB: the NAS Parallel
benchmarks for use the OpenSHMEM library.
The OpenSHMEM benchmarks’ performance
evaluated with respect to their MPI versions
on three distinctly different platforms with
different OpenSHMEM/SHMEM library
implementations
On mature library implementations with
hardware support for RMA (SGI), the
benchmarks using the OpenSHMEM library
have comparable or better performance than
MPI.
The OpenSHMEM Analyzer (OSA) developed
in collaboration with ORNL allows for better
error checking during compile time. 22
24. COMPARING MPI 2 AND
OpenSHMEM
MPI Window
semantics
◦ All process which intend
to use the window must
participate in window
creation
◦ Many or All the local
allocations/objects
should be coalesced
within a single window
creation.
SHMEM semantics
◦ All global and static
data are by default
accessible to all
process.
◦ Local
allocations/objects can
be made shmem
accessible using
shmalloc instead of
malloc
24
25. COMPARING MPI 2 AND
OpenSHMEM
MPI_Win_fence
◦ Fence is a collective call.
◦ Need 2 fence calls, one
to separate and another
one to complete.
◦ So it mostly functions like
barrier
shmem_fence
◦ shmem fence is just
meant for ordering of puts.
◦ It does not separate the
processes neither does it
mean completion
◦ Ensures there are no
pending puts to be
delivered to the same
target before the next put
25
26. COMPARING MPI 2 AND
OpenSHMEM
Point to point
synchronization.
◦ Sender does Start and
waits for Post from
receiver
◦ The receiver does
Post and waits for the
data.
◦ The sender Puts the
data and signals
completion to receiver
Point to point
synchronization.
◦ The receiver can
directly wait for the
data using
shmem_wait on a
event flag.
◦ The sender puts the
data and sets the
event flag to signal the
receiver.
◦ Both post and
complete are implicit
inside the wait and put
operation.
26
27. COMPARING MPI 2 AND
OpenSHMEM
No mutual
exclusion
Lock is not real
lock, but begin
RMA
Unlock means end
RMA
Only the source
calls lock
Enforces mutual
exclusion
The PE which
acquires lock doe
put
The waiting PE
gets the lock on
first come first
serve basis
27
28. OpenShmem in use
Uses GASNET for any devices (also
mobile)
It has been tested with standard
libraries on controlled situations
No test results available for GRID
It is comparable with CUDA, MPI and
OPENMP
28
29. GASNet
GASNet is a language-independent, low-level networking layer that provides network-
independent, high-performance communication primitives tailored for implementing
parallel global address space SPMD languages and libraries such as UPC, Co-Array
Fortran, SHMEM, Cray Chapel, and Titanium.
The interface is primarily intended as a compilation target and for use by runtime library
writers (as opposed to end users), and the primary goals are high performance,
interface portability, and expressiveness. GASNet stands for "Global-Address Space
Networking".
The design of GASNet is partitioned into two layers to maximize porting ease without
sacrificing performance:
◦ lower level GASNet core API - the design is based heavily on Active Messages, and is implemented
directly on top of each individual network architecture.
◦ The upper level GASNet extended API, which provides high-level operations such as remote
memory access and various collective operations.
Operating systems: Linux, FreeBSD, NetBSD, Tru64, AIX, IRIX, HPUX, Solaris,
MSWindows-Cygwin, MacOSX, Unicos, SuperUX, Catamount, BLRTS, MTX
Architectures: x86, Itanium, Opteron, Athlon, Alpha, PowerPC, MIPS, PA-RISC, SPARC,
Cray T3E, Cray X-1, Cray XT, Cray XE, Cray XK, Cray XC30, SX-6, IBM BlueGene/L, IBM
BlueGene/P, IBM BlueGene/Q, IBM Power 775, SiCortex, PlayStation3
Compilers: GCC, Portland Group C, Pathscale C, Intel C, SunPro C, Compaq C, HP C,
MIPSPro C, IBM VisualAge C, Cray C, NEC C, MTA C, LLVM Clang, Open64
System diagram
showing GASNet
layers 29
30. hardware offload
host-side network adapter that offloads most
of the networking “work”
◦ server’s main CPU(s) don’t have to
◦ OS bypass techniques
you can have dedicated (read: very
fast/optimized) hardware do the heavy lifting
◦ rest of the server’s resources are free
it’s not just processor cycles that are saved
◦ Caches — both instruction and data — are likely
not to be thrashed
◦ Interrupts may be fired less frequently
◦ There may be (slightly) less data transferred
across internal buses
30
31. Refrences
Using OpenCL: Programming Massively Parallel
Computers, edited by Janusz Kowalik, Tadeusz
Puźniakowski
OpenSHMEM,Application Programming
Interface,Version 1.0 FINAL
OpenSHMEM TUTORIAL, Presenters: Tony
Curtis and Swaroop Pophale
OpenSHMEM Performance and Potential: A NPB
Experimental Study, Swaroop Pophale,
OPENSHMEM BOF SC2011 TCC-303 ,T. Curtis,
S. Poole
31
Hinweis der Redaktion
OpenMP is global shared space (PVM & MPI Are SPMD and partioned space)
ARMI an adaptive, platform independent communication library, Infini BAnd
Data transfer in OpenSHMEM is possible through several one-sided put (for write) and get
(for read) operations, as well as various collective routines such as broadcasts and reductions.
Query routines are available to gather information about the execution. OpenSHMEM also
provides synchronization routines to coordinate data transfers and other operations.
Differences between barrier and fence is that the barrier synchronizes every work item in the group and fence Is only on a specified memory operation
The BT benchmark has I/O-intensive subtypes[4] Block Tridiagonal Scalar Pentadiagonal
Solve a synthetic system of nonlinear PDEs using three different algorithms involving block tridiagonal, scalar pentadiagonal and symmetric successive over-relaxation (SSOR) solver kernels, respectively
هم آوایی
Network offload is typically most beneficial with long messages — sending a short message in its entirety can frequently cost exactly the same as starting a communication action and then polling it for completion later.