SlideShare ist ein Scribd-Unternehmen logo
1 von 31
OpenSHMEM
Ehsan alirezaei
Ehsan_alirezaii@ut.ac.ir
130 pages
OPENSHMEM
PREREQUISITES
 Knowledge of C/Fortran
 Familiarity with parallel computing
 Linux/UNIX command-line
 Useful for hands-on
◦ 64-bit Linux (native, VM or remote)
 E.g. Fedora, CentOS, ...
 http://www.virtualbox.org/!
◦ Download of GASNet, OpenSHMEM library,
test-suite & demo programs
 http://gasnet.lbl.gov/#download
◦ Installation of GASNet,
2
Introducing OpenSHMEM
 Supported by
◦ Oak Ridge National Laboratory
◦ University of Houston
◦ Open Source Software Solutions
◦ Department of Defense
◦ U.S. Department of Energy
◦ Extreme Scale Systems Center
 SHared MEMory
 SHMEM is a 1-sided communications library
◦ C and Fortran PGAS programming model
◦ Point-to-point and collective routines
◦ Synchronizations
◦ Atomic operations
 Can take advantage of hardware offload
◦ Performance benefits
3
One-sided communication
4
Introduction
 Address Spaces
◦ Global vs. distributed
◦ OpenMP has global (shared) space
◦ MPI has partitioned space
 Private data exchanged via messages
◦ SHMEM is “partitioned global address space”
 PGAS
◦ Has private and shared data
 Shared data accessible directly by other processes
 Used for programs that
◦ perform computations in separate address spaces and
explicitly pass data to and from different processes in the
program.
 The processes participating in shared memory
applications are referred to as processing elements
(PEs).
5
Introduction
 All processors see symmetric variables
◦ Global Address Space
 All processors have own view of symmetric variables
◦ Partitioned Global Address Space
 OpenSHMEM supports Single Program Multiple Data
(SPMD) style of programming.
 The OpenSHMEM Specification is an effort to create a
standardized SHMEM library API for C, C++, and
Fortran
 SGI’s SHMEM API is the baseline for OpenSHMEM
Specification 1.0
 The specification is open to the community for reviews
and contributions.
6
Implementation layers
7
OpenSHMEM Features
 Use of symmetric variables and point to point
“put” and “get” operations.
 Remote direct memory access enables one-
sided operations, leading to performance
benefits.
 A standard API for different hardware provides
network hardware independence and renders
support to different network layer technologies.
 Along with data transfer operations the library
provides synchronization mechanisms (barrier,
fence, quiet, wait), collective operations
(broadcast, collection, reduction), and atomic
memory operations (swap, add, increment).
 Open to the community for reviews and
contributions.
8
OpenSHMEM HISTORY AND
IMPLEMENTATIONS
 Cray
◦ SHMEM first introduced by Cray Research Inc. in
1993 for Cray T3D
◦ Platforms: Cray T3D, T3E, PVP, XT series
 SGI
◦ Owns the “rights” for SHMEM
◦ Baseline for OpenSHMEM development (Altix)
 Quadrics (company out of business)
◦ Optimized API for QsNet
◦ Platform: Linux cluster with QsNet interconnect
 Others
◦ HP SHMEM, IBM SHMEM
◦ GPSHMEM (cluster with ARMCI & MPI support, old)
9
Symmetric Variables
 Arrays or variables that exist with the
same size, type, and relative address
on all PEs.
◦ The following kinds of data objects are
symmetric:
 Fortran data objects in common blocks or with
the SAVE attribute.
 Non-stack C and C++ variables.
 Fortran arrays allocated with shpalloc
 C and C++ data allocated by shmalloc
10
Memory Model
11
Memory Model
12
Memory Model
13
Memory Model
14
Memory Model
15
FREQUENTLY USED
OPENSHMEM API (C, C++)
OpenSHMEM Library Call Functionality
start pes() Initializes the OpenSHMEM library
my pe() Returns an integer identification of the PE
num pes() Returns the number of PEs executing the application
shmem type get(*dest, *src, size t len, int pe)
Returns with the same type value of the symmetric
src from the remote PE to the symmetric dest on the
local PE
shmem type put(*dest, *src, size t len, int pe)
Returns when the same type value of the symmetric
src on the calling PE is on its way to the symmetric
dest on the remote PE
shmem barrier all()
Returns when all PEs reach the same barrier call
and have completed all outstanding communication
16
Execution Model
1. Initializing
◦ Start_pes()
 Setup symmetric heap
 Creating PE numbers (identifiers to PEs)
 Data transfer
 Query routines
 Finalization
17
Communication Operations
 Data Transfers
a. One-sided puts
b. One-sided gets
 Synchronization Mechanisms
a. Fence
b. Quiet
c. Barrier
 Collective Communication
a. Broadcast
b. Collection
c. Reduction
 Address Manipulation
a. Allocating and deallocating memory blocks in the symmetric space.
 Locks
a. Implementation of mutual exclusion
 Atomic Memory Operations
a. Swap, Conditional Swap, Add and Increment
 Data Cache control
18
OpenSHMEM
ATOMIC OPERATIONS
 What does “atomic” mean anyway?
◦ Indivisible operation on symmetric variable
◦ No other operation can interpose during
update
◦ But “no other operation” actually means…?
 No other atomic operation
 Can’t do anything about other mechanisms
interfering
 E.g. thread outside of SHMEM program
 Non-atomic SHMEM operation
 Why this restriction?
 Implementation in hardware
19
OpenSHMEM
ATOMIC OPERATIONS
 Atomic Swap
◦ Unconditional
◦ Conditional
 Arithmetic
◦ Increment
◦ Fetch-and-increment & fetch-and-add value
◦ Return previous value at target on PE
 Locks
◦ Symmetric variables
◦ Acquired and released to define mutual-exclusion
execution regions
◦ Initialize lock to 0. After that managed by above API
◦ Can be used for updating distributed data structures
20
OpenSHMEM
LOOKING FOR OVERLAPS
 How to identify overlap opportunities
◦ Put is not an indivisible operation
 Send local, reuse local, on-wire, stored
 Can do useful work on other data in between
◦ General principle:
 Identify independent tasks/data
 Initiate action as early as possible
 Put/barrier/collective
 Interpose independent work
 Synchronize as late as possible
 Divide application into distinct communication and
computation phases to minimize synchronization
points
 Use of point-to-point synchronization as opposed to
collective synchronization
21
Benchmarks and Tools
 OpenSHMEM NPB: the NAS Parallel
benchmarks for use the OpenSHMEM library.
 The OpenSHMEM benchmarks’ performance
evaluated with respect to their MPI versions
on three distinctly different platforms with
different OpenSHMEM/SHMEM library
implementations
 On mature library implementations with
hardware support for RMA (SGI), the
benchmarks using the OpenSHMEM library
have comparable or better performance than
MPI.
 The OpenSHMEM Analyzer (OSA) developed
in collaboration with ORNL allows for better
error checking during compile time. 22
Performance of Implementing BT
and SP benchmarks
23
COMPARING MPI 2 AND
OpenSHMEM
 MPI Window
semantics
◦ All process which intend
to use the window must
participate in window
creation
◦ Many or All the local
allocations/objects
should be coalesced
within a single window
creation.
 SHMEM semantics
◦ All global and static
data are by default
accessible to all
process.
◦ Local
allocations/objects can
be made shmem
accessible using
shmalloc instead of
malloc
24
COMPARING MPI 2 AND
OpenSHMEM
 MPI_Win_fence
◦ Fence is a collective call.
◦ Need 2 fence calls, one
to separate and another
one to complete.
◦ So it mostly functions like
barrier
 shmem_fence
◦ shmem fence is just
meant for ordering of puts.
◦ It does not separate the
processes neither does it
mean completion
◦ Ensures there are no
pending puts to be
delivered to the same
target before the next put
25
COMPARING MPI 2 AND
OpenSHMEM
 Point to point
synchronization.
◦ Sender does Start and
waits for Post from
receiver
◦ The receiver does
Post and waits for the
data.
◦ The sender Puts the
data and signals
completion to receiver
 Point to point
synchronization.
◦ The receiver can
directly wait for the
data using
shmem_wait on a
event flag.
◦ The sender puts the
data and sets the
event flag to signal the
receiver.
◦ Both post and
complete are implicit
inside the wait and put
operation.
26
COMPARING MPI 2 AND
OpenSHMEM
 No mutual
exclusion
 Lock is not real
lock, but begin
RMA
 Unlock means end
RMA
 Only the source
calls lock
 Enforces mutual
exclusion
 The PE which
acquires lock doe
put
 The waiting PE
gets the lock on
first come first
serve basis
27
OpenShmem in use
 Uses GASNET for any devices (also
mobile)
 It has been tested with standard
libraries on controlled situations
 No test results available for GRID
 It is comparable with CUDA, MPI and
OPENMP
28
GASNet
 GASNet is a language-independent, low-level networking layer that provides network-
independent, high-performance communication primitives tailored for implementing
parallel global address space SPMD languages and libraries such as UPC, Co-Array
Fortran, SHMEM, Cray Chapel, and Titanium.
 The interface is primarily intended as a compilation target and for use by runtime library
writers (as opposed to end users), and the primary goals are high performance,
interface portability, and expressiveness. GASNet stands for "Global-Address Space
Networking".
The design of GASNet is partitioned into two layers to maximize porting ease without
sacrificing performance:
◦ lower level GASNet core API - the design is based heavily on Active Messages, and is implemented
directly on top of each individual network architecture.
◦ The upper level GASNet extended API, which provides high-level operations such as remote
memory access and various collective operations.
 Operating systems: Linux, FreeBSD, NetBSD, Tru64, AIX, IRIX, HPUX, Solaris,
MSWindows-Cygwin, MacOSX, Unicos, SuperUX, Catamount, BLRTS, MTX
 Architectures: x86, Itanium, Opteron, Athlon, Alpha, PowerPC, MIPS, PA-RISC, SPARC,
Cray T3E, Cray X-1, Cray XT, Cray XE, Cray XK, Cray XC30, SX-6, IBM BlueGene/L, IBM
BlueGene/P, IBM BlueGene/Q, IBM Power 775, SiCortex, PlayStation3
 Compilers: GCC, Portland Group C, Pathscale C, Intel C, SunPro C, Compaq C, HP C,
MIPSPro C, IBM VisualAge C, Cray C, NEC C, MTA C, LLVM Clang, Open64
System diagram
showing GASNet
layers 29
hardware offload
 host-side network adapter that offloads most
of the networking “work”
◦ server’s main CPU(s) don’t have to
◦ OS bypass techniques
 you can have dedicated (read: very
fast/optimized) hardware do the heavy lifting
◦ rest of the server’s resources are free
 it’s not just processor cycles that are saved
◦ Caches — both instruction and data — are likely
not to be thrashed
◦ Interrupts may be fired less frequently
◦ There may be (slightly) less data transferred
across internal buses
30
Refrences
 Using OpenCL: Programming Massively Parallel
Computers, edited by Janusz Kowalik, Tadeusz
Puźniakowski
 OpenSHMEM,Application Programming
Interface,Version 1.0 FINAL
 OpenSHMEM TUTORIAL, Presenters: Tony
Curtis and Swaroop Pophale
 OpenSHMEM Performance and Potential: A NPB
Experimental Study, Swaroop Pophale,
 OPENSHMEM BOF SC2011 TCC-303 ,T. Curtis,
S. Poole
31

Weitere ähnliche Inhalte

Was ist angesagt?

Asymmetric Multiprocessing - Kynetics ELC 2018 portland
Asymmetric Multiprocessing - Kynetics ELC 2018 portlandAsymmetric Multiprocessing - Kynetics ELC 2018 portland
Asymmetric Multiprocessing - Kynetics ELC 2018 portlandNicola La Gloria
 
Recent advance in netmap/VALE(mSwitch)
Recent advance in netmap/VALE(mSwitch)Recent advance in netmap/VALE(mSwitch)
Recent advance in netmap/VALE(mSwitch)micchie
 
Network Stack in Userspace (NUSE)
Network Stack in Userspace (NUSE)Network Stack in Userspace (NUSE)
Network Stack in Userspace (NUSE)Hajime Tazaki
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDKKernel TLV
 
FreeBSD and Drivers
FreeBSD and DriversFreeBSD and Drivers
FreeBSD and DriversKernel TLV
 
Measuring a 25 and 40Gb/s Data Plane
Measuring a 25 and 40Gb/s Data PlaneMeasuring a 25 and 40Gb/s Data Plane
Measuring a 25 and 40Gb/s Data PlaneOpen-NFP
 
Playing BBR with a userspace network stack
Playing BBR with a userspace network stackPlaying BBR with a userspace network stack
Playing BBR with a userspace network stackHajime Tazaki
 
Library Operating System for Linux #netdev01
Library Operating System for Linux #netdev01Library Operating System for Linux #netdev01
Library Operating System for Linux #netdev01Hajime Tazaki
 
Linux Kernel Library - Reusing Monolithic Kernel
Linux Kernel Library - Reusing Monolithic KernelLinux Kernel Library - Reusing Monolithic Kernel
Linux Kernel Library - Reusing Monolithic KernelHajime Tazaki
 
VLANs in the Linux Kernel
VLANs in the Linux KernelVLANs in the Linux Kernel
VLANs in the Linux KernelKernel TLV
 
Introduction to eBPF
Introduction to eBPFIntroduction to eBPF
Introduction to eBPFRogerColl2
 
LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1Hajime Tazaki
 
Bypassing ASLR Exploiting CVE 2015-7545
Bypassing ASLR Exploiting CVE 2015-7545Bypassing ASLR Exploiting CVE 2015-7545
Bypassing ASLR Exploiting CVE 2015-7545Kernel TLV
 
Introduction to RCU
Introduction to RCUIntroduction to RCU
Introduction to RCUKernel TLV
 
CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] IO Visor Project
 
Linux rumpkernel - ABC2018 (AsiaBSDCon 2018)
Linux rumpkernel - ABC2018 (AsiaBSDCon 2018)Linux rumpkernel - ABC2018 (AsiaBSDCon 2018)
Linux rumpkernel - ABC2018 (AsiaBSDCon 2018)Hajime Tazaki
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Hajime Tazaki
 
Performance challenges in software networking
Performance challenges in software networkingPerformance challenges in software networking
Performance challenges in software networkingStephen Hemminger
 

Was ist angesagt? (20)

Asymmetric Multiprocessing - Kynetics ELC 2018 portland
Asymmetric Multiprocessing - Kynetics ELC 2018 portlandAsymmetric Multiprocessing - Kynetics ELC 2018 portland
Asymmetric Multiprocessing - Kynetics ELC 2018 portland
 
Recent advance in netmap/VALE(mSwitch)
Recent advance in netmap/VALE(mSwitch)Recent advance in netmap/VALE(mSwitch)
Recent advance in netmap/VALE(mSwitch)
 
Network Stack in Userspace (NUSE)
Network Stack in Userspace (NUSE)Network Stack in Userspace (NUSE)
Network Stack in Userspace (NUSE)
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
FreeBSD and Drivers
FreeBSD and DriversFreeBSD and Drivers
FreeBSD and Drivers
 
Measuring a 25 and 40Gb/s Data Plane
Measuring a 25 and 40Gb/s Data PlaneMeasuring a 25 and 40Gb/s Data Plane
Measuring a 25 and 40Gb/s Data Plane
 
Playing BBR with a userspace network stack
Playing BBR with a userspace network stackPlaying BBR with a userspace network stack
Playing BBR with a userspace network stack
 
Library Operating System for Linux #netdev01
Library Operating System for Linux #netdev01Library Operating System for Linux #netdev01
Library Operating System for Linux #netdev01
 
Linux Kernel Library - Reusing Monolithic Kernel
Linux Kernel Library - Reusing Monolithic KernelLinux Kernel Library - Reusing Monolithic Kernel
Linux Kernel Library - Reusing Monolithic Kernel
 
VLANs in the Linux Kernel
VLANs in the Linux KernelVLANs in the Linux Kernel
VLANs in the Linux Kernel
 
mTCP使ってみた
mTCP使ってみたmTCP使ってみた
mTCP使ってみた
 
Introduction to eBPF
Introduction to eBPFIntroduction to eBPF
Introduction to eBPF
 
Memory model
Memory modelMemory model
Memory model
 
LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1
 
Bypassing ASLR Exploiting CVE 2015-7545
Bypassing ASLR Exploiting CVE 2015-7545Bypassing ASLR Exploiting CVE 2015-7545
Bypassing ASLR Exploiting CVE 2015-7545
 
Introduction to RCU
Introduction to RCUIntroduction to RCU
Introduction to RCU
 
CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016]
 
Linux rumpkernel - ABC2018 (AsiaBSDCon 2018)
Linux rumpkernel - ABC2018 (AsiaBSDCon 2018)Linux rumpkernel - ABC2018 (AsiaBSDCon 2018)
Linux rumpkernel - ABC2018 (AsiaBSDCon 2018)
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014
 
Performance challenges in software networking
Performance challenges in software networkingPerformance challenges in software networking
Performance challenges in software networking
 

Andere mochten auch

Course timetabling Project
Course timetabling ProjectCourse timetabling Project
Course timetabling ProjectEhsan Alirezaei
 
How to choose right agent based methodology
How to choose right agent based methodologyHow to choose right agent based methodology
How to choose right agent based methodologyEhsan Alirezaei
 
How to choose right methodology
How to choose right methodologyHow to choose right methodology
How to choose right methodologyEhsan Alirezaei
 
Methodology construction
Methodology construction Methodology construction
Methodology construction Ehsan Alirezaei
 
Macro and micro aspects in RE for AOSE
Macro and micro aspects in RE for AOSEMacro and micro aspects in RE for AOSE
Macro and micro aspects in RE for AOSEEhsan Alirezaei
 

Andere mochten auch (7)

Course timetabling Project
Course timetabling ProjectCourse timetabling Project
Course timetabling Project
 
دولت دانا
دولت دانادولت دانا
دولت دانا
 
How to choose right agent based methodology
How to choose right agent based methodologyHow to choose right agent based methodology
How to choose right agent based methodology
 
How to choose right methodology
How to choose right methodologyHow to choose right methodology
How to choose right methodology
 
ebXML
ebXMLebXML
ebXML
 
Methodology construction
Methodology construction Methodology construction
Methodology construction
 
Macro and micro aspects in RE for AOSE
Macro and micro aspects in RE for AOSEMacro and micro aspects in RE for AOSE
Macro and micro aspects in RE for AOSE
 

Ähnlich wie Open shmem

2023comp90024_workshop.pdf
2023comp90024_workshop.pdf2023comp90024_workshop.pdf
2023comp90024_workshop.pdfLevLafayette1
 
Performance improvement techniques for software distributed shared memory
Performance improvement techniques for software distributed shared memoryPerformance improvement techniques for software distributed shared memory
Performance improvement techniques for software distributed shared memoryZongYing Lyu
 
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated SystemsPetapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated Systemsdairsie
 
Ucx an open source framework for hpc network ap is and beyond
Ucx  an open source framework for hpc network ap is and beyondUcx  an open source framework for hpc network ap is and beyond
Ucx an open source framework for hpc network ap is and beyondinside-BigData.com
 
Parallelization using open mp
Parallelization using open mpParallelization using open mp
Parallelization using open mpranjit banshpal
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPAnil Bohare
 
parallel programming models
 parallel programming models parallel programming models
parallel programming modelsSwetha S
 
AMP Kynetics - ELC 2018 Portland
AMP  Kynetics - ELC 2018 PortlandAMP  Kynetics - ELC 2018 Portland
AMP Kynetics - ELC 2018 PortlandKynetics
 
Icg hpc-user
Icg hpc-userIcg hpc-user
Icg hpc-usergdburton
 
Directive-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous ComputingDirective-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous ComputingRuymán Reyes
 
Parallel computing in india
Parallel computing in indiaParallel computing in india
Parallel computing in indiaPreeti Chauhan
 
slides8 SharedMemory.ppt
slides8 SharedMemory.pptslides8 SharedMemory.ppt
slides8 SharedMemory.pptaminnezarat
 
Hs java open_party
Hs java open_partyHs java open_party
Hs java open_partyOpen Party
 
Lec+3-Introduction-to-Distributed-Systems.pdf
Lec+3-Introduction-to-Distributed-Systems.pdfLec+3-Introduction-to-Distributed-Systems.pdf
Lec+3-Introduction-to-Distributed-Systems.pdfsamaghorab
 

Ähnlich wie Open shmem (20)

2023comp90024_workshop.pdf
2023comp90024_workshop.pdf2023comp90024_workshop.pdf
2023comp90024_workshop.pdf
 
Performance improvement techniques for software distributed shared memory
Performance improvement techniques for software distributed shared memoryPerformance improvement techniques for software distributed shared memory
Performance improvement techniques for software distributed shared memory
 
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated SystemsPetapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
 
Ucx an open source framework for hpc network ap is and beyond
Ucx  an open source framework for hpc network ap is and beyondUcx  an open source framework for hpc network ap is and beyond
Ucx an open source framework for hpc network ap is and beyond
 
Parallelization using open mp
Parallelization using open mpParallelization using open mp
Parallelization using open mp
 
PROSE
PROSEPROSE
PROSE
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMP
 
Underlying principles of parallel and distributed computing
Underlying principles of parallel and distributed computingUnderlying principles of parallel and distributed computing
Underlying principles of parallel and distributed computing
 
parallel programming models
 parallel programming models parallel programming models
parallel programming models
 
Amoeba1
Amoeba1Amoeba1
Amoeba1
 
AMP Kynetics - ELC 2018 Portland
AMP  Kynetics - ELC 2018 PortlandAMP  Kynetics - ELC 2018 Portland
AMP Kynetics - ELC 2018 Portland
 
Icg hpc-user
Icg hpc-userIcg hpc-user
Icg hpc-user
 
Directive-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous ComputingDirective-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous Computing
 
Lecture5
Lecture5Lecture5
Lecture5
 
Introduction to OpenMP
Introduction to OpenMPIntroduction to OpenMP
Introduction to OpenMP
 
Parallel computing in india
Parallel computing in indiaParallel computing in india
Parallel computing in india
 
slides8 SharedMemory.ppt
slides8 SharedMemory.pptslides8 SharedMemory.ppt
slides8 SharedMemory.ppt
 
25-MPI-OpenMP.pptx
25-MPI-OpenMP.pptx25-MPI-OpenMP.pptx
25-MPI-OpenMP.pptx
 
Hs java open_party
Hs java open_partyHs java open_party
Hs java open_party
 
Lec+3-Introduction-to-Distributed-Systems.pdf
Lec+3-Introduction-to-Distributed-Systems.pdfLec+3-Introduction-to-Distributed-Systems.pdf
Lec+3-Introduction-to-Distributed-Systems.pdf
 

Kürzlich hochgeladen

Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 

Kürzlich hochgeladen (20)

Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 

Open shmem

  • 2. OPENSHMEM PREREQUISITES  Knowledge of C/Fortran  Familiarity with parallel computing  Linux/UNIX command-line  Useful for hands-on ◦ 64-bit Linux (native, VM or remote)  E.g. Fedora, CentOS, ...  http://www.virtualbox.org/! ◦ Download of GASNet, OpenSHMEM library, test-suite & demo programs  http://gasnet.lbl.gov/#download ◦ Installation of GASNet, 2
  • 3. Introducing OpenSHMEM  Supported by ◦ Oak Ridge National Laboratory ◦ University of Houston ◦ Open Source Software Solutions ◦ Department of Defense ◦ U.S. Department of Energy ◦ Extreme Scale Systems Center  SHared MEMory  SHMEM is a 1-sided communications library ◦ C and Fortran PGAS programming model ◦ Point-to-point and collective routines ◦ Synchronizations ◦ Atomic operations  Can take advantage of hardware offload ◦ Performance benefits 3
  • 5. Introduction  Address Spaces ◦ Global vs. distributed ◦ OpenMP has global (shared) space ◦ MPI has partitioned space  Private data exchanged via messages ◦ SHMEM is “partitioned global address space”  PGAS ◦ Has private and shared data  Shared data accessible directly by other processes  Used for programs that ◦ perform computations in separate address spaces and explicitly pass data to and from different processes in the program.  The processes participating in shared memory applications are referred to as processing elements (PEs). 5
  • 6. Introduction  All processors see symmetric variables ◦ Global Address Space  All processors have own view of symmetric variables ◦ Partitioned Global Address Space  OpenSHMEM supports Single Program Multiple Data (SPMD) style of programming.  The OpenSHMEM Specification is an effort to create a standardized SHMEM library API for C, C++, and Fortran  SGI’s SHMEM API is the baseline for OpenSHMEM Specification 1.0  The specification is open to the community for reviews and contributions. 6
  • 8. OpenSHMEM Features  Use of symmetric variables and point to point “put” and “get” operations.  Remote direct memory access enables one- sided operations, leading to performance benefits.  A standard API for different hardware provides network hardware independence and renders support to different network layer technologies.  Along with data transfer operations the library provides synchronization mechanisms (barrier, fence, quiet, wait), collective operations (broadcast, collection, reduction), and atomic memory operations (swap, add, increment).  Open to the community for reviews and contributions. 8
  • 9. OpenSHMEM HISTORY AND IMPLEMENTATIONS  Cray ◦ SHMEM first introduced by Cray Research Inc. in 1993 for Cray T3D ◦ Platforms: Cray T3D, T3E, PVP, XT series  SGI ◦ Owns the “rights” for SHMEM ◦ Baseline for OpenSHMEM development (Altix)  Quadrics (company out of business) ◦ Optimized API for QsNet ◦ Platform: Linux cluster with QsNet interconnect  Others ◦ HP SHMEM, IBM SHMEM ◦ GPSHMEM (cluster with ARMCI & MPI support, old) 9
  • 10. Symmetric Variables  Arrays or variables that exist with the same size, type, and relative address on all PEs. ◦ The following kinds of data objects are symmetric:  Fortran data objects in common blocks or with the SAVE attribute.  Non-stack C and C++ variables.  Fortran arrays allocated with shpalloc  C and C++ data allocated by shmalloc 10
  • 16. FREQUENTLY USED OPENSHMEM API (C, C++) OpenSHMEM Library Call Functionality start pes() Initializes the OpenSHMEM library my pe() Returns an integer identification of the PE num pes() Returns the number of PEs executing the application shmem type get(*dest, *src, size t len, int pe) Returns with the same type value of the symmetric src from the remote PE to the symmetric dest on the local PE shmem type put(*dest, *src, size t len, int pe) Returns when the same type value of the symmetric src on the calling PE is on its way to the symmetric dest on the remote PE shmem barrier all() Returns when all PEs reach the same barrier call and have completed all outstanding communication 16
  • 17. Execution Model 1. Initializing ◦ Start_pes()  Setup symmetric heap  Creating PE numbers (identifiers to PEs)  Data transfer  Query routines  Finalization 17
  • 18. Communication Operations  Data Transfers a. One-sided puts b. One-sided gets  Synchronization Mechanisms a. Fence b. Quiet c. Barrier  Collective Communication a. Broadcast b. Collection c. Reduction  Address Manipulation a. Allocating and deallocating memory blocks in the symmetric space.  Locks a. Implementation of mutual exclusion  Atomic Memory Operations a. Swap, Conditional Swap, Add and Increment  Data Cache control 18
  • 19. OpenSHMEM ATOMIC OPERATIONS  What does “atomic” mean anyway? ◦ Indivisible operation on symmetric variable ◦ No other operation can interpose during update ◦ But “no other operation” actually means…?  No other atomic operation  Can’t do anything about other mechanisms interfering  E.g. thread outside of SHMEM program  Non-atomic SHMEM operation  Why this restriction?  Implementation in hardware 19
  • 20. OpenSHMEM ATOMIC OPERATIONS  Atomic Swap ◦ Unconditional ◦ Conditional  Arithmetic ◦ Increment ◦ Fetch-and-increment & fetch-and-add value ◦ Return previous value at target on PE  Locks ◦ Symmetric variables ◦ Acquired and released to define mutual-exclusion execution regions ◦ Initialize lock to 0. After that managed by above API ◦ Can be used for updating distributed data structures 20
  • 21. OpenSHMEM LOOKING FOR OVERLAPS  How to identify overlap opportunities ◦ Put is not an indivisible operation  Send local, reuse local, on-wire, stored  Can do useful work on other data in between ◦ General principle:  Identify independent tasks/data  Initiate action as early as possible  Put/barrier/collective  Interpose independent work  Synchronize as late as possible  Divide application into distinct communication and computation phases to minimize synchronization points  Use of point-to-point synchronization as opposed to collective synchronization 21
  • 22. Benchmarks and Tools  OpenSHMEM NPB: the NAS Parallel benchmarks for use the OpenSHMEM library.  The OpenSHMEM benchmarks’ performance evaluated with respect to their MPI versions on three distinctly different platforms with different OpenSHMEM/SHMEM library implementations  On mature library implementations with hardware support for RMA (SGI), the benchmarks using the OpenSHMEM library have comparable or better performance than MPI.  The OpenSHMEM Analyzer (OSA) developed in collaboration with ORNL allows for better error checking during compile time. 22
  • 23. Performance of Implementing BT and SP benchmarks 23
  • 24. COMPARING MPI 2 AND OpenSHMEM  MPI Window semantics ◦ All process which intend to use the window must participate in window creation ◦ Many or All the local allocations/objects should be coalesced within a single window creation.  SHMEM semantics ◦ All global and static data are by default accessible to all process. ◦ Local allocations/objects can be made shmem accessible using shmalloc instead of malloc 24
  • 25. COMPARING MPI 2 AND OpenSHMEM  MPI_Win_fence ◦ Fence is a collective call. ◦ Need 2 fence calls, one to separate and another one to complete. ◦ So it mostly functions like barrier  shmem_fence ◦ shmem fence is just meant for ordering of puts. ◦ It does not separate the processes neither does it mean completion ◦ Ensures there are no pending puts to be delivered to the same target before the next put 25
  • 26. COMPARING MPI 2 AND OpenSHMEM  Point to point synchronization. ◦ Sender does Start and waits for Post from receiver ◦ The receiver does Post and waits for the data. ◦ The sender Puts the data and signals completion to receiver  Point to point synchronization. ◦ The receiver can directly wait for the data using shmem_wait on a event flag. ◦ The sender puts the data and sets the event flag to signal the receiver. ◦ Both post and complete are implicit inside the wait and put operation. 26
  • 27. COMPARING MPI 2 AND OpenSHMEM  No mutual exclusion  Lock is not real lock, but begin RMA  Unlock means end RMA  Only the source calls lock  Enforces mutual exclusion  The PE which acquires lock doe put  The waiting PE gets the lock on first come first serve basis 27
  • 28. OpenShmem in use  Uses GASNET for any devices (also mobile)  It has been tested with standard libraries on controlled situations  No test results available for GRID  It is comparable with CUDA, MPI and OPENMP 28
  • 29. GASNet  GASNet is a language-independent, low-level networking layer that provides network- independent, high-performance communication primitives tailored for implementing parallel global address space SPMD languages and libraries such as UPC, Co-Array Fortran, SHMEM, Cray Chapel, and Titanium.  The interface is primarily intended as a compilation target and for use by runtime library writers (as opposed to end users), and the primary goals are high performance, interface portability, and expressiveness. GASNet stands for "Global-Address Space Networking". The design of GASNet is partitioned into two layers to maximize porting ease without sacrificing performance: ◦ lower level GASNet core API - the design is based heavily on Active Messages, and is implemented directly on top of each individual network architecture. ◦ The upper level GASNet extended API, which provides high-level operations such as remote memory access and various collective operations.  Operating systems: Linux, FreeBSD, NetBSD, Tru64, AIX, IRIX, HPUX, Solaris, MSWindows-Cygwin, MacOSX, Unicos, SuperUX, Catamount, BLRTS, MTX  Architectures: x86, Itanium, Opteron, Athlon, Alpha, PowerPC, MIPS, PA-RISC, SPARC, Cray T3E, Cray X-1, Cray XT, Cray XE, Cray XK, Cray XC30, SX-6, IBM BlueGene/L, IBM BlueGene/P, IBM BlueGene/Q, IBM Power 775, SiCortex, PlayStation3  Compilers: GCC, Portland Group C, Pathscale C, Intel C, SunPro C, Compaq C, HP C, MIPSPro C, IBM VisualAge C, Cray C, NEC C, MTA C, LLVM Clang, Open64 System diagram showing GASNet layers 29
  • 30. hardware offload  host-side network adapter that offloads most of the networking “work” ◦ server’s main CPU(s) don’t have to ◦ OS bypass techniques  you can have dedicated (read: very fast/optimized) hardware do the heavy lifting ◦ rest of the server’s resources are free  it’s not just processor cycles that are saved ◦ Caches — both instruction and data — are likely not to be thrashed ◦ Interrupts may be fired less frequently ◦ There may be (slightly) less data transferred across internal buses 30
  • 31. Refrences  Using OpenCL: Programming Massively Parallel Computers, edited by Janusz Kowalik, Tadeusz Puźniakowski  OpenSHMEM,Application Programming Interface,Version 1.0 FINAL  OpenSHMEM TUTORIAL, Presenters: Tony Curtis and Swaroop Pophale  OpenSHMEM Performance and Potential: A NPB Experimental Study, Swaroop Pophale,  OPENSHMEM BOF SC2011 TCC-303 ,T. Curtis, S. Poole 31

Hinweis der Redaktion

  1. OpenMP is global shared space (PVM & MPI Are SPMD and partioned space)
  2. ARMI an adaptive, platform independent communication library, Infini BAnd
  3. Data transfer in OpenSHMEM is possible through several one-sided put (for write) and get (for read) operations, as well as various collective routines such as broadcasts and reductions. Query routines are available to gather information about the execution. OpenSHMEM also provides synchronization routines to coordinate data transfers and other operations.
  4. Differences between barrier and fence is that the barrier synchronizes every work item in the group and fence Is only on a specified memory operation
  5. The BT benchmark has I/O-intensive subtypes[4] Block Tridiagonal Scalar Pentadiagonal Solve a synthetic system of nonlinear PDEs using three different algorithms involving block tridiagonal, scalar pentadiagonal and symmetric successive over-relaxation (SSOR) solver kernels, respectively
  6. هم آوایی
  7. Network offload is typically most beneficial with long messages — sending a short message in its entirety can frequently cost exactly the same as starting a communication action and then polling it for completion later.