SlideShare ist ein Scribd-Unternehmen logo
1 von 108
Downloaden Sie, um offline zu lesen
Open MPI State of the Union X
Community Meeting SC‘16
George Bosilca
Yossi Itigin Nathan Hjelm
Perry SchmidtJeff Squyres
Ralph Castain
Open MPI State of the Union X
Community Meeting SC‘16
10 years of SC Open MPI BOFs!
Public service announcement
www.mcs.anl.gov/eurompi2017/
• CFP online at web site
• Location: Argonne National Laboratory
(Chicago, Illinois, USA)
• Full paper submission deadline: 1st May 2016
Github / Community Update
Github: we used to have 2 repos
ompi ompi-release
Github: we used to have 2 repos
ompi ompi-release
Github: now we have just one repo
ompi
Contribution policy
• For 10+ years, we
have required a
signed contribution
agreement for
“sizable” code
contributions
The   Open   MPI   Project 
Software   Grant   and   Corporate   Contributor   License   Agreement   (“Agreement”) 
http://www.open­mpi.org/community/contribute/ 
(v1.6) 
 
 
Thank   you   for   your   interest   in   The   Open   MPI   Project   (the   “Project”).   In   order   to   clarify 
the   intellectual   property   license   granted   with   Contributions   from   any   person   or   entity,   the 
Open   MPI   Project   copyright   holders   (the   “Copyright   Holders”)   must   have   a   Contributor 
License   Agreement   (CLA)   on   file   that   has   been   signed   by   each   Contributor,   indicating 
agreement   to   the   license   terms   below.   This   license   is   for   your   protection   as   a   Contributor 
as   well   as   the   protection   of   the   Copyright   Holders   and   their   users;   it   does   not   change   your 
rights   to   use   your   own   Contributions   for   any   other   purpose. 
 
This   version   of   the   Agreement   allows   an   entity   (the   "Corporation")   to   submit 
Contributions   to   the   Copyright   Holders,   to   authorize   Contributions   submitted   by   its 
designated   employees   to   the   Copyright   Holders,   and   to   grant   copyright   and   patent 
licenses   thereto. 
 
If   you   have   not   already   done   so,   please   complete   and   send   a   signed   Agreement   to 
ompi­contributors@lists.open­mpi.org .      Please   read   this   document   carefully   before 
signing   and   keep   a   copy   for   your   records. 
 
 
Corporation   name: ________________________________________________ 
 
Corporation   address: ________________________________________________ 
 
________________________________________________ 
 
________________________________________________ 
 
Point   of   Contact: ________________________________________________ 
 
E­Mail: ________________________________________________ 
 
Telephone: _____________________ Fax:   ____________________ 
 
 
You   accept   and   agree   to   the   following   terms   and   conditions   for   Your   present   and   future 
Contributions   submitted   to   the   Copyright   Holders.   In   return,   the   Copyright   Holders   shall 
not   use   Your   Contributions   in   a   way   that   is   contrary   to   the   public   benefit.   Except   for   the 
license   granted   herein   to   the   Copyright   Holders   and   recipients   of   software   distributed   by 
the   Copyright   Holders,   You   reserve   all   right,   title,   and   interest   in   and   to   Your 
Contributions. 
Contribution policy
• This is no longer
necessary
The   Open   MPI   Project 
Software   Grant   and   Corporate   Contributor   License   Agreement   (“Agreement”) 
http://www.open­mpi.org/community/contribute/ 
(v1.6) 
 
 
Thank   you   for   your   interest   in   The   Open   MPI   Project   (the   “Project”).   In   order   to   clarify 
the   intellectual   property   license   granted   with   Contributions   from   any   person   or   entity,   the 
Open   MPI   Project   copyright   holders   (the   “Copyright   Holders”)   must   have   a   Contributor 
License   Agreement   (CLA)   on   file   that   has   been   signed   by   each   Contributor,   indicating 
agreement   to   the   license   terms   below.   This   license   is   for   your   protection   as   a   Contributor 
as   well   as   the   protection   of   the   Copyright   Holders   and   their   users;   it   does   not   change   your 
rights   to   use   your   own   Contributions   for   any   other   purpose. 
 
This   version   of   the   Agreement   allows   an   entity   (the   "Corporation")   to   submit 
Contributions   to   the   Copyright   Holders,   to   authorize   Contributions   submitted   by   its 
designated   employees   to   the   Copyright   Holders,   and   to   grant   copyright   and   patent 
licenses   thereto. 
 
If   you   have   not   already   done   so,   please   complete   and   send   a   signed   Agreement   to 
ompi­contributors@lists.open­mpi.org .      Please   read   this   document   carefully   before 
signing   and   keep   a   copy   for   your   records. 
 
 
Corporation   name: ________________________________________________ 
 
Corporation   address: ________________________________________________ 
 
________________________________________________ 
 
________________________________________________ 
 
Point   of   Contact: ________________________________________________ 
 
E­Mail: ________________________________________________ 
 
Telephone: _____________________ Fax:   ____________________ 
 
 
You   accept   and   agree   to   the   following   terms   and   conditions   for   Your   present   and   future 
Contributions   submitted   to   the   Copyright   Holders.   In   return,   the   Copyright   Holders   shall 
not   use   Your   Contributions   in   a   way   that   is   contrary   to   the   public   benefit.   Except   for   the 
license   granted   herein   to   the   Copyright   Holders   and   recipients   of   software   distributed   by 
the   Copyright   Holders,   You   reserve   all   right,   title,   and   interest   in   and   to   Your 
Contributions. 
Contribution policy
• Instead, we now require a “Signed-off-by”
token in commit messages
• Can automatically be added by
“git commit –s”
Some awesome new feature
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Signed-off-by
• Intent: make it easier for individuals and
organizations to contribute to Open MPI
• “Signed-off-by” means agreement to the
Open MPI Contributor’s Declaration
§ See the full definition here
§ This is common in many open source projects
Contributor’s Declaration
Open MPI versioning
Open MPI versioning
• Open MPI will (continue to) use a “A.B.C”
version number triple
• Each number now has a specific meaning:
This number changes when backwards
compatibility breaks
This number changes when new features are
added
This number changes for all other releases
A
B
C 15
Definition
• Open MPI vY is backwards compatible with
Open MPI vX (where Y>X) if:
§ Users can compile a correct MPI / OSHMEM
program with vX
§ Run it with the same CLI options and MCA
parameters using vX or vY
§ The job executes correctly
16
What does that encompass?
• “Backwards compatibility” covers several
areas:
§ Binary compatibility, specifically the MPI /
OSHMEM API ABI
§ MPI / OSHMEM run time system
§ mpirun / oshrun CLI options
§ MCA parameter names / values / meanings
17
What does that not encompass?
• Open MPI only supports running exactly
the same version of the runtime and MPI /
OSHMEM libraries in a single job
§ If you mix-n-match vX and vY in a single job…
18
ERROR
Current version series
• v1.10.x series
§ Older, stable, rapidly hitting end of life
• v2.0.x series
§ Current stable series
• v2.x series
§ Upcoming series
v1.10.x Roadmap
v1.10.x release manager
• Ralph Castain, Intel
v1.10 series
• Soon to be end of life
• One more release expected: v1.10.5
§ Bug fixes only – no new features
§ Do not have a specific timeframe
If you are still running v1.10.x,
please start migrating to v2.0.x
v2.0.x and v2.x
v2 Roadmap
v2.x release managers
• Howard Pritchard, Los Alamos National Lab
• Jeff Squyres, Cisco Systems, Inc.
v2 versions
Master development
v2.0.0
v2.x branch
v2.0.x branch
Bug fixes only
v2.0.1 v2.0.2
Expected
Dec 2016
v2.1.0
Expected
Q1 2017
v2.1.x will get its own branch
(bug fixes only)
v2.x
v2.0.0
v2.0.x
v2.0.1 v2.0.2
v2.1.0
v2.1.x
v2.1.1
v3.x will definitely (eventually) happen
v2.x
v2.0.0
v2.0.x
v2.0.1 v2.0.2
v2.1.0
v2.1.x
v2.1.1
v3.x
Is it worthwhile to make an
intermediate v2.2.x?
v2.x
v2.0.0
v2.0.x
v2.0.1 v2.0.2
v2.1.0
v2.1.x
v2.1.1
v2.2.x
v3.x
v2.2.x: pros and cons
PRO
• The v2.x branch is a good
stable starting point
• Easy to maintain
backwards compatibility
with all v2.x series
CON
• Difficulty in porting new
features from master
branch
§ Due to drift from master
• Consumes developer
resources and pushes
back 3.x release
§ And therefore some
”bigger” features
This is an open question in the developer community
Should we do a v2.2.x series?
Please let us know your opinion!
www.open-mpi.org/sc16/
Some lesser-known Open MPI
features you may not be aware of
Random sample of
v2.x features / work
Singularity
• Containers are of growing interest
§ Packaged applications
§ Portability
• Some issues remain
§ Cross-container boundary interactions for MPI
wireup, release manager interactions
§ Properly handling “containers” as “apps”
Leads: Ralph Castain (Intel), Greg Kurtzer (LBNL)
Open MPI Singularity Support
• PMIx support
§ Cross-version compatibility
§ Standardized protocol across environments
• Auto-detection of containers
§ Identify that app is a Singularity container
§ Do the Right Things to optimize behavior
• Auto-packaging of Open MPI apps
§ Singularity detects Open MPI app and
automatically includes all required libs
ORTE Distributed Virtual Machine
• Original goal
§ Circumvent Cray’s single job per node limit
§ Enable new programming model
• RADICAL-Pilot
§ Decompose large parallel problem in Bag of
MPI Tasks
§ Decoupled, can be executed in parallel
§ Faster convergence to solution
Leads: Mark Santcroos (Rutgers), Ralph Castain (Intel)
Role of DVM
• Launch/wireup time dominates execution
§ DVM instantiated once
§ Tasks highly asynchronous
§ Run tasks in parallel, share cpu cycles
• CLI interface
• Python language bindings through CFFI
• Result
§ Improved concurrency (~16k concurrent tasks)
§ Improved throughput (100 tasks/s)
ORTE-DVM + RADICAL-Pilot
#tasks = 3 * #nodes, 64s tasks
(SSH)
Future Work
• Bulk interface to orte_submit()
• OFI-based (libfabric) inter-ORTE daemon
communication
• Optimize ORTE communication topology
• Topology aware task placement
Open MPI I/O (”OMPIO”)
• Highly modular architecture for parallel I/O
§ Separate implementation than ROMIO
• Default parallel I/O library in Open MPI
§ For all file systems starting from the v2.0.-
release with the exception of Lustre
Lead: Edgar Gabriel (U. Houston)
OMPIO key features
• Tightly integrated with the Open MPI
architecture
§ Frameworks/modules, derived datatype
handling, progress engine, etc.
• Support for multiple collective I/O
algorithms
• Automatic adjustments of # of aggregators
• Multiple mechanisms available for shared
file pointer operations
OMPIO ongoing work
• Enhance support for the Lustre file system
in the v2.1.x release series
• Support for hybrid / multi-threaded
applications
• Enhance support for GPFS
§ Collaboration with HLRS
• Enhance out-of-the-box performance of
OMPIO
OMPIO
AWS scale testing
• EC2 donation for Open MPI and PMIx
scalability
§ Access to larger resources than individual
organizations have
• Both science and engineering
§ Data-driven analysis
§ Used for regression testing
• Early days: no results to share yet
Leads: Jeff Squyres (Cisco), Peter Gottesman, Brian Barrett (AWS)
Exciting new capabilities in
Open MPI
George Bosilca
Exascale Computing Project
and Open MPI
• DOE program to help develop software
stack to enable application development
for a wide range of exascale class systems
• Open MPI for Exascale (OMPI-X) was one
of 35 proposals selected for funding:
§ Joint proposal involving ORNL,
LANL, SNL, UTK, and LLNL
§ 3 year time frame
Exascale Computing Project
and Open MPI Goals
• Work with MPI Forum on extending the MPI standard to
better support exascale applications
§ Endpoints, Finepoints, Sessions, Resilience
• Improved interoperability of OMPI with other
programming models (MPI+X)
§ Process placement, thread marshaling, resource sharing
• Scalability and Performance
§ Memory footprint, Collective communications, Message
Matching, PMIx
• Resilience/fault tolerance improvements
§ Integration with C/R libraries (FTI, SCR), in-place restart, ULFM
• Enhancing MPI Tools interface (network performance
counters, better RMA support, etc.)
MPI_THREAD_MULTIPLE
• The transition to full threading support has been mostly
completed
§ We also worked on the performance (fine grain locks)
§ Supports all threading models from a single library
• All atomic accesses are protected
§ Allow asynchronous progress
• Complete redesign the request completion
§ Everything goes through requests (pt2pt, collectives, RMA*, I/O, *)
§ Threads are not competing for resources, instead they
collaboratively progress
• A thread will wait until all expected requests have been completed or an error
has been raised
• Less synchronizations, less overhead (better latency and injection rate).
MPI_THREAD_MULTIPLE
• Messages per second
injection rate (bigger is
better)
• Each process bound to
a NUMA node
§ 4 threads per process
§ Distributed environment
TCP (ipoib)
MPI_THREAD_MULTIPLE
• Messages per second
injection rate (bigger is
better)
• Each process bound to
a NUMA node
§ 4 threads per process
§ Distributed environment
TCP (ipoib)
§ Vader (shared memory)
• All BTLs show
similar results
Asynchronous progress
• The BTLs can either have their own progress thread (such as TCP
and usnic) or take advantage of the async progress provided by
OPAL
§ Adaptive Frequency: varies with the expected load (posted or ongoing
data movements)
SMB TCP(ipoib)
Impact on applications
• MADNESS - Multiresolution Adaptive Numerical Environment for
Scientific Simulation
Threading
Name Owner Status Thread
BTL
OpenIB Chelsio maintenance Done
Portals4 SNL maintenance Not done
Scif LANL maintenance Not done
self UTK active Done
sm UTK active Done
smcuda NVIDIA/UTK active Done
tcp UTK active Done
uGNI LANL active Done
usnic CISCO active Done
vader LANL active Done
PML
Yalla Mellanox active In progress
UCX Mellanox/UTK active Done
MTL
MXM Mellanox active In progress
OFI Intel active In progress
Portals4 SNL active In progress
PSM Intel active In progress
PSM2 Intel active In progress
Network
support
It’s complicated!
(only listing those
that are in release
branches)
Collective Communications
• Architecture aware: Reshape tuned to account for process
placement and node architecture
• Classical 2 levels decision (inter and intra-node) composition of
collective algorithms
§ Pipeline possible but no other algorithmic composition possible
ResultResult
Bcast 32 procs, 4x8 cores (OpenIB)
Collective Communications
Result
§ Async Collectives
starts as soon as
the process learns
that a collective is
progressing on a
communicator
(somewhat similar
to unexpected
collectives)
§ The algorithm
automatically
adapts to network
conditions
§ Resistant to system
noise
• Dataflow Collectives: Different algorithms compose
naturally (using a dynamic granularity for the pipelining
fragments)
Ibcast 16 nodes
overlap
Collective Communications
• Dataflow Collectives: Different algorithms compose
naturally (using a dynamic granularity for the pipelining
fragments)
Result	(Slowdown	of	broadcast	with
noise	inject	on	one	node)§ The algorithm
automatically adapts
to network conditions
§ Resistant to system
noise
CUDA Support
• Multi-level coordination
protocol based on the location
of the source and destination
memory
§ Support for GPUDirect
• Delocalize part of the datatype
engine into the GPU
§ Driven by the CPU
§ Provide a different datatype
representation (avoid
branching in the code)
• Deeply integrated support for
OpenIB and shared memory
§ BTL independent support
available for everything else
V T
Ivy Bridge E5-2690 v2 @ 3.00GHz, 2 sockets 10-core, 4 K40/node
MVAPICH 2.2-GDR
200 600 1000 1500 2000 2500 3000 3500 4000
Matrix Size (N)
0
5
10
15
20
Time(ms)
Pingpong Benchmark with Matrix in SM (intra-GPU)
T-1GPU
T-1GPU-MVAPICH
V-1GPU
V-1GPU-MVAPICH
Pingpong Single node (intra-GPU)
CUDA Support
200 600 1000 1500 2000 2500 3000 3500 4000
Matrix Size (N)
0
5
10
15
20
25
30
35
40
45
Time(ms)
Pingpong Benchmark with Matrix in SM (inter-GPU)
T-2GPU
T-2GPU-MVAPICH
V-2GPU
V-2GPU-MVAPICH
V T
Ivy Bridge E5-2690 v2 @ 3.00GHz, 2 sockets 10-core, 4 K40/node
MVAPICH 2.2-GDR
Pingpong Single node (inter-GPU)
• Multi-level coordination
protocol based on the location
of the source and destination
memory
§ Support for GPUDirect
• Delocalize part of the datatype
engine into the GPU
§ Driven by the CPU
§ Provide a different datatype
representation (avoid
branching in the code)
• Deeply integrated support for
OpenIB and shared memory
§ BTL independent support
available for everything else
CUDA Support
200 600 1000 1500 2000 2500 3000 3500 4000
Matrix Size (N)
0
10
20
30
40
50
60
Time(ms)
Pingpong Benchmark with Matrix in IB
T-GPU
T-GPU-MVAPICH
V-GPU
V-GPU-MVAPICH
Pingpong Distributed over IB (intra-GPU)
V T
Ivy Bridge E5-2690 v2 @ 3.00GHz, 2 sockets 10-core, 4 K40/node
MVAPICH 2.2-GDR
• Multi-level coordination
protocol based on the location
of the source and destination
memory
§ Support for GPUDirect
• Delocalize part of the datatype
engine into the GPU
§ Driven by the CPU
§ Provide a different datatype
representation (avoid
branching in the code)
• Deeply integrated support for
OpenIB and shared memory
§ BTL independent support
available for everything else
CUDA Support
• Architecture-aware collective support
§ Dataflow algorithm
§ Node-level algorithm take advantage of the bidirectional
bandwidth of the PCI
Ivy Bridge E5-2690 v2 @ 3.00GHz, 2 sockets 10-core, 4 K40/node
MVAPICH 2.2-GDR
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34
Latency(s)
Message Size (MB)
Bcast on 6 node (4GPUs/node)
MVAPICH
OMPI
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34
Latency(s)
Message Size (MB)
Reduce on 6 node (4GPUs/node)
MVAPICH
OMPI
User Level Failure Mitigation
§ Open MPI implementation
updated in-sync with Open MPI
2.x
§ Scalable fault tolerant algorithms
demonstrated in practice for
revoke, agreement, and failure
detection (SC’14, EuroMPI’15,
SC’15, SC’16)
Introduction Early Returning Agreement Performance Evaluation
ERA performance depending on the tree topolo
���
���
���
���
����
����
����
����
�� �� �� �� �� ��
��
����������
��������������������������
��������������������
������������������
�����������������
��������������������������
����������������������
Practical Scalable Consensus
Fault Tolerant Agreement
costs approximately 2x
Allreduce
Bandwidth(Gbit/s)
Message Size (Bytes)
Shared Memory Ping-pong Performance
Open MPI
FT Open MPI
0
1
2
3
4
5
6
7
8
9
10
11
12
1 4 16 64 256 1K 4K 16K 64K 256K 1M
Latency(us)
0.5
1
1.5
2
2.5
3
1 4 16 64 256 1K
Point to point
performance
unchanged
With FT
enabled
Open MPI and Fujitsu
Fujitsu Limited
Fujitsu MPI
with Open MPI Community
• Fujitsu MPI is based on Open MPI
§ Running on K computer and its commercial/
successor machines for over 5 years.
§ For Post-K, also Open MPI based.
• Collaboration plan with OMPI community
§ ARM support
§ PMIx integration
§ Reduced memory footprint
§ New MPI version support
§ Thread parallelization extension etc.
Fujitsu Contribution Examples
• Thread parallelization in the MPI library
• Statistical information for application tuning
Speed-up Ratio of MPI_Pack
2 16 256 4Ki 64Ki 1Mi
Number of Blocks
2
16
256
4Ki
64Ki
1Mi
Byte Size of a Block
1
2
3
4
5
6
7
8
0
1
2
3
4
5
6
7
Speed-up Ratio of MPI_PACK Compared to Single Thread
See https://github.com/open-mpi/ompi/wiki/Meeting-2016-02-attachments/Fujitsu-OMPI-Dev-Meeting-2016-Feb.pdf
RMA Update
Nathan Hjelm
Los Alamos National Laboratory
v2.x osc/pt2pt
• Fully supports MPI-3.1 RMA
• Full support for MPI datatypes
• Emulates one-sided operation using point-to-point
components (PML) for communication
• Improved lock-all scaling
• Improved support for MPI_THREAD_MULTIPLE
• Caveat: asynchronous progress support lacking
• Targets must enter MPI to progress any one-sided
data movement!
• Doesn’t really support passive-target
v2.x osc/rdma
• Fully supports MPI-3.1 RMA
• Full support for MPI datatypes
• Fully supports passive target RMA operations
• Uses network RMA and atomic operation
support through Byte Transport Layer (BTL)
• Supports Infiniband, Infinipath/Omnipath**, and
Cray Gemini/Aries
• Additional networks can be supported
• Requirements: put, get, fetch-and-add, and
compare-and-swap
v2.x osc/rdma
• Improved support for
MPI_THREAD_MULTIPLE
• Improved memory scaling
• Support for hardware atomics
• Supports MPI_Fetch_and_add and
MPI_Compare_and_swap
• Supports 32 and 64 bit integer and floating point
values
• Accelerated MPI_Ops varies by hardware
• Set osc_rdma_acc_single_intrinsic MCA variable
to true to enable
v2.x RMA Performance
• IMB Truly Passive Put on Cray XC-40
0
0.2
0.4
0.6
0.8
1
20 25 210 215 220
Overlap%
Message size (Bytes)
IMB Truly passive put Overlap
OSC/pt2pt
OSC/rdma
20
40
60
80
100
v2.x RMA Performance
• Contended MPI_Fetch_and_op performance
0.1
1
10
100
1000
10000
100000
10 100 1000
Latency(µs)
MPI Ranks
MPI_Fetch_and_op Latency XC40 w/ Contention
Open MPI osc/pt2pt
Open MPI osc/rdma - LOCK
Open MPI osc/rdma - AMO
Open MPI osc/rdma - CAS
MPI processes
v2.x RMA Performance
• osc_put_latency with MPI_Win_flush on XC-40
1
2
4
8
16
32
64
128
256
512
20 25 210 215 220
Latency(µs)
Message Size (Bytes)
OSU Put Latency XC40
Open MPI osc/pt2pt - Flush
Open MPI osc/rdma - Flush
IBM Spectrum MPI
Perry Schmidt
IBM Spectrum MPI
• IBM Spectrum MPI is a pre-built, pre-packaged version of
Open MPI plus IBM value add components.
• Supports both PowerPC and x86
• Includes many of the popular Open MPI components
selectable at runtime
§ E.g. MXM, usNIC, Omnipath
• Spectrum MPI 10.1.0.2 to release in December of 2016.
§ First release (Spectrum MPI 10.1) was July of 2016.
• Part of IBM Spectrum HPC Software Stack
Evaluation Downloads for Spectrum MPI:
https://www-01.ibm.com/marketing/iwm/iwm/web/reg/pick.do?source=swerpsysz-lsf-
3&S_PKG=mpi101&S_TACT=000002PW&lang=en_US
IBM Value Add Components
• PAMI PML and OSC
§ Improved point-to-point and one-sided performance for Mellanox
Infiniand
§ Includes support for Nvidia GPU buffers
§ Additional performance optimizations
• IBM Collective Library
§ Significant performance improvements for blocking and non-
blocking collectives over OMPI collectives.
§ Dynamic collective algorithm selection.
• GPU support
§ CUDA-Aware support for Nvidia GPU cards.
§ Adding GPU RDMA Direct / Async in future release.
IBM Testing and Support
• Extensive level of testing for IBM releases
§ Standard Open MPI release testing…
§ …Plus Platform MPI test suites
§ …Plus HPC stack integration testing
• IBM Customer Support
§ For customers running a licensed copy of IBM Spectrum MPI
§ IBM will work with customers and partners to resolve issues in
non IBM-owned components
Community support
• Activity participating with Open MPI Community
• MTT and Jenkins testing on IBM PowerPC servers
• New features that will be contributed back
§ Improved LSF support
§ -aff: easy to use affinity controls
§ -prot: protocol connectivity report
§ -entry: dynamic layering of multiple PMPI libraries
§ …and more…
• PMIx improvements (go to their BOF…)
§ Focus on CORAL-sized clusters
• Bug fixes, bug fixes, bug fixes…
Mellanox Community Efforts
Yossi Itigin
The Power of Community
Compels Us
• Engaged in multiple open source efforts enabling
exascale MPI and PGAS applications
§ UCX
• Open source
• Strong vendor ecosystem
• Near bare metal performance across a rage of fabrics
• InfiniBand, uGNI, RoCE, shared memory
§ PMIx (PMI eXascale)
• Open source
• Exascale job launch
• Supported by SLURM, LSF, PBS
Exascale Enabled
Out-of-the-Box
• UCX
§ UCX PML starting from v1.10 (MPI)
§ UCX SPML starting from v1.10 (OSHMEM)
§ Support for advanced PMIx features
• Direct modex
• Non-blocking fence
• Eliminate the barrier in initialization
• OSHMEM
§ Open MPI v2.1.0 is (will be) OSHMEM v1.3 compliant!
OMPI-UCX Performance Data
• Mellanox system
§ Switch: SX6518
§ InfiniBand ConnectX-4
§ CPU: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
§ Red Hat Enterprise Linux Server release 7.2
§ Open MPI/SHMEM v2.0.2a1
§ UCX version:
# UCT version=1.0.2129
# configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --
prefix=/hpc/local/benchmarks/hpcx_install_Wednesday/hpcx-icc-redhat7.2/ucx --with-
knem=/hpc/local/benchmarks/hpcx_install_Wednesday/hpcx-icc-redhat7.2/knem
§ Benchmark: OSU v5.1
MPI UCX PML Point-to-Point
Latency
MPI UCX PML Point-to-Point
Bandwidth
MPI UCX PML Point-to-Point
Message Rate
OSHMEM + UCX SPML
Atomic Rates
OSU	OpenSHMEM Atomic	Operation	Rate	
ConnectX-4	HCA	
Millions	of	atomic	
operations	per	second
Operation RC Transport	 DC Transport	
shmem_int_fadd 14.03 15.6
shmem_int_finc 23.03 22.55
shmem_int_add 87.73 115.75
shmem_int_inc 81.13 122.92
shmem_int_cswap 23.14 22.74
shmem_int_swap 23.17 22.26
shmem_longlong_fadd 23.24 22.87
shmem_longlong_finc 23.15 22.83
shmem_longlong_add 80.08 91.22
shmem_longlong_inc 76.13 95.61
shmem_longlong_cswap	 15.18 22.7
shmem_longlong_swap 22.79 22.84
OSHMEM + UCX SPML One Sided
Message Rate (osu_oshm_put_mr)
Mapping, Ranking, Binding:
Oh My!
Ralph H Castain
Intel, Inc.
Why bind?
• In Open MPI early days
§ No default binding
§ Simple bind-to-core, bind-to-socket options
• Motivators
§ Competitor “out-of-the-box” comparisons
• Bound by default
§ Researchers
• Fine-grained positioning, binding
§ More complex chips
Terminology
• Slots
§ How many processes allowed on a node
• Oversubscribe: #processes > #slots
§ Nothing to do with hardware!
• System admins often configure a linkage
• Processing element (“PE”)
§ Smallest atomistic processor
§ Frequently core or HT (tile?)
§ Overload: more than one process bound to PE
Three Phases
• Mapping
§ Assign each process to a location
§ Determines which processes run on what nodes
§ Detects oversubscription
• Ranking
§ Assigns MPI_COMM_WORLD rank to each process
• Binding
§ Binds processes to specific processing elements
§ Detects overload
Examples: notation
One core
One chip package
One hyperthread
Node 2Node 1 Slots = 3/node
Mapping
Slot
Node
NUMA
Socket
Cache (L1,2,3)
Core
HWThread
Oversubscribe
Nooversubscribe
Span
PE
#procs = 5
Mapping
Node 2Node 1 Slots = 3/node
Slot
Node
NUMA
Socket
Cache (L1,2,3)
Core
HWThread
Oversubscribe
Nooversubscribe
Span
PE
ABC DE
#procs = 5
Mapping
Node 2Node 1 Slots = 3/node
Slot
Node
NUMA
Socket
Cache (L1,2,3)
Core
HWThread
Oversubscribe
Nooversubscribe
Span
PE
ACE BD
#procs = 5
Mapping
Node 2Node 1 Slots = 3/node
Slot
Node
NUMA
Socket
Cache (L1,2,3)
Core
HWThread
Oversubscribe
Nooversubscribe
Span
PE
#procs = 5
AC B D E
Mapping
Node 2Node 1 Slots = 3/node
Slot
Node
NUMA
Socket
Cache (L1,2,3)
Core
HWThread
Oversubscribe
Nooversubscribe
Span
PE
#procs = 5
AE B C D
Ranking
Node 2Node 1 Slots = 3/node
Slot
Node
NUMA
Socket
Cache (L1,2,3)
Core
HWThread
Span
Fill
#procs = 5
AE B C D
0,1 2 3 4
Map-by socket:span
Default
Ranking
Node 2Node 1 Slots = 3/node
Slot
Node
NUMA
Socket
Cache (L1,2,3)
Core
HWThread
Span
Fill
AE B C D
0,2 4 1 3
#procs = 5
Map-by socket:span
Ranking
Node 2Node 1 Slots = 3/node
Slot
Node
NUMA
Socket
Cache (L1,2,3)
Core
HWThread
Span
Fill
AE B C D
0,2 1 3 4
#procs = 5
Map-by socket:span
Ranking
Node 2Node 1 Slots = 3/node
Slot
Node
NUMA
Socket
Cache (L1,2,3)
Core
HWThread
Span
Fill
AE B C D
0,4 1 2 3
#procs = 5
Map-by socket:span
Binding
Node 2Node 1 Slots = 3/node
NUMA
Socket
Cache (L1,2,3)
Core
HWThread
AE B C D
0,4 1 2 3
#procs = 5
Map-by socket:span
Rank-by socket:span
0 4 1 2 3
Binding
Node 2Node 1 Slots = 3/node
NUMA
Socket
Cache (L1,2,3)
Core
HWThread
AE B C D
0,4 1 2 3
#procs = 5
Map-by socket:span
Rank-by socket:span
0 4 1 2 3
Defaults
Node 2Node 1 Slots = 3/node
0 1
#procs <= 2
Node 2Node 1 Slots = 3/node
0,1,2
#procs > 2
Map-by core Rank-by slot Bind-to core
Map-by slot Rank-by slot Bind-to NUMA
3,4
Mapping: PE Option
Node 2Node 1 Slots = 3/node
Slot
Node
NUMA
Socket
Cache (L1,2,3)
Core
HWThread
Oversubscribe
Nooversubscribe
Span
PE=2
#procs = 3
A
D E
B C
Rank-by slot
=> bind-to core
Mapping: PE Option
Node 2Node 1 Slots = 3/node
Slot
Node
NUMA
Socket
Cache (L1,2,3)
Core
HWThread
Oversubscribe
Nooversubscribe
Span
PE=2
#procs = 3
A
D E
B C
hwthreads-as-cpus
Rank-by slot
=> bind-to hwt
Mapping: PE Option
Node 2Node 1 Slots = 3/node
Slot
Node
NUMA
Socket
Cache (L1,2,3)
Core
HWThread
Oversubscribe
Nooversubscribe
Span
PE=3
#procs = 3
A
D E
B C
hwthreads-as-cpus
Rank-by slot
=> bind-to hwt
Mapping: PE Option
Node 2Node 1 Slots = 3/node
Slot
Node
NUMA
Socket
Cache (L1,2,3)
Core
HWThread
Oversubscribe
Nooversubscribe
Span
PE=3
#procs = 3
A
D E
C
hwthreads-as-cpus
B
Rank-by slot
=> bind-to hwt
Conclusion
• bind-to defaults
§ map-by specified à bind to that level
§ map-by not specified
• np <= 2: bind-to core
• np > 2: bind-to NUMA
§ PE > 1 à bind to PE
• Map, rank, bind
§ Separate dimensions
§ Considerable flexibility
Right combination is highly application specific!
Wrap up
Where do we need help?
• Code
§ Any bug that bothers you
§ Any feature that you can add
• User documentation
• Testing
• Usability
• Release engineering
Come join us!
George Bosilca
Yossi Itigin Nathan Hjelm
Perry SchmidtJeff Squyres
Ralph Castain

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

gRPC: The Story of Microservices at Square
gRPC: The Story of Microservices at SquaregRPC: The Story of Microservices at Square
gRPC: The Story of Microservices at Square
 
Cleaning Up the Dirt of the Nineties - How New Protocols are Modernizing the Web
Cleaning Up the Dirt of the Nineties - How New Protocols are Modernizing the WebCleaning Up the Dirt of the Nineties - How New Protocols are Modernizing the Web
Cleaning Up the Dirt of the Nineties - How New Protocols are Modernizing the Web
 
SFO15-110: Toolchain Collaboration
SFO15-110: Toolchain CollaborationSFO15-110: Toolchain Collaboration
SFO15-110: Toolchain Collaboration
 
Bringing Learnings from Googley Microservices with gRPC - Varun Talwar, Google
Bringing Learnings from Googley Microservices with gRPC - Varun Talwar, GoogleBringing Learnings from Googley Microservices with gRPC - Varun Talwar, Google
Bringing Learnings from Googley Microservices with gRPC - Varun Talwar, Google
 
Generating Unified APIs with Protocol Buffers and gRPC
Generating Unified APIs with Protocol Buffers and gRPCGenerating Unified APIs with Protocol Buffers and gRPC
Generating Unified APIs with Protocol Buffers and gRPC
 
Debugging of (C)Python applications
Debugging of (C)Python applicationsDebugging of (C)Python applications
Debugging of (C)Python applications
 
new ifnet(9) KPI for FreeBSD network stack
new ifnet(9) KPI for FreeBSD network stacknew ifnet(9) KPI for FreeBSD network stack
new ifnet(9) KPI for FreeBSD network stack
 
lwref: insane idea on reference counting
lwref: insane idea on reference countinglwref: insane idea on reference counting
lwref: insane idea on reference counting
 
gRPC: Beyond REST
gRPC: Beyond RESTgRPC: Beyond REST
gRPC: Beyond REST
 
HKG15-301: OVS implemented via ODP & vendor SDKs
HKG15-301: OVS implemented via ODP & vendor SDKsHKG15-301: OVS implemented via ODP & vendor SDKs
HKG15-301: OVS implemented via ODP & vendor SDKs
 
gRPC
gRPCgRPC
gRPC
 
Netty Cookbook - Chapter 2
Netty Cookbook - Chapter 2Netty Cookbook - Chapter 2
Netty Cookbook - Chapter 2
 
New sendfile
New sendfileNew sendfile
New sendfile
 
HKG15-110: ODP Project Update
HKG15-110: ODP Project UpdateHKG15-110: ODP Project Update
HKG15-110: ODP Project Update
 
gRPC and Microservices
gRPC and MicroservicesgRPC and Microservices
gRPC and Microservices
 
SFO15-102:ODP Project Update
SFO15-102:ODP Project UpdateSFO15-102:ODP Project Update
SFO15-102:ODP Project Update
 
Power-up services with gRPC
Power-up services with gRPCPower-up services with gRPC
Power-up services with gRPC
 
OFI Overview 2019 Webinar
OFI Overview 2019 WebinarOFI Overview 2019 Webinar
OFI Overview 2019 Webinar
 
Debugging Python with gdb
Debugging Python with gdbDebugging Python with gdb
Debugging Python with gdb
 
Introduction to gRPC
Introduction to gRPCIntroduction to gRPC
Introduction to gRPC
 

Andere mochten auch

MPI Raspberry pi 3 cluster
MPI Raspberry pi 3 clusterMPI Raspberry pi 3 cluster
MPI Raspberry pi 3 cluster
Arafat Hussain
 
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingEuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
Jonathan Dursi
 

Andere mochten auch (14)

Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen IIPorting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
 
MPI and Distributed Applications
MPI and Distributed ApplicationsMPI and Distributed Applications
MPI and Distributed Applications
 
MPI Raspberry pi 3 cluster
MPI Raspberry pi 3 clusterMPI Raspberry pi 3 cluster
MPI Raspberry pi 3 cluster
 
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingEuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
 
Ejclase mpi
Ejclase mpiEjclase mpi
Ejclase mpi
 
CUDA-Aware MPI
CUDA-Aware MPICUDA-Aware MPI
CUDA-Aware MPI
 
MPI message passing interface
MPI message passing interfaceMPI message passing interface
MPI message passing interface
 
FDA Focus on Design Controls
FDA Focus on Design Controls FDA Focus on Design Controls
FDA Focus on Design Controls
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNL
 
FireWorks overview
FireWorks overviewFireWorks overview
FireWorks overview
 
HPC Best Practices: Application Performance Optimization
HPC Best Practices: Application Performance OptimizationHPC Best Practices: Application Performance Optimization
HPC Best Practices: Application Performance Optimization
 
ISBI MPI Tutorial
ISBI MPI TutorialISBI MPI Tutorial
ISBI MPI Tutorial
 
High Performance Computing - The Future is Here
High Performance Computing - The Future is HereHigh Performance Computing - The Future is Here
High Performance Computing - The Future is Here
 
High Performance Computing using MPI
High Performance Computing using MPIHigh Performance Computing using MPI
High Performance Computing using MPI
 

Ähnlich wie Open MPI State of the Union X SC'16 BOF

Language agnostic technologies introduced in pi web-agent 0.4rc2
Language agnostic technologies  introduced in pi web-agent 0.4rc2Language agnostic technologies  introduced in pi web-agent 0.4rc2
Language agnostic technologies introduced in pi web-agent 0.4rc2
Andreas Galazis
 
"The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming...
"The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming..."The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming...
"The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming...
Edge AI and Vision Alliance
 
Concerto conmoto
Concerto conmotoConcerto conmoto
Concerto conmoto
mskmoorthy
 

Ähnlich wie Open MPI State of the Union X SC'16 BOF (20)

FOSSology & GSOC Journey
FOSSology & GSOC JourneyFOSSology & GSOC Journey
FOSSology & GSOC Journey
 
Php Dependency Management with Composer ZendCon 2016
Php Dependency Management with Composer ZendCon 2016Php Dependency Management with Composer ZendCon 2016
Php Dependency Management with Composer ZendCon 2016
 
BigBlueButton Platform Components
BigBlueButton Platform ComponentsBigBlueButton Platform Components
BigBlueButton Platform Components
 
Language agnostic technologies introduced in pi web-agent 0.4rc2
Language agnostic technologies  introduced in pi web-agent 0.4rc2Language agnostic technologies  introduced in pi web-agent 0.4rc2
Language agnostic technologies introduced in pi web-agent 0.4rc2
 
Openesb past present_future
Openesb past present_futureOpenesb past present_future
Openesb past present_future
 
Openesb past present_future
Openesb past present_futureOpenesb past present_future
Openesb past present_future
 
Openesb past present and future
Openesb past present and futureOpenesb past present and future
Openesb past present and future
 
Mobile, Open Source, and the Drive to the Cloud
Mobile, Open Source, and the Drive to the CloudMobile, Open Source, and the Drive to the Cloud
Mobile, Open Source, and the Drive to the Cloud
 
Mobile, Open Source, & the Drive to the Cloud
Mobile, Open Source, & the Drive to the CloudMobile, Open Source, & the Drive to the Cloud
Mobile, Open Source, & the Drive to the Cloud
 
The new WPE API
The new WPE APIThe new WPE API
The new WPE API
 
"The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming...
"The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming..."The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming...
"The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming...
 
Concerto conmoto
Concerto conmotoConcerto conmoto
Concerto conmoto
 
Evolution ofversioncontrolinopensource
Evolution ofversioncontrolinopensourceEvolution ofversioncontrolinopensource
Evolution ofversioncontrolinopensource
 
How to migrate SourcePro apps from Solaris to Linux
How to migrate SourcePro apps from Solaris to LinuxHow to migrate SourcePro apps from Solaris to Linux
How to migrate SourcePro apps from Solaris to Linux
 
Interop 2018 - Understanding Kubernetes - Brian Gracely
Interop 2018 - Understanding Kubernetes - Brian GracelyInterop 2018 - Understanding Kubernetes - Brian Gracely
Interop 2018 - Understanding Kubernetes - Brian Gracely
 
There is no such thing as “Vanilla Kubernetes”
There is no such thing as “Vanilla Kubernetes”There is no such thing as “Vanilla Kubernetes”
There is no such thing as “Vanilla Kubernetes”
 
OpenShift with Eclipse Tooling - EclipseCon 2012
OpenShift with Eclipse Tooling - EclipseCon 2012OpenShift with Eclipse Tooling - EclipseCon 2012
OpenShift with Eclipse Tooling - EclipseCon 2012
 
Vs code extensions required for blockchain development
Vs code extensions required for blockchain developmentVs code extensions required for blockchain development
Vs code extensions required for blockchain development
 
Evolution of Version Control In Open Source
Evolution of Version Control In Open SourceEvolution of Version Control In Open Source
Evolution of Version Control In Open Source
 
Varnish : Advanced and high-performance HTTP caching
Varnish : Advanced and high-performance HTTP cachingVarnish : Advanced and high-performance HTTP caching
Varnish : Advanced and high-performance HTTP caching
 

Mehr von Jeff Squyres

Cisco EuroMPI'13 vendor session presentation
Cisco EuroMPI'13 vendor session presentationCisco EuroMPI'13 vendor session presentation
Cisco EuroMPI'13 vendor session presentation
Jeff Squyres
 

Mehr von Jeff Squyres (16)

MPI Sessions: a proposal to the MPI Forum
MPI Sessions: a proposal to the MPI ForumMPI Sessions: a proposal to the MPI Forum
MPI Sessions: a proposal to the MPI Forum
 
MPI Fourm SC'15 BOF
MPI Fourm SC'15 BOFMPI Fourm SC'15 BOF
MPI Fourm SC'15 BOF
 
Cisco's journey from Verbs to Libfabric
Cisco's journey from Verbs to LibfabricCisco's journey from Verbs to Libfabric
Cisco's journey from Verbs to Libfabric
 
(Very) Loose proposal to revamp MPI_INIT and MPI_FINALIZE
(Very) Loose proposal to revamp MPI_INIT and MPI_FINALIZE(Very) Loose proposal to revamp MPI_INIT and MPI_FINALIZE
(Very) Loose proposal to revamp MPI_INIT and MPI_FINALIZE
 
2014 01-21-mpi-community-feedback
2014 01-21-mpi-community-feedback2014 01-21-mpi-community-feedback
2014 01-21-mpi-community-feedback
 
Cisco usNIC: how it works, how it is used in Open MPI
Cisco usNIC: how it works, how it is used in Open MPICisco usNIC: how it works, how it is used in Open MPI
Cisco usNIC: how it works, how it is used in Open MPI
 
Cisco EuroMPI'13 vendor session presentation
Cisco EuroMPI'13 vendor session presentationCisco EuroMPI'13 vendor session presentation
Cisco EuroMPI'13 vendor session presentation
 
Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)
Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)
Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)
 
MPI History
MPI HistoryMPI History
MPI History
 
MOSSCon 2013, Cisco Open Source talk
MOSSCon 2013, Cisco Open Source talkMOSSCon 2013, Cisco Open Source talk
MOSSCon 2013, Cisco Open Source talk
 
Ethernet and TCP optimizations
Ethernet and TCP optimizationsEthernet and TCP optimizations
Ethernet and TCP optimizations
 
Friends don't let friends leak MPI_Requests
Friends don't let friends leak MPI_RequestsFriends don't let friends leak MPI_Requests
Friends don't let friends leak MPI_Requests
 
MPI-3 Timer requests proposal
MPI-3 Timer requests proposalMPI-3 Timer requests proposal
MPI-3 Timer requests proposal
 
MPI_Mprobe is good for you
MPI_Mprobe is good for youMPI_Mprobe is good for you
MPI_Mprobe is good for you
 
The Message Passing Interface (MPI) in Layman's Terms
The Message Passing Interface (MPI) in Layman's TermsThe Message Passing Interface (MPI) in Layman's Terms
The Message Passing Interface (MPI) in Layman's Terms
 
What is [Open] MPI?
What is [Open] MPI?What is [Open] MPI?
What is [Open] MPI?
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

Open MPI State of the Union X SC'16 BOF

  • 1. Open MPI State of the Union X Community Meeting SC‘16 George Bosilca Yossi Itigin Nathan Hjelm Perry SchmidtJeff Squyres Ralph Castain
  • 2. Open MPI State of the Union X Community Meeting SC‘16 10 years of SC Open MPI BOFs!
  • 4. www.mcs.anl.gov/eurompi2017/ • CFP online at web site • Location: Argonne National Laboratory (Chicago, Illinois, USA) • Full paper submission deadline: 1st May 2016
  • 6. Github: we used to have 2 repos ompi ompi-release
  • 7. Github: we used to have 2 repos ompi ompi-release
  • 8. Github: now we have just one repo ompi
  • 9. Contribution policy • For 10+ years, we have required a signed contribution agreement for “sizable” code contributions The   Open   MPI   Project  Software   Grant   and   Corporate   Contributor   License   Agreement   (“Agreement”)  http://www.open­mpi.org/community/contribute/  (v1.6)      Thank   you   for   your   interest   in   The   Open   MPI   Project   (the   “Project”).   In   order   to   clarify  the   intellectual   property   license   granted   with   Contributions   from   any   person   or   entity,   the  Open   MPI   Project   copyright   holders   (the   “Copyright   Holders”)   must   have   a   Contributor  License   Agreement   (CLA)   on   file   that   has   been   signed   by   each   Contributor,   indicating  agreement   to   the   license   terms   below.   This   license   is   for   your   protection   as   a   Contributor  as   well   as   the   protection   of   the   Copyright   Holders   and   their   users;   it   does   not   change   your  rights   to   use   your   own   Contributions   for   any   other   purpose.    This   version   of   the   Agreement   allows   an   entity   (the   "Corporation")   to   submit  Contributions   to   the   Copyright   Holders,   to   authorize   Contributions   submitted   by   its  designated   employees   to   the   Copyright   Holders,   and   to   grant   copyright   and   patent  licenses   thereto.    If   you   have   not   already   done   so,   please   complete   and   send   a   signed   Agreement   to  ompi­contributors@lists.open­mpi.org .      Please   read   this   document   carefully   before  signing   and   keep   a   copy   for   your   records.      Corporation   name: ________________________________________________    Corporation   address: ________________________________________________    ________________________________________________    ________________________________________________    Point   of   Contact: ________________________________________________    E­Mail: ________________________________________________    Telephone: _____________________ Fax:   ____________________      You   accept   and   agree   to   the   following   terms   and   conditions   for   Your   present   and   future  Contributions   submitted   to   the   Copyright   Holders.   In   return,   the   Copyright   Holders   shall  not   use   Your   Contributions   in   a   way   that   is   contrary   to   the   public   benefit.   Except   for   the  license   granted   herein   to   the   Copyright   Holders   and   recipients   of   software   distributed   by  the   Copyright   Holders,   You   reserve   all   right,   title,   and   interest   in   and   to   Your  Contributions. 
  • 10. Contribution policy • This is no longer necessary The   Open   MPI   Project  Software   Grant   and   Corporate   Contributor   License   Agreement   (“Agreement”)  http://www.open­mpi.org/community/contribute/  (v1.6)      Thank   you   for   your   interest   in   The   Open   MPI   Project   (the   “Project”).   In   order   to   clarify  the   intellectual   property   license   granted   with   Contributions   from   any   person   or   entity,   the  Open   MPI   Project   copyright   holders   (the   “Copyright   Holders”)   must   have   a   Contributor  License   Agreement   (CLA)   on   file   that   has   been   signed   by   each   Contributor,   indicating  agreement   to   the   license   terms   below.   This   license   is   for   your   protection   as   a   Contributor  as   well   as   the   protection   of   the   Copyright   Holders   and   their   users;   it   does   not   change   your  rights   to   use   your   own   Contributions   for   any   other   purpose.    This   version   of   the   Agreement   allows   an   entity   (the   "Corporation")   to   submit  Contributions   to   the   Copyright   Holders,   to   authorize   Contributions   submitted   by   its  designated   employees   to   the   Copyright   Holders,   and   to   grant   copyright   and   patent  licenses   thereto.    If   you   have   not   already   done   so,   please   complete   and   send   a   signed   Agreement   to  ompi­contributors@lists.open­mpi.org .      Please   read   this   document   carefully   before  signing   and   keep   a   copy   for   your   records.      Corporation   name: ________________________________________________    Corporation   address: ________________________________________________    ________________________________________________    ________________________________________________    Point   of   Contact: ________________________________________________    E­Mail: ________________________________________________    Telephone: _____________________ Fax:   ____________________      You   accept   and   agree   to   the   following   terms   and   conditions   for   Your   present   and   future  Contributions   submitted   to   the   Copyright   Holders.   In   return,   the   Copyright   Holders   shall  not   use   Your   Contributions   in   a   way   that   is   contrary   to   the   public   benefit.   Except   for   the  license   granted   herein   to   the   Copyright   Holders   and   recipients   of   software   distributed   by  the   Copyright   Holders,   You   reserve   all   right,   title,   and   interest   in   and   to   Your  Contributions. 
  • 11. Contribution policy • Instead, we now require a “Signed-off-by” token in commit messages • Can automatically be added by “git commit –s” Some awesome new feature Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
  • 12. Signed-off-by • Intent: make it easier for individuals and organizations to contribute to Open MPI • “Signed-off-by” means agreement to the Open MPI Contributor’s Declaration § See the full definition here § This is common in many open source projects
  • 15. Open MPI versioning • Open MPI will (continue to) use a “A.B.C” version number triple • Each number now has a specific meaning: This number changes when backwards compatibility breaks This number changes when new features are added This number changes for all other releases A B C 15
  • 16. Definition • Open MPI vY is backwards compatible with Open MPI vX (where Y>X) if: § Users can compile a correct MPI / OSHMEM program with vX § Run it with the same CLI options and MCA parameters using vX or vY § The job executes correctly 16
  • 17. What does that encompass? • “Backwards compatibility” covers several areas: § Binary compatibility, specifically the MPI / OSHMEM API ABI § MPI / OSHMEM run time system § mpirun / oshrun CLI options § MCA parameter names / values / meanings 17
  • 18. What does that not encompass? • Open MPI only supports running exactly the same version of the runtime and MPI / OSHMEM libraries in a single job § If you mix-n-match vX and vY in a single job… 18 ERROR
  • 19. Current version series • v1.10.x series § Older, stable, rapidly hitting end of life • v2.0.x series § Current stable series • v2.x series § Upcoming series
  • 21. v1.10.x release manager • Ralph Castain, Intel
  • 22. v1.10 series • Soon to be end of life • One more release expected: v1.10.5 § Bug fixes only – no new features § Do not have a specific timeframe If you are still running v1.10.x, please start migrating to v2.0.x
  • 24. v2.x release managers • Howard Pritchard, Los Alamos National Lab • Jeff Squyres, Cisco Systems, Inc.
  • 25. v2 versions Master development v2.0.0 v2.x branch v2.0.x branch Bug fixes only v2.0.1 v2.0.2 Expected Dec 2016 v2.1.0 Expected Q1 2017
  • 26. v2.1.x will get its own branch (bug fixes only) v2.x v2.0.0 v2.0.x v2.0.1 v2.0.2 v2.1.0 v2.1.x v2.1.1
  • 27. v3.x will definitely (eventually) happen v2.x v2.0.0 v2.0.x v2.0.1 v2.0.2 v2.1.0 v2.1.x v2.1.1 v3.x
  • 28. Is it worthwhile to make an intermediate v2.2.x? v2.x v2.0.0 v2.0.x v2.0.1 v2.0.2 v2.1.0 v2.1.x v2.1.1 v2.2.x v3.x
  • 29. v2.2.x: pros and cons PRO • The v2.x branch is a good stable starting point • Easy to maintain backwards compatibility with all v2.x series CON • Difficulty in porting new features from master branch § Due to drift from master • Consumes developer resources and pushes back 3.x release § And therefore some ”bigger” features This is an open question in the developer community
  • 30. Should we do a v2.2.x series? Please let us know your opinion! www.open-mpi.org/sc16/
  • 31. Some lesser-known Open MPI features you may not be aware of Random sample of v2.x features / work
  • 32. Singularity • Containers are of growing interest § Packaged applications § Portability • Some issues remain § Cross-container boundary interactions for MPI wireup, release manager interactions § Properly handling “containers” as “apps” Leads: Ralph Castain (Intel), Greg Kurtzer (LBNL)
  • 33. Open MPI Singularity Support • PMIx support § Cross-version compatibility § Standardized protocol across environments • Auto-detection of containers § Identify that app is a Singularity container § Do the Right Things to optimize behavior • Auto-packaging of Open MPI apps § Singularity detects Open MPI app and automatically includes all required libs
  • 34. ORTE Distributed Virtual Machine • Original goal § Circumvent Cray’s single job per node limit § Enable new programming model • RADICAL-Pilot § Decompose large parallel problem in Bag of MPI Tasks § Decoupled, can be executed in parallel § Faster convergence to solution Leads: Mark Santcroos (Rutgers), Ralph Castain (Intel)
  • 35. Role of DVM • Launch/wireup time dominates execution § DVM instantiated once § Tasks highly asynchronous § Run tasks in parallel, share cpu cycles • CLI interface • Python language bindings through CFFI • Result § Improved concurrency (~16k concurrent tasks) § Improved throughput (100 tasks/s)
  • 36. ORTE-DVM + RADICAL-Pilot #tasks = 3 * #nodes, 64s tasks (SSH)
  • 37. Future Work • Bulk interface to orte_submit() • OFI-based (libfabric) inter-ORTE daemon communication • Optimize ORTE communication topology • Topology aware task placement
  • 38. Open MPI I/O (”OMPIO”) • Highly modular architecture for parallel I/O § Separate implementation than ROMIO • Default parallel I/O library in Open MPI § For all file systems starting from the v2.0.- release with the exception of Lustre Lead: Edgar Gabriel (U. Houston)
  • 39. OMPIO key features • Tightly integrated with the Open MPI architecture § Frameworks/modules, derived datatype handling, progress engine, etc. • Support for multiple collective I/O algorithms • Automatic adjustments of # of aggregators • Multiple mechanisms available for shared file pointer operations
  • 40. OMPIO ongoing work • Enhance support for the Lustre file system in the v2.1.x release series • Support for hybrid / multi-threaded applications • Enhance support for GPFS § Collaboration with HLRS • Enhance out-of-the-box performance of OMPIO
  • 41. OMPIO
  • 42. AWS scale testing • EC2 donation for Open MPI and PMIx scalability § Access to larger resources than individual organizations have • Both science and engineering § Data-driven analysis § Used for regression testing • Early days: no results to share yet Leads: Jeff Squyres (Cisco), Peter Gottesman, Brian Barrett (AWS)
  • 43. Exciting new capabilities in Open MPI George Bosilca
  • 44. Exascale Computing Project and Open MPI • DOE program to help develop software stack to enable application development for a wide range of exascale class systems • Open MPI for Exascale (OMPI-X) was one of 35 proposals selected for funding: § Joint proposal involving ORNL, LANL, SNL, UTK, and LLNL § 3 year time frame
  • 45. Exascale Computing Project and Open MPI Goals • Work with MPI Forum on extending the MPI standard to better support exascale applications § Endpoints, Finepoints, Sessions, Resilience • Improved interoperability of OMPI with other programming models (MPI+X) § Process placement, thread marshaling, resource sharing • Scalability and Performance § Memory footprint, Collective communications, Message Matching, PMIx • Resilience/fault tolerance improvements § Integration with C/R libraries (FTI, SCR), in-place restart, ULFM • Enhancing MPI Tools interface (network performance counters, better RMA support, etc.)
  • 46. MPI_THREAD_MULTIPLE • The transition to full threading support has been mostly completed § We also worked on the performance (fine grain locks) § Supports all threading models from a single library • All atomic accesses are protected § Allow asynchronous progress • Complete redesign the request completion § Everything goes through requests (pt2pt, collectives, RMA*, I/O, *) § Threads are not competing for resources, instead they collaboratively progress • A thread will wait until all expected requests have been completed or an error has been raised • Less synchronizations, less overhead (better latency and injection rate).
  • 47. MPI_THREAD_MULTIPLE • Messages per second injection rate (bigger is better) • Each process bound to a NUMA node § 4 threads per process § Distributed environment TCP (ipoib)
  • 48. MPI_THREAD_MULTIPLE • Messages per second injection rate (bigger is better) • Each process bound to a NUMA node § 4 threads per process § Distributed environment TCP (ipoib) § Vader (shared memory) • All BTLs show similar results
  • 49. Asynchronous progress • The BTLs can either have their own progress thread (such as TCP and usnic) or take advantage of the async progress provided by OPAL § Adaptive Frequency: varies with the expected load (posted or ongoing data movements) SMB TCP(ipoib)
  • 50. Impact on applications • MADNESS - Multiresolution Adaptive Numerical Environment for Scientific Simulation
  • 51. Threading Name Owner Status Thread BTL OpenIB Chelsio maintenance Done Portals4 SNL maintenance Not done Scif LANL maintenance Not done self UTK active Done sm UTK active Done smcuda NVIDIA/UTK active Done tcp UTK active Done uGNI LANL active Done usnic CISCO active Done vader LANL active Done PML Yalla Mellanox active In progress UCX Mellanox/UTK active Done MTL MXM Mellanox active In progress OFI Intel active In progress Portals4 SNL active In progress PSM Intel active In progress PSM2 Intel active In progress Network support It’s complicated! (only listing those that are in release branches)
  • 52. Collective Communications • Architecture aware: Reshape tuned to account for process placement and node architecture • Classical 2 levels decision (inter and intra-node) composition of collective algorithms § Pipeline possible but no other algorithmic composition possible ResultResult Bcast 32 procs, 4x8 cores (OpenIB)
  • 53. Collective Communications Result § Async Collectives starts as soon as the process learns that a collective is progressing on a communicator (somewhat similar to unexpected collectives) § The algorithm automatically adapts to network conditions § Resistant to system noise • Dataflow Collectives: Different algorithms compose naturally (using a dynamic granularity for the pipelining fragments) Ibcast 16 nodes overlap
  • 54. Collective Communications • Dataflow Collectives: Different algorithms compose naturally (using a dynamic granularity for the pipelining fragments) Result (Slowdown of broadcast with noise inject on one node)§ The algorithm automatically adapts to network conditions § Resistant to system noise
  • 55. CUDA Support • Multi-level coordination protocol based on the location of the source and destination memory § Support for GPUDirect • Delocalize part of the datatype engine into the GPU § Driven by the CPU § Provide a different datatype representation (avoid branching in the code) • Deeply integrated support for OpenIB and shared memory § BTL independent support available for everything else V T Ivy Bridge E5-2690 v2 @ 3.00GHz, 2 sockets 10-core, 4 K40/node MVAPICH 2.2-GDR 200 600 1000 1500 2000 2500 3000 3500 4000 Matrix Size (N) 0 5 10 15 20 Time(ms) Pingpong Benchmark with Matrix in SM (intra-GPU) T-1GPU T-1GPU-MVAPICH V-1GPU V-1GPU-MVAPICH Pingpong Single node (intra-GPU)
  • 56. CUDA Support 200 600 1000 1500 2000 2500 3000 3500 4000 Matrix Size (N) 0 5 10 15 20 25 30 35 40 45 Time(ms) Pingpong Benchmark with Matrix in SM (inter-GPU) T-2GPU T-2GPU-MVAPICH V-2GPU V-2GPU-MVAPICH V T Ivy Bridge E5-2690 v2 @ 3.00GHz, 2 sockets 10-core, 4 K40/node MVAPICH 2.2-GDR Pingpong Single node (inter-GPU) • Multi-level coordination protocol based on the location of the source and destination memory § Support for GPUDirect • Delocalize part of the datatype engine into the GPU § Driven by the CPU § Provide a different datatype representation (avoid branching in the code) • Deeply integrated support for OpenIB and shared memory § BTL independent support available for everything else
  • 57. CUDA Support 200 600 1000 1500 2000 2500 3000 3500 4000 Matrix Size (N) 0 10 20 30 40 50 60 Time(ms) Pingpong Benchmark with Matrix in IB T-GPU T-GPU-MVAPICH V-GPU V-GPU-MVAPICH Pingpong Distributed over IB (intra-GPU) V T Ivy Bridge E5-2690 v2 @ 3.00GHz, 2 sockets 10-core, 4 K40/node MVAPICH 2.2-GDR • Multi-level coordination protocol based on the location of the source and destination memory § Support for GPUDirect • Delocalize part of the datatype engine into the GPU § Driven by the CPU § Provide a different datatype representation (avoid branching in the code) • Deeply integrated support for OpenIB and shared memory § BTL independent support available for everything else
  • 58. CUDA Support • Architecture-aware collective support § Dataflow algorithm § Node-level algorithm take advantage of the bidirectional bandwidth of the PCI Ivy Bridge E5-2690 v2 @ 3.00GHz, 2 sockets 10-core, 4 K40/node MVAPICH 2.2-GDR 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 Latency(s) Message Size (MB) Bcast on 6 node (4GPUs/node) MVAPICH OMPI 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 Latency(s) Message Size (MB) Reduce on 6 node (4GPUs/node) MVAPICH OMPI
  • 59. User Level Failure Mitigation § Open MPI implementation updated in-sync with Open MPI 2.x § Scalable fault tolerant algorithms demonstrated in practice for revoke, agreement, and failure detection (SC’14, EuroMPI’15, SC’15, SC’16) Introduction Early Returning Agreement Performance Evaluation ERA performance depending on the tree topolo ��� ��� ��� ��� ���� ���� ���� ���� �� �� �� �� �� �� �� ���������� �������������������������� �������������������� ������������������ ����������������� �������������������������� ���������������������� Practical Scalable Consensus Fault Tolerant Agreement costs approximately 2x Allreduce Bandwidth(Gbit/s) Message Size (Bytes) Shared Memory Ping-pong Performance Open MPI FT Open MPI 0 1 2 3 4 5 6 7 8 9 10 11 12 1 4 16 64 256 1K 4K 16K 64K 256K 1M Latency(us) 0.5 1 1.5 2 2.5 3 1 4 16 64 256 1K Point to point performance unchanged With FT enabled
  • 60. Open MPI and Fujitsu Fujitsu Limited
  • 61. Fujitsu MPI with Open MPI Community • Fujitsu MPI is based on Open MPI § Running on K computer and its commercial/ successor machines for over 5 years. § For Post-K, also Open MPI based. • Collaboration plan with OMPI community § ARM support § PMIx integration § Reduced memory footprint § New MPI version support § Thread parallelization extension etc.
  • 62. Fujitsu Contribution Examples • Thread parallelization in the MPI library • Statistical information for application tuning Speed-up Ratio of MPI_Pack 2 16 256 4Ki 64Ki 1Mi Number of Blocks 2 16 256 4Ki 64Ki 1Mi Byte Size of a Block 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 Speed-up Ratio of MPI_PACK Compared to Single Thread See https://github.com/open-mpi/ompi/wiki/Meeting-2016-02-attachments/Fujitsu-OMPI-Dev-Meeting-2016-Feb.pdf
  • 63. RMA Update Nathan Hjelm Los Alamos National Laboratory
  • 64. v2.x osc/pt2pt • Fully supports MPI-3.1 RMA • Full support for MPI datatypes • Emulates one-sided operation using point-to-point components (PML) for communication • Improved lock-all scaling • Improved support for MPI_THREAD_MULTIPLE • Caveat: asynchronous progress support lacking • Targets must enter MPI to progress any one-sided data movement! • Doesn’t really support passive-target
  • 65. v2.x osc/rdma • Fully supports MPI-3.1 RMA • Full support for MPI datatypes • Fully supports passive target RMA operations • Uses network RMA and atomic operation support through Byte Transport Layer (BTL) • Supports Infiniband, Infinipath/Omnipath**, and Cray Gemini/Aries • Additional networks can be supported • Requirements: put, get, fetch-and-add, and compare-and-swap
  • 66. v2.x osc/rdma • Improved support for MPI_THREAD_MULTIPLE • Improved memory scaling • Support for hardware atomics • Supports MPI_Fetch_and_add and MPI_Compare_and_swap • Supports 32 and 64 bit integer and floating point values • Accelerated MPI_Ops varies by hardware • Set osc_rdma_acc_single_intrinsic MCA variable to true to enable
  • 67. v2.x RMA Performance • IMB Truly Passive Put on Cray XC-40 0 0.2 0.4 0.6 0.8 1 20 25 210 215 220 Overlap% Message size (Bytes) IMB Truly passive put Overlap OSC/pt2pt OSC/rdma 20 40 60 80 100
  • 68. v2.x RMA Performance • Contended MPI_Fetch_and_op performance 0.1 1 10 100 1000 10000 100000 10 100 1000 Latency(µs) MPI Ranks MPI_Fetch_and_op Latency XC40 w/ Contention Open MPI osc/pt2pt Open MPI osc/rdma - LOCK Open MPI osc/rdma - AMO Open MPI osc/rdma - CAS MPI processes
  • 69. v2.x RMA Performance • osc_put_latency with MPI_Win_flush on XC-40 1 2 4 8 16 32 64 128 256 512 20 25 210 215 220 Latency(µs) Message Size (Bytes) OSU Put Latency XC40 Open MPI osc/pt2pt - Flush Open MPI osc/rdma - Flush
  • 71. IBM Spectrum MPI • IBM Spectrum MPI is a pre-built, pre-packaged version of Open MPI plus IBM value add components. • Supports both PowerPC and x86 • Includes many of the popular Open MPI components selectable at runtime § E.g. MXM, usNIC, Omnipath • Spectrum MPI 10.1.0.2 to release in December of 2016. § First release (Spectrum MPI 10.1) was July of 2016. • Part of IBM Spectrum HPC Software Stack Evaluation Downloads for Spectrum MPI: https://www-01.ibm.com/marketing/iwm/iwm/web/reg/pick.do?source=swerpsysz-lsf- 3&S_PKG=mpi101&S_TACT=000002PW&lang=en_US
  • 72. IBM Value Add Components • PAMI PML and OSC § Improved point-to-point and one-sided performance for Mellanox Infiniand § Includes support for Nvidia GPU buffers § Additional performance optimizations • IBM Collective Library § Significant performance improvements for blocking and non- blocking collectives over OMPI collectives. § Dynamic collective algorithm selection. • GPU support § CUDA-Aware support for Nvidia GPU cards. § Adding GPU RDMA Direct / Async in future release.
  • 73. IBM Testing and Support • Extensive level of testing for IBM releases § Standard Open MPI release testing… § …Plus Platform MPI test suites § …Plus HPC stack integration testing • IBM Customer Support § For customers running a licensed copy of IBM Spectrum MPI § IBM will work with customers and partners to resolve issues in non IBM-owned components
  • 74. Community support • Activity participating with Open MPI Community • MTT and Jenkins testing on IBM PowerPC servers • New features that will be contributed back § Improved LSF support § -aff: easy to use affinity controls § -prot: protocol connectivity report § -entry: dynamic layering of multiple PMPI libraries § …and more… • PMIx improvements (go to their BOF…) § Focus on CORAL-sized clusters • Bug fixes, bug fixes, bug fixes…
  • 76. The Power of Community Compels Us • Engaged in multiple open source efforts enabling exascale MPI and PGAS applications § UCX • Open source • Strong vendor ecosystem • Near bare metal performance across a rage of fabrics • InfiniBand, uGNI, RoCE, shared memory § PMIx (PMI eXascale) • Open source • Exascale job launch • Supported by SLURM, LSF, PBS
  • 77. Exascale Enabled Out-of-the-Box • UCX § UCX PML starting from v1.10 (MPI) § UCX SPML starting from v1.10 (OSHMEM) § Support for advanced PMIx features • Direct modex • Non-blocking fence • Eliminate the barrier in initialization • OSHMEM § Open MPI v2.1.0 is (will be) OSHMEM v1.3 compliant!
  • 78. OMPI-UCX Performance Data • Mellanox system § Switch: SX6518 § InfiniBand ConnectX-4 § CPU: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz § Red Hat Enterprise Linux Server release 7.2 § Open MPI/SHMEM v2.0.2a1 § UCX version: # UCT version=1.0.2129 # configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check -- prefix=/hpc/local/benchmarks/hpcx_install_Wednesday/hpcx-icc-redhat7.2/ucx --with- knem=/hpc/local/benchmarks/hpcx_install_Wednesday/hpcx-icc-redhat7.2/knem § Benchmark: OSU v5.1
  • 79. MPI UCX PML Point-to-Point Latency
  • 80. MPI UCX PML Point-to-Point Bandwidth
  • 81. MPI UCX PML Point-to-Point Message Rate
  • 82. OSHMEM + UCX SPML Atomic Rates OSU OpenSHMEM Atomic Operation Rate ConnectX-4 HCA Millions of atomic operations per second Operation RC Transport DC Transport shmem_int_fadd 14.03 15.6 shmem_int_finc 23.03 22.55 shmem_int_add 87.73 115.75 shmem_int_inc 81.13 122.92 shmem_int_cswap 23.14 22.74 shmem_int_swap 23.17 22.26 shmem_longlong_fadd 23.24 22.87 shmem_longlong_finc 23.15 22.83 shmem_longlong_add 80.08 91.22 shmem_longlong_inc 76.13 95.61 shmem_longlong_cswap 15.18 22.7 shmem_longlong_swap 22.79 22.84
  • 83. OSHMEM + UCX SPML One Sided Message Rate (osu_oshm_put_mr)
  • 84. Mapping, Ranking, Binding: Oh My! Ralph H Castain Intel, Inc.
  • 85. Why bind? • In Open MPI early days § No default binding § Simple bind-to-core, bind-to-socket options • Motivators § Competitor “out-of-the-box” comparisons • Bound by default § Researchers • Fine-grained positioning, binding § More complex chips
  • 86. Terminology • Slots § How many processes allowed on a node • Oversubscribe: #processes > #slots § Nothing to do with hardware! • System admins often configure a linkage • Processing element (“PE”) § Smallest atomistic processor § Frequently core or HT (tile?) § Overload: more than one process bound to PE
  • 87. Three Phases • Mapping § Assign each process to a location § Determines which processes run on what nodes § Detects oversubscription • Ranking § Assigns MPI_COMM_WORLD rank to each process • Binding § Binds processes to specific processing elements § Detects overload
  • 88. Examples: notation One core One chip package One hyperthread
  • 89. Node 2Node 1 Slots = 3/node Mapping Slot Node NUMA Socket Cache (L1,2,3) Core HWThread Oversubscribe Nooversubscribe Span PE #procs = 5
  • 90. Mapping Node 2Node 1 Slots = 3/node Slot Node NUMA Socket Cache (L1,2,3) Core HWThread Oversubscribe Nooversubscribe Span PE ABC DE #procs = 5
  • 91. Mapping Node 2Node 1 Slots = 3/node Slot Node NUMA Socket Cache (L1,2,3) Core HWThread Oversubscribe Nooversubscribe Span PE ACE BD #procs = 5
  • 92. Mapping Node 2Node 1 Slots = 3/node Slot Node NUMA Socket Cache (L1,2,3) Core HWThread Oversubscribe Nooversubscribe Span PE #procs = 5 AC B D E
  • 93. Mapping Node 2Node 1 Slots = 3/node Slot Node NUMA Socket Cache (L1,2,3) Core HWThread Oversubscribe Nooversubscribe Span PE #procs = 5 AE B C D
  • 94. Ranking Node 2Node 1 Slots = 3/node Slot Node NUMA Socket Cache (L1,2,3) Core HWThread Span Fill #procs = 5 AE B C D 0,1 2 3 4 Map-by socket:span Default
  • 95. Ranking Node 2Node 1 Slots = 3/node Slot Node NUMA Socket Cache (L1,2,3) Core HWThread Span Fill AE B C D 0,2 4 1 3 #procs = 5 Map-by socket:span
  • 96. Ranking Node 2Node 1 Slots = 3/node Slot Node NUMA Socket Cache (L1,2,3) Core HWThread Span Fill AE B C D 0,2 1 3 4 #procs = 5 Map-by socket:span
  • 97. Ranking Node 2Node 1 Slots = 3/node Slot Node NUMA Socket Cache (L1,2,3) Core HWThread Span Fill AE B C D 0,4 1 2 3 #procs = 5 Map-by socket:span
  • 98. Binding Node 2Node 1 Slots = 3/node NUMA Socket Cache (L1,2,3) Core HWThread AE B C D 0,4 1 2 3 #procs = 5 Map-by socket:span Rank-by socket:span 0 4 1 2 3
  • 99. Binding Node 2Node 1 Slots = 3/node NUMA Socket Cache (L1,2,3) Core HWThread AE B C D 0,4 1 2 3 #procs = 5 Map-by socket:span Rank-by socket:span 0 4 1 2 3
  • 100. Defaults Node 2Node 1 Slots = 3/node 0 1 #procs <= 2 Node 2Node 1 Slots = 3/node 0,1,2 #procs > 2 Map-by core Rank-by slot Bind-to core Map-by slot Rank-by slot Bind-to NUMA 3,4
  • 101. Mapping: PE Option Node 2Node 1 Slots = 3/node Slot Node NUMA Socket Cache (L1,2,3) Core HWThread Oversubscribe Nooversubscribe Span PE=2 #procs = 3 A D E B C Rank-by slot => bind-to core
  • 102. Mapping: PE Option Node 2Node 1 Slots = 3/node Slot Node NUMA Socket Cache (L1,2,3) Core HWThread Oversubscribe Nooversubscribe Span PE=2 #procs = 3 A D E B C hwthreads-as-cpus Rank-by slot => bind-to hwt
  • 103. Mapping: PE Option Node 2Node 1 Slots = 3/node Slot Node NUMA Socket Cache (L1,2,3) Core HWThread Oversubscribe Nooversubscribe Span PE=3 #procs = 3 A D E B C hwthreads-as-cpus Rank-by slot => bind-to hwt
  • 104. Mapping: PE Option Node 2Node 1 Slots = 3/node Slot Node NUMA Socket Cache (L1,2,3) Core HWThread Oversubscribe Nooversubscribe Span PE=3 #procs = 3 A D E C hwthreads-as-cpus B Rank-by slot => bind-to hwt
  • 105. Conclusion • bind-to defaults § map-by specified à bind to that level § map-by not specified • np <= 2: bind-to core • np > 2: bind-to NUMA § PE > 1 à bind to PE • Map, rank, bind § Separate dimensions § Considerable flexibility Right combination is highly application specific!
  • 107. Where do we need help? • Code § Any bug that bothers you § Any feature that you can add • User documentation • Testing • Usability • Release engineering
  • 108. Come join us! George Bosilca Yossi Itigin Nathan Hjelm Perry SchmidtJeff Squyres Ralph Castain