SlideShare ist ein Scribd-Unternehmen logo
1 von 42
MPI Requirements
of the Network Layer
Presented to the OpenFabrics libfabric Working Group
January 28, 2014
Community feedback assembled
by Jeff Squyres, Cisco Systems
OpenFabrics
Open Frameworks
• Effort started / led by Sean Hefty, Intel
• Successor to the Linux verbs API
•
•
•
•

Fix some of the well-known problems from verbs
Match modern hardware / not be tied to InfiniBand
Serve MPI’s needs better
Serve other consumers better (e.g., PGAS)

• Not verbs 2.0: a whole new API

Slide 2
Many thanks to MPI contributors
(in no particular order)
•

ETZ Zurich

•

• Torsten Hoefler

•

•
•
•
•
•

Sandia National Labs
• Ron Brightwell
• Brian Barrett
• Ryan Grant

•

IBM
•
•
•
•

Chulho Kim
Carl Obert
Michael Blocksome
Perry Schmidt

Cisco Systems

•

Jeff Squyres
Dave Goodell
Reese Faucette
Cesare Cantu
Upinder Malhi

Oak Ridge National Labs
• Scott Atchley
• Pavel Shamis

•

Argonne National Labs
• Jeff Hammond

Slide 3
Many thanks to MPI contributors
(in no particular order)
• Intel
• Sayantan Sur
• Charles Archer

• AMD
• Brad Benton

• Microsoft

• Cray
• Krishna Kandalla

• Fab Tillier

• U. Edinburgh / EPCC

• Mellanox
• Devendar Bureddy

• Dan Holmes

• U. Alabama Birmingham

• SGI

• Tony Skjellum
• Amin Hassani
• Shane Farmer

• Michael Raymond

Slide 4
Quick MPI overview
• High-level abstraction API
• No concept of a connection

• All communication:
• Is reliable
• Has some ordering rules
• Is comprised of typed messages

• Peer address is (communicator, integer) tuple
• I.e., virtualized
• Specifies a process, not a server / network
endpoint
Slide 5
Quick MPI overview
• Communication modes
• Blocking and non-blocking (polled completion)
• Point-to-point: two-sided and one-sided
• Collective operations:
broadcast, scatter, reduce, …etc.
• …and others, but those are the big ones

• Async. progression is required/strongly desired

• Message buffers are provided by the application
• They are not “special” (e.g., registered)
Slide 6
Quick MPI overview
• MPI specification
• Governed by the MPI Forum standards body
• Currently at MPI-3.0

• MPI implementations
• Software + hardware implementation of the spec
• Some are open source, some are closed source
• Generally don’t care about interoperability (e.g., wire
protocols)

Slide 7
MPI is a large community
• Community feedback represents union of:
• Different viewpoints
• Different MPI implementations
• Different hardware perspectives

• …and not all agree with each other
• For example…

Slide 8
Different MPI camps
Those who want
high level interfaces

Those who want
low level interfaces

• Do not want to see
memory registration

• Want to have good
memory registration
infrastructure

• Want tag matching

• Want direct access to
hardware capabilities

• E.g., PSM
• Trust the network layer to
do everything well under
the covers

• Want to fully implement
MPI interfaces themselves
• Or, the MPI implementers
are the kernel / firmware
/hardware developers
Slide 9
Be careful what you ask for…
• …because you just got it

• Members of the MPI
Forum would like to be
involved in the libfabric
design on an ongoing
basis
• Can we get an MPI
libfabric listserv?

Slide
10
Basic things MPI needs
• Messages (not streams)
• Efficient API
• Allow for low latency / high bandwidth
• Low number of instructions in the critical path
• Enable “zero copy”

• Separation of local action initiation and completion
• One-sided (including atomics) and two-sided
semantics
• No requirement for communication buffer alignment
Slide
11
Basic things MPI needs
• Asynchronous progress
independent of API calls
• Including asynchronous
progress from multiple
consumers (e.g., MPI and
PGAS in the same
process)
• Preferably via dedicated
hardware
Progress
of these

Slide
12

Also causes
progress
of these
Basic things MPI needs
• Scalable communications with millions of peers
• With both one-sided and two-sided semantics
• Think of MPI as a fully-connected model
(even though it usually isn’t implemented that way)

• Today, runs with 3 million MPI processes in a job

Slide
13
Things MPI likes in verbs
• (all the basic needs from previous slide)
• Different modes of communication
• Reliable vs. unreliable
• Scalable connectionless communications (i.e., UD)

• Specify peer read/write address (i.e., RDMA)
• RDMA write with immediate (*)
• …but we want more (more on this later)

Slide
14
Things MPI likes in verbs
• Ability to re-use (short/inline) buffers immediately
• Polling and OS-native/fd-based blocking QP
modes
• Discover devices, ports, and their capabilities (*)
• …but let’s not tie this to a specific hardware model

• Scatter / gather lists for sends

• Atomic operations (*)
• …but we want more (more on this later)
Slide
15
Things MPI likes in verbs
• Can have multiple
consumers in a single
process
• API handles are
independent of each other

Slide
16
Things MPI likes in verbs
• Verbs does not:
• Require collective initialization across multiple
processes
• Require peers to have the same process image
• Restrict completion order vs. delivery order
• Restrict source/target address region
(stack, data, heap)
• Require a specific wire protocol (*)
• …but it does impose limitations, e.g., 40-byte GRH UD
header
Slide
17
Things MPI likes in verbs
• Ability to connect to “unrelated” peers
• Cannot access peer (memory) without permission
• Ability to block while waiting for completion
• ...assumedly without consuming host CPU cycles

• Cleans up everything upon process termination
• E.g., kernel and hardware resources are released

Slide
18
Other things MPI wants
(described as verbs
improvements)
• MTU is an int (not an enum)
• Specify timeouts to connection requests
• …or have a CM that completes connections
asynchronously

• All operations need to be non-blocking, including:
• Address handle creation
• Communication setup / teardown
• Memory registration / deregistration
Slide
19
Other things MPI wants
(described as verbs
improvements)
• Specify buffer/length as function parameters
• Specified as struct requires extra memory accesses
• …more on this later

• Ability to query how many credits currently
available in a QP
• To support actions that consume more than one
credit

• Remove concept of “queue pair”
• Have standalone send channels and receive
channels
Slide
20
Other things MPI wants
(described as verbs
improvements)
• Completion at target for an RDMA write
• Have ability to query if loopback communication is
supported
• Clearly delineate what functionality must be
supported vs. what is optional
• Example: MPI provides (almost) the same
functionality everywhere, regardless of hardware /
platform
• Verbs functionality is wildly different for each provider
Slide
21
Other things MPI wants
(described as verbs
improvements)
• Better ability to determine causes of errors
• In verbs:
• Different providers have different (proprietary)
interpretations of various error codes
• Difficult to find out why ibv_post_send() or
ibv_poll_cq() failed, for example

• Perhaps a better strerr() type of functionality (that
can also obtain provider-specific strings)?

Slide
22
Other things MPI wants:
Standardized high-level interfaces
• Examples:
•
•
•
•
•

Tag matching
MPI collective operations (TBD)
Remote atomic operations
…etc.
The MPI community wants input in the design of these
interfaces

• Divided opinions from MPI community:
• Providers must support these interfaces, even if
emulated
Slide
• Run-time query to see which interfaces are supported
23
Other things MPI wants:
Vendor-specific interfaces
• Direct access to vendor-specific features
• Lowest-common denominator API is not always
enough
• Allow all providers to extend all parts of the API

• Implies:
• Robust API to query what devices and providers are
available at run-time (and their various versions, etc.)
• Compile-time conventions and protections to allow
for safe non-portable codes

• This is a radical difference from verbs
Slide
24
…and much more
• Don’t have the time to go through all of this in a
slidecast
• The rest of the slides are attached, for those
interested

• The collaboration is ongoing
• This is a long process
• But both exciting and promising

Slide
25
Core libfabric functionality

Direct function
calls to libfabric

Slide
26
Example options for direct access
to vendor-specific functionality

Example 1:
Access to
provider A
extensions
without going
through libfabric
core

Slide
27
Example options for direct access
to vendor-specific functionality

Example 2:
Access to provider B
extensions via “pass
through” functionality
in libfabric

Slide
28
Other things MPI wants:
Regarding memory registration
• Run-time query: is memory registration is
necessary?
• I.e., explicit or implicit memory registration

• If explicit
• Need robust notification of involuntary memory deregistration (e.g., munmap)

• If the cost of de/registration were “free”, much of
this debate would go away 

Slide
29
Other things MPI wants:
Regarding fork() behavior
• In child:
• All memory is accessible (no side effects)
• Network handles are stale / unusable
• Can re-initialize network API (i.e., get new handles)

• In parent:
• All memory is accessible
• Network layer is still fully usable
• Independent of child process effects

Slide
30
Other things MPI wants
• If network header knowledge is required:
• Provide a run-time query
• Do not mandate a specific network header
• E.g., incoming verbs datagrams require a GRH
header

• Request ordered vs. unordered delivery
• Potentially by traffic type (e.g., send/receive vs.
RDMA)

• Completions on both sides of a remote write
Slide
31
Other things MPI wants
• Allow listeners to request a specific network
address
• Similar to TCP sockets asking for a specific port

• Allow receiver providers to consume buffering
directly related to the size of incoming messages
• Example: “slab” buffering schemes

Slide
32
Other things MPI wants
• Generic completion types. Example:
• Aggregate completions
• Vendor-specific events

• Out-of-band messaging

Slide
33
Other things MPI wants
• Noncontiguous sends, receives, and RDMA opns.
• Page size irrelevance
• Send / receive from memory, regardless of page size

• Access to underlying performance counters
• For MPI implementers and MPI-3 “MPI_T” tools

• Set / get network quality of service

Slide
34
Other things MPI wants:
More atomic operations
• Datatypes (minimum): int64_t, uint64_t, int32_t, uint32_t
• Would be great: all C types (to include double complex)
• Would be ok: all <stdint.h> types
• Don’t require more than natural C alignment

• Operations (minimum)
• accumulate, fetch-and-accumulate, swap, compare-and-swap

• Accumulate operators (minimum)
• add, subtract, or, xor, and, min, max

• Run-time query: are these atomics coherent with the host?
• If support both, have ability to request one or the other
Slide
35
Other things MPI wants:
MPI RMA requirements
• Offset-based communication (not address-based)
• Performance improvement: potentially reduces
cache misses associated with offset-to-address
lookup

• Programmatic support to discover if VA based
RMA performs worse/better than offset based
• Both models could be available in the API
• But not required to be supported simultaneously

• Aggregate completions for MPI Put/Get operations
• Per endpoint
• Per memory region

Slide
36
Other things MPI wants:
MPI RMA requirements
• Ability to specify remote keys when registering
• Improves MPI collective memory window allocation
scalability

• Ability to specify arbitrary-sized atomic ops
• Run-time query supported size

• Ability to specify/query ordering and ordering limits
of atomics
• Ordering mode: rar, raw, war and waw
• Example: “rar” – reads after reads are ordered
Slide
37
“New,” but becoming important
• Network topology discovery and awareness
• …but this is (somewhat) a New Thing
• Not much commonality across MPI implementations

• Would be nice to see some aspect of libfabric
provide fabric topology and other/meta information
• Need read-only access for regular users

Slide
38
API design considerations
• With no tag matching, MPI frequently sends /
receives two buffers
• (header + payload)
• Optimize for that

• MPI sometimes needs thread safety, sometimes
not
• May need both in a single process

• Support for checkpoint/restart is desirable
• Make it safe to close stale handles, reclaim
resources
Slide
39
API design considerations
• Do not assume:
•
•
•
•
•

Max size of any transfer (e.g., inline)
The memory translation unit is in network hardware
All communication buffers are in main RAM
Onload / offload, but allow for both
API handles refer to unique hardware resources

• Be “as reliable as sockets” (e.g., if a peer
disappears)
• Have well-defined failure semantics
• Have ability to reclaim resources on failure
Slide
40
Conclusions
• Many different requirements
• High-level, low-level, and vendor-specific interfaces

• The MPI community would like to continue to
collaborate
• Tag matching is well-understood, but agreeing on a
common set of interfaces for them will take work
• Creating other high-level MPI-friendly interfaces
(e.g., for collectives) will take additional work

Slide
41
Thank you!

Weitere ähnliche Inhalte

Was ist angesagt?

Java Presentation
Java PresentationJava Presentation
Java PresentationAmr Salah
 
Calling The Notes C Api From Lotus Script
Calling The Notes C Api From Lotus ScriptCalling The Notes C Api From Lotus Script
Calling The Notes C Api From Lotus Scriptdominion
 
.Net programming with C#
.Net programming with C#.Net programming with C#
.Net programming with C#NguynSang29
 
Chatbot Tutorial - Create your first bot with Xatkit
Chatbot Tutorial - Create your first bot with Xatkit Chatbot Tutorial - Create your first bot with Xatkit
Chatbot Tutorial - Create your first bot with Xatkit Jordi Cabot
 
.NET MeetUp Brno - Challenges of Managing CoreFX repo -- Karel Zikmund
.NET MeetUp Brno - Challenges of Managing CoreFX repo -- Karel Zikmund.NET MeetUp Brno - Challenges of Managing CoreFX repo -- Karel Zikmund
.NET MeetUp Brno - Challenges of Managing CoreFX repo -- Karel ZikmundKarel Zikmund
 
Programming Languages #devcon2013
Programming Languages #devcon2013Programming Languages #devcon2013
Programming Languages #devcon2013Iván Montes
 
Building apps with common code for windows 8 and windows phone 8 (WP8)
Building apps with common code for windows 8 and windows phone 8 (WP8)Building apps with common code for windows 8 and windows phone 8 (WP8)
Building apps with common code for windows 8 and windows phone 8 (WP8)Tamir Dresher
 
3978 Why is Java so different... A Session for Cobol/PLI/Assembler Developers
3978   Why is Java so different... A Session for Cobol/PLI/Assembler Developers3978   Why is Java so different... A Session for Cobol/PLI/Assembler Developers
3978 Why is Java so different... A Session for Cobol/PLI/Assembler Developersnick_garrod
 
Dd13.2013.milano.open ntf
Dd13.2013.milano.open ntfDd13.2013.milano.open ntf
Dd13.2013.milano.open ntfUlrich Krause
 
BP207 - Meet the Java Application Server You Already Own – IBM Domino
BP207 - Meet the Java Application Server You Already Own – IBM DominoBP207 - Meet the Java Application Server You Already Own – IBM Domino
BP207 - Meet the Java Application Server You Already Own – IBM DominoSerdar Basegmez
 
Calling all modularity solutions
Calling all modularity solutionsCalling all modularity solutions
Calling all modularity solutionsSangjin Lee
 
HFM, Workspace, and FDM – Voiding your warranty
HFM, Workspace, and FDM – Voiding your warrantyHFM, Workspace, and FDM – Voiding your warranty
HFM, Workspace, and FDM – Voiding your warrantyCharles Beyer
 

Was ist angesagt? (17)

Java Presentation
Java PresentationJava Presentation
Java Presentation
 
Calling The Notes C Api From Lotus Script
Calling The Notes C Api From Lotus ScriptCalling The Notes C Api From Lotus Script
Calling The Notes C Api From Lotus Script
 
.Net programming with C#
.Net programming with C#.Net programming with C#
.Net programming with C#
 
.Net Standard 2.0
.Net Standard 2.0.Net Standard 2.0
.Net Standard 2.0
 
Chatbot Tutorial - Create your first bot with Xatkit
Chatbot Tutorial - Create your first bot with Xatkit Chatbot Tutorial - Create your first bot with Xatkit
Chatbot Tutorial - Create your first bot with Xatkit
 
P1 2018 python
P1 2018 pythonP1 2018 python
P1 2018 python
 
.NET MeetUp Brno - Challenges of Managing CoreFX repo -- Karel Zikmund
.NET MeetUp Brno - Challenges of Managing CoreFX repo -- Karel Zikmund.NET MeetUp Brno - Challenges of Managing CoreFX repo -- Karel Zikmund
.NET MeetUp Brno - Challenges of Managing CoreFX repo -- Karel Zikmund
 
Programming Languages #devcon2013
Programming Languages #devcon2013Programming Languages #devcon2013
Programming Languages #devcon2013
 
Building apps with common code for windows 8 and windows phone 8 (WP8)
Building apps with common code for windows 8 and windows phone 8 (WP8)Building apps with common code for windows 8 and windows phone 8 (WP8)
Building apps with common code for windows 8 and windows phone 8 (WP8)
 
JAVA INTRODUCTION - 1
JAVA INTRODUCTION - 1JAVA INTRODUCTION - 1
JAVA INTRODUCTION - 1
 
3978 Why is Java so different... A Session for Cobol/PLI/Assembler Developers
3978   Why is Java so different... A Session for Cobol/PLI/Assembler Developers3978   Why is Java so different... A Session for Cobol/PLI/Assembler Developers
3978 Why is Java so different... A Session for Cobol/PLI/Assembler Developers
 
P1 2017 python
P1 2017 pythonP1 2017 python
P1 2017 python
 
Java
JavaJava
Java
 
Dd13.2013.milano.open ntf
Dd13.2013.milano.open ntfDd13.2013.milano.open ntf
Dd13.2013.milano.open ntf
 
BP207 - Meet the Java Application Server You Already Own – IBM Domino
BP207 - Meet the Java Application Server You Already Own – IBM DominoBP207 - Meet the Java Application Server You Already Own – IBM Domino
BP207 - Meet the Java Application Server You Already Own – IBM Domino
 
Calling all modularity solutions
Calling all modularity solutionsCalling all modularity solutions
Calling all modularity solutions
 
HFM, Workspace, and FDM – Voiding your warranty
HFM, Workspace, and FDM – Voiding your warrantyHFM, Workspace, and FDM – Voiding your warranty
HFM, Workspace, and FDM – Voiding your warranty
 

Ähnlich wie MPI Requirements of the Network Layer

CPSeis &amp; GeoCraft
CPSeis &amp; GeoCraftCPSeis &amp; GeoCraft
CPSeis &amp; GeoCraftbillmenger
 
Linux Distribution Collaboration …on a Mainframe!
Linux Distribution Collaboration …on a Mainframe!Linux Distribution Collaboration …on a Mainframe!
Linux Distribution Collaboration …on a Mainframe!All Things Open
 
Architectural Decisions: Smoothly and Consistently
Architectural Decisions: Smoothly and ConsistentlyArchitectural Decisions: Smoothly and Consistently
Architectural Decisions: Smoothly and ConsistentlyComsysto Reply GmbH
 
Architectural Decisions: Smoothly and Consistently
Architectural Decisions: Smoothly and ConsistentlyArchitectural Decisions: Smoothly and Consistently
Architectural Decisions: Smoothly and ConsistentlyComsysto Reply GmbH
 
Open MPI SC'15 State of the Union BOF
Open MPI SC'15 State of the Union BOFOpen MPI SC'15 State of the Union BOF
Open MPI SC'15 State of the Union BOFJeff Squyres
 
Software Architecture and Architectors: useless VS valuable
Software Architecture and Architectors: useless VS valuableSoftware Architecture and Architectors: useless VS valuable
Software Architecture and Architectors: useless VS valuableComsysto Reply GmbH
 
AD1545 - Extending the XPages Extension Library
AD1545 - Extending the XPages Extension LibraryAD1545 - Extending the XPages Extension Library
AD1545 - Extending the XPages Extension Librarypaidi_ed
 
An Open Source Workbench for Prototyping Multimodal Interactions Based on Off...
An Open Source Workbench for Prototyping Multimodal Interactions Based on Off...An Open Source Workbench for Prototyping Multimodal Interactions Based on Off...
An Open Source Workbench for Prototyping Multimodal Interactions Based on Off...Jean Vanderdonckt
 
Dev buchan leveraging the notes c api
Dev buchan leveraging the notes c apiDev buchan leveraging the notes c api
Dev buchan leveraging the notes c apiBill Buchan
 
Embedded Systems: Lecture 8: The Raspberry Pi as a Linux Box
Embedded Systems: Lecture 8: The Raspberry Pi as a Linux BoxEmbedded Systems: Lecture 8: The Raspberry Pi as a Linux Box
Embedded Systems: Lecture 8: The Raspberry Pi as a Linux BoxAhmed El-Arabawy
 
Why Plone Will Die
Why Plone Will DieWhy Plone Will Die
Why Plone Will DieAndreas Jung
 
State of Puppet - Puppet Camp Silicon Valley 2014
State of Puppet - Puppet Camp Silicon Valley 2014State of Puppet - Puppet Camp Silicon Valley 2014
State of Puppet - Puppet Camp Silicon Valley 2014Puppet
 
Reviewing CPAN modules
Reviewing CPAN modulesReviewing CPAN modules
Reviewing CPAN modulesneilbowers
 
Putting Compilers to Work
Putting Compilers to WorkPutting Compilers to Work
Putting Compilers to WorkSingleStore
 
#SPSToronto 2018 migrate you custom development to the SharePoint Framework
#SPSToronto 2018 migrate you custom development to the SharePoint Framework#SPSToronto 2018 migrate you custom development to the SharePoint Framework
#SPSToronto 2018 migrate you custom development to the SharePoint FrameworkVincent Biret
 
ch4-Software is Everywhere
ch4-Software is Everywherech4-Software is Everywhere
ch4-Software is Everywheressuser06ea42
 
10 clues showing that you are doing OSGi in the wrong manner - Jerome Moliere
10 clues showing that you are doing OSGi in the wrong manner - Jerome Moliere10 clues showing that you are doing OSGi in the wrong manner - Jerome Moliere
10 clues showing that you are doing OSGi in the wrong manner - Jerome Molieremfrancis
 
Realtime traffic analyser
Realtime traffic analyserRealtime traffic analyser
Realtime traffic analyserAlex Moskvin
 

Ähnlich wie MPI Requirements of the Network Layer (20)

CPSeis &amp; GeoCraft
CPSeis &amp; GeoCraftCPSeis &amp; GeoCraft
CPSeis &amp; GeoCraft
 
Linux Distribution Collaboration …on a Mainframe!
Linux Distribution Collaboration …on a Mainframe!Linux Distribution Collaboration …on a Mainframe!
Linux Distribution Collaboration …on a Mainframe!
 
Architectural Decisions: Smoothly and Consistently
Architectural Decisions: Smoothly and ConsistentlyArchitectural Decisions: Smoothly and Consistently
Architectural Decisions: Smoothly and Consistently
 
Architectural Decisions: Smoothly and Consistently
Architectural Decisions: Smoothly and ConsistentlyArchitectural Decisions: Smoothly and Consistently
Architectural Decisions: Smoothly and Consistently
 
Open MPI SC'15 State of the Union BOF
Open MPI SC'15 State of the Union BOFOpen MPI SC'15 State of the Union BOF
Open MPI SC'15 State of the Union BOF
 
Software Architecture and Architectors: useless VS valuable
Software Architecture and Architectors: useless VS valuableSoftware Architecture and Architectors: useless VS valuable
Software Architecture and Architectors: useless VS valuable
 
AD1545 - Extending the XPages Extension Library
AD1545 - Extending the XPages Extension LibraryAD1545 - Extending the XPages Extension Library
AD1545 - Extending the XPages Extension Library
 
An Open Source Workbench for Prototyping Multimodal Interactions Based on Off...
An Open Source Workbench for Prototyping Multimodal Interactions Based on Off...An Open Source Workbench for Prototyping Multimodal Interactions Based on Off...
An Open Source Workbench for Prototyping Multimodal Interactions Based on Off...
 
Dev buchan leveraging the notes c api
Dev buchan leveraging the notes c apiDev buchan leveraging the notes c api
Dev buchan leveraging the notes c api
 
Embedded Systems: Lecture 8: The Raspberry Pi as a Linux Box
Embedded Systems: Lecture 8: The Raspberry Pi as a Linux BoxEmbedded Systems: Lecture 8: The Raspberry Pi as a Linux Box
Embedded Systems: Lecture 8: The Raspberry Pi as a Linux Box
 
Why Plone Will Die
Why Plone Will DieWhy Plone Will Die
Why Plone Will Die
 
State of Puppet - Puppet Camp Silicon Valley 2014
State of Puppet - Puppet Camp Silicon Valley 2014State of Puppet - Puppet Camp Silicon Valley 2014
State of Puppet - Puppet Camp Silicon Valley 2014
 
Reviewing CPAN modules
Reviewing CPAN modulesReviewing CPAN modules
Reviewing CPAN modules
 
Putting Compilers to Work
Putting Compilers to WorkPutting Compilers to Work
Putting Compilers to Work
 
Stackato
StackatoStackato
Stackato
 
#SPSToronto 2018 migrate you custom development to the SharePoint Framework
#SPSToronto 2018 migrate you custom development to the SharePoint Framework#SPSToronto 2018 migrate you custom development to the SharePoint Framework
#SPSToronto 2018 migrate you custom development to the SharePoint Framework
 
ch4-Software is Everywhere
ch4-Software is Everywherech4-Software is Everywhere
ch4-Software is Everywhere
 
10 clues showing that you are doing OSGi in the wrong manner - Jerome Moliere
10 clues showing that you are doing OSGi in the wrong manner - Jerome Moliere10 clues showing that you are doing OSGi in the wrong manner - Jerome Moliere
10 clues showing that you are doing OSGi in the wrong manner - Jerome Moliere
 
Web development post io2016
Web development post io2016Web development post io2016
Web development post io2016
 
Realtime traffic analyser
Realtime traffic analyserRealtime traffic analyser
Realtime traffic analyser
 

Mehr von inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 

Mehr von inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Kürzlich hochgeladen

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 

Kürzlich hochgeladen (20)

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 

MPI Requirements of the Network Layer

  • 1. MPI Requirements of the Network Layer Presented to the OpenFabrics libfabric Working Group January 28, 2014 Community feedback assembled by Jeff Squyres, Cisco Systems
  • 2. OpenFabrics Open Frameworks • Effort started / led by Sean Hefty, Intel • Successor to the Linux verbs API • • • • Fix some of the well-known problems from verbs Match modern hardware / not be tied to InfiniBand Serve MPI’s needs better Serve other consumers better (e.g., PGAS) • Not verbs 2.0: a whole new API Slide 2
  • 3. Many thanks to MPI contributors (in no particular order) • ETZ Zurich • • Torsten Hoefler • • • • • • Sandia National Labs • Ron Brightwell • Brian Barrett • Ryan Grant • IBM • • • • Chulho Kim Carl Obert Michael Blocksome Perry Schmidt Cisco Systems • Jeff Squyres Dave Goodell Reese Faucette Cesare Cantu Upinder Malhi Oak Ridge National Labs • Scott Atchley • Pavel Shamis • Argonne National Labs • Jeff Hammond Slide 3
  • 4. Many thanks to MPI contributors (in no particular order) • Intel • Sayantan Sur • Charles Archer • AMD • Brad Benton • Microsoft • Cray • Krishna Kandalla • Fab Tillier • U. Edinburgh / EPCC • Mellanox • Devendar Bureddy • Dan Holmes • U. Alabama Birmingham • SGI • Tony Skjellum • Amin Hassani • Shane Farmer • Michael Raymond Slide 4
  • 5. Quick MPI overview • High-level abstraction API • No concept of a connection • All communication: • Is reliable • Has some ordering rules • Is comprised of typed messages • Peer address is (communicator, integer) tuple • I.e., virtualized • Specifies a process, not a server / network endpoint Slide 5
  • 6. Quick MPI overview • Communication modes • Blocking and non-blocking (polled completion) • Point-to-point: two-sided and one-sided • Collective operations: broadcast, scatter, reduce, …etc. • …and others, but those are the big ones • Async. progression is required/strongly desired • Message buffers are provided by the application • They are not “special” (e.g., registered) Slide 6
  • 7. Quick MPI overview • MPI specification • Governed by the MPI Forum standards body • Currently at MPI-3.0 • MPI implementations • Software + hardware implementation of the spec • Some are open source, some are closed source • Generally don’t care about interoperability (e.g., wire protocols) Slide 7
  • 8. MPI is a large community • Community feedback represents union of: • Different viewpoints • Different MPI implementations • Different hardware perspectives • …and not all agree with each other • For example… Slide 8
  • 9. Different MPI camps Those who want high level interfaces Those who want low level interfaces • Do not want to see memory registration • Want to have good memory registration infrastructure • Want tag matching • Want direct access to hardware capabilities • E.g., PSM • Trust the network layer to do everything well under the covers • Want to fully implement MPI interfaces themselves • Or, the MPI implementers are the kernel / firmware /hardware developers Slide 9
  • 10. Be careful what you ask for… • …because you just got it • Members of the MPI Forum would like to be involved in the libfabric design on an ongoing basis • Can we get an MPI libfabric listserv? Slide 10
  • 11. Basic things MPI needs • Messages (not streams) • Efficient API • Allow for low latency / high bandwidth • Low number of instructions in the critical path • Enable “zero copy” • Separation of local action initiation and completion • One-sided (including atomics) and two-sided semantics • No requirement for communication buffer alignment Slide 11
  • 12. Basic things MPI needs • Asynchronous progress independent of API calls • Including asynchronous progress from multiple consumers (e.g., MPI and PGAS in the same process) • Preferably via dedicated hardware Progress of these Slide 12 Also causes progress of these
  • 13. Basic things MPI needs • Scalable communications with millions of peers • With both one-sided and two-sided semantics • Think of MPI as a fully-connected model (even though it usually isn’t implemented that way) • Today, runs with 3 million MPI processes in a job Slide 13
  • 14. Things MPI likes in verbs • (all the basic needs from previous slide) • Different modes of communication • Reliable vs. unreliable • Scalable connectionless communications (i.e., UD) • Specify peer read/write address (i.e., RDMA) • RDMA write with immediate (*) • …but we want more (more on this later) Slide 14
  • 15. Things MPI likes in verbs • Ability to re-use (short/inline) buffers immediately • Polling and OS-native/fd-based blocking QP modes • Discover devices, ports, and their capabilities (*) • …but let’s not tie this to a specific hardware model • Scatter / gather lists for sends • Atomic operations (*) • …but we want more (more on this later) Slide 15
  • 16. Things MPI likes in verbs • Can have multiple consumers in a single process • API handles are independent of each other Slide 16
  • 17. Things MPI likes in verbs • Verbs does not: • Require collective initialization across multiple processes • Require peers to have the same process image • Restrict completion order vs. delivery order • Restrict source/target address region (stack, data, heap) • Require a specific wire protocol (*) • …but it does impose limitations, e.g., 40-byte GRH UD header Slide 17
  • 18. Things MPI likes in verbs • Ability to connect to “unrelated” peers • Cannot access peer (memory) without permission • Ability to block while waiting for completion • ...assumedly without consuming host CPU cycles • Cleans up everything upon process termination • E.g., kernel and hardware resources are released Slide 18
  • 19. Other things MPI wants (described as verbs improvements) • MTU is an int (not an enum) • Specify timeouts to connection requests • …or have a CM that completes connections asynchronously • All operations need to be non-blocking, including: • Address handle creation • Communication setup / teardown • Memory registration / deregistration Slide 19
  • 20. Other things MPI wants (described as verbs improvements) • Specify buffer/length as function parameters • Specified as struct requires extra memory accesses • …more on this later • Ability to query how many credits currently available in a QP • To support actions that consume more than one credit • Remove concept of “queue pair” • Have standalone send channels and receive channels Slide 20
  • 21. Other things MPI wants (described as verbs improvements) • Completion at target for an RDMA write • Have ability to query if loopback communication is supported • Clearly delineate what functionality must be supported vs. what is optional • Example: MPI provides (almost) the same functionality everywhere, regardless of hardware / platform • Verbs functionality is wildly different for each provider Slide 21
  • 22. Other things MPI wants (described as verbs improvements) • Better ability to determine causes of errors • In verbs: • Different providers have different (proprietary) interpretations of various error codes • Difficult to find out why ibv_post_send() or ibv_poll_cq() failed, for example • Perhaps a better strerr() type of functionality (that can also obtain provider-specific strings)? Slide 22
  • 23. Other things MPI wants: Standardized high-level interfaces • Examples: • • • • • Tag matching MPI collective operations (TBD) Remote atomic operations …etc. The MPI community wants input in the design of these interfaces • Divided opinions from MPI community: • Providers must support these interfaces, even if emulated Slide • Run-time query to see which interfaces are supported 23
  • 24. Other things MPI wants: Vendor-specific interfaces • Direct access to vendor-specific features • Lowest-common denominator API is not always enough • Allow all providers to extend all parts of the API • Implies: • Robust API to query what devices and providers are available at run-time (and their various versions, etc.) • Compile-time conventions and protections to allow for safe non-portable codes • This is a radical difference from verbs Slide 24
  • 25. …and much more • Don’t have the time to go through all of this in a slidecast • The rest of the slides are attached, for those interested • The collaboration is ongoing • This is a long process • But both exciting and promising Slide 25
  • 26. Core libfabric functionality Direct function calls to libfabric Slide 26
  • 27. Example options for direct access to vendor-specific functionality Example 1: Access to provider A extensions without going through libfabric core Slide 27
  • 28. Example options for direct access to vendor-specific functionality Example 2: Access to provider B extensions via “pass through” functionality in libfabric Slide 28
  • 29. Other things MPI wants: Regarding memory registration • Run-time query: is memory registration is necessary? • I.e., explicit or implicit memory registration • If explicit • Need robust notification of involuntary memory deregistration (e.g., munmap) • If the cost of de/registration were “free”, much of this debate would go away  Slide 29
  • 30. Other things MPI wants: Regarding fork() behavior • In child: • All memory is accessible (no side effects) • Network handles are stale / unusable • Can re-initialize network API (i.e., get new handles) • In parent: • All memory is accessible • Network layer is still fully usable • Independent of child process effects Slide 30
  • 31. Other things MPI wants • If network header knowledge is required: • Provide a run-time query • Do not mandate a specific network header • E.g., incoming verbs datagrams require a GRH header • Request ordered vs. unordered delivery • Potentially by traffic type (e.g., send/receive vs. RDMA) • Completions on both sides of a remote write Slide 31
  • 32. Other things MPI wants • Allow listeners to request a specific network address • Similar to TCP sockets asking for a specific port • Allow receiver providers to consume buffering directly related to the size of incoming messages • Example: “slab” buffering schemes Slide 32
  • 33. Other things MPI wants • Generic completion types. Example: • Aggregate completions • Vendor-specific events • Out-of-band messaging Slide 33
  • 34. Other things MPI wants • Noncontiguous sends, receives, and RDMA opns. • Page size irrelevance • Send / receive from memory, regardless of page size • Access to underlying performance counters • For MPI implementers and MPI-3 “MPI_T” tools • Set / get network quality of service Slide 34
  • 35. Other things MPI wants: More atomic operations • Datatypes (minimum): int64_t, uint64_t, int32_t, uint32_t • Would be great: all C types (to include double complex) • Would be ok: all <stdint.h> types • Don’t require more than natural C alignment • Operations (minimum) • accumulate, fetch-and-accumulate, swap, compare-and-swap • Accumulate operators (minimum) • add, subtract, or, xor, and, min, max • Run-time query: are these atomics coherent with the host? • If support both, have ability to request one or the other Slide 35
  • 36. Other things MPI wants: MPI RMA requirements • Offset-based communication (not address-based) • Performance improvement: potentially reduces cache misses associated with offset-to-address lookup • Programmatic support to discover if VA based RMA performs worse/better than offset based • Both models could be available in the API • But not required to be supported simultaneously • Aggregate completions for MPI Put/Get operations • Per endpoint • Per memory region Slide 36
  • 37. Other things MPI wants: MPI RMA requirements • Ability to specify remote keys when registering • Improves MPI collective memory window allocation scalability • Ability to specify arbitrary-sized atomic ops • Run-time query supported size • Ability to specify/query ordering and ordering limits of atomics • Ordering mode: rar, raw, war and waw • Example: “rar” – reads after reads are ordered Slide 37
  • 38. “New,” but becoming important • Network topology discovery and awareness • …but this is (somewhat) a New Thing • Not much commonality across MPI implementations • Would be nice to see some aspect of libfabric provide fabric topology and other/meta information • Need read-only access for regular users Slide 38
  • 39. API design considerations • With no tag matching, MPI frequently sends / receives two buffers • (header + payload) • Optimize for that • MPI sometimes needs thread safety, sometimes not • May need both in a single process • Support for checkpoint/restart is desirable • Make it safe to close stale handles, reclaim resources Slide 39
  • 40. API design considerations • Do not assume: • • • • • Max size of any transfer (e.g., inline) The memory translation unit is in network hardware All communication buffers are in main RAM Onload / offload, but allow for both API handles refer to unique hardware resources • Be “as reliable as sockets” (e.g., if a peer disappears) • Have well-defined failure semantics • Have ability to reclaim resources on failure Slide 40
  • 41. Conclusions • Many different requirements • High-level, low-level, and vendor-specific interfaces • The MPI community would like to continue to collaborate • Tag matching is well-understood, but agreeing on a common set of interfaces for them will take work • Creating other high-level MPI-friendly interfaces (e.g., for collectives) will take additional work Slide 41

Hinweis der Redaktion

  1. Added “ability for multiple consumer async progress” sub-bullet (from feedback in slide 35)
  2. Added “both 1-sided and 2-sided” to scalability point in this slide (Vs. in a later slide)
  3. Feedback from MPI community:We like RDMA write with immediate… but 4 bytes isn’t enoughRe-state last bullet better: want completion on peer for an RDMA write
  4. Replaced “inline messages” with “ability to re-use buffer immediately”
  5. Replaced “MPI + PGAS” example with “API handles are independent of each other”Split off into its own slide
  6. Specifically mentioned the 40-byte GHR UD header
  7. Translated “Multiple PDs per process” to “ability to connect to ‘unrelated’ peers”Per feedback on slide 32, add point about being able to block (without consuming CPU, even though that’s not actually specified by verbs)
  8. After MPI community feedback:- Added 1st bullet as reaction to feedback from slide 13
  9. This is a new slide from after the MPI community feedback webex
  10. MPI community feedback:- Separate distinction between vendor-specific optimization exposure vs. common functionality that should be standardized (e.g., tag matching should be standardized)This is a new slide that specifically mentions the common high-level interfacesI added the ability to run-time query which interfaces are available (e.g., for providers who do not want to provide these high-level interfaces)The next slide is now (pretty much) the original slide that talks about low-level vendor-specific functionality
  11. MPI community feedback:- Separate distinction between vendor-specific optimization exposure vs. common functionality that should be standardized (e.g., tag matching should be standardized)This slide is pretty much the same as it was; the standardization of common interfaces is now a separate slide (the one before this one)
  12. MPI community feedback:- Fix first bullet with: memory reg may not be necessary- Here’s what we want out of mr: 1) decouple registration and pinning. 2) here’s what we’re going to do in MPI- Different requirements: whether mr is implict or explicit (depends on OS, too)- Perhaps choose which to use at runtime? (explicit or implicit)- There are two different camps in the MPI community - If cost of reg/dereg was “free”, much of this debate goes awayI separated memory registration out into its own slide
  13. MPI community feedback:After fork in child: be able to re-init network layerslab: instead, say amount of receive buffering used directly related to incoming messageI split fork into its own slide
  14. MPI community feedback:Datagram bullet is redundant with prior slideLast bullet seems redundant with prior slide, tooOrdering: potentially on a per-message basis? (e.g., send/recv ordered and RDMA unordered)Changed “query for datagram payload offset” to be an exampleI added sub-bullet about ordered vs. unorderedLast bullet (completions for remote write) is not redundant; I left it
  15. MPI community feedback:slab: instead, say amount of receive buffering used directly related to incoming messageChanged 2nd bullet / sub-bullet to be in the form of a requirement, and just used “slab” as an example
  16. MPI community feedback:Little confusing that I mention vendor specified again hereScalable: point is that MPI is a fully connected model. Support it however you want. Today we run on M’s of processes.I didn’t find the vendor-specific events confusing…?I moved the “scalable to millions of peers” up to slide 12
  17. MPI Community feedback:Hints to device about power? (e.g., MPI knows I’m not going to do anything for a while… but it doesn’t) And network power hints may not be valuable over time…?On prior slide: have efficient mechanism to block in libfabric calls (e.g., poll hard in lower layer, sleep for a while, …etc.)There was later consensus that power hints are not likely useful, particularly regarding the network device(s)
  18. Per Jeff Hammond: Add [u]int32_tAdd &lt;stdint.h&gt; types…
  19. This is a new slide from the MPI Forum RMA WG
  20. This is a new slide from the MPI Forum RMA WG
  21. Added: need read-only access for regular users (vs. IB, which requires root access)
  22. MPI community feedback:Data point: if we provide a tag matching, then everyone should provide it even if you have to emulate it (this is two opinions)Expose OOB capabilities from the networkI moved the tag matching data point up to slide 21Moved OOB capabilities point up to slide 31
  23. MPI community feedback:We need async progress (from the perspective of MPI looking down). MPI-3 demands it.Pair this with multiple consumers of the interface both being able to have “async progress” from a single place (E.g., MPI and PGAS playing nicely together)Failures: well-defined a way to reclaim resources from that layer and belowBoth async progress points now listed as a “need” in slide 11Added point about reclaiming resources