SlideShare ist ein Scribd-Unternehmen logo
1 von 17
PMIx: Process Management for
Exascale Environments
Ralph H. Castain, David Solt, Joshua Hursey, Aurelien Bouteiller
EuroMPI/USA 2017, Chicago, IL
OMPI
Spectrum
OSHMEM
SOS
PGAS
others
What is PMIx?
PMI-1 PMI-2
wireup support
dynamic spawn
keyval publish/lookup
MPICH
years go by…
SLURM
ALPS
RM
PGAS
others
2015
Exascale systems
on horizon
Launch times long
New paradigms
2016
Exascale launch
in < 10s
Orchestration
PMIx v1.2
SLURM
JSM
RM
OMPI
Spectrum
OSHMEM
2017
Exascale launch
in < 30s
PMIx v2.x
SLURM
JSM
others
RM
Three Distinct Entities
• PMIx Standard
 Defined set of APIs, attribute strings
 Nothing about implementation
• PMIx Reference Library
 A full-featured implementation of the Standard
 Intended to ease adoption
• PMIx Reference Server
 Full-featured “shim” to a non-PMIx RM
The Community
https://pmix.github.io/pmix
https://github.com/pmix
Job
Script WLM WLM RM
Launch
Cmd
Spawn
Procs
GO
Global
Xchg
Proc
Fabric
NIC
Proc
NIC
Proc
Barrier
FS
Traditional Launch Sequence
Wait for files
& libs
Topo Topo Topo
Fabric
NIC
Fabric
Pro
c
Pro
c
Pro
c
Job
Script WLM WLM RM
Launch
Cmd
Spawn
Procs
GO
Global
Xchg
Proc
Fabric
NIC
Proxy
Proc
Fabric
NIC
Proxy
Proc
Proxy
Barrier
FS
Newer Launch Sequence
Wait for files
& libs
Topo Topo
Fabric
NIC
Topo
PMIx-SMS Interactions
RM
PMIx
Client
FS
Fabric
RAS
APP
Orchestration
Requests
Responses
NIC
Fabric
Mgr
PMIx
Server
MPI
OpenMP
Job
Script
System
Management Stack
Tool Support
PMIx Launch Sequence
*RM daemon, mpirun-daemon, etc.
PMIx/SLURM*
#nodes
MPI_Init(sec)
*LANL/Buffy cluster, 1ppn
PRS**
**PMIx Reference Server v2.0, direct-fetch/async
srun/PMI2
Performance papers coming in 2018!
Similar Requirements
• Notifications/response
 Errors, resource changes
 Negotiated response
• Request allocation changes
 shrink/expand
• Workflow management
 Steered/conditional execution
• QoS requests
 Power, file system, fabric
Multiple,
use-
specific
libs?
(difficult for RM
community to
support)
Single,
multi-
purpose
lib?
PMIx “Standards” Process
• Modifications/additions
 Proposed as RFC
 Include prototype implementation
• Pull request to reference library
 Notification sent to mailing list
• Reviews conducted
 RFC and implementation
 Continues until consensus emerges
• Approval given
 Developer telecon (weekly)
Standards Doc
under
development!
Philosophy
• Generalized APIs
 Few hard parameters
 “Info” arrays to pass information, specify directives
• Easily extended
 Add “keys” instead of modifying API
• Async operations
• Thread safe
• SMS always has right to say “not supported”
 Allow each backend to evaluate what and when to
support something
• Generalized APIs
 Few hard parameters
 “Info” arrays to pass information, specify directives
• Easily extended
 Add “keys” instead of modifying API
• Async operations
• Thread safe
• SMS always has right to say “not supported”
 Allow each backend to evaluate what and when to
support something
Messenger not Doer
APPSMS
Tool
Current Support
• Typical startup operations
 Put, get, commit, barrier,
spawn, [dis]connect,
publish/lookup
• Tool connections
 Debugger, job submission,
query
• Generalized query
support
 Job status, layout, system
data, resource availability
• Event notification
 App, system generated
 Subscribe, chained
 Pre-emption, failures,
timeout warning, …
• Logging (job record)
 Status reports, error output
• Flexible allocations
 Release resources, request
resources
Event Notification Use Case
• Fault detection and reporting
w/ULFM MPI
 ULFM MPI is a fault tolerant
flavor of Open MPI
• Failures may be detected from
the SMS, RAS, or directly by
MPI communications
• Components produce a PMIx
event when detecting an error
• Fault Tolerant components
register for the fault event
• Components propagate fault
events which are then
delivered to registered clients
MPI MPI
PMIx
Server
PMIx
Server
RAS
PMIx
In Pipeline
• Network support
 Security keys, pre-spawn local
driver setup, fabric topology
and status, traffic reports,
fabric manager interaction
• Obsolescence protection
 Automatic cross-version
compatibility
 Container support
• Job control
 Pause, kill, signal, heartbeat,
resilience support
• Generalized data store
• File system support
 Dependency detection
 Tiered storage caching strategies
• Debugger/tool support++
 Automatic rendezvous
 Single interface to all launchers
 Co-launch daemons
 Access fabric info, etc.
• Cross-library interoperation
Summary
We now have an interface library RMs will
support for application-directed requests
Need to collaboratively define
what we want to do with it
Project: https://pmix.github.io/pmix
Reference Implementation: https://github.com/pmix/pmix
Reference Server: https://github.com/pmix/pmix-reference-server

Weitere ähnliche Inhalte

Was ist angesagt?

OSMC 2015: The Assimilation Project by Alan Robertson
OSMC 2015: The Assimilation Project by Alan RobertsonOSMC 2015: The Assimilation Project by Alan Robertson
OSMC 2015: The Assimilation Project by Alan Robertson
NETWAYS
 
Multi Layer Monitoring V1
Multi Layer Monitoring V1Multi Layer Monitoring V1
Multi Layer Monitoring V1
Lahav Savir
 
5 things you didn't know you could do with security policy management
5 things you didn't know you could do with security policy management5 things you didn't know you could do with security policy management
5 things you didn't know you could do with security policy management
AlgoSec
 

Was ist angesagt? (20)

OSMC 2015: The Assimilation Project by Alan Robertson
OSMC 2015: The Assimilation Project by Alan RobertsonOSMC 2015: The Assimilation Project by Alan Robertson
OSMC 2015: The Assimilation Project by Alan Robertson
 
how to simulate ACI
how to simulate ACIhow to simulate ACI
how to simulate ACI
 
Multi Layer Monitoring V1
Multi Layer Monitoring V1Multi Layer Monitoring V1
Multi Layer Monitoring V1
 
Create and Manage a Micro-Segmented Data Center – Best Practices
Create and Manage a Micro-Segmented Data Center – Best PracticesCreate and Manage a Micro-Segmented Data Center – Best Practices
Create and Manage a Micro-Segmented Data Center – Best Practices
 
Best Practices for Workload Security: Securing Servers in Modern Data Center ...
Best Practices for Workload Security: Securing Servers in Modern Data Center ...Best Practices for Workload Security: Securing Servers in Modern Data Center ...
Best Practices for Workload Security: Securing Servers in Modern Data Center ...
 
Migrating Application Connectivity and Network Security to AWS
Migrating Application Connectivity and Network Security to AWSMigrating Application Connectivity and Network Security to AWS
Migrating Application Connectivity and Network Security to AWS
 
5 things you didn't know you could do with security policy management
5 things you didn't know you could do with security policy management5 things you didn't know you could do with security policy management
5 things you didn't know you could do with security policy management
 
Reaching PCI Nirvana: Ensure a Successful Audit & Maintain Continuous Compliance
Reaching PCI Nirvana: Ensure a Successful Audit & Maintain Continuous ComplianceReaching PCI Nirvana: Ensure a Successful Audit & Maintain Continuous Compliance
Reaching PCI Nirvana: Ensure a Successful Audit & Maintain Continuous Compliance
 
Embracing the Rise of SecDevOps
Embracing the Rise of SecDevOpsEmbracing the Rise of SecDevOps
Embracing the Rise of SecDevOps
 
SevOne Scalability
SevOne ScalabilitySevOne Scalability
SevOne Scalability
 
SDN's managing security across the virtual network final
SDN's managing security across the virtual network finalSDN's managing security across the virtual network final
SDN's managing security across the virtual network final
 
Tying cyber attacks to business processes, for faster mitigation
Tying cyber attacks to business processes, for faster mitigationTying cyber attacks to business processes, for faster mitigation
Tying cyber attacks to business processes, for faster mitigation
 
Drone Hijacking
Drone HijackingDrone Hijacking
Drone Hijacking
 
Cisco Firepower Migration | Cisco and AlgoSec Joint Webinar
Cisco Firepower Migration | Cisco and AlgoSec Joint WebinarCisco Firepower Migration | Cisco and AlgoSec Joint Webinar
Cisco Firepower Migration | Cisco and AlgoSec Joint Webinar
 
Accelerate Application Deployment Across Cisco ACI Fabric, On-Premise Firewal...
Accelerate Application Deployment Across Cisco ACI Fabric, On-Premise Firewal...Accelerate Application Deployment Across Cisco ACI Fabric, On-Premise Firewal...
Accelerate Application Deployment Across Cisco ACI Fabric, On-Premise Firewal...
 
Vulnerability Discovery in the Cloud
Vulnerability Discovery in the CloudVulnerability Discovery in the Cloud
Vulnerability Discovery in the Cloud
 
Perforce on Tour 2015 - Securing the Helix Platform at Citrix
Perforce on Tour 2015 - Securing the Helix Platform at CitrixPerforce on Tour 2015 - Securing the Helix Platform at Citrix
Perforce on Tour 2015 - Securing the Helix Platform at Citrix
 
2018 07-24 network security at the speed of dev ops - webinar
2018 07-24 network security at the speed of dev ops - webinar2018 07-24 network security at the speed of dev ops - webinar
2018 07-24 network security at the speed of dev ops - webinar
 
Sasa milic, cisco advanced malware protection
Sasa milic, cisco advanced malware protectionSasa milic, cisco advanced malware protection
Sasa milic, cisco advanced malware protection
 
Spirent: The Internet of Things: The Expanded Security Perimeter
Spirent: The Internet of Things:  The Expanded Security Perimeter Spirent: The Internet of Things:  The Expanded Security Perimeter
Spirent: The Internet of Things: The Expanded Security Perimeter
 

Ähnlich wie EuroMPI 2017 PMIx presentation

Pm ix tutorial-june2019-pub (1)
Pm ix tutorial-june2019-pub (1)Pm ix tutorial-june2019-pub (1)
Pm ix tutorial-june2019-pub (1)
ewerkboy
 
Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)
Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)
Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)
Lucas Jellema
 
Take control of your DevOps Dumping Ground; Melissa Sussmann
Take control of your DevOps Dumping Ground; Melissa SussmannTake control of your DevOps Dumping Ground; Melissa Sussmann
Take control of your DevOps Dumping Ground; Melissa Sussmann
Puppet
 

Ähnlich wie EuroMPI 2017 PMIx presentation (20)

PMIx Tiered Storage Support
PMIx Tiered Storage SupportPMIx Tiered Storage Support
PMIx Tiered Storage Support
 
SC'17 BoF Presentation
SC'17 BoF PresentationSC'17 BoF Presentation
SC'17 BoF Presentation
 
PMIx: Bridging the Container Boundary
PMIx: Bridging the Container BoundaryPMIx: Bridging the Container Boundary
PMIx: Bridging the Container Boundary
 
SC'16 PMIx BoF Presentation
SC'16 PMIx BoF PresentationSC'16 PMIx BoF Presentation
SC'16 PMIx BoF Presentation
 
SC'18 BoF Presentation
SC'18 BoF PresentationSC'18 BoF Presentation
SC'18 BoF Presentation
 
Live Application and Infrastructure Monitoring and Root Cause Log Analysis wi...
Live Application and Infrastructure Monitoring and Root Cause Log Analysis wi...Live Application and Infrastructure Monitoring and Root Cause Log Analysis wi...
Live Application and Infrastructure Monitoring and Root Cause Log Analysis wi...
 
SC15 PMIx Birds-of-a-Feather
SC15 PMIx Birds-of-a-FeatherSC15 PMIx Birds-of-a-Feather
SC15 PMIx Birds-of-a-Feather
 
20140708 - Jeremy Edberg: How Netflix Delivers Software
20140708 - Jeremy Edberg: How Netflix Delivers Software20140708 - Jeremy Edberg: How Netflix Delivers Software
20140708 - Jeremy Edberg: How Netflix Delivers Software
 
Threat intelligence solution
Threat intelligence solutionThreat intelligence solution
Threat intelligence solution
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Source
 
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
 
Microservices
MicroservicesMicroservices
Microservices
 
Event Bus as Backbone for Decoupled Microservice Choreography - Lecture and W...
Event Bus as Backbone for Decoupled Microservice Choreography - Lecture and W...Event Bus as Backbone for Decoupled Microservice Choreography - Lecture and W...
Event Bus as Backbone for Decoupled Microservice Choreography - Lecture and W...
 
Pm ix tutorial-june2019-pub (1)
Pm ix tutorial-june2019-pub (1)Pm ix tutorial-june2019-pub (1)
Pm ix tutorial-june2019-pub (1)
 
Music city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lakeMusic city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lake
 
2008-03-06 Harris Corp Security Seminar
2008-03-06 Harris Corp Security Seminar2008-03-06 Harris Corp Security Seminar
2008-03-06 Harris Corp Security Seminar
 
Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)
Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)
Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)
 
Netflix Cloud Platform and Open Source
Netflix Cloud Platform and Open SourceNetflix Cloud Platform and Open Source
Netflix Cloud Platform and Open Source
 
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInDataMonitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
 
Take control of your DevOps Dumping Ground; Melissa Sussmann
Take control of your DevOps Dumping Ground; Melissa SussmannTake control of your DevOps Dumping Ground; Melissa Sussmann
Take control of your DevOps Dumping Ground; Melissa Sussmann
 

Kürzlich hochgeladen

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 

EuroMPI 2017 PMIx presentation

  • 1. PMIx: Process Management for Exascale Environments Ralph H. Castain, David Solt, Joshua Hursey, Aurelien Bouteiller EuroMPI/USA 2017, Chicago, IL
  • 2. OMPI Spectrum OSHMEM SOS PGAS others What is PMIx? PMI-1 PMI-2 wireup support dynamic spawn keyval publish/lookup MPICH years go by… SLURM ALPS RM PGAS others 2015 Exascale systems on horizon Launch times long New paradigms 2016 Exascale launch in < 10s Orchestration PMIx v1.2 SLURM JSM RM OMPI Spectrum OSHMEM 2017 Exascale launch in < 30s PMIx v2.x SLURM JSM others RM
  • 3. Three Distinct Entities • PMIx Standard  Defined set of APIs, attribute strings  Nothing about implementation • PMIx Reference Library  A full-featured implementation of the Standard  Intended to ease adoption • PMIx Reference Server  Full-featured “shim” to a non-PMIx RM
  • 5. Job Script WLM WLM RM Launch Cmd Spawn Procs GO Global Xchg Proc Fabric NIC Proc NIC Proc Barrier FS Traditional Launch Sequence Wait for files & libs Topo Topo Topo Fabric NIC Fabric
  • 6. Pro c Pro c Pro c Job Script WLM WLM RM Launch Cmd Spawn Procs GO Global Xchg Proc Fabric NIC Proxy Proc Fabric NIC Proxy Proc Proxy Barrier FS Newer Launch Sequence Wait for files & libs Topo Topo Fabric NIC Topo
  • 8. PMIx Launch Sequence *RM daemon, mpirun-daemon, etc.
  • 9. PMIx/SLURM* #nodes MPI_Init(sec) *LANL/Buffy cluster, 1ppn PRS** **PMIx Reference Server v2.0, direct-fetch/async srun/PMI2 Performance papers coming in 2018!
  • 10. Similar Requirements • Notifications/response  Errors, resource changes  Negotiated response • Request allocation changes  shrink/expand • Workflow management  Steered/conditional execution • QoS requests  Power, file system, fabric Multiple, use- specific libs? (difficult for RM community to support) Single, multi- purpose lib?
  • 11. PMIx “Standards” Process • Modifications/additions  Proposed as RFC  Include prototype implementation • Pull request to reference library  Notification sent to mailing list • Reviews conducted  RFC and implementation  Continues until consensus emerges • Approval given  Developer telecon (weekly) Standards Doc under development!
  • 12. Philosophy • Generalized APIs  Few hard parameters  “Info” arrays to pass information, specify directives • Easily extended  Add “keys” instead of modifying API • Async operations • Thread safe • SMS always has right to say “not supported”  Allow each backend to evaluate what and when to support something
  • 13. • Generalized APIs  Few hard parameters  “Info” arrays to pass information, specify directives • Easily extended  Add “keys” instead of modifying API • Async operations • Thread safe • SMS always has right to say “not supported”  Allow each backend to evaluate what and when to support something Messenger not Doer APPSMS Tool
  • 14. Current Support • Typical startup operations  Put, get, commit, barrier, spawn, [dis]connect, publish/lookup • Tool connections  Debugger, job submission, query • Generalized query support  Job status, layout, system data, resource availability • Event notification  App, system generated  Subscribe, chained  Pre-emption, failures, timeout warning, … • Logging (job record)  Status reports, error output • Flexible allocations  Release resources, request resources
  • 15. Event Notification Use Case • Fault detection and reporting w/ULFM MPI  ULFM MPI is a fault tolerant flavor of Open MPI • Failures may be detected from the SMS, RAS, or directly by MPI communications • Components produce a PMIx event when detecting an error • Fault Tolerant components register for the fault event • Components propagate fault events which are then delivered to registered clients MPI MPI PMIx Server PMIx Server RAS PMIx
  • 16. In Pipeline • Network support  Security keys, pre-spawn local driver setup, fabric topology and status, traffic reports, fabric manager interaction • Obsolescence protection  Automatic cross-version compatibility  Container support • Job control  Pause, kill, signal, heartbeat, resilience support • Generalized data store • File system support  Dependency detection  Tiered storage caching strategies • Debugger/tool support++  Automatic rendezvous  Single interface to all launchers  Co-launch daemons  Access fabric info, etc. • Cross-library interoperation
  • 17. Summary We now have an interface library RMs will support for application-directed requests Need to collaboratively define what we want to do with it Project: https://pmix.github.io/pmix Reference Implementation: https://github.com/pmix/pmix Reference Server: https://github.com/pmix/pmix-reference-server