SlideShare ist ein Scribd-Unternehmen logo
1 von 74
HPC Controls: View to the Future
Ralph H. Castain
June, 2015
This Presentation: A Caveat
• One person’s view into the future
o Where I am leading the open source community
o Compiled from presentations and emails with that
community, the national labs, various corporations
o Spans last 20+ years
• Not what I am expecting any particular entity to do
Controls Overview: A Definition
• Resource Manager
o Workload manager
o Run-time environment
• Monitoring
• Resiliency/Error Management
• Overlay Network
• Pub-sub Network
• Console
Controls Overview: A Definition
• Resource Manager
o Workload manager
o Run-time environment
• Monitoring
• Resiliency/Error
Management
• Overlay Network
• Pub-sub Network
• Console
Open Ecosystems Today
• Every element is independent
island, each with its own
community
• Multiple programming languages
• Conflicting licensing
• Cross-element interactions, where
they exist, are via text-based
messaging
General Requirements
• Scalable to exascale
levels & beyond
– Better-than-linear scaling
– Constrained memory
footprint
• Dynamically
configurable
– Sense and adapt, user-
directable
– On-the-fly updates
• Open source (non-copy-left)
• Maintainable, flexible
– Single platform that can be
utilized to build multiple
tools
– Existing ecosystem
• Resilient
– Self-heal around failures
– Reintegrate recovered
resources
5
Chosen Software Platform
• Demonstrated scalability
• Established community
• Clean non-copy-left licensing
• Compatible architecture
• …
6
Open Resilient Cluster Manager (ORCM)
[for reference implementation]
2003 - present
7
Open MPI
OpenRTE
Cisco Intel
10s-100s of Knodes
(80K in production)
EMC
Enterprise
Router
Cluster
Monitor SCON/RM
OMPI/ORTE
ORCM/SCON
FDDP
1994-2003
Abstractions
• Divide functional blocks into abstract frameworks
o Standardize the API for each identified functional area
o Can be used external, or dedicated for internal use
o Single-select: pick active component at startup
o Multi-select: dynamically decide for each execution
• Multiple implementations
o Each in its own isolated plugin
o Fully implement the API
o Base “component” holds common functions
Example: SCalable Overlay Network (SCON)
RMQ ZMQ BTL OOB
SCON-MSG
TCPUDPIB UGNI SM USNIC TCP CUDAPortals4
SCIF
Send/Recv
Example: SCalable Overlay Network (SCON)
10
RMQ ZMQ BTL OOB
SCON-MSG
TCPUDPIB UGNI SM USNIC TCP CUDAPortals4
SCIF
Send/Recv
Inherit
ORCM and Plug-ins
• Plug-ins are shared libraries
o Central set of plug-ins in installation tree
o Users can also have plug-ins under $HOME
o Proprietary binary plugins picked up at runtime
• Can add / remove plug-ins after install
o No need to recompile / re-link apps
o Download / install new plug-ins
o Develop new plug-ins safely
• Update “on-the-fly”
o Add, update plug-ins while running
o Frameworks “pause” during update
Controls Overview: A Definition
• Resource Manager
o Workload manager
o Run-time environment
• Monitoring
• Resiliency/Error Management
• Overlay Network
• Pub-sub Network
• Console
Definition: RM
• Scheduler/Workload Manager
o Allocates resources to session
o Interactive and batch
• Run-Time Environment
o Launch and monitor applications
o Support inter-process communication wireup
o Serve as intermediary between applications and WM
• Dynamic resource requests
• Error notification
o Implement failure policies
13
Breaking it Down
• Workload Manager
o Dedicated framework
o Plugins for two-way integration to external WM (Moab, Cobalt)
o Plugins for implementing internal WM (FIFO)
• Run-Time Environment
o Broken down into functional blocks, each with own framework
• Loosely divided into three general categories
• Messaging, launch, error handling
• One or more frameworks for each category
o Knitted together via “state machine”
• Event-driven, async
• Each functional block can be separate thread
• Each plugin within each block can be separate thread(s)
Key Objectives for Future
• Orchestration via integration
o Monitoring system, file system, network, facility, console
• Power control/management
• “Instant On”
• Application interface
o Request RM actions
o Receive RM notifications
o Fault tolerance
• Advanced workload management algorithms
o Alternative programming model support (Hadoop, Cloud)
o Anticipatory scheduling
RM as Orchestrator
Monitoring
Console
DB
File
System
Network
Resource
Manager
Overlay
Network
Pub-
Sub
SCON
Prov.
Agent
Hierarchical Design
orcmd
CN
orcmd
CN
B
M
C
Rack Controller
orcmd
RACK
DB
Row
Ctlr
orcmd
CN
orcmd
CN
B
M
C
Rack Controller
orcmd
RACK
Hierarchical Design
orcmd
CN
orcmd
CN
B
M
C
Rack Controller
orcmd
RACK
DB
Row
Ctlr
orcmd
CN
orcmd
CN
B
M
C
Rack Controller
orcmd
RACK
Flexible Architecture
• Each tool built on top of same plugin system
o Different combinations of frameworks
o Different plugins activated to play different roles
o Example: orcmd on compute node vs on rack/row controllers
• Designed for distributed, centralized, hybrid operations
o Centralized for small clusters
o Hybrid for larger clusters
o Example: centralized scheduler, distributed “worker-bees”
• Accessible to users for interacting with RM
o Add shim libraries (abstract, public APIs) to access framework APIs
o Examples: SCON, pub-sub, in-flight analytics
Code Reuse: Scheduler
sched
moab cobalt FIFO
orcmd
DB
db
pwmgt
cfgi
pvn
File System Integration
• Input
o Current data location, time to retrieve and position
o Data the application intends to use
• Job submission, dynamic request (via RTE), persistence across jobs and sessions
o What OS/libraries application requires
• Workload Manager
o Scheduling algorithm factors
• Current provisioning map, NVM usage patterns/requests, persistent data/ckpt location
• Data/library locality, loading time
• Output
o Pre-location requirements to file system (e.g., SPINDLE cache tree)
o Hot/warm/cold data movement, persistence directives to RTE
o NVM & Burst Buffer allocations (job submission, dynamic)
Provisioning Integration
• Support multiple provisioning agents
o Warewulf, xCAT, ROCKS
• Support multiple provisioning modes
o Bare metal, Virtual Machines (VMs)
• Job submission includes desired provisioning
o Scheduler can include provisioning time in allocation decision, anticipate
re-use of provisioned nodes
o Scheduler notifies provisioning agent of what nodes to provision
o Provisioning agent notifies RTE when provisioning complete so launch
can proceed
Network Integration
• Quality of service allocations
o Bandwidth, traffic priority, power constraints
o Specified at job submission, dynamic request (via RTE)
• Security requirements
o Network domain definitions
• State-of-health
o Monitored by monitoring system
o Reported to RM for response
• Static endpoint
o Allocate application endpoints prior to launch
• Minimize startup time
o Update process location upon fault recovery
Power Control/Management
• Power/heat-aware scheduling
o Specified cluster-level power cap, ramp up/down rate limits
o Specified thermal limits (ramp up/down rates, level)
o Node-level idle power, shutdown policies between sessions/time-of-day
• Site-level coordination
o Heat and power management subsystem
• Consider system capacity in scheduling
• Provide load anticipation levels to site controllers
o Coordinate ramps (job launch/shutdown)
• Within cluster, across site
o Receive limit (high/low) updates
o Direct RTE to adjust controls while maintaining sync across cluster
“Instant On”
• Objective
o Reduce startup from minutes to seconds
o 1M procs, 50k nodes thru MPI_Init
• On track: 20 sec
• 2018 target: 5 sec
• What we require
o Use HSN for communication during launch
o Static endpoint allocation prior to launch
o Integrate file system with RM for prepositioning, with scheduler for
anticipatory locations
25
Today: ~15-30 min
“Instant On” Value-Add
• Requires two things
o Process can compute endpoint of any remote process
o Hardware can reserve and translate virtual endpoints to local ones
• RM knows process map
o Assign endpoint info for each process
o Communicate map to local processes at initialization
• Programming libraries
o Compute connection info from map
• Eliminates costly sharing of endpoint info
Application Interface: PMIx
• Current PMI implementations are limited
o Only used for MPI wireup
o Don’t scale adequately
• Communication required for every piece of data
• All blocking operations
o Licensing and desire for standalone client library
• Increasing requests for app-RM interactions
o Job spawn, data pre-location, power control
o Current approach is fragmented
• Every RM creating its own APIs
Application Interface: PMIx
• Current PMI implementations are limited
o Only used for MPI wireup
o Don’t scale adequately
• Communication required for every piece of data
• All blocking operations
o Licensing and desire for standalone client library
• Increasing requests for app-RM interactions
o Job spawn, data pre-location, power control
o Current approach is fragmented
• Every RM creating its own APIs
PMIx Approach
• Ease adoption
o Backwards compatible to PMI-1/2 APIs
o Standalone client library
o Server convenience library
• BSD-licensed
• Replace blocking with non-blocking operations for scalability
• Add APIs to support
o New use-cases: IO, power, error notification, checkpoint, …
o Programming models beyond MPI
First time
PMIx: Fault Tolerance
• Notification
o App can register for error notifications, incipient faults
o RM will notify when app would be impacted
• Notify procs, system monitor, user as directed (e.g., email, tweet)
o App responds with desired action
• Terminate/restart job, wait for checkpoint, etc.
• Checkpoint support
o Timed or on-demand, binary or SCR
o BB allocations, bleed/storage directives across hierarchical storage
• Restart support
o From remote NVM checkpoint, relocate checkpoint, etc.
PMIx: Status
• Version 1.0 release
o Preliminary version for commentary
• Version 1.1 release
o Production version
o Scheduled for release by Supercomputing 2015
• Server integrations underway
o SLURM
o ORCM
• PGAS/GASNet integration underway
o Extend support for that programming model
Alternative Models: Supporting Hadoop & Cloud
• HPC
o Optimize versus performance, single-tenancy
• Hadoop support
o Optimize versus data locality
o Multi-tenancy
o Pre-location of data to NVM, pre-staging from data store
o Dynamic allocation requests
• Cloud = capacity computing
o Optimize versus cost
o Multi-tenancy
o Provisioning and security support for VMs
Workload Manager: Job Description Language
• Complexity of describing job is growing
o Power, file/lib positioning
o Performance vs capacity, programming model
o System, project, application-level defaults
• Provide templates?
o System defaults, with modifiers
• --hadoop:mapper=foo,reducer=bar
o User-defined
• Application templates
• Shared, group templates
o Markup language definition of behaviors, priorities
30 yrs
Anticipatory Scheduling: Key to Success
• Current schedulers are reactionary
o Compute next allocation when prior one completes
o Mandates fast algorithm to reduce dead time
o Limit what can be done in terms of pre-loading etc. as they add to dead time
• Anticipatory scheduler
o Look ahead and predict range of potential schedules
o Instantly select “best” when prior one completes
o Support data and binary prepositioning in anticipation of most likely option
o Improved optimization of resources as pressure on algorithm computational
time is relieved
34
Anticipatory Scheduler
Console
File
System
Resource
Manager
Overlay
Network
Prov.
Agent
Retrieve reqd
files from
backend storage
Cache images to
staging points
Projected
schedule
Controls Overview: A Definition
• Resource Manager
o Workload manager
o Run-time environment
• Monitoring
• Resiliency/Error Management
• Overlay Network
• Pub-sub Network
• Console
Key Functional Requirements
• Support all available data collection sensors
o Environmental (junction temps, power)
o Process usage statistics (cpu, memory, disk, network)
o MCA, BMC events
• Support variety of backend databases
• Admin configuration
o Select which sensors, how often sampled, how often reported, when and
where data is stored
o Severity/priority/definition of events
o Local admin customizes config, on-the-fly changes
Monitoring System
sensors diag
orcmd
Data
Inventory Diagnostics
B
M
C
Compute Node
Launches/Monitors
Apps
sensors diag
orcmd
S
C
O
N
Rack Controller
Data
Inventory
db
DB
SCON Agg Data
IO/Pvn
Caching
Sensor Framework
• Each sensor can operate in its own time base
o Sample at independent rate
o Output collected in common framework-level bucket
o Can trigger immediate send of bucket for critical events
• Separate reporting time base
o Send bucket to aggregator node at scheduled intervals
o Aggregator receives bucket in sensor/base, extracts contributions from each
sensor, passes to that sensor module for unpacking/recording
Freq JnTemp Power Base
Sensor
BMC/IPMI
Inventory Collection/Tracking
• Performed at boot
o Update on-demand, warm change detected
• Data obtained from three sources
o HWLOC topology collection (processors, memory, disks, NICs)
o Sensor components (BMC/IPMI)
o Network fabric managers (switches/routers)
• Store current inventory and updates
o Track prior locations of each FRU
o Correlate inventory to RAS events
o Detect inadvertent return-to-service of failed units
Database Framework
• Database commands go to base “stub” functions
o Cycle across active modules until one acknowledges handling request
o API provides “attributes” to specify database, table, etc.
• Modules check attributes to determine ability to handle
o Plugins call schema framework to format data
• Multi-threading supported
o Each command can be processed by separate thread
o Each database module can process requests using multiple threads
postgres ODBC Base
db
db2lite
Diagnostics: Another Framework
• Test the system on demand
o Checks for running jobs and returns error
o Execute at boot, when hardware changed, if errors reported, …
• Each plugin tests specific area
o Data reported to database
o Tied to FRUs as well as overall node
cpu eth Base
diag
mem
Controls Overview: A Definition
• Resource Manager
o Workload manager
o Run-time environment
• Monitoring
• Resiliency/Error Management
• Overlay Network
• Pub-sub Network
• Console
Planned Progression
Monitoring
Fault detection
(RAS event generation)
Fault
diagnosis
Fault
prediction
Planned Progression
Monitoring
Fault detection
(RAS event generation)
Fault
diagnosis
Fault
prediction
FDDP
Action Response
Analytics Workflow Concept
I
n
p
u
t
M
o
d
u
l
e
Generalized
Format
Sensors
Other
Workflows
Other
Workflow
RAS
Event
Available in SCON as well
Pub-Sub
Database
Workflow Elements
• Average (window, running, etc.)
• Rate (convert incoming data to events/sec)
• Threshold (high, low)
• Filter
o Selects input values based on provided params
• RAS event
o Generates a RAS event corresponding to input description
• Publish data
Analytics
• Execute on aggregator nodes for in-flight reduction
o Sys admin defines, user can define (if permitted)
• Event-based state machine
o Each workflow in own thread, own instance of each plugin
o Branch and merge of workflow
o Tap stream between workflow steps
o Tap data streams (sensors, others)
• Event generation
o Generate events/alarms
o Specify data to be included (window)
48
Distributed Architecture
• Hierarchical, distributed approach for unlimited scalability
o Utilize daemons on rack/row controllers
• Analysis done at each level of the hierarchy
o Support rapid response to critical events
o Distribute processing load
o Minimize data movement
• RM’s error manager framework controls response
o Based on specified policies
Fault Diagnosis
• Identify root cause and location
o Sometimes obvious – e.g., when direct measurement
o Other times non-obvious
• Multiple cascading impacts
• Cause identified by multi-sensor correlations (indirect measurement)
• Direct measurement yields early report of non-root cause
• Example: power supply fails due to borderline cooling + high load
• Estimate severity
o Safety issue, long-term damage, imminent failure
• Requires in-depth understanding of hardware
Fault Prediction: Methodology
• Exploit access to internals
o Investigate optimal location, number of sensors
o Embed intelligence, communications capability
• Integrate data from all available sources
o Engineering design tests
o Reliability life tests
o Production qualification tests
• Utilize learning algorithms to improve performance
o Both embedded, post process
o Seed with expert knowledge
Fault Prediction: Outcomes
• Continuous update of mean-time-to-preventative-maintenance
o Feed into projected downtime planning
o Incorporate into scheduling algo
• Alarm reports for imminent failures
o Notify impacted sessions/applications
o Plan/execute preemptive actions
• Store predictions
o Algorithm improvement
Error Manager
• Log errors for reporting, future analysis
• Primary responsibility: fault response
o Contains defined response for given types of faults
o Responds to faults by shifting resources, processes
• Secondary responsibility: resilience strategy
o Continuously update and define possible response options
o Fault prediction triggers pre-emptive action
• Select various response strategies via component
o Run-time, configuration, or on-the-fly command
Example: Network Communications
Fault tolerant
• Modify run-time to avoid
automatic abort upon loss of
communication
• Detect failure of any on-going
communication
• Reconnect quickly
• Resend lost messages from me
• Request resend of lost
messages to me
Network interface card repeatedly
loses connection or shows data loss
Resilient
• Estimate probability of failure
during session
– Failure history
– Monitor internals
• Temperature, …
• Take preemptive action
– Reroute messages via
alternative transports
– Coordinated move of processes
to another node
Node fails during execution of high-
priority application
Example: Node Failure
Fault tolerant
• Detect failure of process(es) on
that node
• Find last checkpoint
• Restart processes on another
node at the checkpoint state
– May (likely) require restarting
all processes at same state
• Resend any lost messages
Resilient
• Monitor state-of-health of node
– Temperature of key
components, other signatures
• Estimate probability of failure
during future time interval(s)
• Take preemptive action
– Direct checkpoint/save
– Coordinate move with
application
• Avoids potential need to reset
ALL processes back to earlier
state!
Controls Overview: A Definition
• Resource Manager
o Workload manager
o Run-time environment
• Monitoring
• Resiliency/Error Management
• Overlay Network
• Pub-sub Network
• Console
Definition: Overlay Network
• Messaging system
o Scalable/resilient communications
o Integration-friendly with user applications, system management software
o Quality of service support
• In-flight analytics
o Insert analytic workflows anywhere in the data stream
o Tap data stream at any point
o Generate RAS events from data stream
57
Requirements
• Scalable to exascale levels
– Better-than-linear scaling of
broadcast
• Resilient
– Self-heal around failures
– Reintegrate recovered
resources
– Support quality of service
(QoS) levels
• Dynamically configurable
– Sense and adapt, user-
directable
– On-the-fly updates
• In-flight analytics
– User-defined workflows
– Distributed, hierarchical
analysis of sensor data to
identify RAS events
– Used by error manager
• Multi-fabric
– OOB Ethernet
– In-band fabric
– Auto-switchover for
resilience, QoS
• Open source (non-copy-left)
58
High-Level Architecture
RMQ ZMQ BTL OOB
SCON-MSG
TCPUDPIB UGNI SM USNIC TCP CUDA
SCON-ANALYTICS
FILTER AVG THRESHOLD
Portals4
SCIF
Send/Recv
WorkflowQoS
ACK NACK HYBRID
Inherit
Messaging System
• Message management level
o Manages assignment of messages (and/or fragments) to transport layers
o Detect/redirects messages upon transport failure
o Interfaces to transports
o Matching logic
• Transport level
o Byte-level message movement
o May fragment across wires within transport
o Per-message selection policies (system default, user specified, etc)
60
Quality of Service Controls
• Plugin architecture
o Selected per transport, requested quality of service
• ACK-based (cmd/ctrl)
o ACK each message, or window of messages, based on QoS
o Resend or return error – QoS specified policy and number of retries before giving up
• NACK-based (streaming)
o NACK if message sequence number is out of order indicating lost message(s)
o Request resend or return error, based on QoS
o May ACK after N messages
• Security level
o Connection authorization, encryption, …
Controls Overview: A Definition
• Resource Manager
o Workload manager
o Run-time environment
• Monitoring
• Resiliency/Error Management
• Overlay Network
• Pub-sub Network
• Console
Pub-Sub: Two Models
• Client-Server (MQ)
o Everyone publishes data to one or more servers
o Clients subscribe to server
o Server pushes data matching subscription to clients
o Option: server logs and provides history prior to subscription
• Broker (MRNet)
o Publishers register data with server
o Clients log request with server
o Server connects client to publisher
o Publisher directly provides data to clients
o Up to publisher to log, provide history
Pub-Sub: Two Models
• Client-Server (MQ)
o Everyone publishes data to one or more servers
o Clients subscribe to server
o Server pushes data matching subscription to clients
o Option: server logs and provides history prior to subscription
• Broker (MRNet)
o Publishers register data with server
o Clients log request with server
o Server connects client to publisher
o Publisher directly provides data to clients
o Up to publisher to log, provide history
Pub-Sub: Two Models
• Client-Server (MQ)
o Everyone publishes data to one or more servers
o Clients subscribe to server
o Server pushes data matching subscription to clients
o Option: server logs and provides history prior to subscription
• Broker (MRNet)
o Publishers register data with server
o Clients log request with server
o Server connects client to publisher
o Publisher directly provides data to clients
o Up to publisher to log, provide history
ORCM: Support Both Models
ORCM Pub-Sub Architecture
• Abstract, extendable APIs
o Utilize “attributes” to specify desired data and other parameters
o Allows each component to determine ability to support request
• Internal lightweight version
o Support for smaller systems
o When “good enough” is enough
o Broker and client-server architectures
• External heavyweight versions supported via plugins
o RabbitMQ
o MRNet
ORCM Pub/Sub API
• Required functions
o Advertise available event/data (publisher)
o Publish event/data (publish)
o Get catalog of published events (subscriber)
o Subscribe for event(s)/data (subscriber asynchronous callback based and
polling based)
o Unsubscribe for event/data (subscriber)
• Security – Access control for requested event/data
• Possible future functions
o QoS
o Statistics
Controls Overview: A Definition
• Resource Manager
o Workload manager
o Run-time environment
• Monitoring
• Resiliency/Error Management
• Overlay Network
• Pub-sub Network
• Console
Basic Approach
• Command-line interface
o Required for mid- and high-end systems
o Control on-the-fly adjustments, update/customize config
o Custom code
• GUI
o Required for some market segments
o Good for visualizing cluster state, data trends
o Provide centralized configuration management
o Integrate with existing open source solution
Configuration Management
• Central interface for all configuration
o Eliminate frustrating game of “whack-a-mole”
o Ensure consistent definition across system elements
o Output files/interface to controllers for individual software packages
• RM (queues), network, file system
• Database backend
o Provide historical tracking
o Multiple methods for access, editing
• Command line and GUI
o Dedicated config management tools
HPC Controls

Weitere ähnliche Inhalte

Was ist angesagt?

SERENE 2014 Workshop: Paper "Automatic Generation of Description Files for Hi...
SERENE 2014 Workshop: Paper "Automatic Generation of Description Files for Hi...SERENE 2014 Workshop: Paper "Automatic Generation of Description Files for Hi...
SERENE 2014 Workshop: Paper "Automatic Generation of Description Files for Hi...SERENEWorkshop
 
Row #9: An architecture overview of APNIC's RDAP deployment to the cloud
Row #9: An architecture overview of APNIC's RDAP deployment to the cloudRow #9: An architecture overview of APNIC's RDAP deployment to the cloud
Row #9: An architecture overview of APNIC's RDAP deployment to the cloudAPNIC
 
DPDK Architecture Musings - Andy Harvey
DPDK Architecture Musings - Andy HarveyDPDK Architecture Musings - Andy Harvey
DPDK Architecture Musings - Andy Harveyharryvanhaaren
 
Barak Perlman, ConteXtream - SFC (Service Function Chaining) Using Openstack ...
Barak Perlman, ConteXtream - SFC (Service Function Chaining) Using Openstack ...Barak Perlman, ConteXtream - SFC (Service Function Chaining) Using Openstack ...
Barak Perlman, ConteXtream - SFC (Service Function Chaining) Using Openstack ...Cloud Native Day Tel Aviv
 
Apache NiFi SDLC Improvements
Apache NiFi SDLC ImprovementsApache NiFi SDLC Improvements
Apache NiFi SDLC ImprovementsBryan Bende
 
Recent Advances in Machine Learning: Bringing a New Level of Intelligence to ...
Recent Advances in Machine Learning: Bringing a New Level of Intelligence to ...Recent Advances in Machine Learning: Bringing a New Level of Intelligence to ...
Recent Advances in Machine Learning: Bringing a New Level of Intelligence to ...Brocade
 
L4-L7 services for SDN and NVF by Youcef Laribi
L4-L7 services for SDN and NVF by Youcef LaribiL4-L7 services for SDN and NVF by Youcef Laribi
L4-L7 services for SDN and NVF by Youcef Laribibuildacloud
 
Architecture of OpenFlow SDNs
Architecture of OpenFlow SDNsArchitecture of OpenFlow SDNs
Architecture of OpenFlow SDNsUS-Ignite
 
Developing Enterprise Applications for the Cloud, from Monolith to Microservice
Developing Enterprise Applications for the Cloud, from Monolith to MicroserviceDeveloping Enterprise Applications for the Cloud, from Monolith to Microservice
Developing Enterprise Applications for the Cloud, from Monolith to MicroserviceJack-Junjie Cai
 
Generic Resource Manager - László Vadkerti, András Kovács
Generic Resource Manager - László Vadkerti, András KovácsGeneric Resource Manager - László Vadkerti, András Kovács
Generic Resource Manager - László Vadkerti, András Kovácsharryvanhaaren
 
LCA14: LCA14-209: ODP Project Update
LCA14: LCA14-209: ODP Project UpdateLCA14: LCA14-209: ODP Project Update
LCA14: LCA14-209: ODP Project UpdateLinaro
 
DEVNET-1175 OpenDaylight Service Function Chaining
DEVNET-1175	OpenDaylight Service Function ChainingDEVNET-1175	OpenDaylight Service Function Chaining
DEVNET-1175 OpenDaylight Service Function ChainingCisco DevNet
 
Dynamic Service Chaining
Dynamic Service Chaining Dynamic Service Chaining
Dynamic Service Chaining Tail-f Systems
 
Accelerating apache spark with rdma
Accelerating apache spark with rdmaAccelerating apache spark with rdma
Accelerating apache spark with rdmainside-BigData.com
 
Hotplug and Virtio - Tetsuya Mukawa
Hotplug and Virtio - Tetsuya MukawaHotplug and Virtio - Tetsuya Mukawa
Hotplug and Virtio - Tetsuya Mukawaharryvanhaaren
 
Learn more about the tremendous value Open Data Plane brings to NFV
Learn more about the tremendous value Open Data Plane brings to NFVLearn more about the tremendous value Open Data Plane brings to NFV
Learn more about the tremendous value Open Data Plane brings to NFVGhodhbane Mohamed Amine
 
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlowBringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlowDataWorks Summit
 
Introduction to Beryllium release of OpenDaylight
Introduction to Beryllium release of OpenDaylightIntroduction to Beryllium release of OpenDaylight
Introduction to Beryllium release of OpenDaylightSDN Hub
 

Was ist angesagt? (20)

SDN Project PPT
SDN Project PPTSDN Project PPT
SDN Project PPT
 
SERENE 2014 Workshop: Paper "Automatic Generation of Description Files for Hi...
SERENE 2014 Workshop: Paper "Automatic Generation of Description Files for Hi...SERENE 2014 Workshop: Paper "Automatic Generation of Description Files for Hi...
SERENE 2014 Workshop: Paper "Automatic Generation of Description Files for Hi...
 
Row #9: An architecture overview of APNIC's RDAP deployment to the cloud
Row #9: An architecture overview of APNIC's RDAP deployment to the cloudRow #9: An architecture overview of APNIC's RDAP deployment to the cloud
Row #9: An architecture overview of APNIC's RDAP deployment to the cloud
 
FD.io - The Universal Dataplane
FD.io - The Universal DataplaneFD.io - The Universal Dataplane
FD.io - The Universal Dataplane
 
DPDK Architecture Musings - Andy Harvey
DPDK Architecture Musings - Andy HarveyDPDK Architecture Musings - Andy Harvey
DPDK Architecture Musings - Andy Harvey
 
Barak Perlman, ConteXtream - SFC (Service Function Chaining) Using Openstack ...
Barak Perlman, ConteXtream - SFC (Service Function Chaining) Using Openstack ...Barak Perlman, ConteXtream - SFC (Service Function Chaining) Using Openstack ...
Barak Perlman, ConteXtream - SFC (Service Function Chaining) Using Openstack ...
 
Apache NiFi SDLC Improvements
Apache NiFi SDLC ImprovementsApache NiFi SDLC Improvements
Apache NiFi SDLC Improvements
 
Recent Advances in Machine Learning: Bringing a New Level of Intelligence to ...
Recent Advances in Machine Learning: Bringing a New Level of Intelligence to ...Recent Advances in Machine Learning: Bringing a New Level of Intelligence to ...
Recent Advances in Machine Learning: Bringing a New Level of Intelligence to ...
 
L4-L7 services for SDN and NVF by Youcef Laribi
L4-L7 services for SDN and NVF by Youcef LaribiL4-L7 services for SDN and NVF by Youcef Laribi
L4-L7 services for SDN and NVF by Youcef Laribi
 
Architecture of OpenFlow SDNs
Architecture of OpenFlow SDNsArchitecture of OpenFlow SDNs
Architecture of OpenFlow SDNs
 
Developing Enterprise Applications for the Cloud, from Monolith to Microservice
Developing Enterprise Applications for the Cloud, from Monolith to MicroserviceDeveloping Enterprise Applications for the Cloud, from Monolith to Microservice
Developing Enterprise Applications for the Cloud, from Monolith to Microservice
 
Generic Resource Manager - László Vadkerti, András Kovács
Generic Resource Manager - László Vadkerti, András KovácsGeneric Resource Manager - László Vadkerti, András Kovács
Generic Resource Manager - László Vadkerti, András Kovács
 
LCA14: LCA14-209: ODP Project Update
LCA14: LCA14-209: ODP Project UpdateLCA14: LCA14-209: ODP Project Update
LCA14: LCA14-209: ODP Project Update
 
DEVNET-1175 OpenDaylight Service Function Chaining
DEVNET-1175	OpenDaylight Service Function ChainingDEVNET-1175	OpenDaylight Service Function Chaining
DEVNET-1175 OpenDaylight Service Function Chaining
 
Dynamic Service Chaining
Dynamic Service Chaining Dynamic Service Chaining
Dynamic Service Chaining
 
Accelerating apache spark with rdma
Accelerating apache spark with rdmaAccelerating apache spark with rdma
Accelerating apache spark with rdma
 
Hotplug and Virtio - Tetsuya Mukawa
Hotplug and Virtio - Tetsuya MukawaHotplug and Virtio - Tetsuya Mukawa
Hotplug and Virtio - Tetsuya Mukawa
 
Learn more about the tremendous value Open Data Plane brings to NFV
Learn more about the tremendous value Open Data Plane brings to NFVLearn more about the tremendous value Open Data Plane brings to NFV
Learn more about the tremendous value Open Data Plane brings to NFV
 
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlowBringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
 
Introduction to Beryllium release of OpenDaylight
Introduction to Beryllium release of OpenDaylightIntroduction to Beryllium release of OpenDaylight
Introduction to Beryllium release of OpenDaylight
 

Ähnlich wie HPC Controls Future

3.2 Streaming and Messaging
3.2 Streaming and Messaging3.2 Streaming and Messaging
3.2 Streaming and Messaging振东 刘
 
HPC Resource Management: Futures
HPC Resource Management: FuturesHPC Resource Management: Futures
HPC Resource Management: Futuresrcastain
 
ONOS Platform Architecture
ONOS Platform ArchitectureONOS Platform Architecture
ONOS Platform ArchitectureOpenDaylight
 
Building Modern Digital Services on Scalable Private Government Infrastructur...
Building Modern Digital Services on Scalable Private Government Infrastructur...Building Modern Digital Services on Scalable Private Government Infrastructur...
Building Modern Digital Services on Scalable Private Government Infrastructur...Andrés Colón Pérez
 
Adding Real-time Features to PHP Applications
Adding Real-time Features to PHP ApplicationsAdding Real-time Features to PHP Applications
Adding Real-time Features to PHP ApplicationsRonny López
 
Realtime traffic analyser
Realtime traffic analyserRealtime traffic analyser
Realtime traffic analyserAlex Moskvin
 
Building high performance microservices in finance with Apache Thrift
Building high performance microservices in finance with Apache ThriftBuilding high performance microservices in finance with Apache Thrift
Building high performance microservices in finance with Apache ThriftRX-M Enterprises LLC
 
haproxy-150423120602-conversion-gate01.pdf
haproxy-150423120602-conversion-gate01.pdfhaproxy-150423120602-conversion-gate01.pdf
haproxy-150423120602-conversion-gate01.pdfPawanVerma628806
 
Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseNick Dimiduk
 
Load Balancing
Load BalancingLoad Balancing
Load Balancingoptalink
 
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUpStrimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUpJosé Román Martín Gil
 
PMIx: Bridging the Container Boundary
PMIx: Bridging the Container BoundaryPMIx: Bridging the Container Boundary
PMIx: Bridging the Container Boundaryrcastain
 
PMIx Updated Overview
PMIx Updated OverviewPMIx Updated Overview
PMIx Updated OverviewRalph Castain
 
Apache Thrift, a brief introduction
Apache Thrift, a brief introductionApache Thrift, a brief introduction
Apache Thrift, a brief introductionRandy Abernethy
 
Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013
Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013
Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013Amazon Web Services
 
Introduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OSIntroduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OSSteve Wong
 
SDN Architecture & Ecosystem
SDN Architecture & EcosystemSDN Architecture & Ecosystem
SDN Architecture & EcosystemKingston Smiler
 

Ähnlich wie HPC Controls Future (20)

3.2 Streaming and Messaging
3.2 Streaming and Messaging3.2 Streaming and Messaging
3.2 Streaming and Messaging
 
HPC Resource Management: Futures
HPC Resource Management: FuturesHPC Resource Management: Futures
HPC Resource Management: Futures
 
ONOS Platform Architecture
ONOS Platform ArchitectureONOS Platform Architecture
ONOS Platform Architecture
 
Building Modern Digital Services on Scalable Private Government Infrastructur...
Building Modern Digital Services on Scalable Private Government Infrastructur...Building Modern Digital Services on Scalable Private Government Infrastructur...
Building Modern Digital Services on Scalable Private Government Infrastructur...
 
Adding Real-time Features to PHP Applications
Adding Real-time Features to PHP ApplicationsAdding Real-time Features to PHP Applications
Adding Real-time Features to PHP Applications
 
Scalable Web Apps
Scalable Web AppsScalable Web Apps
Scalable Web Apps
 
Realtime traffic analyser
Realtime traffic analyserRealtime traffic analyser
Realtime traffic analyser
 
Building high performance microservices in finance with Apache Thrift
Building high performance microservices in finance with Apache ThriftBuilding high performance microservices in finance with Apache Thrift
Building high performance microservices in finance with Apache Thrift
 
haproxy-150423120602-conversion-gate01.pdf
haproxy-150423120602-conversion-gate01.pdfhaproxy-150423120602-conversion-gate01.pdf
haproxy-150423120602-conversion-gate01.pdf
 
HAProxy
HAProxy HAProxy
HAProxy
 
Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBase
 
Software Defined Networking: Primer
Software Defined Networking: Primer Software Defined Networking: Primer
Software Defined Networking: Primer
 
Load Balancing
Load BalancingLoad Balancing
Load Balancing
 
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUpStrimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
 
PMIx: Bridging the Container Boundary
PMIx: Bridging the Container BoundaryPMIx: Bridging the Container Boundary
PMIx: Bridging the Container Boundary
 
PMIx Updated Overview
PMIx Updated OverviewPMIx Updated Overview
PMIx Updated Overview
 
Apache Thrift, a brief introduction
Apache Thrift, a brief introductionApache Thrift, a brief introduction
Apache Thrift, a brief introduction
 
Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013
Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013
Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013
 
Introduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OSIntroduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OS
 
SDN Architecture & Ecosystem
SDN Architecture & EcosystemSDN Architecture & Ecosystem
SDN Architecture & Ecosystem
 

Mehr von rcastain

PMIx Tiered Storage Support
PMIx Tiered Storage SupportPMIx Tiered Storage Support
PMIx Tiered Storage Supportrcastain
 
SC'16 PMIx BoF Presentation
SC'16 PMIx BoF PresentationSC'16 PMIx BoF Presentation
SC'16 PMIx BoF Presentationrcastain
 
EuroMPI 2017 PMIx presentation
EuroMPI 2017 PMIx presentationEuroMPI 2017 PMIx presentation
EuroMPI 2017 PMIx presentationrcastain
 
SC'17 BoF Presentation
SC'17 BoF PresentationSC'17 BoF Presentation
SC'17 BoF Presentationrcastain
 
PMIx: Debuggers and Fabric Support
PMIx: Debuggers and Fabric SupportPMIx: Debuggers and Fabric Support
PMIx: Debuggers and Fabric Supportrcastain
 
SC'18 BoF Presentation
SC'18 BoF PresentationSC'18 BoF Presentation
SC'18 BoF Presentationrcastain
 

Mehr von rcastain (6)

PMIx Tiered Storage Support
PMIx Tiered Storage SupportPMIx Tiered Storage Support
PMIx Tiered Storage Support
 
SC'16 PMIx BoF Presentation
SC'16 PMIx BoF PresentationSC'16 PMIx BoF Presentation
SC'16 PMIx BoF Presentation
 
EuroMPI 2017 PMIx presentation
EuroMPI 2017 PMIx presentationEuroMPI 2017 PMIx presentation
EuroMPI 2017 PMIx presentation
 
SC'17 BoF Presentation
SC'17 BoF PresentationSC'17 BoF Presentation
SC'17 BoF Presentation
 
PMIx: Debuggers and Fabric Support
PMIx: Debuggers and Fabric SupportPMIx: Debuggers and Fabric Support
PMIx: Debuggers and Fabric Support
 
SC'18 BoF Presentation
SC'18 BoF PresentationSC'18 BoF Presentation
SC'18 BoF Presentation
 

Kürzlich hochgeladen

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 

Kürzlich hochgeladen (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

HPC Controls Future

  • 1. HPC Controls: View to the Future Ralph H. Castain June, 2015
  • 2. This Presentation: A Caveat • One person’s view into the future o Where I am leading the open source community o Compiled from presentations and emails with that community, the national labs, various corporations o Spans last 20+ years • Not what I am expecting any particular entity to do
  • 3. Controls Overview: A Definition • Resource Manager o Workload manager o Run-time environment • Monitoring • Resiliency/Error Management • Overlay Network • Pub-sub Network • Console
  • 4. Controls Overview: A Definition • Resource Manager o Workload manager o Run-time environment • Monitoring • Resiliency/Error Management • Overlay Network • Pub-sub Network • Console Open Ecosystems Today • Every element is independent island, each with its own community • Multiple programming languages • Conflicting licensing • Cross-element interactions, where they exist, are via text-based messaging
  • 5. General Requirements • Scalable to exascale levels & beyond – Better-than-linear scaling – Constrained memory footprint • Dynamically configurable – Sense and adapt, user- directable – On-the-fly updates • Open source (non-copy-left) • Maintainable, flexible – Single platform that can be utilized to build multiple tools – Existing ecosystem • Resilient – Self-heal around failures – Reintegrate recovered resources 5
  • 6. Chosen Software Platform • Demonstrated scalability • Established community • Clean non-copy-left licensing • Compatible architecture • … 6 Open Resilient Cluster Manager (ORCM) [for reference implementation]
  • 7. 2003 - present 7 Open MPI OpenRTE Cisco Intel 10s-100s of Knodes (80K in production) EMC Enterprise Router Cluster Monitor SCON/RM OMPI/ORTE ORCM/SCON FDDP 1994-2003
  • 8. Abstractions • Divide functional blocks into abstract frameworks o Standardize the API for each identified functional area o Can be used external, or dedicated for internal use o Single-select: pick active component at startup o Multi-select: dynamically decide for each execution • Multiple implementations o Each in its own isolated plugin o Fully implement the API o Base “component” holds common functions
  • 9. Example: SCalable Overlay Network (SCON) RMQ ZMQ BTL OOB SCON-MSG TCPUDPIB UGNI SM USNIC TCP CUDAPortals4 SCIF Send/Recv
  • 10. Example: SCalable Overlay Network (SCON) 10 RMQ ZMQ BTL OOB SCON-MSG TCPUDPIB UGNI SM USNIC TCP CUDAPortals4 SCIF Send/Recv Inherit
  • 11. ORCM and Plug-ins • Plug-ins are shared libraries o Central set of plug-ins in installation tree o Users can also have plug-ins under $HOME o Proprietary binary plugins picked up at runtime • Can add / remove plug-ins after install o No need to recompile / re-link apps o Download / install new plug-ins o Develop new plug-ins safely • Update “on-the-fly” o Add, update plug-ins while running o Frameworks “pause” during update
  • 12. Controls Overview: A Definition • Resource Manager o Workload manager o Run-time environment • Monitoring • Resiliency/Error Management • Overlay Network • Pub-sub Network • Console
  • 13. Definition: RM • Scheduler/Workload Manager o Allocates resources to session o Interactive and batch • Run-Time Environment o Launch and monitor applications o Support inter-process communication wireup o Serve as intermediary between applications and WM • Dynamic resource requests • Error notification o Implement failure policies 13
  • 14. Breaking it Down • Workload Manager o Dedicated framework o Plugins for two-way integration to external WM (Moab, Cobalt) o Plugins for implementing internal WM (FIFO) • Run-Time Environment o Broken down into functional blocks, each with own framework • Loosely divided into three general categories • Messaging, launch, error handling • One or more frameworks for each category o Knitted together via “state machine” • Event-driven, async • Each functional block can be separate thread • Each plugin within each block can be separate thread(s)
  • 15. Key Objectives for Future • Orchestration via integration o Monitoring system, file system, network, facility, console • Power control/management • “Instant On” • Application interface o Request RM actions o Receive RM notifications o Fault tolerance • Advanced workload management algorithms o Alternative programming model support (Hadoop, Cloud) o Anticipatory scheduling
  • 19. Flexible Architecture • Each tool built on top of same plugin system o Different combinations of frameworks o Different plugins activated to play different roles o Example: orcmd on compute node vs on rack/row controllers • Designed for distributed, centralized, hybrid operations o Centralized for small clusters o Hybrid for larger clusters o Example: centralized scheduler, distributed “worker-bees” • Accessible to users for interacting with RM o Add shim libraries (abstract, public APIs) to access framework APIs o Examples: SCON, pub-sub, in-flight analytics
  • 20. Code Reuse: Scheduler sched moab cobalt FIFO orcmd DB db pwmgt cfgi pvn
  • 21. File System Integration • Input o Current data location, time to retrieve and position o Data the application intends to use • Job submission, dynamic request (via RTE), persistence across jobs and sessions o What OS/libraries application requires • Workload Manager o Scheduling algorithm factors • Current provisioning map, NVM usage patterns/requests, persistent data/ckpt location • Data/library locality, loading time • Output o Pre-location requirements to file system (e.g., SPINDLE cache tree) o Hot/warm/cold data movement, persistence directives to RTE o NVM & Burst Buffer allocations (job submission, dynamic)
  • 22. Provisioning Integration • Support multiple provisioning agents o Warewulf, xCAT, ROCKS • Support multiple provisioning modes o Bare metal, Virtual Machines (VMs) • Job submission includes desired provisioning o Scheduler can include provisioning time in allocation decision, anticipate re-use of provisioned nodes o Scheduler notifies provisioning agent of what nodes to provision o Provisioning agent notifies RTE when provisioning complete so launch can proceed
  • 23. Network Integration • Quality of service allocations o Bandwidth, traffic priority, power constraints o Specified at job submission, dynamic request (via RTE) • Security requirements o Network domain definitions • State-of-health o Monitored by monitoring system o Reported to RM for response • Static endpoint o Allocate application endpoints prior to launch • Minimize startup time o Update process location upon fault recovery
  • 24. Power Control/Management • Power/heat-aware scheduling o Specified cluster-level power cap, ramp up/down rate limits o Specified thermal limits (ramp up/down rates, level) o Node-level idle power, shutdown policies between sessions/time-of-day • Site-level coordination o Heat and power management subsystem • Consider system capacity in scheduling • Provide load anticipation levels to site controllers o Coordinate ramps (job launch/shutdown) • Within cluster, across site o Receive limit (high/low) updates o Direct RTE to adjust controls while maintaining sync across cluster
  • 25. “Instant On” • Objective o Reduce startup from minutes to seconds o 1M procs, 50k nodes thru MPI_Init • On track: 20 sec • 2018 target: 5 sec • What we require o Use HSN for communication during launch o Static endpoint allocation prior to launch o Integrate file system with RM for prepositioning, with scheduler for anticipatory locations 25 Today: ~15-30 min
  • 26. “Instant On” Value-Add • Requires two things o Process can compute endpoint of any remote process o Hardware can reserve and translate virtual endpoints to local ones • RM knows process map o Assign endpoint info for each process o Communicate map to local processes at initialization • Programming libraries o Compute connection info from map • Eliminates costly sharing of endpoint info
  • 27. Application Interface: PMIx • Current PMI implementations are limited o Only used for MPI wireup o Don’t scale adequately • Communication required for every piece of data • All blocking operations o Licensing and desire for standalone client library • Increasing requests for app-RM interactions o Job spawn, data pre-location, power control o Current approach is fragmented • Every RM creating its own APIs
  • 28. Application Interface: PMIx • Current PMI implementations are limited o Only used for MPI wireup o Don’t scale adequately • Communication required for every piece of data • All blocking operations o Licensing and desire for standalone client library • Increasing requests for app-RM interactions o Job spawn, data pre-location, power control o Current approach is fragmented • Every RM creating its own APIs
  • 29. PMIx Approach • Ease adoption o Backwards compatible to PMI-1/2 APIs o Standalone client library o Server convenience library • BSD-licensed • Replace blocking with non-blocking operations for scalability • Add APIs to support o New use-cases: IO, power, error notification, checkpoint, … o Programming models beyond MPI First time
  • 30. PMIx: Fault Tolerance • Notification o App can register for error notifications, incipient faults o RM will notify when app would be impacted • Notify procs, system monitor, user as directed (e.g., email, tweet) o App responds with desired action • Terminate/restart job, wait for checkpoint, etc. • Checkpoint support o Timed or on-demand, binary or SCR o BB allocations, bleed/storage directives across hierarchical storage • Restart support o From remote NVM checkpoint, relocate checkpoint, etc.
  • 31. PMIx: Status • Version 1.0 release o Preliminary version for commentary • Version 1.1 release o Production version o Scheduled for release by Supercomputing 2015 • Server integrations underway o SLURM o ORCM • PGAS/GASNet integration underway o Extend support for that programming model
  • 32. Alternative Models: Supporting Hadoop & Cloud • HPC o Optimize versus performance, single-tenancy • Hadoop support o Optimize versus data locality o Multi-tenancy o Pre-location of data to NVM, pre-staging from data store o Dynamic allocation requests • Cloud = capacity computing o Optimize versus cost o Multi-tenancy o Provisioning and security support for VMs
  • 33. Workload Manager: Job Description Language • Complexity of describing job is growing o Power, file/lib positioning o Performance vs capacity, programming model o System, project, application-level defaults • Provide templates? o System defaults, with modifiers • --hadoop:mapper=foo,reducer=bar o User-defined • Application templates • Shared, group templates o Markup language definition of behaviors, priorities 30 yrs
  • 34. Anticipatory Scheduling: Key to Success • Current schedulers are reactionary o Compute next allocation when prior one completes o Mandates fast algorithm to reduce dead time o Limit what can be done in terms of pre-loading etc. as they add to dead time • Anticipatory scheduler o Look ahead and predict range of potential schedules o Instantly select “best” when prior one completes o Support data and binary prepositioning in anticipation of most likely option o Improved optimization of resources as pressure on algorithm computational time is relieved 34
  • 35. Anticipatory Scheduler Console File System Resource Manager Overlay Network Prov. Agent Retrieve reqd files from backend storage Cache images to staging points Projected schedule
  • 36. Controls Overview: A Definition • Resource Manager o Workload manager o Run-time environment • Monitoring • Resiliency/Error Management • Overlay Network • Pub-sub Network • Console
  • 37. Key Functional Requirements • Support all available data collection sensors o Environmental (junction temps, power) o Process usage statistics (cpu, memory, disk, network) o MCA, BMC events • Support variety of backend databases • Admin configuration o Select which sensors, how often sampled, how often reported, when and where data is stored o Severity/priority/definition of events o Local admin customizes config, on-the-fly changes
  • 38. Monitoring System sensors diag orcmd Data Inventory Diagnostics B M C Compute Node Launches/Monitors Apps sensors diag orcmd S C O N Rack Controller Data Inventory db DB SCON Agg Data IO/Pvn Caching
  • 39. Sensor Framework • Each sensor can operate in its own time base o Sample at independent rate o Output collected in common framework-level bucket o Can trigger immediate send of bucket for critical events • Separate reporting time base o Send bucket to aggregator node at scheduled intervals o Aggregator receives bucket in sensor/base, extracts contributions from each sensor, passes to that sensor module for unpacking/recording Freq JnTemp Power Base Sensor BMC/IPMI
  • 40. Inventory Collection/Tracking • Performed at boot o Update on-demand, warm change detected • Data obtained from three sources o HWLOC topology collection (processors, memory, disks, NICs) o Sensor components (BMC/IPMI) o Network fabric managers (switches/routers) • Store current inventory and updates o Track prior locations of each FRU o Correlate inventory to RAS events o Detect inadvertent return-to-service of failed units
  • 41. Database Framework • Database commands go to base “stub” functions o Cycle across active modules until one acknowledges handling request o API provides “attributes” to specify database, table, etc. • Modules check attributes to determine ability to handle o Plugins call schema framework to format data • Multi-threading supported o Each command can be processed by separate thread o Each database module can process requests using multiple threads postgres ODBC Base db db2lite
  • 42. Diagnostics: Another Framework • Test the system on demand o Checks for running jobs and returns error o Execute at boot, when hardware changed, if errors reported, … • Each plugin tests specific area o Data reported to database o Tied to FRUs as well as overall node cpu eth Base diag mem
  • 43. Controls Overview: A Definition • Resource Manager o Workload manager o Run-time environment • Monitoring • Resiliency/Error Management • Overlay Network • Pub-sub Network • Console
  • 44. Planned Progression Monitoring Fault detection (RAS event generation) Fault diagnosis Fault prediction
  • 45. Planned Progression Monitoring Fault detection (RAS event generation) Fault diagnosis Fault prediction FDDP Action Response
  • 47. Workflow Elements • Average (window, running, etc.) • Rate (convert incoming data to events/sec) • Threshold (high, low) • Filter o Selects input values based on provided params • RAS event o Generates a RAS event corresponding to input description • Publish data
  • 48. Analytics • Execute on aggregator nodes for in-flight reduction o Sys admin defines, user can define (if permitted) • Event-based state machine o Each workflow in own thread, own instance of each plugin o Branch and merge of workflow o Tap stream between workflow steps o Tap data streams (sensors, others) • Event generation o Generate events/alarms o Specify data to be included (window) 48
  • 49. Distributed Architecture • Hierarchical, distributed approach for unlimited scalability o Utilize daemons on rack/row controllers • Analysis done at each level of the hierarchy o Support rapid response to critical events o Distribute processing load o Minimize data movement • RM’s error manager framework controls response o Based on specified policies
  • 50. Fault Diagnosis • Identify root cause and location o Sometimes obvious – e.g., when direct measurement o Other times non-obvious • Multiple cascading impacts • Cause identified by multi-sensor correlations (indirect measurement) • Direct measurement yields early report of non-root cause • Example: power supply fails due to borderline cooling + high load • Estimate severity o Safety issue, long-term damage, imminent failure • Requires in-depth understanding of hardware
  • 51. Fault Prediction: Methodology • Exploit access to internals o Investigate optimal location, number of sensors o Embed intelligence, communications capability • Integrate data from all available sources o Engineering design tests o Reliability life tests o Production qualification tests • Utilize learning algorithms to improve performance o Both embedded, post process o Seed with expert knowledge
  • 52. Fault Prediction: Outcomes • Continuous update of mean-time-to-preventative-maintenance o Feed into projected downtime planning o Incorporate into scheduling algo • Alarm reports for imminent failures o Notify impacted sessions/applications o Plan/execute preemptive actions • Store predictions o Algorithm improvement
  • 53. Error Manager • Log errors for reporting, future analysis • Primary responsibility: fault response o Contains defined response for given types of faults o Responds to faults by shifting resources, processes • Secondary responsibility: resilience strategy o Continuously update and define possible response options o Fault prediction triggers pre-emptive action • Select various response strategies via component o Run-time, configuration, or on-the-fly command
  • 54. Example: Network Communications Fault tolerant • Modify run-time to avoid automatic abort upon loss of communication • Detect failure of any on-going communication • Reconnect quickly • Resend lost messages from me • Request resend of lost messages to me Network interface card repeatedly loses connection or shows data loss Resilient • Estimate probability of failure during session – Failure history – Monitor internals • Temperature, … • Take preemptive action – Reroute messages via alternative transports – Coordinated move of processes to another node
  • 55. Node fails during execution of high- priority application Example: Node Failure Fault tolerant • Detect failure of process(es) on that node • Find last checkpoint • Restart processes on another node at the checkpoint state – May (likely) require restarting all processes at same state • Resend any lost messages Resilient • Monitor state-of-health of node – Temperature of key components, other signatures • Estimate probability of failure during future time interval(s) • Take preemptive action – Direct checkpoint/save – Coordinate move with application • Avoids potential need to reset ALL processes back to earlier state!
  • 56. Controls Overview: A Definition • Resource Manager o Workload manager o Run-time environment • Monitoring • Resiliency/Error Management • Overlay Network • Pub-sub Network • Console
  • 57. Definition: Overlay Network • Messaging system o Scalable/resilient communications o Integration-friendly with user applications, system management software o Quality of service support • In-flight analytics o Insert analytic workflows anywhere in the data stream o Tap data stream at any point o Generate RAS events from data stream 57
  • 58. Requirements • Scalable to exascale levels – Better-than-linear scaling of broadcast • Resilient – Self-heal around failures – Reintegrate recovered resources – Support quality of service (QoS) levels • Dynamically configurable – Sense and adapt, user- directable – On-the-fly updates • In-flight analytics – User-defined workflows – Distributed, hierarchical analysis of sensor data to identify RAS events – Used by error manager • Multi-fabric – OOB Ethernet – In-band fabric – Auto-switchover for resilience, QoS • Open source (non-copy-left) 58
  • 59. High-Level Architecture RMQ ZMQ BTL OOB SCON-MSG TCPUDPIB UGNI SM USNIC TCP CUDA SCON-ANALYTICS FILTER AVG THRESHOLD Portals4 SCIF Send/Recv WorkflowQoS ACK NACK HYBRID Inherit
  • 60. Messaging System • Message management level o Manages assignment of messages (and/or fragments) to transport layers o Detect/redirects messages upon transport failure o Interfaces to transports o Matching logic • Transport level o Byte-level message movement o May fragment across wires within transport o Per-message selection policies (system default, user specified, etc) 60
  • 61. Quality of Service Controls • Plugin architecture o Selected per transport, requested quality of service • ACK-based (cmd/ctrl) o ACK each message, or window of messages, based on QoS o Resend or return error – QoS specified policy and number of retries before giving up • NACK-based (streaming) o NACK if message sequence number is out of order indicating lost message(s) o Request resend or return error, based on QoS o May ACK after N messages • Security level o Connection authorization, encryption, …
  • 62. Controls Overview: A Definition • Resource Manager o Workload manager o Run-time environment • Monitoring • Resiliency/Error Management • Overlay Network • Pub-sub Network • Console
  • 63. Pub-Sub: Two Models • Client-Server (MQ) o Everyone publishes data to one or more servers o Clients subscribe to server o Server pushes data matching subscription to clients o Option: server logs and provides history prior to subscription • Broker (MRNet) o Publishers register data with server o Clients log request with server o Server connects client to publisher o Publisher directly provides data to clients o Up to publisher to log, provide history
  • 64. Pub-Sub: Two Models • Client-Server (MQ) o Everyone publishes data to one or more servers o Clients subscribe to server o Server pushes data matching subscription to clients o Option: server logs and provides history prior to subscription • Broker (MRNet) o Publishers register data with server o Clients log request with server o Server connects client to publisher o Publisher directly provides data to clients o Up to publisher to log, provide history
  • 65. Pub-Sub: Two Models • Client-Server (MQ) o Everyone publishes data to one or more servers o Clients subscribe to server o Server pushes data matching subscription to clients o Option: server logs and provides history prior to subscription • Broker (MRNet) o Publishers register data with server o Clients log request with server o Server connects client to publisher o Publisher directly provides data to clients o Up to publisher to log, provide history ORCM: Support Both Models
  • 66. ORCM Pub-Sub Architecture • Abstract, extendable APIs o Utilize “attributes” to specify desired data and other parameters o Allows each component to determine ability to support request • Internal lightweight version o Support for smaller systems o When “good enough” is enough o Broker and client-server architectures • External heavyweight versions supported via plugins o RabbitMQ o MRNet
  • 67. ORCM Pub/Sub API • Required functions o Advertise available event/data (publisher) o Publish event/data (publish) o Get catalog of published events (subscriber) o Subscribe for event(s)/data (subscriber asynchronous callback based and polling based) o Unsubscribe for event/data (subscriber) • Security – Access control for requested event/data • Possible future functions o QoS o Statistics
  • 68. Controls Overview: A Definition • Resource Manager o Workload manager o Run-time environment • Monitoring • Resiliency/Error Management • Overlay Network • Pub-sub Network • Console
  • 69. Basic Approach • Command-line interface o Required for mid- and high-end systems o Control on-the-fly adjustments, update/customize config o Custom code • GUI o Required for some market segments o Good for visualizing cluster state, data trends o Provide centralized configuration management o Integrate with existing open source solution
  • 70. Configuration Management • Central interface for all configuration o Eliminate frustrating game of “whack-a-mole” o Ensure consistent definition across system elements o Output files/interface to controllers for individual software packages • RM (queues), network, file system • Database backend o Provide historical tracking o Multiple methods for access, editing • Command line and GUI o Dedicated config management tools
  • 71.
  • 72.
  • 73.