Bright Cluster Manager is a comprehensive, integrated management solution for parallel computing resources both currently and in the future. It provisions, monitors, and manages heterogeneous computing resources including systems, storage, and interconnects. It provides a unified graphical user interface and command line for managing multiple clusters and clouds simultaneously. It simplifies development by providing tools, libraries, and workload management. It integrates Intel Xeon Phi coprocessors by packaging all necessary software and allowing them to be configured, controlled, and monitored through the management interface. It also performs health checks on Xeon Phis and only schedules jobs to nodes passing the checks.
Bright Cluster Manager: A Comprehensive, Integrated Management Solution for Parallel Universes Today and Tomorrow
1. Bright Cluster Manager
A Comprehensive, Integrated Management Solution for
Parallel Universes Today and Tomorrow
Ian Lumb
Bright Evangelist
2.
3. In My Parallel Universe …
In my parallel universe, parallel computing at extreme
scale is easy!
• Scientists focus on science, engineers on engineering
No problem is out of computational reach
Coding has been deprecated!
– Problems are stated in the natural language of the discipline
» Implementation suggestions/guidelines are optional
– `Heuristic algorithms’ take care of the implementation specifics (i.e., the
coding)
Resources are plentiful!
– Physical constraints (e.g., power, cooling & space) have been
eliminated
– Generic processors to specialized coprocessors are readily available
– Resource management is completely transparent
4. Parallel Computing via Bright Cluster Manager
Provisions, monitors and manages all neo-heterogeneous
resources
• Systems, storage, interconnects, etc.
Management, parallelized
•
•
•
•
Adaptive provisioning in real time
Topologically based monitoring
Fault tolerance via high availability
One GUI for multiple clusters and clouds
Development simplified
• Tools and libraries available
• Workloads managed
6. Bright Cluster Manager — Elements
Cluster Management GUI
User Portal
Cluster Management Shell
SSL / SOAP / X509 / IPtables
Cluster Management Daemon
Provisioning
SLURM
Torque/Maui
Torque/MOAB
PBS Pro
Grid Engine
LSF
Monitoring
Automation
Health Checks
Management
Compilers
Libraries
Debuggers
Profilers
PDU
IPMI/iLO
Interconnect
Ethernet
Disk
Memory
MIC
CPU
SLES / RHEL / CentOS / SL
SLES / RHEL / CentOS / SL
ScaleMP vSMP
7. Management Interface
Graphical User Interface (GUI)
Offers administrator full cluster control
Standalone desktop application
Manages multiple clusters simultaneously
Runs natively on Linux, Windows and MacOS
Cluster Management Shell (CMSH)
All GUI functionality also available through
Cluster Management Shell
Interactive and scriptable in batch mode
Cluster
Management
GUI
Cluster
Management
Shell
8.
9. Intel Xeon Phi Integration
Everything needed to enable Xeon Phi on a cluster is
packaged as easy-to-install Bright packages:
•
•
•
•
•
Xeon Phi driver
Xeon Phi runtime
Xeon Phi SDK
Xeon Phi OFED
Xeon Phi flash utilities
Environment modules ensure that user environment is set
up perfectly (PATH, LD_LIBRARY_PATH, ...)
Xeon Phi driver recompiled automatically against running
kernel at boot-time
10. Intel Xeon Phi Integration
Set-up wizard takes care of initial Xeon Phi configuration
(e.g. creating bridge interfaces, assigning IP addresses)
Xeon Phi appears as a first-class device type in cluster
management infrastructure
Xeon Phi can be configured, controlled and monitored
through CMSH and CMGUI
Xeon Phi is automatically added to the workload
management system as a consumable resource
Compute jobs may request Xeon Phi resource in job script
16. Architecture — Monitoring
Bright Cluster
metrics
CMDaemon
Cluster
Management
GUI
BMC
node001
data
Cluster
Management
Shell
Web-Based
User Portal
metrics
metrics
metrics
head node
BMC
node002
metrics
Third-Party
Applications
BMC
raw data
consolidated
data
node003
17.
18. Cluster Health Management
Goal: provide problem free environment for running jobs
Regular health checks
• Actions that return PASS, FAIL or UNKNOWN
• Can be associated with a settable severity and a message
• Can launch an action based on any response value
Pre-job health checks
16 Xeon Phi health checks included by default
Jobs will only be scheduled to nodes where Xeon Phi is working
properly (as determined by health checks)
Intel Cluster Checker included to verify that cluster is set up
properly
19. Intel Xeon Phi Workload Management
Three ways to run Xeon Phi jobs:
• Offload (i.e. Xeon Phi is used as coprocessor from host)
• Native (i.e. job executes entirely on Xeon Phi)
• Symmetric (i.e. communicating processes on both host and Xeon
Phi)
Offload: Xeon Phi represented as consumable resource in
workload management system
Native: Ported Slurm to Xeon Phi
Symmetric: work in progress, will require some changes to
workload managers
Additional work in progress: make sure Xeon Phi is not used in
multiple modes simultaneously