SlideShare a Scribd company logo
1 of 22
Bright Cluster Manager
A Comprehensive, Integrated Management Solution for
Parallel Universes Today and Tomorrow

Ian Lumb
Bright Evangelist
In My Parallel Universe …

 In my parallel universe, parallel computing at extreme
scale is easy!
• Scientists focus on science, engineers on engineering
No problem is out of computational reach
Coding has been deprecated!
– Problems are stated in the natural language of the discipline
» Implementation suggestions/guidelines are optional
– `Heuristic algorithms’ take care of the implementation specifics (i.e., the
coding)

Resources are plentiful!
– Physical constraints (e.g., power, cooling & space) have been
eliminated
– Generic processors to specialized coprocessors are readily available
– Resource management is completely transparent
Parallel Computing via Bright Cluster Manager

 Provisions, monitors and manages all neo-heterogeneous
resources
• Systems, storage, interconnects, etc.

 Management, parallelized
•
•
•
•

Adaptive provisioning in real time
Topologically based monitoring
Fault tolerance via high availability
One GUI for multiple clusters and clouds

 Development simplified
• Tools and libraries available
• Workloads managed
Architecture
Bright Cluster
CMDaemon
Cluster
Management
GUI

node001
SOAP+SSL
Cluster
Management
Shell

Web-Based
User Portal

SOAP+SSL

head node

node002

Third-Party
Applications

node003
Bright Cluster Manager — Elements
Cluster Management GUI

User Portal

Cluster Management Shell

SSL / SOAP / X509 / IPtables
Cluster Management Daemon

Provisioning

SLURM
Torque/Maui
Torque/MOAB
PBS Pro
Grid Engine
LSF

Monitoring
Automation
Health Checks
Management

Compilers
Libraries
Debuggers
Profilers

PDU

IPMI/iLO

Interconnect

Ethernet

Disk

Memory

MIC

CPU

SLES / RHEL / CentOS / SL
SLES / RHEL / CentOS / SL
ScaleMP vSMP
Management Interface

Graphical User Interface (GUI)
 Offers administrator full cluster control
 Standalone desktop application
 Manages multiple clusters simultaneously
 Runs natively on Linux, Windows and MacOS
Cluster Management Shell (CMSH)
 All GUI functionality also available through
Cluster Management Shell
 Interactive and scriptable in batch mode

Cluster
Management
GUI

Cluster
Management
Shell
Intel Xeon Phi Integration

 Everything needed to enable Xeon Phi on a cluster is
packaged as easy-to-install Bright packages:
•
•
•
•
•

Xeon Phi driver
Xeon Phi runtime
Xeon Phi SDK
Xeon Phi OFED
Xeon Phi flash utilities

 Environment modules ensure that user environment is set
up perfectly (PATH, LD_LIBRARY_PATH, ...)
 Xeon Phi driver recompiled automatically against running
kernel at boot-time
Intel Xeon Phi Integration

 Set-up wizard takes care of initial Xeon Phi configuration
(e.g. creating bridge interfaces, assigning IP addresses)
 Xeon Phi appears as a first-class device type in cluster
management infrastructure
 Xeon Phi can be configured, controlled and monitored
through CMSH and CMGUI
 Xeon Phi is automatically added to the workload
management system as a consumable resource
 Compute jobs may request Xeon Phi resource in job script
11
Architecture — Monitoring
Bright Cluster
metrics

CMDaemon
Cluster
Management
GUI
BMC

node001

data
Cluster
Management
Shell

Web-Based
User Portal

metrics

metrics
metrics

head node

BMC

node002

metrics
Third-Party
Applications

BMC

raw data

consolidated
data

node003
Cluster Health Management
 Goal: provide problem free environment for running jobs
 Regular health checks
• Actions that return PASS, FAIL or UNKNOWN
• Can be associated with a settable severity and a message
• Can launch an action based on any response value

 Pre-job health checks
 16 Xeon Phi health checks included by default
 Jobs will only be scheduled to nodes where Xeon Phi is working
properly (as determined by health checks)
 Intel Cluster Checker included to verify that cluster is set up
properly
Intel Xeon Phi Workload Management
 Three ways to run Xeon Phi jobs:
• Offload (i.e. Xeon Phi is used as coprocessor from host)
• Native (i.e. job executes entirely on Xeon Phi)
• Symmetric (i.e. communicating processes on both host and Xeon
Phi)

 Offload: Xeon Phi represented as consumable resource in
workload management system
 Native: Ported Slurm to Xeon Phi
 Symmetric: work in progress, will require some changes to
workload managers
 Additional work in progress: make sure Xeon Phi is not used in
multiple modes simultaneously
Cherry Creek
Bright Cluster Manager makes it easy
to install, manage and use clusters
with Intel Xeon Phi coprocessors.
Questions?

More Related Content

What's hot

6multiprogrammingtimesharing 130112050125-phpapp01
6multiprogrammingtimesharing 130112050125-phpapp016multiprogrammingtimesharing 130112050125-phpapp01
6multiprogrammingtimesharing 130112050125-phpapp01
Gaurav Kumar
 
Processes Control Block (Operating System)
Processes Control Block (Operating System)Processes Control Block (Operating System)
Processes Control Block (Operating System)
Imdad Ullah
 

What's hot (20)

Operating Systems 1 (6/12) - Processes
Operating Systems 1 (6/12) - ProcessesOperating Systems 1 (6/12) - Processes
Operating Systems 1 (6/12) - Processes
 
Measuring Performance by Irfanullah
Measuring Performance by IrfanullahMeasuring Performance by Irfanullah
Measuring Performance by Irfanullah
 
ikh311-08
ikh311-08ikh311-08
ikh311-08
 
Technical Review on Live Virtual Machine Migration Techniques for Eucalyptus ...
Technical Review on Live Virtual Machine Migration Techniques for Eucalyptus ...Technical Review on Live Virtual Machine Migration Techniques for Eucalyptus ...
Technical Review on Live Virtual Machine Migration Techniques for Eucalyptus ...
 
Rtos
RtosRtos
Rtos
 
Process Control Block & Threads and Their Management
Process Control Block & Threads and Their ManagementProcess Control Block & Threads and Their Management
Process Control Block & Threads and Their Management
 
Process management
Process managementProcess management
Process management
 
RTOS for Embedded System Design
RTOS for Embedded System DesignRTOS for Embedded System Design
RTOS for Embedded System Design
 
Cpu performance matrix
Cpu performance matrixCpu performance matrix
Cpu performance matrix
 
Processes and operating systems
Processes and operating systemsProcesses and operating systems
Processes and operating systems
 
Operating system concepts
Operating system conceptsOperating system concepts
Operating system concepts
 
6multiprogrammingtimesharing 130112050125-phpapp01
6multiprogrammingtimesharing 130112050125-phpapp016multiprogrammingtimesharing 130112050125-phpapp01
6multiprogrammingtimesharing 130112050125-phpapp01
 
Functions of the Operating System
Functions of the Operating SystemFunctions of the Operating System
Functions of the Operating System
 
RTOS
RTOSRTOS
RTOS
 
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDSFAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
 
Processes Control Block (Operating System)
Processes Control Block (Operating System)Processes Control Block (Operating System)
Processes Control Block (Operating System)
 
Chapter02
Chapter02Chapter02
Chapter02
 
Chapter02
Chapter02Chapter02
Chapter02
 
Kernel security Concepts
Kernel security ConceptsKernel security Concepts
Kernel security Concepts
 
Operating System - Unit I - Operating System Structures
Operating System - Unit I - Operating System StructuresOperating System - Unit I - Operating System Structures
Operating System - Unit I - Operating System Structures
 

Viewers also liked

Altair - compute manager your gateway to hpc cloud computing with pbs profess...
Altair - compute manager your gateway to hpc cloud computing with pbs profess...Altair - compute manager your gateway to hpc cloud computing with pbs profess...
Altair - compute manager your gateway to hpc cloud computing with pbs profess...
Volodymyr Saviak
 
Altair HPC & Cloud Offerings: A key enabler for Multi-Physics - Americas ATC ...
Altair HPC & Cloud Offerings: A key enabler for Multi-Physics - Americas ATC ...Altair HPC & Cloud Offerings: A key enabler for Multi-Physics - Americas ATC ...
Altair HPC & Cloud Offerings: A key enabler for Multi-Physics - Americas ATC ...
Altair
 
AltaiHTC 2012 Connector Training
AltaiHTC 2012 Connector TrainingAltaiHTC 2012 Connector Training
AltaiHTC 2012 Connector Training
Altair
 
"Performance Evaluation, Scalability Analysis, and Optimization Tuning of A...
"Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of A..."Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of A...
"Performance Evaluation, Scalability Analysis, and Optimization Tuning of A...
Altair
 
Altair NVH Solutions - Americas ATC 2015 Workshop
Altair NVH Solutions - Americas ATC 2015 WorkshopAltair NVH Solutions - Americas ATC 2015 Workshop
Altair NVH Solutions - Americas ATC 2015 Workshop
Altair
 

Viewers also liked (20)

Utilizing Public AND Private Clouds with Bright Cluster Manager
Utilizing Public AND Private Clouds with Bright Cluster ManagerUtilizing Public AND Private Clouds with Bright Cluster Manager
Utilizing Public AND Private Clouds with Bright Cluster Manager
 
PBS and Scheduling at NCI: The past, present and future
PBS and Scheduling at NCI: The past, present and futurePBS and Scheduling at NCI: The past, present and future
PBS and Scheduling at NCI: The past, present and future
 
Managing Clusters with Intel® Xeon Phi™ Coprocessors using Bright Cluster Man...
Managing Clusters with Intel® Xeon Phi™ Coprocessors using Bright Cluster Man...Managing Clusters with Intel® Xeon Phi™ Coprocessors using Bright Cluster Man...
Managing Clusters with Intel® Xeon Phi™ Coprocessors using Bright Cluster Man...
 
Bright Topics Webinar April 15, 2015 - Modernized Monitoring for Cluster and ...
Bright Topics Webinar April 15, 2015 - Modernized Monitoring for Cluster and ...Bright Topics Webinar April 15, 2015 - Modernized Monitoring for Cluster and ...
Bright Topics Webinar April 15, 2015 - Modernized Monitoring for Cluster and ...
 
Altair Pbs Works Overview 10 1 Kiew
Altair Pbs Works Overview 10 1 KiewAltair Pbs Works Overview 10 1 Kiew
Altair Pbs Works Overview 10 1 Kiew
 
HPC Technology Compass 2014/15
HPC Technology Compass 2014/15HPC Technology Compass 2014/15
HPC Technology Compass 2014/15
 
(BDT202) HPC Now Means 'High Personal Computing' | AWS re:Invent 2014
(BDT202) HPC Now Means 'High Personal Computing' | AWS re:Invent 2014(BDT202) HPC Now Means 'High Personal Computing' | AWS re:Invent 2014
(BDT202) HPC Now Means 'High Personal Computing' | AWS re:Invent 2014
 
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
 
Altair - compute manager your gateway to hpc cloud computing with pbs profess...
Altair - compute manager your gateway to hpc cloud computing with pbs profess...Altair - compute manager your gateway to hpc cloud computing with pbs profess...
Altair - compute manager your gateway to hpc cloud computing with pbs profess...
 
Multi-Tenant Pharma HPC Clusters
Multi-Tenant Pharma HPC ClustersMulti-Tenant Pharma HPC Clusters
Multi-Tenant Pharma HPC Clusters
 
Altair HPC & Cloud Offerings: A key enabler for Multi-Physics - Americas ATC ...
Altair HPC & Cloud Offerings: A key enabler for Multi-Physics - Americas ATC ...Altair HPC & Cloud Offerings: A key enabler for Multi-Physics - Americas ATC ...
Altair HPC & Cloud Offerings: A key enabler for Multi-Physics - Americas ATC ...
 
HPC Cluster & Cloud Computing
HPC Cluster & Cloud ComputingHPC Cluster & Cloud Computing
HPC Cluster & Cloud Computing
 
Building an HPC Cluster in 10 Minutes
Building an HPC Cluster in 10 MinutesBuilding an HPC Cluster in 10 Minutes
Building an HPC Cluster in 10 Minutes
 
HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores 
 
AltaiHTC 2012 Connector Training
AltaiHTC 2012 Connector TrainingAltaiHTC 2012 Connector Training
AltaiHTC 2012 Connector Training
 
Altair on Intel Xeon Phi: Optimizing HPC for Breakthrough Performance
Altair on Intel Xeon Phi:  Optimizing HPC for Breakthrough PerformanceAltair on Intel Xeon Phi:  Optimizing HPC for Breakthrough Performance
Altair on Intel Xeon Phi: Optimizing HPC for Breakthrough Performance
 
Altair HTC 2012 NVH Training
Altair HTC 2012 NVH TrainingAltair HTC 2012 NVH Training
Altair HTC 2012 NVH Training
 
CFD Analysis with AcuSolve and ultraFluidX
CFD Analysis with AcuSolve and ultraFluidXCFD Analysis with AcuSolve and ultraFluidX
CFD Analysis with AcuSolve and ultraFluidX
 
"Performance Evaluation, Scalability Analysis, and Optimization Tuning of A...
"Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of A..."Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of A...
"Performance Evaluation, Scalability Analysis, and Optimization Tuning of A...
 
Altair NVH Solutions - Americas ATC 2015 Workshop
Altair NVH Solutions - Americas ATC 2015 WorkshopAltair NVH Solutions - Americas ATC 2015 Workshop
Altair NVH Solutions - Americas ATC 2015 Workshop
 

Similar to Bright Cluster Manager: A Comprehensive, Integrated Management Solution for Parallel Universes Today and Tomorrow

Similar to Bright Cluster Manager: A Comprehensive, Integrated Management Solution for Parallel Universes Today and Tomorrow (20)

1_OS_INTRO.pptx
1_OS_INTRO.pptx1_OS_INTRO.pptx
1_OS_INTRO.pptx
 
Operating Systems PPT 1 (1).pdf
Operating Systems PPT 1 (1).pdfOperating Systems PPT 1 (1).pdf
Operating Systems PPT 1 (1).pdf
 
Engg-0505-IT-Operating-Systems-2nd-year.pdf
Engg-0505-IT-Operating-Systems-2nd-year.pdfEngg-0505-IT-Operating-Systems-2nd-year.pdf
Engg-0505-IT-Operating-Systems-2nd-year.pdf
 
EMBEDDED OS
EMBEDDED OSEMBEDDED OS
EMBEDDED OS
 
cs-intro-os.ppt
cs-intro-os.pptcs-intro-os.ppt
cs-intro-os.ppt
 
OS Content.pdf
OS Content.pdfOS Content.pdf
OS Content.pdf
 
OS_MD_1.pdf
OS_MD_1.pdfOS_MD_1.pdf
OS_MD_1.pdf
 
Wait-free data structures on embedded multi-core systems
Wait-free data structures on embedded multi-core systemsWait-free data structures on embedded multi-core systems
Wait-free data structures on embedded multi-core systems
 
Runos OpenFlow Controller (eng)
Runos OpenFlow Controller (eng)Runos OpenFlow Controller (eng)
Runos OpenFlow Controller (eng)
 
Ds ppt imp.
Ds ppt imp.Ds ppt imp.
Ds ppt imp.
 
Embedded Intro India05
Embedded Intro India05Embedded Intro India05
Embedded Intro India05
 
ESC UNIT 3.ppt
ESC UNIT 3.pptESC UNIT 3.ppt
ESC UNIT 3.ppt
 
8
88
8
 
2 Security Architecture+Design
2 Security Architecture+Design2 Security Architecture+Design
2 Security Architecture+Design
 
How to Monitor Performance of your Jenkins Deployment
How to Monitor Performance of your Jenkins DeploymentHow to Monitor Performance of your Jenkins Deployment
How to Monitor Performance of your Jenkins Deployment
 
CS403: Operating System : Unit I _merged.pdf
CS403: Operating System :  Unit I _merged.pdfCS403: Operating System :  Unit I _merged.pdf
CS403: Operating System : Unit I _merged.pdf
 
Os1
Os1Os1
Os1
 
Lec # 1 chapter 2
Lec # 1 chapter 2Lec # 1 chapter 2
Lec # 1 chapter 2
 
Chapter 22 - Windows XP
Chapter 22 - Windows XPChapter 22 - Windows XP
Chapter 22 - Windows XP
 
Structure of Operating System
Structure of Operating System Structure of Operating System
Structure of Operating System
 

More from Ian Lumb

Univa Unicloud - High Volume Workloads: How Smart Companies are Harnessing th...
Univa Unicloud - High Volume Workloads: How Smart Companies are Harnessing th...Univa Unicloud - High Volume Workloads: How Smart Companies are Harnessing th...
Univa Unicloud - High Volume Workloads: How Smart Companies are Harnessing th...
Ian Lumb
 
Machine Learning for Big Data Analytics: Scaling In with Containers while Sc...
Machine Learning for Big Data Analytics:  Scaling In with Containers while Sc...Machine Learning for Big Data Analytics:  Scaling In with Containers while Sc...
Machine Learning for Big Data Analytics: Scaling In with Containers while Sc...
Ian Lumb
 
Docker 101 - all about Docker containers
Docker 101 - all about Docker containers Docker 101 - all about Docker containers
Docker 101 - all about Docker containers
Ian Lumb
 

More from Ian Lumb (11)

Towards Deep Learning from Twitter for Improved Tsunami Alerts and Advisories
Towards Deep Learning from Twitter for Improved Tsunami Alerts and AdvisoriesTowards Deep Learning from Twitter for Improved Tsunami Alerts and Advisories
Towards Deep Learning from Twitter for Improved Tsunami Alerts and Advisories
 
Univa and SUSE at SC17: Scaling Machine Learning for SUSE Linux Containers, S...
Univa and SUSE at SC17: Scaling Machine Learning for SUSE Linux Containers, S...Univa and SUSE at SC17: Scaling Machine Learning for SUSE Linux Containers, S...
Univa and SUSE at SC17: Scaling Machine Learning for SUSE Linux Containers, S...
 
Managing Containerized HPC and AI Workloads on TSUBAME3.0
Managing Containerized HPC and AI Workloads on TSUBAME3.0Managing Containerized HPC and AI Workloads on TSUBAME3.0
Managing Containerized HPC and AI Workloads on TSUBAME3.0
 
Univa Unicloud - High Volume Workloads: How Smart Companies are Harnessing th...
Univa Unicloud - High Volume Workloads: How Smart Companies are Harnessing th...Univa Unicloud - High Volume Workloads: How Smart Companies are Harnessing th...
Univa Unicloud - High Volume Workloads: How Smart Companies are Harnessing th...
 
Dev / Test / Ops – Gain More Horsepower and Reduce Costs by Sharing Kubernete...
Dev / Test / Ops – Gain More Horsepower and Reduce Costs by Sharing Kubernete...Dev / Test / Ops – Gain More Horsepower and Reduce Costs by Sharing Kubernete...
Dev / Test / Ops – Gain More Horsepower and Reduce Costs by Sharing Kubernete...
 
Drilling Deep with Machine Learning as an Enterprise Enabled Micro Service
Drilling Deep with Machine Learning as an Enterprise Enabled Micro ServiceDrilling Deep with Machine Learning as an Enterprise Enabled Micro Service
Drilling Deep with Machine Learning as an Enterprise Enabled Micro Service
 
Machine Learning for Big Data Analytics: Scaling In with Containers while Sc...
Machine Learning for Big Data Analytics:  Scaling In with Containers while Sc...Machine Learning for Big Data Analytics:  Scaling In with Containers while Sc...
Machine Learning for Big Data Analytics: Scaling In with Containers while Sc...
 
Docker 101 - all about Docker containers
Docker 101 - all about Docker containers Docker 101 - all about Docker containers
Docker 101 - all about Docker containers
 
High Performance Computing in the Cloud?
High Performance Computing in the Cloud?High Performance Computing in the Cloud?
High Performance Computing in the Cloud?
 
VoDcast Slides: The Rise in Popularity of Apache Spark
VoDcast Slides: The Rise in Popularity of Apache SparkVoDcast Slides: The Rise in Popularity of Apache Spark
VoDcast Slides: The Rise in Popularity of Apache Spark
 
How to Upgrade Your Hadoop Stack in 1 Step -- with Zero Downtime
How to Upgrade Your Hadoop Stack in 1 Step -- with Zero DowntimeHow to Upgrade Your Hadoop Stack in 1 Step -- with Zero Downtime
How to Upgrade Your Hadoop Stack in 1 Step -- with Zero Downtime
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Bright Cluster Manager: A Comprehensive, Integrated Management Solution for Parallel Universes Today and Tomorrow

  • 1. Bright Cluster Manager A Comprehensive, Integrated Management Solution for Parallel Universes Today and Tomorrow Ian Lumb Bright Evangelist
  • 2.
  • 3. In My Parallel Universe …  In my parallel universe, parallel computing at extreme scale is easy! • Scientists focus on science, engineers on engineering No problem is out of computational reach Coding has been deprecated! – Problems are stated in the natural language of the discipline » Implementation suggestions/guidelines are optional – `Heuristic algorithms’ take care of the implementation specifics (i.e., the coding) Resources are plentiful! – Physical constraints (e.g., power, cooling & space) have been eliminated – Generic processors to specialized coprocessors are readily available – Resource management is completely transparent
  • 4. Parallel Computing via Bright Cluster Manager  Provisions, monitors and manages all neo-heterogeneous resources • Systems, storage, interconnects, etc.  Management, parallelized • • • • Adaptive provisioning in real time Topologically based monitoring Fault tolerance via high availability One GUI for multiple clusters and clouds  Development simplified • Tools and libraries available • Workloads managed
  • 6. Bright Cluster Manager — Elements Cluster Management GUI User Portal Cluster Management Shell SSL / SOAP / X509 / IPtables Cluster Management Daemon Provisioning SLURM Torque/Maui Torque/MOAB PBS Pro Grid Engine LSF Monitoring Automation Health Checks Management Compilers Libraries Debuggers Profilers PDU IPMI/iLO Interconnect Ethernet Disk Memory MIC CPU SLES / RHEL / CentOS / SL SLES / RHEL / CentOS / SL ScaleMP vSMP
  • 7. Management Interface Graphical User Interface (GUI)  Offers administrator full cluster control  Standalone desktop application  Manages multiple clusters simultaneously  Runs natively on Linux, Windows and MacOS Cluster Management Shell (CMSH)  All GUI functionality also available through Cluster Management Shell  Interactive and scriptable in batch mode Cluster Management GUI Cluster Management Shell
  • 8.
  • 9. Intel Xeon Phi Integration  Everything needed to enable Xeon Phi on a cluster is packaged as easy-to-install Bright packages: • • • • • Xeon Phi driver Xeon Phi runtime Xeon Phi SDK Xeon Phi OFED Xeon Phi flash utilities  Environment modules ensure that user environment is set up perfectly (PATH, LD_LIBRARY_PATH, ...)  Xeon Phi driver recompiled automatically against running kernel at boot-time
  • 10. Intel Xeon Phi Integration  Set-up wizard takes care of initial Xeon Phi configuration (e.g. creating bridge interfaces, assigning IP addresses)  Xeon Phi appears as a first-class device type in cluster management infrastructure  Xeon Phi can be configured, controlled and monitored through CMSH and CMGUI  Xeon Phi is automatically added to the workload management system as a consumable resource  Compute jobs may request Xeon Phi resource in job script
  • 11. 11
  • 12.
  • 13.
  • 14.
  • 15.
  • 16. Architecture — Monitoring Bright Cluster metrics CMDaemon Cluster Management GUI BMC node001 data Cluster Management Shell Web-Based User Portal metrics metrics metrics head node BMC node002 metrics Third-Party Applications BMC raw data consolidated data node003
  • 17.
  • 18. Cluster Health Management  Goal: provide problem free environment for running jobs  Regular health checks • Actions that return PASS, FAIL or UNKNOWN • Can be associated with a settable severity and a message • Can launch an action based on any response value  Pre-job health checks  16 Xeon Phi health checks included by default  Jobs will only be scheduled to nodes where Xeon Phi is working properly (as determined by health checks)  Intel Cluster Checker included to verify that cluster is set up properly
  • 19. Intel Xeon Phi Workload Management  Three ways to run Xeon Phi jobs: • Offload (i.e. Xeon Phi is used as coprocessor from host) • Native (i.e. job executes entirely on Xeon Phi) • Symmetric (i.e. communicating processes on both host and Xeon Phi)  Offload: Xeon Phi represented as consumable resource in workload management system  Native: Ported Slurm to Xeon Phi  Symmetric: work in progress, will require some changes to workload managers  Additional work in progress: make sure Xeon Phi is not used in multiple modes simultaneously
  • 21. Bright Cluster Manager makes it easy to install, manage and use clusters with Intel Xeon Phi coprocessors.

Editor's Notes

  1. (all text in the bullets)