SlideShare a Scribd company logo
1 of 21
Download to read offline
Your logo
here
Monitoring Alerts and Metrics on
Large Power Systems Clusters
Marcelo Perazolo
Cognitive Systems Architect
IBM Systems
mperazolo@us.ibm.com
Nuremberg, Nov 4-7, 2019
http://osmc.de
• Introduction
• CORAL & Summit Supercomputer case
• Power Firmware Monitoring – The CRASSD open source project
• Power-Ops open source project – an open source collaboration
• Demo
• Conclusion
Agenda
Why Power/OpenPOWER is popular for certain Workloads
• Open Hardware Architecture
• Multiple vendors
• OpenPOWER Foundation
• CORAL: Collaboration of Oak Ridge, Argonne and Lawrence Livermore
• Summit is located at the Oak Ridge Laboratory, used for civilian research
• Sister project: Sierra supercomputer at Lawrence Livermore (nuclear weapons research)
• First supercomputer to reach exaOps performance
• ~ interconnected by 185 miles of fiber optic cables
• ~ 5,600 sqft of data center floor space
• ~ 340 tons of hardware and overhead infrastructure
• ~ 13MW power consumption
• 4,608 Power9 AC922 22-core systems
• 27,648 NVIDIA GPUs (6 per node)
• 250 Peta Bytes of Storage
• 200Gbps InfiniBand bandwidth between nodes
• Pumps up to 200 petaFLOPS / 3 exaOps
• Helps researchers with AI / BigData / Analytics, HPC capabilities
Case Study: The Summit Supercomputer
Summit: The Most Energy-Efficient Supercomputer
“The world’s smartest supercomputer is sharing data with its cooling
plant, reducing energy consumption and cost”
• “Summit is also the most energy-efficient supercomputer in
its Green500 class—based on gigaflops per watt—outranking systems a 10th as
fast.”
• “We wanted to couple Summit’s mechanical cooling system with its
computational workload to optimize efficiency, which can translate to significant
cost savings for a system of this size.”
• “We’ve developed the infrastructure architecture to scale to millions of events
per second using containerized microservices and popular enterprise open-
source software.”
• “On each Summit node OpenBMC provides real-time data readings from dozens
of sensors totaling more than 460,000 metrics per second that describe power
consumption, temperature, and performance for the entire supercomputer.”
• ”Facility staff can now visualize Summit behavior across all 4,608 nodes with a
temperature heat map, a power consumption map, and power and consumption
data broken down by CPUs and GPUs.”
• “Capturing all possible data in real time allows operators and researchers to
gain powerful insights into job behavior, machine performance, and cooling
response.”
*** Quoted from: https://www.hpcwire.com/off-the-wire/olcf-and-providentia-worldwide-build-intelligence-
system-for-supercomputer-cooling-plant/
Summit: High Level Hardware/Architecture View
CRASSD
Firmware Alerts & Telemetry from Power nodes flow to Crassd servers and then to open tools for
visualization such as Grafana, Elastic Stack. Data includes power consumption, frequencies, cooling, etc.
CRASSD: Open tooling for Power Firmware Monitoring
CRASSD Facts
▪ CORAL required telemetry data for all
nodes/layers in the Power Cluster
▪ Proposed RAS architecture had flaws:
▪ No method existed to route errors from the BMC
▪ Built CRASSD as an open tool:
–To collect error events and sort using policy tables
– extended the daemon to gather sensor readings to
fulfill ORNL telemetry requirements
–Provides an API that makes it easy to develop plug-
ins using various Open Source monitoring tools
▪ The results have been impressive, and many
more use cases are being developed
▪ CRASSD currently being incorporated into
other Solutions where the same requirements
exist, e.g. Power-Ops stack.
Available at: https://github.com/open-power-ref-design-toolkit/ibm-crassd
Motivations
• Replace legacy tools and solutions with modern/open alternatives for Power clusters
• Monitoring for x86 is feature-rich and commoditized with extensive support
• Not so much for Power, e.g.: Elastic on Power still on v5.x; new v7.x now has binaries (x86 only)
• Power users often need to port / build / configure these tools from scratch !!
➔ May influence cost of maintenance, thus decision to user Power at all
• Automate a complete ecosystem of tools that fit all needs of a modern Ops stack
• types of data: logs/alerts vs. telemetry
• analysis: historical vs. real-time
• multi-layer aggregation: firmware, OS, services, etc.
• single system or cluster-wide
➔ Popular stacks use Grafana & Prometheus, ELK, Nagios / Icinga / Zabbix, Netdata, etc.
and are deployed/configured by tools such as Ansible, Terraform, Salt, Puppet, etc.
Proposal: Build & curate a key set of modern open tools for Power systems, engage Power systems
users and open source monitoring/ops community
Value 1: reduce cost of modernizing Operations for existing Power clusters (legacy → open)
Value 2: enable adding Power nodes easily into data centers that already use modern Ops tooling
Value 3: reduced entry cost of Operation for new solutions interested on Power advantages
Beyond Power Firmware Monitoring: Power-Ops project
Power-Ops: Open tooling for Power Cluster Operations
Power-Ops Facts
▪ Management stack runs on Power LE architecture
▪ Managed endpoints supported are Power Linux
(could also be easily used on x86):
▪ RedHat family of OSs
▪ Debian/Ubuntu family of OSs
▪ AIX (limited, starting to be supported as endpoints)
▪ Composed of automation components using
Ansible playbooks
▪ 3 Main goals:
▪ Bring-up and pre-configure target platforms
(Bare-Metal, Virtual Machines, Containers*)
▪ Build components not currently available on the
Power platform
▪ Deploy and Configure tooling and start-up dashboards that
work off-the-shelf with Power
▪ Growing community of interested end-users
Power-Ops: Bring-Up
The Bring-Up Process
▪ DevOps professional triggers process on
CI/CD platform
▪ CI/CD tools invoke Ansible
▪ Ansible Playbooks interact with IaaS of choice
▪ Nodes are brought up targeted for different roles:
–Builders
–Controllers
–Endpoints
▪ Bring-up includes powering-up (if needed) and
laying down pre-requisites for building or
deployment
–OS
–Packages & Libraries
–Access configuration
–Software configuration
devops CI/CD
builders
controllers
endpoints
This could be one of several choices, e.g.
- Bare-Metal
- Hypervisors or Power
- Power Hyperconverged Infrastructure
- Containers on OpenShift, etc.
(integrations are easy, just drop playbook)
Power-Ops: Build
The Build Process
▪ Many components are already available on Power,
but there are exceptions
▪ CRASSD: source on github
▪ Build process generates packages for Debian, RedHat
▪ Go Lang
▪ Go Daemon binary must be recompiled on Power
▪ Elastic Stack
–Up to v5.x code is implemented in Java
–Newer releases include binaries (not yet supported)
–Beats must be re-packaged for Debian, RedHat
▪ All relevant packages are then stored on a
local repository
▪ Doesn’t have to run frequently
–DevOps orgs could automate upstream integration
devops CI/CD
builders
repo
Generates binaries/packages for Power
not yet widely available on public repos
Long-term goal is to
integrate Power packages
onto upstream repositories
libs
Power-Ops: Deploy
The Deploy Process
▪ Choose deployment topology
▪ Where each component is deployed to
▪ How they interconnect with each other
▪ Deploy tooling to nodes
▪ Elastic Stack, Netadata, Crassd go to Controller nodes
▪ Beats (FileBeat, MetricBeat) go to Endpoint nodes
▪ Deploy configuration & visualizations/dashboards
▪ Crassd is configured to collect firmware data:
Telemetry data goes to Netdata
Alerting data goes to Logstash
▪ FileBeat collects logs and sends to Logstash
▪ MetricBeat collects telemetry and sends to Elasticsearch
▪ Visualizations/Dashboards are deployed to Netdata and
Kibana
▪ Operators can then access User Interfaces from
Kibana and Netdata
devops CI/CD
repo
CRASSD
Flexible deployment to both
controllers and endpoints
Demo Overview
(controller)
wmdepos
P8 bare metal
Marcelo’s Laptop
(endpoint/VM)
pops-ubuntu-ept
crassd
(endpoint/VM)
pops-redhat-ept
(endpoint / P9)
bos-1
github
repos
deploy
f/w
alerts
telemetry
+ logs
(controller)
launchgr01
P9 bare metal
crassd
Dashboards:
- F/W Alerts (Kibana)
- Logs/Infrastructure (Kibana)
- Cluster Metrics (Kibana)
- OS & F/W Metrics (Netdata)
firmware
192.168.10.25
IPMI OBMCtelemetry
deployment
playbooks
(*)
(*) F/W data supported on Power9 systems
(endpoint/VM)
pops-aix-ept
DEMO / Walk-through
Next Steps
Grow the community
1. Engage with traditional Power systems users (e.g. AIX, legacy Power) promoting modernization
2. Engage with Power Linux community, foster benefits of sharing solutions for everybody’s benefit
3. Engage with Open Source communities, promote support of Power out of the box (when such doesn’t yet exist)
4. Use as a catalyst for monitoring of new large Power clusters (taking advantage of lower cost of entry on Power)
Enhance the Operational Stack
• Add Call Home support to CRASSD
• Support more deployment use cases, such as:
• Containers (development under way)
• Broader integration targeting other IaaS/PaaS solutions (e.g. OpenShift clusters)
• Support additional tools, such as:
• Prometheus / Grafana (development planned)
• Zabbix and/or Nagios / Icinga, others… (feel free to suggest / collaborate !!!)
• Support additional hardware, such as:
• Support other/newer BMC Firmware interfaces such as Redfish
• Monitor GPUs, Networking & Storage equipment
• More Power / OpenPOWER system models
• Currency work to support and maintain newer releases of tooling, e.g.
• Migrate to Elastic Stack v7.x (needs automation)
• Add support for more Beats
• More AIX support
Q&As
Backup
Kibana: Dashboard for Power Firmware events (fed from CRASSD Alerts)
Kibana: Dashboard for Power Infrastructure logs (fed from FileBeat)
Kibana: Multiple Dashboard for Long-Term Power metrics
(fed from MetricBeat and kept on Elasticsearch)
+ more
Netdata: Dashboards for Real-Time Power Firmware metrics (fed from CRASSD)
and Power Infrastructure metrics (fed from other Netdata plugins)

More Related Content

What's hot

Supporting Over a Thousand Custom Hive User Defined Functions
Supporting Over a Thousand Custom Hive User Defined FunctionsSupporting Over a Thousand Custom Hive User Defined Functions
Supporting Over a Thousand Custom Hive User Defined FunctionsDatabricks
 
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...Spark Summit
 
Spark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg SchadSpark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg SchadSpark Summit
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
 
A TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with PrestoA TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with PrestoYu Liu
 
Llap: Locality is Dead
Llap: Locality is DeadLlap: Locality is Dead
Llap: Locality is Deadt3rmin4t0r
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangSpark Summit
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduDataWorks Summit
 
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlowBringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlowDataWorks Summit
 
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...DataWorks Summit
 
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Mich Talebzadeh (Ph.D.)
 
Hive on spark berlin buzzwords
Hive on spark berlin buzzwordsHive on spark berlin buzzwords
Hive on spark berlin buzzwordsSzehon Ho
 
LLAP Nov Meetup
LLAP Nov MeetupLLAP Nov Meetup
LLAP Nov Meetupt3rmin4t0r
 
Provisioning with Stacki at NIST
Provisioning with Stacki at NISTProvisioning with Stacki at NIST
Provisioning with Stacki at NISTStackIQ
 

What's hot (20)

The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Supporting Over a Thousand Custom Hive User Defined Functions
Supporting Over a Thousand Custom Hive User Defined FunctionsSupporting Over a Thousand Custom Hive User Defined Functions
Supporting Over a Thousand Custom Hive User Defined Functions
 
YARN and the Docker container runtime
YARN and the Docker container runtimeYARN and the Docker container runtime
YARN and the Docker container runtime
 
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
 
Spark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg SchadSpark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg Schad
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
From Device to Data Center to Insights
From Device to Data Center to InsightsFrom Device to Data Center to Insights
From Device to Data Center to Insights
 
Apache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduceApache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduce
 
A TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with PrestoA TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with Presto
 
Llap: Locality is Dead
Llap: Locality is DeadLlap: Locality is Dead
Llap: Locality is Dead
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene Pang
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlowBringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
 
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
 
Ansible + Hadoop
Ansible + HadoopAnsible + Hadoop
Ansible + Hadoop
 
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
 
Hive on spark berlin buzzwords
Hive on spark berlin buzzwordsHive on spark berlin buzzwords
Hive on spark berlin buzzwords
 
LLAP Nov Meetup
LLAP Nov MeetupLLAP Nov Meetup
LLAP Nov Meetup
 
Provisioning with Stacki at NIST
Provisioning with Stacki at NISTProvisioning with Stacki at NIST
Provisioning with Stacki at NIST
 

Similar to OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by Marcelo Perazolo

OCP Telco Engineering Workshop at BCE2017
OCP Telco Engineering Workshop at BCE2017OCP Telco Engineering Workshop at BCE2017
OCP Telco Engineering Workshop at BCE2017Radisys Corporation
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyPeter Clapham
 
Stacks and Layers: Integrating P4, C, OVS and OpenStack
Stacks and Layers: Integrating P4, C, OVS and OpenStackStacks and Layers: Integrating P4, C, OVS and OpenStack
Stacks and Layers: Integrating P4, C, OVS and OpenStackOpen-NFP
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageMayaData Inc
 
Deview 2013 rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john maoDeview 2013   rise of the wimpy machines - john mao
Deview 2013 rise of the wimpy machines - john maoNAVER D2
 
OpenNebulaconf2017US: Paying down technical debt with "one" dollar bills by ...
OpenNebulaconf2017US:  Paying down technical debt with "one" dollar bills by ...OpenNebulaconf2017US:  Paying down technical debt with "one" dollar bills by ...
OpenNebulaconf2017US: Paying down technical debt with "one" dollar bills by ...OpenNebula Project
 
Design installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuDesign installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuAlan Sill
 
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...Nagios
 
Sharing High-Performance Interconnects Across Multiple Virtual Machines
Sharing High-Performance Interconnects Across Multiple Virtual MachinesSharing High-Performance Interconnects Across Multiple Virtual Machines
Sharing High-Performance Interconnects Across Multiple Virtual Machinesinside-BigData.com
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialmadhuinturi
 
Introduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OSIntroduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OSSteve Wong
 
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...OpenStack
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansPeter Clapham
 
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and InfrastrctureRevolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and Infrastrcturesabnees
 
Optimized placement in Openstack for NFV
Optimized placement in Openstack for NFVOptimized placement in Openstack for NFV
Optimized placement in Openstack for NFVDebojyoti Dutta
 
Public vs. Private Cloud Performance by Flex
Public vs. Private Cloud Performance by FlexPublic vs. Private Cloud Performance by Flex
Public vs. Private Cloud Performance by FlexStackIQ
 
3.2 Streaming and Messaging
3.2 Streaming and Messaging3.2 Streaming and Messaging
3.2 Streaming and Messaging振东 刘
 

Similar to OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by Marcelo Perazolo (20)

OCP Telco Engineering Workshop at BCE2017
OCP Telco Engineering Workshop at BCE2017OCP Telco Engineering Workshop at BCE2017
OCP Telco Engineering Workshop at BCE2017
 
ODP Presentation LinuxCon NA 2014
ODP Presentation LinuxCon NA 2014ODP Presentation LinuxCon NA 2014
ODP Presentation LinuxCon NA 2014
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Stacks and Layers: Integrating P4, C, OVS and OpenStack
Stacks and Layers: Integrating P4, C, OVS and OpenStackStacks and Layers: Integrating P4, C, OVS and OpenStack
Stacks and Layers: Integrating P4, C, OVS and OpenStack
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
 
Deview 2013 rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john maoDeview 2013   rise of the wimpy machines - john mao
Deview 2013 rise of the wimpy machines - john mao
 
OpenNebulaconf2017US: Paying down technical debt with "one" dollar bills by ...
OpenNebulaconf2017US:  Paying down technical debt with "one" dollar bills by ...OpenNebulaconf2017US:  Paying down technical debt with "one" dollar bills by ...
OpenNebulaconf2017US: Paying down technical debt with "one" dollar bills by ...
 
Design installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuDesign installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttu
 
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
 
Sharing High-Performance Interconnects Across Multiple Virtual Machines
Sharing High-Performance Interconnects Across Multiple Virtual MachinesSharing High-Performance Interconnects Across Multiple Virtual Machines
Sharing High-Performance Interconnects Across Multiple Virtual Machines
 
Japan's post K Computer
Japan's post K ComputerJapan's post K Computer
Japan's post K Computer
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorial
 
Introduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OSIntroduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OS
 
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
 
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and InfrastrctureRevolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
 
Optimized placement in Openstack for NFV
Optimized placement in Openstack for NFVOptimized placement in Openstack for NFV
Optimized placement in Openstack for NFV
 
Public vs. Private Cloud Performance by Flex
Public vs. Private Cloud Performance by FlexPublic vs. Private Cloud Performance by Flex
Public vs. Private Cloud Performance by Flex
 
3.2 Streaming and Messaging
3.2 Streaming and Messaging3.2 Streaming and Messaging
3.2 Streaming and Messaging
 

Recently uploaded

why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 

Recently uploaded (20)

why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 

OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by Marcelo Perazolo

  • 1. Your logo here Monitoring Alerts and Metrics on Large Power Systems Clusters Marcelo Perazolo Cognitive Systems Architect IBM Systems mperazolo@us.ibm.com Nuremberg, Nov 4-7, 2019 http://osmc.de
  • 2. • Introduction • CORAL & Summit Supercomputer case • Power Firmware Monitoring – The CRASSD open source project • Power-Ops open source project – an open source collaboration • Demo • Conclusion Agenda
  • 3. Why Power/OpenPOWER is popular for certain Workloads • Open Hardware Architecture • Multiple vendors • OpenPOWER Foundation
  • 4. • CORAL: Collaboration of Oak Ridge, Argonne and Lawrence Livermore • Summit is located at the Oak Ridge Laboratory, used for civilian research • Sister project: Sierra supercomputer at Lawrence Livermore (nuclear weapons research) • First supercomputer to reach exaOps performance • ~ interconnected by 185 miles of fiber optic cables • ~ 5,600 sqft of data center floor space • ~ 340 tons of hardware and overhead infrastructure • ~ 13MW power consumption • 4,608 Power9 AC922 22-core systems • 27,648 NVIDIA GPUs (6 per node) • 250 Peta Bytes of Storage • 200Gbps InfiniBand bandwidth between nodes • Pumps up to 200 petaFLOPS / 3 exaOps • Helps researchers with AI / BigData / Analytics, HPC capabilities Case Study: The Summit Supercomputer
  • 5. Summit: The Most Energy-Efficient Supercomputer “The world’s smartest supercomputer is sharing data with its cooling plant, reducing energy consumption and cost” • “Summit is also the most energy-efficient supercomputer in its Green500 class—based on gigaflops per watt—outranking systems a 10th as fast.” • “We wanted to couple Summit’s mechanical cooling system with its computational workload to optimize efficiency, which can translate to significant cost savings for a system of this size.” • “We’ve developed the infrastructure architecture to scale to millions of events per second using containerized microservices and popular enterprise open- source software.” • “On each Summit node OpenBMC provides real-time data readings from dozens of sensors totaling more than 460,000 metrics per second that describe power consumption, temperature, and performance for the entire supercomputer.” • ”Facility staff can now visualize Summit behavior across all 4,608 nodes with a temperature heat map, a power consumption map, and power and consumption data broken down by CPUs and GPUs.” • “Capturing all possible data in real time allows operators and researchers to gain powerful insights into job behavior, machine performance, and cooling response.” *** Quoted from: https://www.hpcwire.com/off-the-wire/olcf-and-providentia-worldwide-build-intelligence- system-for-supercomputer-cooling-plant/
  • 6. Summit: High Level Hardware/Architecture View CRASSD Firmware Alerts & Telemetry from Power nodes flow to Crassd servers and then to open tools for visualization such as Grafana, Elastic Stack. Data includes power consumption, frequencies, cooling, etc.
  • 7. CRASSD: Open tooling for Power Firmware Monitoring CRASSD Facts ▪ CORAL required telemetry data for all nodes/layers in the Power Cluster ▪ Proposed RAS architecture had flaws: ▪ No method existed to route errors from the BMC ▪ Built CRASSD as an open tool: –To collect error events and sort using policy tables – extended the daemon to gather sensor readings to fulfill ORNL telemetry requirements –Provides an API that makes it easy to develop plug- ins using various Open Source monitoring tools ▪ The results have been impressive, and many more use cases are being developed ▪ CRASSD currently being incorporated into other Solutions where the same requirements exist, e.g. Power-Ops stack. Available at: https://github.com/open-power-ref-design-toolkit/ibm-crassd
  • 8. Motivations • Replace legacy tools and solutions with modern/open alternatives for Power clusters • Monitoring for x86 is feature-rich and commoditized with extensive support • Not so much for Power, e.g.: Elastic on Power still on v5.x; new v7.x now has binaries (x86 only) • Power users often need to port / build / configure these tools from scratch !! ➔ May influence cost of maintenance, thus decision to user Power at all • Automate a complete ecosystem of tools that fit all needs of a modern Ops stack • types of data: logs/alerts vs. telemetry • analysis: historical vs. real-time • multi-layer aggregation: firmware, OS, services, etc. • single system or cluster-wide ➔ Popular stacks use Grafana & Prometheus, ELK, Nagios / Icinga / Zabbix, Netdata, etc. and are deployed/configured by tools such as Ansible, Terraform, Salt, Puppet, etc. Proposal: Build & curate a key set of modern open tools for Power systems, engage Power systems users and open source monitoring/ops community Value 1: reduce cost of modernizing Operations for existing Power clusters (legacy → open) Value 2: enable adding Power nodes easily into data centers that already use modern Ops tooling Value 3: reduced entry cost of Operation for new solutions interested on Power advantages Beyond Power Firmware Monitoring: Power-Ops project
  • 9. Power-Ops: Open tooling for Power Cluster Operations Power-Ops Facts ▪ Management stack runs on Power LE architecture ▪ Managed endpoints supported are Power Linux (could also be easily used on x86): ▪ RedHat family of OSs ▪ Debian/Ubuntu family of OSs ▪ AIX (limited, starting to be supported as endpoints) ▪ Composed of automation components using Ansible playbooks ▪ 3 Main goals: ▪ Bring-up and pre-configure target platforms (Bare-Metal, Virtual Machines, Containers*) ▪ Build components not currently available on the Power platform ▪ Deploy and Configure tooling and start-up dashboards that work off-the-shelf with Power ▪ Growing community of interested end-users
  • 10. Power-Ops: Bring-Up The Bring-Up Process ▪ DevOps professional triggers process on CI/CD platform ▪ CI/CD tools invoke Ansible ▪ Ansible Playbooks interact with IaaS of choice ▪ Nodes are brought up targeted for different roles: –Builders –Controllers –Endpoints ▪ Bring-up includes powering-up (if needed) and laying down pre-requisites for building or deployment –OS –Packages & Libraries –Access configuration –Software configuration devops CI/CD builders controllers endpoints This could be one of several choices, e.g. - Bare-Metal - Hypervisors or Power - Power Hyperconverged Infrastructure - Containers on OpenShift, etc. (integrations are easy, just drop playbook)
  • 11. Power-Ops: Build The Build Process ▪ Many components are already available on Power, but there are exceptions ▪ CRASSD: source on github ▪ Build process generates packages for Debian, RedHat ▪ Go Lang ▪ Go Daemon binary must be recompiled on Power ▪ Elastic Stack –Up to v5.x code is implemented in Java –Newer releases include binaries (not yet supported) –Beats must be re-packaged for Debian, RedHat ▪ All relevant packages are then stored on a local repository ▪ Doesn’t have to run frequently –DevOps orgs could automate upstream integration devops CI/CD builders repo Generates binaries/packages for Power not yet widely available on public repos Long-term goal is to integrate Power packages onto upstream repositories libs
  • 12. Power-Ops: Deploy The Deploy Process ▪ Choose deployment topology ▪ Where each component is deployed to ▪ How they interconnect with each other ▪ Deploy tooling to nodes ▪ Elastic Stack, Netadata, Crassd go to Controller nodes ▪ Beats (FileBeat, MetricBeat) go to Endpoint nodes ▪ Deploy configuration & visualizations/dashboards ▪ Crassd is configured to collect firmware data: Telemetry data goes to Netdata Alerting data goes to Logstash ▪ FileBeat collects logs and sends to Logstash ▪ MetricBeat collects telemetry and sends to Elasticsearch ▪ Visualizations/Dashboards are deployed to Netdata and Kibana ▪ Operators can then access User Interfaces from Kibana and Netdata devops CI/CD repo CRASSD Flexible deployment to both controllers and endpoints
  • 13. Demo Overview (controller) wmdepos P8 bare metal Marcelo’s Laptop (endpoint/VM) pops-ubuntu-ept crassd (endpoint/VM) pops-redhat-ept (endpoint / P9) bos-1 github repos deploy f/w alerts telemetry + logs (controller) launchgr01 P9 bare metal crassd Dashboards: - F/W Alerts (Kibana) - Logs/Infrastructure (Kibana) - Cluster Metrics (Kibana) - OS & F/W Metrics (Netdata) firmware 192.168.10.25 IPMI OBMCtelemetry deployment playbooks (*) (*) F/W data supported on Power9 systems (endpoint/VM) pops-aix-ept
  • 15. Next Steps Grow the community 1. Engage with traditional Power systems users (e.g. AIX, legacy Power) promoting modernization 2. Engage with Power Linux community, foster benefits of sharing solutions for everybody’s benefit 3. Engage with Open Source communities, promote support of Power out of the box (when such doesn’t yet exist) 4. Use as a catalyst for monitoring of new large Power clusters (taking advantage of lower cost of entry on Power) Enhance the Operational Stack • Add Call Home support to CRASSD • Support more deployment use cases, such as: • Containers (development under way) • Broader integration targeting other IaaS/PaaS solutions (e.g. OpenShift clusters) • Support additional tools, such as: • Prometheus / Grafana (development planned) • Zabbix and/or Nagios / Icinga, others… (feel free to suggest / collaborate !!!) • Support additional hardware, such as: • Support other/newer BMC Firmware interfaces such as Redfish • Monitor GPUs, Networking & Storage equipment • More Power / OpenPOWER system models • Currency work to support and maintain newer releases of tooling, e.g. • Migrate to Elastic Stack v7.x (needs automation) • Add support for more Beats • More AIX support
  • 16. Q&As
  • 18. Kibana: Dashboard for Power Firmware events (fed from CRASSD Alerts)
  • 19. Kibana: Dashboard for Power Infrastructure logs (fed from FileBeat)
  • 20. Kibana: Multiple Dashboard for Long-Term Power metrics (fed from MetricBeat and kept on Elasticsearch) + more
  • 21. Netdata: Dashboards for Real-Time Power Firmware metrics (fed from CRASSD) and Power Infrastructure metrics (fed from other Netdata plugins)