SlideShare a Scribd company logo
1 of 21
Visualization of Monitoring
Data at the NASA Advanced
Supercomputing Facility
Janice Singh
janice.s.singh@nasa.gov
NASA Advanced Supercomputing Facility
• Pleiades
– 11,136-node SGI ICE supercluster
162,496 cores total (32,768 additional GPU cores)
• Tape Storage
– pDMF cluster
• NFS servers
– home filesystems on computing systems
• Lustre Filesystems
– each filesystem is multiple servers
• PBS (Portable Batch System)
– job scheduler that runs on computing systems
Ref: http://www.nas.nasa.gov/hecc/
Janice Singh - janice.s.singh@nasa.gov 2
Why Visualization is Needed
• 24 x 7 Help Desk
– need a quick overview of system status
• but still more specific than nagios visualization
– not just single status per host, but sub-groups per host
– they assess situations before calling next level of support
• automatic alerts from nagios are not as selective
– interrelated issues
• allows us to see how many systems affected by one issue
Janice Singh - janice.s.singh@nasa.gov 3
Heads Up Display – For Staff
Janice Singh - janice.s.singh@nasa.gov 4
Heads Up Display – For Staff (details)
Janice Singh - janice.s.singh@nasa.gov 5
Heads Up Display – for Users
Janice Singh - janice.s.singh@nasa.gov 6
All the Parts
• nagios
• nrpe
• nsca
• datagg (in-house software)
• apache
• perl/cgi-bin
Janice Singh - janice.s.singh@nasa.gov 7
Data flow
network firewall (The Enclave)
nsca
nsca
nrpe
ssh
Cluster
Compute Node
Remote Node
Web Server
nagios
nsca nagios.cmd
datagg
nagios2.cmd
nagios
nagios
nagios web
interface
HUD orange - pipe file
green - flat file
purple - web site
HUD buffer
nrpe nrpe
Dedicated Nagios Node
Janice Singh - janice.s.singh@nasa.gov 8
Nagios (and add-ons) setup
The Basics
•the webserver and the main nagios server are the same machine
•there is a network firewall called “the enclave”
– most compute nodes are inside the enclave
– the webserver can only receive data from the enclave, not
send
•the servers within the enclave send nagios data to the webserver
via nsca
•for the servers outside the enclave, nrpe is used
•plugins written in Perl using Nagios::Plugin
Janice Singh - janice.s.singh@nasa.gov 9
Nagios (and add-ons) setup, cont.
Versions
•when I inherited the systems, they were all using nagios 2.10
– most systems have been upgraded to nagios 3.4+
– all new systems have nagios 3.5
– webserver is still using 2.10
•nsca 2.7.2 across the board
Janice Singh - janice.s.singh@nasa.gov 10
Nagios (and add-ons) setup, cont.
Clusters within the enclave
•there is one host that is considered the Dedicated Nagios Server
•the rest of the hosts are monitored using nrpe
•exceptions on Pleiades:
– there are many hosts monitored that get reimaged often
• difficult to administer nrpe
• use check_by_ssh
– ssh is flaky under nagios 3, so it still uses nagios 2.10
• will randomly give the error: Could not open pipe: /usr/bin/ssh
– use 2 Dedicated Nagios Servers
• so many checks that there was unacceptable latency
– tests that should run every 2 mins were running every 30
mins
Janice Singh - janice.s.singh@nasa.gov 11
datagg (DATa AGGregator)
• why it was needed (i.e. what nagios couldn’t do for us)
– error summaries of nagios problems
• the nagios webpage does tell number of alerts per service
group, but that cannot be leveraged via API
Janice Singh - janice.s.singh@nasa.gov 12
datagg (DATa AGGregator), cont.
• why it was needed (i.e. what nagios couldn’t do for us)
– parsing data about Portable Batch System (PBS)
– piecing together large outputs from NSCA
• current output from PBS: 404,937 characters
Janice Singh - janice.s.singh@nasa.gov 13
datagg (DATa AGGregator), cont.
• why it was needed (i.e. what nagios couldn’t do for us)
– mapping service nodes to the appropriate Lustre filesystem
• they are in a servicegroup (but not available via API)
Janice Singh - janice.s.singh@nasa.gov 14
datagg (DATa AGGregator), cont.
• in-house written perl script
– it reads the command file (pipe) that nsca creates on the
webserver
– the nagios configuration on the webserver also writes to the
nsca command file
– it aggregates the data that it reads in from the pipe and writes
it out to a flat file referred to as the “HUD buffer”
• The data read in is in the format:
[$timestamp] PROCESS_SERVICE_CHECK_RESULT;
$hostname; $service_description; $state; $output
Janice Singh - janice.s.singh@nasa.gov 15
HUD Buffer
• is a Windows-style .ini file
– sections
• used to group together the boxes (or sub-boxes)
• ex: [pleiades daemons]
– keys
• name=value
– this is where we put the nagios state and output of
plugin
– every section also has the key Error Summary
Janice Singh - janice.s.singh@nasa.gov 16
Data flow
network firewall (The Enclave)
nsca
nsca
nrpe
ssh
Cluster
Compute Node
Remote Node
Web Server
nagios
nsca nagios.cmd
datagg
nagios2.cmd
nagios
nagios
nagios web
interface
HUD orange - pipe file
green - flat file
purple - web site
HUD buffer
nrpe nrpe
Dedicated Nagios Node
Janice Singh - janice.s.singh@nasa.gov 17
Displays
• Two versions
– Internal (aka Staff HUD) on the internal network
• Staff HUD and nagios web interface
• clicking on Staff HUD goes to the nagios web page
– for the service checks for the host or service group
– the nagios web interface gives more details on the
plugin output than is displayed on the HUD
– the nagios web interface is used to suspend/restart
notifications
– External (aka miniHUD)
– Permissions set in the Apache config file
• Both written in Perl using cgi-bin
Janice Singh - janice.s.singh@nasa.gov 18
Future Plans
• use a database to collect data
– this will also allow us to have historical data beyond what is in
the logs, which can be used in graphing
– will eliminate the need for a flat file
Janice Singh - janice.s.singh@nasa.gov 19
Future Plans
• things that would be great to see in nagios 4
– an API
– make a nagios check run “on demand”
– no more random “could not open pipe” errors
– lower latency (we’re having problems with less than 600
checks!)
– error summaries
– way to send large amounts of data via nsca
– a way to see the exact command that nagios ran on the
nagios webpage
Janice Singh - janice.s.singh@nasa.gov 20
Questions?
Janice Singh - janice.s.singh@nasa.gov 21

More Related Content

What's hot

HBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQL
HBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQLHBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQL
HBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQLCloudera, Inc.
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBasedave_revell
 
SQL for Everything at CWT2014
SQL for Everything at CWT2014SQL for Everything at CWT2014
SQL for Everything at CWT2014N Masahiro
 
Testing Rolling Roots
Testing Rolling RootsTesting Rolling Roots
Testing Rolling RootsAPNIC
 
"How about no grep and zabbix?". ELK based alerts and metrics.
"How about no grep and zabbix?". ELK based alerts and metrics."How about no grep and zabbix?". ELK based alerts and metrics.
"How about no grep and zabbix?". ELK based alerts and metrics.Vladimir Pavkin
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionSplunk
 
Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflowsBeyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflowsDataWorks Summit
 
Treasure Data and AWS - Developers.io 2015
Treasure Data and AWS - Developers.io 2015Treasure Data and AWS - Developers.io 2015
Treasure Data and AWS - Developers.io 2015N Masahiro
 
Ansible for large scale deployment
Ansible for large scale deploymentAnsible for large scale deployment
Ansible for large scale deploymentKarthik .P.R
 
stream-processing-at-linkedin-with-apache-samza
stream-processing-at-linkedin-with-apache-samzastream-processing-at-linkedin-with-apache-samza
stream-processing-at-linkedin-with-apache-samzaAbhishek Shivanna
 
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...InfluxData
 
fluentd -- the missing log collector
fluentd -- the missing log collectorfluentd -- the missing log collector
fluentd -- the missing log collectorMuga Nishizawa
 
How to Make Norikra Perfect
How to Make Norikra PerfectHow to Make Norikra Perfect
How to Make Norikra PerfectSATOSHI TAGOMORI
 
Mozilla - Anurag Phadke - Hadoop World 2010
Mozilla - Anurag Phadke - Hadoop World 2010Mozilla - Anurag Phadke - Hadoop World 2010
Mozilla - Anurag Phadke - Hadoop World 2010Cloudera, Inc.
 
Using HAProxy to Scale MySQL
Using HAProxy to Scale MySQLUsing HAProxy to Scale MySQL
Using HAProxy to Scale MySQLBill Sickles
 
Apache Apex & Bigtop
Apache Apex & BigtopApache Apex & Bigtop
Apache Apex & BigtopApache Apex
 
Native container monitoring
Native container monitoringNative container monitoring
Native container monitoringRohit Jnagal
 
PLNOG 18 - Paweł Małachowski - Spy hard czyli regexpem po pakietach
PLNOG 18 - Paweł Małachowski - Spy hard czyli regexpem po pakietachPLNOG 18 - Paweł Małachowski - Spy hard czyli regexpem po pakietach
PLNOG 18 - Paweł Małachowski - Spy hard czyli regexpem po pakietachPROIDEA
 
Moving Graphs to Production At Scale
Moving Graphs to Production At ScaleMoving Graphs to Production At Scale
Moving Graphs to Production At ScaleNeo4j
 
Open Source Monitoring Tools
Open Source Monitoring ToolsOpen Source Monitoring Tools
Open Source Monitoring Toolsm_richardson
 

What's hot (20)

HBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQL
HBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQLHBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQL
HBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQL
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBase
 
SQL for Everything at CWT2014
SQL for Everything at CWT2014SQL for Everything at CWT2014
SQL for Everything at CWT2014
 
Testing Rolling Roots
Testing Rolling RootsTesting Rolling Roots
Testing Rolling Roots
 
"How about no grep and zabbix?". ELK based alerts and metrics.
"How about no grep and zabbix?". ELK based alerts and metrics."How about no grep and zabbix?". ELK based alerts and metrics.
"How about no grep and zabbix?". ELK based alerts and metrics.
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflowsBeyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
 
Treasure Data and AWS - Developers.io 2015
Treasure Data and AWS - Developers.io 2015Treasure Data and AWS - Developers.io 2015
Treasure Data and AWS - Developers.io 2015
 
Ansible for large scale deployment
Ansible for large scale deploymentAnsible for large scale deployment
Ansible for large scale deployment
 
stream-processing-at-linkedin-with-apache-samza
stream-processing-at-linkedin-with-apache-samzastream-processing-at-linkedin-with-apache-samza
stream-processing-at-linkedin-with-apache-samza
 
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...
 
fluentd -- the missing log collector
fluentd -- the missing log collectorfluentd -- the missing log collector
fluentd -- the missing log collector
 
How to Make Norikra Perfect
How to Make Norikra PerfectHow to Make Norikra Perfect
How to Make Norikra Perfect
 
Mozilla - Anurag Phadke - Hadoop World 2010
Mozilla - Anurag Phadke - Hadoop World 2010Mozilla - Anurag Phadke - Hadoop World 2010
Mozilla - Anurag Phadke - Hadoop World 2010
 
Using HAProxy to Scale MySQL
Using HAProxy to Scale MySQLUsing HAProxy to Scale MySQL
Using HAProxy to Scale MySQL
 
Apache Apex & Bigtop
Apache Apex & BigtopApache Apex & Bigtop
Apache Apex & Bigtop
 
Native container monitoring
Native container monitoringNative container monitoring
Native container monitoring
 
PLNOG 18 - Paweł Małachowski - Spy hard czyli regexpem po pakietach
PLNOG 18 - Paweł Małachowski - Spy hard czyli regexpem po pakietachPLNOG 18 - Paweł Małachowski - Spy hard czyli regexpem po pakietach
PLNOG 18 - Paweł Małachowski - Spy hard czyli regexpem po pakietach
 
Moving Graphs to Production At Scale
Moving Graphs to Production At ScaleMoving Graphs to Production At Scale
Moving Graphs to Production At Scale
 
Open Source Monitoring Tools
Open Source Monitoring ToolsOpen Source Monitoring Tools
Open Source Monitoring Tools
 

Viewers also liked

How to mitigate risk in the age of the cloud
How to mitigate risk in the age of the cloudHow to mitigate risk in the age of the cloud
How to mitigate risk in the age of the cloudJames Sankar
 
The Pacific Research Platform
The Pacific Research PlatformThe Pacific Research Platform
The Pacific Research PlatformLarry Smarr
 
2011-13CENICAR
2011-13CENICAR2011-13CENICAR
2011-13CENICARJanis Cortese
 
Spain and the Canaries
Spain and the CanariesSpain and the Canaries
Spain and the Canariesmrfserafin
 
GOLE Next Generation Architecture Task Force
GOLE Next Generation Architecture Task ForceGOLE Next Generation Architecture Task Force
GOLE Next Generation Architecture Task ForceCANARIE Inc.
 
CANARIE - What Do I Need to Connect with eduroam and Shibboleth
CANARIE - What Do I Need to Connect with eduroam and ShibbolethCANARIE - What Do I Need to Connect with eduroam and Shibboleth
CANARIE - What Do I Need to Connect with eduroam and ShibbolethChris Phillips
 
30 best Creative, Design & Marketing 
Quotes
30 best Creative, Design & Marketing 
Quotes30 best Creative, Design & Marketing 
Quotes
30 best Creative, Design & Marketing 
QuotesMike Hendrixen
 

Viewers also liked (7)

How to mitigate risk in the age of the cloud
How to mitigate risk in the age of the cloudHow to mitigate risk in the age of the cloud
How to mitigate risk in the age of the cloud
 
The Pacific Research Platform
The Pacific Research PlatformThe Pacific Research Platform
The Pacific Research Platform
 
2011-13CENICAR
2011-13CENICAR2011-13CENICAR
2011-13CENICAR
 
Spain and the Canaries
Spain and the CanariesSpain and the Canaries
Spain and the Canaries
 
GOLE Next Generation Architecture Task Force
GOLE Next Generation Architecture Task ForceGOLE Next Generation Architecture Task Force
GOLE Next Generation Architecture Task Force
 
CANARIE - What Do I Need to Connect with eduroam and Shibboleth
CANARIE - What Do I Need to Connect with eduroam and ShibbolethCANARIE - What Do I Need to Connect with eduroam and Shibboleth
CANARIE - What Do I Need to Connect with eduroam and Shibboleth
 
30 best Creative, Design & Marketing 
Quotes
30 best Creative, Design & Marketing 
Quotes30 best Creative, Design & Marketing 
Quotes
30 best Creative, Design & Marketing 
Quotes
 

Similar to Nagios Conference 2013 - Janice Singh - Visualization of Monitoring Data at the NASA Advanced Supercomputing Facilityuting facility

Nagios Conference 2014 - Janice Singh - Real World Uses for Nagios APIs
Nagios Conference 2014 - Janice Singh - Real World Uses for Nagios APIsNagios Conference 2014 - Janice Singh - Real World Uses for Nagios APIs
Nagios Conference 2014 - Janice Singh - Real World Uses for Nagios APIsNagios
 
Monitoring at/with SUSE 2015
Monitoring at/with SUSE 2015Monitoring at/with SUSE 2015
Monitoring at/with SUSE 2015Lars Vogdt
 
Janice Singh - Writing Custom Nagios Plugins
Janice Singh - Writing Custom Nagios PluginsJanice Singh - Writing Custom Nagios Plugins
Janice Singh - Writing Custom Nagios PluginsNagios
 
Nagios Conference 2014 - Leland Lammert - Distributed Heirarchical Nagios
Nagios Conference 2014 - Leland Lammert - Distributed Heirarchical NagiosNagios Conference 2014 - Leland Lammert - Distributed Heirarchical Nagios
Nagios Conference 2014 - Leland Lammert - Distributed Heirarchical NagiosNagios
 
DNS_Tutorial 2.pptx
DNS_Tutorial 2.pptxDNS_Tutorial 2.pptx
DNS_Tutorial 2.pptxviditsir
 
Zabbix Monitoring Platform
Zabbix Monitoring Platform Zabbix Monitoring Platform
Zabbix Monitoring Platform Seyedmajid Etehadi
 
Spy hard, challenges of 100G deep packet inspection on x86 platform
Spy hard, challenges of 100G deep packet inspection on x86 platformSpy hard, challenges of 100G deep packet inspection on x86 platform
Spy hard, challenges of 100G deep packet inspection on x86 platformRedge Technologies
 
Nagios Conference 2014 - Rob Hassing - How To Maintain Over 20 Monitoring App...
Nagios Conference 2014 - Rob Hassing - How To Maintain Over 20 Monitoring App...Nagios Conference 2014 - Rob Hassing - How To Maintain Over 20 Monitoring App...
Nagios Conference 2014 - Rob Hassing - How To Maintain Over 20 Monitoring App...Nagios
 
Open Cloud BBQ - Nano Server
Open Cloud BBQ - Nano ServerOpen Cloud BBQ - Nano Server
Open Cloud BBQ - Nano ServerTomica Kaniski
 
Infrastructure Around Hadoop
Infrastructure Around HadoopInfrastructure Around Hadoop
Infrastructure Around HadoopDataWorks Summit
 
MAGPI: Advanced Services: IPv6, Multicast, DNSSEC
MAGPI: Advanced Services: IPv6, Multicast, DNSSECMAGPI: Advanced Services: IPv6, Multicast, DNSSEC
MAGPI: Advanced Services: IPv6, Multicast, DNSSECShumon Huque
 
OSMC 2010 | Monitoring mit Icinga by Icinga Team
OSMC 2010 | Monitoring mit Icinga by Icinga TeamOSMC 2010 | Monitoring mit Icinga by Icinga Team
OSMC 2010 | Monitoring mit Icinga by Icinga TeamNETWAYS
 
Webinar - DreamObjects/Ceph Case Study
Webinar - DreamObjects/Ceph Case StudyWebinar - DreamObjects/Ceph Case Study
Webinar - DreamObjects/Ceph Case StudyCeph Community
 
MariaDB with SphinxSE
MariaDB with SphinxSEMariaDB with SphinxSE
MariaDB with SphinxSEColin Charles
 
Gpfs introandsetup
Gpfs introandsetupGpfs introandsetup
Gpfs introandsetupasihan
 
Membase Intro from Membase Meetup San Francisco
Membase Intro from Membase Meetup San FranciscoMembase Intro from Membase Meetup San Francisco
Membase Intro from Membase Meetup San FranciscoMembase
 
Nagios Conference 2014 - Frank Pantaleo - Nagios Monitoring of Netezza Databases
Nagios Conference 2014 - Frank Pantaleo - Nagios Monitoring of Netezza DatabasesNagios Conference 2014 - Frank Pantaleo - Nagios Monitoring of Netezza Databases
Nagios Conference 2014 - Frank Pantaleo - Nagios Monitoring of Netezza DatabasesNagios
 
Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016
Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016
Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016Zabbix
 
(ATS6-PLAT06) Maximizing AEP Performance
(ATS6-PLAT06) Maximizing AEP Performance(ATS6-PLAT06) Maximizing AEP Performance
(ATS6-PLAT06) Maximizing AEP PerformanceBIOVIA
 

Similar to Nagios Conference 2013 - Janice Singh - Visualization of Monitoring Data at the NASA Advanced Supercomputing Facilityuting facility (20)

Nagios Conference 2014 - Janice Singh - Real World Uses for Nagios APIs
Nagios Conference 2014 - Janice Singh - Real World Uses for Nagios APIsNagios Conference 2014 - Janice Singh - Real World Uses for Nagios APIs
Nagios Conference 2014 - Janice Singh - Real World Uses for Nagios APIs
 
Monitoring at/with SUSE 2015
Monitoring at/with SUSE 2015Monitoring at/with SUSE 2015
Monitoring at/with SUSE 2015
 
Janice Singh - Writing Custom Nagios Plugins
Janice Singh - Writing Custom Nagios PluginsJanice Singh - Writing Custom Nagios Plugins
Janice Singh - Writing Custom Nagios Plugins
 
Nagios Conference 2014 - Leland Lammert - Distributed Heirarchical Nagios
Nagios Conference 2014 - Leland Lammert - Distributed Heirarchical NagiosNagios Conference 2014 - Leland Lammert - Distributed Heirarchical Nagios
Nagios Conference 2014 - Leland Lammert - Distributed Heirarchical Nagios
 
DNS_Tutorial 2.pptx
DNS_Tutorial 2.pptxDNS_Tutorial 2.pptx
DNS_Tutorial 2.pptx
 
Zabbix Monitoring Platform
Zabbix Monitoring Platform Zabbix Monitoring Platform
Zabbix Monitoring Platform
 
Spy hard, challenges of 100G deep packet inspection on x86 platform
Spy hard, challenges of 100G deep packet inspection on x86 platformSpy hard, challenges of 100G deep packet inspection on x86 platform
Spy hard, challenges of 100G deep packet inspection on x86 platform
 
Nagios Conference 2014 - Rob Hassing - How To Maintain Over 20 Monitoring App...
Nagios Conference 2014 - Rob Hassing - How To Maintain Over 20 Monitoring App...Nagios Conference 2014 - Rob Hassing - How To Maintain Over 20 Monitoring App...
Nagios Conference 2014 - Rob Hassing - How To Maintain Over 20 Monitoring App...
 
Open Cloud BBQ - Nano Server
Open Cloud BBQ - Nano ServerOpen Cloud BBQ - Nano Server
Open Cloud BBQ - Nano Server
 
M3 nfs fs-3.2.1
M3 nfs fs-3.2.1M3 nfs fs-3.2.1
M3 nfs fs-3.2.1
 
Infrastructure Around Hadoop
Infrastructure Around HadoopInfrastructure Around Hadoop
Infrastructure Around Hadoop
 
MAGPI: Advanced Services: IPv6, Multicast, DNSSEC
MAGPI: Advanced Services: IPv6, Multicast, DNSSECMAGPI: Advanced Services: IPv6, Multicast, DNSSEC
MAGPI: Advanced Services: IPv6, Multicast, DNSSEC
 
OSMC 2010 | Monitoring mit Icinga by Icinga Team
OSMC 2010 | Monitoring mit Icinga by Icinga TeamOSMC 2010 | Monitoring mit Icinga by Icinga Team
OSMC 2010 | Monitoring mit Icinga by Icinga Team
 
Webinar - DreamObjects/Ceph Case Study
Webinar - DreamObjects/Ceph Case StudyWebinar - DreamObjects/Ceph Case Study
Webinar - DreamObjects/Ceph Case Study
 
MariaDB with SphinxSE
MariaDB with SphinxSEMariaDB with SphinxSE
MariaDB with SphinxSE
 
Gpfs introandsetup
Gpfs introandsetupGpfs introandsetup
Gpfs introandsetup
 
Membase Intro from Membase Meetup San Francisco
Membase Intro from Membase Meetup San FranciscoMembase Intro from Membase Meetup San Francisco
Membase Intro from Membase Meetup San Francisco
 
Nagios Conference 2014 - Frank Pantaleo - Nagios Monitoring of Netezza Databases
Nagios Conference 2014 - Frank Pantaleo - Nagios Monitoring of Netezza DatabasesNagios Conference 2014 - Frank Pantaleo - Nagios Monitoring of Netezza Databases
Nagios Conference 2014 - Frank Pantaleo - Nagios Monitoring of Netezza Databases
 
Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016
Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016
Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016
 
(ATS6-PLAT06) Maximizing AEP Performance
(ATS6-PLAT06) Maximizing AEP Performance(ATS6-PLAT06) Maximizing AEP Performance
(ATS6-PLAT06) Maximizing AEP Performance
 

More from Nagios

Nagios XI Best Practices
Nagios XI Best PracticesNagios XI Best Practices
Nagios XI Best PracticesNagios
 
Jesse Olson - Nagios Log Server Architecture Overview
Jesse Olson - Nagios Log Server Architecture OverviewJesse Olson - Nagios Log Server Architecture Overview
Jesse Olson - Nagios Log Server Architecture OverviewNagios
 
Trevor McDonald - Nagios XI Under The Hood
Trevor McDonald  - Nagios XI Under The HoodTrevor McDonald  - Nagios XI Under The Hood
Trevor McDonald - Nagios XI Under The HoodNagios
 
Sean Falzon - Nagios - Resilient Notifications
Sean Falzon - Nagios - Resilient NotificationsSean Falzon - Nagios - Resilient Notifications
Sean Falzon - Nagios - Resilient NotificationsNagios
 
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise EditionMarcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise EditionNagios
 
Dave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical ExperienceDave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical ExperienceNagios
 
Mike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service ChecksMike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service ChecksNagios
 
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios InstallationMike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios InstallationNagios
 
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...Nagios
 
Matt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With NagiosMatt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With NagiosNagios
 
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.Nagios
 
Eric Loyd - Fractal Nagios
Eric Loyd - Fractal NagiosEric Loyd - Fractal Nagios
Eric Loyd - Fractal NagiosNagios
 
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...Nagios
 
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...Nagios
 
Nagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson OpeningNagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson OpeningNagios
 
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios CoreNrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios CoreNagios
 
Nagios Log Server - Features
Nagios Log Server - FeaturesNagios Log Server - Features
Nagios Log Server - FeaturesNagios
 
Nagios Network Analyzer - Features
Nagios Network Analyzer - FeaturesNagios Network Analyzer - Features
Nagios Network Analyzer - FeaturesNagios
 
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing NagiosNagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing NagiosNagios
 
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment Options
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment OptionsNagios Conference 2014 - Mike Weber - Nagios Rapid Deployment Options
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment OptionsNagios
 

More from Nagios (20)

Nagios XI Best Practices
Nagios XI Best PracticesNagios XI Best Practices
Nagios XI Best Practices
 
Jesse Olson - Nagios Log Server Architecture Overview
Jesse Olson - Nagios Log Server Architecture OverviewJesse Olson - Nagios Log Server Architecture Overview
Jesse Olson - Nagios Log Server Architecture Overview
 
Trevor McDonald - Nagios XI Under The Hood
Trevor McDonald  - Nagios XI Under The HoodTrevor McDonald  - Nagios XI Under The Hood
Trevor McDonald - Nagios XI Under The Hood
 
Sean Falzon - Nagios - Resilient Notifications
Sean Falzon - Nagios - Resilient NotificationsSean Falzon - Nagios - Resilient Notifications
Sean Falzon - Nagios - Resilient Notifications
 
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise EditionMarcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
 
Dave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical ExperienceDave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical Experience
 
Mike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service ChecksMike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service Checks
 
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios InstallationMike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
 
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
 
Matt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With NagiosMatt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With Nagios
 
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
 
Eric Loyd - Fractal Nagios
Eric Loyd - Fractal NagiosEric Loyd - Fractal Nagios
Eric Loyd - Fractal Nagios
 
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
 
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
 
Nagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson OpeningNagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson Opening
 
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios CoreNrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
 
Nagios Log Server - Features
Nagios Log Server - FeaturesNagios Log Server - Features
Nagios Log Server - Features
 
Nagios Network Analyzer - Features
Nagios Network Analyzer - FeaturesNagios Network Analyzer - Features
Nagios Network Analyzer - Features
 
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing NagiosNagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
 
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment Options
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment OptionsNagios Conference 2014 - Mike Weber - Nagios Rapid Deployment Options
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment Options
 

Recently uploaded

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 

Recently uploaded (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Nagios Conference 2013 - Janice Singh - Visualization of Monitoring Data at the NASA Advanced Supercomputing Facilityuting facility

  • 1. Visualization of Monitoring Data at the NASA Advanced Supercomputing Facility Janice Singh janice.s.singh@nasa.gov
  • 2. NASA Advanced Supercomputing Facility • Pleiades – 11,136-node SGI ICE supercluster 162,496 cores total (32,768 additional GPU cores) • Tape Storage – pDMF cluster • NFS servers – home filesystems on computing systems • Lustre Filesystems – each filesystem is multiple servers • PBS (Portable Batch System) – job scheduler that runs on computing systems Ref: http://www.nas.nasa.gov/hecc/ Janice Singh - janice.s.singh@nasa.gov 2
  • 3. Why Visualization is Needed • 24 x 7 Help Desk – need a quick overview of system status • but still more specific than nagios visualization – not just single status per host, but sub-groups per host – they assess situations before calling next level of support • automatic alerts from nagios are not as selective – interrelated issues • allows us to see how many systems affected by one issue Janice Singh - janice.s.singh@nasa.gov 3
  • 4. Heads Up Display – For Staff Janice Singh - janice.s.singh@nasa.gov 4
  • 5. Heads Up Display – For Staff (details) Janice Singh - janice.s.singh@nasa.gov 5
  • 6. Heads Up Display – for Users Janice Singh - janice.s.singh@nasa.gov 6
  • 7. All the Parts • nagios • nrpe • nsca • datagg (in-house software) • apache • perl/cgi-bin Janice Singh - janice.s.singh@nasa.gov 7
  • 8. Data flow network firewall (The Enclave) nsca nsca nrpe ssh Cluster Compute Node Remote Node Web Server nagios nsca nagios.cmd datagg nagios2.cmd nagios nagios nagios web interface HUD orange - pipe file green - flat file purple - web site HUD buffer nrpe nrpe Dedicated Nagios Node Janice Singh - janice.s.singh@nasa.gov 8
  • 9. Nagios (and add-ons) setup The Basics •the webserver and the main nagios server are the same machine •there is a network firewall called “the enclave” – most compute nodes are inside the enclave – the webserver can only receive data from the enclave, not send •the servers within the enclave send nagios data to the webserver via nsca •for the servers outside the enclave, nrpe is used •plugins written in Perl using Nagios::Plugin Janice Singh - janice.s.singh@nasa.gov 9
  • 10. Nagios (and add-ons) setup, cont. Versions •when I inherited the systems, they were all using nagios 2.10 – most systems have been upgraded to nagios 3.4+ – all new systems have nagios 3.5 – webserver is still using 2.10 •nsca 2.7.2 across the board Janice Singh - janice.s.singh@nasa.gov 10
  • 11. Nagios (and add-ons) setup, cont. Clusters within the enclave •there is one host that is considered the Dedicated Nagios Server •the rest of the hosts are monitored using nrpe •exceptions on Pleiades: – there are many hosts monitored that get reimaged often • difficult to administer nrpe • use check_by_ssh – ssh is flaky under nagios 3, so it still uses nagios 2.10 • will randomly give the error: Could not open pipe: /usr/bin/ssh – use 2 Dedicated Nagios Servers • so many checks that there was unacceptable latency – tests that should run every 2 mins were running every 30 mins Janice Singh - janice.s.singh@nasa.gov 11
  • 12. datagg (DATa AGGregator) • why it was needed (i.e. what nagios couldn’t do for us) – error summaries of nagios problems • the nagios webpage does tell number of alerts per service group, but that cannot be leveraged via API Janice Singh - janice.s.singh@nasa.gov 12
  • 13. datagg (DATa AGGregator), cont. • why it was needed (i.e. what nagios couldn’t do for us) – parsing data about Portable Batch System (PBS) – piecing together large outputs from NSCA • current output from PBS: 404,937 characters Janice Singh - janice.s.singh@nasa.gov 13
  • 14. datagg (DATa AGGregator), cont. • why it was needed (i.e. what nagios couldn’t do for us) – mapping service nodes to the appropriate Lustre filesystem • they are in a servicegroup (but not available via API) Janice Singh - janice.s.singh@nasa.gov 14
  • 15. datagg (DATa AGGregator), cont. • in-house written perl script – it reads the command file (pipe) that nsca creates on the webserver – the nagios configuration on the webserver also writes to the nsca command file – it aggregates the data that it reads in from the pipe and writes it out to a flat file referred to as the “HUD buffer” • The data read in is in the format: [$timestamp] PROCESS_SERVICE_CHECK_RESULT; $hostname; $service_description; $state; $output Janice Singh - janice.s.singh@nasa.gov 15
  • 16. HUD Buffer • is a Windows-style .ini file – sections • used to group together the boxes (or sub-boxes) • ex: [pleiades daemons] – keys • name=value – this is where we put the nagios state and output of plugin – every section also has the key Error Summary Janice Singh - janice.s.singh@nasa.gov 16
  • 17. Data flow network firewall (The Enclave) nsca nsca nrpe ssh Cluster Compute Node Remote Node Web Server nagios nsca nagios.cmd datagg nagios2.cmd nagios nagios nagios web interface HUD orange - pipe file green - flat file purple - web site HUD buffer nrpe nrpe Dedicated Nagios Node Janice Singh - janice.s.singh@nasa.gov 17
  • 18. Displays • Two versions – Internal (aka Staff HUD) on the internal network • Staff HUD and nagios web interface • clicking on Staff HUD goes to the nagios web page – for the service checks for the host or service group – the nagios web interface gives more details on the plugin output than is displayed on the HUD – the nagios web interface is used to suspend/restart notifications – External (aka miniHUD) – Permissions set in the Apache config file • Both written in Perl using cgi-bin Janice Singh - janice.s.singh@nasa.gov 18
  • 19. Future Plans • use a database to collect data – this will also allow us to have historical data beyond what is in the logs, which can be used in graphing – will eliminate the need for a flat file Janice Singh - janice.s.singh@nasa.gov 19
  • 20. Future Plans • things that would be great to see in nagios 4 – an API – make a nagios check run “on demand” – no more random “could not open pipe” errors – lower latency (we’re having problems with less than 600 checks!) – error summaries – way to send large amounts of data via nsca – a way to see the exact command that nagios ran on the nagios webpage Janice Singh - janice.s.singh@nasa.gov 20
  • 21. Questions? Janice Singh - janice.s.singh@nasa.gov 21