SlideShare a Scribd company logo
1 of 27
Building Bridges
The System Administration Tools and
Techniques Used to Deploy Bridges
Richard Underwood
HPC System Administrator
richardu@psc.edu
2
© 2010 Pittsburgh Supercomputing Center
© 2017 Pittsburgh Supercomputing Center
What is Bridges?
• NSF Funded
– NSF Award #1445606
– Omni Path Connected
– 908 Nodes
• 752 Compute
• 48 GPU
• 42 3TB Large Memory
• 4 12TB Large Memory
• 42 Administration and Application Servers
• 20 Storage Servers
3
© 2010 Pittsburgh Supercomputing Center
© 2017 Pittsburgh Supercomputing Center
What is Bridges? (cont)
• Focuses on new user communities
– Groups and disciplines new to supercomputing
• Wide variety of computing programs and paradigms
– MPI, Hadoop, Virtual Machines, science gateways, etc
• Bridges Virtual Tour
– https://www.psc.edu/bridges-virtual-tour
4
© 2010 Pittsburgh Supercomputing Center
© 2017 Pittsburgh Supercomputing Center
Scheduling - Slurm
• Simple Linux Utility for Resource Management
• Replaces PBS as the standard for many large installations
these days
– XSEDE especially
• Topology Aware
– Bridges has islands of high connectivity with weak overall
connectivity
• similar command line tools to PBS (qstat vs sinfo)
• Gives QoS tools, that allow us to push things like anti-
stuffing, and debugging queues
5
© 2010 Pittsburgh Supercomputing Center
© 2017 Pittsburgh Supercomputing Center
Booting - Openstack Ironic
• Openstack Ironic is Openstack’s utility for booting physical
hardware using openstack’s own images
• Booting all Bridges nodes(except 3 bootstrap nodes, and
the 12TB nodes) is done through Ironic.
• Benefits
– Fast deployment
– Heterogeneity
– Fast redeployment
• Shortcomings
– We were one of the first large deployments at this scale
– PXE boot only (just recently got local disk boot to work)
6
© 2010 Pittsburgh Supercomputing Center
© 2017 Pittsburgh Supercomputing Center
Monitoring and Reporting Workflow
7
© 2010 Pittsburgh Supercomputing Center
© 2017 Pittsburgh Supercomputing Center
Configuration Management
8
© 2010 Pittsburgh Supercomputing Center
© 2017 Pittsburgh Supercomputing Center
Puppet – Configuration Management
• Flexible configuration management tool
• Bridges uses puppet to configure and customize most
things on Bridges
– packages, Omni Path, file system mounts, swap, utilities (NTP,
Duo), Naemon
• Handles a plethora of machine types
– Administration, compute (3 flavors), GPU, Openstack hypervisors,
storage nodes, data transfer among others
– Done through node type inheritance, and overrides
• Allows for quick rebuilds and quick repurposing
• Puppet is extended by a number of tools: puppetdb,
puppetboard, mcollective
9
© 2010 Pittsburgh Supercomputing Center
© 2017 Pittsburgh Supercomputing Center
PuppetDB
• Puts a database backend behind puppet, to keep reports on
state of the machines under its purview.
• Fully searchable by RESTful interface
– Everything from reports to facts(environmental variables) about
each system
• Allows for external resources (a way to pool resources to
one host)
– ssh keys from each host to a specific master host
– monitoring instructions (Nagios/Naemon)
10
© 2010 Pittsburgh Supercomputing Center
© 2017 Pittsburgh Supercomputing Center
Puppetboard
• Web frontend for puppetdb
– Enterprise puppet has this but is not free
• Takes puppetdb and makes it easily viewable
• Full searching for facts, reports
• Easy to see how machines are reacting to puppet
• Only status cannot change from the website
11
© 2010 Pittsburgh Supercomputing Center
© 2017 Pittsburgh Supercomputing Center
Puppetboard – Cont.
12
© 2010 Pittsburgh Supercomputing Center
© 2017 Pittsburgh Supercomputing Center
mCollective
• Marionette Collective
• Controls puppet agents from a mco command line tool
• Uses middleware layer – activemq
• Allows you to update multiple hosts at once, to push out an
important update
• mco extensible - nrpe, rpc, services, facts
13
© 2010 Pittsburgh Supercomputing Center
© 2017 Pittsburgh Supercomputing Center
Monitoring
14
© 2010 Pittsburgh Supercomputing Center
© 2017 Pittsburgh Supercomputing Center
Monitoring - Historical
• Monitoring at PSC
– Done with Mon in the Past
• Easy to configure
• Works with pagers
• Doesn’t break down
• Not good for large systems with lots of parts
• Nagios
– Standard monitoring tool
– highly extensible
– harder to configure (than mon)
– scales decently (can slow down with really large instances)
– Community version no longer being developed
– Can be automatically configured with Puppet / Puppetdb
15
© 2010 Pittsburgh Supercomputing Center
© 2017 Pittsburgh Supercomputing Center
Monitoring - Naemon
• Fork of Nagios
– Developers and users of Nagios
– forked after development of Nagios XI stopped development on
community edition
• Similar to mysql -> mariadb
– Many different forks (ichinga, centreon, shinken)
• Benefits we liked
– input engine parallelized
– enhanced grouping
– drop in replacement for nagios
– LiveStatus database-like interface to current and historical data
• Allows for complicated node up introspect scripts
16
© 2010 Pittsburgh Supercomputing Center
© 2017 Pittsburgh Supercomputing Center
Puppet – Naemon Integration
• Puppet defines naemon rules for each host
• When puppet runs on a host, these rules are loaded into
puppetdb (External Resources)
• When the naemon server runs puppet, these rules are
combined and if there’s a change in the rules, naemon
restarts
• Allows for automatic deployment of naemon rules through
puppet.
17
© 2010 Pittsburgh Supercomputing Center
© 2017 Pittsburgh Supercomputing Center
Naemon – Checking In through Passive Checks
• Lets the host check in with the naemon master instead of
naemon directing the check
• Useful in cases where the host under test may be load
• Allows hosts to check in at their leisure
• In the past this was done through NCSA
– Nagios Service Check Acceptor
– Not good for parallelization
– Not maintained
• New method NRDP - Nagios Remote Data Protocol
– Uses secure apache to send the code RESTfully
– multiple formats - including JSON and XML
– easier to maintain
18
© 2010 Pittsburgh Supercomputing Center
© 2017 Pittsburgh Supercomputing Center
Monitoring – InfluxDB
• Tool for long term state data retention and visualization
– Lets you make pretty graphs
• Most long term state data tools use RRDs
– Round Robin Databases
– great for looking at trends
– Do not save all data over time
• it pares down to averages over time
• Originally we used OpenTSDB
– did not scale well to a large cluster or even a small cluster for that
matter
19
© 2010 Pittsburgh Supercomputing Center
© 2017 Pittsburgh Supercomputing Center
InfluxDB - Cont
• Hosts locally run a client that collects the data and push it
up to the InfluxDB server
• Very similar to Naemon Passive checks
• We would like to use RabbitMQ to handle all messages
from hosts
– Push logs to ELK stack
– Push host status to Nagios
– Push state data to InfluxDB
20
© 2010 Pittsburgh Supercomputing Center
© 2017 Pittsburgh Supercomputing Center
Grafana
• Visualization Front end for InfluxDB
• Allows for multiple graphs to be tracked over time
21
© 2010 Pittsburgh Supercomputing Center
© 2017 Pittsburgh Supercomputing Center
Grafana
22
© 2010 Pittsburgh Supercomputing Center
© 2017 Pittsburgh Supercomputing Center
Grafana Cont
23
© 2010 Pittsburgh Supercomputing Center
© 2017 Pittsburgh Supercomputing Center
Logging
• rsyslog
– Standard on most linux systems
• Great for retention
• Bad for Searching
• Bad for Analyzing Patterns
• Splunk
– Uses Elastic Search
• Great for searching
• Great for Analyzing
• Expensive
• Data potentially stored in the cloud
24
© 2010 Pittsburgh Supercomputing Center
© 2017 Pittsburgh Supercomputing Center
Logging – ELK Stack
• Replacement for Splunk
– Harder to set up than Splunk
– Free
• Apache 2.0 License
– Web GUI front end
• ELK Stands for
– Elastic Search
• Search Engine for Logging
• Hadoop Extensions
• Relevance Algorithm
– Logstash
– Kibana
25
© 2010 Pittsburgh Supercomputing Center
© 2017 Pittsburgh Supercomputing Center
ELK Stack - Cont
• Elastic Search
– Search Engine for Logging
– Hadoop Extensions – Parallelizable
– Relevance Algorithm
• Logstash
– Replacement for rsyslog – Stores all events as objects
– Hadoop Extensions – Parallelizable
• Kibana
– Web Front End for ELK stack
– Powerful search
– Easy to visualize patterns, and display graphs of data
26
© 2010 Pittsburgh Supercomputing Center
© 2017 Pittsburgh Supercomputing Center
Kibana - Cont
27
© 2010 Pittsburgh Supercomputing Center
© 2017 Pittsburgh Supercomputing Center
Questions?
• I’d love to collaborate with my colleagues at other sites, and
discuss what works for them as well.

More Related Content

What's hot

Inside Hulu's Data platform (BigDataCamp LA 2013)
Inside Hulu's Data platform (BigDataCamp LA 2013)Inside Hulu's Data platform (BigDataCamp LA 2013)
Inside Hulu's Data platform (BigDataCamp LA 2013)
Prasan Samtani
 
Juniper Innovation Contest
Juniper Innovation ContestJuniper Innovation Contest
Juniper Innovation Contest
AMIT BORUDE
 
Infrastructure Monitoring with Postgres
Infrastructure Monitoring with PostgresInfrastructure Monitoring with Postgres
Infrastructure Monitoring with Postgres
Steven Simpson
 

What's hot (20)

Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with Alluxio
 
Enabling Scientific Workflows on FermiCloud using OpenNebula
Enabling Scientific Workflows on FermiCloud using OpenNebulaEnabling Scientific Workflows on FermiCloud using OpenNebula
Enabling Scientific Workflows on FermiCloud using OpenNebula
 
Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015
 
Presto changes
Presto changesPresto changes
Presto changes
 
ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...
ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...
ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...
 
KOCOON – KAKAO Automatic K8S Monitoring
KOCOON – KAKAO Automatic K8S MonitoringKOCOON – KAKAO Automatic K8S Monitoring
KOCOON – KAKAO Automatic K8S Monitoring
 
Inside Hulu's Data platform (BigDataCamp LA 2013)
Inside Hulu's Data platform (BigDataCamp LA 2013)Inside Hulu's Data platform (BigDataCamp LA 2013)
Inside Hulu's Data platform (BigDataCamp LA 2013)
 
Juniper Innovation Contest
Juniper Innovation ContestJuniper Innovation Contest
Juniper Innovation Contest
 
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRealtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLib
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Need for Time series Database
Need for Time series DatabaseNeed for Time series Database
Need for Time series Database
 
February 2014 HUG : Hive On Tez
February 2014 HUG : Hive On TezFebruary 2014 HUG : Hive On Tez
February 2014 HUG : Hive On Tez
 
Workflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoWorkflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. Tokyo
 
Gluster.community.day.2013
Gluster.community.day.2013Gluster.community.day.2013
Gluster.community.day.2013
 
Savanna: Hadoop on OpenStack
Savanna: Hadoop on OpenStackSavanna: Hadoop on OpenStack
Savanna: Hadoop on OpenStack
 
Foss evolution cos-boudnik
Foss evolution cos-boudnikFoss evolution cos-boudnik
Foss evolution cos-boudnik
 
La big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixit
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.js
 
Infrastructure Monitoring with Postgres
Infrastructure Monitoring with PostgresInfrastructure Monitoring with Postgres
Infrastructure Monitoring with Postgres
 
Hyperconverged Cloud, Not just a toy anymore - Andrew Hatfield, Red Hat
Hyperconverged Cloud, Not just a toy anymore - Andrew Hatfield, Red HatHyperconverged Cloud, Not just a toy anymore - Andrew Hatfield, Red Hat
Hyperconverged Cloud, Not just a toy anymore - Andrew Hatfield, Red Hat
 

Similar to PEARC17: Building bridges - The System Administration Tools and Techniques Used to Deploy Bridges

Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
DataWorks Summit
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
Itai Yaffe
 

Similar to PEARC17: Building bridges - The System Administration Tools and Techniques Used to Deploy Bridges (20)

Row #9: An architecture overview of APNIC's RDAP deployment to the cloud
Row #9: An architecture overview of APNIC's RDAP deployment to the cloudRow #9: An architecture overview of APNIC's RDAP deployment to the cloud
Row #9: An architecture overview of APNIC's RDAP deployment to the cloud
 
Webinar 2017. Supercharge your analytics with ClickHouse. Vadim Tkachenko
Webinar 2017. Supercharge your analytics with ClickHouse. Vadim TkachenkoWebinar 2017. Supercharge your analytics with ClickHouse. Vadim Tkachenko
Webinar 2017. Supercharge your analytics with ClickHouse. Vadim Tkachenko
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
NPAE Tool
NPAE ToolNPAE Tool
NPAE Tool
 
High Performance Deep learning with Apache Spark
High Performance Deep learning with Apache SparkHigh Performance Deep learning with Apache Spark
High Performance Deep learning with Apache Spark
 
Icinga Web 2 is more
Icinga Web 2 is moreIcinga Web 2 is more
Icinga Web 2 is more
 
Monitoring federation open stack infrastructure
Monitoring federation open stack infrastructureMonitoring federation open stack infrastructure
Monitoring federation open stack infrastructure
 
Network Engineering for High Speed Data Sharing
Network Engineering for High Speed Data SharingNetwork Engineering for High Speed Data Sharing
Network Engineering for High Speed Data Sharing
 
2014 09-12 lambda-architecture-at-indix
2014 09-12 lambda-architecture-at-indix2014 09-12 lambda-architecture-at-indix
2014 09-12 lambda-architecture-at-indix
 
MAGPI: Advanced Services: IPv6, Multicast, DNSSEC
MAGPI: Advanced Services: IPv6, Multicast, DNSSECMAGPI: Advanced Services: IPv6, Multicast, DNSSEC
MAGPI: Advanced Services: IPv6, Multicast, DNSSEC
 
About VisualDNA Architecture @ Rubyslava 2014
About VisualDNA Architecture @ Rubyslava 2014About VisualDNA Architecture @ Rubyslava 2014
About VisualDNA Architecture @ Rubyslava 2014
 
VINX-NOG 2022: An update on IPv6, RPKI and tools
VINX-NOG 2022: An update on IPv6, RPKI and tools VINX-NOG 2022: An update on IPv6, RPKI and tools
VINX-NOG 2022: An update on IPv6, RPKI and tools
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 
Geospatial Sensor Networks and Partitioning Data
Geospatial Sensor Networks and Partitioning DataGeospatial Sensor Networks and Partitioning Data
Geospatial Sensor Networks and Partitioning Data
 
Keys for Success from Streams to Queries
Keys for Success from Streams to QueriesKeys for Success from Streams to Queries
Keys for Success from Streams to Queries
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
 
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedInCouchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyModernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

PEARC17: Building bridges - The System Administration Tools and Techniques Used to Deploy Bridges

  • 1. Building Bridges The System Administration Tools and Techniques Used to Deploy Bridges Richard Underwood HPC System Administrator richardu@psc.edu
  • 2. 2 © 2010 Pittsburgh Supercomputing Center © 2017 Pittsburgh Supercomputing Center What is Bridges? • NSF Funded – NSF Award #1445606 – Omni Path Connected – 908 Nodes • 752 Compute • 48 GPU • 42 3TB Large Memory • 4 12TB Large Memory • 42 Administration and Application Servers • 20 Storage Servers
  • 3. 3 © 2010 Pittsburgh Supercomputing Center © 2017 Pittsburgh Supercomputing Center What is Bridges? (cont) • Focuses on new user communities – Groups and disciplines new to supercomputing • Wide variety of computing programs and paradigms – MPI, Hadoop, Virtual Machines, science gateways, etc • Bridges Virtual Tour – https://www.psc.edu/bridges-virtual-tour
  • 4. 4 © 2010 Pittsburgh Supercomputing Center © 2017 Pittsburgh Supercomputing Center Scheduling - Slurm • Simple Linux Utility for Resource Management • Replaces PBS as the standard for many large installations these days – XSEDE especially • Topology Aware – Bridges has islands of high connectivity with weak overall connectivity • similar command line tools to PBS (qstat vs sinfo) • Gives QoS tools, that allow us to push things like anti- stuffing, and debugging queues
  • 5. 5 © 2010 Pittsburgh Supercomputing Center © 2017 Pittsburgh Supercomputing Center Booting - Openstack Ironic • Openstack Ironic is Openstack’s utility for booting physical hardware using openstack’s own images • Booting all Bridges nodes(except 3 bootstrap nodes, and the 12TB nodes) is done through Ironic. • Benefits – Fast deployment – Heterogeneity – Fast redeployment • Shortcomings – We were one of the first large deployments at this scale – PXE boot only (just recently got local disk boot to work)
  • 6. 6 © 2010 Pittsburgh Supercomputing Center © 2017 Pittsburgh Supercomputing Center Monitoring and Reporting Workflow
  • 7. 7 © 2010 Pittsburgh Supercomputing Center © 2017 Pittsburgh Supercomputing Center Configuration Management
  • 8. 8 © 2010 Pittsburgh Supercomputing Center © 2017 Pittsburgh Supercomputing Center Puppet – Configuration Management • Flexible configuration management tool • Bridges uses puppet to configure and customize most things on Bridges – packages, Omni Path, file system mounts, swap, utilities (NTP, Duo), Naemon • Handles a plethora of machine types – Administration, compute (3 flavors), GPU, Openstack hypervisors, storage nodes, data transfer among others – Done through node type inheritance, and overrides • Allows for quick rebuilds and quick repurposing • Puppet is extended by a number of tools: puppetdb, puppetboard, mcollective
  • 9. 9 © 2010 Pittsburgh Supercomputing Center © 2017 Pittsburgh Supercomputing Center PuppetDB • Puts a database backend behind puppet, to keep reports on state of the machines under its purview. • Fully searchable by RESTful interface – Everything from reports to facts(environmental variables) about each system • Allows for external resources (a way to pool resources to one host) – ssh keys from each host to a specific master host – monitoring instructions (Nagios/Naemon)
  • 10. 10 © 2010 Pittsburgh Supercomputing Center © 2017 Pittsburgh Supercomputing Center Puppetboard • Web frontend for puppetdb – Enterprise puppet has this but is not free • Takes puppetdb and makes it easily viewable • Full searching for facts, reports • Easy to see how machines are reacting to puppet • Only status cannot change from the website
  • 11. 11 © 2010 Pittsburgh Supercomputing Center © 2017 Pittsburgh Supercomputing Center Puppetboard – Cont.
  • 12. 12 © 2010 Pittsburgh Supercomputing Center © 2017 Pittsburgh Supercomputing Center mCollective • Marionette Collective • Controls puppet agents from a mco command line tool • Uses middleware layer – activemq • Allows you to update multiple hosts at once, to push out an important update • mco extensible - nrpe, rpc, services, facts
  • 13. 13 © 2010 Pittsburgh Supercomputing Center © 2017 Pittsburgh Supercomputing Center Monitoring
  • 14. 14 © 2010 Pittsburgh Supercomputing Center © 2017 Pittsburgh Supercomputing Center Monitoring - Historical • Monitoring at PSC – Done with Mon in the Past • Easy to configure • Works with pagers • Doesn’t break down • Not good for large systems with lots of parts • Nagios – Standard monitoring tool – highly extensible – harder to configure (than mon) – scales decently (can slow down with really large instances) – Community version no longer being developed – Can be automatically configured with Puppet / Puppetdb
  • 15. 15 © 2010 Pittsburgh Supercomputing Center © 2017 Pittsburgh Supercomputing Center Monitoring - Naemon • Fork of Nagios – Developers and users of Nagios – forked after development of Nagios XI stopped development on community edition • Similar to mysql -> mariadb – Many different forks (ichinga, centreon, shinken) • Benefits we liked – input engine parallelized – enhanced grouping – drop in replacement for nagios – LiveStatus database-like interface to current and historical data • Allows for complicated node up introspect scripts
  • 16. 16 © 2010 Pittsburgh Supercomputing Center © 2017 Pittsburgh Supercomputing Center Puppet – Naemon Integration • Puppet defines naemon rules for each host • When puppet runs on a host, these rules are loaded into puppetdb (External Resources) • When the naemon server runs puppet, these rules are combined and if there’s a change in the rules, naemon restarts • Allows for automatic deployment of naemon rules through puppet.
  • 17. 17 © 2010 Pittsburgh Supercomputing Center © 2017 Pittsburgh Supercomputing Center Naemon – Checking In through Passive Checks • Lets the host check in with the naemon master instead of naemon directing the check • Useful in cases where the host under test may be load • Allows hosts to check in at their leisure • In the past this was done through NCSA – Nagios Service Check Acceptor – Not good for parallelization – Not maintained • New method NRDP - Nagios Remote Data Protocol – Uses secure apache to send the code RESTfully – multiple formats - including JSON and XML – easier to maintain
  • 18. 18 © 2010 Pittsburgh Supercomputing Center © 2017 Pittsburgh Supercomputing Center Monitoring – InfluxDB • Tool for long term state data retention and visualization – Lets you make pretty graphs • Most long term state data tools use RRDs – Round Robin Databases – great for looking at trends – Do not save all data over time • it pares down to averages over time • Originally we used OpenTSDB – did not scale well to a large cluster or even a small cluster for that matter
  • 19. 19 © 2010 Pittsburgh Supercomputing Center © 2017 Pittsburgh Supercomputing Center InfluxDB - Cont • Hosts locally run a client that collects the data and push it up to the InfluxDB server • Very similar to Naemon Passive checks • We would like to use RabbitMQ to handle all messages from hosts – Push logs to ELK stack – Push host status to Nagios – Push state data to InfluxDB
  • 20. 20 © 2010 Pittsburgh Supercomputing Center © 2017 Pittsburgh Supercomputing Center Grafana • Visualization Front end for InfluxDB • Allows for multiple graphs to be tracked over time
  • 21. 21 © 2010 Pittsburgh Supercomputing Center © 2017 Pittsburgh Supercomputing Center Grafana
  • 22. 22 © 2010 Pittsburgh Supercomputing Center © 2017 Pittsburgh Supercomputing Center Grafana Cont
  • 23. 23 © 2010 Pittsburgh Supercomputing Center © 2017 Pittsburgh Supercomputing Center Logging • rsyslog – Standard on most linux systems • Great for retention • Bad for Searching • Bad for Analyzing Patterns • Splunk – Uses Elastic Search • Great for searching • Great for Analyzing • Expensive • Data potentially stored in the cloud
  • 24. 24 © 2010 Pittsburgh Supercomputing Center © 2017 Pittsburgh Supercomputing Center Logging – ELK Stack • Replacement for Splunk – Harder to set up than Splunk – Free • Apache 2.0 License – Web GUI front end • ELK Stands for – Elastic Search • Search Engine for Logging • Hadoop Extensions • Relevance Algorithm – Logstash – Kibana
  • 25. 25 © 2010 Pittsburgh Supercomputing Center © 2017 Pittsburgh Supercomputing Center ELK Stack - Cont • Elastic Search – Search Engine for Logging – Hadoop Extensions – Parallelizable – Relevance Algorithm • Logstash – Replacement for rsyslog – Stores all events as objects – Hadoop Extensions – Parallelizable • Kibana – Web Front End for ELK stack – Powerful search – Easy to visualize patterns, and display graphs of data
  • 26. 26 © 2010 Pittsburgh Supercomputing Center © 2017 Pittsburgh Supercomputing Center Kibana - Cont
  • 27. 27 © 2010 Pittsburgh Supercomputing Center © 2017 Pittsburgh Supercomputing Center Questions? • I’d love to collaborate with my colleagues at other sites, and discuss what works for them as well.