Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak Ridge National Laboratory) Kafka Summit 2020

confluent
confluentconfluent
ORNL is managed by UT-Battelle LLC for the US Department of Energy
Enabling Insight to Support World-Class
Supercomputing
Stefan Ceballos
Oak Ridge National Laboratory
National Center for Computational Sciences
Notice: This work was supported by the Oak Ridge Leadership Computing Facility (OLCF) at Oak Ridge National
Laboratory (ORNL) for the Department of Energy (DOE) under Prime Contract Number DE-AC05-00OR-22725
2
Topics
• Overview of Oak Ridge National Laboratory (ORNL)
– National Center for Computational Sciences (NCCS) overview
– History and future of high-performance computing (HPC) at ORNL
– External customers as an org, internal customers as a team
• Operational data challenges in HPC
• New team formation
• Lessons learned
• Current use-cases
3
Our vision: Sustain ORNL’s leadership and scientific
impact in computing and computational sciences
• Provide the world’s most powerful open
resources for:
– Scalable computing and simulation
– Data and analytics at any scale
– Scalable cyber-secure infrastructure for science
• Follow a well-defined path for maintaining
world leadership in these critical areas
• Deliver leading-edge science relevant to missions
of DOE and key federal and state agencies
• Build and exploit cross-cutting partnerships
• Attract the brightest talent
• Invest in education and training
4
National Center for Computational Sciences
Home to the OLCF, including Summit
• 65,000 ft2 of DC space
– Distributed across 4 controlled-access areas
– 31,000 ft2: Very high load bearing (³625 lb/ft2)
• 40 MW of high-efficiency highly reliable power
– Tightly integrated into TVA’s
161 kV transmission system
– Diverse medium-voltage distribution system
– High-efficiency 3.0/4.0 MVA transformers
• 6,600 tons of chilled water
– 20 MW cooling capacity at 70 °F
– 7,700 tons evaporative cooling
• Expansion potential: Power infrastructure
to 60 MW, cooling infrastructure to 70 MW
55
OLCF-3
ORNL has systematically delivered a series
of leadership-class systems
On scope • On budget • Within schedule
OLCF-1
OLCF-2
1000-fold
improvement
in 8 years
2012
Cray XK7
Titan
27
PF
18.5
TF
25 TF
54 TF 62 TF
263
TF
1
PF
2.3
PF
2004
Cray X1E
Phoenix
2005
Cray XT3
Jaguar
2006
Cray XT3
Jaguar
2007
Cray XT4
Jaguar
2008
Cray XT4
Jaguar
2008
Cray XT5
Jaguar
2009
Cray XT5
Jaguar
Titan turned 6 years old in October 2018
and continued to deliver world-class
science research in support of our user
community until it was decommissioned
on August 2nd 2019
66
We are building on this record of success
to enable exascale in 2021
OLCF-5
OLCF-4
1.5
EF
200
PF
27
PF
2012
Cray XK7
Titan
2021
Frontier
2018
IBM
Summit
7
System performance
• Peak performance
of 200 petaflops for M&S;
3.3 exaops (FP16) for data
analytics and AI
• Launched in June 2018;
ranked #1 on TOP500 list
System elements
• 4608 nodes
• Dual-rail Mellanox EDR InfiniBand
network
• 250 PB IBM Spectrum Scale
file system transferring
data at 2.5 TB/s
Node elements
• 2 IBM POWER9 processors
• 6 NVIDIA Tesla V100 GPUs
• 608 GB of fast memory
• (96 GB HBM2 + 512 GB DDR4)
• 1.6 TB of non-volatile memory
Summit system overview
Summit is the fastest
supercomputer in the world and
has been #1 on the TOP500 list
since its launch in June 2018
8
CPU Nodes
• 512 Compute nodes
• Dual 8-Core Xeon (Sandy
Bridge) Processors
• 128 GB Ram
GPU and Large Mem Nodes
• 9 GPU/LargeMem nodes
• Dual 14-Core Xeon
Processors
• 1 TB Ram
• 2 NVIDIA K80 GPUs
Cluster elements
• Mellanox FDR 4x Infiniband
(56 Gb/sec/direction)
• Mounts Alpine center-wide
GPFS file system
• Slurm resource manager
Rhea and Data Transfer Nodes Cluster system overview
Rhea is used for analysis, visualization,
and postprocessing
Data Transfer Cluster (DTNs) provide interactive and scheduled data transfer services
99
Frontier Overview
Partnership between ORNL, Cray, and AMD
The Frontier system will be delivered in 2021
Peak Performance greater than 1.5 EF
Composed of more than 100 Cray Shasta cabinets
– Connected by Slingshot™ interconnect with adaptive routing, congestion control, and quality of service
Node Architecture:
– An AMD EPYC™ processor and four Radeon Instinct™ GPU accelerators
purpose-built for exascale computing
– Fully connected with high speed AMD Infinity Fabric links
– Coherent memory across the node
– 100 GB/s injection bandwidth
– Near-node NVM storage
Researchers will harness Frontier to advance science in such applications as systems biology,
materials science, energy production, additive manufacturing and health data science.
10
National Climate-Computing Research Center
(NCRC)
• Strategic Partnership Project, currently in year 9
• 5-year periods. Current IAA effective through FY20
• Within ORNL’s National Center for Computational Sciences (NCCS)
• Service provided - DOE-titled equipment
• Secure network enclave; Department of Commerce access policies
• Agreement between NOAA and DOE’s Oak
Ridge National Laboratory for HPC services and
climate modeling support
11
980
TF
1.1
PF
1.1
PF
1.1
PF
1.1
PF
1.77
PF
4.02
PF
4.27
PF
4.78
PF
5.29
PF
260
TF
1111
2010
C1MS,
Cray XT6
2011
C1MS +
C2
Cray XE6
2012
C2 + C1,
Cray XE6
2016
C3, Cray
XC40
2017
C3 + C4
Cray XC40
2017
C3 + C4
Cray XC40
2018
C3 + C4
Cray XC40
2019
C3 + C4
Cray XC40
Gaea system evolution,
2010–2019
4.2 PB
storage
capability
6.4 PB
storage
capability
38 PB
storage
capability
ORNL has successfully lead the effort
to procure, configure, and transition
to operation sustained petaflops
scale computing including
infrastructure for storage, and
networking in support of the NOAA
mission.
12
US Air Force: ORNL is supporting the 557th Weather Wing
Existing
Core
Services
RSA
LDAP
DNS
License
Monitoring
…
Data Transfer Nodes
USAF 557th WW
Secure
Infrastructure
Secure
WAN
Offeror Solution
Sys11
Hall A
HPC11 Services
Front End Nodes
Scheduler
Management
…
Sys11
Hall B
TDS
Development
Filesystem(s)
Production
Filesystem(s)
Highly reliable computing resources that deliver timely and frequent weather products,
supporting Air Force mission requirements worldwide
13
Summary
• Scientific data
– Data generated from scientific computing jobs
• Operational data
– Machine data such as metrics, logs, and user info
• A variety of machines that produce operational data
– A revolving door of supercomputers and supporting infrastructure
14
Operational Data Challenges in HPC
• Large amounts of machine data generated with possible bursts
• Long-term data retention for analytics
• Siloed data limits holistic view
• No standard approach to monitoring in HPC
– Each org has their own unique challenges
15
New Team Formation
• Increased organizational efforts for data-driven decision-
making
• Expansion of initially-small Kafka project
• New Elasticsearch cluster for telemetry data
• Need in organization for centralized single source of truth
• Increased usage of big data tools
16
NCCS Ops Big Data Platform
Kafka
Clusters Data
Security Data
File Systems
Data
Application
Data
Facilities Data
Network Data
Logstash Elasticsearch
User-chosen
Exporters
Long-term Data
Storage
GPFS, HPSS
Telegraf Splunk
Kibana
Grafana
PrometheusUser
DB/App
Faust
Lightweight
Time Series
Database
Internal CRM
Software
Nagios
17
Lessons Learned
• Build it and they will come
– Scale as needed
• Work closely with a few initial
users who share vision
• Provide users with topic visibility
for quicker debugging
• A data dictionary offer users
potential new data streams to
explore
• Backbone of big data platform
18
Use Case – Summit Cooling Intelligence
• Intelligent real-time resource
control based on situational
awareness of the data center
• Automated control system is
monitored and controlled by our
facility engineers at a macro
scale
19
Use Case – Summit Cooling Intelligence
Weather
Wetbulb
Forecast
(NOAA)
Cooling Plant
MTW PLC
Data
(C-TECH)
Cooling Plant
MTW PLC
Outlet
Summit
OpenBMC
Telemetry
Streaming (IBM)
Summit
Job Scheduler
LSF Job
Allocation (IBM)
MessageBus(Kafka)
Human
Engineer
DashboardState
Snapshot
Histogram
Snapshot
Generator
Lightweight
Time Series
Database
Data
Exporters
Serialization
Compression
Long-term Data
Storage
GPFS, HPSS
Elasticsearch
Training
Learning
Model
Artifact
20
Use Case – Summit I/O Metrics
21
Use Case – Lustre RPC Trace Data
Questions
?c e b a l l o s s l @ o r n l . g o v
1 von 22

Recomendados

Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ... von
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...confluent
1.4K views20 Folien
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S... von
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...confluent
2.1K views18 Folien
How to use Standard SQL over Kafka: From the basics to advanced use cases | F... von
How to use Standard SQL over Kafka: From the basics to advanced use cases | F...How to use Standard SQL over Kafka: From the basics to advanced use cases | F...
How to use Standard SQL over Kafka: From the basics to advanced use cases | F...HostedbyConfluent
356 views30 Folien
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin... von
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...HostedbyConfluent
1.1K views39 Folien
Kafka Lag Monitoring For Human Beings (Elad Leev, AppsFlyer) Kafka Summit 2020 von
Kafka Lag Monitoring For Human Beings (Elad Leev, AppsFlyer) Kafka Summit 2020Kafka Lag Monitoring For Human Beings (Elad Leev, AppsFlyer) Kafka Summit 2020
Kafka Lag Monitoring For Human Beings (Elad Leev, AppsFlyer) Kafka Summit 2020HostedbyConfluent
6.7K views60 Folien
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ... von
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...HostedbyConfluent
1.8K views76 Folien

Más contenido relacionado

Was ist angesagt?

Kafka at the core of an AIOps pipeline | Sunanda Kommula, Selector.ai and Ala... von
Kafka at the core of an AIOps pipeline | Sunanda Kommula, Selector.ai and Ala...Kafka at the core of an AIOps pipeline | Sunanda Kommula, Selector.ai and Ala...
Kafka at the core of an AIOps pipeline | Sunanda Kommula, Selector.ai and Ala...HostedbyConfluent
678 views21 Folien
Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di... von
Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di...Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di...
Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di...confluent
12K views28 Folien
Kafka Connect and Streams (Concepts, Architecture, Features) von
Kafka Connect and Streams (Concepts, Architecture, Features)Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)Kai Wähner
1.8K views40 Folien
Feed Your SIEM Smart with Kafka Connect (Vitalii Rudenskyi, McKesson Corp) Ka... von
Feed Your SIEM Smart with Kafka Connect (Vitalii Rudenskyi, McKesson Corp) Ka...Feed Your SIEM Smart with Kafka Connect (Vitalii Rudenskyi, McKesson Corp) Ka...
Feed Your SIEM Smart with Kafka Connect (Vitalii Rudenskyi, McKesson Corp) Ka...HostedbyConfluent
1.2K views25 Folien
Time Series Analysis Using an Event Streaming Platform von
 Time Series Analysis Using an Event Streaming Platform Time Series Analysis Using an Event Streaming Platform
Time Series Analysis Using an Event Streaming PlatformDr. Mirko Kämpf
101 views52 Folien
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum... von
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...HostedbyConfluent
10.7K views38 Folien

Was ist angesagt?(20)

Kafka at the core of an AIOps pipeline | Sunanda Kommula, Selector.ai and Ala... von HostedbyConfluent
Kafka at the core of an AIOps pipeline | Sunanda Kommula, Selector.ai and Ala...Kafka at the core of an AIOps pipeline | Sunanda Kommula, Selector.ai and Ala...
Kafka at the core of an AIOps pipeline | Sunanda Kommula, Selector.ai and Ala...
HostedbyConfluent678 views
Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di... von confluent
Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di...Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di...
Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di...
confluent12K views
Kafka Connect and Streams (Concepts, Architecture, Features) von Kai Wähner
Kafka Connect and Streams (Concepts, Architecture, Features)Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)
Kai Wähner1.8K views
Feed Your SIEM Smart with Kafka Connect (Vitalii Rudenskyi, McKesson Corp) Ka... von HostedbyConfluent
Feed Your SIEM Smart with Kafka Connect (Vitalii Rudenskyi, McKesson Corp) Ka...Feed Your SIEM Smart with Kafka Connect (Vitalii Rudenskyi, McKesson Corp) Ka...
Feed Your SIEM Smart with Kafka Connect (Vitalii Rudenskyi, McKesson Corp) Ka...
HostedbyConfluent1.2K views
Time Series Analysis Using an Event Streaming Platform von Dr. Mirko Kämpf
 Time Series Analysis Using an Event Streaming Platform Time Series Analysis Using an Event Streaming Platform
Time Series Analysis Using an Event Streaming Platform
Dr. Mirko Kämpf101 views
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum... von HostedbyConfluent
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...
HostedbyConfluent10.7K views
Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S... von HostedbyConfluent
Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...
Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...
HostedbyConfluent925 views
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ... von Kai Wähner
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
Kai Wähner3.3K views
Confluent Kafka and KSQL: Streaming Data Pipelines Made Easy von Kairo Tavares
Confluent Kafka and KSQL: Streaming Data Pipelines Made EasyConfluent Kafka and KSQL: Streaming Data Pipelines Made Easy
Confluent Kafka and KSQL: Streaming Data Pipelines Made Easy
Kairo Tavares400 views
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi... von confluent
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
confluent1.7K views
Apache kafka-a distributed streaming platform von confluent
Apache kafka-a distributed streaming platformApache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platform
confluent3.6K views
Developing a custom Kafka connector? Make it shine! | Igor Buzatović, Porsche... von HostedbyConfluent
Developing a custom Kafka connector? Make it shine! | Igor Buzatović, Porsche...Developing a custom Kafka connector? Make it shine! | Igor Buzatović, Porsche...
Developing a custom Kafka connector? Make it shine! | Igor Buzatović, Porsche...
HostedbyConfluent106 views
Maximize the Business Value of Machine Learning and Data Science with Kafka (... von confluent
Maximize the Business Value of Machine Learning and Data Science with Kafka (...Maximize the Business Value of Machine Learning and Data Science with Kafka (...
Maximize the Business Value of Machine Learning and Data Science with Kafka (...
confluent1.2K views
Tackling Kafka, with a Small Team ( Jaren Glover, Robinhood) Kafka Summit SF ... von confluent
Tackling Kafka, with a Small Team ( Jaren Glover, Robinhood) Kafka Summit SF ...Tackling Kafka, with a Small Team ( Jaren Glover, Robinhood) Kafka Summit SF ...
Tackling Kafka, with a Small Team ( Jaren Glover, Robinhood) Kafka Summit SF ...
confluent1.6K views
Neo4j Graph Streaming Services with Apache Kafka von jexp
Neo4j Graph Streaming Services with Apache KafkaNeo4j Graph Streaming Services with Apache Kafka
Neo4j Graph Streaming Services with Apache Kafka
jexp2.1K views
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka... von HostedbyConfluent
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...
HostedbyConfluent737 views
Performance Tuning RocksDB for Kafka Streams’ State Stores von confluent
Performance Tuning RocksDB for Kafka Streams’ State StoresPerformance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State Stores
confluent701 views
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ... von HostedbyConfluent
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
HostedbyConfluent6.2K views
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019 von confluent
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 20190-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
confluent10.2K views
Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning wi... von confluent
Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning wi...Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning wi...
Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning wi...
confluent3.2K views

Similar a Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak Ridge National Laboratory) Kafka Summit 2020

Monitoring Exascale Supercomputers With Tim Osborne | Current 2022 von
Monitoring Exascale Supercomputers With Tim Osborne | Current 2022Monitoring Exascale Supercomputers With Tim Osborne | Current 2022
Monitoring Exascale Supercomputers With Tim Osborne | Current 2022HostedbyConfluent
305 views16 Folien
The U.S. Exascale Computing Project: Status and Plans von
The U.S. Exascale Computing Project: Status and PlansThe U.S. Exascale Computing Project: Status and Plans
The U.S. Exascale Computing Project: Status and Plansinside-BigData.com
1.6K views53 Folien
Exascale Computing Project - Driving a HUGE Change in a Changing World von
Exascale Computing Project - Driving a HUGE Change in a Changing WorldExascale Computing Project - Driving a HUGE Change in a Changing World
Exascale Computing Project - Driving a HUGE Change in a Changing Worldinside-BigData.com
912 views25 Folien
How HPC and large-scale data analytics are transforming experimental science von
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceinside-BigData.com
2.3K views31 Folien
EPCC MSc industry projects von
EPCC MSc industry projectsEPCC MSc industry projects
EPCC MSc industry projectsEPCC, University of Edinburgh
173 views10 Folien
CORAL Fact Sheet von
CORAL Fact SheetCORAL Fact Sheet
CORAL Fact SheetJoseph Apuzzo
144 views3 Folien

Similar a Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak Ridge National Laboratory) Kafka Summit 2020(20)

Monitoring Exascale Supercomputers With Tim Osborne | Current 2022 von HostedbyConfluent
Monitoring Exascale Supercomputers With Tim Osborne | Current 2022Monitoring Exascale Supercomputers With Tim Osborne | Current 2022
Monitoring Exascale Supercomputers With Tim Osborne | Current 2022
HostedbyConfluent305 views
The U.S. Exascale Computing Project: Status and Plans von inside-BigData.com
The U.S. Exascale Computing Project: Status and PlansThe U.S. Exascale Computing Project: Status and Plans
The U.S. Exascale Computing Project: Status and Plans
inside-BigData.com1.6K views
Exascale Computing Project - Driving a HUGE Change in a Changing World von inside-BigData.com
Exascale Computing Project - Driving a HUGE Change in a Changing WorldExascale Computing Project - Driving a HUGE Change in a Changing World
Exascale Computing Project - Driving a HUGE Change in a Changing World
inside-BigData.com912 views
How HPC and large-scale data analytics are transforming experimental science von inside-BigData.com
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental science
inside-BigData.com2.3K views
The Coming Age of Extreme Heterogeneity in HPC von inside-BigData.com
The Coming Age of Extreme Heterogeneity in HPCThe Coming Age of Extreme Heterogeneity in HPC
The Coming Age of Extreme Heterogeneity in HPC
inside-BigData.com509 views
OpenStack Toronto Q3 MeetUp - September 28th 2017 von Stacy Véronneau
OpenStack Toronto Q3 MeetUp - September 28th 2017OpenStack Toronto Q3 MeetUp - September 28th 2017
OpenStack Toronto Q3 MeetUp - September 28th 2017
Stacy Véronneau361 views
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility von inside-BigData.com
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
inside-BigData.com657 views
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech... von Databricks
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Databricks935 views
Update on Trinity System Procurement and Plans von inside-BigData.com
Update on Trinity System Procurement and PlansUpdate on Trinity System Procurement and Plans
Update on Trinity System Procurement and Plans
inside-BigData.com1.6K views
Accelerating Research and Enterprise Solutions by Bridging HPC and AI von inside-BigData.com
Accelerating Research and Enterprise Solutions by Bridging HPC and AIAccelerating Research and Enterprise Solutions by Bridging HPC and AI
Accelerating Research and Enterprise Solutions by Bridging HPC and AI
inside-BigData.com232 views
Accelerators at ORNL - Application Readiness, Early Science, and Industry Impact von inside-BigData.com
Accelerators at ORNL - Application Readiness, Early Science, and Industry ImpactAccelerators at ORNL - Application Readiness, Early Science, and Industry Impact
Accelerators at ORNL - Application Readiness, Early Science, and Industry Impact
inside-BigData.com1.2K views
Dell High-Performance Computing solutions: Enable innovations, outperform exp... von Dell World
Dell High-Performance Computing solutions: Enable innovations, outperform exp...Dell High-Performance Computing solutions: Enable innovations, outperform exp...
Dell High-Performance Computing solutions: Enable innovations, outperform exp...
Dell World1.1K views
Active Nets Technology Transfer through High-Performance Network Devices von Tal Lavian Ph.D.
Active Nets Technology Transfer through High-Performance Network DevicesActive Nets Technology Transfer through High-Performance Network Devices
Active Nets Technology Transfer through High-Performance Network Devices
Tal Lavian Ph.D.324 views
The Pacific Research Platform von Larry Smarr
The Pacific Research PlatformThe Pacific Research Platform
The Pacific Research Platform
Larry Smarr800 views
CloudLightning and the OPM-based Use Case von CloudLightning
CloudLightning and the OPM-based Use CaseCloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use Case
CloudLightning427 views

Más de confluent

Citi TechTalk Session 2: Kafka Deep Dive von
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Diveconfluent
19 views60 Folien
Build real-time streaming data pipelines to AWS with Confluent von
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluentconfluent
70 views53 Folien
Q&A with Confluent Professional Services: Confluent Service Mesh von
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Meshconfluent
67 views69 Folien
Citi Tech Talk: Event Driven Kafka Microservices von
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservicesconfluent
24 views29 Folien
Confluent & GSI Webinars series - Session 3 von
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3confluent
16 views59 Folien
Citi Tech Talk: Messaging Modernization von
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernizationconfluent
17 views39 Folien

Más de confluent(20)

Citi TechTalk Session 2: Kafka Deep Dive von confluent
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
confluent19 views
Build real-time streaming data pipelines to AWS with Confluent von confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
confluent70 views
Q&A with Confluent Professional Services: Confluent Service Mesh von confluent
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Mesh
confluent67 views
Citi Tech Talk: Event Driven Kafka Microservices von confluent
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservices
confluent24 views
Confluent & GSI Webinars series - Session 3 von confluent
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3
confluent16 views
Citi Tech Talk: Messaging Modernization von confluent
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernization
confluent17 views
Citi Tech Talk: Data Governance for streaming and real time data von confluent
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time data
confluent21 views
Confluent & GSI Webinars series: Session 2 von confluent
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2
confluent16 views
Data In Motion Paris 2023 von confluent
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
confluent226 views
The Future of Application Development - API Days - Melbourne 2023 von confluent
The Future of Application Development - API Days - Melbourne 2023The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023
confluent68 views
The Playful Bond Between REST And Data Streams von confluent
The Playful Bond Between REST And Data StreamsThe Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data Streams
confluent49 views
The Journey to Data Mesh with Confluent von confluent
The Journey to Data Mesh with ConfluentThe Journey to Data Mesh with Confluent
The Journey to Data Mesh with Confluent
confluent71 views
Citi Tech Talk: Monitoring and Performance von confluent
Citi Tech Talk: Monitoring and PerformanceCiti Tech Talk: Monitoring and Performance
Citi Tech Talk: Monitoring and Performance
confluent42 views
Citi Tech Talk Disaster Recovery Solutions Deep Dive von confluent
Citi Tech Talk  Disaster Recovery Solutions Deep DiveCiti Tech Talk  Disaster Recovery Solutions Deep Dive
Citi Tech Talk Disaster Recovery Solutions Deep Dive
confluent67 views
Citi Tech Talk: Hybrid Cloud von confluent
Citi Tech Talk: Hybrid CloudCiti Tech Talk: Hybrid Cloud
Citi Tech Talk: Hybrid Cloud
confluent43 views
Confluent Partner Tech Talk with QLIK von confluent
Confluent Partner Tech Talk with QLIKConfluent Partner Tech Talk with QLIK
Confluent Partner Tech Talk with QLIK
confluent90 views
Real-time Streaming for Government and the Public Sector von confluent
Real-time Streaming for Government and the Public SectorReal-time Streaming for Government and the Public Sector
Real-time Streaming for Government and the Public Sector
confluent41 views
Confluent Partner Tech Talk with SVA von confluent
Confluent Partner Tech Talk with SVAConfluent Partner Tech Talk with SVA
Confluent Partner Tech Talk with SVA
confluent95 views
How to Build Real-Time Analytics Applications like Netflix, Confluent, and Re... von confluent
How to Build Real-Time Analytics Applications like Netflix, Confluent, and Re...How to Build Real-Time Analytics Applications like Netflix, Confluent, and Re...
How to Build Real-Time Analytics Applications like Netflix, Confluent, and Re...
confluent28 views
Single View of Data von confluent
Single View of DataSingle View of Data
Single View of Data
confluent71 views

Último

Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De... von
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...Moses Kemibaro
29 views38 Folien
Igniting Next Level Productivity with AI-Infused Data Integration Workflows von
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Safe Software
344 views86 Folien
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue von
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlueMigrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlueShapeBlue
96 views20 Folien
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P... von
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...ShapeBlue
82 views62 Folien
Ransomware is Knocking your Door_Final.pdf von
Ransomware is Knocking your Door_Final.pdfRansomware is Knocking your Door_Final.pdf
Ransomware is Knocking your Door_Final.pdfSecurity Bootcamp
76 views46 Folien
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ... von
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...Jasper Oosterveld
28 views49 Folien

Último(20)

Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De... von Moses Kemibaro
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...
Moses Kemibaro29 views
Igniting Next Level Productivity with AI-Infused Data Integration Workflows von Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software344 views
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue von ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlueMigrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
ShapeBlue96 views
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P... von ShapeBlue
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
ShapeBlue82 views
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ... von Jasper Oosterveld
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue von ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlueVNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
ShapeBlue85 views
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti... von ShapeBlue
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
ShapeBlue46 views
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit... von ShapeBlue
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
ShapeBlue57 views
Five Things You SHOULD Know About Postman von Postman
Five Things You SHOULD Know About PostmanFive Things You SHOULD Know About Postman
Five Things You SHOULD Know About Postman
Postman40 views
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue von ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueCloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
ShapeBlue46 views
"Surviving highload with Node.js", Andrii Shumada von Fwdays
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada
Fwdays40 views
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院 von IttrainingIttraining
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda... von ShapeBlue
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
ShapeBlue63 views
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue von ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlueElevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
ShapeBlue96 views
Business Analyst Series 2023 - Week 4 Session 7 von DianaGray10
Business Analyst Series 2023 -  Week 4 Session 7Business Analyst Series 2023 -  Week 4 Session 7
Business Analyst Series 2023 - Week 4 Session 7
DianaGray1080 views
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ... von ShapeBlue
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
ShapeBlue77 views

Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak Ridge National Laboratory) Kafka Summit 2020

  • 1. ORNL is managed by UT-Battelle LLC for the US Department of Energy Enabling Insight to Support World-Class Supercomputing Stefan Ceballos Oak Ridge National Laboratory National Center for Computational Sciences Notice: This work was supported by the Oak Ridge Leadership Computing Facility (OLCF) at Oak Ridge National Laboratory (ORNL) for the Department of Energy (DOE) under Prime Contract Number DE-AC05-00OR-22725
  • 2. 2 Topics • Overview of Oak Ridge National Laboratory (ORNL) – National Center for Computational Sciences (NCCS) overview – History and future of high-performance computing (HPC) at ORNL – External customers as an org, internal customers as a team • Operational data challenges in HPC • New team formation • Lessons learned • Current use-cases
  • 3. 3 Our vision: Sustain ORNL’s leadership and scientific impact in computing and computational sciences • Provide the world’s most powerful open resources for: – Scalable computing and simulation – Data and analytics at any scale – Scalable cyber-secure infrastructure for science • Follow a well-defined path for maintaining world leadership in these critical areas • Deliver leading-edge science relevant to missions of DOE and key federal and state agencies • Build and exploit cross-cutting partnerships • Attract the brightest talent • Invest in education and training
  • 4. 4 National Center for Computational Sciences Home to the OLCF, including Summit • 65,000 ft2 of DC space – Distributed across 4 controlled-access areas – 31,000 ft2: Very high load bearing (³625 lb/ft2) • 40 MW of high-efficiency highly reliable power – Tightly integrated into TVA’s 161 kV transmission system – Diverse medium-voltage distribution system – High-efficiency 3.0/4.0 MVA transformers • 6,600 tons of chilled water – 20 MW cooling capacity at 70 °F – 7,700 tons evaporative cooling • Expansion potential: Power infrastructure to 60 MW, cooling infrastructure to 70 MW
  • 5. 55 OLCF-3 ORNL has systematically delivered a series of leadership-class systems On scope • On budget • Within schedule OLCF-1 OLCF-2 1000-fold improvement in 8 years 2012 Cray XK7 Titan 27 PF 18.5 TF 25 TF 54 TF 62 TF 263 TF 1 PF 2.3 PF 2004 Cray X1E Phoenix 2005 Cray XT3 Jaguar 2006 Cray XT3 Jaguar 2007 Cray XT4 Jaguar 2008 Cray XT4 Jaguar 2008 Cray XT5 Jaguar 2009 Cray XT5 Jaguar Titan turned 6 years old in October 2018 and continued to deliver world-class science research in support of our user community until it was decommissioned on August 2nd 2019
  • 6. 66 We are building on this record of success to enable exascale in 2021 OLCF-5 OLCF-4 1.5 EF 200 PF 27 PF 2012 Cray XK7 Titan 2021 Frontier 2018 IBM Summit
  • 7. 7 System performance • Peak performance of 200 petaflops for M&S; 3.3 exaops (FP16) for data analytics and AI • Launched in June 2018; ranked #1 on TOP500 list System elements • 4608 nodes • Dual-rail Mellanox EDR InfiniBand network • 250 PB IBM Spectrum Scale file system transferring data at 2.5 TB/s Node elements • 2 IBM POWER9 processors • 6 NVIDIA Tesla V100 GPUs • 608 GB of fast memory • (96 GB HBM2 + 512 GB DDR4) • 1.6 TB of non-volatile memory Summit system overview Summit is the fastest supercomputer in the world and has been #1 on the TOP500 list since its launch in June 2018
  • 8. 8 CPU Nodes • 512 Compute nodes • Dual 8-Core Xeon (Sandy Bridge) Processors • 128 GB Ram GPU and Large Mem Nodes • 9 GPU/LargeMem nodes • Dual 14-Core Xeon Processors • 1 TB Ram • 2 NVIDIA K80 GPUs Cluster elements • Mellanox FDR 4x Infiniband (56 Gb/sec/direction) • Mounts Alpine center-wide GPFS file system • Slurm resource manager Rhea and Data Transfer Nodes Cluster system overview Rhea is used for analysis, visualization, and postprocessing Data Transfer Cluster (DTNs) provide interactive and scheduled data transfer services
  • 9. 99 Frontier Overview Partnership between ORNL, Cray, and AMD The Frontier system will be delivered in 2021 Peak Performance greater than 1.5 EF Composed of more than 100 Cray Shasta cabinets – Connected by Slingshot™ interconnect with adaptive routing, congestion control, and quality of service Node Architecture: – An AMD EPYC™ processor and four Radeon Instinct™ GPU accelerators purpose-built for exascale computing – Fully connected with high speed AMD Infinity Fabric links – Coherent memory across the node – 100 GB/s injection bandwidth – Near-node NVM storage Researchers will harness Frontier to advance science in such applications as systems biology, materials science, energy production, additive manufacturing and health data science.
  • 10. 10 National Climate-Computing Research Center (NCRC) • Strategic Partnership Project, currently in year 9 • 5-year periods. Current IAA effective through FY20 • Within ORNL’s National Center for Computational Sciences (NCCS) • Service provided - DOE-titled equipment • Secure network enclave; Department of Commerce access policies • Agreement between NOAA and DOE’s Oak Ridge National Laboratory for HPC services and climate modeling support
  • 11. 11 980 TF 1.1 PF 1.1 PF 1.1 PF 1.1 PF 1.77 PF 4.02 PF 4.27 PF 4.78 PF 5.29 PF 260 TF 1111 2010 C1MS, Cray XT6 2011 C1MS + C2 Cray XE6 2012 C2 + C1, Cray XE6 2016 C3, Cray XC40 2017 C3 + C4 Cray XC40 2017 C3 + C4 Cray XC40 2018 C3 + C4 Cray XC40 2019 C3 + C4 Cray XC40 Gaea system evolution, 2010–2019 4.2 PB storage capability 6.4 PB storage capability 38 PB storage capability ORNL has successfully lead the effort to procure, configure, and transition to operation sustained petaflops scale computing including infrastructure for storage, and networking in support of the NOAA mission.
  • 12. 12 US Air Force: ORNL is supporting the 557th Weather Wing Existing Core Services RSA LDAP DNS License Monitoring … Data Transfer Nodes USAF 557th WW Secure Infrastructure Secure WAN Offeror Solution Sys11 Hall A HPC11 Services Front End Nodes Scheduler Management … Sys11 Hall B TDS Development Filesystem(s) Production Filesystem(s) Highly reliable computing resources that deliver timely and frequent weather products, supporting Air Force mission requirements worldwide
  • 13. 13 Summary • Scientific data – Data generated from scientific computing jobs • Operational data – Machine data such as metrics, logs, and user info • A variety of machines that produce operational data – A revolving door of supercomputers and supporting infrastructure
  • 14. 14 Operational Data Challenges in HPC • Large amounts of machine data generated with possible bursts • Long-term data retention for analytics • Siloed data limits holistic view • No standard approach to monitoring in HPC – Each org has their own unique challenges
  • 15. 15 New Team Formation • Increased organizational efforts for data-driven decision- making • Expansion of initially-small Kafka project • New Elasticsearch cluster for telemetry data • Need in organization for centralized single source of truth • Increased usage of big data tools
  • 16. 16 NCCS Ops Big Data Platform Kafka Clusters Data Security Data File Systems Data Application Data Facilities Data Network Data Logstash Elasticsearch User-chosen Exporters Long-term Data Storage GPFS, HPSS Telegraf Splunk Kibana Grafana PrometheusUser DB/App Faust Lightweight Time Series Database Internal CRM Software Nagios
  • 17. 17 Lessons Learned • Build it and they will come – Scale as needed • Work closely with a few initial users who share vision • Provide users with topic visibility for quicker debugging • A data dictionary offer users potential new data streams to explore • Backbone of big data platform
  • 18. 18 Use Case – Summit Cooling Intelligence • Intelligent real-time resource control based on situational awareness of the data center • Automated control system is monitored and controlled by our facility engineers at a macro scale
  • 19. 19 Use Case – Summit Cooling Intelligence Weather Wetbulb Forecast (NOAA) Cooling Plant MTW PLC Data (C-TECH) Cooling Plant MTW PLC Outlet Summit OpenBMC Telemetry Streaming (IBM) Summit Job Scheduler LSF Job Allocation (IBM) MessageBus(Kafka) Human Engineer DashboardState Snapshot Histogram Snapshot Generator Lightweight Time Series Database Data Exporters Serialization Compression Long-term Data Storage GPFS, HPSS Elasticsearch Training Learning Model Artifact
  • 20. 20 Use Case – Summit I/O Metrics
  • 21. 21 Use Case – Lustre RPC Trace Data
  • 22. Questions ?c e b a l l o s s l @ o r n l . g o v