SlideShare ist ein Scribd-Unternehmen logo
1 von 11
Pro Tips:
Building for HyperScale
William Wu
Director of Product Management, Penguin Computing
Penguin Computing Theater SC18
© 2018 Penguin Computing
How I Define Hyperscale Computing
2
▪ A distributed infrastructure
▪ Designed to handle increased demand for computing resources
▪ With no new additional physical space, cooling or electrical
power
▪ Often features standardization, automation, redundancy, and
high availability (HA)
© 2018 Penguin Computing
Why You Should Care
3
▪ Designing for hyperscale is NOT the same as designing for regular
data center or HPC
▪ Hyperscale is designed to be cost-effective at thousands of
instances/petabytes of data
 Deployment should be able to theoretically
scale indefinitely
 Should be able to scale sections
independently (not like hyperconvergence)
▪ BUT - same design concepts also help
improve efficiency of smaller sized
systems
© 2018 Penguin Computing
What to Look for in a Hyperscale Partner
4
▪ Technology focus in all key areas: storage, HPC, and
networking
▪ Experience with Open Solutions, such as Open Compute
Project (OCP)
 Standardization is required at hyperscale or is too costly to
build
▪ For AI, also look for vendors who have not just built but
manage large AI systems
 Running AI infrastructure teaches you a lot about how to
build it
© 2018 Penguin Computing
Application of HPC Discipline
5
▪ Example Solution Blocks
❑ Relion XO1132g – Intel® Xeon® Scalable Processors
❑ Altus XO1132g – AMD EPYC™
❑ Relion XO1114GT – Intel® Xeon® Scalable Processors, 4x GPU – PCIe Form Factor
❑ Relion XO1114GTS – Intel® Xeon® Scalable Processors, 4x GPU – SXM2 w/ NVLINK2
❑ Altus XO1114GT – AMD EPYC™, 4x GPU – PCIe Form Factor
▪ Penguin Computing HPC Discipline
❑ Skylake UPI Connections
❑ PLX Topology (Single, Dual Root Complex)
❑ Network Latency
❑ PCIe Gen3 vs PCIe Gen4
© 2018 Penguin Computing
Application of HPC Discipline
6
▪ Example Solution Blocks
❑ Training – NVIDIA® Tesla® V100-PCIe, GeForce Titan V
❑ Inference – NVIDIA® Tesla® V100-PCIe, Tesla P4
❑ Inference/Edge Computing – Jetson, Drive PX
❑ Solution Appliance - NVIDIA® DGX-1™, DGX-2, ODM HGX-2
▪ Penguin Computing HPC Discipline
❑ Application Accelerated Specification
■ Tensor Core, FP64/FP32, HBM2, NVLINK, NVSwitch, PCIe
■ PLX Topology, P2P
❑ Inference technology in scale out solutions
❑ Network Latency
❑ Network Topology (Scale out, expansion)
© 2018 Penguin Computing
Application of HPC Discipline
7
▪ Compute & GPU Accelerator
❑ Fan Speed Profile, Liquid Cooling (fan monitoring), air
flow simulation, power spikes, power shelf slew rate
❑ GPU monitor (or lack thereof w/ GTX)
❑ Liquid Cooling (fan monitor), power delivery (v1)
▪ Infrastructure
❑ 12V DC Pass Through (OPA Switch)
❑ Open Bridge Rack (v1 OCP)
❑ Angus Rack (v1 OCP and EIA)
▪ Network Topology
❑ Network Configuration
❑ Subscription Rate
❑ Expansion and Scale Out (Redundancy)
Core 1 Core 2 Core X
Edge 1 Edge 2 Edge 3 Edge X
GPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
Core 3 ….
….
© 2018 Penguin Computing
Solution Validation Tree Diagram
8
CPU
Server System
BMC
• Functional Test / BVT.FVT
• Stability – Stress . Cycling
• Performance
• FW Flash, Driver Update
• Failure Detection & Recovery
Memory, NVDIMM
HDD/SSD/M.2
• Feature / Functional Test
• Stability – Stress . Cycling
• Performance
• Power consumption
• Failure Detection
FW/SW
BIOS/uEFI
Driver/FirmwareApplications
MB
Mezz. / FPGA
Add-on card
HW
GPU Card
© 2018 Penguin Computing
Test Coverage
9
Pre-Run
• Check Rack
PDU/Switch
• Update Unit
Firmware
• Check Unit
Configuration
• Check Unit
Component Status
• Check Unit OS /
BMC Log
Run-in
• Unit Component
Stress Test
• Unit Idle Test
• Unit System
Stress Test
• Rack Stress Test
Final
• Check Firmware
• Check Unit
Configuration
• Check SMART
Info
• Check Unit OS /
BMC Log
• Reset Unit /
Switch as Default
Configuration
© 2018 Penguin Computing 10
To learn more, go to:
www.penguincomputing.com/ai-practice
Want More Info?
PENGUIN COMPUTING
@PenguinHPC
PENGUIN COMPUTING
PENGUIN COMPUTING
www.penguincomputing.com
1-888-PENGUIN

Weitere ähnliche Inhalte

Was ist angesagt?

Reducing large S3 API costs using Alluxio at Datasapiens
Reducing large S3 API costs using Alluxio at Datasapiens Reducing large S3 API costs using Alluxio at Datasapiens
Reducing large S3 API costs using Alluxio at Datasapiens Alluxio, Inc.
 
20160811 s301 e_prabhat
20160811 s301 e_prabhat20160811 s301 e_prabhat
20160811 s301 e_prabhatKumar Prabhat
 
StorageQuery: federated querying on object stores, powered by Alluxio and Presto
StorageQuery: federated querying on object stores, powered by Alluxio and PrestoStorageQuery: federated querying on object stores, powered by Alluxio and Presto
StorageQuery: federated querying on object stores, powered by Alluxio and PrestoAlluxio, Inc.
 
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...Alluxio, Inc.
 
OpenStack and Red Hat: How we learned to adapt with our customers in a maturi...
OpenStack and Red Hat: How we learned to adapt with our customers in a maturi...OpenStack and Red Hat: How we learned to adapt with our customers in a maturi...
OpenStack and Red Hat: How we learned to adapt with our customers in a maturi...OpenStack
 
Alluxio Use Cases and Future Directions
Alluxio Use Cases and Future DirectionsAlluxio Use Cases and Future Directions
Alluxio Use Cases and Future DirectionsAlluxio, Inc.
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.
 
Hybrid data lake on google cloud with alluxio and dataproc
Hybrid data lake on google cloud  with alluxio and dataprocHybrid data lake on google cloud  with alluxio and dataproc
Hybrid data lake on google cloud with alluxio and dataprocAlluxio, Inc.
 
Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioAlluxio, Inc.
 
Using OPC-UA to Extract IIoT Time Series Data from PLC and SCADA Systems
Using OPC-UA to Extract IIoT Time Series Data from PLC and SCADA SystemsUsing OPC-UA to Extract IIoT Time Series Data from PLC and SCADA Systems
Using OPC-UA to Extract IIoT Time Series Data from PLC and SCADA SystemsInfluxData
 
IoT Event Processing and Analytics with InfluxDB in Google Cloud | Christoph ...
IoT Event Processing and Analytics with InfluxDB in Google Cloud | Christoph ...IoT Event Processing and Analytics with InfluxDB in Google Cloud | Christoph ...
IoT Event Processing and Analytics with InfluxDB in Google Cloud | Christoph ...InfluxData
 
Solution Brief: Commvault & Red Hat Storage
Solution Brief: Commvault & Red Hat StorageSolution Brief: Commvault & Red Hat Storage
Solution Brief: Commvault & Red Hat StorageMarcel Hergaarden
 
Ten Reasons Why Netezza Professionals Should Consider Greenplum
Ten Reasons Why Netezza Professionals Should Consider GreenplumTen Reasons Why Netezza Professionals Should Consider Greenplum
Ten Reasons Why Netezza Professionals Should Consider GreenplumVMware Tanzu
 
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoHigh Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoAlluxio, Inc.
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.
 
Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS Alluxio, Inc.
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudShubham Tagra
 
Engaging with HPC Midlands - Next Steps
Engaging with HPC Midlands - Next StepsEngaging with HPC Midlands - Next Steps
Engaging with HPC Midlands - Next StepsMartin Hamilton
 

Was ist angesagt? (20)

Reducing large S3 API costs using Alluxio at Datasapiens
Reducing large S3 API costs using Alluxio at Datasapiens Reducing large S3 API costs using Alluxio at Datasapiens
Reducing large S3 API costs using Alluxio at Datasapiens
 
20160811 s301 e_prabhat
20160811 s301 e_prabhat20160811 s301 e_prabhat
20160811 s301 e_prabhat
 
StorageQuery: federated querying on object stores, powered by Alluxio and Presto
StorageQuery: federated querying on object stores, powered by Alluxio and PrestoStorageQuery: federated querying on object stores, powered by Alluxio and Presto
StorageQuery: federated querying on object stores, powered by Alluxio and Presto
 
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
 
OpenStack and Red Hat: How we learned to adapt with our customers in a maturi...
OpenStack and Red Hat: How we learned to adapt with our customers in a maturi...OpenStack and Red Hat: How we learned to adapt with our customers in a maturi...
OpenStack and Red Hat: How we learned to adapt with our customers in a maturi...
 
Alluxio Use Cases and Future Directions
Alluxio Use Cases and Future DirectionsAlluxio Use Cases and Future Directions
Alluxio Use Cases and Future Directions
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
Hybrid data lake on google cloud with alluxio and dataproc
Hybrid data lake on google cloud  with alluxio and dataprocHybrid data lake on google cloud  with alluxio and dataproc
Hybrid data lake on google cloud with alluxio and dataproc
 
Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with Alluxio
 
Using OPC-UA to Extract IIoT Time Series Data from PLC and SCADA Systems
Using OPC-UA to Extract IIoT Time Series Data from PLC and SCADA SystemsUsing OPC-UA to Extract IIoT Time Series Data from PLC and SCADA Systems
Using OPC-UA to Extract IIoT Time Series Data from PLC and SCADA Systems
 
IoT Event Processing and Analytics with InfluxDB in Google Cloud | Christoph ...
IoT Event Processing and Analytics with InfluxDB in Google Cloud | Christoph ...IoT Event Processing and Analytics with InfluxDB in Google Cloud | Christoph ...
IoT Event Processing and Analytics with InfluxDB in Google Cloud | Christoph ...
 
Solution Brief: Commvault & Red Hat Storage
Solution Brief: Commvault & Red Hat StorageSolution Brief: Commvault & Red Hat Storage
Solution Brief: Commvault & Red Hat Storage
 
Ten Reasons Why Netezza Professionals Should Consider Greenplum
Ten Reasons Why Netezza Professionals Should Consider GreenplumTen Reasons Why Netezza Professionals Should Consider Greenplum
Ten Reasons Why Netezza Professionals Should Consider Greenplum
 
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoHigh Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the Cloud
 
Engaging with HPC Midlands - Next Steps
Engaging with HPC Midlands - Next StepsEngaging with HPC Midlands - Next Steps
Engaging with HPC Midlands - Next Steps
 
RubiX
RubiXRubiX
RubiX
 

Ähnlich wie Pro Tips: Building for Hyperscale

Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)inside-BigData.com
 
Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKVPresentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKVKevin Xu
 
How Open Technology Helped DOE Labs Place Sixteen Supercomputers on the Top500
How Open Technology Helped DOE Labs Place Sixteen Supercomputers on the Top500How Open Technology Helped DOE Labs Place Sixteen Supercomputers on the Top500
How Open Technology Helped DOE Labs Place Sixteen Supercomputers on the Top500Penguin Computing
 
OpenNebula and StorPool: Building Powerful Clouds
OpenNebula and StorPool: Building Powerful CloudsOpenNebula and StorPool: Building Powerful Clouds
OpenNebula and StorPool: Building Powerful CloudsOpenNebula Project
 
AWS re:Invent 2016: High Performance Computing on AWS (CMP207)
AWS re:Invent 2016: High Performance Computing on AWS (CMP207)AWS re:Invent 2016: High Performance Computing on AWS (CMP207)
AWS re:Invent 2016: High Performance Computing on AWS (CMP207)Amazon Web Services
 
22by7 and DellEMC Tech Day July 20 2017 - Power Edge
22by7 and DellEMC Tech Day July 20 2017 - Power Edge22by7 and DellEMC Tech Day July 20 2017 - Power Edge
22by7 and DellEMC Tech Day July 20 2017 - Power EdgeSashikris
 
OpenNebulaConf2018 - Is Hyperconverged Infrastructure what you need? - Boyan ...
OpenNebulaConf2018 - Is Hyperconverged Infrastructure what you need? - Boyan ...OpenNebulaConf2018 - Is Hyperconverged Infrastructure what you need? - Boyan ...
OpenNebulaConf2018 - Is Hyperconverged Infrastructure what you need? - Boyan ...OpenNebula Project
 
Design installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuDesign installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuAlan Sill
 
Evolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionEvolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionNVIDIA Taiwan
 
Expectations for optical network from the viewpoint of system software research
Expectations for optical network from the viewpoint of system software researchExpectations for optical network from the viewpoint of system software research
Expectations for optical network from the viewpoint of system software researchRyousei Takano
 
KSCOPE 2013: Exadata Consolidation Success Story
KSCOPE 2013: Exadata Consolidation Success StoryKSCOPE 2013: Exadata Consolidation Success Story
KSCOPE 2013: Exadata Consolidation Success StoryKristofferson A
 
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...E-Commerce Brasil
 
QNAP NAS打造私有雲平台
QNAP NAS打造私有雲平台QNAP NAS打造私有雲平台
QNAP NAS打造私有雲平台Anderson Cheng
 
Cuda meetup presentation 5
Cuda meetup presentation 5Cuda meetup presentation 5
Cuda meetup presentation 5Rihards Gailums
 
OCP Telco Engineering Workshop at BCE2017
OCP Telco Engineering Workshop at BCE2017OCP Telco Engineering Workshop at BCE2017
OCP Telco Engineering Workshop at BCE2017Radisys Corporation
 
OpenCAPI next generation accelerator
OpenCAPI next generation accelerator OpenCAPI next generation accelerator
OpenCAPI next generation accelerator Ganesan Narayanasamy
 
Pro Tips: Designing and Deploying End-to-End HPC and AI Solutions
Pro Tips: Designing and Deploying End-to-End HPC and AI SolutionsPro Tips: Designing and Deploying End-to-End HPC and AI Solutions
Pro Tips: Designing and Deploying End-to-End HPC and AI SolutionsPenguin Computing
 

Ähnlich wie Pro Tips: Building for Hyperscale (20)

Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
 
Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKVPresentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
 
E3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - SundanceE3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - Sundance
 
POWER9 for AI & HPC
POWER9 for AI & HPCPOWER9 for AI & HPC
POWER9 for AI & HPC
 
How Open Technology Helped DOE Labs Place Sixteen Supercomputers on the Top500
How Open Technology Helped DOE Labs Place Sixteen Supercomputers on the Top500How Open Technology Helped DOE Labs Place Sixteen Supercomputers on the Top500
How Open Technology Helped DOE Labs Place Sixteen Supercomputers on the Top500
 
OpenNebula and StorPool: Building Powerful Clouds
OpenNebula and StorPool: Building Powerful CloudsOpenNebula and StorPool: Building Powerful Clouds
OpenNebula and StorPool: Building Powerful Clouds
 
AWS re:Invent 2016: High Performance Computing on AWS (CMP207)
AWS re:Invent 2016: High Performance Computing on AWS (CMP207)AWS re:Invent 2016: High Performance Computing on AWS (CMP207)
AWS re:Invent 2016: High Performance Computing on AWS (CMP207)
 
22by7 and DellEMC Tech Day July 20 2017 - Power Edge
22by7 and DellEMC Tech Day July 20 2017 - Power Edge22by7 and DellEMC Tech Day July 20 2017 - Power Edge
22by7 and DellEMC Tech Day July 20 2017 - Power Edge
 
OpenNebulaConf2018 - Is Hyperconverged Infrastructure what you need? - Boyan ...
OpenNebulaConf2018 - Is Hyperconverged Infrastructure what you need? - Boyan ...OpenNebulaConf2018 - Is Hyperconverged Infrastructure what you need? - Boyan ...
OpenNebulaConf2018 - Is Hyperconverged Infrastructure what you need? - Boyan ...
 
Design installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuDesign installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttu
 
Evolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionEvolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server Solution
 
Expectations for optical network from the viewpoint of system software research
Expectations for optical network from the viewpoint of system software researchExpectations for optical network from the viewpoint of system software research
Expectations for optical network from the viewpoint of system software research
 
KSCOPE 2013: Exadata Consolidation Success Story
KSCOPE 2013: Exadata Consolidation Success StoryKSCOPE 2013: Exadata Consolidation Success Story
KSCOPE 2013: Exadata Consolidation Success Story
 
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
 
Lenovo HPC Strategy Update
Lenovo HPC Strategy UpdateLenovo HPC Strategy Update
Lenovo HPC Strategy Update
 
QNAP NAS打造私有雲平台
QNAP NAS打造私有雲平台QNAP NAS打造私有雲平台
QNAP NAS打造私有雲平台
 
Cuda meetup presentation 5
Cuda meetup presentation 5Cuda meetup presentation 5
Cuda meetup presentation 5
 
OCP Telco Engineering Workshop at BCE2017
OCP Telco Engineering Workshop at BCE2017OCP Telco Engineering Workshop at BCE2017
OCP Telco Engineering Workshop at BCE2017
 
OpenCAPI next generation accelerator
OpenCAPI next generation accelerator OpenCAPI next generation accelerator
OpenCAPI next generation accelerator
 
Pro Tips: Designing and Deploying End-to-End HPC and AI Solutions
Pro Tips: Designing and Deploying End-to-End HPC and AI SolutionsPro Tips: Designing and Deploying End-to-End HPC and AI Solutions
Pro Tips: Designing and Deploying End-to-End HPC and AI Solutions
 

Mehr von Penguin Computing

Pro Tips: When to Choose Private vs Public vs Hybrid Cloud
Pro Tips: When to Choose Private vs Public vs Hybrid CloudPro Tips: When to Choose Private vs Public vs Hybrid Cloud
Pro Tips: When to Choose Private vs Public vs Hybrid CloudPenguin Computing
 
Partner Perspectives: The OCP Community
Partner Perspectives: The OCP CommunityPartner Perspectives: The OCP Community
Partner Perspectives: The OCP CommunityPenguin Computing
 
Ocp recommended profiles for next generation OCP Racks
Ocp recommended profiles for next generation OCP RacksOcp recommended profiles for next generation OCP Racks
Ocp recommended profiles for next generation OCP RacksPenguin Computing
 
Penguin computing designing and deploying end to end HPC and AI Solutions
Penguin computing designing and deploying end to end HPC and AI SolutionsPenguin computing designing and deploying end to end HPC and AI Solutions
Penguin computing designing and deploying end to end HPC and AI SolutionsPenguin Computing
 
Ocp updating the ocp compute voltage step response specification
Ocp  updating the ocp compute voltage step response specificationOcp  updating the ocp compute voltage step response specification
Ocp updating the ocp compute voltage step response specificationPenguin Computing
 
Ocp recommended profiles for next generation ocp racks
Ocp recommended profiles for next generation ocp racksOcp recommended profiles for next generation ocp racks
Ocp recommended profiles for next generation ocp racksPenguin Computing
 

Mehr von Penguin Computing (6)

Pro Tips: When to Choose Private vs Public vs Hybrid Cloud
Pro Tips: When to Choose Private vs Public vs Hybrid CloudPro Tips: When to Choose Private vs Public vs Hybrid Cloud
Pro Tips: When to Choose Private vs Public vs Hybrid Cloud
 
Partner Perspectives: The OCP Community
Partner Perspectives: The OCP CommunityPartner Perspectives: The OCP Community
Partner Perspectives: The OCP Community
 
Ocp recommended profiles for next generation OCP Racks
Ocp recommended profiles for next generation OCP RacksOcp recommended profiles for next generation OCP Racks
Ocp recommended profiles for next generation OCP Racks
 
Penguin computing designing and deploying end to end HPC and AI Solutions
Penguin computing designing and deploying end to end HPC and AI SolutionsPenguin computing designing and deploying end to end HPC and AI Solutions
Penguin computing designing and deploying end to end HPC and AI Solutions
 
Ocp updating the ocp compute voltage step response specification
Ocp  updating the ocp compute voltage step response specificationOcp  updating the ocp compute voltage step response specification
Ocp updating the ocp compute voltage step response specification
 
Ocp recommended profiles for next generation ocp racks
Ocp recommended profiles for next generation ocp racksOcp recommended profiles for next generation ocp racks
Ocp recommended profiles for next generation ocp racks
 

Kürzlich hochgeladen

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 

Kürzlich hochgeladen (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Pro Tips: Building for Hyperscale

  • 1. Pro Tips: Building for HyperScale William Wu Director of Product Management, Penguin Computing Penguin Computing Theater SC18
  • 2. © 2018 Penguin Computing How I Define Hyperscale Computing 2 ▪ A distributed infrastructure ▪ Designed to handle increased demand for computing resources ▪ With no new additional physical space, cooling or electrical power ▪ Often features standardization, automation, redundancy, and high availability (HA)
  • 3. © 2018 Penguin Computing Why You Should Care 3 ▪ Designing for hyperscale is NOT the same as designing for regular data center or HPC ▪ Hyperscale is designed to be cost-effective at thousands of instances/petabytes of data  Deployment should be able to theoretically scale indefinitely  Should be able to scale sections independently (not like hyperconvergence) ▪ BUT - same design concepts also help improve efficiency of smaller sized systems
  • 4. © 2018 Penguin Computing What to Look for in a Hyperscale Partner 4 ▪ Technology focus in all key areas: storage, HPC, and networking ▪ Experience with Open Solutions, such as Open Compute Project (OCP)  Standardization is required at hyperscale or is too costly to build ▪ For AI, also look for vendors who have not just built but manage large AI systems  Running AI infrastructure teaches you a lot about how to build it
  • 5. © 2018 Penguin Computing Application of HPC Discipline 5 ▪ Example Solution Blocks ❑ Relion XO1132g – Intel® Xeon® Scalable Processors ❑ Altus XO1132g – AMD EPYC™ ❑ Relion XO1114GT – Intel® Xeon® Scalable Processors, 4x GPU – PCIe Form Factor ❑ Relion XO1114GTS – Intel® Xeon® Scalable Processors, 4x GPU – SXM2 w/ NVLINK2 ❑ Altus XO1114GT – AMD EPYC™, 4x GPU – PCIe Form Factor ▪ Penguin Computing HPC Discipline ❑ Skylake UPI Connections ❑ PLX Topology (Single, Dual Root Complex) ❑ Network Latency ❑ PCIe Gen3 vs PCIe Gen4
  • 6. © 2018 Penguin Computing Application of HPC Discipline 6 ▪ Example Solution Blocks ❑ Training – NVIDIA® Tesla® V100-PCIe, GeForce Titan V ❑ Inference – NVIDIA® Tesla® V100-PCIe, Tesla P4 ❑ Inference/Edge Computing – Jetson, Drive PX ❑ Solution Appliance - NVIDIA® DGX-1™, DGX-2, ODM HGX-2 ▪ Penguin Computing HPC Discipline ❑ Application Accelerated Specification ■ Tensor Core, FP64/FP32, HBM2, NVLINK, NVSwitch, PCIe ■ PLX Topology, P2P ❑ Inference technology in scale out solutions ❑ Network Latency ❑ Network Topology (Scale out, expansion)
  • 7. © 2018 Penguin Computing Application of HPC Discipline 7 ▪ Compute & GPU Accelerator ❑ Fan Speed Profile, Liquid Cooling (fan monitoring), air flow simulation, power spikes, power shelf slew rate ❑ GPU monitor (or lack thereof w/ GTX) ❑ Liquid Cooling (fan monitor), power delivery (v1) ▪ Infrastructure ❑ 12V DC Pass Through (OPA Switch) ❑ Open Bridge Rack (v1 OCP) ❑ Angus Rack (v1 OCP and EIA) ▪ Network Topology ❑ Network Configuration ❑ Subscription Rate ❑ Expansion and Scale Out (Redundancy) Core 1 Core 2 Core X Edge 1 Edge 2 Edge 3 Edge X GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU Core 3 …. ….
  • 8. © 2018 Penguin Computing Solution Validation Tree Diagram 8 CPU Server System BMC • Functional Test / BVT.FVT • Stability – Stress . Cycling • Performance • FW Flash, Driver Update • Failure Detection & Recovery Memory, NVDIMM HDD/SSD/M.2 • Feature / Functional Test • Stability – Stress . Cycling • Performance • Power consumption • Failure Detection FW/SW BIOS/uEFI Driver/FirmwareApplications MB Mezz. / FPGA Add-on card HW GPU Card
  • 9. © 2018 Penguin Computing Test Coverage 9 Pre-Run • Check Rack PDU/Switch • Update Unit Firmware • Check Unit Configuration • Check Unit Component Status • Check Unit OS / BMC Log Run-in • Unit Component Stress Test • Unit Idle Test • Unit System Stress Test • Rack Stress Test Final • Check Firmware • Check Unit Configuration • Check SMART Info • Check Unit OS / BMC Log • Reset Unit / Switch as Default Configuration
  • 10. © 2018 Penguin Computing 10 To learn more, go to: www.penguincomputing.com/ai-practice Want More Info? PENGUIN COMPUTING @PenguinHPC PENGUIN COMPUTING PENGUIN COMPUTING