This document summarizes benchmarking of MPI applications in Singularity containers on traditional HPC and cloud infrastructures. Key findings include:
1) Singularity containers performed comparably to native environments on an HPC cluster but with some overhead on cloud due to heterogeneous infrastructure.
2) TensorFlow performance improved on Azure cloud with GPUs when updating CUDA drivers from version 9 to 10 in containers.
3) HPC portability is partially broken in containers due to requirements for compatible host infrastructure and MPI implementations.
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Benchmarking MPI Applications in Singularity Containers on Traditional HPC and Cloud Infrastructures
1. ||ID | SIS
2019 hpc-ch Forum – Cloud and Containers
Andrei Plamadă, Jarunan Panyasantisuk
ETH Zürich – Scientific IT Services
16.05.2019 1
Benchmarking MPI Applications in Singularity Containers
on Traditional HPC and Cloud Infrastructures
Andrei Plamadă
2. ||ID | SIS
§ Motivation
§ User experience:
§ Traditional HPC vs HPC in the Public Cloud
§ Singularity v2.6
§ Benchmarking MPI Applications
§ OSU Micro-Benchmarks
§ Machine Learning: TensorFlow
16.05.2019Andrei Plamadă 2
Outline
3. ||ID | SIS
§ 2018-2022: 20.2% CAGR for IaaS (see Forbes –
Gartner)
16.05.2019Andrei Plamadă 3
Motivation – Public Cloud is growing rapidly
80.0
94.8 110.5
126.7
143.7
30.5 38.9 49.1 61.9
76.7
2018 2019 2020 2021 2022
Worldwide Public Cloud SaaS and IaaS
Revenue Forecast (Billions of U.S. Dollars)
SaaS IaaS
4. ||ID | SIS
§ 2018-2022: 20.2% CAGR for IaaS (see Forbes –
Gartner)
§ Expectations
§ More competitive prices
§ More regions
§ More heterogeneous
16.05.2019Andrei Plamadă 4
Motivation – Public Cloud is growing rapidly
80.0
94.8 110.5
126.7
143.7
30.5 38.9 49.1 61.9
76.7
2018 2019 2020 2021 2022
Worldwide Public Cloud SaaS and IaaS
Revenue Forecast (Billions of U.S. Dollars)
SaaS IaaS
5. ||ID | SIS
§ 2018-2022: 20.2% CAGR for IaaS (see Forbes –
Gartner)
§ Expectations
§ More competitive prices
§ More regions
§ More heterogeneous
16.05.2019Andrei Plamadă 5
Motivation – Public Cloud is growing rapidly
§ Available in Switzerland
§ 2019-03-12 Google Cloud Platform in Zurich
§ Announced in Switzerland
§ 2018-03-14 Azure Switzerland North and West
80.0
94.8 110.5
126.7
143.7
30.5 38.9 49.1 61.9
76.7
2018 2019 2020 2021 2022
Worldwide Public Cloud SaaS and IaaS
Revenue Forecast (Billions of U.S. Dollars)
SaaS IaaS
7. ||ID | SIS
§ Containers improve portability and can address the reproducibility issue in
research (EnhanceR Survey - Science IT Consultants)
§ EnhanceR Survey - Infrastructure Providers for Container Use
§ Singularity:
§ Developed initially at LBL - Berkeley Lab - for HPC use case (multi-tenancy)
§ Open source with standard BSD 3 clause license https://github.com/sylabs/singularity
§ Under active development with 12 contributors with more than 100 commits
§ Available also with commercial support: Singularity Pro
§ Used world wide and recommended by vendors, e.g. NVIDIA, Azure Batch
§ Big worldwide community (google groups, slack)
§ Swiss community - EnhanceR
16.05.2019Andrei Plamadă 7
Motivation – Singularity as the container solution for HPC
8. ||ID | SIS
§ Containers improve portability and can address the reproducibility issue in
research (EnhanceR Survey - Science IT Consultants)
§ EnhanceR Survey - Infrastructure Providers for Container Use
§ Main idea
16.05.2019Andrei Plamadă 8
Motivation – Singularity as the container solution for HPC
Host OS+Drivers+Middleware
(OSDM)
MPI
• mpirun
• MPI Library
SSH
Server
App
• Shared MPI
Library
Host OS+Drivers+Middleware
(OSDM)
MPI
• mpirun
SSH
Server
Container OSDM
• MPI
• App
• Shared MPI Library
9. ||ID | SIS
§ Traditional HPC (ETH – SIS – HPC)
§ Euler IV:
§ 2x18 core Intel Xeon Gold 6150 (2.7-3.7 GHz)
§ All cores available
§ HT available
§ 7.4 GB/core Memory
§ 100 Gbps InfiniBand
§ Public Cloud - Azure
§ In preview HC-Series – Standard_HC44rs
§ 2x24 core Intel Xeon Plat 8168 (2.7-3.7 GHz)?
§ 2x2 core used by the supervisor?
§ HT disabled?
§ 8.0 GB/core Memory
§ 100 Gbps InfiniBand
16.05.2019Andrei Plamadă 9
Traditional HPC vs HPC in the Public Cloud
10. ||ID | SIS
§ Traditional HPC (ETH – SIS – HPC)
§ Ready to be used (LSF)
§ No maintenance / set-up
§ Login and Compute Nodes
§ Moderate flexibility regarding the software
stack
§ Queue
§ It generally works as expected
§ Public Cloud - Azure
§ Needs to be set-up (Slurm Cluster) via
CycleCloud
§ As admin fully responsible
§ Master and Execute Nodes
§ High flexibility (as the admin), e.g. OpenMPI,
MPICH, MVAPICH2, Intel MPI
§ Queue (as admin high availability)
§ Auto-scaling
§ https://github.com/Azure/cyclecloud-
slurm/issues
16.05.2019Andrei Plamadă 10
User Experience – Traditional HPC vs HPC in the Public Cloud
11. ||ID | SIS 16.05.2019Andrei Plamadă 11
User Experience on CentOS 7 – Singularity v2.6
Create
• Docker
• root access
• on your PC
Run
• Singularity
• on your PC or HPC
infrastructure
§ Multi-node: MPICH ABI Compatibility
initiative
12. ||ID | SIS
Bytes EN m2 v2.2 EC m2 v2.2 EC m2 v2.3 AN m2 v2.3 AC m2 v2.3
8 0.16 0.15 0.16 0.16 0.08
64 1.30 1.27 1.29 1.28 1.25
512 8.27 8.21 8.14 7.87 7.65
4K 37.41 37.65 37.42 37.23 36.54
32K 88.89 89.25 89.43 83.50 82.47
2M 94.75 94.59 95.19 94.25 94.30
16M 94.95 94.75 95.50 91.49 89.99
16.05.2019Andrei Plamadă 12
Osu Micro-Benchmarks – osu_bw (Gbps) 1000 iterations
Abbreviations: Azure (A), Euler (E), MVAPICH2 (m2), Native (N), Container (C)
§ Naïve EC/AC MPICH v3.3 is working but only up to 10/4 Gbps (no InfiniBand)
§ Host: AC MPICH v3.3, Container: m2 v2.3; results as for AC m2 v2.3 - up to 100 Gbps
§ OpenMPI is not compatible with MPICH-derived MPI implementations is not working
18. ||ID | SIS 16.05.2019Andrei Plamadă 18
Conclusion
§ User experience on Azure - HPC in the cloud is catching up:
§ CycleCloud Slurm Cluster with compute intensive VMs + 100 Gbps InfiniBand in preview
§ Big Machine learning VMs (up to 8 x Tesla V100 NVLINK) in preview
§ Singularity Containers:
§ Once the host is similar with the container we did not experience any overhead
§ HPC partially breaks the portability of containers
§ The container should be compatible with host infrastructure and host MPI implementation
§ Updating CUDA drivers (9 to 10) might improve the time to solution
19. ||ID | SIS
ETH Zürich
Andrei Plamadă
Scientific IT Services
Weinbergstrasse 11
8092 Zürich
16.05.2019Andrei Plamadă 19
Contact Acknowledgements
SIS colleagues
Thomas Wüst
Urban Borstnik
Samuel Fux
EnhanceR colleagues
Alexander Kashev (UniBe)
Microsoft / Azure
Lukasz Miroslaw
Andy Howard
EnhanceR Survey - Infrastructure Providers for Container Use
https://forms.gle/JBW78qDPWabd4GDR8