SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Akihiro Nomura2019-06-17 ISC’19, Frankfurt Germany
Global Scientific Information and
Computing Center
Introducing Container Technology to
TSUBAME3.0 Supercomputer
Part of this work was supported by JST CREST Grant Number JPMJCR1501, Japan
Part of this work was conducted as research activities of AIST - Tokyo Tech
Real World Big-Data Computation Open Innovation Laboratory (RWBC-OIL)
What TSUBAME3.0 is?
• TSUBAME: Supercomputer series in Tokyo Tech
• TSUBAME1.0: ClearSpeed
• TSUBAME1.2: NVIDIA Tesla S1070
• The 1st supercomputer using GPU as compute unit (2008.11, T1.2)
• TSUBAME2.0: NVIDIA Tesla M2050 → TSUBAME2.5: K20X
• Operated for 7 years: 2010.11 – 2017.10
• TSUBAME-KFC(/DL): Oil-submerged supercomputer testbed
• #1 in Green500 (2013.11, 2014.06)
• TSUBAME3.0: NVIDIA Tesla P100
• #1 in Green500 (2017.06)
• Currently #25 in TOP500 (2019.06)
• Operation started on 2017.08
• ~1500 users from academia + industry, not only from Tokyo Tech
• Various application domain, expertise, software (ISV, self-made, serial, parallel)
• Important to keep research secret from other users
2019/6/17 ISC-HPC 2019 2
Experience in 7-year operation of
TSUBAME 2 supercomputer
• Our previous TSUBAME2 operated from 2010.11 to 2017.10
• We faced many problems during long-term operations
• Resource separation is required
• How to maintain software up-to-date
• How to ban users overeating resources in shared node
• How to make the system energy efficient
• How to cope with decaying network cables
(SIAM PP18)
• Other problems that I cannot disclose, or
do not want to remember again 
2019/6/17 ISC-HPC 2019 3
Why resource separation ?
• TSUBAME2 1408 compute nodes are too fat
• To utilize 3GPU+2CPU/node config, users need to program with
CUDA/OpenACC for GPU, OpenMP for intra-node comm, MPI for inter-
node comm… just too hard for most users
• Three types of users (or workloads)
• Expert, Guru: fully utilize 3GPU+2CPU/node config
• GPU user: use 1~3 GPUs, but not so much for CPU threads
• CPU user: don’t use GPU at all
• Assigning full node to all users is just a waste of resource
2019/6/17 ISC-HPC 2019 4
Resource separation accomplished in T2
(in 2010)
• VM(KVM) based approach
• Run CPU VM inside GPU-job nodes
• GPU couldn’t be virtualized
• NW performance is limited due to IPoIB
• Nice usability
• Users can ssh into both GPU part and CPU part for debug / monitoring
• Many TSUBAME1 users did during their job
• Good isolation
• GPU user cannot see what’s going on in CPU part and vice versa
• Bad flexibility
• We cannot change the # of nodes to be split into two dynamically
CPU
CPU
GPU
GPU
GPU
IB
HCA
IB
HCA
4cores
8cores
G
Bare-Metal
U/V
VM
IP over IB
2019/6/17 ISC-HPC 2019 5
What happens in SW env,
if we operate one system for long time
• Everything gets stale
• System software compatibility problem
• GPU and Lustre drivers won’t support 5-years-old OS distro
• OS support problem
• OS vendors drop support of 5-years-old distro
• ISV software compatibility problem
• Some newer version won’t work in old OS
• Some stable version isn’t verified in new OS
• Library version hell
• Upgrade to newer OS version is painful
• Everything must be validated again, esp. in ISV software
• We did once (SLES11 → SLES11SP3, 2014.08), with large cost
2019/6/17 ISC-HPC 2019 6
When I tried to install Caffe to T2.5
(2015.05)
• SLES11SP3, two years from release, <1 year from system update
• SP4 appeared just after verification and installation
• Got request from a user on Friday evening, thought it’s easy
• Experienced library hell, took 3 days to install it
• Lots of missing libraries
• >20 Python packages, gflags, glog, leveldb, lmdb, protobuf, snappy, cuDNN, OpenCV
• GCC is too old, let’s install it…
• Ah, I need to recompile everything with new GCC…
• Also tried Tensorflow later days, but I abandoned
• Some binary-shipped part requires newer glibc 
∴ Introducing bleeding-edge software to old system is quite painful
2019/6/17 ISC-HPC 2019 7
Our expectation to container technology
for upcoming TSUBAME3 (as of late 2016)
• We just wanted something we can
• Make OS kernel version and userland version independent
• Provide new system software and libraries with least cost
• Provide old userland if necessary
• Then we can skip validation of all ISVs in newer environment
• Also (partially) useful to replay old experiment later
• Split resources (CPU, GPU, Memory, NW) without performance
drawback
• Secure isolation between separated partitions
• Dynamic partitioning
• Allow users to do what they did in previous systems
• In our case, SSH to compute node while a job is running
2019/6/17 ISC-HPC 2019 8
Our choices for resource separation
(again, as of late 2016)
2019/6/17 ISC-HPC 2019 9
• VM and Docker was available choice
• Other container technology (Shifter, Singularity, …) was not mature
VM Metrics Docker container
GPU: virtualized 
Interconnect:
IB: supports SR-IOV 
OmniPath: No support 
Performance Almost no overhead 
SSH is not a problem  Usability SSH into container requires some integration 
Isolated w/o problem  Isolation If cgroup works well, it’s OK 
Hard to deploy OS dynamically  Userland
virtualization
Userland can be chosen 
VM on/off is costly  Flexibility Container itself won’t be a problem 
We didn’t specify VM or Docker explicitly,
but requested functionality in procurement
The vendor choose Docker
How TSUBAME3 node looks like
• The node is larger than T2
• 28 CPU cores
• 4 GPUs
• 4 Omni-Path HFIs
• Too huge for most of users
• Expert, Guru
• GPU user
• CPU user
• We expect most of users to
split the node
2019/6/17 ISC-HPC 2019 10
How we separate the node physically
• Separate the node
hierarchically
• Inspired by buddy system in
Linux kernel’s slab memory
allocator
• Less flexible because of fixed
mem/CPU mem/GPU ratio
• Better scheduling to minimize
scattered resources
2019/6/17 ISC-HPC 2019 11
CPU
CPU
GPU
GPU
GPU
GPU
OPA
HFI
OPA
HFI
OPA
HFI
OPA
HFI
14cores H
7cores Q
2cores
4cores
G
C4
Resource Utilization in TSUBAME3
(2019.04)
• ~70% of Jobs (based on
vnode×time) are running on
separated node, rather than full
node
• Sum of vnode×time product
exceeded 540 × 30day in busy
months
• We couldn’t serve jobs without
partitioning
2019/6/17 ISC-HPC 2019 12
How we separate the node logically
• Integration by HPE (primary vendor) and UNIVA (scheduler vendor)
• Just using cgroup
• To achieve the minimal goal of resource separation in short development time
• Userland virtualization is not urgent, should be implemented by when the initial
userland become obsoleted
• SSH to (part of) compute nodes are desirable, but not requisite
• Using Docker, integrated with scheduler
• To achieve the full goal, including triaged goals in cgroup impl.
• Multi-node docker integration was challenging, no predecessor at that time
• It took almost two years to make docker part serviced
• Integration broke scheduling priorities etc. in specific situation 
• Finally started Docker-based service on 2019.04
2019/6/17 ISC-HPC 2019 13
Our requirement to container technology
• DO NOT PASS root TO USER
• We use several filesystems in our network
• Cluster NFS for home storage
• Lustre for high-speed shared storage
• (local SSD + BeeOND)
• We MUST prevent users to access data of other users
→ We decided NOT to allow users to bring their own images
• In docker, root in container is (sometimes restricted) root in host OS
• We cannot filter malicious image that allows to escape from jail
• Files with setuid bit, local vulnerability exploit, …
• Drawback: users cannot bring the images
• We initially thought that’s not a problem, or inevitable compromise
2019/6/17 ISC-HPC 2019 14
Time flies like an arrow, in just 2 years
• During introduction
and preparation,
container tech evolved
rapidly and we were
out of sync
• What users expect to
container was not
what we planned to do
with container
• Lots of application
container appeared,
including HPC apps
Pics from
http://www.projectcartoon.com
2019/6/17 ISC-HPC 2019 15
Other container choices: Singularity
• Docker was general purpose container
• Not designed to be used by untrusted users
• HPC-aware container are being implemented
• Shifter
• Prevent users in container image from getting root
• Singularity
• Run container without root (except for startup, cgroup and FS mount)
• There are security document describing setuid-related implementation!!
• Can we accept user-brought container images using Singularity?
2019/6/17 ISC-HPC 2019 16
Introducing Singularity to TSUBAME3.0
(2018.08-09)
• Request came from a user, with a pointer to security
consideration document
• Checked source code of singularity (setuid-related part) with
multiple staffs
• Discussion in research computer system audit board
• Not usual path for ordinary software, but Singularity requires setuid
binary
• Finally installed Singularity 2.6
• Singularity 3.2.1 is also available, from last week 
• Did the same setuid-related code check as implementation changed
2019/6/17 ISC-HPC 2019 17
Pros and cons for Docker and Singularity
2019/6/17 ISC-HPC 2019 18
• Note: it’s just TSUBAME3’s case
Metrics should vary in different supercomputer sites
Docker in TSUBAME3 Metrics Singularity in TSUBAME3
Can SSH into container
IP address is assigned
Usability Running daemon inside container is not supported
No IP address is assigned
Already Integrated  Isolation Need to be done from outside, but possible
Userland can be chosen
Only by system admins
Userland
virtualization
User can bring arbitrary images
Delayed to 2019.04 Service Start 2018.09
Yes, HPC container started working with
Singularity, that’s all?
• Unfortunately NO in MPI apps
• Requires integration of both kernel(host)-level drivers and userland libs
• Also, process launch must be done in host side, not from container
• mpirun …… singularity exec …… path/to/mpiapp
• Many container implementation has mechanics to fill the gap of
NVIDIA GPU driver version difference
• NVIDIA-docker, --nv option of Singularity…
• Yes, TSUBAME3 is NVIDIA GPU Cloud Ready 
• TSUBAME3 uses OmniPath, while other HPC sites often uses
InfiniBand (or Tofu, Aries, …)
• Users (except for guru) don’t care which the underlying interconnect is
• Unlike accelerators, users don’t expect CUDA works in FPGA
• However, system software required in the container is different
2019/6/17 ISC-HPC 2019 19
Container
What we expect for MPI-impl
independent container
• MPI equivalent of --nv option?
• Auto-introduces MPI related system software
• Requires MPI ABI compatibility in some level
• MPI ABI compatibility initiatives
• libfabrics
• Recompile MPI apps with specific MPI,
when the image is built for specific system
• Fat container images to choose MPI lib
dynamically?
2019/6/17 ISC-HPC 2019 20
App
MPI for
InfiniBand
MPI for
OmniPath
Wrap up
• We tried to introduce Docker to TSUBAME3.0 in order to implement resource
separation and flexible userland update, not targeting container as goal, but just
as tools
• However, users expectation to container was different from what we thought
• Obtaining full goal at once with Docker was too adventurous and took very long
time to get it in service, but now working well 
• It sometimes is important to change mind during system operation, opinion from
users are important
• For system administrator, security documentation is very important
• To run massively parallel applications everywhere using container, it still have
several problems to solve
• I believe I did (and am doing) something stupid, due to historical reasons, or just
not knowing appropriate technology
• Your input is always welcome
2019/6/17 ISC-HPC 2019 21
Acknowledgement
• TSUBAME3 operation working group members
• ~15 faculty and other staff members
• HPE and UNIVA engineers, who finally realized container-base
TSUBAME3.0 system with lots of effort
• We expect upgrading to SLES15 in 2020.03
• Many container vendors for formal or informal discussions
• And Users, especially those who requested bleeding edge software
• TSUBAME Computing Services: https://www.t3.gsic.titech.ac.jp/en/
2019/6/17 ISC-HPC 2019 22
Resources

Weitere ähnliche Inhalte

Was ist angesagt?

OpenStack and Ceph: the Winning Pair
OpenStack and Ceph: the Winning PairOpenStack and Ceph: the Winning Pair
OpenStack and Ceph: the Winning Pair
Red_Hat_Storage
 

Was ist angesagt? (20)

제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-Ranchers
제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-Ranchers제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-Ranchers
제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-Ranchers
 
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, CitrixXPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
 
Noah - Robust and Flexible Operating System Compatibility Architecture - Cont...
Noah - Robust and Flexible Operating System Compatibility Architecture - Cont...Noah - Robust and Flexible Operating System Compatibility Architecture - Cont...
Noah - Robust and Flexible Operating System Compatibility Architecture - Cont...
 
Buildinga billionuserloadbalancer may2015-sre-con15europe-shuff
Buildinga billionuserloadbalancer may2015-sre-con15europe-shuffBuildinga billionuserloadbalancer may2015-sre-con15europe-shuff
Buildinga billionuserloadbalancer may2015-sre-con15europe-shuff
 
[OpenInfra Days Korea 2018] Day 1 - T4-7: "Ceph 스토리지, PaaS로 서비스 운영하기"
[OpenInfra Days Korea 2018] Day 1 - T4-7: "Ceph 스토리지, PaaS로 서비스 운영하기"[OpenInfra Days Korea 2018] Day 1 - T4-7: "Ceph 스토리지, PaaS로 서비스 운영하기"
[OpenInfra Days Korea 2018] Day 1 - T4-7: "Ceph 스토리지, PaaS로 서비스 운영하기"
 
Monitor PowerKVM using Ganglia, Nagios
Monitor PowerKVM using Ganglia, NagiosMonitor PowerKVM using Ganglia, Nagios
Monitor PowerKVM using Ganglia, Nagios
 
[2C4]Clustered computing with CoreOS, fleet and etcd
[2C4]Clustered computing with CoreOS, fleet and etcd[2C4]Clustered computing with CoreOS, fleet and etcd
[2C4]Clustered computing with CoreOS, fleet and etcd
 
SaltConf14 - Matthew Williams, Flowroute - Salt Virt for Linux contatiners an...
SaltConf14 - Matthew Williams, Flowroute - Salt Virt for Linux contatiners an...SaltConf14 - Matthew Williams, Flowroute - Salt Virt for Linux contatiners an...
SaltConf14 - Matthew Williams, Flowroute - Salt Virt for Linux contatiners an...
 
Clocker - The Docker Cloud Maker
Clocker - The Docker Cloud MakerClocker - The Docker Cloud Maker
Clocker - The Docker Cloud Maker
 
Introduction openstack-meetup-nov-28
Introduction openstack-meetup-nov-28Introduction openstack-meetup-nov-28
Introduction openstack-meetup-nov-28
 
Experience Report: Cloud Foundry Open Source Operations | anynines
Experience Report: Cloud Foundry Open Source Operations | anyninesExperience Report: Cloud Foundry Open Source Operations | anynines
Experience Report: Cloud Foundry Open Source Operations | anynines
 
How Can OpenNebula Fit Your Needs: A European Project Feedback
How Can OpenNebula Fit Your Needs: A European Project FeedbackHow Can OpenNebula Fit Your Needs: A European Project Feedback
How Can OpenNebula Fit Your Needs: A European Project Feedback
 
Running Cloud Foundry for 12 months - An experience report | anynines
Running Cloud Foundry for 12 months - An experience report | anyninesRunning Cloud Foundry for 12 months - An experience report | anynines
Running Cloud Foundry for 12 months - An experience report | anynines
 
OpenStack and Ceph: the Winning Pair
OpenStack and Ceph: the Winning PairOpenStack and Ceph: the Winning Pair
OpenStack and Ceph: the Winning Pair
 
AF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashAF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on Flash
 
Kubernetes for HCL Connections Component Pack - Build or Buy?
Kubernetes for HCL Connections Component Pack - Build or Buy?Kubernetes for HCL Connections Component Pack - Build or Buy?
Kubernetes for HCL Connections Component Pack - Build or Buy?
 
OWF: Xen - Open Source Hypervisor Designed for Clouds
OWF: Xen - Open Source Hypervisor Designed for CloudsOWF: Xen - Open Source Hypervisor Designed for Clouds
OWF: Xen - Open Source Hypervisor Designed for Clouds
 
Why you’re going to fail running java on docker!
Why you’re going to fail running java on docker!Why you’re going to fail running java on docker!
Why you’re going to fail running java on docker!
 
XPDS14 - Intel(r) Virtualization Technology for Directed I/O (VT-d) Posted In...
XPDS14 - Intel(r) Virtualization Technology for Directed I/O (VT-d) Posted In...XPDS14 - Intel(r) Virtualization Technology for Directed I/O (VT-d) Posted In...
XPDS14 - Intel(r) Virtualization Technology for Directed I/O (VT-d) Posted In...
 
Ganeti Web Manager: Cluster Management Made Simple
Ganeti Web Manager: Cluster Management Made SimpleGaneti Web Manager: Cluster Management Made Simple
Ganeti Web Manager: Cluster Management Made Simple
 

Ähnlich wie Introducing Container Technology to TSUBAME3.0 Supercomputer

TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
chiportal
 

Ähnlich wie Introducing Container Technology to TSUBAME3.0 Supercomputer (20)

[KubeCon NA 2020] containerd: Rootless Containers 2020
[KubeCon NA 2020] containerd: Rootless Containers 2020[KubeCon NA 2020] containerd: Rootless Containers 2020
[KubeCon NA 2020] containerd: Rootless Containers 2020
 
”Bare-Metal Container" presented at HPCC2016
”Bare-Metal Container" presented at HPCC2016”Bare-Metal Container" presented at HPCC2016
”Bare-Metal Container" presented at HPCC2016
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
 
2014 11-05 hpcac-kniep_christian_dockermpi
2014 11-05 hpcac-kniep_christian_dockermpi2014 11-05 hpcac-kniep_christian_dockermpi
2014 11-05 hpcac-kniep_christian_dockermpi
 
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
 
The State of Rootless Containers
The State of Rootless ContainersThe State of Rootless Containers
The State of Rootless Containers
 
Rootless Containers
Rootless ContainersRootless Containers
Rootless Containers
 
Kubernetes 101
Kubernetes 101Kubernetes 101
Kubernetes 101
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
 
GRP 19 - Nautilus, IceCube and LIGO
GRP 19 - Nautilus, IceCube and LIGOGRP 19 - Nautilus, IceCube and LIGO
GRP 19 - Nautilus, IceCube and LIGO
 
Usernetes: Kubernetes as a non-root user
Usernetes: Kubernetes as a non-root userUsernetes: Kubernetes as a non-root user
Usernetes: Kubernetes as a non-root user
 
Network services on Kubernetes on premise
Network services on Kubernetes on premiseNetwork services on Kubernetes on premise
Network services on Kubernetes on premise
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
 
Composing services with Kubernetes
Composing services with KubernetesComposing services with Kubernetes
Composing services with Kubernetes
 
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
 
Container orchestration and microservices world
Container orchestration and microservices worldContainer orchestration and microservices world
Container orchestration and microservices world
 
Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marath...
Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marath...Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marath...
Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marath...
 
[Podman Special Event] Kubernetes in Rootless Podman
[Podman Special Event] Kubernetes in Rootless Podman[Podman Special Event] Kubernetes in Rootless Podman
[Podman Special Event] Kubernetes in Rootless Podman
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
 

Kürzlich hochgeladen

AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
rknatarajan
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Christo Ananth
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
Tonystark477637
 

Kürzlich hochgeladen (20)

Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 

Introducing Container Technology to TSUBAME3.0 Supercomputer

  • 1. Akihiro Nomura2019-06-17 ISC’19, Frankfurt Germany Global Scientific Information and Computing Center Introducing Container Technology to TSUBAME3.0 Supercomputer Part of this work was supported by JST CREST Grant Number JPMJCR1501, Japan Part of this work was conducted as research activities of AIST - Tokyo Tech Real World Big-Data Computation Open Innovation Laboratory (RWBC-OIL)
  • 2. What TSUBAME3.0 is? • TSUBAME: Supercomputer series in Tokyo Tech • TSUBAME1.0: ClearSpeed • TSUBAME1.2: NVIDIA Tesla S1070 • The 1st supercomputer using GPU as compute unit (2008.11, T1.2) • TSUBAME2.0: NVIDIA Tesla M2050 → TSUBAME2.5: K20X • Operated for 7 years: 2010.11 – 2017.10 • TSUBAME-KFC(/DL): Oil-submerged supercomputer testbed • #1 in Green500 (2013.11, 2014.06) • TSUBAME3.0: NVIDIA Tesla P100 • #1 in Green500 (2017.06) • Currently #25 in TOP500 (2019.06) • Operation started on 2017.08 • ~1500 users from academia + industry, not only from Tokyo Tech • Various application domain, expertise, software (ISV, self-made, serial, parallel) • Important to keep research secret from other users 2019/6/17 ISC-HPC 2019 2
  • 3. Experience in 7-year operation of TSUBAME 2 supercomputer • Our previous TSUBAME2 operated from 2010.11 to 2017.10 • We faced many problems during long-term operations • Resource separation is required • How to maintain software up-to-date • How to ban users overeating resources in shared node • How to make the system energy efficient • How to cope with decaying network cables (SIAM PP18) • Other problems that I cannot disclose, or do not want to remember again  2019/6/17 ISC-HPC 2019 3
  • 4. Why resource separation ? • TSUBAME2 1408 compute nodes are too fat • To utilize 3GPU+2CPU/node config, users need to program with CUDA/OpenACC for GPU, OpenMP for intra-node comm, MPI for inter- node comm… just too hard for most users • Three types of users (or workloads) • Expert, Guru: fully utilize 3GPU+2CPU/node config • GPU user: use 1~3 GPUs, but not so much for CPU threads • CPU user: don’t use GPU at all • Assigning full node to all users is just a waste of resource 2019/6/17 ISC-HPC 2019 4
  • 5. Resource separation accomplished in T2 (in 2010) • VM(KVM) based approach • Run CPU VM inside GPU-job nodes • GPU couldn’t be virtualized • NW performance is limited due to IPoIB • Nice usability • Users can ssh into both GPU part and CPU part for debug / monitoring • Many TSUBAME1 users did during their job • Good isolation • GPU user cannot see what’s going on in CPU part and vice versa • Bad flexibility • We cannot change the # of nodes to be split into two dynamically CPU CPU GPU GPU GPU IB HCA IB HCA 4cores 8cores G Bare-Metal U/V VM IP over IB 2019/6/17 ISC-HPC 2019 5
  • 6. What happens in SW env, if we operate one system for long time • Everything gets stale • System software compatibility problem • GPU and Lustre drivers won’t support 5-years-old OS distro • OS support problem • OS vendors drop support of 5-years-old distro • ISV software compatibility problem • Some newer version won’t work in old OS • Some stable version isn’t verified in new OS • Library version hell • Upgrade to newer OS version is painful • Everything must be validated again, esp. in ISV software • We did once (SLES11 → SLES11SP3, 2014.08), with large cost 2019/6/17 ISC-HPC 2019 6
  • 7. When I tried to install Caffe to T2.5 (2015.05) • SLES11SP3, two years from release, <1 year from system update • SP4 appeared just after verification and installation • Got request from a user on Friday evening, thought it’s easy • Experienced library hell, took 3 days to install it • Lots of missing libraries • >20 Python packages, gflags, glog, leveldb, lmdb, protobuf, snappy, cuDNN, OpenCV • GCC is too old, let’s install it… • Ah, I need to recompile everything with new GCC… • Also tried Tensorflow later days, but I abandoned • Some binary-shipped part requires newer glibc  ∴ Introducing bleeding-edge software to old system is quite painful 2019/6/17 ISC-HPC 2019 7
  • 8. Our expectation to container technology for upcoming TSUBAME3 (as of late 2016) • We just wanted something we can • Make OS kernel version and userland version independent • Provide new system software and libraries with least cost • Provide old userland if necessary • Then we can skip validation of all ISVs in newer environment • Also (partially) useful to replay old experiment later • Split resources (CPU, GPU, Memory, NW) without performance drawback • Secure isolation between separated partitions • Dynamic partitioning • Allow users to do what they did in previous systems • In our case, SSH to compute node while a job is running 2019/6/17 ISC-HPC 2019 8
  • 9. Our choices for resource separation (again, as of late 2016) 2019/6/17 ISC-HPC 2019 9 • VM and Docker was available choice • Other container technology (Shifter, Singularity, …) was not mature VM Metrics Docker container GPU: virtualized  Interconnect: IB: supports SR-IOV  OmniPath: No support  Performance Almost no overhead  SSH is not a problem  Usability SSH into container requires some integration  Isolated w/o problem  Isolation If cgroup works well, it’s OK  Hard to deploy OS dynamically  Userland virtualization Userland can be chosen  VM on/off is costly  Flexibility Container itself won’t be a problem  We didn’t specify VM or Docker explicitly, but requested functionality in procurement The vendor choose Docker
  • 10. How TSUBAME3 node looks like • The node is larger than T2 • 28 CPU cores • 4 GPUs • 4 Omni-Path HFIs • Too huge for most of users • Expert, Guru • GPU user • CPU user • We expect most of users to split the node 2019/6/17 ISC-HPC 2019 10
  • 11. How we separate the node physically • Separate the node hierarchically • Inspired by buddy system in Linux kernel’s slab memory allocator • Less flexible because of fixed mem/CPU mem/GPU ratio • Better scheduling to minimize scattered resources 2019/6/17 ISC-HPC 2019 11 CPU CPU GPU GPU GPU GPU OPA HFI OPA HFI OPA HFI OPA HFI 14cores H 7cores Q 2cores 4cores G C4
  • 12. Resource Utilization in TSUBAME3 (2019.04) • ~70% of Jobs (based on vnode×time) are running on separated node, rather than full node • Sum of vnode×time product exceeded 540 × 30day in busy months • We couldn’t serve jobs without partitioning 2019/6/17 ISC-HPC 2019 12
  • 13. How we separate the node logically • Integration by HPE (primary vendor) and UNIVA (scheduler vendor) • Just using cgroup • To achieve the minimal goal of resource separation in short development time • Userland virtualization is not urgent, should be implemented by when the initial userland become obsoleted • SSH to (part of) compute nodes are desirable, but not requisite • Using Docker, integrated with scheduler • To achieve the full goal, including triaged goals in cgroup impl. • Multi-node docker integration was challenging, no predecessor at that time • It took almost two years to make docker part serviced • Integration broke scheduling priorities etc. in specific situation  • Finally started Docker-based service on 2019.04 2019/6/17 ISC-HPC 2019 13
  • 14. Our requirement to container technology • DO NOT PASS root TO USER • We use several filesystems in our network • Cluster NFS for home storage • Lustre for high-speed shared storage • (local SSD + BeeOND) • We MUST prevent users to access data of other users → We decided NOT to allow users to bring their own images • In docker, root in container is (sometimes restricted) root in host OS • We cannot filter malicious image that allows to escape from jail • Files with setuid bit, local vulnerability exploit, … • Drawback: users cannot bring the images • We initially thought that’s not a problem, or inevitable compromise 2019/6/17 ISC-HPC 2019 14
  • 15. Time flies like an arrow, in just 2 years • During introduction and preparation, container tech evolved rapidly and we were out of sync • What users expect to container was not what we planned to do with container • Lots of application container appeared, including HPC apps Pics from http://www.projectcartoon.com 2019/6/17 ISC-HPC 2019 15
  • 16. Other container choices: Singularity • Docker was general purpose container • Not designed to be used by untrusted users • HPC-aware container are being implemented • Shifter • Prevent users in container image from getting root • Singularity • Run container without root (except for startup, cgroup and FS mount) • There are security document describing setuid-related implementation!! • Can we accept user-brought container images using Singularity? 2019/6/17 ISC-HPC 2019 16
  • 17. Introducing Singularity to TSUBAME3.0 (2018.08-09) • Request came from a user, with a pointer to security consideration document • Checked source code of singularity (setuid-related part) with multiple staffs • Discussion in research computer system audit board • Not usual path for ordinary software, but Singularity requires setuid binary • Finally installed Singularity 2.6 • Singularity 3.2.1 is also available, from last week  • Did the same setuid-related code check as implementation changed 2019/6/17 ISC-HPC 2019 17
  • 18. Pros and cons for Docker and Singularity 2019/6/17 ISC-HPC 2019 18 • Note: it’s just TSUBAME3’s case Metrics should vary in different supercomputer sites Docker in TSUBAME3 Metrics Singularity in TSUBAME3 Can SSH into container IP address is assigned Usability Running daemon inside container is not supported No IP address is assigned Already Integrated  Isolation Need to be done from outside, but possible Userland can be chosen Only by system admins Userland virtualization User can bring arbitrary images Delayed to 2019.04 Service Start 2018.09
  • 19. Yes, HPC container started working with Singularity, that’s all? • Unfortunately NO in MPI apps • Requires integration of both kernel(host)-level drivers and userland libs • Also, process launch must be done in host side, not from container • mpirun …… singularity exec …… path/to/mpiapp • Many container implementation has mechanics to fill the gap of NVIDIA GPU driver version difference • NVIDIA-docker, --nv option of Singularity… • Yes, TSUBAME3 is NVIDIA GPU Cloud Ready  • TSUBAME3 uses OmniPath, while other HPC sites often uses InfiniBand (or Tofu, Aries, …) • Users (except for guru) don’t care which the underlying interconnect is • Unlike accelerators, users don’t expect CUDA works in FPGA • However, system software required in the container is different 2019/6/17 ISC-HPC 2019 19
  • 20. Container What we expect for MPI-impl independent container • MPI equivalent of --nv option? • Auto-introduces MPI related system software • Requires MPI ABI compatibility in some level • MPI ABI compatibility initiatives • libfabrics • Recompile MPI apps with specific MPI, when the image is built for specific system • Fat container images to choose MPI lib dynamically? 2019/6/17 ISC-HPC 2019 20 App MPI for InfiniBand MPI for OmniPath
  • 21. Wrap up • We tried to introduce Docker to TSUBAME3.0 in order to implement resource separation and flexible userland update, not targeting container as goal, but just as tools • However, users expectation to container was different from what we thought • Obtaining full goal at once with Docker was too adventurous and took very long time to get it in service, but now working well  • It sometimes is important to change mind during system operation, opinion from users are important • For system administrator, security documentation is very important • To run massively parallel applications everywhere using container, it still have several problems to solve • I believe I did (and am doing) something stupid, due to historical reasons, or just not knowing appropriate technology • Your input is always welcome 2019/6/17 ISC-HPC 2019 21
  • 22. Acknowledgement • TSUBAME3 operation working group members • ~15 faculty and other staff members • HPE and UNIVA engineers, who finally realized container-base TSUBAME3.0 system with lots of effort • We expect upgrading to SLES15 in 2020.03 • Many container vendors for formal or informal discussions • And Users, especially those who requested bleeding edge software • TSUBAME Computing Services: https://www.t3.gsic.titech.ac.jp/en/ 2019/6/17 ISC-HPC 2019 22 Resources