Introducing Container Technology to TSUBAME3.0 Supercomputer

Akihiro Nomura2019-06-17 ISC’19, Frankfurt Germany
Global Scientific Information and
Computing Center
Introducing Container Technology to
TSUBAME3.0 Supercomputer
Part of this work was supported by JST CREST Grant Number JPMJCR1501, Japan
Part of this work was conducted as research activities of AIST - Tokyo Tech
Real World Big-Data Computation Open Innovation Laboratory (RWBC-OIL)

What TSUBAME3.0 is?
• TSUBAME: Supercomputer series in Tokyo Tech
• TSUBAME1.0: ClearSpeed
• TSUBAME1.2: NVIDIA Tesla S1070
• The 1st supercomputer using GPU as compute unit (2008.11, T1.2)
• TSUBAME2.0: NVIDIA Tesla M2050 → TSUBAME2.5: K20X
• Operated for 7 years: 2010.11 – 2017.10
• TSUBAME-KFC(/DL): Oil-submerged supercomputer testbed
• #1 in Green500 (2013.11, 2014.06)
• TSUBAME3.0: NVIDIA Tesla P100
• #1 in Green500 (2017.06)
• Currently #25 in TOP500 (2019.06)
• Operation started on 2017.08
• ~1500 users from academia + industry, not only from Tokyo Tech
• Various application domain, expertise, software (ISV, self-made, serial, parallel)
• Important to keep research secret from other users
2019/6/17 ISC-HPC 2019 2

Experience in 7-year operation of
TSUBAME 2 supercomputer
• Our previous TSUBAME2 operated from 2010.11 to 2017.10
• We faced many problems during long-term operations
• Resource separation is required
• How to maintain software up-to-date
• How to ban users overeating resources in shared node
• How to make the system energy efficient
• How to cope with decaying network cables
(SIAM PP18)
• Other problems that I cannot disclose, or
do not want to remember again 
2019/6/17 ISC-HPC 2019 3

Why resource separation ?
• TSUBAME2 1408 compute nodes are too fat
• To utilize 3GPU+2CPU/node config, users need to program with
CUDA/OpenACC for GPU, OpenMP for intra-node comm, MPI for inter-
node comm… just too hard for most users
• Three types of users (or workloads)
• Expert, Guru: fully utilize 3GPU+2CPU/node config
• GPU user: use 1~3 GPUs, but not so much for CPU threads
• CPU user: don’t use GPU at all
• Assigning full node to all users is just a waste of resource
2019/6/17 ISC-HPC 2019 4

Resource separation accomplished in T2
(in 2010)
• VM(KVM) based approach
• Run CPU VM inside GPU-job nodes
• GPU couldn’t be virtualized
• NW performance is limited due to IPoIB
• Nice usability
• Users can ssh into both GPU part and CPU part for debug / monitoring
• Many TSUBAME1 users did during their job
• Good isolation
• GPU user cannot see what’s going on in CPU part and vice versa
• Bad flexibility
• We cannot change the # of nodes to be split into two dynamically
CPU
CPU
GPU
GPU
GPU
IB
HCA
IB
HCA
4cores
8cores
G
Bare-Metal
U/V
VM
IP over IB
2019/6/17 ISC-HPC 2019 5

What happens in SW env,
if we operate one system for long time
• Everything gets stale
• System software compatibility problem
• GPU and Lustre drivers won’t support 5-years-old OS distro
• OS support problem
• OS vendors drop support of 5-years-old distro
• ISV software compatibility problem
• Some newer version won’t work in old OS
• Some stable version isn’t verified in new OS
• Library version hell
• Upgrade to newer OS version is painful
• Everything must be validated again, esp. in ISV software
• We did once (SLES11 → SLES11SP3, 2014.08), with large cost
2019/6/17 ISC-HPC 2019 6

When I tried to install Caffe to T2.5
(2015.05)
• SLES11SP3, two years from release, <1 year from system update
• SP4 appeared just after verification and installation
• Got request from a user on Friday evening, thought it’s easy
• Experienced library hell, took 3 days to install it
• Lots of missing libraries
• >20 Python packages, gflags, glog, leveldb, lmdb, protobuf, snappy, cuDNN, OpenCV
• GCC is too old, let’s install it…
• Ah, I need to recompile everything with new GCC…
• Also tried Tensorflow later days, but I abandoned
• Some binary-shipped part requires newer glibc 
∴ Introducing bleeding-edge software to old system is quite painful
2019/6/17 ISC-HPC 2019 7

Our expectation to container technology
for upcoming TSUBAME3 (as of late 2016)
• We just wanted something we can
• Make OS kernel version and userland version independent
• Provide new system software and libraries with least cost
• Provide old userland if necessary
• Then we can skip validation of all ISVs in newer environment
• Also (partially) useful to replay old experiment later
• Split resources (CPU, GPU, Memory, NW) without performance
drawback
• Secure isolation between separated partitions
• Dynamic partitioning
• Allow users to do what they did in previous systems
• In our case, SSH to compute node while a job is running
2019/6/17 ISC-HPC 2019 8

Our choices for resource separation
(again, as of late 2016)
2019/6/17 ISC-HPC 2019 9
• VM and Docker was available choice
• Other container technology (Shifter, Singularity, …) was not mature
VM Metrics Docker container
GPU: virtualized 
Interconnect:
IB: supports SR-IOV 
OmniPath: No support 
Performance Almost no overhead 
SSH is not a problem  Usability SSH into container requires some integration 
Isolated w/o problem  Isolation If cgroup works well, it’s OK 
Hard to deploy OS dynamically  Userland
virtualization
Userland can be chosen 
VM on/off is costly  Flexibility Container itself won’t be a problem 
We didn’t specify VM or Docker explicitly,
but requested functionality in procurement
The vendor choose Docker

How TSUBAME3 node looks like
• The node is larger than T2
• 28 CPU cores
• 4 GPUs
• 4 Omni-Path HFIs
• Too huge for most of users
• Expert, Guru
• GPU user
• CPU user
• We expect most of users to
split the node
2019/6/17 ISC-HPC 2019 10

How we separate the node physically
• Separate the node
hierarchically
• Inspired by buddy system in
Linux kernel’s slab memory
allocator
• Less flexible because of fixed
mem/CPU mem/GPU ratio
• Better scheduling to minimize
scattered resources
2019/6/17 ISC-HPC 2019 11
CPU
CPU
GPU
GPU
GPU
GPU
OPA
HFI
OPA
HFI
OPA
HFI
OPA
HFI
14cores H
7cores Q
2cores
4cores
G
C4

Resource Utilization in TSUBAME3
(2019.04)
• ~70% of Jobs (based on
vnode×time) are running on
separated node, rather than full
node
• Sum of vnode×time product
exceeded 540 × 30day in busy
months
• We couldn’t serve jobs without
partitioning
2019/6/17 ISC-HPC 2019 12

How we separate the node logically
• Integration by HPE (primary vendor) and UNIVA (scheduler vendor)
• Just using cgroup
• To achieve the minimal goal of resource separation in short development time
• Userland virtualization is not urgent, should be implemented by when the initial
userland become obsoleted
• SSH to (part of) compute nodes are desirable, but not requisite
• Using Docker, integrated with scheduler
• To achieve the full goal, including triaged goals in cgroup impl.
• Multi-node docker integration was challenging, no predecessor at that time
• It took almost two years to make docker part serviced
• Integration broke scheduling priorities etc. in specific situation 
• Finally started Docker-based service on 2019.04
2019/6/17 ISC-HPC 2019 13

Our requirement to container technology
• DO NOT PASS root TO USER
• We use several filesystems in our network
• Cluster NFS for home storage
• Lustre for high-speed shared storage
• (local SSD + BeeOND)
• We MUST prevent users to access data of other users
→ We decided NOT to allow users to bring their own images
• In docker, root in container is (sometimes restricted) root in host OS
• We cannot filter malicious image that allows to escape from jail
• Files with setuid bit, local vulnerability exploit, …
• Drawback: users cannot bring the images
• We initially thought that’s not a problem, or inevitable compromise
2019/6/17 ISC-HPC 2019 14

Time flies like an arrow, in just 2 years
• During introduction
and preparation,
container tech evolved
rapidly and we were
out of sync
• What users expect to
container was not
what we planned to do
with container
• Lots of application
container appeared,
including HPC apps
Pics from
http://www.projectcartoon.com
2019/6/17 ISC-HPC 2019 15

Other container choices: Singularity
• Docker was general purpose container
• Not designed to be used by untrusted users
• HPC-aware container are being implemented
• Shifter
• Prevent users in container image from getting root
• Singularity
• Run container without root (except for startup, cgroup and FS mount)
• There are security document describing setuid-related implementation!!
• Can we accept user-brought container images using Singularity?
2019/6/17 ISC-HPC 2019 16

Introducing Singularity to TSUBAME3.0
(2018.08-09)
• Request came from a user, with a pointer to security
consideration document
• Checked source code of singularity (setuid-related part) with
multiple staffs
• Discussion in research computer system audit board
• Not usual path for ordinary software, but Singularity requires setuid
binary
• Finally installed Singularity 2.6
• Singularity 3.2.1 is also available, from last week 
• Did the same setuid-related code check as implementation changed
2019/6/17 ISC-HPC 2019 17

Pros and cons for Docker and Singularity
2019/6/17 ISC-HPC 2019 18
• Note: it’s just TSUBAME3’s case
Metrics should vary in different supercomputer sites
Docker in TSUBAME3 Metrics Singularity in TSUBAME3
Can SSH into container
IP address is assigned
Usability Running daemon inside container is not supported
No IP address is assigned
Already Integrated  Isolation Need to be done from outside, but possible
Userland can be chosen
Only by system admins
Userland
virtualization
User can bring arbitrary images
Delayed to 2019.04 Service Start 2018.09

Yes, HPC container started working with
Singularity, that’s all?
• Unfortunately NO in MPI apps
• Requires integration of both kernel(host)-level drivers and userland libs
• Also, process launch must be done in host side, not from container
• mpirun …… singularity exec …… path/to/mpiapp
• Many container implementation has mechanics to fill the gap of
NVIDIA GPU driver version difference
• NVIDIA-docker, --nv option of Singularity…
• Yes, TSUBAME3 is NVIDIA GPU Cloud Ready 
• TSUBAME3 uses OmniPath, while other HPC sites often uses
InfiniBand (or Tofu, Aries, …)
• Users (except for guru) don’t care which the underlying interconnect is
• Unlike accelerators, users don’t expect CUDA works in FPGA
• However, system software required in the container is different
2019/6/17 ISC-HPC 2019 19

Container
What we expect for MPI-impl
independent container
• MPI equivalent of --nv option?
• Auto-introduces MPI related system software
• Requires MPI ABI compatibility in some level
• MPI ABI compatibility initiatives
• libfabrics
• Recompile MPI apps with specific MPI,
when the image is built for specific system
• Fat container images to choose MPI lib
dynamically?
2019/6/17 ISC-HPC 2019 20
App
MPI for
InfiniBand
MPI for
OmniPath

Wrap up
• We tried to introduce Docker to TSUBAME3.0 in order to implement resource
separation and flexible userland update, not targeting container as goal, but just
as tools
• However, users expectation to container was different from what we thought
• Obtaining full goal at once with Docker was too adventurous and took very long
time to get it in service, but now working well 
• It sometimes is important to change mind during system operation, opinion from
users are important
• For system administrator, security documentation is very important
• To run massively parallel applications everywhere using container, it still have
several problems to solve
• I believe I did (and am doing) something stupid, due to historical reasons, or just
not knowing appropriate technology
• Your input is always welcome
2019/6/17 ISC-HPC 2019 21

Acknowledgement
• TSUBAME3 operation working group members
• ~15 faculty and other staff members
• HPE and UNIVA engineers, who finally realized container-base
TSUBAME3.0 system with lots of effort
• We expect upgrading to SLES15 in 2020.03
• Many container vendors for formal or informal discussions
• And Users, especially those who requested bleeding edge software
• TSUBAME Computing Services: https://www.t3.gsic.titech.ac.jp/en/
2019/6/17 ISC-HPC 2019 22
Resources

Introducing Container Technology to TSUBAME3.0 Supercomputer

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Introducing Container Technology to TSUBAME3.0 Supercomputer

Ähnlich wie Introducing Container Technology to TSUBAME3.0 Supercomputer (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Introducing Container Technology to TSUBAME3.0 Supercomputer