Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster

Toward a practical “HPC Cloud”:
Performance tuning of a virtualized HPC cluster

Ryousei Takano, Tsutomu Ikegami,
Takahiro Hirofuchi, Yoshio Tanaka

Information Technology Research Institute,
National Institute of Advanced Industrial Science and Technology (AIST),
Japan
CUTE2011@Seoul, Dec.15 2011

Background
•  Cloud computing is getting increased attention
from High Performance Computing community.
–  e.g., Amazon EC2 Cluster Compute Instances
•  Virtualization is a key technology.
–  Provider rely on virtualization to consolidate
computing resources.
•  Virtualization provides not only opportunities,
but also challenges for HPC systems and
applications.
–  Concern: Performance degradation due to the
overhead of virtualization
2

Contribution
•  Goal:
–  To realize a practical HPC Cloud whose performance
is close to that of bare metal (i.e., non-virtualized)
machines
•  Contributions:
–  Feasible study by evaluating the HPC Challenge
benchmark on a 16 node InfiniBand cluster
–  The effect of three performance tuning techniques:
•  PCI passthrough
•  NUMA affinity
•  VMM noise reduction

3

Outline
•  Background
•  Performance tuning techniques for HPC Cloud
–  PCI passthrough
–  NUMA affinity
–  VMM noise reduction
•  Performance evaluation
–  HPC Challenge benchmark suite
–  Results
•  Summary

4

Outline
•  Background
–  Results
•  Summary

5

Toward a practical HPC Cloud
To reduce the overhead of “True” HPC Cloud
VM1
interrupt virtualization The performance is
Guest OS
To disable unnecessary services close to that of bare
Physical
driver
on the host OS (i.e., ksmd). metal machines.

VMM
Reduce
VMM noise
NIC
Set NUMA
affinity
VM (QEMU process)
Guest OS
Threads

Use PCI VCPU
threads
passthrough
Linux kernel

Current KVM
HPC Cloud
Its performance is
not good and Physical
unstable. CPU
CPU socket
6

IO architectures of VMs
IO emulation PCI passthrough
VM1 VM2 VM1 VM2
Guest OS Guest OS
… …
Guest Physical
driver driver

VMM VMM
vSwitch

Physical
driver VMM-bypass
access
NIC NIC

IO emulation degrades the performance PCI passthrough achieves the performance
due to the overhead of VMM processing. comparable to bare metal machines.
VMM: Virtual Machine Monitor
7

NUMA affinity
Bare Metal
Linux
On NUMA systems, memory affinity
Threads
is an important performance factor.
numactl

Process Local memory accesses are faster
scheduler
than remote memory accesses.

In order to avoid inter-socket
memory transfer, binding a thread
to CPU socket can be effective.
CPU socket
Physical
CPU P0 P1 P2 P3

memory memory NUMA: Non Uniform Memory Access
8

NUMA affinity: KVM
Bare Metal KVM
Linux VM (QEMU process)

Threads Guest OS
Threads
numactl
numactl bind threads
Process to vSocket
VCPU
scheduler V0 V1 V2 V3
threads

Linux kernel
taskset
KVM
pin vCPU to
Process CPU (Vn = Pn)
scheduler
CPU socket
Physical
CPU P0 P1 P2 P3
Physical
CPU P0 P1 P2 P3
memory memory
CPU socket
9

NUMA affinity: Xen
Bare Metal Xen
Linux VM (Xen DomU)

Threads VM Guest OS
(Dom0) Threads
numactl
numactl cannot run on a guest
Process
OS, because Xen does not
scheduler V0 V1 V2 V3
disclose the physical NUMA VCPU
topology.
Xen Hypervisor

pin vCPU to
Domain CPU (Vn = Pn)
scheduler
CPU socket
Physical
CPU P0 P1 P2 P3
Physical
CPU P0 P1 P2 P3
memory memory

10

VMM noise
•  OS noise is well-known problem to large-scale
system scalability.
–  OS activities and some daemon programs take up
CPU time, consume cache and TLB, and delay the
synchronization of parallel processes
•  VMM level noise, called VMM noise, can cause
the same problem for a guest OS.
–  The overhead of interrupt virtualization that results in
VM exits (i.e., VM-to-VMM switching)
–  Unnecessary services on the host OS (i.e., ksmd)
•  Now, we do not take care of VMM noise.
11

Outline
•  Background
–  Results
•  Summary

12

Experimental setting
Evaluation of HPC Challenge benchmark on
a 16 node Infiniband cluster
Blade server Dell PowerEdge M610 Host machine environment
CPU Intel quad-core Xeon E5540/2.53GHz x2 OS Debian 6.0.1

Chipset Intel 5520 Linux kernel 2.6.32-5-amd64

Memory 48 GB DDR3 KVM 0.12.50

InfiniBand Mellanox ConnectX (MT26428) Xen 4.0.1
Compiler gcc/gfortran 4.4.5
Blade switch MPI Open MPI 1.4.2
InfiniBand Mellanox M3601Q (QDR 16 ports) VM environment
VCPU 8
Only 1 VM runs on 1 host. ! Memory 45 GB
13

HPC Challenge Benchmark Suite
We measure spatial and temporal locality boundaries
!"#$%&#$"'(")(#*+(,-..(/+0$1'
by evaluating HPC Challenge benchmark suite.
Communication
Compute intensive
intensive
%&'(

@@8 /;<!!
,-5
8+93":&4(5"6&4$#7

!$00$"'(
-&:#'+:
>334$6&#$"'0 Memory intensive
-8=>?2
=&'A"9>66+00 28=<>!
"#$ 23&#$&4(5"6&4$#7 %&'(

From: Piotr Luszczek, et al., “The HPC Challenge (HPCC) Benchmark Suite,” SC2006 Tutorial.

14

HPC Challenge: Result
HPL(G)
Compute intensive
1.4

1.2

1
Random Ring Latency PTRANS(G)
0.8 Memory intensive
0.6

0.4 BMM
BMM+pin
0.2
KVM
0
KVM+pin+bind
Xen
Random Ring BW STREAM(EP)
Xen+pin

Comparing Xen and KVM, the
performances are almost same.
FFT(G) RandomAccess(G)

Communication G: Global, EP: Embarrassingly parallel
intensive Higher is better, except for Random Ring Latency.
15

HPC Challenge: Result
Xen KVM
HPL(G) HPL(G)
1.2 1.2

1 1
Random
Random Ring 0.8
Ring 0.8 PTRANS(G) PTRANS(G)
Latency
Latency 0.6
0.6
0.4 0.4
0.2 0.2
0 0
Random STREAM Random Ring
STREAM(EP)
Ring BW (EP) BW

RandomAcc RandomAccess
FFT(G) FFT(G)
ess(G) (G)

BMM Xen Xen+pin BMM KVM KVM+pin+bind

NUMA affinity is important even on a VM.
But, the effect of VCPU pin is uncertain. G: Global, EP: Embarrassingly parallel
Higher is better, except for Random Ring Latency.
16

HPL: High Performance LINPACK
•  BMM: The LINPACK efficiency is 57.7% in 16
nodes (63.1% in a single node).
•  BMM, KVM: setting NUMA affinity is effective.
•  Virtualization overhead is 6 to 8%.
Configuration 1 node 16 nodes
BMM 50.24 (1.00) 706.21 (1.00)
BMM + bind 51.07 (1.02) 747.88 (1.06)
Xen 49.44 (0.98) 700.23 (0.99)
Xen + pin 49.37 (0.98) 698.93 (0.99)
KVM 48.03 (0.96) 671.97 (0.95)
KVM + pin + bind 49.33 (0.98) 684.96 (0.97)

17

Discussion
•  The performance of global benchmarks, except
for FFT(G), is almost comparable with that of
bare metal machines.
–  FFT decreased the performance by 11% to 20% due
to the virtualization overhead related to the inter-node
communication and/or VMM noise.
–  PCI passthrough improves MPI communication
throughput close to that of bare metal machines.
But, interrupt injection that results in VM exits can
disturb the application execution.

18

Discussion (cont.)
•  The performance of Xen is marginally better than
that of KVM, except for RandomRing Bandwidth.
–  The bandwidth decreases by 4% in KVM, 20% in Xen.
•  KVM: The performance of STREAM(EP)
decreases by 27%.
–  A lot of memory contention among processes (TLB
miss) may occur. It is the worst situation for EPT
(Extended Page Table), because the page walk of EPT
takes more time than that of shadow page table.
This means a virtual machine is more sensitive to
memory contention than a bare metal machine.

19

Outline
•  Background
–  Results
•  Summary

20

Summary
HPC Cloud is promising!
•  The performance of coarse-grained parallel
applications is comparable to bare metal machines.
•  We plan to adopt these performance tuning
techniques into our private cloud service called
“AIST Cloud.”
•  Open issues:
–  Live migration with VMM-bypass devices

21

HPC Cloud
HPC Cloud utilizes cloud resources in High
Performance Computing (HPC) applications.
Virtualized
Clusters

Users require resources Provider allocates users a dedicated
according to needs. virtual cluster on demand.

Physical
Cluster

23

Amazon EC2 CCI in TOP500
TOP500 Nov. 2011
100
InfiniBand: 76%
80
Efficiency (%)

10 Gigabit Ethernet: 72%
60

40
GPGPU machines Gigabit Ethernet: 52%
#42 Amazon EC2 InfiniBand
20
cluster compute instances Gigabit Ethernet
10 Gigabit Ethernet
0

TOP500 rank
Efficiency Maximum LINPACK performance Rmax Theoretical peak performance Rpeak

LINPACK Efficiency
TOP500 June 2011
100 InfiniBand: 79%

80
Efficiency (%)

60

GPGPU machines
#451 Amazon EC2
InfiniBand cluster compute instances
20
Gigabit Ethernet
10 Gigabit Ethernet
Virtualization causes the
0 performance degradation!

TOP500 rank
Efficiency Maximum LINPACK performance Rmax Theoretical peak performance Rpeak

Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster

Ähnlich wie Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster (20)

Mehr von Ryousei Takano

Mehr von Ryousei Takano (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster