Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster
1. Toward a practical “HPC Cloud”:
Performance tuning of a virtualized HPC cluster
Ryousei Takano, Tsutomu Ikegami,
Takahiro Hirofuchi, Yoshio Tanaka
Information Technology Research Institute,
National Institute of Advanced Industrial Science and Technology (AIST),
Japan
CUTE2011@Seoul, Dec.15 2011
2. Background
• Cloud computing is getting increased attention
from High Performance Computing community.
– e.g., Amazon EC2 Cluster Compute Instances
• Virtualization is a key technology.
– Provider rely on virtualization to consolidate
computing resources.
• Virtualization provides not only opportunities,
but also challenges for HPC systems and
applications.
– Concern: Performance degradation due to the
overhead of virtualization
2
3. Contribution
• Goal:
– To realize a practical HPC Cloud whose performance
is close to that of bare metal (i.e., non-virtualized)
machines
• Contributions:
– Feasible study by evaluating the HPC Challenge
benchmark on a 16 node InfiniBand cluster
– The effect of three performance tuning techniques:
• PCI passthrough
• NUMA affinity
• VMM noise reduction
3
6. Toward a practical HPC Cloud
To reduce the overhead of “True” HPC Cloud
VM1
interrupt virtualization The performance is
Guest OS
To disable unnecessary services close to that of bare
Physical
driver
on the host OS (i.e., ksmd). metal machines.
VMM
Reduce
VMM noise
NIC
Set NUMA
affinity
VM (QEMU process)
Guest OS
Threads
Use PCI VCPU
threads
passthrough
Linux kernel
Current KVM
HPC Cloud
Its performance is
not good and Physical
unstable. CPU
CPU socket
6
7. IO architectures of VMs
IO emulation PCI passthrough
VM1 VM2 VM1 VM2
Guest OS Guest OS
… …
Guest Physical
driver driver
VMM VMM
vSwitch
Physical
driver VMM-bypass
access
NIC NIC
IO emulation degrades the performance PCI passthrough achieves the performance
due to the overhead of VMM processing. comparable to bare metal machines.
VMM: Virtual Machine Monitor
7
8. NUMA affinity
Bare Metal
Linux
On NUMA systems, memory affinity
Threads
is an important performance factor.
numactl
Process Local memory accesses are faster
scheduler
than remote memory accesses.
In order to avoid inter-socket
memory transfer, binding a thread
to CPU socket can be effective.
CPU socket
Physical
CPU P0 P1 P2 P3
memory memory NUMA: Non Uniform Memory Access
8
9. NUMA affinity: KVM
Bare Metal KVM
Linux VM (QEMU process)
Threads Guest OS
Threads
numactl
numactl bind threads
Process to vSocket
VCPU
scheduler V0 V1 V2 V3
threads
Linux kernel
taskset
KVM
pin vCPU to
Process CPU (Vn = Pn)
scheduler
CPU socket
Physical
CPU P0 P1 P2 P3
Physical
CPU P0 P1 P2 P3
memory memory
CPU socket
9
10. NUMA affinity: Xen
Bare Metal Xen
Linux VM (Xen DomU)
Threads VM Guest OS
(Dom0) Threads
numactl
numactl cannot run on a guest
Process
OS, because Xen does not
scheduler V0 V1 V2 V3
disclose the physical NUMA VCPU
topology.
Xen Hypervisor
pin vCPU to
Domain CPU (Vn = Pn)
scheduler
CPU socket
Physical
CPU P0 P1 P2 P3
Physical
CPU P0 P1 P2 P3
memory memory
10
11. VMM noise
• OS noise is well-known problem to large-scale
system scalability.
– OS activities and some daemon programs take up
CPU time, consume cache and TLB, and delay the
synchronization of parallel processes
• VMM level noise, called VMM noise, can cause
the same problem for a guest OS.
– The overhead of interrupt virtualization that results in
VM exits (i.e., VM-to-VMM switching)
– Unnecessary services on the host OS (i.e., ksmd)
• Now, we do not take care of VMM noise.
11
13. Experimental setting
Evaluation of HPC Challenge benchmark on
a 16 node Infiniband cluster
Blade server Dell PowerEdge M610 Host machine environment
CPU Intel quad-core Xeon E5540/2.53GHz x2 OS Debian 6.0.1
Chipset Intel 5520 Linux kernel 2.6.32-5-amd64
Memory 48 GB DDR3 KVM 0.12.50
InfiniBand Mellanox ConnectX (MT26428) Xen 4.0.1
Compiler gcc/gfortran 4.4.5
Blade switch MPI Open MPI 1.4.2
InfiniBand Mellanox M3601Q (QDR 16 ports) VM environment
VCPU 8
Only 1 VM runs on 1 host. ! Memory 45 GB
13
14. HPC Challenge Benchmark Suite
We measure spatial and temporal locality boundaries
!"#$%&#$"'(")(#*+(,-..(/+0$1'
by evaluating HPC Challenge benchmark suite.
Communication
Compute intensive
intensive
%&'(
@@8 /;<!!
,-5
8+93":&4(5"6&4$#7
!$00$"'(
-&:#'+:
>334$6&#$"'0 Memory intensive
-8=>?2
=&'A"9>66+00 28=<>!
"#$ 23&#$&4(5"6&4$#7 %&'(
From: Piotr Luszczek, et al., “The HPC Challenge (HPCC) Benchmark Suite,” SC2006 Tutorial.
14
15. HPC Challenge: Result
HPL(G)
Compute intensive
1.4
1.2
1
Random Ring Latency PTRANS(G)
0.8 Memory intensive
0.6
0.4 BMM
BMM+pin
0.2
KVM
0
KVM+pin+bind
Xen
Random Ring BW STREAM(EP)
Xen+pin
Comparing Xen and KVM, the
performances are almost same.
FFT(G) RandomAccess(G)
Communication G: Global, EP: Embarrassingly parallel
intensive Higher is better, except for Random Ring Latency.
15
16. HPC Challenge: Result
Xen KVM
HPL(G) HPL(G)
1.2 1.2
1 1
Random
Random Ring 0.8
Ring 0.8 PTRANS(G) PTRANS(G)
Latency
Latency 0.6
0.6
0.4 0.4
0.2 0.2
0 0
Random STREAM Random Ring
STREAM(EP)
Ring BW (EP) BW
RandomAcc RandomAccess
FFT(G) FFT(G)
ess(G) (G)
BMM Xen Xen+pin BMM KVM KVM+pin+bind
NUMA affinity is important even on a VM.
But, the effect of VCPU pin is uncertain. G: Global, EP: Embarrassingly parallel
Higher is better, except for Random Ring Latency.
16
17. HPL: High Performance LINPACK
• BMM: The LINPACK efficiency is 57.7% in 16
nodes (63.1% in a single node).
• BMM, KVM: setting NUMA affinity is effective.
• Virtualization overhead is 6 to 8%.
Configuration 1 node 16 nodes
BMM 50.24 (1.00) 706.21 (1.00)
BMM + bind 51.07 (1.02) 747.88 (1.06)
Xen 49.44 (0.98) 700.23 (0.99)
Xen + pin 49.37 (0.98) 698.93 (0.99)
KVM 48.03 (0.96) 671.97 (0.95)
KVM + pin + bind 49.33 (0.98) 684.96 (0.97)
17
18. Discussion
• The performance of global benchmarks, except
for FFT(G), is almost comparable with that of
bare metal machines.
– FFT decreased the performance by 11% to 20% due
to the virtualization overhead related to the inter-node
communication and/or VMM noise.
– PCI passthrough improves MPI communication
throughput close to that of bare metal machines.
But, interrupt injection that results in VM exits can
disturb the application execution.
18
19. Discussion (cont.)
• The performance of Xen is marginally better than
that of KVM, except for RandomRing Bandwidth.
– The bandwidth decreases by 4% in KVM, 20% in Xen.
• KVM: The performance of STREAM(EP)
decreases by 27%.
– A lot of memory contention among processes (TLB
miss) may occur. It is the worst situation for EPT
(Extended Page Table), because the page walk of EPT
takes more time than that of shadow page table.
This means a virtual machine is more sensitive to
memory contention than a bare metal machine.
19
21. Summary
HPC Cloud is promising!
• The performance of coarse-grained parallel
applications is comparable to bare metal machines.
• We plan to adopt these performance tuning
techniques into our private cloud service called
“AIST Cloud.”
• Open issues:
– VMM noise reduction
– Live migration with VMM-bypass devices
21