VMworld 2013: Silent Killer: How Latency Destroys Performance...And What to Do About It

Silent Killer: How Latency Destroys
Performance...And What to Do About It
Bhavesh Davda, VMware
Josh Simons, VMware
VSVC5187
#VSVC5187

22
Agenda
 Introduction
• Definitions
• Effects
• Sources
 Mitigation
• BIOS settings
• CPU scheduling and over-commitment
• Memory over-commitment and MMU virtualization
• NUMA and vNUMA
• Guest OS
• Storage
• Networking

33
What is Latency?
 Examples in computing environments:
• Signal propagation within a microprocessor
• Memory access from cache, from local memory, from non-local memory
• PCI I/O data transfers
• Data access within rotating media
• Operating system scheduling
• Network communication, local and wide area
• Application logic
 Typically reported as average latency
Latency is a measure of time delay experienced in a system,
the precise definition of which depends on the system and
the time being measured. (Wikipedia)

44
https://gist.github.com/hellerbarde/2843375
^
and IT person

55
A Latency Number Every Human Should Know

66
What is Jitter?
 Examples in computing environments
• Unpredictable response times in financial trading applications
• Stalling, stuttering audio and video in telecommunication applications
• Reduced performance of distributed parallel computing applications
• Measurable variations in run times for long-running jobs
Jitter is variation in latency that causes non-deterministic
performance in seemingly deterministic workloads
“Insanity: doing the same thing over and
over again and expecting different results.”
Albert Einstein

77
Agenda
 Introduction
• Definitions
• Effects
• Sources
 Mitigation
• BIOS settings
• NUMA and vNUMA
• Guest OS
• Storage
• Networking

88
Effects of Latency and Jitter on VoIP Audio Quality
Original 5% drop 20% drop
http://www.voiptroubleshooter.com/sound_files/
1 2 3 4 5 6
1 2 3 4 5 6
De-jitter buffering
1 2 3 4 5 6
De-jitter buffering
ITU-T G.114 Latency Recommendation
Mean Opinion Score (MOS)
4.3-5.0
4.0-4.3
3.6-4.0
3.1-3.6
2.6-3.1
Higherisbetter
1 2 3 4 5 6
Play out latency
1 2 4 5
Drops

99
The Case of the Missing Supercomputer Performance
The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of
ASCI Q, Petrini, F., Kerbyson, D., Pakin, S., Proceedings of the 2003 CM/IEEE conference on Supercomputing
 Peer-to-peer parallel (MPI)
application performance degrades
as scale increases – up to 2X
worse than predicted by model
 No obvious explanations, initially
 Noise – extraneous daemons, kernel
timers, etc. – indicted as problem
 Jittered arrival times at application
synchronization points resulted in
significant overall slowdowns
(lowerisbetter)
(lowerisbetter)

1010
Latency Affects Throughput, Packet Rate, and IOPs, Too
Assume a 100 bit/sec channel bandwidth (1 bit every 0.01 sec)
XMIT Time (sec) = Latency + Packet Size * 0.01
Throughput (bits/sec) = Packet Size / XMIT Time
Packet
Size (bits)
Throughput (bits/sec) Packet Rate (packets/sec)
Latency
0 sec
Latency
0.01 sec
Latency
0.04 sec
Latency
0 sec
Latency
0.01 sec
Latency
0.04 sec
1 100 100
10 100 10
100 100 1

1111
Packet
Size (bits)
Latency
0 sec
Latency
0.01 sec
Latency
0.04 sec
Latency
0 sec
Latency
0.01 sec
Latency
0.04 sec
1 100 50 100 50
10 100 91 10 9
100 100 99 1 1

1212
Packet
Size (bits)
Latency
0 sec
Latency
0.01 sec
Latency
0.04 sec
Latency
0 sec
Latency
0.01 sec
Latency
0.04 sec
1 100 50 20 100 50 20
10 100 91 71 10 9 7
100 100 99 96 1 1 1

1313
Agenda
 Introduction
• Definitions
• Effects
• Sources
 Mitigation
• BIOS settings
• NUMA and vNUMA
• Guest OS
• Storage
• Networking

1414
Network Latency in Bare Metal Environments
 Message copy from application
to OS (kernel)
 OS (network stack) + NIC driver
queues packet for NIC
 NIC DMAs packet and transmits
on the wire
CPUs RAM
Interconnect
NIC Disk
Network
Switch
Server

1515
Interconnect
Network Latency in Virtual Environments
 Message copy from
application to GOS (kernel)
 GOS (network stack) + vNIC
driver queues packet for
vNIC
 VM exit to VMM/Hypervisor
 vNIC implementation
emulates DMA from VM,
sends to vSwitch
 vSwitch queues packet for
pNIC
 pNIC DMAs packet and
transmits on the wire
Network
Switch
VMs
Virtual Switch
NIC
Server
Management
Agents
Background
Tasks
ESXi Hypervisor

1616
Network Storage: Small I/O Case Study
 Rendering applications
• 1.4X – 3X slowdown seen initially
 Customer NFS stress test
• 10K files
• 1K random reads/file
• 1-32K bytes
• 7X slowdown
 Single change
• Disable LRO (Large Receive Offload) within the
guest to avoid coalescing of small messages upon
arrival
• See KB 1027511: Poor TCP Performance can occur
in Linux virtual machines with LRO enabled
 Final application performance
• 1 – 5% slower than native
Guest OS
Application
ESXiNFS Server

1717
Data Center Networks – the Trend to Fabrics
WAN/Internet
WAN/Internet
NORTH/SOUTH
EAST/WEST

1818
Agenda
 Introduction
• Definitions
• Effects
• Sources
 Mitigation
• BIOS settings
• NUMA and vNUMA
• Guest OS
• Storage
• Networking

1919
General Guidelines about Tuning for Latency
 vSphere ESXi is designed for high performance and fairness
• Maximizes overall performance of all VMs without unfairly penalizing any VM
• Defaults are carefully tuned for high throughput
 Tunable settings should be thoroughly vetted in a test environment
before deployment
 Tuning should be applied individually to study the effects on
performance
• Maintain good change control practices
 Certain tunables for lowest latency can negatively affect
throughput and efficiency, so consider tradeoffs
• Consider isolating latency-sensitive VMs on dedicated hosts
• DRS host groups can be used to manage groups of hosts supporting latency-
sensitive VMs

2020
Optimizing for Latency-sensitive Workloads (1 of 3)
 Power Management
• Set at both BIOS and hypervisor levels
• Hyperthreading may cause jitter due to pipeline
sharing
• Intel Turbo Boost may cause runtime jitter
 CPU and memory over-commitment
• Transparent page sharing may cause jitter due
to non-deterministic share-breaking on writes
• Memory compression
• Better to avoid over-subscription of resources
 Memory virtualization
• Hardware memory virtualization can sometimes
be slower than software approaches
Max performance / Static High
To disable:
sched.mem.pshare.enable = FALSE
Mem.MemZipEnable = 0
For shadow page tables (i.e.,
software approach):
monitor.virtual_mmu = software

2121
Memory Virtualization
HPL
Native
(GFLOP/s)
Virtual
EPT on EPT off
4K guest pages 37.04 36.04 (97.3%) 36.22 (97.8%)
2MB guest pages 37.74
38.24
(100.1%)
38.42
(100.2%)
*RandomAccess
Native
(GUP/s)
Virtual
EPT on EPT off
4K guest pages 0.01842 0.0156 (84.8%) 0.0181 (98.3%)
2MB guest pages 0.03956 0.0380 (96.2%) 0.0390 (98.6%)
physical
virtual
machine
EPT = Intel Extended Page Tables = hardware page table virtualization = AMD RVI

2222
NUMA and vNUMA
hypervisor
Application
socketM socket MsocketM socket M
Making virtual NUMA nodes visible within the Guest OS
allows ESXi to respect GOS process placement and
memory allocation decisions, which can lead to significant
performance increases

2323
 NUMA
• ESXi optimally allocates CPU and memory
• NUMA node affinity can be set manually
• Exposing NUMA topology to wide guests
(vNUMA) can be very important. Automatic for
#vCPU > 8 and can be forced otherwise
• NUMA scheduler does not include HT by
default. Can be overridden to prevent VM split
across NUMA nodes
numa.nodeAffinity = X
numa.vcpu.min = N (< #vCPUs)
numa.vcpu.preferHT = “1”

2424
vNUMA Performance Study: SpecOMP (Lower is Better)
Performance Evaluation of HPC Benchmarks on VMware’s ESX Server, Ali Q., Kiriansky, V., Simons
J., Zaroo, P., 5th Workshop on System-level Virtualization for High Performance Computing, 2011

2525
 NUMA
• ESXi optimally allocates CPU and memory
• NUMA node affinity can be set manually
• Exposing NUMA topology to wide guests
(vNUMA) can be very important. Automatic for
#vCPU > 8 and can be forced otherwise
• NUMA scheduler does not include HT by
default. Can be overridden to prevent VM split
across NUMA nodes
 VM scheduling optimizations
• e.g., suppress descheduling
 Guest OS choice
• Later distributions are usually better (tickless
kernel, etc.)
• RHEL 6+, SLES 11+, etc. (2.6.32+ kernel)
• Windows Server 2008+
monitor_control.halt_desched = FALSE
numa.nodeAffinity = X
numa.vcpu.min = N (< #vCPUs)
numa.vcpu.preferHT = “1”

2626
Optimizing for Latency-sensitive Workloads (3/3)
 Storage
• Storage stack already tuned for small block transfers
• iSCSI and NAS (host and guest) affected by network tuning
parameters
• Local Flash memory’s much lower latency exposes overheads
in software stack that we are working to address
 Networking
• Interrupt coalescing should be disabled
vNIC
pNIC
• Jumbo frames may interfere with low-latency traffic
• Disable Large Receive Offload (LRO) for TCP (including NAS)
• Polling for I/O completion rather than using interrupts
• Passthrough / direct assignment for lowest I/O latencies
ethernetX.coalescingScheme = “disabled”
esxcli module parameter driver-parameter
DPDK, RDMA poll mode

2727
kernel
Kernel Bypass Model
driver
tcp/ip
sockets
hardware
application
rdma
guestkernel
driver
tcp/ip
sockets
vmkernel
application
hardware
user
user
rdma

2828
InfiniBand Bandwidth with Passthrough / Direct Assignment
0
500
1000
1500
2000
2500
3000
3500
2
4
8
16
32
64
128
256
512
1K
2K
4K
8K
16K
32K
64K
128K
256K
512K
1M
2M
4M
8M
Bandwidth(MB/s)
Message size (bytes)
Send: Native
Send: ESXi
RDMA Read: Native
RDMA Read: ESXi
RDMA Performance in Virtual Machines using QDR InfiniBand on VMware vSphere 5, April 2011
http://labs.vmware.com/academic/publications/ib-researchnote-apr2012

2929
Latency with Passthrough / Direct Assignment (Send/Rcv, Polling)
1
2
4
8
16
32
64
128
256
512
1024
2048
4096
2
4
8
16
32
64
128
256
512
1K
2K
4K
8K
16K
32K
64K
128K
256K
512K
1M
2M
4M
8M
Halfroundtriplatency(µs)
Message size (bytes)
Native
ESXi ExpA
MsgSize
(bytes)
Native ESXi ExpA
2 1.35 1.75
4 1.35 1.75
8 1.38 1.78
16 1.37 2.05
32 1.38 2.35
64 1.39 2.9
128 1.5 4.13
256 2.3 2.31

3030
New Features Planned for Upcoming vSphere ESXi Releases
 New virtual machine property: “Latency sensitivity”
• High => lowest latency
 Exclusively assign physical CPUs to virtual CPUs of “Latency
Sensitivity = High” VMs
• Physical CPUs not used for scheduling other VMs or ESXi tasks
 Idle in Virtual Machine monitor (VMM) when Guest OS is idle
• Lowers latency to wake up the idle Guest OS, compared to idling in ESXi
vmkernel
 Disable vNIC interrupt coalescing
 For DirectPath I/O, optimize interrupt delivery path for lowest
latency
 Make ESXi vmkernel more preemptible
• Reduces jitter due to long-running kernel code

3131
Summary
 Virtualization does add some latency over bare metal
 vSphere is generally tuned for throughput and fairness
• Tunables exist at the host, VM, and guest level to improve latency
• This will become more automatic in subsequent releases
 ESXi is a good hypervisor for virtualizing an increasingly broad
array of applications, including latency-sensitive applications such
as Telco, Financial, and some HPC workloads
 When observing application performance degradation in the future,
we hope you will think about the “silent killer” and try some of
techniques we’ve described here

3232
Resources
Best Practices for Performance Tuning of Latency-Sensitive
Workloads in vSphere VMs
http://www.vmware.com/resources/techresources/10220
Network I/O Latency in vSphere 5
http://www.vmware.com/resources/techresources/10256
Deploying Extremely Latency-Sensitive Applications in vSphere 5.5
http://www.vmware.com/files/pdf/techpaper/deploying-latency-sensitive-apps-
vSphere5.pdf
RDMA Performance in Virtual Machines Using QDR InfiniBand on
VMware vSphere 5
http://labs.vmware.com/academic/publications/ib-researchnote-apr2012

3333
Other VMworld Activities Related to This Session
 HOL:
HOL-SDC-1304
vSphere Performance Optimization
 Session:
VSVC5596
Extreme Performance Series: Network Speed Ahead

VMworld 2013: Silent Killer: How Latency Destroys Performance...And What to Do About It

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to VMworld 2013: Silent Killer: How Latency Destroys Performance...And What to Do About It

Similar to VMworld 2013: Silent Killer: How Latency Destroys Performance...And What to Do About It (20)

More from VMworld

More from VMworld (20)

Recently uploaded

Recently uploaded (20)

VMworld 2013: Silent Killer: How Latency Destroys Performance...And What to Do About It