VMworld 2013
Bhavesh Davda, VMware
Josh Simons, VMware
Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
VMworld 2013: Silent Killer: How Latency Destroys Performance...And What to Do About It
1. Silent Killer: How Latency Destroys
Performance...And What to Do About It
Bhavesh Davda, VMware
Josh Simons, VMware
VSVC5187
#VSVC5187
2. 22
Agenda
Introduction
• Definitions
• Effects
• Sources
Mitigation
• BIOS settings
• CPU scheduling and over-commitment
• Memory over-commitment and MMU virtualization
• NUMA and vNUMA
• Guest OS
• Storage
• Networking
3. 33
What is Latency?
Examples in computing environments:
• Signal propagation within a microprocessor
• Memory access from cache, from local memory, from non-local memory
• PCI I/O data transfers
• Data access within rotating media
• Operating system scheduling
• Network communication, local and wide area
• Application logic
Typically reported as average latency
Latency is a measure of time delay experienced in a system,
the precise definition of which depends on the system and
the time being measured. (Wikipedia)
6. 66
What is Jitter?
Examples in computing environments
• Unpredictable response times in financial trading applications
• Stalling, stuttering audio and video in telecommunication applications
• Reduced performance of distributed parallel computing applications
• Measurable variations in run times for long-running jobs
Jitter is variation in latency that causes non-deterministic
performance in seemingly deterministic workloads
“Insanity: doing the same thing over and
over again and expecting different results.”
Albert Einstein
7. 77
Agenda
Introduction
• Definitions
• Effects
• Sources
Mitigation
• BIOS settings
• CPU scheduling and over-commitment
• Memory over-commitment and MMU virtualization
• NUMA and vNUMA
• Guest OS
• Storage
• Networking
8. 88
Effects of Latency and Jitter on VoIP Audio Quality
Original 5% drop 20% drop
http://www.voiptroubleshooter.com/sound_files/
1 2 3 4 5 6
1 2 3 4 5 6
De-jitter buffering
1 2 3 4 5 6
De-jitter buffering
ITU-T G.114 Latency Recommendation
Mean Opinion Score (MOS)
4.3-5.0
4.0-4.3
3.6-4.0
3.1-3.6
2.6-3.1
Higherisbetter
1 2 3 4 5 6
Play out latency
1 2 4 5
Drops
9. 99
The Case of the Missing Supercomputer Performance
The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of
ASCI Q, Petrini, F., Kerbyson, D., Pakin, S., Proceedings of the 2003 CM/IEEE conference on Supercomputing
Peer-to-peer parallel (MPI)
application performance degrades
as scale increases – up to 2X
worse than predicted by model
No obvious explanations, initially
Noise – extraneous daemons, kernel
timers, etc. – indicted as problem
Jittered arrival times at application
synchronization points resulted in
significant overall slowdowns
(lowerisbetter)
(lowerisbetter)
13. 1313
Agenda
Introduction
• Definitions
• Effects
• Sources
Mitigation
• BIOS settings
• CPU scheduling and over-commitment
• Memory over-commitment and MMU virtualization
• NUMA and vNUMA
• Guest OS
• Storage
• Networking
14. 1414
Network Latency in Bare Metal Environments
Message copy from application
to OS (kernel)
OS (network stack) + NIC driver
queues packet for NIC
NIC DMAs packet and transmits
on the wire
CPUs RAM
Interconnect
NIC Disk
Network
Switch
Server
15. 1515
Interconnect
Network Latency in Virtual Environments
Message copy from
application to GOS (kernel)
GOS (network stack) + vNIC
driver queues packet for
vNIC
VM exit to VMM/Hypervisor
vNIC implementation
emulates DMA from VM,
sends to vSwitch
vSwitch queues packet for
pNIC
pNIC DMAs packet and
transmits on the wire
Network
Switch
VMs
Virtual Switch
NIC
Server
Management
Agents
Background
Tasks
ESXi Hypervisor
16. 1616
Network Storage: Small I/O Case Study
Rendering applications
• 1.4X – 3X slowdown seen initially
Customer NFS stress test
• 10K files
• 1K random reads/file
• 1-32K bytes
• 7X slowdown
Single change
• Disable LRO (Large Receive Offload) within the
guest to avoid coalescing of small messages upon
arrival
• See KB 1027511: Poor TCP Performance can occur
in Linux virtual machines with LRO enabled
Final application performance
• 1 – 5% slower than native
Guest OS
Application
ESXiNFS Server
18. 1818
Agenda
Introduction
• Definitions
• Effects
• Sources
Mitigation
• BIOS settings
• CPU scheduling and over-commitment
• Memory over-commitment and MMU virtualization
• NUMA and vNUMA
• Guest OS
• Storage
• Networking
19. 1919
General Guidelines about Tuning for Latency
vSphere ESXi is designed for high performance and fairness
• Maximizes overall performance of all VMs without unfairly penalizing any VM
• Defaults are carefully tuned for high throughput
Tunable settings should be thoroughly vetted in a test environment
before deployment
Tuning should be applied individually to study the effects on
performance
• Maintain good change control practices
Certain tunables for lowest latency can negatively affect
throughput and efficiency, so consider tradeoffs
• Consider isolating latency-sensitive VMs on dedicated hosts
• DRS host groups can be used to manage groups of hosts supporting latency-
sensitive VMs
20. 2020
Optimizing for Latency-sensitive Workloads (1 of 3)
Power Management
• Set at both BIOS and hypervisor levels
• Hyperthreading may cause jitter due to pipeline
sharing
• Intel Turbo Boost may cause runtime jitter
CPU and memory over-commitment
• Transparent page sharing may cause jitter due
to non-deterministic share-breaking on writes
• Memory compression
• Better to avoid over-subscription of resources
Memory virtualization
• Hardware memory virtualization can sometimes
be slower than software approaches
Max performance / Static High
To disable:
sched.mem.pshare.enable = FALSE
Mem.MemZipEnable = 0
For shadow page tables (i.e.,
software approach):
monitor.virtual_mmu = software
22. 2222
NUMA and vNUMA
hypervisor
Application
socketM socket MsocketM socket M
Making virtual NUMA nodes visible within the Guest OS
allows ESXi to respect GOS process placement and
memory allocation decisions, which can lead to significant
performance increases
23. 2323
Optimizing for Latency-sensitive Workloads (2 of 3)
NUMA
• ESXi optimally allocates CPU and memory
• NUMA node affinity can be set manually
• Exposing NUMA topology to wide guests
(vNUMA) can be very important. Automatic for
#vCPU > 8 and can be forced otherwise
• NUMA scheduler does not include HT by
default. Can be overridden to prevent VM split
across NUMA nodes
numa.nodeAffinity = X
numa.vcpu.min = N (< #vCPUs)
numa.vcpu.preferHT = “1”
24. 2424
vNUMA Performance Study: SpecOMP (Lower is Better)
Performance Evaluation of HPC Benchmarks on VMware’s ESX Server, Ali Q., Kiriansky, V., Simons
J., Zaroo, P., 5th Workshop on System-level Virtualization for High Performance Computing, 2011
25. 2525
Optimizing for Latency-sensitive Workloads (2 of 3)
NUMA
• ESXi optimally allocates CPU and memory
• NUMA node affinity can be set manually
• Exposing NUMA topology to wide guests
(vNUMA) can be very important. Automatic for
#vCPU > 8 and can be forced otherwise
• NUMA scheduler does not include HT by
default. Can be overridden to prevent VM split
across NUMA nodes
VM scheduling optimizations
• e.g., suppress descheduling
Guest OS choice
• Later distributions are usually better (tickless
kernel, etc.)
• RHEL 6+, SLES 11+, etc. (2.6.32+ kernel)
• Windows Server 2008+
monitor_control.halt_desched = FALSE
numa.nodeAffinity = X
numa.vcpu.min = N (< #vCPUs)
numa.vcpu.preferHT = “1”
26. 2626
Optimizing for Latency-sensitive Workloads (3/3)
Storage
• Storage stack already tuned for small block transfers
• iSCSI and NAS (host and guest) affected by network tuning
parameters
• Local Flash memory’s much lower latency exposes overheads
in software stack that we are working to address
Networking
• Interrupt coalescing should be disabled
vNIC
pNIC
• Jumbo frames may interfere with low-latency traffic
• Disable Large Receive Offload (LRO) for TCP (including NAS)
• Polling for I/O completion rather than using interrupts
• Passthrough / direct assignment for lowest I/O latencies
ethernetX.coalescingScheme = “disabled”
esxcli module parameter driver-parameter
DPDK, RDMA poll mode
30. 3030
New Features Planned for Upcoming vSphere ESXi Releases
New virtual machine property: “Latency sensitivity”
• High => lowest latency
Exclusively assign physical CPUs to virtual CPUs of “Latency
Sensitivity = High” VMs
• Physical CPUs not used for scheduling other VMs or ESXi tasks
Idle in Virtual Machine monitor (VMM) when Guest OS is idle
• Lowers latency to wake up the idle Guest OS, compared to idling in ESXi
vmkernel
Disable vNIC interrupt coalescing
For DirectPath I/O, optimize interrupt delivery path for lowest
latency
Make ESXi vmkernel more preemptible
• Reduces jitter due to long-running kernel code
31. 3131
Summary
Virtualization does add some latency over bare metal
vSphere is generally tuned for throughput and fairness
• Tunables exist at the host, VM, and guest level to improve latency
• This will become more automatic in subsequent releases
ESXi is a good hypervisor for virtualizing an increasingly broad
array of applications, including latency-sensitive applications such
as Telco, Financial, and some HPC workloads
When observing application performance degradation in the future,
we hope you will think about the “silent killer” and try some of
techniques we’ve described here
32. 3232
Resources
Best Practices for Performance Tuning of Latency-Sensitive
Workloads in vSphere VMs
http://www.vmware.com/resources/techresources/10220
Network I/O Latency in vSphere 5
http://www.vmware.com/resources/techresources/10256
Deploying Extremely Latency-Sensitive Applications in vSphere 5.5
http://www.vmware.com/files/pdf/techpaper/deploying-latency-sensitive-apps-
vSphere5.pdf
RDMA Performance in Virtual Machines Using QDR InfiniBand on
VMware vSphere 5
http://labs.vmware.com/academic/publications/ib-researchnote-apr2012
33. 3333
Other VMworld Activities Related to This Session
HOL:
HOL-SDC-1304
vSphere Performance Optimization
Session:
VSVC5596
Extreme Performance Series: Network Speed Ahead