Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Virtualization overheads

6.276 Aufrufe

Veröffentlicht am

Talk given at 3rd CMG Pune Event: 26th April 2014, Capgemini

Veröffentlicht in: Technologie
  • Loggen Sie sich ein, um Kommentare anzuzeigen.

Virtualization overheads

  1. 1. Computer Measurement Group, India 1 Performance overheads of Virtualization Sandeep Joshi Principal SDE, Storage Startup 26 April 2014 www.cmgindia.org
  2. 2. Computer Measurement Group, India 2 Contents 1. Hypervisor classification and overview 2. Overheads introduced 3. Benchmarking 4. Analytic models 5. Conclusion
  3. 3. Computer Measurement Group, India 3 Not covered in this talk • Mobile virtualization - Motorola Evoke QA4 was the first to run two OS. - new Hyp mode in ARM Cortex A-15 processor. • Nested virtualization - running one hypervisor on top of another. • Network virtualization - SDN, OpenFlow. • Containers (aka OS-level virtualization) - Solaris Zones, LXC, OpenVZ. • Older hypervisors which did binary translation.
  4. 4. Computer Measurement Group, India 4 Classification • Image : blog.technet.com/b/chenley
  5. 5. Computer Measurement Group, India 5 VMWare ESX • Image : blog.vmware.com
  6. 6. Computer Measurement Group, India 6 VMWare ESX l Each virtual machine has multiple worlds (threads), some of which correspond to guest CPUs and others are dedicated to the device processing (Run “esxtop” on the host). • Monolithic kernel. Hardware support limited to those drivers installed in the hypervisor.
  7. 7. Computer Measurement Group, India 7 KVM Used in Google cloud, Eucalyptus, or most Openstack clouds. • Image : Redhat Summit, June 2013
  8. 8. Computer Measurement Group, India 8 KVM Linux is the hypervisor. Leverages Linux features (device drivers, NAPI, CPU and IO schedulers, cgroups, madvise, NUMA, etc.) • Guest OS sits inside Linux process running QEMU shell; each Virtual CPU is a thread inside this process. • Uses QEMU for device virtualization. QEMU in one guest is not aware of QEMU running in another guest.
  9. 9. Computer Measurement Group, India 9 Microsoft HyperV Used in Microsoft Azure cloud
  10. 10. Computer Measurement Group, India 10 Xen When you use Amazon or Rackspace, you are using Xen.
  11. 11. Computer Measurement Group, India 11 Contents 1. Hypervisor classification and overview 2. Overheads introduced 3. Benchmarking 4. Analytic models 5. Conclusion
  12. 12. Computer Measurement Group, India 12 Overheads introduced 1. CPU : nested scheduling triggers lock preemption problem (use gang scheduling), VM exits are exits. 2. Memory : Nested page table, NUMA topology. 3. Disk : Nested filesystems, page cache, IO schedulers, interrupt delivery, DMA. 4. Network : DMA, interrupt delivery. • Next few slides will cover hardware assists, nested filesystems, nested IO schedulers and benefits of IO paravirtualization.
  13. 13. Computer Measurement Group, India 13 Hardware assists Hardware assists have considerably eased many virtualization overheads: 1. CPU : Binary translation was replaced by extra CPU rings : root and guest mode, each with 4 rings. 2. MMU : Shadow table in software replaced by EPT/Nested page table. 3. IOMMU : during DMA, it converts Machine Physical Address to Guest Physical Address. 4. IO-APIC : interrupt delivery is done directly to the guest using IDT. 5. SR-IOV : virtual functions implemented in the NIC (SR-IOV is also defined for storage adapters but not yet implemented) Benefits: Hardware assistance reduces CPU cache contention as well as “Service Demand” on the VM. (Service Demand = CPU Utilization/Throughput). Higher throughput is obtained for lesser CPU utilization.
  14. 14. Computer Measurement Group, India 14 Hardware assists Image:intel.comImage: virtualizationdeepdive.wordpress.com IOMMU APIC
  15. 15. Computer Measurement Group, India 15 How much does hardware assist help? Ganesan et al (2013) ran microbenchmarks on 2-core Intel Xeon. Compare native vs Xen, with and without hardware assistance (SR- IOV, IOMMU, Intel VT). Finding: Network throughput is near-native with SR-IOV but CPU utilization still remains high (possibly because interrupt processing is still triggering numerous guest VM-hypervisor transitions ?). Chart shows Max throughput with iPerf. Mbps Dom0 CPU VM CPU Native 940 NA 16.68 SR-IOV 940 20 65 (high) No SR-IOV 192 82 39
  16. 16. Computer Measurement Group, India 16 How much does hardware assist help? Further results from Ganesan et al (2013). Disk throughput tested using RUBiS(disk+net intensive) and BLAST(disk intensive). Finding: Disk IO is not yet benefiting from hardware assists. Most of the RUBiS improvement comes from SR-IOV rather than IOMMU. Similar finding with BLAST.
  17. 17. Computer Measurement Group, India 17 Nested IO scheduling VM and hypervisor are both running IO scheduling algorithms (and so is the disk drive). IO requests are rearranged and merged by the IO scheduler (scheduler can be set in Xen Dom0 or KVM host but not in ESX). On Linux, there are 4 schedulers - CFQ, NOOP, Anticipatory, Deadline. Each block device can have a different scheduler. Image: dannykim.me
  18. 18. Computer Measurement Group, India 18 Nested IO scheduling Results of Boutcher and Chandra • Best combination of schedulers depends on workload. • Tradeoff between fairness (and VMs) and throughput • Scheduler closest to workload has most impact. • NOOP has best throughput in the hypervisor but is least fair by Jain's fairness measure. • In guest VM, CFQ is 17% better than Anticipatory for FFSB benchmark but for Postmark, Anticipatory is 18% better than CFQ. • On Hypervisor, NOOP is 60% better than CFQ for FFSB and 72% better for Postmark.
  19. 19. Computer Measurement Group, India 19 Nested IO scheduling Boutcher's numbers for FFSB on Xen 3.2, 128 threads, each VM allocated contiguous 120GB space on 500GB SATA drive. X-axis is scheduler in VM; Y-axis is scheduler in hypervisor. Numbers in table are approx because they were converted from a bar graph (Transactions per sec. on Xen). anticipato ry CFQ Deadline NOOP Anticipato ry 200 260 175 240 CFQ 260 240 155 160 Deadline 315 360 250 255 NOOP 320 370 245 255
  20. 20. Computer Measurement Group, India 20 Sequential IO becomes random • Sequential IO issued from multiple VMs to same block device becomes random when aggregated in the hypervisor. • Set longer disk queue length in hypervisor to enable better aggregation. On VMWare, you can set disk.SchedReqNumOutstanding=NNN. • Use PCI flash or SSDs to absorb random writes.
  21. 21. Computer Measurement Group, India 21 Nested filesystems and page cache Filesystem on VM can map to a flat file on underlying filesystem, raw block device (local or iSCSI), NFS. Flat file on FS is preferred for ease of management. It can be raw, qcow2, vmdk or vhd format. VM1 Guest filesystem /dev/sda Files in /vmfs on hypervisor /dev/sdc VM-2 Guest filesystem /dev/sdb • Flat files introduce another performance overhead (next slide). • KVM has four caching modes (none, writeback, writethru, unsafe) which can disable/enable either cache. • In Xen Dom0, the page cache comes into play when the file- storage option is in use.
  22. 22. Computer Measurement Group, India 22 Nested filesystems and page cache Le et al (2012) ran FileBench and “fio” on KVM. Tried 42 different combinations of guest and host file systems. Found worst-case 67% degradation. Their conclusion: • Read-dominated workloads benefit from readahead. • Avoid journaled filesystems for write-dominated workloads. • Latency goes up 10-30% in best case. • Host FS should be like a dumb disk or VirtFS; it should not make placement decisions over what guest FS has decided. • Jannen(2013) found that overheads are worse for filesystems on SSD. On HDD, the overheads are masked by rotational latency.
  23. 23. Computer Measurement Group, India 23 Nested filesystems and page cache Le et al (2012) – random, file write test using “fio”. Y-axis is Host file system; X-axis is Guest file system. Throughput in MB/sec ext2 ext3 reiser xfs jfs ext2 60 55 65 80 95 ext3 60 55 65 80 75 ext4 60 55 55 70 95 reiser 60 55 65 80 100 xfs 60 40 60 70 65 jfs 60 50 65 80 105
  24. 24. Computer Measurement Group, India 24 Nested IO stacks : use paravirtualization Hypervisor exposes virtual NIC or storage HBA written in software to the VM. IO request issued by VM travels to the bottom of the stack before it is repackaged and reissued by the hypervisor . Paravirtualization traps the IO request and uses shared memory to route it faster to the hypervisor. 1. VMWare: Install “VMWare tools” and select “Paravirtual SCSI controller” for storage and “vmxnet” driver for networking. VMWare claims PVSCSI offers 12% throughput improvement with 18% less CPU cost with 8Kb block size (blogs.vmware.com) 2. KVM: use newer in-kernel “vhost-net” for networking and “virtio- scsi” or “virtio-blk-data-plane” drivers for storage. 3. Xen: Split-driver used for PVM guests while HVM guests use QEMU or StubDom. HVM can also use PV drivers.
  25. 25. Computer Measurement Group, India 25 Xen: PVM and HVM difference HVM is 3-6% better than PV guest for CPU+RAM intensive tests. For 1 VM with 4 vCPU and 6GB JVM heap size (Yu 2012).
  26. 26. Computer Measurement Group, India 26 Xen: PVM and HVM difference Here, HVM was using PV driver. Outperforms PV by 39% for disk- intensive test running on SSD. (Yu 2012).
  27. 27. Computer Measurement Group, India 27 Contents 1. Hypervisor classification and overview 2. Overheads introduced 3. Benchmarking 4. Analytic models 5. Conclusion
  28. 28. Computer Measurement Group, India 28 Virtual Machine Benchmarking Two aspects 1.Performance : Compare performance of a consolidated server to a non-virtualized OS running on bare-metal hardware ? 2.Isolation : Does overloading one VM bring down performance of other VMs running on same node. Impact of factors •. Virtualization-sensitive instructions issued by the guest. •. VM exits and interrupt delivery. •. Architectural choices made within hypervisor •. Interference between VMs due to shared resources (visible and invisible).
  29. 29. Computer Measurement Group, India 29 Testing isolation capability of a hypervisor 1. Run application on cloud with collocated VMs. 2. Then run in isolation with no collocated VMs to find the gaps. 3. Then run it in a controlled environment, gradually adding collocated VMs which create CPU, disk or network load, until you can simulate the behaviour seen in the cloud.
  30. 30. Computer Measurement Group, India 30 CPU isolation on Xen (Barker and Shenoy study 2010) Find variation in completion times for a single thread which is running a “floating point operations” test periodically over a few hours. • Amazon EC2 small instance: Average completion time was 500 ms, but there was significant jitter. Some tests even took an entire second. • Local setup: Same test on local Xen server showed almost NO variation in completion time. • Conclusion: CPU scheduler on Xen does not provide perfect isolation. • Further tests done to narrow down the problem in the CPU scheduler.
  31. 31. Computer Measurement Group, India 31 Xen’s credit scheduler for CPU (Barker and Shenoy study 2010) Xen has 2 main CPU schedulers – EDF (realtime) and Credit (default). Each VM runs on one or more virtual CPUs(vCPU). Hypervisor maps vCPU to physical CPUs (floating or pinned). For each VM, you can define (weight, cap). 1. Weight = proportion of CPU allocated. 2. Cap = max limit or ceiling on CPU time. Credit scheduler periodically issues 30ms to each vCPU for use. Allocation decremented in 10ms intervals. When credits expire, VM must wait until next 30ms cycle. If VM receives an interrupt, it gets “Boost” which inserts it to the top of the vCPU queue, provided it has not exhausted its credits. Scheduler also has a work-conserving mode which transfers unused capacity to those VMs that need it.
  32. 32. Computer Measurement Group, India 32 CPU isolation on Xen (Barker and Shenoy study 2010) On local setup, tied two VMs to same physical core. Varied (weight, cap) of foreground VM while keeping background VM busy. 1. First test: Keep weights of both VMs equal. Observed jitter as seen on EC2 test. 2. Second test: Vary “weight” while keeping “cap” constant. Weight does not directly correspond to CPU time. Weight ratio of 1:1000 only translates into actual CPU ratio of 1:1.5 (33% more). 3. Third test: Vary “cap” on both VMs. CPU allocation of foreground VM was in proportion to the “cap” (even when background VM was idle). Conclusion: Strong isolation requires pinning VM to a core or setting the “cap”.
  33. 33. Computer Measurement Group, India 33 Disk isolation on Xen Test jitter for small random or large streaming IO to simulate game servers and media servers.(Barker, 2010) Amazon EC2 : Found significant variation in completion time for reads and writes. Write bandwidth can vary upto 50% from mean. Read bandwidth variation can be due to caching side-effects. Isolated local setup: Completion times are consistent if there is no other VM on the Xen node. Introduce a background VM: Run same tests with another background VM doing heavy IO. Used CFQ IO scheduler in Dom0 and NOOP in guest VM. Finding: Xen has no native disk isolation mechanism to identify per- VM disk flows. Throughput of foreground VM dropped by 65-75% and latency increased by 70-80%. Limit to the degradation due to the round-robin policy of Xen Dom0 driver.
  34. 34. Computer Measurement Group, India 34 Network isolation on Xen (Barker 2010) 1.Measure “ping” time to next hop 2.Measure sum of “ping” time to first three hops. 3.Measure time to transfer 32KB block between local and EC2 instance. Pop quiz: what is the point in conducting these three tests ?
  35. 35. Computer Measurement Group, India 35 Network isolation on Xen (Barker 2010) 1.Measure “ping” time to next hop 2.Measure sum of “ping” time to first three hops. 3.Measure time to transfer 32KB block between local and EC2 instance. Purpose: a)First measurement captures jitter of network interface b)Second captures jitter in routers inside Amazon data center. c)Third captures Internet WAN transfer rate and jitter. 1. Saw no jitter in first measurement. 2. Significant variation in second. Most took 5ms but there were significant number which took order of magnitude longer. 3. Third test showed regular variation (related to peak hours) typical of most WAN applications.
  36. 36. Computer Measurement Group, India 36 Network isolation on Xen Network latency tests on a Game server and a Media server on local Xen cloud. (Barker 2010) Found that “tc” defines per-VM flows using IP address and provides good isolation. Two ways to allocate bandwidth using Linux “tc” tool. 1.Dedicated : Divide bandwidth between competing VMs and prevent any VM from using more (i.e. Weight + cap). 2.Shared : Divide bandwidth but allow VMs to draw more if required (i.e. Weight + work-conserving). In both game and media server tests, results are consistent. “Dedicated” mode produced lower latency while “shared” mode produced lower jitter. Interference Mean Std deviation Dedicated 23.6 ms 29.6 Shared 33.9 ms 16.9
  37. 37. Computer Measurement Group, India 37 Long latency tails on EC2 (Xu et al, Bobtail, 2013) Initial observations: 1. Median RTTs within EC2 are upto 0.6ms but 99.9 percentile RTT on EC2 is 4 times longer than that seen in dedicated data centers. (In other words, a few packets see much longer delays than normal.) 2. Small instances most susceptible to the problem. 3. Measured RTT between node pairs on same AZ. Pattern not symmetric. Hence, long tail not caused by location of host on network. 4. RTT between good and bad nodes in AZ can differ by order of magnitude. 5. One AZ which had newer CPU models did not return that many bad nodes.
  38. 38. Computer Measurement Group, India 38 Long latency tails on EC2 (Xu et al, Bobtail, 2013) Experimental setup: On 4-core Xen server, dedicate 2 cores to Dom0. Remaining 2 cores are shared between 5 VMs with 40% share each. Vary the combination of latency-sensitive versus CPU- intensive VMs. Latency- sensitive CPU-intensive RTT 5 0 1 ms 4 1 1 ms 3 2 <10 ms 2 3 ~30 ms 1 4 ~30ms Long-tail emerges when CPU-intensive VMs exceed number of cores.
  39. 39. Computer Measurement Group, India 39 Long latency tails on EC2 (Xu et al, Bobtail, 2013) Hypothesis: Do all CPU-intensive VMs cause a problem? Test: Vary the CPU usage of CPU-intensive VM to find out. Long tail occurs when a competing VM does not use all its CPU allocation. The Boost mechanism for quickly scheduling latency-sensitive VMs fails against such VMs.
  40. 40. Computer Measurement Group, India 40 Long latency tails on EC2 (Xu et al, Bobtail, 2013) 1. Latency-sensitive VMs cannot respond in a timely manner because they are starved of CPU by other VMs. 2. The VMs which starve them are those that are CPU-intensive but are not using 100% of their allocation within 30ms. 3. The BOOST mechanism in the Xen scheduler runs in FIFO manner and treats these two types of VMs equally instead of prioritizing the latency-sensitive VM. Authors designed “Bobtail” to select the EC2 instance on which to place a latency-sensitive VM. (see paper)
  41. 41. Computer Measurement Group, India 41 EC2 Xen settings Tested for small instances: 1. EC2 uses Xen credit scheduler in non-work-conserving mode, which reduces efficiency but improves isolation. 2. It allows vCPUs to float across cores instead of pinning them to a core. 3. Disk and network scheduling is work-conserving, but only network scheduling has a max cap of 300 Mbps. (Varadarajan, 2012)
  42. 42. Computer Measurement Group, India 42 Know your hypervisor : Xen Xen : CPU has two schedulers : Credit(2) and EDF. • Credit scheduler keeps a per-VM (weight, cap). Can be work- conserving or not. Work-conserving means “distribute any idle time to other running processes”; otherwise total CPU quantum is capped. • I/O intensive VMs benefit from BOOST, which bumps a vCPU to the head of the queue when it receives an interrupt, provided it has not exhausted its credits. Device scheduler: • Disk and network IO goes through Domain 0, which schedules them in batches in round-robin fashion by Domain 0. To control network bandwidth, use Linux tools to define per-VM flows. Best practice: Increase CPU weight of Dom0 to be proportional to the amount of IO. Dedicate core(s) to it. Dedicate memory and prevent ballooning.
  43. 43. Computer Measurement Group, India 43 Know your hypervisor - KVM • QEMU originally started a complete machine emulator [Bellard, 2005]. Code emulation is done by TCG (tiny code generator) originally called “dyngen”. KVM was later added as another code accelerator into the QEMU framework. • Only one IO thread; BIG QEMU Driver lock is held in many IO functions. • Redhat “fio” benchmark in Sept 2012 reported 1.4M IOPs with 4 guests but this was using passthrough IO (i.e. bypassing QEMU) • Similar numbers reported in Mar 2013 but this time using an experimental virtio-dataplane feature which utilizes dedicated per- device threads for IO. • Performance of RTOS (as a guest OS) in KVM also suffers when it comes in contact with QEMU [Kiszka].
  44. 44. Computer Measurement Group, India 44 Tile-based benchmarking to test consolidation Traditional benchmarks are designed for individual servers. For virtualization, tiles of virtual machines that mimic actual consolidation are used. 1. SPECvirt sc2013 (supersedes SPECvirt sc2010) 2. VConsolidate(Intel): tile consisting of SPECjbb, Sysbench, Webbench and a mail server 3. VMMark (VMWare) : Exchange mail server, standy system, Apache server, database server. SPEC sc2013: •. Run for 3 hours on a single node to stress CPU, RAM, disk, and network. •. Incorporates four workloads : web server, 4 Java app server connected to a backend database server (to test multiple vCPU on SMP), mail server and Batch server. •. Keep adding additional sets of virtual machines (tiles) until overall throughput reaches a peak. All VMs must continue to meet the required QoS (spec.org/virt_sc2013)
  45. 45. Computer Measurement Group, India 45 SPECvirt sc2013
  46. 46. Computer Measurement Group, India 46 Contents 1. Hypervisor classification and overview 2. Overheads introduced 3. Benchmarking 4. Analytic models 5. Conclusion
  47. 47. Computer Measurement Group, India 47 Value of analytic model Benchmarks have to Produce repeatable results. Be Comparable easily across architectures & platforms. Have predictive power (extrapolation). Tension between realism and reproducibility. Macrobenchmarks simulate real-world conditions but are not comparable and lack extrapolation. Microbenchmarks determine the cost of primitive operations. Need analytic model to tie benchmarks to prospective application use. Seltzer proposed three approaches : 1. Vector-based: combine system vector with application vector. 2. Trace-based : Generate workload from trace to capture dynamic sequence of requests. 3. Hybrid : combination of both. (Mogul 1999; Seltzer et al 1999)
  48. 48. Computer Measurement Group, India 48 Analytic models for virtualization 1.Layered queuing network (Menasce; Benevenuto 2006). 2.Factor graphs to determine per-VM utilization (Lu 2011) 3.Trace-based approach (Wood, et al) 4.VMBench (Moller @ Karlsruhe) 5.Equations for cache and core interference (Apparao, et al). 6.Machine learning
  49. 49. Computer Measurement Group, India 49 Layered Queueing network (for Xen) VM Domain 0 Disk IN OUT Xen
  50. 50. Computer Measurement Group, India 50 Layered Queueing network (for Xen) Total response time R = R(VM) + R(dom0) + R(disk) For M/M/1 with feedback: R of one resource = D/[ 1- U ] U = Utilization = λ * D = Arrival rate * Service demand U lies between 0 and 1. D = Service demand = total time taken by one request. D(resource by VM) = D(bare) * Slowdown(resource)/P(VM) D(resource by Dom0) = D(VM) * Cost(Dom0)/P(IDD) where P=speedup of hardware of VM as compared to bare-metal. Cost(Dom0) = BusyTime(Dom0)/BusyTime(VM) Slowdown(resource) = BusyTime(virtual)/BusyTime(bare)
  51. 51. Computer Measurement Group, India 51 Factor graphs Resource utilization at guest VMs is known and aggregate utilization at hypervisor is known. How to determine the function which defines per-VM utilization of each resource ? This can be modeled as a “source separation problem” studied in digital signal processing. Measurements inside VM and on hypervisor can differ: 1.Disk IO inside VM can be higher than on the hypervisor due to merging of IOs in the hypervisor. 2.CPU utilization inside a VM can be half of that at the hypervisor because Xen issues per-VM IO through Dom0 (seen via “xentop”).
  52. 52. Computer Measurement Group, India 52 Factor graphs CPU Disk Net Mem h1 h2 h4 h3 CPU Disk Net Mem CPU Disk Net Mem f1 f2 f4 f3 CPU Disk Net Mem CPU Disk Net Mem g1 g2 g4 g3 Host VM1 VM2
  53. 53. Computer Measurement Group, India 53 Trace-based approach How to model the migration from bare-metal to virtual environment? 1.Create platform profiles to measure cost of primitive operations: Run same microbenchmarks on native (bare-metal) and virtualized platform. 2.Relate native and virtualized : Formulate set of equations which relate native metrics to virtualized. 3.Capture trace of application which is to be migrated: Determine how many primitive operations it uses and plug it in. (*Actual process employs Statistical methods and is more complicated)
  54. 54. Computer Measurement Group, India 54 Trace-based approach How to model the migration from bare-metal to virtual environment? Step 1: Create platform profiles? Run carefully chosen CPU, disk and network-intensive microbenchmarks on both the bare-metal and virtual environment. Measure key metrics for each benchmark : a)CPU – percentage time spent in user, kernel and iowait b)Network – read and write packets/sec and bytes/sec c)Disk – read and write blocks/sec and bytes/sec. Metric CPU user CPU sys CPU iowait BareMetal 23 13 3 Virtual 32 20 8
  55. 55. Computer Measurement Group, India 55 Trace-based approach How to model the migration from bare-metal to virtual environment? Step 2: Relate native and virtualized : Formulate set of equations which relate native metrics to virtualized. e.g. Util(cpu on VM) = c0 + c1*M1 + c2*M2 + ... cn*Mn where Mk=Metric gathered from native microbenchmark Solve for the model coefficients using Least Squares Regression. The coefficients c_k capture relation between native and virtualized platform. e.g. c0=4, c1=19, c2=23, etc...
  56. 56. Computer Measurement Group, India 56 Trace-based approach How to model the migration from bare-metal to virtual environment? Step 3: Capture trace of application which is to be migrated: And find the new metrics Mk to plug into the above equation. Then solve it. Voila ! Util(cpu on VM)= 4 * (M1) + 19 * (M2) + ... Recap: 1. Create platform profiles for native and virtual. 2. Find coefficients which relate native & virtual. 3. Capture application trace and apply the equation. Their findings: 4. Single model is not applicable for Intel and AMD since CPU utilization varies. 5. Feedback loop within the application can distort performance prediction.
  57. 57. Computer Measurement Group, India 57 Conclusion All problems in CS can be solved by another level of indirection -David Wheeler (1997-2004, first PhD in Computer Science) ... and performance problems introduced by indirection require caching, interlayer cooperation and hardware assists (e.g. TLB cache, EPT, paravirtualization). Virtual machines have finally arrived. Dismissed for a number of years as merely academic curiosities, they are now seen as cost-effective techniques for organizing computer systems resources to provide extraordinary system flexibility and support for certain unique applications. [Goldberg, Survey of Virtual Machine Research, 1974]
  58. 58. Computer Measurement Group, India 58 References 1. Ganesan et al. Empirical study of performance benefits of hardware assisted virtualization, 2013. 2. Boutcher and Chandra. Does virtualization make disk scheduling passe. 3. Le at al. Understanding Performance Implications of Nested File Systems in a Virtualized Environment. 4. Jannen. Virtualize storage, not disks. 5. Yu. Xen PV Performance status and Optimization Opportunities. 6. Barker and Shenoy. Empirical evaluation of latency-sensitive application performance in the cloud 7. Xu. Bobtail. Avoiding long tails in the cloud. 8. Varadarajan et al. Resource freeing attacks. 9. Bellard. QEMU, a fast and portable dynamic translator. 10.Kiszka. Using KVM as a realtime hypervisor 11.Mogul. Brittle metrics in operating system research. 12.Seltzer et al. The Case for Application-Specific Benchmarking
  59. 59. Computer Measurement Group, India 59 References 1. Menasce. VIRTUALIZATION: CONCEPTS, APPLICATIONS, AND PERFORMANCE MODELING 2. Benevenuto et al. Performance Models for Virtualized Applications 3. Lu at al, Untangling Mixed Information to Calibrate Resource Utilization in Virtual Machines, 2011. 4. Wood. Profiling and Modeling Resource Usage of Virtualized Applications
  60. 60. Computer Measurement Group, India 60 BACKUP SLIDES
  61. 61. Computer Measurement Group, India 61 Classification • OS-level virtualization : Does not run any intermediary hypervisor. Modify the OS to support namespaces for networking, processes and filesystem. • Paravirtualization : Guest OS is modified and is aware that it is running inside a hypervisor. • Full virtualization : Guest OS runs unmodified. Hypervisor emulates hardware devices.
  62. 62. Computer Measurement Group, India 62 NUMA/SMP • If you run a monster server VM with many vCPUs, you may have to worry about NUMA scaling. Depending on NUMA ratio, 30-40% higher cost (latency and throughput) in accessing remote memory • Hypervisor must be able to 1.manually pin a vCPU to a core. 2.export NUMA topology to the guest OS. 3.do automatic NUMA-aware scheduling of all guest VMs. •. VMWare introduced vNUMA in vSphere 5. •. On Xen, pin Dom0 to a core. In case of NUMA, put frontend and backend drivers on the same core. •. KVM exports NUMA topology to VM but it is still lagging on automatic scheduling.
  63. 63. Computer Measurement Group, India 63 NUMA/SMP • Depending on NUMA ratio, 30-40% higher cost (latency and throughput) in accessing remote memory • Hypervisor must support ability to pin a vCPU to a core, and also allocate memory from specific NUMA node. • Hypervisor must export NUMA topology (ACPI tables) so guest OS can do its job. • Hypervisor should do automatic NUMA-aware scheduling of all guest VMs. • VMWare introduced vNUMA in vSphere 5. • On Xen, pin Dom0 to a core. In case of NUMA, put frontend and backend drivers on the same core. • KVM exports NUMA topology to VM but it is still lagging on automatic scheduling. • Cross-call overhead : On a SMP machine, when a semaphore is released by one thread, it issues a cross-call or inter-processor interrupt if the waiting threads are sleeping on another core. On a VM, the cross-call becomes a costly privileged op (Akther). Interrupt delivery may also trigger a cross-call.
  64. 64. Computer Measurement Group, India 64 Nested CPU scheduling • Each guest OS runs on one or more virtual CPUs. Hypervisor schedules virtual CPUs on its run queue and then each guest OS decides which task to run on that virtual CPU. • Introduces lock preemption problem: A process in the guest OS may get scheduled out by the hypervisor while holding a spin lock, delaying other processes waiting for that spin lock. • Guest OS would not schedule out a process holding a spin lock but hypervisor is unaware of processes within the guest OS. • • Solution is some form of co-scheduling or “gang scheduling”. VMWare actively seeks to reduce “skew” between multiple vCPUs of the same VM.
  65. 65. Computer Measurement Group, India 65 Nested Page tables • Page fault in VM may occur because the hypervisor has not allocated RAM to the VM. • Guest Page table : Guest Virtual adddress -> Hypervisor Virtual • Hypervisor Page Table : Hypervisor Virtual -> Actual RAM. • Earlier, hypervisors would maintain a “shadow page table” for each guest OS. This function has now moved into hardware by both Intel and AMD. Its called “Nested page tables”. • Nested page tables require a costly two-dimensional page walk. For each step that is resolved in the guest table, you have to look up the host table. • Overhead can be alleviated by using “huge pages” and per-VM tags in the TLB cache.
  66. 66. Computer Measurement Group, India 66 Memory overheads & solutions Balloon driver : take back memory from guest. • -- VMWare • -- KVM (see virtio_balloon.c in linux_kernel/drivers/virtio) • -- HyperV calls it Dynamic Memory • -- Xen Transcendent Memory Memory deduplication • -- Present in System/270 (Smith & Nair); • -- VMWare calls it Transparent Page Sharing (patented) • -- KVM uses KSM (which calls Linux madvise()) • -- Xen uses KSM in HVM mode only.
  67. 67. Computer Measurement Group, India 67 Quantifying isolation • Deshane et al(2007) defined BenchVM to test isolation. • Run normal VMs alongside a overloaded VM and test if the normal VM remains responsive. • On the Overloaded VM, you run various stress tests: 1.CPU stress test 2.Memory stress test : calloc and touch memory without free() 3.Network : threaded UDP send and receive 4.Disk : IOzone 5.Fork bomb : test fast process creation and scheduling. Their conclusion: Full virtualization provides better isolation than container-based virtualization. Their other results may be outdated due to advances in virtualization
  68. 68. Computer Measurement Group, India 68 VM exits are costly • Interrupt processing causes context switches between VM and hypervisor. • KVM EOI optimization : guest IDT (interrupt descriptor table) is shadowed. • VMWare detects cluster of instructions that can cause guest exits. • Use combination of polling and interrupt (NAPI)
  69. 69. Computer Measurement Group, India 69 mclock • Disk capacity varies dynamically and cannot be statically allocated like CPU or RAM. • Need proportional sharing algorithm to reserve disk capacity • Gulati et al propose a dynamic algorithm which interleaves two schedulers and uses three tags with every IO request.
  70. 70. Computer Measurement Group, India 70 Hadoop benchmark • VMWare : • HyperV (conducted on HDInsight – Microsoft's version of Hadoop) : • KVM: • Xen: (Netflix runs map-reduce on AWS)
  71. 71. Computer Measurement Group, India 71 HPC/Scientific benchmark • VMWare paper : SPEC MPI and SPEC OMP • Xen : Jackson et al (2010) ran NERSC on Amazon EC2. Six times slower than Linux cluster and 20 times slower than modern HPC system. EC2 interconnect severely limits performance. Could not use processor-specific compiler options since heterogenous mix of CPUs on every node. • In Jun 2010, Amazon launched “Cluster Compute Nodes” which are basically nodes running Xen in hvm mode connected via 10G ethernet (no Infiniband yet). • KVM and OpenVZ : Regola (2010) ran NPB on these nodes.
  72. 72. Computer Measurement Group, India 72 Realtime benchmark • In order to minimize jitter and limit the worst-case latency, a realtime system must provide mechanisms for resource reservation, process preemption and prevention of priority inversion. • Soft realtime (VoIP) vs Hard realtime. Soft means 20ms jitter between packets acceptable. • RT-XEN • Kiszka KVM – QEMU driver lock.
  73. 73. Computer Measurement Group, India 73 Layered Queueing network (Xen) Total response time R = R(vcpu) + R(dom0_cpu) + R(disk) Resp. Time = Demand/[ 1- Utilization ] R(vcpu) = D(vcpu)/ [ 1 – U (vcpu) ] R(dom0_cpu) = D(vcpu)/ [ 1 - U(dom0_vcpu] R(disk) = D(disk)/ [ 1 – U(disk) ] Util = λ * D = Arrival rate * Demand D(vm_cpu) = D(isol_cpu) * S(cpu)/P(vm) where S=slowdown, P=speedup D(dom0_cpu) = D(vm_cpu) * Cost(dom0_vm)/P(dom0) Cost(dom0_vm) = B(dom0_cpu)/B(vm_cpu) where B = busy time Slowdown(cpu) = B(vm_cpu)/B(bare_cpu) Slowdown(disk) = B(vm_disk)/B(bare_disk)