This document discusses power improvements for the Xen hypervisor. It begins with background on the large power consumption gap between native operating systems and virtualized environments using Xen. Several fixes are described to close this gap for both client and server workloads. For clients, optimizations reduced the idle power gap from 40% to 5% by improving LCD brightness controls, I/O power management, graphics power management, and other areas. For servers, proposed optimizations focus on timer alignment, power-aware scheduling, and reducing periodic tasks to increase idle time and power savings. Overall, the document outlines ongoing work to optimize Xen's power efficiency.
4. Room to save POWER
• Ideal/standard Native OS power consumption
• Reality Hypervisor power consumption
• LARGE DELTA (~40% for client at start)
4
5. Client architecture
Client Xen Configuration
Linux Win7
DomU
Dom0 DomU
VM
VM VM
Xen Hypervisor
Hardware
5
6. Goal
• Native OS power efficiency
• Close the Power gap with Native Win7
Code
Drop
Fix Identify
Code Gap
Root
Cause
6
7. Current results
• ~40% idle power gap 2 years ago
• ~5% idle power gap now
Idle Power Gap
45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
Project Start Project End
• More?
• Increasingly harder to extract
7
8. LCD brightness control
LCD Display
– ~20% idle power
− Broken brightness controls
Win7> Dom0>
Fix:
−Added emulation of ACPI video extension
− Specifically, brightness control methods _BCL, _BCM, and _BQC
− Added to VM guest ACPI BIOS
− Pass through control knob output to Dom0 take platform action
−Make sure Dom0 LCD brightness is really working
8
9. Runtime IO power management
Dysfunctional IO power management
• ~15% Idle power
• 1st available in 2.6.32 kernel, but:
− not functioning correctly
Fix:
• Enable energy-saving states at run time and auto suspended when idle
• Gap dropped from ~25% to 6.8% after fix
− HP 8440p mobile platform based on Nehalem processor
9
10. ATA_link power Max_Perf
ATA_link static power setting
− ~6% idle power in max_performance Run Time
− But performance suffers with min_power
− Even worse:
−All SCSI hosts active with/without attached devices
Mim_Power
Fix:
− Runtime update for ATA_link power setting
−Toggle min_power / max_performance, as needed
− Disable clocks on deviceless ports
10
11. Network power
Wired and Wi-Fi
− ~16 % idle power (650mw)
− Many interrupts break deep c state during idle
Win7> Dom0>
Fix:
− Enable Wi-Fi and E1000 power saving mode in Dom0
− Add Win7 power management PV driver to pass control settings to Dom0
11
12. GFX power management
iGFX power management inactive
− ~16% idle power (650mw)
− VT-d requires device reset
−Reset clears all regs including BIOS enabled power management regs
− Disables: RC6 (render standby), turbo, and GPMT (Graphics Power
Modulation Technology)
VT-d operation
BIOS PM ON PM PM ON
Reset OFF
Save / Restore
Fix:
− Save/Restore PM registers around FLR
12
13. Client summary
• Started with a ~40% gap
• Ended with ~5% gap
• Greatly improved and got close to the goal
13
14. Server power savings --
increasing idle time
• Timer alignment
• Power aware scheduling
• Reducing periodic tasks
14
16. Timer alignment
• Proposal
• Configurable timer consolidate window, such as 50 ns
• Compute timer interrupt moment
• Shift timer handle moment to next timer consolidate moment
• Benefit
• Fewer interrupts longer idle time power savings
• Challenges
• Guest schedule impact– performance impact
• Cross CPU timer synchronization
• IPI frequency and synchronization
16
17. Timer alignment
intr arrived intr arrived
Timer intr
idle busy idle busy
Cpu0: CPU idle
CPU busy
New intr arrived
intr arrived
Cpu1:
Resultant
Socket
Socket C-state
:
Gained C-State
• Shifting CPU1’s interrupt to match CPU0’s Nice gain in C-State
• Repeated over and over adds up
17
18. Power aware scheduling
• ACPI modes –
− Performance Power hungry mode
− Energy mode Power savings mode
− Balanced
• Task to Scheduling
− Performance
− Schedule vCPUs one per physical core before pairing
− Energy
− Schedule vCPUs one per logical core
− power down more cores
− power down more sockets
18
19. Power saving scheduler
packages
pkg 0 pkg 1
cores
core 0 core 1 core 0 core 1
HT cpu 0 cpu 1 cpu 2 cpu 3 cpu 4 cpu 5 cpu 6 cpu 7
running task vcpu0 vcpu1 vcpu2
power aware
scheduler
Idle CPU/in deep C-state Busy CPU Not in deep C-state
19
20. Reduce periodic activity
• Power-unfriendly RTC emulation:
− VMM updates RTC clock twice per second
− Solution
− Update RTC clock only on Read
If a clock ticks
where no one
can see it, does
the time change?
• Frequent Wake-ups to check buffered I/O:
− Wakeup multiple times a second (Polling model)
− Solution (Push model)
− Event channel to notify buffered I/O change status
No more polling
20
The LCD brightness is control by ACPI. When we press the hotkey in the laptop to decrease the LCD brightness, it will trigger a ACPI event and the event handler will call the control methods to take the corresponding action. But we lack the control methods support in guest’s ACPI table, so we need add those control methods to guest ACPI. And when those control methods are called by guest, then ask dom0 to do the work.
Basically, we need to turn off the SCSI host that doesn’t attached any device to save the power. But previous client didn’t do this and this will waste power. Now our solution is to bring down all the SCSI host that do not attached any device to save more power.In previous Client, the ATA link only can be set statically: either to mini_power or to max_performance. Now, we add the dynamic solution: runtime check the system load, if idle, set to mini_power, or else, set to max_performance.
FLR means function level reset. FLR will reset the whole device and go to initial status(like power on). The issue is that FLR is required when pass through GFX to guest. Then it will clear all PM regs setting by BIOS which can save the power. The solution is that we save the PM regs before FLR and restore it after FLR.RC6(render standby) is a GPU’s technology that allows the GPU to go into a very low power consumption state when the GPU is idle. It is same with the C state in CPU.Turbo is the intel turbo boost. Refer to en.wikipedia.org/wiki/Intel_Turbo_Boost to get more detailsGPMT(Graphics Power Modulation Technology):Graphics Power Modulation Technology (Intel GPMT) is a method for saving power in the graphics adapter while continuing to display and process data in the adapter. This method will switch the render frequency and/or render voltage dynamically between higher and lower power states supported on the platform based on render engine workload
Process is an OS schedulable task/entity
Actually, we don’t know the proper value as the expiration window. The 50ns just a guess. As you know, different guest and different workload have the different requirement. It hard to give a fixing value as the expiration window. We may need lots of experiments to get the proper value. Unfortunately, we don’t have the time to do this. Also, we don’t have the time to implement the timer alignment in Xen. We did it in KVM. But the idea is same between Xen and KVM.
1. Not VM. The RTC is emulated by Hypervisor. Here I mean the emulation logic in Hypervior is wrong, not the usage inside VM.2. Event channel is a mechanism used to notify events between hypervisor and VMs. Before, device model polls the buffered I/O(several times a second), and mostly, there are no new data arrived. Now, when hypervisor write the data to buffered I/O page, it will issue an event to notify device model that new data is arriving, then device model will wake up to get the data. With this way, we can eliminate the needless waken ups of device model to check the buffer I/O.