1. The On-going Evolutions of Power Management in Xen
Kevin Tian
Open Source Technology Center (OTC)
Intel Cooperation
2. Agenda
• Brief history
• Evolutions of Idle power management
• Evolutions of run-time power management
• Tools
• Experimental data about Xen power efficiency
• Power impact from VM
Intel Confidential
2
3. Brief History
Enhanced green computing
Improved stability
Jan. 2008 Jun. 2008
Better usability
Jul. 2007 Xen 3.2 Mature deep C-
…
released states support
Host S3
Xen 3.1 Sep. 2007 May. 2007 Aug. 2008
Dom0 Preliminary Xen 3.3
controlled cpufreq and released
freq/vol cpuidle support
scaling in Xen
Intel Confidential
3
5. Xen Summit Boston 2008
Dom0
ACPI Parser
External
Control
Schedulers
Registration Enter Idle
Hypercall Ladder Halt (C1)
CpuIdle driver Dynamic
Timekeeping
Tick
C1 C2
Mwait/IO
Xen
Intel Confidential
5
6. Enhanced C-states support
Dom0 noPM x ps p1 hv m
noC3 x ps p1 hv m
120%
C3 x ps p1 hv m
100% 100%
100%
ACPI Parser 80% 102.60%
Percentage
External 91.40%
60%
107.70%
Control
82.30%
40%
20%
Schedulers
0%
id le (W att) SPEC p o w e r
Registration Enter Idle (Sco r e )
Hypercall Ladder Halt (C1)
‘noPM’ has both cpuidle and cpufreq
CpuIdle driver disabled, and vice versa for other two
Dynamic
Timekeeping
Tick
cases. Compared to ‘C3’, noC3 has
C1 C2
Mwait/IO Deep C-states
maximum C-state limited to C2
Timer
C1E Xentrace
For idle watt, lower value means
greener. For SPECpower score, higher
value indicates more power efficient
Xen
Intel Confidential
6
7. TSC freeze
TSC freeze
Ideal TSC
Unsynchronized TSC Time went backwards warning
Dom0
CPU0
Actual TSC
Lots of lost ticks
Fluctuating TSC scale
factor
ACPI Parser
CPU1 Faster ToD
Xen system time skew
External
Control 0 1 sec 1 sec Scale error
…
Platform counter
…
TSC Restore TSC upon elapsed platform
counter since current entry
Schedulers
Percpu platform to TSC scale
…
TSC
Registration Enter Idle Restore TSC upon elapsed platform
Software compensation counter since last calibration
according to elapsed Percpu platform to TSC scale
…
counter of platform timer TSCLadder
Hypercall Halt (C1)
… Restore TSC upon elapsed platform
counter since power on
CpuIdle driver Global platform to TSC scale Dynamic
Timekeeping
… Tick
TSC
C1 C2
Mwait/IO Deep C-states
Timer
C1E TSC
Xentrace
save/restore
Hardware enhancement to have TSC
Always never stopped (e.g. by Intel Core-i7)
running TSC
Platform counter
TSC
Xen
Intel Confidential
7
8. APIC timer freeze
Dom0 Timer heap
Nearest deadline Reprogram local APIC
T count-down timer
T T
APIC timer
ACPI Parser T T T T freeze interrupt when
… count down to
zero
External Scan/execute
Control expired timers (Delayed)
Timer softirq handler Local APIC ISR
Schedulers
Registration Enter Idle
Solution
use platform timer Halt (C1)
(PIT/HPET) to carry nearest
Hypercall Ladder
deadline, when APIC timer is halted in deep C-
states:
CpuIdle driver Dynamic
Timekeeping
Tick
C1 C2
Mwait/IO Deep C-states is required since number of platform
Broadcast
Timer
timer source is less than number of CPUs
C1E PIT/HPET TSC
Xentrace
broadcast save/restore
will come soon to have more
MSI based HPET interrupt
platform timer sources with reduced broadcast traffic
Always
running TSC
will be supported in new CPU
Always running APIC timer
soon
Xen
Intel Confidential
8
9. Menu governor
Ladder
Dom0
Expected minimal residency at Cn
Expected minimal residency at Cn+1
C0
ACPI Parser
Cn
External
Control Cn+1
Inefficient
Promotion if continuous N Cn
Schedulers
residencies > expectation Demotion if current Cn+1
residency < expectation
Registration Enter Idle
less idle watt consumed (-5.2%)
Higher SPECpower score (+1.6%)
Hypercall Ladder
Menu (HVM WinWPsp1)
CpuIdle driver
Menu Halt (C1)
C1 C2
Mwait/IO Deep C-states Nearest timer deadline
C1, 1ns, 20w
Dynamic
PICK
Timekeeping
Last unexpected break event
Tick
C1E C2, 10ns, 15w
PIT/HPET TSC
broadcast save/restore
Timer
C3, 100ns, 5w Latency/power requirement
…
Always
Xentrace
running TSC
To be further tuned!
Xen
Intel Confidential
9
10. Range timer
Dom0
0(ms) 1 2 3 4 5 6
For each timer, it accepts a
range for expiration now:
C0
ACPI Parser
[expiration, expiration +
Cn
Frequent C-state
External
timer_slop]
Control Entry/exit may (default 50us for timer_slop)
Instead consume
0(ms) 1 2 3 4 5 6
More power Overlapping ranges can be
merged to reduce timer
Schedulers
interrupt count
C0
Registration Enter Idle
Cn
Hypercall Ladder
Menu Halt (C1)
CpuIdle driver Dynamic
Timekeeping
Tick
C1 C2
Mwait/IO Deep C-states Range
Timer
timer
C1E PIT/HPET TSC
Xentrace
broadcast save/restore
Always
running TSC
Xen
Intel Confidential
10
11. Range timer effect
One UP HVM RHEL5u1 Multiple idle UP HVM RHEL5u1
10000
range=50us
9000
range=1ms
timer interrupt/second
8000
7000
6000
-7.5% +1.2%
5000
4000
3000
2000
1000
Idle(watt) SPECpower(score)
0
100% 100%
50us
1HVM 2HVM 4HVM
92.50% 101.20%
1ms
Collected on a two-cores mobile platform
Intel Confidential
11
12. Current picture
Dom0
Xenpm
ACPI Parser
External
Control
Power
Schedulers
Aware
Registration Enter Idle
Hypercall Statistics Ladder
Menu Halt (C1)
CpuIdle driver Dynamic
Timekeeping
Tick
C1 C2
Mwait/IO Deep C-states Range
Timer
timer
C1E PIT/HPET TSC
Xentrace
broadcast save/restore
Always
running TSC
Xen
Intel Confidential
12
14. Xen Summit Boston 2008
Dom0
Linux
PM
Tools
On User
Others
Demand Space
Cpufreq core
Power ACPI
Others ACPI Parser
Now-K8 cpufreq
Linux Cpufreq External
Control
Query
Idle
State
Registration
Schedulers
On
Demand
Tiny cpufreq core
Enable
MSR ACPI
Cpufreq
Access
(IA32)
Perm
Xen
Intel Confidential
14
15. Current picture
Dom0
Linux
PM Xenpm
?
Tools
On User
Others
Demand Space
Cpufreq core
Power ACPI
Others ACPI Parser
Now-K8 cpufreq
Linux Cpufreq External
Control
Query
Idle
More
State
governors
Registration
Schedulers Turbo mode
On User Perfor Power
Enhanced
Demand Space mance Save
Control
User
Tiny cpufreq core Interface
Statistics
Enable
Cpu offline
MSR ACPI ACPI Power
/ online
Cpufreq Cpufreq Now
Access
(IA32) (IA64) K8
Perm
More
Xen drivers
Intel Confidential
15
17. Dom0
Retrieve run-time
Xenpm
statistics about Xen
cpuidle and cpufreq
Apply user policy on
exposed control
knobs of Xen cpufreq
(governor, set freq,
etc)
More capabilities to
be added later, e.g.
profile…
Log every state change for Xen cpuidle and cpufreq:
Xentrace
CPU0 391365842416 (+ 21204) cpu_idle_entry [ idle to state 2 ]
CPU0 391375951050 (+10108634) cpu_idle_exit [ return from state 2 ]
Raw data could be further processed by other scripts
Xen
Intel Confidential
17
19. • All data shown in this section:
– For reference only and not guaranteed
– On a two-core mobile platform, with one HVM guest created
• Server consolidation effect with multiple VMs/workloads are in progress
• Improvement when Xen cpuidle and cpufreq are enabled
– SPECpower score is normalized (100% noPM score is 1000 ssj_ops / watt)
– Similarly, consumed watt is also normalized (idle noPM watt is 10w)
1500
25
noPM-score
PM-score
normalized score (ssj_ops/watt)
noPM-watt
PM-watt 20
Normalized power (watt)
Reduced
1000
Power! 15
10
500
Improved
Efficiency
5
0
0
0 10 20 30 40 50 60 70 80 90 100
SPECpower workload (%)
Intel Confidential
19
20. • Below is a more attracting comparison
– Native WinXPsp1’s idle watt and SPECpower score are both normalized to 100% as the base
native xpsp1
140% native rhel5u1
HZ=1000 in
xen xpsp1 hvm
RHEL5u1 incurs high
timer interrupt kvm xpsp1 hvm
130%
xen rhel5u1 hvm
kvm rhel5u1 hvm
120%
Normalized percentage (%)
110%
100%
Xen is slightly
more power
90%
efficient than KVM
Similar Idle power
80% consumption for
Xen and KVM
70%
60%
Native Native
50%
idle (Watt) SPECpow er (ssj_ops/w att)
Intel Confidential
20
22. • VMM shouldn’t be blamed as only reason for high power consumption!
• A ‘bad’ VM could eat power
– Just like what ‘bad’ application could do in a native OS
– Cause high break events (e.g. timer interrupts) with short C-state residency
• Which parts in VM could draw high power?
– ‘Bad’ applications hurt just as what they could on native
– How guest OS is implemented also matters
• Periodic tick frequency – HZ
• Timer usage in drivers
• Time sub-system implementation
• …
• Green guest OS wins!
– Smaller HZ, idle tickles or fully dynamic tick, range timer, etc.
Intel Confidential
22
23. Idle power consumption
120.00% 4
115.00% 3.5
‘Bad’
guest
110.00% 3
Average residency (ms)
Normalized watt
105.00% 2.5
100.00% 2
95.00% 1.5
Green Green
90.00% 1
85.00% 0.5
80.00% 0
bare Dom0 PV HVM Winxp HZ=1000 HZ=250 HZ=100 tickless
Idle power consumption
Average C-state residency
Intel Confidential
23