Minimizing I/O Latency in Xen-ARM

Minimizing I/O Latency in Xen-ARM

Seehwan Yoo, Chuck Yoo
and OSVirtual
Korea University

I/O Latency Issues in Xen-ARM
•  Guaranteed I/O latency is essential to Xen-ARM
–  For mobile communication,
•  Broken/missing phone calls, long network detection
•  AP, CP are used for guaranteed communication
•  Virtualization premises transparent execution
–  Not only instruction execution; access isolation
–  But also performance; performance isolation

•  In practice, I/O latency is still an obstacle
–  Lack of scheduler support
•  Hierarchical scheduling nature
i.e. hypervisor cannot impose task scheduling inside a guest OS
•  The credit scheduler doesn’t match to time-sensitive applications
–  Latency due to the split driver model
•  Enhances reliability, but degrades performance (w.r.t. I/O latency)
•  Inter-domain communication between IDD and user domain

•  We investigate these issues, and present possible remedies

Operating Systems Lab. http://os.korea.ac.kr 2

Related Work
•  Credit schedulers in Xen
–  Task-aware scheduler (by Hwanjoo et. al. vee’08)
•  Inspection of task execution at guest OS
•  Adaptively use boost mechanism by VM workload characteristics
–  Laxity-based soft RT scheduler (by Lee min et. al. vee’10)
•  Laxity calculation by execution profile
•  Assign priority based on the remaining time to deadline (laxity)
–  Dynamic core-allocation (by Y. Hu et. al. hpdc’10)
•  Core-allocation by workload characteristics (driver core, fast/slow tick core)
•  Mobile/embedded hypervisors
–  OKL4 microvisor: microkernel-based hypervisor
•  Has verified kernel – from sel4
•  Presents good performance – commercially successful
•  Real-time optimizations – slice donation, direct switching, reflective scheduling,
threaded interrupt handling, etc.
–  VirtualLogix – VLX: mobile hypervisor for real-time support
•  Shared driver model – device sharing model among RT-guest OS and
non-RT guest OSs
•  Good PV performance


Background – the Credit in Xen

•  Weighted round-robin based fair scheduler
•  Priority – BOOST, UNDER, OVER
–  Basic principle: preserve fairness among all
VCPUs
•  Each vcpu gets credit periodically
•  Credit is debited as vcpu consumes execution time
–  Priority assignment
•  Remaining credit <= 0 à OVER (lowest)
•  Remaining credit > 0 à UNDER
•  Event-pending VCPU : BOOST (highest)
–  BOOST: for providing low response time
•  Allows immediate preemption of the current vcpu


Fallacies for the BOOST
in the Credit
•  Fallacy 1) VCPU is always boosted by I/O event
–  In fact, BOOST is sometimes ignored because
VCPU is boosted only when it doesn’t break the fairness
•  ‘Not-boosted vcpu’s are observed when the vcpu is in-execution
–  Example 1)
•  If a user domain has CPU job and waiting for execution, then
•  It is not boosted since it will be executed soon, and tentative BOOST is easy to break
the fairness

•  Fallacy 2) BOOST always prioritizes the VCPU
–  In fact, BOOST is easy to be negated because
multiple vcpus can be boosted simultaneously
•  ‘Multi-boost’ happens quite often in split driver model
–  Example 1)
•  Driver domain has to be boosted, and then
•  User domain also needs to be boosted
–  Example 2)
•  Driver domain has multiple pkts that are destined to multiple user domains, then
•  All the designated user domains are boosted


Xen-ARM’s
Latency Characteristic
•  I/O latency measured throughout interrupt
path
–  Preemption latency :
until code is preemptible
–  VCPU dispatch latency :
until the designated VCPU is scheduled
–  Intra-VM latency :
until IO completion
Dom0 DomU
dispatch dispatch
Phy intr.
Dom0 DomU
Xen-ARM Dom0 DomU
Intr. handler I/O task I/O task

Preemption Vcpu Vcpu
dispatch Intra-VM dispatch Intra-VM
latency
latency latency latency latency


Observed Latency thru intr. path
•  I/O latency measured throughout interrupt path
–  Send ping request from external server to dom1
–  VM settings
•  Dom0 : the driver domain
•  Dom1 : 20% cpu load + ping recv.
•  Dom2 : 100% cpu load (CPU-burning workload)

–  Xen-netback latency
•  Large worst-case latency
•  Dom0 vcpu dispatch lat.
+ intra-dom0 lat.

–  Netback-domU
latency
•  Large average latency
•  Dom1 vcpu dispatch lat.


Observed VCPU dispatch latency :
not-boosted vcpu
•  Experimental setting (%)
100
–  Dom1: varying CPU workload
–  Dom2: burning CPU workload
–  Another external host sends ping

Cumulative latency distribution
80
to Dom1
•  Not-boosted vcpus affect I/O latency
distribution 60
–  Dom1 CPU workload 20% : almost
90% ping requests are handled
within 1ms 40
–  Dom1 CPU workload 40% : 75%
ping requests are handled 20% CPU load @ dom1
40% CPU load @ dom1
within 1ms 20
60% CPU load @ dom1
–  Dom1 CPU workload 60% : 65% 80% CPU load @ dom1
ping requests are handled
within 1ms 0
–  Dom1 CPU workload 80% : only 60% 0 2 4 6 8 10 12 14
ping requests are handled Latency distribution by CPU load
Latency
(ms)
within 1ms

•  When Dom1 has more CPU load
è larger I/O latency (self-disturbing by not-boosted vcpu)


Observed Multi-boost
•  At the hypervisor, we vcpu state priority sched. out
count
counted the number Blocked BOOST 275
of “schedule out” of the UNDER 13
driver domain Unblocked BOOST 664,190
UNDER 49

•  Multi-boost is specifically when
–  the current VCPU is BOOST state, and
it is unblocked
–  Imply that there should be another BOOST vcpu, and
the current vcpu is scheduled out

•  Large number of Multi-boosts


Intra-VM latency
1

•  Latency

Cumulative distribution
0.95

from usb_irq
to netback
0.9
Xen-usb_irq @ dom0
Xen-netback @ dom0
Xen-icmp @ dom1

–  Schedule outs 0.85 Dom0 : IDD
Dom1 : CPU 20%+ ping recv

during dom0 execution
Dom2 : CPU 100%
0.8
0 20 40 60 80

Latency (ms)
1

•  Reasons

0.95
–  Dom0 : not always the
higest prio. 0.9
Xen-usb_irq @ dom0

–  Asynch I/O handling :
Xen-netback @ dom0
Xen-icmp @ dom1
0.85 Dom0 : IDD

bottom half, softirq, Dom1 : CPU 80%+ ping recv
Dom2 : CPU 100%

tasklets, etc. 0.8
0 20 40 60 80

Latency (ms)


Virtual Interrupt Preemption
@ Guest OS
<At driver domain>

•  Interrupt enable/disable is not physically
void default_idle(void) {
operated within a guest OS
Hardware interrupt
local_irq_disable();
–  local_irq_disable(): disables
Set a pending bit
only virtual intr. Virtual
(because virtual if (!need_resched())
interrupt
intr. is disabled) // blocks this domain disabled
•  Physical intr. can be occurred arch_idle();

–  might trigger inter-VM local_irq_enable();
scheduling }

Driver domain is scheduled out
•  Virtual interrupt preemption having pending IRQ
–  Similar to lock-holder preemption
–  Perform driver function with
interrupt disabled
–  Physical timer intr. triggers
inter-VM scheduling
–  Virt. Intr. can be received only after
the domain is scheduled again
(as large as tens of ms)

Note that the driver domain performs extensive I/O operation that disables interrupt

Resolving I/O Latency
Problems for Xen-ARM
1.  Fixed priority assignment
–  Let the driver domain always run with
DRIVER_BOOST, which is the highest priority
•  regardless of the CPU workload
•  Resolves non-boosted VCPU and multi-boost
–  RT_BOOST, BOOST for real-time I/O domain
2.  Virtual FIQ support
–  ARM-specific interrupt optimization
–  Higher priority than normal IRQ interrupt
–  vPSR (Program status register) usage
3.  Do not schedule out the driver domain when it disables virtual
interrupt
–  It will be finished soon, and the hypervisor should give a chance to
run the driver domain
•  Resolves virtual interrupt preemption


Enhanced Latency :
no dom1 vcpu dispatch latency
1 1

0.9 0.9

0.8 0.8

0.7 0.7
Xen-netback @dom0 Xen-netback @ dom0
Xen-icmp @ dom1 Xen-icmp @ dom1
0.6 0.6
Dom0 : IDD Dom0 : IDD
0.5 Dom1 : CPU 20%+ ping recv Dom1 : CPU 40%+ ping recv
0.5
Dom2 : CPU 100% Dom2 : CPU 100%
0.4 0.4
0 20 40 60 80 0 20 40 60 80
Latency (ms) Latency (ms)

1 1

0.9 0.9

0.8 0.8

0.7 0.7
Xen-netback @ dom0 Xen-netback @ dom0
Xen-icmp @ dom1 Xen-icmp @ dom1
0.6 0.6
Dom0 : IDD Dom0 : IDD
Dom1 : CPU 60%+ ping recv Dom1 : CPU 80%+ ping recv
0.5 0.5
Dom2 : CPU 100% Dom2 : CPU 100%

0.4 0.4
0 20 40 60 80 0 20 40 60 80

Latency (ms) Latency (ms)


Enhanced Interrupt Latency :
@ Driver Domain
•  Vcpu dispatch latency 1
–  From intr. handler @ Xen
to hard irq handler @ IDD 0.99

–  Fixed priority dom0 vcpu 0.98
Dom0 : IDD
Dom1 : CPU 80%+ ping recv
–  Virtual FIQ Dom2 : CPU 100%

–  No latency is observed! 0.97
Xen-usb_irq - orig.
Xen-netback - orig.
0.96 Xen-usb_irq - new
Xen-netback - new
•  Intra-VM latency
0.95
–  From ISR to netback @ IDD 0 20 40 60 80

Latency (ms)
–  No virt. intr. preemption
(dom0 highest prio.)

•  Among 13M intr.,
–  56K are caught where
virt intr preemption happened
–  8.5M preemption occurred
with FIQ optimization


Enhanced end-user Latency :
overall result
•  Over 1 million ping tests,
–  Fixed priority make the driver domain to run
without additional latency
(from inter-VM scheduling)
•  Largely reduces overall latency
–  99% interrupts (%)
100

are handled
Latency distribution (cumulative)
within 1ms 95 reduced
latency

90

Xen-domU (enhanced)
Xen-domU (original)
85

80
0 10 20 30 40 50 60
Latencies in all intervals (ms)


Conclusion and
Possible Future work
•  We analyzed I/O latency in Xen-ARM virtual machine
–  Throughout the interrupt handling path in split driver model
•  Two main reasons for long latency
–  Limitation of ‘BOOST’ in the Xen-ARM’s Credit scheduler
•  Not-boosted vcpu
•  Multi-boost
–  Driver domain’s virtualized interrupt handling
•  Virtual interrupt preemption (aka. lock-holder preemption)
•  Achieve under millisecond latency for 99% network packet interrupts
–  DRIVER_BOOST; the highest priority for driver domain
–  Modify scheduler in order for the driver domain not to be scheduled out while
virtual interrupts are disabled
–  Further optimizations (incl. virtual FIQ mode, softirq-awareness)
•  Possible future work
–  Multi-core consideration/extension (core allocation, etc.)
•  Other scheduler integration
–  Tight latency guarantee for real-time guest OS
•  Rest 1% holds the key


Thanks for your attention

Credits to OSvirtual @ oslab, KU

17

Appendix.
Native comparison
•  Comparison with native system – no cpuworkload
–  Slightly reduced handling latency, largely reduced max.
Orig.
Orig.

New
sched.
New
sched.

Na#ve

dom0
domU
Dom0
DomU

Min

375
459
575
444
584

Avg

532.52
821.48
912.42
576.54
736.06

Max
107456
100782
100964
1883
2208

Stdev
1792.34
4656.26
4009.78
41.84
45.95

1

0.8

0.6

0.4
native ping (eth0)
orig. dom0
0.2 orig. domU
new sched. dom0
new sched. domU
0
300 400 500 600 700 800 900 1000

Latency (us)

Appendix.
Fairness?
•  is still good, and achieves high utilization
CPU burning jobs’ utilization
dom2

dom1

NO I/O (ideal case)

Orig. credit

New scheduler

* Setting
Dom1: 20, 40, 60, 80, 100% CPU load
+ ping receiver
Dom2: 100% CPU load
Note that the credit is work-conserving

Normalized
Orig. credit

New scheduler

throughput
= ( Measured thruput/
ideal thruput )

Appendix.
Latency in multi-OS env.
•  3 domain cases with differentiated service
–  added RT_BOOST priority, between
DRIVER_BOOST and BOOST
–  Dom1 assuming RT
–  Dom2 and Dom3 for GP
Credit’s latency dist. 3 domains for Enhanced latency dist. 3 doms. for
10K ping tests (interval 10~40ms) 10K ping tests (interval 10~40ms)
100.00% 100.00%
90.00% 90.00%
80.00% 80.00%
70.00% 70.00%
60.00% 60.00%
50.00% Dom1 50.00% Dom1

40.00% Dom2 40.00% Dom2
30.00% Dom3 30.00% Dom3
20.00% 20.00%
10.00% 10.00%
0.00% 0.00%
0
6000
12000
18000
24000
30000
36000
42000
48000
54000
60000
66000
72000
78000
84000
90000
96000
102000
108000
114000
120000
0
6000
12000
18000
24000
30000
36000
42000
48000
54000
60000
66000
72000
78000
84000
90000
96000
102000
108000
114000
120000


Appendix.
Latency in multi-OS env.
•  3 domain cases with differentiated service
–  added RT_BOOST priority, between DRIVER_BOOST and BOOST
–  Dom1 assuming RT
–  Dom2 and Dom3 for GP
100.00%

90.00%

80.00%

70.00%

60.00% Enh. Dom1
Enh. Dom2
50.00%
Enh. Dom3
40.00%
Credit Dom1
30.00% Credit Dom2

20.00% Credit Dom3

10.00%

0.00%


Minimizing I/O Latency in Xen-ARM

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Minimizing I/O Latency in Xen-ARM

Ähnlich wie Minimizing I/O Latency in Xen-ARM (20)

Mehr von The Linux Foundation

Mehr von The Linux Foundation (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Minimizing I/O Latency in Xen-ARM