Extending Io Scalability

Extending I/O scalability in Xen

Dong Eddie, Zhang Xiantao, Xu Dongxiao, Yang Xiaowei

Xen Summit Asia 2009

Legal Information

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS.
NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL
PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S
TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY
WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO
SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING
TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY
PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Intel may make changes to specifications and product descriptions at any time, without
notice.

All products, computer systems, dates, and figures specified are preliminary based on
current expectations, and are subject to change without notice.

Performance tests and ratings are measured using specific computer systems and/or
components and reflect the approximate performance of Intel products as measured by
those tests. Any difference in system hardware or software design or configuration may
affect actual performance.

Intel is a trademark of Intel Corporation in the U.S. and other countries.

*Other names and brands may be claimed as the property of others.

Copyright © 2009, Intel Corporation. All rights are protected.

2

Agenda

•Scalability challenges of I/O virtualization
•VNIF optimizations
•VT-d overhead reduction
•SR-IOV
•Per CPU vector

3

Scalability challenges of I/O virtualization

• Para-virtualized device driver
•Software bottleneck; high CPU utilization

• Direct I/O
• Interrupt overhead; device# limit

• SR-IOV
• Driver not optimal

• Limitation inside VMM

4

VNIF optimization

5

Improving scalability in 10G network
U P/ RX
D U P/ RX ( w hack)
D /

1200 10000 1200 10000

bps)
bps)
1000 8000 1000 8000

CPU ut i l %
CPU ut i l %

Bandw dt h ( M
Bandw dt h ( M
800 800
6000 6000
600 600
4000 4000
400

i

i
400
200 2000 2000
200
0 0 0 0
10 20 40 60 10 20 40 60
VM# VM#

dom0% vm% BW dom0% vm% BW

•Issue: the total throughput is limited by the bottleneck of netback driver in
dom0 - only one TX/RX tasklet serves all netfront interfaces.
•Hack: duplicate 10 netback drivers, each serves VNIFs of domX (X=domid
%10).
•Benefit: reach near 10G throughput, with the cost of high CPU util%
•Ongoing: multiple TX/RX tasklets

6

Notification frequency reduction
UDP/RX TCP/RX

250.00% 250.00%

CPU Utilization
CPU utilization

200.00% 200.00%
150.00% 150.00%
100.00% 100.00%
50.00% 50.00%
0.00% 0.00%

1v rig

3v rig

9v rig
1v FE

3v FE

9v FE
E

E

E
1v FE

3v FE

9v FE
1v rig

3v rig

9v rig
E

E

E

-B

-B

-B
-B

-B

-B

-o

-o

-o
-o

-o

-o

-
-

-
-
-

-

m

m

m
m

m

m

m
m

m
m

m

m

m

m

m
m

m

m

1v

3v

9v
1v

3v

9v

dom0% vm% dom0% vm%

• Issue: Evtchn freq ~= physical NIC’s intr freq * n (n: VNIF# sharing the
NIC).
But per our test, 1K HZ evtchn freq can sustain 1G throughput (TCP/UDP)
already.
• Solution: add evtchn freq controlling policy inside 1) netback or 2)
netfront, and interface to userspace.
• Benefit: guests’ CPU util% reduction.

7

VT-d overhead reduction

8

Network virtualization overhead with Direct I/O -
10G NIC

* ‘%’ here means system wide CPU utilization normalized to 100%; 4 CPUs are used in the test.

•Interrupt frequency is very high in 10G network: 4K/8K HZ for each TX/RX queue;
8 TX/RX queues in total;
•APIC access VMExit causes the most overhead, within which EOI access occupies
90%/50% for TX/RX case.
•Interrupt (including ext. intr and IPI) VMExit causes the 2nd most overhead, within which
IPI occupies >50%.

9

APIC access VMExit optimization - vEOI
VMExit Instruction fetch Instruction vAPIC emulation VMEntry
emulation

APIC access VMExit handling stages

• EOI’s usage
• S/W writes “0” into EOI for signaling interrupt servicing completion;
• Upon receiving a EOI, LAPIC clears the highest priority bit in the ISR and
dispatches the next priority interrupt.
• Majority OSes (Windows and Linux) only use ‘mov’ for this purpose,
which has no side effect to other CPU states
• Benefit: by bypassing instruction fetch/emulation stage, each vEOI’s
cost decreases from 7.6k to 3.1k cycles, CPU util% can decrease by
15+% (12k*8*4.5k/2.8G) in 10G NIC case.

If guests use complex instructions (e.g. stos) for EOI access, there will be side effects to
guests, but immune to host.
Better / complete solution is PV or virtualizing x2APIC (using MSR).

10

vIntr delivering - before optimization

0. VT-d device’s pIntr affinity is set to where vCPU0 is on (pCPU0) at boot time;
guest OS sets vCPU2 to receive the vIntr.
1. When the device generates the pIntr, it’s delivered to pCPU0
2. Xen sends IPI to where vCPU0 is on (pCPU2) for vIntr delivery
3. Before vCPU0 VMentry, Xen delivers vIntr by sending IPI to where vCPU2 is on
(pCPU3).
4. On vCPU2 VMentry, vIntr is injected to the guest.

11

vIntr delivering - after optimization

0. Most works are done before the pIntr happens
* When guest OS sets vCPU2 to receive the vIntr, Xen sets the corresponding pCPU
(pCPU3) to receive the pIntr. * When vCPU2 migrates, the pIntr migrates with it.
1. When the device generates the pIntr, it’s delivered to pCPU3 directly
2. On vCPU2 VMentry, vIntr is injected to the guest.

Benefit: all IPIs for vIntr delivery (~4%-9% CPU overhead) are removed.
Limit: for simplicity, our 1st patch only handle MSI (no ioapic) and single vIntr
destination case, which covers typical usage.

12

SRIOV v.s. VNIF scalability
dom+
U VF dom + I F ( w hack)
U VN /

1200 10000 1200 10000

bps)
bps)
1000 8000 1000 8000
CPU ut i l %

CPU ut i l %
Bandw dt h ( M

Bandw dt h ( M
800 800
6000 6000
600 600
4000 4000
400

i
i
400
200 2000 2000
200
0 0 0 0
10 20 40 60 10 20 40 60
VM# VM#

dom0% vm% BW dom0% vm% BW

•SRIOV has significant advantage over VNIF solution even the tasklet
bottleneck is fixed, in the aspects of
• Less CPU utilization
• Stable latency

14

Interrupt coalescing optimization

• Interrupt coalescing is more critical to VM than to native
• Interrupt handling inside VM has more overhead.
• Interrupt coalescing policy in native is not efficient in SRIOV, as the BW of one VF can be
• << line speed, e.g. several VFs within one physical port competing at the same time.
• >> line speed, e.g inter-VF communication with one port.
• Proposal: adaptive interrupt coalescing (AIC), based on the packet generating rate in the
last seconds and buffer size in the system.

15

Issues

•Interrupt vector in Xen was global
• One vector number was shared by all pCPUS
•Only <200 vectors were available for devices in total
• Vector range is 0-255
• Lowest 32 vectors reserved by x86 SDM
• 16 vectors reserved for legacy PIC high priority cases
• Highest 16 vectors reserved for high priority cases
• Special case: 0x82
•In SR-IOV case, vectors are easy to be used up
• E.g. one Niantic NIC can use 384 vectors = 2(ports) * 64(VFs+PF)
* 3(TX/RX/mailbox).

17

Solution

•Back-port per-CPU vector solution in Linux kernel
•After the change, max vector# = 256*pCPU#
•Code changes
• Precondition: all interrupts to be indexed by ‘irq’
• Vector allocation / management functions, related structures…
• Interrupt migration need special care: after migration, vector#
may be different.
• Evaluation
• With per-CPU vector, vectors can be assigned to all Niantic VFs
successful.

18

Extending Io Scalability

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (13)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Extending Io Scalability

Ähnlich wie Extending Io Scalability (20)

Mehr von The Linux Foundation

Mehr von The Linux Foundation (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Extending Io Scalability