SlideShare ist ein Scribd-Unternehmen logo
1 von 18
Downloaden Sie, um offline zu lesen
Extending I/O scalability in Xen

 Dong Eddie, Zhang Xiantao, Xu Dongxiao, Yang Xiaowei


               Xen Summit Asia 2009
Legal Information

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS.
NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL
PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT.  EXCEPT AS PROVIDED IN INTEL'S
TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY
WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO
SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING
TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY
PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Intel may make changes to specifications and product descriptions at any time, without
notice.

All products, computer systems, dates, and figures specified are preliminary based on
current expectations, and are subject to change without notice.

Performance tests and ratings are measured using specific computer systems and/or
components and reflect the approximate performance of Intel products as measured by
those tests.  Any difference in system hardware or software design or configuration may
affect actual performance. 

Intel is a trademark of Intel Corporation in the U.S. and other countries.

*Other names and brands may be claimed as the property of others.

Copyright © 2009, Intel Corporation. All rights are protected.




                                                                                          2
Agenda

•Scalability challenges of I/O virtualization
•VNIF optimizations
•VT-d overhead reduction
•SR-IOV
•Per CPU vector




                                                3
Scalability challenges of I/O virtualization

• Para-virtualized device driver
    •Software bottleneck; high CPU utilization

• Direct I/O
 • Interrupt overhead; device# limit

• SR-IOV
 • Driver not optimal

• Limitation inside VMM




                                                 4
VNIF optimization




                    5
Improving scalability in 10G network
                                  U P/ RX
                                   D                                                                       U P/ RX ( w hack)
                                                                                                            D         /

               1200                                10000                                       1200                                 10000




                                                                                                                                                          bps)
                                                                         bps)
               1000                                8000                                        1000                                 8000




                                                                                CPU ut i l %
CPU ut i l %




                                                                                                                                            Bandw dt h ( M
                                                           Bandw dt h ( M
               800                                                                             800
                                                   6000                                                                             6000
               600                                                                             600
                                                   4000                                                                             4000
               400




                                                                i




                                                                                                                                                 i
                                                                                               400
               200                                 2000                                                                             2000
                                                                                               200
                 0                                 0                                             0                                  0
                      10     20          40   60                                                      10     20          40    60
                                   VM#                                                                             VM#

                           dom0%      vm%     BW                                                           dom0%         vm%   BW


           •Issue: the total throughput is limited by the bottleneck of netback driver in
           dom0 - only one TX/RX tasklet serves all netfront interfaces.
           •Hack: duplicate 10 netback drivers, each serves VNIFs of domX (X=domid
           %10).
           •Benefit: reach near 10G throughput, with the cost of high CPU util%
           •Ongoing: multiple TX/RX tasklets



                                                                                                                                                                 6
Notification frequency reduction
                                   UDP/RX                                                       TCP/RX


                  250.00%                                                        250.00%




                                                               CPU Utilization
CPU utilization




                  200.00%                                                        200.00%
                  150.00%                                                        150.00%
                  100.00%                                                        100.00%
                   50.00%                                                         50.00%
                   0.00%                                                           0.00%




                                                                                    1v rig




                                                                                                 3v rig




                                                                                                                9v rig
                                                                                   1v FE




                                                                                                 3v FE




                                                                                                               9v FE
                                                                                           E




                                                                                                        E




                                                                                                                       E
                       1v FE




                                    3v FE




                                                    9v FE
                       1v rig




                                    3v rig




                                                    9v rig
                              E




                                           E




                                                           E




                                                                                        -B




                                                                                                     -B




                                                                                                                    -B
                           -B




                                        -B




                                                        -B




                                                                                       -o




                                                                                                    -o




                                                                                                                   -o
                          -o




                                       -o




                                                       -o




                                                                                                     -
                                                                                        -




                                                                                                                    -
                                        -
                           -




                                                        -




                                                                                      m




                                                                                                   m




                                                                                                                  m
                                      m




                                                                                      m




                                                                                                   m




                                                                                                                  m
                         m




                                                      m
                         m




                                      m




                                                      m




                                                                                      m




                                                                                                  m




                                                                                                                  m
                       m




                                     m




                                                     m




                                                                                   1v




                                                                                               3v




                                                                                                               9v
                    1v




                                  3v




                                                  9v

                                   dom0%    vm%                                                  dom0%   vm%



  •                Issue: Evtchn freq ~= physical NIC’s intr freq * n (n: VNIF# sharing the
                   NIC).
                   But per our test, 1K HZ evtchn freq can sustain 1G throughput (TCP/UDP)
                   already.
  •                Solution: add evtchn freq controlling policy inside 1) netback or 2)
                   netfront, and interface to userspace.
  •                Benefit: guests’ CPU util% reduction.



                                                                                                                           7
VT-d overhead reduction




                          8
Network virtualization overhead with Direct I/O -
10G NIC




      * ‘%’ here means system wide CPU utilization normalized to 100%; 4 CPUs are used in the test.

•Interrupt frequency is very high in 10G network: 4K/8K HZ for each TX/RX queue;
8 TX/RX queues in total;
•APIC access VMExit causes the most overhead, within which EOI access occupies
90%/50% for TX/RX case.
•Interrupt (including ext. intr and IPI) VMExit causes the 2nd most overhead, within which
IPI occupies >50%.




                                                                                                      9
APIC access VMExit optimization - vEOI
            VMExit     Instruction fetch      Instruction vAPIC emulation          VMEntry
                                               emulation

                            APIC access VMExit handling stages

• EOI’s usage
    • S/W writes “0” into EOI for signaling interrupt servicing completion;
    • Upon receiving a EOI, LAPIC clears the highest priority bit in the ISR and
      dispatches the next priority interrupt.
• Majority OSes (Windows and Linux) only use ‘mov’ for this purpose,
  which has no side effect to other CPU states
• Benefit: by bypassing instruction fetch/emulation stage, each vEOI’s
  cost decreases from 7.6k to 3.1k cycles, CPU util% can decrease by
  15+% (12k*8*4.5k/2.8G) in 10G NIC case.

 If guests use complex instructions (e.g. stos) for EOI access, there will be side effects to
 guests, but immune to host.
 Better / complete solution is PV or virtualizing x2APIC (using MSR).



                                                                                                10
vIntr delivering - before optimization




0. VT-d device’s pIntr affinity is set to where vCPU0 is on (pCPU0) at boot time;
guest OS sets vCPU2 to receive the vIntr.
1. When the device generates the pIntr, it’s delivered to pCPU0
2. Xen sends IPI to where vCPU0 is on (pCPU2) for vIntr delivery
3. Before vCPU0 VMentry, Xen delivers vIntr by sending IPI to where vCPU2 is on
(pCPU3).
4. On vCPU2 VMentry, vIntr is injected to the guest.



                                                                                    11
vIntr delivering - after optimization




0. Most works are done before the pIntr happens
     * When guest OS sets vCPU2 to receive the vIntr, Xen sets the corresponding pCPU
(pCPU3) to receive the pIntr. * When vCPU2 migrates, the pIntr migrates with it.
1. When the device generates the pIntr, it’s delivered to pCPU3 directly
2. On vCPU2 VMentry, vIntr is injected to the guest.

Benefit: all IPIs for vIntr delivery (~4%-9% CPU overhead) are removed.
Limit: for simplicity, our 1st patch only handle MSI (no ioapic) and single vIntr
destination case, which covers typical usage.




                                                                                        12
SR-IOV




         13
SRIOV v.s. VNIF scalability
                               dom+
                                  U VF                                                                     dom + I F ( w hack)
                                                                                                              U VN      /

               1200                                10000                                       1200                                   10000




                                                                                                                                                            bps)
                                                                         bps)
               1000                                8000                                        1000                                   8000
CPU ut i l %




                                                                                CPU ut i l %
                                                           Bandw dt h ( M




                                                                                                                                              Bandw dt h ( M
               800                                                                             800
                                                   6000                                                                               6000
               600                                                                             600
                                                   4000                                                                               4000
               400




                                                                                                                                                   i
                                                                i
                                                                                               400
               200                                 2000                                                                               2000
                                                                                               200
                 0                                 0                                             0                                    0
                      10     20          40   60                                                      10       20         40     60
                                   VM#                                                                              VM#

                           dom0%     vm%      BW                                                           dom0%          vm%    BW



       •SRIOV has significant advantage over VNIF solution even the tasklet
       bottleneck is fixed, in the aspects of
                • Less CPU utilization
                • Stable latency




                                                                                                                                                                   14
Interrupt coalescing optimization




•    Interrupt coalescing is more critical to VM than to native
      •   Interrupt handling inside VM has more overhead.
•    Interrupt coalescing policy in native is not efficient in SRIOV, as the BW of one VF can be
      •   << line speed, e.g. several VFs within one physical port competing at the same time.
      •   >> line speed, e.g inter-VF communication with one port.
•    Proposal: adaptive interrupt coalescing (AIC), based on the packet generating rate in the
     last seconds and buffer size in the system.




                                                                                                   15
Per CPU vector




                 16
Issues

•Interrupt vector in Xen was global
 • One vector number was shared by all pCPUS
•Only <200 vectors were available for devices in total
 • Vector range is 0-255
 • Lowest 32 vectors reserved by x86 SDM
 • 16 vectors reserved for legacy PIC high priority cases
 • Highest 16 vectors reserved for high priority cases
 • Special case: 0x82
•In SR-IOV case, vectors are easy to be used up
 • E.g. one Niantic NIC can use 384 vectors = 2(ports) * 64(VFs+PF)
   * 3(TX/RX/mailbox).




                                                                      17
Solution

•Back-port per-CPU vector solution in Linux kernel
•After the change, max vector# = 256*pCPU#
•Code changes
 • Precondition: all interrupts to be indexed by ‘irq’
 • Vector allocation / management functions, related structures…
 • Interrupt migration need special care: after migration, vector#
   may be different.
• Evaluation
 • With per-CPU vector, vectors can be assigned to all Niantic VFs
   successful.




                                                                     18

Weitere ähnliche Inhalte

Was ist angesagt?

Canopy management tree training & crop loading – opportunities to learn fro...
Canopy management   tree training & crop loading – opportunities to learn fro...Canopy management   tree training & crop loading – opportunities to learn fro...
Canopy management tree training & crop loading – opportunities to learn fro...
MacadamiaSociety
 

Was ist angesagt? (13)

Canopy management tree training & crop loading – opportunities to learn fro...
Canopy management   tree training & crop loading – opportunities to learn fro...Canopy management   tree training & crop loading – opportunities to learn fro...
Canopy management tree training & crop loading – opportunities to learn fro...
 
SPICE MODEL of RN1963FS in SPICE PARK
SPICE MODEL of RN1963FS in SPICE PARKSPICE MODEL of RN1963FS in SPICE PARK
SPICE MODEL of RN1963FS in SPICE PARK
 
SPICE MODEL of RN1964FS in SPICE PARK
SPICE MODEL of RN1964FS in SPICE PARKSPICE MODEL of RN1964FS in SPICE PARK
SPICE MODEL of RN1964FS in SPICE PARK
 
SPICE MODEL of RN1966FS in SPICE PARK
SPICE MODEL of RN1966FS in SPICE PARKSPICE MODEL of RN1966FS in SPICE PARK
SPICE MODEL of RN1966FS in SPICE PARK
 
Enterprise and Government Mobility Solutions - Market Update & Outlook 2010
Enterprise and Government Mobility Solutions - Market Update & Outlook 2010Enterprise and Government Mobility Solutions - Market Update & Outlook 2010
Enterprise and Government Mobility Solutions - Market Update & Outlook 2010
 
SPICE MODEL of RN1902AFS in SPICE PARK
SPICE MODEL of RN1902AFS in SPICE PARKSPICE MODEL of RN1902AFS in SPICE PARK
SPICE MODEL of RN1902AFS in SPICE PARK
 
SPICE MODEL of RN1903AFS in SPICE PARK
SPICE MODEL of RN1903AFS in SPICE PARKSPICE MODEL of RN1903AFS in SPICE PARK
SPICE MODEL of RN1903AFS in SPICE PARK
 
SPICE MODEL of SSM3K15FU (Standard+BDS Model) in SPICE PARK
SPICE MODEL of SSM3K15FU (Standard+BDS Model) in SPICE PARKSPICE MODEL of SSM3K15FU (Standard+BDS Model) in SPICE PARK
SPICE MODEL of SSM3K15FU (Standard+BDS Model) in SPICE PARK
 
SPICE MODEL of SSM3K15FU (Professional+BDP Model) in SPICE PARK
SPICE MODEL of SSM3K15FU (Professional+BDP Model) in SPICE PARKSPICE MODEL of SSM3K15FU (Professional+BDP Model) in SPICE PARK
SPICE MODEL of SSM3K15FU (Professional+BDP Model) in SPICE PARK
 
SPICE MODEL of RN1904AFS in SPICE PARK
SPICE MODEL of RN1904AFS in SPICE PARKSPICE MODEL of RN1904AFS in SPICE PARK
SPICE MODEL of RN1904AFS in SPICE PARK
 
SPICE MODEL of RN1906AFS in SPICE PARK
SPICE MODEL of RN1906AFS in SPICE PARKSPICE MODEL of RN1906AFS in SPICE PARK
SPICE MODEL of RN1906AFS in SPICE PARK
 
SPICE MODEL of SSM3K35FS (Professional+BDP Model) in SPICE PARK
SPICE MODEL of SSM3K35FS (Professional+BDP Model) in SPICE PARKSPICE MODEL of SSM3K35FS (Professional+BDP Model) in SPICE PARK
SPICE MODEL of SSM3K35FS (Professional+BDP Model) in SPICE PARK
 
SPICE MODEL of RN1905AFS in SPICE PARK
SPICE MODEL of RN1905AFS in SPICE PARKSPICE MODEL of RN1905AFS in SPICE PARK
SPICE MODEL of RN1905AFS in SPICE PARK
 

Andere mochten auch

Andere mochten auch (20)

Jin Hai
Jin HaiJin Hai
Jin Hai
 
Xensummit Asia 2009 Talk
Xensummit Asia 2009 TalkXensummit Asia 2009 Talk
Xensummit Asia 2009 Talk
 
Qemu Pcie
Qemu PcieQemu Pcie
Qemu Pcie
 
Xen Debugging
Xen DebuggingXen Debugging
Xen Debugging
 
Xen Roadmap 11 09
Xen Roadmap 11 09Xen Roadmap 11 09
Xen Roadmap 11 09
 
Graphics Passthrough With Vt D V2
Graphics Passthrough With Vt D V2Graphics Passthrough With Vt D V2
Graphics Passthrough With Vt D V2
 
Xen Summit 2009 Shanghai Ras
Xen Summit 2009 Shanghai RasXen Summit 2009 Shanghai Ras
Xen Summit 2009 Shanghai Ras
 
Xen Summit Asia2009 Gplhost Dtc Xen En
Xen Summit Asia2009 Gplhost Dtc Xen EnXen Summit Asia2009 Gplhost Dtc Xen En
Xen Summit Asia2009 Gplhost Dtc Xen En
 
Chen Haibo
Chen HaiboChen Haibo
Chen Haibo
 
Xen Summit Asia 2009 Xen Arm Talk
Xen Summit Asia 2009 Xen Arm TalkXen Summit Asia 2009 Xen Arm Talk
Xen Summit Asia 2009 Xen Arm Talk
 
Graphics Passthrough With Vt D
Graphics Passthrough With Vt DGraphics Passthrough With Vt D
Graphics Passthrough With Vt D
 
Practical Xen Testing At Intel
Practical Xen Testing At IntelPractical Xen Testing At Intel
Practical Xen Testing At Intel
 
Reflink
ReflinkReflink
Reflink
 
Xensummit2009 Io Virtualization Performance
Xensummit2009 Io Virtualization PerformanceXensummit2009 Io Virtualization Performance
Xensummit2009 Io Virtualization Performance
 
Xen Summit Asia2009 Gplhost Dtc Xen Zh
Xen Summit Asia2009 Gplhost Dtc Xen ZhXen Summit Asia2009 Gplhost Dtc Xen Zh
Xen Summit Asia2009 Gplhost Dtc Xen Zh
 
Bob Liang Intro
Bob Liang IntroBob Liang Intro
Bob Liang Intro
 
Wang Chao
Wang ChaoWang Chao
Wang Chao
 
Unlocking the value of the real-time stream for movie lovers
Unlocking the value of the real-time stream for movie loversUnlocking the value of the real-time stream for movie lovers
Unlocking the value of the real-time stream for movie lovers
 
Lecture11 A Image
Lecture11 A ImageLecture11 A Image
Lecture11 A Image
 
Fra Print til Pixels - Nichemedier og communities
Fra Print til Pixels - Nichemedier og communitiesFra Print til Pixels - Nichemedier og communities
Fra Print til Pixels - Nichemedier og communities
 

Ähnlich wie Extending Io Scalability

Get Better I/O Performance in VMware vSphere 5.1 Environments with Emulex 16G...
Get Better I/O Performance in VMware vSphere 5.1 Environments with Emulex 16G...Get Better I/O Performance in VMware vSphere 5.1 Environments with Emulex 16G...
Get Better I/O Performance in VMware vSphere 5.1 Environments with Emulex 16G...
Emulex Corporation
 
A Function by Any Other Name is a Function
A Function by Any Other Name is a FunctionA Function by Any Other Name is a Function
A Function by Any Other Name is a Function
Jason Strate
 
slide
slideslide
slide
koh-t
 
Book2
Book2Book2
Book2
SQU
 
cost analyse
cost analysecost analyse
cost analyse
SQU
 
OLTP Performance Benchmark Review
OLTP Performance Benchmark ReviewOLTP Performance Benchmark Review
OLTP Performance Benchmark Review
Jignesh Shah
 
Vertical format for trading account, profit and loss account & balance sheet
Vertical format for trading account, profit and loss account & balance sheetVertical format for trading account, profit and loss account & balance sheet
Vertical format for trading account, profit and loss account & balance sheet
SAITO College Sdn Bhd
 

Ähnlich wie Extending Io Scalability (20)

Get Better I/O Performance in VMware vSphere 5.1 Environments with Emulex 16G...
Get Better I/O Performance in VMware vSphere 5.1 Environments with Emulex 16G...Get Better I/O Performance in VMware vSphere 5.1 Environments with Emulex 16G...
Get Better I/O Performance in VMware vSphere 5.1 Environments with Emulex 16G...
 
Get Better I/O Performance in VMware vSphere 5.1 Environments with Emulex 16G...
Get Better I/O Performance in VMware vSphere 5.1 Environments with Emulex 16G...Get Better I/O Performance in VMware vSphere 5.1 Environments with Emulex 16G...
Get Better I/O Performance in VMware vSphere 5.1 Environments with Emulex 16G...
 
Enabling Clean Talking
Enabling Clean Talking Enabling Clean Talking
Enabling Clean Talking
 
I/O Scalability in Xen
I/O Scalability in XenI/O Scalability in Xen
I/O Scalability in Xen
 
Novell ZENworks Configuration Management Database Management
Novell ZENworks Configuration Management Database ManagementNovell ZENworks Configuration Management Database Management
Novell ZENworks Configuration Management Database Management
 
Jai kumar fpga_prototyping
Jai kumar fpga_prototypingJai kumar fpga_prototyping
Jai kumar fpga_prototyping
 
PAN Platform Summary
PAN Platform SummaryPAN Platform Summary
PAN Platform Summary
 
SPICE MODEL of CM600HA-24H (Professional+FWDP Model) in SPICE PARK
SPICE MODEL of CM600HA-24H (Professional+FWDP Model) in SPICE PARKSPICE MODEL of CM600HA-24H (Professional+FWDP Model) in SPICE PARK
SPICE MODEL of CM600HA-24H (Professional+FWDP Model) in SPICE PARK
 
A Function by Any Other Name is a Function
A Function by Any Other Name is a FunctionA Function by Any Other Name is a Function
A Function by Any Other Name is a Function
 
slide
slideslide
slide
 
cost analysis
cost analysiscost analysis
cost analysis
 
Book2
Book2Book2
Book2
 
cost analyse
cost analysecost analyse
cost analyse
 
Beyond Moore's Law
Beyond Moore's LawBeyond Moore's Law
Beyond Moore's Law
 
OLTP Performance Benchmark Review
OLTP Performance Benchmark ReviewOLTP Performance Benchmark Review
OLTP Performance Benchmark Review
 
SPICE MODEL of TPCM8002-H (Professional+BDP Model) in SPICE PARK
SPICE MODEL of TPCM8002-H (Professional+BDP Model) in SPICE PARKSPICE MODEL of TPCM8002-H (Professional+BDP Model) in SPICE PARK
SPICE MODEL of TPCM8002-H (Professional+BDP Model) in SPICE PARK
 
Vertical format for trading account, profit and loss account & balance sheet
Vertical format for trading account, profit and loss account & balance sheetVertical format for trading account, profit and loss account & balance sheet
Vertical format for trading account, profit and loss account & balance sheet
 
American Pharmaceutical Review Barnes Et Al
American Pharmaceutical Review Barnes Et AlAmerican Pharmaceutical Review Barnes Et Al
American Pharmaceutical Review Barnes Et Al
 
ツィータースピーカーのスパイスモデル
ツィータースピーカーのスパイスモデルツィータースピーカーのスパイスモデル
ツィータースピーカーのスパイスモデル
 
SPICE MODEL of FT48D in SPICE PARK
SPICE MODEL of FT48D in SPICE PARKSPICE MODEL of FT48D in SPICE PARK
SPICE MODEL of FT48D in SPICE PARK
 

Mehr von The Linux Foundation

Mehr von The Linux Foundation (20)

ELC2019: Static Partitioning Made Simple
ELC2019: Static Partitioning Made SimpleELC2019: Static Partitioning Made Simple
ELC2019: Static Partitioning Made Simple
 
XPDDS19: How TrenchBoot is Enabling Measured Launch for Open-Source Platform ...
XPDDS19: How TrenchBoot is Enabling Measured Launch for Open-Source Platform ...XPDDS19: How TrenchBoot is Enabling Measured Launch for Open-Source Platform ...
XPDDS19: How TrenchBoot is Enabling Measured Launch for Open-Source Platform ...
 
XPDDS19 Keynote: Xen in Automotive - Artem Mygaiev, Director, Technology Solu...
XPDDS19 Keynote: Xen in Automotive - Artem Mygaiev, Director, Technology Solu...XPDDS19 Keynote: Xen in Automotive - Artem Mygaiev, Director, Technology Solu...
XPDDS19 Keynote: Xen in Automotive - Artem Mygaiev, Director, Technology Solu...
 
XPDDS19 Keynote: Xen Project Weather Report 2019 - Lars Kurth, Director of Op...
XPDDS19 Keynote: Xen Project Weather Report 2019 - Lars Kurth, Director of Op...XPDDS19 Keynote: Xen Project Weather Report 2019 - Lars Kurth, Director of Op...
XPDDS19 Keynote: Xen Project Weather Report 2019 - Lars Kurth, Director of Op...
 
XPDDS19 Keynote: Unikraft Weather Report
XPDDS19 Keynote:  Unikraft Weather ReportXPDDS19 Keynote:  Unikraft Weather Report
XPDDS19 Keynote: Unikraft Weather Report
 
XPDDS19 Keynote: Secret-free Hypervisor: Now and Future - Wei Liu, Software E...
XPDDS19 Keynote: Secret-free Hypervisor: Now and Future - Wei Liu, Software E...XPDDS19 Keynote: Secret-free Hypervisor: Now and Future - Wei Liu, Software E...
XPDDS19 Keynote: Secret-free Hypervisor: Now and Future - Wei Liu, Software E...
 
XPDDS19 Keynote: Xen Dom0-less - Stefano Stabellini, Principal Engineer, Xilinx
XPDDS19 Keynote: Xen Dom0-less - Stefano Stabellini, Principal Engineer, XilinxXPDDS19 Keynote: Xen Dom0-less - Stefano Stabellini, Principal Engineer, Xilinx
XPDDS19 Keynote: Xen Dom0-less - Stefano Stabellini, Principal Engineer, Xilinx
 
XPDDS19 Keynote: Patch Review for Non-maintainers - George Dunlap, Citrix Sys...
XPDDS19 Keynote: Patch Review for Non-maintainers - George Dunlap, Citrix Sys...XPDDS19 Keynote: Patch Review for Non-maintainers - George Dunlap, Citrix Sys...
XPDDS19 Keynote: Patch Review for Non-maintainers - George Dunlap, Citrix Sys...
 
XPDDS19: Memories of a VM Funk - Mihai Donțu, Bitdefender
XPDDS19: Memories of a VM Funk - Mihai Donțu, BitdefenderXPDDS19: Memories of a VM Funk - Mihai Donțu, Bitdefender
XPDDS19: Memories of a VM Funk - Mihai Donțu, Bitdefender
 
OSSJP/ALS19: The Road to Safety Certification: Overcoming Community Challeng...
OSSJP/ALS19:  The Road to Safety Certification: Overcoming Community Challeng...OSSJP/ALS19:  The Road to Safety Certification: Overcoming Community Challeng...
OSSJP/ALS19: The Road to Safety Certification: Overcoming Community Challeng...
 
OSSJP/ALS19: The Road to Safety Certification: How the Xen Project is Making...
 OSSJP/ALS19: The Road to Safety Certification: How the Xen Project is Making... OSSJP/ALS19: The Road to Safety Certification: How the Xen Project is Making...
OSSJP/ALS19: The Road to Safety Certification: How the Xen Project is Making...
 
XPDDS19: Speculative Sidechannels and Mitigations - Andrew Cooper, Citrix
XPDDS19: Speculative Sidechannels and Mitigations - Andrew Cooper, CitrixXPDDS19: Speculative Sidechannels and Mitigations - Andrew Cooper, Citrix
XPDDS19: Speculative Sidechannels and Mitigations - Andrew Cooper, Citrix
 
XPDDS19: Keeping Coherency on Arm: Reborn - Julien Grall, Arm ltd
XPDDS19: Keeping Coherency on Arm: Reborn - Julien Grall, Arm ltdXPDDS19: Keeping Coherency on Arm: Reborn - Julien Grall, Arm ltd
XPDDS19: Keeping Coherency on Arm: Reborn - Julien Grall, Arm ltd
 
XPDDS19: QEMU PV Backend 'qdevification'... What Does it Mean? - Paul Durrant...
XPDDS19: QEMU PV Backend 'qdevification'... What Does it Mean? - Paul Durrant...XPDDS19: QEMU PV Backend 'qdevification'... What Does it Mean? - Paul Durrant...
XPDDS19: QEMU PV Backend 'qdevification'... What Does it Mean? - Paul Durrant...
 
XPDDS19: Status of PCI Emulation in Xen - Roger Pau Monné, Citrix Systems R&D
XPDDS19: Status of PCI Emulation in Xen - Roger Pau Monné, Citrix Systems R&DXPDDS19: Status of PCI Emulation in Xen - Roger Pau Monné, Citrix Systems R&D
XPDDS19: Status of PCI Emulation in Xen - Roger Pau Monné, Citrix Systems R&D
 
XPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM Systems
XPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM SystemsXPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM Systems
XPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM Systems
 
XPDDS19: Bringing Xen to the Masses: The Story of Building a Community-driven...
XPDDS19: Bringing Xen to the Masses: The Story of Building a Community-driven...XPDDS19: Bringing Xen to the Masses: The Story of Building a Community-driven...
XPDDS19: Bringing Xen to the Masses: The Story of Building a Community-driven...
 
XPDDS19: Will Robots Automate Your Job Away? Streamlining Xen Project Contrib...
XPDDS19: Will Robots Automate Your Job Away? Streamlining Xen Project Contrib...XPDDS19: Will Robots Automate Your Job Away? Streamlining Xen Project Contrib...
XPDDS19: Will Robots Automate Your Job Away? Streamlining Xen Project Contrib...
 
XPDDS19: Client Virtualization Toolstack in Go - Nick Rosbrook & Brendan Kerr...
XPDDS19: Client Virtualization Toolstack in Go - Nick Rosbrook & Brendan Kerr...XPDDS19: Client Virtualization Toolstack in Go - Nick Rosbrook & Brendan Kerr...
XPDDS19: Client Virtualization Toolstack in Go - Nick Rosbrook & Brendan Kerr...
 
XPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSE
XPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSEXPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSE
XPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSE
 

Kürzlich hochgeladen

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Kürzlich hochgeladen (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Extending Io Scalability

  • 1. Extending I/O scalability in Xen Dong Eddie, Zhang Xiantao, Xu Dongxiao, Yang Xiaowei Xen Summit Asia 2009
  • 2. Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT.  EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel may make changes to specifications and product descriptions at any time, without notice. All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests.  Any difference in system hardware or software design or configuration may affect actual performance.  Intel is a trademark of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Copyright © 2009, Intel Corporation. All rights are protected. 2
  • 3. Agenda •Scalability challenges of I/O virtualization •VNIF optimizations •VT-d overhead reduction •SR-IOV •Per CPU vector 3
  • 4. Scalability challenges of I/O virtualization • Para-virtualized device driver •Software bottleneck; high CPU utilization • Direct I/O • Interrupt overhead; device# limit • SR-IOV • Driver not optimal • Limitation inside VMM 4
  • 6. Improving scalability in 10G network U P/ RX D U P/ RX ( w hack) D / 1200 10000 1200 10000 bps) bps) 1000 8000 1000 8000 CPU ut i l % CPU ut i l % Bandw dt h ( M Bandw dt h ( M 800 800 6000 6000 600 600 4000 4000 400 i i 400 200 2000 2000 200 0 0 0 0 10 20 40 60 10 20 40 60 VM# VM# dom0% vm% BW dom0% vm% BW •Issue: the total throughput is limited by the bottleneck of netback driver in dom0 - only one TX/RX tasklet serves all netfront interfaces. •Hack: duplicate 10 netback drivers, each serves VNIFs of domX (X=domid %10). •Benefit: reach near 10G throughput, with the cost of high CPU util% •Ongoing: multiple TX/RX tasklets 6
  • 7. Notification frequency reduction UDP/RX TCP/RX 250.00% 250.00% CPU Utilization CPU utilization 200.00% 200.00% 150.00% 150.00% 100.00% 100.00% 50.00% 50.00% 0.00% 0.00% 1v rig 3v rig 9v rig 1v FE 3v FE 9v FE E E E 1v FE 3v FE 9v FE 1v rig 3v rig 9v rig E E E -B -B -B -B -B -B -o -o -o -o -o -o - - - - - - m m m m m m m m m m m m m m m m m m 1v 3v 9v 1v 3v 9v dom0% vm% dom0% vm% • Issue: Evtchn freq ~= physical NIC’s intr freq * n (n: VNIF# sharing the NIC). But per our test, 1K HZ evtchn freq can sustain 1G throughput (TCP/UDP) already. • Solution: add evtchn freq controlling policy inside 1) netback or 2) netfront, and interface to userspace. • Benefit: guests’ CPU util% reduction. 7
  • 9. Network virtualization overhead with Direct I/O - 10G NIC * ‘%’ here means system wide CPU utilization normalized to 100%; 4 CPUs are used in the test. •Interrupt frequency is very high in 10G network: 4K/8K HZ for each TX/RX queue; 8 TX/RX queues in total; •APIC access VMExit causes the most overhead, within which EOI access occupies 90%/50% for TX/RX case. •Interrupt (including ext. intr and IPI) VMExit causes the 2nd most overhead, within which IPI occupies >50%. 9
  • 10. APIC access VMExit optimization - vEOI VMExit Instruction fetch Instruction vAPIC emulation VMEntry emulation APIC access VMExit handling stages • EOI’s usage • S/W writes “0” into EOI for signaling interrupt servicing completion; • Upon receiving a EOI, LAPIC clears the highest priority bit in the ISR and dispatches the next priority interrupt. • Majority OSes (Windows and Linux) only use ‘mov’ for this purpose, which has no side effect to other CPU states • Benefit: by bypassing instruction fetch/emulation stage, each vEOI’s cost decreases from 7.6k to 3.1k cycles, CPU util% can decrease by 15+% (12k*8*4.5k/2.8G) in 10G NIC case. If guests use complex instructions (e.g. stos) for EOI access, there will be side effects to guests, but immune to host. Better / complete solution is PV or virtualizing x2APIC (using MSR). 10
  • 11. vIntr delivering - before optimization 0. VT-d device’s pIntr affinity is set to where vCPU0 is on (pCPU0) at boot time; guest OS sets vCPU2 to receive the vIntr. 1. When the device generates the pIntr, it’s delivered to pCPU0 2. Xen sends IPI to where vCPU0 is on (pCPU2) for vIntr delivery 3. Before vCPU0 VMentry, Xen delivers vIntr by sending IPI to where vCPU2 is on (pCPU3). 4. On vCPU2 VMentry, vIntr is injected to the guest. 11
  • 12. vIntr delivering - after optimization 0. Most works are done before the pIntr happens * When guest OS sets vCPU2 to receive the vIntr, Xen sets the corresponding pCPU (pCPU3) to receive the pIntr. * When vCPU2 migrates, the pIntr migrates with it. 1. When the device generates the pIntr, it’s delivered to pCPU3 directly 2. On vCPU2 VMentry, vIntr is injected to the guest. Benefit: all IPIs for vIntr delivery (~4%-9% CPU overhead) are removed. Limit: for simplicity, our 1st patch only handle MSI (no ioapic) and single vIntr destination case, which covers typical usage. 12
  • 13. SR-IOV 13
  • 14. SRIOV v.s. VNIF scalability dom+ U VF dom + I F ( w hack) U VN / 1200 10000 1200 10000 bps) bps) 1000 8000 1000 8000 CPU ut i l % CPU ut i l % Bandw dt h ( M Bandw dt h ( M 800 800 6000 6000 600 600 4000 4000 400 i i 400 200 2000 2000 200 0 0 0 0 10 20 40 60 10 20 40 60 VM# VM# dom0% vm% BW dom0% vm% BW •SRIOV has significant advantage over VNIF solution even the tasklet bottleneck is fixed, in the aspects of • Less CPU utilization • Stable latency 14
  • 15. Interrupt coalescing optimization • Interrupt coalescing is more critical to VM than to native • Interrupt handling inside VM has more overhead. • Interrupt coalescing policy in native is not efficient in SRIOV, as the BW of one VF can be • << line speed, e.g. several VFs within one physical port competing at the same time. • >> line speed, e.g inter-VF communication with one port. • Proposal: adaptive interrupt coalescing (AIC), based on the packet generating rate in the last seconds and buffer size in the system. 15
  • 17. Issues •Interrupt vector in Xen was global • One vector number was shared by all pCPUS •Only <200 vectors were available for devices in total • Vector range is 0-255 • Lowest 32 vectors reserved by x86 SDM • 16 vectors reserved for legacy PIC high priority cases • Highest 16 vectors reserved for high priority cases • Special case: 0x82 •In SR-IOV case, vectors are easy to be used up • E.g. one Niantic NIC can use 384 vectors = 2(ports) * 64(VFs+PF) * 3(TX/RX/mailbox). 17
  • 18. Solution •Back-port per-CPU vector solution in Linux kernel •After the change, max vector# = 256*pCPU# •Code changes • Precondition: all interrupts to be indexed by ‘irq’ • Vector allocation / management functions, related structures… • Interrupt migration need special care: after migration, vector# may be different. • Evaluation • With per-CPU vector, vectors can be assigned to all Niantic VFs successful. 18