SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Downloaden Sie, um offline zu lesen
Xen and XenServer
Storage Performance
Low Latency Virtualisation Challenges
Dr Felipe Franciosi
XenServer Engineering Performance Team
e-mail:
felipe.franciosi@citrix.com
freenode: felipef #xen-api
twitter:
@franciozzy
Agenda
• Where do Xen and XenServer stand?
๏

When is the virtualisation overhead most noticeable?

• Current implementation:
๏

blkfront, blkback, blktap2+tapdisk, blktap3, qemu-qdisk

• Measurements over different back ends
Throughput and latency analysis
๏ Latency breakdown: where are we losing time when virtualising?
๏

• Proposals for improving
2

© 2013 Citrix
Where do
Xen and XenServer Stand?
When is the virtualisation overhead
noticeable?
Where do Xen and XenServer stand?
• What kind of throughput can I get from dom0 to my device?
๏

4

Using 1 MiB reads, this host reports 118 MB/s from dom0

© 2013 Citrix
Where do Xen and XenServer stand?
• What kind of throughput can I get from domU to my device?
๏

Using 1 MiB reads, this host reports 117 MB/s from a VM

IMPERCEPTIBLE
virtualisation overhead

5

© 2013 Citrix
Where do Xen and XenServer stand?
• That’s not always the case...
๏

Same test on different hardware (from dom0)

my disks can do
700 MB/s !!!!
6

© 2013 Citrix | Confidential - Do Not Distribute
Where do Xen and XenServer stand?
• That’s not always the case...
๏

Same test on different hardware (from domU)

VISIBLE
virtualisation overhead

why is my VM only
doing 300 MB/s ???
7

© 2013 Citrix | Confidential - Do Not Distribute
Current Implementation
How we virtualise storage with Xen
Current Implementation (bare metal)
• How does that compare to storage performance again?
๏

There are different ways a user application can do storage I/O
• We will use simple read() and write() libc wrappers as examples
BD

block
layer

device
driver

1. char buf[4096];
2. int fd = open(“/dev/sda”,
O_RDONLY | O_DIRECT);

HW Interrupt
on completion

sys_read(fd, buf, 4096)

vfs_read()

f_op->read()**

3. read(fd, buf, 4096);
kernel space
user space

libc

user
process

9

© 2013 Citrix

buf
fd

4. buf now has the data!
Current Implementation (Xen)
• The “Upstream Xen” use case
The virtual device in the guest is implemented by blkfront
๏ Blkfront connects to blkback, which handles the I/O in dom0
๏

dom0

BD

device
driver

block
layer

blkback

xen’s blkif
protocol

domU

VDI

device
blkfront
driver

block
layer

syscall / etc()

kernel space
user space

kernel space
user space
libc

user
process

10

© 2013 Citrix

buf
fd
Current Implementation (Xen)
• The XenServer 6.2.0 use case
XenServer provides thin provisioning, snapshot, clones, etc. hello VHD
๏ This is easily implemented in user space. hello TAPDISK
๏

dom0

BD

device
driver

block
layer

tap

data stored
in VHD

aio syscalls
kernel space
user space

libaio

tapdisk2

11

© 2013 Citrix

blktap2

block
layer

blkback

xen’s blkif
protocol

domU

VDI

blkfront

block
layer

syscall / etc()
kernel space
user space
libc

user
process

buf
fd
Current Implementation (Xen)
• The blktap3 and qemu-qdisk use case
๏

Have the entire back end in user space
dom0

BD

block
layer

device
driver

domU

VDI

blkfront

block
layer

data stored
in VHD

aio syscalls
kernel space
user space

evtchn
dev

libaio

tapdisk3 / qemu-qdisk
tapdisk2

12

gntdev

© 2013 Citrix

xen’s blkif
protocol

syscall / etc()
kernel space
user space
libc

user
process

buf
fd
Measurements Over
Different Back Ends
Throughput and latency analysis
Measurement Over Different Back Ends
• Same host, different RAID0 logical volumes on a PERC H700
๏

14

All have 64 KiB stripes, adaptive read-ahead and write-back cache enabled

© 2013 Citrix
• dom0 had:
• 4 vCPUs pinned
• 4 GB of RAM
• It becomes visible that
certain back ends cope
much better with larger
block sizes.
• This controller
supports up to 128 KiB
per request.
• Above that, the Linux
block layer splits the
requests.

15

© 2013 Citrix
• Seagate ST (SAS)
• blkback is slower, but it
catches up with big
enough requests.

16

© 2013 Citrix
• Seagate ST (SAS)
• User space back ends
are so slow they never
catch up, even with
bigger requests.
• This is not always true:
if the disks were
slower, they would
catch up.

17

© 2013 Citrix
• Intel DC S3700 (SSD)
• When the disks are
really fast, none of the
technologies catch up.

18

© 2013 Citrix
Measurement Over Different Back Ends
• There is another way to look at the data:

1
Throughput (data/time)

19

© 2013 Citrix

= Latency (time/data)
• Intel DC S3700 (SSD)
• The question now is:
where is time being
spent?
• Compare time spent:
• dom0
• blkback
• qdisk

20

© 2013 Citrix
Measurement Over Different Back Ends
• Inserted trace points using RDTSC
๏

TSC is consistent across cores and domains

6
BD

block
layer

device
driver

8
kernel space
user space
1. Just before issuing read()
2. On SyS_read()
3. On blkfront’s do_blkif_request()
4. Just before notify_remote_via_irq()
5. On blkback’s xen_blkif_be_int()
6. On blkback’s xen_blkif_schedule()
7. Just before blk_finish_plug()
21

© 2013 Citrix

dom0

7

blkback

3

domU

VDI

block
layer

device
blkfront
driver

5

4
9

8. On end_block_io_op()
9. Just before notify_remove_via_irq()
10. On blkif_interrupt()
11. Just before __blk_end_request_all()
12. Just after returning from read()

11

10

syscall / etc()

2
kernel space
user space

libc

1
12

user
process

buf
fd
Measurement Over Different Back Ends
• Initial tests showed there is a “warm up” time
• Simply inserting printk()s affect the hot path
• Used a trace buffer instead
In the kernel, trace_printk()
๏ In user space, hacked up buffer and a signal handler to dump its contents
๏

• Run 100 read requests
One immediately after the other (requests were sequential)
๏ Used IO Depth = 1 (only one in flight request at a time)
๏

• Sorted the times, removed the 10 (10%) fastest and slowest runs
• Repeated the experiment 10 times and averaged the results
22

© 2013 Citrix
Measurements on 3.11.0
without Persistent Grants

Time spent on device

Actual overhead of
mapping and unmapping
and transferring data back
to user space at the end

23

© 2013 Citrix
Measurements on 3.11.0
with Persistent Grants

Time spent on device

Time spent copying data
out of persistently granted
memory

24

© 2013 Citrix
Measurement Over Different Back Ends
• The penalty of copying can be worth taking depending on other factors:
Number of dom0 vCPUs
๏ Number of concurrent VMs performing IO
๏

(TLB flushes)
(contention on grant tables)

• Ideally, blkfront should support both data paths
• Administrators can profile their workloads and decide what to provide

25

© 2013 Citrix
Proposals for Improving
What else can we do to minimise the overhead?
New Ideas
• How can we reduce the processing required to virtualise I/O ?

27

© 2013 Citrix
New Ideas
• Persistent Grants
Issue: grant mapping (and unmapping) is expensive
๏ Concept: back end keeps grants, front end copies I/O data to those pages
๏ Status: currently implemented in 3.11.0 and already supported in qemu-qdisk
๏

• Indirect I/O
Issue: blkif protocol limit requests to 11 segs of 4 KiB per I/O ring
๏ Concept: use segments as indirect mapping of other segments
๏ Status: currently implemented in 3.11.0
๏

28

© 2013 Citrix
New Ideas
• Avoid TLB flushes altogether (Malcolm Crossley)
๏

Issue:
• When unmapping, we need to flush the TLB on all cores
• But in the storage data path, the back end doesn’t really access the data

Concept: check whether granted pages have been accessed
๏ Status: early prototypes already developed
๏

• Only grant map pages on the fault handler (David Vrabel)
๏

Issue:
• Mapping/unmapping is expensive
• But in the storage data path, the back end doesn’t really access the data

Concept: Have a fault handler for the mapping, triggered only when needed
๏ Status: idea yet being conceived
๏

29

© 2013 Citrix
New Ideas
• Split Rings
๏

Issue:
• blkif protocol supports 32 slots per ring for requests or responses
• this limits the total amount of outstanding requests (unless multi-page rings are used)
• this makes inefficient use of CPU caching (both domains writing to the same page)

Concept: use one ring for requests and another for responses
๏ Status: early prototypes already developed
๏

• Multi-queue Approach
Issue: Linux’s block layer does not scale well across NUMA nodes
๏ Concept: allow device drivers to register a request queue per core
๏ Status: scheduled for Kernel 3.12
๏

30

© 2013 Citrix
e-mail:
felipe.franciosi@citrix.com
freenode: felipef #xen-api
twitter:
@franciozzy

Weitere ähnliche Inhalte

Was ist angesagt?

How to manage a big scale HTML/CSS project
How to manage a big scale HTML/CSS projectHow to manage a big scale HTML/CSS project
How to manage a big scale HTML/CSS project
Renoir Boulanger
 
E tech vmware presentation
E tech vmware presentationE tech vmware presentation
E tech vmware presentation
jpenney
 

Was ist angesagt? (20)

Bootstrap PPT by Mukesh
Bootstrap PPT by MukeshBootstrap PPT by Mukesh
Bootstrap PPT by Mukesh
 
VMware Presentation
VMware PresentationVMware Presentation
VMware Presentation
 
Storage Virtualization
Storage VirtualizationStorage Virtualization
Storage Virtualization
 
Veeam Backup & Replication Tips and Tricks
Veeam Backup & Replication Tips and TricksVeeam Backup & Replication Tips and Tricks
Veeam Backup & Replication Tips and Tricks
 
Responsive Web Design Tutorial PDF for Beginners
Responsive Web Design Tutorial PDF for BeginnersResponsive Web Design Tutorial PDF for Beginners
Responsive Web Design Tutorial PDF for Beginners
 
The IBM Netezza Data Warehouse Appliance
The IBM Netezza Data Warehouse ApplianceThe IBM Netezza Data Warehouse Appliance
The IBM Netezza Data Warehouse Appliance
 
File server
File serverFile server
File server
 
VMWARE VS MS-HYPER-V
VMWARE VS MS-HYPER-VVMWARE VS MS-HYPER-V
VMWARE VS MS-HYPER-V
 
Presentation of bootstrap
Presentation of bootstrapPresentation of bootstrap
Presentation of bootstrap
 
What’s New in VMware vSphere 7?
What’s New in VMware vSphere 7?What’s New in VMware vSphere 7?
What’s New in VMware vSphere 7?
 
Lock That Sh*t Down! Auth Security Patterns for Apps, APIs, and Infra
Lock That Sh*t Down! Auth Security Patterns for Apps, APIs, and InfraLock That Sh*t Down! Auth Security Patterns for Apps, APIs, and Infra
Lock That Sh*t Down! Auth Security Patterns for Apps, APIs, and Infra
 
vSAN Beyond The Basics
vSAN Beyond The BasicsvSAN Beyond The Basics
vSAN Beyond The Basics
 
VMware Vsphere Graduation Project Presentation
VMware Vsphere Graduation Project PresentationVMware Vsphere Graduation Project Presentation
VMware Vsphere Graduation Project Presentation
 
Vmware overview
Vmware overviewVmware overview
Vmware overview
 
How to manage a big scale HTML/CSS project
How to manage a big scale HTML/CSS projectHow to manage a big scale HTML/CSS project
How to manage a big scale HTML/CSS project
 
E tech vmware presentation
E tech vmware presentationE tech vmware presentation
E tech vmware presentation
 
LendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGateLendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGate
 
Web Servers(IIS, NGINX, APACHE)
Web Servers(IIS, NGINX, APACHE)Web Servers(IIS, NGINX, APACHE)
Web Servers(IIS, NGINX, APACHE)
 
Zenoh: The Genesis
Zenoh: The GenesisZenoh: The Genesis
Zenoh: The Genesis
 
[오픈소스컨설팅]JBoss EAP 6 Deep Dive(Compare with WebLogic)
[오픈소스컨설팅]JBoss EAP 6 Deep Dive(Compare with WebLogic)[오픈소스컨설팅]JBoss EAP 6 Deep Dive(Compare with WebLogic)
[오픈소스컨설팅]JBoss EAP 6 Deep Dive(Compare with WebLogic)
 

Ähnlich wie XPDS13: Xen and XenServer Storage Performance - Felipe Franciosi, Citrix

Collaborate instant cloning_kyle
Collaborate instant cloning_kyleCollaborate instant cloning_kyle
Collaborate instant cloning_kyle
Kyle Hailey
 

Ähnlich wie XPDS13: Xen and XenServer Storage Performance - Felipe Franciosi, Citrix (20)

XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, CitrixXPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
 
[KubeCon NA 2020] containerd: Rootless Containers 2020
[KubeCon NA 2020] containerd: Rootless Containers 2020[KubeCon NA 2020] containerd: Rootless Containers 2020
[KubeCon NA 2020] containerd: Rootless Containers 2020
 
Running a database on local NVMes on Kubernetes
Running a database on local NVMes on KubernetesRunning a database on local NVMes on Kubernetes
Running a database on local NVMes on Kubernetes
 
Running a database on local NVMes on Kubernetes
Running a database on local NVMes on KubernetesRunning a database on local NVMes on Kubernetes
Running a database on local NVMes on Kubernetes
 
Real-World Docker: 10 Things We've Learned
Real-World Docker: 10 Things We've Learned  Real-World Docker: 10 Things We've Learned
Real-World Docker: 10 Things We've Learned
 
Scaling Docker Containers using Kubernetes and Azure Container Service
Scaling Docker Containers using Kubernetes and Azure Container ServiceScaling Docker Containers using Kubernetes and Azure Container Service
Scaling Docker Containers using Kubernetes and Azure Container Service
 
XPDS14: Xen 4.5 Roadmap - Konrad Wilk, Oracle
XPDS14: Xen 4.5 Roadmap - Konrad Wilk, OracleXPDS14: Xen 4.5 Roadmap - Konrad Wilk, Oracle
XPDS14: Xen 4.5 Roadmap - Konrad Wilk, Oracle
 
DevOpsDays Taipei 2017 從打鐵到雲端
DevOpsDays Taipei 2017 從打鐵到雲端DevOpsDays Taipei 2017 從打鐵到雲端
DevOpsDays Taipei 2017 從打鐵到雲端
 
Where Did All These Cycles Go?
Where Did All These Cycles Go?Where Did All These Cycles Go?
Where Did All These Cycles Go?
 
OpenZFS send and receive
OpenZFS send and receiveOpenZFS send and receive
OpenZFS send and receive
 
Running Cloud Foundry for 12 months - An experience report | anynines
Running Cloud Foundry for 12 months - An experience report | anyninesRunning Cloud Foundry for 12 months - An experience report | anynines
Running Cloud Foundry for 12 months - An experience report | anynines
 
Collaborate instant cloning_kyle
Collaborate instant cloning_kyleCollaborate instant cloning_kyle
Collaborate instant cloning_kyle
 
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP..."Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
 
Docksal: Better than VMs
Docksal:  Better than VMsDocksal:  Better than VMs
Docksal: Better than VMs
 
Rooting Out Root: User namespaces in Docker
Rooting Out Root: User namespaces in DockerRooting Out Root: User namespaces in Docker
Rooting Out Root: User namespaces in Docker
 
Sergey Dzyuban "To Build My Own Cloud with Blackjack…"
Sergey Dzyuban "To Build My Own Cloud with Blackjack…"Sergey Dzyuban "To Build My Own Cloud with Blackjack…"
Sergey Dzyuban "To Build My Own Cloud with Blackjack…"
 
XPDDS17: To Grant or Not to Grant? - João Martins, Oracle
XPDDS17: To Grant or Not to Grant? - João Martins, Oracle XPDDS17: To Grant or Not to Grant? - João Martins, Oracle
XPDDS17: To Grant or Not to Grant? - João Martins, Oracle
 
CCNA17 KVM and CloudStack
CCNA17 KVM and CloudStackCCNA17 KVM and CloudStack
CCNA17 KVM and CloudStack
 
What You Should Know About WebLogic Server 12c (12.2.1.2) #oow2015 #otntour2...
What You Should Know About WebLogic Server 12c (12.2.1.2)  #oow2015 #otntour2...What You Should Know About WebLogic Server 12c (12.2.1.2)  #oow2015 #otntour2...
What You Should Know About WebLogic Server 12c (12.2.1.2) #oow2015 #otntour2...
 
distcom-short-20140112-1600
distcom-short-20140112-1600distcom-short-20140112-1600
distcom-short-20140112-1600
 

Mehr von The Linux Foundation

Mehr von The Linux Foundation (20)

ELC2019: Static Partitioning Made Simple
ELC2019: Static Partitioning Made SimpleELC2019: Static Partitioning Made Simple
ELC2019: Static Partitioning Made Simple
 
XPDDS19: How TrenchBoot is Enabling Measured Launch for Open-Source Platform ...
XPDDS19: How TrenchBoot is Enabling Measured Launch for Open-Source Platform ...XPDDS19: How TrenchBoot is Enabling Measured Launch for Open-Source Platform ...
XPDDS19: How TrenchBoot is Enabling Measured Launch for Open-Source Platform ...
 
XPDDS19 Keynote: Xen in Automotive - Artem Mygaiev, Director, Technology Solu...
XPDDS19 Keynote: Xen in Automotive - Artem Mygaiev, Director, Technology Solu...XPDDS19 Keynote: Xen in Automotive - Artem Mygaiev, Director, Technology Solu...
XPDDS19 Keynote: Xen in Automotive - Artem Mygaiev, Director, Technology Solu...
 
XPDDS19 Keynote: Xen Project Weather Report 2019 - Lars Kurth, Director of Op...
XPDDS19 Keynote: Xen Project Weather Report 2019 - Lars Kurth, Director of Op...XPDDS19 Keynote: Xen Project Weather Report 2019 - Lars Kurth, Director of Op...
XPDDS19 Keynote: Xen Project Weather Report 2019 - Lars Kurth, Director of Op...
 
XPDDS19 Keynote: Unikraft Weather Report
XPDDS19 Keynote:  Unikraft Weather ReportXPDDS19 Keynote:  Unikraft Weather Report
XPDDS19 Keynote: Unikraft Weather Report
 
XPDDS19 Keynote: Secret-free Hypervisor: Now and Future - Wei Liu, Software E...
XPDDS19 Keynote: Secret-free Hypervisor: Now and Future - Wei Liu, Software E...XPDDS19 Keynote: Secret-free Hypervisor: Now and Future - Wei Liu, Software E...
XPDDS19 Keynote: Secret-free Hypervisor: Now and Future - Wei Liu, Software E...
 
XPDDS19 Keynote: Xen Dom0-less - Stefano Stabellini, Principal Engineer, Xilinx
XPDDS19 Keynote: Xen Dom0-less - Stefano Stabellini, Principal Engineer, XilinxXPDDS19 Keynote: Xen Dom0-less - Stefano Stabellini, Principal Engineer, Xilinx
XPDDS19 Keynote: Xen Dom0-less - Stefano Stabellini, Principal Engineer, Xilinx
 
XPDDS19 Keynote: Patch Review for Non-maintainers - George Dunlap, Citrix Sys...
XPDDS19 Keynote: Patch Review for Non-maintainers - George Dunlap, Citrix Sys...XPDDS19 Keynote: Patch Review for Non-maintainers - George Dunlap, Citrix Sys...
XPDDS19 Keynote: Patch Review for Non-maintainers - George Dunlap, Citrix Sys...
 
XPDDS19: Memories of a VM Funk - Mihai Donțu, Bitdefender
XPDDS19: Memories of a VM Funk - Mihai Donțu, BitdefenderXPDDS19: Memories of a VM Funk - Mihai Donțu, Bitdefender
XPDDS19: Memories of a VM Funk - Mihai Donțu, Bitdefender
 
OSSJP/ALS19: The Road to Safety Certification: Overcoming Community Challeng...
OSSJP/ALS19:  The Road to Safety Certification: Overcoming Community Challeng...OSSJP/ALS19:  The Road to Safety Certification: Overcoming Community Challeng...
OSSJP/ALS19: The Road to Safety Certification: Overcoming Community Challeng...
 
OSSJP/ALS19: The Road to Safety Certification: How the Xen Project is Making...
 OSSJP/ALS19: The Road to Safety Certification: How the Xen Project is Making... OSSJP/ALS19: The Road to Safety Certification: How the Xen Project is Making...
OSSJP/ALS19: The Road to Safety Certification: How the Xen Project is Making...
 
XPDDS19: Speculative Sidechannels and Mitigations - Andrew Cooper, Citrix
XPDDS19: Speculative Sidechannels and Mitigations - Andrew Cooper, CitrixXPDDS19: Speculative Sidechannels and Mitigations - Andrew Cooper, Citrix
XPDDS19: Speculative Sidechannels and Mitigations - Andrew Cooper, Citrix
 
XPDDS19: Keeping Coherency on Arm: Reborn - Julien Grall, Arm ltd
XPDDS19: Keeping Coherency on Arm: Reborn - Julien Grall, Arm ltdXPDDS19: Keeping Coherency on Arm: Reborn - Julien Grall, Arm ltd
XPDDS19: Keeping Coherency on Arm: Reborn - Julien Grall, Arm ltd
 
XPDDS19: QEMU PV Backend 'qdevification'... What Does it Mean? - Paul Durrant...
XPDDS19: QEMU PV Backend 'qdevification'... What Does it Mean? - Paul Durrant...XPDDS19: QEMU PV Backend 'qdevification'... What Does it Mean? - Paul Durrant...
XPDDS19: QEMU PV Backend 'qdevification'... What Does it Mean? - Paul Durrant...
 
XPDDS19: Status of PCI Emulation in Xen - Roger Pau Monné, Citrix Systems R&D
XPDDS19: Status of PCI Emulation in Xen - Roger Pau Monné, Citrix Systems R&DXPDDS19: Status of PCI Emulation in Xen - Roger Pau Monné, Citrix Systems R&D
XPDDS19: Status of PCI Emulation in Xen - Roger Pau Monné, Citrix Systems R&D
 
XPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM Systems
XPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM SystemsXPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM Systems
XPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM Systems
 
XPDDS19: Bringing Xen to the Masses: The Story of Building a Community-driven...
XPDDS19: Bringing Xen to the Masses: The Story of Building a Community-driven...XPDDS19: Bringing Xen to the Masses: The Story of Building a Community-driven...
XPDDS19: Bringing Xen to the Masses: The Story of Building a Community-driven...
 
XPDDS19: Will Robots Automate Your Job Away? Streamlining Xen Project Contrib...
XPDDS19: Will Robots Automate Your Job Away? Streamlining Xen Project Contrib...XPDDS19: Will Robots Automate Your Job Away? Streamlining Xen Project Contrib...
XPDDS19: Will Robots Automate Your Job Away? Streamlining Xen Project Contrib...
 
XPDDS19: Client Virtualization Toolstack in Go - Nick Rosbrook & Brendan Kerr...
XPDDS19: Client Virtualization Toolstack in Go - Nick Rosbrook & Brendan Kerr...XPDDS19: Client Virtualization Toolstack in Go - Nick Rosbrook & Brendan Kerr...
XPDDS19: Client Virtualization Toolstack in Go - Nick Rosbrook & Brendan Kerr...
 
XPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSE
XPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSEXPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSE
XPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSE
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

XPDS13: Xen and XenServer Storage Performance - Felipe Franciosi, Citrix

  • 1. Xen and XenServer Storage Performance Low Latency Virtualisation Challenges Dr Felipe Franciosi XenServer Engineering Performance Team e-mail: felipe.franciosi@citrix.com freenode: felipef #xen-api twitter: @franciozzy
  • 2. Agenda • Where do Xen and XenServer stand? ๏ When is the virtualisation overhead most noticeable? • Current implementation: ๏ blkfront, blkback, blktap2+tapdisk, blktap3, qemu-qdisk • Measurements over different back ends Throughput and latency analysis ๏ Latency breakdown: where are we losing time when virtualising? ๏ • Proposals for improving 2 © 2013 Citrix
  • 3. Where do Xen and XenServer Stand? When is the virtualisation overhead noticeable?
  • 4. Where do Xen and XenServer stand? • What kind of throughput can I get from dom0 to my device? ๏ 4 Using 1 MiB reads, this host reports 118 MB/s from dom0 © 2013 Citrix
  • 5. Where do Xen and XenServer stand? • What kind of throughput can I get from domU to my device? ๏ Using 1 MiB reads, this host reports 117 MB/s from a VM IMPERCEPTIBLE virtualisation overhead 5 © 2013 Citrix
  • 6. Where do Xen and XenServer stand? • That’s not always the case... ๏ Same test on different hardware (from dom0) my disks can do 700 MB/s !!!! 6 © 2013 Citrix | Confidential - Do Not Distribute
  • 7. Where do Xen and XenServer stand? • That’s not always the case... ๏ Same test on different hardware (from domU) VISIBLE virtualisation overhead why is my VM only doing 300 MB/s ??? 7 © 2013 Citrix | Confidential - Do Not Distribute
  • 8. Current Implementation How we virtualise storage with Xen
  • 9. Current Implementation (bare metal) • How does that compare to storage performance again? ๏ There are different ways a user application can do storage I/O • We will use simple read() and write() libc wrappers as examples BD block layer device driver 1. char buf[4096]; 2. int fd = open(“/dev/sda”, O_RDONLY | O_DIRECT); HW Interrupt on completion sys_read(fd, buf, 4096) vfs_read() f_op->read()** 3. read(fd, buf, 4096); kernel space user space libc user process 9 © 2013 Citrix buf fd 4. buf now has the data!
  • 10. Current Implementation (Xen) • The “Upstream Xen” use case The virtual device in the guest is implemented by blkfront ๏ Blkfront connects to blkback, which handles the I/O in dom0 ๏ dom0 BD device driver block layer blkback xen’s blkif protocol domU VDI device blkfront driver block layer syscall / etc() kernel space user space kernel space user space libc user process 10 © 2013 Citrix buf fd
  • 11. Current Implementation (Xen) • The XenServer 6.2.0 use case XenServer provides thin provisioning, snapshot, clones, etc. hello VHD ๏ This is easily implemented in user space. hello TAPDISK ๏ dom0 BD device driver block layer tap data stored in VHD aio syscalls kernel space user space libaio tapdisk2 11 © 2013 Citrix blktap2 block layer blkback xen’s blkif protocol domU VDI blkfront block layer syscall / etc() kernel space user space libc user process buf fd
  • 12. Current Implementation (Xen) • The blktap3 and qemu-qdisk use case ๏ Have the entire back end in user space dom0 BD block layer device driver domU VDI blkfront block layer data stored in VHD aio syscalls kernel space user space evtchn dev libaio tapdisk3 / qemu-qdisk tapdisk2 12 gntdev © 2013 Citrix xen’s blkif protocol syscall / etc() kernel space user space libc user process buf fd
  • 13. Measurements Over Different Back Ends Throughput and latency analysis
  • 14. Measurement Over Different Back Ends • Same host, different RAID0 logical volumes on a PERC H700 ๏ 14 All have 64 KiB stripes, adaptive read-ahead and write-back cache enabled © 2013 Citrix
  • 15. • dom0 had: • 4 vCPUs pinned • 4 GB of RAM • It becomes visible that certain back ends cope much better with larger block sizes. • This controller supports up to 128 KiB per request. • Above that, the Linux block layer splits the requests. 15 © 2013 Citrix
  • 16. • Seagate ST (SAS) • blkback is slower, but it catches up with big enough requests. 16 © 2013 Citrix
  • 17. • Seagate ST (SAS) • User space back ends are so slow they never catch up, even with bigger requests. • This is not always true: if the disks were slower, they would catch up. 17 © 2013 Citrix
  • 18. • Intel DC S3700 (SSD) • When the disks are really fast, none of the technologies catch up. 18 © 2013 Citrix
  • 19. Measurement Over Different Back Ends • There is another way to look at the data: 1 Throughput (data/time) 19 © 2013 Citrix = Latency (time/data)
  • 20. • Intel DC S3700 (SSD) • The question now is: where is time being spent? • Compare time spent: • dom0 • blkback • qdisk 20 © 2013 Citrix
  • 21. Measurement Over Different Back Ends • Inserted trace points using RDTSC ๏ TSC is consistent across cores and domains 6 BD block layer device driver 8 kernel space user space 1. Just before issuing read() 2. On SyS_read() 3. On blkfront’s do_blkif_request() 4. Just before notify_remote_via_irq() 5. On blkback’s xen_blkif_be_int() 6. On blkback’s xen_blkif_schedule() 7. Just before blk_finish_plug() 21 © 2013 Citrix dom0 7 blkback 3 domU VDI block layer device blkfront driver 5 4 9 8. On end_block_io_op() 9. Just before notify_remove_via_irq() 10. On blkif_interrupt() 11. Just before __blk_end_request_all() 12. Just after returning from read() 11 10 syscall / etc() 2 kernel space user space libc 1 12 user process buf fd
  • 22. Measurement Over Different Back Ends • Initial tests showed there is a “warm up” time • Simply inserting printk()s affect the hot path • Used a trace buffer instead In the kernel, trace_printk() ๏ In user space, hacked up buffer and a signal handler to dump its contents ๏ • Run 100 read requests One immediately after the other (requests were sequential) ๏ Used IO Depth = 1 (only one in flight request at a time) ๏ • Sorted the times, removed the 10 (10%) fastest and slowest runs • Repeated the experiment 10 times and averaged the results 22 © 2013 Citrix
  • 23. Measurements on 3.11.0 without Persistent Grants Time spent on device Actual overhead of mapping and unmapping and transferring data back to user space at the end 23 © 2013 Citrix
  • 24. Measurements on 3.11.0 with Persistent Grants Time spent on device Time spent copying data out of persistently granted memory 24 © 2013 Citrix
  • 25. Measurement Over Different Back Ends • The penalty of copying can be worth taking depending on other factors: Number of dom0 vCPUs ๏ Number of concurrent VMs performing IO ๏ (TLB flushes) (contention on grant tables) • Ideally, blkfront should support both data paths • Administrators can profile their workloads and decide what to provide 25 © 2013 Citrix
  • 26. Proposals for Improving What else can we do to minimise the overhead?
  • 27. New Ideas • How can we reduce the processing required to virtualise I/O ? 27 © 2013 Citrix
  • 28. New Ideas • Persistent Grants Issue: grant mapping (and unmapping) is expensive ๏ Concept: back end keeps grants, front end copies I/O data to those pages ๏ Status: currently implemented in 3.11.0 and already supported in qemu-qdisk ๏ • Indirect I/O Issue: blkif protocol limit requests to 11 segs of 4 KiB per I/O ring ๏ Concept: use segments as indirect mapping of other segments ๏ Status: currently implemented in 3.11.0 ๏ 28 © 2013 Citrix
  • 29. New Ideas • Avoid TLB flushes altogether (Malcolm Crossley) ๏ Issue: • When unmapping, we need to flush the TLB on all cores • But in the storage data path, the back end doesn’t really access the data Concept: check whether granted pages have been accessed ๏ Status: early prototypes already developed ๏ • Only grant map pages on the fault handler (David Vrabel) ๏ Issue: • Mapping/unmapping is expensive • But in the storage data path, the back end doesn’t really access the data Concept: Have a fault handler for the mapping, triggered only when needed ๏ Status: idea yet being conceived ๏ 29 © 2013 Citrix
  • 30. New Ideas • Split Rings ๏ Issue: • blkif protocol supports 32 slots per ring for requests or responses • this limits the total amount of outstanding requests (unless multi-page rings are used) • this makes inefficient use of CPU caching (both domains writing to the same page) Concept: use one ring for requests and another for responses ๏ Status: early prototypes already developed ๏ • Multi-queue Approach Issue: Linux’s block layer does not scale well across NUMA nodes ๏ Concept: allow device drivers to register a request queue per core ๏ Status: scheduled for Kernel 3.12 ๏ 30 © 2013 Citrix