XPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSE

Core scheduling in Xen
Jürgen Groß
Virtualization Kernel Developer
SUSE Linux GmbH, jgross@suse.com

2
Agenda
• What is “Core scheduling”?
• Motivation
• How does it work?
• Performance numbers
• Current state

What is “Core scheduling”?

4
Today: Cpu scheduling
• On each physical cpu the scheduler is deciding which
vcpu is to be scheduled next
• When taking other physical cpus into account only the
load of the system is being looked at
• Each vcpu can run on each physical cpu within some
constraints (cpupools, pinning)

5
Cpu scheduling
Core 0
Thread 0 Thread 1
Core 1
Thread 0 Thread 1
Core 2
Thread 0 Thread 1
Core 3
Thread 0 Thread 1
Dom0
vcpu0
Dom0
vcpu1
Dom0
vcpu2Dom0
vcpu3Dom0
vcpu4
Dom0
vcpu5
Dom0
vcpu6
Dom0
vcpu7
blocked

6
Cpu scheduling
Core 0
Thread 0 Thread 1
Core 1
Thread 0 Thread 1
Core 2
Thread 0 Thread 1
Core 3
Thread 0 Thread 1
Dom0
vcpu0
Dom0
vcpu1
Dom0
vcpu2Dom0
vcpu3Dom0
vcpu4
Dom0
vcpu5
Dom0
vcpu6
Dom0
vcpu7
DomU
vcpu3
DomU
vcpu0
DomU
vcpu2
DomU
vcpu1
blocked

7
Cpu scheduling
Core 0
Thread 0 Thread 1
Core 1
Thread 0 Thread 1
Core 2
Thread 0 Thread 1
Core 3
Thread 0 Thread 1
Dom0
vcpu0
Dom0
vcpu1
Dom0
vcpu2Dom0
vcpu3
Dom0
vcpu4
Dom0
vcpu5
Dom0
vcpu6
Dom0
vcpu7
DomU
vcpu3
DomU
vcpu0
DomU
vcpu2
DomU
vcpu1
runq
blocked

8
Core scheduling
• The scheduler is no longer acting on (v)cpus, but on
(v)cores
• All siblings (threads) of a core are scheduled together,
scheduling for all siblings of a single core is
synchronized
• The relation between vcores and vcpus is fixed, in
contrast to “core aware scheduling” where it might
change
• Pinning and cpupools are affecting cores (so e.g. pinning
a vcpu to a specific physical cpu will pin all vcpus of the
same vcore)

9
Core scheduling
Core 0
Thread 0 Thread 1
Core 1
Thread 0 Thread 1
Core 2
Thread 0 Thread 1
Core 3
Thread 0 Thread 1
Dom0
vcpu0
Dom0
vcpu1
Dom0
vcpu4
Dom0
vcpu5
Dom0
vcpu6
Dom0
vcpu7
Dom0
vcpu2
Dom0
vcpu3
blocked

10
Core scheduling
Core 0
Thread 0 Thread 1
Core 1
Thread 0 Thread 1
Core 2
Thread 0 Thread 1
Core 3
Thread 0 Thread 1
Dom0
vcpu0
Dom0
vcpu1
Dom0
vcpu4
Dom0
vcpu5
Dom0
vcpu6
Dom0
vcpu7
DomU
vcpu0
DomU
vcpu1
DomU
vcpu3
DomU
vcpu2
runq Dom0
vcpu2
Dom0
vcpu3
blocked

12
Cpu bugs
• Several cpu bugs (e.g. L1TF, MDS) involve side channel
attacks to steal data from threads of the same core
• Core scheduling prohibits cross-domain side channel
attacks
• This lays the groundwork for a safe operation with SMT
enabled

13
Fairness of accounting
• Threads running on the same core share multiple
resources (execution units, TLB, caches), so they are
influencing each other regarding performance
• With cpu scheduling a guest’s cpu performance is
depending on the host load, not on that of the guest
• In case the owner of the guest has to pay for the used
cpu time the price will depend on host load

14
Guest side optimizations
• The guest might decide to let run only one thread of a
core to make use of all resources of that core in a single
thread
• Some threads might be able to make use of shared
resources when running on the same core
• The guest might want to mitigate against cpu bugs via
core-aware scheduling

16
Decoupling scheduling from cpus
• In schedulers switch:
‒ vcpu→sched_unit
‒ pcpu→sched_resource
• Scheduling decisions are the same as before
• Amount of needed changes in sched_*.c rather high, but
mostly mechanical
• Scheduler.c is acting as abstraction layer for rest of the
hypervisor
• A sched_resource can be a cpu, core or socket

17
Syncing of context switches
• When switching vcpus on a cpu all other vcpus of the
same sched_unit must be switched on all other cpus of
the sched_resource
• Syncing is done in 2 steps:
‒ After decision is made to switch all other cpus of sched_resource
must rendezvous
‒ Context switch is performed on all affected cpus in parallel, after
that all cpus again rendezvous before they proceed
• At no time two vcpus of different sched_units are running
in guest mode on the same sched_resource

18
Syncing of context switches
1.Schedule event on one cpu
2.Take schedule lock, call scheduler for selecting next
sched_unit to run
3.If no change, drop lock and exit, otherwise signal other cpus
of sched_resource to process schedule_slave event, then
drop lock and wait for others to join
4.Last one to join switches sched_unit on sched_resource,
frees others to continue
5.On each cpu of sched_resource context is switched to new
vcpu
6.Wait on each cpu until all context switches are done, then
leave schedule handling

19
Idle vcpus
• A guest vcpu becoming idle will result in the idle vcpu
being scheduled
• Only if the scheduler decides to switch sched_units a
synchronized context switch is needed
• No change of address space when switching between
idle and guest vcpus without sched_unit switch (no
change on x86)
• An idle vcpu running in a guest sched_unit won’t run
tasklets or do livepatching in order to avoid activities for
other guests

20
Cpupools
• Only complete sched_resources can be moved from/to
cpupools
• For easy support of cpu hotplug cpus not in any cpupool
are not handled in units of sched_resources, but
individually
• At system boot cpupool0 is only created after all cpus
are brought online, as otherwise the number of cpus per
sched_resource isn’t yet known
• Cpus not in any pool are no longer handled by the
default scheduler, but by the new idle_scheduler

21
Cpu hotplug
• SMT on/off switching at runtime disabled with core
scheduling active
• Offlining a cpu will now first remove the related
sched_resource from cpupool0 if necessary
• Onlining a cpu will add it only to cpupool0 in case the
complete sched_resource is online

22
Cpupool0
Cpupools and cpu hotplug
Core 0
Thread 0 Thread 1
Core 1
Thread 0 Thread 1
Core 2
Thread 0 Thread 1
Core 3
Thread 0 Thread 1

23
Cpupool0
Core 0
Thread 0 Thread 1
Core 1
Thread 0 Thread 1
Core 2
Thread 0 Thread 1 Thread 0 Thread 1

24
Cpupool0
Core 0
Thread 0 Thread 1
Core 1
Thread 0 Thread 1
Core 2
Thread 0 Thread 1 Thread 0

26
Test basics
• All tests done by Dario Faggioli (SUSE)
• Test machine was a 4-core system with HT (8 cpus)
• Dom0 always with 8 vcpus, HVM domU with 4 or 8 vcpus
• Scenarios (all results compared to “without patches, HT on”,
positive numbers are better):
‒ Without patches (HT on/off)
‒ sched-gran=cpu (HT on/off)
‒ sched-gran=core
• Benchmarks:
‒ Stream (memory benchmark, 4 tasks in parallel)
‒ Kernbench (kernel build with 2, 4, 8 or 16 threads)
‒ Hackbench (communication via pipes, machine saturated)
‒ Mutilate (load generator for memcached)
‒ Netperf (TCP/UDP/UNIX, two communicating tasks)
‒ Pgioperf (postgres micro-benchmark)

27
Dom0 only
Unpatched,
no-HT
gran=cpu, HT
gran=cpu, no-
HT
gran=core
Stream
-0.06% …
+0.11%
+0.11% …
+0.50%
-0.02% …
+1.11%
-7.37% …
-3.82%
Kernbench
-36.9% ...
+7.07%
-0.61% …
+0.21%
-36.81% …
+6.77%
-5.98% …
-0.01%
Hackbench
-67.08% …
-43.44%
-4.79% …
+7.35%
-68.30% …
-37.19%
-3.22% …
+5.38%
Mutilate
-20.65% …
+10.05%
-0.63% …
-0.08%
-19.70% …
+11.26%
-11.40% …
-2.23%
Netperf
-0.37% …
+3.14%
-4.38% …
+1.00%
-5.01% …
+1.71%
-33.08% …
+6.71%
Pgioperf
-14.01% …
-6.63%
-12.54% …
+1.15%
-11.04% …
+3.09%
-6.71% …
-4.04%

28
HVM domU, 4 vcpus
Unpatched,
no-HT
gran=cpu, HT
gran=cpu, no-
HT
gran=core
Stream
-6.67% …
-0.25%
-6.86% …
-5.38%
-1.35% …
+0.23%
-16.81% …
-8.35%
Kernbench
+1.17% ...
+14.52%
+1.14% …
+6.31%
-0.03% …
+13.52%
-39.96% …
-13.99%
Hackbench
-8.12% …
+26.34%
-33.51% …
+10.54%
-11.78% …
+24.71%
-43.25% …
-4.07%
Mutilate
-0.49% …
+9.76%
-4.49% …
-0.12%
-3.12% …
+8.80%
-16.66% …
-8.48%
Netperf
-8.04% …
+11.83%
-41.63% …
+2.55%
-10.78% …
+17.42%
-26.58% …
+4.74%
Pgioperf
-1.47% …
+3.57%
-29.63% …
+1.77%
+0.28% …
+5.48%
+0.10% …
+13.85%

29
HVM domU, 8 vcpus
Unpatched,
no-HT
gran=cpu, HT
gran=cpu, no-
HT
gran=core
Stream
+2.82% …
+6.84%
+0.47% …
+5.07%
+4.52% …
+5.73%
-14.91% …
-9.55%
Kernbench
-46.41% ...
+6.04%
+0.46% …
+1.70%
-46.42% …
+6.25%
-6.91% …
+0.19%
Hackbench
-50.23% …
+4.17%
-14.08% …
+14.06%
-48.40% …
+7.08%
-16.51% …
+11.06%
Mutilate
-68.33% …
-6.48%
-1.11% …
+2.33%
-66.81% …
-3.00%
-45.50% …
-6.17%
Netperf
-11.87% …
+25.95%
-15.48% …
+14.57%
-8.64% …
+4.58%
-18.00% …
+1.81%
Pgioperf
+0.79% …
+94.25%
-1.62% …
+19.02%
-0.44% …
+83.68%
-49.56% …
+0.51%

30
2 * HVM domU, 8 vcpus
Unpatched,
no-HT
gran=cpu, HT
gran=cpu, no-
HT
gran=core
Stream
-26.13% …
-22.94%
-0.87% …
+1.45%
-25.48% …
-22.58%
-13.34% …
-6.37%
Kernbench
-50.26% ...
-48.38%
-0.24% …
-0.13%
-51.79% …
-49.89%
-23.98% …
-17.84%
Hackbench
+15.02% …
+35.59%
-2.28% …
+5.42%
+10.41% …
+34.48%
-12.19% …
+16.91%
Mutilate
-93.85% …
-56.82%
-2.19% …
+8.57%
-91.89% …
-57.33%
-83.70% …
-13.03%
Netperf
-50.48% …
-15.77%
-16.39% …
+7.61%
-48.31% …
-18.41%
-36.22% …
+4.41%
Pgioperf
-7.32% …
-2.18%
-231.22% …
+0.30%
-1605.80% …
-5.63%
-6035.64% …
-30.76%

32
Patches already committed
• Removing cpu on/offlining hooks in schedule.c and
cpupool.c for suspend/resume handling
• Small correction in sched_credit2.c for SMT-aware
scheduling (needed for core scheduling)
• Inline wrappers for calling per-scheduler functions from
schedule.c
• Test for mandatory per-scheduler functions instead of
ASSERT()
• Interface change for sched_switch_sched() per-
scheduler function avoiding code duplication

33
Patches in review
• V1 of the (rest-)series, currently 57 patches
• 40 files changed, 3704 insertions(+), 2299 deletions(-)
• Only small parts of V1 have been reviewed up to now
(thanks to all reviewers!)
• All comments on RFC-V1 and RFC-V2 have been
addressed, those required some major reworks (partially
due to renaming requests, partially conceptual ones)

34
Future plans
• Rework of scheduler related files (move to
common/sched/, making sched-if.h really scheduler
private)
• ARM support
• Support of per-cpupool scheduling granularity
• Support of per-cpupool SMT setting
• Sane topology reporting to the guests
• Add hypercall syncing between threads for full
L1TF/MDS mitigation (probably kills performance)

XPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSE

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (7)

Ähnlich wie XPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSE

Ähnlich wie XPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSE (20)

Mehr von The Linux Foundation

Mehr von The Linux Foundation (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

XPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSE