Today Xen is scheduling guest virtual cpus on all available physical cpus independently from each other. Recent security issues on modern processors (e.g. L1TF) require to turn off hyperthreading for best security in order to avoid leaking information from one hyperthread to the other. One way to avoid having to turn off hyperthreading is to only ever schedule virtual cpus of the same guest on one physical core at the same time. This is called core scheduling.
This presentation shows results from the effort to implement core scheduling in the Xen hypervisor. The basic modifications in Xen are presented and performance numbers with core scheduling active are shown.
4. 4
Today: Cpu scheduling
• On each physical cpu the scheduler is deciding which
vcpu is to be scheduled next
• When taking other physical cpus into account only the
load of the system is being looked at
• Each vcpu can run on each physical cpu within some
constraints (cpupools, pinning)
8. 8
Core scheduling
• The scheduler is no longer acting on (v)cpus, but on
(v)cores
• All siblings (threads) of a core are scheduled together,
scheduling for all siblings of a single core is
synchronized
• The relation between vcores and vcpus is fixed, in
contrast to “core aware scheduling” where it might
change
• Pinning and cpupools are affecting cores (so e.g. pinning
a vcpu to a specific physical cpu will pin all vcpus of the
same vcore)
12. 12
Cpu bugs
• Several cpu bugs (e.g. L1TF, MDS) involve side channel
attacks to steal data from threads of the same core
• Core scheduling prohibits cross-domain side channel
attacks
• This lays the groundwork for a safe operation with SMT
enabled
13. 13
Fairness of accounting
• Threads running on the same core share multiple
resources (execution units, TLB, caches), so they are
influencing each other regarding performance
• With cpu scheduling a guest’s cpu performance is
depending on the host load, not on that of the guest
• In case the owner of the guest has to pay for the used
cpu time the price will depend on host load
14. 14
Guest side optimizations
• The guest might decide to let run only one thread of a
core to make use of all resources of that core in a single
thread
• Some threads might be able to make use of shared
resources when running on the same core
• The guest might want to mitigate against cpu bugs via
core-aware scheduling
16. 16
Decoupling scheduling from cpus
• In schedulers switch:
‒ vcpu→sched_unit
‒ pcpu→sched_resource
• Scheduling decisions are the same as before
• Amount of needed changes in sched_*.c rather high, but
mostly mechanical
• Scheduler.c is acting as abstraction layer for rest of the
hypervisor
• A sched_resource can be a cpu, core or socket
17. 17
Syncing of context switches
• When switching vcpus on a cpu all other vcpus of the
same sched_unit must be switched on all other cpus of
the sched_resource
• Syncing is done in 2 steps:
‒ After decision is made to switch all other cpus of sched_resource
must rendezvous
‒ Context switch is performed on all affected cpus in parallel, after
that all cpus again rendezvous before they proceed
• At no time two vcpus of different sched_units are running
in guest mode on the same sched_resource
18. 18
Syncing of context switches
1.Schedule event on one cpu
2.Take schedule lock, call scheduler for selecting next
sched_unit to run
3.If no change, drop lock and exit, otherwise signal other cpus
of sched_resource to process schedule_slave event, then
drop lock and wait for others to join
4.Last one to join switches sched_unit on sched_resource,
frees others to continue
5.On each cpu of sched_resource context is switched to new
vcpu
6.Wait on each cpu until all context switches are done, then
leave schedule handling
19. 19
Idle vcpus
• A guest vcpu becoming idle will result in the idle vcpu
being scheduled
• Only if the scheduler decides to switch sched_units a
synchronized context switch is needed
• No change of address space when switching between
idle and guest vcpus without sched_unit switch (no
change on x86)
• An idle vcpu running in a guest sched_unit won’t run
tasklets or do livepatching in order to avoid activities for
other guests
20. 20
Cpupools
• Only complete sched_resources can be moved from/to
cpupools
• For easy support of cpu hotplug cpus not in any cpupool
are not handled in units of sched_resources, but
individually
• At system boot cpupool0 is only created after all cpus
are brought online, as otherwise the number of cpus per
sched_resource isn’t yet known
• Cpus not in any pool are no longer handled by the
default scheduler, but by the new idle_scheduler
21. 21
Cpu hotplug
• SMT on/off switching at runtime disabled with core
scheduling active
• Offlining a cpu will now first remove the related
sched_resource from cpupool0 if necessary
• Onlining a cpu will add it only to cpupool0 in case the
complete sched_resource is online
26. 26
Test basics
• All tests done by Dario Faggioli (SUSE)
• Test machine was a 4-core system with HT (8 cpus)
• Dom0 always with 8 vcpus, HVM domU with 4 or 8 vcpus
• Scenarios (all results compared to “without patches, HT on”,
positive numbers are better):
‒ Without patches (HT on/off)
‒ sched-gran=cpu (HT on/off)
‒ sched-gran=core
• Benchmarks:
‒ Stream (memory benchmark, 4 tasks in parallel)
‒ Kernbench (kernel build with 2, 4, 8 or 16 threads)
‒ Hackbench (communication via pipes, machine saturated)
‒ Mutilate (load generator for memcached)
‒ Netperf (TCP/UDP/UNIX, two communicating tasks)
‒ Pgioperf (postgres micro-benchmark)
32. 32
Patches already committed
• Removing cpu on/offlining hooks in schedule.c and
cpupool.c for suspend/resume handling
• Small correction in sched_credit2.c for SMT-aware
scheduling (needed for core scheduling)
• Inline wrappers for calling per-scheduler functions from
schedule.c
• Test for mandatory per-scheduler functions instead of
ASSERT()
• Interface change for sched_switch_sched() per-
scheduler function avoiding code duplication
33. 33
Patches in review
• V1 of the (rest-)series, currently 57 patches
• 40 files changed, 3704 insertions(+), 2299 deletions(-)
• Only small parts of V1 have been reviewed up to now
(thanks to all reviewers!)
• All comments on RFC-V1 and RFC-V2 have been
addressed, those required some major reworks (partially
due to renaming requests, partially conceptual ones)
34. 34
Future plans
• Rework of scheduler related files (move to
common/sched/, making sched-if.h really scheduler
private)
• ARM support
• Support of per-cpupool scheduling granularity
• Support of per-cpupool SMT setting
• Sane topology reporting to the guests
• Add hypercall syncing between threads for full
L1TF/MDS mitigation (probably kills performance)