The document discusses power-efficient scheduling in the Linux kernel. It proposes moving power management capabilities from cpufreq and cpuidle into the scheduler to allow it to make more informed decisions. Key points include:
- The scheduler currently lacks power/energy information to optimize task placement.
- Cpufreq and cpuidle are not well coordinated with the scheduler.
- A power driver would provide power/topology data to the scheduler.
- Feedback from a kernel summit highlighted the need for use cases and benchmarks to evaluate proposals.
- Patches have been prepared to implement task placement based on CPU suitability.
2. 2
Topics Overview
ï§ Timeline
ï§ Towards a unified scheduler driven power policy
ï§ Task placement based on CPU suitability
ï§ Kernel Summit Feedback
ï§ Status
ï§ Questions?
3. 3
Timeline
ï§ May â Ingo's response to the task packing patches from
VincentG reignited discussions on power-aware scheduling
ï§ Early July â Posted proposed patches for a power aware
scheduler based on a power driver running in conjunction
with the current scheduler
ï§ Avoid big changes to the already complex current scheduler
ï§ Migrate functionality back in to the scheduler when we had worked
out the kinks
ï§ Sept â At Plumbers there was a relatively broad agreement
with the approach
ï§ October â Morten reposts patchset with refined APIs between
power driver and the scheduler
ï§ LKS â Reopened the discussion. More on this later
4. 4
Unified scheduler driven power policy ⊠Why ?
ï§ big.LITTLE MP patches are tested, stable and performant
ï§ Take the principles learnt during the implementation and apply to
an upstream solution
ï§ Existing power management frameworks are not coordinated
(cpufreq, cpuidle) with the scheduler
ï§ E.g. the scheduler decides which cpu to wake up or idle without
having any knowledge about C-states. cpuidle is left to do its best
based on these uninformed choices.
ï§ The scheduler is the most obvious place coordinate power
management at it has the best view of the overall system load.
ï§ The scheduler knows when tasks are scheduled and decides the
load balance. cpufreq has to wait until it can see the result of the
scheduler decisions before it can react.
ï§ Task packing in the scheduler needs P and C-state information to
make informed decisions.
5. 5
Existing Power Policies
ï§ Frequency scaling: cpufreq
ï§ Generic governor + platform specific driver
ï§ Decides target frequency based on overall cpu load.
ï§ Idle state selection: cpuidle
ï§ Generic governor + platform specific driver
ï§ Attempts to predict idle time when cpus enter idle.
ï§ Scheduler:
ï§ Completely generic and unaware of cpufreq and cpuidle policies.
ï§ Determines when and where a task runs, i.e. on which cpu.
ï§ Task placement considering CPU suitability required.
6. 6
cpu1cpu1
Existing Power Policies
cpu0cpu0
Freq Load
T
Scheduler
policy
cpufreq
policy
cpuidle
policy
Powerrq
T
Load balance
idle
Current load (pre-3.11)
Current load (3.11)
ï§ No coordination between power policies to avoid
conflicting/suboptimal decisions.
ï§ Is it a problem?
7. 7
Issues
ï§ Scheduler->cpufreq->scheduler cpu load feedback loop
ï§ From 3.11 the scheduler uses tracked load for load-balancing.
ï§ Tracked load is impacted by frequency scaling. Lower frequency
leads to higher tracked load for the same task.
ï§ Hindering new power-aware scheduling features
ï§ Task packing: Needs feedback from cpufreq to determine when cpus
are full.
ï§ Topology aware task placement: Needs topology information inside
the scheduler to determine the most optimal cpus to use when the
system is partially loaded.
ï§ Heterogeneous systems (big.LITTLE): Needs topology information
and accurate load tracking.
ï§ Thermal also needs to be considered
8. 8
Power scheduler proposal
Power driver (drivers/*/?.c)Scheduler (fair.c) Power framework (power.c)
Helper function
library
Driver registrationsched_domain
Hierarchy
(Generic topology)
Load balance
algorithms
Detailed platform
topology
Platform HW driver
Load tracking
Platform perf. and
energy monitoring
Performance state
selection
Sleep state
selection
âImportant tasksâ
cgroup
+ New generic info
(pack, heterogeneous, ...)
+ Packing,
+ P & C-state aware,
+ Heterogeneous
+ Scale invariant
Abstract power
driver/topology
interface
Existing policy algorithms
Library (drivers/power/?.c)
9. 9
Task placement based on CPU suitability
ï§Part of the power scheduler proposal
ï§ sched_domain hierarchy
ï§ Load balance algorithm (Heterogeneous)
ï§Existing big.LITTLE MP Patches
ï§ Definition: CFS scheduler optimization for heterogeneous platforms.
Attempts to select task affinity to optimize power and performance
based on task load and CPU type
ï§ Hosted at
ï§http://git.linaro.org/gitweb?p=arm/big.LITTLE/mp.git
ï§ Co-exists with existing (CFS) scheduler code
ï§ Guarded by CONFIG_SCHED_HMP
ï§ Setup HMP domains as a dependency to topology code
ï§Implement big.LITTLE MP functionality inside scheduler mainline code
10. 10
Task placement scheduler architectural bricks
1) Additional sched domain data structures
2) Specify sched domain level for task placement
3) Unweighted instantaneous load signal
4) Task placement hook in select task
5) Task placement hook in load balance
6) Task placement idle pull
11. 11
Brick 1: Additional sched domain data structures
ï§big.LITTLE MP:
ï§ struct hmp_domain
                                                                           Â
struct hmp_domain {
        struct cpumask cpus;
        struct cpumask possible_cpus;
        struct list_head hmp_domains;
}
ï§Task placement based on CPU suitability:
ï§ Use the existing sched groups in CPU sched domain level
ï§ Add task load ranges into CPU, sched domain and group
12. 12
Brick 2: Specify sched domain level
ï§big.LITTLE MP:
ï§ No additional sched domain flag
ï§ Deletes SD_LOAD_BALANCE flag in CPU level
ï§Task placement based on CPU suitability:
ï§ Adds SD_SUITABILITY flag to CPU level
13. 13
Brick 3: Unweighted instantaneous load signal
ï§ big.LITTLE MP & Task placement based on CPU suitability:
ï§ For sched entity and cfs_rq
    struct sched_avg {
            u32 runnable_avg_sum, runnable_avg_period;
            u64 last_runnable_update;
            s64 decay_count;
            unsigned long load_avg_contrib;
            unsigned long load_avg_ratio;
    }
ï§ sched entity: runnable_avg_sum * NICE_0_LOAD / (runnable_avg_period + 1)
ï§ cfs_rq: set in [update/enqueue/dequeue]_entity_load_avg()
14. 14
Brick 4: Task placement hook in select task
ï§big.LITTLE MP:
ï§ Force new non-kernel tasks onto big CPUs until
load stabilises
ï§ Least loaded CPU of big cluster is used
ï§Task placement based on CPU suitability:
ï§ Use task load ranges of previous CPU and
(initialized) task load ratio to set new CPU
15. 15
Brick 5: Task placement hook in load balance
ï§big.LITTLE MP:
ï§ Completely bypasses load_balance() in CPU level
ï§ hmp_force_up_migration() in run_rebalance_domains()
ï§ Calls hmp_up_migration() for migration to faster CPU
ï§ Calls hmp_offload_down() for using little CPUs when idle
ï§ Does not use env->imbalance or something equivalent
ï§Task placement based on CPU suitability:
ï§ Happens inside load_balance()
ï§ Find most unsuitable queue (i.e. find source run-queue)
ï§ Move unsuitable tasks (counterpart to load balance)
ï§ Move one unsuitable task (counterpart to active load balance)
ï§ Cannot use env->imbalance to control load balance
ï§ Using grp_load_avg_ratio/(NICE_0_LOAD * sg->group_weight) <= THRESHOLD
ï§ Falling back to 'mainline load balance' in case condition is not meet (destination
group is already overloaded)
16. 16
Brick 6: Task placement idle pull
ï§big.LITTLE MP:
ï§ Big CPU pulls running task above the threshold from little CPU
ï§Task placement based on CPU suitability:
ï§ Not necessary because idle_balance()->load_balance() is not
suppressed on CPU level by missing SD_LOAD_BALANCE flag
ï§ Idle pull happens inside load_balance
17. 17
Kernel Summit Feedback
ï§ Good to get active discussion
ï§ First time with everybody in the same room
ï§ LWN article - âThe power-aware scheduling mini-summitâ
ï§ Key points made
ï§ Power benchmarks are needed for evaluation
ï§ Use-case descriptions are needed to define common ground.
ï§ The scheduler needs energy/power information to make power-aware
scheduling decisions.
ï§ Power-awareness should be moved into the scheduler.
ï§ cpufreq is not fit for its purpose and should go away.
ï§ cpuidle will be integrated in the scheduler. Possibly support by
new per task properties, such as latency constraints
ï§ Are there ways to replay energy scenarios?
ï§ Linsched or perf sched
18. 18
Kernel Summit feedback observations
ï§ All part of the open-source process
ï§ Discussions have raised awareness of the issues
ï§ Maintainers recognise the need for improved power management
ï§ Iterative approach necessary but the steps are clear
ï§ Maintainers have a clear server/desktop background
ï§ ARM community can help educate this audience on embedded
requirements
ï§ Benchmarking for power could be hard to do in a simple way
ï§ Cyclic test, sysbench type tests unlikely to yield realistic results in real
systems
ï§ However, full accuracy not required
ï§ Power models necessarily complex and often closely guarded
secrets
ï§ Collection and reporting of meaningful metrics is probably sufficient
19. 19
Status
ï§ Latest Power-aware scheduling patches on LKML
ï§ https://lkml.org/lkml/2013/10/11/547
ï§ Task placement based on CPU suitability patches prepared
ï§ Proof of concept done
ï§ Waiting for right time to post to lists
ï§ Feedback from Linux kernel Summit needs to be discussed