Energy Aware Scheduling
Currently, the kernel is not great at making scheduling decisions with respect to keeping energy consumption at a minimum. It is not possible to know if increasing the P-state is better than waking up another core and moving a task. No reliable information is available that allows the scheduler to make sane energy-decisions. When conserving energy (or minimizing heat), both throughput and task latency must be taken into consideration.
- Valid benchmark tools will be accepted into tools/
- Power policies/awareness in the (core) scheduler, possibly tied to sched domains
- Integration of cpufreq and cpuidle (rather; cpufreq must die) into core scheduler
- When adjusting P-state, think voltage instead of frequency
- Task packing
- Hardware-level power management
- Correctly measure the energy consumption for a given workload is a challenge
- HW-vendors are reluctant to release actual numbers regarding power-usage
- Knowing what the energy-cost is for moving between C-states (or P-states)
- Agreement to performance/power/latency tradeoffs
- Avoid excessive knobs in the kernel.
Different architectures have different needs, the core scheduler is a carefully tuned piece of machinery. A set of tools and expected results should be made available so that changes can be tested and verified by people and regressions tracked down fast.
Some tools that can be hacked/adjusted
- extended rt-app
- ftrace / trace-cmd
List of patches submitted to LKML, short description and current state (ACK/NACK):
- to be filled in
(short summary in progress, Henrik is reading/eating bytes)
Added to Ingos tree as of May 7 2014
An excellent summary by lwn: Another attempt at power-aware scheduling
Currently being discussed, the patch with the highest degree of discussion arm: topology: Define TC2 sched energy and provide it to scheduler
Shameless rip from Energy-aware scheduling use-cases and scheduler issues
Description: The current mainline scheduler has no power topology information available to enable it to make energy-aware decisions. The energy cost of running a cpu at different frequencies and the energy cost of waking up another cpu are needed.
Proposed solution: Represent energy costs for each P-states and C-states in the topology to enable the scheduler to estimate the energy cost of the scheduling decisions. Coupled with P-state awareness that would allow the scheduler to avoid expensive high P-states.
Description: While performance is non-deterministic with the mainline scheduler (described in issue 6) it also leads to non-deterministic energy consumption. First step is to get performance right, but if we don't keep energy in mind, heterogeneous systems will end up with high performance and energy consumption.
Proposed solution: We know that a simple heuristic that controls task placement based on tracked load works rather well for most smartphone workloads. However, realistic patterns exist that defeat this heuristic.
Description: Currently, the CFS scheduler has no knowledge about frequency scaling. Frequency scaling governors generally try to match the frequency to the load, which means that the idle time has no absolute meaning. The potential spare cpu capacity may be much higher than indicated by the idle time if the cpu is running at a low P-state.
Description: The cost of waking up a cpu in terms of latency and energy depends on the idle state the cpu is in. Deeper idle states typically affects more than a single cpu. Waking up a single cpu from such state is more expensive as it also affects the idle states of of its related cpus.
Proposed solution: Make the scheduler idle state aware by either moving idle handling into the scheduler or let the idle framework (cpuidle) maintain a cpumask of the cheapest cpus to wake up which is accessible to the scheduler. Status:
Description: For energy-aware task placement decisions the scheduler would need to estimate the energy impact of scheduling a specific task on a specific cpu. Depending on the resulting P-state it may be more energy efficient to wake-up another cpu (see system 1 in mail 11 for energy efficiency example). Proposed solution: Frequency invariance has been proposed before  where the task load is scaled by the cur/max freq ratio. Another possibility is to use hardware counters if such are available on the platform. Status:
Description: The current mainline scheduler doesn't give optimum performance on heterogeneous systems for workload with few tasks (#tasks <= #cpu). Using cpu_power (in its current form) to inform the scheduler about the relative compute capacity of the cpus is not sufficient.
- cpu_power is not used on wake-up which means that new tasks may end up anywhere.
- Using cpu_power to represent the relative performance of the cpus, leads to undesirable task balance in common scenarios.
Proposed solution: Status:
Common webbrowsing use-cases (no embedded videos, but dynamic contents is ok) typically exhibit three distinct modes of operation depending on what it is doing in relation to the user
- Page load and rendering
- Display website (user reading, no user interaction)
- Page scrolling
Audio playback is a low load periodic application that has little/no variation in period and load over time. It consists of tasks involved in decoding the audio stream and communicating with audio frameworks and drivers.
Depending on the platform hardware video is a low to medium load periodic application. There may be some variation in the load depending on the video codec, content, and resolution. The load pattern is roughly synchronized to the video frame-rate (typically 30 FPS). Video playback also includes audio playback as part of the workload.
This description is based on one particular Android game, but similar patterns have been observed for a number of games. Overall, 10+ threads are active and context switches happen very often. Key game engine tasks and graphics driver tasks are scheduled ~200-700 times per second.
Most modern systems use DVFS to save power by slowing down computation throughput when less performance is necessary. The power/performance relation is platform specific. Some platforms may have better energy savings (energy per instruction) than others at low frequencies.