mirror of
https://github.com/torvalds/linux.git
synced 2026-06-07 22:14:04 +02:00
Merge branch eas-dev into experimental/android-4.19
Bug: 118439987 Bug: 120440300 Change-Id: I46a509df5e3bcb5253717d083f90679e7a72d378 Signed-off-by: Alistair Strachan <astrachan@google.com>
This commit is contained in:
commit
409c3ce064
413
Documentation/scheduler/sched-tune.txt
Normal file
413
Documentation/scheduler/sched-tune.txt
Normal file
|
|
@ -0,0 +1,413 @@
|
|||
Central, scheduler-driven, power-performance control
|
||||
(EXPERIMENTAL)
|
||||
|
||||
Abstract
|
||||
========
|
||||
|
||||
The topic of a single simple power-performance tunable, that is wholly
|
||||
scheduler centric, and has well defined and predictable properties has come up
|
||||
on several occasions in the past [1,2]. With techniques such as a scheduler
|
||||
driven DVFS [3], we now have a good framework for implementing such a tunable.
|
||||
This document describes the overall ideas behind its design and implementation.
|
||||
|
||||
|
||||
Table of Contents
|
||||
=================
|
||||
|
||||
1. Motivation
|
||||
2. Introduction
|
||||
3. Signal Boosting Strategy
|
||||
4. OPP selection using boosted CPU utilization
|
||||
5. Per task group boosting
|
||||
6. Per-task wakeup-placement-strategy Selection
|
||||
7. Question and Answers
|
||||
- What about "auto" mode?
|
||||
- What about boosting on a congested system?
|
||||
- How CPUs are boosted when we have tasks with multiple boost values?
|
||||
8. References
|
||||
|
||||
|
||||
1. Motivation
|
||||
=============
|
||||
|
||||
Sched-DVFS [3] was a new event-driven cpufreq governor which allows the
|
||||
scheduler to select the optimal DVFS operating point (OPP) for running a task
|
||||
allocated to a CPU. Later, the cpufreq maintainers introduced a similar
|
||||
governor, schedutil. The introduction of schedutil also enables running
|
||||
workloads at the most energy efficient OPPs.
|
||||
|
||||
However, sometimes it may be desired to intentionally boost the performance of
|
||||
a workload even if that could imply a reasonable increase in energy
|
||||
consumption. For example, in order to reduce the response time of a task, we
|
||||
may want to run the task at a higher OPP than the one that is actually required
|
||||
by it's CPU bandwidth demand.
|
||||
|
||||
This last requirement is especially important if we consider that one of the
|
||||
main goals of the utilization-driven governor component is to replace all
|
||||
currently available CPUFreq policies. Since sched-DVFS and schedutil are event
|
||||
based, as opposed to the sampling driven governors we currently have, they are
|
||||
already more responsive at selecting the optimal OPP to run tasks allocated to
|
||||
a CPU. However, just tracking the actual task load demand may not be enough
|
||||
from a performance standpoint. For example, it is not possible to get
|
||||
behaviors similar to those provided by the "performance" and "interactive"
|
||||
CPUFreq governors.
|
||||
|
||||
This document describes an implementation of a tunable, stacked on top of the
|
||||
utilization-driven governors which extends their functionality to support task
|
||||
performance boosting.
|
||||
|
||||
By "performance boosting" we mean the reduction of the time required to
|
||||
complete a task activation, i.e. the time elapsed from a task wakeup to its
|
||||
next deactivation (e.g. because it goes back to sleep or it terminates). For
|
||||
example, if we consider a simple periodic task which executes the same workload
|
||||
for 5[s] every 20[s] while running at a certain OPP, a boosted execution of
|
||||
that task must complete each of its activations in less than 5[s].
|
||||
|
||||
A previous attempt [5] to introduce such a boosting feature has not been
|
||||
successful mainly because of the complexity of the proposed solution. Previous
|
||||
versions of the approach described in this document exposed a single simple
|
||||
interface to user-space. This single tunable knob allowed the tuning of
|
||||
system wide scheduler behaviours ranging from energy efficiency at one end
|
||||
through to incremental performance boosting at the other end. This first
|
||||
tunable affects all tasks. However, that is not useful for Android products
|
||||
so in this version only a more advanced extension of the concept is provided
|
||||
which uses CGroups to boost the performance of only selected tasks while using
|
||||
the energy efficient default for all others.
|
||||
|
||||
The rest of this document introduces in more details the proposed solution
|
||||
which has been named SchedTune.
|
||||
|
||||
|
||||
2. Introduction
|
||||
===============
|
||||
|
||||
SchedTune exposes a simple user-space interface provided through a new
|
||||
CGroup controller 'stune' which provides two power-performance tunables
|
||||
per group:
|
||||
|
||||
/<stune cgroup mount point>/schedtune.prefer_idle
|
||||
/<stune cgroup mount point>/schedtune.boost
|
||||
|
||||
The CGroup implementation permits arbitrary user-space defined task
|
||||
classification to tune the scheduler for different goals depending on the
|
||||
specific nature of the task, e.g. background vs interactive vs low-priority.
|
||||
|
||||
More details are given in section 5.
|
||||
|
||||
2.1 Boosting
|
||||
============
|
||||
|
||||
The boost value is expressed as an integer in the range [-100..0..100].
|
||||
|
||||
A value of 0 (default) configures the CFS scheduler for maximum energy
|
||||
efficiency. This means that sched-DVFS runs the tasks at the minimum OPP
|
||||
required to satisfy their workload demand.
|
||||
|
||||
A value of 100 configures scheduler for maximum performance, which translates
|
||||
to the selection of the maximum OPP on that CPU.
|
||||
|
||||
A value of -100 configures scheduler for minimum performance, which translates
|
||||
to the selection of the minimum OPP on that CPU.
|
||||
|
||||
The range between -100, 0 and 100 can be set to satisfy other scenarios suitably.
|
||||
For example to satisfy interactive response or depending on other system events
|
||||
(battery level etc).
|
||||
|
||||
The overall design of the SchedTune module is built on top of "Per-Entity Load
|
||||
Tracking" (PELT) signals and sched-DVFS by introducing a bias on the Operating
|
||||
Performance Point (OPP) selection.
|
||||
|
||||
Each time a task is allocated on a CPU, cpufreq is given the opportunity to tune
|
||||
the operating frequency of that CPU to better match the workload demand. The
|
||||
selection of the actual OPP being activated is influenced by the boost value
|
||||
for the task CGroup.
|
||||
|
||||
This simple biasing approach leverages existing frameworks, which means minimal
|
||||
modifications to the scheduler, and yet it allows to achieve a range of
|
||||
different behaviours all from a single simple tunable knob.
|
||||
|
||||
In EAS schedulers, we use boosted task and CPU utilization for energy
|
||||
calculation and energy-aware task placement.
|
||||
|
||||
2.2 prefer_idle
|
||||
===============
|
||||
|
||||
This is a flag which indicates to the scheduler that userspace would like
|
||||
the scheduler to focus on energy or to focus on performance.
|
||||
|
||||
A value of 0 (default) signals to the CFS scheduler that tasks in this group
|
||||
can be placed according to the energy-aware wakeup strategy.
|
||||
|
||||
A value of 1 signals to the CFS scheduler that tasks in this group should be
|
||||
placed to minimise wakeup latency.
|
||||
|
||||
The value is combined with the boost value - task placement will not be
|
||||
boost aware however CPU OPP selection is still boost aware.
|
||||
|
||||
Android platforms typically use this flag for application tasks which the
|
||||
user is currently interacting with.
|
||||
|
||||
|
||||
3. Signal Boosting Strategy
|
||||
===========================
|
||||
|
||||
The whole PELT machinery works based on the value of a few load tracking signals
|
||||
which basically track the CPU bandwidth requirements for tasks and the capacity
|
||||
of CPUs. The basic idea behind the SchedTune knob is to artificially inflate
|
||||
some of these load tracking signals to make a task or RQ appears more demanding
|
||||
that it actually is.
|
||||
|
||||
Which signals have to be inflated depends on the specific "consumer". However,
|
||||
independently from the specific (signal, consumer) pair, it is important to
|
||||
define a simple and possibly consistent strategy for the concept of boosting a
|
||||
signal.
|
||||
|
||||
A boosting strategy defines how the "abstract" user-space defined
|
||||
sched_cfs_boost value is translated into an internal "margin" value to be added
|
||||
to a signal to get its inflated value:
|
||||
|
||||
margin := boosting_strategy(sched_cfs_boost, signal)
|
||||
boosted_signal := signal + margin
|
||||
|
||||
Different boosting strategies were identified and analyzed before selecting the
|
||||
one found to be most effective.
|
||||
|
||||
Signal Proportional Compensation (SPC)
|
||||
--------------------------------------
|
||||
|
||||
In this boosting strategy the sched_cfs_boost value is used to compute a
|
||||
margin which is proportional to the complement of the original signal.
|
||||
When a signal has a maximum possible value, its complement is defined as
|
||||
the delta from the actual value and its possible maximum.
|
||||
|
||||
Since the tunable implementation uses signals which have SCHED_LOAD_SCALE as
|
||||
the maximum possible value, the margin becomes:
|
||||
|
||||
margin := sched_cfs_boost * (SCHED_LOAD_SCALE - signal)
|
||||
|
||||
Using this boosting strategy:
|
||||
- a 100% sched_cfs_boost means that the signal is scaled to the maximum value
|
||||
- each value in the range of sched_cfs_boost effectively inflates the signal in
|
||||
question by a quantity which is proportional to the maximum value.
|
||||
|
||||
For example, by applying the SPC boosting strategy to the selection of the OPP
|
||||
to run a task it is possible to achieve these behaviors:
|
||||
|
||||
- 0% boosting: run the task at the minimum OPP required by its workload
|
||||
- 100% boosting: run the task at the maximum OPP available for the CPU
|
||||
- 50% boosting: run at the half-way OPP between minimum and maximum
|
||||
|
||||
Which means that, at 50% boosting, a task will be scheduled to run at half of
|
||||
the maximum theoretically achievable performance on the specific target
|
||||
platform.
|
||||
|
||||
A graphical representation of an SPC boosted signal is represented in the
|
||||
following figure where:
|
||||
a) "-" represents the original signal
|
||||
b) "b" represents a 50% boosted signal
|
||||
c) "p" represents a 100% boosted signal
|
||||
|
||||
|
||||
^
|
||||
| SCHED_LOAD_SCALE
|
||||
+-----------------------------------------------------------------+
|
||||
|pppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp
|
||||
|
|
||||
| boosted_signal
|
||||
| bbbbbbbbbbbbbbbbbbbbbbbb
|
||||
|
|
||||
| original signal
|
||||
| bbbbbbbbbbbbbbbbbbbbbbbb+----------------------+
|
||||
| |
|
||||
|bbbbbbbbbbbbbbbbbb |
|
||||
| |
|
||||
| |
|
||||
| |
|
||||
| +-----------------------+
|
||||
| |
|
||||
| |
|
||||
| |
|
||||
|------------------+
|
||||
|
|
||||
|
|
||||
+----------------------------------------------------------------------->
|
||||
|
||||
The plot above shows a ramped load signal (titled 'original_signal') and it's
|
||||
boosted equivalent. For each step of the original signal the boosted signal
|
||||
corresponding to a 50% boost is midway from the original signal and the upper
|
||||
bound. Boosting by 100% generates a boosted signal which is always saturated to
|
||||
the upper bound.
|
||||
|
||||
|
||||
4. OPP selection using boosted CPU utilization
|
||||
==============================================
|
||||
|
||||
It is worth calling out that the implementation does not introduce any new load
|
||||
signals. Instead, it provides an API to tune existing signals. This tuning is
|
||||
done on demand and only in scheduler code paths where it is sensible to do so.
|
||||
The new API calls are defined to return either the default signal or a boosted
|
||||
one, depending on the value of sched_cfs_boost. This is a clean an non invasive
|
||||
modification of the existing existing code paths.
|
||||
|
||||
The signal representing a CPU's utilization is boosted according to the
|
||||
previously described SPC boosting strategy. To sched-DVFS, this allows a CPU
|
||||
(ie CFS run-queue) to appear more used then it actually is.
|
||||
|
||||
Thus, with the sched_cfs_boost enabled we have the following main functions to
|
||||
get the current utilization of a CPU:
|
||||
|
||||
cpu_util()
|
||||
boosted_cpu_util()
|
||||
|
||||
The new boosted_cpu_util() is similar to the first but returns a boosted
|
||||
utilization signal which is a function of the sched_cfs_boost value.
|
||||
|
||||
This function is used in the CFS scheduler code paths where sched-DVFS needs to
|
||||
decide the OPP to run a CPU at.
|
||||
For example, this allows selecting the highest OPP for a CPU which has
|
||||
the boost value set to 100%.
|
||||
|
||||
|
||||
5. Per task group boosting
|
||||
==========================
|
||||
|
||||
On battery powered devices there usually are many background services which are
|
||||
long running and need energy efficient scheduling. On the other hand, some
|
||||
applications are more performance sensitive and require an interactive
|
||||
response and/or maximum performance, regardless of the energy cost.
|
||||
|
||||
To better service such scenarios, the SchedTune implementation has an extension
|
||||
that provides a more fine grained boosting interface.
|
||||
|
||||
A new CGroup controller, namely "schedtune", can be enabled which allows to
|
||||
defined and configure task groups with different boosting values.
|
||||
Tasks that require special performance can be put into separate CGroups.
|
||||
The value of the boost associated with the tasks in this group can be specified
|
||||
using a single knob exposed by the CGroup controller:
|
||||
|
||||
schedtune.boost
|
||||
|
||||
This knob allows the definition of a boost value that is to be used for
|
||||
SPC boosting of all tasks attached to this group.
|
||||
|
||||
The current schedtune controller implementation is really simple and has these
|
||||
main characteristics:
|
||||
|
||||
1) It is only possible to create 1 level depth hierarchies
|
||||
|
||||
The root control groups define the system-wide boost value to be applied
|
||||
by default to all tasks. Its direct subgroups are named "boost groups" and
|
||||
they define the boost value for specific set of tasks.
|
||||
Further nested subgroups are not allowed since they do not have a sensible
|
||||
meaning from a user-space standpoint.
|
||||
|
||||
2) It is possible to define only a limited number of "boost groups"
|
||||
|
||||
This number is defined at compile time and by default configured to 16.
|
||||
This is a design decision motivated by two main reasons:
|
||||
a) In a real system we do not expect utilization scenarios with more then few
|
||||
boost groups. For example, a reasonable collection of groups could be
|
||||
just "background", "interactive" and "performance".
|
||||
b) It simplifies the implementation considerably, especially for the code
|
||||
which has to compute the per CPU boosting once there are multiple
|
||||
RUNNABLE tasks with different boost values.
|
||||
|
||||
Such a simple design should allow servicing the main utilization scenarios identified
|
||||
so far. It provides a simple interface which can be used to manage the
|
||||
power-performance of all tasks or only selected tasks.
|
||||
Moreover, this interface can be easily integrated by user-space run-times (e.g.
|
||||
Android, ChromeOS) to implement a QoS solution for task boosting based on tasks
|
||||
classification, which has been a long standing requirement.
|
||||
|
||||
Setup and usage
|
||||
---------------
|
||||
|
||||
0. Use a kernel with CONFIG_SCHED_TUNE support enabled
|
||||
|
||||
1. Check that the "schedtune" CGroup controller is available:
|
||||
|
||||
root@linaro-nano:~# cat /proc/cgroups
|
||||
#subsys_name hierarchy num_cgroups enabled
|
||||
cpuset 0 1 1
|
||||
cpu 0 1 1
|
||||
schedtune 0 1 1
|
||||
|
||||
2. Mount a tmpfs to create the CGroups mount point (Optional)
|
||||
|
||||
root@linaro-nano:~# sudo mount -t tmpfs cgroups /sys/fs/cgroup
|
||||
|
||||
3. Mount the "schedtune" controller
|
||||
|
||||
root@linaro-nano:~# mkdir /sys/fs/cgroup/stune
|
||||
root@linaro-nano:~# sudo mount -t cgroup -o schedtune stune /sys/fs/cgroup/stune
|
||||
|
||||
4. Create task groups and configure their specific boost value (Optional)
|
||||
|
||||
For example here we create a "performance" boost group configure to boost
|
||||
all its tasks to 100%
|
||||
|
||||
root@linaro-nano:~# mkdir /sys/fs/cgroup/stune/performance
|
||||
root@linaro-nano:~# echo 100 > /sys/fs/cgroup/stune/performance/schedtune.boost
|
||||
|
||||
5. Move tasks into the boost group
|
||||
|
||||
For example, the following moves the tasks with PID $TASKPID (and all its
|
||||
threads) into the "performance" boost group.
|
||||
|
||||
root@linaro-nano:~# echo "TASKPID > /sys/fs/cgroup/stune/performance/cgroup.procs
|
||||
|
||||
This simple configuration allows only the threads of the $TASKPID task to run,
|
||||
when needed, at the highest OPP in the most capable CPU of the system.
|
||||
|
||||
|
||||
6. Per-task wakeup-placement-strategy Selection
|
||||
===============================================
|
||||
|
||||
Many devices have a number of CFS tasks in use which require an absolute
|
||||
minimum wakeup latency, and many tasks for which wakeup latency is not
|
||||
important.
|
||||
|
||||
For touch-driven environments, removing additional wakeup latency can be
|
||||
critical.
|
||||
|
||||
When you use the Schedtume CGroup controller, you have access to a second
|
||||
parameter which allows a group to be marked such that energy_aware task
|
||||
placement is bypassed for tasks belonging to that group.
|
||||
|
||||
prefer_idle=0 (default - use energy-aware task placement if available)
|
||||
prefer_idle=1 (never use energy-aware task placement for these tasks)
|
||||
|
||||
Since the regular wakeup task placement algorithm in CFS is biased for
|
||||
performance, this has the effect of restoring minimum wakeup latency
|
||||
for the desired tasks whilst still allowing energy-aware wakeup placement
|
||||
to save energy for other tasks.
|
||||
|
||||
|
||||
7. Question and Answers
|
||||
=======================
|
||||
|
||||
What about "auto" mode?
|
||||
-----------------------
|
||||
|
||||
The 'auto' mode as described in [5] can be implemented by interfacing SchedTune
|
||||
with some suitable user-space element. This element could use the exposed
|
||||
system-wide or cgroup based interface.
|
||||
|
||||
How are multiple groups of tasks with different boost values managed?
|
||||
---------------------------------------------------------------------
|
||||
|
||||
The current SchedTune implementation keeps track of the boosted RUNNABLE tasks
|
||||
on a CPU. The CPU utilization seen by the scheduler-driven cpufreq governors
|
||||
(and used to select an appropriate OPP) is boosted with a value which is the
|
||||
maximum of the boost values of the currently RUNNABLE tasks in its RQ.
|
||||
|
||||
This allows cpufreq to boost a CPU only while there are boosted tasks ready
|
||||
to run and switch back to the energy efficient mode as soon as the last boosted
|
||||
task is dequeued.
|
||||
|
||||
|
||||
8. References
|
||||
=============
|
||||
[1] http://lwn.net/Articles/552889
|
||||
[2] http://lkml.org/lkml/2012/5/18/91
|
||||
[3] http://lkml.org/lkml/2015/6/26/620
|
||||
|
|
@ -42,6 +42,7 @@ cpu0: cpu@0 {
|
|||
cci-control-port = <&cci_control1>;
|
||||
cpu-idle-states = <&CLUSTER_SLEEP_BIG>;
|
||||
capacity-dmips-mhz = <1024>;
|
||||
dynamic-power-coefficient = <990>;
|
||||
};
|
||||
|
||||
cpu1: cpu@1 {
|
||||
|
|
@ -51,6 +52,7 @@ cpu1: cpu@1 {
|
|||
cci-control-port = <&cci_control1>;
|
||||
cpu-idle-states = <&CLUSTER_SLEEP_BIG>;
|
||||
capacity-dmips-mhz = <1024>;
|
||||
dynamic-power-coefficient = <990>;
|
||||
};
|
||||
|
||||
cpu2: cpu@2 {
|
||||
|
|
@ -60,6 +62,7 @@ cpu2: cpu@2 {
|
|||
cci-control-port = <&cci_control2>;
|
||||
cpu-idle-states = <&CLUSTER_SLEEP_LITTLE>;
|
||||
capacity-dmips-mhz = <516>;
|
||||
dynamic-power-coefficient = <133>;
|
||||
};
|
||||
|
||||
cpu3: cpu@3 {
|
||||
|
|
@ -69,6 +72,7 @@ cpu3: cpu@3 {
|
|||
cci-control-port = <&cci_control2>;
|
||||
cpu-idle-states = <&CLUSTER_SLEEP_LITTLE>;
|
||||
capacity-dmips-mhz = <516>;
|
||||
dynamic-power-coefficient = <133>;
|
||||
};
|
||||
|
||||
cpu4: cpu@4 {
|
||||
|
|
@ -78,6 +82,7 @@ cpu4: cpu@4 {
|
|||
cci-control-port = <&cci_control2>;
|
||||
cpu-idle-states = <&CLUSTER_SLEEP_LITTLE>;
|
||||
capacity-dmips-mhz = <516>;
|
||||
dynamic-power-coefficient = <133>;
|
||||
};
|
||||
|
||||
idle-states {
|
||||
|
|
|
|||
|
|
@ -2,6 +2,12 @@ CONFIG_SYSVIPC=y
|
|||
CONFIG_NO_HZ=y
|
||||
CONFIG_HIGH_RES_TIMERS=y
|
||||
CONFIG_CGROUPS=y
|
||||
CONFIG_CGROUP_SCHED=y
|
||||
CONFIG_FAIR_GROUP_SCHED=y
|
||||
CONFIG_CGROUP_FREEZER=y
|
||||
CONFIG_CPUSETS=y
|
||||
CONFIG_PROC_PID_CPUSET=y
|
||||
CONFIG_SCHED_AUTOGROUP=y
|
||||
CONFIG_BLK_DEV_INITRD=y
|
||||
CONFIG_EMBEDDED=y
|
||||
CONFIG_PERF_EVENTS=y
|
||||
|
|
@ -116,6 +122,7 @@ CONFIG_PCI_ENDPOINT=y
|
|||
CONFIG_PCI_ENDPOINT_CONFIGFS=y
|
||||
CONFIG_PCI_EPF_TEST=m
|
||||
CONFIG_SMP=y
|
||||
CONFIG_SCHED_MC=y
|
||||
CONFIG_NR_CPUS=16
|
||||
CONFIG_SECCOMP=y
|
||||
CONFIG_ARM_APPENDED_DTB=y
|
||||
|
|
@ -124,10 +131,10 @@ CONFIG_KEXEC=y
|
|||
CONFIG_EFI=y
|
||||
CONFIG_CPU_FREQ=y
|
||||
CONFIG_CPU_FREQ_STAT=y
|
||||
CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
|
||||
CONFIG_CPU_FREQ_GOV_POWERSAVE=m
|
||||
CONFIG_CPU_FREQ_GOV_USERSPACE=m
|
||||
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m
|
||||
CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL=y
|
||||
CONFIG_CPU_FREQ_GOV_POWERSAVE=y
|
||||
CONFIG_CPU_FREQ_GOV_USERSPACE=y
|
||||
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y
|
||||
CONFIG_CPU_FREQ_GOV_SCHEDUTIL=y
|
||||
CONFIG_CPUFREQ_DT=y
|
||||
CONFIG_ARM_IMX6Q_CPUFREQ=y
|
||||
|
|
@ -137,6 +144,7 @@ CONFIG_ARM_CPUIDLE=y
|
|||
CONFIG_ARM_ZYNQ_CPUIDLE=y
|
||||
CONFIG_ARM_EXYNOS_CPUIDLE=y
|
||||
CONFIG_KERNEL_MODE_NEON=y
|
||||
CONFIG_ENERGY_MODEL=y
|
||||
CONFIG_NET=y
|
||||
CONFIG_PACKET=y
|
||||
CONFIG_UNIX=y
|
||||
|
|
|
|||
|
|
@ -30,9 +30,15 @@ const struct cpumask *cpu_coregroup_mask(int cpu);
|
|||
/* Replace task scheduler's default frequency-invariant accounting */
|
||||
#define arch_scale_freq_capacity topology_get_freq_scale
|
||||
|
||||
/* Replace task scheduler's default max-frequency-invariant accounting */
|
||||
#define arch_scale_max_freq_capacity topology_get_max_freq_scale
|
||||
|
||||
/* Replace task scheduler's default cpu-invariant accounting */
|
||||
#define arch_scale_cpu_capacity topology_get_cpu_scale
|
||||
|
||||
/* Enable topology flag updates */
|
||||
#define arch_update_cpu_topology topology_update_cpu_topology
|
||||
|
||||
#else
|
||||
|
||||
static inline void init_cpu_topology(void) { }
|
||||
|
|
|
|||
|
|
@ -99,6 +99,7 @@ A72_0: cpu@0 {
|
|||
clocks = <&scpi_dvfs 0>;
|
||||
cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
|
||||
capacity-dmips-mhz = <1024>;
|
||||
dynamic-power-coefficient = <450>;
|
||||
};
|
||||
|
||||
A72_1: cpu@1 {
|
||||
|
|
@ -116,6 +117,7 @@ A72_1: cpu@1 {
|
|||
clocks = <&scpi_dvfs 0>;
|
||||
cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
|
||||
capacity-dmips-mhz = <1024>;
|
||||
dynamic-power-coefficient = <450>;
|
||||
};
|
||||
|
||||
A53_0: cpu@100 {
|
||||
|
|
@ -133,6 +135,7 @@ A53_0: cpu@100 {
|
|||
clocks = <&scpi_dvfs 1>;
|
||||
cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
|
||||
capacity-dmips-mhz = <485>;
|
||||
dynamic-power-coefficient = <140>;
|
||||
};
|
||||
|
||||
A53_1: cpu@101 {
|
||||
|
|
@ -150,6 +153,7 @@ A53_1: cpu@101 {
|
|||
clocks = <&scpi_dvfs 1>;
|
||||
cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
|
||||
capacity-dmips-mhz = <485>;
|
||||
dynamic-power-coefficient = <140>;
|
||||
};
|
||||
|
||||
A53_2: cpu@102 {
|
||||
|
|
@ -167,6 +171,7 @@ A53_2: cpu@102 {
|
|||
clocks = <&scpi_dvfs 1>;
|
||||
cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
|
||||
capacity-dmips-mhz = <485>;
|
||||
dynamic-power-coefficient = <140>;
|
||||
};
|
||||
|
||||
A53_3: cpu@103 {
|
||||
|
|
@ -184,6 +189,7 @@ A53_3: cpu@103 {
|
|||
clocks = <&scpi_dvfs 1>;
|
||||
cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
|
||||
capacity-dmips-mhz = <485>;
|
||||
dynamic-power-coefficient = <140>;
|
||||
};
|
||||
|
||||
A72_L2: l2-cache0 {
|
||||
|
|
|
|||
|
|
@ -98,6 +98,7 @@ A57_0: cpu@0 {
|
|||
clocks = <&scpi_dvfs 0>;
|
||||
cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
|
||||
capacity-dmips-mhz = <1024>;
|
||||
dynamic-power-coefficient = <530>;
|
||||
};
|
||||
|
||||
A57_1: cpu@1 {
|
||||
|
|
@ -115,6 +116,7 @@ A57_1: cpu@1 {
|
|||
clocks = <&scpi_dvfs 0>;
|
||||
cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
|
||||
capacity-dmips-mhz = <1024>;
|
||||
dynamic-power-coefficient = <530>;
|
||||
};
|
||||
|
||||
A53_0: cpu@100 {
|
||||
|
|
@ -132,6 +134,7 @@ A53_0: cpu@100 {
|
|||
clocks = <&scpi_dvfs 1>;
|
||||
cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
|
||||
capacity-dmips-mhz = <578>;
|
||||
dynamic-power-coefficient = <140>;
|
||||
};
|
||||
|
||||
A53_1: cpu@101 {
|
||||
|
|
@ -149,6 +152,7 @@ A53_1: cpu@101 {
|
|||
clocks = <&scpi_dvfs 1>;
|
||||
cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
|
||||
capacity-dmips-mhz = <578>;
|
||||
dynamic-power-coefficient = <140>;
|
||||
};
|
||||
|
||||
A53_2: cpu@102 {
|
||||
|
|
@ -166,6 +170,7 @@ A53_2: cpu@102 {
|
|||
clocks = <&scpi_dvfs 1>;
|
||||
cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
|
||||
capacity-dmips-mhz = <578>;
|
||||
dynamic-power-coefficient = <140>;
|
||||
};
|
||||
|
||||
A53_3: cpu@103 {
|
||||
|
|
@ -183,6 +188,7 @@ A53_3: cpu@103 {
|
|||
clocks = <&scpi_dvfs 1>;
|
||||
cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
|
||||
capacity-dmips-mhz = <578>;
|
||||
dynamic-power-coefficient = <140>;
|
||||
};
|
||||
|
||||
A57_L2: l2-cache0 {
|
||||
|
|
|
|||
|
|
@ -19,9 +19,13 @@ CONFIG_BLK_CGROUP=y
|
|||
CONFIG_CGROUP_PIDS=y
|
||||
CONFIG_CGROUP_HUGETLB=y
|
||||
CONFIG_CPUSETS=y
|
||||
CONFIG_CGROUPS=y
|
||||
CONFIG_FAIR_GROUP_SCHED=y
|
||||
CONFIG_CGROUP_SCHED=y
|
||||
CONFIG_CGROUP_DEVICE=y
|
||||
CONFIG_CGROUP_CPUACCT=y
|
||||
CONFIG_CGROUP_PERF=y
|
||||
CONFIG_CGROUP_FREEZER=y
|
||||
CONFIG_USER_NS=y
|
||||
CONFIG_SCHED_AUTOGROUP=y
|
||||
CONFIG_BLK_DEV_INITRD=y
|
||||
|
|
@ -101,13 +105,16 @@ CONFIG_XEN=y
|
|||
CONFIG_COMPAT=y
|
||||
CONFIG_HIBERNATION=y
|
||||
CONFIG_WQ_POWER_EFFICIENT_DEFAULT=y
|
||||
CONFIG_ENERGY_MODEL=y
|
||||
CONFIG_SCHED_TUNE=y
|
||||
CONFIG_ARM_CPUIDLE=y
|
||||
CONFIG_CPU_FREQ=y
|
||||
CONFIG_CPU_FREQ_STAT=y
|
||||
CONFIG_CPU_FREQ_GOV_POWERSAVE=m
|
||||
CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL=y
|
||||
CONFIG_CPU_FREQ_GOV_POWERSAVE=y
|
||||
CONFIG_CPU_FREQ_GOV_USERSPACE=y
|
||||
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
|
||||
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m
|
||||
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y
|
||||
CONFIG_CPU_FREQ_GOV_SCHEDUTIL=y
|
||||
CONFIG_CPUFREQ_DT=y
|
||||
CONFIG_ACPI_CPPC_CPUFREQ=m
|
||||
|
|
|
|||
|
|
@ -42,9 +42,15 @@ int pcibus_to_node(struct pci_bus *bus);
|
|||
/* Replace task scheduler's default frequency-invariant accounting */
|
||||
#define arch_scale_freq_capacity topology_get_freq_scale
|
||||
|
||||
/* Replace task scheduler's default max-frequency-invariant accounting */
|
||||
#define arch_scale_max_freq_capacity topology_get_max_freq_scale
|
||||
|
||||
/* Replace task scheduler's default cpu-invariant accounting */
|
||||
#define arch_scale_cpu_capacity topology_get_cpu_scale
|
||||
|
||||
/* Enable topology flag updates */
|
||||
#define arch_update_cpu_topology topology_update_cpu_topology
|
||||
|
||||
#include <asm-generic/topology.h>
|
||||
|
||||
#endif /* _ASM_ARM_TOPOLOGY_H */
|
||||
|
|
|
|||
|
|
@ -219,4 +219,5 @@ source "drivers/siox/Kconfig"
|
|||
|
||||
source "drivers/slimbus/Kconfig"
|
||||
|
||||
source "drivers/energy_model/Kconfig"
|
||||
endmenu
|
||||
|
|
|
|||
|
|
@ -157,6 +157,8 @@ obj-$(CONFIG_REMOTEPROC) += remoteproc/
|
|||
obj-$(CONFIG_RPMSG) += rpmsg/
|
||||
obj-$(CONFIG_SOUNDWIRE) += soundwire/
|
||||
|
||||
obj-$(CONFIG_ENERGY_MODEL) += energy_model/
|
||||
|
||||
# Virtualization drivers
|
||||
obj-$(CONFIG_VIRT_DRIVERS) += virt/
|
||||
obj-$(CONFIG_HYPERV) += hv/
|
||||
|
|
|
|||
|
|
@ -15,8 +15,11 @@
|
|||
#include <linux/slab.h>
|
||||
#include <linux/string.h>
|
||||
#include <linux/sched/topology.h>
|
||||
#include <linux/cpuset.h>
|
||||
|
||||
DEFINE_PER_CPU(unsigned long, freq_scale) = SCHED_CAPACITY_SCALE;
|
||||
DEFINE_PER_CPU(unsigned long, max_cpu_freq);
|
||||
DEFINE_PER_CPU(unsigned long, max_freq_scale) = SCHED_CAPACITY_SCALE;
|
||||
|
||||
void arch_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq,
|
||||
unsigned long max_freq)
|
||||
|
|
@ -26,8 +29,29 @@ void arch_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq,
|
|||
|
||||
scale = (cur_freq << SCHED_CAPACITY_SHIFT) / max_freq;
|
||||
|
||||
for_each_cpu(i, cpus)
|
||||
for_each_cpu(i, cpus) {
|
||||
per_cpu(freq_scale, i) = scale;
|
||||
per_cpu(max_cpu_freq, i) = max_freq;
|
||||
}
|
||||
}
|
||||
|
||||
void arch_set_max_freq_scale(struct cpumask *cpus,
|
||||
unsigned long policy_max_freq)
|
||||
{
|
||||
unsigned long scale, max_freq;
|
||||
int cpu = cpumask_first(cpus);
|
||||
|
||||
if (cpu > nr_cpu_ids)
|
||||
return;
|
||||
|
||||
max_freq = per_cpu(max_cpu_freq, cpu);
|
||||
if (!max_freq)
|
||||
return;
|
||||
|
||||
scale = (policy_max_freq << SCHED_CAPACITY_SHIFT) / max_freq;
|
||||
|
||||
for_each_cpu(cpu, cpus)
|
||||
per_cpu(max_freq_scale, cpu) = scale;
|
||||
}
|
||||
|
||||
static DEFINE_MUTEX(cpu_scale_mutex);
|
||||
|
|
@ -47,6 +71,9 @@ static ssize_t cpu_capacity_show(struct device *dev,
|
|||
return sprintf(buf, "%lu\n", topology_get_cpu_scale(NULL, cpu->dev.id));
|
||||
}
|
||||
|
||||
static void update_topology_flags_workfn(struct work_struct *work);
|
||||
static DECLARE_WORK(update_topology_flags_work, update_topology_flags_workfn);
|
||||
|
||||
static ssize_t cpu_capacity_store(struct device *dev,
|
||||
struct device_attribute *attr,
|
||||
const char *buf,
|
||||
|
|
@ -72,6 +99,8 @@ static ssize_t cpu_capacity_store(struct device *dev,
|
|||
topology_set_cpu_scale(i, new_capacity);
|
||||
mutex_unlock(&cpu_scale_mutex);
|
||||
|
||||
schedule_work(&update_topology_flags_work);
|
||||
|
||||
return count;
|
||||
}
|
||||
|
||||
|
|
@ -96,6 +125,25 @@ static int register_cpu_capacity_sysctl(void)
|
|||
}
|
||||
subsys_initcall(register_cpu_capacity_sysctl);
|
||||
|
||||
static int update_topology;
|
||||
|
||||
int topology_update_cpu_topology(void)
|
||||
{
|
||||
return update_topology;
|
||||
}
|
||||
|
||||
/*
|
||||
* Updating the sched_domains can't be done directly from cpufreq callbacks
|
||||
* due to locking, so queue the work for later.
|
||||
*/
|
||||
static void update_topology_flags_workfn(struct work_struct *work)
|
||||
{
|
||||
update_topology = 1;
|
||||
rebuild_sched_domains();
|
||||
pr_debug("sched_domain hierarchy rebuilt, flags updated\n");
|
||||
update_topology = 0;
|
||||
}
|
||||
|
||||
static u32 capacity_scale;
|
||||
static u32 *raw_capacity;
|
||||
|
||||
|
|
@ -201,6 +249,7 @@ init_cpu_capacity_callback(struct notifier_block *nb,
|
|||
|
||||
if (cpumask_empty(cpus_to_visit)) {
|
||||
topology_normalize_cpu_scale();
|
||||
schedule_work(&update_topology_flags_work);
|
||||
free_raw_capacity();
|
||||
pr_debug("cpu_capacity: parsing done\n");
|
||||
schedule_work(&parsing_done_work);
|
||||
|
|
|
|||
|
|
@ -24,6 +24,7 @@
|
|||
#include <linux/cpufreq.h>
|
||||
#include <linux/cpumask.h>
|
||||
#include <linux/cpu_cooling.h>
|
||||
#include <linux/energy_model.h>
|
||||
#include <linux/export.h>
|
||||
#include <linux/module.h>
|
||||
#include <linux/mutex.h>
|
||||
|
|
@ -456,6 +457,7 @@ static int get_cluster_clk_and_freq_table(struct device *cpu_dev,
|
|||
/* Per-CPU initialization */
|
||||
static int bL_cpufreq_init(struct cpufreq_policy *policy)
|
||||
{
|
||||
struct em_data_callback em_cb = EM_DATA_CB(of_dev_pm_opp_get_cpu_power);
|
||||
u32 cur_cluster = cpu_to_cluster(policy->cpu);
|
||||
struct device *cpu_dev;
|
||||
int ret;
|
||||
|
|
@ -487,6 +489,14 @@ static int bL_cpufreq_init(struct cpufreq_policy *policy)
|
|||
policy->cpuinfo.transition_latency =
|
||||
arm_bL_ops->get_transition_latency(cpu_dev);
|
||||
|
||||
ret = dev_pm_opp_get_opp_count(cpu_dev);
|
||||
if (ret <= 0) {
|
||||
dev_dbg(cpu_dev, "OPP table is not ready, deferring probe\n");
|
||||
return -EPROBE_DEFER;
|
||||
}
|
||||
|
||||
em_register_perf_domain(policy->cpus, ret, &em_cb);
|
||||
|
||||
if (is_bL_switching_enabled())
|
||||
per_cpu(cpu_last_req_freq, policy->cpu) = clk_get_cpu_rate(policy->cpu);
|
||||
|
||||
|
|
|
|||
|
|
@ -16,6 +16,7 @@
|
|||
#include <linux/cpu_cooling.h>
|
||||
#include <linux/cpufreq.h>
|
||||
#include <linux/cpumask.h>
|
||||
#include <linux/energy_model.h>
|
||||
#include <linux/err.h>
|
||||
#include <linux/module.h>
|
||||
#include <linux/of.h>
|
||||
|
|
@ -152,6 +153,7 @@ static int resources_available(void)
|
|||
|
||||
static int cpufreq_init(struct cpufreq_policy *policy)
|
||||
{
|
||||
struct em_data_callback em_cb = EM_DATA_CB(of_dev_pm_opp_get_cpu_power);
|
||||
struct cpufreq_frequency_table *freq_table;
|
||||
struct opp_table *opp_table = NULL;
|
||||
struct private_data *priv;
|
||||
|
|
@ -160,7 +162,7 @@ static int cpufreq_init(struct cpufreq_policy *policy)
|
|||
unsigned int transition_latency;
|
||||
bool fallback = false;
|
||||
const char *name;
|
||||
int ret;
|
||||
int ret, nr_opp;
|
||||
|
||||
cpu_dev = get_cpu_device(policy->cpu);
|
||||
if (!cpu_dev) {
|
||||
|
|
@ -237,6 +239,7 @@ static int cpufreq_init(struct cpufreq_policy *policy)
|
|||
ret = -EPROBE_DEFER;
|
||||
goto out_free_opp;
|
||||
}
|
||||
nr_opp = ret;
|
||||
|
||||
if (fallback) {
|
||||
cpumask_setall(policy->cpus);
|
||||
|
|
@ -280,6 +283,8 @@ static int cpufreq_init(struct cpufreq_policy *policy)
|
|||
policy->cpuinfo.transition_latency = transition_latency;
|
||||
policy->dvfs_possible_from_any_cpu = true;
|
||||
|
||||
em_register_perf_domain(policy->cpus, nr_opp, &em_cb);
|
||||
|
||||
return 0;
|
||||
|
||||
out_free_cpufreq_table:
|
||||
|
|
|
|||
|
|
@ -25,6 +25,7 @@
|
|||
#include <linux/kernel_stat.h>
|
||||
#include <linux/module.h>
|
||||
#include <linux/mutex.h>
|
||||
#include <linux/sched/cpufreq.h>
|
||||
#include <linux/slab.h>
|
||||
#include <linux/suspend.h>
|
||||
#include <linux/syscore_ops.h>
|
||||
|
|
@ -158,6 +159,12 @@ __weak void arch_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq,
|
|||
}
|
||||
EXPORT_SYMBOL_GPL(arch_set_freq_scale);
|
||||
|
||||
__weak void arch_set_max_freq_scale(struct cpumask *cpus,
|
||||
unsigned long policy_max_freq)
|
||||
{
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(arch_set_max_freq_scale);
|
||||
|
||||
/*
|
||||
* This is a generic cpufreq init() routine which can be used by cpufreq
|
||||
* drivers of SMP systems. It will do following:
|
||||
|
|
@ -2243,6 +2250,8 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
|
|||
policy->max = new_policy->max;
|
||||
trace_cpu_frequency_limits(policy);
|
||||
|
||||
arch_set_max_freq_scale(policy->cpus, policy->max);
|
||||
|
||||
policy->cached_target_freq = UINT_MAX;
|
||||
|
||||
pr_debug("new min and max freqs are %u - %u kHz\n",
|
||||
|
|
@ -2277,6 +2286,7 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
|
|||
ret = cpufreq_start_governor(policy);
|
||||
if (!ret) {
|
||||
pr_debug("cpufreq: governor change\n");
|
||||
sched_cpufreq_governor_change(policy, old_gov);
|
||||
return 0;
|
||||
}
|
||||
cpufreq_exit_governor(policy);
|
||||
|
|
|
|||
|
|
@ -12,6 +12,7 @@
|
|||
#include <linux/cpufreq.h>
|
||||
#include <linux/cpumask.h>
|
||||
#include <linux/cpu_cooling.h>
|
||||
#include <linux/energy_model.h>
|
||||
#include <linux/export.h>
|
||||
#include <linux/module.h>
|
||||
#include <linux/pm_opp.h>
|
||||
|
|
@ -103,13 +104,42 @@ scmi_get_sharing_cpus(struct device *cpu_dev, struct cpumask *cpumask)
|
|||
return 0;
|
||||
}
|
||||
|
||||
static int __maybe_unused
|
||||
scmi_get_cpu_power(unsigned long *power, unsigned long *KHz, int cpu)
|
||||
{
|
||||
struct device *cpu_dev = get_cpu_device(cpu);
|
||||
unsigned long Hz;
|
||||
int ret, domain;
|
||||
|
||||
if (!cpu_dev) {
|
||||
pr_err("failed to get cpu%d device\n", cpu);
|
||||
return -ENODEV;
|
||||
}
|
||||
|
||||
domain = handle->perf_ops->device_domain_id(cpu_dev);
|
||||
if (domain < 0)
|
||||
return domain;
|
||||
|
||||
/* Get the power cost of the performance domain. */
|
||||
Hz = *KHz * 1000;
|
||||
ret = handle->perf_ops->est_power_get(handle, domain, &Hz, power);
|
||||
if (ret)
|
||||
return ret;
|
||||
|
||||
/* The EM framework specifies the frequency in KHz. */
|
||||
*KHz = Hz / 1000;
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int scmi_cpufreq_init(struct cpufreq_policy *policy)
|
||||
{
|
||||
int ret;
|
||||
int ret, nr_opp;
|
||||
unsigned int latency;
|
||||
struct device *cpu_dev;
|
||||
struct scmi_data *priv;
|
||||
struct cpufreq_frequency_table *freq_table;
|
||||
struct em_data_callback em_cb = EM_DATA_CB(scmi_get_cpu_power);
|
||||
|
||||
cpu_dev = get_cpu_device(policy->cpu);
|
||||
if (!cpu_dev) {
|
||||
|
|
@ -142,6 +172,7 @@ static int scmi_cpufreq_init(struct cpufreq_policy *policy)
|
|||
ret = -EPROBE_DEFER;
|
||||
goto out_free_opp;
|
||||
}
|
||||
nr_opp = ret;
|
||||
|
||||
priv = kzalloc(sizeof(*priv), GFP_KERNEL);
|
||||
if (!priv) {
|
||||
|
|
@ -171,6 +202,9 @@ static int scmi_cpufreq_init(struct cpufreq_policy *policy)
|
|||
policy->cpuinfo.transition_latency = latency;
|
||||
|
||||
policy->fast_switch_possible = true;
|
||||
|
||||
em_register_perf_domain(policy->cpus, nr_opp, &em_cb);
|
||||
|
||||
return 0;
|
||||
|
||||
out_free_priv:
|
||||
|
|
|
|||
|
|
@ -23,6 +23,7 @@
|
|||
#include <linux/cpufreq.h>
|
||||
#include <linux/cpumask.h>
|
||||
#include <linux/cpu_cooling.h>
|
||||
#include <linux/energy_model.h>
|
||||
#include <linux/export.h>
|
||||
#include <linux/module.h>
|
||||
#include <linux/of_platform.h>
|
||||
|
|
@ -98,11 +99,12 @@ scpi_get_sharing_cpus(struct device *cpu_dev, struct cpumask *cpumask)
|
|||
|
||||
static int scpi_cpufreq_init(struct cpufreq_policy *policy)
|
||||
{
|
||||
int ret;
|
||||
int ret, nr_opp;
|
||||
unsigned int latency;
|
||||
struct device *cpu_dev;
|
||||
struct scpi_data *priv;
|
||||
struct cpufreq_frequency_table *freq_table;
|
||||
struct em_data_callback em_cb = EM_DATA_CB(of_dev_pm_opp_get_cpu_power);
|
||||
|
||||
cpu_dev = get_cpu_device(policy->cpu);
|
||||
if (!cpu_dev) {
|
||||
|
|
@ -135,6 +137,7 @@ static int scpi_cpufreq_init(struct cpufreq_policy *policy)
|
|||
ret = -EPROBE_DEFER;
|
||||
goto out_free_opp;
|
||||
}
|
||||
nr_opp = ret;
|
||||
|
||||
priv = kzalloc(sizeof(*priv), GFP_KERNEL);
|
||||
if (!priv) {
|
||||
|
|
@ -170,6 +173,9 @@ static int scpi_cpufreq_init(struct cpufreq_policy *policy)
|
|||
policy->cpuinfo.transition_latency = latency;
|
||||
|
||||
policy->fast_switch_possible = false;
|
||||
|
||||
em_register_perf_domain(policy->cpus, nr_opp, &em_cb);
|
||||
|
||||
return 0;
|
||||
|
||||
out_free_cpufreq_table:
|
||||
|
|
|
|||
|
|
@ -221,7 +221,7 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv,
|
|||
}
|
||||
|
||||
/* Take note of the planned idle state. */
|
||||
sched_idle_set_state(target_state);
|
||||
sched_idle_set_state(target_state, index);
|
||||
|
||||
trace_cpu_idle_rcuidle(index, dev->cpu);
|
||||
time_start = ns_to_ktime(local_clock());
|
||||
|
|
@ -235,7 +235,7 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv,
|
|||
trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
|
||||
|
||||
/* The cpu is no longer idle or about to enter idle. */
|
||||
sched_idle_set_state(NULL);
|
||||
sched_idle_set_state(NULL, -1);
|
||||
|
||||
if (broadcast) {
|
||||
if (WARN_ON_ONCE(!irqs_disabled()))
|
||||
|
|
|
|||
16
drivers/energy_model/Kconfig
Normal file
16
drivers/energy_model/Kconfig
Normal file
|
|
@ -0,0 +1,16 @@
|
|||
config LEGACY_ENERGY_MODEL_DT
|
||||
bool "Legacy DT-based Energy Model of CPUs"
|
||||
default n
|
||||
help
|
||||
The Energy Aware Scheduler (EAS) used to rely on Energy Models
|
||||
(EMs) statically defined in the Device Tree. More recent
|
||||
versions of EAS now rely on the EM framework to get the power
|
||||
costs of CPUs.
|
||||
|
||||
This driver reads old-style static EMs in DT and feeds them in
|
||||
the EM framework, hence enabling to use EAS on platforms with
|
||||
old DT files. Since EAS now uses only the active costs of CPUs,
|
||||
the cluster-related costs and idle-costs of the old EM are
|
||||
ignored.
|
||||
|
||||
If in doubt, say N.
|
||||
3
drivers/energy_model/Makefile
Normal file
3
drivers/energy_model/Makefile
Normal file
|
|
@ -0,0 +1,3 @@
|
|||
# SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
obj-$(CONFIG_LEGACY_ENERGY_MODEL_DT) += legacy_em_dt.o
|
||||
193
drivers/energy_model/legacy_em_dt.c
Normal file
193
drivers/energy_model/legacy_em_dt.c
Normal file
|
|
@ -0,0 +1,193 @@
|
|||
// SPDX-License-Identifier: GPL-2.0
|
||||
/*
|
||||
* Legacy Energy Model loading driver
|
||||
*
|
||||
* Copyright (C) 2018, ARM Ltd.
|
||||
* Written by: Quentin Perret, ARM Ltd.
|
||||
*/
|
||||
|
||||
#define pr_fmt(fmt) "legacy-dt-em: " fmt
|
||||
|
||||
#include <linux/cpufreq.h>
|
||||
#include <linux/cpumask.h>
|
||||
#include <linux/cpuset.h>
|
||||
#include <linux/energy_model.h>
|
||||
#include <linux/gfp.h>
|
||||
#include <linux/init.h>
|
||||
#include <linux/of.h>
|
||||
#include <linux/printk.h>
|
||||
#include <linux/slab.h>
|
||||
|
||||
static cpumask_var_t cpus_to_visit;
|
||||
|
||||
static DEFINE_PER_CPU(unsigned long, nr_states) = 0;
|
||||
|
||||
struct em_state {
|
||||
unsigned long frequency;
|
||||
unsigned long power;
|
||||
unsigned long capacity;
|
||||
};
|
||||
static DEFINE_PER_CPU(struct em_state*, cpu_em) = NULL;
|
||||
|
||||
static void finish_em_loading_workfn(struct work_struct *work);
|
||||
static DECLARE_WORK(finish_em_loading_work, finish_em_loading_workfn);
|
||||
|
||||
static DEFINE_MUTEX(em_loading_mutex);
|
||||
|
||||
/*
|
||||
* Callback given to the EM framework. All this does is browse the table
|
||||
* created by legacy_em_dt().
|
||||
*/
|
||||
static int get_power(unsigned long *mW, unsigned long *KHz, int cpu)
|
||||
{
|
||||
unsigned long nstates = per_cpu(nr_states, cpu);
|
||||
struct em_state *em = per_cpu(cpu_em, cpu);
|
||||
int i;
|
||||
|
||||
if (!nstates || !em)
|
||||
return -ENODEV;
|
||||
|
||||
for (i = 0; i < nstates - 1; i++) {
|
||||
if (em[i].frequency > *KHz)
|
||||
break;
|
||||
}
|
||||
|
||||
*KHz = em[i].frequency;
|
||||
*mW = em[i].power;
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int init_em_dt_callback(struct notifier_block *nb, unsigned long val,
|
||||
void *data)
|
||||
{
|
||||
struct em_data_callback em_cb = EM_DATA_CB(get_power);
|
||||
unsigned long nstates, scale_cpu, max_freq;
|
||||
struct cpufreq_policy *policy = data;
|
||||
const struct property *prop;
|
||||
struct device_node *cn, *cp;
|
||||
struct em_state *em;
|
||||
int cpu, i, ret = 0;
|
||||
const __be32 *tmp;
|
||||
|
||||
if (val != CPUFREQ_NOTIFY)
|
||||
return 0;
|
||||
|
||||
mutex_lock(&em_loading_mutex);
|
||||
|
||||
/* Do not register twice an energy model */
|
||||
for_each_cpu(cpu, policy->cpus) {
|
||||
if (per_cpu(nr_states, cpu) || per_cpu(cpu_em, cpu)) {
|
||||
pr_err("EM of CPU%d already loaded\n", cpu);
|
||||
ret = -EEXIST;
|
||||
goto unlock;
|
||||
}
|
||||
}
|
||||
|
||||
max_freq = policy->cpuinfo.max_freq;
|
||||
if (!max_freq) {
|
||||
pr_err("No policy->max for CPU%d\n", cpu);
|
||||
ret = -EINVAL;
|
||||
goto unlock;
|
||||
}
|
||||
|
||||
cpu = cpumask_first(policy->cpus);
|
||||
cn = of_get_cpu_node(cpu, NULL);
|
||||
if (!cn) {
|
||||
pr_err("No device_node for CPU%d\n", cpu);
|
||||
ret = -ENODEV;
|
||||
goto unlock;
|
||||
}
|
||||
|
||||
cp = of_parse_phandle(cn, "sched-energy-costs", 0);
|
||||
if (!cp) {
|
||||
pr_err("CPU%d node has no sched-energy-costs\n", cpu);
|
||||
ret = -ENODEV;
|
||||
goto unlock;
|
||||
}
|
||||
|
||||
prop = of_find_property(cp, "busy-cost-data", NULL);
|
||||
if (!prop || !prop->value) {
|
||||
pr_err("No busy-cost-data for CPU%d\n", cpu);
|
||||
ret = -ENODEV;
|
||||
goto unlock;
|
||||
}
|
||||
|
||||
nstates = (prop->length / sizeof(u32)) / 2;
|
||||
em = kcalloc(nstates, sizeof(struct em_cap_state), GFP_KERNEL);
|
||||
if (!em) {
|
||||
ret = -ENOMEM;
|
||||
goto unlock;
|
||||
}
|
||||
|
||||
/* Copy the capacity and power cost to the table. */
|
||||
for (i = 0, tmp = prop->value; i < nstates; i++) {
|
||||
em[i].capacity = be32_to_cpup(tmp++);
|
||||
em[i].power = be32_to_cpup(tmp++);
|
||||
}
|
||||
|
||||
/* Get the CPU capacity (according to the EM) */
|
||||
scale_cpu = em[nstates - 1].capacity;
|
||||
if (!scale_cpu) {
|
||||
pr_err("CPU%d: capacity cannot be 0\n", cpu);
|
||||
kfree(em);
|
||||
ret = -EINVAL;
|
||||
goto unlock;
|
||||
}
|
||||
|
||||
/* Re-compute the intermediate frequencies based on the EM. */
|
||||
for (i = 0; i < nstates; i++)
|
||||
em[i].frequency = em[i].capacity * max_freq / scale_cpu;
|
||||
|
||||
/* Assign the table to all CPUs of this policy. */
|
||||
for_each_cpu(i, policy->cpus) {
|
||||
per_cpu(nr_states, i) = nstates;
|
||||
per_cpu(cpu_em, i) = em;
|
||||
}
|
||||
|
||||
pr_info("Registering EM of %*pbl\n", cpumask_pr_args(policy->cpus));
|
||||
em_register_perf_domain(policy->cpus, nstates, &em_cb);
|
||||
|
||||
/* Finish the work when all possible CPUs have been registered. */
|
||||
cpumask_andnot(cpus_to_visit, cpus_to_visit, policy->cpus);
|
||||
if (cpumask_empty(cpus_to_visit))
|
||||
schedule_work(&finish_em_loading_work);
|
||||
|
||||
unlock:
|
||||
mutex_unlock(&em_loading_mutex);
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
static struct notifier_block init_em_dt_notifier = {
|
||||
.notifier_call = init_em_dt_callback,
|
||||
};
|
||||
|
||||
static void finish_em_loading_workfn(struct work_struct *work)
|
||||
{
|
||||
cpufreq_unregister_notifier(&init_em_dt_notifier,
|
||||
CPUFREQ_POLICY_NOTIFIER);
|
||||
free_cpumask_var(cpus_to_visit);
|
||||
|
||||
/* Let the scheduler know the Energy Model is ready. */
|
||||
rebuild_sched_domains();
|
||||
}
|
||||
|
||||
static int __init register_cpufreq_notifier(void)
|
||||
{
|
||||
int ret;
|
||||
|
||||
if (!alloc_cpumask_var(&cpus_to_visit, GFP_KERNEL))
|
||||
return -ENOMEM;
|
||||
|
||||
cpumask_copy(cpus_to_visit, cpu_possible_mask);
|
||||
|
||||
ret = cpufreq_register_notifier(&init_em_dt_notifier,
|
||||
CPUFREQ_POLICY_NOTIFIER);
|
||||
|
||||
if (ret)
|
||||
free_cpumask_var(cpus_to_visit);
|
||||
|
||||
return ret;
|
||||
}
|
||||
core_initcall(register_cpufreq_notifier);
|
||||
|
|
@ -427,6 +427,33 @@ static int scmi_dvfs_freq_get(const struct scmi_handle *handle, u32 domain,
|
|||
return ret;
|
||||
}
|
||||
|
||||
static int scmi_dvfs_est_power_get(const struct scmi_handle *handle, u32 domain,
|
||||
unsigned long *freq, unsigned long *power)
|
||||
{
|
||||
struct scmi_perf_info *pi = handle->perf_priv;
|
||||
struct perf_dom_info *dom;
|
||||
unsigned long opp_freq;
|
||||
int idx, ret = -EINVAL;
|
||||
struct scmi_opp *opp;
|
||||
|
||||
dom = pi->dom_info + domain;
|
||||
if (!dom)
|
||||
return -EIO;
|
||||
|
||||
for (opp = dom->opp, idx = 0; idx < dom->opp_count; idx++, opp++) {
|
||||
opp_freq = opp->perf * dom->mult_factor;
|
||||
if (opp_freq < *freq)
|
||||
continue;
|
||||
|
||||
*freq = opp_freq;
|
||||
*power = opp->power;
|
||||
ret = 0;
|
||||
break;
|
||||
}
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
static struct scmi_perf_ops perf_ops = {
|
||||
.limits_set = scmi_perf_limits_set,
|
||||
.limits_get = scmi_perf_limits_get,
|
||||
|
|
@ -437,6 +464,7 @@ static struct scmi_perf_ops perf_ops = {
|
|||
.device_opps_add = scmi_dvfs_device_opps_add,
|
||||
.freq_set = scmi_dvfs_freq_set,
|
||||
.freq_get = scmi_dvfs_freq_get,
|
||||
.est_power_get = scmi_dvfs_est_power_get,
|
||||
};
|
||||
|
||||
static int scmi_perf_protocol_init(struct scmi_handle *handle)
|
||||
|
|
|
|||
|
|
@ -778,3 +778,44 @@ struct device_node *dev_pm_opp_get_of_node(struct dev_pm_opp *opp)
|
|||
return of_node_get(opp->np);
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(dev_pm_opp_get_of_node);
|
||||
|
||||
int of_dev_pm_opp_get_cpu_power(unsigned long *mW, unsigned long *KHz, int cpu)
|
||||
{
|
||||
unsigned long mV, Hz, MHz;
|
||||
struct device *cpu_dev;
|
||||
struct dev_pm_opp *opp;
|
||||
struct device_node *np;
|
||||
u32 cap;
|
||||
u64 tmp;
|
||||
|
||||
cpu_dev = get_cpu_device(cpu);
|
||||
if (!cpu_dev)
|
||||
return -ENODEV;
|
||||
|
||||
np = of_node_get(cpu_dev->of_node);
|
||||
if (!np)
|
||||
return -EINVAL;
|
||||
|
||||
if (of_property_read_u32(np, "dynamic-power-coefficient", &cap))
|
||||
return -EINVAL;
|
||||
|
||||
Hz = *KHz * 1000;
|
||||
opp = dev_pm_opp_find_freq_ceil(cpu_dev, &Hz);
|
||||
if (IS_ERR(opp))
|
||||
return -EINVAL;
|
||||
|
||||
mV = dev_pm_opp_get_voltage(opp) / 1000;
|
||||
dev_pm_opp_put(opp);
|
||||
if (!mV)
|
||||
return -EINVAL;
|
||||
|
||||
MHz = Hz / 1000000;
|
||||
tmp = (u64)cap * mV * mV * MHz;
|
||||
do_div(tmp, 1000000000);
|
||||
|
||||
*mW = (unsigned long)tmp;
|
||||
*KHz = Hz / 1000;
|
||||
|
||||
return 0;
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(of_dev_pm_opp_get_cpu_power);
|
||||
|
|
|
|||
|
|
@ -31,6 +31,7 @@
|
|||
#include <linux/slab.h>
|
||||
#include <linux/cpu.h>
|
||||
#include <linux/cpu_cooling.h>
|
||||
#include <linux/energy_model.h>
|
||||
|
||||
#include <trace/events/thermal.h>
|
||||
|
||||
|
|
@ -48,19 +49,6 @@
|
|||
* ...
|
||||
*/
|
||||
|
||||
/**
|
||||
* struct freq_table - frequency table along with power entries
|
||||
* @frequency: frequency in KHz
|
||||
* @power: power in mW
|
||||
*
|
||||
* This structure is built when the cooling device registers and helps
|
||||
* in translating frequency to power and vice versa.
|
||||
*/
|
||||
struct freq_table {
|
||||
u32 frequency;
|
||||
u32 power;
|
||||
};
|
||||
|
||||
/**
|
||||
* struct time_in_idle - Idle time stats
|
||||
* @time: previous reading of the absolute time that this cpu was idle
|
||||
|
|
@ -82,7 +70,7 @@ struct time_in_idle {
|
|||
* frequency.
|
||||
* @max_level: maximum cooling level. One less than total number of valid
|
||||
* cpufreq frequencies.
|
||||
* @freq_table: Freq table in descending order of frequencies
|
||||
* @em: Reference on the Energy Model of the device
|
||||
* @cdev: thermal_cooling_device pointer to keep track of the
|
||||
* registered cooling device.
|
||||
* @policy: cpufreq policy.
|
||||
|
|
@ -98,7 +86,7 @@ struct cpufreq_cooling_device {
|
|||
unsigned int cpufreq_state;
|
||||
unsigned int clipped_freq;
|
||||
unsigned int max_level;
|
||||
struct freq_table *freq_table; /* In descending order */
|
||||
struct em_perf_domain *em;
|
||||
struct thermal_cooling_device *cdev;
|
||||
struct cpufreq_policy *policy;
|
||||
struct list_head node;
|
||||
|
|
@ -111,26 +99,6 @@ static LIST_HEAD(cpufreq_cdev_list);
|
|||
|
||||
/* Below code defines functions to be used for cpufreq as cooling device */
|
||||
|
||||
/**
|
||||
* get_level: Find the level for a particular frequency
|
||||
* @cpufreq_cdev: cpufreq_cdev for which the property is required
|
||||
* @freq: Frequency
|
||||
*
|
||||
* Return: level corresponding to the frequency.
|
||||
*/
|
||||
static unsigned long get_level(struct cpufreq_cooling_device *cpufreq_cdev,
|
||||
unsigned int freq)
|
||||
{
|
||||
struct freq_table *freq_table = cpufreq_cdev->freq_table;
|
||||
unsigned long level;
|
||||
|
||||
for (level = 1; level <= cpufreq_cdev->max_level; level++)
|
||||
if (freq > freq_table[level].frequency)
|
||||
break;
|
||||
|
||||
return level - 1;
|
||||
}
|
||||
|
||||
/**
|
||||
* cpufreq_thermal_notifier - notifier callback for cpufreq policy change.
|
||||
* @nb: struct notifier_block * with callback info.
|
||||
|
|
@ -184,105 +152,52 @@ static int cpufreq_thermal_notifier(struct notifier_block *nb,
|
|||
return NOTIFY_OK;
|
||||
}
|
||||
|
||||
#ifdef CONFIG_ENERGY_MODEL
|
||||
/**
|
||||
* update_freq_table() - Update the freq table with power numbers
|
||||
* @cpufreq_cdev: the cpufreq cooling device in which to update the table
|
||||
* @capacitance: dynamic power coefficient for these cpus
|
||||
* get_level: Find the level for a particular frequency
|
||||
* @cpufreq_cdev: cpufreq_cdev for which the property is required
|
||||
* @freq: Frequency
|
||||
*
|
||||
* Update the freq table with power numbers. This table will be used in
|
||||
* cpu_power_to_freq() and cpu_freq_to_power() to convert between power and
|
||||
* frequency efficiently. Power is stored in mW, frequency in KHz. The
|
||||
* resulting table is in descending order.
|
||||
*
|
||||
* Return: 0 on success, -EINVAL if there are no OPPs for any CPUs,
|
||||
* or -ENOMEM if we run out of memory.
|
||||
* Return: level corresponding to the frequency.
|
||||
*/
|
||||
static int update_freq_table(struct cpufreq_cooling_device *cpufreq_cdev,
|
||||
u32 capacitance)
|
||||
static unsigned long get_level(struct cpufreq_cooling_device *cpufreq_cdev,
|
||||
unsigned int freq)
|
||||
{
|
||||
struct freq_table *freq_table = cpufreq_cdev->freq_table;
|
||||
struct dev_pm_opp *opp;
|
||||
struct device *dev = NULL;
|
||||
int num_opps = 0, cpu = cpufreq_cdev->policy->cpu, i;
|
||||
int i;
|
||||
|
||||
dev = get_cpu_device(cpu);
|
||||
if (unlikely(!dev)) {
|
||||
dev_warn(&cpufreq_cdev->cdev->device,
|
||||
"No cpu device for cpu %d\n", cpu);
|
||||
return -ENODEV;
|
||||
for (i = cpufreq_cdev->max_level - 1; i >= 0; i--) {
|
||||
if (freq > cpufreq_cdev->em->table[i].frequency)
|
||||
break;
|
||||
}
|
||||
|
||||
num_opps = dev_pm_opp_get_opp_count(dev);
|
||||
if (num_opps < 0)
|
||||
return num_opps;
|
||||
|
||||
/*
|
||||
* The cpufreq table is also built from the OPP table and so the count
|
||||
* should match.
|
||||
*/
|
||||
if (num_opps != cpufreq_cdev->max_level + 1) {
|
||||
dev_warn(dev, "Number of OPPs not matching with max_levels\n");
|
||||
return -EINVAL;
|
||||
}
|
||||
|
||||
for (i = 0; i <= cpufreq_cdev->max_level; i++) {
|
||||
unsigned long freq = freq_table[i].frequency * 1000;
|
||||
u32 freq_mhz = freq_table[i].frequency / 1000;
|
||||
u64 power;
|
||||
u32 voltage_mv;
|
||||
|
||||
/*
|
||||
* Find ceil frequency as 'freq' may be slightly lower than OPP
|
||||
* freq due to truncation while converting to kHz.
|
||||
*/
|
||||
opp = dev_pm_opp_find_freq_ceil(dev, &freq);
|
||||
if (IS_ERR(opp)) {
|
||||
dev_err(dev, "failed to get opp for %lu frequency\n",
|
||||
freq);
|
||||
return -EINVAL;
|
||||
}
|
||||
|
||||
voltage_mv = dev_pm_opp_get_voltage(opp) / 1000;
|
||||
dev_pm_opp_put(opp);
|
||||
|
||||
/*
|
||||
* Do the multiplication with MHz and millivolt so as
|
||||
* to not overflow.
|
||||
*/
|
||||
power = (u64)capacitance * freq_mhz * voltage_mv * voltage_mv;
|
||||
do_div(power, 1000000000);
|
||||
|
||||
/* power is stored in mW */
|
||||
freq_table[i].power = power;
|
||||
}
|
||||
|
||||
return 0;
|
||||
return cpufreq_cdev->max_level - i - 1;
|
||||
}
|
||||
|
||||
|
||||
static u32 cpu_freq_to_power(struct cpufreq_cooling_device *cpufreq_cdev,
|
||||
u32 freq)
|
||||
{
|
||||
int i;
|
||||
struct freq_table *freq_table = cpufreq_cdev->freq_table;
|
||||
|
||||
for (i = 1; i <= cpufreq_cdev->max_level; i++)
|
||||
if (freq > freq_table[i].frequency)
|
||||
for (i = cpufreq_cdev->max_level - 1; i >= 0; i--) {
|
||||
if (freq > cpufreq_cdev->em->table[i].frequency)
|
||||
break;
|
||||
}
|
||||
|
||||
return freq_table[i - 1].power;
|
||||
return cpufreq_cdev->em->table[i + 1].power;
|
||||
}
|
||||
|
||||
static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_cdev,
|
||||
u32 power)
|
||||
{
|
||||
int i;
|
||||
struct freq_table *freq_table = cpufreq_cdev->freq_table;
|
||||
|
||||
for (i = 1; i <= cpufreq_cdev->max_level; i++)
|
||||
if (power > freq_table[i].power)
|
||||
for (i = cpufreq_cdev->max_level - 1; i >= 0; i--) {
|
||||
if (power > cpufreq_cdev->em->table[i].power)
|
||||
break;
|
||||
}
|
||||
|
||||
return freq_table[i - 1].frequency;
|
||||
return cpufreq_cdev->em->table[i + 1].frequency;
|
||||
}
|
||||
|
||||
/**
|
||||
|
|
@ -332,6 +247,7 @@ static u32 get_dynamic_power(struct cpufreq_cooling_device *cpufreq_cdev,
|
|||
raw_cpu_power = cpu_freq_to_power(cpufreq_cdev, freq);
|
||||
return (raw_cpu_power * cpufreq_cdev->last_load) / 100;
|
||||
}
|
||||
#endif
|
||||
|
||||
/* cpufreq cooling device callback functions are defined below */
|
||||
|
||||
|
|
@ -374,6 +290,30 @@ static int cpufreq_get_cur_state(struct thermal_cooling_device *cdev,
|
|||
return 0;
|
||||
}
|
||||
|
||||
static unsigned int get_state_freq(struct cpufreq_cooling_device *cpufreq_cdev,
|
||||
unsigned long state)
|
||||
{
|
||||
struct cpufreq_policy *policy;
|
||||
unsigned long idx;
|
||||
|
||||
#ifdef CONFIG_ENERGY_MODEL
|
||||
/* Use the Energy Model table if available */
|
||||
if (cpufreq_cdev->em) {
|
||||
idx = cpufreq_cdev->max_level - state;
|
||||
return cpufreq_cdev->em->table[idx].frequency;
|
||||
}
|
||||
#endif
|
||||
|
||||
/* Otherwise, fallback on the CPUFreq table */
|
||||
policy = cpufreq_cdev->policy;
|
||||
if (policy->freq_table_sorted == CPUFREQ_TABLE_SORTED_ASCENDING)
|
||||
idx = cpufreq_cdev->max_level - state;
|
||||
else
|
||||
idx = state;
|
||||
|
||||
return policy->freq_table[idx].frequency;
|
||||
}
|
||||
|
||||
/**
|
||||
* cpufreq_set_cur_state - callback function to set the current cooling state.
|
||||
* @cdev: thermal cooling device pointer.
|
||||
|
|
@ -398,7 +338,7 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
|
|||
if (cpufreq_cdev->cpufreq_state == state)
|
||||
return 0;
|
||||
|
||||
clip_freq = cpufreq_cdev->freq_table[state].frequency;
|
||||
clip_freq = get_state_freq(cpufreq_cdev, state);
|
||||
cpufreq_cdev->cpufreq_state = state;
|
||||
cpufreq_cdev->clipped_freq = clip_freq;
|
||||
|
||||
|
|
@ -407,6 +347,7 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
|
|||
return 0;
|
||||
}
|
||||
|
||||
#ifdef CONFIG_ENERGY_MODEL
|
||||
/**
|
||||
* cpufreq_get_requested_power() - get the current power
|
||||
* @cdev: &thermal_cooling_device pointer
|
||||
|
|
@ -497,7 +438,7 @@ static int cpufreq_state2power(struct thermal_cooling_device *cdev,
|
|||
struct thermal_zone_device *tz,
|
||||
unsigned long state, u32 *power)
|
||||
{
|
||||
unsigned int freq, num_cpus;
|
||||
unsigned int freq, num_cpus, idx;
|
||||
struct cpufreq_cooling_device *cpufreq_cdev = cdev->devdata;
|
||||
|
||||
/* Request state should be less than max_level */
|
||||
|
|
@ -506,7 +447,8 @@ static int cpufreq_state2power(struct thermal_cooling_device *cdev,
|
|||
|
||||
num_cpus = cpumask_weight(cpufreq_cdev->policy->cpus);
|
||||
|
||||
freq = cpufreq_cdev->freq_table[state].frequency;
|
||||
idx = cpufreq_cdev->max_level - state;
|
||||
freq = cpufreq_cdev->em->table[idx].frequency;
|
||||
*power = cpu_freq_to_power(cpufreq_cdev, freq) * num_cpus;
|
||||
|
||||
return 0;
|
||||
|
|
@ -553,14 +495,6 @@ static int cpufreq_power2state(struct thermal_cooling_device *cdev,
|
|||
return 0;
|
||||
}
|
||||
|
||||
/* Bind cpufreq callbacks to thermal cooling device ops */
|
||||
|
||||
static struct thermal_cooling_device_ops cpufreq_cooling_ops = {
|
||||
.get_max_state = cpufreq_get_max_state,
|
||||
.get_cur_state = cpufreq_get_cur_state,
|
||||
.set_cur_state = cpufreq_set_cur_state,
|
||||
};
|
||||
|
||||
static struct thermal_cooling_device_ops cpufreq_power_cooling_ops = {
|
||||
.get_max_state = cpufreq_get_max_state,
|
||||
.get_cur_state = cpufreq_get_cur_state,
|
||||
|
|
@ -569,32 +503,27 @@ static struct thermal_cooling_device_ops cpufreq_power_cooling_ops = {
|
|||
.state2power = cpufreq_state2power,
|
||||
.power2state = cpufreq_power2state,
|
||||
};
|
||||
#endif
|
||||
|
||||
/* Bind cpufreq callbacks to thermal cooling device ops */
|
||||
|
||||
static struct thermal_cooling_device_ops cpufreq_cooling_ops = {
|
||||
.get_max_state = cpufreq_get_max_state,
|
||||
.get_cur_state = cpufreq_get_cur_state,
|
||||
.set_cur_state = cpufreq_set_cur_state,
|
||||
};
|
||||
|
||||
/* Notifier for cpufreq policy change */
|
||||
static struct notifier_block thermal_cpufreq_notifier_block = {
|
||||
.notifier_call = cpufreq_thermal_notifier,
|
||||
};
|
||||
|
||||
static unsigned int find_next_max(struct cpufreq_frequency_table *table,
|
||||
unsigned int prev_max)
|
||||
{
|
||||
struct cpufreq_frequency_table *pos;
|
||||
unsigned int max = 0;
|
||||
|
||||
cpufreq_for_each_valid_entry(pos, table) {
|
||||
if (pos->frequency > max && pos->frequency < prev_max)
|
||||
max = pos->frequency;
|
||||
}
|
||||
|
||||
return max;
|
||||
}
|
||||
|
||||
/**
|
||||
* __cpufreq_cooling_register - helper function to create cpufreq cooling device
|
||||
* @np: a valid struct device_node to the cooling device device tree node
|
||||
* @policy: cpufreq policy
|
||||
* Normally this should be same as cpufreq policy->related_cpus.
|
||||
* @capacitance: dynamic power coefficient for these cpus
|
||||
* @try_model: true if a power model should be used
|
||||
*
|
||||
* This interface function registers the cpufreq cooling device with the name
|
||||
* "thermal-cpufreq-%x". This api can support multiple instances of cpufreq
|
||||
|
|
@ -606,12 +535,12 @@ static unsigned int find_next_max(struct cpufreq_frequency_table *table,
|
|||
*/
|
||||
static struct thermal_cooling_device *
|
||||
__cpufreq_cooling_register(struct device_node *np,
|
||||
struct cpufreq_policy *policy, u32 capacitance)
|
||||
struct cpufreq_policy *policy, bool try_model)
|
||||
{
|
||||
struct thermal_cooling_device *cdev;
|
||||
struct cpufreq_cooling_device *cpufreq_cdev;
|
||||
char dev_name[THERMAL_NAME_LENGTH];
|
||||
unsigned int freq, i, num_cpus;
|
||||
unsigned int i, num_cpus;
|
||||
int ret;
|
||||
struct thermal_cooling_device_ops *cooling_ops;
|
||||
bool first;
|
||||
|
|
@ -645,54 +574,36 @@ __cpufreq_cooling_register(struct device_node *np,
|
|||
/* max_level is an index, not a counter */
|
||||
cpufreq_cdev->max_level = i - 1;
|
||||
|
||||
cpufreq_cdev->freq_table = kmalloc_array(i,
|
||||
sizeof(*cpufreq_cdev->freq_table),
|
||||
GFP_KERNEL);
|
||||
if (!cpufreq_cdev->freq_table) {
|
||||
cdev = ERR_PTR(-ENOMEM);
|
||||
goto free_idle_time;
|
||||
}
|
||||
#ifdef CONFIG_ENERGY_MODEL
|
||||
if (try_model) {
|
||||
struct em_perf_domain *em = em_cpu_get(policy->cpu);
|
||||
|
||||
if (!em || !cpumask_equal(policy->cpus, to_cpumask(em->cpus))) {
|
||||
cdev = ERR_PTR(-EINVAL);
|
||||
goto free_idle_time;
|
||||
}
|
||||
cpufreq_cdev->em = em;
|
||||
cooling_ops = &cpufreq_power_cooling_ops;
|
||||
} else
|
||||
#endif
|
||||
cooling_ops = &cpufreq_cooling_ops;
|
||||
|
||||
ret = ida_simple_get(&cpufreq_ida, 0, 0, GFP_KERNEL);
|
||||
if (ret < 0) {
|
||||
cdev = ERR_PTR(ret);
|
||||
goto free_table;
|
||||
goto free_idle_time;
|
||||
}
|
||||
cpufreq_cdev->id = ret;
|
||||
|
||||
snprintf(dev_name, sizeof(dev_name), "thermal-cpufreq-%d",
|
||||
cpufreq_cdev->id);
|
||||
|
||||
/* Fill freq-table in descending order of frequencies */
|
||||
for (i = 0, freq = -1; i <= cpufreq_cdev->max_level; i++) {
|
||||
freq = find_next_max(policy->freq_table, freq);
|
||||
cpufreq_cdev->freq_table[i].frequency = freq;
|
||||
|
||||
/* Warn for duplicate entries */
|
||||
if (!freq)
|
||||
pr_warn("%s: table has duplicate entries\n", __func__);
|
||||
else
|
||||
pr_debug("%s: freq:%u KHz\n", __func__, freq);
|
||||
}
|
||||
|
||||
if (capacitance) {
|
||||
ret = update_freq_table(cpufreq_cdev, capacitance);
|
||||
if (ret) {
|
||||
cdev = ERR_PTR(ret);
|
||||
goto remove_ida;
|
||||
}
|
||||
|
||||
cooling_ops = &cpufreq_power_cooling_ops;
|
||||
} else {
|
||||
cooling_ops = &cpufreq_cooling_ops;
|
||||
}
|
||||
|
||||
cdev = thermal_of_cooling_device_register(np, dev_name, cpufreq_cdev,
|
||||
cooling_ops);
|
||||
if (IS_ERR(cdev))
|
||||
goto remove_ida;
|
||||
|
||||
cpufreq_cdev->clipped_freq = cpufreq_cdev->freq_table[0].frequency;
|
||||
cpufreq_cdev->clipped_freq = get_state_freq(cpufreq_cdev, 0);
|
||||
cpufreq_cdev->cdev = cdev;
|
||||
|
||||
mutex_lock(&cooling_list_lock);
|
||||
|
|
@ -709,8 +620,6 @@ __cpufreq_cooling_register(struct device_node *np,
|
|||
|
||||
remove_ida:
|
||||
ida_simple_remove(&cpufreq_ida, cpufreq_cdev->id);
|
||||
free_table:
|
||||
kfree(cpufreq_cdev->freq_table);
|
||||
free_idle_time:
|
||||
kfree(cpufreq_cdev->idle_time);
|
||||
free_cdev:
|
||||
|
|
@ -732,7 +641,7 @@ __cpufreq_cooling_register(struct device_node *np,
|
|||
struct thermal_cooling_device *
|
||||
cpufreq_cooling_register(struct cpufreq_policy *policy)
|
||||
{
|
||||
return __cpufreq_cooling_register(NULL, policy, 0);
|
||||
return __cpufreq_cooling_register(NULL, policy, false);
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(cpufreq_cooling_register);
|
||||
|
||||
|
|
@ -760,7 +669,6 @@ of_cpufreq_cooling_register(struct cpufreq_policy *policy)
|
|||
{
|
||||
struct device_node *np = of_get_cpu_node(policy->cpu, NULL);
|
||||
struct thermal_cooling_device *cdev = NULL;
|
||||
u32 capacitance = 0;
|
||||
|
||||
if (!np) {
|
||||
pr_err("cpu_cooling: OF node not available for cpu%d\n",
|
||||
|
|
@ -769,10 +677,7 @@ of_cpufreq_cooling_register(struct cpufreq_policy *policy)
|
|||
}
|
||||
|
||||
if (of_find_property(np, "#cooling-cells", NULL)) {
|
||||
of_property_read_u32(np, "dynamic-power-coefficient",
|
||||
&capacitance);
|
||||
|
||||
cdev = __cpufreq_cooling_register(np, policy, capacitance);
|
||||
cdev = __cpufreq_cooling_register(np, policy, true);
|
||||
if (IS_ERR(cdev)) {
|
||||
pr_err("cpu_cooling: cpu%d is not running as cooling device: %ld\n",
|
||||
policy->cpu, PTR_ERR(cdev));
|
||||
|
|
@ -814,7 +719,6 @@ void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
|
|||
thermal_cooling_device_unregister(cpufreq_cdev->cdev);
|
||||
ida_simple_remove(&cpufreq_ida, cpufreq_cdev->id);
|
||||
kfree(cpufreq_cdev->idle_time);
|
||||
kfree(cpufreq_cdev->freq_table);
|
||||
kfree(cpufreq_cdev);
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(cpufreq_cooling_unregister);
|
||||
|
|
|
|||
|
|
@ -9,6 +9,7 @@
|
|||
#include <linux/percpu.h>
|
||||
|
||||
void topology_normalize_cpu_scale(void);
|
||||
int topology_update_cpu_topology(void);
|
||||
|
||||
struct device_node;
|
||||
bool topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu);
|
||||
|
|
@ -32,4 +33,12 @@ unsigned long topology_get_freq_scale(int cpu)
|
|||
return per_cpu(freq_scale, cpu);
|
||||
}
|
||||
|
||||
DECLARE_PER_CPU(unsigned long, max_freq_scale);
|
||||
|
||||
static inline
|
||||
unsigned long topology_get_max_freq_scale(struct sched_domain *sd, int cpu)
|
||||
{
|
||||
return per_cpu(max_freq_scale, cpu);
|
||||
}
|
||||
|
||||
#endif /* _LINUX_ARCH_TOPOLOGY_H_ */
|
||||
|
|
|
|||
|
|
@ -21,6 +21,10 @@ SUBSYS(cpu)
|
|||
SUBSYS(cpuacct)
|
||||
#endif
|
||||
|
||||
#if IS_ENABLED(CONFIG_SCHED_TUNE)
|
||||
SUBSYS(schedtune)
|
||||
#endif
|
||||
|
||||
#if IS_ENABLED(CONFIG_BLK_CGROUP)
|
||||
SUBSYS(io)
|
||||
#endif
|
||||
|
|
|
|||
|
|
@ -955,6 +955,8 @@ extern unsigned int arch_freq_get_on_cpu(int cpu);
|
|||
|
||||
extern void arch_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq,
|
||||
unsigned long max_freq);
|
||||
extern void arch_set_max_freq_scale(struct cpumask *cpus,
|
||||
unsigned long policy_max_freq);
|
||||
|
||||
/* the following are really really optional */
|
||||
extern struct freq_attr cpufreq_freq_attr_scaling_available_freqs;
|
||||
|
|
|
|||
|
|
@ -219,7 +219,7 @@ static inline void cpuidle_use_deepest_state(bool enable)
|
|||
#endif
|
||||
|
||||
/* kernel/sched/idle.c */
|
||||
extern void sched_idle_set_state(struct cpuidle_state *idle_state);
|
||||
extern void sched_idle_set_state(struct cpuidle_state *idle_state, int index);
|
||||
extern void default_idle_call(void);
|
||||
|
||||
#ifdef CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED
|
||||
|
|
|
|||
189
include/linux/energy_model.h
Normal file
189
include/linux/energy_model.h
Normal file
|
|
@ -0,0 +1,189 @@
|
|||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _LINUX_ENERGY_MODEL_H
|
||||
#define _LINUX_ENERGY_MODEL_H
|
||||
#include <linux/cpumask.h>
|
||||
#include <linux/jump_label.h>
|
||||
#include <linux/kobject.h>
|
||||
#include <linux/rcupdate.h>
|
||||
#include <linux/sched/cpufreq.h>
|
||||
#include <linux/sched/topology.h>
|
||||
#include <linux/types.h>
|
||||
|
||||
#ifdef CONFIG_ENERGY_MODEL
|
||||
/**
|
||||
* em_cap_state - Capacity state of a performance domain
|
||||
* @frequency: The CPU frequency in KHz, for consistency with CPUFreq
|
||||
* @power: The power consumed by 1 CPU at this level, in milli-watts
|
||||
* @cost: The cost coefficient associated with this level, used during
|
||||
* energy calculation. Equal to: power * max_frequency / frequency
|
||||
*/
|
||||
struct em_cap_state {
|
||||
unsigned long frequency;
|
||||
unsigned long power;
|
||||
unsigned long cost;
|
||||
};
|
||||
|
||||
/**
|
||||
* em_perf_domain - Performance domain
|
||||
* @table: List of capacity states, in ascending order
|
||||
* @nr_cap_states: Number of capacity states
|
||||
* @kobj: Kobject used to expose the domain in sysfs
|
||||
* @cpus: Cpumask covering the CPUs of the domain
|
||||
*
|
||||
* A "performance domain" represents a group of CPUs whose performance is
|
||||
* scaled together. All CPUs of a performance domain must have the same
|
||||
* micro-architecture. Performance domains often have a 1-to-1 mapping with
|
||||
* CPUFreq policies.
|
||||
*/
|
||||
struct em_perf_domain {
|
||||
struct em_cap_state *table;
|
||||
int nr_cap_states;
|
||||
struct kobject kobj;
|
||||
unsigned long cpus[0];
|
||||
};
|
||||
|
||||
#define EM_CPU_MAX_POWER 0xFFFF
|
||||
|
||||
struct em_data_callback {
|
||||
/**
|
||||
* active_power() - Provide power at the next capacity state of a CPU
|
||||
* @power : Active power at the capacity state in mW (modified)
|
||||
* @freq : Frequency at the capacity state in kHz (modified)
|
||||
* @cpu : CPU for which we do this operation
|
||||
*
|
||||
* active_power() must find the lowest capacity state of 'cpu' above
|
||||
* 'freq' and update 'power' and 'freq' to the matching active power
|
||||
* and frequency.
|
||||
*
|
||||
* The power is the one of a single CPU in the domain, expressed in
|
||||
* milli-watts. It is expected to fit in the [0, EM_CPU_MAX_POWER]
|
||||
* range.
|
||||
*
|
||||
* Return 0 on success.
|
||||
*/
|
||||
int (*active_power)(unsigned long *power, unsigned long *freq, int cpu);
|
||||
};
|
||||
#define EM_DATA_CB(_active_power_cb) { .active_power = &_active_power_cb }
|
||||
|
||||
struct em_perf_domain *em_cpu_get(int cpu);
|
||||
int em_register_perf_domain(cpumask_t *span, unsigned int nr_states,
|
||||
struct em_data_callback *cb);
|
||||
|
||||
/**
|
||||
* em_pd_energy() - Estimates the energy consumed by the CPUs of a perf. domain
|
||||
* @pd : performance domain for which energy has to be estimated
|
||||
* @max_util : highest utilization among CPUs of the domain
|
||||
* @sum_util : sum of the utilization of all CPUs in the domain
|
||||
*
|
||||
* Return: the sum of the energy consumed by the CPUs of the domain assuming
|
||||
* a capacity state satisfying the max utilization of the domain.
|
||||
*/
|
||||
static inline unsigned long em_pd_energy(struct em_perf_domain *pd,
|
||||
unsigned long max_util, unsigned long sum_util)
|
||||
{
|
||||
unsigned long freq, scale_cpu;
|
||||
struct em_cap_state *cs;
|
||||
int i, cpu;
|
||||
|
||||
/*
|
||||
* In order to predict the capacity state, map the utilization of the
|
||||
* most utilized CPU of the performance domain to a requested frequency,
|
||||
* like schedutil.
|
||||
*/
|
||||
cpu = cpumask_first(to_cpumask(pd->cpus));
|
||||
scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
|
||||
cs = &pd->table[pd->nr_cap_states - 1];
|
||||
freq = map_util_freq(max_util, cs->frequency, scale_cpu);
|
||||
|
||||
/*
|
||||
* Find the lowest capacity state of the Energy Model above the
|
||||
* requested frequency.
|
||||
*/
|
||||
for (i = 0; i < pd->nr_cap_states; i++) {
|
||||
cs = &pd->table[i];
|
||||
if (cs->frequency >= freq)
|
||||
break;
|
||||
}
|
||||
|
||||
/*
|
||||
* The capacity of a CPU in the domain at that capacity state (cs)
|
||||
* can be computed as:
|
||||
*
|
||||
* cs->freq * scale_cpu
|
||||
* cs->cap = -------------------- (1)
|
||||
* cpu_max_freq
|
||||
*
|
||||
* So, ignoring the costs of idle states (which are not available in
|
||||
* the EM), the energy consumed by this CPU at that capacity state is
|
||||
* estimated as:
|
||||
*
|
||||
* cs->power * cpu_util
|
||||
* cpu_nrg = -------------------- (2)
|
||||
* cs->cap
|
||||
*
|
||||
* since 'cpu_util / cs->cap' represents its percentage of busy time.
|
||||
*
|
||||
* NOTE: Although the result of this computation actually is in
|
||||
* units of power, it can be manipulated as an energy value
|
||||
* over a scheduling period, since it is assumed to be
|
||||
* constant during that interval.
|
||||
*
|
||||
* By injecting (1) in (2), 'cpu_nrg' can be re-expressed as a product
|
||||
* of two terms:
|
||||
*
|
||||
* cs->power * cpu_max_freq cpu_util
|
||||
* cpu_nrg = ------------------------ * --------- (3)
|
||||
* cs->freq scale_cpu
|
||||
*
|
||||
* The first term is static, and is stored in the em_cap_state struct
|
||||
* as 'cs->cost'.
|
||||
*
|
||||
* Since all CPUs of the domain have the same micro-architecture, they
|
||||
* share the same 'cs->cost', and the same CPU capacity. Hence, the
|
||||
* total energy of the domain (which is the simple sum of the energy of
|
||||
* all of its CPUs) can be factorized as:
|
||||
*
|
||||
* cs->cost * \Sum cpu_util
|
||||
* pd_nrg = ------------------------ (4)
|
||||
* scale_cpu
|
||||
*/
|
||||
return cs->cost * sum_util / scale_cpu;
|
||||
}
|
||||
|
||||
/**
|
||||
* em_pd_nr_cap_states() - Get the number of capacity states of a perf. domain
|
||||
* @pd : performance domain for which this must be done
|
||||
*
|
||||
* Return: the number of capacity states in the performance domain table
|
||||
*/
|
||||
static inline int em_pd_nr_cap_states(struct em_perf_domain *pd)
|
||||
{
|
||||
return pd->nr_cap_states;
|
||||
}
|
||||
|
||||
#else
|
||||
struct em_perf_domain {};
|
||||
struct em_data_callback {};
|
||||
#define EM_DATA_CB(_active_power_cb) { }
|
||||
|
||||
static inline int em_register_perf_domain(cpumask_t *span,
|
||||
unsigned int nr_states, struct em_data_callback *cb)
|
||||
{
|
||||
return -EINVAL;
|
||||
}
|
||||
static inline struct em_perf_domain *em_cpu_get(int cpu)
|
||||
{
|
||||
return NULL;
|
||||
}
|
||||
static inline unsigned long em_pd_energy(struct em_perf_domain *pd,
|
||||
unsigned long max_util, unsigned long sum_util)
|
||||
{
|
||||
return 0;
|
||||
}
|
||||
static inline int em_pd_nr_cap_states(struct em_perf_domain *pd)
|
||||
{
|
||||
return 0;
|
||||
}
|
||||
#endif
|
||||
|
||||
#endif
|
||||
|
|
@ -301,6 +301,7 @@ int dev_pm_opp_of_get_sharing_cpus(struct device *cpu_dev, struct cpumask *cpuma
|
|||
struct device_node *dev_pm_opp_of_get_opp_desc_node(struct device *dev);
|
||||
struct dev_pm_opp *of_dev_pm_opp_find_required_opp(struct device *dev, struct device_node *np);
|
||||
struct device_node *dev_pm_opp_get_of_node(struct dev_pm_opp *opp);
|
||||
int of_dev_pm_opp_get_cpu_power(unsigned long *mW, unsigned long *KHz, int cpu);
|
||||
#else
|
||||
static inline int dev_pm_opp_of_add_table(struct device *dev)
|
||||
{
|
||||
|
|
@ -343,6 +344,10 @@ static inline struct device_node *dev_pm_opp_get_of_node(struct dev_pm_opp *opp)
|
|||
{
|
||||
return NULL;
|
||||
}
|
||||
static inline int of_dev_pm_opp_get_cpu_power(unsigned long *mW, unsigned long *KHz, int cpu)
|
||||
{
|
||||
return -ENOTSUPP;
|
||||
}
|
||||
#endif
|
||||
|
||||
#endif /* __LINUX_OPP_H__ */
|
||||
|
|
|
|||
|
|
@ -2,6 +2,7 @@
|
|||
#ifndef _LINUX_SCHED_CPUFREQ_H
|
||||
#define _LINUX_SCHED_CPUFREQ_H
|
||||
|
||||
#include <linux/cpufreq.h>
|
||||
#include <linux/types.h>
|
||||
|
||||
/*
|
||||
|
|
@ -20,6 +21,20 @@ void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data,
|
|||
void (*func)(struct update_util_data *data, u64 time,
|
||||
unsigned int flags));
|
||||
void cpufreq_remove_update_util_hook(int cpu);
|
||||
|
||||
static inline unsigned long map_util_freq(unsigned long util,
|
||||
unsigned long freq, unsigned long cap)
|
||||
{
|
||||
return (freq + (freq >> 2)) * util / cap;
|
||||
}
|
||||
#endif /* CONFIG_CPU_FREQ */
|
||||
|
||||
#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
|
||||
void sched_cpufreq_governor_change(struct cpufreq_policy *policy,
|
||||
struct cpufreq_governor *old_gov);
|
||||
#else
|
||||
static inline void sched_cpufreq_governor_change(struct cpufreq_policy *policy,
|
||||
struct cpufreq_governor *old_gov) { }
|
||||
#endif
|
||||
|
||||
#endif /* _LINUX_SCHED_CPUFREQ_H */
|
||||
|
|
|
|||
|
|
@ -22,6 +22,8 @@ enum { sysctl_hung_task_timeout_secs = 0 };
|
|||
|
||||
extern unsigned int sysctl_sched_latency;
|
||||
extern unsigned int sysctl_sched_min_granularity;
|
||||
extern unsigned int sysctl_sched_sync_hint_enable;
|
||||
extern unsigned int sysctl_sched_cstate_aware;
|
||||
extern unsigned int sysctl_sched_wakeup_granularity;
|
||||
extern unsigned int sysctl_sched_child_runs_first;
|
||||
|
||||
|
|
@ -83,4 +85,11 @@ extern int sysctl_schedstats(struct ctl_table *table, int write,
|
|||
void __user *buffer, size_t *lenp,
|
||||
loff_t *ppos);
|
||||
|
||||
#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
|
||||
extern unsigned int sysctl_sched_energy_aware;
|
||||
extern int sched_energy_aware_handler(struct ctl_table *table, int write,
|
||||
void __user *buffer, size_t *lenp,
|
||||
loff_t *ppos);
|
||||
#endif
|
||||
|
||||
#endif /* _LINUX_SCHED_SYSCTL_H */
|
||||
|
|
|
|||
|
|
@ -23,10 +23,10 @@
|
|||
#define SD_BALANCE_FORK 0x0008 /* Balance on fork, clone */
|
||||
#define SD_BALANCE_WAKE 0x0010 /* Balance on wakeup */
|
||||
#define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */
|
||||
#define SD_ASYM_CPUCAPACITY 0x0040 /* Groups have different max cpu capacities */
|
||||
#define SD_SHARE_CPUCAPACITY 0x0080 /* Domain members share cpu capacity */
|
||||
#define SD_ASYM_CPUCAPACITY 0x0040 /* Domain members have different CPU capacities */
|
||||
#define SD_SHARE_CPUCAPACITY 0x0080 /* Domain members share CPU capacity */
|
||||
#define SD_SHARE_POWERDOMAIN 0x0100 /* Domain members share power domain */
|
||||
#define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */
|
||||
#define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share CPU pkg resources */
|
||||
#define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */
|
||||
#define SD_ASYM_PACKING 0x0800 /* Place busy groups earlier in the domain */
|
||||
#define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */
|
||||
|
|
@ -202,6 +202,17 @@ extern void set_sched_topology(struct sched_domain_topology_level *tl);
|
|||
# define SD_INIT_NAME(type)
|
||||
#endif
|
||||
|
||||
#ifndef arch_scale_cpu_capacity
|
||||
static __always_inline
|
||||
unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
|
||||
{
|
||||
if (sd && (sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
|
||||
return sd->smt_gain / sd->span_weight;
|
||||
|
||||
return SCHED_CAPACITY_SCALE;
|
||||
}
|
||||
#endif
|
||||
|
||||
#else /* CONFIG_SMP */
|
||||
|
||||
struct sched_domain_attr;
|
||||
|
|
@ -217,6 +228,14 @@ static inline bool cpus_share_cache(int this_cpu, int that_cpu)
|
|||
return true;
|
||||
}
|
||||
|
||||
#ifndef arch_scale_cpu_capacity
|
||||
static __always_inline
|
||||
unsigned long arch_scale_cpu_capacity(void __always_unused *sd, int cpu)
|
||||
{
|
||||
return SCHED_CAPACITY_SCALE;
|
||||
}
|
||||
#endif
|
||||
|
||||
#endif /* !CONFIG_SMP */
|
||||
|
||||
static inline int task_node(const struct task_struct *p)
|
||||
|
|
|
|||
|
|
@ -34,6 +34,7 @@
|
|||
struct wake_q_head {
|
||||
struct wake_q_node *first;
|
||||
struct wake_q_node **lastp;
|
||||
int count;
|
||||
};
|
||||
|
||||
#define WAKE_Q_TAIL ((struct wake_q_node *) 0x01)
|
||||
|
|
@ -45,6 +46,7 @@ static inline void wake_q_init(struct wake_q_head *head)
|
|||
{
|
||||
head->first = WAKE_Q_TAIL;
|
||||
head->lastp = &head->first;
|
||||
head->count = 0;
|
||||
}
|
||||
|
||||
extern void wake_q_add(struct wake_q_head *head,
|
||||
|
|
|
|||
|
|
@ -91,6 +91,8 @@ struct scmi_clk_ops {
|
|||
* to sustained performance level mapping
|
||||
* @freq_get: gets the frequency for a given device using sustained frequency
|
||||
* to sustained performance level mapping
|
||||
* @est_power_get: gets the estimated power cost for a given performance domain
|
||||
* at a given frequency
|
||||
*/
|
||||
struct scmi_perf_ops {
|
||||
int (*limits_set)(const struct scmi_handle *handle, u32 domain,
|
||||
|
|
@ -110,6 +112,8 @@ struct scmi_perf_ops {
|
|||
unsigned long rate, bool poll);
|
||||
int (*freq_get)(const struct scmi_handle *handle, u32 domain,
|
||||
unsigned long *rate, bool poll);
|
||||
int (*est_power_get)(const struct scmi_handle *handle, u32 domain,
|
||||
unsigned long *rate, unsigned long *power);
|
||||
};
|
||||
|
||||
/**
|
||||
|
|
|
|||
|
|
@ -572,6 +572,423 @@ TRACE_EVENT(sched_wake_idle_without_ipi,
|
|||
|
||||
TP_printk("cpu=%d", __entry->cpu)
|
||||
);
|
||||
|
||||
#ifdef CONFIG_SMP
|
||||
#ifdef CREATE_TRACE_POINTS
|
||||
static inline
|
||||
int __trace_sched_cpu(struct cfs_rq *cfs_rq, struct sched_entity *se)
|
||||
{
|
||||
#ifdef CONFIG_FAIR_GROUP_SCHED
|
||||
struct rq *rq = cfs_rq ? cfs_rq->rq : NULL;
|
||||
#else
|
||||
struct rq *rq = cfs_rq ? container_of(cfs_rq, struct rq, cfs) : NULL;
|
||||
#endif
|
||||
return rq ? cpu_of(rq)
|
||||
: task_cpu((container_of(se, struct task_struct, se)));
|
||||
}
|
||||
|
||||
static inline
|
||||
int __trace_sched_path(struct cfs_rq *cfs_rq, char *path, int len)
|
||||
{
|
||||
#ifdef CONFIG_FAIR_GROUP_SCHED
|
||||
int l = path ? len : 0;
|
||||
|
||||
if (cfs_rq && task_group_is_autogroup(cfs_rq->tg))
|
||||
return autogroup_path(cfs_rq->tg, path, l) + 1;
|
||||
else if (cfs_rq && cfs_rq->tg->css.cgroup)
|
||||
return cgroup_path(cfs_rq->tg->css.cgroup, path, l) + 1;
|
||||
#endif
|
||||
if (path)
|
||||
strcpy(path, "(null)");
|
||||
|
||||
return strlen("(null)");
|
||||
}
|
||||
|
||||
static inline
|
||||
struct cfs_rq *__trace_sched_group_cfs_rq(struct sched_entity *se)
|
||||
{
|
||||
#ifdef CONFIG_FAIR_GROUP_SCHED
|
||||
return se->my_q;
|
||||
#else
|
||||
return NULL;
|
||||
#endif
|
||||
}
|
||||
#endif /* CREATE_TRACE_POINTS */
|
||||
|
||||
/*
|
||||
* Tracepoint for cfs_rq load tracking:
|
||||
*/
|
||||
TRACE_EVENT(sched_load_cfs_rq,
|
||||
|
||||
TP_PROTO(struct cfs_rq *cfs_rq),
|
||||
|
||||
TP_ARGS(cfs_rq),
|
||||
|
||||
TP_STRUCT__entry(
|
||||
__field( int, cpu )
|
||||
__dynamic_array(char, path,
|
||||
__trace_sched_path(cfs_rq, NULL, 0) )
|
||||
__field( unsigned long, load )
|
||||
__field( unsigned long, rbl_load )
|
||||
__field( unsigned long, util )
|
||||
),
|
||||
|
||||
TP_fast_assign(
|
||||
__entry->cpu = __trace_sched_cpu(cfs_rq, NULL);
|
||||
__trace_sched_path(cfs_rq, __get_dynamic_array(path),
|
||||
__get_dynamic_array_len(path));
|
||||
__entry->load = cfs_rq->avg.load_avg;
|
||||
__entry->rbl_load = cfs_rq->avg.runnable_load_avg;
|
||||
__entry->util = cfs_rq->avg.util_avg;
|
||||
),
|
||||
|
||||
TP_printk("cpu=%d path=%s load=%lu rbl_load=%lu util=%lu",
|
||||
__entry->cpu, __get_str(path), __entry->load,
|
||||
__entry->rbl_load,__entry->util)
|
||||
);
|
||||
|
||||
/*
|
||||
* Tracepoint for rt_rq load tracking:
|
||||
*/
|
||||
struct rq;
|
||||
TRACE_EVENT(sched_load_rt_rq,
|
||||
|
||||
TP_PROTO(struct rq *rq),
|
||||
|
||||
TP_ARGS(rq),
|
||||
|
||||
TP_STRUCT__entry(
|
||||
__field( int, cpu )
|
||||
__field( unsigned long, util )
|
||||
),
|
||||
|
||||
TP_fast_assign(
|
||||
__entry->cpu = rq->cpu;
|
||||
__entry->util = rq->avg_rt.util_avg;
|
||||
),
|
||||
|
||||
TP_printk("cpu=%d util=%lu", __entry->cpu,
|
||||
__entry->util)
|
||||
);
|
||||
|
||||
/*
|
||||
* Tracepoint for sched_entity load tracking:
|
||||
*/
|
||||
TRACE_EVENT(sched_load_se,
|
||||
|
||||
TP_PROTO(struct sched_entity *se),
|
||||
|
||||
TP_ARGS(se),
|
||||
|
||||
TP_STRUCT__entry(
|
||||
__field( int, cpu )
|
||||
__dynamic_array(char, path,
|
||||
__trace_sched_path(__trace_sched_group_cfs_rq(se), NULL, 0) )
|
||||
__array( char, comm, TASK_COMM_LEN )
|
||||
__field( pid_t, pid )
|
||||
__field( unsigned long, load )
|
||||
__field( unsigned long, rbl_load )
|
||||
__field( unsigned long, util )
|
||||
),
|
||||
|
||||
TP_fast_assign(
|
||||
struct cfs_rq *gcfs_rq = __trace_sched_group_cfs_rq(se);
|
||||
struct task_struct *p = gcfs_rq ? NULL
|
||||
: container_of(se, struct task_struct, se);
|
||||
|
||||
__entry->cpu = __trace_sched_cpu(gcfs_rq, se);
|
||||
__trace_sched_path(gcfs_rq, __get_dynamic_array(path),
|
||||
__get_dynamic_array_len(path));
|
||||
memcpy(__entry->comm, p ? p->comm : "(null)", TASK_COMM_LEN);
|
||||
__entry->pid = p ? p->pid : -1;
|
||||
__entry->load = se->avg.load_avg;
|
||||
__entry->rbl_load = se->avg.runnable_load_avg;
|
||||
__entry->util = se->avg.util_avg;
|
||||
),
|
||||
|
||||
TP_printk("cpu=%d path=%s comm=%s pid=%d load=%lu rbl_load=%lu util=%lu",
|
||||
__entry->cpu, __get_str(path), __entry->comm, __entry->pid,
|
||||
__entry->load, __entry->rbl_load, __entry->util)
|
||||
);
|
||||
|
||||
/*
|
||||
* Tracepoint for task_group load tracking:
|
||||
*/
|
||||
#ifdef CONFIG_FAIR_GROUP_SCHED
|
||||
TRACE_EVENT(sched_load_tg,
|
||||
|
||||
TP_PROTO(struct cfs_rq *cfs_rq),
|
||||
|
||||
TP_ARGS(cfs_rq),
|
||||
|
||||
TP_STRUCT__entry(
|
||||
__field( int, cpu )
|
||||
__dynamic_array(char, path,
|
||||
__trace_sched_path(cfs_rq, NULL, 0) )
|
||||
__field( long, load )
|
||||
),
|
||||
|
||||
TP_fast_assign(
|
||||
__entry->cpu = cfs_rq->rq->cpu;
|
||||
__trace_sched_path(cfs_rq, __get_dynamic_array(path),
|
||||
__get_dynamic_array_len(path));
|
||||
__entry->load = atomic_long_read(&cfs_rq->tg->load_avg);
|
||||
),
|
||||
|
||||
TP_printk("cpu=%d path=%s load=%ld", __entry->cpu, __get_str(path),
|
||||
__entry->load)
|
||||
);
|
||||
#endif /* CONFIG_FAIR_GROUP_SCHED */
|
||||
|
||||
/*
|
||||
* Tracepoint for tasks' estimated utilization.
|
||||
*/
|
||||
TRACE_EVENT(sched_util_est_task,
|
||||
|
||||
TP_PROTO(struct task_struct *tsk, struct sched_avg *avg),
|
||||
|
||||
TP_ARGS(tsk, avg),
|
||||
|
||||
TP_STRUCT__entry(
|
||||
__array( char, comm, TASK_COMM_LEN )
|
||||
__field( pid_t, pid )
|
||||
__field( int, cpu )
|
||||
__field( unsigned int, util_avg )
|
||||
__field( unsigned int, est_enqueued )
|
||||
__field( unsigned int, est_ewma )
|
||||
|
||||
),
|
||||
|
||||
TP_fast_assign(
|
||||
memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
|
||||
__entry->pid = tsk->pid;
|
||||
__entry->cpu = task_cpu(tsk);
|
||||
__entry->util_avg = avg->util_avg;
|
||||
__entry->est_enqueued = avg->util_est.enqueued;
|
||||
__entry->est_ewma = avg->util_est.ewma;
|
||||
),
|
||||
|
||||
TP_printk("comm=%s pid=%d cpu=%d util_avg=%u util_est_ewma=%u util_est_enqueued=%u",
|
||||
__entry->comm,
|
||||
__entry->pid,
|
||||
__entry->cpu,
|
||||
__entry->util_avg,
|
||||
__entry->est_ewma,
|
||||
__entry->est_enqueued)
|
||||
);
|
||||
|
||||
/*
|
||||
* Tracepoint for root cfs_rq's estimated utilization.
|
||||
*/
|
||||
TRACE_EVENT(sched_util_est_cpu,
|
||||
|
||||
TP_PROTO(int cpu, struct cfs_rq *cfs_rq),
|
||||
|
||||
TP_ARGS(cpu, cfs_rq),
|
||||
|
||||
TP_STRUCT__entry(
|
||||
__field( int, cpu )
|
||||
__field( unsigned int, util_avg )
|
||||
__field( unsigned int, util_est_enqueued )
|
||||
),
|
||||
|
||||
TP_fast_assign(
|
||||
__entry->cpu = cpu;
|
||||
__entry->util_avg = cfs_rq->avg.util_avg;
|
||||
__entry->util_est_enqueued = cfs_rq->avg.util_est.enqueued;
|
||||
),
|
||||
|
||||
TP_printk("cpu=%d util_avg=%u util_est_enqueued=%u",
|
||||
__entry->cpu,
|
||||
__entry->util_avg,
|
||||
__entry->util_est_enqueued)
|
||||
);
|
||||
|
||||
/*
|
||||
* Tracepoint for find_best_target
|
||||
*/
|
||||
TRACE_EVENT(sched_find_best_target,
|
||||
|
||||
TP_PROTO(struct task_struct *tsk, bool prefer_idle,
|
||||
unsigned long min_util, int best_idle, int best_active,
|
||||
int target, int backup),
|
||||
|
||||
TP_ARGS(tsk, prefer_idle, min_util, best_idle,
|
||||
best_active, target, backup),
|
||||
|
||||
TP_STRUCT__entry(
|
||||
__array( char, comm, TASK_COMM_LEN )
|
||||
__field( pid_t, pid )
|
||||
__field( unsigned long, min_util )
|
||||
__field( bool, prefer_idle )
|
||||
__field( int, best_idle )
|
||||
__field( int, best_active )
|
||||
__field( int, target )
|
||||
__field( int, backup )
|
||||
),
|
||||
|
||||
TP_fast_assign(
|
||||
memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
|
||||
__entry->pid = tsk->pid;
|
||||
__entry->min_util = min_util;
|
||||
__entry->prefer_idle = prefer_idle;
|
||||
__entry->best_idle = best_idle;
|
||||
__entry->best_active = best_active;
|
||||
__entry->target = target;
|
||||
__entry->backup = backup;
|
||||
),
|
||||
|
||||
TP_printk("pid=%d comm=%s prefer_idle=%d "
|
||||
"best_idle=%d best_active=%d target=%d backup=%d",
|
||||
__entry->pid, __entry->comm, __entry->prefer_idle,
|
||||
__entry->best_idle, __entry->best_active,
|
||||
__entry->target, __entry->backup)
|
||||
);
|
||||
|
||||
/*
|
||||
* Tracepoint for accounting CPU boosted utilization
|
||||
*/
|
||||
TRACE_EVENT(sched_boost_cpu,
|
||||
|
||||
TP_PROTO(int cpu, unsigned long util, long margin),
|
||||
|
||||
TP_ARGS(cpu, util, margin),
|
||||
|
||||
TP_STRUCT__entry(
|
||||
__field( int, cpu )
|
||||
__field( unsigned long, util )
|
||||
__field(long, margin )
|
||||
),
|
||||
|
||||
TP_fast_assign(
|
||||
__entry->cpu = cpu;
|
||||
__entry->util = util;
|
||||
__entry->margin = margin;
|
||||
),
|
||||
|
||||
TP_printk("cpu=%d util=%lu margin=%ld",
|
||||
__entry->cpu,
|
||||
__entry->util,
|
||||
__entry->margin)
|
||||
);
|
||||
|
||||
/*
|
||||
* Tracepoint for schedtune_tasks_update
|
||||
*/
|
||||
TRACE_EVENT(sched_tune_tasks_update,
|
||||
|
||||
TP_PROTO(struct task_struct *tsk, int cpu, int tasks, int idx,
|
||||
int boost, int max_boost, u64 group_ts),
|
||||
|
||||
TP_ARGS(tsk, cpu, tasks, idx, boost, max_boost, group_ts),
|
||||
|
||||
TP_STRUCT__entry(
|
||||
__array( char, comm, TASK_COMM_LEN )
|
||||
__field( pid_t, pid )
|
||||
__field( int, cpu )
|
||||
__field( int, tasks )
|
||||
__field( int, idx )
|
||||
__field( int, boost )
|
||||
__field( int, max_boost )
|
||||
__field( u64, group_ts )
|
||||
),
|
||||
|
||||
TP_fast_assign(
|
||||
memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
|
||||
__entry->pid = tsk->pid;
|
||||
__entry->cpu = cpu;
|
||||
__entry->tasks = tasks;
|
||||
__entry->idx = idx;
|
||||
__entry->boost = boost;
|
||||
__entry->max_boost = max_boost;
|
||||
__entry->group_ts = group_ts;
|
||||
),
|
||||
|
||||
TP_printk("pid=%d comm=%s "
|
||||
"cpu=%d tasks=%d idx=%d boost=%d max_boost=%d timeout=%llu",
|
||||
__entry->pid, __entry->comm,
|
||||
__entry->cpu, __entry->tasks, __entry->idx,
|
||||
__entry->boost, __entry->max_boost,
|
||||
__entry->group_ts)
|
||||
);
|
||||
|
||||
/*
|
||||
* Tracepoint for schedtune_boostgroup_update
|
||||
*/
|
||||
TRACE_EVENT(sched_tune_boostgroup_update,
|
||||
|
||||
TP_PROTO(int cpu, int variation, int max_boost),
|
||||
|
||||
TP_ARGS(cpu, variation, max_boost),
|
||||
|
||||
TP_STRUCT__entry(
|
||||
__field( int, cpu )
|
||||
__field( int, variation )
|
||||
__field( int, max_boost )
|
||||
),
|
||||
|
||||
TP_fast_assign(
|
||||
__entry->cpu = cpu;
|
||||
__entry->variation = variation;
|
||||
__entry->max_boost = max_boost;
|
||||
),
|
||||
|
||||
TP_printk("cpu=%d variation=%d max_boost=%d",
|
||||
__entry->cpu, __entry->variation, __entry->max_boost)
|
||||
);
|
||||
|
||||
/*
|
||||
* Tracepoint for accounting task boosted utilization
|
||||
*/
|
||||
TRACE_EVENT(sched_boost_task,
|
||||
|
||||
TP_PROTO(struct task_struct *tsk, unsigned long util, long margin),
|
||||
|
||||
TP_ARGS(tsk, util, margin),
|
||||
|
||||
TP_STRUCT__entry(
|
||||
__array( char, comm, TASK_COMM_LEN )
|
||||
__field( pid_t, pid )
|
||||
__field( unsigned long, util )
|
||||
__field( long, margin )
|
||||
|
||||
),
|
||||
|
||||
TP_fast_assign(
|
||||
memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
|
||||
__entry->pid = tsk->pid;
|
||||
__entry->util = util;
|
||||
__entry->margin = margin;
|
||||
),
|
||||
|
||||
TP_printk("comm=%s pid=%d util=%lu margin=%ld",
|
||||
__entry->comm, __entry->pid,
|
||||
__entry->util,
|
||||
__entry->margin)
|
||||
);
|
||||
|
||||
/*
|
||||
* Tracepoint for system overutilized flag
|
||||
*/
|
||||
TRACE_EVENT(sched_overutilized,
|
||||
|
||||
TP_PROTO(int overutilized),
|
||||
|
||||
TP_ARGS(overutilized),
|
||||
|
||||
TP_STRUCT__entry(
|
||||
__field( int, overutilized )
|
||||
),
|
||||
|
||||
TP_fast_assign(
|
||||
__entry->overutilized = overutilized;
|
||||
),
|
||||
|
||||
TP_printk("overutilized=%d",
|
||||
__entry->overutilized)
|
||||
);
|
||||
|
||||
#endif /* CONFIG_SMP */
|
||||
#endif /* _TRACE_SCHED_H */
|
||||
|
||||
/* This part must be outside protection */
|
||||
|
|
|
|||
23
init/Kconfig
23
init/Kconfig
|
|
@ -991,6 +991,29 @@ config SCHED_AUTOGROUP
|
|||
desktop applications. Task group autogeneration is currently based
|
||||
upon task session.
|
||||
|
||||
config SCHED_TUNE
|
||||
bool "Boosting for CFS tasks (EXPERIMENTAL)"
|
||||
depends on SMP
|
||||
help
|
||||
This option enables support for task classification using a new
|
||||
cgroup controller, schedtune. Schedtune allows tasks to be given
|
||||
a boost value and marked as latency-sensitive or not. This option
|
||||
provides the "schedtune" controller.
|
||||
|
||||
This new controller:
|
||||
1. allows only a two layers hierarchy, where the root defines the
|
||||
system-wide boost value and its direct childrens define each one a
|
||||
different "class of tasks" to be boosted with a different value
|
||||
2. supports up to 16 different task classes, each one which could be
|
||||
configured with a different boost value
|
||||
|
||||
Latency-sensitive tasks are not subject to energy-aware wakeup
|
||||
task placement. The boost value assigned to tasks is used to
|
||||
influence task placement and CPU frequency selection (if
|
||||
utilization-driven frequency selection is in use).
|
||||
|
||||
If unsure, say N.
|
||||
|
||||
config SYSFS_DEPRECATED
|
||||
bool "Enable deprecated sysfs features to support old userspace tools"
|
||||
depends on SYSFS
|
||||
|
|
|
|||
|
|
@ -298,3 +298,18 @@ config PM_GENERIC_DOMAINS_OF
|
|||
|
||||
config CPU_PM
|
||||
bool
|
||||
|
||||
config ENERGY_MODEL
|
||||
bool "Energy Model for CPUs"
|
||||
depends on SMP
|
||||
depends on CPU_FREQ
|
||||
default n
|
||||
help
|
||||
Several subsystems (thermal and/or the task scheduler for example)
|
||||
can leverage information about the energy consumed by CPUs to make
|
||||
smarter decisions. This config option enables the framework from
|
||||
which subsystems can access the energy models.
|
||||
|
||||
The exact usage of the energy model is subsystem-dependent.
|
||||
|
||||
If in doubt, say N.
|
||||
|
|
|
|||
|
|
@ -15,3 +15,5 @@ obj-$(CONFIG_PM_AUTOSLEEP) += autosleep.o
|
|||
obj-$(CONFIG_PM_WAKELOCKS) += wakelock.o
|
||||
|
||||
obj-$(CONFIG_MAGIC_SYSRQ) += poweroff.o
|
||||
|
||||
obj-$(CONFIG_ENERGY_MODEL) += energy_model.o
|
||||
|
|
|
|||
291
kernel/power/energy_model.c
Normal file
291
kernel/power/energy_model.c
Normal file
|
|
@ -0,0 +1,291 @@
|
|||
// SPDX-License-Identifier: GPL-2.0
|
||||
/*
|
||||
* Energy Model of CPUs
|
||||
*
|
||||
* Copyright (c) 2018, Arm ltd.
|
||||
* Written by: Quentin Perret, Arm ltd.
|
||||
*/
|
||||
|
||||
#define pr_fmt(fmt) "energy_model: " fmt
|
||||
|
||||
#include <linux/cpu.h>
|
||||
#include <linux/cpumask.h>
|
||||
#include <linux/energy_model.h>
|
||||
#include <linux/sched/topology.h>
|
||||
#include <linux/slab.h>
|
||||
|
||||
/* Mapping of each CPU to the performance domain to which it belongs. */
|
||||
static DEFINE_PER_CPU(struct em_perf_domain *, em_data);
|
||||
|
||||
/*
|
||||
* Mutex serializing the registrations of performance domains and letting
|
||||
* callbacks defined by drivers sleep.
|
||||
*/
|
||||
static DEFINE_MUTEX(em_pd_mutex);
|
||||
|
||||
static struct kobject *em_kobject;
|
||||
|
||||
/* Getters for the attributes of em_perf_domain objects */
|
||||
struct em_pd_attr {
|
||||
struct attribute attr;
|
||||
ssize_t (*show)(struct em_perf_domain *pd, char *buf);
|
||||
ssize_t (*store)(struct em_perf_domain *pd, const char *buf, size_t s);
|
||||
};
|
||||
|
||||
#define EM_ATTR_LEN 13
|
||||
#define show_table_attr(_attr) \
|
||||
static ssize_t show_##_attr(struct em_perf_domain *pd, char *buf) \
|
||||
{ \
|
||||
ssize_t cnt = 0; \
|
||||
int i; \
|
||||
for (i = 0; i < pd->nr_cap_states; i++) { \
|
||||
if (cnt >= (ssize_t) (PAGE_SIZE / sizeof(char) \
|
||||
- (EM_ATTR_LEN + 2))) \
|
||||
goto out; \
|
||||
cnt += scnprintf(&buf[cnt], EM_ATTR_LEN + 1, "%lu ", \
|
||||
pd->table[i]._attr); \
|
||||
} \
|
||||
out: \
|
||||
cnt += sprintf(&buf[cnt], "\n"); \
|
||||
return cnt; \
|
||||
}
|
||||
|
||||
show_table_attr(power);
|
||||
show_table_attr(frequency);
|
||||
show_table_attr(cost);
|
||||
|
||||
static ssize_t show_cpus(struct em_perf_domain *pd, char *buf)
|
||||
{
|
||||
return sprintf(buf, "%*pbl\n", cpumask_pr_args(to_cpumask(pd->cpus)));
|
||||
}
|
||||
|
||||
#define pd_attr(_name) em_pd_##_name##_attr
|
||||
#define define_pd_attr(_name) static struct em_pd_attr pd_attr(_name) = \
|
||||
__ATTR(_name, 0444, show_##_name, NULL)
|
||||
|
||||
define_pd_attr(power);
|
||||
define_pd_attr(frequency);
|
||||
define_pd_attr(cost);
|
||||
define_pd_attr(cpus);
|
||||
|
||||
static struct attribute *em_pd_default_attrs[] = {
|
||||
&pd_attr(power).attr,
|
||||
&pd_attr(frequency).attr,
|
||||
&pd_attr(cost).attr,
|
||||
&pd_attr(cpus).attr,
|
||||
NULL
|
||||
};
|
||||
|
||||
#define to_pd(k) container_of(k, struct em_perf_domain, kobj)
|
||||
#define to_pd_attr(a) container_of(a, struct em_pd_attr, attr)
|
||||
|
||||
static ssize_t show(struct kobject *kobj, struct attribute *attr, char *buf)
|
||||
{
|
||||
struct em_perf_domain *pd = to_pd(kobj);
|
||||
struct em_pd_attr *pd_attr = to_pd_attr(attr);
|
||||
ssize_t ret;
|
||||
|
||||
ret = pd_attr->show(pd, buf);
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
static const struct sysfs_ops em_pd_sysfs_ops = {
|
||||
.show = show,
|
||||
};
|
||||
|
||||
static struct kobj_type ktype_em_pd = {
|
||||
.sysfs_ops = &em_pd_sysfs_ops,
|
||||
.default_attrs = em_pd_default_attrs,
|
||||
};
|
||||
|
||||
static struct em_perf_domain *em_create_pd(cpumask_t *span, int nr_states,
|
||||
struct em_data_callback *cb)
|
||||
{
|
||||
unsigned long opp_eff, prev_opp_eff = ULONG_MAX;
|
||||
unsigned long power, freq, prev_freq = 0;
|
||||
int i, ret, cpu = cpumask_first(span);
|
||||
struct em_cap_state *table;
|
||||
struct em_perf_domain *pd;
|
||||
u64 fmax;
|
||||
|
||||
if (!cb->active_power)
|
||||
return NULL;
|
||||
|
||||
pd = kzalloc(sizeof(*pd) + cpumask_size(), GFP_KERNEL);
|
||||
if (!pd)
|
||||
return NULL;
|
||||
|
||||
table = kcalloc(nr_states, sizeof(*table), GFP_KERNEL);
|
||||
if (!table)
|
||||
goto free_pd;
|
||||
|
||||
/* Build the list of capacity states for this performance domain */
|
||||
for (i = 0, freq = 0; i < nr_states; i++, freq++) {
|
||||
/*
|
||||
* active_power() is a driver callback which ceils 'freq' to
|
||||
* lowest capacity state of 'cpu' above 'freq' and updates
|
||||
* 'power' and 'freq' accordingly.
|
||||
*/
|
||||
ret = cb->active_power(&power, &freq, cpu);
|
||||
if (ret) {
|
||||
pr_err("pd%d: invalid cap. state: %d\n", cpu, ret);
|
||||
goto free_cs_table;
|
||||
}
|
||||
|
||||
/*
|
||||
* We expect the driver callback to increase the frequency for
|
||||
* higher capacity states.
|
||||
*/
|
||||
if (freq <= prev_freq) {
|
||||
pr_err("pd%d: non-increasing freq: %lu\n", cpu, freq);
|
||||
goto free_cs_table;
|
||||
}
|
||||
|
||||
/*
|
||||
* The power returned by active_state() is expected to be
|
||||
* positive, in milli-watts and to fit into 16 bits.
|
||||
*/
|
||||
if (!power || power > EM_CPU_MAX_POWER) {
|
||||
pr_err("pd%d: invalid power: %lu\n", cpu, power);
|
||||
goto free_cs_table;
|
||||
}
|
||||
|
||||
table[i].power = power;
|
||||
table[i].frequency = prev_freq = freq;
|
||||
|
||||
/*
|
||||
* The hertz/watts efficiency ratio should decrease as the
|
||||
* frequency grows on sane platforms. But this isn't always
|
||||
* true in practice so warn the user if a higher OPP is more
|
||||
* power efficient than a lower one.
|
||||
*/
|
||||
opp_eff = freq / power;
|
||||
if (opp_eff >= prev_opp_eff)
|
||||
pr_warn("pd%d: hertz/watts ratio non-monotonically decreasing: em_cap_state %d >= em_cap_state%d\n",
|
||||
cpu, i, i - 1);
|
||||
prev_opp_eff = opp_eff;
|
||||
}
|
||||
|
||||
/* Compute the cost of each capacity_state. */
|
||||
fmax = (u64) table[nr_states - 1].frequency;
|
||||
for (i = 0; i < nr_states; i++) {
|
||||
table[i].cost = div64_u64(fmax * table[i].power,
|
||||
table[i].frequency);
|
||||
}
|
||||
|
||||
pd->table = table;
|
||||
pd->nr_cap_states = nr_states;
|
||||
cpumask_copy(to_cpumask(pd->cpus), span);
|
||||
|
||||
ret = kobject_init_and_add(&pd->kobj, &ktype_em_pd, em_kobject,
|
||||
"pd%u", cpu);
|
||||
if (ret)
|
||||
pr_err("pd%d: failed kobject_init_and_add(): %d\n", cpu, ret);
|
||||
|
||||
return pd;
|
||||
|
||||
free_cs_table:
|
||||
kfree(table);
|
||||
free_pd:
|
||||
kfree(pd);
|
||||
|
||||
return NULL;
|
||||
}
|
||||
|
||||
/**
|
||||
* em_cpu_get() - Return the performance domain for a CPU
|
||||
* @cpu : CPU to find the performance domain for
|
||||
*
|
||||
* Return: the performance domain to which 'cpu' belongs, or NULL if it doesn't
|
||||
* exist.
|
||||
*/
|
||||
struct em_perf_domain *em_cpu_get(int cpu)
|
||||
{
|
||||
return READ_ONCE(per_cpu(em_data, cpu));
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(em_cpu_get);
|
||||
|
||||
/**
|
||||
* em_register_perf_domain() - Register the Energy Model of a performance domain
|
||||
* @span : Mask of CPUs in the performance domain
|
||||
* @nr_states : Number of capacity states to register
|
||||
* @cb : Callback functions providing the data of the Energy Model
|
||||
*
|
||||
* Create Energy Model tables for a performance domain using the callbacks
|
||||
* defined in cb.
|
||||
*
|
||||
* If multiple clients register the same performance domain, all but the first
|
||||
* registration will be ignored.
|
||||
*
|
||||
* Return 0 on success
|
||||
*/
|
||||
int em_register_perf_domain(cpumask_t *span, unsigned int nr_states,
|
||||
struct em_data_callback *cb)
|
||||
{
|
||||
unsigned long cap, prev_cap = 0;
|
||||
struct em_perf_domain *pd;
|
||||
int cpu, ret = 0;
|
||||
|
||||
if (!span || !nr_states || !cb)
|
||||
return -EINVAL;
|
||||
|
||||
/*
|
||||
* Use a mutex to serialize the registration of performance domains and
|
||||
* let the driver-defined callback functions sleep.
|
||||
*/
|
||||
mutex_lock(&em_pd_mutex);
|
||||
|
||||
if (!em_kobject) {
|
||||
em_kobject = kobject_create_and_add("energy_model",
|
||||
&cpu_subsys.dev_root->kobj);
|
||||
if (!em_kobject) {
|
||||
ret = -ENODEV;
|
||||
goto unlock;
|
||||
}
|
||||
}
|
||||
|
||||
for_each_cpu(cpu, span) {
|
||||
/* Make sure we don't register again an existing domain. */
|
||||
if (READ_ONCE(per_cpu(em_data, cpu))) {
|
||||
ret = -EEXIST;
|
||||
goto unlock;
|
||||
}
|
||||
|
||||
/*
|
||||
* All CPUs of a domain must have the same micro-architecture
|
||||
* since they all share the same table.
|
||||
*/
|
||||
cap = arch_scale_cpu_capacity(NULL, cpu);
|
||||
if (prev_cap && prev_cap != cap) {
|
||||
pr_err("CPUs of %*pbl must have the same capacity\n",
|
||||
cpumask_pr_args(span));
|
||||
ret = -EINVAL;
|
||||
goto unlock;
|
||||
}
|
||||
prev_cap = cap;
|
||||
}
|
||||
|
||||
/* Create the performance domain and add it to the Energy Model. */
|
||||
pd = em_create_pd(span, nr_states, cb);
|
||||
if (!pd) {
|
||||
ret = -EINVAL;
|
||||
goto unlock;
|
||||
}
|
||||
|
||||
for_each_cpu(cpu, span) {
|
||||
/*
|
||||
* The per-cpu array can be read concurrently from em_cpu_get().
|
||||
* The barrier enforces the ordering needed to make sure readers
|
||||
* can only access well formed em_perf_domain structs.
|
||||
*/
|
||||
smp_store_release(per_cpu_ptr(&em_data, cpu), pd);
|
||||
}
|
||||
|
||||
pr_debug("Created perf domain %*pbl\n", cpumask_pr_args(span));
|
||||
unlock:
|
||||
mutex_unlock(&em_pd_mutex);
|
||||
|
||||
return ret;
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(em_register_perf_domain);
|
||||
|
|
@ -24,6 +24,7 @@ obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o
|
|||
obj-$(CONFIG_SCHED_AUTOGROUP) += autogroup.o
|
||||
obj-$(CONFIG_SCHEDSTATS) += stats.o
|
||||
obj-$(CONFIG_SCHED_DEBUG) += debug.o
|
||||
obj-$(CONFIG_SCHED_TUNE) += tune.o
|
||||
obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
|
||||
obj-$(CONFIG_CPU_FREQ) += cpufreq.o
|
||||
obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
|
||||
|
|
|
|||
|
|
@ -259,7 +259,6 @@ void proc_sched_autogroup_show_task(struct task_struct *p, struct seq_file *m)
|
|||
}
|
||||
#endif /* CONFIG_PROC_FS */
|
||||
|
||||
#ifdef CONFIG_SCHED_DEBUG
|
||||
int autogroup_path(struct task_group *tg, char *buf, int buflen)
|
||||
{
|
||||
if (!task_group_is_autogroup(tg))
|
||||
|
|
@ -267,4 +266,3 @@ int autogroup_path(struct task_group *tg, char *buf, int buflen)
|
|||
|
||||
return snprintf(buf, buflen, "%s-%ld", "/autogroup", tg->autogroup->id);
|
||||
}
|
||||
#endif
|
||||
|
|
|
|||
|
|
@ -412,6 +412,8 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
|
|||
if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL))
|
||||
return;
|
||||
|
||||
head->count++;
|
||||
|
||||
get_task_struct(task);
|
||||
|
||||
/*
|
||||
|
|
@ -421,6 +423,10 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
|
|||
head->lastp = &node->next;
|
||||
}
|
||||
|
||||
static int
|
||||
try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags,
|
||||
int sibling_count_hint);
|
||||
|
||||
void wake_up_q(struct wake_q_head *head)
|
||||
{
|
||||
struct wake_q_node *node = head->first;
|
||||
|
|
@ -435,10 +441,10 @@ void wake_up_q(struct wake_q_head *head)
|
|||
task->wake_q.next = NULL;
|
||||
|
||||
/*
|
||||
* wake_up_process() executes a full barrier, which pairs with
|
||||
* try_to_wake_up() executes a full barrier, which pairs with
|
||||
* the queueing in wake_q_add() so as not to miss wakeups.
|
||||
*/
|
||||
wake_up_process(task);
|
||||
try_to_wake_up(task, TASK_NORMAL, 0, head->count);
|
||||
put_task_struct(task);
|
||||
}
|
||||
}
|
||||
|
|
@ -1523,12 +1529,14 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
|
|||
* The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
|
||||
*/
|
||||
static inline
|
||||
int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)
|
||||
int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags,
|
||||
int sibling_count_hint)
|
||||
{
|
||||
lockdep_assert_held(&p->pi_lock);
|
||||
|
||||
if (p->nr_cpus_allowed > 1)
|
||||
cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags);
|
||||
cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags,
|
||||
sibling_count_hint);
|
||||
else
|
||||
cpu = cpumask_any(&p->cpus_allowed);
|
||||
|
||||
|
|
@ -1931,6 +1939,8 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
|
|||
* @p: the thread to be awakened
|
||||
* @state: the mask of task states that can be woken
|
||||
* @wake_flags: wake modifier flags (WF_*)
|
||||
* @sibling_count_hint: A hint at the number of threads that are being woken up
|
||||
* in this event.
|
||||
*
|
||||
* If (@state & @p->state) @p->state = TASK_RUNNING.
|
||||
*
|
||||
|
|
@ -1946,7 +1956,8 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
|
|||
* %false otherwise.
|
||||
*/
|
||||
static int
|
||||
try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
|
||||
try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags,
|
||||
int sibling_count_hint)
|
||||
{
|
||||
unsigned long flags;
|
||||
int cpu, success = 0;
|
||||
|
|
@ -2033,7 +2044,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
|
|||
atomic_dec(&task_rq(p)->nr_iowait);
|
||||
}
|
||||
|
||||
cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
|
||||
cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags,
|
||||
sibling_count_hint);
|
||||
if (task_cpu(p) != cpu) {
|
||||
wake_flags |= WF_MIGRATED;
|
||||
set_task_cpu(p, cpu);
|
||||
|
|
@ -2120,13 +2132,13 @@ static void try_to_wake_up_local(struct task_struct *p, struct rq_flags *rf)
|
|||
*/
|
||||
int wake_up_process(struct task_struct *p)
|
||||
{
|
||||
return try_to_wake_up(p, TASK_NORMAL, 0);
|
||||
return try_to_wake_up(p, TASK_NORMAL, 0, 1);
|
||||
}
|
||||
EXPORT_SYMBOL(wake_up_process);
|
||||
|
||||
int wake_up_state(struct task_struct *p, unsigned int state)
|
||||
{
|
||||
return try_to_wake_up(p, state, 0);
|
||||
return try_to_wake_up(p, state, 0, 1);
|
||||
}
|
||||
|
||||
/*
|
||||
|
|
@ -2408,7 +2420,7 @@ void wake_up_new_task(struct task_struct *p)
|
|||
* as we're not fully set-up yet.
|
||||
*/
|
||||
p->recent_used_cpu = task_cpu(p);
|
||||
__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
|
||||
__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0, 1));
|
||||
#endif
|
||||
rq = __task_rq_lock(p, &rf);
|
||||
update_rq_clock(rq);
|
||||
|
|
@ -2947,7 +2959,7 @@ void sched_exec(void)
|
|||
int dest_cpu;
|
||||
|
||||
raw_spin_lock_irqsave(&p->pi_lock, flags);
|
||||
dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0);
|
||||
dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0, 1);
|
||||
if (dest_cpu == smp_processor_id())
|
||||
goto unlock;
|
||||
|
||||
|
|
@ -3708,7 +3720,7 @@ asmlinkage __visible void __sched preempt_schedule_irq(void)
|
|||
int default_wake_function(wait_queue_entry_t *curr, unsigned mode, int wake_flags,
|
||||
void *key)
|
||||
{
|
||||
return try_to_wake_up(curr->private, mode, wake_flags);
|
||||
return try_to_wake_up(curr->private, mode, wake_flags, 1);
|
||||
}
|
||||
EXPORT_SYMBOL(default_wake_function);
|
||||
|
||||
|
|
|
|||
|
|
@ -13,11 +13,13 @@
|
|||
|
||||
#include "sched.h"
|
||||
|
||||
#include <linux/sched/cpufreq.h>
|
||||
#include <trace/events/power.h>
|
||||
|
||||
struct sugov_tunables {
|
||||
struct gov_attr_set attr_set;
|
||||
unsigned int rate_limit_us;
|
||||
unsigned int up_rate_limit_us;
|
||||
unsigned int down_rate_limit_us;
|
||||
};
|
||||
|
||||
struct sugov_policy {
|
||||
|
|
@ -28,7 +30,9 @@ struct sugov_policy {
|
|||
|
||||
raw_spinlock_t update_lock; /* For shared policies */
|
||||
u64 last_freq_update_time;
|
||||
s64 freq_update_delay_ns;
|
||||
s64 min_rate_limit_ns;
|
||||
s64 up_rate_delay_ns;
|
||||
s64 down_rate_delay_ns;
|
||||
unsigned int next_freq;
|
||||
unsigned int cached_raw_freq;
|
||||
|
||||
|
|
@ -93,9 +97,32 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
|
|||
if (unlikely(sg_policy->need_freq_update))
|
||||
return true;
|
||||
|
||||
/* No need to recalculate next freq for min_rate_limit_us
|
||||
* at least. However we might still decide to further rate
|
||||
* limit once frequency change direction is decided, according
|
||||
* to the separate rate limits.
|
||||
*/
|
||||
|
||||
delta_ns = time - sg_policy->last_freq_update_time;
|
||||
return delta_ns >= sg_policy->min_rate_limit_ns;
|
||||
}
|
||||
|
||||
static bool sugov_up_down_rate_limit(struct sugov_policy *sg_policy, u64 time,
|
||||
unsigned int next_freq)
|
||||
{
|
||||
s64 delta_ns;
|
||||
|
||||
delta_ns = time - sg_policy->last_freq_update_time;
|
||||
|
||||
return delta_ns >= sg_policy->freq_update_delay_ns;
|
||||
if (next_freq > sg_policy->next_freq &&
|
||||
delta_ns < sg_policy->up_rate_delay_ns)
|
||||
return true;
|
||||
|
||||
if (next_freq < sg_policy->next_freq &&
|
||||
delta_ns < sg_policy->down_rate_delay_ns)
|
||||
return true;
|
||||
|
||||
return false;
|
||||
}
|
||||
|
||||
static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time,
|
||||
|
|
@ -104,6 +131,9 @@ static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time,
|
|||
if (sg_policy->next_freq == next_freq)
|
||||
return false;
|
||||
|
||||
if (sugov_up_down_rate_limit(sg_policy, time, next_freq))
|
||||
return false;
|
||||
|
||||
sg_policy->next_freq = next_freq;
|
||||
sg_policy->last_freq_update_time = time;
|
||||
|
||||
|
|
@ -167,7 +197,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
|
|||
unsigned int freq = arch_scale_freq_invariant() ?
|
||||
policy->cpuinfo.max_freq : policy->cur;
|
||||
|
||||
freq = (freq + (freq >> 2)) * util / max;
|
||||
freq = map_util_freq(util, freq, max);
|
||||
|
||||
if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
|
||||
return sg_policy->next_freq;
|
||||
|
|
@ -189,6 +219,9 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
|
|||
* Where the cfs,rt and dl util numbers are tracked with the same metric and
|
||||
* synchronized windows and are thus directly comparable.
|
||||
*
|
||||
* The @util parameter passed to this function is assumed to be the aggregation
|
||||
* of RT and CFS util numbers. The cases of DL and IRQ are managed here.
|
||||
*
|
||||
* The cfs,rt,dl utilization are the running times measured with rq->clock_task
|
||||
* which excludes things like IRQ and steal-time. These latter are then accrued
|
||||
* in the irq utilization.
|
||||
|
|
@ -197,15 +230,14 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
|
|||
* based on the task model parameters and gives the minimal utilization
|
||||
* required to meet deadlines.
|
||||
*/
|
||||
static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
|
||||
unsigned long schedutil_freq_util(int cpu, unsigned long util,
|
||||
unsigned long max, enum schedutil_type type)
|
||||
{
|
||||
struct rq *rq = cpu_rq(sg_cpu->cpu);
|
||||
unsigned long util, irq, max;
|
||||
struct rq *rq = cpu_rq(cpu);
|
||||
unsigned long irq;
|
||||
|
||||
sg_cpu->max = max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
|
||||
sg_cpu->bw_dl = cpu_bw_dl(rq);
|
||||
|
||||
if (rt_rq_is_runnable(&rq->rt))
|
||||
if (sched_feat(SUGOV_RT_MAX_FREQ) && type == FREQUENCY_UTIL &&
|
||||
rt_rq_is_runnable(&rq->rt))
|
||||
return max;
|
||||
|
||||
/*
|
||||
|
|
@ -218,25 +250,34 @@ static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
|
|||
return max;
|
||||
|
||||
/*
|
||||
* Because the time spend on RT/DL tasks is visible as 'lost' time to
|
||||
* CFS tasks and we use the same metric to track the effective
|
||||
* utilization (PELT windows are synchronized) we can directly add them
|
||||
* to obtain the CPU's actual utilization.
|
||||
* The function is called with @util defined as the aggregation (the
|
||||
* sum) of RT and CFS signals, hence leaving the special case of DL
|
||||
* to be delt with. The exact way of doing things depend on the calling
|
||||
* context.
|
||||
*/
|
||||
util = cpu_util_cfs(rq);
|
||||
util += cpu_util_rt(rq);
|
||||
|
||||
/*
|
||||
* We do not make cpu_util_dl() a permanent part of this sum because we
|
||||
* want to use cpu_bw_dl() later on, but we need to check if the
|
||||
* CFS+RT+DL sum is saturated (ie. no idle time) such that we select
|
||||
* f_max when there is no idle time.
|
||||
*
|
||||
* NOTE: numerical errors or stop class might cause us to not quite hit
|
||||
* saturation when we should -- something for later.
|
||||
*/
|
||||
if ((util + cpu_util_dl(rq)) >= max)
|
||||
return max;
|
||||
if (type == FREQUENCY_UTIL) {
|
||||
/*
|
||||
* For frequency selection we do not make cpu_util_dl() a
|
||||
* permanent part of this sum because we want to use
|
||||
* cpu_bw_dl() later on, but we need to check if the
|
||||
* CFS+RT+DL sum is saturated (ie. no idle time) such
|
||||
* that we select f_max when there is no idle time.
|
||||
*
|
||||
* NOTE: numerical errors or stop class might cause us
|
||||
* to not quite hit saturation when we should --
|
||||
* something for later.
|
||||
*/
|
||||
if ((util + cpu_util_dl(rq)) >= max)
|
||||
return max;
|
||||
} else {
|
||||
/*
|
||||
* OTOH, for energy computation we need the estimated
|
||||
* running time, so include util_dl and ignore dl_bw.
|
||||
*/
|
||||
util += cpu_util_dl(rq);
|
||||
if (util >= max)
|
||||
return max;
|
||||
}
|
||||
|
||||
/*
|
||||
* There is still idle time; further improve the number by using the
|
||||
|
|
@ -250,17 +291,35 @@ static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
|
|||
util = scale_irq_capacity(util, irq, max);
|
||||
util += irq;
|
||||
|
||||
/*
|
||||
* Bandwidth required by DEADLINE must always be granted while, for
|
||||
* FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
|
||||
* to gracefully reduce the frequency when no tasks show up for longer
|
||||
* periods of time.
|
||||
*
|
||||
* Ideally we would like to set bw_dl as min/guaranteed freq and util +
|
||||
* bw_dl as requested freq. However, cpufreq is not yet ready for such
|
||||
* an interface. So, we only do the latter for now.
|
||||
*/
|
||||
return min(max, util + sg_cpu->bw_dl);
|
||||
if (type == FREQUENCY_UTIL) {
|
||||
/*
|
||||
* Bandwidth required by DEADLINE must always be granted
|
||||
* while, for FAIR and RT, we use blocked utilization of
|
||||
* IDLE CPUs as a mechanism to gracefully reduce the
|
||||
* frequency when no tasks show up for longer periods of
|
||||
* time.
|
||||
*
|
||||
* Ideally we would like to set bw_dl as min/guaranteed
|
||||
* freq and util + bw_dl as requested freq. However,
|
||||
* cpufreq is not yet ready for such an interface. So,
|
||||
* we only do the latter for now.
|
||||
*/
|
||||
util += cpu_bw_dl(rq);
|
||||
}
|
||||
|
||||
return min(max, util);
|
||||
}
|
||||
|
||||
static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
|
||||
{
|
||||
struct rq *rq = cpu_rq(sg_cpu->cpu);
|
||||
unsigned long util = boosted_cpu_util(sg_cpu->cpu, cpu_util_rt(rq));
|
||||
unsigned long max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
|
||||
|
||||
sg_cpu->max = max;
|
||||
sg_cpu->bw_dl = cpu_bw_dl(rq);
|
||||
|
||||
return schedutil_freq_util(sg_cpu->cpu, util, max, FREQUENCY_UTIL);
|
||||
}
|
||||
|
||||
/**
|
||||
|
|
@ -562,15 +621,32 @@ static inline struct sugov_tunables *to_sugov_tunables(struct gov_attr_set *attr
|
|||
return container_of(attr_set, struct sugov_tunables, attr_set);
|
||||
}
|
||||
|
||||
static ssize_t rate_limit_us_show(struct gov_attr_set *attr_set, char *buf)
|
||||
static DEFINE_MUTEX(min_rate_lock);
|
||||
|
||||
static void update_min_rate_limit_ns(struct sugov_policy *sg_policy)
|
||||
{
|
||||
mutex_lock(&min_rate_lock);
|
||||
sg_policy->min_rate_limit_ns = min(sg_policy->up_rate_delay_ns,
|
||||
sg_policy->down_rate_delay_ns);
|
||||
mutex_unlock(&min_rate_lock);
|
||||
}
|
||||
|
||||
static ssize_t up_rate_limit_us_show(struct gov_attr_set *attr_set, char *buf)
|
||||
{
|
||||
struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
|
||||
|
||||
return sprintf(buf, "%u\n", tunables->rate_limit_us);
|
||||
return sprintf(buf, "%u\n", tunables->up_rate_limit_us);
|
||||
}
|
||||
|
||||
static ssize_t
|
||||
rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf, size_t count)
|
||||
static ssize_t down_rate_limit_us_show(struct gov_attr_set *attr_set, char *buf)
|
||||
{
|
||||
struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
|
||||
|
||||
return sprintf(buf, "%u\n", tunables->down_rate_limit_us);
|
||||
}
|
||||
|
||||
static ssize_t up_rate_limit_us_store(struct gov_attr_set *attr_set,
|
||||
const char *buf, size_t count)
|
||||
{
|
||||
struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
|
||||
struct sugov_policy *sg_policy;
|
||||
|
|
@ -579,18 +655,42 @@ rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf, size_t count
|
|||
if (kstrtouint(buf, 10, &rate_limit_us))
|
||||
return -EINVAL;
|
||||
|
||||
tunables->rate_limit_us = rate_limit_us;
|
||||
tunables->up_rate_limit_us = rate_limit_us;
|
||||
|
||||
list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook)
|
||||
sg_policy->freq_update_delay_ns = rate_limit_us * NSEC_PER_USEC;
|
||||
list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook) {
|
||||
sg_policy->up_rate_delay_ns = rate_limit_us * NSEC_PER_USEC;
|
||||
update_min_rate_limit_ns(sg_policy);
|
||||
}
|
||||
|
||||
return count;
|
||||
}
|
||||
|
||||
static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us);
|
||||
static ssize_t down_rate_limit_us_store(struct gov_attr_set *attr_set,
|
||||
const char *buf, size_t count)
|
||||
{
|
||||
struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
|
||||
struct sugov_policy *sg_policy;
|
||||
unsigned int rate_limit_us;
|
||||
|
||||
if (kstrtouint(buf, 10, &rate_limit_us))
|
||||
return -EINVAL;
|
||||
|
||||
tunables->down_rate_limit_us = rate_limit_us;
|
||||
|
||||
list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook) {
|
||||
sg_policy->down_rate_delay_ns = rate_limit_us * NSEC_PER_USEC;
|
||||
update_min_rate_limit_ns(sg_policy);
|
||||
}
|
||||
|
||||
return count;
|
||||
}
|
||||
|
||||
static struct governor_attr up_rate_limit_us = __ATTR_RW(up_rate_limit_us);
|
||||
static struct governor_attr down_rate_limit_us = __ATTR_RW(down_rate_limit_us);
|
||||
|
||||
static struct attribute *sugov_attributes[] = {
|
||||
&rate_limit_us.attr,
|
||||
&up_rate_limit_us.attr,
|
||||
&down_rate_limit_us.attr,
|
||||
NULL
|
||||
};
|
||||
|
||||
|
|
@ -601,7 +701,7 @@ static struct kobj_type sugov_tunables_ktype = {
|
|||
|
||||
/********************** cpufreq governor interface *********************/
|
||||
|
||||
static struct cpufreq_governor schedutil_gov;
|
||||
struct cpufreq_governor schedutil_gov;
|
||||
|
||||
static struct sugov_policy *sugov_policy_alloc(struct cpufreq_policy *policy)
|
||||
{
|
||||
|
|
@ -746,7 +846,8 @@ static int sugov_init(struct cpufreq_policy *policy)
|
|||
goto stop_kthread;
|
||||
}
|
||||
|
||||
tunables->rate_limit_us = cpufreq_policy_transition_delay_us(policy);
|
||||
tunables->up_rate_limit_us = cpufreq_policy_transition_delay_us(policy);
|
||||
tunables->down_rate_limit_us = cpufreq_policy_transition_delay_us(policy);
|
||||
|
||||
policy->governor_data = sg_policy;
|
||||
sg_policy->tunables = tunables;
|
||||
|
|
@ -804,7 +905,11 @@ static int sugov_start(struct cpufreq_policy *policy)
|
|||
struct sugov_policy *sg_policy = policy->governor_data;
|
||||
unsigned int cpu;
|
||||
|
||||
sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC;
|
||||
sg_policy->up_rate_delay_ns =
|
||||
sg_policy->tunables->up_rate_limit_us * NSEC_PER_USEC;
|
||||
sg_policy->down_rate_delay_ns =
|
||||
sg_policy->tunables->down_rate_limit_us * NSEC_PER_USEC;
|
||||
update_min_rate_limit_ns(sg_policy);
|
||||
sg_policy->last_freq_update_time = 0;
|
||||
sg_policy->next_freq = 0;
|
||||
sg_policy->work_in_progress = false;
|
||||
|
|
@ -860,7 +965,7 @@ static void sugov_limits(struct cpufreq_policy *policy)
|
|||
sg_policy->need_freq_update = true;
|
||||
}
|
||||
|
||||
static struct cpufreq_governor schedutil_gov = {
|
||||
struct cpufreq_governor schedutil_gov = {
|
||||
.name = "schedutil",
|
||||
.owner = THIS_MODULE,
|
||||
.dynamic_switching = true,
|
||||
|
|
@ -883,3 +988,36 @@ static int __init sugov_register(void)
|
|||
return cpufreq_register_governor(&schedutil_gov);
|
||||
}
|
||||
fs_initcall(sugov_register);
|
||||
|
||||
#ifdef CONFIG_ENERGY_MODEL
|
||||
extern bool sched_energy_update;
|
||||
extern struct mutex sched_energy_mutex;
|
||||
|
||||
static void rebuild_sd_workfn(struct work_struct *work)
|
||||
{
|
||||
mutex_lock(&sched_energy_mutex);
|
||||
sched_energy_update = true;
|
||||
rebuild_sched_domains();
|
||||
sched_energy_update = false;
|
||||
mutex_unlock(&sched_energy_mutex);
|
||||
}
|
||||
static DECLARE_WORK(rebuild_sd_work, rebuild_sd_workfn);
|
||||
|
||||
/*
|
||||
* EAS shouldn't be attempted without sugov, so rebuild the sched_domains
|
||||
* on governor changes to make sure the scheduler knows about it.
|
||||
*/
|
||||
void sched_cpufreq_governor_change(struct cpufreq_policy *policy,
|
||||
struct cpufreq_governor *old_gov)
|
||||
{
|
||||
if (old_gov == &schedutil_gov || policy->governor == &schedutil_gov) {
|
||||
/*
|
||||
* When called from the cpufreq_register_driver() path, the
|
||||
* cpu_hotplug_lock is already held, so use a work item to
|
||||
* avoid nested locking in rebuild_sched_domains().
|
||||
*/
|
||||
schedule_work(&rebuild_sd_work);
|
||||
}
|
||||
|
||||
}
|
||||
#endif
|
||||
|
|
|
|||
|
|
@ -1567,7 +1567,8 @@ static void yield_task_dl(struct rq *rq)
|
|||
static int find_later_rq(struct task_struct *task);
|
||||
|
||||
static int
|
||||
select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
|
||||
select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags,
|
||||
int sibling_count_hint)
|
||||
{
|
||||
struct task_struct *curr;
|
||||
struct rq *rq;
|
||||
|
|
|
|||
1070
kernel/sched/fair.c
1070
kernel/sched/fair.c
File diff suppressed because it is too large
Load Diff
|
|
@ -90,3 +90,33 @@ SCHED_FEAT(WA_BIAS, true)
|
|||
* UtilEstimation. Use estimated CPU utilization.
|
||||
*/
|
||||
SCHED_FEAT(UTIL_EST, true)
|
||||
|
||||
/*
|
||||
* Fast pre-selection of CPU candidates for EAS.
|
||||
*/
|
||||
SCHED_FEAT(FIND_BEST_TARGET, true)
|
||||
|
||||
/*
|
||||
* Energy aware scheduling algorithm choices:
|
||||
* EAS_PREFER_IDLE
|
||||
* Direct tasks in a schedtune.prefer_idle=1 group through
|
||||
* the EAS path for wakeup task placement. Otherwise, put
|
||||
* those tasks through the mainline slow path.
|
||||
*/
|
||||
SCHED_FEAT(EAS_PREFER_IDLE, true)
|
||||
|
||||
/*
|
||||
* Request max frequency from schedutil whenever a RT task is running.
|
||||
*/
|
||||
SCHED_FEAT(SUGOV_RT_MAX_FREQ, false)
|
||||
|
||||
/*
|
||||
* Apply schedtune boost hold to tasks of all sched classes.
|
||||
* If enabled, schedtune will hold the boost applied to a CPU
|
||||
* for 50ms regardless of task activation - if the task is
|
||||
* still running 50ms later, the boost hold expires and schedtune
|
||||
* boost will expire immediately the task stops.
|
||||
* If disabled, this behaviour will only apply to tasks of the
|
||||
* RT class.
|
||||
*/
|
||||
SCHED_FEAT(SCHEDTUNE_BOOST_HOLD_ALL, false)
|
||||
|
|
|
|||
|
|
@ -16,9 +16,10 @@ extern char __cpuidle_text_start[], __cpuidle_text_end[];
|
|||
* sched_idle_set_state - Record idle state for the current CPU.
|
||||
* @idle_state: State to record.
|
||||
*/
|
||||
void sched_idle_set_state(struct cpuidle_state *idle_state)
|
||||
void sched_idle_set_state(struct cpuidle_state *idle_state, int index)
|
||||
{
|
||||
idle_set_state(this_rq(), idle_state);
|
||||
idle_set_state_idx(this_rq(), index);
|
||||
}
|
||||
|
||||
static int __read_mostly cpu_idle_force_poll;
|
||||
|
|
@ -374,7 +375,8 @@ void cpu_startup_entry(enum cpuhp_state state)
|
|||
|
||||
#ifdef CONFIG_SMP
|
||||
static int
|
||||
select_task_rq_idle(struct task_struct *p, int cpu, int sd_flag, int flags)
|
||||
select_task_rq_idle(struct task_struct *p, int cpu, int sd_flag, int flags,
|
||||
int sibling_count_hint)
|
||||
{
|
||||
return task_cpu(p); /* IDLE tasks as never migrated */
|
||||
}
|
||||
|
|
|
|||
|
|
@ -29,6 +29,8 @@
|
|||
#include "sched-pelt.h"
|
||||
#include "pelt.h"
|
||||
|
||||
#include <trace/events/sched.h>
|
||||
|
||||
/*
|
||||
* Approximate:
|
||||
* val * y^n, where y^32 ~= 0.5 (~1 scheduling period)
|
||||
|
|
@ -274,6 +276,9 @@ int __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se)
|
|||
|
||||
if (___update_load_sum(now, cpu, &se->avg, 0, 0, 0)) {
|
||||
___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
|
||||
|
||||
trace_sched_load_se(se);
|
||||
|
||||
return 1;
|
||||
}
|
||||
|
||||
|
|
@ -290,6 +295,9 @@ int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_e
|
|||
|
||||
___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
|
||||
cfs_se_util_change(&se->avg);
|
||||
|
||||
trace_sched_load_se(se);
|
||||
|
||||
return 1;
|
||||
}
|
||||
|
||||
|
|
@ -304,6 +312,9 @@ int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)
|
|||
cfs_rq->curr != NULL)) {
|
||||
|
||||
___update_load_avg(&cfs_rq->avg, 1, 1);
|
||||
|
||||
trace_sched_load_cfs_rq(cfs_rq);
|
||||
|
||||
return 1;
|
||||
}
|
||||
|
||||
|
|
@ -329,6 +340,9 @@ int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
|
|||
running)) {
|
||||
|
||||
___update_load_avg(&rq->avg_rt, 1, 1);
|
||||
|
||||
trace_sched_load_rt_rq(rq);
|
||||
|
||||
return 1;
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -1329,6 +1329,8 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
|
|||
{
|
||||
struct sched_rt_entity *rt_se = &p->rt;
|
||||
|
||||
schedtune_enqueue_task(p, cpu_of(rq));
|
||||
|
||||
if (flags & ENQUEUE_WAKEUP)
|
||||
rt_se->timeout = 0;
|
||||
|
||||
|
|
@ -1342,6 +1344,8 @@ static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
|
|||
{
|
||||
struct sched_rt_entity *rt_se = &p->rt;
|
||||
|
||||
schedtune_dequeue_task(p, cpu_of(rq));
|
||||
|
||||
update_curr_rt(rq);
|
||||
dequeue_rt_entity(rt_se, flags);
|
||||
|
||||
|
|
@ -1386,7 +1390,8 @@ static void yield_task_rt(struct rq *rq)
|
|||
static int find_lowest_rq(struct task_struct *task);
|
||||
|
||||
static int
|
||||
select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
|
||||
select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags,
|
||||
int sibling_count_hint)
|
||||
{
|
||||
struct task_struct *curr;
|
||||
struct rq *rq;
|
||||
|
|
|
|||
|
|
@ -44,6 +44,7 @@
|
|||
#include <linux/ctype.h>
|
||||
#include <linux/debugfs.h>
|
||||
#include <linux/delayacct.h>
|
||||
#include <linux/energy_model.h>
|
||||
#include <linux/init_task.h>
|
||||
#include <linux/kprobes.h>
|
||||
#include <linux/kthread.h>
|
||||
|
|
@ -79,6 +80,8 @@
|
|||
# define SCHED_WARN_ON(x) ({ (void)(x), 0; })
|
||||
#endif
|
||||
|
||||
#include "tune.h"
|
||||
|
||||
struct rq;
|
||||
struct cpuidle_state;
|
||||
|
||||
|
|
@ -702,6 +705,22 @@ static inline bool sched_asym_prefer(int a, int b)
|
|||
return arch_asym_cpu_priority(a) > arch_asym_cpu_priority(b);
|
||||
}
|
||||
|
||||
struct perf_domain {
|
||||
struct em_perf_domain *em_pd;
|
||||
struct perf_domain *next;
|
||||
struct rcu_head rcu;
|
||||
};
|
||||
|
||||
struct max_cpu_capacity {
|
||||
raw_spinlock_t lock;
|
||||
unsigned long val;
|
||||
int cpu;
|
||||
};
|
||||
|
||||
/* Scheduling group status flags */
|
||||
#define SG_OVERLOAD 0x1 /* More than one runnable task on a CPU. */
|
||||
#define SG_OVERUTILIZED 0x2 /* One or more CPUs are over-utilized. */
|
||||
|
||||
/*
|
||||
* We add the notion of a root-domain which will be used to define per-domain
|
||||
* variables. Each exclusive cpuset essentially defines an island domain by
|
||||
|
|
@ -717,8 +736,15 @@ struct root_domain {
|
|||
cpumask_var_t span;
|
||||
cpumask_var_t online;
|
||||
|
||||
/* Indicate more than one runnable task for any CPU */
|
||||
bool overload;
|
||||
/*
|
||||
* Indicate pullable load on at least one CPU, e.g:
|
||||
* - More than one runnable task
|
||||
* - Running task is misfit
|
||||
*/
|
||||
int overload;
|
||||
|
||||
/* Indicate one or more cpus over-utilized (tipping point) */
|
||||
int overutilized;
|
||||
|
||||
/*
|
||||
* The bit corresponding to a CPU gets set here if such CPU has more
|
||||
|
|
@ -749,13 +775,21 @@ struct root_domain {
|
|||
cpumask_var_t rto_mask;
|
||||
struct cpupri cpupri;
|
||||
|
||||
unsigned long max_cpu_capacity;
|
||||
/* Maximum cpu capacity in the system. */
|
||||
struct max_cpu_capacity max_cpu_capacity;
|
||||
|
||||
/*
|
||||
* NULL-terminated list of performance domains intersecting with the
|
||||
* CPUs of the rd. Protected by RCU.
|
||||
*/
|
||||
struct perf_domain *pd;
|
||||
};
|
||||
|
||||
extern struct root_domain def_root_domain;
|
||||
extern struct mutex sched_domains_mutex;
|
||||
|
||||
extern void init_defrootdomain(void);
|
||||
extern void init_max_cpu_capacity(struct max_cpu_capacity *mcc);
|
||||
extern int sched_init_domains(const struct cpumask *cpu_map);
|
||||
extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
|
||||
extern void sched_get_rd(struct root_domain *rd);
|
||||
|
|
@ -845,6 +879,8 @@ struct rq {
|
|||
|
||||
unsigned char idle_balance;
|
||||
|
||||
unsigned long misfit_task_load;
|
||||
|
||||
/* For active balancing */
|
||||
int active_balance;
|
||||
int push_cpu;
|
||||
|
|
@ -916,6 +952,7 @@ struct rq {
|
|||
#ifdef CONFIG_CPU_IDLE
|
||||
/* Must be inspected within a rcu lock section */
|
||||
struct cpuidle_state *idle_state;
|
||||
int idle_state_idx;
|
||||
#endif
|
||||
};
|
||||
|
||||
|
|
@ -1187,7 +1224,9 @@ DECLARE_PER_CPU(int, sd_llc_size);
|
|||
DECLARE_PER_CPU(int, sd_llc_id);
|
||||
DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
|
||||
DECLARE_PER_CPU(struct sched_domain *, sd_numa);
|
||||
DECLARE_PER_CPU(struct sched_domain *, sd_asym);
|
||||
DECLARE_PER_CPU(struct sched_domain *, sd_asym_packing);
|
||||
DECLARE_PER_CPU(struct sched_domain *, sd_asym_cpucapacity);
|
||||
extern struct static_key_false sched_asym_cpucapacity;
|
||||
|
||||
struct sched_group_capacity {
|
||||
atomic_t ref;
|
||||
|
|
@ -1197,6 +1236,7 @@ struct sched_group_capacity {
|
|||
*/
|
||||
unsigned long capacity;
|
||||
unsigned long min_capacity; /* Min per-CPU capacity in group */
|
||||
unsigned long max_capacity; /* Max per-CPU capacity in group */
|
||||
unsigned long next_update;
|
||||
int imbalance; /* XXX unrelated to capacity but shared group state */
|
||||
|
||||
|
|
@ -1525,7 +1565,8 @@ struct sched_class {
|
|||
void (*put_prev_task)(struct rq *rq, struct task_struct *p);
|
||||
|
||||
#ifdef CONFIG_SMP
|
||||
int (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
|
||||
int (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags,
|
||||
int subling_count_hint);
|
||||
void (*migrate_task_rq)(struct task_struct *p, int new_cpu);
|
||||
|
||||
void (*task_woken)(struct rq *this_rq, struct task_struct *task);
|
||||
|
|
@ -1613,6 +1654,17 @@ static inline struct cpuidle_state *idle_get_state(struct rq *rq)
|
|||
|
||||
return rq->idle_state;
|
||||
}
|
||||
|
||||
static inline void idle_set_state_idx(struct rq *rq, int idle_state_idx)
|
||||
{
|
||||
rq->idle_state_idx = idle_state_idx;
|
||||
}
|
||||
|
||||
static inline int idle_get_state_idx(struct rq *rq)
|
||||
{
|
||||
WARN_ON(!rcu_read_lock_held());
|
||||
return rq->idle_state_idx;
|
||||
}
|
||||
#else
|
||||
static inline void idle_set_state(struct rq *rq,
|
||||
struct cpuidle_state *idle_state)
|
||||
|
|
@ -1623,6 +1675,15 @@ static inline struct cpuidle_state *idle_get_state(struct rq *rq)
|
|||
{
|
||||
return NULL;
|
||||
}
|
||||
|
||||
static inline void idle_set_state_idx(struct rq *rq, int idle_state_idx)
|
||||
{
|
||||
}
|
||||
|
||||
static inline int idle_get_state_idx(struct rq *rq)
|
||||
{
|
||||
return -1;
|
||||
}
|
||||
#endif
|
||||
|
||||
extern void schedule_idle(void);
|
||||
|
|
@ -1696,8 +1757,8 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
|
|||
|
||||
if (prev_nr < 2 && rq->nr_running >= 2) {
|
||||
#ifdef CONFIG_SMP
|
||||
if (!rq->rd->overload)
|
||||
rq->rd->overload = true;
|
||||
if (!READ_ONCE(rq->rd->overload))
|
||||
WRITE_ONCE(rq->rd->overload, 1);
|
||||
#endif
|
||||
}
|
||||
|
||||
|
|
@ -1756,26 +1817,14 @@ unsigned long arch_scale_freq_capacity(int cpu)
|
|||
}
|
||||
#endif
|
||||
|
||||
#ifdef CONFIG_SMP
|
||||
#ifndef arch_scale_cpu_capacity
|
||||
#ifndef arch_scale_max_freq_capacity
|
||||
struct sched_domain;
|
||||
static __always_inline
|
||||
unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
|
||||
{
|
||||
if (sd && (sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
|
||||
return sd->smt_gain / sd->span_weight;
|
||||
|
||||
return SCHED_CAPACITY_SCALE;
|
||||
}
|
||||
#endif
|
||||
#else
|
||||
#ifndef arch_scale_cpu_capacity
|
||||
static __always_inline
|
||||
unsigned long arch_scale_cpu_capacity(void __always_unused *sd, int cpu)
|
||||
unsigned long arch_scale_max_freq_capacity(struct sched_domain *sd, int cpu)
|
||||
{
|
||||
return SCHED_CAPACITY_SCALE;
|
||||
}
|
||||
#endif
|
||||
#endif
|
||||
|
||||
struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)
|
||||
__acquires(rq->lock);
|
||||
|
|
@ -2189,6 +2238,38 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
|
|||
#endif
|
||||
|
||||
#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
|
||||
/**
|
||||
* enum schedutil_type - CPU utilization type
|
||||
* @FREQUENCY_UTIL: Utilization used to select frequency
|
||||
* @ENERGY_UTIL: Utilization used during energy calculation
|
||||
*
|
||||
* The utilization signals of all scheduling classes (CFS/RT/DL) and IRQ time
|
||||
* need to be aggregated differently depending on the usage made of them. This
|
||||
* enum is used within schedutil_freq_util() to differentiate the types of
|
||||
* utilization expected by the callers, and adjust the aggregation accordingly.
|
||||
*/
|
||||
enum schedutil_type {
|
||||
FREQUENCY_UTIL,
|
||||
ENERGY_UTIL,
|
||||
};
|
||||
|
||||
unsigned long schedutil_freq_util(int cpu, unsigned long util,
|
||||
unsigned long max, enum schedutil_type type);
|
||||
|
||||
static inline unsigned long schedutil_energy_util(int cpu, unsigned long util)
|
||||
{
|
||||
unsigned long max = arch_scale_cpu_capacity(NULL, cpu);
|
||||
|
||||
return schedutil_freq_util(cpu, util, max, ENERGY_UTIL);
|
||||
}
|
||||
#else /* CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
|
||||
static inline unsigned long schedutil_energy_util(int cpu, unsigned long util)
|
||||
{
|
||||
return util;
|
||||
}
|
||||
#endif
|
||||
|
||||
#ifdef CONFIG_SMP
|
||||
static inline unsigned long cpu_bw_dl(struct rq *rq)
|
||||
{
|
||||
return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
|
||||
|
|
@ -2244,3 +2325,13 @@ unsigned long scale_irq_capacity(unsigned long util, unsigned long irq, unsigned
|
|||
return util;
|
||||
}
|
||||
#endif
|
||||
|
||||
#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
|
||||
#define perf_domain_span(pd) (to_cpumask(((pd)->em_pd->cpus)))
|
||||
#else
|
||||
#define perf_domain_span(pd) NULL
|
||||
#endif
|
||||
|
||||
#ifdef CONFIG_SMP
|
||||
extern struct static_key_false sched_energy_present;
|
||||
#endif
|
||||
|
|
|
|||
|
|
@ -11,7 +11,8 @@
|
|||
|
||||
#ifdef CONFIG_SMP
|
||||
static int
|
||||
select_task_rq_stop(struct task_struct *p, int cpu, int sd_flag, int flags)
|
||||
select_task_rq_stop(struct task_struct *p, int cpu, int sd_flag, int flags,
|
||||
int sibling_count_hint)
|
||||
{
|
||||
return task_cpu(p); /* stop tasks as never migrate */
|
||||
}
|
||||
|
|
|
|||
|
|
@ -201,6 +201,242 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
|
|||
return 1;
|
||||
}
|
||||
|
||||
DEFINE_STATIC_KEY_FALSE(sched_energy_present);
|
||||
#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
|
||||
unsigned int sysctl_sched_energy_aware = 1;
|
||||
DEFINE_MUTEX(sched_energy_mutex);
|
||||
bool sched_energy_update;
|
||||
|
||||
#ifdef CONFIG_PROC_SYSCTL
|
||||
int sched_energy_aware_handler(struct ctl_table *table, int write,
|
||||
void __user *buffer, size_t *lenp, loff_t *ppos)
|
||||
{
|
||||
int ret, state;
|
||||
|
||||
if (write && !capable(CAP_SYS_ADMIN))
|
||||
return -EPERM;
|
||||
|
||||
ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
|
||||
if (!ret && write) {
|
||||
state = static_branch_unlikely(&sched_energy_present);
|
||||
if (state != sysctl_sched_energy_aware) {
|
||||
mutex_lock(&sched_energy_mutex);
|
||||
sched_energy_update = 1;
|
||||
rebuild_sched_domains();
|
||||
sched_energy_update = 0;
|
||||
mutex_unlock(&sched_energy_mutex);
|
||||
}
|
||||
}
|
||||
|
||||
return ret;
|
||||
}
|
||||
#endif
|
||||
|
||||
static void free_pd(struct perf_domain *pd)
|
||||
{
|
||||
struct perf_domain *tmp;
|
||||
|
||||
while (pd) {
|
||||
tmp = pd->next;
|
||||
kfree(pd);
|
||||
pd = tmp;
|
||||
}
|
||||
}
|
||||
|
||||
static struct perf_domain *find_pd(struct perf_domain *pd, int cpu)
|
||||
{
|
||||
while (pd) {
|
||||
if (cpumask_test_cpu(cpu, perf_domain_span(pd)))
|
||||
return pd;
|
||||
pd = pd->next;
|
||||
}
|
||||
|
||||
return NULL;
|
||||
}
|
||||
|
||||
static struct perf_domain *pd_init(int cpu)
|
||||
{
|
||||
struct em_perf_domain *obj = em_cpu_get(cpu);
|
||||
struct perf_domain *pd;
|
||||
|
||||
if (!obj) {
|
||||
if (sched_debug())
|
||||
pr_info("%s: no EM found for CPU%d\n", __func__, cpu);
|
||||
return NULL;
|
||||
}
|
||||
|
||||
pd = kzalloc(sizeof(*pd), GFP_KERNEL);
|
||||
if (!pd)
|
||||
return NULL;
|
||||
pd->em_pd = obj;
|
||||
|
||||
return pd;
|
||||
}
|
||||
|
||||
static void perf_domain_debug(const struct cpumask *cpu_map,
|
||||
struct perf_domain *pd)
|
||||
{
|
||||
if (!sched_debug() || !pd)
|
||||
return;
|
||||
|
||||
printk(KERN_DEBUG "root_domain %*pbl: ", cpumask_pr_args(cpu_map));
|
||||
|
||||
while (pd) {
|
||||
printk(KERN_CONT " pd%d:{ cpus=%*pbl nr_cstate=%d }",
|
||||
cpumask_first(perf_domain_span(pd)),
|
||||
cpumask_pr_args(perf_domain_span(pd)),
|
||||
em_pd_nr_cap_states(pd->em_pd));
|
||||
pd = pd->next;
|
||||
}
|
||||
|
||||
printk(KERN_CONT "\n");
|
||||
}
|
||||
|
||||
static void destroy_perf_domain_rcu(struct rcu_head *rp)
|
||||
{
|
||||
struct perf_domain *pd;
|
||||
|
||||
pd = container_of(rp, struct perf_domain, rcu);
|
||||
free_pd(pd);
|
||||
}
|
||||
|
||||
static void sched_energy_start(int ndoms_new, cpumask_var_t doms_new[])
|
||||
{
|
||||
/*
|
||||
* The conditions for EAS to start are checked during the creation of
|
||||
* root domains. If one of them meets all conditions, it will have a
|
||||
* non-null list of performance domains.
|
||||
*/
|
||||
while (ndoms_new) {
|
||||
if (cpu_rq(cpumask_first(doms_new[ndoms_new - 1]))->rd->pd)
|
||||
goto enable;
|
||||
ndoms_new--;
|
||||
}
|
||||
|
||||
if (static_branch_unlikely(&sched_energy_present)) {
|
||||
if (sched_debug())
|
||||
pr_info("%s: stopping EAS\n", __func__);
|
||||
static_branch_disable_cpuslocked(&sched_energy_present);
|
||||
}
|
||||
|
||||
return;
|
||||
|
||||
enable:
|
||||
if (!static_branch_unlikely(&sched_energy_present)) {
|
||||
if (sched_debug())
|
||||
pr_info("%s: starting EAS\n", __func__);
|
||||
static_branch_enable_cpuslocked(&sched_energy_present);
|
||||
}
|
||||
}
|
||||
|
||||
/*
|
||||
* EAS can be used on a root domain if it meets all the following conditions:
|
||||
* 1. an Energy Model (EM) is available;
|
||||
* 2. the SD_ASYM_CPUCAPACITY flag is set in the sched_domain hierarchy.
|
||||
* 3. the EM complexity is low enough to keep scheduling overheads low;
|
||||
* 4. schedutil is driving the frequency of all CPUs of the rd;
|
||||
*
|
||||
* The complexity of the Energy Model is defined as:
|
||||
*
|
||||
* C = nr_pd * (nr_cpus + nr_cs)
|
||||
*
|
||||
* with parameters defined as:
|
||||
* - nr_pd: the number of performance domains
|
||||
* - nr_cpus: the number of CPUs
|
||||
* - nr_cs: the sum of the number of capacity states of all performance
|
||||
* domains (for example, on a system with 2 performance domains,
|
||||
* with 10 capacity states each, nr_cs = 2 * 10 = 20).
|
||||
*
|
||||
* It is generally not a good idea to use such a model in the wake-up path on
|
||||
* very complex platforms because of the associated scheduling overheads. The
|
||||
* arbitrary constraint below prevents that. It makes EAS usable up to 16 CPUs
|
||||
* with per-CPU DVFS and less than 8 capacity states each, for example.
|
||||
*/
|
||||
#define EM_MAX_COMPLEXITY 2048
|
||||
|
||||
extern struct cpufreq_governor schedutil_gov;
|
||||
static void build_perf_domains(const struct cpumask *cpu_map)
|
||||
{
|
||||
int i, nr_pd = 0, nr_cs = 0, nr_cpus = cpumask_weight(cpu_map);
|
||||
struct perf_domain *pd = NULL, *tmp;
|
||||
int cpu = cpumask_first(cpu_map);
|
||||
struct root_domain *rd = cpu_rq(cpu)->rd;
|
||||
struct cpufreq_policy *policy;
|
||||
struct cpufreq_governor *gov;
|
||||
|
||||
if (!sysctl_sched_energy_aware)
|
||||
goto free;
|
||||
|
||||
/* EAS is enabled for asymmetric CPU capacity topologies. */
|
||||
if (!per_cpu(sd_asym_cpucapacity, cpu)) {
|
||||
if (sched_debug()) {
|
||||
pr_info("rd %*pbl: CPUs do not have asymmetric capacities\n",
|
||||
cpumask_pr_args(cpu_map));
|
||||
}
|
||||
goto free;
|
||||
}
|
||||
|
||||
for_each_cpu(i, cpu_map) {
|
||||
/* Skip already covered CPUs. */
|
||||
if (find_pd(pd, i))
|
||||
continue;
|
||||
|
||||
/* Do not attempt EAS if schedutil is not being used. */
|
||||
policy = cpufreq_cpu_get(i);
|
||||
if (!policy)
|
||||
goto free;
|
||||
gov = policy->governor;
|
||||
cpufreq_cpu_put(policy);
|
||||
if (gov != &schedutil_gov) {
|
||||
if (rd->pd)
|
||||
pr_warn("rd %*pbl: Disabling EAS, schedutil is mandatory\n",
|
||||
cpumask_pr_args(cpu_map));
|
||||
goto free;
|
||||
}
|
||||
|
||||
/* Create the new pd and add it to the local list. */
|
||||
tmp = pd_init(i);
|
||||
if (!tmp)
|
||||
goto free;
|
||||
tmp->next = pd;
|
||||
pd = tmp;
|
||||
|
||||
/*
|
||||
* Count performance domains and capacity states for the
|
||||
* complexity check.
|
||||
*/
|
||||
nr_pd++;
|
||||
nr_cs += em_pd_nr_cap_states(pd->em_pd);
|
||||
}
|
||||
|
||||
/* Bail out if the Energy Model complexity is too high. */
|
||||
if (nr_pd * (nr_cs + nr_cpus) > EM_MAX_COMPLEXITY) {
|
||||
WARN(1, "rd %*pbl: Failed to start EAS, EM complexity is too high\n",
|
||||
cpumask_pr_args(cpu_map));
|
||||
goto free;
|
||||
}
|
||||
|
||||
perf_domain_debug(cpu_map, pd);
|
||||
|
||||
/* Attach the new list of performance domains to the root domain. */
|
||||
tmp = rd->pd;
|
||||
rcu_assign_pointer(rd->pd, pd);
|
||||
if (tmp)
|
||||
call_rcu(&tmp->rcu, destroy_perf_domain_rcu);
|
||||
|
||||
return;
|
||||
|
||||
free:
|
||||
free_pd(pd);
|
||||
tmp = rd->pd;
|
||||
rcu_assign_pointer(rd->pd, NULL);
|
||||
if (tmp)
|
||||
call_rcu(&tmp->rcu, destroy_perf_domain_rcu);
|
||||
}
|
||||
#else
|
||||
static void free_pd(struct perf_domain *pd) { }
|
||||
#endif /* CONFIG_ENERGY_MODEL && CONFIG_CPU_FREQ_GOV_SCHEDUTIL*/
|
||||
|
||||
static void free_rootdomain(struct rcu_head *rcu)
|
||||
{
|
||||
struct root_domain *rd = container_of(rcu, struct root_domain, rcu);
|
||||
|
|
@ -211,6 +447,7 @@ static void free_rootdomain(struct rcu_head *rcu)
|
|||
free_cpumask_var(rd->rto_mask);
|
||||
free_cpumask_var(rd->online);
|
||||
free_cpumask_var(rd->span);
|
||||
free_pd(rd->pd);
|
||||
kfree(rd);
|
||||
}
|
||||
|
||||
|
|
@ -287,6 +524,9 @@ static int init_rootdomain(struct root_domain *rd)
|
|||
|
||||
if (cpupri_init(&rd->cpupri) != 0)
|
||||
goto free_cpudl;
|
||||
|
||||
init_max_cpu_capacity(&rd->max_cpu_capacity);
|
||||
|
||||
return 0;
|
||||
|
||||
free_cpudl:
|
||||
|
|
@ -397,7 +637,9 @@ DEFINE_PER_CPU(int, sd_llc_size);
|
|||
DEFINE_PER_CPU(int, sd_llc_id);
|
||||
DEFINE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
|
||||
DEFINE_PER_CPU(struct sched_domain *, sd_numa);
|
||||
DEFINE_PER_CPU(struct sched_domain *, sd_asym);
|
||||
DEFINE_PER_CPU(struct sched_domain *, sd_asym_packing);
|
||||
DEFINE_PER_CPU(struct sched_domain *, sd_asym_cpucapacity);
|
||||
DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity);
|
||||
|
||||
static void update_top_cache_domain(int cpu)
|
||||
{
|
||||
|
|
@ -422,7 +664,10 @@ static void update_top_cache_domain(int cpu)
|
|||
rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
|
||||
|
||||
sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
|
||||
rcu_assign_pointer(per_cpu(sd_asym, cpu), sd);
|
||||
rcu_assign_pointer(per_cpu(sd_asym_packing, cpu), sd);
|
||||
|
||||
sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY);
|
||||
rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd);
|
||||
}
|
||||
|
||||
/*
|
||||
|
|
@ -692,6 +937,7 @@ static void init_overlap_sched_group(struct sched_domain *sd,
|
|||
sg_span = sched_group_span(sg);
|
||||
sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
|
||||
sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;
|
||||
sg->sgc->max_capacity = SCHED_CAPACITY_SCALE;
|
||||
}
|
||||
|
||||
static int
|
||||
|
|
@ -851,6 +1097,7 @@ static struct sched_group *get_group(int cpu, struct sd_data *sdd)
|
|||
|
||||
sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sched_group_span(sg));
|
||||
sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;
|
||||
sg->sgc->max_capacity = SCHED_CAPACITY_SCALE;
|
||||
|
||||
return sg;
|
||||
}
|
||||
|
|
@ -1061,7 +1308,6 @@ static struct cpumask ***sched_domains_numa_masks;
|
|||
* SD_SHARE_PKG_RESOURCES - describes shared caches
|
||||
* SD_NUMA - describes NUMA topologies
|
||||
* SD_SHARE_POWERDOMAIN - describes shared power domain
|
||||
* SD_ASYM_CPUCAPACITY - describes mixed capacity topologies
|
||||
*
|
||||
* Odd one out, which beside describing the topology has a quirk also
|
||||
* prescribes the desired behaviour that goes along with it:
|
||||
|
|
@ -1073,13 +1319,12 @@ static struct cpumask ***sched_domains_numa_masks;
|
|||
SD_SHARE_PKG_RESOURCES | \
|
||||
SD_NUMA | \
|
||||
SD_ASYM_PACKING | \
|
||||
SD_ASYM_CPUCAPACITY | \
|
||||
SD_SHARE_POWERDOMAIN)
|
||||
|
||||
static struct sched_domain *
|
||||
sd_init(struct sched_domain_topology_level *tl,
|
||||
const struct cpumask *cpu_map,
|
||||
struct sched_domain *child, int cpu)
|
||||
struct sched_domain *child, int dflags, int cpu)
|
||||
{
|
||||
struct sd_data *sdd = &tl->data;
|
||||
struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu);
|
||||
|
|
@ -1100,6 +1345,9 @@ sd_init(struct sched_domain_topology_level *tl,
|
|||
"wrong sd_flags in topology description\n"))
|
||||
sd_flags &= ~TOPOLOGY_SD_FLAGS;
|
||||
|
||||
/* Apply detected topology flags */
|
||||
sd_flags |= dflags;
|
||||
|
||||
*sd = (struct sched_domain){
|
||||
.min_interval = sd_weight,
|
||||
.max_interval = 2*sd_weight,
|
||||
|
|
@ -1122,7 +1370,7 @@ sd_init(struct sched_domain_topology_level *tl,
|
|||
| 0*SD_SHARE_CPUCAPACITY
|
||||
| 0*SD_SHARE_PKG_RESOURCES
|
||||
| 0*SD_SERIALIZE
|
||||
| 0*SD_PREFER_SIBLING
|
||||
| 1*SD_PREFER_SIBLING
|
||||
| 0*SD_NUMA
|
||||
| sd_flags
|
||||
,
|
||||
|
|
@ -1148,17 +1396,21 @@ sd_init(struct sched_domain_topology_level *tl,
|
|||
if (sd->flags & SD_ASYM_CPUCAPACITY) {
|
||||
struct sched_domain *t = sd;
|
||||
|
||||
/*
|
||||
* Don't attempt to spread across CPUs of different capacities.
|
||||
*/
|
||||
if (sd->child)
|
||||
sd->child->flags &= ~SD_PREFER_SIBLING;
|
||||
|
||||
for_each_lower_domain(t)
|
||||
t->flags |= SD_BALANCE_WAKE;
|
||||
}
|
||||
|
||||
if (sd->flags & SD_SHARE_CPUCAPACITY) {
|
||||
sd->flags |= SD_PREFER_SIBLING;
|
||||
sd->imbalance_pct = 110;
|
||||
sd->smt_gain = 1178; /* ~15% */
|
||||
|
||||
} else if (sd->flags & SD_SHARE_PKG_RESOURCES) {
|
||||
sd->flags |= SD_PREFER_SIBLING;
|
||||
sd->imbalance_pct = 117;
|
||||
sd->cache_nice_tries = 1;
|
||||
sd->busy_idx = 2;
|
||||
|
|
@ -1169,6 +1421,7 @@ sd_init(struct sched_domain_topology_level *tl,
|
|||
sd->busy_idx = 3;
|
||||
sd->idle_idx = 2;
|
||||
|
||||
sd->flags &= ~SD_PREFER_SIBLING;
|
||||
sd->flags |= SD_SERIALIZE;
|
||||
if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) {
|
||||
sd->flags &= ~(SD_BALANCE_EXEC |
|
||||
|
|
@ -1178,7 +1431,6 @@ sd_init(struct sched_domain_topology_level *tl,
|
|||
|
||||
#endif
|
||||
} else {
|
||||
sd->flags |= SD_PREFER_SIBLING;
|
||||
sd->cache_nice_tries = 1;
|
||||
sd->busy_idx = 2;
|
||||
sd->idle_idx = 1;
|
||||
|
|
@ -1604,9 +1856,9 @@ static void __sdt_free(const struct cpumask *cpu_map)
|
|||
|
||||
static struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl,
|
||||
const struct cpumask *cpu_map, struct sched_domain_attr *attr,
|
||||
struct sched_domain *child, int cpu)
|
||||
struct sched_domain *child, int dflags, int cpu)
|
||||
{
|
||||
struct sched_domain *sd = sd_init(tl, cpu_map, child, cpu);
|
||||
struct sched_domain *sd = sd_init(tl, cpu_map, child, dflags, cpu);
|
||||
|
||||
if (child) {
|
||||
sd->level = child->level + 1;
|
||||
|
|
@ -1632,6 +1884,65 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
|
|||
return sd;
|
||||
}
|
||||
|
||||
/*
|
||||
* Find the sched_domain_topology_level where all CPU capacities are visible
|
||||
* for all CPUs.
|
||||
*/
|
||||
static struct sched_domain_topology_level
|
||||
*asym_cpu_capacity_level(const struct cpumask *cpu_map)
|
||||
{
|
||||
int i, j, asym_level = 0;
|
||||
bool asym = false;
|
||||
struct sched_domain_topology_level *tl, *asym_tl = NULL;
|
||||
unsigned long cap;
|
||||
|
||||
/* Is there any asymmetry? */
|
||||
cap = arch_scale_cpu_capacity(NULL, cpumask_first(cpu_map));
|
||||
|
||||
for_each_cpu(i, cpu_map) {
|
||||
if (arch_scale_cpu_capacity(NULL, i) != cap) {
|
||||
asym = true;
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
if (!asym)
|
||||
return NULL;
|
||||
|
||||
/*
|
||||
* Examine topology from all CPU's point of views to detect the lowest
|
||||
* sched_domain_topology_level where a highest capacity CPU is visible
|
||||
* to everyone.
|
||||
*/
|
||||
for_each_cpu(i, cpu_map) {
|
||||
unsigned long max_capacity = arch_scale_cpu_capacity(NULL, i);
|
||||
int tl_id = 0;
|
||||
|
||||
for_each_sd_topology(tl) {
|
||||
if (tl_id < asym_level)
|
||||
goto next_level;
|
||||
|
||||
for_each_cpu_and(j, tl->mask(i), cpu_map) {
|
||||
unsigned long capacity;
|
||||
|
||||
capacity = arch_scale_cpu_capacity(NULL, j);
|
||||
|
||||
if (capacity <= max_capacity)
|
||||
continue;
|
||||
|
||||
max_capacity = capacity;
|
||||
asym_level = tl_id;
|
||||
asym_tl = tl;
|
||||
}
|
||||
next_level:
|
||||
tl_id++;
|
||||
}
|
||||
}
|
||||
|
||||
return asym_tl;
|
||||
}
|
||||
|
||||
|
||||
/*
|
||||
* Build sched domains for a given set of CPUs and attach the sched domains
|
||||
* to the individual CPUs
|
||||
|
|
@ -1642,20 +1953,31 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
|
|||
enum s_alloc alloc_state;
|
||||
struct sched_domain *sd;
|
||||
struct s_data d;
|
||||
struct rq *rq = NULL;
|
||||
int i, ret = -ENOMEM;
|
||||
struct sched_domain_topology_level *tl_asym;
|
||||
bool has_asym = false;
|
||||
|
||||
alloc_state = __visit_domain_allocation_hell(&d, cpu_map);
|
||||
if (alloc_state != sa_rootdomain)
|
||||
goto error;
|
||||
|
||||
tl_asym = asym_cpu_capacity_level(cpu_map);
|
||||
|
||||
/* Set up domains for CPUs specified by the cpu_map: */
|
||||
for_each_cpu(i, cpu_map) {
|
||||
struct sched_domain_topology_level *tl;
|
||||
|
||||
sd = NULL;
|
||||
for_each_sd_topology(tl) {
|
||||
sd = build_sched_domain(tl, cpu_map, attr, sd, i);
|
||||
int dflags = 0;
|
||||
|
||||
if (tl == tl_asym) {
|
||||
dflags |= SD_ASYM_CPUCAPACITY;
|
||||
has_asym = true;
|
||||
}
|
||||
|
||||
sd = build_sched_domain(tl, cpu_map, attr, sd, dflags, i);
|
||||
|
||||
if (tl == sched_domain_topology)
|
||||
*per_cpu_ptr(d.sd, i) = sd;
|
||||
if (tl->flags & SDTL_OVERLAP)
|
||||
|
|
@ -1693,21 +2015,13 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
|
|||
/* Attach the domains */
|
||||
rcu_read_lock();
|
||||
for_each_cpu(i, cpu_map) {
|
||||
rq = cpu_rq(i);
|
||||
sd = *per_cpu_ptr(d.sd, i);
|
||||
|
||||
/* Use READ_ONCE()/WRITE_ONCE() to avoid load/store tearing: */
|
||||
if (rq->cpu_capacity_orig > READ_ONCE(d.rd->max_cpu_capacity))
|
||||
WRITE_ONCE(d.rd->max_cpu_capacity, rq->cpu_capacity_orig);
|
||||
|
||||
cpu_attach_domain(sd, d.rd, i);
|
||||
}
|
||||
rcu_read_unlock();
|
||||
|
||||
if (rq && sched_debug_enabled) {
|
||||
pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n",
|
||||
cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
|
||||
}
|
||||
if (has_asym)
|
||||
static_branch_enable_cpuslocked(&sched_asym_cpucapacity);
|
||||
|
||||
ret = 0;
|
||||
error:
|
||||
|
|
@ -1879,8 +2193,8 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
|
|||
/* Destroy deleted domains: */
|
||||
for (i = 0; i < ndoms_cur; i++) {
|
||||
for (j = 0; j < n && !new_topology; j++) {
|
||||
if (cpumask_equal(doms_cur[i], doms_new[j])
|
||||
&& dattrs_equal(dattr_cur, i, dattr_new, j))
|
||||
if (cpumask_equal(doms_cur[i], doms_new[j]) &&
|
||||
dattrs_equal(dattr_cur, i, dattr_new, j))
|
||||
goto match1;
|
||||
}
|
||||
/* No match - a current sched domain not in new doms_new[] */
|
||||
|
|
@ -1900,8 +2214,8 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
|
|||
/* Build new domains: */
|
||||
for (i = 0; i < ndoms_new; i++) {
|
||||
for (j = 0; j < n && !new_topology; j++) {
|
||||
if (cpumask_equal(doms_new[i], doms_cur[j])
|
||||
&& dattrs_equal(dattr_new, i, dattr_cur, j))
|
||||
if (cpumask_equal(doms_new[i], doms_cur[j]) &&
|
||||
dattrs_equal(dattr_new, i, dattr_cur, j))
|
||||
goto match2;
|
||||
}
|
||||
/* No match - add a new doms_new */
|
||||
|
|
@ -1910,6 +2224,22 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
|
|||
;
|
||||
}
|
||||
|
||||
#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
|
||||
/* Build perf. domains: */
|
||||
for (i = 0; i < ndoms_new; i++) {
|
||||
for (j = 0; j < n && !sched_energy_update; j++) {
|
||||
if (cpumask_equal(doms_new[i], doms_cur[j]) &&
|
||||
cpu_rq(cpumask_first(doms_cur[j]))->rd->pd)
|
||||
goto match3;
|
||||
}
|
||||
/* No match - add perf. domains for a new rd */
|
||||
build_perf_domains(doms_new[i]);
|
||||
match3:
|
||||
;
|
||||
}
|
||||
sched_energy_start(ndoms_new, doms_new);
|
||||
#endif
|
||||
|
||||
/* Remember the new sched domains: */
|
||||
if (doms_cur != &fallback_doms)
|
||||
free_sched_domains(doms_cur, ndoms_cur);
|
||||
|
|
|
|||
692
kernel/sched/tune.c
Normal file
692
kernel/sched/tune.c
Normal file
|
|
@ -0,0 +1,692 @@
|
|||
#include <linux/cgroup.h>
|
||||
#include <linux/err.h>
|
||||
#include <linux/kernel.h>
|
||||
#include <linux/percpu.h>
|
||||
#include <linux/printk.h>
|
||||
#include <linux/rcupdate.h>
|
||||
#include <linux/slab.h>
|
||||
|
||||
#include <trace/events/sched.h>
|
||||
|
||||
#include "sched.h"
|
||||
|
||||
bool schedtune_initialized = false;
|
||||
extern struct reciprocal_value schedtune_spc_rdiv;
|
||||
|
||||
/* We hold schedtune boost in effect for at least this long */
|
||||
#define SCHEDTUNE_BOOST_HOLD_NS 50000000ULL
|
||||
|
||||
/*
|
||||
* EAS scheduler tunables for task groups.
|
||||
*
|
||||
* When CGroup support is enabled, we have to synchronize two different
|
||||
* paths:
|
||||
* - slow path: where CGroups are created/updated/removed
|
||||
* - fast path: where tasks in a CGroups are accounted
|
||||
*
|
||||
* The slow path tracks (a limited number of) CGroups and maps each on a
|
||||
* "boost_group" index. The fastpath accounts tasks currently RUNNABLE on each
|
||||
* "boost_group".
|
||||
*
|
||||
* Once a new CGroup is created, a boost group idx is assigned and the
|
||||
* corresponding "boost_group" marked as valid on each CPU.
|
||||
* Once a CGroup is release, the corresponding "boost_group" is marked as
|
||||
* invalid on each CPU. The CPU boost value (boost_max) is aggregated by
|
||||
* considering only valid boost_groups with a non null tasks counter.
|
||||
*
|
||||
* .:: Locking strategy
|
||||
*
|
||||
* The fast path uses a spin lock for each CPU boost_group which protects the
|
||||
* tasks counter.
|
||||
*
|
||||
* The "valid" and "boost" values of each CPU boost_group is instead
|
||||
* protected by the RCU lock provided by the CGroups callbacks. Thus, only the
|
||||
* slow path can access and modify the boost_group attribtues of each CPU.
|
||||
* The fast path will catch up the most updated values at the next scheduling
|
||||
* event (i.e. enqueue/dequeue).
|
||||
*
|
||||
* |
|
||||
* SLOW PATH | FAST PATH
|
||||
* CGroup add/update/remove | Scheduler enqueue/dequeue events
|
||||
* |
|
||||
* |
|
||||
* | DEFINE_PER_CPU(struct boost_groups)
|
||||
* | +--------------+----+---+----+----+
|
||||
* | | idle | | | | |
|
||||
* | | boost_max | | | | |
|
||||
* | +---->lock | | | | |
|
||||
* struct schedtune allocated_groups | | | group[ ] | | | | |
|
||||
* +------------------------------+ +-------+ | | +--+---------+-+----+---+----+----+
|
||||
* | idx | | | | | | valid |
|
||||
* | boots / prefer_idle | | | | | | boost |
|
||||
* | perf_{boost/constraints}_idx | <---------+(*) | | | | tasks | <------------+
|
||||
* | css | +-------+ | | +---------+ |
|
||||
* +-+----------------------------+ | | | | | | |
|
||||
* ^ | | | | | | |
|
||||
* | +-------+ | | +---------+ |
|
||||
* | | | | | | | |
|
||||
* | | | | | | | |
|
||||
* | +-------+ | | +---------+ |
|
||||
* | zmalloc | | | | | | |
|
||||
* | | | | | | | |
|
||||
* | +-------+ | | +---------+ |
|
||||
* + BOOSTGROUPS_COUNT | | BOOSTGROUPS_COUNT |
|
||||
* schedtune_boostgroup_init() | + |
|
||||
* | schedtune_{en,de}queue_task() |
|
||||
* | +
|
||||
* | schedtune_tasks_update()
|
||||
* |
|
||||
*/
|
||||
|
||||
/* SchdTune tunables for a group of tasks */
|
||||
struct schedtune {
|
||||
/* SchedTune CGroup subsystem */
|
||||
struct cgroup_subsys_state css;
|
||||
|
||||
/* Boost group allocated ID */
|
||||
int idx;
|
||||
|
||||
/* Boost value for tasks on that SchedTune CGroup */
|
||||
int boost;
|
||||
|
||||
/* Hint to bias scheduling of tasks on that SchedTune CGroup
|
||||
* towards idle CPUs */
|
||||
int prefer_idle;
|
||||
};
|
||||
|
||||
static inline struct schedtune *css_st(struct cgroup_subsys_state *css)
|
||||
{
|
||||
return css ? container_of(css, struct schedtune, css) : NULL;
|
||||
}
|
||||
|
||||
static inline struct schedtune *task_schedtune(struct task_struct *tsk)
|
||||
{
|
||||
return css_st(task_css(tsk, schedtune_cgrp_id));
|
||||
}
|
||||
|
||||
static inline struct schedtune *parent_st(struct schedtune *st)
|
||||
{
|
||||
return css_st(st->css.parent);
|
||||
}
|
||||
|
||||
/*
|
||||
* SchedTune root control group
|
||||
* The root control group is used to defined a system-wide boosting tuning,
|
||||
* which is applied to all tasks in the system.
|
||||
* Task specific boost tuning could be specified by creating and
|
||||
* configuring a child control group under the root one.
|
||||
* By default, system-wide boosting is disabled, i.e. no boosting is applied
|
||||
* to tasks which are not into a child control group.
|
||||
*/
|
||||
static struct schedtune
|
||||
root_schedtune = {
|
||||
.boost = 0,
|
||||
.prefer_idle = 0,
|
||||
};
|
||||
|
||||
/*
|
||||
* Maximum number of boost groups to support
|
||||
* When per-task boosting is used we still allow only limited number of
|
||||
* boost groups for two main reasons:
|
||||
* 1. on a real system we usually have only few classes of workloads which
|
||||
* make sense to boost with different values (e.g. background vs foreground
|
||||
* tasks, interactive vs low-priority tasks)
|
||||
* 2. a limited number allows for a simpler and more memory/time efficient
|
||||
* implementation especially for the computation of the per-CPU boost
|
||||
* value
|
||||
*/
|
||||
#define BOOSTGROUPS_COUNT 5
|
||||
|
||||
/* Array of configured boostgroups */
|
||||
static struct schedtune *allocated_group[BOOSTGROUPS_COUNT] = {
|
||||
&root_schedtune,
|
||||
NULL,
|
||||
};
|
||||
|
||||
/* SchedTune boost groups
|
||||
* Keep track of all the boost groups which impact on CPU, for example when a
|
||||
* CPU has two RUNNABLE tasks belonging to two different boost groups and thus
|
||||
* likely with different boost values.
|
||||
* Since on each system we expect only a limited number of boost groups, here
|
||||
* we use a simple array to keep track of the metrics required to compute the
|
||||
* maximum per-CPU boosting value.
|
||||
*/
|
||||
struct boost_groups {
|
||||
/* Maximum boost value for all RUNNABLE tasks on a CPU */
|
||||
int boost_max;
|
||||
u64 boost_ts;
|
||||
struct {
|
||||
/* True when this boost group maps an actual cgroup */
|
||||
bool valid;
|
||||
/* The boost for tasks on that boost group */
|
||||
int boost;
|
||||
/* Count of RUNNABLE tasks on that boost group */
|
||||
unsigned tasks;
|
||||
/* Timestamp of boost activation */
|
||||
u64 ts;
|
||||
} group[BOOSTGROUPS_COUNT];
|
||||
/* CPU's boost group locking */
|
||||
raw_spinlock_t lock;
|
||||
};
|
||||
|
||||
/* Boost groups affecting each CPU in the system */
|
||||
DEFINE_PER_CPU(struct boost_groups, cpu_boost_groups);
|
||||
|
||||
static inline bool schedtune_boost_timeout(u64 now, u64 ts)
|
||||
{
|
||||
return ((now - ts) > SCHEDTUNE_BOOST_HOLD_NS);
|
||||
}
|
||||
|
||||
static inline bool
|
||||
schedtune_boost_group_active(int idx, struct boost_groups* bg, u64 now)
|
||||
{
|
||||
if (bg->group[idx].tasks)
|
||||
return true;
|
||||
|
||||
return !schedtune_boost_timeout(now, bg->group[idx].ts);
|
||||
}
|
||||
|
||||
static void
|
||||
schedtune_cpu_update(int cpu, u64 now)
|
||||
{
|
||||
struct boost_groups *bg = &per_cpu(cpu_boost_groups, cpu);
|
||||
int boost_max;
|
||||
u64 boost_ts;
|
||||
int idx;
|
||||
|
||||
/* The root boost group is always active */
|
||||
boost_max = bg->group[0].boost;
|
||||
boost_ts = now;
|
||||
for (idx = 1; idx < BOOSTGROUPS_COUNT; ++idx) {
|
||||
|
||||
/* Ignore non boostgroups not mapping a cgroup */
|
||||
if (!bg->group[idx].valid)
|
||||
continue;
|
||||
|
||||
/*
|
||||
* A boost group affects a CPU only if it has
|
||||
* RUNNABLE tasks on that CPU or it has hold
|
||||
* in effect from a previous task.
|
||||
*/
|
||||
if (!schedtune_boost_group_active(idx, bg, now))
|
||||
continue;
|
||||
|
||||
/* This boost group is active */
|
||||
if (boost_max > bg->group[idx].boost)
|
||||
continue;
|
||||
|
||||
boost_max = bg->group[idx].boost;
|
||||
boost_ts = bg->group[idx].ts;
|
||||
}
|
||||
|
||||
/* Ensures boost_max is non-negative when all cgroup boost values
|
||||
* are neagtive. Avoids under-accounting of cpu capacity which may cause
|
||||
* task stacking and frequency spikes.*/
|
||||
boost_max = max(boost_max, 0);
|
||||
bg->boost_max = boost_max;
|
||||
bg->boost_ts = boost_ts;
|
||||
}
|
||||
|
||||
static int
|
||||
schedtune_boostgroup_update(int idx, int boost)
|
||||
{
|
||||
struct boost_groups *bg;
|
||||
int cur_boost_max;
|
||||
int old_boost;
|
||||
int cpu;
|
||||
u64 now;
|
||||
|
||||
/* Update per CPU boost groups */
|
||||
for_each_possible_cpu(cpu) {
|
||||
bg = &per_cpu(cpu_boost_groups, cpu);
|
||||
|
||||
/* CGroups are never associated to non active cgroups */
|
||||
BUG_ON(!bg->group[idx].valid);
|
||||
|
||||
/*
|
||||
* Keep track of current boost values to compute the per CPU
|
||||
* maximum only when it has been affected by the new value of
|
||||
* the updated boost group
|
||||
*/
|
||||
cur_boost_max = bg->boost_max;
|
||||
old_boost = bg->group[idx].boost;
|
||||
|
||||
/* Update the boost value of this boost group */
|
||||
bg->group[idx].boost = boost;
|
||||
|
||||
/* Check if this update increase current max */
|
||||
now = sched_clock_cpu(cpu);
|
||||
if (boost > cur_boost_max &&
|
||||
schedtune_boost_group_active(idx, bg, now)) {
|
||||
bg->boost_max = boost;
|
||||
bg->boost_ts = bg->group[idx].ts;
|
||||
|
||||
trace_sched_tune_boostgroup_update(cpu, 1, bg->boost_max);
|
||||
continue;
|
||||
}
|
||||
|
||||
/* Check if this update has decreased current max */
|
||||
if (cur_boost_max == old_boost && old_boost > boost) {
|
||||
schedtune_cpu_update(cpu, now);
|
||||
trace_sched_tune_boostgroup_update(cpu, -1, bg->boost_max);
|
||||
continue;
|
||||
}
|
||||
|
||||
trace_sched_tune_boostgroup_update(cpu, 0, bg->boost_max);
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
#define ENQUEUE_TASK 1
|
||||
#define DEQUEUE_TASK -1
|
||||
|
||||
static inline bool
|
||||
schedtune_update_timestamp(struct task_struct *p)
|
||||
{
|
||||
if (sched_feat(SCHEDTUNE_BOOST_HOLD_ALL))
|
||||
return true;
|
||||
|
||||
return task_has_rt_policy(p);
|
||||
}
|
||||
|
||||
static inline void
|
||||
schedtune_tasks_update(struct task_struct *p, int cpu, int idx, int task_count)
|
||||
{
|
||||
struct boost_groups *bg = &per_cpu(cpu_boost_groups, cpu);
|
||||
int tasks = bg->group[idx].tasks + task_count;
|
||||
|
||||
/* Update boosted tasks count while avoiding to make it negative */
|
||||
bg->group[idx].tasks = max(0, tasks);
|
||||
|
||||
/* Update timeout on enqueue */
|
||||
if (task_count > 0) {
|
||||
u64 now = sched_clock_cpu(cpu);
|
||||
|
||||
if (schedtune_update_timestamp(p))
|
||||
bg->group[idx].ts = now;
|
||||
|
||||
/* Boost group activation or deactivation on that RQ */
|
||||
if (bg->group[idx].tasks == 1)
|
||||
schedtune_cpu_update(cpu, now);
|
||||
}
|
||||
|
||||
trace_sched_tune_tasks_update(p, cpu, tasks, idx,
|
||||
bg->group[idx].boost, bg->boost_max,
|
||||
bg->group[idx].ts);
|
||||
}
|
||||
|
||||
/*
|
||||
* NOTE: This function must be called while holding the lock on the CPU RQ
|
||||
*/
|
||||
void schedtune_enqueue_task(struct task_struct *p, int cpu)
|
||||
{
|
||||
struct boost_groups *bg = &per_cpu(cpu_boost_groups, cpu);
|
||||
unsigned long irq_flags;
|
||||
struct schedtune *st;
|
||||
int idx;
|
||||
|
||||
if (unlikely(!schedtune_initialized))
|
||||
return;
|
||||
|
||||
/*
|
||||
* Boost group accouting is protected by a per-cpu lock and requires
|
||||
* interrupt to be disabled to avoid race conditions for example on
|
||||
* do_exit()::cgroup_exit() and task migration.
|
||||
*/
|
||||
raw_spin_lock_irqsave(&bg->lock, irq_flags);
|
||||
rcu_read_lock();
|
||||
|
||||
st = task_schedtune(p);
|
||||
idx = st->idx;
|
||||
|
||||
schedtune_tasks_update(p, cpu, idx, ENQUEUE_TASK);
|
||||
|
||||
rcu_read_unlock();
|
||||
raw_spin_unlock_irqrestore(&bg->lock, irq_flags);
|
||||
}
|
||||
|
||||
int schedtune_can_attach(struct cgroup_taskset *tset)
|
||||
{
|
||||
struct task_struct *task;
|
||||
struct cgroup_subsys_state *css;
|
||||
struct boost_groups *bg;
|
||||
struct rq_flags rq_flags;
|
||||
unsigned int cpu;
|
||||
struct rq *rq;
|
||||
int src_bg; /* Source boost group index */
|
||||
int dst_bg; /* Destination boost group index */
|
||||
int tasks;
|
||||
u64 now;
|
||||
|
||||
if (unlikely(!schedtune_initialized))
|
||||
return 0;
|
||||
|
||||
|
||||
cgroup_taskset_for_each(task, css, tset) {
|
||||
|
||||
/*
|
||||
* Lock the CPU's RQ the task is enqueued to avoid race
|
||||
* conditions with migration code while the task is being
|
||||
* accounted
|
||||
*/
|
||||
rq = task_rq_lock(task, &rq_flags);
|
||||
|
||||
if (!task->on_rq) {
|
||||
task_rq_unlock(rq, task, &rq_flags);
|
||||
continue;
|
||||
}
|
||||
|
||||
/*
|
||||
* Boost group accouting is protected by a per-cpu lock and requires
|
||||
* interrupt to be disabled to avoid race conditions on...
|
||||
*/
|
||||
cpu = cpu_of(rq);
|
||||
bg = &per_cpu(cpu_boost_groups, cpu);
|
||||
raw_spin_lock(&bg->lock);
|
||||
|
||||
dst_bg = css_st(css)->idx;
|
||||
src_bg = task_schedtune(task)->idx;
|
||||
|
||||
/*
|
||||
* Current task is not changing boostgroup, which can
|
||||
* happen when the new hierarchy is in use.
|
||||
*/
|
||||
if (unlikely(dst_bg == src_bg)) {
|
||||
raw_spin_unlock(&bg->lock);
|
||||
task_rq_unlock(rq, task, &rq_flags);
|
||||
continue;
|
||||
}
|
||||
|
||||
/*
|
||||
* This is the case of a RUNNABLE task which is switching its
|
||||
* current boost group.
|
||||
*/
|
||||
|
||||
/* Move task from src to dst boost group */
|
||||
tasks = bg->group[src_bg].tasks - 1;
|
||||
bg->group[src_bg].tasks = max(0, tasks);
|
||||
bg->group[dst_bg].tasks += 1;
|
||||
|
||||
/* Update boost hold start for this group */
|
||||
now = sched_clock_cpu(cpu);
|
||||
bg->group[dst_bg].ts = now;
|
||||
|
||||
/* Force boost group re-evaluation at next boost check */
|
||||
bg->boost_ts = now - SCHEDTUNE_BOOST_HOLD_NS;
|
||||
|
||||
raw_spin_unlock(&bg->lock);
|
||||
task_rq_unlock(rq, task, &rq_flags);
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
void schedtune_cancel_attach(struct cgroup_taskset *tset)
|
||||
{
|
||||
/* This can happen only if SchedTune controller is mounted with
|
||||
* other hierarchies ane one of them fails. Since usually SchedTune is
|
||||
* mouted on its own hierarcy, for the time being we do not implement
|
||||
* a proper rollback mechanism */
|
||||
WARN(1, "SchedTune cancel attach not implemented");
|
||||
}
|
||||
|
||||
/*
|
||||
* NOTE: This function must be called while holding the lock on the CPU RQ
|
||||
*/
|
||||
void schedtune_dequeue_task(struct task_struct *p, int cpu)
|
||||
{
|
||||
struct boost_groups *bg = &per_cpu(cpu_boost_groups, cpu);
|
||||
unsigned long irq_flags;
|
||||
struct schedtune *st;
|
||||
int idx;
|
||||
|
||||
if (unlikely(!schedtune_initialized))
|
||||
return;
|
||||
|
||||
/*
|
||||
* Boost group accouting is protected by a per-cpu lock and requires
|
||||
* interrupt to be disabled to avoid race conditions on...
|
||||
*/
|
||||
raw_spin_lock_irqsave(&bg->lock, irq_flags);
|
||||
rcu_read_lock();
|
||||
|
||||
st = task_schedtune(p);
|
||||
idx = st->idx;
|
||||
|
||||
schedtune_tasks_update(p, cpu, idx, DEQUEUE_TASK);
|
||||
|
||||
rcu_read_unlock();
|
||||
raw_spin_unlock_irqrestore(&bg->lock, irq_flags);
|
||||
}
|
||||
|
||||
int schedtune_cpu_boost(int cpu)
|
||||
{
|
||||
struct boost_groups *bg;
|
||||
u64 now;
|
||||
|
||||
bg = &per_cpu(cpu_boost_groups, cpu);
|
||||
now = sched_clock_cpu(cpu);
|
||||
|
||||
/* Check to see if we have a hold in effect */
|
||||
if (schedtune_boost_timeout(now, bg->boost_ts))
|
||||
schedtune_cpu_update(cpu, now);
|
||||
|
||||
return bg->boost_max;
|
||||
}
|
||||
|
||||
int schedtune_task_boost(struct task_struct *p)
|
||||
{
|
||||
struct schedtune *st;
|
||||
int task_boost;
|
||||
|
||||
if (unlikely(!schedtune_initialized))
|
||||
return 0;
|
||||
|
||||
/* Get task boost value */
|
||||
rcu_read_lock();
|
||||
st = task_schedtune(p);
|
||||
task_boost = st->boost;
|
||||
rcu_read_unlock();
|
||||
|
||||
return task_boost;
|
||||
}
|
||||
|
||||
int schedtune_prefer_idle(struct task_struct *p)
|
||||
{
|
||||
struct schedtune *st;
|
||||
int prefer_idle;
|
||||
|
||||
if (unlikely(!schedtune_initialized))
|
||||
return 0;
|
||||
|
||||
/* Get prefer_idle value */
|
||||
rcu_read_lock();
|
||||
st = task_schedtune(p);
|
||||
prefer_idle = st->prefer_idle;
|
||||
rcu_read_unlock();
|
||||
|
||||
return prefer_idle;
|
||||
}
|
||||
|
||||
static u64
|
||||
prefer_idle_read(struct cgroup_subsys_state *css, struct cftype *cft)
|
||||
{
|
||||
struct schedtune *st = css_st(css);
|
||||
|
||||
return st->prefer_idle;
|
||||
}
|
||||
|
||||
static int
|
||||
prefer_idle_write(struct cgroup_subsys_state *css, struct cftype *cft,
|
||||
u64 prefer_idle)
|
||||
{
|
||||
struct schedtune *st = css_st(css);
|
||||
st->prefer_idle = !!prefer_idle;
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
static s64
|
||||
boost_read(struct cgroup_subsys_state *css, struct cftype *cft)
|
||||
{
|
||||
struct schedtune *st = css_st(css);
|
||||
|
||||
return st->boost;
|
||||
}
|
||||
|
||||
static int
|
||||
boost_write(struct cgroup_subsys_state *css, struct cftype *cft,
|
||||
s64 boost)
|
||||
{
|
||||
struct schedtune *st = css_st(css);
|
||||
|
||||
if (boost < 0 || boost > 100)
|
||||
return -EINVAL;
|
||||
|
||||
st->boost = boost;
|
||||
|
||||
/* Update CPU boost */
|
||||
schedtune_boostgroup_update(st->idx, st->boost);
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
static struct cftype files[] = {
|
||||
{
|
||||
.name = "boost",
|
||||
.read_s64 = boost_read,
|
||||
.write_s64 = boost_write,
|
||||
},
|
||||
{
|
||||
.name = "prefer_idle",
|
||||
.read_u64 = prefer_idle_read,
|
||||
.write_u64 = prefer_idle_write,
|
||||
},
|
||||
{ } /* terminate */
|
||||
};
|
||||
|
||||
static void
|
||||
schedtune_boostgroup_init(struct schedtune *st, int idx)
|
||||
{
|
||||
struct boost_groups *bg;
|
||||
int cpu;
|
||||
|
||||
/* Initialize per CPUs boost group support */
|
||||
for_each_possible_cpu(cpu) {
|
||||
bg = &per_cpu(cpu_boost_groups, cpu);
|
||||
bg->group[idx].boost = 0;
|
||||
bg->group[idx].valid = true;
|
||||
bg->group[idx].ts = 0;
|
||||
}
|
||||
|
||||
/* Keep track of allocated boost groups */
|
||||
allocated_group[idx] = st;
|
||||
st->idx = idx;
|
||||
}
|
||||
|
||||
static struct cgroup_subsys_state *
|
||||
schedtune_css_alloc(struct cgroup_subsys_state *parent_css)
|
||||
{
|
||||
struct schedtune *st;
|
||||
int idx;
|
||||
|
||||
if (!parent_css)
|
||||
return &root_schedtune.css;
|
||||
|
||||
/* Allow only single level hierachies */
|
||||
if (parent_css != &root_schedtune.css) {
|
||||
pr_err("Nested SchedTune boosting groups not allowed\n");
|
||||
return ERR_PTR(-ENOMEM);
|
||||
}
|
||||
|
||||
/* Allow only a limited number of boosting groups */
|
||||
for (idx = 1; idx < BOOSTGROUPS_COUNT; ++idx)
|
||||
if (!allocated_group[idx])
|
||||
break;
|
||||
if (idx == BOOSTGROUPS_COUNT) {
|
||||
pr_err("Trying to create more than %d SchedTune boosting groups\n",
|
||||
BOOSTGROUPS_COUNT);
|
||||
return ERR_PTR(-ENOSPC);
|
||||
}
|
||||
|
||||
st = kzalloc(sizeof(*st), GFP_KERNEL);
|
||||
if (!st)
|
||||
goto out;
|
||||
|
||||
/* Initialize per CPUs boost group support */
|
||||
schedtune_boostgroup_init(st, idx);
|
||||
|
||||
return &st->css;
|
||||
|
||||
out:
|
||||
return ERR_PTR(-ENOMEM);
|
||||
}
|
||||
|
||||
static void
|
||||
schedtune_boostgroup_release(struct schedtune *st)
|
||||
{
|
||||
struct boost_groups *bg;
|
||||
int cpu;
|
||||
|
||||
/* Reset per CPUs boost group support */
|
||||
for_each_possible_cpu(cpu) {
|
||||
bg = &per_cpu(cpu_boost_groups, cpu);
|
||||
bg->group[st->idx].valid = false;
|
||||
bg->group[st->idx].boost = 0;
|
||||
}
|
||||
|
||||
/* Keep track of allocated boost groups */
|
||||
allocated_group[st->idx] = NULL;
|
||||
}
|
||||
|
||||
static void
|
||||
schedtune_css_free(struct cgroup_subsys_state *css)
|
||||
{
|
||||
struct schedtune *st = css_st(css);
|
||||
|
||||
/* Release per CPUs boost group support */
|
||||
schedtune_boostgroup_release(st);
|
||||
kfree(st);
|
||||
}
|
||||
|
||||
struct cgroup_subsys schedtune_cgrp_subsys = {
|
||||
.css_alloc = schedtune_css_alloc,
|
||||
.css_free = schedtune_css_free,
|
||||
.can_attach = schedtune_can_attach,
|
||||
.cancel_attach = schedtune_cancel_attach,
|
||||
.legacy_cftypes = files,
|
||||
.early_init = 1,
|
||||
};
|
||||
|
||||
static inline void
|
||||
schedtune_init_cgroups(void)
|
||||
{
|
||||
struct boost_groups *bg;
|
||||
int cpu;
|
||||
|
||||
/* Initialize the per CPU boost groups */
|
||||
for_each_possible_cpu(cpu) {
|
||||
bg = &per_cpu(cpu_boost_groups, cpu);
|
||||
memset(bg, 0, sizeof(struct boost_groups));
|
||||
bg->group[0].valid = true;
|
||||
raw_spin_lock_init(&bg->lock);
|
||||
}
|
||||
|
||||
pr_info("schedtune: configured to support %d boost groups\n",
|
||||
BOOSTGROUPS_COUNT);
|
||||
|
||||
schedtune_initialized = true;
|
||||
}
|
||||
|
||||
/*
|
||||
* Initialize the cgroup structures
|
||||
*/
|
||||
static int
|
||||
schedtune_init(void)
|
||||
{
|
||||
schedtune_spc_rdiv = reciprocal_value(100);
|
||||
schedtune_init_cgroups();
|
||||
return 0;
|
||||
}
|
||||
postcore_initcall(schedtune_init);
|
||||
37
kernel/sched/tune.h
Normal file
37
kernel/sched/tune.h
Normal file
|
|
@ -0,0 +1,37 @@
|
|||
|
||||
#ifdef CONFIG_SCHED_TUNE
|
||||
|
||||
#include <linux/reciprocal_div.h>
|
||||
|
||||
/*
|
||||
* System energy normalization constants
|
||||
*/
|
||||
struct target_nrg {
|
||||
unsigned long min_power;
|
||||
unsigned long max_power;
|
||||
struct reciprocal_value rdiv;
|
||||
};
|
||||
|
||||
int schedtune_cpu_boost(int cpu);
|
||||
int schedtune_task_boost(struct task_struct *tsk);
|
||||
|
||||
int schedtune_prefer_idle(struct task_struct *tsk);
|
||||
|
||||
void schedtune_enqueue_task(struct task_struct *p, int cpu);
|
||||
void schedtune_dequeue_task(struct task_struct *p, int cpu);
|
||||
|
||||
unsigned long boosted_cpu_util(int cpu, unsigned long other_util);
|
||||
|
||||
#else /* CONFIG_SCHED_TUNE */
|
||||
|
||||
#define schedtune_cpu_boost(cpu) 0
|
||||
#define schedtune_task_boost(tsk) 0
|
||||
|
||||
#define schedtune_prefer_idle(tsk) 0
|
||||
|
||||
#define schedtune_enqueue_task(task, cpu) do { } while (0)
|
||||
#define schedtune_dequeue_task(task, cpu) do { } while (0)
|
||||
|
||||
#define boosted_cpu_util(cpu, other_util) cpu_util_cfs(cpu_rq(cpu))
|
||||
|
||||
#endif /* CONFIG_SCHED_TUNE */
|
||||
|
|
@ -320,6 +320,13 @@ static struct ctl_table kern_table[] = {
|
|||
.proc_handler = proc_dointvec,
|
||||
},
|
||||
#ifdef CONFIG_SCHED_DEBUG
|
||||
{
|
||||
.procname = "sched_cstate_aware",
|
||||
.data = &sysctl_sched_cstate_aware,
|
||||
.maxlen = sizeof(unsigned int),
|
||||
.mode = 0644,
|
||||
.proc_handler = proc_dointvec,
|
||||
},
|
||||
{
|
||||
.procname = "sched_min_granularity_ns",
|
||||
.data = &sysctl_sched_min_granularity,
|
||||
|
|
@ -338,6 +345,13 @@ static struct ctl_table kern_table[] = {
|
|||
.extra1 = &min_sched_granularity_ns,
|
||||
.extra2 = &max_sched_granularity_ns,
|
||||
},
|
||||
{
|
||||
.procname = "sched_sync_hint_enable",
|
||||
.data = &sysctl_sched_sync_hint_enable,
|
||||
.maxlen = sizeof(unsigned int),
|
||||
.mode = 0644,
|
||||
.proc_handler = proc_dointvec,
|
||||
},
|
||||
{
|
||||
.procname = "sched_wakeup_granularity_ns",
|
||||
.data = &sysctl_sched_wakeup_granularity,
|
||||
|
|
@ -466,6 +480,17 @@ static struct ctl_table kern_table[] = {
|
|||
.extra1 = &one,
|
||||
},
|
||||
#endif
|
||||
#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
|
||||
{
|
||||
.procname = "sched_energy_aware",
|
||||
.data = &sysctl_sched_energy_aware,
|
||||
.maxlen = sizeof(unsigned int),
|
||||
.mode = 0644,
|
||||
.proc_handler = sched_energy_aware_handler,
|
||||
.extra1 = &zero,
|
||||
.extra2 = &one,
|
||||
},
|
||||
#endif
|
||||
#ifdef CONFIG_PROVE_LOCKING
|
||||
{
|
||||
.procname = "prove_locking",
|
||||
|
|
|
|||
Loading…
Reference in New Issue
Block a user