Merge branch eas-dev into experimental/android-4.19

Bug: 118439987
Bug: 120440300
Change-Id: I46a509df5e3bcb5253717d083f90679e7a72d378
Signed-off-by: Alistair Strachan <astrachan@google.com>
This commit is contained in:
Alistair Strachan 2018-11-08 11:12:08 -08:00
commit 409c3ce064
55 changed files with 4478 additions and 364 deletions

View File

@ -0,0 +1,413 @@
Central, scheduler-driven, power-performance control
(EXPERIMENTAL)
Abstract
========
The topic of a single simple power-performance tunable, that is wholly
scheduler centric, and has well defined and predictable properties has come up
on several occasions in the past [1,2]. With techniques such as a scheduler
driven DVFS [3], we now have a good framework for implementing such a tunable.
This document describes the overall ideas behind its design and implementation.
Table of Contents
=================
1. Motivation
2. Introduction
3. Signal Boosting Strategy
4. OPP selection using boosted CPU utilization
5. Per task group boosting
6. Per-task wakeup-placement-strategy Selection
7. Question and Answers
- What about "auto" mode?
- What about boosting on a congested system?
- How CPUs are boosted when we have tasks with multiple boost values?
8. References
1. Motivation
=============
Sched-DVFS [3] was a new event-driven cpufreq governor which allows the
scheduler to select the optimal DVFS operating point (OPP) for running a task
allocated to a CPU. Later, the cpufreq maintainers introduced a similar
governor, schedutil. The introduction of schedutil also enables running
workloads at the most energy efficient OPPs.
However, sometimes it may be desired to intentionally boost the performance of
a workload even if that could imply a reasonable increase in energy
consumption. For example, in order to reduce the response time of a task, we
may want to run the task at a higher OPP than the one that is actually required
by it's CPU bandwidth demand.
This last requirement is especially important if we consider that one of the
main goals of the utilization-driven governor component is to replace all
currently available CPUFreq policies. Since sched-DVFS and schedutil are event
based, as opposed to the sampling driven governors we currently have, they are
already more responsive at selecting the optimal OPP to run tasks allocated to
a CPU. However, just tracking the actual task load demand may not be enough
from a performance standpoint. For example, it is not possible to get
behaviors similar to those provided by the "performance" and "interactive"
CPUFreq governors.
This document describes an implementation of a tunable, stacked on top of the
utilization-driven governors which extends their functionality to support task
performance boosting.
By "performance boosting" we mean the reduction of the time required to
complete a task activation, i.e. the time elapsed from a task wakeup to its
next deactivation (e.g. because it goes back to sleep or it terminates). For
example, if we consider a simple periodic task which executes the same workload
for 5[s] every 20[s] while running at a certain OPP, a boosted execution of
that task must complete each of its activations in less than 5[s].
A previous attempt [5] to introduce such a boosting feature has not been
successful mainly because of the complexity of the proposed solution. Previous
versions of the approach described in this document exposed a single simple
interface to user-space. This single tunable knob allowed the tuning of
system wide scheduler behaviours ranging from energy efficiency at one end
through to incremental performance boosting at the other end. This first
tunable affects all tasks. However, that is not useful for Android products
so in this version only a more advanced extension of the concept is provided
which uses CGroups to boost the performance of only selected tasks while using
the energy efficient default for all others.
The rest of this document introduces in more details the proposed solution
which has been named SchedTune.
2. Introduction
===============
SchedTune exposes a simple user-space interface provided through a new
CGroup controller 'stune' which provides two power-performance tunables
per group:
/<stune cgroup mount point>/schedtune.prefer_idle
/<stune cgroup mount point>/schedtune.boost
The CGroup implementation permits arbitrary user-space defined task
classification to tune the scheduler for different goals depending on the
specific nature of the task, e.g. background vs interactive vs low-priority.
More details are given in section 5.
2.1 Boosting
============
The boost value is expressed as an integer in the range [-100..0..100].
A value of 0 (default) configures the CFS scheduler for maximum energy
efficiency. This means that sched-DVFS runs the tasks at the minimum OPP
required to satisfy their workload demand.
A value of 100 configures scheduler for maximum performance, which translates
to the selection of the maximum OPP on that CPU.
A value of -100 configures scheduler for minimum performance, which translates
to the selection of the minimum OPP on that CPU.
The range between -100, 0 and 100 can be set to satisfy other scenarios suitably.
For example to satisfy interactive response or depending on other system events
(battery level etc).
The overall design of the SchedTune module is built on top of "Per-Entity Load
Tracking" (PELT) signals and sched-DVFS by introducing a bias on the Operating
Performance Point (OPP) selection.
Each time a task is allocated on a CPU, cpufreq is given the opportunity to tune
the operating frequency of that CPU to better match the workload demand. The
selection of the actual OPP being activated is influenced by the boost value
for the task CGroup.
This simple biasing approach leverages existing frameworks, which means minimal
modifications to the scheduler, and yet it allows to achieve a range of
different behaviours all from a single simple tunable knob.
In EAS schedulers, we use boosted task and CPU utilization for energy
calculation and energy-aware task placement.
2.2 prefer_idle
===============
This is a flag which indicates to the scheduler that userspace would like
the scheduler to focus on energy or to focus on performance.
A value of 0 (default) signals to the CFS scheduler that tasks in this group
can be placed according to the energy-aware wakeup strategy.
A value of 1 signals to the CFS scheduler that tasks in this group should be
placed to minimise wakeup latency.
The value is combined with the boost value - task placement will not be
boost aware however CPU OPP selection is still boost aware.
Android platforms typically use this flag for application tasks which the
user is currently interacting with.
3. Signal Boosting Strategy
===========================
The whole PELT machinery works based on the value of a few load tracking signals
which basically track the CPU bandwidth requirements for tasks and the capacity
of CPUs. The basic idea behind the SchedTune knob is to artificially inflate
some of these load tracking signals to make a task or RQ appears more demanding
that it actually is.
Which signals have to be inflated depends on the specific "consumer". However,
independently from the specific (signal, consumer) pair, it is important to
define a simple and possibly consistent strategy for the concept of boosting a
signal.
A boosting strategy defines how the "abstract" user-space defined
sched_cfs_boost value is translated into an internal "margin" value to be added
to a signal to get its inflated value:
margin := boosting_strategy(sched_cfs_boost, signal)
boosted_signal := signal + margin
Different boosting strategies were identified and analyzed before selecting the
one found to be most effective.
Signal Proportional Compensation (SPC)
--------------------------------------
In this boosting strategy the sched_cfs_boost value is used to compute a
margin which is proportional to the complement of the original signal.
When a signal has a maximum possible value, its complement is defined as
the delta from the actual value and its possible maximum.
Since the tunable implementation uses signals which have SCHED_LOAD_SCALE as
the maximum possible value, the margin becomes:
margin := sched_cfs_boost * (SCHED_LOAD_SCALE - signal)
Using this boosting strategy:
- a 100% sched_cfs_boost means that the signal is scaled to the maximum value
- each value in the range of sched_cfs_boost effectively inflates the signal in
question by a quantity which is proportional to the maximum value.
For example, by applying the SPC boosting strategy to the selection of the OPP
to run a task it is possible to achieve these behaviors:
- 0% boosting: run the task at the minimum OPP required by its workload
- 100% boosting: run the task at the maximum OPP available for the CPU
- 50% boosting: run at the half-way OPP between minimum and maximum
Which means that, at 50% boosting, a task will be scheduled to run at half of
the maximum theoretically achievable performance on the specific target
platform.
A graphical representation of an SPC boosted signal is represented in the
following figure where:
a) "-" represents the original signal
b) "b" represents a 50% boosted signal
c) "p" represents a 100% boosted signal
^
| SCHED_LOAD_SCALE
+-----------------------------------------------------------------+
|pppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp
|
| boosted_signal
| bbbbbbbbbbbbbbbbbbbbbbbb
|
| original signal
| bbbbbbbbbbbbbbbbbbbbbbbb+----------------------+
| |
|bbbbbbbbbbbbbbbbbb |
| |
| |
| |
| +-----------------------+
| |
| |
| |
|------------------+
|
|
+----------------------------------------------------------------------->
The plot above shows a ramped load signal (titled 'original_signal') and it's
boosted equivalent. For each step of the original signal the boosted signal
corresponding to a 50% boost is midway from the original signal and the upper
bound. Boosting by 100% generates a boosted signal which is always saturated to
the upper bound.
4. OPP selection using boosted CPU utilization
==============================================
It is worth calling out that the implementation does not introduce any new load
signals. Instead, it provides an API to tune existing signals. This tuning is
done on demand and only in scheduler code paths where it is sensible to do so.
The new API calls are defined to return either the default signal or a boosted
one, depending on the value of sched_cfs_boost. This is a clean an non invasive
modification of the existing existing code paths.
The signal representing a CPU's utilization is boosted according to the
previously described SPC boosting strategy. To sched-DVFS, this allows a CPU
(ie CFS run-queue) to appear more used then it actually is.
Thus, with the sched_cfs_boost enabled we have the following main functions to
get the current utilization of a CPU:
cpu_util()
boosted_cpu_util()
The new boosted_cpu_util() is similar to the first but returns a boosted
utilization signal which is a function of the sched_cfs_boost value.
This function is used in the CFS scheduler code paths where sched-DVFS needs to
decide the OPP to run a CPU at.
For example, this allows selecting the highest OPP for a CPU which has
the boost value set to 100%.
5. Per task group boosting
==========================
On battery powered devices there usually are many background services which are
long running and need energy efficient scheduling. On the other hand, some
applications are more performance sensitive and require an interactive
response and/or maximum performance, regardless of the energy cost.
To better service such scenarios, the SchedTune implementation has an extension
that provides a more fine grained boosting interface.
A new CGroup controller, namely "schedtune", can be enabled which allows to
defined and configure task groups with different boosting values.
Tasks that require special performance can be put into separate CGroups.
The value of the boost associated with the tasks in this group can be specified
using a single knob exposed by the CGroup controller:
schedtune.boost
This knob allows the definition of a boost value that is to be used for
SPC boosting of all tasks attached to this group.
The current schedtune controller implementation is really simple and has these
main characteristics:
1) It is only possible to create 1 level depth hierarchies
The root control groups define the system-wide boost value to be applied
by default to all tasks. Its direct subgroups are named "boost groups" and
they define the boost value for specific set of tasks.
Further nested subgroups are not allowed since they do not have a sensible
meaning from a user-space standpoint.
2) It is possible to define only a limited number of "boost groups"
This number is defined at compile time and by default configured to 16.
This is a design decision motivated by two main reasons:
a) In a real system we do not expect utilization scenarios with more then few
boost groups. For example, a reasonable collection of groups could be
just "background", "interactive" and "performance".
b) It simplifies the implementation considerably, especially for the code
which has to compute the per CPU boosting once there are multiple
RUNNABLE tasks with different boost values.
Such a simple design should allow servicing the main utilization scenarios identified
so far. It provides a simple interface which can be used to manage the
power-performance of all tasks or only selected tasks.
Moreover, this interface can be easily integrated by user-space run-times (e.g.
Android, ChromeOS) to implement a QoS solution for task boosting based on tasks
classification, which has been a long standing requirement.
Setup and usage
---------------
0. Use a kernel with CONFIG_SCHED_TUNE support enabled
1. Check that the "schedtune" CGroup controller is available:
root@linaro-nano:~# cat /proc/cgroups
#subsys_name hierarchy num_cgroups enabled
cpuset 0 1 1
cpu 0 1 1
schedtune 0 1 1
2. Mount a tmpfs to create the CGroups mount point (Optional)
root@linaro-nano:~# sudo mount -t tmpfs cgroups /sys/fs/cgroup
3. Mount the "schedtune" controller
root@linaro-nano:~# mkdir /sys/fs/cgroup/stune
root@linaro-nano:~# sudo mount -t cgroup -o schedtune stune /sys/fs/cgroup/stune
4. Create task groups and configure their specific boost value (Optional)
For example here we create a "performance" boost group configure to boost
all its tasks to 100%
root@linaro-nano:~# mkdir /sys/fs/cgroup/stune/performance
root@linaro-nano:~# echo 100 > /sys/fs/cgroup/stune/performance/schedtune.boost
5. Move tasks into the boost group
For example, the following moves the tasks with PID $TASKPID (and all its
threads) into the "performance" boost group.
root@linaro-nano:~# echo "TASKPID > /sys/fs/cgroup/stune/performance/cgroup.procs
This simple configuration allows only the threads of the $TASKPID task to run,
when needed, at the highest OPP in the most capable CPU of the system.
6. Per-task wakeup-placement-strategy Selection
===============================================
Many devices have a number of CFS tasks in use which require an absolute
minimum wakeup latency, and many tasks for which wakeup latency is not
important.
For touch-driven environments, removing additional wakeup latency can be
critical.
When you use the Schedtume CGroup controller, you have access to a second
parameter which allows a group to be marked such that energy_aware task
placement is bypassed for tasks belonging to that group.
prefer_idle=0 (default - use energy-aware task placement if available)
prefer_idle=1 (never use energy-aware task placement for these tasks)
Since the regular wakeup task placement algorithm in CFS is biased for
performance, this has the effect of restoring minimum wakeup latency
for the desired tasks whilst still allowing energy-aware wakeup placement
to save energy for other tasks.
7. Question and Answers
=======================
What about "auto" mode?
-----------------------
The 'auto' mode as described in [5] can be implemented by interfacing SchedTune
with some suitable user-space element. This element could use the exposed
system-wide or cgroup based interface.
How are multiple groups of tasks with different boost values managed?
---------------------------------------------------------------------
The current SchedTune implementation keeps track of the boosted RUNNABLE tasks
on a CPU. The CPU utilization seen by the scheduler-driven cpufreq governors
(and used to select an appropriate OPP) is boosted with a value which is the
maximum of the boost values of the currently RUNNABLE tasks in its RQ.
This allows cpufreq to boost a CPU only while there are boosted tasks ready
to run and switch back to the energy efficient mode as soon as the last boosted
task is dequeued.
8. References
=============
[1] http://lwn.net/Articles/552889
[2] http://lkml.org/lkml/2012/5/18/91
[3] http://lkml.org/lkml/2015/6/26/620

View File

@ -42,6 +42,7 @@ cpu0: cpu@0 {
cci-control-port = <&cci_control1>;
cpu-idle-states = <&CLUSTER_SLEEP_BIG>;
capacity-dmips-mhz = <1024>;
dynamic-power-coefficient = <990>;
};
cpu1: cpu@1 {
@ -51,6 +52,7 @@ cpu1: cpu@1 {
cci-control-port = <&cci_control1>;
cpu-idle-states = <&CLUSTER_SLEEP_BIG>;
capacity-dmips-mhz = <1024>;
dynamic-power-coefficient = <990>;
};
cpu2: cpu@2 {
@ -60,6 +62,7 @@ cpu2: cpu@2 {
cci-control-port = <&cci_control2>;
cpu-idle-states = <&CLUSTER_SLEEP_LITTLE>;
capacity-dmips-mhz = <516>;
dynamic-power-coefficient = <133>;
};
cpu3: cpu@3 {
@ -69,6 +72,7 @@ cpu3: cpu@3 {
cci-control-port = <&cci_control2>;
cpu-idle-states = <&CLUSTER_SLEEP_LITTLE>;
capacity-dmips-mhz = <516>;
dynamic-power-coefficient = <133>;
};
cpu4: cpu@4 {
@ -78,6 +82,7 @@ cpu4: cpu@4 {
cci-control-port = <&cci_control2>;
cpu-idle-states = <&CLUSTER_SLEEP_LITTLE>;
capacity-dmips-mhz = <516>;
dynamic-power-coefficient = <133>;
};
idle-states {

View File

@ -2,6 +2,12 @@ CONFIG_SYSVIPC=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_CGROUPS=y
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_CGROUP_FREEZER=y
CONFIG_CPUSETS=y
CONFIG_PROC_PID_CPUSET=y
CONFIG_SCHED_AUTOGROUP=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_EMBEDDED=y
CONFIG_PERF_EVENTS=y
@ -116,6 +122,7 @@ CONFIG_PCI_ENDPOINT=y
CONFIG_PCI_ENDPOINT_CONFIGFS=y
CONFIG_PCI_EPF_TEST=m
CONFIG_SMP=y
CONFIG_SCHED_MC=y
CONFIG_NR_CPUS=16
CONFIG_SECCOMP=y
CONFIG_ARM_APPENDED_DTB=y
@ -124,10 +131,10 @@ CONFIG_KEXEC=y
CONFIG_EFI=y
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_STAT=y
CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=m
CONFIG_CPU_FREQ_GOV_USERSPACE=m
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m
CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=y
CONFIG_CPU_FREQ_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y
CONFIG_CPU_FREQ_GOV_SCHEDUTIL=y
CONFIG_CPUFREQ_DT=y
CONFIG_ARM_IMX6Q_CPUFREQ=y
@ -137,6 +144,7 @@ CONFIG_ARM_CPUIDLE=y
CONFIG_ARM_ZYNQ_CPUIDLE=y
CONFIG_ARM_EXYNOS_CPUIDLE=y
CONFIG_KERNEL_MODE_NEON=y
CONFIG_ENERGY_MODEL=y
CONFIG_NET=y
CONFIG_PACKET=y
CONFIG_UNIX=y

View File

@ -30,9 +30,15 @@ const struct cpumask *cpu_coregroup_mask(int cpu);
/* Replace task scheduler's default frequency-invariant accounting */
#define arch_scale_freq_capacity topology_get_freq_scale
/* Replace task scheduler's default max-frequency-invariant accounting */
#define arch_scale_max_freq_capacity topology_get_max_freq_scale
/* Replace task scheduler's default cpu-invariant accounting */
#define arch_scale_cpu_capacity topology_get_cpu_scale
/* Enable topology flag updates */
#define arch_update_cpu_topology topology_update_cpu_topology
#else
static inline void init_cpu_topology(void) { }

View File

@ -99,6 +99,7 @@ A72_0: cpu@0 {
clocks = <&scpi_dvfs 0>;
cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
capacity-dmips-mhz = <1024>;
dynamic-power-coefficient = <450>;
};
A72_1: cpu@1 {
@ -116,6 +117,7 @@ A72_1: cpu@1 {
clocks = <&scpi_dvfs 0>;
cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
capacity-dmips-mhz = <1024>;
dynamic-power-coefficient = <450>;
};
A53_0: cpu@100 {
@ -133,6 +135,7 @@ A53_0: cpu@100 {
clocks = <&scpi_dvfs 1>;
cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
capacity-dmips-mhz = <485>;
dynamic-power-coefficient = <140>;
};
A53_1: cpu@101 {
@ -150,6 +153,7 @@ A53_1: cpu@101 {
clocks = <&scpi_dvfs 1>;
cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
capacity-dmips-mhz = <485>;
dynamic-power-coefficient = <140>;
};
A53_2: cpu@102 {
@ -167,6 +171,7 @@ A53_2: cpu@102 {
clocks = <&scpi_dvfs 1>;
cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
capacity-dmips-mhz = <485>;
dynamic-power-coefficient = <140>;
};
A53_3: cpu@103 {
@ -184,6 +189,7 @@ A53_3: cpu@103 {
clocks = <&scpi_dvfs 1>;
cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
capacity-dmips-mhz = <485>;
dynamic-power-coefficient = <140>;
};
A72_L2: l2-cache0 {

View File

@ -98,6 +98,7 @@ A57_0: cpu@0 {
clocks = <&scpi_dvfs 0>;
cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
capacity-dmips-mhz = <1024>;
dynamic-power-coefficient = <530>;
};
A57_1: cpu@1 {
@ -115,6 +116,7 @@ A57_1: cpu@1 {
clocks = <&scpi_dvfs 0>;
cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
capacity-dmips-mhz = <1024>;
dynamic-power-coefficient = <530>;
};
A53_0: cpu@100 {
@ -132,6 +134,7 @@ A53_0: cpu@100 {
clocks = <&scpi_dvfs 1>;
cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
capacity-dmips-mhz = <578>;
dynamic-power-coefficient = <140>;
};
A53_1: cpu@101 {
@ -149,6 +152,7 @@ A53_1: cpu@101 {
clocks = <&scpi_dvfs 1>;
cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
capacity-dmips-mhz = <578>;
dynamic-power-coefficient = <140>;
};
A53_2: cpu@102 {
@ -166,6 +170,7 @@ A53_2: cpu@102 {
clocks = <&scpi_dvfs 1>;
cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
capacity-dmips-mhz = <578>;
dynamic-power-coefficient = <140>;
};
A53_3: cpu@103 {
@ -183,6 +188,7 @@ A53_3: cpu@103 {
clocks = <&scpi_dvfs 1>;
cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
capacity-dmips-mhz = <578>;
dynamic-power-coefficient = <140>;
};
A57_L2: l2-cache0 {

View File

@ -19,9 +19,13 @@ CONFIG_BLK_CGROUP=y
CONFIG_CGROUP_PIDS=y
CONFIG_CGROUP_HUGETLB=y
CONFIG_CPUSETS=y
CONFIG_CGROUPS=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_CGROUP_SCHED=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_CGROUP_PERF=y
CONFIG_CGROUP_FREEZER=y
CONFIG_USER_NS=y
CONFIG_SCHED_AUTOGROUP=y
CONFIG_BLK_DEV_INITRD=y
@ -101,13 +105,16 @@ CONFIG_XEN=y
CONFIG_COMPAT=y
CONFIG_HIBERNATION=y
CONFIG_WQ_POWER_EFFICIENT_DEFAULT=y
CONFIG_ENERGY_MODEL=y
CONFIG_SCHED_TUNE=y
CONFIG_ARM_CPUIDLE=y
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_STAT=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=m
CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=y
CONFIG_CPU_FREQ_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y
CONFIG_CPU_FREQ_GOV_SCHEDUTIL=y
CONFIG_CPUFREQ_DT=y
CONFIG_ACPI_CPPC_CPUFREQ=m

View File

@ -42,9 +42,15 @@ int pcibus_to_node(struct pci_bus *bus);
/* Replace task scheduler's default frequency-invariant accounting */
#define arch_scale_freq_capacity topology_get_freq_scale
/* Replace task scheduler's default max-frequency-invariant accounting */
#define arch_scale_max_freq_capacity topology_get_max_freq_scale
/* Replace task scheduler's default cpu-invariant accounting */
#define arch_scale_cpu_capacity topology_get_cpu_scale
/* Enable topology flag updates */
#define arch_update_cpu_topology topology_update_cpu_topology
#include <asm-generic/topology.h>
#endif /* _ASM_ARM_TOPOLOGY_H */

View File

@ -219,4 +219,5 @@ source "drivers/siox/Kconfig"
source "drivers/slimbus/Kconfig"
source "drivers/energy_model/Kconfig"
endmenu

View File

@ -157,6 +157,8 @@ obj-$(CONFIG_REMOTEPROC) += remoteproc/
obj-$(CONFIG_RPMSG) += rpmsg/
obj-$(CONFIG_SOUNDWIRE) += soundwire/
obj-$(CONFIG_ENERGY_MODEL) += energy_model/
# Virtualization drivers
obj-$(CONFIG_VIRT_DRIVERS) += virt/
obj-$(CONFIG_HYPERV) += hv/

View File

@ -15,8 +15,11 @@
#include <linux/slab.h>
#include <linux/string.h>
#include <linux/sched/topology.h>
#include <linux/cpuset.h>
DEFINE_PER_CPU(unsigned long, freq_scale) = SCHED_CAPACITY_SCALE;
DEFINE_PER_CPU(unsigned long, max_cpu_freq);
DEFINE_PER_CPU(unsigned long, max_freq_scale) = SCHED_CAPACITY_SCALE;
void arch_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq,
unsigned long max_freq)
@ -26,8 +29,29 @@ void arch_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq,
scale = (cur_freq << SCHED_CAPACITY_SHIFT) / max_freq;
for_each_cpu(i, cpus)
for_each_cpu(i, cpus) {
per_cpu(freq_scale, i) = scale;
per_cpu(max_cpu_freq, i) = max_freq;
}
}
void arch_set_max_freq_scale(struct cpumask *cpus,
unsigned long policy_max_freq)
{
unsigned long scale, max_freq;
int cpu = cpumask_first(cpus);
if (cpu > nr_cpu_ids)
return;
max_freq = per_cpu(max_cpu_freq, cpu);
if (!max_freq)
return;
scale = (policy_max_freq << SCHED_CAPACITY_SHIFT) / max_freq;
for_each_cpu(cpu, cpus)
per_cpu(max_freq_scale, cpu) = scale;
}
static DEFINE_MUTEX(cpu_scale_mutex);
@ -47,6 +71,9 @@ static ssize_t cpu_capacity_show(struct device *dev,
return sprintf(buf, "%lu\n", topology_get_cpu_scale(NULL, cpu->dev.id));
}
static void update_topology_flags_workfn(struct work_struct *work);
static DECLARE_WORK(update_topology_flags_work, update_topology_flags_workfn);
static ssize_t cpu_capacity_store(struct device *dev,
struct device_attribute *attr,
const char *buf,
@ -72,6 +99,8 @@ static ssize_t cpu_capacity_store(struct device *dev,
topology_set_cpu_scale(i, new_capacity);
mutex_unlock(&cpu_scale_mutex);
schedule_work(&update_topology_flags_work);
return count;
}
@ -96,6 +125,25 @@ static int register_cpu_capacity_sysctl(void)
}
subsys_initcall(register_cpu_capacity_sysctl);
static int update_topology;
int topology_update_cpu_topology(void)
{
return update_topology;
}
/*
* Updating the sched_domains can't be done directly from cpufreq callbacks
* due to locking, so queue the work for later.
*/
static void update_topology_flags_workfn(struct work_struct *work)
{
update_topology = 1;
rebuild_sched_domains();
pr_debug("sched_domain hierarchy rebuilt, flags updated\n");
update_topology = 0;
}
static u32 capacity_scale;
static u32 *raw_capacity;
@ -201,6 +249,7 @@ init_cpu_capacity_callback(struct notifier_block *nb,
if (cpumask_empty(cpus_to_visit)) {
topology_normalize_cpu_scale();
schedule_work(&update_topology_flags_work);
free_raw_capacity();
pr_debug("cpu_capacity: parsing done\n");
schedule_work(&parsing_done_work);

View File

@ -24,6 +24,7 @@
#include <linux/cpufreq.h>
#include <linux/cpumask.h>
#include <linux/cpu_cooling.h>
#include <linux/energy_model.h>
#include <linux/export.h>
#include <linux/module.h>
#include <linux/mutex.h>
@ -456,6 +457,7 @@ static int get_cluster_clk_and_freq_table(struct device *cpu_dev,
/* Per-CPU initialization */
static int bL_cpufreq_init(struct cpufreq_policy *policy)
{
struct em_data_callback em_cb = EM_DATA_CB(of_dev_pm_opp_get_cpu_power);
u32 cur_cluster = cpu_to_cluster(policy->cpu);
struct device *cpu_dev;
int ret;
@ -487,6 +489,14 @@ static int bL_cpufreq_init(struct cpufreq_policy *policy)
policy->cpuinfo.transition_latency =
arm_bL_ops->get_transition_latency(cpu_dev);
ret = dev_pm_opp_get_opp_count(cpu_dev);
if (ret <= 0) {
dev_dbg(cpu_dev, "OPP table is not ready, deferring probe\n");
return -EPROBE_DEFER;
}
em_register_perf_domain(policy->cpus, ret, &em_cb);
if (is_bL_switching_enabled())
per_cpu(cpu_last_req_freq, policy->cpu) = clk_get_cpu_rate(policy->cpu);

View File

@ -16,6 +16,7 @@
#include <linux/cpu_cooling.h>
#include <linux/cpufreq.h>
#include <linux/cpumask.h>
#include <linux/energy_model.h>
#include <linux/err.h>
#include <linux/module.h>
#include <linux/of.h>
@ -152,6 +153,7 @@ static int resources_available(void)
static int cpufreq_init(struct cpufreq_policy *policy)
{
struct em_data_callback em_cb = EM_DATA_CB(of_dev_pm_opp_get_cpu_power);
struct cpufreq_frequency_table *freq_table;
struct opp_table *opp_table = NULL;
struct private_data *priv;
@ -160,7 +162,7 @@ static int cpufreq_init(struct cpufreq_policy *policy)
unsigned int transition_latency;
bool fallback = false;
const char *name;
int ret;
int ret, nr_opp;
cpu_dev = get_cpu_device(policy->cpu);
if (!cpu_dev) {
@ -237,6 +239,7 @@ static int cpufreq_init(struct cpufreq_policy *policy)
ret = -EPROBE_DEFER;
goto out_free_opp;
}
nr_opp = ret;
if (fallback) {
cpumask_setall(policy->cpus);
@ -280,6 +283,8 @@ static int cpufreq_init(struct cpufreq_policy *policy)
policy->cpuinfo.transition_latency = transition_latency;
policy->dvfs_possible_from_any_cpu = true;
em_register_perf_domain(policy->cpus, nr_opp, &em_cb);
return 0;
out_free_cpufreq_table:

View File

@ -25,6 +25,7 @@
#include <linux/kernel_stat.h>
#include <linux/module.h>
#include <linux/mutex.h>
#include <linux/sched/cpufreq.h>
#include <linux/slab.h>
#include <linux/suspend.h>
#include <linux/syscore_ops.h>
@ -158,6 +159,12 @@ __weak void arch_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq,
}
EXPORT_SYMBOL_GPL(arch_set_freq_scale);
__weak void arch_set_max_freq_scale(struct cpumask *cpus,
unsigned long policy_max_freq)
{
}
EXPORT_SYMBOL_GPL(arch_set_max_freq_scale);
/*
* This is a generic cpufreq init() routine which can be used by cpufreq
* drivers of SMP systems. It will do following:
@ -2243,6 +2250,8 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
policy->max = new_policy->max;
trace_cpu_frequency_limits(policy);
arch_set_max_freq_scale(policy->cpus, policy->max);
policy->cached_target_freq = UINT_MAX;
pr_debug("new min and max freqs are %u - %u kHz\n",
@ -2277,6 +2286,7 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
ret = cpufreq_start_governor(policy);
if (!ret) {
pr_debug("cpufreq: governor change\n");
sched_cpufreq_governor_change(policy, old_gov);
return 0;
}
cpufreq_exit_governor(policy);

View File

@ -12,6 +12,7 @@
#include <linux/cpufreq.h>
#include <linux/cpumask.h>
#include <linux/cpu_cooling.h>
#include <linux/energy_model.h>
#include <linux/export.h>
#include <linux/module.h>
#include <linux/pm_opp.h>
@ -103,13 +104,42 @@ scmi_get_sharing_cpus(struct device *cpu_dev, struct cpumask *cpumask)
return 0;
}
static int __maybe_unused
scmi_get_cpu_power(unsigned long *power, unsigned long *KHz, int cpu)
{
struct device *cpu_dev = get_cpu_device(cpu);
unsigned long Hz;
int ret, domain;
if (!cpu_dev) {
pr_err("failed to get cpu%d device\n", cpu);
return -ENODEV;
}
domain = handle->perf_ops->device_domain_id(cpu_dev);
if (domain < 0)
return domain;
/* Get the power cost of the performance domain. */
Hz = *KHz * 1000;
ret = handle->perf_ops->est_power_get(handle, domain, &Hz, power);
if (ret)
return ret;
/* The EM framework specifies the frequency in KHz. */
*KHz = Hz / 1000;
return 0;
}
static int scmi_cpufreq_init(struct cpufreq_policy *policy)
{
int ret;
int ret, nr_opp;
unsigned int latency;
struct device *cpu_dev;
struct scmi_data *priv;
struct cpufreq_frequency_table *freq_table;
struct em_data_callback em_cb = EM_DATA_CB(scmi_get_cpu_power);
cpu_dev = get_cpu_device(policy->cpu);
if (!cpu_dev) {
@ -142,6 +172,7 @@ static int scmi_cpufreq_init(struct cpufreq_policy *policy)
ret = -EPROBE_DEFER;
goto out_free_opp;
}
nr_opp = ret;
priv = kzalloc(sizeof(*priv), GFP_KERNEL);
if (!priv) {
@ -171,6 +202,9 @@ static int scmi_cpufreq_init(struct cpufreq_policy *policy)
policy->cpuinfo.transition_latency = latency;
policy->fast_switch_possible = true;
em_register_perf_domain(policy->cpus, nr_opp, &em_cb);
return 0;
out_free_priv:

View File

@ -23,6 +23,7 @@
#include <linux/cpufreq.h>
#include <linux/cpumask.h>
#include <linux/cpu_cooling.h>
#include <linux/energy_model.h>
#include <linux/export.h>
#include <linux/module.h>
#include <linux/of_platform.h>
@ -98,11 +99,12 @@ scpi_get_sharing_cpus(struct device *cpu_dev, struct cpumask *cpumask)
static int scpi_cpufreq_init(struct cpufreq_policy *policy)
{
int ret;
int ret, nr_opp;
unsigned int latency;
struct device *cpu_dev;
struct scpi_data *priv;
struct cpufreq_frequency_table *freq_table;
struct em_data_callback em_cb = EM_DATA_CB(of_dev_pm_opp_get_cpu_power);
cpu_dev = get_cpu_device(policy->cpu);
if (!cpu_dev) {
@ -135,6 +137,7 @@ static int scpi_cpufreq_init(struct cpufreq_policy *policy)
ret = -EPROBE_DEFER;
goto out_free_opp;
}
nr_opp = ret;
priv = kzalloc(sizeof(*priv), GFP_KERNEL);
if (!priv) {
@ -170,6 +173,9 @@ static int scpi_cpufreq_init(struct cpufreq_policy *policy)
policy->cpuinfo.transition_latency = latency;
policy->fast_switch_possible = false;
em_register_perf_domain(policy->cpus, nr_opp, &em_cb);
return 0;
out_free_cpufreq_table:

View File

@ -221,7 +221,7 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv,
}
/* Take note of the planned idle state. */
sched_idle_set_state(target_state);
sched_idle_set_state(target_state, index);
trace_cpu_idle_rcuidle(index, dev->cpu);
time_start = ns_to_ktime(local_clock());
@ -235,7 +235,7 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv,
trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
/* The cpu is no longer idle or about to enter idle. */
sched_idle_set_state(NULL);
sched_idle_set_state(NULL, -1);
if (broadcast) {
if (WARN_ON_ONCE(!irqs_disabled()))

View File

@ -0,0 +1,16 @@
config LEGACY_ENERGY_MODEL_DT
bool "Legacy DT-based Energy Model of CPUs"
default n
help
The Energy Aware Scheduler (EAS) used to rely on Energy Models
(EMs) statically defined in the Device Tree. More recent
versions of EAS now rely on the EM framework to get the power
costs of CPUs.
This driver reads old-style static EMs in DT and feeds them in
the EM framework, hence enabling to use EAS on platforms with
old DT files. Since EAS now uses only the active costs of CPUs,
the cluster-related costs and idle-costs of the old EM are
ignored.
If in doubt, say N.

View File

@ -0,0 +1,3 @@
# SPDX-License-Identifier: GPL-2.0
obj-$(CONFIG_LEGACY_ENERGY_MODEL_DT) += legacy_em_dt.o

View File

@ -0,0 +1,193 @@
// SPDX-License-Identifier: GPL-2.0
/*
* Legacy Energy Model loading driver
*
* Copyright (C) 2018, ARM Ltd.
* Written by: Quentin Perret, ARM Ltd.
*/
#define pr_fmt(fmt) "legacy-dt-em: " fmt
#include <linux/cpufreq.h>
#include <linux/cpumask.h>
#include <linux/cpuset.h>
#include <linux/energy_model.h>
#include <linux/gfp.h>
#include <linux/init.h>
#include <linux/of.h>
#include <linux/printk.h>
#include <linux/slab.h>
static cpumask_var_t cpus_to_visit;
static DEFINE_PER_CPU(unsigned long, nr_states) = 0;
struct em_state {
unsigned long frequency;
unsigned long power;
unsigned long capacity;
};
static DEFINE_PER_CPU(struct em_state*, cpu_em) = NULL;
static void finish_em_loading_workfn(struct work_struct *work);
static DECLARE_WORK(finish_em_loading_work, finish_em_loading_workfn);
static DEFINE_MUTEX(em_loading_mutex);
/*
* Callback given to the EM framework. All this does is browse the table
* created by legacy_em_dt().
*/
static int get_power(unsigned long *mW, unsigned long *KHz, int cpu)
{
unsigned long nstates = per_cpu(nr_states, cpu);
struct em_state *em = per_cpu(cpu_em, cpu);
int i;
if (!nstates || !em)
return -ENODEV;
for (i = 0; i < nstates - 1; i++) {
if (em[i].frequency > *KHz)
break;
}
*KHz = em[i].frequency;
*mW = em[i].power;
return 0;
}
static int init_em_dt_callback(struct notifier_block *nb, unsigned long val,
void *data)
{
struct em_data_callback em_cb = EM_DATA_CB(get_power);
unsigned long nstates, scale_cpu, max_freq;
struct cpufreq_policy *policy = data;
const struct property *prop;
struct device_node *cn, *cp;
struct em_state *em;
int cpu, i, ret = 0;
const __be32 *tmp;
if (val != CPUFREQ_NOTIFY)
return 0;
mutex_lock(&em_loading_mutex);
/* Do not register twice an energy model */
for_each_cpu(cpu, policy->cpus) {
if (per_cpu(nr_states, cpu) || per_cpu(cpu_em, cpu)) {
pr_err("EM of CPU%d already loaded\n", cpu);
ret = -EEXIST;
goto unlock;
}
}
max_freq = policy->cpuinfo.max_freq;
if (!max_freq) {
pr_err("No policy->max for CPU%d\n", cpu);
ret = -EINVAL;
goto unlock;
}
cpu = cpumask_first(policy->cpus);
cn = of_get_cpu_node(cpu, NULL);
if (!cn) {
pr_err("No device_node for CPU%d\n", cpu);
ret = -ENODEV;
goto unlock;
}
cp = of_parse_phandle(cn, "sched-energy-costs", 0);
if (!cp) {
pr_err("CPU%d node has no sched-energy-costs\n", cpu);
ret = -ENODEV;
goto unlock;
}
prop = of_find_property(cp, "busy-cost-data", NULL);
if (!prop || !prop->value) {
pr_err("No busy-cost-data for CPU%d\n", cpu);
ret = -ENODEV;
goto unlock;
}
nstates = (prop->length / sizeof(u32)) / 2;
em = kcalloc(nstates, sizeof(struct em_cap_state), GFP_KERNEL);
if (!em) {
ret = -ENOMEM;
goto unlock;
}
/* Copy the capacity and power cost to the table. */
for (i = 0, tmp = prop->value; i < nstates; i++) {
em[i].capacity = be32_to_cpup(tmp++);
em[i].power = be32_to_cpup(tmp++);
}
/* Get the CPU capacity (according to the EM) */
scale_cpu = em[nstates - 1].capacity;
if (!scale_cpu) {
pr_err("CPU%d: capacity cannot be 0\n", cpu);
kfree(em);
ret = -EINVAL;
goto unlock;
}
/* Re-compute the intermediate frequencies based on the EM. */
for (i = 0; i < nstates; i++)
em[i].frequency = em[i].capacity * max_freq / scale_cpu;
/* Assign the table to all CPUs of this policy. */
for_each_cpu(i, policy->cpus) {
per_cpu(nr_states, i) = nstates;
per_cpu(cpu_em, i) = em;
}
pr_info("Registering EM of %*pbl\n", cpumask_pr_args(policy->cpus));
em_register_perf_domain(policy->cpus, nstates, &em_cb);
/* Finish the work when all possible CPUs have been registered. */
cpumask_andnot(cpus_to_visit, cpus_to_visit, policy->cpus);
if (cpumask_empty(cpus_to_visit))
schedule_work(&finish_em_loading_work);
unlock:
mutex_unlock(&em_loading_mutex);
return ret;
}
static struct notifier_block init_em_dt_notifier = {
.notifier_call = init_em_dt_callback,
};
static void finish_em_loading_workfn(struct work_struct *work)
{
cpufreq_unregister_notifier(&init_em_dt_notifier,
CPUFREQ_POLICY_NOTIFIER);
free_cpumask_var(cpus_to_visit);
/* Let the scheduler know the Energy Model is ready. */
rebuild_sched_domains();
}
static int __init register_cpufreq_notifier(void)
{
int ret;
if (!alloc_cpumask_var(&cpus_to_visit, GFP_KERNEL))
return -ENOMEM;
cpumask_copy(cpus_to_visit, cpu_possible_mask);
ret = cpufreq_register_notifier(&init_em_dt_notifier,
CPUFREQ_POLICY_NOTIFIER);
if (ret)
free_cpumask_var(cpus_to_visit);
return ret;
}
core_initcall(register_cpufreq_notifier);

View File

@ -427,6 +427,33 @@ static int scmi_dvfs_freq_get(const struct scmi_handle *handle, u32 domain,
return ret;
}
static int scmi_dvfs_est_power_get(const struct scmi_handle *handle, u32 domain,
unsigned long *freq, unsigned long *power)
{
struct scmi_perf_info *pi = handle->perf_priv;
struct perf_dom_info *dom;
unsigned long opp_freq;
int idx, ret = -EINVAL;
struct scmi_opp *opp;
dom = pi->dom_info + domain;
if (!dom)
return -EIO;
for (opp = dom->opp, idx = 0; idx < dom->opp_count; idx++, opp++) {
opp_freq = opp->perf * dom->mult_factor;
if (opp_freq < *freq)
continue;
*freq = opp_freq;
*power = opp->power;
ret = 0;
break;
}
return ret;
}
static struct scmi_perf_ops perf_ops = {
.limits_set = scmi_perf_limits_set,
.limits_get = scmi_perf_limits_get,
@ -437,6 +464,7 @@ static struct scmi_perf_ops perf_ops = {
.device_opps_add = scmi_dvfs_device_opps_add,
.freq_set = scmi_dvfs_freq_set,
.freq_get = scmi_dvfs_freq_get,
.est_power_get = scmi_dvfs_est_power_get,
};
static int scmi_perf_protocol_init(struct scmi_handle *handle)

View File

@ -778,3 +778,44 @@ struct device_node *dev_pm_opp_get_of_node(struct dev_pm_opp *opp)
return of_node_get(opp->np);
}
EXPORT_SYMBOL_GPL(dev_pm_opp_get_of_node);
int of_dev_pm_opp_get_cpu_power(unsigned long *mW, unsigned long *KHz, int cpu)
{
unsigned long mV, Hz, MHz;
struct device *cpu_dev;
struct dev_pm_opp *opp;
struct device_node *np;
u32 cap;
u64 tmp;
cpu_dev = get_cpu_device(cpu);
if (!cpu_dev)
return -ENODEV;
np = of_node_get(cpu_dev->of_node);
if (!np)
return -EINVAL;
if (of_property_read_u32(np, "dynamic-power-coefficient", &cap))
return -EINVAL;
Hz = *KHz * 1000;
opp = dev_pm_opp_find_freq_ceil(cpu_dev, &Hz);
if (IS_ERR(opp))
return -EINVAL;
mV = dev_pm_opp_get_voltage(opp) / 1000;
dev_pm_opp_put(opp);
if (!mV)
return -EINVAL;
MHz = Hz / 1000000;
tmp = (u64)cap * mV * mV * MHz;
do_div(tmp, 1000000000);
*mW = (unsigned long)tmp;
*KHz = Hz / 1000;
return 0;
}
EXPORT_SYMBOL_GPL(of_dev_pm_opp_get_cpu_power);

View File

@ -31,6 +31,7 @@
#include <linux/slab.h>
#include <linux/cpu.h>
#include <linux/cpu_cooling.h>
#include <linux/energy_model.h>
#include <trace/events/thermal.h>
@ -48,19 +49,6 @@
* ...
*/
/**
* struct freq_table - frequency table along with power entries
* @frequency: frequency in KHz
* @power: power in mW
*
* This structure is built when the cooling device registers and helps
* in translating frequency to power and vice versa.
*/
struct freq_table {
u32 frequency;
u32 power;
};
/**
* struct time_in_idle - Idle time stats
* @time: previous reading of the absolute time that this cpu was idle
@ -82,7 +70,7 @@ struct time_in_idle {
* frequency.
* @max_level: maximum cooling level. One less than total number of valid
* cpufreq frequencies.
* @freq_table: Freq table in descending order of frequencies
* @em: Reference on the Energy Model of the device
* @cdev: thermal_cooling_device pointer to keep track of the
* registered cooling device.
* @policy: cpufreq policy.
@ -98,7 +86,7 @@ struct cpufreq_cooling_device {
unsigned int cpufreq_state;
unsigned int clipped_freq;
unsigned int max_level;
struct freq_table *freq_table; /* In descending order */
struct em_perf_domain *em;
struct thermal_cooling_device *cdev;
struct cpufreq_policy *policy;
struct list_head node;
@ -111,26 +99,6 @@ static LIST_HEAD(cpufreq_cdev_list);
/* Below code defines functions to be used for cpufreq as cooling device */
/**
* get_level: Find the level for a particular frequency
* @cpufreq_cdev: cpufreq_cdev for which the property is required
* @freq: Frequency
*
* Return: level corresponding to the frequency.
*/
static unsigned long get_level(struct cpufreq_cooling_device *cpufreq_cdev,
unsigned int freq)
{
struct freq_table *freq_table = cpufreq_cdev->freq_table;
unsigned long level;
for (level = 1; level <= cpufreq_cdev->max_level; level++)
if (freq > freq_table[level].frequency)
break;
return level - 1;
}
/**
* cpufreq_thermal_notifier - notifier callback for cpufreq policy change.
* @nb: struct notifier_block * with callback info.
@ -184,105 +152,52 @@ static int cpufreq_thermal_notifier(struct notifier_block *nb,
return NOTIFY_OK;
}
#ifdef CONFIG_ENERGY_MODEL
/**
* update_freq_table() - Update the freq table with power numbers
* @cpufreq_cdev: the cpufreq cooling device in which to update the table
* @capacitance: dynamic power coefficient for these cpus
* get_level: Find the level for a particular frequency
* @cpufreq_cdev: cpufreq_cdev for which the property is required
* @freq: Frequency
*
* Update the freq table with power numbers. This table will be used in
* cpu_power_to_freq() and cpu_freq_to_power() to convert between power and
* frequency efficiently. Power is stored in mW, frequency in KHz. The
* resulting table is in descending order.
*
* Return: 0 on success, -EINVAL if there are no OPPs for any CPUs,
* or -ENOMEM if we run out of memory.
* Return: level corresponding to the frequency.
*/
static int update_freq_table(struct cpufreq_cooling_device *cpufreq_cdev,
u32 capacitance)
static unsigned long get_level(struct cpufreq_cooling_device *cpufreq_cdev,
unsigned int freq)
{
struct freq_table *freq_table = cpufreq_cdev->freq_table;
struct dev_pm_opp *opp;
struct device *dev = NULL;
int num_opps = 0, cpu = cpufreq_cdev->policy->cpu, i;
int i;
dev = get_cpu_device(cpu);
if (unlikely(!dev)) {
dev_warn(&cpufreq_cdev->cdev->device,
"No cpu device for cpu %d\n", cpu);
return -ENODEV;
for (i = cpufreq_cdev->max_level - 1; i >= 0; i--) {
if (freq > cpufreq_cdev->em->table[i].frequency)
break;
}
num_opps = dev_pm_opp_get_opp_count(dev);
if (num_opps < 0)
return num_opps;
/*
* The cpufreq table is also built from the OPP table and so the count
* should match.
*/
if (num_opps != cpufreq_cdev->max_level + 1) {
dev_warn(dev, "Number of OPPs not matching with max_levels\n");
return -EINVAL;
}
for (i = 0; i <= cpufreq_cdev->max_level; i++) {
unsigned long freq = freq_table[i].frequency * 1000;
u32 freq_mhz = freq_table[i].frequency / 1000;
u64 power;
u32 voltage_mv;
/*
* Find ceil frequency as 'freq' may be slightly lower than OPP
* freq due to truncation while converting to kHz.
*/
opp = dev_pm_opp_find_freq_ceil(dev, &freq);
if (IS_ERR(opp)) {
dev_err(dev, "failed to get opp for %lu frequency\n",
freq);
return -EINVAL;
}
voltage_mv = dev_pm_opp_get_voltage(opp) / 1000;
dev_pm_opp_put(opp);
/*
* Do the multiplication with MHz and millivolt so as
* to not overflow.
*/
power = (u64)capacitance * freq_mhz * voltage_mv * voltage_mv;
do_div(power, 1000000000);
/* power is stored in mW */
freq_table[i].power = power;
}
return 0;
return cpufreq_cdev->max_level - i - 1;
}
static u32 cpu_freq_to_power(struct cpufreq_cooling_device *cpufreq_cdev,
u32 freq)
{
int i;
struct freq_table *freq_table = cpufreq_cdev->freq_table;
for (i = 1; i <= cpufreq_cdev->max_level; i++)
if (freq > freq_table[i].frequency)
for (i = cpufreq_cdev->max_level - 1; i >= 0; i--) {
if (freq > cpufreq_cdev->em->table[i].frequency)
break;
}
return freq_table[i - 1].power;
return cpufreq_cdev->em->table[i + 1].power;
}
static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_cdev,
u32 power)
{
int i;
struct freq_table *freq_table = cpufreq_cdev->freq_table;
for (i = 1; i <= cpufreq_cdev->max_level; i++)
if (power > freq_table[i].power)
for (i = cpufreq_cdev->max_level - 1; i >= 0; i--) {
if (power > cpufreq_cdev->em->table[i].power)
break;
}
return freq_table[i - 1].frequency;
return cpufreq_cdev->em->table[i + 1].frequency;
}
/**
@ -332,6 +247,7 @@ static u32 get_dynamic_power(struct cpufreq_cooling_device *cpufreq_cdev,
raw_cpu_power = cpu_freq_to_power(cpufreq_cdev, freq);
return (raw_cpu_power * cpufreq_cdev->last_load) / 100;
}
#endif
/* cpufreq cooling device callback functions are defined below */
@ -374,6 +290,30 @@ static int cpufreq_get_cur_state(struct thermal_cooling_device *cdev,
return 0;
}
static unsigned int get_state_freq(struct cpufreq_cooling_device *cpufreq_cdev,
unsigned long state)
{
struct cpufreq_policy *policy;
unsigned long idx;
#ifdef CONFIG_ENERGY_MODEL
/* Use the Energy Model table if available */
if (cpufreq_cdev->em) {
idx = cpufreq_cdev->max_level - state;
return cpufreq_cdev->em->table[idx].frequency;
}
#endif
/* Otherwise, fallback on the CPUFreq table */
policy = cpufreq_cdev->policy;
if (policy->freq_table_sorted == CPUFREQ_TABLE_SORTED_ASCENDING)
idx = cpufreq_cdev->max_level - state;
else
idx = state;
return policy->freq_table[idx].frequency;
}
/**
* cpufreq_set_cur_state - callback function to set the current cooling state.
* @cdev: thermal cooling device pointer.
@ -398,7 +338,7 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
if (cpufreq_cdev->cpufreq_state == state)
return 0;
clip_freq = cpufreq_cdev->freq_table[state].frequency;
clip_freq = get_state_freq(cpufreq_cdev, state);
cpufreq_cdev->cpufreq_state = state;
cpufreq_cdev->clipped_freq = clip_freq;
@ -407,6 +347,7 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
return 0;
}
#ifdef CONFIG_ENERGY_MODEL
/**
* cpufreq_get_requested_power() - get the current power
* @cdev: &thermal_cooling_device pointer
@ -497,7 +438,7 @@ static int cpufreq_state2power(struct thermal_cooling_device *cdev,
struct thermal_zone_device *tz,
unsigned long state, u32 *power)
{
unsigned int freq, num_cpus;
unsigned int freq, num_cpus, idx;
struct cpufreq_cooling_device *cpufreq_cdev = cdev->devdata;
/* Request state should be less than max_level */
@ -506,7 +447,8 @@ static int cpufreq_state2power(struct thermal_cooling_device *cdev,
num_cpus = cpumask_weight(cpufreq_cdev->policy->cpus);
freq = cpufreq_cdev->freq_table[state].frequency;
idx = cpufreq_cdev->max_level - state;
freq = cpufreq_cdev->em->table[idx].frequency;
*power = cpu_freq_to_power(cpufreq_cdev, freq) * num_cpus;
return 0;
@ -553,14 +495,6 @@ static int cpufreq_power2state(struct thermal_cooling_device *cdev,
return 0;
}
/* Bind cpufreq callbacks to thermal cooling device ops */
static struct thermal_cooling_device_ops cpufreq_cooling_ops = {
.get_max_state = cpufreq_get_max_state,
.get_cur_state = cpufreq_get_cur_state,
.set_cur_state = cpufreq_set_cur_state,
};
static struct thermal_cooling_device_ops cpufreq_power_cooling_ops = {
.get_max_state = cpufreq_get_max_state,
.get_cur_state = cpufreq_get_cur_state,
@ -569,32 +503,27 @@ static struct thermal_cooling_device_ops cpufreq_power_cooling_ops = {
.state2power = cpufreq_state2power,
.power2state = cpufreq_power2state,
};
#endif
/* Bind cpufreq callbacks to thermal cooling device ops */
static struct thermal_cooling_device_ops cpufreq_cooling_ops = {
.get_max_state = cpufreq_get_max_state,
.get_cur_state = cpufreq_get_cur_state,
.set_cur_state = cpufreq_set_cur_state,
};
/* Notifier for cpufreq policy change */
static struct notifier_block thermal_cpufreq_notifier_block = {
.notifier_call = cpufreq_thermal_notifier,
};
static unsigned int find_next_max(struct cpufreq_frequency_table *table,
unsigned int prev_max)
{
struct cpufreq_frequency_table *pos;
unsigned int max = 0;
cpufreq_for_each_valid_entry(pos, table) {
if (pos->frequency > max && pos->frequency < prev_max)
max = pos->frequency;
}
return max;
}
/**
* __cpufreq_cooling_register - helper function to create cpufreq cooling device
* @np: a valid struct device_node to the cooling device device tree node
* @policy: cpufreq policy
* Normally this should be same as cpufreq policy->related_cpus.
* @capacitance: dynamic power coefficient for these cpus
* @try_model: true if a power model should be used
*
* This interface function registers the cpufreq cooling device with the name
* "thermal-cpufreq-%x". This api can support multiple instances of cpufreq
@ -606,12 +535,12 @@ static unsigned int find_next_max(struct cpufreq_frequency_table *table,
*/
static struct thermal_cooling_device *
__cpufreq_cooling_register(struct device_node *np,
struct cpufreq_policy *policy, u32 capacitance)
struct cpufreq_policy *policy, bool try_model)
{
struct thermal_cooling_device *cdev;
struct cpufreq_cooling_device *cpufreq_cdev;
char dev_name[THERMAL_NAME_LENGTH];
unsigned int freq, i, num_cpus;
unsigned int i, num_cpus;
int ret;
struct thermal_cooling_device_ops *cooling_ops;
bool first;
@ -645,54 +574,36 @@ __cpufreq_cooling_register(struct device_node *np,
/* max_level is an index, not a counter */
cpufreq_cdev->max_level = i - 1;
cpufreq_cdev->freq_table = kmalloc_array(i,
sizeof(*cpufreq_cdev->freq_table),
GFP_KERNEL);
if (!cpufreq_cdev->freq_table) {
cdev = ERR_PTR(-ENOMEM);
goto free_idle_time;
}
#ifdef CONFIG_ENERGY_MODEL
if (try_model) {
struct em_perf_domain *em = em_cpu_get(policy->cpu);
if (!em || !cpumask_equal(policy->cpus, to_cpumask(em->cpus))) {
cdev = ERR_PTR(-EINVAL);
goto free_idle_time;
}
cpufreq_cdev->em = em;
cooling_ops = &cpufreq_power_cooling_ops;
} else
#endif
cooling_ops = &cpufreq_cooling_ops;
ret = ida_simple_get(&cpufreq_ida, 0, 0, GFP_KERNEL);
if (ret < 0) {
cdev = ERR_PTR(ret);
goto free_table;
goto free_idle_time;
}
cpufreq_cdev->id = ret;
snprintf(dev_name, sizeof(dev_name), "thermal-cpufreq-%d",
cpufreq_cdev->id);
/* Fill freq-table in descending order of frequencies */
for (i = 0, freq = -1; i <= cpufreq_cdev->max_level; i++) {
freq = find_next_max(policy->freq_table, freq);
cpufreq_cdev->freq_table[i].frequency = freq;
/* Warn for duplicate entries */
if (!freq)
pr_warn("%s: table has duplicate entries\n", __func__);
else
pr_debug("%s: freq:%u KHz\n", __func__, freq);
}
if (capacitance) {
ret = update_freq_table(cpufreq_cdev, capacitance);
if (ret) {
cdev = ERR_PTR(ret);
goto remove_ida;
}
cooling_ops = &cpufreq_power_cooling_ops;
} else {
cooling_ops = &cpufreq_cooling_ops;
}
cdev = thermal_of_cooling_device_register(np, dev_name, cpufreq_cdev,
cooling_ops);
if (IS_ERR(cdev))
goto remove_ida;
cpufreq_cdev->clipped_freq = cpufreq_cdev->freq_table[0].frequency;
cpufreq_cdev->clipped_freq = get_state_freq(cpufreq_cdev, 0);
cpufreq_cdev->cdev = cdev;
mutex_lock(&cooling_list_lock);
@ -709,8 +620,6 @@ __cpufreq_cooling_register(struct device_node *np,
remove_ida:
ida_simple_remove(&cpufreq_ida, cpufreq_cdev->id);
free_table:
kfree(cpufreq_cdev->freq_table);
free_idle_time:
kfree(cpufreq_cdev->idle_time);
free_cdev:
@ -732,7 +641,7 @@ __cpufreq_cooling_register(struct device_node *np,
struct thermal_cooling_device *
cpufreq_cooling_register(struct cpufreq_policy *policy)
{
return __cpufreq_cooling_register(NULL, policy, 0);
return __cpufreq_cooling_register(NULL, policy, false);
}
EXPORT_SYMBOL_GPL(cpufreq_cooling_register);
@ -760,7 +669,6 @@ of_cpufreq_cooling_register(struct cpufreq_policy *policy)
{
struct device_node *np = of_get_cpu_node(policy->cpu, NULL);
struct thermal_cooling_device *cdev = NULL;
u32 capacitance = 0;
if (!np) {
pr_err("cpu_cooling: OF node not available for cpu%d\n",
@ -769,10 +677,7 @@ of_cpufreq_cooling_register(struct cpufreq_policy *policy)
}
if (of_find_property(np, "#cooling-cells", NULL)) {
of_property_read_u32(np, "dynamic-power-coefficient",
&capacitance);
cdev = __cpufreq_cooling_register(np, policy, capacitance);
cdev = __cpufreq_cooling_register(np, policy, true);
if (IS_ERR(cdev)) {
pr_err("cpu_cooling: cpu%d is not running as cooling device: %ld\n",
policy->cpu, PTR_ERR(cdev));
@ -814,7 +719,6 @@ void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
thermal_cooling_device_unregister(cpufreq_cdev->cdev);
ida_simple_remove(&cpufreq_ida, cpufreq_cdev->id);
kfree(cpufreq_cdev->idle_time);
kfree(cpufreq_cdev->freq_table);
kfree(cpufreq_cdev);
}
EXPORT_SYMBOL_GPL(cpufreq_cooling_unregister);

View File

@ -9,6 +9,7 @@
#include <linux/percpu.h>
void topology_normalize_cpu_scale(void);
int topology_update_cpu_topology(void);
struct device_node;
bool topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu);
@ -32,4 +33,12 @@ unsigned long topology_get_freq_scale(int cpu)
return per_cpu(freq_scale, cpu);
}
DECLARE_PER_CPU(unsigned long, max_freq_scale);
static inline
unsigned long topology_get_max_freq_scale(struct sched_domain *sd, int cpu)
{
return per_cpu(max_freq_scale, cpu);
}
#endif /* _LINUX_ARCH_TOPOLOGY_H_ */

View File

@ -21,6 +21,10 @@ SUBSYS(cpu)
SUBSYS(cpuacct)
#endif
#if IS_ENABLED(CONFIG_SCHED_TUNE)
SUBSYS(schedtune)
#endif
#if IS_ENABLED(CONFIG_BLK_CGROUP)
SUBSYS(io)
#endif

View File

@ -955,6 +955,8 @@ extern unsigned int arch_freq_get_on_cpu(int cpu);
extern void arch_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq,
unsigned long max_freq);
extern void arch_set_max_freq_scale(struct cpumask *cpus,
unsigned long policy_max_freq);
/* the following are really really optional */
extern struct freq_attr cpufreq_freq_attr_scaling_available_freqs;

View File

@ -219,7 +219,7 @@ static inline void cpuidle_use_deepest_state(bool enable)
#endif
/* kernel/sched/idle.c */
extern void sched_idle_set_state(struct cpuidle_state *idle_state);
extern void sched_idle_set_state(struct cpuidle_state *idle_state, int index);
extern void default_idle_call(void);
#ifdef CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED

View File

@ -0,0 +1,189 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _LINUX_ENERGY_MODEL_H
#define _LINUX_ENERGY_MODEL_H
#include <linux/cpumask.h>
#include <linux/jump_label.h>
#include <linux/kobject.h>
#include <linux/rcupdate.h>
#include <linux/sched/cpufreq.h>
#include <linux/sched/topology.h>
#include <linux/types.h>
#ifdef CONFIG_ENERGY_MODEL
/**
* em_cap_state - Capacity state of a performance domain
* @frequency: The CPU frequency in KHz, for consistency with CPUFreq
* @power: The power consumed by 1 CPU at this level, in milli-watts
* @cost: The cost coefficient associated with this level, used during
* energy calculation. Equal to: power * max_frequency / frequency
*/
struct em_cap_state {
unsigned long frequency;
unsigned long power;
unsigned long cost;
};
/**
* em_perf_domain - Performance domain
* @table: List of capacity states, in ascending order
* @nr_cap_states: Number of capacity states
* @kobj: Kobject used to expose the domain in sysfs
* @cpus: Cpumask covering the CPUs of the domain
*
* A "performance domain" represents a group of CPUs whose performance is
* scaled together. All CPUs of a performance domain must have the same
* micro-architecture. Performance domains often have a 1-to-1 mapping with
* CPUFreq policies.
*/
struct em_perf_domain {
struct em_cap_state *table;
int nr_cap_states;
struct kobject kobj;
unsigned long cpus[0];
};
#define EM_CPU_MAX_POWER 0xFFFF
struct em_data_callback {
/**
* active_power() - Provide power at the next capacity state of a CPU
* @power : Active power at the capacity state in mW (modified)
* @freq : Frequency at the capacity state in kHz (modified)
* @cpu : CPU for which we do this operation
*
* active_power() must find the lowest capacity state of 'cpu' above
* 'freq' and update 'power' and 'freq' to the matching active power
* and frequency.
*
* The power is the one of a single CPU in the domain, expressed in
* milli-watts. It is expected to fit in the [0, EM_CPU_MAX_POWER]
* range.
*
* Return 0 on success.
*/
int (*active_power)(unsigned long *power, unsigned long *freq, int cpu);
};
#define EM_DATA_CB(_active_power_cb) { .active_power = &_active_power_cb }
struct em_perf_domain *em_cpu_get(int cpu);
int em_register_perf_domain(cpumask_t *span, unsigned int nr_states,
struct em_data_callback *cb);
/**
* em_pd_energy() - Estimates the energy consumed by the CPUs of a perf. domain
* @pd : performance domain for which energy has to be estimated
* @max_util : highest utilization among CPUs of the domain
* @sum_util : sum of the utilization of all CPUs in the domain
*
* Return: the sum of the energy consumed by the CPUs of the domain assuming
* a capacity state satisfying the max utilization of the domain.
*/
static inline unsigned long em_pd_energy(struct em_perf_domain *pd,
unsigned long max_util, unsigned long sum_util)
{
unsigned long freq, scale_cpu;
struct em_cap_state *cs;
int i, cpu;
/*
* In order to predict the capacity state, map the utilization of the
* most utilized CPU of the performance domain to a requested frequency,
* like schedutil.
*/
cpu = cpumask_first(to_cpumask(pd->cpus));
scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
cs = &pd->table[pd->nr_cap_states - 1];
freq = map_util_freq(max_util, cs->frequency, scale_cpu);
/*
* Find the lowest capacity state of the Energy Model above the
* requested frequency.
*/
for (i = 0; i < pd->nr_cap_states; i++) {
cs = &pd->table[i];
if (cs->frequency >= freq)
break;
}
/*
* The capacity of a CPU in the domain at that capacity state (cs)
* can be computed as:
*
* cs->freq * scale_cpu
* cs->cap = -------------------- (1)
* cpu_max_freq
*
* So, ignoring the costs of idle states (which are not available in
* the EM), the energy consumed by this CPU at that capacity state is
* estimated as:
*
* cs->power * cpu_util
* cpu_nrg = -------------------- (2)
* cs->cap
*
* since 'cpu_util / cs->cap' represents its percentage of busy time.
*
* NOTE: Although the result of this computation actually is in
* units of power, it can be manipulated as an energy value
* over a scheduling period, since it is assumed to be
* constant during that interval.
*
* By injecting (1) in (2), 'cpu_nrg' can be re-expressed as a product
* of two terms:
*
* cs->power * cpu_max_freq cpu_util
* cpu_nrg = ------------------------ * --------- (3)
* cs->freq scale_cpu
*
* The first term is static, and is stored in the em_cap_state struct
* as 'cs->cost'.
*
* Since all CPUs of the domain have the same micro-architecture, they
* share the same 'cs->cost', and the same CPU capacity. Hence, the
* total energy of the domain (which is the simple sum of the energy of
* all of its CPUs) can be factorized as:
*
* cs->cost * \Sum cpu_util
* pd_nrg = ------------------------ (4)
* scale_cpu
*/
return cs->cost * sum_util / scale_cpu;
}
/**
* em_pd_nr_cap_states() - Get the number of capacity states of a perf. domain
* @pd : performance domain for which this must be done
*
* Return: the number of capacity states in the performance domain table
*/
static inline int em_pd_nr_cap_states(struct em_perf_domain *pd)
{
return pd->nr_cap_states;
}
#else
struct em_perf_domain {};
struct em_data_callback {};
#define EM_DATA_CB(_active_power_cb) { }
static inline int em_register_perf_domain(cpumask_t *span,
unsigned int nr_states, struct em_data_callback *cb)
{
return -EINVAL;
}
static inline struct em_perf_domain *em_cpu_get(int cpu)
{
return NULL;
}
static inline unsigned long em_pd_energy(struct em_perf_domain *pd,
unsigned long max_util, unsigned long sum_util)
{
return 0;
}
static inline int em_pd_nr_cap_states(struct em_perf_domain *pd)
{
return 0;
}
#endif
#endif

View File

@ -301,6 +301,7 @@ int dev_pm_opp_of_get_sharing_cpus(struct device *cpu_dev, struct cpumask *cpuma
struct device_node *dev_pm_opp_of_get_opp_desc_node(struct device *dev);
struct dev_pm_opp *of_dev_pm_opp_find_required_opp(struct device *dev, struct device_node *np);
struct device_node *dev_pm_opp_get_of_node(struct dev_pm_opp *opp);
int of_dev_pm_opp_get_cpu_power(unsigned long *mW, unsigned long *KHz, int cpu);
#else
static inline int dev_pm_opp_of_add_table(struct device *dev)
{
@ -343,6 +344,10 @@ static inline struct device_node *dev_pm_opp_get_of_node(struct dev_pm_opp *opp)
{
return NULL;
}
static inline int of_dev_pm_opp_get_cpu_power(unsigned long *mW, unsigned long *KHz, int cpu)
{
return -ENOTSUPP;
}
#endif
#endif /* __LINUX_OPP_H__ */

View File

@ -2,6 +2,7 @@
#ifndef _LINUX_SCHED_CPUFREQ_H
#define _LINUX_SCHED_CPUFREQ_H
#include <linux/cpufreq.h>
#include <linux/types.h>
/*
@ -20,6 +21,20 @@ void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data,
void (*func)(struct update_util_data *data, u64 time,
unsigned int flags));
void cpufreq_remove_update_util_hook(int cpu);
static inline unsigned long map_util_freq(unsigned long util,
unsigned long freq, unsigned long cap)
{
return (freq + (freq >> 2)) * util / cap;
}
#endif /* CONFIG_CPU_FREQ */
#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
void sched_cpufreq_governor_change(struct cpufreq_policy *policy,
struct cpufreq_governor *old_gov);
#else
static inline void sched_cpufreq_governor_change(struct cpufreq_policy *policy,
struct cpufreq_governor *old_gov) { }
#endif
#endif /* _LINUX_SCHED_CPUFREQ_H */

View File

@ -22,6 +22,8 @@ enum { sysctl_hung_task_timeout_secs = 0 };
extern unsigned int sysctl_sched_latency;
extern unsigned int sysctl_sched_min_granularity;
extern unsigned int sysctl_sched_sync_hint_enable;
extern unsigned int sysctl_sched_cstate_aware;
extern unsigned int sysctl_sched_wakeup_granularity;
extern unsigned int sysctl_sched_child_runs_first;
@ -83,4 +85,11 @@ extern int sysctl_schedstats(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos);
#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
extern unsigned int sysctl_sched_energy_aware;
extern int sched_energy_aware_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos);
#endif
#endif /* _LINUX_SCHED_SYSCTL_H */

View File

@ -23,10 +23,10 @@
#define SD_BALANCE_FORK 0x0008 /* Balance on fork, clone */
#define SD_BALANCE_WAKE 0x0010 /* Balance on wakeup */
#define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */
#define SD_ASYM_CPUCAPACITY 0x0040 /* Groups have different max cpu capacities */
#define SD_SHARE_CPUCAPACITY 0x0080 /* Domain members share cpu capacity */
#define SD_ASYM_CPUCAPACITY 0x0040 /* Domain members have different CPU capacities */
#define SD_SHARE_CPUCAPACITY 0x0080 /* Domain members share CPU capacity */
#define SD_SHARE_POWERDOMAIN 0x0100 /* Domain members share power domain */
#define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */
#define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share CPU pkg resources */
#define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */
#define SD_ASYM_PACKING 0x0800 /* Place busy groups earlier in the domain */
#define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */
@ -202,6 +202,17 @@ extern void set_sched_topology(struct sched_domain_topology_level *tl);
# define SD_INIT_NAME(type)
#endif
#ifndef arch_scale_cpu_capacity
static __always_inline
unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
{
if (sd && (sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
return sd->smt_gain / sd->span_weight;
return SCHED_CAPACITY_SCALE;
}
#endif
#else /* CONFIG_SMP */
struct sched_domain_attr;
@ -217,6 +228,14 @@ static inline bool cpus_share_cache(int this_cpu, int that_cpu)
return true;
}
#ifndef arch_scale_cpu_capacity
static __always_inline
unsigned long arch_scale_cpu_capacity(void __always_unused *sd, int cpu)
{
return SCHED_CAPACITY_SCALE;
}
#endif
#endif /* !CONFIG_SMP */
static inline int task_node(const struct task_struct *p)

View File

@ -34,6 +34,7 @@
struct wake_q_head {
struct wake_q_node *first;
struct wake_q_node **lastp;
int count;
};
#define WAKE_Q_TAIL ((struct wake_q_node *) 0x01)
@ -45,6 +46,7 @@ static inline void wake_q_init(struct wake_q_head *head)
{
head->first = WAKE_Q_TAIL;
head->lastp = &head->first;
head->count = 0;
}
extern void wake_q_add(struct wake_q_head *head,

View File

@ -91,6 +91,8 @@ struct scmi_clk_ops {
* to sustained performance level mapping
* @freq_get: gets the frequency for a given device using sustained frequency
* to sustained performance level mapping
* @est_power_get: gets the estimated power cost for a given performance domain
* at a given frequency
*/
struct scmi_perf_ops {
int (*limits_set)(const struct scmi_handle *handle, u32 domain,
@ -110,6 +112,8 @@ struct scmi_perf_ops {
unsigned long rate, bool poll);
int (*freq_get)(const struct scmi_handle *handle, u32 domain,
unsigned long *rate, bool poll);
int (*est_power_get)(const struct scmi_handle *handle, u32 domain,
unsigned long *rate, unsigned long *power);
};
/**

View File

@ -572,6 +572,423 @@ TRACE_EVENT(sched_wake_idle_without_ipi,
TP_printk("cpu=%d", __entry->cpu)
);
#ifdef CONFIG_SMP
#ifdef CREATE_TRACE_POINTS
static inline
int __trace_sched_cpu(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
#ifdef CONFIG_FAIR_GROUP_SCHED
struct rq *rq = cfs_rq ? cfs_rq->rq : NULL;
#else
struct rq *rq = cfs_rq ? container_of(cfs_rq, struct rq, cfs) : NULL;
#endif
return rq ? cpu_of(rq)
: task_cpu((container_of(se, struct task_struct, se)));
}
static inline
int __trace_sched_path(struct cfs_rq *cfs_rq, char *path, int len)
{
#ifdef CONFIG_FAIR_GROUP_SCHED
int l = path ? len : 0;
if (cfs_rq && task_group_is_autogroup(cfs_rq->tg))
return autogroup_path(cfs_rq->tg, path, l) + 1;
else if (cfs_rq && cfs_rq->tg->css.cgroup)
return cgroup_path(cfs_rq->tg->css.cgroup, path, l) + 1;
#endif
if (path)
strcpy(path, "(null)");
return strlen("(null)");
}
static inline
struct cfs_rq *__trace_sched_group_cfs_rq(struct sched_entity *se)
{
#ifdef CONFIG_FAIR_GROUP_SCHED
return se->my_q;
#else
return NULL;
#endif
}
#endif /* CREATE_TRACE_POINTS */
/*
* Tracepoint for cfs_rq load tracking:
*/
TRACE_EVENT(sched_load_cfs_rq,
TP_PROTO(struct cfs_rq *cfs_rq),
TP_ARGS(cfs_rq),
TP_STRUCT__entry(
__field( int, cpu )
__dynamic_array(char, path,
__trace_sched_path(cfs_rq, NULL, 0) )
__field( unsigned long, load )
__field( unsigned long, rbl_load )
__field( unsigned long, util )
),
TP_fast_assign(
__entry->cpu = __trace_sched_cpu(cfs_rq, NULL);
__trace_sched_path(cfs_rq, __get_dynamic_array(path),
__get_dynamic_array_len(path));
__entry->load = cfs_rq->avg.load_avg;
__entry->rbl_load = cfs_rq->avg.runnable_load_avg;
__entry->util = cfs_rq->avg.util_avg;
),
TP_printk("cpu=%d path=%s load=%lu rbl_load=%lu util=%lu",
__entry->cpu, __get_str(path), __entry->load,
__entry->rbl_load,__entry->util)
);
/*
* Tracepoint for rt_rq load tracking:
*/
struct rq;
TRACE_EVENT(sched_load_rt_rq,
TP_PROTO(struct rq *rq),
TP_ARGS(rq),
TP_STRUCT__entry(
__field( int, cpu )
__field( unsigned long, util )
),
TP_fast_assign(
__entry->cpu = rq->cpu;
__entry->util = rq->avg_rt.util_avg;
),
TP_printk("cpu=%d util=%lu", __entry->cpu,
__entry->util)
);
/*
* Tracepoint for sched_entity load tracking:
*/
TRACE_EVENT(sched_load_se,
TP_PROTO(struct sched_entity *se),
TP_ARGS(se),
TP_STRUCT__entry(
__field( int, cpu )
__dynamic_array(char, path,
__trace_sched_path(__trace_sched_group_cfs_rq(se), NULL, 0) )
__array( char, comm, TASK_COMM_LEN )
__field( pid_t, pid )
__field( unsigned long, load )
__field( unsigned long, rbl_load )
__field( unsigned long, util )
),
TP_fast_assign(
struct cfs_rq *gcfs_rq = __trace_sched_group_cfs_rq(se);
struct task_struct *p = gcfs_rq ? NULL
: container_of(se, struct task_struct, se);
__entry->cpu = __trace_sched_cpu(gcfs_rq, se);
__trace_sched_path(gcfs_rq, __get_dynamic_array(path),
__get_dynamic_array_len(path));
memcpy(__entry->comm, p ? p->comm : "(null)", TASK_COMM_LEN);
__entry->pid = p ? p->pid : -1;
__entry->load = se->avg.load_avg;
__entry->rbl_load = se->avg.runnable_load_avg;
__entry->util = se->avg.util_avg;
),
TP_printk("cpu=%d path=%s comm=%s pid=%d load=%lu rbl_load=%lu util=%lu",
__entry->cpu, __get_str(path), __entry->comm, __entry->pid,
__entry->load, __entry->rbl_load, __entry->util)
);
/*
* Tracepoint for task_group load tracking:
*/
#ifdef CONFIG_FAIR_GROUP_SCHED
TRACE_EVENT(sched_load_tg,
TP_PROTO(struct cfs_rq *cfs_rq),
TP_ARGS(cfs_rq),
TP_STRUCT__entry(
__field( int, cpu )
__dynamic_array(char, path,
__trace_sched_path(cfs_rq, NULL, 0) )
__field( long, load )
),
TP_fast_assign(
__entry->cpu = cfs_rq->rq->cpu;
__trace_sched_path(cfs_rq, __get_dynamic_array(path),
__get_dynamic_array_len(path));
__entry->load = atomic_long_read(&cfs_rq->tg->load_avg);
),
TP_printk("cpu=%d path=%s load=%ld", __entry->cpu, __get_str(path),
__entry->load)
);
#endif /* CONFIG_FAIR_GROUP_SCHED */
/*
* Tracepoint for tasks' estimated utilization.
*/
TRACE_EVENT(sched_util_est_task,
TP_PROTO(struct task_struct *tsk, struct sched_avg *avg),
TP_ARGS(tsk, avg),
TP_STRUCT__entry(
__array( char, comm, TASK_COMM_LEN )
__field( pid_t, pid )
__field( int, cpu )
__field( unsigned int, util_avg )
__field( unsigned int, est_enqueued )
__field( unsigned int, est_ewma )
),
TP_fast_assign(
memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
__entry->pid = tsk->pid;
__entry->cpu = task_cpu(tsk);
__entry->util_avg = avg->util_avg;
__entry->est_enqueued = avg->util_est.enqueued;
__entry->est_ewma = avg->util_est.ewma;
),
TP_printk("comm=%s pid=%d cpu=%d util_avg=%u util_est_ewma=%u util_est_enqueued=%u",
__entry->comm,
__entry->pid,
__entry->cpu,
__entry->util_avg,
__entry->est_ewma,
__entry->est_enqueued)
);
/*
* Tracepoint for root cfs_rq's estimated utilization.
*/
TRACE_EVENT(sched_util_est_cpu,
TP_PROTO(int cpu, struct cfs_rq *cfs_rq),
TP_ARGS(cpu, cfs_rq),
TP_STRUCT__entry(
__field( int, cpu )
__field( unsigned int, util_avg )
__field( unsigned int, util_est_enqueued )
),
TP_fast_assign(
__entry->cpu = cpu;
__entry->util_avg = cfs_rq->avg.util_avg;
__entry->util_est_enqueued = cfs_rq->avg.util_est.enqueued;
),
TP_printk("cpu=%d util_avg=%u util_est_enqueued=%u",
__entry->cpu,
__entry->util_avg,
__entry->util_est_enqueued)
);
/*
* Tracepoint for find_best_target
*/
TRACE_EVENT(sched_find_best_target,
TP_PROTO(struct task_struct *tsk, bool prefer_idle,
unsigned long min_util, int best_idle, int best_active,
int target, int backup),
TP_ARGS(tsk, prefer_idle, min_util, best_idle,
best_active, target, backup),
TP_STRUCT__entry(
__array( char, comm, TASK_COMM_LEN )
__field( pid_t, pid )
__field( unsigned long, min_util )
__field( bool, prefer_idle )
__field( int, best_idle )
__field( int, best_active )
__field( int, target )
__field( int, backup )
),
TP_fast_assign(
memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
__entry->pid = tsk->pid;
__entry->min_util = min_util;
__entry->prefer_idle = prefer_idle;
__entry->best_idle = best_idle;
__entry->best_active = best_active;
__entry->target = target;
__entry->backup = backup;
),
TP_printk("pid=%d comm=%s prefer_idle=%d "
"best_idle=%d best_active=%d target=%d backup=%d",
__entry->pid, __entry->comm, __entry->prefer_idle,
__entry->best_idle, __entry->best_active,
__entry->target, __entry->backup)
);
/*
* Tracepoint for accounting CPU boosted utilization
*/
TRACE_EVENT(sched_boost_cpu,
TP_PROTO(int cpu, unsigned long util, long margin),
TP_ARGS(cpu, util, margin),
TP_STRUCT__entry(
__field( int, cpu )
__field( unsigned long, util )
__field(long, margin )
),
TP_fast_assign(
__entry->cpu = cpu;
__entry->util = util;
__entry->margin = margin;
),
TP_printk("cpu=%d util=%lu margin=%ld",
__entry->cpu,
__entry->util,
__entry->margin)
);
/*
* Tracepoint for schedtune_tasks_update
*/
TRACE_EVENT(sched_tune_tasks_update,
TP_PROTO(struct task_struct *tsk, int cpu, int tasks, int idx,
int boost, int max_boost, u64 group_ts),
TP_ARGS(tsk, cpu, tasks, idx, boost, max_boost, group_ts),
TP_STRUCT__entry(
__array( char, comm, TASK_COMM_LEN )
__field( pid_t, pid )
__field( int, cpu )
__field( int, tasks )
__field( int, idx )
__field( int, boost )
__field( int, max_boost )
__field( u64, group_ts )
),
TP_fast_assign(
memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
__entry->pid = tsk->pid;
__entry->cpu = cpu;
__entry->tasks = tasks;
__entry->idx = idx;
__entry->boost = boost;
__entry->max_boost = max_boost;
__entry->group_ts = group_ts;
),
TP_printk("pid=%d comm=%s "
"cpu=%d tasks=%d idx=%d boost=%d max_boost=%d timeout=%llu",
__entry->pid, __entry->comm,
__entry->cpu, __entry->tasks, __entry->idx,
__entry->boost, __entry->max_boost,
__entry->group_ts)
);
/*
* Tracepoint for schedtune_boostgroup_update
*/
TRACE_EVENT(sched_tune_boostgroup_update,
TP_PROTO(int cpu, int variation, int max_boost),
TP_ARGS(cpu, variation, max_boost),
TP_STRUCT__entry(
__field( int, cpu )
__field( int, variation )
__field( int, max_boost )
),
TP_fast_assign(
__entry->cpu = cpu;
__entry->variation = variation;
__entry->max_boost = max_boost;
),
TP_printk("cpu=%d variation=%d max_boost=%d",
__entry->cpu, __entry->variation, __entry->max_boost)
);
/*
* Tracepoint for accounting task boosted utilization
*/
TRACE_EVENT(sched_boost_task,
TP_PROTO(struct task_struct *tsk, unsigned long util, long margin),
TP_ARGS(tsk, util, margin),
TP_STRUCT__entry(
__array( char, comm, TASK_COMM_LEN )
__field( pid_t, pid )
__field( unsigned long, util )
__field( long, margin )
),
TP_fast_assign(
memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
__entry->pid = tsk->pid;
__entry->util = util;
__entry->margin = margin;
),
TP_printk("comm=%s pid=%d util=%lu margin=%ld",
__entry->comm, __entry->pid,
__entry->util,
__entry->margin)
);
/*
* Tracepoint for system overutilized flag
*/
TRACE_EVENT(sched_overutilized,
TP_PROTO(int overutilized),
TP_ARGS(overutilized),
TP_STRUCT__entry(
__field( int, overutilized )
),
TP_fast_assign(
__entry->overutilized = overutilized;
),
TP_printk("overutilized=%d",
__entry->overutilized)
);
#endif /* CONFIG_SMP */
#endif /* _TRACE_SCHED_H */
/* This part must be outside protection */

View File

@ -991,6 +991,29 @@ config SCHED_AUTOGROUP
desktop applications. Task group autogeneration is currently based
upon task session.
config SCHED_TUNE
bool "Boosting for CFS tasks (EXPERIMENTAL)"
depends on SMP
help
This option enables support for task classification using a new
cgroup controller, schedtune. Schedtune allows tasks to be given
a boost value and marked as latency-sensitive or not. This option
provides the "schedtune" controller.
This new controller:
1. allows only a two layers hierarchy, where the root defines the
system-wide boost value and its direct childrens define each one a
different "class of tasks" to be boosted with a different value
2. supports up to 16 different task classes, each one which could be
configured with a different boost value
Latency-sensitive tasks are not subject to energy-aware wakeup
task placement. The boost value assigned to tasks is used to
influence task placement and CPU frequency selection (if
utilization-driven frequency selection is in use).
If unsure, say N.
config SYSFS_DEPRECATED
bool "Enable deprecated sysfs features to support old userspace tools"
depends on SYSFS

View File

@ -298,3 +298,18 @@ config PM_GENERIC_DOMAINS_OF
config CPU_PM
bool
config ENERGY_MODEL
bool "Energy Model for CPUs"
depends on SMP
depends on CPU_FREQ
default n
help
Several subsystems (thermal and/or the task scheduler for example)
can leverage information about the energy consumed by CPUs to make
smarter decisions. This config option enables the framework from
which subsystems can access the energy models.
The exact usage of the energy model is subsystem-dependent.
If in doubt, say N.

View File

@ -15,3 +15,5 @@ obj-$(CONFIG_PM_AUTOSLEEP) += autosleep.o
obj-$(CONFIG_PM_WAKELOCKS) += wakelock.o
obj-$(CONFIG_MAGIC_SYSRQ) += poweroff.o
obj-$(CONFIG_ENERGY_MODEL) += energy_model.o

291
kernel/power/energy_model.c Normal file
View File

@ -0,0 +1,291 @@
// SPDX-License-Identifier: GPL-2.0
/*
* Energy Model of CPUs
*
* Copyright (c) 2018, Arm ltd.
* Written by: Quentin Perret, Arm ltd.
*/
#define pr_fmt(fmt) "energy_model: " fmt
#include <linux/cpu.h>
#include <linux/cpumask.h>
#include <linux/energy_model.h>
#include <linux/sched/topology.h>
#include <linux/slab.h>
/* Mapping of each CPU to the performance domain to which it belongs. */
static DEFINE_PER_CPU(struct em_perf_domain *, em_data);
/*
* Mutex serializing the registrations of performance domains and letting
* callbacks defined by drivers sleep.
*/
static DEFINE_MUTEX(em_pd_mutex);
static struct kobject *em_kobject;
/* Getters for the attributes of em_perf_domain objects */
struct em_pd_attr {
struct attribute attr;
ssize_t (*show)(struct em_perf_domain *pd, char *buf);
ssize_t (*store)(struct em_perf_domain *pd, const char *buf, size_t s);
};
#define EM_ATTR_LEN 13
#define show_table_attr(_attr) \
static ssize_t show_##_attr(struct em_perf_domain *pd, char *buf) \
{ \
ssize_t cnt = 0; \
int i; \
for (i = 0; i < pd->nr_cap_states; i++) { \
if (cnt >= (ssize_t) (PAGE_SIZE / sizeof(char) \
- (EM_ATTR_LEN + 2))) \
goto out; \
cnt += scnprintf(&buf[cnt], EM_ATTR_LEN + 1, "%lu ", \
pd->table[i]._attr); \
} \
out: \
cnt += sprintf(&buf[cnt], "\n"); \
return cnt; \
}
show_table_attr(power);
show_table_attr(frequency);
show_table_attr(cost);
static ssize_t show_cpus(struct em_perf_domain *pd, char *buf)
{
return sprintf(buf, "%*pbl\n", cpumask_pr_args(to_cpumask(pd->cpus)));
}
#define pd_attr(_name) em_pd_##_name##_attr
#define define_pd_attr(_name) static struct em_pd_attr pd_attr(_name) = \
__ATTR(_name, 0444, show_##_name, NULL)
define_pd_attr(power);
define_pd_attr(frequency);
define_pd_attr(cost);
define_pd_attr(cpus);
static struct attribute *em_pd_default_attrs[] = {
&pd_attr(power).attr,
&pd_attr(frequency).attr,
&pd_attr(cost).attr,
&pd_attr(cpus).attr,
NULL
};
#define to_pd(k) container_of(k, struct em_perf_domain, kobj)
#define to_pd_attr(a) container_of(a, struct em_pd_attr, attr)
static ssize_t show(struct kobject *kobj, struct attribute *attr, char *buf)
{
struct em_perf_domain *pd = to_pd(kobj);
struct em_pd_attr *pd_attr = to_pd_attr(attr);
ssize_t ret;
ret = pd_attr->show(pd, buf);
return ret;
}
static const struct sysfs_ops em_pd_sysfs_ops = {
.show = show,
};
static struct kobj_type ktype_em_pd = {
.sysfs_ops = &em_pd_sysfs_ops,
.default_attrs = em_pd_default_attrs,
};
static struct em_perf_domain *em_create_pd(cpumask_t *span, int nr_states,
struct em_data_callback *cb)
{
unsigned long opp_eff, prev_opp_eff = ULONG_MAX;
unsigned long power, freq, prev_freq = 0;
int i, ret, cpu = cpumask_first(span);
struct em_cap_state *table;
struct em_perf_domain *pd;
u64 fmax;
if (!cb->active_power)
return NULL;
pd = kzalloc(sizeof(*pd) + cpumask_size(), GFP_KERNEL);
if (!pd)
return NULL;
table = kcalloc(nr_states, sizeof(*table), GFP_KERNEL);
if (!table)
goto free_pd;
/* Build the list of capacity states for this performance domain */
for (i = 0, freq = 0; i < nr_states; i++, freq++) {
/*
* active_power() is a driver callback which ceils 'freq' to
* lowest capacity state of 'cpu' above 'freq' and updates
* 'power' and 'freq' accordingly.
*/
ret = cb->active_power(&power, &freq, cpu);
if (ret) {
pr_err("pd%d: invalid cap. state: %d\n", cpu, ret);
goto free_cs_table;
}
/*
* We expect the driver callback to increase the frequency for
* higher capacity states.
*/
if (freq <= prev_freq) {
pr_err("pd%d: non-increasing freq: %lu\n", cpu, freq);
goto free_cs_table;
}
/*
* The power returned by active_state() is expected to be
* positive, in milli-watts and to fit into 16 bits.
*/
if (!power || power > EM_CPU_MAX_POWER) {
pr_err("pd%d: invalid power: %lu\n", cpu, power);
goto free_cs_table;
}
table[i].power = power;
table[i].frequency = prev_freq = freq;
/*
* The hertz/watts efficiency ratio should decrease as the
* frequency grows on sane platforms. But this isn't always
* true in practice so warn the user if a higher OPP is more
* power efficient than a lower one.
*/
opp_eff = freq / power;
if (opp_eff >= prev_opp_eff)
pr_warn("pd%d: hertz/watts ratio non-monotonically decreasing: em_cap_state %d >= em_cap_state%d\n",
cpu, i, i - 1);
prev_opp_eff = opp_eff;
}
/* Compute the cost of each capacity_state. */
fmax = (u64) table[nr_states - 1].frequency;
for (i = 0; i < nr_states; i++) {
table[i].cost = div64_u64(fmax * table[i].power,
table[i].frequency);
}
pd->table = table;
pd->nr_cap_states = nr_states;
cpumask_copy(to_cpumask(pd->cpus), span);
ret = kobject_init_and_add(&pd->kobj, &ktype_em_pd, em_kobject,
"pd%u", cpu);
if (ret)
pr_err("pd%d: failed kobject_init_and_add(): %d\n", cpu, ret);
return pd;
free_cs_table:
kfree(table);
free_pd:
kfree(pd);
return NULL;
}
/**
* em_cpu_get() - Return the performance domain for a CPU
* @cpu : CPU to find the performance domain for
*
* Return: the performance domain to which 'cpu' belongs, or NULL if it doesn't
* exist.
*/
struct em_perf_domain *em_cpu_get(int cpu)
{
return READ_ONCE(per_cpu(em_data, cpu));
}
EXPORT_SYMBOL_GPL(em_cpu_get);
/**
* em_register_perf_domain() - Register the Energy Model of a performance domain
* @span : Mask of CPUs in the performance domain
* @nr_states : Number of capacity states to register
* @cb : Callback functions providing the data of the Energy Model
*
* Create Energy Model tables for a performance domain using the callbacks
* defined in cb.
*
* If multiple clients register the same performance domain, all but the first
* registration will be ignored.
*
* Return 0 on success
*/
int em_register_perf_domain(cpumask_t *span, unsigned int nr_states,
struct em_data_callback *cb)
{
unsigned long cap, prev_cap = 0;
struct em_perf_domain *pd;
int cpu, ret = 0;
if (!span || !nr_states || !cb)
return -EINVAL;
/*
* Use a mutex to serialize the registration of performance domains and
* let the driver-defined callback functions sleep.
*/
mutex_lock(&em_pd_mutex);
if (!em_kobject) {
em_kobject = kobject_create_and_add("energy_model",
&cpu_subsys.dev_root->kobj);
if (!em_kobject) {
ret = -ENODEV;
goto unlock;
}
}
for_each_cpu(cpu, span) {
/* Make sure we don't register again an existing domain. */
if (READ_ONCE(per_cpu(em_data, cpu))) {
ret = -EEXIST;
goto unlock;
}
/*
* All CPUs of a domain must have the same micro-architecture
* since they all share the same table.
*/
cap = arch_scale_cpu_capacity(NULL, cpu);
if (prev_cap && prev_cap != cap) {
pr_err("CPUs of %*pbl must have the same capacity\n",
cpumask_pr_args(span));
ret = -EINVAL;
goto unlock;
}
prev_cap = cap;
}
/* Create the performance domain and add it to the Energy Model. */
pd = em_create_pd(span, nr_states, cb);
if (!pd) {
ret = -EINVAL;
goto unlock;
}
for_each_cpu(cpu, span) {
/*
* The per-cpu array can be read concurrently from em_cpu_get().
* The barrier enforces the ordering needed to make sure readers
* can only access well formed em_perf_domain structs.
*/
smp_store_release(per_cpu_ptr(&em_data, cpu), pd);
}
pr_debug("Created perf domain %*pbl\n", cpumask_pr_args(span));
unlock:
mutex_unlock(&em_pd_mutex);
return ret;
}
EXPORT_SYMBOL_GPL(em_register_perf_domain);

View File

@ -24,6 +24,7 @@ obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o
obj-$(CONFIG_SCHED_AUTOGROUP) += autogroup.o
obj-$(CONFIG_SCHEDSTATS) += stats.o
obj-$(CONFIG_SCHED_DEBUG) += debug.o
obj-$(CONFIG_SCHED_TUNE) += tune.o
obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
obj-$(CONFIG_CPU_FREQ) += cpufreq.o
obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o

View File

@ -259,7 +259,6 @@ void proc_sched_autogroup_show_task(struct task_struct *p, struct seq_file *m)
}
#endif /* CONFIG_PROC_FS */
#ifdef CONFIG_SCHED_DEBUG
int autogroup_path(struct task_group *tg, char *buf, int buflen)
{
if (!task_group_is_autogroup(tg))
@ -267,4 +266,3 @@ int autogroup_path(struct task_group *tg, char *buf, int buflen)
return snprintf(buf, buflen, "%s-%ld", "/autogroup", tg->autogroup->id);
}
#endif

View File

@ -412,6 +412,8 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL))
return;
head->count++;
get_task_struct(task);
/*
@ -421,6 +423,10 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
head->lastp = &node->next;
}
static int
try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags,
int sibling_count_hint);
void wake_up_q(struct wake_q_head *head)
{
struct wake_q_node *node = head->first;
@ -435,10 +441,10 @@ void wake_up_q(struct wake_q_head *head)
task->wake_q.next = NULL;
/*
* wake_up_process() executes a full barrier, which pairs with
* try_to_wake_up() executes a full barrier, which pairs with
* the queueing in wake_q_add() so as not to miss wakeups.
*/
wake_up_process(task);
try_to_wake_up(task, TASK_NORMAL, 0, head->count);
put_task_struct(task);
}
}
@ -1523,12 +1529,14 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
* The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
*/
static inline
int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)
int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags,
int sibling_count_hint)
{
lockdep_assert_held(&p->pi_lock);
if (p->nr_cpus_allowed > 1)
cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags);
cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags,
sibling_count_hint);
else
cpu = cpumask_any(&p->cpus_allowed);
@ -1931,6 +1939,8 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
* @p: the thread to be awakened
* @state: the mask of task states that can be woken
* @wake_flags: wake modifier flags (WF_*)
* @sibling_count_hint: A hint at the number of threads that are being woken up
* in this event.
*
* If (@state & @p->state) @p->state = TASK_RUNNING.
*
@ -1946,7 +1956,8 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
* %false otherwise.
*/
static int
try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags,
int sibling_count_hint)
{
unsigned long flags;
int cpu, success = 0;
@ -2033,7 +2044,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
atomic_dec(&task_rq(p)->nr_iowait);
}
cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags,
sibling_count_hint);
if (task_cpu(p) != cpu) {
wake_flags |= WF_MIGRATED;
set_task_cpu(p, cpu);
@ -2120,13 +2132,13 @@ static void try_to_wake_up_local(struct task_struct *p, struct rq_flags *rf)
*/
int wake_up_process(struct task_struct *p)
{
return try_to_wake_up(p, TASK_NORMAL, 0);
return try_to_wake_up(p, TASK_NORMAL, 0, 1);
}
EXPORT_SYMBOL(wake_up_process);
int wake_up_state(struct task_struct *p, unsigned int state)
{
return try_to_wake_up(p, state, 0);
return try_to_wake_up(p, state, 0, 1);
}
/*
@ -2408,7 +2420,7 @@ void wake_up_new_task(struct task_struct *p)
* as we're not fully set-up yet.
*/
p->recent_used_cpu = task_cpu(p);
__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0, 1));
#endif
rq = __task_rq_lock(p, &rf);
update_rq_clock(rq);
@ -2947,7 +2959,7 @@ void sched_exec(void)
int dest_cpu;
raw_spin_lock_irqsave(&p->pi_lock, flags);
dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0);
dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0, 1);
if (dest_cpu == smp_processor_id())
goto unlock;
@ -3708,7 +3720,7 @@ asmlinkage __visible void __sched preempt_schedule_irq(void)
int default_wake_function(wait_queue_entry_t *curr, unsigned mode, int wake_flags,
void *key)
{
return try_to_wake_up(curr->private, mode, wake_flags);
return try_to_wake_up(curr->private, mode, wake_flags, 1);
}
EXPORT_SYMBOL(default_wake_function);

View File

@ -13,11 +13,13 @@
#include "sched.h"
#include <linux/sched/cpufreq.h>
#include <trace/events/power.h>
struct sugov_tunables {
struct gov_attr_set attr_set;
unsigned int rate_limit_us;
unsigned int up_rate_limit_us;
unsigned int down_rate_limit_us;
};
struct sugov_policy {
@ -28,7 +30,9 @@ struct sugov_policy {
raw_spinlock_t update_lock; /* For shared policies */
u64 last_freq_update_time;
s64 freq_update_delay_ns;
s64 min_rate_limit_ns;
s64 up_rate_delay_ns;
s64 down_rate_delay_ns;
unsigned int next_freq;
unsigned int cached_raw_freq;
@ -93,9 +97,32 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
if (unlikely(sg_policy->need_freq_update))
return true;
/* No need to recalculate next freq for min_rate_limit_us
* at least. However we might still decide to further rate
* limit once frequency change direction is decided, according
* to the separate rate limits.
*/
delta_ns = time - sg_policy->last_freq_update_time;
return delta_ns >= sg_policy->min_rate_limit_ns;
}
static bool sugov_up_down_rate_limit(struct sugov_policy *sg_policy, u64 time,
unsigned int next_freq)
{
s64 delta_ns;
delta_ns = time - sg_policy->last_freq_update_time;
return delta_ns >= sg_policy->freq_update_delay_ns;
if (next_freq > sg_policy->next_freq &&
delta_ns < sg_policy->up_rate_delay_ns)
return true;
if (next_freq < sg_policy->next_freq &&
delta_ns < sg_policy->down_rate_delay_ns)
return true;
return false;
}
static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time,
@ -104,6 +131,9 @@ static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time,
if (sg_policy->next_freq == next_freq)
return false;
if (sugov_up_down_rate_limit(sg_policy, time, next_freq))
return false;
sg_policy->next_freq = next_freq;
sg_policy->last_freq_update_time = time;
@ -167,7 +197,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
unsigned int freq = arch_scale_freq_invariant() ?
policy->cpuinfo.max_freq : policy->cur;
freq = (freq + (freq >> 2)) * util / max;
freq = map_util_freq(util, freq, max);
if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
return sg_policy->next_freq;
@ -189,6 +219,9 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
* Where the cfs,rt and dl util numbers are tracked with the same metric and
* synchronized windows and are thus directly comparable.
*
* The @util parameter passed to this function is assumed to be the aggregation
* of RT and CFS util numbers. The cases of DL and IRQ are managed here.
*
* The cfs,rt,dl utilization are the running times measured with rq->clock_task
* which excludes things like IRQ and steal-time. These latter are then accrued
* in the irq utilization.
@ -197,15 +230,14 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
* based on the task model parameters and gives the minimal utilization
* required to meet deadlines.
*/
static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
unsigned long schedutil_freq_util(int cpu, unsigned long util,
unsigned long max, enum schedutil_type type)
{
struct rq *rq = cpu_rq(sg_cpu->cpu);
unsigned long util, irq, max;
struct rq *rq = cpu_rq(cpu);
unsigned long irq;
sg_cpu->max = max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
sg_cpu->bw_dl = cpu_bw_dl(rq);
if (rt_rq_is_runnable(&rq->rt))
if (sched_feat(SUGOV_RT_MAX_FREQ) && type == FREQUENCY_UTIL &&
rt_rq_is_runnable(&rq->rt))
return max;
/*
@ -218,25 +250,34 @@ static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
return max;
/*
* Because the time spend on RT/DL tasks is visible as 'lost' time to
* CFS tasks and we use the same metric to track the effective
* utilization (PELT windows are synchronized) we can directly add them
* to obtain the CPU's actual utilization.
* The function is called with @util defined as the aggregation (the
* sum) of RT and CFS signals, hence leaving the special case of DL
* to be delt with. The exact way of doing things depend on the calling
* context.
*/
util = cpu_util_cfs(rq);
util += cpu_util_rt(rq);
/*
* We do not make cpu_util_dl() a permanent part of this sum because we
* want to use cpu_bw_dl() later on, but we need to check if the
* CFS+RT+DL sum is saturated (ie. no idle time) such that we select
* f_max when there is no idle time.
*
* NOTE: numerical errors or stop class might cause us to not quite hit
* saturation when we should -- something for later.
*/
if ((util + cpu_util_dl(rq)) >= max)
return max;
if (type == FREQUENCY_UTIL) {
/*
* For frequency selection we do not make cpu_util_dl() a
* permanent part of this sum because we want to use
* cpu_bw_dl() later on, but we need to check if the
* CFS+RT+DL sum is saturated (ie. no idle time) such
* that we select f_max when there is no idle time.
*
* NOTE: numerical errors or stop class might cause us
* to not quite hit saturation when we should --
* something for later.
*/
if ((util + cpu_util_dl(rq)) >= max)
return max;
} else {
/*
* OTOH, for energy computation we need the estimated
* running time, so include util_dl and ignore dl_bw.
*/
util += cpu_util_dl(rq);
if (util >= max)
return max;
}
/*
* There is still idle time; further improve the number by using the
@ -250,17 +291,35 @@ static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
util = scale_irq_capacity(util, irq, max);
util += irq;
/*
* Bandwidth required by DEADLINE must always be granted while, for
* FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
* to gracefully reduce the frequency when no tasks show up for longer
* periods of time.
*
* Ideally we would like to set bw_dl as min/guaranteed freq and util +
* bw_dl as requested freq. However, cpufreq is not yet ready for such
* an interface. So, we only do the latter for now.
*/
return min(max, util + sg_cpu->bw_dl);
if (type == FREQUENCY_UTIL) {
/*
* Bandwidth required by DEADLINE must always be granted
* while, for FAIR and RT, we use blocked utilization of
* IDLE CPUs as a mechanism to gracefully reduce the
* frequency when no tasks show up for longer periods of
* time.
*
* Ideally we would like to set bw_dl as min/guaranteed
* freq and util + bw_dl as requested freq. However,
* cpufreq is not yet ready for such an interface. So,
* we only do the latter for now.
*/
util += cpu_bw_dl(rq);
}
return min(max, util);
}
static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
{
struct rq *rq = cpu_rq(sg_cpu->cpu);
unsigned long util = boosted_cpu_util(sg_cpu->cpu, cpu_util_rt(rq));
unsigned long max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
sg_cpu->max = max;
sg_cpu->bw_dl = cpu_bw_dl(rq);
return schedutil_freq_util(sg_cpu->cpu, util, max, FREQUENCY_UTIL);
}
/**
@ -562,15 +621,32 @@ static inline struct sugov_tunables *to_sugov_tunables(struct gov_attr_set *attr
return container_of(attr_set, struct sugov_tunables, attr_set);
}
static ssize_t rate_limit_us_show(struct gov_attr_set *attr_set, char *buf)
static DEFINE_MUTEX(min_rate_lock);
static void update_min_rate_limit_ns(struct sugov_policy *sg_policy)
{
mutex_lock(&min_rate_lock);
sg_policy->min_rate_limit_ns = min(sg_policy->up_rate_delay_ns,
sg_policy->down_rate_delay_ns);
mutex_unlock(&min_rate_lock);
}
static ssize_t up_rate_limit_us_show(struct gov_attr_set *attr_set, char *buf)
{
struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
return sprintf(buf, "%u\n", tunables->rate_limit_us);
return sprintf(buf, "%u\n", tunables->up_rate_limit_us);
}
static ssize_t
rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf, size_t count)
static ssize_t down_rate_limit_us_show(struct gov_attr_set *attr_set, char *buf)
{
struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
return sprintf(buf, "%u\n", tunables->down_rate_limit_us);
}
static ssize_t up_rate_limit_us_store(struct gov_attr_set *attr_set,
const char *buf, size_t count)
{
struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
struct sugov_policy *sg_policy;
@ -579,18 +655,42 @@ rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf, size_t count
if (kstrtouint(buf, 10, &rate_limit_us))
return -EINVAL;
tunables->rate_limit_us = rate_limit_us;
tunables->up_rate_limit_us = rate_limit_us;
list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook)
sg_policy->freq_update_delay_ns = rate_limit_us * NSEC_PER_USEC;
list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook) {
sg_policy->up_rate_delay_ns = rate_limit_us * NSEC_PER_USEC;
update_min_rate_limit_ns(sg_policy);
}
return count;
}
static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us);
static ssize_t down_rate_limit_us_store(struct gov_attr_set *attr_set,
const char *buf, size_t count)
{
struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
struct sugov_policy *sg_policy;
unsigned int rate_limit_us;
if (kstrtouint(buf, 10, &rate_limit_us))
return -EINVAL;
tunables->down_rate_limit_us = rate_limit_us;
list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook) {
sg_policy->down_rate_delay_ns = rate_limit_us * NSEC_PER_USEC;
update_min_rate_limit_ns(sg_policy);
}
return count;
}
static struct governor_attr up_rate_limit_us = __ATTR_RW(up_rate_limit_us);
static struct governor_attr down_rate_limit_us = __ATTR_RW(down_rate_limit_us);
static struct attribute *sugov_attributes[] = {
&rate_limit_us.attr,
&up_rate_limit_us.attr,
&down_rate_limit_us.attr,
NULL
};
@ -601,7 +701,7 @@ static struct kobj_type sugov_tunables_ktype = {
/********************** cpufreq governor interface *********************/
static struct cpufreq_governor schedutil_gov;
struct cpufreq_governor schedutil_gov;
static struct sugov_policy *sugov_policy_alloc(struct cpufreq_policy *policy)
{
@ -746,7 +846,8 @@ static int sugov_init(struct cpufreq_policy *policy)
goto stop_kthread;
}
tunables->rate_limit_us = cpufreq_policy_transition_delay_us(policy);
tunables->up_rate_limit_us = cpufreq_policy_transition_delay_us(policy);
tunables->down_rate_limit_us = cpufreq_policy_transition_delay_us(policy);
policy->governor_data = sg_policy;
sg_policy->tunables = tunables;
@ -804,7 +905,11 @@ static int sugov_start(struct cpufreq_policy *policy)
struct sugov_policy *sg_policy = policy->governor_data;
unsigned int cpu;
sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC;
sg_policy->up_rate_delay_ns =
sg_policy->tunables->up_rate_limit_us * NSEC_PER_USEC;
sg_policy->down_rate_delay_ns =
sg_policy->tunables->down_rate_limit_us * NSEC_PER_USEC;
update_min_rate_limit_ns(sg_policy);
sg_policy->last_freq_update_time = 0;
sg_policy->next_freq = 0;
sg_policy->work_in_progress = false;
@ -860,7 +965,7 @@ static void sugov_limits(struct cpufreq_policy *policy)
sg_policy->need_freq_update = true;
}
static struct cpufreq_governor schedutil_gov = {
struct cpufreq_governor schedutil_gov = {
.name = "schedutil",
.owner = THIS_MODULE,
.dynamic_switching = true,
@ -883,3 +988,36 @@ static int __init sugov_register(void)
return cpufreq_register_governor(&schedutil_gov);
}
fs_initcall(sugov_register);
#ifdef CONFIG_ENERGY_MODEL
extern bool sched_energy_update;
extern struct mutex sched_energy_mutex;
static void rebuild_sd_workfn(struct work_struct *work)
{
mutex_lock(&sched_energy_mutex);
sched_energy_update = true;
rebuild_sched_domains();
sched_energy_update = false;
mutex_unlock(&sched_energy_mutex);
}
static DECLARE_WORK(rebuild_sd_work, rebuild_sd_workfn);
/*
* EAS shouldn't be attempted without sugov, so rebuild the sched_domains
* on governor changes to make sure the scheduler knows about it.
*/
void sched_cpufreq_governor_change(struct cpufreq_policy *policy,
struct cpufreq_governor *old_gov)
{
if (old_gov == &schedutil_gov || policy->governor == &schedutil_gov) {
/*
* When called from the cpufreq_register_driver() path, the
* cpu_hotplug_lock is already held, so use a work item to
* avoid nested locking in rebuild_sched_domains().
*/
schedule_work(&rebuild_sd_work);
}
}
#endif

View File

@ -1567,7 +1567,8 @@ static void yield_task_dl(struct rq *rq)
static int find_later_rq(struct task_struct *task);
static int
select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags,
int sibling_count_hint)
{
struct task_struct *curr;
struct rq *rq;

File diff suppressed because it is too large Load Diff

View File

@ -90,3 +90,33 @@ SCHED_FEAT(WA_BIAS, true)
* UtilEstimation. Use estimated CPU utilization.
*/
SCHED_FEAT(UTIL_EST, true)
/*
* Fast pre-selection of CPU candidates for EAS.
*/
SCHED_FEAT(FIND_BEST_TARGET, true)
/*
* Energy aware scheduling algorithm choices:
* EAS_PREFER_IDLE
* Direct tasks in a schedtune.prefer_idle=1 group through
* the EAS path for wakeup task placement. Otherwise, put
* those tasks through the mainline slow path.
*/
SCHED_FEAT(EAS_PREFER_IDLE, true)
/*
* Request max frequency from schedutil whenever a RT task is running.
*/
SCHED_FEAT(SUGOV_RT_MAX_FREQ, false)
/*
* Apply schedtune boost hold to tasks of all sched classes.
* If enabled, schedtune will hold the boost applied to a CPU
* for 50ms regardless of task activation - if the task is
* still running 50ms later, the boost hold expires and schedtune
* boost will expire immediately the task stops.
* If disabled, this behaviour will only apply to tasks of the
* RT class.
*/
SCHED_FEAT(SCHEDTUNE_BOOST_HOLD_ALL, false)

View File

@ -16,9 +16,10 @@ extern char __cpuidle_text_start[], __cpuidle_text_end[];
* sched_idle_set_state - Record idle state for the current CPU.
* @idle_state: State to record.
*/
void sched_idle_set_state(struct cpuidle_state *idle_state)
void sched_idle_set_state(struct cpuidle_state *idle_state, int index)
{
idle_set_state(this_rq(), idle_state);
idle_set_state_idx(this_rq(), index);
}
static int __read_mostly cpu_idle_force_poll;
@ -374,7 +375,8 @@ void cpu_startup_entry(enum cpuhp_state state)
#ifdef CONFIG_SMP
static int
select_task_rq_idle(struct task_struct *p, int cpu, int sd_flag, int flags)
select_task_rq_idle(struct task_struct *p, int cpu, int sd_flag, int flags,
int sibling_count_hint)
{
return task_cpu(p); /* IDLE tasks as never migrated */
}

View File

@ -29,6 +29,8 @@
#include "sched-pelt.h"
#include "pelt.h"
#include <trace/events/sched.h>
/*
* Approximate:
* val * y^n, where y^32 ~= 0.5 (~1 scheduling period)
@ -274,6 +276,9 @@ int __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se)
if (___update_load_sum(now, cpu, &se->avg, 0, 0, 0)) {
___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
trace_sched_load_se(se);
return 1;
}
@ -290,6 +295,9 @@ int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_e
___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
cfs_se_util_change(&se->avg);
trace_sched_load_se(se);
return 1;
}
@ -304,6 +312,9 @@ int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)
cfs_rq->curr != NULL)) {
___update_load_avg(&cfs_rq->avg, 1, 1);
trace_sched_load_cfs_rq(cfs_rq);
return 1;
}
@ -329,6 +340,9 @@ int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
running)) {
___update_load_avg(&rq->avg_rt, 1, 1);
trace_sched_load_rt_rq(rq);
return 1;
}

View File

@ -1329,6 +1329,8 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
{
struct sched_rt_entity *rt_se = &p->rt;
schedtune_enqueue_task(p, cpu_of(rq));
if (flags & ENQUEUE_WAKEUP)
rt_se->timeout = 0;
@ -1342,6 +1344,8 @@ static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
{
struct sched_rt_entity *rt_se = &p->rt;
schedtune_dequeue_task(p, cpu_of(rq));
update_curr_rt(rq);
dequeue_rt_entity(rt_se, flags);
@ -1386,7 +1390,8 @@ static void yield_task_rt(struct rq *rq)
static int find_lowest_rq(struct task_struct *task);
static int
select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags,
int sibling_count_hint)
{
struct task_struct *curr;
struct rq *rq;

View File

@ -44,6 +44,7 @@
#include <linux/ctype.h>
#include <linux/debugfs.h>
#include <linux/delayacct.h>
#include <linux/energy_model.h>
#include <linux/init_task.h>
#include <linux/kprobes.h>
#include <linux/kthread.h>
@ -79,6 +80,8 @@
# define SCHED_WARN_ON(x) ({ (void)(x), 0; })
#endif
#include "tune.h"
struct rq;
struct cpuidle_state;
@ -702,6 +705,22 @@ static inline bool sched_asym_prefer(int a, int b)
return arch_asym_cpu_priority(a) > arch_asym_cpu_priority(b);
}
struct perf_domain {
struct em_perf_domain *em_pd;
struct perf_domain *next;
struct rcu_head rcu;
};
struct max_cpu_capacity {
raw_spinlock_t lock;
unsigned long val;
int cpu;
};
/* Scheduling group status flags */
#define SG_OVERLOAD 0x1 /* More than one runnable task on a CPU. */
#define SG_OVERUTILIZED 0x2 /* One or more CPUs are over-utilized. */
/*
* We add the notion of a root-domain which will be used to define per-domain
* variables. Each exclusive cpuset essentially defines an island domain by
@ -717,8 +736,15 @@ struct root_domain {
cpumask_var_t span;
cpumask_var_t online;
/* Indicate more than one runnable task for any CPU */
bool overload;
/*
* Indicate pullable load on at least one CPU, e.g:
* - More than one runnable task
* - Running task is misfit
*/
int overload;
/* Indicate one or more cpus over-utilized (tipping point) */
int overutilized;
/*
* The bit corresponding to a CPU gets set here if such CPU has more
@ -749,13 +775,21 @@ struct root_domain {
cpumask_var_t rto_mask;
struct cpupri cpupri;
unsigned long max_cpu_capacity;
/* Maximum cpu capacity in the system. */
struct max_cpu_capacity max_cpu_capacity;
/*
* NULL-terminated list of performance domains intersecting with the
* CPUs of the rd. Protected by RCU.
*/
struct perf_domain *pd;
};
extern struct root_domain def_root_domain;
extern struct mutex sched_domains_mutex;
extern void init_defrootdomain(void);
extern void init_max_cpu_capacity(struct max_cpu_capacity *mcc);
extern int sched_init_domains(const struct cpumask *cpu_map);
extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
extern void sched_get_rd(struct root_domain *rd);
@ -845,6 +879,8 @@ struct rq {
unsigned char idle_balance;
unsigned long misfit_task_load;
/* For active balancing */
int active_balance;
int push_cpu;
@ -916,6 +952,7 @@ struct rq {
#ifdef CONFIG_CPU_IDLE
/* Must be inspected within a rcu lock section */
struct cpuidle_state *idle_state;
int idle_state_idx;
#endif
};
@ -1187,7 +1224,9 @@ DECLARE_PER_CPU(int, sd_llc_size);
DECLARE_PER_CPU(int, sd_llc_id);
DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
DECLARE_PER_CPU(struct sched_domain *, sd_numa);
DECLARE_PER_CPU(struct sched_domain *, sd_asym);
DECLARE_PER_CPU(struct sched_domain *, sd_asym_packing);
DECLARE_PER_CPU(struct sched_domain *, sd_asym_cpucapacity);
extern struct static_key_false sched_asym_cpucapacity;
struct sched_group_capacity {
atomic_t ref;
@ -1197,6 +1236,7 @@ struct sched_group_capacity {
*/
unsigned long capacity;
unsigned long min_capacity; /* Min per-CPU capacity in group */
unsigned long max_capacity; /* Max per-CPU capacity in group */
unsigned long next_update;
int imbalance; /* XXX unrelated to capacity but shared group state */
@ -1525,7 +1565,8 @@ struct sched_class {
void (*put_prev_task)(struct rq *rq, struct task_struct *p);
#ifdef CONFIG_SMP
int (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
int (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags,
int subling_count_hint);
void (*migrate_task_rq)(struct task_struct *p, int new_cpu);
void (*task_woken)(struct rq *this_rq, struct task_struct *task);
@ -1613,6 +1654,17 @@ static inline struct cpuidle_state *idle_get_state(struct rq *rq)
return rq->idle_state;
}
static inline void idle_set_state_idx(struct rq *rq, int idle_state_idx)
{
rq->idle_state_idx = idle_state_idx;
}
static inline int idle_get_state_idx(struct rq *rq)
{
WARN_ON(!rcu_read_lock_held());
return rq->idle_state_idx;
}
#else
static inline void idle_set_state(struct rq *rq,
struct cpuidle_state *idle_state)
@ -1623,6 +1675,15 @@ static inline struct cpuidle_state *idle_get_state(struct rq *rq)
{
return NULL;
}
static inline void idle_set_state_idx(struct rq *rq, int idle_state_idx)
{
}
static inline int idle_get_state_idx(struct rq *rq)
{
return -1;
}
#endif
extern void schedule_idle(void);
@ -1696,8 +1757,8 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
if (prev_nr < 2 && rq->nr_running >= 2) {
#ifdef CONFIG_SMP
if (!rq->rd->overload)
rq->rd->overload = true;
if (!READ_ONCE(rq->rd->overload))
WRITE_ONCE(rq->rd->overload, 1);
#endif
}
@ -1756,26 +1817,14 @@ unsigned long arch_scale_freq_capacity(int cpu)
}
#endif
#ifdef CONFIG_SMP
#ifndef arch_scale_cpu_capacity
#ifndef arch_scale_max_freq_capacity
struct sched_domain;
static __always_inline
unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
{
if (sd && (sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
return sd->smt_gain / sd->span_weight;
return SCHED_CAPACITY_SCALE;
}
#endif
#else
#ifndef arch_scale_cpu_capacity
static __always_inline
unsigned long arch_scale_cpu_capacity(void __always_unused *sd, int cpu)
unsigned long arch_scale_max_freq_capacity(struct sched_domain *sd, int cpu)
{
return SCHED_CAPACITY_SCALE;
}
#endif
#endif
struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)
__acquires(rq->lock);
@ -2189,6 +2238,38 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
#endif
#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
/**
* enum schedutil_type - CPU utilization type
* @FREQUENCY_UTIL: Utilization used to select frequency
* @ENERGY_UTIL: Utilization used during energy calculation
*
* The utilization signals of all scheduling classes (CFS/RT/DL) and IRQ time
* need to be aggregated differently depending on the usage made of them. This
* enum is used within schedutil_freq_util() to differentiate the types of
* utilization expected by the callers, and adjust the aggregation accordingly.
*/
enum schedutil_type {
FREQUENCY_UTIL,
ENERGY_UTIL,
};
unsigned long schedutil_freq_util(int cpu, unsigned long util,
unsigned long max, enum schedutil_type type);
static inline unsigned long schedutil_energy_util(int cpu, unsigned long util)
{
unsigned long max = arch_scale_cpu_capacity(NULL, cpu);
return schedutil_freq_util(cpu, util, max, ENERGY_UTIL);
}
#else /* CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
static inline unsigned long schedutil_energy_util(int cpu, unsigned long util)
{
return util;
}
#endif
#ifdef CONFIG_SMP
static inline unsigned long cpu_bw_dl(struct rq *rq)
{
return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
@ -2244,3 +2325,13 @@ unsigned long scale_irq_capacity(unsigned long util, unsigned long irq, unsigned
return util;
}
#endif
#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
#define perf_domain_span(pd) (to_cpumask(((pd)->em_pd->cpus)))
#else
#define perf_domain_span(pd) NULL
#endif
#ifdef CONFIG_SMP
extern struct static_key_false sched_energy_present;
#endif

View File

@ -11,7 +11,8 @@
#ifdef CONFIG_SMP
static int
select_task_rq_stop(struct task_struct *p, int cpu, int sd_flag, int flags)
select_task_rq_stop(struct task_struct *p, int cpu, int sd_flag, int flags,
int sibling_count_hint)
{
return task_cpu(p); /* stop tasks as never migrate */
}

View File

@ -201,6 +201,242 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
return 1;
}
DEFINE_STATIC_KEY_FALSE(sched_energy_present);
#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
unsigned int sysctl_sched_energy_aware = 1;
DEFINE_MUTEX(sched_energy_mutex);
bool sched_energy_update;
#ifdef CONFIG_PROC_SYSCTL
int sched_energy_aware_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp, loff_t *ppos)
{
int ret, state;
if (write && !capable(CAP_SYS_ADMIN))
return -EPERM;
ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
if (!ret && write) {
state = static_branch_unlikely(&sched_energy_present);
if (state != sysctl_sched_energy_aware) {
mutex_lock(&sched_energy_mutex);
sched_energy_update = 1;
rebuild_sched_domains();
sched_energy_update = 0;
mutex_unlock(&sched_energy_mutex);
}
}
return ret;
}
#endif
static void free_pd(struct perf_domain *pd)
{
struct perf_domain *tmp;
while (pd) {
tmp = pd->next;
kfree(pd);
pd = tmp;
}
}
static struct perf_domain *find_pd(struct perf_domain *pd, int cpu)
{
while (pd) {
if (cpumask_test_cpu(cpu, perf_domain_span(pd)))
return pd;
pd = pd->next;
}
return NULL;
}
static struct perf_domain *pd_init(int cpu)
{
struct em_perf_domain *obj = em_cpu_get(cpu);
struct perf_domain *pd;
if (!obj) {
if (sched_debug())
pr_info("%s: no EM found for CPU%d\n", __func__, cpu);
return NULL;
}
pd = kzalloc(sizeof(*pd), GFP_KERNEL);
if (!pd)
return NULL;
pd->em_pd = obj;
return pd;
}
static void perf_domain_debug(const struct cpumask *cpu_map,
struct perf_domain *pd)
{
if (!sched_debug() || !pd)
return;
printk(KERN_DEBUG "root_domain %*pbl: ", cpumask_pr_args(cpu_map));
while (pd) {
printk(KERN_CONT " pd%d:{ cpus=%*pbl nr_cstate=%d }",
cpumask_first(perf_domain_span(pd)),
cpumask_pr_args(perf_domain_span(pd)),
em_pd_nr_cap_states(pd->em_pd));
pd = pd->next;
}
printk(KERN_CONT "\n");
}
static void destroy_perf_domain_rcu(struct rcu_head *rp)
{
struct perf_domain *pd;
pd = container_of(rp, struct perf_domain, rcu);
free_pd(pd);
}
static void sched_energy_start(int ndoms_new, cpumask_var_t doms_new[])
{
/*
* The conditions for EAS to start are checked during the creation of
* root domains. If one of them meets all conditions, it will have a
* non-null list of performance domains.
*/
while (ndoms_new) {
if (cpu_rq(cpumask_first(doms_new[ndoms_new - 1]))->rd->pd)
goto enable;
ndoms_new--;
}
if (static_branch_unlikely(&sched_energy_present)) {
if (sched_debug())
pr_info("%s: stopping EAS\n", __func__);
static_branch_disable_cpuslocked(&sched_energy_present);
}
return;
enable:
if (!static_branch_unlikely(&sched_energy_present)) {
if (sched_debug())
pr_info("%s: starting EAS\n", __func__);
static_branch_enable_cpuslocked(&sched_energy_present);
}
}
/*
* EAS can be used on a root domain if it meets all the following conditions:
* 1. an Energy Model (EM) is available;
* 2. the SD_ASYM_CPUCAPACITY flag is set in the sched_domain hierarchy.
* 3. the EM complexity is low enough to keep scheduling overheads low;
* 4. schedutil is driving the frequency of all CPUs of the rd;
*
* The complexity of the Energy Model is defined as:
*
* C = nr_pd * (nr_cpus + nr_cs)
*
* with parameters defined as:
* - nr_pd: the number of performance domains
* - nr_cpus: the number of CPUs
* - nr_cs: the sum of the number of capacity states of all performance
* domains (for example, on a system with 2 performance domains,
* with 10 capacity states each, nr_cs = 2 * 10 = 20).
*
* It is generally not a good idea to use such a model in the wake-up path on
* very complex platforms because of the associated scheduling overheads. The
* arbitrary constraint below prevents that. It makes EAS usable up to 16 CPUs
* with per-CPU DVFS and less than 8 capacity states each, for example.
*/
#define EM_MAX_COMPLEXITY 2048
extern struct cpufreq_governor schedutil_gov;
static void build_perf_domains(const struct cpumask *cpu_map)
{
int i, nr_pd = 0, nr_cs = 0, nr_cpus = cpumask_weight(cpu_map);
struct perf_domain *pd = NULL, *tmp;
int cpu = cpumask_first(cpu_map);
struct root_domain *rd = cpu_rq(cpu)->rd;
struct cpufreq_policy *policy;
struct cpufreq_governor *gov;
if (!sysctl_sched_energy_aware)
goto free;
/* EAS is enabled for asymmetric CPU capacity topologies. */
if (!per_cpu(sd_asym_cpucapacity, cpu)) {
if (sched_debug()) {
pr_info("rd %*pbl: CPUs do not have asymmetric capacities\n",
cpumask_pr_args(cpu_map));
}
goto free;
}
for_each_cpu(i, cpu_map) {
/* Skip already covered CPUs. */
if (find_pd(pd, i))
continue;
/* Do not attempt EAS if schedutil is not being used. */
policy = cpufreq_cpu_get(i);
if (!policy)
goto free;
gov = policy->governor;
cpufreq_cpu_put(policy);
if (gov != &schedutil_gov) {
if (rd->pd)
pr_warn("rd %*pbl: Disabling EAS, schedutil is mandatory\n",
cpumask_pr_args(cpu_map));
goto free;
}
/* Create the new pd and add it to the local list. */
tmp = pd_init(i);
if (!tmp)
goto free;
tmp->next = pd;
pd = tmp;
/*
* Count performance domains and capacity states for the
* complexity check.
*/
nr_pd++;
nr_cs += em_pd_nr_cap_states(pd->em_pd);
}
/* Bail out if the Energy Model complexity is too high. */
if (nr_pd * (nr_cs + nr_cpus) > EM_MAX_COMPLEXITY) {
WARN(1, "rd %*pbl: Failed to start EAS, EM complexity is too high\n",
cpumask_pr_args(cpu_map));
goto free;
}
perf_domain_debug(cpu_map, pd);
/* Attach the new list of performance domains to the root domain. */
tmp = rd->pd;
rcu_assign_pointer(rd->pd, pd);
if (tmp)
call_rcu(&tmp->rcu, destroy_perf_domain_rcu);
return;
free:
free_pd(pd);
tmp = rd->pd;
rcu_assign_pointer(rd->pd, NULL);
if (tmp)
call_rcu(&tmp->rcu, destroy_perf_domain_rcu);
}
#else
static void free_pd(struct perf_domain *pd) { }
#endif /* CONFIG_ENERGY_MODEL && CONFIG_CPU_FREQ_GOV_SCHEDUTIL*/
static void free_rootdomain(struct rcu_head *rcu)
{
struct root_domain *rd = container_of(rcu, struct root_domain, rcu);
@ -211,6 +447,7 @@ static void free_rootdomain(struct rcu_head *rcu)
free_cpumask_var(rd->rto_mask);
free_cpumask_var(rd->online);
free_cpumask_var(rd->span);
free_pd(rd->pd);
kfree(rd);
}
@ -287,6 +524,9 @@ static int init_rootdomain(struct root_domain *rd)
if (cpupri_init(&rd->cpupri) != 0)
goto free_cpudl;
init_max_cpu_capacity(&rd->max_cpu_capacity);
return 0;
free_cpudl:
@ -397,7 +637,9 @@ DEFINE_PER_CPU(int, sd_llc_size);
DEFINE_PER_CPU(int, sd_llc_id);
DEFINE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
DEFINE_PER_CPU(struct sched_domain *, sd_numa);
DEFINE_PER_CPU(struct sched_domain *, sd_asym);
DEFINE_PER_CPU(struct sched_domain *, sd_asym_packing);
DEFINE_PER_CPU(struct sched_domain *, sd_asym_cpucapacity);
DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity);
static void update_top_cache_domain(int cpu)
{
@ -422,7 +664,10 @@ static void update_top_cache_domain(int cpu)
rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
rcu_assign_pointer(per_cpu(sd_asym, cpu), sd);
rcu_assign_pointer(per_cpu(sd_asym_packing, cpu), sd);
sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY);
rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd);
}
/*
@ -692,6 +937,7 @@ static void init_overlap_sched_group(struct sched_domain *sd,
sg_span = sched_group_span(sg);
sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;
sg->sgc->max_capacity = SCHED_CAPACITY_SCALE;
}
static int
@ -851,6 +1097,7 @@ static struct sched_group *get_group(int cpu, struct sd_data *sdd)
sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sched_group_span(sg));
sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;
sg->sgc->max_capacity = SCHED_CAPACITY_SCALE;
return sg;
}
@ -1061,7 +1308,6 @@ static struct cpumask ***sched_domains_numa_masks;
* SD_SHARE_PKG_RESOURCES - describes shared caches
* SD_NUMA - describes NUMA topologies
* SD_SHARE_POWERDOMAIN - describes shared power domain
* SD_ASYM_CPUCAPACITY - describes mixed capacity topologies
*
* Odd one out, which beside describing the topology has a quirk also
* prescribes the desired behaviour that goes along with it:
@ -1073,13 +1319,12 @@ static struct cpumask ***sched_domains_numa_masks;
SD_SHARE_PKG_RESOURCES | \
SD_NUMA | \
SD_ASYM_PACKING | \
SD_ASYM_CPUCAPACITY | \
SD_SHARE_POWERDOMAIN)
static struct sched_domain *
sd_init(struct sched_domain_topology_level *tl,
const struct cpumask *cpu_map,
struct sched_domain *child, int cpu)
struct sched_domain *child, int dflags, int cpu)
{
struct sd_data *sdd = &tl->data;
struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu);
@ -1100,6 +1345,9 @@ sd_init(struct sched_domain_topology_level *tl,
"wrong sd_flags in topology description\n"))
sd_flags &= ~TOPOLOGY_SD_FLAGS;
/* Apply detected topology flags */
sd_flags |= dflags;
*sd = (struct sched_domain){
.min_interval = sd_weight,
.max_interval = 2*sd_weight,
@ -1122,7 +1370,7 @@ sd_init(struct sched_domain_topology_level *tl,
| 0*SD_SHARE_CPUCAPACITY
| 0*SD_SHARE_PKG_RESOURCES
| 0*SD_SERIALIZE
| 0*SD_PREFER_SIBLING
| 1*SD_PREFER_SIBLING
| 0*SD_NUMA
| sd_flags
,
@ -1148,17 +1396,21 @@ sd_init(struct sched_domain_topology_level *tl,
if (sd->flags & SD_ASYM_CPUCAPACITY) {
struct sched_domain *t = sd;
/*
* Don't attempt to spread across CPUs of different capacities.
*/
if (sd->child)
sd->child->flags &= ~SD_PREFER_SIBLING;
for_each_lower_domain(t)
t->flags |= SD_BALANCE_WAKE;
}
if (sd->flags & SD_SHARE_CPUCAPACITY) {
sd->flags |= SD_PREFER_SIBLING;
sd->imbalance_pct = 110;
sd->smt_gain = 1178; /* ~15% */
} else if (sd->flags & SD_SHARE_PKG_RESOURCES) {
sd->flags |= SD_PREFER_SIBLING;
sd->imbalance_pct = 117;
sd->cache_nice_tries = 1;
sd->busy_idx = 2;
@ -1169,6 +1421,7 @@ sd_init(struct sched_domain_topology_level *tl,
sd->busy_idx = 3;
sd->idle_idx = 2;
sd->flags &= ~SD_PREFER_SIBLING;
sd->flags |= SD_SERIALIZE;
if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) {
sd->flags &= ~(SD_BALANCE_EXEC |
@ -1178,7 +1431,6 @@ sd_init(struct sched_domain_topology_level *tl,
#endif
} else {
sd->flags |= SD_PREFER_SIBLING;
sd->cache_nice_tries = 1;
sd->busy_idx = 2;
sd->idle_idx = 1;
@ -1604,9 +1856,9 @@ static void __sdt_free(const struct cpumask *cpu_map)
static struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl,
const struct cpumask *cpu_map, struct sched_domain_attr *attr,
struct sched_domain *child, int cpu)
struct sched_domain *child, int dflags, int cpu)
{
struct sched_domain *sd = sd_init(tl, cpu_map, child, cpu);
struct sched_domain *sd = sd_init(tl, cpu_map, child, dflags, cpu);
if (child) {
sd->level = child->level + 1;
@ -1632,6 +1884,65 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
return sd;
}
/*
* Find the sched_domain_topology_level where all CPU capacities are visible
* for all CPUs.
*/
static struct sched_domain_topology_level
*asym_cpu_capacity_level(const struct cpumask *cpu_map)
{
int i, j, asym_level = 0;
bool asym = false;
struct sched_domain_topology_level *tl, *asym_tl = NULL;
unsigned long cap;
/* Is there any asymmetry? */
cap = arch_scale_cpu_capacity(NULL, cpumask_first(cpu_map));
for_each_cpu(i, cpu_map) {
if (arch_scale_cpu_capacity(NULL, i) != cap) {
asym = true;
break;
}
}
if (!asym)
return NULL;
/*
* Examine topology from all CPU's point of views to detect the lowest
* sched_domain_topology_level where a highest capacity CPU is visible
* to everyone.
*/
for_each_cpu(i, cpu_map) {
unsigned long max_capacity = arch_scale_cpu_capacity(NULL, i);
int tl_id = 0;
for_each_sd_topology(tl) {
if (tl_id < asym_level)
goto next_level;
for_each_cpu_and(j, tl->mask(i), cpu_map) {
unsigned long capacity;
capacity = arch_scale_cpu_capacity(NULL, j);
if (capacity <= max_capacity)
continue;
max_capacity = capacity;
asym_level = tl_id;
asym_tl = tl;
}
next_level:
tl_id++;
}
}
return asym_tl;
}
/*
* Build sched domains for a given set of CPUs and attach the sched domains
* to the individual CPUs
@ -1642,20 +1953,31 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
enum s_alloc alloc_state;
struct sched_domain *sd;
struct s_data d;
struct rq *rq = NULL;
int i, ret = -ENOMEM;
struct sched_domain_topology_level *tl_asym;
bool has_asym = false;
alloc_state = __visit_domain_allocation_hell(&d, cpu_map);
if (alloc_state != sa_rootdomain)
goto error;
tl_asym = asym_cpu_capacity_level(cpu_map);
/* Set up domains for CPUs specified by the cpu_map: */
for_each_cpu(i, cpu_map) {
struct sched_domain_topology_level *tl;
sd = NULL;
for_each_sd_topology(tl) {
sd = build_sched_domain(tl, cpu_map, attr, sd, i);
int dflags = 0;
if (tl == tl_asym) {
dflags |= SD_ASYM_CPUCAPACITY;
has_asym = true;
}
sd = build_sched_domain(tl, cpu_map, attr, sd, dflags, i);
if (tl == sched_domain_topology)
*per_cpu_ptr(d.sd, i) = sd;
if (tl->flags & SDTL_OVERLAP)
@ -1693,21 +2015,13 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
/* Attach the domains */
rcu_read_lock();
for_each_cpu(i, cpu_map) {
rq = cpu_rq(i);
sd = *per_cpu_ptr(d.sd, i);
/* Use READ_ONCE()/WRITE_ONCE() to avoid load/store tearing: */
if (rq->cpu_capacity_orig > READ_ONCE(d.rd->max_cpu_capacity))
WRITE_ONCE(d.rd->max_cpu_capacity, rq->cpu_capacity_orig);
cpu_attach_domain(sd, d.rd, i);
}
rcu_read_unlock();
if (rq && sched_debug_enabled) {
pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n",
cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
}
if (has_asym)
static_branch_enable_cpuslocked(&sched_asym_cpucapacity);
ret = 0;
error:
@ -1879,8 +2193,8 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
/* Destroy deleted domains: */
for (i = 0; i < ndoms_cur; i++) {
for (j = 0; j < n && !new_topology; j++) {
if (cpumask_equal(doms_cur[i], doms_new[j])
&& dattrs_equal(dattr_cur, i, dattr_new, j))
if (cpumask_equal(doms_cur[i], doms_new[j]) &&
dattrs_equal(dattr_cur, i, dattr_new, j))
goto match1;
}
/* No match - a current sched domain not in new doms_new[] */
@ -1900,8 +2214,8 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
/* Build new domains: */
for (i = 0; i < ndoms_new; i++) {
for (j = 0; j < n && !new_topology; j++) {
if (cpumask_equal(doms_new[i], doms_cur[j])
&& dattrs_equal(dattr_new, i, dattr_cur, j))
if (cpumask_equal(doms_new[i], doms_cur[j]) &&
dattrs_equal(dattr_new, i, dattr_cur, j))
goto match2;
}
/* No match - add a new doms_new */
@ -1910,6 +2224,22 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
;
}
#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
/* Build perf. domains: */
for (i = 0; i < ndoms_new; i++) {
for (j = 0; j < n && !sched_energy_update; j++) {
if (cpumask_equal(doms_new[i], doms_cur[j]) &&
cpu_rq(cpumask_first(doms_cur[j]))->rd->pd)
goto match3;
}
/* No match - add perf. domains for a new rd */
build_perf_domains(doms_new[i]);
match3:
;
}
sched_energy_start(ndoms_new, doms_new);
#endif
/* Remember the new sched domains: */
if (doms_cur != &fallback_doms)
free_sched_domains(doms_cur, ndoms_cur);

692
kernel/sched/tune.c Normal file
View File

@ -0,0 +1,692 @@
#include <linux/cgroup.h>
#include <linux/err.h>
#include <linux/kernel.h>
#include <linux/percpu.h>
#include <linux/printk.h>
#include <linux/rcupdate.h>
#include <linux/slab.h>
#include <trace/events/sched.h>
#include "sched.h"
bool schedtune_initialized = false;
extern struct reciprocal_value schedtune_spc_rdiv;
/* We hold schedtune boost in effect for at least this long */
#define SCHEDTUNE_BOOST_HOLD_NS 50000000ULL
/*
* EAS scheduler tunables for task groups.
*
* When CGroup support is enabled, we have to synchronize two different
* paths:
* - slow path: where CGroups are created/updated/removed
* - fast path: where tasks in a CGroups are accounted
*
* The slow path tracks (a limited number of) CGroups and maps each on a
* "boost_group" index. The fastpath accounts tasks currently RUNNABLE on each
* "boost_group".
*
* Once a new CGroup is created, a boost group idx is assigned and the
* corresponding "boost_group" marked as valid on each CPU.
* Once a CGroup is release, the corresponding "boost_group" is marked as
* invalid on each CPU. The CPU boost value (boost_max) is aggregated by
* considering only valid boost_groups with a non null tasks counter.
*
* .:: Locking strategy
*
* The fast path uses a spin lock for each CPU boost_group which protects the
* tasks counter.
*
* The "valid" and "boost" values of each CPU boost_group is instead
* protected by the RCU lock provided by the CGroups callbacks. Thus, only the
* slow path can access and modify the boost_group attribtues of each CPU.
* The fast path will catch up the most updated values at the next scheduling
* event (i.e. enqueue/dequeue).
*
* |
* SLOW PATH | FAST PATH
* CGroup add/update/remove | Scheduler enqueue/dequeue events
* |
* |
* | DEFINE_PER_CPU(struct boost_groups)
* | +--------------+----+---+----+----+
* | | idle | | | | |
* | | boost_max | | | | |
* | +---->lock | | | | |
* struct schedtune allocated_groups | | | group[ ] | | | | |
* +------------------------------+ +-------+ | | +--+---------+-+----+---+----+----+
* | idx | | | | | | valid |
* | boots / prefer_idle | | | | | | boost |
* | perf_{boost/constraints}_idx | <---------+(*) | | | | tasks | <------------+
* | css | +-------+ | | +---------+ |
* +-+----------------------------+ | | | | | | |
* ^ | | | | | | |
* | +-------+ | | +---------+ |
* | | | | | | | |
* | | | | | | | |
* | +-------+ | | +---------+ |
* | zmalloc | | | | | | |
* | | | | | | | |
* | +-------+ | | +---------+ |
* + BOOSTGROUPS_COUNT | | BOOSTGROUPS_COUNT |
* schedtune_boostgroup_init() | + |
* | schedtune_{en,de}queue_task() |
* | +
* | schedtune_tasks_update()
* |
*/
/* SchdTune tunables for a group of tasks */
struct schedtune {
/* SchedTune CGroup subsystem */
struct cgroup_subsys_state css;
/* Boost group allocated ID */
int idx;
/* Boost value for tasks on that SchedTune CGroup */
int boost;
/* Hint to bias scheduling of tasks on that SchedTune CGroup
* towards idle CPUs */
int prefer_idle;
};
static inline struct schedtune *css_st(struct cgroup_subsys_state *css)
{
return css ? container_of(css, struct schedtune, css) : NULL;
}
static inline struct schedtune *task_schedtune(struct task_struct *tsk)
{
return css_st(task_css(tsk, schedtune_cgrp_id));
}
static inline struct schedtune *parent_st(struct schedtune *st)
{
return css_st(st->css.parent);
}
/*
* SchedTune root control group
* The root control group is used to defined a system-wide boosting tuning,
* which is applied to all tasks in the system.
* Task specific boost tuning could be specified by creating and
* configuring a child control group under the root one.
* By default, system-wide boosting is disabled, i.e. no boosting is applied
* to tasks which are not into a child control group.
*/
static struct schedtune
root_schedtune = {
.boost = 0,
.prefer_idle = 0,
};
/*
* Maximum number of boost groups to support
* When per-task boosting is used we still allow only limited number of
* boost groups for two main reasons:
* 1. on a real system we usually have only few classes of workloads which
* make sense to boost with different values (e.g. background vs foreground
* tasks, interactive vs low-priority tasks)
* 2. a limited number allows for a simpler and more memory/time efficient
* implementation especially for the computation of the per-CPU boost
* value
*/
#define BOOSTGROUPS_COUNT 5
/* Array of configured boostgroups */
static struct schedtune *allocated_group[BOOSTGROUPS_COUNT] = {
&root_schedtune,
NULL,
};
/* SchedTune boost groups
* Keep track of all the boost groups which impact on CPU, for example when a
* CPU has two RUNNABLE tasks belonging to two different boost groups and thus
* likely with different boost values.
* Since on each system we expect only a limited number of boost groups, here
* we use a simple array to keep track of the metrics required to compute the
* maximum per-CPU boosting value.
*/
struct boost_groups {
/* Maximum boost value for all RUNNABLE tasks on a CPU */
int boost_max;
u64 boost_ts;
struct {
/* True when this boost group maps an actual cgroup */
bool valid;
/* The boost for tasks on that boost group */
int boost;
/* Count of RUNNABLE tasks on that boost group */
unsigned tasks;
/* Timestamp of boost activation */
u64 ts;
} group[BOOSTGROUPS_COUNT];
/* CPU's boost group locking */
raw_spinlock_t lock;
};
/* Boost groups affecting each CPU in the system */
DEFINE_PER_CPU(struct boost_groups, cpu_boost_groups);
static inline bool schedtune_boost_timeout(u64 now, u64 ts)
{
return ((now - ts) > SCHEDTUNE_BOOST_HOLD_NS);
}
static inline bool
schedtune_boost_group_active(int idx, struct boost_groups* bg, u64 now)
{
if (bg->group[idx].tasks)
return true;
return !schedtune_boost_timeout(now, bg->group[idx].ts);
}
static void
schedtune_cpu_update(int cpu, u64 now)
{
struct boost_groups *bg = &per_cpu(cpu_boost_groups, cpu);
int boost_max;
u64 boost_ts;
int idx;
/* The root boost group is always active */
boost_max = bg->group[0].boost;
boost_ts = now;
for (idx = 1; idx < BOOSTGROUPS_COUNT; ++idx) {
/* Ignore non boostgroups not mapping a cgroup */
if (!bg->group[idx].valid)
continue;
/*
* A boost group affects a CPU only if it has
* RUNNABLE tasks on that CPU or it has hold
* in effect from a previous task.
*/
if (!schedtune_boost_group_active(idx, bg, now))
continue;
/* This boost group is active */
if (boost_max > bg->group[idx].boost)
continue;
boost_max = bg->group[idx].boost;
boost_ts = bg->group[idx].ts;
}
/* Ensures boost_max is non-negative when all cgroup boost values
* are neagtive. Avoids under-accounting of cpu capacity which may cause
* task stacking and frequency spikes.*/
boost_max = max(boost_max, 0);
bg->boost_max = boost_max;
bg->boost_ts = boost_ts;
}
static int
schedtune_boostgroup_update(int idx, int boost)
{
struct boost_groups *bg;
int cur_boost_max;
int old_boost;
int cpu;
u64 now;
/* Update per CPU boost groups */
for_each_possible_cpu(cpu) {
bg = &per_cpu(cpu_boost_groups, cpu);
/* CGroups are never associated to non active cgroups */
BUG_ON(!bg->group[idx].valid);
/*
* Keep track of current boost values to compute the per CPU
* maximum only when it has been affected by the new value of
* the updated boost group
*/
cur_boost_max = bg->boost_max;
old_boost = bg->group[idx].boost;
/* Update the boost value of this boost group */
bg->group[idx].boost = boost;
/* Check if this update increase current max */
now = sched_clock_cpu(cpu);
if (boost > cur_boost_max &&
schedtune_boost_group_active(idx, bg, now)) {
bg->boost_max = boost;
bg->boost_ts = bg->group[idx].ts;
trace_sched_tune_boostgroup_update(cpu, 1, bg->boost_max);
continue;
}
/* Check if this update has decreased current max */
if (cur_boost_max == old_boost && old_boost > boost) {
schedtune_cpu_update(cpu, now);
trace_sched_tune_boostgroup_update(cpu, -1, bg->boost_max);
continue;
}
trace_sched_tune_boostgroup_update(cpu, 0, bg->boost_max);
}
return 0;
}
#define ENQUEUE_TASK 1
#define DEQUEUE_TASK -1
static inline bool
schedtune_update_timestamp(struct task_struct *p)
{
if (sched_feat(SCHEDTUNE_BOOST_HOLD_ALL))
return true;
return task_has_rt_policy(p);
}
static inline void
schedtune_tasks_update(struct task_struct *p, int cpu, int idx, int task_count)
{
struct boost_groups *bg = &per_cpu(cpu_boost_groups, cpu);
int tasks = bg->group[idx].tasks + task_count;
/* Update boosted tasks count while avoiding to make it negative */
bg->group[idx].tasks = max(0, tasks);
/* Update timeout on enqueue */
if (task_count > 0) {
u64 now = sched_clock_cpu(cpu);
if (schedtune_update_timestamp(p))
bg->group[idx].ts = now;
/* Boost group activation or deactivation on that RQ */
if (bg->group[idx].tasks == 1)
schedtune_cpu_update(cpu, now);
}
trace_sched_tune_tasks_update(p, cpu, tasks, idx,
bg->group[idx].boost, bg->boost_max,
bg->group[idx].ts);
}
/*
* NOTE: This function must be called while holding the lock on the CPU RQ
*/
void schedtune_enqueue_task(struct task_struct *p, int cpu)
{
struct boost_groups *bg = &per_cpu(cpu_boost_groups, cpu);
unsigned long irq_flags;
struct schedtune *st;
int idx;
if (unlikely(!schedtune_initialized))
return;
/*
* Boost group accouting is protected by a per-cpu lock and requires
* interrupt to be disabled to avoid race conditions for example on
* do_exit()::cgroup_exit() and task migration.
*/
raw_spin_lock_irqsave(&bg->lock, irq_flags);
rcu_read_lock();
st = task_schedtune(p);
idx = st->idx;
schedtune_tasks_update(p, cpu, idx, ENQUEUE_TASK);
rcu_read_unlock();
raw_spin_unlock_irqrestore(&bg->lock, irq_flags);
}
int schedtune_can_attach(struct cgroup_taskset *tset)
{
struct task_struct *task;
struct cgroup_subsys_state *css;
struct boost_groups *bg;
struct rq_flags rq_flags;
unsigned int cpu;
struct rq *rq;
int src_bg; /* Source boost group index */
int dst_bg; /* Destination boost group index */
int tasks;
u64 now;
if (unlikely(!schedtune_initialized))
return 0;
cgroup_taskset_for_each(task, css, tset) {
/*
* Lock the CPU's RQ the task is enqueued to avoid race
* conditions with migration code while the task is being
* accounted
*/
rq = task_rq_lock(task, &rq_flags);
if (!task->on_rq) {
task_rq_unlock(rq, task, &rq_flags);
continue;
}
/*
* Boost group accouting is protected by a per-cpu lock and requires
* interrupt to be disabled to avoid race conditions on...
*/
cpu = cpu_of(rq);
bg = &per_cpu(cpu_boost_groups, cpu);
raw_spin_lock(&bg->lock);
dst_bg = css_st(css)->idx;
src_bg = task_schedtune(task)->idx;
/*
* Current task is not changing boostgroup, which can
* happen when the new hierarchy is in use.
*/
if (unlikely(dst_bg == src_bg)) {
raw_spin_unlock(&bg->lock);
task_rq_unlock(rq, task, &rq_flags);
continue;
}
/*
* This is the case of a RUNNABLE task which is switching its
* current boost group.
*/
/* Move task from src to dst boost group */
tasks = bg->group[src_bg].tasks - 1;
bg->group[src_bg].tasks = max(0, tasks);
bg->group[dst_bg].tasks += 1;
/* Update boost hold start for this group */
now = sched_clock_cpu(cpu);
bg->group[dst_bg].ts = now;
/* Force boost group re-evaluation at next boost check */
bg->boost_ts = now - SCHEDTUNE_BOOST_HOLD_NS;
raw_spin_unlock(&bg->lock);
task_rq_unlock(rq, task, &rq_flags);
}
return 0;
}
void schedtune_cancel_attach(struct cgroup_taskset *tset)
{
/* This can happen only if SchedTune controller is mounted with
* other hierarchies ane one of them fails. Since usually SchedTune is
* mouted on its own hierarcy, for the time being we do not implement
* a proper rollback mechanism */
WARN(1, "SchedTune cancel attach not implemented");
}
/*
* NOTE: This function must be called while holding the lock on the CPU RQ
*/
void schedtune_dequeue_task(struct task_struct *p, int cpu)
{
struct boost_groups *bg = &per_cpu(cpu_boost_groups, cpu);
unsigned long irq_flags;
struct schedtune *st;
int idx;
if (unlikely(!schedtune_initialized))
return;
/*
* Boost group accouting is protected by a per-cpu lock and requires
* interrupt to be disabled to avoid race conditions on...
*/
raw_spin_lock_irqsave(&bg->lock, irq_flags);
rcu_read_lock();
st = task_schedtune(p);
idx = st->idx;
schedtune_tasks_update(p, cpu, idx, DEQUEUE_TASK);
rcu_read_unlock();
raw_spin_unlock_irqrestore(&bg->lock, irq_flags);
}
int schedtune_cpu_boost(int cpu)
{
struct boost_groups *bg;
u64 now;
bg = &per_cpu(cpu_boost_groups, cpu);
now = sched_clock_cpu(cpu);
/* Check to see if we have a hold in effect */
if (schedtune_boost_timeout(now, bg->boost_ts))
schedtune_cpu_update(cpu, now);
return bg->boost_max;
}
int schedtune_task_boost(struct task_struct *p)
{
struct schedtune *st;
int task_boost;
if (unlikely(!schedtune_initialized))
return 0;
/* Get task boost value */
rcu_read_lock();
st = task_schedtune(p);
task_boost = st->boost;
rcu_read_unlock();
return task_boost;
}
int schedtune_prefer_idle(struct task_struct *p)
{
struct schedtune *st;
int prefer_idle;
if (unlikely(!schedtune_initialized))
return 0;
/* Get prefer_idle value */
rcu_read_lock();
st = task_schedtune(p);
prefer_idle = st->prefer_idle;
rcu_read_unlock();
return prefer_idle;
}
static u64
prefer_idle_read(struct cgroup_subsys_state *css, struct cftype *cft)
{
struct schedtune *st = css_st(css);
return st->prefer_idle;
}
static int
prefer_idle_write(struct cgroup_subsys_state *css, struct cftype *cft,
u64 prefer_idle)
{
struct schedtune *st = css_st(css);
st->prefer_idle = !!prefer_idle;
return 0;
}
static s64
boost_read(struct cgroup_subsys_state *css, struct cftype *cft)
{
struct schedtune *st = css_st(css);
return st->boost;
}
static int
boost_write(struct cgroup_subsys_state *css, struct cftype *cft,
s64 boost)
{
struct schedtune *st = css_st(css);
if (boost < 0 || boost > 100)
return -EINVAL;
st->boost = boost;
/* Update CPU boost */
schedtune_boostgroup_update(st->idx, st->boost);
return 0;
}
static struct cftype files[] = {
{
.name = "boost",
.read_s64 = boost_read,
.write_s64 = boost_write,
},
{
.name = "prefer_idle",
.read_u64 = prefer_idle_read,
.write_u64 = prefer_idle_write,
},
{ } /* terminate */
};
static void
schedtune_boostgroup_init(struct schedtune *st, int idx)
{
struct boost_groups *bg;
int cpu;
/* Initialize per CPUs boost group support */
for_each_possible_cpu(cpu) {
bg = &per_cpu(cpu_boost_groups, cpu);
bg->group[idx].boost = 0;
bg->group[idx].valid = true;
bg->group[idx].ts = 0;
}
/* Keep track of allocated boost groups */
allocated_group[idx] = st;
st->idx = idx;
}
static struct cgroup_subsys_state *
schedtune_css_alloc(struct cgroup_subsys_state *parent_css)
{
struct schedtune *st;
int idx;
if (!parent_css)
return &root_schedtune.css;
/* Allow only single level hierachies */
if (parent_css != &root_schedtune.css) {
pr_err("Nested SchedTune boosting groups not allowed\n");
return ERR_PTR(-ENOMEM);
}
/* Allow only a limited number of boosting groups */
for (idx = 1; idx < BOOSTGROUPS_COUNT; ++idx)
if (!allocated_group[idx])
break;
if (idx == BOOSTGROUPS_COUNT) {
pr_err("Trying to create more than %d SchedTune boosting groups\n",
BOOSTGROUPS_COUNT);
return ERR_PTR(-ENOSPC);
}
st = kzalloc(sizeof(*st), GFP_KERNEL);
if (!st)
goto out;
/* Initialize per CPUs boost group support */
schedtune_boostgroup_init(st, idx);
return &st->css;
out:
return ERR_PTR(-ENOMEM);
}
static void
schedtune_boostgroup_release(struct schedtune *st)
{
struct boost_groups *bg;
int cpu;
/* Reset per CPUs boost group support */
for_each_possible_cpu(cpu) {
bg = &per_cpu(cpu_boost_groups, cpu);
bg->group[st->idx].valid = false;
bg->group[st->idx].boost = 0;
}
/* Keep track of allocated boost groups */
allocated_group[st->idx] = NULL;
}
static void
schedtune_css_free(struct cgroup_subsys_state *css)
{
struct schedtune *st = css_st(css);
/* Release per CPUs boost group support */
schedtune_boostgroup_release(st);
kfree(st);
}
struct cgroup_subsys schedtune_cgrp_subsys = {
.css_alloc = schedtune_css_alloc,
.css_free = schedtune_css_free,
.can_attach = schedtune_can_attach,
.cancel_attach = schedtune_cancel_attach,
.legacy_cftypes = files,
.early_init = 1,
};
static inline void
schedtune_init_cgroups(void)
{
struct boost_groups *bg;
int cpu;
/* Initialize the per CPU boost groups */
for_each_possible_cpu(cpu) {
bg = &per_cpu(cpu_boost_groups, cpu);
memset(bg, 0, sizeof(struct boost_groups));
bg->group[0].valid = true;
raw_spin_lock_init(&bg->lock);
}
pr_info("schedtune: configured to support %d boost groups\n",
BOOSTGROUPS_COUNT);
schedtune_initialized = true;
}
/*
* Initialize the cgroup structures
*/
static int
schedtune_init(void)
{
schedtune_spc_rdiv = reciprocal_value(100);
schedtune_init_cgroups();
return 0;
}
postcore_initcall(schedtune_init);

37
kernel/sched/tune.h Normal file
View File

@ -0,0 +1,37 @@
#ifdef CONFIG_SCHED_TUNE
#include <linux/reciprocal_div.h>
/*
* System energy normalization constants
*/
struct target_nrg {
unsigned long min_power;
unsigned long max_power;
struct reciprocal_value rdiv;
};
int schedtune_cpu_boost(int cpu);
int schedtune_task_boost(struct task_struct *tsk);
int schedtune_prefer_idle(struct task_struct *tsk);
void schedtune_enqueue_task(struct task_struct *p, int cpu);
void schedtune_dequeue_task(struct task_struct *p, int cpu);
unsigned long boosted_cpu_util(int cpu, unsigned long other_util);
#else /* CONFIG_SCHED_TUNE */
#define schedtune_cpu_boost(cpu) 0
#define schedtune_task_boost(tsk) 0
#define schedtune_prefer_idle(tsk) 0
#define schedtune_enqueue_task(task, cpu) do { } while (0)
#define schedtune_dequeue_task(task, cpu) do { } while (0)
#define boosted_cpu_util(cpu, other_util) cpu_util_cfs(cpu_rq(cpu))
#endif /* CONFIG_SCHED_TUNE */

View File

@ -320,6 +320,13 @@ static struct ctl_table kern_table[] = {
.proc_handler = proc_dointvec,
},
#ifdef CONFIG_SCHED_DEBUG
{
.procname = "sched_cstate_aware",
.data = &sysctl_sched_cstate_aware,
.maxlen = sizeof(unsigned int),
.mode = 0644,
.proc_handler = proc_dointvec,
},
{
.procname = "sched_min_granularity_ns",
.data = &sysctl_sched_min_granularity,
@ -338,6 +345,13 @@ static struct ctl_table kern_table[] = {
.extra1 = &min_sched_granularity_ns,
.extra2 = &max_sched_granularity_ns,
},
{
.procname = "sched_sync_hint_enable",
.data = &sysctl_sched_sync_hint_enable,
.maxlen = sizeof(unsigned int),
.mode = 0644,
.proc_handler = proc_dointvec,
},
{
.procname = "sched_wakeup_granularity_ns",
.data = &sysctl_sched_wakeup_granularity,
@ -466,6 +480,17 @@ static struct ctl_table kern_table[] = {
.extra1 = &one,
},
#endif
#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
{
.procname = "sched_energy_aware",
.data = &sysctl_sched_energy_aware,
.maxlen = sizeof(unsigned int),
.mode = 0644,
.proc_handler = sched_energy_aware_handler,
.extra1 = &zero,
.extra2 = &one,
},
#endif
#ifdef CONFIG_PROVE_LOCKING
{
.procname = "prove_locking",