Commit Graph

215 Commits

Author SHA1 Message Date
Rafael J. Wysocki
106a2662e6 cpuidle: governors: teo: Rearrange stopped tick handling
This change is based on the observation that it is not in fact necessary
to select a deep idle state every time the scheduler tick has been
stopped before the idle state selection takes place.  Namely, if the
time till the closest timer (that is not the tick) is short enough,
a shallow idle state can be selected because the timer will kick the
CPU out of that state, so the damage from a possible overly optimistic
selection will be limited.

Update the teo governor in accordance with the above in analogy with
the previous analogous menu governor update.

Among other things, this will cause the teo governor to call
tick_nohz_get_sleep_length() every time when the tick has been
stopped already and only change the original idle state selection
if the time till the closest timer is beyond SAFE_TIMER_RANGE_NS
which is way more straightforward than the current code flow.

Of course, this effectively throws away some of the recent teo governor
changes made recently, but the resulting simplification is worth it in
my view.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/1865078.VLH7GnMWUR@rafael.j.wysocki
2026-03-05 15:23:16 +01:00
Rafael J. Wysocki
e57c2bf2e8 cpuidle: governors: menu: Refine stopped tick handling
This change is based on the observation that it is not in fact necessary
to select a deep idle state every time the scheduler tick has been
stopped before the idle state selection takes place.  Namely, if the
time till the closest timer (that is not the tick) is short enough,
a shallow idle state can be selected because the timer will kick the
CPU out of that state, so the damage from a possible overly optimistic
selection will be limited.

Update the menu governor in accordance with the above and use twice
the tick period length as the "safe timer range" for allowing the
original predicted_ns value to be used even if the tick has been
stopped.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/3341782.5fSG56mABF@rafael.j.wysocki
2026-03-05 15:23:16 +01:00
Christian Loehle
93983a9f3b cpuidle: menu: Remove single state handling
cpuidle systems where the governor has no choice because there's only
a single idle state are now handled by cpuidle core and bypass the
governor, so remove the related handling.

Signed-off-by: Christian Loehle <christian.loehle@arm.com>
[ rjw: Rebase on top of the cpuidle changes merged recently ]
Link: https://patch.msgid.link/20260216185005.1131593-5-aboorvad@linux.ibm.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2026-02-17 15:49:53 +01:00
Christian Loehle
825d5d3479 cpuidle: teo: Remove single state handling
cpuidle systems where the governor has no choice because there's only
a single idle state are now handled by cpuidle core and bypass the
governor, so remove the related handling.

Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/20260216185005.1131593-4-aboorvad@linux.ibm.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2026-02-17 15:49:53 +01:00
Aboorva Devarajan
9b9c0ff095 cpuidle: haltpoll: Remove single state handling
cpuidle systems where the governor has no choice because there's only
a single idle state are now handled by cpuidle core and bypass the
governor, so remove the related handling.

Signed-off-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
[ rjw: Extended the change to drop a redundant local variable ]
Link: https://patch.msgid.link/20260216185005.1131593-3-aboorvad@linux.ibm.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2026-02-17 15:49:24 +01:00
Rafael J. Wysocki
a971f984b8 cpuidle: governors: teo: Refine intercepts-based idle state lookup
There are cases in which decisions made by the teo governor are
arguably overly conservative.

For instance, suppose that there are 4 idle states and the values of
the intercepts metric for the first 3 of them are 400, 250, and 251,
respectively.  If the total sum computed in teo_update() is 1000, the
governor will select idle state 1 (provided that all idle states are
enabled and the scheduler tick has not been stopped) although arguably
idle state 0 would be a better choice because the likelihood of getting
an idle duration below the target residency of idle state 1 is greater
than the likelihood of getting an idle duration between the target
residency of idle state 1 and the target residency of idle state 2.

To address this, refine the candidate idle state lookup based on
intercepts to start at the state with the maximum intercepts metric,
below the deepest enabled one, to avoid the cases in which the search
may stop before reaching that state.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
[ rjw: Fixed typo "intercetps" in new comments (3 places) ]
Link: https://patch.msgid.link/2417298.ElGaqSPkdT@rafael.j.wysocki
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2026-01-30 20:15:52 +01:00
Rafael J. Wysocki
f36de72673 cpuidle: governors: teo: Adjust the classification of wakeup events
If differences between target residency values of adjacent idle states
of a given CPU are relatively large, the corresponding idle state bins
used by the teo governors are large either and the rule by which hits
are distinguished from intercepts is inaccurate.

Namely, by that rule, a wakeup event is classified as a hit if the
sleep length (the time till the closest timer other than the tick)
and the measured idle duration, adjusted for the entered idle state
exit latency, fall into the same idle state bin.  However, if that bin
is large enough, the actual difference between the sleep length and
the measured idle duration may be significant.  It may in fact be
significantly greater than the analogous difference for an event where
the sleep length and the measured idle duration fall into different
bins.

For this reason, amend the rule in question with a check that will only
allow a wakeup event to be counted as a hit if the sleep length is less
than the "raw" measured idle duration (which means that the wakeup
appears to have occurred after the anticipated timer event).  Otherwise,
the event will be counted as an intercept.

Also update the documentation part explaining the difference between
"hits" and "intercepts" to take the above change into account.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/5093379.31r3eYUQgx@rafael.j.wysocki
2026-01-30 20:15:43 +01:00
Rafael J. Wysocki
475ca3470b cpuidle: governors: teo: Refine tick_intercepts vs total events check
Use 2/3 as the proportion coefficient in the check comparing
cpu_data->tick_intercepts with cpu_data->total because it is close
enough to the current one (5/8) and it allows of more straightforward
interpretation (on average, intercepts within the tick period length
are twice as frequent as other events).

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/10793374.nUPlyArG6x@rafael.j.wysocki
2026-01-23 21:50:38 +01:00
Rafael J. Wysocki
60836533b4 cpuidle: governors: teo: Avoid fake intercepts produced by tick
Tick wakeups can lead to fake intercepts that may skew idle state
selection towards shallow states, so it is better to avoid counting
them as intercepts.

For this purpose, add a check causing teo_update() to only count
tick wakeups as intercepts if intercepts within the tick period
range are at least twice as frequent as any other events.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/3404606.44csPzL39Z@rafael.j.wysocki
2026-01-23 21:50:38 +01:00
Rafael J. Wysocki
4bd2221f23 cpuidle: governors: teo: Avoid selecting states with zero-size bins
If the last two enabled idle states have the same target residency which
is at least equal to TICK_NSEC, teo may select the next-to-last one even
though the size of that state's bin is 0, which is confusing.

Prevent that from happening by adding a target residency check to the
relevant code path.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
[ rjw: Fixed a typo in the changelog ]
Link: https://patch.msgid.link/3033265.e9J7NaK4W3@rafael.j.wysocki
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2026-01-23 21:49:54 +01:00
Rafael J. Wysocki
80606f4eb8 cpuidle: governors: menu: Always check timers with tick stopped
After commit 5484e31bbb ("cpuidle: menu: Skip tick_nohz_get_sleep_length()
call in some cases"), if the return value of get_typical_interval()
multiplied by NSEC_PER_USEC is not greater than RESIDENCY_THRESHOLD_NS,
the menu governor will skip computing the time till the closest timer.
If that happens when the tick has been stopped already, the selected
idle state may be too deep due to the subsequent check comparing
predicted_ns with TICK_NSEC and causing its value to be replaced with
the expected time till the closest timer, which is KTIME_MAX in that
case.  That will cause the deepest enabled idle state to be selected,
but the time till the closest timer very well may be shorter than the
target residency of that state, in which case a shallower state should
be used.

Address this by making menu_select() always compute the time till the
closest timer when the tick has been stopped.

Also move the predicted_ns check mentioned above into the branch in
which the time till the closest timer is determined because it only
needs to be done in that case.

Fixes: 5484e31bbb ("cpuidle: menu: Skip tick_nohz_get_sleep_length() call in some cases")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/5959091.DvuYhMxLoT@rafael.j.wysocki
2026-01-23 21:22:42 +01:00
Breno Leitao
fcbd7897b8 cpuidle: menu: Remove incorrect unlikely() annotation
The unlikely() annotation on the early-return condition in menu_select()
is incorrect on systems with only one idle state (e.g., ARM64 servers
with a single ACPI LPI state). Branch profiling shows 100% misprediction
on such systems since drv->state_count <= 1 is always true.

On platforms where only state0 is available, this path is the common
case, not an unlikely edge case. Remove the misleading annotation to
let the branch predictor learn the actual behavior.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260105-annotated_idle-v1-1-10ddf0771b58@debian.org
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2026-01-09 21:52:54 +01:00
Rafael J. Wysocki
15bfdadd61 cpuidle: governors: teo: Add missing space to the description
There is a missing space in the governor description comment, so add it.

No functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/5059034.31r3eYUQgx@rafael.j.wysocki
2025-11-24 20:43:32 +01:00
Rafael J. Wysocki
d834e68a0e cpuidle: governors: teo: Simplify intercepts-based state lookup
Simplify the loop looking up a candidate idle state in the case when an
intercept is likely to occur by adding a search for the state index limit
if the tick is stopped before it.

First, call tick_nohz_tick_stopped() just once and if it returns true,
look for the shallowest state index below the current candidate one with
target residency at least equal to the tick period length.

Next, simply look for a state that is not shallower than the one found
in the previous step and satisfies the intercepts majority condition (if
there are no such states, the shallowest state that is not shallower
than the one found in the previous step becomes the new candidate).

Since teo_state_ok() has no callers any more after the above changes,
drop it.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
[ rjw: Changelog clarification and code comment edit ]
Link: https://patch.msgid.link/2418792.ElGaqSPkdT@rafael.j.wysocki
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-11-20 16:32:48 +01:00
Rafael J. Wysocki
50db438231 cpuidle: governors: teo: Fix tick_intercepts handling in teo_update()
The condition deciding whether or not to increase cpu_data->tick_intercepts
in teo_update() is reverse, so fix it.

Fixes: d619b5cc67 ("cpuidle: teo: Simplify counting events used for tick management")
Cc: 6.14+ <stable@vger.kernel.org> # 6.14+: 0796ddf4a7f0: cpuidle: teo: Use this_cpu_ptr() where possible
Cc: 6.14+ <stable@vger.kernel.org> # 6.14+: 8f3f01082d7a: cpuidle: governors: teo: Use s64 consistently in teo_update()
Cc: 6.14+ <stable@vger.kernel.org> # 6.14+: b54df61c7428: cpuidle: governors: teo: Decay metrics below DECAY_SHIFT threshold
Cc: 6.14+ <stable@vger.kernel.org> 6.14+: 083654ded547: cpuidle: governors: teo: Rework the handling of tick wakeups
Cc: 6.14+ <stable@vger.kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/5085160.31r3eYUQgx@rafael.j.wysocki
2025-11-20 14:54:08 +01:00
Rafael J. Wysocki
083654ded5 cpuidle: governors: teo: Rework the handling of tick wakeups
If the wakeup pattern is clearly dominated by tick wakeups, count those
wakeups as hits on the deepest available idle state to increase the
likelihood of stopping the tick, especially on systems where there are
only 2 usable idle states and the tick can only be stopped when the
deeper state is selected.

This change is expected to reduce power on some systems where state 0 is
selected relatively often even though they are almost idle.  Without it,
the governor may end up selecting the shallowest idle state all the time
even if the system is almost completely idle due all tick wakeups being
counted as hits on that state and preventing the tick from being stopped
at all.

Fixes: 4b20b07ce7 ("cpuidle: teo: Don't count non-existent intercepts")
Reported-by: Reka Norman <rekanorman@chromium.org>
Closes: https://lore.kernel.org/linux-pm/CAEmPcwsNMNnNXuxgvHTQ93Mx-q3Oz9U57THQsU_qdcCx1m4w5g@mail.gmail.com/
Tested-by: Reka Norman <rekanorman@chromium.org>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Cc: 6.11+ <stable@vger.kernel.org> # 6.11+: 92ce5c07b7a1: cpuidle: teo: Reorder candidate state index checks
Cc: 6.11+ <stable@vger.kernel.org> # 6.11+: ea185406d1ed: cpuidle: teo: Combine candidate state index checks against 0
Cc: 6.11+ <stable@vger.kernel.org> # 6.11+: b9a6af26bd83: cpuidle: teo: Drop local variable prev_intercept_idx
Cc: 6.11+ <stable@vger.kernel.org> # 6.11+: e24f8a55de50: cpuidle: teo: Clarify two code comments
Cc: 6.11+ <stable@vger.kernel.org> # 6.11+: d619b5cc6780: cpuidle: teo: Simplify counting events used for tick management
Cc: 6.11+ <stable@vger.kernel.org> # 6.11+: 13ed5c4a6d9c: cpuidle: teo: Skip getting the sleep length if wakeups are very frequent
Cc: 6.11+ <stable@vger.kernel.org> # 6.11+: ddcfa7964677: cpuidle: teo: Simplify handling of total events count
Cc: 6.11+ <stable@vger.kernel.org> # 6.11+: 65e18e654475: cpuidle: teo: Replace time_span_ns with a flag
Cc: 6.11+ <stable@vger.kernel.org> # 6.11+: 0796ddf4a7f0: cpuidle: teo: Use this_cpu_ptr() where possible
Cc: 6.11+ <stable@vger.kernel.org> # 6.11+: 8f3f01082d7a: cpuidle: governors: teo: Use s64 consistently in teo_update()
Cc: 6.11+ <stable@vger.kernel.org> # 6.11+: b54df61c7428: cpuidle: governors: teo: Decay metrics below DECAY_SHIFT threshold
Cc: 6.11+ <stable@vger.kernel.org> # 6.11+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
[ rjw: Rebase on commit 0796ddf4a7, changelog update ]
Link: https://patch.msgid.link/6228387.lOV4Wx5bFT@rafael.j.wysocki
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-11-20 14:49:57 +01:00
Rafael J. Wysocki
b54df61c74 cpuidle: governors: teo: Decay metrics below DECAY_SHIFT threshold
If a given governor metric falls below a certain value (8 for
DECAY_SHIFT equal to 3), it will not decay any more due to the
simplistic decay implementation.  This may in some cases lead to
subtle inconsistencies in the governor behavior, so change the
decay implementation to take it into account and set the metric
at hand to 0 in that case.

Suggested-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/2819353.mvXUDI8C0e@rafael.j.wysocki
2025-11-14 15:20:01 +01:00
Rafael J. Wysocki
8f3f01082d cpuidle: governors: teo: Use s64 consistently in teo_update()
Two local variables in teo_update() are defined as u64, but their
values are then compared with s64 values, so it is more consistent
to use s64 as their data type.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/3026616.e9J7NaK4W3@rafael.j.wysocki
2025-11-14 15:20:01 +01:00
Rafael J. Wysocki
17673f64a0 cpuidle: governors: teo: Drop redundant function parameter
The last no_poll parameter of teo_find_shallower_state() is always
false, so drop it.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/2253109.irdbgypaU6@rafael.j.wysocki
2025-11-14 15:20:01 +01:00
Rafael J. Wysocki
a03b201180 cpuidle: governors: teo: Drop misguided target residency check
When the target residency of the current candidate idle state is
greater than the expected time till the closest timer (the sleep
length), it does not matter whether or not the tick has already been
stopped or if it is going to be stopped.  The closest timer will
trigger anyway at its due time, so if an idle state with target
residency above the sleep length is selected, energy will be wasted
and there may be excess latency.

Of course, if the closest timer were canceled before it could trigger,
a deeper idle state would be more suitable, but this is not expected
to happen (generally speaking, hrtimers are not expected to be
canceled as a rule).

Accordingly, the teo_state_ok() check done in that case causes energy to
be wasted more often than it allows any energy to be saved (if it allows
any energy to be saved at all), so drop it and let the governor use the
teo_find_shallower_state() return value as the new candidate idle state
index.

Fixes: 21d28cd2fa ("cpuidle: teo: Do not call tick_nohz_get_sleep_length() upfront")
Cc: All applicable <stable@vger.kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/5955081.DvuYhMxLoT@rafael.j.wysocki
2025-11-14 15:19:39 +01:00
Christian Loehle
0796ddf4a7 cpuidle: teo: Use this_cpu_ptr() where possible
The cpuidle governor callbacks for update, select and reflect
are always running on the actual idle entering/exiting CPU, so
use the more optimized this_cpu_ptr() to access the internal teo
data.

This brings down the latency-critical teo_reflect() from
static void teo_reflect(struct cpuidle_device *dev, int state)
{
ffffffc080ffcff0:	hint	#0x19
ffffffc080ffcff4:	stp	x29, x30, [sp, #-48]!
	struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
ffffffc080ffcff8:	adrp	x2, ffffffc0848c0000 <gicv5_global_data+0x28>
{
ffffffc080ffcffc:	add	x29, sp, #0x0
ffffffc080ffd000:	stp	x19, x20, [sp, #16]
ffffffc080ffd004:	orr	x20, xzr, x0
	struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
ffffffc080ffd008:	add	x0, x2, #0xc20
{
ffffffc080ffd00c:	stp	x21, x22, [sp, #32]
	struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
ffffffc080ffd010:	adrp	x19, ffffffc083eb5000 <cpu_devices+0x78>
ffffffc080ffd014:	add	x19, x19, #0xbb0
ffffffc080ffd018:	ldr	w3, [x20, #4]

	dev->last_state_idx = state;

to

static void teo_reflect(struct cpuidle_device *dev, int state)
{
ffffffc080ffd034:	hint	#0x19
ffffffc080ffd038:	stp	x29, x30, [sp, #-48]!
ffffffc080ffd03c:	add	x29, sp, #0x0
ffffffc080ffd040:	stp	x19, x20, [sp, #16]
ffffffc080ffd044:	orr	x20, xzr, x0
	struct teo_cpu *cpu_data = this_cpu_ptr(&teo_cpus);
ffffffc080ffd048:	adrp	x19, ffffffc083eb5000 <cpu_devices+0x78>
{
ffffffc080ffd04c:	stp	x21, x22, [sp, #32]
	struct teo_cpu *cpu_data = this_cpu_ptr(&teo_cpus);
ffffffc080ffd050:	add	x19, x19, #0xbb0

	dev->last_state_idx = state;

This saves us:
	adrp    x2, ffffffc0848c0000 <gicv5_global_data+0x28>
	add     x0, x2, #0xc20
	ldr     w3, [x20, #4]

Signed-off-by: Christian Loehle <christian.loehle@arm.com>
[ rjw: Subject tweak ]
Link: https://patch.msgid.link/20251110120819.714560-1-christian.loehle@arm.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-11-12 21:02:07 +01:00
Aboorva Devarajan
07d8157012 cpuidle: menu: Use residency threshold in polling state override decisions
On virtualized PowerPC (pseries) systems, where only one polling state
(Snooze) and one deep state (CEDE) are available, selecting CEDE when
the predicted idle duration is less than the target residency of CEDE
state can hurt performance. In such cases, the entry/exit overhead of
CEDE outweighs the power savings, leading to unnecessary state
transitions and higher latency.

Menu governor currently contains a special-case rule that prioritizes
the first non-polling state over polling, even when its target residency
is much longer than the predicted idle duration. On PowerPC/pseries,
where the gap between the polling state (Snooze) and the first non-polling
state (CEDE) is large, this behavior causes performance regressions.

Refine that special case by adding an extra requirement: the first
non-polling state can only be chosen if its target residency is below
the defined RESIDENCY_THRESHOLD_NS. If this condition is not satisfied,
polling is allowed instead, avoiding suboptimal non-polling state
entries.

This change is limited to the single special-case rule for the first
non-polling state. The general non-polling state selection logic in the
menu governor remains unchanged.

Performance improvement observed with pgbench on PowerPC (pseries)
system:
+---------------------------+------------+------------+------------+
| Metric                    | Baseline   | Patched    | Change (%) |
+---------------------------+------------+------------+------------+
| Transactions/sec (TPS)    | 495,210    | 536,982    | +8.45%     |
| Avg latency (ms)          | 0.163      | 0.150      | -7.98%     |
+---------------------------+------------+------------+------------+

CPUIdle state usage:
+--------------+--------------+-------------+
| Metric       | Baseline     | Patched     |
+--------------+--------------+-------------+
| Total usage  | 12,735,820   | 13,918,442  |
| Above usage  | 11,401,520   | 1,598,210   |
| Below usage  | 20,145       | 702,395     |
+--------------+--------------+-------------+

Above/Total and Below/Total usage percentages:
+------------------------+-----------+---------+
| Metric                 | Baseline  | Patched |
+------------------------+-----------+---------+
| Above % (Above/Total)  | 89.56%    | 11.49%  |
| Below % (Below/Total)  | 0.16%     | 5.05%   |
| Total cpuidle miss (%) | 89.72%    | 16.54%  |
+------------------------+-----------+---------+

The results indicate that restricting CEDE selection to cases where
its residency matches the predicted idle time reduces mispredictions,
lowers unnecessary state transitions, and improves overall throughput.

Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
[ rjw: Changelog edits, rebase ]
Link: https://patch.msgid.link/20251006013954.17972-1-aboorvad@linux.ibm.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-10-27 14:53:46 +01:00
Rafael J. Wysocki
db86f55bf8 cpuidle: governors: menu: Select polling state in some more cases
A throughput regression of 11% introduced by commit 779b1a1cb1 ("cpuidle:
governors: menu: Avoid selecting states with too much latency") has been
reported and it is related to the case when the menu governor checks if
selecting a proper idle state instead of a polling one makes sense.

In particular, it is questionable to do so if the exit latency of the
idle state in question exceeds the predicted idle duration, so add a
check for that, which is sufficient to make the reported regression go
away, and update the related code comment accordingly.

Fixes: 779b1a1cb1 ("cpuidle: governors: menu: Avoid selecting states with too much latency")
Closes: https://lore.kernel.org/linux-pm/004501dc43c9$ec8aa930$c59ffb90$@telus.net/
Reported-by: Doug Smythies <dsmythies@telus.net>
Tested-by: Doug Smythies <dsmythies@telus.net>
Cc: All applicable <stable@vger.kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/12786727.O9o76ZdvQC@rafael.j.wysocki
2025-10-27 14:41:27 +01:00
Rafael J. Wysocki
10fad40122 Revert "cpuidle: menu: Avoid discarding useful information"
It is reported that commit 85975daeaa ("cpuidle: menu: Avoid discarding
useful information") led to a performance regression on Intel Jasper Lake
systems because it reduced the time spent by CPUs in idle state C7 which
is correlated to the maximum frequency the CPUs can get to because of an
average running power limit [1].

Before that commit, get_typical_interval() would have returned UINT_MAX
whenever it had been unable to make a high-confidence prediction which
had led to selecting the deepest available idle state too often and
both power and performance had been inadequate as a result of that on
some systems.  However, this had not been a problem on systems with
relatively aggressive average running power limits, like the Jasper Lake
systems in question, because on those systems it was compensated by the
ability to run CPUs faster.

It was addressed by causing get_typical_interval() to return a number
based on the recent idle duration information available to it even if it
could not make a high-confidence prediction, but that clearly did not
take the possible correlation between idle power and available CPU
capacity into account.

For this reason, revert most of the changes made by commit 85975daeaa,
except for one cosmetic cleanup, and add a comment explaining the
rationale for returning UINT_MAX from get_typical_interval() when it
is unable to make a high-confidence prediction.

Fixes: 85975daeaa ("cpuidle: menu: Avoid discarding useful information")
Closes: https://lore.kernel.org/linux-pm/36iykr223vmcfsoysexug6s274nq2oimcu55ybn6ww4il3g3cv@cohflgdbpnq7/ [1]
Reported-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: All applicable <stable@vger.kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/3663603.iIbC2pHGDl@rafael.j.wysocki
2025-10-20 21:27:16 +02:00
Rafael J. Wysocki
17224c1d25 cpuidle: governors: menu: Rearrange main loop in menu_select()
Reduce the indentation level in the main loop of menu_select() by
rearranging some checks and assignments in it.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/2389215.ElGaqSPkdT@rafael.j.wysocki
2025-08-21 22:02:28 +02:00
Rafael J. Wysocki
779b1a1cb1 cpuidle: governors: menu: Avoid selecting states with too much latency
Occasionally, the exit latency of the idle state selected by the menu
governor may exceed the PM QoS CPU wakeup latency limit.  Namely, if the
scheduler tick has been stopped already and predicted_ns is greater than
the tick period length, the governor may return an idle state whose exit
latency exceeds latency_req because that decision is made before
checking the current idle state's exit latency.

For instance, say that there are 3 idle states, 0, 1, and 2.  For idle
states 0 and 1, the exit latency is equal to the target residency and
the values are 0 and 5 us, respectively.  State 2 is deeper and has the
exit latency and target residency of 200 us and 2 ms (which is greater
than the tick period length), respectively.

Say that predicted_ns is equal to TICK_NSEC and the PM QoS latency
limit is 20 us.  After the first two iterations of the main loop in
menu_select(), idx becomes 1 and in the third iteration of it the target
residency of the current state (state 2) is greater than predicted_ns.
State 2 is not a polling one and predicted_ns is not less than TICK_NSEC,
so the check on whether or not the tick has been stopped is done.  Say
that the tick has been stopped already and there are no imminent timers
(that is, delta_tick is greater than the target residency of state 2).
In that case, idx becomes 2 and it is returned immediately, but the exit
latency of state 2 exceeds the latency limit.

Address this issue by modifying the code to compare the exit latency of
the current idle state (idle state i) with the latency limit before
comparing its target residency with predicted_ns, which allows one
more exit_latency_ns check that becomes redundant to be dropped.

However, after the above change, latency_req cannot take the predicted_ns
value any more, which takes place after commit 38f83090f5 ("cpuidle:
menu: Remove iowait influence"), because it may cause a polling state
to be returned prematurely.

In the context of the previous example say that predicted_ns is 3000 and
the PM QoS latency limit is still 20 us.  Additionally, say that idle
state 0 is a polling one.  Moving the exit_latency_ns check before the
target_residency_ns one causes the loop to terminate in the second
iteration, before the target_residency_ns check, so idle state 0 will be
returned even though previously state 1 would be returned if there were
no imminent timers.

For this reason, remove the assignment of the predicted_ns value to
latency_req from the code.

Fixes: 5ef499cd57 ("cpuidle: menu: Handle stopped tick more aggressively")
Cc: 4.17+ <stable@vger.kernel.org> # 4.17+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/5043159.31r3eYUQgx@rafael.j.wysocki
2025-08-18 19:04:25 +02:00
Rafael J. Wysocki
fa3fa55de0 cpuidle: governors: menu: Avoid using invalid recent intervals data
Marc has reported that commit 85975daeaa ("cpuidle: menu: Avoid
discarding useful information") caused the number of wakeup interrupts
to increase on an idle system [1], which was not expected to happen
after merely allowing shallower idle states to be selected by the
governor in some cases.

However, on the system in question, all of the idle states deeper than
WFI are rejected by the driver due to a firmware issue [2].  This causes
the governor to only consider the recent interval duriation data
corresponding to attempts to enter WFI that are successful and the
recent invervals table is filled with values lower than the scheduler
tick period.  Consequently, the governor predicts an idle duration
below the scheduler tick period length and avoids stopping the tick
more often which leads to the observed symptom.

Address it by modifying the governor to update the recent intervals
table also when entering the previously selected idle state fails, so
it knows that the short idle intervals might have been the minority
had the selected idle states been actually entered every time.

Fixes: 85975daeaa ("cpuidle: menu: Avoid discarding useful information")
Link: https://lore.kernel.org/linux-pm/86o6sv6n94.wl-maz@kernel.org/ [1]
Link: https://lore.kernel.org/linux-pm/7ffcb716-9a1b-48c2-aaa4-469d0df7c792@arm.com/ [2]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/2793874.mvXUDI8C0e@rafael.j.wysocki
2025-08-11 21:46:14 +02:00
Zhongqiu Han
d4a7882f93 cpuidle: menu: Optimize bucket assignment when next_timer_ns equals KTIME_MAX
Directly assign the last bucket value instead of calling which_bucket()
when next_timer_ns equals KTIME_MAX, the largest possible value that
always falls into the last bucket.

This avoids unnecessary calculations and enhances performance.

Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Zhongqiu Han <quic_zhonhan@quicinc.com>
Link: https://patch.msgid.link/20250405135308.1854342-1-quic_zhonhan@quicinc.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-04-09 19:37:30 +02:00
Atul Kumar Pant
194c396e8a cpuidle: teo: Fix typos in two comments
Fix a typo and correct a spelling mistake in comments.

Signed-off-by: Atul Kumar Pant <atulpant.linux@gmail.com>
Link: https://patch.msgid.link/20250405060909.2026332-1-atulpant.linux@gmail.com
[ rjw: Subject and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-04-09 19:34:23 +02:00
Rafael J. Wysocki
5c35041099 cpuidle: menu: Update documentation after get_typical_interval() changes
The documentation of the menu cpuidle governor needs to be updated
to match the code behavior after some changes made recently.

No functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/4998484.31r3eYUQgx@rjwysocki.net
[ rjw: More specific subject, two typos fixed in the changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-02-25 12:12:46 +01:00
Rafael J. Wysocki
85975daeaa cpuidle: menu: Avoid discarding useful information
When giving up on making a high-confidence prediction,
get_typical_interval() always returns UINT_MAX which means that the
next idle interval prediction will be based entirely on the time till
the next timer.  However, the information represented by the most
recent intervals may not be completely useless in those cases.

Namely, the largest recent idle interval is an upper bound on the
recently observed idle duration, so it is reasonable to assume that
the next idle duration is unlikely to exceed it.  Moreover, this is
still true after eliminating the suspected outliers if the sample
set still under consideration is at least as large as 50% of the
maximum sample set size.

Accordingly, make get_typical_interval() return the current maximum
recent interval value in that case instead of UINT_MAX.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reported-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Link: https://patch.msgid.link/7770672.EvYhyI6sBW@rjwysocki.net
2025-02-25 12:00:43 +01:00
Rafael J. Wysocki
8de7606f0f cpuidle: menu: Eliminate outliers on both ends of the sample set
Currently, get_typical_interval() attempts to eliminate outliers at the
high end of the sample set only (probably in order to bias the prediction
toward lower values), but this it problematic because if the outliers are
present at the low end of the sample set, discarding the highest values
will not help to reduce the variance.

Since the presence of outliers at the low end of the sample set is
generally as likely as their presence at the high end of the sample
set, modify get_typical_interval() to treat samples at the largest
distances from the average (on both ends of the sample set) as outliers.

This should increase the likelihood of making a meaningful prediction
in some cases.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reported-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Link: https://patch.msgid.link/2301940.iZASKD2KPV@rjwysocki.net
2025-02-25 12:00:35 +01:00
Rafael J. Wysocki
60256e458e cpuidle: menu: Tweak threshold use in get_typical_interval()
To prepare get_typical_interval() for subsequent changes, rearrange
the use of the data point threshold in it a bit and initialize that
threshold to UINT_MAX which is more consistent with its data type.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Link: https://patch.msgid.link/8490144.T7Z3S40VBb@rjwysocki.net
2025-02-25 12:00:28 +01:00
Rafael J. Wysocki
13982929fb cpuidle: menu: Use one loop for average and variance computations
Use the observation that one loop is sufficient to compute the average
of an array of values and their variance to eliminate one of the loops
from get_typical_interval().

While at it, make get_typical_interval() consistently use u64 as the
64-bit unsigned integer data type and rearrange some white space and the
declarations of local variables in it (to make them follow the reverse
X-mas tree pattern).

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Link: https://patch.msgid.link/3339073.aeNJFYEL58@rjwysocki.net
2025-02-25 12:00:17 +01:00
Rafael J. Wysocki
d2cd195b57 cpuidle: menu: Drop a redundant local variable
Local variable min in get_typical_interval() is updated, but never
accessed later, so drop it.

No functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Link: https://patch.msgid.link/13699686.uLZWGnKmhe@rjwysocki.net
2025-02-25 11:59:02 +01:00
Rafael J. Wysocki
16c8d7586c cpuidle: teo: Skip sleep length computation for low latency constraints
If the idle state exit latency constraint is sufficiently low, it
is better to avoid the additional latency related to calling
tick_nohz_get_sleep_length().  It is also not necessary to compute
the sleep length in that case because shallow idle state selection
will be forced then regardless of the recent wakeup history.

Accordingly, skip the sleep length computation and subsequent
checks of the exit latency constraint is low enough.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/6122398.lOV4Wx5bFT@rjwysocki.net
2025-01-20 17:19:53 +01:00
Rafael J. Wysocki
65e18e6544 cpuidle: teo: Replace time_span_ns with a flag
After recent updates, the time_span_ns field in struct teo_cpu has
become an indicator on whether or not the most recent wakeup has been
"genuine" which may as well be indicated by a bool field without
calling local_clock(), so update the code accordingly.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/6010475.MhkbZ0Pkbq@rjwysocki.net
2025-01-20 17:18:02 +01:00
Rafael J. Wysocki
ddcfa79646 cpuidle: teo: Simplify handling of total events count
Instead of computing the total events count from scratch every time,
decay it and add a PULSE value to it in teo_update().

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/9388883.CDJkKcVGEf@rjwysocki.net
2025-01-20 17:17:56 +01:00
Rafael J. Wysocki
13ed5c4a6d cpuidle: teo: Skip getting the sleep length if wakeups are very frequent
Commit 6da8f9ba5a ("cpuidle: teo: Skip tick_nohz_get_sleep_length()
call in some cases") attempted to reduce the governor overhead in some
cases by making it avoid obtaining the sleep length (the time till the
next timer event) which may be costly.

Among other things, after the above commit, tick_nohz_get_sleep_length()
was not called any more when idle state 0 was to be returned, which
turned out to be problematic and the previous behavior in that respect
was restored by commit 4b20b07ce7 ("cpuidle: teo: Don't count non-
existent intercepts").

However, commit 6da8f9ba5a also caused the governor to avoid calling
tick_nohz_get_sleep_length() on systems where idle state 0 is a "polling"
one (that is, it is not really an idle state, but a loop continuously
executed by the CPU) when the target residency of the idle state to be
returned was low enough, so there was no practical need to refine the
idle state selection in any way.  This change was not removed by the
other commit, so now on systems where idle state 0 is a "polling" one,
tick_nohz_get_sleep_length() is called when idle state 0 is to be
returned, but it is not called when a deeper idle state with
sufficiently low target residency is to be returned.  That is arguably
confusing and inconsistent.

Moreover, there is no specific reason why the behavior in question
should depend on whether or not idle state 0 is a "polling" one.

One way to address this would be to make the governor always call
tick_nohz_get_sleep_length() to obtain the sleep length, but that would
effectively mean reverting commit 6da8f9ba5a and restoring the latency
issue that was the reason for doing it.  This approach is thus not
particularly attractive.

To address it differently, notice that if a CPU is woken up very often,
this is not likely to be caused by timers in the first place (user space
has a default timer slack of 50 us and there are relatively few timers
with a deadline shorter than several microseconds in the kernel) and
even if it were the case, the potential benefit from using a deep idle
state would then be questionable for latency reasons.  Therefore, if the
majority of CPU wakeups occur within several microseconds, it can be
assumed that all wakeups in that range are non-timer and the sleep
length need not be determined.

Accordingly, introduce a new metric for counting wakeups with the
measured idle duration below RESIDENCY_THRESHOLD_NS and modify the idle
state selection to skip the tick_nohz_get_sleep_length() invocation if
idle state 0 has been selected or the target residency of the candidate
idle state is below RESIDENCY_THRESHOLD_NS and the value of the new
metric is at least 1/2 of the total event count.

Since the above requires the measured idle duration to be determined
every time, except for the cases when one of the safety nets has
triggered in which the wakeup is counted as a hit in the deepest
idle state idle residency range, update the handling of those cases
to avoid skipping the idle duration computation when the CPU wakeup
is "genuine".

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/3851791.kQq0lBPeGt@rjwysocki.net
Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
[ rjw: Renamed a struct field ]
[ rjw: Fixed typo in the subject and one in a comment ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-01-20 17:17:02 +01:00
Rafael J. Wysocki
d619b5cc67 cpuidle: teo: Simplify counting events used for tick management
Replace the tick_hits metric with a new tick_intercepts one that can be
used directly when deciding whether or not to stop the scheduler tick
and update the governor functional description accordingly.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/1987985.PYKUYFuaPT@rjwysocki.net
2025-01-20 17:16:53 +01:00
Rafael J. Wysocki
e24f8a55de cpuidle: teo: Clarify two code comments
Rewrite two code comments suposed to explain its behavior that are too
concise or not sufficiently clear.

No functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/8472971.T7Z3S40VBb@rjwysocki.net
[ rjw: Fixed 2 typos in new comments ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-01-20 17:16:45 +01:00
Rafael J. Wysocki
b9a6af26bd cpuidle: teo: Drop local variable prev_intercept_idx
Local variable prev_intercept_idx in teo_select() is redundant because
it cannot be 0 when candidate state index is 0.

The prev_intercept_idx value is the index of the deepest enabled idle
state, so if it is 0, state 0 is the deepest enabled idle state, in
which case it must be the only enabled idle state, but then teo_select()
would have returned early before initializing prev_intercept_idx.

Thus prev_intercept_idx must be nonzero and the check of it against 0
always passes, so it can be dropped altogether.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/3327997.aeNJFYEL58@rjwysocki.net
[ rjw: Fixed typo in the changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-01-20 17:16:37 +01:00
Rafael J. Wysocki
ea185406d1 cpuidle: teo: Combine candidate state index checks against 0
There are two candidate state index checks against 0 in teo_select()
that need not be separate, so combine them and update comments around
them.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/13676346.uLZWGnKmhe@rjwysocki.net
2025-01-20 17:16:30 +01:00
Rafael J. Wysocki
92ce5c07b7 cpuidle: teo: Reorder candidate state index checks
Since constraint_idx may be 0, the candidate state index may change to 0
after assigning constraint_idx to it, so first check if it is greater
than constraint_idx (and update it if so) and then check it against 0.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/1907276.tdWV9SEqCh@rjwysocki.net
2025-01-20 17:16:16 +01:00
Rafael J. Wysocki
425b753645 cpuidle: teo: Rearrange idle state lookup code
Rearrange code in the idle state lookup loop in teo_select() to make it
somewhat easier to follow and update comments around it.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/4619938.LvFx2qVVIh@rjwysocki.net
2025-01-20 17:15:44 +01:00
Rafael J. Wysocki
5a597a19a2 cpuidle: teo: Update documentation after previous changes
After previous changes, the description of the teo governor in the
documentation comment does not match the code any more, so update it
as appropriate.

Fixes: 4499143980 ("cpuidle: teo: Remove recent intercepts metric")
Fixes: 2662342079 ("cpuidle: teo: Gather statistics regarding whether or not to stop the tick")
Fixes: 6da8f9ba5a ("cpuidle: teo: Skip tick_nohz_get_sleep_length() call in some cases")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/6120335.lOV4Wx5bFT@rjwysocki.net
[ rjw: Corrected 3 typos found by Christian ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-01-13 20:46:27 +01:00
Christian Loehle
38f83090f5 cpuidle: menu: Remove iowait influence
Remove CPU iowaiters influence on idle state selection.

Remove the menu notion of performance multiplier which increased with
the number of tasks that went to iowait sleep on this CPU and haven't
woken up yet.

Relying on iowait for cpuidle is problematic for a few reasons:

 1. There is no guarantee that an iowaiting task will wake up on the
    same CPU.

 2. The task being in iowait says nothing about the idle duration, we
    could be selecting shallower states for a long time.

 3. The task being in iowait doesn't always imply a performance hit
    with increased latency.

 4. If there is such a performance hit, the number of iowaiting tasks
    doesn't directly correlate.

 5. The definition of iowait altogether is vague at best, it is
    sprinkled across kernel code.

Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/20240905092645.2885200-2-christian.loehle@arm.com
[ rjw: Minor edits in the changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-09-30 17:00:55 +02:00
Christian Loehle
4b20b07ce7 cpuidle: teo: Don't count non-existent intercepts
When bailing out early, teo will not query the sleep length anymore
since commit 6da8f9ba5a ("cpuidle: teo:
Skip tick_nohz_get_sleep_length() call in some cases") with an
expected sleep_length_ns value of KTIME_MAX.
This lead to state0 accumulating lots of 'intercepts' because
the actually measured sleep length was < KTIME_MAX, so query the sleep
length instead for teo to recognize if it still is in an
intercept-likely scenario without alternating between the two modes.

Fundamentally we can only do one of the two:
 1. Skip sleep_length_ns query when we think intercept is likely.
 2. Have accurate data if sleep_length_ns is actually intercepted when
we believe it is currently intercepted.

Previously teo did the former while this patch chooses the latter as
the additional time it takes to query the sleep length was found to be
negligible and the variants of option 1 (count all unknowns as misses
or count all unknown as hits) had significant regressions (as misses
had lots of too shallow idle state selections and as hits had terrible
performance in intercept-heavy workloads).

Fixes: 6da8f9ba5a ("cpuidle: teo: Skip tick_nohz_get_sleep_length() call in some cases")
Link: https://patch.msgid.link/c40acf72-010f-4a8b-80e4-33f133ba266b@arm.com
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-07-01 18:58:55 +02:00
Christian Loehle
4499143980 cpuidle: teo: Remove recent intercepts metric
The logic for recent intercepts didn't work, there is an underflow
of the 'recent' value that can be observed during boot already, which
teo usually doesn't recover from, making the entire logic pointless.
Furthermore the recent intercepts also were never reset, thus not
actually being very 'recent'.

Having underflowed 'recent' values lead to teo always acting as if
we were in a scenario were expected sleep length based on timers is
too high and it therefore unnecessarily selecting shallower states.

Experiments show that the remaining 'intercept' logic is enough to
quickly react to scenarios in which teo cannot rely on the timer
expected sleep length.

See also here:
https://lore.kernel.org/lkml/0ce2d536-1125-4df8-9a5b-0d5e389cd8af@arm.com/

Fixes: 77577558f2 ("cpuidle: teo: Rework most recent idle duration values treatment")
Link: https://patch.msgid.link/20240628095955.34096-3-christian.loehle@arm.com
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-06-28 21:03:21 +02:00
Christian Loehle
0a2998fa48 Revert: "cpuidle: teo: Introduce util-awareness"
This reverts commit 9ce0f7c4bc.

Util-awareness was reported to be too aggressive in selecting shallower
states. Additionally a single threshold was found to not be suitable
for reasoning about sleep length as, for all practical purposes,
almost arbitrary sleep lengths are still possible for any load value.

Fixes: 9ce0f7c4bc ("cpuidle: teo: Introduce util-awareness")
Link: https://patch.msgid.link/20240628095955.34096-2-christian.loehle@arm.com
Reported-by: Qais Yousef <qyousef@layalina.io>
Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Qais Yousef <qyousef@layalina.io>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-06-28 21:03:21 +02:00