linux

mirror of https://github.com/torvalds/linux.git synced 2026-05-25 07:33:19 +02:00

Author	SHA1	Message	Date
Rafael J. Wysocki	106a2662e6	cpuidle: governors: teo: Rearrange stopped tick handling This change is based on the observation that it is not in fact necessary to select a deep idle state every time the scheduler tick has been stopped before the idle state selection takes place. Namely, if the time till the closest timer (that is not the tick) is short enough, a shallow idle state can be selected because the timer will kick the CPU out of that state, so the damage from a possible overly optimistic selection will be limited. Update the teo governor in accordance with the above in analogy with the previous analogous menu governor update. Among other things, this will cause the teo governor to call tick_nohz_get_sleep_length() every time when the tick has been stopped already and only change the original idle state selection if the time till the closest timer is beyond SAFE_TIMER_RANGE_NS which is way more straightforward than the current code flow. Of course, this effectively throws away some of the recent teo governor changes made recently, but the resulting simplification is worth it in my view. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/1865078.VLH7GnMWUR@rafael.j.wysocki	2026-03-05 15:23:16 +01:00
Rafael J. Wysocki	e57c2bf2e8	cpuidle: governors: menu: Refine stopped tick handling This change is based on the observation that it is not in fact necessary to select a deep idle state every time the scheduler tick has been stopped before the idle state selection takes place. Namely, if the time till the closest timer (that is not the tick) is short enough, a shallow idle state can be selected because the timer will kick the CPU out of that state, so the damage from a possible overly optimistic selection will be limited. Update the menu governor in accordance with the above and use twice the tick period length as the "safe timer range" for allowing the original predicted_ns value to be used even if the tick has been stopped. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/3341782.5fSG56mABF@rafael.j.wysocki	2026-03-05 15:23:16 +01:00
Christian Loehle	93983a9f3b	cpuidle: menu: Remove single state handling cpuidle systems where the governor has no choice because there's only a single idle state are now handled by cpuidle core and bypass the governor, so remove the related handling. Signed-off-by: Christian Loehle <christian.loehle@arm.com> [ rjw: Rebase on top of the cpuidle changes merged recently ] Link: https://patch.msgid.link/20260216185005.1131593-5-aboorvad@linux.ibm.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2026-02-17 15:49:53 +01:00
Christian Loehle	825d5d3479	cpuidle: teo: Remove single state handling cpuidle systems where the governor has no choice because there's only a single idle state are now handled by cpuidle core and bypass the governor, so remove the related handling. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/20260216185005.1131593-4-aboorvad@linux.ibm.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2026-02-17 15:49:53 +01:00
Aboorva Devarajan	9b9c0ff095	cpuidle: haltpoll: Remove single state handling cpuidle systems where the governor has no choice because there's only a single idle state are now handled by cpuidle core and bypass the governor, so remove the related handling. Signed-off-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> [ rjw: Extended the change to drop a redundant local variable ] Link: https://patch.msgid.link/20260216185005.1131593-3-aboorvad@linux.ibm.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2026-02-17 15:49:24 +01:00
Rafael J. Wysocki	a971f984b8	cpuidle: governors: teo: Refine intercepts-based idle state lookup There are cases in which decisions made by the teo governor are arguably overly conservative. For instance, suppose that there are 4 idle states and the values of the intercepts metric for the first 3 of them are 400, 250, and 251, respectively. If the total sum computed in teo_update() is 1000, the governor will select idle state 1 (provided that all idle states are enabled and the scheduler tick has not been stopped) although arguably idle state 0 would be a better choice because the likelihood of getting an idle duration below the target residency of idle state 1 is greater than the likelihood of getting an idle duration between the target residency of idle state 1 and the target residency of idle state 2. To address this, refine the candidate idle state lookup based on intercepts to start at the state with the maximum intercepts metric, below the deepest enabled one, to avoid the cases in which the search may stop before reaching that state. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> [ rjw: Fixed typo "intercetps" in new comments (3 places) ] Link: https://patch.msgid.link/2417298.ElGaqSPkdT@rafael.j.wysocki Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2026-01-30 20:15:52 +01:00
Rafael J. Wysocki	f36de72673	cpuidle: governors: teo: Adjust the classification of wakeup events If differences between target residency values of adjacent idle states of a given CPU are relatively large, the corresponding idle state bins used by the teo governors are large either and the rule by which hits are distinguished from intercepts is inaccurate. Namely, by that rule, a wakeup event is classified as a hit if the sleep length (the time till the closest timer other than the tick) and the measured idle duration, adjusted for the entered idle state exit latency, fall into the same idle state bin. However, if that bin is large enough, the actual difference between the sleep length and the measured idle duration may be significant. It may in fact be significantly greater than the analogous difference for an event where the sleep length and the measured idle duration fall into different bins. For this reason, amend the rule in question with a check that will only allow a wakeup event to be counted as a hit if the sleep length is less than the "raw" measured idle duration (which means that the wakeup appears to have occurred after the anticipated timer event). Otherwise, the event will be counted as an intercept. Also update the documentation part explaining the difference between "hits" and "intercepts" to take the above change into account. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/5093379.31r3eYUQgx@rafael.j.wysocki	2026-01-30 20:15:43 +01:00
Rafael J. Wysocki	475ca3470b	cpuidle: governors: teo: Refine tick_intercepts vs total events check Use 2/3 as the proportion coefficient in the check comparing cpu_data->tick_intercepts with cpu_data->total because it is close enough to the current one (5/8) and it allows of more straightforward interpretation (on average, intercepts within the tick period length are twice as frequent as other events). Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/10793374.nUPlyArG6x@rafael.j.wysocki	2026-01-23 21:50:38 +01:00
Rafael J. Wysocki	60836533b4	cpuidle: governors: teo: Avoid fake intercepts produced by tick Tick wakeups can lead to fake intercepts that may skew idle state selection towards shallow states, so it is better to avoid counting them as intercepts. For this purpose, add a check causing teo_update() to only count tick wakeups as intercepts if intercepts within the tick period range are at least twice as frequent as any other events. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/3404606.44csPzL39Z@rafael.j.wysocki	2026-01-23 21:50:38 +01:00
Rafael J. Wysocki	4bd2221f23	cpuidle: governors: teo: Avoid selecting states with zero-size bins If the last two enabled idle states have the same target residency which is at least equal to TICK_NSEC, teo may select the next-to-last one even though the size of that state's bin is 0, which is confusing. Prevent that from happening by adding a target residency check to the relevant code path. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> [ rjw: Fixed a typo in the changelog ] Link: https://patch.msgid.link/3033265.e9J7NaK4W3@rafael.j.wysocki Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2026-01-23 21:49:54 +01:00
Rafael J. Wysocki	80606f4eb8	cpuidle: governors: menu: Always check timers with tick stopped After commit `5484e31bbb` ("cpuidle: menu: Skip tick_nohz_get_sleep_length() call in some cases"), if the return value of get_typical_interval() multiplied by NSEC_PER_USEC is not greater than RESIDENCY_THRESHOLD_NS, the menu governor will skip computing the time till the closest timer. If that happens when the tick has been stopped already, the selected idle state may be too deep due to the subsequent check comparing predicted_ns with TICK_NSEC and causing its value to be replaced with the expected time till the closest timer, which is KTIME_MAX in that case. That will cause the deepest enabled idle state to be selected, but the time till the closest timer very well may be shorter than the target residency of that state, in which case a shallower state should be used. Address this by making menu_select() always compute the time till the closest timer when the tick has been stopped. Also move the predicted_ns check mentioned above into the branch in which the time till the closest timer is determined because it only needs to be done in that case. Fixes: `5484e31bbb` ("cpuidle: menu: Skip tick_nohz_get_sleep_length() call in some cases") Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/5959091.DvuYhMxLoT@rafael.j.wysocki	2026-01-23 21:22:42 +01:00
Breno Leitao	fcbd7897b8	cpuidle: menu: Remove incorrect unlikely() annotation The unlikely() annotation on the early-return condition in menu_select() is incorrect on systems with only one idle state (e.g., ARM64 servers with a single ACPI LPI state). Branch profiling shows 100% misprediction on such systems since drv->state_count <= 1 is always true. On platforms where only state0 is available, this path is the common case, not an unlikely edge case. Remove the misleading annotation to let the branch predictor learn the actual behavior. Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20260105-annotated_idle-v1-1-10ddf0771b58@debian.org Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2026-01-09 21:52:54 +01:00
Rafael J. Wysocki	15bfdadd61	cpuidle: governors: teo: Add missing space to the description There is a missing space in the governor description comment, so add it. No functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://patch.msgid.link/5059034.31r3eYUQgx@rafael.j.wysocki	2025-11-24 20:43:32 +01:00
Rafael J. Wysocki	d834e68a0e	cpuidle: governors: teo: Simplify intercepts-based state lookup Simplify the loop looking up a candidate idle state in the case when an intercept is likely to occur by adding a search for the state index limit if the tick is stopped before it. First, call tick_nohz_tick_stopped() just once and if it returns true, look for the shallowest state index below the current candidate one with target residency at least equal to the tick period length. Next, simply look for a state that is not shallower than the one found in the previous step and satisfies the intercepts majority condition (if there are no such states, the shallowest state that is not shallower than the one found in the previous step becomes the new candidate). Since teo_state_ok() has no callers any more after the above changes, drop it. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> [ rjw: Changelog clarification and code comment edit ] Link: https://patch.msgid.link/2418792.ElGaqSPkdT@rafael.j.wysocki Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2025-11-20 16:32:48 +01:00
Rafael J. Wysocki	50db438231	cpuidle: governors: teo: Fix tick_intercepts handling in teo_update() The condition deciding whether or not to increase cpu_data->tick_intercepts in teo_update() is reverse, so fix it. Fixes: `d619b5cc67` ("cpuidle: teo: Simplify counting events used for tick management") Cc: 6.14+ <stable@vger.kernel.org> # 6.14+: 0796ddf4a7f0: cpuidle: teo: Use this_cpu_ptr() where possible Cc: 6.14+ <stable@vger.kernel.org> # 6.14+: 8f3f01082d7a: cpuidle: governors: teo: Use s64 consistently in teo_update() Cc: 6.14+ <stable@vger.kernel.org> # 6.14+: b54df61c7428: cpuidle: governors: teo: Decay metrics below DECAY_SHIFT threshold Cc: 6.14+ <stable@vger.kernel.org> 6.14+: 083654ded547: cpuidle: governors: teo: Rework the handling of tick wakeups Cc: 6.14+ <stable@vger.kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/5085160.31r3eYUQgx@rafael.j.wysocki	2025-11-20 14:54:08 +01:00
Rafael J. Wysocki	083654ded5	cpuidle: governors: teo: Rework the handling of tick wakeups If the wakeup pattern is clearly dominated by tick wakeups, count those wakeups as hits on the deepest available idle state to increase the likelihood of stopping the tick, especially on systems where there are only 2 usable idle states and the tick can only be stopped when the deeper state is selected. This change is expected to reduce power on some systems where state 0 is selected relatively often even though they are almost idle. Without it, the governor may end up selecting the shallowest idle state all the time even if the system is almost completely idle due all tick wakeups being counted as hits on that state and preventing the tick from being stopped at all. Fixes: `4b20b07ce7` ("cpuidle: teo: Don't count non-existent intercepts") Reported-by: Reka Norman <rekanorman@chromium.org> Closes: https://lore.kernel.org/linux-pm/CAEmPcwsNMNnNXuxgvHTQ93Mx-q3Oz9U57THQsU_qdcCx1m4w5g@mail.gmail.com/ Tested-by: Reka Norman <rekanorman@chromium.org> Tested-by: Christian Loehle <christian.loehle@arm.com> Cc: 6.11+ <stable@vger.kernel.org> # 6.11+: 92ce5c07b7a1: cpuidle: teo: Reorder candidate state index checks Cc: 6.11+ <stable@vger.kernel.org> # 6.11+: ea185406d1ed: cpuidle: teo: Combine candidate state index checks against 0 Cc: 6.11+ <stable@vger.kernel.org> # 6.11+: b9a6af26bd83: cpuidle: teo: Drop local variable prev_intercept_idx Cc: 6.11+ <stable@vger.kernel.org> # 6.11+: e24f8a55de50: cpuidle: teo: Clarify two code comments Cc: 6.11+ <stable@vger.kernel.org> # 6.11+: d619b5cc6780: cpuidle: teo: Simplify counting events used for tick management Cc: 6.11+ <stable@vger.kernel.org> # 6.11+: 13ed5c4a6d9c: cpuidle: teo: Skip getting the sleep length if wakeups are very frequent Cc: 6.11+ <stable@vger.kernel.org> # 6.11+: ddcfa7964677: cpuidle: teo: Simplify handling of total events count Cc: 6.11+ <stable@vger.kernel.org> # 6.11+: 65e18e654475: cpuidle: teo: Replace time_span_ns with a flag Cc: 6.11+ <stable@vger.kernel.org> # 6.11+: 0796ddf4a7f0: cpuidle: teo: Use this_cpu_ptr() where possible Cc: 6.11+ <stable@vger.kernel.org> # 6.11+: 8f3f01082d7a: cpuidle: governors: teo: Use s64 consistently in teo_update() Cc: 6.11+ <stable@vger.kernel.org> # 6.11+: b54df61c7428: cpuidle: governors: teo: Decay metrics below DECAY_SHIFT threshold Cc: 6.11+ <stable@vger.kernel.org> # 6.11+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> [ rjw: Rebase on commit `0796ddf4a7`, changelog update ] Link: https://patch.msgid.link/6228387.lOV4Wx5bFT@rafael.j.wysocki Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2025-11-20 14:49:57 +01:00
Rafael J. Wysocki	b54df61c74	cpuidle: governors: teo: Decay metrics below DECAY_SHIFT threshold If a given governor metric falls below a certain value (8 for DECAY_SHIFT equal to 3), it will not decay any more due to the simplistic decay implementation. This may in some cases lead to subtle inconsistencies in the governor behavior, so change the decay implementation to take it into account and set the metric at hand to 0 in that case. Suggested-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/2819353.mvXUDI8C0e@rafael.j.wysocki	2025-11-14 15:20:01 +01:00
Rafael J. Wysocki	8f3f01082d	cpuidle: governors: teo: Use s64 consistently in teo_update() Two local variables in teo_update() are defined as u64, but their values are then compared with s64 values, so it is more consistent to use s64 as their data type. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/3026616.e9J7NaK4W3@rafael.j.wysocki	2025-11-14 15:20:01 +01:00
Rafael J. Wysocki	17673f64a0	cpuidle: governors: teo: Drop redundant function parameter The last no_poll parameter of teo_find_shallower_state() is always false, so drop it. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/2253109.irdbgypaU6@rafael.j.wysocki	2025-11-14 15:20:01 +01:00
Rafael J. Wysocki	a03b201180	cpuidle: governors: teo: Drop misguided target residency check When the target residency of the current candidate idle state is greater than the expected time till the closest timer (the sleep length), it does not matter whether or not the tick has already been stopped or if it is going to be stopped. The closest timer will trigger anyway at its due time, so if an idle state with target residency above the sleep length is selected, energy will be wasted and there may be excess latency. Of course, if the closest timer were canceled before it could trigger, a deeper idle state would be more suitable, but this is not expected to happen (generally speaking, hrtimers are not expected to be canceled as a rule). Accordingly, the teo_state_ok() check done in that case causes energy to be wasted more often than it allows any energy to be saved (if it allows any energy to be saved at all), so drop it and let the governor use the teo_find_shallower_state() return value as the new candidate idle state index. Fixes: `21d28cd2fa` ("cpuidle: teo: Do not call tick_nohz_get_sleep_length() upfront") Cc: All applicable <stable@vger.kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/5955081.DvuYhMxLoT@rafael.j.wysocki	2025-11-14 15:19:39 +01:00
Christian Loehle	0796ddf4a7	cpuidle: teo: Use this_cpu_ptr() where possible The cpuidle governor callbacks for update, select and reflect are always running on the actual idle entering/exiting CPU, so use the more optimized this_cpu_ptr() to access the internal teo data. This brings down the latency-critical teo_reflect() from static void teo_reflect(struct cpuidle_device dev, int state) { ffffffc080ffcff0: hint #0x19 ffffffc080ffcff4: stp x29, x30, [sp, #-48]! struct teo_cpu cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu); ffffffc080ffcff8: adrp x2, ffffffc0848c0000 <gicv5_global_data+0x28> { ffffffc080ffcffc: add x29, sp, #0x0 ffffffc080ffd000: stp x19, x20, [sp, #16] ffffffc080ffd004: orr x20, xzr, x0 struct teo_cpu cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu); ffffffc080ffd008: add x0, x2, #0xc20 { ffffffc080ffd00c: stp x21, x22, [sp, #32] struct teo_cpu cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu); ffffffc080ffd010: adrp x19, ffffffc083eb5000 <cpu_devices+0x78> ffffffc080ffd014: add x19, x19, #0xbb0 ffffffc080ffd018: ldr w3, [x20, #4] dev->last_state_idx = state; to static void teo_reflect(struct cpuidle_device dev, int state) { ffffffc080ffd034: hint #0x19 ffffffc080ffd038: stp x29, x30, [sp, #-48]! ffffffc080ffd03c: add x29, sp, #0x0 ffffffc080ffd040: stp x19, x20, [sp, #16] ffffffc080ffd044: orr x20, xzr, x0 struct teo_cpu cpu_data = this_cpu_ptr(&teo_cpus); ffffffc080ffd048: adrp x19, ffffffc083eb5000 <cpu_devices+0x78> { ffffffc080ffd04c: stp x21, x22, [sp, #32] struct teo_cpu *cpu_data = this_cpu_ptr(&teo_cpus); ffffffc080ffd050: add x19, x19, #0xbb0 dev->last_state_idx = state; This saves us: adrp x2, ffffffc0848c0000 <gicv5_global_data+0x28> add x0, x2, #0xc20 ldr w3, [x20, #4] Signed-off-by: Christian Loehle <christian.loehle@arm.com> [ rjw: Subject tweak ] Link: https://patch.msgid.link/20251110120819.714560-1-christian.loehle@arm.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2025-11-12 21:02:07 +01:00
Aboorva Devarajan	07d8157012	cpuidle: menu: Use residency threshold in polling state override decisions On virtualized PowerPC (pseries) systems, where only one polling state (Snooze) and one deep state (CEDE) are available, selecting CEDE when the predicted idle duration is less than the target residency of CEDE state can hurt performance. In such cases, the entry/exit overhead of CEDE outweighs the power savings, leading to unnecessary state transitions and higher latency. Menu governor currently contains a special-case rule that prioritizes the first non-polling state over polling, even when its target residency is much longer than the predicted idle duration. On PowerPC/pseries, where the gap between the polling state (Snooze) and the first non-polling state (CEDE) is large, this behavior causes performance regressions. Refine that special case by adding an extra requirement: the first non-polling state can only be chosen if its target residency is below the defined RESIDENCY_THRESHOLD_NS. If this condition is not satisfied, polling is allowed instead, avoiding suboptimal non-polling state entries. This change is limited to the single special-case rule for the first non-polling state. The general non-polling state selection logic in the menu governor remains unchanged. Performance improvement observed with pgbench on PowerPC (pseries) system: +---------------------------+------------+------------+------------+ \| Metric \| Baseline \| Patched \| Change (%) \| +---------------------------+------------+------------+------------+ \| Transactions/sec (TPS) \| 495,210 \| 536,982 \| +8.45% \| \| Avg latency (ms) \| 0.163 \| 0.150 \| -7.98% \| +---------------------------+------------+------------+------------+ CPUIdle state usage: +--------------+--------------+-------------+ \| Metric \| Baseline \| Patched \| +--------------+--------------+-------------+ \| Total usage \| 12,735,820 \| 13,918,442 \| \| Above usage \| 11,401,520 \| 1,598,210 \| \| Below usage \| 20,145 \| 702,395 \| +--------------+--------------+-------------+ Above/Total and Below/Total usage percentages: +------------------------+-----------+---------+ \| Metric \| Baseline \| Patched \| +------------------------+-----------+---------+ \| Above % (Above/Total) \| 89.56% \| 11.49% \| \| Below % (Below/Total) \| 0.16% \| 5.05% \| \| Total cpuidle miss (%) \| 89.72% \| 16.54% \| +------------------------+-----------+---------+ The results indicate that restricting CEDE selection to cases where its residency matches the predicted idle time reduces mispredictions, lowers unnecessary state transitions, and improves overall throughput. Reviewed-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Aboorva Devarajan <aboorvad@linux.ibm.com> [ rjw: Changelog edits, rebase ] Link: https://patch.msgid.link/20251006013954.17972-1-aboorvad@linux.ibm.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2025-10-27 14:53:46 +01:00
Rafael J. Wysocki	db86f55bf8	cpuidle: governors: menu: Select polling state in some more cases A throughput regression of 11% introduced by commit `779b1a1cb1` ("cpuidle: governors: menu: Avoid selecting states with too much latency") has been reported and it is related to the case when the menu governor checks if selecting a proper idle state instead of a polling one makes sense. In particular, it is questionable to do so if the exit latency of the idle state in question exceeds the predicted idle duration, so add a check for that, which is sufficient to make the reported regression go away, and update the related code comment accordingly. Fixes: `779b1a1cb1` ("cpuidle: governors: menu: Avoid selecting states with too much latency") Closes: https://lore.kernel.org/linux-pm/004501dc43c9$ec8aa930$c59ffb90$@telus.net/ Reported-by: Doug Smythies <dsmythies@telus.net> Tested-by: Doug Smythies <dsmythies@telus.net> Cc: All applicable <stable@vger.kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/12786727.O9o76ZdvQC@rafael.j.wysocki	2025-10-27 14:41:27 +01:00
Rafael J. Wysocki	10fad40122	Revert "cpuidle: menu: Avoid discarding useful information" It is reported that commit `85975daeaa` ("cpuidle: menu: Avoid discarding useful information") led to a performance regression on Intel Jasper Lake systems because it reduced the time spent by CPUs in idle state C7 which is correlated to the maximum frequency the CPUs can get to because of an average running power limit [1]. Before that commit, get_typical_interval() would have returned UINT_MAX whenever it had been unable to make a high-confidence prediction which had led to selecting the deepest available idle state too often and both power and performance had been inadequate as a result of that on some systems. However, this had not been a problem on systems with relatively aggressive average running power limits, like the Jasper Lake systems in question, because on those systems it was compensated by the ability to run CPUs faster. It was addressed by causing get_typical_interval() to return a number based on the recent idle duration information available to it even if it could not make a high-confidence prediction, but that clearly did not take the possible correlation between idle power and available CPU capacity into account. For this reason, revert most of the changes made by commit `85975daeaa`, except for one cosmetic cleanup, and add a comment explaining the rationale for returning UINT_MAX from get_typical_interval() when it is unable to make a high-confidence prediction. Fixes: `85975daeaa` ("cpuidle: menu: Avoid discarding useful information") Closes: https://lore.kernel.org/linux-pm/36iykr223vmcfsoysexug6s274nq2oimcu55ybn6ww4il3g3cv@cohflgdbpnq7/ [1] Reported-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: All applicable <stable@vger.kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://patch.msgid.link/3663603.iIbC2pHGDl@rafael.j.wysocki	2025-10-20 21:27:16 +02:00
Rafael J. Wysocki	17224c1d25	cpuidle: governors: menu: Rearrange main loop in menu_select() Reduce the indentation level in the main loop of menu_select() by rearranging some checks and assignments in it. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/2389215.ElGaqSPkdT@rafael.j.wysocki	2025-08-21 22:02:28 +02:00
Rafael J. Wysocki	779b1a1cb1	cpuidle: governors: menu: Avoid selecting states with too much latency Occasionally, the exit latency of the idle state selected by the menu governor may exceed the PM QoS CPU wakeup latency limit. Namely, if the scheduler tick has been stopped already and predicted_ns is greater than the tick period length, the governor may return an idle state whose exit latency exceeds latency_req because that decision is made before checking the current idle state's exit latency. For instance, say that there are 3 idle states, 0, 1, and 2. For idle states 0 and 1, the exit latency is equal to the target residency and the values are 0 and 5 us, respectively. State 2 is deeper and has the exit latency and target residency of 200 us and 2 ms (which is greater than the tick period length), respectively. Say that predicted_ns is equal to TICK_NSEC and the PM QoS latency limit is 20 us. After the first two iterations of the main loop in menu_select(), idx becomes 1 and in the third iteration of it the target residency of the current state (state 2) is greater than predicted_ns. State 2 is not a polling one and predicted_ns is not less than TICK_NSEC, so the check on whether or not the tick has been stopped is done. Say that the tick has been stopped already and there are no imminent timers (that is, delta_tick is greater than the target residency of state 2). In that case, idx becomes 2 and it is returned immediately, but the exit latency of state 2 exceeds the latency limit. Address this issue by modifying the code to compare the exit latency of the current idle state (idle state i) with the latency limit before comparing its target residency with predicted_ns, which allows one more exit_latency_ns check that becomes redundant to be dropped. However, after the above change, latency_req cannot take the predicted_ns value any more, which takes place after commit `38f83090f5` ("cpuidle: menu: Remove iowait influence"), because it may cause a polling state to be returned prematurely. In the context of the previous example say that predicted_ns is 3000 and the PM QoS latency limit is still 20 us. Additionally, say that idle state 0 is a polling one. Moving the exit_latency_ns check before the target_residency_ns one causes the loop to terminate in the second iteration, before the target_residency_ns check, so idle state 0 will be returned even though previously state 1 would be returned if there were no imminent timers. For this reason, remove the assignment of the predicted_ns value to latency_req from the code. Fixes: `5ef499cd57` ("cpuidle: menu: Handle stopped tick more aggressively") Cc: 4.17+ <stable@vger.kernel.org> # 4.17+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/5043159.31r3eYUQgx@rafael.j.wysocki	2025-08-18 19:04:25 +02:00
Rafael J. Wysocki	fa3fa55de0	cpuidle: governors: menu: Avoid using invalid recent intervals data Marc has reported that commit `85975daeaa` ("cpuidle: menu: Avoid discarding useful information") caused the number of wakeup interrupts to increase on an idle system [1], which was not expected to happen after merely allowing shallower idle states to be selected by the governor in some cases. However, on the system in question, all of the idle states deeper than WFI are rejected by the driver due to a firmware issue [2]. This causes the governor to only consider the recent interval duriation data corresponding to attempts to enter WFI that are successful and the recent invervals table is filled with values lower than the scheduler tick period. Consequently, the governor predicts an idle duration below the scheduler tick period length and avoids stopping the tick more often which leads to the observed symptom. Address it by modifying the governor to update the recent intervals table also when entering the previously selected idle state fails, so it knows that the short idle intervals might have been the minority had the selected idle states been actually entered every time. Fixes: `85975daeaa` ("cpuidle: menu: Avoid discarding useful information") Link: https://lore.kernel.org/linux-pm/86o6sv6n94.wl-maz@kernel.org/ [1] Link: https://lore.kernel.org/linux-pm/7ffcb716-9a1b-48c2-aaa4-469d0df7c792@arm.com/ [2] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/2793874.mvXUDI8C0e@rafael.j.wysocki	2025-08-11 21:46:14 +02:00
Zhongqiu Han	d4a7882f93	cpuidle: menu: Optimize bucket assignment when next_timer_ns equals KTIME_MAX Directly assign the last bucket value instead of calling which_bucket() when next_timer_ns equals KTIME_MAX, the largest possible value that always falls into the last bucket. This avoids unnecessary calculations and enhances performance. Reviewed-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Zhongqiu Han <quic_zhonhan@quicinc.com> Link: https://patch.msgid.link/20250405135308.1854342-1-quic_zhonhan@quicinc.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2025-04-09 19:37:30 +02:00
Atul Kumar Pant	194c396e8a	cpuidle: teo: Fix typos in two comments Fix a typo and correct a spelling mistake in comments. Signed-off-by: Atul Kumar Pant <atulpant.linux@gmail.com> Link: https://patch.msgid.link/20250405060909.2026332-1-atulpant.linux@gmail.com [ rjw: Subject and changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2025-04-09 19:34:23 +02:00
Rafael J. Wysocki	5c35041099	cpuidle: menu: Update documentation after get_typical_interval() changes The documentation of the menu cpuidle governor needs to be updated to match the code behavior after some changes made recently. No functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/4998484.31r3eYUQgx@rjwysocki.net [ rjw: More specific subject, two typos fixed in the changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2025-02-25 12:12:46 +01:00
Rafael J. Wysocki	85975daeaa	cpuidle: menu: Avoid discarding useful information When giving up on making a high-confidence prediction, get_typical_interval() always returns UINT_MAX which means that the next idle interval prediction will be based entirely on the time till the next timer. However, the information represented by the most recent intervals may not be completely useless in those cases. Namely, the largest recent idle interval is an upper bound on the recently observed idle duration, so it is reasonable to assume that the next idle duration is unlikely to exceed it. Moreover, this is still true after eliminating the suspected outliers if the sample set still under consideration is at least as large as 50% of the maximum sample set size. Accordingly, make get_typical_interval() return the current maximum recent interval value in that case instead of UINT_MAX. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reported-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Link: https://patch.msgid.link/7770672.EvYhyI6sBW@rjwysocki.net	2025-02-25 12:00:43 +01:00
Rafael J. Wysocki	8de7606f0f	cpuidle: menu: Eliminate outliers on both ends of the sample set Currently, get_typical_interval() attempts to eliminate outliers at the high end of the sample set only (probably in order to bias the prediction toward lower values), but this it problematic because if the outliers are present at the low end of the sample set, discarding the highest values will not help to reduce the variance. Since the presence of outliers at the low end of the sample set is generally as likely as their presence at the high end of the sample set, modify get_typical_interval() to treat samples at the largest distances from the average (on both ends of the sample set) as outliers. This should increase the likelihood of making a meaningful prediction in some cases. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reported-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Link: https://patch.msgid.link/2301940.iZASKD2KPV@rjwysocki.net	2025-02-25 12:00:35 +01:00
Rafael J. Wysocki	60256e458e	cpuidle: menu: Tweak threshold use in get_typical_interval() To prepare get_typical_interval() for subsequent changes, rearrange the use of the data point threshold in it a bit and initialize that threshold to UINT_MAX which is more consistent with its data type. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Link: https://patch.msgid.link/8490144.T7Z3S40VBb@rjwysocki.net	2025-02-25 12:00:28 +01:00
Rafael J. Wysocki	13982929fb	cpuidle: menu: Use one loop for average and variance computations Use the observation that one loop is sufficient to compute the average of an array of values and their variance to eliminate one of the loops from get_typical_interval(). While at it, make get_typical_interval() consistently use u64 as the 64-bit unsigned integer data type and rearrange some white space and the declarations of local variables in it (to make them follow the reverse X-mas tree pattern). No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Link: https://patch.msgid.link/3339073.aeNJFYEL58@rjwysocki.net	2025-02-25 12:00:17 +01:00
Rafael J. Wysocki	d2cd195b57	cpuidle: menu: Drop a redundant local variable Local variable min in get_typical_interval() is updated, but never accessed later, so drop it. No functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Link: https://patch.msgid.link/13699686.uLZWGnKmhe@rjwysocki.net	2025-02-25 11:59:02 +01:00
Rafael J. Wysocki	16c8d7586c	cpuidle: teo: Skip sleep length computation for low latency constraints If the idle state exit latency constraint is sufficiently low, it is better to avoid the additional latency related to calling tick_nohz_get_sleep_length(). It is also not necessary to compute the sleep length in that case because shallow idle state selection will be forced then regardless of the recent wakeup history. Accordingly, skip the sleep length computation and subsequent checks of the exit latency constraint is low enough. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/6122398.lOV4Wx5bFT@rjwysocki.net	2025-01-20 17:19:53 +01:00
Rafael J. Wysocki	65e18e6544	cpuidle: teo: Replace time_span_ns with a flag After recent updates, the time_span_ns field in struct teo_cpu has become an indicator on whether or not the most recent wakeup has been "genuine" which may as well be indicated by a bool field without calling local_clock(), so update the code accordingly. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/6010475.MhkbZ0Pkbq@rjwysocki.net	2025-01-20 17:18:02 +01:00
Rafael J. Wysocki	ddcfa79646	cpuidle: teo: Simplify handling of total events count Instead of computing the total events count from scratch every time, decay it and add a PULSE value to it in teo_update(). No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/9388883.CDJkKcVGEf@rjwysocki.net	2025-01-20 17:17:56 +01:00
Rafael J. Wysocki	13ed5c4a6d	cpuidle: teo: Skip getting the sleep length if wakeups are very frequent Commit `6da8f9ba5a` ("cpuidle: teo: Skip tick_nohz_get_sleep_length() call in some cases") attempted to reduce the governor overhead in some cases by making it avoid obtaining the sleep length (the time till the next timer event) which may be costly. Among other things, after the above commit, tick_nohz_get_sleep_length() was not called any more when idle state 0 was to be returned, which turned out to be problematic and the previous behavior in that respect was restored by commit `4b20b07ce7` ("cpuidle: teo: Don't count non- existent intercepts"). However, commit `6da8f9ba5a` also caused the governor to avoid calling tick_nohz_get_sleep_length() on systems where idle state 0 is a "polling" one (that is, it is not really an idle state, but a loop continuously executed by the CPU) when the target residency of the idle state to be returned was low enough, so there was no practical need to refine the idle state selection in any way. This change was not removed by the other commit, so now on systems where idle state 0 is a "polling" one, tick_nohz_get_sleep_length() is called when idle state 0 is to be returned, but it is not called when a deeper idle state with sufficiently low target residency is to be returned. That is arguably confusing and inconsistent. Moreover, there is no specific reason why the behavior in question should depend on whether or not idle state 0 is a "polling" one. One way to address this would be to make the governor always call tick_nohz_get_sleep_length() to obtain the sleep length, but that would effectively mean reverting commit `6da8f9ba5a` and restoring the latency issue that was the reason for doing it. This approach is thus not particularly attractive. To address it differently, notice that if a CPU is woken up very often, this is not likely to be caused by timers in the first place (user space has a default timer slack of 50 us and there are relatively few timers with a deadline shorter than several microseconds in the kernel) and even if it were the case, the potential benefit from using a deep idle state would then be questionable for latency reasons. Therefore, if the majority of CPU wakeups occur within several microseconds, it can be assumed that all wakeups in that range are non-timer and the sleep length need not be determined. Accordingly, introduce a new metric for counting wakeups with the measured idle duration below RESIDENCY_THRESHOLD_NS and modify the idle state selection to skip the tick_nohz_get_sleep_length() invocation if idle state 0 has been selected or the target residency of the candidate idle state is below RESIDENCY_THRESHOLD_NS and the value of the new metric is at least 1/2 of the total event count. Since the above requires the measured idle duration to be determined every time, except for the cases when one of the safety nets has triggered in which the wakeup is counted as a hit in the deepest idle state idle residency range, update the handling of those cases to avoid skipping the idle duration computation when the CPU wakeup is "genuine". Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://patch.msgid.link/3851791.kQq0lBPeGt@rjwysocki.net Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> [ rjw: Renamed a struct field ] [ rjw: Fixed typo in the subject and one in a comment ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2025-01-20 17:17:02 +01:00
Rafael J. Wysocki	d619b5cc67	cpuidle: teo: Simplify counting events used for tick management Replace the tick_hits metric with a new tick_intercepts one that can be used directly when deciding whether or not to stop the scheduler tick and update the governor functional description accordingly. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/1987985.PYKUYFuaPT@rjwysocki.net	2025-01-20 17:16:53 +01:00
Rafael J. Wysocki	e24f8a55de	cpuidle: teo: Clarify two code comments Rewrite two code comments suposed to explain its behavior that are too concise or not sufficiently clear. No functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/8472971.T7Z3S40VBb@rjwysocki.net [ rjw: Fixed 2 typos in new comments ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2025-01-20 17:16:45 +01:00
Rafael J. Wysocki	b9a6af26bd	cpuidle: teo: Drop local variable prev_intercept_idx Local variable prev_intercept_idx in teo_select() is redundant because it cannot be 0 when candidate state index is 0. The prev_intercept_idx value is the index of the deepest enabled idle state, so if it is 0, state 0 is the deepest enabled idle state, in which case it must be the only enabled idle state, but then teo_select() would have returned early before initializing prev_intercept_idx. Thus prev_intercept_idx must be nonzero and the check of it against 0 always passes, so it can be dropped altogether. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/3327997.aeNJFYEL58@rjwysocki.net [ rjw: Fixed typo in the changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2025-01-20 17:16:37 +01:00
Rafael J. Wysocki	ea185406d1	cpuidle: teo: Combine candidate state index checks against 0 There are two candidate state index checks against 0 in teo_select() that need not be separate, so combine them and update comments around them. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/13676346.uLZWGnKmhe@rjwysocki.net	2025-01-20 17:16:30 +01:00
Rafael J. Wysocki	92ce5c07b7	cpuidle: teo: Reorder candidate state index checks Since constraint_idx may be 0, the candidate state index may change to 0 after assigning constraint_idx to it, so first check if it is greater than constraint_idx (and update it if so) and then check it against 0. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/1907276.tdWV9SEqCh@rjwysocki.net	2025-01-20 17:16:16 +01:00
Rafael J. Wysocki	425b753645	cpuidle: teo: Rearrange idle state lookup code Rearrange code in the idle state lookup loop in teo_select() to make it somewhat easier to follow and update comments around it. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/4619938.LvFx2qVVIh@rjwysocki.net	2025-01-20 17:15:44 +01:00
Rafael J. Wysocki	5a597a19a2	cpuidle: teo: Update documentation after previous changes After previous changes, the description of the teo governor in the documentation comment does not match the code any more, so update it as appropriate. Fixes: `4499143980` ("cpuidle: teo: Remove recent intercepts metric") Fixes: `2662342079` ("cpuidle: teo: Gather statistics regarding whether or not to stop the tick") Fixes: `6da8f9ba5a` ("cpuidle: teo: Skip tick_nohz_get_sleep_length() call in some cases") Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/6120335.lOV4Wx5bFT@rjwysocki.net [ rjw: Corrected 3 typos found by Christian ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2025-01-13 20:46:27 +01:00
Christian Loehle	38f83090f5	cpuidle: menu: Remove iowait influence Remove CPU iowaiters influence on idle state selection. Remove the menu notion of performance multiplier which increased with the number of tasks that went to iowait sleep on this CPU and haven't woken up yet. Relying on iowait for cpuidle is problematic for a few reasons: 1. There is no guarantee that an iowaiting task will wake up on the same CPU. 2. The task being in iowait says nothing about the idle duration, we could be selecting shallower states for a long time. 3. The task being in iowait doesn't always imply a performance hit with increased latency. 4. If there is such a performance hit, the number of iowaiting tasks doesn't directly correlate. 5. The definition of iowait altogether is vague at best, it is sprinkled across kernel code. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/20240905092645.2885200-2-christian.loehle@arm.com [ rjw: Minor edits in the changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2024-09-30 17:00:55 +02:00
Christian Loehle	4b20b07ce7	cpuidle: teo: Don't count non-existent intercepts When bailing out early, teo will not query the sleep length anymore since commit `6da8f9ba5a` ("cpuidle: teo: Skip tick_nohz_get_sleep_length() call in some cases") with an expected sleep_length_ns value of KTIME_MAX. This lead to state0 accumulating lots of 'intercepts' because the actually measured sleep length was < KTIME_MAX, so query the sleep length instead for teo to recognize if it still is in an intercept-likely scenario without alternating between the two modes. Fundamentally we can only do one of the two: 1. Skip sleep_length_ns query when we think intercept is likely. 2. Have accurate data if sleep_length_ns is actually intercepted when we believe it is currently intercepted. Previously teo did the former while this patch chooses the latter as the additional time it takes to query the sleep length was found to be negligible and the variants of option 1 (count all unknowns as misses or count all unknown as hits) had significant regressions (as misses had lots of too shallow idle state selections and as hits had terrible performance in intercept-heavy workloads). Fixes: `6da8f9ba5a` ("cpuidle: teo: Skip tick_nohz_get_sleep_length() call in some cases") Link: https://patch.msgid.link/c40acf72-010f-4a8b-80e4-33f133ba266b@arm.com Signed-off-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2024-07-01 18:58:55 +02:00
Christian Loehle	4499143980	cpuidle: teo: Remove recent intercepts metric The logic for recent intercepts didn't work, there is an underflow of the 'recent' value that can be observed during boot already, which teo usually doesn't recover from, making the entire logic pointless. Furthermore the recent intercepts also were never reset, thus not actually being very 'recent'. Having underflowed 'recent' values lead to teo always acting as if we were in a scenario were expected sleep length based on timers is too high and it therefore unnecessarily selecting shallower states. Experiments show that the remaining 'intercept' logic is enough to quickly react to scenarios in which teo cannot rely on the timer expected sleep length. See also here: https://lore.kernel.org/lkml/0ce2d536-1125-4df8-9a5b-0d5e389cd8af@arm.com/ Fixes: `77577558f2` ("cpuidle: teo: Rework most recent idle duration values treatment") Link: https://patch.msgid.link/20240628095955.34096-3-christian.loehle@arm.com Signed-off-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2024-06-28 21:03:21 +02:00
Christian Loehle	0a2998fa48	Revert: "cpuidle: teo: Introduce util-awareness" This reverts commit `9ce0f7c4bc`. Util-awareness was reported to be too aggressive in selecting shallower states. Additionally a single threshold was found to not be suitable for reasoning about sleep length as, for all practical purposes, almost arbitrary sleep lengths are still possible for any load value. Fixes: `9ce0f7c4bc` ("cpuidle: teo: Introduce util-awareness") Link: https://patch.msgid.link/20240628095955.34096-2-christian.loehle@arm.com Reported-by: Qais Yousef <qyousef@layalina.io> Reported-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Qais Yousef <qyousef@layalina.io> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2024-06-28 21:03:21 +02:00

1 2 3 4 5

215 Commits