linux

mirror of https://github.com/torvalds/linux.git synced 2026-05-25 07:33:19 +02:00

Author	SHA1	Message	Date
Rafael J. Wysocki	e57c2bf2e8	cpuidle: governors: menu: Refine stopped tick handling This change is based on the observation that it is not in fact necessary to select a deep idle state every time the scheduler tick has been stopped before the idle state selection takes place. Namely, if the time till the closest timer (that is not the tick) is short enough, a shallow idle state can be selected because the timer will kick the CPU out of that state, so the damage from a possible overly optimistic selection will be limited. Update the menu governor in accordance with the above and use twice the tick period length as the "safe timer range" for allowing the original predicted_ns value to be used even if the tick has been stopped. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/3341782.5fSG56mABF@rafael.j.wysocki	2026-03-05 15:23:16 +01:00
Christian Loehle	93983a9f3b	cpuidle: menu: Remove single state handling cpuidle systems where the governor has no choice because there's only a single idle state are now handled by cpuidle core and bypass the governor, so remove the related handling. Signed-off-by: Christian Loehle <christian.loehle@arm.com> [ rjw: Rebase on top of the cpuidle changes merged recently ] Link: https://patch.msgid.link/20260216185005.1131593-5-aboorvad@linux.ibm.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2026-02-17 15:49:53 +01:00
Rafael J. Wysocki	80606f4eb8	cpuidle: governors: menu: Always check timers with tick stopped After commit `5484e31bbb` ("cpuidle: menu: Skip tick_nohz_get_sleep_length() call in some cases"), if the return value of get_typical_interval() multiplied by NSEC_PER_USEC is not greater than RESIDENCY_THRESHOLD_NS, the menu governor will skip computing the time till the closest timer. If that happens when the tick has been stopped already, the selected idle state may be too deep due to the subsequent check comparing predicted_ns with TICK_NSEC and causing its value to be replaced with the expected time till the closest timer, which is KTIME_MAX in that case. That will cause the deepest enabled idle state to be selected, but the time till the closest timer very well may be shorter than the target residency of that state, in which case a shallower state should be used. Address this by making menu_select() always compute the time till the closest timer when the tick has been stopped. Also move the predicted_ns check mentioned above into the branch in which the time till the closest timer is determined because it only needs to be done in that case. Fixes: `5484e31bbb` ("cpuidle: menu: Skip tick_nohz_get_sleep_length() call in some cases") Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/5959091.DvuYhMxLoT@rafael.j.wysocki	2026-01-23 21:22:42 +01:00
Breno Leitao	fcbd7897b8	cpuidle: menu: Remove incorrect unlikely() annotation The unlikely() annotation on the early-return condition in menu_select() is incorrect on systems with only one idle state (e.g., ARM64 servers with a single ACPI LPI state). Branch profiling shows 100% misprediction on such systems since drv->state_count <= 1 is always true. On platforms where only state0 is available, this path is the common case, not an unlikely edge case. Remove the misleading annotation to let the branch predictor learn the actual behavior. Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20260105-annotated_idle-v1-1-10ddf0771b58@debian.org Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2026-01-09 21:52:54 +01:00
Aboorva Devarajan	07d8157012	cpuidle: menu: Use residency threshold in polling state override decisions On virtualized PowerPC (pseries) systems, where only one polling state (Snooze) and one deep state (CEDE) are available, selecting CEDE when the predicted idle duration is less than the target residency of CEDE state can hurt performance. In such cases, the entry/exit overhead of CEDE outweighs the power savings, leading to unnecessary state transitions and higher latency. Menu governor currently contains a special-case rule that prioritizes the first non-polling state over polling, even when its target residency is much longer than the predicted idle duration. On PowerPC/pseries, where the gap between the polling state (Snooze) and the first non-polling state (CEDE) is large, this behavior causes performance regressions. Refine that special case by adding an extra requirement: the first non-polling state can only be chosen if its target residency is below the defined RESIDENCY_THRESHOLD_NS. If this condition is not satisfied, polling is allowed instead, avoiding suboptimal non-polling state entries. This change is limited to the single special-case rule for the first non-polling state. The general non-polling state selection logic in the menu governor remains unchanged. Performance improvement observed with pgbench on PowerPC (pseries) system: +---------------------------+------------+------------+------------+ \| Metric \| Baseline \| Patched \| Change (%) \| +---------------------------+------------+------------+------------+ \| Transactions/sec (TPS) \| 495,210 \| 536,982 \| +8.45% \| \| Avg latency (ms) \| 0.163 \| 0.150 \| -7.98% \| +---------------------------+------------+------------+------------+ CPUIdle state usage: +--------------+--------------+-------------+ \| Metric \| Baseline \| Patched \| +--------------+--------------+-------------+ \| Total usage \| 12,735,820 \| 13,918,442 \| \| Above usage \| 11,401,520 \| 1,598,210 \| \| Below usage \| 20,145 \| 702,395 \| +--------------+--------------+-------------+ Above/Total and Below/Total usage percentages: +------------------------+-----------+---------+ \| Metric \| Baseline \| Patched \| +------------------------+-----------+---------+ \| Above % (Above/Total) \| 89.56% \| 11.49% \| \| Below % (Below/Total) \| 0.16% \| 5.05% \| \| Total cpuidle miss (%) \| 89.72% \| 16.54% \| +------------------------+-----------+---------+ The results indicate that restricting CEDE selection to cases where its residency matches the predicted idle time reduces mispredictions, lowers unnecessary state transitions, and improves overall throughput. Reviewed-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Aboorva Devarajan <aboorvad@linux.ibm.com> [ rjw: Changelog edits, rebase ] Link: https://patch.msgid.link/20251006013954.17972-1-aboorvad@linux.ibm.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2025-10-27 14:53:46 +01:00
Rafael J. Wysocki	db86f55bf8	cpuidle: governors: menu: Select polling state in some more cases A throughput regression of 11% introduced by commit `779b1a1cb1` ("cpuidle: governors: menu: Avoid selecting states with too much latency") has been reported and it is related to the case when the menu governor checks if selecting a proper idle state instead of a polling one makes sense. In particular, it is questionable to do so if the exit latency of the idle state in question exceeds the predicted idle duration, so add a check for that, which is sufficient to make the reported regression go away, and update the related code comment accordingly. Fixes: `779b1a1cb1` ("cpuidle: governors: menu: Avoid selecting states with too much latency") Closes: https://lore.kernel.org/linux-pm/004501dc43c9$ec8aa930$c59ffb90$@telus.net/ Reported-by: Doug Smythies <dsmythies@telus.net> Tested-by: Doug Smythies <dsmythies@telus.net> Cc: All applicable <stable@vger.kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/12786727.O9o76ZdvQC@rafael.j.wysocki	2025-10-27 14:41:27 +01:00
Rafael J. Wysocki	10fad40122	Revert "cpuidle: menu: Avoid discarding useful information" It is reported that commit `85975daeaa` ("cpuidle: menu: Avoid discarding useful information") led to a performance regression on Intel Jasper Lake systems because it reduced the time spent by CPUs in idle state C7 which is correlated to the maximum frequency the CPUs can get to because of an average running power limit [1]. Before that commit, get_typical_interval() would have returned UINT_MAX whenever it had been unable to make a high-confidence prediction which had led to selecting the deepest available idle state too often and both power and performance had been inadequate as a result of that on some systems. However, this had not been a problem on systems with relatively aggressive average running power limits, like the Jasper Lake systems in question, because on those systems it was compensated by the ability to run CPUs faster. It was addressed by causing get_typical_interval() to return a number based on the recent idle duration information available to it even if it could not make a high-confidence prediction, but that clearly did not take the possible correlation between idle power and available CPU capacity into account. For this reason, revert most of the changes made by commit `85975daeaa`, except for one cosmetic cleanup, and add a comment explaining the rationale for returning UINT_MAX from get_typical_interval() when it is unable to make a high-confidence prediction. Fixes: `85975daeaa` ("cpuidle: menu: Avoid discarding useful information") Closes: https://lore.kernel.org/linux-pm/36iykr223vmcfsoysexug6s274nq2oimcu55ybn6ww4il3g3cv@cohflgdbpnq7/ [1] Reported-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: All applicable <stable@vger.kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://patch.msgid.link/3663603.iIbC2pHGDl@rafael.j.wysocki	2025-10-20 21:27:16 +02:00
Rafael J. Wysocki	17224c1d25	cpuidle: governors: menu: Rearrange main loop in menu_select() Reduce the indentation level in the main loop of menu_select() by rearranging some checks and assignments in it. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/2389215.ElGaqSPkdT@rafael.j.wysocki	2025-08-21 22:02:28 +02:00
Rafael J. Wysocki	779b1a1cb1	cpuidle: governors: menu: Avoid selecting states with too much latency Occasionally, the exit latency of the idle state selected by the menu governor may exceed the PM QoS CPU wakeup latency limit. Namely, if the scheduler tick has been stopped already and predicted_ns is greater than the tick period length, the governor may return an idle state whose exit latency exceeds latency_req because that decision is made before checking the current idle state's exit latency. For instance, say that there are 3 idle states, 0, 1, and 2. For idle states 0 and 1, the exit latency is equal to the target residency and the values are 0 and 5 us, respectively. State 2 is deeper and has the exit latency and target residency of 200 us and 2 ms (which is greater than the tick period length), respectively. Say that predicted_ns is equal to TICK_NSEC and the PM QoS latency limit is 20 us. After the first two iterations of the main loop in menu_select(), idx becomes 1 and in the third iteration of it the target residency of the current state (state 2) is greater than predicted_ns. State 2 is not a polling one and predicted_ns is not less than TICK_NSEC, so the check on whether or not the tick has been stopped is done. Say that the tick has been stopped already and there are no imminent timers (that is, delta_tick is greater than the target residency of state 2). In that case, idx becomes 2 and it is returned immediately, but the exit latency of state 2 exceeds the latency limit. Address this issue by modifying the code to compare the exit latency of the current idle state (idle state i) with the latency limit before comparing its target residency with predicted_ns, which allows one more exit_latency_ns check that becomes redundant to be dropped. However, after the above change, latency_req cannot take the predicted_ns value any more, which takes place after commit `38f83090f5` ("cpuidle: menu: Remove iowait influence"), because it may cause a polling state to be returned prematurely. In the context of the previous example say that predicted_ns is 3000 and the PM QoS latency limit is still 20 us. Additionally, say that idle state 0 is a polling one. Moving the exit_latency_ns check before the target_residency_ns one causes the loop to terminate in the second iteration, before the target_residency_ns check, so idle state 0 will be returned even though previously state 1 would be returned if there were no imminent timers. For this reason, remove the assignment of the predicted_ns value to latency_req from the code. Fixes: `5ef499cd57` ("cpuidle: menu: Handle stopped tick more aggressively") Cc: 4.17+ <stable@vger.kernel.org> # 4.17+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/5043159.31r3eYUQgx@rafael.j.wysocki	2025-08-18 19:04:25 +02:00
Rafael J. Wysocki	fa3fa55de0	cpuidle: governors: menu: Avoid using invalid recent intervals data Marc has reported that commit `85975daeaa` ("cpuidle: menu: Avoid discarding useful information") caused the number of wakeup interrupts to increase on an idle system [1], which was not expected to happen after merely allowing shallower idle states to be selected by the governor in some cases. However, on the system in question, all of the idle states deeper than WFI are rejected by the driver due to a firmware issue [2]. This causes the governor to only consider the recent interval duriation data corresponding to attempts to enter WFI that are successful and the recent invervals table is filled with values lower than the scheduler tick period. Consequently, the governor predicts an idle duration below the scheduler tick period length and avoids stopping the tick more often which leads to the observed symptom. Address it by modifying the governor to update the recent intervals table also when entering the previously selected idle state fails, so it knows that the short idle intervals might have been the minority had the selected idle states been actually entered every time. Fixes: `85975daeaa` ("cpuidle: menu: Avoid discarding useful information") Link: https://lore.kernel.org/linux-pm/86o6sv6n94.wl-maz@kernel.org/ [1] Link: https://lore.kernel.org/linux-pm/7ffcb716-9a1b-48c2-aaa4-469d0df7c792@arm.com/ [2] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/2793874.mvXUDI8C0e@rafael.j.wysocki	2025-08-11 21:46:14 +02:00
Zhongqiu Han	d4a7882f93	cpuidle: menu: Optimize bucket assignment when next_timer_ns equals KTIME_MAX Directly assign the last bucket value instead of calling which_bucket() when next_timer_ns equals KTIME_MAX, the largest possible value that always falls into the last bucket. This avoids unnecessary calculations and enhances performance. Reviewed-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Zhongqiu Han <quic_zhonhan@quicinc.com> Link: https://patch.msgid.link/20250405135308.1854342-1-quic_zhonhan@quicinc.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2025-04-09 19:37:30 +02:00
Rafael J. Wysocki	5c35041099	cpuidle: menu: Update documentation after get_typical_interval() changes The documentation of the menu cpuidle governor needs to be updated to match the code behavior after some changes made recently. No functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/4998484.31r3eYUQgx@rjwysocki.net [ rjw: More specific subject, two typos fixed in the changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2025-02-25 12:12:46 +01:00
Rafael J. Wysocki	85975daeaa	cpuidle: menu: Avoid discarding useful information When giving up on making a high-confidence prediction, get_typical_interval() always returns UINT_MAX which means that the next idle interval prediction will be based entirely on the time till the next timer. However, the information represented by the most recent intervals may not be completely useless in those cases. Namely, the largest recent idle interval is an upper bound on the recently observed idle duration, so it is reasonable to assume that the next idle duration is unlikely to exceed it. Moreover, this is still true after eliminating the suspected outliers if the sample set still under consideration is at least as large as 50% of the maximum sample set size. Accordingly, make get_typical_interval() return the current maximum recent interval value in that case instead of UINT_MAX. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reported-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Link: https://patch.msgid.link/7770672.EvYhyI6sBW@rjwysocki.net	2025-02-25 12:00:43 +01:00
Rafael J. Wysocki	8de7606f0f	cpuidle: menu: Eliminate outliers on both ends of the sample set Currently, get_typical_interval() attempts to eliminate outliers at the high end of the sample set only (probably in order to bias the prediction toward lower values), but this it problematic because if the outliers are present at the low end of the sample set, discarding the highest values will not help to reduce the variance. Since the presence of outliers at the low end of the sample set is generally as likely as their presence at the high end of the sample set, modify get_typical_interval() to treat samples at the largest distances from the average (on both ends of the sample set) as outliers. This should increase the likelihood of making a meaningful prediction in some cases. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reported-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Link: https://patch.msgid.link/2301940.iZASKD2KPV@rjwysocki.net	2025-02-25 12:00:35 +01:00
Rafael J. Wysocki	60256e458e	cpuidle: menu: Tweak threshold use in get_typical_interval() To prepare get_typical_interval() for subsequent changes, rearrange the use of the data point threshold in it a bit and initialize that threshold to UINT_MAX which is more consistent with its data type. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Link: https://patch.msgid.link/8490144.T7Z3S40VBb@rjwysocki.net	2025-02-25 12:00:28 +01:00
Rafael J. Wysocki	13982929fb	cpuidle: menu: Use one loop for average and variance computations Use the observation that one loop is sufficient to compute the average of an array of values and their variance to eliminate one of the loops from get_typical_interval(). While at it, make get_typical_interval() consistently use u64 as the 64-bit unsigned integer data type and rearrange some white space and the declarations of local variables in it (to make them follow the reverse X-mas tree pattern). No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Link: https://patch.msgid.link/3339073.aeNJFYEL58@rjwysocki.net	2025-02-25 12:00:17 +01:00
Rafael J. Wysocki	d2cd195b57	cpuidle: menu: Drop a redundant local variable Local variable min in get_typical_interval() is updated, but never accessed later, so drop it. No functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Link: https://patch.msgid.link/13699686.uLZWGnKmhe@rjwysocki.net	2025-02-25 11:59:02 +01:00
Christian Loehle	38f83090f5	cpuidle: menu: Remove iowait influence Remove CPU iowaiters influence on idle state selection. Remove the menu notion of performance multiplier which increased with the number of tasks that went to iowait sleep on this CPU and haven't woken up yet. Relying on iowait for cpuidle is problematic for a few reasons: 1. There is no guarantee that an iowaiting task will wake up on the same CPU. 2. The task being in iowait says nothing about the idle duration, we could be selecting shallower states for a long time. 3. The task being in iowait doesn't always imply a performance hit with increased latency. 4. If there is such a performance hit, the number of iowaiting tasks doesn't directly correlate. 5. The definition of iowait altogether is vague at best, it is sprinkled across kernel code. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/20240905092645.2885200-2-christian.loehle@arm.com [ rjw: Minor edits in the changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2024-09-30 17:00:55 +02:00
Christian Loehle	bf18311384	cpuidle: menu: Cleanup after loadavg removal The performance impact of loadavg was removed with commit `a7fe5190c0` ("cpuidle: menu: Remove get_loadavg() from the performance multiplier") With only iowait remaining the description can be simplified, remove also the no longer needed includes. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2024-06-07 20:55:00 +02:00
Rafael J. Wysocki	5484e31bbb	cpuidle: menu: Skip tick_nohz_get_sleep_length() call in some cases Because the cost of calling tick_nohz_get_sleep_length() may increase in the future, reorder the code in menu_select() so it first uses the statistics to determine the expected idle duration. If that value is higher than RESIDENCY_THRESHOLD_NS, tick_nohz_get_sleep_length() will be called to obtain the time till the closest timer and refine the idle duration prediction if necessary. This causes the governor to always take the full overhead of get_typical_interval() with the assumption that the cost will be amortized by skipping the tick_nohz_get_sleep_length() call in the cases when the predicted idle duration is relatively very small. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Doug Smythies <dsmythies@telus.net>	2023-08-17 11:28:38 +02:00
Jason Wang	14e6c70671	cpuidle: menu: Fix typo in a comment The word `these' in a comment is repeated, so drop one. Signed-off-by: Jason Wang <wangborong@cdjrlc.com> [ rjw: Changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2021-11-24 17:30:44 +01:00
Alexey Dobriyan	8fc2858e57	sched: Make nr_iowait_cpu() return 32-bit value Runqueue ->nr_iowait counters are 32-bit anyway. Propagate 32-bitness into other code, but don't try too hard. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210422200228.1423391-3-adobriyan@gmail.com	2021-05-12 21:34:16 +02:00
Rafael J. Wysocki	060e3535ad	cpuidle: menu: Take negative "sleep length" values into account Make the menu governor check the tick_nohz_get_next_hrtimer() return value so as to avoid dealing with negative "sleep length" values and make it use that value directly when the tick is stopped. While at it, rename local variable delta_next in menu_select() to delta_tick which better reflects its purpose. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2021-04-07 19:26:44 +02:00
Rafael J. Wysocki	c1d51f684c	cpuidle: Use nanoseconds as the unit of time Currently, the cpuidle subsystem uses microseconds as the unit of time which (among other things) causes the idle loop to incur some integer division overhead for no clear benefit. In order to allow cpuidle to measure time in nanoseconds, add two new fields, exit_latency_ns and target_residency_ns, to represent the exit latency and target residency of an idle state in nanoseconds, respectively, to struct cpuidle_state and initialize them with the help of the corresponding values in microseconds provided by drivers. Additionally, change cpuidle_governor_latency_req() to return the idle state exit latency constraint in nanoseconds. Also meeasure idle state residency (last_residency_ns in struct cpuidle_device and time_ns in struct cpuidle_driver) in nanoseconds and update the cpuidle core and governors accordingly. However, the menu governor still computes typical intervals in microseconds to avoid integer overflows. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Doug Smythies <dsmythies@telus.net> Tested-by: Doug Smythies <dsmythies@telus.net>	2019-11-11 21:56:07 +01:00
Rafael J. Wysocki	99e98d3fb1	cpuidle: Consolidate disabled state checks There are two reasons why CPU idle states may be disabled: either because the driver has disabled them or because they have been disabled by user space via sysfs. In the former case, the state's "disabled" flag is set once during the initialization of the driver and it is never cleared later (it is read-only effectively). In the latter case, the "disable" field of the given state's cpuidle_state_usage struct is set and it may be changed via sysfs. Thus checking whether or not an idle state has been disabled involves reading these two flags every time. In order to avoid the additional check of the state's "disabled" flag (which is effectively read-only anyway), use the value of it at the init time to set a (new) flag in the "disable" field of that state's cpuidle_state_usage structure and use the sysfs interface to manipulate another (new) flag in it. This way the state is disabled whenever the "disable" field of its cpuidle_state_usage structure is nonzero, whatever the reason, and it is the only place to look into to check whether or not the state has been disabled. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>	2019-11-06 13:19:56 +01:00
Rafael J. Wysocki	32b91ca153	cpuidle: menu: Allow tick to be stopped if PM QoS is used After commit `554c8aa8ec` ("sched: idle: Select idle state before stopping the tick") the menu governor prevents the scheduler tick from being stopped (unless stopped already) if there is a PM QoS latency constraint for the given CPU and the target residency of the deepest idle state matching that constraint is below the tick boundary. However, that is problematic if CPUs with PM QoS latency constraints are idle for long times, because it effectively causes the tick to run on them all the time which is wasteful. [It is also confusing and questionable if they are full dynticks CPUs.] To address that issue, make the menu governor allow the tick to be stopped only if the idle duration predicted by it is beyond the tick boundary, except when the shallowest idle state is selected upfront and it is not a "polling" one. Fixes: `554c8aa8ec` ("sched: idle: Select idle state before stopping the tick") Link: https://lore.kernel.org/lkml/79b247b3-e056-610e-9a07-e685dfdaa6c9@gmail.com/ Reported-by: Thomas Lindroth <thomas.lindroth@gmail.com> Tested-by: Thomas Lindroth <thomas.lindroth@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2019-08-05 11:02:44 +02:00
Marcelo Tosatti	7d4daeedd5	governors: unify last_state_idx Since this field is shared by all governors, move it to cpuidle device structure. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2019-07-30 17:27:37 +02:00
Thomas Gleixner	7925f8f78f	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 215 Based on 1 normalized pattern(s): this code is licenced under the gpl version 2 as described in the copying file that acompanies the linux kernel extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 1 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexios Zavras <alexios.zavras@intel.com> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Steve Winslow <swinslow@gmail.com> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190528171439.466585205@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-05-30 11:29:54 -07:00
Rafael J. Wysocki	814b8797f9	cpuidle: menu: Avoid overflows when computing variance The variance computation in get_typical_interval() may overflow if the square of the value of diff exceeds the maximum for the int64_t data type value which basically is the case when it is of the order of UINT_MAX. However, data points so far in the future don't matter for idle state selection anyway, so change the initial threshold value in get_typical_interval() to INT_MAX which will cause more "outlying" data points to be discarded without affecting the selection result. Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2019-03-07 10:54:22 +01:00
Linus Torvalds	6ef746769e	More power management updates for 4.20-rc1 - Fix build regression in the intel_pstate driver that doesn't build without CONFIG_ACPI after recent changes (Dominik Brodowski). - One of the heuristics in the menu cpuidle governor is based on a function returning 0 most of the time, so drop it and clean up the scheduler code related to it (Daniel Lezcano). - Prevent the arm_big_little cpufreq driver from being used on ARM64 which is not suitable for it and drop the arm_big_little_dt driver that is not used any more (Sudeep Holla). - Prevent the hung task watchdog from triggering during resume from system-wide sleep states by disabling it before freezing tasks and enabling it again after they have been thawed (Vitaly Kuznetsov). -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJb2BJ7AAoJEILEb/54YlRx/kwP/iD7tUUZ6mT84OI0FTbEj8A/ fM+uHrwy25PmqyWGGtbHpaWU9OxVxUReSicsBCt+2LZmX3sFYpbSb243mv3pmxqb A0kLflG4lWCKJNIfa/a3OMDTUw26mxSTCidE3jJXkd8HkWrzeAWvMair+UCuzMf3 A4Omu0IkNL8C0MKtUOb3PlUk3dnLYMxuairNhozBPhi+P+0tLW9/9XvgPJBVhnbZ CKn/aFsDoc08tAfxC8N32cgKwE7nbeIgTJTBFyu2lQmInsd4TTuoM50vSC5i+x88 AmBOoH9IX0fhXJ6hgm+VMW8+x9S+H7jAVy/3C2xoUBeCclzlxX6eUCtjV5YNZqqn 1nXQfGeAwgzX6Tyu6HjM7vjbfObk59ZwpmDRPJEUEhLDEBMS+iDStlp9zmKTedNm G4iSTzS6qJCNPtx4y5wkLp/FvzTofIuWqVFJSJC4+EoVKkbbw9xwaY+JKXUt1Uwx j+U6EtRhzL/kVX0nq+iQXXeANxCFNzI56Ov5O7mxjF1m/hDE/Gb2QEeIb6nRZC2A H3I2so2J3h1yTgadpGFFvJWaqfHkgcBTsm06tSgHVb86quiTANJIQ9mqfFyOzDDJ KaZ82MROt7UuCMI6X9n+oIBDZWLHmADge6RdHCD1wB+zrUmusCtNEHUZACXd0mPf s8MUK4bWVhViVXGS5bMP =/bnR -----END PGP SIGNATURE----- Merge tag 'pm-4.20-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more power management updates from Rafael Wysocki: "These remove a questionable heuristic from the menu cpuidle governor, fix a recent build regression in the intel_pstate driver, clean up ARM big-Little support in cpufreq and fix up hung task watchdog's interaction with system-wide power management transitions. Specifics: - Fix build regression in the intel_pstate driver that doesn't build without CONFIG_ACPI after recent changes (Dominik Brodowski). - One of the heuristics in the menu cpuidle governor is based on a function returning 0 most of the time, so drop it and clean up the scheduler code related to it (Daniel Lezcano). - Prevent the arm_big_little cpufreq driver from being used on ARM64 which is not suitable for it and drop the arm_big_little_dt driver that is not used any more (Sudeep Holla). - Prevent the hung task watchdog from triggering during resume from system-wide sleep states by disabling it before freezing tasks and enabling it again after they have been thawed (Vitaly Kuznetsov)" * tag 'pm-4.20-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: kernel: hung_task.c: disable on suspend cpufreq: remove unused arm_big_little_dt driver cpufreq: drop ARM_BIG_LITTLE_CPUFREQ support for ARM64 cpufreq: intel_pstate: Fix compilation for !CONFIG_ACPI cpuidle: menu: Remove get_loadavg() from the performance multiplier sched: Factor out nr_iowait and nr_iowait_cpu	2018-10-30 09:08:07 -07:00
Johannes Weiner	8508cf3ffa	sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD There are several definitions of those functions/macros in places that mess with fixed-point load averages. Provide an official version. [akpm@linux-foundation.org: fix missed conversion in block/blk-iolatency.c] Link: http://lkml.kernel.org/r/20180828172258.3185-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Daniel Drake <drake@endlessm.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-10-26 16:26:32 -07:00
Daniel Lezcano	a7fe5190c0	cpuidle: menu: Remove get_loadavg() from the performance multiplier The function get_loadavg() returns almost always zero. To be more precise, statistically speaking for a total of 1023379 times passing in the function, the load is equal to zero 1020728 times, greater than 100, 610 times, the remaining is between 0 and 5. In 2011, the get_loadavg() was removed from the Android tree because of the above [1]. At this time, the load was: unsigned long this_cpu_load(void) { struct rq this = this_rq(); return this->cpu_load[0]; } In 2014, the code was changed by commit `372ba8cb46` (cpuidle: menu: Lookup CPU runqueues less) and the load is: void get_iowait_load(unsigned long nr_waiters, unsigned long load) { struct rq rq = this_rq(); nr_waiters = atomic_read(&rq->nr_iowait); load = rq->load.weight; } with the same result. Both measurements show using the load in this code path does no matter anymore. Removing it. [1] `4dedd9f124` Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2018-10-25 16:49:27 +02:00
Rafael J. Wysocki	f1c8e410cd	cpuidle: menu: Avoid computations when result will be discarded If the minimum interval taken into account in the average computation loop in get_typical_interval() is less than the expected idle duration determined so far, the resultant average cannot be greater than that value as well and the entire return result of the function is going to be discarded anyway going forward. In that case, it is a waste of time to carry out the remaining computations in get_typical_interval(), so avoid that by returning early if the minimum interval is not below the expected idle duration. No intentional changes of behavior. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2018-10-18 09:34:13 +02:00
Rafael J. Wysocki	12b65eadf0	cpuidle: menu: Drop redundant comparison Since the correction factor cannot be greater than RESOLUTION * DECAY, the result of the predicted_us computation in menu_select() cannot be greater than data->next_timer_us, so it is not necessary to compare the "typical interval" value coming from get_typical_interval() with data->next_timer_us separately. It is sufficient to copmare predicted_us with the return value of get_typical_interval() directly, so do that and drop the now redundant expected_interval variable. No intentional changes of behavior. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2018-10-18 09:34:13 +02:00
Rafael J. Wysocki	bde091ece2	cpuidle: menu: Simplify checks related to the polling state After some recent menu governor changes, the promotion of the "polling" state to a physical one is mostly controlled by the latency limit (resulting from the "interactivity" factor) and not by the time to the closest timer event, so it should be sufficient to check the exit latency of that state for this purpose (of course, its target residency still needs to be within the next timer event range for energy-efficiency). Also, the physical state the "polling" one is promoted to need not be the next one in principle (in case the next state is disabled, for example). For these reasons, simplify the checks made to decide whether or not to promote the "polling" state to a physical one and update the target idle duration when it is promoted in case the residency of the new state turns out to be above the tick boundary (in which case there is no reason to stop the tick). Tested-by: Doug Smythies <dsmythies@telus.net> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2018-10-12 10:46:37 +02:00
Rafael J. Wysocki	53812cdc91	cpuidle: menu: Move the latency_req == 0 special case check It is better to always update data->bucket before returning from menu_select() to avoid updating the correction factor for a stale bucket, so combine the latency_req == 0 special check with the more general check below. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>	2018-10-04 19:27:27 +02:00
Rafael J. Wysocki	8b007ebec9	cpuidle: menu: Avoid computations for very close timers If the next timer event (not including the tick) is closer than the target residency of the second state or the PM QoS latency constraint is below its exit latency, state[0] will be used regardless of any other factors, so skip the computations in menu_select() then and return 0 straight away from it. Still, do that after the bucket has been determined to avoid updating the correction factor for a stale bucket. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>	2018-10-04 19:27:27 +02:00
Rafael J. Wysocki	eb40a380bf	cpuidle: menu: Do not update last_state_idx in menu_select() It is not necessary to update data->last_state_idx in menu_select() as it only is used in menu_update() which only runs when data->needs_update is set and that is set only when updating data->last_state_idx in menu_reflect(). Accordingly, drop the update of data->last_state_idx from menu_select() and get rid of the (now redundant) "out" label from it. No intentional behavior changes. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org>	2018-10-04 19:26:38 +02:00
Rafael J. Wysocki	96c3d11df1	cpuidle: menu: Get rid of first_idx from menu_select() Rearrange the code in menu_select() so that the loop over idle states always starts from 0 and get rid of the first_idx variable. While at it, add two empty lines to separate conditional statements from one another. No intentional behavior changes. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org>	2018-10-04 19:25:53 +02:00
Rafael J. Wysocki	23e8ceb9ce	cpuidle: menu: Compute first_idx when latency_req is known Since menu_select() can only set first_idx to 1 if the exit latency of the second state is not greater than the latency limit, it should first determine that limit. Thus first_idx should be computed after the "interactivity" factor has been taken into account. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewedy-by: Daniel Lezcano <daniel.lezcano@linaro.org>	2018-10-04 19:24:14 +02:00
Rafael J. Wysocki	5f26bdceb9	cpuidle: menu: Fix wakeup statistics updates for polling state If the CPU exits the "polling" state due to the time limit in the loop in poll_idle(), this is not a real wakeup and it just means that the "polling" state selection was not adequate. The governor mispredicted short idle duration, but had a more suitable state been selected, the CPU might have spent more time in it. In fact, there is no reason to expect that there would have been a wakeup event earlier than the next timer in that case. Handling such cases as regular wakeups in menu_update() may cause the menu governor to make suboptimal decisions going forward, but ignoring them altogether would not be correct either, because every time menu_select() is invoked, it makes a separate new attempt to predict the idle duration taking distinct time to the closest timer event as input and the outcomes of all those attempts should be recorded. For this reason, make menu_update() always assume that if the "polling" state was exited due to the time limit, the next proper wakeup event for the CPU would be the next timer event (not including the tick). Fixes: `a37b969a61` "cpuidle: poll_state: Add time limit to poll_idle()" Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org>	2018-10-04 10:23:37 +02:00
Rafael J. Wysocki	03dba27804	cpuidle: menu: Replace data->predicted_us with local variable The predicted_us field in struct menu_device is only accessed in menu_select(), so replace it with a local variable in that function. With that, stop using expected_interval instead of predicted_us to store the new predicted idle duration value if it is set to the selected state's target residency which is quite confusing. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>	2018-10-03 12:02:44 +02:00
Fieah Lim	6a5f95b5a4	cpuidle: Remove unnecessary wrapper cpuidle_get_last_residency() cpuidle_get_last_residency() is just a wrapper for retrieving the last_residency member of struct cpuidle_device. It's also weirdly the only wrapper function for accessing cpuidle_* struct member (by my best guess is it could be a leftover from v2.x). Anyhow, since the only two users (the ladder and menu governors) can access dev->last_residency directly, and it's more intuitive to do it that way, let's just get rid of the wrapper. This patch tidies up CPU idle code a bit without functional changes. Signed-off-by: Fieah Lim <kw@fieahl.im> [ rjw: Changelog cleanup ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2018-09-18 09:24:44 +02:00
Rafael J. Wysocki	757ab15c3f	cpuidle: menu: Retain tick when shallow state is selected The case addressed by commit `5ef499cd57` (cpuidle: menu: Handle stopped tick more aggressively) in the stopped tick case is present when the tick has not been stopped yet too. Namely, if only two CPU idle states, shallow state A with target residency significantly below the tick boundary and deep state B with target residency significantly above it, are available and the predicted idle duration is above the tick boundary, but below the target residency of state B, state A will be selected and the CPU may spend indefinite amount of time in it, which is not quite energy-efficient. However, if the tick has not been stopped yet and the governor is about to select a shallow idle state for the CPU even though the idle duration predicted by it is above the tick boundary, it should be fine to wake up the CPU early, so the tick can be retained then and the governor will have a chance to select a deeper state when it runs next time. [Note that when this really happens, it will make the idle duration predictor believe that the CPU might be idle longer than predicted, which will make it more likely to predict longer idle durations going forward, but that will also cause deeper idle states to be selected going forward, on average, which is what's needed here.] Fixes: `87c9fe6ee4` (cpuidle: menu: Avoid selecting shallow states with stopped tick) Reported-by: Leo Yan <leo.yan@linaro.org> Cc: 4.17+ <stable@vger.kernel.org> # 4.17+: `5ef499cd57` (cpuidle: menu: Handle ...) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2018-08-25 13:16:08 +02:00
Rafael J. Wysocki	5ef499cd57	cpuidle: menu: Handle stopped tick more aggressively Commit `87c9fe6ee4` (cpuidle: menu: Avoid selecting shallow states with stopped tick) missed the case when the target residencies of deep idle states of CPUs are above the tick boundary which may cause the CPU to get stuck in a shallow idle state for a long time. Say there are two CPU idle states available: one shallow, with the target residency much below the tick boundary and one deep, with the target residency significantly above the tick boundary. In that case, if the tick has been stopped already and the expected next timer event is relatively far in the future, the governor will assume the idle duration to be equal to TICK_USEC and it will select the idle state for the CPU accordingly. However, that will cause the shallow state to be selected even though it would have been more energy-efficient to select the deep one. To address this issue, modify the governor to always use the time till the closest timer event instead of the predicted idle duration if the latter is less than the tick period length and the tick has been stopped already. Also make it extend the search for a matching idle state if the tick is stopped to avoid settling on a shallow state if deep states with target residencies above the tick period length are available. In addition, make it always indicate that the tick should be stopped if it has been stopped already for consistency. Fixes: `87c9fe6ee4` (cpuidle: menu: Avoid selecting shallow states with stopped tick) Reported-by: Leo Yan <leo.yan@linaro.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: 4.17+ <stable@vger.kernel.org> # 4.17+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2018-08-20 13:37:03 +02:00
Rafael J. Wysocki	50f7ccc647	cpuidle: menu: Update stale polling override comment The comment to explain why the menu governor uses idle state 1 instead of idle state 0 as the first one sometimes is stale (among other things it mentions a user setting not present any more), so update it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2018-08-16 23:05:43 +02:00
Rafael J. Wysocki	f390c5eb28	cpuidle: menu: Fix white space Fix some damaged white space in menu_select(). Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2018-08-15 00:08:51 +02:00
Rafael J. Wysocki	0fc784fb09	cpuidle: governors: Consolidate PM QoS handling There is some code duplication related to the PM QoS handling between the existing cpuidle governors, so move that code to a common helper function and call that from the governors. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2018-05-30 23:13:00 +02:00
Rafael J. Wysocki	cf7eeea947	cpuidle: governors: Drop redundant checks related to PM QoS PM_QOS_RESUME_LATENCY_NO_CONSTRAINT is defined as the 32-bit integer maximum, so it is not necessary to test the return value of dev_pm_qos_raw_read_value() against it directly in the menu and ladder cpuidle governors. Drop these redundant checks. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2018-05-30 23:13:00 +02:00
Rafael J. Wysocki	87c9fe6ee4	cpuidle: menu: Avoid selecting shallow states with stopped tick If the scheduler tick has been stopped already and the governor selects a shallow idle state, the CPU can spend a long time in that state if the selection is based on an inaccurate prediction of idle time. That effect turns out to be relevant, so it needs to be mitigated. To that end, modify the menu governor to discard the result of the idle time prediction if the tick is stopped and the predicted idle time is less than the tick period length, unless the tick timer is going to expire soon. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>	2018-04-09 11:54:57 +02:00

1 2 3

135 Commits