From 4fb352df14de4b5277f38a9874f7c19cf641ae4d Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Fri, 5 Dec 2025 16:24:05 +0100 Subject: [PATCH 01/65] PM: sleep: Do not flag runtime PM workqueue as freezable Till now, the runtime PM workqueue has been flagged as freezable, so it does not process work items during system-wide PM transitions like system suspend and resume. The original reason to do that was to reduce the likelihood of runtime PM getting in the way of system-wide PM processing, but now it is mostly an optimization because (1) runtime suspend of devices is prevented by bumping up their runtime PM usage counters in device_prepare() and (2) device drivers are expected to disable runtime PM for the devices handled by them before they embark on system-wide PM activities that may change the state of the hardware or otherwise interfere with runtime PM. However, it prevents asynchronous runtime resume of devices from working during system-wide PM transitions, which is confusing because synchronous runtime resume is not prevented at the same time, and it also sometimes turns out to be problematic. For example, it has been reported that blk_queue_enter() may deadlock during a system suspend transition because of the pm_request_resume() usage in it [1]. It may also deadlock during a system resume transition in a similar way. That happens because the asynchronous runtime resume of the given device is not processed due to the freezing of the runtime PM workqueue. While it may be better to address this particular issue in the block layer, the very presence of it means that similar problems may be expected to occur elsewhere. For this reason, remove the WQ_FREEZABLE flag from the runtime PM workqueue and make device_suspend_late() use the generic variant of pm_runtime_disable() that will carry out runtime PM of the device synchronously if there is pending resume work for it. Also update the comment before the pm_runtime_disable() call in device_suspend_late(), to document the fact that the runtime PM should not be expected to work for the device until the end of device_resume_early(), and update the related documentation. This change may, even though it is not expected to, uncover some latent issues related to queuing up asynchronous runtime resume work items during system suspend or hibernation. However, they should be limited to the interference between runtime resume and system-wide PM callbacks in the cases when device drivers start to handle system-wide PM before disabling runtime PM as described above. Link: https://lore.kernel.org/linux-pm/20251126101636.205505-2-yang.yang@vivo.com/ Signed-off-by: Rafael J. Wysocki Reviewed-by: Ulf Hansson Link: https://patch.msgid.link/12794222.O9o76ZdvQC@rafael.j.wysocki --- Documentation/power/runtime_pm.rst | 7 +++---- drivers/base/power/main.c | 7 ++++--- kernel/power/main.c | 2 +- 3 files changed, 8 insertions(+), 8 deletions(-) diff --git a/Documentation/power/runtime_pm.rst b/Documentation/power/runtime_pm.rst index 455b9d135d85..a53ab09c37d5 100644 --- a/Documentation/power/runtime_pm.rst +++ b/Documentation/power/runtime_pm.rst @@ -712,10 +712,9 @@ out the following operations: * During system suspend pm_runtime_get_noresume() is called for every device right before executing the subsystem-level .prepare() callback for it and pm_runtime_barrier() is called for every device right before executing the - subsystem-level .suspend() callback for it. In addition to that the PM core - calls __pm_runtime_disable() with 'false' as the second argument for every - device right before executing the subsystem-level .suspend_late() callback - for it. + subsystem-level .suspend() callback for it. In addition to that, the PM + core disables runtime PM for every device right before executing the + subsystem-level .suspend_late() callback for it. * During system resume pm_runtime_enable() and pm_runtime_put() are called for every device right after executing the subsystem-level .resume_early() diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c index 97a8b4fcf471..189de5250f25 100644 --- a/drivers/base/power/main.c +++ b/drivers/base/power/main.c @@ -1647,10 +1647,11 @@ static void device_suspend_late(struct device *dev, pm_message_t state, bool asy goto Complete; /* - * Disable runtime PM for the device without checking if there is a - * pending resume request for it. + * After this point, any runtime PM operations targeting the device + * will fail until the corresponding pm_runtime_enable() call in + * device_resume_early(). */ - __pm_runtime_disable(dev, false); + pm_runtime_disable(dev); if (dev->power.syscore) goto Skip; diff --git a/kernel/power/main.c b/kernel/power/main.c index 03b2c5495c77..5f8c9e12eaec 100644 --- a/kernel/power/main.c +++ b/kernel/power/main.c @@ -1125,7 +1125,7 @@ EXPORT_SYMBOL_GPL(pm_wq); static int __init pm_start_workqueues(void) { - pm_wq = alloc_workqueue("pm", WQ_FREEZABLE | WQ_UNBOUND, 0); + pm_wq = alloc_workqueue("pm", WQ_UNBOUND, 0); if (!pm_wq) return -ENOMEM; From 6b401a5b2d2acf56ec902f96f6381982457ab339 Mon Sep 17 00:00:00 2001 From: Kaushlendra Kumar Date: Tue, 2 Dec 2025 10:10:12 +0530 Subject: [PATCH 02/65] cpupower: idle_monitor: fix incorrect value logged after stop MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The cpuidle sysfs monitor printed the previous sample’s counter value in cpuidle_stop() instead of the freshly read one. The dprint line used previous_count[cpu][state] while current_count[cpu][state] had just been populated. This caused misleading debug output. Switch the logging to current_count so the post-interval snapshot matches the displayed value. Link: https://lore.kernel.org/r/20251202044012.3844790-1-kaushlendra.kumar@intel.com Signed-off-by: Kaushlendra Kumar Signed-off-by: Shuah Khan --- tools/power/cpupower/utils/idle_monitor/cpuidle_sysfs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/power/cpupower/utils/idle_monitor/cpuidle_sysfs.c b/tools/power/cpupower/utils/idle_monitor/cpuidle_sysfs.c index 8b42c2f0a5b0..4225eff9833d 100644 --- a/tools/power/cpupower/utils/idle_monitor/cpuidle_sysfs.c +++ b/tools/power/cpupower/utils/idle_monitor/cpuidle_sysfs.c @@ -70,7 +70,7 @@ static int cpuidle_stop(void) current_count[cpu][state] = cpuidle_state_time(cpu, state); dprint("CPU %d - State: %d - Val: %llu\n", - cpu, state, previous_count[cpu][state]); + cpu, state, current_count[cpu][state]); } } return 0; From 24858a84163c8d04827166b3bcaed80612bb62fc Mon Sep 17 00:00:00 2001 From: Kaushlendra Kumar Date: Wed, 26 Nov 2025 14:46:13 +0530 Subject: [PATCH 03/65] tools/cpupower: Fix inverted APERF capability check The capability check was inverted, causing the function to return error when APERF support is available and proceed when it is not. Negate the condition to return error only when APERF capability is absent. Link: https://lore.kernel.org/r/20251126091613.567480-1-kaushlendra.kumar@intel.com Signed-off-by: Kaushlendra Kumar Signed-off-by: Shuah Khan --- tools/power/cpupower/utils/cpufreq-info.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/power/cpupower/utils/cpufreq-info.c b/tools/power/cpupower/utils/cpufreq-info.c index 7d3732f5f2f6..5fe01e516817 100644 --- a/tools/power/cpupower/utils/cpufreq-info.c +++ b/tools/power/cpupower/utils/cpufreq-info.c @@ -270,7 +270,7 @@ static int get_freq_hardware(unsigned int cpu, unsigned int human) { unsigned long freq; - if (cpupower_cpu_info.caps & CPUPOWER_CAP_APERF) + if (!(cpupower_cpu_info.caps & CPUPOWER_CAP_APERF)) return -EINVAL; freq = cpufreq_get_freq_hardware(cpu); From 1b9aaf36b7b40235e5a529c15848c3d866362207 Mon Sep 17 00:00:00 2001 From: Kaushlendra Kumar Date: Thu, 27 Nov 2025 10:15:36 +0530 Subject: [PATCH 04/65] tools/cpupower: Use strcspn() to strip trailing newline Replace manual newline removal with strcspn() which is safer and cleaner. This avoids potential out-of-bounds access on empty strings and handles the case where no newline exists. Link: https://lore.kernel.org/r/20251127044536.715722-1-kaushlendra.kumar@intel.com Signed-off-by: Kaushlendra Kumar Signed-off-by: Shuah Khan --- tools/power/cpupower/lib/cpuidle.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/tools/power/cpupower/lib/cpuidle.c b/tools/power/cpupower/lib/cpuidle.c index f2c1139adf71..6a881d93d2e9 100644 --- a/tools/power/cpupower/lib/cpuidle.c +++ b/tools/power/cpupower/lib/cpuidle.c @@ -193,8 +193,7 @@ static char *cpuidle_state_get_one_string(unsigned int cpu, if (result == NULL) return NULL; - if (result[strlen(result) - 1] == '\n') - result[strlen(result) - 1] = '\0'; + result[strcspn(result, "\n")] = '\0'; return result; } @@ -366,8 +365,7 @@ static char *sysfs_cpuidle_get_one_string(enum cpuidle_string which) if (result == NULL) return NULL; - if (result[strlen(result) - 1] == '\n') - result[strlen(result) - 1] = '\0'; + result[strcspn(result, "\n")] = '\0'; return result; } From f9bd3762cf1bd0c2465f2e6121b340883471d1bf Mon Sep 17 00:00:00 2001 From: Kaushlendra Kumar Date: Mon, 1 Dec 2025 17:47:45 +0530 Subject: [PATCH 05/65] tools/power cpupower: Reset errno before strtoull() cpuidle_state_get_one_value() never cleared errno before calling strtoull(), so a prior ERANGE caused every cpuidle counter read to return zero. Reset errno to 0 before the conversion so each sysfs read is evaluated independently. Link: https://lore.kernel.org/r/20251201121745.3776703-1-kaushlendra.kumar@intel.com Signed-off-by: Kaushlendra Kumar Signed-off-by: Shuah Khan --- tools/power/cpupower/lib/cpuidle.c | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/power/cpupower/lib/cpuidle.c b/tools/power/cpupower/lib/cpuidle.c index 6a881d93d2e9..2fcb343d8e75 100644 --- a/tools/power/cpupower/lib/cpuidle.c +++ b/tools/power/cpupower/lib/cpuidle.c @@ -150,6 +150,7 @@ unsigned long long cpuidle_state_get_one_value(unsigned int cpu, if (len == 0) return 0; + errno = 0; value = strtoull(linebuf, &endp, 0); if (endp == linebuf || errno == ERANGE) From ff72619e11348ab189e232c59515dd5c33780d7c Mon Sep 17 00:00:00 2001 From: Kaushlendra Kumar Date: Tue, 2 Dec 2025 12:24:03 +0530 Subject: [PATCH 06/65] tools/power cpupower: Show C0 in idle-info dump `cpupower idle-info -o` skipped C0 because the loop began at 1: before: states: C1 ... latency[002] residency[00002] C2 ... latency[010] residency[00020] C3 ... latency[133] residency[00600] after: states: C0 ... latency[000] residency[00000] C1 ... latency[002] residency[00002] C2 ... latency[010] residency[00020] C3 ... latency[133] residency[00600] Start iterating at index 0 so the idle report mirrors sysfs and includes C0 stats. Link: https://lore.kernel.org/r/20251202065403.1492807-1-kaushlendra.kumar@intel.com Signed-off-by: Kaushlendra Kumar Signed-off-by: Shuah Khan --- tools/power/cpupower/utils/cpuidle-info.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/power/cpupower/utils/cpuidle-info.c b/tools/power/cpupower/utils/cpuidle-info.c index e0d17f0de3fe..81b4763a97d6 100644 --- a/tools/power/cpupower/utils/cpuidle-info.c +++ b/tools/power/cpupower/utils/cpuidle-info.c @@ -111,7 +111,7 @@ static void proc_cpuidle_cpu_output(unsigned int cpu) printf(_("max_cstate: C%u\n"), cstates-1); printf(_("maximum allowed latency: %lu usec\n"), max_allowed_cstate); printf(_("states:\t\n")); - for (cstate = 1; cstate < cstates; cstate++) { + for (cstate = 0; cstate < cstates; cstate++) { printf(_(" C%d: " "type[C%d] "), cstate, cstate); printf(_("promotion[--] demotion[--] ")); From 77cf053b041fe13d1fdd2e572e16ee7776ff687d Mon Sep 17 00:00:00 2001 From: Lifeng Zheng Date: Tue, 2 Dec 2025 15:27:26 +0800 Subject: [PATCH 07/65] cpufreq: Return -EOPNOTSUPP if no policy supports boost In cpufreq_boost_trigger_state(), if none of the the policies support boost, policy_set_boost() will not be called and this function will return 0. But it is better to return an error to indicate that the platform doesn't support boost. Signed-off-by: Lifeng Zheng Acked-by: Viresh Kumar Reviewed-by: Jie Zhan [ rjw: Subject and changelog edits ] Link: https://patch.msgid.link/20251202072727.1368285-2-zhenglifeng1@huawei.com Signed-off-by: Rafael J. Wysocki --- drivers/cpufreq/cpufreq.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index 4472bb1ec83c..8de9c94c097f 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -2803,7 +2803,7 @@ static int cpufreq_boost_trigger_state(int state) { struct cpufreq_policy *policy; unsigned long flags; - int ret = 0; + int ret = -EOPNOTSUPP; /* * Don't compare 'cpufreq_driver->boost_enabled' with 'state' here to @@ -2823,6 +2823,10 @@ static int cpufreq_boost_trigger_state(int state) if (ret) goto err_reset_state; } + + if (ret) + goto err_reset_state; + cpus_read_unlock(); return 0; From 78d83b293891c597cef773eb17d9cc02b386f21a Mon Sep 17 00:00:00 2001 From: Lifeng Zheng Date: Tue, 2 Dec 2025 15:27:27 +0800 Subject: [PATCH 08/65] cpufreq: cpufreq_boost_trigger_state() optimization Optimize the error handling code in cpufreq_boost_trigger_state(). Signed-off-by: Lifeng Zheng Acked-by: Viresh Kumar Reviewed-by: Jie Zhan [ rjw: Changelog edit ] Link: https://patch.msgid.link/20251202072727.1368285-3-zhenglifeng1@huawei.com Signed-off-by: Rafael J. Wysocki --- drivers/cpufreq/cpufreq.c | 13 ++++--------- 1 file changed, 4 insertions(+), 9 deletions(-) diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index 8de9c94c097f..50dde2980f1b 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -2820,19 +2820,14 @@ static int cpufreq_boost_trigger_state(int state) continue; ret = policy_set_boost(policy, state); - if (ret) - goto err_reset_state; + if (unlikely(ret)) + break; } - if (ret) - goto err_reset_state; - cpus_read_unlock(); - return 0; - -err_reset_state: - cpus_read_unlock(); + if (likely(!ret)) + return 0; write_lock_irqsave(&cpufreq_driver_lock, flags); cpufreq_driver->boost_enabled = !state; From 549a1be5cebb7079789e5821d8ad53140e181367 Mon Sep 17 00:00:00 2001 From: Krzysztof Kozlowski Date: Fri, 2 Jan 2026 13:49:14 +0100 Subject: [PATCH 09/65] OPP: of: Simplify with scoped for each OF child loop Use scoped for-each loop when iterating over device nodes to make code a bit simpler. Signed-off-by: Krzysztof Kozlowski Signed-off-by: Viresh Kumar --- drivers/opp/of.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/drivers/opp/of.c b/drivers/opp/of.c index 1e0d0adb18e1..a268c2b250c0 100644 --- a/drivers/opp/of.c +++ b/drivers/opp/of.c @@ -956,7 +956,6 @@ static struct dev_pm_opp *_opp_add_static_v2(struct opp_table *opp_table, /* Initializes OPP tables based on new bindings */ static int _of_add_opp_table_v2(struct device *dev, struct opp_table *opp_table) { - struct device_node *np; int ret, count = 0; struct dev_pm_opp *opp; @@ -971,13 +970,12 @@ static int _of_add_opp_table_v2(struct device *dev, struct opp_table *opp_table) } /* We have opp-table node now, iterate over it and add OPPs */ - for_each_available_child_of_node(opp_table->np, np) { + for_each_available_child_of_node_scoped(opp_table->np, np) { opp = _opp_add_static_v2(opp_table, dev, np); if (IS_ERR(opp)) { ret = PTR_ERR(opp); dev_err(dev, "%s: Failed to add OPP, %d\n", __func__, ret); - of_node_put(np); goto remove_static_opp; } else if (opp) { count++; From 25ff69011ddf9ec73114382dc90040a4cad490b0 Mon Sep 17 00:00:00 2001 From: Artem Bityutskiy Date: Mon, 15 Dec 2025 13:12:29 +0200 Subject: [PATCH 10/65] intel_idle: Remove unused driver version constant The INTEL_IDLE_VERSION constant has not been updated since 2020 and serves no useful purpose. The driver version is implicitly defined by the kernel version, making this constant redundant. Remove the constant to eliminate potential confusion about version tracking. Signed-off-by: Artem Bityutskiy Reviewed-by: Andy Shevchenko Link: https://patch.msgid.link/20251215111229.132705-1-dedekind1@gmail.com Signed-off-by: Rafael J. Wysocki --- drivers/idle/intel_idle.c | 5 ----- 1 file changed, 5 deletions(-) diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c index 9ba83954c255..aa44b3c2cb2c 100644 --- a/drivers/idle/intel_idle.c +++ b/drivers/idle/intel_idle.c @@ -63,8 +63,6 @@ #include #include -#define INTEL_IDLE_VERSION "0.5.1" - static struct cpuidle_driver intel_idle_driver = { .name = "intel_idle", .owner = THIS_MODULE, @@ -2478,9 +2476,6 @@ static int __init intel_idle_init(void) return -ENODEV; } - pr_debug("v" INTEL_IDLE_VERSION " model 0x%X\n", - boot_cpu_data.x86_model); - intel_idle_cpuidle_devices = alloc_percpu(struct cpuidle_device); if (!intel_idle_cpuidle_devices) return -ENOMEM; From a36dc37b56722bc114d5dd5657b884334031eb49 Mon Sep 17 00:00:00 2001 From: Artem Bityutskiy Date: Mon, 15 Dec 2025 13:13:00 +0200 Subject: [PATCH 11/65] intel_idle: Remove the 'preferred_cstates' parameter Remove the 'preferred_cstates' module parameter as it is not really useful. The parameter currently only affects Alder Lake, where it controls C1/C1E preference, with C1E being the default. The parameter does not support any other platform. For example, Meteor Lake has a similar C1/C1E limitation, but the parameter does not support Meteor Lake. This indicates that the parameter is not very useful. Generally, independent C1 and C1E are important for server platforms where low latency is key. However, they are not as important for client platforms, like Alder Lake, where C1E providing better energy savings is generally preferred. The parameter was originally introduced for Sapphire Rapids Xeon: da0e58c038e6 intel_idle: add 'preferred_cstates' module argument Later it was added to Alder Lake: d1cf8bbfed1ed ("intel_idle: Add AlderLake support") But it was removed from Sapphire Rapids when firmware fixed the C1/C1E limitation: 1548fac47a114 ("intel_idle: make SPR C1 and C1E be independent") So Alder Lake is the only platform left where this parameter has any effect. Remove this parameter to simplify the driver and reduce maintenance burden. Signed-off-by: Artem Bityutskiy Link: https://patch.msgid.link/20251215111300.132803-1-dedekind1@gmail.com Signed-off-by: Rafael J. Wysocki --- drivers/idle/intel_idle.c | 36 ------------------------------------ 1 file changed, 36 deletions(-) diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c index aa44b3c2cb2c..2d67a091ed3f 100644 --- a/drivers/idle/intel_idle.c +++ b/drivers/idle/intel_idle.c @@ -70,7 +70,6 @@ static struct cpuidle_driver intel_idle_driver = { /* intel_idle.max_cstate=0 disables driver */ static int max_cstate = CPUIDLE_STATE_MAX - 1; static unsigned int disabled_states_mask __read_mostly; -static unsigned int preferred_states_mask __read_mostly; static bool force_irq_on __read_mostly; static bool ibrs_off __read_mostly; @@ -2049,25 +2048,6 @@ static void __init skx_idle_state_table_update(void) } } -/** - * adl_idle_state_table_update - Adjust AlderLake idle states table. - */ -static void __init adl_idle_state_table_update(void) -{ - /* Check if user prefers C1 over C1E. */ - if (preferred_states_mask & BIT(1) && !(preferred_states_mask & BIT(2))) { - cpuidle_state_table[0].flags &= ~CPUIDLE_FLAG_UNUSABLE; - cpuidle_state_table[1].flags |= CPUIDLE_FLAG_UNUSABLE; - - /* Disable C1E by clearing the "C1E promotion" bit. */ - c1e_promotion = C1E_PROMOTION_DISABLE; - return; - } - - /* Make sure C1E is enabled by default */ - c1e_promotion = C1E_PROMOTION_ENABLE; -} - /** * spr_idle_state_table_update - Adjust Sapphire Rapids idle states table. */ @@ -2174,11 +2154,6 @@ static void __init intel_idle_init_cstates_icpu(struct cpuidle_driver *drv) case INTEL_EMERALDRAPIDS_X: spr_idle_state_table_update(); break; - case INTEL_ALDERLAKE: - case INTEL_ALDERLAKE_L: - case INTEL_ATOM_GRACEMONT: - adl_idle_state_table_update(); - break; case INTEL_ATOM_SILVERMONT: case INTEL_ATOM_AIRMONT: byt_cht_auto_demotion_disable(); @@ -2532,17 +2507,6 @@ module_param(max_cstate, int, 0444); */ module_param_named(states_off, disabled_states_mask, uint, 0444); MODULE_PARM_DESC(states_off, "Mask of disabled idle states"); -/* - * Some platforms come with mutually exclusive C-states, so that if one is - * enabled, the other C-states must not be used. Example: C1 and C1E on - * Sapphire Rapids platform. This parameter allows for selecting the - * preferred C-states among the groups of mutually exclusive C-states - the - * selected C-states will be registered, the other C-states from the mutually - * exclusive group won't be registered. If the platform has no mutually - * exclusive C-states, this parameter has no effect. - */ -module_param_named(preferred_cstates, preferred_states_mask, uint, 0444); -MODULE_PARM_DESC(preferred_cstates, "Mask of preferred idle states"); /* * Debugging option that forces the driver to enter all C-states with * interrupts enabled. Does not apply to C-states with From ff24f314447a25164bac85cb310c382e289afdbe Mon Sep 17 00:00:00 2001 From: Artem Bityutskiy Date: Tue, 16 Dec 2025 10:04:00 +0200 Subject: [PATCH 12/65] intel_idle: Initialize sysfs after cpuidle driver initialization Reorder initialization calls to initialize the internal driver data before sysfs: Was: intel_idle_sysfs_init(); intel_idle_cpuidle_driver_init(); Now: intel_idle_cpuidle_driver_init(); intel_idle_sysfs_init(); Follow the general principle that drivers should initialize internal state before registering external interfaces like sysfs, avoiding potential usage before full initialization. Signed-off-by: Artem Bityutskiy Link: https://patch.msgid.link/20251216080402.156988-2-dedekind1@gmail.com Signed-off-by: Rafael J. Wysocki --- drivers/idle/intel_idle.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c index 2d67a091ed3f..f64463e00df7 100644 --- a/drivers/idle/intel_idle.c +++ b/drivers/idle/intel_idle.c @@ -2455,12 +2455,12 @@ static int __init intel_idle_init(void) if (!intel_idle_cpuidle_devices) return -ENOMEM; + intel_idle_cpuidle_driver_init(&intel_idle_driver); + retval = intel_idle_sysfs_init(); if (retval) pr_warn("failed to initialized sysfs"); - intel_idle_cpuidle_driver_init(&intel_idle_driver); - retval = cpuidle_register_driver(&intel_idle_driver); if (retval) { struct cpuidle_driver *drv = cpuidle_get_driver(); From 111f77a233484cf39a6317f4d0306387e9ffda7b Mon Sep 17 00:00:00 2001 From: Artem Bityutskiy Date: Tue, 16 Dec 2025 10:04:01 +0200 Subject: [PATCH 13/65] intel_idle: Add cmdline option to adjust C-states table Add a new module parameter that allows adjusting the C-states table used by the driver. Currently, the C-states table is hardcoded in the driver based on the CPU model. The goal is to have good enough defaults for most users. However, C-state characteristics, such as exit latency and residency, can vary between different variants of the same CPU model and BIOS settings. Moreover, different platform usage models and user preferences may benefit from different C-state target_residency values. Provide a way for users to adjust the C-states table via a module parameter "table". The general format is: "state1:latency1:target_residency1,state2:latency2:target_residency2,..." In other words, represent each C-state by its name, exit latency (in microseconds), and target residency (in microseconds), separated by colons. Separate multiple C-states by commas. For example, suppose a CPU has 3 C-states with the following characteristics: C1: exit_latency=1, target_residency=2 C1E: exit_latency=10, target_residency=10 C6: exit_latency=100, target_residency=500 Users can specify a custom C-states table as follows: 1. intel_idle.table="C1:2:2,C1E:5:20,C6:150:600" Result: C1: exit_latency=2, target_residency=2 C1E: exit_latency=5, target_residency=20 C6: exit_latency=150, target_residency=600 2. intel_idle.table="C6::400" Result: C1: exit_latency=1, target_residency=2 (unchanged) C1E: exit_latency=10, target_residency=10 (unchanged) C6: exit_latency=100, target_residency=400 (only target_residency changed) Signed-off-by: Artem Bityutskiy Link: https://patch.msgid.link/20251216080402.156988-3-dedekind1@gmail.com Signed-off-by: Rafael J. Wysocki --- drivers/idle/intel_idle.c | 169 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 169 insertions(+) diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c index f64463e00df7..ab6b86ff9905 100644 --- a/drivers/idle/intel_idle.c +++ b/drivers/idle/intel_idle.c @@ -73,6 +73,10 @@ static unsigned int disabled_states_mask __read_mostly; static bool force_irq_on __read_mostly; static bool ibrs_off __read_mostly; +/* The maximum allowed length for the 'table' module parameter */ +#define MAX_CMDLINE_TABLE_LEN 256 +static char cmdline_table_str[MAX_CMDLINE_TABLE_LEN] __read_mostly; + static struct cpuidle_device __percpu *intel_idle_cpuidle_devices; static unsigned long auto_demotion_disable_flags; @@ -104,6 +108,9 @@ static struct device *sysfs_root __initdata; static const struct idle_cpu *icpu __initdata; static struct cpuidle_state *cpuidle_state_table __initdata; +/* C-states data from the 'intel_idle.table' cmdline parameter */ +static struct cpuidle_state cmdline_states[CPUIDLE_STATE_MAX] __initdata; + static unsigned int mwait_substates __initdata; /* @@ -2393,6 +2400,149 @@ static void __init intel_idle_sysfs_uninit(void) put_device(sysfs_root); } + /** + * get_cmdline_field - Get the current field from a cmdline string. + * @args: The cmdline string to get the current field from. + * @field: Pointer to the current field upon return. + * @sep: The fields separator character. + * + * Examples: + * Input: args="C1:1:1,C1E:2:10", sep=':' + * Output: field="C1", return "1:1,C1E:2:10" + * Input: args="C1:1:1,C1E:2:10", sep=',' + * Output: field="C1:1:1", return "C1E:2:10" + * Ipnut: args="::", sep=':' + * Output: field="", return ":" + * + * Return: The continuation of the cmdline string after the field or NULL. + */ +static char *get_cmdline_field(char *args, char **field, char sep) +{ + unsigned int i; + + for (i = 0; args[i] && !isspace(args[i]); i++) { + if (args[i] == sep) + break; + } + + *field = args; + + if (args[i] != sep) + return NULL; + + args[i] = '\0'; + return args + i + 1; +} + +/** + * cmdline_table_adjust - Adjust the C-states table with data from cmdline. + * @drv: cpuidle driver (assumed to point to intel_idle_driver). + * + * Adjust the C-states table with data from the 'intel_idle.table' module + * parameter (if specified). + */ +static void __init cmdline_table_adjust(struct cpuidle_driver *drv) +{ + char *args = cmdline_table_str; + struct cpuidle_state *state; + int i; + + if (args[0] == '\0') + /* The 'intel_idle.table' module parameter was not specified */ + return; + + /* Create a copy of the C-states table */ + for (i = 0; i < drv->state_count; i++) + cmdline_states[i] = drv->states[i]; + + /* + * Adjust the C-states table copy with data from the 'intel_idle.table' + * module parameter. + */ + while (args) { + char *fields, *name, *val; + + /* + * Get the next C-state definition, which is expected to be + * '::'. Treat "empty" + * fields as unchanged. For example, + * '::' leaves the latency unchanged. + */ + args = get_cmdline_field(args, &fields, ','); + + /* name */ + fields = get_cmdline_field(fields, &name, ':'); + if (!fields) + goto error; + + if (!strcmp(name, "POLL")) { + pr_err("Cannot adjust POLL\n"); + continue; + } + + /* Find the C-state by its name */ + state = NULL; + for (i = 0; i < drv->state_count; i++) { + if (!strcmp(name, drv->states[i].name)) { + state = &cmdline_states[i]; + break; + } + } + + if (!state) { + pr_err("C-state '%s' was not found\n", name); + continue; + } + + /* Latency */ + fields = get_cmdline_field(fields, &val, ':'); + if (!fields) + goto error; + + if (*val) { + if (kstrtouint(val, 0, &state->exit_latency)) + goto error; + } + + /* Target residency */ + fields = get_cmdline_field(fields, &val, ':'); + + if (*val) { + if (kstrtouint(val, 0, &state->target_residency)) + goto error; + } + + /* + * Allow for 3 more fields, but ignore them. Helps to make + * possible future extensions of the cmdline format backward + * compatible. + */ + for (i = 0; fields && i < 3; i++) { + fields = get_cmdline_field(fields, &val, ':'); + if (!fields) + break; + } + + if (fields) { + pr_err("Too many fields for C-state '%s'\n", state->name); + goto error; + } + + pr_info("C-state from cmdline: name=%s, latency=%u, residency=%u\n", + state->name, state->exit_latency, state->target_residency); + } + + /* Copy the adjusted C-states table back */ + for (i = 1; i < drv->state_count; i++) + drv->states[i] = cmdline_states[i]; + + pr_info("Adjusted C-states with data from 'intel_idle.table'\n"); + return; + +error: + pr_info("Failed to adjust C-states with data from 'intel_idle.table'\n"); +} + static int __init intel_idle_init(void) { const struct x86_cpu_id *id; @@ -2456,6 +2606,7 @@ static int __init intel_idle_init(void) return -ENOMEM; intel_idle_cpuidle_driver_init(&intel_idle_driver); + cmdline_table_adjust(&intel_idle_driver); retval = intel_idle_sysfs_init(); if (retval) @@ -2519,3 +2670,21 @@ module_param(force_irq_on, bool, 0444); */ module_param(ibrs_off, bool, 0444); MODULE_PARM_DESC(ibrs_off, "Disable IBRS when idle"); + +/* + * Define the C-states table from a user input string. Expected format is + * 'name:latency:residency', where: + * - name: The C-state name. + * - latency: The C-state exit latency in us. + * - residency: The C-state target residency in us. + * + * Multiple C-states can be defined by separating them with commas: + * 'name1:latency1:residency1,name2:latency2:residency2' + * + * Example: intel_idle.table=C1:1:1,C1E:5:10,C6:100:600 + * + * To leave latency or residency unchanged, use an empty field, for example: + * 'C1:1:1,C1E::10' - leaves C1E latency unchanged. + */ +module_param_string(table, cmdline_table_str, MAX_CMDLINE_TABLE_LEN, 0444); +MODULE_PARM_DESC(table, "Build the C-states table from a user input string"); From be6a150829b375c1b53d7ea5794ccc9edd2e0c9c Mon Sep 17 00:00:00 2001 From: Artem Bityutskiy Date: Tue, 16 Dec 2025 10:04:02 +0200 Subject: [PATCH 14/65] intel_idle: Add C-states validation Add validation for C-states specified via the "table=" module parameter. Treat this module parameter as untrusted input and validate it thoroughly. Signed-off-by: Artem Bityutskiy Link: https://patch.msgid.link/20251216080402.156988-4-dedekind1@gmail.com Signed-off-by: Rafael J. Wysocki --- drivers/idle/intel_idle.c | 54 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 54 insertions(+) diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c index ab6b86ff9905..f49c939d636f 100644 --- a/drivers/idle/intel_idle.c +++ b/drivers/idle/intel_idle.c @@ -45,6 +45,7 @@ #include #include #include +#include #include #include #include @@ -75,6 +76,11 @@ static bool ibrs_off __read_mostly; /* The maximum allowed length for the 'table' module parameter */ #define MAX_CMDLINE_TABLE_LEN 256 +/* Maximum allowed C-state latency */ +#define MAX_CMDLINE_LATENCY_US (5 * USEC_PER_MSEC) +/* Maximum allowed C-state target residency */ +#define MAX_CMDLINE_RESIDENCY_US (100 * USEC_PER_MSEC) + static char cmdline_table_str[MAX_CMDLINE_TABLE_LEN] __read_mostly; static struct cpuidle_device __percpu *intel_idle_cpuidle_devices; @@ -2434,6 +2440,41 @@ static char *get_cmdline_field(char *args, char **field, char sep) return args + i + 1; } +/** + * validate_cmdline_cstate - Validate a C-state from cmdline. + * @state: The C-state to validate. + * @prev_state: The previous C-state in the table or NULL. + * + * Return: 0 if the C-state is valid or -EINVAL otherwise. + */ +static int validate_cmdline_cstate(struct cpuidle_state *state, + struct cpuidle_state *prev_state) +{ + if (state->exit_latency == 0) + /* Exit latency 0 can only be used for the POLL state */ + return -EINVAL; + + if (state->exit_latency > MAX_CMDLINE_LATENCY_US) + return -EINVAL; + + if (state->target_residency > MAX_CMDLINE_RESIDENCY_US) + return -EINVAL; + + if (state->target_residency < state->exit_latency) + return -EINVAL; + + if (!prev_state) + return 0; + + if (state->exit_latency <= prev_state->exit_latency) + return -EINVAL; + + if (state->target_residency <= prev_state->target_residency) + return -EINVAL; + + return 0; +} + /** * cmdline_table_adjust - Adjust the C-states table with data from cmdline. * @drv: cpuidle driver (assumed to point to intel_idle_driver). @@ -2532,6 +2573,19 @@ static void __init cmdline_table_adjust(struct cpuidle_driver *drv) state->name, state->exit_latency, state->target_residency); } + /* Validate the adjusted C-states, start with index 1 to skip POLL */ + for (i = 1; i < drv->state_count; i++) { + struct cpuidle_state *prev_state; + + state = &cmdline_states[i]; + prev_state = &cmdline_states[i - 1]; + + if (validate_cmdline_cstate(state, prev_state)) { + pr_err("C-state '%s' validation failed\n", state->name); + goto error; + } + } + /* Copy the adjusted C-states table back */ for (i = 1; i < drv->state_count; i++) drv->states[i] = cmdline_states[i]; From 1ade6a4f7f09d5d6f6fc449e6bfa92b5e2d063c2 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Mon, 22 Dec 2025 20:52:33 +0100 Subject: [PATCH 15/65] USB: core: Discard pm_runtime_put() return value To allow the return type of pm_runtime_put() to be changed to void in the future, modify usb_autopm_put_interface_async() to discard the return value of pm_runtime_put(). That value is merely used in a debug comment printed by the function in question and it is not a particularly useful piece of information because pm_runtime_put() does not guarantee that the device will be suspended even if it successfully queues up a work item to check whether or not the device can be suspended. Signed-off-by: Rafael J. Wysocki Acked-by: Alan Stern Acked-by: Greg Kroah-Hartman Link: https://patch.msgid.link/5058509.GXAFRqVoOG@rafael.j.wysocki --- drivers/usb/core/driver.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/drivers/usb/core/driver.c b/drivers/usb/core/driver.c index d29edc7c616a..2f5958bc4f7f 100644 --- a/drivers/usb/core/driver.c +++ b/drivers/usb/core/driver.c @@ -1810,13 +1810,11 @@ EXPORT_SYMBOL_GPL(usb_autopm_put_interface); void usb_autopm_put_interface_async(struct usb_interface *intf) { struct usb_device *udev = interface_to_usbdev(intf); - int status; usb_mark_last_busy(udev); - status = pm_runtime_put(&intf->dev); - dev_vdbg(&intf->dev, "%s: cnt %d -> %d\n", - __func__, atomic_read(&intf->dev.power.usage_count), - status); + pm_runtime_put(&intf->dev); + dev_vdbg(&intf->dev, "%s: cnt %d\n", + __func__, atomic_read(&intf->dev.power.usage_count)); } EXPORT_SYMBOL_GPL(usb_autopm_put_interface_async); From 88dcab0650fd31072ed07a0d26fce5bbbbd8e7a1 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Mon, 22 Dec 2025 20:59:58 +0100 Subject: [PATCH 16/65] drm/imagination: Discard pm_runtime_put() return value The Imagination DRM driver defines pvr_power_put() to pass the return value of pm_runtime_put() to the caller, but then it never uses the return value of pvr_power_put(). Modify pvr_power_put() to discard the pm_runtime_put() return value and change its return type to void. No intentional functional impact. This will facilitate a planned change of the pm_runtime_put() return type to void in the future. Signed-off-by: Rafael J. Wysocki Reviewed-by: Matt Coster Link: https://patch.msgid.link/8642685.T7Z3S40VBb@rafael.j.wysocki --- drivers/gpu/drm/imagination/pvr_power.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/imagination/pvr_power.h b/drivers/gpu/drm/imagination/pvr_power.h index b853d092242c..c34252bda078 100644 --- a/drivers/gpu/drm/imagination/pvr_power.h +++ b/drivers/gpu/drm/imagination/pvr_power.h @@ -30,12 +30,12 @@ pvr_power_get(struct pvr_device *pvr_dev) return pm_runtime_resume_and_get(drm_dev->dev); } -static __always_inline int +static __always_inline void pvr_power_put(struct pvr_device *pvr_dev) { struct drm_device *drm_dev = from_pvr_device(pvr_dev); - return pm_runtime_put(drm_dev->dev); + pm_runtime_put(drm_dev->dev); } int pvr_power_domains_init(struct pvr_device *pvr_dev); From 0cc7933cbec80900bdbe658b72e2ba99187fe628 Mon Sep 17 00:00:00 2001 From: Andreas Kemnade Date: Thu, 8 Jan 2026 09:26:12 +0100 Subject: [PATCH 17/65] cpufreq: omap: remove driver The omap-cpufreq driver is not used in the corresponding defconfigs. The pseudo platform device to use it was removed by commit cb6675d6a868 ("ARM: OMAP2+: Remove legacy PM init") 10 years ago. Checking if there is any need to reactivate it: For omap3, dra7 there is ti-cpufreq to create cpufreq-dt device For omap2/4/5 there is cpufreq-dt-plat to create cpufreq-dt device. For omap1 this driver cannot be selected at all. So no users, no need to reactivate the driver somehow. So remove it. Signed-off-by: Andreas Kemnade Acked-by: Kevin Hilman Link: https://patch.msgid.link/20260108-omap-cpufreq-removal-v1-1-8fe42f130f48@kemnade.info Signed-off-by: Rafael J. Wysocki --- drivers/cpufreq/Kconfig.arm | 5 - drivers/cpufreq/Makefile | 1 - drivers/cpufreq/omap-cpufreq.c | 195 --------------------------------- 3 files changed, 201 deletions(-) delete mode 100644 drivers/cpufreq/omap-cpufreq.c diff --git a/drivers/cpufreq/Kconfig.arm b/drivers/cpufreq/Kconfig.arm index 9be0503df55a..4014bc9dd73a 100644 --- a/drivers/cpufreq/Kconfig.arm +++ b/drivers/cpufreq/Kconfig.arm @@ -141,11 +141,6 @@ config ARM_MEDIATEK_CPUFREQ_HW The driver implements the cpufreq interface for this HW engine. Say Y if you want to support CPUFreq HW. -config ARM_OMAP2PLUS_CPUFREQ - bool "TI OMAP2+" - depends on ARCH_OMAP2PLUS || COMPILE_TEST - default ARCH_OMAP2PLUS - config ARM_QCOM_CPUFREQ_NVMEM tristate "Qualcomm nvmem based CPUFreq" depends on ARCH_QCOM || COMPILE_TEST diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile index 681d687b5a18..385c9fcc65c6 100644 --- a/drivers/cpufreq/Makefile +++ b/drivers/cpufreq/Makefile @@ -69,7 +69,6 @@ obj-$(CONFIG_ARM_KIRKWOOD_CPUFREQ) += kirkwood-cpufreq.o obj-$(CONFIG_ARM_MEDIATEK_CPUFREQ) += mediatek-cpufreq.o obj-$(CONFIG_ARM_MEDIATEK_CPUFREQ_HW) += mediatek-cpufreq-hw.o obj-$(CONFIG_MACH_MVEBU_V7) += mvebu-cpufreq.o -obj-$(CONFIG_ARM_OMAP2PLUS_CPUFREQ) += omap-cpufreq.o obj-$(CONFIG_ARM_PXA2xx_CPUFREQ) += pxa2xx-cpufreq.o obj-$(CONFIG_PXA3xx) += pxa3xx-cpufreq.o obj-$(CONFIG_ARM_QCOM_CPUFREQ_HW) += qcom-cpufreq-hw.o diff --git a/drivers/cpufreq/omap-cpufreq.c b/drivers/cpufreq/omap-cpufreq.c deleted file mode 100644 index bbb01d93b54b..000000000000 --- a/drivers/cpufreq/omap-cpufreq.c +++ /dev/null @@ -1,195 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0-only -/* - * CPU frequency scaling for OMAP using OPP information - * - * Copyright (C) 2005 Nokia Corporation - * Written by Tony Lindgren - * - * Based on cpu-sa1110.c, Copyright (C) 2001 Russell King - * - * Copyright (C) 2007-2011 Texas Instruments, Inc. - * - OMAP3/4 support by Rajendra Nayak, Santosh Shilimkar - */ - -#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -/* OPP tolerance in percentage */ -#define OPP_TOLERANCE 4 - -static struct cpufreq_frequency_table *freq_table; -static atomic_t freq_table_users = ATOMIC_INIT(0); -static struct device *mpu_dev; -static struct regulator *mpu_reg; - -static int omap_target(struct cpufreq_policy *policy, unsigned int index) -{ - int r, ret; - struct dev_pm_opp *opp; - unsigned long freq, volt = 0, volt_old = 0, tol = 0; - unsigned int old_freq, new_freq; - - old_freq = policy->cur; - new_freq = freq_table[index].frequency; - - freq = new_freq * 1000; - ret = clk_round_rate(policy->clk, freq); - if (ret < 0) { - dev_warn(mpu_dev, - "CPUfreq: Cannot find matching frequency for %lu\n", - freq); - return ret; - } - freq = ret; - - if (mpu_reg) { - opp = dev_pm_opp_find_freq_ceil(mpu_dev, &freq); - if (IS_ERR(opp)) { - dev_err(mpu_dev, "%s: unable to find MPU OPP for %d\n", - __func__, new_freq); - return -EINVAL; - } - volt = dev_pm_opp_get_voltage(opp); - dev_pm_opp_put(opp); - tol = volt * OPP_TOLERANCE / 100; - volt_old = regulator_get_voltage(mpu_reg); - } - - dev_dbg(mpu_dev, "cpufreq-omap: %u MHz, %ld mV --> %u MHz, %ld mV\n", - old_freq / 1000, volt_old ? volt_old / 1000 : -1, - new_freq / 1000, volt ? volt / 1000 : -1); - - /* scaling up? scale voltage before frequency */ - if (mpu_reg && (new_freq > old_freq)) { - r = regulator_set_voltage(mpu_reg, volt - tol, volt + tol); - if (r < 0) { - dev_warn(mpu_dev, "%s: unable to scale voltage up.\n", - __func__); - return r; - } - } - - ret = clk_set_rate(policy->clk, new_freq * 1000); - - /* scaling down? scale voltage after frequency */ - if (mpu_reg && (new_freq < old_freq)) { - r = regulator_set_voltage(mpu_reg, volt - tol, volt + tol); - if (r < 0) { - dev_warn(mpu_dev, "%s: unable to scale voltage down.\n", - __func__); - clk_set_rate(policy->clk, old_freq * 1000); - return r; - } - } - - return ret; -} - -static inline void freq_table_free(void) -{ - if (atomic_dec_and_test(&freq_table_users)) - dev_pm_opp_free_cpufreq_table(mpu_dev, &freq_table); -} - -static int omap_cpu_init(struct cpufreq_policy *policy) -{ - int result; - - policy->clk = clk_get(NULL, "cpufreq_ck"); - if (IS_ERR(policy->clk)) - return PTR_ERR(policy->clk); - - if (!freq_table) { - result = dev_pm_opp_init_cpufreq_table(mpu_dev, &freq_table); - if (result) { - dev_err(mpu_dev, - "%s: cpu%d: failed creating freq table[%d]\n", - __func__, policy->cpu, result); - clk_put(policy->clk); - return result; - } - } - - atomic_inc_return(&freq_table_users); - - /* FIXME: what's the actual transition time? */ - cpufreq_generic_init(policy, freq_table, 300 * 1000); - - return 0; -} - -static void omap_cpu_exit(struct cpufreq_policy *policy) -{ - freq_table_free(); - clk_put(policy->clk); -} - -static struct cpufreq_driver omap_driver = { - .flags = CPUFREQ_NEED_INITIAL_FREQ_CHECK, - .verify = cpufreq_generic_frequency_table_verify, - .target_index = omap_target, - .get = cpufreq_generic_get, - .init = omap_cpu_init, - .exit = omap_cpu_exit, - .register_em = cpufreq_register_em_with_opp, - .name = "omap", -}; - -static int omap_cpufreq_probe(struct platform_device *pdev) -{ - mpu_dev = get_cpu_device(0); - if (!mpu_dev) { - pr_warn("%s: unable to get the MPU device\n", __func__); - return -EINVAL; - } - - mpu_reg = regulator_get(mpu_dev, "vcc"); - if (IS_ERR(mpu_reg)) { - pr_warn("%s: unable to get MPU regulator\n", __func__); - mpu_reg = NULL; - } else { - /* - * Ensure physical regulator is present. - * (e.g. could be dummy regulator.) - */ - if (regulator_get_voltage(mpu_reg) < 0) { - pr_warn("%s: physical regulator not present for MPU\n", - __func__); - regulator_put(mpu_reg); - mpu_reg = NULL; - } - } - - return cpufreq_register_driver(&omap_driver); -} - -static void omap_cpufreq_remove(struct platform_device *pdev) -{ - cpufreq_unregister_driver(&omap_driver); -} - -static struct platform_driver omap_cpufreq_platdrv = { - .driver = { - .name = "omap-cpufreq", - }, - .probe = omap_cpufreq_probe, - .remove = omap_cpufreq_remove, -}; -module_platform_driver(omap_cpufreq_platdrv); - -MODULE_DESCRIPTION("cpufreq driver for OMAP SoCs"); -MODULE_LICENSE("GPL"); From 80b49829ba1776d3593998293d457397e349b765 Mon Sep 17 00:00:00 2001 From: Andreas Kemnade Date: Thu, 8 Jan 2026 09:26:13 +0100 Subject: [PATCH 18/65] MAINTAINERS: remove omap-cpufreq Remove entry for omap-cpufreq, since it is removed. Signed-off-by: Andreas Kemnade Acked-by: Kevin Hilman Link: https://patch.msgid.link/20260108-omap-cpufreq-removal-v1-2-8fe42f130f48@kemnade.info Signed-off-by: Rafael J. Wysocki --- MAINTAINERS | 1 - 1 file changed, 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index 5b11839cba9d..2f950a4c9fac 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -19129,7 +19129,6 @@ M: Kevin Hilman L: linux-omap@vger.kernel.org S: Maintained F: arch/arm/*omap*/*pm* -F: drivers/cpufreq/omap-cpufreq.c OMAP POWERDOMAIN SOC ADAPTATION LAYER SUPPORT M: Paul Walmsley From c9f7b0e6b903a68780684c30773e3b591b10deaa Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Mon, 22 Dec 2025 21:03:25 +0100 Subject: [PATCH 19/65] media: ccs: Discard pm_runtime_put() return value Passing the pm_runtime_put() return value to callers is not particularly useful. Returning an error code from pm_runtime_put() merely means that it has not queued up a work item to check whether or not the device can be suspended and there are many perfectly valid situations in which that can happen, like after writing "on" to the devices' runtime PM "control" attribute in sysfs for one example. It also happens when the kernel is configured with CONFIG_PM unset. Accordingly, update ccs_post_streamoff() to simply discard the return value of pm_runtime_put() and always return success to the caller. This will facilitate a planned change of the pm_runtime_put() return type to void in the future. Signed-off-by: Rafael J. Wysocki Acked-by: Sakari Ailus Link: https://patch.msgid.link/22966634.EfDdHjke4D@rafael.j.wysocki --- drivers/media/i2c/ccs/ccs-core.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/media/i2c/ccs/ccs-core.c b/drivers/media/i2c/ccs/ccs-core.c index f8523140784c..0d7b922fd4c4 100644 --- a/drivers/media/i2c/ccs/ccs-core.c +++ b/drivers/media/i2c/ccs/ccs-core.c @@ -1974,7 +1974,9 @@ static int ccs_post_streamoff(struct v4l2_subdev *subdev) struct ccs_sensor *sensor = to_ccs_sensor(subdev); struct i2c_client *client = v4l2_get_subdevdata(&sensor->src->sd); - return pm_runtime_put(&client->dev); + pm_runtime_put(&client->dev); + + return 0; } static int ccs_enum_mbus_code(struct v4l2_subdev *subdev, From f52defa7b830abbba6b26df503ca42c0b2f20abe Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Mon, 22 Dec 2025 21:07:46 +0100 Subject: [PATCH 20/65] watchdog: rz: Discard pm_runtime_put() return values Failing a watchdog stop due to pm_runtime_put() returning a negative value is not particularly useful. Returning an error code from pm_runtime_put() merely means that it has not queued up a work item to check whether or not the device can be suspended and there are many perfectly valid situations in which that can happen, like after writing "on" to the devices' runtime PM "control" attribute in sysfs for one example. It also happens when the kernel is configured with CONFIG_PM unset. Accordingly, update rzg2l_wdt_stop() and rzv2h_wdt_stop() to simply discard the return value of pm_runtime_put(). This will facilitate a planned change of the pm_runtime_put() return type to void in the future. Signed-off-by: Rafael J. Wysocki Reviewed-by: Guenter Roeck Link: https://patch.msgid.link/3340071.5fSG56mABF@rafael.j.wysocki --- drivers/watchdog/rzg2l_wdt.c | 4 +--- drivers/watchdog/rzv2h_wdt.c | 4 +--- 2 files changed, 2 insertions(+), 6 deletions(-) diff --git a/drivers/watchdog/rzg2l_wdt.c b/drivers/watchdog/rzg2l_wdt.c index 1c9aa366d0a0..509f9dffdacd 100644 --- a/drivers/watchdog/rzg2l_wdt.c +++ b/drivers/watchdog/rzg2l_wdt.c @@ -132,9 +132,7 @@ static int rzg2l_wdt_stop(struct watchdog_device *wdev) if (ret) return ret; - ret = pm_runtime_put(wdev->parent); - if (ret < 0) - return ret; + pm_runtime_put(wdev->parent); return 0; } diff --git a/drivers/watchdog/rzv2h_wdt.c b/drivers/watchdog/rzv2h_wdt.c index a694786837e1..f93647934db7 100644 --- a/drivers/watchdog/rzv2h_wdt.c +++ b/drivers/watchdog/rzv2h_wdt.c @@ -174,9 +174,7 @@ static int rzv2h_wdt_stop(struct watchdog_device *wdev) if (priv->of_data->wdtdcr) rzt2h_wdt_wdtdcr_count_stop(priv); - ret = pm_runtime_put(wdev->parent); - if (ret < 0) - return ret; + pm_runtime_put(wdev->parent); return 0; } From 7b8de72b4001a7e2071c69b6bcc95ac21ca01094 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Mon, 22 Dec 2025 21:09:22 +0100 Subject: [PATCH 21/65] watchdog: rzv2h_wdt: Discard pm_runtime_put() return value Failing device probe due to pm_runtime_put() returning an error is not particularly useful. Returning an error code from pm_runtime_put() merely means that it has not queued up a work item to check whether or not the device can be suspended and there are many perfectly valid situations in which that can happen, like after writing "on" to the devices' runtime PM "control" attribute in sysfs for one example. It also happens when the kernel is configured with CONFIG_PM unset. Accordingly, update rzt2h_wdt_wdtdcr_init() to simply discard the return value of pm_runtime_put() and return success to the caller after invoking that function. This will facilitate a planned change of the pm_runtime_put() return type to void in the future. Signed-off-by: Rafael J. Wysocki Reviewed-by: Guenter Roeck Link: https://patch.msgid.link/1867890.VLH7GnMWUR@rafael.j.wysocki --- drivers/watchdog/rzv2h_wdt.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/drivers/watchdog/rzv2h_wdt.c b/drivers/watchdog/rzv2h_wdt.c index f93647934db7..3b6abb66a1da 100644 --- a/drivers/watchdog/rzv2h_wdt.c +++ b/drivers/watchdog/rzv2h_wdt.c @@ -268,9 +268,7 @@ static int rzt2h_wdt_wdtdcr_init(struct platform_device *pdev, rzt2h_wdt_wdtdcr_count_stop(priv); - ret = pm_runtime_put(&pdev->dev); - if (ret < 0) - return ret; + pm_runtime_put(&pdev->dev); return 0; } From d33976be6cecfe340a52b365ecf706a0c55d543d Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Mon, 22 Dec 2025 21:24:19 +0100 Subject: [PATCH 22/65] hwspinlock: omap: Discard pm_runtime_put() return value Failing driver probe due to pm_runtime_put() returning a negative value is not particularly useful. Returning an error code from pm_runtime_put() merely means that it has not queued up a work item to check whether or not the device can be suspended and there are many perfectly valid situations in which that can happen, like after writing "on" to the devices' runtime PM "control" attribute in sysfs for one example. It also happens when the kernel has been configured with CONFIG_PM unset. Accordingly, update omap_hwspinlock_probe() to simply discard the return value of pm_runtime_put(). This will facilitate a planned change of the pm_runtime_put() return type to void in the future. Signed-off-by: Rafael J. Wysocki Acked-by: Bjorn Andersson Link: https://patch.msgid.link/883243465.0ifERbkFSE@rafael.j.wysocki --- drivers/hwspinlock/omap_hwspinlock.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/drivers/hwspinlock/omap_hwspinlock.c b/drivers/hwspinlock/omap_hwspinlock.c index 27b47b8623c0..3a9a5678737b 100644 --- a/drivers/hwspinlock/omap_hwspinlock.c +++ b/drivers/hwspinlock/omap_hwspinlock.c @@ -101,9 +101,7 @@ static int omap_hwspinlock_probe(struct platform_device *pdev) * runtime PM will make sure the clock of this module is * enabled again iff at least one lock is requested */ - ret = pm_runtime_put(&pdev->dev); - if (ret < 0) - return ret; + pm_runtime_put(&pdev->dev); /* one of the four lsb's must be set, and nothing else */ if (hweight_long(i & 0xf) != 1 || i > 8) From 01eafccacc707da2db2a9eb4be56c9367e42323f Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Mon, 22 Dec 2025 21:25:57 +0100 Subject: [PATCH 23/65] coresight: Discard pm_runtime_put() return values Failing a debugfs write due to pm_runtime_put() returning a negative value is not particularly useful. Returning an error code from pm_runtime_put() merely means that it has not queued up a work item to check whether or not the device can be suspended and there are many perfectly valid situations in which that can happen, like after writing "on" to the devices' runtime PM "control" attribute in sysfs for one example. It also happens when the kernel has been configured with CONFIG_PM unset, in which case debug_disable_func() in the coresight driver will always return an error. For this reason, update debug_disable_func() to simply discard the return value of pm_runtime_put(), change its return type to void, and propagate that change to debug_func_knob_write(). This will facilitate a planned change of the pm_runtime_put() return type to void in the future. Signed-off-by: Rafael J. Wysocki Acked-by: Suzuki K Poulose Link: https://patch.msgid.link/2058657.yKVeVyVuyW@rafael.j.wysocki --- drivers/hwtracing/coresight/coresight-cpu-debug.c | 12 ++++-------- 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/drivers/hwtracing/coresight/coresight-cpu-debug.c b/drivers/hwtracing/coresight/coresight-cpu-debug.c index 5f21366406aa..629614278e46 100644 --- a/drivers/hwtracing/coresight/coresight-cpu-debug.c +++ b/drivers/hwtracing/coresight/coresight-cpu-debug.c @@ -451,10 +451,10 @@ static int debug_enable_func(void) return ret; } -static int debug_disable_func(void) +static void debug_disable_func(void) { struct debug_drvdata *drvdata; - int cpu, ret, err = 0; + int cpu; /* * Disable debug power domains, records the error and keep @@ -466,12 +466,8 @@ static int debug_disable_func(void) if (!drvdata) continue; - ret = pm_runtime_put(drvdata->dev); - if (ret < 0) - err = ret; + pm_runtime_put(drvdata->dev); } - - return err; } static ssize_t debug_func_knob_write(struct file *f, @@ -492,7 +488,7 @@ static ssize_t debug_func_knob_write(struct file *f, if (val) ret = debug_enable_func(); else - ret = debug_disable_func(); + debug_disable_func(); if (ret) { pr_err("%s: unable to %s debug function: %d\n", From 6401e43479a809b7a5a930d76c363f4b5705ed00 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Mon, 22 Dec 2025 21:27:44 +0100 Subject: [PATCH 24/65] platform/chrome: cros_hps_i2c: Discard pm_runtime_put() return value Passing pm_runtime_put() return value to the callers is not particularly useful. Returning an error code from pm_runtime_put() merely means that it has not queued up a work item to check whether or not the device can be suspended and there are many perfectly valid situations in which that can happen, like after writing "on" to the devices' runtime PM "control" attribute in sysfs for one example. It also happens when the kernel is configured with CONFIG_PM unset. Accordingly, update hps_release() to simply discard the return value of pm_runtime_put() and always return success to the caller. This will facilitate a planned change of the pm_runtime_put() return type to void in the future. Signed-off-by: Rafael J. Wysocki Acked-by: Tzung-Bi Shih Link: https://patch.msgid.link/2302270.NgBsaNRSFp@rafael.j.wysocki --- drivers/platform/chrome/cros_hps_i2c.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/platform/chrome/cros_hps_i2c.c b/drivers/platform/chrome/cros_hps_i2c.c index 6b479cfe3f73..ac6498c593e3 100644 --- a/drivers/platform/chrome/cros_hps_i2c.c +++ b/drivers/platform/chrome/cros_hps_i2c.c @@ -46,7 +46,9 @@ static int hps_release(struct inode *inode, struct file *file) struct hps_drvdata, misc_device); struct device *dev = &hps->client->dev; - return pm_runtime_put(dev); + pm_runtime_put(dev); + + return 0; } static const struct file_operations hps_fops = { From bf91b35a46ceef08a1e64c54b0e611fcae531e7a Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Mon, 22 Dec 2025 21:31:45 +0100 Subject: [PATCH 25/65] scsi: ufs: core: Discard pm_runtime_put() return values The ufshcd driver defines ufshcd_rpm_put() to return an int, but that return value is never used. It also passes the return value of pm_runtime_put() to the caller which is not very useful. Returning an error code from pm_runtime_put() merely means that it has not queued up a work item to check whether or not the device can be suspended and there are many perfectly valid situations in which that can happen, like after writing "on" to the devices' runtime PM "control" attribute in sysfs for one example. Modify ufshcd_rpm_put() to discard the pm_runtime_put() return value and change its return type to void. No intentional functional impact. This will facilitate a planned change of the pm_runtime_put() return type to void in the future. Signed-off-by: Rafael J. Wysocki Reviewed-by: Bart Van Assche Reviewed-by: Martin K. Petersen Link: https://patch.msgid.link/2781685.BddDVKsqQX@rafael.j.wysocki --- drivers/ufs/core/ufshcd-priv.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/ufs/core/ufshcd-priv.h b/drivers/ufs/core/ufshcd-priv.h index 4259f499382f..27b18b0cc058 100644 --- a/drivers/ufs/core/ufshcd-priv.h +++ b/drivers/ufs/core/ufshcd-priv.h @@ -348,9 +348,9 @@ static inline int ufshcd_rpm_resume(struct ufs_hba *hba) return pm_runtime_resume(&hba->ufs_device_wlun->sdev_gendev); } -static inline int ufshcd_rpm_put(struct ufs_hba *hba) +static inline void ufshcd_rpm_put(struct ufs_hba *hba) { - return pm_runtime_put(&hba->ufs_device_wlun->sdev_gendev); + pm_runtime_put(&hba->ufs_device_wlun->sdev_gendev); } /** From fcbd7897b871e157ee5c595e950c8466d86c0cd5 Mon Sep 17 00:00:00 2001 From: Breno Leitao Date: Mon, 5 Jan 2026 06:37:06 -0800 Subject: [PATCH 26/65] cpuidle: menu: Remove incorrect unlikely() annotation The unlikely() annotation on the early-return condition in menu_select() is incorrect on systems with only one idle state (e.g., ARM64 servers with a single ACPI LPI state). Branch profiling shows 100% misprediction on such systems since drv->state_count <= 1 is always true. On platforms where only state0 is available, this path is the common case, not an unlikely edge case. Remove the misleading annotation to let the branch predictor learn the actual behavior. Signed-off-by: Breno Leitao Link: https://patch.msgid.link/20260105-annotated_idle-v1-1-10ddf0771b58@debian.org Signed-off-by: Rafael J. Wysocki --- drivers/cpuidle/governors/menu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c index 64d6f7a1c776..ef9c5a84643e 100644 --- a/drivers/cpuidle/governors/menu.c +++ b/drivers/cpuidle/governors/menu.c @@ -271,7 +271,7 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, data->bucket = BUCKETS - 1; } - if (unlikely(drv->state_count <= 1 || latency_req == 0) || + if (drv->state_count <= 1 || latency_req == 0 || ((data->next_timer_ns < drv->states[1].target_residency_ns || latency_req < drv->states[1].exit_latency_ns) && !dev->states_usage[0].disable)) { From fd0d2872dc53fe55f66842767e952457348b8d18 Mon Sep 17 00:00:00 2001 From: Christian Loehle Date: Tue, 6 Jan 2026 13:36:53 +0000 Subject: [PATCH 27/65] MAINTAINERS: Add myself as cpuidle reviewer I've been reviewing cpuidle changes, for governors in particular, for the last couple of years and will continue to do so. Signed-off-by: Christian Loehle Link: https://patch.msgid.link/71f63cb7-2d9b-49a3-9b04-a47e2edef5e0@arm.com Signed-off-by: Rafael J. Wysocki --- MAINTAINERS | 1 + 1 file changed, 1 insertion(+) diff --git a/MAINTAINERS b/MAINTAINERS index 765ad2daa218..ea1d4c85b865 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -6554,6 +6554,7 @@ F: rust/kernel/cpu.rs CPU IDLE TIME MANAGEMENT FRAMEWORK M: "Rafael J. Wysocki" M: Daniel Lezcano +R: Christian Loehle L: linux-pm@vger.kernel.org S: Maintained B: https://bugzilla.kernel.org From 07e5e811f86dcd6f595c3bbd71cde294e8545889 Mon Sep 17 00:00:00 2001 From: Sumeet Pawnikar Date: Sun, 11 Jan 2026 19:42:36 +0530 Subject: [PATCH 28/65] powercap: Replace sprintf() with sysfs_emit() in sysfs show functions Replace all sprintf() calls with sysfs_emit() in sysfs show functions. sysfs_emit() is preferred over sprintf() for formatting sysfs output as it provides better bounds checking and prevents potential buffer overflows. Also, replace sprintf() with sysfs_emit() in show_constraint_name() and simplify the code by removing the redundant strlen() call since sysfs_emit() returns the length. Signed-off-by: Sumeet Pawnikar Link: https://patch.msgid.link/20260111141237.12340-1-sumeet4linux@gmail.com Signed-off-by: Rafael J. Wysocki --- drivers/powercap/powercap_sys.c | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/drivers/powercap/powercap_sys.c b/drivers/powercap/powercap_sys.c index 1ff369880beb..f3b2ae635305 100644 --- a/drivers/powercap/powercap_sys.c +++ b/drivers/powercap/powercap_sys.c @@ -27,7 +27,7 @@ static ssize_t _attr##_show(struct device *dev, \ \ if (power_zone->ops->get_##_attr) { \ if (!power_zone->ops->get_##_attr(power_zone, &value)) \ - len = sprintf(buf, "%lld\n", value); \ + len = sysfs_emit(buf, "%lld\n", value); \ } \ \ return len; \ @@ -75,7 +75,7 @@ static ssize_t show_constraint_##_attr(struct device *dev, \ pconst = &power_zone->constraints[id]; \ if (pconst && pconst->ops && pconst->ops->get_##_attr) { \ if (!pconst->ops->get_##_attr(power_zone, id, &value)) \ - len = sprintf(buf, "%lld\n", value); \ + len = sysfs_emit(buf, "%lld\n", value); \ } \ \ return len; \ @@ -171,9 +171,8 @@ static ssize_t show_constraint_name(struct device *dev, if (pconst && pconst->ops && pconst->ops->get_name) { name = pconst->ops->get_name(power_zone, id); if (name) { - sprintf(buf, "%.*s\n", POWERCAP_CONSTRAINT_NAME_LEN - 1, - name); - len = strlen(buf); + len = sysfs_emit(buf, "%.*s\n", + POWERCAP_CONSTRAINT_NAME_LEN - 1, name); } } @@ -350,7 +349,7 @@ static ssize_t name_show(struct device *dev, { struct powercap_zone *power_zone = to_powercap_zone(dev); - return sprintf(buf, "%s\n", power_zone->name); + return sysfs_emit(buf, "%s\n", power_zone->name); } static DEVICE_ATTR_RO(name); @@ -438,7 +437,7 @@ static ssize_t enabled_show(struct device *dev, mode = false; } - return sprintf(buf, "%d\n", mode); + return sysfs_emit(buf, "%d\n", mode); } static ssize_t enabled_store(struct device *dev, From 54b3cd55a515c7c0fcfa0c1f0b10d62c11d64bcc Mon Sep 17 00:00:00 2001 From: Daniel Tang Date: Wed, 14 Jan 2026 21:01:52 -0500 Subject: [PATCH 29/65] powercap: intel_rapl: Add PL4 support for Ice Lake Microsoft Surface Pro 7 firmware throttles the processor upon boot/resume. Userspace needs to be able to restore the correct value. Link: https://github.com/linux-surface/linux-surface/issues/706 Signed-off-by: Daniel Tang Link: https://patch.msgid.link/6088605.ChMirdbgyp@daniel-desktop3 Signed-off-by: Rafael J. Wysocki --- drivers/powercap/intel_rapl_msr.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/powercap/intel_rapl_msr.c b/drivers/powercap/intel_rapl_msr.c index 9a7e150b3536..a2bc0a9c1e10 100644 --- a/drivers/powercap/intel_rapl_msr.c +++ b/drivers/powercap/intel_rapl_msr.c @@ -162,6 +162,7 @@ static int rapl_msr_write_raw(int cpu, struct reg_action *ra) /* List of verified CPUs. */ static const struct x86_cpu_id pl4_support_ids[] = { + X86_MATCH_VFM(INTEL_ICELAKE_L, NULL), X86_MATCH_VFM(INTEL_TIGERLAKE_L, NULL), X86_MATCH_VFM(INTEL_ALDERLAKE, NULL), X86_MATCH_VFM(INTEL_ALDERLAKE_L, NULL), From e9df6eba060c6db2f7f3fd8666d1af0a369d6f7b Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Thu, 8 Jan 2026 16:05:37 +0100 Subject: [PATCH 30/65] genirq/chip: Change irq_chip_pm_put() return type to void The irq_chip_pm_put() return value is only used in __irq_do_set_handler() to trigger a WARN_ON() if it is negative, but doing so is not useful because irq_chip_pm_put() simply passes the pm_runtime_put() return value to its callers. Returning an error code from pm_runtime_put() merely means that it has not queued up a work item to check whether or not the device can be suspended and there are many perfectly valid situations in which that can happen, like after writing "on" to the devices' runtime PM "control" attribute in sysfs for one example. For this reason, modify irq_chip_pm_put() to discard the pm_runtime_put() return value, change its return type to void, and drop the WARN_ON() around the irq_chip_pm_put() invocation from __irq_do_set_handler(). Also update the irq_chip_pm_put() kerneldoc comment to be more accurate. This will facilitate a planned change of the pm_runtime_put() return type to void in the future. Signed-off-by: Rafael J. Wysocki Reviewed-by: Thomas Gleixner Link: https://patch.msgid.link/5075294.31r3eYUQgx@rafael.j.wysocki --- include/linux/irq.h | 2 +- kernel/irq/chip.c | 22 +++++++++++----------- 2 files changed, 12 insertions(+), 12 deletions(-) diff --git a/include/linux/irq.h b/include/linux/irq.h index 4a9f1d7b08c3..ef0816fdc6f2 100644 --- a/include/linux/irq.h +++ b/include/linux/irq.h @@ -658,7 +658,7 @@ extern void handle_fasteoi_nmi(struct irq_desc *desc); extern int irq_chip_compose_msi_msg(struct irq_data *data, struct msi_msg *msg); extern int irq_chip_pm_get(struct irq_data *data); -extern int irq_chip_pm_put(struct irq_data *data); +extern void irq_chip_pm_put(struct irq_data *data); #ifdef CONFIG_IRQ_DOMAIN_HIERARCHY extern void handle_fasteoi_ack_irq(struct irq_desc *desc); extern void handle_fasteoi_mask_irq(struct irq_desc *desc); diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c index 678f094d261a..23f22f3d5207 100644 --- a/kernel/irq/chip.c +++ b/kernel/irq/chip.c @@ -974,7 +974,7 @@ __irq_do_set_handler(struct irq_desc *desc, irq_flow_handler_t handle, irq_state_set_disabled(desc); if (is_chained) { desc->action = NULL; - WARN_ON(irq_chip_pm_put(irq_desc_get_irq_data(desc))); + irq_chip_pm_put(irq_desc_get_irq_data(desc)); } desc->depth = 1; } @@ -1530,20 +1530,20 @@ int irq_chip_pm_get(struct irq_data *data) } /** - * irq_chip_pm_put - Disable power for an IRQ chip + * irq_chip_pm_put - Drop a PM reference on an IRQ chip * @data: Pointer to interrupt specific data * - * Disable the power to the IRQ chip referenced by the interrupt data - * structure, belongs. Note that power will only be disabled, once this - * function has been called for all IRQs that have called irq_chip_pm_get(). + * Drop a power management reference, acquired via irq_chip_pm_get(), on the IRQ + * chip represented by the interrupt data structure. + * + * Note that this will not disable power to the IRQ chip until this function + * has been called for all IRQs that have called irq_chip_pm_get() and it may + * not disable power at all (if user space prevents that, for example). */ -int irq_chip_pm_put(struct irq_data *data) +void irq_chip_pm_put(struct irq_data *data) { struct device *dev = irq_get_pm_device(data); - int retval = 0; - if (IS_ENABLED(CONFIG_PM) && dev) - retval = pm_runtime_put(dev); - - return (retval < 0) ? retval : 0; + if (dev) + pm_runtime_put(dev); } From 75e8635832a2e45d2a910c247eddd6b65d5ce6e1 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Thu, 8 Jan 2026 16:17:17 +0100 Subject: [PATCH 31/65] drm: Discard pm_runtime_put() return value Multiple DRM drivers use the pm_runtime_put() return value for printing debug or even error messages and all of those messages are at least somewhat misleading. Returning an error code from pm_runtime_put() merely means that it has not queued up a work item to check whether or not the device can be suspended and there are many perfectly valid situations in which that can happen, like after writing "on" to the devices' runtime PM "control" attribute in sysfs for one example. It also happens when the kernel has been configured with CONFIG_PM unset. For this reason, modify all of those drivers to simply discard the pm_runtime_put() return value which is what they should be doing. This will facilitate a planned change of the pm_runtime_put() return type to void in the future. Signed-off-by: Rafael J. Wysocki Acked-by: Dave Stevenson Acked-by: Liviu Dudau Link: https://patch.msgid.link/2256082.irdbgypaU6@rafael.j.wysocki --- drivers/gpu/drm/arm/malidp_crtc.c | 6 +----- drivers/gpu/drm/bridge/imx/imx8qm-ldb.c | 4 +--- drivers/gpu/drm/bridge/imx/imx8qxp-ldb.c | 4 +--- drivers/gpu/drm/bridge/imx/imx8qxp-pixel-combiner.c | 5 +---- drivers/gpu/drm/bridge/imx/imx8qxp-pxl2dpi.c | 5 +---- drivers/gpu/drm/imx/dc/dc-crtc.c | 12 +++--------- drivers/gpu/drm/vc4/vc4_hdmi.c | 5 +---- drivers/gpu/drm/vc4/vc4_vec.c | 12 ++---------- 8 files changed, 11 insertions(+), 42 deletions(-) diff --git a/drivers/gpu/drm/arm/malidp_crtc.c b/drivers/gpu/drm/arm/malidp_crtc.c index d72c22dcf685..e61cf362abdf 100644 --- a/drivers/gpu/drm/arm/malidp_crtc.c +++ b/drivers/gpu/drm/arm/malidp_crtc.c @@ -77,7 +77,6 @@ static void malidp_crtc_atomic_disable(struct drm_crtc *crtc, crtc); struct malidp_drm *malidp = crtc_to_malidp_device(crtc); struct malidp_hw_device *hwdev = malidp->dev; - int err; /* always disable planes on the CRTC that is being turned off */ drm_atomic_helper_disable_planes_on_crtc(old_state, false); @@ -87,10 +86,7 @@ static void malidp_crtc_atomic_disable(struct drm_crtc *crtc, clk_disable_unprepare(hwdev->pxlclk); - err = pm_runtime_put(crtc->dev->dev); - if (err < 0) { - DRM_DEBUG_DRIVER("Failed to disable runtime power management: %d\n", err); - } + pm_runtime_put(crtc->dev->dev); } static const struct gamma_curve_segment { diff --git a/drivers/gpu/drm/bridge/imx/imx8qm-ldb.c b/drivers/gpu/drm/bridge/imx/imx8qm-ldb.c index 47aa65938e6a..fc67e7ed653d 100644 --- a/drivers/gpu/drm/bridge/imx/imx8qm-ldb.c +++ b/drivers/gpu/drm/bridge/imx/imx8qm-ldb.c @@ -280,9 +280,7 @@ static void imx8qm_ldb_bridge_atomic_disable(struct drm_bridge *bridge, clk_disable_unprepare(imx8qm_ldb->clk_bypass); clk_disable_unprepare(imx8qm_ldb->clk_pixel); - ret = pm_runtime_put(dev); - if (ret < 0) - DRM_DEV_ERROR(dev, "failed to put runtime PM: %d\n", ret); + pm_runtime_put(dev); } static const u32 imx8qm_ldb_bus_output_fmts[] = { diff --git a/drivers/gpu/drm/bridge/imx/imx8qxp-ldb.c b/drivers/gpu/drm/bridge/imx/imx8qxp-ldb.c index 122502968927..d70f3c9b3925 100644 --- a/drivers/gpu/drm/bridge/imx/imx8qxp-ldb.c +++ b/drivers/gpu/drm/bridge/imx/imx8qxp-ldb.c @@ -282,9 +282,7 @@ static void imx8qxp_ldb_bridge_atomic_disable(struct drm_bridge *bridge, if (is_split && companion) companion->funcs->atomic_disable(companion, state); - ret = pm_runtime_put(dev); - if (ret < 0) - DRM_DEV_ERROR(dev, "failed to put runtime PM: %d\n", ret); + pm_runtime_put(dev); } static const u32 imx8qxp_ldb_bus_output_fmts[] = { diff --git a/drivers/gpu/drm/bridge/imx/imx8qxp-pixel-combiner.c b/drivers/gpu/drm/bridge/imx/imx8qxp-pixel-combiner.c index 8517b1c953d4..8e64b5404561 100644 --- a/drivers/gpu/drm/bridge/imx/imx8qxp-pixel-combiner.c +++ b/drivers/gpu/drm/bridge/imx/imx8qxp-pixel-combiner.c @@ -181,11 +181,8 @@ static void imx8qxp_pc_bridge_atomic_disable(struct drm_bridge *bridge, { struct imx8qxp_pc_channel *ch = bridge->driver_private; struct imx8qxp_pc *pc = ch->pc; - int ret; - ret = pm_runtime_put(pc->dev); - if (ret < 0) - DRM_DEV_ERROR(pc->dev, "failed to put runtime PM: %d\n", ret); + pm_runtime_put(pc->dev); } static const u32 imx8qxp_pc_bus_output_fmts[] = { diff --git a/drivers/gpu/drm/bridge/imx/imx8qxp-pxl2dpi.c b/drivers/gpu/drm/bridge/imx/imx8qxp-pxl2dpi.c index 111310acab2c..82a2bba375ad 100644 --- a/drivers/gpu/drm/bridge/imx/imx8qxp-pxl2dpi.c +++ b/drivers/gpu/drm/bridge/imx/imx8qxp-pxl2dpi.c @@ -127,11 +127,8 @@ static void imx8qxp_pxl2dpi_bridge_atomic_disable(struct drm_bridge *bridge, struct drm_atomic_state *state) { struct imx8qxp_pxl2dpi *p2d = bridge->driver_private; - int ret; - ret = pm_runtime_put(p2d->dev); - if (ret < 0) - DRM_DEV_ERROR(p2d->dev, "failed to put runtime PM: %d\n", ret); + pm_runtime_put(p2d->dev); if (p2d->companion) p2d->companion->funcs->atomic_disable(p2d->companion, state); diff --git a/drivers/gpu/drm/imx/dc/dc-crtc.c b/drivers/gpu/drm/imx/dc/dc-crtc.c index 31d3a982deaf..608c610662dc 100644 --- a/drivers/gpu/drm/imx/dc/dc-crtc.c +++ b/drivers/gpu/drm/imx/dc/dc-crtc.c @@ -300,7 +300,7 @@ dc_crtc_atomic_disable(struct drm_crtc *crtc, struct drm_atomic_state *state) drm_atomic_get_new_crtc_state(state, crtc); struct dc_drm_device *dc_drm = to_dc_drm_device(crtc->dev); struct dc_crtc *dc_crtc = to_dc_crtc(crtc); - int idx, ret; + int idx; if (!drm_dev_enter(crtc->dev, &idx)) goto out; @@ -313,16 +313,10 @@ dc_crtc_atomic_disable(struct drm_crtc *crtc, struct drm_atomic_state *state) dc_fg_disable_clock(dc_crtc->fg); /* request pixel engine power-off as plane is off too */ - ret = pm_runtime_put(dc_drm->pe->dev); - if (ret) - dc_crtc_err(crtc, "failed to put DC pixel engine RPM: %d\n", - ret); + pm_runtime_put(dc_drm->pe->dev); /* request display engine power-off when CRTC is disabled */ - ret = pm_runtime_put(dc_crtc->de->dev); - if (ret < 0) - dc_crtc_err(crtc, "failed to put DC display engine RPM: %d\n", - ret); + pm_runtime_put(dc_crtc->de->dev); drm_dev_exit(idx); diff --git a/drivers/gpu/drm/vc4/vc4_hdmi.c b/drivers/gpu/drm/vc4/vc4_hdmi.c index 1798d1156d10..4504e38ce844 100644 --- a/drivers/gpu/drm/vc4/vc4_hdmi.c +++ b/drivers/gpu/drm/vc4/vc4_hdmi.c @@ -848,7 +848,6 @@ static void vc4_hdmi_encoder_post_crtc_powerdown(struct drm_encoder *encoder, struct vc4_hdmi *vc4_hdmi = encoder_to_vc4_hdmi(encoder); struct drm_device *drm = vc4_hdmi->connector.dev; unsigned long flags; - int ret; int idx; mutex_lock(&vc4_hdmi->mutex); @@ -867,9 +866,7 @@ static void vc4_hdmi_encoder_post_crtc_powerdown(struct drm_encoder *encoder, clk_disable_unprepare(vc4_hdmi->pixel_bvb_clock); clk_disable_unprepare(vc4_hdmi->pixel_clock); - ret = pm_runtime_put(&vc4_hdmi->pdev->dev); - if (ret < 0) - drm_err(drm, "Failed to release power domain: %d\n", ret); + pm_runtime_put(&vc4_hdmi->pdev->dev); drm_dev_exit(idx); diff --git a/drivers/gpu/drm/vc4/vc4_vec.c b/drivers/gpu/drm/vc4/vc4_vec.c index b84fad2a5b23..b0b271d93b27 100644 --- a/drivers/gpu/drm/vc4/vc4_vec.c +++ b/drivers/gpu/drm/vc4/vc4_vec.c @@ -542,7 +542,7 @@ static void vc4_vec_encoder_disable(struct drm_encoder *encoder, { struct drm_device *drm = encoder->dev; struct vc4_vec *vec = encoder_to_vc4_vec(encoder); - int idx, ret; + int idx; if (!drm_dev_enter(drm, &idx)) return; @@ -556,17 +556,9 @@ static void vc4_vec_encoder_disable(struct drm_encoder *encoder, clk_disable_unprepare(vec->clock); - ret = pm_runtime_put(&vec->pdev->dev); - if (ret < 0) { - drm_err(drm, "Failed to release power domain: %d\n", ret); - goto err_dev_exit; - } + pm_runtime_put(&vec->pdev->dev); drm_dev_exit(idx); - return; - -err_dev_exit: - drm_dev_exit(idx); } static void vc4_vec_encoder_enable(struct drm_encoder *encoder, From 7799ba2160e4919913ecabca8a7fc1aa4c576fb4 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jo=C3=A3o=20Marcos=20Costa?= Date: Tue, 13 Jan 2026 14:27:53 +0100 Subject: [PATCH 32/65] cpupower: make systemd unit installation optional MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit cpupower currently installs a cpupower.service unit file into unitdir unconditionally, regardless of whether systemd is used by the host. Improve the installation procedure by making this systemd step optional: a 'SYSTEMD' build parameter that defaults to 'true' and can be set to 'false' to disable the installation of systemd's unit file. Since 'SYSTEMD' defaults to true, the current behavior is kept as the default. Link: https://lore.kernel.org/r/20260113132753.1730020-2-joaomarcos.costa@bootlin.com Signed-off-by: João Marcos Costa Signed-off-by: Shuah Khan --- tools/power/cpupower/Makefile | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/tools/power/cpupower/Makefile b/tools/power/cpupower/Makefile index a1df9196dc45..969716dfe8de 100644 --- a/tools/power/cpupower/Makefile +++ b/tools/power/cpupower/Makefile @@ -315,7 +315,17 @@ endif $(INSTALL_DATA) lib/cpuidle.h $(DESTDIR)${includedir}/cpuidle.h $(INSTALL_DATA) lib/powercap.h $(DESTDIR)${includedir}/powercap.h -install-tools: $(OUTPUT)cpupower +# SYSTEMD=false disables installation of the systemd unit file +SYSTEMD ?= true + +install-systemd: + $(INSTALL) -d $(DESTDIR)${unitdir} + sed 's|___CDIR___|${confdir}|; s|___LDIR___|${libexecdir}|' cpupower.service.in > '$(DESTDIR)${unitdir}/cpupower.service' + $(SETPERM_DATA) '$(DESTDIR)${unitdir}/cpupower.service' + +INSTALL_SYSTEMD := $(if $(filter true,$(strip $(SYSTEMD))),install-systemd) + +install-tools: $(OUTPUT)cpupower $(INSTALL_SYSTEMD) $(INSTALL) -d $(DESTDIR)${bindir} $(INSTALL_PROGRAM) $(OUTPUT)cpupower $(DESTDIR)${bindir} $(INSTALL) -d $(DESTDIR)${bash_completion_dir} @@ -324,9 +334,6 @@ install-tools: $(OUTPUT)cpupower $(INSTALL_DATA) cpupower-service.conf '$(DESTDIR)${confdir}' $(INSTALL) -d $(DESTDIR)${libexecdir} $(INSTALL_PROGRAM) cpupower.sh '$(DESTDIR)${libexecdir}/cpupower' - $(INSTALL) -d $(DESTDIR)${unitdir} - sed 's|___CDIR___|${confdir}|; s|___LDIR___|${libexecdir}|' cpupower.service.in > '$(DESTDIR)${unitdir}/cpupower.service' - $(SETPERM_DATA) '$(DESTDIR)${unitdir}/cpupower.service' install-man: $(INSTALL_DATA) -D man/cpupower.1 $(DESTDIR)${mandir}/man1/cpupower.1 @@ -406,4 +413,4 @@ help: @echo ' uninstall - Remove previously installed files from the dir defined by "DESTDIR"' @echo ' cmdline or Makefile config block option (default: "")' -.PHONY: all utils libcpupower update-po create-gmo install-lib install-tools install-man install-gmo install uninstall clean help +.PHONY: all utils libcpupower update-po create-gmo install-lib install-systemd install-tools install-man install-gmo install uninstall clean help From 80606f4eb8d7484ab7f7d6f0fd30d71e6fbcf328 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Tue, 20 Jan 2026 16:26:14 +0100 Subject: [PATCH 33/65] cpuidle: governors: menu: Always check timers with tick stopped After commit 5484e31bbbff ("cpuidle: menu: Skip tick_nohz_get_sleep_length() call in some cases"), if the return value of get_typical_interval() multiplied by NSEC_PER_USEC is not greater than RESIDENCY_THRESHOLD_NS, the menu governor will skip computing the time till the closest timer. If that happens when the tick has been stopped already, the selected idle state may be too deep due to the subsequent check comparing predicted_ns with TICK_NSEC and causing its value to be replaced with the expected time till the closest timer, which is KTIME_MAX in that case. That will cause the deepest enabled idle state to be selected, but the time till the closest timer very well may be shorter than the target residency of that state, in which case a shallower state should be used. Address this by making menu_select() always compute the time till the closest timer when the tick has been stopped. Also move the predicted_ns check mentioned above into the branch in which the time till the closest timer is determined because it only needs to be done in that case. Fixes: 5484e31bbbff ("cpuidle: menu: Skip tick_nohz_get_sleep_length() call in some cases") Signed-off-by: Rafael J. Wysocki Reviewed-by: Christian Loehle Link: https://patch.msgid.link/5959091.DvuYhMxLoT@rafael.j.wysocki --- drivers/cpuidle/governors/menu.c | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c index ef9c5a84643e..c6052055ba0f 100644 --- a/drivers/cpuidle/governors/menu.c +++ b/drivers/cpuidle/governors/menu.c @@ -239,7 +239,7 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, /* Find the shortest expected idle interval. */ predicted_ns = get_typical_interval(data) * NSEC_PER_USEC; - if (predicted_ns > RESIDENCY_THRESHOLD_NS) { + if (predicted_ns > RESIDENCY_THRESHOLD_NS || tick_nohz_tick_stopped()) { unsigned int timer_us; /* Determine the time till the closest timer. */ @@ -259,6 +259,16 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, RESOLUTION * DECAY * NSEC_PER_USEC); /* Use the lowest expected idle interval to pick the idle state. */ predicted_ns = min((u64)timer_us * NSEC_PER_USEC, predicted_ns); + /* + * If the tick is already stopped, the cost of possible short + * idle duration misprediction is much higher, because the CPU + * may be stuck in a shallow idle state for a long time as a + * result of it. In that case, say we might mispredict and use + * the known time till the closest timer event for the idle + * state selection. + */ + if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC) + predicted_ns = data->next_timer_ns; } else { /* * Because the next timer event is not going to be determined @@ -284,16 +294,6 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, return 0; } - /* - * If the tick is already stopped, the cost of possible short idle - * duration misprediction is much higher, because the CPU may be stuck - * in a shallow idle state for a long time as a result of it. In that - * case, say we might mispredict and use the known time till the closest - * timer event for the idle state selection. - */ - if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC) - predicted_ns = data->next_timer_ns; - /* * Find the idle state with the lowest power while satisfying * our constraints. From 4bd2221f231d798b01027367857d9ba2f24f6ea0 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Wed, 14 Jan 2026 20:44:04 +0100 Subject: [PATCH 34/65] cpuidle: governors: teo: Avoid selecting states with zero-size bins If the last two enabled idle states have the same target residency which is at least equal to TICK_NSEC, teo may select the next-to-last one even though the size of that state's bin is 0, which is confusing. Prevent that from happening by adding a target residency check to the relevant code path. Signed-off-by: Rafael J. Wysocki Reviewed-by: Christian Loehle [ rjw: Fixed a typo in the changelog ] Link: https://patch.msgid.link/3033265.e9J7NaK4W3@rafael.j.wysocki Signed-off-by: Rafael J. Wysocki --- drivers/cpuidle/governors/teo.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c index 81ac5fd58a1c..9820ef36a664 100644 --- a/drivers/cpuidle/governors/teo.c +++ b/drivers/cpuidle/governors/teo.c @@ -388,6 +388,15 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, while (min_idx < idx && drv->states[min_idx].target_residency_ns < TICK_NSEC) min_idx++; + + /* + * Avoid selecting a state with a lower index, but with + * the same target residency as the current candidate + * one. + */ + if (drv->states[min_idx].target_residency_ns == + drv->states[idx].target_residency_ns) + goto constraint; } /* @@ -410,6 +419,7 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, } } +constraint: /* * If there is a latency constraint, it may be necessary to select an * idle state shallower than the current candidate one. From 60836533b4c7b69e6cb815c87f089e39c2878acd Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Wed, 14 Jan 2026 20:44:53 +0100 Subject: [PATCH 35/65] cpuidle: governors: teo: Avoid fake intercepts produced by tick Tick wakeups can lead to fake intercepts that may skew idle state selection towards shallow states, so it is better to avoid counting them as intercepts. For this purpose, add a check causing teo_update() to only count tick wakeups as intercepts if intercepts within the tick period range are at least twice as frequent as any other events. Signed-off-by: Rafael J. Wysocki Reviewed-by: Christian Loehle Link: https://patch.msgid.link/3404606.44csPzL39Z@rafael.j.wysocki --- drivers/cpuidle/governors/teo.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c index 9820ef36a664..5434584af040 100644 --- a/drivers/cpuidle/governors/teo.c +++ b/drivers/cpuidle/governors/teo.c @@ -239,6 +239,17 @@ static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev) cpu_data->state_bins[drv->state_count-1].hits += PULSE; return; } + /* + * If intercepts within the tick period range are not frequent + * enough, count this wakeup as a hit, since it is likely that + * the tick has woken up the CPU because an expected intercept + * was not there. Otherwise, one of the intercepts may have + * been incidentally preceded by the tick wakeup. + */ + if (3 * cpu_data->tick_intercepts < 2 * total) { + cpu_data->state_bins[idx_timer].hits += PULSE; + return; + } } /* From 475ca3470b3739150720f1b285646de38103e7b7 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Wed, 14 Jan 2026 20:45:30 +0100 Subject: [PATCH 36/65] cpuidle: governors: teo: Refine tick_intercepts vs total events check Use 2/3 as the proportion coefficient in the check comparing cpu_data->tick_intercepts with cpu_data->total because it is close enough to the current one (5/8) and it allows of more straightforward interpretation (on average, intercepts within the tick period length are twice as frequent as other events). Signed-off-by: Rafael J. Wysocki Reviewed-by: Christian Loehle Link: https://patch.msgid.link/10793374.nUPlyArG6x@rafael.j.wysocki --- drivers/cpuidle/governors/teo.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c index 5434584af040..750ab0678a77 100644 --- a/drivers/cpuidle/governors/teo.c +++ b/drivers/cpuidle/governors/teo.c @@ -485,7 +485,7 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, * total wakeup events, do not stop the tick. */ if (drv->states[idx].target_residency_ns < TICK_NSEC && - cpu_data->tick_intercepts > cpu_data->total / 2 + cpu_data->total / 8) + 3 * cpu_data->tick_intercepts >= 2 * cpu_data->total) duration_ns = TICK_NSEC / 2; end: From 0b7277e02dabba2a9921a7f4761ae6e627e7297a Mon Sep 17 00:00:00 2001 From: Aleks Todorov Date: Fri, 23 Jan 2026 14:03:44 +0000 Subject: [PATCH 37/65] OPP: Return correct value in dev_pm_opp_get_level Commit 073d3d2ca7d4 ("OPP: Level zero is valid") modified the documentation for this function to indicate that errors should return a non-zero value to avoid colliding with the OPP level zero, however forgot to actually update the return. No in-tree kernel code depends on the error value being 0. Fixes: 073d3d2ca7d4 ("OPP: Level zero is valid") Signed-off-by: Aleks Todorov Signed-off-by: Viresh Kumar --- drivers/opp/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/opp/core.c b/drivers/opp/core.c index dbebb8c829bc..ae43c656f108 100644 --- a/drivers/opp/core.c +++ b/drivers/opp/core.c @@ -241,7 +241,7 @@ unsigned int dev_pm_opp_get_level(struct dev_pm_opp *opp) { if (IS_ERR_OR_NULL(opp) || !opp->available) { pr_err("%s: Invalid parameters\n", __func__); - return 0; + return U32_MAX; } return opp->level; From 8c8b12a55614ea05953e8d695e700e6e1322a05d Mon Sep 17 00:00:00 2001 From: Alexandre Courbot Date: Fri, 28 Nov 2025 11:11:39 +0900 Subject: [PATCH 38/65] rust: cpufreq: always inline functions using build_assert with arguments `build_assert` relies on the compiler to optimize out its error path. Functions using it with its arguments must thus always be inlined, otherwise the error path of `build_assert` might not be optimized out, triggering a build error. Signed-off-by: Alexandre Courbot Reviewed-by: Daniel Almeida Signed-off-by: Viresh Kumar --- rust/kernel/cpufreq.rs | 2 ++ 1 file changed, 2 insertions(+) diff --git a/rust/kernel/cpufreq.rs b/rust/kernel/cpufreq.rs index f968fbd22890..0879a79485f8 100644 --- a/rust/kernel/cpufreq.rs +++ b/rust/kernel/cpufreq.rs @@ -1015,6 +1015,8 @@ impl Registration { ..pin_init::zeroed() }; + // Always inline to optimize out error path of `build_assert`. + #[inline(always)] const fn copy_name(name: &'static CStr) -> [c_char; CPUFREQ_NAME_LEN] { let src = name.to_bytes_with_nul(); let mut dst = [0; CPUFREQ_NAME_LEN]; From 9d84fd86d9ce26be72f1cf6839a9335005734d4f Mon Sep 17 00:00:00 2001 From: Alice Ryhl Date: Tue, 2 Dec 2025 19:37:35 +0000 Subject: [PATCH 39/65] rust: cpufreq: add __rust_helper to helpers This is needed to inline these helpers into Rust code. Signed-off-by: Alice Ryhl Reviewed-by: Boqun Feng Signed-off-by: Viresh Kumar --- rust/helpers/cpufreq.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/rust/helpers/cpufreq.c b/rust/helpers/cpufreq.c index 7c1343c4d65e..0e16aeef2b5a 100644 --- a/rust/helpers/cpufreq.c +++ b/rust/helpers/cpufreq.c @@ -3,7 +3,8 @@ #include #ifdef CONFIG_CPU_FREQ -void rust_helper_cpufreq_register_em_with_opp(struct cpufreq_policy *policy) +__rust_helper void +rust_helper_cpufreq_register_em_with_opp(struct cpufreq_policy *policy) { cpufreq_register_em_with_opp(policy); } From e79cc7b5eba255fc0534212d25ee6142213d5314 Mon Sep 17 00:00:00 2001 From: Luca Weiss Date: Wed, 10 Dec 2025 10:43:25 +0900 Subject: [PATCH 40/65] dt-bindings: cpufreq: qcom-hw: document Milos CPUFREQ Hardware Document the CPUFREQ Hardware on the Milos SoC. Acked-by: Rob Herring (Arm) Acked-by: Viresh Kumar Signed-off-by: Luca Weiss Signed-off-by: Viresh Kumar --- Documentation/devicetree/bindings/cpufreq/cpufreq-qcom-hw.yaml | 2 ++ 1 file changed, 2 insertions(+) diff --git a/Documentation/devicetree/bindings/cpufreq/cpufreq-qcom-hw.yaml b/Documentation/devicetree/bindings/cpufreq/cpufreq-qcom-hw.yaml index 2d42fc3d8ef8..22eeaef14f55 100644 --- a/Documentation/devicetree/bindings/cpufreq/cpufreq-qcom-hw.yaml +++ b/Documentation/devicetree/bindings/cpufreq/cpufreq-qcom-hw.yaml @@ -35,6 +35,7 @@ properties: - description: v2 of CPUFREQ HW (EPSS) items: - enum: + - qcom,milos-cpufreq-epss - qcom,qcs8300-cpufreq-epss - qcom,qdu1000-cpufreq-epss - qcom,sa8255p-cpufreq-epss @@ -169,6 +170,7 @@ allOf: compatible: contains: enum: + - qcom,milos-cpufreq-epss - qcom,qcs8300-cpufreq-epss - qcom,sc7280-cpufreq-epss - qcom,sm8250-cpufreq-epss From d6a6c58da38e4c4564e841faf3880769ff09936b Mon Sep 17 00:00:00 2001 From: Aaron Kling Date: Thu, 18 Dec 2025 15:39:52 -0600 Subject: [PATCH 41/65] cpufreq: Add Tegra186 and Tegra194 to cpufreq-dt-platdev blocklist These have platform specific drivers. Signed-off-by: Aaron Kling Signed-off-by: Viresh Kumar --- drivers/cpufreq/cpufreq-dt-platdev.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/cpufreq/cpufreq-dt-platdev.c b/drivers/cpufreq/cpufreq-dt-platdev.c index a1d11ecd1ac8..4348eba6eb91 100644 --- a/drivers/cpufreq/cpufreq-dt-platdev.c +++ b/drivers/cpufreq/cpufreq-dt-platdev.c @@ -147,6 +147,8 @@ static const struct of_device_id blocklist[] __initconst = { { .compatible = "nvidia,tegra30", }, { .compatible = "nvidia,tegra114", }, { .compatible = "nvidia,tegra124", }, + { .compatible = "nvidia,tegra186", }, + { .compatible = "nvidia,tegra194", }, { .compatible = "nvidia,tegra210", }, { .compatible = "nvidia,tegra234", }, From e05d9e5c8b754cc7d72acd896f5f7caf6b78a973 Mon Sep 17 00:00:00 2001 From: Tamir Duberstein Date: Mon, 22 Dec 2025 13:29:32 +0100 Subject: [PATCH 42/65] rust: cpufreq: replace `kernel::c_str!` with C-Strings C-String literals were added in Rust 1.77. Replace instances of `kernel::c_str!` with C-String literals where possible. Acked-by: Greg Kroah-Hartman Reviewed-by: Alice Ryhl Reviewed-by: Benno Lossin Signed-off-by: Tamir Duberstein Reviewed-by: Daniel Almeida Acked-by: Danilo Krummrich Signed-off-by: Viresh Kumar --- drivers/cpufreq/rcpufreq_dt.rs | 5 ++--- rust/kernel/cpufreq.rs | 3 +-- 2 files changed, 3 insertions(+), 5 deletions(-) diff --git a/drivers/cpufreq/rcpufreq_dt.rs b/drivers/cpufreq/rcpufreq_dt.rs index 31e07f0279db..f17bf64c22e2 100644 --- a/drivers/cpufreq/rcpufreq_dt.rs +++ b/drivers/cpufreq/rcpufreq_dt.rs @@ -3,7 +3,6 @@ //! Rust based implementation of the cpufreq-dt driver. use kernel::{ - c_str, clk::Clk, cpu, cpufreq, cpumask::CpumaskVar, @@ -52,7 +51,7 @@ impl opp::ConfigOps for CPUFreqDTDriver {} #[vtable] impl cpufreq::Driver for CPUFreqDTDriver { - const NAME: &'static CStr = c_str!("cpufreq-dt"); + const NAME: &'static CStr = c"cpufreq-dt"; const FLAGS: u16 = cpufreq::flags::NEED_INITIAL_FREQ_CHECK | cpufreq::flags::IS_COOLING_DEV; const BOOST_ENABLED: bool = true; @@ -197,7 +196,7 @@ fn register_em(policy: &mut cpufreq::Policy) { OF_TABLE, MODULE_OF_TABLE, ::IdInfo, - [(of::DeviceId::new(c_str!("operating-points-v2")), ())] + [(of::DeviceId::new(c"operating-points-v2"), ())] ); impl platform::Driver for CPUFreqDTDriver { diff --git a/rust/kernel/cpufreq.rs b/rust/kernel/cpufreq.rs index 0879a79485f8..76faa1ac8501 100644 --- a/rust/kernel/cpufreq.rs +++ b/rust/kernel/cpufreq.rs @@ -840,7 +840,6 @@ fn register_em(_policy: &mut Policy) { /// ``` /// use kernel::{ /// cpufreq, -/// c_str, /// device::{Core, Device}, /// macros::vtable, /// of, platform, @@ -853,7 +852,7 @@ fn register_em(_policy: &mut Policy) { /// /// #[vtable] /// impl cpufreq::Driver for SampleDriver { -/// const NAME: &'static CStr = c_str!("cpufreq-sample"); +/// const NAME: &'static CStr = c"cpufreq-sample"; /// const FLAGS: u16 = cpufreq::flags::NEED_INITIAL_FREQ_CHECK | cpufreq::flags::IS_COOLING_DEV; /// const BOOST_ENABLED: bool = true; /// From f9cadb3d56912a70571fdd95f426b757557c465b Mon Sep 17 00:00:00 2001 From: Jie Zhan Date: Tue, 23 Dec 2025 15:21:17 +0800 Subject: [PATCH 43/65] ACPI: CPPC: Factor out and export per-cpu cppc_perf_ctrs_in_pcc_cpu() Factor out cppc_perf_ctrs_in_pcc_cpu() for checking whether per-cpu CPC regs are defined in PCC channels, and export it out for further use. Reviewed-by: Lifeng Zheng Reviewed-by: Pierre Gondois Signed-off-by: Jie Zhan Acked-by: Rafael J. Wysocki (Intel) Signed-off-by: Viresh Kumar --- drivers/acpi/cppc_acpi.c | 48 ++++++++++++++++++++++------------------ include/acpi/cppc_acpi.h | 5 +++++ 2 files changed, 32 insertions(+), 21 deletions(-) diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c index 3bdeeee3414e..ec4966aaa8d4 100644 --- a/drivers/acpi/cppc_acpi.c +++ b/drivers/acpi/cppc_acpi.c @@ -1422,6 +1422,32 @@ int cppc_get_perf_caps(int cpunum, struct cppc_perf_caps *perf_caps) } EXPORT_SYMBOL_GPL(cppc_get_perf_caps); +/** + * cppc_perf_ctrs_in_pcc_cpu - Check if any perf counters of a CPU are in PCC. + * @cpu: CPU on which to check perf counters. + * + * Return: true if any of the counters are in PCC regions, false otherwise + */ +bool cppc_perf_ctrs_in_pcc_cpu(unsigned int cpu) +{ + struct cpc_desc *cpc_desc = per_cpu(cpc_desc_ptr, cpu); + struct cpc_register_resource *ref_perf_reg; + + /* + * If reference perf register is not supported then we should use the + * nominal perf value + */ + ref_perf_reg = &cpc_desc->cpc_regs[REFERENCE_PERF]; + if (!CPC_SUPPORTED(ref_perf_reg)) + ref_perf_reg = &cpc_desc->cpc_regs[NOMINAL_PERF]; + + return CPC_IN_PCC(&cpc_desc->cpc_regs[DELIVERED_CTR]) || + CPC_IN_PCC(&cpc_desc->cpc_regs[REFERENCE_CTR]) || + CPC_IN_PCC(&cpc_desc->cpc_regs[CTR_WRAP_TIME]) || + CPC_IN_PCC(ref_perf_reg); +} +EXPORT_SYMBOL_GPL(cppc_perf_ctrs_in_pcc_cpu); + /** * cppc_perf_ctrs_in_pcc - Check if any perf counters are in a PCC region. * @@ -1436,27 +1462,7 @@ bool cppc_perf_ctrs_in_pcc(void) int cpu; for_each_online_cpu(cpu) { - struct cpc_register_resource *ref_perf_reg; - struct cpc_desc *cpc_desc; - - cpc_desc = per_cpu(cpc_desc_ptr, cpu); - - if (CPC_IN_PCC(&cpc_desc->cpc_regs[DELIVERED_CTR]) || - CPC_IN_PCC(&cpc_desc->cpc_regs[REFERENCE_CTR]) || - CPC_IN_PCC(&cpc_desc->cpc_regs[CTR_WRAP_TIME])) - return true; - - - ref_perf_reg = &cpc_desc->cpc_regs[REFERENCE_PERF]; - - /* - * If reference perf register is not supported then we should - * use the nominal perf value - */ - if (!CPC_SUPPORTED(ref_perf_reg)) - ref_perf_reg = &cpc_desc->cpc_regs[NOMINAL_PERF]; - - if (CPC_IN_PCC(ref_perf_reg)) + if (cppc_perf_ctrs_in_pcc_cpu(cpu)) return true; } diff --git a/include/acpi/cppc_acpi.h b/include/acpi/cppc_acpi.h index 13fa81504844..4bcdcaf8bf2c 100644 --- a/include/acpi/cppc_acpi.h +++ b/include/acpi/cppc_acpi.h @@ -154,6 +154,7 @@ extern int cppc_get_perf_ctrs(int cpu, struct cppc_perf_fb_ctrs *perf_fb_ctrs); extern int cppc_set_perf(int cpu, struct cppc_perf_ctrls *perf_ctrls); extern int cppc_set_enable(int cpu, bool enable); extern int cppc_get_perf_caps(int cpu, struct cppc_perf_caps *caps); +extern bool cppc_perf_ctrs_in_pcc_cpu(unsigned int cpu); extern bool cppc_perf_ctrs_in_pcc(void); extern unsigned int cppc_perf_to_khz(struct cppc_perf_caps *caps, unsigned int perf); extern unsigned int cppc_khz_to_perf(struct cppc_perf_caps *caps, unsigned int freq); @@ -204,6 +205,10 @@ static inline int cppc_get_perf_caps(int cpu, struct cppc_perf_caps *caps) { return -EOPNOTSUPP; } +static inline bool cppc_perf_ctrs_in_pcc_cpu(unsigned int cpu) +{ + return false; +} static inline bool cppc_perf_ctrs_in_pcc(void) { return false; From 206b6612556398e717b1e293d96992d5ab2b8f32 Mon Sep 17 00:00:00 2001 From: Jie Zhan Date: Tue, 23 Dec 2025 15:21:18 +0800 Subject: [PATCH 44/65] cpufreq: CPPC: Factor out cppc_fie_kworker_init() Factor out the CPPC FIE kworker init in cppc_freq_invariance_init() because it's a standalone procedure for use when the CPC regs are in PCC channels. Reviewed-by: Lifeng Zheng Reviewed-by: Pierre Gondois Signed-off-by: Jie Zhan Signed-off-by: Viresh Kumar --- drivers/cpufreq/cppc_cpufreq.c | 29 +++++++++++++++++------------ 1 file changed, 17 insertions(+), 12 deletions(-) diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c index 9eac77c4f294..947b4e2e1d4e 100644 --- a/drivers/cpufreq/cppc_cpufreq.c +++ b/drivers/cpufreq/cppc_cpufreq.c @@ -184,7 +184,7 @@ static void cppc_cpufreq_cpu_fie_exit(struct cpufreq_policy *policy) } } -static void __init cppc_freq_invariance_init(void) +static void cppc_fie_kworker_init(void) { struct sched_attr attr = { .size = sizeof(struct sched_attr), @@ -201,17 +201,6 @@ static void __init cppc_freq_invariance_init(void) }; int ret; - if (fie_disabled != FIE_ENABLED && fie_disabled != FIE_DISABLED) { - fie_disabled = FIE_ENABLED; - if (cppc_perf_ctrs_in_pcc()) { - pr_info("FIE not enabled on systems with registers in PCC\n"); - fie_disabled = FIE_DISABLED; - } - } - - if (fie_disabled) - return; - kworker_fie = kthread_run_worker(0, "cppc_fie"); if (IS_ERR(kworker_fie)) { pr_warn("%s: failed to create kworker_fie: %ld\n", __func__, @@ -229,6 +218,22 @@ static void __init cppc_freq_invariance_init(void) } } +static void __init cppc_freq_invariance_init(void) +{ + if (fie_disabled != FIE_ENABLED && fie_disabled != FIE_DISABLED) { + fie_disabled = FIE_ENABLED; + if (cppc_perf_ctrs_in_pcc()) { + pr_info("FIE not enabled on systems with registers in PCC\n"); + fie_disabled = FIE_DISABLED; + } + } + + if (fie_disabled) + return; + + cppc_fie_kworker_init(); +} + static void cppc_freq_invariance_exit(void) { if (fie_disabled) From 997c021abc6eb9cf7df39fa77fa5e666ad55e3a3 Mon Sep 17 00:00:00 2001 From: Jie Zhan Date: Tue, 23 Dec 2025 15:21:19 +0800 Subject: [PATCH 45/65] cpufreq: CPPC: Update FIE arch_freq_scale in ticks for non-PCC regs Currently, the CPPC Frequency Invariance Engine (FIE) is invoked from the scheduler tick but defers the update of arch_freq_scale to a separate thread because cppc_get_perf_ctrs() would sleep if the CPC regs are in PCC. However, this deferred update mechanism is unnecessary and introduces extra overhead for non-PCC register spaces (e.g. System Memory or FFH), where accessing the regs won't sleep and can be safely performed from the tick context. Furthermore, with the CPPC FIE registered, it throws repeated warnings of "cppc_scale_freq_workfn: failed to read perf counters" on our platform with the CPC regs in System Memory and a power-down idle state enabled. That's because the remote CPU can be in a power-down idle state, and reading its perf counters returns 0. Moving the FIE handling back to the scheduler tick process makes the CPU handle its own perf counters, so it won't be idle and the issue would be inherently solved. To address the above issues, update arch_freq_scale directly in ticks for non-PCC regs and keep the deferred update mechanism for PCC regs. Reviewed-by: Lifeng Zheng Reviewed-by: Pierre Gondois Signed-off-by: Jie Zhan Signed-off-by: Viresh Kumar --- drivers/cpufreq/cppc_cpufreq.c | 77 +++++++++++++++++++++++----------- 1 file changed, 52 insertions(+), 25 deletions(-) diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c index 947b4e2e1d4e..36e8a75a37f1 100644 --- a/drivers/cpufreq/cppc_cpufreq.c +++ b/drivers/cpufreq/cppc_cpufreq.c @@ -54,31 +54,24 @@ static int cppc_perf_from_fbctrs(struct cppc_perf_fb_ctrs *fb_ctrs_t0, struct cppc_perf_fb_ctrs *fb_ctrs_t1); /** - * cppc_scale_freq_workfn - CPPC arch_freq_scale updater for frequency invariance - * @work: The work item. + * __cppc_scale_freq_tick - CPPC arch_freq_scale updater for frequency invariance + * @cppc_fi: per-cpu CPPC FIE data. * - * The CPPC driver register itself with the topology core to provide its own + * The CPPC driver registers itself with the topology core to provide its own * implementation (cppc_scale_freq_tick()) of topology_scale_freq_tick() which * gets called by the scheduler on every tick. * * Note that the arch specific counters have higher priority than CPPC counters, * if available, though the CPPC driver doesn't need to have any special * handling for that. - * - * On an invocation of cppc_scale_freq_tick(), we schedule an irq work (since we - * reach here from hard-irq context), which then schedules a normal work item - * and cppc_scale_freq_workfn() updates the per_cpu arch_freq_scale variable - * based on the counter updates since the last tick. */ -static void cppc_scale_freq_workfn(struct kthread_work *work) +static void __cppc_scale_freq_tick(struct cppc_freq_invariance *cppc_fi) { - struct cppc_freq_invariance *cppc_fi; struct cppc_perf_fb_ctrs fb_ctrs = {0}; struct cppc_cpudata *cpu_data; unsigned long local_freq_scale; u64 perf; - cppc_fi = container_of(work, struct cppc_freq_invariance, work); cpu_data = cppc_fi->cpu_data; if (cppc_get_perf_ctrs(cppc_fi->cpu, &fb_ctrs)) { @@ -102,6 +95,24 @@ static void cppc_scale_freq_workfn(struct kthread_work *work) per_cpu(arch_freq_scale, cppc_fi->cpu) = local_freq_scale; } +static void cppc_scale_freq_tick(void) +{ + __cppc_scale_freq_tick(&per_cpu(cppc_freq_inv, smp_processor_id())); +} + +static struct scale_freq_data cppc_sftd = { + .source = SCALE_FREQ_SOURCE_CPPC, + .set_freq_scale = cppc_scale_freq_tick, +}; + +static void cppc_scale_freq_workfn(struct kthread_work *work) +{ + struct cppc_freq_invariance *cppc_fi; + + cppc_fi = container_of(work, struct cppc_freq_invariance, work); + __cppc_scale_freq_tick(cppc_fi); +} + static void cppc_irq_work(struct irq_work *irq_work) { struct cppc_freq_invariance *cppc_fi; @@ -110,7 +121,14 @@ static void cppc_irq_work(struct irq_work *irq_work) kthread_queue_work(kworker_fie, &cppc_fi->work); } -static void cppc_scale_freq_tick(void) +/* + * Reading perf counters may sleep if the CPC regs are in PCC. Thus, we + * schedule an irq work in scale_freq_tick (since we reach here from hard-irq + * context), which then schedules a normal work item cppc_scale_freq_workfn() + * that updates the per_cpu arch_freq_scale variable based on the counter + * updates since the last tick. + */ +static void cppc_scale_freq_tick_pcc(void) { struct cppc_freq_invariance *cppc_fi = &per_cpu(cppc_freq_inv, smp_processor_id()); @@ -121,13 +139,14 @@ static void cppc_scale_freq_tick(void) irq_work_queue(&cppc_fi->irq_work); } -static struct scale_freq_data cppc_sftd = { +static struct scale_freq_data cppc_sftd_pcc = { .source = SCALE_FREQ_SOURCE_CPPC, - .set_freq_scale = cppc_scale_freq_tick, + .set_freq_scale = cppc_scale_freq_tick_pcc, }; static void cppc_cpufreq_cpu_fie_init(struct cpufreq_policy *policy) { + struct scale_freq_data *sftd = &cppc_sftd; struct cppc_freq_invariance *cppc_fi; int cpu, ret; @@ -138,8 +157,11 @@ static void cppc_cpufreq_cpu_fie_init(struct cpufreq_policy *policy) cppc_fi = &per_cpu(cppc_freq_inv, cpu); cppc_fi->cpu = cpu; cppc_fi->cpu_data = policy->driver_data; - kthread_init_work(&cppc_fi->work, cppc_scale_freq_workfn); - init_irq_work(&cppc_fi->irq_work, cppc_irq_work); + if (cppc_perf_ctrs_in_pcc_cpu(cpu)) { + kthread_init_work(&cppc_fi->work, cppc_scale_freq_workfn); + init_irq_work(&cppc_fi->irq_work, cppc_irq_work); + sftd = &cppc_sftd_pcc; + } ret = cppc_get_perf_ctrs(cpu, &cppc_fi->prev_perf_fb_ctrs); @@ -155,7 +177,7 @@ static void cppc_cpufreq_cpu_fie_init(struct cpufreq_policy *policy) } /* Register for freq-invariance */ - topology_set_scale_freq_source(&cppc_sftd, policy->cpus); + topology_set_scale_freq_source(sftd, policy->cpus); } /* @@ -178,6 +200,8 @@ static void cppc_cpufreq_cpu_fie_exit(struct cpufreq_policy *policy) topology_clear_scale_freq_source(SCALE_FREQ_SOURCE_CPPC, policy->related_cpus); for_each_cpu(cpu, policy->related_cpus) { + if (!cppc_perf_ctrs_in_pcc_cpu(cpu)) + continue; cppc_fi = &per_cpu(cppc_freq_inv, cpu); irq_work_sync(&cppc_fi->irq_work); kthread_cancel_work_sync(&cppc_fi->work); @@ -206,6 +230,7 @@ static void cppc_fie_kworker_init(void) pr_warn("%s: failed to create kworker_fie: %ld\n", __func__, PTR_ERR(kworker_fie)); fie_disabled = FIE_DISABLED; + kworker_fie = NULL; return; } @@ -215,20 +240,24 @@ static void cppc_fie_kworker_init(void) ret); kthread_destroy_worker(kworker_fie); fie_disabled = FIE_DISABLED; + kworker_fie = NULL; } } static void __init cppc_freq_invariance_init(void) { - if (fie_disabled != FIE_ENABLED && fie_disabled != FIE_DISABLED) { - fie_disabled = FIE_ENABLED; - if (cppc_perf_ctrs_in_pcc()) { + bool perf_ctrs_in_pcc = cppc_perf_ctrs_in_pcc(); + + if (fie_disabled == FIE_UNSET) { + if (perf_ctrs_in_pcc) { pr_info("FIE not enabled on systems with registers in PCC\n"); fie_disabled = FIE_DISABLED; + } else { + fie_disabled = FIE_ENABLED; } } - if (fie_disabled) + if (fie_disabled || !perf_ctrs_in_pcc) return; cppc_fie_kworker_init(); @@ -236,10 +265,8 @@ static void __init cppc_freq_invariance_init(void) static void cppc_freq_invariance_exit(void) { - if (fie_disabled) - return; - - kthread_destroy_worker(kworker_fie); + if (kworker_fie) + kthread_destroy_worker(kworker_fie); } #else From 11af6e102d31433e3084d6d6cdb2b2fe6c23d1a9 Mon Sep 17 00:00:00 2001 From: Yilin Chen <1479826151@qq.com> Date: Mon, 12 Jan 2026 16:00:47 +0800 Subject: [PATCH 46/65] rust: cpumask: rename methods of Cpumask for clarity and consistency Rename `as_ref` and `as_mut_ref` to `from_raw` and `from_raw_mut` to align with the established naming convention for constructing types from raw pointers in the kernel's Rust codebase. Signed-off-by: Yilin Chen <1479826151@qq.com> Reviewed-by: Gary Guo Reviewed-by: Alice Ryhl Signed-off-by: Viresh Kumar --- rust/kernel/cpumask.rs | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/rust/kernel/cpumask.rs b/rust/kernel/cpumask.rs index c1d17826ae7b..44bb36636ee3 100644 --- a/rust/kernel/cpumask.rs +++ b/rust/kernel/cpumask.rs @@ -39,7 +39,7 @@ /// fn set_clear_cpu(ptr: *mut bindings::cpumask, set_cpu: CpuId, clear_cpu: CpuId) { /// // SAFETY: The `ptr` is valid for writing and remains valid for the lifetime of the /// // returned reference. -/// let mask = unsafe { Cpumask::as_mut_ref(ptr) }; +/// let mask = unsafe { Cpumask::from_raw_mut(ptr) }; /// /// mask.set(set_cpu); /// mask.clear(clear_cpu); @@ -49,13 +49,13 @@ pub struct Cpumask(Opaque); impl Cpumask { - /// Creates a mutable reference to an existing `struct cpumask` pointer. + /// Creates a mutable reference from an existing `struct cpumask` pointer. /// /// # Safety /// /// The caller must ensure that `ptr` is valid for writing and remains valid for the lifetime /// of the returned reference. - pub unsafe fn as_mut_ref<'a>(ptr: *mut bindings::cpumask) -> &'a mut Self { + pub unsafe fn from_raw_mut<'a>(ptr: *mut bindings::cpumask) -> &'a mut Self { // SAFETY: Guaranteed by the safety requirements of the function. // // INVARIANT: The caller ensures that `ptr` is valid for writing and remains valid for the @@ -63,13 +63,13 @@ pub unsafe fn as_mut_ref<'a>(ptr: *mut bindings::cpumask) -> &'a mut Self { unsafe { &mut *ptr.cast() } } - /// Creates a reference to an existing `struct cpumask` pointer. + /// Creates a reference from an existing `struct cpumask` pointer. /// /// # Safety /// /// The caller must ensure that `ptr` is valid for reading and remains valid for the lifetime /// of the returned reference. - pub unsafe fn as_ref<'a>(ptr: *const bindings::cpumask) -> &'a Self { + pub unsafe fn from_raw<'a>(ptr: *const bindings::cpumask) -> &'a Self { // SAFETY: Guaranteed by the safety requirements of the function. // // INVARIANT: The caller ensures that `ptr` is valid for reading and remains valid for the From 7b781899072c5701ef9538c365757ee9ab9c00bd Mon Sep 17 00:00:00 2001 From: Konrad Dybcio Date: Tue, 13 Jan 2026 16:25:35 +0100 Subject: [PATCH 47/65] cpufreq: dt-platdev: Block the driver from probing on more QC platforms Add a number of QC platforms to the blocklist, they all use either the qcom-cpufreq-hw driver. Signed-off-by: Konrad Dybcio Signed-off-by: Viresh Kumar --- drivers/cpufreq/cpufreq-dt-platdev.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/cpufreq/cpufreq-dt-platdev.c b/drivers/cpufreq/cpufreq-dt-platdev.c index 4348eba6eb91..73b00c51f9e9 100644 --- a/drivers/cpufreq/cpufreq-dt-platdev.c +++ b/drivers/cpufreq/cpufreq-dt-platdev.c @@ -171,8 +171,11 @@ static const struct of_device_id blocklist[] __initconst = { { .compatible = "qcom,sdm845", }, { .compatible = "qcom,sdx75", }, { .compatible = "qcom,sm6115", }, + { .compatible = "qcom,sm6125", }, + { .compatible = "qcom,sm6150", }, { .compatible = "qcom,sm6350", }, { .compatible = "qcom,sm6375", }, + { .compatible = "qcom,sm7125", }, { .compatible = "qcom,sm7225", }, { .compatible = "qcom,sm7325", }, { .compatible = "qcom,sm8150", }, From 8c376f337a7e31c42949247e24eaad9a30d6c62c Mon Sep 17 00:00:00 2001 From: Sergey Shtylyov Date: Tue, 13 Jan 2026 22:33:30 +0300 Subject: [PATCH 48/65] cpufreq: scmi: correct SCMI explanation SCMI stands for System Control and Management Interface, not System Control and Power Interface -- apparently, Sudeep Holla copied this line from his SCPI driver and then just forgot to update the acronym explanation... :-) Fixes: 99d6bdf33877 ("cpufreq: add support for CPU DVFS based on SCMI message protocol") Signed-off-by: Sergey Shtylyov Reviewed-by: Sudeep Holla Signed-off-by: Viresh Kumar --- drivers/cpufreq/scmi-cpufreq.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/cpufreq/scmi-cpufreq.c b/drivers/cpufreq/scmi-cpufreq.c index d2a110079f5f..e0e1756180b0 100644 --- a/drivers/cpufreq/scmi-cpufreq.c +++ b/drivers/cpufreq/scmi-cpufreq.c @@ -1,6 +1,6 @@ // SPDX-License-Identifier: GPL-2.0 /* - * System Control and Power Interface (SCMI) based CPUFreq Interface driver + * System Control and Management Interface (SCMI) based CPUFreq Interface driver * * Copyright (C) 2018-2021 ARM Ltd. * Sudeep Holla From 94dbce6c13cd7634f9bdb402248991c95a8c3d57 Mon Sep 17 00:00:00 2001 From: Juan Martinez Date: Fri, 16 Jan 2026 15:45:39 -0600 Subject: [PATCH 49/65] cpufreq/amd-pstate: Add comment explaining nominal_perf usage for performance policy Add comment explaining why nominal_perf is used for MinPerf when the CPU frequency policy is set to CPUFREQ_POLICY_PERFORMANCE, rather than using highest_perf or lowest_nonlinear_perf. Signed-off-by: Juan Martinez Signed-off-by: Viresh Kumar --- drivers/cpufreq/amd-pstate.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/drivers/cpufreq/amd-pstate.c b/drivers/cpufreq/amd-pstate.c index c45bc98721d2..ec9f38b219de 100644 --- a/drivers/cpufreq/amd-pstate.c +++ b/drivers/cpufreq/amd-pstate.c @@ -636,6 +636,19 @@ static void amd_pstate_update_min_max_limit(struct cpufreq_policy *policy) WRITE_ONCE(cpudata->max_limit_freq, policy->max); if (cpudata->policy == CPUFREQ_POLICY_PERFORMANCE) { + /* + * For performance policy, set MinPerf to nominal_perf rather than + * highest_perf or lowest_nonlinear_perf. + * + * Per commit 0c411b39e4f4c, using highest_perf was observed + * to cause frequency throttling on power-limited platforms, leading to + * performance regressions. Using lowest_nonlinear_perf would limit + * performance too much for HPC workloads requiring high frequency + * operation and minimal wakeup latency from idle states. + * + * nominal_perf therefore provides a balance by avoiding throttling + * while still maintaining enough performance for HPC workloads. + */ perf.min_limit_perf = min(perf.nominal_perf, perf.max_limit_perf); WRITE_ONCE(cpudata->min_limit_freq, min(cpudata->nominal_freq, cpudata->max_limit_freq)); } else { From 945fc28a06a1d30315ca416167754e10208024a5 Mon Sep 17 00:00:00 2001 From: Dhruva Gole Date: Tue, 20 Jan 2026 17:17:30 +0530 Subject: [PATCH 50/65] cpufreq: dt-platdev: Add ti,am62l3 to blocklist Add AM62L3 SoC to the dt-platdev blocklist to ensure proper handling of CPUFreq functionality. The AM62L3 will use its native TI CPUFreq driver implementation instead of the generic dt-platdev driver. This follows the same pattern as other TI SoCs like AM62A7, AM62D2, and AM62P5 which have been previously added to this blocklist. Reviewed-by: Kendall Willis Signed-off-by: Dhruva Gole Signed-off-by: Viresh Kumar --- drivers/cpufreq/cpufreq-dt-platdev.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/cpufreq/cpufreq-dt-platdev.c b/drivers/cpufreq/cpufreq-dt-platdev.c index 73b00c51f9e9..4b0b6c521b36 100644 --- a/drivers/cpufreq/cpufreq-dt-platdev.c +++ b/drivers/cpufreq/cpufreq-dt-platdev.c @@ -196,6 +196,7 @@ static const struct of_device_id blocklist[] __initconst = { { .compatible = "ti,am625", }, { .compatible = "ti,am62a7", }, { .compatible = "ti,am62d2", }, + { .compatible = "ti,am62l3", }, { .compatible = "ti,am62p5", }, { .compatible = "qcom,ipq5332", }, From dea8bfea76e4bea9f727f777604d4053d7e9cd92 Mon Sep 17 00:00:00 2001 From: Dhruva Gole Date: Tue, 20 Jan 2026 17:17:31 +0530 Subject: [PATCH 51/65] cpufreq: ti-cpufreq: add support for AM62L3 SoC Add CPUFreq support for the AM62L3 SoC with the appropriate AM62L3 speed grade constants according to the datasheet [1]. This follows the same architecture-specific implementation pattern as other TI SoCs in the AM6x family. While at it, also sort instances where the SOC family names were not sorted alphabetically. [1] https://www.ti.com/lit/pdf/SPRSPA1 Signed-off-by: Dhruva Gole Reviewed-by: Kendall Willis Signed-off-by: Viresh Kumar --- drivers/cpufreq/ti-cpufreq.c | 34 +++++++++++++++++++++++++++++++++- 1 file changed, 33 insertions(+), 1 deletion(-) diff --git a/drivers/cpufreq/ti-cpufreq.c b/drivers/cpufreq/ti-cpufreq.c index 6ee76f5fe9c5..3d1129aeed02 100644 --- a/drivers/cpufreq/ti-cpufreq.c +++ b/drivers/cpufreq/ti-cpufreq.c @@ -70,6 +70,12 @@ enum { #define AM62A7_SUPPORT_R_MPU_OPP BIT(1) #define AM62A7_SUPPORT_V_MPU_OPP BIT(2) +#define AM62L3_EFUSE_E_MPU_OPP 5 +#define AM62L3_EFUSE_O_MPU_OPP 15 + +#define AM62L3_SUPPORT_E_MPU_OPP BIT(0) +#define AM62L3_SUPPORT_O_MPU_OPP BIT(1) + #define AM62P5_EFUSE_O_MPU_OPP 15 #define AM62P5_EFUSE_S_MPU_OPP 19 #define AM62P5_EFUSE_T_MPU_OPP 20 @@ -213,6 +219,22 @@ static unsigned long am625_efuse_xlate(struct ti_cpufreq_data *opp_data, return calculated_efuse; } +static unsigned long am62l3_efuse_xlate(struct ti_cpufreq_data *opp_data, + unsigned long efuse) +{ + unsigned long calculated_efuse = AM62L3_SUPPORT_E_MPU_OPP; + + switch (efuse) { + case AM62L3_EFUSE_O_MPU_OPP: + calculated_efuse |= AM62L3_SUPPORT_O_MPU_OPP; + fallthrough; + case AM62L3_EFUSE_E_MPU_OPP: + calculated_efuse |= AM62L3_SUPPORT_E_MPU_OPP; + } + + return calculated_efuse; +} + static struct ti_cpufreq_soc_data am3x_soc_data = { .efuse_xlate = amx3_efuse_xlate, .efuse_fallback = AM33XX_800M_ARM_MPU_MAX_FREQ, @@ -313,8 +335,9 @@ static struct ti_cpufreq_soc_data am3517_soc_data = { static const struct soc_device_attribute k3_cpufreq_soc[] = { { .family = "AM62X", }, { .family = "AM62AX", }, - { .family = "AM62PX", }, { .family = "AM62DX", }, + { .family = "AM62LX", }, + { .family = "AM62PX", }, { /* sentinel */ } }; @@ -335,6 +358,14 @@ static struct ti_cpufreq_soc_data am62a7_soc_data = { .multi_regulator = false, }; +static struct ti_cpufreq_soc_data am62l3_soc_data = { + .efuse_xlate = am62l3_efuse_xlate, + .efuse_offset = 0x0, + .efuse_mask = 0x07c0, + .efuse_shift = 0x6, + .multi_regulator = false, +}; + static struct ti_cpufreq_soc_data am62p5_soc_data = { .efuse_xlate = am62p5_efuse_xlate, .efuse_offset = 0x0, @@ -463,6 +494,7 @@ static const struct of_device_id ti_cpufreq_of_match[] __maybe_unused = { { .compatible = "ti,am625", .data = &am625_soc_data, }, { .compatible = "ti,am62a7", .data = &am62a7_soc_data, }, { .compatible = "ti,am62d2", .data = &am62a7_soc_data, }, + { .compatible = "ti,am62l3", .data = &am62l3_soc_data, }, { .compatible = "ti,am62p5", .data = &am62p5_soc_data, }, /* legacy */ { .compatible = "ti,omap3430", .data = &omap34xx_soc_data, }, From 0b7fbf9333fa4699a53145bad8ce74ea986caa13 Mon Sep 17 00:00:00 2001 From: Felix Gu Date: Wed, 21 Jan 2026 23:32:06 +0800 Subject: [PATCH 52/65] cpufreq: scmi: Fix device_node reference leak in scmi_cpu_domain_id() When calling of_parse_phandle_with_args(), the caller is responsible to call of_node_put() to release the reference of device node. In scmi_cpu_domain_id(), it does not release the reference. Fixes: e336baa4193e ("cpufreq: scmi: Prepare to move OF parsing of domain-id to cpufreq") Signed-off-by: Felix Gu Signed-off-by: Viresh Kumar --- drivers/cpufreq/scmi-cpufreq.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/cpufreq/scmi-cpufreq.c b/drivers/cpufreq/scmi-cpufreq.c index e0e1756180b0..c7a3b038385b 100644 --- a/drivers/cpufreq/scmi-cpufreq.c +++ b/drivers/cpufreq/scmi-cpufreq.c @@ -101,6 +101,7 @@ static int scmi_cpu_domain_id(struct device *cpu_dev) return -EINVAL; } + of_node_put(domain_id.np); return domain_id.args[0]; } From 4a1cf5ed51b1b6049d7771d2e77789b99dafc8ae Mon Sep 17 00:00:00 2001 From: Sumit Gupta Date: Tue, 20 Jan 2026 20:26:15 +0530 Subject: [PATCH 53/65] cpufreq: CPPC: Add generic helpers for sysfs show/store Add generic helper functions for u64 sysfs attributes that follow the common pattern of calling CPPC get/set APIs: - cppc_cpufreq_sysfs_show_u64(): reads value and handles -EOPNOTSUPP - cppc_cpufreq_sysfs_store_u64(): parses input and calls set function Add CPPC_CPUFREQ_ATTR_RW_U64() macro to generate show/store functions using these helpers, reducing boilerplate for simple attributes. Convert auto_act_window and energy_performance_preference_val to use the new macro. No functional changes. Signed-off-by: Sumit Gupta Reviewed-by: Lifeng Zheng [ rjw: Retained empty code line after a conditional ] Link: https://patch.msgid.link/20260120145623.2959636-2-sumitg@nvidia.com Signed-off-by: Rafael J. Wysocki --- drivers/cpufreq/cppc_cpufreq.c | 72 +++++++++++++--------------------- 1 file changed, 27 insertions(+), 45 deletions(-) diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c index 36e8a75a37f1..7e8042efedd1 100644 --- a/drivers/cpufreq/cppc_cpufreq.c +++ b/drivers/cpufreq/cppc_cpufreq.c @@ -863,14 +863,13 @@ static ssize_t store_auto_select(struct cpufreq_policy *policy, return count; } -static ssize_t show_auto_act_window(struct cpufreq_policy *policy, char *buf) +static ssize_t cppc_cpufreq_sysfs_show_u64(unsigned int cpu, + int (*get_func)(int, u64 *), + char *buf) { u64 val; - int ret; + int ret = get_func((int)cpu, &val); - ret = cppc_get_auto_act_window(policy->cpu, &val); - - /* show "" when this register is not supported by cpc */ if (ret == -EOPNOTSUPP) return sysfs_emit(buf, "\n"); @@ -880,42 +879,9 @@ static ssize_t show_auto_act_window(struct cpufreq_policy *policy, char *buf) return sysfs_emit(buf, "%llu\n", val); } -static ssize_t store_auto_act_window(struct cpufreq_policy *policy, - const char *buf, size_t count) -{ - u64 usec; - int ret; - - ret = kstrtou64(buf, 0, &usec); - if (ret) - return ret; - - ret = cppc_set_auto_act_window(policy->cpu, usec); - if (ret) - return ret; - - return count; -} - -static ssize_t show_energy_performance_preference_val(struct cpufreq_policy *policy, char *buf) -{ - u64 val; - int ret; - - ret = cppc_get_epp_perf(policy->cpu, &val); - - /* show "" when this register is not supported by cpc */ - if (ret == -EOPNOTSUPP) - return sysfs_emit(buf, "\n"); - - if (ret) - return ret; - - return sysfs_emit(buf, "%llu\n", val); -} - -static ssize_t store_energy_performance_preference_val(struct cpufreq_policy *policy, - const char *buf, size_t count) +static ssize_t cppc_cpufreq_sysfs_store_u64(unsigned int cpu, + int (*set_func)(int, u64), + const char *buf, size_t count) { u64 val; int ret; @@ -924,13 +890,29 @@ static ssize_t store_energy_performance_preference_val(struct cpufreq_policy *po if (ret) return ret; - ret = cppc_set_epp(policy->cpu, val); - if (ret) - return ret; + ret = set_func((int)cpu, val); - return count; + return ret ? ret : count; } +#define CPPC_CPUFREQ_ATTR_RW_U64(_name, _get_func, _set_func) \ +static ssize_t show_##_name(struct cpufreq_policy *policy, char *buf) \ +{ \ + return cppc_cpufreq_sysfs_show_u64(policy->cpu, _get_func, buf);\ +} \ +static ssize_t store_##_name(struct cpufreq_policy *policy, \ + const char *buf, size_t count) \ +{ \ + return cppc_cpufreq_sysfs_store_u64(policy->cpu, _set_func, \ + buf, count); \ +} + +CPPC_CPUFREQ_ATTR_RW_U64(auto_act_window, cppc_get_auto_act_window, + cppc_set_auto_act_window) + +CPPC_CPUFREQ_ATTR_RW_U64(energy_performance_preference_val, + cppc_get_epp_perf, cppc_set_epp) + cpufreq_freq_attr_ro(freqdomain_cpus); cpufreq_freq_attr_rw(auto_select); cpufreq_freq_attr_rw(auto_act_window); From 1081c1649da989ef9cbc01ffa99babc190df6077 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Mon, 26 Jan 2026 21:03:57 +0100 Subject: [PATCH 54/65] PM: hibernate: Drop NULL pointer checks before acomp_request_free() Since acomp_request_free() checks its argument against NULL, the NULL pointer checks before calling it added by commit ("7966cf0ebe32 PM: hibernate: Fix crash when freeing invalid crypto compressor") are redundant, so drop them. No intentional functional impact. Signed-off-by: Rafael J. Wysocki Link: https://patch.msgid.link/6233709.lOV4Wx5bFT@rafael.j.wysocki --- kernel/power/swap.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/kernel/power/swap.c b/kernel/power/swap.c index 8050e5182835..7e462957c9bf 100644 --- a/kernel/power/swap.c +++ b/kernel/power/swap.c @@ -902,8 +902,8 @@ static int save_compressed_image(struct swap_map_handle *handle, for (thr = 0; thr < nr_threads; thr++) { if (data[thr].thr) kthread_stop(data[thr].thr); - if (data[thr].cr) - acomp_request_free(data[thr].cr); + + acomp_request_free(data[thr].cr); if (!IS_ERR_OR_NULL(data[thr].cc)) crypto_free_acomp(data[thr].cc); @@ -1502,8 +1502,8 @@ static int load_compressed_image(struct swap_map_handle *handle, for (thr = 0; thr < nr_threads; thr++) { if (data[thr].thr) kthread_stop(data[thr].thr); - if (data[thr].cr) - acomp_request_free(data[thr].cr); + + acomp_request_free(data[thr].cr); if (!IS_ERR_OR_NULL(data[thr].cc)) crypto_free_acomp(data[thr].cc); From cc764d3bbd545d7d6f5f66ac678ffc522d75f0f9 Mon Sep 17 00:00:00 2001 From: Pengjie Zhang Date: Fri, 16 Jan 2026 17:46:23 +0800 Subject: [PATCH 55/65] cpufreq: userspace: make scaling_setspeed return the actual requested frequency According to the Linux kernel ABI documentation for 'scaling_setspeed': "It returns the last frequency requested by the governor (in kHz) or can be written to in order to set a new frequency for the policy." However, the current implementation of show_speed() returns 'policy->cur'. 'policy->cur' represents the frequency after the driver has resolved the request against the hardware frequency table and applied policy limits (min/max). This creates a discrepancy between the documentation/user expectation and the actual code behavior. For instance: 1. User writes a value to 'scaling_setspeed' that is not in the OPP table (e.g., user asks for A, driver rounds it to B). 2. User reads 'scaling_setspeed'. 3. Code returns B ('policy->cur'). 4. User expects A (the "frequency requested"), but gets B. This patch changes show_speed() to return 'userspace->setspeed', which stores the actual value last requested by the user. This restores the read/write symmetry of the attribute and aligns the code with the ABI description. The effective frequency can still be observed via 'scaling_cur_freq' or 'cpuinfo_cur_freq', preserving the distinction between "what was requested" (setspeed) and "what is effective" (cur_freq). Signed-off-by: Pengjie Zhang Acked-by: Viresh Kumar Acked-by: lihuisong@huawei.com Link: https://patch.msgid.link/20260116094623.2980031-1-zhangpengjie2@huawei.com Signed-off-by: Rafael J. Wysocki --- drivers/cpufreq/cpufreq_userspace.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/cpufreq/cpufreq_userspace.c b/drivers/cpufreq/cpufreq_userspace.c index 77d62152cd38..4bd62e6c5c51 100644 --- a/drivers/cpufreq/cpufreq_userspace.c +++ b/drivers/cpufreq/cpufreq_userspace.c @@ -49,7 +49,9 @@ static int cpufreq_set(struct cpufreq_policy *policy, unsigned int freq) static ssize_t show_speed(struct cpufreq_policy *policy, char *buf) { - return sprintf(buf, "%u\n", policy->cur); + struct userspace_policy *userspace = policy->governor_data; + + return sprintf(buf, "%u\n", userspace->setspeed); } static int cpufreq_userspace_policy_init(struct cpufreq_policy *policy) From a554a25e66efea0b78fb3d24f4f19289e037c0dc Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Wed, 28 Jan 2026 17:05:27 +0100 Subject: [PATCH 56/65] cpufreq: ondemand: Simplify idle cputime granularity test cpufreq calls get_cpu_idle_time_us() just to know if idle cputime accounting has a nanoseconds granularity. Use the appropriate indicator instead to make that deduction. Signed-off-by: Frederic Weisbecker Link: https://patch.msgid.link/aXozx0PXutnm8ECX@localhost.localdomain Signed-off-by: Rafael J. Wysocki --- drivers/cpufreq/cpufreq_ondemand.c | 7 +------ include/linux/tick.h | 2 ++ kernel/time/hrtimer.c | 2 +- kernel/time/tick-internal.h | 2 -- kernel/time/tick-sched.c | 8 +++++++- kernel/time/timer.c | 2 +- 6 files changed, 12 insertions(+), 11 deletions(-) diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c index a6ecc203f7b7..bb7db82930e4 100644 --- a/drivers/cpufreq/cpufreq_ondemand.c +++ b/drivers/cpufreq/cpufreq_ondemand.c @@ -334,17 +334,12 @@ static void od_free(struct policy_dbs_info *policy_dbs) static int od_init(struct dbs_data *dbs_data) { struct od_dbs_tuners *tuners; - u64 idle_time; - int cpu; tuners = kzalloc(sizeof(*tuners), GFP_KERNEL); if (!tuners) return -ENOMEM; - cpu = get_cpu(); - idle_time = get_cpu_idle_time_us(cpu, NULL); - put_cpu(); - if (idle_time != -1ULL) { + if (tick_nohz_is_active()) { /* Idle micro accounting is supported. Use finer thresholds */ dbs_data->up_threshold = MICRO_FREQUENCY_UP_THRESHOLD; } else { diff --git a/include/linux/tick.h b/include/linux/tick.h index ac76ae9fa36d..738007d6f577 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -126,6 +126,7 @@ enum tick_dep_bits { #ifdef CONFIG_NO_HZ_COMMON extern bool tick_nohz_enabled; +extern bool tick_nohz_is_active(void); extern bool tick_nohz_tick_stopped(void); extern bool tick_nohz_tick_stopped_cpu(int cpu); extern void tick_nohz_idle_stop_tick(void); @@ -142,6 +143,7 @@ extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time); extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time); #else /* !CONFIG_NO_HZ_COMMON */ #define tick_nohz_enabled (0) +static inline bool tick_nohz_is_active(void) { return false; } static inline int tick_nohz_tick_stopped(void) { return 0; } static inline int tick_nohz_tick_stopped_cpu(int cpu) { return 0; } static inline void tick_nohz_idle_stop_tick(void) { } diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c index 0e4bc1ca15ff..1caf02a72ba8 100644 --- a/kernel/time/hrtimer.c +++ b/kernel/time/hrtimer.c @@ -943,7 +943,7 @@ void clock_was_set(unsigned int bases) cpumask_var_t mask; int cpu; - if (!hrtimer_hres_active(cpu_base) && !tick_nohz_active) + if (!hrtimer_hres_active(cpu_base) && !tick_nohz_is_active()) goto out_timerfd; if (!zalloc_cpumask_var(&mask, GFP_KERNEL)) { diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h index 4e4f7bbe2a64..597d816d22e8 100644 --- a/kernel/time/tick-internal.h +++ b/kernel/time/tick-internal.h @@ -156,7 +156,6 @@ static inline void tick_nohz_init(void) { } #endif #ifdef CONFIG_NO_HZ_COMMON -extern unsigned long tick_nohz_active; extern void timers_update_nohz(void); extern u64 get_jiffies_update(unsigned long *basej); # ifdef CONFIG_SMP @@ -171,7 +170,6 @@ extern void timer_expire_remote(unsigned int cpu); # endif #else /* CONFIG_NO_HZ_COMMON */ static inline void timers_update_nohz(void) { } -#define tick_nohz_active (0) #endif DECLARE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases); diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 2f8a7923fa27..72e39c793117 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -693,7 +693,7 @@ void __init tick_nohz_init(void) * NO HZ enabled ? */ bool tick_nohz_enabled __read_mostly = true; -unsigned long tick_nohz_active __read_mostly; +static unsigned long tick_nohz_active __read_mostly; /* * Enable / Disable tickless mode */ @@ -704,6 +704,12 @@ static int __init setup_tick_nohz(char *str) __setup("nohz=", setup_tick_nohz); +bool tick_nohz_is_active(void) +{ + return tick_nohz_active; +} +EXPORT_SYMBOL_GPL(tick_nohz_is_active); + bool tick_nohz_tick_stopped(void) { struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched); diff --git a/kernel/time/timer.c b/kernel/time/timer.c index 1f2364126894..7e1e3bde6b8b 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -281,7 +281,7 @@ DEFINE_STATIC_KEY_FALSE(timers_migration_enabled); static void timers_update_migration(void) { - if (sysctl_timer_migration && tick_nohz_active) + if (sysctl_timer_migration && tick_nohz_is_active()) static_branch_enable(&timers_migration_enabled); else static_branch_disable(&timers_migration_enabled); From f36de72673ad80c9931c0b411df0d6ef184f6c22 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Thu, 29 Jan 2026 21:49:12 +0100 Subject: [PATCH 57/65] cpuidle: governors: teo: Adjust the classification of wakeup events If differences between target residency values of adjacent idle states of a given CPU are relatively large, the corresponding idle state bins used by the teo governors are large either and the rule by which hits are distinguished from intercepts is inaccurate. Namely, by that rule, a wakeup event is classified as a hit if the sleep length (the time till the closest timer other than the tick) and the measured idle duration, adjusted for the entered idle state exit latency, fall into the same idle state bin. However, if that bin is large enough, the actual difference between the sleep length and the measured idle duration may be significant. It may in fact be significantly greater than the analogous difference for an event where the sleep length and the measured idle duration fall into different bins. For this reason, amend the rule in question with a check that will only allow a wakeup event to be counted as a hit if the sleep length is less than the "raw" measured idle duration (which means that the wakeup appears to have occurred after the anticipated timer event). Otherwise, the event will be counted as an intercept. Also update the documentation part explaining the difference between "hits" and "intercepts" to take the above change into account. Signed-off-by: Rafael J. Wysocki Reviewed-by: Christian Loehle Link: https://patch.msgid.link/5093379.31r3eYUQgx@rafael.j.wysocki --- drivers/cpuidle/governors/teo.c | 25 ++++++++++++++----------- 1 file changed, 14 insertions(+), 11 deletions(-) diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c index 750ab0678a77..34b769b37a86 100644 --- a/drivers/cpuidle/governors/teo.c +++ b/drivers/cpuidle/governors/teo.c @@ -48,12 +48,11 @@ * in accordance with what happened last time. * * The "hits" metric reflects the relative frequency of situations in which the - * sleep length and the idle duration measured after CPU wakeup fall into the - * same bin (that is, the CPU appears to wake up "on time" relative to the sleep - * length). In turn, the "intercepts" metric reflects the relative frequency of - * non-timer wakeup events for which the measured idle duration falls into a bin - * that corresponds to an idle state shallower than the one whose bin is fallen - * into by the sleep length (these events are also referred to as "intercepts" + * sleep length and the idle duration measured after CPU wakeup are close enough + * (that is, the CPU appears to wake up "on time" relative to the sleep length). + * In turn, the "intercepts" metric reflects the relative frequency of non-timer + * wakeup events for which the measured idle duration is significantly different + * from the sleep length (these events are also referred to as "intercepts" * below). * * The governor also counts "intercepts" with the measured idle duration below @@ -167,6 +166,7 @@ static void teo_decay(unsigned int *metric) */ static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev) { + s64 lat_ns = drv->states[dev->last_state_idx].exit_latency_ns; struct teo_cpu *cpu_data = this_cpu_ptr(&teo_cpus); int i, idx_timer = 0, idx_duration = 0; s64 target_residency_ns, measured_ns; @@ -182,8 +182,6 @@ static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev) */ measured_ns = S64_MAX; } else { - s64 lat_ns = drv->states[dev->last_state_idx].exit_latency_ns; - measured_ns = dev->last_residency_ns; /* * The delay between the wakeup and the first instruction @@ -253,12 +251,17 @@ static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev) } /* - * If the measured idle duration falls into the same bin as the sleep - * length, this is a "hit", so update the "hits" metric for that bin. + * If the measured idle duration (adjusted for the entered state exit + * latency) falls into the same bin as the sleep length and the latter + * is less than the "raw" measured idle duration (so the wakeup appears + * to have occurred after the anticipated timer event), this is a "hit", + * so update the "hits" metric for that bin. + * * Otherwise, update the "intercepts" metric for the bin fallen into by * the measured idle duration. */ - if (idx_timer == idx_duration) { + if (idx_timer == idx_duration && + cpu_data->sleep_length_ns - measured_ns < lat_ns / 2) { cpu_data->state_bins[idx_timer].hits += PULSE; } else { cpu_data->state_bins[idx_duration].intercepts += PULSE; From a971f984b8455db0ef23910442029cdad53bc459 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Thu, 29 Jan 2026 21:51:11 +0100 Subject: [PATCH 58/65] cpuidle: governors: teo: Refine intercepts-based idle state lookup There are cases in which decisions made by the teo governor are arguably overly conservative. For instance, suppose that there are 4 idle states and the values of the intercepts metric for the first 3 of them are 400, 250, and 251, respectively. If the total sum computed in teo_update() is 1000, the governor will select idle state 1 (provided that all idle states are enabled and the scheduler tick has not been stopped) although arguably idle state 0 would be a better choice because the likelihood of getting an idle duration below the target residency of idle state 1 is greater than the likelihood of getting an idle duration between the target residency of idle state 1 and the target residency of idle state 2. To address this, refine the candidate idle state lookup based on intercepts to start at the state with the maximum intercepts metric, below the deepest enabled one, to avoid the cases in which the search may stop before reaching that state. Signed-off-by: Rafael J. Wysocki Reviewed-by: Christian Loehle [ rjw: Fixed typo "intercetps" in new comments (3 places) ] Link: https://patch.msgid.link/2417298.ElGaqSPkdT@rafael.j.wysocki Signed-off-by: Rafael J. Wysocki --- drivers/cpuidle/governors/teo.c | 50 ++++++++++++++++++++++++++++----- 1 file changed, 43 insertions(+), 7 deletions(-) diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c index 34b769b37a86..80f3ba942a06 100644 --- a/drivers/cpuidle/governors/teo.c +++ b/drivers/cpuidle/governors/teo.c @@ -74,12 +74,17 @@ * than the candidate one (it represents the cases in which the CPU was * likely woken up by a non-timer wakeup source). * + * Also find the idle state with the maximum intercepts metric (if there are + * multiple states with the maximum intercepts metric, choose the one with + * the highest index). + * * 2. If the second sum computed in step 1 is greater than a half of the sum of * both metrics for the candidate state bin and all subsequent bins (if any), * a shallower idle state is likely to be more suitable, so look for it. * * - Traverse the enabled idle states shallower than the candidate one in the - * descending order. + * descending order, starting at the state with the maximum intercepts + * metric found in step 1. * * - For each of them compute the sum of the "intercepts" metrics over all * of the idle states between it and the candidate one (including the @@ -308,8 +313,10 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, ktime_t delta_tick = TICK_NSEC / 2; unsigned int idx_intercept_sum = 0; unsigned int intercept_sum = 0; + unsigned int intercept_max = 0; unsigned int idx_hit_sum = 0; unsigned int hit_sum = 0; + int intercept_max_idx = -1; int constraint_idx = 0; int idx0 = 0, idx = -1; s64 duration_ns; @@ -340,17 +347,32 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, if (!dev->states_usage[0].disable) idx = 0; - /* Compute the sums of metrics for early wakeup pattern detection. */ + /* + * Compute the sums of metrics for early wakeup pattern detection and + * look for the state bin with the maximum intercepts metric below the + * deepest enabled one (if there are multiple states with the maximum + * intercepts metric, choose the one with the highest index). + */ for (i = 1; i < drv->state_count; i++) { struct teo_bin *prev_bin = &cpu_data->state_bins[i-1]; + unsigned int prev_intercepts = prev_bin->intercepts; struct cpuidle_state *s = &drv->states[i]; /* * Update the sums of idle state metrics for all of the states * shallower than the current one. */ - intercept_sum += prev_bin->intercepts; hit_sum += prev_bin->hits; + intercept_sum += prev_intercepts; + /* + * Check if this is the bin with the maximum number of + * intercepts so far and in that case update the index of + * the state with the maximum intercepts metric. + */ + if (prev_intercepts >= intercept_max) { + intercept_max = prev_intercepts; + intercept_max_idx = i - 1; + } if (dev->states_usage[i].disable) continue; @@ -414,9 +436,22 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, } /* - * Look for the deepest idle state whose target residency had - * not exceeded the idle duration in over a half of the relevant - * cases in the past. + * If the minimum state index is greater than or equal to the + * index of the state with the maximum intercepts metric and + * the corresponding state is enabled, there is no need to look + * at the deeper states. + */ + if (min_idx >= intercept_max_idx && + !dev->states_usage[min_idx].disable) { + idx = min_idx; + goto constraint; + } + + /* + * Look for the deepest enabled idle state, at most as deep as + * the one with the maximum intercepts metric, whose target + * residency had not been greater than the idle duration in over + * a half of the relevant cases in the past. * * Take the possible duration limitation present if the tick * has been stopped already into account. @@ -428,7 +463,8 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, continue; idx = i; - if (2 * intercept_sum > idx_intercept_sum) + if (2 * intercept_sum > idx_intercept_sum && + i <= intercept_max_idx) break; } } From e79eec6ca1f5a3dbd804b73fd313b3fe455df4f3 Mon Sep 17 00:00:00 2001 From: Patrick Little Date: Wed, 28 Jan 2026 16:33:11 -0600 Subject: [PATCH 59/65] Documentation: Fix typos in energy model documentation Fix typos in documentation related to energy model management. Signed-off-by: Patrick Little Acked-by: Randy Dunlap [ rjw: Subject and changelog edits ] Link: https://patch.msgid.link/20260128-documentation-fix-grammar-v1-1-39238dc471f9@gmail.com Signed-off-by: Rafael J. Wysocki --- Documentation/power/energy-model.rst | 14 +++++++------- Documentation/scheduler/sched-energy.rst | 8 ++++---- 2 files changed, 11 insertions(+), 11 deletions(-) diff --git a/Documentation/power/energy-model.rst b/Documentation/power/energy-model.rst index cbdf7520aaa6..65133187f2ad 100644 --- a/Documentation/power/energy-model.rst +++ b/Documentation/power/energy-model.rst @@ -14,8 +14,8 @@ subsystems willing to use that information to make energy-aware decisions. The source of the information about the power consumed by devices can vary greatly from one platform to another. These power costs can be estimated using devicetree data in some cases. In others, the firmware will know better. -Alternatively, userspace might be best positioned. And so on. In order to avoid -each and every client subsystem to re-implement support for each and every +Alternatively, userspace might be best positioned. In order to avoid +having each and every client subsystem re-implement support for each and every possible source of information on its own, the EM framework intervenes as an abstraction layer which standardizes the format of power cost tables in the kernel, hence enabling to avoid redundant work. @@ -32,7 +32,7 @@ be found in the Intelligent Power Allocation in Documentation/driver-api/thermal/power_allocator.rst. Kernel subsystems might implement automatic detection to check whether EM registered devices have inconsistent scale (based on EM internal flag). -Important thing to keep in mind is that when the power values are expressed in +An important thing to keep in mind is that when the power values are expressed in an 'abstract scale' deriving real energy in micro-Joules would not be possible. The figure below depicts an example of drivers (Arm-specific here, but the @@ -82,7 +82,7 @@ using kref mechanism. The device driver which provided the new EM at runtime, should call EM API to free it safely when it's no longer needed. The EM framework will handle the clean-up when it's possible. -The kernel code which want to modify the EM values is protected from concurrent +The kernel code which wants to modify the EM values is protected from concurrent access using a mutex. Therefore, the device driver code must run in sleeping context when it tries to modify the EM. @@ -113,7 +113,7 @@ Registration of 'advanced' EM ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The 'advanced' EM gets its name due to the fact that the driver is allowed -to provide more precised power model. It's not limited to some implemented math +to provide a more precise power model. It's not limited to some implemented math formula in the framework (like it is in 'simple' EM case). It can better reflect the real power measurements performed for each performance state. Thus, this registration method should be preferred in case considering EM static power @@ -172,7 +172,7 @@ Registration of 'simple' EM ~~~~~~~~~~~~~~~~~~~~~~~~~~~ The 'simple' EM is registered using the framework helper function -cpufreq_register_em_with_opp(). It implements a power model which is tight to +cpufreq_register_em_with_opp(). It implements a power model which is tied to a math formula:: Power = C * V^2 * f @@ -251,7 +251,7 @@ It returns the 'struct em_perf_state' pointer which is an array of performance states in ascending order. This function must be called in the RCU read lock section (after the rcu_read_lock()). When the EM table is not needed anymore there is a need to -call rcu_real_unlock(). In this way the EM safely uses the RCU read section +call rcu_read_unlock(). In this way the EM safely uses the RCU read section and protects the users. It also allows the EM framework to manage the memory and free it. More details how to use it can be found in Section 3.2 in the example driver. diff --git a/Documentation/scheduler/sched-energy.rst b/Documentation/scheduler/sched-energy.rst index 70e2921ef725..4e47aaf103eb 100644 --- a/Documentation/scheduler/sched-energy.rst +++ b/Documentation/scheduler/sched-energy.rst @@ -244,7 +244,7 @@ Example 2. From these calculations, the Case 1 has the lowest total energy. So CPU 1 - is be the best candidate from an energy-efficiency standpoint. + is the best candidate from an energy-efficiency standpoint. Big CPUs are generally more power hungry than the little ones and are thus used mainly when a task doesn't fit the littles. However, little CPUs aren't always @@ -252,7 +252,7 @@ necessarily more energy-efficient than big CPUs. For some systems, the high OPPs of the little CPUs can be less energy-efficient than the lowest OPPs of the bigs, for example. So, if the little CPUs happen to have enough utilization at a specific point in time, a small task waking up at that moment could be better -of executing on the big side in order to save energy, even though it would fit +off executing on the big side in order to save energy, even though it would fit on the little side. And even in the case where all OPPs of the big CPUs are less energy-efficient @@ -285,7 +285,7 @@ much that can be done by the scheduler to save energy without severely harming throughput. In order to avoid hurting performance with EAS, CPUs are flagged as 'over-utilized' as soon as they are used at more than 80% of their compute capacity. As long as no CPUs are over-utilized in a root domain, load balancing -is disabled and EAS overridess the wake-up balancing code. EAS is likely to load +is disabled and EAS overrides the wake-up balancing code. EAS is likely to load the most energy efficient CPUs of the system more than the others if that can be done without harming throughput. So, the load-balancer is disabled to prevent it from breaking the energy-efficient task placement found by EAS. It is safe to @@ -385,7 +385,7 @@ Using EAS with any other governor than schedutil is not supported. 6.5 Scale-invariant utilization signals ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -In order to make accurate prediction across CPUs and for all performance +In order to make accurate predictions across CPUs and for all performance states, EAS needs frequency-invariant and CPU-invariant PELT signals. These can be obtained using the architecture-defined arch_scale{cpu,freq}_capacity() callbacks. From 1c7442d10b031ace1b7f4902af48bdca465ca25f Mon Sep 17 00:00:00 2001 From: Patrick Little Date: Wed, 28 Jan 2026 16:33:12 -0600 Subject: [PATCH 60/65] PM: EM: Documentation: Fix bug in example code snippet A semicolon was mistakenly placed at the end of 'if' statements. If example is copied as-is, it would lead to the subsequent return being executed unconditionally, which is incorrect, and the rest of the function would never be reached. Signed-off-by: Patrick Little Acked-by: Randy Dunlap [ rjw: Subject adjustment ] Link: https://patch.msgid.link/20260128-documentation-fix-grammar-v1-2-39238dc471f9@gmail.com Signed-off-by: Rafael J. Wysocki --- Documentation/power/energy-model.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Documentation/power/energy-model.rst b/Documentation/power/energy-model.rst index 65133187f2ad..0d4644d72767 100644 --- a/Documentation/power/energy-model.rst +++ b/Documentation/power/energy-model.rst @@ -308,12 +308,12 @@ EM framework:: 05 06 /* Use the 'foo' protocol to ceil the frequency */ 07 freq = foo_get_freq_ceil(dev, *KHz); - 08 if (freq < 0); + 08 if (freq < 0) 09 return freq; 10 11 /* Estimate the power cost for the dev at the relevant freq. */ 12 power = foo_estimate_power(dev, freq); - 13 if (power < 0); + 13 if (power < 0) 14 return power; 15 16 /* Return the values to the EM framework */ From 75ce02f4bc9a8b8350b6b1b01872467b0cc960cc Mon Sep 17 00:00:00 2001 From: Samuel Wu Date: Fri, 23 Jan 2026 17:21:29 -0800 Subject: [PATCH 61/65] PM: wakeup: Handle empty list in wakeup_sources_walk_start() In the case of an empty wakeup_sources list, wakeup_sources_walk_start() will return an invalid but non-NULL address. This also affects wrappers of the aforementioned function, like for_each_wakeup_source(). Update wakeup_sources_walk_start() to return NULL in case of an empty list. Fixes: b4941adb24c0 ("PM: wakeup: Add routine to help fetch wakeup source object.") Signed-off-by: Samuel Wu [ rjw: Subject and changelog edits ] Link: https://patch.msgid.link/20260124012133.2451708-2-wusamuel@google.com Signed-off-by: Rafael J. Wysocki --- drivers/base/power/wakeup.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/drivers/base/power/wakeup.c b/drivers/base/power/wakeup.c index 1e1a0e7eeac5..e69033d16fba 100644 --- a/drivers/base/power/wakeup.c +++ b/drivers/base/power/wakeup.c @@ -275,9 +275,7 @@ EXPORT_SYMBOL_GPL(wakeup_sources_read_unlock); */ struct wakeup_source *wakeup_sources_walk_start(void) { - struct list_head *ws_head = &wakeup_sources; - - return list_entry_rcu(ws_head->next, struct wakeup_source, entry); + return list_first_or_null_rcu(&wakeup_sources, struct wakeup_source, entry); } EXPORT_SYMBOL_GPL(wakeup_sources_walk_start); From 1fedbb589448bee9f20bb2ed9c850d1d2cf9963c Mon Sep 17 00:00:00 2001 From: Yaxiong Tian Date: Tue, 3 Feb 2026 10:48:52 +0800 Subject: [PATCH 62/65] cpufreq: intel_pstate: Enable asym capacity only when CPU SMT is not possible According to the description in the intel_pstate.rst documentation, Capacity-Aware Scheduling and Energy-Aware Scheduling are only supported on a hybrid processor without SMT. Previously, the system used sched_smt_active() for judgment, which is not a strict condition because users can switch it on or off via /sys at any time. This could lead to incorrect driver settings in certain scenarios. For example, on a CPU that supports SMT, a user can disable SMT via the nosmt parameter to enable asym capacity, and then re-enable SMT via /sys. In such cases, some settings in the driver would no longer be correct. To address this issue, replace sched_smt_active() with cpu_smt_possible(), and only enable asym capacity when CPU SMT is not possible. Fixes: 929ebc93ccaa ("cpufreq: intel_pstate: Set asymmetric CPU capacity on hybrid systems") Signed-off-by: Yaxiong Tian [ rjw: Subject and changelog edits ] Link: https://patch.msgid.link/20260203024852.301066-1-tianyaxiong@kylinos.cn Signed-off-by: Rafael J. Wysocki --- drivers/cpufreq/intel_pstate.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c index ec4abe374573..1625ec2d0d06 100644 --- a/drivers/cpufreq/intel_pstate.c +++ b/drivers/cpufreq/intel_pstate.c @@ -1161,7 +1161,7 @@ static void hybrid_init_cpu_capacity_scaling(bool refresh) * the capacity of SMT threads is not deterministic even approximately, * do not do that when SMT is in use. */ - if (hwp_is_hybrid && !sched_smt_active() && arch_enable_hybrid_capacity_scale()) { + if (hwp_is_hybrid && !cpu_smt_possible() && arch_enable_hybrid_capacity_scale()) { hybrid_refresh_cpu_capacity_scaling(); /* * Disabling ITMT causes sched domains to be rebuilt to disable asym From 3bd1cde3dffbb29764453201e19c17053557a520 Mon Sep 17 00:00:00 2001 From: Yaxiong Tian Date: Tue, 3 Feb 2026 17:35:01 +0800 Subject: [PATCH 63/65] cpufreq: Documentation: Update description of rate_limit_us default value Due to commit 37c6dccd6837 ("cpufreq: Remove LATENCY_MULTIPLIER") updating the acquisition logic of cpufreq_policy_transition_delay_us(), the original description of 2 ms has become inaccurate. Therefore, update the description of the default value for rate_limit_us from 2ms to 1ms. Signed-off-by: Yaxiong Tian [ rjw: Subject and changelog edits ] Link: https://patch.msgid.link/20260203093501.1138721-1-tianyaxiong@kylinos.cn Signed-off-by: Rafael J. Wysocki --- Documentation/admin-guide/pm/cpufreq.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/admin-guide/pm/cpufreq.rst b/Documentation/admin-guide/pm/cpufreq.rst index 738d7b4dc33a..dbe6d23a5d67 100644 --- a/Documentation/admin-guide/pm/cpufreq.rst +++ b/Documentation/admin-guide/pm/cpufreq.rst @@ -439,7 +439,7 @@ This governor exposes only one tunable: ``rate_limit_us`` Minimum time (in microseconds) that has to pass between two consecutive runs of governor computations (default: 1.5 times the scaling driver's - transition latency or the maximum 2ms). + transition latency or 1ms if the driver does not provide a latency value). The purpose of this tunable is to reduce the scheduler context overhead of the governor which might be excessive without it. From 5c9ecd8e6437cd55a38ea4f1e1d19cee8e226cb8 Mon Sep 17 00:00:00 2001 From: Gui-Dong Han Date: Tue, 3 Feb 2026 11:19:43 +0800 Subject: [PATCH 64/65] PM: sleep: wakeirq: harden dev_pm_clear_wake_irq() against races dev_pm_clear_wake_irq() currently uses a dangerous pattern where dev->power.wakeirq is read and checked for NULL outside the lock. If two callers invoke this function concurrently, both might see a valid pointer and proceed. This could result in a double-free when the second caller acquires the lock and tries to release the same object. Address this by removing the lockless check of dev->power.wakeirq. Instead, acquire dev->power.lock immediately to ensure the check and the subsequent operations are atomic. If dev->power.wakeirq is NULL under the lock, simply unlock and return. This guarantees that concurrent calls cannot race to free the same object. Based on a quick scan of current users, I did not find an actual bug as drivers seem to rely on their own synchronization. However, since asynchronous usage patterns exist (e.g., in drivers/net/wireless/ti/wlcore), I believe a race is theoretically possible if the API is used less carefully in the future. This change hardens the API to be robust against such cases. Fixes: 4990d4fe327b ("PM / Wakeirq: Add automated device wake IRQ handling") Signed-off-by: Gui-Dong Han Link: https://patch.msgid.link/20260203031943.1924-1-hanguidong02@gmail.com Signed-off-by: Rafael J. Wysocki --- drivers/base/power/wakeirq.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/drivers/base/power/wakeirq.c b/drivers/base/power/wakeirq.c index 8aa28c08b289..c0809d18fc54 100644 --- a/drivers/base/power/wakeirq.c +++ b/drivers/base/power/wakeirq.c @@ -83,13 +83,16 @@ EXPORT_SYMBOL_GPL(dev_pm_set_wake_irq); */ void dev_pm_clear_wake_irq(struct device *dev) { - struct wake_irq *wirq = dev->power.wakeirq; + struct wake_irq *wirq; unsigned long flags; - if (!wirq) - return; - spin_lock_irqsave(&dev->power.lock, flags); + wirq = dev->power.wakeirq; + if (!wirq) { + spin_unlock_irqrestore(&dev->power.lock, flags); + return; + } + device_wakeup_detach_irq(dev); dev->power.wakeirq = NULL; spin_unlock_irqrestore(&dev->power.lock, flags); From 0491f3f9f664e7e0131eb4d2a8b19c49562e5c64 Mon Sep 17 00:00:00 2001 From: Xuewen Yan Date: Wed, 4 Feb 2026 13:25:09 +0100 Subject: [PATCH 65/65] PM: sleep: core: Avoid bit field races related to work_in_progress In all of the system suspend transition phases, the async processing of a device may be carried out in parallel with power.work_in_progress updates for the device's parent or suppliers and if it touches bit fields from the same group (for example, power.must_resume or power.wakeup_path), bit field corruption is possible. To avoid that, turn work_in_progress in struct dev_pm_info into a proper bool field and relocate it to save space. Fixes: aa7a9275ab81 ("PM: sleep: Suspend async parents after suspending children") Fixes: 443046d1ad66 ("PM: sleep: Make suspend of devices more asynchronous") Signed-off-by: Xuewen Yan Closes: https://lore.kernel.org/linux-pm/20260203063459.12808-1-xuewen.yan@unisoc.com/ Cc: All applicable [ rjw: Added subject and changelog ] Link: https://patch.msgid.link/CAB8ipk_VX2VPm706Jwa1=8NSA7_btWL2ieXmBgHr2JcULEP76g@mail.gmail.com Signed-off-by: Rafael J. Wysocki --- include/linux/pm.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/pm.h b/include/linux/pm.h index 98a899858ece..afcaaa37a812 100644 --- a/include/linux/pm.h +++ b/include/linux/pm.h @@ -681,10 +681,10 @@ struct dev_pm_info { struct list_head entry; struct completion completion; struct wakeup_source *wakeup; + bool work_in_progress; /* Owned by the PM core */ bool wakeup_path:1; bool syscore:1; bool no_pm_callbacks:1; /* Owned by the PM core */ - bool work_in_progress:1; /* Owned by the PM core */ bool smart_suspend:1; /* Owned by the PM core */ bool must_resume:1; /* Owned by the PM core */ bool may_skip_resume:1; /* Set by subsystems */