From 7802fce7dc18394d041a1310fe4ad76120e08145 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Mon, 27 Jan 2025 14:07:12 +0100 Subject: [PATCH 1/3] cpufreq: intel_pstate: Make it possible to avoid enabling CAS Capacity-aware scheduling (CAS) is enabled by default by intel_pstate on hybrid systems without SMT, but in some usage scenarios it may be more attractive to place tasks for maximum CPU performance regardless of the extra cost in terms of energy, which is the case on such systems when CAS is not enabled, so introduce a command line option to forbid intel_pstate to enable CAS. Signed-off-by: Rafael J. Wysocki Acked-by:Srinivas Pandruvada Link: https://patch.msgid.link/2781262.mvXUDI8C0e@rjwysocki.net --- Documentation/admin-guide/kernel-parameters.txt | 3 +++ Documentation/admin-guide/pm/intel_pstate.rst | 3 +++ drivers/cpufreq/intel_pstate.c | 9 +++++++++ 3 files changed, 15 insertions(+) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index fb8752b42ec8..77e671e55993 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2316,6 +2316,9 @@ per_cpu_perf_limits Allow per-logical-CPU P-State performance control limits using cpufreq sysfs interface + no_cas + Do not enable capacity-aware scheduling (CAS) on + hybrid systems intremap= [X86-64,Intel-IOMMU,EARLY] on enable Interrupt Remapping (default) diff --git a/Documentation/admin-guide/pm/intel_pstate.rst b/Documentation/admin-guide/pm/intel_pstate.rst index bf13ad25a32f..78fc83ed2a7e 100644 --- a/Documentation/admin-guide/pm/intel_pstate.rst +++ b/Documentation/admin-guide/pm/intel_pstate.rst @@ -696,6 +696,9 @@ of them have to be prepended with the ``intel_pstate=`` prefix. Use per-logical-CPU P-State limits (see `Coordination of P-state Limits`_ for details). +``no_cas`` + Do not enable capacity-aware scheduling (CAS) which is enabled by + default on hybrid systems. Diagnostics and Tuning ====================== diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c index 9c4cc01fd51a..bc31e9b9b660 100644 --- a/drivers/cpufreq/intel_pstate.c +++ b/drivers/cpufreq/intel_pstate.c @@ -936,6 +936,8 @@ static struct freq_attr *hwp_cpufreq_attrs[] = { NULL, }; +static bool no_cas __ro_after_init; + static struct cpudata *hybrid_max_perf_cpu __read_mostly; /* * Protects hybrid_max_perf_cpu, the capacity_perf fields in struct cpudata, @@ -1041,6 +1043,10 @@ static void hybrid_refresh_cpu_capacity_scaling(void) static void hybrid_init_cpu_capacity_scaling(bool refresh) { + /* Bail out if enabling capacity-aware scheduling is prohibited. */ + if (no_cas) + return; + /* * If hybrid_max_perf_cpu is set at this point, the hybrid CPU capacity * scaling has been enabled already and the driver is just changing the @@ -3835,6 +3841,9 @@ static int __init intel_pstate_setup(char *str) if (!strcmp(str, "no_hwp")) no_hwp = 1; + if (!strcmp(str, "no_cas")) + no_cas = true; + if (!strcmp(str, "force")) force_load = 1; if (!strcmp(str, "hwp_only")) From 3698dd6b139dc37b35a9ad83d9330c1f99666c02 Mon Sep 17 00:00:00 2001 From: Jie Zhan Date: Thu, 13 Feb 2025 11:55:10 +0800 Subject: [PATCH 2/3] cpufreq: governor: Fix negative 'idle_time' handling in dbs_update() We observed an issue that the CPU frequency can't raise up with a 100% CPU load when NOHZ is off and the 'conservative' governor is selected. 'idle_time' can be negative if it's obtained from get_cpu_idle_time_jiffy() when NOHZ is off. This was found and explained in commit 9485e4ca0b48 ("cpufreq: governor: Fix handling of special cases in dbs_update()"). However, commit 7592019634f8 ("cpufreq: governors: Fix long idle detection logic in load calculation") introduced a comparison between 'idle_time' and 'samling_rate' to detect a long idle interval. While 'idle_time' is converted to int before comparison, it's actually promoted to unsigned again when compared with an unsigned 'sampling_rate'. Hence, this leads to wrong idle interval detection when it's in fact 100% busy and sets policy_dbs->idle_periods to a very large value. 'conservative' adjusts the frequency to minimum because of the large 'idle_periods', such that the frequency can't raise up. 'Ondemand' doesn't use policy_dbs->idle_periods so it fortunately avoids the issue. Correct negative 'idle_time' to 0 before any use of it in dbs_update(). Fixes: 7592019634f8 ("cpufreq: governors: Fix long idle detection logic in load calculation") Signed-off-by: Jie Zhan Reviewed-by: Chen Yu Link: https://patch.msgid.link/20250213035510.2402076-1-zhanjie9@hisilicon.com Signed-off-by: Rafael J. Wysocki --- drivers/cpufreq/cpufreq_governor.c | 45 +++++++++++++++--------------- 1 file changed, 23 insertions(+), 22 deletions(-) diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c index af44ee6a6430..1a7fcaf39cc9 100644 --- a/drivers/cpufreq/cpufreq_governor.c +++ b/drivers/cpufreq/cpufreq_governor.c @@ -145,7 +145,23 @@ unsigned int dbs_update(struct cpufreq_policy *policy) time_elapsed = update_time - j_cdbs->prev_update_time; j_cdbs->prev_update_time = update_time; - idle_time = cur_idle_time - j_cdbs->prev_cpu_idle; + /* + * cur_idle_time could be smaller than j_cdbs->prev_cpu_idle if + * it's obtained from get_cpu_idle_time_jiffy() when NOHZ is + * off, where idle_time is calculated by the difference between + * time elapsed in jiffies and "busy time" obtained from CPU + * statistics. If a CPU is 100% busy, the time elapsed and busy + * time should grow with the same amount in two consecutive + * samples, but in practice there could be a tiny difference, + * making the accumulated idle time decrease sometimes. Hence, + * in this case, idle_time should be regarded as 0 in order to + * make the further process correct. + */ + if (cur_idle_time > j_cdbs->prev_cpu_idle) + idle_time = cur_idle_time - j_cdbs->prev_cpu_idle; + else + idle_time = 0; + j_cdbs->prev_cpu_idle = cur_idle_time; if (ignore_nice) { @@ -162,7 +178,7 @@ unsigned int dbs_update(struct cpufreq_policy *policy) * calls, so the previous load value can be used then. */ load = j_cdbs->prev_load; - } else if (unlikely((int)idle_time > 2 * sampling_rate && + } else if (unlikely(idle_time > 2 * sampling_rate && j_cdbs->prev_load)) { /* * If the CPU had gone completely idle and a task has @@ -189,30 +205,15 @@ unsigned int dbs_update(struct cpufreq_policy *policy) load = j_cdbs->prev_load; j_cdbs->prev_load = 0; } else { - if (time_elapsed >= idle_time) { + if (time_elapsed > idle_time) load = 100 * (time_elapsed - idle_time) / time_elapsed; - } else { - /* - * That can happen if idle_time is returned by - * get_cpu_idle_time_jiffy(). In that case - * idle_time is roughly equal to the difference - * between time_elapsed and "busy time" obtained - * from CPU statistics. Then, the "busy time" - * can end up being greater than time_elapsed - * (for example, if jiffies_64 and the CPU - * statistics are updated by different CPUs), - * so idle_time may in fact be negative. That - * means, though, that the CPU was busy all - * the time (on the rough average) during the - * last sampling interval and 100 can be - * returned as the load. - */ - load = (int)idle_time < 0 ? 100 : 0; - } + else + load = 0; + j_cdbs->prev_load = load; } - if (unlikely((int)idle_time > 2 * sampling_rate)) { + if (unlikely(idle_time > 2 * sampling_rate)) { unsigned int periods = idle_time / sampling_rate; if (periods < idle_periods) From ed7cad0504e38a2a4e1aa6168b6eadee6de531b5 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Fri, 7 Feb 2025 13:48:40 +0100 Subject: [PATCH 3/3] cpufreq: intel_pstate: Relocate platform preference check Move the invocation of intel_pstate_platform_pwr_mgmt_exists() before checking whether or not HWP is enabled because it does not depend on any code running before it except for the vendor check and if CPU performance scaling is going to be carried out by the platform, all of the code that runs before that function (again, except for the vendor check) is redundant. This is not expected to alter any functionality except for the ordering of messages printed by intel_pstate_init() when it is going to return an error before attempting to register the driver. Signed-off-by: Rafael J. Wysocki Acked-by: Srinivas Pandruvada Link: https://patch.msgid.link/2776745.mvXUDI8C0e@rjwysocki.net --- drivers/cpufreq/intel_pstate.c | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c index bc31e9b9b660..fa81054fa411 100644 --- a/drivers/cpufreq/intel_pstate.c +++ b/drivers/cpufreq/intel_pstate.c @@ -3694,6 +3694,15 @@ static int __init intel_pstate_init(void) if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL) return -ENODEV; + /* + * The Intel pstate driver will be ignored if the platform + * firmware has its own power management modes. + */ + if (intel_pstate_platform_pwr_mgmt_exists()) { + pr_info("P-states controlled by the platform\n"); + return -ENODEV; + } + id = x86_match_cpu(hwp_support_ids); if (id) { hwp_forced = intel_pstate_hwp_is_enabled(); @@ -3749,15 +3758,6 @@ static int __init intel_pstate_init(void) default_driver = &intel_cpufreq; hwp_cpu_matched: - /* - * The Intel pstate driver will be ignored if the platform - * firmware has its own power management modes. - */ - if (intel_pstate_platform_pwr_mgmt_exists()) { - pr_info("P-states controlled by the platform\n"); - return -ENODEV; - } - if (!hwp_active && hwp_only) return -ENOTSUPP;