- Fix printf format warning for bprintf
sunrpc uses a trace_printk() that triggers a printf warning during the
compile. Move the __printf() attribute around for when debugging is not
enabled the warning will go away.
- Remove redundant check for EVENT_FILE_FL_FREED in event_filter_write()
The FREED flag is checked in the call to event_file_file() and then
checked again right afterward, which is unneeded.
- Clean up event_file_file() and event_file_data() helpers
These helper functions played a different role in the past, but now with
eventfs, the READ_ONCE() isn't needed. Simplify the code a bit and also
add a warning to event_file_data() if the file or its data is not present.
- Remove updating file->private_data in tracing open
All access to the file private data is handled by the helper functions,
which do not use file->private_data. Stop updating it on open.
- Show ENUM names in function arguments via BTF in function tracing
When showing the function arguments when func-args option is set for
function tracing, if one of the arguments is found to be an enum, show the
name of the enum instead of its number.
- Add new trace_call__##name() API for tracepoints
Tracepoints are enabled via static_branch() blocks, where when not
enabled, there's only a nop that is in the code where the execution will
just skip over it. When tracing is enabled, the nop is converted to a
direct jump to the tracepoint code. Sometimes more calculations are
required to be performed to update the parameters of the tracepoint. In
this case, trace_##name##_enabled() is called which is a static_branch()
that gets enabled only when the tracepoint is enabled. This allows the
extra calculations to also be skipped by the nop:
if (trace_foo_enabled()) {
x = bar();
trace_foo(x);
}
Where the x=bar() is only performed when foo is enabled. The problem with
this approach is that there's now two static_branch() calls. One for
checking if the tracepoint is enabled, and then again to know if the
tracepoint should be called. The second one is redundant.
Introduce trace_call__foo() that will call the foo() tracepoint directly
without doing a static_branch():
if (trace_foo_enabled()) {
x = bar();
trace_call__foo();
}
- Update various locations to use the new trace_call__##name() API
- Move snapshot code out of trace.c
Cleaning up trace.c to not be a "dump all", move the snapshot code out of
it and into a new trace_snapshot.c file.
- Clean up some "%*.s" to "%*s"
- Allow boot kernel command line options to be called multiple times
Have options like:
ftrace_filter=foo ftrace_filter=bar ftrace_filter=zoo
Equal to:
ftrace_filter=foo,bar,zoo
- Fix ipi_raise event CPU field to be a CPU field
The ipi_raise target_cpus field is defined as a __bitmask(). There is now a
__cpumask() field definition. Update the field to use that.
- Have hist_field_name() use a snprintf() and not a series of strcat()
It's safer to use snprintf() that a series of strcat().
- Fix tracepoint regfunc balancing
A tracepoint can define a "reg" and "unreg" function that gets called
before the tracepoint is enabled, and after it is disabled respectively.
But on error, after the "reg" func is called and the tracepoint is not
enabled, the "unreg" function is not called to tear down what the "reg"
function performed.
- Fix output that shows what histograms are enabled
Event variables are displayed incorrectly in the histogram output.
Instead of "sched.sched_wakeup.$var", it is showing
"$sched.sched_wakeup.var" where the '$' is in the incorrect location.
- Some other simple cleanups.
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaeCpvxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qt2WAP44m85BbAjBqJe4WR103eOXV+bREBta
dRoReKJOMe519gEAp0rK/HoCvHgHhIGe3gaGdIsNhnaxoFyNWMG/wokoLAY=
=Hg6+
-----END PGP SIGNATURE-----
Merge tag 'trace-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing updates from Steven Rostedt:
- Fix printf format warning for bprintf
sunrpc uses a trace_printk() that triggers a printf warning during
the compile. Move the __printf() attribute around for when debugging
is not enabled the warning will go away
- Remove redundant check for EVENT_FILE_FL_FREED in
event_filter_write()
The FREED flag is checked in the call to event_file_file() and then
checked again right afterward, which is unneeded
- Clean up event_file_file() and event_file_data() helpers
These helper functions played a different role in the past, but now
with eventfs, the READ_ONCE() isn't needed. Simplify the code a bit
and also add a warning to event_file_data() if the file or its data
is not present
- Remove updating file->private_data in tracing open
All access to the file private data is handled by the helper
functions, which do not use file->private_data. Stop updating it on
open
- Show ENUM names in function arguments via BTF in function tracing
When showing the function arguments when func-args option is set for
function tracing, if one of the arguments is found to be an enum,
show the name of the enum instead of its number
- Add new trace_call__##name() API for tracepoints
Tracepoints are enabled via static_branch() blocks, where when not
enabled, there's only a nop that is in the code where the execution
will just skip over it. When tracing is enabled, the nop is converted
to a direct jump to the tracepoint code. Sometimes more calculations
are required to be performed to update the parameters of the
tracepoint. In this case, trace_##name##_enabled() is called which is
a static_branch() that gets enabled only when the tracepoint is
enabled. This allows the extra calculations to also be skipped by the
nop:
if (trace_foo_enabled()) {
x = bar();
trace_foo(x);
}
Where the x=bar() is only performed when foo is enabled. The problem
with this approach is that there's now two static_branch() calls. One
for checking if the tracepoint is enabled, and then again to know if
the tracepoint should be called. The second one is redundant
Introduce trace_call__foo() that will call the foo() tracepoint
directly without doing a static_branch():
if (trace_foo_enabled()) {
x = bar();
trace_call__foo();
}
- Update various locations to use the new trace_call__##name() API
- Move snapshot code out of trace.c
Cleaning up trace.c to not be a "dump all", move the snapshot code
out of it and into a new trace_snapshot.c file
- Clean up some "%*.s" to "%*s"
- Allow boot kernel command line options to be called multiple times
Have options like:
ftrace_filter=foo ftrace_filter=bar ftrace_filter=zoo
Equal to:
ftrace_filter=foo,bar,zoo
- Fix ipi_raise event CPU field to be a CPU field
The ipi_raise target_cpus field is defined as a __bitmask(). There is
now a __cpumask() field definition. Update the field to use that
- Have hist_field_name() use a snprintf() and not a series of strcat()
It's safer to use snprintf() that a series of strcat()
- Fix tracepoint regfunc balancing
A tracepoint can define a "reg" and "unreg" function that gets called
before the tracepoint is enabled, and after it is disabled
respectively. But on error, after the "reg" func is called and the
tracepoint is not enabled, the "unreg" function is not called to tear
down what the "reg" function performed
- Fix output that shows what histograms are enabled
Event variables are displayed incorrectly in the histogram output
Instead of "sched.sched_wakeup.$var", it is showing
"$sched.sched_wakeup.var" where the '$' is in the incorrect location
- Some other simple cleanups
* tag 'trace-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (24 commits)
selftests/ftrace: Add test case for fully-qualified variable references
tracing: Fix fully-qualified variable reference printing in histograms
tracepoint: balance regfunc() on func_add() failure in tracepoint_add_func()
tracing: Rebuild full_name on each hist_field_name() call
tracing: Report ipi_raise target CPUs as cpumask
tracing: Remove duplicate latency_fsnotify() stub
tracing: Preserve repeated trace_trigger boot parameters
tracing: Append repeated boot-time tracing parameters
tracing: Remove spurious default precision from show_event_trigger/filter formats
cpufreq: Use trace_call__##name() at guarded tracepoint call sites
tracing: Remove tracing_alloc_snapshot() when snapshot isn't defined
tracing: Move snapshot code out of trace.c and into trace_snapshot.c
mm: damon: Use trace_call__##name() at guarded tracepoint call sites
btrfs: Use trace_call__##name() at guarded tracepoint call sites
spi: Use trace_call__##name() at guarded tracepoint call sites
i2c: Use trace_call__##name() at guarded tracepoint call sites
kernel: Use trace_call__##name() at guarded tracepoint call sites
tracepoint: Add trace_call__##name() API
tracing: trace_mmap.h: fix a kernel-doc warning
tracing: Pretty-print enum parameters in function arguments
...
Add support for new features:
* CPPC performance priority
* Dynamic EPP
* Raw EPP
* New unit tests for new features
Fixes for:
* PREEMPT_RT
* sysfs files being present when HW missing
* Broken/outdated documentation
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEECwtuSU6dXvs5GA2aLRkspiR3AnYFAmnOpNgTHHN1cGVybTFA
a2VybmVsLm9yZwAKCRAtGSymJHcCduR7EADexgetxq0l6/iV2DyI1/YJcf+cNPoS
yxE93vN9i3A2xcx87klncVF0C2zIZaZFkp6o7VY/AReL/UyUOh6snz371OXBl7pm
A/uppkT5QdzTpmknJMyqkLRlHfkMjNRzWv4sdh4kyJSB3SkgaN7zSVi6Zxamt/vJ
VNCgExZQeDqk4VL2X/NBfaBagYSnPnBmBdXoY6aPYqFrqKj4SlDxYNbJsQlcyE9Z
z0naVGb5YPEJOaMvE+5z+DwX4EmtN3si+vfi8VuQOXPnoDGOG763rpMLnz7xYvfW
poPu2fnitN39MaT96btRShD6XuCg9eaPAEmpb3j6c93n1kUo+joLLbalhfc0HMeL
1/8ndz+KatEUMQTCVgs8cboob1PpRvqhIb+vrs6aTEqCsgqUKUZ7GYgglBamyRka
mivC5Q+ssCxq47/ilGfECFr8vK0oV3rTu9Ltp4MS5zN70tI0YYZk3o1454nY5dhc
Byv5e9bft/n9AA576y5vXENcWCSez/8UFGl5RjoxQZ7SFKNFnbSic1BT4uMRVX/G
4QUk5TWwC8WdOp7YsO30LwZ0y9vtxmfBn8BF/6n/dYGhM1/DVQ1nX9iyzhCHZ3XH
fgyrkUktdI1dsm/xKvbqxK9Djw0tkMsfH1yI6iQccefnlo4gRSvTRFiM2yepY6py
E8MZpz1ML8T2Pw==
=XTdh
-----END PGP SIGNATURE-----
Merge tag 'amd-pstate-v7.1-2026-04-02' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/superm1/linux
Pull amd-pstate new content for 7.1 (2026-04-02) from Mario Limonciello:
"Add support for new features:
* CPPC performance priority
* Dynamic EPP
* Raw EPP
* New unit tests for new features
Fixes for:
* PREEMPT_RT
* sysfs files being present when HW missing
* Broken/outdated documentation"
* tag 'amd-pstate-v7.1-2026-04-02' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/superm1/linux: (22 commits)
MAINTAINERS: amd-pstate: Step down as maintainer, add Prateek as reviewer
cpufreq: Pass the policy to cpufreq_driver->adjust_perf()
cpufreq/amd-pstate: Pass the policy to amd_pstate_update()
cpufreq/amd-pstate-ut: Add a unit test for raw EPP
cpufreq/amd-pstate: Add support for raw EPP writes
cpufreq/amd-pstate: Add support for platform profile class
cpufreq/amd-pstate: add kernel command line to override dynamic epp
cpufreq/amd-pstate: Add dynamic energy performance preference
Documentation: amd-pstate: fix dead links in the reference section
cpufreq/amd-pstate: Cache the max frequency in cpudata
Documentation/amd-pstate: Add documentation for amd_pstate_floor_{freq,count}
Documentation/amd-pstate: List amd_pstate_prefcore_ranking sysfs file
Documentation/amd-pstate: List amd_pstate_hw_prefcore sysfs file
amd-pstate-ut: Add a testcase to validate the visibility of driver attributes
amd-pstate-ut: Add module parameter to select testcases
amd-pstate: Introduce a tracepoint trace_amd_pstate_cppc_req2()
amd-pstate: Add sysfs support for floor_freq and floor_count
amd-pstate: Add support for CPPC_REQ2 and FLOOR_PERF
x86/cpufeatures: Add AMD CPPC Performance Priority feature.
amd-pstate: Make certain freq_attrs conditionally visible
...
cpufreq_cpu_get() can sleep on PREEMPT_RT in presence of concurrent
writer(s), however amd-pstate depends on fetching the cpudata via the
policy's driver data which necessitates grabbing the reference.
Since schedutil governor can call "cpufreq_driver->update_perf()"
during sched_tick/enqueue/dequeue with rq_lock held and IRQs disabled,
fetching the policy object using the cpufreq_cpu_get() helper in the
scheduler fast-path leads to "BUG: scheduling while atomic" on
PREEMPT_RT [1].
Pass the cached cpufreq policy object in sg_policy to the update_perf()
instead of just the CPU. The CPU can be inferred using "policy->cpu".
The lifetime of cpufreq_policy object outlasts that of the governor and
the cpufreq driver (allocated when the CPU is onlined and only reclaimed
when the CPU is offlined / the CPU device is removed) which makes it
safe to be referenced throughout the governor's lifetime.
Closes:https://lore.kernel.org/all/20250731092316.3191-1-spasswolf@web.de/ [1]
Fixes: 1d215f0319 ("cpufreq: amd-pstate: Add fast switch function for AMD P-State")
Reported-by: Bert Karwatzki <spasswolf@web.de>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Acked-by: Gary Guo <gary@garyguo.net> # Rust
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Reviewed-by: Zhongqiu Han <zhongqiu.han@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20260316081849.19368-3-kprateek.nayak@amd.com
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
Replace trace_foo() with the new trace_call__foo() at sites already
guarded by trace_foo_enabled(), avoiding a redundant
static_branch_unlikely() re-evaluation inside the tracepoint.
trace_call__foo() calls the tracepoint callbacks directly without
utilizing the static branch again.
Cc: Huang Rui <ray.huang@amd.com>
Cc: Mario Limonciello <mario.limonciello@amd.com>
Cc: Perry Yuan <perry.yuan@amd.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Len Brown <lenb@kernel.org>
Link: https://patch.msgid.link/20260323160052.17528-7-vineeth@bitbyteword.org
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
Assisted-by: Claude:claude-sonnet-4-6
Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org> # cpufreq core
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Repeated intel_pstate disables currently return an error, adding unnecessary
complexity to userspace scripts which must first read the current state and
conditionally write 'off'.
Make repeated intel_pstate disables a no-op.
Signed-off-by: Fabio M. De Francesco <fabio.m.de.francesco@linux.intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Link: https://patch.msgid.link/20260219181600.16388-1-fabio.m.de.francesco@linux.intel.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When the system is booted with kernel command line argument "nosmt" or
"maxcpus" to limit the number of CPUs, disabling turbo via:
echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
results in a crash:
PF: supervisor read access in kernel mode
PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: Oops: 0000 [#1] SMP PTI
...
RIP: 0010:store_no_turbo+0x100/0x1f0
...
This occurs because for_each_possible_cpu() returns CPUs even if they
are not online. For those CPUs, all_cpu_data[] will be NULL. Since
commit 973207ae3d ("cpufreq: intel_pstate: Rearrange max frequency
updates handling code"), all_cpu_data[] is dereferenced even for CPUs
which are not online, causing the NULL pointer dereference.
To fix that, pass CPU number to intel_pstate_update_max_freq() and use
all_cpu_data[] for those CPUs for which there is a valid cpufreq policy.
Fixes: 973207ae3d ("cpufreq: intel_pstate: Rearrange max frequency updates handling code")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=221068
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: 6.16+ <stable@vger.kernel.org> # 6.16+
Link: https://patch.msgid.link/20260225001752.890164-1-srinivas.pandruvada@linux.intel.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The update_cpu_qos_request() function attempts to initialize the 'freq'
variable by dereferencing 'cpudata' before verifying if the 'policy'
is valid.
This issue occurs on systems booted with the "nosmt" parameter, where
all_cpu_data[cpu] is NULL for the SMT sibling threads. As a result,
any call to update_qos_requests() will result in a NULL pointer
dereference as the code will attempt to access pstate.turbo_freq using
the NULL cpudata pointer.
Also, pstate.turbo_freq may be updated by intel_pstate_get_hwp_cap()
after initializing the 'freq' variable, so it is better to defer the
'freq' until intel_pstate_get_hwp_cap() has been called.
Fix this by deferring the 'freq' assignment until after the policy and
driver_data have been validated.
Fixes: ae1bdd23b9 ("cpufreq: intel_pstate: Adjust frequency percentage computations")
Reported-by: Jirka Hladky <jhladky@redhat.com>
Closes: https://lore.kernel.org/all/CAE4VaGDfiPvz3AzrwrwM4kWB3SCkMci25nPO8W1JmTBd=xHzZg@mail.gmail.com/
Signed-off-by: David Arcari <darcari@redhat.com>
Cc: 6.18+ <stable@vger.kernel.org> # 6.18+
[ rjw: Added one paragraph to the changelog ]
Link: https://patch.msgid.link/20260224122106.228116-1-darcari@redhat.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This was done entirely with mindless brute force, using
git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'
to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.
Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.
For the same reason the 'flex' versions will be done as a separate
conversion.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with: kmalloc_objs(TYPE, COUNT, ...)
Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...)
(where TYPE may also be *VAR)
The resulting allocations no longer return "void *", instead returning
"TYPE *".
Signed-off-by: Kees Cook <kees@kernel.org>
According to the description in the intel_pstate.rst documentation,
Capacity-Aware Scheduling and Energy-Aware Scheduling are only
supported on a hybrid processor without SMT. Previously, the system
used sched_smt_active() for judgment, which is not a strict condition
because users can switch it on or off via /sys at any time.
This could lead to incorrect driver settings in certain scenarios.
For example, on a CPU that supports SMT, a user can disable SMT
via the nosmt parameter to enable asym capacity, and then re-enable
SMT via /sys. In such cases, some settings in the driver would no
longer be correct.
To address this issue, replace sched_smt_active() with cpu_smt_possible(),
and only enable asym capacity when CPU SMT is not possible.
Fixes: 929ebc93cc ("cpufreq: intel_pstate: Set asymmetric CPU capacity on hybrid systems")
Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
[ rjw: Subject and changelog edits ]
Link: https://patch.msgid.link/20260203024852.301066-1-tianyaxiong@kylinos.cn
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
To eliminate some code duplication from the intel_pstate driver,
move the core_get_val() function body to a new function called
get_perf_ctl_val() and make both core_get_val() and atom_get_val()
invoke it to carry out the same computation.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Link: https://patch.msgid.link/2829273.mvXUDI8C0e@rafael.j.wysocki
Use guard(mutex)(&intel_pstate_driver_lock), or the scoped variant of
it, wherever intel_pstate_driver_lock needs to be held.
This allows some local variables and goto statements to be dropped as
they are not necessary any more.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Link: https://patch.msgid.link/2807232.mvXUDI8C0e@rafael.j.wysocki
Commit ac4e04d9e3 ("cpufreq: intel_pstate: Unchecked MSR aceess in
legacy mode") introduced a check for feature X86_FEATURE_IDA to verify
turbo mode support. Although this is the correct way to check for turbo
mode support, it causes issues on some platforms that disable turbo
during OS boot, but enable it later [1]. Before adding this feature
check, users were able to get turbo mode frequencies by writing 0 to
/sys/devices/system/cpu/intel_pstate/no_turbo post-boot.
To restore the old behavior on the affected systems while still
addressing the unchecked MSR issue on some Skylake-X systems, check
X86_FEATURE_IDA only immediately before updates of MSR_IA32_PERF_CTL
that may involve setting the Turbo Engage Bit (bit 32).
Fixes: ac4e04d9e3 ("cpufreq: intel_pstate: Unchecked MSR aceess in legacy mode")
Reported-by: Aaron Rainbolt <arainbolt@kfocus.org>
Closes: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2122531 [1]
Tested-by: Aaron Rainbolt <arainbolt@kfocus.org>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw: Subject adjustment, changelog edits ]
Link: https://patch.msgid.link/20251111010840.141490-1-srinivas.pandruvada@linux.intel.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Prevent intel_pstate from loading when Out-of-Band (OOB) P-states mode
is enabled.
The OOB identification mechanism for Diamond Rapids servers is the same
as for prior generation CPUs such as Granite Rapids. Add the Diamond
Rapids CPU model to intel_pstate_cpu_oob_ids[] to ensure correct OOB
handling.
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Link: https://patch.msgid.link/20251022215425.3566218-1-sathyanarayanan.kuppuswamy@linux.intel.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Some debug messages generated by intel_pstate on a given hybrid system
are only printed for some CPUs which is confusing, so modify the driver
to print them for all CPUs. Also change those messages to avoid
printing local variable names in them.
Moreover, some debug messages printed by intel_pstate are quite hard
to understand without looking at the code printing them, so make them
somewhat clearer while at it.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/8609836.T7Z3S40VBb@rafael.j.wysocki
Instead of using HWP-to-frequency scaling factors for computing cost
coefficients in the energy model used on hybrid systems, which is
fragile, rely on CPU type information that is easily accessible now and
the information on whether or not L3 cache is present for this purpose.
This also allows the cost coefficients for P-cores to be adjusted so
that they start to be populated somewhat earlier (that is, before
E-cores are loaded up to their full capacity).
In addition to the above, replace an inaccurate comment regarding the
reason why the freq value is added to the cost in hybrid_get_cost().
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
Link: https://patch.msgid.link/5932894.DvuYhMxLoT@rafael.j.wysocki
Introduce a function for checking whether or not a given CPU has L3
cache, called hybrid_has_l3(), and use it in hybrid_get_cost() for
computing cost coefficients associated with a given perf domain.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/13884343.uLZWGnKmhe@rafael.j.wysocki
Introduce a function for identifying the type of a given CPU in a
hybrid system, called hybrid_get_cpu_type(), and use if for hybrid
scaling factor determination in hwp_get_cpu_scaling().
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/1954386.tdWV9SEqCh@rafael.j.wysocki
The comment above the condition `if (cpu->last_sample_time)` clearly
indicates that the branch is taken for the vast majority of invocations
after the first sample in a cycle. The first sample is a one-time
initialization case.
Add likely() hint to the condition to improve branch prediction for
this performance-critical path in intel_pstate_sample().
Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
So far, HWP has never been enabled without EPP (Energy-Performance
Preference) interface support, since the lack of the latter indicates an
incomplete implementation of HWP, which was the case on early development
vehicle platforms. However, HWP can be expected to work if DEC (Dynamic
Efficiency Control) is enabled as indicated by setting bit 27 in
MSR_IA32_POWER_CTL (DEC enable bit).
Accordingly, allow HWP to be enabled if the EPP interface is not
supported so long as DEC is enabled in the processor.
Still, the EPP control sysfs interface is useless when EPP is not
supported, so do not expose it in that case.
Link: https://lore.kernel.org/linux-pm/20250904000608.260817-2-srinivas.pandruvada@linux.intel.com/
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Co-developed-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The intel_pstate driver manages CPU capacity changes itself and it does
not need an update of the capacity of all CPUs in the system to be
carried out after registering a PD.
Moreover, in some configurations (for instance, an SMT-capable
hybrid x86 system booted with nosmt in the kernel command line) the
em_check_capacity_update() call at the end of em_dev_register_perf_domain()
always fails and reschedules itself to run once again in 1 s, so
effectively it runs in vain every 1 s forever.
To address this, introduce a new variant of em_dev_register_perf_domain(),
called em_dev_register_pd_no_update(), that does not invoke
em_check_capacity_update(), and make intel_pstate use it instead of the
original.
Fixes: 7b010f9b90 ("cpufreq: intel_pstate: EAS support for hybrid platforms")
Closes: https://lore.kernel.org/linux-pm/40212796-734c-4140-8a85-854f72b8144d@panix.com/
Reported-by: Kenneth R. Crudup <kenny@panix.com>
Tested-by: Kenneth R. Crudup <kenny@panix.com>
Cc: 6.16+ <stable@vger.kernel.org> # 6.16+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Adjust frequency percentage computations in update_cpu_qos_request() to
avoid going above the exact numerical percentage in the FREQ_QOS_MAX
case.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Zihuan Zhang <zhangzihuan@kylinos.cn>
Link: https://patch.msgid.link/3395556.44csPzL39Z@rafael.j.wysocki
[ rjw: Rename "cpu" to "cpudata" and "cpunum" to "cpu" ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Move the code from the for_each_possible_cpu() loop in update_qos_request()
to a separate function and use __free() for cpufreq policy reference
counting in it to avoid having to call cpufreq_cpu_put() repeatedly (or
using goto).
While at it, rename update_qos_request() to update_qos_requests()
because it updates multiple requests in one go.
No intentional functional impact.
Link: https://lore.kernel.org/linux-pm/CAJZ5v0gN1T5woSF0tO=TbAh+2-sWzxFjWyDbB7wG2TFCOU01iQ@mail.gmail.com/
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Zihuan Zhang <zhangzihuan@kylinos.cn>
Link: https://patch.msgid.link/3026597.e9J7NaK4W3@rafael.j.wysocki
[ rjw: Rename "cpu" to "cpudata" and "cpunum" to "cpu" in new code ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The cpufreq_cpu_put() call in update_qos_request() takes place too early
because the latter subsequently calls freq_qos_update_request() that
indirectly accesses the policy object in question through the QoS request
object passed to it.
Fortunately, update_qos_request() is called under intel_pstate_driver_lock,
so this issue does not matter for changing the intel_pstate operation
mode, but it theoretically can cause a crash to occur on CPU device hot
removal (which currently can only happen in virt, but it is formally
supported nevertheless).
Address this issue by modifying update_qos_request() to drop the
reference to the policy later.
Fixes: da5c504c7a ("cpufreq: intel_pstate: Implement QoS supported freq constraints")
Cc: 5.4+ <stable@vger.kernel.org> # 5.4+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Zihuan Zhang <zhangzihuan@kylinos.cn>
Link: https://patch.msgid.link/2255671.irdbgypaU6@rafael.j.wysocki
The intel_pstate driver does not enable HWP mode when CPUID.06H:EAX[10]
is not set, indicating that EPP (Energy Performance Preference) is not
supported by the hardware.
When EPP is unavailable, the system falls back to using EPB (Energy
Performance Bias) if the feature is supported. However, since the
intel_pstate driver will not enable HWP in this scenario, any EPB-related
code becomes unreachable and irrelevant. Remove the EPB handling code
paths simplifying the driver logic and reducing code size.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Link: https://patch.msgid.link/20250904000608.260817-1-srinivas.pandruvada@linux.intel.com
[ rjw: Subject adjustment ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Follow cleanup.h recommendations and define and assign a variable
in one statement when __free() is used.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Zihuan Zhang <zhangzihuan@kylinos.cn>
Link: https://patch.msgid.link/2251447.irdbgypaU6@rafael.j.wysocki
Users may disable HWP in firmware, in which case intel_pstate
wouldn't load unless the CPU model is explicitly supported.
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Link: https://patch.msgid.link/20250623105601.3924-1-lirongqing@baidu.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
In the passive mode, intel_cpufreq_update_pstate() sets HWP_MIN_PERF in
accordance with the target frequency to ensure delivering adequate
performance, but it sets HWP_DESIRED_PERF to 0, so the processor has no
indication that the desired performance level is actually equal to the
floor one. This may cause it to choose a performance point way above
the desired level.
Moreover, this is inconsistent with intel_cpufreq_adjust_perf() which
actually sets HWP_DESIRED_PERF in accordance with the target performance
value.
Address this by adjusting intel_cpufreq_update_pstate() to pass
target_pstate as both the minimum and the desired performance levels
to intel_cpufreq_hwp_update().
Fixes: a365ab6b9d ("cpufreq: intel_pstate: Implement the ->adjust_perf() callback")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Shashank Balaji <shashank.mahadasyam@sony.com>
Link: https://patch.msgid.link/6173276.lOV4Wx5bFT@rjwysocki.net
- Fix potential division-by-zero error in em_compute_costs() (Yaxiong
Tian).
- Fix typos in energy model documentation and example driver code (Moon
Hee Lee, Atul Kumar Pant).
- Rearrange the energy model management code and add a new function for
adjusting a CPU energy model after adjusting the capacity of the
given CPU to it (Rafael Wysocki).
- Refactor cpufreq_online(), add and use cpufreq policy locking guards,
use __free() in policy reference counting, and clean up core cpufreq
code on top of that (Rafael Wysocki).
- Fix boost handling on CPU suspend/resume and sysfs updates (Viresh
Kumar).
- Fix des_perf clamping with max_perf in amd_pstate_update() (Dhananjay
Ugwekar).
- Add offline, online and suspend callbacks to the amd-pstate driver,
rename and use the existing amd_pstate_epp callbacks in it (Dhananjay
Ugwekar).
- Add support for the "Requested CPU Min frequency" BIOS option to the
amd-pstate driver (Dhananjay Ugwekar).
- Reset amd-pstate driver mode after running selftests (Swapnil
Sapkal).
- Avoid shadowing ret in amd_pstate_ut_check_driver() (Nathan
Chancellor).
- Add helper for governor checks to the schedutil cpufreq governor and
move cpufreq-specific EAS checks to cpufreq (Rafael Wysocki).
- Populate the cpu_capacity sysfs entries from the intel_pstate driver
after registering asym capacity support (Ricardo Neri).
- Add support for enabling Energy-aware scheduling (EAS) to the
intel_pstate driver when operating in the passive mode on a hybrid
platform (Rafael Wysocki).
- Drop redundant cpus_read_lock() from store_local_boost() in the
cpufreq core (Seyediman Seyedarab).
- Replace sscanf() with kstrtouint() in the cpufreq code and use a
symbol instead of a raw number in it (Bowen Yu).
- Add support for autonomous CPU performance state selection to the
CPPC cpufreq driver (Lifeng Zheng).
- OPP: Add dev_pm_opp_set_level() (Praveen Talari).
- Introduce scope-based cleanup headers and mutex locking guards in OPP
core (Viresh Kumar).
- Switch OPP to use kmemdup_array() (Zhang Enpei).
- Optimize bucket assignment when next_timer_ns equals KTIME_MAX in the
menu cpuidle governor (Zhongqiu Han).
- Convert the cpuidle PSCI driver to a faux device one (Sudeep Holla).
- Add C1 demotion on/off sysfs knob to the intel_idle driver (Artem
Bityutskiy).
- Fix typos in two comments in the teo cpuidle governor (Atul Kumar
Pant).
- Fix denying of auto suspend in pm_suspend_timer_fn() (Charan Teja
Kalla).
- Move debug runtime PM attributes to runtime_attrs[] (Rafael Wysocki).
- Add new devm_ functions for enabling runtime PM and runtime PM
reference counting (Bence Csókás).
- Remove size arguments from strscpy() calls in the hibernation core
code (Thorsten Blum).
- Adjust the handling of devices with asynchronous suspend enabled
during system suspend and resume to start resuming them immediately
after resuming their parents and to start suspending such a device
immediately after suspending its first child (Rafael Wysocki).
- Adjust messages printed during tasks freezing to avoid using
pr_cont() (Andrew Sayers, Paul Menzel).
- Clean up unnecessary usage of !! in pm_print_times_init() (Zihuan
Zhang).
- Add missing wakeup source attribute relax_count to sysfs and
remove the space character at the end ofi the string produced by
pm_show_wakelocks() (Zijun Hu).
- Add configurable pm_test delay for hibernation (Zihuan Zhang).
- Disable asynchronous suspend in ucsi_ccg_probe() to prevent the
cypd4226 device on Tegra boards from suspending prematurely (Jon
Hunter).
- Unbreak printing PM debug messages during hibernation and clean up
some related code (Rafael Wysocki).
- Add a systemd service to run cpupower and change cpupower binding's
Makefile to use -lcpupower (John B. Wyatt IV, Francesco Poli).
-----BEGIN PGP SIGNATURE-----
iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmg0xS0SHHJqd0Byand5
c29ja2kubmV0AAoJEO5fvZ0v1OO1AwwH/Rvgza5YBPb9JZqWJT/ZiBw7HcEWHhP1
fNfcVU1gXPZiF0yoPfjfJua6BcLj6lyQ3d/+zWqqAcWfmRSD6HPe8yYz8qALUAqj
RWhDa04aGj6B9bQuOjejatznYlQlkwCRT7zec+75D+dAHVMqR/Vt2LFAetCadgHe
MQibAQmVFXu3RFkBjReTAdGzVoTXkwoZDrzdfA2aFAfMJNtJpOW4atUZvnucuctv
VK3ZratrctCIw7yXEoB1nWSmlY7R5JlslplBfndjmmOnky3YxNr7C6paqwtbTWoF
MiX48qkmLOGeO6gS8s/lVCDQ4oZ+UNFQvXRsM5NGjycBikhHX/dp/w4=
=dIqJ
-----END PGP SIGNATURE-----
Merge tag 'pm-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"Once again, the changes are dominated by cpufreq updates, but this
time the majority of them are cpufreq core changes, mostly related to
the introduction of policy locking guards and __free() usage, and
fixes related to boost handling.
Still, there is also a significant update of the intel_pstate driver
making it register an energy model when running on a hybrid platform
which is used for enabling energy-aware scheduling (EAS) if the driver
operates in the passive mode (and schedutil is used as the cpufreq
governor for all CPUs which is the passive mode default).
There are some amd-pstate driver updates too, for a good measure,
including the "Requested CPU Min frequency" BIOS option support and
new online/offline callbacks.
In the cpuidle space, the most significant change is the addition of a
C1 demotion on/off sysfs knob to intel_idle which should help some
users to configure their systems more precisely. There is also the
conversion of the PSCI cpuidle driver to a faux device one and there
are two small updates of cpuidle governors.
Device power management is also modified quite a bit, especially the
handling of devices with asynchronous suspend and resume enabled
during system transitions. They are now going to be handled more
asynchronously during suspend transitions and somewhat less
aggressively during resume transitions.
Apart from the above, the operating performance points (OPP) library
is now going to use mutex locking guards and scope-based cleanup
helpers and there is the usual bunch of assorted fixes and code
cleanups.
Specifics:
- Fix potential division-by-zero error in em_compute_costs() (Yaxiong
Tian)
- Fix typos in energy model documentation and example driver code
(Moon Hee Lee, Atul Kumar Pant)
- Rearrange the energy model management code and add a new function
for adjusting a CPU energy model after adjusting the capacity of
the given CPU to it (Rafael Wysocki)
- Refactor cpufreq_online(), add and use cpufreq policy locking
guards, use __free() in policy reference counting, and clean up
core cpufreq code on top of that (Rafael Wysocki)
- Fix boost handling on CPU suspend/resume and sysfs updates (Viresh
Kumar)
- Fix des_perf clamping with max_perf in amd_pstate_update()
(Dhananjay Ugwekar)
- Add offline, online and suspend callbacks to the amd-pstate driver,
rename and use the existing amd_pstate_epp callbacks in it
(Dhananjay Ugwekar)
- Add support for the "Requested CPU Min frequency" BIOS option to
the amd-pstate driver (Dhananjay Ugwekar)
- Reset amd-pstate driver mode after running selftests (Swapnil
Sapkal)
- Avoid shadowing ret in amd_pstate_ut_check_driver() (Nathan
Chancellor)
- Add helper for governor checks to the schedutil cpufreq governor
and move cpufreq-specific EAS checks to cpufreq (Rafael Wysocki)
- Populate the cpu_capacity sysfs entries from the intel_pstate
driver after registering asym capacity support (Ricardo Neri)
- Add support for enabling Energy-aware scheduling (EAS) to the
intel_pstate driver when operating in the passive mode on a hybrid
platform (Rafael Wysocki)
- Drop redundant cpus_read_lock() from store_local_boost() in the
cpufreq core (Seyediman Seyedarab)
- Replace sscanf() with kstrtouint() in the cpufreq code and use a
symbol instead of a raw number in it (Bowen Yu)
- Add support for autonomous CPU performance state selection to the
CPPC cpufreq driver (Lifeng Zheng)
- OPP: Add dev_pm_opp_set_level() (Praveen Talari)
- Introduce scope-based cleanup headers and mutex locking guards in
OPP core (Viresh Kumar)
- Switch OPP to use kmemdup_array() (Zhang Enpei)
- Optimize bucket assignment when next_timer_ns equals KTIME_MAX in
the menu cpuidle governor (Zhongqiu Han)
- Convert the cpuidle PSCI driver to a faux device one (Sudeep Holla)
- Add C1 demotion on/off sysfs knob to the intel_idle driver (Artem
Bityutskiy)
- Fix typos in two comments in the teo cpuidle governor (Atul Kumar
Pant)
- Fix denying of auto suspend in pm_suspend_timer_fn() (Charan Teja
Kalla)
- Move debug runtime PM attributes to runtime_attrs[] (Rafael
Wysocki)
- Add new devm_ functions for enabling runtime PM and runtime PM
reference counting (Bence Csókás)
- Remove size arguments from strscpy() calls in the hibernation core
code (Thorsten Blum)
- Adjust the handling of devices with asynchronous suspend enabled
during system suspend and resume to start resuming them immediately
after resuming their parents and to start suspending such a device
immediately after suspending its first child (Rafael Wysocki)
- Adjust messages printed during tasks freezing to avoid using
pr_cont() (Andrew Sayers, Paul Menzel)
- Clean up unnecessary usage of !! in pm_print_times_init() (Zihuan
Zhang)
- Add missing wakeup source attribute relax_count to sysfs and remove
the space character at the end ofi the string produced by
pm_show_wakelocks() (Zijun Hu)
- Add configurable pm_test delay for hibernation (Zihuan Zhang)
- Disable asynchronous suspend in ucsi_ccg_probe() to prevent the
cypd4226 device on Tegra boards from suspending prematurely (Jon
Hunter)
- Unbreak printing PM debug messages during hibernation and clean up
some related code (Rafael Wysocki)
- Add a systemd service to run cpupower and change cpupower binding's
Makefile to use -lcpupower (John B. Wyatt IV, Francesco Poli)"
* tag 'pm-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (72 commits)
cpufreq: CPPC: Add support for autonomous selection
cpufreq: Update sscanf() to kstrtouint()
cpufreq: Replace magic number
OPP: switch to use kmemdup_array()
PM: freezer: Rewrite restarting tasks log to remove stray *done.*
PM: runtime: fix denying of auto suspend in pm_suspend_timer_fn()
cpufreq: drop redundant cpus_read_lock() from store_local_boost()
cpupower: do not install files to /etc/default/
cpupower: do not call systemctl at install time
cpupower: do not write DESTDIR to cpupower.service
PM: sleep: Introduce pm_sleep_transition_in_progress()
cpufreq/amd-pstate: Avoid shadowing ret in amd_pstate_ut_check_driver()
cpufreq: intel_pstate: Document hybrid processor support
cpufreq: intel_pstate: EAS: Increase cost for CPUs using L3 cache
cpufreq: intel_pstate: EAS support for hybrid platforms
PM: EM: Introduce em_adjust_cpu_capacity()
PM: EM: Move CPU capacity check to em_adjust_new_capacity()
PM: EM: Documentation: Fix typos in example driver code
cpufreq: Drop policy locking from cpufreq_policy_is_good_for_eas()
PM: sleep: Introduce pm_suspend_in_progress()
...
On some hybrid platforms some efficient CPUs (E-cores) are not connected
to the L3 cache, but there are no other differences between them and the
other E-cores that use L3. In that case, it is generally more efficient
to run "light" workloads on the E-cores that do not use L3 and allow all
of the cores using L3, including P-cores, to go into idle states.
For this reason, slightly increase the cost for all CPUs sharing the L3
cache to make EAS prefer CPUs that do not use it to the other CPUs of
the same type (if any).
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/2032776.usQuhbGJ8B@rjwysocki.net
Modify intel_pstate to register EM perf domains for CPUs on hybrid
platforms without SMT which causes EAS to be enabled on them when
schedutil is used as the cpufreq governor (which requires intel_pstate
to operate in the passive mode).
This change is targeting platforms (for example, Lunar Lake) where the
"little" CPUs (E-cores) are always more energy-efficient than the "big"
or "performance" CPUs (P-cores) when run at the same HWP performance
level, so it is sufficient to tell EAS that E-cores are always preferred
(so long as there is enough spare capacity on one of them to run the
given task). However, migrating tasks between CPUs of the same type
too often is not desirable because it may hurt both performance and
energy efficiency due to leaving warm caches behind.
For this reason, register a separate perf domain for each CPU and choose
the cost values for them so that the cost mostly depends on the CPU type,
but there is also a small component of it depending on the performance
level (utilization) which helps to balance the load between CPUs of the
same type.
The cost component related to the CPU type is computed with the help of
the observation that the IPC metric value for a given CPU is inversely
proportional to its performance-to-frequency scaling factor and the cost
of running code on it can be assumed to be roughly proportional to that
IPC ratio (in principle, the higher the IPC ratio, the more resources
are utilized when running at a given frequency, so the cost should be
higher).
For all CPUs that are online at the system initialization time, EM perf
domains are registered when the driver starts up, after asymmetric
capacity support has been enabled. For the CPUs that become online
later, EM perf domains are registered after setting the asymmetric
capacity for them.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://patch.msgid.link/6057101.MhkbZ0Pkbq@rjwysocki.net
Intel hybrid processors have CPUs of different capacity. Populate the
interface /sys/devices/system/cpu/cpuN/cpu_capacity.
This interface uses the per-CPU variable `cpu_scale`. On x86 this
variable has no other use besides feeding the sysfs entries. Initialize
it when setting CPU capacity for the scheduler and scale-invariant code.
Feed it with arch_scale_cpu_capacity() as it gives capacity normalized
to the interval [0, SCHED_CAPACITY_SCALE].
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When turbo mode is unavailable on a Skylake-X system, executing the
command:
# echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
results in an unchecked MSR access error:
WRMSR to 0x199 (attempted to write 0x0000000100001300).
This issue was reproduced on an OEM (Original Equipment Manufacturer)
system and is not a common problem across all Skylake-X systems.
This error occurs because the MSR 0x199 Turbo Engage Bit (bit 32) is set
when turbo mode is disabled. The issue arises when intel_pstate fails to
detect that turbo mode is disabled. Here intel_pstate relies on
MSR_IA32_MISC_ENABLE bit 38 to determine the status of turbo mode.
However, on this system, bit 38 is not set even when turbo mode is
disabled.
According to the Intel Software Developer's Manual (SDM), the BIOS sets
this bit during platform initialization to enable or disable
opportunistic processor performance operations. Logically, this bit
should be set in such cases. However, the SDM also specifies that "OS
and applications must use CPUID leaf 06H to detect processors with
opportunistic processor performance operations enabled."
Therefore, in addition to checking MSR_IA32_MISC_ENABLE bit 38, verify
that CPUID.06H:EAX[1] is 0 to accurately determine if turbo mode is
disabled.
Fixes: 4521e1a0ce ("cpufreq: intel_pstate: Reflect current no_turbo state correctly")
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: All applicable <stable@vger.kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Commit b52aaeeadf ("cpufreq: intel_pstate: Avoid SMP calls to get
cpu-type") introduced two issues into hwp_get_cpu_scaling(). First,
it made that function use the CPU type of the CPU running the code
even though the target CPU is passed as the argument to it and second,
it used smp_processor_id() for that even though hwp_get_cpu_scaling()
runs in preemptible context.
Fix both of these problems by simply passing "cpu" to cpu_data().
Fixes: b52aaeeadf ("cpufreq: intel_pstate: Avoid SMP calls to get cpu-type")
Link: https://lore.kernel.org/linux-pm/20250412103434.5321-1-xry111@xry111.site/
Reported-by: Xi Ruoyao <xry111@xry111.site>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/12659608.O9o76ZdvQC@rjwysocki.net
Since cpufreq_update_limits() obtains a cpufreq policy pointer for the
given CPU and reference counts the corresponding policy object, it may
as well pass the policy pointer to the cpufreq driver's ->update_limits()
callback which allows that callback to avoid invoking cpufreq_cpu_get()
for the same CPU.
Accordingly, redefine ->update_limits() to take a policy pointer instead
of a CPU number and update both drivers implementing it, intel_pstate
and amd-pstate, as needed.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: Sudeep Holla <sudeep.holla@arm.com>
Tested-by: Sudeep Holla <sudeep.holla@arm.com>
Link: https://patch.msgid.link/8560367.NyiUUSuA9g@rjwysocki.net
Rename __intel_pstate_update_max_freq() to intel_pstate_update_max_freq()
and move the cpufreq policy reference counting and locking into it (and
implement the locking with the recently introduced cpufreq policy "write"
locking guard).
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Link: https://patch.msgid.link/2315023.iZASKD2KPV@rjwysocki.net