- Fix stale direct dispatch state in ddsp_dsq_id which can cause
spurious warnings in mark_direct_dispatch() on task wakeup.
- Fix is_bpf_migration_disabled() false negative on non-PREEMPT_RCU
configs which can lead to incorrectly dispatching migration-disabled
tasks to remote CPUs.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCac//0w4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGdqUAP9kEuxvB+pxjheSKV0j7zvDHd+ksMxjQTRoBmyu
PE0hIgEA5gAax8ebef9MlyRVsm9Qh7v/AmovUHt75oeCnDk++Ag=
=hD7A
-----END PGP SIGNATURE-----
Merge tag 'sched_ext-for-7.0-rc6-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
"These are late but both fix subtle yet critical problems and the blast
radius is limited strictly to sched_ext.
- Fix stale direct dispatch state in ddsp_dsq_id which can cause
spurious warnings in mark_direct_dispatch() on task wakeup
- Fix is_bpf_migration_disabled() false negative on non-PREEMPT_RCU
configs which can lead to incorrectly dispatching migration-
disabled tasks to remote CPUs"
* tag 'sched_ext-for-7.0-rc6-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Fix stale direct dispatch state in ddsp_dsq_id
sched_ext: Fix is_bpf_migration_disabled() false negative on non-PREEMPT_RCU
@p->scx.ddsp_dsq_id can be left set (non-SCX_DSQ_INVALID) triggering a
spurious warning in mark_direct_dispatch() when the next wakeup's
ops.select_cpu() calls scx_bpf_dsq_insert(), such as:
WARNING: kernel/sched/ext.c:1273 at scx_dsq_insert_commit+0xcd/0x140
The root cause is that ddsp_dsq_id was only cleared in dispatch_enqueue(),
which is not reached in all paths that consume or cancel a direct dispatch
verdict.
Fix it by clearing it at the right places:
- direct_dispatch(): cache the direct dispatch state in local variables
and clear it before dispatch_enqueue() on the synchronous path. For
the deferred path, the direct dispatch state must remain set until
process_ddsp_deferred_locals() consumes them.
- process_ddsp_deferred_locals(): cache the dispatch state in local
variables and clear it before calling dispatch_to_local_dsq(), which
may migrate the task to another rq.
- do_enqueue_task(): clear the dispatch state on the enqueue path
(local/global/bypass fallbacks), where the direct dispatch verdict is
ignored.
- dequeue_task_scx(): clear the dispatch state after dispatch_dequeue()
to handle both the deferred dispatch cancellation and the holding_cpu
race, covering all cases where a pending direct dispatch is
cancelled.
- scx_disable_task(): clear the direct dispatch state when
transitioning a task out of the current scheduler. Waking tasks may
have had the direct dispatch state set by the outgoing scheduler's
ops.select_cpu() and then been queued on a wake_list via
ttwu_queue_wakelist(), when SCX_OPS_ALLOW_QUEUED_WAKEUP is set. Such
tasks are not on the runqueue and are not iterated by scx_bypass(),
so their direct dispatch state won't be cleared. Without this clear,
any subsequent SCX scheduler that tries to direct dispatch the task
will trigger the WARN_ON_ONCE() in mark_direct_dispatch().
Fixes: 5b26f7b920 ("sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches")
Cc: stable@vger.kernel.org # v6.12+
Cc: Daniel Hodges <hodgesd@meta.com>
Cc: Patrick Somaru <patsomaru@meta.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
- Fix a NULL pointer dereference in the energy model netlink
interface that may occur if a given perf domain ID is not
recognized (Changwoo Min)
- Avoid double free in the cpufreq_dbs_governor_init() error path
when kobject_init_and_add() fails (Guangshuo Li)
-----BEGIN PGP SIGNATURE-----
iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmnPshwSHHJqd0Byand5
c29ja2kubmV0AAoJEO5fvZ0v1OO1iCsH/2R8XWqh/Yc5h4dobakF4AjDZ5TB7aHN
GBvXXUAfZ9VRhqzfUxArbnCLDQcLnc7z/wdJo2nq/K0M+NqjBd2yY3nVirJmdpxr
4H6mORN2YLWh853roDx6ZrMWGaYrozY4KVCLx/2NJ/OqxNYDEQlpOjobxV7YUq1b
omtBWm72lMHyCC/mRcyBC7JYUWt2PA3yqYb7IuVBdn4M4vY1hOMxAxoLwLiIx0qw
zmWsuv7FPOAWcf7JvFyEGYaPkMT4Spf6ZYT1t6DALg9o194+0iWJSKl1IUDWkA3C
2OTxUIVv+5Mpw+sE4DQGbZSbTaglgjCQ6DJu1O+AwUbd7usRN8ZJ2Zk=
=El4h
-----END PGP SIGNATURE-----
Merge tag 'pm-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix a potential NULL pointer dereference in the energy model
netlink interface and a potential double free in an error path in
the common cpufreq governor management code:
- Fix a NULL pointer dereference in the energy model netlink
interface that may occur if a given perf domain ID is not
recognized (Changwoo Min)
- Avoid double free in the cpufreq_dbs_governor_init() error
path when kobject_init_and_add() fails (Guangshuo Li)"
* tag 'pm-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: governor: fix double free in cpufreq_dbs_governor_init() error path
PM: EM: Fix NULL pointer dereference when perf domain ID is not found
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmnPGdMACgkQ6rmadz2v
bTrNxw/9Hcn2V/Jqp/cEagmKIKqSAUFgEE+AwRbQU5YL2Yem/6Q15rnOk8pOSDT5
jqk7VbuchVmWa+a9DVy7d3XVWohk332QbvQRHfqV8P0ZpnfJa0YqdZlKg2/4/8P/
yVhLzVrGIGcvvz9CfhIynRhq/fvr7iYbSSv9JT3nig4qCYpUf7kPbXSLtxyElNWN
xX36KfTxQO4xI2+iezsNwklXF25Tv59V1fNuKF2lshxS+DwaroAzAJLd3MGvTHRj
8y5kU1UDb+HeJh9DpEFjppQp4qUQjIKAiNVvXGUOe7TI/i9VTIiMfesniWKNwzYv
Alo2G8fLb4nJhzNL2ol4R0I5BCYmMT55tBFvSNJQ+9Esy6azkbExmKuE1hXsUXo1
jY0TbNt58zSZEmyz9SYoFKlg4lOW4ZIMl0RtnSBRoDwtK3ThGV7QFlnKq3uPZ6ce
RcpMk7cOnERLzwPnpSiACrQmzhMk+j5HG1u+Eb3rXKxYCQO6bAhpQyPDKsiXNgkL
uezq2zqAnNho0/CInHGlRj7E1JnvRoHCcLBT4zzyIY/jruI8fzK0aMqGMvk/qOby
BWDnJ9GG3VmGSUc/FOp3IchKCnxXhkYqsjBCP03cbIZgr1MuixZeom81OsPNmSX8
Ke+FeGNsU5zOUJ1iG2BZjdya/DAgP8hd85WVtaXyX60KKhuu45c=
=w0RY
-----END PGP SIGNATURE-----
Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Pull bpf fixes from Alexei Starovoitov:
- Fix register equivalence for pointers to packet (Alexei Starovoitov)
- Fix incorrect pruning due to atomic fetch precision tracking (Daniel
Borkmann)
- Fix grace period wait for bpf_link-ed tracepoints (Kumar Kartikeya
Dwivedi)
- Fix use-after-free of sockmap's sk->sk_socket (Kuniyuki Iwashima)
- Reject direct access to nullable PTR_TO_BUF pointers (Qi Tang)
- Reject sleepable kprobe_multi programs at attach time (Varun R
Mallya)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Add more precision tracking tests for atomics
bpf: Fix incorrect pruning due to atomic fetch precision tracking
bpf: Reject sleepable kprobe_multi programs at attach time
bpf: reject direct access to nullable PTR_TO_BUF pointers
bpf: sockmap: Fix use-after-free of sk->sk_socket in sk_psock_verdict_data_ready().
bpf: Fix grace period wait for tracepoint bpf_link
bpf: Fix regsafe() for pointers to packet
Since commit 8e4f0b1ebc ("bpf: use rcu_read_lock_dont_migrate() for
trampoline.c"), the BPF prolog (__bpf_prog_enter) calls migrate_disable()
only when CONFIG_PREEMPT_RCU is enabled, via rcu_read_lock_dont_migrate().
Without CONFIG_PREEMPT_RCU, the prolog never touches migration_disabled,
so migration_disabled == 1 always means the task is truly
migration-disabled regardless of whether it is the current task.
The old unconditional p == current check was a false negative in this
case, potentially allowing a migration-disabled task to be dispatched to
a remote CPU and triggering scx_error in task_can_run_on_remote_rq().
Only apply the p == current disambiguation when CONFIG_PREEMPT_RCU is
enabled, where the ambiguity with the BPF prolog still exists.
Fixes: 8e4f0b1ebc ("bpf: use rcu_read_lock_dont_migrate() for trampoline.c")
Cc: stable@vger.kernel.org # v6.18+
Link: https://lore.kernel.org/lkml/20250821090609.42508-8-dongml2@chinatelecom.cn/
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When backtrack_insn encounters a BPF_STX instruction with BPF_ATOMIC
and BPF_FETCH, the src register (or r0 for BPF_CMPXCHG) also acts as
a destination, thus receiving the old value from the memory location.
The current backtracking logic does not account for this. It treats
atomic fetch operations the same as regular stores where the src
register is only an input. This leads the backtrack_insn to fail to
propagate precision to the stack location, which is then not marked
as precise!
Later, the verifier's path pruning can incorrectly consider two states
equivalent when they differ in terms of stack state. Meaning, two
branches can be treated as equivalent and thus get pruned when they
should not be seen as such.
Fix it as follows: Extend the BPF_LDX handling in backtrack_insn to
also cover atomic fetch operations via is_atomic_fetch_insn() helper.
When the fetch dst register is being tracked for precision, clear it,
and propagate precision over to the stack slot. For non-stack memory,
the precision walk stops at the atomic instruction, same as regular
BPF_LDX. This covers all fetch variants.
Before:
0: (b7) r1 = 8 ; R1=8
1: (7b) *(u64 *)(r10 -8) = r1 ; R1=8 R10=fp0 fp-8=8
2: (b7) r2 = 0 ; R2=0
3: (db) r2 = atomic64_fetch_add((u64 *)(r10 -8), r2) ; R2=8 R10=fp0 fp-8=mmmmmmmm
4: (bf) r3 = r10 ; R3=fp0 R10=fp0
5: (0f) r3 += r2
mark_precise: frame0: last_idx 5 first_idx 0 subseq_idx -1
mark_precise: frame0: regs=r2 stack= before 4: (bf) r3 = r10
mark_precise: frame0: regs=r2 stack= before 3: (db) r2 = atomic64_fetch_add((u64 *)(r10 -8), r2)
mark_precise: frame0: regs=r2 stack= before 2: (b7) r2 = 0
6: R2=8 R3=fp8
6: (b7) r0 = 0 ; R0=0
7: (95) exit
After:
0: (b7) r1 = 8 ; R1=8
1: (7b) *(u64 *)(r10 -8) = r1 ; R1=8 R10=fp0 fp-8=8
2: (b7) r2 = 0 ; R2=0
3: (db) r2 = atomic64_fetch_add((u64 *)(r10 -8), r2) ; R2=8 R10=fp0 fp-8=mmmmmmmm
4: (bf) r3 = r10 ; R3=fp0 R10=fp0
5: (0f) r3 += r2
mark_precise: frame0: last_idx 5 first_idx 0 subseq_idx -1
mark_precise: frame0: regs=r2 stack= before 4: (bf) r3 = r10
mark_precise: frame0: regs=r2 stack= before 3: (db) r2 = atomic64_fetch_add((u64 *)(r10 -8), r2)
mark_precise: frame0: regs= stack=-8 before 2: (b7) r2 = 0
mark_precise: frame0: regs= stack=-8 before 1: (7b) *(u64 *)(r10 -8) = r1
mark_precise: frame0: regs=r1 stack= before 0: (b7) r1 = 8
6: R2=8 R3=fp8
6: (b7) r0 = 0 ; R0=0
7: (95) exit
Fixes: 5ffa25502b ("bpf: Add instructions for atomic_[cmp]xchg")
Fixes: 5ca419f286 ("bpf: Add BPF_FETCH field / create atomic_fetch_add instruction")
Reported-by: STAR Labs SG <info@starlabs.sg>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260331222020.401848-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
kprobe.multi programs run in atomic/RCU context and cannot sleep.
However, bpf_kprobe_multi_link_attach() did not validate whether the
program being attached had the sleepable flag set, allowing sleepable
helpers such as bpf_copy_from_user() to be invoked from a non-sleepable
context.
This causes a "sleeping function called from invalid context" splat:
BUG: sleeping function called from invalid context at ./include/linux/uaccess.h:169
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1787, name: sudo
preempt_count: 1, expected: 0
RCU nest depth: 2, expected: 0
Fix this by rejecting sleepable programs early in
bpf_kprobe_multi_link_attach(), before any further processing.
Fixes: 0dcac27254 ("bpf: Add multi kprobe link")
Signed-off-by: Varun R Mallya <varunrmallya@gmail.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Leon Hwang <leon.hwang@linux.dev>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20260401191126.440683-1-varunrmallya@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
check_mem_access() matches PTR_TO_BUF via base_type() which strips
PTR_MAYBE_NULL, allowing direct dereference without a null check.
Map iterator ctx->key and ctx->value are PTR_TO_BUF | PTR_MAYBE_NULL.
On stop callbacks these are NULL, causing a kernel NULL dereference.
Add a type_may_be_null() guard to the PTR_TO_BUF branch, matching the
existing PTR_TO_BTF_ID pattern.
Fixes: 20b2aff4bc ("bpf: Introduce MEM_RDONLY flag")
Signed-off-by: Qi Tang <tpluszz77@gmail.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260402092923.38357-2-tpluszz77@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
John reported that stress-ng-yield could make his machine unhappy and
managed to bisect it to commit b3d99f43c7 ("sched/fair: Fix
zero_vruntime tracking").
The commit in question changes avg_vruntime() from a function that is
a pure reader, to a function that updates variables. This turns an
unlocked sched/debug usage of this function from a minor mistake into
a data corruptor.
Fixes: af4cf40470 ("sched/fair: Add cfs_rq::avg_vruntime")
Fixes: b3d99f43c7 ("sched/fair: Fix zero_vruntime tracking")
Reported-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: John Stultz <jstultz@google.com>
Link: https://patch.msgid.link/20260401132355.196370805@infradead.org
John reported that stress-ng-yield could make his machine unhappy and
managed to bisect it to commit b3d99f43c7 ("sched/fair: Fix
zero_vruntime tracking").
The combination of yield and that commit was specific enough to
hypothesize the following scenario:
Suppose we have 2 runnable tasks, both doing yield. Then one will be
eligible and one will not be, because the average position must be in
between these two entities.
Therefore, the runnable task will be eligible, and be promoted a full
slice (all the tasks do is yield after all). This causes it to jump over
the other task and now the other task is eligible and current is no
longer. So we schedule.
Since we are runnable, there is no {de,en}queue. All we have is the
__{en,de}queue_entity() from {put_prev,set_next}_task(). But per the
fingered commit, those two no longer move zero_vruntime.
All that moves zero_vruntime are tick and full {de,en}queue.
This means, that if the two tasks playing leapfrog can reach the
critical speed to reach the overflow point inside one tick's worth of
time, we're up a creek.
Additionally, when multiple cgroups are involved, there is no guarantee
the tick will in fact hit every cgroup in a timely manner. Statistically
speaking it will, but that same statistics does not rule out the
possibility of one cgroup not getting a tick for a significant amount of
time -- however unlikely.
Therefore, just like with the yield() case, force an update at the end
of every slice. This ensures the update is never more than a single
slice behind and the whole thing is within 2 lag bounds as per the
comment on entity_key().
Fixes: b3d99f43c7 ("sched/fair: Fix zero_vruntime tracking")
Reported-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: John Stultz <jstultz@google.com>
Link: https://patch.msgid.link/20260401132355.081530332@infradead.org
Recently, tracepoints were switched from using disabled preemption
(which acts as RCU read section) to SRCU-fast when they are not
faultable. This means that to do a proper grace period wait for programs
running in such tracepoints, we must use SRCU's grace period wait.
This is only for non-faultable tracepoints, faultable ones continue
using RCU Tasks Trace.
However, bpf_link_free() currently does call_rcu() for all cases when
the link is non-sleepable (hence, for tracepoints, non-faultable). Fix
this by doing a call_srcu() grace period wait.
As far RCU Tasks Trace gp -> RCU gp chaining is concerned, it is deemed
unnecessary for tracepoint programs. The link and program are either
accessed under RCU Tasks Trace protection, or SRCU-fast protection now.
The earlier logic of chaining both RCU Tasks Trace and RCU gp waits was
to generalize the logic, even if it conceded an extra RCU gp wait,
however that is unnecessary for tracepoints even before this change.
In practice no cost was paid since rcu_trace_implies_rcu_gp() was always
true. Hence we need not chaining any RCU gp after the SRCU gp.
For instance, in the non-faultable raw tracepoint, the RCU read section
of the program in __bpf_trace_run() is enclosed in the SRCU gp, likewise
for faultable raw tracepoint, the program is under the RCU Tasks Trace
protection. Hence, the outermost scope can be waited upon to ensure
correctness.
Also, sleepable programs cannot be attached to non-faultable
tracepoints, so whenever program or link is sleepable, only RCU Tasks
Trace protection is being used for the link and prog.
Fixes: a46023d561 ("tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast")
Reviewed-by: Sun Jian <sun.jian.kdev@gmail.com>
Reviewed-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20260331211021.1632902-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In case rold->reg->range == BEYOND_PKT_END && rcur->reg->range == N
regsafe() may return true which may lead to current state with
valid packet range not being explored. Fix the bug.
Fixes: 6d94e741a8 ("bpf: Support for pointers beyond pkt_end.")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Amery Hung <ameryhung@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20260331204228.26726-1-alexei.starovoitov@gmail.com
- Fix SCX_KICK_WAIT deadlock where multiple CPUs waiting for each other in
hardirq context form a cycle. Move the wait to a balance callback which
can drop the rq lock and process IPIs.
- Fix inconsistent NUMA node lookup in scx_select_cpu_dfl() where the
waker_node used cpu_to_node() while prev_cpu used
scx_cpu_node_if_enabled(), leading to undefined behavior when per-node
idle tracking is disabled.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCacwiiQ4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGVILAP44s30JBpNyJ9JhAiCoTYzxzOXqqGbotnpQckMF
+7WoJAD/Z9dJO/Sw/AH0fX6WVJDmO0QsQvFXLXJBxWy7A5XVAA0=
=2DW5
-----END PGP SIGNATURE-----
Merge tag 'sched_ext-for-7.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- Fix SCX_KICK_WAIT deadlock where multiple CPUs waiting for each other
in hardirq context form a cycle. Move the wait to a balance callback
which can drop the rq lock and process IPIs.
- Fix inconsistent NUMA node lookup in scx_select_cpu_dfl() where
the waker_node used cpu_to_node() while prev_cpu used
scx_cpu_node_if_enabled(), leading to undefined behavior when
per-node idle tracking is disabled.
* tag 'sched_ext-for-7.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
selftests/sched_ext: Add cyclic SCX_KICK_WAIT stress test
sched_ext: Fix SCX_KICK_WAIT deadlock by deferring wait to balance callback
sched_ext: Fix inconsistent NUMA node lookup in scx_select_cpu_dfl()
- Fix false positive stall reports on weakly ordered architectures where
the lockless worklist/timestamp check in the watchdog can observe stale
values due to memory reordering. Recheck under pool->lock to confirm.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCacwiew4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGX66AQCT5ZNUV46mQdQU8TXV4qMiAW1gdwoUT+rvcQ9Q
u2jA/AD/S1uHycAWKqA+TTL8NfNvhgyhgq1AVbYUlTRXCz3iwgU=
=BlI5
-----END PGP SIGNATURE-----
Merge tag 'wq-for-7.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fix from Tejun Heo:
- Fix false positive stall reports on weakly ordered architectures
where the lockless worklist/timestamp check in the watchdog can
observe stale values due to memory reordering.
Recheck under pool->lock to confirm.
* tag 'wq-for-7.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Better describe stall check
workqueue: Fix false positive stall reports
- Fix cgroup rmdir racing with dying tasks. Deferred task cgroup unlink
introduced a window where cgroup.procs is empty but the cgroup is still
populated, causing rmdir to fail with -EBUSY and selftest failures. Make
rmdir wait for dying tasks to fully leave and fix selftests to not depend
on synchronous populated updates.
- Fix cpuset v1 task migration failure from empty cpusets under strict
security policies. When CPU hotplug removes the last CPU from a v1
cpuset, tasks must be migrated to an ancestor without a
security_task_setscheduler() check that would block the migration.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCacwibg4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGXHEAP98nVEKyl7c7+sXYtwOPn8KEhdHkdpHyPZwhpS2
1wLhaQEAm8yO49s7IgvGPWSz0s/gQdmF5/x8RAee0sJsZALvGQg=
=bUUt
-----END PGP SIGNATURE-----
Merge tag 'cgroup-for-7.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
- Fix cgroup rmdir racing with dying tasks.
Deferred task cgroup unlink introduced a window where cgroup.procs
is empty but the cgroup is still populated, causing rmdir to fail
with -EBUSY and selftest failures.
Make rmdir wait for dying tasks to fully leave and fix selftests to
not depend on synchronous populated updates.
- Fix cpuset v1 task migration failure from empty cpusets under strict
security policies.
When CPU hotplug removes the last CPU from a v1 cpuset, tasks must be
migrated to an ancestor without a security_task_setscheduler() check
that would block the migration.
* tag 'cgroup-for-7.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup/cpuset: Skip security check for hotplug induced v1 task migration
cgroup/cpuset: Simplify setsched decision check in task iteration loop of cpuset_can_attach()
cgroup: Fix cgroup_drain_dying() testing the wrong condition
selftests/cgroup: Don't require synchronous populated update on task exit
cgroup: Wait for dying tasks to leave on rmdir
When a CPU hot removal causes a v1 cpuset to lose all its CPUs, the
cpuset hotplug handler will schedule a work function to migrate tasks
in that cpuset with no CPU to its ancestor to enable those tasks to
continue running.
If a strict security policy is in place, however, the task migration
may fail when security_task_setscheduler() call in cpuset_can_attach()
returns a -EACCES error. That will mean that those tasks will have
no CPU to run on. The system administrators will have to explicitly
intervene to either add CPUs to that cpuset or move the tasks elsewhere
if they are aware of it.
This problem was found by a reported test failure in the LTP's
cpuset_hotplug_test.sh. Fix this problem by treating this special case as
an exception to skip the setsched security check in cpuset_can_attach()
when a v1 cpuset with tasks have no CPU left.
With that patch applied, the cpuset_hotplug_test.sh test can be run
successfully without failure.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Centralize the check required to run security_task_setscheduler() in
the task iteration loop of cpuset_can_attach() outside of the loop as
it has no dependency on the characteristics of the tasks themselves.
There is no functional change.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
dev_energymodel_nl_get_perf_domains_doit() calls
em_perf_domain_get_by_id() but does not check the return value before
passing it to __em_nl_get_pd_size(). When a caller supplies a
non-existent perf domain ID, em_perf_domain_get_by_id() returns NULL,
and __em_nl_get_pd_size() immediately dereferences pd->cpus
(struct offset 0x30), causing a NULL pointer dereference.
The sister handler dev_energymodel_nl_get_perf_table_doit() already
handles this correctly via __em_nl_get_pd_table_id(), which returns
NULL and causes the caller to return -EINVAL. Add the same NULL check
in the get-perf-domains do handler.
Fixes: 380ff27af2 ("PM: EM: Add dump to get-perf-domains in the EM YNL spec")
Reported-by: Yi Lai <yi1.lai@linux.intel.com>
Closes: https://lore.kernel.org/lkml/aXiySM79UYfk+ytd@ly-workstation/
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Cc: 6.19+ <stable@vger.kernel.org> # 6.19+
[ rjw: Subject and changelog edits ]
Link: https://patch.msgid.link/20260329073615.649976-1-changwoo@igalia.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
SCX_KICK_WAIT busy-waits in kick_cpus_irq_workfn() using
smp_cond_load_acquire() until the target CPU's kick_sync advances. Because
the irq_work runs in hardirq context, the waiting CPU cannot reschedule and
its own kick_sync never advances. If multiple CPUs form a wait cycle, all
CPUs deadlock.
Replace the busy-wait in kick_cpus_irq_workfn() with resched_curr() to
force the CPU through do_pick_task_scx(), which queues a balance callback
to perform the wait. The balance callback drops the rq lock and enables
IRQs following the sched_core_balance() pattern, so the CPU can process
IPIs while waiting. The local CPU's kick_sync is advanced on entry to
do_pick_task_scx() and continuously during the wait, ensuring any CPU that
starts waiting for us sees the advancement and cannot form cyclic
dependencies.
Fixes: 90e55164da ("sched_ext: Implement SCX_KICK_WAIT")
Cc: stable@vger.kernel.org # v6.12+
Reported-by: Christian Loehle <christian.loehle@arm.com>
Link: https://lore.kernel.org/r/20260316100249.1651641-1-christian.loehle@arm.com
Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Christian Loehle <christian.loehle@arm.com>
which may cause missed expirations or incorrect overrun accounting.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmnIt8IRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hogw/+J8ekCjKdzZobEyGAM4fTc0jlOmPiXog3
yCKEZHnRpTmWskMKhDIqt3TYr9ihzHSWLwgtUpcx5bXO2qyhqVlQtfmSjA0CXeQ9
DHADtLAsZIg8qpkqur6zGd4icLX5ZoVfiwZdPX3QZMgwAmM0JsqOLOIy8TGvsB2Q
PWnq4aiTqs3lqA00ehMDR2gSlPfgJKUNgHVyJa0/udHVU81A67vPMbu12sX+Iz25
ITxOehF32JuEvJUxa/sipjswizYc/bt5qEywn8Xoob99xrkv5vmTQ31sehqovwq/
xVMaWs+oRj0Ec20E1LWuawnV83b0XR7j8Z2XCVNJHHfys80Hj6Egw7HVctKQq1Tf
X3bzQMbyaYHWn+Cc7DClexposL3xi76ZPW5VaWcsLs5DA15hGmgY7r2DrlJby7QZ
isjqsW06NGLM1DFrTEYlwTDKm7vLukFnWX4TXokMHitkSQPcyLa24sgMPFIXY1Pe
QoY+f9tmIZ5GmA8JA4dgORSQvQEiG8ElToNDfM2Dmk8b69eO+JwdehQfwwGhhemF
S8VMUMylrk0WOoCxM6peuA0JUMPKsLtg5dWWxsWDOxNUUxs9ujoQBAMYXuRloPUv
kGV8entjyBZ4Nz1dxxKM1r/J2c7a8IImbQ18AsdPFu4NaRIG6TvYkGP3MGZtMHLe
iHW5kgvgB5c=
=hDiV
-----END PGP SIGNATURE-----
Merge tag 'timers-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Ingo Molnar:
"Fix an argument order bug in the alarm timer forwarding logic, which
may cause missed expirations or incorrect overrun accounting"
* tag 'timers-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
alarmtimer: Fix argument order in alarm_timer_forward()
- Tighten up the sys_futex_requeue() ABI a bit, to disallow
dissimilar futex flags and potential UaF access. (Peter Zijlstra)
- Fix UaF between futex_key_to_node_opt() and vma_replace_policy()
(Hao-Yu Yang)
- Clear stale exiting pointer in futex_lock_pi() retry path,
which bug triggered a warning (and potential misbehavior)
in stress-testing. (Davidlohr Bueso)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmnItYsRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1h0Ow/+O4JG+XMdmTn2OLZQBFVqlt40RG+UAudW
tLyop+xovbLXk0FbZZen233yrYLnD8Qea9+LSIi1KPOKdklT4fimgDrVhSdZVMcY
mPJxkZ1NTRDpl3qpjgaT3yXslgpQQlzDLjAkyyC6SWQh7RHlZ9JB4UICei9sOeE8
WgwBrCfDMJXgjG5sAvkIhQFJaNEb/5HBHJu2Nxldlkl0of+le1e0mGUdY7H0ntxx
kJ4jCsrpCw+tzhZw7RmGe8ouEDkbt2f7EBT10pjZ1cEP22Rogn1AzgdD7pkbhcU3
WxbBGjSfm3ziPADELf0IhVYE/s+UXKTK9LfKDt4Q5W7paTV+o4Fllr+sraBmOoxp
Ova5RZzSdq5NCGAcmFJWg/rxEyZf+uC7GjwPx/80huzgJvBthdjOagdVKm/3BNv8
4d4H9KX67zWXE/XQHMt3g/7KYlVD9tzcp+v+h/rs8fiwkCkElwbljLISEr1StyOv
CDaCouu6FQE8MoVTNPPbH1DfEyIgNKHdXfkQbt/OkGVB9/dX71zFmxEXRCYAGMb1
wZXJ0j/oZZYcapmcb/xk5YjcZEfEU9QnMHHtCdR+dDL56xbsT5cDIzrb09LF6H/9
WiMxhx2V3RNBNUwbOKGX6vdqa2kXG+Pjp6sq9Af7VB9oy+nbWpfcVFqWbWwKAP1N
7e/EmnoxPTs=
=85AU
-----END PGP SIGNATURE-----
Merge tag 'locking-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull futex fixes from Ingo Molnar:
- Tighten up the sys_futex_requeue() ABI a bit, to disallow dissimilar
futex flags and potential UaF access (Peter Zijlstra)
- Fix UaF between futex_key_to_node_opt() and vma_replace_policy()
(Hao-Yu Yang)
- Clear stale exiting pointer in futex_lock_pi() retry path, which
triggered a warning (and potential misbehavior) in stress-testing
(Davidlohr Bueso)
* tag 'locking-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Clear stale exiting pointer in futex_lock_pi() retry path
futex: Fix UaF between futex_key_to_node_opt() and vma_replace_policy()
futex: Require sys_futex_requeue() to have identical flags
- Fix potential deadlock in osnoise and hotplug
The interface_lock can be called by a osnoise thread and the CPU shutdown
logic of osnoise can wait for this thread to finish. But cpus_read_lock()
can also be taken while holding the interface_lock. This produces a
circular lock dependency and can cause a deadlock.
Swap the ordering of cpus_read_lock() and the interface_lock to have
interface_lock taken within the cpus_read_lock() context to prevent this
circular dependency.
- Fix freeing of event triggers in early boot up
If the same trigger is added on the kernel command line, the second
one will fail to be applied and the trigger created will be freed.
This calls into the deferred logic and creates a kernel thread to do
the freeing. But the command line logic is called before kernel
threads can be created and this leads to a NULL pointer dereference.
Delay freeing event triggers until late init.
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCacf2GBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qooZAQDQgsnrkwkUHc2WjhAAsZunjWhqRQOp
SIdOO95pDvQwewEAjjP6kG5UEppL8eNF9WmV8EMz9pjYo7i2/Wy59H6xnwQ=
=6Rs0
-----END PGP SIGNATURE-----
Merge tag 'trace-v7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix potential deadlock in osnoise and hotplug
The interface_lock can be called by a osnoise thread and the CPU
shutdown logic of osnoise can wait for this thread to finish. But
cpus_read_lock() can also be taken while holding the interface_lock.
This produces a circular lock dependency and can cause a deadlock.
Swap the ordering of cpus_read_lock() and the interface_lock to have
interface_lock taken within the cpus_read_lock() context to prevent
this circular dependency.
- Fix freeing of event triggers in early boot up
If the same trigger is added on the kernel command line, the second
one will fail to be applied and the trigger created will be freed.
This calls into the deferred logic and creates a kernel thread to do
the freeing. But the command line logic is called before kernel
threads can be created and this leads to a NULL pointer dereference.
Delay freeing event triggers until late init.
* tag 'trace-v7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Drain deferred trigger frees if kthread creation fails
tracing: Fix potential deadlock in cpu hotplug with osnoise
Fuzzying/stressing futexes triggered:
WARNING: kernel/futex/core.c:825 at wait_for_owner_exiting+0x7a/0x80, CPU#11: futex_lock_pi_s/524
When futex_lock_pi_atomic() sees the owner is exiting, it returns -EBUSY
and stores a refcounted task pointer in 'exiting'.
After wait_for_owner_exiting() consumes that reference, the local pointer
is never reset to nil. Upon a retry, if futex_lock_pi_atomic() returns a
different error, the bogus pointer is passed to wait_for_owner_exiting().
CPU0 CPU1 CPU2
futex_lock_pi(uaddr)
// acquires the PI futex
exit()
futex_cleanup_begin()
futex_state = EXITING;
futex_lock_pi(uaddr)
futex_lock_pi_atomic()
attach_to_pi_owner()
// observes EXITING
*exiting = owner; // takes ref
return -EBUSY
wait_for_owner_exiting(-EBUSY, owner)
put_task_struct(); // drops ref
// exiting still points to owner
goto retry;
futex_lock_pi_atomic()
lock_pi_update_atomic()
cmpxchg(uaddr)
*uaddr ^= WAITERS // whatever
// value changed
return -EAGAIN;
wait_for_owner_exiting(-EAGAIN, exiting) // stale
WARN_ON_ONCE(exiting)
Fix this by resetting upon retry, essentially aligning it with requeue_pi.
Fixes: 3ef240eaff ("futex: Prevent exit livelock")
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260326001759.4129680-1-dave@stgolabs.net
Boot-time trigger registration can fail before the trigger-data cleanup
kthread exists. Deferring those frees until late init is fine, but the
post-boot fallback must still drain the deferred list if kthread
creation never succeeds.
Otherwise, boot-deferred nodes can accumulate on
trigger_data_free_list, later frees fall back to synchronously freeing
only the current object, and the older queued entries are leaked
forever.
To trigger this, add the following to the kernel command line:
trace_event=sched_switch trace_trigger=sched_switch.traceon,sched_switch.traceon
The second traceon trigger will fail and be freed. This triggers a NULL
pointer dereference and crashes the kernel.
Keep the deferred boot-time behavior, but when kthread creation fails,
drain the whole queued list synchronously. Do the same in the late-init
drain path so queued entries are not stranded there either.
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260324221326.1395799-3-atwellwea@gmail.com
Fixes: 61d445af0a ("tracing: Add bulk garbage collection of freeing event_trigger_data")
Signed-off-by: Wesley Atwell <atwellwea@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
* Fix uninitialized variable error when writing to a sysctl bitmap
Removed the possibility of returning an unjustified -EINVAL when writing to a
sysctl bitmap
* Testing
Quickly passed through linux-next.
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEErkcJVyXmMSXOyyeQupfNUreWQU8FAmnE85UACgkQupfNUreW
QU+XkQv/YqEcAQYGYHX0OEvjjTtmQwGDOyD+EkwztCfaW+9lL4XxPKG13kH0kaTI
4u+R2E2/pQCFmY4ln85Rl8nhgLtKavyrRBr5U7ml68tC0MGYSFP7soDQWHIBd+u3
RxGMvOBpEH8MUpR1QxFuc4TYlm/VR05sojK/rf2xDf2RmW45ZRpGsKleWNJa8tl2
fq0kYA3tMy68QZ49NIsn9tbhYso1TQa/oGh+ocYuvUOYA7fbsXQ8Nl227Jk0WQY4
krtB0CPxyzKLDjyDZ7LJAbpE8xg0b5oRzpBSLmMoRaxrdbkhznlDcYzZNROIUoJe
ktQ23ZtI+rRJc5bdnRZcrrXIrxEFjYfCVou52I4xUSs6zi4QpuOUUMySkDqyIshV
lecvMDMYExpi1lLufZWSCRn0LWPWGVqJPSkUEPMynIkaZgdAh4CKgpEVItNtGkip
4P2tJz8nlRlFr2IINJWJCCdgt0p1h3a1jb41lSznNp2FFvf8d7jDhjZWEl0tMf8u
l81ivZ4n
=qCgc
-----END PGP SIGNATURE-----
Merge tag 'sysctl-7.00-fixes-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl
Pull sysctl fix from Joel Granados:
"Fix uninitialized variable error when writing to a sysctl bitmap
Removed the possibility of returning an unjustified -EINVAL when
writing to a sysctl bitmap"
* tag 'sysctl-7.00-fixes-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl:
sysctl: fix uninitialized variable in proc_do_large_bitmap
- Restore the cpufreq core behavior changed inadvertently during the
6.19 development cycle to call cpufreq_frequency_table_cpuinfo()
for cpufreq policies getting re-initialized which ensures that
policy->max and policy->cpuinfo_max_freq will be valid going
forward (Viresh Kumar)
- Adjust the cached requested frequency in the conservative cpufreq
governor on policy limits changes to prevent it from becoming stale
in some cases (Viresh Kumar)
- Prevent pm_restore_gfp_mask() from triggering a WARN_ON() in some
code paths in which it is legitimately called without invoking
pm_restrict_gfp_mask() previously (Youngjun Park)
- Update snapshot_write_finalize() to take trailing zero pages into
account properly which prevents user space restore from failing
subsequently in some cases (Alberto Garcia)
-----BEGIN PGP SIGNATURE-----
iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmnFdKkSHHJqd0Byand5
c29ja2kubmV0AAoJEO5fvZ0v1OO12N4H/32iyc/ZSyZsDbfgPvyS6VsVzy9+JVHk
IKGFENHISCJeHHPjVCoESKnIhzIiBiN4rkFUr1V7FZpxECYkLK13D+120S+Dzf0Z
SGX+gi2YbwesO8Pf3F/3IpnP3zpwLae+UUxPDbwP9zPQdgT53O6CebG+F85/cO5X
C5n+juR06UVo4i5iKxjPWA2BQYEePT3Z5s3GBG9MzgTVGz5VfzeSQCQ3D5SyrwKm
KQxp80JbsHqlAI8jawEr/qMoPD1y24km+VrQbYZb88jtYB6lfYb1dNZsdglRVpvk
Urp262c6X3bubqdni0nHQiTodW0qXzdfaJT9E06Z0MsWppWW/3Q1Auc=
=9Cxk
-----END PGP SIGNATURE-----
Merge tag 'pm-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix two cpufreq issues, one in the core and one in the
conservative governor, and two issues related to system sleep:
- Restore the cpufreq core behavior changed inadvertently during the
6.19 development cycle to call cpufreq_frequency_table_cpuinfo()
for cpufreq policies getting re-initialized which ensures that
policy->max and policy->cpuinfo_max_freq will be valid going
forward (Viresh Kumar)
- Adjust the cached requested frequency in the conservative cpufreq
governor on policy limits changes to prevent it from becoming stale
in some cases (Viresh Kumar)
- Prevent pm_restore_gfp_mask() from triggering a WARN_ON() in some
code paths in which it is legitimately called without invoking
pm_restrict_gfp_mask() previously (Youngjun Park)
- Update snapshot_write_finalize() to take trailing zero pages into
account properly which prevents user space restore from failing
subsequently in some cases (Alberto Garcia)"
* tag 'pm-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: sleep: Drop spurious WARN_ON() from pm_restore_gfp_mask()
PM: hibernate: Drain trailing zero pages on userspace restore
cpufreq: conservative: Reset requested_freq on limits change
cpufreq: Don't skip cpufreq_frequency_table_cpuinfo()
A set of fixes for DMA-mapping subsystem, which resolve false-positive
warnings from KMSAN and DMA-API debug (Shigeru Yoshida and Leon
Romanovsky) as well as a simple build fix (Miguel Ojeda).
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCacP+awAKCRCJp1EFxbsS
RB0IAP94SD5YWfKauuYfXw0uxuy0POZR7rNX93y1LMxuorU+9QEAucOy9wIfjxKC
Sc6U+fR4cmmIUKKnj7DCiR/EW8cOuQU=
=b7th
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-7.0-2026-03-25' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping fixes from Marek Szyprowski:
"A set of fixes for DMA-mapping subsystem, which resolve false-
positive warnings from KMSAN and DMA-API debug (Shigeru Yoshida
and Leon Romanovsky) as well as a simple build fix (Miguel Ojeda)"
* tag 'dma-mapping-7.0-2026-03-25' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
dma-mapping: add missing `inline` for `dma_free_attrs`
mm/hmm: Indicate that HMM requires DMA coherency
RDMA/umem: Tell DMA mapping that UMEM requires coherency
iommu/dma: add support for DMA_ATTR_REQUIRE_COHERENT attribute
dma-direct: prevent SWIOTLB path when DMA_ATTR_REQUIRE_COHERENT is set
dma-mapping: Introduce DMA require coherency attribute
dma-mapping: Clarify valid conditions for CPU cache line overlap
dma-mapping: handle DMA_ATTR_CPU_CACHE_CLEAN in trace output
dma-debug: Allow multiple invocations of overlapping entries
dma: swiotlb: add KMSAN annotations to swiotlb_bounce()
During futex_key_to_node_opt() execution, vma->vm_policy is read under
speculative mmap lock and RCU. Concurrently, mbind() may call
vma_replace_policy() which frees the old mempolicy immediately via
kmem_cache_free().
This creates a race where __futex_key_to_node() dereferences a freed
mempolicy pointer, causing a use-after-free read of mpol->mode.
[ 151.412631] BUG: KASAN: slab-use-after-free in __futex_key_to_node (kernel/futex/core.c:349)
[ 151.414046] Read of size 2 at addr ffff888001c49634 by task e/87
[ 151.415969] Call Trace:
[ 151.416732] __asan_load2 (mm/kasan/generic.c:271)
[ 151.416777] __futex_key_to_node (kernel/futex/core.c:349)
[ 151.416822] get_futex_key (kernel/futex/core.c:374 kernel/futex/core.c:386 kernel/futex/core.c:593)
Fix by adding rcu to __mpol_put().
Fixes: c042c50521 ("futex: Implement FUTEX2_MPOL")
Reported-by: Hao-Yu Yang <naup96721@gmail.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Hao-Yu Yang <naup96721@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Link: https://patch.msgid.link/20260324174418.GB1850007@noisy.programming.kicks-ass.net
Nicholas reported that his LLM found it was possible to create a UaF
when sys_futex_requeue() is used with different flags. The initial
motivation for allowing different flags was the variable sized futex,
but since that hasn't been merged (yet), simply mandate the flags are
identical, as is the case for the old style sys_futex() requeue
operations.
Fixes: 0f4b5f9722 ("futex: Add sys_futex_requeue()")
Reported-by: Nicholas Carlini <npc@anthropic.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
proc_do_large_bitmap() does not initialize variable c, which is expected
to be set to a trailing character by proc_get_long().
However, proc_get_long() only sets c when the input buffer contains a
trailing character after the parsed value.
If c is not initialized it may happen to contain a '-'. If this is the
case proc_do_large_bitmap() expects to be able to parse a second part of
the input buffer. If there is no second part an unjustified -EINVAL will
be returned.
Initialize c to 0 to prevent returning -EINVAL on valid input.
Fixes: 9f977fb7ae ("sysctl: add proc_do_large_bitmap")
Signed-off-by: Marc Buerg <buermarc@googlemail.com>
Reviewed-by: Joel Granados <joel.granados@kernel.org>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Fix a regression introduced by commit c27cea4416 ("rcu: Re-implement
RCU Tasks Trace in terms of SRCU-fast"): BPF contexts can run with
preemption disabled or scheduler locks held, so call_srcu() must work in
all such contexts. Fix this by converting SRCU's spinlocks to raw
spinlocks and avoiding scheduler lock acquisition in call_srcu() by
deferring to an irq_work (similar to call_rcu_tasks_generic()), for both
tree SRCU and tiny SRCU. Also fix a follow-on lockdep splat caused by
srcu_node allocation under the newly introduced raw spinlock by
deferring the allocation to grace-period worker context.
-----BEGIN PGP SIGNATURE-----
iQFhBAABCABLFiEEj5IosQTPz8XU1wRHSXnow7UH+rgFAmnELZgbFIAAAAAABAAO
bWFudTIsMi41KzEuMTIsMCwzERxib3F1bkBrZXJuZWwub3JnAAoJEEl56MO1B/q4
//MH/3l6NUigh3+FOE7bCffkfNTtcSMvJXzcMMLD4dS/JoFAJR0D3I0qm5S/AH+D
BMZzE7h2oUoy82XsBFYEZCwSVTGcYhqgPa8KaMtCRHPDHkWTWPbps5J9PCTMste0
VVUongZ3wBFzk23ckrK8MbZCi3V5x1alTJPScTsl7enCPRuPupoYyTCQAPv06F2o
3pvaJDF5r3UxSdtNC8bINlcfkFXQXC+J16iQQ91t2n80fNzGJQYb0dWAjb3F/FG3
ZE8YbDa8sGiyDNgWU4+gmYbJw4pS9Eau+djj04Dk1L5j4IXD0dgUo0GxYtStUF3d
/nscmx88BZ06nW+RRZJixPaIDBQ=
=yjFI
-----END PGP SIGNATURE-----
Merge tag 'rcu-fixes.v7.0-20260325a' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux
Pull RCU fixes from Boqun Feng:
"Fix a regression introduced by commit c27cea4416 ("rcu: Re-implement
RCU Tasks Trace in terms of SRCU-fast"): BPF contexts can run with
preemption disabled or scheduler locks held, so call_srcu() must work
in all such contexts.
Fix this by converting SRCU's spinlocks to raw spinlocks and avoiding
scheduler lock acquisition in call_srcu() by deferring to an irq_work
(similar to call_rcu_tasks_generic()), for both tree SRCU and tiny
SRCU.
Also fix a follow-on lockdep splat caused by srcu_node allocation
under the newly introduced raw spinlock by deferring the allocation to
grace-period worker context"
* tag 'rcu-fixes.v7.0-20260325a' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux:
srcu: Use irq_work to start GP in tiny SRCU
rcu: Use an intermediate irq_work to start process_srcu()
srcu: Push srcu_node allocation to GP when non-preemptible
srcu: Use raw spinlocks so call_srcu() can be used under preempt_disable()
cgroup_drain_dying() was using cgroup_is_populated() to test whether there are
dying tasks to wait for. cgroup_is_populated() tests nr_populated_csets,
nr_populated_domain_children and nr_populated_threaded_children, but
cgroup_drain_dying() only needs to care about this cgroup's own tasks - whether
there are children is cgroup_destroy_locked()'s concern.
This caused hangs during shutdown. When systemd tried to rmdir a cgroup that had
no direct tasks but had a populated child, cgroup_drain_dying() would enter its
wait loop because cgroup_is_populated() was true from
nr_populated_domain_children. The task iterator found nothing to wait for, yet
the populated state never cleared because it was driven by live tasks in the
child cgroup.
Fix it by using cgroup_has_tasks() which only tests nr_populated_csets.
v3: Fix cgroup_is_populated() -> cgroup_has_tasks() (Sebastian).
v2: https://lore.kernel.org/r/20260323200205.1063629-1-tj@kernel.org
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Fixes: 1b164b876c ("cgroup: Wait for dying tasks to leave on rmdir")
Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tiny SRCU's srcu_gp_start_if_needed() directly calls schedule_work(),
which acquires the workqueue pool->lock.
This causes a lockdep splat when call_srcu() is called with a scheduler
lock held, due to:
call_srcu() [holding pi_lock]
srcu_gp_start_if_needed()
schedule_work() -> pool->lock
workqueue_init() / create_worker() [holding pool->lock]
wake_up_process() -> try_to_wake_up() -> pi_lock
Also add irq_work_sync() to cleanup_srcu_struct() to prevent a
use-after-free if a queued irq_work fires after cleanup begins.
Tested with rcutorture SRCU-T and no lockdep warnings.
[ Thanks to Boqun for similar fix in patch "rcu: Use an intermediate irq_work
to start process_srcu()" ]
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun@kernel.org>
Since commit c27cea4416 ("rcu: Re-implement RCU Tasks Trace in terms
of SRCU-fast") we switched to SRCU in BPF. However as BPF instrument can
happen basically everywhere (including where a scheduler lock is held),
call_srcu() now needs to avoid acquiring scheduler lock because
otherwise it could cause deadlock [1]. Fix this by following what the
previous RCU Tasks Trace did: using an irq_work to delay the queuing of
the work to start process_srcu().
[boqun: Apply Joel's feedback]
[boqun: Apply Andrea's test feedback]
Reported-by: Andrea Righi <arighi@nvidia.com>
Closes: https://lore.kernel.org/all/abjzvz_tL_siV17s@gpd4/
Fixes: commit c27cea4416 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast")
Link: https://lore.kernel.org/rcu/3c4c5a29-24ea-492d-aeee-e0d9605b4183@nvidia.com/ [1]
Suggested-by: Zqiang <qiang.zhang@linux.dev>
Tested-by: Andrea Righi <arighi@nvidia.com>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Boqun Feng <boqun@kernel.org>
When the srcutree.convert_to_big and srcutree.big_cpu_lim kernel boot
parameters specify initialization-time allocation of the srcu_node
tree for statically allocated srcu_struct structures (for example, in
DEFINE_SRCU() at build time instead of init_srcu_struct() at runtime),
init_srcu_struct_nodes() will attempt to dynamically allocate this tree
at the first run-time update-side use of this srcu_struct structure,
but while holding a raw spinlock. Because the memory allocator can
acquire non-raw spinlocks, this can result in lockdep splats.
This commit therefore uses the same SRCU_SIZE_ALLOC trick that is used
when the first run-time update-side use of this srcu_struct structure
happens before srcu_init() is called. The actual allocation then takes
place from workqueue context at the ends of upcoming SRCU grace periods.
[boqun: Adjust the sha1 of the Fixes tag]
Fixes: 175b45ed34 ("srcu: Use raw spinlocks so call_srcu() can be used under preempt_disable()")
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun@kernel.org>
Tree SRCU has used non-raw spinlocks for many years, motivated by a desire
to avoid unnecessary real-time latency and the absence of any reason to
use raw spinlocks. However, the recent use of SRCU in tracing as the
underlying implementation of RCU Tasks Trace means that call_srcu()
is invoked from preemption-disabled regions of code, which in turn
requires that any locks acquired by call_srcu() or its callees must be
raw spinlocks.
This commit therefore converts SRCU's spinlocks to raw spinlocks.
[boqun: Add Fixes tag]
Reported-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Fixes: c27cea4416 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast")
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Try to be more explicit why the workqueue watchdog does not take
pool->lock by default. Spin locks are full memory barriers which
delay anything. Obviously, they would primary delay operations
on the related worker pools.
Explain why it is enough to prevent the false positive by re-checking
the timestamp under the pool->lock.
Finally, make it clear what would be the alternative solution in
__queue_work() which is a hotter path.
Signed-off-by: Petr Mladek <pmladek@suse.com>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
alarm_timer_forward() passes arguments to alarm_forward() in the wrong
order:
alarm_forward(alarm, timr->it_interval, now);
However, alarm_forward() is defined as:
u64 alarm_forward(struct alarm *alarm, ktime_t now, ktime_t interval);
and uses the second argument as the current time:
delta = ktime_sub(now, alarm->node.expires);
Passing the interval as "now" results in incorrect delta computation,
which can lead to missed expirations or incorrect overrun accounting.
This issue has been present since the introduction of
alarm_timer_forward().
Fix this by swapping the arguments.
Fixes: e7561f1633 ("alarmtimer: Implement forward callback")
Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260323061130.29991-1-zhanxusheng@xiaomi.com
a72f73c4dd ("cgroup: Don't expose dead tasks in cgroup") hid PF_EXITING
tasks from cgroup.procs so that systemd doesn't see tasks that have already
been reaped via waitpid(). However, the populated counter (nr_populated_csets)
is only decremented when the task later passes through cgroup_task_dead() in
finish_task_switch(). This means cgroup.procs can appear empty while the
cgroup is still populated, causing rmdir to fail with -EBUSY.
Fix this by making cgroup_rmdir() wait for dying tasks to fully leave. If the
cgroup is populated but all remaining tasks have PF_EXITING set (the task
iterator returns none due to the existing filter), wait for a kick from
cgroup_task_dead() and retry. The wait is brief as tasks are removed from the
cgroup's css_set between PF_EXITING assertion in do_exit() and
cgroup_task_dead() in finish_task_switch().
v2: cgroup_is_populated() true to false transition happens under css_set_lock
not cgroup_mutex, so retest under css_set_lock before sleeping to avoid
missed wakeups (Sebastian).
Fixes: a72f73c4dd ("cgroup: Don't expose dead tasks in cgroup")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202603222104.2c81684e-lkp@intel.com
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Bert Karwatzki <spasswolf@web.de>
Cc: Michal Koutny <mkoutny@suse.com>
Cc: cgroups@vger.kernel.org
Commit 35e4a69b20 ("PM: sleep: Allow pm_restrict_gfp_mask()
stacking") introduced refcount-based GFP mask management that warns
when pm_restore_gfp_mask() is called with saved_gfp_count == 0.
Some hibernation paths call pm_restore_gfp_mask() defensively where
the GFP mask may or may not be restricted depending on the execution
path. For example, the uswsusp interface invokes it in
SNAPSHOT_CREATE_IMAGE, SNAPSHOT_UNFREEZE, and snapshot_release().
Before the stacking change this was a silent no-op; it now triggers
a spurious WARNING.
Remove the WARN_ON() wrapper from the !saved_gfp_count check while
retaining the check itself, so that defensive calls remain harmless
without producing false warnings.
Fixes: 35e4a69b20 ("PM: sleep: Allow pm_restrict_gfp_mask() stacking")
Signed-off-by: Youngjun Park <youngjun.park@lge.com>
[ rjw: Subject tweak ]
Link: https://patch.msgid.link/20260322120528.750178-1-youngjun.park@lge.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Commit 005e8dddd4 ("PM: hibernate: don't store zero pages in the
image file") added an optimization to skip zero-filled pages in the
hibernation image. On restore, zero pages are handled internally by
snapshot_write_next() in a loop that processes them without returning
to the caller.
With the userspace restore interface, writing the last non-zero page
to /dev/snapshot is followed by the SNAPSHOT_ATOMIC_RESTORE ioctl. At
this point there are no more calls to snapshot_write_next() so any
trailing zero pages are not processed, snapshot_image_loaded() fails
because handle->cur is smaller than expected, the ioctl returns -EPERM
and the image is not restored.
The in-kernel restore path is not affected by this because the loop in
load_image() in swap.c calls snapshot_write_next() until it returns 0.
It is this final call that drains any trailing zero pages.
Fixed by calling snapshot_write_next() in snapshot_write_finalize(),
giving the kernel the chance to drain any trailing zero pages.
Fixes: 005e8dddd4 ("PM: hibernate: don't store zero pages in the image file")
Signed-off-by: Alberto Garcia <berto@igalia.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Link: https://patch.msgid.link/ef5a7c5e3e3dbd17dcb20efaa0c53a47a23498bb.1773075892.git.berto@igalia.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmnAGisACgkQ6rmadz2v
bTqjsw/9GfHT/fdnjfA/q27TQH28ZdrZfq90BpI3m5BfTO8/l+Kt+g1HDGpku+C/
iWh66rg9t/P9nMvtdzvPsdT833UbwbY6fPEK3r7ANgf7SBb1DNvaGHBM6XNefvZV
j+VcykKUaEo8U1GeG+gI4TyAALSqvvMeBPYpAPZDUYguYLyE+YIl2Pl6tWt+A7yf
9V3JjCSz63t75qqnhY2SIBZv2pqWiMaCI8uPgaF7drhQM5Xc0l/R75CMPGeF9BrT
GRtTVJhY+6UyI2Q0ZRSRSVHZ1j2kYHI/eK3Kamxwal5hNh37BYHm3pT5TSHbZTe1
xO7c1AB0vds8kznRkclQfsMdjVwuBQj03ukLVNqnnaaE4Ir7JlXlXYgeG0KJbbfW
kQG8UyDD7tMWZkvaA0Z51FC88WJNLJoNAku519alcMtgAf1CrxzG9aUAYEWE4erh
E/FKKvFqQ6T0mOFSXlk1NFeMjNXcg5Tu2KKKKOjAWT6goUc4hw80IWydTyxMy32m
8/eLmdTZpAQovc2rS+5LSTigQ3DT082J950sxdQ3yRaLTWBGNC06gkA/WcRq2ZI+
hBdW6GI1XFwkXGw5+F9fN9Bt5FmE42v44i+RrlNZV1R5bVr0Za/ofkWP3dm1/SOg
QRSJk30hx9JveR9gD/xWawycYFuwmha/BL0tur2T32M67MneJpo=
=Ye1S
-----END PGP SIGNATURE-----
Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Pull bpf fixes from Alexei Starovoitov:
- Fix how linked registers track zero extension of subregisters (Daniel
Borkmann)
- Fix unsound scalar fork for OR instructions (Daniel Wade)
- Fix exception exit lock check for subprogs (Ihor Solodrai)
- Fix undefined behavior in interpreter for SDIV/SMOD instructions
(Jenny Guanni Qu)
- Release module's BTF when module is unloaded (Kumar Kartikeya
Dwivedi)
- Fix constant blinding for PROBE_MEM32 instructions (Sachin Kumar)
- Reset register ID for END instructions to prevent incorrect value
tracking (Yazhou Tang)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Add a test cases for sync_linked_regs regarding zext propagation
bpf: Fix sync_linked_regs regarding BPF_ADD_CONST32 zext propagation
selftests/bpf: Add tests for maybe_fork_scalars() OR vs AND handling
bpf: Fix unsound scalar forking in maybe_fork_scalars() for BPF_OR
selftests/bpf: Add tests for sdiv32/smod32 with INT_MIN dividend
bpf: Fix undefined behavior in interpreter sdiv/smod for INT_MIN
selftests/bpf: Add tests for bpf_throw lock leak from subprogs
bpf: Fix exception exit lock checking for subprogs
bpf: Release module BTF IDR before module unload
selftests/bpf: Fix pkg-config call on static builds
bpf: Fix constant blinding for PROBE_MEM32 stores
selftests/bpf: Add test for BPF_END register ID reset
bpf: Reset register ID for BPF_END value tracking
- Revert "tracing: Remove pid in task_rename tracing output"
A change was made to remove the pid field from the task_rename event
because it was thought that it was always done for the current task and
recording the pid would be redundant. This turned out to be incorrect and
there are a few corner case where this is not true and caused some
regressions in tooling.
- Fix the reading from user space for migration
The reading of user space uses a seq lock type of logic where it uses a
per-cpu temporary buffer and disables migration, then enables preemption,
does the copy from user space, disables preemption, enables migration and
checks if there was any schedule switches while preemption was enabled. If
there was a context switch, then it is considered that the per-cpu buffer
could be corrupted and it tries again. There's a protection check that
tests if it takes a hundred tries, it issues a warning and exits out to
prevent a live lock.
This was triggered because the task was selected by the load balancer to
be migrated to another CPU, every time preemption is enabled the migration
task would schedule in try to migrate the task but can't because migration
is disabled and let it run again. This caused the scheduler to schedule out
the task every time it enabled preemption and made the loop never exit
(until the 100 iteration test triggered).
Fix this by enabling and disabling preemption and keeping migration
enabled if the reading from user space needs to be done again. This will
let the migration thread migrate the task and the copy from user space
will likely pass on the next iteration.
- Fix trace_marker copy option freeing
The "copy_trace_marker" option allows a tracing instance to get a copy of
a write to the trace_marker file of the top level instance. This is
managed by a link list protected by RCU. When an instance is removed, a
check is made if the option is set, and if so synchronized_rcu() is
called. The problem is that an iteration is made to reset all the flags to
what they were when the instance was created (to perform clean ups) was
done before the check of the copy_trace_marker option and that option was
cleared, so the synchronize_rcu() was never called.
Move the clearing of all the flags after the check of copy_trace_marker to
do synchronize_rcu() so that the option is still set if it was before and
the synchronization is performed.
- Fix entries setting when validating the persistent ring buffer
When validating the persistent ring buffer on boot up, the number of
events per sub-buffer is added to the sub-buffer meta page. The validator
was updating cpu_buffer->head_page (the first sub-buffer of the per-cpu
buffer) and not the "head_page" variable that was iterating the
sub-buffers. This was causing the first sub-buffer to be assigned the
entries for each sub-buffer and not the sub-buffer that was supposed to be
updated.
- Use "hash" value to update the direct callers
When updating the ftrace direct callers, it assigned a temporary callback
to all the callback functions of the ftrace ops and not just the
functions represented by the passed in hash. This causes an unnecessary
slow down of the functions of the ftrace_ops that is not being modified.
Only update the functions that are going to be modified to call the
ftrace loop function so that the update can be made on those functions.
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCacAMahQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qr0sAQCoI4L3iAR5HU1z8dw2GWhOz9fTnzfw
9VPRZAsga9J5xgEA1Y0bvKBM0UPHFAL2POkaILYV1aT00lZ7aIVHPqfdYgA=
=OoGW
-----END PGP SIGNATURE-----
Merge tag 'trace-v7.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Revert "tracing: Remove pid in task_rename tracing output"
A change was made to remove the pid field from the task_rename event
because it was thought that it was always done for the current task
and recording the pid would be redundant. This turned out to be
incorrect and there are a few corner case where this is not true and
caused some regressions in tooling.
- Fix the reading from user space for migration
The reading of user space uses a seq lock type of logic where it uses
a per-cpu temporary buffer and disables migration, then enables
preemption, does the copy from user space, disables preemption,
enables migration and checks if there was any schedule switches while
preemption was enabled. If there was a context switch, then it is
considered that the per-cpu buffer could be corrupted and it tries
again. There's a protection check that tests if it takes a hundred
tries, it issues a warning and exits out to prevent a live lock.
This was triggered because the task was selected by the load balancer
to be migrated to another CPU, every time preemption is enabled the
migration task would schedule in try to migrate the task but can't
because migration is disabled and let it run again. This caused the
scheduler to schedule out the task every time it enabled preemption
and made the loop never exit (until the 100 iteration test
triggered).
Fix this by enabling and disabling preemption and keeping migration
enabled if the reading from user space needs to be done again. This
will let the migration thread migrate the task and the copy from user
space will likely pass on the next iteration.
- Fix trace_marker copy option freeing
The "copy_trace_marker" option allows a tracing instance to get a
copy of a write to the trace_marker file of the top level instance.
This is managed by a link list protected by RCU. When an instance is
removed, a check is made if the option is set, and if so
synchronized_rcu() is called.
The problem is that an iteration is made to reset all the flags to
what they were when the instance was created (to perform clean ups)
was done before the check of the copy_trace_marker option and that
option was cleared, so the synchronize_rcu() was never called.
Move the clearing of all the flags after the check of
copy_trace_marker to do synchronize_rcu() so that the option is still
set if it was before and the synchronization is performed.
- Fix entries setting when validating the persistent ring buffer
When validating the persistent ring buffer on boot up, the number of
events per sub-buffer is added to the sub-buffer meta page. The
validator was updating cpu_buffer->head_page (the first sub-buffer of
the per-cpu buffer) and not the "head_page" variable that was
iterating the sub-buffers. This was causing the first sub-buffer to
be assigned the entries for each sub-buffer and not the sub-buffer
that was supposed to be updated.
- Use "hash" value to update the direct callers
When updating the ftrace direct callers, it assigned a temporary
callback to all the callback functions of the ftrace ops and not just
the functions represented by the passed in hash. This causes an
unnecessary slow down of the functions of the ftrace_ops that is not
being modified. Only update the functions that are going to be
modified to call the ftrace loop function so that the update can be
made on those functions.
* tag 'trace-v7.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ftrace: Use hash argument for tmp_ops in update_ftrace_direct_mod
ring-buffer: Fix to update per-subbuf entries of persistent ring buffer
tracing: Fix trace_marker copy link list updates
tracing: Fix failure to read user space from system call trace events
tracing: Revert "tracing: Remove pid in task_rename tracing output"
- Fix a PMU driver crash on AMD EPYC systems, caused by
a race condition in x86_pmu_enable()
- Fix a possible counter-initialization bug in x86_pmu_enable()
- Fix a counter inheritance bug in inherit_event() and
__perf_event_read()
- Fix an Intel PMU driver branch constraints handling bug
found by UBSAN
- Fix the Intel PMU driver's new Off-Module Response (OMR)
support code for Diamond Rapids / Nova lake, to fix a snoop
information parsing bug
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmm/ptcRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1i8yBAAsbAbTzyJOxiE29beCiW3r9V4ELtrSlpG
syUtZ4Y1cozK1w6/8S2oWHJkuWa+ToO6bn9rKxNemWIrnsxe5B4iKfeNWRFTz24g
41xTXaxUB9c7vBgv4BDvr/ykkYGQybkn0Bf/U5rufzvIlst9bx7zKVAnIT9Qws37
UTMY96XGYY5HNzGSZbQkpQ4cs8n72U+00OBHMTWtH8NJT+fRmaM312Q8F6wNKgH2
YtaAjwb55BU5+hQUz5YN96xQGYaoj0s8UtOk4a3tS/t0F8mOodDVTxzMdHKToQmD
SbscGvfC+bg5zjoYGFEU+cXoBMkkZlBPqZdLVQAGEy3YZT+JIdmyJhn9FD6HVuVw
OzyG9VuY+TxOFrFQdMs3Xfa0vZ7AO1c90HB08P4T7nWaMioR1iFAF31MSVEMXuzd
ZROHplWNIqDeOzmerXgZq4JWy3Bpaam8fH1B5/qN450oAxaym3lCOoCZidJYgy3g
CVBF/6BO7DlpKiy9lXknRscItwIiRmZ9Xr+sOmOMQRGqQkC6Ykk/Hj2sA1qKPuQ5
ruRqqaL9cznttAoJR2jZ0Ijyu7usgxB/y066nR1zXKdvEdNcntaic+QybHxbQoYx
kyYNoR1dg+AWLb5juT68abkP4trZ8EUa7Q29OX1SzTk+0U7M0fO3/rq9gt71HiHR
WSjRDQPfWrI=
=jp5H
-----END PGP SIGNATURE-----
Merge tag 'perf-urgent-2026-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
- Fix a PMU driver crash on AMD EPYC systems, caused by
a race condition in x86_pmu_enable()
- Fix a possible counter-initialization bug in x86_pmu_enable()
- Fix a counter inheritance bug in inherit_event() and
__perf_event_read()
- Fix an Intel PMU driver branch constraints handling bug
found by UBSAN
- Fix the Intel PMU driver's new Off-Module Response (OMR)
support code for Diamond Rapids / Nova lake, to fix a snoop
information parsing bug
* tag 'perf-urgent-2026-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Fix OMR snoop information parsing issues
perf/x86/intel: Add missing branch counters constraint apply
perf: Make sure to use pmu_ctx->pmu for groups
x86/perf: Make sure to program the counter value for stopped events on migration
perf/x86: Move event pointer setup earlier in x86_pmu_enable()
On weakly ordered architectures (e.g., arm64), the lockless check in
wq_watchdog_timer_fn() can observe a reordering between the worklist
insertion and the last_progress_ts update. Specifically, the watchdog
can see a non-empty worklist (from a list_add) while reading a stale
last_progress_ts value, causing a false positive stall report.
This was confirmed by reading pool->last_progress_ts again after holding
pool->lock in wq_watchdog_timer_fn():
workqueue watchdog: pool 7 false positive detected!
lockless_ts=4784580465 locked_ts=4785033728
diff=453263ms worklist_empty=0
To avoid slowing down the hot path (queue_work, etc.), recheck
last_progress_ts with pool->lock held. This will eliminate the false
positive with minimal overhead.
Remove two extra empty lines in wq_watchdog_timer_fn() as we are on it.
Fixes: 82607adcf9 ("workqueue: implement lockup detector")
Cc: stable@vger.kernel.org # v4.5+
Assisted-by: claude-code:claude-opus-4-6
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
In the WAKE_SYNC path of scx_select_cpu_dfl(), waker_node was computed
with cpu_to_node(), while node (for prev_cpu) was computed with
scx_cpu_node_if_enabled(). When scx_builtin_idle_per_node is disabled,
idle_cpumask(waker_node) is called with a real node ID even though
per-node idle tracking is disabled, resulting in undefined behavior.
Fix by using scx_cpu_node_if_enabled() for waker_node as well, ensuring
both variables are computed consistently.
Fixes: 48849271e6 ("sched_ext: idle: Per-node idle cpumasks")
Cc: stable@vger.kernel.org # v6.15+
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The modify logic registers temporary ftrace_ops object (tmp_ops) to trigger
the slow path for all direct callers to be able to safely modify attached
addresses.
At the moment we use ops->func_hash for tmp_ops filter, which represents all
the systems attachments. It's faster to use just the passed hash filter, which
contains only the modified sites and is always a subset of the ops->func_hash.
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Menglong Dong <menglong8.dong@gmail.com>
Cc: Song Liu <song@kernel.org>
Link: https://patch.msgid.link/20260312123738.129926-1-jolsa@kernel.org
Fixes: e93672f770 ("ftrace: Add update_ftrace_direct_mod function")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Since the validation loop in rb_meta_validate_events() updates the same
cpu_buffer->head_page->entries, the other subbuf entries are not updated.
Fix to use head_page to update the entries field, since it is the cursor
in this loop.
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Ian Rogers <irogers@google.com>
Fixes: 5f3b6e839f ("ring-buffer: Validate boot range memory events")
Link: https://patch.msgid.link/177391153882.193994.17158784065013676533.stgit@mhiramat.tok.corp.google.com
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>