Now that dev_get_cma_area() is no longer inline, we don't have any user
of dma_contiguous_default_area() outside of contiguous.c so we can make
it static.
Signed-off-by: Maxime Ripard <mripard@kernel.org>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20260331-dma-buf-heaps-as-modules-v4-3-e18fda504419@kernel.org
As we try to enable dma-buf heaps, and the CMA one in particular, to
compile as modules, we need to export dev_get_cma_area(). It's currently
implemented as an inline function that returns either the content of
device->cma_area or dma_contiguous_default_area.
Thus, it means we need to export dma_contiguous_default_area, which
isn't really something we want any module to have access to.
Instead, let's make dev_get_cma_area() a proper function we will be able
to export so we can avoid exporting dma_contiguous_default_area.
Signed-off-by: Maxime Ripard <mripard@kernel.org>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20260331-dma-buf-heaps-as-modules-v4-2-e18fda504419@kernel.org
The CMA heap instantiation was initially developed by having the
contiguous DMA code call into the CMA heap to create a new instance
every time a reserved memory area is probed.
Turning the CMA heap into a module would create a dependency of the
kernel on a module, which doesn't work.
Let's turn the logic around and do the opposite: store all the reserved
memory CMA regions into the contiguous DMA code, and provide an iterator
for the heap to use when it probes.
Signed-off-by: Maxime Ripard <mripard@kernel.org>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20260331-dma-buf-heaps-as-modules-v4-1-e18fda504419@kernel.org
This commit tests invoking call_srcu() with preemption both enabled
and disabled, via acquiring of pi lock.
[ Joel: reword commit message. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Add a Kconfig option to set the default value of the
kernel.panic_on_rcu_stall sysctl, allowing the kernel to be built
with panic-on-RCU-stall enabled by default.
This is useful for high-availability systems that require automatic
recovery (via panic_timeout) when a CPU stall is detected, without
needing userspace to configure the sysctl at boot.
This follows the pattern established by BOOTPARAM_SOFTLOCKUP_PANIC
and BOOTPARAM_HUNG_TASK_PANIC. The runtime sysctl can still override
the Kconfig default.
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Gustavo Luiz Duarte <gustavold@gmail.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Currently, all calls to torture_hrtimeout_ns() either provide a non-zero
fuzzt_ns or a NULL trsp, either of which avoids taking the modulus of a
zero-valued fuzzt_ns. But this code should do a better job of defending
itself, so this commit explicitly checks fuzzt_ns and avoids the modulus
when its value is zero.
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
The bypass flush decision logic is duplicated in rcu_nocb_try_bypass()
and nocb_gp_wait() with similar conditions.
This commit therefore extracts the functionality into a common helper
function nocb_bypass_needs_flush() improving the code readability.
A flush_faster parameter is added to controlling the flushing thresholds
and timeouts. This design was in the original commit d1b222c6be
("rcu/nocb: Add bypass callback queueing") to avoid having the GP
kthread aggressively flush the bypass queue.
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
The rcu_nocb_cpu_offload() and rcu_nocb_cpu_deoffload() functions are
nearly duplicates.
Therefore, extract the common logic into rcu_nocb_cpu_toggle_offload()
which takes an 'offload' boolean, and make both exported functions
simple wrappers.
This eliminates a bunch of duplicate code at the call sites, namely
mutex locking, CPU hotplug locking and CPU online checks.
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
The cblist_init_generic() is executed during the CPU early boot
phase due to commit:30ef09635b9e ("rcu-tasks: Initialize callback
lists at rcu_init() time"), at this time, only one boot CPU is
online and the irq is disabled. this commit therefore use routine
assignment replace of smp_store_release() and WRITE_ONCE() in the
cblist_init_generic().
Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
The torture_shutdown_init() function spawns a shutdown kthread in
a manner very similar to that implemented by rcu_scale_shutdown().
This commit therefore re-implements rcu_scale_shutdown() in terms of
torture_shutdown_init().
This patch was generated by Claude given as input the patch making the
same transformation of ref_scale_shutdown().
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
The torture_shutdown_init() function spawns a shutdown kthread in
a manner very similar to that implemented by ref_scale_shutdown().
This commit therefore re-implements ref_scale_shutdown in terms of
torture_shutdown_init().
The initial draft of this patch was generated by version 2.1.16 of the
Claude AI/LLM, but trained and configured for use by my employer, and
prompted to refer to Linux-kernel source code. This initial draft failed
to provide a forward reference to ref_scale_cleanup(), passed zero to
torture_shutdown_init() for an unwelcome insta-shutdown, and failed to
pass the kvm.sh --duration argument in as a refscale module parameter.
On the other hand, it did catch the need to NULL main_task on the
post-test self-shutdown code path, which I might well have forgotten
to do.
This version of the patch fixes those problems, and in fact very little
of the initial draft remains.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
This commit adds a trivial textbook implementation of preemptible RCU
to rcutorture ("torture_type=trivial-preempt"), similar in spirit to the
existing "torture_type=trivial" textbook implementation of non-preemptible
RCU. Neither trivial RCU implementation has any value for production use,
and are intended only to keep Paul honest in his introductory writings
and presentations.
[ paulmck: Apply kernel test robot feedback. ]
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
dev_energymodel_nl_get_perf_domains_doit() calls
em_perf_domain_get_by_id() but does not check the return value before
passing it to __em_nl_get_pd_size(). When a caller supplies a
non-existent perf domain ID, em_perf_domain_get_by_id() returns NULL,
and __em_nl_get_pd_size() immediately dereferences pd->cpus
(struct offset 0x30), causing a NULL pointer dereference.
The sister handler dev_energymodel_nl_get_perf_table_doit() already
handles this correctly via __em_nl_get_pd_table_id(), which returns
NULL and causes the caller to return -EINVAL. Add the same NULL check
in the get-perf-domains do handler.
Fixes: 380ff27af2 ("PM: EM: Add dump to get-perf-domains in the EM YNL spec")
Reported-by: Yi Lai <yi1.lai@linux.intel.com>
Closes: https://lore.kernel.org/lkml/aXiySM79UYfk+ytd@ly-workstation/
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Cc: 6.19+ <stable@vger.kernel.org> # 6.19+
[ rjw: Subject and changelog edits ]
Link: https://patch.msgid.link/20260329073615.649976-1-changwoo@igalia.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Conflict in kernel/sched/ext.c init_sched_ext_class() between:
415cb193bb ("sched_ext: Fix SCX_KICK_WAIT deadlock by deferring wait
to balance callback")
which adds cpus_to_sync cpumask allocation, and:
84b1a0ea0b ("sched_ext: Implement scx_bpf_dsq_reenq() for user DSQs")
8c1b9453fd ("sched_ext: Convert deferred_reenq_locals from llist to
regular list")
which add deferred_reenq init code at the same location. Both are
independent additions. Include both.
Signed-off-by: Tejun Heo <tj@kernel.org>
SCX_KICK_WAIT busy-waits in kick_cpus_irq_workfn() using
smp_cond_load_acquire() until the target CPU's kick_sync advances. Because
the irq_work runs in hardirq context, the waiting CPU cannot reschedule and
its own kick_sync never advances. If multiple CPUs form a wait cycle, all
CPUs deadlock.
Replace the busy-wait in kick_cpus_irq_workfn() with resched_curr() to
force the CPU through do_pick_task_scx(), which queues a balance callback
to perform the wait. The balance callback drops the rq lock and enables
IRQs following the sched_core_balance() pattern, so the CPU can process
IPIs while waiting. The local CPU's kick_sync is advanced on entry to
do_pick_task_scx() and continuously during the wait, ensuring any CPU that
starts waiting for us sees the advancement and cannot form cyclic
dependencies.
Fixes: 90e55164da ("sched_ext: Implement SCX_KICK_WAIT")
Cc: stable@vger.kernel.org # v6.12+
Reported-by: Christian Loehle <christian.loehle@arm.com>
Link: https://lore.kernel.org/r/20260316100249.1651641-1-christian.loehle@arm.com
Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Christian Loehle <christian.loehle@arm.com>
When CONFIG_DMA_API_DEBUG is enabled, the DMA debug infrastructure
tracks active mappings per cacheline and warns if two different DMA
mappings share the same cacheline ("cacheline tracking EEXIST,
overlapping mappings aren't supported").
On x86_64, ARCH_KMALLOC_MINALIGN defaults to 8, so small kmalloc
allocations (e.g. the 8-byte hub->buffer and hub->status in the USB
hub driver) frequently land in the same 64-byte cacheline. When both
are DMA-mapped, this triggers a false positive warning.
This has been reported repeatedly since v5.14 (when the EEXIST check
was added) across various USB host controllers and devices including
xhci_hcd with USB hubs, USB audio devices, and USB ethernet adapters.
The cacheline overlap is only a real concern on architectures that
require DMA buffer alignment to cacheline boundaries (i.e. where
ARCH_DMA_MINALIGN >= L1_CACHE_BYTES). On architectures like x86_64
where dma_get_cache_alignment() returns 1, the hardware is
cache-coherent and overlapping cacheline mappings are harmless.
Suppress the EEXIST warning when dma_get_cache_alignment() is less
than L1_CACHE_BYTES, indicating the architecture does not require
cacheline-aligned DMA buffers.
Verified with a kernel module reproducer that performs two kmalloc(8)
allocations back-to-back and DMA-maps both:
Before: allocations share a cacheline, EEXIST fires within ~50 pairs
After: same cacheline pair found, but no warning emitted
Fixes: 2b4bbc6231 ("dma-debug: report -EEXIST errors in add_dma_entry")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215740
Suggested-by: Harry Yoo <harry@kernel.org>
Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Signed-off-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20260327124156.24820-1-mikhail.v.gavrilov@gmail.com
which may cause missed expirations or incorrect overrun accounting.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmnIt8IRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hogw/+J8ekCjKdzZobEyGAM4fTc0jlOmPiXog3
yCKEZHnRpTmWskMKhDIqt3TYr9ihzHSWLwgtUpcx5bXO2qyhqVlQtfmSjA0CXeQ9
DHADtLAsZIg8qpkqur6zGd4icLX5ZoVfiwZdPX3QZMgwAmM0JsqOLOIy8TGvsB2Q
PWnq4aiTqs3lqA00ehMDR2gSlPfgJKUNgHVyJa0/udHVU81A67vPMbu12sX+Iz25
ITxOehF32JuEvJUxa/sipjswizYc/bt5qEywn8Xoob99xrkv5vmTQ31sehqovwq/
xVMaWs+oRj0Ec20E1LWuawnV83b0XR7j8Z2XCVNJHHfys80Hj6Egw7HVctKQq1Tf
X3bzQMbyaYHWn+Cc7DClexposL3xi76ZPW5VaWcsLs5DA15hGmgY7r2DrlJby7QZ
isjqsW06NGLM1DFrTEYlwTDKm7vLukFnWX4TXokMHitkSQPcyLa24sgMPFIXY1Pe
QoY+f9tmIZ5GmA8JA4dgORSQvQEiG8ElToNDfM2Dmk8b69eO+JwdehQfwwGhhemF
S8VMUMylrk0WOoCxM6peuA0JUMPKsLtg5dWWxsWDOxNUUxs9ujoQBAMYXuRloPUv
kGV8entjyBZ4Nz1dxxKM1r/J2c7a8IImbQ18AsdPFu4NaRIG6TvYkGP3MGZtMHLe
iHW5kgvgB5c=
=hDiV
-----END PGP SIGNATURE-----
Merge tag 'timers-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Ingo Molnar:
"Fix an argument order bug in the alarm timer forwarding logic, which
may cause missed expirations or incorrect overrun accounting"
* tag 'timers-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
alarmtimer: Fix argument order in alarm_timer_forward()
- Tighten up the sys_futex_requeue() ABI a bit, to disallow
dissimilar futex flags and potential UaF access. (Peter Zijlstra)
- Fix UaF between futex_key_to_node_opt() and vma_replace_policy()
(Hao-Yu Yang)
- Clear stale exiting pointer in futex_lock_pi() retry path,
which bug triggered a warning (and potential misbehavior)
in stress-testing. (Davidlohr Bueso)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmnItYsRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1h0Ow/+O4JG+XMdmTn2OLZQBFVqlt40RG+UAudW
tLyop+xovbLXk0FbZZen233yrYLnD8Qea9+LSIi1KPOKdklT4fimgDrVhSdZVMcY
mPJxkZ1NTRDpl3qpjgaT3yXslgpQQlzDLjAkyyC6SWQh7RHlZ9JB4UICei9sOeE8
WgwBrCfDMJXgjG5sAvkIhQFJaNEb/5HBHJu2Nxldlkl0of+le1e0mGUdY7H0ntxx
kJ4jCsrpCw+tzhZw7RmGe8ouEDkbt2f7EBT10pjZ1cEP22Rogn1AzgdD7pkbhcU3
WxbBGjSfm3ziPADELf0IhVYE/s+UXKTK9LfKDt4Q5W7paTV+o4Fllr+sraBmOoxp
Ova5RZzSdq5NCGAcmFJWg/rxEyZf+uC7GjwPx/80huzgJvBthdjOagdVKm/3BNv8
4d4H9KX67zWXE/XQHMt3g/7KYlVD9tzcp+v+h/rs8fiwkCkElwbljLISEr1StyOv
CDaCouu6FQE8MoVTNPPbH1DfEyIgNKHdXfkQbt/OkGVB9/dX71zFmxEXRCYAGMb1
wZXJ0j/oZZYcapmcb/xk5YjcZEfEU9QnMHHtCdR+dDL56xbsT5cDIzrb09LF6H/9
WiMxhx2V3RNBNUwbOKGX6vdqa2kXG+Pjp6sq9Af7VB9oy+nbWpfcVFqWbWwKAP1N
7e/EmnoxPTs=
=85AU
-----END PGP SIGNATURE-----
Merge tag 'locking-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull futex fixes from Ingo Molnar:
- Tighten up the sys_futex_requeue() ABI a bit, to disallow dissimilar
futex flags and potential UaF access (Peter Zijlstra)
- Fix UaF between futex_key_to_node_opt() and vma_replace_policy()
(Hao-Yu Yang)
- Clear stale exiting pointer in futex_lock_pi() retry path, which
triggered a warning (and potential misbehavior) in stress-testing
(Davidlohr Bueso)
* tag 'locking-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Clear stale exiting pointer in futex_lock_pi() retry path
futex: Fix UaF between futex_key_to_node_opt() and vma_replace_policy()
futex: Require sys_futex_requeue() to have identical flags
The following kfuncs currently accept void *meta__ign argument:
* bpf_obj_new_impl
* bpf_obj_drop_impl
* bpf_percpu_obj_new_impl
* bpf_percpu_obj_drop_impl
* bpf_refcount_acquire_impl
* bpf_list_push_back_impl
* bpf_list_push_front_impl
* bpf_rbtree_add_impl
The __ign suffix is an indicator for the verifier to skip the argument
in check_kfunc_args(). Then, in fixup_kfunc_call() the verifier may
set the value of this argument to struct btf_struct_meta *
kptr_struct_meta from insn_aux_data.
BPF programs must pass a dummy NULL value when calling these kfuncs.
Additionally, the list and rbtree _impl kfuncs also accept an implicit
u64 argument, which doesn't require __ign suffix because it's a
scalar, and BPF programs explicitly pass 0.
Add new kfuncs with KF_IMPLICIT_ARGS [1], that correspond to each
_impl kfunc accepting meta__ign. The existing _impl kfuncs remain
unchanged for backwards compatibility.
To support this, add "btf_struct_meta" to the list of recognized
implicit argument types in resolve_btfids.
Implement is_kfunc_arg_implicit() in the verifier, that determines
implicit args by inspecting both a non-_impl BTF prototype of the
kfunc.
Update the special_kfunc_list in the verifier and relevant checks to
support both the old _impl and the new KF_IMPLICIT_ARGS variants of
btf_struct_meta users.
[1] https://lore.kernel.org/bpf/20260120222638.3976562-1-ihor.solodrai@linux.dev/
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20260327203241.3365046-1-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Change 2d8b7f9bf8 ("tracing: Have show_event_trigger/filter format a bit more in columns")
added space padding to align the output.
However it used ("%*.s", len, "") which requests the default precision.
It doesn't matter here whether the userspace default (0) or kernel
default (no precision) is used, but the format should be "%*s".
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260326201824.3919-1-david.laight.linux@gmail.com
Signed-off-by: David Laight <david.laight.linux@gmail.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
- Fix potential deadlock in osnoise and hotplug
The interface_lock can be called by a osnoise thread and the CPU shutdown
logic of osnoise can wait for this thread to finish. But cpus_read_lock()
can also be taken while holding the interface_lock. This produces a
circular lock dependency and can cause a deadlock.
Swap the ordering of cpus_read_lock() and the interface_lock to have
interface_lock taken within the cpus_read_lock() context to prevent this
circular dependency.
- Fix freeing of event triggers in early boot up
If the same trigger is added on the kernel command line, the second
one will fail to be applied and the trigger created will be freed.
This calls into the deferred logic and creates a kernel thread to do
the freeing. But the command line logic is called before kernel
threads can be created and this leads to a NULL pointer dereference.
Delay freeing event triggers until late init.
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCacf2GBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qooZAQDQgsnrkwkUHc2WjhAAsZunjWhqRQOp
SIdOO95pDvQwewEAjjP6kG5UEppL8eNF9WmV8EMz9pjYo7i2/Wy59H6xnwQ=
=6Rs0
-----END PGP SIGNATURE-----
Merge tag 'trace-v7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix potential deadlock in osnoise and hotplug
The interface_lock can be called by a osnoise thread and the CPU
shutdown logic of osnoise can wait for this thread to finish. But
cpus_read_lock() can also be taken while holding the interface_lock.
This produces a circular lock dependency and can cause a deadlock.
Swap the ordering of cpus_read_lock() and the interface_lock to have
interface_lock taken within the cpus_read_lock() context to prevent
this circular dependency.
- Fix freeing of event triggers in early boot up
If the same trigger is added on the kernel command line, the second
one will fail to be applied and the trigger created will be freed.
This calls into the deferred logic and creates a kernel thread to do
the freeing. But the command line logic is called before kernel
threads can be created and this leads to a NULL pointer dereference.
Delay freeing event triggers until late init.
* tag 'trace-v7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Drain deferred trigger frees if kthread creation fails
tracing: Fix potential deadlock in cpu hotplug with osnoise
The function tracing_alloc_snapshot() is only used between trace.c and
trace_snapshot.c. When snapshot isn't configured, it's not used at all.
The stub function was defined as a global with no users and no prototype
causing build issues.
Remove the function when snapshot isn't configured as nothing is calling
it.
Also remove the EXPORT_SYMBOL_GPL() that was associated with it as it's
not used outside of the tracing subsystem which also includes any modules.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260328101946.2c4ef4a5@robin
Reported-by: Mark Brown <broonie@kernel.org>
Closes: https://lore.kernel.org/all/acb-IuZ4vDkwwQLW@sirena.co.uk/
Fixes: bade44fe54 (tracing: Move snapshot code out of trace.c and into trace_snapshot.c)
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The comment in exit_itimers() still refers to itimer_delete(),
which was replaced by posix_timer_delete(). Update the comment
accordingly.
Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260326142210.98632-1-zhanxusheng@xiaomi.com
Fuzzying/stressing futexes triggered:
WARNING: kernel/futex/core.c:825 at wait_for_owner_exiting+0x7a/0x80, CPU#11: futex_lock_pi_s/524
When futex_lock_pi_atomic() sees the owner is exiting, it returns -EBUSY
and stores a refcounted task pointer in 'exiting'.
After wait_for_owner_exiting() consumes that reference, the local pointer
is never reset to nil. Upon a retry, if futex_lock_pi_atomic() returns a
different error, the bogus pointer is passed to wait_for_owner_exiting().
CPU0 CPU1 CPU2
futex_lock_pi(uaddr)
// acquires the PI futex
exit()
futex_cleanup_begin()
futex_state = EXITING;
futex_lock_pi(uaddr)
futex_lock_pi_atomic()
attach_to_pi_owner()
// observes EXITING
*exiting = owner; // takes ref
return -EBUSY
wait_for_owner_exiting(-EBUSY, owner)
put_task_struct(); // drops ref
// exiting still points to owner
goto retry;
futex_lock_pi_atomic()
lock_pi_update_atomic()
cmpxchg(uaddr)
*uaddr ^= WAITERS // whatever
// value changed
return -EAGAIN;
wait_for_owner_exiting(-EAGAIN, exiting) // stale
WARN_ON_ONCE(exiting)
Fix this by resetting upon retry, essentially aligning it with requeue_pi.
Fixes: 3ef240eaff ("futex: Prevent exit livelock")
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260326001759.4129680-1-dave@stgolabs.net
Boot-time trigger registration can fail before the trigger-data cleanup
kthread exists. Deferring those frees until late init is fine, but the
post-boot fallback must still drain the deferred list if kthread
creation never succeeds.
Otherwise, boot-deferred nodes can accumulate on
trigger_data_free_list, later frees fall back to synchronously freeing
only the current object, and the older queued entries are leaked
forever.
To trigger this, add the following to the kernel command line:
trace_event=sched_switch trace_trigger=sched_switch.traceon,sched_switch.traceon
The second traceon trigger will fail and be freed. This triggers a NULL
pointer dereference and crashes the kernel.
Keep the deferred boot-time behavior, but when kthread creation fails,
drain the whole queued list synchronously. Do the same in the late-init
drain path so queued entries are not stranded there either.
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260324221326.1395799-3-atwellwea@gmail.com
Fixes: 61d445af0a ("tracing: Add bulk garbage collection of freeing event_trigger_data")
Signed-off-by: Wesley Atwell <atwellwea@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
After the introduction of clear_pages() we exploit the fact that the
process vm_area is allocated in contiguous pages to just clear them all in
one swift operation.
Link: https://lkml.kernel.org/r/20260224-mm-fork-clear-pages-v1-1-184c65a72d49@kernel.org
Signed-off-by: Linus Walleij <linusw@kernel.org>
Suggested-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/linux-mm/dpnwsp7dl4535rd7qmszanw6u5an2p74uxfex4dh53frpb7pu3@2bnjjavjrepe/
Suggested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Link: https://lore.kernel.org/20240311164638.2015063-7-pasha.tatashin@soleen.com
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Ankur Arora <ankur.a.arora@oracle.com>
Cc: Ben Segall <bsegall@google.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Now that kernel_clone() checks valid_signal(args->exit_signal), the "sig"
argument of do_notify_parent() must always be valid or we have a bug.
However, do_notify_parent() only checks that sig != -1 at the start, then
it does another valid_signal() check before __send_signal_locked().
This is confusing. Change do_notify_parent() to WARN and return early if
valid_signal(sig) is false.
Link: https://lkml.kernel.org/r/abld-ilvMEZ7VgMw@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Deepanshu Kartikey <Kartikey406@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, the buddy system only performs checks every 3rd sample. With a
4-second interval. If a check window is missed, the next check occurs 12
seconds later, potentially delaying hard lockup detection for up to 24
seconds.
Modify the buddy system to perform checks at every interval (4s).
Introduce a missed-interrupt threshold to maintain the existing grace
period while reducing the detection window to 8-12 seconds.
Best and worst case detection scenarios:
Before (12s check window):
- Best case: Lockup occurs after first check but just before heartbeat
interval. Detected in ~8s (8s till next check).
- Worst case: Lockup occurs just after a check.
Detected in ~24s (missed check + 12s till next check + 12s logic).
After (4s check window with threshold of 3):
- Best case: Lockup occurs just before a check.
Detected in ~8s (0s till 1st check + 4s till 2nd + 4s till 3rd).
- Worst case: Lockup occurs just after a check.
Detected in ~12s (4s till 1st check + 4s till 2nd + 4s till 3rd).
Link: https://lkml.kernel.org/r/20260312-hardlockup-watchdog-fixes-v2-4-45bd8a0cc7ed@google.com
Signed-off-by: Mayank Rungta <mrungta@google.com>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Li Huafei <lihuafei1@huawei.com>
Cc: Max Kellermann <max.kellermann@ionos.com>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Stephane Erainan <eranian@google.com>
Cc: Wang Jinchao <wangjinchao600@gmail.com>
Cc: Yunhui Cui <cuiyunhui@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, arch_touch_nmi_watchdog() causes an early return that skips
updating hrtimer_interrupts_saved. This leads to stale comparisons and
delayed lockup detection.
I found this issue because in our system the serial console is fairly
chatty. For example, the 8250 console driver frequently calls
touch_nmi_watchdog() via console_write(). If a CPU locks up after a timer
interrupt but before next watchdog check, we see the following sequence:
* watchdog_hardlockup_check() saves counter (e.g., 1000)
* Timer runs and updates the counter (1001)
* touch_nmi_watchdog() is called
* CPU locks up
* 10s pass: check() notices touch, returns early, skips update
* 10s pass: check() saves counter (1001)
* 10s pass: check() finally detects lockup
This delays detection to 30 seconds. With this fix, we detect the lockup
in 20 seconds.
Link: https://lkml.kernel.org/r/20260312-hardlockup-watchdog-fixes-v2-2-45bd8a0cc7ed@google.com
Signed-off-by: Mayank Rungta <mrungta@google.com>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Li Huafei <lihuafei1@huawei.com>
Cc: Max Kellermann <max.kellermann@ionos.com>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Stephane Erainan <eranian@google.com>
Cc: Wang Jinchao <wangjinchao600@gmail.com>
Cc: Yunhui Cui <cuiyunhui@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "watchdog/hardlockup: Improvements to hardlockup", v2.
This series addresses limitations in the hardlockup detector
implementations and updates the documentation to reflect actual behavior
and recent changes.
The changes are structured as follows:
Refactoring (Patch 1)
=====================
Patch 1 refactors watchdog_hardlockup_check() to return early if no
lockup is detected. This reduces the indentation level of the main
logic block, serving as a clean base for the subsequent changes.
Hardlockup Detection Improvements (Patches 2 & 4)
=================================================
The hardlockup detector logic relies on updating saved interrupt counts to
determine if the CPU is making progress.
Patch 1 ensures that the saved interrupt count is updated unconditionally
before checking the "touched" flag. This prevents stale comparisons which
can delay detection. This is a logic fix that ensures the detector
remains accurate even when the watchdog is frequently touched.
Patch 3 improves the Buddy detector's timeliness. The current checking
interval (every 3rd sample) causes high variability in detection time (up
to 24s). This patch changes the Buddy detector to check at every hrtimer
interval (4s) with a missed-interrupt threshold of 3, narrowing the
detection window to a consistent 8-12 second range.
Documentation Updates (Patches 3 & 5)
=====================================
The current documentation does not fully capture the variable nature of
detection latency or the details of the Buddy system.
Patch 3 removes the strict "10 seconds" definition of a hardlockup, which
was misleading given the periodic nature of the detector. It adds a
"Detection Overhead" section to the admin guide, using "Best Case" and
"Worst Case" scenarios to illustrate that detection time can vary
significantly (e.g., ~6s to ~20s).
Patch 5 adds a dedicated section for the Buddy detector, which was
previously undocumented. It details the mechanism, the new timing logic,
and known limitations.
This patch (of 5):
Invert the `is_hardlockup(cpu)` check in `watchdog_hardlockup_check()` to
return early when a hardlockup is not detected. This flattens the main
logic block, reducing the indentation level and making the code easier to
read and maintain.
This refactoring serves as a preparation patch for future hardlockup
changes.
Link: https://lkml.kernel.org/r/20260312-hardlockup-watchdog-fixes-v2-0-45bd8a0cc7ed@google.com
Link: https://lkml.kernel.org/r/20260312-hardlockup-watchdog-fixes-v2-1-45bd8a0cc7ed@google.com
Signed-off-by: Mayank Rungta <mrungta@google.com>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Li Huafei <lihuafei1@huawei.com>
Cc: Max Kellermann <max.kellermann@ionos.com>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Stephane Erainan <eranian@google.com>
Cc: Wang Jinchao <wangjinchao600@gmail.com>
Cc: Yunhui Cui <cuiyunhui@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kexec_core.c does not do any cryptographic hashing, so the header
crypto/hash.h is not needed at all.
Link: https://lkml.kernel.org/r/20260314204144.44884-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Several files related to kernel crash dumps include crypto/sha1.h but
never use any of its functionality. Remove these includes so that these
files don't unnecessarily come up in searches for which kernel code is
still using the obsolete SHA-1 algorithm.
Link: https://lkml.kernel.org/r/20260314204243.45001-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, the hung task reporting mechanism indiscriminately labels all
TASK_UNINTERRUPTIBLE (D) tasks as "blocked", irrespective of whether they
are awaiting I/O completion or kernel locking primitives. This ambiguity
compels system administrators to manually inspect stack traces to discern
whether the delay stems from an I/O wait (typically indicative of hardware
or filesystem anomalies) or software contention. Such detailed analysis
is not always immediately accessible to system administrators or support
engineers.
To address this, this patch utilises the existing in_iowait field within
struct task_struct to augment the failure report. If the task is blocked
due to I/O (e.g., via io_schedule_prepare()), the log message is updated
to explicitly state "blocked in I/O wait".
Examples:
- Standard Block: "INFO: task bash:123 blocked for more than 120
seconds".
- I/O Block: "INFO: task dd:456 blocked in I/O wait for more than
120 seconds".
Theoretically, concurrent executions of io_schedule_finish() could result
in a race condition where the read flag does not precisely correlate with
the subsequently printed backtrace. However, this limitation is deemed
acceptable in practice. The entire reporting mechanism is inherently racy
by design; nevertheless, it remains highly reliable in the vast majority
of cases, particularly because it primarily captures protracted stalls.
Consequently, introducing additional synchronisation to mitigate this
minor inaccuracy would be entirely disproportionate to the situation.
Link: https://lkml.kernel.org/r/20260303221324.4106917-1-atomlin@atomlin.com
Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Lance Yang <lance.yang@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
A recent change allowed to reset the global counter of hung tasks using
the sysctl interface. A potential race with the regular check has been
solved by updating the global counter only once at the end of the check.
However, the hung task check can take a significant amount of time,
particularly when task information is being dumped to slow serial
consoles. Some users monitor this global counter to trigger immediate
migration of critical containers. Delaying the increment until the full
check completes postpones these high-priority rescue operations.
Update the global counter as soon as a hung task is detected. Since the
value is read asynchronously, a relaxed atomic operation is sufficient.
Link: https://lkml.kernel.org/r/20260303203031.4097316-4-atomlin@atomlin.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
Reported-by: Lance Yang <lance.yang@linux.dev>
Closes: https://lore.kernel.org/r/f239e00f-4282-408d-b172-0f9885f4b01b@linux.dev
Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Joel Granados <joel.granados@kernel.org>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, the hung_task_detect_count sysctl provides a cumulative count
of hung tasks since boot. In long-running, high-availability
environments, this counter may lose its utility if it cannot be reset once
an incident has been resolved. Furthermore, the previous implementation
relied upon implicit ordering, which could not strictly guarantee that
diagnostic metadata published by one CPU was visible to the panic logic on
another.
This patch introduces the capability to reset the detection count by
writing "0" to the hung_task_detect_count sysctl. The proc_handler logic
has been updated to validate this input and atomically reset the counter.
The synchronisation of sysctl_hung_task_detect_count relies upon a
transactional model to ensure the integrity of the detection counter
against concurrent resets from userspace. The application of
atomic_long_read_acquire() and atomic_long_cmpxchg_release() is correct
and provides the following guarantees:
1. Prevention of Load-Store Reordering via Acquire Semantics By
utilising atomic_long_read_acquire() to snapshot the counter
before initiating the task traversal, we establish a strict
memory barrier. This prevents the compiler or hardware from
reordering the initial load to a point later in the scan. Without
this "acquire" barrier, a delayed load could potentially read a
"0" value resulting from a userspace reset that occurred
mid-scan. This would lead to the subsequent cmpxchg succeeding
erroneously, thereby overwriting the user's reset with stale
increment data.
2. Atomicity of the "Commit" Phase via Release Semantics The
atomic_long_cmpxchg_release() serves as the transaction's commit
point. The "release" barrier ensures that all diagnostic
recordings and task-state observations made during the scan are
globally visible before the counter is incremented.
3. Race Condition Resolution This pairing effectively detects any
"out-of-band" reset of the counter. If
sysctl_hung_task_detect_count is modified via the procfs
interface during the scan, the final cmpxchg will detect the
discrepancy between the current value and the "acquire" snapshot.
Consequently, the update will fail, ensuring that a reset command
from the administrator is prioritised over a scan that may have
been invalidated by that very reset.
Link: https://lkml.kernel.org/r/20260303203031.4097316-3-atomlin@atomlin.com
Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Joel Granados <joel.granados@kernel.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Lance Yang <lance.yang@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "hung_task: Provide runtime reset interface for hung task
detector", v9.
This series introduces the ability to reset
/proc/sys/kernel/hung_task_detect_count.
Writing a "0" value to this file atomically resets the counter of detected
hung tasks. This functionality provides system administrators with the
means to clear the cumulative diagnostic history following incident
resolution, thereby simplifying subsequent monitoring without
necessitating a system restart.
This patch (of 3):
The check_hung_task() function currently conflates two distinct
responsibilities: validating whether a task is hung and handling the
subsequent reporting (printing warnings, triggering panics, or
tracepoints).
This patch refactors the logic by introducing hung_task_info(), a function
dedicated solely to reporting. The actual detection check,
task_is_hung(), is hoisted into the primary loop within
check_hung_uninterruptible_tasks(). This separation clearly decouples the
mechanism of detection from the policy of reporting.
Furthermore, to facilitate future support for concurrent hung task
detection, the global sysctl_hung_task_detect_count variable is converted
from unsigned long to atomic_long_t. Consequently, the counting logic is
updated to accumulate the number of hung tasks locally (this_round_count)
during the iteration. The global counter is then updated atomically via
atomic_long_cmpxchg_relaxed() once the loop concludes, rather than
incrementally during the scan.
These changes are strictly preparatory and introduce no functional change
to the system's runtime behaviour.
Link: https://lkml.kernel.org/r/20260303203031.4097316-1-atomlin@atomlin.com
Link: https://lkml.kernel.org/r/20260303203031.4097316-2-atomlin@atomlin.com
Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Joel Granados <joel.granados@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Replace sprintf() with sysfs_emit() in sysfs show functions. sysfs_emit()
is preferred for formatting sysfs output because it provides safer bounds
checking. No functional changes.
Link: https://lkml.kernel.org/r/20260301125106.911980-2-thorsten.blum@linux.dev
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Both copy_process() and alloc_pid() do the same PIDNS_ADDING check. The
reasons for these checks, and the fact that both are necessary, are not
immediately obvious. Add the comments.
Link: https://lkml.kernel.org/r/aaGIRElc78U4Er42@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Adrian Reber <areber@redhat.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Alexander Mikhalitsyn <alexander@mihalicyn.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "pid: make sub-init creation retryable".
This patch (of 2):
Currently we allow only one attempt to create init in a new namespace. If
the first fork() fails after alloc_pid() succeeds, free_pid() clears
PIDNS_ADDING and thus disables further PID allocations.
Nowadays this looks like an unnecessary limitation. The original reason
to handle "case PIDNS_ADDING" in free_pid() is gone, most probably after
commit 69879c01a0 ("proc: Remove the now unnecessary internal mount of
proc").
Change free_pid() to keep ns->pid_allocated == PIDNS_ADDING, and change
alloc_pid() to reset the cursor early, right after taking pidmap_lock.
Test-case:
#define _GNU_SOURCE
#include <linux/sched.h>
#include <sys/syscall.h>
#include <sys/wait.h>
#include <assert.h>
#include <sched.h>
#include <errno.h>
int main(void)
{
struct clone_args args = {
.exit_signal = SIGCHLD,
.flags = CLONE_PIDFD,
.pidfd = 0,
};
unsigned long pidfd;
int pid;
assert(unshare(CLONE_NEWPID) == 0);
pid = syscall(__NR_clone3, &args, sizeof(args));
assert(pid == -1 && errno == EFAULT);
args.pidfd = (unsigned long)&pidfd;
pid = syscall(__NR_clone3, &args, sizeof(args));
if (pid)
assert(pid > 0 && wait(NULL) == pid);
else
assert(getpid() == 1);
return 0;
}
Link: https://lkml.kernel.org/r/aaGHu3ixbw9Y7kFj@redhat.com
Link: https://lkml.kernel.org/r/aaGIHa7vGdwhEc_D@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Andrei Vagin <avagin@gmail.com>
Cc: Adrian Reber <areber@redhat.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Alexander Mikhalitsyn <alexander@mihalicyn.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The function read_key_from_user_keying() is missing an 'r' in its name.
Fix the typo by renaming it to read_key_from_user_keyring().
Link: https://lkml.kernel.org/r/20260227230422.859423-1-thorsten.blum@linux.dev
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
'key_count' is an 'unsigned int' and cannot be less than zero. Remove
the redundant condition.
Link: https://lkml.kernel.org/r/20260228085136.861971-2-thorsten.blum@linux.dev
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Replace simple_strtoul() with the recommended kstrtoul() for parsing the
'coredump_filter=' boot parameter.
Check the return value of kstrtoul() and reject invalid values. This adds
error handling while preserving behavior for existing values, and removes
use of the deprecated simple_strtoul() helper. The current code silently
sets 'default_dump_filter = 0' if parsing fails, instead of leaving the
default value (MMF_DUMP_FILTER_DEFAULT) unchanged.
Rename the static variable 'default_dump_filter' to 'coredump_filter'
since it does not necessarily contain the default value and the current
name can be misleading.
Link: https://lkml.kernel.org/r/20251215142152.4082-2-thorsten.blum@linux.dev
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Ben Segall <bsegall@google.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The "(signal->core_state || !(signal->flags & SIGNAL_GROUP_EXIT))" check
in complete_signal() is not obvious at all, and in fact it only adds
unnecessary confusion: this condition is always true.
prepare_signal() does:
if (signal->flags & SIGNAL_GROUP_EXIT) {
if (signal->core_state)
return sig == SIGKILL;
/*
* The process is in the middle of dying, drop the signal.
*/
return false;
}
This means that "!signal->core_state && (signal->flags &
SIGNAL_GROUP_EXIT)" in complete_signal() is never possible.
If SIGNAL_GROUP_EXIT is set, prepare_signal() can only return true if
signal->core_state is not NULL.
Link: https://lkml.kernel.org/r/aZsfkDhnqJ4s1oTs@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Kees Cook <kees@kernel.org>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc; Deepanshu Kartikey <kartikey406@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
thread_group_empty(tsk) is only possible if tsk is a group leader, and
thread_group_empty() already does the thread_group_leader() check.
So it makes no sense to check "thread_group_leader() &&
thread_group_empty()"; thread_group_empty() alone is enough.
Link: https://lkml.kernel.org/r/aZsfeegKZPZZszJh@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc; Deepanshu Kartikey <kartikey406@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
However there's a convention of assuming that __init-time allocations
cannot fail. Because if a kmalloc() were to fail at this time, the kernel
is hopelessly messed up anyway. So simply panic() if that kmalloc failed,
then make that 350-byte buffer __initdata.
Link: https://lkml.kernel.org/r/20260223035914.4033-1-rioo.tsukatsukii@gmail.com
Signed-off-by: Rio <rioo.tsukatsukii@gmail.com>
Cc: Joel Granados <joel.granados@kernel.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Wang Jinchao <wangjinchao600@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The buffer used to hold the taint string is statically allocated, which
requires updating whenever a new taint flag is added.
Instead, allocate the exact required length at boot once the allocator is
available in an init function. The allocation sums the string lengths in
taint_flags[], along with space for separators and formatting.
print_tainted() is switched to use this dynamically allocated buffer.
If allocation fails, print_tainted() warns about the failure and continues
to use the original static buffer as a fallback.
Link: https://lkml.kernel.org/r/20260222140804.22225-1-rioo.tsukatsukii@gmail.com
Signed-off-by: Rio <rioo.tsukatsukii@gmail.com>
Cc: Joel Granados <joel.granados@kernel.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Wang Jinchao <wangjinchao600@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The verbose 'Tainted: ...' string in print_tainted_seq can total to 327
characters while the buffer defined in _print_tainted is 320 bytes.
Increase its size to 350 characters to hold all flags, along with some
headroom.
[akpm@linux-foundation.org: fix spello, add comment]
Link: https://lkml.kernel.org/r/20260220151500.13585-1-rioo.tsukatsukii@gmail.com
Signed-off-by: Rio <rioo.tsukatsukii@gmail.com>
Cc: Joel Granados <joel.granados@kernel.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Wang Jinchao <wangjinchao600@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* Fix uninitialized variable error when writing to a sysctl bitmap
Removed the possibility of returning an unjustified -EINVAL when writing to a
sysctl bitmap
* Testing
Quickly passed through linux-next.
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEErkcJVyXmMSXOyyeQupfNUreWQU8FAmnE85UACgkQupfNUreW
QU+XkQv/YqEcAQYGYHX0OEvjjTtmQwGDOyD+EkwztCfaW+9lL4XxPKG13kH0kaTI
4u+R2E2/pQCFmY4ln85Rl8nhgLtKavyrRBr5U7ml68tC0MGYSFP7soDQWHIBd+u3
RxGMvOBpEH8MUpR1QxFuc4TYlm/VR05sojK/rf2xDf2RmW45ZRpGsKleWNJa8tl2
fq0kYA3tMy68QZ49NIsn9tbhYso1TQa/oGh+ocYuvUOYA7fbsXQ8Nl227Jk0WQY4
krtB0CPxyzKLDjyDZ7LJAbpE8xg0b5oRzpBSLmMoRaxrdbkhznlDcYzZNROIUoJe
ktQ23ZtI+rRJc5bdnRZcrrXIrxEFjYfCVou52I4xUSs6zi4QpuOUUMySkDqyIshV
lecvMDMYExpi1lLufZWSCRn0LWPWGVqJPSkUEPMynIkaZgdAh4CKgpEVItNtGkip
4P2tJz8nlRlFr2IINJWJCCdgt0p1h3a1jb41lSznNp2FFvf8d7jDhjZWEl0tMf8u
l81ivZ4n
=qCgc
-----END PGP SIGNATURE-----
Merge tag 'sysctl-7.00-fixes-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl
Pull sysctl fix from Joel Granados:
"Fix uninitialized variable error when writing to a sysctl bitmap
Removed the possibility of returning an unjustified -EINVAL when
writing to a sysctl bitmap"
* tag 'sysctl-7.00-fixes-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl:
sysctl: fix uninitialized variable in proc_do_large_bitmap
Add a comment explaining the design intent behind rejecting built-in DSQs
(%SCX_DSQ_GLOBAL and %SCX_DSQ_LOCAL*) as sources. Local DSQs support
reenqueueing but the BPF scheduler cannot directly iterate or move tasks
from them. %SCX_DSQ_GLOBAL is similar but also doesn't support
reenqueueing because it maps to multiple per-node DSQs, making the scope
difficult to define.
Also annotate @dsq_id to make clear it must be a user-created DSQ.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
get_data() has a sanity check for regular data blocks to ensure at
least space for the ID exists. But a regular block should also have
at least 1 byte of data (otherwise it would be data-less instead of
regular).
Expand the get_data() block size sanity check to additionally expect
at least 1 byte of data.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Tested-by: Petr Mladek <pmladek@suse.com>
Link: https://patch.msgid.link/20260326133809.8045-2-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
Commit cc3bad11de ("printk_ringbuffer: Fix check of valid data
size when blk_lpos overflows") added sanity checking to get_data()
to avoid returning data of illegal sizes (too large or too small).
It uses the helper function data_check_size() for the check.
However, data_check_size() expects the size of the data, not the
size of the data block. get_data() is providing the size of the
data block. This means that if the data size (text_buf_size) is
at or near the maximum legal size:
sizeof(prb_data_block) + text_buf_size == DATA_SIZE(data_ring) / 2
data_check_size() will report failure because it adds
sizeof(prb_data_block) to the provided size. The sanity check in
get_data() is counting the data block header twice. The result is
that the reader fails to read the legal record.
Since get_data() subtracts the data block header size before returning,
move the sanity check to after the subtraction.
Luckily printk() is not vulnerable to this problem because
truncate_msg() limits printk-messages to 1/4 of the ringbuffer.
Indeed, by adjusting the printk_ringbuffer KUnit test, which does not
use printk() and its truncate_msg() check, it is easy to see that the
reader fails and the WARN_ON is triggered.
Fixes: cc3bad11de ("printk_ringbuffer: Fix check of valid data size when blk_lpos overflows")
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Tested-by: Petr Mladek <pmladek@suse.com>
Link: https://patch.msgid.link/20260326133809.8045-1-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
A bunch of new hooks for managing block devices were added a while ago
but they weren't actually appropriately classified.
* bpf_lsm_bdev_alloc() is called when the inode for the block
device is allocated. This happens from a sleepable context so mark the
function as sleepable. When this function is called the memory for the
block device storage embedded into the inode is zeroed. That block
device cannot be meaningfully reference or interacted with at this
point. So mark it as untrusted for now.
* bpf_lsm_bdev_free() is called when the inode for the block
device is freed. A bunch of memory associated with the block device
has already been freed and there's dangling pointers in there. So mark
it as untrusted. It cannot be meaningfully referenced or interacted
with anymore. It is also called from sb->s_op->free_inode:: which
means it runs in rcu context (most of the times). So leave it as
non-sleepable.
* bpf_lsm_bdev_setintegrity() is called when a dm-verity device
is instantiated (glossing over details for simplicity of the commit
message). The block device is very much alive so it remains a trusted
hook. It's also called with device mapper's suspend lock held and so
the hook is able to sleep so mark it sleepable.
Signed-off-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20260326-work-bpf-bdev-v2-1-5e3c58963987@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When a bridge window contains big and small resource(s), the small
resource(s) may not amount to the half of the size of the big resource
which would allow calculate_head_align() to shrink the head alignment.
This results in always placing the small resource(s) after the big
resource.
In general, it would be good to be able to place the small resource(s)
before the big resource to achieve better utilization of the address space.
In the cases where the large resource can only fit at the end of the
window, it is even required.
However, carrying the information over from pbus_size_mem() and
calculate_head_align() to __pci_assign_resource() and
pcibios_align_resource() is not easy with the current data structures.
A somewhat hacky way to move the non-aligning tail part to the head is
possible within pcibios_align_resource(). The free space between the start
of the free space span and the aligned start address can be compared with
the non-aligning remainder of the size. If the free space is larger than
the remainder, placing the remainder before the start address is possible.
This relocation should generally work, because PCI resources consist only
power-of-2 atoms.
Various arch requirements may still need to override the relocation, so the
relocation is only applied selectively in such cases.
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=221205
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Xifer <xiferdev@gmail.com>
Link: https://patch.msgid.link/20260324165633.4583-10-ilpo.jarvinen@linux.intel.com
__find_resource_space() has variable called 'tmp'. Rename it to
'full_avail' to better indicate its purpose.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Xifer <xiferdev@gmail.com>
Link: https://patch.msgid.link/20260324165633.4583-4-ilpo.jarvinen@linux.intel.com
__find_resource_space() calculates the full extent of empty space but only
passes the aligned space to resource_alignf callback. In some situations,
the callback may choose take advantage of the free space before the
requested alignment.
Pass the full extent of the calculated empty space to resource_alignf
callback as an additional parameter.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Xifer <xiferdev@gmail.com>
Link: https://patch.msgid.link/20260324165633.4583-3-ilpo.jarvinen@linux.intel.com
Validate layout if present, but because the kernel must be
strict in what it accepts, reject BTF with unsupported kinds,
even if they are in the layout information.
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260326145444.2076244-8-alan.maguire@oracle.com
__find_resource_space() currently uses resource_contains() but for
tentative resources that are not yet crafted into the resource tree. As
resource_contains() checks that IORESOURCE_UNSET is not set for either of
the resources, the caller has to hack around this problem by clearing the
IORESOURCE_UNSET flag (essentially lying to resource_contains()).
Instead of the hack, introduce __resource_contains_unbound() for cases like
this.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Xifer <xiferdev@gmail.com>
Link: https://patch.msgid.link/20260324165633.4583-2-ilpo.jarvinen@linux.intel.com
- Restore the cpufreq core behavior changed inadvertently during the
6.19 development cycle to call cpufreq_frequency_table_cpuinfo()
for cpufreq policies getting re-initialized which ensures that
policy->max and policy->cpuinfo_max_freq will be valid going
forward (Viresh Kumar)
- Adjust the cached requested frequency in the conservative cpufreq
governor on policy limits changes to prevent it from becoming stale
in some cases (Viresh Kumar)
- Prevent pm_restore_gfp_mask() from triggering a WARN_ON() in some
code paths in which it is legitimately called without invoking
pm_restrict_gfp_mask() previously (Youngjun Park)
- Update snapshot_write_finalize() to take trailing zero pages into
account properly which prevents user space restore from failing
subsequently in some cases (Alberto Garcia)
-----BEGIN PGP SIGNATURE-----
iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmnFdKkSHHJqd0Byand5
c29ja2kubmV0AAoJEO5fvZ0v1OO12N4H/32iyc/ZSyZsDbfgPvyS6VsVzy9+JVHk
IKGFENHISCJeHHPjVCoESKnIhzIiBiN4rkFUr1V7FZpxECYkLK13D+120S+Dzf0Z
SGX+gi2YbwesO8Pf3F/3IpnP3zpwLae+UUxPDbwP9zPQdgT53O6CebG+F85/cO5X
C5n+juR06UVo4i5iKxjPWA2BQYEePT3Z5s3GBG9MzgTVGz5VfzeSQCQ3D5SyrwKm
KQxp80JbsHqlAI8jawEr/qMoPD1y24km+VrQbYZb88jtYB6lfYb1dNZsdglRVpvk
Urp262c6X3bubqdni0nHQiTodW0qXzdfaJT9E06Z0MsWppWW/3Q1Auc=
=9Cxk
-----END PGP SIGNATURE-----
Merge tag 'pm-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix two cpufreq issues, one in the core and one in the
conservative governor, and two issues related to system sleep:
- Restore the cpufreq core behavior changed inadvertently during the
6.19 development cycle to call cpufreq_frequency_table_cpuinfo()
for cpufreq policies getting re-initialized which ensures that
policy->max and policy->cpuinfo_max_freq will be valid going
forward (Viresh Kumar)
- Adjust the cached requested frequency in the conservative cpufreq
governor on policy limits changes to prevent it from becoming stale
in some cases (Viresh Kumar)
- Prevent pm_restore_gfp_mask() from triggering a WARN_ON() in some
code paths in which it is legitimately called without invoking
pm_restrict_gfp_mask() previously (Youngjun Park)
- Update snapshot_write_finalize() to take trailing zero pages into
account properly which prevents user space restore from failing
subsequently in some cases (Alberto Garcia)"
* tag 'pm-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: sleep: Drop spurious WARN_ON() from pm_restore_gfp_mask()
PM: hibernate: Drain trailing zero pages on userspace restore
cpufreq: conservative: Reset requested_freq on limits change
cpufreq: Don't skip cpufreq_frequency_table_cpuinfo()
Add optional reserved memory callbacks to perform region verification and
early fixup, then move all CMA related code in of_reserved_mem.c to them.
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://patch.msgid.link/20260325090023.3175348-5-m.szyprowski@samsung.com
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Move init function from OF_DECLARE() argument to the given reserved
memory region ops structure and then pass that structure to the
OF_DECLARE() initializer. This node_init callback is mandatory for the
reserved mem driver. Such change makes it possible in the future to add
more functions called by the generic code before given memory region is
initialized and rmem object is created.
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://patch.msgid.link/20260325090023.3175348-4-m.szyprowski@samsung.com
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
When given reserved memory region doesn't really support given node,
return -ENODEV instead of -ENOENT. Then fix __reserved_mem_init_node()
function to properly propagate error code different from -ENODEV instead
of silently ignoring it.
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://patch.msgid.link/20260325090023.3175348-3-m.szyprowski@samsung.com
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
FDT node is not needed for anything besides the initialization, so it can
be simply passed as an argument to the reserved memory region init
function.
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://patch.msgid.link/20260325090023.3175348-2-m.szyprowski@samsung.com
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
When a caller enqueues a work item using schedule_delayed_work() the used
wq is "system_wq" (per-cpu wq) while queue_delayed_work() uses
WORK_CPU_UNBOUND (used when no target CPU is specified). The same applies
to schedule_work() that is using system_wq and queue_work(), which again
makes use of WORK_CPU_UNBOUND.
This lack of consistency cannot be addressed without refactoring the API.
Continue the effort to refactor workqueue APIs, which began with the
introduction of new workqueues and a new alloc_workqueue() flag in:
commit 128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566 ("workqueue: Add new WQ_PERCPU flag")
and switch smp_call_on_cpu() to use system_percpu_wq because system_wq is
going away once the ongoing workqueue restructuring is done.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://patch.msgid.link/20251110170332.319314-1-marco.crivellari@suse.com
A set of fixes for DMA-mapping subsystem, which resolve false-positive
warnings from KMSAN and DMA-API debug (Shigeru Yoshida and Leon
Romanovsky) as well as a simple build fix (Miguel Ojeda).
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCacP+awAKCRCJp1EFxbsS
RB0IAP94SD5YWfKauuYfXw0uxuy0POZR7rNX93y1LMxuorU+9QEAucOy9wIfjxKC
Sc6U+fR4cmmIUKKnj7DCiR/EW8cOuQU=
=b7th
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-7.0-2026-03-25' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping fixes from Marek Szyprowski:
"A set of fixes for DMA-mapping subsystem, which resolve false-
positive warnings from KMSAN and DMA-API debug (Shigeru Yoshida
and Leon Romanovsky) as well as a simple build fix (Miguel Ojeda)"
* tag 'dma-mapping-7.0-2026-03-25' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
dma-mapping: add missing `inline` for `dma_free_attrs`
mm/hmm: Indicate that HMM requires DMA coherency
RDMA/umem: Tell DMA mapping that UMEM requires coherency
iommu/dma: add support for DMA_ATTR_REQUIRE_COHERENT attribute
dma-direct: prevent SWIOTLB path when DMA_ATTR_REQUIRE_COHERENT is set
dma-mapping: Introduce DMA require coherency attribute
dma-mapping: Clarify valid conditions for CPU cache line overlap
dma-mapping: handle DMA_ATTR_CPU_CACHE_CLEAN in trace output
dma-debug: Allow multiple invocations of overlapping entries
dma: swiotlb: add KMSAN annotations to swiotlb_bounce()
During futex_key_to_node_opt() execution, vma->vm_policy is read under
speculative mmap lock and RCU. Concurrently, mbind() may call
vma_replace_policy() which frees the old mempolicy immediately via
kmem_cache_free().
This creates a race where __futex_key_to_node() dereferences a freed
mempolicy pointer, causing a use-after-free read of mpol->mode.
[ 151.412631] BUG: KASAN: slab-use-after-free in __futex_key_to_node (kernel/futex/core.c:349)
[ 151.414046] Read of size 2 at addr ffff888001c49634 by task e/87
[ 151.415969] Call Trace:
[ 151.416732] __asan_load2 (mm/kasan/generic.c:271)
[ 151.416777] __futex_key_to_node (kernel/futex/core.c:349)
[ 151.416822] get_futex_key (kernel/futex/core.c:374 kernel/futex/core.c:386 kernel/futex/core.c:593)
Fix by adding rcu to __mpol_put().
Fixes: c042c50521 ("futex: Implement FUTEX2_MPOL")
Reported-by: Hao-Yu Yang <naup96721@gmail.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Hao-Yu Yang <naup96721@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Link: https://patch.msgid.link/20260324174418.GB1850007@noisy.programming.kicks-ass.net
Nicholas reported that his LLM found it was possible to create a UaF
when sys_futex_requeue() is used with different flags. The initial
motivation for allowing different flags was the variable sized futex,
but since that hasn't been merged (yet), simply mandate the flags are
identical, as is the case for the old style sys_futex() requeue
operations.
Fixes: 0f4b5f9722 ("futex: Add sys_futex_requeue()")
Reported-by: Nicholas Carlini <npc@anthropic.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Previously, missing time namespace support in the vDSO meant that time
namespaces needed to be disabled globally. This was expressed in a hard
dependency on the generic vDSO library. This also meant that architectures
without any vDSO or only a stub vDSO could not enable time namespaces.
Now that all architectures using a real vDSO are using the generic library,
that dependency is not necessary anymore.
Remove the dependency and let all architectures enable time namespaces.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260326-vdso-timens-decoupling-v2-2-c82693a7775f@linutronix.de
As a preparation of the untangling of time namespaces and the vDSO, move
the glue functions between those subsystems into a new file.
While at it, switch the mutex lock and mmap_read_lock() in the vDSO
namespace code to guard().
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260326-vdso-timens-decoupling-v2-1-c82693a7775f@linutronix.de
The trace.c file was a dumping ground for most tracing code. Start
organizing it better by moving various functions out into their own files.
Move all the snapshot code, including the max trace code into its own
trace_snapshot.c file.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260324140145.36352d6a@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Introduce the FOR_EACH_NS_TYPE(X) macro as the single source of truth
for the set of (struct type, CLONE_NEW* flag) pairs that define Linux
namespace types.
Currently, the list of CLONE_NEW* flags is duplicated inline in
multiple call sites and would need another copy in each new consumer.
This makes it easy to miss one when a new namespace type is added.
Derive two things from the X-macro:
- CLONE_NS_ALL: Bitmask of all known CLONE_NEW* flags, usable as a
validity mask or iteration bound.
- ns_common_type(): Rewritten to use the X-macro via a leading-comma
_Generic pattern, so the struct-to-flag mapping stays in sync with the
flag set automatically.
Replace the inline flag enumerations in copy_namespaces(),
unshare_nsproxy_namespaces(), check_setns_flags(), and
ksys_unshare() with CLONE_NS_ALL.
When a new namespace type is added, only FOR_EACH_NS_TYPE needs to
be updated; CLONE_NS_ALL, ns_common_type(), and all the call sites
pick up the change automatically.
Cc: Christian Brauner <brauner@kernel.org>
Cc: Günther Noack <gnoack@google.com>
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Link: https://patch.msgid.link/20260312100444.2609563-4-mic@digikod.net
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
proc_do_large_bitmap() does not initialize variable c, which is expected
to be set to a trailing character by proc_get_long().
However, proc_get_long() only sets c when the input buffer contains a
trailing character after the parsed value.
If c is not initialized it may happen to contain a '-'. If this is the
case proc_do_large_bitmap() expects to be able to parse a second part of
the input buffer. If there is no second part an unjustified -EINVAL will
be returned.
Initialize c to 0 to prevent returning -EINVAL on valid input.
Fixes: 9f977fb7ae ("sysctl: add proc_do_large_bitmap")
Signed-off-by: Marc Buerg <buermarc@googlemail.com>
Reviewed-by: Joel Granados <joel.granados@kernel.org>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
025b1bd419 introduced SCX_EV_SUB_BYPASS_DISPATCH to track scheduling
of bypassed descendant tasks, and correctly increments it per-CPU and
displays it in sysfs and dump output. However, scx_read_events() which
aggregates per-CPU counters into a summary was not updated to include
this event, causing it to always read as zero in sysfs, in debug dumps,
and via the scx_bpf_events() kfunc.
Add the missing scx_agg_event() call for SCX_EV_SUB_BYPASS_DISPATCH.
Fixes: 025b1bd419 ("sched_ext: Implement hierarchical bypass mode")
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When scx_bpf_dsq_move[_vtime]() is called on a task that belongs to a
different scheduler, scx_error() is invoked to flag the violation.
scx_error() schedules an asynchronous scheduler teardown via irq_work
and returns immediately, so execution falls through and the DSQ move
proceeds on a cross-scheduler task regardless, potentially corrupting
DSQ state.
Add the missing return false so the function exits right after
reporting the error, consistent with the other early-exit checks in
the same function (e.g. scx_vet_enq_flags() failure at the top).
Fixes: bb4d9fd551 ("sched_ext: scx_dsq_move() should validate the task belongs to the right scheduler")
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Fix a regression introduced by commit c27cea4416 ("rcu: Re-implement
RCU Tasks Trace in terms of SRCU-fast"): BPF contexts can run with
preemption disabled or scheduler locks held, so call_srcu() must work in
all such contexts. Fix this by converting SRCU's spinlocks to raw
spinlocks and avoiding scheduler lock acquisition in call_srcu() by
deferring to an irq_work (similar to call_rcu_tasks_generic()), for both
tree SRCU and tiny SRCU. Also fix a follow-on lockdep splat caused by
srcu_node allocation under the newly introduced raw spinlock by
deferring the allocation to grace-period worker context.
-----BEGIN PGP SIGNATURE-----
iQFhBAABCABLFiEEj5IosQTPz8XU1wRHSXnow7UH+rgFAmnELZgbFIAAAAAABAAO
bWFudTIsMi41KzEuMTIsMCwzERxib3F1bkBrZXJuZWwub3JnAAoJEEl56MO1B/q4
//MH/3l6NUigh3+FOE7bCffkfNTtcSMvJXzcMMLD4dS/JoFAJR0D3I0qm5S/AH+D
BMZzE7h2oUoy82XsBFYEZCwSVTGcYhqgPa8KaMtCRHPDHkWTWPbps5J9PCTMste0
VVUongZ3wBFzk23ckrK8MbZCi3V5x1alTJPScTsl7enCPRuPupoYyTCQAPv06F2o
3pvaJDF5r3UxSdtNC8bINlcfkFXQXC+J16iQQ91t2n80fNzGJQYb0dWAjb3F/FG3
ZE8YbDa8sGiyDNgWU4+gmYbJw4pS9Eau+djj04Dk1L5j4IXD0dgUo0GxYtStUF3d
/nscmx88BZ06nW+RRZJixPaIDBQ=
=yjFI
-----END PGP SIGNATURE-----
Merge tag 'rcu-fixes.v7.0-20260325a' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux
Pull RCU fixes from Boqun Feng:
"Fix a regression introduced by commit c27cea4416 ("rcu: Re-implement
RCU Tasks Trace in terms of SRCU-fast"): BPF contexts can run with
preemption disabled or scheduler locks held, so call_srcu() must work
in all such contexts.
Fix this by converting SRCU's spinlocks to raw spinlocks and avoiding
scheduler lock acquisition in call_srcu() by deferring to an irq_work
(similar to call_rcu_tasks_generic()), for both tree SRCU and tiny
SRCU.
Also fix a follow-on lockdep splat caused by srcu_node allocation
under the newly introduced raw spinlock by deferring the allocation to
grace-period worker context"
* tag 'rcu-fixes.v7.0-20260325a' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux:
srcu: Use irq_work to start GP in tiny SRCU
rcu: Use an intermediate irq_work to start process_srcu()
srcu: Push srcu_node allocation to GP when non-preemptible
srcu: Use raw spinlocks so call_srcu() can be used under preempt_disable()
cgroup_drain_dying() was using cgroup_is_populated() to test whether there are
dying tasks to wait for. cgroup_is_populated() tests nr_populated_csets,
nr_populated_domain_children and nr_populated_threaded_children, but
cgroup_drain_dying() only needs to care about this cgroup's own tasks - whether
there are children is cgroup_destroy_locked()'s concern.
This caused hangs during shutdown. When systemd tried to rmdir a cgroup that had
no direct tasks but had a populated child, cgroup_drain_dying() would enter its
wait loop because cgroup_is_populated() was true from
nr_populated_domain_children. The task iterator found nothing to wait for, yet
the populated state never cleared because it was driven by live tasks in the
child cgroup.
Fix it by using cgroup_has_tasks() which only tests nr_populated_csets.
v3: Fix cgroup_is_populated() -> cgroup_has_tasks() (Sebastian).
v2: https://lore.kernel.org/r/20260323200205.1063629-1-tj@kernel.org
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Fixes: 1b164b876c ("cgroup: Wait for dying tasks to leave on rmdir")
Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Both smp_call_function() and smp_call_function_single() use per-CPU
call_single_data_t variable to hold the infamous CSD lock. However,
while smp_call_function() acquires the destination CPU's CSD lock,
smp_call_function_single() instead uses the source CPU's CSD lock.
(These are two separate sets of CSD locks, cfd_data and csd_data,
respectively.)
This otherwise inexplicable pair of choices is explained by their
respective queueing properties. If smp_call_function() where to
use the sending CPU's CSD lock, that would serialize the destination
CPUs' IPI handlers and result in long smp_call_function() latencies,
especially on systems with large numbers of CPUs. For its part, if
smp_call_function_single() were to use the (single) destination CPU's
CSD lock, this would similarly serialize in the case where many CPUs
are sending IPIs to a single "victim" CPU. Plus it would result in
higher levels of memory contention.
Except that if there is no NMI-based stack tracing on a weakly ordered
system where remote unsynchronized stack traces are especially unreliable,
the improved debugging beats the improved queueing. This improved queueing
only matters if a bunch of CPUs are calling smp_call_function_single()
concurrently for a single "victim" CPU, which is not the common case.
Therefore, make smp_call_function_single() use the destination CPU's
csd_data instance in kernels built with CONFIG_CSD_LOCK_WAIT_DEBUG=y
where csdlock_debug_enabled is also set. Otherwise, continue to use
the source CPU's csd_data.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://patch.msgid.link/25c2eb97-77c8-49a5-80ac-efe78dea272c@paulmck-laptop
smp_call_function_single() and smp_call_function_many_cond() disable
preemption and cache the CPU number via get_cpu().
Use this cached value throughout the function instead of invoking
smp_processor_id() again.
[ tglx: Make the copy&pasta'ed change log match the patch ]
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: Mukesh Kumar Chaurasiya (IBM) <mkchauras@gmail.com>
Link: https://patch.msgid.link/20260323193630.640311-4-sshegde@linux.ibm.com
Add missing kernel-doc comments and rearrange the order of others to
prevent all kernel-doc warnings.
- add function Returns: sections or format existing comments as kernel-doc
- add missing function parameter comments
- use "/**" for smp_call_function_any() and on_each_cpu_cond_mask()
- correct the commented function name for on_each_cpu_cond_mask()
- use correct format for function short descriptions
- add all kernel-doc comments for smp_call_on_cpu()
- remove kernel-doc comments for raw_smp_processor_id() since there is
no prototype for it here (other than !SMP)
- in smp.h, rearrange some lines so that the kernel-doc comments for
smp_processor_id() are immediately before the macro (to prevent
kernel-doc warnings)
- remove "Returns" from smp_call_function() since it doesn't
return a value
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260310061726.1153764-1-rdunlap@infradead.org
A set of fixes for DMA-mapping subsystem, which resolve false-positive
warnings from KMSAN and DMA-API debug (Shigeru Yoshida and Leon
Romanovsky) as well as a simple build fix (Miguel Ojeda).
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCacP+awAKCRCJp1EFxbsS
RB0IAP94SD5YWfKauuYfXw0uxuy0POZR7rNX93y1LMxuorU+9QEAucOy9wIfjxKC
Sc6U+fR4cmmIUKKnj7DCiR/EW8cOuQU=
=b7th
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-7.0-2026-03-25' into dma-mapping-for-next
dma-mapping fixes for Linux 7.0
A set of fixes for DMA-mapping subsystem, which resolve false-positive
warnings from KMSAN and DMA-API debug (Shigeru Yoshida and Leon
Romanovsky) as well as a simple build fix (Miguel Ojeda).
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tiny SRCU's srcu_gp_start_if_needed() directly calls schedule_work(),
which acquires the workqueue pool->lock.
This causes a lockdep splat when call_srcu() is called with a scheduler
lock held, due to:
call_srcu() [holding pi_lock]
srcu_gp_start_if_needed()
schedule_work() -> pool->lock
workqueue_init() / create_worker() [holding pool->lock]
wake_up_process() -> try_to_wake_up() -> pi_lock
Also add irq_work_sync() to cleanup_srcu_struct() to prevent a
use-after-free if a queued irq_work fires after cleanup begins.
Tested with rcutorture SRCU-T and no lockdep warnings.
[ Thanks to Boqun for similar fix in patch "rcu: Use an intermediate irq_work
to start process_srcu()" ]
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun@kernel.org>
Since commit c27cea4416 ("rcu: Re-implement RCU Tasks Trace in terms
of SRCU-fast") we switched to SRCU in BPF. However as BPF instrument can
happen basically everywhere (including where a scheduler lock is held),
call_srcu() now needs to avoid acquiring scheduler lock because
otherwise it could cause deadlock [1]. Fix this by following what the
previous RCU Tasks Trace did: using an irq_work to delay the queuing of
the work to start process_srcu().
[boqun: Apply Joel's feedback]
[boqun: Apply Andrea's test feedback]
Reported-by: Andrea Righi <arighi@nvidia.com>
Closes: https://lore.kernel.org/all/abjzvz_tL_siV17s@gpd4/
Fixes: commit c27cea4416 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast")
Link: https://lore.kernel.org/rcu/3c4c5a29-24ea-492d-aeee-e0d9605b4183@nvidia.com/ [1]
Suggested-by: Zqiang <qiang.zhang@linux.dev>
Tested-by: Andrea Righi <arighi@nvidia.com>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Boqun Feng <boqun@kernel.org>
When the srcutree.convert_to_big and srcutree.big_cpu_lim kernel boot
parameters specify initialization-time allocation of the srcu_node
tree for statically allocated srcu_struct structures (for example, in
DEFINE_SRCU() at build time instead of init_srcu_struct() at runtime),
init_srcu_struct_nodes() will attempt to dynamically allocate this tree
at the first run-time update-side use of this srcu_struct structure,
but while holding a raw spinlock. Because the memory allocator can
acquire non-raw spinlocks, this can result in lockdep splats.
This commit therefore uses the same SRCU_SIZE_ALLOC trick that is used
when the first run-time update-side use of this srcu_struct structure
happens before srcu_init() is called. The actual allocation then takes
place from workqueue context at the ends of upcoming SRCU grace periods.
[boqun: Adjust the sha1 of the Fixes tag]
Fixes: 175b45ed34 ("srcu: Use raw spinlocks so call_srcu() can be used under preempt_disable()")
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun@kernel.org>
Tree SRCU has used non-raw spinlocks for many years, motivated by a desire
to avoid unnecessary real-time latency and the absence of any reason to
use raw spinlocks. However, the recent use of SRCU in tracing as the
underlying implementation of RCU Tasks Trace means that call_srcu()
is invoked from preemption-disabled regions of code, which in turn
requires that any locks acquired by call_srcu() or its callees must be
raw spinlocks.
This commit therefore converts SRCU's spinlocks to raw spinlocks.
[boqun: Add Fixes tag]
Reported-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Fixes: c27cea4416 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast")
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
When alloc_and_link_pwqs() fails partway through the per-cpu allocation
loop, some pool_workqueues may have already been linked into wq->pwqs
via link_pwq(). The error path frees these pwqs with kmem_cache_free()
but never removes them from the wq->pwqs list, leaving dangling pointers
in the list.
Currently this is not exploitable because the workqueue was never added
to the global workqueues list and the caller frees the wq immediately
after. However, this makes sure that alloc_and_link_pwqs() doesn't leave
any half-baked structure, which may have side effects if not properly
cleaned up.
Fix this by unlinking each pwq from wq->pwqs before freeing it. No
locking is needed as the workqueue has not been published yet, thus
no concurrency is possible.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Try to be more explicit why the workqueue watchdog does not take
pool->lock by default. Spin locks are full memory barriers which
delay anything. Obviously, they would primary delay operations
on the related worker pools.
Explain why it is enough to prevent the false positive by re-checking
the timestamp under the pool->lock.
Finally, make it clear what would be the alternative solution in
__queue_work() which is a hotter path.
Signed-off-by: Petr Mladek <pmladek@suse.com>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
The print_scx_info() always output scx_root structure's->ops.name,
but for built with CONFIG_EXT_SUB_SCHED=y kernels, the tasks may be
attach an sub scx_sched structure. this commit therefore use the
scx_task_sched_rcu() to correctly get scx_sched structure to output
ops.name, and drop state check.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
Previously different architectures were using random sources of
differing strength and cost to decide the random kstack offset. A number
of architectures (loongarch, powerpc, s390, x86) were using their
timestamp counter, at whatever the frequency happened to be. Other
arches (arm64, riscv) were using entropy from the crng via
get_random_u16().
There have been concerns that in some cases the timestamp counters may
be too weak, because they can be easily guessed or influenced by user
space. And get_random_u16() has been shown to be too costly for the
level of protection kstack offset randomization provides.
So let's use a common, architecture-agnostic source of entropy; a
per-cpu prng, seeded at boot-time from the crng. This has a few
benefits:
- We can remove choose_random_kstack_offset(); That was only there to
try to make the timestamp counter value a bit harder to influence
from user space [*].
- The architecture code is simplified. All it has to do now is call
add_random_kstack_offset() in the syscall path.
- The strength of the randomness can be reasoned about independently
of the architecture.
- Arches previously using get_random_u16() now have much faster
syscall paths, see below results.
[*] Additionally, this gets rid of some redundant work on s390 and x86.
Before this patch, those architectures called
choose_random_kstack_offset() under arch_exit_to_user_mode_prepare(),
which is also called for exception returns to userspace which were *not*
syscalls (e.g. regular interrupts). Getting rid of
choose_random_kstack_offset() avoids a small amount of redundant work
for the non-syscall cases.
In some configurations, add_random_kstack_offset() will now call
instrumentable code, so for a couple of arches, I have moved the call a
bit later to the first point where instrumentation is allowed. This
doesn't impact the efficacy of the mechanism.
There have been some claims that a prng may be less strong than the
timestamp counter if not regularly reseeded. But the prng has a period
of about 2^113. So as long as the prng state remains secret, it should
not be possible to guess. If the prng state can be accessed, we have
bigger problems.
Additionally, we are only consuming 6 bits to randomize the stack, so
there are only 64 possible random offsets. I assert that it would be
trivial for an attacker to brute force by repeating their attack and
waiting for the random stack offset to be the desired one. The prng
approach seems entirely proportional to this level of protection.
Performance data are provided below. The baseline is v6.18 with rndstack
on for each respective arch. (I)/(R) indicate statistically significant
improvement/regression. arm64 platform is AWS Graviton3 (m7g.metal).
x86_64 platform is AWS Sapphire Rapids (m7i.24xlarge):
+-----------------+--------------+---------------+---------------+
| Benchmark | Result Class | per-cpu-prng | per-cpu-prng |
| | | arm64 (metal) | x86_64 (VM) |
+=================+==============+===============+===============+
| syscall/getpid | mean (ns) | (I) -9.50% | (I) -17.65% |
| | p99 (ns) | (I) -59.24% | (I) -24.41% |
| | p99.9 (ns) | (I) -59.52% | (I) -28.52% |
+-----------------+--------------+---------------+---------------+
| syscall/getppid | mean (ns) | (I) -9.52% | (I) -19.24% |
| | p99 (ns) | (I) -59.25% | (I) -25.03% |
| | p99.9 (ns) | (I) -59.50% | (I) -28.17% |
+-----------------+--------------+---------------+---------------+
| syscall/invalid | mean (ns) | (I) -10.31% | (I) -18.56% |
| | p99 (ns) | (I) -60.79% | (I) -20.06% |
| | p99.9 (ns) | (I) -61.04% | (I) -25.04% |
+-----------------+--------------+---------------+---------------+
I tested an earlier version of this change on x86 bare metal and it
showed a smaller but still significant improvement. The bare metal
system wasn't available this time around so testing was done in a VM
instance. I'm guessing the cost of rdtsc is higher for VMs.
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Link: https://patch.msgid.link/20260303150840.3789438-3-ryan.roberts@arm.com
Signed-off-by: Kees Cook <kees@kernel.org>
kstack_offset was previously maintained per-cpu, but this caused a
couple of issues. So let's instead make it per-task.
Issue 1: add_random_kstack_offset() and choose_random_kstack_offset()
expected and required to be called with interrupts and preemption
disabled so that it could manipulate per-cpu state. But arm64, loongarch
and risc-v are calling them with interrupts and preemption enabled. I
don't _think_ this causes any functional issues, but it's certainly
unexpected and could lead to manipulating the wrong cpu's state, which
could cause a minor performance degradation due to bouncing the cache
lines. By maintaining the state per-task those functions can safely be
called in preemptible context.
Issue 2: add_random_kstack_offset() is called before executing the
syscall and expands the stack using a previously chosen random offset.
choose_random_kstack_offset() is called after executing the syscall and
chooses and stores a new random offset for the next syscall. With
per-cpu storage for this offset, an attacker could force cpu migration
during the execution of the syscall and prevent the offset from being
updated for the original cpu such that it is predictable for the next
syscall on that cpu. By maintaining the state per-task, this problem
goes away because the per-task random offset is updated after the
syscall regardless of which cpu it is executing on.
Fixes: 39218ff4c6 ("stack: Optionally randomize kernel stack offset each syscall")
Closes: https://lore.kernel.org/all/dd8c37bc-795f-4c7a-9086-69e584d8ab24@arm.com/
Cc: stable@vger.kernel.org
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Link: https://patch.msgid.link/20260303150840.3789438-2-ryan.roberts@arm.com
Signed-off-by: Kees Cook <kees@kernel.org>
The testing for tracing was triggering a timestamp count issue that was
always off by one. This has been happening for some time but has never
been reported by anyone else. It was finally discovered to be an issue
with the "uptime" (jiffies) clock that happened to be traced and the
internal recursion caused the discrepancy. This would have been much
easier to solve if the clock function being used was displayed when the
error was detected.
Add the clock function to the error output.
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260323202212.479bb288@gandalf.local.home
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The commit f35dbac694 ("ring-buffer: Fix to update per-subbuf entries of
persistent ring buffer") was a fix and merged upstream. It is needed for
some other work in the ring buffer. The current branch has the remote
buffer code that is shared with the Arm64 subsystem and can't be rebased.
Merge in the upstream commit to allow continuing of the ring buffer work.
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Calling smp_processor_id() on:
- In CONFIG_DEBUG_PREEMPT=y, if preemption/irq is disabled, then it does
not print any warning.
- In CONFIG_DEBUG_PREEMPT=n, it doesn't do anything apart from getting
__smp_processor_id
So with both CONFIG_DEBUG_PREEMPT=y/n, in preemption disabled section it is
better to cache the value. It saves a few cycles. Though tiny, repeated
adds up.
timer_clear_idle() is called with interrupts disabled. So cache the value
once.
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: Mukesh Kumar Chaurasiya (IBM) <mkchauras@gmail.com>
Link: https://patch.msgid.link/20260323193630.640311-5-sshegde@linux.ibm.com
The DEVMAP_HASH branch in dev_map_redirect_multi() uses
hlist_for_each_entry_safe() to iterate hash buckets, but this function
runs under RCU protection (called from xdp_do_generic_redirect_map()
in softirq context). Concurrent writers (__dev_map_hash_update_elem,
dev_map_hash_delete_elem) modify the list using RCU primitives
(hlist_add_head_rcu, hlist_del_rcu).
hlist_for_each_entry_safe() performs plain pointer dereferences without
rcu_dereference(), missing the acquire barrier needed to pair with
writers' rcu_assign_pointer(). On weakly-ordered architectures (ARM64,
POWER), a reader can observe a partially-constructed node. It also
defeats CONFIG_PROVE_RCU lockdep validation and KCSAN data-race
detection.
Replace with hlist_for_each_entry_rcu() using rcu_read_lock_bh_held()
as the lockdep condition, consistent with the rcu_dereference_check()
used in the DEVMAP (non-hash) branch of the same functions. Also fix
the same incorrect lockdep_is_held(&dtab->index_lock) condition in
dev_map_enqueue_multi(), where the lock is not held either.
Fixes: e624d4ed4a ("xdp: Extend xdp_redirect_map with broadcast support")
Signed-off-by: David Carlier <devnexen@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260320072645.16731-1-devnexen@gmail.com
alarm_timer_forward() passes arguments to alarm_forward() in the wrong
order:
alarm_forward(alarm, timr->it_interval, now);
However, alarm_forward() is defined as:
u64 alarm_forward(struct alarm *alarm, ktime_t now, ktime_t interval);
and uses the second argument as the current time:
delta = ktime_sub(now, alarm->node.expires);
Passing the interval as "now" results in incorrect delta computation,
which can lead to missed expirations or incorrect overrun accounting.
This issue has been present since the introduction of
alarm_timer_forward().
Fix this by swapping the arguments.
Fixes: e7561f1633 ("alarmtimer: Implement forward callback")
Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260323061130.29991-1-zhanxusheng@xiaomi.com
The purpose of the constant it is not entirely clear from its name.
As this constant is going to be exposed in a UAPI header, give it a more
specific name for clarity. As all its users call it 'marker', use that
wording in the constant itself.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Reviewed-by: Nicolas Schier <nsc@kernel.org>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
This enum originates in generic cryptographic code and has a very
generic name. Nowadays it is only used for module signatures.
As this enum is going to be exposed in a UAPI header, give it a more
specific name for clarity and consistency.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Reviewed-by: Nicolas Schier <nsc@kernel.org>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
The function btf_check_kfunc_arg_match() was refactored into
check_kfunc_args() by commit 00b85860fe ("bpf: Rewrite kfunc
argument handling"). Update the comment accordingly.
Assisted-by: unnamed:deepseek-v3.2 coccinelle
Signed-off-by: Kexin Sun <kexinsun@smail.nju.edu.cn>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Link: https://lore.kernel.org/r/20260321105658.6006-1-kexinsun@smail.nju.edu.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add BPF verifier support for single- and multi-level pointer
parameters and return values in BPF trampolines by treating these
parameters as SCALAR_VALUE.
This extends the existing support for int and void pointers that are
already treated as SCALAR_VALUE.
This provides consistent logic for single and multi-level pointers:
if a type is treated as SCALAR for a single-level pointer, the same
applies to multi-level pointers. The exception is pointer-to-struct,
which is currently PTR_TO_BTF_ID for single-level but treated as
scalar for multi-level pointers since the verifier lacks context
to infer the size of target memory regions.
Safety is ensured by existing BTF verification, which rejects invalid
pointer types at the BTF verification stage.
Signed-off-by: Slava Imameev <slava.imameev@crowdstrike.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260314082127.7939-2-slava.imameev@crowdstrike.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
a72f73c4dd ("cgroup: Don't expose dead tasks in cgroup") hid PF_EXITING
tasks from cgroup.procs so that systemd doesn't see tasks that have already
been reaped via waitpid(). However, the populated counter (nr_populated_csets)
is only decremented when the task later passes through cgroup_task_dead() in
finish_task_switch(). This means cgroup.procs can appear empty while the
cgroup is still populated, causing rmdir to fail with -EBUSY.
Fix this by making cgroup_rmdir() wait for dying tasks to fully leave. If the
cgroup is populated but all remaining tasks have PF_EXITING set (the task
iterator returns none due to the existing filter), wait for a kick from
cgroup_task_dead() and retry. The wait is brief as tasks are removed from the
cgroup's css_set between PF_EXITING assertion in do_exit() and
cgroup_task_dead() in finish_task_switch().
v2: cgroup_is_populated() true to false transition happens under css_set_lock
not cgroup_mutex, so retest under css_set_lock before sleeping to avoid
missed wakeups (Sebastian).
Fixes: a72f73c4dd ("cgroup: Don't expose dead tasks in cgroup")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202603222104.2c81684e-lkp@intel.com
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Bert Karwatzki <spasswolf@web.de>
Cc: Michal Koutny <mkoutny@suse.com>
Cc: cgroups@vger.kernel.org
The current implementation only checks whether the first argument is
refcounted. Fix this by iterating over all arguments.
Signed-off-by: Keisuke Nishimura <keisuke.nishimura@inria.fr>
Fixes: 38f1e66abd ("bpf: Do not allow tail call in strcut_ops program with __ref argument")
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Acked-by: Amery Hung <ameryhung@gmail.com>
Link: https://lore.kernel.org/r/20260320130219.63711-1-keisuke.nishimura@inria.fr
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
kvmemdup_bpfptr() returns -EFAULT when the user pointer cannot be
copied, and -ENOMEM on allocation failure. The error path always
returned -ENOMEM, misreporting bad addresses as out-of-memory.
Return PTR_ERR(sig) so user space gets the correct errno.
Signed-off-by: Weixie Cui <cuiweixie@gmail.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/tencent_C9C5B2B28413D6303D505CD02BFEA4708C07@qq.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
If fprobe_entry does not fill the allocated fgraph_data completely, the
unused part does not have to be zeroed.
fgraph_data is a short-lived part of the shadow stack. The preceding
length field allows locating the end regardless of the content.
Link: https://lore.kernel.org/all/20260324084804.375764-1-martin@kaiser.cx/
Signed-off-by: Martin Kaiser <martin@kaiser.cx>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Simplify tnum_step() from a 10-variable algorithm into a straight
line sequence of bitwise operations.
Problem Reduction:
tnum_step(): Given a tnum `(tval, tmask)` where `tval & tmask == 0`,
and a value `z` with `tval ≤ z < (tval | tmask)`, find the smallest
`r > z`, a tnum-satisfying value, i.e., `r & ~tmask == tval`.
Every tnum-satisfying value has the form tval | s where s is a subset
of tmask bits (s & ~tmask == 0). Since tval and tmask are disjoint:
tval | s = tval + s
Similarly z = tval + d where d = z - tval, so r > z becomes:
tval + s > tval + d
s > d
The problem reduces to: find the smallest s, a subset of tmask, such
that s > d.
Notice that `s` must be a subset of tmask, the problem now is simplified.
Algorithm:
The mask bits of `d` form a "counter" that we want to increment by one,
but the counter has gaps at the fixed-bit positions. A normal +1 would
stop at the first 0-bit it meets; we need it to skip over fixed-bit
gaps and land on the next mask bit.
Step 1 -- plug the gaps:
d | carry_mask | ~tmask
- ~tmask fills all fixed-bit positions with 1.
- carry_mask = (1 << fls64(d & ~tmask)) - 1 fills all positions
(including mask positions) below the highest non-mask bit of d.
After this, the only remaining 0s are mask bits above the highest
non-mask bit of d where d is also 0 -- exactly the positions where
the carry can validly land.
Step 2 -- increment:
(d | carry_mask | ~tmask) + 1
Adding 1 flips all trailing 1s to 0 and sets the first 0 to 1. Since
every gap has been plugged, that first 0 is guaranteed to be a mask bit
above all non-mask bits of d.
Step 3 -- mask:
((d | carry_mask | ~tmask) + 1) & tmask
Strip the scaffolding, keeping only mask bits. Call the result inc.
Step 4 -- result:
tval | inc
Reattach the fixed bits.
A simple 8-bit example:
tmask: 1 1 0 1 0 1 1 0
d: 1 0 1 0 0 0 1 0 (d = 162)
^
non-mask 1 at bit 5
With carry_mask = 0b00111111 (smeared from bit 5):
d|carry|~tm 1 0 1 1 1 1 1 1
+ 1 1 1 0 0 0 0 0 0
& tmask 1 1 0 0 0 0 0 0
The patch passes my local test: test_verifier, test_progs for
`-t verifier` and `-t reg_bounds`.
CBMC shows the new code is equiv to original one[1], and
a lean4 proof of correctness is available[2]:
theorem tnumStep_correct (tval tmask z : BitVec 64)
-- Precondition: valid tnum and input z
(h_consistent : (tval &&& tmask) = 0)
(h_lo : tval ≤ z)
(h_hi : z < (tval ||| tmask)) :
-- Postcondition: r must be:
-- (1) tnum member
-- (2) z < r
-- (3) for any other member w > z, r <= w
let r := tnumStep tval tmask z
satisfiesTnum64 r tval tmask ∧
tval ≤ r ∧ r ≤ (tval ||| tmask) ∧
z < r ∧
∀ w, satisfiesTnum64 w tval tmask → z < w → r ≤ w := by
-- unfold definition
unfold tnumStep satisfiesTnum64
simp only []
refine ⟨?_, ?_, ?_, ?_, ?_⟩
-- the solver proves each conjunct
· bv_decide
· bv_decide
· bv_decide
· bv_decide
· intro w hw1 hw2; bv_decide
[1] https://github.com/eddyz87/tnum-step-verif/blob/master/main.c
[2] https://pastebin.com/raw/czHKiyY0
Signed-off-by: Hao Sun <hao.sun@inf.ethz.ch>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Reviewed-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com>
Link: https://lore.kernel.org/r/20260320162336.166542-1-hao.sun@inf.ethz.ch
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This was renamed in commit 23ef9d4397 ("kcfi: Rename CONFIG_CFI_CLANG
to CONFIG_CFI") as it is now a compiler-agnostic option. Using the wrong
name results in the code getting compiled out. Meaning the CFI failures
for btf_dtor_kfunc_t would still trigger.
Fixes: 99fde4d062 ("bpf, btf: Enforce destructor kfunc type with CFI")
Signed-off-by: Carlos Llamas <cmllamas@google.com>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20260312183818.2721750-1-cmllamas@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Since commit 603b441623 ("bpf: Update the bpf_prog_calc_tag to use
SHA256") made BPF program tags use SHA-256 instead of SHA-1, the header
<crypto/sha1.h> no longer needs to be included. Remove the relevant
inclusions so that they no longer unnecessarily come up in searches for
which kernel code is still using the obsolete SHA-1 algorithm.
Since net/ipv6/addrconf.c was relying on the transitive inclusion of
<crypto/sha1.h> (for an unrelated purpose) via <linux/filter.h>, make it
include <crypto/sha1.h> explicitly in order to keep that file building.
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Acked-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/20260314214555.112386-1-ebiggers@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Calling smp_processor_id() on:
- In CONFIG_DEBUG_PREEMPT=y, if preemption/irq is disabled, then it does
not print any warning.
- In CONFIG_DEBUG_PREEMPT=n, it doesn't do anything apart from getting
__smp_processor_id
So with both CONFIG_DEBUG_PREEMPT=y/n, in preemption disabled section
it is better to cache the value. It could save a few cycles. Though
tiny, repeated could add up to a small value.
ttwu_queue_cond is called with interrupt disabled. So preemption is
disabled. Hence cache the value once instead.
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mukesh Kumar Chaurasiya (IBM) <mkchauras@gmail.com>
Link: https://patch.msgid.link/20260323193630.640311-3-sshegde@linux.ibm.com
Calling smp_processor_id() on:
- In CONFIG_DEBUG_PREEMPT=y, if preemption/irq is disabled, then it does
not print any warning.
- In CONFIG_DEBUG_PREEMPT=n, it doesn't do anything apart from getting
__smp_processor_id
So with both CONFIG_DEBUG_PREEMPT=y/n, in preemption disabled section
it is better to cache the value. It could save a few cycles. Though
tiny, repeated in loop could add up to a small value.
find_new_ilb is called in interrupt context. So preemption is disabled.
So Hoist the this_cpu out of loop
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mukesh Kumar Chaurasiya (IBM) <mkchauras@gmail.com>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20260323193630.640311-2-sshegde@linux.ibm.com
Currently, print_function_args() prints enum parameter values
in decimal format, reducing trace log readability.
Use BTF information to resolve enum parameters and print their
symbolic names (where available). This improves readability by
showing meaningful identifiers instead of raw numbers.
Before:
mod_memcg_lruvec_state(lruvec=0xffff..., idx=5, val=320)
After:
mod_memcg_lruvec_state(lruvec=0xffff..., idx=5 [NR_SLAB_RECLAIMABLE_B], val=320)
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://patch.msgid.link/20260209071949.4040193-1-dolinux.peng@gmail.com
Signed-off-by: Donglin Peng <pengdonglin@xiaomi.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The tracing_open_file_tr() function currently copies the trace_event_file
pointer from inode->i_private to file->private_data when the file is
successfully opened. This duplication is not particularly useful, as all
event code should utilize event_file_file() or event_file_data() to
retrieve a trace_event_file pointer from a file struct and these access
functions read file->f_inode->i_private. Moreover, this setup requires the
code for opening hist files to explicitly clear file->private_data before
calling single_open(), since this function expects the private_data member
to be set to NULL and uses it to store a pointer to a seq_file.
Remove the unnecessary setting of file->private_data in
tracing_open_file_tr() and simplify the hist code.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Link: https://patch.msgid.link/20260219162737.314231-6-petr.pavlu@suse.com
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The tracing code provides two functions event_file_file() and
event_file_data() to obtain a trace_event_file pointer from a file struct.
The primary method to use is event_file_file(), as it checks for the
EVENT_FILE_FL_FREED flag to determine whether the event is being removed.
The second function event_file_data() is an optimization for retrieving the
same data when the event_mutex is still held.
In the past, when removing an event directory in remove_event_file_dir(),
the code set i_private to NULL for all event files and readers were
expected to check for this state to recognize that the event is being
removed. In the case of event_id_read(), the value was read using
event_file_data() without acquiring the event_mutex. This required
event_file_data() to use READ_ONCE() when retrieving the i_private data.
With the introduction of eventfs, i_private is assigned when an eventfs
inode is allocated and remains set throughout its lifetime.
Remove the now unnecessary READ_ONCE() access to i_private in both
event_file_file() and event_file_data(). Inline the access to i_private in
remove_event_file_dir(), which allows event_file_data() to handle i_private
solely as a trace_event_file pointer. Add a check in event_file_data() to
ensure that the event_mutex is held and that file->flags doesn't have the
EVENT_FILE_FL_FREED flag set. Finally, move event_file_data() immediately
after event_file_code() since the latter provides a comment explaining how
both functions should be used together.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Link: https://patch.msgid.link/20260219162737.314231-5-petr.pavlu@suse.com
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The event_filter_write() function calls event_file_file() to retrieve
a trace_event_file associated with a given file struct. If a non-NULL
pointer is returned, the function then checks whether the trace_event_file
instance has the EVENT_FILE_FL_FREED flag set. This check is redundant
because event_file_file() already performs this validation and returns NULL
if the flag is set. The err value is also already initialized to -ENODEV.
Remove the unnecessary check for EVENT_FILE_FL_FREED in
event_filter_write().
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Link: https://patch.msgid.link/20260219162737.314231-4-petr.pavlu@suse.com
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The sunrpc change to use trace_printk() for debugging caused
a new warning for every instance of dprintk() in some configurations,
when -Wformat-security is enabled:
fs/nfs/getroot.c: In function 'nfs_get_root':
fs/nfs/getroot.c:90:17: error: format not a string literal and no format arguments [-Werror=format-security]
90 | nfs_errorf(fc, "NFS: Couldn't getattr on root");
I've been slowly chipping away at those warnings over time with the
intention of enabling them by default in the future. While I could not
figure out why this only happens for this one instance, I see that the
__trace_bprintk() function is always called with a local variable as
the format string, rather than a literal.
Move the __printf(2,3) annotation on this function from the declaration
to the caller. As this is can only be validated for literals, the
attribute on the declaration causes the warnings every time, but
removing it entirely introduces a new warning on the __ftrace_vbprintk()
definition.
The format strings still get checked because the underlying literal keeps
getting passed into __trace_printk() in the "else" branch, which is not
taken but still evaluated for compile-time warnings.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Anna Schumaker <anna.schumaker@oracle.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Yury Norov <ynorov@nvidia.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Link: https://patch.msgid.link/20260203164545.3174910-1-arnd@kernel.org
Fixes: ec7d8e68ef ("sunrpc: add a Kconfig option to redirect dfprintk() output to trace buffer")
Acked-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When scx_alloc_and_add_sched() creates the sub-scheduler kset, it sets
sch->kobj as the parent. Because sch->kobj.kset points to scx_kset,
registering this sub-kset triggers a KOBJ_ADD uevent. The uevent walk
finds scx_kset and calls scx_uevent() with the sub-kset's kobject.
scx_uevent() unconditionally uses container_of() to cast the incoming
kobject to struct scx_sched, producing a wild pointer when the kobject
belongs to the kset itself rather than a scheduler instance. Accessing
sch->ops.name through this pointer causes a KASAN slab-out-of-bounds
read:
BUG: KASAN: slab-out-of-bounds in string+0x3b6/0x4c0
Read of size 1 at addr ffff888004d04348 by task scx_enable_help/748
Call Trace:
string+0x3b6/0x4c0
vsnprintf+0x3ec/0x1550
add_uevent_var+0x160/0x3a0
scx_uevent+0x22/0x30
kobject_uevent_env+0x5dc/0x1730
kset_register+0x192/0x280
scx_alloc_and_add_sched+0x130d/0x1c60
...
Fix this by checking the kobject's ktype against scx_ktype before
performing the cast, and returning 0 for non-matching kobjects.
Tested with vng and scx_qmap without triggering any KASAN errors.
Fixes: ebeca1f930 ("sched_ext: Introduce cgroup sub-sched support")
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The function freezable_schedule() was removed in commit
f5d39b0208 ("freezer,sched: Rewrite core freezer logic"), which
rewrote the freezer to use a dedicated TASK_FROZEN state instead.
do_signal_stop() and ptrace_stop() no longer call
freezable_schedule(); they now set TASK_STOPPED/TASK_TRACED and the
freezer handles those states directly via TASK_FROZEN. Update the
comment to reflect the current mechanism.
Assisted-by: unnamed:deepseek-v3.2 coccinelle
Signed-off-by: Kexin Sun <kexinsun@smail.nju.edu.cn>
Link: https://patch.msgid.link/20260321105927.7979-1-kexinsun@smail.nju.edu.cn
Signed-off-by: Christian Brauner <brauner@kernel.org>
Commit 35e4a69b20 ("PM: sleep: Allow pm_restrict_gfp_mask()
stacking") introduced refcount-based GFP mask management that warns
when pm_restore_gfp_mask() is called with saved_gfp_count == 0.
Some hibernation paths call pm_restore_gfp_mask() defensively where
the GFP mask may or may not be restricted depending on the execution
path. For example, the uswsusp interface invokes it in
SNAPSHOT_CREATE_IMAGE, SNAPSHOT_UNFREEZE, and snapshot_release().
Before the stacking change this was a silent no-op; it now triggers
a spurious WARNING.
Remove the WARN_ON() wrapper from the !saved_gfp_count check while
retaining the check itself, so that defensive calls remain harmless
without producing false warnings.
Fixes: 35e4a69b20 ("PM: sleep: Allow pm_restrict_gfp_mask() stacking")
Signed-off-by: Youngjun Park <youngjun.park@lge.com>
[ rjw: Subject tweak ]
Link: https://patch.msgid.link/20260322120528.750178-1-youngjun.park@lge.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Commit 005e8dddd4 ("PM: hibernate: don't store zero pages in the
image file") added an optimization to skip zero-filled pages in the
hibernation image. On restore, zero pages are handled internally by
snapshot_write_next() in a loop that processes them without returning
to the caller.
With the userspace restore interface, writing the last non-zero page
to /dev/snapshot is followed by the SNAPSHOT_ATOMIC_RESTORE ioctl. At
this point there are no more calls to snapshot_write_next() so any
trailing zero pages are not processed, snapshot_image_loaded() fails
because handle->cur is smaller than expected, the ioctl returns -EPERM
and the image is not restored.
The in-kernel restore path is not affected by this because the loop in
load_image() in swap.c calls snapshot_write_next() until it returns 0.
It is this final call that drains any trailing zero pages.
Fixed by calling snapshot_write_next() in snapshot_write_finalize(),
giving the kernel the chance to drain any trailing zero pages.
Fixes: 005e8dddd4 ("PM: hibernate: don't store zero pages in the image file")
Signed-off-by: Alberto Garcia <berto@igalia.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Link: https://patch.msgid.link/ef5a7c5e3e3dbd17dcb20efaa0c53a47a23498bb.1773075892.git.berto@igalia.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When the check_undefined command in kernel/trace/Makefile fails, there
is no output, making it hard to understand why the build failed. Capture
the output of the $(NM) + grep command and print it when failing to make
it clearer what the problem is.
Fixes: a717943d8e ("tracing: Check for undefined symbols in simple_ring_buffer")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Vincent Donnefort <vdonnefort@google.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Link: https://patch.msgid.link/20260320-cmd_check_undefined-verbose-v1-1-54fc5b061f94@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cross-merge BPF and other fixes after downstream PR.
Minor conflicts in:
tools/testing/selftests/bpf/progs/exceptions_fail.c
tools/testing/selftests/bpf/progs/verifier_bounds.c
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
schedule_deferred() uses irq_work_queue() which always queues on the
calling CPU. The deferred work can run from any CPU correctly, and the
_locked() path already processes remote rqs from the calling CPU. However,
when falling through to the irq_work path, queuing on the target CPU is
preferable as the work can run sooner via IPI delivery rather than waiting
for the calling CPU to re-enable IRQs.
Currently, only reenqueue operations use this path - either BPF-initiated
reenqueue targeting a remote rq, or IMMED reenqueue when the target CPU is
busy running userspace (not in balance or wakeup, so the _locked() fast
paths aren't available). Use irq_work_queue_on() to target the owning CPU.
This improves IMMED reenqueue latency when tasks are dispatched to
remote local DSQs. Testing on a 24-CPU AMD Ryzen 3900X with scx_qmap
-I -F 50 (ALWAYS_ENQ_IMMED, every 50th enqueue forced to prev_cpu's
local DSQ) under heavy mixed load (2x CPU oversubscription, yield and
context-switch pressure, SCHED_FIFO bursts, periodic fork storms, mixed
nice levels, C-states disabled), measuring local DSQ residence time
(insert to remove) over 5 x 120s runs (~1.2M tasks per set):
>128us outliers: 71 -> 39 (-45%)
>256us outliers: 59 -> 36 (-39%)
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
When building with SCHED_CLASS_EXT=y but CGROUPS=n, clang reports errors
for undeclared cgroup_put() and cgroup_get() calls, and a warning for the
unused err_stop_helper label.
EXT_SUB_SCHED is def_bool y depending only on SCHED_CLASS_EXT, but it
fundamentally requires cgroups (cgroup_path, cgroup_get, cgroup_put,
cgroup_id, etc.). Add the missing CGROUPS dependency to EXT_SUB_SCHED in
init/Kconfig.
Guard cgroup_put() and cgroup_get() in the common paths with:
#if defined(CONFIG_EXT_GROUP_SCHED) || defined(CONFIG_EXT_SUB_SCHED)
Guard the err_stop_helper label with #ifdef CONFIG_EXT_SUB_SCHED since
all gotos targeting it are inside that same ifdef block.
Tested with both CGROUPS enabled and disabled.
Fixes: ebeca1f930 ("sched_ext: Introduce cgroup sub-sched support")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202603210903.IrKhPd6k-lkp@intel.com/
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmnAGisACgkQ6rmadz2v
bTqjsw/9GfHT/fdnjfA/q27TQH28ZdrZfq90BpI3m5BfTO8/l+Kt+g1HDGpku+C/
iWh66rg9t/P9nMvtdzvPsdT833UbwbY6fPEK3r7ANgf7SBb1DNvaGHBM6XNefvZV
j+VcykKUaEo8U1GeG+gI4TyAALSqvvMeBPYpAPZDUYguYLyE+YIl2Pl6tWt+A7yf
9V3JjCSz63t75qqnhY2SIBZv2pqWiMaCI8uPgaF7drhQM5Xc0l/R75CMPGeF9BrT
GRtTVJhY+6UyI2Q0ZRSRSVHZ1j2kYHI/eK3Kamxwal5hNh37BYHm3pT5TSHbZTe1
xO7c1AB0vds8kznRkclQfsMdjVwuBQj03ukLVNqnnaaE4Ir7JlXlXYgeG0KJbbfW
kQG8UyDD7tMWZkvaA0Z51FC88WJNLJoNAku519alcMtgAf1CrxzG9aUAYEWE4erh
E/FKKvFqQ6T0mOFSXlk1NFeMjNXcg5Tu2KKKKOjAWT6goUc4hw80IWydTyxMy32m
8/eLmdTZpAQovc2rS+5LSTigQ3DT082J950sxdQ3yRaLTWBGNC06gkA/WcRq2ZI+
hBdW6GI1XFwkXGw5+F9fN9Bt5FmE42v44i+RrlNZV1R5bVr0Za/ofkWP3dm1/SOg
QRSJk30hx9JveR9gD/xWawycYFuwmha/BL0tur2T32M67MneJpo=
=Ye1S
-----END PGP SIGNATURE-----
Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Pull bpf fixes from Alexei Starovoitov:
- Fix how linked registers track zero extension of subregisters (Daniel
Borkmann)
- Fix unsound scalar fork for OR instructions (Daniel Wade)
- Fix exception exit lock check for subprogs (Ihor Solodrai)
- Fix undefined behavior in interpreter for SDIV/SMOD instructions
(Jenny Guanni Qu)
- Release module's BTF when module is unloaded (Kumar Kartikeya
Dwivedi)
- Fix constant blinding for PROBE_MEM32 instructions (Sachin Kumar)
- Reset register ID for END instructions to prevent incorrect value
tracking (Yazhou Tang)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Add a test cases for sync_linked_regs regarding zext propagation
bpf: Fix sync_linked_regs regarding BPF_ADD_CONST32 zext propagation
selftests/bpf: Add tests for maybe_fork_scalars() OR vs AND handling
bpf: Fix unsound scalar forking in maybe_fork_scalars() for BPF_OR
selftests/bpf: Add tests for sdiv32/smod32 with INT_MIN dividend
bpf: Fix undefined behavior in interpreter sdiv/smod for INT_MIN
selftests/bpf: Add tests for bpf_throw lock leak from subprogs
bpf: Fix exception exit lock checking for subprogs
bpf: Release module BTF IDR before module unload
selftests/bpf: Fix pkg-config call on static builds
bpf: Fix constant blinding for PROBE_MEM32 stores
selftests/bpf: Add test for BPF_END register ID reset
bpf: Reset register ID for BPF_END value tracking
- Revert "tracing: Remove pid in task_rename tracing output"
A change was made to remove the pid field from the task_rename event
because it was thought that it was always done for the current task and
recording the pid would be redundant. This turned out to be incorrect and
there are a few corner case where this is not true and caused some
regressions in tooling.
- Fix the reading from user space for migration
The reading of user space uses a seq lock type of logic where it uses a
per-cpu temporary buffer and disables migration, then enables preemption,
does the copy from user space, disables preemption, enables migration and
checks if there was any schedule switches while preemption was enabled. If
there was a context switch, then it is considered that the per-cpu buffer
could be corrupted and it tries again. There's a protection check that
tests if it takes a hundred tries, it issues a warning and exits out to
prevent a live lock.
This was triggered because the task was selected by the load balancer to
be migrated to another CPU, every time preemption is enabled the migration
task would schedule in try to migrate the task but can't because migration
is disabled and let it run again. This caused the scheduler to schedule out
the task every time it enabled preemption and made the loop never exit
(until the 100 iteration test triggered).
Fix this by enabling and disabling preemption and keeping migration
enabled if the reading from user space needs to be done again. This will
let the migration thread migrate the task and the copy from user space
will likely pass on the next iteration.
- Fix trace_marker copy option freeing
The "copy_trace_marker" option allows a tracing instance to get a copy of
a write to the trace_marker file of the top level instance. This is
managed by a link list protected by RCU. When an instance is removed, a
check is made if the option is set, and if so synchronized_rcu() is
called. The problem is that an iteration is made to reset all the flags to
what they were when the instance was created (to perform clean ups) was
done before the check of the copy_trace_marker option and that option was
cleared, so the synchronize_rcu() was never called.
Move the clearing of all the flags after the check of copy_trace_marker to
do synchronize_rcu() so that the option is still set if it was before and
the synchronization is performed.
- Fix entries setting when validating the persistent ring buffer
When validating the persistent ring buffer on boot up, the number of
events per sub-buffer is added to the sub-buffer meta page. The validator
was updating cpu_buffer->head_page (the first sub-buffer of the per-cpu
buffer) and not the "head_page" variable that was iterating the
sub-buffers. This was causing the first sub-buffer to be assigned the
entries for each sub-buffer and not the sub-buffer that was supposed to be
updated.
- Use "hash" value to update the direct callers
When updating the ftrace direct callers, it assigned a temporary callback
to all the callback functions of the ftrace ops and not just the
functions represented by the passed in hash. This causes an unnecessary
slow down of the functions of the ftrace_ops that is not being modified.
Only update the functions that are going to be modified to call the
ftrace loop function so that the update can be made on those functions.
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCacAMahQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qr0sAQCoI4L3iAR5HU1z8dw2GWhOz9fTnzfw
9VPRZAsga9J5xgEA1Y0bvKBM0UPHFAL2POkaILYV1aT00lZ7aIVHPqfdYgA=
=OoGW
-----END PGP SIGNATURE-----
Merge tag 'trace-v7.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Revert "tracing: Remove pid in task_rename tracing output"
A change was made to remove the pid field from the task_rename event
because it was thought that it was always done for the current task
and recording the pid would be redundant. This turned out to be
incorrect and there are a few corner case where this is not true and
caused some regressions in tooling.
- Fix the reading from user space for migration
The reading of user space uses a seq lock type of logic where it uses
a per-cpu temporary buffer and disables migration, then enables
preemption, does the copy from user space, disables preemption,
enables migration and checks if there was any schedule switches while
preemption was enabled. If there was a context switch, then it is
considered that the per-cpu buffer could be corrupted and it tries
again. There's a protection check that tests if it takes a hundred
tries, it issues a warning and exits out to prevent a live lock.
This was triggered because the task was selected by the load balancer
to be migrated to another CPU, every time preemption is enabled the
migration task would schedule in try to migrate the task but can't
because migration is disabled and let it run again. This caused the
scheduler to schedule out the task every time it enabled preemption
and made the loop never exit (until the 100 iteration test
triggered).
Fix this by enabling and disabling preemption and keeping migration
enabled if the reading from user space needs to be done again. This
will let the migration thread migrate the task and the copy from user
space will likely pass on the next iteration.
- Fix trace_marker copy option freeing
The "copy_trace_marker" option allows a tracing instance to get a
copy of a write to the trace_marker file of the top level instance.
This is managed by a link list protected by RCU. When an instance is
removed, a check is made if the option is set, and if so
synchronized_rcu() is called.
The problem is that an iteration is made to reset all the flags to
what they were when the instance was created (to perform clean ups)
was done before the check of the copy_trace_marker option and that
option was cleared, so the synchronize_rcu() was never called.
Move the clearing of all the flags after the check of
copy_trace_marker to do synchronize_rcu() so that the option is still
set if it was before and the synchronization is performed.
- Fix entries setting when validating the persistent ring buffer
When validating the persistent ring buffer on boot up, the number of
events per sub-buffer is added to the sub-buffer meta page. The
validator was updating cpu_buffer->head_page (the first sub-buffer of
the per-cpu buffer) and not the "head_page" variable that was
iterating the sub-buffers. This was causing the first sub-buffer to
be assigned the entries for each sub-buffer and not the sub-buffer
that was supposed to be updated.
- Use "hash" value to update the direct callers
When updating the ftrace direct callers, it assigned a temporary
callback to all the callback functions of the ftrace ops and not just
the functions represented by the passed in hash. This causes an
unnecessary slow down of the functions of the ftrace_ops that is not
being modified. Only update the functions that are going to be
modified to call the ftrace loop function so that the update can be
made on those functions.
* tag 'trace-v7.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ftrace: Use hash argument for tmp_ops in update_ftrace_direct_mod
ring-buffer: Fix to update per-subbuf entries of persistent ring buffer
tracing: Fix trace_marker copy link list updates
tracing: Fix failure to read user space from system call trace events
tracing: Revert "tracing: Remove pid in task_rename tracing output"
- Fix a PMU driver crash on AMD EPYC systems, caused by
a race condition in x86_pmu_enable()
- Fix a possible counter-initialization bug in x86_pmu_enable()
- Fix a counter inheritance bug in inherit_event() and
__perf_event_read()
- Fix an Intel PMU driver branch constraints handling bug
found by UBSAN
- Fix the Intel PMU driver's new Off-Module Response (OMR)
support code for Diamond Rapids / Nova lake, to fix a snoop
information parsing bug
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmm/ptcRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1i8yBAAsbAbTzyJOxiE29beCiW3r9V4ELtrSlpG
syUtZ4Y1cozK1w6/8S2oWHJkuWa+ToO6bn9rKxNemWIrnsxe5B4iKfeNWRFTz24g
41xTXaxUB9c7vBgv4BDvr/ykkYGQybkn0Bf/U5rufzvIlst9bx7zKVAnIT9Qws37
UTMY96XGYY5HNzGSZbQkpQ4cs8n72U+00OBHMTWtH8NJT+fRmaM312Q8F6wNKgH2
YtaAjwb55BU5+hQUz5YN96xQGYaoj0s8UtOk4a3tS/t0F8mOodDVTxzMdHKToQmD
SbscGvfC+bg5zjoYGFEU+cXoBMkkZlBPqZdLVQAGEy3YZT+JIdmyJhn9FD6HVuVw
OzyG9VuY+TxOFrFQdMs3Xfa0vZ7AO1c90HB08P4T7nWaMioR1iFAF31MSVEMXuzd
ZROHplWNIqDeOzmerXgZq4JWy3Bpaam8fH1B5/qN450oAxaym3lCOoCZidJYgy3g
CVBF/6BO7DlpKiy9lXknRscItwIiRmZ9Xr+sOmOMQRGqQkC6Ykk/Hj2sA1qKPuQ5
ruRqqaL9cznttAoJR2jZ0Ijyu7usgxB/y066nR1zXKdvEdNcntaic+QybHxbQoYx
kyYNoR1dg+AWLb5juT68abkP4trZ8EUa7Q29OX1SzTk+0U7M0fO3/rq9gt71HiHR
WSjRDQPfWrI=
=jp5H
-----END PGP SIGNATURE-----
Merge tag 'perf-urgent-2026-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
- Fix a PMU driver crash on AMD EPYC systems, caused by
a race condition in x86_pmu_enable()
- Fix a possible counter-initialization bug in x86_pmu_enable()
- Fix a counter inheritance bug in inherit_event() and
__perf_event_read()
- Fix an Intel PMU driver branch constraints handling bug
found by UBSAN
- Fix the Intel PMU driver's new Off-Module Response (OMR)
support code for Diamond Rapids / Nova lake, to fix a snoop
information parsing bug
* tag 'perf-urgent-2026-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Fix OMR snoop information parsing issues
perf/x86/intel: Add missing branch counters constraint apply
perf: Make sure to use pmu_ctx->pmu for groups
x86/perf: Make sure to program the counter value for stopped events on migration
perf/x86: Move event pointer setup earlier in x86_pmu_enable()
On weakly ordered architectures (e.g., arm64), the lockless check in
wq_watchdog_timer_fn() can observe a reordering between the worklist
insertion and the last_progress_ts update. Specifically, the watchdog
can see a non-empty worklist (from a list_add) while reading a stale
last_progress_ts value, causing a false positive stall report.
This was confirmed by reading pool->last_progress_ts again after holding
pool->lock in wq_watchdog_timer_fn():
workqueue watchdog: pool 7 false positive detected!
lockless_ts=4784580465 locked_ts=4785033728
diff=453263ms worklist_empty=0
To avoid slowing down the hot path (queue_work, etc.), recheck
last_progress_ts with pool->lock held. This will eliminate the false
positive with minimal overhead.
Remove two extra empty lines in wq_watchdog_timer_fn() as we are on it.
Fixes: 82607adcf9 ("workqueue: implement lockup detector")
Cc: stable@vger.kernel.org # v4.5+
Assisted-by: claude-code:claude-opus-4-6
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
syzbot reported the following warning:
DEAD callback error for CPU1
WARNING: kernel/cpu.c:1463 at _cpu_down+0x759/0x1020 kernel/cpu.c:1463, CPU#0: syz.0.1960/14614
at commit 4ae12d8bd9 ("Merge tag 'kbuild-fixes-7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux")
which tglx traced to padata_cpu_dead() given it's the only
sub-CPUHP_TEARDOWN_CPU callback that returns an error.
Failure isn't allowed in hotplug states before CPUHP_TEARDOWN_CPU
so move the CPU offline callback to the ONLINE section where failure is
possible.
Fixes: 894c9ef978 ("padata: validate cpumask without removed CPU during offline")
Reported-by: syzbot+123e1b70473ce213f3af@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/69af0a05.050a0220.310d8.002f.GAE@google.com/
Debugged-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
In the WAKE_SYNC path of scx_select_cpu_dfl(), waker_node was computed
with cpu_to_node(), while node (for prev_cpu) was computed with
scx_cpu_node_if_enabled(). When scx_builtin_idle_per_node is disabled,
idle_cpumask(waker_node) is called with a real node ID even though
per-node idle tracking is disabled, resulting in undefined behavior.
Fix by using scx_cpu_node_if_enabled() for waker_node as well, ensuring
both variables are computed consistently.
Fixes: 48849271e6 ("sched_ext: idle: Per-node idle cpumasks")
Cc: stable@vger.kernel.org # v6.15+
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The modify logic registers temporary ftrace_ops object (tmp_ops) to trigger
the slow path for all direct callers to be able to safely modify attached
addresses.
At the moment we use ops->func_hash for tmp_ops filter, which represents all
the systems attachments. It's faster to use just the passed hash filter, which
contains only the modified sites and is always a subset of the ops->func_hash.
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Menglong Dong <menglong8.dong@gmail.com>
Cc: Song Liu <song@kernel.org>
Link: https://patch.msgid.link/20260312123738.129926-1-jolsa@kernel.org
Fixes: e93672f770 ("ftrace: Add update_ftrace_direct_mod function")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Since the validation loop in rb_meta_validate_events() updates the same
cpu_buffer->head_page->entries, the other subbuf entries are not updated.
Fix to use head_page to update the entries field, since it is the cursor
in this loop.
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Ian Rogers <irogers@google.com>
Fixes: 5f3b6e839f ("ring-buffer: Validate boot range memory events")
Link: https://patch.msgid.link/177391153882.193994.17158784065013676533.stgit@mhiramat.tok.corp.google.com
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When the "copy_trace_marker" option is enabled for an instance, anything
written into /sys/kernel/tracing/trace_marker is also copied into that
instances buffer. When the option is set, that instance's trace_array
descriptor is added to the marker_copies link list. This list is protected
by RCU, as all iterations uses an RCU protected list traversal.
When the instance is deleted, all the flags that were enabled are cleared.
This also clears the copy_trace_marker flag and removes the trace_array
descriptor from the list.
The issue is after the flags are called, a direct call to
update_marker_trace() is performed to clear the flag. This function
returns true if the state of the flag changed and false otherwise. If it
returns true here, synchronize_rcu() is called to make sure all readers
see that its removed from the list.
But since the flag was already cleared, the state does not change and the
synchronization is never called, leaving a possible UAF bug.
Move the clearing of all flags below the updating of the copy_trace_marker
option which then makes sure the synchronization is performed.
Also use the flag for checking the state in update_marker_trace() instead
of looking at if the list is empty.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260318185512.1b6c7db4@gandalf.local.home
Fixes: 7b382efd5e ("tracing: Allow the top level trace_marker to write into another instances")
Reported-by: Sasha Levin <sashal@kernel.org>
Closes: https://lore.kernel.org/all/20260225133122.237275-1-sashal@kernel.org/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The system call trace events call trace_user_fault_read() to read the user
space part of some system calls. This is done by grabbing a per-cpu
buffer, disabling migration, enabling preemption, calling
copy_from_user(), disabling preemption, enabling migration and checking if
the task was preempted while preemption was enabled. If it was, the buffer
is considered corrupted and it tries again.
There's a safety mechanism that will fail out of this loop if it fails 100
times (with a warning). That warning message was triggered in some
pi_futex stress tests. Enabling the sched_switch trace event and
traceoff_on_warning, showed the problem:
pi_mutex_hammer-1375 [006] d..21 138.981648: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981651: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981656: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981659: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981664: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981667: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981671: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981675: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981679: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981682: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981687: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981690: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981695: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981698: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981703: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981706: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981711: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981714: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981719: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981722: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981727: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981730: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981735: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981738: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
What happened was the task 1375 was flagged to be migrated. When
preemption was enabled, the migration thread woke up to migrate that task,
but failed because migration for that task was disabled. This caused the
loop to fail to exit because the task scheduled out while trying to read
user space.
Every time the task enabled preemption the migration thread would schedule
in, try to migrate the task, fail and let the task continue. But because
the loop would only enable preemption with migration disabled, it would
always fail because each time it enabled preemption to read user space,
the migration thread would try to migrate it.
To solve this, when the loop fails to read user space without being
scheduled out, enabled and disable preemption with migration enabled. This
will allow the migration task to successfully migrate the task and the
next loop should succeed to read user space without being scheduled out.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260316130734.1858a998@gandalf.local.home
Fixes: 64cf7d058a ("tracing: Have trace_marker use per-cpu data to read user space")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Jenny reported that in sync_linked_regs() the BPF_ADD_CONST32 flag is
checked on known_reg (the register narrowed by a conditional branch)
instead of reg (the linked target register created by an alu32 operation).
Example case with reg:
1. r6 = bpf_get_prandom_u32()
2. r7 = r6 (linked, same id)
3. w7 += 5 (alu32 -- r7 gets BPF_ADD_CONST32, zero-extended by CPU)
4. if w6 < 0xFFFFFFFC goto safe (narrows r6 to [0xFFFFFFFC, 0xFFFFFFFF])
5. sync_linked_regs() propagates to r7 but does NOT call zext_32_to_64()
6. Verifier thinks r7 is [0x100000001, 0x100000004] instead of [1, 4]
Since known_reg above does not have BPF_ADD_CONST32 set above, zext_32_to_64()
is never called on alu32-derived linked registers. This causes the verifier
to track incorrect 64-bit bounds, while the CPU correctly zero-extends the
32-bit result.
The code checking known_reg->id was correct however (see scalars_alu32_wrap
selftest case), but the real fix needs to handle both directions - zext
propagation should be done when either register has BPF_ADD_CONST32, since
the linked relationship involves a 32-bit operation regardless of which
side has the flag.
Example case with known_reg (exercised also by scalars_alu32_wrap):
1. r1 = r0; w1 += 0x100 (alu32 -- r1 gets BPF_ADD_CONST32)
2. if r1 > 0x80 - known_reg = r1 (has BPF_ADD_CONST32), reg = r0 (doesn't)
Hence, fix it by checking for (reg->id | known_reg->id) & BPF_ADD_CONST32.
Moreover, sync_linked_regs() also has a soundness issue when two linked
registers used different ALU widths: one with BPF_ADD_CONST32 and the
other with BPF_ADD_CONST64. The delta relationship between linked registers
assumes the same arithmetic width though. When one register went through
alu32 (CPU zero-extends the 32-bit result) and the other went through
alu64 (no zero-extension), the propagation produces incorrect bounds.
Example:
r6 = bpf_get_prandom_u32() // fully unknown
if r6 >= 0x100000000 goto out // constrain r6 to [0, U32_MAX]
r7 = r6
w7 += 1 // alu32: r7.id = N | BPF_ADD_CONST32
r8 = r6
r8 += 2 // alu64: r8.id = N | BPF_ADD_CONST64
if r7 < 0xFFFFFFFF goto out // narrows r7 to [0xFFFFFFFF, 0xFFFFFFFF]
At the branch on r7, sync_linked_regs() runs with known_reg=r7
(BPF_ADD_CONST32) and reg=r8 (BPF_ADD_CONST64). The delta path
computes:
r8 = r7 + (delta_r8 - delta_r7) = 0xFFFFFFFF + (2 - 1) = 0x100000000
Then, because known_reg->id has BPF_ADD_CONST32, zext_32_to_64(r8) is
called, truncating r8 to [0, 0]. But r8 used a 64-bit ALU op -- the
CPU does NOT zero-extend it. The actual CPU value of r8 is
0xFFFFFFFE + 2 = 0x100000000, not 0. The verifier now underestimates
r8's 64-bit bounds, which is a soundness violation.
Fix sync_linked_regs() by skipping propagation when the two registers
have mixed ALU widths (one BPF_ADD_CONST32, the other BPF_ADD_CONST64).
Lastly, fix regsafe() used for path pruning: the existing checks used
"& BPF_ADD_CONST" to test for offset linkage, which treated
BPF_ADD_CONST32 and BPF_ADD_CONST64 as equivalent.
Fixes: 7a433e5193 ("bpf: Support negative offsets, BPF_SUB, and alu32 for linked register tracking")
Reported-by: Jenny Guanni Qu <qguanni@gmail.com>
Co-developed-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260319211507.213816-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
maybe_fork_scalars() is called for both BPF_AND and BPF_OR when the
source operand is a constant. When dst has signed range [-1, 0], it
forks the verifier state: the pushed path gets dst = 0, the current
path gets dst = -1.
For BPF_AND this is correct: 0 & K == 0.
For BPF_OR this is wrong: 0 | K == K, not 0.
The pushed path therefore tracks dst as 0 when the runtime value is K,
producing an exploitable verifier/runtime divergence that allows
out-of-bounds map access.
Fix this by passing env->insn_idx (instead of env->insn_idx + 1) to
push_stack(), so the pushed path re-executes the ALU instruction with
dst = 0 and naturally computes the correct result for any opcode.
Fixes: bffacdb80b ("bpf: Recognize special arithmetic shift in the verifier")
Signed-off-by: Daniel Wade <danjwade95@gmail.com>
Reviewed-by: Amery Hung <ameryhung@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260314021521.128361-2-danjwade95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The BPF interpreter's signed 32-bit division and modulo handlers use
the kernel abs() macro on s32 operands. The abs() macro documentation
(include/linux/math.h) explicitly states the result is undefined when
the input is the type minimum. When DST contains S32_MIN (0x80000000),
abs((s32)DST) triggers undefined behavior and returns S32_MIN unchanged
on arm64/x86. This value is then sign-extended to u64 as
0xFFFFFFFF80000000, causing do_div() to compute the wrong result.
The verifier's abstract interpretation (scalar32_min_max_sdiv) computes
the mathematically correct result for range tracking, creating a
verifier/interpreter mismatch that can be exploited for out-of-bounds
map value access.
Introduce abs_s32() which handles S32_MIN correctly by casting to u32
before negating, avoiding signed overflow entirely. Replace all 8
abs((s32)...) call sites in the interpreter's sdiv32/smod32 handlers.
s32 is the only affected case -- the s64 division/modulo handlers do
not use abs().
Fixes: ec0e2da95f ("bpf: Support new signed div/mod instructions.")
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Jenny Guanni Qu <qguanni@gmail.com>
Link: https://lore.kernel.org/r/20260311011116.2108005-2-qguanni@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The sleepable context check for global function calls in
check_func_call() open-codes the same checks that in_sleepable_context()
already performs. Replace the open-coded check with a call to
in_sleepable_context() and use non_sleepable_context_description() for
the error message, consistent with check_helper_call() and
check_kfunc_call().
Note that in_sleepable_context() also checks active_locks, which
overlaps with the existing active_locks check above it. However, the two
checks serve different purposes: the active_locks check rejects all
global function calls while holding a lock (not just sleepable ones), so
it must remain as a separate guard.
Update the expected error messages in the irq and preempt_lock selftests
to match.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260318174327.3151925-4-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
check_kfunc_call() has multiple scattered checks that reject sleepable
kfuncs in various non-sleepable contexts (RCU, preempt-disabled, IRQ-
disabled). These are the same conditions already checked by
in_sleepable_context(), so replace them with a single consolidated
check.
This also simplifies the preempt lock tracking by flattening the nested
if/else structure into a linear chain: preempt_disable increments,
preempt_enable checks for underflow and decrements. The sleepable check
is kept as a separate block since it is logically distinct from the lock
accounting.
No functional change since in_sleepable_context() checks all the same
state (active_rcu_locks, active_preempt_locks, active_locks,
active_irq_id, in_sleepable).
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260318174327.3151925-3-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
check_helper_call() prints the error message for every
env->cur_state->active* element when calling a sleepable helper.
Consolidate all of them into a single print statement.
The check for env->cur_state->active_locks was not part of the removed
print statements and will not be triggered with the consolidated print
as well because it is checked in do_check() before check_helper_call()
is even reached.
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260318174327.3151925-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
process_bpf_exit_full() passes check_lock = !curframe to
check_resource_leak(), which is false in cases when bpf_throw() is
called from a static subprog. This makes check_resource_leak() to skip
validation of active_rcu_locks, active_preempt_locks, and
active_irq_id on exception exits from subprogs.
At runtime bpf_throw() unwinds the stack via ORC without releasing any
user-acquired locks, which may cause various issues as the result.
Fix by setting check_lock = true for exception exits regardless of
curframe, since exceptions bypass all intermediate frame
cleanup. Update the error message prefix to "bpf_throw" for exception
exits to distinguish them from normal BPF_EXIT.
Fix reject_subprog_with_rcu_read_lock test which was previously
passing for the wrong reason. Test program returned directly from the
subprog call without closing the RCU section, so the error was
triggered by the unclosed RCU lock on normal exit, not by
bpf_throw. Update __msg annotations for affected tests to match the
new "bpf_throw" error prefix.
The spin_lock case is not affected because they are already checked [1]
at the call site in do_check_insn() before bpf_throw can run.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/bpf/verifier.c?h=v7.0-rc4#n21098
Assisted-by: Claude:claude-opus-4-6
Fixes: f18b03faba ("bpf: Implement BPF exceptions")
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260320000809.643798-1-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
dmemcg_parse_limit does not use the region parameter. Remove it.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
In the default built-in idle CPU selection policy, when @prev_cpu is
busy and no fully idle core is available, try to place the task on its
SMT sibling if that sibling is idle, before searching any other idle CPU
in the same LLC.
Migration to the sibling is cheap and keeps the task on the same core,
preserving L1 cache and reducing wakeup latency.
On large SMT systems this appears to consistently boost throughput by
roughly 2-3% on CPU-bound workloads (running a number of tasks equal to
the number of SMT cores).
Cc: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCgA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmm3G/UeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGZJUH/R0vQ3Vha48QDEic
1NREwaHxAoTFi0i3y7OPPklqrP2V09D1qg4Q6fExYQVTQgV6F2DRjVbyPKrmr4ay
BA6aHrUdnFngYHpDlI1b1r7rJiAIN4WFHl7StO70bS+EB+UPsP9cfP3CKXUfKfqT
kyHXzUrd5QnjYmlb9rQw1E6rzsRamNtGUtZf7TwDidJYjtm3sPeDHUkjyRy4xkYd
UouIu6W7UXoicl38bJAgaWBY5BiYtjN6ktnY4/gcqDeqYd7mTM3Eb1B+OSXgFfip
F0OYfJhfWn+63WnPA+1I5jXWC1UrdVXTMK/NTYjhmGlfdmkLcWDlNGtu+qKZbpwj
fmF3Kyo=
=6nX1
-----END PGP SIGNATURE-----
Merge tag 'v7.0-rc4' into timers/core, to resolve conflict
Resolve conflict between this change in the upstream kernel:
4c652a4772 ("rseq: Mark rseq_arm_slice_extension_timer() __always_inline")
... and this pending change in timers/core:
0e98eb1481 ("entry: Prepare for deferred hrtimer rearming")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
snapshot_image_loaded() is used in both the in-kernel and the
userspace restore path to ensure that the snapshot image has been
completely loaded. However the latter path returns -EPERM in such
situations, which is meant for cases where the operation is neither
write-only nor ready.
This patch updates the check so the returned error code is -ENODATA in
both cases.
Suggested-by: Brian Geffon <bgeffon@google.com>
Signed-off-by: Alberto Garcia <berto@igalia.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Link: https://patch.msgid.link/8cfda38659c623f5392f3458cb32504ffd556a74.1773075892.git.berto@igalia.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This effectively gives us an ability to create the pid namespace init as
a child of the process (setns-ed to the pid namespace) different to the
process which created the pid namespace itself.
Original problem:
There is a cool set_tid feature in clone3() syscall, it allows you to
create process with desired pids on multiple pid namespace levels. Which
is useful to restore processes in CRIU for nested pid namespace case.
In nested container case we can potentially see this kind of pid/user
namespace tree:
Process
┌─────────┐
User NS0 ──▶ Pid NS0 ──▶ Pid p0 │
│ │ │ │
▼ ▼ │ │
User NS1 ──▶ Pid NS1 ──▶ Pid p1 │
│ │ │ │
... ... │ ... │
│ │ │ │
▼ ▼ │ │
User NSn ──▶ Pid NSn ──▶ Pid pn │
└─────────┘
So to create the "Process" and set pids {p0, p1, ... pn} for it on all
pid namespace levels we can use clone3() syscall set_tid feature, BUT
the syscall does not allow you to set pid on pid namespace levels you
don't have permission to. So basically you have to be in "User NS0" when
creating the "Process" to actually be able to set pids on all levels.
It is ok for almost any process, but with pid namespace init this does
not work, as currently we can only create pid namespace init and the pid
namespace itself simultaneously, so to make "Pid NSn" owned by "User
NSn" we have to be in the "User NSn".
We can't possibly be in "User NS0" and "User NSn" at the same time,
hence the problem.
Alternative solution:
Yes, for the case of pid namespace init we can use old and gold
/proc/sys/kernel/ns_last_pid interface on the levels lower than n. But
it is much more complicated and introduces tons of extra code to do. It
would be nice to make clone3() set_tid interface also aplicable to this
corner case.
Implementation:
Now when anyone can setns to the pid namespace before the creation of
init, and thus multiple processes can fork children to the pid
namespace, it is important that we enforce the first process created is
always pid namespace init. (Note that this was done by the previous
preparational patch as a standalon useful change.) We only allow other
processes after the init sets pid_namespace->child_reaper.
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Andrei Vagin <avagin@google.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
--
v2: Use *_ONCE for ->child_reaper accesses atomicity, and avoid taking
task_list lock for reading it. Rebase to master, and thus remove
now excess pidns_ready variable.
v3: Separate *_ONCE change and "init is first" checks into separate
commits.
v5: Add Andrei's review tag.
->child_reaper which can influence the pid namespace, so it looks like
the pid namespace is fully setup at the point when init sets
->child_reaper to receive more processes. Thus tasklist lock looks
excess in pidns_for_children_get()'s ->child_reaper check and it should
be safe not to have it in the corresponding check in alloc_pid()
(introduced earlier in this series).
Link: https://patch.msgid.link/20260318122157.280595-4-ptikhomirov@virtuozzo.com
Acked-by: Andrei Vagin <avagin@google.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Note: I didn't find anything in copy_process() around setting the
Signed-off-by: Christian Brauner <brauner@kernel.org>
This moves the condition (tid != 1 && !tmp->child_reaper) to after idr
alloc, so it not only covers that first process in pid namespace has pid
1 in case of clone3(set_tid) requesting wrong pid, but also if idr
itself gives wrong pid for some reason.
This could've been the case before this patch, when creating first
process the alloc_pid()->pidfs_add_pid() code path fails, so that the
idr->idr_next is non zero anymore and next process calling to
alloc_pid(), will get 2 as a pid from idr_alloc_cyclic(). Though thanks
to PIDNS_ADDING logic, free_pid() disables further pid allocation in
this case and it does not lead to any real problem.
Note: This is also a preparation for the next patch in the series, which
will introduce an ability of creating init from the task different to
the task which had created the pid namespace. Needed to make sure that
init is always first, even in this new case.
--
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Andrei Vagin <avagin@google.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Link: https://patch.msgid.link/20260318122157.280595-3-ptikhomirov@virtuozzo.com
v3: Split from main commit. Merge two checks of ->child_reaper into one.
v4: Update commit message about PIDNS_ADDING.
v5: Add Andrei's review tag.
Signed-off-by: Christian Brauner <brauner@kernel.org>
To avoid potential problems related to cpu/compiler optimizations around
->child_reaper, let's use WRITE_ONCE (additional to task_list lock)
everywhere we write it and use READ_ONCE where we read it without
explicit lock. Note: It also pairs with existing READ_ONCE with no lock
in nsfs_fh_to_dentry().
Also let's add ASSERT_EXCLUSIVE_WRITER before write to identify to KCSAN
that we don't expect any concurrent ->child_reaper modifications, and
those must be detected.
--
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Link: https://patch.msgid.link/20260318122157.280595-2-ptikhomirov@virtuozzo.com
v3: Split from main commit. Add ASSERT_EXCLUSIVE_WRITER.
v5: Add one more READ_ONCE for access without lock in free_pid().
Signed-off-by: Christian Brauner <brauner@kernel.org>
The clocksource watchdog code has over time reached the state of an
impenetrable maze of duct tape and staples. The original design, which was
made in the context of systems far smaller than today, is based on the
assumption that the to be monitored clocksource (TSC) can be trivially
compared against a known to be stable clocksource (HPET/ACPI-PM timer).
Over the years it turned out that this approach has major flaws:
- Long delays between watchdog invocations can result in wrap arounds
of the reference clocksource
- Scalability of the reference clocksource readout can degrade on large
multi-socket systems due to interconnect congestion
This was addressed with various heuristics which degraded the accuracy of
the watchdog to the point that it fails to detect actual TSC problems on
older hardware which exposes slow inter CPU drifts due to firmware
manipulating the TSC to hide SMI time.
To address this and bring back sanity to the watchdog, rewrite the code
completely with a different approach:
1) Restrict the validation against a reference clocksource to the boot
CPU, which is usually the CPU/Socket closest to the legacy block which
contains the reference source (HPET/ACPI-PM timer). Validate that the
reference readout is within a bound latency so that the actual
comparison against the TSC stays within 500ppm as long as the clocks
are stable.
2) Compare the TSCs of the other CPUs in a round robin fashion against
the boot CPU in the same way the TSC synchronization on CPU hotplug
works. This still can suffer from delayed reaction of the remote CPU
to the SMP function call and the latency of the control variable cache
line. But this latency is not affecting correctness. It only affects
the accuracy. With low contention the readout latency is in the low
nanoseconds range, which detects even slight skews between CPUs. Under
high contention this becomes obviously less accurate, but still
detects slow skews reliably as it solely relies on subsequent readouts
being monotonically increasing. It just can take slightly longer to
detect the issue.
3) Rewrite the watchdog test so it tests the various mechanisms one by
one and validating the result against the expectation.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Daniel J Blueman <daniel@quora.org>
Reviewed-by: Jiri Wiesner <jwiesner@suse.de>
Reviewed-by: Daniel J Blueman <daniel@quora.org>
Link: https://patch.msgid.link/20260123231521.926490888@kernel.org
Link: https://patch.msgid.link/87h5qeomm5.ffs@tglx
DMA_ATTR_REQUIRE_COHERENT indicates that SWIOTLB must not be used.
Ensure the SWIOTLB path is declined whenever the DMA direct path is
selected.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20260316-dma-debug-overlap-v3-5-1dde90a7f08b@nvidia.com
The mapping buffers which carry this attribute require DMA coherent system.
This means that they can't take SWIOTLB path, can perform CPU cache overlap
and doesn't perform cache flushing.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20260316-dma-debug-overlap-v3-4-1dde90a7f08b@nvidia.com
Rename the DMA_ATTR_CPU_CACHE_CLEAN attribute to better reflect that it
is debugging aid to inform DMA core code that CPU cache line overlaps are
allowed, and refine the documentation describing its use.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20260316-dma-debug-overlap-v3-3-1dde90a7f08b@nvidia.com
Add /sys/module/*/import_ns to expose imported namespaces for
currently loaded modules. The file contains one namespace per line and
only exists for modules that import at least one namespace.
Previously, the only way for userspace to inspect the symbol
namespaces a module imports is to locate the .ko on disk and invoke
modinfo(8) to decompress/parse the metadata. The kernel validated
namespaces at load time, but it was otherwise discarded.
Exposing this data via sysfs provides a runtime mechanism to verify
which namespaces are being used by modules. For example, this allows
userspace to audit driver API access in Android GKI, which uses symbol
namespaces to restrict vendor drivers from using specific kernel
interfaces (e.g., direct filesystem access).
Signed-off-by: Nicholas Sielicki <linux@opensource.nslick.com>
[Sami: Updated the commit message to explain motivation.]
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
While very unlikely, local storage theoretically may leak memory of the
size of "struct bpf_local_storage" when destroy() fails to grab
local_storage->lock and initializes selem->local_storage before other
racing map_free() see it. Warn the user to allow debugging the issue
instead of leaking the memory silently.
Note that test_maps in bpf selftests already stress tested
bpf_selem_unlink_nofail() by creating 4096 sockets and then immediately
destroying them in multiple threads. With 64 threads, 64 x 4096 socket
local storages were created and destroyed during the test and no warning
in the function were triggered.
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://patch.msgid.link/20260318224219.615105-1-ameryhung@gmail.com
Currently, local storage may deadlock when deferring freeing selem or
local storage through kfree_rcu(), call_rcu() or call_rcu_tasks_trace()
in NMI or reentrant. Since deleting selem in NMI is an unlikely use
case, partially mitigate it by returning error when calling from
bpf_xxx_storage_delete() helpers in NMI. Note that, it is still possible
to deadlock through reentrant. A full mitigation requires returning
error when irqs_disabled() is true, which, however is too heavy-handed
for bpf_xxx_storage_delete().
The long-term solution requires _nolock versions of call_rcu. Another
possible solution is to defer the free through irq_work [0], but it
would grow the size of selem, which is non-ideal.
The check is only needed in bpf_selem_unlink(), which is used by helpers
and syscalls. bpf_selem_unlink_nofail() is fine as it is called during
map and owner tear down that never run in NMI or reentrant.
[0] https://lore.kernel.org/bpf/20260205190233.912-1-alexei.starovoitov@gmail.com/
Fixes: a10787e6d5 ("bpf: Enable task local storage for tracing programs")
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://patch.msgid.link/20260319025716.2361065-1-ameryhung@gmail.com
- Consolidate the handling of two special cases in the idle loop that
occur when only one CPU idle state is present (Rafael Wysocki)
- Fix a race condition related to device removal in the runtime PM
core code that may cause a stale device object pointer to be
dereferenced (Bart Van Assche)
-----BEGIN PGP SIGNATURE-----
iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmm8Ad4SHHJqd0Byand5
c29ja2kubmV0AAoJEO5fvZ0v1OO1yFYIAKaPfbV2uWEbB0l+g7QBk4lEGc+CUxN+
jpBut/hWepoLDzQyqaROtTd8R5dFmKASsLXpNVgDjEDX+HCkFzdmlUlDZar0B8zZ
lZTcQypFPWtKeTD5OuelZK/trT09SdsB5Yeav1O6KXM5bT0ISn/f729gkPGIXXm5
+L8R2SvrEYu1dMZ3ithRq+4/bx1Pqav/g+0aD/tQMBDxjSE0G+d3hPFxvPueDMZX
L8CNC5CcH3vwNrI0NR9EH/z491q1Ddo6XUBW+ja2UGztdBTVmm5fEktDGxVjN8B8
A+ZT0RX1itcV+ngjtNYpXbRlMLwCkTvM8rVLpOANcAmsbNOCYGyM80s=
=aptB
-----END PGP SIGNATURE-----
Merge tag 'pm-7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix an idle loop issue exposed by recent changes and a race
condition related to device removal in the runtime PM core code:
- Consolidate the handling of two special cases in the idle loop that
occur when only one CPU idle state is present (Rafael Wysocki)
- Fix a race condition related to device removal in the runtime PM
core code that may cause a stale device object pointer to be
dereferenced (Bart Van Assche)"
* tag 'pm-7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: runtime: Fix a race condition related to device removal
sched: idle: Consolidate the handling of two special cases
Gregory reported in [0] that the global_map_resize test when run in
repeatedly ends up failing during program load. This stems from the fact
that BTF reference has not dropped to zero after the previous run's
module is unloaded, and the older module's BTF is still discoverable and
visible. Later, in libbpf, load_module_btfs() will find the ID for this
stale BTF, open its fd, and then it will be used during program load
where later steps taking module reference using btf_try_get_module()
fail since the underlying module for the BTF is gone.
Logically, once a module is unloaded, it's associated BTF artifacts
should become hidden. The BTF object inside the kernel may still remain
alive as long its reference counts are alive, but it should no longer be
discoverable.
To fix this, let us call btf_free_id() from the MODULE_STATE_GOING case
for the module unload to free the BTF associated IDR entry, and disable
its discovery once module unload returns to user space. If a race
happens during unload, the outcome is non-deterministic anyway. However,
user space should be able to rely on the guarantee that once it has
synchronously established a successful module unload, no more stale
artifacts associated with this module can be obtained subsequently.
Note that we must be careful to not invoke btf_free_id() in btf_put()
when btf_is_module() is true now. There could be a window where the
module unload drops a non-terminal reference, frees the IDR, but the
same ID gets reused and the second unconditional btf_free_id() ends up
releasing an unrelated entry.
To avoid a special case for btf_is_module() case, set btf->id to zero to
make btf_free_id() idempotent, such that we can unconditionally invoke it
from btf_put(), and also from the MODULE_STATE_GOING case. Since zero is
an invalid IDR, the idr_remove() should be a noop.
Note that we can be sure that by the time we reach final btf_put() for
btf_is_module() case, the btf_free_id() is already done, since the
module itself holds the BTF reference, and it will call this function
for the BTF before dropping its own reference.
[0]: https://lore.kernel.org/bpf/cover.1773170190.git.grbell@redhat.com
Fixes: 36e68442d1 ("bpf: Load and verify kernel module BTFs")
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
Reported-by: Gregory Bell <grbell@redhat.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260312205307.1346991-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* Use the preferred `unsigned int` over plain `unsigned` for the `num`
parameter.
* Synchronize the parameter names in moduleparam.h with the ones used by
the implementation in params.c.
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
When setting a charp module parameter, the param_set_charp() function
allocates memory to store a copy of the input value. Later, when the module
is potentially unloaded, the destroy_params() function is called to free
this allocated memory.
However, destroy_params() is available only when CONFIG_SYSFS=y, otherwise
only a dummy variant is present. In the unlikely case that the kernel is
configured with CONFIG_MODULES=y and CONFIG_SYSFS=n, this results in
a memory leak of charp values when a module is unloaded.
Fix this issue by making destroy_params() always available when
CONFIG_MODULES=y. Rename the function to module_destroy_params() to clarify
that it is intended for use by the module loader.
Fixes: e180a6b775 ("param: fix charp parameters set via sysfs")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Use the "sd_llc" passed to select_idle_cpu() to obtain the
"sd_llc_shared" instead of dereferencing the per-CPU variable.
Since "sd->shared" is always reclaimed at the same time as "sd" via
call_rcu() and update_top_cache_domain() always ensures a valid
"sd->shared" assignment when "sd_llc" is present, "sd_llc->shared" can
always be dereferenced without needing an additional check.
While at it move the cpumask_and() operation after the SIS_UTIL bailout
check to avoid unnecessarily computing the cpumask.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://patch.msgid.link/20260312044434.1974-10-kprateek.nayak@amd.com
Only the topmost SD_SHARE_LLC domain has the "sd->shared" assigned.
Simply use "sd->shared" as an indicator for load balancing at the highest
SD_SHARE_LLC domain in update_idle_cpu_scan() instead of relying on
llc_size.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://patch.msgid.link/20260312044434.1974-9-kprateek.nayak@amd.com
select_task_rq_fair() is always called with p->pi_lock held and IRQs
disabled which makes it equivalent of an RCU read-side.
Since commit 71fedc41c2 ("sched/fair: Switch to
rcu_dereference_all()") switched to using rcu_dereference_all() in the
wakeup path, drop the explicit rcu_read_{lock,unlock}() in the fair
task's wakeup path.
Future plans to reuse select_task_rq_fair() /
find_energy_efficient_cpu() in the fair class' balance callback will do
so with IRQs disabled and will comply with the requirements of
rcu_dereference_all() which makes this safe keeping in mind future
development plans too.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://patch.msgid.link/20260312044434.1974-8-kprateek.nayak@amd.com
Similar to commit 71fedc41c2 ("sched/fair: Switch to
rcu_dereference_all()"), switch to checking for rcu_read_lock_any_held()
in idle_get_state() to allow removing superfluous rcu_read_lock()
regions in the fair task's wakeup path where the pi_lock is held and
IRQs are disabled.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://patch.msgid.link/20260312044434.1974-7-kprateek.nayak@amd.com
Now that "sd->shared" assignments are using the sched_domain_shared
objects allocated with s_data, remove the sd_data based allocations.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://patch.msgid.link/20260312044434.1974-6-kprateek.nayak@amd.com
Use the "sched_domain_shared" object allocated in s_data for
"sd->shared" assignments. Assign "sd->shared" for the topmost
SD_SHARE_LLC domain before degeneration and rely on the degeneration
path to correctly pass down the shared object to "sd_llc".
sd_degenerate_parent() ensures degenerating domains must have the same
sched_domain_span() which ensures 1:1 passing down of the shared object.
If the topmost SD_SHARE_LLC domain degenerates, the shared object is
freed from destroy_sched_domain() when the last reference is dropped.
claim_allocations() NULLs out the objects that have been assigned as
"sd->shared" and the unassigned ones are freed from the __sds_free()
path.
To keep all the claim_allocations() bits in one place,
claim_allocations() has been extended to accept "s_data" and iterate the
domains internally to free both "sched_domain_shared" and the
per-topology-level data for the particular CPU in one place.
Post cpu_attach_domain(), all reclaims of "sd->shared" are handled via
call_rcu() on the sched_domain object via destroy_sched_domains_rcu().
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://patch.msgid.link/20260312044434.1974-5-kprateek.nayak@amd.com
The "sched_domain_shared" object is allocated for every topology level
in __sdt_alloc() and is freed post sched domain rebuild if they aren't
assigned during sd_init().
"sd->shared" is only assigned for SD_SHARE_LLC domains and out of all
the assigned objects, only "sd_llc_shared" is ever used by the
scheduler.
Since only "sd_llc_shared" is ever used, and since SD_SHARE_LLC domains
never overlap, allocate only a single range of per-CPU
"sched_domain_shared" object with s_data instead of doing it per
topology level.
The subsequent commit uses the degeneration path to correctly assign the
"sd->shared" to the topmost SD_SHARE_LLC domain.
No functional changes are expected at this point.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://patch.msgid.link/20260312044434.1974-4-kprateek.nayak@amd.com
Subsequent changes to assign "sd->shared" from "s_data" would
necessitate finding the topmost SD_SHARE_LLC to assign shared object to.
This is very similar to the "imb_numa_nr" computation loop except that
"imb_numa_nr" cares about the first domain without the SD_SHARE_LLC flag
(immediate parent of sd_llc) whereas the "sd->shared" assignment would
require sd_llc itself.
Extract the "imb_numa_nr" calculation into a helper
adjust_numa_imbalance() and use the current loop in the
build_sched_domains() to find the sd_llc.
While at it, guard the call behind CONFIG_NUMA's status since
"imb_numa_nr" only makes sense on NUMA enabled configs with SD_NUMA
domains.
No functional changes intended.
Suggested-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://patch.msgid.link/20260312044434.1974-3-kprateek.nayak@amd.com
The "sd_weight" used for calculating the load balancing interval, and
its limits, considers the span weight of the entire topology level
without accounting for cpuset partitions.
For example, consider a large system of 128CPUs divided into 8 * 16CPUs
partition which is typical when deploying virtual machines:
[ PKG Domain: 128CPUs ]
[Partition0: 16CPUs][Partition1: 16CPUs] ... [Partition7: 16CPUs]
Although each partition only contains 16CPUs, the load balancing
interval is set to a minimum of 128 jiffies considering the span of the
entire domain with 128CPUs which can lead to longer imbalances within
the partition although balancing within is cheaper with 16CPUs.
Compute the "sd_weight" after computing the "sd_span" considering the
cpu_map covered by the partition, and set the load balancing interval,
and its limits accordingly.
For the above example, the balancing intervals for the partitions PKG
domain changes as follows:
before after
balance_interval 128 16
min_interval 128 16
max_interval 256 32
Intervals are now proportional to the CPUs in the partitioned domain as
was intended by the original formula.
Fixes: cb83b629ba ("sched/numa: Rewrite the CONFIG_NUMA sched domain support")
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://patch.msgid.link/20260312044434.1974-2-kprateek.nayak@amd.com
Restore the SPDX tag that was accidentally dropped.
Fixes: 7e4b6c9430 ("tracing: add more symbols to whitelist")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Cc: Arnd Bergmann <arnd@kernel.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://patch.msgid.link/20260317194252.1890568-1-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Compiler and tooling-generated symbols are difficult to maintain
across all supported architectures. Make the allowlist more robust by
replacing the harcoded list with a mechanism that automatically detects
these symbols.
This mechanism generates a C function designed to trigger common
compiler-inserted symbols.
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Tested-by: Arnd Bergmann <arnd@arndb.de>
Link: https://patch.msgid.link/20260316092845.3367411-1-vdonnefort@google.com
[maz: added __msan prefix to allowlist as pointed out by Arnd]
Signed-off-by: Marc Zyngier <maz@kernel.org>
There are two special cases in the idle loop that are handled
inconsistently even though they are analogous.
The first one is when a cpuidle driver is absent and the default CPU
idle time power management implemented by the architecture code is used.
In that case, the scheduler tick is stopped every time before invoking
default_idle_call().
The second one is when a cpuidle driver is present, but there is only
one idle state in its table. In that case, the scheduler tick is never
stopped at all.
Since each of these approaches has its drawbacks, reconcile them with
the help of one simple heuristic. Namely, stop the tick if the CPU has
been woken up by it in the previous iteration of the idle loop, or let
it tick otherwise.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Qais Yousef <qyousef@layalina.io>
Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Fixes: ed98c34919 ("sched: idle: Do not stop the tick before cpuidle_idle_call()")
[ rjw: Added Fixes tag, changelog edits ]
Link: https://patch.msgid.link/4741364.LvFx2qVVIh@rafael.j.wysocki
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
All are singletons - please see the changelogs for details.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCabhW5gAKCRDdBJ7gKXxA
jiy+AQDq0x5J94Ezavp0//wqpQ3gwCbo+E8cOE4qb3ig20gBFAEA2hr8qvWYeGCV
r+FhX6PAmY6fxL8e/akqR3h7fR7iPAk=
=g8d7
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2026-03-16-12-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"6 hotfixes. 4 are cc:stable. 3 are for MM.
All are singletons - please see the changelogs for details"
* tag 'mm-hotfixes-stable-2026-03-16-12-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
MAINTAINERS: update email address for Ignat Korchagin
mm/huge_memory: fix early failure try_to_migrate() when split huge pmd for shared THP
mm/rmap: fix incorrect pte restoration for lazyfree folios
mm/huge_memory: fix use of NULL folio in move_pages_huge_pmd()
build_bug.h: correct function parameters names in kernel-doc
crash_dump: don't log dm-crypt key bytes in read_key_from_user_keying
The BPF verifier currently enforces a call stack depth of 8 frames,
regardless of the actual stack space consumption of those frames. The
limit is necessary for static call stacks, because the bookkeeping data
structures used by the verifier when stepping into static functions
during verification only support 8 stack frames. However, this
limitation only matters for static stack frames: Global subprogs are
verified by themselves and do not require limiting the call depth.
Relax this limitation to only apply to static stack frames. Verification
now only fails when there is a sequence of 8 calls to non-global
subprogs. Calling into a global subprog resets the counter. This allows
deeper call stacks, provided all frames still fit in the stack.
The change does not increase the maximum size of the call stack, only
the maximum number of frames we can place in it.
Also change the progs/test_global_func3.c selftest to use static
functions, since with the new patch it would otherwise unexpectedly
pass verification.
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20260316161225.128011-2-emil@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
ancestors[] is a flexible array member that needs level + 1 slots to
hold all ancestors including self (indices 0..level), but kzalloc_flex()
only allocates `level` slots:
sch = kzalloc_flex(*sch, ancestors, level);
...
sch->ancestors[level] = sch; /* one past the end */
For the root scheduler (level = 0), zero slots are allocated and
ancestors[0] is written immediately past the end of the object.
KASAN reports:
BUG: KASAN: slab-out-of-bounds in scx_alloc_and_add_sched+0x1c17/0x1d10
Write of size 8 at addr ffff888066b56538 by task scx_enable_help/667
The buggy address is located 0 bytes to the right of
allocated 1336-byte region [ffff888066b56000, ffff888066b56538)
Fix by passing level + 1 to kzalloc_flex().
Tested with vng + scx_lavd, KASAN no longer triggers.
Fixes: ebeca1f930 ("sched_ext: Introduce cgroup sub-sched support")
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Make the spinlock implementation compatible with lock context analysis
(CONTEXT_ANALYSIS := 1) by adding lock context annotations to the
_raw_##op##_...() macros.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260313171510.230998-4-bvanassche@acm.org
Currently ATOMIC_INIT() is not used because in the past that macro was
provided by linux/atomic.h which is not usable from linux/jump_label.h.
However since commit 7ca8cf5347 ("locking/atomic: Move ATOMIC_INIT
into linux/types.h") the macro only requires linux/types.h.
Remove the now unnecessary workaround and the associated assertions.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260313-jump_label-cleanup-v2-1-35d3c0bde549@linutronix.de
Commit 1ea4b47350 ("locking/rwsem: Remove the list_head from struct
rw_semaphore") introduced a logic error in rwsem_del_waiter().
The root cause of this issue is an inconsistency in the return values of
__rwsem_del_waiter() and rwsem_del_waiter(). Specifically,
__rwsem_del_waiter() returns true when the wait list becomes empty,
whereas rwsem_del_waiter() is supposed to return true if the wait list
is NOT empty.
This caused a null pointer dereference in rwsem_mark_wake() because it
was being called when sem->first_waiter was NULL.
Fixes: 1ea4b47350 ("locking/rwsem: Remove the list_head from struct rw_semaphore")
Reported-by: syzbot+3d2ff92c67127d337463@syzkaller.appspotmail.com
Signed-off-by: Andrei Vagin <avagin@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: syzbot+3d2ff92c67127d337463@syzkaller.appspotmail.com
Link: https://patch.msgid.link/20260314182607.3343346-1-avagin@google.com
When a device performs DMA to a bounce buffer, KMSAN is unaware of
the write and does not mark the data as initialized. When
swiotlb_bounce() later copies the bounce buffer back to the original
buffer, memcpy propagates the uninitialized shadow to the original
buffer, causing false positive uninit-value reports.
Fix this by calling kmsan_unpoison_memory() on the bounce buffer
before copying it back in the DMA_FROM_DEVICE path, so that memcpy
naturally propagates initialized shadow to the destination.
Suggested-by: Alexander Potapenko <glider@google.com>
Link: https://lore.kernel.org/CAG_fn=WUGta-paG1BgsGRoAR+fmuCgh3xo=R3XdzOt_-DqSdHw@mail.gmail.com/
Fixes: 7ade4f1077 ("dma: kmsan: unpoison DMA mappings")
Signed-off-by: Shigeru Yoshida <syoshida@redhat.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20260315082750.2375581-1-syoshida@redhat.com
kobject_init_and_add() failure requires kobject_put() for proper cleanup, but
the error paths were using kfree(sch) possibly leaking the kobject name. The
kset_create_and_add() failure was already using kobject_put() correctly.
Switch the kobject_init_and_add() error paths to use kobject_put(). As the
release path puts the cgroup ref, make scx_alloc_and_add_sched() always
consume @cgrp via a new err_put_cgrp label at the bottom of the error chain
and update scx_sub_enable_workfn() accordingly.
Fixes: 17108735b4 ("sched_ext: Use dynamic allocation for scx_sched")
Reported-by: David Carlier <devnexen@gmail.com>
Link: https://lore.kernel.org/r/20260314134457.46216-1-devnexen@gmail.com
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
The abort path in scx_sub_enable_workfn() fell through to out_put_cgrp,
double-putting the cgroup ref already owned by sch->cgrp. It also skipped
kthread_flush_work() needed to flush the disable path.
Relocate the abort block above err_unlock_and_disable so it falls through to
err_disable.
Fixes: 337ec00b1d ("sched_ext: Implement cgroup sub-sched enabling and disabling")
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
- kprobes: avoid crash when rmmod/insmod after ftrace killed
This fixes a kernel crash caused by kprobes on the symbol in a
module which is unloaded after ftrace_kill() is called.
- kprobes: Remove unneeded warnings from __arm_kprobe_ftrace()
Remove unneeded WARN messages which can be kicked if the kprobe is
using ftrace and it fails to enable the ftrace. Since kprobes
correctly handle such failure, we don't need to warn it.
-----BEGIN PGP SIGNATURE-----
iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmm2hk8bHG1hc2FtaS5o
aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bBZkH+gP3OllhdIU3AUB+vXEb
UEE3VE5IZRufSgtjbbJnYI3b8U2dWXw7wmb+fBJ0i0Zf6F+2IUr3hUg1pHNARvlL
MMu1YW7PG8gKGUsNc7jpHBVXrnefA4XpzXe7wtxaGvqAV16nL/6xhZlanGgL50Gv
+F9cETMIGZ8duF6XVgEMmUUCg88Iwpp9MjzDQOjpRK7z41LND3ccJ2V88ODzDevj
idnRbnk12Q9b7xZwGL5P5Ab163kHPFExsXQQnPmy/0fLcyGi8U3hY5EwYhOT85J0
qRZVHpR1yrXqBpaxD/7as5/pfuUluxdzwmDkyRuQGYW6y7xkhlVNEGWqsWZeYLuW
IZU=
=Zkm3
-----END PGP SIGNATURE-----
Merge tag 'probes-fixes-v7.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes fixes from Masami Hiramatsu:
- Avoid crash when rmmod/insmod after ftrace killed
This fixes a kernel crash caused by kprobes on the symbol in a module
which is unloaded after ftrace_kill() is called.
- Remove unneeded warnings from __arm_kprobe_ftrace()
Remove unneeded WARN messages which can be triggered if the kprobe is
using ftrace and it fails to enable the ftrace. Since kprobes
correctly handle such failure, we don't need to warn it.
* tag 'probes-fixes-v7.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
kprobes: Remove unneeded warnings from __arm_kprobe_ftrace()
kprobes: avoid crash when rmmod/insmod after ftrace killed
- Fix CID hangs due to a race between concurrent forks
- Fix vfork()/CLONE_VM MMCID bug causing hangs
- Remove pointless preemption guard
- Fix CID task list walk performance regression on large systems
by removing the known-flaky and slow counting logic using
for_each_process_thread() in mm_cid_*fixup_tasks_to_cpus(),
and implementing a simple sched_mm_cid::node list instead
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmm2JvIRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hTqg/+K7b4LDOi3nVblmoj6q+mQj2i8DFPbi10
zeAWJJnamYWPvUi+Wxq30JjZJ9v+15Ddcmbhea9m/3u1YO6nAL5TbGeQcJ2LU/7p
Ynu9cznv9PfqO4X7WQc3gJC9xx8PbcM00E3JzGxDX/3NDmDBaTOwwuTp41ymcbhm
cGfnUQWGt81sMummVzqehszfIRMZHnWflYDJ2gC66rcGXMNBlEX125F8jybOm66n
Ez6gO7e9EGn28+hZIufySsxaeeK/3NFVKj1UjGP/FMuBwQFAjHPv61nic33nOKXT
yrw7U8DIaYUqFN4d1lplTG72j2YSUj7snn3Q+ubxpzFmOt7RmouVqwlVGEoey5fh
cEe2VYSQFoZKQioWWyms1LP1hTOa2JkNVhdjBfRZ8IM+Wp47OaDiw1h1+zwwMDbJ
xpDAXEuU+sBZiv2SeBLFQgrGj58gb8pdjN4o47X89mx8TKYWtStrCMsD+MF10LBm
dz780Eiinbw5D8JBsxU/ehETpgrAAVmo1KbFx2Q2grAgkJs7jSqBN2KF8NpmH/ZS
Jk8SpQOn4Vp8iO32TbpsV/GErG9EQgixQxnkTukv2Qd9kguhmjwbi/blN3rLBlBb
XbmR9rRAMfAjlPrk84tn9ecXNWO0NV83IYheAwjip36alSbOs+OcxdhrZ78nxh8C
EsKqGl3PeOk=
=ce5G
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2026-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
"More MM-CID fixes, mostly fixing hangs/races:
- Fix CID hangs due to a race between concurrent forks
- Fix vfork()/CLONE_VM MMCID bug causing hangs
- Remove pointless preemption guard
- Fix CID task list walk performance regression on large systems
by removing the known-flaky and slow counting logic using
for_each_process_thread() in mm_cid_*fixup_tasks_to_cpus(), and
implementing a simple sched_mm_cid::node list instead"
* tag 'sched-urgent-2026-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/mmcid: Avoid full tasklist walks
sched/mmcid: Remove pointless preempt guard
sched/mmcid: Handle vfork()/CLONE_VM correctly
sched/mmcid: Prevent CID stalls due to concurrent forks
Under CONFIG_EXT_SUB_SCHED, the kzalloc() and kstrdup() failure
paths jump to err_stop_helper without first setting ret. The
function then returns ERR_PTR(ret) with ret uninitialized, which
can produce ERR_PTR(0) (NULL), causing the caller's IS_ERR() check
to pass and leading to a NULL pointer dereference.
Set ret = -ENOMEM before each goto to fix the error path.
Fixes: ebeca1f930 ("sched_ext: Introduce cgroup sub-sched support")
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
In commit 5dbb19b16a ("bpf: Add third round of bounds deduction"), I
added a new round of bounds deduction because two rounds were not enough
to converge to a fixed point. This commit slightly refactor the bounds
deduction logic such that two rounds are enough.
In [1], Eduard noticed that after we improved the refinement logic, a
third call to the bounds deduction (__reg_deduce_bounds) was needed to
converge to a fixed point. More specifically, we needed this third call
to improve the s64 range using the s32 range. We added the third call
and postponed a more detailed analysis of the refinement logic.
I've been looking into this more recently. The register refinement
consists of the following calls.
__update_reg_bounds();
3 x __reg_deduce_bounds() {
deduce_bounds_32_from_64();
deduce_bounds_32_from_32();
deduce_bounds_64_from_64();
deduce_bounds_64_from_32();
};
__reg_bound_offset();
__update_reg_bounds();
From this, we can observe that we first improve the 32bit ranges from
the 64bit ranges in deduce_bounds_32_from_64, then improve the 64bit
ranges on their own in deduce_bounds_64_from_64. Intuitively, if we
were to improve the 64bit ranges on their own *before* we use them to
improve the 32bit ranges, we may reach a fixed point earlier.
In a similar manner, using CBMC, Eduard found that it's best to improve
the 32bit ranges on their own *after* we've improve them using the 64bit
ranges. That is, running deduce_bounds_32_from_32 after
deduce_bounds_32_from_64.
These changes allow us to lose one call to __reg_deduce_bounds. Without
this reordering, the test "verifier_bounds/bounds deduction cross sign
boundary, negative overlap" fails when removing one call to
__reg_deduce_bounds. In some cases, this change can even improve
precision a little bit, as illustrated in the new selftest in the next
patch.
As expected, this change didn't have any impact on the number of
instructions processed when running it through the Cilium complexity
test suite [2].
Link: https://lore.kernel.org/bpf/aIKtSK9LjQXB8FLY@mail.gmail.com/ [1]
Link: https://pchaigno.github.io/test-verifier-complexity.html [2]
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Co-developed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/1b00d2749ec4c774c3ada84e265ac7fda72cfe56.1773401138.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Extending these APIs with a flush argument:
dma_direct_unmap_phys(), dma_direct_map_phys(), and
dma_direct_sync_single_for_cpu(). For single-buffer cases, flush=true
would be used, while for SG cases flush=false would be used, followed
by a single flush after all cache operations are issued in
dma_direct_{map,unmap}_sg().
This ultimately benefits dma_map_sg() and dma_unmap_sg().
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Reviewed-by: Leon Romanovsky <leon@kernel.org>
Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
Signed-off-by: Barry Song <baohua@kernel.org>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20260228221337.59951-1-21cnbao@gmail.com
Currently, arch_sync_dma_for_cpu and arch_sync_dma_for_device
always wait for the completion of each DMA buffer. That is,
issuing the DMA sync and waiting for completion is done in a
single API call.
For scatter-gather lists with multiple entries, this means
issuing and waiting is repeated for each entry, which can hurt
performance. Architectures like ARM64 may be able to issue all
DMA sync operations for all entries first and then wait for
completion together.
To address this, arch_sync_dma_for_* now batches DMA operations
and performs a flush afterward. On ARM64, the flush is implemented
with a dsb instruction in arch_sync_dma_flush(). On other
architectures, arch_sync_dma_flush() is currently a nop.
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Reviewed-by: Juergen Gross <jgross@suse.com> # drivers/xen/swiotlb-xen.c
Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
Signed-off-by: Barry Song <baohua@kernel.org>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20260228221316.59934-1-21cnbao@gmail.com
- Improve workqueue stall diagnostics: dump all busy workers (not just
running ones), show wall-clock duration of in-flight work items, and
add a sample module for reproducing stalls.
- Fix POOL_BH vs WQ_BH flag namespace mismatch in pr_cont_worker_id().
- Rename pool->watchdog_ts to pool->last_progress_ts and related
functions for clarity.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCabRyOg4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGaAEAP9xJvKzVtyXBMWxIQNJKqN58VaT/5bNk3tYfm3O
ZOuBrQD+Lsxjvv/pSSKFscZJ/x0dthdYncYI8DF3/G6Lnf+LDAc=
=MT3y
-----END PGP SIGNATURE-----
Merge tag 'wq-for-7.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fixes from Tejun Heo:
- Improve workqueue stall diagnostics: dump all busy workers (not just
running ones), show wall-clock duration of in-flight work items, and
add a sample module for reproducing stalls
- Fix POOL_BH vs WQ_BH flag namespace mismatch in pr_cont_worker_id()
- Rename pool->watchdog_ts to pool->last_progress_ts and related
functions for clarity
* tag 'wq-for-7.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Rename show_cpu_pool{s,}_hog{s,}() to reflect broadened scope
workqueue: Add stall detector sample module
workqueue: Show all busy workers in stall diagnostics
workqueue: Show in-flight work item duration in stall diagnostics
workqueue: Rename pool->watchdog_ts to pool->last_progress_ts
workqueue: Use POOL_BH instead of WQ_BH when checking pool flags
- Hide PF_EXITING tasks from cgroup.procs to avoid exposing dead tasks
that haven't been removed yet, fixing a systemd timeout issue on
PREEMPT_RT.
- Call rebuild_sched_domains() directly in CPU hotplug instead of
deferring to a workqueue, fixing a race where online/offline CPUs
could briefly appear in stale sched domains.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCabRyLQ4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGXk4APwKw2HtGyI3OQAHfDBL+wlblPtf8acz0zpDwGCT
9+TWFQD/Rhmtvkb/X/LTwT5PKJksoHOfkD4MqmVMGStKGdxtBAY=
=DJ17
-----END PGP SIGNATURE-----
Merge tag 'cgroup-for-7.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
- Hide PF_EXITING tasks from cgroup.procs to avoid exposing dead tasks
that haven't been removed yet, fixing a systemd timeout issue on
PREEMPT_RT
- Call rebuild_sched_domains() directly in CPU hotplug instead of
deferring to a workqueue, fixing a race where online/offline CPUs
could briefly appear in stale sched domains
* tag 'cgroup-for-7.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: Don't expose dead tasks in cgroup
cgroup/cpuset: Call rebuild_sched_domains() directly in hotplug
- Fix data races flagged by KCSAN: add missing READ_ONCE()/WRITE_ONCE()
annotations for lock-free accesses to module parameters and dsq->seq.
- Fix silent truncation of upper 32 enqueue flags (SCX_ENQ_PREEMPT and
above) when passed through the int sched_class interface.
- Documentation updates: scheduling class precedence, task ownership
state machine, example scheduler descriptions, config list cleanup.
- Selftest fix for format specifier and buffer length in
file_write_long().
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCabRyHg4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGZiWAQCmUOHiGAk73p9DDn6Zyrm+o/iQm/iOinchBeUs
ZiG0bgEAn15giAnLCA5Zs6cG7PemxBH1v7ctyzTjh1VsBds0rwo=
=zXix
-----END PGP SIGNATURE-----
Merge tag 'sched_ext-for-7.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- Fix data races flagged by KCSAN: add missing READ_ONCE()/WRITE_ONCE()
annotations for lock-free accesses to module parameters and dsq->seq
- Fix silent truncation of upper 32 enqueue flags (SCX_ENQ_PREEMPT and
above) when passed through the int sched_class interface
- Documentation updates: scheduling class precedence, task ownership
state machine, example scheduler descriptions, config list cleanup
- Selftest fix for format specifier and buffer length in
file_write_long()
* tag 'sched_ext-for-7.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Use WRITE_ONCE() for the write side of scx_enable helper pointer
sched_ext: Fix enqueue_task_scx() truncation of upper enqueue flags
sched_ext: Documentation: Update sched-ext.rst
sched_ext: Use READ_ONCE() for scx_slice_bypass_us in scx_bypass()
sched_ext: Documentation: Mention scheduling class precedence
sched_ext: Document task ownership state machine
sched_ext: Use READ_ONCE() for lock-free reads of module param variables
sched_ext/selftests: Fix format specifier and buffer length in file_write_long()
sched_ext: Use WRITE_ONCE() for the write side of dsq->seq update
schedule_dsq_reenq() always uses schedule_deferred() which falls back to
irq_work. However, callers like schedule_reenq_local() already hold the
target rq lock, and scx_bpf_dsq_reenq() may hold it via the ops callback.
Add a locked_rq parameter so schedule_dsq_reenq() can use
schedule_deferred_locked() when the target rq is already held. The locked
variant can use cheaper paths (balance callbacks, wakeup hooks) instead of
always bouncing through irq_work.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
SCX_ENQ_IMMED makes enqueue to local DSQs succeed only if the task can start
running immediately. Otherwise, the task is re-enqueued through ops.enqueue().
This provides tighter control but requires specifying the flag on every
insertion.
Add SCX_OPS_ALWAYS_ENQ_IMMED ops flag. When set, SCX_ENQ_IMMED is
automatically applied to all local DSQ enqueues including through
scx_bpf_dsq_move_to_local().
scx_qmap is updated with -I option to test the feature and -F option for
IMMED stress testing which forces every Nth enqueue to a busy local DSQ.
v2: - Cover scx_bpf_dsq_move_to_local() path (now has enq_flags via ___v2).
- scx_qmap: Remove sched_switch and cpu_release handlers (superseded by
kernel-side wakeup_preempt_scx()). Add -F for IMMED stress testing.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
scx_bpf_dsq_move_to_local() moves a task from a non-local DSQ to the
current CPU's local DSQ. This is an indirect way of dispatching to a local
DSQ and should support enq_flags like direct dispatches do - e.g.
SCX_ENQ_HEAD for head-of-queue insertion and SCX_ENQ_IMMED for immediate
execution guarantees.
Add scx_bpf_dsq_move_to_local___v2() with an enq_flags parameter. The
original becomes a v1 compat wrapper passing 0. The compat macro is updated
to a three-level chain: v2 (7.1+) -> v1 (current) -> scx_bpf_consume
(pre-rename). All in-tree BPF schedulers are updated to pass 0.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Add enq_flags parameter to consume_dispatch_q() and consume_remote_task(),
passing it through to move_{local,remote}_task_to_local_dsq(). All callers
pass 0.
No functional change. This prepares for SCX_ENQ_IMMED support on the consume
path.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Add SCX_ENQ_IMMED enqueue flag for local DSQ insertions. Once a task is
dispatched with IMMED, it either gets on the CPU immediately and stays on it,
or gets reenqueued back to the BPF scheduler. It will never linger on a local
DSQ behind other tasks or on a CPU taken by a higher-priority class.
rq_is_open() uses rq->next_class to determine whether the rq is available,
and wakeup_preempt_scx() triggers reenqueue when a higher-priority class task
arrives. These capture all higher class preemptions. Combined with reenqueue
points in the dispatch path, all cases where an IMMED task would not execute
immediately are covered.
SCX_TASK_IMMED persists in p->scx.flags until the next fresh enqueue, so the
guarantee survives SAVE/RESTORE cycles. If preempted while running,
put_prev_task_scx() reenqueues through ops.enqueue() with
SCX_TASK_REENQ_PREEMPTED instead of silently placing the task back on the
local DSQ.
This enables tighter scheduling latency control by preventing tasks from
piling up on local DSQs. It also enables opportunistic CPU sharing across
sub-schedulers - without this, a sub-scheduler can stuff the local DSQ of a
shared CPU, making it difficult for others to use.
v2: - Rewrite is_curr_done() as rq_is_open() using rq->next_class and
implement wakeup_preempt_scx() to achieve complete coverage of all
cases where IMMED tasks could get stranded.
- Track IMMED persistently in p->scx.flags and reenqueue
preempted-while-running tasks through ops.enqueue().
- Bound deferred reenq cycles (SCX_REENQ_LOCAL_MAX_REPEAT).
- Misc renames, documentation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Add scx_vet_enq_flags() stub and call it from scx_dsq_insert_preamble() and
scx_dsq_move(). Pass dsq_id into preamble so the vetting function can
validate flag and DSQ combinations.
No functional change. This prepares for SCX_ENQ_IMMED which will populate the
vetting function.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Split task_should_reenq() into local_task_should_reenq() and
user_task_should_reenq(). The local variant takes reenq_flags by pointer.
No functional change. This prepares for SCX_ENQ_IMMED which will add
IMMED-specific logic to the local variant.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
parse_affn_scope() uses strncasecmp() with the length of the candidate
name, which means it only checks if the input *starts with* a known
scope name.
Given that the upcoming diff will create "cache_shard" affinity scope,
writing "cache_shard" to a workqueue's affinity_scope sysfs attribute
always matches "cache" first, making it impossible to select
"cache_shard" via sysfs, so, this fix enable it to distinguish "cache"
and "cache_shard"
Fix by replacing the hand-rolled prefix matching loop with
sysfs_match_string(), which uses sysfs_streq() for exact matching
(modulo trailing newlines). Also add the missing const qualifier to
the wq_affn_names[] array declaration.
Note that sysfs_streq() is case-sensitive, unlike the previous
strncasecmp() approach. This is intentional and consistent with
how other sysfs attributes handle string matching in the kernel.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Randconfig builds show a number of cryptic build errors from
hitting undefined symbols in simple_ring_buffer.o:
make[7]: *** [/home/arnd/arm-soc/kernel/trace/Makefile:147: kernel/trace/simple_ring_buffer.o.checked] Error 1
These happen with CONFIG_TRACE_BRANCH_PROFILING, CONFIG_KASAN_HW_TAGS,
CONFIG_STACKPROTECTOR, CONFIG_DEBUG_IRQFLAGS and indirectly from WARN_ON().
Add exceptions for each one that I have hit so far on arm64, x86_64 and arm
randconfig builds.
Other architectures likely hit additional ones, so it would be nice
to produce a little more verbose output that include the name of the
missing symbols directly.
Fixes: a717943d8e ("tracing: Check for undefined symbols in simple_ring_buffer")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://patch.msgid.link/20260312123601.625063-2-arnd@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Undefined symbols are not allowed for simple_ring_buffer.c. But some
compiler emitted symbols are missing in the allowlist. Update it.
Reported-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Fixes: a717943d8e ("tracing: Check for undefined symbols in simple_ring_buffer")
Closes: https://lore.kernel.org/all/20260311221816.GA316631@ax162/
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://patch.msgid.link/20260312113535.2213350-1-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Add support for creating a mount namespace that contains only a copy of
the root mount from the caller's mount namespace, with none of the
child mounts. This is useful for containers and sandboxes that want to
start with a minimal mount table and populate it from scratch rather
than inheriting and then tearing down the full mount tree.
Two new flags are introduced:
- CLONE_EMPTY_MNTNS for clone3(), using the 64-bit flag space.
- UNSHARE_EMPTY_MNTNS for unshare(), reusing the
CLONE_PARENT_SETTID bit which has no meaning for unshare.
Both flags imply CLONE_NEWNS. For the unshare path,
UNSHARE_EMPTY_MNTNS is converted to CLONE_EMPTY_MNTNS in
unshare_nsproxy_namespaces() before it reaches copy_mnt_ns(), so the
mount namespace code only needs to handle a single flag.
In copy_mnt_ns(), when CLONE_EMPTY_MNTNS is set, clone_mnt() is used
instead of copy_tree() to clone only the root mount. The caller's root
and working directory are both reset to the root dentry of the new
mount.
The cleanup variables are changed from vfsmount pointers with
__free(mntput) to struct path with __free(path_put) because the empty
mount namespace path needs to release both mount and dentry references
when replacing the caller's root and pwd. In the normal (non-empty)
path only the mount component is set, and dput(NULL) is a no-op so
path_put remains correct there as well.
Link: https://patch.msgid.link/20260306-work-empty-mntns-consolidated-v1-1-6eb30529bbb0@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Using a non-continuous aka untrusted clocksource as a watchdog for another
untrusted clocksource is equivalent to putting the fox in charge of the
henhouse.
That's especially true with the jiffies clocksource which depends on
interrupt delivery based on a periodic timer. Neither the frequency of that
timer is trustworthy nor the kernel's ability to react on it in a timely
manner and rearm it if it is not self rearming.
Just don't bother to deal with this. It's not worth the trouble and only
relevant to museum piece hardware.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260123231521.858743259@kernel.org
The container_of() call is open-coded multiple times.
Add a helper macro.
Use container_of_const() to preserve constness.
Signed-off-by: Thomas Weißschuh (Schneider Electric) <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260311-hrtimer-cleanups-v1-12-095357392669@linutronix.de
This pointer indirection is a remnant from when ktime_t was a struct,
today it is pointless.
Drop the pointer indirection.
Signed-off-by: Thomas Weißschuh (Schneider Electric) <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260311-hrtimer-cleanups-v1-9-095357392669@linutronix.de
The value will be assigned to before any usage.
No other function in hrtimer.c does such a zero-initialization.
Signed-off-by: Thomas Weißschuh (Schneider Electric) <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260311-hrtimer-cleanups-v1-7-095357392669@linutronix.de
Neither the array nor the offsets it is pointing to are meant to be
changed through the array.
Mark both the array and the values it points to as const.
Signed-off-by: Thomas Weißschuh (Schneider Electric) <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260311-hrtimer-cleanups-v1-5-095357392669@linutronix.de
In aux_clock_enable() the clocksource from tkr_raw is used to call
tk_setup_internals(). Do the same in tk_aux_update_clocksource(). While
the clocksources will be the same in any case, this is less confusing.
Signed-off-by: Thomas Weißschuh (Schneider Electric) <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260311-hrtimer-cleanups-v1-4-095357392669@linutronix.de
The sentinel value added by the wrapper macros __print_symbolic() et al
prevents the callers from adding their own trailing comma. This makes
constructing symbol list dynamically based on kconfig values tedious.
Drop the sentinel elements, so callers can either specify the trailing
comma or not, just like in regular array initializers.
Signed-off-by: Thomas Weißschuh (Schneider Electric) <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260311-hrtimer-cleanups-v1-2-095357392669@linutronix.de
Oliver reported that x86_pmu_del() ended up doing an out-of-bound memory access
when group_sched_in() fails and needs to roll back.
This *should* be handled by the transaction callbacks, but he found that when
the group leader is a software event, the transaction handlers of the wrong PMU
are used. Despite the move_group case in perf_event_open() and group_sched_in()
using pmu_ctx->pmu.
Turns out, inherit uses event->pmu to clone the events, effectively undoing the
move_group case for all inherited contexts. Fix this by also making inherit use
pmu_ctx->pmu, ensuring all inherited counters end up in the same pmu context.
Similarly, __perf_event_read() should use equally use pmu_ctx->pmu for the
group case.
Fixes: bd27568117 ("perf: Rewrite core context handling")
Reported-by: Oliver Rosenberg <olrose55@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ian Rogers <irogers@google.com>
Link: https://patch.msgid.link/20260309133713.GB606826@noisy.programming.kicks-ass.net
Replace the comma operator with separate statements when assigning
NUMA fault statistics. This improves readability and follows kernel
coding style.
Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260309024247.10908-1-zhanxusheng@xiaomi.com
Replace the global cgroup_file_kn_lock with a per-cgroup_file spinlock
to eliminate cross-cgroup contention as it is not really protecting
data shared between different cgroups.
The lock is initialized in cgroup_add_file() alongside timer_setup().
No lock acquisition is needed during initialization since the cgroup
directory is being populated under cgroup_mutex and no concurrent
accessors exist at that point.
Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add lockless checks before acquiring cgroup_file_kn_lock:
1. READ_ONCE(cfile->kn) NULL check to skip torn-down files.
2. READ_ONCE(cfile->notified_at) rate-limit check to skip when
within the notification interval. If within the interval, arm
the deferred timer via timer_reduce() and confirm it is pending
before returning -- if the timer fired in between, fall through
to the lock path so the notification is not lost.
Both checks have safe error directions -- a stale read can only
cause unnecessary lock acquisition, never a missed notification.
The critical section is simplified to just taking a kernfs_get()
reference and updating notified_at.
Annotate cfile->kn and cfile->notified_at write sites with
WRITE_ONCE() to pair with the lockless readers.
Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
cgroup_file_notify() calls kernfs_notify() while holding the global
cgroup_file_kn_lock. kernfs_notify() does non-trivial work including
wake_up_interruptible() and acquisition of a second global spinlock
(kernfs_notify_lock), inflating the hold time.
Take a kernfs_get() reference under the lock and call kernfs_notify()
after dropping it, following the pattern from cgroup_file_show().
Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add a new clone3() flag CLONE_PIDFD_AUTOKILL that ties a child's
lifetime to the pidfd returned from clone3(). When the last reference to
the struct file created by clone3() is closed the kernel sends SIGKILL
to the child. A pidfd obtained via pidfd_open() for the same process
does not keep the child alive and does not trigger autokill - only the
specific struct file from clone3() has this property.
This is useful for container runtimes, service managers, and sandboxed
subprocess execution - any scenario where the child must die if the
parent crashes or abandons the pidfd.
CLONE_PIDFD_AUTOKILL requires both CLONE_PIDFD (the whole point is tying
lifetime to the pidfd file) and CLONE_AUTOREAP (a killed child with no
one to reap it would become a zombie). CLONE_THREAD is rejected because
autokill targets a process not a thread.
The clone3 pidfd is identified by the PIDFD_AUTOKILL file flag set on
the struct file at clone3() time. The pidfs .release handler checks this
flag and sends SIGKILL via do_send_sig_info(SIGKILL, SEND_SIG_PRIV, ...)
only when it is set. Files from pidfd_open() or open_by_handle_at() are
distinct struct files that do not carry this flag. dup()/fork() share the
same struct file so they extend the child's lifetime until the last
reference drops.
CLONE_PIDFD_AUTOKILL uses a privilege model based on CLONE_NNP: without
CLONE_NNP the child could escalate privileges via setuid/setgid exec
after being spawned, so the caller must have CAP_SYS_ADMIN in its user
namespace. With CLONE_NNP the child can never gain new privileges so
unprivileged usage is allowed. This is a deliberate departure from the
pdeath_signal model which is reset during secureexec and commit_creds()
rendering it useless for container runtimes that need to deprivilege
themselves.
Link: https://patch.msgid.link/20260226-work-pidfs-autoreap-v5-3-d148b984a989@kernel.org
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Add a new clone3() flag CLONE_NNP that sets no_new_privs on the child
process at clone time. This is analogous to prctl(PR_SET_NO_NEW_PRIVS)
but applied at process creation rather than requiring a separate step
after the child starts running.
CLONE_NNP is rejected with CLONE_THREAD. It's conceptually a lot simpler
if the whole thread-group is forced into NNP and not have single threads
running around with NNP.
Link: https://patch.msgid.link/20260226-work-pidfs-autoreap-v5-2-d148b984a989@kernel.org
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Add a new clone3() flag CLONE_AUTOREAP that makes a child process
auto-reap on exit without ever becoming a zombie. This is a per-process
property in contrast to the existing auto-reap mechanism via
SA_NOCLDWAIT or SIG_IGN for SIGCHLD which applies to all children of a
given parent.
Currently the only way to automatically reap children is to set
SA_NOCLDWAIT or SIG_IGN on SIGCHLD. This is a parent-scoped property
affecting all children which makes it unsuitable for libraries or
applications that need selective auto-reaping of specific children while
still being able to wait() on others.
CLONE_AUTOREAP stores an autoreap flag in the child's signal_struct.
When the child exits do_notify_parent() checks this flag and causes
exit_notify() to transition the task directly to EXIT_DEAD. Since the
flag lives on the child it survives reparenting: if the original parent
exits and the child is reparented to a subreaper or init the child still
auto-reaps when it eventually exits.
CLONE_AUTOREAP can be combined with CLONE_PIDFD to allow the parent to
monitor the child's exit via poll() and retrieve exit status via
PIDFD_GET_INFO. Without CLONE_PIDFD it provides a fire-and-forget
pattern where the parent simply doesn't care about the child's exit
status. No exit signal is delivered so exit_signal must be zero.
CLONE_AUTOREAP is rejected in combination with CLONE_PARENT. If a
CLONE_AUTOREAP child were to clone(CLONE_PARENT) the new grandchild
would inherit exit_signal == 0 from the autoreap parent's group leader
but without signal->autoreap. This grandchild would become a zombie that
never sends a signal and is never autoreaped - confusing and arguably
broken behavior.
The flag is not inherited by the autoreap process's own children. Each
child that should be autoreaped must be explicitly created with
CLONE_AUTOREAP.
Link: https://github.com/uapi-group/kernel-features/issues/45
Link: https://patch.msgid.link/20260226-work-pidfs-autoreap-v5-1-d148b984a989@kernel.org
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
When the hrtimer_interrupt needs to restart more than 3 times and still has
expired timers, the interrupt is considered hung. To give the system a
little time to recover, the hardware timer is programmed a little into the
future.
Prior to commit 2889243848 ("hrtimer: Re-arrange hrtimer_interrupt()"),
this was relative to the amount of time spend serving the interrupt with a
max of 100 msec.
However, in order to simplify, and because this condition 'should' not
happen, the timeout was unconditionally set to 100 msec.
'Obviously' there is a benchmark that hits this hard, by programming a
ton of very short timers :-/
Since reprogramming is decoupled from the interrupt handling, the actual
execution time is lost, however the code does track max_hang_time. Using
that, rather than the 100 ms max restores performance.
stress-ng --timeout 60 --times --verify --metrics --no-rand-seed --timermix 64
bogo ops/s
288924384856^1: 23715979.93
288924384856: 11550049.77
patched: 23361116.78
Additionally, Thomas noted that cpu_base->hang_detected should not be
cleared until the next interrupt, such that __hrtimer_reprogram() won't
undo the extra delay.
Fixes: 2889243848 ("hrtimer: Re-arrange hrtimer_interrupt()")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260311121500.GF652779@noisy.programming.kicks-ass.net
Closes: https://lore.kernel.org/oe-lkp/202603102229.74b9dee4-lkp@intel.com
Chasing vfork()'ed tasks on a CID ownership mode switch requires a full
task list walk, which is obviously expensive on large systems.
Avoid that by keeping a list of tasks using a mm MMCID entity in mm::mm_cid
and walk this list instead. This removes the proven to be flaky counting
logic and avoids a full task list walk in the case of vfork()'ed tasks.
Fixes: fbd0e71dc3 ("sched/mmcid: Provide CID ownership mode fixup functions")
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260310202526.183824481@kernel.org
This is a leftover from the early versions of this function where it could
be invoked without mm::mm_cid::lock held.
Remove it and add lockdep asserts instead.
Fixes: 653fda7ae7 ("sched/mmcid: Switch over to the new mechanism")
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260310202526.116363613@kernel.org
Matthieu and Jiri reported stalls where a task endlessly loops in
mm_get_cid() when scheduling in.
It turned out that the logic which handles vfork()'ed tasks is broken. It
is invoked when the number of tasks associated to a process is smaller than
the number of MMCID users. It then walks the task list to find the
vfork()'ed task, but accounts all the already processed tasks as well.
If that double processing brings the number of to be handled tasks to 0,
the walk stops and the vfork()'ed task's CID is not fixed up. As a
consequence a subsequent schedule in fails to acquire a (transitional) CID
and the machine stalls.
Cure this by removing the accounting condition and make the fixup always
walk the full task list if it could not find the exact number of users in
the process' thread list.
Fixes: fbd0e71dc3 ("sched/mmcid: Provide CID ownership mode fixup functions")
Closes: https://lore.kernel.org/b24ffcb3-09d5-4e48-9070-0b69bc654281@kernel.org
Reported-by: Matthieu Baerts <matttbe@kernel.org>
Reported-by: Jiri Slaby <jirislaby@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260310202526.048657665@kernel.org
A newly forked task is accounted as MMCID user before the task is visible
in the process' thread list and the global task list. This creates the
following problem:
CPU1 CPU2
fork()
sched_mm_cid_fork(tnew1)
tnew1->mm.mm_cid_users++;
tnew1->mm_cid.cid = getcid()
-> preemption
fork()
sched_mm_cid_fork(tnew2)
tnew2->mm.mm_cid_users++;
// Reaches the per CPU threshold
mm_cid_fixup_tasks_to_cpus()
for_each_other(current, p)
....
As tnew1 is not visible yet, this fails to fix up the already allocated CID
of tnew1. As a consequence a subsequent schedule in might fail to acquire a
(transitional) CID and the machine stalls.
Move the invocation of sched_mm_cid_fork() after the new task becomes
visible in the thread and the task list to prevent this.
This also makes it symmetrical vs. exit() where the task is removed as CID
user before the task is removed from the thread and task lists.
Fixes: fbd0e71dc3 ("sched/mmcid: Provide CID ownership mode fixup functions")
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260310202525.969061974@kernel.org
The trace_clock_jiffies() function that handles the "uptime" clock for
tracing calls jiffies_64_to_clock_t(). This causes the function tracer to
constantly recurse when the tracing clock is set to "uptime". Mark it
notrace to prevent unnecessary recursion when using the "uptime" clock.
Fixes: 58d4e21e50 ("tracing: Fix wraparound problems in "uptime" trace clock")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260306212403.72270bb2@robin
After sparc64, there are no remaining users of ARCH_CLOCKSOURCE_DATA
and it can just be removed.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Andreas Larsson <andreas@gaisler.com>
Reviewed-by: Andreas Larsson <andreas@gaisler.com>
Acked-by: John Stultz <jstultz@google.com>
Link: https://patch.msgid.link/20260304-vdso-sparc64-generic-2-v6-14-d8eb3b0e1410@linutronix.de
[Thomas: drop sparc64 bits from the patch]
When debug logging is enabled, read_key_from_user_keying() logs the first
8 bytes of the key payload and partially exposes the dm-crypt key. Stop
logging any key bytes.
Link: https://lkml.kernel.org/r/20260227230008.858641-2-thorsten.blum@linux.dev
Fixes: 479e58549b ("crash_dump: store dm crypt keys in kdump reserved memory")
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Cc: Baoquan He <bhe@redhat.com>
Cc: Coiby Xu <coxu@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, audit_receive_msg() ignores unknown status bits in AUDIT_SET
requests, incorrectly returning success to newer user space tools
querying unsupported features. This breaks forward compatibility.
Fix this by defining AUDIT_STATUS_ALL and returning -EINVAL if any
unrecognized bits are set (s.mask & ~AUDIT_STATUS_ALL).
This ensures invalid requests are safely rejected, allowing user space
to reliably test for and gracefully handle feature detection on older
kernels.
Suggested-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Ricardo Robaina <rrobaina@redhat.com>
[PM: subject line tweak]
Signed-off-by: Paul Moore <paul@paul-moore.com>
BPF_ST | BPF_PROBE_MEM32 immediate stores are not handled by
bpf_jit_blind_insn(), allowing user-controlled 32-bit immediates to
survive unblinded into JIT-compiled native code when bpf_jit_harden >= 1.
The root cause is that convert_ctx_accesses() rewrites BPF_ST|BPF_MEM
to BPF_ST|BPF_PROBE_MEM32 for arena pointer stores during verification,
before bpf_jit_blind_constants() runs during JIT compilation. The
blinding switch only matches BPF_ST|BPF_MEM (mode 0x60), not
BPF_ST|BPF_PROBE_MEM32 (mode 0xa0). The instruction falls through
unblinded.
Add BPF_ST|BPF_PROBE_MEM32 cases to bpf_jit_blind_insn() alongside the
existing BPF_ST|BPF_MEM cases. The blinding transformation is identical:
load the blinded immediate into BPF_REG_AX via mov+xor, then convert
the immediate store to a register store (BPF_STX).
The rewritten STX instruction must preserve the BPF_PROBE_MEM32 mode so
the architecture JIT emits the correct arena addressing (R12-based on
x86-64). Cannot use the BPF_STX_MEM() macro here because it hardcodes
BPF_MEM mode; construct the instruction directly instead.
Fixes: 6082b6c328 ("bpf: Recognize addr_space_cast instruction in the verifier.")
Reviewed-by: Puranjay Mohan <puranjay@kernel.org>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Sachin Kumar <xcyfun@protonmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/Y6IT5VvNRchPBLI5D7JZHBzZrU9rb0ycRJPJzJSXGj7kJlX8RJwZFSM2YZjcDxoQKABkxt1T8Os2gi23PYyFuQe6KkZGWVyfz8K5afdy9ak=@protonmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch adds support to validate a pointer as not null when its
value is compared to a register whose value the verifier knows to be
null.
Initial pattern only verifies against an immediate operand.
Signed-off-by: Cupertino Miranda <cupertino.miranda@oracle.com>
Cc: David Faust <david.faust@oracle.com>
Cc: Jose Marchesi <jose.marchesi@oracle.com>
Cc: Elena Zannoni <elena.zannoni@oracle.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260304195018.181396-3-cupertino.miranda@oracle.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When a register undergoes a BPF_END (byte swap) operation, its scalar
value is mutated in-place. If this register previously shared a scalar ID
with another register (e.g., after an `r1 = r0` assignment), this tie must
be broken.
Currently, the verifier misses resetting `dst_reg->id` to 0 for BPF_END.
Consequently, if a conditional jump checks the swapped register, the
verifier incorrectly propagates the learned bounds to the linked register,
leading to false confidence in the linked register's value and potentially
allowing out-of-bounds memory accesses.
Fix this by explicitly resetting `dst_reg->id` to 0 in the BPF_END case
to break the scalar tie, similar to how BPF_NEG handles it via
`__mark_reg_known`.
Fixes: 9d21199842 ("bpf: Add bitwise tracking for BPF_END")
Closes: https://lore.kernel.org/bpf/AMBPR06MB108683CFEB1CB8D9E02FC95ECF17EA@AMBPR06MB10868.eurprd06.prod.outlook.com/
Link: https://lore.kernel.org/bpf/4be25f7442a52244d0dd1abb47bc6750e57984c9.camel@gmail.com/
Reported-by: Guillaume Laporte <glapt.pro@outlook.com>
Co-developed-by: Tianci Cao <ziye@zju.edu.cn>
Signed-off-by: Tianci Cao <ziye@zju.edu.cn>
Co-developed-by: Shenghao Yuan <shenghaoyuan0928@163.com>
Signed-off-by: Shenghao Yuan <shenghaoyuan0928@163.com>
Signed-off-by: Yazhou Tang <tangyazhou518@outlook.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260304083228.142016-2-tangyazhou@zju.edu.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
scx_claim_exit() propagates exits to descendants under scx_sched_lock.
A sub-sched being attached concurrently could be missed if it links
after the propagation. Check the parent's exit_kind in scx_link_sched()
under scx_sched_lock to interlock against scx_claim_exit() - either the
parent sees the child in its iteration or the child sees the parent's
non-NONE exit_kind and fails attachment.
Fixes: ebeca1f930 ("sched_ext: Introduce cgroup sub-sched support")
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
There are two sites that nest rq lock inside scx_sched_lock:
- scx_bypass() takes scx_sched_lock then rq lock per CPU to propagate
per-cpu bypass flags and re-enqueue tasks.
- sysrq_handle_sched_ext_dump() takes scx_sched_lock to iterate all
scheds, scx_dump_state() then takes rq lock per CPU for dump.
And scx_claim_exit() takes scx_sched_lock to propagate exits to
descendants. It can be reached from scx_tick(), BPF kfuncs, and many
other paths with rq lock already held, creating the reverse ordering:
rq lock -> scx_sched_lock vs. scx_sched_lock -> rq lock
Fix by flipping scx_bypass() to take rq lock first, and dropping
scx_sched_lock from sysrq_handle_sched_ext_dump() as scx_sched_all is
already RCU-traversable and scx_dump_lock now prevents dumping a dead
sched. This makes the consistent ordering rq lock -> scx_sched_lock.
Reported-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Link: https://lore.kernel.org/r/20260309163025.2240221-1-yphbchou0911@gmail.com
Fixes: ebeca1f930 ("sched_ext: Introduce cgroup sub-sched support")
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
scx_disable() directly called kthread_queue_work() which can acquire
worker->lock, pi_lock and rq->__lock. This made scx_disable() unsafe to
call while holding locks that conflict with this chain - in particular,
scx_claim_exit() calls scx_disable() for each descendant while holding
scx_sched_lock, which nests inside rq->__lock in scx_bypass().
The error path (scx_vexit()) was already bouncing through irq_work to
avoid this issue. Generalize the pattern to all scx_disable() calls by
always going through irq_work. irq_work_queue() is lockless and safe to
call from any context, and the actual kthread_queue_work() call happens
in the irq_work handler outside any locks.
Rename error_irq_work to disable_irq_work to reflect the broader usage.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Add a dedicated scx_dump_lock and per-sched dump_disabled flag so that
debug dumping can be safely disabled during sched teardown without
relying on scx_sched_lock. This is a prep for the next patch which
decouples the sysrq dump path from scx_sched_lock to resolve a lock
ordering issue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
sub_detach is the parent's op called to notify the parent that a child
is detaching. Test parent->ops.sub_detach instead of sch->ops.sub_detach.
Fixes: ebeca1f930 ("sched_ext: Introduce cgroup sub-sched support")
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Add a Resource-managed version of alloc_workqueue() to fix common
problem of drivers mixing devm() calls with destroy_workqueue. Such
naive and discouraged driver approach leads to difficult to debug bugs
when the driver:
1. Allocates workqueue in standard way and destroys it in driver
remove() callback,
2. Sets work struct with devm_work_autocancel(),
3. Registers interrupt handler with devm_request_threaded_irq().
Which leads to following unbind/removal path:
1. destroy_workqueue() via driver remove(),
Any interrupt coming now would still execute the interrupt handler,
which queues work on destroyed workqueue.
2. devm_irq_release(),
3. devm_work_drop() -> cancel_work_sync() on destroyed workqueue.
devm_alloc_workqueue() has two benefits:
1. Solves above problem of mix-and-match devres and non-devres code in
driver,
2. Simplify any sane drivers which were correctly using
alloc_workqueue() + devm_add_action_or_reset().
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Prefer using IS_ERR_OR_NULL() over using IS_ERR() and a manual NULL
check.
Change generated with coccinelle.
Signed-off-by: Philipp Hahn <phahn-oss@avm.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
The _DESCS_COUNT macro currently uses 1U (32-bit unsigned) instead of
1UL (unsigned long), which breaks the intended overflow testing design
on 64-bit systems.
Problem Analysis:
----------------
The printk_ringbuffer uses a deliberate design choice to initialize
descriptor IDs near the maximum 62-bit value to trigger overflow early
in the system's lifetime. This is documented in printk_ringbuffer.h:
"initial values are chosen that map to the correct initial array
indexes, but will result in overflows soon."
The DESC0_ID macro calculates:
DESC0_ID(ct_bits) = DESC_ID(-(_DESCS_COUNT(ct_bits) + 1))
On 64-bit systems with typical configuration (descbits=16):
- Current buggy behavior: DESC0_ID = 0xfffeffff
- Expected behavior: DESC0_ID = 0x3ffffffffffeffff
The buggy version only uses 32 bits, which means:
1. The initial ID is nowhere near 2^62
2. It would take ~140 trillion wraps to trigger 62-bit overflow
3. The overflow handling code is never tested in practice
Root Cause:
----------
The issue is in this line:
#define _DESCS_COUNT(ct_bits) (1U << (ct_bits))
When _DESCS_COUNT(16) is calculated:
1U << 16 = 0x10000 (32-bit value)
-(0x10000 + 1) = -0x10001 = 0xFFFEFFFF (32-bit two's complement)
On 64-bit systems, this 32-bit value doesn't get extended to create
the intended 62-bit ID near the maximum value.
Impact:
------
While index calculations still work correctly in the short term, this
bug has several implications:
1. Violates the design intention documented in the code
2. Overflow handling code paths remain untested
3. ABA detection code doesn't get exercised under overflow conditions
4. In extreme long-term running scenarios (though unlikely), could
potentially cause issues when ID actually reaches 2^62
Verification:
------------
Tested on ARM64 system with CONFIG_LOG_BUF_SHIFT=20 (descbits=15):
- Before fix: DESC0_ID(16) = 0xfffeffff
- After fix: DESC0_ID(16) = 0x3fffffffffff7fff
The fix aligns _DESCS_COUNT with _DATA_SIZE, which already correctly
uses 1UL:
#define _DATA_SIZE(sz_bits) (1UL << (sz_bits))
Signed-off-by: feng.zhou <realsummitzhou@gmail.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Tested-by: Petr Mladek <pmladek@suse.com>
Link: https://patch.msgid.link/20260202094140.9518-1-realsummitzhou@gmail.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
If the cpuidle governor .select() callback is skipped because there
is only one idle state in the cpuidle driver, the .reflect() callback
should be skipped as well, at least for consistency (if not for
correctness), so do it.
Fixes: e5c9ffc6ae ("cpuidle: Skip governor when only one idle state is available")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://patch.msgid.link/12857700.O9o76ZdvQC@rafael.j.wysocki
c2a57380df ("sched: Replace use of system_unbound_wq with system_dfl_wq")
converted system_unbound_wq usages in ext.c but missed the queue_rcu_work()
call in scx_kobj_release() which was added later by the dynamic scx_sched
allocation conversion. Apply the same conversion.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Pull sched/core to resolve conflicts between:
c2a57380df ("sched: Replace use of system_unbound_wq with system_dfl_wq")
from the tip tree and commit:
cde94c032b ("sched_ext: Make watchdog sub-sched aware")
The latter moves around code modiefied by the former. Apply the changes in
the new locations.
Signed-off-by: Tejun Heo <tj@kernel.org>
While running scx_flatcg, dmesg prints "SCX_OPS_HAS_CGROUP_WEIGHT is
deprecated and a noop", in code, SCX_OPS_HAS_CGROUP_WEIGHT has been
marked as DEPRECATED, and will be removed on 6.18. Now it's time to do it.
Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently there are users of queue_delayed_work() who specify
system_long_wq, the per-cpu workqueue. This workqueue should
be used for long per-cpu works, but queue_delayed_work()
queue the work using:
queue_delayed_work_on(WORK_CPU_UNBOUND, ...);
This would end up calling __queue_delayed_work() that does:
if (housekeeping_enabled(HK_TYPE_TIMER)) {
// [....]
} else {
if (likely(cpu == WORK_CPU_UNBOUND))
add_timer_global(timer);
else
add_timer_on(timer, cpu);
}
So when cpu == WORK_CPU_UNBOUND the timer is global and is
not using a specific CPU. Later, when __queue_work() is called:
if (req_cpu == WORK_CPU_UNBOUND) {
if (wq->flags & WQ_UNBOUND)
cpu = wq_select_unbound_cpu(raw_smp_processor_id());
else
cpu = raw_smp_processor_id();
}
Because the wq is not unbound, it takes the CPU where the timer
fired and enqueue the work on that CPU.
The consequence of all of this is that the work can run anywhere,
depending on where the timer fired.
Introduce system_dfl_long_wq in order to change, in a future step,
users that are still calling:
queue_delayed_work(system_long_wq, ...);
with the new system_dfl_long_wq instead, so that the work may
benefit from scheduler task placement.
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The simple_ring_buffer implementation must remain simple enough to be
used by the pKVM hypervisor. Prevent the object build if unresolved
symbols are found.
Link: https://patch.msgid.link/20260309162516.2623589-19-vdonnefort@google.com
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Add load/unload callback used for each admitted page in the ring-buffer.
This will be later useful for the pKVM hypervisor which uses a different
VA space and need to dynamically map/unmap the ring-buffer pages.
Link: https://patch.msgid.link/20260309162516.2623589-18-vdonnefort@google.com
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Add a module to help testing the tracefs support for trace remotes. This
module:
* Use simple_ring_buffer to write into a ring-buffer.
* Declare a single "selftest" event that can be triggered from
user-space.
* Register a "test" trace remote.
This is intended to be used by trace remote selftests.
Link: https://patch.msgid.link/20260309162516.2623589-15-vdonnefort@google.com
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Add a simple implementation of the kernel ring-buffer. This intends to
be used later by ring-buffer remotes such as the pKVM hypervisor, hence
the need for a cut down version (write only) without any dependency.
Link: https://patch.msgid.link/20260309162516.2623589-14-vdonnefort@google.com
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
In preparation for allowing the writing of ring-buffer compliant pages
outside of ring_buffer.c, move buffer_data_page and timestamps encoding
macros into the publicly available ring_buffer_types.h.
Link: https://patch.msgid.link/20260309162516.2623589-13-vdonnefort@google.com
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Just like for the kernel events directory, add 'enable', 'header_page'
and 'header_event' at the root of the trace remote events/ directory.
Link: https://patch.msgid.link/20260309162516.2623589-11-vdonnefort@google.com
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
An event is predefined point in the writer code that allows to log
data. Following the same scheme as kernel events, add remote events,
described to user-space within the events/ tracefs directory found in
the corresponding trace remote.
Remote events are expected to be described during the trace remote
registration.
Add also a .enable_event callback for trace_remote to toggle the event
logging, if supported.
Link: https://patch.msgid.link/20260309162516.2623589-10-vdonnefort@google.com
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Add a .init call back so the trace remote callers can add entries to the
tracefs directory.
Link: https://patch.msgid.link/20260309162516.2623589-9-vdonnefort@google.com
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Allow reading the trace file for trace remotes. This performs a
non-consuming read of the trace buffer.
Link: https://patch.msgid.link/20260309162516.2623589-8-vdonnefort@google.com
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Allow to reset the trace remote buffer by writing to the Tracefs "trace"
file. This is similar to the regular Tracefs interface.
Link: https://patch.msgid.link/20260309162516.2623589-7-vdonnefort@google.com
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
A trace remote relies on ring-buffer remotes to read and control
compatible tracing buffers, written by entity such as firmware or
hypervisor.
Add a Tracefs directory remotes/ that contains all instances of trace
remotes. Each instance follows the same hierarchy as any other to ease
the support by existing user-space tools.
This currently does not provide any event support, which will come
later.
Link: https://patch.msgid.link/20260309162516.2623589-6-vdonnefort@google.com
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Hopefully, the remote will only swap pages on the kernel instruction (via
the swap_reader_page() callback). This means we know at what point the
ring-buffer geometry has changed. It is therefore possible to rearrange
the kernel view of that ring-buffer to allow non-consuming read.
Link: https://patch.msgid.link/20260309162516.2623589-5-vdonnefort@google.com
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Add ring-buffer remotes to support entities outside of the kernel (such
as firmware or a hypervisor) that writes events into a ring-buffer using
the tracefs format
Require a description of the ring-buffer pages (struct
trace_buffer_desc) and callbacks (swap_reader_page and reset) to set up
the ring-buffer on the kernel side.
Expect the remote entity to maintain and update the meta-page.
Link: https://patch.msgid.link/20260309162516.2623589-4-vdonnefort@google.com
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The subbuf_ids field allows to point to a specific page from the
ring-buffer based on its ID. As a preparation or the upcoming
ring-buffer remote support, point this array to the buffer_page instead
of the buffer_data_page.
Link: https://patch.msgid.link/20260309162516.2623589-3-vdonnefort@google.com
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Add two fields pages_touched and pages_lost to the ring-buffer
meta-page. Those fields are useful to get the number of used pages in
the ring-buffer.
Link: https://patch.msgid.link/20260309162516.2623589-2-vdonnefort@google.com
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
fmod_ret BPF programs can only be attached to selected functions. For
convenience, the error injection list was originally used (along with
functions prefixed with "security_"), which contains syscalls and
several other functions.
When error injection is disabled (CONFIG_FUNCTION_ERROR_INJECTION=n),
that list is empty and fmod_ret programs are effectively unavailable for
most of the functions. In such a case, at least enable fmod_ret programs
on syscalls.
Signed-off-by: Viktor Malik <vmalik@redhat.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/472310f9a5f4944ad03214e4d943a4830fd8eb76.1773055375.git.vmalik@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Sleepable BPF programs can only be attached to selected functions. For
convenience, the error injection list was originally used, which
contains syscalls and several other functions.
When error injection is disabled (CONFIG_FUNCTION_ERROR_INJECTION=n),
that list is empty and sleepable tracing programs are effectively
unavailable. In such a case, at least enable sleepable programs on
syscalls. For discussion why syscalls were chosen, see [1].
To detect that a function is a syscall handler, we check for
arch-specific prefixes for the most common architectures. Unfortunately,
the prefixes are hard-coded in arch syscall code so we need to hard-code
them, too.
[1] https://lore.kernel.org/bpf/CAADnVQK6qP8izg+k9yV0vdcT-+=axtFQ2fKw7D-2Ei-V6WS5Dw@mail.gmail.com/
Signed-off-by: Viktor Malik <vmalik@redhat.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/2704a8512746655037e3c02b471b31bd0d76c8db.1773055375.git.vmalik@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
scx_enable() uses double-checked locking to lazily initialize a static
kthread_worker pointer. The fast path reads helper locklessly:
if (!READ_ONCE(helper)) { // lockless read -- no helper_mutex
The write side initializes helper under helper_mutex, but previously
used a plain assignment:
helper = kthread_run_worker(0, "scx_enable_helper");
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
plain write -- KCSAN data race with READ_ONCE() above
Since READ_ONCE() on the fast path and the plain write on the
initialization path access the same variable without a common lock,
they constitute a data race. KCSAN requires that all sides of a
lock-free access use READ_ONCE()/WRITE_ONCE() consistently.
Use a temporary variable to stage the result of kthread_run_worker(),
and only WRITE_ONCE() into helper after confirming the pointer is
valid. This avoids a window where a concurrent caller on the fast path
could observe an ERR pointer via READ_ONCE(helper) before the error
check completes.
Fixes: b06ccbabe2 ("sched_ext: Fix starvation of scx_enable() under fair-class saturation")
Signed-off-by: zhidao su <suzhidao@xiaomi.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Instead of embedding a list_head in struct mutex, store a pointer to
the first waiter. The list of waiters remains a doubly linked list so
we can efficiently add to the tail of the list, remove from the front
(or middle) of the list.
Some of the list manipulation becomes more complicated, but it's a
reasonable tradeoff on the slow paths to shrink data structures which
embed a mutex like struct file.
Some of the debug checks have to be deleted because there's no equivalent
to checking them in the new scheme (eg an empty waiter->list now means
that it is the only waiter, not that the waiter is no longer on the list).
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260305195545.3707590-4-willy@infradead.org
Instead of embedding a list_head in struct semaphore, store a pointer to
the first waiter. The list of waiters remains a doubly linked list so
we can efficiently add to the tail of the list and remove from the front
(or middle) of the list.
Some of the list manipulation becomes more complicated, but it's a
reasonable tradeoff on the slow paths to shrink data structures
which embed a semaphore.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260305195545.3707590-3-willy@infradead.org
Instead of embedding a list_head in struct rw_semaphore, store a pointer
to the first waiter. The list of waiters remains a doubly linked list
so we can efficiently add to the tail of the list, remove from the front
(or middle) of the list.
Some of the list manipulation becomes more complicated, but it's a
reasonable tradeoff on the slow paths to shrink some core data structures
like struct inode.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260305195545.3707590-2-willy@infradead.org
This reverts commit 9adfcef334.
dsq->nr is protected by dsq->lock and reading while holding the lock doesn't
constitute a racy read.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: zhidao su <suzhidao@xiaomi.com>
permissive for auxiliary clocks, to not reject syscalls
based on the status field that do not try to modify the
status field. This makes ABI behavior in clock_adjtime()
consistent with CLOCK_REALTIME.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmmsxzkRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hq8g//fRTp9p2pVfmRWUoxWELrT/bMK1r+D6F3
6BYkwp68peRhchVrFxkI/Y37rjAIC8CXZSPuvkubqIROrH3gA7SCCQYCcZKdss+t
i3lbpQF8IbagPIS5btpOAN2KRCu2S7aqjDdH0rWb9VhQdlW7fI71Z72Uz07YEA+q
TWpy3gE531P/dgAqcvIAyMHnFZDCb1S6z8wZvT3SV4r4GkczfXpTFyNHHtETSu0V
7isuOBfloM4HpDU50oUotlqBiwigH27J2Ad6aIrnCA7iaQPrzREysG+8E96ShhaB
g6+qaQS5gTgFryA1bggA6LzGveLOI8bjy2kZ2SnZWuFPj46OReGIuwK4kyY07jz2
xk0sd37alN16ETKhGVLfAgjmzVGoKVNnp4ak9J3VmMbxWEmXeObuOC8SmF9VImc1
4bRaG9+Tlfd4DtOOz2+E4VcPE1D9A2tMw4esgUaXRrrp4GlEcKOJ5PRlWj0uGvrh
xLPLbL0XIiWsjMsHdVs4Gq9Z0MvfRHc4VLOviIqLFtHox2DscZypPkyjKAv5inp0
/VWyUYJkkr07RMQQ3nqHnP+lzAfO2aSeZ72D9NnHStL3RPbGC4jYvpoi8dnH0/TT
PKJgj2jb7u3h+1cxKBi1RM0JbxUYD5+4N8zfJISa9uqkHZ3XY3VyuuT+2RHO6CQp
d1BdX0V4oDA=
=zjov
-----END PGP SIGNATURE-----
Merge tag 'timers-urgent-2026-03-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Ingo Molnar:
"Make clock_adjtime() syscall timex validation slightly more permissive
for auxiliary clocks, to not reject syscalls based on the status field
that do not try to modify the status field.
This makes the ABI behavior in clock_adjtime() consistent with
CLOCK_REALTIME"
* tag 'timers-urgent-2026-03-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timekeeping: Fix timex status validation for auxiliary clocks
PI and setscheduler() syscalls, resulting in kernel warnings
and misbehavior. Found during stress-testing.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmmsxW0RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1jzXRAAjqcTwaC72cd+6cnh+tE9/fcjXf1JtK5e
TxdTygsgBAbXh63rD4y4cRPueqBR1ne52TAV0lI8Z1pBM/XthnaF4MJBue6B8EdX
SQIE7hpOh6R81I6hnuhNoNsAy95jQvYXN5SFaKMuNacWNVX8k3vPzN5XPxa7yHLN
MVUL+O9c7Xwg4v30Nz/QIv0mFoPosbh4PIdeVpD/ghJAXtXhsCg7EYOivEk9UsSy
TAcq3qRnfDyroIOc5/dnSglEwX12LQqVFBba97nI/TCjaH23PsUIt2Dg2rpJbJ+k
bLh4hGpOoyQvgE/PSEdoMl1F9pXw3XiUOzAGrFJdqn0iKL+7WzuTEQH+vAToGZQv
4hF5BtMjLrAYY/MVsD8qJGm/pne5nTIo2gSsG7LZPwCmMj0rDUGXfO4G8N8LHhT7
ExQ/t2+z0BczsKdvF3VKX+RweT51AOYOWcmLIdA9h1jdAy858GVmTzSWDveAEJ0L
yToPQ0UMCz985g9il6Rdb5cIphD7DjuUeFNnYTCm63cVpZdA4j8Da74r4KfP2jNY
tRcbiUy+A7MwqW5aERgwBtI6XCz6QZqW3svJW9yYghf40lgNGAcDCTTdf2r7g0Ho
Q0pQVxEk9mXD5N1otjzSS4piLbzoMaPH1L4W6ceHN1RzBjfSJED3tmfGUHZUDqNE
w33GhhQAFpA=
=vP5l
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2026-03-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar:
"Fix a DL scheduler bug that may corrupt internal metrics during PI and
setscheduler() syscalls, resulting in kernel warnings and misbehavior.
Found during stress-testing"
* tag 'sched-urgent-2026-03-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/deadline: Fix missing ENQUEUE_REPLENISH during PI de-boosting
ffa7ae0724 ("sched_ext: Add reenq_flags plumbing to scx_bpf_dsq_reenq()")
introduced task_should_reenq() as a filter inside reenq_local(), requiring
SCX_REENQ_ANY to be set in order to match any task. scx_bpf_dsq_reenq()
handles this correctly by converting a bare reenq_flags=0 to SCX_REENQ_ANY,
but scx_bpf_reenqueue_local() was not updated and continued to call
reenq_local() with 0, causing it to silently reenqueue zero tasks.
Fix by passing SCX_REENQ_ANY directly.
Fixes: ffa7ae0724 ("sched_ext: Add reenq_flags plumbing to scx_bpf_dsq_reenq()")
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
- Fix possible NULL pointer dereference in trace_data_alloc()
On the error path in trace_data_alloc(), it can call trigger_data_free()
with a NULL pointer. This use to be a kfree() but was changed to
trigger_data_free() to clean up any partial initialization. The issue is
that trigger_data_free() does not expect a NULL pointer. Have
trigger_data_free() return safely on NULL pointer.
- Fix multiple events on the command line and bootconfig
If multiple events are enabled on the command line separately and not
grouped, only the last event gets enabled. That is:
trace_event=sched_switch trace_event=sched_waking
Will only enable sched_waking where as:
trace_event=sched_switch,sched_waking
Will enable both.
The bootconfig makes it even worse as the second way is the more common
method.
The issue is that a temporary buffer is used to store the events to enable
later in boot. Each time the cmdline callback is called, it overwrites
what was previously there.
Have the callback append the next value (delimited by a comma) if the
temporary buffer already has content.
- Fix command line trace_buffer_size if >= 2G
The logic to allocate the trace buffer uses "int" for the size parameter
in the command line code causing overflow issues if more that 2G is
specified.
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaaxEIRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qn+QAQCM6aJm0ZqDD2dM262M1mQpkU7sW3Dz
hZfBpo3YlH55fQEAklsaD96+yKN7PLl1Vh4c0zCelMHZA7kgck/3GqaFAgA=
=rn/Z
-----END PGP SIGNATURE-----
Merge tag 'trace-v7.0-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix possible NULL pointer dereference in trace_data_alloc()
On the trace_data_alloc() error path, it can call trigger_data_free()
with a NULL pointer. This used to be a kfree() but was changed to
trigger_data_free() to clean up any partial initialization. The issue
is that trigger_data_free() does not expect a NULL pointer. Have
trigger_data_free() return safely on NULL pointer.
- Fix multiple events on the command line and bootconfig
If multiple events are enabled on the command line separately and not
grouped, only the last event gets enabled. That is:
trace_event=sched_switch trace_event=sched_waking
will only enable sched_waking whereas:
trace_event=sched_switch,sched_waking
will enable both.
The bootconfig makes it even worse as the second way is the more
common method.
The issue is that a temporary buffer is used to store the events to
enable later in boot. Each time the cmdline callback is called, it
overwrites what was previously there.
Have the callback append the next value (delimited by a comma) if the
temporary buffer already has content.
- Fix command line trace_buffer_size if >= 2G
The logic to allocate the trace buffer uses "int" for the size
parameter in the command line code causing overflow issues if more
that 2G is specified.
* tag 'trace-v7.0-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Fix trace_buf_size= cmdline parameter with sizes >= 2G
tracing: Fix enabling multiple events on the kernel command line and bootconfig
tracing: Add NULL pointer check to trigger_data_free()
SCX_ENQ_REENQ indicates that a task is being re-enqueued but doesn't tell the
BPF scheduler why. Add SCX_TASK_REENQ_REASON flags using bits 12-13 of
p->scx.flags to communicate the reason during ops.enqueue():
- NONE: Not being reenqueued
- KFUNC: Reenqueued by scx_bpf_dsq_reenq() and friends
More reasons will be added.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Task states (NONE, INIT, READY, ENABLED) were defined in a separate enum with
unshifted values and then shifted when stored in scx_entity.flags. Simplify by
defining them as pre-shifted values directly in scx_ent_flags and removing the
separate scx_task_state enum. This removes the need for shifting when
reading/writing state values.
scx_get_task_state() now returns the masked flags value directly.
scx_set_task_state() accepts the pre-shifted state value. scx_dump_task()
shifts down for display to maintain readable output.
No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
schedule_dsq_reenq() always acquires deferred_reenq_lock to queue a reenqueue
request. Add a lockless fast-path to skip lock acquisition when the request is
already pending with the required flags set.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
scx_bpf_dsq_reenq() currently only supports local DSQs. Extend it to support
user-defined DSQs by adding a deferred re-enqueue mechanism similar to the
local DSQ handling.
Add per-cpu deferred_reenq_user_node/flags to scx_dsq_pcpu and
deferred_reenq_users list to scx_rq. When scx_bpf_dsq_reenq() is called on a
user DSQ, the DSQ's per-cpu node is added to the current rq's deferred list.
process_deferred_reenq_users() then iterates the DSQ using the cursor helpers
and re-enqueues each task.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Factor out cursor-based DSQ iteration from bpf_iter_scx_dsq_next() into
nldsq_cursor_next_task() and the task-lost check from scx_dsq_move() into
nldsq_cursor_lost_task() to prepare for reuse.
As ->priv is only used to record dsq->seq for cursors, update
INIT_DSQ_LIST_CURSOR() to take the DSQ pointer and set ->priv from dsq->seq
so that users don't have to read it manually. Move scx_dsq_iter_flags enum
earlier so nldsq_cursor_next_task() can use SCX_DSQ_ITER_REV.
bypass_lb_cpu() now sets cursor.priv to dsq->seq but doesn't use it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Add per-CPU data structure to dispatch queues. Each DSQ now has a percpu
scx_dsq_pcpu which contains a back-pointer to the DSQ. This will be used by
future changes to implement per-CPU reenqueue tracking for user DSQs.
init_dsq() now allocates the percpu data and can fail, so it returns an
error code. All callers are updated to handle failures. exit_dsq() is added
to free the percpu data and is called from all DSQ cleanup paths.
In scx_bpf_create_dsq(), init_dsq() is called before rcu_read_lock() since
alloc_percpu() requires GFP_KERNEL context, and dsq->sched is set
afterwards.
v2: Fix err_free_pcpu to only exit_dsq() initialized bypass DSQs (Andrea
Righi).
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Add infrastructure to pass flags through the deferred reenqueue path.
reenq_local() now takes a reenq_flags parameter, and scx_sched_pcpu gains a
deferred_reenq_local_flags field to accumulate flags from multiple
scx_bpf_dsq_reenq() calls before processing. No flags are defined yet.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
scx_bpf_reenqueue_local() can only trigger re-enqueue of the current CPU's
local DSQ. Introduce scx_bpf_dsq_reenq() which takes a DSQ ID and can target
any local DSQ including remote CPUs via SCX_DSQ_LOCAL_ON | cpu. This will be
expanded to support user DSQs by future changes.
scx_bpf_reenqueue_local() is reimplemented as a simple wrapper around
scx_bpf_dsq_reenq(SCX_DSQ_LOCAL, 0) and may be deprecated in the future.
Update compat.bpf.h with a compatibility shim and scx_qmap to test the new
functionality.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Wrap the deferred_reenq_local_node list_head into struct
scx_deferred_reenq_local. More fields will be added and this allows using a
shorthand pointer to access them.
No functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
The deferred reenqueue local mechanism uses an llist (lockless list) for
collecting schedulers that need their local DSQs re-enqueued. Convert to a
regular list protected by a raw_spinlock.
The llist was used for its lockless properties, but the upcoming changes to
support remote reenqueue require more complex list operations that are
difficult to implement correctly with lockless data structures. A spinlock-
protected regular list provides the necessary flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Previously, both process_ddsp_deferred_locals() and reenq_local() required
forward declarations. Reorganize so that only run_deferred() needs to be
declared. Both callees are grouped right before run_deferred() for better
locality. This reduces forward declaration clutter and will ease adding more
to the run_deferred() path.
No functional changes.
v2: Also relocate process_ddsp_deferred_locals() next to run_deferred()
(Daniel Jordan).
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Change find_global_dsq() to take a CPU number directly instead of a task
pointer. This prepares for callers where the CPU is available but the task is
not.
No functional changes.
v2: Rename tcpu to cpu in find_global_dsq() (Emil Tsalapatis).
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Extract pnode allocation and deallocation logic into alloc_pnode() and
free_pnode() helpers. This simplifies scx_alloc_and_add_sched() and prepares
for adding more per-node initialization and cleanup in subsequent patches.
No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Global DSQs are currently stored as an array of scx_dispatch_q pointers,
one per NUMA node. To allow adding more per-node data structures, wrap the
global DSQ in scx_sched_pnode and replace global_dsqs with pnode array.
NUMA-aware allocation is maintained. No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Move scx_bpf_task_cgroup() kfunc definition and its BTF_ID entry to the end
of the kfunc section before __bpf_kfunc_end_defs() for cleaner code
organization.
No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
ops.quiescent() is invoked with the same deq_flags as ops.dequeue(), so
the BPF scheduler is able to distinguish sleep vs property changes in
both callbacks.
However, dequeue_task_scx() receives deq_flags as an int from the
sched_class interface, so SCX flags above bit 32 (%SCX_DEQ_SCHED_CHANGE)
are truncated. ops_dequeue() reconstructs the full u64 for ops.dequeue(),
but ops.quiescent() is still called with the original int and can never
see %SCX_DEQ_SCHED_CHANGE.
Fix this by constructing the full u64 deq_flags in dequeue_task_scx()
(renaming the int parameter to core_deq_flags) and passing the complete
flags to both ops_dequeue() and ops.quiescent().
Fixes: ebf1ccff79 ("sched_ext: Fix ops.dequeue() semantics")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
enqueue_task_scx() takes int enq_flags from the sched_class interface.
SCX enqueue flags starting at bit 32 (SCX_ENQ_PREEMPT and above) are
silently truncated when passed through activate_task(). extra_enq_flags
was added as a workaround - storing high bits in rq->scx.extra_enq_flags
and OR-ing them back in enqueue_task_scx(). However, the OR target is
still the int parameter, so the high bits are lost anyway.
The current impact is limited as the only affected flag is SCX_ENQ_PREEMPT
which is informational to the BPF scheduler - its loss means the scheduler
doesn't know about preemption but doesn't cause incorrect behavior.
Fix by renaming the int parameter to core_enq_flags and introducing a
u64 enq_flags local that merges both sources. All downstream functions
already take u64 enq_flags.
Fixes: f0e1a0643a ("sched_ext: Implement BPF extensible scheduler class")
Cc: stable@vger.kernel.org # v6.12+
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Fix an inconsistency between func_states_equal() and
collect_linked_regs():
- regsafe() uses check_ids() to verify that cached and current states
have identical register id mapping.
- func_states_equal() calls regsafe() only for registers computed as
live by compute_live_registers().
- clean_live_states() is supposed to remove dead registers from cached
states, but it can skip states belonging to an iterator-based loop.
- collect_linked_regs() collects all registers sharing the same id,
ignoring the marks computed by compute_live_registers().
Linked registers are stored in the state's jump history.
- backtrack_insn() marks all linked registers for an instruction
as precise whenever one of the linked registers is precise.
The above might lead to a scenario:
- There is an instruction I with register rY known to be dead at I.
- Instruction I is reached via two paths: first A, then B.
- On path A:
- There is an id link between registers rX and rY.
- Checkpoint C is created at I.
- Linked register set {rX, rY} is saved to the jump history.
- rX is marked as precise at I, causing both rX and rY
to be marked precise at C.
- On path B:
- There is no id link between registers rX and rY,
otherwise register states are sub-states of those in C.
- Because rY is dead at I, check_ids() returns true.
- Current state is considered equal to checkpoint C,
propagate_precision() propagates spurious precision
mark for register rY along the path B.
- Depending on a program, this might hit verifier_bug()
in the backtrack_insn(), e.g. if rY ∈ [r1..r5]
and backtrack_insn() spots a function call.
The reproducer program is in the next patch.
This was hit by sched_ext scx_lavd scheduler code.
Changes in tests:
- verifier_scalar_ids.c selftests need modification to preserve
some registers as live for __msg() checks.
- exceptions_assert.c adjusted to match changes in the verifier log,
R0 is dead after conditional instruction and thus does not get
range.
- precise.c adjusted to match changes in the verifier log, register r9
is dead after comparison and it's range is not important for test.
Reported-by: Emil Tsalapatis <emil@etsalapatis.com>
Fixes: 0fb3cf6110 ("bpf: use register liveness information for func_states_equal")
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260306-linked-regs-and-propagate-precision-v1-1-18e859be570d@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
During the CPU offline process, the dying CPU is cleared from the
cpu_online_mask in takedown_cpu(). After this step, various CPUHP_*_DEAD
callbacks are executed to perform cleanup jobs for the dead CPU, so this
cpu online check in padata_cpu_dead() is unnecessary.
Similarly, when executing padata_cpu_online() during the
CPUHP_AP_ONLINE_DYN phase, the CPU has already been set in the
cpu_online_mask, the action even occurs earlier than the
CPUHP_AP_ONLINE_IDLE stage.
Remove this unnecessary cpu online check in __padata_add_cpu() and
__padata_remove_cpu().
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Some of the sizing logic through tracer_alloc_buffers() uses int
internally, causing unexpected behavior if the user passes a value that
does not fit in an int (on my x86 machine, the result is uselessly tiny
buffers).
Fix by plumbing the parameter's real type (unsigned long) through to the
ring buffer allocation functions, which already use unsigned long.
It has always been possible to create larger ring buffers via the sysfs
interface: this only affects the cmdline parameter.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/bff42a4288aada08bdf74da3f5b67a2c28b761f8.1772852067.git.calvin@wbinvd.org
Fixes: 73c5162aa3 ("tracing: keep ring buffer to minimum size till used")
Signed-off-by: Calvin Owens <calvin@wbinvd.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Same as in __reg64_deduce_bounds(), refine s32/u32 ranges
in __reg32_deduce_bounds() in the following situations:
- s32 range crosses U32_MAX/0 boundary, positive part of the s32 range
overlaps with u32 range:
0 U32_MAX
| [xxxxxxxxxxxxxx u32 range xxxxxxxxxxxxxx] |
|----------------------------|----------------------------|
|xxxxx s32 range xxxxxxxxx] [xxxxxxx|
0 S32_MAX S32_MIN -1
- s32 range crosses U32_MAX/0 boundary, negative part of the s32 range
overlaps with u32 range:
0 U32_MAX
| [xxxxxxxxxxxxxx u32 range xxxxxxxxxxxxxx] |
|----------------------------|----------------------------|
|xxxxxxxxx] [xxxxxxxxxxxx s32 range |
0 S32_MAX S32_MIN -1
- No refinement if ranges overlap in two intervals.
This helps for e.g. consider the following program:
call %[bpf_get_prandom_u32];
w0 &= 0xffffffff;
if w0 < 0x3 goto 1f; // on fall-through u32 range [3..U32_MAX]
if w0 s> 0x1 goto 1f; // on fall-through s32 range [S32_MIN..1]
if w0 s< 0x0 goto 1f; // range can be narrowed to [S32_MIN..-1]
r10 = 0;
1: ...;
The reg_bounds.c selftest is updated to incorporate identical logic,
refinement based on non-overflowing range halves:
((x ∩ [0, smax]) ∩ (y ∩ [0, smax])) ∪
((x ∩ [smin,-1]) ∩ (y ∩ [smin,-1]))
Reported-by: Andrea Righi <arighi@nvidia.com>
Reported-by: Emil Tsalapatis <emil@etsalapatis.com>
Closes: https://lore.kernel.org/bpf/aakqucg4vcujVwif@gpd4/T/
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260306-bpf-32-bit-range-overflow-v3-1-f7f67e060a6b@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Once a task exits it has its state set to TASK_DEAD and then it is
removed from the cgroup it belonged to. The last step happens on the task
gets out of its last schedule() invocation and is delayed on PREEMPT_RT
due to locking constraints.
As a result it is possible to receive a pid via waitpid() of a task
which is still listed in cgroup.procs for the cgroup it belonged
to. This is something that systemd does not expect and as a result it
waits for its exit until a time out occurs.
This can also be reproduced on !PREEMPT_RT kernel with a significant
delay in do_exit() after exit_notify().
Hide the task from the output which have PF_EXITING set which is done
before the parent is notified. Keeping zombies with live threads
shouldn't break anything (suggested by Tejun).
Reported-by: Bert Karwatzki <spasswolf@web.de>
Closes: https://lore.kernel.org/all/20260219164648.3014-1-spasswolf@web.de/
Tested-by: Bert Karwatzki <spasswolf@web.de>
Fixes: 9311e6c29b ("cgroup: Fix sleeping from invalid context warning on PREEMPT_RT")
Cc: stable@vger.kernel.org # v6.19+
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Multiple events can be enabled on the kernel command line via a comma
separator. But if the are specified one at a time, then only the last
event is enabled. This is because the event names are saved in a temporary
buffer, and each call by the init cmdline code will reset that buffer.
This also affects names in the boot config file, as it may call the
callback multiple times with an example of:
kernel.trace_event = ":mod:rproc_qcom_common", ":mod:qrtr", ":mod:qcom_aoss"
Change the cmdline callback function to append a comma and the next value
if the temporary buffer already has content.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260302-trace-events-allow-multiple-modules-v1-1-ce4436e37fb8@oss.qualcomm.com
Signed-off-by: Andrei-Alexandru Tachici <andrei-alexandru.tachici@oss.qualcomm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
If trigger_data_alloc() fails and returns NULL, event_hist_trigger_parse()
jumps to the out_free error path. While kfree() safely handles a NULL
pointer, trigger_data_free() does not. This causes a NULL pointer
dereference in trigger_data_free() when evaluating
data->cmd_ops->set_filter.
Fix the problem by adding a NULL pointer check to trigger_data_free().
The problem was found by an experimental code review agent based on
gemini-3.1-pro while reviewing backports into v6.18.y.
Cc: Miaoqian Lin <linmq006@gmail.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://patch.msgid.link/20260305193339.2810953-1-linux@roeck-us.net
Fixes: 0550069cc2 ("tracing: Properly process error handling in event_hist_trigger_parse()")
Assisted-by: Gemini:gemini-3.1-pro
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
This is an early-stage partial implementation that demonstrates the core
building blocks for nested sub-scheduler dispatching. While significant
work remains in the enqueue path and other areas, this patch establishes
the fundamental mechanisms needed for hierarchical scheduler operation.
The key building blocks introduced include:
- Private stack support for ops.dispatch() to prevent stack overflow when
walking down nested schedulers during dispatch operations
- scx_bpf_sub_dispatch() kfunc that allows parent schedulers to trigger
dispatch operations on their direct child schedulers
- Proper parent-child relationship validation to ensure dispatch requests
are only made to legitimate child schedulers
- Updated scx_dispatch_sched() to handle both nested and non-nested
invocations with appropriate kf_mask handling
The qmap scheduler is updated to demonstrate the functionality by calling
scx_bpf_sub_dispatch() on registered child schedulers when it has no
tasks in its own queues.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Add rhashtable-based lookup for sub-schedulers indexed by cgroup_id to
enable efficient scheduler discovery in preparation for multiple scheduler
support. The hash table allows quick lookup of the appropriate scheduler
instance when processing tasks from different cgroups.
This extends scx_link_sched() to register sub-schedulers in the hash table
and scx_unlink_sched() to remove them. A new scx_find_sub_sched() function
provides the lookup interface.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Factor out scx_link_sched() and scx_unlink_sched() functions to reduce
code duplication in the scheduler enable/disable paths.
No functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
scx_bpf_reenqueue_local() currently re-enqueues all tasks on the local DSQ
regardless of which sub-scheduler owns them. With multiple sub-schedulers,
each should only re-enqueue tasks it owns or are owned by its descendants.
Replace the per-rq boolean flag with a lock-free linked list to track
per-scheduler reenqueue requests. Filter tasks in reenq_local() using
hierarchical ownership checks and block deferrals during bypass to prevent
use on dead schedulers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Add a back pointer from scx_sched_pcpu to scx_sched. This will be used by
the next patch to make scx_bpf_reenqueue_local() sub-sched aware.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
The preceding changes implemented the framework to support cgroup
sub-scheds and updated scheduling paths and kfuncs so that they have
minimal but working support for sub-scheds. However, actual sub-sched
enabling/disabling hasn't been implemented yet and all tasks stayed on
scx_root.
Implement cgroup sub-sched enabling and disabling to actually activate
sub-scheds:
- Both enable and disable operations bypass only the tasks in the subtree
of the child being enabled or disabled to limit disruptions.
- When enabling, all candidate tasks are first initialized for the child
sched. Once that succeeds, the tasks are exited for the parent and then
switched over to the child. This adds a bit of complication but
guarantees that child scheduler failures are always contained.
- Disabling works the same way in the other direction. However, when the
parent may fail to initialize a task, disabling is propagated up to the
parent. While this means that a parent sched fail due to a child sched
event, the failure can only originate from the parent itself (its
ops.init_task()). The only effect a malfunctioning child can have on the
parent is attempting to move the tasks back to the parent.
After this change, although not all the necessary mechanisms are in place
yet, sub-scheds can take control of their tasks and schedule them.
v2: Fix missing scx_cgroup_unlock()/percpu_up_write() in abort path
(Cheng-Yang Chou).
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Extend scx_dump_state() to support multiple schedulers and improve task
identification in dumps. The function now takes a specific scheduler to
dump and can optionally filter tasks by scheduler.
scx_dump_task() now displays which scheduler each task belongs to, using
"*" to mark tasks owned by the scheduler being dumped. Sub-schedulers
are identified with their level and cgroup ID.
The SysRq-D handler now iterates through all active schedulers under
scx_sched_lock and dumps each one separately. For SysRq-D dumps, only
tasks owned by each scheduler are dumped to avoid redundancy since all
schedulers are being dumped. Error-triggered dumps continue to dump all
tasks since only that specific scheduler is being dumped.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
The scx_dump_state() function uses a regular spinlock to serialize
access. In a subsequent patch, this function will be called while
holding scx_sched_lock, which is a raw spinlock, creating a lock
nesting violation.
Convert the dump_lock to a raw spinlock and use the guard macro for
cleaner lock management.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Currently, the watchdog checks all tasks as if they are all on scx_root.
Move scx_watchdog_timeout inside scx_sched and make check_rq_for_timeouts()
use the timeout from the scx_sched associated with each task.
refresh_watchdog() is added, which determines the timer interval as half of
the shortest watchdog timeouts of all scheds and arms or disarms it as
necessary. Every scx_sched instance has equivalent or better detection
latency while sharing the same timer.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
scx_dsp_ctx and scx_dsp_max_batch are global variables used in the dispatch
path. In prepration for multiple scheduler support, move the former into
scx_sched_pcpu and the latter into scx_sched. No user-visible behavior
changes intended.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
The cgroup sub-sched support involves invasive changes to many areas of
sched_ext. The overall scaffolding is now in place and the next step is
implementing sub-sched enable/disable.
To enable partial testing and verification, update balance_one() to
dispatch from all scx_sched instances until it finds a task to run. This
should keep scheduling working when sub-scheds are enabled with tasks on
them. This will be replaced by BPF-driven hierarchical operation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
When a sub-scheduler enters bypass mode, its tasks must be scheduled by an
ancestor to guarantee forward progress. Tasks from bypassing descendants are
queued in the bypass DSQs of the nearest non-bypassing ancestor, or the root
scheduler if all ancestors are bypassing. This requires coordination between
bypassing schedulers and their hosts.
Add bypass_enq_target_dsq() to find the correct bypass DSQ by walking up the
hierarchy until reaching a non-bypassing ancestor. When a sub-scheduler starts
bypassing, all its runnable tasks are re-enqueued after scx_bypassing() is set,
ensuring proper migration to ancestor bypass DSQs.
Update scx_dispatch_sched() to handle hosting bypassed descendants. When a
scheduler is not bypassing but has bypassing descendants, it must schedule both
its own tasks and bypassed descendant tasks. A simple policy is implemented
where every Nth dispatch attempt (SCX_BYPASS_HOST_NTH=2) consumes from the
bypass DSQ. A fallback consumption is also added at the end of dispatch to
ensure bypassed tasks make progress even when normal scheduling is idle.
Update enable_bypass_dsp() and disable_bypass_dsp() to increment
bypass_dsp_enable_depth on both the bypassing scheduler and its parent host,
ensuring both can detect that bypass dispatch is active through
bypass_dsp_enabled().
Add SCX_EV_SUB_BYPASS_DISPATCH event counter to track scheduling of bypassed
descendant tasks.
v2: Fix comment typos (Andrea).
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
The bypass_depth field tracks nesting of bypass operations but is also used
to determine whether the bypass dispatch path should be active. With
hierarchical scheduling, child schedulers may need to activate their parent's
bypass dispatch path without affecting the parent's bypass_depth, requiring
separation of these concerns.
Add bypass_dsp_enable_depth and bypass_dsp_claim to independently control
bypass dispatch path activation. The new enable_bypass_dsp() and
disable_bypass_dsp() functions manage this state with proper claim semantics
to prevent races. The bypass dispatch path now only activates when
bypass_dsp_enabled() returns true, which checks the new enable_depth counter.
The disable operation is carefully ordered after all tasks are moved out of
bypass DSQs to ensure they are drained before the dispatch path is disabled.
During scheduler teardown, disable_bypass_dsp() is called explicitly to ensure
cleanup even if bypass mode was never entered normally.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
The @prev parameter passed into ops.dispatch() is expected to be on the
same sched. Passing in @prev which isn't on the sched can spuriously
trigger failures that can kill the scheduler. Pass in @prev iff it's on
the same sched.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
In preparation of multiple scheduler support, factor out
scx_dispatch_sched() from balance_one(). The function boundary makes
remembering $prev_on_scx and $prev_on_rq less useful. Open code $prev_on_scx
in balance_one() and $prev_on_rq in both balance_one() and
scx_dispatch_sched().
No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Bypass mode is used to simplify enable and disable paths and guarantee
forward progress when something goes wrong. When enabled, all tasks skip BPF
scheduling and fall back to simple in-kernel FIFO scheduling. While this
global behavior can be used as-is when dealing with sub-scheds, that would
allow any sub-sched instance to affect the whole system in a significantly
disruptive manner.
Make bypass state hierarchical by propagating it to descendants and updating
per-cpu flags accordingly. This allows an scx_sched to bypass if itself or
any of its ancestors are in bypass mode. However, this doesn't make the
actual bypass enqueue and dispatch paths hierarchical yet. That will be done
in later patches.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
In preparation of multiple scheduler support, make bypass state
per-scx_sched. Move scx_bypass_depth, bypass_timestamp and bypass_lb_timer
from globals into scx_sched. Move SCX_RQ_BYPASSING from rq to scx_sched_pcpu
as SCX_SCHED_PCPU_BYPASSING.
scx_bypass() now takes @sch and scx_rq_bypassing(rq) is replaced with
scx_bypassing(sch, cpu). All callers updated.
scx_bypassed_for_enable existed to balance the global scx_bypass_depth when
enable failed. Now that bypass_depth is per-scheduler, the counter is
destroyed along with the scheduler on enable failure. Remove
scx_bypassed_for_enable.
As all tasks currently use the root scheduler, there's no observable behavior
change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
To support bypass mode for sub-schedulers, move bypass_dsq from struct scx_rq
to struct scx_sched_pcpu. Add bypass_dsq() helper. Move bypass_dsq
initialization from init_sched_ext_class() to scx_alloc_and_attach_sched().
bypass_lb_cpu() now takes a CPU number instead of rq pointer. All callers
updated. No behavior change as all tasks use the root scheduler.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
The abort state was tracked in the global scx_aborting flag which was used to
break out of potential live-lock scenarios when an error occurs. With
hierarchical scheduling, each scheduler instance must track its own abort
state independently so that an aborting scheduler doesn't interfere with
others.
Move the aborting flag into struct scx_sched and update all access sites. The
early initialization check in scx_root_enable() that warned about residual
aborting state is no longer needed as each scheduler instance now starts with
a clean state.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
The default time slice was stored in the global scx_slice_dfl variable which
was dynamically modified when entering and exiting bypass mode. With
hierarchical scheduling, each scheduler instance needs its own default slice
configuration so that bypass operations on one scheduler don't affect others.
Move slice_dfl into struct scx_sched and update all access sites. The bypass
logic now modifies the root scheduler's slice_dfl. At task initialization in
init_scx_entity(), use the SCX_SLICE_DFL constant directly since the task may
not yet be associated with a specific scheduler.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Call ops.core_sched_before() iff both tasks belong to the same scx_sched.
Otherwise, use timestamp based ordering.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
- Add the @sch parameter to scx_init_task() and drop @tg as it can be
obtained from @p. Separate out __scx_init_task() which does everything
except for the task state transition.
- Add the @sch parameter to scx_enable_task(). Separate out
__scx_enable_task() which does everything except for the task state
transition.
- Add the @sch parameter to scx_disable_task().
- Rename scx_exit_task() to scx_disable_and_exit_task() and separate out
__scx_disable_and_exit_task() which does everything except for the task
state transition.
While some task state transitions are relocated, no meaningful behavior
changes are expected.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
scx_bpf_dsq_move[_vtime]() calls scx_dsq_move() to move task from a DSQ to
another. However, @p doesn't necessarily have to come form the containing
iteration and can thus be a task which belongs to another scx_sched. Verify
that @p is on the same scx_sched as the DSQ being iterated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
scx_bpf_task_set_slice() and scx_bpf_task_set_dsq_vtime() now verify that
the calling scheduler has authority over the task before allowing updates.
This prevents schedulers from modifying tasks that don't belong to them in
hierarchical scheduling configurations.
Direct writes to p->scx.slice and p->scx.dsq_vtime are deprecated and now
trigger warnings. They will be disallowed in a future release.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Add checks to enforce scheduling authority boundaries when multiple
schedulers are present:
1. In scx_dsq_insert_preamble() and the dispatch retry path, ignore attempts
to insert tasks that the scheduler doesn't own, counting them via
SCX_EV_INSERT_NOT_OWNED. As BPF schedulers are allowed to ignore
dequeues, such attempts can occur legitimately during sub-scheduler
enabling when tasks move between schedulers. The counter helps distinguish
normal cases from scheduler bugs.
2. For scx_bpf_dsq_insert_vtime() and scx_bpf_select_cpu_and(), error out
when sub-schedulers are attached. These functions lack the aux__prog
parameter needed to identify the calling scheduler, so they cannot be used
safely with multiple schedulers. BPF programs should use the arg-wrapped
versions (__scx_bpf_dsq_insert_vtime() and __scx_bpf_select_cpu_and())
instead.
These checks ensure that with multiple concurrent schedulers, scheduler
identity can be properly determined and unauthorized task operations are
prevented or tracked.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
In preparation for multiple scheduler support, introduce scx_prog_sched()
accessor which returns the scx_sched instance associated with a BPF program.
The association is determined via the special KF_IMPLICIT_ARGS kfunc
parameter, which provides access to bpf_prog_aux. This aux can be used to
retrieve the struct_ops (sched_ext_ops) that the program is associated with,
and from there, the corresponding scx_sched instance.
For compatibility, when ops.sub_attach is not implemented (older schedulers
without sub-scheduler support), unassociated programs fall back to scx_root.
A warning is logged once per scheduler for such programs.
As scx_root is still the only scheduler, this shouldn't introduce
user-visible behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
In preparation of multiple scheduler support, add p->scx.sched which points
to the scx_sched instance that the task is scheduled by, which is currently
always scx_root. Add scx_task_sched[_rcu]() accessors which return the
associated scx_sched of the specified task and replace the raw scx_root
dereferences with it where applicable. scx_task_on_sched() is also added to
test whether a given task is on the specified sched.
As scx_root is still the only scheduler, this shouldn't introduce
user-visible behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
A system often runs multiple workloads especially in multi-tenant server
environments where a system is split into partitions servicing separate
more-or-less independent workloads each requiring an application-specific
scheduler. To support such and other use cases, sched_ext is in the process
of growing multiple scheduler support.
When partitioning a system in terms of CPUs for such use cases, an
oft-taken approach is hard partitioning the system using cpuset. While it
would be possible to tie sched_ext multiple scheduler support to cpuset
partitions, such an approach would have fundamental limitations stemming
from the lack of dynamism and flexibility.
Users often don't care which specific CPUs are assigned to which workload
and want to take advantage of optimizations which are enabled by running
workloads on a larger machine - e.g. opportunistic over-commit, improving
latency critical workload characteristics while maintaining bandwidth
fairness, employing control mechanisms based on different criteria than
on-CPU time for e.g. flexible memory bandwidth isolation, packing similar
parts from different workloads on same L3s to improve cache efficiency,
and so on.
As this sort of dynamic behaviors are impossible or difficult to implement
with hard partitioning, sched_ext is implementing cgroup sub-sched support
where schedulers can be attached to the cgroup hierarchy and a parent
scheduler is responsible for controlling the CPUs that each child can use
at any given moment. This makes CPU distribution dynamically controlled by
BPF allowing high flexibility.
This patch adds the skeletal sched_ext cgroup sub-sched support:
- sched_ext_ops.sub_cgroup_id and .sub_attach/detach() are added. Non-zero
sub_cgroup_id indicates that the scheduler is to be attached to the
identified cgroup. A sub-sched is attached to the cgroup iff the nearest
ancestor scheduler implements .sub_attach() and grants the attachment. Max
nesting depth is limited by SCX_SUB_MAX_DEPTH.
- When a scheduler exits, all its descendant schedulers are exited
together. Also, cgroup.scx_sched added which points to the effective
scheduler instance for the cgroup. This is updated on scheduler
init/exit and inherited on cgroup online. When a cgroup is offlined, the
attached scheduler is automatically exited.
- Sub-sched support is gated on CONFIG_EXT_SUB_SCHED which is
automatically enabled if both SCX and cgroups are enabled. Sub-sched
support is not tied to the CPU controller but rather the cgroup
hierarchy itself. This is intentional as the support for cpu.weight and
cpu.max based resource control is orthogonal to sub-sched support. Note
that CONFIG_CGROUPS around cgroup subtree iteration support for
scx_task_iter is replaced with CONFIG_EXT_SUB_SCHED for consistency.
- This allows loading sub-scheds and most framework operations such as
propagating disable down the hierarchy work. However, sub-scheds are not
operational yet and all tasks stay with the root sched. This will serve
as the basis for building up full sub-sched support.
- DSQs point to the scx_sched they belong to.
- scx_qmap is updated to allow attachment of sub-scheds and also serving
as sub-scheds.
- scx_is_descendant() is added but not yet used in this patch. It is used by
later changes in the series and placed here as this is where the function
belongs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
In preparation for multiple scheduler support, reorganize the enable and
disable paths to make scheduler instances explicit. Extract
scx_root_disable() from scx_disable_workfn(). Rename scx_enable_workfn()
to scx_root_enable_workfn(). Change scx_disable() to take @sch parameter
and only queue disable_work if scx_claim_exit() succeeds for consistency.
Move exit_kind validation into scx_claim_exit(). The sysrq handler now
prints a message when no scheduler is loaded.
These changes don't materially affect user-visible behavior.
v2: Keep scx_enable() name as-is and only rename the workfn to
scx_root_enable_workfn(). Change scx_enable() return type to s32.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
- Always trigger the warning if p->scx.disallow is set for fork inits. There
is no reason to set it during forks.
- Flip the positions of if/else arms to ease adding error conditions.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
The planned sched_ext cgroup sub-scheduler support needs the newly forked
task to be associated with its cgroup in its post_fork() hook. There is no
existing ordering requirement between the two now. Swap them and note the
new ordering requirement.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Cc: Ingo Molnar <mingo@redhat.com>
Make sched_cgroup_fork() pass @kargs to scx_fork(). This will be used to
determine @p's cgroup for cgroup sub-sched support.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Cc: Peter Zijlstra <peterz@infradead.org>
For the planned cgroup sub-scheduler support, enable/disable operations are
going to be subtree specific and iterating all tasks in the system for those
operations can be unnecessarily expensive and disruptive.
cgroup already has mechanisms to perform subtree task iterations. Implement
cgroup subtree iteration for scx_task_iter:
- Add optional @cgrp to scx_task_iter_start() which enables cgroup subtree
iteration.
- Make scx_task_iter use css_next_descendant_pre() and css_task_iter to
iterate all tasks in the cgroup subtree.
- Update all existing callers to pass NULL to maintain current behavior.
The two iteration mechanisms are independent and duplicate. It's likely that
scx_tasks can be removed in favor of always using cgroup iteration if
CONFIG_SCHED_CLASS_EXT depends on CONFIG_CGROUPS.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Besides deferring the call to housekeeping_update(), commit 6df415aa46
("cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug
to workqueue") also defers the rebuild_sched_domains() call to
the workqueue. So a new offline CPU may still be in a sched domain
or new online CPU not showing up in the sched domains for a short
transition period. That could be a problem in some corner cases and
can be the cause of a reported test failure[1]. Fix it by calling
rebuild_sched_domains_cpuslocked() directly in hotplug as before. If
isolated partition invalidation or recreation is being done, the
housekeeping_update() call to update the housekeeping cpumasks will
still be deferred to a workqueue.
In commit 3bfe479671 ("cgroup/cpuset: Move
housekeeping_update()/rebuild_sched_domains() together"),
housekeeping_update() is called before rebuild_sched_domains() because
it needs to access the HK_TYPE_DOMAIN housekeeping cpumask. That is now
changed to use the static HK_TYPE_DOMAIN_BOOT cpumask as HK_TYPE_DOMAIN
cpumask is now changeable at run time. As a result, we can move the
rebuild_sched_domains() call before housekeeping_update() with
the slight advantage that it will be done in the same cpus_read_lock
critical section without the possibility of interference by a concurrent
cpu hot add/remove operation.
As it doesn't make sense to acquire cpuset_mutex/cpuset_top_mutex after
calling housekeeping_update() and immediately release them again, move
the cpuset_full_unlock() operation inside update_hk_sched_domains()
and rename it to cpuset_update_sd_hk_unlock() to signify that it will
release the full set of locks.
[1] https://lore.kernel.org/lkml/1a89aceb-48db-4edd-a730-b445e41221fe@nvidia.com
Fixes: 6df415aa46 ("cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue")
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Commit 0927780c90 ("sched_ext: Use READ_ONCE() for lock-free reads
of module param variables") annotated the plain reads of
scx_slice_bypass_us and scx_bypass_lb_intv_us in bypass_lb_cpu(), but
missed a third site in scx_bypass():
WRITE_ONCE(scx_slice_dfl, scx_slice_bypass_us * NSEC_PER_USEC);
scx_slice_bypass_us is a module parameter writable via sysfs in
process context through set_slice_us() -> param_set_uint_minmax(),
which performs a plain store without holding bypass_lock. scx_bypass()
reads the variable under bypass_lock, but since the writer does not
take that lock, the two accesses are concurrent.
WRITE_ONCE() only applies volatile semantics to the store of
scx_slice_dfl -- the val expression containing scx_slice_bypass_us is
evaluated as a plain read, providing no protection against concurrent
writes.
Wrap the read with READ_ONCE() to complete the annotation started by
commit 0927780c90 and make the access KCSAN-clean, consistent with
the existing READ_ONCE(scx_slice_bypass_us) in bypass_lb_cpu().
Signed-off-by: David Carlier <devnexen@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
show_cpu_pool_hog() and show_cpu_pools_hogs() no longer only dump CPU
hogs — since commit 8823eaef45 ("workqueue: Show all busy workers in
stall diagnostics"), they dump every in-flight worker in the pool's
busy_hash.
Rename them to show_cpu_pool_busy_workers() and
show_cpu_pools_busy_workers() to accurately describe what they do.
Also fix the pr_info() message to say "stalled worker pools" instead of
"stalled CPU-bound worker pools", since sleeping/blocked workers are now
included.
No functional change.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmmqPRMQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgplf5D/9uOsBr+OGXtkLUJtD6MiwoJUsYgYF2dMIx
epcp+8RdMaOGtigtx69QXzTP5aPjA+AvBLAMYM+QDQDAPMWbRPsD7LaCYHy7ekwA
OL68R3QRTMYPPgpuf7pKyhif7olozAvoWAnRaoWlo67rbK+mTzZsTIsgTwF4zUu6
T0dL9thbWqtJMxKSuUk+DywggvGyNZWICJ3rAZ6os2htruH0fPhsJNGVFgNXMnpe
Cy2OvWxBWRQkZnpDEocZUdYyCRVhHr7hu311j6nSLNXufqpgFmWLGO4C3vetOlgx
ulEHfGNINcSLcw9R8pNWRxU14V6iw8Oy4nU9RtZhUpF32Iasvxb4H0w76Dp9Ukq1
/DuoSkWg/Ahn24xSYxJwwZpOEE8L92pn0M2ukCfC6h7ytmDjjEL1AQ2kyFHV4mR3
nc/3FkQ0abe3HHk8Rit6+txe3sSQo5no1z8kFlb9yp2MwAmonxCCQ9N1s7pxeeP+
iLaPbGMaZ7Ra1GswD/vzxFQtkglsxLuM5D0JkjHe99a54ZnF0vF3y9jeDVOQbV1C
H6/bU/2DI3SQ8xqv6tIXQ22reyRen3ao5VKLSrmrT/tDQVoEBV5SMnJFO1J8jBP4
QST03wiu8ShHSyZ98KefwlsndrTX02V9UVD4FVj+TZXwCWltulnIR4dVYFdySWwW
d613iUsWJw==
=NNcQ
-----END PGP SIGNATURE-----
Merge tag 'block-7.0-20260305' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block fixes from Jens Axboe:
- NVMe pull request via Keith:
- Improve quirk visibility and configurability (Maurizio)
- Fix runtime user modification to queue setup (Keith)
- Fix multipath leak on try_module_get failure (Keith)
- Ignore ambiguous spec definitions for better atomics support
(John)
- Fix admin queue leak on controller reset (Ming)
- Fix large allocation in persistent reservation read keys
(Sungwoo Kim)
- Fix fcloop callback handling (Justin)
- Securely free DHCHAP secrets (Daniel)
- Various cleanups and typo fixes (John, Wilfred)
- Avoid a circular lock dependency issue in the sysfs nr_requests or
scheduler store handling
- Fix a circular lock dependency with the pcpu mutex and the queue
freeze lock
- Cleanup for bio_copy_kern(), using __bio_add_page() rather than the
bio_add_page(), as adding a page here cannot fail. The exiting code
had broken cleanup for the error condition, so make it clear that the
error condition cannot happen
- Fix for a __this_cpu_read() in preemptible context splat
* tag 'block-7.0-20260305' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
block: use trylock to avoid lockdep circular dependency in sysfs
nvme: fix memory allocation in nvme_pr_read_keys()
block: use __bio_add_page in bio_copy_kern
block: break pcpu_alloc_mutex dependency on freeze_lock
blktrace: fix __this_cpu_read/write in preemptible context
nvme-multipath: fix leak on try_module_get failure
nvmet-fcloop: Check remoteport port_state before calling done callback
nvme-pci: do not try to add queue maps at runtime
nvme-pci: cap queue creation to used queues
nvme-pci: ensure we're polling a polled queue
nvme: fix memory leak in quirks_param_set()
nvme: correct comment about nvme_ns_remove()
nvme: stop setting namespace gendisk device driver data
nvme: add support for dynamic quirk configuration via module parameter
nvme: fix admin queue leak on controller reset
nvme-fabrics: use kfree_sensitive() for DHCHAP secrets
nvme: stop using AWUPF
nvme: expose active quirks in sysfs
nvme/host: fixup some typos
kthread_exit became a macro to do_exit in commit 28aaa9c399
("kthread: consolidate kthread exit paths to prevent use-after-free"),
so there is no kthread_exit function BTF ID to resolve. Remove it from
noreturn_deny to avoid resolve_btfids unresolved symbol warnings.
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On 32-bit architectures, unsigned long is only 32 bits wide, which
causes 64-bit inode numbers to be silently truncated. Several
filesystems (NFS, XFS, BTRFS, etc.) can generate inode numbers that
exceed 32 bits, and this truncation can lead to inode number collisions
and other subtle bugs on 32-bit systems.
Change the type of inode->i_ino from unsigned long to u64 to ensure that
inode numbers are always represented as 64-bit values regardless of
architecture. Update all format specifiers treewide from %lu/%lx to
%llu/%llx to match the new type, along with corresponding local variable
types.
This is the bulk treewide conversion. Earlier patches in this series
handled trace events separately to allow trace field reordering for
better struct packing on 32-bit.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20260304-iino-u64-v3-12-2257ad83d372@kernel.org
Acked-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
inode->i_ino is being widened from unsigned long to u64. The audit
subsystem uses unsigned long ino in struct fields, function parameters,
and local variables that store inode numbers from arbitrary filesystems.
On 32-bit platforms this truncates inode numbers that exceed 32 bits,
which will cause incorrect audit log entries and broken watch/mark
comparisons.
Widen all audit ino fields, parameters, and locals to u64, and update
the inode format string from %lu to %llu to match.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20260304-iino-u64-v3-2-2257ad83d372@kernel.org
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
raw_spin_rq_unlock() is short, and is called in some hot code paths
such as finish_lock_switch().
Inline raw_spin_rq_unlock() to micro-optimize performance a bit.
Signed-off-by: Xie Yuanbin <qq570070308@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20260216164950.147617-3-qq570070308@gmail.com
This recent commit:
96d1610e0b ("sched: Optimize hrtimer handling")
introduced a new build warning when !CONFIG_HOTPLUG_CPU
while SCHED_HRTIMERS=y [ == HIGH_RES_TIMERS=y ]:
/tip.testing/kernel/sched/core.c:882:13: warning: ‘hrtick_clear’ defined but not used [-Wunused-function]
Mark this helper function as always-used, instead of complicating
the code with another obscure #ifdef.
Fixes: 96d1610e0b ("sched: Optimize hrtimer handling")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/177245077226.1647592.1821545206171336606.tip-bot2@tip-bot2
Expose the following through cgroup.h:
- cgroup_on_dfl()
- cgroup_is_dead()
- cgroup_for_each_live_child()
- cgroup_for_each_live_descendant_pre()
- cgroup_for_each_live_descendant_post()
Until now, these didn't need to be exposed because controllers only cared
about the css hierarchy. The planned sched_ext hierarchical scheduler
support will be based on the default cgroup hierarchy, which is in line
with the existing BPF cgroup support, and thus needs these exposed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fix various coding style issues across the audit subsystem flagged
by checkpatch.pl script to adhere to kernel coding standards.
Specific changes include:
- kernel/auditfilter.c: Move the open brace '{' to the previous line
for the audit_ops array declaration.
- lib/audit.c: Add a required space before the open parenthesis '('.
- include/uapi/linux/audit.h: Enclose the complex macro value for
AUDIT_UID_UNSET in parentheses.
Signed-off-by: Ricardo Robaina <rrobaina@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
show_cpu_pool_hog() only prints workers whose task is currently running
on the CPU (task_is_running()). This misses workers that are busy
processing a work item but are sleeping or blocked — for example, a
worker that clears PF_WQ_WORKER and enters wait_event_idle(). Such a
worker still occupies a pool slot and prevents progress, yet produces
an empty backtrace section in the watchdog output.
This is happening on real arm64 systems, where
toggle_allocation_gate() IPIs every single CPU in the machine (which
lacks NMI), causing workqueue stalls that show empty backtraces because
toggle_allocation_gate() is sleeping in wait_event_idle().
Remove the task_is_running() filter so every in-flight worker in the
pool's busy_hash is dumped. The busy_hash is protected by pool->lock,
which is already held.
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
When diagnosing workqueue stalls, knowing how long each in-flight work
item has been executing is valuable. Add a current_start timestamp
(jiffies) to struct worker, set it when a work item begins execution in
process_one_work(), and print the elapsed wall-clock time in show_pwq().
Unlike current_at (which tracks CPU runtime and resets on wakeup for
CPU-intensive detection), current_start is never reset because the
diagnostic cares about total wall-clock time including sleeps.
Before: in-flight: 165:stall_work_fn [wq_stall]
After: in-flight: 165:stall_work_fn [wq_stall] for 100s
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
The watchdog_ts name doesn't convey what the timestamp actually tracks.
This field tracks the last time a workqueue got progress.
Rename it to last_progress_ts to make it clear that it records when the
pool last made forward progress (started processing new work items).
No functional change.
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
pr_cont_worker_id() checks pool->flags against WQ_BH, which is a
workqueue-level flag (defined in workqueue.h). Pool flags use a
separate namespace with POOL_* constants (defined in workqueue.c).
The correct constant is POOL_BH. Both WQ_BH and POOL_BH are defined
as (1 << 0) so this has no behavioral impact, but it is semantically
wrong and inconsistent with every other pool-level BH check in the
file.
Fixes: 4cb1ef6460 ("workqueue: Implement BH workqueues to eventually replace tasklets")
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Borislav reported a division by zero in the timekeeping code and random
hangs with the new coupled clocksource/clockevent functionality.
It turned out that the TSC clocksource is not always updating the
freq_khz field of the clocksource on registration. The coupled mode
conversion calculation requires the frequency and as it's not
initialized the resulting factor is zero or a random value. As a
consequence this causes a division by zero or random boot hangs.
Instead of chasing down all clocksources which fail to update that
member, fill it in at registration time where the caller has to supply
the frequency anyway. Except for special clocksources like jiffies which
never can have coupled mode.
To make this more robust put a check into the registration function to
validate that the caller supplied a frequency if the coupled mode
feature bit is set. If not, emit a warning and clear the feature bit.
Fixes: cd38bdb8e6 ("timekeeping: Provide infrastructure for coupled clockevents")
Reported-by: Borislav Petkov <bp@alien8.de>
Reported-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Borislav Petkov <bp@alien8.de>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Link: https://patch.msgid.link/87cy1jsa4m.ffs@tglx
Closes: https://lore.kernel.org/20260303213027.GA2168957@ax162
Nathan reported a boot failure after the coupled clocksource/event support
was enabled for the TSC deadline timer. It turns out that on the affected
test systems the TSC frequency is not refined against HPET, so it is
registered with the same frequency as the TSC-early clocksource.
As a consequence the update function which checks for a change of the
shift/mult pair of the clocksource fails to compute the conversion
limit, which is zero initialized. This check is there to avoid pointless
computations on every timekeeping update cycle (tick).
So the actual clockevent conversion function limits the delta expiry to
zero, which means the timer is always programmed to expire in the
past. This obviously results in a spectacular timer interrupt storm,
which goes unnoticed because the per CPU interrupts on x86 are not
exposed to the runaway detection mechanism and the NMI watchdog is not
yet functional. So the machine simply stops booting.
That did not show up in testing. All test machines refine the TSC frequency
so TSC has a differrent shift/mult pair than TSC-early and the conversion
limit is properly initialized.
Cure that by setting the conversion limit right at the point where the new
clocksource is installed.
Fixes: cd38bdb8e6 ("timekeeping: Provide infrastructure for coupled clockevents")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: John Stultz <jstultz@google.com>
Link: https://patch.msgid.link/87bjh4zies.ffs@tglx
Closes: https://lore.kernel.org/20260303012905.GA978396@ax162
The task ownership state machine in sched_ext is quite hard to follow
from the code alone. The interaction of ownership states, memory
ordering rules and cross-CPU "lock dancing" makes the overall model
subtle.
Extend the documentation next to scx_ops_state to provide a more
structured and self-contained description of the state transitions and
their synchronization rules.
The new reference should make the code easier to reason about and
maintain and can help future contributors understand the overall
task-ownership workflow.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
bypass_lb_cpu() reads scx_bypass_lb_intv_us and scx_slice_bypass_us
without holding any lock, in timer callback context where module
parameter writes via sysfs can happen concurrently:
min_delta_us = scx_bypass_lb_intv_us / SCX_BYPASS_LB_MIN_DELTA_DIV;
^^^^^^^^^^^^^^^^^^^^
plain read -- KCSAN data race
if (delta < DIV_ROUND_UP(min_delta_us, scx_slice_bypass_us))
^^^^^^^^^^^^^^^^^
plain read -- KCSAN data race
scx_bypass_lb_intv_us already uses READ_ONCE() in scx_bypass_lb_timerfn()
and scx_bypass() for its other lock-free read sites, leaving
bypass_lb_cpu() inconsistent. scx_slice_bypass_us has the same
lock-free access pattern in the same function.
Fix both plain reads by using READ_ONCE() to complete the concurrent
access annotation and make the code KCSAN-clean.
Signed-off-by: zhidao su <suzhidao@xiaomi.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
- Fix thresh_return of function graph tracer
The update to store data on the shadow stack removed the abuse of
using the task recursion word as a way to keep track of what functions
to ignore. The trace_graph_return() was updated to handle this, but
when function_graph tracer is using a threshold (only trace functions
that took longer than a specified time), it uses
trace_graph_thresh_return() instead. This function was still incorrectly
using the task struct recursion word causing the function graph tracer to
permanently set all functions to "notrace"
- Fix thresh_return nosleep accounting
When the calltime was moved to the shadow stack storage instead of being
on the fgraph descriptor, the calculations for the amount of sleep time
was updated. The calculation was done in the trace_graph_thresh_return()
function, which also called the trace_graph_return(), which did the
calculation again, causing the time to be doubled.
Remove the call to trace_graph_return() as what it needed to do wasn't
that much, and just do the work in trace_graph_thresh_return().
- Fix syscall trace event activation on boot up
The syscall trace events are pseudo events attached to the raw_syscall
tracepoints. When the first syscall event is enabled, it enables the
raw_syscall tracepoint and doesn't need to do anything when a second
syscall event is also enabled.
When events are enabled via the kernel command line, syscall events
are partially enabled as the enabling is called before rcu_init.
This is due to allow early events to be enabled immediately. Because
kernel command line events do not distinguish between different
types of events, the syscall events are enabled here but are not fully
functioning. After rcu_init, they are disabled and re-enabled so that
they can be fully enabled. The problem happened is that this
"disable-enable" is done one at a time. If more than one syscall event
is specified on the command line, by disabling them one at a time,
the counter never gets to zero, and the raw_syscall is not disabled and
enabled, keeping the syscall events in their non-fully functional state.
Instead, disable all events and re-enabled them all, as that will ensure
the raw_syscall event is also disabled and re-enabled.
- Disable preemption in ftrace pid filtering
The ftrace pid filtering attaches to the fork and exit tracepoints to
add or remove pids that should be traced. They access variables protected
by RCU (preemption disabled). Now that tracepoint callbacks are called with
preemption enabled, this protection needs to be added explicitly, and
not depend on the functions being called with preemption disabled.
- Disable preemption in event pid filtering
The event pid filtering needs the same preemption disabling guards as
ftrace pid filtering.
- Fix accounting of the memory mapped ring buffer on fork
Memory mapping the ftrace ring buffer sets the vm_flags to DONTCOPY. But
this does not prevent the application from calling madvise(MADVISE_DOFORK).
This causes the mapping to be copied on fork. After the first tasks exits,
the mapping is considered unmapped by everyone. But when he second task
exits, the counter goes below zero and triggers a WARN_ON.
Since nothing prevents two separate tasks from mmapping the ftrace ring
buffer (although two mappings may mess each other up), there's no reason
to stop the memory from being copied on fork.
Update the vm_operations to have an ".open" handler to update the
accounting and let the ring buffer know someone else has it mapped.
- Add all ftrace headers in MAINTAINERS file
The MAINTAINERS file only specifies include/linux/ftrace.h But misses
ftrace_irq.h and ftrace_regs.h. Make the file use wildcards to get all
*ftrace* files.
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaamiIBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qulnAP9ZO6iChQL0hX/Xuu2VyRhVz0Svf8Sg
iq2IUHP48twOogEApR4zeelMORxdKqkLR+BajZUVFR1PukVbMaszPr9GoQw=
=H9pj
-----END PGP SIGNATURE-----
Merge tag 'trace-v7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix thresh_return of function graph tracer
The update to store data on the shadow stack removed the abuse of
using the task recursion word as a way to keep track of what
functions to ignore. The trace_graph_return() was updated to handle
this, but when function_graph tracer is using a threshold (only trace
functions that took longer than a specified time), it uses
trace_graph_thresh_return() instead.
This function was still incorrectly using the task struct recursion
word causing the function graph tracer to permanently set all
functions to "notrace"
- Fix thresh_return nosleep accounting
When the calltime was moved to the shadow stack storage instead of
being on the fgraph descriptor, the calculations for the amount of
sleep time was updated. The calculation was done in the
trace_graph_thresh_return() function, which also called the
trace_graph_return(), which did the calculation again, causing the
time to be doubled.
Remove the call to trace_graph_return() as what it needed to do
wasn't that much, and just do the work in
trace_graph_thresh_return().
- Fix syscall trace event activation on boot up
The syscall trace events are pseudo events attached to the
raw_syscall tracepoints. When the first syscall event is enabled, it
enables the raw_syscall tracepoint and doesn't need to do anything
when a second syscall event is also enabled.
When events are enabled via the kernel command line, syscall events
are partially enabled as the enabling is called before rcu_init. This
is due to allow early events to be enabled immediately. Because
kernel command line events do not distinguish between different types
of events, the syscall events are enabled here but are not fully
functioning. After rcu_init, they are disabled and re-enabled so that
they can be fully enabled.
The problem happened is that this "disable-enable" is done one at a
time. If more than one syscall event is specified on the command
line, by disabling them one at a time, the counter never gets to
zero, and the raw_syscall is not disabled and enabled, keeping the
syscall events in their non-fully functional state.
Instead, disable all events and re-enabled them all, as that will
ensure the raw_syscall event is also disabled and re-enabled.
- Disable preemption in ftrace pid filtering
The ftrace pid filtering attaches to the fork and exit tracepoints to
add or remove pids that should be traced. They access variables
protected by RCU (preemption disabled). Now that tracepoint callbacks
are called with preemption enabled, this protection needs to be added
explicitly, and not depend on the functions being called with
preemption disabled.
- Disable preemption in event pid filtering
The event pid filtering needs the same preemption disabling guards as
ftrace pid filtering.
- Fix accounting of the memory mapped ring buffer on fork
Memory mapping the ftrace ring buffer sets the vm_flags to DONTCOPY.
But this does not prevent the application from calling
madvise(MADVISE_DOFORK). This causes the mapping to be copied on
fork. After the first tasks exits, the mapping is considered unmapped
by everyone. But when he second task exits, the counter goes below
zero and triggers a WARN_ON.
Since nothing prevents two separate tasks from mmapping the ftrace
ring buffer (although two mappings may mess each other up), there's
no reason to stop the memory from being copied on fork.
Update the vm_operations to have an ".open" handler to update the
accounting and let the ring buffer know someone else has it mapped.
- Add all ftrace headers in MAINTAINERS file
The MAINTAINERS file only specifies include/linux/ftrace.h But misses
ftrace_irq.h and ftrace_regs.h. Make the file use wildcards to get
all *ftrace* files.
* tag 'trace-v7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ftrace: Add MAINTAINERS entries for all ftrace headers
tracing: Fix WARN_ON in tracing_buffers_mmap_close
tracing: Disable preemption in the tracepoint callbacks handling filtered pids
ftrace: Disable preemption in the tracepoint callbacks handling filtered pids
tracing: Fix syscall events activation by ensuring refcount hits zero
fgraph: Fix thresh_return nosleeptime double-adjust
fgraph: Fix thresh_return clear per-task notrace
- Fix a potential kernel panic in the module loader by adding a bounds
check for the ELF section index. This prevents crashes if attempting
to load a module that uses SHN_XINDEX or is corrupted.
- Fix the Kconfig menu layout for module versioning, signing, and
compression options so they correctly appear as submenus in menuconfig.
- Remove a redundant lockdep_free_key_range() call in the load_module()
error path. This is already handled by module_deallocate() calling
free_mod_mem() since the module_memory rework.
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQSE9au1u/dCZerzchhaByWrOaGnegUCaaeC1QAKCRBaByWrOaGn
enQ7AQCJWZPofsDiEN2GZsupXsMMn1kt4xkimGGlb55Fwq1/pQD+OfczUt63MBst
dwMJuaW4ndRQLRXFQHpoa441zjFCcgw=
=CkAk
-----END PGP SIGNATURE-----
Merge tag 'modules-7.0-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux
Pull module fixes from Sami Tolvanen:
- Fix a potential kernel panic in the module loader by adding a bounds
check for the ELF section index. This prevents crashes if attempting
to load a module that uses SHN_XINDEX or is corrupted.
- Fix the Kconfig menu layout for module versioning, signing, and
compression options so they correctly appear as submenus in
menuconfig.
- Remove a redundant lockdep_free_key_range() call in the load_module()
error path. This is already handled by module_deallocate() calling
free_mod_mem() since the module_memory rework.
* tag 'modules-7.0-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux:
module: Fix kernel panic when a symbol st_shndx is out of bounds
module: Fix the modversions and signing submenus
module: Remove duplicate freeing of lockdep classes
The timekeeping_validate_timex() function validates the timex status
of an auxiliary system clock even when the status is not to be changed,
which causes unexpected errors for applications that make read-only
clock_adjtime() calls, or set some other timex fields, but without
clearing the status field.
Do the AUX-specific status validation only when the modes field contains
ADJ_STATUS, i.e. the application is actually trying to change the
status. This makes the AUX-specific clock_adjtime() behavior consistent
with CLOCK_REALTIME.
Fixes: 4eca49d0b6 ("timekeeping: Prepare do_adtimex() for auxiliary clocks")
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260225085231.276751-1-mlichvar@redhat.com
bpf_iter_scx_dsq_new() reads dsq->seq via READ_ONCE() without holding
any lock, making dsq->seq a lock-free concurrently accessed variable.
However, dispatch_enqueue(), the sole writer of dsq->seq, uses a plain
increment without the matching WRITE_ONCE() on the write side:
dsq->seq++;
^^^^^^^^^^^
plain write -- KCSAN data race
The KCSAN documentation requires that if one accessor uses READ_ONCE()
or WRITE_ONCE() on a variable to annotate lock-free access, all other
accesses must also use the appropriate accessor. A plain write leaves
the pair incomplete and will trigger KCSAN warnings.
Fix by using WRITE_ONCE() for the write side of the update:
WRITE_ONCE(dsq->seq, dsq->seq + 1);
This is consistent with bpf_iter_scx_dsq_new() and makes the
concurrent access annotation complete and KCSAN-clean.
Signed-off-by: zhidao su <suzhidao@xiaomi.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
* Fix error when reporting jiffies converted values back to user space
Return the converted value instead of "Invalid argument" error.
* Testing
Spent around a week in linux-next -enough for this small fix-
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEErkcJVyXmMSXOyyeQupfNUreWQU8FAmmoNL0ACgkQupfNUreW
QU/dMQv/YL6Lpv76iFGOL8gP1tU3oebKWKGJYDcQtqPZfLnlFmWu+XliHCldqJ7J
Ur9u2KleA0jM/Szq/v4FOyq2L7992dpKSkzM6ZsMyEfrz0e21WCZus40pcpE0L2j
kMNo4Vf3bAP+18KNsxh6zUc9WeYJ3suySmme+je2WNkab/io9XNUxYv7LhnKWze7
3iCXYZj/HtF3G9/xk0v3Ihlw6rNRVxNPfC3DpGXlvtnTSchlj9S9IK4pczcAmdw8
CNTEGCi+yzZYCcyI310IoeH0d3L5k39daJqtSC0BlVp607kr57nt5Hygf08WdnG8
2U+lvoWKp7odyu9/D1nqcpoQVY+9IzRkW+RM1bnYOmNYAiFrhiKTNCpZOhhqWn6P
3f3zvRq3Wt9zuA8upGjT6adxTrPMkpiqQD4POExgzSvoqkZ31Lw1/A6INtdWng82
+rFdL4PqdElrghVl07zydX5UWz/+fZKsQMz/j1cKROhKRQsaLXWIYHbo6OSlp6AC
JLONWmgW
=F0SK
-----END PGP SIGNATURE-----
Merge tag 'sysctl-7.00-fixes-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl
Pull sysctl fix from Joel Granados:
- Fix error when reporting jiffies converted values back to user space
Return the converted value instead of "Invalid argument" error
* tag 'sysctl-7.00-fixes-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl:
time/jiffies: Fix sysctl file error on configurations where USER_HZ < HZ
Running stress-ng --schedpolicy 0 on an RT kernel on a big machine
might lead to the following WARNINGs (edited).
sched: DL de-boosted task PID 22725: REPLENISH flag missing
WARNING: CPU: 93 PID: 0 at kernel/sched/deadline.c:239 dequeue_task_dl+0x15c/0x1f8
... (running_bw underflow)
Call trace:
dequeue_task_dl+0x15c/0x1f8 (P)
dequeue_task+0x80/0x168
deactivate_task+0x24/0x50
push_dl_task+0x264/0x2e0
dl_task_timer+0x1b0/0x228
__hrtimer_run_queues+0x188/0x378
hrtimer_interrupt+0xfc/0x260
...
The problem is that when a SCHED_DEADLINE task (lock holder) is
changed to a lower priority class via sched_setscheduler(), it may
fail to properly inherit the parameters of potential DEADLINE donors
if it didn't already inherit them in the past (shorter deadline than
donor's at that time). This might lead to bandwidth accounting
corruption, as enqueue_task_dl() won't recognize the lock holder as
boosted.
The scenario occurs when:
1. A DEADLINE task (donor) blocks on a PI mutex held by another
DEADLINE task (holder), but the holder doesn't inherit parameters
(e.g., it already has a shorter deadline)
2. sched_setscheduler() changes the holder from DEADLINE to a lower
class while still holding the mutex
3. The holder should now inherit DEADLINE parameters from the donor
and be enqueued with ENQUEUE_REPLENISH, but this doesn't happen
Fix the issue by introducing __setscheduler_dl_pi(), which detects when
a DEADLINE (proper or boosted) task gets setscheduled to a lower
priority class. In case, the function makes the task inherit DEADLINE
parameters of the donoer (pi_se) and sets ENQUEUE_REPLENISH flag to
ensure proper bandwidth accounting during the next enqueue operation.
Fixes: 2279f540ea ("sched/deadline: Fix priority inheritance with multiple scheduling classes")
Reported-by: Bruno Goncalves <bgoncalv@redhat.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260302-upstream-fix-deadline-piboost-b4-v3-1-6ba32184a9e0@redhat.com
Commit 2dc164a48e ("sysctl: Create converter functions with two new
macros") incorrectly returns error to user space when jiffies sysctl
converter is used. The old overflow check got replaced with an
unconditional one:
+ if (USER_HZ < HZ)
+ return -EINVAL;
which will always be true on configurations with "USER_HZ < HZ".
Remove the check; it is no longer needed as clock_t_to_jiffies() returns
ULONG_MAX for the overflow case and proc_int_u2k_conv_uop() checks for
"> INT_MAX" after conversion
Fixes: 2dc164a48e ("sysctl: Create converter functions with two new macros")
Reported-by: Colm Harrington <colm.harrington@oracle.com>
Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Support for dma scatter-gather mapping and is intended for testing
mapping performance. It achieves by introducing the dma_sg_map_param
structure and related functions, which enable the implementation of
scatter-gather mapping preparation, mapping, and unmapping operations.
Additionally, the dma_map_benchmark_ops array is updated to include
operations for scatter-gather mapping. This commit aims to provide
a wider range of mapping performance test to cater to different scenarios.
Reviewed-by: Barry Song <baohua@kernel.org>
Signed-off-by: Qinxin Xia <xiaqinxin@huawei.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20260225093800.3625054-3-xiaqinxin@huawei.com
This patch adjusts the DMA map benchmark framework to make the DMA
map benchmark framework more flexible and adaptable to other mapping
modes in the future. By abstracting the framework into five interfaces:
prepare, unprepare, initialize_data, do_map, and do_unmap.
The new map schema can be introduced more easily
without major modifications to the existing code structure.
Reviewed-by: Barry Song <baohua@kernel.org>
Signed-off-by: Qinxin Xia <xiaqinxin@huawei.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20260225093800.3625054-2-xiaqinxin@huawei.com
When a process forks, the child process copies the parent's VMAs but the
user_mapped reference count is not incremented. As a result, when both the
parent and child processes exit, tracing_buffers_mmap_close() is called
twice. On the second call, user_mapped is already 0, causing the function to
return -ENODEV and triggering a WARN_ON.
Normally, this isn't an issue as the memory is mapped with VM_DONTCOPY set.
But this is only a hint, and the application can call
madvise(MADVISE_DOFORK) which resets the VM_DONTCOPY flag. When the
application does that, it can trigger this issue on fork.
Fix it by incrementing the user_mapped reference count without re-mapping
the pages in the VMA's open callback.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://patch.msgid.link/20260227025842.1085206-1-wangqing7171@gmail.com
Fixes: cf9f0f7c4c ("tracing: Allow user-space mapping of the ring-buffer")
Reported-by: syzbot+3b5dd2030fe08afdf65d@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=3b5dd2030fe08afdf65d
Tested-by: syzbot+3b5dd2030fe08afdf65d@syzkaller.appspotmail.com
Signed-off-by: Qing Wang <wangqing7171@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When function trace PID filtering is enabled, the function tracer will
attach a callback to the fork tracepoint as well as the exit tracepoint
that will add the forked child PID to the PID filtering list as well as
remove the PID that is exiting.
Commit a46023d561 ("tracing: Guard __DECLARE_TRACE() use of
__DO_TRACE_CALL() with SRCU-fast") removed the disabling of preemption
when calling tracepoint callbacks.
The callbacks used for the PID filtering accounting depended on preemption
being disabled, and now the trigger a "suspicious RCU usage" warning message.
Make them explicitly disable preemption.
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260302213546.156e3e4f@gandalf.local.home
Fixes: a46023d561 ("tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
When multiple syscall events are specified in the kernel command line
(e.g., trace_event=syscalls:sys_enter_openat,syscalls:sys_enter_close),
they are often not captured after boot, even though they appear enabled
in the tracing/set_event file.
The issue stems from how syscall events are initialized. Syscall
tracepoints require the global reference count (sys_tracepoint_refcount)
to transition from 0 to 1 to trigger the registration of the syscall
work (TIF_SYSCALL_TRACEPOINT) for tasks, including the init process (pid 1).
The current implementation of early_enable_events() with disable_first=true
used an interleaved sequence of "Disable A -> Enable A -> Disable B -> Enable B".
If multiple syscalls are enabled, the refcount never drops to zero,
preventing the 0->1 transition that triggers actual registration.
Fix this by splitting early_enable_events() into two distinct phases:
1. Disable all events specified in the buffer.
2. Enable all events specified in the buffer.
This ensures the refcount hits zero before re-enabling, allowing syscall
events to be properly activated during early boot.
The code is also refactored to use a helper function to avoid logic
duplication between the disable and enable phases.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260224023544.1250787-1-hehuiwen@kylinos.cn
Fixes: ce1039bd3a ("tracing: Fix enabling of syscall events on the command line")
Signed-off-by: Huiwen He <hehuiwen@kylinos.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
trace_graph_thresh_return() called handle_nosleeptime() and then delegated
to trace_graph_return(), which calls handle_nosleeptime() again. When
sleep-time accounting is disabled this double-adjusts calltime and can
produce bogus durations (including underflow).
Fix this by computing rettime once, applying handle_nosleeptime() only
once, using the adjusted calltime for threshold comparison, and writing
the return event directly via __trace_graph_return() when the threshold is
met.
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260221113314048jE4VRwIyZEALiYByGK0My@zte.com.cn
Fixes: 3c9880f3ab ("ftrace: Use a running sleeptime instead of saving on shadow stack")
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Shengming Hu <hu.shengming@zte.com.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When tracing_thresh is enabled, function graph tracing uses
trace_graph_thresh_return() as the return handler. Unlike
trace_graph_return(), it did not clear the per-task TRACE_GRAPH_NOTRACE
flag set by the entry handler for set_graph_notrace addresses. This could
leave the task permanently in "notrace" state and effectively disable
function graph tracing for that task.
Mirror trace_graph_return()'s per-task notrace handling by clearing
TRACE_GRAPH_NOTRACE and returning early when set.
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260221113007819YgrZsMGABff4Rc-O_fZxL@zte.com.cn
Fixes: b84214890a ("function_graph: Move graph notrace bit to shadow stack global var")
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Shengming Hu <hu.shengming@zte.com.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The root cause of this bug is that when 'bpf_link_put' reduces the
refcount of 'shim_link->link.link' to zero, the resource is considered
released but may still be referenced via 'tr->progs_hlist' in
'cgroup_shim_find'. The actual cleanup of 'tr->progs_hlist' in
'bpf_shim_tramp_link_release' is deferred. During this window, another
process can cause a use-after-free via 'bpf_trampoline_link_cgroup_shim'.
Based on Martin KaFai Lau's suggestions, I have created a simple patch.
To fix this:
Add an atomic non-zero check in 'bpf_trampoline_link_cgroup_shim'.
Only increment the refcount if it is not already zero.
Testing:
I verified the fix by adding a delay in
'bpf_shim_tramp_link_release' to make the bug easier to trigger:
static void bpf_shim_tramp_link_release(struct bpf_link *link)
{
/* ... */
if (!shim_link->trampoline)
return;
+ msleep(100);
WARN_ON_ONCE(bpf_trampoline_unlink_prog(&shim_link->link,
shim_link->trampoline, NULL));
bpf_trampoline_put(shim_link->trampoline);
}
Before the patch, running a PoC easily reproduced the crash(almost 100%)
with a call trace similar to KaiyanM's report.
After the patch, the bug no longer occurs even after millions of
iterations.
Fixes: 69fd337a97 ("bpf: per-cgroup lsm flavor")
Reported-by: Kaiyan Mei <M202472210@hust.edu.cn>
Closes: https://lore.kernel.org/bpf/3c4ebb0b.46ff8.19abab8abe2.Coremail.kaiyanm@hust.edu.cn/
Signed-off-by: Lang Xu <xulang@uniontech.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/279EEE1BA1DDB49D+20260303095217.34436-1-xulang@uniontech.com
- Fix circular locking dependency in cpuset partition code by deferring
housekeeping_update() calls to a workqueue instead of calling them
directly under cpus_read_lock.
- Fix null-ptr-deref in rebuild_sched_domains_cpuslocked() when
generate_sched_domains() returns NULL due to kmalloc failure.
- Fix incorrect cpuset behavior for effective_xcpus in
partition_xcpus_del() and cpuset_update_tasks_cpumask() in
update_cpumasks_hier().
- Fix race between task migration and cgroup iteration.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaadVVQ4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGef0AQDLuJE3vzc2VeCBc4rGcj7ZSRmc3tc28lOqHRzi
XEx1iwD+PeFcb9wt1CTqA5hAiIY1LGR/5iO1kTH7paRd16DBRAc=
=S8WE
-----END PGP SIGNATURE-----
Merge tag 'cgroup-for-7.0-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
- Fix circular locking dependency in cpuset partition code by
deferring housekeeping_update() calls to a workqueue instead
of calling them directly under cpus_read_lock
- Fix null-ptr-deref in rebuild_sched_domains_cpuslocked() when
generate_sched_domains() returns NULL due to kmalloc failure
- Fix incorrect cpuset behavior for effective_xcpus in
partition_xcpus_del() and cpuset_update_tasks_cpumask()
in update_cpumasks_hier()
- Fix race between task migration and cgroup iteration
* tag 'cgroup-for-7.0-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup/cpuset: fix null-ptr-deref in rebuild_sched_domains_cpuslocked
cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock
cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue
cgroup/cpuset: Move housekeeping_update()/rebuild_sched_domains() together
kselftest/cgroup: Simplify test_cpuset_prs.sh by removing "S+" command
cgroup/cpuset: Set isolated_cpus_updating only if isolated_cpus is changed
cgroup/cpuset: Clarify exclusion rules for cpuset internal variables
cgroup/cpuset: Fix incorrect use of cpuset_update_tasks_cpumask() in update_cpumasks_hier()
cgroup/cpuset: Fix incorrect change to effective_xcpus in partition_xcpus_del()
cgroup: fix race between task migration and iteration
- Fix starvation of scx_enable() under fair-class saturation by
offloading the enable path to an RT kthread.
- Fix out-of-bounds access in idle mask initialization on systems with
non-contiguous NUMA node IDs.
- Fix a preemption window during scheduler exit and a refcount underflow
in cgroup init error path.
- Fix SCX_EFLAG_INITIALIZED being a no-op flag.
- Add READ_ONCE() annotations for KCSAN-clean lockless accesses and
replace naked scx_root dereferences with container_of() in kobject
callbacks.
- Tooling and selftest fixes: compilation issues with clang 17,
strtoul() misuse, unused options cleanup, and Kconfig sync.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaadTZA4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGdf9AQDmsZ8Y3uOJV/5K5RuEoo6SDPmCjr+JXPZu45kD
+UBj3wD9F8DPq+g+KnD7jILhqUdOTePhhNrVYbVw3e1x29EYBQ0=
=nRTC
-----END PGP SIGNATURE-----
Merge tag 'sched_ext-for-7.0-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- Fix starvation of scx_enable() under fair-class saturation by
offloading the enable path to an RT kthread
- Fix out-of-bounds access in idle mask initialization on systems with
non-contiguous NUMA node IDs
- Fix a preemption window during scheduler exit and a refcount
underflow in cgroup init error path
- Fix SCX_EFLAG_INITIALIZED being a no-op flag
- Add READ_ONCE() annotations for KCSAN-clean lockless accesses and
replace naked scx_root dereferences with container_of() in kobject
callbacks
- Tooling and selftest fixes: compilation issues with clang 17,
strtoul() misuse, unused options cleanup, and Kconfig sync
* tag 'sched_ext-for-7.0-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Fix starvation of scx_enable() under fair-class saturation
sched_ext: Remove redundant css_put() in scx_cgroup_init()
selftests/sched_ext: Fix peek_dsq.bpf.c compile error for clang 17
selftests/sched_ext: Add -fms-extensions to bpf build flags
tools/sched_ext: Add -fms-extensions to bpf build flags
sched_ext: Use READ_ONCE() for plain reads of scx_watchdog_timeout
sched_ext: Replace naked scx_root dereferences in kobject callbacks
sched_ext: Use READ_ONCE() for the read side of dsq->nr update
tools/sched_ext: fix strtoul() misuse in scx_hotplug_seq()
sched_ext: Fix SCX_EFLAG_INITIALIZED being a no-op flag
sched_ext: Fix out-of-bounds access in scx_idle_init_masks()
sched_ext: Disable preemption between scx_claim_exit() and kicking helper work
tools/sched_ext: Add Kconfig to sync with upstream
tools/sched_ext: Sync README.md Kconfig with upstream scx
selftests/sched_ext: Remove duplicated unistd.h include in rt_stall.c
tools/sched_ext: scx_sdt: Remove unused '-f' option
tools/sched_ext: scx_central: Remove unused '-p' option
selftests/sched_ext: Fix unused-result warning for read()
selftests/sched_ext: Abort test loop on signal
During scx_enable(), the READY -> ENABLED task switching loop changes the
calling thread's sched_class from fair to ext. Since fair has higher
priority than ext, saturating fair-class workloads can indefinitely starve
the enable thread, hanging the system. This was introduced when the enable
path switched from preempt_disable() to scx_bypass() which doesn't protect
against fair-class starvation. Note that the original preempt_disable()
protection wasn't complete either - in partial switch modes, the calling
thread could still be starved after preempt_enable() as it may have been
switched to ext class.
Fix it by offloading the enable body to a dedicated system-wide RT
(SCHED_FIFO) kthread which cannot be starved by either fair or ext class
tasks. scx_enable() lazily creates the kthread on first use and passes the
ops pointer through a struct scx_enable_cmd containing the kthread_work,
then synchronously waits for completion.
The workfn runs on a different kthread from sch->helper (which runs
disable_work), so it can safely flush disable_work on the error path
without deadlock.
Fixes: 8c2090c504 ("sched_ext: Initialize in bypass mode")
Cc: stable@vger.kernel.org # v6.12+
Signed-off-by: Tejun Heo <tj@kernel.org>
Global subprogs are currently not allowed to return void. Adjust
verifier logic to allow global functions with a void return type.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20260228184759.108145-5-emil@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
From: Eduard Zingerman <eddyz87@gmail.com>
Both main progs and subprogs use the same function in the verifier,
check_return_code, to verify the type and value range of the register
being returned. However, subprogs only need a subset of the logic in
check_return_code. this also goes the way - check_return_code explicitly
checks whether it is handling a subprogram in multiple places, complicating
the logic. Separate the handling of the two into separate functions.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20260228184759.108145-4-emil@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
From: Eduard Zingerman <eddyz87@gmail.com>
The check_return_code function has explicit checks on whether
a program type can return void. Factor this logic out to reuse
it later for both main progs and subprogs.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20260228184759.108145-3-emil@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Factor the return value range calculation logic in check_return_code
out of the function in preparation for separating the return value
validation logic for BPF_EXIT and bpf_throw().
No functional changes. The change made in return_retval_code's handling
of PROG_TRACING program types (not error'ing on the default case) is a
no-op because the match on the program attach type is exhaustive.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20260228184759.108145-2-emil@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The limitation on fixed offsets stems from the fact that certain program
types rewrite the accesses to the context structure and translate them
to accesses to the real underlying type. Technically, in the past, we
could have stashed the register offset in insn_aux and made rewrites
work, but we've never needed it in the past since the offset for such
context structures easily fit in the s16 signed instruction offset.
Regardless, the consequence is that for program types where the program
type's verifier ops doesn't supply a convert_ctx_access callback, we
unnecessarily reject accesses with a modified ctx pointer (i.e., one
whose offset has been shifted) in check_ptr_off_reg. Make an exception
for such program types (like syscall, tracepoint, raw_tp, etc.).
Pass in fixed_off_ok as true to __check_ptr_off_reg for such cases, and
accumulate the register offset into insn->off passed to check_ctx_access.
In particular, the accumulation is critical since we need to correctly
track the max_ctx_offset which is used for bounds checking the buffer
for syscall programs at runtime.
Reported-by: Tejun Heo <tj@kernel.org>
Reported-by: Dan Schatzberg <dschatzberg@meta.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20260227005725.1247305-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
cpu_map_bpf_prog_run_xdp() handles XDP_PASS, XDP_REDIRECT, and
XDP_DROP but is missing an XDP_ABORTED case. Without it, XDP_ABORTED
falls into the default case which logs a misleading "invalid XDP
action" warning instead of tracing the abort via trace_xdp_exception().
Add the missing XDP_ABORTED case with trace_xdp_exception(), matching
the handling already present in the skb path (cpu_map_bpf_prog_run_skb),
devmap (dev_map_bpf_prog_run), and the generic XDP path (do_xdp_generic).
Also pass xdpf->dev_rx instead of NULL to bpf_warn_invalid_xdp_action()
in the default case, so the warning includes the actual device name.
This aligns with the generic XDP path in net/core/dev.c which already
passes the real device.
Signed-off-by: Anand Kumar Shaw <anandkrshawheritage@gmail.com>
Link: https://lore.kernel.org/r/20260218042924.42931-1-anandkrshawheritage@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Implementing BPF version of preempt_count() requires accessing lowcore
from BPF. Since lowcore can be relocated, open-coding
(struct lowcore *)0 does not work, so add a kfunc.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20260217160813.100855-2-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The iterator css_for_each_descendant_pre() walks the cgroup hierarchy
under cgroup_lock(). It does not increment the reference counts on
yielded css structs.
According to the cgroup documentation, css_put() should only be used
to release a reference obtained via css_get() or css_tryget_online().
Since the iterator does not use either of these to acquire a reference,
calling css_put() in the error path of scx_cgroup_init() causes a
refcount underflow.
Remove the unbalanced css_put() to prevent a potential Use-After-Free
(UAF) vulnerability.
Fixes: 8195136669 ("sched_ext: Add cgroup support")
Cc: stable@vger.kernel.org # v6.12+
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
scx_watchdog_timeout is written with WRITE_ONCE() in scx_enable():
WRITE_ONCE(scx_watchdog_timeout, timeout);
However, three read-side accesses use plain reads without the matching
READ_ONCE():
/* check_rq_for_timeouts() - L2824 */
last_runnable + scx_watchdog_timeout
/* scx_watchdog_workfn() - L2852 */
scx_watchdog_timeout / 2
/* scx_enable() - L5179 */
scx_watchdog_timeout / 2
The KCSAN documentation requires that if one accessor uses WRITE_ONCE()
to annotate lock-free access, all other accesses must also use the
appropriate accessor. Plain reads alongside WRITE_ONCE() leave the pair
incomplete and can trigger KCSAN warnings.
Note that scx_tick() already uses the correct READ_ONCE() annotation:
last_check + READ_ONCE(scx_watchdog_timeout)
Fix the three remaining plain reads to match, making all accesses to
scx_watchdog_timeout consistently annotated and KCSAN-clean.
Signed-off-by: zhidao su <suzhidao@xiaomi.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Static variables are automatically initialized to 0 by the compiler.
Remove the redundant explicit assignments in kernel/audit.c to clean
up the code, align with standard kernel coding style, and fix the
following checkpatch.pl errors:
./scripts/checkpatch.pl kernel/audit.c | grep -A2 ERROR:
ERROR: do not initialise statics to 0
+ static unsigned long last_check = 0;
--
ERROR: do not initialise statics to 0
+ static int messages = 0;
--
ERROR: do not initialise statics to 0
+ static unsigned long last_msg = 0;
Signed-off-by: Ricardo Robaina <rrobaina@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>