Commit Graph

51747 Commits

Author SHA1 Message Date
Linus Torvalds
2b4d0215be 20 hotfixes. All are for MM (and for MMish maintainers). 9 are cc:stable
and the remainder are for post-7.0 issues or aren't deemed suitable for
 backporting.
 
 There's a 2 patch DAMON series from SeongJae Park which address races
 which could lead to use-after-free errors.  And a 3 patch DAMON series
 which avoids the possibility of presenting stale parameter values to
 users.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCafPaVQAKCRDdBJ7gKXxA
 jiUxAQCceUQi6IqADUMhYAsbGcs1LoeMWfiMfbCz2NCoiTXAEwD/S2uqSELRPQQc
 7iW6D7U6dTa3d2kkbnxC02ocekaxiQ4=
 =M/FI
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2026-04-30-15-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM fixes from Andrew Morton:
 "20 hotfixes. All are for MM (and for MMish maintainers). 9 are
  cc:stable and the remainder are for post-7.0 issues or aren't deemed
  suitable for backporting.

  There are two DAMON series from SeongJae Park which address races
  which could lead to use-after-free errors, and avoid the possibility
  of presenting stale parameter values to users"

* tag 'mm-hotfixes-stable-2026-04-30-15-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm: memcontrol: fix rcu unbalance in get_non_dying_memcg_end()
  mm/userfaultfd: detect VMA type change after copy retry in mfill_copy_folio_retry()
  MAINTAINERS: remove stale kdump project URL
  mm/damon/stat: detect and use fresh enabled value
  mm/damon/lru_sort: detect and use fresh enabled and kdamond_pid values
  mm/damon/reclaim: detect and use fresh enabled and kdamond_pid values
  selftests/mm: specify requirement for PROC_MEM_ALWAYS_FORCE=y
  mm/damon/sysfs-schemes: protect path kfree() with damon_sysfs_lock
  mm/damon/sysfs-schemes: protect memcg_path kfree() with damon_sysfs_lock
  MAINTAINERS: update Li Wang's email address
  MAINTAINERS, mailmap: update email address for Qi Zheng
  MAINTAINERS: update Liam's email address
  mm/hugetlb_cma: round up per_node before logging it
  MAINTAINERS: fix regex pattern in CORE MM category
  mm/vma: do not try to unmap a VMA if mmap_prepare() invoked from mmap()
  mm: start background writeback based on per-wb threshold for strictlimit BDIs
  kho: fix error handling in kho_add_subtree()
  liveupdate: fix return value on session allocation failure
  mailmap: update entry for Dan Carpenter
  vmalloc: fix buffer overflow in vrealloc_node_align()
2026-05-01 08:45:23 -07:00
Linus Torvalds
e75a43c7ce tracing fixes for v7.1:
- Fix inverted check of registering the stats for branch tracing
 
   When calling register_stat_tracer() which returns zero on success and
   negative on error, the callers were checking the return of zero as an
   error and printing a warning message. Because this was just a normal
   printk() message and not a WARN(), it wasn't caught in any testing.
 
   Fix the check to print the warning message when an error actually happens.
 
 - Fix a typo in a comment in tracepoint.h
 
 - Limit the size of event probes to 3K in size
 
   It is possible to create a dynamic event probe via the tracefs system that
   is greater than the max size of an event that the ring buffer can hold.
   This basically causes the event to become useless. Limit the size of an
   event probe to be 3K as that should be large enough to handle any dynamic
   events being created, and fits within the PAGE_SIZE sub-buffers of the
   ring buffer.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCafJn9hQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6quUaAP9ezC2VXWugRIhOEmK0pIpY90zddsyv
 xYTa3XuPP2r5JQEA8DcmH6fTo7JJxzbbyWPot/EgaqMt5XguqKC7txhjfww=
 =+e7q
 -----END PGP SIGNATURE-----

Merge tag 'trace-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix inverted check of registering the stats for branch tracing

   When calling register_stat_tracer() which returns zero on success and
   negative on error, the callers were checking the return of zero as an
   error and printing a warning message. Because this was just a normal
   printk() message and not a WARN(), it wasn't caught in any testing.

   Fix the check to print the warning message when an error actually
   happens.

 - Fix a typo in a comment in tracepoint.h

 - Limit the size of event probes to 3K in size

   It is possible to create a dynamic event probe via the tracefs system
   that is greater than the max size of an event that the ring buffer
   can hold. This basically causes the event to become useless.

   Limit the size of an event probe to be 3K as that should be large
   enough to handle any dynamic events being created, and fits within
   the PAGE_SIZE sub-buffers of the ring buffer.

* tag 'trace-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/probes: Limit size of event probe to 3K
  tracepoint: Fix typo in tracepoint.h comment
  tracing: branch: Fix inverted check on stat tracer registration
2026-04-29 22:21:44 -07:00
Steven Rostedt
b2aa3b4d64 tracing/probes: Limit size of event probe to 3K
There currently isn't a max limit an event probe can be. One could make an
event greater than PAGE_SIZE, which makes the event useless because if
it's bigger than the max event that can be recorded into the ring buffer,
then it will never be recorded.

A event probe should never need to be greater than 3K, so make that the
max size. As long as the max is less than the max that can be recorded
onto the ring buffer, it should be fine.

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Fixes: 93ccae7a22 ("tracing/kprobes: Support basic types on dynamic events")
Link: https://patch.msgid.link/20260428122302.706610ba@gandalf.local.home
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2026-04-29 16:07:38 -04:00
Linus Torvalds
664f0f6be3 sched_ext: Fixes for v7.1-rc1
The merge window pulled in the cgroup sub-scheduler infrastructure, and
 new AI reviews are accelerating bug reporting and fixing - hence the
 larger than usual fixes batch.
 
 - Use-after-frees during scheduler load/unload. The disable path
   could free the BPF scheduler while deferred irq_work / kthread work
   was still in flight; cgroup setter callbacks read the active
   scheduler outside the rwsem that synchronizes against teardown.
   Fixed both, and reused the disable drain in the enable error paths
   so the BPF JIT page can't be freed under live callbacks.
 
 - Several BPF op invocations didn't tell the framework which runqueue
   was already locked, so helper kfuncs that re-acquire the runqueue
   by CPU could deadlock on the held lock. Fixed at the affected
   callsites, including recursive parent-into-child dispatch.
 
 - The hardlockup notifier ran from NMI but eventually took a
   non-NMI-safe lock. Bounced through irq_work.
 
 - A handful of bugs in the new sub-scheduler hierarchy: helper
   kfuncs hard-coded the root instead of resolving the caller's
   scheduler; the enable error path tried to disable per-task state
   that had never been initialized, and leaked cpus_read_lock on the
   way out; a sysfs object was leaked on every load/unload; the
   dispatch fast-path used the root scheduler instead of the task's;
   and a couple of CONFIG #ifdef guards were misclassified.
 
 - Verifier-time hardening: BPF programs of unrelated struct_ops
   types (e.g. tcp_congestion_ops) could call sched_ext kfuncs - a
   semantic bug and, once sub-sched was enabled, a KASAN
   out-of-bounds read. Now rejected at load. Plus a few NULL and
   cross-task argument checks on sched_ext kfuncs, and a selftest
   covering the new deny.
 
 - rhashtable (Herbert): restored the insecure_elasticity toggle and
   bounced the deferred-resize kick through irq_work to break a
   lock-order cycle observable from raw-spinlock callers. sched_ext's
   scheduler-instance hash is the first user of both.
 
 - The bypass-mode load balancer used file-scope cpumasks; with
   multiple scheduler instances now possible, those raced. Moved
   per-instance, plus a follow-up to skip tasks whose recorded CPU is
   stale relative to the new owning runqueue.
 
 - Smaller fixes: a dispatch queue's first-task tracking misbehaved
   when a parked iterator cursor sat in the list; the runqueue's
   next-class wasn't promoted on local-queue enqueue, leaving an SCX
   task behind RT in edge cases; the reference qmap scheduler stopped
   erroring on legitimate cross-scheduler task-storage misses.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCafEN/A4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGaydAQCxWrUqnZXhxF4LnztjxTF2tgv7p8P7TbpS6aU6
 etqRpAEA9RFmIXs7XrhwCm0n2BwSgjvrNxnWfPhWvuH0uN0GTAA=
 =wLna
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-7.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:
 "The merge window pulled in the cgroup sub-scheduler infrastructure,
  and new AI reviews are accelerating bug reporting and fixing - hence
  the larger than usual fixes batch:

   - Use-after-frees during scheduler load/unload:
       - The disable path could free the BPF scheduler while deferred
         irq_work / kthread work was still in flight
       - cgroup setter callbacks read the active scheduler outside the
         rwsem that synchronizes against teardown
     Fix both, and reuse the disable drain in the enable error paths so
     the BPF JIT page can't be freed under live callbacks.

   - Several BPF op invocations didn't tell the framework which runqueue
     was already locked, so helper kfuncs that re-acquire the runqueue
     by CPU could deadlock on the held lock

     Fix the affected callsites, including recursive parent-into-child
     dispatch.

   - The hardlockup notifier ran from NMI but eventually took a
     non-NMI-safe lock. Bounce it through irq_work.

   - A handful of bugs in the new sub-scheduler hierarchy:
       - helper kfuncs hard-coded the root instead of resolving the
         caller's scheduler
       - the enable error path tried to disable per-task state that had
         never been initialized, and leaked cpus_read_lock on the way
         out
       - a sysfs object was leaked on every load/unload
       - the dispatch fast-path used the root scheduler instead of the
         task's
       - a couple of CONFIG #ifdef guards were misclassified

   - Verifier-time hardening: BPF programs of unrelated struct_ops types
     (e.g. tcp_congestion_ops) could call sched_ext kfuncs - a semantic
     bug and, once sub-sched was enabled, a KASAN out-of-bounds read.
     Now rejected at load. Plus a few NULL and cross-task argument
     checks on sched_ext kfuncs, and a selftest covering the new deny.

   - rhashtable (Herbert): restore the insecure_elasticity toggle and
     bounce the deferred-resize kick through irq_work to break a
     lock-order cycle observable from raw-spinlock callers. sched_ext's
     scheduler-instance hash is the first user of both.

   - The bypass-mode load balancer used file-scope cpumasks; with
     multiple scheduler instances now possible, those raced. Move to
     per-instance cpumasks, plus a follow-up to skip tasks whose
     recorded CPU is stale relative to the new owning runqueue.

   - Smaller fixes:
       - a dispatch queue's first-task tracking misbehaved when a parked
         iterator cursor sat in the list
       - the runqueue's next-class wasn't promoted on local-queue
         enqueue, leaving an SCX task behind RT in edge cases
       - the reference qmap scheduler stopped erroring on legitimate
         cross-scheduler task-storage misses"

* tag 'sched_ext-for-7.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (26 commits)
  sched_ext: Fix scx_flush_disable_work() UAF race
  sched_ext: Call wakeup_preempt() in local_dsq_post_enq()
  sched_ext: Release cpus_read_lock on scx_link_sched() failure in root enable
  sched_ext: Reject NULL-sch callers in scx_bpf_task_set_slice/dsq_vtime
  sched_ext: Refuse cross-task select_cpu_from_kfunc calls
  sched_ext: Align cgroup #ifdef guards with SUB_SCHED vs GROUP_SCHED
  sched_ext: Make bypass LB cpumasks per-scheduler
  sched_ext: Pass held rq to SCX_CALL_OP() for core_sched_before
  sched_ext: Pass held rq to SCX_CALL_OP() for dump_cpu/dump_task
  sched_ext: Save and restore scx_locked_rq across SCX_CALL_OP
  sched_ext: Use dsq->first_task instead of list_empty() in dispatch_enqueue() FIFO-tail
  sched_ext: Resolve caller's scheduler in scx_bpf_destroy_dsq() / scx_bpf_dsq_nr_queued()
  sched_ext: Read scx_root under scx_cgroup_ops_rwsem in cgroup setters
  sched_ext: Don't disable tasks in scx_sub_enable_workfn() abort path
  sched_ext: Skip tasks with stale task_rq in bypass_lb_cpu()
  sched_ext: Guard scx_dsq_move() against NULL kit->dsq after failed iter_new
  sched_ext: Unregister sub_kset on scheduler disable
  sched_ext: Defer scx_hardlockup() out of NMI
  sched_ext: sync disable_irq_work in bpf_scx_unreg()
  sched_ext: Fix local_dsq_post_enq() to use task's scheduler in sub-sched
  ...
2026-04-28 16:26:11 -07:00
Breno Leitao
3b75dd76e6 tracing: branch: Fix inverted check on stat tracer registration
init_annotated_branch_stats() and all_annotated_branch_stats() check the
return value of register_stat_tracer() with "if (!ret)", but
register_stat_tracer() returns 0 on success and a negative errno on
failure. The inverted check causes the warning to be printed on every
successful registration, e.g.:

  Warning: could not register annotated branches stats

while leaving real failures silent. The initcall also returned a
hard-coded 1 instead of the actual error.

Invert the check and propagate ret so that the warning fires on real
errors and the initcall reports the correct status.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: https://patch.msgid.link/20260420-tracing-v1-1-d8f4cd0d6af1@debian.org
Fixes: 002bb86d8d ("tracing/ftrace: separate events tracing and stats tracing engine")
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2026-04-28 14:28:29 -04:00
Cheng-Yang Chou
d99f7a32f0 sched_ext: Fix scx_flush_disable_work() UAF race
scx_flush_disable_work() calls irq_work_sync() followed by
kthread_flush_work() to ensure that the disable kthread work has
fully completed before bpf_scx_unreg() frees the SCX scheduler.

However, a concurrent scx_vexit() (e.g., triggered by a watchdog stall)
creates a race window between scx_claim_exit() and irq_work_queue():

  CPU A (scx_vexit (watchdog))        CPU B (bpf_scx_unreg)
  ----                                ----
  scx_claim_exit()
    atomic_try_cmpxchg(NONE->kind)
  stack_trace_save()
  vscnprintf()
                                      scx_disable()
                                        scx_claim_exit() -> FAIL
                                      scx_flush_disable_work()
                                        irq_work_sync()      // no-op: not queued yet
                                        kthread_flush_work() // no-op: not queued yet
                                      kobject_put(&sch->kobj) -> free %sch
  irq_work_queue() -> UAF on %sch
  scx_disable_irq_workfn()
    kthread_queue_work() -> UAF

The root cause is that CPU B's scx_flush_disable_work() returns after
syncing an irq_work that has not yet been queued, while CPU A is still
executing the code between scx_claim_exit() and irq_work_queue().

Loop until exit_kind reaches SCX_EXIT_DONE or SCX_EXIT_NONE, draining
disable_irq_work and disable_work in each pass. This ensures that any
work queued after the previous check is caught, while also correctly
handling cases where no disable was triggered (e.g., the
scx_sub_enable_workfn() abort path).

Fixes: 510a270554 ("sched_ext: sync disable_irq_work in bpf_scx_unreg()")
Reported-by: https://sashiko.dev/#/patchset/20260424100221.32407-1-icheng%40nvidia.com
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-28 07:40:03 -10:00
Kuba Piecuch
163f8b7f9a sched_ext: Call wakeup_preempt() in local_dsq_post_enq()
There are several edge cases (see linked thread) where an IMMED task
can be left lingering on a local DSQ if an RT task swoops in at the
wrong time. All of these edge cases are due to rq->next_class being idle
even after dispatching a task to rq's local DSQ. We should bump
rq->next_class to &ext_sched_class as soon as we've inserted a task into
the local DSQ.

To optimize the common case of rq->next_class == &ext_sched_class,
only call wakeup_preempt() if rq->next_class is below EXT. If next_class
is EXT or above, wakeup_preempt() is a no-op anyway.

This lets us also simplify the preempt_curr() logic a bit since
wakeup_preempt() will call preempt_curr() for us if next_class is
below EXT.

Link: https://lore.kernel.org/all/DHZPHUFXB4N3.2RY28MUEWBNYK@google.com/
Signed-off-by: Kuba Piecuch <jpiecuch@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-28 06:28:48 -10:00
Linus Torvalds
3b3bea6d4b cgroup: Fixes for v7.1-rc1
- Fix UAF race in psi pressure_write() against cgroup file release by
   extending cgroup_mutex coverage and ordering of->priv access after
   cgroup_kn_lock_live().
 
 - Fix integer overflow in rdmacg_try_charge() when usage equals INT_MAX
   by performing the increment in s64.
 
 - Fix asymmetric DL bandwidth accounting on cpuset attach rollback by
   recording the CPU used by dl_bw_alloc() so cancel_attach() returns
   the reservation to the same root domain.
 
 - Fix nr_dying_subsys_* race that briefly showed 0 in cgroup.stat after
   rmdir by incrementing from kill_css() instead of offline_css().
 
 - Typo fix in cgroup-v2 documentation.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCae+xjw4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGaIUAQD2hJ7ELRDXAtXzL1Ck1zH8vESvbX8syFfkSO6L
 IgtovQEA4Tk7/RIO3HfBxBjgp6Q5vo7C7Biz4ye7fCu/ry7x3Qk=
 =pypQ
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-7.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup fixes from Tejun Heo:

 - Fix UAF race in psi pressure_write() against cgroup file release by
   extending cgroup_mutex coverage and ordering of->priv access after
   cgroup_kn_lock_live()

 - Fix integer overflow in rdmacg_try_charge() when usage equals INT_MAX
   by performing the increment in s64

 - Fix asymmetric DL bandwidth accounting on cpuset attach rollback by
   recording the CPU used by dl_bw_alloc() so cancel_attach() returns
   the reservation to the same root domain

 - Fix nr_dying_subsys_* race that briefly showed 0 in cgroup.stat after
   rmdir by incrementing from kill_css() instead of offline_css()

 - Typo fix in cgroup-v2 documentation

* tag 'cgroup-for-7.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  docs: cgroup: fix typo 'protetion' -> 'protection'
  cgroup: Increment nr_dying_subsys_* from rmdir context
  cgroup/cpuset: record DL BW alloc CPU for attach rollback
  cgroup/rdma: fix integer overflow in rdmacg_try_charge()
  sched/psi: fix race between file release and pressure write
2026-04-27 16:51:27 -07:00
Breno Leitao
9ec9532989 kho: fix error handling in kho_add_subtree()
Fix two error handling issues in kho_add_subtree(), where it doesn't
handle the error path correctly.

1. If fdt_setprop() fails after the subnode has been created, the
   subnode is not removed. This leaves an incomplete node in the FDT
   (missing "preserved-data" or "blob-size" properties).

2. The fdt_setprop() return value (an FDT error code) is stored
   directly in err and returned to the caller, which expects -errno.

Fix both by storing fdt_setprop() results in fdt_err, jumping to a new
out_del_node label that removes the subnode on failure, and only setting
err = 0 on the success path, otherwise returning -ENOMEM (instead of
FDT_ERR_ errors that would come from fdt_setprop).

No user-visible changes.  This patch fixes error handling in the KHO
(Kexec HandOver) subsystem, which is used to preserve data across kexec
reboots.  The fix only affects a rare failure path during kexec
preparation — specifically when the kernel runs out of space in the
Flattened Device Tree buffer while registering preserved memory regions.

In the unlikely event that this error path was triggered, the old code
would leave a malformed node in the device tree and return an incorrect
error code to the calling subsystem, which could lead to confusing log
messages or incorrect recovery decisions.  With this fix, the incomplete
node is properly cleaned up and the appropriate errno value is propagated,
this error code is not returned to the user.

Link: https://lore.kernel.org/20260410-kho_fix_send-v2-1-1b4debf7ee08@debian.org
Fixes: 3dc92c3114 ("kexec: add Kexec HandOver (KHO) generation helpers")
Signed-off-by: Breno Leitao <leitao@debian.org>
Suggested-by: Pratyush Yadav <pratyush@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Breno Leitao <leitao@debian.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-27 05:54:23 -07:00
Pasha Tatashin
0562b572ce liveupdate: fix return value on session allocation failure
When session allocation fails during deserialization, the global 'err'
variable was not updated before returning.  This caused subsequent calls
to luo_session_deserialize() to incorrectly report success.

Ensure 'err' is set to the error code from PTR_ERR(session).  This ensures
that an error is correctly returned to userspace when it attempts to open
/dev/liveupdate in the new kernel if deserialization failed.

Link: https://lore.kernel.org/20260415193738.515491-1-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Cc: David Matlack <dmatlack@google.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Samiullah Khawaja <skhawaja@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-27 05:54:23 -07:00
Tejun Heo
deb7b2f93d sched_ext: Release cpus_read_lock on scx_link_sched() failure in root enable
scx_root_enable_workfn() takes cpus_read_lock() before
scx_link_sched(sch), but the `if (ret) goto err_disable` on failure
skips the matching cpus_read_unlock() - all other err_disable gotos
along this path drop the lock first.

scx_link_sched() only returns non-zero on the sub-sched path
(parent != NULL), so the leak path is unreachable via the root
caller today. Still, the unwind is out of line with the surrounding
paths.

Drop cpus_read_lock() before goto err_disable.

v2: Correct Fixes: tag (Andrea Righi).

Fixes: 25037af712 ("sched_ext: Add rhashtable lookup for sub-schedulers")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-24 14:31:36 -10:00
Tejun Heo
05b4a9a9bc sched_ext: Reject NULL-sch callers in scx_bpf_task_set_slice/dsq_vtime
scx_prog_sched(aux) returns NULL for TRACING / SYSCALL BPF progs that
have no struct_ops association when the root scheduler has sub_attach
set. scx_bpf_task_set_slice() and scx_bpf_task_set_dsq_vtime() pass
that NULL into scx_task_on_sched(sch, p), which under
CONFIG_EXT_SUB_SCHED is rcu_access_pointer(p->scx.sched) == sch. For
any non-scx task p->scx.sched is NULL, so NULL == NULL returns true
and the authority gate is bypassed - a privileged but
non-struct_ops-associated prog can poke p->scx.slice /
p->scx.dsq_vtime on arbitrary tasks.

Reject !sch up front so the gate only admits callers with a resolved
scheduler.

Fixes: 245d09c594 ("sched_ext: Enforce scheduler ownership when updating slice and dsq_vtime")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24 14:31:36 -10:00
Tejun Heo
ea7c716a24 sched_ext: Refuse cross-task select_cpu_from_kfunc calls
select_cpu_from_kfunc() skipped pi_lock for @p when called from
ops.select_cpu() or another rq-locked SCX op, assuming the held lock
protects @p. scx_bpf_select_cpu_dfl() / __scx_bpf_select_cpu_and() accept an
arbitrary KF_RCU task_struct, so a caller in e.g. ops.select_cpu(p1) or
ops.enqueue(p1) can pass some other p2 - the held pi_lock / rq lock is p1's,
not p2's - and reading p2->cpus_ptr / nr_cpus_allowed races with
set_cpus_allowed_ptr() and migrate_disable_switch() on another CPU.

Abort the scheduler on cross-task calls in both branches: for
ops.select_cpu() use scx_kf_arg_task_ok() to verify @p is the wake-up
task recorded in current->scx.kf_tasks[] by SCX_CALL_OP_TASK_RET();
for other rq-locked SCX ops compare task_rq(p) against scx_locked_rq().

v2: Switch the in_select_cpu cross-task check from direct_dispatch_task
    comparison to scx_kf_arg_task_ok(). The former spuriously rejects when
    ops.select_cpu() calls scx_bpf_dsq_insert() first, then calls
    scx_bpf_select_cpu_*() on the same task. (Andrea Righi)

Fixes: 0022b32850 ("sched_ext: Decouple kfunc unlocked-context check from kf_mask")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrea Righi <arighi@nvidia.com>
2026-04-24 14:31:36 -10:00
Tejun Heo
c0e8ddc76d sched_ext: Align cgroup #ifdef guards with SUB_SCHED vs GROUP_SCHED
Two EXT_GROUP_SCHED/SUB_SCHED guards are misclassified:

- scx_root_enable_workfn()'s cgroup_get(cgrp) and the err_put_cgrp unwind
  in scx_alloc_and_add_sched() are under `#if GROUP || SUB`, but the
  matching cgroup_put() in scx_sched_free_rcu_work() is inside `#ifdef SUB`
  only (via sch->cgrp, stored only under SUB). GROUP-only would leak a
  reference on every root-sched enable.

- sch_cgroup() / set_cgroup_sched() live under `#if GROUP || SUB` but touch
  SUB-only fields (sch->cgrp, cgroup->scx_sched). GROUP-only wouldn't
  compile.

GROUP needs CGROUP_SCHED; SUB needs only CGROUPS. CGROUPS=y/CGROUP_SCHED=n
gives the reachable GROUP=n, SUB=y combination; GROUP=y, SUB=n isn't
reachable today (SUB is def_bool y under CGROUPS). Neither miscategorization
triggers a real bug in any reachable config, but keep the guards honest:

- Narrow cgroup_get and err_put_cgrp to `#ifdef SUB` (matches the free-side
  put).
- Move sch_cgroup() and set_cgroup_sched() to a separate `#ifdef SUB` block
  with no-op stubs for the !SUB case; keep root_cgroup() and scx_cgroup_{
  lock,unlock}() under `#if GROUP || SUB` since those only need cgroup core.

Fixes: ebeca1f930 ("sched_ext: Introduce cgroup sub-sched support")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24 14:31:36 -10:00
Tejun Heo
d292aa00de sched_ext: Make bypass LB cpumasks per-scheduler
scx_bypass_lb_{donee,resched}_cpumask were file-scope statics shared by all
scheduler instances. With CONFIG_EXT_SUB_SCHED, multiple sched instances
each arm their own bypass_lb_timer; concurrent bypass_lb_node() calls RMW
the global cpumasks with no lock, corrupting donee/resched decisions.

Move the cpumasks into struct scx_sched, allocate them alongside the timer
in scx_alloc_and_add_sched(), free them in scx_sched_free_rcu_work().

Fixes: 95d1df610c ("sched_ext: Implement load balancer for bypass mode")
Cc: stable@vger.kernel.org # v6.19+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24 14:31:36 -10:00
Tejun Heo
4155fb489f sched_ext: Pass held rq to SCX_CALL_OP() for core_sched_before
scx_prio_less() runs from core-sched's pick_next_task() path with rq
locked but invokes ops.core_sched_before() with NULL locked_rq, leaving
scx_locked_rq_state NULL. If the BPF callback calls a kfunc that
re-acquires rq based on scx_locked_rq() - e.g. scx_bpf_cpuperf_set(cpu)
- it re-acquires the already-held rq.

Pass task_rq(a).

Fixes: 7b0888b7cc ("sched_ext: Implement core-sched support")
Cc: stable@vger.kernel.org # v6.12+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24 14:31:36 -10:00
Tejun Heo
207d76a372 sched_ext: Pass held rq to SCX_CALL_OP() for dump_cpu/dump_task
scx_dump_state() walks CPUs with rq_lock_irqsave() held and invokes
ops.dump_cpu / ops.dump_task with NULL locked_rq, leaving
scx_locked_rq_state NULL. If the BPF callback calls a kfunc that
re-acquires rq based on scx_locked_rq() - e.g. scx_bpf_cpuperf_set(cpu)
- it re-acquires the already-held rq.

Pass the held rq to SCX_CALL_OP(). Thread it into scx_dump_task() too.
The pre-loop ops.dump call runs before rq_lock_irqsave() so keeps
rq=NULL.

Fixes: 07814a9439 ("sched_ext: Print debug dump after an error exit")
Cc: stable@vger.kernel.org # v6.12+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24 14:31:36 -10:00
Tejun Heo
7fb39e4eb4 sched_ext: Save and restore scx_locked_rq across SCX_CALL_OP
SCX_CALL_OP{,_RET}() unconditionally clears scx_locked_rq_state to NULL on
exit. Correct at the top level, but ops can recurse via
scx_bpf_sub_dispatch(): a parent's ops.dispatch calls the helper, which
invokes the child's ops.dispatch under another SCX_CALL_OP. When the inner
call returns, the NULL clobbers the outer's state. The parent's BPF then
calls kfuncs like scx_bpf_cpuperf_set() which read scx_locked_rq()==NULL and
re-acquire the already-held rq.

Snapshot scx_locked_rq_state on entry and restore on exit. Rename the rq
parameter to locked_rq across all SCX_CALL_OP* macros so the snapshot local
can be typed as 'struct rq *' without colliding with the parameter token in
the expansion. SCX_CALL_OP_TASK{,_RET}() and SCX_CALL_OP_2TASKS_RET() funnel
through the two base macros and inherit the fix.

Fixes: 4f8b122848 ("sched_ext: Add basic building blocks for nested sub-scheduler dispatching")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24 14:31:36 -10:00
Tejun Heo
2f2ea77092 sched_ext: Use dsq->first_task instead of list_empty() in dispatch_enqueue() FIFO-tail
dispatch_enqueue()'s FIFO-tail path used list_empty(&dsq->list) to decide
whether to set dsq->first_task on enqueue. dsq->list can contain parked BPF
iterator cursors (SCX_DSQ_LNODE_ITER_CURSOR), so list_empty() is not a
reliable "no real task" check. If the last real task is unlinked while a
cursor is parked, first_task becomes NULL; the next FIFO-tail enqueue then
sees list_empty() == false and skips the first_task update, leaving
scx_bpf_dsq_peek() returning NULL for a non-empty DSQ.

Test dsq->first_task directly, which already tracks only real tasks and is
maintained under dsq->lock.

Fixes: 44f5c8ec5b ("sched_ext: Add lockless peek operation for DSQs")
Cc: stable@vger.kernel.org # v6.19+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Cc: Ryan Newton <newton@meta.com>
2026-04-24 14:31:35 -10:00
Tejun Heo
cc2a387d33 sched_ext: Resolve caller's scheduler in scx_bpf_destroy_dsq() / scx_bpf_dsq_nr_queued()
scx_bpf_create_dsq() resolves the calling scheduler via scx_prog_sched(aux)
and inserts the new DSQ into that scheduler's dsq_hash. Its inverse
scx_bpf_destroy_dsq() and the query helper scx_bpf_dsq_nr_queued() were
hard-coded to rcu_dereference(scx_root), so a sub-scheduler could only
destroy or query DSQs in the root scheduler's hash - never its own. If the
root had a DSQ with the same id, the sub-sched silently destroyed it and the
root aborted on the next dispatch ("invalid DSQ ID 0x0..").

Take a const struct bpf_prog_aux *aux via KF_IMPLICIT_ARGS and resolve the
scheduler with scx_prog_sched(aux), matching scx_bpf_create_dsq().

Fixes: ebeca1f930 ("sched_ext: Introduce cgroup sub-sched support")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24 14:31:35 -10:00
Tejun Heo
80afd4c84b sched_ext: Read scx_root under scx_cgroup_ops_rwsem in cgroup setters
scx_group_set_{weight,idle,bandwidth}() cache scx_root before acquiring
scx_cgroup_ops_rwsem, so the pointer can be stale by the time the op runs.
If the loaded scheduler is disabled and freed (via RCU work) and another is
enabled between the naked load and the rwsem acquire, the reader sees
scx_cgroup_enabled=true (the new scheduler's) but dereferences the freed one
- UAF on SCX_HAS_OP(sch, ...) / SCX_CALL_OP(sch, ...).

scx_cgroup_enabled is toggled only under scx_cgroup_ops_rwsem write
(scx_cgroup_{init,exit}), so reading scx_root inside the rwsem read section
correlates @sch with the enabled snapshot.

Fixes: a5bd6ba30b ("sched_ext: Use cgroup_lock/unlock() to synchronize against cgroup operations")
Cc: stable@vger.kernel.org # v6.18+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24 14:31:35 -10:00
Tejun Heo
21a5a97ba4 sched_ext: Don't disable tasks in scx_sub_enable_workfn() abort path
scx_sub_enable_workfn()'s prep loop calls __scx_init_task(sch, p, false)
without transitioning task state, then sets SCX_TASK_SUB_INIT. If prep fails
partway, the abort path runs __scx_disable_and_exit_task(sch, p) on the
marked tasks. Task state is still the parent's ENABLED, so that dispatches
to the SCX_TASK_ENABLED arm and calls scx_disable_task(sch, p) - i.e.
child->ops.disable() - for tasks on which child->ops.enable() never ran. A
BPF sub-scheduler allocating per-task state in enable/freeing in disable
would operate on uninitialized state.

The dying-task branch in scx_disable_and_exit_task() has the same problem,
and scx_enabling_sub_sched was cleared before the abort cleanup loop - a
task exiting during cleanup tripped the WARN and skipped both ops.exit_task
and the SCX_TASK_SUB_INIT clear, leaking per-task resources and leaving the
task stuck.

Introduce scx_sub_init_cancel_task() that calls ops.exit_task with
cancelled=true - matching what the top-level init path does when init_task
itself returns -errno. Use it in the abort loop and in the dying-task
branch. scx_enabling_sub_sched now stays set until the abort loop finishes
clearing SUB_INIT, so concurrent exits hitting the dying-task branch can
still find @sch. That branch also clears SCX_TASK_SUB_INIT unconditionally
when seen, leaving the task unmarked even if the WARN fires.

Fixes: 337ec00b1d ("sched_ext: Implement cgroup sub-sched enabling and disabling")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24 14:31:35 -10:00
Tejun Heo
da2d81b411 sched_ext: Skip tasks with stale task_rq in bypass_lb_cpu()
bypass_lb_cpu() transfers tasks between per-CPU bypass DSQs without
migrating them - task_cpu() only updates when the donee later consumes the
task via move_remote_task_to_local_dsq(). If the LB timer fires again before
consumption and the new DSQ becomes a donor, @p is still on the previous CPU
and task_rq(@p) != donor_rq. @p can't be moved without its own rq locked.

Skip such tasks.

Fixes: 95d1df610c ("sched_ext: Implement load balancer for bypass mode")
Cc: stable@vger.kernel.org # v6.19+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24 14:31:35 -10:00
Tejun Heo
4fda9f0e7c sched_ext: Guard scx_dsq_move() against NULL kit->dsq after failed iter_new
bpf_iter_scx_dsq_new() clears kit->dsq on failure and
bpf_iter_scx_dsq_{next,destroy}() guard against that. scx_dsq_move() doesn't -
it dereferences kit->dsq immediately, so a BPF program that calls
scx_bpf_dsq_move[_vtime]() after a failed iter_new oopses the kernel.

Return false if kit->dsq is NULL.

Fixes: 4c30f5ce4f ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()")
Cc: stable@vger.kernel.org # v6.12+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24 14:31:35 -10:00
Tejun Heo
411d3ef1a7 sched_ext: Unregister sub_kset on scheduler disable
When ops.sub_attach is set, scx_alloc_and_add_sched() creates sub_kset as a
child of &sch->kobj, which pins the parent with its own reference. The
disable paths never call kset_unregister(), so the final kobject_put() in
bpf_scx_unreg() leaves a stale reference and scx_kobj_release() never runs,
leaking the whole struct scx_sched on every load/unload cycle.

Unregister sub_kset in scx_root_disable() and scx_sub_disable() before
kobject_del(&sch->kobj).

Fixes: ebeca1f930 ("sched_ext: Introduce cgroup sub-sched support")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24 14:31:35 -10:00
Tejun Heo
bd2d76455b sched_ext: Defer scx_hardlockup() out of NMI
scx_hardlockup() runs from NMI and eventually calls scx_claim_exit(),
which takes scx_sched_lock. scx_sched_lock isn't NMI-safe and grabbing
it from NMI context can lead to deadlocks.

The hardlockup handler is best-effort recovery and the disable path it
triggers runs off of irq_work anyway. Move the handle_lockup() call into
an irq_work so it runs in IRQ context.

Fixes: ebeca1f930 ("sched_ext: Introduce cgroup sub-sched support")
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24 14:13:22 -10:00
Linus Torvalds
27d128c1cf ring-buffer fix for 7.1:
- Fix accounting of persistent ring buffer rewind
 
   On boot up, the head page is moved back to the earliest point of
   the saved ring buffer. This is because the ring buffer being read by
   user space on a crash may not save the part it read. Rewinding the head
   page back to the earliest saved position helps keep those events from
   being lost.
 
   The number of events is also read during boot up and displayed in the
   stats file in the tracefs directory. It's also used for other accounting
   as well. On boot up, the "reader page" is accounted for but a rewind may
   put it back into the buffer and then the reader page may be accounted
   for again.
 
   Save off the original reader page and skip accounting it when scanning
   the pages in the ring buffer.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaevLRxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qsNBAQCO9T+3kbDLD/DxRBYjeMXhTg2iY2OV
 SHV9jEzcYTahmwEAqVCFY9sNxVYXait/C924SmHCSi1N0HrIuEzIEd29iwk=
 =bv+t
 -----END PGP SIGNATURE-----

Merge tag 'trace-ring-buffer-v7.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull ring-buffer fix from Steven Rostedt:

 - Fix accounting of persistent ring buffer rewind

   On boot up, the head page is moved back to the earliest point of the
   saved ring buffer. This is because the ring buffer being read by user
   space on a crash may not save the part it read. Rewinding the head
   page back to the earliest saved position helps keep those events from
   being lost.

   The number of events is also read during boot up and displayed in the
   stats file in the tracefs directory. It's also used for other
   accounting as well. On boot up, the "reader page" is accounted for
   but a rewind may put it back into the buffer and then the reader page
   may be accounted for again.

   Save off the original reader page and skip accounting it when
   scanning the pages in the ring buffer.

* tag 'trace-ring-buffer-v7.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  ring-buffer: Do not double count the reader_page
2026-04-24 15:17:23 -07:00
Masami Hiramatsu (Google)
92d5a60672 ring-buffer: Do not double count the reader_page
Since the cpu_buffer->reader_page is updated if there are unwound
pages. After that update, we should skip the page if it is the
original reader_page, because the original reader_page is already
checked.

Cc: stable@vger.kernel.org
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Ian Rogers <irogers@google.com>
Link: https://patch.msgid.link/177701353063.2223789.1471163147644103306.stgit@mhiramat.tok.corp.google.com
Fixes: ca296d32ec ("tracing: ring_buffer: Rewind persistent ring buffer on reboot")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2026-04-24 15:34:39 -04:00
Richard Cheng
510a270554 sched_ext: sync disable_irq_work in bpf_scx_unreg()
When unregistered my self-written scx scheduler, the following panic
occurs.

[  229.923133] Kernel text patching generated an invalid instruction at 0xffff80009bc2c1f8!
[  229.923146] Internal error: Oops - BRK: 00000000f2000100 [#1]  SMP
[  230.077871] CPU: 48 UID: 0 PID: 1760 Comm: kworker/u583:7 Not tainted 7.0.0+ #3 PREEMPT(full)
[  230.086677] Hardware name: NVIDIA GB200 NVL/P3809-BMC, BIOS 02.05.12 20251107
[  230.093972] Workqueue: events_unbound bpf_map_free_deferred
[  230.099675] Sched_ext: invariant_0.1.0_aarch64_unknown_linux_gnu_debug (disabling), task: runnable_at=-174ms
[  230.116843] pc : 0xffff80009bc2c1f8
[  230.120406] lr : dequeue_task_scx+0x270/0x2d0
[  230.217749] Call trace:
[  230.228515]  0xffff80009bc2c1f8 (P)
[  230.232077]  dequeue_task+0x84/0x188
[  230.235728]  sched_change_begin+0x1dc/0x250
[  230.240000]  __set_cpus_allowed_ptr_locked+0x17c/0x240
[  230.245250]  __set_cpus_allowed_ptr+0x74/0xf0
[  230.249701]  ___migrate_enable+0x4c/0xa0
[  230.253707]  bpf_map_free_deferred+0x1a4/0x1b0
[  230.258246]  process_one_work+0x184/0x540
[  230.262342]  worker_thread+0x19c/0x348
[  230.266170]  kthread+0x13c/0x150
[  230.269465]  ret_from_fork+0x10/0x20
[  230.281393] Code: d4202000 d4202000 d4202000 d4202000 (d4202000)
[  230.287621] ---[ end trace 0000000000000000 ]---
[  231.160046] Kernel panic - not syncing: Oops - BRK: Fatal exception in interrupt

The root cause is that the JIT page backing ops->quiescent() is freed
before all callers of that function have stopped.

The expected ordering during teardown is:
    bitmap_zero(sch->has_op) + synchronize_rcu()
        -> guarantees no CPU will ever call sch->ops.* again
    -> only THEN free the BPF struct_ops JIT page

bpf_scx_unreg() is supposed to enforce the order, but after
commit f4a6c506d1 ("sched_ext: Always bounce scx_disable() through
irq_work"), disable_work is no longer queued directly, causing
kthread_flush_work() to be a noop. Thus, the caller drops the struct_ops
map too early and poisoned with AARCH64_BREAK_FAULT before
disable_workfn ever execute.

So the subsequent dequeue_task() still sees SCX_HAS_OP(sch, quiescent)
as true and calls ops.quiescent, which hit on the poisoned page and BRK
panic.

Add a helper scx_flush_disable_work() so the future use cases that want
to flush disable_work can use it.
Also amend the call for scx_root_enable_workfn() and
scx_sub_enable_workfn() which have similar pattern in the error path.

Fixes: f4a6c506d1 ("sched_ext: Always bounce scx_disable() through irq_work")
Signed-off-by: Richard Cheng <icheng@nvidia.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-24 07:26:48 -10:00
Linus Torvalds
892c894b4b Misc locking fixes:
- Fix ww_mutex regression, which caused hangs/pauses in some DRM drivers
  - Fix rtmutex proxy-rollback bug
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmnrdJERHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1jnJw//bi0bzZnBYeU4ukrqnyVWUZS7lKCnl+3M
 Ys31Ls38nXTQPhs+qbgUlqLzybJ/96s7uattPIW8+0Jkr+PfQguCvT+QiXs0SFzC
 aEp6PcZAD1z+QQY1zj2c1TM414IdyWbkShRxsPULb1j8nYJ6mO50UzP4zFdl7QwH
 gwZRgPkriUhD3vfgtHX3U0rshrvFHu9Ed6uuMkLFxaEiHn0h/4rZa0q9vonxw+cU
 64MabrxtPyQA97Mjr/YVsJjodt5bzkBmKNDCu9YxfqFQ0LTXyi1DZG/zdFHmT1Kf
 WcRIXbCx27HcR+NOWRze4rby2SR27NnPAZC/rR7Y6MX0IHGg1m2tM9DIGnDw71Su
 3HkQ1W3KhQFO0uvbhkKCXk1SAfBrJwh0g3p91sgj/tZyaXXdEYRChhpie5jMy3eg
 3k49QfYHigQZCZa8IH0PjrxLzDHo8S2tB75X6dWlxC/UeN/6X1G+Z9a52G9kZrFJ
 ZhrxY5/WVF4LbeIYupFbjN393jVN4WPEJKN3Esouuc+bluqMjUjl65v12yTQZeFy
 7pObrM9nJ4FijWjt+OCHUUZLwt+k5zttdUjgEh0zonx94JtpDA0qjkpPtbYgCFak
 WQxwD+qs1r8yv+msqkQ65/ir5GT6GyAzQnQAQbAQyB4itC9iyDhoS6mzkFBsw3bk
 TxfGszSV5ng=
 =6wf3
 -----END PGP SIGNATURE-----

Merge tag 'locking-urgent-2026-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fixes from Ingo Molnar:

 - Fix ww_mutex regression, which caused hangs/pauses in some DRM drivers

 - Fix rtmutex proxy-rollback bug

* tag 'locking-urgent-2026-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/mutex: Fix ww_mutex wait_list operations
  rtmutex: Use waiter::task instead of current in remove_waiter()
2026-04-24 10:14:29 -07:00
Petr Malat
13e786b64b cgroup: Increment nr_dying_subsys_* from rmdir context
Incrementing nr_dying_subsys_* in offline_css(), which is executed by
cgroup_offline_wq worker, leads to a race where user can see the value
to be 0 if he reads cgroup.stat after calling rmdir and before the worker
executes. This makes the user wrongly expect resources released by the
removed cgroup to be available for a new assignment.

Increment nr_dying_subsys_* from kill_css(), which is called from the
cgroup_rmdir() context.

Fixes: ab03125268 ("cgroup: Show # of subsystem CSSes in cgroup.stat")
Signed-off-by: Petr Malat <oss@malat.biz>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-23 07:37:40 -10:00
zhidao su
4e3d7c89e1 sched_ext: Fix local_dsq_post_enq() to use task's scheduler in sub-sched
local_dsq_post_enq() calls call_task_dequeue() with scx_root instead of
the scheduler instance actually managing the task. When
CONFIG_EXT_SUB_SCHED is enabled, tasks may be managed by a sub-scheduler
whose ops.dequeue() callback differs from root's. Using scx_root causes
the wrong scheduler's ops.dequeue() to be consulted: sub-sched tasks
dispatched to a local DSQ via scx_bpf_dsq_move_to_local() will have
SCX_TASK_IN_CUSTODY cleared but the sub-scheduler's ops.dequeue() is
never invoked, violating the custody exit semantics.

Fix by adding a 'struct scx_sched *sch' parameter to local_dsq_post_enq()
and move_local_task_to_local_dsq(), and propagating the correct scheduler
from their callers dispatch_enqueue(), move_task_between_dsqs(), and
consume_dispatch_q().

This is consistent with dispatch_enqueue()'s non-local path which already
passes 'sch' directly to call_task_dequeue() for global/bypass DSQs.

Fixes: ebf1ccff79 ("sched_ext: Fix ops.dequeue() semantics")
Signed-off-by: zhidao su <suzhidao@xiaomi.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-23 06:36:56 -10:00
Peter Zijlstra
0adc92b910 locking/mutex: Fix ww_mutex wait_list operations
Chaitanya, John and Mikhail reported commit 25500ba7e7 ("locking/mutex:
Remove the list_head from struct mutex") wrecked ww_mutex.

Specifically there were 2 issues:

 - __ww_waiter_prev() had the termination condition wrong; it would terminate
   when the previous entry was the first, which results in a truncated
   iteration: W3, W2, (no W1).

 - __mutex_add_waiter(@pos != NULL), as used by __ww_waiter_add() /
   __ww_mutex_add_waiter(); this inserts @waiter before @pos (which is what
   list_add_tail() does). But this should then also update lock->first_waiter.

Much thanks to Prateek for spotting the __mutex_add_waiter() issue!

Fixes: 25500ba7e7 ("locking/mutex: Remove the list_head from struct mutex")
Reported-by: "Borah, Chaitanya Kumar" <chaitanya.kumar.borah@intel.com>
Closes: https://lore.kernel.org/r/af005996-05e9-4336-8450-d14ca652ba5d%40intel.com
Reported-by: John Stultz <jstultz@google.com>
Closes: https://lore.kernel.org/r/CANDhNCq%3Doizzud3hH3oqGzTrcjB8OwGeineJ3mwZuGdDWG8fRQ%40mail.gmail.com
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Closes: https://lore.kernel.org/r/CABXGCsO5fKq2nD9nO8yO1z50ZzgCPWqueNXHANjntaswoOh2Dg@mail.gmail.com
Debugged-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Link: https://patch.msgid.link/20260422092335.GH3102924%40noisy.programming.kicks-ass.net
2026-04-23 10:05:49 +02:00
Linus Torvalds
1e18ed5727 ring-buffer fix for v7.1
- Make undefsyms_base.c into a real file
 
   The file undefsyms_base.c is used to catch any symbols used by a remote
   ring buffer that is made for use of a pKVM hypervisor. As it doesn't share
   the same text as the rest of the kernel, referencing any symbols within
   the kernel will make it fail to be built for the standalone hypervisor.
 
   A file was created by the Makefile that checked for any symbols that could
   cause issues. There's no reason to have this file created by the Makefile,
   just create it as a normal file instead.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaekbsxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qr0MAQCHMFK56uzmMvQJg5Tfhtb+grY7ryo4
 j0rjLkNQBovmxwD/XylBDkhxr8IQ72jU04e+xuXzrqoT18U7XvQEVpPgkQQ=
 =dQi0
 -----END PGP SIGNATURE-----

Merge tag 'trace-ring-buffer-v7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull ring-buffer fix from Steven Rostedt:

 - Make undefsyms_base.c into a real file

   The file undefsyms_base.c is used to catch any symbols used by a
   remote ring buffer that is made for use of a pKVM hypervisor. As it
   doesn't share the same text as the rest of the kernel, referencing
   any symbols within the kernel will make it fail to be built for the
   standalone hypervisor.

   A file was created by the Makefile that checked for any symbols that
   could cause issues. There's no reason to have this file created by
   the Makefile, just create it as a normal file instead.

* tag 'trace-ring-buffer-v7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Make undefsyms_base.c a first-class citizen
2026-04-22 14:47:52 -07:00
Linus Torvalds
38ee6e1fb6 kgdb patches for 7.1
Only a very small update for kgdb this cycle: a single patch from Kexin Sun
 that fixes some outdated comments.
 
 Signed-off-by: Daniel Thompson (RISCstar) <danielt@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEELzVBU1D3lWq6cKzwfOMlXTn3iKEFAmno2OcACgkQfOMlXTn3
 iKHVNw/+IXsSmpUTllkwlzmib+D7zYVEoyLOdoBRQyJDjjT9Qe811budkIs02sIk
 NUgMBnffInU73lCw2TbRpKag9+dl8ZglKFvM3UpElFzY5ePnUc/fHeFtsXdzVjSK
 PVzL3fCtN66R1MuMeCdMNjhtSVHNJCrclj0We39JRgli7kWK1zH6kY8X/j/ExBTR
 9S5e3/9Yn/V9Z2Z33lwBMCKLHyd/5r0ezKbvfVGmqT/sTE/IbTcTfVMipEHGtHfA
 JHs5PC+1obJ9jeq/1c1fV8tZ8kO+VjGI0yUibbWEBmPx0pVUW/FG1z+WwhlbcyOl
 cyhKQ7d4h4e718M+bWAXeoRGk9bXyAWHqWojM98Xh+TvR2H20qAzUC44qeT9tt7B
 SOfvZM4Xwsn+uYqmthtMqLMqe8VIX2WbUVjsntYZ46fRTq8RAUY3z7yLv6bm7N4l
 p4i7/cdCsbzi+tN6uaKANywvt9ygEAcPILGfMhfIS4Pa7aQ/JVjKgxKhP1bUuH9u
 ftqIcJwvX3d8MDQY4FsLpsRToj3rADfr35HnQCc7fmqRIFi7++fXzPerktdF/5Sd
 6NMR0+mdFewdW9JfYKGZMTybf9XXyC+F6HlKgNOBddPje7qQlLUIQBajn71W753i
 TNKI5Go6idMgWEukNVXGD2XqObyKD2eBgCj/42Kc8bDfo7D66UA=
 =MegD
 -----END PGP SIGNATURE-----

Merge tag 'kgdb-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux

Pull kgdb update from Daniel Thompson:
 "Only a very small update for kgdb this cycle: a single patch from
  Kexin Sun that fixes some outdated comments"

* tag 'kgdb-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
  kgdb: update outdated references to kgdb_wait()
2026-04-22 14:26:58 -07:00
Paolo Bonzini
5335e318ad tracing: Make undefsyms_base.c a first-class citizen
Linus points out that dumping undefsyms_base.c form the Makefile
is rather ugly, and that a much better course of action would be
to have this file as a first-class citizen in the git tree.

This allows some extra cleanup in the Makefile, and the removal of
the .gitignore file in kernel/trace.

Cc: Marc Zyngier <maz@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/CAHk-=wieqGd_XKpu8UxDoyADZx8TDe8CF3RmkUXt5N_9t5Pf_w@mail.gmail.com
Link: https://lore.kernel.org/all/20260421095446.2951646-1-maz@kernel.org/
Link: https://patch.msgid.link/20260421100455.324333-1-pbonzini@redhat.com
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2026-04-22 11:24:41 -04:00
Masami Hiramatsu (Google)
476c5bbae6 tracing/fprobe: Fix to unregister ftrace_ops if it is empty on module unloading
Fix fprobe to unregister ftrace_ops if corresponding type of fprobe
does not exist on the fprobe_ip_table and it is expected to be empty
when unloading modules.

Since ftrace thinks that the empty hash means everything to be traced,
if we set fprobes only on the unloaded module, all functions are traced
unexpectedly after unloading module.
e.g.

 # modprobe xt_LOG.ko
 # echo 'f:test log_tg*' > dynamic_events
 # echo 1 > events/fprobes/test/enable
 # cat enabled_functions
log_tg [xt_LOG] (1)             tramp: 0xffffffffa0004000 (fprobe_ftrace_entry+0x0/0x490) ->fprobe_ftrace_entry+0x0/0x490
log_tg_check [xt_LOG] (1)               tramp: 0xffffffffa0004000 (fprobe_ftrace_entry+0x0/0x490) ->fprobe_ftrace_entry+0x0/0x490
log_tg_destroy [xt_LOG] (1)             tramp: 0xffffffffa0004000 (fprobe_ftrace_entry+0x0/0x490) ->fprobe_ftrace_entry+0x0/0x490
 # rmmod xt_LOG
 # wc -l enabled_functions
34085 enabled_functions

Link: https://lore.kernel.org/all/177669368776.132053.10042301916765771279.stgit@mhiramat.tok.corp.google.com/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2026-04-22 09:24:13 +09:00
Kexin Sun
256e5254ef kgdb: update outdated references to kgdb_wait()
The function kgdb_wait() was folded into the static function
kgdb_cpu_enter() by commit 62fae31219 ("kgdb: eliminate
kgdb_wait(), all cpus enter the same way").  Update the four stale
references accordingly:

 - include/linux/kgdb.h and arch/x86/kernel/kgdb.c: the
   kgdb_roundup_cpus() kdoc describes what other CPUs are rounded up
   to call.  Because kgdb_cpu_enter() is static, the correct public
   entry point is kgdb_handle_exception(); also fix a pre-existing
   grammar error ("get them be" -> "get them into") and reflow the
   text.

 - kernel/debug/debug_core.c: replace with the generic description
   "the debug trap handler", since the actual entry path is
   architecture-specific.

 - kernel/debug/gdbstub.c: kgdb_cpu_enter() is correct here (it
   describes internal state, not a call target); add the missing
   parentheses.

Suggested-by: Daniel Thompson <daniel@riscstar.com>
Assisted-by: unnamed:deepseek-v3.2 coccinelle
Signed-off-by: Kexin Sun <kexinsun@smail.nju.edu.cn>
2026-04-21 16:41:54 +01:00
Masami Hiramatsu (Google)
0ac0058a74 tracing/fprobe: Check the same type fprobe on table as the unregistered one
Commit 2c67dc457b ("tracing: fprobe: optimization for entry only case")
introduced a different ftrace_ops for entry-only fprobes.

However, when unregistering an fprobe, the kernel only checks if another
fprobe exists at the same address, without checking which type of fprobe
it is.
If different fprobes are registered at the same address, the same address
will be registered in both fgraph_ops and ftrace_ops, but only one of
them will be deleted when unregistering. (the one removed first will not
be deleted from the ops).

This results in junk entries remaining in either fgraph_ops or ftrace_ops.
For example:
 =======
 cd /sys/kernel/tracing

 # 'Add entry and exit events on the same place'
 echo 'f:event1 vfs_read' >> dynamic_events
 echo 'f:event2 vfs_read%return' >> dynamic_events

 # 'Enable both of them'
 echo 1 > events/fprobes/enable
 cat enabled_functions
vfs_read (2)            ->arch_ftrace_ops_list_func+0x0/0x210

 # 'Disable and remove exit event'
 echo 0 > events/fprobes/event2/enable
 echo -:event2 >> dynamic_events

 # 'Disable and remove all events'
 echo 0 > events/fprobes/enable
 echo > dynamic_events

 # 'Add another event'
 echo 'f:event3 vfs_open%return' > dynamic_events
 cat dynamic_events
f:fprobes/event3 vfs_open%return

 echo 1 > events/fprobes/enable
 cat enabled_functions
vfs_open (1)            tramp: 0xffffffffa0001000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60    subops: {ent:fprobe_fgraph_entry+0x0/0x620 ret:fprobe_return+0x0/0x150}
vfs_read (1)            tramp: 0xffffffffa0001000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60    subops: {ent:fprobe_fgraph_entry+0x0/0x620 ret:fprobe_return+0x0/0x150}
 =======

As you can see, an entry for the vfs_read remains.

To fix this issue, when unregistering, the kernel should also check if
there is the same type of fprobes still exist at the same address, and
if not, delete its entry from either fgraph_ops or ftrace_ops.

Link: https://lore.kernel.org/all/177669367993.132053.10553046138528674802.stgit@mhiramat.tok.corp.google.com/

Fixes: 2c67dc457b ("tracing: fprobe: optimization for entry only case")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2026-04-22 00:03:10 +09:00
Masami Hiramatsu (Google)
aa72812b49 tracing/fprobe: Avoid kcalloc() in rcu_read_lock section
fprobe_remove_node_in_module() is called under RCU read locked, but
this invokes kcalloc() if there are more than 8 fprobes installed
on the module. Sashiko warns it because kcalloc() can sleep [1].

 [1] https://sashiko.dev/#/patchset/177552432201.853249.5125045538812833325.stgit%40mhiramat.tok.corp.google.com

To fix this issue, expand the batch size to 128 and do not expand
the fprobe_addr_list, but just cancel walking on fprobe_ip_table,
update fgraph/ftrace_ops and retry the loop again.

Link: https://lore.kernel.org/all/177669367206.132053.1493637946869032744.stgit@mhiramat.tok.corp.google.com/

Fixes: 0de4c70d04 ("tracing: fprobe: use rhltable for fprobe_ip_table")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2026-04-22 00:02:59 +09:00
Masami Hiramatsu (Google)
845947aca6 tracing/fprobe: Remove fprobe from hash in failure path
When register_fprobe_ips() fails, it tries to remove a list of
fprobe_hash_node from fprobe_ip_table, but it missed to remove
fprobe itself from fprobe_table. Moreover, when removing
the fprobe_hash_node which is added to rhltable once, it must
use kfree_rcu() after removing from rhltable.

To fix these issues, this reuses unregister_fprobe() internal
code to rollback the half-way registered fprobe.

Link: https://lore.kernel.org/all/177669366417.132053.17874946321744910456.stgit@mhiramat.tok.corp.google.com/

Fixes: 4346ba1604 ("fprobe: Rewrite fprobe on function-graph tracer")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2026-04-21 23:59:57 +09:00
Masami Hiramatsu (Google)
1aec9e5c3e tracing/fprobe: Unregister fprobe even if memory allocation fails
unregister_fprobe() can fail under memory pressure because of memory
allocation failure, but this maybe called from module unloading, and
usually there is no way to retry it. Moreover. trace_fprobe does not
check the return value.

To fix this problem, unregister fprobe and fprobe_hash_node even if
working memory allocation fails.
Anyway, if the last fprobe is removed, the filter will be freed.

Link: https://lore.kernel.org/all/177669365629.132053.8433032896213721288.stgit@mhiramat.tok.corp.google.com/

Fixes: 4346ba1604 ("fprobe: Rewrite fprobe on function-graph tracer")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2026-04-21 23:59:39 +09:00
Masami Hiramatsu (Google)
6ad51ada17 tracing/fprobe: Reject registration of a registered fprobe before init
Reject registration of a registered fprobe which is on the fprobe
hash table before initializing fprobe.
The add_fprobe_hash() checks this re-register fprobe, but since
fprobe_init() clears hlist_array field, it is too late to check it.
It has to check the re-registration before touncing fprobe.

Link: https://lore.kernel.org/all/177669364845.132053.18375367916162315835.stgit@mhiramat.tok.corp.google.com/

Fixes: 4346ba1604 ("fprobe: Rewrite fprobe on function-graph tracer")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2026-04-21 23:59:29 +09:00
Linus Torvalds
b4e07588e7 tracing: tell git to ignore the generated 'undefsyms_base.c' file
This odd file was added to automatically figure out tool-generated
symbols.

Honestly, it *should* have been just a real honest-to-goodness regular
file in git, instead of having strange code to generate it in the
Makefile, but that is not how that silly thing works.  So now we need to
ignore it explicitly.

Fixes: 1211907ac0 ("tracing: Generate undef symbols allowlist for simple_ring_buffer")
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Marc Zyngier <maz@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-04-20 17:25:56 -07:00
Linus Torvalds
b66cb4f156 printk changes for 7.1
-----BEGIN PGP SIGNATURE-----
 
 iQJPBAABCAA5FiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmnmEggbFIAAAAAABAAO
 bWFudTIsMi41KzEuMTIsMiwyAAoJEFKgDEdIgJTyjQwP/j2a9EzdI4+DfGrmA56m
 /heo44ObpFJppYaWEyAN7xX8Xpm7ErokLjZxVxhgQY7hGU5WLx1CmslnJbfWYkWy
 r7q92Kd/QIDmsvwHzlE0xdaX3rp8AXp3O2iIhcDfv9FRe+UultrESBpw9JbRYXXQ
 SLAFU1iOFMxyuzvhW/2l/B+PQDm0uoRMpMWTsLXo2JtNOsIGewNyV/7dhOumm+RD
 /0lgVA9jMAQA33j4Hkr38REe8lYH7aGl66x1mZhDg+kYb7w7ndKW/QN21OJiP+2D
 s4moi2/VLmC/UcxoDAOQTArKYkrYc2nzLMJRMaIun8jcNfCHEqaQAW2MTzSDAC78
 +atdlWrfIxykORA31lTtjh4o5vg43sQPjFZCJr3RxNLfxFy15NULhT75QFrxe9sA
 W3gs4Rz+LeynoQbJNCn8hgK0xcpKLtzC4dMY7dfQo6uyq2YnJauEvioJKbUYyoDh
 3l3uYfnzHcfRUb+yNtIkNCiYXwrpTAnSnifCbWO3smWYxhCdAT14Rval/fxtTjSe
 sTphd7m32U56WpWX0lYQbY001nMzso+gN/eQSK20IzrhvghwYUqvCKNm+fIM9tjR
 nQBxka+B7XhO27hNV4QIT8ZQRUkabQQAseEFhLQI2Ptgw3eV0s85gMP60Lg4gpZ0
 7OGA/VM+xrLqXvsFeDDWccSg
 =O6QR
 -----END PGP SIGNATURE-----

Merge tag 'printk-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux

Pull printk updates from Petr Mladek:

 - Fix printk ring buffer initialization and sanity checks

 - Workaround printf kunit test compilation with gcc < 12.1

 - Add IPv6 address printf format tests

 - Misc code and documentation cleanup

* tag 'printk-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
  printf: Compile the kunit test with DISABLE_BRANCH_PROFILING DISABLE_BRANCH_PROFILING
  lib/vsprintf: use bool for local decode variable
  lib/hexdump: print_hex_dump_bytes() calls print_hex_dump_debug()
  printk: ringbuffer: fix errors in comments
  printk_ringbuffer: Add sanity check for 0-size data
  printk_ringbuffer: Fix get_data() size sanity check
  printf: add IPv6 address format tests
  printk: Fix _DESCS_COUNT type for 64-bit systems
2026-04-20 15:42:18 -07:00
Linus Torvalds
ccbc9fdb32 Fix timer stalls caused by incorrect handling of
the dev->next_event_forced flag.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmnl1/4RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1g/ww/7BI4CyQUJLSCpYMjvkj+87Rrfd5u6FGqt
 jz0dGeQpY0LvRGSqASwICe+1r0zwHF+xDsUfJvA13mRaPM6D6bEU+JE6ffK8B6T9
 EIyYwEwQ2a7DrdIu9+FCXTwqXDUoGLFsguD50b4qupQKFcDlwgZbg4UAWi/ptau9
 Ww+5T/+sfw/SMR9EwXBSKH79N0gOOGpDNfGtpDv+0X0qPQvo9QGAxMfIUgMf7ZaA
 y55agXi5iOdM+mAIrE69WLinBzrBvXHWNr66/SaMadQs93I1hU54sLpir4ft1yCs
 WnDtTRWG11Y0HBHUqqgbnN8BR/2VIFDVe9BtRDoDUD70iEJ8TGqJjOvF8v7C00MK
 ets0zNel9Rqbz9wjrjTekPYUHfC/t9qqzV77c0TdU1IR6FArf/OT9Ge34AVr60EX
 a5s4aX7ECLjwuTwgQPLXsSedOD0eQndf/VYdEQ86fTUfyyujVg2NCxbFEfDr3eho
 SbjcNv1UQ1WY/7miJzYaiA2aVNtwuX25YNI+t3f3pX/1tGqmx9oB1tNzJqgGfuN9
 3/Rx3uP0kH+gpbw1lKAugFJazOEHDLJG8LBgF0PYbmlVdIGn57IuBlQL5GDLe77O
 G+sFUhrLNpkrIJVhNWODkM+K/z9vvKzENiRgG4hB0oAbQEUuqTA6ppayKWWs8KDb
 9fdgLSkdjtI=
 =GPvm
 -----END PGP SIGNATURE-----

Merge tag 'timers-urgent-2026-04-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fix from Ingo Molnar:
 "Fix timer stalls caused by incorrect handling of the
  dev->next_event_forced flag"

* tag 'timers-urgent-2026-04-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clockevents: Add missing resets of the next_event_forced flag
2026-04-20 15:30:08 -07:00
Keenan Dong
3bfdc63936 rtmutex: Use waiter::task instead of current in remove_waiter()
remove_waiter() is used by the slowlock paths, but it is also used for
proxy-lock rollback in rt_mutex_start_proxy_lock() when invoked from
futex_requeue().

In the latter case waiter::task is not current, but remove_waiter()
operates on current for the dequeue operation. That results in several
problems:

  1) the rbtree dequeue happens without waiter::task::pi_lock being held

  2) the waiter task's pi_blocked_on state is not cleared, which leaves a
     dangling pointer primed for UAF around.

  3) rt_mutex_adjust_prio_chain() operates on the wrong top priority waiter
     task

Use waiter::task instead of current in all related operations in
remove_waiter() to cure those problems.

[ tglx: Fixup rt_mutex_adjust_prio_chain(), add a comment and amend the
  	changelog ]

Fixes: 8161239a8b ("rtmutex: Simplify PI algorithm and make highest prio task get lock")
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Signed-off-by: Keenan Dong <keenanat2000@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Cc: stable@vger.kernel.org
2026-04-21 00:22:31 +02:00
Cheng-Yang Chou
2d2b026c3e sched_ext: Deny SCX kfuncs to non-SCX struct_ops programs
scx_kfunc_context_filter() currently allows non-SCX struct_ops programs
(e.g. tcp_congestion_ops) to call SCX unlocked kfuncs. This is wrong
for two reasons:

- It is semantically incorrect: a TCP congestion control program has no
  business calling SCX kfuncs such as scx_bpf_kick_cpu().

- With CONFIG_EXT_SUB_SCHED=y, kfuncs like scx_bpf_kick_cpu() call
  scx_prog_sched(aux), which invokes bpf_prog_get_assoc_struct_ops(aux)
  and casts the result to struct sched_ext_ops * before reading ops->priv.
  For a non-SCX struct_ops program the returned pointer is the kdata of
  that struct_ops type, which is far smaller than sched_ext_ops, making
  the read an out-of-bounds access (confirmed with KASAN).

Extend the filter to cover scx_kfunc_set_any and scx_kfunc_set_idle as
well, and deny all SCX kfuncs for any struct_ops program that is not the
SCX struct_ops. This addresses both issues: the semantic contract is
enforced at the verifier level, and the runtime out-of-bounds access
becomes unreachable.

Fixes: d1d3c1c6ae ("sched_ext: Add verifier-time kfunc context filter")
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-20 07:57:29 -10:00
Petr Mladek
add9d911be Merge branch 'rework/prb-fixes' into for-linus 2026-04-20 13:42:01 +02:00
Tejun Heo
87019cb6c2 sched_ext: Mark scx_sched_hash insecure_elasticity
scx_sched_hash is inserted into under scx_sched_lock (raw_spinlock_irq)
in scx_link_sched(). rhashtable's sync grow path calls get_random_u32()
and does a GFP_ATOMIC allocation; both acquire regular spinlocks, which
is unsafe under raw_spinlock_t. Set insecure_elasticity to skip the
sync grow.

v2:
- Dropped dsq_hash changes. Insertion is not under raw_spin_lock.

- Switched from no_sync_grow flag to insecure_elasticity.

Fixes: 25037af712 ("sched_ext: Add rhashtable lookup for sub-schedulers")
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-19 05:47:28 -10:00