Commit Graph

51779 Commits

Author SHA1 Message Date
Jann Horn
c1fa0bb633 exit: prevent preemption of oopsing TASK_DEAD task
When an already-exiting task oopses, make_task_dead() currently calls
do_task_dead() with preemption enabled.  That is forbidden:
do_task_dead() calls __schedule(), which has a comment saying "WARNING:
must be called with preemption disabled!".

If an oopsing task is preempted in do_task_dead(), between becoming
TASK_DEAD and entering the scheduler explicitly, bad things happen:
finish_task_switch() assumes that once the scheduler has switched away
from a TASK_DEAD task, the task can never run again and its stack is no
longer needed; but that assumption apparently doesn't hold if the dead
task was preempted (the SM_PREEMPT case).

This means that the scheduler ends up repeatedly dropping references on
the dead task's stack, which can lead to use-after-free or double-free
of the entire task stack; in other words, two tasks can end up running
on the same stack, resulting in various kinds of memory corruption.

(This does not just affect "recursively oopsing" tasks; it is enough to
oops once during task exit, for example in a file_operations::release
handler)

Fixes: 7f80a2fd7d ("exit: Stop poorly open coding do_task_dead in make_task_dead")
Cc: stable@kernel.org
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-05-11 08:55:11 -07:00
Linus Torvalds
515186b7be bpf-fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmn/wmgACgkQ6rmadz2v
 bTosmhAAgYkQLg7zVQdruoSYb7Vzjz1Di4tM2rBXNIX4S7dvfZUGGBNzFV1lWobk
 /r6269llSnPKXofs+69LDVCpdvUXmGRmS7+bq+bxV7WVmg7JruVOTWg839jValJK
 cY3IQi0lZ9GVKaePI5C2XxBS3rCrdQmby91fcfp5C6A/gR6m7PzAlnoIuJ2SQx6A
 7tsxxJb4wRtFWPBp7ClbBo7MAMIzPse/6CzsA2eP+icyJC+De9WGYs6bTDNi7vpY
 +eul0HMyHLTszJe/AGrsu5Ky3S6l+CTydi1fAUSOnk1pYHHhRvvD2WV8ix05/0rO
 2looZl6ogpcisCm1i8HN8g1ST0tS74x3bL9kjvB/hhKGh6K1QpU6/drEvmJqKMAu
 fspYHD3qO+OXN7EV7tFZ1ErJvJZ7zT7UP0JxirAK1DFQZWrki/tJKehSD6gbir8R
 GwwZctXDOPTGADBsdqbxEPEAp1gVTvDXf04k6GOCLkzqqYBMVKdW/8GXN+6Itr+O
 nxxoC0SOOkW7rRlJaxuJd5+kpaCKOuK9FaXWONOn7HPzBgK0E0CL9g3+cZcS1QvI
 2/5utfFj0gMeo40ZDjCyDWXm7w+AnTSKMMapB5pyi0FY3AVtroSV88HNbpm7DJrs
 xp9jO5ZD6EQ9Wn1cufOYAkrgZYwTZL5Z2EqyKcoJUIk1ZjpQbXg=
 =x/fg
 -----END PGP SIGNATURE-----

Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Pull bpf fixes from Alexei Starovoitov:

 - Fix sk_local_storage diag dump via netlink (Amery Hung)

 - Fix off-by-one in arena direct-value access (Junyoung Jang)

 - Reject TCP_NODELAY in bpf-tcp congestion control (KaFai Wan)

 - Fix type confusion in bpf_*_sock() (Kuniyuki Iwashima)

 - Reject TX-only AF_XDP sockets (Linpu Yu)

 - Don't run arg-tracking analysis twice on main subprog (Paul Chaignon)

 - Fix NULL pointer dereference in bpf_sk_storage_clone and fib lookup
   (Weiming Shi)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf: Fix off-by-one boundary validation in arena direct-value access
  xskmap: reject TX-only AF_XDP sockets
  bpf: Don't run arg-tracking analysis twice on main subprog
  bpf: Free reuseport cBPF prog after RCU grace period.
  bpf: tcp: Fix type confusion in sol_tcp_sockopt().
  bpf: tcp: Fix type confusion in bpf_skc_to_tcp6_sock().
  bpf: tcp: Fix type confusion in bpf_skc_to_tcp_sock().
  mptcp: bpf: Fix type confusion in bpf_mptcp_sock_from_subflow()
  selftest: bpf: Add test for bpf_tcp_sock() and RAW socket.
  bpf: tcp: Fix type confusion in bpf_tcp_sock().
  tools/headers: Regenerate stddef.h to fix BPF selftests
  bpf: Fix sk_local_storage diag dumping uninitialized special fields
  bpf: Fix NULL pointer dereference in bpf_skb_fib_lookup()
  sockmap: Fix sk_psock_drop() race vs sock_map_{unhash,close,destroy}().
  bpf: Fix NULL pointer dereference in bpf_sk_storage_clone and diag paths
  selftests/bpf: Verify bpf-tcp-cc rejects TCP_NODELAY
  selftests/bpf: Test TCP_NODELAY in TCP hdr opt callbacks
  bpf: Reject TCP_NODELAY in bpf-tcp-cc
  bpf: Reject TCP_NODELAY in TCP header option callbacks
2026-05-09 18:42:54 -07:00
Junyoung Jang
3ac1a467e3 bpf: Fix off-by-one boundary validation in arena direct-value access
BPF_MAP_TYPE_ARENA accepts BPF_PSEUDO_MAP_VALUE offsets at exactly
the end of the arena mapping (off == arena_size). The boundary check
in arena_map_direct_value_addr() uses `>` instead of `>=`, which
incorrectly allows a one-past-end pointer to be accepted.

Change the condition to `>=` to correctly reject offsets that fall
outside the valid arena user_vm range.

Fixes: 317460317a ("bpf: Introduce bpf_arena.")
Signed-off-by: Junyoung Jang <graypanda.inzag@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20260426172505.1947915-1-graypanda.inzag@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-05-09 16:18:39 -07:00
Paul Chaignon
512809bb8a bpf: Don't run arg-tracking analysis twice on main subprog
Because subprog 0, the main subprog, is considered a global function,
we end up running the arg-tracking dataflow analysis twice on it. That
results in slightly longer verification but mostly in more verbose
verifier logs. This patch fixes it by keeping only the iteration over
global subprogs.

When running over all of Cilium's programs with BPF_LOG_LEVEL2, this
reduces verbosity by ~20% on average.

Fixes: bf0c571f7f ("bpf: introduce forward arg-tracking dataflow analysis")
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/e4d7b53d4963ef520541a782f5fc8108a168877c.1778176504.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-05-09 16:12:40 -07:00
Linus Torvalds
6e1e5a33e8 Fix CPU hotplug activation race in the timer migration code,
by Frederic Weisbecker.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmn+lQYRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iNhhAAmzI4kh9kulol1M73u+LiQTdx0zMcDcSM
 NiHeZ+QbNNMqkfnj6xxOVGViPLTVSXpa+dU4hdeHM3n7vYnlGDiHBm4EM95z7S28
 Jk39z6Wq/0vmR938rhVVNgMbyx3N+rHV15ReyY88vq6OcI4SdJ9sin72ZW8piuBi
 EeHZLOyPsz+qf4tG6UFHP3uzM7fQZ9DrVxzcDZE4+xbVpk1kXLuAGo89wXF0GbUX
 NhUYtHakvARehLg3N+0qo/DUGRXYp1GF+qfIXH7V6g91eMhPureUrhzsKVmbu5BL
 OXMMskWG20LhySrLX+fKDk8DH3n3G+dTFCgM5kc6JaR4vLT0prEJHeVsIX+ydOLF
 sZqqkaVhKvFNlxjrqbjRNBV09/Cd9XN1wYUqnBeh55qzB+yoOnfS9+z7dsguf38K
 YKiCmlt2T68NL5/ZUvXF3ghm5ic7dTyZFUyM/abo7rSTeGByh5vruD19r2zp0C1/
 hgSPmAm8fuJAzQCRUzXxDsY5hLhKtv+1h9+Y9humakzfxUdneydsHigz5YjZ6tzI
 E8LyLsBZnSB6IKZfVetwpg0xTxB75GtSwa+p0XUg2+q2ss94mYmWMFM+ZKWxaTYq
 yI5paLXZT1DO9Fx/4gsCDlUrR1y096GL6O/1t3TOUsoJ/9kEFvOL/ZF4sjEnWrch
 CYI+DPoUOUA=
 =UHOI
 -----END PGP SIGNATURE-----

Merge tag 'timers-urgent-2026-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fix from Ingo Molnar:
 "Fix CPU hotplug activation race in the timer migration code, by
  Frederic Weisbecker"

* tag 'timers-urgent-2026-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timers/migration: Fix another hotplug activation race
2026-05-08 20:03:39 -07:00
Linus Torvalds
7f00232152 Miscellaneous scheduler fixes:
- Fix spurious failures in rseq self-tests (Mark Brown)
 
  - Fix rseq rseq::cpu_id_start ABI regression due to
    TCMalloc's creative use of the supposedly read-only field.
    The fix is to introduce a new ABI variant based on
    a new (larger) rseq area registration size, to keep
    the TCMalloc use of rseq backwards compatible on new kernels.
    (Thomas Gleixner)
 
  - Fix wakeup_preempt_fair() for not waking up task (Vincent Guittot)
 
  - Fix s64 mult overflow in vruntime_eligible() (Zhan Xusheng)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmn+lBARHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iy1hAAunlBoDq8/MXSt4JeMRX/3p+CKihExTnO
 LO535Rv8DfcepBgZIysTJKMn9bM/l+7OXGdQ+YDjS70GsLM2aOzDYBKCwOHHm0pZ
 OJ6Y+UUFAacnQS4EuQLqyNBW0Ice4AIYWu0pLLADs2KUgX1DmSo9bhgZbHcbsMnA
 IjoaFNhebeA1bHSDD11UIHTza23mqEinxM0yOK8pT+M6fOMXWOo/kLLLYjG/yAIB
 qBFGpwJkdKjBcpCmAYU9jpw26p/17YMzkgmAaUXOKRLZi+h5zQMNVjR+OIjK4qxt
 z5Tj+h7t3IcFV2d1zUThPpxxHLn3ro30R5mW0OrsPPFI8AkSRC6GsIX/Ft9uFbQQ
 1SGknyx5qLrSldmT8KKXPlM/vriyh3iL6/QMgXtTb8FfegRCbjXsZy39s3wOexCD
 oBnJt0rX3NviPyb/Up9cfdx++kPfJM074NdVHRBW4ucoOpwHosNcuBP0YQsctFw0
 QO7lkfjTo19eB9ftSyfadwF9e+2jYF7YoLmTOKqGLbZEQeIT5tJHDgHklLWrqHAh
 HDmDCHyqtiXBDCeapqnomrhqczKlSUK1Qqk/Hh+Hwiwj0N4vLhHXmRIpTb/R43/Y
 i6cdwV8Xtl+RltmYaINpMxdnFl/iz/kpapYkqy/ykLNfBj01AWbdaIVPRk47ndU1
 E29BSjMZhPU=
 =2QJp
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2026-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Ingo Molnar:

 - Fix spurious failures in rseq self-tests (Mark Brown)

 - Fix rseq rseq::cpu_id_start ABI regression due to TCMalloc's creative
   use of the supposedly read-only field

   The fix is to introduce a new ABI variant based on a new (larger)
   rseq area registration size, to keep the TCMalloc use of rseq
   backwards compatible on new kernels (Thomas Gleixner)

 - Fix wakeup_preempt_fair() for not waking up task (Vincent Guittot)

 - Fix s64 mult overflow in vruntime_eligible() (Zhan Xusheng)

* tag 'sched-urgent-2026-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix wakeup_preempt_fair() for not waking up task
  sched/fair: Fix overflow in vruntime_eligible()
  selftests/rseq: Expand for optimized RSEQ ABI v2
  rseq: Reenable performance optimizations conditionally
  rseq: Implement read only ABI enforcement for optimized RSEQ V2 mode
  selftests/rseq: Validate legacy behavior
  selftests/rseq: Make registration flexible for legacy and optimized mode
  selftests/rseq: Skip tests if time slice extensions are not available
  rseq: Revert to historical performance killing behaviour
  rseq: Don't advertise time slice extensions if disabled
  rseq: Protect rseq_reset() against interrupts
  rseq: Set rseq::cpu_id_start to 0 on unregistration
  selftests/rseq: Don't run tests with runner scripts outside of the scripts
2026-05-08 19:42:10 -07:00
Linus Torvalds
e5cf0260a7 Miscellaneous perf events fixes:
- Fix deadlock in the perf_mmap() failure path (Peter Zijlstra)
 
  - Intel ACR (Auto Counter Reload) fixes (Dapeng Mi):
 
    - Fix validation and configuration of ACR masks
    - Fix ACR rescheduling bug causing stale masks
    - Disable the PMI on ACR-enabled hardware
    - Enable ACR on Panther Cover uarch too
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmn+kB0RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1ib2RAAkh19iPRq7tHCXvDYJEHztpAuioyaKznw
 57pNlzG+/K/gg/yZ7uoCtXvogEHgDtart9ZtH7CVtZQgky6YfdiBq0g0pNSdOoY5
 O4IB0ZIXIu/FOh1+Q1k3Md5MeuxUz1Jp21Wq+JNK6VkWwfq+oCZ3XJK06+2C45wI
 uUePaEMFZn7VX9WOToZJQZME/+5yQvrgOq+D6gBs+y3UJO5u6kpdoley6fPXRtYV
 hyfBYiutJlcV1dJC9g7Dc6CHrBkaolFTKsRi2RjD658fHmUUMCubsn6lSG9UJqiZ
 CWtNMHJ/k1WBLuPLUaZBa3W0s+mUZ+0E6W3nLRHC2ORRQhAnQKeoDb2IWlVhYTdB
 NmyABPqjwvGfgidMh39aMt8GS4lFZBXGozVNWTZprN56U/jYH/Ol4cJNLJ9Ez5yk
 fzIkljCc5L/ZmxmqjJNBvmJtTpAt/FhN0qKT/k9jksISFE24bzZ0oRg3t051OyXs
 Mndldyl/2EFHA2PBIN2phISTVWh5lewYNBaK0SBbx77DX6NzMdevhdGAvw2cRVT/
 BJvqj+OeBfiaGBNb/lAIsoZCnuMClQi2t4jlKGkmN3n9hbgPyPAsz/WJRDLr9GZ+
 cqQgh7fL80HoqZTfV7tWxKTkDK3AciXXZE+8ntBpGC6CgMmgsJqLmxc60Jzfh2OO
 qGXcodOISag=
 =WNs1
 -----END PGP SIGNATURE-----

Merge tag 'perf-urgent-2026-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf events fixes from Ingo Molnar:

 - Fix deadlock in the perf_mmap() failure path (Peter Zijlstra)

 - Intel ACR (Auto Counter Reload) fixes (Dapeng Mi):
     - Fix validation and configuration of ACR masks
     - Fix ACR rescheduling bug causing stale masks
     - Disable the PMI on ACR-enabled hardware
     - Enable ACR on Panther Cover uarch too

* tag 'perf-urgent-2026-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel: Enable auto counter reload for DMR
  perf/x86/intel: Disable PMI for self-reloaded ACR events
  perf/x86/intel: Always reprogram ACR events to prevent stale masks
  perf/x86/intel: Improve validation and configuration of ACR masks
  perf/core: Fix deadlock in perf_mmap() failure path
2026-05-08 19:39:18 -07:00
Vincent Guittot
9f6d929ee2 sched/fair: Fix wakeup_preempt_fair() for not waking up task
Make sure to only call pick_next_entity() on an non-empty cfs_rq.

The assumption that p is always enqueued and not delayed, is only true for
wakeup. If p was moved while delayed, pick_next_entity() will dequeue it and
the cfs might become empty. Test if there are still queued tasks before trying
again to determine if p could be the next one to be picked.

There are at least 2 cases:

When cfs becomes idle, it tries to pull tasks but if those pulled tasks are
delayed, they will be dequeued when attached to cfs. attach_tasks() ->
attach_task() -> wakeup_preempt(rq, p, 0);

A misfit task running on cfs A triggers a load balance to be pulled on a better
cpu, the load balance on cfs B starts an active load balance to pulled the
running misfit task. If there is a delayed dequeue task on cfs A, it can be
pulled instead of the previously running misfit task. attach_one_task() ->
attach_task() -> wakeup_preempt(rq, p, 0);

Fixes: ac8e69e693 ("sched/fair: Fix wakeup_preempt_fair() vs delayed dequeue")
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260503104503.1732682-1-vincent.guittot@linaro.org
2026-05-06 17:41:18 +02:00
Zhan Xusheng
b6eee96843 sched/fair: Fix overflow in vruntime_eligible()
Zhan Xusheng reported running into sporadic a s64 mult overflow in
vruntime_eligible().

When constructing a worst case scenario:

If you have cgroups, then you can have an entity of weight 2 (per
calc_group_shares()), and its vlag should then be bounded by: (slice+TICK_NSEC)
* NICE_0_LOAD, which is around 44 bits as per the comment on entity_key().

The other extreme is 100*NICE_0_LOAD, thus you get:

{key, weight}[] := {
  puny: { (slice + TICK_NSEC) * NICE_0_LOAD, 2               },
  max:  { 0,                                 100*NICE_0_LOAD },
}

The avg_vruntime() would end up being very close to 0 (which is
zero_vruntime), so no real help making that more accurate.

vruntime_eligible(puny) ends up with:

 avg  = 2 * puny.key (+ 0)
 load = 2 + 100 * NICE_0_LOAD

 avg >= puny.key * load

And that is: (slice + TICK_NSEC) * NICE_0_LOAD * NICE_0_LOAD * 100, which will
overflow s64.

Zhan suggested using __builtin_mul_overflow(), however after staring at
compiler output for various architectures using godbolt, it seems that using an
__int128 multiplication often results in better code.

Specifically, a number of architectures already compute the __int128 product to
determine the overflow. Eg. arm64 already has the 'smulh' instruction used. By
explicitly doing an __int128 multiply, it will emit the 'mul; smulh' pattern,
which modern cores can fuse (armv8-a clang-22.1.0). x86_64 has less branches
(no OF handling).

Since Linux has ARCH_SUPPORTS_INT128 to gate __int128 usage, also provide the
__builtin_mul_overflow() variant as a fallback.

[peterz: Changelog and __int128 bits]
Fixes: 556146ce5e ("sched/fair: Avoid overflow in enqueue_entity()")
Reported-by: Zhan Xusheng <zhanxusheng1024@gmail.com>
Closes: https://patch.msgid.link/20260415145742.10359-1-zhanxusheng%40xiaomi.com
Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260505103155.GN3102924%40noisy.programming.kicks-ass.net
2026-05-06 17:41:17 +02:00
Thomas Gleixner
99428157dc rseq: Reenable performance optimizations conditionally
Due to the incompatibility with TCMalloc the RSEQ optimizations and
extended features (time slice extensions) have been disabled and made
run-time conditional.

The original RSEQ implementation, which TCMalloc depends on, registers a 32
byte region (ORIG_RSEG_SIZE). This region has a 32 byte alignment
requirement.

The extension safe newer variant exposes the kernel RSEQ feature size via
getauxval(AT_RSEQ_FEATURE_SIZE) and the alignment requirement via
getauxval(AT_RSEQ_ALIGN). The alignment requirement is that the registered
RSEQ region is aligned to the next power of two of the feature size. The
kernel currently has a feature size of 33 bytes, which means the alignment
requirement is 64 bytes.

The TCMalloc RSEQ region is embedded into a cache line aligned data
structure starting at offset 32 bytes so that bytes 28-31 and the
cpu_id_start field at bytes 32-35 form a 64-bit little endian pointer with
the top-most bit (63 set) to check whether the kernel has overwritten
cpu_id_start with an actual CPU id value, which is guaranteed to not have
the top most bit set.

As this is part of their performance tuned magic, it's a pretty safe
assumption, that TCMalloc won't use a larger RSEQ size.

This allows the kernel to declare that registrations with a size greater
than the original size of 32 bytes, which is the cases since time slice
extensions got introduced, as RSEQ ABI v2 with the following differences to
the original behaviour:

  1) Unconditional updates of the user read only fields (CPU, node, MMCID)
     are removed. Those fields are only updated on registration, task
     migration and MMCID changes.

  2) Unconditional evaluation of the criticial section pointer is
     removed. It's only evaluated when user space was interrupted and was
     scheduled out or before delivering a signal in the interrupted
     context.

  3) The read/only requirement of the ID fields is enforced. When the
     kernel detects that userspace manipulated the fields, the process is
     terminated. This ensures that multiple entities (libraries) can
     utilize RSEQ without interfering.

  4) Todays extended RSEQ feature (time slice extensions) and future
     extensions are only enabled in the v2 enabled mode.

Registrations with the original size of 32 bytes operate in backwards
compatible legacy mode without performance improvements and extended
features.

Unfortunately that also affects users of older GLIBC versions which
register the original size of 32 bytes and do not evaluate the kernel
required size in the auxiliary vector AT_RSEQ_FEATURE_SIZE.

That's the result of the lack of enforcement in the original implementation
and the unwillingness of a single entity to cooperate with the larger
ecosystem for many years.

Implement the required registration changes by restructuring the spaghetti
code and adding the size/version check. Also add documentation about the
differences of legacy and optimized RSEQ V2 mode.

Thanks to Mathieu for pointing out the ORIG_RSEQ_SIZE constraints!

Fixes: d6200245c7 ("rseq: Allow registering RSEQ with slice extension")
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Link: https://patch.msgid.link/20260428224427.927160119%40kernel.org
Cc: stable@vger.kernel.org
2026-05-06 17:40:27 +02:00
Thomas Gleixner
82f572449c rseq: Implement read only ABI enforcement for optimized RSEQ V2 mode
The optimized RSEQ V2 mode requires that user space adheres to the ABI
specification and does not modify the read-only fields cpu_id_start,
cpu_id, node_id and mm_cid behind the kernel's back.

While the kernel does not rely on these fields, the adherence to this is a
fundamental prerequisite to allow multiple entities, e.g. libraries, in an
application to utilize the full potential of RSEQ without stepping on each
other toes.

Validate this adherence on every update of these fields. If the kernel
detects that user space modified the fields, the application is force
terminated.

Fixes: d6200245c7 ("rseq: Allow registering RSEQ with slice extension")
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Link: https://patch.msgid.link/20260428224427.845230956%40kernel.org
Cc: stable@vger.kernel.org
2026-05-06 17:40:15 +02:00
Frederic Weisbecker
bd3c45dd01 timers/migration: Fix another hotplug activation race
The hotplug control CPU is assumed to be active in the hierarchy but
that doesn't imply that the root is active. If the current CPU is not
the one that activated the current hierarchy, and the CPU performing
this duty is still halfway through the tree, the root may still be
observed inactive. And this can break the activation of a new root as in
the following scenario:

1) Initially, the whole system has 64 CPUs and only CPU 63 is awake.

                   [GRP1:0]
                    active
                  /    |    \
                 /     |     \
         [GRP0:0]    [...]    [GRP0:7]
           idle      idle      active
         /   |   \               |
     CPU 0  CPU 1  ...         CPU 63
     idle   idle               active

2) CPU 63 goes idle _but_ due to a #VMEXIT it hasn't yet reached the
   [GRP1:0]->parent dereference (that would be NULL and stop the walk)
   in __walk_groups_from().

                   [GRP1:0]
                     idle
                  /    |    \
                 /     |     \
         [GRP0:0]    [...]    [GRP0:7]
           idle      idle       idle
         /   |   \                |
     CPU 0  CPU 1  ...         CPU 63
     idle   idle                idle

3) CPU 1 wakes up, activates GRP0:0 but didn't yet manage to propagate
   up to GRP1:0 due to yet another #VMEXIT.

                   [GRP1:0]
                     idle
                  /    |    \
                 /     |     \
         [GRP0:0]    [...]    [GRP0:7]
         active      idle       idle
         /   |   \                |
     CPU 0  CPU 1  ...         CPU 63
     idle  active               idle

3) CPU 0 wakes up and doesn't need to walk above GRP0:0 as it's CPU 1
   role.

                   [GRP1:0]
                     idle
                  /    |    \
                 /     |     \
         [GRP0:0]    [...]    [GRP0:7]
         active      idle       idle
         /   |   \                |
     CPU 0  CPU 1  ...         CPU 63
    active  active              idle

4) CPU 0 boots CPU 64. It creates a new root for it.

                             [GRP2:0]
                               idle
                           /          \
                          /            \
                   [GRP1:0]           [GRP1:1]
                   idle                 idle
                  /    |    \                \
                 /     |     \                \
         [GRP0:0]    [...]    [GRP0:7]      [GRP0:8]
         active      idle       idle          idle
         /   |   \                |            |
     CPU 0  CPU 1  ...         CPU 63        CPU 64
    active  active              idle         offline

5) CPU 0 activates the new root, but note that GRP1:0 is still idle,
   waiting for CPU 1 to resume from #VMEXIT and activate it.

                             [GRP2:0]
                              active
                           /          \
                          /            \
                   [GRP1:0]           [GRP1:1]
                   idle                 idle
                  /    |    \                \
                 /     |     \                \
         [GRP0:0]    [...]    [GRP0:7]      [GRP0:8]
         active      idle       idle          idle
         /   |   \                |            |
     CPU 0  CPU 1  ...         CPU 63        CPU 64
    active  active              idle         offline

6) CPU 63 resumes after #VMEXIT and sees the new GRP1:0 parent.
   Therefore it propagates the stale inactive state of GRP1:0 up to
   GRP2:0.

                             [GRP2:0]
                              idle
                           /          \
                          /            \
                   [GRP1:0]           [GRP1:1]
                   idle                 idle
                  /    |    \                \
                 /     |     \                \
         [GRP0:0]    [...]    [GRP0:7]      [GRP0:8]
         active      idle       idle          idle
         /   |   \                |            |
     CPU 0  CPU 1  ...         CPU 63        CPU 64
    active  active              idle         offline

7) CPU 1 resumes after #VMEXIT and finally activates GRP1:0. But it
   doesn't observe its parent link because no ordering enforced that.
   Therefore GRP2:0 is spuriously left idle.

                             [GRP2:0]
                              idle
                           /          \
                          /            \
                   [GRP1:0]           [GRP1:1]
                   active                 idle
                  /    |    \                \
                 /     |     \                \
         [GRP0:0]    [...]    [GRP0:7]      [GRP0:8]
         active      idle       idle          idle
         /   |   \                |            |
     CPU 0  CPU 1  ...         CPU 63        CPU 64
    active  active              idle         offline

Such races are highly theoretical and the problem would solve itself
once the old root ever becomes idle again. But it still leaves a taste
of discomfort.

Fix it with enforcing a fully ordered atomic read of the old root state
before propagating the activate state up to the new root. It has a two
directions ordering effect:

* Acquire + release of the latest old root state: If the hotplug control
  CPU is not the one that woke up the old root, make sure to acquire its
  active state and propagate it upwards through the ordered chain of
  activation (the acquire pairs with the cmpxchg() in tmigr_active_up()
  and subsequent releases will pair with atomic_read_acquire() and
  smp_mb__after_atomic() in tmigr_inactive_up()).

* Release: If the hotplug control CPU is not the one that must wake up
  the old root, but the CPU covering that is lagging behind its duty,
  publish the links from the old root to the new parents. This way the
  lagging CPU will propagate the active state itself.

Fixes: 7ee9887703 ("timers: Implement the hierarchical pull model")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260423165354.95152-2-frederic@kernel.org
2026-05-06 08:21:12 +02:00
Linus Torvalds
74fe02ce12 workqueue: Fixes for v7.1-rc2
- Fix devm_alloc_workqueue() passing a va_list as a positional arg to the
   variadic alloc_workqueue() macro, which garbled wq->name and skipped
   lockdep init on the devm path. Fold both noprof entry points onto a
   va_list helper. Also, annotate __printf(1, 0).
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCafpheA4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGeBeAQDKq3XmN45CaOB76k9wSmRYqVgTSoWWTL83O8km
 nC2UhQEAiJBfqhP586coXyK6saaZX9QwFdPzTcB72LhZSAj13gQ=
 =Zixt
 -----END PGP SIGNATURE-----

Merge tag 'wq-for-7.1-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq

Pull workqueue fixes from Tejun Heo:

 - Fix devm_alloc_workqueue() passing a va_list as a positional arg to
   the variadic alloc_workqueue() macro, which garbled wq->name and
   skipped lockdep init on the devm path. Fold both noprof entry points
   onto a va_list helper.

   Also, annotate it using __printf(1, 0)

* tag 'wq-for-7.1-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Annotate alloc_workqueue_va() with __printf(1, 0)
  workqueue: fix devm_alloc_workqueue() va_list misuse
2026-05-05 16:09:31 -07:00
Linus Torvalds
11f00074f7 cgroup: Fixes for v7.1-rc2
- During v6.19, cgroup task unlink was moved from do_exit() to after the
   final task switch to satisfy a controller invariant. That left the kernel
   seeing tasks past exit_signals() longer than userspace expected, and
   several v7.0 follow-ups tried to bridge the gap by making rmdir wait for
   the kernel side. None held up. The latest is an A-A deadlock when rmdir
   is invoked by the reaper of zombies whose pidns teardown the rmdir itself
   is waiting on, which points at the synchronizing approach being
   fundamentally wrong:
 
   - Take a different approach: drop the wait, leave rmdir's user-visible
     side returning as soon as cgroup.procs is empty, and defer the css
     percpu_ref kill that drives ->css_offline() until the cgroup is fully
     depopulated.
 
   - Tagged for stable. Somewhat invasive but contained. The hope is that
     fixing forward sticks. If not, the fallback is to revert the entire
     chain and rework on the development branch.
 
   - Doesn't plug a pre-existing analogous race in
     cgroup_apply_control_disable() (controller disable via subtree_control).
     Not a regression. The development branch will do the more invasive
     restructuring needed for that.
 
 - Documentation update for cgroup-v1 charge-commit section that still
   referenced functions removed when the memcg hugetlb try-commit-cancel
   protocol was retired.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCafphbw4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGbydAQDxlEIeJPdJlwbU6X4PBW/7DYeDHABG7OdrFf5K
 Fq4ECAD/ZHsFyCNEOcZym6t2/FCZR0xbaPGQibLt3er6AkLRFwM=
 =3Jra
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-7.1-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup fixes from Tejun Heo:

 - During v6.19, cgroup task unlink was moved from do_exit() to after the
   final task switch to satisfy a controller invariant. That left the kernel
   seeing tasks past exit_signals() longer than userspace expected, and
   several v7.0 follow-ups tried to bridge the gap by making rmdir wait for
   the kernel side. None held up.

   The latest is an A-A deadlock when rmdir is invoked by the reaper of
   zombies whose pidns teardown the rmdir itself is waiting on, which
   points at the synchronizing approach being fundamentally wrong.

   Take a different approach: drop the wait, leave rmdir's user-visible
   side returning as soon as cgroup.procs is empty, and defer the css
   percpu_ref kill that drives ->css_offline() until the cgroup is fully
   depopulated.

   Tagged for stable. Somewhat invasive but contained. The hope is that
   fixing forward sticks. If not, the fallback is to revert the entire
   chain and rework on the development branch.

   Note that this doesn't plug a pre-existing analogous race in
   cgroup_apply_control_disable() (controller disable via
   subtree_control). Not a regression. The development branch will do
   the more invasive restructuring needed for that.

 - Documentation update for cgroup-v1 charge-commit section that still
   referenced functions removed when the memcg hugetlb try-commit-cancel
   protocol was retired.

* tag 'cgroup-for-7.1-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  docs: cgroup-v1: Update charge-commit section
  cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated
2026-05-05 15:43:32 -07:00
Linus Torvalds
de95ad90fb sched_ext: Fixes for v7.1-rc2
- Fix idle CPU selection returning prev_cpu outside the task's cpus_ptr
   when the BPF caller's allowed mask was wider. Stable backport.
 
 - Two opposite-direction gaps in scx_task_iter's cgroup-scoped mode
   versus the global mode:
 
   - Tasks past exit_signals() are filtered by the cgroup walk but kept by
     global. Sub-scheduler enable abort leaked __scx_init_task() state.
     Add a CSS_TASK_ITER_WITH_DEAD flag to cgroup's task iterator
     (scx_task_iter is its only user) and use it.
 
   - Tasks past sched_ext_dead() are still returned, tripping
     WARN_ON_ONCE() in callers or making them touch torn-down state. Mark
     and skip under the per-task rq lock.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCafphXA4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGbI/AP4nRHDusUuYDSFBLyHODvLZXfMM3Nb0yzS7euQJ
 qvx6OQEA1p5AyRWA2apFvKjjQrl1dOb5vUlro1Fj8VF51X7Spwc=
 =olGB
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-7.1-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:

 - Fix idle CPU selection returning prev_cpu outside the task's cpus_ptr
   when the BPF caller's allowed mask was wider. Stable backport.

 - Two opposite-direction gaps in scx_task_iter's cgroup-scoped mode
   versus the global mode:

    - Tasks past exit_signals() are filtered by the cgroup walk but kept
      by global. Sub-scheduler enable abort leaked __scx_init_task()
      state. Add a CSS_TASK_ITER_WITH_DEAD flag to cgroup's task
      iterator (scx_task_iter is its only user) and use it.

    - Tasks past sched_ext_dead() are still returned, tripping
      WARN_ON_ONCE() in callers or making them touch torn-down state.
      Mark and skip under the per-task rq lock.

* tag 'sched_ext-for-7.1-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: idle: Recheck prev_cpu after narrowing allowed mask
  sched_ext: Skip past-sched_ext_dead() tasks in scx_task_iter_next_locked()
  cgroup, sched_ext: Include exiting tasks in cgroup iter
2026-05-05 15:22:04 -07:00
Thomas Gleixner
b9eac6a9d9 rseq: Revert to historical performance killing behaviour
The recent RSEQ optimization work broke the TCMalloc abuse of the RSEQ ABI
as it not longer unconditionally updates the CPU, node, mm_cid fields,
which are documented as read only for user space. Due to the observed
behavior of the kernel it was possible for TCMalloc to overwrite the
cpu_id_start field for their own purposes and rely on the kernel to update
it unconditionally after each context switch and before signal delivery.

The RSEQ ABI only guarantees that these fields are updated when the data
changes, i.e. the task is migrated or the MMCID of the task changes due to
switching from or to per CPU ownership mode.

The optimization work eliminated the unconditional updates and reduced them
to the documented ABI guarantees, which results in a massive performance
win for syscall, scheduling heavy work loads, which in turn breaks the
TCMalloc expectations.

There have been several options discussed to restore the TCMalloc
functionality while preserving the optimization benefits. They all end up
in a series of hard to maintain workarounds, which in the worst case
introduce overhead for everyone, e.g. in the scheduler.

The requirements of TCMalloc and the optimization work are diametral and
the required work arounds are a maintainence burden. They end up as fragile
constructs, which are blocking further optimization work and are pretty
much guaranteed to cause more subtle issues down the road.

The optimization work heavily depends on the generic entry code, which is
not used by all architectures yet. So the rework preserved the original
mechanism moslty unmodified to keep the support for architectures, which
handle rseq in their own exit to user space loop. That code is currently
optimized out by the compiler on architectures which use the generic entry
code.

This allows to revert back to the original behaviour by replacing the
compile time constant conditions with a runtime condition where required,
which disables the optimization and the dependend time slice extension
feature until the run-time condition can be enabled in the RSEQ
registration code on a per task basis again.

The following changes are required to restore the original behavior, which
makes TCMalloc work again:

  1) Replace the compile time constant conditionals with runtime
     conditionals where appropriate to prevent the compiler from optimizing
     the legacy mode out

  2) Enforce unconditional update of IDs on context switch for the
     non-optimized v1 mode

  3) Enforce update of IDs in the pre signal delivery path for the
     non-optimized v1 mode

  4) Enforce update of IDs in the membarrier(RSEQ) IPI for the
     non-optimized v1 mode

  5) Make time slice and future extensions depend on optimized v2 mode

This brings back the full performance problems, but preserves the v2
optimization code and for generic entry code using architectures also the
TIF_RSEQ optimization which avoids a full evaluation of the exit to user
mode loop in many cases.

Fixes: 566d8015f7 ("rseq: Avoid CPU/MM CID updates when no event pending")
Reported-by: Mathias Stearn <mathias@mongodb.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Closes: https://lore.kernel.org/CAHnCjA25b+nO2n5CeifknSKHssJpPrjnf+dtr7UgzRw4Zgu=oA@mail.gmail.com
Link: https://patch.msgid.link/20260428224427.517051752%40kernel.org
Cc: stable@vger.kernel.org
2026-05-05 16:02:57 +02:00
Peter Zijlstra
c69df06e4e perf/core: Fix deadlock in perf_mmap() failure path
Ian noted that commit 77de62ad3d ("perf/core: Fix refcount bug and
potential UAF in perf_mmap") would cause a deadlock due to
event->mmap_mutex recursion.

This happens because we're now calling perf_mmap_close() under
mmap_mutex, while that function itself can also take mmap_mutex.

Solve this by noting that perf_mmap_close() is far more complicated
than we need at this particular point, since it deals with scenarios
that cannot happen in this particular case.

Replace the call to perf_mmap_close() with a very narrow undo for the
case of first-exposure. If this is not the first mmap(), there is no
race and it is fine to drop the lock and call perf_mmap_close() to
handle to more complicated scenarios.

Note: move the rb->mmap_user (namespace) handling into the rb
init/free code such that it does not complicate the mmap handling.

Fixes: 77de62ad3d ("perf/core: Fix refcount bug and potential UAF in perf_mmap")
Reported-by: Ian Rogers <irogers@google.com>
Closes: https://patch.msgid.link/CAP-5%3DfVJyVMZw%3DDqP53Kxg58nUmJ_0bxoaeOKAbC03BVc11HaA%40mail.gmail.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260326112821.GK3738786@noisy.programming.kicks-ass.net
2026-05-05 12:47:20 +02:00
David Carlier
b34c82777a sched_ext: idle: Recheck prev_cpu after narrowing allowed mask
scx_select_cpu_dfl() narrows @allowed to @cpus_allowed & @p->cpus_ptr
when the BPF caller supplies a @cpus_allowed that differs from
@p->cpus_ptr and @p doesn't have full affinity. However,
@is_prev_allowed was computed against the original (wider)
@cpus_allowed, so the prev_cpu fast paths could pick a @prev_cpu that
is in @cpus_allowed but not in @p->cpus_ptr, violating the intended
invariant that the returned CPU is always usable by @p. The kernel
masks this via the SCX_EV_SELECT_CPU_FALLBACK fallback, but the
behavior contradicts the documented contract.

Move the @is_prev_allowed evaluation past the narrowing block so it
tests against the final @allowed mask.

Fixes: ee9a4e9279 ("sched_ext: idle: Properly handle invalid prev_cpu during idle selection")
Cc: stable@vger.kernel.org # v6.16+
Assisted-by: Claude <noreply@anthropic.com>
Signed-off-by: David Carlier <devnexen@gmail.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-04 11:01:04 -10:00
Tejun Heo
ff9eda4ea9 sched_ext: Skip past-sched_ext_dead() tasks in scx_task_iter_next_locked()
scx_task_iter's cgroup-scoped mode can return tasks whose
sched_ext_dead() has already completed: cgroup_task_dead() removes
from cset->tasks after sched_ext_dead() in finish_task_switch() and is
irq-work deferred on PREEMPT_RT. The global mode is fine -
sched_ext_dead() removes from scx_tasks via list_del_init() first.

Callers (sub-sched enable prep/abort/apply, scx_sub_disable(),
scx_fail_parent()) assume returned tasks are still on @sch and trip
WARN_ON_ONCE() or operate on torn-down state otherwise.

Set %SCX_TASK_OFF_TASKS in sched_ext_dead() under @p's rq lock and
have scx_task_iter_next_locked() skip flagged tasks under the same
lock. Setter and reader serialize on the per-task rq lock - no race.

Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-04 09:06:03 -10:00
Tejun Heo
60f21a2649 cgroup, sched_ext: Include exiting tasks in cgroup iter
a72f73c4dd ("cgroup: Don't expose dead tasks in cgroup") made
css_task_iter_advance() skip exiting tasks so cgroup.procs stays consistent
with waitpid() visibility. Unfortunately, this broke scx_task_iter.

scx_task_iter walks either scx_tasks (global) or a cgroup subtree via
css_task_iter() and the two modes are expected to cover the same set of
tasks. After the above change the cgroup-scoped mode silently skips tasks
past exit_signals() that are still on scx_tasks.

scx_sub_enable_workfn()'s abort path is one of the symptoms: an exiting
SCX_TASK_SUB_INIT task can race past the cgroup iter leaking
__scx_init_task() state. Other iterations share the same gap.

Add CSS_TASK_ITER_WITH_DEAD to opt out of the skip and use it from
scx_task_iter().

Fixes: b0e4c2f8a0 ("sched_ext: Implement cgroup subtree iteration for scx_task_iter")
Reported-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-04 09:06:03 -10:00
Tejun Heo
93618edf75 cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated
A chain of commits going back to v7.0 reworked rmdir to satisfy the
controller invariant that a subsystem's ->css_offline() must not run while
tasks are still doing kernel-side work in the cgroup.

[1] d245698d72 ("cgroup: Defer task cgroup unlink until after the task is done switching out")
[2] a72f73c4dd ("cgroup: Don't expose dead tasks in cgroup")
[3] 1b164b876c ("cgroup: Wait for dying tasks to leave on rmdir")
[4] 4c56a8ac68 ("cgroup: Fix cgroup_drain_dying() testing the wrong condition")
[5] 13e786b64b ("cgroup: Increment nr_dying_subsys_* from rmdir context")

[1] moved task cset unlink from do_exit() to finish_task_switch() so a
task's cset link drops only after the task has fully stopped scheduling.
That made tasks past exit_signals() linger on cset->tasks until their final
context switch, which led to a series of problems as what userspace expected
to see after rmdir diverged from what the kernel needs to wait for. [2]-[5]
tried to bridge that divergence: [2] filtered the exiting tasks from
cgroup.procs; [3] had rmdir(2) sleep in TASK_UNINTERRUPTIBLE for them; [4]
fixed the wait's condition; [5] made nr_dying_subsys_* visible
synchronously.

The cgroup_drain_dying() wait in [3] turned out to be a dead end. When the
rmdir caller is also the reaper of a zombie that pins a pidns teardown (e.g.
host PID 1 systemd reaping orphan pids that were re-parented to it during
the same teardown), rmdir blocks in TASK_UNINTERRUPTIBLE waiting for those
pids to free, the pids can't free because PID 1 is the reaper and it's stuck
in rmdir, and the system A-A deadlocks. No internal lock ordering breaks
this; the wait itself is the bug.

The css killing side that drove the original reorder, however, can be made
cleanly asynchronous: ->css_offline() is already async, run from
css_killed_work_fn() driven by percpu_ref_kill_and_confirm(). The fix is to
make that chain start only after all tasks have left the cgroup. rmdir's
user-visible side then returns as soon as cgroup.procs and friends are
empty, while ->css_offline() still runs only after the cgroup is fully
drained.

Verified by the original reproducer (pidns teardown + zombie reaper, runs
under vng) which hangs vanilla and succeeds here, and by per-commit
deterministic repros for [2], [3], [4], [5] with a boot parameter that
widens the post-exit_signals() window so each state is reliably reachable.
Some stress tests on top of that.

cgroup_apply_control_disable() has the same shape of pre-existing race:
when a controller is disabled via subtree_control, kill_css() ran
synchronously while tasks past exit_signals() could still be linked to
the cgroup's csets, and ->css_offline() could fire before they drained.
This patch preserves the existing synchronous behavior at that call site
(kill_css_sync() + kill_css_finish() back-to-back) and a follow-up patch
will defer kill_css_finish() there using a per-css trigger.

This seems like the right approach and I don't see problems with it. The
changes are somewhat invasive but not excessively so, so backporting to
-stable should be okay. If something does turn out to be wrong, the fallback
is to revert the entire chain ([1]-[5]) and rework in the development branch
instead.

v2: Pin cgrp across the deferred destroy work with explicit
    cgroup_get()/cgroup_put() around queue_work() and the work_fn. v1
    wasn't actually broken (ordered cgroup_offline_wq + queue_work order
    in cgroup_task_dead() saved it) but the explicit ref removes the
    dependency on those non-obvious invariants. Also note the
    pre-existing cgroup_apply_control_disable() race in the description;
    a follow-up will defer kill_css_finish() there.

Fixes: 1b164b876c ("cgroup: Wait for dying tasks to leave on rmdir")
Cc: stable@vger.kernel.org # v7.0+
Reported-and-tested-by: Martin Pitt <martin@piware.de>
Link: https://lore.kernel.org/all/afHNg2VX2jy9bW7y@piware.de/
Link: https://lore.kernel.org/all/35e0670adb4abeab13da2c321582af9f@kernel.org/
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2026-05-04 08:52:26 -10:00
Linus Torvalds
cffcf520fd Fix lockup in requeue-PI during signal/timeout wakeups,
by Sebastian Andrzej Siewior.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmn2/AYRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iQCA//cAsib3cTZr9GAvjX5+Fjf3Dsl4HdO7qr
 XzOeMNtvnz0wcWgNCq02vwutbQwzRRr71DqDzhYV7YGxwyrqw+fE0RMvQDEML3I6
 SI1I8aj1Eo+WNHcy7HajYd0WBiOuMAcSZsa+3kYWKnDZ0GJadbQHTrQo5nT8VgFf
 o+aGU4kivGsKlz+UrcTxViNovenQ3mysuY8Pn3xKFlwn+vnJIwT2WUuQ1U8wK2OY
 edH9O4UEJkPFIOxqhL5+s4J/utsFasEFSLMpx9NpYzOGH85PTxIJg6O+n9a5NTAa
 40tsXlWkDsi/AfiNbWBYOpw8gS2yHyrLuY9CLBuxiRojfS6TePzfJyCPFvLLhBau
 90y02GskoDe4DFox9f+33BykR48yWxGOtxXiFJ1caXH4HsZi5z6Wd3vFCQp61zwm
 RPGKA5k8a9+hlToOpaAwHqA8ODbnEyRKwhG/OdnHo7cKPAWH+2awSSyW30DQoo+o
 mBYcMNbNeZObzQ/DErZvErDpq0hePATn/zfMFNEtXh+0Y1WZWix0NT5atHbtid+w
 0tRaazUpNCpGNp+7xxFGmaHxN/bPCmjZXpTIWIhc6vn8DNx/Y059g3ItYyeiRVkZ
 WD0vWdBgYerK0CfYsllh5d4fiTSLKoILq/f5Zc5Pq/GnVAZe/JQy7v6Duj+HmJ9O
 g9es4fSjzBs=
 =HUBJ
 -----END PGP SIGNATURE-----

Merge tag 'locking-urgent-2026-05-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fix from Ingo Molnar:
 "Fix lockup in requeue-PI during signal/timeout wakeups, by Sebastian
  Andrzej Siewior"

* tag 'locking-urgent-2026-05-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  futex: Prevent lockup in requeue-PI during signal/ timeout wakeup
2026-05-03 08:17:09 -07:00
Linus Torvalds
c3cba36b39 Miscellaneous scheduler fixes:
- Fix the delayed dequeue negative lag increase fix in the
    fair scheduler (Peter Zijlstra)
 
  - Fix wakeup_preempt_fair() to do proper delayed dequeue
    (Vincent Guittot)
 
  - Clear sched_entity::rel_deadline when initializing
    forked entities, which bug can cause all tasks to be
    EEVDF-ineligible, causing a NULL pointer dereference
    crash in pick_next_entity() (Zicheng Qu)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmn2+tsRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gDdg//WX8E7dyiwRaFNsPMgEvi4Q7pBoXAb5Zi
 /iYNlpw1/QRG9KF59175CC3cLOVLJ3dA+79EZPS/mQSuukdTxJg6sbPTkneULV3D
 l8DkjH0uDS7mZBqlLDC+Xjqv1y7Y01V6qN9Si8hpR5rP3D0heWdspAGS5aSZ/8Dp
 h/VSYl2R615Z32NaS0Xys2hy4M0/I30Uuw4fJScvIYeAMb4s5+6RHEQmuuo25A3O
 HQg9Ljqi5NQaHwzvHTYjCenh/NENOd/tu/kZzFgW57HJqSXGM7KBqcjaK68q4vCl
 LgBlsfux7RTbnrEAIhGYBSoDss+tBbMKm5qaYZNENJpLS8ptm4J3iKgUJ0W2e3dW
 Zp6IjVkj0E+qC65WnFENXsiDr+/fZ9d71/xq2L4z6SxQNv1mtX2+f/HUYyKU5jCc
 I4NDEBLGbdRVNuPW7esECVIEVRYFR1cPdZigLW8M7JEnr0p0skF1zgnnMtVuK6Ep
 qONYldUIHWdsx67yOSdhykSyq6Qfaew/UKuG1ivlN0BDL4I/AWf+4BVMHJrigeok
 xKD8DRWH6s7fSicfM2aJMmj6nRSR2Zz5a9T3lE4LxSvDm41JEnatqpb2Xhjri+I0
 21slsm4AZmh1xR1kj7sTKxdHJn0E+lNN/XSZP6OcCoNqlr2XEGMRxWpVo/WqJnO6
 HVG9/VoSP1w=
 =z+8E
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2026-05-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Ingo Molnar:

 - Fix the delayed dequeue negative lag increase fix in the
   fair scheduler (Peter Zijlstra)

 - Fix wakeup_preempt_fair() to do proper delayed dequeue
   (Vincent Guittot)

 - Clear sched_entity::rel_deadline when initializing
   forked entities, which bug can cause all tasks to be
   EEVDF-ineligible, causing a NULL pointer dereference
   crash in pick_next_entity() (Zicheng Qu)

* tag 'sched-urgent-2026-05-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Clear rel_deadline when initializing forked entities
  sched/fair: Fix wakeup_preempt_fair() vs delayed dequeue
  sched/fair: Fix the negative lag increase fix
2026-05-03 08:05:23 -07:00
Davidlohr Bueso
ee9dce4436 futex: Drop CLONE_THREAD requirement for private default hash alloc
Currently need_futex_hash_allocate_default() depends on strict pthread
semantics, abusing CLONE_THREAD.  This breaks the non-concurrency
assumptions when doing the mm->futex_ref pcpu allocations, leading to
bugs[0] when sharing the mm in other ways; ie:

    BUG: KASAN: slab-use-after-free in futex_hash_put

... where the +1 bias can end up on a percpu counter that mm->futex_ref
no longer points at.

Loosen the check to cover any CLONE_VM clone, except vfork().  Excluding
vfork keeps the existing paths untouched (no overhead), and we can't
race in the first place: either the parent is suspended and the child
runs alone, or mm->futex_ref is already allocated from an earlier
CLONE_VM.

Link: https://lore.kernel.org/all/CAL_bE8LsmCQ-FAtYDuwbJhOkt9p2wwYQwAbMh=PifC=VsiBM6A@mail.gmail.com/ [0]
Fixes: d9b05321e2 ("futex: Move futex_hash_free() back to __mmput()")
Reported-by: Yiming Qian <yimingqian591@gmail.com>
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-05-01 13:12:34 -07:00
Thomas Gleixner
010b7723c0 rseq: Don't advertise time slice extensions if disabled
If time slice extensions have been disabled on the kernel command line,
then advertising them in RSEQ flags is wrong.

Adjust the conditionals to reflect reality, fixup the misleading comments
about the gap of these flags and the rseq::flags field.

Fixes: d6200245c7 ("rseq: Allow registering RSEQ with slice extension")
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Link: https://patch.msgid.link/20260428224427.437059375%40kernel.org
Cc: stable@vger.kernel.org
2026-05-01 21:32:20 +02:00
Thomas Gleixner
2cb68e4512 rseq: Set rseq::cpu_id_start to 0 on unregistration
The RSEQ rework changed that to RSEQ_CPU_UNINITILIZED, which is obviously
incompatible. Revert back to the original behavior.

Fixes: 0f085b4188 ("rseq: Provide and use rseq_set_ids()")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Link: https://patch.msgid.link/20260428224427.271566313%40kernel.org
Cc: stable@vger.kernel.org
2026-05-01 21:32:20 +02:00
Linus Torvalds
2b4d0215be 20 hotfixes. All are for MM (and for MMish maintainers). 9 are cc:stable
and the remainder are for post-7.0 issues or aren't deemed suitable for
 backporting.
 
 There's a 2 patch DAMON series from SeongJae Park which address races
 which could lead to use-after-free errors.  And a 3 patch DAMON series
 which avoids the possibility of presenting stale parameter values to
 users.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCafPaVQAKCRDdBJ7gKXxA
 jiUxAQCceUQi6IqADUMhYAsbGcs1LoeMWfiMfbCz2NCoiTXAEwD/S2uqSELRPQQc
 7iW6D7U6dTa3d2kkbnxC02ocekaxiQ4=
 =M/FI
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2026-04-30-15-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM fixes from Andrew Morton:
 "20 hotfixes. All are for MM (and for MMish maintainers). 9 are
  cc:stable and the remainder are for post-7.0 issues or aren't deemed
  suitable for backporting.

  There are two DAMON series from SeongJae Park which address races
  which could lead to use-after-free errors, and avoid the possibility
  of presenting stale parameter values to users"

* tag 'mm-hotfixes-stable-2026-04-30-15-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm: memcontrol: fix rcu unbalance in get_non_dying_memcg_end()
  mm/userfaultfd: detect VMA type change after copy retry in mfill_copy_folio_retry()
  MAINTAINERS: remove stale kdump project URL
  mm/damon/stat: detect and use fresh enabled value
  mm/damon/lru_sort: detect and use fresh enabled and kdamond_pid values
  mm/damon/reclaim: detect and use fresh enabled and kdamond_pid values
  selftests/mm: specify requirement for PROC_MEM_ALWAYS_FORCE=y
  mm/damon/sysfs-schemes: protect path kfree() with damon_sysfs_lock
  mm/damon/sysfs-schemes: protect memcg_path kfree() with damon_sysfs_lock
  MAINTAINERS: update Li Wang's email address
  MAINTAINERS, mailmap: update email address for Qi Zheng
  MAINTAINERS: update Liam's email address
  mm/hugetlb_cma: round up per_node before logging it
  MAINTAINERS: fix regex pattern in CORE MM category
  mm/vma: do not try to unmap a VMA if mmap_prepare() invoked from mmap()
  mm: start background writeback based on per-wb threshold for strictlimit BDIs
  kho: fix error handling in kho_add_subtree()
  liveupdate: fix return value on session allocation failure
  mailmap: update entry for Dan Carpenter
  vmalloc: fix buffer overflow in vrealloc_node_align()
2026-05-01 08:45:23 -07:00
Linus Torvalds
e75a43c7ce tracing fixes for v7.1:
- Fix inverted check of registering the stats for branch tracing
 
   When calling register_stat_tracer() which returns zero on success and
   negative on error, the callers were checking the return of zero as an
   error and printing a warning message. Because this was just a normal
   printk() message and not a WARN(), it wasn't caught in any testing.
 
   Fix the check to print the warning message when an error actually happens.
 
 - Fix a typo in a comment in tracepoint.h
 
 - Limit the size of event probes to 3K in size
 
   It is possible to create a dynamic event probe via the tracefs system that
   is greater than the max size of an event that the ring buffer can hold.
   This basically causes the event to become useless. Limit the size of an
   event probe to be 3K as that should be large enough to handle any dynamic
   events being created, and fits within the PAGE_SIZE sub-buffers of the
   ring buffer.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCafJn9hQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6quUaAP9ezC2VXWugRIhOEmK0pIpY90zddsyv
 xYTa3XuPP2r5JQEA8DcmH6fTo7JJxzbbyWPot/EgaqMt5XguqKC7txhjfww=
 =+e7q
 -----END PGP SIGNATURE-----

Merge tag 'trace-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix inverted check of registering the stats for branch tracing

   When calling register_stat_tracer() which returns zero on success and
   negative on error, the callers were checking the return of zero as an
   error and printing a warning message. Because this was just a normal
   printk() message and not a WARN(), it wasn't caught in any testing.

   Fix the check to print the warning message when an error actually
   happens.

 - Fix a typo in a comment in tracepoint.h

 - Limit the size of event probes to 3K in size

   It is possible to create a dynamic event probe via the tracefs system
   that is greater than the max size of an event that the ring buffer
   can hold. This basically causes the event to become useless.

   Limit the size of an event probe to be 3K as that should be large
   enough to handle any dynamic events being created, and fits within
   the PAGE_SIZE sub-buffers of the ring buffer.

* tag 'trace-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/probes: Limit size of event probe to 3K
  tracepoint: Fix typo in tracepoint.h comment
  tracing: branch: Fix inverted check on stat tracer registration
2026-04-29 22:21:44 -07:00
Steven Rostedt
b2aa3b4d64 tracing/probes: Limit size of event probe to 3K
There currently isn't a max limit an event probe can be. One could make an
event greater than PAGE_SIZE, which makes the event useless because if
it's bigger than the max event that can be recorded into the ring buffer,
then it will never be recorded.

A event probe should never need to be greater than 3K, so make that the
max size. As long as the max is less than the max that can be recorded
onto the ring buffer, it should be fine.

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Fixes: 93ccae7a22 ("tracing/kprobes: Support basic types on dynamic events")
Link: https://patch.msgid.link/20260428122302.706610ba@gandalf.local.home
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2026-04-29 16:07:38 -04:00
Tejun Heo
20e81c64c9 workqueue: Annotate alloc_workqueue_va() with __printf(1, 0)
alloc_workqueue_va() forwards its va_list to __alloc_workqueue() which
ultimately feeds vsnprintf(). __alloc_workqueue() already carries
__printf(1, 0); the new wrapper needs the same annotation so format
string checking propagates through the forwarding.

Fixes: 0de4cb473a ("workqueue: fix devm_alloc_workqueue() va_list misuse")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202604300347.2LgXyteh-lkp@intel.com/
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-29 09:44:16 -10:00
Sebastian Andrzej Siewior
bc7304f3ae futex: Prevent lockup in requeue-PI during signal/ timeout wakeup
During wait-requeue-pi (task A) and requeue-PI (task B) the following
race can happen:

     Task A                             Task B
  futex_wait_requeue_pi()
    futex_setup_timer()
    futex_do_wait()
                                   futex_requeue()
                                        CLASS(hb, hb1)(&key1);
                                        CLASS(hb, hb2)(&key2);
        *timeout*
    futex_requeue_pi_wakeup_sync()
        requeue_state = Q_REQUEUE_PI_IGNORE

    *blocks on hb->lock*

                                        futex_proxy_trylock_atomic()
                                          futex_requeue_pi_prepare()
                                            Q_REQUEUE_PI_IGNORE => -EAGAIN
                                        double_unlock_hb(hb1, hb2)
                                         *retry*

Task B acquires both hb locks and attempts to acquire the PI-lock of the
top most waiter (task B). Task A is leaving early due to a signal/
timeout and started removing itself from the queue. It updates its
requeue_state but can not remove it from the list because this requires
the hb lock which is owned by task B.

Usually task A is able to swoop the lock after task B unlocked it.
However if task B is of higher priority then task A may not be able to
wake up in time and acquire the lock before task B gets it again.
Especially on a UP system where A is never scheduled.

As a result task A blocks on the lock and task B busy loops, trying to
make progress but live locks the system instead. Tragic.

This can be fixed by removing the top most waiter from the list in this
case. This allows task B to grab the next top waiter (if any) in the
next iteration and make progress.

Remove the top most waiter if futex_requeue_pi_prepare() fails.
Let the waiter conditionally remove itself from the list in
handle_early_requeue_pi_wakeup().

Fixes: 07d91ef510 ("futex: Prevent requeue_pi() lock nesting issue on RT")
Reported-by: Moritz Klammler <Moritz.Klammler@ferchau.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260428103425.dywXyPd3@linutronix.de
Closes: https://lore.kernel.org/all/VE1PR06MB6894BE61C173D802365BE19DFF4CA@VE1PR06MB6894.eurprd06.prod.outlook.com
2026-04-29 08:56:40 +02:00
Linus Torvalds
664f0f6be3 sched_ext: Fixes for v7.1-rc1
The merge window pulled in the cgroup sub-scheduler infrastructure, and
 new AI reviews are accelerating bug reporting and fixing - hence the
 larger than usual fixes batch.
 
 - Use-after-frees during scheduler load/unload. The disable path
   could free the BPF scheduler while deferred irq_work / kthread work
   was still in flight; cgroup setter callbacks read the active
   scheduler outside the rwsem that synchronizes against teardown.
   Fixed both, and reused the disable drain in the enable error paths
   so the BPF JIT page can't be freed under live callbacks.
 
 - Several BPF op invocations didn't tell the framework which runqueue
   was already locked, so helper kfuncs that re-acquire the runqueue
   by CPU could deadlock on the held lock. Fixed at the affected
   callsites, including recursive parent-into-child dispatch.
 
 - The hardlockup notifier ran from NMI but eventually took a
   non-NMI-safe lock. Bounced through irq_work.
 
 - A handful of bugs in the new sub-scheduler hierarchy: helper
   kfuncs hard-coded the root instead of resolving the caller's
   scheduler; the enable error path tried to disable per-task state
   that had never been initialized, and leaked cpus_read_lock on the
   way out; a sysfs object was leaked on every load/unload; the
   dispatch fast-path used the root scheduler instead of the task's;
   and a couple of CONFIG #ifdef guards were misclassified.
 
 - Verifier-time hardening: BPF programs of unrelated struct_ops
   types (e.g. tcp_congestion_ops) could call sched_ext kfuncs - a
   semantic bug and, once sub-sched was enabled, a KASAN
   out-of-bounds read. Now rejected at load. Plus a few NULL and
   cross-task argument checks on sched_ext kfuncs, and a selftest
   covering the new deny.
 
 - rhashtable (Herbert): restored the insecure_elasticity toggle and
   bounced the deferred-resize kick through irq_work to break a
   lock-order cycle observable from raw-spinlock callers. sched_ext's
   scheduler-instance hash is the first user of both.
 
 - The bypass-mode load balancer used file-scope cpumasks; with
   multiple scheduler instances now possible, those raced. Moved
   per-instance, plus a follow-up to skip tasks whose recorded CPU is
   stale relative to the new owning runqueue.
 
 - Smaller fixes: a dispatch queue's first-task tracking misbehaved
   when a parked iterator cursor sat in the list; the runqueue's
   next-class wasn't promoted on local-queue enqueue, leaving an SCX
   task behind RT in edge cases; the reference qmap scheduler stopped
   erroring on legitimate cross-scheduler task-storage misses.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCafEN/A4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGaydAQCxWrUqnZXhxF4LnztjxTF2tgv7p8P7TbpS6aU6
 etqRpAEA9RFmIXs7XrhwCm0n2BwSgjvrNxnWfPhWvuH0uN0GTAA=
 =wLna
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-7.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:
 "The merge window pulled in the cgroup sub-scheduler infrastructure,
  and new AI reviews are accelerating bug reporting and fixing - hence
  the larger than usual fixes batch:

   - Use-after-frees during scheduler load/unload:
       - The disable path could free the BPF scheduler while deferred
         irq_work / kthread work was still in flight
       - cgroup setter callbacks read the active scheduler outside the
         rwsem that synchronizes against teardown
     Fix both, and reuse the disable drain in the enable error paths so
     the BPF JIT page can't be freed under live callbacks.

   - Several BPF op invocations didn't tell the framework which runqueue
     was already locked, so helper kfuncs that re-acquire the runqueue
     by CPU could deadlock on the held lock

     Fix the affected callsites, including recursive parent-into-child
     dispatch.

   - The hardlockup notifier ran from NMI but eventually took a
     non-NMI-safe lock. Bounce it through irq_work.

   - A handful of bugs in the new sub-scheduler hierarchy:
       - helper kfuncs hard-coded the root instead of resolving the
         caller's scheduler
       - the enable error path tried to disable per-task state that had
         never been initialized, and leaked cpus_read_lock on the way
         out
       - a sysfs object was leaked on every load/unload
       - the dispatch fast-path used the root scheduler instead of the
         task's
       - a couple of CONFIG #ifdef guards were misclassified

   - Verifier-time hardening: BPF programs of unrelated struct_ops types
     (e.g. tcp_congestion_ops) could call sched_ext kfuncs - a semantic
     bug and, once sub-sched was enabled, a KASAN out-of-bounds read.
     Now rejected at load. Plus a few NULL and cross-task argument
     checks on sched_ext kfuncs, and a selftest covering the new deny.

   - rhashtable (Herbert): restore the insecure_elasticity toggle and
     bounce the deferred-resize kick through irq_work to break a
     lock-order cycle observable from raw-spinlock callers. sched_ext's
     scheduler-instance hash is the first user of both.

   - The bypass-mode load balancer used file-scope cpumasks; with
     multiple scheduler instances now possible, those raced. Move to
     per-instance cpumasks, plus a follow-up to skip tasks whose
     recorded CPU is stale relative to the new owning runqueue.

   - Smaller fixes:
       - a dispatch queue's first-task tracking misbehaved when a parked
         iterator cursor sat in the list
       - the runqueue's next-class wasn't promoted on local-queue
         enqueue, leaving an SCX task behind RT in edge cases
       - the reference qmap scheduler stopped erroring on legitimate
         cross-scheduler task-storage misses"

* tag 'sched_ext-for-7.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (26 commits)
  sched_ext: Fix scx_flush_disable_work() UAF race
  sched_ext: Call wakeup_preempt() in local_dsq_post_enq()
  sched_ext: Release cpus_read_lock on scx_link_sched() failure in root enable
  sched_ext: Reject NULL-sch callers in scx_bpf_task_set_slice/dsq_vtime
  sched_ext: Refuse cross-task select_cpu_from_kfunc calls
  sched_ext: Align cgroup #ifdef guards with SUB_SCHED vs GROUP_SCHED
  sched_ext: Make bypass LB cpumasks per-scheduler
  sched_ext: Pass held rq to SCX_CALL_OP() for core_sched_before
  sched_ext: Pass held rq to SCX_CALL_OP() for dump_cpu/dump_task
  sched_ext: Save and restore scx_locked_rq across SCX_CALL_OP
  sched_ext: Use dsq->first_task instead of list_empty() in dispatch_enqueue() FIFO-tail
  sched_ext: Resolve caller's scheduler in scx_bpf_destroy_dsq() / scx_bpf_dsq_nr_queued()
  sched_ext: Read scx_root under scx_cgroup_ops_rwsem in cgroup setters
  sched_ext: Don't disable tasks in scx_sub_enable_workfn() abort path
  sched_ext: Skip tasks with stale task_rq in bypass_lb_cpu()
  sched_ext: Guard scx_dsq_move() against NULL kit->dsq after failed iter_new
  sched_ext: Unregister sub_kset on scheduler disable
  sched_ext: Defer scx_hardlockup() out of NMI
  sched_ext: sync disable_irq_work in bpf_scx_unreg()
  sched_ext: Fix local_dsq_post_enq() to use task's scheduler in sub-sched
  ...
2026-04-28 16:26:11 -07:00
Breno Leitao
3b75dd76e6 tracing: branch: Fix inverted check on stat tracer registration
init_annotated_branch_stats() and all_annotated_branch_stats() check the
return value of register_stat_tracer() with "if (!ret)", but
register_stat_tracer() returns 0 on success and a negative errno on
failure. The inverted check causes the warning to be printed on every
successful registration, e.g.:

  Warning: could not register annotated branches stats

while leaving real failures silent. The initcall also returned a
hard-coded 1 instead of the actual error.

Invert the check and propagate ret so that the warning fires on real
errors and the initcall reports the correct status.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: https://patch.msgid.link/20260420-tracing-v1-1-d8f4cd0d6af1@debian.org
Fixes: 002bb86d8d ("tracing/ftrace: separate events tracing and stats tracing engine")
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2026-04-28 14:28:29 -04:00
Cheng-Yang Chou
d99f7a32f0 sched_ext: Fix scx_flush_disable_work() UAF race
scx_flush_disable_work() calls irq_work_sync() followed by
kthread_flush_work() to ensure that the disable kthread work has
fully completed before bpf_scx_unreg() frees the SCX scheduler.

However, a concurrent scx_vexit() (e.g., triggered by a watchdog stall)
creates a race window between scx_claim_exit() and irq_work_queue():

  CPU A (scx_vexit (watchdog))        CPU B (bpf_scx_unreg)
  ----                                ----
  scx_claim_exit()
    atomic_try_cmpxchg(NONE->kind)
  stack_trace_save()
  vscnprintf()
                                      scx_disable()
                                        scx_claim_exit() -> FAIL
                                      scx_flush_disable_work()
                                        irq_work_sync()      // no-op: not queued yet
                                        kthread_flush_work() // no-op: not queued yet
                                      kobject_put(&sch->kobj) -> free %sch
  irq_work_queue() -> UAF on %sch
  scx_disable_irq_workfn()
    kthread_queue_work() -> UAF

The root cause is that CPU B's scx_flush_disable_work() returns after
syncing an irq_work that has not yet been queued, while CPU A is still
executing the code between scx_claim_exit() and irq_work_queue().

Loop until exit_kind reaches SCX_EXIT_DONE or SCX_EXIT_NONE, draining
disable_irq_work and disable_work in each pass. This ensures that any
work queued after the previous check is caught, while also correctly
handling cases where no disable was triggered (e.g., the
scx_sub_enable_workfn() abort path).

Fixes: 510a270554 ("sched_ext: sync disable_irq_work in bpf_scx_unreg()")
Reported-by: https://sashiko.dev/#/patchset/20260424100221.32407-1-icheng%40nvidia.com
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-28 07:40:03 -10:00
Kuba Piecuch
163f8b7f9a sched_ext: Call wakeup_preempt() in local_dsq_post_enq()
There are several edge cases (see linked thread) where an IMMED task
can be left lingering on a local DSQ if an RT task swoops in at the
wrong time. All of these edge cases are due to rq->next_class being idle
even after dispatching a task to rq's local DSQ. We should bump
rq->next_class to &ext_sched_class as soon as we've inserted a task into
the local DSQ.

To optimize the common case of rq->next_class == &ext_sched_class,
only call wakeup_preempt() if rq->next_class is below EXT. If next_class
is EXT or above, wakeup_preempt() is a no-op anyway.

This lets us also simplify the preempt_curr() logic a bit since
wakeup_preempt() will call preempt_curr() for us if next_class is
below EXT.

Link: https://lore.kernel.org/all/DHZPHUFXB4N3.2RY28MUEWBNYK@google.com/
Signed-off-by: Kuba Piecuch <jpiecuch@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-28 06:28:48 -10:00
Breno Leitao
0de4cb473a workqueue: fix devm_alloc_workqueue() va_list misuse
devm_alloc_workqueue() built a va_list and passed it as a single
positional argument to the variadic alloc_workqueue() macro:

	va_start(args, max_active);
	wq = alloc_workqueue(fmt, flags, max_active, args);
	va_end(args);

C does not allow forwarding a va_list through a ... parameter.
alloc_workqueue() expands to alloc_workqueue_noprof(), which runs
its own va_start() over its ... params, so the inner
vsnprintf(wq->name, sizeof(wq->name), fmt, args) in
__alloc_workqueue() received the outer va_list object as the first
variadic slot rather than the caller's actual format arguments.

Add a new static helper alloc_workqueue_va() that wraps
__alloc_workqueue() and runs wq_init_lockdep() on success, and
fold both alloc_workqueue_noprof() and devm_alloc_workqueue_noprof()
onto it as suggested by Tejun.

The wq_init_lockdep() step is required on the devm path
too, otherwise __flush_workqueue()'s on-stack
COMPLETION_INITIALIZER_ONSTACK_MAP would NULL-deref wq->lockdep_map.

No caller changes are required. devm_alloc_ordered_workqueue() is
a macro forwarding to devm_alloc_workqueue() and inherits the fix.
Two in-tree callers actively trigger the broken path on every probe:

  drivers/power/supply/mt6370-charger.c:889
  drivers/power/supply/max77705_charger.c:649

both of which use devm_alloc_ordered_workqueue(dev, "%s", 0,
dev_name(dev)).

A standalone reproducer module is available at[1].

Link: https://github.com/leitao/debug/blob/main/workqueue/valist/wq_va_test.c [1]
Fixes: 1dfc9d60a6 ("workqueue: devres: Add device-managed allocate workqueue")
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-28 06:13:40 -10:00
Zicheng Qu
3da56dc063 sched/fair: Clear rel_deadline when initializing forked entities
A yield-triggered crash can happen when a newly forked sched_entity
enters the fair class with se->rel_deadline unexpectedly set.

The failing sequence is:

  1. A task is forked while se->rel_deadline is still set.
  2. __sched_fork() initializes vruntime, vlag and other sched_entity
     state, but does not clear rel_deadline.
  3. On the first enqueue, enqueue_entity() calls place_entity().
  4. Because se->rel_deadline is set, place_entity() treats se->deadline
     as a relative deadline and converts it to an absolute deadline by
     adding the current vruntime.
  5. However, the forked entity's deadline is not a valid inherited
     relative deadline for this new scheduling instance, so the conversion
     produces an abnormally large deadline.
  6. If the task later calls sched_yield(), yield_task_fair() advances
     se->vruntime to se->deadline.
  7. The inflated vruntime is then used by the following enqueue path,
     where the vruntime-derived key can overflow when multiplied by the
     entity weight.
  8. This corrupts cfs_rq->sum_w_vruntime, breaks EEVDF eligibility
     calculation, and can eventually make all entities appear ineligible.
     pick_next_entity() may then return NULL unexpectedly, leading to a
     later NULL dereference.

A captured trace shows the effect clearly. Before yield, the entity's
vruntime was around:

  9834017729983308

After yield_task_fair() executed:

  se->vruntime = se->deadline

the vruntime jumped to:

  19668035460670230

and the deadline was later advanced further to:

  19668035463470230

This shows that the deadline had already become abnormally large before
yield_task_fair() copied it into vruntime.

rel_deadline is only meaningful when se->deadline really carries a
relative deadline that still needs to be placed against vruntime. A
freshly forked sched_entity should not inherit or retain this state.
Clear se->rel_deadline in __sched_fork(), together with the other
sched_entity runtime state, so that the first enqueue does not interpret
the new entity's deadline as a stale relative deadline.

Fixes: 82e9d0456e ("sched/fair: Avoid re-setting virtual deadline on 'migrations'")
Analyzed-by: Hui Tang <tanghui20@huawei.com>
Analyzed-by: Zhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: Zicheng Qu <quzicheng@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260424071113.1199600-1-quzicheng@huawei.com
2026-04-28 09:19:54 +02:00
Vincent Guittot
ac8e69e693 sched/fair: Fix wakeup_preempt_fair() vs delayed dequeue
Similar to how pick_next_entity() must dequeue delayed entities, so too must
wakeup_preempt_fair(). Any delayed task being found means it is eligible and
hence past the 0-lag point, ready for removal.

Worse, by not removing delayed entities from consideration, it can skew the
preemption decision, with the end result that a short slice wakeup will not
result in a preemption.

                     tip/sched/core  tip/sched/core    +this patch
cyclictest slice  (ms) (default)2.8             8               8
hackbench slice   (ms) (default)2.8            20              20
Total Samples          |    22559           22595           22683
Average           (us) |      157              64( 59%)        59(  8%)
Median (P50)      (us) |       57              57(  0%)        58(- 2%)
90th Percentile   (us) |       64              60(  6%)        60(  0%)
99th Percentile   (us) |     2407              67( 97%)        67(  0%)
99.9th Percentile (us) |     3400            2288( 33%)       727( 68%)
Maximum           (us) |     5037            9252(-84%)      7461( 19%)

Fixes: f12e148892 ("sched/fair: Prepare pick_next_task() for delayed dequeue")
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260422093400.319251-1-vincent.guittot@linaro.org
2026-04-28 09:19:54 +02:00
Peter Zijlstra
c5cd6fd75b sched/fair: Fix the negative lag increase fix
Vincent reported that my rework of his original patch lost a little
something.

Specifically it got the return value wrong; it should not compare
against the old se->vlag, but rather against the current value. Since
the thing that matters is if the effective vruntime of an entity is
affected and the thing needs repositioning or not.

Fixes: 059258b0d4 ("sched/fair: Prevent negative lag increase during delayed dequeue")
Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://patch.msgid.link/20260423094107.GT3102624%40noisy.programming.kicks-ass.net
2026-04-28 09:19:54 +02:00
Linus Torvalds
3b3bea6d4b cgroup: Fixes for v7.1-rc1
- Fix UAF race in psi pressure_write() against cgroup file release by
   extending cgroup_mutex coverage and ordering of->priv access after
   cgroup_kn_lock_live().
 
 - Fix integer overflow in rdmacg_try_charge() when usage equals INT_MAX
   by performing the increment in s64.
 
 - Fix asymmetric DL bandwidth accounting on cpuset attach rollback by
   recording the CPU used by dl_bw_alloc() so cancel_attach() returns
   the reservation to the same root domain.
 
 - Fix nr_dying_subsys_* race that briefly showed 0 in cgroup.stat after
   rmdir by incrementing from kill_css() instead of offline_css().
 
 - Typo fix in cgroup-v2 documentation.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCae+xjw4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGaIUAQD2hJ7ELRDXAtXzL1Ck1zH8vESvbX8syFfkSO6L
 IgtovQEA4Tk7/RIO3HfBxBjgp6Q5vo7C7Biz4ye7fCu/ry7x3Qk=
 =pypQ
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-7.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup fixes from Tejun Heo:

 - Fix UAF race in psi pressure_write() against cgroup file release by
   extending cgroup_mutex coverage and ordering of->priv access after
   cgroup_kn_lock_live()

 - Fix integer overflow in rdmacg_try_charge() when usage equals INT_MAX
   by performing the increment in s64

 - Fix asymmetric DL bandwidth accounting on cpuset attach rollback by
   recording the CPU used by dl_bw_alloc() so cancel_attach() returns
   the reservation to the same root domain

 - Fix nr_dying_subsys_* race that briefly showed 0 in cgroup.stat after
   rmdir by incrementing from kill_css() instead of offline_css()

 - Typo fix in cgroup-v2 documentation

* tag 'cgroup-for-7.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  docs: cgroup: fix typo 'protetion' -> 'protection'
  cgroup: Increment nr_dying_subsys_* from rmdir context
  cgroup/cpuset: record DL BW alloc CPU for attach rollback
  cgroup/rdma: fix integer overflow in rdmacg_try_charge()
  sched/psi: fix race between file release and pressure write
2026-04-27 16:51:27 -07:00
Breno Leitao
9ec9532989 kho: fix error handling in kho_add_subtree()
Fix two error handling issues in kho_add_subtree(), where it doesn't
handle the error path correctly.

1. If fdt_setprop() fails after the subnode has been created, the
   subnode is not removed. This leaves an incomplete node in the FDT
   (missing "preserved-data" or "blob-size" properties).

2. The fdt_setprop() return value (an FDT error code) is stored
   directly in err and returned to the caller, which expects -errno.

Fix both by storing fdt_setprop() results in fdt_err, jumping to a new
out_del_node label that removes the subnode on failure, and only setting
err = 0 on the success path, otherwise returning -ENOMEM (instead of
FDT_ERR_ errors that would come from fdt_setprop).

No user-visible changes.  This patch fixes error handling in the KHO
(Kexec HandOver) subsystem, which is used to preserve data across kexec
reboots.  The fix only affects a rare failure path during kexec
preparation — specifically when the kernel runs out of space in the
Flattened Device Tree buffer while registering preserved memory regions.

In the unlikely event that this error path was triggered, the old code
would leave a malformed node in the device tree and return an incorrect
error code to the calling subsystem, which could lead to confusing log
messages or incorrect recovery decisions.  With this fix, the incomplete
node is properly cleaned up and the appropriate errno value is propagated,
this error code is not returned to the user.

Link: https://lore.kernel.org/20260410-kho_fix_send-v2-1-1b4debf7ee08@debian.org
Fixes: 3dc92c3114 ("kexec: add Kexec HandOver (KHO) generation helpers")
Signed-off-by: Breno Leitao <leitao@debian.org>
Suggested-by: Pratyush Yadav <pratyush@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Breno Leitao <leitao@debian.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-27 05:54:23 -07:00
Pasha Tatashin
0562b572ce liveupdate: fix return value on session allocation failure
When session allocation fails during deserialization, the global 'err'
variable was not updated before returning.  This caused subsequent calls
to luo_session_deserialize() to incorrectly report success.

Ensure 'err' is set to the error code from PTR_ERR(session).  This ensures
that an error is correctly returned to userspace when it attempts to open
/dev/liveupdate in the new kernel if deserialization failed.

Link: https://lore.kernel.org/20260415193738.515491-1-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Cc: David Matlack <dmatlack@google.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Samiullah Khawaja <skhawaja@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-27 05:54:23 -07:00
Tejun Heo
deb7b2f93d sched_ext: Release cpus_read_lock on scx_link_sched() failure in root enable
scx_root_enable_workfn() takes cpus_read_lock() before
scx_link_sched(sch), but the `if (ret) goto err_disable` on failure
skips the matching cpus_read_unlock() - all other err_disable gotos
along this path drop the lock first.

scx_link_sched() only returns non-zero on the sub-sched path
(parent != NULL), so the leak path is unreachable via the root
caller today. Still, the unwind is out of line with the surrounding
paths.

Drop cpus_read_lock() before goto err_disable.

v2: Correct Fixes: tag (Andrea Righi).

Fixes: 25037af712 ("sched_ext: Add rhashtable lookup for sub-schedulers")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-24 14:31:36 -10:00
Tejun Heo
05b4a9a9bc sched_ext: Reject NULL-sch callers in scx_bpf_task_set_slice/dsq_vtime
scx_prog_sched(aux) returns NULL for TRACING / SYSCALL BPF progs that
have no struct_ops association when the root scheduler has sub_attach
set. scx_bpf_task_set_slice() and scx_bpf_task_set_dsq_vtime() pass
that NULL into scx_task_on_sched(sch, p), which under
CONFIG_EXT_SUB_SCHED is rcu_access_pointer(p->scx.sched) == sch. For
any non-scx task p->scx.sched is NULL, so NULL == NULL returns true
and the authority gate is bypassed - a privileged but
non-struct_ops-associated prog can poke p->scx.slice /
p->scx.dsq_vtime on arbitrary tasks.

Reject !sch up front so the gate only admits callers with a resolved
scheduler.

Fixes: 245d09c594 ("sched_ext: Enforce scheduler ownership when updating slice and dsq_vtime")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24 14:31:36 -10:00
Tejun Heo
ea7c716a24 sched_ext: Refuse cross-task select_cpu_from_kfunc calls
select_cpu_from_kfunc() skipped pi_lock for @p when called from
ops.select_cpu() or another rq-locked SCX op, assuming the held lock
protects @p. scx_bpf_select_cpu_dfl() / __scx_bpf_select_cpu_and() accept an
arbitrary KF_RCU task_struct, so a caller in e.g. ops.select_cpu(p1) or
ops.enqueue(p1) can pass some other p2 - the held pi_lock / rq lock is p1's,
not p2's - and reading p2->cpus_ptr / nr_cpus_allowed races with
set_cpus_allowed_ptr() and migrate_disable_switch() on another CPU.

Abort the scheduler on cross-task calls in both branches: for
ops.select_cpu() use scx_kf_arg_task_ok() to verify @p is the wake-up
task recorded in current->scx.kf_tasks[] by SCX_CALL_OP_TASK_RET();
for other rq-locked SCX ops compare task_rq(p) against scx_locked_rq().

v2: Switch the in_select_cpu cross-task check from direct_dispatch_task
    comparison to scx_kf_arg_task_ok(). The former spuriously rejects when
    ops.select_cpu() calls scx_bpf_dsq_insert() first, then calls
    scx_bpf_select_cpu_*() on the same task. (Andrea Righi)

Fixes: 0022b32850 ("sched_ext: Decouple kfunc unlocked-context check from kf_mask")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrea Righi <arighi@nvidia.com>
2026-04-24 14:31:36 -10:00
Tejun Heo
c0e8ddc76d sched_ext: Align cgroup #ifdef guards with SUB_SCHED vs GROUP_SCHED
Two EXT_GROUP_SCHED/SUB_SCHED guards are misclassified:

- scx_root_enable_workfn()'s cgroup_get(cgrp) and the err_put_cgrp unwind
  in scx_alloc_and_add_sched() are under `#if GROUP || SUB`, but the
  matching cgroup_put() in scx_sched_free_rcu_work() is inside `#ifdef SUB`
  only (via sch->cgrp, stored only under SUB). GROUP-only would leak a
  reference on every root-sched enable.

- sch_cgroup() / set_cgroup_sched() live under `#if GROUP || SUB` but touch
  SUB-only fields (sch->cgrp, cgroup->scx_sched). GROUP-only wouldn't
  compile.

GROUP needs CGROUP_SCHED; SUB needs only CGROUPS. CGROUPS=y/CGROUP_SCHED=n
gives the reachable GROUP=n, SUB=y combination; GROUP=y, SUB=n isn't
reachable today (SUB is def_bool y under CGROUPS). Neither miscategorization
triggers a real bug in any reachable config, but keep the guards honest:

- Narrow cgroup_get and err_put_cgrp to `#ifdef SUB` (matches the free-side
  put).
- Move sch_cgroup() and set_cgroup_sched() to a separate `#ifdef SUB` block
  with no-op stubs for the !SUB case; keep root_cgroup() and scx_cgroup_{
  lock,unlock}() under `#if GROUP || SUB` since those only need cgroup core.

Fixes: ebeca1f930 ("sched_ext: Introduce cgroup sub-sched support")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24 14:31:36 -10:00
Tejun Heo
d292aa00de sched_ext: Make bypass LB cpumasks per-scheduler
scx_bypass_lb_{donee,resched}_cpumask were file-scope statics shared by all
scheduler instances. With CONFIG_EXT_SUB_SCHED, multiple sched instances
each arm their own bypass_lb_timer; concurrent bypass_lb_node() calls RMW
the global cpumasks with no lock, corrupting donee/resched decisions.

Move the cpumasks into struct scx_sched, allocate them alongside the timer
in scx_alloc_and_add_sched(), free them in scx_sched_free_rcu_work().

Fixes: 95d1df610c ("sched_ext: Implement load balancer for bypass mode")
Cc: stable@vger.kernel.org # v6.19+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24 14:31:36 -10:00
Tejun Heo
4155fb489f sched_ext: Pass held rq to SCX_CALL_OP() for core_sched_before
scx_prio_less() runs from core-sched's pick_next_task() path with rq
locked but invokes ops.core_sched_before() with NULL locked_rq, leaving
scx_locked_rq_state NULL. If the BPF callback calls a kfunc that
re-acquires rq based on scx_locked_rq() - e.g. scx_bpf_cpuperf_set(cpu)
- it re-acquires the already-held rq.

Pass task_rq(a).

Fixes: 7b0888b7cc ("sched_ext: Implement core-sched support")
Cc: stable@vger.kernel.org # v6.12+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24 14:31:36 -10:00
Tejun Heo
207d76a372 sched_ext: Pass held rq to SCX_CALL_OP() for dump_cpu/dump_task
scx_dump_state() walks CPUs with rq_lock_irqsave() held and invokes
ops.dump_cpu / ops.dump_task with NULL locked_rq, leaving
scx_locked_rq_state NULL. If the BPF callback calls a kfunc that
re-acquires rq based on scx_locked_rq() - e.g. scx_bpf_cpuperf_set(cpu)
- it re-acquires the already-held rq.

Pass the held rq to SCX_CALL_OP(). Thread it into scx_dump_task() too.
The pre-loop ops.dump call runs before rq_lock_irqsave() so keeps
rq=NULL.

Fixes: 07814a9439 ("sched_ext: Print debug dump after an error exit")
Cc: stable@vger.kernel.org # v6.12+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24 14:31:36 -10:00
Tejun Heo
7fb39e4eb4 sched_ext: Save and restore scx_locked_rq across SCX_CALL_OP
SCX_CALL_OP{,_RET}() unconditionally clears scx_locked_rq_state to NULL on
exit. Correct at the top level, but ops can recurse via
scx_bpf_sub_dispatch(): a parent's ops.dispatch calls the helper, which
invokes the child's ops.dispatch under another SCX_CALL_OP. When the inner
call returns, the NULL clobbers the outer's state. The parent's BPF then
calls kfuncs like scx_bpf_cpuperf_set() which read scx_locked_rq()==NULL and
re-acquire the already-held rq.

Snapshot scx_locked_rq_state on entry and restore on exit. Rename the rq
parameter to locked_rq across all SCX_CALL_OP* macros so the snapshot local
can be typed as 'struct rq *' without colliding with the parameter token in
the expansion. SCX_CALL_OP_TASK{,_RET}() and SCX_CALL_OP_2TASKS_RET() funnel
through the two base macros and inherit the fix.

Fixes: 4f8b122848 ("sched_ext: Add basic building blocks for nested sub-scheduler dispatching")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24 14:31:36 -10:00