linux

mirror of https://github.com/torvalds/linux.git synced 2026-05-12 08:08:03 +02:00

Author	SHA1	Message	Date
Jann Horn	c1fa0bb633	exit: prevent preemption of oopsing TASK_DEAD task When an already-exiting task oopses, make_task_dead() currently calls do_task_dead() with preemption enabled. That is forbidden: do_task_dead() calls __schedule(), which has a comment saying "WARNING: must be called with preemption disabled!". If an oopsing task is preempted in do_task_dead(), between becoming TASK_DEAD and entering the scheduler explicitly, bad things happen: finish_task_switch() assumes that once the scheduler has switched away from a TASK_DEAD task, the task can never run again and its stack is no longer needed; but that assumption apparently doesn't hold if the dead task was preempted (the SM_PREEMPT case). This means that the scheduler ends up repeatedly dropping references on the dead task's stack, which can lead to use-after-free or double-free of the entire task stack; in other words, two tasks can end up running on the same stack, resulting in various kinds of memory corruption. (This does not just affect "recursively oopsing" tasks; it is enough to oops once during task exit, for example in a file_operations::release handler) Fixes: `7f80a2fd7d` ("exit: Stop poorly open coding do_task_dead in make_task_dead") Cc: stable@kernel.org Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-05-11 08:55:11 -07:00
Linus Torvalds	515186b7be	bpf-fixes -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmn/wmgACgkQ6rmadz2v bTosmhAAgYkQLg7zVQdruoSYb7Vzjz1Di4tM2rBXNIX4S7dvfZUGGBNzFV1lWobk /r6269llSnPKXofs+69LDVCpdvUXmGRmS7+bq+bxV7WVmg7JruVOTWg839jValJK cY3IQi0lZ9GVKaePI5C2XxBS3rCrdQmby91fcfp5C6A/gR6m7PzAlnoIuJ2SQx6A 7tsxxJb4wRtFWPBp7ClbBo7MAMIzPse/6CzsA2eP+icyJC+De9WGYs6bTDNi7vpY +eul0HMyHLTszJe/AGrsu5Ky3S6l+CTydi1fAUSOnk1pYHHhRvvD2WV8ix05/0rO 2looZl6ogpcisCm1i8HN8g1ST0tS74x3bL9kjvB/hhKGh6K1QpU6/drEvmJqKMAu fspYHD3qO+OXN7EV7tFZ1ErJvJZ7zT7UP0JxirAK1DFQZWrki/tJKehSD6gbir8R GwwZctXDOPTGADBsdqbxEPEAp1gVTvDXf04k6GOCLkzqqYBMVKdW/8GXN+6Itr+O nxxoC0SOOkW7rRlJaxuJd5+kpaCKOuK9FaXWONOn7HPzBgK0E0CL9g3+cZcS1QvI 2/5utfFj0gMeo40ZDjCyDWXm7w+AnTSKMMapB5pyi0FY3AVtroSV88HNbpm7DJrs xp9jO5ZD6EQ9Wn1cufOYAkrgZYwTZL5Z2EqyKcoJUIk1ZjpQbXg= =x/fg -----END PGP SIGNATURE----- Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Pull bpf fixes from Alexei Starovoitov: - Fix sk_local_storage diag dump via netlink (Amery Hung) - Fix off-by-one in arena direct-value access (Junyoung Jang) - Reject TCP_NODELAY in bpf-tcp congestion control (KaFai Wan) - Fix type confusion in bpf__sock() (Kuniyuki Iwashima) - Reject TX-only AF_XDP sockets (Linpu Yu) - Don't run arg-tracking analysis twice on main subprog (Paul Chaignon) - Fix NULL pointer dereference in bpf_sk_storage_clone and fib lookup (Weiming Shi) tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Fix off-by-one boundary validation in arena direct-value access xskmap: reject TX-only AF_XDP sockets bpf: Don't run arg-tracking analysis twice on main subprog bpf: Free reuseport cBPF prog after RCU grace period. bpf: tcp: Fix type confusion in sol_tcp_sockopt(). bpf: tcp: Fix type confusion in bpf_skc_to_tcp6_sock(). bpf: tcp: Fix type confusion in bpf_skc_to_tcp_sock(). mptcp: bpf: Fix type confusion in bpf_mptcp_sock_from_subflow() selftest: bpf: Add test for bpf_tcp_sock() and RAW socket. bpf: tcp: Fix type confusion in bpf_tcp_sock(). tools/headers: Regenerate stddef.h to fix BPF selftests bpf: Fix sk_local_storage diag dumping uninitialized special fields bpf: Fix NULL pointer dereference in bpf_skb_fib_lookup() sockmap: Fix sk_psock_drop() race vs sock_map_{unhash,close,destroy}(). bpf: Fix NULL pointer dereference in bpf_sk_storage_clone and diag paths selftests/bpf: Verify bpf-tcp-cc rejects TCP_NODELAY selftests/bpf: Test TCP_NODELAY in TCP hdr opt callbacks bpf: Reject TCP_NODELAY in bpf-tcp-cc bpf: Reject TCP_NODELAY in TCP header option callbacks	2026-05-09 18:42:54 -07:00
Junyoung Jang	3ac1a467e3	bpf: Fix off-by-one boundary validation in arena direct-value access BPF_MAP_TYPE_ARENA accepts BPF_PSEUDO_MAP_VALUE offsets at exactly the end of the arena mapping (off == arena_size). The boundary check in arena_map_direct_value_addr() uses `>` instead of `>=`, which incorrectly allows a one-past-end pointer to be accepted. Change the condition to `>=` to correctly reject offsets that fall outside the valid arena user_vm range. Fixes: `317460317a` ("bpf: Introduce bpf_arena.") Signed-off-by: Junyoung Jang <graypanda.inzag@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/r/20260426172505.1947915-1-graypanda.inzag@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-05-09 16:18:39 -07:00
Paul Chaignon	512809bb8a	bpf: Don't run arg-tracking analysis twice on main subprog Because subprog 0, the main subprog, is considered a global function, we end up running the arg-tracking dataflow analysis twice on it. That results in slightly longer verification but mostly in more verbose verifier logs. This patch fixes it by keeping only the iteration over global subprogs. When running over all of Cilium's programs with BPF_LOG_LEVEL2, this reduces verbosity by ~20% on average. Fixes: `bf0c571f7f` ("bpf: introduce forward arg-tracking dataflow analysis") Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/e4d7b53d4963ef520541a782f5fc8108a168877c.1778176504.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-05-09 16:12:40 -07:00
Linus Torvalds	6e1e5a33e8	Fix CPU hotplug activation race in the timer migration code, by Frederic Weisbecker. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmn+lQYRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1iNhhAAmzI4kh9kulol1M73u+LiQTdx0zMcDcSM NiHeZ+QbNNMqkfnj6xxOVGViPLTVSXpa+dU4hdeHM3n7vYnlGDiHBm4EM95z7S28 Jk39z6Wq/0vmR938rhVVNgMbyx3N+rHV15ReyY88vq6OcI4SdJ9sin72ZW8piuBi EeHZLOyPsz+qf4tG6UFHP3uzM7fQZ9DrVxzcDZE4+xbVpk1kXLuAGo89wXF0GbUX NhUYtHakvARehLg3N+0qo/DUGRXYp1GF+qfIXH7V6g91eMhPureUrhzsKVmbu5BL OXMMskWG20LhySrLX+fKDk8DH3n3G+dTFCgM5kc6JaR4vLT0prEJHeVsIX+ydOLF sZqqkaVhKvFNlxjrqbjRNBV09/Cd9XN1wYUqnBeh55qzB+yoOnfS9+z7dsguf38K YKiCmlt2T68NL5/ZUvXF3ghm5ic7dTyZFUyM/abo7rSTeGByh5vruD19r2zp0C1/ hgSPmAm8fuJAzQCRUzXxDsY5hLhKtv+1h9+Y9humakzfxUdneydsHigz5YjZ6tzI E8LyLsBZnSB6IKZfVetwpg0xTxB75GtSwa+p0XUg2+q2ss94mYmWMFM+ZKWxaTYq yI5paLXZT1DO9Fx/4gsCDlUrR1y096GL6O/1t3TOUsoJ/9kEFvOL/ZF4sjEnWrch CYI+DPoUOUA= =UHOI -----END PGP SIGNATURE----- Merge tag 'timers-urgent-2026-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Ingo Molnar: "Fix CPU hotplug activation race in the timer migration code, by Frederic Weisbecker" * tag 'timers-urgent-2026-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timers/migration: Fix another hotplug activation race	2026-05-08 20:03:39 -07:00
Linus Torvalds	7f00232152	Miscellaneous scheduler fixes: - Fix spurious failures in rseq self-tests (Mark Brown) - Fix rseq rseq::cpu_id_start ABI regression due to TCMalloc's creative use of the supposedly read-only field. The fix is to introduce a new ABI variant based on a new (larger) rseq area registration size, to keep the TCMalloc use of rseq backwards compatible on new kernels. (Thomas Gleixner) - Fix wakeup_preempt_fair() for not waking up task (Vincent Guittot) - Fix s64 mult overflow in vruntime_eligible() (Zhan Xusheng) Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmn+lBARHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1iy1hAAunlBoDq8/MXSt4JeMRX/3p+CKihExTnO LO535Rv8DfcepBgZIysTJKMn9bM/l+7OXGdQ+YDjS70GsLM2aOzDYBKCwOHHm0pZ OJ6Y+UUFAacnQS4EuQLqyNBW0Ice4AIYWu0pLLADs2KUgX1DmSo9bhgZbHcbsMnA IjoaFNhebeA1bHSDD11UIHTza23mqEinxM0yOK8pT+M6fOMXWOo/kLLLYjG/yAIB qBFGpwJkdKjBcpCmAYU9jpw26p/17YMzkgmAaUXOKRLZi+h5zQMNVjR+OIjK4qxt z5Tj+h7t3IcFV2d1zUThPpxxHLn3ro30R5mW0OrsPPFI8AkSRC6GsIX/Ft9uFbQQ 1SGknyx5qLrSldmT8KKXPlM/vriyh3iL6/QMgXtTb8FfegRCbjXsZy39s3wOexCD oBnJt0rX3NviPyb/Up9cfdx++kPfJM074NdVHRBW4ucoOpwHosNcuBP0YQsctFw0 QO7lkfjTo19eB9ftSyfadwF9e+2jYF7YoLmTOKqGLbZEQeIT5tJHDgHklLWrqHAh HDmDCHyqtiXBDCeapqnomrhqczKlSUK1Qqk/Hh+Hwiwj0N4vLhHXmRIpTb/R43/Y i6cdwV8Xtl+RltmYaINpMxdnFl/iz/kpapYkqy/ykLNfBj01AWbdaIVPRk47ndU1 E29BSjMZhPU= =2QJp -----END PGP SIGNATURE----- Merge tag 'sched-urgent-2026-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: - Fix spurious failures in rseq self-tests (Mark Brown) - Fix rseq rseq::cpu_id_start ABI regression due to TCMalloc's creative use of the supposedly read-only field The fix is to introduce a new ABI variant based on a new (larger) rseq area registration size, to keep the TCMalloc use of rseq backwards compatible on new kernels (Thomas Gleixner) - Fix wakeup_preempt_fair() for not waking up task (Vincent Guittot) - Fix s64 mult overflow in vruntime_eligible() (Zhan Xusheng) * tag 'sched-urgent-2026-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix wakeup_preempt_fair() for not waking up task sched/fair: Fix overflow in vruntime_eligible() selftests/rseq: Expand for optimized RSEQ ABI v2 rseq: Reenable performance optimizations conditionally rseq: Implement read only ABI enforcement for optimized RSEQ V2 mode selftests/rseq: Validate legacy behavior selftests/rseq: Make registration flexible for legacy and optimized mode selftests/rseq: Skip tests if time slice extensions are not available rseq: Revert to historical performance killing behaviour rseq: Don't advertise time slice extensions if disabled rseq: Protect rseq_reset() against interrupts rseq: Set rseq::cpu_id_start to 0 on unregistration selftests/rseq: Don't run tests with runner scripts outside of the scripts	2026-05-08 19:42:10 -07:00
Linus Torvalds	e5cf0260a7	Miscellaneous perf events fixes: - Fix deadlock in the perf_mmap() failure path (Peter Zijlstra) - Intel ACR (Auto Counter Reload) fixes (Dapeng Mi): - Fix validation and configuration of ACR masks - Fix ACR rescheduling bug causing stale masks - Disable the PMI on ACR-enabled hardware - Enable ACR on Panther Cover uarch too Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmn+kB0RHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1ib2RAAkh19iPRq7tHCXvDYJEHztpAuioyaKznw 57pNlzG+/K/gg/yZ7uoCtXvogEHgDtart9ZtH7CVtZQgky6YfdiBq0g0pNSdOoY5 O4IB0ZIXIu/FOh1+Q1k3Md5MeuxUz1Jp21Wq+JNK6VkWwfq+oCZ3XJK06+2C45wI uUePaEMFZn7VX9WOToZJQZME/+5yQvrgOq+D6gBs+y3UJO5u6kpdoley6fPXRtYV hyfBYiutJlcV1dJC9g7Dc6CHrBkaolFTKsRi2RjD658fHmUUMCubsn6lSG9UJqiZ CWtNMHJ/k1WBLuPLUaZBa3W0s+mUZ+0E6W3nLRHC2ORRQhAnQKeoDb2IWlVhYTdB NmyABPqjwvGfgidMh39aMt8GS4lFZBXGozVNWTZprN56U/jYH/Ol4cJNLJ9Ez5yk fzIkljCc5L/ZmxmqjJNBvmJtTpAt/FhN0qKT/k9jksISFE24bzZ0oRg3t051OyXs Mndldyl/2EFHA2PBIN2phISTVWh5lewYNBaK0SBbx77DX6NzMdevhdGAvw2cRVT/ BJvqj+OeBfiaGBNb/lAIsoZCnuMClQi2t4jlKGkmN3n9hbgPyPAsz/WJRDLr9GZ+ cqQgh7fL80HoqZTfV7tWxKTkDK3AciXXZE+8ntBpGC6CgMmgsJqLmxc60Jzfh2OO qGXcodOISag= =WNs1 -----END PGP SIGNATURE----- Merge tag 'perf-urgent-2026-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf events fixes from Ingo Molnar: - Fix deadlock in the perf_mmap() failure path (Peter Zijlstra) - Intel ACR (Auto Counter Reload) fixes (Dapeng Mi): - Fix validation and configuration of ACR masks - Fix ACR rescheduling bug causing stale masks - Disable the PMI on ACR-enabled hardware - Enable ACR on Panther Cover uarch too * tag 'perf-urgent-2026-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel: Enable auto counter reload for DMR perf/x86/intel: Disable PMI for self-reloaded ACR events perf/x86/intel: Always reprogram ACR events to prevent stale masks perf/x86/intel: Improve validation and configuration of ACR masks perf/core: Fix deadlock in perf_mmap() failure path	2026-05-08 19:39:18 -07:00
Vincent Guittot	9f6d929ee2	sched/fair: Fix wakeup_preempt_fair() for not waking up task Make sure to only call pick_next_entity() on an non-empty cfs_rq. The assumption that p is always enqueued and not delayed, is only true for wakeup. If p was moved while delayed, pick_next_entity() will dequeue it and the cfs might become empty. Test if there are still queued tasks before trying again to determine if p could be the next one to be picked. There are at least 2 cases: When cfs becomes idle, it tries to pull tasks but if those pulled tasks are delayed, they will be dequeued when attached to cfs. attach_tasks() -> attach_task() -> wakeup_preempt(rq, p, 0); A misfit task running on cfs A triggers a load balance to be pulled on a better cpu, the load balance on cfs B starts an active load balance to pulled the running misfit task. If there is a delayed dequeue task on cfs A, it can be pulled instead of the previously running misfit task. attach_one_task() -> attach_task() -> wakeup_preempt(rq, p, 0); Fixes: `ac8e69e693` ("sched/fair: Fix wakeup_preempt_fair() vs delayed dequeue") Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260503104503.1732682-1-vincent.guittot@linaro.org	2026-05-06 17:41:18 +02:00
Zhan Xusheng	b6eee96843	sched/fair: Fix overflow in vruntime_eligible() Zhan Xusheng reported running into sporadic a s64 mult overflow in vruntime_eligible(). When constructing a worst case scenario: If you have cgroups, then you can have an entity of weight 2 (per calc_group_shares()), and its vlag should then be bounded by: (slice+TICK_NSEC) * NICE_0_LOAD, which is around 44 bits as per the comment on entity_key(). The other extreme is 100NICE_0_LOAD, thus you get: {key, weight}[] := { puny: { (slice + TICK_NSEC) NICE_0_LOAD, 2 }, max: { 0, 100NICE_0_LOAD }, } The avg_vruntime() would end up being very close to 0 (which is zero_vruntime), so no real help making that more accurate. vruntime_eligible(puny) ends up with: avg = 2 puny.key (+ 0) load = 2 + 100 * NICE_0_LOAD avg >= puny.key * load And that is: (slice + TICK_NSEC) * NICE_0_LOAD * NICE_0_LOAD * 100, which will overflow s64. Zhan suggested using __builtin_mul_overflow(), however after staring at compiler output for various architectures using godbolt, it seems that using an __int128 multiplication often results in better code. Specifically, a number of architectures already compute the __int128 product to determine the overflow. Eg. arm64 already has the 'smulh' instruction used. By explicitly doing an __int128 multiply, it will emit the 'mul; smulh' pattern, which modern cores can fuse (armv8-a clang-22.1.0). x86_64 has less branches (no OF handling). Since Linux has ARCH_SUPPORTS_INT128 to gate __int128 usage, also provide the __builtin_mul_overflow() variant as a fallback. [peterz: Changelog and __int128 bits] Fixes: `556146ce5e` ("sched/fair: Avoid overflow in enqueue_entity()") Reported-by: Zhan Xusheng <zhanxusheng1024@gmail.com> Closes: https://patch.msgid.link/20260415145742.10359-1-zhanxusheng%40xiaomi.com Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260505103155.GN3102924%40noisy.programming.kicks-ass.net	2026-05-06 17:41:17 +02:00
Thomas Gleixner	99428157dc	rseq: Reenable performance optimizations conditionally Due to the incompatibility with TCMalloc the RSEQ optimizations and extended features (time slice extensions) have been disabled and made run-time conditional. The original RSEQ implementation, which TCMalloc depends on, registers a 32 byte region (ORIG_RSEG_SIZE). This region has a 32 byte alignment requirement. The extension safe newer variant exposes the kernel RSEQ feature size via getauxval(AT_RSEQ_FEATURE_SIZE) and the alignment requirement via getauxval(AT_RSEQ_ALIGN). The alignment requirement is that the registered RSEQ region is aligned to the next power of two of the feature size. The kernel currently has a feature size of 33 bytes, which means the alignment requirement is 64 bytes. The TCMalloc RSEQ region is embedded into a cache line aligned data structure starting at offset 32 bytes so that bytes 28-31 and the cpu_id_start field at bytes 32-35 form a 64-bit little endian pointer with the top-most bit (63 set) to check whether the kernel has overwritten cpu_id_start with an actual CPU id value, which is guaranteed to not have the top most bit set. As this is part of their performance tuned magic, it's a pretty safe assumption, that TCMalloc won't use a larger RSEQ size. This allows the kernel to declare that registrations with a size greater than the original size of 32 bytes, which is the cases since time slice extensions got introduced, as RSEQ ABI v2 with the following differences to the original behaviour: 1) Unconditional updates of the user read only fields (CPU, node, MMCID) are removed. Those fields are only updated on registration, task migration and MMCID changes. 2) Unconditional evaluation of the criticial section pointer is removed. It's only evaluated when user space was interrupted and was scheduled out or before delivering a signal in the interrupted context. 3) The read/only requirement of the ID fields is enforced. When the kernel detects that userspace manipulated the fields, the process is terminated. This ensures that multiple entities (libraries) can utilize RSEQ without interfering. 4) Todays extended RSEQ feature (time slice extensions) and future extensions are only enabled in the v2 enabled mode. Registrations with the original size of 32 bytes operate in backwards compatible legacy mode without performance improvements and extended features. Unfortunately that also affects users of older GLIBC versions which register the original size of 32 bytes and do not evaluate the kernel required size in the auxiliary vector AT_RSEQ_FEATURE_SIZE. That's the result of the lack of enforcement in the original implementation and the unwillingness of a single entity to cooperate with the larger ecosystem for many years. Implement the required registration changes by restructuring the spaghetti code and adding the size/version check. Also add documentation about the differences of legacy and optimized RSEQ V2 mode. Thanks to Mathieu for pointing out the ORIG_RSEQ_SIZE constraints! Fixes: `d6200245c7` ("rseq: Allow registering RSEQ with slice extension") Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Link: https://patch.msgid.link/20260428224427.927160119%40kernel.org Cc: stable@vger.kernel.org	2026-05-06 17:40:27 +02:00
Thomas Gleixner	82f572449c	rseq: Implement read only ABI enforcement for optimized RSEQ V2 mode The optimized RSEQ V2 mode requires that user space adheres to the ABI specification and does not modify the read-only fields cpu_id_start, cpu_id, node_id and mm_cid behind the kernel's back. While the kernel does not rely on these fields, the adherence to this is a fundamental prerequisite to allow multiple entities, e.g. libraries, in an application to utilize the full potential of RSEQ without stepping on each other toes. Validate this adherence on every update of these fields. If the kernel detects that user space modified the fields, the application is force terminated. Fixes: `d6200245c7` ("rseq: Allow registering RSEQ with slice extension") Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Link: https://patch.msgid.link/20260428224427.845230956%40kernel.org Cc: stable@vger.kernel.org	2026-05-06 17:40:15 +02:00
Frederic Weisbecker	bd3c45dd01	timers/migration: Fix another hotplug activation race The hotplug control CPU is assumed to be active in the hierarchy but that doesn't imply that the root is active. If the current CPU is not the one that activated the current hierarchy, and the CPU performing this duty is still halfway through the tree, the root may still be observed inactive. And this can break the activation of a new root as in the following scenario: 1) Initially, the whole system has 64 CPUs and only CPU 63 is awake. [GRP1:0] active / \| \ / \| \ [GRP0:0] [...] [GRP0:7] idle idle active / \| \ \| CPU 0 CPU 1 ... CPU 63 idle idle active 2) CPU 63 goes idle _but_ due to a #VMEXIT it hasn't yet reached the [GRP1:0]->parent dereference (that would be NULL and stop the walk) in __walk_groups_from(). [GRP1:0] idle / \| \ / \| \ [GRP0:0] [...] [GRP0:7] idle idle idle / \| \ \| CPU 0 CPU 1 ... CPU 63 idle idle idle 3) CPU 1 wakes up, activates GRP0:0 but didn't yet manage to propagate up to GRP1:0 due to yet another #VMEXIT. [GRP1:0] idle / \| \ / \| \ [GRP0:0] [...] [GRP0:7] active idle idle / \| \ \| CPU 0 CPU 1 ... CPU 63 idle active idle 3) CPU 0 wakes up and doesn't need to walk above GRP0:0 as it's CPU 1 role. [GRP1:0] idle / \| \ / \| \ [GRP0:0] [...] [GRP0:7] active idle idle / \| \ \| CPU 0 CPU 1 ... CPU 63 active active idle 4) CPU 0 boots CPU 64. It creates a new root for it. [GRP2:0] idle / \ / \ [GRP1:0] [GRP1:1] idle idle / \| \ \ / \| \ \ [GRP0:0] [...] [GRP0:7] [GRP0:8] active idle idle idle / \| \ \| \| CPU 0 CPU 1 ... CPU 63 CPU 64 active active idle offline 5) CPU 0 activates the new root, but note that GRP1:0 is still idle, waiting for CPU 1 to resume from #VMEXIT and activate it. [GRP2:0] active / \ / \ [GRP1:0] [GRP1:1] idle idle / \| \ \ / \| \ \ [GRP0:0] [...] [GRP0:7] [GRP0:8] active idle idle idle / \| \ \| \| CPU 0 CPU 1 ... CPU 63 CPU 64 active active idle offline 6) CPU 63 resumes after #VMEXIT and sees the new GRP1:0 parent. Therefore it propagates the stale inactive state of GRP1:0 up to GRP2:0. [GRP2:0] idle / \ / \ [GRP1:0] [GRP1:1] idle idle / \| \ \ / \| \ \ [GRP0:0] [...] [GRP0:7] [GRP0:8] active idle idle idle / \| \ \| \| CPU 0 CPU 1 ... CPU 63 CPU 64 active active idle offline 7) CPU 1 resumes after #VMEXIT and finally activates GRP1:0. But it doesn't observe its parent link because no ordering enforced that. Therefore GRP2:0 is spuriously left idle. [GRP2:0] idle / \ / \ [GRP1:0] [GRP1:1] active idle / \| \ \ / \| \ \ [GRP0:0] [...] [GRP0:7] [GRP0:8] active idle idle idle / \| \ \| \| CPU 0 CPU 1 ... CPU 63 CPU 64 active active idle offline Such races are highly theoretical and the problem would solve itself once the old root ever becomes idle again. But it still leaves a taste of discomfort. Fix it with enforcing a fully ordered atomic read of the old root state before propagating the activate state up to the new root. It has a two directions ordering effect: * Acquire + release of the latest old root state: If the hotplug control CPU is not the one that woke up the old root, make sure to acquire its active state and propagate it upwards through the ordered chain of activation (the acquire pairs with the cmpxchg() in tmigr_active_up() and subsequent releases will pair with atomic_read_acquire() and smp_mb__after_atomic() in tmigr_inactive_up()). * Release: If the hotplug control CPU is not the one that must wake up the old root, but the CPU covering that is lagging behind its duty, publish the links from the old root to the new parents. This way the lagging CPU will propagate the active state itself. Fixes: `7ee9887703` ("timers: Implement the hierarchical pull model") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260423165354.95152-2-frederic@kernel.org	2026-05-06 08:21:12 +02:00
Linus Torvalds	74fe02ce12	workqueue: Fixes for v7.1-rc2 - Fix devm_alloc_workqueue() passing a va_list as a positional arg to the variadic alloc_workqueue() macro, which garbled wq->name and skipped lockdep init on the devm path. Fold both noprof entry points onto a va_list helper. Also, annotate __printf(1, 0). -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCafpheA4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGeBeAQDKq3XmN45CaOB76k9wSmRYqVgTSoWWTL83O8km nC2UhQEAiJBfqhP586coXyK6saaZX9QwFdPzTcB72LhZSAj13gQ= =Zixt -----END PGP SIGNATURE----- Merge tag 'wq-for-7.1-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fixes from Tejun Heo: - Fix devm_alloc_workqueue() passing a va_list as a positional arg to the variadic alloc_workqueue() macro, which garbled wq->name and skipped lockdep init on the devm path. Fold both noprof entry points onto a va_list helper. Also, annotate it using __printf(1, 0) * tag 'wq-for-7.1-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Annotate alloc_workqueue_va() with __printf(1, 0) workqueue: fix devm_alloc_workqueue() va_list misuse	2026-05-05 16:09:31 -07:00
Linus Torvalds	11f00074f7	cgroup: Fixes for v7.1-rc2 - During v6.19, cgroup task unlink was moved from do_exit() to after the final task switch to satisfy a controller invariant. That left the kernel seeing tasks past exit_signals() longer than userspace expected, and several v7.0 follow-ups tried to bridge the gap by making rmdir wait for the kernel side. None held up. The latest is an A-A deadlock when rmdir is invoked by the reaper of zombies whose pidns teardown the rmdir itself is waiting on, which points at the synchronizing approach being fundamentally wrong: - Take a different approach: drop the wait, leave rmdir's user-visible side returning as soon as cgroup.procs is empty, and defer the css percpu_ref kill that drives ->css_offline() until the cgroup is fully depopulated. - Tagged for stable. Somewhat invasive but contained. The hope is that fixing forward sticks. If not, the fallback is to revert the entire chain and rework on the development branch. - Doesn't plug a pre-existing analogous race in cgroup_apply_control_disable() (controller disable via subtree_control). Not a regression. The development branch will do the more invasive restructuring needed for that. - Documentation update for cgroup-v1 charge-commit section that still referenced functions removed when the memcg hugetlb try-commit-cancel protocol was retired. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCafphbw4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGbydAQDxlEIeJPdJlwbU6X4PBW/7DYeDHABG7OdrFf5K Fq4ECAD/ZHsFyCNEOcZym6t2/FCZR0xbaPGQibLt3er6AkLRFwM= =3Jra -----END PGP SIGNATURE----- Merge tag 'cgroup-for-7.1-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - During v6.19, cgroup task unlink was moved from do_exit() to after the final task switch to satisfy a controller invariant. That left the kernel seeing tasks past exit_signals() longer than userspace expected, and several v7.0 follow-ups tried to bridge the gap by making rmdir wait for the kernel side. None held up. The latest is an A-A deadlock when rmdir is invoked by the reaper of zombies whose pidns teardown the rmdir itself is waiting on, which points at the synchronizing approach being fundamentally wrong. Take a different approach: drop the wait, leave rmdir's user-visible side returning as soon as cgroup.procs is empty, and defer the css percpu_ref kill that drives ->css_offline() until the cgroup is fully depopulated. Tagged for stable. Somewhat invasive but contained. The hope is that fixing forward sticks. If not, the fallback is to revert the entire chain and rework on the development branch. Note that this doesn't plug a pre-existing analogous race in cgroup_apply_control_disable() (controller disable via subtree_control). Not a regression. The development branch will do the more invasive restructuring needed for that. - Documentation update for cgroup-v1 charge-commit section that still referenced functions removed when the memcg hugetlb try-commit-cancel protocol was retired. * tag 'cgroup-for-7.1-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: docs: cgroup-v1: Update charge-commit section cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated	2026-05-05 15:43:32 -07:00
Linus Torvalds	de95ad90fb	sched_ext: Fixes for v7.1-rc2 - Fix idle CPU selection returning prev_cpu outside the task's cpus_ptr when the BPF caller's allowed mask was wider. Stable backport. - Two opposite-direction gaps in scx_task_iter's cgroup-scoped mode versus the global mode: - Tasks past exit_signals() are filtered by the cgroup walk but kept by global. Sub-scheduler enable abort leaked __scx_init_task() state. Add a CSS_TASK_ITER_WITH_DEAD flag to cgroup's task iterator (scx_task_iter is its only user) and use it. - Tasks past sched_ext_dead() are still returned, tripping WARN_ON_ONCE() in callers or making them touch torn-down state. Mark and skip under the per-task rq lock. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCafphXA4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGbI/AP4nRHDusUuYDSFBLyHODvLZXfMM3Nb0yzS7euQJ qvx6OQEA1p5AyRWA2apFvKjjQrl1dOb5vUlro1Fj8VF51X7Spwc= =olGB -----END PGP SIGNATURE----- Merge tag 'sched_ext-for-7.1-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix idle CPU selection returning prev_cpu outside the task's cpus_ptr when the BPF caller's allowed mask was wider. Stable backport. - Two opposite-direction gaps in scx_task_iter's cgroup-scoped mode versus the global mode: - Tasks past exit_signals() are filtered by the cgroup walk but kept by global. Sub-scheduler enable abort leaked __scx_init_task() state. Add a CSS_TASK_ITER_WITH_DEAD flag to cgroup's task iterator (scx_task_iter is its only user) and use it. - Tasks past sched_ext_dead() are still returned, tripping WARN_ON_ONCE() in callers or making them touch torn-down state. Mark and skip under the per-task rq lock. * tag 'sched_ext-for-7.1-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: idle: Recheck prev_cpu after narrowing allowed mask sched_ext: Skip past-sched_ext_dead() tasks in scx_task_iter_next_locked() cgroup, sched_ext: Include exiting tasks in cgroup iter	2026-05-05 15:22:04 -07:00
Thomas Gleixner	b9eac6a9d9	rseq: Revert to historical performance killing behaviour The recent RSEQ optimization work broke the TCMalloc abuse of the RSEQ ABI as it not longer unconditionally updates the CPU, node, mm_cid fields, which are documented as read only for user space. Due to the observed behavior of the kernel it was possible for TCMalloc to overwrite the cpu_id_start field for their own purposes and rely on the kernel to update it unconditionally after each context switch and before signal delivery. The RSEQ ABI only guarantees that these fields are updated when the data changes, i.e. the task is migrated or the MMCID of the task changes due to switching from or to per CPU ownership mode. The optimization work eliminated the unconditional updates and reduced them to the documented ABI guarantees, which results in a massive performance win for syscall, scheduling heavy work loads, which in turn breaks the TCMalloc expectations. There have been several options discussed to restore the TCMalloc functionality while preserving the optimization benefits. They all end up in a series of hard to maintain workarounds, which in the worst case introduce overhead for everyone, e.g. in the scheduler. The requirements of TCMalloc and the optimization work are diametral and the required work arounds are a maintainence burden. They end up as fragile constructs, which are blocking further optimization work and are pretty much guaranteed to cause more subtle issues down the road. The optimization work heavily depends on the generic entry code, which is not used by all architectures yet. So the rework preserved the original mechanism moslty unmodified to keep the support for architectures, which handle rseq in their own exit to user space loop. That code is currently optimized out by the compiler on architectures which use the generic entry code. This allows to revert back to the original behaviour by replacing the compile time constant conditions with a runtime condition where required, which disables the optimization and the dependend time slice extension feature until the run-time condition can be enabled in the RSEQ registration code on a per task basis again. The following changes are required to restore the original behavior, which makes TCMalloc work again: 1) Replace the compile time constant conditionals with runtime conditionals where appropriate to prevent the compiler from optimizing the legacy mode out 2) Enforce unconditional update of IDs on context switch for the non-optimized v1 mode 3) Enforce update of IDs in the pre signal delivery path for the non-optimized v1 mode 4) Enforce update of IDs in the membarrier(RSEQ) IPI for the non-optimized v1 mode 5) Make time slice and future extensions depend on optimized v2 mode This brings back the full performance problems, but preserves the v2 optimization code and for generic entry code using architectures also the TIF_RSEQ optimization which avoids a full evaluation of the exit to user mode loop in many cases. Fixes: `566d8015f7` ("rseq: Avoid CPU/MM CID updates when no event pending") Reported-by: Mathias Stearn <mathias@mongodb.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Closes: https://lore.kernel.org/CAHnCjA25b+nO2n5CeifknSKHssJpPrjnf+dtr7UgzRw4Zgu=oA@mail.gmail.com Link: https://patch.msgid.link/20260428224427.517051752%40kernel.org Cc: stable@vger.kernel.org	2026-05-05 16:02:57 +02:00
Peter Zijlstra	c69df06e4e	perf/core: Fix deadlock in perf_mmap() failure path Ian noted that commit `77de62ad3d` ("perf/core: Fix refcount bug and potential UAF in perf_mmap") would cause a deadlock due to event->mmap_mutex recursion. This happens because we're now calling perf_mmap_close() under mmap_mutex, while that function itself can also take mmap_mutex. Solve this by noting that perf_mmap_close() is far more complicated than we need at this particular point, since it deals with scenarios that cannot happen in this particular case. Replace the call to perf_mmap_close() with a very narrow undo for the case of first-exposure. If this is not the first mmap(), there is no race and it is fine to drop the lock and call perf_mmap_close() to handle to more complicated scenarios. Note: move the rb->mmap_user (namespace) handling into the rb init/free code such that it does not complicate the mmap handling. Fixes: `77de62ad3d` ("perf/core: Fix refcount bug and potential UAF in perf_mmap") Reported-by: Ian Rogers <irogers@google.com> Closes: https://patch.msgid.link/CAP-5%3DfVJyVMZw%3DDqP53Kxg58nUmJ_0bxoaeOKAbC03BVc11HaA%40mail.gmail.com Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260326112821.GK3738786@noisy.programming.kicks-ass.net	2026-05-05 12:47:20 +02:00
David Carlier	b34c82777a	sched_ext: idle: Recheck prev_cpu after narrowing allowed mask scx_select_cpu_dfl() narrows @allowed to @cpus_allowed & @p->cpus_ptr when the BPF caller supplies a @cpus_allowed that differs from @p->cpus_ptr and @p doesn't have full affinity. However, @is_prev_allowed was computed against the original (wider) @cpus_allowed, so the prev_cpu fast paths could pick a @prev_cpu that is in @cpus_allowed but not in @p->cpus_ptr, violating the intended invariant that the returned CPU is always usable by @p. The kernel masks this via the SCX_EV_SELECT_CPU_FALLBACK fallback, but the behavior contradicts the documented contract. Move the @is_prev_allowed evaluation past the narrowing block so it tests against the final @allowed mask. Fixes: `ee9a4e9279` ("sched_ext: idle: Properly handle invalid prev_cpu during idle selection") Cc: stable@vger.kernel.org # v6.16+ Assisted-by: Claude <noreply@anthropic.com> Signed-off-by: David Carlier <devnexen@gmail.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-05-04 11:01:04 -10:00
Tejun Heo	ff9eda4ea9	sched_ext: Skip past-sched_ext_dead() tasks in scx_task_iter_next_locked() scx_task_iter's cgroup-scoped mode can return tasks whose sched_ext_dead() has already completed: cgroup_task_dead() removes from cset->tasks after sched_ext_dead() in finish_task_switch() and is irq-work deferred on PREEMPT_RT. The global mode is fine - sched_ext_dead() removes from scx_tasks via list_del_init() first. Callers (sub-sched enable prep/abort/apply, scx_sub_disable(), scx_fail_parent()) assume returned tasks are still on @sch and trip WARN_ON_ONCE() or operate on torn-down state otherwise. Set %SCX_TASK_OFF_TASKS in sched_ext_dead() under @p's rq lock and have scx_task_iter_next_locked() skip flagged tasks under the same lock. Setter and reader serialize on the per-task rq lock - no race. Signed-off-by: Tejun Heo <tj@kernel.org>	2026-05-04 09:06:03 -10:00
Tejun Heo	60f21a2649	cgroup, sched_ext: Include exiting tasks in cgroup iter `a72f73c4dd` ("cgroup: Don't expose dead tasks in cgroup") made css_task_iter_advance() skip exiting tasks so cgroup.procs stays consistent with waitpid() visibility. Unfortunately, this broke scx_task_iter. scx_task_iter walks either scx_tasks (global) or a cgroup subtree via css_task_iter() and the two modes are expected to cover the same set of tasks. After the above change the cgroup-scoped mode silently skips tasks past exit_signals() that are still on scx_tasks. scx_sub_enable_workfn()'s abort path is one of the symptoms: an exiting SCX_TASK_SUB_INIT task can race past the cgroup iter leaking __scx_init_task() state. Other iterations share the same gap. Add CSS_TASK_ITER_WITH_DEAD to opt out of the skip and use it from scx_task_iter(). Fixes: `b0e4c2f8a0` ("sched_ext: Implement cgroup subtree iteration for scx_task_iter") Reported-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-05-04 09:06:03 -10:00
Tejun Heo	93618edf75	cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated A chain of commits going back to v7.0 reworked rmdir to satisfy the controller invariant that a subsystem's ->css_offline() must not run while tasks are still doing kernel-side work in the cgroup. [1] `d245698d72` ("cgroup: Defer task cgroup unlink until after the task is done switching out") [2] `a72f73c4dd` ("cgroup: Don't expose dead tasks in cgroup") [3] `1b164b876c` ("cgroup: Wait for dying tasks to leave on rmdir") [4] `4c56a8ac68` ("cgroup: Fix cgroup_drain_dying() testing the wrong condition") [5] `13e786b64b` ("cgroup: Increment nr_dying_subsys_* from rmdir context") [1] moved task cset unlink from do_exit() to finish_task_switch() so a task's cset link drops only after the task has fully stopped scheduling. That made tasks past exit_signals() linger on cset->tasks until their final context switch, which led to a series of problems as what userspace expected to see after rmdir diverged from what the kernel needs to wait for. [2]-[5] tried to bridge that divergence: [2] filtered the exiting tasks from cgroup.procs; [3] had rmdir(2) sleep in TASK_UNINTERRUPTIBLE for them; [4] fixed the wait's condition; [5] made nr_dying_subsys_* visible synchronously. The cgroup_drain_dying() wait in [3] turned out to be a dead end. When the rmdir caller is also the reaper of a zombie that pins a pidns teardown (e.g. host PID 1 systemd reaping orphan pids that were re-parented to it during the same teardown), rmdir blocks in TASK_UNINTERRUPTIBLE waiting for those pids to free, the pids can't free because PID 1 is the reaper and it's stuck in rmdir, and the system A-A deadlocks. No internal lock ordering breaks this; the wait itself is the bug. The css killing side that drove the original reorder, however, can be made cleanly asynchronous: ->css_offline() is already async, run from css_killed_work_fn() driven by percpu_ref_kill_and_confirm(). The fix is to make that chain start only after all tasks have left the cgroup. rmdir's user-visible side then returns as soon as cgroup.procs and friends are empty, while ->css_offline() still runs only after the cgroup is fully drained. Verified by the original reproducer (pidns teardown + zombie reaper, runs under vng) which hangs vanilla and succeeds here, and by per-commit deterministic repros for [2], [3], [4], [5] with a boot parameter that widens the post-exit_signals() window so each state is reliably reachable. Some stress tests on top of that. cgroup_apply_control_disable() has the same shape of pre-existing race: when a controller is disabled via subtree_control, kill_css() ran synchronously while tasks past exit_signals() could still be linked to the cgroup's csets, and ->css_offline() could fire before they drained. This patch preserves the existing synchronous behavior at that call site (kill_css_sync() + kill_css_finish() back-to-back) and a follow-up patch will defer kill_css_finish() there using a per-css trigger. This seems like the right approach and I don't see problems with it. The changes are somewhat invasive but not excessively so, so backporting to -stable should be okay. If something does turn out to be wrong, the fallback is to revert the entire chain ([1]-[5]) and rework in the development branch instead. v2: Pin cgrp across the deferred destroy work with explicit cgroup_get()/cgroup_put() around queue_work() and the work_fn. v1 wasn't actually broken (ordered cgroup_offline_wq + queue_work order in cgroup_task_dead() saved it) but the explicit ref removes the dependency on those non-obvious invariants. Also note the pre-existing cgroup_apply_control_disable() race in the description; a follow-up will defer kill_css_finish() there. Fixes: `1b164b876c` ("cgroup: Wait for dying tasks to leave on rmdir") Cc: stable@vger.kernel.org # v7.0+ Reported-and-tested-by: Martin Pitt <martin@piware.de> Link: https://lore.kernel.org/all/afHNg2VX2jy9bW7y@piware.de/ Link: https://lore.kernel.org/all/35e0670adb4abeab13da2c321582af9f@kernel.org/ Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>	2026-05-04 08:52:26 -10:00
Linus Torvalds	cffcf520fd	Fix lockup in requeue-PI during signal/timeout wakeups, by Sebastian Andrzej Siewior. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmn2/AYRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1iQCA//cAsib3cTZr9GAvjX5+Fjf3Dsl4HdO7qr XzOeMNtvnz0wcWgNCq02vwutbQwzRRr71DqDzhYV7YGxwyrqw+fE0RMvQDEML3I6 SI1I8aj1Eo+WNHcy7HajYd0WBiOuMAcSZsa+3kYWKnDZ0GJadbQHTrQo5nT8VgFf o+aGU4kivGsKlz+UrcTxViNovenQ3mysuY8Pn3xKFlwn+vnJIwT2WUuQ1U8wK2OY edH9O4UEJkPFIOxqhL5+s4J/utsFasEFSLMpx9NpYzOGH85PTxIJg6O+n9a5NTAa 40tsXlWkDsi/AfiNbWBYOpw8gS2yHyrLuY9CLBuxiRojfS6TePzfJyCPFvLLhBau 90y02GskoDe4DFox9f+33BykR48yWxGOtxXiFJ1caXH4HsZi5z6Wd3vFCQp61zwm RPGKA5k8a9+hlToOpaAwHqA8ODbnEyRKwhG/OdnHo7cKPAWH+2awSSyW30DQoo+o mBYcMNbNeZObzQ/DErZvErDpq0hePATn/zfMFNEtXh+0Y1WZWix0NT5atHbtid+w 0tRaazUpNCpGNp+7xxFGmaHxN/bPCmjZXpTIWIhc6vn8DNx/Y059g3ItYyeiRVkZ WD0vWdBgYerK0CfYsllh5d4fiTSLKoILq/f5Zc5Pq/GnVAZe/JQy7v6Duj+HmJ9O g9es4fSjzBs= =HUBJ -----END PGP SIGNATURE----- Merge tag 'locking-urgent-2026-05-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fix from Ingo Molnar: "Fix lockup in requeue-PI during signal/timeout wakeups, by Sebastian Andrzej Siewior" * tag 'locking-urgent-2026-05-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex: Prevent lockup in requeue-PI during signal/ timeout wakeup	2026-05-03 08:17:09 -07:00
Linus Torvalds	c3cba36b39	Miscellaneous scheduler fixes: - Fix the delayed dequeue negative lag increase fix in the fair scheduler (Peter Zijlstra) - Fix wakeup_preempt_fair() to do proper delayed dequeue (Vincent Guittot) - Clear sched_entity::rel_deadline when initializing forked entities, which bug can cause all tasks to be EEVDF-ineligible, causing a NULL pointer dereference crash in pick_next_entity() (Zicheng Qu) Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmn2+tsRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1gDdg//WX8E7dyiwRaFNsPMgEvi4Q7pBoXAb5Zi /iYNlpw1/QRG9KF59175CC3cLOVLJ3dA+79EZPS/mQSuukdTxJg6sbPTkneULV3D l8DkjH0uDS7mZBqlLDC+Xjqv1y7Y01V6qN9Si8hpR5rP3D0heWdspAGS5aSZ/8Dp h/VSYl2R615Z32NaS0Xys2hy4M0/I30Uuw4fJScvIYeAMb4s5+6RHEQmuuo25A3O HQg9Ljqi5NQaHwzvHTYjCenh/NENOd/tu/kZzFgW57HJqSXGM7KBqcjaK68q4vCl LgBlsfux7RTbnrEAIhGYBSoDss+tBbMKm5qaYZNENJpLS8ptm4J3iKgUJ0W2e3dW Zp6IjVkj0E+qC65WnFENXsiDr+/fZ9d71/xq2L4z6SxQNv1mtX2+f/HUYyKU5jCc I4NDEBLGbdRVNuPW7esECVIEVRYFR1cPdZigLW8M7JEnr0p0skF1zgnnMtVuK6Ep qONYldUIHWdsx67yOSdhykSyq6Qfaew/UKuG1ivlN0BDL4I/AWf+4BVMHJrigeok xKD8DRWH6s7fSicfM2aJMmj6nRSR2Zz5a9T3lE4LxSvDm41JEnatqpb2Xhjri+I0 21slsm4AZmh1xR1kj7sTKxdHJn0E+lNN/XSZP6OcCoNqlr2XEGMRxWpVo/WqJnO6 HVG9/VoSP1w= =z+8E -----END PGP SIGNATURE----- Merge tag 'sched-urgent-2026-05-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: - Fix the delayed dequeue negative lag increase fix in the fair scheduler (Peter Zijlstra) - Fix wakeup_preempt_fair() to do proper delayed dequeue (Vincent Guittot) - Clear sched_entity::rel_deadline when initializing forked entities, which bug can cause all tasks to be EEVDF-ineligible, causing a NULL pointer dereference crash in pick_next_entity() (Zicheng Qu) * tag 'sched-urgent-2026-05-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Clear rel_deadline when initializing forked entities sched/fair: Fix wakeup_preempt_fair() vs delayed dequeue sched/fair: Fix the negative lag increase fix	2026-05-03 08:05:23 -07:00
Davidlohr Bueso	ee9dce4436	futex: Drop CLONE_THREAD requirement for private default hash alloc Currently need_futex_hash_allocate_default() depends on strict pthread semantics, abusing CLONE_THREAD. This breaks the non-concurrency assumptions when doing the mm->futex_ref pcpu allocations, leading to bugs[0] when sharing the mm in other ways; ie: BUG: KASAN: slab-use-after-free in futex_hash_put ... where the +1 bias can end up on a percpu counter that mm->futex_ref no longer points at. Loosen the check to cover any CLONE_VM clone, except vfork(). Excluding vfork keeps the existing paths untouched (no overhead), and we can't race in the first place: either the parent is suspended and the child runs alone, or mm->futex_ref is already allocated from an earlier CLONE_VM. Link: https://lore.kernel.org/all/CAL_bE8LsmCQ-FAtYDuwbJhOkt9p2wwYQwAbMh=PifC=VsiBM6A@mail.gmail.com/ [0] Fixes: `d9b05321e2` ("futex: Move futex_hash_free() back to __mmput()") Reported-by: Yiming Qian <yimingqian591@gmail.com> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-05-01 13:12:34 -07:00
Thomas Gleixner	010b7723c0	rseq: Don't advertise time slice extensions if disabled If time slice extensions have been disabled on the kernel command line, then advertising them in RSEQ flags is wrong. Adjust the conditionals to reflect reality, fixup the misleading comments about the gap of these flags and the rseq::flags field. Fixes: `d6200245c7` ("rseq: Allow registering RSEQ with slice extension") Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Link: https://patch.msgid.link/20260428224427.437059375%40kernel.org Cc: stable@vger.kernel.org	2026-05-01 21:32:20 +02:00
Thomas Gleixner	2cb68e4512	rseq: Set rseq::cpu_id_start to 0 on unregistration The RSEQ rework changed that to RSEQ_CPU_UNINITILIZED, which is obviously incompatible. Revert back to the original behavior. Fixes: `0f085b4188` ("rseq: Provide and use rseq_set_ids()") Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Link: https://patch.msgid.link/20260428224427.271566313%40kernel.org Cc: stable@vger.kernel.org	2026-05-01 21:32:20 +02:00
Linus Torvalds	2b4d0215be	20 hotfixes. All are for MM (and for MMish maintainers). 9 are cc:stable and the remainder are for post-7.0 issues or aren't deemed suitable for backporting. There's a 2 patch DAMON series from SeongJae Park which address races which could lead to use-after-free errors. And a 3 patch DAMON series which avoids the possibility of presenting stale parameter values to users. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCafPaVQAKCRDdBJ7gKXxA jiUxAQCceUQi6IqADUMhYAsbGcs1LoeMWfiMfbCz2NCoiTXAEwD/S2uqSELRPQQc 7iW6D7U6dTa3d2kkbnxC02ocekaxiQ4= =M/FI -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2026-04-30-15-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM fixes from Andrew Morton: "20 hotfixes. All are for MM (and for MMish maintainers). 9 are cc:stable and the remainder are for post-7.0 issues or aren't deemed suitable for backporting. There are two DAMON series from SeongJae Park which address races which could lead to use-after-free errors, and avoid the possibility of presenting stale parameter values to users" * tag 'mm-hotfixes-stable-2026-04-30-15-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: memcontrol: fix rcu unbalance in get_non_dying_memcg_end() mm/userfaultfd: detect VMA type change after copy retry in mfill_copy_folio_retry() MAINTAINERS: remove stale kdump project URL mm/damon/stat: detect and use fresh enabled value mm/damon/lru_sort: detect and use fresh enabled and kdamond_pid values mm/damon/reclaim: detect and use fresh enabled and kdamond_pid values selftests/mm: specify requirement for PROC_MEM_ALWAYS_FORCE=y mm/damon/sysfs-schemes: protect path kfree() with damon_sysfs_lock mm/damon/sysfs-schemes: protect memcg_path kfree() with damon_sysfs_lock MAINTAINERS: update Li Wang's email address MAINTAINERS, mailmap: update email address for Qi Zheng MAINTAINERS: update Liam's email address mm/hugetlb_cma: round up per_node before logging it MAINTAINERS: fix regex pattern in CORE MM category mm/vma: do not try to unmap a VMA if mmap_prepare() invoked from mmap() mm: start background writeback based on per-wb threshold for strictlimit BDIs kho: fix error handling in kho_add_subtree() liveupdate: fix return value on session allocation failure mailmap: update entry for Dan Carpenter vmalloc: fix buffer overflow in vrealloc_node_align()	2026-05-01 08:45:23 -07:00
Linus Torvalds	e75a43c7ce	tracing fixes for v7.1: - Fix inverted check of registering the stats for branch tracing When calling register_stat_tracer() which returns zero on success and negative on error, the callers were checking the return of zero as an error and printing a warning message. Because this was just a normal printk() message and not a WARN(), it wasn't caught in any testing. Fix the check to print the warning message when an error actually happens. - Fix a typo in a comment in tracepoint.h - Limit the size of event probes to 3K in size It is possible to create a dynamic event probe via the tracefs system that is greater than the max size of an event that the ring buffer can hold. This basically causes the event to become useless. Limit the size of an event probe to be 3K as that should be large enough to handle any dynamic events being created, and fits within the PAGE_SIZE sub-buffers of the ring buffer. -----BEGIN PGP SIGNATURE----- iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCafJn9hQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6quUaAP9ezC2VXWugRIhOEmK0pIpY90zddsyv xYTa3XuPP2r5JQEA8DcmH6fTo7JJxzbbyWPot/EgaqMt5XguqKC7txhjfww= =+e7q -----END PGP SIGNATURE----- Merge tag 'trace-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix inverted check of registering the stats for branch tracing When calling register_stat_tracer() which returns zero on success and negative on error, the callers were checking the return of zero as an error and printing a warning message. Because this was just a normal printk() message and not a WARN(), it wasn't caught in any testing. Fix the check to print the warning message when an error actually happens. - Fix a typo in a comment in tracepoint.h - Limit the size of event probes to 3K in size It is possible to create a dynamic event probe via the tracefs system that is greater than the max size of an event that the ring buffer can hold. This basically causes the event to become useless. Limit the size of an event probe to be 3K as that should be large enough to handle any dynamic events being created, and fits within the PAGE_SIZE sub-buffers of the ring buffer. * tag 'trace-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/probes: Limit size of event probe to 3K tracepoint: Fix typo in tracepoint.h comment tracing: branch: Fix inverted check on stat tracer registration	2026-04-29 22:21:44 -07:00
Steven Rostedt	b2aa3b4d64	tracing/probes: Limit size of event probe to 3K There currently isn't a max limit an event probe can be. One could make an event greater than PAGE_SIZE, which makes the event useless because if it's bigger than the max event that can be recorded into the ring buffer, then it will never be recorded. A event probe should never need to be greater than 3K, so make that the max size. As long as the max is less than the max that can be recorded onto the ring buffer, it should be fine. Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Fixes: `93ccae7a22` ("tracing/kprobes: Support basic types on dynamic events") Link: https://patch.msgid.link/20260428122302.706610ba@gandalf.local.home Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2026-04-29 16:07:38 -04:00
Tejun Heo	20e81c64c9	workqueue: Annotate alloc_workqueue_va() with __printf(1, 0) alloc_workqueue_va() forwards its va_list to __alloc_workqueue() which ultimately feeds vsnprintf(). __alloc_workqueue() already carries __printf(1, 0); the new wrapper needs the same annotation so format string checking propagates through the forwarding. Fixes: `0de4cb473a` ("workqueue: fix devm_alloc_workqueue() va_list misuse") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202604300347.2LgXyteh-lkp@intel.com/ Signed-off-by: Tejun Heo <tj@kernel.org>	2026-04-29 09:44:16 -10:00
Sebastian Andrzej Siewior	bc7304f3ae	futex: Prevent lockup in requeue-PI during signal/ timeout wakeup During wait-requeue-pi (task A) and requeue-PI (task B) the following race can happen: Task A Task B futex_wait_requeue_pi() futex_setup_timer() futex_do_wait() futex_requeue() CLASS(hb, hb1)(&key1); CLASS(hb, hb2)(&key2); timeout futex_requeue_pi_wakeup_sync() requeue_state = Q_REQUEUE_PI_IGNORE blocks on hb->lock futex_proxy_trylock_atomic() futex_requeue_pi_prepare() Q_REQUEUE_PI_IGNORE => -EAGAIN double_unlock_hb(hb1, hb2) retry Task B acquires both hb locks and attempts to acquire the PI-lock of the top most waiter (task B). Task A is leaving early due to a signal/ timeout and started removing itself from the queue. It updates its requeue_state but can not remove it from the list because this requires the hb lock which is owned by task B. Usually task A is able to swoop the lock after task B unlocked it. However if task B is of higher priority then task A may not be able to wake up in time and acquire the lock before task B gets it again. Especially on a UP system where A is never scheduled. As a result task A blocks on the lock and task B busy loops, trying to make progress but live locks the system instead. Tragic. This can be fixed by removing the top most waiter from the list in this case. This allows task B to grab the next top waiter (if any) in the next iteration and make progress. Remove the top most waiter if futex_requeue_pi_prepare() fails. Let the waiter conditionally remove itself from the list in handle_early_requeue_pi_wakeup(). Fixes: `07d91ef510` ("futex: Prevent requeue_pi() lock nesting issue on RT") Reported-by: Moritz Klammler <Moritz.Klammler@ferchau.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260428103425.dywXyPd3@linutronix.de Closes: https://lore.kernel.org/all/VE1PR06MB6894BE61C173D802365BE19DFF4CA@VE1PR06MB6894.eurprd06.prod.outlook.com	2026-04-29 08:56:40 +02:00
Linus Torvalds	664f0f6be3	sched_ext: Fixes for v7.1-rc1 The merge window pulled in the cgroup sub-scheduler infrastructure, and new AI reviews are accelerating bug reporting and fixing - hence the larger than usual fixes batch. - Use-after-frees during scheduler load/unload. The disable path could free the BPF scheduler while deferred irq_work / kthread work was still in flight; cgroup setter callbacks read the active scheduler outside the rwsem that synchronizes against teardown. Fixed both, and reused the disable drain in the enable error paths so the BPF JIT page can't be freed under live callbacks. - Several BPF op invocations didn't tell the framework which runqueue was already locked, so helper kfuncs that re-acquire the runqueue by CPU could deadlock on the held lock. Fixed at the affected callsites, including recursive parent-into-child dispatch. - The hardlockup notifier ran from NMI but eventually took a non-NMI-safe lock. Bounced through irq_work. - A handful of bugs in the new sub-scheduler hierarchy: helper kfuncs hard-coded the root instead of resolving the caller's scheduler; the enable error path tried to disable per-task state that had never been initialized, and leaked cpus_read_lock on the way out; a sysfs object was leaked on every load/unload; the dispatch fast-path used the root scheduler instead of the task's; and a couple of CONFIG #ifdef guards were misclassified. - Verifier-time hardening: BPF programs of unrelated struct_ops types (e.g. tcp_congestion_ops) could call sched_ext kfuncs - a semantic bug and, once sub-sched was enabled, a KASAN out-of-bounds read. Now rejected at load. Plus a few NULL and cross-task argument checks on sched_ext kfuncs, and a selftest covering the new deny. - rhashtable (Herbert): restored the insecure_elasticity toggle and bounced the deferred-resize kick through irq_work to break a lock-order cycle observable from raw-spinlock callers. sched_ext's scheduler-instance hash is the first user of both. - The bypass-mode load balancer used file-scope cpumasks; with multiple scheduler instances now possible, those raced. Moved per-instance, plus a follow-up to skip tasks whose recorded CPU is stale relative to the new owning runqueue. - Smaller fixes: a dispatch queue's first-task tracking misbehaved when a parked iterator cursor sat in the list; the runqueue's next-class wasn't promoted on local-queue enqueue, leaving an SCX task behind RT in edge cases; the reference qmap scheduler stopped erroring on legitimate cross-scheduler task-storage misses. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCafEN/A4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGaydAQCxWrUqnZXhxF4LnztjxTF2tgv7p8P7TbpS6aU6 etqRpAEA9RFmIXs7XrhwCm0n2BwSgjvrNxnWfPhWvuH0uN0GTAA= =wLna -----END PGP SIGNATURE----- Merge tag 'sched_ext-for-7.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: "The merge window pulled in the cgroup sub-scheduler infrastructure, and new AI reviews are accelerating bug reporting and fixing - hence the larger than usual fixes batch: - Use-after-frees during scheduler load/unload: - The disable path could free the BPF scheduler while deferred irq_work / kthread work was still in flight - cgroup setter callbacks read the active scheduler outside the rwsem that synchronizes against teardown Fix both, and reuse the disable drain in the enable error paths so the BPF JIT page can't be freed under live callbacks. - Several BPF op invocations didn't tell the framework which runqueue was already locked, so helper kfuncs that re-acquire the runqueue by CPU could deadlock on the held lock Fix the affected callsites, including recursive parent-into-child dispatch. - The hardlockup notifier ran from NMI but eventually took a non-NMI-safe lock. Bounce it through irq_work. - A handful of bugs in the new sub-scheduler hierarchy: - helper kfuncs hard-coded the root instead of resolving the caller's scheduler - the enable error path tried to disable per-task state that had never been initialized, and leaked cpus_read_lock on the way out - a sysfs object was leaked on every load/unload - the dispatch fast-path used the root scheduler instead of the task's - a couple of CONFIG #ifdef guards were misclassified - Verifier-time hardening: BPF programs of unrelated struct_ops types (e.g. tcp_congestion_ops) could call sched_ext kfuncs - a semantic bug and, once sub-sched was enabled, a KASAN out-of-bounds read. Now rejected at load. Plus a few NULL and cross-task argument checks on sched_ext kfuncs, and a selftest covering the new deny. - rhashtable (Herbert): restore the insecure_elasticity toggle and bounce the deferred-resize kick through irq_work to break a lock-order cycle observable from raw-spinlock callers. sched_ext's scheduler-instance hash is the first user of both. - The bypass-mode load balancer used file-scope cpumasks; with multiple scheduler instances now possible, those raced. Move to per-instance cpumasks, plus a follow-up to skip tasks whose recorded CPU is stale relative to the new owning runqueue. - Smaller fixes: - a dispatch queue's first-task tracking misbehaved when a parked iterator cursor sat in the list - the runqueue's next-class wasn't promoted on local-queue enqueue, leaving an SCX task behind RT in edge cases - the reference qmap scheduler stopped erroring on legitimate cross-scheduler task-storage misses" * tag 'sched_ext-for-7.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (26 commits) sched_ext: Fix scx_flush_disable_work() UAF race sched_ext: Call wakeup_preempt() in local_dsq_post_enq() sched_ext: Release cpus_read_lock on scx_link_sched() failure in root enable sched_ext: Reject NULL-sch callers in scx_bpf_task_set_slice/dsq_vtime sched_ext: Refuse cross-task select_cpu_from_kfunc calls sched_ext: Align cgroup #ifdef guards with SUB_SCHED vs GROUP_SCHED sched_ext: Make bypass LB cpumasks per-scheduler sched_ext: Pass held rq to SCX_CALL_OP() for core_sched_before sched_ext: Pass held rq to SCX_CALL_OP() for dump_cpu/dump_task sched_ext: Save and restore scx_locked_rq across SCX_CALL_OP sched_ext: Use dsq->first_task instead of list_empty() in dispatch_enqueue() FIFO-tail sched_ext: Resolve caller's scheduler in scx_bpf_destroy_dsq() / scx_bpf_dsq_nr_queued() sched_ext: Read scx_root under scx_cgroup_ops_rwsem in cgroup setters sched_ext: Don't disable tasks in scx_sub_enable_workfn() abort path sched_ext: Skip tasks with stale task_rq in bypass_lb_cpu() sched_ext: Guard scx_dsq_move() against NULL kit->dsq after failed iter_new sched_ext: Unregister sub_kset on scheduler disable sched_ext: Defer scx_hardlockup() out of NMI sched_ext: sync disable_irq_work in bpf_scx_unreg() sched_ext: Fix local_dsq_post_enq() to use task's scheduler in sub-sched ...	2026-04-28 16:26:11 -07:00
Breno Leitao	3b75dd76e6	tracing: branch: Fix inverted check on stat tracer registration init_annotated_branch_stats() and all_annotated_branch_stats() check the return value of register_stat_tracer() with "if (!ret)", but register_stat_tracer() returns 0 on success and a negative errno on failure. The inverted check causes the warning to be printed on every successful registration, e.g.: Warning: could not register annotated branches stats while leaving real failures silent. The initcall also returned a hard-coded 1 instead of the actual error. Invert the check and propagate ret so that the warning fires on real errors and the initcall reports the correct status. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: https://patch.msgid.link/20260420-tracing-v1-1-d8f4cd0d6af1@debian.org Fixes: `002bb86d8d` ("tracing/ftrace: separate events tracing and stats tracing engine") Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2026-04-28 14:28:29 -04:00
Cheng-Yang Chou	d99f7a32f0	sched_ext: Fix scx_flush_disable_work() UAF race scx_flush_disable_work() calls irq_work_sync() followed by kthread_flush_work() to ensure that the disable kthread work has fully completed before bpf_scx_unreg() frees the SCX scheduler. However, a concurrent scx_vexit() (e.g., triggered by a watchdog stall) creates a race window between scx_claim_exit() and irq_work_queue(): CPU A (scx_vexit (watchdog)) CPU B (bpf_scx_unreg) ---- ---- scx_claim_exit() atomic_try_cmpxchg(NONE->kind) stack_trace_save() vscnprintf() scx_disable() scx_claim_exit() -> FAIL scx_flush_disable_work() irq_work_sync() // no-op: not queued yet kthread_flush_work() // no-op: not queued yet kobject_put(&sch->kobj) -> free %sch irq_work_queue() -> UAF on %sch scx_disable_irq_workfn() kthread_queue_work() -> UAF The root cause is that CPU B's scx_flush_disable_work() returns after syncing an irq_work that has not yet been queued, while CPU A is still executing the code between scx_claim_exit() and irq_work_queue(). Loop until exit_kind reaches SCX_EXIT_DONE or SCX_EXIT_NONE, draining disable_irq_work and disable_work in each pass. This ensures that any work queued after the previous check is caught, while also correctly handling cases where no disable was triggered (e.g., the scx_sub_enable_workfn() abort path). Fixes: `510a270554` ("sched_ext: sync disable_irq_work in bpf_scx_unreg()") Reported-by: https://sashiko.dev/#/patchset/20260424100221.32407-1-icheng%40nvidia.com Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-04-28 07:40:03 -10:00
Kuba Piecuch	163f8b7f9a	sched_ext: Call wakeup_preempt() in local_dsq_post_enq() There are several edge cases (see linked thread) where an IMMED task can be left lingering on a local DSQ if an RT task swoops in at the wrong time. All of these edge cases are due to rq->next_class being idle even after dispatching a task to rq's local DSQ. We should bump rq->next_class to &ext_sched_class as soon as we've inserted a task into the local DSQ. To optimize the common case of rq->next_class == &ext_sched_class, only call wakeup_preempt() if rq->next_class is below EXT. If next_class is EXT or above, wakeup_preempt() is a no-op anyway. This lets us also simplify the preempt_curr() logic a bit since wakeup_preempt() will call preempt_curr() for us if next_class is below EXT. Link: https://lore.kernel.org/all/DHZPHUFXB4N3.2RY28MUEWBNYK@google.com/ Signed-off-by: Kuba Piecuch <jpiecuch@google.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-04-28 06:28:48 -10:00
Breno Leitao	0de4cb473a	workqueue: fix devm_alloc_workqueue() va_list misuse devm_alloc_workqueue() built a va_list and passed it as a single positional argument to the variadic alloc_workqueue() macro: va_start(args, max_active); wq = alloc_workqueue(fmt, flags, max_active, args); va_end(args); C does not allow forwarding a va_list through a ... parameter. alloc_workqueue() expands to alloc_workqueue_noprof(), which runs its own va_start() over its ... params, so the inner vsnprintf(wq->name, sizeof(wq->name), fmt, args) in __alloc_workqueue() received the outer va_list object as the first variadic slot rather than the caller's actual format arguments. Add a new static helper alloc_workqueue_va() that wraps __alloc_workqueue() and runs wq_init_lockdep() on success, and fold both alloc_workqueue_noprof() and devm_alloc_workqueue_noprof() onto it as suggested by Tejun. The wq_init_lockdep() step is required on the devm path too, otherwise __flush_workqueue()'s on-stack COMPLETION_INITIALIZER_ONSTACK_MAP would NULL-deref wq->lockdep_map. No caller changes are required. devm_alloc_ordered_workqueue() is a macro forwarding to devm_alloc_workqueue() and inherits the fix. Two in-tree callers actively trigger the broken path on every probe: drivers/power/supply/mt6370-charger.c:889 drivers/power/supply/max77705_charger.c:649 both of which use devm_alloc_ordered_workqueue(dev, "%s", 0, dev_name(dev)). A standalone reproducer module is available at[1]. Link: https://github.com/leitao/debug/blob/main/workqueue/valist/wq_va_test.c [1] Fixes: `1dfc9d60a6` ("workqueue: devres: Add device-managed allocate workqueue") Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-04-28 06:13:40 -10:00
Zicheng Qu	3da56dc063	sched/fair: Clear rel_deadline when initializing forked entities A yield-triggered crash can happen when a newly forked sched_entity enters the fair class with se->rel_deadline unexpectedly set. The failing sequence is: 1. A task is forked while se->rel_deadline is still set. 2. __sched_fork() initializes vruntime, vlag and other sched_entity state, but does not clear rel_deadline. 3. On the first enqueue, enqueue_entity() calls place_entity(). 4. Because se->rel_deadline is set, place_entity() treats se->deadline as a relative deadline and converts it to an absolute deadline by adding the current vruntime. 5. However, the forked entity's deadline is not a valid inherited relative deadline for this new scheduling instance, so the conversion produces an abnormally large deadline. 6. If the task later calls sched_yield(), yield_task_fair() advances se->vruntime to se->deadline. 7. The inflated vruntime is then used by the following enqueue path, where the vruntime-derived key can overflow when multiplied by the entity weight. 8. This corrupts cfs_rq->sum_w_vruntime, breaks EEVDF eligibility calculation, and can eventually make all entities appear ineligible. pick_next_entity() may then return NULL unexpectedly, leading to a later NULL dereference. A captured trace shows the effect clearly. Before yield, the entity's vruntime was around: 9834017729983308 After yield_task_fair() executed: se->vruntime = se->deadline the vruntime jumped to: 19668035460670230 and the deadline was later advanced further to: 19668035463470230 This shows that the deadline had already become abnormally large before yield_task_fair() copied it into vruntime. rel_deadline is only meaningful when se->deadline really carries a relative deadline that still needs to be placed against vruntime. A freshly forked sched_entity should not inherit or retain this state. Clear se->rel_deadline in __sched_fork(), together with the other sched_entity runtime state, so that the first enqueue does not interpret the new entity's deadline as a stale relative deadline. Fixes: `82e9d0456e` ("sched/fair: Avoid re-setting virtual deadline on 'migrations'") Analyzed-by: Hui Tang <tanghui20@huawei.com> Analyzed-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260424071113.1199600-1-quzicheng@huawei.com	2026-04-28 09:19:54 +02:00
Vincent Guittot	ac8e69e693	sched/fair: Fix wakeup_preempt_fair() vs delayed dequeue Similar to how pick_next_entity() must dequeue delayed entities, so too must wakeup_preempt_fair(). Any delayed task being found means it is eligible and hence past the 0-lag point, ready for removal. Worse, by not removing delayed entities from consideration, it can skew the preemption decision, with the end result that a short slice wakeup will not result in a preemption. tip/sched/core tip/sched/core +this patch cyclictest slice (ms) (default)2.8 8 8 hackbench slice (ms) (default)2.8 20 20 Total Samples \| 22559 22595 22683 Average (us) \| 157 64( 59%) 59( 8%) Median (P50) (us) \| 57 57( 0%) 58(- 2%) 90th Percentile (us) \| 64 60( 6%) 60( 0%) 99th Percentile (us) \| 2407 67( 97%) 67( 0%) 99.9th Percentile (us) \| 3400 2288( 33%) 727( 68%) Maximum (us) \| 5037 9252(-84%) 7461( 19%) Fixes: `f12e148892` ("sched/fair: Prepare pick_next_task() for delayed dequeue") Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260422093400.319251-1-vincent.guittot@linaro.org	2026-04-28 09:19:54 +02:00
Peter Zijlstra	c5cd6fd75b	sched/fair: Fix the negative lag increase fix Vincent reported that my rework of his original patch lost a little something. Specifically it got the return value wrong; it should not compare against the old se->vlag, but rather against the current value. Since the thing that matters is if the effective vruntime of an entity is affected and the thing needs repositioning or not. Fixes: `059258b0d4` ("sched/fair: Prevent negative lag increase during delayed dequeue") Reported-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://patch.msgid.link/20260423094107.GT3102624%40noisy.programming.kicks-ass.net	2026-04-28 09:19:54 +02:00
Linus Torvalds	3b3bea6d4b	cgroup: Fixes for v7.1-rc1 - Fix UAF race in psi pressure_write() against cgroup file release by extending cgroup_mutex coverage and ordering of->priv access after cgroup_kn_lock_live(). - Fix integer overflow in rdmacg_try_charge() when usage equals INT_MAX by performing the increment in s64. - Fix asymmetric DL bandwidth accounting on cpuset attach rollback by recording the CPU used by dl_bw_alloc() so cancel_attach() returns the reservation to the same root domain. - Fix nr_dying_subsys_* race that briefly showed 0 in cgroup.stat after rmdir by incrementing from kill_css() instead of offline_css(). - Typo fix in cgroup-v2 documentation. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCae+xjw4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGaIUAQD2hJ7ELRDXAtXzL1Ck1zH8vESvbX8syFfkSO6L IgtovQEA4Tk7/RIO3HfBxBjgp6Q5vo7C7Biz4ye7fCu/ry7x3Qk= =pypQ -----END PGP SIGNATURE----- Merge tag 'cgroup-for-7.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - Fix UAF race in psi pressure_write() against cgroup file release by extending cgroup_mutex coverage and ordering of->priv access after cgroup_kn_lock_live() - Fix integer overflow in rdmacg_try_charge() when usage equals INT_MAX by performing the increment in s64 - Fix asymmetric DL bandwidth accounting on cpuset attach rollback by recording the CPU used by dl_bw_alloc() so cancel_attach() returns the reservation to the same root domain - Fix nr_dying_subsys_* race that briefly showed 0 in cgroup.stat after rmdir by incrementing from kill_css() instead of offline_css() - Typo fix in cgroup-v2 documentation * tag 'cgroup-for-7.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: docs: cgroup: fix typo 'protetion' -> 'protection' cgroup: Increment nr_dying_subsys_* from rmdir context cgroup/cpuset: record DL BW alloc CPU for attach rollback cgroup/rdma: fix integer overflow in rdmacg_try_charge() sched/psi: fix race between file release and pressure write	2026-04-27 16:51:27 -07:00
Breno Leitao	9ec9532989	kho: fix error handling in kho_add_subtree() Fix two error handling issues in kho_add_subtree(), where it doesn't handle the error path correctly. 1. If fdt_setprop() fails after the subnode has been created, the subnode is not removed. This leaves an incomplete node in the FDT (missing "preserved-data" or "blob-size" properties). 2. The fdt_setprop() return value (an FDT error code) is stored directly in err and returned to the caller, which expects -errno. Fix both by storing fdt_setprop() results in fdt_err, jumping to a new out_del_node label that removes the subnode on failure, and only setting err = 0 on the success path, otherwise returning -ENOMEM (instead of FDT_ERR_ errors that would come from fdt_setprop). No user-visible changes. This patch fixes error handling in the KHO (Kexec HandOver) subsystem, which is used to preserve data across kexec reboots. The fix only affects a rare failure path during kexec preparation — specifically when the kernel runs out of space in the Flattened Device Tree buffer while registering preserved memory regions. In the unlikely event that this error path was triggered, the old code would leave a malformed node in the device tree and return an incorrect error code to the calling subsystem, which could lead to confusing log messages or incorrect recovery decisions. With this fix, the incomplete node is properly cleaned up and the appropriate errno value is propagated, this error code is not returned to the user. Link: https://lore.kernel.org/20260410-kho_fix_send-v2-1-1b4debf7ee08@debian.org Fixes: `3dc92c3114` ("kexec: add Kexec HandOver (KHO) generation helpers") Signed-off-by: Breno Leitao <leitao@debian.org> Suggested-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Breno Leitao <leitao@debian.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-27 05:54:23 -07:00
Pasha Tatashin	0562b572ce	liveupdate: fix return value on session allocation failure When session allocation fails during deserialization, the global 'err' variable was not updated before returning. This caused subsequent calls to luo_session_deserialize() to incorrectly report success. Ensure 'err' is set to the error code from PTR_ERR(session). This ensures that an error is correctly returned to userspace when it attempts to open /dev/liveupdate in the new kernel if deserialization failed. Link: https://lore.kernel.org/20260415193738.515491-1-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org> Cc: David Matlack <dmatlack@google.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Samiullah Khawaja <skhawaja@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-27 05:54:23 -07:00
Tejun Heo	deb7b2f93d	sched_ext: Release cpus_read_lock on scx_link_sched() failure in root enable scx_root_enable_workfn() takes cpus_read_lock() before scx_link_sched(sch), but the `if (ret) goto err_disable` on failure skips the matching cpus_read_unlock() - all other err_disable gotos along this path drop the lock first. scx_link_sched() only returns non-zero on the sub-sched path (parent != NULL), so the leak path is unreachable via the root caller today. Still, the unwind is out of line with the surrounding paths. Drop cpus_read_lock() before goto err_disable. v2: Correct Fixes: tag (Andrea Righi). Fixes: `25037af712` ("sched_ext: Add rhashtable lookup for sub-schedulers") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-04-24 14:31:36 -10:00
Tejun Heo	05b4a9a9bc	sched_ext: Reject NULL-sch callers in scx_bpf_task_set_slice/dsq_vtime scx_prog_sched(aux) returns NULL for TRACING / SYSCALL BPF progs that have no struct_ops association when the root scheduler has sub_attach set. scx_bpf_task_set_slice() and scx_bpf_task_set_dsq_vtime() pass that NULL into scx_task_on_sched(sch, p), which under CONFIG_EXT_SUB_SCHED is rcu_access_pointer(p->scx.sched) == sch. For any non-scx task p->scx.sched is NULL, so NULL == NULL returns true and the authority gate is bypassed - a privileged but non-struct_ops-associated prog can poke p->scx.slice / p->scx.dsq_vtime on arbitrary tasks. Reject !sch up front so the gate only admits callers with a resolved scheduler. Fixes: `245d09c594` ("sched_ext: Enforce scheduler ownership when updating slice and dsq_vtime") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-24 14:31:36 -10:00
Tejun Heo	ea7c716a24	sched_ext: Refuse cross-task select_cpu_from_kfunc calls select_cpu_from_kfunc() skipped pi_lock for @p when called from ops.select_cpu() or another rq-locked SCX op, assuming the held lock protects @p. scx_bpf_select_cpu_dfl() / __scx_bpf_select_cpu_and() accept an arbitrary KF_RCU task_struct, so a caller in e.g. ops.select_cpu(p1) or ops.enqueue(p1) can pass some other p2 - the held pi_lock / rq lock is p1's, not p2's - and reading p2->cpus_ptr / nr_cpus_allowed races with set_cpus_allowed_ptr() and migrate_disable_switch() on another CPU. Abort the scheduler on cross-task calls in both branches: for ops.select_cpu() use scx_kf_arg_task_ok() to verify @p is the wake-up task recorded in current->scx.kf_tasks[] by SCX_CALL_OP_TASK_RET(); for other rq-locked SCX ops compare task_rq(p) against scx_locked_rq(). v2: Switch the in_select_cpu cross-task check from direct_dispatch_task comparison to scx_kf_arg_task_ok(). The former spuriously rejects when ops.select_cpu() calls scx_bpf_dsq_insert() first, then calls scx_bpf_select_cpu_*() on the same task. (Andrea Righi) Fixes: `0022b32850` ("sched_ext: Decouple kfunc unlocked-context check from kf_mask") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andrea Righi <arighi@nvidia.com>	2026-04-24 14:31:36 -10:00
Tejun Heo	c0e8ddc76d	sched_ext: Align cgroup #ifdef guards with SUB_SCHED vs GROUP_SCHED Two EXT_GROUP_SCHED/SUB_SCHED guards are misclassified: - scx_root_enable_workfn()'s cgroup_get(cgrp) and the err_put_cgrp unwind in scx_alloc_and_add_sched() are under `#if GROUP \|\| SUB`, but the matching cgroup_put() in scx_sched_free_rcu_work() is inside `#ifdef SUB` only (via sch->cgrp, stored only under SUB). GROUP-only would leak a reference on every root-sched enable. - sch_cgroup() / set_cgroup_sched() live under `#if GROUP \|\| SUB` but touch SUB-only fields (sch->cgrp, cgroup->scx_sched). GROUP-only wouldn't compile. GROUP needs CGROUP_SCHED; SUB needs only CGROUPS. CGROUPS=y/CGROUP_SCHED=n gives the reachable GROUP=n, SUB=y combination; GROUP=y, SUB=n isn't reachable today (SUB is def_bool y under CGROUPS). Neither miscategorization triggers a real bug in any reachable config, but keep the guards honest: - Narrow cgroup_get and err_put_cgrp to `#ifdef SUB` (matches the free-side put). - Move sch_cgroup() and set_cgroup_sched() to a separate `#ifdef SUB` block with no-op stubs for the !SUB case; keep root_cgroup() and scx_cgroup_{ lock,unlock}() under `#if GROUP \|\| SUB` since those only need cgroup core. Fixes: `ebeca1f930` ("sched_ext: Introduce cgroup sub-sched support") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-24 14:31:36 -10:00
Tejun Heo	d292aa00de	sched_ext: Make bypass LB cpumasks per-scheduler scx_bypass_lb_{donee,resched}_cpumask were file-scope statics shared by all scheduler instances. With CONFIG_EXT_SUB_SCHED, multiple sched instances each arm their own bypass_lb_timer; concurrent bypass_lb_node() calls RMW the global cpumasks with no lock, corrupting donee/resched decisions. Move the cpumasks into struct scx_sched, allocate them alongside the timer in scx_alloc_and_add_sched(), free them in scx_sched_free_rcu_work(). Fixes: `95d1df610c` ("sched_ext: Implement load balancer for bypass mode") Cc: stable@vger.kernel.org # v6.19+ Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-24 14:31:36 -10:00
Tejun Heo	4155fb489f	sched_ext: Pass held rq to SCX_CALL_OP() for core_sched_before scx_prio_less() runs from core-sched's pick_next_task() path with rq locked but invokes ops.core_sched_before() with NULL locked_rq, leaving scx_locked_rq_state NULL. If the BPF callback calls a kfunc that re-acquires rq based on scx_locked_rq() - e.g. scx_bpf_cpuperf_set(cpu) - it re-acquires the already-held rq. Pass task_rq(a). Fixes: `7b0888b7cc` ("sched_ext: Implement core-sched support") Cc: stable@vger.kernel.org # v6.12+ Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-24 14:31:36 -10:00
Tejun Heo	207d76a372	sched_ext: Pass held rq to SCX_CALL_OP() for dump_cpu/dump_task scx_dump_state() walks CPUs with rq_lock_irqsave() held and invokes ops.dump_cpu / ops.dump_task with NULL locked_rq, leaving scx_locked_rq_state NULL. If the BPF callback calls a kfunc that re-acquires rq based on scx_locked_rq() - e.g. scx_bpf_cpuperf_set(cpu) - it re-acquires the already-held rq. Pass the held rq to SCX_CALL_OP(). Thread it into scx_dump_task() too. The pre-loop ops.dump call runs before rq_lock_irqsave() so keeps rq=NULL. Fixes: `07814a9439` ("sched_ext: Print debug dump after an error exit") Cc: stable@vger.kernel.org # v6.12+ Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-24 14:31:36 -10:00
Tejun Heo	7fb39e4eb4	sched_ext: Save and restore scx_locked_rq across SCX_CALL_OP SCX_CALL_OP{,_RET}() unconditionally clears scx_locked_rq_state to NULL on exit. Correct at the top level, but ops can recurse via scx_bpf_sub_dispatch(): a parent's ops.dispatch calls the helper, which invokes the child's ops.dispatch under another SCX_CALL_OP. When the inner call returns, the NULL clobbers the outer's state. The parent's BPF then calls kfuncs like scx_bpf_cpuperf_set() which read scx_locked_rq()==NULL and re-acquire the already-held rq. Snapshot scx_locked_rq_state on entry and restore on exit. Rename the rq parameter to locked_rq across all SCX_CALL_OP* macros so the snapshot local can be typed as 'struct rq *' without colliding with the parameter token in the expansion. SCX_CALL_OP_TASK{,_RET}() and SCX_CALL_OP_2TASKS_RET() funnel through the two base macros and inherit the fix. Fixes: `4f8b122848` ("sched_ext: Add basic building blocks for nested sub-scheduler dispatching") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-24 14:31:36 -10:00

1 2 3 4 5 ...

51779 Commits