linux

mirror of https://github.com/torvalds/linux.git synced 2026-05-12 16:18:45 +02:00

Author	SHA1	Message	Date
Linus Torvalds	7f00232152	Miscellaneous scheduler fixes: - Fix spurious failures in rseq self-tests (Mark Brown) - Fix rseq rseq::cpu_id_start ABI regression due to TCMalloc's creative use of the supposedly read-only field. The fix is to introduce a new ABI variant based on a new (larger) rseq area registration size, to keep the TCMalloc use of rseq backwards compatible on new kernels. (Thomas Gleixner) - Fix wakeup_preempt_fair() for not waking up task (Vincent Guittot) - Fix s64 mult overflow in vruntime_eligible() (Zhan Xusheng) Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmn+lBARHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1iy1hAAunlBoDq8/MXSt4JeMRX/3p+CKihExTnO LO535Rv8DfcepBgZIysTJKMn9bM/l+7OXGdQ+YDjS70GsLM2aOzDYBKCwOHHm0pZ OJ6Y+UUFAacnQS4EuQLqyNBW0Ice4AIYWu0pLLADs2KUgX1DmSo9bhgZbHcbsMnA IjoaFNhebeA1bHSDD11UIHTza23mqEinxM0yOK8pT+M6fOMXWOo/kLLLYjG/yAIB qBFGpwJkdKjBcpCmAYU9jpw26p/17YMzkgmAaUXOKRLZi+h5zQMNVjR+OIjK4qxt z5Tj+h7t3IcFV2d1zUThPpxxHLn3ro30R5mW0OrsPPFI8AkSRC6GsIX/Ft9uFbQQ 1SGknyx5qLrSldmT8KKXPlM/vriyh3iL6/QMgXtTb8FfegRCbjXsZy39s3wOexCD oBnJt0rX3NviPyb/Up9cfdx++kPfJM074NdVHRBW4ucoOpwHosNcuBP0YQsctFw0 QO7lkfjTo19eB9ftSyfadwF9e+2jYF7YoLmTOKqGLbZEQeIT5tJHDgHklLWrqHAh HDmDCHyqtiXBDCeapqnomrhqczKlSUK1Qqk/Hh+Hwiwj0N4vLhHXmRIpTb/R43/Y i6cdwV8Xtl+RltmYaINpMxdnFl/iz/kpapYkqy/ykLNfBj01AWbdaIVPRk47ndU1 E29BSjMZhPU= =2QJp -----END PGP SIGNATURE----- Merge tag 'sched-urgent-2026-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: - Fix spurious failures in rseq self-tests (Mark Brown) - Fix rseq rseq::cpu_id_start ABI regression due to TCMalloc's creative use of the supposedly read-only field The fix is to introduce a new ABI variant based on a new (larger) rseq area registration size, to keep the TCMalloc use of rseq backwards compatible on new kernels (Thomas Gleixner) - Fix wakeup_preempt_fair() for not waking up task (Vincent Guittot) - Fix s64 mult overflow in vruntime_eligible() (Zhan Xusheng) * tag 'sched-urgent-2026-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix wakeup_preempt_fair() for not waking up task sched/fair: Fix overflow in vruntime_eligible() selftests/rseq: Expand for optimized RSEQ ABI v2 rseq: Reenable performance optimizations conditionally rseq: Implement read only ABI enforcement for optimized RSEQ V2 mode selftests/rseq: Validate legacy behavior selftests/rseq: Make registration flexible for legacy and optimized mode selftests/rseq: Skip tests if time slice extensions are not available rseq: Revert to historical performance killing behaviour rseq: Don't advertise time slice extensions if disabled rseq: Protect rseq_reset() against interrupts rseq: Set rseq::cpu_id_start to 0 on unregistration selftests/rseq: Don't run tests with runner scripts outside of the scripts	2026-05-08 19:42:10 -07:00
Vincent Guittot	9f6d929ee2	sched/fair: Fix wakeup_preempt_fair() for not waking up task Make sure to only call pick_next_entity() on an non-empty cfs_rq. The assumption that p is always enqueued and not delayed, is only true for wakeup. If p was moved while delayed, pick_next_entity() will dequeue it and the cfs might become empty. Test if there are still queued tasks before trying again to determine if p could be the next one to be picked. There are at least 2 cases: When cfs becomes idle, it tries to pull tasks but if those pulled tasks are delayed, they will be dequeued when attached to cfs. attach_tasks() -> attach_task() -> wakeup_preempt(rq, p, 0); A misfit task running on cfs A triggers a load balance to be pulled on a better cpu, the load balance on cfs B starts an active load balance to pulled the running misfit task. If there is a delayed dequeue task on cfs A, it can be pulled instead of the previously running misfit task. attach_one_task() -> attach_task() -> wakeup_preempt(rq, p, 0); Fixes: `ac8e69e693` ("sched/fair: Fix wakeup_preempt_fair() vs delayed dequeue") Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260503104503.1732682-1-vincent.guittot@linaro.org	2026-05-06 17:41:18 +02:00
Zhan Xusheng	b6eee96843	sched/fair: Fix overflow in vruntime_eligible() Zhan Xusheng reported running into sporadic a s64 mult overflow in vruntime_eligible(). When constructing a worst case scenario: If you have cgroups, then you can have an entity of weight 2 (per calc_group_shares()), and its vlag should then be bounded by: (slice+TICK_NSEC) * NICE_0_LOAD, which is around 44 bits as per the comment on entity_key(). The other extreme is 100NICE_0_LOAD, thus you get: {key, weight}[] := { puny: { (slice + TICK_NSEC) NICE_0_LOAD, 2 }, max: { 0, 100NICE_0_LOAD }, } The avg_vruntime() would end up being very close to 0 (which is zero_vruntime), so no real help making that more accurate. vruntime_eligible(puny) ends up with: avg = 2 puny.key (+ 0) load = 2 + 100 * NICE_0_LOAD avg >= puny.key * load And that is: (slice + TICK_NSEC) * NICE_0_LOAD * NICE_0_LOAD * 100, which will overflow s64. Zhan suggested using __builtin_mul_overflow(), however after staring at compiler output for various architectures using godbolt, it seems that using an __int128 multiplication often results in better code. Specifically, a number of architectures already compute the __int128 product to determine the overflow. Eg. arm64 already has the 'smulh' instruction used. By explicitly doing an __int128 multiply, it will emit the 'mul; smulh' pattern, which modern cores can fuse (armv8-a clang-22.1.0). x86_64 has less branches (no OF handling). Since Linux has ARCH_SUPPORTS_INT128 to gate __int128 usage, also provide the __builtin_mul_overflow() variant as a fallback. [peterz: Changelog and __int128 bits] Fixes: `556146ce5e` ("sched/fair: Avoid overflow in enqueue_entity()") Reported-by: Zhan Xusheng <zhanxusheng1024@gmail.com> Closes: https://patch.msgid.link/20260415145742.10359-1-zhanxusheng%40xiaomi.com Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260505103155.GN3102924%40noisy.programming.kicks-ass.net	2026-05-06 17:41:17 +02:00
Linus Torvalds	de95ad90fb	sched_ext: Fixes for v7.1-rc2 - Fix idle CPU selection returning prev_cpu outside the task's cpus_ptr when the BPF caller's allowed mask was wider. Stable backport. - Two opposite-direction gaps in scx_task_iter's cgroup-scoped mode versus the global mode: - Tasks past exit_signals() are filtered by the cgroup walk but kept by global. Sub-scheduler enable abort leaked __scx_init_task() state. Add a CSS_TASK_ITER_WITH_DEAD flag to cgroup's task iterator (scx_task_iter is its only user) and use it. - Tasks past sched_ext_dead() are still returned, tripping WARN_ON_ONCE() in callers or making them touch torn-down state. Mark and skip under the per-task rq lock. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCafphXA4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGbI/AP4nRHDusUuYDSFBLyHODvLZXfMM3Nb0yzS7euQJ qvx6OQEA1p5AyRWA2apFvKjjQrl1dOb5vUlro1Fj8VF51X7Spwc= =olGB -----END PGP SIGNATURE----- Merge tag 'sched_ext-for-7.1-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix idle CPU selection returning prev_cpu outside the task's cpus_ptr when the BPF caller's allowed mask was wider. Stable backport. - Two opposite-direction gaps in scx_task_iter's cgroup-scoped mode versus the global mode: - Tasks past exit_signals() are filtered by the cgroup walk but kept by global. Sub-scheduler enable abort leaked __scx_init_task() state. Add a CSS_TASK_ITER_WITH_DEAD flag to cgroup's task iterator (scx_task_iter is its only user) and use it. - Tasks past sched_ext_dead() are still returned, tripping WARN_ON_ONCE() in callers or making them touch torn-down state. Mark and skip under the per-task rq lock. * tag 'sched_ext-for-7.1-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: idle: Recheck prev_cpu after narrowing allowed mask sched_ext: Skip past-sched_ext_dead() tasks in scx_task_iter_next_locked() cgroup, sched_ext: Include exiting tasks in cgroup iter	2026-05-05 15:22:04 -07:00
Thomas Gleixner	b9eac6a9d9	rseq: Revert to historical performance killing behaviour The recent RSEQ optimization work broke the TCMalloc abuse of the RSEQ ABI as it not longer unconditionally updates the CPU, node, mm_cid fields, which are documented as read only for user space. Due to the observed behavior of the kernel it was possible for TCMalloc to overwrite the cpu_id_start field for their own purposes and rely on the kernel to update it unconditionally after each context switch and before signal delivery. The RSEQ ABI only guarantees that these fields are updated when the data changes, i.e. the task is migrated or the MMCID of the task changes due to switching from or to per CPU ownership mode. The optimization work eliminated the unconditional updates and reduced them to the documented ABI guarantees, which results in a massive performance win for syscall, scheduling heavy work loads, which in turn breaks the TCMalloc expectations. There have been several options discussed to restore the TCMalloc functionality while preserving the optimization benefits. They all end up in a series of hard to maintain workarounds, which in the worst case introduce overhead for everyone, e.g. in the scheduler. The requirements of TCMalloc and the optimization work are diametral and the required work arounds are a maintainence burden. They end up as fragile constructs, which are blocking further optimization work and are pretty much guaranteed to cause more subtle issues down the road. The optimization work heavily depends on the generic entry code, which is not used by all architectures yet. So the rework preserved the original mechanism moslty unmodified to keep the support for architectures, which handle rseq in their own exit to user space loop. That code is currently optimized out by the compiler on architectures which use the generic entry code. This allows to revert back to the original behaviour by replacing the compile time constant conditions with a runtime condition where required, which disables the optimization and the dependend time slice extension feature until the run-time condition can be enabled in the RSEQ registration code on a per task basis again. The following changes are required to restore the original behavior, which makes TCMalloc work again: 1) Replace the compile time constant conditionals with runtime conditionals where appropriate to prevent the compiler from optimizing the legacy mode out 2) Enforce unconditional update of IDs on context switch for the non-optimized v1 mode 3) Enforce update of IDs in the pre signal delivery path for the non-optimized v1 mode 4) Enforce update of IDs in the membarrier(RSEQ) IPI for the non-optimized v1 mode 5) Make time slice and future extensions depend on optimized v2 mode This brings back the full performance problems, but preserves the v2 optimization code and for generic entry code using architectures also the TIF_RSEQ optimization which avoids a full evaluation of the exit to user mode loop in many cases. Fixes: `566d8015f7` ("rseq: Avoid CPU/MM CID updates when no event pending") Reported-by: Mathias Stearn <mathias@mongodb.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Closes: https://lore.kernel.org/CAHnCjA25b+nO2n5CeifknSKHssJpPrjnf+dtr7UgzRw4Zgu=oA@mail.gmail.com Link: https://patch.msgid.link/20260428224427.517051752%40kernel.org Cc: stable@vger.kernel.org	2026-05-05 16:02:57 +02:00
David Carlier	b34c82777a	sched_ext: idle: Recheck prev_cpu after narrowing allowed mask scx_select_cpu_dfl() narrows @allowed to @cpus_allowed & @p->cpus_ptr when the BPF caller supplies a @cpus_allowed that differs from @p->cpus_ptr and @p doesn't have full affinity. However, @is_prev_allowed was computed against the original (wider) @cpus_allowed, so the prev_cpu fast paths could pick a @prev_cpu that is in @cpus_allowed but not in @p->cpus_ptr, violating the intended invariant that the returned CPU is always usable by @p. The kernel masks this via the SCX_EV_SELECT_CPU_FALLBACK fallback, but the behavior contradicts the documented contract. Move the @is_prev_allowed evaluation past the narrowing block so it tests against the final @allowed mask. Fixes: `ee9a4e9279` ("sched_ext: idle: Properly handle invalid prev_cpu during idle selection") Cc: stable@vger.kernel.org # v6.16+ Assisted-by: Claude <noreply@anthropic.com> Signed-off-by: David Carlier <devnexen@gmail.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-05-04 11:01:04 -10:00
Tejun Heo	ff9eda4ea9	sched_ext: Skip past-sched_ext_dead() tasks in scx_task_iter_next_locked() scx_task_iter's cgroup-scoped mode can return tasks whose sched_ext_dead() has already completed: cgroup_task_dead() removes from cset->tasks after sched_ext_dead() in finish_task_switch() and is irq-work deferred on PREEMPT_RT. The global mode is fine - sched_ext_dead() removes from scx_tasks via list_del_init() first. Callers (sub-sched enable prep/abort/apply, scx_sub_disable(), scx_fail_parent()) assume returned tasks are still on @sch and trip WARN_ON_ONCE() or operate on torn-down state otherwise. Set %SCX_TASK_OFF_TASKS in sched_ext_dead() under @p's rq lock and have scx_task_iter_next_locked() skip flagged tasks under the same lock. Setter and reader serialize on the per-task rq lock - no race. Signed-off-by: Tejun Heo <tj@kernel.org>	2026-05-04 09:06:03 -10:00
Tejun Heo	60f21a2649	cgroup, sched_ext: Include exiting tasks in cgroup iter `a72f73c4dd` ("cgroup: Don't expose dead tasks in cgroup") made css_task_iter_advance() skip exiting tasks so cgroup.procs stays consistent with waitpid() visibility. Unfortunately, this broke scx_task_iter. scx_task_iter walks either scx_tasks (global) or a cgroup subtree via css_task_iter() and the two modes are expected to cover the same set of tasks. After the above change the cgroup-scoped mode silently skips tasks past exit_signals() that are still on scx_tasks. scx_sub_enable_workfn()'s abort path is one of the symptoms: an exiting SCX_TASK_SUB_INIT task can race past the cgroup iter leaking __scx_init_task() state. Other iterations share the same gap. Add CSS_TASK_ITER_WITH_DEAD to opt out of the skip and use it from scx_task_iter(). Fixes: `b0e4c2f8a0` ("sched_ext: Implement cgroup subtree iteration for scx_task_iter") Reported-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-05-04 09:06:03 -10:00
Linus Torvalds	c3cba36b39	Miscellaneous scheduler fixes: - Fix the delayed dequeue negative lag increase fix in the fair scheduler (Peter Zijlstra) - Fix wakeup_preempt_fair() to do proper delayed dequeue (Vincent Guittot) - Clear sched_entity::rel_deadline when initializing forked entities, which bug can cause all tasks to be EEVDF-ineligible, causing a NULL pointer dereference crash in pick_next_entity() (Zicheng Qu) Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmn2+tsRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1gDdg//WX8E7dyiwRaFNsPMgEvi4Q7pBoXAb5Zi /iYNlpw1/QRG9KF59175CC3cLOVLJ3dA+79EZPS/mQSuukdTxJg6sbPTkneULV3D l8DkjH0uDS7mZBqlLDC+Xjqv1y7Y01V6qN9Si8hpR5rP3D0heWdspAGS5aSZ/8Dp h/VSYl2R615Z32NaS0Xys2hy4M0/I30Uuw4fJScvIYeAMb4s5+6RHEQmuuo25A3O HQg9Ljqi5NQaHwzvHTYjCenh/NENOd/tu/kZzFgW57HJqSXGM7KBqcjaK68q4vCl LgBlsfux7RTbnrEAIhGYBSoDss+tBbMKm5qaYZNENJpLS8ptm4J3iKgUJ0W2e3dW Zp6IjVkj0E+qC65WnFENXsiDr+/fZ9d71/xq2L4z6SxQNv1mtX2+f/HUYyKU5jCc I4NDEBLGbdRVNuPW7esECVIEVRYFR1cPdZigLW8M7JEnr0p0skF1zgnnMtVuK6Ep qONYldUIHWdsx67yOSdhykSyq6Qfaew/UKuG1ivlN0BDL4I/AWf+4BVMHJrigeok xKD8DRWH6s7fSicfM2aJMmj6nRSR2Zz5a9T3lE4LxSvDm41JEnatqpb2Xhjri+I0 21slsm4AZmh1xR1kj7sTKxdHJn0E+lNN/XSZP6OcCoNqlr2XEGMRxWpVo/WqJnO6 HVG9/VoSP1w= =z+8E -----END PGP SIGNATURE----- Merge tag 'sched-urgent-2026-05-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: - Fix the delayed dequeue negative lag increase fix in the fair scheduler (Peter Zijlstra) - Fix wakeup_preempt_fair() to do proper delayed dequeue (Vincent Guittot) - Clear sched_entity::rel_deadline when initializing forked entities, which bug can cause all tasks to be EEVDF-ineligible, causing a NULL pointer dereference crash in pick_next_entity() (Zicheng Qu) * tag 'sched-urgent-2026-05-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Clear rel_deadline when initializing forked entities sched/fair: Fix wakeup_preempt_fair() vs delayed dequeue sched/fair: Fix the negative lag increase fix	2026-05-03 08:05:23 -07:00
Linus Torvalds	664f0f6be3	sched_ext: Fixes for v7.1-rc1 The merge window pulled in the cgroup sub-scheduler infrastructure, and new AI reviews are accelerating bug reporting and fixing - hence the larger than usual fixes batch. - Use-after-frees during scheduler load/unload. The disable path could free the BPF scheduler while deferred irq_work / kthread work was still in flight; cgroup setter callbacks read the active scheduler outside the rwsem that synchronizes against teardown. Fixed both, and reused the disable drain in the enable error paths so the BPF JIT page can't be freed under live callbacks. - Several BPF op invocations didn't tell the framework which runqueue was already locked, so helper kfuncs that re-acquire the runqueue by CPU could deadlock on the held lock. Fixed at the affected callsites, including recursive parent-into-child dispatch. - The hardlockup notifier ran from NMI but eventually took a non-NMI-safe lock. Bounced through irq_work. - A handful of bugs in the new sub-scheduler hierarchy: helper kfuncs hard-coded the root instead of resolving the caller's scheduler; the enable error path tried to disable per-task state that had never been initialized, and leaked cpus_read_lock on the way out; a sysfs object was leaked on every load/unload; the dispatch fast-path used the root scheduler instead of the task's; and a couple of CONFIG #ifdef guards were misclassified. - Verifier-time hardening: BPF programs of unrelated struct_ops types (e.g. tcp_congestion_ops) could call sched_ext kfuncs - a semantic bug and, once sub-sched was enabled, a KASAN out-of-bounds read. Now rejected at load. Plus a few NULL and cross-task argument checks on sched_ext kfuncs, and a selftest covering the new deny. - rhashtable (Herbert): restored the insecure_elasticity toggle and bounced the deferred-resize kick through irq_work to break a lock-order cycle observable from raw-spinlock callers. sched_ext's scheduler-instance hash is the first user of both. - The bypass-mode load balancer used file-scope cpumasks; with multiple scheduler instances now possible, those raced. Moved per-instance, plus a follow-up to skip tasks whose recorded CPU is stale relative to the new owning runqueue. - Smaller fixes: a dispatch queue's first-task tracking misbehaved when a parked iterator cursor sat in the list; the runqueue's next-class wasn't promoted on local-queue enqueue, leaving an SCX task behind RT in edge cases; the reference qmap scheduler stopped erroring on legitimate cross-scheduler task-storage misses. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCafEN/A4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGaydAQCxWrUqnZXhxF4LnztjxTF2tgv7p8P7TbpS6aU6 etqRpAEA9RFmIXs7XrhwCm0n2BwSgjvrNxnWfPhWvuH0uN0GTAA= =wLna -----END PGP SIGNATURE----- Merge tag 'sched_ext-for-7.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: "The merge window pulled in the cgroup sub-scheduler infrastructure, and new AI reviews are accelerating bug reporting and fixing - hence the larger than usual fixes batch: - Use-after-frees during scheduler load/unload: - The disable path could free the BPF scheduler while deferred irq_work / kthread work was still in flight - cgroup setter callbacks read the active scheduler outside the rwsem that synchronizes against teardown Fix both, and reuse the disable drain in the enable error paths so the BPF JIT page can't be freed under live callbacks. - Several BPF op invocations didn't tell the framework which runqueue was already locked, so helper kfuncs that re-acquire the runqueue by CPU could deadlock on the held lock Fix the affected callsites, including recursive parent-into-child dispatch. - The hardlockup notifier ran from NMI but eventually took a non-NMI-safe lock. Bounce it through irq_work. - A handful of bugs in the new sub-scheduler hierarchy: - helper kfuncs hard-coded the root instead of resolving the caller's scheduler - the enable error path tried to disable per-task state that had never been initialized, and leaked cpus_read_lock on the way out - a sysfs object was leaked on every load/unload - the dispatch fast-path used the root scheduler instead of the task's - a couple of CONFIG #ifdef guards were misclassified - Verifier-time hardening: BPF programs of unrelated struct_ops types (e.g. tcp_congestion_ops) could call sched_ext kfuncs - a semantic bug and, once sub-sched was enabled, a KASAN out-of-bounds read. Now rejected at load. Plus a few NULL and cross-task argument checks on sched_ext kfuncs, and a selftest covering the new deny. - rhashtable (Herbert): restore the insecure_elasticity toggle and bounce the deferred-resize kick through irq_work to break a lock-order cycle observable from raw-spinlock callers. sched_ext's scheduler-instance hash is the first user of both. - The bypass-mode load balancer used file-scope cpumasks; with multiple scheduler instances now possible, those raced. Move to per-instance cpumasks, plus a follow-up to skip tasks whose recorded CPU is stale relative to the new owning runqueue. - Smaller fixes: - a dispatch queue's first-task tracking misbehaved when a parked iterator cursor sat in the list - the runqueue's next-class wasn't promoted on local-queue enqueue, leaving an SCX task behind RT in edge cases - the reference qmap scheduler stopped erroring on legitimate cross-scheduler task-storage misses" * tag 'sched_ext-for-7.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (26 commits) sched_ext: Fix scx_flush_disable_work() UAF race sched_ext: Call wakeup_preempt() in local_dsq_post_enq() sched_ext: Release cpus_read_lock on scx_link_sched() failure in root enable sched_ext: Reject NULL-sch callers in scx_bpf_task_set_slice/dsq_vtime sched_ext: Refuse cross-task select_cpu_from_kfunc calls sched_ext: Align cgroup #ifdef guards with SUB_SCHED vs GROUP_SCHED sched_ext: Make bypass LB cpumasks per-scheduler sched_ext: Pass held rq to SCX_CALL_OP() for core_sched_before sched_ext: Pass held rq to SCX_CALL_OP() for dump_cpu/dump_task sched_ext: Save and restore scx_locked_rq across SCX_CALL_OP sched_ext: Use dsq->first_task instead of list_empty() in dispatch_enqueue() FIFO-tail sched_ext: Resolve caller's scheduler in scx_bpf_destroy_dsq() / scx_bpf_dsq_nr_queued() sched_ext: Read scx_root under scx_cgroup_ops_rwsem in cgroup setters sched_ext: Don't disable tasks in scx_sub_enable_workfn() abort path sched_ext: Skip tasks with stale task_rq in bypass_lb_cpu() sched_ext: Guard scx_dsq_move() against NULL kit->dsq after failed iter_new sched_ext: Unregister sub_kset on scheduler disable sched_ext: Defer scx_hardlockup() out of NMI sched_ext: sync disable_irq_work in bpf_scx_unreg() sched_ext: Fix local_dsq_post_enq() to use task's scheduler in sub-sched ...	2026-04-28 16:26:11 -07:00
Cheng-Yang Chou	d99f7a32f0	sched_ext: Fix scx_flush_disable_work() UAF race scx_flush_disable_work() calls irq_work_sync() followed by kthread_flush_work() to ensure that the disable kthread work has fully completed before bpf_scx_unreg() frees the SCX scheduler. However, a concurrent scx_vexit() (e.g., triggered by a watchdog stall) creates a race window between scx_claim_exit() and irq_work_queue(): CPU A (scx_vexit (watchdog)) CPU B (bpf_scx_unreg) ---- ---- scx_claim_exit() atomic_try_cmpxchg(NONE->kind) stack_trace_save() vscnprintf() scx_disable() scx_claim_exit() -> FAIL scx_flush_disable_work() irq_work_sync() // no-op: not queued yet kthread_flush_work() // no-op: not queued yet kobject_put(&sch->kobj) -> free %sch irq_work_queue() -> UAF on %sch scx_disable_irq_workfn() kthread_queue_work() -> UAF The root cause is that CPU B's scx_flush_disable_work() returns after syncing an irq_work that has not yet been queued, while CPU A is still executing the code between scx_claim_exit() and irq_work_queue(). Loop until exit_kind reaches SCX_EXIT_DONE or SCX_EXIT_NONE, draining disable_irq_work and disable_work in each pass. This ensures that any work queued after the previous check is caught, while also correctly handling cases where no disable was triggered (e.g., the scx_sub_enable_workfn() abort path). Fixes: `510a270554` ("sched_ext: sync disable_irq_work in bpf_scx_unreg()") Reported-by: https://sashiko.dev/#/patchset/20260424100221.32407-1-icheng%40nvidia.com Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-04-28 07:40:03 -10:00
Kuba Piecuch	163f8b7f9a	sched_ext: Call wakeup_preempt() in local_dsq_post_enq() There are several edge cases (see linked thread) where an IMMED task can be left lingering on a local DSQ if an RT task swoops in at the wrong time. All of these edge cases are due to rq->next_class being idle even after dispatching a task to rq's local DSQ. We should bump rq->next_class to &ext_sched_class as soon as we've inserted a task into the local DSQ. To optimize the common case of rq->next_class == &ext_sched_class, only call wakeup_preempt() if rq->next_class is below EXT. If next_class is EXT or above, wakeup_preempt() is a no-op anyway. This lets us also simplify the preempt_curr() logic a bit since wakeup_preempt() will call preempt_curr() for us if next_class is below EXT. Link: https://lore.kernel.org/all/DHZPHUFXB4N3.2RY28MUEWBNYK@google.com/ Signed-off-by: Kuba Piecuch <jpiecuch@google.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-04-28 06:28:48 -10:00
Zicheng Qu	3da56dc063	sched/fair: Clear rel_deadline when initializing forked entities A yield-triggered crash can happen when a newly forked sched_entity enters the fair class with se->rel_deadline unexpectedly set. The failing sequence is: 1. A task is forked while se->rel_deadline is still set. 2. __sched_fork() initializes vruntime, vlag and other sched_entity state, but does not clear rel_deadline. 3. On the first enqueue, enqueue_entity() calls place_entity(). 4. Because se->rel_deadline is set, place_entity() treats se->deadline as a relative deadline and converts it to an absolute deadline by adding the current vruntime. 5. However, the forked entity's deadline is not a valid inherited relative deadline for this new scheduling instance, so the conversion produces an abnormally large deadline. 6. If the task later calls sched_yield(), yield_task_fair() advances se->vruntime to se->deadline. 7. The inflated vruntime is then used by the following enqueue path, where the vruntime-derived key can overflow when multiplied by the entity weight. 8. This corrupts cfs_rq->sum_w_vruntime, breaks EEVDF eligibility calculation, and can eventually make all entities appear ineligible. pick_next_entity() may then return NULL unexpectedly, leading to a later NULL dereference. A captured trace shows the effect clearly. Before yield, the entity's vruntime was around: 9834017729983308 After yield_task_fair() executed: se->vruntime = se->deadline the vruntime jumped to: 19668035460670230 and the deadline was later advanced further to: 19668035463470230 This shows that the deadline had already become abnormally large before yield_task_fair() copied it into vruntime. rel_deadline is only meaningful when se->deadline really carries a relative deadline that still needs to be placed against vruntime. A freshly forked sched_entity should not inherit or retain this state. Clear se->rel_deadline in __sched_fork(), together with the other sched_entity runtime state, so that the first enqueue does not interpret the new entity's deadline as a stale relative deadline. Fixes: `82e9d0456e` ("sched/fair: Avoid re-setting virtual deadline on 'migrations'") Analyzed-by: Hui Tang <tanghui20@huawei.com> Analyzed-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260424071113.1199600-1-quzicheng@huawei.com	2026-04-28 09:19:54 +02:00
Vincent Guittot	ac8e69e693	sched/fair: Fix wakeup_preempt_fair() vs delayed dequeue Similar to how pick_next_entity() must dequeue delayed entities, so too must wakeup_preempt_fair(). Any delayed task being found means it is eligible and hence past the 0-lag point, ready for removal. Worse, by not removing delayed entities from consideration, it can skew the preemption decision, with the end result that a short slice wakeup will not result in a preemption. tip/sched/core tip/sched/core +this patch cyclictest slice (ms) (default)2.8 8 8 hackbench slice (ms) (default)2.8 20 20 Total Samples \| 22559 22595 22683 Average (us) \| 157 64( 59%) 59( 8%) Median (P50) (us) \| 57 57( 0%) 58(- 2%) 90th Percentile (us) \| 64 60( 6%) 60( 0%) 99th Percentile (us) \| 2407 67( 97%) 67( 0%) 99.9th Percentile (us) \| 3400 2288( 33%) 727( 68%) Maximum (us) \| 5037 9252(-84%) 7461( 19%) Fixes: `f12e148892` ("sched/fair: Prepare pick_next_task() for delayed dequeue") Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260422093400.319251-1-vincent.guittot@linaro.org	2026-04-28 09:19:54 +02:00
Peter Zijlstra	c5cd6fd75b	sched/fair: Fix the negative lag increase fix Vincent reported that my rework of his original patch lost a little something. Specifically it got the return value wrong; it should not compare against the old se->vlag, but rather against the current value. Since the thing that matters is if the effective vruntime of an entity is affected and the thing needs repositioning or not. Fixes: `059258b0d4` ("sched/fair: Prevent negative lag increase during delayed dequeue") Reported-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://patch.msgid.link/20260423094107.GT3102624%40noisy.programming.kicks-ass.net	2026-04-28 09:19:54 +02:00
Tejun Heo	deb7b2f93d	sched_ext: Release cpus_read_lock on scx_link_sched() failure in root enable scx_root_enable_workfn() takes cpus_read_lock() before scx_link_sched(sch), but the `if (ret) goto err_disable` on failure skips the matching cpus_read_unlock() - all other err_disable gotos along this path drop the lock first. scx_link_sched() only returns non-zero on the sub-sched path (parent != NULL), so the leak path is unreachable via the root caller today. Still, the unwind is out of line with the surrounding paths. Drop cpus_read_lock() before goto err_disable. v2: Correct Fixes: tag (Andrea Righi). Fixes: `25037af712` ("sched_ext: Add rhashtable lookup for sub-schedulers") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-04-24 14:31:36 -10:00
Tejun Heo	05b4a9a9bc	sched_ext: Reject NULL-sch callers in scx_bpf_task_set_slice/dsq_vtime scx_prog_sched(aux) returns NULL for TRACING / SYSCALL BPF progs that have no struct_ops association when the root scheduler has sub_attach set. scx_bpf_task_set_slice() and scx_bpf_task_set_dsq_vtime() pass that NULL into scx_task_on_sched(sch, p), which under CONFIG_EXT_SUB_SCHED is rcu_access_pointer(p->scx.sched) == sch. For any non-scx task p->scx.sched is NULL, so NULL == NULL returns true and the authority gate is bypassed - a privileged but non-struct_ops-associated prog can poke p->scx.slice / p->scx.dsq_vtime on arbitrary tasks. Reject !sch up front so the gate only admits callers with a resolved scheduler. Fixes: `245d09c594` ("sched_ext: Enforce scheduler ownership when updating slice and dsq_vtime") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-24 14:31:36 -10:00
Tejun Heo	ea7c716a24	sched_ext: Refuse cross-task select_cpu_from_kfunc calls select_cpu_from_kfunc() skipped pi_lock for @p when called from ops.select_cpu() or another rq-locked SCX op, assuming the held lock protects @p. scx_bpf_select_cpu_dfl() / __scx_bpf_select_cpu_and() accept an arbitrary KF_RCU task_struct, so a caller in e.g. ops.select_cpu(p1) or ops.enqueue(p1) can pass some other p2 - the held pi_lock / rq lock is p1's, not p2's - and reading p2->cpus_ptr / nr_cpus_allowed races with set_cpus_allowed_ptr() and migrate_disable_switch() on another CPU. Abort the scheduler on cross-task calls in both branches: for ops.select_cpu() use scx_kf_arg_task_ok() to verify @p is the wake-up task recorded in current->scx.kf_tasks[] by SCX_CALL_OP_TASK_RET(); for other rq-locked SCX ops compare task_rq(p) against scx_locked_rq(). v2: Switch the in_select_cpu cross-task check from direct_dispatch_task comparison to scx_kf_arg_task_ok(). The former spuriously rejects when ops.select_cpu() calls scx_bpf_dsq_insert() first, then calls scx_bpf_select_cpu_*() on the same task. (Andrea Righi) Fixes: `0022b32850` ("sched_ext: Decouple kfunc unlocked-context check from kf_mask") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andrea Righi <arighi@nvidia.com>	2026-04-24 14:31:36 -10:00
Tejun Heo	c0e8ddc76d	sched_ext: Align cgroup #ifdef guards with SUB_SCHED vs GROUP_SCHED Two EXT_GROUP_SCHED/SUB_SCHED guards are misclassified: - scx_root_enable_workfn()'s cgroup_get(cgrp) and the err_put_cgrp unwind in scx_alloc_and_add_sched() are under `#if GROUP \|\| SUB`, but the matching cgroup_put() in scx_sched_free_rcu_work() is inside `#ifdef SUB` only (via sch->cgrp, stored only under SUB). GROUP-only would leak a reference on every root-sched enable. - sch_cgroup() / set_cgroup_sched() live under `#if GROUP \|\| SUB` but touch SUB-only fields (sch->cgrp, cgroup->scx_sched). GROUP-only wouldn't compile. GROUP needs CGROUP_SCHED; SUB needs only CGROUPS. CGROUPS=y/CGROUP_SCHED=n gives the reachable GROUP=n, SUB=y combination; GROUP=y, SUB=n isn't reachable today (SUB is def_bool y under CGROUPS). Neither miscategorization triggers a real bug in any reachable config, but keep the guards honest: - Narrow cgroup_get and err_put_cgrp to `#ifdef SUB` (matches the free-side put). - Move sch_cgroup() and set_cgroup_sched() to a separate `#ifdef SUB` block with no-op stubs for the !SUB case; keep root_cgroup() and scx_cgroup_{ lock,unlock}() under `#if GROUP \|\| SUB` since those only need cgroup core. Fixes: `ebeca1f930` ("sched_ext: Introduce cgroup sub-sched support") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-24 14:31:36 -10:00
Tejun Heo	d292aa00de	sched_ext: Make bypass LB cpumasks per-scheduler scx_bypass_lb_{donee,resched}_cpumask were file-scope statics shared by all scheduler instances. With CONFIG_EXT_SUB_SCHED, multiple sched instances each arm their own bypass_lb_timer; concurrent bypass_lb_node() calls RMW the global cpumasks with no lock, corrupting donee/resched decisions. Move the cpumasks into struct scx_sched, allocate them alongside the timer in scx_alloc_and_add_sched(), free them in scx_sched_free_rcu_work(). Fixes: `95d1df610c` ("sched_ext: Implement load balancer for bypass mode") Cc: stable@vger.kernel.org # v6.19+ Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-24 14:31:36 -10:00
Tejun Heo	4155fb489f	sched_ext: Pass held rq to SCX_CALL_OP() for core_sched_before scx_prio_less() runs from core-sched's pick_next_task() path with rq locked but invokes ops.core_sched_before() with NULL locked_rq, leaving scx_locked_rq_state NULL. If the BPF callback calls a kfunc that re-acquires rq based on scx_locked_rq() - e.g. scx_bpf_cpuperf_set(cpu) - it re-acquires the already-held rq. Pass task_rq(a). Fixes: `7b0888b7cc` ("sched_ext: Implement core-sched support") Cc: stable@vger.kernel.org # v6.12+ Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-24 14:31:36 -10:00
Tejun Heo	207d76a372	sched_ext: Pass held rq to SCX_CALL_OP() for dump_cpu/dump_task scx_dump_state() walks CPUs with rq_lock_irqsave() held and invokes ops.dump_cpu / ops.dump_task with NULL locked_rq, leaving scx_locked_rq_state NULL. If the BPF callback calls a kfunc that re-acquires rq based on scx_locked_rq() - e.g. scx_bpf_cpuperf_set(cpu) - it re-acquires the already-held rq. Pass the held rq to SCX_CALL_OP(). Thread it into scx_dump_task() too. The pre-loop ops.dump call runs before rq_lock_irqsave() so keeps rq=NULL. Fixes: `07814a9439` ("sched_ext: Print debug dump after an error exit") Cc: stable@vger.kernel.org # v6.12+ Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-24 14:31:36 -10:00
Tejun Heo	7fb39e4eb4	sched_ext: Save and restore scx_locked_rq across SCX_CALL_OP SCX_CALL_OP{,_RET}() unconditionally clears scx_locked_rq_state to NULL on exit. Correct at the top level, but ops can recurse via scx_bpf_sub_dispatch(): a parent's ops.dispatch calls the helper, which invokes the child's ops.dispatch under another SCX_CALL_OP. When the inner call returns, the NULL clobbers the outer's state. The parent's BPF then calls kfuncs like scx_bpf_cpuperf_set() which read scx_locked_rq()==NULL and re-acquire the already-held rq. Snapshot scx_locked_rq_state on entry and restore on exit. Rename the rq parameter to locked_rq across all SCX_CALL_OP* macros so the snapshot local can be typed as 'struct rq *' without colliding with the parameter token in the expansion. SCX_CALL_OP_TASK{,_RET}() and SCX_CALL_OP_2TASKS_RET() funnel through the two base macros and inherit the fix. Fixes: `4f8b122848` ("sched_ext: Add basic building blocks for nested sub-scheduler dispatching") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-24 14:31:36 -10:00
Tejun Heo	2f2ea77092	sched_ext: Use dsq->first_task instead of list_empty() in dispatch_enqueue() FIFO-tail dispatch_enqueue()'s FIFO-tail path used list_empty(&dsq->list) to decide whether to set dsq->first_task on enqueue. dsq->list can contain parked BPF iterator cursors (SCX_DSQ_LNODE_ITER_CURSOR), so list_empty() is not a reliable "no real task" check. If the last real task is unlinked while a cursor is parked, first_task becomes NULL; the next FIFO-tail enqueue then sees list_empty() == false and skips the first_task update, leaving scx_bpf_dsq_peek() returning NULL for a non-empty DSQ. Test dsq->first_task directly, which already tracks only real tasks and is maintained under dsq->lock. Fixes: `44f5c8ec5b` ("sched_ext: Add lockless peek operation for DSQs") Cc: stable@vger.kernel.org # v6.19+ Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Cc: Ryan Newton <newton@meta.com>	2026-04-24 14:31:35 -10:00
Tejun Heo	cc2a387d33	sched_ext: Resolve caller's scheduler in scx_bpf_destroy_dsq() / scx_bpf_dsq_nr_queued() scx_bpf_create_dsq() resolves the calling scheduler via scx_prog_sched(aux) and inserts the new DSQ into that scheduler's dsq_hash. Its inverse scx_bpf_destroy_dsq() and the query helper scx_bpf_dsq_nr_queued() were hard-coded to rcu_dereference(scx_root), so a sub-scheduler could only destroy or query DSQs in the root scheduler's hash - never its own. If the root had a DSQ with the same id, the sub-sched silently destroyed it and the root aborted on the next dispatch ("invalid DSQ ID 0x0.."). Take a const struct bpf_prog_aux *aux via KF_IMPLICIT_ARGS and resolve the scheduler with scx_prog_sched(aux), matching scx_bpf_create_dsq(). Fixes: `ebeca1f930` ("sched_ext: Introduce cgroup sub-sched support") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-24 14:31:35 -10:00
Tejun Heo	80afd4c84b	sched_ext: Read scx_root under scx_cgroup_ops_rwsem in cgroup setters scx_group_set_{weight,idle,bandwidth}() cache scx_root before acquiring scx_cgroup_ops_rwsem, so the pointer can be stale by the time the op runs. If the loaded scheduler is disabled and freed (via RCU work) and another is enabled between the naked load and the rwsem acquire, the reader sees scx_cgroup_enabled=true (the new scheduler's) but dereferences the freed one - UAF on SCX_HAS_OP(sch, ...) / SCX_CALL_OP(sch, ...). scx_cgroup_enabled is toggled only under scx_cgroup_ops_rwsem write (scx_cgroup_{init,exit}), so reading scx_root inside the rwsem read section correlates @sch with the enabled snapshot. Fixes: `a5bd6ba30b` ("sched_ext: Use cgroup_lock/unlock() to synchronize against cgroup operations") Cc: stable@vger.kernel.org # v6.18+ Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-24 14:31:35 -10:00
Tejun Heo	21a5a97ba4	sched_ext: Don't disable tasks in scx_sub_enable_workfn() abort path scx_sub_enable_workfn()'s prep loop calls __scx_init_task(sch, p, false) without transitioning task state, then sets SCX_TASK_SUB_INIT. If prep fails partway, the abort path runs __scx_disable_and_exit_task(sch, p) on the marked tasks. Task state is still the parent's ENABLED, so that dispatches to the SCX_TASK_ENABLED arm and calls scx_disable_task(sch, p) - i.e. child->ops.disable() - for tasks on which child->ops.enable() never ran. A BPF sub-scheduler allocating per-task state in enable/freeing in disable would operate on uninitialized state. The dying-task branch in scx_disable_and_exit_task() has the same problem, and scx_enabling_sub_sched was cleared before the abort cleanup loop - a task exiting during cleanup tripped the WARN and skipped both ops.exit_task and the SCX_TASK_SUB_INIT clear, leaking per-task resources and leaving the task stuck. Introduce scx_sub_init_cancel_task() that calls ops.exit_task with cancelled=true - matching what the top-level init path does when init_task itself returns -errno. Use it in the abort loop and in the dying-task branch. scx_enabling_sub_sched now stays set until the abort loop finishes clearing SUB_INIT, so concurrent exits hitting the dying-task branch can still find @sch. That branch also clears SCX_TASK_SUB_INIT unconditionally when seen, leaving the task unmarked even if the WARN fires. Fixes: `337ec00b1d` ("sched_ext: Implement cgroup sub-sched enabling and disabling") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-24 14:31:35 -10:00
Tejun Heo	da2d81b411	sched_ext: Skip tasks with stale task_rq in bypass_lb_cpu() bypass_lb_cpu() transfers tasks between per-CPU bypass DSQs without migrating them - task_cpu() only updates when the donee later consumes the task via move_remote_task_to_local_dsq(). If the LB timer fires again before consumption and the new DSQ becomes a donor, @p is still on the previous CPU and task_rq(@p) != donor_rq. @p can't be moved without its own rq locked. Skip such tasks. Fixes: `95d1df610c` ("sched_ext: Implement load balancer for bypass mode") Cc: stable@vger.kernel.org # v6.19+ Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-24 14:31:35 -10:00
Tejun Heo	4fda9f0e7c	sched_ext: Guard scx_dsq_move() against NULL kit->dsq after failed iter_new bpf_iter_scx_dsq_new() clears kit->dsq on failure and bpf_iter_scx_dsq_{next,destroy}() guard against that. scx_dsq_move() doesn't - it dereferences kit->dsq immediately, so a BPF program that calls scx_bpf_dsq_move[_vtime]() after a failed iter_new oopses the kernel. Return false if kit->dsq is NULL. Fixes: `4c30f5ce4f` ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()") Cc: stable@vger.kernel.org # v6.12+ Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-24 14:31:35 -10:00
Tejun Heo	411d3ef1a7	sched_ext: Unregister sub_kset on scheduler disable When ops.sub_attach is set, scx_alloc_and_add_sched() creates sub_kset as a child of &sch->kobj, which pins the parent with its own reference. The disable paths never call kset_unregister(), so the final kobject_put() in bpf_scx_unreg() leaves a stale reference and scx_kobj_release() never runs, leaking the whole struct scx_sched on every load/unload cycle. Unregister sub_kset in scx_root_disable() and scx_sub_disable() before kobject_del(&sch->kobj). Fixes: `ebeca1f930` ("sched_ext: Introduce cgroup sub-sched support") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-24 14:31:35 -10:00
Tejun Heo	bd2d76455b	sched_ext: Defer scx_hardlockup() out of NMI scx_hardlockup() runs from NMI and eventually calls scx_claim_exit(), which takes scx_sched_lock. scx_sched_lock isn't NMI-safe and grabbing it from NMI context can lead to deadlocks. The hardlockup handler is best-effort recovery and the disable path it triggers runs off of irq_work anyway. Move the handle_lockup() call into an irq_work so it runs in IRQ context. Fixes: `ebeca1f930` ("sched_ext: Introduce cgroup sub-sched support") Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-24 14:13:22 -10:00
Richard Cheng	510a270554	sched_ext: sync disable_irq_work in bpf_scx_unreg() When unregistered my self-written scx scheduler, the following panic occurs. [ 229.923133] Kernel text patching generated an invalid instruction at 0xffff80009bc2c1f8! [ 229.923146] Internal error: Oops - BRK: 00000000f2000100 [#1] SMP [ 230.077871] CPU: 48 UID: 0 PID: 1760 Comm: kworker/u583:7 Not tainted 7.0.0+ #3 PREEMPT(full) [ 230.086677] Hardware name: NVIDIA GB200 NVL/P3809-BMC, BIOS 02.05.12 20251107 [ 230.093972] Workqueue: events_unbound bpf_map_free_deferred [ 230.099675] Sched_ext: invariant_0.1.0_aarch64_unknown_linux_gnu_debug (disabling), task: runnable_at=-174ms [ 230.116843] pc : 0xffff80009bc2c1f8 [ 230.120406] lr : dequeue_task_scx+0x270/0x2d0 [ 230.217749] Call trace: [ 230.228515] 0xffff80009bc2c1f8 (P) [ 230.232077] dequeue_task+0x84/0x188 [ 230.235728] sched_change_begin+0x1dc/0x250 [ 230.240000] __set_cpus_allowed_ptr_locked+0x17c/0x240 [ 230.245250] __set_cpus_allowed_ptr+0x74/0xf0 [ 230.249701] ___migrate_enable+0x4c/0xa0 [ 230.253707] bpf_map_free_deferred+0x1a4/0x1b0 [ 230.258246] process_one_work+0x184/0x540 [ 230.262342] worker_thread+0x19c/0x348 [ 230.266170] kthread+0x13c/0x150 [ 230.269465] ret_from_fork+0x10/0x20 [ 230.281393] Code: d4202000 d4202000 d4202000 d4202000 (d4202000) [ 230.287621] ---[ end trace 0000000000000000 ]--- [ 231.160046] Kernel panic - not syncing: Oops - BRK: Fatal exception in interrupt The root cause is that the JIT page backing ops->quiescent() is freed before all callers of that function have stopped. The expected ordering during teardown is: bitmap_zero(sch->has_op) + synchronize_rcu() -> guarantees no CPU will ever call sch->ops.* again -> only THEN free the BPF struct_ops JIT page bpf_scx_unreg() is supposed to enforce the order, but after commit `f4a6c506d1` ("sched_ext: Always bounce scx_disable() through irq_work"), disable_work is no longer queued directly, causing kthread_flush_work() to be a noop. Thus, the caller drops the struct_ops map too early and poisoned with AARCH64_BREAK_FAULT before disable_workfn ever execute. So the subsequent dequeue_task() still sees SCX_HAS_OP(sch, quiescent) as true and calls ops.quiescent, which hit on the poisoned page and BRK panic. Add a helper scx_flush_disable_work() so the future use cases that want to flush disable_work can use it. Also amend the call for scx_root_enable_workfn() and scx_sub_enable_workfn() which have similar pattern in the error path. Fixes: `f4a6c506d1` ("sched_ext: Always bounce scx_disable() through irq_work") Signed-off-by: Richard Cheng <icheng@nvidia.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-04-24 07:26:48 -10:00
zhidao su	4e3d7c89e1	sched_ext: Fix local_dsq_post_enq() to use task's scheduler in sub-sched local_dsq_post_enq() calls call_task_dequeue() with scx_root instead of the scheduler instance actually managing the task. When CONFIG_EXT_SUB_SCHED is enabled, tasks may be managed by a sub-scheduler whose ops.dequeue() callback differs from root's. Using scx_root causes the wrong scheduler's ops.dequeue() to be consulted: sub-sched tasks dispatched to a local DSQ via scx_bpf_dsq_move_to_local() will have SCX_TASK_IN_CUSTODY cleared but the sub-scheduler's ops.dequeue() is never invoked, violating the custody exit semantics. Fix by adding a 'struct scx_sched *sch' parameter to local_dsq_post_enq() and move_local_task_to_local_dsq(), and propagating the correct scheduler from their callers dispatch_enqueue(), move_task_between_dsqs(), and consume_dispatch_q(). This is consistent with dispatch_enqueue()'s non-local path which already passes 'sch' directly to call_task_dequeue() for global/bypass DSQs. Fixes: `ebf1ccff79` ("sched_ext: Fix ops.dequeue() semantics") Signed-off-by: zhidao su <suzhidao@xiaomi.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-04-23 06:36:56 -10:00
Cheng-Yang Chou	2d2b026c3e	sched_ext: Deny SCX kfuncs to non-SCX struct_ops programs scx_kfunc_context_filter() currently allows non-SCX struct_ops programs (e.g. tcp_congestion_ops) to call SCX unlocked kfuncs. This is wrong for two reasons: - It is semantically incorrect: a TCP congestion control program has no business calling SCX kfuncs such as scx_bpf_kick_cpu(). - With CONFIG_EXT_SUB_SCHED=y, kfuncs like scx_bpf_kick_cpu() call scx_prog_sched(aux), which invokes bpf_prog_get_assoc_struct_ops(aux) and casts the result to struct sched_ext_ops * before reading ops->priv. For a non-SCX struct_ops program the returned pointer is the kdata of that struct_ops type, which is far smaller than sched_ext_ops, making the read an out-of-bounds access (confirmed with KASAN). Extend the filter to cover scx_kfunc_set_any and scx_kfunc_set_idle as well, and deny all SCX kfuncs for any struct_ops program that is not the SCX struct_ops. This addresses both issues: the semantic contract is enforced at the verifier level, and the runtime out-of-bounds access becomes unreachable. Fixes: `d1d3c1c6ae` ("sched_ext: Add verifier-time kfunc context filter") Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-04-20 07:57:29 -10:00
Tejun Heo	87019cb6c2	sched_ext: Mark scx_sched_hash insecure_elasticity scx_sched_hash is inserted into under scx_sched_lock (raw_spinlock_irq) in scx_link_sched(). rhashtable's sync grow path calls get_random_u32() and does a GFP_ATOMIC allocation; both acquire regular spinlocks, which is unsafe under raw_spinlock_t. Set insecure_elasticity to skip the sync grow. v2: - Dropped dsq_hash changes. Insertion is not under raw_spin_lock. - Switched from no_sync_grow flag to insecure_elasticity. Fixes: `25037af712` ("sched_ext: Add rhashtable lookup for sub-schedulers") Signed-off-by: Tejun Heo <tj@kernel.org>	2026-04-19 05:47:28 -10:00
Linus Torvalds	cb30bf881c	tracing updates for v7.1: - Fix printf format warning for bprintf sunrpc uses a trace_printk() that triggers a printf warning during the compile. Move the __printf() attribute around for when debugging is not enabled the warning will go away. - Remove redundant check for EVENT_FILE_FL_FREED in event_filter_write() The FREED flag is checked in the call to event_file_file() and then checked again right afterward, which is unneeded. - Clean up event_file_file() and event_file_data() helpers These helper functions played a different role in the past, but now with eventfs, the READ_ONCE() isn't needed. Simplify the code a bit and also add a warning to event_file_data() if the file or its data is not present. - Remove updating file->private_data in tracing open All access to the file private data is handled by the helper functions, which do not use file->private_data. Stop updating it on open. - Show ENUM names in function arguments via BTF in function tracing When showing the function arguments when func-args option is set for function tracing, if one of the arguments is found to be an enum, show the name of the enum instead of its number. - Add new trace_call__##name() API for tracepoints Tracepoints are enabled via static_branch() blocks, where when not enabled, there's only a nop that is in the code where the execution will just skip over it. When tracing is enabled, the nop is converted to a direct jump to the tracepoint code. Sometimes more calculations are required to be performed to update the parameters of the tracepoint. In this case, trace_##name##_enabled() is called which is a static_branch() that gets enabled only when the tracepoint is enabled. This allows the extra calculations to also be skipped by the nop: if (trace_foo_enabled()) { x = bar(); trace_foo(x); } Where the x=bar() is only performed when foo is enabled. The problem with this approach is that there's now two static_branch() calls. One for checking if the tracepoint is enabled, and then again to know if the tracepoint should be called. The second one is redundant. Introduce trace_call__foo() that will call the foo() tracepoint directly without doing a static_branch(): if (trace_foo_enabled()) { x = bar(); trace_call__foo(); } - Update various locations to use the new trace_call__##name() API - Move snapshot code out of trace.c Cleaning up trace.c to not be a "dump all", move the snapshot code out of it and into a new trace_snapshot.c file. - Clean up some "%.s" to "%s" - Allow boot kernel command line options to be called multiple times Have options like: ftrace_filter=foo ftrace_filter=bar ftrace_filter=zoo Equal to: ftrace_filter=foo,bar,zoo - Fix ipi_raise event CPU field to be a CPU field The ipi_raise target_cpus field is defined as a __bitmask(). There is now a __cpumask() field definition. Update the field to use that. - Have hist_field_name() use a snprintf() and not a series of strcat() It's safer to use snprintf() that a series of strcat(). - Fix tracepoint regfunc balancing A tracepoint can define a "reg" and "unreg" function that gets called before the tracepoint is enabled, and after it is disabled respectively. But on error, after the "reg" func is called and the tracepoint is not enabled, the "unreg" function is not called to tear down what the "reg" function performed. - Fix output that shows what histograms are enabled Event variables are displayed incorrectly in the histogram output. Instead of "sched.sched_wakeup.$var", it is showing "$sched.sched_wakeup.var" where the '$' is in the incorrect location. - Some other simple cleanups. -----BEGIN PGP SIGNATURE----- iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaeCpvxQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qt2WAP44m85BbAjBqJe4WR103eOXV+bREBta dRoReKJOMe519gEAp0rK/HoCvHgHhIGe3gaGdIsNhnaxoFyNWMG/wokoLAY= =Hg6+ -----END PGP SIGNATURE----- Merge tag 'trace-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing updates from Steven Rostedt: - Fix printf format warning for bprintf sunrpc uses a trace_printk() that triggers a printf warning during the compile. Move the __printf() attribute around for when debugging is not enabled the warning will go away - Remove redundant check for EVENT_FILE_FL_FREED in event_filter_write() The FREED flag is checked in the call to event_file_file() and then checked again right afterward, which is unneeded - Clean up event_file_file() and event_file_data() helpers These helper functions played a different role in the past, but now with eventfs, the READ_ONCE() isn't needed. Simplify the code a bit and also add a warning to event_file_data() if the file or its data is not present - Remove updating file->private_data in tracing open All access to the file private data is handled by the helper functions, which do not use file->private_data. Stop updating it on open - Show ENUM names in function arguments via BTF in function tracing When showing the function arguments when func-args option is set for function tracing, if one of the arguments is found to be an enum, show the name of the enum instead of its number - Add new trace_call__##name() API for tracepoints Tracepoints are enabled via static_branch() blocks, where when not enabled, there's only a nop that is in the code where the execution will just skip over it. When tracing is enabled, the nop is converted to a direct jump to the tracepoint code. Sometimes more calculations are required to be performed to update the parameters of the tracepoint. In this case, trace_##name##_enabled() is called which is a static_branch() that gets enabled only when the tracepoint is enabled. This allows the extra calculations to also be skipped by the nop: if (trace_foo_enabled()) { x = bar(); trace_foo(x); } Where the x=bar() is only performed when foo is enabled. The problem with this approach is that there's now two static_branch() calls. One for checking if the tracepoint is enabled, and then again to know if the tracepoint should be called. The second one is redundant Introduce trace_call__foo() that will call the foo() tracepoint directly without doing a static_branch(): if (trace_foo_enabled()) { x = bar(); trace_call__foo(); } - Update various locations to use the new trace_call__##name() API - Move snapshot code out of trace.c Cleaning up trace.c to not be a "dump all", move the snapshot code out of it and into a new trace_snapshot.c file - Clean up some "%.s" to "%s" - Allow boot kernel command line options to be called multiple times Have options like: ftrace_filter=foo ftrace_filter=bar ftrace_filter=zoo Equal to: ftrace_filter=foo,bar,zoo - Fix ipi_raise event CPU field to be a CPU field The ipi_raise target_cpus field is defined as a __bitmask(). There is now a __cpumask() field definition. Update the field to use that - Have hist_field_name() use a snprintf() and not a series of strcat() It's safer to use snprintf() that a series of strcat() - Fix tracepoint regfunc balancing A tracepoint can define a "reg" and "unreg" function that gets called before the tracepoint is enabled, and after it is disabled respectively. But on error, after the "reg" func is called and the tracepoint is not enabled, the "unreg" function is not called to tear down what the "reg" function performed - Fix output that shows what histograms are enabled Event variables are displayed incorrectly in the histogram output Instead of "sched.sched_wakeup.$var", it is showing "$sched.sched_wakeup.var" where the '$' is in the incorrect location - Some other simple cleanups * tag 'trace-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (24 commits) selftests/ftrace: Add test case for fully-qualified variable references tracing: Fix fully-qualified variable reference printing in histograms tracepoint: balance regfunc() on func_add() failure in tracepoint_add_func() tracing: Rebuild full_name on each hist_field_name() call tracing: Report ipi_raise target CPUs as cpumask tracing: Remove duplicate latency_fsnotify() stub tracing: Preserve repeated trace_trigger boot parameters tracing: Append repeated boot-time tracing parameters tracing: Remove spurious default precision from show_event_trigger/filter formats cpufreq: Use trace_call__##name() at guarded tracepoint call sites tracing: Remove tracing_alloc_snapshot() when snapshot isn't defined tracing: Move snapshot code out of trace.c and into trace_snapshot.c mm: damon: Use trace_call__##name() at guarded tracepoint call sites btrfs: Use trace_call__##name() at guarded tracepoint call sites spi: Use trace_call__##name() at guarded tracepoint call sites i2c: Use trace_call__##name() at guarded tracepoint call sites kernel: Use trace_call__##name() at guarded tracepoint call sites tracepoint: Add trace_call__##name() API tracing: trace_mmap.h: fix a kernel-doc warning tracing: Pretty-print enum parameters in function arguments ...	2026-04-17 09:43:12 -07:00
Linus Torvalds	fdbfee9fc5	Runtime Verification updates for 7.1: - Refactor da_monitor header to share handlers across monitor types No functional changes, only less code duplication. - Add Hybrid Automata model class Add a new model class that extends deterministic automata by adding constraints on transitions and states. Those constraints can take into account wall-clock time and as such allow RV monitor to make assertions on real time. Add documentation and code generation scripts. - Add stall monitor as hybrid automaton example Add a monitor that triggers a violation when a task is stalling as an example of automaton working with real time variables. - Convert the opid monitor to a hybrid automaton The opid monitor can be heavily simplified if written as a hybrid automaton: instead of tracking preempt and interrupt enable/disable events, it can just run constraints on the preemption/interrupt states when events like wakeup and need_resched verify. - Add support for per-object monitors in DA/HA Allow writing deterministic and hybrid automata monitors for generic objects (e.g. any struct), by exploiting a hash table where objects are saved. This allows to track more than just tasks in RV. For instance it will be used to track deadline entities in deadline monitors. - Add deadline tracepoints and move some deadline utilities Prepare the ground for deadline monitors by defining events and exporting helpers. - Add nomiss deadline monitor Add first example of deadline monitor asserting all entities complete before their deadline. - Improve rvgen error handling Introduce AutomataError exception class and better handle expected exceptions while showing a backtrace for unexpected ones. - Improve python code quality in rvgen Refactor the rvgen generation scripts to align with python best practices: use f-strings instead of %, use len() instead of __len__(), remove semicolons, use context managers for file operations, fix whitespace violations, extract magic strings into constants, remove unused imports and methods. - Fix small bugs in rvgen The generator scripts presented some corner case bugs: logical error in validating what a correct dot file looks like, fix an isinstance() check, enforce a dot file has an initial state, fix type annotations and typos in comments. - rvgen refactoring Refactor automata.py to use iterator-based parsing and handle required arguments directly in argparse. - Allow epoll in rtapp-sleep monitor The epoll_wait call is now rt-friendly so it should be allowed in the sleep monitor as a valid sleep method. -----BEGIN PGP SIGNATURE----- iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCad4kUxQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qiJgAPsGJ0TCTubdn2x6fpwLDKUcSsNsYrok m8gLHeK0rmkKNQD/ajUW+RBsOufXAsnSzUoUcl7sw1TSiQJ085U+0+ZeDgM= =WZqk -----END PGP SIGNATURE----- Merge tag 'trace-rv-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull runtime verification updates from Steven Rostedt: - Refactor da_monitor header to share handlers across monitor types No functional changes, only less code duplication. - Add Hybrid Automata model class Add a new model class that extends deterministic automata by adding constraints on transitions and states. Those constraints can take into account wall-clock time and as such allow RV monitor to make assertions on real time. Add documentation and code generation scripts. - Add stall monitor as hybrid automaton example Add a monitor that triggers a violation when a task is stalling as an example of automaton working with real time variables. - Convert the opid monitor to a hybrid automaton The opid monitor can be heavily simplified if written as a hybrid automaton: instead of tracking preempt and interrupt enable/disable events, it can just run constraints on the preemption/interrupt states when events like wakeup and need_resched verify. - Add support for per-object monitors in DA/HA Allow writing deterministic and hybrid automata monitors for generic objects (e.g. any struct), by exploiting a hash table where objects are saved. This allows to track more than just tasks in RV. For instance it will be used to track deadline entities in deadline monitors. - Add deadline tracepoints and move some deadline utilities Prepare the ground for deadline monitors by defining events and exporting helpers. - Add nomiss deadline monitor Add first example of deadline monitor asserting all entities complete before their deadline. - Improve rvgen error handling Introduce AutomataError exception class and better handle expected exceptions while showing a backtrace for unexpected ones. - Improve python code quality in rvgen Refactor the rvgen generation scripts to align with python best practices: use f-strings instead of %, use len() instead of __len__(), remove semicolons, use context managers for file operations, fix whitespace violations, extract magic strings into constants, remove unused imports and methods. - Fix small bugs in rvgen The generator scripts presented some corner case bugs: logical error in validating what a correct dot file looks like, fix an isinstance() check, enforce a dot file has an initial state, fix type annotations and typos in comments. - rvgen refactoring Refactor automata.py to use iterator-based parsing and handle required arguments directly in argparse. - Allow epoll in rtapp-sleep monitor The epoll_wait call is now rt-friendly so it should be allowed in the sleep monitor as a valid sleep method. * tag 'trace-rv-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (32 commits) rv: Allow epoll in rtapp-sleep monitor rv/rvgen: fix _fill_states() return type annotation rv/rvgen: fix unbound loop variable warning rv/rvgen: enforce presence of initial state rv/rvgen: extract node marker string to class constant rv/rvgen: fix isinstance check in Variable.expand() rv/rvgen: make monitor arguments required in rvgen rv/rvgen: remove unused __get_main_name method rv/rvgen: remove unused sys import from dot2c rv/rvgen: refactor automata.py to use iterator-based parsing rv/rvgen: use class constant for init marker rv/rvgen: fix DOT file validation logic error rv/rvgen: fix PEP 8 whitespace violations rv/rvgen: fix typos in automata and generator docstring and comments rv/rvgen: use context managers for file operations rv/rvgen: remove unnecessary semicolons rv/rvgen: replace __len__() calls with len() rv/rvgen: replace % string formatting with f-strings rv/rvgen: remove bare except clauses in generator rv/rvgen: introduce AutomataError exception class ...	2026-04-15 17:15:18 -07:00
Linus Torvalds	5bdb4078e1	sched_ext: Changes for v7.1 - Cgroup sub-scheduler groundwork. Multiple BPF schedulers can be attached to cgroups and the dispatch path is made hierarchical. This involves substantial restructuring of the core dispatch, bypass, watchdog, and dump paths to be per-scheduler, along with new infrastructure for scheduler ownership enforcement, lifecycle management, and cgroup subtree iteration. The enqueue path is not yet updated and will follow in a later cycle. - scx_bpf_dsq_reenq() generalized to support any DSQ including remote local DSQs and user DSQs. Built on top of this, SCX_ENQ_IMMED guarantees that tasks dispatched to local DSQs either run immediately or get reenqueued back through ops.enqueue(), giving schedulers tighter control over queueing latency. Also useful for opportunistic CPU sharing across sub-schedulers. - ops.dequeue() was only invoked when the core knew a task was in BPF data structures, missing scheduling property change events and skipping callbacks for non-local DSQ dispatches from ops.select_cpu(). Fixed to guarantee exactly one ops.dequeue() call when a task leaves BPF scheduler custody. - Kfunc access validation moved from runtime to BPF verifier time, removing runtime mask enforcement. - Idle SMT sibling prioritization in the idle CPU selection path. - Documentation, selftest, and tooling updates. Misc bug fixes and cleanups. - Merges from tip/sched-core, cgroup/for-7.1, and for-7.0-fixes to resolve dependencies and conflicts for the above changes. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCad0uaA4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGbktAQD2HrKdydyEfefz/n4mNpIXh/DFYX49NgKYcgUh sKy4ngD/Sy7nAZS2zwM+36PN6jBV7+cfuoaiKPgCstPFeGsvPwU= =fsgj -----END PGP SIGNATURE----- Merge tag 'sched_ext-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext updates from Tejun Heo: - cgroup sub-scheduler groundwork Multiple BPF schedulers can be attached to cgroups and the dispatch path is made hierarchical. This involves substantial restructuring of the core dispatch, bypass, watchdog, and dump paths to be per-scheduler, along with new infrastructure for scheduler ownership enforcement, lifecycle management, and cgroup subtree iteration The enqueue path is not yet updated and will follow in a later cycle - scx_bpf_dsq_reenq() generalized to support any DSQ including remote local DSQs and user DSQs Built on top of this, SCX_ENQ_IMMED guarantees that tasks dispatched to local DSQs either run immediately or get reenqueued back through ops.enqueue(), giving schedulers tighter control over queueing latency Also useful for opportunistic CPU sharing across sub-schedulers - ops.dequeue() was only invoked when the core knew a task was in BPF data structures, missing scheduling property change events and skipping callbacks for non-local DSQ dispatches from ops.select_cpu() Fixed to guarantee exactly one ops.dequeue() call when a task leaves BPF scheduler custody - Kfunc access validation moved from runtime to BPF verifier time, removing runtime mask enforcement - Idle SMT sibling prioritization in the idle CPU selection path - Documentation, selftest, and tooling updates. Misc bug fixes and cleanups * tag 'sched_ext-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (134 commits) tools/sched_ext: Add explicit cast from void* in RESIZE_ARRAY() sched_ext: Make string params of __ENUM_set() const tools/sched_ext: Kick home CPU for stranded tasks in scx_qmap sched_ext: Drop spurious warning on kick during scheduler disable sched_ext: Warn on task-based SCX op recursion sched_ext: Rename scx_kf_allowed_on_arg_tasks() to scx_kf_arg_task_ok() sched_ext: Remove runtime kfunc mask enforcement sched_ext: Add verifier-time kfunc context filter sched_ext: Drop redundant rq-locked check from scx_bpf_task_cgroup() sched_ext: Decouple kfunc unlocked-context check from kf_mask sched_ext: Fix ops.cgroup_move() invocation kf_mask and rq tracking sched_ext: Track @p's rq lock across set_cpus_allowed_scx -> ops.set_cpumask sched_ext: Add select_cpu kfuncs to scx_kfunc_ids_unlocked sched_ext: Drop TRACING access to select_cpu kfuncs selftests/sched_ext: Fix wrong DSQ ID in peek_dsq error message sched_ext: Documentation: improve accuracy of task lifecycle pseudo-code selftests/sched_ext: Improve runner error reporting for invalid arguments sched_ext: Documentation: Fix scx_bpf_move_to_local kfunc name sched_ext: Documentation: Add ops.dequeue() to task lifecycle tools/sched_ext: Fix off-by-one in scx_sdt payload zeroing ...	2026-04-15 10:54:24 -07:00
Linus Torvalds	1c3b68f0d5	Scheduler changes for v7.1: Fair scheduling updates: - Skip SCHED_IDLE rq for SCHED_IDLE tasks (Christian Loehle) - Remove superfluous rcu_read_lock() in the wakeup path (K Prateek Nayak) - Simplify the entry condition for update_idle_cpu_scan() (K Prateek Nayak) - Simplify SIS_UTIL handling in select_idle_cpu() (K Prateek Nayak) - Avoid overflow in enqueue_entity() (K Prateek Nayak) - Update overutilized detection (Vincent Guittot) - Prevent negative lag increase during delayed dequeue (Vincent Guittot) - Clear buddies for preempt_short (Vincent Guittot) - Implement more complex proportional newidle balance (Peter Zijlstra) - Increase weight bits for avg_vruntime (Peter Zijlstra) - Use full weight to __calc_delta() (Peter Zijlstra) RT and DL scheduling updates: - Fix incorrect schedstats for rt and dl thread (Dengjun Su) - Skip group schedulable check with rt_group_sched=0 (Michal Koutný) - Move group schedulability check to sched_rt_global_validate() (Michal Koutný) - Add reporting of runtime left & abs deadline to sched_getattr() for DEADLINE tasks (Tommaso Cucinotta) Scheduling topology updates by K Prateek Nayak: - Compute sd_weight considering cpuset partitions - Extract "imb_numa_nr" calculation into a separate helper - Allocate per-CPU sched_domain_shared in s_data - Switch to assigning "sd->shared" from s_data - Remove sched_domain_shared allocation with sd_data Energy-aware scheduling updates: - Filter false overloaded_group case for EAS (Vincent Guittot) - PM: EM: Switch to rcu_dereference_all() in wakeup path (Dietmar Eggemann) Infrastructure updates: - Replace use of system_unbound_wq with system_dfl_wq (Marco Crivellari) Proxy scheduling updates by John Stultz: - Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr() - Minimise repeated sched_proxy_exec() checking - Fix potentially missing balancing with Proxy Exec - Fix and improve task::blocked_on et al handling - Add assert_balance_callbacks_empty() helper - Add logic to zap balancing callbacks if we pick again - Move attach_one_task() and attach_task() helpers to sched.h - Handle blocked-waiter migration (and return migration) - Add K Prateek Nayak to scheduler reviewers for proxy execution Misc cleanups and fixes by John Stultz, Joseph Salisbury, Peter Zijlstra, K Prateek Nayak, Michal Koutný, Randy Dunlap, Shrikanth Hegde, Vincent Guittot, Zhan Xusheng, Xie Yuanbin and Vincent Guittot. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmncq4oRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1gxoA/8DD0SsMhBLaZLi+LAdY5fD6rGjOLGBtxz NgwN8CAvPIFH7qFzPjAk7WtVXoKjF62sRDFvUaBEsliflRzOkBkYr3SnUYRORyBB VRj7D6ymuWhxnhYsy8+Hviu/93c3GyEO59IYU0wIShxBzYBxqDfNxWvEUQte2Cin 1yFy4CICJeGpsBv9Ev+0LtesxtF5bnaioawbAYcpc2IdYsK+nsMKRvkwg1YSdLmh v9+vIYuQBrclBn3OR7dsv2krBev5qodYtDZFwdJagE+6aaQv2zhWIfhetPpkzwrq zhuzVZH+E9404Pn5EqJaw7KmU9eyBBwIUVqBaQfH73eSe5PY0tiSrpPU9foocUjo 4Td9sL11SLzjwpM4bIijW0ezZY8y+4Q0A21GwdcwAx3LPstXcF5GIjQ76dVFPRKN Unbt6o+9O9NvMLg8CLzwonlFzOoLOrL+5eKJs+caOuOikT+cXnBQrukgB4ck3RAD PIVD8XnufJTCKiDvx2vravLXsWiA2cg7citVsgc8y5FBcdhzv3YVqXd/lGkqg+09 7rVqE6NRDlkk4G4KZACTK45YVcVwXhQlMU/qiS0IduHdD0NtL9DPnQvdfzQWQehO 30cJ5vZ+fqbHspJ8AdPuqntUyfEvPTCbCT4Ou/AEcvO8NRQu2gplcq9mF4U46WZG GBPWXvGHzM8= =NjyS -----END PGP SIGNATURE----- Merge tag 'sched-core-2026-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Fair scheduling updates: - Skip SCHED_IDLE rq for SCHED_IDLE tasks (Christian Loehle) - Remove superfluous rcu_read_lock() in the wakeup path (K Prateek Nayak) - Simplify the entry condition for update_idle_cpu_scan() (K Prateek Nayak) - Simplify SIS_UTIL handling in select_idle_cpu() (K Prateek Nayak) - Avoid overflow in enqueue_entity() (K Prateek Nayak) - Update overutilized detection (Vincent Guittot) - Prevent negative lag increase during delayed dequeue (Vincent Guittot) - Clear buddies for preempt_short (Vincent Guittot) - Implement more complex proportional newidle balance (Peter Zijlstra) - Increase weight bits for avg_vruntime (Peter Zijlstra) - Use full weight to __calc_delta() (Peter Zijlstra) RT and DL scheduling updates: - Fix incorrect schedstats for rt and dl thread (Dengjun Su) - Skip group schedulable check with rt_group_sched=0 (Michal Koutný) - Move group schedulability check to sched_rt_global_validate() (Michal Koutný) - Add reporting of runtime left & abs deadline to sched_getattr() for DEADLINE tasks (Tommaso Cucinotta) Scheduling topology updates by K Prateek Nayak: - Compute sd_weight considering cpuset partitions - Extract "imb_numa_nr" calculation into a separate helper - Allocate per-CPU sched_domain_shared in s_data - Switch to assigning "sd->shared" from s_data - Remove sched_domain_shared allocation with sd_data Energy-aware scheduling updates: - Filter false overloaded_group case for EAS (Vincent Guittot) - PM: EM: Switch to rcu_dereference_all() in wakeup path (Dietmar Eggemann) Infrastructure updates: - Replace use of system_unbound_wq with system_dfl_wq (Marco Crivellari) Proxy scheduling updates by John Stultz: - Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr() - Minimise repeated sched_proxy_exec() checking - Fix potentially missing balancing with Proxy Exec - Fix and improve task::blocked_on et al handling - Add assert_balance_callbacks_empty() helper - Add logic to zap balancing callbacks if we pick again - Move attach_one_task() and attach_task() helpers to sched.h - Handle blocked-waiter migration (and return migration) - Add K Prateek Nayak to scheduler reviewers for proxy execution Misc cleanups and fixes by John Stultz, Joseph Salisbury, Peter Zijlstra, K Prateek Nayak, Michal Koutný, Randy Dunlap, Shrikanth Hegde, Vincent Guittot, Zhan Xusheng, Xie Yuanbin and Vincent Guittot" * tag 'sched-core-2026-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits) sched/eevdf: Clear buddies for preempt_short sched/rt: Cleanup global RT bandwidth functions sched/rt: Move group schedulability check to sched_rt_global_validate() sched/rt: Skip group schedulable check with rt_group_sched=0 sched/fair: Avoid overflow in enqueue_entity() sched: Use u64 for bandwidth ratio calculations sched/fair: Prevent negative lag increase during delayed dequeue sched/fair: Use sched_energy_enabled() sched: Handle blocked-waiter migration (and return migration) sched: Move attach_one_task and attach_task helpers to sched.h sched: Add logic to zap balance callbacks if we pick again sched: Add assert_balance_callbacks_empty helper sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy return-migration sched: Fix modifying donor->blocked on without proper locking locking: Add task::blocked_lock to serialize blocked_on state sched: Fix potentially missing balancing with Proxy Exec sched: Minimise repeated sched_proxy_exec() checking sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr() MAINTAINERS: Add K Prateek Nayak to scheduler reviewers sched/core: Get this cpu once in ttwu_queue_cond() ...	2026-04-14 13:33:36 -07:00
Linus Torvalds	c1fe867b5b	Updates for the timer/timekeeping core: - A rework of the hrtimer subsystem to reduce the overhead for frequently armed timers, especially the hrtick scheduler timer. - Better timer locality decision - Simplification of the evaluation of the first expiry time by keeping track of the neighbor timers in the RB-tree by providing a RB-tree variant with neighbor links. That avoids walking the RB-tree on removal to find the next expiry time, but even more important allows to quickly evaluate whether a timer which is rearmed changes the position in the RB-tree with the modified expiry time or not. If not, the dequeue/enqueue sequence which both can end up in rebalancing can be completely avoided. - Deferred reprogramming of the underlying clock event device. This optimizes for the situation where a hrtimer callback sets the need resched bit. In that case the code attempts to defer the re-programming of the clock event device up to the point where the scheduler has picked the next task and has the next hrtick timer armed. In case that there is no immediate reschedule or soft interrupts have to be handled before reaching the reschedule point in the interrupt entry code the clock event is reprogrammed in one of those code paths to prevent that the timer becomes stale. - Support for clocksource coupled clockevents The TSC deadline timer is coupled to the TSC. The next event is programmed in TSC time. Currently this is done by converting the CLOCK_MONOTONIC based expiry value into a relative timeout, converting it into TSC ticks, reading the TSC adding the delta ticks and writing the deadline MSR. As the timekeeping core has the conversion factors for the TSC already, the whole back and forth conversion can be completely avoided. The timekeeping core calculates the reverse conversion factors from nanoseconds to TSC ticks and utilizes the base timestamps of TSC and CLOCK_MONOTONIC which are updated once per tick. This allows a direct conversion into the TSC deadline value without reading the time and as a bonus keeps the deadline conversion in sync with the TSC conversion factors, which are updated by adjtimex() on systems with NTP/PTP enabled. - Allow inlining of the clocksource read and clockevent write functions when they are tiny enough, e.g. on x86 RDTSC and WRMSR. With all those enhancements in place a hrtick enabled scheduler provides the same performance as without hrtick. But also other hrtimer users obviously benefit from these optimizations. - Robustness improvements and cleanups of historical sins in the hrtimer and timekeeping code. - Rewrite of the clocksource watchdog. The clocksource watchdog code has over time reached the state of an impenetrable maze of duct tape and staples. The original design, which was made in the context of systems far smaller than today, is based on the assumption that the to be monitored clocksource (TSC) can be trivially compared against a known to be stable clocksource (HPET/ACPI-PM timer). Over the years this rather naive approach turned out to have major flaws. Long delays between the watchdog invocations can cause wrap arounds of the reference clocksource. The access to the reference clocksource degrades on large multi-sockets systems dure to interconnect congestion. This has been addressed with various heuristics which degraded the accuracy of the watchdog to the point that it fails to detect actual TSC problems on older hardware which exposes slow inter CPU drifts due to firmware manipulating the TSC to hide SMI time. The rewrite addresses this by: - Restricting the validation against the reference clocksource to the boot CPU which is usually closest to the legacy block which contains the reference clocksource (HPET/ACPI-PM). - Do a round robin validation betwen the boot CPU and the other CPUs based only on the TSC with an algorithm similar to the TSC synchronization code during CPU hotplug. - Being more leniant versus remote timeouts - The usual tiny fixes, cleanups and enhancements all over the place -----BEGIN PGP SIGNATURE----- iQJEBAABCgAuFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmnbxdMQHHRnbHhAa2Vy bmVsLm9yZwAKCRCmGPVMDXSYoeq6EAC4h9wuBr5yCkxmog1Bhlk9cnK0oX1THb7V Q4z6DrYAiDXP6z4IDwqSW+3vvmNw1QXOeqpyMTiiIcQ5mNSs1IDnCt5HOEwY+ICm fiSUMYkXkH6xdFWspYWFkD7aExHJRT3hd/bo+WnXGHhHclPj5NHZssLMIDboHrzX jLV1hljmthfwg/uOXDGmQUPRFjqr2ZQjo7zGA5SwfVg8Krz7g/qRVy2wUns9TdW/ NYwihDm1YV7qkK/+f1GnMdd70toqb1OZo/fS9FPbBrPLdyi8V+UbnFSUeZu8kCwV KubAzjLZR4xYCnrlaHhoi208GMd0TOvHMOrdAA0zkQHfhmszGl4N0pbF/EI29Ft2 tQG/FUTG+nzgNOrMCPN2nr3u/UOXLP+gO+2hkyyQjqUP35IyaTYQn10JgBmPTdJL Ab6E8WL9gTMCd+t/bVjdU/B8W9ruMihKBtWkTfMBCcQ9uNJFCEGzrcMF8hzFYRTs /4rMDr3NlGYydAnbKPj6bkC5gtjBvh/L08kOdUFyXCMSqIzvJkZJ4241ogl1Awi6 VfdwjF5KZCQo3M1ujpep+1L010wC/yulqLt2brQMO9Nt05dRhgwM3lxy7cnlMNm3 NdfMgi+OG0CzQ+ZUpvo20hCgTDUVgWN9g5R3rar8FJX+Ym3T+ZoEKlShZF+fSRjf YAUIbUyi7A== =2qc8 -----END PGP SIGNATURE----- Merge tag 'timers-core-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer core updates from Thomas Gleixner: - A rework of the hrtimer subsystem to reduce the overhead for frequently armed timers, especially the hrtick scheduler timer: - Better timer locality decision - Simplification of the evaluation of the first expiry time by keeping track of the neighbor timers in the RB-tree by providing a RB-tree variant with neighbor links. That avoids walking the RB-tree on removal to find the next expiry time, but even more important allows to quickly evaluate whether a timer which is rearmed changes the position in the RB-tree with the modified expiry time or not. If not, the dequeue/enqueue sequence which both can end up in rebalancing can be completely avoided. - Deferred reprogramming of the underlying clock event device. This optimizes for the situation where a hrtimer callback sets the need resched bit. In that case the code attempts to defer the re-programming of the clock event device up to the point where the scheduler has picked the next task and has the next hrtick timer armed. In case that there is no immediate reschedule or soft interrupts have to be handled before reaching the reschedule point in the interrupt entry code the clock event is reprogrammed in one of those code paths to prevent that the timer becomes stale. - Support for clocksource coupled clockevents The TSC deadline timer is coupled to the TSC. The next event is programmed in TSC time. Currently this is done by converting the CLOCK_MONOTONIC based expiry value into a relative timeout, converting it into TSC ticks, reading the TSC adding the delta ticks and writing the deadline MSR. As the timekeeping core has the conversion factors for the TSC already, the whole back and forth conversion can be completely avoided. The timekeeping core calculates the reverse conversion factors from nanoseconds to TSC ticks and utilizes the base timestamps of TSC and CLOCK_MONOTONIC which are updated once per tick. This allows a direct conversion into the TSC deadline value without reading the time and as a bonus keeps the deadline conversion in sync with the TSC conversion factors, which are updated by adjtimex() on systems with NTP/PTP enabled. - Allow inlining of the clocksource read and clockevent write functions when they are tiny enough, e.g. on x86 RDTSC and WRMSR. With all those enhancements in place a hrtick enabled scheduler provides the same performance as without hrtick. But also other hrtimer users obviously benefit from these optimizations. - Robustness improvements and cleanups of historical sins in the hrtimer and timekeeping code. - Rewrite of the clocksource watchdog. The clocksource watchdog code has over time reached the state of an impenetrable maze of duct tape and staples. The original design, which was made in the context of systems far smaller than today, is based on the assumption that the to be monitored clocksource (TSC) can be trivially compared against a known to be stable clocksource (HPET/ACPI-PM timer). Over the years this rather naive approach turned out to have major flaws. Long delays between the watchdog invocations can cause wrap arounds of the reference clocksource. The access to the reference clocksource degrades on large multi-sockets systems dure to interconnect congestion. This has been addressed with various heuristics which degraded the accuracy of the watchdog to the point that it fails to detect actual TSC problems on older hardware which exposes slow inter CPU drifts due to firmware manipulating the TSC to hide SMI time. The rewrite addresses this by: - Restricting the validation against the reference clocksource to the boot CPU which is usually closest to the legacy block which contains the reference clocksource (HPET/ACPI-PM). - Do a round robin validation betwen the boot CPU and the other CPUs based only on the TSC with an algorithm similar to the TSC synchronization code during CPU hotplug. - Being more leniant versus remote timeouts - The usual tiny fixes, cleanups and enhancements all over the place * tag 'timers-core-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits) alarmtimer: Access timerqueue node under lock in suspend hrtimer: Fix incorrect #endif comment for BITS_PER_LONG check posix-timers: Fix stale function name in comment timers: Get this_cpu once while clearing the idle state clocksource: Rewrite watchdog code completely clocksource: Don't use non-continuous clocksources as watchdog x86/tsc: Handle CLOCK_SOURCE_VALID_FOR_HRES correctly MIPS: Don't select CLOCKSOURCE_WATCHDOG parisc: Remove unused clocksource flags hrtimer: Add a helper to retrieve a hrtimer from its timerqueue node hrtimer: Remove trailing comma after HRTIMER_MAX_CLOCK_BASES hrtimer: Mark index and clockid of clock base as const hrtimer: Drop unnecessary pointer indirection in hrtimer_expire_entry event hrtimer: Drop spurious space in 'enum hrtimer_base_type' hrtimer: Don't zero-initialize ret in hrtimer_nanosleep() hrtimer: Remove hrtimer_get_expires_ns() timekeeping: Mark offsets array as const timekeeping/auxclock: Consistently use raw timekeeper for tk_setup_internals() timer_list: Print offset as signed integer tracing: Use explicit array size instead of sentinel elements in symbol printing ...	2026-04-14 10:27:07 -07:00
Linus Torvalds	d7c8087a9c	Power management updates for 7.1-rc1 - Update qcom-hw DT bindings to include Eliza hardware (Abel Vesa) - Update cpufreq-dt-platdev blocklist (Faruque Ansari) - Minor updates to driver and dt-bindings for Tegra (Thierry Reding, Rosen Penev) - Add MAINTAINERS entry for CPPC driver (Viresh Kumar) - Add support for new features: CPPC performance priority, Dynamic EPP, Raw EPP, and new unit tests for them to amd-pstate (Gautham Shenoy, Mario Limonciello) - Fix sysfs files being present when HW missing and broken/outdated documentation in the amd-pstate driver (Ninad Naik, Gautham Shenoy) - Pass the policy to cpufreq_driver->adjust_perf() to avoid using cpufreq_cpu_get() in the .adjust_perf() callback in amd-pstate which leads to a scheduling-while-atomic bug (K Prateek Nayak) - Clean up dead code in Kconfig for cpufreq (Julian Braha) - Remove max_freq_req update for pre-existing cpufreq policy and add a boost_freq_req QoS request to save the boost constraint instead of overwriting the last scaling_max_freq constraint (Pierre Gondois) - Embed cpufreq QoS freq_req objects in cpufreq policy so they all are allocated in one go along with the policy to simplify lifetime rules and avoid error handling issues (Viresh Kumar) - Use DMI max speed when CPPC is unavailable in the acpi-cpufreq scaling driver (Henry Tseng) - Switch policy_is_shared() in cpufreq to using cpumask_nth() instead of cpumask_weight() because the former is more efficient (Yury Norov) - Use sysfs_emit() in sysfs show functions for cpufreq governor attributes (Thorsten Blum) - Update intel_pstate to stop returning an error when "off" is written to its status sysfs attribute while the driver is already off (Fabio De Francesco) - Include current frequency in the debug message printed by __cpufreq_driver_target() (Pengjie Zhang) - Refine stopped tick handling in the menu cpuidle governor and rearrange stopped tick handling in the teo cpuidle governor (Rafael Wysocki) - Add Panther Lake C-states table to the intel_idle driver (Artem Bityutskiy) - Clean up dead dependencies on CPU_IDLE in Kconfig (Julian Braha) - Simplify cpuidle_register_device() with guard() (Huisong Li) - Use performance level if available to distinguish between rates in OPP debugfs (Manivannan Sadhasivam) - Fix scoped_guard in dev_pm_opp_xlate_required_opp() (Viresh Kumar) - Return -ENODATA if the snapshot image is not loaded (Alberto Garcia) - Remove inclusion of crypto/hash.h from hibernate_64.c on x86 (Eric Biggers) - Clean up and rearrange the intel_rapl power capping driver to make the respective interface drivers (TPMI, MSR, and MMOI) hold their own settings and primitives and consolidate PL4 and PMU support flags into rapl_defaults (Kuppuswamy Sathyanarayanan) - Correct kernel-doc function parameter names in the power capping core code (Randy Dunlap) - Remove unneeded casting for HZ_PER_KHZ in devfreq (Andy Shevchenko) - Use _visible attribute to replace create/remove_sysfs_files() in devfreq (Pengjie Zhang) - Add Tegra114 support to activity monitor device in tegra30-devfreq as a preparation to upcoming EMC controller support (Svyatoslav Ryhel) - Fix mistakes in cpupower man pages, add the boost and epp options to the cpupower-frequency-info man page, and add the perf-bias option to the cpupower-info man page (Roberto Ricci) - Remove unnecessary extern declarations from getopt.h in arguments parsing functions in cpufreq-set, cpuidle-info, cpuidle-set, cpupower-info, and cpupower-set utilities (Kaushlendra Kumar) -----BEGIN PGP SIGNATURE----- iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmnY9TISHHJqd0Byand5 c29ja2kubmV0AAoJEO5fvZ0v1OO1G9gH/j5mEqfPpiwX6fQ/ZwOGdNOOPVA5w9j4 KPHSMwMD5lZkoaZfasp2vt27KY5SOoVVvRZ2DKkFJ3Jai4I3cUPZYypga2nre1ag tgzX4vOjcw2r40Eda6ezWl1h4mca/xJJBX7xH2+hn1JY+Y1in37g50CqMIjKh96z Uugkk6UZytL1XcF55PMhIUgDf6pDtRT5UOW9xOKOkUt8FVWTJ7ei3HaWyV5kDmVq b5eQ42+OH7y6sWNnoKczFd8fStvh6J/avoJurBEvcOQhMcjaIaB48G19+KjDg73E NjrVcgG20P2rltBvV2d0J1TKskZHkaP7XjIeWfkwjGZhee3FL7ssS/g= =fRCO -----END PGP SIGNATURE----- Merge tag 'pm-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "Once again, cpufreq is the most active development area, mostly because of the new feature additions and documentation updates in the amd-pstate driver, but there are also changes in the cpufreq core related to boost support and other assorted updates elsewhere. Next up are power capping changes due to the major cleanup of the Intel RAPL driver. On the cpuidle front, a new C-states table for Intel Panther Lake is added to the intel_idle driver, the stopped tick handling in the menu and teo governors is updated, and there are a couple of cleanups. Apart from the above, support for Tegra114 is added to devfreq and there are assorted cleanups of that code, there are also two updates of the operating performance points (OPP) library, two minor updates related to hibernation, and cpupower utility man pages updates and cleanups. Specifics: - Update qcom-hw DT bindings to include Eliza hardware (Abel Vesa) - Update cpufreq-dt-platdev blocklist (Faruque Ansari) - Minor updates to driver and dt-bindings for Tegra (Thierry Reding, Rosen Penev) - Add MAINTAINERS entry for CPPC driver (Viresh Kumar) - Add support for new features: CPPC performance priority, Dynamic EPP, Raw EPP, and new unit tests for them to amd-pstate (Gautham Shenoy, Mario Limonciello) - Fix sysfs files being present when HW missing and broken/outdated documentation in the amd-pstate driver (Ninad Naik, Gautham Shenoy) - Pass the policy to cpufreq_driver->adjust_perf() to avoid using cpufreq_cpu_get() in the .adjust_perf() callback in amd-pstate which leads to a scheduling-while-atomic bug (K Prateek Nayak) - Clean up dead code in Kconfig for cpufreq (Julian Braha) - Remove max_freq_req update for pre-existing cpufreq policy and add a boost_freq_req QoS request to save the boost constraint instead of overwriting the last scaling_max_freq constraint (Pierre Gondois) - Embed cpufreq QoS freq_req objects in cpufreq policy so they all are allocated in one go along with the policy to simplify lifetime rules and avoid error handling issues (Viresh Kumar) - Use DMI max speed when CPPC is unavailable in the acpi-cpufreq scaling driver (Henry Tseng) - Switch policy_is_shared() in cpufreq to using cpumask_nth() instead of cpumask_weight() because the former is more efficient (Yury Norov) - Use sysfs_emit() in sysfs show functions for cpufreq governor attributes (Thorsten Blum) - Update intel_pstate to stop returning an error when "off" is written to its status sysfs attribute while the driver is already off (Fabio De Francesco) - Include current frequency in the debug message printed by __cpufreq_driver_target() (Pengjie Zhang) - Refine stopped tick handling in the menu cpuidle governor and rearrange stopped tick handling in the teo cpuidle governor (Rafael Wysocki) - Add Panther Lake C-states table to the intel_idle driver (Artem Bityutskiy) - Clean up dead dependencies on CPU_IDLE in Kconfig (Julian Braha) - Simplify cpuidle_register_device() with guard() (Huisong Li) - Use performance level if available to distinguish between rates in OPP debugfs (Manivannan Sadhasivam) - Fix scoped_guard in dev_pm_opp_xlate_required_opp() (Viresh Kumar) - Return -ENODATA if the snapshot image is not loaded (Alberto Garcia) - Remove inclusion of crypto/hash.h from hibernate_64.c on x86 (Eric Biggers) - Clean up and rearrange the intel_rapl power capping driver to make the respective interface drivers (TPMI, MSR, and MMOI) hold their own settings and primitives and consolidate PL4 and PMU support flags into rapl_defaults (Kuppuswamy Sathyanarayanan) - Correct kernel-doc function parameter names in the power capping core code (Randy Dunlap) - Remove unneeded casting for HZ_PER_KHZ in devfreq (Andy Shevchenko) - Use _visible attribute to replace create/remove_sysfs_files() in devfreq (Pengjie Zhang) - Add Tegra114 support to activity monitor device in tegra30-devfreq as a preparation to upcoming EMC controller support (Svyatoslav Ryhel) - Fix mistakes in cpupower man pages, add the boost and epp options to the cpupower-frequency-info man page, and add the perf-bias option to the cpupower-info man page (Roberto Ricci) - Remove unnecessary extern declarations from getopt.h in arguments parsing functions in cpufreq-set, cpuidle-info, cpuidle-set, cpupower-info, and cpupower-set utilities (Kaushlendra Kumar)" * tag 'pm-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (74 commits) cpufreq/amd-pstate: Add POWER_SUPPLY select for dynamic EPP cpupower: remove extern declarations in cmd functions cpuidle: Simplify cpuidle_register_device() with guard() PM / devfreq: tegra30-devfreq: add support for Tegra114 PM / devfreq: use _visible attribute to replace create/remove_sysfs_files() PM / devfreq: Remove unneeded casting for HZ_PER_KHZ MAINTAINERS: amd-pstate: Step down as maintainer, add Prateek as reviewer cpufreq: Pass the policy to cpufreq_driver->adjust_perf() cpufreq/amd-pstate: Pass the policy to amd_pstate_update() cpufreq/amd-pstate-ut: Add a unit test for raw EPP cpufreq/amd-pstate: Add support for raw EPP writes cpufreq/amd-pstate: Add support for platform profile class cpufreq/amd-pstate: add kernel command line to override dynamic epp cpufreq/amd-pstate: Add dynamic energy performance preference Documentation: amd-pstate: fix dead links in the reference section cpufreq/amd-pstate: Cache the max frequency in cpudata Documentation/amd-pstate: Add documentation for amd_pstate_floor_{freq,count} Documentation/amd-pstate: List amd_pstate_prefcore_ranking sysfs file Documentation/amd-pstate: List amd_pstate_hw_prefcore sysfs file amd-pstate-ut: Add a testcase to validate the visibility of driver attributes ...	2026-04-13 19:47:52 -07:00
Thomas Gleixner	ff1c0c5d07	Merge branch 'timers/urgent' into timers/core to resolve the conflict with urgent fixes.	2026-04-11 07:58:33 +02:00
Tejun Heo	49d78adf95	sched_ext: Drop spurious warning on kick during scheduler disable kick_cpus_irq_workfn() warns when scx_kick_syncs is NULL, but this can legitimately happen when a BPF timer or other kick source races with free_kick_syncs() during scheduler disable. Drop the pr_warn_once() and add a comment explaining the race. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>	2026-04-10 16:38:25 -10:00
Tejun Heo	e719e17d99	sched_ext: Warn on task-based SCX op recursion The kf_tasks[] design assumes task-based SCX ops don't nest - if they did, kf_tasks[0] would get clobbered. The old scx_kf_allow() WARN_ONCE caught invalid nesting via kf_mask, but that machinery is gone now. Add a WARN_ON_ONCE(current->scx.kf_tasks[0]) at the top of each SCX_CALL_OP_TASK*() macro. Checking kf_tasks[0] alone is sufficient: all three variants (SCX_CALL_OP_TASK, SCX_CALL_OP_TASK_RET, SCX_CALL_OP_2TASKS_RET) write to kf_tasks[0], so a non-NULL value at entry to any of the three means re-entry from somewhere in the family. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>	2026-04-10 07:54:06 -10:00
Tejun Heo	979a98b6e9	sched_ext: Rename scx_kf_allowed_on_arg_tasks() to scx_kf_arg_task_ok() The "kf_allowed" framing on this helper comes from the old runtime scx_kf_allowed() gate, which has been removed. Rename it to describe what it actually does in the new model. Pure rename, no functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>	2026-04-10 07:54:06 -10:00
Cheng-Yang Chou	7cd9a5d7d4	sched_ext: Remove runtime kfunc mask enforcement Now that scx_kfunc_context_filter enforces context-sensitive kfunc restrictions at BPF load time, the per-task runtime enforcement via scx_kf_mask is redundant. Remove it entirely: - Delete enum scx_kf_mask, the kf_mask field on sched_ext_entity, and the scx_kf_allow()/scx_kf_disallow()/scx_kf_allowed() helpers along with the higher_bits()/highest_bit() helpers they used. - Strip the @mask parameter (and the BUILD_BUG_ON checks) from the SCX_CALL_OP[_RET]/SCX_CALL_OP_TASK[_RET]/SCX_CALL_OP_2TASKS_RET macros and update every call site. Reflow call sites that were wrapped only to fit the old 5-arg form and now collapse onto a single line under ~100 cols. - Remove the in-kfunc scx_kf_allowed() runtime checks from scx_dsq_insert_preamble(), scx_dsq_move(), scx_bpf_dispatch_nr_slots(), scx_bpf_dispatch_cancel(), scx_bpf_dsq_move_to_local___v2(), scx_bpf_sub_dispatch(), scx_bpf_reenqueue_local(), and the per-call guard inside select_cpu_from_kfunc(). scx_bpf_task_cgroup() and scx_kf_allowed_on_arg_tasks() were already cleaned up in the "drop redundant rq-locked check" patch. scx_kf_allowed_if_unlocked() was rewritten in the preceding "decouple" patch. No further changes to those helpers here. Co-developed-by: Juntong Deng <juntong.deng@outlook.com> Signed-off-by: Juntong Deng <juntong.deng@outlook.com> Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-10 07:54:06 -10:00
Tejun Heo	d1d3c1c6ae	sched_ext: Add verifier-time kfunc context filter Move enforcement of SCX context-sensitive kfunc restrictions from per-task runtime kf_mask checks to BPF verifier-time filtering, using the BPF core's struct_ops context information. A shared .filter callback is attached to each context-sensitive BTF set and consults a per-op allow table (scx_kf_allow_flags[]) indexed by SCX ops member offset. Disallowed calls are now rejected at program load time instead of at runtime. The old model split reachability across two places: each SCX_CALL_OP*() set bits naming its op context, and each kfunc's scx_kf_allowed() check OR'd together the bits it accepted. A kfunc was callable when those two masks overlapped. The new model transposes the result to the caller side - each op's allow flags directly list the kfunc groups it may call. The old bit assignments were: Call-site bits: ops.select_cpu = ENQUEUE \| SELECT_CPU ops.enqueue = ENQUEUE ops.dispatch = DISPATCH ops.cpu_release = CPU_RELEASE Kfunc-group accepted bits: enqueue group = ENQUEUE \| DISPATCH select_cpu group = SELECT_CPU \| ENQUEUE dispatch group = DISPATCH cpu_release group = CPU_RELEASE Intersecting them yields the reachability now expressed directly by scx_kf_allow_flags[]: ops.select_cpu -> SELECT_CPU \| ENQUEUE ops.enqueue -> SELECT_CPU \| ENQUEUE ops.dispatch -> ENQUEUE \| DISPATCH ops.cpu_release -> CPU_RELEASE Unlocked ops carried no kf_mask bits and reached only unlocked kfuncs; that maps directly to UNLOCKED in the new table. Equivalence was checked by walking every (op, kfunc-group) combination across SCX ops, SYSCALL, and non-SCX struct_ops callers against the old scx_kf_allowed() runtime checks. With two intended exceptions (see below), all combinations reach the same verdict; disallowed calls are now caught at load time instead of firing scx_error() at runtime. scx_bpf_dsq_move_set_slice() and scx_bpf_dsq_move_set_vtime() are exceptions: they have no runtime check at all, but the new filter rejects them from ops outside dispatch/unlocked. The affected cases are nonsensical - the values these setters store are only read by scx_bpf_dsq_move{,_vtime}(), which is itself restricted to dispatch/unlocked, so a setter call from anywhere else was already dead code. Runtime scx_kf_mask enforcement is left in place by this patch and removed in a follow-up. Original-patch-by: Juntong Deng <juntong.deng@outlook.com> Original-patch-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-10 07:54:06 -10:00
Tejun Heo	2193af26a1	sched_ext: Drop redundant rq-locked check from scx_bpf_task_cgroup() scx_kf_allowed_on_arg_tasks() runs both an scx_kf_allowed(__SCX_KF_RQ_LOCKED) mask check and a kf_tasks[] check. After the preceding call-site fixes, every SCX_CALL_OP_TASK*() invocation has kf_mask & __SCX_KF_RQ_LOCKED non-zero, so the mask check is redundant whenever the kf_tasks[] check passes. Drop it and simplify the helper to take only @sch and @p. Fold the locking guarantee into the SCX_CALL_OP_TASK() comment block, which scx_bpf_task_cgroup() now points to. No functional change. Extracted from a larger verifier-time kfunc context filter patch originally written by Juntong Deng. Original-patch-by: Juntong Deng <juntong.deng@outlook.com> Cc: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-10 07:54:06 -10:00
Tejun Heo	0022b32850	sched_ext: Decouple kfunc unlocked-context check from kf_mask scx_kf_allowed_if_unlocked() uses !current->scx.kf_mask as a proxy for "no SCX-tracked lock held". kf_mask is removed in a follow-up patch, so its two callers - select_cpu_from_kfunc() and scx_dsq_move() - need another basis. Add a new bool scx_rq.in_select_cpu, set across the SCX_CALL_OP_TASK_RET that invokes ops.select_cpu(), to capture the one case where SCX itself holds no lock but try_to_wake_up() holds @p's pi_lock. Together with scx_locked_rq(), it expresses the same accepted-context set. select_cpu_from_kfunc() needs a runtime test because it has to take different locking paths depending on context. Open-code as a three-way branch. The unlocked branch takes raw_spin_lock_irqsave(&p->pi_lock) directly - pi_lock alone is enough for the fields the kfunc reads, and is lighter than task_rq_lock(). scx_dsq_move() doesn't really need a runtime test - its accepted contexts could be enforced at verifier load time. But since the runtime state is already there and using it keeps the upcoming load-time filter simpler, just write it the same way: (scx_locked_rq() \|\| in_select_cpu) && !kf_allowed(DISPATCH). scx_kf_allowed_if_unlocked() is deleted with the conversions. No semantic change. v2: s/No functional change/No semantic change/ - the unlocked path now acquires pi_lock instead of the heavier task_rq_lock() (Andrea Righi). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-10 07:54:06 -10:00
Tejun Heo	b470e37c1f	sched_ext: Fix ops.cgroup_move() invocation kf_mask and rq tracking sched_move_task() invokes ops.cgroup_move() inside task_rq_lock(tsk), so @p's rq lock is held. The SCX_CALL_OP_TASK invocation mislabels this: - kf_mask = SCX_KF_UNLOCKED (== 0), claiming no lock is held. - rq = NULL, so update_locked_rq() doesn't run and scx_locked_rq() returns NULL. Switch to SCX_KF_REST and pass task_rq(p), matching ops.set_cpumask() from set_cpus_allowed_scx(). Three effects: - scx_bpf_task_cgroup() becomes callable (was rejected by scx_kf_allowed(__SCX_KF_RQ_LOCKED)). Safe; rq lock is held. - scx_bpf_dsq_move() is now rejected (was allowed via the unlocked branch). Calling it while holding an unrelated task's rq lock is risky; rejection is correct. - scx_bpf_select_cpu_*() previously took the unlocked branch in select_cpu_from_kfunc() and called task_rq_lock(p, &rf), which would deadlock against the already-held pi_lock. Now it takes the locked-rq branch and is rejected with -EPERM via the existing kf_allowed(SCX_KF_SELECT_CPU \| SCX_KF_ENQUEUE) check. Latent deadlock fix. No in-tree scheduler is known to call any of these from ops.cgroup_move(). v2: Add Fixes: tag (Andrea Righi). Fixes: `18853ba782` ("sched_ext: Track currently locked rq") Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-10 07:54:06 -10:00

1 2 3 4 5 ...

5323 Commits