linux

mirror of https://github.com/torvalds/linux.git synced 2026-05-12 16:18:45 +02:00

Author	SHA1	Message	Date
Tejun Heo	d292aa00de	sched_ext: Make bypass LB cpumasks per-scheduler scx_bypass_lb_{donee,resched}_cpumask were file-scope statics shared by all scheduler instances. With CONFIG_EXT_SUB_SCHED, multiple sched instances each arm their own bypass_lb_timer; concurrent bypass_lb_node() calls RMW the global cpumasks with no lock, corrupting donee/resched decisions. Move the cpumasks into struct scx_sched, allocate them alongside the timer in scx_alloc_and_add_sched(), free them in scx_sched_free_rcu_work(). Fixes: `95d1df610c` ("sched_ext: Implement load balancer for bypass mode") Cc: stable@vger.kernel.org # v6.19+ Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-24 14:31:36 -10:00
Tejun Heo	d1d3c1c6ae	sched_ext: Add verifier-time kfunc context filter Move enforcement of SCX context-sensitive kfunc restrictions from per-task runtime kf_mask checks to BPF verifier-time filtering, using the BPF core's struct_ops context information. A shared .filter callback is attached to each context-sensitive BTF set and consults a per-op allow table (scx_kf_allow_flags[]) indexed by SCX ops member offset. Disallowed calls are now rejected at program load time instead of at runtime. The old model split reachability across two places: each SCX_CALL_OP*() set bits naming its op context, and each kfunc's scx_kf_allowed() check OR'd together the bits it accepted. A kfunc was callable when those two masks overlapped. The new model transposes the result to the caller side - each op's allow flags directly list the kfunc groups it may call. The old bit assignments were: Call-site bits: ops.select_cpu = ENQUEUE \| SELECT_CPU ops.enqueue = ENQUEUE ops.dispatch = DISPATCH ops.cpu_release = CPU_RELEASE Kfunc-group accepted bits: enqueue group = ENQUEUE \| DISPATCH select_cpu group = SELECT_CPU \| ENQUEUE dispatch group = DISPATCH cpu_release group = CPU_RELEASE Intersecting them yields the reachability now expressed directly by scx_kf_allow_flags[]: ops.select_cpu -> SELECT_CPU \| ENQUEUE ops.enqueue -> SELECT_CPU \| ENQUEUE ops.dispatch -> ENQUEUE \| DISPATCH ops.cpu_release -> CPU_RELEASE Unlocked ops carried no kf_mask bits and reached only unlocked kfuncs; that maps directly to UNLOCKED in the new table. Equivalence was checked by walking every (op, kfunc-group) combination across SCX ops, SYSCALL, and non-SCX struct_ops callers against the old scx_kf_allowed() runtime checks. With two intended exceptions (see below), all combinations reach the same verdict; disallowed calls are now caught at load time instead of firing scx_error() at runtime. scx_bpf_dsq_move_set_slice() and scx_bpf_dsq_move_set_vtime() are exceptions: they have no runtime check at all, but the new filter rejects them from ops outside dispatch/unlocked. The affected cases are nonsensical - the values these setters store are only read by scx_bpf_dsq_move{,_vtime}(), which is itself restricted to dispatch/unlocked, so a setter call from anywhere else was already dead code. Runtime scx_kf_mask enforcement is left in place by this patch and removed in a follow-up. Original-patch-by: Juntong Deng <juntong.deng@outlook.com> Original-patch-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-10 07:54:06 -10:00
Tejun Heo	0022b32850	sched_ext: Decouple kfunc unlocked-context check from kf_mask scx_kf_allowed_if_unlocked() uses !current->scx.kf_mask as a proxy for "no SCX-tracked lock held". kf_mask is removed in a follow-up patch, so its two callers - select_cpu_from_kfunc() and scx_dsq_move() - need another basis. Add a new bool scx_rq.in_select_cpu, set across the SCX_CALL_OP_TASK_RET that invokes ops.select_cpu(), to capture the one case where SCX itself holds no lock but try_to_wake_up() holds @p's pi_lock. Together with scx_locked_rq(), it expresses the same accepted-context set. select_cpu_from_kfunc() needs a runtime test because it has to take different locking paths depending on context. Open-code as a three-way branch. The unlocked branch takes raw_spin_lock_irqsave(&p->pi_lock) directly - pi_lock alone is enough for the fields the kfunc reads, and is lighter than task_rq_lock(). scx_dsq_move() doesn't really need a runtime test - its accepted contexts could be enforced at verifier load time. But since the runtime state is already there and using it keeps the upcoming load-time filter simpler, just write it the same way: (scx_locked_rq() \|\| in_select_cpu) && !kf_allowed(DISPATCH). scx_kf_allowed_if_unlocked() is deleted with the conversions. No semantic change. v2: s/No functional change/No semantic change/ - the unlocked path now acquires pi_lock instead of the heavier task_rq_lock() (Andrea Righi). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-04-10 07:54:06 -10:00
Tejun Heo	3229ac4a5e	sched_ext: Add SCX_OPS_ALWAYS_ENQ_IMMED ops flag SCX_ENQ_IMMED makes enqueue to local DSQs succeed only if the task can start running immediately. Otherwise, the task is re-enqueued through ops.enqueue(). This provides tighter control but requires specifying the flag on every insertion. Add SCX_OPS_ALWAYS_ENQ_IMMED ops flag. When set, SCX_ENQ_IMMED is automatically applied to all local DSQ enqueues including through scx_bpf_dsq_move_to_local(). scx_qmap is updated with -I option to test the feature and -F option for IMMED stress testing which forces every Nth enqueue to a busy local DSQ. v2: - Cover scx_bpf_dsq_move_to_local() path (now has enq_flags via ___v2). - scx_qmap: Remove sched_switch and cpu_release handlers (superseded by kernel-side wakeup_preempt_scx()). Add -F for IMMED stress testing. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-13 09:43:23 -10:00
Tejun Heo	98d709cba3	sched_ext: Implement SCX_ENQ_IMMED Add SCX_ENQ_IMMED enqueue flag for local DSQ insertions. Once a task is dispatched with IMMED, it either gets on the CPU immediately and stays on it, or gets reenqueued back to the BPF scheduler. It will never linger on a local DSQ behind other tasks or on a CPU taken by a higher-priority class. rq_is_open() uses rq->next_class to determine whether the rq is available, and wakeup_preempt_scx() triggers reenqueue when a higher-priority class task arrives. These capture all higher class preemptions. Combined with reenqueue points in the dispatch path, all cases where an IMMED task would not execute immediately are covered. SCX_TASK_IMMED persists in p->scx.flags until the next fresh enqueue, so the guarantee survives SAVE/RESTORE cycles. If preempted while running, put_prev_task_scx() reenqueues through ops.enqueue() with SCX_TASK_REENQ_PREEMPTED instead of silently placing the task back on the local DSQ. This enables tighter scheduling latency control by preventing tasks from piling up on local DSQs. It also enables opportunistic CPU sharing across sub-schedulers - without this, a sub-scheduler can stuff the local DSQ of a shared CPU, making it difficult for others to use. v2: - Rewrite is_curr_done() as rq_is_open() using rq->next_class and implement wakeup_preempt_scx() to achieve complete coverage of all cases where IMMED tasks could get stranded. - Track IMMED persistently in p->scx.flags and reenqueue preempted-while-running tasks through ops.enqueue(). - Bound deferred reenq cycles (SCX_REENQ_LOCAL_MAX_REPEAT). - Misc renames, documentation. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-13 09:43:22 -10:00
Tejun Heo	f4a6c506d1	sched_ext: Always bounce scx_disable() through irq_work scx_disable() directly called kthread_queue_work() which can acquire worker->lock, pi_lock and rq->__lock. This made scx_disable() unsafe to call while holding locks that conflict with this chain - in particular, scx_claim_exit() calls scx_disable() for each descendant while holding scx_sched_lock, which nests inside rq->__lock in scx_bypass(). The error path (scx_vexit()) was already bouncing through irq_work to avoid this issue. Generalize the pattern to all scx_disable() calls by always going through irq_work. irq_work_queue() is lockless and safe to call from any context, and the actual kthread_queue_work() call happens in the irq_work handler outside any locks. Rename error_irq_work to disable_irq_work to reflect the broader usage. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-10 07:12:21 -10:00
Tejun Heo	b5bc043505	sched_ext: Add scx_dump_lock and dump_disabled Add a dedicated scx_dump_lock and per-sched dump_disabled flag so that debug dumping can be safely disabled during sched teardown without relying on scx_sched_lock. This is a prep for the next patch which decouples the sysrq dump path from scx_sched_lock to resolve a lock ordering issue. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-10 07:12:21 -10:00
Zhao Mengmeng	bec10581e9	sched_ext: remove SCX_OPS_HAS_CGROUP_WEIGHT While running scx_flatcg, dmesg prints "SCX_OPS_HAS_CGROUP_WEIGHT is deprecated and a noop", in code, SCX_OPS_HAS_CGROUP_WEIGHT has been marked as DEPRECATED, and will be removed on 6.18. Now it's time to do it. Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-09 09:45:18 -10:00
Tejun Heo	ce897abc21	sched_ext: Add SCX_TASK_REENQ_REASON flags SCX_ENQ_REENQ indicates that a task is being re-enqueued but doesn't tell the BPF scheduler why. Add SCX_TASK_REENQ_REASON flags using bits 12-13 of p->scx.flags to communicate the reason during ops.enqueue(): - NONE: Not being reenqueued - KFUNC: Reenqueued by scx_bpf_dsq_reenq() and friends More reasons will be added. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-07 05:29:50 -10:00
Tejun Heo	ffa7ae0724	sched_ext: Add reenq_flags plumbing to scx_bpf_dsq_reenq() Add infrastructure to pass flags through the deferred reenqueue path. reenq_local() now takes a reenq_flags parameter, and scx_sched_pcpu gains a deferred_reenq_local_flags field to accumulate flags from multiple scx_bpf_dsq_reenq() calls before processing. No flags are defined yet. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-07 05:29:49 -10:00
Tejun Heo	0c4df54ad8	sched_ext: Wrap deferred_reenq_local_node into a struct Wrap the deferred_reenq_local_node list_head into struct scx_deferred_reenq_local. More fields will be added and this allows using a shorthand pointer to access them. No functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-07 05:29:49 -10:00
Tejun Heo	8c1b9453fd	sched_ext: Convert deferred_reenq_locals from llist to regular list The deferred reenqueue local mechanism uses an llist (lockless list) for collecting schedulers that need their local DSQs re-enqueued. Convert to a regular list protected by a raw_spinlock. The llist was used for its lockless properties, but the upcoming changes to support remote reenqueue require more complex list operations that are difficult to implement correctly with lockless data structures. A spinlock- protected regular list provides the necessary flexibility. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-07 05:29:49 -10:00
Tejun Heo	d4ae868c6b	sched_ext: Wrap global DSQs in per-node structure Global DSQs are currently stored as an array of scx_dispatch_q pointers, one per NUMA node. To allow adding more per-node data structures, wrap the global DSQ in scx_sched_pnode and replace global_dsqs with pnode array. NUMA-aware allocation is maintained. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-07 05:29:49 -10:00
Tejun Heo	25037af712	sched_ext: Add rhashtable lookup for sub-schedulers Add rhashtable-based lookup for sub-schedulers indexed by cgroup_id to enable efficient scheduler discovery in preparation for multiple scheduler support. The hash table allows quick lookup of the appropriate scheduler instance when processing tasks from different cgroups. This extends scx_link_sched() to register sub-schedulers in the hash table and scx_unlink_sched() to remove them. A new scx_find_sub_sched() function provides the lookup interface. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-06 07:58:04 -10:00
Tejun Heo	0d8c551dd5	sched_ext: Make scx_bpf_reenqueue_local() sub-sched aware scx_bpf_reenqueue_local() currently re-enqueues all tasks on the local DSQ regardless of which sub-scheduler owns them. With multiple sub-schedulers, each should only re-enqueue tasks it owns or are owned by its descendants. Replace the per-rq boolean flag with a lock-free linked list to track per-scheduler reenqueue requests. Filter tasks in reenq_local() using hierarchical ownership checks and block deferrals during bypass to prevent use on dead schedulers. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-06 07:58:04 -10:00
Tejun Heo	7f5fcd47dd	sched_ext: Add scx_sched back pointer to scx_sched_pcpu Add a back pointer from scx_sched_pcpu to scx_sched. This will be used by the next patch to make scx_bpf_reenqueue_local() sub-sched aware. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-06 07:58:04 -10:00
Tejun Heo	cde94c032b	sched_ext: Make watchdog sub-sched aware Currently, the watchdog checks all tasks as if they are all on scx_root. Move scx_watchdog_timeout inside scx_sched and make check_rq_for_timeouts() use the timeout from the scx_sched associated with each task. refresh_watchdog() is added, which determines the timer interval as half of the shortest watchdog timeouts of all scheds and arms or disarms it as necessary. Every scx_sched instance has equivalent or better detection latency while sharing the same timer. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-06 07:58:04 -10:00
Tejun Heo	34ecfb3551	sched_ext: Move scx_dsp_ctx and scx_dsp_max_batch into scx_sched scx_dsp_ctx and scx_dsp_max_batch are global variables used in the dispatch path. In prepration for multiple scheduler support, move the former into scx_sched_pcpu and the latter into scx_sched. No user-visible behavior changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-06 07:58:04 -10:00
Tejun Heo	025b1bd419	sched_ext: Implement hierarchical bypass mode When a sub-scheduler enters bypass mode, its tasks must be scheduled by an ancestor to guarantee forward progress. Tasks from bypassing descendants are queued in the bypass DSQs of the nearest non-bypassing ancestor, or the root scheduler if all ancestors are bypassing. This requires coordination between bypassing schedulers and their hosts. Add bypass_enq_target_dsq() to find the correct bypass DSQ by walking up the hierarchy until reaching a non-bypassing ancestor. When a sub-scheduler starts bypassing, all its runnable tasks are re-enqueued after scx_bypassing() is set, ensuring proper migration to ancestor bypass DSQs. Update scx_dispatch_sched() to handle hosting bypassed descendants. When a scheduler is not bypassing but has bypassing descendants, it must schedule both its own tasks and bypassed descendant tasks. A simple policy is implemented where every Nth dispatch attempt (SCX_BYPASS_HOST_NTH=2) consumes from the bypass DSQ. A fallback consumption is also added at the end of dispatch to ensure bypassed tasks make progress even when normal scheduling is idle. Update enable_bypass_dsp() and disable_bypass_dsp() to increment bypass_dsp_enable_depth on both the bypassing scheduler and its parent host, ensuring both can detect that bypass dispatch is active through bypass_dsp_enabled(). Add SCX_EV_SUB_BYPASS_DISPATCH event counter to track scheduling of bypassed descendant tasks. v2: Fix comment typos (Andrea). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-06 07:58:04 -10:00
Tejun Heo	aa2a0a1968	sched_ext: Separate bypass dispatch enabling from bypass depth tracking The bypass_depth field tracks nesting of bypass operations but is also used to determine whether the bypass dispatch path should be active. With hierarchical scheduling, child schedulers may need to activate their parent's bypass dispatch path without affecting the parent's bypass_depth, requiring separation of these concerns. Add bypass_dsp_enable_depth and bypass_dsp_claim to independently control bypass dispatch path activation. The new enable_bypass_dsp() and disable_bypass_dsp() functions manage this state with proper claim semantics to prevent races. The bypass dispatch path now only activates when bypass_dsp_enabled() returns true, which checks the new enable_depth counter. The disable operation is carefully ordered after all tasks are moved out of bypass DSQs to ensure they are drained before the dispatch path is disabled. During scheduler teardown, disable_bypass_dsp() is called explicitly to ensure cleanup even if bypass mode was never entered normally. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-06 07:58:03 -10:00
Tejun Heo	5c8d98a1b4	sched_ext: Move bypass state into scx_sched In preparation of multiple scheduler support, make bypass state per-scx_sched. Move scx_bypass_depth, bypass_timestamp and bypass_lb_timer from globals into scx_sched. Move SCX_RQ_BYPASSING from rq to scx_sched_pcpu as SCX_SCHED_PCPU_BYPASSING. scx_bypass() now takes @sch and scx_rq_bypassing(rq) is replaced with scx_bypassing(sch, cpu). All callers updated. scx_bypassed_for_enable existed to balance the global scx_bypass_depth when enable failed. Now that bypass_depth is per-scheduler, the counter is destroyed along with the scheduler on enable failure. Remove scx_bypassed_for_enable. As all tasks currently use the root scheduler, there's no observable behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-06 07:58:03 -10:00
Tejun Heo	ff06f727a9	sched_ext: Move bypass_dsq into scx_sched_pcpu To support bypass mode for sub-schedulers, move bypass_dsq from struct scx_rq to struct scx_sched_pcpu. Add bypass_dsq() helper. Move bypass_dsq initialization from init_sched_ext_class() to scx_alloc_and_attach_sched(). bypass_lb_cpu() now takes a CPU number instead of rq pointer. All callers updated. No behavior change as all tasks use the root scheduler. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-06 07:58:03 -10:00
Tejun Heo	c1743da43c	sched_ext: Move aborting flag to per-scheduler field The abort state was tracked in the global scx_aborting flag which was used to break out of potential live-lock scenarios when an error occurs. With hierarchical scheduling, each scheduler instance must track its own abort state independently so that an aborting scheduler doesn't interfere with others. Move the aborting flag into struct scx_sched and update all access sites. The early initialization check in scx_root_enable() that warned about residual aborting state is no longer needed as each scheduler instance now starts with a clean state. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-06 07:58:03 -10:00
Tejun Heo	e1cccf365e	sched_ext: Move default slice to per-scheduler field The default time slice was stored in the global scx_slice_dfl variable which was dynamically modified when entering and exiting bypass mode. With hierarchical scheduling, each scheduler instance needs its own default slice configuration so that bypass operations on one scheduler don't affect others. Move slice_dfl into struct scx_sched and update all access sites. The bypass logic now modifies the root scheduler's slice_dfl. At task initialization in init_scx_entity(), use the SCX_SLICE_DFL constant directly since the task may not yet be associated with a specific scheduler. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-06 07:58:03 -10:00
Tejun Heo	a5fa0708cb	sched_ext: Enforce scheduling authority in dispatch and select_cpu operations Add checks to enforce scheduling authority boundaries when multiple schedulers are present: 1. In scx_dsq_insert_preamble() and the dispatch retry path, ignore attempts to insert tasks that the scheduler doesn't own, counting them via SCX_EV_INSERT_NOT_OWNED. As BPF schedulers are allowed to ignore dequeues, such attempts can occur legitimately during sub-scheduler enabling when tasks move between schedulers. The counter helps distinguish normal cases from scheduler bugs. 2. For scx_bpf_dsq_insert_vtime() and scx_bpf_select_cpu_and(), error out when sub-schedulers are attached. These functions lack the aux__prog parameter needed to identify the calling scheduler, so they cannot be used safely with multiple schedulers. BPF programs should use the arg-wrapped versions (__scx_bpf_dsq_insert_vtime() and __scx_bpf_select_cpu_and()) instead. These checks ensure that with multiple concurrent schedulers, scheduler identity can be properly determined and unauthorized task operations are prevented or tracked. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-06 07:58:03 -10:00
Tejun Heo	105dcd005b	sched_ext: Introduce scx_prog_sched() In preparation for multiple scheduler support, introduce scx_prog_sched() accessor which returns the scx_sched instance associated with a BPF program. The association is determined via the special KF_IMPLICIT_ARGS kfunc parameter, which provides access to bpf_prog_aux. This aux can be used to retrieve the struct_ops (sched_ext_ops) that the program is associated with, and from there, the corresponding scx_sched instance. For compatibility, when ops.sub_attach is not implemented (older schedulers without sub-scheduler support), unassociated programs fall back to scx_root. A warning is logged once per scheduler for such programs. As scx_root is still the only scheduler, this shouldn't introduce user-visible behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-06 07:58:03 -10:00
Tejun Heo	88234b075c	sched_ext: Introduce scx_task_sched[_rcu]() In preparation of multiple scheduler support, add p->scx.sched which points to the scx_sched instance that the task is scheduled by, which is currently always scx_root. Add scx_task_sched[_rcu]() accessors which return the associated scx_sched of the specified task and replace the raw scx_root dereferences with it where applicable. scx_task_on_sched() is also added to test whether a given task is on the specified sched. As scx_root is still the only scheduler, this shouldn't introduce user-visible behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-06 07:58:03 -10:00
Tejun Heo	ebeca1f930	sched_ext: Introduce cgroup sub-sched support A system often runs multiple workloads especially in multi-tenant server environments where a system is split into partitions servicing separate more-or-less independent workloads each requiring an application-specific scheduler. To support such and other use cases, sched_ext is in the process of growing multiple scheduler support. When partitioning a system in terms of CPUs for such use cases, an oft-taken approach is hard partitioning the system using cpuset. While it would be possible to tie sched_ext multiple scheduler support to cpuset partitions, such an approach would have fundamental limitations stemming from the lack of dynamism and flexibility. Users often don't care which specific CPUs are assigned to which workload and want to take advantage of optimizations which are enabled by running workloads on a larger machine - e.g. opportunistic over-commit, improving latency critical workload characteristics while maintaining bandwidth fairness, employing control mechanisms based on different criteria than on-CPU time for e.g. flexible memory bandwidth isolation, packing similar parts from different workloads on same L3s to improve cache efficiency, and so on. As this sort of dynamic behaviors are impossible or difficult to implement with hard partitioning, sched_ext is implementing cgroup sub-sched support where schedulers can be attached to the cgroup hierarchy and a parent scheduler is responsible for controlling the CPUs that each child can use at any given moment. This makes CPU distribution dynamically controlled by BPF allowing high flexibility. This patch adds the skeletal sched_ext cgroup sub-sched support: - sched_ext_ops.sub_cgroup_id and .sub_attach/detach() are added. Non-zero sub_cgroup_id indicates that the scheduler is to be attached to the identified cgroup. A sub-sched is attached to the cgroup iff the nearest ancestor scheduler implements .sub_attach() and grants the attachment. Max nesting depth is limited by SCX_SUB_MAX_DEPTH. - When a scheduler exits, all its descendant schedulers are exited together. Also, cgroup.scx_sched added which points to the effective scheduler instance for the cgroup. This is updated on scheduler init/exit and inherited on cgroup online. When a cgroup is offlined, the attached scheduler is automatically exited. - Sub-sched support is gated on CONFIG_EXT_SUB_SCHED which is automatically enabled if both SCX and cgroups are enabled. Sub-sched support is not tied to the CPU controller but rather the cgroup hierarchy itself. This is intentional as the support for cpu.weight and cpu.max based resource control is orthogonal to sub-sched support. Note that CONFIG_CGROUPS around cgroup subtree iteration support for scx_task_iter is replaced with CONFIG_EXT_SUB_SCHED for consistency. - This allows loading sub-scheds and most framework operations such as propagating disable down the hierarchy work. However, sub-scheds are not operational yet and all tasks stay with the root sched. This will serve as the basis for building up full sub-sched support. - DSQs point to the scx_sched they belong to. - scx_qmap is updated to allow attachment of sub-scheds and also serving as sub-scheds. - scx_is_descendant() is added but not yet used in this patch. It is used by later changes in the series and placed here as this is where the function belongs. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2026-03-06 07:58:03 -10:00
Tejun Heo	32e940f2bd	Merge branch 'for-7.0-fixes' into for-7.1 To prepare for hierarchical scheduling patchset which will cause multiple conflicts otherwise. Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-06 07:46:32 -10:00
Andrea Righi	70f54f61a3	sched_ext: Document task ownership state machine The task ownership state machine in sched_ext is quite hard to follow from the code alone. The interaction of ownership states, memory ordering rules and cross-CPU "lock dancing" makes the overall model subtle. Extend the documentation next to scx_ops_state to provide a more structured and self-contained description of the state transitions and their synchronization rules. The new reference should make the code easier to reason about and maintain and can help future contributors understand the overall task-ownership workflow. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-05 06:21:06 -10:00
David Carlier	749989b2d9	sched_ext: Fix SCX_EFLAG_INITIALIZED being a no-op flag SCX_EFLAG_INITIALIZED is the sole member of enum scx_exit_flags with no explicit value, so the compiler assigns it 0. This makes the bitwise OR in scx_ops_init() a no-op: sch->exit_info->flags \|= SCX_EFLAG_INITIALIZED; /* \|= 0 */ As a result, BPF schedulers cannot distinguish whether ops.init() completed successfully by inspecting exit_info->flags. Assign the value 1LLU << 0 so the flag is actually set. Fixes: `f3aec2adce` ("sched_ext: Add SCX_EFLAG_INITIALIZED to indicate successful ops.init()") Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-26 12:03:24 -10:00
Andrea Righi	ebf1ccff79	sched_ext: Fix ops.dequeue() semantics Currently, ops.dequeue() is only invoked when the sched_ext core knows that a task resides in BPF-managed data structures, which causes it to miss scheduling property change events. In addition, ops.dequeue() callbacks are completely skipped when tasks are dispatched to non-local DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably track task state. Fix this by guaranteeing that each task entering the BPF scheduler's custody triggers exactly one ops.dequeue() call when it leaves that custody, whether the exit is due to a dispatch (regular or via a core scheduling pick) or to a scheduling property change (e.g. sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA balancing, etc.). BPF scheduler custody concept: a task is considered to be in the BPF scheduler's custody when the scheduler is responsible for managing its lifecycle. This includes tasks dispatched to user-created DSQs or stored in the BPF scheduler's internal data structures from ops.enqueue(). Custody ends when the task is dispatched to a terminal DSQ (such as the local DSQ or %SCX_DSQ_GLOBAL), selected by core scheduling, or removed due to a property change. Tasks directly dispatched to terminal DSQs bypass the BPF scheduler entirely and are never in its custody. Terminal DSQs include: - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues where tasks go directly to execution. - Global DSQ (%SCX_DSQ_GLOBAL): the built-in fallback queue where the BPF scheduler is considered "done" with the task. As a result, ops.dequeue() is not invoked for tasks directly dispatched to terminal DSQs. To identify dequeues triggered by scheduling property changes, introduce the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set, the dequeue was caused by a scheduling property change. New ops.dequeue() semantics: - ops.dequeue() is invoked exactly once when the task leaves the BPF scheduler's custody, in one of the following cases: a) regular dispatch: a task dispatched to a user DSQ or stored in internal BPF data structures is moved to a terminal DSQ (ops.dequeue() called without any special flags set), b) core scheduling dispatch: core-sched picks task before dispatch (ops.dequeue() called with %SCX_DEQ_CORE_SCHED_EXEC flag set), c) property change: task properties modified before dispatch, (ops.dequeue() called with %SCX_DEQ_SCHED_CHANGE flag set). This allows BPF schedulers to: - reliably track task ownership and lifecycle, - maintain accurate accounting of managed tasks, - update internal state when tasks change properties. Cc: Tejun Heo <tj@kernel.org> Cc: Emil Tsalapatis <emil@etsalapatis.com> Cc: Kuba Piecuch <jpiecuch@google.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-23 10:01:18 -10:00
Tejun Heo	95d1df610c	sched_ext: Implement load balancer for bypass mode In bypass mode, tasks are queued on per-CPU bypass DSQs. While this works well in most cases, there is a failure mode where a BPF scheduler can skew task placement severely before triggering bypass in highly over-saturated systems. If most tasks end up concentrated on a few CPUs, those CPUs can accumulate queues that are too long to drain in a reasonable time, leading to RCU stalls and hung tasks. Implement a simple timer-based load balancer that redistributes tasks across CPUs within each NUMA node. The balancer runs periodically (default 500ms, tunable via bypass_lb_intv_us module parameter) and moves tasks from overloaded CPUs to underloaded ones. When moving tasks between bypass DSQs, the load balancer holds nested DSQ locks to avoid dropping and reacquiring the donor DSQ lock on each iteration, as donor DSQs can be very long and highly contended. Add the SCX_ENQ_NESTED flag and use raw_spin_lock_nested() in dispatch_enqueue() to support this. The load balancer timer function reads scx_bypass_depth locklessly to check whether bypass mode is active. Use WRITE_ONCE() when updating scx_bypass_depth to pair with the READ_ONCE() in the timer function. This has been tested on a 192 CPU dual socket AMD EPYC machine with ~20k runnable tasks running scx_cpu0. As scx_cpu0 queues all tasks to CPU0, almost all tasks end up on CPU0 creating severe imbalance. Without the load balancer, disabling the scheduler can lead to RCU stalls and hung tasks, taking a very long time to complete. With the load balancer, disable completes in about a second. The load balancing operation can be monitored using the sched_ext_bypass_lb tracepoint and disabled by setting bypass_lb_intv_us to 0. v2: Lock both rq and DSQ in bypass_lb_cpu() and use dispatch_dequeue_locked() to prevent races with dispatch_dequeue() (Andrea Righi). Cc: Andrea Righi <arighi@nvidia.com> Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Cc: Emil Tsalapatis <etsal@meta.com> Reviewed_by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-11-12 06:43:44 -10:00
Tejun Heo	5a629ecbcd	sched_ext: Mark racy bitfields to prevent adding fields that can't tolerate races The warned bitfields in struct scx_sched are updated racily from concurrent CPUs causing RMW races, which is fine for these boolean warning flags. Add a comment marking this area to prevent future fields that can't tolerate racy updates from being added here. Signed-off-by: Tejun Heo <tj@kernel.org>	2025-11-05 12:07:09 -10:00
Tejun Heo	a379fa1e2c	sched_ext: Fix SCX_KICK_WAIT to work reliably SCX_KICK_WAIT is used to synchronously wait for the target CPU to complete a reschedule and can be used to implement operations like core scheduling. This used to be implemented by scx_next_task_picked() incrementing pnt_seq, which was always called when a CPU picks the next task to run, allowing SCX_KICK_WAIT to reliably wait for the target CPU to enter the scheduler and pick the next task. However, commit `b999e365c2` ("sched_ext: Replace scx_next_task_picked() with switch_class()") replaced scx_next_task_picked() with the switch_class() callback, which is only called when switching between sched classes. This broke SCX_KICK_WAIT because pnt_seq would no longer be reliably incremented unless the previous task was SCX and the next task was not. This fix leverages commit `4c95380701` ("sched/ext: Fold balance_scx() into pick_task_scx()") which refactored the pick path making put_prev_task_scx() the natural place to track task switches for SCX_KICK_WAIT. The fix moves pnt_seq increment to put_prev_task_scx() and also increments it in pick_task_scx() to handle cases where the same task is re-selected, whether by BPF scheduler decision or slice refill. The semantics: If the current task on the target CPU is SCX, SCX_KICK_WAIT waits until the CPU enters the scheduling path. This provides sufficient guarantee for use cases like core scheduling while keeping the operation self-contained within SCX. v2: - Also increment pnt_seq in pick_task_scx() to handle same-task re-selection (Andrea Righi). - Use smp_cond_load_acquire() for the busy-wait loop for better architecture optimization (Peter Zijlstra). Reported-by: Wen-Fang Liu <liuwenfang@honor.com> Link: http://lkml.kernel.org/r/228ebd9e6ed3437996dffe15735a9caa@honor.com Cc: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-10-22 11:42:14 -10:00
zhidao su	347ed2d566	sched/ext: Implement cgroup_set_idle() callback Implement the missing cgroup_set_idle() callback that was marked as a TODO. This allows BPF schedulers to be notified when a cgroup's idle state changes, enabling them to adjust their scheduling behavior accordingly. The implementation follows the same pattern as other cgroup callbacks like cgroup_set_weight() and cgroup_set_bandwidth(). It checks if the BPF scheduler has implemented the callback and invokes it with the appropriate parameters. Fixes a spelling error in the cgroup_set_bandwidth() documentation. tj: s/scx_cgroup_rwsem/scx_cgroup_ops_rwsem/ to fix build breakage. Signed-off-by: zhidao su <soolaugust@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-10-14 10:17:33 -10:00
Tejun Heo	f3aec2adce	sched_ext: Add SCX_EFLAG_INITIALIZED to indicate successful ops.init() ops.exit() may be called even if the loading failed before ops.init() finishes successfully. This is because ops.exit() allows rich exit info communication. Add SCX_EFLAG_INITIALIZED flag to scx_exit_info.flags to indicate whether ops.init() finished successfully. This enables BPF schedulers to distinguish between exit scenarios and handle cleanup appropriately based on initialization state. Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-23 09:03:26 -10:00
Tejun Heo	c7e739746d	sched_ext: Use bitfields for boolean warning flags Convert warned_zero_slice and warned_deprecated_rq in scx_sched struct to single-bit bitfields. While this doesn't reduce struct size immediately, it prepares for future bitfield additions. v2: Update patch description. Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-23 09:03:26 -10:00
Christian Loehle	5c48d88fe0	sched_ext: deprecation warn for scx_bpf_cpu_rq() scx_bpf_cpu_rq() works on an unlocked rq which generally isn't safe. For the common use-cases scx_bpf_locked_rq() and scx_bpf_cpu_curr() work, so add a deprecation warning to scx_bpf_cpu_rq() so it can eventually be removed. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-03 11:51:57 -10:00
Tejun Heo	bcb7c23056	sched_ext: Put event_stats_cpu in struct scx_sched_pcpu scx_sched.event_stats_cpu is the percpu counters that are used to track stats. Introduce struct scx_sched_pcpu and move the counters inside. This will ease adding more per-cpu fields. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>	2025-09-03 11:33:28 -10:00
Tejun Heo	0c2b8356e4	sched_ext: Move internal type and accessor definitions to ext_internal.h There currently isn't a place to place SCX-internal types and accessors to be shared between ext.c and ext_idle.c. Create kernel/sched/ext_internal.h and move internal type and accessor definitions there. This trims ext.c a bit and makes future additions easier. Pure code reorganization. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>	2025-09-03 11:33:28 -10:00

41 Commits