linux/kernel
Tejun Heo ebeca1f930 sched_ext: Introduce cgroup sub-sched support
A system often runs multiple workloads especially in multi-tenant server
environments where a system is split into partitions servicing separate
more-or-less independent workloads each requiring an application-specific
scheduler. To support such and other use cases, sched_ext is in the process
of growing multiple scheduler support.

When partitioning a system in terms of CPUs for such use cases, an
oft-taken approach is hard partitioning the system using cpuset. While it
would be possible to tie sched_ext multiple scheduler support to cpuset
partitions, such an approach would have fundamental limitations stemming
from the lack of dynamism and flexibility.

Users often don't care which specific CPUs are assigned to which workload
and want to take advantage of optimizations which are enabled by running
workloads on a larger machine - e.g. opportunistic over-commit, improving
latency critical workload characteristics while maintaining bandwidth
fairness, employing control mechanisms based on different criteria than
on-CPU time for e.g. flexible memory bandwidth isolation, packing similar
parts from different workloads on same L3s to improve cache efficiency,
and so on.

As this sort of dynamic behaviors are impossible or difficult to implement
with hard partitioning, sched_ext is implementing cgroup sub-sched support
where schedulers can be attached to the cgroup hierarchy and a parent
scheduler is responsible for controlling the CPUs that each child can use
at any given moment. This makes CPU distribution dynamically controlled by
BPF allowing high flexibility.

This patch adds the skeletal sched_ext cgroup sub-sched support:

- sched_ext_ops.sub_cgroup_id and .sub_attach/detach() are added. Non-zero
  sub_cgroup_id indicates that the scheduler is to be attached to the
  identified cgroup. A sub-sched is attached to the cgroup iff the nearest
  ancestor scheduler implements .sub_attach() and grants the attachment. Max
  nesting depth is limited by SCX_SUB_MAX_DEPTH.

- When a scheduler exits, all its descendant schedulers are exited
  together. Also, cgroup.scx_sched added which points to the effective
  scheduler instance for the cgroup. This is updated on scheduler
  init/exit and inherited on cgroup online. When a cgroup is offlined, the
  attached scheduler is automatically exited.

- Sub-sched support is gated on CONFIG_EXT_SUB_SCHED which is
  automatically enabled if both SCX and cgroups are enabled. Sub-sched
  support is not tied to the CPU controller but rather the cgroup
  hierarchy itself. This is intentional as the support for cpu.weight and
  cpu.max based resource control is orthogonal to sub-sched support. Note
  that CONFIG_CGROUPS around cgroup subtree iteration support for
  scx_task_iter is replaced with CONFIG_EXT_SUB_SCHED for consistency.

- This allows loading sub-scheds and most framework operations such as
  propagating disable down the hierarchy work. However, sub-scheds are not
  operational yet and all tasks stay with the root sched. This will serve
  as the basis for building up full sub-sched support.

- DSQs point to the scx_sched they belong to.

- scx_qmap is updated to allow attachment of sub-scheds and also serving
  as sub-scheds.

- scx_is_descendant() is added but not yet used in this patch. It is used by
  later changes in the series and placed here as this is where the function
  belongs.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06 07:58:03 -10:00
..
bpf bpf: Improve bounds when tnum has a single possible value 2026-02-27 16:11:50 -08:00
cgroup cgroup: Expose some cgroup helpers 2026-03-05 18:15:58 -10:00
configs Remove WARN_ALL_UNSEEDED_RANDOM kernel config option 2026-02-23 11:18:48 -08:00
debug treewide: Replace kmalloc with kmalloc_obj for non-scalar types 2026-02-21 01:02:28 -08:00
dma dma-mapping: avoid random addr value print out on error path 2026-02-23 08:26:54 +01:00
entry Merge branch 'core/entry' into sched/core 2026-01-30 15:40:05 +01:00
events perf: Fix __perf_event_overflow() vs perf_remove_from_context() race 2026-02-25 15:02:34 +01:00
futex Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
gcov Convert more 'alloc_obj' cases to default GFP_KERNEL arguments 2026-02-21 20:03:00 -08:00
irq Convert 'alloc_flex' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
kcsan kcsan: test: Adjust "expect" allocation type for kmalloc_obj 2026-02-26 09:54:08 -08:00
livepatch Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
liveupdate liveupdate: luo_file: remember retrieve() status 2026-02-24 11:13:26 -08:00
locking Convert remaining multi-line kmalloc_obj/flex GFP_KERNEL uses 2026-02-22 08:26:33 -08:00
module module: Fix kernel panic when a symbol st_shndx is out of bounds 2026-02-23 19:37:28 +00:00
power Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
printk Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
rcu Convert remaining multi-line kmalloc_obj/flex GFP_KERNEL uses 2026-02-22 08:26:33 -08:00
sched sched_ext: Introduce cgroup sub-sched support 2026-03-06 07:58:03 -10:00
time Summary 2026-03-04 08:21:11 -08:00
trace tracing: Fix WARN_ON in tracing_buffers_mmap_close 2026-03-03 22:25:32 -05:00
unwind Convert more 'alloc_obj' cases to default GFP_KERNEL arguments 2026-02-21 20:03:00 -08:00
.gitignore
acct.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
async.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
audit_fsnotify.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
audit_tree.c Convert 'alloc_flex' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
audit_watch.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
audit.c Convert 'alloc_flex' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
audit.h audit: fix comment misindentation in audit.h 2025-10-22 19:28:06 -04:00
auditfilter.c Convert 'alloc_flex' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
auditsc.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
backtracetest.c
bounds.c x86/asm: Remove ANNOTATE_DATA_SPECIAL usage 2025-12-03 16:53:19 +01:00
capability.c
cfi.c
compat.c
configs.c
context_tracking.c context_tracking: Remove rcu_task_trace_heavyweight_{enter,exit}() 2026-01-01 16:39:46 +08:00
cpu_pm.c syscore: Pass context data to callbacks 2025-11-14 10:01:52 +01:00
cpu.c SPDX updates for 7.0-rc1 2026-02-17 09:46:03 -08:00
crash_core_test.c
crash_core.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
crash_dump_dm_crypt.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
crash_reserve.c crash: let architecture decide crash memory export to iomem_resource 2025-11-12 10:00:15 -08:00
cred.c cred: remove unused set_security_override_from_ctx() 2026-01-06 20:52:57 -05:00
delayacct.c delayacct: fix uapi timespec64 definition 2026-02-08 00:13:32 -08:00
dma.c
elfcorehdr.c
exec_domain.c
exit.c kthread: consolidate kthread exit paths to prevent use-after-free 2026-02-26 10:45:49 +01:00
exit.h
extable.c
fail_function.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
fork.c sched/core: Swap the order between sched_post_fork() and cgroup_post_fork() 2026-03-06 07:58:02 -10:00
freezer.c freezer: Clarify that only cgroup1 freezer uses PM freezer 2025-10-30 20:10:27 +01:00
gen_kheaders.sh
groups.c treewide: Replace kmalloc with kmalloc_obj for non-scalar types 2026-02-21 01:02:28 -08:00
hung_task.c hung_task: add hung_task_sys_info sysctl to dump sys info on task-hung 2025-11-20 14:03:43 -08:00
iomem.c
irq_work.c
jump_label.c
kallsyms_internal.h kallsyms: Get rid of kallsyms relative base 2026-01-22 15:58:22 -07:00
kallsyms_selftest.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
kallsyms_selftest.h
kallsyms.c mm.git review status for linus..mm-nonmm-stable 2026-02-12 12:13:01 -08:00
kcmp.c
Kconfig.freezer
Kconfig.hz
Kconfig.kexec liveupdate: kho: move to kernel/liveupdate 2025-11-27 14:24:33 -08:00
Kconfig.locks
Kconfig.preempt sched: Further restrict the preemption modes 2026-01-08 12:43:57 +01:00
kcov.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
kexec_core.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
kexec_elf.c
kexec_file.c kexec: derive purgatory entry from symbol 2026-01-31 16:16:07 -08:00
kexec_internal.h
kexec.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
kheaders.c
kprobes.c Convert 'alloc_flex' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
kstack_erase.c sysctl: remove __user qualifier from stack_erasing_sysctl buffer argument 2025-11-27 15:44:53 +01:00
ksyms_common.c
ksysfs.c kexec: move sysfs entries to /sys/kernel/kexec 2025-11-27 14:24:42 -08:00
kthread.c kthread: consolidate kthread exit paths to prevent use-after-free 2026-02-26 10:45:49 +01:00
latencytop.c
Makefile kcov: Enable context analysis 2026-01-05 16:43:34 +01:00
module_signature.c
notifier.c
nscommon.c nsfs: tighten permission checks for ns iteration ioctls 2026-02-27 22:00:08 +01:00
nsproxy.c nsproxy: fix free_nsproxy() and simplify create_new_namespaces() 2025-11-14 13:10:38 +01:00
nstree.c nstree: tighten permission checks for listing 2026-02-27 22:00:11 +01:00
padata.c Convert more 'alloc_obj' cases to default GFP_KERNEL arguments 2026-02-21 20:03:00 -08:00
panic.c panic: add panic_force_cpu= parameter to redirect panic to a specific CPU 2026-02-03 08:21:26 -08:00
params.c Convert more 'alloc_obj' cases to default GFP_KERNEL arguments 2026-02-21 20:03:00 -08:00
pid_namespace.c pid: rely on common reference count behavior 2025-11-11 10:01:32 +01:00
pid_sysctl.h
pid.c Revert "pid: make __task_pid_nr_ns(ns => NULL) safe for zombie callers" 2026-02-10 11:39:30 +01:00
profile.c
ptrace.c rseq: Introduce struct rseq_data 2025-11-04 08:30:50 +01:00
range.c
reboot.c treewide: Replace kmalloc with kmalloc_obj for non-scalar types 2026-02-21 01:02:28 -08:00
regset.c
relay.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
resource_kunit.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
resource.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
rseq.c rseq: slice ext: Ensure rseq feature size differs from original rseq size 2026-02-23 11:19:19 +01:00
scftorture.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
scs.c scs: fix a wrong parameter in __scs_magic 2025-11-12 10:00:13 -08:00
seccomp.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
signal.c compiler-context-analysis: Remove __cond_lock() function-like helper 2026-01-05 16:43:33 +01:00
smp.c smp: Introduce a helper function to check for pending IPIs 2025-11-19 18:06:50 +01:00
smpboot.c
smpboot.h
softirq.c
stacktrace.c
static_call_inline.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
static_call.c
stop_machine.c
sys_ni.c rseq: Implement sys_rseq_slice_yield() 2026-01-22 11:11:17 +01:00
sys.c RISC-V updates for v7.0 2026-02-12 19:17:44 -08:00
sysctl-test.c
sysctl.c sysctl: replace SYSCTL_INT_CONV_CUSTOM macro with functions 2026-01-06 11:27:10 +01:00
task_work.c task_work: Fix NMI race condition 2025-10-29 10:29:54 +01:00
taskstats.c
torture.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
tracepoint.c Convert 'alloc_flex' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
tsacct.c tsacct: skip all kernel threads 2026-01-26 19:07:13 -08:00
ucount.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
uid16.c
uid16.h
umh.c treewide: Replace kmalloc with kmalloc_obj for non-scalar types 2026-02-21 01:02:28 -08:00
up.c
user_namespace.c Convert remaining multi-line kmalloc_obj/flex GFP_KERNEL uses 2026-02-22 08:26:33 -08:00
user-return-notifier.c
user.c ns: drop custom reference count initialization for initial namespaces 2025-11-11 10:01:32 +01:00
utsname_sysctl.c
utsname.c namespace-6.18-rc1 2025-09-29 11:20:29 -07:00
vhost_task.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
vmcore_info.c mm.git review status for linus..mm-nonmm-stable 2026-02-12 12:13:01 -08:00
watch_queue.c Convert 'alloc_flex' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
watchdog_buddy.c
watchdog_perf.c watchdog/hardlockup: simplify perf event probe and remove per-cpu dependency 2026-02-08 00:13:35 -08:00
watchdog.c watchdog/softlockup: fix sample ring index wrap in need_counting_irqs() 2026-02-08 00:13:34 -08:00
workqueue_internal.h
workqueue.c Convert 'alloc_flex' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00