linux/init
Tejun Heo ebeca1f930 sched_ext: Introduce cgroup sub-sched support
A system often runs multiple workloads especially in multi-tenant server
environments where a system is split into partitions servicing separate
more-or-less independent workloads each requiring an application-specific
scheduler. To support such and other use cases, sched_ext is in the process
of growing multiple scheduler support.

When partitioning a system in terms of CPUs for such use cases, an
oft-taken approach is hard partitioning the system using cpuset. While it
would be possible to tie sched_ext multiple scheduler support to cpuset
partitions, such an approach would have fundamental limitations stemming
from the lack of dynamism and flexibility.

Users often don't care which specific CPUs are assigned to which workload
and want to take advantage of optimizations which are enabled by running
workloads on a larger machine - e.g. opportunistic over-commit, improving
latency critical workload characteristics while maintaining bandwidth
fairness, employing control mechanisms based on different criteria than
on-CPU time for e.g. flexible memory bandwidth isolation, packing similar
parts from different workloads on same L3s to improve cache efficiency,
and so on.

As this sort of dynamic behaviors are impossible or difficult to implement
with hard partitioning, sched_ext is implementing cgroup sub-sched support
where schedulers can be attached to the cgroup hierarchy and a parent
scheduler is responsible for controlling the CPUs that each child can use
at any given moment. This makes CPU distribution dynamically controlled by
BPF allowing high flexibility.

This patch adds the skeletal sched_ext cgroup sub-sched support:

- sched_ext_ops.sub_cgroup_id and .sub_attach/detach() are added. Non-zero
  sub_cgroup_id indicates that the scheduler is to be attached to the
  identified cgroup. A sub-sched is attached to the cgroup iff the nearest
  ancestor scheduler implements .sub_attach() and grants the attachment. Max
  nesting depth is limited by SCX_SUB_MAX_DEPTH.

- When a scheduler exits, all its descendant schedulers are exited
  together. Also, cgroup.scx_sched added which points to the effective
  scheduler instance for the cgroup. This is updated on scheduler
  init/exit and inherited on cgroup online. When a cgroup is offlined, the
  attached scheduler is automatically exited.

- Sub-sched support is gated on CONFIG_EXT_SUB_SCHED which is
  automatically enabled if both SCX and cgroups are enabled. Sub-sched
  support is not tied to the CPU controller but rather the cgroup
  hierarchy itself. This is intentional as the support for cpu.weight and
  cpu.max based resource control is orthogonal to sub-sched support. Note
  that CONFIG_CGROUPS around cgroup subtree iteration support for
  scx_task_iter is replaced with CONFIG_EXT_SUB_SCHED for consistency.

- This allows loading sub-scheds and most framework operations such as
  propagating disable down the hierarchy work. However, sub-scheds are not
  operational yet and all tasks stay with the root sched. This will serve
  as the basis for building up full sub-sched support.

- DSQs point to the scx_sched they belong to.

- scx_qmap is updated to allow attachment of sub-scheds and also serving
  as sub-scheds.

- scx_is_descendant() is added but not yet used in this patch. It is used by
  later changes in the series and placed here as this is where the function
  belongs.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06 07:58:03 -10:00
..
.gitignore kbuild: build init/built-in.a just once 2022-09-29 04:40:15 +09:00
.kunitconfig initramfs_test: kunit tests for initramfs unpacking 2025-03-08 12:13:04 +01:00
calibrate.c calibrate: update header inclusion 2025-11-27 14:24:45 -08:00
do_mounts_initrd.c init: remove /proc/sys/kernel/real-root-dev 2026-01-12 17:22:27 +01:00
do_mounts_rd.c initrd: remove deprecated code path (linuxrc) 2026-01-12 17:22:22 +01:00
do_mounts.c vfs-7.0-rc1.nullfs 2026-02-09 13:41:34 -08:00
do_mounts.h fs: use nullfs unconditionally as the real rootfs 2026-01-14 11:23:39 +01:00
init_task.c Scheduler changes for v7.0: 2026-02-10 12:50:10 -08:00
initramfs_internal.h init: add initramfs_internal.h 2025-03-04 09:52:36 +01:00
initramfs_test.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
initramfs.c Convert 'alloc_flex' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
Kconfig sched_ext: Introduce cgroup sub-sched support 2026-03-06 07:58:03 -10:00
main.c mm.git review status for linus..mm-nonmm-stable 2026-02-12 12:13:01 -08:00
Makefile initramfs_test: kunit tests for initramfs unpacking 2025-03-08 12:13:04 +01:00
noinitramfs.c
version-timestamp.c ns: drop custom reference count initialization for initial namespaces 2025-11-11 10:01:32 +01:00
version.c init/version.c: Replace strlcpy with strscpy 2023-09-22 09:50:56 -07:00