Commit Graph

11 Commits

Author SHA1 Message Date
Cheng-Yang Chou
2d2b026c3e sched_ext: Deny SCX kfuncs to non-SCX struct_ops programs
scx_kfunc_context_filter() currently allows non-SCX struct_ops programs
(e.g. tcp_congestion_ops) to call SCX unlocked kfuncs. This is wrong
for two reasons:

- It is semantically incorrect: a TCP congestion control program has no
  business calling SCX kfuncs such as scx_bpf_kick_cpu().

- With CONFIG_EXT_SUB_SCHED=y, kfuncs like scx_bpf_kick_cpu() call
  scx_prog_sched(aux), which invokes bpf_prog_get_assoc_struct_ops(aux)
  and casts the result to struct sched_ext_ops * before reading ops->priv.
  For a non-SCX struct_ops program the returned pointer is the kdata of
  that struct_ops type, which is far smaller than sched_ext_ops, making
  the read an out-of-bounds access (confirmed with KASAN).

Extend the filter to cover scx_kfunc_set_any and scx_kfunc_set_idle as
well, and deny all SCX kfuncs for any struct_ops program that is not the
SCX struct_ops. This addresses both issues: the semantic contract is
enforced at the verifier level, and the runtime out-of-bounds access
becomes unreachable.

Fixes: d1d3c1c6ae ("sched_ext: Add verifier-time kfunc context filter")
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-20 07:57:29 -10:00
Tejun Heo
d1d3c1c6ae sched_ext: Add verifier-time kfunc context filter
Move enforcement of SCX context-sensitive kfunc restrictions from per-task
runtime kf_mask checks to BPF verifier-time filtering, using the BPF core's
struct_ops context information.

A shared .filter callback is attached to each context-sensitive BTF set
and consults a per-op allow table (scx_kf_allow_flags[]) indexed by SCX
ops member offset. Disallowed calls are now rejected at program load time
instead of at runtime.

The old model split reachability across two places: each SCX_CALL_OP*()
set bits naming its op context, and each kfunc's scx_kf_allowed() check
OR'd together the bits it accepted. A kfunc was callable when those two
masks overlapped. The new model transposes the result to the caller side -
each op's allow flags directly list the kfunc groups it may call. The old
bit assignments were:

  Call-site bits:
    ops.select_cpu   = ENQUEUE | SELECT_CPU
    ops.enqueue      = ENQUEUE
    ops.dispatch     = DISPATCH
    ops.cpu_release  = CPU_RELEASE

  Kfunc-group accepted bits:
    enqueue group     = ENQUEUE | DISPATCH
    select_cpu group  = SELECT_CPU | ENQUEUE
    dispatch group    = DISPATCH
    cpu_release group = CPU_RELEASE

Intersecting them yields the reachability now expressed directly by
scx_kf_allow_flags[]:

  ops.select_cpu  -> SELECT_CPU | ENQUEUE
  ops.enqueue     -> SELECT_CPU | ENQUEUE
  ops.dispatch    -> ENQUEUE | DISPATCH
  ops.cpu_release -> CPU_RELEASE

Unlocked ops carried no kf_mask bits and reached only unlocked kfuncs;
that maps directly to UNLOCKED in the new table.

Equivalence was checked by walking every (op, kfunc-group) combination
across SCX ops, SYSCALL, and non-SCX struct_ops callers against the old
scx_kf_allowed() runtime checks. With two intended exceptions (see below),
all combinations reach the same verdict; disallowed calls are now caught at
load time instead of firing scx_error() at runtime.

scx_bpf_dsq_move_set_slice() and scx_bpf_dsq_move_set_vtime() are
exceptions: they have no runtime check at all, but the new filter rejects
them from ops outside dispatch/unlocked. The affected cases are nonsensical
- the values these setters store are only read by
scx_bpf_dsq_move{,_vtime}(), which is itself restricted to
dispatch/unlocked, so a setter call from anywhere else was already dead
code.

Runtime scx_kf_mask enforcement is left in place by this patch and removed
in a follow-up.

Original-patch-by: Juntong Deng <juntong.deng@outlook.com>
Original-patch-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-10 07:54:06 -10:00
Cheng-Yang Chou
545b343015 sched_ext: Always use SMP versions in kernel/sched/ext_idle.h
Simplify the scheduler by making formerly SMP-only primitives and data
structures unconditional.

tj: Updated subject for clarity.

Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-13 14:47:59 -10:00
Andrea Righi
353656eb84 sched_ext: idle: Make local functions static in ext_idle.c
Functions that are only used within ext_idle.c can be marked as static
to limit their scope.

No functional changes.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-09 06:25:14 -10:00
Andrea Righi
23c63a9652 sched_ext: idle: Explicitly pass allowed cpumask to scx_select_cpu_dfl()
Modify scx_select_cpu_dfl() to take the allowed cpumask as an explicit
argument, instead of implicitly using @p->cpus_ptr.

This prepares for future changes where arbitrary cpumasks may be passed
to the built-in idle CPU selection policy.

This is a pure refactoring with no functional changes.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-07 07:13:52 -10:00
Andrea Righi
e4855fc90e sched_ext: idle: Refactor scx_select_cpu_dfl()
Make scx_select_cpu_dfl() more consistent with the other idle-related
APIs by returning a negative value when an idle CPU isn't found.

No functional changes, this is purely a refactoring.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-14 08:17:11 -10:00
Andrea Righi
c414c2171c sched_ext: idle: Honor idle flags in the built-in idle selection policy
Enable passing idle flags (%SCX_PICK_IDLE_*) to scx_select_cpu_dfl(),
to enforce strict selection criteria, such as selecting an idle CPU
strictly within @prev_cpu's node or choosing only a fully idle SMT core.

This functionality will be exposed through a dedicated kfunc in a
separate patch.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-14 08:17:01 -10:00
Andrea Righi
48849271e6 sched_ext: idle: Per-node idle cpumasks
Using a single global idle mask can lead to inefficiencies and a lot of
stress on the cache coherency protocol on large systems with multiple
NUMA nodes, since all the CPUs can create a really intense read/write
activity on the single global cpumask.

Therefore, split the global cpumask into multiple per-NUMA node cpumasks
to improve scalability and performance on large systems.

The concept is that each cpumask will track only the idle CPUs within
its corresponding NUMA node, treating CPUs in other NUMA nodes as busy.
In this way concurrent access to the idle cpumask will be restricted
within each NUMA node.

The split of multiple per-node idle cpumasks can be controlled using the
SCX_OPS_BUILTIN_IDLE_PER_NODE flag.

By default SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled and a global
host-wide idle cpumask is used, maintaining the previous behavior.

NOTE: if a scheduler explicitly enables the per-node idle cpumasks (via
SCX_OPS_BUILTIN_IDLE_PER_NODE), scx_bpf_get_idle_cpu/smtmask() will
trigger an scx error, since there are no system-wide cpumasks.

= Test =

Hardware:
 - System: DGX B200
    - CPUs: 224 SMT threads (112 physical cores)
    - Processor: INTEL(R) XEON(R) PLATINUM 8570
    - 2 NUMA nodes

Scheduler:
 - scx_simple [1] (so that we can focus at the built-in idle selection
   policy and not at the scheduling policy itself)

Test:
 - Run a parallel kernel build `make -j $(nproc)` and measure the average
   elapsed time over 10 runs:

          avg time | stdev
          ---------+------
 before:   52.431s | 2.895
  after:   50.342s | 2.895

= Conclusion =

Splitting the global cpumask into multiple per-NUMA cpumasks helped to
achieve a speedup of approximately +4% with this particular architecture
and test case.

The same test on a DGX-1 (40 physical cores, Intel Xeon E5-2698 v4 @
2.20GHz, 2 NUMA nodes) shows a speedup of around 1.5-3%.

On smaller systems, I haven't noticed any measurable regressions or
improvements with the same test (parallel kernel build) and scheduler
(scx_simple).

Moreover, with a modified scx_bpfland that uses the new NUMA-aware APIs
I observed an additional +2-2.5% performance improvement with the same
test.

[1] https://github.com/sched-ext/scx/blob/main/scheds/c/scx_simple.bpf.c

Cc: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-16 06:52:20 -10:00
Andrea Righi
0aaaf89df8 sched_ext: idle: Introduce SCX_OPS_BUILTIN_IDLE_PER_NODE
Add the new scheduler flag SCX_OPS_BUILTIN_IDLE_PER_NODE, which allows
BPF schedulers to select between using a global flat idle cpumask or
multiple per-node cpumasks.

This only introduces the flag and the mechanism to enable/disable this
feature without affecting any scheduling behavior.

Cc: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-16 06:52:20 -10:00
Andrea Righi
d73249f887 sched_ext: idle: Make idle static keys private
Make all the static keys used by the idle CPU selection policy private
to ext_idle.c. This avoids unnecessary exposure in headers and improves
code encapsulation.

Cc: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-16 06:52:20 -10:00
Andrea Righi
337d1b354a sched_ext: Move built-in idle CPU selection policy to a separate file
As ext.c is becoming quite large, move the idle CPU selection policy to
separate files (ext_idle.c / ext_idle.h) for better code readability.

Moreover, group together all the idle CPU selection kfunc's to the same
btf_kfunc_id_set block.

No functional changes, this is purely code reorganization.

Suggested-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-27 12:43:43 -10:00