scx_central currently assumes that ops.init() runs on the selected
central CPU and aborts otherwise. This is no longer true, as ops.init()
is invoked from the scx_enable_helper thread, which can run on any
CPU.
As a result, sched_setaffinity() from userspace doesn't work, causing
scx_central to fail when loading with:
[ 1985.319942] sched_ext: central: scx_central.bpf.c:314: init from non-central CPU
[ 1985.320317] scx_exit+0xa3/0xd0
[ 1985.320535] scx_bpf_error_bstr+0xbd/0x220
[ 1985.320840] bpf_prog_3a445a8163fa8149_central_init+0x103/0x1ba
[ 1985.321073] bpf__sched_ext_ops_init+0x40/0xa8
[ 1985.321286] scx_root_enable_workfn+0x507/0x1650
[ 1985.321461] kthread_worker_fn+0x260/0x940
[ 1985.321745] kthread+0x303/0x3e0
[ 1985.321901] ret_from_fork+0x589/0x7d0
[ 1985.322065] ret_from_fork_asm+0x1a/0x30
DEBUG DUMP
===================================================================
central: root
scx_enable_help[134] triggered exit kind 1025:
scx_bpf_error (scx_central.bpf.c:314: init from non-central CPU)
Fix this by:
- Defer bpf_timer_start() to the first dispatch on the central CPU.
- Initialize the BPF timer in central_init() and kick the central CPU
to guarantee entering the dispatch path on the central CPU immediately.
- Remove the unnecessary sched_setaffinity() call in userspace.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
Several demo schedulers and the selftest runner had usage strings
that omitted options which are actually supported:
- scx_central: add missing [-v]
- scx_pair: add missing [-v]
- scx_qmap: add missing [-S] and [-H]
- scx_userland: add missing [-v]
- scx_sdt: remove [-f] which no longer exists
- runner.c: add missing [-s], [-l], [-q]; drop [-h] which none of the
other sched_ext tools list in their usage lines
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The '-p' option is defined in getopt() but not handled in the switch
statement or documented in the help text. Providing '-p' currently
triggers the default error path.
Remove it to sync the optstring with the actual implementation.
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
After goto restart, optind retains its advanced position from the
previous getopt loop, causing getopt() to immediately return -1.
This silently drops all command-line options on the restarted skeleton.
Reset optind to 1 at the restart label so options are re-parsed.
Affected schedulers: scx_simple, scx_central, scx_flatcg, scx_pair,
scx_sdt, scx_cpu0.
Signed-off-by: David Carlier <devnexen@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Use CPU_SET_S() instead of CPU_SET() on the dynamically allocated
cpuset to avoid a potential out-of-bounds write when nr_cpu_ids
exceeds CPU_SETSIZE.
Also destroy the skeleton before returning on invalid central CPU ID
to prevent a resource leak.
Signed-off-by: David Carlier <devnexen@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The cpu set is dynamically allocated for nr_cpu_ids using CPU_ALLOC(),
so the size passed to sched_setaffinity() should be CPU_ALLOC_SIZE()
rather than sizeof(cpu_set_t). Valgrind flagged this as accessing
unaddressable bytes past the allocation.
Signed-off-by: David Carlier <devnexen@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Receive tools/sched_ext updates form https://github.com/sched-ext/scx to
sync userspace bits:
- basic BPF arena allocator abstractions,
- additional process flags definitions,
- fixed is_migration_disabled() helper,
- separate out user_exit_info BPF and user space code.
This also fixes the following warning when building the selftests:
tools/sched_ext/include/scx/common.bpf.h:550:9: warning: 'likely' macro redefined [-Wmacro-redefined]
550 | #define likely(x) __builtin_expect(!!(x), 1)
| ^
Co-developed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Synchronize with https://github.com/sched-ext/scx at d384453984a0 ("kernel:
Sync at ad3b301aa0 ("sched_ext: Provides a sysfs 'events' to expose core
event counters")").
Signed-off-by: Tejun Heo <tj@kernel.org>
Receive tools/sched_ext updates form https://github.com/sched-ext/scx to
sync userspace bits:
- scx_bpf_dump_header() added which can be used to print out basic scheduler
info on dump.
- BPF possible/online CPU iterators added.
- CO-RE enums added. The enums are autogenerated from vmlinux.h. Include the
generated artifacts in tools/sched_ext to keep the Makefile simpler.
- Other misc changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
sizeof when applied to a pointer typed expression gives the size of
the pointer.
The proper fix in this particular case is to code sizeof(*cpuset)
instead of sizeof(cpuset).
This issue was detected with the help of Coccinelle.
Fixes: 22a920209a ("sched_ext: Implement tickless support")
Signed-off-by: guanjing <guanjing@cmss.chinamobile.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add ops.cpu_online/offline() which are invoked when CPUs come online and
offline respectively. As the enqueue path already automatically bypasses
tasks to the local dsq on a deactivated CPU, BPF schedulers are guaranteed
to see tasks only on CPUs which are between online() and offline().
If the BPF scheduler doesn't implement ops.cpu_online/offline(), the
scheduler is automatically exited with SCX_ECODE_RESTART |
SCX_ECODE_RSN_HOTPLUG. Userspace can implement CPU hotpplug support
trivially by simply reinitializing and reloading the scheduler.
scx_qmap is updated to print out online CPUs on hotplug events. Other
schedulers are updated to restart based on ecode.
v3: - The previous implementation added @reason to
sched_class.rq_on/offline() to distinguish between CPU hotplug events
and topology updates. This was buggy and fragile as the methods are
skipped if the current state equals the target state. Instead, add
scx_rq_[de]activate() which are directly called from
sched_cpu_de/activate(). This also allows ops.cpu_on/offline() to
sleep which can be useful.
- ops.dispatch() could be called on a CPU that the BPF scheduler was
told to be offline. The dispatch patch is updated to bypass in such
cases.
v2: - To accommodate lock ordering change between scx_cgroup_rwsem and
cpus_read_lock(), CPU hotplug operations are put into its own SCX_OPI
block and enabled eariler during scx_ope_enable() so that
cpus_read_lock() can be dropped before acquiring scx_cgroup_rwsem.
- Auto exit with ECODE added.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>
Allow BPF schedulers to indicate tickless operation by setting p->scx.slice
to SCX_SLICE_INF. A CPU whose current task has infinte slice goes into
tickless operation.
scx_central is updated to use tickless operations for all tasks and
instead use a BPF timer to expire slices. This also uses the SCX_ENQ_PREEMPT
and task state tracking added by the previous patches.
Currently, there is no way to pin the timer on the central CPU, so it may
end up on one of the worker CPUs; however, outside of that, the worker CPUs
can go tickless both while running sched_ext tasks and idling.
With schbench running, scx_central shows:
root@test ~# grep ^LOC /proc/interrupts; sleep 10; grep ^LOC /proc/interrupts
LOC: 142024 656 664 449 Local timer interrupts
LOC: 161663 663 665 449 Local timer interrupts
Without it:
root@test ~ [SIGINT]# grep ^LOC /proc/interrupts; sleep 10; grep ^LOC /proc/interrupts
LOC: 188778 3142 3793 3993 Local timer interrupts
LOC: 198993 5314 6323 6438 Local timer interrupts
While scx_central itself is too barebone to be useful as a
production scheduler, a more featureful central scheduler can be built using
the same approach. Google's experience shows that such an approach can have
significant benefits for certain applications such as VM hosting.
v4: Allow operation even if BPF_F_TIMER_CPU_PIN is not available.
v3: Pin the central scheduler's timer on the central_cpu using
BPF_F_TIMER_CPU_PIN.
v2: Convert to BPF inline iterators.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>
This patch adds a new example scheduler, scx_central, which demonstrates
central scheduling where one CPU is responsible for making all scheduling
decisions in the system using scx_bpf_kick_cpu(). The central CPU makes
scheduling decisions for all CPUs in the system, queues tasks on the
appropriate local dsq's and preempts the worker CPUs. The worker CPUs in
turn preempt the central CPU when it needs tasks to run.
Currently, every CPU depends on its own tick to expire the current task. A
follow-up patch implementing tickless support for sched_ext will allow the
worker CPUs to go full tickless so that they can run completely undisturbed.
v3: - Kumar fixed a bug where the dispatch path could overflow the dispatch
buffer if too many are dispatched to the fallback DSQ.
- Use the new SCX_KICK_IDLE to wake up non-central CPUs.
- Dropped '-p' option.
v2: - Use RESIZABLE_ARRAY() instead of fixed MAX_CPUS and use SCX_BUG[_ON]()
to simplify error handling.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Julia Lawall <julia.lawall@inria.fr>