linux/kernel/sched
Thomas Gleixner b9eac6a9d9 rseq: Revert to historical performance killing behaviour
The recent RSEQ optimization work broke the TCMalloc abuse of the RSEQ ABI
as it not longer unconditionally updates the CPU, node, mm_cid fields,
which are documented as read only for user space. Due to the observed
behavior of the kernel it was possible for TCMalloc to overwrite the
cpu_id_start field for their own purposes and rely on the kernel to update
it unconditionally after each context switch and before signal delivery.

The RSEQ ABI only guarantees that these fields are updated when the data
changes, i.e. the task is migrated or the MMCID of the task changes due to
switching from or to per CPU ownership mode.

The optimization work eliminated the unconditional updates and reduced them
to the documented ABI guarantees, which results in a massive performance
win for syscall, scheduling heavy work loads, which in turn breaks the
TCMalloc expectations.

There have been several options discussed to restore the TCMalloc
functionality while preserving the optimization benefits. They all end up
in a series of hard to maintain workarounds, which in the worst case
introduce overhead for everyone, e.g. in the scheduler.

The requirements of TCMalloc and the optimization work are diametral and
the required work arounds are a maintainence burden. They end up as fragile
constructs, which are blocking further optimization work and are pretty
much guaranteed to cause more subtle issues down the road.

The optimization work heavily depends on the generic entry code, which is
not used by all architectures yet. So the rework preserved the original
mechanism moslty unmodified to keep the support for architectures, which
handle rseq in their own exit to user space loop. That code is currently
optimized out by the compiler on architectures which use the generic entry
code.

This allows to revert back to the original behaviour by replacing the
compile time constant conditions with a runtime condition where required,
which disables the optimization and the dependend time slice extension
feature until the run-time condition can be enabled in the RSEQ
registration code on a per task basis again.

The following changes are required to restore the original behavior, which
makes TCMalloc work again:

  1) Replace the compile time constant conditionals with runtime
     conditionals where appropriate to prevent the compiler from optimizing
     the legacy mode out

  2) Enforce unconditional update of IDs on context switch for the
     non-optimized v1 mode

  3) Enforce update of IDs in the pre signal delivery path for the
     non-optimized v1 mode

  4) Enforce update of IDs in the membarrier(RSEQ) IPI for the
     non-optimized v1 mode

  5) Make time slice and future extensions depend on optimized v2 mode

This brings back the full performance problems, but preserves the v2
optimization code and for generic entry code using architectures also the
TIF_RSEQ optimization which avoids a full evaluation of the exit to user
mode loop in many cases.

Fixes: 566d8015f7 ("rseq: Avoid CPU/MM CID updates when no event pending")
Reported-by: Mathias Stearn <mathias@mongodb.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Closes: https://lore.kernel.org/CAHnCjA25b+nO2n5CeifknSKHssJpPrjnf+dtr7UgzRw4Zgu=oA@mail.gmail.com
Link: https://patch.msgid.link/20260428224427.517051752%40kernel.org
Cc: stable@vger.kernel.org
2026-05-05 16:02:57 +02:00
..
autogroup.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
autogroup.h
build_policy.c sched_ext: Move internal type and accessor definitions to ext_internal.h 2025-09-03 11:33:28 -10:00
build_utility.c
clock.c sched/clock: Avoid false sharing for sched_clock_irqtime 2026-02-03 12:04:19 +01:00
completion.c
core_sched.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
core.c sched/fair: Clear rel_deadline when initializing forked entities 2026-04-28 09:19:54 +02:00
cpuacct.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
cpudeadline.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
cpudeadline.h sched/deadline: only set free_cpus for online runqueues 2025-10-16 11:13:49 +02:00
cpufreq_schedutil.c cpufreq: Pass the policy to cpufreq_driver->adjust_perf() 2026-04-02 11:30:24 -05:00
cpufreq.c
cpupri.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
cpupri.h
cputime.c - A nice cleanup to the paravirt code containing a unification of the paravirt 2026-02-10 19:01:45 -08:00
deadline.c Runtime Verification updates for 7.1: 2026-04-15 17:15:18 -07:00
debug.c Merge branch 'sched/urgent' into sched/core, to resolve conflicts 2026-04-02 15:04:09 +02:00
ext_idle.c sched_ext: Remove runtime kfunc mask enforcement 2026-04-10 07:54:06 -10:00
ext_idle.h sched_ext: Add verifier-time kfunc context filter 2026-04-10 07:54:06 -10:00
ext_internal.h sched_ext: Add verifier-time kfunc context filter 2026-04-10 07:54:06 -10:00
ext.c tracing updates for v7.1: 2026-04-17 09:43:12 -07:00
ext.h sched_ext: Add @kargs to scx_fork() 2026-03-06 07:58:02 -10:00
fair.c sched/fair: Fix wakeup_preempt_fair() vs delayed dequeue 2026-04-28 09:19:54 +02:00
features.h Scheduler changes for v7.1: 2026-04-14 13:33:36 -07:00
idle.c sched: idle: Consolidate the handling of two special cases 2026-03-16 20:29:47 +01:00
isolation.c cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock 2026-02-23 10:46:49 -10:00
loadavg.c Merge branch 'tip/sched/urgent' 2025-07-14 17:16:28 +02:00
Makefile sched: Enable context analysis for core.c and fair.c 2026-01-05 16:43:36 +01:00
membarrier.c rseq: Revert to historical performance killing behaviour 2026-05-05 16:02:57 +02:00
pelt.c treewide: Update email address 2026-01-11 06:09:11 -10:00
pelt.h sched/fair: Switch to task based throttle model 2025-09-03 10:03:14 +02:00
psi.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
rq-offsets.c sched: Make migrate_{en,dis}able() inline 2025-09-25 09:57:16 +02:00
rt.c sched/rt: Cleanup global RT bandwidth functions 2026-04-08 13:11:44 +02:00
sched-pelt.h
sched.h sched_ext: Changes for v7.1 2026-04-15 10:54:24 -07:00
smp.h
stats.c sched/smp: Use the SMP version of schedstats 2025-06-13 08:47:21 +02:00
stats.h delayacct: add timestamp of delay max 2026-01-31 16:16:06 -08:00
stop_task.c sched/core: Rework sched_class::wakeup_preempt() and rq_modified_*() 2025-12-17 10:53:25 +01:00
swait.c
syscalls.c Linux 7.0-rc4 2026-03-17 07:14:42 +01:00
topology.c sched/fair: Use sched_energy_enabled() 2026-04-03 14:23:41 +02:00
wait_bit.c
wait.c ARM: 2025-07-30 17:14:01 -07:00