linux/kernel
Thomas Gleixner 99428157dc rseq: Reenable performance optimizations conditionally
Due to the incompatibility with TCMalloc the RSEQ optimizations and
extended features (time slice extensions) have been disabled and made
run-time conditional.

The original RSEQ implementation, which TCMalloc depends on, registers a 32
byte region (ORIG_RSEG_SIZE). This region has a 32 byte alignment
requirement.

The extension safe newer variant exposes the kernel RSEQ feature size via
getauxval(AT_RSEQ_FEATURE_SIZE) and the alignment requirement via
getauxval(AT_RSEQ_ALIGN). The alignment requirement is that the registered
RSEQ region is aligned to the next power of two of the feature size. The
kernel currently has a feature size of 33 bytes, which means the alignment
requirement is 64 bytes.

The TCMalloc RSEQ region is embedded into a cache line aligned data
structure starting at offset 32 bytes so that bytes 28-31 and the
cpu_id_start field at bytes 32-35 form a 64-bit little endian pointer with
the top-most bit (63 set) to check whether the kernel has overwritten
cpu_id_start with an actual CPU id value, which is guaranteed to not have
the top most bit set.

As this is part of their performance tuned magic, it's a pretty safe
assumption, that TCMalloc won't use a larger RSEQ size.

This allows the kernel to declare that registrations with a size greater
than the original size of 32 bytes, which is the cases since time slice
extensions got introduced, as RSEQ ABI v2 with the following differences to
the original behaviour:

  1) Unconditional updates of the user read only fields (CPU, node, MMCID)
     are removed. Those fields are only updated on registration, task
     migration and MMCID changes.

  2) Unconditional evaluation of the criticial section pointer is
     removed. It's only evaluated when user space was interrupted and was
     scheduled out or before delivering a signal in the interrupted
     context.

  3) The read/only requirement of the ID fields is enforced. When the
     kernel detects that userspace manipulated the fields, the process is
     terminated. This ensures that multiple entities (libraries) can
     utilize RSEQ without interfering.

  4) Todays extended RSEQ feature (time slice extensions) and future
     extensions are only enabled in the v2 enabled mode.

Registrations with the original size of 32 bytes operate in backwards
compatible legacy mode without performance improvements and extended
features.

Unfortunately that also affects users of older GLIBC versions which
register the original size of 32 bytes and do not evaluate the kernel
required size in the auxiliary vector AT_RSEQ_FEATURE_SIZE.

That's the result of the lack of enforcement in the original implementation
and the unwillingness of a single entity to cooperate with the larger
ecosystem for many years.

Implement the required registration changes by restructuring the spaghetti
code and adding the size/version check. Also add documentation about the
differences of legacy and optimized RSEQ V2 mode.

Thanks to Mathieu for pointing out the ORIG_RSEQ_SIZE constraints!

Fixes: d6200245c7 ("rseq: Allow registering RSEQ with slice extension")
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Link: https://patch.msgid.link/20260428224427.927160119%40kernel.org
Cc: stable@vger.kernel.org
2026-05-06 17:40:27 +02:00
..
bpf bpf-fixes 2026-04-17 15:58:22 -07:00
cgroup mm.git review status for linus..mm-stable 2026-04-19 08:01:17 -07:00
configs Remove WARN_ALL_UNSEEDED_RANDOM kernel config option 2026-02-23 11:18:48 -08:00
debug kgdb: update outdated references to kgdb_wait() 2026-04-21 16:41:54 +01:00
dma memblock: updates for 7.0-rc1 2026-04-18 11:29:14 -07:00
entry arm64 updates for 7.1: 2026-04-14 16:48:56 -07:00
events mm.git review status for linus..mm-stable 2026-04-15 12:59:16 -07:00
futex Locking updates for v7.1: 2026-04-14 12:36:25 -07:00
gcov Convert more 'alloc_obj' cases to default GFP_KERNEL arguments 2026-02-21 20:03:00 -08:00
irq genirq/chip: Invoke add_interrupt_randomness() in handle_percpu_devid_irq() 2026-04-02 23:03:29 +02:00
kcsan kcsan: test: Adjust "expect" allocation type for kmalloc_obj 2026-02-26 09:54:08 -08:00
livepatch Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
liveupdate mm.git review status for linus..mm-stable 2026-04-19 08:01:17 -07:00
locking locking/mutex: Fix ww_mutex wait_list operations 2026-04-23 10:05:49 +02:00
module module: Simplify warning on positive returns from module_init() 2026-04-04 00:04:48 +00:00
power Merge branches 'pm-cpuidle', 'pm-opp' and 'pm-sleep' 2026-04-10 12:37:27 +02:00
printk Merge branch 'rework/prb-fixes' into for-linus 2026-04-20 13:42:01 +02:00
rcu RCU changes for v7.1 2026-04-13 09:36:45 -07:00
sched rseq: Revert to historical performance killing behaviour 2026-05-05 16:02:57 +02:00
time clockevents: Add missing resets of the next_event_forced flag 2026-04-16 21:22:04 +02:00
trace ring-buffer fix for 7.1: 2026-04-24 15:17:23 -07:00
unwind Convert more 'alloc_obj' cases to default GFP_KERNEL arguments 2026-02-21 20:03:00 -08:00
.gitignore kheaders: rebuild kheaders_data.tar.xz when a file is modified within a minute 2025-06-24 20:30:37 +09:00
acct.c vfs-7.1-rc1.misc 2026-04-13 14:20:11 -07:00
async.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
audit_fsnotify.c audit: widen ino fields to u64 2026-03-06 14:31:26 +01:00
audit_tree.c Convert 'alloc_flex' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
audit_watch.c audit: widen ino fields to u64 2026-03-06 14:31:26 +01:00
audit.c audit: handle unknown status requests in audit_receive_msg() 2026-03-10 15:22:43 -04:00
audit.h audit: widen ino fields to u64 2026-03-06 14:31:26 +01:00
auditfilter.c audit: fix coding style issues 2026-03-05 22:16:08 -05:00
auditsc.c audit: widen ino fields to u64 2026-03-06 14:31:26 +01:00
backtracetest.c
bounds.c x86/asm: Remove ANNOTATE_DATA_SPECIAL usage 2025-12-03 16:53:19 +01:00
capability.c
cfi.c cfi: Move BPF CFI types and helpers to generic code 2025-07-31 18:23:53 -07:00
compat.c
configs.c
context_tracking.c context_tracking: Remove rcu_task_trace_heavyweight_{enter,exit}() 2026-01-01 16:39:46 +08:00
cpu_pm.c syscore: Pass context data to callbacks 2025-11-14 10:01:52 +01:00
cpu.c SPDX updates for 7.0-rc1 2026-02-17 09:46:03 -08:00
crash_core_test.c crash: add KUnit tests for crash_exclude_mem_range 2025-09-13 17:32:55 -07:00
crash_core.c kernel/crash: remove inclusion of crypto/sha1.h 2026-03-27 21:19:46 -07:00
crash_dump_dm_crypt.c crash_dump/dm-crypt: don't print in arch-specific code 2026-04-02 23:36:24 -07:00
crash_reserve.c kernel/crash: remove inclusion of crypto/sha1.h 2026-03-27 21:19:46 -07:00
cred.c cred: remove unused set_security_override_from_ctx() 2026-01-06 20:52:57 -05:00
delayacct.c delayacct: fix uapi timespec64 definition 2026-02-08 00:13:32 -08:00
dma.c
elfcorehdr.c
exec_domain.c
exit.c mm.git review status for linus..mm-nonmm-stable 2026-04-16 20:11:56 -07:00
exit.h
extable.c
fail_function.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
fork.c mm.git review status for linus..mm-nonmm-stable 2026-04-16 20:11:56 -07:00
freezer.c freezer: Clarify that only cgroup1 freezer uses PM freezer 2025-10-30 20:10:27 +01:00
gen_kheaders.sh kheaders: make it possible to override TAR 2025-08-06 10:23:36 +09:00
groups.c treewide: Replace kmalloc with kmalloc_obj for non-scalar types 2026-02-21 01:02:28 -08:00
hung_task.c hung_task: explicitly report I/O wait state in log output 2026-03-27 21:19:40 -07:00
iomem.c
irq_work.c kernel: Use trace_call__##name() at guarded tracepoint call sites 2026-03-26 08:28:49 -04:00
jump_label.c jump_label: use ATOMIC_INIT() for initialization of .enabled 2026-03-16 13:16:48 +01:00
kallsyms_internal.h kallsyms: Get rid of kallsyms relative base 2026-01-22 15:58:22 -07:00
kallsyms_selftest.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
kallsyms_selftest.h
kallsyms.c mm.git review status for linus..mm-nonmm-stable 2026-02-12 12:13:01 -08:00
kcmp.c
Kconfig.freezer
Kconfig.hz
Kconfig.kexec liveupdate: kho: move to kernel/liveupdate 2025-11-27 14:24:33 -08:00
Kconfig.locks
Kconfig.preempt sched: Further restrict the preemption modes 2026-01-08 12:43:57 +01:00
kcov.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
kexec_core.c kernel/kexec: remove inclusion of crypto/hash.h 2026-03-27 21:19:46 -07:00
kexec_elf.c
kexec_file.c kexec: derive purgatory entry from symbol 2026-01-31 16:16:07 -08:00
kexec_internal.h kexec: enable CMA based contiguous allocation 2025-08-02 12:01:38 -07:00
kexec.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
kheaders.c
kprobes.c kprobes: Remove unneeded warnings from __arm_kprobe_ftrace() 2026-03-13 23:15:26 +09:00
kstack_erase.c sysctl: remove __user qualifier from stack_erasing_sysctl buffer argument 2025-11-27 15:44:53 +01:00
ksyms_common.c
ksysfs.c kernel: ksysfs: initialize kernel_kobj earlier 2026-04-03 19:39:52 +02:00
kthread.c kthread: consolidate kthread exit paths to prevent use-after-free 2026-02-26 10:45:49 +01:00
latencytop.c
Makefile kcov: Enable context analysis 2026-01-05 16:43:34 +01:00
module_signature.c module: Give 'enum pkey_id_type' a more specific name 2026-03-24 21:42:37 +00:00
notifier.c
nscommon.c nsfs: tighten permission checks for ns iteration ioctls 2026-02-27 22:00:08 +01:00
nsproxy.c vfs-7.1-rc1.mount.v2 2026-04-14 19:59:25 -07:00
nstree.c nstree: tighten permission checks for listing 2026-02-27 22:00:11 +01:00
padata.c padata: Put CPU offline callback in ONLINE section to allow failure 2026-03-22 11:17:59 +09:00
panic.c kernel/panic: mark init_taint_buf as __initdata and panic instead of warning in alloc_taint_buf() 2026-03-27 21:19:33 -07:00
params.c module: Clean up parse_args() arguments 2026-03-18 21:43:18 +00:00
pid_namespace.c pid_namespace: allow opening pid_for_children before init was created 2026-03-20 14:44:26 +01:00
pid_sysctl.h
pid.c mm.git review status for linus..mm-nonmm-stable 2026-04-16 20:11:56 -07:00
profile.c
ptrace.c clone: add CLONE_AUTOREAP 2026-03-11 23:14:02 +01:00
range.c
reboot.c treewide: Replace kmalloc with kmalloc_obj for non-scalar types 2026-02-21 01:02:28 -08:00
regset.c
relay.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
resource_kunit.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
resource.c PCI: Align head space better 2026-03-27 10:19:08 -05:00
rseq.c rseq: Reenable performance optimizations conditionally 2026-05-06 17:40:27 +02:00
scftorture.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
scs.c scs: fix a wrong parameter in __scs_magic 2025-11-12 10:00:13 -08:00
seccomp.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
signal.c mm.git review status for linus..mm-nonmm-stable 2026-04-16 20:11:56 -07:00
smp.c tracing updates for v7.1: 2026-04-17 09:43:12 -07:00
smpboot.c
smpboot.h
softirq.c softirq: Prepare for deferred hrtimer rearming 2026-02-27 16:40:13 +01:00
stacktrace.c
static_call_inline.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
static_call.c
stop_machine.c sched/core: Fix migrate_swap() vs. hotplug 2025-07-01 15:02:03 +02:00
sys_ni.c rseq: Implement sys_rseq_slice_yield() 2026-01-22 11:11:17 +01:00
sys.c prctl: cfi: change the branch landing pad prctl()s to be more descriptive 2026-04-04 18:40:58 -06:00
sysctl-test.c
sysctl.c sysctl: fix uninitialized variable in proc_do_large_bitmap 2026-03-26 09:32:19 +01:00
task_work.c task_work: Fix NMI race condition 2025-10-29 10:29:54 +01:00
taskstats.c taskstats: set version in TGID exit notifications 2026-04-15 02:15:02 -07:00
torture.c torture: Avoid modulo-zero error in torture_hrtimeout_ns() 2026-03-30 15:48:14 -04:00
tracepoint.c tracepoint: balance regfunc() on func_add() failure in tracepoint_add_func() 2026-04-14 05:17:02 -04:00
tsacct.c tsacct: skip all kernel threads 2026-01-26 19:07:13 -08:00
ucount.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
uid16.c
uid16.h
umh.c treewide: Replace kmalloc with kmalloc_obj for non-scalar types 2026-02-21 01:02:28 -08:00
up.c
user_namespace.c Convert remaining multi-line kmalloc_obj/flex GFP_KERNEL uses 2026-02-22 08:26:33 -08:00
user-return-notifier.c
user.c ns: drop custom reference count initialization for initial namespaces 2025-11-11 10:01:32 +01:00
utsname_sysctl.c
utsname.c namespace-6.18-rc1 2025-09-29 11:20:29 -07:00
vhost_task.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
vmcore_info.c mm.git review status for linus..mm-nonmm-stable 2026-04-16 20:11:56 -07:00
watch_queue.c Convert 'alloc_flex' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
watchdog_buddy.c watchdog/hardlockup: improve buddy system detection timeliness 2026-03-27 21:19:47 -07:00
watchdog_perf.c watchdog/hardlockup: simplify perf event probe and remove per-cpu dependency 2026-02-08 00:13:35 -08:00
watchdog.c watchdog/hardlockup: improve buddy system detection timeliness 2026-03-27 21:19:47 -07:00
workqueue_internal.h workqueue: Show in-flight work item duration in stall diagnostics 2026-03-05 07:27:48 -10:00
workqueue.c workqueue: Changes for v7.1 2026-04-15 10:32:08 -07:00