linux/Documentation/userspace-api
Thomas Gleixner 99428157dc rseq: Reenable performance optimizations conditionally
Due to the incompatibility with TCMalloc the RSEQ optimizations and
extended features (time slice extensions) have been disabled and made
run-time conditional.

The original RSEQ implementation, which TCMalloc depends on, registers a 32
byte region (ORIG_RSEG_SIZE). This region has a 32 byte alignment
requirement.

The extension safe newer variant exposes the kernel RSEQ feature size via
getauxval(AT_RSEQ_FEATURE_SIZE) and the alignment requirement via
getauxval(AT_RSEQ_ALIGN). The alignment requirement is that the registered
RSEQ region is aligned to the next power of two of the feature size. The
kernel currently has a feature size of 33 bytes, which means the alignment
requirement is 64 bytes.

The TCMalloc RSEQ region is embedded into a cache line aligned data
structure starting at offset 32 bytes so that bytes 28-31 and the
cpu_id_start field at bytes 32-35 form a 64-bit little endian pointer with
the top-most bit (63 set) to check whether the kernel has overwritten
cpu_id_start with an actual CPU id value, which is guaranteed to not have
the top most bit set.

As this is part of their performance tuned magic, it's a pretty safe
assumption, that TCMalloc won't use a larger RSEQ size.

This allows the kernel to declare that registrations with a size greater
than the original size of 32 bytes, which is the cases since time slice
extensions got introduced, as RSEQ ABI v2 with the following differences to
the original behaviour:

  1) Unconditional updates of the user read only fields (CPU, node, MMCID)
     are removed. Those fields are only updated on registration, task
     migration and MMCID changes.

  2) Unconditional evaluation of the criticial section pointer is
     removed. It's only evaluated when user space was interrupted and was
     scheduled out or before delivering a signal in the interrupted
     context.

  3) The read/only requirement of the ID fields is enforced. When the
     kernel detects that userspace manipulated the fields, the process is
     terminated. This ensures that multiple entities (libraries) can
     utilize RSEQ without interfering.

  4) Todays extended RSEQ feature (time slice extensions) and future
     extensions are only enabled in the v2 enabled mode.

Registrations with the original size of 32 bytes operate in backwards
compatible legacy mode without performance improvements and extended
features.

Unfortunately that also affects users of older GLIBC versions which
register the original size of 32 bytes and do not evaluate the kernel
required size in the auxiliary vector AT_RSEQ_FEATURE_SIZE.

That's the result of the lack of enforcement in the original implementation
and the unwillingness of a single entity to cooperate with the larger
ecosystem for many years.

Implement the required registration changes by restructuring the spaghetti
code and adding the size/version check. Also add documentation about the
differences of legacy and optimized RSEQ V2 mode.

Thanks to Mathieu for pointing out the ORIG_RSEQ_SIZE constraints!

Fixes: d6200245c7 ("rseq: Allow registering RSEQ with slice extension")
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Link: https://patch.msgid.link/20260428224427.927160119%40kernel.org
Cc: stable@vger.kernel.org
2026-05-06 17:40:27 +02:00
..
accelerators Documentation: ocxl.rst: Update consortium site 2025-03-12 16:26:45 -06:00
ebpf
fwctl fwctl/bnxt_fwctl: Add documentation entries 2026-03-31 13:33:55 -03:00
gpio Documentation: use a source-read extension for the index link boilerplate 2026-01-23 11:59:34 -07:00
ioctl Char/Misc/IIO driver changes for 7.0-rc1 2026-02-17 09:11:04 -08:00
media media: uapi: Clarify MBUS color component order for serial buses 2026-03-24 11:58:02 +01:00
netlink docs: netlink: Couple of intro-specs documentation fixes 2025-11-06 14:50:59 -08:00
check_exec.rst security: Add EXEC_RESTRICT_FILE and EXEC_DENY_INTERACTIVE securebits 2024-12-18 17:00:29 -08:00
dcdbas.rst Documentation: move driver-api/dcdbas to userspace-api/ 2024-01-03 14:17:40 -07:00
dma-buf-alloc-exchange.rst doc: uapi: Add document describing dma-buf semantics 2023-08-21 18:20:05 +02:00
dma-buf-heaps.rst dma-buf: heaps: system: document system_cc_shared heap 2026-04-10 14:20:01 +02:00
ELF.rst ELF: document some de-facto PT_* ABI quirks 2023-04-20 17:53:38 -06:00
futex2.rst
index.rst Scheduler changes for v7.0: 2026-02-10 12:50:10 -08:00
iommufd.rst iommufd: Fix spelling errors in iommufd.rst 2025-08-18 11:15:06 -03:00
isapnp.rst Documentation: move driver-api/isapnp to userspace-api/ 2024-01-03 14:17:39 -07:00
landlock.rst landlock: Document fallocate(2) as another truncation corner case 2026-04-07 18:51:11 +02:00
liveupdate.rst docs: add luo documentation 2025-11-27 14:24:39 -08:00
lsm.rst LSM: Create lsm_list_modules system call 2023-11-12 22:54:42 -05:00
mfd_noexec.rst mm/memfd: add documentation for MFD_NOEXEC_SEAL MFD_EXEC 2024-06-15 10:43:07 -07:00
mseal.rst LoongArch: Enable ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS 2025-05-30 21:45:42 +08:00
no_new_privs.rst
ntsync.rst docs: ntsync: Add documentation for the ntsync uAPI. 2025-01-08 13:18:12 +01:00
perf_ring_buffer.rst coresight: docs: Remove target sink from examples 2025-03-11 14:55:57 +00:00
rseq.rst rseq: Reenable performance optimizations conditionally 2026-05-06 17:40:27 +02:00
seccomp_filter.rst Documentation: userspace-api: correct spelling 2023-02-02 11:07:18 -07:00
spec_ctrl.rst Documentation/x86: Fix PR_SET_SPECULATION_CTRL error codes 2025-12-29 16:27:45 +01:00
sysfs-platform_profile.rst docs: Fix typos, improve grammar in Userspace API 2025-06-09 15:13:33 -06:00
tee.rst Documentation: Destage TEE subsystem documentation 2023-12-08 15:45:10 -07:00
unshare.rst
vduse.rst Documentation: Add documentation for VDUSE Address Space IDs 2026-01-28 15:32:18 -05:00