mirror of
https://github.com/torvalds/linux.git
synced 2026-05-12 16:18:45 +02:00
Due to the incompatibility with TCMalloc the RSEQ optimizations and
extended features (time slice extensions) have been disabled and made
run-time conditional.
The original RSEQ implementation, which TCMalloc depends on, registers a 32
byte region (ORIG_RSEG_SIZE). This region has a 32 byte alignment
requirement.
The extension safe newer variant exposes the kernel RSEQ feature size via
getauxval(AT_RSEQ_FEATURE_SIZE) and the alignment requirement via
getauxval(AT_RSEQ_ALIGN). The alignment requirement is that the registered
RSEQ region is aligned to the next power of two of the feature size. The
kernel currently has a feature size of 33 bytes, which means the alignment
requirement is 64 bytes.
The TCMalloc RSEQ region is embedded into a cache line aligned data
structure starting at offset 32 bytes so that bytes 28-31 and the
cpu_id_start field at bytes 32-35 form a 64-bit little endian pointer with
the top-most bit (63 set) to check whether the kernel has overwritten
cpu_id_start with an actual CPU id value, which is guaranteed to not have
the top most bit set.
As this is part of their performance tuned magic, it's a pretty safe
assumption, that TCMalloc won't use a larger RSEQ size.
This allows the kernel to declare that registrations with a size greater
than the original size of 32 bytes, which is the cases since time slice
extensions got introduced, as RSEQ ABI v2 with the following differences to
the original behaviour:
1) Unconditional updates of the user read only fields (CPU, node, MMCID)
are removed. Those fields are only updated on registration, task
migration and MMCID changes.
2) Unconditional evaluation of the criticial section pointer is
removed. It's only evaluated when user space was interrupted and was
scheduled out or before delivering a signal in the interrupted
context.
3) The read/only requirement of the ID fields is enforced. When the
kernel detects that userspace manipulated the fields, the process is
terminated. This ensures that multiple entities (libraries) can
utilize RSEQ without interfering.
4) Todays extended RSEQ feature (time slice extensions) and future
extensions are only enabled in the v2 enabled mode.
Registrations with the original size of 32 bytes operate in backwards
compatible legacy mode without performance improvements and extended
features.
Unfortunately that also affects users of older GLIBC versions which
register the original size of 32 bytes and do not evaluate the kernel
required size in the auxiliary vector AT_RSEQ_FEATURE_SIZE.
That's the result of the lack of enforcement in the original implementation
and the unwillingness of a single entity to cooperate with the larger
ecosystem for many years.
Implement the required registration changes by restructuring the spaghetti
code and adding the size/version check. Also add documentation about the
differences of legacy and optimized RSEQ V2 mode.
Thanks to Mathieu for pointing out the ORIG_RSEQ_SIZE constraints!
Fixes: d6200245c7 ("rseq: Allow registering RSEQ with slice extension")
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Link: https://patch.msgid.link/20260428224427.927160119%40kernel.org
Cc: stable@vger.kernel.org
233 lines
9.1 KiB
ReStructuredText
233 lines
9.1 KiB
ReStructuredText
=====================
|
|
Restartable Sequences
|
|
=====================
|
|
|
|
Restartable Sequences allow to register a per thread userspace memory area
|
|
to be used as an ABI between kernel and userspace for three purposes:
|
|
|
|
* userspace restartable sequences
|
|
|
|
* quick access to read the current CPU number, node ID from userspace
|
|
|
|
* scheduler time slice extensions
|
|
|
|
Restartable sequences (per-cpu atomics)
|
|
---------------------------------------
|
|
|
|
Restartable sequences allow userspace to perform update operations on
|
|
per-cpu data without requiring heavyweight atomic operations. The actual
|
|
ABI is unfortunately only available in the code and selftests.
|
|
|
|
Quick access to CPU number, node ID
|
|
-----------------------------------
|
|
|
|
Allows to implement per CPU data efficiently. Documentation is in code and
|
|
selftests. :(
|
|
|
|
Optimized RSEQ V2
|
|
-----------------
|
|
|
|
On architectures which utilize the generic entry code and generic TIF bits
|
|
the kernel supports runtime optimizations for RSEQ, which also enable
|
|
enhanced features like scheduler time slice extensions.
|
|
|
|
To enable them a task has to register the RSEQ region with at least the
|
|
length advertised by getauxval(AT_RSEQ_FEATURE_SIZE).
|
|
|
|
If existing binaries register with RSEQ_ORIG_SIZE (32 bytes), the kernel
|
|
keeps the legacy low performance mode enabled to fulfil the expectations
|
|
of existing users regarding the original RSEQ implementation behaviour.
|
|
|
|
The following table documents the ABI and behavioral guarantees of the
|
|
legacy and the optimized V2 mode.
|
|
|
|
.. list-table:: RSEQ modes
|
|
:header-rows: 1
|
|
|
|
* - Nr
|
|
- What
|
|
|
|
- Legacy
|
|
- Optimized V2
|
|
|
|
* - 1
|
|
- The cpu_id_start, cpu_id, node_id and mm_cid fields (User mode read
|
|
only)
|
|
.. Legacy
|
|
- Updated by the kernel unconditionally after each context switch and
|
|
before signal delivery
|
|
.. Optimized V2
|
|
- Updated by the kernel if and only if they change, i.e. if the task
|
|
is migrated or mm_cid changes
|
|
|
|
* - 2
|
|
- The rseq_cs critical section field
|
|
.. Legacy
|
|
- Evaluated and handled unconditionally after each context switch and
|
|
before signal delivery
|
|
.. Optimized V2
|
|
- Evaluated and handled conditionally only when user space was
|
|
interrupted and was scheduled out or before delivering a signal in
|
|
the interrupted context.
|
|
|
|
* - 3
|
|
- Read only fields
|
|
.. Legacy
|
|
- No strict enforcement except in debug mode
|
|
.. Optimized V2
|
|
- Strict enforcement
|
|
|
|
* - 4
|
|
- membarrier(...RSEQ)
|
|
.. Legacy
|
|
- All running threads of the process are interrupted and the ID fields
|
|
are rewritten and eventually active critical sections are aborted
|
|
before they return to user space. All threads which are scheduled
|
|
out whether voluntary or not are covered by #1/#2 above.
|
|
.. Optimized V2
|
|
- All running threads of the process are interrupted and eventually
|
|
active critical sections are aborted before these threads return to
|
|
user space. The ID fields are only updated if changed as a
|
|
consequence of the interrupt. All threads which are scheduled out
|
|
whether voluntary or not are covered by #1/#2 above.
|
|
|
|
* - 5
|
|
- Time slice extensions
|
|
.. Legacy
|
|
- Not supported
|
|
.. Optimized V2
|
|
- Supported
|
|
|
|
The legacy mode is obviously less performant as it does unconditional
|
|
updates and critical section checks even if not strictly required by the
|
|
ABI contract. That can't be changed anymore as some users depend on that
|
|
observed behavior, which in turn enables them to violate the ABI and
|
|
overwrite the cpu_id_start field for their own purposes. This is obviously
|
|
discouraged as it renders RSEQ incompatible with the intended usage and
|
|
breaks the expectation of other libraries in the same application.
|
|
|
|
The ABI compliant optimized v2 mode, which respects the read only fields,
|
|
does not require unconditional updates and therefore is way more
|
|
performant. The kernel validates the read only fields for compliance. If
|
|
user space modifies them, the process is killed. Compliant usage allows
|
|
multiple libraries in the same application to benefit from the RSEQ
|
|
functionality without disturbing each other. The ABI compliant optimized v2
|
|
mode also enables extended RSEQ features like time slice extensions.
|
|
|
|
|
|
Scheduler time slice extensions
|
|
-------------------------------
|
|
|
|
This allows a thread to request a time slice extension when it enters a
|
|
critical section to avoid contention on a resource when the thread is
|
|
scheduled out inside of the critical section.
|
|
|
|
The prerequisites for this functionality are:
|
|
|
|
* Enabled in Kconfig
|
|
|
|
* Enabled at boot time (default is enabled)
|
|
|
|
* A rseq userspace pointer has been registered for the thread in
|
|
optimized V2 mode
|
|
|
|
The thread has to enable the functionality via prctl(2)::
|
|
|
|
prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
|
|
PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
|
|
|
|
prctl() returns 0 on success or otherwise with the following error codes:
|
|
|
|
========= ==============================================================
|
|
Errorcode Meaning
|
|
========= ==============================================================
|
|
EINVAL Functionality not available or invalid function arguments.
|
|
Note: arg4 and arg5 must be zero
|
|
ENOTSUPP Functionality was disabled on the kernel command line
|
|
ENXIO Available, but no rseq user struct registered
|
|
========= ==============================================================
|
|
|
|
The state can be also queried via prctl(2)::
|
|
|
|
prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0);
|
|
|
|
prctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if
|
|
disabled. Otherwise it returns with the following error codes:
|
|
|
|
========= ==============================================================
|
|
Errorcode Meaning
|
|
========= ==============================================================
|
|
EINVAL Functionality not available or invalid function arguments.
|
|
Note: arg3 and arg4 and arg5 must be zero
|
|
========= ==============================================================
|
|
|
|
The availability and status is also exposed via the rseq ABI struct flags
|
|
field via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the
|
|
``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read-only for user
|
|
space and only for informational purposes.
|
|
|
|
If the mechanism was enabled via prctl(), the thread can request a time
|
|
slice extension by setting rseq::slice_ctrl::request to 1. If the thread is
|
|
interrupted and the interrupt results in a reschedule request in the
|
|
kernel, then the kernel can grant a time slice extension and return to
|
|
userspace instead of scheduling out. The length of the extension is
|
|
determined by debugfs:rseq/slice_ext_nsec. The default value is 5 usec; which
|
|
is the minimum value. It can be incremented to 50 usecs, however doing so
|
|
can/will affect the minimum scheduling latency.
|
|
|
|
Any proposed changes to this default will have to come with a selftest and
|
|
rseq-slice-hist.py output that shows the new value has merrit.
|
|
|
|
The kernel indicates the grant by clearing rseq::slice_ctrl::request and
|
|
setting rseq::slice_ctrl::granted to 1. If there is a reschedule of the
|
|
thread after granting the extension, the kernel clears the granted bit to
|
|
indicate that to userspace.
|
|
|
|
If the request bit is still set when the leaving the critical section,
|
|
userspace can clear it and continue.
|
|
|
|
If the granted bit is set, then userspace invokes rseq_slice_yield(2) when
|
|
leaving the critical section to relinquish the CPU. The kernel enforces
|
|
this by arming a timer to prevent misbehaving userspace from abusing this
|
|
mechanism.
|
|
|
|
If both the request bit and the granted bit are false when leaving the
|
|
critical section, then this indicates that a grant was revoked and no
|
|
further action is required by userspace.
|
|
|
|
The required code flow is as follows::
|
|
|
|
rseq->slice_ctrl.request = 1;
|
|
barrier(); // Prevent compiler reordering
|
|
critical_section();
|
|
barrier(); // Prevent compiler reordering
|
|
rseq->slice_ctrl.request = 0;
|
|
if (rseq->slice_ctrl.granted)
|
|
rseq_slice_yield();
|
|
|
|
As all of this is strictly CPU local, there are no atomicity requirements.
|
|
Checking the granted state is racy, but that cannot be avoided at all::
|
|
|
|
if (rseq->slice_ctrl.granted)
|
|
-> Interrupt results in schedule and grant revocation
|
|
rseq_slice_yield();
|
|
|
|
So there is no point in pretending that this might be solved by an atomic
|
|
operation.
|
|
|
|
If the thread issues a syscall other than rseq_slice_yield(2) within the
|
|
granted timeslice extension, the grant is also revoked and the CPU is
|
|
relinquished immediately when entering the kernel. This is required as
|
|
syscalls might consume arbitrary CPU time until they reach a scheduling
|
|
point when the preemption model is either NONE or VOLUNTARY and therefore
|
|
might exceed the grant by far.
|
|
|
|
The preferred solution for user space is to use rseq_slice_yield(2) which
|
|
is side effect free. The support for arbitrary syscalls is required to
|
|
support onion layer architectured applications, where the code handling the
|
|
critical section and requesting the time slice extension has no control
|
|
over the code within the critical section.
|
|
|
|
The kernel enforces flag consistency and terminates the thread with SIGSEGV
|
|
if it detects a violation.
|