mirror of
https://github.com/torvalds/linux.git
synced 2026-05-31 10:33:41 +02:00
Documentation: Add real-time to core-api
The documents explain the design concepts behind PREEMPT_RT and highlight key differences necessary to achieve it. It also include a list of requirements that must be fulfilled to support PREEMPT_RT on a given architecture. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> [jc: tweaked "how they differ" section head] Signed-off-by: Jonathan Corbet <corbet@lwn.net> Link: https://lore.kernel.org/r/20250815093858.930751-4-bigeasy@linutronix.de
This commit is contained in:
parent
f41c808c43
commit
f51fe3b7e4
|
|
@ -24,6 +24,7 @@ it.
|
||||||
printk-index
|
printk-index
|
||||||
symbol-namespaces
|
symbol-namespaces
|
||||||
asm-annotations
|
asm-annotations
|
||||||
|
real-time/index
|
||||||
|
|
||||||
Data structures and low-level utilities
|
Data structures and low-level utilities
|
||||||
=======================================
|
=======================================
|
||||||
|
|
|
||||||
109
Documentation/core-api/real-time/architecture-porting.rst
Normal file
109
Documentation/core-api/real-time/architecture-porting.rst
Normal file
|
|
@ -0,0 +1,109 @@
|
||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
|
||||||
|
=============================================
|
||||||
|
Porting an architecture to support PREEMPT_RT
|
||||||
|
=============================================
|
||||||
|
|
||||||
|
:Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
||||||
|
|
||||||
|
This list outlines the architecture specific requirements that must be
|
||||||
|
implemented in order to enable PREEMPT_RT. Once all required features are
|
||||||
|
implemented, ARCH_SUPPORTS_RT can be selected in architecture’s Kconfig to make
|
||||||
|
PREEMPT_RT selectable.
|
||||||
|
Many prerequisites (genirq support for example) are enforced by the common code
|
||||||
|
and are omitted here.
|
||||||
|
|
||||||
|
The optional features are not strictly required but it is worth to consider
|
||||||
|
them.
|
||||||
|
|
||||||
|
Requirements
|
||||||
|
------------
|
||||||
|
|
||||||
|
Forced threaded interrupts
|
||||||
|
CONFIG_IRQ_FORCED_THREADING must be selected. Any interrupts that must
|
||||||
|
remain in hard-IRQ context must be marked with IRQF_NO_THREAD. This
|
||||||
|
requirement applies for instance to clocksource event interrupts,
|
||||||
|
perf interrupts and cascading interrupt-controller handlers.
|
||||||
|
|
||||||
|
PREEMPTION support
|
||||||
|
Kernel preemption must be supported and requires that
|
||||||
|
CONFIG_ARCH_NO_PREEMPT remain unselected. Scheduling requests, such as those
|
||||||
|
issued from an interrupt or other exception handler, must be processed
|
||||||
|
immediately.
|
||||||
|
|
||||||
|
POSIX CPU timers and KVM
|
||||||
|
POSIX CPU timers must expire from thread context rather than directly within
|
||||||
|
the timer interrupt. This behavior is enabled by setting the configuration
|
||||||
|
option CONFIG_HAVE_POSIX_CPU_TIMERS_TASK_WORK.
|
||||||
|
When KVM is enabled, CONFIG_KVM_XFER_TO_GUEST_WORK must also be set to ensure
|
||||||
|
that any pending work, such as POSIX timer expiration, is handled before
|
||||||
|
transitioning into guest mode.
|
||||||
|
|
||||||
|
Hard-IRQ and Soft-IRQ stacks
|
||||||
|
Soft interrupts are handled in the thread context in which they are raised. If
|
||||||
|
a soft interrupt is triggered from hard-IRQ context, its execution is deferred
|
||||||
|
to the ksoftirqd thread. Preemption is never disabled during soft interrupt
|
||||||
|
handling, which makes soft interrupts preemptible.
|
||||||
|
If an architecture provides a custom __do_softirq() implementation that uses a
|
||||||
|
separate stack, it must select CONFIG_HAVE_SOFTIRQ_ON_OWN_STACK. The
|
||||||
|
functionality should only be enabled when CONFIG_SOFTIRQ_ON_OWN_STACK is set.
|
||||||
|
|
||||||
|
FPU and SIMD access in kernel mode
|
||||||
|
FPU and SIMD registers are typically not used in kernel mode and are therefore
|
||||||
|
not saved during kernel preemption. As a result, any kernel code that uses
|
||||||
|
these registers must be enclosed within a kernel_fpu_begin() and
|
||||||
|
kernel_fpu_end() section.
|
||||||
|
The kernel_fpu_begin() function usually invokes local_bh_disable() to prevent
|
||||||
|
interruptions from softirqs and to disable regular preemption. This allows the
|
||||||
|
protected code to run safely in both thread and softirq contexts.
|
||||||
|
On PREEMPT_RT kernels, however, kernel_fpu_begin() must not call
|
||||||
|
local_bh_disable(). Instead, it should use preempt_disable(), since softirqs
|
||||||
|
are always handled in thread context under PREEMPT_RT. In this case, disabling
|
||||||
|
preemption alone is sufficient.
|
||||||
|
The crypto subsystem operates on memory pages and requires users to "walk and
|
||||||
|
map" these pages while processing a request. This operation must occur outside
|
||||||
|
the kernel_fpu_begin()/ kernel_fpu_end() section because it requires preemption
|
||||||
|
to be enabled. These preemption points are generally sufficient to avoid
|
||||||
|
excessive scheduling latency.
|
||||||
|
|
||||||
|
Exception handlers
|
||||||
|
Exception handlers, such as the page fault handler, typically enable interrupts
|
||||||
|
early, before invoking any generic code to process the exception. This is
|
||||||
|
necessary because handling a page fault may involve operations that can sleep.
|
||||||
|
Enabling interrupts is especially important on PREEMPT_RT, where certain
|
||||||
|
locks, such as spinlock_t, become sleepable. For example, handling an
|
||||||
|
invalid opcode may result in sending a SIGILL signal to the user task. A
|
||||||
|
debug excpetion will send a SIGTRAP signal.
|
||||||
|
In both cases, if the exception occurred in user space, it is safe to enable
|
||||||
|
interrupts early. Sending a signal requires both interrupts and kernel
|
||||||
|
preemption to be enabled.
|
||||||
|
|
||||||
|
Optional features
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
Timer and clocksource
|
||||||
|
A high-resolution clocksource and clockevents device are recommended. The
|
||||||
|
clockevents device should support the CLOCK_EVT_FEAT_ONESHOT feature for
|
||||||
|
optimal timer behavior. In most cases, microsecond-level accuracy is
|
||||||
|
sufficient
|
||||||
|
|
||||||
|
Lazy preemption
|
||||||
|
This mechanism allows an in-kernel scheduling request for non-real-time tasks
|
||||||
|
to be delayed until the task is about to return to user space. It helps avoid
|
||||||
|
preempting a task that holds a sleeping lock at the time of the scheduling
|
||||||
|
request.
|
||||||
|
With CONFIG_GENERIC_IRQ_ENTRY enabled, supporting this feature requires
|
||||||
|
defining a bit for TIF_NEED_RESCHED_LAZY, preferably near TIF_NEED_RESCHED.
|
||||||
|
|
||||||
|
Serial console with NBCON
|
||||||
|
With PREEMPT_RT enabled, all console output is handled by a dedicated thread
|
||||||
|
rather than directly from the context in which printk() is invoked. This design
|
||||||
|
allows printk() to be safely used in atomic contexts.
|
||||||
|
However, this also means that if the kernel crashes and cannot switch to the
|
||||||
|
printing thread, no output will be visible preventing the system from printing
|
||||||
|
its final messages.
|
||||||
|
There are exceptions for immediate output, such as during panic() handling. To
|
||||||
|
support this, the console driver must implement new-style lock handling. This
|
||||||
|
involves setting the CON_NBCON flag in console::flags and providing
|
||||||
|
implementations for the write_atomic, write_thread, device_lock, and
|
||||||
|
device_unlock callbacks.
|
||||||
242
Documentation/core-api/real-time/differences.rst
Normal file
242
Documentation/core-api/real-time/differences.rst
Normal file
|
|
@ -0,0 +1,242 @@
|
||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
|
||||||
|
===========================
|
||||||
|
How realtime kernels differ
|
||||||
|
===========================
|
||||||
|
|
||||||
|
:Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
||||||
|
|
||||||
|
Preface
|
||||||
|
=======
|
||||||
|
|
||||||
|
With forced-threaded interrupts and sleeping spin locks, code paths that
|
||||||
|
previously caused long scheduling latencies have been made preemptible and
|
||||||
|
moved into process context. This allows the scheduler to manage them more
|
||||||
|
effectively and respond to higher-priority tasks with reduced latency.
|
||||||
|
|
||||||
|
The following chapters provide an overview of key differences between a
|
||||||
|
PREEMPT_RT kernel and a standard, non-PREEMPT_RT kernel.
|
||||||
|
|
||||||
|
Locking
|
||||||
|
=======
|
||||||
|
|
||||||
|
Spinning locks such as spinlock_t are used to provide synchronization for data
|
||||||
|
structures accessed from both interrupt context and process context. For this
|
||||||
|
reason, locking functions are also available with the _irq() or _irqsave()
|
||||||
|
suffixes, which disable interrupts before acquiring the lock. This ensures that
|
||||||
|
the lock can be safely acquired in process context when interrupts are enabled.
|
||||||
|
|
||||||
|
However, on a PREEMPT_RT system, interrupts are forced-threaded and no longer
|
||||||
|
run in hard IRQ context. As a result, there is no need to disable interrupts as
|
||||||
|
part of the locking procedure when using spinlock_t.
|
||||||
|
|
||||||
|
For low-level core components such as interrupt handling, the scheduler, or the
|
||||||
|
timer subsystem the kernel uses raw_spinlock_t. This lock type preserves
|
||||||
|
traditional semantics: it disables preemption and, when used with _irq() or
|
||||||
|
_irqsave(), also disables interrupts. This ensures proper synchronization in
|
||||||
|
critical sections that must remain non-preemptible or with interrupts disabled.
|
||||||
|
|
||||||
|
Execution context
|
||||||
|
=================
|
||||||
|
|
||||||
|
Interrupt handling in a PREEMPT_RT system is invoked in process context through
|
||||||
|
the use of threaded interrupts. Other parts of the kernel also shift their
|
||||||
|
execution into threaded context by different mechanisms. The goal is to keep
|
||||||
|
execution paths preemptible, allowing the scheduler to interrupt them when a
|
||||||
|
higher-priority task needs to run.
|
||||||
|
|
||||||
|
Below is an overview of the kernel subsystems involved in this transition to
|
||||||
|
threaded, preemptible execution.
|
||||||
|
|
||||||
|
Interrupt handling
|
||||||
|
------------------
|
||||||
|
|
||||||
|
All interrupts are forced-threaded in a PREEMPT_RT system. The exceptions are
|
||||||
|
interrupts that are requested with the IRQF_NO_THREAD, IRQF_PERCPU, or
|
||||||
|
IRQF_ONESHOT flags.
|
||||||
|
|
||||||
|
The IRQF_ONESHOT flag is used together with threaded interrupts, meaning those
|
||||||
|
registered using request_threaded_irq() and providing only a threaded handler.
|
||||||
|
Its purpose is to keep the interrupt line masked until the threaded handler has
|
||||||
|
completed.
|
||||||
|
|
||||||
|
If a primary handler is also provided in this case, it is essential that the
|
||||||
|
handler does not acquire any sleeping locks, as it will not be threaded. The
|
||||||
|
handler should be minimal and must avoid introducing delays, such as
|
||||||
|
busy-waiting on hardware registers.
|
||||||
|
|
||||||
|
|
||||||
|
Soft interrupts, bottom half handling
|
||||||
|
-------------------------------------
|
||||||
|
|
||||||
|
Soft interrupts are raised by the interrupt handler and are executed after the
|
||||||
|
handler returns. Since they run in thread context, they can be preempted by
|
||||||
|
other threads. Do not assume that softirq context runs with preemption
|
||||||
|
disabled. This means you must not rely on mechanisms like local_bh_disable() in
|
||||||
|
process context to protect per-CPU variables. Because softirq handlers are
|
||||||
|
preemptible under PREEMPT_RT, this approach does not provide reliable
|
||||||
|
synchronization.
|
||||||
|
|
||||||
|
If this kind of protection is required for performance reasons, consider using
|
||||||
|
local_lock_nested_bh(). On non-PREEMPT_RT kernels, this allows lockdep to
|
||||||
|
verify that bottom halves are disabled. On PREEMPT_RT systems, it adds the
|
||||||
|
necessary locking to ensure proper protection.
|
||||||
|
|
||||||
|
Using local_lock_nested_bh() also makes the locking scope explicit and easier
|
||||||
|
for readers and maintainers to understand.
|
||||||
|
|
||||||
|
|
||||||
|
per-CPU variables
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
Protecting access to per-CPU variables solely by using preempt_disable() should
|
||||||
|
be avoided, especially if the critical section has unbounded runtime or may
|
||||||
|
call APIs that can sleep.
|
||||||
|
|
||||||
|
If using a spinlock_t is considered too costly for performance reasons,
|
||||||
|
consider using local_lock_t. On non-PREEMPT_RT configurations, this introduces
|
||||||
|
no runtime overhead when lockdep is disabled. With lockdep enabled, it verifies
|
||||||
|
that the lock is only acquired in process context and never from softirq or
|
||||||
|
hard IRQ context.
|
||||||
|
|
||||||
|
On a PREEMPT_RT kernel, local_lock_t is implemented using a per-CPU spinlock_t,
|
||||||
|
which provides safe local protection for per-CPU data while keeping the system
|
||||||
|
preemptible.
|
||||||
|
|
||||||
|
Because spinlock_t on PREEMPT_RT does not disable preemption, it cannot be used
|
||||||
|
to protect per-CPU data by relying on implicit preemption disabling. If this
|
||||||
|
inherited preemption disabling is essential and if local_lock_t cannot be used
|
||||||
|
due to performance constraints, brevity of the code, or abstraction boundaries
|
||||||
|
within an API then preempt_disable_nested() may be a suitable alternative. On
|
||||||
|
non-PREEMPT_RT kernels, it verifies with lockdep that preemption is already
|
||||||
|
disabled. On PREEMPT_RT, it explicitly disables preemption.
|
||||||
|
|
||||||
|
Timers
|
||||||
|
------
|
||||||
|
|
||||||
|
By default, an hrtimer is executed in hard interrupt context. The exception is
|
||||||
|
timers initialized with the HRTIMER_MODE_SOFT flag, which are executed in
|
||||||
|
softirq context.
|
||||||
|
|
||||||
|
On a PREEMPT_RT kernel, this behavior is reversed: hrtimers are executed in
|
||||||
|
softirq context by default, typically within the ktimersd thread. This thread
|
||||||
|
runs at the lowest real-time priority, ensuring it executes before any
|
||||||
|
SCHED_OTHER tasks but does not interfere with higher-priority real-time
|
||||||
|
threads. To explicitly request execution in hard interrupt context on
|
||||||
|
PREEMPT_RT, the timer must be marked with the HRTIMER_MODE_HARD flag.
|
||||||
|
|
||||||
|
Memory allocation
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
The memory allocation APIs, such as kmalloc() and alloc_pages(), require a
|
||||||
|
gfp_t flag to indicate the allocation context. On non-PREEMPT_RT kernels, it is
|
||||||
|
necessary to use GFP_ATOMIC when allocating memory from interrupt context or
|
||||||
|
from sections where preemption is disabled. This is because the allocator must
|
||||||
|
not sleep in these contexts waiting for memory to become available.
|
||||||
|
|
||||||
|
However, this approach does not work on PREEMPT_RT kernels. The memory
|
||||||
|
allocator in PREEMPT_RT uses sleeping locks internally, which cannot be
|
||||||
|
acquired when preemption is disabled. Fortunately, this is generally not a
|
||||||
|
problem, because PREEMPT_RT moves most contexts that would traditionally run
|
||||||
|
with preemption or interrupts disabled into threaded context, where sleeping is
|
||||||
|
allowed.
|
||||||
|
|
||||||
|
What remains problematic is code that explicitly disables preemption or
|
||||||
|
interrupts. In such cases, memory allocation must be performed outside the
|
||||||
|
critical section.
|
||||||
|
|
||||||
|
This restriction also applies to memory deallocation routines such as kfree()
|
||||||
|
and free_pages(), which may also involve internal locking and must not be
|
||||||
|
called from non-preemptible contexts.
|
||||||
|
|
||||||
|
IRQ work
|
||||||
|
--------
|
||||||
|
|
||||||
|
The irq_work API provides a mechanism to schedule a callback in interrupt
|
||||||
|
context. It is designed for use in contexts where traditional scheduling is not
|
||||||
|
possible, such as from within NMI handlers or from inside the scheduler, where
|
||||||
|
using a workqueue would be unsafe.
|
||||||
|
|
||||||
|
On non-PREEMPT_RT systems, all irq_work items are executed immediately in
|
||||||
|
interrupt context. Items marked with IRQ_WORK_LAZY are deferred until the next
|
||||||
|
timer tick but are still executed in interrupt context.
|
||||||
|
|
||||||
|
On PREEMPT_RT systems, the execution model changes. Because irq_work callbacks
|
||||||
|
may acquire sleeping locks or have unbounded execution time, they are handled
|
||||||
|
in thread context by a per-CPU irq_work kernel thread. This thread runs at the
|
||||||
|
lowest real-time priority, ensuring it executes before any SCHED_OTHER tasks
|
||||||
|
but does not interfere with higher-priority real-time threads.
|
||||||
|
|
||||||
|
The exception are work items marked with IRQ_WORK_HARD_IRQ, which are still
|
||||||
|
executed in hard interrupt context. Lazy items (IRQ_WORK_LAZY) continue to be
|
||||||
|
deferred until the next timer tick and are also executed by the irq_work/
|
||||||
|
thread.
|
||||||
|
|
||||||
|
RCU callbacks
|
||||||
|
-------------
|
||||||
|
|
||||||
|
RCU callbacks are invoked by default in softirq context. Their execution is
|
||||||
|
important because, depending on the use case, they either free memory or ensure
|
||||||
|
progress in state transitions. Running these callbacks as part of the softirq
|
||||||
|
chain can lead to undesired situations, such as contention for CPU resources
|
||||||
|
with other SCHED_OTHER tasks when executed within ksoftirqd.
|
||||||
|
|
||||||
|
To avoid running callbacks in softirq context, the RCU subsystem provides a
|
||||||
|
mechanism to execute them in process context instead. This behavior can be
|
||||||
|
enabled by setting the boot command-line parameter rcutree.use_softirq=0. This
|
||||||
|
setting is enforced in kernels configured with PREEMPT_RT.
|
||||||
|
|
||||||
|
Spin until ready
|
||||||
|
================
|
||||||
|
|
||||||
|
The "spin until ready" pattern involves repeatedly checking (spinning on) the
|
||||||
|
state of a data structure until it becomes available. This pattern assumes that
|
||||||
|
preemption, soft interrupts, or interrupts are disabled. If the data structure
|
||||||
|
is marked busy, it is presumed to be in use by another CPU, and spinning should
|
||||||
|
eventually succeed as that CPU makes progress.
|
||||||
|
|
||||||
|
Some examples are hrtimer_cancel() or timer_delete_sync(). These functions
|
||||||
|
cancel timers that execute with interrupts or soft interrupts disabled. If a
|
||||||
|
thread attempts to cancel a timer and finds it active, spinning until the
|
||||||
|
callback completes is safe because the callback can only run on another CPU and
|
||||||
|
will eventually finish.
|
||||||
|
|
||||||
|
On PREEMPT_RT kernels, however, timer callbacks run in thread context. This
|
||||||
|
introduces a challenge: a higher-priority thread attempting to cancel the timer
|
||||||
|
may preempt the timer callback thread. Since the scheduler cannot migrate the
|
||||||
|
callback thread to another CPU due to affinity constraints, spinning can result
|
||||||
|
in livelock even on multiprocessor systems.
|
||||||
|
|
||||||
|
To avoid this, both the canceling and callback sides must use a handshake
|
||||||
|
mechanism that supports priority inheritance. This allows the canceling thread
|
||||||
|
to suspend until the callback completes, ensuring forward progress without
|
||||||
|
risking livelock.
|
||||||
|
|
||||||
|
In order to solve the problem at the API level, the sequence locks were extended
|
||||||
|
to allow a proper handover between the the spinning reader and the maybe
|
||||||
|
blocked writer.
|
||||||
|
|
||||||
|
Sequence locks
|
||||||
|
--------------
|
||||||
|
|
||||||
|
Sequence counters and sequential locks are documented in
|
||||||
|
Documentation/locking/seqlock.rst.
|
||||||
|
|
||||||
|
The interface has been extended to ensure proper preemption states for the
|
||||||
|
writer and spinning reader contexts. This is achieved by embedding the writer
|
||||||
|
serialization lock directly into the sequence counter type, resulting in
|
||||||
|
composite types such as seqcount_spinlock_t or seqcount_mutex_t.
|
||||||
|
|
||||||
|
These composite types allow readers to detect an ongoing write and actively
|
||||||
|
boost the writer’s priority to help it complete its update instead of spinning
|
||||||
|
and waiting for its completion.
|
||||||
|
|
||||||
|
If the plain seqcount_t is used, extra care must be taken to synchronize the
|
||||||
|
reader with the writer during updates. The writer must ensure its update is
|
||||||
|
serialized and non-preemptible relative to the reader. This cannot be achieved
|
||||||
|
using a regular spinlock_t because spinlock_t on PREEMPT_RT does not disable
|
||||||
|
preemption. In such cases, using seqcount_spinlock_t is the preferred solution.
|
||||||
|
|
||||||
|
However, if there is no spinning involved i.e., if the reader only needs to
|
||||||
|
detect whether a write has started and not serialize against it then using
|
||||||
|
seqcount_t is reasonable.
|
||||||
16
Documentation/core-api/real-time/index.rst
Normal file
16
Documentation/core-api/real-time/index.rst
Normal file
|
|
@ -0,0 +1,16 @@
|
||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
|
||||||
|
=====================
|
||||||
|
Real-time preemption
|
||||||
|
=====================
|
||||||
|
|
||||||
|
This documentation is intended for Linux kernel developers and contributors
|
||||||
|
interested in the inner workings of PREEMPT_RT. It explains key concepts and
|
||||||
|
the required changes compared to a non-PREEMPT_RT configuration.
|
||||||
|
|
||||||
|
.. toctree::
|
||||||
|
:maxdepth: 2
|
||||||
|
|
||||||
|
theory
|
||||||
|
differences
|
||||||
|
architecture-porting
|
||||||
116
Documentation/core-api/real-time/theory.rst
Normal file
116
Documentation/core-api/real-time/theory.rst
Normal file
|
|
@ -0,0 +1,116 @@
|
||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
|
||||||
|
=====================
|
||||||
|
Theory of operation
|
||||||
|
=====================
|
||||||
|
|
||||||
|
:Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
||||||
|
|
||||||
|
Preface
|
||||||
|
=======
|
||||||
|
|
||||||
|
PREEMPT_RT transforms the Linux kernel into a real-time kernel. It achieves
|
||||||
|
this by replacing locking primitives, such as spinlock_t, with a preemptible
|
||||||
|
and priority-inheritance aware implementation known as rtmutex, and by enforcing
|
||||||
|
the use of threaded interrupts. As a result, the kernel becomes fully
|
||||||
|
preemptible, with the exception of a few critical code paths, including entry
|
||||||
|
code, the scheduler, and low-level interrupt handling routines.
|
||||||
|
|
||||||
|
This transformation places the majority of kernel execution contexts under the
|
||||||
|
control of the scheduler and significantly increasing the number of preemption
|
||||||
|
points. Consequently, it reduces the latency between a high-priority task
|
||||||
|
becoming runnable and its actual execution on the CPU.
|
||||||
|
|
||||||
|
Scheduling
|
||||||
|
==========
|
||||||
|
|
||||||
|
The core principles of Linux scheduling and the associated user-space API are
|
||||||
|
documented in the man page sched(7)
|
||||||
|
`sched(7) <https://man7.org/linux/man-pages/man7/sched.7.html>`_.
|
||||||
|
By default, the Linux kernel uses the SCHED_OTHER scheduling policy. Under
|
||||||
|
this policy, a task is preempted when the scheduler determines that it has
|
||||||
|
consumed a fair share of CPU time relative to other runnable tasks. However,
|
||||||
|
the policy does not guarantee immediate preemption when a new SCHED_OTHER task
|
||||||
|
becomes runnable. The currently running task may continue executing.
|
||||||
|
|
||||||
|
This behavior differs from that of real-time scheduling policies such as
|
||||||
|
SCHED_FIFO. When a task with a real-time policy becomes runnable, the
|
||||||
|
scheduler immediately selects it for execution if it has a higher priority than
|
||||||
|
the currently running task. The task continues to run until it voluntarily
|
||||||
|
yields the CPU, typically by blocking on an event.
|
||||||
|
|
||||||
|
Sleeping spin locks
|
||||||
|
===================
|
||||||
|
|
||||||
|
The various lock types and their behavior under real-time configurations are
|
||||||
|
described in detail in Documentation/locking/locktypes.rst.
|
||||||
|
In a non-PREEMPT_RT configuration, a spinlock_t is acquired by first disabling
|
||||||
|
preemption and then actively spinning until the lock becomes available. Once
|
||||||
|
the lock is released, preemption is enabled. From a real-time perspective,
|
||||||
|
this approach is undesirable because disabling preemption prevents the
|
||||||
|
scheduler from switching to a higher-priority task, potentially increasing
|
||||||
|
latency.
|
||||||
|
|
||||||
|
To address this, PREEMPT_RT replaces spinning locks with sleeping spin locks
|
||||||
|
that do not disable preemption. On PREEMPT_RT, spinlock_t is implemented using
|
||||||
|
rtmutex. Instead of spinning, a task attempting to acquire a contended lock
|
||||||
|
disables CPU migration, donates its priority to the lock owner (priority
|
||||||
|
inheritance), and voluntarily schedules out while waiting for the lock to
|
||||||
|
become available.
|
||||||
|
|
||||||
|
Disabling CPU migration provides the same effect as disabling preemption, while
|
||||||
|
still allowing preemption and ensuring that the task continues to run on the
|
||||||
|
same CPU while holding a sleeping lock.
|
||||||
|
|
||||||
|
Priority inheritance
|
||||||
|
====================
|
||||||
|
|
||||||
|
Lock types such as spinlock_t and mutex_t in a PREEMPT_RT enabled kernel are
|
||||||
|
implemented on top of rtmutex, which provides support for priority inheritance
|
||||||
|
(PI). When a task blocks on such a lock, the PI mechanism temporarily
|
||||||
|
propagates the blocked task’s scheduling parameters to the lock owner.
|
||||||
|
|
||||||
|
For example, if a SCHED_FIFO task A blocks on a lock currently held by a
|
||||||
|
SCHED_OTHER task B, task A’s scheduling policy and priority are temporarily
|
||||||
|
inherited by task B. After this inheritance, task A is put to sleep while
|
||||||
|
waiting for the lock, and task B effectively becomes the highest-priority task
|
||||||
|
in the system. This allows B to continue executing, make progress, and
|
||||||
|
eventually release the lock.
|
||||||
|
|
||||||
|
Once B releases the lock, it reverts to its original scheduling parameters, and
|
||||||
|
task A can resume execution.
|
||||||
|
|
||||||
|
Threaded interrupts
|
||||||
|
===================
|
||||||
|
|
||||||
|
Interrupt handlers are another source of code that executes with preemption
|
||||||
|
disabled and outside the control of the scheduler. To bring interrupt handling
|
||||||
|
under scheduler control, PREEMPT_RT enforces threaded interrupt handlers.
|
||||||
|
|
||||||
|
With forced threading, interrupt handling is split into two stages. The first
|
||||||
|
stage, the primary handler, is executed in IRQ context with interrupts disabled.
|
||||||
|
Its sole responsibility is to wake the associated threaded handler. The second
|
||||||
|
stage, the threaded handler, is the function passed to request_irq() as the
|
||||||
|
interrupt handler. It runs in process context, scheduled by the kernel.
|
||||||
|
|
||||||
|
From waking the interrupt thread until threaded handling is completed, the
|
||||||
|
interrupt source is masked in the interrupt controller. This ensures that the
|
||||||
|
device interrupt remains pending but does not retrigger the CPU, allowing the
|
||||||
|
system to exit IRQ context and handle the interrupt in a scheduled thread.
|
||||||
|
|
||||||
|
By default, the threaded handler executes with the SCHED_FIFO scheduling policy
|
||||||
|
and a priority of 50 (MAX_RT_PRIO / 2), which is midway between the minimum and
|
||||||
|
maximum real-time priorities.
|
||||||
|
|
||||||
|
If the threaded interrupt handler raises any soft interrupts during its
|
||||||
|
execution, those soft interrupt routines are invoked after the threaded handler
|
||||||
|
completes, within the same thread. Preemption remains enabled during the
|
||||||
|
execution of the soft interrupt handler.
|
||||||
|
|
||||||
|
Summary
|
||||||
|
=======
|
||||||
|
|
||||||
|
By using sleeping locks and forced-threaded interrupts, PREEMPT_RT
|
||||||
|
significantly reduces sections of code where interrupts or preemption is
|
||||||
|
disabled, allowing the scheduler to preempt the current execution context and
|
||||||
|
switch to a higher-priority task.
|
||||||
Loading…
Reference in New Issue
Block a user