Linux kernel source tree
Go to file
Raag Jadav b7cf9f4ac1
drm: Introduce device wedged event
Introduce device wedged event, which notifies userspace of 'wedged'
(hanged/unusable) state of the DRM device through a uevent. This is
useful especially in cases where the device is no longer operating as
expected and has become unrecoverable from driver context. Purpose of
this implementation is to provide drivers a generic way to recover the
device with the help of userspace intervention without taking any drastic
measures (like resetting or re-enumerating the full bus, on which the
underlying physical device is sitting) in the driver.

A 'wedged' device is basically a device that is declared dead by the
driver after exhausting all possible attempts to recover it from driver
context. The uevent is the notification that is sent to userspace along
with a hint about what could possibly be attempted to recover the device
from userspace and bring it back to usable state. Different drivers may
have different ideas of a 'wedged' device depending on hardware
implementation of the underlying physical device, and hence the vendor
agnostic nature of the event. It is up to the drivers to decide when they
see the need for device recovery and how they want to recover from the
available methods.

Driver prerequisites
--------------------

The driver, before opting for recovery, needs to make sure that the
'wedged' device doesn't harm the system as a whole by taking care of the
prerequisites. Necessary actions must include disabling DMA to system
memory as well as any communication channels with other devices. Further,
the driver must ensure that all dma_fences are signalled and any device
state that the core kernel might depend on is cleaned up. All existing
mmaps should be invalidated and page faults should be redirected to a
dummy page. Once the event is sent, the device must be kept in 'wedged'
state until the recovery is performed. New accesses to the device
(IOCTLs) should be rejected, preferably with an error code that resembles
the type of failure the device has encountered. This will signify the
reason for wedging, which can be reported to the application if needed.

Recovery
--------

Current implementation defines three recovery methods, out of which,
drivers can use any one, multiple or none. Method(s) of choice will be
sent in the uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in
order of less to more side-effects. If driver is unsure about recovery
or method is unknown (like soft/hard system reboot, firmware flashing,
physical device replacement or any other procedure which can't be
attempted on the fly), ``WEDGED=unknown`` will be sent instead.

Userspace consumers can parse this event and attempt recovery as per the
following expectations.

    =============== ========================================
    Recovery method Consumer expectations
    =============== ========================================
    none            optional telemetry collection
    rebind          unbind + bind driver
    bus-reset       unbind + bus reset/re-enumeration + bind
    unknown         consumer policy
    =============== ========================================

The only exception to this is ``WEDGED=none``, which signifies that the
device was temporarily 'wedged' at some point but was recovered from driver
context using device specific methods like reset. No explicit recovery is
expected from the consumer in this case, but it can still take additional
steps like gathering telemetry information (devcoredump, syslog). This is
useful because the first hang is usually the most critical one which can
result in consequential hangs or complete wedging.

Consumer prerequisites
----------------------

It is the responsibility of the consumer to make sure that the device or
its resources are not in use by any process before attempting recovery.
With IOCTLs erroring out, all device memory should be unmapped and file
descriptors should be closed to prevent leaks or undefined behaviour. The
idea here is to clear the device of all user context beforehand and set
the stage for a clean recovery.

Example
-------

Udev rule::

    SUBSYSTEM=="drm", ENV{WEDGED}=="rebind", DEVPATH=="*/drm/card[0-9]",
    RUN+="/path/to/rebind.sh $env{DEVPATH}"

Recovery script::

    #!/bin/sh

    DEVPATH=$(readlink -f /sys/$1/device)
    DEVICE=$(basename $DEVPATH)
    DRIVER=$(readlink -f $DEVPATH/driver)

    echo -n $DEVICE > $DRIVER/unbind
    echo -n $DEVICE > $DRIVER/bind

Customization
-------------

Although basic recovery is possible with a simple script, consumers can
define custom policies around recovery. For example, if the driver supports
multiple recovery methods, consumers can opt for the suitable one depending
on scenarios like repeat offences or vendor specific failures. Consumers
can also choose to have the device available for debugging or telemetry
collection and base their recovery decision on the findings. This is useful
especially when the driver is unsure about recovery or method is unknown.

 v4: s/drm_dev_wedged/drm_dev_wedged_event
     Use drm_info() (Jani)
     Kernel doc adjustment (Aravind)
 v5: Send recovery method with uevent (Lina)
 v6: Access wedge_recovery_opts[] using helper function (Jani)
     Use snprintf() (Jani)
 v7: Convert recovery helpers into regular functions (Andy, Jani)
     Aesthetic adjustments (Andy)
     Handle invalid recovery method
 v8: Allow sending multiple methods with uevent (Lucas, Michal)
     static_assert() globally (Andy)
 v9: Provide 'none' method for device reset (Christian)
     Provide recovery opts using switch cases
v11: Log device reset (André)

Signed-off-by: Raag Jadav <raag.jadav@intel.com>
Reviewed-by: André Almeida <andrealmeid@igalia.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250204070528.1919158-2-raag.jadav@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-02-13 12:15:43 -05:00
arch Merge drm/drm-next into drm-misc-next 2025-02-06 13:47:32 +01:00
block block-6.14-20250131 2025-01-31 11:49:30 -08:00
certs sign-file,extract-cert: use pkcs11 provider for OPENSSL MAJOR >= 3 2024-09-20 19:52:48 +03:00
crypto treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
Documentation dt-bindings: display: bridge: sn65dsi83: Add interrupt 2025-02-13 16:17:38 +01:00
drivers drm: Introduce device wedged event 2025-02-13 12:15:43 -05:00
fs assorted stuff for this merge window 2025-02-01 15:07:56 -08:00
include drm: Introduce device wedged event 2025-02-13 12:15:43 -05:00
init Kbuild updates for v6.14 2025-01-31 12:07:07 -08:00
io_uring io_uring-6.14-20250131 2025-01-31 11:29:23 -08:00
ipc treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
kernel 21 hotfixes. 8 are cc:stable and the remainder address post-6.13 issues. 2025-02-01 09:49:20 -08:00
lib 21 hotfixes. 8 are cc:stable and the remainder address post-6.13 issues. 2025-02-01 09:49:20 -08:00
LICENSES LICENSES: add 0BSD license text 2024-09-01 20:43:24 -07:00
mm assorted stuff for this merge window 2025-02-01 15:07:56 -08:00
net assorted stuff for this merge window 2025-02-01 15:07:56 -08:00
rust Kbuild updates for v6.14 2025-01-31 12:07:07 -08:00
samples AT_EXECVE_CHECK update for v6.14-rc1 (fix1) 2025-01-31 17:12:31 -08:00
scripts 21 hotfixes. 8 are cc:stable and the remainder address post-6.13 issues. 2025-02-01 09:49:20 -08:00
security treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
sound Merge drm/drm-next into drm-misc-next 2025-02-06 13:47:32 +01:00
tools Turbostat 2025.02.02 updates since 2024.11.30 2025-02-02 10:49:13 -08:00
usr kbuild: Drop support for include/asm-<arch> in headers_check.pl 2024-12-21 11:43:17 +09:00
virt Merge branch 'kvm-mirror-page-tables' into HEAD 2025-01-20 07:15:58 -05:00
.clang-format clang-format: Update with v6.11-rc1's for_each macro list 2024-08-02 13:20:31 +02:00
.clippy.toml rust: give Clippy the minimum supported Rust version 2025-01-10 00:17:25 +01:00
.cocciconfig
.editorconfig .editorconfig: remove trim_trailing_whitespace option 2024-06-13 16:47:52 +02:00
.get_maintainer.ignore MAINTAINERS: Retire Ralf Baechle 2024-11-12 15:48:59 +01:00
.gitattributes .gitattributes: set diff driver for Rust source code files 2023-05-31 17:48:25 +02:00
.gitignore rust: use host dylib naming convention to support macOS 2025-01-10 01:01:24 +01:00
.mailmap 21 hotfixes. 8 are cc:stable and the remainder address post-6.13 issues. 2025-02-01 09:49:20 -08:00
.rustfmt.toml rust: add .rustfmt.toml 2022-09-28 09:02:20 +02:00
COPYING
CREDITS Merge drm/drm-next into drm-misc-next 2025-02-06 13:47:32 +01:00
Kbuild drm: ensure drm headers are self-contained and pass kernel-doc 2025-02-12 10:44:43 +02:00
Kconfig
MAINTAINERS drm/i2c: move TDA998x driver under drivers/gpu/drm/bridge 2025-02-13 00:19:41 +02:00
Makefile Linux 6.14-rc1 2025-02-02 15:39:26 -08:00
README README: Fix spelling 2024-03-18 03:36:32 -06:00

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the reStructuredText markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.