mirror of
https://github.com/torvalds/linux.git
synced 2026-05-29 17:43:52 +02:00
Add support for decoding NVIDIA-specific CPER sections delivered via the APEI GHES vendor record notifier chain. NVIDIA hardware generates vendor-specific CPER sections containing error signatures and diagnostic register dumps. This implementation registers a notifier_block with the GHES vendor record notifier and decodes these sections, printing error details via dev_info(). The driver binds to ACPI device NVDA2012, present on NVIDIA server platforms. The NVIDIA CPER section contains a fixed header with error metadata (signature, error type, severity, socket) followed by variable-length register address-value pairs for hardware diagnostics. This work is based on libcper [1]. Example output: nvidia-ghes NVDA2012:00: NVIDIA CPER section, error_data_length: 544 nvidia-ghes NVDA2012:00: signature: CMET-INFO nvidia-ghes NVDA2012:00: error_type: 0 nvidia-ghes NVDA2012:00: error_instance: 0 nvidia-ghes NVDA2012:00: severity: 3 nvidia-ghes NVDA2012:00: socket: 0 nvidia-ghes NVDA2012:00: number_regs: 32 nvidia-ghes NVDA2012:00: instance_base: 0x0000000000000000 nvidia-ghes NVDA2012:00: register[0]: address=0x8000000100000000 value=0x0000000100000000 https://github.com/openbmc/libcper/commit/683e055061ce [1] Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: Kai-Heng Feng <kaihengf@nvidia.com> [ rjw: Changelog edits ] Link: https://patch.msgid.link/20260330094203.38022-4-kaihengf@nvidia.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
99 lines
3.1 KiB
Plaintext
99 lines
3.1 KiB
Plaintext
# SPDX-License-Identifier: GPL-2.0
|
|
config HAVE_ACPI_APEI
|
|
bool
|
|
|
|
config HAVE_ACPI_APEI_NMI
|
|
bool
|
|
|
|
config ACPI_APEI
|
|
bool "ACPI Platform Error Interface (APEI)"
|
|
select MISC_FILESYSTEMS
|
|
select PSTORE
|
|
select UEFI_CPER
|
|
depends on HAVE_ACPI_APEI
|
|
help
|
|
APEI allows to report errors (for example from the chipset)
|
|
to the operating system. This improves NMI handling
|
|
especially. In addition it supports error serialization and
|
|
error injection.
|
|
|
|
config ACPI_APEI_GHES
|
|
bool "APEI Generic Hardware Error Source"
|
|
depends on ACPI_APEI
|
|
select ACPI_HED
|
|
select IRQ_WORK
|
|
select GENERIC_ALLOCATOR
|
|
select ARM_SDE_INTERFACE if ARM64
|
|
help
|
|
Generic Hardware Error Source provides a way to report
|
|
platform hardware errors (such as that from chipset). It
|
|
works in so called "Firmware First" mode, that is, hardware
|
|
errors are reported to firmware firstly, then reported to
|
|
Linux by firmware. This way, some non-standard hardware
|
|
error registers or non-standard hardware link can be checked
|
|
by firmware to produce more valuable hardware error
|
|
information for Linux.
|
|
|
|
config ACPI_APEI_PCIEAER
|
|
bool "APEI PCIe AER logging/recovering support"
|
|
depends on ACPI_APEI && PCIEAER
|
|
help
|
|
PCIe AER errors may be reported via APEI firmware first mode.
|
|
Turn on this option to enable the corresponding support.
|
|
|
|
config ACPI_APEI_SEA
|
|
bool
|
|
depends on ARM64 && ACPI_APEI_GHES
|
|
default y
|
|
|
|
config ACPI_APEI_MEMORY_FAILURE
|
|
bool "APEI memory error recovering support"
|
|
depends on ACPI_APEI && MEMORY_FAILURE
|
|
help
|
|
Memory errors may be reported via APEI firmware first mode.
|
|
Turn on this option to enable the memory recovering support.
|
|
|
|
config ACPI_APEI_EINJ
|
|
tristate "APEI Error INJection (EINJ)"
|
|
depends on ACPI_APEI && DEBUG_FS
|
|
help
|
|
EINJ provides a hardware error injection mechanism, it is
|
|
mainly used for debugging and testing the other parts of
|
|
APEI and some other RAS features.
|
|
|
|
config ACPI_APEI_EINJ_CXL
|
|
bool "CXL Error INJection Support"
|
|
default ACPI_APEI_EINJ
|
|
depends on ACPI_APEI_EINJ
|
|
depends on CXL_BUS && CXL_BUS <= ACPI_APEI_EINJ
|
|
help
|
|
Support for CXL protocol Error INJection through debugfs/cxl.
|
|
Availability and which errors are supported is dependent on
|
|
the host platform. Look to ACPI v6.5 section 18.6.4 and kernel
|
|
EINJ documentation for more information.
|
|
|
|
If unsure say 'n'
|
|
|
|
config ACPI_APEI_GHES_NVIDIA
|
|
tristate "NVIDIA GHES vendor record handler"
|
|
depends on ACPI_APEI_GHES
|
|
help
|
|
Support for decoding NVIDIA-specific CPER sections delivered via
|
|
the APEI GHES vendor record notifier chain. Registers a handler
|
|
for the NVIDIA section GUID and logs error signatures, severity,
|
|
socket, and diagnostic register address-value pairs.
|
|
|
|
Enable on NVIDIA server platforms (e.g. DGX, HGX) that expose
|
|
ACPI device NVDA2012 in their firmware tables.
|
|
|
|
If unsure, say N.
|
|
|
|
config ACPI_APEI_ERST_DEBUG
|
|
tristate "APEI Error Record Serialization Table (ERST) Debug Support"
|
|
depends on ACPI_APEI
|
|
help
|
|
ERST is a way provided by APEI to save and retrieve hardware
|
|
error information to and from a persistent store. Enable this
|
|
if you want to debugging and testing the ERST kernel support
|
|
and firmware implementation.
|