Linux kernel source tree
Go to file
Shiju Jose 588ca944c2 cxl/edac: Add CXL memory device memory sparing control feature
Memory sparing is defined as a repair function that replaces a portion of
memory with a portion of functional memory at that same DPA. The subclasses
for this operation vary in terms of the scope of the sparing being
performed. The cacheline sparing subclass refers to a sparing action that
can replace a full cacheline. Row sparing is provided as an alternative to
PPR sparing functions and its scope is that of a single DDR row.
As per CXL r3.2 Table 8-125 foot note 1. Memory sparing is preferred over
PPR when possible.
Bank sparing allows an entire bank to be replaced. Rank sparing is defined
as an operation in which an entire DDR rank is replaced.

Memory sparing maintenance operations may be supported by CXL devices
that implement CXL.mem protocol. A sparing maintenance operation requests
the CXL device to perform a repair operation on its media.
For example, a CXL device with DRAM components that support memory sparing
features may implement sparing maintenance operations.

The host may issue a query command by setting query resources flag in the
input payload (CXL spec 3.2 Table 8-120) to determine availability of
sparing resources for a given address. In response to a query request,
the device shall report the resource availability by producing the memory
sparing event record (CXL spec 3.2 Table 8-60) in which the Channel, Rank,
Nibble Mask, Bank Group, Bank, Row, Column, Sub-Channel fields are a copy
of the values specified in the request.

During the execution of a sparing maintenance operation, a CXL memory
device:
- may not retain data
- may not be able to process CXL.mem requests correctly.
These CXL memory device capabilities are specified by restriction flags
in the memory sparing feature readable attributes.

When a CXL device identifies error on a memory component, the device
may inform the host about the need for a memory sparing maintenance
operation by using DRAM event record, where the 'maintenance needed' flag
may set. The event record contains some of the DPA, Channel, Rank,
Nibble Mask, Bank Group, Bank, Row, Column, Sub-Channel fields that
should be repaired. The userspace tool requests for maintenance operation
if the 'maintenance needed' flag set in the CXL DRAM error record.

CXL spec 3.2 section 8.2.10.7.1.4 describes the device's memory sparing
maintenance operation feature.

CXL spec 3.2 section 8.2.10.7.2.3 describes the memory sparing feature
discovery and configuration.

Add support for controlling CXL memory device memory sparing feature.
Register with EDAC driver, which gets the memory repair attr descriptors
from the EDAC memory repair driver and exposes sysfs repair control
attributes for memory sparing to the userspace. For example CXL memory
sparing control for the CXL mem0 device is exposed in
/sys/bus/edac/devices/cxl_mem0/mem_repairX/

Use case
========
1. CXL device identifies a failure in a memory component, report to
   userspace in a CXL DRAM trace event with DPA and other attributes of
   memory to repair such as channel, rank, nibble mask, bank Group,
   bank, row, column, sub-channel.

2. Rasdaemon process the trace event and may issue query request in sysfs
check resources available for memory sparing if either of the following
conditions met.
 - 'maintenance needed' flag set in the event record.
 - 'threshold event' flag set for CVME threshold feature.
 - When the number of corrected error reported on a CXL.mem media to the
   userspace exceeds the threshold value for corrected error count defined
   by the userspace policy.

3. Rasdaemon process the memory sparing trace event and issue repair
   request for memory sparing.

Kernel CXL driver shall report memory sparing event record to the userspace
with the resource availability in order rasdaemon to process the event
record and issue a repair request in sysfs for the memory sparing operation
in the CXL device.

Note: Based on the feedbacks from the community 'query' sysfs attribute is
removed and reporting memory sparing error record to the userspace are not
supported. Instead userspace issues sparing operation and kernel does the
same to the CXL memory device, when 'maintenance needed' flag set in the
DRAM event record.

Add checks to ensure the memory to be repaired is offline and if online,
then originates from a CXL DRAM error record reported in the current boot
before requesting a memory sparing operation on the device.

Note: Tested memory sparing feature control with QEMU patch
      "hw/cxl: Add emulation for memory sparing control feature"
      https://lore.kernel.org/linux-cxl/20250509172229.726-1-shiju.jose@huawei.com/T/#m5f38512a95670d75739f9dad3ee91b95c7f5c8d6

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Link: https://patch.msgid.link/20250521124749.817-8-shiju.jose@huawei.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-05-23 13:24:53 -07:00
arch Misc fixes: 2025-04-26 09:45:54 -07:00
block block-6.15-20250424 2025-04-25 11:34:39 -07:00
certs sign-file,extract-cert: use pkcs11 provider for OPENSSL MAJOR >= 3 2024-09-20 19:52:48 +03:00
crypto crypto: scomp - Fix off-by-one bug when calculating last page 2025-04-23 09:32:57 +08:00
Documentation cxl/edac: Add CXL memory device memory sparing control feature 2025-05-23 13:24:53 -07:00
drivers cxl/edac: Add CXL memory device memory sparing control feature 2025-05-23 13:24:53 -07:00
fs vfs-6.15-rc4.fixes 2025-04-25 15:57:21 -07:00
include cxl/edac: Add CXL memory device memory sparing control feature 2025-05-23 13:24:53 -07:00
init Kconfig: switch CONFIG_SYSFS_SYCALL default to n 2025-04-15 10:28:35 +02:00
io_uring io_uring: fix 'sync' handling of io_fallback_tw() 2025-04-24 10:32:43 -06:00
ipc treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
kernel Fix sporadic crashes in dequeue_entities() due to ... bad math. 2025-04-26 09:23:20 -07:00
lib hardening fixes for v6.15-rc3 2025-04-18 13:20:20 -07:00
LICENSES LICENSES: add 0BSD license text 2024-09-01 20:43:24 -07:00
mm mm/migrate: fix sleep in atomic for large folios and buffer heads 2025-04-22 18:16:08 +02:00
net nfsd-6.15 fixes: 2025-04-26 10:43:03 -07:00
rust Driver core fixes for 6.15-rc4 2025-04-25 10:02:59 -07:00
samples samples/bpf: Fix compilation failure for samples/bpf on LoongArch Fedora 2025-04-25 09:32:02 -07:00
scripts Fix mis-uses of 'cc-option' for warning disablement 2025-04-23 10:08:29 -07:00
security Landlock fix for v6.15-rc4 2025-04-24 12:59:05 -07:00
sound ASoC: Fixes for v6.15 2025-04-11 15:51:19 +02:00
tools cxl/edac: Add CXL memory device patrol scrub control feature 2025-05-23 13:24:09 -07:00
usr kbuild: hdrcheck: fix cross build with clang 2025-03-05 04:06:45 +09:00
virt ARM: 2025-04-08 13:47:55 -07:00
.clang-format clang-format: Update the ForEachMacros list for v6.15-rc1 2025-04-13 11:03:59 +02:00
.clippy.toml rust: give Clippy the minimum supported Rust version 2025-01-10 00:17:25 +01:00
.cocciconfig
.editorconfig .editorconfig: remove trim_trailing_whitespace option 2024-06-13 16:47:52 +02:00
.get_maintainer.ignore MAINTAINERS: Retire Ralf Baechle 2024-11-12 15:48:59 +01:00
.gitattributes .gitattributes: set diff driver for Rust source code files 2023-05-31 17:48:25 +02:00
.gitignore kbuild: Create intermediate vmlinux build with relocations preserved 2025-03-17 00:29:50 +09:00
.mailmap sound fixes for 6.15-rc3 2025-04-17 10:14:51 -07:00
.rustfmt.toml rust: add .rustfmt.toml 2022-09-28 09:02:20 +02:00
COPYING
CREDITS MAINTAINERS: update SLAB ALLOCATOR maintainers 2025-04-17 20:10:06 -07:00
Kbuild drm: ensure drm headers are self-contained and pass kernel-doc 2025-02-12 10:44:43 +02:00
Kconfig io_uring: Rename KConfig to Kconfig 2025-02-19 14:53:27 -07:00
MAINTAINERS pci-v6.15-fixes-3 2025-04-26 13:02:36 -07:00
Makefile Linux 6.15-rc4 2025-04-27 15:19:23 -07:00
README README: Fix spelling 2024-03-18 03:36:32 -06:00

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the reStructuredText markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.