linux/mm
Shuai Xue aaf99ac2ce mm/hwpoison: do not send SIGBUS to processes with recovered clean pages
When an uncorrected memory error is consumed there is a race between the
CMCI from the memory controller reporting an uncorrected error with a UCNA
signature, and the core reporting and SRAR signature machine check when
the data is about to be consumed.

- Background: why *UN*corrected errors tied to *C*MCI in Intel platform [1]

Prior to Icelake memory controllers reported patrol scrub events that
detected a previously unseen uncorrected error in memory by signaling a
broadcast machine check with an SRAO (Software Recoverable Action
Optional) signature in the machine check bank.  This was overkill because
it's not an urgent problem that no core is on the verge of consuming that
bad data.  It's also found that multi SRAO UCE may cause nested MCE
interrupts and finally become an IERR.

Hence, Intel downgrades the machine check bank signature of patrol scrub
from SRAO to UCNA (Uncorrected, No Action required), and signal changed to
#CMCI.  Just to add to the confusion, Linux does take an action (in
uc_decode_notifier()) to try to offline the page despite the UC*NA*
signature name.

- Background: why #CMCI and #MCE race when poison is consuming in Intel platform [1]

Having decided that CMCI/UCNA is the best action for patrol scrub errors,
the memory controller uses it for reads too.  But the memory controller is
executing asynchronously from the core, and can't tell the difference
between a "real" read and a speculative read.  So it will do CMCI/UCNA if
an error is found in any read.

Thus:

1) Core is clever and thinks address A is needed soon, issues a speculative read.
2) Core finds it is going to use address A soon after sending the read request
3) The CMCI from the memory controller is in a race with MCE from the core
   that will soon try to retire the load from address A.

Quite often (because speculation has got better) the CMCI from the memory
controller is delivered before the core is committed to the instruction
reading address A, so the interrupt is taken, and Linux offlines the page
(marking it as poison).

- Why user process is killed for instr case

Commit 046545a661 ("mm/hwpoison: fix error page recovered but reported
"not recovered"") tries to fix noise message "Memory error not recovered"
and skips duplicate SIGBUSs due to the race.  But it also introduced a bug
that kill_accessing_process() return -EHWPOISON for instr case, as result,
kill_me_maybe() send a SIGBUS to user process.

If the CMCI wins that race, the page is marked poisoned when
uc_decode_notifier() calls memory_failure().  For dirty pages,
memory_failure() invokes try_to_unmap() with the TTU_HWPOISON flag,
converting the PTE to a hwpoison entry.  As a result,
kill_accessing_process():

- call walk_page_range() and return 1 regardless of whether
  try_to_unmap() succeeds or fails,
- call kill_proc() to make sure a SIGBUS is sent
- return -EHWPOISON to indicate that SIGBUS is already sent to the
  process and kill_me_maybe() doesn't have to send it again.

However, for clean pages, the TTU_HWPOISON flag is cleared, leaving the
PTE unchanged and not converted to a hwpoison entry.  Conversely, for
clean pages where PTE entries are not marked as hwpoison,
kill_accessing_process() returns -EFAULT, causing kill_me_maybe() to send
a SIGBUS.

Console log looks like this:

    Memory failure: 0x827ca68: corrupted page was clean: dropped without side effects
    Memory failure: 0x827ca68: recovery action for clean LRU page: Recovered
    Memory failure: 0x827ca68: already hardware poisoned
    mce: Memory error not recovered

To fix it, return 0 for "corrupted page was clean", preventing an
unnecessary SIGBUS to user process.

[1] https://lore.kernel.org/lkml/20250217063335.22257-1-xueshuai@linux.alibaba.com/T/#mba94f1305b3009dd340ce4114d3221fe810d1871
Link: https://lkml.kernel.org/r/20250312112852.82415-3-xueshuai@linux.alibaba.com
Fixes: 046545a661 ("mm/hwpoison: fix error page recovered but reported "not recovered"")
Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ruidong Tian <tianruidong@linux.alibaba.com>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Yazen Ghannam <yazen.ghannam@amd.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17 22:07:05 -07:00
..
damon mm/damon/sysfs-schemes: avoid Wformat-security warning on damon_sysfs_access_pattern_add_range_dir() 2025-03-17 22:07:01 -07:00
kasan kasan: don't call find_vm_area() in a PREEMPT_RT kernel 2025-02-17 22:40:04 -08:00
kfence kfence: skip __GFP_THISNODE allocations on NUMA systems 2025-02-01 03:53:26 -08:00
kmsan dma: kmsan: export kmsan_handle_dma() for modules 2025-03-05 21:36:14 -08:00
backing-dev.c
balloon_compaction.c
bootmem_info.c mm/sparse: allow for alternate vmemmap section init at boot 2025-03-16 22:06:27 -07:00
cma_debug.c mm, cma: support multiple contiguous ranges, if requested 2025-03-16 22:06:25 -07:00
cma_sysfs.c mm/cma: export total and free number of pages for CMA areas 2025-03-16 22:06:24 -07:00
cma.c mm/cma: introduce interface for early reservations 2025-03-16 22:06:30 -07:00
cma.h mm/cma: introduce interface for early reservations 2025-03-16 22:06:30 -07:00
compaction.c mm/page_alloc: clarify terminology in migratetype fallback code 2025-03-17 00:05:35 -07:00
debug_page_alloc.c
debug_page_ref.c
debug_vm_pgtable.c mm/debug_vm_pgtable: Use pxdp_get() for accessing page table entries 2024-09-17 01:07:01 -07:00
debug.c mm: move _pincount in folio to page[2] on 32bit 2025-03-17 22:06:44 -07:00
dmapool_test.c
dmapool.c
early_ioremap.c mm/early_ioremap: add null pointer checks to prevent NULL-pointer dereference 2025-01-13 22:40:59 -08:00
execmem.c alloc_tag: populate memory for module tags as needed 2024-11-07 14:25:16 -08:00
fadvise.c fdget(), trivial conversions 2024-11-03 01:28:06 -05:00
fail_page_alloc.c
failslab.c
filemap.c mm/filemap: use xas_try_split() in __filemap_add_folio() 2025-03-17 22:07:01 -07:00
folio-compat.c mm/writeback: add folio_mark_dirty_lock() 2024-11-05 11:14:32 +01:00
gup_test.c
gup_test.h
gup.c mm: move _pincount in folio to page[2] on 32bit 2025-03-17 22:06:44 -07:00
highmem.c
hmm.c mm: allow compound zone device pages 2025-03-17 22:06:39 -07:00
huge_memory.c mm/truncate: use folio_split() in truncate operation 2025-03-17 22:07:00 -07:00
hugetlb_cgroup.c page_counter: track failcnt only for legacy cgroups 2025-03-17 00:05:35 -07:00
hugetlb_cma.c mm/hugetlb: move hugetlb CMA code in to its own file 2025-03-16 22:06:31 -07:00
hugetlb_cma.h mm/hugetlb: move hugetlb CMA code in to its own file 2025-03-16 22:06:31 -07:00
hugetlb_vmemmap.c mm/hugetlb: do pre-HVO for bootmem allocated pages 2025-03-16 22:06:29 -07:00
hugetlb_vmemmap.h mm/hugetlb: do pre-HVO for bootmem allocated pages 2025-03-16 22:06:29 -07:00
hugetlb.c mm/hugetlb: update nr_huge_pages and surplus_huge_pages together 2025-03-17 22:06:50 -07:00
hwpoison-inject.c
init-mm.c mm: replace vm_lock and detached flag with a reference count 2025-03-16 22:06:20 -07:00
internal.h arch, mm: make releasing of memory to page allocator more explicit 2025-03-17 22:06:53 -07:00
interval_tree.c
io-mapping.c
ioremap.c mm/ioremap: pass pgprot_t to ioremap_prot() instead of unsigned long 2025-03-16 22:06:23 -07:00
Kconfig mm: CONFIG_NO_PAGE_MAPCOUNT to prepare for not maintain per-page mapcounts in large folios 2025-03-17 22:06:46 -07:00
Kconfig.debug mm: rename GENERIC_PTDUMP and PTDUMP_CORE 2025-03-17 00:05:32 -07:00
khugepaged.c mm: convert folio_likely_mapped_shared() to folio_maybe_mapped_shared() 2025-03-17 22:06:46 -07:00
kmemleak.c mm: kmemleak: add support for dumping physical and __percpu object info 2025-03-16 22:06:08 -07:00
ksm.c mm/ksm: handle device-exclusive entries correctly in write_protect_page() 2025-03-16 22:05:58 -07:00
list_lru.c mm/list_lru: make the case where mlru is NULL as unlikely 2025-03-17 00:05:32 -07:00
maccess.c kasan: migrate copy_user_test to kunit 2024-11-11 00:26:44 -08:00
madvise.c mm/madvise: remove len parameter of madvise_do_behavior() 2025-03-17 22:07:04 -07:00
Makefile mm: rename GENERIC_PTDUMP and PTDUMP_CORE 2025-03-17 00:05:32 -07:00
mapping_dirty_helpers.c
memblock.c arch, mm: streamline HIGHMEM freeing 2025-03-17 22:06:53 -07:00
memcontrol-v1.c mm: memcontrol: move memsw charge callbacks to v1 2025-03-16 22:05:55 -07:00
memcontrol-v1.h mm: memcontrol: move memsw charge callbacks to v1 2025-03-16 22:05:55 -07:00
memcontrol.c memcg: bypass root memcg check for skmem charging 2025-03-17 00:05:36 -07:00
memfd.c mm/memfd: fix spelling and grammatical issues 2025-03-16 22:06:04 -07:00
memory_hotplug.c hwpoison, memory_hotplug: lock folio before unmap hwpoisoned folio 2025-03-05 21:36:13 -08:00
memory-failure.c mm/hwpoison: do not send SIGBUS to processes with recovered clean pages 2025-03-17 22:07:05 -07:00
memory-tiers.c memory tiers: use default_dram_perf_ref_source in log message 2024-09-26 14:01:44 -07:00
memory.c arch, mm: set high_memory in free_area_init() 2025-03-17 22:06:52 -07:00
mempolicy.c mm: convert folio_likely_mapped_shared() to folio_maybe_mapped_shared() 2025-03-17 22:06:46 -07:00
mempool.c
memremap.c device/dax: properly refcount device dax pages when mapping 2025-03-17 22:06:41 -07:00
memtest.c
migrate_device.c mm: allow compound zone device pages 2025-03-17 22:06:39 -07:00
migrate.c mm: use ptep_get() instead of directly dereferencing pte_t* 2025-03-17 22:07:02 -07:00
mincore.c mm/mincore: improve performance by adding an unlikely hint 2025-03-16 22:06:32 -07:00
mlock.c mm: allow compound zone device pages 2025-03-17 22:06:39 -07:00
mm_init.c arch, mm: make releasing of memory to page allocator more explicit 2025-03-17 22:06:53 -07:00
mm_slot.h
mmap_lock.c mm: mmap_lock: optimize mmap_lock tracepoints 2025-01-13 22:40:34 -08:00
mmap.c mm/mremap: refactor move_page_tables(), abstracting state 2025-03-17 22:06:42 -07:00
mmu_gather.c mm/mmu_gather: update comment on RCU freeing 2025-03-16 22:06:12 -07:00
mmu_notifier.c
mmzone.c
mprotect.c mm: convert folio_likely_mapped_shared() to folio_maybe_mapped_shared() 2025-03-17 22:06:46 -07:00
mremap.c mm/mremap: thread state through move page table operation 2025-03-17 22:06:42 -07:00
mseal.c mseal: remove can_do_mseal() 2025-01-13 22:40:51 -08:00
msync.c
nommu.c arch, mm: set high_memory in free_area_init() 2025-03-17 22:06:52 -07:00
numa_emulation.c mm/fake-numa: allow later numa node hotplug 2025-01-25 20:22:29 -08:00
numa_memblks.c mm/fake-numa: allow later numa node hotplug 2025-01-25 20:22:29 -08:00
numa.c mm/memblock: add memblock_alloc_or_panic interface 2025-01-25 20:22:38 -08:00
oom_kill.c mm/oom_kill: fix trivial typo in comment 2025-03-16 22:05:55 -07:00
page_alloc.c mm/page_alloc: add trace event for totalreserve_pages calculation 2025-03-17 22:07:03 -07:00
page_counter.c page_counter: track failcnt only for legacy cgroups 2025-03-17 00:05:35 -07:00
page_ext.c mm: page_ext: add an iteration API for page extensions 2025-03-17 22:06:57 -07:00
page_frag_cache.c mm/page_alloc: export free_frozen_pages() instead of free_unref_page() 2025-01-13 22:40:31 -08:00
page_idle.c mm/page_idle: handle device-exclusive entries correctly in page_idle_clear_pte_refs_one() 2025-03-16 22:05:59 -07:00
page_io.c page_io: zswap: do not crash the kernel on decompression failure 2025-03-17 22:06:50 -07:00
page_isolation.c mm/hugetlb: wait for hugetlb folios to be freed 2025-03-05 21:36:14 -08:00
page_owner.c mm: page_owner: use new iteration API 2025-03-17 22:06:57 -07:00
page_poison.c
page_reporting.c
page_reporting.h
page_table_check.c mm: page_table_check: use new iteration API 2025-03-17 22:06:57 -07:00
page_vma_mapped.c mm: make page_mapped_in_vma() hugetlb walk aware 2025-03-16 22:06:42 -07:00
page-writeback.c writeback: fix calculations in trace_balance_dirty_pages() for cgwb 2025-03-17 00:05:37 -07:00
pagewalk.c mm: pagewalk: add the ability to install PTEs 2024-11-11 00:26:44 -08:00
percpu-internal.h
percpu-km.c
percpu-stats.c
percpu-vm.c
percpu.c mm, percpu: do not consider sleepable allocations atomic 2025-03-16 22:06:08 -07:00
pgalloc-track.h
pgtable-generic.c mm: add RCU annotation to pte_offset_map(_lock) 2024-12-18 19:04:43 -08:00
process_vm_access.c mm: refactor mm_access() to not return NULL 2024-11-05 16:56:23 -08:00
pt_reclaim.c mm: pgtable: reclaim empty PTE page in madvise(MADV_DONTNEED) 2025-01-13 22:40:48 -08:00
ptdump.c
readahead.c The various patchsets are summarized below. Plus of course many 2025-01-26 18:36:23 -08:00
rmap.c mm: stop maintaining the per-page mapcount of large folios (CONFIG_NO_PAGE_MAPCOUNT) 2025-03-17 22:06:48 -07:00
rodata_test.c mm/rodata_test: verify test data is unchanged, rather than non-zero 2025-01-13 22:40:38 -08:00
secretmem.c add a string-to-qstr constructor 2025-01-27 19:25:45 -05:00
shmem_quota.c
shmem.c mm/shmem: fix functions documentation 2025-03-17 22:07:02 -07:00
show_mem.c mm/show_mem: use str_yes_no() helper in show_free_areas() 2024-11-07 14:38:08 -08:00
shrinker_debug.c mm/shrinker: fix name consistency issue in shrinker_debugfs_rename() 2025-03-17 00:05:40 -07:00
shrinker.c mm: shrinker: avoid memleak in alloc_shrinker_info 2024-10-31 20:27:04 -07:00
shuffle.c
shuffle.h
slab_common.c mm/slab/kvfree_rcu: Switch to WQ_MEM_RECLAIM wq 2025-03-04 08:51:53 +01:00
slab.h mm/slab: fix kernel-doc func param names 2025-01-13 10:22:04 +01:00
slub.c alloc_tag: uninline code gated by mem_alloc_profiling_key in slab allocator 2025-03-16 22:06:03 -07:00
sparse-vmemmap.c mm/hugetlb: do pre-HVO for bootmem allocated pages 2025-03-16 22:06:29 -07:00
sparse.c drivers/base/memory: improve add_boot_memory_block() 2025-03-17 22:07:01 -07:00
swap_cgroup.c mm: swap_cgroup: remove double initialization of locals 2025-03-17 22:06:58 -07:00
swap_state.c mm, swap: simplify folio swap allocation 2025-03-16 22:06:44 -07:00
swap.c fs/dax: properly refcount fs dax pages 2025-03-17 22:06:41 -07:00
swap.h mm, swap: simplify folio swap allocation 2025-03-16 22:06:44 -07:00
swapfile.c mm, swap: simplify folio swap allocation 2025-03-16 22:06:44 -07:00
truncate.c mm/truncate: use folio_split() in truncate operation 2025-03-17 22:07:00 -07:00
usercopy.c
userfaultfd.c mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail 2025-03-16 22:06:18 -07:00
util.c mm: add comments to do_mmap(), mmap_region() and vm_mmap() 2025-01-13 22:40:59 -08:00
vma_internal.h mm/vma: move brk() internals to mm/vma.c 2025-01-13 22:40:42 -08:00
vma.c mm: make vma cache SLAB_TYPESAFE_BY_RCU 2025-03-16 22:06:21 -07:00
vma.h mm: make vma cache SLAB_TYPESAFE_BY_RCU 2025-03-16 22:06:21 -07:00
vmalloc.c mm/vmalloc: refactor __vmalloc_node_range_noprof() 2025-03-17 22:06:58 -07:00
vmpressure.c
vmscan.c mm: lock PGDAT_RECLAIM_LOCKED with acquire memory ordering 2025-03-17 22:07:04 -07:00
vmstat.c vmstat: disable vmstat_work on vmstat_cpu_down_prep() 2025-01-12 19:03:38 -08:00
workingset.c mm/mglru: rework workingset protection 2025-01-25 20:22:39 -08:00
zpdesc.h mm/zsmalloc: introduce __zpdesc_clear/set_zsmalloc() 2025-01-25 20:22:35 -08:00
zpool.c mm: zpool: remove zpool_malloc_support_movable() 2025-03-17 00:05:41 -07:00
zsmalloc.c mm: zpool: remove zpool_malloc_support_movable() 2025-03-17 00:05:41 -07:00
zswap.c page_io: zswap: do not crash the kernel on decompression failure 2025-03-17 22:06:50 -07:00