From 3a206a8649f83bec99a3517da5e7dac9c138875e Mon Sep 17 00:00:00 2001 From: "Lorenzo Stoakes (Oracle)" Date: Wed, 18 Mar 2026 12:26:32 +0000 Subject: [PATCH 1/6] mm/rmap: clear vma->anon_vma on error Commit 542eda1a8329 ("mm/rmap: improve anon_vma_clone(), unlink_anon_vmas() comments, add asserts") alters the way errors are handled, but overlooked one important aspect of clean up. When a VMA encounters an error state in anon_vma_clone() (that is, on attempted allocation of anon_vma_chain objects), it cleans up partially established state in cleanup_partial_anon_vmas(), before returning an error. However, this occurs prior to anon_vma->num_active_vmas being incremented, and it also fails to clear the VMA's vma->anon_vma field, which remains in place. This is immediately an inconsistent state, because anon_vma->num_active_vmas is supposed to track the number of VMAs whose vma->anon_vma field references that anon_vma, and now that count is off-by-negative-1 for each VMA for which this error state has occurred. When VMAs are unlinked from this anon_vma, unlink_anon_vmas() will eventually underflow anon_vma->num_active_vmas, which will trigger a warning. This will always eventually happen, as we unlink anon_vma's at process teardown. It could also cause maybe_reuse_anon_vma() to incorrectly permit the reuse of an anon_vma which has active VMAs attached, which will lead to a persistently invalid state. The solution is to clear the VMA's anon_vma field when we clean up partial state, as the fact we are doing so indicates clearly that the VMA is not correctly integrated into the anon_vma tree and thus this field is invalid. Link: https://lkml.kernel.org/r/20260318122632.63404-1-ljs@kernel.org Fixes: 542eda1a8329 ("mm/rmap: improve anon_vma_clone(), unlink_anon_vmas() comments, add asserts") Signed-off-by: Lorenzo Stoakes (Oracle) Reported-by: Sasha Levin Closes: https://lore.kernel.org/linux-mm/20260302151547.2389070-1-sashal@kernel.org/ Reported-by: Jiakai Xu Closes: https://lore.kernel.org/linux-mm/CAFb8wJvRhatRD-9DVmr5v5pixTMPEr3UKjYBJjCd09OfH55CKg@mail.gmail.com/ Acked-by: David Hildenbrand (Arm) Acked-by: Vlastimil Babka (SUSE) Tested-by: Jiakai Xu Acked-by: Harry Yoo Cc: Jann Horn Cc: Liam Howlett Cc: Rik van Riel Cc: Sasha Levin (Microsoft) Signed-off-by: Andrew Morton --- mm/rmap.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/mm/rmap.c b/mm/rmap.c index 391337282e3f..8f08090d7eb9 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -457,6 +457,13 @@ static void cleanup_partial_anon_vmas(struct vm_area_struct *vma) list_del(&avc->same_vma); anon_vma_chain_free(avc); } + + /* + * The anon_vma assigned to this VMA is no longer valid, as we were not + * able to correctly clone AVC state. Avoid inconsistent anon_vma tree + * state by resetting. + */ + vma->anon_vma = NULL; } /** From 26f775a054c3cda86ad465a64141894a90a9e145 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Thu, 19 Mar 2026 07:52:17 -0700 Subject: [PATCH 2/6] mm/damon/core: avoid use of half-online-committed context One major usage of damon_call() is online DAMON parameters update. It is done by calling damon_commit_ctx() inside the damon_call() callback function. damon_commit_ctx() can fail for two reasons: 1) invalid parameters and 2) internal memory allocation failures. In case of failures, the damon_ctx that attempted to be updated (commit destination) can be partially updated (or, corrupted from a perspective), and therefore shouldn't be used anymore. The function only ensures the damon_ctx object can safely deallocated using damon_destroy_ctx(). The API callers are, however, calling damon_commit_ctx() only after asserting the parameters are valid, to avoid damon_commit_ctx() fails due to invalid input parameters. But it can still theoretically fail if the internal memory allocation fails. In the case, DAMON may run with the partially updated damon_ctx. This can result in unexpected behaviors including even NULL pointer dereference in case of damos_commit_dests() failure [1]. Such allocation failure is arguably too small to fail, so the real world impact would be rare. But, given the bad consequence, this needs to be fixed. Avoid such partially-committed (maybe-corrupted) damon_ctx use by saving the damon_commit_ctx() failure on the damon_ctx object. For this, introduce damon_ctx->maybe_corrupted field. damon_commit_ctx() sets it when it is failed. kdamond_call() checks if the field is set after each damon_call_control->fn() is executed. If it is set, ignore remaining callback requests and return. All kdamond_call() callers including kdamond_fn() also check the maybe_corrupted field right after kdamond_call() invocations. If the field is set, break the kdamond_fn() main loop so that DAMON sill doesn't use the context that might be corrupted. [sj@kernel.org: let kdamond_call() with cancel regardless of maybe_corrupted] Link: https://lkml.kernel.org/r/20260320031553.2479-1-sj@kernel.org Link: https://sashiko.dev/#/patchset/20260319145218.86197-1-sj%40kernel.org Link: https://lkml.kernel.org/r/20260319145218.86197-1-sj@kernel.org Link: https://lore.kernel.org/20260319043309.97966-1-sj@kernel.org [1] Fixes: 3301f1861d34 ("mm/damon/sysfs: handle commit command using damon_call()") Signed-off-by: SeongJae Park Cc: [6.15+] Signed-off-by: Andrew Morton --- include/linux/damon.h | 6 ++++++ mm/damon/core.c | 8 ++++++++ 2 files changed, 14 insertions(+) diff --git a/include/linux/damon.h b/include/linux/damon.h index a4fea23da857..be3d198043ff 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -810,6 +810,12 @@ struct damon_ctx { struct damos_walk_control *walk_control; struct mutex walk_control_lock; + /* + * indicate if this may be corrupted. Currentonly this is set only for + * damon_commit_ctx() failure. + */ + bool maybe_corrupted; + /* Working thread of the given DAMON context */ struct task_struct *kdamond; /* Protects @kdamond field access */ diff --git a/mm/damon/core.c b/mm/damon/core.c index c1d1091d307e..3e1890d64d06 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -1252,6 +1252,7 @@ int damon_commit_ctx(struct damon_ctx *dst, struct damon_ctx *src) { int err; + dst->maybe_corrupted = true; if (!is_power_of_2(src->min_region_sz)) return -EINVAL; @@ -1277,6 +1278,7 @@ int damon_commit_ctx(struct damon_ctx *dst, struct damon_ctx *src) dst->addr_unit = src->addr_unit; dst->min_region_sz = src->min_region_sz; + dst->maybe_corrupted = false; return 0; } @@ -2678,6 +2680,8 @@ static void kdamond_call(struct damon_ctx *ctx, bool cancel) complete(&control->completion); else if (control->canceled && control->dealloc_on_cancel) kfree(control); + if (!cancel && ctx->maybe_corrupted) + break; } mutex_lock(&ctx->call_controls_lock); @@ -2707,6 +2711,8 @@ static int kdamond_wait_activation(struct damon_ctx *ctx) kdamond_usleep(min_wait_time); kdamond_call(ctx, false); + if (ctx->maybe_corrupted) + return -EINVAL; damos_walk_cancel(ctx); } return -EBUSY; @@ -2790,6 +2796,8 @@ static int kdamond_fn(void *data) * kdamond_merge_regions() if possible, to reduce overhead */ kdamond_call(ctx, false); + if (ctx->maybe_corrupted) + break; if (!list_empty(&ctx->schemes)) kdamond_apply_schemes(ctx); else From b0377ee8042985b0d91bf579afcc4ee9150db14d Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Thu, 19 Mar 2026 12:44:56 +0900 Subject: [PATCH 3/6] zram: do not slot_free() written-back slots slot_free() basically completely resets the slots by clearing all of its flags and attributes. While zram_writeback_complete() restores some of flags back (those that are necessary for async read decompression) we still lose a lot of slot's metadata. For example, slot's ac-time, or ZRAM_INCOMPRESSIBLE. More importantly, restoring flags/attrs requires extra attention as some of the flags are directly affecting zram device stats. And the original code did not pay that attention. Namely ZRAM_HUGE slots handling in zram_writeback_complete(). The call to slot_free() would decrement ->huge_pages, however when zram_writeback_complete() restored the slot's ZRAM_HUGE flag, it would not get reflected in an incremented ->huge_pages. So when the slot would finally get freed, slot_free() would decrement ->huge_pages again, leading to underflow. Fix this by open-coding the required memory free and stats updates in zram_writeback_complete(), rather than calling the destructive slot_free(). Since we now preserve the ZRAM_HUGE flag on written-back slots (for the deferred decompression path), we also update slot_free() to skip decrementing ->huge_pages if ZRAM_WB is set. Link: https://lkml.kernel.org/r/20260320023143.2372879-1-senozhatsky@chromium.org Link: https://lkml.kernel.org/r/20260319034912.1894770-1-senozhatsky@chromium.org Fixes: d38fab605c667 ("zram: introduce compressed data writeback") Signed-off-by: Sergey Senozhatsky Acked-by: Minchan Kim Cc: Brian Geffon Cc: Richard Chang Signed-off-by: Andrew Morton --- drivers/block/zram/zram_drv.c | 39 +++++++++++++---------------------- 1 file changed, 14 insertions(+), 25 deletions(-) diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index a324ede6206d..af679375b193 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -917,9 +917,8 @@ static void zram_account_writeback_submit(struct zram *zram) static int zram_writeback_complete(struct zram *zram, struct zram_wb_req *req) { - u32 size, index = req->pps->index; - int err, prio; - bool huge; + u32 index = req->pps->index; + int err; err = blk_status_to_errno(req->bio.bi_status); if (err) { @@ -946,28 +945,13 @@ static int zram_writeback_complete(struct zram *zram, struct zram_wb_req *req) goto out; } - if (zram->compressed_wb) { - /* - * ZRAM_WB slots get freed, we need to preserve data required - * for read decompression. - */ - size = get_slot_size(zram, index); - prio = get_slot_comp_priority(zram, index); - huge = test_slot_flag(zram, index, ZRAM_HUGE); - } - - slot_free(zram, index); - set_slot_flag(zram, index, ZRAM_WB); + clear_slot_flag(zram, index, ZRAM_IDLE); + if (test_slot_flag(zram, index, ZRAM_HUGE)) + atomic64_dec(&zram->stats.huge_pages); + atomic64_sub(get_slot_size(zram, index), &zram->stats.compr_data_size); + zs_free(zram->mem_pool, get_slot_handle(zram, index)); set_slot_handle(zram, index, req->blk_idx); - - if (zram->compressed_wb) { - if (huge) - set_slot_flag(zram, index, ZRAM_HUGE); - set_slot_size(zram, index, size); - set_slot_comp_priority(zram, index, prio); - } - - atomic64_inc(&zram->stats.pages_stored); + set_slot_flag(zram, index, ZRAM_WB); out: slot_unlock(zram, index); @@ -2010,8 +1994,13 @@ static void slot_free(struct zram *zram, u32 index) set_slot_comp_priority(zram, index, 0); if (test_slot_flag(zram, index, ZRAM_HUGE)) { + /* + * Writeback completion decrements ->huge_pages but keeps + * ZRAM_HUGE flag for deferred decompression path. + */ + if (!test_slot_flag(zram, index, ZRAM_WB)) + atomic64_dec(&zram->stats.huge_pages); clear_slot_flag(zram, index, ZRAM_HUGE); - atomic64_dec(&zram->stats.huge_pages); } if (test_slot_flag(zram, index, ZRAM_WB)) { From 38dfd294e24c0f397413799c2e5633aedb2058bf Mon Sep 17 00:00:00 2001 From: Muhammad Usama Anjum Date: Tue, 10 Mar 2026 17:17:39 +0000 Subject: [PATCH 4/6] mailmap: update email address for Muhammad Usama Anjum Add updated email address. Link: https://lkml.kernel.org/r/20260310171757.3970390-1-usama.anjum@arm.com Signed-off-by: Muhammad Usama Anjum Cc: Arnd Bergmann Cc: Carlos Bilbao Cc: Hans Verkuil Cc: Jakub Kacinski Cc: Martin Kepplinger Cc: Shannon Nelson Signed-off-by: Andrew Morton --- .mailmap | 1 + 1 file changed, 1 insertion(+) diff --git a/.mailmap b/.mailmap index 40b4db2b2d60..7d14504daf24 100644 --- a/.mailmap +++ b/.mailmap @@ -587,6 +587,7 @@ Morten Welinder Morten Welinder Morten Welinder Morten Welinder +Muhammad Usama Anjum Mukesh Ojha Muna Sinada Murali Nalajala From 631c1111501f34980649242751e93cfdadfd1f1c Mon Sep 17 00:00:00 2001 From: "Lorenzo Stoakes (Oracle)" Date: Mon, 16 Mar 2026 14:01:22 +0000 Subject: [PATCH 5/6] mm/zswap: add missing kunmap_local() Commit e2c3b6b21c77 ("mm: zswap: use SG list decompression APIs from zsmalloc") updated zswap_decompress() to use the scatterwalk API to copy data for uncompressed pages. In doing so, it mapped kernel memory locally for 32-bit kernels using kmap_local_folio(), however it never unmapped this memory. This resulted in the linked syzbot report where a BUG_ON() is triggered due to leaking the kmap slot. This patch fixes the issue by explicitly unmapping the established kmap. Also, add flush_dcache_folio() after the kunmap_local() call I had assumed that a new folio here combined with the flush that is done at the point of setting the PTE would suffice, but it doesn't seem that's actually the case, as update_mmu_cache() will in many archtectures only actually flush entries where a dcache flush was done on a range previously. I had also wondered whether kunmap_local() might suffice, but it doesn't seem to be the case. Some arches do seem to actually dcache flush on unmap, parisc does it if CONFIG_HIGHMEM is not set by setting ARCH_HAS_FLUSH_ON_KUNMAP and calling kunmap_flush_on_unmap() from __kunmap_local(), otherwise non-CONFIG_HIGHMEM callers do nothing here. Otherwise arch_kmap_local_pre_unmap() is called which does: * sparc - flush_cache_all() * arm - if VIVT, __cpuc_flush_dcache_area() * otherwise - nothing Also arch_kmap_local_post_unmap() is called which does: * arm - local_flush_tlb_kernel_page() * csky - kmap_flush_tlb() * microblaze, ppc - local_flush_tlb_page() * mips - local_flush_tlb_one() * sparc - flush_tlb_all() (again) * x86 - arch_flush_lazy_mmu_mode() * otherwise - nothing But this is only if it's high memory, and doesn't cover all architectures, so is presumably intended to handle other cache consistency concerns. In any case, VIPT is problematic here whether low or high memory (in spite of what the documentation claims, see [0] - 'the kernel did write to a page that is in the page cache page and / or in high memory'), because dirty cache lines may exist at the set indexed by the kernel direct mapping, which won't exist in the set indexed by any subsequent userland mapping, meaning userland might read stale data from L2 cache. Even if the documentation is correct and low memory is fine not to be flushed here, we can't be sure as to whether the memory is low or high (kmap_local_folio() will be a no-op if low), and this call should be harmless if it is low. VIVT would require more work if the memory were shared and already mapped, but this isn't the case here, and would anyway be handled by the dcache flush call. In any case, we definitely need this flush as far as I can tell. And we should probably consider updating the documentation unless it turns out there's somehow dcache synchronisation that happens for low memory/64-bit kernels elsewhere? [ljs@kernel.org: add flush_dcache_folio() after the kunmap_local() call] Link: https://lkml.kernel.org/r/13e09a99-181f-45ac-a18d-057faf94bccb@lucifer.local Link: https://lkml.kernel.org/r/20260316140122.339697-1-ljs@kernel.org Link: https://docs.kernel.org/core-api/cachetlb.html [0] Fixes: e2c3b6b21c77 ("mm: zswap: use SG list decompression APIs from zsmalloc") Signed-off-by: Lorenzo Stoakes (Oracle) Reported-by: syzbot+fe426bef95363177631d@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/69b75e2c.050a0220.12d28.015a.GAE@google.com Acked-by: Yosry Ahmed Acked-by: Johannes Weiner Reviewed-by: SeongJae Park Acked-by: Yosry Ahmed Acked-by: Nhat Pham Cc: Chengming Zhou Signed-off-by: Andrew Morton --- mm/zswap.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/mm/zswap.c b/mm/zswap.c index e6ec3295bdb0..16b2ef7223e1 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -942,9 +942,15 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio) /* zswap entries of length PAGE_SIZE are not compressed. */ if (entry->length == PAGE_SIZE) { + void *dst; + WARN_ON_ONCE(input->length != PAGE_SIZE); - memcpy_from_sglist(kmap_local_folio(folio, 0), input, 0, PAGE_SIZE); + + dst = kmap_local_folio(folio, 0); + memcpy_from_sglist(dst, input, 0, PAGE_SIZE); dlen = PAGE_SIZE; + kunmap_local(dst); + flush_dcache_folio(folio); } else { sg_init_table(&output, 1); sg_set_folio(&output, folio, PAGE_SIZE, 0); From 84481e705ab07ed46e56587fe846af194acacafe Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Mon, 16 Mar 2026 16:51:17 -0700 Subject: [PATCH 6/6] mm/damon/stat: monitor all System RAM resources DAMON_STAT usage document (Documentation/admin-guide/mm/damon/stat.rst) says it monitors the system's entire physical memory. But, it is monitoring only the biggest System RAM resource of the system. When there are multiple System RAM resources, this results in monitoring only an unexpectedly small fraction of the physical memory. For example, suppose the system has a 500 GiB System RAM, 10 MiB non-System RAM, and 500 GiB System RAM resources in order on the physical address space. DAMON_STAT will monitor only the first 500 GiB System RAM. This situation is particularly common on NUMA systems. Select a physical address range that covers all System RAM areas of the system, to fix this issue and make it work as documented. [sj@kernel.org: return error if monitoring target region is invalid] Link: https://lkml.kernel.org/r/20260317053631.87907-1-sj@kernel.org Link: https://lkml.kernel.org/r/20260316235118.873-1-sj@kernel.org Fixes: 369c415e6073 ("mm/damon: introduce DAMON_STAT module") Signed-off-by: SeongJae Park Cc: [6.17+] Signed-off-by: Andrew Morton --- mm/damon/stat.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 50 insertions(+), 3 deletions(-) diff --git a/mm/damon/stat.c b/mm/damon/stat.c index 25fb44ccf99d..cf2c5a541eee 100644 --- a/mm/damon/stat.c +++ b/mm/damon/stat.c @@ -145,12 +145,59 @@ static int damon_stat_damon_call_fn(void *data) return 0; } +struct damon_stat_system_ram_range_walk_arg { + bool walked; + struct resource res; +}; + +static int damon_stat_system_ram_walk_fn(struct resource *res, void *arg) +{ + struct damon_stat_system_ram_range_walk_arg *a = arg; + + if (!a->walked) { + a->walked = true; + a->res.start = res->start; + } + a->res.end = res->end; + return 0; +} + +static unsigned long damon_stat_res_to_core_addr(resource_size_t ra, + unsigned long addr_unit) +{ + /* + * Use div_u64() for avoiding linking errors related with __udivdi3, + * __aeabi_uldivmod, or similar problems. This should also improve the + * performance optimization (read div_u64() comment for the detail). + */ + if (sizeof(ra) == 8 && sizeof(addr_unit) == 4) + return div_u64(ra, addr_unit); + return ra / addr_unit; +} + +static int damon_stat_set_monitoring_region(struct damon_target *t, + unsigned long addr_unit, unsigned long min_region_sz) +{ + struct damon_addr_range addr_range; + struct damon_stat_system_ram_range_walk_arg arg = {}; + + walk_system_ram_res(0, -1, &arg, damon_stat_system_ram_walk_fn); + if (!arg.walked) + return -EINVAL; + addr_range.start = damon_stat_res_to_core_addr( + arg.res.start, addr_unit); + addr_range.end = damon_stat_res_to_core_addr( + arg.res.end + 1, addr_unit); + if (addr_range.end <= addr_range.start) + return -EINVAL; + return damon_set_regions(t, &addr_range, 1, min_region_sz); +} + static struct damon_ctx *damon_stat_build_ctx(void) { struct damon_ctx *ctx; struct damon_attrs attrs; struct damon_target *target; - unsigned long start = 0, end = 0; ctx = damon_new_ctx(); if (!ctx) @@ -180,8 +227,8 @@ static struct damon_ctx *damon_stat_build_ctx(void) if (!target) goto free_out; damon_add_target(ctx, target); - if (damon_set_region_biggest_system_ram_default(target, &start, &end, - ctx->min_region_sz)) + if (damon_stat_set_monitoring_region(target, ctx->addr_unit, + ctx->min_region_sz)) goto free_out; return ctx; free_out: