linux

mirror of https://github.com/torvalds/linux.git synced 2026-05-29 17:43:52 +02:00

Author	SHA1	Message	Date
Linus Torvalds	334fbe734e	mm.git review status for linus..mm-stable Everything: Total patches: 368 Reviews/patch: 1.56 Reviewed rate: 74% Excluding DAMON: Total patches: 316 Reviews/patch: 1.77 Reviewed rate: 81% Excluding DAMON and zram: Total patches: 306 Reviews/patch: 1.81 Reviewed rate: 82% Excluding DAMON, zram and maple_tree: Total patches: 276 Reviews/patch: 2.01 Reviewed rate: 91% Significant patch series in this merge: - The 30 patch series "maple_tree: Replace big node with maple copy" from Liam Howlett is mainly prepararatory work for ongoing development but it does reduce stack usage and is an improvement. - The 12 patch series "mm, swap: swap table phase III: remove swap_map" from Kairui Song offers memory savings by removing the static swap_map. It also yields some CPU savings and implements several cleanups. - The 2 patch series "mm: memfd_luo: preserve file seals" from Pratyush Yadav adds file seal preservation to LUO's memfd code. - The 2 patch series "mm: zswap: add per-memcg stat for incompressible pages" from Jiayuan Chen adds additional userspace stats reportng to zswap. - The 4 patch series "arch, mm: consolidate empty_zero_page" from Mike Rapoport implements some cleanups for our handling of ZERO_PAGE() and zero_pfn. - The 2 patch series "mm/kmemleak: Improve scan_should_stop() implementation" from Zhongqiu Han provides an robustness improvement and some cleanups in the kmemleak code. - The 4 patch series "Improve khugepaged scan logic" from Vernon Yang "improves the khugepaged scan logic and reduces CPU consumption by prioritizing scanning tasks that access memory frequently". - The 2 patch series "Make KHO Stateless" from Jason Miu simplifies Kexec Handover by "transitioning KHO from an xarray-based metadata tracking system with serialization to a radix tree data structure that can be passed directly to the next kernel" - The 3 patch series "mm: vmscan: add PID and cgroup ID to vmscan tracepoints" from Thomas Ballasi and Steven Rostedt enhances vmscan's tracepointing. - The 5 patch series "mm: arch/shstk: Common shadow stack mapping helper and VM_NOHUGEPAGE" from Catalin Marinas is a cleanup for the shadow stack code: remove per-arch code in favour of a generic implementation. - The 2 patch series "Fix KASAN support for KHO restored vmalloc regions" from Pasha Tatashin fixes a WARN() which can be emitted the KHO restores a vmalloc area. - The 4 patch series "mm: Remove stray references to pagevec" from Tal Zussman provides several cleanups, mainly udpating references to "struct pagevec", which became folio_batch three years ago. - The 17 patch series "mm: Eliminate fake head pages from vmemmap optimization" from Kiryl Shutsemau simplifies the HugeTLB vmemmap optimization (HVO) by changing how tail pages encode their relationship to the head page. - The 2 patch series "mm/damon/core: improve DAMOS quota efficiency for core layer filters" from SeongJae Park improves two problematic behaviors of DAMOS that makes it less efficient when core layer filters are used. - The 3 patch series "mm/damon: strictly respect min_nr_regions" from SeongJae Park improves DAMON usability by extending the treatment of the min_nr_regions user-settable parameter. - The 3 patch series "mm/page_alloc: pcp locking cleanup" from Vlastimil Babka is a proper fix for a previously hotfixed SMP=n issue. Code simplifications and cleanups ennsed. - The 16 patch series "mm: cleanups around unmapping / zapping" from David Hildenbrand implements "a bunch of cleanups around unmapping and zapping. Mostly simplifications, code movements, documentation and renaming of zapping functions". - The 6 patch series "support batched checking of the young flag for MGLRU" from Baolin Wang supports batched checking of the young flag for MGLRU. It's part cleanups; one benchmark shows large performance benefits for arm64. - The 5 patch series "memcg: obj stock and slab stat caching cleanups" from Johannes Weiner provides memcg cleanup and robustness improvements. - The 5 patch series "Allow order zero pages in page reporting" from Yuvraj Sakshith enhances page_reporting's free page reporting - it is presently and undesirably order-0 pages when reporting free memory. - The 6 patch series "mm: vma flag tweaks" from Lorenzo Stoakes is cleanup work following from the recent conversion of the VMA flags to a bitmap. - The 10 patch series "mm/damon: add optional debugging-purpose sanity checks" from SeongJae Park adds some more developer-facing debug checks into DAMON core. - The 2 patch series "mm/damon: test and document power-of-2 min_region_sz requirement" from SeongJae Park adds an additional DAMON kunit test and makes some adjustments to the addr_unit parameter handling. - The 3 patch series "mm/damon/core: make passed_sample_intervals comparisons overflow-safe" from SeongJae Park fixes a hard-to-hit time overflow issue in DAMON core. - The 7 patch series "mm/damon: improve/fixup/update ratio calculation, test and documentation" from SeongJae Park is a "batch of misc/minor improvements and fixups" for DAMON. - The 4 patch series "mm: move vma_(kernel\|mmu)_pagesize() out of hugetlb.c" from David Hildenbrand fixes a possible issue with dax-device when CONFIG_HUGETLB=n. Some code movement was required. - The 6 patch series "zram: recompression cleanups and tweaks" from Sergey Senozhatsky provides "a somewhat random mix of fixups, recompression cleanups and improvements" in the zram code. - The 11 patch series "mm/damon: support multiple goal-based quota tuning algorithms" from SeongJae Park extend DAMOS quotas goal auto-tuning to support multiple tuning algorithms that users can select. - The 4 patch series "mm: thp: reduce unnecessary start_stop_khugepaged()" from Breno Leitao fixes the khugpaged sysfs handling so we no longer spam the logs with reams of junk when starting/stopping khugepaged. - The 3 patch series "mm: improve map count checks" from Lorenzo Stoakes provides some cleanups and slight fixes in the mremap, mmap and vma code. - The 5 patch series "mm/damon: support addr_unit on default monitoring targets for modules" from SeongJae Park extends the use of DAMON core's addr_unit tunable. - The 5 patch series "mm: khugepaged cleanups and mTHP prerequisites" from Nico Pache provides cleanups in the khugepaged and is a base for Nico's planned khugepaged mTHP support. - The 15 patch series "mm: memory hot(un)plug and SPARSEMEM cleanups" from David Hildenbrand implements code movement and cleanups in the memhotplug and sparsemem code. - The 2 patch series "mm: remove CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE and cleanup CONFIG_MIGRATION" from David Hildenbrand rationalizes some memhotplug Kconfig support. - The 6 patch series "change young flag check functions to return bool" from Baolin Wang is "a cleanup patchset to change all young flag check functions to return bool". - The 3 patch series "mm/damon/sysfs: fix memory leak and NULL dereference issues" from Josh Law and SeongJae Park fixes a few potential DAMON bugs. - The 25 patch series "mm/vma: convert vm_flags_t to vma_flags_t in vma code" from "converts a lot of the existing use of the legacy vm_flags_t data type to the new vma_flags_t type which replaces it". Mainly in the vma code. - The 21 patch series "mm: expand mmap_prepare functionality and usage" from Lorenzo Stoakes "expands the mmap_prepare functionality, which is intended to replace the deprecated f_op->mmap hook which has been the source of bugs and security issues for some time". Cleanups, documentation, extension of mmap_prepare into filesystem drivers. - The 13 patch series "mm/huge_memory: refactor zap_huge_pmd()" from Lorenzo Stoakes simplifies and cleans up zap_huge_pmd(). Additional cleanups around vm_normal_folio_pmd() and the softleaf functionality are performed. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCad3HDQAKCRDdBJ7gKXxA jrUQAPwNhPk5nPSxnyxjAeQtOBHqgCdnICeEismLajPKd9aYRgEA0s2XAu3tSUYi GrBnWImHG3s4ePQxVcPCegWTsOUrXgQ= =1Q7o -----END PGP SIGNATURE----- Merge tag 'mm-stable-2026-04-13-21-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "maple_tree: Replace big node with maple copy" (Liam Howlett) Mainly prepararatory work for ongoing development but it does reduce stack usage and is an improvement. - "mm, swap: swap table phase III: remove swap_map" (Kairui Song) Offers memory savings by removing the static swap_map. It also yields some CPU savings and implements several cleanups. - "mm: memfd_luo: preserve file seals" (Pratyush Yadav) File seal preservation to LUO's memfd code - "mm: zswap: add per-memcg stat for incompressible pages" (Jiayuan Chen) Additional userspace stats reportng to zswap - "arch, mm: consolidate empty_zero_page" (Mike Rapoport) Some cleanups for our handling of ZERO_PAGE() and zero_pfn - "mm/kmemleak: Improve scan_should_stop() implementation" (Zhongqiu Han) A robustness improvement and some cleanups in the kmemleak code - "Improve khugepaged scan logic" (Vernon Yang) Improve khugepaged scan logic and reduce CPU consumption by prioritizing scanning tasks that access memory frequently - "Make KHO Stateless" (Jason Miu) Simplify Kexec Handover by transitioning KHO from an xarray-based metadata tracking system with serialization to a radix tree data structure that can be passed directly to the next kernel - "mm: vmscan: add PID and cgroup ID to vmscan tracepoints" (Thomas Ballasi and Steven Rostedt) Enhance vmscan's tracepointing - "mm: arch/shstk: Common shadow stack mapping helper and VM_NOHUGEPAGE" (Catalin Marinas) Cleanup for the shadow stack code: remove per-arch code in favour of a generic implementation - "Fix KASAN support for KHO restored vmalloc regions" (Pasha Tatashin) Fix a WARN() which can be emitted the KHO restores a vmalloc area - "mm: Remove stray references to pagevec" (Tal Zussman) Several cleanups, mainly udpating references to "struct pagevec", which became folio_batch three years ago - "mm: Eliminate fake head pages from vmemmap optimization" (Kiryl Shutsemau) Simplify the HugeTLB vmemmap optimization (HVO) by changing how tail pages encode their relationship to the head page - "mm/damon/core: improve DAMOS quota efficiency for core layer filters" (SeongJae Park) Improve two problematic behaviors of DAMOS that makes it less efficient when core layer filters are used - "mm/damon: strictly respect min_nr_regions" (SeongJae Park) Improve DAMON usability by extending the treatment of the min_nr_regions user-settable parameter - "mm/page_alloc: pcp locking cleanup" (Vlastimil Babka) The proper fix for a previously hotfixed SMP=n issue. Code simplifications and cleanups ensued - "mm: cleanups around unmapping / zapping" (David Hildenbrand) A bunch of cleanups around unmapping and zapping. Mostly simplifications, code movements, documentation and renaming of zapping functions - "support batched checking of the young flag for MGLRU" (Baolin Wang) Batched checking of the young flag for MGLRU. It's part cleanups; one benchmark shows large performance benefits for arm64 - "memcg: obj stock and slab stat caching cleanups" (Johannes Weiner) memcg cleanup and robustness improvements - "Allow order zero pages in page reporting" (Yuvraj Sakshith) Enhance free page reporting - it is presently and undesirably order-0 pages when reporting free memory. - "mm: vma flag tweaks" (Lorenzo Stoakes) Cleanup work following from the recent conversion of the VMA flags to a bitmap - "mm/damon: add optional debugging-purpose sanity checks" (SeongJae Park) Add some more developer-facing debug checks into DAMON core - "mm/damon: test and document power-of-2 min_region_sz requirement" (SeongJae Park) An additional DAMON kunit test and makes some adjustments to the addr_unit parameter handling - "mm/damon/core: make passed_sample_intervals comparisons overflow-safe" (SeongJae Park) Fix a hard-to-hit time overflow issue in DAMON core - "mm/damon: improve/fixup/update ratio calculation, test and documentation" (SeongJae Park) A batch of misc/minor improvements and fixups for DAMON - "mm: move vma_(kernel\|mmu)_pagesize() out of hugetlb.c" (David Hildenbrand) Fix a possible issue with dax-device when CONFIG_HUGETLB=n. Some code movement was required. - "zram: recompression cleanups and tweaks" (Sergey Senozhatsky) A somewhat random mix of fixups, recompression cleanups and improvements in the zram code - "mm/damon: support multiple goal-based quota tuning algorithms" (SeongJae Park) Extend DAMOS quotas goal auto-tuning to support multiple tuning algorithms that users can select - "mm: thp: reduce unnecessary start_stop_khugepaged()" (Breno Leitao) Fix the khugpaged sysfs handling so we no longer spam the logs with reams of junk when starting/stopping khugepaged - "mm: improve map count checks" (Lorenzo Stoakes) Provide some cleanups and slight fixes in the mremap, mmap and vma code - "mm/damon: support addr_unit on default monitoring targets for modules" (SeongJae Park) Extend the use of DAMON core's addr_unit tunable - "mm: khugepaged cleanups and mTHP prerequisites" (Nico Pache) Cleanups to khugepaged and is a base for Nico's planned khugepaged mTHP support - "mm: memory hot(un)plug and SPARSEMEM cleanups" (David Hildenbrand) Code movement and cleanups in the memhotplug and sparsemem code - "mm: remove CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE and cleanup CONFIG_MIGRATION" (David Hildenbrand) Rationalize some memhotplug Kconfig support - "change young flag check functions to return bool" (Baolin Wang) Cleanups to change all young flag check functions to return bool - "mm/damon/sysfs: fix memory leak and NULL dereference issues" (Josh Law and SeongJae Park) Fix a few potential DAMON bugs - "mm/vma: convert vm_flags_t to vma_flags_t in vma code" (Lorenzo Stoakes) Convert a lot of the existing use of the legacy vm_flags_t data type to the new vma_flags_t type which replaces it. Mainly in the vma code. - "mm: expand mmap_prepare functionality and usage" (Lorenzo Stoakes) Expand the mmap_prepare functionality, which is intended to replace the deprecated f_op->mmap hook which has been the source of bugs and security issues for some time. Cleanups, documentation, extension of mmap_prepare into filesystem drivers - "mm/huge_memory: refactor zap_huge_pmd()" (Lorenzo Stoakes) Simplify and clean up zap_huge_pmd(). Additional cleanups around vm_normal_folio_pmd() and the softleaf functionality are performed. * tag 'mm-stable-2026-04-13-21-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits) mm: fix deferred split queue races during migration mm/khugepaged: fix issue with tracking lock mm/huge_memory: add and use has_deposited_pgtable() mm/huge_memory: add and use normal_or_softleaf_folio_pmd() mm: add softleaf_is_valid_pmd_entry(), pmd_to_softleaf_folio() mm/huge_memory: separate out the folio part of zap_huge_pmd() mm/huge_memory: use mm instead of tlb->mm mm/huge_memory: remove unnecessary sanity checks mm/huge_memory: deduplicate zap deposited table call mm/huge_memory: remove unnecessary VM_BUG_ON_PAGE() mm/huge_memory: add a common exit path to zap_huge_pmd() mm/huge_memory: handle buggy PMD entry in zap_huge_pmd() mm/huge_memory: have zap_huge_pmd return a boolean, add kdoc mm/huge: avoid big else branch in zap_huge_pmd() mm/huge_memory: simplify vma_is_specal_huge() mm: on remap assert that input range within the proposed VMA mm: add mmap_action_map_kernel_pages[_full]() uio: replace deprecated mmap hook with mmap_prepare in uio_info drivers: hv: vmbus: replace deprecated mmap hook with mmap_prepare mm: allow handling of stacked mmap_prepare hooks in more drivers ...	2026-04-15 12:59:16 -07:00
Chen Ni	42561b341b	mm/swapfile: remove duplicate include of swap_table.h Remove duplicate inclusion of swap_table.h in swapfile.c to clean up redundant code. Link: https://lkml.kernel.org/r/20260318043849.399266-1-nichen@iscas.ac.cn Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:31 -07:00
Hui Zhu	86e69c020b	mm/swap: strengthen locking assertions and invariants in cluster allocation swap_cluster_alloc_table() requires several locks to be held by its callers: ci->lock, the per-CPU swap_cluster lock, and, for non-solid-state devices (non-SWP_SOLIDSTATE), the si->global_cluster_lock. While most call paths (e.g., via cluster_alloc_swap_entry() or alloc_swap_scan_list()) correctly acquire these locks before invocation, the path through swap_reclaim_work() -> swap_reclaim_full_clusters() -> isolate_lock_cluster() is distinct. This path operates exclusively on si->full_clusters, where the swap allocation tables are guaranteed to be already allocated. Consequently, isolate_lock_cluster() should never trigger a call to swap_cluster_alloc_table() for these clusters. Strengthen the locking and state assertions to formalize these invariants: 1. Add a lockdep_assert_held() for si->global_cluster_lock in swap_cluster_alloc_table() for non-SWP_SOLIDSTATE devices. 2. Reorder existing lockdep assertions in swap_cluster_alloc_table() to match the actual lock acquisition order (per-CPU lock, then global lock, then cluster lock). 3. Add a VM_WARN_ON_ONCE() in isolate_lock_cluster() to ensure that table allocations are only attempted for clusters being isolated from the free list. Attempting to allocate a table for a cluster from other lists (like the full list during reclaim) indicates a violation of subsystem invariants. These changes ensure locking consistency and help catch potential synchronization or logic issues during development. [zhuhui@kylinos.cn: remove redundant comment, per Barry] Link: https://lkml.kernel.org/r/20260311022241.177801-1-hui.zhu@linux.dev [zhuhui@kylinos.cn: initialize `flags', per Chris] Link: https://lkml.kernel.org/r/20260312023024.903143-1-hui.zhu@linux.dev Link: https://lkml.kernel.org/r/20260310015657.42395-1-hui.zhu@linux.dev Signed-off-by: Hui Zhu <zhuhui@kylinos.cn> Reviewed-by: Youngjun Park <youngjun.park@lge.com> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:25 -07:00
Kairui Song	1df1a1b950	mm, swap: no need to clear the shadow explicitly Since we no longer bypass the swap cache, every swap-in will clear the swap shadow by inserting the folio into the swap table. The only place we may seem to need to free the swap shadow is when the swap slots are freed directly without a folio (swap_put_entries_direct). But with the swap table, that is not needed either. Freeing a slot in the swap table will set the table entry to NULL, which erases the shadow just fine. So just delete all explicit shadow clearing, it's no longer needed. Also, rearrange the freeing. Link: https://lkml.kernel.org/r/20260218-swap-table-p3-v3-12-f4e34be021a7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <lkp@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:59 -07:00
Kairui Song	a0f79916e1	mm, swap: simplify checking if a folio is swapped Clean up and simplify how we check if a folio is swapped. The helper already requires the folio to be in swap cache and locked. That's enough to pin the swap cluster from being freed, so there is no need to lock anything else to avoid UAF. And besides, we have cleaned up and defined the swap operation to be mostly folio based, and now the only place a folio will have any of its swap slots' count increased from 0 to 1 is folio_dup_swap, which also requires the folio lock. So as we are holding the folio lock here, a folio can't change its swap status from not swapped (all swap slots have a count of 0) to swapped (any slot has a swap count larger than 0). So there won't be any false negatives of this helper if we simply depend on the folio lock to stabilize the cluster. We are only using this helper to determine if we can and should release the swap cache. So false positives are completely harmless, and also already exist before. Depending on the timing, previously, it's also possible that a racing thread releases the swap count right after releasing the ci lock and before this helper returns. In any case, the worst that could happen is we leave a clean swap cache. It will still be reclaimed when under pressure just fine. So, in conclusion, we can simplify and make the check much simpler and lockless. Also, rename it to folio_maybe_swapped to reflect the design. Link: https://lkml.kernel.org/r/20260218-swap-table-p3-v3-11-f4e34be021a7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <lkp@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:59 -07:00
Kairui Song	45711d446b	mm, swap: no need to truncate the scan border swap_map had a static flexible size, so the last cluster won't be fully covered, hence the allocator needs to check the scan border to avoid OOB. But the swap table has a fixed-sized swap table for each cluster, and the slots beyond the device size are marked as bad slots. The allocator can simply scan all slots as usual, and any bad slots will be skipped. Link: https://lkml.kernel.org/r/20260218-swap-table-p3-v3-10-f4e34be021a7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <lkp@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:59 -07:00
Kairui Song	0d6af9bcf3	mm, swap: use the swap table to track the swap count Now all the infrastructures are ready, switch to using the swap table only. This is unfortunately a large patch because the whole old counting mechanism, especially SWP_CONTINUED, has to be gone and switch to the new mechanism together, with no intermediate steps available. The swap table is capable of holding up to SWP_TB_COUNT_MAX - 1 counts in the higher bits of each table entry, so using that, the swap_map can be completely dropped. swap_map also had a limit of SWAP_CONT_MAX. Any value beyond that limit will require a COUNT_CONTINUED page. COUNT_CONTINUED is a bit complex to maintain, so for the swap table, a simpler approach is used: when the count goes beyond SWP_TB_COUNT_MAX - 1, the cluster will have an extend_table allocated, which is a swap cluster-sized array of unsigned int. The counting is basically offloaded there until the count drops below SWP_TB_COUNT_MAX again. Both the swap table and the extend table are cluster-based, so they exhibit good performance and sparsity. To make the switch from swap_map to swap table clean, this commit cleans up and introduces a new set of functions based on the swap table design, for manipulating swap counts: - __swap_cluster_dup_entry, __swap_cluster_put_entry, __swap_cluster_alloc_entry, __swap_cluster_free_entry: Increase/decrease the count of a swap slot, or alloc / free a swap slot. This is the internal routine that does the counting work based on the swap table and handles all the complexities. The caller will need to lock the cluster before calling them. All swap count-related update operations are wrapped by these four helpers. - swap_dup_entries_cluster, swap_put_entries_cluster: Increase/decrease the swap count of one or a set of swap slots in the same cluster range. These two helpers serve as the common routines for folio_dup_swap & swap_dup_entry_direct, or folio_put_swap & swap_put_entries_direct. And use these helpers to replace all existing callers. This helps to simplify the count tracking by a lot, and the swap_map is gone. [ryncsn@gmail.com: fix build] Link: https://lkml.kernel.org/r/aZWuLZi-vYi3vAWe@KASONG-MC4 Link: https://lkml.kernel.org/r/20260218-swap-table-p3-v3-9-f4e34be021a7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Suggested-by: Chris Li <chrisl@kernel.org> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <lkp@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:59 -07:00
Kairui Song	5dc533f7aa	mm, swap: simplify swap table sanity range check The newly introduced helper, which checks bad slots and emptiness of a cluster, can cover the older sanity check just fine, with a more rigorous condition check. So merge them. Link: https://lkml.kernel.org/r/20260218-swap-table-p3-v3-8-f4e34be021a7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <lkp@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:59 -07:00
Kairui Song	1307442b93	mm, swap: mark bad slots in swap table directly In preparing the deprecating swap_map, mark bad slots in the swap table too when setting SWAP_MAP_BAD in swap_map. Also, refine the swap table sanity check on freeing to adapt to the bad slots change. For swapoff, the bad slots count must match the cluster usage count, as nothing should touch them, and they contribute to the cluster usage count on swapon. For ordinary swap table freeing, the swap table of clusters with bad slots should never be freed since the cluster usage count never reaches zero. Link: https://lkml.kernel.org/r/20260218-swap-table-p3-v3-7-f4e34be021a7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <lkp@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:59 -07:00
Kairui Song	0c7e6014b7	mm, swap: consolidate bad slots setup and make it more robust In preparation for using the swap table to track bad slots directly, move the bad slot setup to one place, set up the swap_map mark, and cluster counter update together. While at it, provide more informative logs and a more robust fallback if any bad slot info looks incorrect. Fixes a potential issue that a malformed swap file may cause the cluster to be unusable upon swapon, and provides a more verbose warning on a malformed swap file Link: https://lkml.kernel.org/r/20260218-swap-table-p3-v3-4-f4e34be021a7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <lkp@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:58 -07:00
Kairui Song	50f8c41928	mm, swap: remove redundant arguments and locking for enabling a device There is no need to repeatedly pass zero map and priority values. zeromap is similar to cluster info and swap_map, which are only used once the swap device is exposed. And the prio values are currently read only once set, and only used for the list insertion upon expose or swap info display. Link: https://lkml.kernel.org/r/20260218-swap-table-p3-v3-3-f4e34be021a7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <lkp@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:58 -07:00
Kairui Song	451c632610	mm, swap: clean up swapon process and locking Slightly clean up the swapon process. Add comments about what swap_lock protects, introduce and rename helpers that wrap swap_map and cluster_info setup, and do it outside of the swap_lock lock. This lock protection is not needed for swap_map and cluster_info setup because all swap users must either hold the percpu ref or hold a stable allocated swap entry (e.g., locking a folio in the swap cache) before accessing. So before the swap device is exposed by enable_swap_info, nothing would use the swap device's map or cluster. So we are safe to allocate and set up swap data freely first, then expose the swap device and set the SWP_WRITEOK flag. Link: https://lkml.kernel.org/r/20260218-swap-table-p3-v3-2-f4e34be021a7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <lkp@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:58 -07:00
Kairui Song	eca4d01b98	mm, swap: protect si->swap_file properly and use as a mount indicator Patch series "mm, swap: swap table phase III: remove swap_map", v3. This series removes the static swap_map and uses the swap table for the swap count directly. This saves about ~30% memory usage for the static swap metadata. For example, this saves 256MB of memory when mounting a 1TB swap device. Performance is slightly better too, since the double update of the swap table and swap_map is now gone. Test results: Mounting a swap device: ======================= Mount a 1TB brd device as SWAP, just to verify the memory save: `free -m` before: total used free shared buff/cache available Mem: 1465 1051 417 1 61 413 Swap: 1054435 0 1054435 `free -m` after: total used free shared buff/cache available Mem: 1465 795 672 1 62 670 Swap: 1054435 0 1054435 Idle memory usage is reduced by ~256MB just as expected. And following this design we should be able to save another ~512MB in a next phase. Build kernel test: ================== Test using ZSWAP with NVME SWAP, make -j48, defconfig, in a x86_64 VM with 5G RAM, under global pressure, avg of 32 test run: Before After: System time: 1038.97s 1013.75s (-2.4%) Test using ZRAM as SWAP, make -j12, tinyconfig, in a ARM64 VM with 1.5G RAM, under global pressure, avg of 32 test run: Before After: System time: 67.75s 66.65s (-1.6%) The result is slightly better. Redis / Valkey benchmark: ========================= Test using ZRAM as SWAP, in a ARM64 VM with 1.5G RAM, under global pressure, avg of 64 test run: Server: valkey-server --maxmemory 2560M Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get no persistence with BGSAVE Before: 472705.71 RPS 369451.68 RPS After: 481197.93 RPS (+1.8%) 374922.32 RPS (+1.5%) In conclusion, performance is better in all cases, and memory usage is much lower. The swap cgroup array will also be merged into the swap table in a later phase, saving the other ~60% part of the static swap metadata and making all the swap metadata dynamic. The improved API for swap operations also reduces the lock contention and makes more batching operations possible. This patch (of 12): /proc/swaps uses si->swap_map as the indicator to check if the swap device is mounted. swap_map will be removed soon, so change it to use si->swap_file instead because: - si->swap_file is exactly the only dynamic content that /proc/swaps is interested in. Previously, it was checking si->swap_map just to ensure si->swap_file is available. si->swap_map is set under mutex protection, and after si->swap_file is set, so having si->swap_map set guarantees si->swap_file is set. - Checking si->flags doesn't work here. SWP_WRITEOK is cleared during swapoff, but /proc/swaps is supposed to show the device under swapoff too to report the swapoff progress. And SWP_USED is set even if the device hasn't been properly set up. We can have another flag, but the easier way is to just check si->swap_file directly. So protect si->swap_file setting with mutext, and set si->swap_file only when the swap device is truly enabled. /proc/swaps only interested in si->swap_file and a few static data reading. Only si->swap_file needs protection. Reading other static fields is always fine. Link: https://lkml.kernel.org/r/20260218-swap-table-p3-v3-0-f4e34be021a7@tencent.com Link: https://lkml.kernel.org/r/20260218-swap-table-p3-v3-1-f4e34be021a7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:52:58 -07:00
Kairui Song	396f57b572	mm, swap: speed up hibernation allocation and writeout Since commit `0ff67f990b` ("mm, swap: remove swap slot cache"), hibernation has been using the swap slot slow allocation path for simplification, which turns out might cause regression for some devices because the allocator now rotates clusters too often, leading to slower allocation and more random distribution of data. Fast allocation is not complex, so implement hibernation support as well. Test result with Samsung SSD 830 Series (SATA II, 3.0 Gbps) shows the performance is several times better [1]: 6.19: 324 seconds After this series: 35 seconds Link: https://lkml.kernel.org/r/20260216-hibernate-perf-v4-1-1ba9f0bf1ec9@tencent.com Link: https://lore.kernel.org/linux-mm/8b4bdcfa-ce3f-4e23-839f-31367df7c18f@gmx.de/ [1] Signed-off-by: Kairui Song <kasong@tencent.com> Fixes: `0ff67f990b` ("mm, swap: remove swap slot cache") Reported-by: Carsten Grohmann <mail@carstengrohmann.de> Closes: https://lore.kernel.org/linux-mm/20260206121151.dea3633d1f0ded7bbf49c22e@linux-foundation.org/ Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-03-24 14:38:25 -07:00
Damien Le Moal	ecd92cfec5	block: remove bdev_nonrot() bdev_nonrot() is simply the negative return value of bdev_rot(). So replace all call sites of bdev_nonrot() with calls to bdev_rot() and remove bdev_nonrot(). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-09 14:30:00 -06:00
Linus Torvalds	32a92f8c89	Convert more 'alloc_obj' cases to default GFP_KERNEL arguments This converts some of the visually simpler cases that have been split over multiple lines. I only did the ones that are easy to verify the resulting diff by having just that final GFP_KERNEL argument on the next line. Somebody should probably do a proper coccinelle script for this, but for me the trivial script actually resulted in an assertion failure in the middle of the script. I probably had made it a bit _too_ trivial. So after fighting that far a while I decided to just do some of the syntactically simpler cases with variations of the previous 'sed' scripts. The more syntactically complex multi-line cases would mostly really want whitespace cleanup anyway. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-21 20:03:00 -08:00
Linus Torvalds	bf4afc53b7	Convert 'alloc_obj' family to use the new default GFP_KERNEL argument This was done entirely with mindless brute force, using git grep -l '\<k[vmz]alloc_objs(., GFP_KERNEL)' \| xargs sed -i 's/\(alloc_objs(.*\), GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-21 17:09:51 -08:00
Kees Cook	69050f8d6d	treewide: Replace kmalloc with kmalloc_obj for non-scalar types This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(PTR, FAM, COUNT, ...) (where TYPE may also be VAR) The resulting allocations no longer return "void ", instead returning "TYPE ". Signed-off-by: Kees Cook <kees@kernel.org>	2026-02-21 01:02:28 -08:00
Linus Torvalds	4cff5c05e0	mm.git review status for linus..mm-stable Everything: Total patches: 325 Reviews/patch: 1.39 Reviewed rate: 72% Excluding DAMON: Total patches: 262 Reviews/patch: 1.63 Reviewed rate: 82% Excluding DAMON and zram: Total patches: 248 Reviews/patch: 1.72 Reviewed rate: 86% - The 14 patch series "powerpc/64s: do not re-activate batched TLB flush" from Alexander Gordeev makes arch_{enter\|leave}_lazy_mmu_mode() nest properly. It adds a generic enter/leave layer and switches architectures to use it. Various hacks were removed in the process. - The 7 patch series "zram: introduce compressed data writeback" from Richard Chang and Sergey Senozhatsky implements data compression for zram writeback. - The 8 patch series "mm: folio_zero_user: clear page ranges" from David Hildenbrand adds clearing of contiguous page ranges for hugepages. Large improvements during demand faulting are demonstrated. - The 2 patch series "memcg cleanups" from Chen Ridong tideis up some memcg code. - The 12 patch series "mm/damon: introduce {,max_}nr_snapshots and tracepoint for damos stats" from SeongJae Park improves DAMOS stat's provided information, deterministic control, and readability. - The 3 patch series "selftests/mm: hugetlb cgroup charging: robustness fixes" from Li Wang fixes a few issues in the hugetlb cgroup charging selftests. - The 5 patch series "Fix va_high_addr_switch.sh test failure - again" from Chunyu Hu addresses several issues in the va_high_addr_switch test. - The 5 patch series "mm/damon/tests/core-kunit: extend existing test scenarios" from Shu Anzai improves the KUnit test coverage for DAMON. - The 2 patch series "mm/khugepaged: fix dirty page handling for MADV_COLLAPSE" from Shivank Garg fixes a glitch in khugepaged which was causing madvise(MADV_COLLAPSE) to transiently return -EAGAIN. - The 29 patch series "arch, mm: consolidate hugetlb early reservation" from Mike Rapoport reworks and consolidates a pile of straggly code related to reservation of hugetlb memory from bootmem and creation of CMA areas for hugetlb. - The 9 patch series "mm: clean up anon_vma implementation" from Lorenzo Stoakes cleans up the anon_vma implementation in various ways. - The 3 patch series "tweaks for __alloc_pages_slowpath()" from Vlastimil Babka does a little streamlining of the page allocator's slowpath code. - The 8 patch series "memcg: separate private and public ID namespaces" from Shakeel Butt cleans up the memcg ID code and prevents the internal-only private IDs from being exposed to userspace. - The 6 patch series "mm: hugetlb: allocate frozen gigantic folio" from Kefeng Wang cleans up the allocation of frozen folios and avoids some atomic refcount operations. - The 11 patch series "mm/damon: advance DAMOS-based LRU sorting" from SeongJae Park improves DAMOS's movement of memory betewwn the active and inactive LRUs and adds auto-tuning of the ratio-based quotas and of monitoring intervals. - The 18 patch series "Support page table check on PowerPC" from Andrew Donnellan makes CONFIG_PAGE_TABLE_CHECK_ENFORCED work on powerpc. - The 3 patch series "nodemask: align nodes_and{,not} with underlying bitmap ops" from Yury Norov makes nodes_and() and nodes_andnot() propagate the return values from the underlying bit operations, enabling some cleanup in calling code. - The 5 patch series "mm/damon: hide kdamond and kdamond_lock from API callers" from SeongJae Park cleans up some DAMON internal interfaces. - The 4 patch series "mm/khugepaged: cleanups and scan limit fix" from Shivank Garg does some cleanup work in khupaged and fixes a scan limit accounting issue. - The 24 patch series "mm: balloon infrastructure cleanups" from David Hildenbrand goes to town on the balloon infrastructure and its page migration function. Mainly cleanups, also some locking simplification. - The 2 patch series "mm/vmscan: add tracepoint and reason for kswapd_failures reset" from Jiayuan Chen adds additional tracepoints to the page reclaim code. - The 3 patch series "Replace wq users and add WQ_PERCPU to alloc_workqueue() users" from Marco Crivellari is part of Marco's kernel-wide migration from the legacy workqueue APIs over to the preferred unbound workqueues. - The 9 patch series "Various mm kselftests improvements/fixes" from Kevin Brodsky provides various unrelated improvements/fixes for the mm kselftests. - The 5 patch series "mm: accelerate gigantic folio allocation" from Kefeng Wang greatly speeds up gigantic folio allocation, mainly by avoiding unnecessary work in pfn_range_valid_contig(). - The 5 patch series "selftests/damon: improve leak detection and wss estimation reliability" from SeongJae Park improves the reliability of two of the DAMON selftests. - The 8 patch series "mm/damon: cleanup kdamond, damon_call(), damos filter and DAMON_MIN_REGION" from SeongJae Park does some cleanup work in the core DAMON code. - The 8 patch series "Docs/mm/damon: update intro, modules, maintainer profile, and misc" from SeongJae Park performs maintenance work on the DAMON documentation. - The 10 patch series "mm: add and use vma_assert_stabilised() helper" from Lorenzo Stoakes refactors and cleans up the core VMA code. The main aim here is to be able to use the mmap write lock's lockdep state to perform various assertions regarding the locking which the VMA code requires. - The 19 patch series "mm, swap: swap table phase II: unify swapin use" from Kairui Song removes some old swap code (swap cache bypassing and swap synchronization) which wasn't working very well. Various other cleanups and simplifications were made. The end result is a 20% speedup in one benchmark. - The 8 patch series "enable PT_RECLAIM on more 64-bit architectures" from Qi Zheng makes PT_RECLAIM available on 64-bit alpha, loongarch, mips, parisc, um, Various cleanups were performed along the way. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaY1HfAAKCRDdBJ7gKXxA jqhZAP9H8ZlKKqCEgnr6U5XXmJ63Ep2FDQpl8p35yr9yVuU9+gEAgfyWiJ43l1fP rT0yjsUW3KQFBi/SEA3R6aYarmoIBgI= =+HLt -----END PGP SIGNATURE----- Merge tag 'mm-stable-2026-02-11-19-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "powerpc/64s: do not re-activate batched TLB flush" makes arch_{enter\|leave}_lazy_mmu_mode() nest properly (Alexander Gordeev) It adds a generic enter/leave layer and switches architectures to use it. Various hacks were removed in the process. - "zram: introduce compressed data writeback" implements data compression for zram writeback (Richard Chang and Sergey Senozhatsky) - "mm: folio_zero_user: clear page ranges" adds clearing of contiguous page ranges for hugepages. Large improvements during demand faulting are demonstrated (David Hildenbrand) - "memcg cleanups" tidies up some memcg code (Chen Ridong) - "mm/damon: introduce {,max_}nr_snapshots and tracepoint for damos stats" improves DAMOS stat's provided information, deterministic control, and readability (SeongJae Park) - "selftests/mm: hugetlb cgroup charging: robustness fixes" fixes a few issues in the hugetlb cgroup charging selftests (Li Wang) - "Fix va_high_addr_switch.sh test failure - again" addresses several issues in the va_high_addr_switch test (Chunyu Hu) - "mm/damon/tests/core-kunit: extend existing test scenarios" improves the KUnit test coverage for DAMON (Shu Anzai) - "mm/khugepaged: fix dirty page handling for MADV_COLLAPSE" fixes a glitch in khugepaged which was causing madvise(MADV_COLLAPSE) to transiently return -EAGAIN (Shivank Garg) - "arch, mm: consolidate hugetlb early reservation" reworks and consolidates a pile of straggly code related to reservation of hugetlb memory from bootmem and creation of CMA areas for hugetlb (Mike Rapoport) - "mm: clean up anon_vma implementation" cleans up the anon_vma implementation in various ways (Lorenzo Stoakes) - "tweaks for __alloc_pages_slowpath()" does a little streamlining of the page allocator's slowpath code (Vlastimil Babka) - "memcg: separate private and public ID namespaces" cleans up the memcg ID code and prevents the internal-only private IDs from being exposed to userspace (Shakeel Butt) - "mm: hugetlb: allocate frozen gigantic folio" cleans up the allocation of frozen folios and avoids some atomic refcount operations (Kefeng Wang) - "mm/damon: advance DAMOS-based LRU sorting" improves DAMOS's movement of memory betewwn the active and inactive LRUs and adds auto-tuning of the ratio-based quotas and of monitoring intervals (SeongJae Park) - "Support page table check on PowerPC" makes CONFIG_PAGE_TABLE_CHECK_ENFORCED work on powerpc (Andrew Donnellan) - "nodemask: align nodes_and{,not} with underlying bitmap ops" makes nodes_and() and nodes_andnot() propagate the return values from the underlying bit operations, enabling some cleanup in calling code (Yury Norov) - "mm/damon: hide kdamond and kdamond_lock from API callers" cleans up some DAMON internal interfaces (SeongJae Park) - "mm/khugepaged: cleanups and scan limit fix" does some cleanup work in khupaged and fixes a scan limit accounting issue (Shivank Garg) - "mm: balloon infrastructure cleanups" goes to town on the balloon infrastructure and its page migration function. Mainly cleanups, also some locking simplification (David Hildenbrand) - "mm/vmscan: add tracepoint and reason for kswapd_failures reset" adds additional tracepoints to the page reclaim code (Jiayuan Chen) - "Replace wq users and add WQ_PERCPU to alloc_workqueue() users" is part of Marco's kernel-wide migration from the legacy workqueue APIs over to the preferred unbound workqueues (Marco Crivellari) - "Various mm kselftests improvements/fixes" provides various unrelated improvements/fixes for the mm kselftests (Kevin Brodsky) - "mm: accelerate gigantic folio allocation" greatly speeds up gigantic folio allocation, mainly by avoiding unnecessary work in pfn_range_valid_contig() (Kefeng Wang) - "selftests/damon: improve leak detection and wss estimation reliability" improves the reliability of two of the DAMON selftests (SeongJae Park) - "mm/damon: cleanup kdamond, damon_call(), damos filter and DAMON_MIN_REGION" does some cleanup work in the core DAMON code (SeongJae Park) - "Docs/mm/damon: update intro, modules, maintainer profile, and misc" performs maintenance work on the DAMON documentation (SeongJae Park) - "mm: add and use vma_assert_stabilised() helper" refactors and cleans up the core VMA code. The main aim here is to be able to use the mmap write lock's lockdep state to perform various assertions regarding the locking which the VMA code requires (Lorenzo Stoakes) - "mm, swap: swap table phase II: unify swapin use" removes some old swap code (swap cache bypassing and swap synchronization) which wasn't working very well. Various other cleanups and simplifications were made. The end result is a 20% speedup in one benchmark (Kairui Song) - "enable PT_RECLAIM on more 64-bit architectures" makes PT_RECLAIM available on 64-bit alpha, loongarch, mips, parisc, and um. Various cleanups were performed along the way (Qi Zheng) * tag 'mm-stable-2026-02-11-19-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (325 commits) mm/memory: handle non-split locks correctly in zap_empty_pte_table() mm: move pte table reclaim code to memory.c mm: make PT_RECLAIM depends on MMU_GATHER_RCU_TABLE_FREE mm: convert __HAVE_ARCH_TLB_REMOVE_TABLE to CONFIG_HAVE_ARCH_TLB_REMOVE_TABLE config um: mm: enable MMU_GATHER_RCU_TABLE_FREE parisc: mm: enable MMU_GATHER_RCU_TABLE_FREE mips: mm: enable MMU_GATHER_RCU_TABLE_FREE LoongArch: mm: enable MMU_GATHER_RCU_TABLE_FREE alpha: mm: enable MMU_GATHER_RCU_TABLE_FREE mm: change mm/pt_reclaim.c to use asm/tlb.h instead of asm-generic/tlb.h mm/damon/stat: remove __read_mostly from memory_idle_ms_percentiles zsmalloc: make common caches global mm: add SPDX id lines to some mm source files mm/zswap: use %pe to print error pointers mm/vmscan: use %pe to print error pointers mm/readahead: fix typo in comment mm: khugepaged: fix NR_FILE_PAGES and NR_SHMEM in collapse_file() mm: refactor vma_map_pages to use vm_insert_pages mm/damon: unify address range representation with damon_addr_range mm/cma: replace snprintf with strscpy in cma_new_area ...	2026-02-12 11:32:37 -08:00
Kairui Song	50c7f34c5c	mm, swap: remove no longer needed _swap_info_get There are now only two users of _swap_info_get after consolidating these callers, folio_free_swap and swp_swapcount. folio_free_swap already holds the folio lock, and the folio must be in the swap cache, _swap_info_get is redundant. For swp_swapcount, it should use get_swap_device instead. get_swap_device increases the device ref count, which is actually a bit safer. The only current use is smap walking, and the performance change here is tiny. And after these changes, _swap_info_get is no longer used, so we can safely remove it. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-19-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 14:22:58 -08:00
Kairui Song	d3852f9692	mm, swap: drop the SWAP_HAS_CACHE flag Now, the swap cache is managed by the swap table. All swap cache users are checking the swap table directly to check the swap cache state. SWAP_HAS_CACHE is now just a temporary pin before the first increase from 0 to 1 of a slot's swap count (swap_dup_entries) after swap allocation (folio_alloc_swap), or before the final free of slots pinned by folio in swap cache (put_swap_folio). Drop these two usages. For the first dup, SWAP_HAS_CACHE pinning was hard to kill because it used to have multiple meanings, more than just "a slot is cached". We have just simplified that and defined that the first dup is always done with folio locked in swap cache (folio_dup_swap), so stop checking the SWAP_HAS_CACHE bit and just check the swap cache (swap table) directly, and add a WARN if a swap entry's count is being increased for the first time while the folio is not in swap cache. As for freeing, just let the swap cache free all swap entries of a folio that have a swap count of zero directly upon folio removal. We have also just cleaned up batch freeing to check the swap cache usage using the swap table: a slot with swap cache in the swap table will not be freed until its cache is gone, and no SWAP_HAS_CACHE bit is involved anymore. And besides, the removal of a folio and freeing of the slots are being done in the same critical section now, which should improve the performance. After these two changes, SWAP_HAS_CACHE no longer has any users. Swap cache synchronization is also done by the swap table directly, so using SWAP_HAS_CACHE to pin a slot before adding the cache is also no longer needed. Remove all related logic and helpers. swap_map is now only used for tracking the count, so all swap_map users can just read it directly, ignoring the swap_count helper, which was previously used to filter out the SWAP_HAS_CACHE bit. The idea of dropping SWAP_HAS_CACHE and using the swap table directly was initially from Chris's idea of merging all the metadata usage of all swaps into one place. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-18-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Suggested-by: Chris Li <chrisl@kernel.org> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 14:22:57 -08:00
Kairui Song	e1c5c6be3c	mm, swap: clean up and improve swap entries freeing There are a few problems with the current freeing of swap entries. When freeing a set of swap entries directly (swap_put_entries_direct, typically from zapping the page table), it scans the whole swap region multiple times. First, it scans the whole region to check if it can be batch freed and if there is any cached folio. Then do a batch free only if the whole region's swap count equals 1. And if any entry is cached, even if only one, it will have to walk the whole region again to clean up the cache. And if any entry is not in a consistent status with other entries, it will fall back to order 0 freeing. For example, if only one of them is cached, the batch free will fall back. And the current batch freeing workflow relies on the swap map's SWAP_HAS_CACHE bit for both continuous checking and batch freeing, which isn't compatible with the swap table design. Tidy this up, introduce a new cluster scoped helper for all swap entry freeing job. It will batch frees all continuous entries, and just start a new batch if any inconsistent entry is found. This may improve the batch size when the clusters are fragmented. This should also be more robust with more sanity checks, and make it clear that a slot pinned by swap cache will be cleared upon cache reclaim. And the cache reclaim scan is also now limited to each cluster. If a cluster has any clean swap cache left after putting the swap count, reclaim the cluster only instead of the whole region. And since a folio's entries are always in the same cluster, putting swap entries from a folio can also use the new helper directly. This should be both an optimization and a cleanup, and the new helper is adapted to the swap table. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-17-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 14:22:57 -08:00
Kairui Song	4984d746c8	mm, swap: check swap table directly for checking cache Instead of looking at the swap map, check swap table directly to tell if a swap slot is cached. Prepares for the removal of SWAP_HAS_CACHE. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-16-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 14:22:57 -08:00
Kairui Song	270f095179	mm, swap: add folio to swap cache directly on allocation The allocator uses SWAP_HAS_CACHE to pin a swap slot upon allocation. SWAP_HAS_CACHE is being deprecated as it caused a lot of confusion. This pinning usage here can be dropped by adding the folio to swap cache directly on allocation. All swap allocations are folio-based now (except for hibernation), so the swap allocator can always take the folio as the parameter. And now both swap cache (swap table) and swap map are protected by the cluster lock, scanning the map and inserting the folio can be done in the same critical section. This eliminates the time window that a slot is pinned by SWAP_HAS_CACHE, but it has no cache, and avoids touching the lock multiple times. This is both a cleanup and an optimization. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-15-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 14:22:57 -08:00
Kairui Song	3697615914	mm, swap: cleanup swap entry management workflow The current swap entry allocation/freeing workflow has never had a clear definition. This makes it hard to debug or add new optimizations. This commit introduces a proper definition of how swap entries would be allocated and freed. Now, most operations are folio based, so they will never exceed one swap cluster, and we now have a cleaner border between swap and the rest of mm, making it much easier to follow and debug, especially with new added sanity checks. Also making more optimization possible. Swap entry will be mostly freed and free with a folio bound. The folio lock will be useful for resolving many swap related races. Now swap allocation (except hibernation) always starts with a folio in the swap cache, and gets duped/freed protected by the folio lock: - folio_alloc_swap() - The only allocation entry point now. Context: The folio must be locked. This allocates one or a set of continuous swap slots for a folio and binds them to the folio by adding the folio to the swap cache. The swap slots' swap count start with zero value. - folio_dup_swap() - Increase the swap count of one or more entries. Context: The folio must be locked and in the swap cache. For now, the caller still has to lock the new swap entry owner (e.g., PTL). This increases the ref count of swap entries allocated to a folio. Newly allocated swap slots' count has to be increased by this helper as the folio got unmapped (and swap entries got installed). - folio_put_swap() - Decrease the swap count of one or more entries. Context: The folio must be locked and in the swap cache. For now, the caller still has to lock the new swap entry owner (e.g., PTL). This decreases the ref count of swap entries allocated to a folio. Typically, swapin will decrease the swap count as the folio got installed back and the swap entry got uninstalled This won't remove the folio from the swap cache and free the slot. Lazy freeing of swap cache is helpful for reducing IO. There is already a folio_free_swap() for immediate cache reclaim. This part could be further optimized later. The above locking constraints could be further relaxed when the swap table is fully implemented. Currently dup still needs the caller to lock the swap entry container (e.g. PTL), or a concurrent zap may underflow the swap count. Some swap users need to interact with swap count without involving folio (e.g. forking/zapping the page table or mapping truncate without swapin). In such cases, the caller has to ensure there is no race condition on whatever owns the swap count and use the below helpers: - swap_put_entries_direct() - Decrease the swap count directly. Context: The caller must lock whatever is referencing the slots to avoid a race. Typically the page table zapping or shmem mapping truncate will need to free swap slots directly. If a slot is cached (has a folio bound), this will also try to release the swap cache. - swap_dup_entry_direct() - Increase the swap count directly. Context: The caller must lock whatever is referencing the entries to avoid race, and the entries must already have a swap count > 1. Typically, forking will need to copy the page table and hence needs to increase the swap count of the entries in the table. The page table is locked while referencing the swap entries, so the entries all have a swap count > 1 and can't be freed. Hibernation subsystem is a bit different, so two special wrappers are here: - swap_alloc_hibernation_slot() - Allocate one entry from one device. - swap_free_hibernation_slot() - Free one entry allocated by the above helper. All hibernation entries are exclusive to the hibernation subsystem and should not interact with ordinary swap routines. By separating the workflows, it will be possible to bind folio more tightly with swap cache and get rid of the SWAP_HAS_CACHE as a temporary pin. This commit should not introduce any behavior change [kasong@tencent.com: fix leak, per Chris Mason. Remove WARN_ON, per Lai Yi] Link: https://lkml.kernel.org/r/CAMgjq7AUz10uETVm8ozDWcB3XohkOqf0i33KGrAquvEVvfp5cg@mail.gmail.com [ryncsn@gmail.com: fix KSM copy pages for swapoff, per Chris] Link: https://lkml.kernel.org/r/aXxkANcET3l2Xu6J@KASONG-MC4 Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-14-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Signed-off-by: Kairui Song <ryncsn@gmail.com> Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Cc: Chris Mason <clm@fb.com> Cc: Chris Mason <clm@meta.com> Cc: Lai Yi <yi1.lai@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 14:22:56 -08:00
Kairui Song	2732acda82	mm, swap: use swap cache as the swap in synchronize layer Current swap in synchronization mostly uses the swap_map's SWAP_HAS_CACHE bit. Whoever sets the bit first does the actual work to swap in a folio. This has been causing many issues as it's just a poor implementation of a bit lock. Raced users have no idea what is pinning a slot, so it has to loop with a schedule_timeout_uninterruptible(1), which is ugly and causes long-tailing or other performance issues. Besides, the abuse of SWAP_HAS_CACHE has been causing many other troubles for synchronization or maintenance. This is the first step to remove this bit completely. Now all swap in paths are using the swap cache, and both the swap cache and swap map are protected by the cluster lock. So we can just resolve the swap synchronization with the swap cache layer directly using the cluster lock and folio lock. Whoever inserts a folio in the swap cache first does the swap in work. And because folios are locked during swap operations, other raced swap operations will just wait on the folio lock. The SWAP_HAS_CACHE will be removed in later commit. For now, we still set it for some remaining users. But now we do the bit setting and swap cache folio adding in the same critical section, after swap cache is ready. No one will have to spin on the SWAP_HAS_CACHE bit anymore. This both simplifies the logic and should improve the performance, eliminating issues like the one solved in commit `01626a1823` ("mm: avoid unconditional one-tick sleep when swapcache_prepare fails"), or the "skip_if_exists" from commit `a65b0e7607` ("zswap: make shrinking memcg-aware"), which will be removed very soon. [kasong@tencent.com: fix cgroup v1 accounting issue] Link: https://lkml.kernel.org/r/CAMgjq7CGUnzOVG7uSaYjzw9wD7w2dSKOHprJfaEp4CcGLgE3iw@mail.gmail.com Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-12-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 14:22:56 -08:00
Kairui Song	78d6a12dd9	mm, swap: split locked entry duplicating into a standalone helper No feature change, split the common logic into a stand alone helper to be reused later. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-11-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 14:22:55 -08:00
Kairui Song	cda2504c51	mm, swap: consolidate cluster reclaim and usability check Swap cluster cache reclaim requires releasing the lock, so the cluster may become unusable after the reclaim. To prepare for checking swap cache using the swap table directly, consolidate the swap cluster reclaim and the check logic. We will want to avoid touching the cluster's data completely with the swap table, to avoid RCU overhead here. And by moving the cluster usable check into the reclaim helper, it will also help avoid a redundant scan of the slots if the cluster is no longer usable, and we will want to avoid touching the cluster. Also, adjust it very slightly while at it: always scan the whole region during reclaim, don't skip slots covered by a reclaimed folio. Because the reclaim is lockless, it's possible that new cache lands at any time. And for allocation, we want all caches to be reclaimed to avoid fragmentation. Besides, if the scan offset is not aligned with the size of the reclaimed folio, we might skip some existing cache and fail the reclaim unexpectedly. There should be no observable behavior change. It might slightly improve the fragmentation issue or performance. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-10-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 14:22:55 -08:00
Kairui Song	f7ad377a92	mm, swap: swap entry of a bad slot should not be considered as swapped out When checking if a swap entry is swapped out, we simply check if the bitwise result of the count value is larger than 0. But SWAP_MAP_BAD will also be considered as a swao count value larger than 0. SWAP_MAP_BAD being considered as a count value larger than 0 is useful for the swap allocator: they will be seen as a used slot, so the allocator will skip them. But for the swapped out check, this isn't correct. There is currently no observable issue. The swapped out check is only useful for readahead and folio swapped-out status check. For readahead, the swap cache layer will abort upon checking and updating the swap map. For the folio swapped out status check, the swap allocator will never allocate an entry of bad slots to folio, so that part is fine too. The worst that could happen now is redundant allocation/freeing of folios and waste CPU time. This also makes it easier to get rid of swap map checking and update during folio insertion in the swap cache layer. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-9-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 14:22:55 -08:00
Nhat Pham	bc617c990e	mm/shmem, swap: remove SWAP_MAP_SHMEM The SWAP_MAP_SHMEM state was introduced in the commit `aaa468653b` ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a swap entry belongs to shmem during swapoff. However, swapoff has since been rewritten in the commit `b56a2d8af9` ("mm: rid swapoff of quadratic complexity"). Now having swap count == SWAP_MAP_SHMEM value is basically the same as having swap count == 1, and swap_shmem_alloc() behaves analogously to swap_duplicate(). The only difference of note is that swap_shmem_alloc() does not check for -ENOMEM returned from __swap_duplicate(), but it is OK because shmem never re-duplicates any swap entry it owns. This will stil be safe if we use (batched) swap_duplicate() instead. This commit adds swap_duplicate_nr(), the batched variant of swap_duplicate(), and removes the SWAP_MAP_SHMEM state and the associated swap_shmem_alloc() helper to simplify the state machine (both mentally and in terms of actual code). We will also have an extra state/special value that can be repurposed (for swap entries that never gets re-duplicated). Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-8-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Signed-off-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 14:22:55 -08:00
Kairui Song	c246d236b1	mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO Now the overhead of the swap cache is trivial to none, bypassing the swap cache is no longer a good optimization. We have removed the cache bypass swapin for anon memory, now do the same for shmem. Many helpers and functions can be dropped now. The performance may slightly drop because of the co-existence and double update of swap_map and swap table, and this problem will be improved very soon in later commits by dropping the swap_map update partially: Swapin of 24 GB file with tmpfs with transparent_hugepage_tmpfs=within_size and ZRAM, 3 test runs on my machine: Before: After this commit: After this series: 5.99s 6.29s 6.08s And later swap table phases will drop the swap_map completely to avoid overhead and reduce memory usage. Link: https://lkml.kernel.org/r/20251219195751.61328-1-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 14:22:54 -08:00
Kairui Song	d7cf0d54f2	mm, swap: rename __read_swap_cache_async to swap_cache_alloc_folio Patch series "mm, swap: swap table phase II: unify swapin use", v5. This series removes the SWP_SYNCHRONOUS_IO swap cache bypass swapin code and special swap flag bits including SWAP_HAS_CACHE, along with many historical issues. The performance is about ~20% better for some workloads, like Redis with persistence. This also cleans up the code to prepare for later phases, some patches are from a previously posted series. Swap cache bypassing and swap synchronization in general had many issues. Some are solved as workarounds, and some are still there [1]. To resolve them in a clean way, one good solution is to always use swap cache as the synchronization layer [2]. So we have to remove the swap cache bypass swap-in path first. It wasn't very doable due to performance issues, but now combined with the swap table, removing the swap cache bypass path will instead improve the performance, there is no reason to keep it. Now we can rework the swap entry and cache synchronization following the new design. Swap cache synchronization was heavily relying on SWAP_HAS_CACHE, which is the cause of many issues. By dropping the usage of special swap map bits and related workarounds, we get a cleaner code base and prepare for merging the swap count into the swap table in the next step. And swap_map is now only used for swap count, so in the next phase, swap_map can be merged into the swap table, which will clean up more things and start to reduce the static memory usage. Removal of swap_cgroup_ctrl is also doable, but needs to be done after we also simplify the allocation of swapin folios: always use the new swap_cache_alloc_folio helper so the accounting will also be managed by the swap layer by then. Test results: Redis / Valkey bench: ===================== Testing on a ARM64 VM 1.5G memory: Server: valkey-server --maxmemory 2560M Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get no persistence with BGSAVE Before: 460475.84 RPS 311591.19 RPS After: 451943.34 RPS (-1.9%) 371379.06 RPS (+19.2%) Testing on a x86_64 VM with 4G memory (system components takes about 2G): Server: Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get no persistence with BGSAVE Before: 306044.38 RPS 102745.88 RPS After: 309645.44 RPS (+1.2%) 125313.28 RPS (+22.0%) The performance is a lot better when persistence is applied. This should apply to many other workloads that involve sharing memory and COW. A slight performance drop was observed for the ARM64 Redis test: We are still using swap_map to track the swap count, which is causing redundant cache and CPU overhead and is not very performance-friendly for some arches. This will be improved once we merge the swap map into the swap table (as already demonstrated previously [3]). vm-scabiity =========== usemem --init-time -O -y -x -n 32 1536M (16G memory, global pressure, simulated PMEM as swap), average result of 6 test run: Before: After: System time: 282.22s 283.47s Sum Throughput: 5677.35 MB/s 5688.78 MB/s Single process Throughput: 176.41 MB/s 176.23 MB/s Free latency: 518477.96 us 521488.06 us Which is almost identical. Build kernel test: ================== Test using ZRAM as SWAP, make -j48, defconfig, on a x86_64 VM with 4G RAM, under global pressure, avg of 32 test run: Before After: System time: 1379.91s 1364.22s (-0.11%) Test using ZSWAP with NVME SWAP, make -j48, defconfig, on a x86_64 VM with 4G RAM, under global pressure, avg of 32 test run: Before After: System time: 1822.52s 1803.33s (-0.11%) Which is almost identical. MySQL: ====== sysbench /usr/share/sysbench/oltp_read_only.lua --tables=16 --table-size=1000000 --threads=96 --time=600 (using ZRAM as SWAP, in a 512M memory cgroup, buffer pool set to 3G, 3 test run and 180s warm up). Before: 318162.18 qps After: 318512.01 qps (+0.01%) In conclusion, the result is looking better or identical for most cases, and it's especially better for workloads with swap count > 1 on SYNC_IO devices, about ~20% gain in above test. Next phases will start to merge swap count into swap table and reduce memory usage. One more gain here is that we now have better support for THP swapin. Previously, the THP swapin was bound with swap cache bypassing, which only works for single-mapped folios. Removing the bypassing path also enabled THP swapin for all folios. The THP swapin is still limited to SYNC_IO devices, the limitation can be removed later. This may cause more serious THP thrashing for certain workloads, but that's not an issue caused by this series, it's a common THP issue we should resolve separately. This patch (of 19): __read_swap_cache_async is widely used to allocate and ensure a folio is in swapcache, or get the folio if a folio is already there. It's not async, and it's not doing any read. Rename it to better present its usage, and prepare to be reworked as part of new swap cache APIs. Also, add some comments for the function. Worth noting that the skip_if_exists argument is an long existing workaround that will be dropped soon. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-0-8862a265a033@tencent.com Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-1-8862a265a033@tencent.com Link: https://lore.kernel.org/linux-mm/CAMgjq7D5qoFEK9Omvd5_Zqs6M+TEoG03+2i_mhuP5CQPSOPrmQ@mail.gmail.com/ [1] Link: https://lore.kernel.org/linux-mm/20240326185032.72159-1-ryncsn@gmail.com/ [2] Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [3] Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 14:22:53 -08:00
Kevin Lourenco	62451ae347	mm: fix minor spelling mistakes in comments Correct several typos in comments across files in mm/ [akpm@linux-foundation.org: also fix comment grammar, per SeongJae] Link: https://lkml.kernel.org/r/20251218150906.25042-1-klourencodev@gmail.com Signed-off-by: Kevin Lourenco <klourencodev@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:24:48 -08:00
Al Viro	47b3b9bf93	simplify the callers of file_open_name() It accepts ERR_PTR() for name and does the right thing in that case. That allows to simplify the logics in callers, making them trivial to switch to CLASS(filename). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2026-01-13 15:18:08 -05:00
Youngjun Park	b60a3ef784	mm/swapfile: use plist_for_each_entry in __folio_throttle_swaprate The loop breaks immediately after finding the first swap device and never modifies the list. Replace plist_for_each_entry_safe() with plist_for_each_entry() and remove the unused next variable. Link: https://lkml.kernel.org/r/20251127100303.783198-3-youngjun.park@lge.com Signed-off-by: Youngjun Park <youngjun.park@lge.com> Reviewed-by: Baoquan He <bhe@redhat.com> Acked-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Barry Song <baohua@kernel.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-29 10:41:11 -08:00
Youngjun Park	f9e82f99b3	mm/swapfile: fix list iteration when next node is removed during discard Patch series "mm/swapfile: fix and cleanup swap list iterations", v2. This series fixes a potential list iteration issue in swap_sync_discard() when devices are removed, and includes a cleanup for __folio_throttle_swaprate(). This patch (of 2): When the next node is removed from the plist (e.g. by swapoff), plist_del() makes the node point to itself, causing the iteration to loop on the same entry indefinitely. Add a plist_node_empty() check to detect this case and restart iteration, allowing swap_sync_discard() to continue processing remaining swap devices that still have pending discard entries. Additionally, switch from swap_avail_lock/swap_avail_head to swap_lock/swap_active_head so that iteration is only affected by swapoff operations rather than frequent availability changes, reducing exceptional condition checks and lock contention. Link: https://lkml.kernel.org/r/20251127100303.783198-1-youngjun.park@lge.com Link: https://lkml.kernel.org/r/20251127100303.783198-2-youngjun.park@lge.com Fixes: 686ea517f471 ("mm, swap: do not perform synchronous discard during allocation") Signed-off-by: Youngjun Park <youngjun.park@lge.com> Suggested-by: Kairui Song <kasong@tencent.com> Acked-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-29 10:41:10 -08:00
Youngjun Park	b7dd80f8f9	mm: swap: remove scan_swap_map_slots() references from comments The scan_swap_map_slots() helper has been removed, but several comments still referred to it in swap allocation and reclaim paths. This patch cleans up those outdated references and reflows the affected comment blocks to match kernel coding style. Link: https://lkml.kernel.org/r/20251031065011.40863-6-youngjun.park@lge.com Signed-off-by: Youngjun Park <youngjun.park@lge.com> Reviewed-by: Baoquan He <bhe@redhat.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Barry Song <baohua@kernel.org> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-24 15:08:56 -08:00
Youngjun Park	4c239d5f59	mm: swap: change swap_alloc_slow() to void swap_alloc_slow() does not need to return a bool, as all callers handle allocation results via the entry parameter. Update the function signature and remove return statements accordingly. Link: https://lkml.kernel.org/r/20251031065011.40863-5-youngjun.park@lge.com Signed-off-by: Youngjun Park <youngjun.park@lge.com> Reviewed-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Barry Song <baohua@kernel.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-24 15:08:56 -08:00
Youngjun Park	68f78bf55b	mm, swap: use SWP_SOLIDSTATE to determine if swap is rotational The current non rotational check is unreliable as the device's rotational status can be changed by a user via sysfs. Use the more reliable SWP_SOLIDSTATE flag which is set at swapon time, to ensure the nr_rotate_swap count remains consistent. Plus, it is easy to read and simple. Link: https://lkml.kernel.org/r/20251031065011.40863-3-youngjun.park@lge.com Fixes: `81a0298bdf` ("mm, swap: don't use VMA based swap readahead if HDD is used as swap") Signed-off-by: Youngjun Park <youngjun.park@lge.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-24 15:08:56 -08:00
Youngjun Park	cb65082a0a	mm, swap: fix memory leak in setup_clusters() error path Patch series "mm: swap: small fixes and comment cleanups", v2. This series provides a few small fixes and cleanups for the swap code. The first patch fixes a memory leak in an error path that was recently introduced. The subsequent patches include minor logic adjustments and the removal of redundant comments. This patch (of 5): setup_clusters() could leak 'cluster_info' memory if an error occurred on a path that did not jump to the 'err_free' label. This patch simplifies the error handling by removing the goto label and instead calling free_cluster_info() on all error exit paths. The new logic is safe, as free_cluster_info() already handles NULL pointer inputs. Link: https://lkml.kernel.org/r/20251031065011.40863-1-youngjun.park@lge.com Link: https://lkml.kernel.org/r/20251031065011.40863-2-youngjun.park@lge.com Fixes: `07adc4cf1e` ("mm, swap: implement dynamic allocation of swap table") Signed-off-by: Youngjun Park <youngjun.park@lge.com> Reviewed-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-24 15:08:56 -08:00
Youngjun Park	c230719523	mm/swap: fix wrong plist empty check in swap_alloc_slow() swap_alloc_slow() was checking `si->avail_list` instead of `next->avail_list` when verifying if the next swap device is still in the list, which could cause unnecessary restarts during allocation. Link: https://lkml.kernel.org/r/20251119114136.594108-1-youngjun.park@lge.com Fixes: `8e689f8ea4` ("mm/swap: do not choose swap device according to numa node") Signed-off-by: Youngjun Park <youngjun.park@lge.com> Acked-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-24 15:08:56 -08:00
Lorenzo Stoakes	a3a3e215c9	mm: replace remaining pte_to_swp_entry() with softleaf_from_pte() There are straggler invocations of pte_to_swp_entry() lying around, replace all of these with the software leaf entry equivalent - softleaf_from_pte(). With those removed, eliminate pte_to_swp_entry() altogether. No functional change intended. Link: https://lkml.kernel.org/r/d8ee5ccefe4c42d7c4fe1a2e46f285ac40421cd3.1762812360.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Byungchul Park <byungchul@sk.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Chris Li <chrisl@kernel.org> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Gregory Price <gourry@gourry.net> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Nico Pache <npache@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Xu <weixugc@google.com> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-24 15:08:52 -08:00
Lorenzo Stoakes	06fb61462b	mm: eliminate is_swap_pte() when softleaf_from_pte() suffices In cases where we can simply utilise the fact that softleaf_from_pte() treats present entries as if they were none entries and thus eliminate spurious uses of is_swap_pte(), do so. No functional change intended. Link: https://lkml.kernel.org/r/92ebab9567978155116804c67babc3c64636c403.1762812360.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Byungchul Park <byungchul@sk.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Chris Li <chrisl@kernel.org> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Gregory Price <gourry@gourry.net> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Nico Pache <npache@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Wei Xu <weixugc@google.com> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-24 15:08:50 -08:00
Andrew Morton	87fcafc4e2	Merge branch 'mm-hotfixes-stable' into mm-stable in order to merge "mm/huge_memory: only get folio_order() once during __folio_split()" into mm-stable.	2025-11-24 15:07:34 -08:00
Youngjun Park	f5e31a196e	mm: swap: remove duplicate nr_swap_pages decrement in get_swap_page_of_type() After commit `4f78252da8`, nr_swap_pages is decremented in swap_range_alloc(). Since cluster_alloc_swap_entry() calls swap_range_alloc() internally, the decrement in get_swap_page_of_type() causes double-decrementing. As a representative userspace-visible runtime example of the impact, /proc/meminfo reports increasingly inaccurate SwapFree values. The discrepancy grows with each swap allocation, and during hibernation when large amounts of memory are written to swap, the reported value can deviate significantly from actual available swap space, misleading users and monitoring tools. Remove the duplicate decrement. Link: https://lkml.kernel.org/r/20251102082456.79807-1-youngjun.park@lge.com Fixes: `4f78252da8` ("mm: swap: move nr_swap_pages counter decrement from folio_alloc_swap() to swap_range_alloc()") Signed-off-by: Youngjun Park <youngjun.park@lge.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Kairui Song <kasong@tencent.com> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: <stable@vger.kernel.org> [6.17+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-24 14:25:17 -08:00
Baoquan He	52f37efc59	mm/swap: select swap device with default priority round robin Swap devices are assumed to have similar accessing speed when swapon if no priority is specified. It's unfair and doesn't make sense just because one swap device is swapped on firstly, its priority will be higher than the one swapped on later. Here, set all swap devicess to have priority '-1' by default. With this change, swap device with default priority will be selected round robin when swapping out. This can improve the swapping efficiency a lot among multiple swap devices with default priority. Below are swapon output during the processes when high pressure vm-scability test is being taken: 1) This is pre-commit `a2468cc9bf`, swap device is selectd one by one by priority from high to low when one swap device is exhausted: ------------------------------------ [root@hp-dl385g10-03 ~]# swapon NAME TYPE SIZE USED PRIO /dev/zram0 partition 16G 16G -1 /dev/zram1 partition 16G 966.2M -2 /dev/zram2 partition 16G 0B -3 /dev/zram3 partition 16G 0B -4 2) This is behaviour with commit `a2468cc9bf`, on node, swap device sharing the same node id is selected firstly until exhausted; while on node no swap device sharing the node id it selects the one with highest priority until exhaustd: ------------------------------------ [root@hp-dl385g10-03 ~]# swapon NAME TYPE SIZE USED PRIO /dev/zram0 partition 16G 15.7G -2 /dev/zram1 partition 16G 3.4G -3 /dev/zram2 partition 16G 3.4G -4 /dev/zram3 partition 16G 2.6G -5 3) After this patch applied, swap devices with default priority are selectd round robin: ------------------------------------ [root@hp-dl385g10-03 block]# swapon NAME TYPE SIZE USED PRIO /dev/zram0 partition 16G 6.6G -1 /dev/zram1 partition 16G 6.6G -1 /dev/zram2 partition 16G 6.6G -1 /dev/zram3 partition 16G 6.6G -1 With the change, about 18% efficiency promotion relative to node based way as below. (Surely, the pre-commit `a2468cc9bf` way is the worst.) vm-scability test: ================== Test with: usemem --init-time -O -y -x -n 31 2G (4G memcg, zram as swap) one by one: node based: round robin: System time: 1087.38 s 637.92 s 526.74 s (lower is better) Sum Throughput: 2036.55 MB/s 3546.56 MB/s 4207.56 MB/s (higher is better) Single process Throughput: 65.69 MB/s 114.40 MB/s 135.72 MB/s (high is better) free latency: 15769409.48 us 10138455.99 us 6810119.01 us(lower is better) Link: https://lkml.kernel.org/r/20251028034308.929550-3-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Suggested-by: Chris Li <chrisl@kernel.org> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Barry Song <baohua@kernel.org> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-16 17:28:27 -08:00
Baoquan He	8e689f8ea4	mm/swap: do not choose swap device according to numa node Patch series "mm/swapfile.c: select swap devices of default priority round robin", v5. Currently, on system with multiple swap devices, swap allocation will select one swap device according to priority. The swap device with the highest priority will be chosen to allocate firstly. People can specify a priority from 0 to 32767 when swapon a swap device, or the system will set it from -2 then downwards by default. Meanwhile, on NUMA system, the swap device with node_id will be considered first on that NUMA node of the node_id. In the current code, an array of plist, swap_avail_heads[nid], is used to organize swap devices on each NUMA node. For each NUMA node, there is a plist organizing all swap devices. The 'prio' value in the plist is the negated value of the device's priority due to plist being sorted from low to high. The swap device owning one node_id will be promoted to the front position on that NUMA node, then other swap devices are put in order of their default priority. E.g I got a system with 8 NUMA nodes, and I setup 4 zram partition as swap devices. Current behaviour: their priorities will be(note that -1 is skipped): NAME TYPE SIZE USED PRIO /dev/zram0 partition 16G 0B -2 /dev/zram1 partition 16G 0B -3 /dev/zram2 partition 16G 0B -4 /dev/zram3 partition 16G 0B -5 And their positions in the 8 swap_avail_lists[nid] will be: swap_avail_lists[0]: /* node 0's available swap device list / zram0 -> zram1 -> zram2 -> zram3 prio:1 prio:3 prio:4 prio:5 swap_avali_lists[1]: / node 1's available swap device list / zram1 -> zram0 -> zram2 -> zram3 prio:1 prio:2 prio:4 prio:5 swap_avail_lists[2]: / node 2's available swap device list / zram2 -> zram0 -> zram1 -> zram3 prio:1 prio:2 prio:3 prio:5 swap_avail_lists[3]: / node 3's available swap device list / zram3 -> zram0 -> zram1 -> zram2 prio:1 prio:2 prio:3 prio:4 swap_avail_lists[4-7]: / node 4,5,6,7's available swap device list / zram0 -> zram1 -> zram2 -> zram3 prio:2 prio:3 prio:4 prio:5 The adjustment for swap device with node_id intended to decrease the pressure of lock contention for one swap device by taking different swap device on different node. The adjustment was introduced in commit `a2468cc9bf` ("swap: choose swap device according to numa node"). However, the adjustment is a little coarse-grained. On the node, the swap device sharing the node's id will always be selected firstly by node's CPUs until exhausted, then next one. And on other nodes where no swap device shares its node id, swap device with priority '-2' will be selected firstly until exhausted, then next with priority '-3'. This is the swapon output during the process high pressure vm-scability test is being taken. It's clearly showing zram0 is heavily exploited until exhausted. =================================== [root@hp-dl385g10-03 ~]# swapon NAME TYPE SIZE USED PRIO /dev/zram0 partition 16G 15.7G -2 /dev/zram1 partition 16G 3.4G -3 /dev/zram2 partition 16G 3.4G -4 /dev/zram3 partition 16G 2.6G -5 The node based strategy on selecting swap device is much better then the old way one by one selecting swap device. However it is still unreasonable because swap devices are assumed to have similar accessing speed if no priority is specified when swapon. It's unfair and doesn't make sense just because one swap device is swapped on firstly, its priority will be higher than the one swapped on later. So in this patchset, change is made to select the swap device round robin if default priority. In code, the plist array swap_avail_heads[nid] is replaced with a plist swap_avail_head which reverts commit `a2468cc9bf`. Meanwhile, on top of the revert, further change is taken to make any device w/o specified priority get the same default priority '-1'. Surely, swap device with specified priority are always put foremost, this is not impacted. If you care about their different accessing speed, then use 'swapon -p xx' to deploy priority for your swap devices. New behaviour: swap_avail_list: / one global available swap device list */ zram0 -> zram1 -> zram2 -> zram3 prio:1 prio:1 prio:1 prio:1 This is the swapon output during the process high pressure vm-scability being taken, all is selected round robin: ======================================= [root@hp-dl385g10-03 linux]# swapon NAME TYPE SIZE USED PRIO /dev/zram0 partition 16G 12.6G -1 /dev/zram1 partition 16G 12.6G -1 /dev/zram2 partition 16G 12.6G -1 /dev/zram3 partition 16G 12.6G -1 With the change, we can see about 18% efficiency promotion as below: vm-scability test: ================== Test with: usemem --init-time -O -y -x -n 31 2G (4G memcg, zram as swap) Before: After: System time: 637.92 s 526.74 s (lower is better) Sum Throughput: 3546.56 MB/s 4207.56 MB/s (higher is better) Single process Throughput: 114.40 MB/s 135.72 MB/s (higher is better) free latency: 10138455.99 us 6810119.01 us (low is better) This patch (of 2): This reverts commit `a2468cc9bf` ("swap: choose swap device according to numa node"). After this patch, the behaviour will change back to pre-commit `a2468cc9bf`. Means the priority will be set from -1 then downwards by default, and when swapping, it will exhault swap device one by one according to priority from high to low. This is preparation work for later change. [root@hp-dl385g10-03 ~]# swapon NAME TYPE SIZE USED PRIO /dev/zram0 partition 16G 16G -1 /dev/zram1 partition 16G 966.2M -2 /dev/zram2 partition 16G 0B -3 /dev/zram3 partition 16G 0B -4 Link: https://lkml.kernel.org/r/20251028034308.929550-1-bhe@redhat.com Link: https://lkml.kernel.org/r/20251028034308.929550-2-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Suggested-by: Chris Li <chrisl@kernel.org> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: Kairui Song <kasong@tencent.com> Cc: Barry Song <baohua@kernel.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-16 17:28:27 -08:00
Kairui Song	4fd58b51ef	mm, swap: remove redundant argument for isolating a cluster The order argument was introduced by an intermediate commit and was then never used, just remove it. Link: https://lkml.kernel.org/r/20251024-swap-clean-after-swap-table-p1-v2-5-a709469052e7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-16 17:28:20 -08:00
Kairui Song	a983471cfc	mm, swap: cleanup swap entry allocation parameter We no longer need this GFP parameter after commit `8578e0c00d` ("mm, swap: use the swap table for the swap cache and switch API"). Before that commit the GFP parameter is already almost identical for all callers, so nothing changed by that commit. Swap table just moved the GFP to lower layer and make it more defined and changes depend on atomic or sleep allocation. Now this parameter is no longer used, just remove it. No behavior change. Link: https://lkml.kernel.org/r/20251024-swap-clean-after-swap-table-p1-v2-3-a709469052e7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-16 17:28:20 -08:00
Kairui Song	e4adea27b9	mm, swap: rename helper for setup bad slots The name inc_cluster_info_page is very confusing, as this helper is only used during swapon to mark bad slots. Rename it properly and turn the VM_BUG_ON in it into WARN_ON to expose more potential issues. Swapon is a cold path, so adding more checks should be a good idea. No feature change except new WARN_ON. Link: https://lkml.kernel.org/r/20251024-swap-clean-after-swap-table-p1-v2-2-a709469052e7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-16 17:28:20 -08:00

1 2 3 4 5 ...

686 Commits