mirror of
https://github.com/torvalds/linux.git
synced 2026-05-12 16:18:45 +02:00
master
309 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
334fbe734e |
mm.git review status for linus..mm-stable
Everything: Total patches: 368 Reviews/patch: 1.56 Reviewed rate: 74% Excluding DAMON: Total patches: 316 Reviews/patch: 1.77 Reviewed rate: 81% Excluding DAMON and zram: Total patches: 306 Reviews/patch: 1.81 Reviewed rate: 82% Excluding DAMON, zram and maple_tree: Total patches: 276 Reviews/patch: 2.01 Reviewed rate: 91% Significant patch series in this merge: - The 30 patch series "maple_tree: Replace big node with maple copy" from Liam Howlett is mainly prepararatory work for ongoing development but it does reduce stack usage and is an improvement. - The 12 patch series "mm, swap: swap table phase III: remove swap_map" from Kairui Song offers memory savings by removing the static swap_map. It also yields some CPU savings and implements several cleanups. - The 2 patch series "mm: memfd_luo: preserve file seals" from Pratyush Yadav adds file seal preservation to LUO's memfd code. - The 2 patch series "mm: zswap: add per-memcg stat for incompressible pages" from Jiayuan Chen adds additional userspace stats reportng to zswap. - The 4 patch series "arch, mm: consolidate empty_zero_page" from Mike Rapoport implements some cleanups for our handling of ZERO_PAGE() and zero_pfn. - The 2 patch series "mm/kmemleak: Improve scan_should_stop() implementation" from Zhongqiu Han provides an robustness improvement and some cleanups in the kmemleak code. - The 4 patch series "Improve khugepaged scan logic" from Vernon Yang "improves the khugepaged scan logic and reduces CPU consumption by prioritizing scanning tasks that access memory frequently". - The 2 patch series "Make KHO Stateless" from Jason Miu simplifies Kexec Handover by "transitioning KHO from an xarray-based metadata tracking system with serialization to a radix tree data structure that can be passed directly to the next kernel" - The 3 patch series "mm: vmscan: add PID and cgroup ID to vmscan tracepoints" from Thomas Ballasi and Steven Rostedt enhances vmscan's tracepointing. - The 5 patch series "mm: arch/shstk: Common shadow stack mapping helper and VM_NOHUGEPAGE" from Catalin Marinas is a cleanup for the shadow stack code: remove per-arch code in favour of a generic implementation. - The 2 patch series "Fix KASAN support for KHO restored vmalloc regions" from Pasha Tatashin fixes a WARN() which can be emitted the KHO restores a vmalloc area. - The 4 patch series "mm: Remove stray references to pagevec" from Tal Zussman provides several cleanups, mainly udpating references to "struct pagevec", which became folio_batch three years ago. - The 17 patch series "mm: Eliminate fake head pages from vmemmap optimization" from Kiryl Shutsemau simplifies the HugeTLB vmemmap optimization (HVO) by changing how tail pages encode their relationship to the head page. - The 2 patch series "mm/damon/core: improve DAMOS quota efficiency for core layer filters" from SeongJae Park improves two problematic behaviors of DAMOS that makes it less efficient when core layer filters are used. - The 3 patch series "mm/damon: strictly respect min_nr_regions" from SeongJae Park improves DAMON usability by extending the treatment of the min_nr_regions user-settable parameter. - The 3 patch series "mm/page_alloc: pcp locking cleanup" from Vlastimil Babka is a proper fix for a previously hotfixed SMP=n issue. Code simplifications and cleanups ennsed. - The 16 patch series "mm: cleanups around unmapping / zapping" from David Hildenbrand implements "a bunch of cleanups around unmapping and zapping. Mostly simplifications, code movements, documentation and renaming of zapping functions". - The 6 patch series "support batched checking of the young flag for MGLRU" from Baolin Wang supports batched checking of the young flag for MGLRU. It's part cleanups; one benchmark shows large performance benefits for arm64. - The 5 patch series "memcg: obj stock and slab stat caching cleanups" from Johannes Weiner provides memcg cleanup and robustness improvements. - The 5 patch series "Allow order zero pages in page reporting" from Yuvraj Sakshith enhances page_reporting's free page reporting - it is presently and undesirably order-0 pages when reporting free memory. - The 6 patch series "mm: vma flag tweaks" from Lorenzo Stoakes is cleanup work following from the recent conversion of the VMA flags to a bitmap. - The 10 patch series "mm/damon: add optional debugging-purpose sanity checks" from SeongJae Park adds some more developer-facing debug checks into DAMON core. - The 2 patch series "mm/damon: test and document power-of-2 min_region_sz requirement" from SeongJae Park adds an additional DAMON kunit test and makes some adjustments to the addr_unit parameter handling. - The 3 patch series "mm/damon/core: make passed_sample_intervals comparisons overflow-safe" from SeongJae Park fixes a hard-to-hit time overflow issue in DAMON core. - The 7 patch series "mm/damon: improve/fixup/update ratio calculation, test and documentation" from SeongJae Park is a "batch of misc/minor improvements and fixups" for DAMON. - The 4 patch series "mm: move vma_(kernel|mmu)_pagesize() out of hugetlb.c" from David Hildenbrand fixes a possible issue with dax-device when CONFIG_HUGETLB=n. Some code movement was required. - The 6 patch series "zram: recompression cleanups and tweaks" from Sergey Senozhatsky provides "a somewhat random mix of fixups, recompression cleanups and improvements" in the zram code. - The 11 patch series "mm/damon: support multiple goal-based quota tuning algorithms" from SeongJae Park extend DAMOS quotas goal auto-tuning to support multiple tuning algorithms that users can select. - The 4 patch series "mm: thp: reduce unnecessary start_stop_khugepaged()" from Breno Leitao fixes the khugpaged sysfs handling so we no longer spam the logs with reams of junk when starting/stopping khugepaged. - The 3 patch series "mm: improve map count checks" from Lorenzo Stoakes provides some cleanups and slight fixes in the mremap, mmap and vma code. - The 5 patch series "mm/damon: support addr_unit on default monitoring targets for modules" from SeongJae Park extends the use of DAMON core's addr_unit tunable. - The 5 patch series "mm: khugepaged cleanups and mTHP prerequisites" from Nico Pache provides cleanups in the khugepaged and is a base for Nico's planned khugepaged mTHP support. - The 15 patch series "mm: memory hot(un)plug and SPARSEMEM cleanups" from David Hildenbrand implements code movement and cleanups in the memhotplug and sparsemem code. - The 2 patch series "mm: remove CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE and cleanup CONFIG_MIGRATION" from David Hildenbrand rationalizes some memhotplug Kconfig support. - The 6 patch series "change young flag check functions to return bool" from Baolin Wang is "a cleanup patchset to change all young flag check functions to return bool". - The 3 patch series "mm/damon/sysfs: fix memory leak and NULL dereference issues" from Josh Law and SeongJae Park fixes a few potential DAMON bugs. - The 25 patch series "mm/vma: convert vm_flags_t to vma_flags_t in vma code" from "converts a lot of the existing use of the legacy vm_flags_t data type to the new vma_flags_t type which replaces it". Mainly in the vma code. - The 21 patch series "mm: expand mmap_prepare functionality and usage" from Lorenzo Stoakes "expands the mmap_prepare functionality, which is intended to replace the deprecated f_op->mmap hook which has been the source of bugs and security issues for some time". Cleanups, documentation, extension of mmap_prepare into filesystem drivers. - The 13 patch series "mm/huge_memory: refactor zap_huge_pmd()" from Lorenzo Stoakes simplifies and cleans up zap_huge_pmd(). Additional cleanups around vm_normal_folio_pmd() and the softleaf functionality are performed. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCad3HDQAKCRDdBJ7gKXxA jrUQAPwNhPk5nPSxnyxjAeQtOBHqgCdnICeEismLajPKd9aYRgEA0s2XAu3tSUYi GrBnWImHG3s4ePQxVcPCegWTsOUrXgQ= =1Q7o -----END PGP SIGNATURE----- Merge tag 'mm-stable-2026-04-13-21-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "maple_tree: Replace big node with maple copy" (Liam Howlett) Mainly prepararatory work for ongoing development but it does reduce stack usage and is an improvement. - "mm, swap: swap table phase III: remove swap_map" (Kairui Song) Offers memory savings by removing the static swap_map. It also yields some CPU savings and implements several cleanups. - "mm: memfd_luo: preserve file seals" (Pratyush Yadav) File seal preservation to LUO's memfd code - "mm: zswap: add per-memcg stat for incompressible pages" (Jiayuan Chen) Additional userspace stats reportng to zswap - "arch, mm: consolidate empty_zero_page" (Mike Rapoport) Some cleanups for our handling of ZERO_PAGE() and zero_pfn - "mm/kmemleak: Improve scan_should_stop() implementation" (Zhongqiu Han) A robustness improvement and some cleanups in the kmemleak code - "Improve khugepaged scan logic" (Vernon Yang) Improve khugepaged scan logic and reduce CPU consumption by prioritizing scanning tasks that access memory frequently - "Make KHO Stateless" (Jason Miu) Simplify Kexec Handover by transitioning KHO from an xarray-based metadata tracking system with serialization to a radix tree data structure that can be passed directly to the next kernel - "mm: vmscan: add PID and cgroup ID to vmscan tracepoints" (Thomas Ballasi and Steven Rostedt) Enhance vmscan's tracepointing - "mm: arch/shstk: Common shadow stack mapping helper and VM_NOHUGEPAGE" (Catalin Marinas) Cleanup for the shadow stack code: remove per-arch code in favour of a generic implementation - "Fix KASAN support for KHO restored vmalloc regions" (Pasha Tatashin) Fix a WARN() which can be emitted the KHO restores a vmalloc area - "mm: Remove stray references to pagevec" (Tal Zussman) Several cleanups, mainly udpating references to "struct pagevec", which became folio_batch three years ago - "mm: Eliminate fake head pages from vmemmap optimization" (Kiryl Shutsemau) Simplify the HugeTLB vmemmap optimization (HVO) by changing how tail pages encode their relationship to the head page - "mm/damon/core: improve DAMOS quota efficiency for core layer filters" (SeongJae Park) Improve two problematic behaviors of DAMOS that makes it less efficient when core layer filters are used - "mm/damon: strictly respect min_nr_regions" (SeongJae Park) Improve DAMON usability by extending the treatment of the min_nr_regions user-settable parameter - "mm/page_alloc: pcp locking cleanup" (Vlastimil Babka) The proper fix for a previously hotfixed SMP=n issue. Code simplifications and cleanups ensued - "mm: cleanups around unmapping / zapping" (David Hildenbrand) A bunch of cleanups around unmapping and zapping. Mostly simplifications, code movements, documentation and renaming of zapping functions - "support batched checking of the young flag for MGLRU" (Baolin Wang) Batched checking of the young flag for MGLRU. It's part cleanups; one benchmark shows large performance benefits for arm64 - "memcg: obj stock and slab stat caching cleanups" (Johannes Weiner) memcg cleanup and robustness improvements - "Allow order zero pages in page reporting" (Yuvraj Sakshith) Enhance free page reporting - it is presently and undesirably order-0 pages when reporting free memory. - "mm: vma flag tweaks" (Lorenzo Stoakes) Cleanup work following from the recent conversion of the VMA flags to a bitmap - "mm/damon: add optional debugging-purpose sanity checks" (SeongJae Park) Add some more developer-facing debug checks into DAMON core - "mm/damon: test and document power-of-2 min_region_sz requirement" (SeongJae Park) An additional DAMON kunit test and makes some adjustments to the addr_unit parameter handling - "mm/damon/core: make passed_sample_intervals comparisons overflow-safe" (SeongJae Park) Fix a hard-to-hit time overflow issue in DAMON core - "mm/damon: improve/fixup/update ratio calculation, test and documentation" (SeongJae Park) A batch of misc/minor improvements and fixups for DAMON - "mm: move vma_(kernel|mmu)_pagesize() out of hugetlb.c" (David Hildenbrand) Fix a possible issue with dax-device when CONFIG_HUGETLB=n. Some code movement was required. - "zram: recompression cleanups and tweaks" (Sergey Senozhatsky) A somewhat random mix of fixups, recompression cleanups and improvements in the zram code - "mm/damon: support multiple goal-based quota tuning algorithms" (SeongJae Park) Extend DAMOS quotas goal auto-tuning to support multiple tuning algorithms that users can select - "mm: thp: reduce unnecessary start_stop_khugepaged()" (Breno Leitao) Fix the khugpaged sysfs handling so we no longer spam the logs with reams of junk when starting/stopping khugepaged - "mm: improve map count checks" (Lorenzo Stoakes) Provide some cleanups and slight fixes in the mremap, mmap and vma code - "mm/damon: support addr_unit on default monitoring targets for modules" (SeongJae Park) Extend the use of DAMON core's addr_unit tunable - "mm: khugepaged cleanups and mTHP prerequisites" (Nico Pache) Cleanups to khugepaged and is a base for Nico's planned khugepaged mTHP support - "mm: memory hot(un)plug and SPARSEMEM cleanups" (David Hildenbrand) Code movement and cleanups in the memhotplug and sparsemem code - "mm: remove CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE and cleanup CONFIG_MIGRATION" (David Hildenbrand) Rationalize some memhotplug Kconfig support - "change young flag check functions to return bool" (Baolin Wang) Cleanups to change all young flag check functions to return bool - "mm/damon/sysfs: fix memory leak and NULL dereference issues" (Josh Law and SeongJae Park) Fix a few potential DAMON bugs - "mm/vma: convert vm_flags_t to vma_flags_t in vma code" (Lorenzo Stoakes) Convert a lot of the existing use of the legacy vm_flags_t data type to the new vma_flags_t type which replaces it. Mainly in the vma code. - "mm: expand mmap_prepare functionality and usage" (Lorenzo Stoakes) Expand the mmap_prepare functionality, which is intended to replace the deprecated f_op->mmap hook which has been the source of bugs and security issues for some time. Cleanups, documentation, extension of mmap_prepare into filesystem drivers - "mm/huge_memory: refactor zap_huge_pmd()" (Lorenzo Stoakes) Simplify and clean up zap_huge_pmd(). Additional cleanups around vm_normal_folio_pmd() and the softleaf functionality are performed. * tag 'mm-stable-2026-04-13-21-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits) mm: fix deferred split queue races during migration mm/khugepaged: fix issue with tracking lock mm/huge_memory: add and use has_deposited_pgtable() mm/huge_memory: add and use normal_or_softleaf_folio_pmd() mm: add softleaf_is_valid_pmd_entry(), pmd_to_softleaf_folio() mm/huge_memory: separate out the folio part of zap_huge_pmd() mm/huge_memory: use mm instead of tlb->mm mm/huge_memory: remove unnecessary sanity checks mm/huge_memory: deduplicate zap deposited table call mm/huge_memory: remove unnecessary VM_BUG_ON_PAGE() mm/huge_memory: add a common exit path to zap_huge_pmd() mm/huge_memory: handle buggy PMD entry in zap_huge_pmd() mm/huge_memory: have zap_huge_pmd return a boolean, add kdoc mm/huge: avoid big else branch in zap_huge_pmd() mm/huge_memory: simplify vma_is_specal_huge() mm: on remap assert that input range within the proposed VMA mm: add mmap_action_map_kernel_pages[_full]() uio: replace deprecated mmap hook with mmap_prepare in uio_info drivers: hv: vmbus: replace deprecated mmap hook with mmap_prepare mm: allow handling of stacked mmap_prepare hooks in more drivers ... |
||
|
|
8231e4c040 |
mm/slab: use compound_head() in page_slab()
page_slab() contained an open-coded implementation of compound_head(). Replace the duplicated code with a direct call to compound_head(). Link: https://lkml.kernel.org/r/20260227194302.274384-19-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
8c846c879e |
mm: rework compound_head() for power-of-2 sizeof(struct page)
For tail pages, the kernel uses the 'compound_info' field to get to the head page. The bit 0 of the field indicates whether the page is a tail page, and if set, the remaining bits represent a pointer to the head page. For cases when size of struct page is power-of-2, change the encoding of compound_info to store a mask that can be applied to the virtual address of the tail page in order to access the head page. It is possible because struct page of the head page is naturally aligned with regards to order of the page. The significant impact of this modification is that all tail pages of the same order will now have identical 'compound_info', regardless of the compound page they are associated with. This paves the way for eliminating fake heads. The HugeTLB Vmemmap Optimization (HVO) creates fake heads and it is only applied when the sizeof(struct page) is power-of-2. Having identical tail pages allows the same page to be mapped into the vmemmap of all pages, maintaining memory savings without fake heads. If sizeof(struct page) is not power-of-2, there is no functional changes. Limit mask usage to HugeTLB vmemmap optimization (HVO) where it makes a difference. The approach with mask would work in the wider set of conditions, but it requires validating that struct pages are naturally aligned for all orders up to the MAX_FOLIO_ORDER, which can be tricky. Link: https://lkml.kernel.org/r/20260227194302.274384-8-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Usama Arif <usamaarif642@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
d50569612c |
mm: rename the 'compound_head' field in the 'struct page' to 'compound_info'
The 'compound_head' field in the 'struct page' encodes whether the page is a tail and where to locate the head page. Bit 0 is set if the page is a tail, and the remaining bits in the field point to the head page. As preparation for changing how the field encodes information about the head page, rename the field to 'compound_info'. Link: https://lkml.kernel.org/r/20260227194302.274384-4-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand (arm) <david@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
5ba6bc27b1 |
slab: decouple pointer to barn from kmem_cache_node
The pointer to barn currently exists in struct kmem_cache_node. That struct is instantiated for every NUMA node with memory, but we want to have a barn for every online node (including memoryless). Thus decouple the two structures. In struct kmem_cache we have an array for kmem_cache_node pointers that appears to be sized MAX_NUMNODES but the actual size calculation in kmem_cache_init() uses nr_node_ids. Therefore we can't just add another array of barn pointers. Instead change the array to newly introduced struct kmem_cache_per_node_ptrs holding both kmem_cache_node and barn pointer. Adjust barn accessor and allocation/initialization code accordingly. For now no functional change intended, barns are created 1:1 together with kmem_cache_nodes. Link: https://patch.msgid.link/20260311-b4-slab-memoryless-barns-v1-1-70ab850be4ce@kernel.org Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Hao Li <hao.li@linux.dev> |
||
|
|
6432f15c81 |
mm/slab: change stride type from unsigned short to unsigned int
Commit |
||
|
|
f3ec502b67 |
mm/slab: mark alloc tags empty for sheaves allocated with __GFP_NO_OBJ_EXT
alloc_empty_sheaf() allocates sheaves from SLAB_KMALLOC caches using __GFP_NO_OBJ_EXT to avoid recursion, however it does not mark their allocation tags empty before freeing, which results in a warning when CONFIG_MEM_ALLOC_PROFILING_DEBUG is set. Fix this by marking allocation tags for such sheaves as empty. The problem was technically introduced in commit |
||
|
|
815c8e3551 |
Merge branch 'slab/for-7.0/sheaves' into slab/for-next
Merge series "slab: replace cpu (partial) slabs with sheaves". The percpu sheaves caching layer was introduced as opt-in but the goal was to eventually move all caches to them. This is the next step, enabling sheaves for all caches (except the two bootstrap ones) and then removing the per cpu (partial) slabs and lots of associated code. Besides the lower locking overhead and much more likely fastpath when freeing, this removes the rather complicated code related to the cpu slab lockless fastpaths (using this_cpu_try_cmpxchg128/64) and all its complications for PREEMPT_RT or kmalloc_nolock(). The lockless slab freelist+counters update operation using try_cmpxchg128/64 remains and is crucial for freeing remote NUMA objects and to allow flushing objects from sheaves to slabs mostly without the node list_lock. Link: https://lore.kernel.org/all/20260123-sheaves-for-all-v4-0-041323d506f7@suse.cz/ |
||
|
|
2f35fee943 |
mm/slab: only allow SLAB_OBJ_EXT_IN_OBJ for unmergeable caches
While SLAB_OBJ_EXT_IN_OBJ allows to reduce memory overhead to account slab objects, it prevents slab merging because merging can change the metadata layout. As pointed out Vlastimil Babka, disabling merging solely for this memory optimization may not be a net win, because disabling slab merging tends to increase overall memory usage. Restrict SLAB_OBJ_EXT_IN_OBJ to caches that are already unmergeable for other reasons (e.g., those with constructors or SLAB_TYPESAFE_BY_RCU). Suggested-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260127103151.21883-3-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
|
|
fab0694646 |
mm/slab: move [__]ksize and slab_ksize() to mm/slub.c
To access SLUB's internal implementation details beyond cache flags in ksize(), move __ksize(), ksize(), and slab_ksize() to mm/slub.c. [vbabka@suse.cz: also make __ksize() static and move its kerneldoc to ksize() ] Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260113061845.159790-9-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
|
|
4b1530f89c |
mm/memcontrol,alloc_tag: handle slabobj_ext access under KASAN poison
In the near future, slabobj_ext may reside outside the allocated slab
object range within a slab, which could be reported as an out-of-bounds
access by KASAN.
As suggested by Andrey Konovalov [1], explicitly disable KASAN and KMSAN
checks when accessing slabobj_ext within slab allocator, memory profiling,
and memory cgroup code. While an alternative approach could be to unpoison
slabobj_ext, out-of-bounds accesses outside the slab allocator are
generally more common.
Move metadata_access_enable()/disable() helpers to mm/slab.h so that
it can be used outside mm/slub.c. However, as suggested by Suren
Baghdasaryan [2], instead of calling them directly from mm code (which is
more prone to errors), change users to access slabobj_ext via get/put
APIs:
- Users should call get_slab_obj_exts() to access slabobj_metadata
and call put_slab_obj_exts() when it's done.
- From now on, accessing it outside the section covered by
get_slab_obj_exts() ~ put_slab_obj_exts() is illegal.
This ensures that accesses to slabobj_ext metadata won't be reported
as access violations.
Call kasan_reset_tag() in slab_obj_ext() before returning the address to
prevent SW or HW tag-based KASAN from reporting false positives.
Suggested-by: Andrey Konovalov <andreyknvl@gmail.com>
Suggested-by: Suren Baghdasaryan <surenb@google.com>
Link: https://lore.kernel.org/linux-mm/CA+fCnZezoWn40BaS3cgmCeLwjT+5AndzcQLc=wH3BjMCu6_YCw@mail.gmail.com [1]
Link: https://lore.kernel.org/linux-mm/CAJuCfpG=Lb4WhYuPkSpdNO4Ehtjm1YcEEK0OM=3g9i=LxmpHSQ@mail.gmail.com [2]
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20260113061845.159790-7-harry.yoo@oracle.com
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
||
|
|
7a8e71bc61 |
mm/slab: use stride to access slabobj_ext
Use a configurable stride value when accessing slab object extension metadata instead of assuming a fixed sizeof(struct slabobj_ext). Store stride value in free bits of slab->counters field. This allows for flexibility in cases where the extension is embedded within slab objects. Since these free bits exist only on 64-bit, any future optimizations that need to change stride value cannot be enabled on 32-bit architectures. Suggested-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260113061845.159790-6-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
|
|
52f1ca8a45 |
mm/slab: abstract slabobj_ext access via new slab_obj_ext() helper
Currently, the slab allocator assumes that slab->obj_exts is a pointer to an array of struct slabobj_ext objects. However, to support storage methods where struct slabobj_ext is embedded within objects, the slab allocator should not make this assumption. Instead of directly dereferencing the slabobj_exts array, abstract access to struct slabobj_ext via helper functions. Introduce a new API slabobj_ext metadata access: slab_obj_ext(slab, obj_exts, index) - returns the pointer to struct slabobj_ext element at the given index. Directly dereferencing the return value of slab_obj_exts() is no longer allowed. Instead, slab_obj_ext() must always be used to access individual struct slabobj_ext objects. Convert all users to use these APIs. No functional changes intended. Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260113061845.159790-5-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
|
|
32c894c727 |
slab: remove struct kmem_cache_cpu
The cpu slab is not used anymore for allocation or freeing, the remaining code is for flushing, but it's effectively dead. Remove the whole struct kmem_cache_cpu, the flushing code and other orphaned functions. The remaining used field of kmem_cache_cpu is the stat array with CONFIG_SLUB_STATS. Put it instead in a new struct kmem_cache_stats. In struct kmem_cache, the field is cpu_stats and placed near the end of the struct. Reviewed-by: Hao Li <hao.li@linux.dev> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
|
|
073d5f1562 |
slab: simplify kmalloc_nolock()
The kmalloc_nolock() implementation has several complications and
restrictions due to SLUB's cpu slab locking, lockless fastpath and
PREEMPT_RT differences. With cpu slab usage removed, we can simplify
things:
- relax the PREEMPT_RT context checks as they were before commit
|
||
|
|
ab2f752ac3 |
slab: remove defer_deactivate_slab()
There are no more cpu slabs so we don't need their deferred deactivation. The function is now only used from places where we allocate a new slab but then can't spin on node list_lock to put it on the partial list. Instead of the deferred action we can free it directly via __free_slab(), we just need to tell it to use _nolock() freeing of the underlying pages and take care of the accounting. Since free_frozen_pages_nolock() variant does not yet exist for code outside of the page allocator, create it as a trivial wrapper for __free_frozen_pages(..., FPI_TRYLOCK). Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Hao Li <hao.li@linux.dev> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
|
|
e323b52cf0 |
slab: remove SLUB_CPU_PARTIAL
We have removed the partial slab usage from allocation paths. Now remove the whole config option and associated code. Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Hao Li <hao.li@linux.dev> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
|
|
f3421f8d15 |
slab: introduce percpu sheaves bootstrap
Until now, kmem_cache->cpu_sheaves was !NULL only for caches with sheaves enabled. Since we want to enable them for almost all caches, it's suboptimal to test the pointer in the fast paths, so instead allocate it for all caches in do_kmem_cache_create(). Instead of testing the cpu_sheaves pointer to recognize caches (yet) without sheaves, test kmem_cache->sheaf_capacity for being 0, where needed, using a new cache_has_sheaves() helper. However, for the fast paths sake we also assume that the main sheaf always exists (pcs->main is !NULL), and during bootstrap we cannot allocate sheaves yet. Solve this by introducing a single static bootstrap_sheaf that's assigned as pcs->main during bootstrap. It has a size of 0, so during allocations, the fast path will find it's empty. Since the size of 0 matches sheaf_capacity of 0, the freeing fast paths will find it's "full". In the slow path handlers, we use cache_has_sheaves() to recognize that the cache doesn't (yet) have real sheaves, and fall back. Thus sharing the single bootstrap sheaf like this for multiple caches and cpus is safe. Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Hao Li <hao.li@linux.dev> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
|
|
8598351edc |
mm/slab: move and refactor __kmem_cache_alias()
Move __kmem_cache_alias() to slab_common.c since it's called by __kmem_cache_create_args() and calls find_mergeable() that both are in this file. We can remove two slab.h declarations and make them static. Instead declare sysfs_slab_alias() from slub.c so that __kmem_cache_alias() can keep calling it. Add args parameter to __kmem_cache_alias() and find_mergeable() instead of align and ctor. With that we can also move the checks for usersize and sheaf_capacity there from __kmem_cache_create_args() and make the result more symmetric with slab_unmergeable(). No functional changes intended. Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
|
|
0f35040de5 |
mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction
Currently, kvfree_rcu_barrier() flushes RCU sheaves across all slab
caches when a cache is destroyed. This is unnecessary; only the RCU
sheaves belonging to the cache being destroyed need to be flushed.
As suggested by Vlastimil Babka, introduce a weaker form of
kvfree_rcu_barrier() that operates on a specific slab cache.
Factor out flush_rcu_sheaves_on_cache() from flush_all_rcu_sheaves() and
call it from flush_all_rcu_sheaves() and kvfree_rcu_barrier_on_cache().
Call kvfree_rcu_barrier_on_cache() instead of kvfree_rcu_barrier() on
cache destruction.
The performance benefit is evaluated on a 12 core 24 threads AMD Ryzen
5900X machine (1 socket), by loading slub_kunit module.
Before:
Total calls: 19
Average latency (us): 18127
Total time (us): 344414
After:
Total calls: 19
Average latency (us): 10066
Total time (us): 191264
Two performance regression have been reported:
- stress module loader test's runtime increases by 50-60% (Daniel)
- internal graphics test's runtime on Tegra234 increases by 35% (Jon)
They are fixed by this change.
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Fixes:
|
||
|
|
ed80cc758b |
Merge branch 'slab/for-6.19/freelist_aba_t_cleanups' into slab/for-next
Merge series "slab: cmpxchg cleanups enabled by -fms-extensions" From the cover letter [1]: After learning about -fms-extensions being enabled for 6.19, I realized there is some cleanup potential in slub code by extending the definition and usage of freelist_aba_t, as it can now become an unnamed member of struct slab. This series performs the cleanup, with no functional changes intended. Additionally we turn freelist_aba_t to struct freelist_counters as it doesn't meet any criteria for being a typedef, per Documentation/process/coding-style.rst Based on the tag kbuild-ms-extensions-6.19 from git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linuxV Link: https://lore.kernel.org/all/20251107-slab-fms-cleanup-v1-0-650b1491ac9e@suse.cz/#t [1] |
||
|
|
e5d7764e13 |
Merge branch 'slab/for-6.19/memdesc_prep' into slab/for-next
Merge series "Prepare slab for memdescs" by Matthew Wilcox. From the cover letter [1]: When we separate struct folio, struct page and struct slab from each other, converting to folios then to slabs will be nonsense. It made sense under the 'folio is just a head page' interpretation, but with full separation, page_folio() will return NULL for a page which belongs to a slab. This patch series removes almost all mentions of folio from slab. There are a few folio_test_slab() invocations left around the tree that I haven't decided how to handle yet. We're not yet quite at the point of separately allocating struct slab, but that's what I'll be working on next. Link: https://lore.kernel.org/all/20251113000932.1589073-1-willy@infradead.org/ [1] |
||
|
|
76ade24433 |
slab: Remove references to folios from virt_to_slab()
Use page_slab() instead of virt_to_folio() which will work perfectly when struct slab is separated from struct folio. This was the last user of folio_slab(), so delete it. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://patch.msgid.link/20251113000932.1589073-17-willy@infradead.org Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
|
|
ee1ee8abc4 |
slab: Remove folio references from __ksize()
In the future, we will separate slab, folio and page from each other and calling virt_to_folio() on an address allocated from slab will return NULL. Delay the conversion from struct page to struct slab until we know we're not dealing with a large kmalloc allocation. There's a minor win for large kmalloc allocations as we avoid the compound_head() hidden in virt_to_folio(). This deprecates calling ksize() on memory allocated by alloc_pages(). Today it becomes a warning and support will be removed entirely in the future. Introduce large_kmalloc_size() to abstract how we represent the size of a large kmalloc allocation. For now, this is the same as page_size(), but it will change with separately allocated memdescs. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://patch.msgid.link/20251113000932.1589073-3-willy@infradead.org Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
|
|
2bcd3800f2 |
slab: Reimplement page_slab()
In order to separate slabs from folios, we need to convert from any page in a slab to the slab directly without going through a page to folio conversion first. Up to this point, page_slab() has followed the example of other memdesc converters (page_folio(), page_ptdesc() etc) and just cast the pointer to the requested type, regardless of whether the pointer is actually a pointer to the correct type or not. That changes with this commit; we check that the page actually belongs to a slab and return NULL if it does not. Other memdesc converters will adopt this convention in future. kfence was the only user of page_slab(), so adjust it to the new way of working. It will need to be touched again when we separate slab from page. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Alexander Potapenko <glider@google.com> Cc: Marco Elver <elver@google.com> Cc: kasan-dev@googlegroups.com Link: https://patch.msgid.link/20251113000932.1589073-2-willy@infradead.org Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Tested-by: Marco Elver <elver@google.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
|
|
3993ca9d64 |
slab: turn freelist_aba_t to a struct and fully define counters there
In struct slab we currently have freelist and counters pair, where counters itself is a union of unsigned long with a sub-struct of several smaller fields. Then for the usage with double cmpxchg we have freelist_aba_t that duplicates the definition of the freelist+counters with implicitly the same layout as the full definition in struct slab. Thanks to -fms-extension we can now move the full counters definition to freelist_aba_t (while changing it to struct freelist_counters as a typedef is unnecessary and discouraged) and replace the relevant part in struct slab to an unnamed reference to it. The immediate benefit is the removal of duplication and no longer relying on the same layout implicitly. It also allows further cleanups thanks to having the full definition of counters in struct freelist_counters. Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
|
|
31e0886fd5 |
slub: remove CONFIG_SLUB_TINY specific code paths
CONFIG_SLUB_TINY minimizes the SLUB's memory overhead in multiple ways, mainly by avoiding percpu caching of slabs and objects. It also reduces code size by replacing some code paths with simplified ones through ifdefs, but the benefits of that are smaller and would complicate the upcoming changes. Thus remove these code paths and associated ifdefs and simplify the code base. Link: https://patch.msgid.link/20251105-sheaves-cleanups-v1-4-b8218e1ac7ef@suse.cz Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
|
|
8804d970fa |
Summary of significant series in this pull request:
- The 3 patch series "mm, swap: improve cluster scan strategy" from Kairui Song improves performance and reduces the failure rate of swap cluster allocation. - The 4 patch series "support large align and nid in Rust allocators" from Vitaly Wool permits Rust allocators to set NUMA node and large alignment when perforning slub and vmalloc reallocs. - The 2 patch series "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend DAMOS_STAT's handling of the DAMON operations sets for virtual address spaces for ops-level DAMOS filters. - The 3 patch series "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren Baghdasaryan reduces mmap_lock contention during reads of /proc/pid/maps. - The 2 patch series "mm/mincore: minor clean up for swap cache checking" from Kairui Song performs some cleanup in the swap code. - The 11 patch series "mm: vm_normal_page*() improvements" from David Hildenbrand provides code cleanup in the pagemap code. - The 5 patch series "add persistent huge zero folio support" from Pankaj Raghav provides a block layer speedup by optionalls making the huge_zero_pagepersistent, instead of releasing it when its refcount falls to zero. - The 3 patch series "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to the recently added Kexec Handover feature. - The 10 patch series "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo Stoakes turns mm_struct.flags into a bitmap. To end the constant struggle with space shortage on 32-bit conflicting with 64-bit's needs. - The 2 patch series "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap code. - The 7 patch series "selftests/mm: Fix false positives and skip unsupported tests" from Donet Tom fixes a few things in our selftests code. - The 7 patch series "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised" from David Hildenbrand "allows individual processes to opt-out of THP=always into THP=madvise, without affecting other workloads on the system". It's a long story - the [1/N] changelog spells out the considerations. - The 11 patch series "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on the memdesc project. Please see https://kernelnewbies.org/MatthewWilcox/Memdescs and https://blogs.oracle.com/linux/post/introducing-memdesc. - The 3 patch series "Tiny optimization for large read operations" from Chi Zhiling improves the efficiency of the pagecache read path. - The 5 patch series "Better split_huge_page_test result check" from Zi Yan improves our folio splitting selftest code. - The 2 patch series "test that rmap behaves as expected" from Wei Yang adds some rmap selftests. - The 3 patch series "remove write_cache_pages()" from Christoph Hellwig removes that function and converts its two remaining callers. - The 2 patch series "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD selftests issues. - The 3 patch series "introduce kernel file mapped folios" from Boris Burkov introduces the concept of "kernel file pages". Using these permits btrfs to account its metadata pages to the root cgroup, rather than to the cgroups of random inappropriate tasks. - The 2 patch series "mm/pageblock: improve readability of some pageblock handling" from Wei Yang provides some readability improvements to the page allocator code. - The 11 patch series "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON to understand arm32 highmem. - The 4 patch series "tools: testing: Use existing atomic.h for vma/maple tests" from Brendan Jackman performs some code cleanups and deduplication under tools/testing/. - The 2 patch series "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes a couple of 32-bit issues in tools/testing/radix-tree.c. - The 2 patch series "kasan: unify kasan_enabled() and remove arch-specific implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific initialization code into a common arch-neutral implementation. - The 3 patch series "mm: remove zpool" from Johannes Weiner removes zspool - an indirection layer which now only redirects to a single thing (zsmalloc). - The 2 patch series "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a couple of cleanups in the fork code. - The 37 patch series "mm: remove nth_page()" from David Hildenbrand makes rather a lot of adjustments at various nth_page() callsites, eventually permitting the removal of that undesirable helper function. - The 2 patch series "introduce kasan.write_only option in hw-tags" from Yeoreum Yun creates a KASAN read-only mode for ARM, using that architecture's memory tagging feature. It is felt that a read-only mode KASAN is suitable for use in production systems rather than debug-only. - The 3 patch series "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does some tidying in the hugetlb folio allocation code. - The 12 patch series "mm: establish const-correctness for pointer parameters" from Max Kellermann makes quite a number of the MM API functions more accurate about the constness of their arguments. This was getting in the way of subsystems (in this case CEPH) when they attempt to improving their own const/non-const accuracy. - The 7 patch series "Cleanup free_pages() misuse" from Vishal Moola fixes a number of code sites which were confused over when to use free_pages() vs __free_pages(). - The 3 patch series "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the mapletree code accessible to Rust. Required by nouveau and by its forthcoming successor: the new Rust Nova driver. - The 2 patch series "selftests/mm: split_huge_page_test: split_pte_mapped_thp improvements" from David Hildenbrand adds a fix and some cleanups to the thp selftesting code. - The 14 patch series "mm, swap: introduce swap table as swap cache (phase I)" from Chris Li and Kairui Song is the first step along the path to implementing "swap tables" - a new approach to swap allocation and state tracking which is expected to yield speed and space improvements. This patchset itself yields a 5-20% performance benefit in some situations. - The 3 patch series "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc layer to clean up the ptdesc code a little. - The 3 patch series "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some issues in our 5-level pagetable selftesting code. - The 2 patch series "Minor fixes for memory allocation profiling" from Suren Baghdasaryan addresses a couple of minor issues in relatively new memory allocation profiling feature. - The 3 patch series "Small cleanups" from Matthew Wilcox has a few cleanups in preparation for more memdesc work. - The 2 patch series "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from Quanmin Yan makes some changes to DAMON in furtherance of supporting arm highmem. - The 2 patch series "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad Anjum adds that compiler check to selftests code and fixes the fallout, by removing dead code. - The 10 patch series "Improvements to Victim Process Thawing and OOM Reaper Traversal Order" from zhongjinji makes a number of improvements in the OOM killer: mainly thawing a more appropriate group of victim threads so they can release resources. - The 5 patch series "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park is a bunch of small and unrelated fixups for DAMON. - The 7 patch series "mm/damon: define and use DAMON initialization check function" from SeongJae Park implement reliability and maintainability improvements to a recently-added bug fix. - The 2 patch series "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from SeongJae Park provides additional transparency to userspace clients of the DAMON_STAT information. - The 2 patch series "Expand scope of khugepaged anonymous collapse" from Dev Jain removes some constraints on khubepaged's collapsing of anon VMAs. It also increases the success rate of MADV_COLLAPSE against an anon vma. - The 2 patch series "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()" from Lorenzo Stoakes moves us further towards removal of file_operations.mmap(). This patchset concentrates upon clearing up the treatment of stacked filesystems. - The 6 patch series "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau provides some fixes and improvements to mlock's tracking of large folios. /proc/meminfo's "Mlocked" field became more accurate. - The 2 patch series "mm/ksm: Fix incorrect accounting of KSM counters during fork" from Donet Tom fixes several user-visible KSM stats inaccuracies across forks and adds selftest code to verify these counters. - The 2 patch series "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses some potential but presently benign issues in KSM's mm_slot handling. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaN3cywAKCRDdBJ7gKXxA jtaPAQDmIuIu7+XnVUK5V11hsQ/5QtsUeLHV3OsAn4yW5/3dEQD/UddRU08ePN+1 2VRB0EwkLAdfMWW7TfiNZ+yhuoiL/AA= =4mhY -----END PGP SIGNATURE----- Merge tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "mm, swap: improve cluster scan strategy" from Kairui Song improves performance and reduces the failure rate of swap cluster allocation - "support large align and nid in Rust allocators" from Vitaly Wool permits Rust allocators to set NUMA node and large alignment when perforning slub and vmalloc reallocs - "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend DAMOS_STAT's handling of the DAMON operations sets for virtual address spaces for ops-level DAMOS filters - "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren Baghdasaryan reduces mmap_lock contention during reads of /proc/pid/maps - "mm/mincore: minor clean up for swap cache checking" from Kairui Song performs some cleanup in the swap code - "mm: vm_normal_page*() improvements" from David Hildenbrand provides code cleanup in the pagemap code - "add persistent huge zero folio support" from Pankaj Raghav provides a block layer speedup by optionalls making the huge_zero_pagepersistent, instead of releasing it when its refcount falls to zero - "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to the recently added Kexec Handover feature - "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo Stoakes turns mm_struct.flags into a bitmap. To end the constant struggle with space shortage on 32-bit conflicting with 64-bit's needs - "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap code - "selftests/mm: Fix false positives and skip unsupported tests" from Donet Tom fixes a few things in our selftests code - "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised" from David Hildenbrand "allows individual processes to opt-out of THP=always into THP=madvise, without affecting other workloads on the system". It's a long story - the [1/N] changelog spells out the considerations - "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on the memdesc project. Please see https://kernelnewbies.org/MatthewWilcox/Memdescs and https://blogs.oracle.com/linux/post/introducing-memdesc - "Tiny optimization for large read operations" from Chi Zhiling improves the efficiency of the pagecache read path - "Better split_huge_page_test result check" from Zi Yan improves our folio splitting selftest code - "test that rmap behaves as expected" from Wei Yang adds some rmap selftests - "remove write_cache_pages()" from Christoph Hellwig removes that function and converts its two remaining callers - "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD selftests issues - "introduce kernel file mapped folios" from Boris Burkov introduces the concept of "kernel file pages". Using these permits btrfs to account its metadata pages to the root cgroup, rather than to the cgroups of random inappropriate tasks - "mm/pageblock: improve readability of some pageblock handling" from Wei Yang provides some readability improvements to the page allocator code - "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON to understand arm32 highmem - "tools: testing: Use existing atomic.h for vma/maple tests" from Brendan Jackman performs some code cleanups and deduplication under tools/testing/ - "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes a couple of 32-bit issues in tools/testing/radix-tree.c - "kasan: unify kasan_enabled() and remove arch-specific implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific initialization code into a common arch-neutral implementation - "mm: remove zpool" from Johannes Weiner removes zspool - an indirection layer which now only redirects to a single thing (zsmalloc) - "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a couple of cleanups in the fork code - "mm: remove nth_page()" from David Hildenbrand makes rather a lot of adjustments at various nth_page() callsites, eventually permitting the removal of that undesirable helper function - "introduce kasan.write_only option in hw-tags" from Yeoreum Yun creates a KASAN read-only mode for ARM, using that architecture's memory tagging feature. It is felt that a read-only mode KASAN is suitable for use in production systems rather than debug-only - "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does some tidying in the hugetlb folio allocation code - "mm: establish const-correctness for pointer parameters" from Max Kellermann makes quite a number of the MM API functions more accurate about the constness of their arguments. This was getting in the way of subsystems (in this case CEPH) when they attempt to improving their own const/non-const accuracy - "Cleanup free_pages() misuse" from Vishal Moola fixes a number of code sites which were confused over when to use free_pages() vs __free_pages() - "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the mapletree code accessible to Rust. Required by nouveau and by its forthcoming successor: the new Rust Nova driver - "selftests/mm: split_huge_page_test: split_pte_mapped_thp improvements" from David Hildenbrand adds a fix and some cleanups to the thp selftesting code - "mm, swap: introduce swap table as swap cache (phase I)" from Chris Li and Kairui Song is the first step along the path to implementing "swap tables" - a new approach to swap allocation and state tracking which is expected to yield speed and space improvements. This patchset itself yields a 5-20% performance benefit in some situations - "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc layer to clean up the ptdesc code a little - "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some issues in our 5-level pagetable selftesting code - "Minor fixes for memory allocation profiling" from Suren Baghdasaryan addresses a couple of minor issues in relatively new memory allocation profiling feature - "Small cleanups" from Matthew Wilcox has a few cleanups in preparation for more memdesc work - "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from Quanmin Yan makes some changes to DAMON in furtherance of supporting arm highmem - "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad Anjum adds that compiler check to selftests code and fixes the fallout, by removing dead code - "Improvements to Victim Process Thawing and OOM Reaper Traversal Order" from zhongjinji makes a number of improvements in the OOM killer: mainly thawing a more appropriate group of victim threads so they can release resources - "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park is a bunch of small and unrelated fixups for DAMON - "mm/damon: define and use DAMON initialization check function" from SeongJae Park implement reliability and maintainability improvements to a recently-added bug fix - "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from SeongJae Park provides additional transparency to userspace clients of the DAMON_STAT information - "Expand scope of khugepaged anonymous collapse" from Dev Jain removes some constraints on khubepaged's collapsing of anon VMAs. It also increases the success rate of MADV_COLLAPSE against an anon vma - "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()" from Lorenzo Stoakes moves us further towards removal of file_operations.mmap(). This patchset concentrates upon clearing up the treatment of stacked filesystems - "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau provides some fixes and improvements to mlock's tracking of large folios. /proc/meminfo's "Mlocked" field became more accurate - "mm/ksm: Fix incorrect accounting of KSM counters during fork" from Donet Tom fixes several user-visible KSM stats inaccuracies across forks and adds selftest code to verify these counters - "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses some potential but presently benign issues in KSM's mm_slot handling * tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (372 commits) mm: swap: check for stable address space before operating on the VMA mm: convert folio_page() back to a macro mm/khugepaged: use start_addr/addr for improved readability hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list alloc_tag: fix boot failure due to NULL pointer dereference mm: silence data-race in update_hiwater_rss mm/memory-failure: don't select MEMORY_ISOLATION mm/khugepaged: remove definition of struct khugepaged_mm_slot mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL hugetlb: increase number of reserving hugepages via cmdline selftests/mm: add fork inheritance test for ksm_merging_pages counter mm/ksm: fix incorrect KSM counter handling in mm_struct during fork drivers/base/node: fix double free in register_one_node() mm: remove PMD alignment constraint in execmem_vmalloc() mm/memory_hotplug: fix typo 'esecially' -> 'especially' mm/rmap: improve mlock tracking for large folios mm/filemap: map entire large folio faultaround mm/fault: try to map the entire file folio in finish_fault() mm/rmap: mlock large folios in try_to_unmap_one() mm/rmap: fix a mlock race condition in folio_referenced_one() ... |
||
|
|
ca74b8cada |
Merge series "slab: Re-entrant kmalloc_nolock()"
From the cover letter [1]: This patch set introduces kmalloc_nolock() which is the next logical step towards any context allocation necessary to remove bpf_mem_alloc and get rid of preallocation requirement in BPF infrastructure. In production BPF maps grew to gigabytes in size. Preallocation wastes memory. Alloc from any context addresses this issue for BPF and other subsystems that are forced to preallocate too. This long task started with introduction of alloc_pages_nolock(), then memcg and objcg were converted to operate from any context including NMI, this set completes the task with kmalloc_nolock() that builds on top of alloc_pages_nolock() and memcg changes. After that BPF subsystem will gradually adopt it everywhere. Link: https://lore.kernel.org/all/20250909010007.1660-1-alexei.starovoitov@gmail.com/ [1] |
||
|
|
b912061924 |
Merge series "SLUB percpu sheaves"
This series adds an opt-in percpu array-based caching layer to SLUB.
It has evolved to a state where kmem caches with sheaves are compatible
with all SLUB features (slub_debug, SLUB_TINY, NUMA locality
considerations). The plan is therefore that it will be later enabled for
all kmem caches and replace the complicated cpu (partial) slabs code.
Note the name "sheaf" was invented by Matthew Wilcox so we don't call
the arrays magazines like the original Bonwick paper. The per-NUMA-node
cache of sheaves is thus called "barn".
This caching may seem similar to the arrays we had in SLAB, but there
are some important differences:
- deals differently with NUMA locality of freed objects, thus there are
no per-node "shared" arrays (with possible lock contention) and no
"alien" arrays that would need periodical flushing
- instead, freeing remote objects (which is rare) bypasses the sheaves
- percpu sheaves thus contain only local objects (modulo rare races
and local node exhaustion)
- NUMA restricted allocations and strict_numa mode is still honoured
- improves kfree_rcu() handling by reusing whole sheaves
- there is an API for obtaining a preallocated sheaf that can be used
for guaranteed and efficient allocations in a restricted context, when
the upper bound for needed objects is known but rarely reached
- opt-in, not used for every cache (for now)
The motivation comes mainly from the ongoing work related to VMA locking
scalability and the related maple tree operations. This is why VMA and
maple nodes caches are sheaf-enabled in the patchset.
A sheaf-enabled cache has the following expected advantages:
- Cheaper fast paths. For allocations, instead of local double cmpxchg,
thanks to local_trylock() it becomes a preempt_disable() and no atomic
operations. Same for freeing, which is otherwise a local double cmpxchg
only for short term allocations (so the same slab is still active on the
same cpu when freeing the object) and a more costly locked double
cmpxchg otherwise.
- kfree_rcu() batching and recycling. kfree_rcu() will put objects to a
separate percpu sheaf and only submit the whole sheaf to call_rcu()
when full. After the grace period, the sheaf can be used for
allocations, which is more efficient than freeing and reallocating
individual slab objects (even with the batching done by kfree_rcu()
implementation itself). In case only some cpus are allowed to handle rcu
callbacks, the sheaf can still be made available to other cpus on the
same node via the shared barn. The maple_node cache uses kfree_rcu() and
thus can benefit from this.
Note: this path is currently limited to !PREEMPT_RT
- Preallocation support. A prefilled sheaf can be privately borrowed to
perform a short term operation that is not allowed to block in the
middle and may need to allocate some objects. If an upper bound (worst
case) for the number of allocations is known, but only much fewer
allocations actually needed on average, borrowing and returning a sheaf
is much more efficient then a bulk allocation for the worst case
followed by a bulk free of the many unused objects. Maple tree write
operations should benefit from this.
- Compatibility with slub_debug. When slub_debug is enabled for a cache,
we simply don't create the percpu sheaves so that the debugging hooks
(at the node partial list slowpaths) are reached as before. The same
thing is done for CONFIG_SLUB_TINY. Sheaf preallocation still works by
reusing the (ineffective) paths for requests exceeding the cache's
sheaf_capacity. This is in line with the existing approach where
debugging bypasses the fast paths and SLUB_TINY preferes memory
savings over performance.
The above is adapted from the cover letter [1], which contains also
in-kernel microbenchmark results showing the lower overhead of sheaves.
Results from Suren Baghdasaryan [2] using a mmap/munmap microbenchmark
also show improvements.
Results from Sudarsan Mahendran [3] using will-it-scale show both
benefits and regressions, probably due to overall noisiness of those
tests.
Link: https://lore.kernel.org/all/20250910-slub-percpu-caches-v8-0-ca3099d8352c@suse.cz/ [1]
Link: https://lore.kernel.org/all/CAJuCfpEQ%3DRUgcAvRzE5jRrhhFpkm8E2PpBK9e9GhK26ZaJQt%3DQ@mail.gmail.com/ [2]
Link: https://lore.kernel.org/all/20250913000935.1021068-1-sudarsanm@google.com/ [3]
|
||
|
|
af92793e52 |
slab: Introduce kmalloc_nolock() and kfree_nolock().
kmalloc_nolock() relies on ability of local_trylock_t to detect
the situation when per-cpu kmem_cache is locked.
In !PREEMPT_RT local_(try)lock_irqsave(&s->cpu_slab->lock, flags)
disables IRQs and marks s->cpu_slab->lock as acquired.
local_lock_is_locked(&s->cpu_slab->lock) returns true when
slab is in the middle of manipulating per-cpu cache
of that specific kmem_cache.
kmalloc_nolock() can be called from any context and can re-enter
into ___slab_alloc():
kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> NMI -> bpf ->
kmalloc_nolock() -> ___slab_alloc(cache_B)
or
kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> tracepoint/kprobe -> bpf ->
kmalloc_nolock() -> ___slab_alloc(cache_B)
Hence the caller of ___slab_alloc() checks if &s->cpu_slab->lock
can be acquired without a deadlock before invoking the function.
If that specific per-cpu kmem_cache is busy the kmalloc_nolock()
retries in a different kmalloc bucket. The second attempt will
likely succeed, since this cpu locked different kmem_cache.
Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
per-cpu rt_spin_lock is locked by current _task_. In this case
re-entrance into the same kmalloc bucket is unsafe, and
kmalloc_nolock() tries a different bucket that is most likely is
not locked by the current task. Though it may be locked by a
different task it's safe to rt_spin_lock() and sleep on it.
Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
immediately if called from hard irq or NMI in PREEMPT_RT.
kfree_nolock() defers freeing to irq_work when local_lock_is_locked()
and (in_nmi() or in PREEMPT_RT).
SLUB_TINY config doesn't use local_lock_is_locked() and relies on
spin_trylock_irqsave(&n->list_lock) to allocate,
while kfree_nolock() always defers to irq_work.
Note, kfree_nolock() must be called _only_ for objects allocated
with kmalloc_nolock(). Debug checks (like kmemleak and kfence)
were skipped on allocation, hence obj = kmalloc(); kfree_nolock(obj);
will miss kmemleak/kfence book keeping and will cause false positives.
large_kmalloc is not supported by either kmalloc_nolock()
or kfree_nolock().
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
||
|
|
83382af9dd |
slab: Make slub local_(try)lock more precise for LOCKDEP
kmalloc_nolock() can be called from any context the ___slab_alloc() can acquire local_trylock_t (which is rt_spin_lock in PREEMPT_RT) and attempt to acquire a different local_trylock_t while in the same task context. The calling sequence might look like: kmalloc() -> tracepoint -> bpf -> kmalloc_nolock() or more precisely: __lock_acquire+0x12ad/0x2590 lock_acquire+0x133/0x2d0 rt_spin_lock+0x6f/0x250 ___slab_alloc+0xb7/0xec0 kmalloc_nolock_noprof+0x15a/0x430 my_debug_callback+0x20e/0x390 [testmod] ___slab_alloc+0x256/0xec0 __kmalloc_cache_noprof+0xd6/0x3b0 Make LOCKDEP understand that local_trylock_t-s protect different kmem_caches. In order to do that add lock_class_key for each kmem_cache and use that key in local_trylock_t. This stack trace is possible on both PREEMPT_RT and !PREEMPT_RT, but teach lockdep about it only for PREEMPT_RT, since in !PREEMPT_RT the ___slab_alloc() code is using local_trylock_irqsave() when lockdep is on. Note, this patch applies this logic to local_lock_t while the next one converts it to local_trylock_t. Both are mapped to rt_spin_lock in PREEMPT_RT. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
|
|
ec66e0d599 |
slab: add sheaf support for batching kfree_rcu() operations
Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
For caches with sheaves, on each cpu maintain a rcu_free sheaf in
addition to main and spare sheaves.
kfree_rcu() operations will try to put objects on this sheaf. Once full,
the sheaf is detached and submitted to call_rcu() with a handler that
will try to put it in the barn, or flush to slab pages using bulk free,
when the barn is full. Then a new empty sheaf must be obtained to put
more objects there.
It's possible that no free sheaves are available to use for a new
rcu_free sheaf, and the allocation in kfree_rcu() context can only use
GFP_NOWAIT and thus may fail. In that case, fall back to the existing
kfree_rcu() implementation.
Expected advantages:
- batching the kfree_rcu() operations, that could eventually replace the
existing batching
- sheaves can be reused for allocations via barn instead of being
flushed to slabs, which is more efficient
- this includes cases where only some cpus are allowed to process rcu
callbacks (CONFIG_RCU_NOCB_CPU)
Possible disadvantage:
- objects might be waiting for more than their grace period (it is
determined by the last object freed into the sheaf), increasing memory
usage - but the existing batching does that too.
Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
implementation favors smaller memory footprint over performance.
Also for now skip the usage of rcu sheaf for CONFIG_PREEMPT_RT as the
contexts where kfree_rcu() is called might not be compatible with taking
a barn spinlock or a GFP_NOWAIT allocation of a new sheaf taking a
spinlock - the current kfree_rcu() implementation avoids doing that.
Teach kvfree_rcu_barrier() to flush all rcu_free sheaves from all caches
that have them. This is not a cheap operation, but the barrier usage is
rare - currently kmem_cache_destroy() or on module unload.
Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
count how many kfree_rcu() used the rcu_free sheaf successfully and how
many had to fall back to the existing implementation.
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
||
|
|
2d517aa09b |
slab: add opt-in caching layer of percpu sheaves
Specifying a non-zero value for a new struct kmem_cache_args field sheaf_capacity will setup a caching layer of percpu arrays called sheaves of given capacity for the created cache. Allocations from the cache will allocate via the percpu sheaves (main or spare) as long as they have no NUMA node preference. Frees will also put the object back into one of the sheaves. When both percpu sheaves are found empty during an allocation, an empty sheaf may be replaced with a full one from the per-node barn. If none are available and the allocation is allowed to block, an empty sheaf is refilled from slab(s) by an internal bulk alloc operation. When both percpu sheaves are full during freeing, the barn can replace a full one with an empty one, unless over a full sheaves limit. In that case a sheaf is flushed to slab(s) by an internal bulk free operation. Flushing sheaves and barns is also wired to the existing cpu flushing and cache shrinking operations. The sheaves do not distinguish NUMA locality of the cached objects. If an allocation is requested with kmem_cache_alloc_node() (or a mempolicy with strict_numa mode enabled) with a specific node (not NUMA_NO_NODE), the sheaves are bypassed. The bulk operations exposed to slab users also try to utilize the sheaves as long as the necessary (full or empty) sheaves are available on the cpu or in the barn. Once depleted, they will fallback to bulk alloc/free to slabs directly to avoid double copying. The sheaf_capacity value is exported in sysfs for observability. Sysfs CONFIG_SLUB_STATS counters alloc_cpu_sheaf and free_cpu_sheaf count objects allocated or freed using the sheaves (and thus not counting towards the other alloc/free path counters). Counters sheaf_refill and sheaf_flush count objects filled or flushed from or to slab pages, and can be used to assess how effective the caching is. The refill and flush operations will also count towards the usual alloc_fastpath/slowpath, free_fastpath/slowpath and other counters for the backing slabs. For barn operations, barn_get and barn_put count how many full sheaves were get from or put to the barn, the _fail variants count how many such requests could not be satisfied mainly because the barn was either empty or full. While the barn also holds empty sheaves to make some operations easier, these are not as critical to mandate own counters. Finally, there are sheaf_alloc/sheaf_free counters. Access to the percpu sheaves is protected by local_trylock() when potential callers include irq context, and local_lock() otherwise (such as when we already know the gfp flags allow blocking). The trylock failures should be rare and we can easily fallback. Each per-NUMA-node barn has a spin_lock. When slub_debug is enabled for a cache with sheaf_capacity also specified, the latter is ignored so that allocations and frees reach the slow path where debugging hooks are processed. Similarly, we ignore it with CONFIG_SLUB_TINY which prefers low memory usage to performance. [boot failure: https://lore.kernel.org/all/583eacf5-c971-451a-9f76-fed0e341b815@linux.ibm.com/ ] Reported-and-tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
|
|
4038016397 |
slab: prevent warnings when slab obj_exts vector allocation fails
When object extension vector allocation fails, we set slab->obj_exts to
OBJEXTS_ALLOC_FAIL to indicate the failure. Later, once the vector is
successfully allocated, we will use this flag to mark codetag references
stored in that vector as empty to avoid codetag warnings.
slab_obj_exts() used to retrieve the slab->obj_exts vector pointer checks
slab->obj_exts for being either NULL or a pointer with MEMCG_DATA_OBJEXTS
bit set. However it does not handle the case when slab->obj_exts equals
OBJEXTS_ALLOC_FAIL. Add the missing condition to avoid extra warning.
Fixes:
|
||
|
|
11afccce2a |
slab: use memdesc_nid()
We no longer need to convert from slab to folio to get the nid, we can ask memdesc_nid() for the nid directly. Link: https://lkml.kernel.org/r/20250805172307.1302730-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
87479378ac |
slab: use memdesc_flags_t
The slab flags are memdesc flags and contain the same information in the upper bits as the other memdescs (like node ID). Link: https://lkml.kernel.org/r/20250805172307.1302730-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
3df29914d9 |
slab: Add SL_pfmemalloc flag
Give slab its own name for this flag. Move the implementation from slab.h to slub.c since it's only used inside slub.c. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20250611155916.2579160-5-willy@infradead.org Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
|
|
30908096dd |
slab: Rename slab->__page_flags to slab->flags
Slab has its own reasons for using flag bits; they aren't just the page bits. Maybe this won't be the ultimate solution, but we should be clear that these bits are in use. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://patch.msgid.link/20250611155916.2579160-3-willy@infradead.org Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
|
|
dea2d9221e |
Merge branch 'slab/for-6.15/kfree_rcu_tiny' into slab/for-next
Merge the slab feature branch kfree_rcu_tiny for 6.15: - Move the TINY_RCU kvfree_rcu() implementation from RCU to SLAB subsystem and cleanup its integration. |
||
|
|
12f4888c9d |
mm/slab: simplify SLAB_* flag handling
SLUB is the only remaining allocator. We can therefore get rid of the logic for allocator-specific flags: * Merge SLAB_CACHE_FLAGS into SLAB_CORE_FLAGS. * Remove CACHE_CREATE_MASK and instead mask out SLAB_DEBUG_FLAGS if !CONFIG_SLUB_DEBUG. SLAB_DEBUG_FLAGS is now defined unconditionally (no impact on existing code, which ignores it if !CONFIG_SLUB_DEBUG). * Define SLAB_FLAGS_PERMITTED in terms of SLAB_CORE_FLAGS and SLAB_DEBUG_FLAGS (no functional change). While at it also remove misleading comments that suggest that multiple allocators are available. Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
|
|
49d5377b38 |
rcu, slab: use a regular callback function for kvfree_rcu
RCU has been special-casing callback function pointers that are integers lower than 4096 as offsets of rcu_head for kvfree() instead. The tree RCU implementation no longer does that as the batched kvfree_rcu() is not a simple call_rcu(). The tiny RCU still does, and the plan is also to make tree RCU use call_rcu() for SLUB_TINY configurations. Instead of teaching tree RCU again to special case the offsets, let's remove the special casing completely. Since there's no SLOB anymore, it is possible to create a callback function that can take a pointer to a middle of slab object with unknown offset and determine the object's pointer before freeing it, so implement that as kvfree_rcu_cb(). Large kmalloc and vmalloc allocations are handled simply by aligning down to page size. For that we retain the requirement that the offset is smaller than 4096. But we can remove __is_kvfree_rcu_offset() completely and instead just opencode the condition in the BUILD_BUG_ON() check. Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Tested-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
|
|
2da76e9e12 |
mm/slab: fix kernel-doc func param names
Use corrected function parameter names to eliminate kernel-doc warnings: slab.h:142: warning: Function parameter or struct member 's' not described in 'slab_folio' slab.h:142: warning: Excess function parameter 'slab' description in 'slab_folio' slab.h:168: warning: Function parameter or struct member 's' not described in 'slab_page' slab.h:168: warning: Excess function parameter 'slab' description in 'slab_page' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrew Morton <akpm@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
|
|
dbc1691527 |
mm/slub: Avoid list corruption when removing a slab from the full list
Boot with slub_debug=UFPZ.
If allocated object failed in alloc_consistency_checks, all objects of
the slab will be marked as used, and then the slab will be removed from
the partial list.
When an object belonging to the slab got freed later, the remove_full()
function is called. Because the slab is neither on the partial list nor
on the full list, it eventually lead to a list corruption (actually a
list poison being detected).
So we need to mark and isolate the slab page with metadata corruption,
do not put it back in circulation.
Because the debug caches avoid all the fastpaths, reusing the frozen bit
to mark slab page with metadata corruption seems to be fine.
[ 4277.385669] list_del corruption, ffffea00044b3e50->next is LIST_POISON1 (dead000000000100)
[ 4277.387023] ------------[ cut here ]------------
[ 4277.387880] kernel BUG at lib/list_debug.c:56!
[ 4277.388680] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[ 4277.389562] CPU: 5 PID: 90 Comm: kworker/5:1 Kdump: loaded Tainted: G OE 6.6.1-1 #1
[ 4277.392113] Workqueue: xfs-inodegc/vda1 xfs_inodegc_worker [xfs]
[ 4277.393551] RIP: 0010:__list_del_entry_valid_or_report+0x7b/0xc0
[ 4277.394518] Code: 48 91 82 e8 37 f9 9a ff 0f 0b 48 89 fe 48 c7 c7 28 49 91 82 e8 26 f9 9a ff 0f 0b 48 89 fe 48 c7 c7 58 49 91
[ 4277.397292] RSP: 0018:ffffc90000333b38 EFLAGS: 00010082
[ 4277.398202] RAX: 000000000000004e RBX: ffffea00044b3e50 RCX: 0000000000000000
[ 4277.399340] RDX: 0000000000000002 RSI: ffffffff828f8715 RDI: 00000000ffffffff
[ 4277.400545] RBP: ffffea00044b3e40 R08: 0000000000000000 R09: ffffc900003339f0
[ 4277.401710] R10: 0000000000000003 R11: ffffffff82d44088 R12: ffff888112cf9910
[ 4277.402887] R13: 0000000000000001 R14: 0000000000000001 R15: ffff8881000424c0
[ 4277.404049] FS: 0000000000000000(0000) GS:ffff88842fd40000(0000) knlGS:0000000000000000
[ 4277.405357] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4277.406389] CR2: 00007f2ad0b24000 CR3: 0000000102a3a006 CR4: 00000000007706e0
[ 4277.407589] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4277.408780] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 4277.410000] PKRU: 55555554
[ 4277.410645] Call Trace:
[ 4277.411234] <TASK>
[ 4277.411777] ? die+0x32/0x80
[ 4277.412439] ? do_trap+0xd6/0x100
[ 4277.413150] ? __list_del_entry_valid_or_report+0x7b/0xc0
[ 4277.414158] ? do_error_trap+0x6a/0x90
[ 4277.414948] ? __list_del_entry_valid_or_report+0x7b/0xc0
[ 4277.415915] ? exc_invalid_op+0x4c/0x60
[ 4277.416710] ? __list_del_entry_valid_or_report+0x7b/0xc0
[ 4277.417675] ? asm_exc_invalid_op+0x16/0x20
[ 4277.418482] ? __list_del_entry_valid_or_report+0x7b/0xc0
[ 4277.419466] ? __list_del_entry_valid_or_report+0x7b/0xc0
[ 4277.420410] free_to_partial_list+0x515/0x5e0
[ 4277.421242] ? xfs_iext_remove+0x41a/0xa10 [xfs]
[ 4277.422298] xfs_iext_remove+0x41a/0xa10 [xfs]
[ 4277.423316] ? xfs_inodegc_worker+0xb4/0x1a0 [xfs]
[ 4277.424383] xfs_bmap_del_extent_delay+0x4fe/0x7d0 [xfs]
[ 4277.425490] __xfs_bunmapi+0x50d/0x840 [xfs]
[ 4277.426445] xfs_itruncate_extents_flags+0x13a/0x490 [xfs]
[ 4277.427553] xfs_inactive_truncate+0xa3/0x120 [xfs]
[ 4277.428567] xfs_inactive+0x22d/0x290 [xfs]
[ 4277.429500] xfs_inodegc_worker+0xb4/0x1a0 [xfs]
[ 4277.430479] process_one_work+0x171/0x340
[ 4277.431227] worker_thread+0x277/0x390
[ 4277.431962] ? __pfx_worker_thread+0x10/0x10
[ 4277.432752] kthread+0xf0/0x120
[ 4277.433382] ? __pfx_kthread+0x10/0x10
[ 4277.434134] ret_from_fork+0x2d/0x50
[ 4277.434837] ? __pfx_kthread+0x10/0x10
[ 4277.435566] ret_from_fork_asm+0x1b/0x30
[ 4277.436280] </TASK>
Fixes:
|
||
|
|
fb5eda0dfe |
mm/kasan: Don't store metadata inside kmalloc object when slub_debug_orig_size is on
For a kmalloc object, when both kasan and slub redzone sanity check are enabled, they could both manipulate its data space like storing kasan free meta data and setting up kmalloc redzone, and may affect accuracy of that object's 'orig_size'. As an accurate 'orig_size' will be needed by some function like krealloc() soon, save kasan's free meta data in slub's metadata area instead of inside object when 'orig_size' is enabled. This will make it easier to maintain/understand the code. Size wise, when these two options are both enabled, the slub meta data space is already huge, and this just slightly increase the overall size. Signed-off-by: Feng Tang <feng.tang@intel.com> Acked-by: Andrey Konovalov <andreyknvl@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
|
|
3f1dd33f99 |
mm, slab: suppress warnings in test_leak_destroy kunit test
The test_leak_destroy kunit test intends to test the detection of stray
objects in kmem_cache_destroy(), which normally produces a warning. The
other slab kunit tests suppress the warnings in the kunit test context,
so suppress warnings and related printk output in this test as well.
Automated test running environments then don't need to learn to filter
the warnings.
Also rename the test's kmem_cache, the name was wrongly copy-pasted from
test_kfree_rcu.
Fixes:
|
||
|
|
77ced98f0f |
mm, slab: fix use of SLAB_SUPPORTS_SYSFS in kmem_cache_release()
The fix implemented in commit |
||
|
|
ecc4d6af97 |
Merge branch 'slab/for-6.12/kmem_cache_args' into slab/for-next
Merge kmem_cache_create() refactoring by Christian Brauner. Note this includes a merge of the vfs.file tree that contains the prerequisity kmem_cache_create_rcu() work. |
||
|
|
dacf472bcd |
slab: remove rcu_freeptr_offset from struct kmem_cache
Pass down struct kmem_cache_args to calculate_sizes() so we can use
args->{use}_freeptr_offset directly. This allows us to remove
->rcu_freeptr_offset from struct kmem_cache.
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
||
|
|
3dbe2bad57 |
slab: pass struct kmem_cache_args to do_kmem_cache_create()
and initialize most things in do_kmem_cache_create(). In a follow-up patch we'll remove rcu_freeptr_offset from struct kmem_cache. Reviewed-by: Kees Cook <kees@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |