linux

mirror of https://github.com/torvalds/linux.git synced 2026-05-12 16:18:45 +02:00

Author	SHA1	Message	Date
Harry Yoo (Oracle)	5b31044e64	mm/slab: return NULL early from kmalloc_nolock() in NMI on UP On UP kernels (!CONFIG_SMP), spin_trylock() is a no-op that unconditionally succeeds even when the lock is already held. As a result, kmalloc_nolock() called from NMI context can re-enter the slab allocator and acquire n->list_lock that the interrupted context is already holding, corrupting slab state. With CONFIG_DEBUG_SPINLOCK on UP, the following BUG is triggered with the slub_kunit test module: BUG: spinlock trylock failure on UP on CPU#0, kunit_try_catch/243 [...] Call Trace: <NMI> dump_stack_lvl+0x3f/0x60 do_raw_spin_trylock+0x41/0x50 _raw_spin_trylock+0x24/0x50 get_from_partial_node+0x120/0x4d0 ___slab_alloc+0x8a/0x4c0 kmalloc_nolock_noprof+0x164/0x310 [...] </NMI> Fix this by returning NULL early when invoked from NMI on a UP kernel. Link: https://lore.kernel.org/linux-mm/ad_cqe51pvr1WaDg@hyeyoo Cc: stable@vger.kernel.org Fixes: `af92793e52` ("slab: Introduce kmalloc_nolock() and kfree_nolock().") Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org> Link: https://patch.msgid.link/20260427-nolock-api-fix-v2-2-a6b83a92d9a4@kernel.org Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>	2026-04-27 09:14:36 +02:00
Marco Elver	082a6d03a2	slub: fix data loss and overflow in krealloc() Commit `2cd8231796` ("mm/slub: allow to set node and align in k[v]realloc") introduced the ability to force a reallocation if the original object does not satisfy new alignment or NUMA node, even when the object is being shrunk. This introduced two bugs in the reallocation fallback path: 1. Data loss during NUMA migration: The jump to 'alloc_new' happens before 'ks' and 'orig_size' are initialized. As a result, the memcpy() in the 'alloc_new' block would copy 0 bytes into the new allocation. 2. Buffer overflow during shrinking: When shrinking an object while forcing a new alignment, 'new_size' is smaller than the old size. However, the memcpy() used the old size ('orig_size ?: ks'), leading to an out-of-bounds write. The same overflow bug exists in the kvrealloc() fallback path, where the old bucket size ksize(p) is copied into the new buffer without being bounded by the new size. A simple reproducer: // e.g. add to lkdtm as KREALLOC_SHRINK_OVERFLOW while (1) { void *p = kmalloc(128, GFP_KERNEL); p = krealloc_node_align(p, 64, 256, GFP_KERNEL, NUMA_NO_NODE); kfree(p); } demonstrates the issue: ================================================================== BUG: KFENCE: out-of-bounds write in memcpy_orig+0x68/0x130 Out-of-bounds write at 0xffff8883ad757038 (120B right of kfence-#47): memcpy_orig+0x68/0x130 krealloc_node_align_noprof+0x1c8/0x340 lkdtm_KREALLOC_SHRINK_OVERFLOW+0x8c/0xc0 [lkdtm] lkdtm_do_action+0x3a/0x60 [lkdtm] ... kfence-#47: 0xffff8883ad756fc0-0xffff8883ad756fff, size=64, cache=kmalloc-64 allocated by task 316 on cpu 7 at 97.680481s (0.021813s ago): krealloc_node_align_noprof+0x19c/0x340 lkdtm_KREALLOC_SHRINK_OVERFLOW+0x8c/0xc0 [lkdtm] lkdtm_do_action+0x3a/0x60 [lkdtm] ... ================================================================== Fix it by moving the old size calculation to the top of __do_krealloc() and bounding all copy lengths by the new allocation size. Fixes: `2cd8231796` ("mm/slub: allow to set node and align in k[v]realloc") Cc: stable@vger.kernel.org Reported-by: https://sashiko.dev/#/patchset/20260415143735.2974230-1-elver%40google.com Signed-off-by: Marco Elver <elver@google.com> Link: https://patch.msgid.link/20260416132837.3787694-1-elver@google.com Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>	2026-04-17 11:07:48 +02:00
Vlastimil Babka (SUSE)	44e0ebe4ac	Merge branch 'slab/for-7.1/misc' into slab/for-next Merge misc slab changes that are not related to sheaves. Various improvements for sysfs, debugging and testing.	2026-04-13 13:23:36 +02:00
Hao Li	5127483619	slub: clarify kmem_cache_refill_sheaf() comments In the in-place refill case, some objects may already have been added before the function returns -ENOMEM. Clarify this behavior and polish the rest of the comment for readability. Acked-by: Harry Yoo (Oracle) <harry@kernel.org> Signed-off-by: Hao Li <hao.li@linux.dev> Link: https://patch.msgid.link/20260407120018.42692-1-hao.li@linux.dev Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>	2026-04-07 14:39:11 +02:00
Hao Li	7f9bb84fdb	slub: use N_NORMAL_MEMORY in can_free_to_pcs to handle remote frees Memory hotplug now keeps N_NORMAL_MEMORY up to date correctly, so make can_free_to_pcs() use it. As a result, when freeing objects on memoryless nodes, or on nodes that have memory but only in ZONE_MOVABLE, the objects can be freed to the sheaf instead of going through the slow path. Signed-off-by: Hao Li <hao.li@linux.dev> Acked-by: Harry Yoo (Oracle) <harry@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Link: https://patch.msgid.link/20260403073958.8722-1-hao.li@linux.dev Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>	2026-04-07 11:10:52 +02:00
Vlastimil Babka (SUSE)	e65d430111	slab: free remote objects to sheaves on memoryless nodes On memoryless nodes we can now allocate from cpu sheaves and refill them normally. But when a node is memoryless on a system without actual CONFIG_HAVE_MEMORYLESS_NODES support, freeing always uses the slowpath because all objects appear as remote. We could instead benefit from the freeing fastpath, because the allocations can't obtain local objects anyway if the node is memoryless. Thus adapt the locality check when freeing, and move them to an inline function can_free_to_pcs() for a single shared implementation. On configurations with CONFIG_HAVE_MEMORYLESS_NODES=y continue using numa_mem_id() so the percpu sheaves and barn on a memoryless node will contain mostly objects from the closest memory node (returned by numa_mem_id()). No change is thus intended for such configuration. On systems with CONFIG_HAVE_MEMORYLESS_NODES=n use numa_node_id() (the cpu's node) since numa_mem_id() just aliases it anyway. But if we are freeing on a memoryless node, allow the freeing to use percpu sheaves for objects from any node, since they are all remote anyway. This way we avoid the slowpath and get more performant freeing. The potential downside is that allocations will obtain objects with a larger average distance. If we kept bypassing the sheaves on freeing, a refill of sheaves from slabs would tend to get closer objects thanks to the ordering of the zonelist. Architectures that allow de-facto memoryless nodes without proper CONFIG_HAVE_MEMORYLESS_NODES support should perhaps consider adding such support. Link: https://patch.msgid.link/20260311-b4-slab-memoryless-barns-v1-3-70ab850be4ce@kernel.org Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Hao Li <hao.li@linux.dev>	2026-03-19 13:22:49 +01:00
Vlastimil Babka (SUSE)	7f693882f0	slab: create barns for online memoryless nodes Ming Lei has reported [1] a performance regression due to replacing cpu (partial) slabs with sheaves. With slub stats enabled, a large amount of slowpath allocations were observed. The affected system has 8 online NUMA nodes but only 2 have memory. For sheaves to work effectively on given cpu, its NUMA node has to have struct node_barn allocated. Those are currently only allocated on nodes with memory (N_MEMORY) where kmem_cache_node also exist as the goal is to cache only node-local objects. But in order to have good performance on a memoryless node, we need its barn to exist and use sheaves to cache non-local objects (as no local objects can exist anyway). Therefore change the implementation to allocate barns on all online nodes, tracked in a new nodemask slab_barn_nodes. Also add a cpu hotplug callback as that's when a memoryless node can become online. Change both get_barn() and rcu_sheaf->node assignment to numa_node_id() so it's returned to the barn of the local cpu's (potentially memoryless) node, and not to the nearest node with memory anymore. On systems with CONFIG_HAVE_MEMORYLESS_NODES=y (which are not the main target of this change) barns did not exist on memoryless nodes, but get_barn() using numa_mem_id() meant a barn was returned from the nearest node with memory. This works, but the barn lock contention increases with every such memoryless node. With this change, barn will be allocated also on the memoryless node, reducing this contention in exchange for increased memory consumption. Reported-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/all/aZ0SbIqaIkwoW2mB@fedora/ [1] Link: https://patch.msgid.link/20260311-b4-slab-memoryless-barns-v1-2-70ab850be4ce@kernel.org Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Hao Li <hao.li@linux.dev>	2026-03-19 13:22:44 +01:00
Vlastimil Babka (SUSE)	5ba6bc27b1	slab: decouple pointer to barn from kmem_cache_node The pointer to barn currently exists in struct kmem_cache_node. That struct is instantiated for every NUMA node with memory, but we want to have a barn for every online node (including memoryless). Thus decouple the two structures. In struct kmem_cache we have an array for kmem_cache_node pointers that appears to be sized MAX_NUMNODES but the actual size calculation in kmem_cache_init() uses nr_node_ids. Therefore we can't just add another array of barn pointers. Instead change the array to newly introduced struct kmem_cache_per_node_ptrs holding both kmem_cache_node and barn pointer. Adjust barn accessor and allocation/initialization code accordingly. For now no functional change intended, barns are created 1:1 together with kmem_cache_nodes. Link: https://patch.msgid.link/20260311-b4-slab-memoryless-barns-v1-1-70ab850be4ce@kernel.org Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Hao Li <hao.li@linux.dev>	2026-03-19 13:22:39 +01:00
Vlastimil Babka (SUSE)	69d73421b7	slab: remove alloc_full_sheaf() The function allocates and then refills and empty sheaf. It's only called from __pcs_replace_empty_main(), which can also in some cases refill an empty sheaf. We can therefore consolidate this code. Remove alloc_full_sheaf() and refactor __pcs_replace_empty_main() so it will call alloc_empty_sheaf() when necessary, and then use the pre-existing refill_sheaf(). The result should be simpler to follow and less duplicated code. Also adjust the comment about returning sheaves to barn, the part about where the empty sheaf we'd be returning comes from is incorrect. No functional change intended. Reviewed-by: Qing Wang <wangqing7171@gmail.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Hao Li <hao.li@linux.dev> Link: https://patch.msgid.link/20260311-b4-slab-remove-alloc_full_sheaf-v1-1-c4c5bb587ae5@kernel.org Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>	2026-03-16 14:13:46 +01:00
Qing Wang	464b1c1158	slab: fix memory leak when refill_sheaf() fails When refill_sheaf() partially fills one sheaf (e.g., fills 5 objects but need to fill 10), it will update sheaf->size and return -ENOMEM. However, the callers (alloc_full_sheaf() and __pcs_replace_empty_main()) directly call free_empty_sheaf() on failure, which only does kfree(sheaf), causing the partially allocated objects memory in sheaf->objects[] leaked. Fix this by calling sheaf_flush_unused() before free_empty_sheaf() to free objects of sheaf->objects[]. And also add a WARN_ON() in free_empty_sheaf() to catch any future cases where a non-empty sheaf is being freed. Fixes: `ed30c4adfc` ("slab: add optimized sheaf refill from partial list") Signed-off-by: Qing Wang <wangqing7171@gmail.com> Link: https://patch.msgid.link/20260311093617.4155965-1-wangqing7171@gmail.com Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Hao Li <hao.li@linux.dev> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>	2026-03-11 17:55:26 +01:00
Harry Yoo	8dafa9f590	mm/slab: fix an incorrect check in obj_exts_alloc_size() obj_exts_alloc_size() prevents recursive allocation of slabobj_ext array from the same cache, to avoid creating slabs that are never freed. There is one mistake that returns the original size when memory allocation profiling is disabled. The assumption was that memcg-triggered slabobj_ext allocation is always served from KMALLOC_CGROUP type. But this is wrong [1]: when the caller specifies both __GFP_RECLAIMABLE and __GFP_ACCOUNT with SLUB_TINY enabled, the allocation is served from normal kmalloc. This is because kmalloc_type() prioritizes __GFP_RECLAIMABLE over __GFP_ACCOUNT, and SLUB_TINY aliases KMALLOC_RECLAIM with KMALLOC_NORMAL. As a result, the recursion guard is bypassed and the problematic slabs can be created. Fix this by removing the mem_alloc_profiling_enabled() check entirely. The remaining is_kmalloc_normal() check is still sufficient to detect whether the cache is of KMALLOC_NORMAL type and avoid bumping the size if it's not. Without SLUB_TINY, no functional change intended. With SLUB_TINY, allocations with __GFP_ACCOUNT\|__GFP_RECLAIMABLE now allocate a larger array if the sizes equal. Reported-by: Zw Tang <shicenci@gmail.com> Fixes: `280ea9c315` ("mm/slab: avoid allocating slabobj_ext array from its own slab") Closes: https://lore.kernel.org/linux-mm/CAPHJ_VKuMKSke8b11AZQw1PTSFN4n2C0gFxC6xGOG0ZLHgPmnA@mail.gmail.com [1] Cc: stable@vger.kernel.org Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260309072219.22653-1-harry.yoo@oracle.com Tested-by: Zw Tang <shicenci@gmail.com> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>	2026-03-10 11:02:54 +01:00
Vlastimil Babka (SUSE)	fb1091febd	mm/slab: allow sheaf refill if blocking is not allowed Ming Lei reported [1] a regression in the ublk null target benchmark due to sheaves. The profile shows that the alloc_from_pcs() fastpath fails and allocations fall back to ___slab_alloc(). It also shows the allocations happen through mempool_alloc(). The strategy of mempool_alloc() is to call the underlying allocator (here slab) without __GFP_DIRECT_RECLAIM first. This does not play well with __pcs_replace_empty_main() checking for gfpflags_allow_blocking() to decide if it should refill an empty sheaf or fallback to the slowpath, so we end up falling back. We could change the mempool strategy but there might be other paths doing the same ting. So instead allow sheaf refill when blocking is not allowed, changing the condition to gfpflags_allow_spinning(). The original condition was unnecessarily restrictive. Note this doesn't fully resolve the regression [1] as another component of that are memoryless nodes, which is to be addressed separately. Reported-by: Ming Lei <ming.lei@redhat.com> Fixes: `e47c897a29` ("slab: add sheaves to most caches") Link: https://lore.kernel.org/all/aZ0SbIqaIkwoW2mB@fedora/ [1] Link: https://patch.msgid.link/20260302095536.34062-2-vbabka@kernel.org Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>	2026-03-04 11:03:54 +01:00
Vlastimil Babka	48647d3f9a	slab: distinguish lock and trylock for sheaf_flush_main() sheaf_flush_main() can be called from __pcs_replace_full_main() where it's fine if the trylock fails, and pcs_flush_all() where it's not expected to and for some flush callers (when destroying the cache or memory hotremove) it would be actually a problem if it failed and left the main sheaf not flushed. The flush callers can however safely use local_lock() instead of trylock. The trylock failure should not happen in practice on !PREEMPT_RT, but can happen on PREEMPT_RT. The impact is limited in practice because when a trylock fails in the kmem_cache_destroy() path, it means someone is using the cache while destroying it, which is a bug on its own. The memory hotremove path is unlikely to be employed in a production RT config, but it's possible. To fix this, split the function into sheaf_flush_main() (using local_lock()) and sheaf_try_flush_main() (using local_trylock()) where both call __sheaf_flush_main_batch() to flush a single batch of objects. This will also allow lockdep to verify our context assumptions. The problem was raised in an off-list question by Marcelo. Fixes: `2d517aa09b` ("slab: add opt-in caching layer of percpu sheaves") Cc: stable@vger.kernel.org Reported-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Hao Li <hao.li@linux.dev> Link: https://patch.msgid.link/20260211-b4-sheaf-flush-v1-1-4e7f492f0055@suse.cz Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>	2026-03-02 10:04:22 +01:00
Harry Yoo	e9217ca77d	mm/slab: initialize slab->stride early to avoid memory ordering issues When alloc_slab_obj_exts() is called later (instead of during slab allocation and initialization), slab->stride and slab->obj_exts are updated after the slab is already accessible by multiple CPUs. The current implementation does not enforce memory ordering between slab->stride and slab->obj_exts. For correctness, slab->stride must be visible before slab->obj_exts. Otherwise, concurrent readers may observe slab->obj_exts as non-zero while stride is still stale. With stale slab->stride, slab_obj_ext() could return the wrong obj_ext. This could cause two problems: - obj_cgroup_put() is called on the wrong objcg, leading to a use-after-free due to incorrect reference counting [1] by decrementing the reference count more than it was incremented. - refill_obj_stock() is called on the wrong objcg, leading to a page_counter overflow [2] by uncharging more memory than charged. Fix this by unconditionally initializing slab->stride in alloc_slab_obj_exts_early(), before the need_slab_obj_exts() check. In the case of SLAB_OBJ_EXT_IN_OBJ, it is overridden in the function. This ensures updates to slab->stride become visible before the slab can be accessed by other CPUs via the per-node partial slab list (protected by spinlock with acquire/release semantics). Thanks to Shakeel Butt for pointing out this issue [3]. [vbabka@kernel.org: the bug reports [1] and [2] are not yet fully fixed, with investigation ongoing, but it is nevertheless a step in the right direction to only set stride once after allocating the slab and not change it later ] Fixes: `7a8e71bc61` ("mm/slab: use stride to access slabobj_ext") Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Link: https://lore.kernel.org/lkml/ca241daa-e7e7-4604-a48d-de91ec9184a5@linux.ibm.com [1] Link: https://lore.kernel.org/all/ddff7c7d-c0c3-4780-808f-9a83268bbf0c@linux.ibm.com [2] Link: https://lore.kernel.org/linux-mm/aZu9G9mVIVzSm6Ft@hyeyoo [3] Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>	2026-02-27 16:22:57 +01:00
Suren Baghdasaryan	f3ec502b67	mm/slab: mark alloc tags empty for sheaves allocated with __GFP_NO_OBJ_EXT alloc_empty_sheaf() allocates sheaves from SLAB_KMALLOC caches using __GFP_NO_OBJ_EXT to avoid recursion, however it does not mark their allocation tags empty before freeing, which results in a warning when CONFIG_MEM_ALLOC_PROFILING_DEBUG is set. Fix this by marking allocation tags for such sheaves as empty. The problem was technically introduced in commit `4c0a17e283` but only becomes possible to hit with commit `913ffd3a1b`. Fixes: `4c0a17e283` ("slab: prevent recursive kmalloc() in alloc_empty_sheaf()") Fixes: `913ffd3a1b` ("slab: handle kmalloc sheaves bootstrap") Reported-by: David Wang <00107082@163.com> Closes: https://lore.kernel.org/all/20260223155128.3849-1-00107082@163.com/ Analyzed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Tested-by: Harry Yoo <harry.yoo@oracle.com> Tested-by: David Wang <00107082@163.com> Link: https://patch.msgid.link/20260225163407.2218712-1-surenb@google.com Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>	2026-02-26 17:30:32 +01:00
Harry Yoo	021ca6b670	mm/slab: pass __GFP_NOWARN to refill_sheaf() if fallback is available When refill_sheaf() is called, failing to refill the sheaf doesn't necessarily mean the allocation will fail because a fallback path might be available and serve the allocation request. Suppress spurious warnings by passing __GFP_NOWARN along with __GFP_NOMEMALLOC whenever a fallback path is available. When the caller is alloc_full_sheaf() or __pcs_replace_empty_main(), the kernel always falls back to the slowpath (__slab_alloc_node()). For __prefill_sheaf_pfmemalloc(), the fallback path is available only when gfp_pfmemalloc_allowed() returns true. Reported-and-tested-by: Chris Bainbridge <chris.bainbridge@gmail.com> Closes: https://lore.kernel.org/linux-mm/aZt2-oS9lkmwT7Ch@debian.local Fixes: `1ce20c28ea` ("slab: handle pfmemalloc slabs properly with sheaves") Link: https://lore.kernel.org/linux-mm/aZwSreGj9-HHdD-j@hyeyoo Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260223133322.16705-1-harry.yoo@oracle.com Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>	2026-02-26 17:30:06 +01:00
Thomas Weißschuh	9042e77a5c	mm/slab: constify sysfs attributes These attributes are never modified, make them read-only. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260223-sysfs-const-slub-v1-2-ff86ffc26fff@weissschuh.net Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>	2026-02-26 15:25:06 +01:00
Thomas Weißschuh	5aa2a02b98	mm/slab: create sysfs attribute through default_groups The driver core can automatically create custom type attributes. This makes the code and error-handling shorter. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260223-sysfs-const-slub-v1-1-ff86ffc26fff@weissschuh.net Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>	2026-02-26 15:25:06 +01:00
Linus Torvalds	bf4afc53b7	Convert 'alloc_obj' family to use the new default GFP_KERNEL argument This was done entirely with mindless brute force, using git grep -l '\<k[vmz]alloc_objs(., GFP_KERNEL)' \| xargs sed -i 's/\(alloc_objs(.*\), GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-21 17:09:51 -08:00
Kees Cook	69050f8d6d	treewide: Replace kmalloc with kmalloc_obj for non-scalar types This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(PTR, FAM, COUNT, ...) (where TYPE may also be VAR) The resulting allocations no longer return "void ", instead returning "TYPE ". Signed-off-by: Kees Cook <kees@kernel.org>	2026-02-21 01:02:28 -08:00
Linus Torvalds	9702969978	slab updates for 7.0 part2 -----BEGIN PGP SIGNATURE----- iQFPBAABCAA5FiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmmTRqgbFIAAAAAABAAO bWFudTIsMi41KzEuMTEsMiwyAAoJELvgsHXSRYiaUboIAIQRGZNZLzAD04PpEwDe LP3g1iI6DytfzHkcqkf+cV1OHpsKZjKUDY8xw42L3ztktzD83W6ypSzQBz1opnUx 5w7N8EoE/GtY+pbOgBwGi7rvwg2i0+IkCdt9R8VpKa5fmwcgWcIpNtp0XRdWjWTb pn04sRTHiNHlMZxdVHVAmlxgcC/8SNBHi4w5KJtDUrq+bkZUS3XAN2ssU3oKBpMy OxhZw7BwfIO7PbBLFTrGQNPjfDU6IL7q8p7T6JcLyugPmqbvzAk07fDOs6GBFPBt jc1wZvC8h32y7WnWqA4rU+g06jXb088B71IywpxzUSIyPs0rfGy/eEtdEOBWrqIT 5o8= =dulw -----END PGP SIGNATURE----- Merge tag 'slab-for-7.0-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull more slab updates from Vlastimil Babka: - Two stable fixes for kmalloc_nolock() usage from NMI context (Harry Yoo) - Allow kmalloc_nolock() allocations to be freed with kfree() and thus also kfree_rcu() and simplify slabobj_ext handling - we no longer need to track how it was allocated to use the matching freeing function (Harry Yoo) * tag 'slab-for-7.0-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm/slab: drop the OBJEXTS_NOSPIN_ALLOC flag from enum objext_flags mm/slab: allow freeing kmalloc_nolock()'d objects using kfree[_rcu]() mm/slab: use prandom if !allow_spin mm/slab: do not access current->mems_allowed_seq if !allow_spin	2026-02-16 13:41:38 -08:00
Linus Torvalds	4cff5c05e0	mm.git review status for linus..mm-stable Everything: Total patches: 325 Reviews/patch: 1.39 Reviewed rate: 72% Excluding DAMON: Total patches: 262 Reviews/patch: 1.63 Reviewed rate: 82% Excluding DAMON and zram: Total patches: 248 Reviews/patch: 1.72 Reviewed rate: 86% - The 14 patch series "powerpc/64s: do not re-activate batched TLB flush" from Alexander Gordeev makes arch_{enter\|leave}_lazy_mmu_mode() nest properly. It adds a generic enter/leave layer and switches architectures to use it. Various hacks were removed in the process. - The 7 patch series "zram: introduce compressed data writeback" from Richard Chang and Sergey Senozhatsky implements data compression for zram writeback. - The 8 patch series "mm: folio_zero_user: clear page ranges" from David Hildenbrand adds clearing of contiguous page ranges for hugepages. Large improvements during demand faulting are demonstrated. - The 2 patch series "memcg cleanups" from Chen Ridong tideis up some memcg code. - The 12 patch series "mm/damon: introduce {,max_}nr_snapshots and tracepoint for damos stats" from SeongJae Park improves DAMOS stat's provided information, deterministic control, and readability. - The 3 patch series "selftests/mm: hugetlb cgroup charging: robustness fixes" from Li Wang fixes a few issues in the hugetlb cgroup charging selftests. - The 5 patch series "Fix va_high_addr_switch.sh test failure - again" from Chunyu Hu addresses several issues in the va_high_addr_switch test. - The 5 patch series "mm/damon/tests/core-kunit: extend existing test scenarios" from Shu Anzai improves the KUnit test coverage for DAMON. - The 2 patch series "mm/khugepaged: fix dirty page handling for MADV_COLLAPSE" from Shivank Garg fixes a glitch in khugepaged which was causing madvise(MADV_COLLAPSE) to transiently return -EAGAIN. - The 29 patch series "arch, mm: consolidate hugetlb early reservation" from Mike Rapoport reworks and consolidates a pile of straggly code related to reservation of hugetlb memory from bootmem and creation of CMA areas for hugetlb. - The 9 patch series "mm: clean up anon_vma implementation" from Lorenzo Stoakes cleans up the anon_vma implementation in various ways. - The 3 patch series "tweaks for __alloc_pages_slowpath()" from Vlastimil Babka does a little streamlining of the page allocator's slowpath code. - The 8 patch series "memcg: separate private and public ID namespaces" from Shakeel Butt cleans up the memcg ID code and prevents the internal-only private IDs from being exposed to userspace. - The 6 patch series "mm: hugetlb: allocate frozen gigantic folio" from Kefeng Wang cleans up the allocation of frozen folios and avoids some atomic refcount operations. - The 11 patch series "mm/damon: advance DAMOS-based LRU sorting" from SeongJae Park improves DAMOS's movement of memory betewwn the active and inactive LRUs and adds auto-tuning of the ratio-based quotas and of monitoring intervals. - The 18 patch series "Support page table check on PowerPC" from Andrew Donnellan makes CONFIG_PAGE_TABLE_CHECK_ENFORCED work on powerpc. - The 3 patch series "nodemask: align nodes_and{,not} with underlying bitmap ops" from Yury Norov makes nodes_and() and nodes_andnot() propagate the return values from the underlying bit operations, enabling some cleanup in calling code. - The 5 patch series "mm/damon: hide kdamond and kdamond_lock from API callers" from SeongJae Park cleans up some DAMON internal interfaces. - The 4 patch series "mm/khugepaged: cleanups and scan limit fix" from Shivank Garg does some cleanup work in khupaged and fixes a scan limit accounting issue. - The 24 patch series "mm: balloon infrastructure cleanups" from David Hildenbrand goes to town on the balloon infrastructure and its page migration function. Mainly cleanups, also some locking simplification. - The 2 patch series "mm/vmscan: add tracepoint and reason for kswapd_failures reset" from Jiayuan Chen adds additional tracepoints to the page reclaim code. - The 3 patch series "Replace wq users and add WQ_PERCPU to alloc_workqueue() users" from Marco Crivellari is part of Marco's kernel-wide migration from the legacy workqueue APIs over to the preferred unbound workqueues. - The 9 patch series "Various mm kselftests improvements/fixes" from Kevin Brodsky provides various unrelated improvements/fixes for the mm kselftests. - The 5 patch series "mm: accelerate gigantic folio allocation" from Kefeng Wang greatly speeds up gigantic folio allocation, mainly by avoiding unnecessary work in pfn_range_valid_contig(). - The 5 patch series "selftests/damon: improve leak detection and wss estimation reliability" from SeongJae Park improves the reliability of two of the DAMON selftests. - The 8 patch series "mm/damon: cleanup kdamond, damon_call(), damos filter and DAMON_MIN_REGION" from SeongJae Park does some cleanup work in the core DAMON code. - The 8 patch series "Docs/mm/damon: update intro, modules, maintainer profile, and misc" from SeongJae Park performs maintenance work on the DAMON documentation. - The 10 patch series "mm: add and use vma_assert_stabilised() helper" from Lorenzo Stoakes refactors and cleans up the core VMA code. The main aim here is to be able to use the mmap write lock's lockdep state to perform various assertions regarding the locking which the VMA code requires. - The 19 patch series "mm, swap: swap table phase II: unify swapin use" from Kairui Song removes some old swap code (swap cache bypassing and swap synchronization) which wasn't working very well. Various other cleanups and simplifications were made. The end result is a 20% speedup in one benchmark. - The 8 patch series "enable PT_RECLAIM on more 64-bit architectures" from Qi Zheng makes PT_RECLAIM available on 64-bit alpha, loongarch, mips, parisc, um, Various cleanups were performed along the way. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaY1HfAAKCRDdBJ7gKXxA jqhZAP9H8ZlKKqCEgnr6U5XXmJ63Ep2FDQpl8p35yr9yVuU9+gEAgfyWiJ43l1fP rT0yjsUW3KQFBi/SEA3R6aYarmoIBgI= =+HLt -----END PGP SIGNATURE----- Merge tag 'mm-stable-2026-02-11-19-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "powerpc/64s: do not re-activate batched TLB flush" makes arch_{enter\|leave}_lazy_mmu_mode() nest properly (Alexander Gordeev) It adds a generic enter/leave layer and switches architectures to use it. Various hacks were removed in the process. - "zram: introduce compressed data writeback" implements data compression for zram writeback (Richard Chang and Sergey Senozhatsky) - "mm: folio_zero_user: clear page ranges" adds clearing of contiguous page ranges for hugepages. Large improvements during demand faulting are demonstrated (David Hildenbrand) - "memcg cleanups" tidies up some memcg code (Chen Ridong) - "mm/damon: introduce {,max_}nr_snapshots and tracepoint for damos stats" improves DAMOS stat's provided information, deterministic control, and readability (SeongJae Park) - "selftests/mm: hugetlb cgroup charging: robustness fixes" fixes a few issues in the hugetlb cgroup charging selftests (Li Wang) - "Fix va_high_addr_switch.sh test failure - again" addresses several issues in the va_high_addr_switch test (Chunyu Hu) - "mm/damon/tests/core-kunit: extend existing test scenarios" improves the KUnit test coverage for DAMON (Shu Anzai) - "mm/khugepaged: fix dirty page handling for MADV_COLLAPSE" fixes a glitch in khugepaged which was causing madvise(MADV_COLLAPSE) to transiently return -EAGAIN (Shivank Garg) - "arch, mm: consolidate hugetlb early reservation" reworks and consolidates a pile of straggly code related to reservation of hugetlb memory from bootmem and creation of CMA areas for hugetlb (Mike Rapoport) - "mm: clean up anon_vma implementation" cleans up the anon_vma implementation in various ways (Lorenzo Stoakes) - "tweaks for __alloc_pages_slowpath()" does a little streamlining of the page allocator's slowpath code (Vlastimil Babka) - "memcg: separate private and public ID namespaces" cleans up the memcg ID code and prevents the internal-only private IDs from being exposed to userspace (Shakeel Butt) - "mm: hugetlb: allocate frozen gigantic folio" cleans up the allocation of frozen folios and avoids some atomic refcount operations (Kefeng Wang) - "mm/damon: advance DAMOS-based LRU sorting" improves DAMOS's movement of memory betewwn the active and inactive LRUs and adds auto-tuning of the ratio-based quotas and of monitoring intervals (SeongJae Park) - "Support page table check on PowerPC" makes CONFIG_PAGE_TABLE_CHECK_ENFORCED work on powerpc (Andrew Donnellan) - "nodemask: align nodes_and{,not} with underlying bitmap ops" makes nodes_and() and nodes_andnot() propagate the return values from the underlying bit operations, enabling some cleanup in calling code (Yury Norov) - "mm/damon: hide kdamond and kdamond_lock from API callers" cleans up some DAMON internal interfaces (SeongJae Park) - "mm/khugepaged: cleanups and scan limit fix" does some cleanup work in khupaged and fixes a scan limit accounting issue (Shivank Garg) - "mm: balloon infrastructure cleanups" goes to town on the balloon infrastructure and its page migration function. Mainly cleanups, also some locking simplification (David Hildenbrand) - "mm/vmscan: add tracepoint and reason for kswapd_failures reset" adds additional tracepoints to the page reclaim code (Jiayuan Chen) - "Replace wq users and add WQ_PERCPU to alloc_workqueue() users" is part of Marco's kernel-wide migration from the legacy workqueue APIs over to the preferred unbound workqueues (Marco Crivellari) - "Various mm kselftests improvements/fixes" provides various unrelated improvements/fixes for the mm kselftests (Kevin Brodsky) - "mm: accelerate gigantic folio allocation" greatly speeds up gigantic folio allocation, mainly by avoiding unnecessary work in pfn_range_valid_contig() (Kefeng Wang) - "selftests/damon: improve leak detection and wss estimation reliability" improves the reliability of two of the DAMON selftests (SeongJae Park) - "mm/damon: cleanup kdamond, damon_call(), damos filter and DAMON_MIN_REGION" does some cleanup work in the core DAMON code (SeongJae Park) - "Docs/mm/damon: update intro, modules, maintainer profile, and misc" performs maintenance work on the DAMON documentation (SeongJae Park) - "mm: add and use vma_assert_stabilised() helper" refactors and cleans up the core VMA code. The main aim here is to be able to use the mmap write lock's lockdep state to perform various assertions regarding the locking which the VMA code requires (Lorenzo Stoakes) - "mm, swap: swap table phase II: unify swapin use" removes some old swap code (swap cache bypassing and swap synchronization) which wasn't working very well. Various other cleanups and simplifications were made. The end result is a 20% speedup in one benchmark (Kairui Song) - "enable PT_RECLAIM on more 64-bit architectures" makes PT_RECLAIM available on 64-bit alpha, loongarch, mips, parisc, and um. Various cleanups were performed along the way (Qi Zheng) * tag 'mm-stable-2026-02-11-19-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (325 commits) mm/memory: handle non-split locks correctly in zap_empty_pte_table() mm: move pte table reclaim code to memory.c mm: make PT_RECLAIM depends on MMU_GATHER_RCU_TABLE_FREE mm: convert __HAVE_ARCH_TLB_REMOVE_TABLE to CONFIG_HAVE_ARCH_TLB_REMOVE_TABLE config um: mm: enable MMU_GATHER_RCU_TABLE_FREE parisc: mm: enable MMU_GATHER_RCU_TABLE_FREE mips: mm: enable MMU_GATHER_RCU_TABLE_FREE LoongArch: mm: enable MMU_GATHER_RCU_TABLE_FREE alpha: mm: enable MMU_GATHER_RCU_TABLE_FREE mm: change mm/pt_reclaim.c to use asm/tlb.h instead of asm-generic/tlb.h mm/damon/stat: remove __read_mostly from memory_idle_ms_percentiles zsmalloc: make common caches global mm: add SPDX id lines to some mm source files mm/zswap: use %pe to print error pointers mm/vmscan: use %pe to print error pointers mm/readahead: fix typo in comment mm: khugepaged: fix NR_FILE_PAGES and NR_SHMEM in collapse_file() mm: refactor vma_map_pages to use vm_insert_pages mm/damon: unify address range representation with damon_addr_range mm/cma: replace snprintf with strscpy in cma_new_area ...	2026-02-12 11:32:37 -08:00
Linus Torvalds	148f95f75c	slab updates for 7.0 -----BEGIN PGP SIGNATURE----- iQFPBAABCAA5FiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmmK68UbFIAAAAAABAAO bWFudTIsMi41KzEuMTEsMiwyAAoJELvgsHXSRYiamM8H/0eOKSvZG/C/HdTm36cy pVjOjgX9KmlHoeH1dOMjqgL2KfOIBis8j1GY0Q/qF1a86uzQa6uuz4XdmJeTUkEE YfzwOdaLIR0U6R/gIH9YPfyU9h3VBLUNtotDculntSO3ZgwY5QUHQHz+ROnVG5SU MSQ2oshSRkh06LRIlvbd0kLax8vZy8UjfYPonF33+XRya17nIY6V2DvqC0MDuEcM jWvbQfm5HTamTAlSV4bmJw+U/FehEdpC4U0ulsAtQILGpJvHCwqDGCNQRFkzcsaM yhi1JLFCZcoHqbQycZMNAypPERfIp8O5thSU6xU2AP/cNl2scR/7/MSuWOvjKBv4 pKU= =u52A -----END PGP SIGNATURE----- Merge tag 'slab-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab updates from Vlastimil Babka: - The percpu sheaves caching layer was introduced as opt-in in 6.18 and now we enable it for all caches and remove the previous cpu (partial) slab caching mechanism. Besides the lower locking overhead and much more likely fastpath when freeing, this removes the rather complicated code related to the cpu slab lockless fastpaths (using this_cpu_try_cmpxchg128/64) and all its complications for PREEMPT_RT or kmalloc_nolock(). The lockless slab freelist+counters update operation using try_cmpxchg128/64 remains and is crucial for freeing remote NUMA objects, and to allow flushing objects from sheaves to slabs mostly without the node list_lock (Vlastimil Babka) - Eliminate slabobj_ext metadata overhead when possible. Instead of using kmalloc() to allocate the array for memcg and/or allocation profiling tag pointers, use leftover space in a slab or per-object padding due to alignment (Harry Yoo) - Various followup improvements to the above (Hao Li) * tag 'slab-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: (39 commits) slub: let need_slab_obj_exts() return false if SLAB_NO_OBJ_EXT is set mm/slab: only allow SLAB_OBJ_EXT_IN_OBJ for unmergeable caches mm/slab: place slabobj_ext metadata in unused space within s->size mm/slab: move [__]ksize and slab_ksize() to mm/slub.c mm/slab: save memory by allocating slabobj_ext array from leftover mm/memcontrol,alloc_tag: handle slabobj_ext access under KASAN poison mm/slab: use stride to access slabobj_ext mm/slab: abstract slabobj_ext access via new slab_obj_ext() helper ext4: specify the free pointer offset for ext4_inode_cache mm/slab: allow specifying free pointer offset when using constructor mm/slab: use unsigned long for orig_size to ensure proper metadata align slub: clarify object field layout comments mm/slab: avoid allocating slabobj_ext array from its own slab slub: avoid list_lock contention from __refill_objects_any() mm/slub: cleanup and repurpose some stat items mm/slub: remove DEACTIVATE_TO_* stat items slab: remove frozen slab checks from __slab_free() slab: update overview comments slab: refill sheaves from all nodes slab: remove unused PREEMPT_RT specific macros ...	2026-02-11 14:12:50 -08:00
Harry Yoo	27125df9a5	mm/slab: drop the OBJEXTS_NOSPIN_ALLOC flag from enum objext_flags OBJEXTS_NOSPIN_ALLOC was used to remember whether a slabobj_ext vector was allocated via kmalloc_nolock(), so that free_slab_obj_exts() could call kfree_nolock() instead of kfree(). Now that kfree() supports freeing kmalloc_nolock() objects, this flag is no longer needed. Instead, pass the allow_spin parameter down to free_slab_obj_exts() to determine whether kfree_nolock() or kfree() should be called in the free path, and free one bit in enum objext_flags. Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Hao Li <hao.li@linux.dev> Link: https://patch.msgid.link/20260210044642.139482-3-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2026-02-10 11:39:30 +01:00
Harry Yoo	c4d6d78298	mm/slab: allow freeing kmalloc_nolock()'d objects using kfree[_rcu]() Slab objects that are allocated with kmalloc_nolock() must be freed using kfree_nolock() because only a subset of alloc hooks are called, since kmalloc_nolock() can't spin on a lock during allocation. This imposes a limitation: such objects cannot be freed with kfree_rcu(), forcing users to work around this limitation by calling call_rcu() with a callback that frees the object using kfree_nolock(). Remove this limitation by teaching kmemleak to gracefully ignore cases when kmemleak_free() or kmemleak_ignore() is called without a prior kmemleak_alloc(). Unlike kmemleak, kfence already handles this case, because, due to its design, only a subset of allocations are served from kfence. With this change, kfree() and kfree_rcu() can be used to free objects that are allocated using kmalloc_nolock(). Suggested-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260210044642.139482-2-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2026-02-10 11:39:30 +01:00
Harry Yoo	a1e244a9f1	mm/slab: use prandom if !allow_spin When CONFIG_SLAB_FREELIST_RANDOM is enabled and get_random_u32() is called in an NMI context, lockdep complains because it acquires a local_lock: ================================ WARNING: inconsistent lock state 6.19.0-rc5-slab-for-next+ #325 Tainted: G N -------------------------------- inconsistent {INITIAL USE} -> {IN-NMI} usage. kunit_try_catch/8312 [HC2[2]:SC0[0]:HE0:SE1] takes: ffff88a02ec49cc0 (batched_entropy_u32.lock){-.-.}-{3:3}, at: get_random_u32+0x7f/0x2e0 {INITIAL USE} state was registered at: lock_acquire+0xd9/0x2f0 get_random_u32+0x93/0x2e0 __get_random_u32_below+0x17/0x70 cache_random_seq_create+0x121/0x1c0 init_cache_random_seq+0x5d/0x110 do_kmem_cache_create+0x1e0/0xa30 __kmem_cache_create_args+0x4ec/0x830 create_kmalloc_caches+0xe6/0x130 kmem_cache_init+0x1b1/0x660 mm_core_init+0x1d8/0x4b0 start_kernel+0x620/0xcd0 x86_64_start_reservations+0x18/0x30 x86_64_start_kernel+0xf3/0x140 common_startup_64+0x13e/0x148 irq event stamp: 76 hardirqs last enabled at (75): [<ffffffff8298b77a>] exc_nmi+0x11a/0x240 hardirqs last disabled at (76): [<ffffffff8298b991>] sysvec_irq_work+0x11/0x110 softirqs last enabled at (0): [<ffffffff813b2dda>] copy_process+0xc7a/0x2350 softirqs last disabled at (0): [<0000000000000000>] 0x0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(batched_entropy_u32.lock); <Interrupt> lock(batched_entropy_u32.lock); * DEADLOCK * Fix this by using pseudo-random number generator if !allow_spin. This means kmalloc_nolock() users won't get truly random numbers, but there is not much we can do about it. Note that an NMI handler might interrupt prandom_u32_state() and change the random state, but that's safe. Link: https://lore.kernel.org/all/0c33bdee-6de8-4d9f-92ca-4f72c1b6fb9f@suse.cz Fixes: `af92793e52` ("slab: Introduce kmalloc_nolock() and kfree_nolock().") Cc: stable@vger.kernel.org Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260210081900.329447-3-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2026-02-10 10:55:32 +01:00
Harry Yoo	144080a582	mm/slab: do not access current->mems_allowed_seq if !allow_spin Lockdep complains when get_from_any_partial() is called in an NMI context, because current->mems_allowed_seq is seqcount_spinlock_t and not NMI-safe: ================================ WARNING: inconsistent lock state 6.19.0-rc5-kfree-rcu+ #315 Tainted: G N -------------------------------- inconsistent {INITIAL USE} -> {IN-NMI} usage. kunit_try_catch/9989 [HC1[1]:SC0[0]:HE0:SE1] takes: ffff889085799820 (&____s->seqcount#3){.-.-}-{0:0}, at: ___slab_alloc+0x58f/0xc00 {INITIAL USE} state was registered at: lock_acquire+0x185/0x320 kernel_init_freeable+0x391/0x1150 kernel_init+0x1f/0x220 ret_from_fork+0x736/0x8f0 ret_from_fork_asm+0x1a/0x30 irq event stamp: 56 hardirqs last enabled at (55): [<ffffffff850a68d7>] _raw_spin_unlock_irq+0x27/0x70 hardirqs last disabled at (56): [<ffffffff850858ca>] __schedule+0x2a8a/0x6630 softirqs last enabled at (0): [<ffffffff81536711>] copy_process+0x1dc1/0x6a10 softirqs last disabled at (0): [<0000000000000000>] 0x0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&____s->seqcount#3); <Interrupt> lock(&____s->seqcount#3); * DEADLOCK * According to Documentation/locking/seqlock.rst, seqcount_t is not NMI-safe and seqcount_latch_t should be used when read path can interrupt the write-side critical section. In this case, do not access current->mems_allowed_seq and avoid retry. Fixes: `af92793e52` ("slab: Introduce kmalloc_nolock() and kfree_nolock().") Cc: stable@vger.kernel.org Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260210081900.329447-2-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2026-02-10 10:55:31 +01:00
Vlastimil Babka	815c8e3551	Merge branch 'slab/for-7.0/sheaves' into slab/for-next Merge series "slab: replace cpu (partial) slabs with sheaves". The percpu sheaves caching layer was introduced as opt-in but the goal was to eventually move all caches to them. This is the next step, enabling sheaves for all caches (except the two bootstrap ones) and then removing the per cpu (partial) slabs and lots of associated code. Besides the lower locking overhead and much more likely fastpath when freeing, this removes the rather complicated code related to the cpu slab lockless fastpaths (using this_cpu_try_cmpxchg128/64) and all its complications for PREEMPT_RT or kmalloc_nolock(). The lockless slab freelist+counters update operation using try_cmpxchg128/64 remains and is crucial for freeing remote NUMA objects and to allow flushing objects from sheaves to slabs mostly without the node list_lock. Link: https://lore.kernel.org/all/20260123-sheaves-for-all-v4-0-041323d506f7@suse.cz/	2026-02-10 09:10:00 +01:00
Hao Li	98e99fc4ad	slub: let need_slab_obj_exts() return false if SLAB_NO_OBJ_EXT is set SLAB_NO_OBJ_EXT is set for boot caches, but need_slab_obj_exts() doesn't check this flag. We should return false unconditionally when SLAB_NO_OBJ_EXT is set. Signed-off-by: Hao Li <hao.li@linux.dev> Acked-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260205120709.425719-1-hao.li@linux.dev Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2026-02-06 10:39:36 +01:00
Hao Ge	e6c53ead2d	mm/slab: Add alloc_tagging_slab_free_hook for memcg_alloc_abort_single When CONFIG_MEM_ALLOC_PROFILING_DEBUG is enabled, the following warning may be noticed: [ 3959.023862] ------------[ cut here ]------------ [ 3959.023891] alloc_tag was not cleared (got tag for lib/xarray.c:378) [ 3959.023947] WARNING: ./include/linux/alloc_tag.h:155 at alloc_tag_add+0x128/0x178, CPU#6: mkfs.ntfs/113998 [ 3959.023978] Modules linked in: dns_resolver tun brd overlay exfat btrfs blake2b libblake2b xor xor_neon raid6_pq loop sctp ip6_udp_tunnel udp_tunnel ext4 crc16 mbcache jbd2 rfkill sunrpc vfat fat sg fuse nfnetlink sr_mod virtio_gpu cdrom drm_client_lib virtio_dma_buf drm_shmem_helper drm_kms_helper ghash_ce drm sm4 backlight virtio_net net_failover virtio_scsi failover virtio_console virtio_blk virtio_mmio dm_mirror dm_region_hash dm_log dm_multipath dm_mod i2c_dev aes_neon_bs aes_ce_blk [last unloaded: hwpoison_inject] [ 3959.024170] CPU: 6 UID: 0 PID: 113998 Comm: mkfs.ntfs Kdump: loaded Tainted: G W 6.19.0-rc7+ #7 PREEMPT(voluntary) [ 3959.024182] Tainted: [W]=WARN [ 3959.024186] Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022 [ 3959.024192] pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 3959.024199] pc : alloc_tag_add+0x128/0x178 [ 3959.024207] lr : alloc_tag_add+0x128/0x178 [ 3959.024214] sp : ffff80008b696d60 [ 3959.024219] x29: ffff80008b696d60 x28: 0000000000000000 x27: 0000000000000240 [ 3959.024232] x26: 0000000000000000 x25: 0000000000000240 x24: ffff800085d17860 [ 3959.024245] x23: 0000000000402800 x22: ffff0000c0012dc0 x21: 00000000000002d0 [ 3959.024257] x20: ffff0000e6ef3318 x19: ffff800085ae0410 x18: 0000000000000000 [ 3959.024269] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 [ 3959.024281] x14: 0000000000000000 x13: 0000000000000001 x12: ffff600064101293 [ 3959.024292] x11: 1fffe00064101292 x10: ffff600064101292 x9 : dfff800000000000 [ 3959.024305] x8 : 00009fff9befed6e x7 : ffff000320809493 x6 : 0000000000000001 [ 3959.024316] x5 : ffff000320809490 x4 : ffff600064101293 x3 : ffff800080691838 [ 3959.024328] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0000d5bcd640 [ 3959.024340] Call trace: [ 3959.024346] alloc_tag_add+0x128/0x178 (P) [ 3959.024355] __alloc_tagging_slab_alloc_hook+0x11c/0x1a8 [ 3959.024362] kmem_cache_alloc_lru_noprof+0x1b8/0x5e8 [ 3959.024369] xas_alloc+0x304/0x4f0 [ 3959.024381] xas_create+0x1e0/0x4a0 [ 3959.024388] xas_store+0x68/0xda8 [ 3959.024395] __filemap_add_folio+0x5b0/0xbd8 [ 3959.024409] filemap_add_folio+0x16c/0x7e0 [ 3959.024416] __filemap_get_folio_mpol+0x2dc/0x9e8 [ 3959.024424] iomap_get_folio+0xfc/0x180 [ 3959.024435] __iomap_get_folio+0x2f8/0x4b8 [ 3959.024441] iomap_write_begin+0x198/0xc18 [ 3959.024448] iomap_write_iter+0x2ec/0x8f8 [ 3959.024454] iomap_file_buffered_write+0x19c/0x290 [ 3959.024461] blkdev_write_iter+0x38c/0x978 [ 3959.024470] vfs_write+0x4d4/0x928 [ 3959.024482] ksys_write+0xfc/0x1f8 [ 3959.024489] __arm64_sys_write+0x74/0xb0 [ 3959.024496] invoke_syscall+0xd4/0x258 [ 3959.024507] el0_svc_common.constprop.0+0xb4/0x240 [ 3959.024514] do_el0_svc+0x48/0x68 [ 3959.024520] el0_svc+0x40/0xf8 [ 3959.024526] el0t_64_sync_handler+0xa0/0xe8 [ 3959.024533] el0t_64_sync+0x1ac/0x1b0 [ 3959.024540] ---[ end trace 0000000000000000 ]--- When __memcg_slab_post_alloc_hook() fails, there are two different free paths depending on whether size == 1 or size != 1. In the kmem_cache_free_bulk() path, we do call alloc_tagging_slab_free_hook(). However, in memcg_alloc_abort_single() we don't, the above warning will be triggered on the next allocation. Therefore, add alloc_tagging_slab_free_hook() to the memcg_alloc_abort_single() path. Fixes: `9f9796b413` ("mm, slab: move memcg charging to post-alloc hook") Cc: stable@vger.kernel.org Suggested-by: Hao Li <hao.li@linux.dev> Signed-off-by: Hao Ge <hao.ge@linux.dev> Reviewed-by: Hao Li <hao.li@linux.dev> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260204101401.202762-1-hao.ge@linux.dev Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2026-02-06 09:51:08 +01:00
Harry Yoo	2f35fee943	mm/slab: only allow SLAB_OBJ_EXT_IN_OBJ for unmergeable caches While SLAB_OBJ_EXT_IN_OBJ allows to reduce memory overhead to account slab objects, it prevents slab merging because merging can change the metadata layout. As pointed out Vlastimil Babka, disabling merging solely for this memory optimization may not be a net win, because disabling slab merging tends to increase overall memory usage. Restrict SLAB_OBJ_EXT_IN_OBJ to caches that are already unmergeable for other reasons (e.g., those with constructors or SLAB_TYPESAFE_BY_RCU). Suggested-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260127103151.21883-3-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2026-02-04 10:05:36 +01:00
Harry Yoo	a77d6d3386	mm/slab: place slabobj_ext metadata in unused space within s->size When a cache has high s->align value and s->object_size is not aligned to it, each object ends up with some unused space because of alignment. If this wasted space is big enough, we can use it to store the slabobj_ext metadata instead of wasting it. On my system, this happens with caches like kmem_cache, mm_struct, pid, task_struct, sighand_cache, xfs_inode, and others. To place the slabobj_ext metadata within each object, the existing slab_obj_ext() logic can still be used by setting: - slab->obj_exts = slab_address(slab) + (slabobj_ext offset) - stride = s->size slab_obj_ext() doesn't need know where the metadata is stored, so this method works without adding extra overhead to slab_obj_ext(). A good example benefiting from this optimization is xfs_inode (object_size: 992, align: 64). To measure memory savings, 2 millions of files were created on XFS. [ MEMCG=y, MEM_ALLOC_PROFILING=n ] Before patch (creating ~2.64M directories on xfs): Slab: 5175976 kB SReclaimable: 3837524 kB SUnreclaim: 1338452 kB After patch (creating ~2.64M directories on xfs): Slab: 5152912 kB SReclaimable: 3838568 kB SUnreclaim: 1314344 kB (-23.54 MiB) Enjoy the memory savings! Suggested-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260113061845.159790-10-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2026-02-04 10:05:36 +01:00
Harry Yoo	fab0694646	mm/slab: move [__]ksize and slab_ksize() to mm/slub.c To access SLUB's internal implementation details beyond cache flags in ksize(), move __ksize(), ksize(), and slab_ksize() to mm/slub.c. [vbabka@suse.cz: also make __ksize() static and move its kerneldoc to ksize() ] Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260113061845.159790-9-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2026-02-04 10:05:35 +01:00
Harry Yoo	70089d0188	mm/slab: save memory by allocating slabobj_ext array from leftover The leftover space in a slab is always smaller than s->size, and kmem caches for large objects that are not power-of-two sizes tend to have a greater amount of leftover space per slab. In some cases, the leftover space is larger than the size of the slabobj_ext array for the slab. An excellent example of such a cache is ext4_inode_cache. On my system, the object size is 1136, with a preferred order of 3, 28 objects per slab, and 960 bytes of leftover space per slab. Since the size of the slabobj_ext array is only 224 bytes (w/o mem profiling) or 448 bytes (w/ mem profiling) per slab, the entire array fits within the leftover space. Allocate the slabobj_exts array from this unused space instead of using kcalloc() when it is large enough. The array is allocated from unused space only when creating new slabs, and it doesn't try to utilize unused space if alloc_slab_obj_exts() is called after slab creation because implementing lazy allocation involves more expensive synchronization. The implementation and evaluation of lazy allocation from unused space is left as future-work. As pointed by Vlastimil Babka [1], it could be beneficial when a slab cache without SLAB_ACCOUNT can be created, and some of the allocations from the cache use __GFP_ACCOUNT. For example, xarray does that. To avoid unnecessary overhead when MEMCG (with SLAB_ACCOUNT) and MEM_ALLOC_PROFILING are not used for the cache, allocate the slabobj_ext array only when either of them is enabled on slab allocation. [ MEMCG=y, MEM_ALLOC_PROFILING=n ] Before patch (creating ~2.64M directories on ext4): Slab: 4747880 kB SReclaimable: 4169652 kB SUnreclaim: 578228 kB After patch (creating ~2.64M directories on ext4): Slab: `4724020` kB SReclaimable: 4169188 kB SUnreclaim: 554832 kB (-22.84 MiB) Enjoy the memory savings! Link: https://lore.kernel.org/linux-mm/48029aab-20ea-4d90-bfd1-255592b2018e@suse.cz [1] Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260113061845.159790-8-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2026-02-04 10:05:35 +01:00
Harry Yoo	4b1530f89c	mm/memcontrol,alloc_tag: handle slabobj_ext access under KASAN poison In the near future, slabobj_ext may reside outside the allocated slab object range within a slab, which could be reported as an out-of-bounds access by KASAN. As suggested by Andrey Konovalov [1], explicitly disable KASAN and KMSAN checks when accessing slabobj_ext within slab allocator, memory profiling, and memory cgroup code. While an alternative approach could be to unpoison slabobj_ext, out-of-bounds accesses outside the slab allocator are generally more common. Move metadata_access_enable()/disable() helpers to mm/slab.h so that it can be used outside mm/slub.c. However, as suggested by Suren Baghdasaryan [2], instead of calling them directly from mm code (which is more prone to errors), change users to access slabobj_ext via get/put APIs: - Users should call get_slab_obj_exts() to access slabobj_metadata and call put_slab_obj_exts() when it's done. - From now on, accessing it outside the section covered by get_slab_obj_exts() ~ put_slab_obj_exts() is illegal. This ensures that accesses to slabobj_ext metadata won't be reported as access violations. Call kasan_reset_tag() in slab_obj_ext() before returning the address to prevent SW or HW tag-based KASAN from reporting false positives. Suggested-by: Andrey Konovalov <andreyknvl@gmail.com> Suggested-by: Suren Baghdasaryan <surenb@google.com> Link: https://lore.kernel.org/linux-mm/CA+fCnZezoWn40BaS3cgmCeLwjT+5AndzcQLc=wH3BjMCu6_YCw@mail.gmail.com [1] Link: https://lore.kernel.org/linux-mm/CAJuCfpG=Lb4WhYuPkSpdNO4Ehtjm1YcEEK0OM=3g9i=LxmpHSQ@mail.gmail.com [2] Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260113061845.159790-7-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2026-02-04 10:05:35 +01:00
Harry Yoo	7a8e71bc61	mm/slab: use stride to access slabobj_ext Use a configurable stride value when accessing slab object extension metadata instead of assuming a fixed sizeof(struct slabobj_ext). Store stride value in free bits of slab->counters field. This allows for flexibility in cases where the extension is embedded within slab objects. Since these free bits exist only on 64-bit, any future optimizations that need to change stride value cannot be enabled on 32-bit architectures. Suggested-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260113061845.159790-6-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2026-02-04 10:05:35 +01:00
Harry Yoo	52f1ca8a45	mm/slab: abstract slabobj_ext access via new slab_obj_ext() helper Currently, the slab allocator assumes that slab->obj_exts is a pointer to an array of struct slabobj_ext objects. However, to support storage methods where struct slabobj_ext is embedded within objects, the slab allocator should not make this assumption. Instead of directly dereferencing the slabobj_exts array, abstract access to struct slabobj_ext via helper functions. Introduce a new API slabobj_ext metadata access: slab_obj_ext(slab, obj_exts, index) - returns the pointer to struct slabobj_ext element at the given index. Directly dereferencing the return value of slab_obj_exts() is no longer allowed. Instead, slab_obj_ext() must always be used to access individual struct slabobj_ext objects. Convert all users to use these APIs. No functional changes intended. Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260113061845.159790-5-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2026-02-04 10:05:35 +01:00
Harry Yoo	a13b68d79d	mm/slab: allow specifying free pointer offset when using constructor When a slab cache has a constructor, the free pointer is placed after the object because certain fields must not be overwritten even after the object is freed. However, some fields that the constructor does not initialize can safely be overwritten after free. Allow specifying the free pointer offset within the object, reducing the overall object size when some fields can be reused for the free pointer. Adjust the document accordingly. Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260113061845.159790-3-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2026-02-04 10:05:35 +01:00
Harry Yoo	b85f369b81	mm/slab: use unsigned long for orig_size to ensure proper metadata align When both KASAN and SLAB_STORE_USER are enabled, accesses to struct kasan_alloc_meta fields can be misaligned on 64-bit architectures. This occurs because orig_size is currently defined as unsigned int, which only guarantees 4-byte alignment. When struct kasan_alloc_meta is placed after orig_size, it may end up at a 4-byte boundary rather than the required 8-byte boundary on 64-bit systems. Note that 64-bit architectures without HAVE_EFFICIENT_UNALIGNED_ACCESS are assumed to require 64-bit accesses to be 64-bit aligned. See HAVE_64BIT_ALIGNED_ACCESS and commit `adab66b71a` ("Revert: "ring-buffer: Remove HAVE_64BIT_ALIGNED_ACCESS"") for more details. Change orig_size from unsigned int to unsigned long to ensure proper alignment for any subsequent metadata. This should not waste additional memory because kmalloc objects are already aligned to at least ARCH_KMALLOC_MINALIGN. Closes: https://lore.kernel.org/all/aPrLF0OUK651M4dk@hyeyoo Suggested-by: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: stable@vger.kernel.org Fixes: `6edf2576a6` ("mm/slub: enable debugging memory wasting of kmalloc") Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Closes: https://lore.kernel.org/all/aPrLF0OUK651M4dk@hyeyoo/ Link: https://patch.msgid.link/20260113061845.159790-2-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2026-02-04 10:05:35 +01:00
Hao Li	9346ee2b53	slub: clarify object field layout comments The comments above check_pad_bytes() document the field layout of a single object. Rewrite them to improve clarity and precision. Also update an outdated comment in calculate_sizes(). Suggested-by: Harry Yoo <harry.yoo@oracle.com> Acked-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Hao Li <hao.li@linux.dev> Link: https://patch.msgid.link/20251229122415.192377-1-hao.li@linux.dev Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2026-02-04 10:05:35 +01:00
Harry Yoo	280ea9c315	mm/slab: avoid allocating slabobj_ext array from its own slab When allocating slabobj_ext array in alloc_slab_obj_exts(), the array can be allocated from the same slab we're allocating the array for. This led to obj_exts_in_slab() incorrectly returning true [1], although the array is not allocated from wasted space of the slab. Vlastimil Babka observed that this problem should be fixed even when ignoring its incompatibility with obj_exts_in_slab(), because it creates slabs that are never freed as there is always at least one allocated object. To avoid this, use the next kmalloc size or large kmalloc when the array can be allocated from the same cache we're allocating the array for. In case of random kmalloc caches, there are multiple kmalloc caches for the same size and the cache is selected based on the caller address. Because it is fragile to ensure the same caller address is passed to kmalloc_slab(), kmalloc_noprof(), and kmalloc_node_noprof(), bump the size to (s->object_size + 1) when the sizes are equal, instead of directly comparing the kmem_cache pointers. Note that this doesn't happen when memory allocation profiling is disabled, as when the allocation of the array is triggered by memory cgroup (KMALLOC_CGROUP), the array is allocated from KMALLOC_NORMAL. Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202601231457.f7b31e09-lkp@intel.com [1] Cc: stable@vger.kernel.org Fixes: `4b87369646` ("mm/slab: add allocation accounting into slab allocation and free paths") Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260126125714.88008-1-harry.yoo@oracle.com Reviewed-by: Hao Li <hao.li@linux.dev> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2026-02-04 10:05:24 +01:00
Marco Crivellari	ed0a826ce3	mm: add WQ_PERCPU to alloc_workqueue users This continues the effort to refactor workqueue APIs, which began with the introduction of new workqueues and a new alloc_workqueue flag in: commit `128ea9f6cc` ("workqueue: Add system_percpu_wq and system_dfl_wq") commit `930c2ea566` ("workqueue: Add new WQ_PERCPU flag") The refactoring is going to alter the default behavior of alloc_workqueue() to be unbound by default. With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND), any alloc_workqueue() caller that doesn't explicitly specify WQ_UNBOUND must now use WQ_PERCPU. For more details see the Link tag below. In order to keep alloc_workqueue() behavior identical, explicitly request WQ_PERCPU. [akpm@linux-foundation.org: fix mm/slub.c] [akpm@linux-foundation.org: fix kmem_cache_init_late() properly, per Sebastian] Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/ Link: https://lkml.kernel.org/r/20260113114630.152942-4-marco.crivellari@suse.com Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Suggested-by: Tejun Heo <tj@kernel.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Cc: Alexander Potapenko <glider@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lai jiangshan <jiangshanlai@gmail.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Marco Elver <elver@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 14:22:39 -08:00
Vlastimil Babka	40fd0acc45	slub: avoid list_lock contention from __refill_objects_any() Kernel test robot has reported a regression in the patch "slab: refill sheaves from all nodes". When taken in isolation like this, there is indeed a tradeoff - we prefer to use remote objects prior to allocating new local slabs. It is replicating a behavior that existed before sheaves for replenishing cpu (partial) slabs - now called get_from_any_partial() to allocate a single object. So the possibility of allocating remote objects is intended even if remote accesses are then slower. But the profiles in the report also suggested a contention on the list_lock spinlock. And that's something we can try to avoid without much tradeoff - if someone else has the spin_lock, it's more likely they are allocating from the node than freeing to it, so we can skip it even if it means allocating a new local slab - contributing to that lock's contention isn't worth it. It should not result in partial slabs accumulating on the remote node. Thus add an allow_spin parameter to __refill_objects_node() and get_partial_node_bulk() to make the attempts from __refill_objects_any() use only a trylock. Reported-by: kernel test robot <oliver.sang@intel.com> Link: https://lore.kernel.org/oe-lkp/202601132136.77efd6d7-lkp@intel.com Link: https://patch.msgid.link/20260129-b4-refill_any_trylock-v1-1-de7420b25840@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2026-01-29 10:10:20 +01:00
Vlastimil Babka	6f1912181d	mm/slub: cleanup and repurpose some stat items A number of stat items related to cpu slabs became unused, remove them. Two of those were ALLOC_FASTPATH and FREE_FASTPATH. But instead of removing those, use them instead of ALLOC_PCS and FREE_PCS, since sheaves are the new (and only) fastpaths, Remove the recently added _PCS variants instead. Change where FREE_SLOWPATH is counted so that it only counts freeing of objects by slab users that (for whatever reason) do not go to a percpu sheaf, and not all (including internal) callers of __slab_free(). Thus sheaf flushing (already counted by SHEAF_FLUSH) does not affect FREE_SLOWPATH anymore. This matches how ALLOC_SLOWPATH doesn't count sheaf refills (counted by SHEAF_REFILL). Reviewed-by: Hao Li <hao.li@linux.dev> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2026-01-29 09:29:51 +01:00
Vlastimil Babka	fb016a5ec7	mm/slub: remove DEACTIVATE_TO_* stat items The cpu slabs and their deactivations were removed, so remove the unused stat items. Weirdly enough the values were also used to control __add_partial() adding to head or tail of the list, so replace that with a new enum add_mode, which is cleaner. Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Hao Li <hao.li@linux.dev> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2026-01-29 09:29:41 +01:00
Vlastimil Babka	b16af1c812	slab: remove frozen slab checks from __slab_free() Currently slabs are only frozen after consistency checks failed. This can happen only in caches with debugging enabled, and those use free_to_partial_list() for freeing. The non-debug operation of __slab_free() can thus stop considering the frozen field, and we can remove the FREE_FROZEN stat. Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Hao Li <hao.li@linux.dev> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2026-01-29 09:29:27 +01:00
Vlastimil Babka	0f7075bea8	slab: update overview comments The changes related to sheaves made the description of locking and other details outdated. Update it to reflect current state. Also add a new copyright line due to major changes. Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Hao Li <hao.li@linux.dev> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2026-01-29 09:29:27 +01:00
Vlastimil Babka	46dea17444	slab: refill sheaves from all nodes __refill_objects() currently only attempts to get partial slabs from the local node and then allocates new slab(s). Expand it to trying also other nodes while observing the remote node defrag ratio, similarly to get_any_partial(). This will prevent allocating new slabs on a node while other nodes have many free slabs. It does mean sheaves will contain non-local objects in that case. Allocations that care about specific node will still be served appropriately, but might get a slowpath allocation. Like get_any_partial() we do observe cpuset_zone_allowed(), although we might be refilling a sheaf that will be then used from a different allocation context. We can also use the resulting refill_objects() in __kmem_cache_alloc_bulk() for non-debug caches. This means kmem_cache_alloc_bulk() will get better performance when sheaves are exhausted. kmem_cache_alloc_bulk() cannot indicate a preferred node so it's compatible with sheaves refill in preferring the local node. Its users also have gfp flags that allow spinning, so document that as a requirement. Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Hao Li <hao.li@linux.dev> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2026-01-29 09:29:27 +01:00
Vlastimil Babka	6c2f307f30	slab: remove unused PREEMPT_RT specific macros The macros slub_get_cpu_ptr()/slub_put_cpu_ptr() are now unused, remove them. USE_LOCKLESS_FAST_PATH() has lost its true meaning with the code being removed. The only remaining usage is in fact testing whether we can assert irqs disabled, because spin_lock_irqsave() only does that on !RT. Test for CONFIG_PREEMPT_RT instead. Reviewed-by: Hao Li <hao.li@linux.dev> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2026-01-29 09:29:27 +01:00
Vlastimil Babka	32c894c727	slab: remove struct kmem_cache_cpu The cpu slab is not used anymore for allocation or freeing, the remaining code is for flushing, but it's effectively dead. Remove the whole struct kmem_cache_cpu, the flushing code and other orphaned functions. The remaining used field of kmem_cache_cpu is the stat array with CONFIG_SLUB_STATS. Put it instead in a new struct kmem_cache_stats. In struct kmem_cache, the field is cpu_stats and placed near the end of the struct. Reviewed-by: Hao Li <hao.li@linux.dev> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2026-01-29 09:29:27 +01:00

1 2 3 4 5 ...

1416 Commits