mirror of
https://github.com/torvalds/linux.git
synced 2026-05-12 16:18:45 +02:00
bf0e022821
26365 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
811129272d |
slab fixes for 7.1-rc1
-----BEGIN PGP SIGNATURE----- iQFPBAABCAA5FiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmn3BZgbFIAAAAAABAAO bWFudTIsMi41KzEuMTIsMiwyAAoJELvgsHXSRYiabzcIAIDyPybWZ/bRup/KfWGE GknYLaUA3nw9ZpZQdQ0cJ+jGO6pfoUqacO0cZ/5ppdcXblKz22AFItUnpsd44M9H 92QjwNJoT6vjgzaSWCDE/6TeE0y27HjyhBlIYjs9mTRTb8sIH+mzfDS58FZponb/ RZMGmcJvS4La+VzaetAxzyx+cX7J9W/7zDfvz/qPMpm8tysGe4z8H7/54c5x/U+c NvuEXaxPaKEa62x5e+IFG6G67NszbJnfRrLGn/UMwp+SS5WUId6D9kmjtT70yJlg hbdU6zw6kI+DIPxRidnpI/iJqcA08r9UuC/TrZwcjaM6sythev5UvOxZXMRVp19w qXU= =yAe1 -----END PGP SIGNATURE----- Merge tag 'slab-for-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab fixes from Vlastimil Babka: - Stable fixes for CONFIG_SMP=n where _nolock() allocations in NMI both at kmalloc and page allocator levels are not properly protected by the spin_trylock() semantics on !SMP (Harry Yoo) * tag 'slab-for-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm/slab: return NULL early from kmalloc_nolock() in NMI on UP mm/page_alloc: return NULL early from alloc_frozen_pages_nolock() in NMI on UP |
||
|
|
99ebc509ee |
mm: memcontrol: fix rcu unbalance in get_non_dying_memcg_end()
Currently, get_non_dying_memcg_start() and get_non_dying_memcg_end() both
evaluate cgroup_subsys_on_dfl(memory_cgrp_subsys) independently to
determine whether to acquire or release the RCU read lock.
However, the result of cgroup_subsys_on_dfl() can change dynamically at
runtime due to cgroup hierarchy rebinding (e.g., when the memory
controller is moved between cgroup v1 and v2 hierarchies). This can cause
the following warning:
=====================================
WARNING: bad unlock balance detected!
7.0.0-next-20260420+ #83 Tainted: G W
-------------------------------------
memcg-repro/270 is trying to release lock (rcu_read_lock) at:
[<ffffffff815f57f7>] rcu_read_unlock+0x17/0x60
but there are no more locks to release!
other info that might help us debug this:
1 lock held by memcg-repro/270:
#0: ffff888102fa2088 (vm_lock){++++}-{0:0}, at: do_user_addr_fault+0x285/0x880
stack backtrace:
CPU: 0 UID: 0 PID: 270 Comm: memcg-repro Tainted: G W 7.0.0-next-20260420+ #
Tainted: [W]=WARN
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
Call Trace:
<TASK>
? rcu_read_unlock+0x17/0x60
dump_stack_lvl+0x77/0xb0
print_unlock_imbalance_bug+0xe0/0xf0
? rcu_read_unlock+0x17/0x60
lock_release+0x21d/0x2a0
rcu_read_unlock+0x1c/0x60
do_pte_missing+0x233/0xb40
__handle_mm_fault+0x80e/0xcd0
handle_mm_fault+0x146/0x310
do_user_addr_fault+0x303/0x880
exc_page_fault+0x9b/0x270
asm_exc_page_fault+0x26/0x30
RIP: 0033:0x5590e4eb41ea
Code: 61 cc 66 0f 6f e0 66 0f 61 c2 66 0f db cd 66 0f 69 e2 66 0f 6f d0 66 0f 69 d4 66 0f 61 0
RSP: 002b:00007ffcad25f030 EFLAGS: 00010202
RAX: 00005590e4eb8010 RBX: 00007ffcad260f7d RCX: 00007f73c474d44d
RDX: 00005590e4eb80a0 RSI: 00005590e4eb503c RDI: 000000000000000f
RBP: 00005590e4eb70a0 R08: 0000000000000000 R09: 00007f73c483a680
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffcad25f180 R14: 00005590e4eb6dd8 R15: 00007f73c4869020
</TASK>
------------[ cut here ]------------
Fix this by explicitly tracking the RCU lock state, ensuring that
rcu_read_unlock() in get_non_dying_memcg_end() is strictly paired with the
lock acquisition, regardless of any runtime rebinding events.
Link: https://lore.kernel.org/20260429073105.44472-1-qi.zheng@linux.dev
Fixes:
|
||
|
|
292411fda2 |
mm/userfaultfd: detect VMA type change after copy retry in mfill_copy_folio_retry()
mfill_copy_folio_retry() drops mmap_lock for the copy_from_user() call.
During this window, the VMA can be replaced with a different type (e.g.
hugetlb), making the caller's ops pointer stale. Subsequent use of the
stale ops would dispatch into the wrong per-vma handlers.
Capture the VMA's ops via vma_uffd_ops() before dropping the lock and
compare against the current vma_uffd_ops() after re-acquiring it.
Return -EAGAIN if they differ so the operation can be retried. This
avoids comparing against the caller's ops which may have been
overridden to anon_uffd_ops for MAP_PRIVATE file-backed mappings.
Link: https://lore.kernel.org/20260424183638.196227-1-devnexen@gmail.com
Fixes:
|
||
|
|
f98590bc08 |
mm/damon/stat: detect and use fresh enabled value
DAMON_STAT updates 'enabled' parameter value, which represents the running
status of its kdamond, when the user explicitly requests start/stop of the
kdamond. The kdamond can, however, be stopped even if the user explicitly
requested the stop, if ctx->regions_score_histogram allocation failure at
beginning of the execution of the kdamond. Hence, if the kdamond is
stopped by the allocation failure, the value of the parameter can be
stale.
Users could show the stale value and be confused. The problem will only
rarely happen in real and common setups because the allocation is arguably
too small to fail. Also, unlike the similar bugs that are now fixed in
DAMON_RECLAIM and DAMON_LRU_SORT, kdamond can be restarted in this case,
because DAMON_STAT force-updates the enabled parameter value for user
inputs. The bug is a bug, though.
The issue stems from the fact that there are multiple events that can
change the status, and following all the events is challenging.
Dynamically detect and use the fresh status for the parameters when those
are requested.
The issue was dicovered [1] by Sashiko.
Link: https://lore.kernel.org/20260419161003.79176-4-sj@kernel.org
Link: https://lore.kernel.org/20260416040602.88665-1-sj@kernel.org [1]
Fixes:
|
||
|
|
b98b7ff602 |
mm/damon/lru_sort: detect and use fresh enabled and kdamond_pid values
DAMON_LRU_SORT updates 'enabled' and 'kdamond_pid' parameter values, which
represents the running status of its kdamond, when the user explicitly
requests start/stop of the kdamond. The kdamond can, however, be stopped
in events other than the explicit user request in the following three
events.
1. ctx->regions_score_histogram allocation failure at beginning of the
execution,
2. damon_commit_ctx() failure due to invalid user input, and
3. damon_commit_ctx() failure due to its internal allocation failures.
Hence, if the kdamond is stopped by the above three events, the values of
the status parameters can be stale. Users could show the stale values and
be confused. This is already bad, but the real consequence is worse.
DAMON_LRU_SORT avoids unnecessary damon_start() and damon_stop() calls
based on the 'enabled' parameter value. And the update of 'enabled'
parameter value depends on the damon_start() and damon_stop() call
results. Hence, once the kdamond has stopped by the unintentional events,
the user cannot restart the kdamond before the system reboot. For
example, the issue can be reproduced via below steps.
# cd /sys/module/damon_lru_sort/parameters
#
# # start DAMON_LRU_SORT
# echo Y > enabled
# ps -ef | grep kdamond
root 806 2 0 17:53 ? 00:00:00 [kdamond.0]
root 808 803 0 17:53 pts/4 00:00:00 grep kdamond
#
# # commit wrong input to stop kdamond withou explicit stop request
# echo 3 > addr_unit
# echo Y > commit_inputs
bash: echo: write error: Invalid argument
#
# # confirm kdamond is stopped
# ps -ef | grep kdamond
root 811 803 0 17:53 pts/4 00:00:00 grep kdamond
#
# # users casn now show stable status
# cat enabled
Y
# cat kdamond_pid
806
#
# # even after fixing the wrong parameter,
# # kdamond cannot be restarted.
# echo 1 > addr_unit
# echo Y > enabled
# ps -ef | grep kdamond
root 815 803 0 17:54 pts/4 00:00:00 grep kdamond
The problem will only rarely happen in real and common setups for the
following reasons. The allocation failures are unlikely in such setups
since those allocations are arguably too small to fail. Also sane users
on real production environments may not commit wrong input parameters.
But once it happens, the consequence is quite bad. And the bug is a bug.
The issue stems from the fact that there are multiple events that can
change the status, and following all the events is challenging.
Dynamically detect and use the fresh status for the parameters when those
are requested.
Link: https://lore.kernel.org/20260419161003.79176-3-sj@kernel.org
Fixes:
|
||
|
|
64a140afa5 |
mm/damon/reclaim: detect and use fresh enabled and kdamond_pid values
Patch series "mm/damon/modules: detect and use fresh status", v3.
DAMON modules including DAMON_RECLAIM, DAMON_LRU_SORT and DAMON_STAT
commonly expose the kdamond running status via their parameters. Under
certain scenarios including wrong user inputs and memory allocation
failures, those parameter values can be stale. It can confuse users. For
DAMON_RECLAIM and DAMON_LRU_SORT, it even makes the kdamond unable to be
restarted before the system reboot.
The problem comes from the fact that there are multiple events for the
status changes and it is difficult to follow up all the scenarios. Fix
the issue by detecting and using the status on demand, instead of using a
cached status that is difficult to be updated.
Patches 1-3 fix the bugs in DAMON_RECLAIM, DAMON_LRU_SORT and DAMON_STAT
in the order.
This patch (of 3):
DAMON_RECLAIM updates 'enabled' and 'kdamond_pid' parameter values, which
represents the running status of its kdamond, when the user explicitly
requests start/stop of the kdamond. The kdamond can, however, be stopped
in events other than the explicit user request in the following three
events.
1. ctx->regions_score_histogram allocation failure at beginning of the
execution,
2. damon_commit_ctx() failure due to invalid user input, and
3. damon_commit_ctx() failure due to its internal allocation failures.
Hence, if the kdamond is stopped by the above three events, the values of
the status parameters can be stale. Users could show the stale values and
be confused. This is already bad, but the real consequence is worse.
DAMON_RECLAIM avoids unnecessary damon_start() and damon_stop() calls
based on the 'enabled' parameter value. And the update of 'enabled'
parameter value depends on the damon_start() and damon_stop() call
results. Hence, once the kdamond has stopped by the unintentional events,
the user cannot restart the kdamond before the system reboot. For
example, the issue can be reproduced via below steps.
# cd /sys/module/damon_reclaim/parameters
#
# # start DAMON_RECLAIM
# echo Y > enabled
# ps -ef | grep kdamond
root 806 2 0 17:53 ? 00:00:00 [kdamond.0]
root 808 803 0 17:53 pts/4 00:00:00 grep kdamond
#
# # commit wrong input to stop kdamond withou explicit stop request
# echo 3 > addr_unit
# echo Y > commit_inputs
bash: echo: write error: Invalid argument
#
# # confirm kdamond is stopped
# ps -ef | grep kdamond
root 811 803 0 17:53 pts/4 00:00:00 grep kdamond
#
# # users casn now show stable status
# cat enabled
Y
# cat kdamond_pid
806
#
# # even after fixing the wrong parameter,
# # kdamond cannot be restarted.
# echo 1 > addr_unit
# echo Y > enabled
# ps -ef | grep kdamond
root 815 803 0 17:54 pts/4 00:00:00 grep kdamond
The problem will only rarely happen in real and common setups for the
following reasons. The allocation failures are unlikely in such setups
since those allocations are arguably too small to fail. Also sane users
on real production environments may not commit wrong input parameters.
But once it happens, the consequence is quite bad. And the bug is a bug.
The issue stems from the fact that there are multiple events that can
change the status, and following all the events is challenging.
Dynamically detect and use the fresh status for the parameters when those
are requested.
Link: https://lore.kernel.org/20260419161003.79176-1-sj@kernel.org
Link: https://lore.kernel.org/20260419161003.79176-2-sj@kernel.org
Fixes:
|
||
|
|
cf3b71421c |
mm/damon/sysfs-schemes: protect path kfree() with damon_sysfs_lock
damon_sysfs_quot_goal->path can be read and written by users, via DAMON
sysfs 'path' file. It can also be indirectly read, for the parameters
{on,off}line committing to DAMON. The reads for parameters committing are
protected by damon_sysfs_lock to avoid the sysfs files being destroyed
while any of the parameters are being read. But the user-driven direct
reads and writes are not protected by any lock, while the write is
deallocating the path-pointing buffer. As a result, the readers could
read the already freed buffer (user-after-free). Note that the user-reads
don't race when the same open file is used by the writer, due to kernfs's
open file locking. Nonetheless, doing the reads and writes with separate
open files would be common. Fix it by protecting both the user-direct
reads and writes with damon_sysfs_lock.
Link: https://lore.kernel.org/20260423150253.111520-3-sj@kernel.org
Fixes:
|
||
|
|
1e68eb96e8 |
mm/damon/sysfs-schemes: protect memcg_path kfree() with damon_sysfs_lock
Patch series "mm/damon/sysfs-schemes: fix use-after-free for [memcg_]path".
Reads of 'memcg_path' and 'path' files in DAMON sysfs interface could race
with their writes, results in use-after-free. Fix those.
This patch (of 2):
damon_sysfs_scheme_filter->mmecg_path can be read and written by users,
via DAMON sysfs memcg_path file. It can also be indirectly read, for the
parameters {on,off}line committing to DAMON. The reads for parameters
committing are protected by damon_sysfs_lock to avoid the sysfs files
being destroyed while any of the parameters are being read. But the
user-driven direct reads and writes are not protected by any lock, while
the write is deallocating the memcg_path-pointing buffer. As a result,
the readers could read the already freed buffer (user-after-free). Note
that the user-reads don't race when the same open file is used by the
writer, due to kernfs's open file locking. Nonetheless, doing the reads
and writes with separate open files would be common. Fix it by protecting
both the user-direct reads and writes with damon_sysfs_lock.
Link: https://lore.kernel.org/20260423150253.111520-1-sj@kernel.org
Link: https://lore.kernel.org/20260423150253.111520-2-sj@kernel.org
Fixes:
|
||
|
|
8f5ce56b76 |
mm/hugetlb_cma: round up per_node before logging it
When the user requests a total hugetlb CMA size without per-node
specification, hugetlb_cma_reserve() computes per_node from
hugetlb_cma_size and the number of nodes that have memory
per_node = DIV_ROUND_UP(hugetlb_cma_size,
nodes_weight(hugetlb_bootmem_nodes));
The reservation loop later computes
size = round_up(min(per_node, hugetlb_cma_size - reserved),
PAGE_SIZE << order);
So the actually reserved per_node size is multiple of (PAGE_SIZE <<
order), but the logged per_node is not rounded up, so it may be smaller
than the actual reserved size.
For example, as the existing comment describes, if a 3 GB area is
requested on a machine with 4 NUMA nodes that have memory, 1 GB is
allocated on the first three nodes, but the printed log is
hugetlb_cma: reserve 3072 MiB, up to 768 MiB per node
Round per_node up to (PAGE_SIZE << order) before logging so that the
printed log always matches the actual reserved size. No functional change
to the actual reservation size, as the following case analysis shows
1. remaining (hugetlb_cma_size - reserved) >= rounded per_node
- AS-IS: min() picks unrounded per_node;
round_up() returns rounded per_node
- TO-BE: min() picks rounded per_node;
round_up() returns rounded per_node (no-op)
2. remaining < unrounded per_node
- AS-IS: min() picks remaining;
round_up() returns round_up(remaining)
- TO-BE: min() picks remaining;
round_up() returns round_up(remaining)
3. unrounded per_node <= remaining < rounded per_node
- AS-IS: min() picks unrounded per_node;
round_up() returns rounded per_node
- TO-BE: min() picks remaining;
round_up() returns round_up(remaining) equals rounded per_node
Link: https://lore.kernel.org/20260422143353.852257-1-ekffu200098@gmail.com
Fixes:
|
||
|
|
619eab23e1 |
mm/vma: do not try to unmap a VMA if mmap_prepare() invoked from mmap()
The mmap_prepare hook functionality includes the ability to invoke
mmap_prepare() from the mmap() hook of existing 'stacked' drivers, that is
ones which are capable of calling the mmap hooks of other drivers/file
systems (e.g. overlayfs, shm).
As part of the mmap_prepare action functionality, we deal with errors by
unmapping the VMA should one arise. This works in the usual mmap_prepare
case, as we invoke this action at the last moment, when the VMA is
established in the maple tree.
However, the mmap() hook passes a not-fully-established VMA pointer to the
caller (which is the motivation behind the mmap_prepare() work), which is
detached.
So attempting to unmap a VMA in this state will be problematic, with the
most obvious symptom being a warning in vma_mark_detached(), because the
VMA is already detached.
It's also unncessary - the mmap() handler will clean up the VMA on error.
So to fix this issue, this patch propagates whether or not an mmap action
is being completed via the compatibility layer or directly.
If the former, then we do not attempt VMA cleanup, if the latter, then we
do.
This patch also updates the userland VMA tests to reflect the change.
Link: https://lore.kernel.org/20260421102150.189982-1-ljs@kernel.org
Fixes:
|
||
|
|
0437906841 |
mm: start background writeback based on per-wb threshold for strictlimit BDIs
The proactive nr_dirty > gdtc->bg_thresh check in balance_dirty_pages()
only checks the global dirty threshold to start background writeback while
the writer is still free-running, but for strictlimit BDIs (eg fuse), the
per-wb dirty count can exceed the per-wb background threshold while the
global threshold is not yet exceeded, so background writeback for this
case never gets proactively started.
Add a per-wb threshold check for strictlimit BDIs so that background
writeback is started when wb_dirty exceeds wb_bg_thresh, which drains
dirty pages before the writer hits the throttle wall, matching the
proactive behavior that the global check provides for non-strictlimit
BDIs.
fio runs on fuse show about a 3-4% improvement in perf for buffered
writes:
fio --name=writeback_test --ioengine=psync --rw=write --bs=128k \
--size=2G --numjobs=4 --ramp_time=10 --runtime=20 \
--time_based --group_reporting=1 --direct=0
Link: https://lore.kernel.org/20260326234629.840938-2-joannelkoong@gmail.com
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
82d1f01292 |
vmalloc: fix buffer overflow in vrealloc_node_align()
Commit |
||
|
|
5b31044e64 |
mm/slab: return NULL early from kmalloc_nolock() in NMI on UP
On UP kernels (!CONFIG_SMP), spin_trylock() is a no-op that
unconditionally succeeds even when the lock is already held. As a
result, kmalloc_nolock() called from NMI context can re-enter the slab
allocator and acquire n->list_lock that the interrupted context is
already holding, corrupting slab state.
With CONFIG_DEBUG_SPINLOCK on UP, the following BUG is triggered with
the slub_kunit test module:
BUG: spinlock trylock failure on UP on CPU#0, kunit_try_catch/243
[...]
Call Trace:
<NMI>
dump_stack_lvl+0x3f/0x60
do_raw_spin_trylock+0x41/0x50
_raw_spin_trylock+0x24/0x50
get_from_partial_node+0x120/0x4d0
___slab_alloc+0x8a/0x4c0
kmalloc_nolock_noprof+0x164/0x310
[...]
</NMI>
Fix this by returning NULL early when invoked from NMI on a UP kernel.
Link: https://lore.kernel.org/linux-mm/ad_cqe51pvr1WaDg@hyeyoo
Cc: stable@vger.kernel.org
Fixes:
|
||
|
|
620b46ed6a |
mm/page_alloc: return NULL early from alloc_frozen_pages_nolock() in NMI on UP
On UP kernels (!CONFIG_SMP), spin_trylock() is a no-op that
unconditionally succeeds even when the lock is already held. As a
result, alloc_frozen_pages_nolock() called from NMI context can
re-enter rmqueue() and acquire the zone lock that the interrupted
context is already holding, corrupting the freelists.
With CONFIG_DEBUG_SPINLOCK on UP, the following BUG is triggered with
the slub_kunit test module:
BUG: spinlock trylock failure on UP on CPU#0, kunit_try_catch/243
[...]
Call Trace:
<NMI>
dump_stack_lvl+0x3f/0x60
do_raw_spin_trylock+0x41/0x50
_raw_spin_trylock+0x24/0x50
rmqueue.isra.0+0x2a9/0xa70
get_page_from_freelist+0xeb/0x450
alloc_frozen_pages_nolock_noprof+0x111/0x1e0
allocate_slab+0x42a/0x500
___slab_alloc+0xa7/0x4c0
kmalloc_nolock_noprof+0x164/0x310
[...]
</NMI>
Fix this by returning NULL early when invoked from NMI on a UP kernel.
Link: https://lore.kernel.org/linux-mm/ad_cqe51pvr1WaDg@hyeyoo
Cc: stable@vger.kernel.org
Fixes:
|
||
|
|
82138f0183 |
slab fix for 7.1
-----BEGIN PGP SIGNATURE----- iQFPBAABCAA5FiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmnrPncbFIAAAAAABAAO bWFudTIsMi41KzEuMTIsMiwyAAoJELvgsHXSRYiaj6gH/jr11AbZbxCd7z1DzwHx blRszXNBGjoMYPbxENG+xostFgccnJVZw5tAeGo8AZoTN2PGrtk0pVOGq+H8ShSd qoQAxx1+wggv2qfQd2qwFmOFueoQBpIlR1kpVF7YFjFz0Z8q2NNqhzNhjIMWiyRI 4qPHcB5GojgXb+khQG25qQ5Hed8D+D+fEelNrNF3lLd8dK5mbqD4VFpf4lnLyIAQ TWf46PhuqQSeYAHMr1j5J+vW2lNeEY5ps/CmAS6DzUt4pl0JnubKT1WPEEUdOjT1 HBTN761tDMM9W5NyqSRLJYqStVbBOFEaJ5ZSulhy8cHreyY8nLN4G1smU/8dGRXc htE= =NYjb -----END PGP SIGNATURE----- Merge tag 'slab-for-7.1-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab fix from Vlastimil Babka: - A stable fix for k(v)ealloc() where reallocating on a different node or shrinking the object can result in either losing the original data or a buffer overflow (Marco Elver) * tag 'slab-for-7.1-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: slub: fix data loss and overflow in krealloc() |
||
|
|
2a4c0c11c0 |
s390 updates for 7.1 merge window
- Add support for CONFIG_PAGE_TABLE_CHECK and enable it in debug_defconfig. s390 can only tell user from kernel PTEs via the mm, so mm_struct is now passed into pxx_user_accessible_page() callbacks - Expose the PCI function UID as an arch-specific slot attribute in sysfs so a function can be identified by its user-defined id while still in standby. Introduces a generic ARCH_PCI_SLOT_GROUPS hook in drivers/pci/slot.c - Refresh s390 PCI documentation to reflect current behavior and cover previously undocumented sysfs attributes - zcrypt device driver cleanup series: consistent field types, clearer variable naming, a kernel-doc warning fix, and a comment explaining the intentional synchronize_rcu() in pkey_handler_register() - Provide an s390 arch_raw_cpu_ptr() that avoids the detour via get_lowcore() using alternatives, shrinking defconfig by ~27 kB - Guard identity-base randomization with kaslr_enabled() so nokaslr keeps the identity mapping at 0 even with CONFIG_RANDOMIZE_IDENTITY_BASE=y - Build S390_MODULES_SANITY_TEST as a module only by requiring KUNIT && m, since built-in would not exercise module loading - Remove the permanently commented-out HMCDRV_DEV_CLASS create_class() code in the hmcdrv driver - Drop stale ident_map_size extern conflicting with asm/page.h -----BEGIN PGP SIGNATURE----- iQEzBAABCgAdFiEE3QHqV+H2a8xAv27vjYWKoQLXFBgFAmno78kACgkQjYWKoQLX FBiHbggAmW5hPIDf4F8HLomMREaaQb7QAyYwfeefwhcFUXSMu8td8S68aN4UkOnS DGSFjb+V6Nqd+ewrF7IS9pRU9YFsmBqo3MnLdcJ/ojZFz8BlwoAi+E4AD1a38hY2 9zh2siPBMjydqBRUn6zjsK8auk4e8r44iS5MNNMXDF2ePE/PnPKTm93GhbtnnM6r a7mQkiPbi6j0sN/UU+pQkhS4fm2XNaGpCGGX0W0v2RdLIYZ9zQQdg4TaEsjQ5wZA OC3P8LG3OyJjnxsY2J8PIKK0VM0JP67KUGnQOi1y8HbN1LkFfAWF6CK7tsyUE/JM TYg7ENs2mUMmaa8niOGkiXzjjAxD0g== =NpmP -----END PGP SIGNATURE----- Merge tag 's390-7.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 updates from Vasily Gorbik: - Add support for CONFIG_PAGE_TABLE_CHECK and enable it in debug_defconfig. s390 can only tell user from kernel PTEs via the mm, so mm_struct is now passed into pxx_user_accessible_page() callbacks - Expose the PCI function UID as an arch-specific slot attribute in sysfs so a function can be identified by its user-defined id while still in standby. Introduces a generic ARCH_PCI_SLOT_GROUPS hook in drivers/pci/slot.c - Refresh s390 PCI documentation to reflect current behavior and cover previously undocumented sysfs attributes - zcrypt device driver cleanup series: consistent field types, clearer variable naming, a kernel-doc warning fix, and a comment explaining the intentional synchronize_rcu() in pkey_handler_register() - Provide an s390 arch_raw_cpu_ptr() that avoids the detour via get_lowcore() using alternatives, shrinking defconfig by ~27 kB - Guard identity-base randomization with kaslr_enabled() so nokaslr keeps the identity mapping at 0 even with RANDOMIZE_IDENTITY_BASE=y - Build S390_MODULES_SANITY_TEST as a module only by requiring KUNIT && m, since built-in would not exercise module loading - Remove the permanently commented-out HMCDRV_DEV_CLASS create_class() code in the hmcdrv driver - Drop stale ident_map_size extern conflicting with asm/page.h * tag 's390-7.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/zcrypt: Fix warning about wrong kernel doc comment PCI: s390: Expose the UID as an arch specific PCI slot attribute docs: s390/pci: Improve and update PCI documentation s390/pkey: Add comment about synchronize_rcu() to pkey base s390/hmcdrv: Remove commented out code s390/zcrypt: Slight rework on the agent_id field s390/zcrypt: Explicitly use a card variable in _zcrypt_send_cprb s390/zcrypt: Rework MKVP fields and handling s390/zcrypt: Make apfs a real unsigned int field s390/zcrypt: Rework domain processing within zcrypt device driver s390/zcrypt: Move inline function rng_type6cprb_msgx from header to code s390/percpu: Provide arch_raw_cpu_ptr() s390: Enable page table check for debug_defconfig s390/pgtable: Add s390 support for page table check s390/pgtable: Use set_pmd_bit() to invalidate PMD entry mm/page_table_check: Pass mm_struct to pxx_user_accessible_page() s390/boot: Respect kaslr_enabled() for identity randomization s390/Kconfig: Make modules sanity test a module-only option s390/setup: Drop stale ident_map_size declaration |
||
|
|
065c4e67cc |
Mostly cleanups and small things, notably:
- musl libc compatibility - vDSO installation fix - TLB sync race fix for recent SMP support - build fix for 32-bit with Clang 20/21 -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmnmKS4ACgkQ10qiO8sP aAD2+w//dOOblgUYgQJUXIxHpS7Gcb3Tm+a7ujC23q/kWf/pc8milCSf+zoxzUXL 23Vwh4Gt4KrHKp8lG1gU3xZqV0qwhXNi5HO2hMpB0ioIVpX3TcrUhFbp/Oirvhgi 3PvnvsFtUlW82DFgewB98tefXZSAlG/pg+RjQ3weHfEo+xQbjYc+kR8o59tN8LNR Ea4rrxyjsr3KN2yBNaFpDkMchudP6XWgKByAZBxZ2FofC3zuVRCyF8ThDfQl/3/W muSqX+2iuKjGpmxV0XWt72hYOhNYjBtDY7f4EPe6sbUy+PU6SjD9h/s7VTyVHgZR 3Sii9AQLLJNYPoglExMfmWfeUnJCUJNNTLUze+ZtnhURZQYTvyJRzVmKj6fDPjK2 jGEKXanfZCK9Cfgy2f2xbQxCxhAVwz6QT0XaQO2dZBXa0anzG+2HM0Zn8MNa9jbU +Lm11k1jd1QBifr+5zeni98KHt2mf77blCny8TraODgLNgWUVi5kMkPF4bZgD4Qj udMU9lOkTD08R89hG/Le9TsB+NIpPauyNxDHUpC/VDterFdZqFvmOFT6afTo/4RZ nXNVdL1tn+7O7v0bLdbyhXwj2her1GDbe6HZ5eTNqmjcOthcgI3gF2stDfFhEbNb /wMHnpGPncMeEI8YWtWOFA4FA5T32+LafLCKhuRJdaw0+f/NMOo= =oovZ -----END PGP SIGNATURE----- Merge tag 'uml-for-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux Pull uml updates from Johannes Berg: "Mostly cleanups and small things, notably: - musl libc compatibility - vDSO installation fix - TLB sync race fix for recent SMP support - build fix for 32-bit with Clang 20/21" * tag 'uml-for-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux: um: Disable GCOV_PROFILE_ALL on 32-bit UML with Clang 20/21 um: drivers: call kernel_strrchr() explicitly in cow_user.c um: Replace strncpy() with strnlen()+memcpy_and_pad() in strncpy_chunk_from_user() x86/um: fix vDSO installation um: Remove CONFIG_FRAME_WARN from x86_64_defconfig um: Fix pte_read() and pte_exec() for kernel mappings um: Fix potential race condition in TLB sync um: time-travel: clean up kernel-doc warnings um: avoid struct sigcontext redefinition with musl um: fix address-of CMSG_DATA() rvalue in stub |
||
|
|
c1f49dea2b |
7 hotfixes. 6 are cc:stable and all are for MM. Please see the
individual changelogs for details. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaeSA8wAKCRDdBJ7gKXxA jlKAAP4z9SQH3J71BVZK0pLi/9a/ISEZyJ9zIpsNi1sQ3shO/gD7BoUTQkepX5dk dv1YBhroYz981dgIIV3kXpIAEYnoeQc= =Os+z -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2026-04-19-00-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM fixes from Andrew Morton: "7 hotfixes. 6 are cc:stable and all are for MM. Please see the individual changelogs for details" * tag 'mm-hotfixes-stable-2026-04-19-00-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/damon/core: disallow non-power of two min_region_sz on damon_start() mm/vmalloc: take vmap_purge_lock in shrinker mm: call ->free_folio() directly in folio_unmap_invalidate() mm: blk-cgroup: fix use-after-free in cgwb_release_workfn() mm/zone_device: do not touch device folio after calling ->folio_free() mm/damon/core: disallow time-quota setting zero esz mm/mempolicy: fix weighted interleave auto sysfs name |
||
|
|
40735a683b |
mm.git review status for linus..mm-stable
Everything: Total patches: 121 Reviews/patch: 2.11 Reviewed rate: 90% Excluding DAMON: Total patches: 113 Reviews/patch: 2.25 Reviewed rate: 96% - The 33 patch series "Eliminate Dying Memory Cgroup" from Qi Zheng and Muchun Song addresses the longstanding "dying memcg problem". A situation wherein a no-longer-used memory control group will hang around for an extended period pointlessly consuming memory. The [0/N] changelog has a good overview of this work. - The 3 patch series "fix unexpected type conversions and potential overflows" from Qi Zheng fixes a couple of potential 32-bit/64-bit issues which were identified during review of the "Eliminate Dying Memory Cgroup" series. - The 6 patch series "kho: history: track previous kernel version and kexec boot count" from Breno Leitao uses Kexec Handover (KHO) to pass the previous kernel's version string and the number of kexec reboots since the last cold boot to the next kernel, and prints it at boot time. - The 4 patch series "liveupdate: prevent double preservation" from Pasha Tatashin teaches LUO to avoid managing the same file across different active sessions. - The 10 patch series "liveupdate: Fix module unloading and unregister API" from Pasha Tatashin addresses an issue with how LUO handles module reference counting and unregistration during module unloading. - The 2 patch series "zswap pool per-CPU acomp_ctx simplifications" from Kanchana Sridhar simplifies and cleans up the zswap crypto compression handling and improves the lifecycle management of zswap pool's per-CPU acomp_ctx resources. - The 2 patch series "mm/damon/core: fix damon_call()/damos_walk() vs kdmond exit race" from SeongJae Park addresses unlikely but possible leaks and deadlocks in damon_call() and damon_walk(). - The 2 patch series "mm/damon/core: validate damos_quota_goal->nid" from SeongJae Park fixes a couple of root-only wild pointer dereferences. - The 2 patch series "Docs/admin-guide/mm/damon: warn commit_inputs vs other params race" from SeongJae Park updates the DAMON documentation to warn operators about potential races which can occur if the commit_inputs parameter is altered at the wrong time. - The 3 patch series "Minor hmm_test fixes and cleanups" from Alistair Popple implements two bugfixes a cleanup for the HMM kernel selftests. - The 6 patch series "Modify memfd_luo code" from Chenghao Duan provides cleanups, simplifications and speedups in the memfd_lou code. - The 4 patch series "mm, kvm: allow uffd support in guest_memfd" from Mike Rapoport enables support for userfaultfd in guest_memfd. - The 6 patch series "selftests/mm: skip several tests when thp is not available" from Chunyu Hu fixes several issues in the selftests code which were causing breakage when the tests were run on CONFIG_THP=n kernels. - The 2 patch series "mm/mprotect: micro-optimization work" from Pedro Falcato implements a couple of nice speedups for mprotect(). - The 3 patch series "MAINTAINERS: update KHO and LIVE UPDATE entries" from Pratyush Yadav reflects upcoming changes in the maintenance of KHO, LUO, memfd_luo, kexec, crash, kdump and probably other kexec-based things - they are being moved out of mm.git and into a new git tree. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaeNL/wAKCRDdBJ7gKXxA jt7EAQCEEQvYYTjld+8HJKsCbavY4pEfci7z4SBiQyIPjRracQD/ZfjXnzL7ucc1 b6q6G4TcslvIDBgzVkk9G2BVn2oCoAg= =3ozv -----END PGP SIGNATURE----- Merge tag 'mm-stable-2026-04-18-02-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more MM updates from Andrew Morton: - "Eliminate Dying Memory Cgroup" (Qi Zheng and Muchun Song) Address the longstanding "dying memcg problem". A situation wherein a no-longer-used memory control group will hang around for an extended period pointlessly consuming memory - "fix unexpected type conversions and potential overflows" (Qi Zheng) Fix a couple of potential 32-bit/64-bit issues which were identified during review of the "Eliminate Dying Memory Cgroup" series - "kho: history: track previous kernel version and kexec boot count" (Breno Leitao) Use Kexec Handover (KHO) to pass the previous kernel's version string and the number of kexec reboots since the last cold boot to the next kernel, and print it at boot time - "liveupdate: prevent double preservation" (Pasha Tatashin) Teach LUO to avoid managing the same file across different active sessions - "liveupdate: Fix module unloading and unregister API" (Pasha Tatashin) Address an issue with how LUO handles module reference counting and unregistration during module unloading - "zswap pool per-CPU acomp_ctx simplifications" (Kanchana Sridhar) Simplify and clean up the zswap crypto compression handling and improve the lifecycle management of zswap pool's per-CPU acomp_ctx resources - "mm/damon/core: fix damon_call()/damos_walk() vs kdmond exit race" (SeongJae Park) Address unlikely but possible leaks and deadlocks in damon_call() and damon_walk() - "mm/damon/core: validate damos_quota_goal->nid" (SeongJae Park) Fix a couple of root-only wild pointer dereferences - "Docs/admin-guide/mm/damon: warn commit_inputs vs other params race" (SeongJae Park) Update the DAMON documentation to warn operators about potential races which can occur if the commit_inputs parameter is altered at the wrong time - "Minor hmm_test fixes and cleanups" (Alistair Popple) Bugfixes and a cleanup for the HMM kernel selftests - "Modify memfd_luo code" (Chenghao Duan) Cleanups, simplifications and speedups to the memfd_lou code - "mm, kvm: allow uffd support in guest_memfd" (Mike Rapoport) Support for userfaultfd in guest_memfd - "selftests/mm: skip several tests when thp is not available" (Chunyu Hu) Fix several issues in the selftests code which were causing breakage when the tests were run on CONFIG_THP=n kernels - "mm/mprotect: micro-optimization work" (Pedro Falcato) A couple of nice speedups for mprotect() - "MAINTAINERS: update KHO and LIVE UPDATE entries" (Pratyush Yadav) Document upcoming changes in the maintenance of KHO, LUO, memfd_luo, kexec, crash, kdump and probably other kexec-based things - they are being moved out of mm.git and into a new git tree * tag 'mm-stable-2026-04-18-02-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (121 commits) MAINTAINERS: add page cache reviewer mm/vmscan: avoid false-positive -Wuninitialized warning MAINTAINERS: update Dave's kdump reviewer email address MAINTAINERS: drop include/linux/liveupdate from LIVE UPDATE MAINTAINERS: drop include/linux/kho/abi/ from KHO MAINTAINERS: update KHO and LIVE UPDATE maintainers MAINTAINERS: update kexec/kdump maintainers entries mm/migrate_device: remove dead migration entry check in migrate_vma_collect_huge_pmd() selftests: mm: skip charge_reserved_hugetlb without killall userfaultfd: allow registration of ranges below mmap_min_addr mm/vmstat: fix vmstat_shepherd double-scheduling vmstat_update mm/hugetlb: fix early boot crash on parameters without '=' separator zram: reject unrecognized type= values in recompress_store() docs: proc: document ProtectionKey in smaps mm/mprotect: special-case small folios when applying permissions mm/mprotect: move softleaf code out of the main function mm: remove '!root_reclaim' checking in should_abort_scan() mm/sparse: fix comment for section map alignment mm/page_io: use sio->len for PSWPIN accounting in sio_read_complete() selftests/mm: transhuge_stress: skip the test when thp not available ... |
||
|
|
95093e5cb4 |
mm/damon/core: disallow non-power of two min_region_sz on damon_start()
Commit |
||
|
|
ec05f51f1e |
mm/vmalloc: take vmap_purge_lock in shrinker
decay_va_pool_node() can be invoked concurrently from two paths:
__purge_vmap_area_lazy() when pools are being purged, and the shrinker via
vmap_node_shrink_scan().
However, decay_va_pool_node() is not safe to run concurrently, and the
shrinker path currently lacks serialization, leading to races and possible
leaks.
Protect decay_va_pool_node() by taking vmap_purge_lock in the shrinker
path to ensure serialization with purge users.
Link: https://lore.kernel.org/20260413192646.14683-1-urezki@gmail.com
Fixes:
|
||
|
|
615d9bb2cc |
mm: call ->free_folio() directly in folio_unmap_invalidate()
We can only call filemap_free_folio() if we have a reference to (or hold a
lock on) the mapping. Otherwise, we've already removed the folio from the
mapping so it no longer pins the mapping and the mapping can be removed,
causing a use-after-free when accessing mapping->a_ops.
Follow the same pattern as __remove_mapping() and load the free_folio
function pointer before dropping the lock on the mapping. That lets us
make filemap_free_folio() static as this was the only caller outside
filemap.c.
Link: https://lore.kernel.org/20260413184314.3419945-1-willy@infradead.org
Fixes:
|
||
|
|
8f5857be99 |
mm: blk-cgroup: fix use-after-free in cgwb_release_workfn()
cgwb_release_workfn() calls css_put(wb->blkcg_css) and then later accesses
wb->blkcg_css again via blkcg_unpin_online(). If css_put() drops the last
reference, the blkcg can be freed asynchronously (css_free_rwork_fn ->
blkcg_css_free -> kfree) before blkcg_unpin_online() dereferences the
pointer to access blkcg->online_pin, resulting in a use-after-free:
BUG: KASAN: slab-use-after-free in blkcg_unpin_online (./include/linux/instrumented.h:112 ./include/linux/atomic/atomic-instrumented.h:400 ./include/linux/refcount.h:389 ./include/linux/refcount.h:432 ./include/linux/refcount.h:450 block/blk-cgroup.c:1367)
Write of size 4 at addr ff11000117aa6160 by task kworker/71:1/531
Workqueue: cgwb_release cgwb_release_workfn
Call Trace:
<TASK>
blkcg_unpin_online (./include/linux/instrumented.h:112 ./include/linux/atomic/atomic-instrumented.h:400 ./include/linux/refcount.h:389 ./include/linux/refcount.h:432 ./include/linux/refcount.h:450 block/blk-cgroup.c:1367)
cgwb_release_workfn (mm/backing-dev.c:629)
process_scheduled_works (kernel/workqueue.c:3278 kernel/workqueue.c:3385)
Freed by task 1016:
kfree (./include/linux/kasan.h:235 mm/slub.c:2689 mm/slub.c:6246 mm/slub.c:6561)
css_free_rwork_fn (kernel/cgroup/cgroup.c:5542)
process_scheduled_works (kernel/workqueue.c:3302 kernel/workqueue.c:3385)
** Stack based on commit 66672af7a095 ("Add linux-next specific files
for 20260410")
I am seeing this crash sporadically in Meta fleet across multiple kernel
versions. A full reproducer is available at:
https://github.com/leitao/debug/blob/main/reproducers/repro_blkcg_uaf.sh
(The race window is narrow. To make it easily reproducible, inject a
msleep(100) between css_put() and blkcg_unpin_online() in
cgwb_release_workfn(). With that delay and a KASAN-enabled kernel, the
reproducer triggers the splat reliably in less than a second.)
Fix this by moving blkcg_unpin_online() before css_put(), so the
cgwb's CSS reference keeps the blkcg alive while blkcg_unpin_online()
accesses it.
Link: https://lore.kernel.org/20260413-blkcg-v1-1-35b72622d16c@debian.org
Fixes:
|
||
|
|
3992898495 |
mm/zone_device: do not touch device folio after calling ->folio_free()
The contents of a device folio can immediately change after calling
->folio_free(), as the folio may be reallocated by a driver with a
different order. Instead of touching the folio again to extract the
pgmap, use the local stack variable when calling percpu_ref_put_many().
Link: https://lore.kernel.org/20260410230346.4009855-1-matthew.brost@intel.com
Fixes:
|
||
|
|
8bbde987c2 |
mm/damon/core: disallow time-quota setting zero esz
When the throughput of a DAMOS scheme is very slow, DAMOS time quota can
make the effective size quota smaller than damon_ctx->min_region_sz. In
the case, damos_apply_scheme() will skip applying the action, because the
action is tried at region level, which requires >=min_region_sz size.
That is, the quota is effectively exceeded for the quota charge window.
Because no action will be applied, the total_charged_sz and
total_charged_ns are also not updated. damos_set_effective_quota() will
try to update the effective size quota before starting the next charge
window. However, because the total_charged_sz and total_charged_ns have
not updated, the throughput and effective size quota are also not changed.
Since effective size quota can only be decreased, other effective size
quota update factors including DAMOS quota goals and size quota cannot
make any change, either.
As a result, the scheme is unexpectedly deactivated until the user notices
and mitigates the situation. The users can mitigate this situation by
changing the time quota online or re-install the scheme. While the
mitigation is somewhat straightforward, finding the situation would be
challenging, because DAMON is not providing good observabilities for that.
Even if such observability is provided, doing the additional monitoring
and the mitigation is somewhat cumbersome and not aligned to the intention
of the time quota. The time quota was intended to help reduce the user's
administration overhead.
Fix the problem by setting time quota-modified effective size quota be at
least min_region_sz always.
The issue was discovered [1] by sashiko.
Link: https://lore.kernel.org/20260407003153.79589-1-sj@kernel.org
Link: https://lore.kernel.org/20260405192504.110014-1-sj@kernel.org [1]
Fixes:
|
||
|
|
8fedac321f |
mm/mempolicy: fix weighted interleave auto sysfs name
The __ATTR macro is a utility that makes defining kobj_attributes easier by stringfying the name, verifying the mode, and setting the show/store fields in a single initializer. It takes a raw token as the first value, rather than a string, so that __ATTR family macros like __ATTR_RW can token-paste it for inferring the _show / _store function names. Commit |
||
|
|
9055c64567 |
memblock: updates for 7.0-rc1
* improve debugability of reserve_mem kernel parameter handling with print outs in case of a failure and debugfs info showing what was actually reserved * Make memblock_free_late() and free_reserved_area() use the same core logic for freeing the memory to buddy and ensure it takes care of updating memblock arrays when ARCH_KEEP_MEMBLOCK is enabled. -----BEGIN PGP SIGNATURE----- iQFEBAABCgAuFiEEeOVYVaWZL5900a/pOQOGJssO/ZEFAmnjRmsQHHJwcHRAa2Vy bmVsLm9yZwAKCRA5A4Ymyw79kYh0CAC4NpZGFqpEBep1eQcfqsPH05dvp1LUXDNk i5GwS2ht/F5D9GcD+EyoYRQjRM8k+XZyOe3sqEF01Uav/rHAv3XrITg/pfiA92AR K7CvQv4NvyQqUNcv/mEb+P8niriJ4oHRXCag9inop1jo/x3Mym07oEy73rknAx9r ZQKwoFNOM/QQGVb9hZUANKCkE8cAsUXG89yEOH0n17FOahC0PZbK/vxjeO+br3IL HxEoC5l1j4cUauf8XEhsVXXdch0iqit/fB3ROePYFNCx7koVYHk6Yl1w++AM0RUA ypOmfPsSiqLY2ciuTIAnpTeMfQkkhEmMI3mp6T5BUBwSKJxLRaSM =c1xd -----END PGP SIGNATURE----- Merge tag 'memblock-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock Pull memblock updates from Mike Rapoport: - improve debuggability of reserve_mem kernel parameter handling with print outs in case of a failure and debugfs info showing what was actually reserved - Make memblock_free_late() and free_reserved_area() use the same core logic for freeing the memory to buddy and ensure it takes care of updating memblock arrays when ARCH_KEEP_MEMBLOCK is enabled. * tag 'memblock-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock: x86/alternative: delay freeing of smp_locks section memblock: warn when freeing reserved memory before memory map is initialized memblock, treewide: make memblock_free() handle late freeing memblock: make free_reserved_area() update memblock if ARCH_KEEP_MEMBLOCK=y memblock: extract page freeing from free_reserved_area() into a helper memblock: make free_reserved_area() more robust mm: move free_reserved_area() to mm/memblock.c powerpc: opal-core: pair alloc_pages_exact() with free_pages_exact() powerpc: fadump: pair alloc_pages_exact() with free_pages_exact() memblock: reserve_mem: fix end caclulation in reserve_mem_release_by_name() memblock: move reserve_bootmem_range() to memblock.c and make it static memblock: Add reserve_mem debugfs info memblock: Print out errors on reserve_mem parser |
||
|
|
3de705a43a |
mm/vmscan: avoid false-positive -Wuninitialized warning
When the -fsanitize=bounds sanitizer is enabled, gcc-16 sometimes runs
into a corner case in the read_ctrl_pos() pos function, where it sees
possible undefined behavior from the 'tier' index overflowing, presumably
in the case that this was called with a negative tier:
In function 'get_tier_idx',
inlined from 'isolate_folios' at mm/vmscan.c:4671:14:
mm/vmscan.c: In function 'isolate_folios':
mm/vmscan.c:4645:29: error: 'pv.refaulted' is used uninitialized [-Werror=uninitialized]
Part of the problem seems to be that read_ctrl_pos() has unusual calling
conventions since commit
|
||
|
|
57294a97bd |
mm/migrate_device: remove dead migration entry check in migrate_vma_collect_huge_pmd()
The softleaf_is_migration() check is unreachable as entries that are not
device_private are filtered out. Similarly, the PTE-level equivalent in
migrate_vma_collect_pmd() skips migration entries.
This dead branch also contained a double spin_unlock(ptl) bug.
Link: https://lore.kernel.org/20260212014611.416695-1-dave@stgolabs.net
Fixes:
|
||
|
|
2b19bf0571 |
mm/vmstat: fix vmstat_shepherd double-scheduling vmstat_update
vmstat_shepherd uses delayed_work_pending() to check whether vmstat_update
is already scheduled for a given CPU before queuing it. However,
delayed_work_pending() only tests WORK_STRUCT_PENDING_BIT, which is
cleared the moment a worker thread picks up the work to execute it.
This means that while vmstat_update is actively running on a CPU,
delayed_work_pending() returns false. If need_update() also returns true
at that point (per-cpu counters not yet zeroed mid-flush), the shepherd
queues a second invocation with delay=0, causing vmstat_update to run
again immediately after finishing.
On a 72-CPU system this race is readily observable: before the fix, many
CPUs show invocation gaps well below 500 jiffies (the minimum
round_jiffies_relative() can produce), with the most extreme cases
reaching 0 jiffies—vmstat_update called twice within the same jiffy.
Fix this by replacing delayed_work_pending() with work_busy(), which
returns non-zero for both WORK_BUSY_PENDING (timer armed or work queued)
and WORK_BUSY_RUNNING (work currently executing). The shepherd now
correctly skips a CPU in all busy states.
After the fix, all sub-jiffy and most sub-100-jiffie gaps disappear. The
remaining early invocations have gaps in the 700–999 jiffie range,
attributable to round_jiffies_relative() aligning to a nearer
jiffie-second boundary rather than to this race.
Each spurious vmstat_update invocation has a measurable side effect:
refresh_cpu_vm_stats() calls decay_pcp_high() for every zone, which drains
idle per-CPU pages back to the buddy allocator via free_pcppages_bulk(),
taking the zone spinlock each time. Eliminating the double-scheduling
therefore reduces zone lock contention directly. On a 72-CPU stress-ng
workload measured with perf lock contention:
free_pcppages_bulk contention count: ~55% reduction
free_pcppages_bulk total wait time: ~57% reduction
free_pcppages_bulk max wait time: ~47% reduction
Note: work_busy() is inherently racy—between the check and the
subsequent queue_delayed_work_on() call, vmstat_update can finish
execution, leaving the work neither pending nor running. In that narrow
window the shepherd can still queue a second invocation. After the fix,
this residual race is rare and produces only occasional small gaps, a
significant improvement over the systematic double-scheduling seen with
delayed_work_pending().
Link: https://lore.kernel.org/20260409-vmstat-v2-1-e9d9a6db08ad@debian.org
Fixes:
|
||
|
|
c45b354911 |
mm/hugetlb: fix early boot crash on parameters without '=' separator
If hugepages, hugepagesz, or default_hugepagesz are specified on the
kernel command line without the '=' separator, early parameter parsing
passes NULL to hugetlb_add_param(), which dereferences it in strlen() and
can crash the system during early boot.
Reject NULL values in hugetlb_add_param() and return -EINVAL instead.
Link: https://lore.kernel.org/20260409105437.108686-4-thorsten.blum@linux.dev
Fixes:
|
||
|
|
89e613bc0b |
mm/mprotect: special-case small folios when applying permissions
The common order-0 case is important enough to want its own branch, and avoids the hairy, large loop logic that the CPU does not seem to handle particularly well. While at it, encourage the compiler to inline batch PTE logic and resolve constant branches by adding __always_inline strategically. Link: https://lore.kernel.org/20260402141628.3367596-3-pfalcato@suse.de Signed-off-by: Pedro Falcato <pfalcato@suse.de> Suggested-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Tested-by: Luke Yang <luyang@redhat.com> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Jiri Hladky <jhladky@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
3bc181c143 |
mm/mprotect: move softleaf code out of the main function
Patch series "mm/mprotect: micro-optimization work", v3. Micro-optimize the change_protection functionality and the change_pte_range() routine. This set of functions works in an incredibly tight loop, and even small inefficiencies are incredibly evident when spun hundreds, thousands or hundreds of thousands of times. There was an attempt to keep the batching functionality as much as possible, which introduced some part of the slowness, but not all of it. Removing it for !arm64 architectures would speed mprotect() up even further, but could easily pessimize cases where large folios are mapped (which is not as rare as it seems, particularly when it comes to the page cache these days). The micro-benchmark used for the tests was [0] (usable using google/benchmark and g++ -O2 -lbenchmark repro.cpp) This resulted in the following (first entry is baseline): --------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------- mprotect_bench 85967 ns 85967 ns 6935 mprotect_bench 70684 ns 70684 ns 9887 After the patchset we can observe an ~18% speedup in mprotect. Wonderful for the elusive mprotect-based workloads! Testing & more ideas welcome. I suspect there is plenty of improvement possible but it would require more time than what I have on my hands right now. The entire inlined function (which inlines into change_protection()) is gigantic - I'm not surprised this is so finnicky. Note: per my profiling, the next _big_ bottleneck here is modify_prot_start_ptes, exactly on the xchg() done by x86. ptep_get_and_clear() is _expensive_. I don't think there's a properly safe way to go about it since we do depend on the D bit quite a lot. This might not be such an issue on other architectures. Luke Yang reported [1]: : On average, we see improvements ranging from a minimum of 5% to a : maximum of 55%, with most improvements showing around a 25% speed up in : the libmicro/mprot_tw4m micro benchmark. This patch (of 2): Move softleaf change_pte_range code into a separate function. This makes the change_pte_range() function a good bit smaller, and lessens cognitive load when reading through the function. Link: https://lore.kernel.org/20260402141628.3367596-1-pfalcato@suse.de Link: https://lore.kernel.org/20260402141628.3367596-2-pfalcato@suse.de Link: https://lore.kernel.org/all/aY8-XuFZ7zCvXulB@luyang-thinkpadp1gen7.toromso.csb/ Link: https://gist.github.com/heatd/1450d273005aba91fa5744f44dfcd933 [0] Link: https://lore.kernel.org/CAL2CeBxT4jtJ+LxYb6=BNxNMGinpgD_HYH5gGxOP-45Q2OncqQ@mail.gmail.com [1] Signed-off-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Tested-by: Luke Yang <luyang@redhat.com> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Jiri Hladky <jhladky@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
19999e479c |
mm: remove '!root_reclaim' checking in should_abort_scan()
Android systems usually use memory.reclaim interface to implement user space memory management which expects that the requested reclaim target and actually reclaimed amount memory are not diverging by too much. With the current MGRLU implementation there is, however, no bail out when the reclaim target is reached and this could lead to an excessive reclaim that scales with the reclaim hierarchy size.For example, we can get a nr_reclaimed=394/nr_to_reclaim=32 proactive reclaim under a common 1-N cgroup hierarchy. This defect arose from the goal of keeping fairness among memcgs that is, for try_to_free_mem_cgroup_pages -> shrink_node_memcgs -> shrink_lruvec -> lru_gen_shrink_lruvec -> try_to_shrink_lruvec, the !root_reclaim(sc) check was there for reclaim fairness, which was necessary before commit |
||
|
|
df620ec4d4 |
mm/page_io: use sio->len for PSWPIN accounting in sio_read_complete()
sio_read_complete() uses sio->pages to account global PSWPIN vm events, but sio->pages tracks the number of bvec entries (folios), not base pages. While large folios cannot currently reach this path (SWP_FS_OPS and SWP_SYNCHRONOUS_IO are mutually exclusive, and mTHP swap-in allocation is gated on SWP_SYNCHRONOUS_IO), the accounting is semantically inconsistent with the per-memcg path which correctly uses folio_nr_pages(). Use sio->len >> PAGE_SHIFT instead, which gives the correct base page count since sio->len is accumulated via folio_size(folio). Link: https://lore.kernel.org/20260402061408.36119-1-devnexen@gmail.com Signed-off-by: David Carlier <devnexen@gmail.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Chris Li <chrisl@kernel.org> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: NeilBrown <neil@brown.name> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
6ab703034f |
userfaultfd: mfill_atomic(): remove retry logic
Since __mfill_atomic_pte() handles the retry for both anonymous and shmem, there is no need to retry copying the date from the userspace in the loop in mfill_atomic(). Drop the retry logic from mfill_atomic(). [rppt@kernel.org: remove safety measure of not returning ENOENT from _copy] Link: https://lore.kernel.org/ac5zcDUY8CFHr6Lw@kernel.org Link: https://lore.kernel.org/20260402041156.1377214-12-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Harry Yoo (Oracle) <harry@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
f74991b4e3 |
shmem, userfaultfd: implement shmem uffd operations using vm_uffd_ops
Add filemap_add() and filemap_remove() methods to vm_uffd_ops and use them in __mfill_atomic_pte() to add shmem folios to page cache and remove them in case of error. Implement these methods in shmem along with vm_uffd_ops->alloc_folio() and drop shmem_mfill_atomic_pte(). Since userfaultfd now does not reference any functions from shmem, drop include if linux/shmem_fs.h from mm/userfaultfd.c mfill_atomic_install_pte() is not used anywhere outside of mm/userfaultfd, make it static. Link: https://lore.kernel.org/20260402041156.1377214-11-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: James Houghton <jthoughton@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Harry Yoo (Oracle) <harry@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
ad9ac30813 |
userfaultfd: introduce vm_uffd_ops->alloc_folio()
and use it to refactor mfill_atomic_pte_zeroed_folio() and mfill_atomic_pte_copy(). mfill_atomic_pte_zeroed_folio() and mfill_atomic_pte_copy() perform almost identical actions: * allocate a folio * update folio contents (either copy from userspace of fill with zeros) * update page tables with the new folio Split a __mfill_atomic_pte() helper that handles both cases and uses newly introduced vm_uffd_ops->alloc_folio() to allocate the folio. Pass the ops structure from the callers to __mfill_atomic_pte() to later allow using anon_uffd_ops for MAP_PRIVATE mappings of file-backed VMAs. Note, that the new ops method is called alloc_folio() rather than folio_alloc() to avoid clash with alloc_tag macro folio_alloc(). Link: https://lore.kernel.org/20260402041156.1377214-10-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: James Houghton <jthoughton@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Harry Yoo (Oracle) <harry@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
dfc4d77182 |
shmem, userfaultfd: use a VMA callback to handle UFFDIO_CONTINUE
When userspace resolves a page fault in a shmem VMA with UFFDIO_CONTINUE it needs to get a folio that already exists in the pagecache backing that VMA. Instead of using shmem_get_folio() for that, add a get_folio_noalloc() method to 'struct vm_uffd_ops' that will return a folio if it exists in the VMA's pagecache at given pgoff. Implement get_folio_noalloc() method for shmem and slightly refactor userfaultfd's mfill_get_vma() and mfill_atomic_pte_continue() to support this new API. Link: https://lore.kernel.org/20260402041156.1377214-9-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: James Houghton <jthoughton@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Harry Yoo (Oracle) <harry@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
0f48947c42 |
userfaultfd: introduce vm_uffd_ops
Current userfaultfd implementation works only with memory managed by core MM: anonymous, shmem and hugetlb. First, there is no fundamental reason to limit userfaultfd support only to the core memory types and userfaults can be handled similarly to regular page faults provided a VMA owner implements appropriate callbacks. Second, historically various code paths were conditioned on vma_is_anonymous(), vma_is_shmem() and is_vm_hugetlb_page() and some of these conditions can be expressed as operations implemented by a particular memory type. Introduce vm_uffd_ops extension to vm_operations_struct that will delegate memory type specific operations to a VMA owner. Operations for anonymous memory are handled internally in userfaultfd using anon_uffd_ops that implicitly assigned to anonymous VMAs. Start with a single operation, ->can_userfault() that will verify that a VMA meets requirements for userfaultfd support at registration time. Implement that method for anonymous, shmem and hugetlb and move relevant parts of vma_can_userfault() into the new callbacks. [rppt@kernel.org: relocate VM_DROPPABLE test, per Tal] Link: https://lore.kernel.org/adffgfM5ANxtPIEF@kernel.org Link: https://lore.kernel.org/20260402041156.1377214-8-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Harry Yoo (Oracle) <harry@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Cc: Tal Zussman <tz2294@columbia.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
a5bb866987 |
userfaultfd: move vma_can_userfault out of line
vma_can_userfault() has grown pretty big and it's not called on performance critical path. Move it out of line. No functional changes. Link: https://lore.kernel.org/20260402041156.1377214-7-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Harry Yoo (Oracle) <harry@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
f5f035a724 |
userfaultfd: retry copying with locks dropped in mfill_atomic_pte_copy()
Implementation of UFFDIO_COPY for anonymous memory might fail to copy data from userspace buffer when the destination VMA is locked (either with mm_lock or with per-VMA lock). In that case, mfill_atomic() releases the locks, retries copying the data with locks dropped and then re-locks the destination VMA and re-establishes PMD. Since this retry-reget dance is only relevant for UFFDIO_COPY and it never happens for other UFFDIO_ operations, make it a part of mfill_atomic_pte_copy() that actually implements UFFDIO_COPY for anonymous memory. As a temporal safety measure to avoid breaking biscection mfill_atomic_pte_copy() makes sure to never return -ENOENT so that the loop in mfill_atomic() won't retry copiyng outside of mmap_lock. This is removed later when shmem implementation will be updated later and the loop in mfill_atomic() will be adjusted. [akpm@linux-foundation.org: update mfill_copy_folio_retry()] Link: https://lore.kernel.org/20260316173829.1126728-1-avagin@google.com Link: https://lore.kernel.org/20260306171815.3160826-6-rppt@kernel.org Link: https://lore.kernel.org/20260402041156.1377214-6-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Cc: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
b8c03b7f45 |
userfaultfd: introduce mfill_get_vma() and mfill_put_vma()
Split the code that finds, locks and verifies VMA from mfill_atomic() into a helper function. This function will be used later during refactoring of mfill_atomic_pte_copy(). Add a counterpart mfill_put_vma() helper that unlocks the VMA and releases map_changing_lock. [avagin@google.com: fix lock leak in mfill_get_vma()] Link: https://lore.kernel.org/20260316173829.1126728-1-avagin@google.com Link: https://lore.kernel.org/20260402041156.1377214-5-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Andrei Vagin <avagin@google.com> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
e2e0b826d3 |
userfaultfd: introduce mfill_establish_pmd() helper
There is a lengthy code chunk in mfill_atomic() that establishes the PMD for UFFDIO operations. This code may be called twice: first time when the copy is performed with VMA/mm locks held and the other time after the copy is retried with locks dropped. Move the code that establishes a PMD into a helper function so it can be reused later during refactoring of mfill_atomic_pte_copy(). Link: https://lore.kernel.org/20260402041156.1377214-4-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
db0062d2c0 |
userfaultfd: introduce struct mfill_state
mfill_atomic() passes a lot of parameters down to its callees.
Aggregate them all into mfill_state structure and pass this structure to
functions that implement various UFFDIO_ commands.
Tracking the state in a structure will allow moving the code that retries
copying of data for UFFDIO_COPY into mfill_atomic_pte_copy() and make the
loop in mfill_atomic() identical for all UFFDIO operations on PTE-mapped
memory.
The mfill_state definition is deliberately local to mm/userfaultfd.c,
hence shmem_mfill_atomic_pte() is not updated.
[harry.yoo@oracle.com: properly initialize mfill_state.len to fix
folio_add_new_anon_rmap() WARN]
Link: https://lore.kernel.org/abehBY7QakYF9bK4@hyeyoo
Link: https://lore.kernel.org/20260402041156.1377214-3-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrei Vagin <avagin@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Harry Yoo (Oracle) <harry@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nikita Kalyazin <kalyazin@amazon.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Carlier <devnexen@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
c0620487fc |
userfaultfd: introduce mfill_copy_folio_locked() helper
Patch series "mm, kvm: allow uffd support in guest_memfd", v4. These patches enable support for userfaultfd in guest_memfd. As the groundwork I refactored userfaultfd handling of PTE-based memory types (anonymous and shmem) and converted them to use vm_uffd_ops for allocating a folio or getting an existing folio from the page cache. shmem also implements callbacks that add a folio to the page cache after the data passed in UFFDIO_COPY was copied and remove the folio from the page cache if page table update fails. In order for guest_memfd to notify userspace about page faults, there are new VM_FAULT_UFFD_MINOR and VM_FAULT_UFFD_MISSING that a ->fault() handler can return to inform the page fault handler that it needs to call handle_userfault() to complete the fault. Nikita helped to plumb these new goodies into guest_memfd and provided basic tests to verify that guest_memfd works with userfaultfd. The handling of UFFDIO_MISSING in guest_memfd requires ability to remove a folio from page cache, the best way I could find was exporting filemap_remove_folio() to KVM. I deliberately left hugetlb out, at least for the most part. hugetlb handles acquisition of VMA and more importantly establishing of parent page table entry differently than PTE-based memory types. This is a different abstraction level than what vm_uffd_ops provides and people objected to exposing such low level APIs as a part of VMA operations. Also, to enable uffd in guest_memfd refactoring of hugetlb is not needed and I prefer to delay it until the dust settles after the changes in this set. This patch (of 4): Split copying of data when locks held from mfill_atomic_pte_copy() into a helper function mfill_copy_folio_locked(). This makes improves code readability and makes complex mfill_atomic_pte_copy() function easier to comprehend. No functional change. Link: https://lore.kernel.org/20260402041156.1377214-1-rppt@kernel.org Link: https://lore.kernel.org/20260402041156.1377214-2-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Peter Xu <peterx@redhat.com> Reviewed-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: David Carlier <devnexen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
dc44f32fde |
mm/memfd_luo: remove folio from page cache when accounting fails
In memfd_luo_retrieve_folios(), when shmem_inode_acct_blocks() fails after successfully adding the folio to the page cache, the code jumps to unlock_folio without removing the folio from the page cache. While the folio eventually will be freed when the file is released by memfd_luo_retrieve(), it is a good idea to directly remove a folio that was not fully added to the file. This avoids the possibility of accounting mismatches in shmem or filemap core. Fix by adding a remove_from_cache label that calls filemap_remove_folio() before unlocking, matching the error handling pattern in shmem_alloc_and_add_folio(). This issue was identified by AI review: https://sashiko.dev/#/patchset/20260323110747.193569-1-duanchenghao@kylinos.cn [pratyush@kernel.org: changelog alterations] Link: https://lore.kernel.org/2vxzzf3lfujq.fsf@kernel.org Link: https://lore.kernel.org/20260326084727.118437-7-duanchenghao@kylinos.cn Signed-off-by: Chenghao Duan <duanchenghao@kylinos.cn> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Haoran Jiang <jianghaoran@kylinos.cn> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
3538f90ab8 |
mm/memfd_luo: fix physical address conversion in put_folios cleanup
In memfd_luo_retrieve_folios()'s put_folios cleanup path:
1. kho_restore_folio() expects a phys_addr_t (physical address) but
receives a raw PFN (pfolio->pfn). This causes kho_restore_page() to
check the wrong physical address (pfn << PAGE_SHIFT instead of the
actual physical address).
2. This loop lacks the !pfolio->pfn check that exists in the main
retrieval loop and memfd_luo_discard_folios(), which could
incorrectly process sparse file holes where pfn=0.
Fix by converting PFN to physical address with PFN_PHYS() and adding
the !pfolio->pfn check, matching the pattern used elsewhere in this file.
This issue was identified by the AI review.
https://sashiko.dev/#/patchset/20260323110747.193569-1-duanchenghao@kylinos.cn
Link: https://lore.kernel.org/20260326084727.118437-6-duanchenghao@kylinos.cn
Fixes:
|
||
|
|
32f6cec5e7 |
mm/memfd_luo: use i_size_write() to set inode size during retrieve
Use i_size_write() instead of directly assigning to inode->i_size when restoring the memfd size in memfd_luo_retrieve(), to keep code consistency. No functional change intended. Link: https://lore.kernel.org/20260326084727.118437-5-duanchenghao@kylinos.cn Signed-off-by: Chenghao Duan <duanchenghao@kylinos.cn> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Haoran Jiang <jianghaoran@kylinos.cn> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Pratyush Yadav <pratyush@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
4aa6424f37 |
mm/memfd_luo: remove unnecessary memset in zero-size memfd path
The memset(kho_vmalloc, 0, sizeof(*kho_vmalloc)) call in the zero-size file handling path is unnecessary because the allocation of the ser structure already uses the __GFP_ZERO flag, ensuring the memory is already zero-initialized. Link: https://lore.kernel.org/20260326084727.118437-4-duanchenghao@kylinos.cn Signed-off-by: Chenghao Duan <duanchenghao@kylinos.cn> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Haoran Jiang <jianghaoran@kylinos.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |