Commit Graph

119 Commits

Author SHA1 Message Date
Lorenzo Stoakes
619eab23e1 mm/vma: do not try to unmap a VMA if mmap_prepare() invoked from mmap()
The mmap_prepare hook functionality includes the ability to invoke
mmap_prepare() from the mmap() hook of existing 'stacked' drivers, that is
ones which are capable of calling the mmap hooks of other drivers/file
systems (e.g.  overlayfs, shm).

As part of the mmap_prepare action functionality, we deal with errors by
unmapping the VMA should one arise.  This works in the usual mmap_prepare
case, as we invoke this action at the last moment, when the VMA is
established in the maple tree.

However, the mmap() hook passes a not-fully-established VMA pointer to the
caller (which is the motivation behind the mmap_prepare() work), which is
detached.

So attempting to unmap a VMA in this state will be problematic, with the
most obvious symptom being a warning in vma_mark_detached(), because the
VMA is already detached.

It's also unncessary - the mmap() handler will clean up the VMA on error.

So to fix this issue, this patch propagates whether or not an mmap action
is being completed via the compatibility layer or directly.

If the former, then we do not attempt VMA cleanup, if the latter, then we
do.

This patch also updates the userland VMA tests to reflect the change.

Link: https://lore.kernel.org/20260421102150.189982-1-ljs@kernel.org
Fixes: ac0a3fc9c0 ("mm: add ability to take further action in vm_area_desc")
Signed-off-by: Lorenzo Stoakes <ljs@kernel.org>
Reported-by: syzbot+db390288d141a1dccf96@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/69e69734.050a0220.24bfd3.0027.GAE@google.com/
Cc: David Hildenbrand <david@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-27 05:54:24 -07:00
Linus Torvalds
334fbe734e mm.git review status for linus..mm-stable
Everything:
 
 Total patches:       368
 Reviews/patch:       1.56
 Reviewed rate:       74%
 
 Excluding DAMON:
 
 Total patches:       316
 Reviews/patch:       1.77
 Reviewed rate:       81%
 
 Excluding DAMON and zram:
 
 Total patches:       306
 Reviews/patch:       1.81
 Reviewed rate:       82%
 
 Excluding DAMON, zram and maple_tree:
 
 Total patches:       276
 Reviews/patch:       2.01
 Reviewed rate:       91%
 
 Significant patch series in this merge:
 
 - The 30 patch series "maple_tree: Replace big node with maple copy"
   from Liam Howlett is mainly prepararatory work for ongoing development
   but it does reduce stack usage and is an improvement.
 
 - The 12 patch series "mm, swap: swap table phase III: remove swap_map"
   from Kairui Song offers memory savings by removing the static swap_map.
   It also yields some CPU savings and implements several cleanups.
 
 - The 2 patch series "mm: memfd_luo: preserve file seals" from Pratyush
   Yadav adds file seal preservation to LUO's memfd code.
 
 - The 2 patch series "mm: zswap: add per-memcg stat for incompressible
   pages" from Jiayuan Chen adds additional userspace stats reportng to
   zswap.
 
 - The 4 patch series "arch, mm: consolidate empty_zero_page" from Mike
   Rapoport implements some cleanups for our handling of ZERO_PAGE() and
   zero_pfn.
 
 - The 2 patch series "mm/kmemleak: Improve scan_should_stop()
   implementation" from Zhongqiu Han provides an robustness improvement and
   some cleanups in the kmemleak code.
 
 - The 4 patch series "Improve khugepaged scan logic" from Vernon Yang
   "improves the khugepaged scan logic and reduces CPU consumption by
   prioritizing scanning tasks that access memory frequently".
 
 - The 2 patch series "Make KHO Stateless" from Jason Miu simplifies
   Kexec Handover by "transitioning KHO from an xarray-based metadata
   tracking system with serialization to a radix tree data structure that
   can be passed directly to the next kernel"
 
 - The 3 patch series "mm: vmscan: add PID and cgroup ID to vmscan
   tracepoints" from Thomas Ballasi and Steven Rostedt enhances vmscan's
   tracepointing.
 
 - The 5 patch series "mm: arch/shstk: Common shadow stack mapping helper
   and VM_NOHUGEPAGE" from Catalin Marinas is a cleanup for the shadow
   stack code: remove per-arch code in favour of a generic implementation.
 
 - The 2 patch series "Fix KASAN support for KHO restored vmalloc
   regions" from Pasha Tatashin fixes a WARN() which can be emitted the KHO
   restores a vmalloc area.
 
 - The 4 patch series "mm: Remove stray references to pagevec" from Tal
   Zussman provides several cleanups, mainly udpating references to "struct
   pagevec", which became folio_batch three years ago.
 
 - The 17 patch series "mm: Eliminate fake head pages from vmemmap
   optimization" from Kiryl Shutsemau simplifies the HugeTLB vmemmap
   optimization (HVO) by changing how tail pages encode their relationship
   to the head page.
 
 - The 2 patch series "mm/damon/core: improve DAMOS quota efficiency for
   core layer filters" from SeongJae Park improves two problematic
   behaviors of DAMOS that makes it less efficient when core layer filters
   are used.
 
 - The 3 patch series "mm/damon: strictly respect min_nr_regions" from
   SeongJae Park improves DAMON usability by extending the treatment of the
   min_nr_regions user-settable parameter.
 
 - The 3 patch series "mm/page_alloc: pcp locking cleanup" from Vlastimil
   Babka is a proper fix for a previously hotfixed SMP=n issue.  Code
   simplifications and cleanups ennsed.
 
 - The 16 patch series "mm: cleanups around unmapping / zapping" from
   David Hildenbrand implements "a bunch of cleanups around unmapping and
   zapping.  Mostly simplifications, code movements, documentation and
   renaming of zapping functions".
 
 - The 6 patch series "support batched checking of the young flag for
   MGLRU" from Baolin Wang supports batched checking of the young flag for
   MGLRU.  It's part cleanups; one benchmark shows large performance
   benefits for arm64.
 
 - The 5 patch series "memcg: obj stock and slab stat caching cleanups"
   from Johannes Weiner provides memcg cleanup and robustness improvements.
 
 - The 5 patch series "Allow order zero pages in page reporting" from
   Yuvraj Sakshith enhances page_reporting's free page reporting - it is
   presently and undesirably order-0 pages when reporting free memory.
 
 - The 6 patch series "mm: vma flag tweaks" from Lorenzo Stoakes is
   cleanup work following from the recent conversion of the VMA flags to a
   bitmap.
 
 - The 10 patch series "mm/damon: add optional debugging-purpose sanity
   checks" from SeongJae Park adds some more developer-facing debug checks
   into DAMON core.
 
 - The 2 patch series "mm/damon: test and document power-of-2
   min_region_sz requirement" from SeongJae Park adds an additional DAMON
   kunit test and makes some adjustments to the addr_unit parameter
   handling.
 
 - The 3 patch series "mm/damon/core: make passed_sample_intervals
   comparisons overflow-safe" from SeongJae Park fixes a hard-to-hit time
   overflow issue in DAMON core.
 
 - The 7 patch series "mm/damon: improve/fixup/update ratio calculation,
   test and documentation" from SeongJae Park is a "batch of misc/minor
   improvements and fixups" for DAMON.
 
 - The 4 patch series "mm: move vma_(kernel|mmu)_pagesize() out of
   hugetlb.c" from David Hildenbrand fixes a possible issue with dax-device
   when CONFIG_HUGETLB=n.  Some code movement was required.
 
 - The 6 patch series "zram: recompression cleanups and tweaks" from
   Sergey Senozhatsky provides "a somewhat random mix of fixups,
   recompression cleanups and improvements" in the zram code.
 
 - The 11 patch series "mm/damon: support multiple goal-based quota
   tuning algorithms" from SeongJae Park extend DAMOS quotas goal
   auto-tuning to support multiple tuning algorithms that users can select.
 
 - The 4 patch series "mm: thp: reduce unnecessary
   start_stop_khugepaged()" from Breno Leitao fixes the khugpaged sysfs
   handling so we no longer spam the logs with reams of junk when
   starting/stopping khugepaged.
 
 - The 3 patch series "mm: improve map count checks" from Lorenzo Stoakes
   provides some cleanups and slight fixes in the mremap, mmap and vma
   code.
 
 - The 5 patch series "mm/damon: support addr_unit on default monitoring
   targets for modules" from SeongJae Park extends the use of DAMON core's
   addr_unit tunable.
 
 - The 5 patch series "mm: khugepaged cleanups and mTHP prerequisites"
   from Nico Pache provides cleanups in the khugepaged and is a base for
   Nico's planned khugepaged mTHP support.
 
 - The 15 patch series "mm: memory hot(un)plug and SPARSEMEM cleanups"
   from David Hildenbrand implements code movement and cleanups in the
   memhotplug and sparsemem code.
 
 - The 2 patch series "mm: remove CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE and
   cleanup CONFIG_MIGRATION" from David Hildenbrand rationalizes some
   memhotplug Kconfig support.
 
 - The 6 patch series "change young flag check functions to return bool"
   from Baolin Wang is "a cleanup patchset to change all young flag check
   functions to return bool".
 
 - The 3 patch series "mm/damon/sysfs: fix memory leak and NULL
   dereference issues" from Josh Law and SeongJae Park fixes a few
   potential DAMON bugs.
 
 - The 25 patch series "mm/vma: convert vm_flags_t to vma_flags_t in vma
   code" from "converts a lot of the existing use of the legacy vm_flags_t
   data type to the new vma_flags_t type which replaces it".  Mainly in the
   vma code.
 
 - The 21 patch series "mm: expand mmap_prepare functionality and usage"
   from Lorenzo Stoakes "expands the mmap_prepare functionality, which is
   intended to replace the deprecated f_op->mmap hook which has been the
   source of bugs and security issues for some time".  Cleanups,
   documentation, extension of mmap_prepare into filesystem drivers.
 
 - The 13 patch series "mm/huge_memory: refactor zap_huge_pmd()" from
   Lorenzo Stoakes simplifies and cleans up zap_huge_pmd().  Additional
   cleanups around vm_normal_folio_pmd() and the softleaf functionality are
   performed.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCad3HDQAKCRDdBJ7gKXxA
 jrUQAPwNhPk5nPSxnyxjAeQtOBHqgCdnICeEismLajPKd9aYRgEA0s2XAu3tSUYi
 GrBnWImHG3s4ePQxVcPCegWTsOUrXgQ=
 =1Q7o
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2026-04-13-21-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - "maple_tree: Replace big node with maple copy" (Liam Howlett)

   Mainly prepararatory work for ongoing development but it does reduce
   stack usage and is an improvement.

 - "mm, swap: swap table phase III: remove swap_map" (Kairui Song)

   Offers memory savings by removing the static swap_map. It also yields
   some CPU savings and implements several cleanups.

 - "mm: memfd_luo: preserve file seals" (Pratyush Yadav)

   File seal preservation to LUO's memfd code

 - "mm: zswap: add per-memcg stat for incompressible pages" (Jiayuan
   Chen)

   Additional userspace stats reportng to zswap

 - "arch, mm: consolidate empty_zero_page" (Mike Rapoport)

   Some cleanups for our handling of ZERO_PAGE() and zero_pfn

 - "mm/kmemleak: Improve scan_should_stop() implementation" (Zhongqiu
   Han)

   A robustness improvement and some cleanups in the kmemleak code

 - "Improve khugepaged scan logic" (Vernon Yang)

   Improve khugepaged scan logic and reduce CPU consumption by
   prioritizing scanning tasks that access memory frequently

 - "Make KHO Stateless" (Jason Miu)

   Simplify Kexec Handover by transitioning KHO from an xarray-based
   metadata tracking system with serialization to a radix tree data
   structure that can be passed directly to the next kernel

 - "mm: vmscan: add PID and cgroup ID to vmscan tracepoints" (Thomas
   Ballasi and Steven Rostedt)

   Enhance vmscan's tracepointing

 - "mm: arch/shstk: Common shadow stack mapping helper and
   VM_NOHUGEPAGE" (Catalin Marinas)

   Cleanup for the shadow stack code: remove per-arch code in favour of
   a generic implementation

 - "Fix KASAN support for KHO restored vmalloc regions" (Pasha Tatashin)

   Fix a WARN() which can be emitted the KHO restores a vmalloc area

 - "mm: Remove stray references to pagevec" (Tal Zussman)

   Several cleanups, mainly udpating references to "struct pagevec",
   which became folio_batch three years ago

 - "mm: Eliminate fake head pages from vmemmap optimization" (Kiryl
   Shutsemau)

   Simplify the HugeTLB vmemmap optimization (HVO) by changing how tail
   pages encode their relationship to the head page

 - "mm/damon/core: improve DAMOS quota efficiency for core layer
   filters" (SeongJae Park)

   Improve two problematic behaviors of DAMOS that makes it less
   efficient when core layer filters are used

 - "mm/damon: strictly respect min_nr_regions" (SeongJae Park)

   Improve DAMON usability by extending the treatment of the
   min_nr_regions user-settable parameter

 - "mm/page_alloc: pcp locking cleanup" (Vlastimil Babka)

   The proper fix for a previously hotfixed SMP=n issue. Code
   simplifications and cleanups ensued

 - "mm: cleanups around unmapping / zapping" (David Hildenbrand)

   A bunch of cleanups around unmapping and zapping. Mostly
   simplifications, code movements, documentation and renaming of
   zapping functions

 - "support batched checking of the young flag for MGLRU" (Baolin Wang)

   Batched checking of the young flag for MGLRU. It's part cleanups; one
   benchmark shows large performance benefits for arm64

 - "memcg: obj stock and slab stat caching cleanups" (Johannes Weiner)

   memcg cleanup and robustness improvements

 - "Allow order zero pages in page reporting" (Yuvraj Sakshith)

   Enhance free page reporting - it is presently and undesirably order-0
   pages when reporting free memory.

 - "mm: vma flag tweaks" (Lorenzo Stoakes)

   Cleanup work following from the recent conversion of the VMA flags to
   a bitmap

 - "mm/damon: add optional debugging-purpose sanity checks" (SeongJae
   Park)

   Add some more developer-facing debug checks into DAMON core

 - "mm/damon: test and document power-of-2 min_region_sz requirement"
   (SeongJae Park)

   An additional DAMON kunit test and makes some adjustments to the
   addr_unit parameter handling

 - "mm/damon/core: make passed_sample_intervals comparisons
   overflow-safe" (SeongJae Park)

   Fix a hard-to-hit time overflow issue in DAMON core

 - "mm/damon: improve/fixup/update ratio calculation, test and
   documentation" (SeongJae Park)

   A batch of misc/minor improvements and fixups for DAMON

 - "mm: move vma_(kernel|mmu)_pagesize() out of hugetlb.c" (David
   Hildenbrand)

   Fix a possible issue with dax-device when CONFIG_HUGETLB=n. Some code
   movement was required.

 - "zram: recompression cleanups and tweaks" (Sergey Senozhatsky)

   A somewhat random mix of fixups, recompression cleanups and
   improvements in the zram code

 - "mm/damon: support multiple goal-based quota tuning algorithms"
   (SeongJae Park)

   Extend DAMOS quotas goal auto-tuning to support multiple tuning
   algorithms that users can select

 - "mm: thp: reduce unnecessary start_stop_khugepaged()" (Breno Leitao)

   Fix the khugpaged sysfs handling so we no longer spam the logs with
   reams of junk when starting/stopping khugepaged

 - "mm: improve map count checks" (Lorenzo Stoakes)

   Provide some cleanups and slight fixes in the mremap, mmap and vma
   code

 - "mm/damon: support addr_unit on default monitoring targets for
   modules" (SeongJae Park)

   Extend the use of DAMON core's addr_unit tunable

 - "mm: khugepaged cleanups and mTHP prerequisites" (Nico Pache)

   Cleanups to khugepaged and is a base for Nico's planned khugepaged
   mTHP support

 - "mm: memory hot(un)plug and SPARSEMEM cleanups" (David Hildenbrand)

   Code movement and cleanups in the memhotplug and sparsemem code

 - "mm: remove CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE and cleanup
   CONFIG_MIGRATION" (David Hildenbrand)

   Rationalize some memhotplug Kconfig support

 - "change young flag check functions to return bool" (Baolin Wang)

   Cleanups to change all young flag check functions to return bool

 - "mm/damon/sysfs: fix memory leak and NULL dereference issues" (Josh
   Law and SeongJae Park)

   Fix a few potential DAMON bugs

 - "mm/vma: convert vm_flags_t to vma_flags_t in vma code" (Lorenzo
   Stoakes)

   Convert a lot of the existing use of the legacy vm_flags_t data type
   to the new vma_flags_t type which replaces it. Mainly in the vma
   code.

 - "mm: expand mmap_prepare functionality and usage" (Lorenzo Stoakes)

   Expand the mmap_prepare functionality, which is intended to replace
   the deprecated f_op->mmap hook which has been the source of bugs and
   security issues for some time. Cleanups, documentation, extension of
   mmap_prepare into filesystem drivers

 - "mm/huge_memory: refactor zap_huge_pmd()" (Lorenzo Stoakes)

   Simplify and clean up zap_huge_pmd(). Additional cleanups around
   vm_normal_folio_pmd() and the softleaf functionality are performed.

* tag 'mm-stable-2026-04-13-21-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
  mm: fix deferred split queue races during migration
  mm/khugepaged: fix issue with tracking lock
  mm/huge_memory: add and use has_deposited_pgtable()
  mm/huge_memory: add and use normal_or_softleaf_folio_pmd()
  mm: add softleaf_is_valid_pmd_entry(), pmd_to_softleaf_folio()
  mm/huge_memory: separate out the folio part of zap_huge_pmd()
  mm/huge_memory: use mm instead of tlb->mm
  mm/huge_memory: remove unnecessary sanity checks
  mm/huge_memory: deduplicate zap deposited table call
  mm/huge_memory: remove unnecessary VM_BUG_ON_PAGE()
  mm/huge_memory: add a common exit path to zap_huge_pmd()
  mm/huge_memory: handle buggy PMD entry in zap_huge_pmd()
  mm/huge_memory: have zap_huge_pmd return a boolean, add kdoc
  mm/huge: avoid big else branch in zap_huge_pmd()
  mm/huge_memory: simplify vma_is_specal_huge()
  mm: on remap assert that input range within the proposed VMA
  mm: add mmap_action_map_kernel_pages[_full]()
  uio: replace deprecated mmap hook with mmap_prepare in uio_info
  drivers: hv: vmbus: replace deprecated mmap hook with mmap_prepare
  mm: allow handling of stacked mmap_prepare hooks in more drivers
  ...
2026-04-15 12:59:16 -07:00
Sechang Lim
894f99eb53 mm/vma: fix memory leak in __mmap_region()
commit 605f6586ec ("mm/vma: do not leak memory when .mmap_prepare
swaps the file") handled the success path by skipping get_file() via
file_doesnt_need_get, but missed the error path.

When /dev/zero is mmap'd with MAP_SHARED, mmap_zero_prepare() calls
shmem_zero_setup_desc() which allocates a new shmem file to back the
mapping. If __mmap_new_vma() subsequently fails, this replacement
file is never fput()'d - the original is released by
ksys_mmap_pgoff(), but nobody releases the new one.

Add fput() for the swapped file in the error path.

Reproducible with fault injection.

FAULT_INJECTION: forcing a failure.
name failslab, interval 1, probability 0, space 0, times 1
CPU: 2 UID: 0 PID: 366 Comm: syz.7.14 Not tainted 7.0.0-rc6 #2 PREEMPT(full)
Hardware name: QEMU Ubuntu 24.04 PC v2 (i440FX + PIIX, arch_caps fix, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x164/0x1f0
 should_fail_ex+0x525/0x650
 should_failslab+0xdf/0x140
 kmem_cache_alloc_noprof+0x78/0x630
 vm_area_alloc+0x24/0x160
 __mmap_region+0xf6b/0x2660
 mmap_region+0x2eb/0x3a0
 do_mmap+0xc79/0x1240
 vm_mmap_pgoff+0x252/0x4c0
 ksys_mmap_pgoff+0xf8/0x120
 __x64_sys_mmap+0x12a/0x190
 do_syscall_64+0xa9/0x580
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
 </TASK>

kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
BUG: memory leak
unreferenced object 0xffff8881118aca80 (size 360):
  comm "syz.7.14", pid 366, jiffies 4294913255
  hex dump (first 32 bytes):
    00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00  .....N..........
    ff ff ff ff ff ff ff ff c0 28 4d ae ff ff ff ff  .........(M.....
  backtrace (crc db0f53bc):
    kmem_cache_alloc_noprof+0x3ab/0x630
    alloc_empty_file+0x5a/0x1e0
    alloc_file_pseudo+0x135/0x220
    __shmem_file_setup+0x274/0x420
    shmem_zero_setup_desc+0x9c/0x170
    mmap_zero_prepare+0x123/0x140
    __mmap_region+0xdda/0x2660
    mmap_region+0x2eb/0x3a0
    do_mmap+0xc79/0x1240
    vm_mmap_pgoff+0x252/0x4c0
    ksys_mmap_pgoff+0xf8/0x120
    __x64_sys_mmap+0x12a/0x190
    do_syscall_64+0xa9/0x580
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

Found by syzkaller.

Link: https://lkml.kernel.org/r/20260331180811.1333348-1-rhkrqnwk98@gmail.com
Fixes: 605f6586ec ("mm/vma: do not leak memory when .mmap_prepare swaps the file")
Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-06 11:13:43 -07:00
Lorenzo Stoakes (Oracle)
c50ca15dd4 mm: add vm_ops->mapped hook
Previously, when a driver needed to do something like establish a
reference count, it could do so in the mmap hook in the knowledge that the
mapping would succeed.

With the introduction of f_op->mmap_prepare this is no longer the case, as
it is invoked prior to actually establishing the mapping.

mmap_prepare is not appropriate for this kind of thing as it is called
before any merge might take place, and after which an error might occur
meaning resources could be leaked.

To take this into account, introduce a new vm_ops->mapped callback which
is invoked when the VMA is first mapped (though notably - not when it is
merged - which is correct and mirrors existing mmap/open/close behaviour).

We do better that vm_ops->open() here, as this callback can return an
error, at which point the VMA will be unmapped.

Note that vm_ops->mapped() is invoked after any mmap action is complete
(such as I/O remapping).

We intentionally do not expose the VMA at this point, exposing only the
fields that could be used, and an output parameter in case the operation
needs to update the vma->vm_private_data field.

In order to deal with stacked filesystems which invoke inner filesystem's
mmap() invocations, add __compat_vma_mapped() and invoke it on vfs_mmap()
(via compat_vma_mmap()) to ensure that the mapped callback is handled when
an mmap() caller invokes a nested filesystem's mmap_prepare() callback.

Update the mmap_prepare documentation to describe the mapped hook and make
it clear what its intended use is.

The vm_ops->mapped() call is handled by the mmap complete logic to ensure
the same code paths are handled by both the compatibility and VMA layers.

Additionally, update VMA userland test headers to reflect the change.

Link: https://lkml.kernel.org/r/4c5e98297eb0aae9565c564e1c296a112702f144.1774045440.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexandre Torgue <alexandre.torgue@foss.st.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bodo Stroesser <bostroesser@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Hildenbrand <david@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Long Li <longli@microsoft.com>
Cc: Marc Dionne <marc.dionne@auristor.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vignesh Raghavendra <vigneshr@ti.com>
Cc: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:53:43 -07:00
Lorenzo Stoakes (Oracle)
382c0f2895 mm: have mmap_action_complete() handle the rmap lock and unmap
Rather than have the callers handle this both the rmap lock release and
unmapping the VMA on error, handle it within the mmap_action_complete()
logic where it makes sense to, being careful not to unlock twice.

This simplifies the logic and makes it harder to make mistake with this,
while retaining correct behaviour with regard to avoiding deadlocks.

Also replace the call_action_complete() function with a direct invocation
of mmap_action_complete() as the abstraction is no longer required.

Also update the VMA tests to reflect this change.

Link: https://lkml.kernel.org/r/8d1ee8ebd3542d006a47e8382fb80cf5b57ecf10.1774045440.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexandre Torgue <alexandre.torgue@foss.st.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bodo Stroesser <bostroesser@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Hildenbrand <david@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Long Li <longli@microsoft.com>
Cc: Marc Dionne <marc.dionne@auristor.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vignesh Raghavendra <vigneshr@ti.com>
Cc: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:53:42 -07:00
Lorenzo Stoakes (Oracle)
04501e759e mm/vma: remove superfluous map->hold_file_rmap_lock
We don't need to reference this field, it's confusing as it duplicates
mmap_action->hide_from_rmap_until_complete, so thread the mmap_action
through to __mmap_new_vma() instead and use the same field consistently.

Link: https://lkml.kernel.org/r/42c3fbb701e361a17193ecda0d2dabcc326288a5.1774045440.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexandre Torgue <alexandre.torgue@foss.st.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bodo Stroesser <bostroesser@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Hildenbrand <david@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Long Li <longli@microsoft.com>
Cc: Marc Dionne <marc.dionne@auristor.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vignesh Raghavendra <vigneshr@ti.com>
Cc: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:53:42 -07:00
Lorenzo Stoakes (Oracle)
f96e1d5f15 mm: avoid deadlock when holding rmap on mmap_prepare error
Commit ac0a3fc9c0 ("mm: add ability to take further action in
vm_area_desc") added the ability for drivers to instruct mm to take actions
after the .mmap_prepare callback is complete.

To make life simpler and safer, this is done before the VMA/mmap write lock
is dropped but when the VMA is completely established.

So on error, we simply munmap() the VMA.

As part of this implementation, unfortunately a horrible hack had to be
implemented to support some questionable behaviour hugetlb relies upon -
that is that the file rmap lock is held until the operation is complete.

The implementation, for convenience, did this in mmap_action_finish() so
both the VMA and mmap_prepare compatibility layer paths would have this
correctly handled.

However, it turns out there is a mistake here - the rmap lock cannot be
held on munmap, as free_pgtables() -> unlink_file_vma_batch_add() ->
unlink_file_vma_batch_process() takes the file rmap lock.

We therefore currently have a deadlock issue that might arise.

Resolve this by leaving it to callers to handle the unmap.

The compatibility layer does not support this rmap behaviour, so we simply
have it unmap on error after calling mmap_action_complete().

In the VMA implementation, we only perform the unmap after the rmap lock is
dropped.

This resolves the issue by ensuring the rmap lock is always dropped when
the unmap occurs.

Link: https://lkml.kernel.org/r/d44248be9da68258b07c2c59d4e73485ee0ca943.1774045440.git.ljs@kernel.org
Fixes: ac0a3fc9c0 ("mm: add ability to take further action in vm_area_desc")
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexandre Torgue <alexandre.torgue@foss.st.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bodo Stroesser <bostroesser@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Hildenbrand <david@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Long Li <longli@microsoft.com>
Cc: Marc Dionne <marc.dionne@auristor.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vignesh Raghavendra <vigneshr@ti.com>
Cc: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:53:42 -07:00
Lorenzo Stoakes (Oracle)
3e4bb27068 mm: various small mmap_prepare cleanups
Patch series "mm: expand mmap_prepare functionality and usage", v4.

This series expands the mmap_prepare functionality, which is intended to
replace the deprecated f_op->mmap hook which has been the source of bugs
and security issues for some time.

This series starts with some cleanup of existing mmap_prepare logic, then
adds documentation for the mmap_prepare call to make it easier for
filesystem and driver writers to understand how it works.

It then importantly adds a vm_ops->mapped hook, a key feature that was
missing from mmap_prepare previously - this is invoked when a driver which
specifies mmap_prepare has successfully been mapped but not merged with
another VMA.

mmap_prepare is invoked prior to a merge being attempted, so you cannot
manipulate state such as reference counts as if it were a new mapping.

The vm_ops->mapped hook allows a driver to perform tasks required at this
stage, and provides symmetry against subsequent vm_ops->open,close calls.

The series uses this to correct the afs implementation which wrongly
manipulated reference count at mmap_prepare time.

It then adds an mmap_prepare equivalent of vm_iomap_memory() -
mmap_action_simple_ioremap(), then uses this to update a number of drivers.

It then splits out the mmap_prepare compatibility layer (which allows for
invocation of mmap_prepare hooks in an mmap() hook) in such a way as to
allow for more incremental implementation of mmap_prepare hooks.

It then uses this to extend mmap_prepare usage in drivers.

Finally it adds an mmap_prepare equivalent of vm_map_pages(), which lays
the foundation for future work which will extend mmap_prepare to DMA
coherent mappings.


This patch (of 21):

Rather than passing arbitrary fields, pass a vm_area_desc pointer to mmap
prepare functions to mmap prepare, and an action and vma pointer to mmap
complete in order to put all the action-specific logic in the function
actually doing the work.

Additionally, allow mmap prepare functions to return an error so we can
error out as soon as possible if there is something logically incorrect in
the input.

Update remap_pfn_range_prepare() to properly check the input range for the
CoW case.

Also remove io_remap_pfn_range_complete(), as we can simply set up the
fields correctly in io_remap_pfn_range_prepare() and use
remap_pfn_range_complete() for this.

While we're here, make remap_pfn_range_prepare_vma() a little neater, and
pass mmap_action directly to call_action_complete().

Then, update compat_vma_mmap() to perform its logic directly, as
__compat_vma_map() is not used by anything so we don't need to export it.

Also update compat_vma_mmap() to use vfs_mmap_prepare() rather than
calling the mmap_prepare op directly.

Finally, update the VMA userland tests to reflect the changes.

Link: https://lkml.kernel.org/r/cover.1774045440.git.ljs@kernel.org
Link: https://lkml.kernel.org/r/99f408e4694f44ab12bdc55fe0bd9685d3bd1117.1774045440.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexandre Torgue <alexandre.torgue@foss.st.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bodo Stroesser <bostroesser@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Hildenbrand <david@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Long Li <longli@microsoft.com>
Cc: Marc Dionne <marc.dionne@auristor.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vignesh Raghavendra <vigneshr@ti.com>
Cc: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:53:41 -07:00
Lorenzo Stoakes (Oracle)
71fdcf9283 mm: simplify VMA flag tests of excluded flags
We have implemented flag mask comparisons of the form:

if ((vm_flags & (VM_FOO|VM_BAR|VM_BAZ) == VM_FOO) { ... }

Like-for-like in the code using a bitwise-and mask via vma_flags_and() and
using vma_flags_same() to ensure the final result equals only the required
flag value.

This is fine but confusing, make things clearer by instead explicitly
excluding undesired flags and including the desired one via tests of the
form:

	if (vma_flags_test(&flags, VMA_FOO_BIT) &&
	    !vma_flags_test_any(&flags, VMA_BAR_BIT, VMA_BAZ_BIT)) { ... }

Which makes it easier to understand what is going on.

No functional change intended.

Link: https://lkml.kernel.org/r/d395c5dd837a9864f5efcec42175910afbe3ce73.1774034900.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Suggested-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Ondrej Mosnacek <omosnace@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:53:41 -07:00
Lorenzo Stoakes (Oracle)
90cb921c4d mm/vma: convert __mmap_region() to use vma_flags_t
Update the mmap() implementation logic implemented in __mmap_region() and
functions invoked by it.  The mmap_region() function converts its input
vm_flags_t parameter to a vma_flags_t value which it then passes to
__mmap_region() which uses the vma_flags_t value consistently from then
on.

As part of the change, we convert map_deny_write_exec() to using
vma_flags_t (it was incorrectly using unsigned long before), and place it
in vma.h, as it is only used internal to mm.

With this change, we eliminate the legacy is_shared_maywrite_vm_flags()
helper function which is now no longer required.

We are also able to update the MMAP_STATE() and VMG_MMAP_STATE() macros to
use the vma_flags_t value.

Finally, we update the VMA tests to reflect the change.

Link: https://lkml.kernel.org/r/1fc33a404c962f02da778da100387cc19bd62153.1774034900.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Ondrej Mosnacek <omosnace@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:53:41 -07:00
Lorenzo Stoakes (Oracle)
a06eb2f827 mm/vma: convert vma_modify_flags[_uffd]() to use vma_flags_t
Update the vma_modify_flags() and vma_modify_flags_uffd() functions to
accept a vma_flags_t parameter rather than a vm_flags_t one, and propagate
the changes as needed to implement this change.

Also add vma_flags_reset_once() in replacement of vm_flags_reset_once(). We
still need to be careful here because we need to avoid tearing, so maintain
the assumption that the first system word set of flags are the only ones
that require protection from tearing, and retain this functionality.

We can copy the remainder of VMA flags above 64 bits normally. But
hopefully by the time that happens, we will have replaced the logic that
requires these WRITE_ONCE()'s with something else.

We also replace instances of vm_flags_reset() with a simple write of VMA
flags. We are no longer perform a number of checks, most notable of all the
VMA flags asserts becase:

1. We might be operating on a VMA that is not yet added to the tree.

2. We might be operating on a VMA that is now detached.

3. Really in all but core code, you should be using vma_desc_xxx().

4. Other VMA fields are manipulated with no such checks.

5. It'd be egregious to have to add variants of flag functions just to
   account for cases such as the above, especially when we don't do so for
   other VMA fields. Drivers are the problematic cases and why it was
   especially important (and also for debug as VMA locks were introduced),
   the mmap_prepare work is solving this generally.

Additionally, we can fairly safely assume by this point the soft dirty
flags are being set correctly, so it's reasonable to drop this also.

Finally, update the VMA tests to reflect this.

Link: https://lkml.kernel.org/r/51afbb2b8c3681003cc7926647e37335d793836e.1774034900.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Ondrej Mosnacek <omosnace@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:53:41 -07:00
Lorenzo Stoakes (Oracle)
769669bd9c mm/vma: convert as much as we can in mm/vma.c to vma_flags_t
Now we have established a good foundation for vm_flags_t to vma_flags_t
changes, update mm/vma.c to utilise vma_flags_t wherever possible.

We are able to convert VM_STARTGAP_FLAGS entirely as this is only used in
mm/vma.c, and to account for the fact we can't use VM_NONE to make life
easier, place the definition of this within existing #ifdef's to be
cleaner.

Generally the remaining changes are mechanical.

Also update the VMA tests to reflect the changes.

Link: https://lkml.kernel.org/r/5fdeaf8af9a12c2a5d68497495f52fa627d05a5b.1774034900.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Ondrej Mosnacek <omosnace@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:53:41 -07:00
Lorenzo Stoakes (Oracle)
3a6455d56b mm: convert do_brk_flags() to use vma_flags_t
In order to be able to do this, we need to change VM_DATA_DEFAULT_FLAGS
and friends and update the architecture-specific definitions also.

We then have to update some KSM logic to handle VMA flags, and introduce
VMA_STACK_FLAGS to define the vma_flags_t equivalent of VM_STACK_FLAGS.

We also introduce two helper functions for use during the time we are
converting legacy flags to vma_flags_t values - vma_flags_to_legacy() and
legacy_to_vma_flags().

This enables us to iteratively make changes to break these changes up into
separate parts.

We use these explicitly here to keep VM_STACK_FLAGS around for certain
users which need to maintain the legacy vm_flags_t values for the time
being.

We are no longer able to rely on the simple VM_xxx being set to zero if
the feature is not enabled, so in the case of VM_DROPPABLE we introduce
VMA_DROPPABLE as the vma_flags_t equivalent, which is set to
EMPTY_VMA_FLAGS if the droppable flag is not available.

While we're here, we make the description of do_brk_flags() into a kdoc
comment, as it almost was already.

We use vma_flags_to_legacy() to not need to update the vm_get_page_prot()
logic as this time.

Note that in create_init_stack_vma() we have to replace the BUILD_BUG_ON()
with a VM_WARN_ON_ONCE() as the tested values are no longer build time
available.

We also update mprotect_fixup() to use VMA flags where possible, though we
have to live with a little duplication between vm_flags_t and vma_flags_t
values for the time being until further conversions are made.

While we're here, update VM_SPECIAL to be defined in terms of
VMA_SPECIAL_FLAGS now we have vma_flags_to_legacy().

Finally, we update the VMA tests to reflect these changes.

Link: https://lkml.kernel.org/r/d02e3e45d9a33d7904b149f5604904089fd640ae.1774034900.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: Paul Moore <paul@paul-moore.com>	[SELinux]
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Ondrej Mosnacek <omosnace@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:53:40 -07:00
Lorenzo Stoakes (Oracle)
7ec1885a7e mm/vma: use new VMA flags for sticky flags logic
Use the new vma_flags_t flags implementation to perform the logic around
sticky flags and what flags are ignored on VMA merge.

We make use of the new vma_flags_empty(), vma_flags_diff_pair(), and
vma_flags_and_mask() functionality.

Also update the VMA tests accordingly.

Link: https://lkml.kernel.org/r/369574f06360ffa44707047e3b58eb4897345fba.1774034900.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Ondrej Mosnacek <omosnace@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:53:38 -07:00
Lorenzo Stoakes (Oracle)
2d1e54aab6 mm: abstract reading sysctl_max_map_count, and READ_ONCE()
Concurrent reads and writes of sysctl_max_map_count are possible, so we
should READ_ONCE() and WRITE_ONCE().

The sysctl procfs logic already enforces WRITE_ONCE(), so abstract the
read side with get_sysctl_max_map_count().

While we're here, also move the field to mm/internal.h and add the getter
there since only mm interacts with it, there's no need for anybody else to
have access.

Finally, update the VMA userland tests to reflect the change.

Link: https://lkml.kernel.org/r/0715259eb37cbdfde4f9e5db92a20ec7110a1ce5.1773249037.git.ljs@kernel.org
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Cc: Jann Horn <jannh@google.com>
Cc: Jianzhou Zhao <luckd0g@163.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:53:28 -07:00
Anthony Yznaga
d239462787 mm: prevent droppable mappings from being locked
Droppable mappings must not be lockable.  There is a check for VMAs with
VM_DROPPABLE set in mlock_fixup() along with checks for other types of
unlockable VMAs which ensures this when calling mlock()/mlock2().

For mlockall(MCL_FUTURE), the check for unlockable VMAs is different.  In
apply_mlockall_flags(), if the flags parameter has MCL_FUTURE set, the
current task's mm's default VMA flag field mm->def_flags has VM_LOCKED
applied to it.  VM_LOCKONFAULT is also applied if MCL_ONFAULT is also set.
When these flags are set as default in this manner they are cleared in
__mmap_complete() for new mappings that do not support mlock.  A check for
VM_DROPPABLE in __mmap_complete() is missing resulting in droppable
mappings created with VM_LOCKED set.  To fix this and reduce that chance
of similar bugs in the future, introduce and use vma_supports_mlock().

Link: https://lkml.kernel.org/r/20260310155821.17869-1-anthony.yznaga@oracle.com
Fixes: 9651fcedf7 ("mm: add MAP_DROPPABLE for designating always lazily freeable mappings")
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
Suggested-by: David Hildenbrand <david@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Tested-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Jason A. Donenfeld <jason@zx2c4.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:53:25 -07:00
David Hildenbrand (Arm)
a9496e9e4b mm: move vma_mmu_pagesize() from hugetlb to vma.c
vma_mmu_pagesize() is also queried on non-hugetlb VMAs and does not really
belong into hugetlb.c.

PPC64 provides a custom overwrite with CONFIG_HUGETLB_PAGE, see
arch/powerpc/mm/book3s64/slice.c, so we cannot easily make this a static
inline function.

So let's move it to vma.c and add some proper kerneldoc.

To make vma tests happy, add a simple vma_kernel_pagesize() stub in
tools/testing/vma/include/custom.h.

Link: https://lkml.kernel.org/r/20260309151901.123947-3-david@kernel.org
Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: "Christophe Leroy (CS GROUP)" <chleroy@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:53:23 -07:00
Lorenzo Stoakes
53f1d93644 mm: make vm_area_desc utilise vma_flags_t only
Now we have eliminated all uses of vm_area_desc->vm_flags, eliminate this
field, and have mmap_prepare users utilise the vma_flags_t
vm_area_desc->vma_flags field only.

As part of this change we alter is_shared_maywrite() to accept a
vma_flags_t parameter, and introduce is_shared_maywrite_vm_flags() for use
with legacy vm_flags_t flags.

We also update struct mmap_state to add a union between vma_flags and
vm_flags temporarily until the mmap logic is also converted to using
vma_flags_t.

Also update the VMA userland tests to reflect this change.

Link: https://lkml.kernel.org/r/fd2a2938b246b4505321954062b1caba7acfc77a.1769097829.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Yury Norov <ynorov@nvidia.com>
Cc: Chris Mason <clm@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-12 15:42:59 -08:00
Lorenzo Stoakes
fd3196ee9c mm: update secretmem to use VMA flags on mmap_prepare
This patch updates secretmem to use the new vma_flags_t type which will
soon supersede vm_flags_t altogether.

In order to make this change we also have to update mlock_future_ok(), we
replace the vm_flags_t parameter with a simple boolean is_vma_locked one,
which also simplifies the invocation here.

This is laying the groundwork for eliminating the vm_flags_t in
vm_area_desc and more broadly throughout the kernel.

No functional changes intended.

[lorenzo.stoakes@oracle.com: fix check_brk_limits(), per Chris]
  Link: https://lkml.kernel.org/r/3aab9ab1-74b4-405e-9efb-08fc2500c06e@lucifer.local
Link: https://lkml.kernel.org/r/a243a09b0a5d0581e963d696de1735f61f5b2075.1769097829.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Yury Norov <ynorov@nvidia.com>
Cc: Chris Mason <clm@fb.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-12 15:42:58 -08:00
Liam R. Howlett
a8700d42b0 mm: use unmap_desc struct for freeing page tables
Pass through the unmap_desc to free_pgtables() because it almost has
everything necessary and is already on the stack.

Updates testing code as necessary.

No functional changes intended.

[Liam.Howlett@oracle.com: fix up unmap desc use on exit_mmap()]
  Link: https://lkml.kernel.org/r/20260210214214.364856-1-Liam.Howlett@oracle.com
Link: https://lkml.kernel.org/r/20260121164946.2093480-12-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: SeongJae Park <sj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-12 15:42:56 -08:00
Liam R. Howlett
2314fe9ba5 mm/vma: use unmap_region() in vms_clear_ptes()
There is no need to open code the vms_clear_ptes() now that unmap_desc
struct is used.

Link: https://lkml.kernel.org/r/20260121164946.2093480-11-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: SeongJae Park <sj@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-12 15:42:56 -08:00
Liam R. Howlett
0df5a8d394 mm/vma: use unmap_desc in exit_mmap() and vms_clear_ptes()
Convert vms_clear_ptes() to use unmap_desc to call unmap_vmas() instead of
the large argument list.  The UNMAP_STATE() cannot be used because the vma
iterator in the vms does not point to the correct maple state
(mas_detach), and the tree_end will be set incorrectly.  Setting up the
arguments manually avoids setting the struct up incorrectly and doing
extra work to get the correct pagetable range.

exit_mmap() also calls unmap_vmas() with many arguments.  Using the
unmap_all_init() function to set the unmap descriptor for all vmas makes
this a bit easier to read.

Update to the vma test code is necessary to ensure testing continues to
function.

No functional changes intended.

Link: https://lkml.kernel.org/r/20260121164946.2093480-10-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: SeongJae Park <sj@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-12 15:42:55 -08:00
Liam R. Howlett
5b6626a76a mm: introduce unmap_desc struct to reduce function arguments
The unmap_region code uses a number of arguments that could use better
documentation.  With the addition of a descriptor for unmap (called
unmap_desc), the arguments can be more self-documenting and increase the
descriptions within the declaration.

No functional change intended

Link: https://lkml.kernel.org/r/20260121164946.2093480-9-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-12 15:42:55 -08:00
Liam R. Howlett
243de0c0dc mm/vma: add page table limit to unmap_region()
The unmap_region() calls need to pass through the page table limit for a
future patch.

No functional changes intended.

Link: https://lkml.kernel.org/r/20260121164946.2093480-7-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-12 15:42:54 -08:00
Liam R. Howlett
eda8c5e776 mm/memory: add tree limit to free_pgtables()
The ceiling and tree search limit need to be different arguments for the
future change in the failed fork attempt.  The ceiling and floor variables
are not very descriptive, so change them to pg_start/pg_end.

Adding a new variable for the vma_end to the function as it will differ
from the pg_end in the later patches in the series.

Add a kernel doc about the free_pgtables() function.

Test code also updated.

No functional changes intended.

Link: https://lkml.kernel.org/r/20260121164946.2093480-6-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-12 15:42:54 -08:00
Liam R. Howlett
23bd03a9a2 mm/vma: add limits to unmap_region() for vmas
Add a limit to the vma search instead of using the start and end of the
one passed in.

No functional changes intended.

Link: https://lkml.kernel.org/r/20260121164946.2093480-5-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-12 15:42:54 -08:00
Lorenzo Stoakes
d17f02417a mm/rmap: separate out fork-only logic on anon_vma_clone()
Specify which operation is being performed to anon_vma_clone(), which
allows us to do checks specific to each operation type, as well as to
separate out and make clear that the anon_vma reuse logic is absolutely
specific to fork only.

This opens the door to further refactorings and refinements later as we
have more information to work with.

Link: https://lkml.kernel.org/r/cf7da7a2d973cdc72a1b80dd9a73260519e8fa9f.1768746221.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Barry Song <v-songbaohua@oppo.com>
Cc: Chris Li <chriscli@google.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Jann Horn <jannh@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-26 20:02:22 -08:00
Lorenzo Stoakes
53eb797ffc mm/rmap: remove anon_vma_merge() function
This function is confusing, we already have the concept of anon_vma merge
to adjacent VMA's anon_vma's to increase probability of anon_vma
compatibility and therefore VMA merge (see is_mergeable_anon_vma() etc.),
as well as anon_vma reuse, along side the usual VMA merge logic.

We can remove the anon_vma check as it is redundant - a merge would not
have been permitted with removal if the anon_vma's were not the same (and
in the case of an unfaulted/faulted merge, we would have already set the
unfaulted VMA's anon_vma to vp->remove->anon_vma in dup_anon_vma()).

Avoid overloading this term when we're very simply unlinking anon_vma
state from a removed VMA upon merge.

Link: https://lkml.kernel.org/r/56bbe45e309f7af197b1c4f94a9a0c8931ff2d29.1768746221.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Barry Song <v-songbaohua@oppo.com>
Cc: Chris Li <chriscli@google.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Jann Horn <jannh@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-26 20:02:21 -08:00
Kevin Lourenco
62451ae347 mm: fix minor spelling mistakes in comments
Correct several typos in comments across files in mm/

[akpm@linux-foundation.org: also fix comment grammar, per SeongJae]
Link: https://lkml.kernel.org/r/20251218150906.25042-1-klourencodev@gmail.com
Signed-off-by: Kevin Lourenco <klourencodev@gmail.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20 19:24:48 -08:00
Lorenzo Stoakes
605f6586ec mm/vma: do not leak memory when .mmap_prepare swaps the file
The current implementation of mmap() is set up such that a struct file
object is obtained for the input fd in ksys_mmap_pgoff() via fget(), and
its reference count decremented at the end of the function via.  fput().

If a merge can be achieved, we are fine to simply decrement the refcount
on the file.  Otherwise, in __mmap_new_file_vma(), we increment the
reference count on the file via get_file() such that the fput() in
ksys_mmap_pgoff() does not free the now-referenced file object.

The introduction of the f_op->mmap_prepare hook changes things, as it
becomes possible for a driver to replace the file object right at the
beginning of the mmap operation.

The current implementation is buggy if this happens because it
unconditionally calls get_file() on the mapping's file whether or not it
was replaced (and thus whether or not its reference count will be
decremented at the end of ksys_mmap_pgoff()).

This results in a memory leak, and was exposed in commit ab04945f91
("mm: update mem char driver to use mmap_prepare").

This patch solves the problem by explicitly tracking whether we actually
need to call get_file() on the file or not, and only doing so if required.

Link: https://lkml.kernel.org/r/20260112155143.661284-1-lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Fixes: ab04945f91 ("mm: update mem char driver to use mmap_prepare")
Reported-by: syzbot+bf5de69ebb4bdf86f59f@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/6964a92b.050a0220.eaf7.008a.GAE@google.com/
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-19 12:30:01 -08:00
Lorenzo Stoakes
3b617fd3d3 mm/vma: enforce VMA fork limit on unfaulted,faulted mremap merge too
The is_mergeable_anon_vma() function uses vmg->middle as the source VMA. 
However when merging a new VMA, this field is NULL.

In all cases except mremap(), the new VMA will either be newly established
and thus lack an anon_vma, or will be an expansion of an existing VMA thus
we do not care about whether VMA is CoW'd or not.

In the case of an mremap(), we can end up in a situation where we can
accidentally allow an unfaulted/faulted merge with a VMA that has been
forked, violating the general rule that we do not permit this for reasons
of anon_vma lock scalability.

Now we have the ability to be aware of the fact we are copying a VMA and
also know which VMA that is, we can explicitly check for this, so do so.

This is pertinent since commit 879bca0a2c ("mm/vma: fix incorrectly
disallowed anonymous VMA merges"), as this patch permits unfaulted/faulted
merges that were previously disallowed running afoul of this issue.

While we are here, vma_had_uncowed_parents() is a confusing name, so make
it simple and rename it to vma_is_fork_child().

Link: https://lkml.kernel.org/r/6e2b9b3024ae1220961c8b81d74296d4720eaf2b.1767638272.git.lorenzo.stoakes@oracle.com
Fixes: 879bca0a2c ("mm/vma: fix incorrectly disallowed anonymous VMA merges")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Jeongjun Park <aha310510@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand (Red Hat) <david@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: Yeoreum Yun <yeoreum.yun@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-14 22:16:24 -08:00
Lorenzo Stoakes
61f67c230a mm/vma: fix anon_vma UAF on mremap() faulted, unfaulted merge
Patch series "mm/vma: fix anon_vma UAF on mremap() faulted, unfaulted
merge", v2.

Commit 879bca0a2c ("mm/vma: fix incorrectly disallowed anonymous VMA
merges") introduced the ability to merge previously unavailable VMA merge
scenarios.

However, it is handling merges incorrectly when it comes to mremap() of a
faulted VMA adjacent to an unfaulted VMA.  The issues arise in three
cases:

1. Previous VMA unfaulted:

              copied -----|
                          v
	|-----------|.............|
	| unfaulted |(faulted VMA)|
	|-----------|.............|
	     prev

2. Next VMA unfaulted:

              copied -----|
                          v
	            |.............|-----------|
	            |(faulted VMA)| unfaulted |
                    |.............|-----------|
		                      next

3. Both adjacent VMAs unfaulted:

              copied -----|
                          v
	|-----------|.............|-----------|
	| unfaulted |(faulted VMA)| unfaulted |
	|-----------|.............|-----------|
	     prev                      next

This series fixes each of these cases, and introduces self tests to assert
that the issues are corrected.

I also test a further case which was already handled, to assert that my
changes continues to correctly handle it:

4. prev unfaulted, next faulted:

              copied -----|
                          v
	|-----------|.............|-----------|
	| unfaulted |(faulted VMA)|  faulted  |
	|-----------|.............|-----------|
	     prev                      next

This bug was discovered via a syzbot report, linked to in the first patch
in the series, I confirmed that this series fixes the bug.

I also discovered that we are failing to check that the faulted VMA was
not forked when merging a copied VMA in cases 1-3 above, an issue this
series also addresses.

I also added self tests to assert that this is resolved (and confirmed
that the tests failed prior to this).

I also cleaned up vma_expand() as part of this work, renamed
vma_had_uncowed_parents() to vma_is_fork_child() as the previous name was
unduly confusing, and simplified the comments around this function.


This patch (of 4):

Commit 879bca0a2c ("mm/vma: fix incorrectly disallowed anonymous VMA
merges") introduced the ability to merge previously unavailable VMA merge
scenarios.

The key piece of logic introduced was the ability to merge a faulted VMA
immediately next to an unfaulted VMA, which relies upon dup_anon_vma() to
correctly handle anon_vma state.

In the case of the merge of an existing VMA (that is changing properties
of a VMA and then merging if those properties are shared by adjacent
VMAs), dup_anon_vma() is invoked correctly.

However in the case of the merge of a new VMA, a corner case peculiar to
mremap() was missed.

The issue is that vma_expand() only performs dup_anon_vma() if the target
(the VMA that will ultimately become the merged VMA): is not the next VMA,
i.e.  the one that appears after the range in which the new VMA is to be
established.

A key insight here is that in all other cases other than mremap(), a new
VMA merge either expands an existing VMA, meaning that the target VMA will
be that VMA, or would have anon_vma be NULL.

Specifically:

* __mmap_region() - no anon_vma in place, initial mapping.
* do_brk_flags() - expanding an existing VMA.
* vma_merge_extend() - expanding an existing VMA.
* relocate_vma_down() - no anon_vma in place, initial mapping.

In addition, we are in the unique situation of needing to duplicate
anon_vma state from a VMA that is neither the previous or next VMA being
merged with.

dup_anon_vma() deals exclusively with the target=unfaulted, src=faulted
case.  This leaves four possibilities, in each case where the copied VMA
is faulted:

1. Previous VMA unfaulted:

              copied -----|
                          v
	|-----------|.............|
	| unfaulted |(faulted VMA)|
	|-----------|.............|
	     prev

target = prev, expand prev to cover.

2. Next VMA unfaulted:

              copied -----|
                          v
	            |.............|-----------|
	            |(faulted VMA)| unfaulted |
                    |.............|-----------|
		                      next

target = next, expand next to cover.

3. Both adjacent VMAs unfaulted:

              copied -----|
                          v
	|-----------|.............|-----------|
	| unfaulted |(faulted VMA)| unfaulted |
	|-----------|.............|-----------|
	     prev                      next

target = prev, expand prev to cover.

4. prev unfaulted, next faulted:

              copied -----|
                          v
	|-----------|.............|-----------|
	| unfaulted |(faulted VMA)|  faulted  |
	|-----------|.............|-----------|
	     prev                      next

target = prev, expand prev to cover.  Essentially equivalent to 3, but
with additional requirement that next's anon_vma is the same as the copied
VMA's.  This is covered by the existing logic.

To account for this very explicitly, we introduce
vma_merge_copied_range(), which sets a newly introduced vmg->copied_from
field, then invokes vma_merge_new_range() which handles the rest of the
logic.

We then update the key vma_expand() function to clean up the logic and
make what's going on clearer, making the 'remove next' case less special,
before invoking dup_anon_vma() unconditionally should we be copying from a
VMA.

Note that in case 3, the if (remove_next) ...  branch will be a no-op, as
next=src in this instance and src is unfaulted.

In case 4, it won't be, but since in this instance next=src and it is
faulted, this will have required tgt=faulted, src=faulted to be
compatible, meaning that next->anon_vma == vmg->copied_from->anon_vma, and
thus a single dup_anon_vma() of next suffices to copy anon_vma state for
the copied-from VMA also.

If we are copying from a VMA in a successful merge we must _always_
propagate anon_vma state.

This issue can be observed most directly by invoked mremap() to move
around a VMA and cause this kind of merge with the MREMAP_DONTUNMAP flag
specified.

This will result in unlink_anon_vmas() being called after failing to
duplicate anon_vma state to the target VMA, which results in the anon_vma
itself being freed with folios still possessing dangling pointers to the
anon_vma and thus a use-after-free bug.

This bug was discovered via a syzbot report, which this patch resolves.

We further make a change to update the mergeable anon_vma check to assert
the copied-from anon_vma did not have CoW parents, as otherwise
dup_anon_vma() might incorrectly propagate CoW ancestors from the next VMA
in case 4 despite the anon_vma's being identical for both VMAs.

Link: https://lkml.kernel.org/r/cover.1767638272.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/b7930ad2b1503a657e29fe928eb33061d7eadf5b.1767638272.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Fixes: 879bca0a2c ("mm/vma: fix incorrectly disallowed anonymous VMA merges")
Reported-by: syzbot+b165fc2e11771c66d8ba@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/694a2745.050a0220.19928e.0017.GAE@google.com/
Reported-by: syzbot+5272541ccbbb14e2ec30@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/694e3dc6.050a0220.35954c.0066.GAE@google.com/
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Jeongjun Park <aha310510@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand (Red Hat) <david@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Yeoreum Yun <yeoreum.yun@arm.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-14 22:16:24 -08:00
Chunyan Zhang
277a1ae387 mm: softdirty: add pgtable_supports_soft_dirty()
Patch series "mm: Add soft-dirty and uffd-wp support for RISC-V", v15.

This patchset adds support for Svrsw60t59b [1] extension which is ratified
now, also add soft dirty and userfaultfd write protect tracking for
RISC-V.

The patches 1 and 2 add macros to allow architectures to define their own
checks if the soft-dirty / uffd_wp PTE bits are available, in other words
for RISC-V, the Svrsw60t59b extension is supported on which device the
kernel is running.  Also patch1-2 are removing "ifdef
CONFIG_MEM_SOFT_DIRTY" "ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP" and "ifdef
CONFIG_PTE_MARKER_UFFD_WP" in favor of checks which if not overridden by
the architecture, no change in behavior is expected.

This patchset has been tested with kselftest mm suite in which soft-dirty,
madv_populate, test_unmerge_uffd_wp, and uffd-unit-tests run and pass, and
no regressions are observed in any of the other tests.


This patch (of 6):

Some platforms can customize the PTE PMD entry soft-dirty bit making it
unavailable even if the architecture provides the resource.

Add an API which architectures can define their specific implementations
to detect if soft-dirty bit is available on which device the kernel is
running.

This patch is removing "ifdef CONFIG_MEM_SOFT_DIRTY" in favor of
pgtable_supports_soft_dirty() checks that defaults to
IS_ENABLED(CONFIG_MEM_SOFT_DIRTY), if not overridden by the architecture,
no change in behavior is expected.

We make sure to never set VM_SOFTDIRTY if !pgtable_supports_soft_dirty(),
so we will never run into VM_SOFTDIRTY checks.

[lorenzo.stoakes@oracle.com: fix VMA selftests]
  Link: https://lkml.kernel.org/r/dac6ddfe-773a-43d5-8f69-021b9ca4d24b@lucifer.local
Link: https://lkml.kernel.org/r/20251113072806.795029-1-zhangchunyan@iscas.ac.cn
Link: https://lkml.kernel.org/r/20251113072806.795029-2-zhangchunyan@iscas.ac.cn
Link: https://github.com/riscv-non-isa/riscv-iommu/pull/543 [1]
Signed-off-by: Chunyan Zhang <zhangchunyan@iscas.ac.cn>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Conor Dooley <conor@kernel.org>
Cc: Deepak Gupta <debug@rivosinc.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: Andrew Jones <ajones@ventanamicro.com>
Cc: Conor Dooley <conor.dooley@microchip.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-24 15:08:54 -08:00
Lorenzo Stoakes
64212ba02e mm: implement sticky VMA flags
It is useful to be able to designate that certain flags are 'sticky', that
is, if two VMAs are merged one with a flag of this nature and one without,
the merged VMA sets this flag.

As a result we ignore these flags for the purposes of determining VMA flag
differences between VMAs being considered for merge.

This patch therefore updates the VMA merge logic to perform this action,
with flags possessing this property being described in the VM_STICKY
bitmap.

Those flags which ought to be ignored for the purposes of VMA merge are
described in the VM_IGNORE_MERGE bitmap, which the VMA merge logic is also
updated to use.

As part of this change we place VM_SOFTDIRTY in VM_IGNORE_MERGE as it
already had this behaviour, alongside VM_STICKY as sticky flags by
implication must not disallow merge.

Ultimately it seems that we should make VM_SOFTDIRTY a sticky flag in its
own right, but this change is out of scope for this series.

The only sticky flag designated as such is VM_MAYBE_GUARD, so as a result
of this change, once the VMA flag is set upon guard region installation,
VMAs with guard ranges will now not have their merge behaviour impacted as
a result and can be freely merged with other VMAs without VM_MAYBE_GUARD
set.

Also update the comments for vma_modify_flags() to directly reference
sticky flags now we have established the concept.

We also update the VMA userland tests to account for the changes.

Link: https://lkml.kernel.org/r/22ad5269f7669d62afb42ce0c79bad70b994c58d.1763460113.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: David Hildenbrand (Red Hat) <david@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20 13:43:58 -08:00
Lorenzo Stoakes
9119d6c209 mm: update vma_modify_flags() to handle residual flags, document
The vma_modify_*() family of functions each either perform splits, a merge
or no changes at all in preparation for the requested modification to
occur.

When doing so for a VMA flags change, we currently don't account for any
flags which may remain (for instance, VM_SOFTDIRTY) despite the requested
change in the case that a merge succeeded.

This is made more important by subsequent patches which will introduce the
concept of sticky VMA flags which rely on this behaviour.

This patch fixes this by passing the VMA flags parameter as a pointer and
updating it accordingly on merge and updating callers to accommodate for
this.

Additionally, while we are here, we add kdocs for each of the
vma_modify_*() functions, as the fact that the requested modification is
not performed is confusing so it is useful to make this abundantly clear.

We also update the VMA userland tests to account for this change.

Link: https://lkml.kernel.org/r/23b5b549b0eaefb2922625626e58c2a352f3e93c.1763460113.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: David Hildenbrand (Red Hat) <david@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20 13:43:58 -08:00
Kefeng Wang
340b59816b mm: kill mm_wr_locked from unmap_vmas() and unmap_single_vma()
Kill mm_wr_locked since commit f8e97613fe ("mm: convert VM_PFNMAP
tracking to pfnmap_track() + pfnmap_untrack()") remove the user.

Link: https://lkml.kernel.org/r/20251104085709.2688433-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20 13:43:57 -08:00
Lorenzo Stoakes
ac0a3fc9c0 mm: add ability to take further action in vm_area_desc
Some drivers/filesystems need to perform additional tasks after the VMA is
set up.  This is typically in the form of pre-population.

The forms of pre-population most likely to be performed are a PFN remap
or the insertion of normal folios and PFNs into a mixed map.

We start by implementing the PFN remap functionality, ensuring that we
perform the appropriate actions at the appropriate time - that is setting
flags at the point of .mmap_prepare, and performing the actual remap at the
point at which the VMA is fully established.

This prevents the driver from doing anything too crazy with a VMA at any
stage, and we retain complete control over how the mm functionality is
applied.

Unfortunately callers still do often require some kind of custom action,
so we add an optional success/error _hook to allow the caller to do
something after the action has succeeded or failed.

This is done at the point when the VMA has already been established, so
the harm that can be done is limited.

The error hook can be used to filter errors if necessary.

There may be cases in which the caller absolutely must hold the file rmap
lock until the operation is entirely complete. It is an edge case, but
certainly the hugetlbfs mmap hook requires it.

To accommodate this, we add the hide_from_rmap_until_complete flag to the
mmap_action type. In this case, if a new VMA is allocated, we will hold the
file rmap lock until the operation is entirely completed (including any
success/error hooks).

Note that we do not need to update __compat_vma_mmap() to accommodate this
flag, as this function will be invoked from an .mmap handler whose VMA is
not yet visible, so we implicitly hide it from the rmap.

If any error arises on these final actions, we simply unmap the VMA
altogether.

Also update the stacked filesystem compatibility layer to utilise the
action behaviour, and update the VMA tests accordingly.

While we're here, rename __compat_vma_mmap_prepare() to __compat_vma_mmap()
as we are now performing actions invoked by the mmap_prepare in addition to
just the mmap_prepare hook.

Link: https://lkml.kernel.org/r/2601199a7b2eaeadfcd8ab6e199c6d1706650c94.1760959442.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Chatre, Reinette <reinette.chatre@intel.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Dave Martin <dave.martin@arm.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Morse <james.morse@arm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicolas Pitre <nico@fluxnic.net>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Robin Murohy <robin.murphy@arm.com>
Cc: Sumanth Korikkar <sumanthk@linux.ibm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-16 17:28:12 -08:00
Lorenzo Stoakes
2bcd9207de mm/vma: rename __mmap_prepare() function to avoid confusion
Now we have the f_op->mmap_prepare() hook, having a static function called
__mmap_prepare() that has nothing to do with it is confusing, so rename
the function to __mmap_setup().

Link: https://lkml.kernel.org/r/d25a22c60ca0f04091697ef9cda0d72ce0cf8af3.1760959442.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Chatre, Reinette <reinette.chatre@intel.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Dave Martin <dave.martin@arm.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Morse <james.morse@arm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicolas Pitre <nico@fluxnic.net>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Robin Murohy <robin.murphy@arm.com>
Cc: Sumanth Korikkar <sumanthk@linux.ibm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-16 17:28:11 -08:00
Lorenzo Stoakes
cf1d98f44d mm/vma: remove unused function, make internal functions static
unlink_file_vma() is not used by anything, so remove it.

vma_link() and vma_link_file() are only used within mm/vma.c, so make them
static.

Link: https://lkml.kernel.org/r/f2ab9ea051225a02e6d1d45a7608f4e149220117.1760959442.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Chatre, Reinette <reinette.chatre@intel.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Dave Martin <dave.martin@arm.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Morse <james.morse@arm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicolas Pitre <nico@fluxnic.net>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Robin Murohy <robin.murphy@arm.com>
Cc: Sumanth Korikkar <sumanthk@linux.ibm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-16 17:28:10 -08:00
jianyun.gao
b6c46600bf mm: fix some typos in mm module
Below are some typos in the code comments:

  intevals ==> intervals
  addesses ==> addresses
  unavaliable ==> unavailable
  facor ==> factor
  droping ==> dropping
  exlusive ==> exclusive
  decription ==> description
  confict ==> conflict
  desriptions ==> descriptions
  otherwize ==> otherwise
  vlaue ==> value
  cheching ==> checking
  exisitng ==> existing
  modifed ==> modified
  differenciate ==> differentiate
  refernece ==> reference
  permissons ==> permissions
  indepdenent ==> independent
  spliting ==> splitting

Just fix it.

Link: https://lkml.kernel.org/r/20250929002608.1633825-1-jianyungao89@gmail.com
Signed-off-by: jianyun.gao <jianyungao89@gmail.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Acked-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-16 17:27:52 -08:00
Lorenzo Stoakes
af6703838e mm: specify separate file and vm_file params in vm_area_desc
Patch series "mm: do not assume file == vma->vm_file in
compat_vma_mmap_prepare()", v2.

As part of the efforts to eliminate the problematic f_op->mmap callback, a
new callback - f_op->mmap_prepare was provided.

While we are converting these callbacks, we must deal with 'stacked'
filesystems and drivers - those which in their own f_op->mmap callback
invoke an inner f_op->mmap callback.

To accomodate for this, a compatibility layer is provided that, via
vfs_mmap(), detects if f_op->mmap_prepare is provided and if so, generates
a vm_area_desc containing the VMA's metadata and invokes the call.

So far, we have provided desc->file equal to vma->vm_file.  However this
is not necessarily valid, especially in the case of stacked drivers which
wish to assign a new file after the inner hook is invoked.

To account for this, we adjust vm_area_desc to have both file and vm_file
fields.  The .vm_file field is strictly set to vma->vm_file (or in the
case of a new mapping, what will become vma->vm_file).

However, .file is set to whichever file vfs_mmap() is invoked with when
using the compatibilty layer.

Therefore, if the VMA's file needs to be updated in .mmap_prepare,
desc->vm_file should be assigned, whilst desc->file should be read.

No current f_op->mmap_prepare users assign desc->file so this is safe to
do.

This makes the .mmap_prepare callback in the context of a stacked
filesystem or driver completely consistent with the existing .mmap
implementations.

While we're here, we do a few small cleanups, and ensure that we const-ify
things correctly in the vm_area_desc struct to avoid hooks accidentally
trying to assign fields they should not.


This patch (of 2):

Stacked filesystems and drivers may invoke mmap hooks with a struct file
pointer that differs from the overlying file.  We will make this
functionality possible in a subsequent patch.

In order to prepare for this, let's update vm_area_struct to separately
provide desc->file and desc->vm_file parameters.

The desc->file parameter is the file that the hook is expected to operate
upon, and is not assignable (though the hok may wish to e.g.  update the
file's accessed time for instance).

The desc->vm_file defaults to what will become vma->vm_file and is what
the hook must reassign should it wish to change the VMA"s vma->vm_file.

For now we keep desc->file, vm_file the same to remain consistent.

No f_op->mmap_prepare() callback sets a new vma->vm_file currently, so
this is safe to change.

While we're here, make the mm_struct desc->mm pointers at immutable as
well as the desc->mm field itself.

As part of this change, also update the single hook which this would
otherwise break - mlock_future_ok(), invoked by secretmem_mmap_prepare()).

We additionally update set_vma_from_desc() to compare fields in a more
logical fashion, checking the (possibly) user-modified fields as the first
operand against the existing value as the second one.

Additionally, update VMA tests to accommodate changes.

Link: https://lkml.kernel.org/r/cover.1756920635.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/3fa15a861bb7419f033d22970598aa61850ea267.1756920635.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Pedro Falcato <pfalcato@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-22 20:17:11 -07:00
Linus Torvalds
da23ea194d Significant patch series in this pull request:
- The 4 patch series "mseal cleanups" from Lorenzo Stoakes erforms some
   mseal cleaning with no intended functional change.
 
 - The 3 patch series "Optimizations for khugepaged" from David
   Hildenbrand improves khugepaged throughput by batching PTE operations
   for large folios.  This gain is mainly for arm64.
 
 - The 8 patch series "x86: enable EXECMEM_ROX_CACHE for ftrace and
   kprobes" from Mike Rapoport provides a bugfix, additional debug code and
   cleanups to the execmem code.
 
 - The 7 patch series "mm/shmem, swap: bugfix and improvement of mTHP
   swap in" from Kairui Song provides bugfixes, cleanups and performance
   improvememnts to the mTHP swapin code.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaI+6HQAKCRDdBJ7gKXxA
 jv7lAQCAKE5dUhdZ0pOYbhBKTlDapQh2KqHrlV3QFcxXgknEoQD/c3gG01rY3fLh
 Cnf5l9+cdyfKxFniO48sUPx6IpriRg8=
 =HT5/
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-08-03-12-35' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull more MM updates from Andrew Morton:
 "Significant patch series in this pull request:

   - "mseal cleanups" (Lorenzo Stoakes)

     Some mseal cleaning with no intended functional change.

   - "Optimizations for khugepaged" (David Hildenbrand)

     Improve khugepaged throughput by batching PTE operations for large
     folios. This gain is mainly for arm64.

   - "x86: enable EXECMEM_ROX_CACHE for ftrace and kprobes" (Mike Rapoport)

     A bugfix, additional debug code and cleanups to the execmem code.

   - "mm/shmem, swap: bugfix and improvement of mTHP swap in" (Kairui Song)

     Bugfixes, cleanups and performance improvememnts to the mTHP swapin
     code"

* tag 'mm-stable-2025-08-03-12-35' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (38 commits)
  mm: mempool: fix crash in mempool_free() for zero-minimum pools
  mm: correct type for vmalloc vm_flags fields
  mm/shmem, swap: fix major fault counting
  mm/shmem, swap: rework swap entry and index calculation for large swapin
  mm/shmem, swap: simplify swapin path and result handling
  mm/shmem, swap: never use swap cache and readahead for SWP_SYNCHRONOUS_IO
  mm/shmem, swap: tidy up swap entry splitting
  mm/shmem, swap: tidy up THP swapin checks
  mm/shmem, swap: avoid redundant Xarray lookup during swapin
  x86/ftrace: enable EXECMEM_ROX_CACHE for ftrace allocations
  x86/kprobes: enable EXECMEM_ROX_CACHE for kprobes allocations
  execmem: drop writable parameter from execmem_fill_trapping_insns()
  execmem: add fallback for failures in vmalloc(VM_ALLOW_HUGE_VMAP)
  execmem: move execmem_force_rw() and execmem_restore_rox() before use
  execmem: rework execmem_cache_free()
  execmem: introduce execmem_alloc_rw()
  execmem: drop unused execmem_update_copy()
  mm: fix a UAF when vma->mm is freed after vma->vm_refcnt got dropped
  mm/rmap: add anon_vma lifetime debug check
  mm: remove mm/io-mapping.c
  ...
2025-08-05 16:02:07 +03:00
Lorenzo Stoakes
8b2914162a mm/mseal: small cleanups
Drop the wholly unnecessary set_vma_sealed() helper(), which is used only
once, and place VMA_ITERATOR() declarations in the correct place.

Retain vma_is_sealed(), and use it instead of the confusingly named
can_modify_vma(), so it's abundantly clear what's being tested, rather
then a nebulous sense of 'can the VMA be modified'.

No functional change intended.

Link: https://lkml.kernel.org/r/98cf28d04583d632a6eb698e9ad23733bb6af26b.1753431105.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Jeff Xu <jeffxu@chromium.org>
Cc: Jann Horn <jannh@google.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02 12:06:09 -07:00
Linus Torvalds
beace86e61 Summary of significant series in this pull request:
- The 4 patch series "mm: ksm: prevent KSM from breaking merging of new
   VMAs" from Lorenzo Stoakes addresses an issue with KSM's
   PR_SET_MEMORY_MERGE mode: newly mapped VMAs were not eligible for
   merging with existing adjacent VMAs.
 
 - The 4 patch series "mm/damon: introduce DAMON_STAT for simple and
   practical access monitoring" from SeongJae Park adds a new kernel module
   which simplifies the setup and usage of DAMON in production
   environments.
 
 - The 6 patch series "stop passing a writeback_control to swap/shmem
   writeout" from Christoph Hellwig is a cleanup to the writeback code
   which removes a couple of pointers from struct writeback_control.
 
 - The 7 patch series "drivers/base/node.c: optimization and cleanups"
   from Donet Tom contains largely uncorrelated cleanups to the NUMA node
   setup and management code.
 
 - The 4 patch series "mm: userfaultfd: assorted fixes and cleanups" from
   Tal Zussman does some maintenance work on the userfaultfd code.
 
 - The 5 patch series "Readahead tweaks for larger folios" from Ryan
   Roberts implements some tuneups for pagecache readahead when it is
   reading into order>0 folios.
 
 - The 4 patch series "selftests/mm: Tweaks to the cow test" from Mark
   Brown provides some cleanups and consistency improvements to the
   selftests code.
 
 - The 4 patch series "Optimize mremap() for large folios" from Dev Jain
   does that.  A 37% reduction in execution time was measured in a
   memset+mremap+munmap microbenchmark.
 
 - The 5 patch series "Remove zero_user()" from Matthew Wilcox expunges
   zero_user() in favor of the more modern memzero_page().
 
 - The 3 patch series "mm/huge_memory: vmf_insert_folio_*() and
   vmf_insert_pfn_pud() fixes" from David Hildenbrand addresses some warts
   which David noticed in the huge page code.  These were not known to be
   causing any issues at this time.
 
 - The 3 patch series "mm/damon: use alloc_migrate_target() for
   DAMOS_MIGRATE_{HOT,COLD" from SeongJae Park provides some cleanup and
   consolidation work in DAMON.
 
 - The 3 patch series "use vm_flags_t consistently" from Lorenzo Stoakes
   uses vm_flags_t in places where we were inappropriately using other
   types.
 
 - The 3 patch series "mm/memfd: Reserve hugetlb folios before
   allocation" from Vivek Kasireddy increases the reliability of large page
   allocation in the memfd code.
 
 - The 14 patch series "mm: Remove pXX_devmap page table bit and pfn_t
   type" from Alistair Popple removes several now-unneeded PFN_* flags.
 
 - The 5 patch series "mm/damon: decouple sysfs from core" from SeongJae
   Park implememnts some cleanup and maintainability work in the DAMON
   sysfs layer.
 
 - The 5 patch series "madvise cleanup" from Lorenzo Stoakes does quite a
   lot of cleanup/maintenance work in the madvise() code.
 
 - The 4 patch series "madvise anon_name cleanups" from Vlastimil Babka
   provides additional cleanups on top or Lorenzo's effort.
 
 - The 11 patch series "Implement numa node notifier" from Oscar Salvador
   creates a standalone notifier for NUMA node memory state changes.
   Previously these were lumped under the more general memory on/offline
   notifier.
 
 - The 6 patch series "Make MIGRATE_ISOLATE a standalone bit" from Zi Yan
   cleans up the pageblock isolation code and fixes a potential issue which
   doesn't seem to cause any problems in practice.
 
 - The 5 patch series "selftests/damon: add python and drgn based DAMON
   sysfs functionality tests" from SeongJae Park adds additional drgn- and
   python-based DAMON selftests which are more comprehensive than the
   existing selftest suite.
 
 - The 5 patch series "Misc rework on hugetlb faulting path" from Oscar
   Salvador fixes a rather obscure deadlock in the hugetlb fault code and
   follows that fix with a series of cleanups.
 
 - The 3 patch series "cma: factor out allocation logic from
   __cma_declare_contiguous_nid" from Mike Rapoport rationalizes and cleans
   up the highmem-specific code in the CMA allocator.
 
 - The 28 patch series "mm/migration: rework movable_ops page migration
   (part 1)" from David Hildenbrand provides cleanups and
   future-preparedness to the migration code.
 
 - The 2 patch series "mm/damon: add trace events for auto-tuned
   monitoring intervals and DAMOS quota" from SeongJae Park adds some
   tracepoints to some DAMON auto-tuning code.
 
 - The 6 patch series "mm/damon: fix misc bugs in DAMON modules" from
   SeongJae Park does that.
 
 - The 6 patch series "mm/damon: misc cleanups" from SeongJae Park also
   does what it claims.
 
 - The 4 patch series "mm: folio_pte_batch() improvements" from David
   Hildenbrand cleans up the large folio PTE batching code.
 
 - The 13 patch series "mm/damon/vaddr: Allow interleaving in
   migrate_{hot,cold} actions" from SeongJae Park facilitates dynamic
   alteration of DAMON's inter-node allocation policy.
 
 - The 3 patch series "Remove unmap_and_put_page()" from Vishal Moola
   provides a couple of page->folio conversions.
 
 - The 4 patch series "mm: per-node proactive reclaim" from Davidlohr
   Bueso implements a per-node control of proactive reclaim - beyond the
   current memcg-based implementation.
 
 - The 14 patch series "mm/damon: remove damon_callback" from SeongJae
   Park replaces the damon_callback interface with a more general and
   powerful damon_call()+damos_walk() interface.
 
 - The 10 patch series "mm/mremap: permit mremap() move of multiple VMAs"
   from Lorenzo Stoakes implements a number of mremap cleanups (of course)
   in preparation for adding new mremap() functionality: newly permit the
   remapping of multiple VMAs when the user is specifying MREMAP_FIXED.  It
   still excludes some specialized situations where this cannot be
   performed reliably.
 
 - The 3 patch series "drop hugetlb_free_pgd_range()" from Anthony Yznaga
   switches some sparc hugetlb code over to the generic version and removes
   the thus-unneeded hugetlb_free_pgd_range().
 
 - The 4 patch series "mm/damon/sysfs: support periodic and automated
   stats update" from SeongJae Park augments the present
   userspace-requested update of DAMON sysfs monitoring files.  Automatic
   update is now provided, along with a tunable to control the update
   interval.
 
 - The 4 patch series "Some randome fixes and cleanups to swapfile" from
   Kemeng Shi does what is claims.
 
 - The 4 patch series "mm: introduce snapshot_page" from Luiz Capitulino
   and David Hildenbrand provides (and uses) a means by which debug-style
   functions can grab a copy of a pageframe and inspect it locklessly
   without tripping over the races inherent in operating on the live
   pageframe directly.
 
 - The 6 patch series "use per-vma locks for /proc/pid/maps reads" from
   Suren Baghdasaryan addresses the large contention issues which can be
   triggered by reads from that procfs file.  Latencies are reduced by more
   than half in some situations.  The series also introduces several new
   selftests for the /proc/pid/maps interface.
 
 - The 6 patch series "__folio_split() clean up" from Zi Yan cleans up
   __folio_split()!
 
 - The 7 patch series "Optimize mprotect() for large folios" from Dev
   Jain provides some quite large (>3x) speedups to mprotect() when dealing
   with large folios.
 
 - The 2 patch series "selftests/mm: reuse FORCE_READ to replace "asm
   volatile("" : "+r" (XXX));" and some cleanup" from wang lian does some
   cleanup work in the selftests code.
 
 - The 3 patch series "tools/testing: expand mremap testing" from Lorenzo
   Stoakes extends the mremap() selftest in several ways, including adding
   more checking of Lorenzo's recently added "permit mremap() move of
   multiple VMAs" feature.
 
 - The 22 patch series "selftests/damon/sysfs.py: test all parameters"
   from SeongJae Park extends the DAMON sysfs interface selftest so that it
   tests all possible user-requested parameters.  Rather than the present
   minimal subset.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaIqcCgAKCRDdBJ7gKXxA
 jkVBAQCCn9DR1QP0CRk961ot0cKzOgioSc0aA03DPb2KXRt2kQEAzDAz0ARurFhL
 8BzbvI0c+4tntHLXvIlrC33n9KWAOQM=
 =XsFy
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "As usual, many cleanups. The below blurbiage describes 42 patchsets.
  21 of those are partially or fully cleanup work. "cleans up",
  "cleanup", "maintainability", "rationalizes", etc.

  I never knew the MM code was so dirty.

  "mm: ksm: prevent KSM from breaking merging of new VMAs" (Lorenzo Stoakes)
     addresses an issue with KSM's PR_SET_MEMORY_MERGE mode: newly
     mapped VMAs were not eligible for merging with existing adjacent
     VMAs.

  "mm/damon: introduce DAMON_STAT for simple and practical access monitoring" (SeongJae Park)
     adds a new kernel module which simplifies the setup and usage of
     DAMON in production environments.

  "stop passing a writeback_control to swap/shmem writeout" (Christoph Hellwig)
     is a cleanup to the writeback code which removes a couple of
     pointers from struct writeback_control.

  "drivers/base/node.c: optimization and cleanups" (Donet Tom)
     contains largely uncorrelated cleanups to the NUMA node setup and
     management code.

  "mm: userfaultfd: assorted fixes and cleanups" (Tal Zussman)
     does some maintenance work on the userfaultfd code.

  "Readahead tweaks for larger folios" (Ryan Roberts)
     implements some tuneups for pagecache readahead when it is reading
     into order>0 folios.

  "selftests/mm: Tweaks to the cow test" (Mark Brown)
     provides some cleanups and consistency improvements to the
     selftests code.

  "Optimize mremap() for large folios" (Dev Jain)
     does that. A 37% reduction in execution time was measured in a
     memset+mremap+munmap microbenchmark.

  "Remove zero_user()" (Matthew Wilcox)
     expunges zero_user() in favor of the more modern memzero_page().

  "mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes" (David Hildenbrand)
     addresses some warts which David noticed in the huge page code.
     These were not known to be causing any issues at this time.

  "mm/damon: use alloc_migrate_target() for DAMOS_MIGRATE_{HOT,COLD" (SeongJae Park)
     provides some cleanup and consolidation work in DAMON.

  "use vm_flags_t consistently" (Lorenzo Stoakes)
     uses vm_flags_t in places where we were inappropriately using other
     types.

  "mm/memfd: Reserve hugetlb folios before allocation" (Vivek Kasireddy)
     increases the reliability of large page allocation in the memfd
     code.

  "mm: Remove pXX_devmap page table bit and pfn_t type" (Alistair Popple)
     removes several now-unneeded PFN_* flags.

  "mm/damon: decouple sysfs from core" (SeongJae Park)
     implememnts some cleanup and maintainability work in the DAMON
     sysfs layer.

  "madvise cleanup" (Lorenzo Stoakes)
     does quite a lot of cleanup/maintenance work in the madvise() code.

  "madvise anon_name cleanups" (Vlastimil Babka)
     provides additional cleanups on top or Lorenzo's effort.

  "Implement numa node notifier" (Oscar Salvador)
     creates a standalone notifier for NUMA node memory state changes.
     Previously these were lumped under the more general memory
     on/offline notifier.

  "Make MIGRATE_ISOLATE a standalone bit" (Zi Yan)
     cleans up the pageblock isolation code and fixes a potential issue
     which doesn't seem to cause any problems in practice.

  "selftests/damon: add python and drgn based DAMON sysfs functionality tests" (SeongJae Park)
     adds additional drgn- and python-based DAMON selftests which are
     more comprehensive than the existing selftest suite.

  "Misc rework on hugetlb faulting path" (Oscar Salvador)
     fixes a rather obscure deadlock in the hugetlb fault code and
     follows that fix with a series of cleanups.

  "cma: factor out allocation logic from __cma_declare_contiguous_nid" (Mike Rapoport)
     rationalizes and cleans up the highmem-specific code in the CMA
     allocator.

  "mm/migration: rework movable_ops page migration (part 1)" (David Hildenbrand)
     provides cleanups and future-preparedness to the migration code.

  "mm/damon: add trace events for auto-tuned monitoring intervals and DAMOS quota" (SeongJae Park)
     adds some tracepoints to some DAMON auto-tuning code.

  "mm/damon: fix misc bugs in DAMON modules" (SeongJae Park)
     does that.

  "mm/damon: misc cleanups" (SeongJae Park)
     also does what it claims.

  "mm: folio_pte_batch() improvements" (David Hildenbrand)
     cleans up the large folio PTE batching code.

  "mm/damon/vaddr: Allow interleaving in migrate_{hot,cold} actions" (SeongJae Park)
     facilitates dynamic alteration of DAMON's inter-node allocation
     policy.

  "Remove unmap_and_put_page()" (Vishal Moola)
     provides a couple of page->folio conversions.

  "mm: per-node proactive reclaim" (Davidlohr Bueso)
     implements a per-node control of proactive reclaim - beyond the
     current memcg-based implementation.

  "mm/damon: remove damon_callback" (SeongJae Park)
     replaces the damon_callback interface with a more general and
     powerful damon_call()+damos_walk() interface.

  "mm/mremap: permit mremap() move of multiple VMAs" (Lorenzo Stoakes)
     implements a number of mremap cleanups (of course) in preparation
     for adding new mremap() functionality: newly permit the remapping
     of multiple VMAs when the user is specifying MREMAP_FIXED. It still
     excludes some specialized situations where this cannot be performed
     reliably.

  "drop hugetlb_free_pgd_range()" (Anthony Yznaga)
     switches some sparc hugetlb code over to the generic version and
     removes the thus-unneeded hugetlb_free_pgd_range().

  "mm/damon/sysfs: support periodic and automated stats update" (SeongJae Park)
     augments the present userspace-requested update of DAMON sysfs
     monitoring files. Automatic update is now provided, along with a
     tunable to control the update interval.

  "Some randome fixes and cleanups to swapfile" (Kemeng Shi)
     does what is claims.

  "mm: introduce snapshot_page" (Luiz Capitulino and David Hildenbrand)
     provides (and uses) a means by which debug-style functions can grab
     a copy of a pageframe and inspect it locklessly without tripping
     over the races inherent in operating on the live pageframe
     directly.

  "use per-vma locks for /proc/pid/maps reads" (Suren Baghdasaryan)
     addresses the large contention issues which can be triggered by
     reads from that procfs file. Latencies are reduced by more than
     half in some situations. The series also introduces several new
     selftests for the /proc/pid/maps interface.

  "__folio_split() clean up" (Zi Yan)
     cleans up __folio_split()!

  "Optimize mprotect() for large folios" (Dev Jain)
     provides some quite large (>3x) speedups to mprotect() when dealing
     with large folios.

  "selftests/mm: reuse FORCE_READ to replace "asm volatile("" : "+r" (XXX));" and some cleanup" (wang lian)
     does some cleanup work in the selftests code.

  "tools/testing: expand mremap testing" (Lorenzo Stoakes)
     extends the mremap() selftest in several ways, including adding
     more checking of Lorenzo's recently added "permit mremap() move of
     multiple VMAs" feature.

  "selftests/damon/sysfs.py: test all parameters" (SeongJae Park)
     extends the DAMON sysfs interface selftest so that it tests all
     possible user-requested parameters. Rather than the present minimal
     subset"

* tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (370 commits)
  MAINTAINERS: add missing headers to mempory policy & migration section
  MAINTAINERS: add missing file to cgroup section
  MAINTAINERS: add MM MISC section, add missing files to MISC and CORE
  MAINTAINERS: add missing zsmalloc file
  MAINTAINERS: add missing files to page alloc section
  MAINTAINERS: add missing shrinker files
  MAINTAINERS: move memremap.[ch] to hotplug section
  MAINTAINERS: add missing mm_slot.h file THP section
  MAINTAINERS: add missing interval_tree.c to memory mapping section
  MAINTAINERS: add missing percpu-internal.h file to per-cpu section
  mm/page_alloc: remove trace_mm_alloc_contig_migrate_range_info()
  selftests/damon: introduce _common.sh to host shared function
  selftests/damon/sysfs.py: test runtime reduction of DAMON parameters
  selftests/damon/sysfs.py: test non-default parameters runtime commit
  selftests/damon/sysfs.py: generalize DAMON context commit assertion
  selftests/damon/sysfs.py: generalize monitoring attributes commit assertion
  selftests/damon/sysfs.py: generalize DAMOS schemes commit assertion
  selftests/damon/sysfs.py: test DAMOS filters commitment
  selftests/damon/sysfs.py: generalize DAMOS scheme commit assertion
  selftests/damon/sysfs.py: test DAMOS destinations commitment
  ...
2025-07-31 14:57:54 -07:00
Lorenzo Stoakes
cfea89210a mm/vma: refactor vma_modify_flags_name() to vma_modify_name()
The single instance in which we use this function doesn't actually need to
change VMA flags, so remove this parameter and update the caller
accordingly.

[lorenzo.stoakes@oracle.com: correct comment]
  Link: https://lkml.kernel.org/r/77f45b2e-a748-4635-9381-a5051091087f@lucifer.local
Link: https://lkml.kernel.org/r/20250714135839.178032-1-lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-24 19:12:28 -07:00
Lorenzo Stoakes
bfbe71109f mm: update core kernel code to use vm_flags_t consistently
The core kernel code is currently very inconsistent in its use of
vm_flags_t vs.  unsigned long.  This prevents us from changing the type of
vm_flags_t in the future and is simply not correct, so correct this.

While this results in rather a lot of churn, it is a critical
pre-requisite for a future planned change to VMA flag type.

Additionally, update VMA userland tests to account for the changes.

To make review easier and to break things into smaller parts, driver and
architecture-specific changes is left for a subsequent commit.

The code has been adjusted to cascade the changes across all calling code
as far as is needed.

We will adjust architecture-specific and driver code in a subsequent patch.

Overall, this patch does not introduce any functional change.

Link: https://lkml.kernel.org/r/d1588e7bb96d1ea3fe7b9df2c699d5b4592d901d.1750274467.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Kees Cook <kees@kernel.org>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Jann Horn <jannh@google.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09 22:42:13 -07:00
Lorenzo Stoakes
4535cb331c mm/vma: use vmg->target to specify target VMA for new VMA merge
In commit 3a75ccba04 ("mm: simplify vma merge structure and expand
comments") we introduced the vmg->target field to make the merging of
existing VMAs simpler - clarifying precisely which VMA would eventually
become the merged VMA once the merge operation was complete.

New VMA merging did not get quite the same treatment, retaining the rather
confusing convention of storing the target VMA in vmg->middle.

This patch corrects this state of affairs, utilising vmg->target for this
purpose for both vma_merge_new_range() and also for vma_expand().

We retain the WARN_ON for vmg->middle being specified in
vma_merge_new_range() as doing so would make no sense, but add an
additional debug assert for setting vmg->target.

This patch additionally updates VMA userland testing to account for this
change.

[lorenzo.stoakes@oracle.com: make comment consistent in vma_expand()]
  Link: https://lkml.kernel.org/r/c54f45e3-a6ac-4749-93c0-cc9e3080ee37@lucifer.local
Link: https://lkml.kernel.org/r/20250613184807.108089-1-lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09 22:42:11 -07:00
Lorenzo Stoakes
cf7e7a3503 mm: prevent KSM from breaking VMA merging for new VMAs
If a user wishes to enable KSM mergeability for an entire process and all
fork/exec'd processes that come after it, they use the prctl()
PR_SET_MEMORY_MERGE operation.

This defaults all newly mapped VMAs to have the VM_MERGEABLE VMA flag set
(in order to indicate they are KSM mergeable), as well as setting this
flag for all existing VMAs and propagating this across fork/exec.

However it also breaks VMA merging for new VMAs, both in the process and
all forked (and fork/exec'd) child processes.

This is because when a new mapping is proposed, the flags specified will
never have VM_MERGEABLE set.  However all adjacent VMAs will already have
VM_MERGEABLE set, rendering VMAs unmergeable by default.

To work around this, we try to set the VM_MERGEABLE flag prior to
attempting a merge.  In the case of brk() this can always be done.

However on mmap() things are more complicated - while KSM is not supported
for MAP_SHARED file-backed mappings, it is supported for MAP_PRIVATE
file-backed mappings.

These mappings may have deprecated .mmap() callbacks specified which
could, in theory, adjust flags and thus KSM eligibility.

So we check to determine whether this is possible.  If not, we set
VM_MERGEABLE prior to the merge attempt on mmap(), otherwise we retain the
previous behaviour.

This fixes VMA merging for all new anonymous mappings, which covers the
majority of real-world cases, so we should see a significant improvement
in VMA mergeability.

For MAP_PRIVATE file-backed mappings, those which implement the
.mmap_prepare() hook and shmem are both known to be safe, so we allow
these, disallowing all other cases.

Also add stubs for newly introduced function invocations to VMA userland
testing.

[lorenzo.stoakes@oracle.com: correctly invoke late KSM check after mmap hook]
  Link: https://lkml.kernel.org/r/5861f8f6-cf5a-4d82-a062-139fb3f9cddb@lucifer.local
Link: https://lkml.kernel.org/r/3ba660af716d87a18ca5b4e635f2101edeb56340.1748537921.git.lorenzo.stoakes@oracle.com
Fixes: d7597f59d1 ("mm: add new api to enable ksm per process") # please no backport!
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Xu Xin <xu.xin16@zte.com.cn>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Stefan Roesch <shr@devkernel.io>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09 22:41:54 -07:00
Lorenzo Stoakes
20ca475d98
mm: rename call_mmap/mmap_prepare to vfs_mmap/mmap_prepare
The call_mmap() function violates the existing convention in
include/linux/fs.h whereby invocations of virtual file system hooks is
performed by functions prefixed with vfs_xxx().

Correct this by renaming call_mmap() to vfs_mmap(). This also avoids
confusion as to the fact that f_op->mmap_prepare may be invoked here.

Also rename __call_mmap_prepare() function to vfs_mmap_prepare() and adjust
to accept a file parameter, this is useful later for nested file systems.

Finally, fix up the VMA userland tests and ensure the mmap_prepare -> mmap
shim is implemented there.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/8d389f4994fa736aa8f9172bef8533c10a9e9011.1750099179.git.lorenzo.stoakes@oracle.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-17 13:35:23 +02:00
Lorenzo Stoakes
bb666b7c27 mm: add mmap_prepare() compatibility layer for nested file systems
Nested file systems, that is those which invoke call_mmap() within their
own f_op->mmap() handlers, may encounter underlying file systems which
provide the f_op->mmap_prepare() hook introduced by commit c84bf6dd2b
("mm: introduce new .mmap_prepare() file callback").

We have a chicken-and-egg scenario here - until all file systems are
converted to using .mmap_prepare(), we cannot convert these nested
handlers, as we can't call f_op->mmap from an .mmap_prepare() hook.

So we have to do it the other way round - invoke the .mmap_prepare() hook
from an .mmap() one.

in order to do so, we need to convert VMA state into a struct vm_area_desc
descriptor, invoking the underlying file system's f_op->mmap_prepare()
callback passing a pointer to this, and then setting VMA state accordingly
and safely.

This patch achieves this via the compat_vma_mmap_prepare() function, which
we invoke from call_mmap() if f_op->mmap_prepare() is specified in the
passed in file pointer.

We place the fundamental logic into mm/vma.h where VMA manipulation
belongs.  We also update the VMA userland tests to accommodate the
changes.

The compat_vma_mmap_prepare() function and its associated machinery is
temporary, and will be removed once the conversion of file systems is
complete.

We carefully place this code so it can be used with CONFIG_MMU and also
with cutting edge nommu silicon.

[akpm@linux-foundation.org: export compat_vma_mmap_prepare tp fix build]
[lorenzo.stoakes@oracle.com: remove unused declarations]
  Link: https://lkml.kernel.org/r/ac3ae324-4c65-432a-8c6d-2af988b18ac8@lucifer.local
Link: https://lkml.kernel.org/r/20250609165749.344976-1-lorenzo.stoakes@oracle.com
Fixes: c84bf6dd2b ("mm: introduce new .mmap_prepare() file callback").
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reported-by: Jann Horn <jannh@google.com>
Closes: https://lore.kernel.org/linux-mm/CAG48ez04yOEVx1ekzOChARDDBZzAKwet8PEoPM4Ln3_rk91AzQ@mail.gmail.com/
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-12 21:39:02 -07:00