mirror of
https://github.com/torvalds/linux.git
synced 2026-05-12 16:18:45 +02:00
Patch series "mm: remove is_swap_[pte, pmd]() + non-swap entries,
introduce leaf entries", v3.
There's an established convention in the kernel that we treat leaf page
tables (so far at the PTE, PMD level) as containing 'swap entries' should
they be neither empty (i.e. p**_none() evaluating true) nor present (i.e.
p**_present() evaluating true).
However, at the same time we also have helper predicates - is_swap_pte(),
is_swap_pmd() - which are inconsistently used.
This is problematic, as it is logical to assume that should somebody wish
to operate upon a page table swap entry they should first check to see if
it is in fact one.
It also implies that perhaps, in future, we might introduce a non-present,
none page table entry that is not a swap entry.
This series resolves this issue by systematically eliminating all use of
the is_swap_pte() and is swap_pmd() predicates so we retain only the
convention that should a leaf page table entry be neither none nor present
it is a swap entry.
We also have the further issue that 'swap entry' is unfortunately a really
rather overloaded term and in fact refers to both entries for swap and for
other information such as migration entries, page table markers, and
device private entries.
We therefore have the rather 'unique' concept of a 'non-swap' swap entry.
This series therefore introduces the concept of 'software leaf entries',
of type softleaf_t, to eliminate this confusion.
A software leaf entry in this sense is any page table entry which is
non-present, and represented by the softleaf_t type. That is - page table
leaf entries which are software-controlled by the kernel.
This includes 'none' or empty entries, which are simply represented by an
zero leaf entry value.
In order to maintain compatibility as we transition the kernel to this new
type, we simply typedef swp_entry_t to softleaf_t.
We introduce a number of predicates and helpers to interact with software
leaf entries in include/linux/leafops.h which, as it imports swapops.h,
can be treated as a drop-in replacement for swapops.h wherever leaf entry
helpers are used.
Since softleaf_from_[pte, pmd]() treats present entries as they were
empty/none leaf entries, this allows for a great deal of simplification of
code throughout the code base, which this series utilises a great deal.
We additionally change from swap entry to software leaf entry handling
where it makes sense to and eliminate functions from swapops.h where
software leaf entries obviate the need for the functions.
This patch (of 16):
PTE markers were previously only concerned with UFFD-specific logic - that
is, PTE entries with the UFFD WP marker set or those marked via
UFFDIO_POISON.
However since the introduction of guard markers in commit 7c53dfbdb0
("mm: add PTE_MARKER_GUARD PTE marker"), this has no longer been the case.
Issues have been avoided as guard regions are not permitted in conjunction
with UFFD, but it still leaves very confusing logic in place, most notably
the misleading and poorly named pte_none_mostly() and
huge_pte_none_mostly().
This predicate returns true for PTE entries that ought to be treated as
none, but only in certain circumstances, and on the assumption we are
dealing with H/W poison markers or UFFD WP markers.
This patch removes these functions and makes each invocation of these
functions instead explicitly check what it needs to check.
As part of this effort it introduces is_uffd_pte_marker() to explicitly
determine if a marker in fact is used as part of UFFD or not.
In the HMM logic we note that the only time we would need to check for a
fault is in the case of a UFFD WP marker, otherwise we simply encounter a
fault error (VM_FAULT_HWPOISON for H/W poisoned marker, VM_FAULT_SIGSEGV
for a guard marker), so only check for the UFFD WP case.
While we're here we also refactor code to make it easier to understand.
[akpm@linux-foundation.org: fix comment typo, per Mike]
Link: https://lkml.kernel.org/r/cover.1762812360.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/c38625fd9a1c1f1cf64ae8a248858e45b3dcdf11.1762812360.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Chris Li <chrisl@kernel.org>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mathew Brost <matthew.brost@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Wei Xu <weixugc@google.com>
Cc: xu xin <xu.xin16@zte.com.cn>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
342 lines
8.4 KiB
C
342 lines
8.4 KiB
C
// SPDX-License-Identifier: GPL-2.0
|
|
/*
|
|
* linux/mm/mincore.c
|
|
*
|
|
* Copyright (C) 1994-2006 Linus Torvalds
|
|
*/
|
|
|
|
/*
|
|
* The mincore() system call.
|
|
*/
|
|
#include <linux/pagemap.h>
|
|
#include <linux/gfp.h>
|
|
#include <linux/pagewalk.h>
|
|
#include <linux/mman.h>
|
|
#include <linux/syscalls.h>
|
|
#include <linux/swap.h>
|
|
#include <linux/swapops.h>
|
|
#include <linux/shmem_fs.h>
|
|
#include <linux/hugetlb.h>
|
|
#include <linux/pgtable.h>
|
|
|
|
#include <linux/uaccess.h>
|
|
#include "swap.h"
|
|
#include "internal.h"
|
|
|
|
static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
|
|
unsigned long end, struct mm_walk *walk)
|
|
{
|
|
#ifdef CONFIG_HUGETLB_PAGE
|
|
unsigned char present;
|
|
unsigned char *vec = walk->private;
|
|
spinlock_t *ptl;
|
|
|
|
ptl = huge_pte_lock(hstate_vma(walk->vma), walk->mm, pte);
|
|
|
|
/*
|
|
* Hugepages under user process are always in RAM and never
|
|
* swapped out, but theoretically it needs to be checked.
|
|
*/
|
|
if (!pte) {
|
|
present = 0;
|
|
} else {
|
|
const pte_t ptep = huge_ptep_get(walk->mm, addr, pte);
|
|
|
|
if (huge_pte_none(ptep) || is_pte_marker(ptep))
|
|
present = 0;
|
|
else
|
|
present = 1;
|
|
}
|
|
|
|
for (; addr != end; vec++, addr += PAGE_SIZE)
|
|
*vec = present;
|
|
walk->private = vec;
|
|
spin_unlock(ptl);
|
|
#else
|
|
BUG();
|
|
#endif
|
|
return 0;
|
|
}
|
|
|
|
static unsigned char mincore_swap(swp_entry_t entry, bool shmem)
|
|
{
|
|
struct swap_info_struct *si;
|
|
struct folio *folio = NULL;
|
|
unsigned char present = 0;
|
|
|
|
if (!IS_ENABLED(CONFIG_SWAP)) {
|
|
WARN_ON(1);
|
|
return 0;
|
|
}
|
|
|
|
/*
|
|
* Shmem mapping may contain swapin error entries, which are
|
|
* absent. Page table may contain migration or hwpoison
|
|
* entries which are always uptodate.
|
|
*/
|
|
if (non_swap_entry(entry))
|
|
return !shmem;
|
|
|
|
/*
|
|
* Shmem mapping lookup is lockless, so we need to grab the swap
|
|
* device. mincore page table walk locks the PTL, and the swap
|
|
* device is stable, avoid touching the si for better performance.
|
|
*/
|
|
if (shmem) {
|
|
si = get_swap_device(entry);
|
|
if (!si)
|
|
return 0;
|
|
}
|
|
folio = swap_cache_get_folio(entry);
|
|
if (shmem)
|
|
put_swap_device(si);
|
|
/* The swap cache space contains either folio, shadow or NULL */
|
|
if (folio && !xa_is_value(folio)) {
|
|
present = folio_test_uptodate(folio);
|
|
folio_put(folio);
|
|
}
|
|
|
|
return present;
|
|
}
|
|
|
|
/*
|
|
* Later we can get more picky about what "in core" means precisely.
|
|
* For now, simply check to see if the page is in the page cache,
|
|
* and is up to date; i.e. that no page-in operation would be required
|
|
* at this time if an application were to map and access this page.
|
|
*/
|
|
static unsigned char mincore_page(struct address_space *mapping, pgoff_t index)
|
|
{
|
|
unsigned char present = 0;
|
|
struct folio *folio;
|
|
|
|
/*
|
|
* When tmpfs swaps out a page from a file, any process mapping that
|
|
* file will not get a swp_entry_t in its pte, but rather it is like
|
|
* any other file mapping (ie. marked !present and faulted in with
|
|
* tmpfs's .fault). So swapped out tmpfs mappings are tested here.
|
|
*/
|
|
folio = filemap_get_entry(mapping, index);
|
|
if (folio) {
|
|
if (xa_is_value(folio)) {
|
|
if (shmem_mapping(mapping))
|
|
return mincore_swap(radix_to_swp_entry(folio),
|
|
true);
|
|
else
|
|
return 0;
|
|
}
|
|
present = folio_test_uptodate(folio);
|
|
folio_put(folio);
|
|
}
|
|
|
|
return present;
|
|
}
|
|
|
|
static int __mincore_unmapped_range(unsigned long addr, unsigned long end,
|
|
struct vm_area_struct *vma, unsigned char *vec)
|
|
{
|
|
unsigned long nr = (end - addr) >> PAGE_SHIFT;
|
|
int i;
|
|
|
|
if (vma->vm_file) {
|
|
pgoff_t pgoff;
|
|
|
|
pgoff = linear_page_index(vma, addr);
|
|
for (i = 0; i < nr; i++, pgoff++)
|
|
vec[i] = mincore_page(vma->vm_file->f_mapping, pgoff);
|
|
} else {
|
|
for (i = 0; i < nr; i++)
|
|
vec[i] = 0;
|
|
}
|
|
return nr;
|
|
}
|
|
|
|
static int mincore_unmapped_range(unsigned long addr, unsigned long end,
|
|
__always_unused int depth,
|
|
struct mm_walk *walk)
|
|
{
|
|
walk->private += __mincore_unmapped_range(addr, end,
|
|
walk->vma, walk->private);
|
|
return 0;
|
|
}
|
|
|
|
static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
|
|
struct mm_walk *walk)
|
|
{
|
|
spinlock_t *ptl;
|
|
struct vm_area_struct *vma = walk->vma;
|
|
pte_t *ptep;
|
|
unsigned char *vec = walk->private;
|
|
int nr = (end - addr) >> PAGE_SHIFT;
|
|
int step, i;
|
|
|
|
ptl = pmd_trans_huge_lock(pmd, vma);
|
|
if (ptl) {
|
|
memset(vec, 1, nr);
|
|
spin_unlock(ptl);
|
|
goto out;
|
|
}
|
|
|
|
ptep = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
|
|
if (!ptep) {
|
|
walk->action = ACTION_AGAIN;
|
|
return 0;
|
|
}
|
|
for (; addr != end; ptep += step, addr += step * PAGE_SIZE) {
|
|
pte_t pte = ptep_get(ptep);
|
|
|
|
step = 1;
|
|
/* We need to do cache lookup too for markers */
|
|
if (pte_none(pte) || is_pte_marker(pte))
|
|
__mincore_unmapped_range(addr, addr + PAGE_SIZE,
|
|
vma, vec);
|
|
else if (pte_present(pte)) {
|
|
unsigned int batch = pte_batch_hint(ptep, pte);
|
|
|
|
if (batch > 1) {
|
|
unsigned int max_nr = (end - addr) >> PAGE_SHIFT;
|
|
|
|
step = min_t(unsigned int, batch, max_nr);
|
|
}
|
|
|
|
for (i = 0; i < step; i++)
|
|
vec[i] = 1;
|
|
} else { /* pte is a swap entry */
|
|
*vec = mincore_swap(pte_to_swp_entry(pte), false);
|
|
}
|
|
vec += step;
|
|
}
|
|
pte_unmap_unlock(ptep - 1, ptl);
|
|
out:
|
|
walk->private += nr;
|
|
cond_resched();
|
|
return 0;
|
|
}
|
|
|
|
static inline bool can_do_mincore(struct vm_area_struct *vma)
|
|
{
|
|
if (vma_is_anonymous(vma))
|
|
return true;
|
|
if (!vma->vm_file)
|
|
return false;
|
|
/*
|
|
* Reveal pagecache information only for non-anonymous mappings that
|
|
* correspond to the files the calling process could (if tried) open
|
|
* for writing; otherwise we'd be including shared non-exclusive
|
|
* mappings, which opens a side channel.
|
|
*/
|
|
return inode_owner_or_capable(&nop_mnt_idmap,
|
|
file_inode(vma->vm_file)) ||
|
|
file_permission(vma->vm_file, MAY_WRITE) == 0;
|
|
}
|
|
|
|
static const struct mm_walk_ops mincore_walk_ops = {
|
|
.pmd_entry = mincore_pte_range,
|
|
.pte_hole = mincore_unmapped_range,
|
|
.hugetlb_entry = mincore_hugetlb,
|
|
.walk_lock = PGWALK_RDLOCK,
|
|
};
|
|
|
|
/*
|
|
* Do a chunk of "sys_mincore()". We've already checked
|
|
* all the arguments, we hold the mmap semaphore: we should
|
|
* just return the amount of info we're asked for.
|
|
*/
|
|
static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *vec)
|
|
{
|
|
struct vm_area_struct *vma;
|
|
unsigned long end;
|
|
int err;
|
|
|
|
vma = vma_lookup(current->mm, addr);
|
|
if (!vma)
|
|
return -ENOMEM;
|
|
end = min(vma->vm_end, addr + (pages << PAGE_SHIFT));
|
|
if (!can_do_mincore(vma)) {
|
|
unsigned long pages = DIV_ROUND_UP(end - addr, PAGE_SIZE);
|
|
memset(vec, 1, pages);
|
|
return pages;
|
|
}
|
|
err = walk_page_range(vma->vm_mm, addr, end, &mincore_walk_ops, vec);
|
|
if (err < 0)
|
|
return err;
|
|
return (end - addr) >> PAGE_SHIFT;
|
|
}
|
|
|
|
/*
|
|
* The mincore(2) system call.
|
|
*
|
|
* mincore() returns the memory residency status of the pages in the
|
|
* current process's address space specified by [addr, addr + len).
|
|
* The status is returned in a vector of bytes. The least significant
|
|
* bit of each byte is 1 if the referenced page is in memory, otherwise
|
|
* it is zero.
|
|
*
|
|
* Because the status of a page can change after mincore() checks it
|
|
* but before it returns to the application, the returned vector may
|
|
* contain stale information. Only locked pages are guaranteed to
|
|
* remain in memory.
|
|
*
|
|
* return values:
|
|
* zero - success
|
|
* -EFAULT - vec points to an illegal address
|
|
* -EINVAL - addr is not a multiple of PAGE_SIZE
|
|
* -ENOMEM - Addresses in the range [addr, addr + len] are
|
|
* invalid for the address space of this process, or
|
|
* specify one or more pages which are not currently
|
|
* mapped
|
|
* -EAGAIN - A kernel resource was temporarily unavailable.
|
|
*/
|
|
SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len,
|
|
unsigned char __user *, vec)
|
|
{
|
|
long retval;
|
|
unsigned long pages;
|
|
unsigned char *tmp;
|
|
|
|
start = untagged_addr(start);
|
|
|
|
/* Check the start address: needs to be page-aligned.. */
|
|
if (unlikely(start & ~PAGE_MASK))
|
|
return -EINVAL;
|
|
|
|
/* ..and we need to be passed a valid user-space range */
|
|
if (!access_ok((void __user *) start, len))
|
|
return -ENOMEM;
|
|
|
|
/* This also avoids any overflows on PAGE_ALIGN */
|
|
pages = len >> PAGE_SHIFT;
|
|
pages += (offset_in_page(len)) != 0;
|
|
|
|
if (!access_ok(vec, pages))
|
|
return -EFAULT;
|
|
|
|
tmp = (void *) __get_free_page(GFP_USER);
|
|
if (!tmp)
|
|
return -EAGAIN;
|
|
|
|
retval = 0;
|
|
while (pages) {
|
|
/*
|
|
* Do at most PAGE_SIZE entries per iteration, due to
|
|
* the temporary buffer size.
|
|
*/
|
|
mmap_read_lock(current->mm);
|
|
retval = do_mincore(start, min(pages, PAGE_SIZE), tmp);
|
|
mmap_read_unlock(current->mm);
|
|
|
|
if (retval <= 0)
|
|
break;
|
|
if (copy_to_user(vec, tmp, retval)) {
|
|
retval = -EFAULT;
|
|
break;
|
|
}
|
|
pages -= retval;
|
|
vec += retval;
|
|
start += retval << PAGE_SHIFT;
|
|
retval = 0;
|
|
}
|
|
free_page((unsigned long) tmp);
|
|
return retval;
|
|
}
|