Commit Graph

277 Commits

Author SHA1 Message Date
David Hildenbrand (Arm)
6a288a4ddb mm/page_alloc: fix initialization of tags of the huge zero folio with init_on_free
__GFP_ZEROTAGS semantics are currently a bit weird, but effectively this
flag is only ever set alongside __GFP_ZERO and __GFP_SKIP_KASAN.

If we run with init_on_free, we will zero out pages during
__free_pages_prepare(), to skip zeroing on the allocation path.

However, when allocating with __GFP_ZEROTAG set, post_alloc_hook() will
consequently not only skip clearing page content, but also skip clearing
tag memory.

Not clearing tags through __GFP_ZEROTAGS is irrelevant for most pages that
will get mapped to user space through set_pte_at() later: set_pte_at() and
friends will detect that the tags have not been initialized yet
(PG_mte_tagged not set), and initialize them.

However, for the huge zero folio, which will be mapped through a PMD
marked as special, this initialization will not be performed, ending up
exposing whatever tags were still set for the pages.

The docs (Documentation/arch/arm64/memory-tagging-extension.rst) state
that allocation tags are set to 0 when a page is first mapped to user
space.  That no longer holds with the huge zero folio when init_on_free is
enabled.

Fix it by decoupling __GFP_ZEROTAGS from __GFP_ZERO, passing to
tag_clear_highpages() whether we want to also clear page content.

Invert the meaning of the tag_clear_highpages() return value to have
clearer semantics.

Reproduced with the huge zero folio by modifying the check_buffer_fill
arm64/mte selftest to use a 2 MiB area, after making sure that pages have
a non-0 tag set when freeing (note that, during boot, we will not actually
initialize tags, but only set KASAN_TAG_KERNEL in the page flags).

	$ ./check_buffer_fill
	1..20
	...
	not ok 17 Check initial tags with private mapping, sync error mode and mmap memory
	not ok 18 Check initial tags with private mapping, sync error mode and mmap/mprotect memory
	...

This code needs more cleanups; we'll tackle that next, like
decoupling __GFP_ZEROTAGS from __GFP_SKIP_KASAN.

[akpm@linux-foundation.org: s/__GPF_ZERO/__GFP_ZERO/, per David]
Link: https://lore.kernel.org/20260421-zerotags-v2-1-05cb1035482e@kernel.org
Fixes: adfb6609c6 ("mm/huge_memory: initialise the tags of the huge zero folio")
Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Tested-by: Lance Yang <lance.yang@linux.dev>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Liam Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-13 17:40:02 -07:00
Linus Torvalds
01f492e181 Arm:
- Add support for tracing in the standalone EL2 hypervisor code, which
   should help both debugging and performance analysis.  This uses the
   new infrastructure for 'remote' trace buffers that can be exposed
   by non-kernel entities such as firmware, and which came through the
   tracing tree.
 
 - Add support for GICv5 Per Processor Interrupts (PPIs), as the starting
   point for supporting the new GIC architecture in KVM.
 
 - Finally add support for pKVM protected guests, where pages are unmapped
   from the host as they are faulted into the guest and can be shared back
   from the guest using pKVM hypercalls.  Protected guests are created
   using a new machine type identifier.  As the elusive guestmem has not
   yet delivered on its promises, anonymous memory is also supported.
 
   This is only a first step towards full isolation from the host; for
   example, the CPU register state and DMA accesses are not yet isolated.
   Because this does not really yet bring fully what it promises, it is
   hidden behind CONFIG_ARM_PKVM_GUEST + 'kvm-arm.mode=protected', and
   also triggers TAINT_USER when a VM is created.  Caveat emptor.
 
 - Rework the dreaded user_mem_abort() function to make it more
   maintainable, reducing the amount of state being exposed to the
   various helpers and rendering a substantial amount of state immutable.
 
 - Expand the Stage-2 page table dumper to support NV shadow page tables
   on a per-VM basis.
 
 - Tidy up the pKVM PSCI proxy code to be slightly less hard to follow.
 
 - Fix both SPE and TRBE in non-VHE configurations so that they do not
   generate spurious, out of context table walks that ultimately lead
   to very bad HW lockups.
 
 - A small set of patches fixing the Stage-2 MMU freeing in error cases.
 
 - Tighten-up accepted SMC immediate value to be only #0 for host
   SMCCC calls.
 
 - The usual cleanups and other selftest churn.
 
 LoongArch:
 
 - Use CSR_CRMD_PLV for kvm_arch_vcpu_in_kernel().
 
 - Add DMSINTC irqchip in kernel support.
 
 RISC-V:
 
 - Fix steal time shared memory alignment checks
 
 - Fix vector context allocation leak
 
 - Fix array out-of-bounds in pmu_ctr_read() and pmu_fw_ctr_read_hi()
 
 - Fix double-free of sdata in kvm_pmu_clear_snapshot_area()
 
 - Fix integer overflow in kvm_pmu_validate_counter_mask()
 
 - Fix shift-out-of-bounds in make_xfence_request()
 
 - Fix lost write protection on huge pages during dirty logging
 
 - Split huge pages during fault handling for dirty logging
 
 - Skip CSR restore if VCPU is reloaded on the same core
 
 - Implement kvm_arch_has_default_irqchip() for KVM selftests
 
 - Factored-out ISA checks into separate sources
 
 - Added hideleg to struct kvm_vcpu_config
 
 - Factored-out VCPU config into separate sources
 
 - Support configuration of per-VM HGATP mode from KVM user space
 
 s390:
 
 - Support for ESA (31-bit) guests inside nested hypervisors.
 
 - Remove restriction on memslot alignment, which is not needed anymore with
   the new gmap code.
 
 - Fix LPSW/E to update the bear (which of course is the breaking event
   address register).
 
 x86:
 
 - Shut up various UBSAN warnings on reading module parameter before they
   were initialized.
 
 - Don't zero-allocate page tables that are used for splitting hugepages in
   the TDP MMU, as KVM is guaranteed to set all SPTEs in the page table and
   thus write all bytes.
 
 - As an optimization, bail early when trying to unsync 4KiB mappings if the
   target gfn can just be mapped with a 2MiB hugepage.
 
 x86 generic:
 
 - Copy single-chunk MMIO write values into struct kvm_vcpu (more precisely
   struct kvm_mmio_fragment) to fix use-after-free stack bugs where KVM
   would dereference stack pointer after an exit to userspace.
 
 - Clean up and comment the emulated MMIO code to try to make it easier to
   maintain (not necessarily "easy", but "easier").
 
 - Move VMXON+VMXOFF and EFER.SVME toggling out of KVM (not *all* of VMX
   and SVM enabling) as it is needed for trusted I/O.
 
 - Advertise support for AVX512 Bit Matrix Multiply (BMM) instructions
 
 - Immediately fail the build if a required #define is missing in one of
   KVM's headers that is included multiple times.
 
 - Reject SET_GUEST_DEBUG with -EBUSY if there's an already injected
   exception, mostly to prevent syzkaller from abusing the uAPI to
   trigger WARNs, but also because it can help prevent userspace from
   unintentionally crashing the VM.
 
 - Exempt SMM from CPUID faulting on Intel, as per the spec.
 
 - Misc hardening and cleanup changes.
 
 x86 (AMD):
 
 - Fix and optimize IRQ window inhibit handling for AVIC; make it per-vCPU
   so that KVM doesn't prematurely re-enable AVIC if multiple
   vCPUs have to-be-injected IRQs.
 
 - Clean up and optimize the OSVW handling, avoiding a bug in which KVM would
   overwrite state when enabling virtualization on multiple CPUs in parallel.
   This should not be a problem because OSVW should usually be the same for
   all CPUs.
 
 - Drop a WARN in KVM_MEMORY_ENCRYPT_REG_REGION where KVM complains about a
   "too large" size based purely on user input.
 
 - Clean up and harden the pinning code for KVM_MEMORY_ENCRYPT_REG_REGION.
 
 - Disallow synchronizing a VMSA of an already-launched/encrypted vCPU, as
   doing so for an SNP guest will crash the host due to an RMP violation
   page fault.
 
 - Overhaul KVM's APIs for detecting SEV+ guests so that VM-scoped queries
   are required to hold kvm->lock, and enforce it by lockdep.  Fix various
   bugs where sev_guest() was not ensured to be stable for the whole
   duration of a function or ioctl.
 
 - Convert a pile of kvm->lock SEV code to guard().
 
 - Play nicer with userspace that does not enable KVM_CAP_EXCEPTION_PAYLOAD,
   for which KVM needs to set CR2 and DR6 as a response to ioctls such as
   KVM_GET_VCPU_EVENTS (even if the payload would end up in EXITINFO2
   rather than CR2, for example).  Only set CR2 and DR6 when consumption of
   the payload is imminent, but on the other hand force delivery of the
   payload in all paths where userspace retrieves CR2 or DR6.
 
 - Use vcpu->arch.cr2 when updating vmcb12's CR2 on nested #VMEXIT instead
   of vmcb02->save.cr2.  The value is out of sync after a save/restore
   or after a #PF is injected into L2.
 
 - Fix a class of nSVM bugs where some fields written by the CPU are not
   synchronized from vmcb02 to cached vmcb12 after VMRUN, and so are not
   up-to-date when saved by KVM_GET_NESTED_STATE.
 
 - Fix a class of bugs where the ordering between KVM_SET_NESTED_STATE and
   KVM_SET_{S}REGS could cause vmcb02 to be incorrectly initialized after
   save+restore.
 
 - Add a variety of missing nSVM consistency checks.
 
 - Fix several bugs where KVM failed to correctly update VMCB fields on
   nested #VMEXIT.
 
 - Fix several bugs where KVM failed to correctly synthesize #UD or #GP for
   SVM-related instructions.
 
 - Add support for save+restore of virtualized LBRs (on SVM).
 
 - Refactor various helpers and macros to improve clarity and (hopefully)
   make the code easier to maintain.
 
 - Aggressively sanitize fields when copying from vmcb12, to guard against
   unintentionally allowing L1 to utilize yet-to-be-defined features.
 
 - Fix several bugs where KVM botched rAX legality checks when emulating SVM
   instructions.  There are remaining issues in that KVM doesn't handle size
   prefix overrides for 64-bit guests.
 
 - Fail emulation of VMRUN/VMLOAD/VMSAVE if mapping vmcb12 fails instead of
   somewhat arbitrarily synthesizing #GP (i.e. don't double down on AMD's
   architectural but sketchy behavior of generating #GP for "unsupported"
   addresses).
 
 - Cache all used vmcb12 fields to further harden against TOCTOU bugs.
 
 x86 (Intel):
 
 - Drop obsolete branch hint prefixes from the VMX instruction macros.
 
 - Use ASM_INPUT_RM() in __vmcs_writel() to coerce clang into using a
   register input when appropriate.
 
 - Code cleanups.
 
 guest_memfd:
 
 - Don't mark guest_memfd folios as accessed, as guest_memfd doesn't support
   reclaim, the memory is unevictable, and there is no storage to write
   back to.
 
 LoongArch selftests:
 
 - Add KVM PMU test cases
 
 s390 selftests:
 
 - Enable more memory selftests.
 
 x86 selftests:
 
 - Add support for Hygon CPUs in KVM selftests.
 
 - Fix a bug in the MSR test where it would get false failures on AMD/Hygon
   CPUs with exactly one of RDPID or RDTSCP.
 
 - Add an MADV_COLLAPSE testcase for guest_memfd as a regression test for a
   bug where the kernel would attempt to collapse guest_memfd folios against
   KVM's will.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmnftRQUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroPAzwf+NKO4Ktv+7A22ImN0SBl0nlUuulsz
 vTcw3+hxdRoIw83GdNS+hG5js0wrpMDnbv3t4+VliDNBSSxrBzcSWX2wpilW0Xtw
 qGo1MWhs2lKPy1NlaRVOwPS6j7uF3AR0TQ1iQLGMedQuCU9WpiKJxyhNXJdbLrt3
 8EgFzsvtEsv+jKNRUNDf9+d0j4gZsFyIe+Brhianbw+u3/UCiUClLCdsKPc4+5ZX
 08otYXytacGNIf/5Ev1vT4pHkHL0yqKXAtX7LEtaS3+0KrPuLjV4slemivzE9vf5
 Evafm5AhA4wpaNMb1ZerhY3T94lsMaJpWxotjR//0Q7C9B59pCQnXCm8mg==
 =CcE0
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "Arm:

   - Add support for tracing in the standalone EL2 hypervisor code,
     which should help both debugging and performance analysis. This
     uses the new infrastructure for 'remote' trace buffers that can be
     exposed by non-kernel entities such as firmware, and which came
     through the tracing tree

   - Add support for GICv5 Per Processor Interrupts (PPIs), as the
     starting point for supporting the new GIC architecture in KVM

   - Finally add support for pKVM protected guests, where pages are
     unmapped from the host as they are faulted into the guest and can
     be shared back from the guest using pKVM hypercalls. Protected
     guests are created using a new machine type identifier. As the
     elusive guestmem has not yet delivered on its promises, anonymous
     memory is also supported

     This is only a first step towards full isolation from the host; for
     example, the CPU register state and DMA accesses are not yet
     isolated. Because this does not really yet bring fully what it
     promises, it is hidden behind CONFIG_ARM_PKVM_GUEST +
     'kvm-arm.mode=protected', and also triggers TAINT_USER when a VM is
     created. Caveat emptor

   - Rework the dreaded user_mem_abort() function to make it more
     maintainable, reducing the amount of state being exposed to the
     various helpers and rendering a substantial amount of state
     immutable

   - Expand the Stage-2 page table dumper to support NV shadow page
     tables on a per-VM basis

   - Tidy up the pKVM PSCI proxy code to be slightly less hard to
     follow

   - Fix both SPE and TRBE in non-VHE configurations so that they do not
     generate spurious, out of context table walks that ultimately lead
     to very bad HW lockups

   - A small set of patches fixing the Stage-2 MMU freeing in error
     cases

   - Tighten-up accepted SMC immediate value to be only #0 for host
     SMCCC calls

   - The usual cleanups and other selftest churn

  LoongArch:

   - Use CSR_CRMD_PLV for kvm_arch_vcpu_in_kernel()

   - Add DMSINTC irqchip in kernel support

  RISC-V:

   - Fix steal time shared memory alignment checks

   - Fix vector context allocation leak

   - Fix array out-of-bounds in pmu_ctr_read() and pmu_fw_ctr_read_hi()

   - Fix double-free of sdata in kvm_pmu_clear_snapshot_area()

   - Fix integer overflow in kvm_pmu_validate_counter_mask()

   - Fix shift-out-of-bounds in make_xfence_request()

   - Fix lost write protection on huge pages during dirty logging

   - Split huge pages during fault handling for dirty logging

   - Skip CSR restore if VCPU is reloaded on the same core

   - Implement kvm_arch_has_default_irqchip() for KVM selftests

   - Factored-out ISA checks into separate sources

   - Added hideleg to struct kvm_vcpu_config

   - Factored-out VCPU config into separate sources

   - Support configuration of per-VM HGATP mode from KVM user space

  s390:

   - Support for ESA (31-bit) guests inside nested hypervisors

   - Remove restriction on memslot alignment, which is not needed
     anymore with the new gmap code

   - Fix LPSW/E to update the bear (which of course is the breaking
     event address register)

  x86:

   - Shut up various UBSAN warnings on reading module parameter before
     they were initialized

   - Don't zero-allocate page tables that are used for splitting
     hugepages in the TDP MMU, as KVM is guaranteed to set all SPTEs in
     the page table and thus write all bytes

   - As an optimization, bail early when trying to unsync 4KiB mappings
     if the target gfn can just be mapped with a 2MiB hugepage

  x86 generic:

   - Copy single-chunk MMIO write values into struct kvm_vcpu (more
     precisely struct kvm_mmio_fragment) to fix use-after-free stack
     bugs where KVM would dereference stack pointer after an exit to
     userspace

   - Clean up and comment the emulated MMIO code to try to make it
     easier to maintain (not necessarily "easy", but "easier")

   - Move VMXON+VMXOFF and EFER.SVME toggling out of KVM (not *all* of
     VMX and SVM enabling) as it is needed for trusted I/O

   - Advertise support for AVX512 Bit Matrix Multiply (BMM) instructions

   - Immediately fail the build if a required #define is missing in one
     of KVM's headers that is included multiple times

   - Reject SET_GUEST_DEBUG with -EBUSY if there's an already injected
     exception, mostly to prevent syzkaller from abusing the uAPI to
     trigger WARNs, but also because it can help prevent userspace from
     unintentionally crashing the VM

   - Exempt SMM from CPUID faulting on Intel, as per the spec

   - Misc hardening and cleanup changes

  x86 (AMD):

   - Fix and optimize IRQ window inhibit handling for AVIC; make it
     per-vCPU so that KVM doesn't prematurely re-enable AVIC if multiple
     vCPUs have to-be-injected IRQs

   - Clean up and optimize the OSVW handling, avoiding a bug in which
     KVM would overwrite state when enabling virtualization on multiple
     CPUs in parallel. This should not be a problem because OSVW should
     usually be the same for all CPUs

   - Drop a WARN in KVM_MEMORY_ENCRYPT_REG_REGION where KVM complains
     about a "too large" size based purely on user input

   - Clean up and harden the pinning code for KVM_MEMORY_ENCRYPT_REG_REGION

   - Disallow synchronizing a VMSA of an already-launched/encrypted
     vCPU, as doing so for an SNP guest will crash the host due to an
     RMP violation page fault

   - Overhaul KVM's APIs for detecting SEV+ guests so that VM-scoped
     queries are required to hold kvm->lock, and enforce it by lockdep.
     Fix various bugs where sev_guest() was not ensured to be stable for
     the whole duration of a function or ioctl

   - Convert a pile of kvm->lock SEV code to guard()

   - Play nicer with userspace that does not enable
     KVM_CAP_EXCEPTION_PAYLOAD, for which KVM needs to set CR2 and DR6
     as a response to ioctls such as KVM_GET_VCPU_EVENTS (even if the
     payload would end up in EXITINFO2 rather than CR2, for example).
     Only set CR2 and DR6 when consumption of the payload is imminent,
     but on the other hand force delivery of the payload in all paths
     where userspace retrieves CR2 or DR6

   - Use vcpu->arch.cr2 when updating vmcb12's CR2 on nested #VMEXIT
     instead of vmcb02->save.cr2. The value is out of sync after a
     save/restore or after a #PF is injected into L2

   - Fix a class of nSVM bugs where some fields written by the CPU are
     not synchronized from vmcb02 to cached vmcb12 after VMRUN, and so
     are not up-to-date when saved by KVM_GET_NESTED_STATE

   - Fix a class of bugs where the ordering between KVM_SET_NESTED_STATE
     and KVM_SET_{S}REGS could cause vmcb02 to be incorrectly
     initialized after save+restore

   - Add a variety of missing nSVM consistency checks

   - Fix several bugs where KVM failed to correctly update VMCB fields
     on nested #VMEXIT

   - Fix several bugs where KVM failed to correctly synthesize #UD or
     #GP for SVM-related instructions

   - Add support for save+restore of virtualized LBRs (on SVM)

   - Refactor various helpers and macros to improve clarity and
     (hopefully) make the code easier to maintain

   - Aggressively sanitize fields when copying from vmcb12, to guard
     against unintentionally allowing L1 to utilize yet-to-be-defined
     features

   - Fix several bugs where KVM botched rAX legality checks when
     emulating SVM instructions. There are remaining issues in that KVM
     doesn't handle size prefix overrides for 64-bit guests

   - Fail emulation of VMRUN/VMLOAD/VMSAVE if mapping vmcb12 fails
     instead of somewhat arbitrarily synthesizing #GP (i.e. don't double
     down on AMD's architectural but sketchy behavior of generating #GP
     for "unsupported" addresses)

   - Cache all used vmcb12 fields to further harden against TOCTOU bugs

  x86 (Intel):

   - Drop obsolete branch hint prefixes from the VMX instruction macros

   - Use ASM_INPUT_RM() in __vmcs_writel() to coerce clang into using a
     register input when appropriate

   - Code cleanups

  guest_memfd:

   - Don't mark guest_memfd folios as accessed, as guest_memfd doesn't
     support reclaim, the memory is unevictable, and there is no storage
     to write back to

  LoongArch selftests:

   - Add KVM PMU test cases

  s390 selftests:

   - Enable more memory selftests

  x86 selftests:

   - Add support for Hygon CPUs in KVM selftests

   - Fix a bug in the MSR test where it would get false failures on
     AMD/Hygon CPUs with exactly one of RDPID or RDTSCP

   - Add an MADV_COLLAPSE testcase for guest_memfd as a regression test
     for a bug where the kernel would attempt to collapse guest_memfd
     folios against KVM's will"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (373 commits)
  KVM: x86: use inlines instead of macros for is_sev_*guest
  x86/virt: Treat SVM as unsupported when running as an SEV+ guest
  KVM: SEV: Goto an existing error label if charging misc_cg for an ASID fails
  KVM: SVM: Move lock-protected allocation of SEV ASID into a separate helper
  KVM: SEV: use mutex guard in snp_handle_guest_req()
  KVM: SEV: use mutex guard in sev_mem_enc_unregister_region()
  KVM: SEV: use mutex guard in sev_mem_enc_ioctl()
  KVM: SEV: use mutex guard in snp_launch_update()
  KVM: SEV: Assert that kvm->lock is held when querying SEV+ support
  KVM: SEV: Document that checking for SEV+ guests when reclaiming memory is "safe"
  KVM: SEV: Hide "struct kvm_sev_info" behind CONFIG_KVM_AMD_SEV=y
  KVM: SEV: WARN on unhandled VM type when initializing VM
  KVM: LoongArch: selftests: Add PMU overflow interrupt test
  KVM: LoongArch: selftests: Add basic PMU event counting test
  KVM: LoongArch: selftests: Add cpucfg read/write helpers
  LoongArch: KVM: Add DMSINTC inject msi to vCPU
  LoongArch: KVM: Add DMSINTC device support
  LoongArch: KVM: Make vcpu_is_preempted() as a macro rather than function
  LoongArch: KVM: Move host CSR_GSTAT save and restore in context switch
  LoongArch: KVM: Move host CSR_EENTRY save and restore in context switch
  ...
2026-04-17 07:18:03 -07:00
Will Deacon
281a38ad29 KVM: arm64: Reclaim faulting page from pKVM in spurious fault handler
Host kernel accesses to pages that are inaccessible at stage-2 result in
the injection of a translation fault, which is fatal unless an exception
table fixup is registered for the faulting PC (e.g. for user access
routines). This is undesirable, since a get_user_pages() call could be
used to obtain a reference to a donated page and then a subsequent
access via a kernel mapping would lead to a panic().

Rework the spurious fault handler so that stage-2 faults injected back
into the host result in the target page being forcefully reclaimed when
no exception table fixup handler is registered.

Tested-by: Fuad Tabba <tabba@google.com>
Tested-by: Mostafa Saleh <smostafa@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://patch.msgid.link/20260330144841.26181-27-will@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-03-30 16:58:09 +01:00
Quentin Perret
9ff714a092 KVM: arm64: Inject SIGSEGV on illegal accesses
The pKVM hypervisor will currently panic if the host tries to access
memory that it doesn't own (e.g. protected guest memory). Sadly, as
guest memory can still be mapped into the VMM's address space, userspace
can trivially crash the kernel/hypervisor by poking into guest memory.

To prevent this, inject the abort back in the host with S1PTW set in the
ESR, hence allowing the host to differentiate this abort from normal
userspace faults and inject a SIGSEGV cleanly.

Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Tested-by: Mostafa Saleh <smostafa@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://patch.msgid.link/20260330144841.26181-20-will@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-03-30 16:58:08 +01:00
Ryan Roberts
b7d9d2e3a8 arm64: mm: __ptep_set_access_flags must hint correct TTL
It has been reported that since commit 752a0d1d48 ("arm64: mm:
Provide level hint for flush_tlb_page()"), the arm64
check_hugetlb_options selftest has been locking up while running "Check
child hugetlb memory with private mapping, sync error mode and mmap
memory".

This is due to hugetlb (and THP) helpers casting their PMD/PUD entries
to PTE and calling __ptep_set_access_flags(), which issues a
__flush_tlb_page(). Now that this is hinted for level 3, in this case,
the TLB entry does not get evicted and we end up in a spurious fault
loop.

Fix this by creating a __ptep_set_access_flags_anysz() function which
takes the pgsize of the entry. It can then add the appropriate hint. The
"_anysz" approach is the established pattern for problems of this class.

Reported-by: Aishwarya TCV <Aishwarya.TCV@arm.com>
Fixes: 752a0d1d48 ("arm64: mm: Provide level hint for flush_tlb_page()")
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2026-03-25 18:08:13 +00:00
Ryan Roberts
15397e3c38 arm64: mm: Wrap flush_tlb_page() around __do_flush_tlb_range()
Flushing a page from the tlb is just a special case of flushing a range.
So let's rework flush_tlb_page() so that it simply wraps
__do_flush_tlb_range(). While at it, let's also update the API to take
the same flags that we use when flushing a range. This allows us to
delete all the ugly "_nosync", "_local" and "_nonotify" variants.

Thanks to constant folding, all of the complex looping and tlbi-by-range
options get eliminated so that the generated code for flush_tlb_page()
looks very similar to the previous version.

Reviewed-by: Linu Cherian <linu.cherian@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2026-03-13 17:26:26 +00:00
Linus Torvalds
44fc84337b arm64 updates for 6.19:
Core features:
 
  - Basic Arm MPAM (Memory system resource Partitioning And Monitoring)
    driver under drivers/resctrl/ which makes use of the fs/rectrl/ API
 
 Perf and PMU:
 
  - Avoid cycle counter on multi-threaded CPUs
 
  - Extend CSPMU device probing and add additional filtering support for
    NVIDIA implementations
 
  - Add support for the PMUs on the NoC S3 interconnect
 
  - Add additional compatible strings for new Cortex and C1 CPUs
 
  - Add support for data source filtering to the SPE driver
 
  - Add support for i.MX8QM and "DB" PMU in the imx PMU driver
 
 Memory managemennt:
 
  - Avoid broadcast TLBI if page reused in write fault
 
  - Elide TLB invalidation if the old PTE was not valid
 
  - Drop redundant cpu_set_*_tcr_t0sz() macros
 
  - Propagate pgtable_alloc() errors outside of __create_pgd_mapping()
 
  - Propagate return value from __change_memory_common()
 
 ACPI and EFI:
 
  - Call EFI runtime services without disabling preemption
 
  - Remove unused ACPI function
 
 Miscellaneous:
 
  - ptrace support to disable streaming on SME-only systems
 
  - Improve sysreg generation to include a 'Prefix' descriptor
 
  - Replace __ASSEMBLY__ with __ASSEMBLER__
 
  - Align register dumps in the kselftest zt-test
 
  - Remove some no longer used macros/functions
 
  - Various spelling corrections
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAmkvMjkACgkQa9axLQDI
 XvGaGg//dtT/ZAqrWa6Yniv1LOlh837C07YdxAYTTuJ+I87DnrxIqjwbW+ye+bF+
 61RTkioeCUm3PH+ncO9gPVNi4ASZ1db3/Rc8Fb6rr1TYOI1sMIeBsbbVdRJgsbX6
 zu9197jOBHscTAeDceB6jZBDyW8iSLINPZ7LN6lGxXsZM/Vn5zfE0heKEEio6Fsx
 +AzO2vos0XcwBR9vFGXtiCDx57T+/cXUtrWfA0Cjz4nvHSgD8+ghS+Jwv+kHMt1L
 zrarqbeQfj+Iixm9PVHiazv+8THo9QdNl1yGLxDmJ4LEVPewjW5jBs8+5e8e3/Gj
 p5JEvmSyWvKTTbFoM5vhxC72A7yuT1QwAk2iCyFIxMbQ25PndHboKVp/569DzOkT
 +6CjI88sVSP6D7bVlN6pFlzc/Fa07YagnDMnMCSfk4LBjUfE3jYb+usaFydyv/rl
 jwZbJrnSF/H+uQlyoJFgOEXSoQdDsll3dv6yEsUCwbd8RqXbAe3svbguOUHSdvIj
 sCViezGZQ7Rkn6D21AfF9j6e7ceaSDaf5DWMxPI3dAxFKG8TJbCBsToR59NnoSj+
 bNEozbZ1mCxmwH8i43wZ6P0RkClvJnoXcvRA+TJj02fSZACO39d3XDNswfXWL41r
 KiWGUJZyn2lPKtiAWVX6pSBtDJ+5rFhuoFgADLX6trkxDe9/EMQ=
 =4Sb6
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Catalin Marinas:
 "These are the arm64 updates for 6.19.

  The biggest part is the Arm MPAM driver under drivers/resctrl/.
  There's a patch touching mm/ to handle spurious faults for huge pmd
  (similar to the pte version). The corresponding arm64 part allows us
  to avoid the TLB maintenance if a (huge) page is reused after a write
  fault. There's EFI refactoring to allow runtime services with
  preemption enabled and the rest is the usual perf/PMU updates and
  several cleanups/typos.

  Summary:

  Core features:

   - Basic Arm MPAM (Memory system resource Partitioning And Monitoring)
     driver under drivers/resctrl/ which makes use of the fs/rectrl/ API

  Perf and PMU:

   - Avoid cycle counter on multi-threaded CPUs

   - Extend CSPMU device probing and add additional filtering support
     for NVIDIA implementations

   - Add support for the PMUs on the NoC S3 interconnect

   - Add additional compatible strings for new Cortex and C1 CPUs

   - Add support for data source filtering to the SPE driver

   - Add support for i.MX8QM and "DB" PMU in the imx PMU driver

  Memory managemennt:

   - Avoid broadcast TLBI if page reused in write fault

   - Elide TLB invalidation if the old PTE was not valid

   - Drop redundant cpu_set_*_tcr_t0sz() macros

   - Propagate pgtable_alloc() errors outside of __create_pgd_mapping()

   - Propagate return value from __change_memory_common()

  ACPI and EFI:

   - Call EFI runtime services without disabling preemption

   - Remove unused ACPI function

  Miscellaneous:

   - ptrace support to disable streaming on SME-only systems

   - Improve sysreg generation to include a 'Prefix' descriptor

   - Replace __ASSEMBLY__ with __ASSEMBLER__

   - Align register dumps in the kselftest zt-test

   - Remove some no longer used macros/functions

   - Various spelling corrections"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (94 commits)
  arm64/mm: Document why linear map split failure upon vm_reset_perms is not problematic
  arm64/pageattr: Propagate return value from __change_memory_common
  arm64/sysreg: Remove unused define ARM64_FEATURE_FIELD_BITS
  KVM: arm64: selftests: Consider all 7 possible levels of cache
  KVM: arm64: selftests: Remove ARM64_FEATURE_FIELD_BITS and its last user
  arm64: atomics: lse: Remove unused parameters from ATOMIC_FETCH_OP_AND macros
  Documentation/arm64: Fix the typo of register names
  ACPI: GTDT: Get rid of acpi_arch_timer_mem_init()
  perf: arm_spe: Add support for filtering on data source
  perf: Add perf_event_attr::config4
  perf/imx_ddr: Add support for PMU in DB (system interconnects)
  perf/imx_ddr: Get and enable optional clks
  perf/imx_ddr: Move ida_alloc() from ddr_perf_init() to ddr_perf_probe()
  dt-bindings: perf: fsl-imx-ddr: Add compatible string for i.MX8QM, i.MX8QXP and i.MX8DXL
  arm64: remove duplicate ARCH_HAS_MEM_ENCRYPT
  arm64: mm: use untagged address to calculate page index
  MAINTAINERS: new entry for MPAM Driver
  arm_mpam: Add kunit tests for props_mismatch()
  arm_mpam: Add kunit test for bitmap reset
  arm_mpam: Add helper to reset saved mbwu state
  ...
2025-12-02 17:03:55 -08:00
Huang Ying
cb1fa2e999 arm64, tlbflush: don't TLBI broadcast if page reused in write fault
A multi-thread customer workload with large memory footprint uses
fork()/exec() to run some external programs every tens seconds.  When
running the workload on an arm64 server machine, it's observed that
quite some CPU cycles are spent in the TLB flushing functions.  While
running the workload on the x86_64 server machine, it's not.  This
causes the performance on arm64 to be much worse than that on x86_64.

During the workload running, after fork()/exec() write-protects all
pages in the parent process, memory writing in the parent process
will cause a write protection fault.  Then the page fault handler
will make the PTE/PDE writable if the page can be reused, which is
almost always true in the workload.  On arm64, to avoid the write
protection fault on other CPUs, the page fault handler flushes the TLB
globally with TLBI broadcast after changing the PTE/PDE.  However, this
isn't always necessary.  Firstly, it's safe to leave some stale
read-only TLB entries as long as they will be flushed finally.
Secondly, it's quite possible that the original read-only PTE/PDEs
aren't cached in remote TLB at all if the memory footprint is large.
In fact, on x86_64, the page fault handler doesn't flush the remote
TLB in this situation, which benefits the performance a lot.

To improve the performance on arm64, make the write protection fault
handler flush the TLB locally instead of globally via TLBI broadcast
after making the PTE/PDE writable.  If there are stale read-only TLB
entries in the remote CPUs, the page fault handler on these CPUs will
regard the page fault as spurious and flush the stale TLB entries.

To test the patchset, make the usemem.c from
vm-scalability (https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git).
support calling fork()/exec() periodically.  To mimic the behavior of
the customer workload, run usemem with 4 threads, access 100GB memory,
and call fork()/exec() every 40 seconds.  Test results show that with
the patchset the score of usemem improves ~40.6%.  The cycles% of TLB
flush functions reduces from ~50.5% to ~0.3% in perf profile.

Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: Zi Yan <ziy@nvidia.com>
Cc: Will Deacon <will@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Christoph Lameter (Ampere) <cl@gentwo.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Yin Fengwei <fengwei_yin@linux.alibaba.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Reviewed-by: David Hildenbrand (Red Hat) <david@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19 16:01:48 +00:00
Linus Torvalds
5bebe8de19 mm/huge_memory: Fix initialization of huge zero folio
The recent fix to properly initialize the tags of the huge zero folio
had an unfortunate not-so-subtle side effect: it caused the actual
*contents* of the huge zero folio to not be initialized at all when the
hardware didn't support the memory tagging.

The reason was the unfortunate semantics of tag_clear_highpage(): on
hardware that didn't do the tagging, it would silently just not do
anything at all.  And since this is done only on arm64 with MTE support,
that basically meant most hardware.

It wasn't necessarily immediately obvious since the huge zero page isn't
necessarily very heavily used - or because it might already be zero
because all-zeroes is the most common pattern.  But it ends up causing
random odd user space failures when you do hit it.

The unfortunate semantics have been around for a while, but became a
real bug only when we started actively using __GFP_ZEROTAGS in the
generic get_huge_zero_folio() function - before that, it had only ever
been used in code that checked that the hardware supported it.

Fix this by simply changing the semantics of tag_clear_highpage() to
return whether it actually successfully did something or not.  While at
it, also make it initialize multiple pages in one go, since that's
actually what the only caller wants it to do and it simplifies the whole
logic.

Fixes: adfb6609c6 ("mm/huge_memory: initialise the tags of the huge zero folio")
Link: https://lore.kernel.org/all/20251117082023.90176-1-00107082@163.com/
Reviewed-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reported-and-tested-by: David Wang <00107082@163.com>
Reported-and-tested-by: Carlos Llamas <cmllamas@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-11-18 08:21:27 -08:00
Catalin Marinas
adfb6609c6 mm/huge_memory: initialise the tags of the huge zero folio
On arm64 with MTE enabled, a page mapped as Normal Tagged (PROT_MTE) in
user space will need to have its allocation tags initialised.  This is
normally done in the arm64 set_pte_at() after checking the memory
attributes.  Such page is also marked with the PG_mte_tagged flag to avoid
subsequent clearing.  Since this relies on having a struct page,
pte_special() mappings are ignored.

Commit d82d09e482 ("mm/huge_memory: mark PMD mappings of the huge zero
folio special") maps the huge zero folio special and the arm64
set_pmd_at() will no longer zero the tags.  There is no guarantee that the
tags are zero, especially if parts of this huge page have been previously
tagged.

It's fairly easy to detect this by regularly dropping the caches to
force the reallocation of the huge zero folio.

Allocate the huge zero folio with the __GFP_ZEROTAGS flag.  In addition,
do not warn in the arm64 __access_remote_tags() when reading tags from the
huge zero page.

I bundled the arm64 change in here as well since they are both related to
the commit mapping the huge zero folio as special.

[catalin.marinas@arm.com: handle arch mte_zero_clear_page_tags() code issuing MTE instructions]
  Link: https://lkml.kernel.org/r/aQi8dA_QpXM8XqrE@arm.com
Link: https://lkml.kernel.org/r/20251031170133.280742-1-catalin.marinas@arm.com
Fixes: d82d09e482 ("mm/huge_memory: mark PMD mappings of the huge zero folio special")
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Tested-by: Beleswar Padhi <b-padhi@ti.com>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Aishwarya TCV <aishwarya.tcv@arm.com>
Cc: David Hildenbrand (Red Hat) <david@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-09 21:19:46 -08:00
Linus Torvalds
beace86e61 Summary of significant series in this pull request:
- The 4 patch series "mm: ksm: prevent KSM from breaking merging of new
   VMAs" from Lorenzo Stoakes addresses an issue with KSM's
   PR_SET_MEMORY_MERGE mode: newly mapped VMAs were not eligible for
   merging with existing adjacent VMAs.
 
 - The 4 patch series "mm/damon: introduce DAMON_STAT for simple and
   practical access monitoring" from SeongJae Park adds a new kernel module
   which simplifies the setup and usage of DAMON in production
   environments.
 
 - The 6 patch series "stop passing a writeback_control to swap/shmem
   writeout" from Christoph Hellwig is a cleanup to the writeback code
   which removes a couple of pointers from struct writeback_control.
 
 - The 7 patch series "drivers/base/node.c: optimization and cleanups"
   from Donet Tom contains largely uncorrelated cleanups to the NUMA node
   setup and management code.
 
 - The 4 patch series "mm: userfaultfd: assorted fixes and cleanups" from
   Tal Zussman does some maintenance work on the userfaultfd code.
 
 - The 5 patch series "Readahead tweaks for larger folios" from Ryan
   Roberts implements some tuneups for pagecache readahead when it is
   reading into order>0 folios.
 
 - The 4 patch series "selftests/mm: Tweaks to the cow test" from Mark
   Brown provides some cleanups and consistency improvements to the
   selftests code.
 
 - The 4 patch series "Optimize mremap() for large folios" from Dev Jain
   does that.  A 37% reduction in execution time was measured in a
   memset+mremap+munmap microbenchmark.
 
 - The 5 patch series "Remove zero_user()" from Matthew Wilcox expunges
   zero_user() in favor of the more modern memzero_page().
 
 - The 3 patch series "mm/huge_memory: vmf_insert_folio_*() and
   vmf_insert_pfn_pud() fixes" from David Hildenbrand addresses some warts
   which David noticed in the huge page code.  These were not known to be
   causing any issues at this time.
 
 - The 3 patch series "mm/damon: use alloc_migrate_target() for
   DAMOS_MIGRATE_{HOT,COLD" from SeongJae Park provides some cleanup and
   consolidation work in DAMON.
 
 - The 3 patch series "use vm_flags_t consistently" from Lorenzo Stoakes
   uses vm_flags_t in places where we were inappropriately using other
   types.
 
 - The 3 patch series "mm/memfd: Reserve hugetlb folios before
   allocation" from Vivek Kasireddy increases the reliability of large page
   allocation in the memfd code.
 
 - The 14 patch series "mm: Remove pXX_devmap page table bit and pfn_t
   type" from Alistair Popple removes several now-unneeded PFN_* flags.
 
 - The 5 patch series "mm/damon: decouple sysfs from core" from SeongJae
   Park implememnts some cleanup and maintainability work in the DAMON
   sysfs layer.
 
 - The 5 patch series "madvise cleanup" from Lorenzo Stoakes does quite a
   lot of cleanup/maintenance work in the madvise() code.
 
 - The 4 patch series "madvise anon_name cleanups" from Vlastimil Babka
   provides additional cleanups on top or Lorenzo's effort.
 
 - The 11 patch series "Implement numa node notifier" from Oscar Salvador
   creates a standalone notifier for NUMA node memory state changes.
   Previously these were lumped under the more general memory on/offline
   notifier.
 
 - The 6 patch series "Make MIGRATE_ISOLATE a standalone bit" from Zi Yan
   cleans up the pageblock isolation code and fixes a potential issue which
   doesn't seem to cause any problems in practice.
 
 - The 5 patch series "selftests/damon: add python and drgn based DAMON
   sysfs functionality tests" from SeongJae Park adds additional drgn- and
   python-based DAMON selftests which are more comprehensive than the
   existing selftest suite.
 
 - The 5 patch series "Misc rework on hugetlb faulting path" from Oscar
   Salvador fixes a rather obscure deadlock in the hugetlb fault code and
   follows that fix with a series of cleanups.
 
 - The 3 patch series "cma: factor out allocation logic from
   __cma_declare_contiguous_nid" from Mike Rapoport rationalizes and cleans
   up the highmem-specific code in the CMA allocator.
 
 - The 28 patch series "mm/migration: rework movable_ops page migration
   (part 1)" from David Hildenbrand provides cleanups and
   future-preparedness to the migration code.
 
 - The 2 patch series "mm/damon: add trace events for auto-tuned
   monitoring intervals and DAMOS quota" from SeongJae Park adds some
   tracepoints to some DAMON auto-tuning code.
 
 - The 6 patch series "mm/damon: fix misc bugs in DAMON modules" from
   SeongJae Park does that.
 
 - The 6 patch series "mm/damon: misc cleanups" from SeongJae Park also
   does what it claims.
 
 - The 4 patch series "mm: folio_pte_batch() improvements" from David
   Hildenbrand cleans up the large folio PTE batching code.
 
 - The 13 patch series "mm/damon/vaddr: Allow interleaving in
   migrate_{hot,cold} actions" from SeongJae Park facilitates dynamic
   alteration of DAMON's inter-node allocation policy.
 
 - The 3 patch series "Remove unmap_and_put_page()" from Vishal Moola
   provides a couple of page->folio conversions.
 
 - The 4 patch series "mm: per-node proactive reclaim" from Davidlohr
   Bueso implements a per-node control of proactive reclaim - beyond the
   current memcg-based implementation.
 
 - The 14 patch series "mm/damon: remove damon_callback" from SeongJae
   Park replaces the damon_callback interface with a more general and
   powerful damon_call()+damos_walk() interface.
 
 - The 10 patch series "mm/mremap: permit mremap() move of multiple VMAs"
   from Lorenzo Stoakes implements a number of mremap cleanups (of course)
   in preparation for adding new mremap() functionality: newly permit the
   remapping of multiple VMAs when the user is specifying MREMAP_FIXED.  It
   still excludes some specialized situations where this cannot be
   performed reliably.
 
 - The 3 patch series "drop hugetlb_free_pgd_range()" from Anthony Yznaga
   switches some sparc hugetlb code over to the generic version and removes
   the thus-unneeded hugetlb_free_pgd_range().
 
 - The 4 patch series "mm/damon/sysfs: support periodic and automated
   stats update" from SeongJae Park augments the present
   userspace-requested update of DAMON sysfs monitoring files.  Automatic
   update is now provided, along with a tunable to control the update
   interval.
 
 - The 4 patch series "Some randome fixes and cleanups to swapfile" from
   Kemeng Shi does what is claims.
 
 - The 4 patch series "mm: introduce snapshot_page" from Luiz Capitulino
   and David Hildenbrand provides (and uses) a means by which debug-style
   functions can grab a copy of a pageframe and inspect it locklessly
   without tripping over the races inherent in operating on the live
   pageframe directly.
 
 - The 6 patch series "use per-vma locks for /proc/pid/maps reads" from
   Suren Baghdasaryan addresses the large contention issues which can be
   triggered by reads from that procfs file.  Latencies are reduced by more
   than half in some situations.  The series also introduces several new
   selftests for the /proc/pid/maps interface.
 
 - The 6 patch series "__folio_split() clean up" from Zi Yan cleans up
   __folio_split()!
 
 - The 7 patch series "Optimize mprotect() for large folios" from Dev
   Jain provides some quite large (>3x) speedups to mprotect() when dealing
   with large folios.
 
 - The 2 patch series "selftests/mm: reuse FORCE_READ to replace "asm
   volatile("" : "+r" (XXX));" and some cleanup" from wang lian does some
   cleanup work in the selftests code.
 
 - The 3 patch series "tools/testing: expand mremap testing" from Lorenzo
   Stoakes extends the mremap() selftest in several ways, including adding
   more checking of Lorenzo's recently added "permit mremap() move of
   multiple VMAs" feature.
 
 - The 22 patch series "selftests/damon/sysfs.py: test all parameters"
   from SeongJae Park extends the DAMON sysfs interface selftest so that it
   tests all possible user-requested parameters.  Rather than the present
   minimal subset.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaIqcCgAKCRDdBJ7gKXxA
 jkVBAQCCn9DR1QP0CRk961ot0cKzOgioSc0aA03DPb2KXRt2kQEAzDAz0ARurFhL
 8BzbvI0c+4tntHLXvIlrC33n9KWAOQM=
 =XsFy
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "As usual, many cleanups. The below blurbiage describes 42 patchsets.
  21 of those are partially or fully cleanup work. "cleans up",
  "cleanup", "maintainability", "rationalizes", etc.

  I never knew the MM code was so dirty.

  "mm: ksm: prevent KSM from breaking merging of new VMAs" (Lorenzo Stoakes)
     addresses an issue with KSM's PR_SET_MEMORY_MERGE mode: newly
     mapped VMAs were not eligible for merging with existing adjacent
     VMAs.

  "mm/damon: introduce DAMON_STAT for simple and practical access monitoring" (SeongJae Park)
     adds a new kernel module which simplifies the setup and usage of
     DAMON in production environments.

  "stop passing a writeback_control to swap/shmem writeout" (Christoph Hellwig)
     is a cleanup to the writeback code which removes a couple of
     pointers from struct writeback_control.

  "drivers/base/node.c: optimization and cleanups" (Donet Tom)
     contains largely uncorrelated cleanups to the NUMA node setup and
     management code.

  "mm: userfaultfd: assorted fixes and cleanups" (Tal Zussman)
     does some maintenance work on the userfaultfd code.

  "Readahead tweaks for larger folios" (Ryan Roberts)
     implements some tuneups for pagecache readahead when it is reading
     into order>0 folios.

  "selftests/mm: Tweaks to the cow test" (Mark Brown)
     provides some cleanups and consistency improvements to the
     selftests code.

  "Optimize mremap() for large folios" (Dev Jain)
     does that. A 37% reduction in execution time was measured in a
     memset+mremap+munmap microbenchmark.

  "Remove zero_user()" (Matthew Wilcox)
     expunges zero_user() in favor of the more modern memzero_page().

  "mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes" (David Hildenbrand)
     addresses some warts which David noticed in the huge page code.
     These were not known to be causing any issues at this time.

  "mm/damon: use alloc_migrate_target() for DAMOS_MIGRATE_{HOT,COLD" (SeongJae Park)
     provides some cleanup and consolidation work in DAMON.

  "use vm_flags_t consistently" (Lorenzo Stoakes)
     uses vm_flags_t in places where we were inappropriately using other
     types.

  "mm/memfd: Reserve hugetlb folios before allocation" (Vivek Kasireddy)
     increases the reliability of large page allocation in the memfd
     code.

  "mm: Remove pXX_devmap page table bit and pfn_t type" (Alistair Popple)
     removes several now-unneeded PFN_* flags.

  "mm/damon: decouple sysfs from core" (SeongJae Park)
     implememnts some cleanup and maintainability work in the DAMON
     sysfs layer.

  "madvise cleanup" (Lorenzo Stoakes)
     does quite a lot of cleanup/maintenance work in the madvise() code.

  "madvise anon_name cleanups" (Vlastimil Babka)
     provides additional cleanups on top or Lorenzo's effort.

  "Implement numa node notifier" (Oscar Salvador)
     creates a standalone notifier for NUMA node memory state changes.
     Previously these were lumped under the more general memory
     on/offline notifier.

  "Make MIGRATE_ISOLATE a standalone bit" (Zi Yan)
     cleans up the pageblock isolation code and fixes a potential issue
     which doesn't seem to cause any problems in practice.

  "selftests/damon: add python and drgn based DAMON sysfs functionality tests" (SeongJae Park)
     adds additional drgn- and python-based DAMON selftests which are
     more comprehensive than the existing selftest suite.

  "Misc rework on hugetlb faulting path" (Oscar Salvador)
     fixes a rather obscure deadlock in the hugetlb fault code and
     follows that fix with a series of cleanups.

  "cma: factor out allocation logic from __cma_declare_contiguous_nid" (Mike Rapoport)
     rationalizes and cleans up the highmem-specific code in the CMA
     allocator.

  "mm/migration: rework movable_ops page migration (part 1)" (David Hildenbrand)
     provides cleanups and future-preparedness to the migration code.

  "mm/damon: add trace events for auto-tuned monitoring intervals and DAMOS quota" (SeongJae Park)
     adds some tracepoints to some DAMON auto-tuning code.

  "mm/damon: fix misc bugs in DAMON modules" (SeongJae Park)
     does that.

  "mm/damon: misc cleanups" (SeongJae Park)
     also does what it claims.

  "mm: folio_pte_batch() improvements" (David Hildenbrand)
     cleans up the large folio PTE batching code.

  "mm/damon/vaddr: Allow interleaving in migrate_{hot,cold} actions" (SeongJae Park)
     facilitates dynamic alteration of DAMON's inter-node allocation
     policy.

  "Remove unmap_and_put_page()" (Vishal Moola)
     provides a couple of page->folio conversions.

  "mm: per-node proactive reclaim" (Davidlohr Bueso)
     implements a per-node control of proactive reclaim - beyond the
     current memcg-based implementation.

  "mm/damon: remove damon_callback" (SeongJae Park)
     replaces the damon_callback interface with a more general and
     powerful damon_call()+damos_walk() interface.

  "mm/mremap: permit mremap() move of multiple VMAs" (Lorenzo Stoakes)
     implements a number of mremap cleanups (of course) in preparation
     for adding new mremap() functionality: newly permit the remapping
     of multiple VMAs when the user is specifying MREMAP_FIXED. It still
     excludes some specialized situations where this cannot be performed
     reliably.

  "drop hugetlb_free_pgd_range()" (Anthony Yznaga)
     switches some sparc hugetlb code over to the generic version and
     removes the thus-unneeded hugetlb_free_pgd_range().

  "mm/damon/sysfs: support periodic and automated stats update" (SeongJae Park)
     augments the present userspace-requested update of DAMON sysfs
     monitoring files. Automatic update is now provided, along with a
     tunable to control the update interval.

  "Some randome fixes and cleanups to swapfile" (Kemeng Shi)
     does what is claims.

  "mm: introduce snapshot_page" (Luiz Capitulino and David Hildenbrand)
     provides (and uses) a means by which debug-style functions can grab
     a copy of a pageframe and inspect it locklessly without tripping
     over the races inherent in operating on the live pageframe
     directly.

  "use per-vma locks for /proc/pid/maps reads" (Suren Baghdasaryan)
     addresses the large contention issues which can be triggered by
     reads from that procfs file. Latencies are reduced by more than
     half in some situations. The series also introduces several new
     selftests for the /proc/pid/maps interface.

  "__folio_split() clean up" (Zi Yan)
     cleans up __folio_split()!

  "Optimize mprotect() for large folios" (Dev Jain)
     provides some quite large (>3x) speedups to mprotect() when dealing
     with large folios.

  "selftests/mm: reuse FORCE_READ to replace "asm volatile("" : "+r" (XXX));" and some cleanup" (wang lian)
     does some cleanup work in the selftests code.

  "tools/testing: expand mremap testing" (Lorenzo Stoakes)
     extends the mremap() selftest in several ways, including adding
     more checking of Lorenzo's recently added "permit mremap() move of
     multiple VMAs" feature.

  "selftests/damon/sysfs.py: test all parameters" (SeongJae Park)
     extends the DAMON sysfs interface selftest so that it tests all
     possible user-requested parameters. Rather than the present minimal
     subset"

* tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (370 commits)
  MAINTAINERS: add missing headers to mempory policy & migration section
  MAINTAINERS: add missing file to cgroup section
  MAINTAINERS: add MM MISC section, add missing files to MISC and CORE
  MAINTAINERS: add missing zsmalloc file
  MAINTAINERS: add missing files to page alloc section
  MAINTAINERS: add missing shrinker files
  MAINTAINERS: move memremap.[ch] to hotplug section
  MAINTAINERS: add missing mm_slot.h file THP section
  MAINTAINERS: add missing interval_tree.c to memory mapping section
  MAINTAINERS: add missing percpu-internal.h file to per-cpu section
  mm/page_alloc: remove trace_mm_alloc_contig_migrate_range_info()
  selftests/damon: introduce _common.sh to host shared function
  selftests/damon/sysfs.py: test runtime reduction of DAMON parameters
  selftests/damon/sysfs.py: test non-default parameters runtime commit
  selftests/damon/sysfs.py: generalize DAMON context commit assertion
  selftests/damon/sysfs.py: generalize monitoring attributes commit assertion
  selftests/damon/sysfs.py: generalize DAMOS schemes commit assertion
  selftests/damon/sysfs.py: test DAMOS filters commitment
  selftests/damon/sysfs.py: generalize DAMOS scheme commit assertion
  selftests/damon/sysfs.py: test DAMOS destinations commitment
  ...
2025-07-31 14:57:54 -07:00
Linus Torvalds
6fb44438a5 arm64 updates for 6.17:
Perf and PMU updates:
 
  - Add support for new (v3) Hisilicon SLLC and DDRC PMUs
 
  - Add support for Arm-NI PMU integrations that share interrupts between
    clock domains within a given instance
 
  - Allow SPE to be configured with a lower sample period than the
    minimum recommendation advertised by PMSIDR_EL1.Interval
 
  - Add suppport for Arm's "Branch Record Buffer Extension" (BRBE)
 
  - Adjust the perf watchdog period according to cpu frequency changes
 
  - Minor driver fixes and cleanups
 
 Hardware features:
 
  - Support for MTE store-only checking (FEAT_MTE_STORE_ONLY)
 
  - Support for reporting the non-address bits during a synchronous MTE
    tag check fault (FEAT_MTE_TAGGED_FAR)
 
  - Optimise the TLBI when folding/unfolding contiguous PTEs on hardware
    with FEAT_BBM (break-before-make) level 2 and no TLB conflict aborts
 
 Software features:
 
  - Enable HAVE_LIVEPATCH after implementing arch_stack_walk_reliable()
    and using the text-poke API for late module relocations
 
  - Force VMAP_STACK always on and change arm64_efi_rt_init() to use
    arch_alloc_vmap_stack() in order to avoid KASAN false positives
 
 ACPI:
 
  - Improve SPCR handling and messaging on systems lacking an SPCR table
 
 Debug:
 
  - Simplify the debug exception entry path
 
  - Drop redundant DBG_MDSCR_* macros
 
 Kselftests:
 
  - Cleanups and improvements for SME, SVE and FPSIMD tests
 
 Miscellaneous:
 
  - Optimise loop to reduce redundant operations in contpte_ptep_get()
 
  - Remove ISB when resetting POR_EL0 during signal handling
 
  - Mark the kernel as tainted on SEA and SError panic
 
  - Remove redundant gcs_free() call
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAmiDkgoACgkQa9axLQDI
 XvFucQ//bYugRP5/Sdlrq5eDKWBGi1HufYzwfDEBLc4S75Eu8mGL/tuThfu9yFn+
 qCowtt4U84HdWsZDTSVo6lym6v2vJUpGOMgXzepvJaFBRnqGv9X9NxH6RQO1LTnu
 Pm7rO+7I9tNpfuc7Zu9pHDggsJEw+WzVfmEF6WPSFlT9mUNv6NbSx4rbLQKU86Dm
 ouTqXaePEQZ5oiRXVasxyT0otGtiACD20WpgOtNjYGzsfUVwCf/C83V/2DLwwbhr
 9cW9lCtFxA/yFdQcA9ThRzWZ9Eo5LAHqjGIq00+zOjuzgDbBtcTT79gpChkhovIR
 FBIsWHd9j9i3nYxzf4V4eRKQnyqS3NQWv7g7uKFwNgARif1Zk0VJ77QIlAYk5xLI
 ENTRjLKz5WNGGnhdkeCvDlVyxX+OktgcVTp3vqRxAKCRahMMUqBrwxiM8RzVF37e
 yzkEQayL8F7uZqy9H7Sjn48UpHZux6frJ1bBQw1oEvR9QmAoAdqavPMSAYIOT3Zr
 ze4WIljq/cFr3kBPIFP5pK1e0qYMHXZpSKIm8MAv6y/7KmQuVbMjZthpuPbLSIw0
 Q7C0KalB8lToPIbO7qMni/he0dCN4K2+E1YHFTR+pzfcoLuW4rjSg7i8tqMLKMJ8
 H+SeGLyPtM5A6bdAPTTpqefcgUUe7064ENUqrGUpDEynGXA7boE=
 =5h1C
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Catalin Marinas:
 "A quick summary: perf support for Branch Record Buffer Extensions
  (BRBE), typical PMU hardware updates, small additions to MTE for
  store-only tag checking and exposing non-address bits to signal
  handlers, HAVE_LIVEPATCH enabled on arm64, VMAP_STACK forced on.

  There is also a TLBI optimisation on hardware that does not require
  break-before-make when changing the user PTEs between contiguous and
  non-contiguous.

  More details:

  Perf and PMU updates:

   - Add support for new (v3) Hisilicon SLLC and DDRC PMUs

   - Add support for Arm-NI PMU integrations that share interrupts
     between clock domains within a given instance

   - Allow SPE to be configured with a lower sample period than the
     minimum recommendation advertised by PMSIDR_EL1.Interval

   - Add suppport for Arm's "Branch Record Buffer Extension" (BRBE)

   - Adjust the perf watchdog period according to cpu frequency changes

   - Minor driver fixes and cleanups

  Hardware features:

   - Support for MTE store-only checking (FEAT_MTE_STORE_ONLY)

   - Support for reporting the non-address bits during a synchronous MTE
     tag check fault (FEAT_MTE_TAGGED_FAR)

   - Optimise the TLBI when folding/unfolding contiguous PTEs on
     hardware with FEAT_BBM (break-before-make) level 2 and no TLB
     conflict aborts

  Software features:

   - Enable HAVE_LIVEPATCH after implementing arch_stack_walk_reliable()
     and using the text-poke API for late module relocations

   - Force VMAP_STACK always on and change arm64_efi_rt_init() to use
     arch_alloc_vmap_stack() in order to avoid KASAN false positives

  ACPI:

   - Improve SPCR handling and messaging on systems lacking an SPCR
     table

  Debug:

   - Simplify the debug exception entry path

   - Drop redundant DBG_MDSCR_* macros

  Kselftests:

   - Cleanups and improvements for SME, SVE and FPSIMD tests

  Miscellaneous:

   - Optimise loop to reduce redundant operations in contpte_ptep_get()

   - Remove ISB when resetting POR_EL0 during signal handling

   - Mark the kernel as tainted on SEA and SError panic

   - Remove redundant gcs_free() call"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (93 commits)
  arm64/gcs: task_gcs_el0_enable() should use passed task
  arm64: Kconfig: Keep selects somewhat alphabetically ordered
  arm64: signal: Remove ISB when resetting POR_EL0
  kselftest/arm64: Handle attempts to disable SM on SME only systems
  kselftest/arm64: Fix SVE write data generation for SME only systems
  kselftest/arm64: Test SME on SME only systems in fp-ptrace
  kselftest/arm64: Test FPSIMD format data writes via NT_ARM_SVE in fp-ptrace
  kselftest/arm64: Allow sve-ptrace to run on SME only systems
  arm64/mm: Drop redundant addr increment in set_huge_pte_at()
  kselftest/arm4: Provide local defines for AT_HWCAP3
  arm64: Mark kernel as tainted on SAE and SError panic
  arm64/gcs: Don't call gcs_free() when releasing task_struct
  drivers/perf: hisi: Support PMUs with no interrupt
  drivers/perf: hisi: Relax the event number check of v2 PMUs
  drivers/perf: hisi: Add support for HiSilicon SLLC v3 PMU driver
  drivers/perf: hisi: Use ACPI driver_data to retrieve SLLC PMU information
  drivers/perf: hisi: Add support for HiSilicon DDRC v3 PMU driver
  drivers/perf: hisi: Simplify the probe process for each DDRC version
  perf/arm-ni: Support sharing IRQs within an NI instance
  perf/arm-ni: Consolidate CPU affinity handling
  ...
2025-07-29 20:21:54 -07:00
Catalin Marinas
3ae8cef210 Merge branches 'for-next/livepatch', 'for-next/user-contig-bbml2', 'for-next/misc', 'for-next/acpi', 'for-next/debug-entry', 'for-next/feat_mte_tagged_far', 'for-next/kselftest', 'for-next/mdscr-cleanup' and 'for-next/vmap-stack', remote-tracking branch 'arm64/for-next/perf' into for-next/core
* arm64/for-next/perf: (23 commits)
  drivers/perf: hisi: Support PMUs with no interrupt
  drivers/perf: hisi: Relax the event number check of v2 PMUs
  drivers/perf: hisi: Add support for HiSilicon SLLC v3 PMU driver
  drivers/perf: hisi: Use ACPI driver_data to retrieve SLLC PMU information
  drivers/perf: hisi: Add support for HiSilicon DDRC v3 PMU driver
  drivers/perf: hisi: Simplify the probe process for each DDRC version
  perf/arm-ni: Support sharing IRQs within an NI instance
  perf/arm-ni: Consolidate CPU affinity handling
  perf/cxlpmu: Fix typos in cxl_pmu.c comments and documentation
  perf/cxlpmu: Remove unintended newline from IRQ name format string
  perf/cxlpmu: Fix devm_kcalloc() argument order in cxl_pmu_probe()
  perf: arm_spe: Relax period restriction
  perf: arm_pmuv3: Add support for the Branch Record Buffer Extension (BRBE)
  KVM: arm64: nvhe: Disable branch generation in nVHE guests
  arm64: Handle BRBE booting requirements
  arm64/sysreg: Add BRBE registers and fields
  perf/arm: Add missing .suppress_bind_attrs
  perf/arm-cmn: Reduce stack usage during discovery
  perf: imx9_perf: make the read-only array mask static const
  perf/arm-cmn: Broaden module description for wider interconnect support
  ...

* for-next/livepatch:
  : Support for HAVE_LIVEPATCH on arm64
  arm64: Kconfig: Keep selects somewhat alphabetically ordered
  arm64: Implement HAVE_LIVEPATCH
  arm64: stacktrace: Implement arch_stack_walk_reliable()
  arm64: stacktrace: Check kretprobe_find_ret_addr() return value
  arm64/module: Use text-poke API for late relocations.

* for-next/user-contig-bbml2:
  : Optimise the TLBI when folding/unfolding contigous PTEs on hardware with BBML2 and no TLB conflict aborts
  arm64/mm: Elide tlbi in contpte_convert() under BBML2
  iommu/arm: Add BBM Level 2 smmu feature
  arm64: Add BBM Level 2 cpu feature
  arm64: cpufeature: Introduce MATCH_ALL_EARLY_CPUS capability type

* for-next/misc:
  : Miscellaneous arm64 patches
  arm64/gcs: task_gcs_el0_enable() should use passed task
  arm64: signal: Remove ISB when resetting POR_EL0
  arm64/mm: Drop redundant addr increment in set_huge_pte_at()
  arm64: Mark kernel as tainted on SAE and SError panic
  arm64/gcs: Don't call gcs_free() when releasing task_struct
  arm64: fix unnecessary rebuilding when CONFIG_DEBUG_EFI=y
  arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get
  arm64: pi: use 'targets' instead of extra-y in Makefile

* for-next/acpi:
  : Various ACPI arm64 changes
  ACPI: Suppress misleading SPCR console message when SPCR table is absent
  ACPI: Return -ENODEV from acpi_parse_spcr() when SPCR support is disabled

* for-next/debug-entry:
  : Simplify the debug exception entry path
  arm64: debug: remove debug exception registration infrastructure
  arm64: debug: split bkpt32 exception entry
  arm64: debug: split brk64 exception entry
  arm64: debug: split hardware watchpoint exception entry
  arm64: debug: split single stepping exception entry
  arm64: debug: refactor reinstall_suspended_bps()
  arm64: debug: split hardware breakpoint exception entry
  arm64: entry: Add entry and exit functions for debug exceptions
  arm64: debug: remove break/step handler registration infrastructure
  arm64: debug: call step handlers statically
  arm64: debug: call software breakpoint handlers statically
  arm64: refactor aarch32_break_handler()
  arm64: debug: clean up single_step_handler logic

* for-next/feat_mte_tagged_far:
  : Support for reporting the non-address bits during a synchronous MTE tag check fault
  kselftest/arm64/mte: Add mtefar tests on check_mmap_options
  kselftest/arm64/mte: Refactor check_mmap_option test
  kselftest/arm64/mte: Add verification for address tag in signal handler
  kselftest/arm64/mte: Add address tag related macro and function
  kselftest/arm64/mte: Check MTE_FAR feature is supported
  kselftest/arm64/mte: Register mte signal handler with SA_EXPOSE_TAGBITS
  kselftest/arm64: Add MTE_FAR hwcap test
  KVM: arm64: Expose FEAT_MTE_TAGGED_FAR feature to guest
  arm64: Report address tag when FEAT_MTE_TAGGED_FAR is supported
  arm64/cpufeature: Add FEAT_MTE_TAGGED_FAR feature

* for-next/kselftest:
  : Kselftest updates for arm64
  kselftest/arm64: Handle attempts to disable SM on SME only systems
  kselftest/arm64: Fix SVE write data generation for SME only systems
  kselftest/arm64: Test SME on SME only systems in fp-ptrace
  kselftest/arm64: Test FPSIMD format data writes via NT_ARM_SVE in fp-ptrace
  kselftest/arm64: Allow sve-ptrace to run on SME only systems
  kselftest/arm4: Provide local defines for AT_HWCAP3
  kselftest/arm64: Specify SVE data when testing VL set in sve-ptrace
  kselftest/arm64: Fix test for streaming FPSIMD write in sve-ptrace
  kselftest/arm64: Fix check for setting new VLs in sve-ptrace
  kselftest/arm64: Convert tpidr2 test to use kselftest.h

* for-next/mdscr-cleanup:
  : Drop redundant DBG_MDSCR_* macros
  KVM: selftests: Change MDSCR_EL1 register holding variables as uint64_t
  arm64/debug: Drop redundant DBG_MDSCR_* macros

* for-next/vmap-stack:
  : Force VMAP_STACK on arm64
  arm64: remove CONFIG_VMAP_STACK checks from entry code
  arm64: remove CONFIG_VMAP_STACK checks from SDEI stack handling
  arm64: remove CONFIG_VMAP_STACK checks from stacktrace overflow logic
  arm64: remove CONFIG_VMAP_STACK conditionals from traps overflow stack
  arm64: remove CONFIG_VMAP_STACK conditionals from irq stack setup
  arm64: Remove CONFIG_VMAP_STACK conditionals from THREAD_SHIFT and THREAD_ALIGN
  arm64: efi: Remove CONFIG_VMAP_STACK check
  arm64: Mandate VMAP_STACK
  arm64: efi: Fix KASAN false positive for EFI runtime stack
  arm64/ptrace: Fix stack-out-of-bounds read in regs_get_kernel_stack_nth()
  arm64/gcs: Don't call gcs_free() during flush_gcs()
  arm64: Restrict pagetable teardown to avoid false warning
  docs: arm64: Fix ICC_SRE_EL2 register typo in booting.rst
2025-07-24 16:01:22 +01:00
Breno Leitao
d7ce7e3a84 arm64: Mark kernel as tainted on SAE and SError panic
Set TAINT_MACHINE_CHECK when SError or Synchronous External Abort (SEA)
interrupts trigger a panic to flag potential hardware faults. This
tainting mechanism aids in debugging and enables correlation of
hardware-related crashes in large-scale deployments.

This change aligns with similar patches[1] that mark machine check
events when the system crashes due to hardware errors.

Link: https://lore.kernel.org/all/20250702-add_tain-v1-1-9187b10914b9@debian.org/ [1]
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20250716-vmcore_hw_error-v2-1-f187f7d62aba@debian.org
Signed-off-by: Will Deacon <will@kernel.org>
2025-07-17 11:07:15 +01:00
Lorenzo Stoakes
d75fa3c947 mm: update architecture and driver code to use vm_flags_t
In future we intend to change the vm_flags_t type, so it isn't correct for
architecture and driver code to assume it is unsigned long.  Correct this
assumption across the board.

Overall, this patch does not introduce any functional change.

Link: https://lkml.kernel.org/r/b6eb1894abc5555ece80bb08af5c022ef780c8bc.1750274467.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>	[arm64]
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09 22:42:14 -07:00
Ada Couprie Diaz
a8b8cce9d9 arm64: debug: remove debug exception registration infrastructure
Now that debug exceptions are handled individually and without the need
for dynamic registration, remove the unused registration infrastructure.

This removes the external caller for `debug_exception_enter()` and
`debug_exception_exit()`.
Make them static again and remove them from the header.

Remove `early_brk64()` as it has been made redundant by
(arm64: debug: split brk64 exception entry) and is not used anymore.
Note : in `early_brk64()` `bug_brk_handler()` is called unconditionally
as a fall-through, but now `call_break_hook()` only calls it if the
immediate matches.
This does not change the behaviour in early boot, as if
`bug_brk_handler()` was called on a non-BUG immediate it would return
DBG_HOOK_ERROR anyway, which `call_break_hook()` will do if no immediate
matches.

Remove `trap_init()`, as it would be empty and a weak definition already
exists in `init/main.c`.

Signed-off-by: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Tested-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Reviewed-by: Will Deacon <will@kernel.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20250707114109.35672-14-ada.coupriediaz@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2025-07-08 13:27:42 +01:00
Ada Couprie Diaz
eaff68b328 arm64: entry: Add entry and exit functions for debug exceptions
Move the `debug_exception_enter()` and `debug_exception_exit()`
functions from mm/fault.c, as they are needed to split
the debug exceptions entry paths from the current unified one.

Make them externally visible in include/asm/exception.h until
the caller in mm/fault.c is cleaned up.

Signed-off-by: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Tested-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Will Deacon <will@kernel.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20250707114109.35672-7-ada.coupriediaz@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2025-07-08 13:27:41 +01:00
Kevin Brodsky
22f3a4f608 arm64: poe: Handle spurious Overlay faults
We do not currently issue an ISB after updating POR_EL0 when
context-switching it, for instance. The rationale is that if the old
value of POR_EL0 is more restrictive and causes a fault during
uaccess, the access will be retried [1]. In other words, we are
trading an ISB on every context-switching for the (unlikely)
possibility of a spurious fault. We may also miss faults if the new
value of POR_EL0 is more restrictive, but that's considered
acceptable.

However, as things stand, a spurious Overlay fault results in
uaccess failing right away since it causes fault_from_pkey() to
return true. If an Overlay fault is reported, we therefore need to
double check POR_EL0 against vma_pkey(vma) - this is what
arch_vma_access_permitted() already does.

As it turns out, we already perform that explicit check if no
Overlay fault is reported, and we need to keep that check (see
comment added in fault_from_pkey()). Net result: the Overlay ISS2
bit isn't of much help to decide whether a pkey fault occurred.

Remove the check for the Overlay bit from fault_from_pkey() and
add a comment to try and explain the situation. While at it, also
add a comment to permission_overlay_switch() in case anyone gets
surprised by the lack of ISB.

[1] https://lore.kernel.org/linux-arm-kernel/ZtYNGBrcE-j35fpw@arm.com/

Fixes: 160a8e13de ("arm64: context switch POR_EL0 register")
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Link: https://lore.kernel.org/r/20250619160042.2499290-2-kevin.brodsky@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2025-07-04 16:40:38 +01:00
Yeoreum Yun
7c7f55039b arm64: Report address tag when FEAT_MTE_TAGGED_FAR is supported
If FEAT_MTE_TAGGED_FAR (Armv8.9) is supported, bits 63:60 of the fault address
are preserved in response to synchronous tag check faults (SEGV_MTESERR).

This patch modifies below to support this feature:
  - Use the original FAR_EL1 value when an MTE tag check fault occurs,
    if ARM64_MTE_FAR is supported so that not only logical tag
    (bits 59:56) but also address tag (bits 63:60] being reported too.

  - Add HWCAP for mtefar to let user know bits 63:60 includes
    address tag information when when FEAT_MTE_TAGGED_FAR is supported.

Applications that require this information should install
a signal handler with the SA_EXPOSE_TAGBITS flag.
While this introduces a minor ABI change,
most applications do not set this flag and therefore will not be affected.

Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
Link: https://lore.kernel.org/r/20250618084513.1761345-3-yeoreum.yun@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-07-02 17:44:17 +01:00
Kristina Martšenko
04a9f771d8 arm64: mm: Handle PAN faults on uaccess CPY* instructions
A subsequent patch will use CPY* instructions to copy between user and
kernel memory. Add handling for PAN faults caused by an intended kernel
memory access erroneously accessing user memory, in order to make it
easier to debug kernel bugs and to keep the same behavior as with
regular loads/stores.

Signed-off-by: Kristina Martšenko <kristina.martsenko@arm.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/20250228170006.390100-3-kristina.martsenko@arm.com
[catalin.marinas@arm.com: Folded the extable search into insn_may_access_user()]
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-03-07 18:28:29 +00:00
Kristina Martšenko
653884f887 arm64: extable: Add fixup handling for uaccess CPY* instructions
A subsequent patch will use CPY* instructions to copy between user and
kernel memory. Add a new exception fixup type to avoid fixing up faults
on kernel memory accesses, in order to make it easier to debug kernel
bugs and to keep the same behavior as with regular loads/stores.

Signed-off-by: Kristina Martšenko <kristina.martsenko@arm.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/20250228170006.390100-2-kristina.martsenko@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-03-07 16:18:06 +00:00
Linus Torvalds
5c00ff742b - The series "zram: optimal post-processing target selection" from
Sergey Senozhatsky improves zram's post-processing selection algorithm.
   This leads to improved memory savings.
 
 - Wei Yang has gone to town on the mapletree code, contributing several
   series which clean up the implementation:
 
 	- "refine mas_mab_cp()"
 	- "Reduce the space to be cleared for maple_big_node"
 	- "maple_tree: simplify mas_push_node()"
 	- "Following cleanup after introduce mas_wr_store_type()"
 	- "refine storing null"
 
 - The series "selftests/mm: hugetlb_fault_after_madv improvements" from
   David Hildenbrand fixes this selftest for s390.
 
 - The series "introduce pte_offset_map_{ro|rw}_nolock()" from Qi Zheng
   implements some rationaizations and cleanups in the page mapping code.
 
 - The series "mm: optimize shadow entries removal" from Shakeel Butt
   optimizes the file truncation code by speeding up the handling of shadow
   entries.
 
 - The series "Remove PageKsm()" from Matthew Wilcox completes the
   migration of this flag over to being a folio-based flag.
 
 - The series "Unify hugetlb into arch_get_unmapped_area functions" from
   Oscar Salvador implements a bunch of consolidations and cleanups in the
   hugetlb code.
 
 - The series "Do not shatter hugezeropage on wp-fault" from Dev Jain
   takes away the wp-fault time practice of turning a huge zero page into
   small pages.  Instead we replace the whole thing with a THP.  More
   consistent cleaner and potentiall saves a large number of pagefaults.
 
 - The series "percpu: Add a test case and fix for clang" from Andy
   Shevchenko enhances and fixes the kernel's built in percpu test code.
 
 - The series "mm/mremap: Remove extra vma tree walk" from Liam Howlett
   optimizes mremap() by avoiding doing things which we didn't need to do.
 
 - The series "Improve the tmpfs large folio read performance" from
   Baolin Wang teaches tmpfs to copy data into userspace at the folio size
   rather than as individual pages.  A 20% speedup was observed.
 
 - The series "mm/damon/vaddr: Fix issue in
   damon_va_evenly_split_region()" fro Zheng Yejian fixes DAMON splitting.
 
 - The series "memcg-v1: fully deprecate charge moving" from Shakeel Butt
   removes the long-deprecated memcgv2 charge moving feature.
 
 - The series "fix error handling in mmap_region() and refactor" from
   Lorenzo Stoakes cleanup up some of the mmap() error handling and
   addresses some potential performance issues.
 
 - The series "x86/module: use large ROX pages for text allocations" from
   Mike Rapoport teaches x86 to use large pages for read-only-execute
   module text.
 
 - The series "page allocation tag compression" from Suren Baghdasaryan
   is followon maintenance work for the new page allocation profiling
   feature.
 
 - The series "page->index removals in mm" from Matthew Wilcox remove
   most references to page->index in mm/.  A slow march towards shrinking
   struct page.
 
 - The series "damon/{self,kunit}tests: minor fixups for DAMON debugfs
   interface tests" from Andrew Paniakin performs maintenance work for
   DAMON's self testing code.
 
 - The series "mm: zswap swap-out of large folios" from Kanchana Sridhar
   improves zswap's batching of compression and decompression.  It is a
   step along the way towards using Intel IAA hardware acceleration for
   this zswap operation.
 
 - The series "kasan: migrate the last module test to kunit" from
   Sabyrzhan Tasbolatov completes the migration of the KASAN built-in tests
   over to the KUnit framework.
 
 - The series "implement lightweight guard pages" from Lorenzo Stoakes
   permits userapace to place fault-generating guard pages within a single
   VMA, rather than requiring that multiple VMAs be created for this.
   Improved efficiencies for userspace memory allocators are expected.
 
 - The series "memcg: tracepoint for flushing stats" from JP Kobryn uses
   tracepoints to provide increased visibility into memcg stats flushing
   activity.
 
 - The series "zram: IDLE flag handling fixes" from Sergey Senozhatsky
   fixes a zram buglet which potentially affected performance.
 
 - The series "mm: add more kernel parameters to control mTHP" from
   Maíra Canal enhances our ability to control/configuremultisize THP from
   the kernel boot command line.
 
 - The series "kasan: few improvements on kunit tests" from Sabyrzhan
   Tasbolatov has a couple of fixups for the KASAN KUnit tests.
 
 - The series "mm/list_lru: Split list_lru lock into per-cgroup scope"
   from Kairui Song optimizes list_lru memory utilization when lockdep is
   enabled.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZzwFqgAKCRDdBJ7gKXxA
 jkeuAQCkl+BmeYHE6uG0hi3pRxkupseR6DEOAYIiTv0/l8/GggD/Z3jmEeqnZaNq
 xyyenpibWgUoShU2wZ/Ha8FE5WDINwg=
 =JfWR
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - The series "zram: optimal post-processing target selection" from
   Sergey Senozhatsky improves zram's post-processing selection
   algorithm. This leads to improved memory savings.

 - Wei Yang has gone to town on the mapletree code, contributing several
   series which clean up the implementation:
	- "refine mas_mab_cp()"
	- "Reduce the space to be cleared for maple_big_node"
	- "maple_tree: simplify mas_push_node()"
	- "Following cleanup after introduce mas_wr_store_type()"
	- "refine storing null"

 - The series "selftests/mm: hugetlb_fault_after_madv improvements" from
   David Hildenbrand fixes this selftest for s390.

 - The series "introduce pte_offset_map_{ro|rw}_nolock()" from Qi Zheng
   implements some rationaizations and cleanups in the page mapping
   code.

 - The series "mm: optimize shadow entries removal" from Shakeel Butt
   optimizes the file truncation code by speeding up the handling of
   shadow entries.

 - The series "Remove PageKsm()" from Matthew Wilcox completes the
   migration of this flag over to being a folio-based flag.

 - The series "Unify hugetlb into arch_get_unmapped_area functions" from
   Oscar Salvador implements a bunch of consolidations and cleanups in
   the hugetlb code.

 - The series "Do not shatter hugezeropage on wp-fault" from Dev Jain
   takes away the wp-fault time practice of turning a huge zero page
   into small pages. Instead we replace the whole thing with a THP. More
   consistent cleaner and potentiall saves a large number of pagefaults.

 - The series "percpu: Add a test case and fix for clang" from Andy
   Shevchenko enhances and fixes the kernel's built in percpu test code.

 - The series "mm/mremap: Remove extra vma tree walk" from Liam Howlett
   optimizes mremap() by avoiding doing things which we didn't need to
   do.

 - The series "Improve the tmpfs large folio read performance" from
   Baolin Wang teaches tmpfs to copy data into userspace at the folio
   size rather than as individual pages. A 20% speedup was observed.

 - The series "mm/damon/vaddr: Fix issue in
   damon_va_evenly_split_region()" fro Zheng Yejian fixes DAMON
   splitting.

 - The series "memcg-v1: fully deprecate charge moving" from Shakeel
   Butt removes the long-deprecated memcgv2 charge moving feature.

 - The series "fix error handling in mmap_region() and refactor" from
   Lorenzo Stoakes cleanup up some of the mmap() error handling and
   addresses some potential performance issues.

 - The series "x86/module: use large ROX pages for text allocations"
   from Mike Rapoport teaches x86 to use large pages for
   read-only-execute module text.

 - The series "page allocation tag compression" from Suren Baghdasaryan
   is followon maintenance work for the new page allocation profiling
   feature.

 - The series "page->index removals in mm" from Matthew Wilcox remove
   most references to page->index in mm/. A slow march towards shrinking
   struct page.

 - The series "damon/{self,kunit}tests: minor fixups for DAMON debugfs
   interface tests" from Andrew Paniakin performs maintenance work for
   DAMON's self testing code.

 - The series "mm: zswap swap-out of large folios" from Kanchana Sridhar
   improves zswap's batching of compression and decompression. It is a
   step along the way towards using Intel IAA hardware acceleration for
   this zswap operation.

 - The series "kasan: migrate the last module test to kunit" from
   Sabyrzhan Tasbolatov completes the migration of the KASAN built-in
   tests over to the KUnit framework.

 - The series "implement lightweight guard pages" from Lorenzo Stoakes
   permits userapace to place fault-generating guard pages within a
   single VMA, rather than requiring that multiple VMAs be created for
   this. Improved efficiencies for userspace memory allocators are
   expected.

 - The series "memcg: tracepoint for flushing stats" from JP Kobryn uses
   tracepoints to provide increased visibility into memcg stats flushing
   activity.

 - The series "zram: IDLE flag handling fixes" from Sergey Senozhatsky
   fixes a zram buglet which potentially affected performance.

 - The series "mm: add more kernel parameters to control mTHP" from
   Maíra Canal enhances our ability to control/configuremultisize THP
   from the kernel boot command line.

 - The series "kasan: few improvements on kunit tests" from Sabyrzhan
   Tasbolatov has a couple of fixups for the KASAN KUnit tests.

 - The series "mm/list_lru: Split list_lru lock into per-cgroup scope"
   from Kairui Song optimizes list_lru memory utilization when lockdep
   is enabled.

* tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (215 commits)
  cma: enforce non-zero pageblock_order during cma_init_reserved_mem()
  mm/kfence: add a new kunit test test_use_after_free_read_nofault()
  zram: fix NULL pointer in comp_algorithm_show()
  memcg/hugetlb: add hugeTLB counters to memcg
  vmstat: call fold_vm_zone_numa_events() before show per zone NUMA event
  mm: mmap_lock: check trace_mmap_lock_$type_enabled() instead of regcount
  zram: ZRAM_DEF_COMP should depend on ZRAM
  MAINTAINERS/MEMORY MANAGEMENT: add document files for mm
  Docs/mm/damon: recommend academic papers to read and/or cite
  mm: define general function pXd_init()
  kmemleak: iommu/iova: fix transient kmemleak false positive
  mm/list_lru: simplify the list_lru walk callback function
  mm/list_lru: split the lock to per-cgroup scope
  mm/list_lru: simplify reparenting and initial allocation
  mm/list_lru: code clean up for reparenting
  mm/list_lru: don't export list_lru_add
  mm/list_lru: don't pass unnecessary key parameters
  kasan: add kunit tests for kmalloc_track_caller, kmalloc_node_track_caller
  kasan: change kasan_atomics kunit test as KUNIT_CASE_SLOW
  kasan: use EXPORT_SYMBOL_IF_KUNIT to export symbols
  ...
2024-11-23 09:58:07 -08:00
Kefeng Wang
6359c39c9d mm: remove unused hugepage for vma_alloc_folio()
The hugepage parameter was deprecated since commit ddc1a5cbc0
("mempolicy: alloc_pages_mpol() for NUMA policy without vma"), for
PMD-sized THP, it still tries only preferred node if possible in
vma_alloc_folio() by checking the order of the folio allocation.

Link: https://lkml.kernel.org/r/20241010061556.1846751-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06 20:11:12 -08:00
Mark Brown
cfad706e8f arm64/mm: Handle GCS data aborts
All GCS operations at EL0 must happen on a page which is marked as
having UnprivGCS access, including read operations.  If a GCS operation
attempts to access a page without this then it will generate a data
abort with the GCS bit set in ESR_EL1.ISS2.

EL0 may validly generate such faults, for example due to copy on write
which will cause the GCS data to be stored in a read only page with no
GCS permissions until the actual copy happens.  Since UnprivGCS allows
both reads and writes to the GCS (though only through GCS operations) we
need to ensure that the memory management subsystem handles GCS accesses
as writes at all times.  Do this by adding FAULT_FLAG_WRITE to any GCS
page faults, adding handling to ensure that invalid cases are identfied
as such early so the memory management core does not think they will
succeed.  The core cannot distinguish between VMAs which are generally
writeable and VMAs which are only writeable through GCS operations.

EL1 may validly write to EL0 GCS for management purposes (eg, while
initialising with cap tokens).

We also report any GCS faults in VMAs not marked as part of a GCS as
access violations, causing a fault to be delivered to userspace if it
attempts to do GCS operations outside a GCS.

Reviewed-by: Thiago Jung Bauermann <thiago.bauermann@linaro.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20241001-arm64-gcs-v13-20-222b78d87eee@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-04 12:04:38 +01:00
Joey Gouly
7f0ab60763 arm64: handle PKEY/POE faults
If a memory fault occurs that is due to an overlay/pkey fault, report that to
userspace with a SEGV_PKUERR.

Signed-off-by: Joey Gouly <joey.gouly@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20240822151113.1479789-17-joey.gouly@arm.com
[will: Add ESR.FSC check to data abort handler]
Signed-off-by: Will Deacon <will@kernel.org>
2024-09-04 12:53:44 +01:00
Kefeng Wang
eebb5181a0 arm64: mm: drop VM_FAULT_BADMAP/VM_FAULT_BADACCESS
Patch series "mm: remove arch's private VM_FAULT_BADMAP/BADACCESS", v2.

Directly set SEGV_MAPRR or SEGV_ACCERR for arm/arm64 to remove the last
two arch's private vm_fault reasons.  


This patch (of 2):

If bad map or access, directly set si_code to SEGV_MAPRR or SEGV_ACCERR,
also set fault to 0 and goto error handling, which make us to drop the
arch's special vm fault reason.

Link: https://lkml.kernel.org/r/20240411130925.73281-1-wangkefeng.wang@huawei.com
Link: https://lkml.kernel.org/r/20240411130925.73281-2-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Aishwarya TCV <aishwarya.tcv@arm.com>
Cc: Cristian Marussi <cristian.marussi@arm.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05 17:53:32 -07:00
Kefeng Wang
faab3d0f25 arm64: mm: accelerate pagefault when VM_FAULT_BADACCESS
The vm_flags of vma already checked under per-VMA lock, if it is a bad
access, directly set fault to VM_FAULT_BADACCESS and handle error, no need
to retry with mmap_lock again, the latency time reduces 34% in 'lat_sig -P
1 prot lat_sig' from lmbench testcase.

Since the page fault is handled under per-VMA lock, count it as a vma lock
event with VMA_LOCK_SUCCESS.

Link: https://lkml.kernel.org/r/20240403083805.1818160-3-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:38 -07:00
Kefeng Wang
6ea02ee489 arm64: mm: cleanup __do_page_fault()
Patch series "arch/mm/fault: accelerate pagefault when badaccess", v2.

After VMA lock-based page fault handling enabled, if bad access met
under per-vma lock, it will fallback to mmap_lock-based handling,
so it leads to unnessary mmap lock and vma find again. A test from
lmbench shows 34% improve after this changes on arm64,

  lat_sig -P 1 prot lat_sig 0.29194 -> 0.19198


This patch (of 7):

The __do_page_fault() only calls handle_mm_fault() after vm_flags checked,
and it is only called by do_page_fault(), let's squash it into
do_page_fault() to cleanup code.

Link: https://lkml.kernel.org/r/20240403083805.1818160-1-wangkefeng.wang@huawei.com
Link: https://lkml.kernel.org/r/20240403083805.1818160-2-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:38 -07:00
Linus Torvalds
902861e34c - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames
from hotplugged memory rather than only from main memory.  Series
   "implement "memmap on memory" feature on s390".
 
 - More folio conversions from Matthew Wilcox in the series
 
 	"Convert memcontrol charge moving to use folios"
 	"mm: convert mm counter to take a folio"
 
 - Chengming Zhou has optimized zswap's rbtree locking, providing
   significant reductions in system time and modest but measurable
   reductions in overall runtimes.  The series is "mm/zswap: optimize the
   scalability of zswap rb-tree".
 
 - Chengming Zhou has also provided the series "mm/zswap: optimize zswap
   lru list" which provides measurable runtime benefits in some
   swap-intensive situations.
 
 - And Chengming Zhou further optimizes zswap in the series "mm/zswap:
   optimize for dynamic zswap_pools".  Measured improvements are modest.
 
 - zswap cleanups and simplifications from Yosry Ahmed in the series "mm:
   zswap: simplify zswap_swapoff()".
 
 - In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has
   contributed several DAX cleanups as well as adding a sysfs tunable to
   control the memmap_on_memory setting when the dax device is hotplugged
   as system memory.
 
 - Johannes Weiner has added the large series "mm: zswap: cleanups",
   which does that.
 
 - More DAMON work from SeongJae Park in the series
 
 	"mm/damon: make DAMON debugfs interface deprecation unignorable"
 	"selftests/damon: add more tests for core functionalities and corner cases"
 	"Docs/mm/damon: misc readability improvements"
 	"mm/damon: let DAMOS feeds and tame/auto-tune itself"
 
 - In the series "mm/mempolicy: weighted interleave mempolicy and sysfs
   extension" Rakie Kim has developed a new mempolicy interleaving policy
   wherein we allocate memory across nodes in a weighted fashion rather
   than uniformly.  This is beneficial in heterogeneous memory environments
   appearing with CXL.
 
 - Christophe Leroy has contributed some cleanup and consolidation work
   against the ARM pagetable dumping code in the series "mm: ptdump:
   Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute".
 
 - Luis Chamberlain has added some additional xarray selftesting in the
   series "test_xarray: advanced API multi-index tests".
 
 - Muhammad Usama Anjum has reworked the selftest code to make its
   human-readable output conform to the TAP ("Test Anything Protocol")
   format.  Amongst other things, this opens up the use of third-party
   tools to parse and process out selftesting results.
 
 - Ryan Roberts has added fork()-time PTE batching of THP ptes in the
   series "mm/memory: optimize fork() with PTE-mapped THP".  Mainly
   targeted at arm64, this significantly speeds up fork() when the process
   has a large number of pte-mapped folios.
 
 - David Hildenbrand also gets in on the THP pte batching game in his
   series "mm/memory: optimize unmap/zap with PTE-mapped THP".  It
   implements batching during munmap() and other pte teardown situations.
   The microbenchmark improvements are nice.
 
 - And in the series "Transparent Contiguous PTEs for User Mappings" Ryan
   Roberts further utilizes arm's pte's contiguous bit ("contpte
   mappings").  Kernel build times on arm64 improved nicely.  Ryan's series
   "Address some contpte nits" provides some followup work.
 
 - In the series "mm/hugetlb: Restore the reservation" Breno Leitao has
   fixed an obscure hugetlb race which was causing unnecessary page faults.
   He has also added a reproducer under the selftest code.
 
 - In the series "selftests/mm: Output cleanups for the compaction test",
   Mark Brown did what the title claims.
 
 - Kinsey Ho has added the series "mm/mglru: code cleanup and refactoring".
 
 - Even more zswap material from Nhat Pham.  The series "fix and extend
   zswap kselftests" does as claimed.
 
 - In the series "Introduce cpu_dcache_is_aliasing() to fix DAX
   regression" Mathieu Desnoyers has cleaned up and fixed rather a mess in
   our handling of DAX on archiecctures which have virtually aliasing data
   caches.  The arm architecture is the main beneficiary.
 
 - Lokesh Gidra's series "per-vma locks in userfaultfd" provides dramatic
   improvements in worst-case mmap_lock hold times during certain
   userfaultfd operations.
 
 - Some page_owner enhancements and maintenance work from Oscar Salvador
   in his series
 
 	"page_owner: print stacks and their outstanding allocations"
 	"page_owner: Fixup and cleanup"
 
 - Uladzislau Rezki has contributed some vmalloc scalability improvements
   in his series "Mitigate a vmap lock contention".  It realizes a 12x
   improvement for a certain microbenchmark.
 
 - Some kexec/crash cleanup work from Baoquan He in the series "Split
   crash out from kexec and clean up related config items".
 
 - Some zsmalloc maintenance work from Chengming Zhou in the series
 
 	"mm/zsmalloc: fix and optimize objects/page migration"
 	"mm/zsmalloc: some cleanup for get/set_zspage_mapping()"
 
 - Zi Yan has taught the MM to perform compaction on folios larger than
   order=0.  This a step along the path to implementaton of the merging of
   large anonymous folios.  The series is named "Enable >0 order folio
   memory compaction".
 
 - Christoph Hellwig has done quite a lot of cleanup work in the
   pagecache writeback code in his series "convert write_cache_pages() to
   an iterator".
 
 - Some modest hugetlb cleanups and speedups in Vishal Moola's series
   "Handle hugetlb faults under the VMA lock".
 
 - Zi Yan has changed the page splitting code so we can split huge pages
   into sizes other than order-0 to better utilize large folios.  The
   series is named "Split a folio to any lower order folios".
 
 - David Hildenbrand has contributed the series "mm: remove
   total_mapcount()", a cleanup.
 
 - Matthew Wilcox has sought to improve the performance of bulk memory
   freeing in his series "Rearrange batched folio freeing".
 
 - Gang Li's series "hugetlb: parallelize hugetlb page init on boot"
   provides large improvements in bootup times on large machines which are
   configured to use large numbers of hugetlb pages.
 
 - Matthew Wilcox's series "PageFlags cleanups" does that.
 
 - Qi Zheng's series "minor fixes and supplement for ptdesc" does that
   also.  S390 is affected.
 
 - Cleanups to our pagemap utility functions from Peter Xu in his series
   "mm/treewide: Replace pXd_large() with pXd_leaf()".
 
 - Nico Pache has fixed a few things with our hugepage selftests in his
   series "selftests/mm: Improve Hugepage Test Handling in MM Selftests".
 
 - Also, of course, many singleton patches to many things.  Please see
   the individual changelogs for details.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZfJpPQAKCRDdBJ7gKXxA
 joxeAP9TrcMEuHnLmBlhIXkWbIR4+ki+pA3v+gNTlJiBhnfVSgD9G55t1aBaRplx
 TMNhHfyiHYDTx/GAV9NXW84tasJSDgA=
 =TG55
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames
   from hotplugged memory rather than only from main memory. Series
   "implement "memmap on memory" feature on s390".

 - More folio conversions from Matthew Wilcox in the series

	"Convert memcontrol charge moving to use folios"
	"mm: convert mm counter to take a folio"

 - Chengming Zhou has optimized zswap's rbtree locking, providing
   significant reductions in system time and modest but measurable
   reductions in overall runtimes. The series is "mm/zswap: optimize the
   scalability of zswap rb-tree".

 - Chengming Zhou has also provided the series "mm/zswap: optimize zswap
   lru list" which provides measurable runtime benefits in some
   swap-intensive situations.

 - And Chengming Zhou further optimizes zswap in the series "mm/zswap:
   optimize for dynamic zswap_pools". Measured improvements are modest.

 - zswap cleanups and simplifications from Yosry Ahmed in the series
   "mm: zswap: simplify zswap_swapoff()".

 - In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has
   contributed several DAX cleanups as well as adding a sysfs tunable to
   control the memmap_on_memory setting when the dax device is
   hotplugged as system memory.

 - Johannes Weiner has added the large series "mm: zswap: cleanups",
   which does that.

 - More DAMON work from SeongJae Park in the series

	"mm/damon: make DAMON debugfs interface deprecation unignorable"
	"selftests/damon: add more tests for core functionalities and corner cases"
	"Docs/mm/damon: misc readability improvements"
	"mm/damon: let DAMOS feeds and tame/auto-tune itself"

 - In the series "mm/mempolicy: weighted interleave mempolicy and sysfs
   extension" Rakie Kim has developed a new mempolicy interleaving
   policy wherein we allocate memory across nodes in a weighted fashion
   rather than uniformly. This is beneficial in heterogeneous memory
   environments appearing with CXL.

 - Christophe Leroy has contributed some cleanup and consolidation work
   against the ARM pagetable dumping code in the series "mm: ptdump:
   Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute".

 - Luis Chamberlain has added some additional xarray selftesting in the
   series "test_xarray: advanced API multi-index tests".

 - Muhammad Usama Anjum has reworked the selftest code to make its
   human-readable output conform to the TAP ("Test Anything Protocol")
   format. Amongst other things, this opens up the use of third-party
   tools to parse and process out selftesting results.

 - Ryan Roberts has added fork()-time PTE batching of THP ptes in the
   series "mm/memory: optimize fork() with PTE-mapped THP". Mainly
   targeted at arm64, this significantly speeds up fork() when the
   process has a large number of pte-mapped folios.

 - David Hildenbrand also gets in on the THP pte batching game in his
   series "mm/memory: optimize unmap/zap with PTE-mapped THP". It
   implements batching during munmap() and other pte teardown
   situations. The microbenchmark improvements are nice.

 - And in the series "Transparent Contiguous PTEs for User Mappings"
   Ryan Roberts further utilizes arm's pte's contiguous bit ("contpte
   mappings"). Kernel build times on arm64 improved nicely. Ryan's
   series "Address some contpte nits" provides some followup work.

 - In the series "mm/hugetlb: Restore the reservation" Breno Leitao has
   fixed an obscure hugetlb race which was causing unnecessary page
   faults. He has also added a reproducer under the selftest code.

 - In the series "selftests/mm: Output cleanups for the compaction
   test", Mark Brown did what the title claims.

 - Kinsey Ho has added the series "mm/mglru: code cleanup and
   refactoring".

 - Even more zswap material from Nhat Pham. The series "fix and extend
   zswap kselftests" does as claimed.

 - In the series "Introduce cpu_dcache_is_aliasing() to fix DAX
   regression" Mathieu Desnoyers has cleaned up and fixed rather a mess
   in our handling of DAX on archiecctures which have virtually aliasing
   data caches. The arm architecture is the main beneficiary.

 - Lokesh Gidra's series "per-vma locks in userfaultfd" provides
   dramatic improvements in worst-case mmap_lock hold times during
   certain userfaultfd operations.

 - Some page_owner enhancements and maintenance work from Oscar Salvador
   in his series

	"page_owner: print stacks and their outstanding allocations"
	"page_owner: Fixup and cleanup"

 - Uladzislau Rezki has contributed some vmalloc scalability
   improvements in his series "Mitigate a vmap lock contention". It
   realizes a 12x improvement for a certain microbenchmark.

 - Some kexec/crash cleanup work from Baoquan He in the series "Split
   crash out from kexec and clean up related config items".

 - Some zsmalloc maintenance work from Chengming Zhou in the series

	"mm/zsmalloc: fix and optimize objects/page migration"
	"mm/zsmalloc: some cleanup for get/set_zspage_mapping()"

 - Zi Yan has taught the MM to perform compaction on folios larger than
   order=0. This a step along the path to implementaton of the merging
   of large anonymous folios. The series is named "Enable >0 order folio
   memory compaction".

 - Christoph Hellwig has done quite a lot of cleanup work in the
   pagecache writeback code in his series "convert write_cache_pages()
   to an iterator".

 - Some modest hugetlb cleanups and speedups in Vishal Moola's series
   "Handle hugetlb faults under the VMA lock".

 - Zi Yan has changed the page splitting code so we can split huge pages
   into sizes other than order-0 to better utilize large folios. The
   series is named "Split a folio to any lower order folios".

 - David Hildenbrand has contributed the series "mm: remove
   total_mapcount()", a cleanup.

 - Matthew Wilcox has sought to improve the performance of bulk memory
   freeing in his series "Rearrange batched folio freeing".

 - Gang Li's series "hugetlb: parallelize hugetlb page init on boot"
   provides large improvements in bootup times on large machines which
   are configured to use large numbers of hugetlb pages.

 - Matthew Wilcox's series "PageFlags cleanups" does that.

 - Qi Zheng's series "minor fixes and supplement for ptdesc" does that
   also. S390 is affected.

 - Cleanups to our pagemap utility functions from Peter Xu in his series
   "mm/treewide: Replace pXd_large() with pXd_leaf()".

 - Nico Pache has fixed a few things with our hugepage selftests in his
   series "selftests/mm: Improve Hugepage Test Handling in MM
   Selftests".

 - Also, of course, many singleton patches to many things. Please see
   the individual changelogs for details.

* tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (435 commits)
  mm/zswap: remove the memcpy if acomp is not sleepable
  crypto: introduce: acomp_is_async to expose if comp drivers might sleep
  memtest: use {READ,WRITE}_ONCE in memory scanning
  mm: prohibit the last subpage from reusing the entire large folio
  mm: recover pud_leaf() definitions in nopmd case
  selftests/mm: skip the hugetlb-madvise tests on unmet hugepage requirements
  selftests/mm: skip uffd hugetlb tests with insufficient hugepages
  selftests/mm: dont fail testsuite due to a lack of hugepages
  mm/huge_memory: skip invalid debugfs new_order input for folio split
  mm/huge_memory: check new folio order when split a folio
  mm, vmscan: retry kswapd's priority loop with cache_trim_mode off on failure
  mm: add an explicit smp_wmb() to UFFDIO_CONTINUE
  mm: fix list corruption in put_pages_list
  mm: remove folio from deferred split list before uncharging it
  filemap: avoid unnecessary major faults in filemap_fault()
  mm,page_owner: drop unnecessary check
  mm,page_owner: check for null stack_record before bumping its refcount
  mm: swap: fix race between free_swap_and_cache() and swapoff()
  mm/treewide: align up pXd_leaf() retval across archs
  mm/treewide: drop pXd_large()
  ...
2024-03-14 17:43:30 -07:00
Ryan Roberts
5a00bfd6a5 arm64/mm: new ptep layer to manage contig bit
Create a new layer for the in-table PTE manipulation APIs.  For now, The
existing API is prefixed with double underscore to become the arch-private
API and the public API is just a simple wrapper that calls the private
API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate.  But since there are
already some contig-aware users (e.g.  hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

The following APIs are treated this way:

 - ptep_get
 - set_pte
 - set_ptes
 - pte_clear
 - ptep_get_and_clear
 - ptep_test_and_clear_young
 - ptep_clear_flush_young
 - ptep_set_wrprotect
 - ptep_set_access_flags

Link: https://lkml.kernel.org/r/20240215103205.2607016-11-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Tested-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Barry Song <21cnbao@gmail.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morse <james.morse@arm.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22 15:27:18 -08:00
Ryan Roberts
659e193027 arm64/mm: convert set_pte_at() to set_ptes(..., 1)
Since set_ptes() was introduced, set_pte_at() has been implemented as a
generic macro around set_ptes(..., 1).  So this change should continue to
generate the same code.  However, making this change prepares us for the
transparent contpte support.  It means we can reroute set_ptes() to
__set_ptes().  Since set_pte_at() is a generic macro, there will be no
equivalent __set_pte_at() to reroute to.

Note that a couple of calls to set_pte_at() remain in the arch code.  This
is intentional, since those call sites are acting on behalf of core-mm and
should continue to call into the public set_ptes() rather than the
arch-private __set_ptes().

Link: https://lkml.kernel.org/r/20240215103205.2607016-9-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Tested-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Barry Song <21cnbao@gmail.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morse <james.morse@arm.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22 15:27:18 -08:00
Ryan Roberts
532736558e arm64/mm: convert READ_ONCE(*ptep) to ptep_get(ptep)
There are a number of places in the arch code that read a pte by using the
READ_ONCE() macro.  Refactor these call sites to instead use the
ptep_get() helper, which itself is a READ_ONCE().  Generated code should
be the same.

This will benefit us when we shortly introduce the transparent contpte
support.  In this case, ptep_get() will become more complex so we now have
all the code abstracted through it.

Link: https://lkml.kernel.org/r/20240215103205.2607016-8-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Tested-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Barry Song <21cnbao@gmail.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morse <james.morse@arm.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22 15:27:18 -08:00
Ard Biesheuvel
7ac8d5b242 arm64: Add ESR decoding for exceptions involving translation level -1
The LPA2 feature introduces new FSC values to report abort exceptions
related to translation level -1. Define these and wire them up.

Reuse the new ESR FSC classification helpers that arrived via the KVM
arm64 tree, and update the one for translation faults to check
specifically for a translation fault at level -1. (Access flag or
permission faults cannot occur at level -1 because they alway involve a
descriptor at the superior level so changing those helpers is not
needed).

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20240214122845.2033971-73-ardb+git@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-02-16 12:42:37 +00:00
Suren Baghdasaryan
46e714c729 arch/mm/fault: fix major fault accounting when retrying under per-VMA lock
A test [1] in Android test suite started failing after [2] was merged.  It
turns out that after handling a major fault under per-VMA lock, the
process major fault counter does not register that fault as major.  Before
[2] read faults would be done under mmap_lock, in which case
FAULT_FLAG_TRIED flag is set before retrying.  That in turn causes
mm_account_fault() to account the fault as major once retry completes. 
With per-VMA locks we often retry because a fault can't be handled without
locking the whole mm using mmap_lock.  Therefore such retries do not set
FAULT_FLAG_TRIED flag.  This logic does not work after [2] because we can
now handle read major faults under per-VMA lock and upon retry the fact
there was a major fault gets lost.  Fix this by setting FAULT_FLAG_TRIED
after retrying under per-VMA lock if VM_FAULT_MAJOR was returned.  Ideally
we would use an additional VM_FAULT bit to indicate the reason for the
retry (could not handle under per-VMA lock vs other reason) but this
simpler solution seems to work, so keeping it simple.

[1] https://cs.android.com/android/platform/superproject/+/master:test/vts-testcase/kernel/api/drop_caches_prop/drop_caches_test.cpp
[2] https://lore.kernel.org/all/20231006195318.4087158-6-willy@infradead.org/

Link: https://lkml.kernel.org/r/20231226214610.109282-1-surenb@google.com
Fixes: 12214eba19 ("mm: handle read faults under the VMA lock")
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29 11:06:49 -08:00
Mark Rutland
4e00f1d9b7 arm64: Avoid cpus_have_const_cap() for ARM64_HAS_EPAN
We use cpus_have_const_cap() to check for ARM64_HAS_EPAN but this is not
necessary and alternative_has_cap() or cpus_have_cap() would be
preferable.

For historical reasons, cpus_have_const_cap() is more complicated than
it needs to be. Before cpucaps are finalized, it will perform a bitmap
test of the system_cpucaps bitmap, and once cpucaps are finalized it
will use an alternative branch. This used to be necessary to handle some
race conditions in the window between cpucap detection and the
subsequent patching of alternatives and static branches, where different
branches could be out-of-sync with one another (or w.r.t. alternative
sequences). Now that we use alternative branches instead of static
branches, these are all patched atomically w.r.t. one another, and there
are only a handful of cases that need special care in the window between
cpucap detection and alternative patching.

Due to the above, it would be nice to remove cpus_have_const_cap(), and
migrate callers over to alternative_has_cap_*(), cpus_have_final_cap(),
or cpus_have_cap() depending on when their requirements. This will
remove redundant instructions and improve code generation, and will make
it easier to determine how each callsite will behave before, during, and
after alternative patching.

The ARM64_HAS_EPAN cpucap is used to affect two things:

1) The permision bits used for userspace executable mappings, which are
   chosen by adjust_protection_map(), which is an arch_initcall. This is
   called after the ARM64_HAS_EPAN cpucap has been detected and
   alternatives have been patched, and before any userspace translation
   tables exist.

2) The handling of faults taken from (user or kernel) accesses to
   userspace executable mappings in do_page_fault(). Userspace
   translation tables are created after adjust_protection_map() is
   called, and hence after the ARM64_HAS_EPAN cpucap has been detected
   and alternatives have been patched.

Neither of these run until after ARM64_HAS_EPAN cpucap has been detected
and alternatives have been patched, and hence there's no need to use
cpus_have_const_cap(). Since adjust_protection_map() is only executed
once at boot time it would be best for it to use cpus_have_cap(), and
since do_page_fault() is executed frequently it would be best for it to
use alternatives_have_cap_unlikely().

This patch replaces the uses of cpus_have_const_cap() with
cpus_have_cap() and alternative_has_cap_unlikely(), which will avoid
generating redundant code, and should be better for all subsequent calls
at runtime. The ARM64_HAS_EPAN cpucap is added to cpucap_is_possible()
so that code can be elided entirely when this is not possible.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Vladimir Murzin <vladimir.murzin@arm.com>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2023-10-16 14:17:04 +01:00
Suren Baghdasaryan
4089eef0e6 mm: drop per-VMA lock when returning VM_FAULT_RETRY or VM_FAULT_COMPLETED
handle_mm_fault returning VM_FAULT_RETRY or VM_FAULT_COMPLETED means
mmap_lock has been released.  However with per-VMA locks behavior is
different and the caller should still release it.  To make the rules
consistent for the caller, drop the per-VMA lock when returning
VM_FAULT_RETRY or VM_FAULT_COMPLETED.  Currently the only path returning
VM_FAULT_RETRY under per-VMA locks is do_swap_page and no path returns
VM_FAULT_COMPLETED for now.

[willy@infradead.org: fix riscv]
  Link: https://lkml.kernel.org/r/CAJuCfpE6GWEx1rPBmNpUfoD5o-gNFz9-UFywzCE2PbEGBiVz7g@mail.gmail.com
Link: https://lkml.kernel.org/r/20230630211957.1341547-4-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
Tested-by: Conor Dooley <conor.dooley@microchip.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hillf Danton <hdanton@sina.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michel Lespinasse <michel@lespinasse.org>
Cc: Minchan Kim <minchan@google.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-24 16:20:17 -07:00
Matthew Wilcox (Oracle)
284e059204 mm: remove CONFIG_PER_VMA_LOCK ifdefs
Patch series "Handle most file-backed faults under the VMA lock", v3.

This patchset adds the ability to handle page faults on parts of files
which are already in the page cache without taking the mmap lock.


This patch (of 10):

Provide lock_vma_under_rcu() when CONFIG_PER_VMA_LOCK is not defined to
eliminate ifdefs in the users.

Link: https://lkml.kernel.org/r/20230724185410.1124082-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20230724185410.1124082-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18 10:12:50 -07:00
SeongJae Park
24be4d0b46 arch/arm64/mm/fault: Fix undeclared variable error in do_page_fault()
Commit ae870a68b5 ("arm64/mm: Convert to using
lock_mm_and_find_vma()") made do_page_fault() to use 'vma' even if
CONFIG_PER_VMA_LOCK is not defined, but the declaration is still in the
ifdef.

As a result, building kernel without the config fails with undeclared
variable error as below:

    arch/arm64/mm/fault.c: In function 'do_page_fault':
    arch/arm64/mm/fault.c:624:2: error: 'vma' undeclared (first use in this function); did you mean 'vmap'?
      624 |  vma = lock_mm_and_find_vma(mm, addr, regs);
          |  ^~~
          |  vmap

Fix it by moving the declaration out of the ifdef.

Fixes: ae870a68b5 ("arm64/mm: Convert to using lock_mm_and_find_vma()")
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-07-03 19:04:32 -07:00
Linus Torvalds
9471f1f2f5 Merge branch 'expand-stack'
This modifies our user mode stack expansion code to always take the
mmap_lock for writing before modifying the VM layout.

It's actually something we always technically should have done, but
because we didn't strictly need it, we were being lazy ("opportunistic"
sounds so much better, doesn't it?) about things, and had this hack in
place where we would extend the stack vma in-place without doing the
proper locking.

And it worked fine.  We just needed to change vm_start (or, in the case
of grow-up stacks, vm_end) and together with some special ad-hoc locking
using the anon_vma lock and the mm->page_table_lock, it all was fairly
straightforward.

That is, it was all fine until Ruihan Li pointed out that now that the
vma layout uses the maple tree code, we *really* don't just change
vm_start and vm_end any more, and the locking really is broken.  Oops.

It's not actually all _that_ horrible to fix this once and for all, and
do proper locking, but it's a bit painful.  We have basically three
different cases of stack expansion, and they all work just a bit
differently:

 - the common and obvious case is the page fault handling. It's actually
   fairly simple and straightforward, except for the fact that we have
   something like 24 different versions of it, and you end up in a maze
   of twisty little passages, all alike.

 - the simplest case is the execve() code that creates a new stack.
   There are no real locking concerns because it's all in a private new
   VM that hasn't been exposed to anybody, but lockdep still can end up
   unhappy if you get it wrong.

 - and finally, we have GUP and page pinning, which shouldn't really be
   expanding the stack in the first place, but in addition to execve()
   we also use it for ptrace(). And debuggers do want to possibly access
   memory under the stack pointer and thus need to be able to expand the
   stack as a special case.

None of these cases are exactly complicated, but the page fault case in
particular is just repeated slightly differently many many times.  And
ia64 in particular has a fairly complicated situation where you can have
both a regular grow-down stack _and_ a special grow-up stack for the
register backing store.

So to make this slightly more manageable, the bulk of this series is to
first create a helper function for the most common page fault case, and
convert all the straightforward architectures to it.

Thus the new 'lock_mm_and_find_vma()' helper function, which ends up
being used by x86, arm, powerpc, mips, riscv, alpha, arc, csky, hexagon,
loongarch, nios2, sh, sparc32, and xtensa.  So we not only convert more
than half the architectures, we now have more shared code and avoid some
of those twisty little passages.

And largely due to this common helper function, the full diffstat of
this series ends up deleting more lines than it adds.

That still leaves eight architectures (ia64, m68k, microblaze, openrisc,
parisc, s390, sparc64 and um) that end up doing 'expand_stack()'
manually because they are doing something slightly different from the
normal pattern.  Along with the couple of special cases in execve() and
GUP.

So there's a couple of patches that first create 'locked' helper
versions of the stack expansion functions, so that there's a obvious
path forward in the conversion.  The execve() case is then actually
pretty simple, and is a nice cleanup from our old "grow-up stackls are
special, because at execve time even they grow down".

The #ifdef CONFIG_STACK_GROWSUP in that code just goes away, because
it's just more straightforward to write out the stack expansion there
manually, instead od having get_user_pages_remote() do it for us in some
situations but not others and have to worry about locking rules for GUP.

And the final step is then to just convert the remaining odd cases to a
new world order where 'expand_stack()' is called with the mmap_lock held
for reading, but where it might drop it and upgrade it to a write, only
to return with it held for reading (in the success case) or with it
completely dropped (in the failure case).

In the process, we remove all the stack expansion from GUP (where
dropping the lock wouldn't be ok without special rules anyway), and add
it in manually to __access_remote_vm() for ptrace().

Thanks to Adrian Glaubitz and Frank Scheiner who tested the ia64 cases.
Everything else here felt pretty straightforward, but the ia64 rules for
stack expansion are really quite odd and very different from everything
else.  Also thanks to Vegard Nossum who caught me getting one of those
odd conditions entirely the wrong way around.

Anyway, I think I want to actually move all the stack expansion code to
a whole new file of its own, rather than have it split up between
mm/mmap.c and mm/memory.c, but since this will have to be backported to
the initial maple tree vma introduction anyway, I tried to keep the
patches _fairly_ minimal.

Also, while I don't think it's valid to expand the stack from GUP, the
final patch in here is a "warn if some crazy GUP user wants to try to
expand the stack" patch.  That one will be reverted before the final
release, but it's left to catch any odd cases during the merge window
and release candidates.

Reported-by: Ruihan Li <lrh2000@pku.edu.cn>

* branch 'expand-stack':
  gup: add warning if some caller would seem to want stack expansion
  mm: always expand the stack with the mmap write lock held
  execve: expand new process stack manually ahead of time
  mm: make find_extend_vma() fail if write lock not held
  powerpc/mm: convert coprocessor fault to lock_mm_and_find_vma()
  mm/fault: convert remaining simple cases to lock_mm_and_find_vma()
  arm/mm: Convert to using lock_mm_and_find_vma()
  riscv/mm: Convert to using lock_mm_and_find_vma()
  mips/mm: Convert to using lock_mm_and_find_vma()
  powerpc/mm: Convert to using lock_mm_and_find_vma()
  arm64/mm: Convert to using lock_mm_and_find_vma()
  mm: make the page fault mmap locking killable
  mm: introduce new 'lock_mm_and_find_vma()' page fault helper
2023-06-28 20:35:21 -07:00
Linus Torvalds
6e17c6de3d - Yosry Ahmed brought back some cgroup v1 stats in OOM logs.
- Yosry has also eliminated cgroup's atomic rstat flushing.
 
 - Nhat Pham adds the new cachestat() syscall.  It provides userspace
   with the ability to query pagecache status - a similar concept to
   mincore() but more powerful and with improved usability.
 
 - Mel Gorman provides more optimizations for compaction, reducing the
   prevalence of page rescanning.
 
 - Lorenzo Stoakes has done some maintanance work on the get_user_pages()
   interface.
 
 - Liam Howlett continues with cleanups and maintenance work to the maple
   tree code.  Peng Zhang also does some work on maple tree.
 
 - Johannes Weiner has done some cleanup work on the compaction code.
 
 - David Hildenbrand has contributed additional selftests for
   get_user_pages().
 
 - Thomas Gleixner has contributed some maintenance and optimization work
   for the vmalloc code.
 
 - Baolin Wang has provided some compaction cleanups,
 
 - SeongJae Park continues maintenance work on the DAMON code.
 
 - Huang Ying has done some maintenance on the swap code's usage of
   device refcounting.
 
 - Christoph Hellwig has some cleanups for the filemap/directio code.
 
 - Ryan Roberts provides two patch series which yield some
   rationalization of the kernel's access to pte entries - use the provided
   APIs rather than open-coding accesses.
 
 - Lorenzo Stoakes has some fixes to the interaction between pagecache
   and directio access to file mappings.
 
 - John Hubbard has a series of fixes to the MM selftesting code.
 
 - ZhangPeng continues the folio conversion campaign.
 
 - Hugh Dickins has been working on the pagetable handling code, mainly
   with a view to reducing the load on the mmap_lock.
 
 - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from
   128 to 8.
 
 - Domenico Cerasuolo has improved the zswap reclaim mechanism by
   reorganizing the LRU management.
 
 - Matthew Wilcox provides some fixups to make gfs2 work better with the
   buffer_head code.
 
 - Vishal Moola also has done some folio conversion work.
 
 - Matthew Wilcox has removed the remnants of the pagevec code - their
   functionality is migrated over to struct folio_batch.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZJejewAKCRDdBJ7gKXxA
 joggAPwKMfT9lvDBEUnJagY7dbDPky1cSYZdJKxxM2cApGa42gEA6Cl8HRAWqSOh
 J0qXCzqaaN8+BuEyLGDVPaXur9KirwY=
 =B7yQ
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull mm updates from Andrew Morton:

 - Yosry Ahmed brought back some cgroup v1 stats in OOM logs

 - Yosry has also eliminated cgroup's atomic rstat flushing

 - Nhat Pham adds the new cachestat() syscall. It provides userspace
   with the ability to query pagecache status - a similar concept to
   mincore() but more powerful and with improved usability

 - Mel Gorman provides more optimizations for compaction, reducing the
   prevalence of page rescanning

 - Lorenzo Stoakes has done some maintanance work on the
   get_user_pages() interface

 - Liam Howlett continues with cleanups and maintenance work to the
   maple tree code. Peng Zhang also does some work on maple tree

 - Johannes Weiner has done some cleanup work on the compaction code

 - David Hildenbrand has contributed additional selftests for
   get_user_pages()

 - Thomas Gleixner has contributed some maintenance and optimization
   work for the vmalloc code

 - Baolin Wang has provided some compaction cleanups,

 - SeongJae Park continues maintenance work on the DAMON code

 - Huang Ying has done some maintenance on the swap code's usage of
   device refcounting

 - Christoph Hellwig has some cleanups for the filemap/directio code

 - Ryan Roberts provides two patch series which yield some
   rationalization of the kernel's access to pte entries - use the
   provided APIs rather than open-coding accesses

 - Lorenzo Stoakes has some fixes to the interaction between pagecache
   and directio access to file mappings

 - John Hubbard has a series of fixes to the MM selftesting code

 - ZhangPeng continues the folio conversion campaign

 - Hugh Dickins has been working on the pagetable handling code, mainly
   with a view to reducing the load on the mmap_lock

 - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment
   from 128 to 8

 - Domenico Cerasuolo has improved the zswap reclaim mechanism by
   reorganizing the LRU management

 - Matthew Wilcox provides some fixups to make gfs2 work better with the
   buffer_head code

 - Vishal Moola also has done some folio conversion work

 - Matthew Wilcox has removed the remnants of the pagevec code - their
   functionality is migrated over to struct folio_batch

* tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits)
  mm/hugetlb: remove hugetlb_set_page_subpool()
  mm: nommu: correct the range of mmap_sem_read_lock in task_mem()
  hugetlb: revert use of page_cache_next_miss()
  Revert "page cache: fix page_cache_next/prev_miss off by one"
  mm/vmscan: fix root proactive reclaim unthrottling unbalanced node
  mm: memcg: rename and document global_reclaim()
  mm: kill [add|del]_page_to_lru_list()
  mm: compaction: convert to use a folio in isolate_migratepages_block()
  mm: zswap: fix double invalidate with exclusive loads
  mm: remove unnecessary pagevec includes
  mm: remove references to pagevec
  mm: rename invalidate_mapping_pagevec to mapping_try_invalidate
  mm: remove struct pagevec
  net: convert sunrpc from pagevec to folio_batch
  i915: convert i915_gpu_error to use a folio_batch
  pagevec: rename fbatch_count()
  mm: remove check_move_unevictable_pages()
  drm: convert drm_gem_put_pages() to use a folio_batch
  i915: convert shmem_sg_free_table() to use a folio_batch
  scatterlist: add sg_set_folio()
  ...
2023-06-28 10:28:11 -07:00
Linus Torvalds
2605e80d34 arm64 updates for 6.5:
- Support for the Armv8.9 Permission Indirection Extensions. While this
   feature doesn't add new functionality, it enables future support for
   Guarded Control Stacks (GCS) and Permission Overlays.
 
 - User-space support for the Armv8.8 memcpy/memset instructions.
 
 - arm64 perf: support the HiSilicon SoC uncore PMU, Arm CMN sysfs
   identifier, support for the NXP i.MX9 SoC DDRC PMU, fixes and
   cleanups.
 
 - Removal of superfluous ISBs on context switch (following retrospective
   architecture tightening).
 
 - Decode the ISS2 register during faults for additional information to
   help with debugging.
 
 - KPTI clean-up/simplification of the trampoline exit code.
 
 - Addressing several -Wmissing-prototype warnings.
 
 - Kselftest improvements for signal handling and ptrace.
 
 - Fix TPIDR2_EL0 restoring on sigreturn
 
 - Clean-up, robustness improvements of the module allocation code.
 
 - More sysreg conversions to the automatic register/bitfields
   generation.
 
 - CPU capabilities handling cleanup.
 
 - Arm documentation updates: ACPI, ptdump.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAmSZyXwACgkQa9axLQDI
 XvEM3BAAkMzHGTDhNVNGLSO07PVmdzTiuoNFlfX7bktdIb+El76VhGXhHeEywTje
 wAq9JIYBf/Src2HbgZLwuly8Fn2vCrhyp++bRJW82o9SiBnx91+0mH7zLf+XHiQ4
 FHKZxvaE6PaDc9o8WXr+IeucPRb5W2HgH37mktxh7ShMLsxorwS94V1oL29A2mV9
 t4XQY7/tdmrDKMKMuQnIr1DurNXBhJ1OKvDnSN/Zzm96JOU/QQ32N2wEE7Y0aHOh
 bBzClksx2mguQqV515mySGFe5yy9NqaAfx2hTAciq+1rwbiCSjqQQmEswoUH8WLX
 JNLylxADWT2qXThFe8W6uyFzEshSAoI1yKxlCGuOsQpu4sFJtR8oh8dDj5669g4Y
 j0jR87r9rWm0iyYI5I+XDMxFVyuh2eFInvjtynRbj+mtS3f/SkO8fXG6Uya+I76C
 UGLlBUKnLr/zHuIGN0LE/V4dYTqsi9EtHoc2Am2xCZsS9jqkxKJG8C93Zsm4GlJC
 OcUtBSjW0rYJq+tLk0yhR6hbh59QbiRh05KnZsPpOKi8purlKSL9ZNPRi7TndLdm
 HjHUY+vQwNIpPIb6pyK4aYZuTdGEQIsQykQ8CULiIGlHi7kc4g9029ouLc5bBAeU
 mU8D62I2ztzPoYljYWNtO7K6g/Dq8c4lpsaMAJ+1Wp2iq2xBJjo=
 =rNBK
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Catalin Marinas:
 "Notable features are user-space support for the memcpy/memset
  instructions and the permission indirection extension.

   - Support for the Armv8.9 Permission Indirection Extensions. While
     this feature doesn't add new functionality, it enables future
     support for Guarded Control Stacks (GCS) and Permission Overlays

   - User-space support for the Armv8.8 memcpy/memset instructions

   - arm64 perf: support the HiSilicon SoC uncore PMU, Arm CMN sysfs
     identifier, support for the NXP i.MX9 SoC DDRC PMU, fixes and
     cleanups

   - Removal of superfluous ISBs on context switch (following
     retrospective architecture tightening)

   - Decode the ISS2 register during faults for additional information
     to help with debugging

   - KPTI clean-up/simplification of the trampoline exit code

   - Addressing several -Wmissing-prototype warnings

   - Kselftest improvements for signal handling and ptrace

   - Fix TPIDR2_EL0 restoring on sigreturn

   - Clean-up, robustness improvements of the module allocation code

   - More sysreg conversions to the automatic register/bitfields
     generation

   - CPU capabilities handling cleanup

   - Arm documentation updates: ACPI, ptdump"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (124 commits)
  kselftest/arm64: Add a test case for TPIDR2 restore
  arm64/signal: Restore TPIDR2 register rather than memory state
  arm64: alternatives: make clean_dcache_range_nopatch() noinstr-safe
  Documentation/arm64: Add ptdump documentation
  arm64: hibernate: remove WARN_ON in save_processor_state
  kselftest/arm64: Log signal code and address for unexpected signals
  docs: perf: Fix warning from 'make htmldocs' in hisi-pmu.rst
  arm64/fpsimd: Exit streaming mode when flushing tasks
  docs: perf: Add new description for HiSilicon UC PMU
  drivers/perf: hisi: Add support for HiSilicon UC PMU driver
  drivers/perf: hisi: Add support for HiSilicon H60PA and PAv3 PMU driver
  perf: arm_cspmu: Add missing MODULE_DEVICE_TABLE
  perf/arm-cmn: Add sysfs identifier
  perf/arm-cmn: Revamp model detection
  perf/arm_dmc620: Add cpumask
  arm64: mm: fix VA-range sanity check
  arm64/mm: remove now-superfluous ISBs from TTBR writes
  Documentation/arm64: Update ACPI tables from BBR
  Documentation/arm64: Update references in arm-acpi
  Documentation/arm64: Update ARM and arch reference
  ...
2023-06-26 17:11:53 -07:00
Linus Torvalds
ae870a68b5 arm64/mm: Convert to using lock_mm_and_find_vma()
This converts arm64 to use the new page fault helper.  It was very
straightforward, but still needed a fix for the "obvious" conversion I
initially did.  Thanks to Suren for the fix and testing.

Fixed-and-tested-by: Suren Baghdasaryan <surenb@google.com>
Unnecessary-code-removal-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-24 14:12:58 -07:00
Catalin Marinas
f42039d10b Merge branches 'for-next/kpti', 'for-next/missing-proto-warn', 'for-next/iss2-decode', 'for-next/kselftest', 'for-next/misc', 'for-next/feat_mops', 'for-next/module-alloc', 'for-next/sysreg', 'for-next/cpucap', 'for-next/acpi', 'for-next/kdump', 'for-next/acpi-doc', 'for-next/doc' and 'for-next/tpidr2-fix', remote-tracking branch 'arm64/for-next/perf' into for-next/core
* arm64/for-next/perf:
  docs: perf: Fix warning from 'make htmldocs' in hisi-pmu.rst
  docs: perf: Add new description for HiSilicon UC PMU
  drivers/perf: hisi: Add support for HiSilicon UC PMU driver
  drivers/perf: hisi: Add support for HiSilicon H60PA and PAv3 PMU driver
  perf: arm_cspmu: Add missing MODULE_DEVICE_TABLE
  perf/arm-cmn: Add sysfs identifier
  perf/arm-cmn: Revamp model detection
  perf/arm_dmc620: Add cpumask
  dt-bindings: perf: fsl-imx-ddr: Add i.MX93 compatible
  drivers/perf: imx_ddr: Add support for NXP i.MX9 SoC DDRC PMU driver
  perf/arm_cspmu: Decouple APMT dependency
  perf/arm_cspmu: Clean up ACPI dependency
  ACPI/APMT: Don't register invalid resource
  perf/arm_cspmu: Fix event attribute type
  perf: arm_cspmu: Set irq affinitiy only if overflow interrupt is used
  drivers/perf: hisi: Don't migrate perf to the CPU going to teardown
  drivers/perf: apple_m1: Force 63bit counters for M2 CPUs
  perf/arm-cmn: Fix DTC reset
  perf: qcom_l2_pmu: Make l2_cache_pmu_probe_cluster() more robust
  perf/arm-cci: Slightly optimize cci_pmu_sync_counters()

* for-next/kpti:
  : Simplify KPTI trampoline exit code
  arm64: entry: Simplify tramp_alias macro and tramp_exit routine
  arm64: entry: Preserve/restore X29 even for compat tasks

* for-next/missing-proto-warn:
  : Address -Wmissing-prototype warnings
  arm64: add alt_cb_patch_nops prototype
  arm64: move early_brk64 prototype to header
  arm64: signal: include asm/exception.h
  arm64: kaslr: add kaslr_early_init() declaration
  arm64: flush: include linux/libnvdimm.h
  arm64: module-plts: inline linux/moduleloader.h
  arm64: hide unused is_valid_bugaddr()
  arm64: efi: add efi_handle_corrupted_x18 prototype
  arm64: cpuidle: fix #ifdef for acpi functions
  arm64: kvm: add prototypes for functions called in asm
  arm64: spectre: provide prototypes for internal functions
  arm64: move cpu_suspend_set_dbg_restorer() prototype to header
  arm64: avoid prototype warnings for syscalls
  arm64: add scs_patch_vmlinux prototype
  arm64: xor-neon: mark xor_arm64_neon_*() static

* for-next/iss2-decode:
  : Add decode of ISS2 to data abort reports
  arm64/esr: Add decode of ISS2 to data abort reporting
  arm64/esr: Use GENMASK() for the ISS mask

* for-next/kselftest:
  : Various arm64 kselftest improvements
  kselftest/arm64: Log signal code and address for unexpected signals
  kselftest/arm64: Add a smoke test for ptracing hardware break/watch points

* for-next/misc:
  : Miscellaneous patches
  arm64: alternatives: make clean_dcache_range_nopatch() noinstr-safe
  arm64: hibernate: remove WARN_ON in save_processor_state
  arm64/fpsimd: Exit streaming mode when flushing tasks
  arm64: mm: fix VA-range sanity check
  arm64/mm: remove now-superfluous ISBs from TTBR writes
  arm64: consolidate rox page protection logic
  arm64: set __exception_irq_entry with __irq_entry as a default
  arm64: syscall: unmask DAIF for tracing status
  arm64: lockdep: enable checks for held locks when returning to userspace
  arm64/cpucaps: increase string width to properly format cpucaps.h
  arm64/cpufeature: Use helper for ECV CNTPOFF cpufeature

* for-next/feat_mops:
  : Support for ARMv8.8 memcpy instructions in userspace
  kselftest/arm64: add MOPS to hwcap test
  arm64: mops: allow disabling MOPS from the kernel command line
  arm64: mops: detect and enable FEAT_MOPS
  arm64: mops: handle single stepping after MOPS exception
  arm64: mops: handle MOPS exceptions
  KVM: arm64: hide MOPS from guests
  arm64: mops: don't disable host MOPS instructions from EL2
  arm64: mops: document boot requirements for MOPS
  KVM: arm64: switch HCRX_EL2 between host and guest
  arm64: cpufeature: detect FEAT_HCX
  KVM: arm64: initialize HCRX_EL2

* for-next/module-alloc:
  : Make the arm64 module allocation code more robust (clean-up, VA range expansion)
  arm64: module: rework module VA range selection
  arm64: module: mandate MODULE_PLTS
  arm64: module: move module randomization to module.c
  arm64: kaslr: split kaslr/module initialization
  arm64: kasan: remove !KASAN_VMALLOC remnants
  arm64: module: remove old !KASAN_VMALLOC logic

* for-next/sysreg: (21 commits)
  : More sysreg conversions to automatic generation
  arm64/sysreg: Convert TRBIDR_EL1 register to automatic generation
  arm64/sysreg: Convert TRBTRG_EL1 register to automatic generation
  arm64/sysreg: Convert TRBMAR_EL1 register to automatic generation
  arm64/sysreg: Convert TRBSR_EL1 register to automatic generation
  arm64/sysreg: Convert TRBBASER_EL1 register to automatic generation
  arm64/sysreg: Convert TRBPTR_EL1 register to automatic generation
  arm64/sysreg: Convert TRBLIMITR_EL1 register to automatic generation
  arm64/sysreg: Rename TRBIDR_EL1 fields per auto-gen tools format
  arm64/sysreg: Rename TRBTRG_EL1 fields per auto-gen tools format
  arm64/sysreg: Rename TRBMAR_EL1 fields per auto-gen tools format
  arm64/sysreg: Rename TRBSR_EL1 fields per auto-gen tools format
  arm64/sysreg: Rename TRBBASER_EL1 fields per auto-gen tools format
  arm64/sysreg: Rename TRBPTR_EL1 fields per auto-gen tools format
  arm64/sysreg: Rename TRBLIMITR_EL1 fields per auto-gen tools format
  arm64/sysreg: Convert OSECCR_EL1 to automatic generation
  arm64/sysreg: Convert OSDTRTX_EL1 to automatic generation
  arm64/sysreg: Convert OSDTRRX_EL1 to automatic generation
  arm64/sysreg: Convert OSLAR_EL1 to automatic generation
  arm64/sysreg: Standardise naming of bitfield constants in OSL[AS]R_EL1
  arm64/sysreg: Convert MDSCR_EL1 to automatic register generation
  ...

* for-next/cpucap:
  : arm64 cpucap clean-up
  arm64: cpufeature: fold cpus_set_cap() into update_cpu_capabilities()
  arm64: cpufeature: use cpucap naming
  arm64: alternatives: use cpucap naming
  arm64: standardise cpucap bitmap names

* for-next/acpi:
  : Various arm64-related ACPI patches
  ACPI: bus: Consolidate all arm specific initialisation into acpi_arm_init()

* for-next/kdump:
  : Simplify the crashkernel reservation behaviour of crashkernel=X,high on arm64
  arm64: add kdump.rst into index.rst
  Documentation: add kdump.rst to present crashkernel reservation on arm64
  arm64: kdump: simplify the reservation behaviour of crashkernel=,high

* for-next/acpi-doc:
  : Update ACPI documentation for Arm systems
  Documentation/arm64: Update ACPI tables from BBR
  Documentation/arm64: Update references in arm-acpi
  Documentation/arm64: Update ARM and arch reference

* for-next/doc:
  : arm64 documentation updates
  Documentation/arm64: Add ptdump documentation

* for-next/tpidr2-fix:
  : Fix the TPIDR2_EL0 register restoring on sigreturn
  kselftest/arm64: Add a test case for TPIDR2 restore
  arm64/signal: Restore TPIDR2 register rather than memory state
2023-06-23 18:32:20 +01:00
Hugh Dickins
52924726f4 arm64: allow pte_offset_map() to fail
In rare transient cases, not yet made possible, pte_offset_map() and
pte_offset_map_lock() may not find a page table: handle appropriately.

Link: https://lkml.kernel.org/r/35e46485-8499-4337-c51f-b8fa495a1a93@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: John David Anglin <dave.anglin@bell.net>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19 16:19:06 -07:00
Arnd Bergmann
bb6e04a173 kasan: use internal prototypes matching gcc-13 builtins
gcc-13 warns about function definitions for builtin interfaces that have a
different prototype, e.g.:

In file included from kasan_test.c:31:
kasan.h:574:6: error: conflicting types for built-in function '__asan_register_globals'; expected 'void(void *, long int)' [-Werror=builtin-declaration-mismatch]
  574 | void __asan_register_globals(struct kasan_global *globals, size_t size);
kasan.h:577:6: error: conflicting types for built-in function '__asan_alloca_poison'; expected 'void(void *, long int)' [-Werror=builtin-declaration-mismatch]
  577 | void __asan_alloca_poison(unsigned long addr, size_t size);
kasan.h:580:6: error: conflicting types for built-in function '__asan_load1'; expected 'void(void *)' [-Werror=builtin-declaration-mismatch]
  580 | void __asan_load1(unsigned long addr);
kasan.h:581:6: error: conflicting types for built-in function '__asan_store1'; expected 'void(void *)' [-Werror=builtin-declaration-mismatch]
  581 | void __asan_store1(unsigned long addr);
kasan.h:643:6: error: conflicting types for built-in function '__hwasan_tag_memory'; expected 'void(void *, unsigned char,  long int)' [-Werror=builtin-declaration-mismatch]
  643 | void __hwasan_tag_memory(unsigned long addr, u8 tag, unsigned long size);

The two problems are:

 - Addresses are passes as 'unsigned long' in the kernel, but gcc-13
   expects a 'void *'.

 - sizes meant to use a signed ssize_t rather than size_t.

Change all the prototypes to match these.  Using 'void *' consistently for
addresses gets rid of a couple of type casts, so push that down to the
leaf functions where possible.

This now passes all randconfig builds on arm, arm64 and x86, but I have
not tested it on the other architectures that support kasan, since they
tend to fail randconfig builds in other ways.  This might fail if any of
the 32-bit architectures expect a 'long' instead of 'int' for the size
argument.

The __asan_allocas_unpoison() function prototype is somewhat weird, since
it uses a pointer for 'stack_top' and an size_t for 'stack_bottom'.  This
looks like it is meant to be 'addr' and 'size' like the others, but the
implementation clearly treats them as 'top' and 'bottom'.

Link: https://lkml.kernel.org/r/20230509145735.9263-2-arnd@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:19 -07:00
Jisheng Zhang
0e2aba6948 arm64: mm: pass original fault address to handle_mm_fault() in PER_VMA_LOCK block
When reading the arm64's PER_VMA_LOCK support code, I found a bit
difference between arm64 and other arch when calling handle_mm_fault()
during VMA lock-based page fault handling: the fault address is masked
before passing to handle_mm_fault(). This is also different from the
usage in mmap_lock-based handling. I think we need to pass the
original fault address to handle_mm_fault() as we did in
commit 84c5e23ede ("arm64: mm: Pass original fault address to
handle_mm_fault()").

If we go through the code path further, we can find that the "masked"
fault address can cause mismatched fault address between perf sw
major/minor page fault sw event and perf page fault sw event:

do_page_fault
  perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, ..., addr)   // orig addr
  handle_mm_fault
    mm_account_fault
      perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, ...) // masked addr

Fixes: cd7f176aea ("arm64/mm: try VMA lock-based page fault handling first")
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20230524131305.2808-1-jszhang@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
2023-06-02 13:02:44 +01:00
Mark Brown
1f9d4ba683 arm64/esr: Add decode of ISS2 to data abort reporting
The architecture has added more information about faults to ISS2 within
ESR. Add decode of this to our data abort fault decode to aid diagnostics.
Features that are not currently enabled are included here for completeness.

Since the architecture specifies the values of bits within ISS2 in terms
of ISS2 rather than in terms of the register as a whole we do so for our
definitions as well, this makes it easier to review bitfield definitions.

Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20230417-arm64-iss2-dabt-decode-v3-2-c1fa503e503a@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2023-05-26 10:11:42 +01:00
Arnd Bergmann
e13d32e992 arm64: move early_brk64 prototype to header
The prototype used for calling early_brk64() is in the file that calls
it, which is the wrong place, as it is not included for the definition:

arch/arm64/kernel/traps.c:1100:12: error: no previous prototype for 'early_brk64' [-Werror=missing-prototypes]

Move it to an appropriate header instead.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20230516160642.523862-15-arnd@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2023-05-25 17:44:03 +01:00
Min-Hua Chen
d91d580878 arm64/mm: mark private VM_FAULT_X defines as vm_fault_t
This patch fixes several sparse warnings for fault.c:

arch/arm64/mm/fault.c:493:24: sparse: warning: incorrect type in return expression (different base types)
arch/arm64/mm/fault.c:493:24: sparse:    expected restricted vm_fault_t
arch/arm64/mm/fault.c:493:24: sparse:    got int
arch/arm64/mm/fault.c:501:32: sparse: warning: incorrect type in return expression (different base types)
arch/arm64/mm/fault.c:501:32: sparse:    expected restricted vm_fault_t
arch/arm64/mm/fault.c:501:32: sparse:    got int
arch/arm64/mm/fault.c:503:32: sparse: warning: incorrect type in return expression (different base types)
arch/arm64/mm/fault.c:503:32: sparse:    expected restricted vm_fault_t
arch/arm64/mm/fault.c:503:32: sparse:    got int
arch/arm64/mm/fault.c:511:24: sparse: warning: incorrect type in return expression (different base types)
arch/arm64/mm/fault.c:511:24: sparse:    expected restricted vm_fault_t
arch/arm64/mm/fault.c:511:24: sparse:    got int
arch/arm64/mm/fault.c:670:13: sparse: warning: restricted vm_fault_t degrades to integer
arch/arm64/mm/fault.c:670:13: sparse: warning: restricted vm_fault_t degrades to integer
arch/arm64/mm/fault.c:713:39: sparse: warning: restricted vm_fault_t degrades to integer

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Min-Hua Chen <minhuadotchen@gmail.com>
Link: https://lore.kernel.org/r/20230502151909.128810-1-minhuadotchen@gmail.com
Signed-off-by: Will Deacon <will@kernel.org>
2023-05-16 14:50:43 +01:00
Suren Baghdasaryan
cd7f176aea arm64/mm: try VMA lock-based page fault handling first
Attempt VMA lock-based page fault handling first, and fall back to the
existing mmap_lock-based handling if that fails.

Link: https://lkml.kernel.org/r/20230227173632.3292573-31-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05 20:03:01 -07:00