Commit Graph

1267 Commits

Author SHA1 Message Date
Paolo Bonzini
6b72d0578c KVM: x86: use again the flush argument of __link_shadow_page()
Except in the case of parentless nested-TDP pages, mmu_page_zap_pte()
clears the SPTE but leaves the invalid_list empty.  In this case, using
kvm_flush_remote_tlbs() as kvm_mmu_remote_flush_or_zap() does is overkill.
Avoid flushing the entirety of the remote TLBs unless the invalid_list
was populated: instead, use a more efficient gfn-targeting flush (if
available) and skip it altogether if the caller guarantees that a TLB
flush is not necessary.

Based-on: <20260503201029.106481-1-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260503210917.121840-1-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-05-12 23:12:31 +02:00
Sean Christopherson
0cb2af2ea6 KVM: x86: Fix shadow paging use-after-free due to unexpected GFN
The shadow MMU computes GFNs for direct shadow pages using sp->gfn plus
the SPTE index. This assumption breaks for shadow paging if the guest
page tables are modified between VM entries (similar to commit
aad885e774, "KVM: x86/mmu: Drop/zap existing present SPTE even
when creating an MMIO SPTE", 2026-03-27).  The flow is as follows:

- a PDE is installed for a 2MB mapping, and a page in that area is
  accessed.  KVM creates a kvm_mmu_page consisting of 512 4KB pages;
  the kvm_mmu_page is marked by FNAME(fetch) as direct-mapped because
  the guest's mapping is a huge page (and thus contiguous).

- the PDE mapping is changed from outside the guest.

- the guest accesses another page in the same 2MB area.  KVM installs
  a new leaf SPTE and rmap entry; the SPTE uses the "correct" GFN
  (i.e. based on the new mapping, as changed in the previous step) but
  that GFN is outside of the [sp->gfn, sp->gfn + 511] range; therefore
  the rmap entry cannot be found and removed when the kvm_mmu_page
  is zapped.

- the memslot that covers the first 2MB mapping is deleted, and the
  kvm_mmu_page for the now-invalid GPA is zapped.  However, rmap_remove()
  only looks at the [sp->gfn, sp->gfn + 511] range established in step 1,
  and fails to find the rmap entry that was recorded by step 3.

- any operation that causes an rmap walk for the same page accessed
  by step 3 then walks a stale rmap and dereferences a freed kvm_mmu_page.
  This includes dirty logging or MMU notifier invalidations (e.g., from
  MADV_DONTNEED).

The underlying issue is that KVM's walking of shadow PTEs assumes that
if a SPTE is present when KVM wants to install a non-leaf SPTE, then the
existing kvm_mmu_page must be for the correct gfn.  Because the only way
for the gfn to be wrong is if KVM messed up and failed to zap a SPTE...
which shouldn't happen, but *actually* only happens in response to a
guest write.

That bug dates back literally forever, as even the first version of KVM
assumes that the GFN matches and walks into the "wrong" shadow page.
However, that was only an imprecision until 2032a93d66 ("KVM: MMU:
Don't allocate gfns page for direct mmu pages") came along.

Fix it by checking for a target gfn mismatch and zapping the existing
SPTE.  That way the old SP and rmap entries are gone, KVM installs
the rmap in the right location, and everyone is happy.

Fixes: 2032a93d66 ("KVM: MMU: Don't allocate gfns page for direct mmu pages")
Fixes: 6aa8b732ca ("kvm: userspace interface")
Reported-by: Alexander Bulekov <bkov@amazon.com>
Reported-by: Fred Griffoul <fgriffo@amazon.co.uk>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://patch.msgid.link/20260503201029.106481-1-pbonzini@redhat.com/
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-05-03 22:32:53 +02:00
Linus Torvalds
01f492e181 Arm:
- Add support for tracing in the standalone EL2 hypervisor code, which
   should help both debugging and performance analysis.  This uses the
   new infrastructure for 'remote' trace buffers that can be exposed
   by non-kernel entities such as firmware, and which came through the
   tracing tree.
 
 - Add support for GICv5 Per Processor Interrupts (PPIs), as the starting
   point for supporting the new GIC architecture in KVM.
 
 - Finally add support for pKVM protected guests, where pages are unmapped
   from the host as they are faulted into the guest and can be shared back
   from the guest using pKVM hypercalls.  Protected guests are created
   using a new machine type identifier.  As the elusive guestmem has not
   yet delivered on its promises, anonymous memory is also supported.
 
   This is only a first step towards full isolation from the host; for
   example, the CPU register state and DMA accesses are not yet isolated.
   Because this does not really yet bring fully what it promises, it is
   hidden behind CONFIG_ARM_PKVM_GUEST + 'kvm-arm.mode=protected', and
   also triggers TAINT_USER when a VM is created.  Caveat emptor.
 
 - Rework the dreaded user_mem_abort() function to make it more
   maintainable, reducing the amount of state being exposed to the
   various helpers and rendering a substantial amount of state immutable.
 
 - Expand the Stage-2 page table dumper to support NV shadow page tables
   on a per-VM basis.
 
 - Tidy up the pKVM PSCI proxy code to be slightly less hard to follow.
 
 - Fix both SPE and TRBE in non-VHE configurations so that they do not
   generate spurious, out of context table walks that ultimately lead
   to very bad HW lockups.
 
 - A small set of patches fixing the Stage-2 MMU freeing in error cases.
 
 - Tighten-up accepted SMC immediate value to be only #0 for host
   SMCCC calls.
 
 - The usual cleanups and other selftest churn.
 
 LoongArch:
 
 - Use CSR_CRMD_PLV for kvm_arch_vcpu_in_kernel().
 
 - Add DMSINTC irqchip in kernel support.
 
 RISC-V:
 
 - Fix steal time shared memory alignment checks
 
 - Fix vector context allocation leak
 
 - Fix array out-of-bounds in pmu_ctr_read() and pmu_fw_ctr_read_hi()
 
 - Fix double-free of sdata in kvm_pmu_clear_snapshot_area()
 
 - Fix integer overflow in kvm_pmu_validate_counter_mask()
 
 - Fix shift-out-of-bounds in make_xfence_request()
 
 - Fix lost write protection on huge pages during dirty logging
 
 - Split huge pages during fault handling for dirty logging
 
 - Skip CSR restore if VCPU is reloaded on the same core
 
 - Implement kvm_arch_has_default_irqchip() for KVM selftests
 
 - Factored-out ISA checks into separate sources
 
 - Added hideleg to struct kvm_vcpu_config
 
 - Factored-out VCPU config into separate sources
 
 - Support configuration of per-VM HGATP mode from KVM user space
 
 s390:
 
 - Support for ESA (31-bit) guests inside nested hypervisors.
 
 - Remove restriction on memslot alignment, which is not needed anymore with
   the new gmap code.
 
 - Fix LPSW/E to update the bear (which of course is the breaking event
   address register).
 
 x86:
 
 - Shut up various UBSAN warnings on reading module parameter before they
   were initialized.
 
 - Don't zero-allocate page tables that are used for splitting hugepages in
   the TDP MMU, as KVM is guaranteed to set all SPTEs in the page table and
   thus write all bytes.
 
 - As an optimization, bail early when trying to unsync 4KiB mappings if the
   target gfn can just be mapped with a 2MiB hugepage.
 
 x86 generic:
 
 - Copy single-chunk MMIO write values into struct kvm_vcpu (more precisely
   struct kvm_mmio_fragment) to fix use-after-free stack bugs where KVM
   would dereference stack pointer after an exit to userspace.
 
 - Clean up and comment the emulated MMIO code to try to make it easier to
   maintain (not necessarily "easy", but "easier").
 
 - Move VMXON+VMXOFF and EFER.SVME toggling out of KVM (not *all* of VMX
   and SVM enabling) as it is needed for trusted I/O.
 
 - Advertise support for AVX512 Bit Matrix Multiply (BMM) instructions
 
 - Immediately fail the build if a required #define is missing in one of
   KVM's headers that is included multiple times.
 
 - Reject SET_GUEST_DEBUG with -EBUSY if there's an already injected
   exception, mostly to prevent syzkaller from abusing the uAPI to
   trigger WARNs, but also because it can help prevent userspace from
   unintentionally crashing the VM.
 
 - Exempt SMM from CPUID faulting on Intel, as per the spec.
 
 - Misc hardening and cleanup changes.
 
 x86 (AMD):
 
 - Fix and optimize IRQ window inhibit handling for AVIC; make it per-vCPU
   so that KVM doesn't prematurely re-enable AVIC if multiple
   vCPUs have to-be-injected IRQs.
 
 - Clean up and optimize the OSVW handling, avoiding a bug in which KVM would
   overwrite state when enabling virtualization on multiple CPUs in parallel.
   This should not be a problem because OSVW should usually be the same for
   all CPUs.
 
 - Drop a WARN in KVM_MEMORY_ENCRYPT_REG_REGION where KVM complains about a
   "too large" size based purely on user input.
 
 - Clean up and harden the pinning code for KVM_MEMORY_ENCRYPT_REG_REGION.
 
 - Disallow synchronizing a VMSA of an already-launched/encrypted vCPU, as
   doing so for an SNP guest will crash the host due to an RMP violation
   page fault.
 
 - Overhaul KVM's APIs for detecting SEV+ guests so that VM-scoped queries
   are required to hold kvm->lock, and enforce it by lockdep.  Fix various
   bugs where sev_guest() was not ensured to be stable for the whole
   duration of a function or ioctl.
 
 - Convert a pile of kvm->lock SEV code to guard().
 
 - Play nicer with userspace that does not enable KVM_CAP_EXCEPTION_PAYLOAD,
   for which KVM needs to set CR2 and DR6 as a response to ioctls such as
   KVM_GET_VCPU_EVENTS (even if the payload would end up in EXITINFO2
   rather than CR2, for example).  Only set CR2 and DR6 when consumption of
   the payload is imminent, but on the other hand force delivery of the
   payload in all paths where userspace retrieves CR2 or DR6.
 
 - Use vcpu->arch.cr2 when updating vmcb12's CR2 on nested #VMEXIT instead
   of vmcb02->save.cr2.  The value is out of sync after a save/restore
   or after a #PF is injected into L2.
 
 - Fix a class of nSVM bugs where some fields written by the CPU are not
   synchronized from vmcb02 to cached vmcb12 after VMRUN, and so are not
   up-to-date when saved by KVM_GET_NESTED_STATE.
 
 - Fix a class of bugs where the ordering between KVM_SET_NESTED_STATE and
   KVM_SET_{S}REGS could cause vmcb02 to be incorrectly initialized after
   save+restore.
 
 - Add a variety of missing nSVM consistency checks.
 
 - Fix several bugs where KVM failed to correctly update VMCB fields on
   nested #VMEXIT.
 
 - Fix several bugs where KVM failed to correctly synthesize #UD or #GP for
   SVM-related instructions.
 
 - Add support for save+restore of virtualized LBRs (on SVM).
 
 - Refactor various helpers and macros to improve clarity and (hopefully)
   make the code easier to maintain.
 
 - Aggressively sanitize fields when copying from vmcb12, to guard against
   unintentionally allowing L1 to utilize yet-to-be-defined features.
 
 - Fix several bugs where KVM botched rAX legality checks when emulating SVM
   instructions.  There are remaining issues in that KVM doesn't handle size
   prefix overrides for 64-bit guests.
 
 - Fail emulation of VMRUN/VMLOAD/VMSAVE if mapping vmcb12 fails instead of
   somewhat arbitrarily synthesizing #GP (i.e. don't double down on AMD's
   architectural but sketchy behavior of generating #GP for "unsupported"
   addresses).
 
 - Cache all used vmcb12 fields to further harden against TOCTOU bugs.
 
 x86 (Intel):
 
 - Drop obsolete branch hint prefixes from the VMX instruction macros.
 
 - Use ASM_INPUT_RM() in __vmcs_writel() to coerce clang into using a
   register input when appropriate.
 
 - Code cleanups.
 
 guest_memfd:
 
 - Don't mark guest_memfd folios as accessed, as guest_memfd doesn't support
   reclaim, the memory is unevictable, and there is no storage to write
   back to.
 
 LoongArch selftests:
 
 - Add KVM PMU test cases
 
 s390 selftests:
 
 - Enable more memory selftests.
 
 x86 selftests:
 
 - Add support for Hygon CPUs in KVM selftests.
 
 - Fix a bug in the MSR test where it would get false failures on AMD/Hygon
   CPUs with exactly one of RDPID or RDTSCP.
 
 - Add an MADV_COLLAPSE testcase for guest_memfd as a regression test for a
   bug where the kernel would attempt to collapse guest_memfd folios against
   KVM's will.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmnftRQUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroPAzwf+NKO4Ktv+7A22ImN0SBl0nlUuulsz
 vTcw3+hxdRoIw83GdNS+hG5js0wrpMDnbv3t4+VliDNBSSxrBzcSWX2wpilW0Xtw
 qGo1MWhs2lKPy1NlaRVOwPS6j7uF3AR0TQ1iQLGMedQuCU9WpiKJxyhNXJdbLrt3
 8EgFzsvtEsv+jKNRUNDf9+d0j4gZsFyIe+Brhianbw+u3/UCiUClLCdsKPc4+5ZX
 08otYXytacGNIf/5Ev1vT4pHkHL0yqKXAtX7LEtaS3+0KrPuLjV4slemivzE9vf5
 Evafm5AhA4wpaNMb1ZerhY3T94lsMaJpWxotjR//0Q7C9B59pCQnXCm8mg==
 =CcE0
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "Arm:

   - Add support for tracing in the standalone EL2 hypervisor code,
     which should help both debugging and performance analysis. This
     uses the new infrastructure for 'remote' trace buffers that can be
     exposed by non-kernel entities such as firmware, and which came
     through the tracing tree

   - Add support for GICv5 Per Processor Interrupts (PPIs), as the
     starting point for supporting the new GIC architecture in KVM

   - Finally add support for pKVM protected guests, where pages are
     unmapped from the host as they are faulted into the guest and can
     be shared back from the guest using pKVM hypercalls. Protected
     guests are created using a new machine type identifier. As the
     elusive guestmem has not yet delivered on its promises, anonymous
     memory is also supported

     This is only a first step towards full isolation from the host; for
     example, the CPU register state and DMA accesses are not yet
     isolated. Because this does not really yet bring fully what it
     promises, it is hidden behind CONFIG_ARM_PKVM_GUEST +
     'kvm-arm.mode=protected', and also triggers TAINT_USER when a VM is
     created. Caveat emptor

   - Rework the dreaded user_mem_abort() function to make it more
     maintainable, reducing the amount of state being exposed to the
     various helpers and rendering a substantial amount of state
     immutable

   - Expand the Stage-2 page table dumper to support NV shadow page
     tables on a per-VM basis

   - Tidy up the pKVM PSCI proxy code to be slightly less hard to
     follow

   - Fix both SPE and TRBE in non-VHE configurations so that they do not
     generate spurious, out of context table walks that ultimately lead
     to very bad HW lockups

   - A small set of patches fixing the Stage-2 MMU freeing in error
     cases

   - Tighten-up accepted SMC immediate value to be only #0 for host
     SMCCC calls

   - The usual cleanups and other selftest churn

  LoongArch:

   - Use CSR_CRMD_PLV for kvm_arch_vcpu_in_kernel()

   - Add DMSINTC irqchip in kernel support

  RISC-V:

   - Fix steal time shared memory alignment checks

   - Fix vector context allocation leak

   - Fix array out-of-bounds in pmu_ctr_read() and pmu_fw_ctr_read_hi()

   - Fix double-free of sdata in kvm_pmu_clear_snapshot_area()

   - Fix integer overflow in kvm_pmu_validate_counter_mask()

   - Fix shift-out-of-bounds in make_xfence_request()

   - Fix lost write protection on huge pages during dirty logging

   - Split huge pages during fault handling for dirty logging

   - Skip CSR restore if VCPU is reloaded on the same core

   - Implement kvm_arch_has_default_irqchip() for KVM selftests

   - Factored-out ISA checks into separate sources

   - Added hideleg to struct kvm_vcpu_config

   - Factored-out VCPU config into separate sources

   - Support configuration of per-VM HGATP mode from KVM user space

  s390:

   - Support for ESA (31-bit) guests inside nested hypervisors

   - Remove restriction on memslot alignment, which is not needed
     anymore with the new gmap code

   - Fix LPSW/E to update the bear (which of course is the breaking
     event address register)

  x86:

   - Shut up various UBSAN warnings on reading module parameter before
     they were initialized

   - Don't zero-allocate page tables that are used for splitting
     hugepages in the TDP MMU, as KVM is guaranteed to set all SPTEs in
     the page table and thus write all bytes

   - As an optimization, bail early when trying to unsync 4KiB mappings
     if the target gfn can just be mapped with a 2MiB hugepage

  x86 generic:

   - Copy single-chunk MMIO write values into struct kvm_vcpu (more
     precisely struct kvm_mmio_fragment) to fix use-after-free stack
     bugs where KVM would dereference stack pointer after an exit to
     userspace

   - Clean up and comment the emulated MMIO code to try to make it
     easier to maintain (not necessarily "easy", but "easier")

   - Move VMXON+VMXOFF and EFER.SVME toggling out of KVM (not *all* of
     VMX and SVM enabling) as it is needed for trusted I/O

   - Advertise support for AVX512 Bit Matrix Multiply (BMM) instructions

   - Immediately fail the build if a required #define is missing in one
     of KVM's headers that is included multiple times

   - Reject SET_GUEST_DEBUG with -EBUSY if there's an already injected
     exception, mostly to prevent syzkaller from abusing the uAPI to
     trigger WARNs, but also because it can help prevent userspace from
     unintentionally crashing the VM

   - Exempt SMM from CPUID faulting on Intel, as per the spec

   - Misc hardening and cleanup changes

  x86 (AMD):

   - Fix and optimize IRQ window inhibit handling for AVIC; make it
     per-vCPU so that KVM doesn't prematurely re-enable AVIC if multiple
     vCPUs have to-be-injected IRQs

   - Clean up and optimize the OSVW handling, avoiding a bug in which
     KVM would overwrite state when enabling virtualization on multiple
     CPUs in parallel. This should not be a problem because OSVW should
     usually be the same for all CPUs

   - Drop a WARN in KVM_MEMORY_ENCRYPT_REG_REGION where KVM complains
     about a "too large" size based purely on user input

   - Clean up and harden the pinning code for KVM_MEMORY_ENCRYPT_REG_REGION

   - Disallow synchronizing a VMSA of an already-launched/encrypted
     vCPU, as doing so for an SNP guest will crash the host due to an
     RMP violation page fault

   - Overhaul KVM's APIs for detecting SEV+ guests so that VM-scoped
     queries are required to hold kvm->lock, and enforce it by lockdep.
     Fix various bugs where sev_guest() was not ensured to be stable for
     the whole duration of a function or ioctl

   - Convert a pile of kvm->lock SEV code to guard()

   - Play nicer with userspace that does not enable
     KVM_CAP_EXCEPTION_PAYLOAD, for which KVM needs to set CR2 and DR6
     as a response to ioctls such as KVM_GET_VCPU_EVENTS (even if the
     payload would end up in EXITINFO2 rather than CR2, for example).
     Only set CR2 and DR6 when consumption of the payload is imminent,
     but on the other hand force delivery of the payload in all paths
     where userspace retrieves CR2 or DR6

   - Use vcpu->arch.cr2 when updating vmcb12's CR2 on nested #VMEXIT
     instead of vmcb02->save.cr2. The value is out of sync after a
     save/restore or after a #PF is injected into L2

   - Fix a class of nSVM bugs where some fields written by the CPU are
     not synchronized from vmcb02 to cached vmcb12 after VMRUN, and so
     are not up-to-date when saved by KVM_GET_NESTED_STATE

   - Fix a class of bugs where the ordering between KVM_SET_NESTED_STATE
     and KVM_SET_{S}REGS could cause vmcb02 to be incorrectly
     initialized after save+restore

   - Add a variety of missing nSVM consistency checks

   - Fix several bugs where KVM failed to correctly update VMCB fields
     on nested #VMEXIT

   - Fix several bugs where KVM failed to correctly synthesize #UD or
     #GP for SVM-related instructions

   - Add support for save+restore of virtualized LBRs (on SVM)

   - Refactor various helpers and macros to improve clarity and
     (hopefully) make the code easier to maintain

   - Aggressively sanitize fields when copying from vmcb12, to guard
     against unintentionally allowing L1 to utilize yet-to-be-defined
     features

   - Fix several bugs where KVM botched rAX legality checks when
     emulating SVM instructions. There are remaining issues in that KVM
     doesn't handle size prefix overrides for 64-bit guests

   - Fail emulation of VMRUN/VMLOAD/VMSAVE if mapping vmcb12 fails
     instead of somewhat arbitrarily synthesizing #GP (i.e. don't double
     down on AMD's architectural but sketchy behavior of generating #GP
     for "unsupported" addresses)

   - Cache all used vmcb12 fields to further harden against TOCTOU bugs

  x86 (Intel):

   - Drop obsolete branch hint prefixes from the VMX instruction macros

   - Use ASM_INPUT_RM() in __vmcs_writel() to coerce clang into using a
     register input when appropriate

   - Code cleanups

  guest_memfd:

   - Don't mark guest_memfd folios as accessed, as guest_memfd doesn't
     support reclaim, the memory is unevictable, and there is no storage
     to write back to

  LoongArch selftests:

   - Add KVM PMU test cases

  s390 selftests:

   - Enable more memory selftests

  x86 selftests:

   - Add support for Hygon CPUs in KVM selftests

   - Fix a bug in the MSR test where it would get false failures on
     AMD/Hygon CPUs with exactly one of RDPID or RDTSCP

   - Add an MADV_COLLAPSE testcase for guest_memfd as a regression test
     for a bug where the kernel would attempt to collapse guest_memfd
     folios against KVM's will"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (373 commits)
  KVM: x86: use inlines instead of macros for is_sev_*guest
  x86/virt: Treat SVM as unsupported when running as an SEV+ guest
  KVM: SEV: Goto an existing error label if charging misc_cg for an ASID fails
  KVM: SVM: Move lock-protected allocation of SEV ASID into a separate helper
  KVM: SEV: use mutex guard in snp_handle_guest_req()
  KVM: SEV: use mutex guard in sev_mem_enc_unregister_region()
  KVM: SEV: use mutex guard in sev_mem_enc_ioctl()
  KVM: SEV: use mutex guard in snp_launch_update()
  KVM: SEV: Assert that kvm->lock is held when querying SEV+ support
  KVM: SEV: Document that checking for SEV+ guests when reclaiming memory is "safe"
  KVM: SEV: Hide "struct kvm_sev_info" behind CONFIG_KVM_AMD_SEV=y
  KVM: SEV: WARN on unhandled VM type when initializing VM
  KVM: LoongArch: selftests: Add PMU overflow interrupt test
  KVM: LoongArch: selftests: Add basic PMU event counting test
  KVM: LoongArch: selftests: Add cpucfg read/write helpers
  LoongArch: KVM: Add DMSINTC inject msi to vCPU
  LoongArch: KVM: Add DMSINTC device support
  LoongArch: KVM: Make vcpu_is_preempted() as a macro rather than function
  LoongArch: KVM: Move host CSR_GSTAT save and restore in context switch
  LoongArch: KVM: Move host CSR_EENTRY save and restore in context switch
  ...
2026-04-17 07:18:03 -07:00
Linus Torvalds
334fbe734e mm.git review status for linus..mm-stable
Everything:
 
 Total patches:       368
 Reviews/patch:       1.56
 Reviewed rate:       74%
 
 Excluding DAMON:
 
 Total patches:       316
 Reviews/patch:       1.77
 Reviewed rate:       81%
 
 Excluding DAMON and zram:
 
 Total patches:       306
 Reviews/patch:       1.81
 Reviewed rate:       82%
 
 Excluding DAMON, zram and maple_tree:
 
 Total patches:       276
 Reviews/patch:       2.01
 Reviewed rate:       91%
 
 Significant patch series in this merge:
 
 - The 30 patch series "maple_tree: Replace big node with maple copy"
   from Liam Howlett is mainly prepararatory work for ongoing development
   but it does reduce stack usage and is an improvement.
 
 - The 12 patch series "mm, swap: swap table phase III: remove swap_map"
   from Kairui Song offers memory savings by removing the static swap_map.
   It also yields some CPU savings and implements several cleanups.
 
 - The 2 patch series "mm: memfd_luo: preserve file seals" from Pratyush
   Yadav adds file seal preservation to LUO's memfd code.
 
 - The 2 patch series "mm: zswap: add per-memcg stat for incompressible
   pages" from Jiayuan Chen adds additional userspace stats reportng to
   zswap.
 
 - The 4 patch series "arch, mm: consolidate empty_zero_page" from Mike
   Rapoport implements some cleanups for our handling of ZERO_PAGE() and
   zero_pfn.
 
 - The 2 patch series "mm/kmemleak: Improve scan_should_stop()
   implementation" from Zhongqiu Han provides an robustness improvement and
   some cleanups in the kmemleak code.
 
 - The 4 patch series "Improve khugepaged scan logic" from Vernon Yang
   "improves the khugepaged scan logic and reduces CPU consumption by
   prioritizing scanning tasks that access memory frequently".
 
 - The 2 patch series "Make KHO Stateless" from Jason Miu simplifies
   Kexec Handover by "transitioning KHO from an xarray-based metadata
   tracking system with serialization to a radix tree data structure that
   can be passed directly to the next kernel"
 
 - The 3 patch series "mm: vmscan: add PID and cgroup ID to vmscan
   tracepoints" from Thomas Ballasi and Steven Rostedt enhances vmscan's
   tracepointing.
 
 - The 5 patch series "mm: arch/shstk: Common shadow stack mapping helper
   and VM_NOHUGEPAGE" from Catalin Marinas is a cleanup for the shadow
   stack code: remove per-arch code in favour of a generic implementation.
 
 - The 2 patch series "Fix KASAN support for KHO restored vmalloc
   regions" from Pasha Tatashin fixes a WARN() which can be emitted the KHO
   restores a vmalloc area.
 
 - The 4 patch series "mm: Remove stray references to pagevec" from Tal
   Zussman provides several cleanups, mainly udpating references to "struct
   pagevec", which became folio_batch three years ago.
 
 - The 17 patch series "mm: Eliminate fake head pages from vmemmap
   optimization" from Kiryl Shutsemau simplifies the HugeTLB vmemmap
   optimization (HVO) by changing how tail pages encode their relationship
   to the head page.
 
 - The 2 patch series "mm/damon/core: improve DAMOS quota efficiency for
   core layer filters" from SeongJae Park improves two problematic
   behaviors of DAMOS that makes it less efficient when core layer filters
   are used.
 
 - The 3 patch series "mm/damon: strictly respect min_nr_regions" from
   SeongJae Park improves DAMON usability by extending the treatment of the
   min_nr_regions user-settable parameter.
 
 - The 3 patch series "mm/page_alloc: pcp locking cleanup" from Vlastimil
   Babka is a proper fix for a previously hotfixed SMP=n issue.  Code
   simplifications and cleanups ennsed.
 
 - The 16 patch series "mm: cleanups around unmapping / zapping" from
   David Hildenbrand implements "a bunch of cleanups around unmapping and
   zapping.  Mostly simplifications, code movements, documentation and
   renaming of zapping functions".
 
 - The 6 patch series "support batched checking of the young flag for
   MGLRU" from Baolin Wang supports batched checking of the young flag for
   MGLRU.  It's part cleanups; one benchmark shows large performance
   benefits for arm64.
 
 - The 5 patch series "memcg: obj stock and slab stat caching cleanups"
   from Johannes Weiner provides memcg cleanup and robustness improvements.
 
 - The 5 patch series "Allow order zero pages in page reporting" from
   Yuvraj Sakshith enhances page_reporting's free page reporting - it is
   presently and undesirably order-0 pages when reporting free memory.
 
 - The 6 patch series "mm: vma flag tweaks" from Lorenzo Stoakes is
   cleanup work following from the recent conversion of the VMA flags to a
   bitmap.
 
 - The 10 patch series "mm/damon: add optional debugging-purpose sanity
   checks" from SeongJae Park adds some more developer-facing debug checks
   into DAMON core.
 
 - The 2 patch series "mm/damon: test and document power-of-2
   min_region_sz requirement" from SeongJae Park adds an additional DAMON
   kunit test and makes some adjustments to the addr_unit parameter
   handling.
 
 - The 3 patch series "mm/damon/core: make passed_sample_intervals
   comparisons overflow-safe" from SeongJae Park fixes a hard-to-hit time
   overflow issue in DAMON core.
 
 - The 7 patch series "mm/damon: improve/fixup/update ratio calculation,
   test and documentation" from SeongJae Park is a "batch of misc/minor
   improvements and fixups" for DAMON.
 
 - The 4 patch series "mm: move vma_(kernel|mmu)_pagesize() out of
   hugetlb.c" from David Hildenbrand fixes a possible issue with dax-device
   when CONFIG_HUGETLB=n.  Some code movement was required.
 
 - The 6 patch series "zram: recompression cleanups and tweaks" from
   Sergey Senozhatsky provides "a somewhat random mix of fixups,
   recompression cleanups and improvements" in the zram code.
 
 - The 11 patch series "mm/damon: support multiple goal-based quota
   tuning algorithms" from SeongJae Park extend DAMOS quotas goal
   auto-tuning to support multiple tuning algorithms that users can select.
 
 - The 4 patch series "mm: thp: reduce unnecessary
   start_stop_khugepaged()" from Breno Leitao fixes the khugpaged sysfs
   handling so we no longer spam the logs with reams of junk when
   starting/stopping khugepaged.
 
 - The 3 patch series "mm: improve map count checks" from Lorenzo Stoakes
   provides some cleanups and slight fixes in the mremap, mmap and vma
   code.
 
 - The 5 patch series "mm/damon: support addr_unit on default monitoring
   targets for modules" from SeongJae Park extends the use of DAMON core's
   addr_unit tunable.
 
 - The 5 patch series "mm: khugepaged cleanups and mTHP prerequisites"
   from Nico Pache provides cleanups in the khugepaged and is a base for
   Nico's planned khugepaged mTHP support.
 
 - The 15 patch series "mm: memory hot(un)plug and SPARSEMEM cleanups"
   from David Hildenbrand implements code movement and cleanups in the
   memhotplug and sparsemem code.
 
 - The 2 patch series "mm: remove CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE and
   cleanup CONFIG_MIGRATION" from David Hildenbrand rationalizes some
   memhotplug Kconfig support.
 
 - The 6 patch series "change young flag check functions to return bool"
   from Baolin Wang is "a cleanup patchset to change all young flag check
   functions to return bool".
 
 - The 3 patch series "mm/damon/sysfs: fix memory leak and NULL
   dereference issues" from Josh Law and SeongJae Park fixes a few
   potential DAMON bugs.
 
 - The 25 patch series "mm/vma: convert vm_flags_t to vma_flags_t in vma
   code" from "converts a lot of the existing use of the legacy vm_flags_t
   data type to the new vma_flags_t type which replaces it".  Mainly in the
   vma code.
 
 - The 21 patch series "mm: expand mmap_prepare functionality and usage"
   from Lorenzo Stoakes "expands the mmap_prepare functionality, which is
   intended to replace the deprecated f_op->mmap hook which has been the
   source of bugs and security issues for some time".  Cleanups,
   documentation, extension of mmap_prepare into filesystem drivers.
 
 - The 13 patch series "mm/huge_memory: refactor zap_huge_pmd()" from
   Lorenzo Stoakes simplifies and cleans up zap_huge_pmd().  Additional
   cleanups around vm_normal_folio_pmd() and the softleaf functionality are
   performed.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCad3HDQAKCRDdBJ7gKXxA
 jrUQAPwNhPk5nPSxnyxjAeQtOBHqgCdnICeEismLajPKd9aYRgEA0s2XAu3tSUYi
 GrBnWImHG3s4ePQxVcPCegWTsOUrXgQ=
 =1Q7o
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2026-04-13-21-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - "maple_tree: Replace big node with maple copy" (Liam Howlett)

   Mainly prepararatory work for ongoing development but it does reduce
   stack usage and is an improvement.

 - "mm, swap: swap table phase III: remove swap_map" (Kairui Song)

   Offers memory savings by removing the static swap_map. It also yields
   some CPU savings and implements several cleanups.

 - "mm: memfd_luo: preserve file seals" (Pratyush Yadav)

   File seal preservation to LUO's memfd code

 - "mm: zswap: add per-memcg stat for incompressible pages" (Jiayuan
   Chen)

   Additional userspace stats reportng to zswap

 - "arch, mm: consolidate empty_zero_page" (Mike Rapoport)

   Some cleanups for our handling of ZERO_PAGE() and zero_pfn

 - "mm/kmemleak: Improve scan_should_stop() implementation" (Zhongqiu
   Han)

   A robustness improvement and some cleanups in the kmemleak code

 - "Improve khugepaged scan logic" (Vernon Yang)

   Improve khugepaged scan logic and reduce CPU consumption by
   prioritizing scanning tasks that access memory frequently

 - "Make KHO Stateless" (Jason Miu)

   Simplify Kexec Handover by transitioning KHO from an xarray-based
   metadata tracking system with serialization to a radix tree data
   structure that can be passed directly to the next kernel

 - "mm: vmscan: add PID and cgroup ID to vmscan tracepoints" (Thomas
   Ballasi and Steven Rostedt)

   Enhance vmscan's tracepointing

 - "mm: arch/shstk: Common shadow stack mapping helper and
   VM_NOHUGEPAGE" (Catalin Marinas)

   Cleanup for the shadow stack code: remove per-arch code in favour of
   a generic implementation

 - "Fix KASAN support for KHO restored vmalloc regions" (Pasha Tatashin)

   Fix a WARN() which can be emitted the KHO restores a vmalloc area

 - "mm: Remove stray references to pagevec" (Tal Zussman)

   Several cleanups, mainly udpating references to "struct pagevec",
   which became folio_batch three years ago

 - "mm: Eliminate fake head pages from vmemmap optimization" (Kiryl
   Shutsemau)

   Simplify the HugeTLB vmemmap optimization (HVO) by changing how tail
   pages encode their relationship to the head page

 - "mm/damon/core: improve DAMOS quota efficiency for core layer
   filters" (SeongJae Park)

   Improve two problematic behaviors of DAMOS that makes it less
   efficient when core layer filters are used

 - "mm/damon: strictly respect min_nr_regions" (SeongJae Park)

   Improve DAMON usability by extending the treatment of the
   min_nr_regions user-settable parameter

 - "mm/page_alloc: pcp locking cleanup" (Vlastimil Babka)

   The proper fix for a previously hotfixed SMP=n issue. Code
   simplifications and cleanups ensued

 - "mm: cleanups around unmapping / zapping" (David Hildenbrand)

   A bunch of cleanups around unmapping and zapping. Mostly
   simplifications, code movements, documentation and renaming of
   zapping functions

 - "support batched checking of the young flag for MGLRU" (Baolin Wang)

   Batched checking of the young flag for MGLRU. It's part cleanups; one
   benchmark shows large performance benefits for arm64

 - "memcg: obj stock and slab stat caching cleanups" (Johannes Weiner)

   memcg cleanup and robustness improvements

 - "Allow order zero pages in page reporting" (Yuvraj Sakshith)

   Enhance free page reporting - it is presently and undesirably order-0
   pages when reporting free memory.

 - "mm: vma flag tweaks" (Lorenzo Stoakes)

   Cleanup work following from the recent conversion of the VMA flags to
   a bitmap

 - "mm/damon: add optional debugging-purpose sanity checks" (SeongJae
   Park)

   Add some more developer-facing debug checks into DAMON core

 - "mm/damon: test and document power-of-2 min_region_sz requirement"
   (SeongJae Park)

   An additional DAMON kunit test and makes some adjustments to the
   addr_unit parameter handling

 - "mm/damon/core: make passed_sample_intervals comparisons
   overflow-safe" (SeongJae Park)

   Fix a hard-to-hit time overflow issue in DAMON core

 - "mm/damon: improve/fixup/update ratio calculation, test and
   documentation" (SeongJae Park)

   A batch of misc/minor improvements and fixups for DAMON

 - "mm: move vma_(kernel|mmu)_pagesize() out of hugetlb.c" (David
   Hildenbrand)

   Fix a possible issue with dax-device when CONFIG_HUGETLB=n. Some code
   movement was required.

 - "zram: recompression cleanups and tweaks" (Sergey Senozhatsky)

   A somewhat random mix of fixups, recompression cleanups and
   improvements in the zram code

 - "mm/damon: support multiple goal-based quota tuning algorithms"
   (SeongJae Park)

   Extend DAMOS quotas goal auto-tuning to support multiple tuning
   algorithms that users can select

 - "mm: thp: reduce unnecessary start_stop_khugepaged()" (Breno Leitao)

   Fix the khugpaged sysfs handling so we no longer spam the logs with
   reams of junk when starting/stopping khugepaged

 - "mm: improve map count checks" (Lorenzo Stoakes)

   Provide some cleanups and slight fixes in the mremap, mmap and vma
   code

 - "mm/damon: support addr_unit on default monitoring targets for
   modules" (SeongJae Park)

   Extend the use of DAMON core's addr_unit tunable

 - "mm: khugepaged cleanups and mTHP prerequisites" (Nico Pache)

   Cleanups to khugepaged and is a base for Nico's planned khugepaged
   mTHP support

 - "mm: memory hot(un)plug and SPARSEMEM cleanups" (David Hildenbrand)

   Code movement and cleanups in the memhotplug and sparsemem code

 - "mm: remove CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE and cleanup
   CONFIG_MIGRATION" (David Hildenbrand)

   Rationalize some memhotplug Kconfig support

 - "change young flag check functions to return bool" (Baolin Wang)

   Cleanups to change all young flag check functions to return bool

 - "mm/damon/sysfs: fix memory leak and NULL dereference issues" (Josh
   Law and SeongJae Park)

   Fix a few potential DAMON bugs

 - "mm/vma: convert vm_flags_t to vma_flags_t in vma code" (Lorenzo
   Stoakes)

   Convert a lot of the existing use of the legacy vm_flags_t data type
   to the new vma_flags_t type which replaces it. Mainly in the vma
   code.

 - "mm: expand mmap_prepare functionality and usage" (Lorenzo Stoakes)

   Expand the mmap_prepare functionality, which is intended to replace
   the deprecated f_op->mmap hook which has been the source of bugs and
   security issues for some time. Cleanups, documentation, extension of
   mmap_prepare into filesystem drivers

 - "mm/huge_memory: refactor zap_huge_pmd()" (Lorenzo Stoakes)

   Simplify and clean up zap_huge_pmd(). Additional cleanups around
   vm_normal_folio_pmd() and the softleaf functionality are performed.

* tag 'mm-stable-2026-04-13-21-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
  mm: fix deferred split queue races during migration
  mm/khugepaged: fix issue with tracking lock
  mm/huge_memory: add and use has_deposited_pgtable()
  mm/huge_memory: add and use normal_or_softleaf_folio_pmd()
  mm: add softleaf_is_valid_pmd_entry(), pmd_to_softleaf_folio()
  mm/huge_memory: separate out the folio part of zap_huge_pmd()
  mm/huge_memory: use mm instead of tlb->mm
  mm/huge_memory: remove unnecessary sanity checks
  mm/huge_memory: deduplicate zap deposited table call
  mm/huge_memory: remove unnecessary VM_BUG_ON_PAGE()
  mm/huge_memory: add a common exit path to zap_huge_pmd()
  mm/huge_memory: handle buggy PMD entry in zap_huge_pmd()
  mm/huge_memory: have zap_huge_pmd return a boolean, add kdoc
  mm/huge: avoid big else branch in zap_huge_pmd()
  mm/huge_memory: simplify vma_is_specal_huge()
  mm: on remap assert that input range within the proposed VMA
  mm: add mmap_action_map_kernel_pages[_full]()
  uio: replace deprecated mmap hook with mmap_prepare in uio_info
  drivers: hv: vmbus: replace deprecated mmap hook with mmap_prepare
  mm: allow handling of stacked mmap_prepare hooks in more drivers
  ...
2026-04-15 12:59:16 -07:00
Paolo Bonzini
1b3090da8d KVM x86 MMU changes for 7.1
- Fix an undefined behavior warning where a crafty userspace can read kvm.ko's
    nx_huge_pages before it's fully initialized.
 
  - Don't zero-allocate page tables that are used for splitting hugepages in the
    TDP MMU, as KVM is guaranteed to set all SPTEs in the page table and thus
    write all bytes.
 
  - Bail early when trying to unsync 4KiB mappings if the target gfn can be
    mapped with a 2MiB hugepage, to avoid the gfn hash lookup.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmnZJZIACgkQOlYIJqCj
 N/1Tlg/+PMF0O39BRxjyIulWVyxWWU/CD3WxS4xSlOd1Pj3rrwIxDx1VYYkmWX90
 8DrBkGCBUdruzNcQQ9YOVU1H+z5yvs5QY2a75TWJX/5A40wXZPMl+8xSpbflP8vQ
 QECaKAyxc+IutnjdivbLxZGbXiFS5yp1PnvK7dOo75gazFkDBF0kpO8lFyvsoaUh
 Q97kkIT2eR/wTsVx+HXwhY2EgxiBdG6N69JIzzVwOvaddqtB6jSzDXPjAhPiOJVU
 xlKaJnlCnHhH59eyAegpef3BtpFuyArFVJUZoI0f7zEqMHiztS616iltHPPNeSgT
 8oxbJcvAIbLKd6RgN+DJ5zbSOVwU8O7G2WP6dXvuVsjN63nAX9NapMOD5gS7e6/9
 gjJq85XClFX2bOHiEnBSAVJ+9obgO69GgP+X3/MIIs0efnCL9ZGtu8tz9ClhkCwP
 6cU7vtojENJOYBM/YEUCaGnrB837+0TX1U4LO+tGJeZbWUJXb0UP0I3EpPsHnk0o
 m/M4q9qzbQHvT/GV3qajp4MLd1NE6W/386xTqvrGv0hFIo0HwDPIPVxzTD7IyypE
 sEm3HIOiLmE95BGWFKKeKhdT1Yb+wJRuggZXatKGTCeik5FOGYbxtclTGV4h8OHY
 Vv7Ll/Xom3ekL2dOQbFHbtlfv3ra+4HzLm5rd3uB+RoMcxyPAf0=
 =y0wF
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-mmu-7.1' of https://github.com/kvm-x86/linux into HEAD

KVM x86 MMU changes for 7.1

 - Fix an undefined behavior warning where a crafty userspace can read kvm.ko's
   nx_huge_pages before it's fully initialized.

 - Don't zero-allocate page tables that are used for splitting hugepages in the
   TDP MMU, as KVM is guaranteed to set all SPTEs in the page table and thus
   write all bytes.

 - Bail early when trying to unsync 4KiB mappings if the target gfn can be
   mapped with a 2MiB hugepage, to avoid the gfn hash lookup.
2026-04-13 12:50:08 +02:00
Mike Rapoport (Microsoft)
9a1d0c738b mm: rename my_zero_pfn() to zero_pfn()
my_zero_pfn() is a silly name.

Rename zero_pfn variable to zero_page_pfn and my_zero_pfn() function to
zero_pfn().

While on it, move extern declarations of zero_page_pfn outside the
functions that use it and add a comment about what ZERO_PAGE is.

Link: https://lkml.kernel.org/r/20260211103141.3215197-3-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Magnus Lindholm <linmag7@gmail.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05 13:53:01 -07:00
Sean Christopherson
df83746075 KVM: x86/mmu: Only WARN in direct MMUs when overwriting shadow-present SPTE
Adjust KVM's sanity check against overwriting a shadow-present SPTE with a
another SPTE with a different target PFN to only apply to direct MMUs,
i.e. only to MMUs without shadowed gPTEs.  While it's impossible for KVM
to overwrite a shadow-present SPTE in response to a guest write, writes
from outside the scope of KVM, e.g. from host userspace, aren't detected
by KVM's write tracking and so can break KVM's shadow paging rules.

  ------------[ cut here ]------------
  pfn != spte_to_pfn(*sptep)
  WARNING: arch/x86/kvm/mmu/mmu.c:3069 at mmu_set_spte+0x1e4/0x440 [kvm], CPU#0: vmx_ept_stale_r/872
  Modules linked in: kvm_intel kvm irqbypass
  CPU: 0 UID: 1000 PID: 872 Comm: vmx_ept_stale_r Not tainted 7.0.0-rc2-eafebd2d2ab0-sink-vm #319 PREEMPT
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:mmu_set_spte+0x1e4/0x440 [kvm]
  Call Trace:
   <TASK>
   ept_page_fault+0x535/0x7f0 [kvm]
   kvm_mmu_do_page_fault+0xee/0x1f0 [kvm]
   kvm_mmu_page_fault+0x8d/0x620 [kvm]
   vmx_handle_exit+0x18c/0x5a0 [kvm_intel]
   kvm_arch_vcpu_ioctl_run+0xc55/0x1c20 [kvm]
   kvm_vcpu_ioctl+0x2d5/0x980 [kvm]
   __x64_sys_ioctl+0x8a/0xd0
   do_syscall_64+0xb5/0x730
   entry_SYSCALL_64_after_hwframe+0x4b/0x53
   </TASK>
  ---[ end trace 0000000000000000 ]---

Fixes: 11d4517511 ("KVM: x86/mmu: Warn if PFN changes on shadow-present SPTE in shadow MMU")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-27 22:33:33 +01:00
Sean Christopherson
aad885e774 KVM: x86/mmu: Drop/zap existing present SPTE even when creating an MMIO SPTE
When installing an emulated MMIO SPTE, do so *after* dropping/zapping the
existing SPTE (if it's shadow-present).  While commit a54aa15c6b was
right about it being impossible to convert a shadow-present SPTE to an
MMIO SPTE due to a _guest_ write, it failed to account for writes to guest
memory that are outside the scope of KVM.

E.g. if host userspace modifies a shadowed gPTE to switch from a memslot
to emulted MMIO and then the guest hits a relevant page fault, KVM will
install the MMIO SPTE without first zapping the shadow-present SPTE.

  ------------[ cut here ]------------
  is_shadow_present_pte(*sptep)
  WARNING: arch/x86/kvm/mmu/mmu.c:484 at mark_mmio_spte+0xb2/0xc0 [kvm], CPU#0: vmx_ept_stale_r/4292
  Modules linked in: kvm_intel kvm irqbypass
  CPU: 0 UID: 1000 PID: 4292 Comm: vmx_ept_stale_r Not tainted 7.0.0-rc2-eafebd2d2ab0-sink-vm #319 PREEMPT
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:mark_mmio_spte+0xb2/0xc0 [kvm]
  Call Trace:
   <TASK>
   mmu_set_spte+0x237/0x440 [kvm]
   ept_page_fault+0x535/0x7f0 [kvm]
   kvm_mmu_do_page_fault+0xee/0x1f0 [kvm]
   kvm_mmu_page_fault+0x8d/0x620 [kvm]
   vmx_handle_exit+0x18c/0x5a0 [kvm_intel]
   kvm_arch_vcpu_ioctl_run+0xc55/0x1c20 [kvm]
   kvm_vcpu_ioctl+0x2d5/0x980 [kvm]
   __x64_sys_ioctl+0x8a/0xd0
   do_syscall_64+0xb5/0x730
   entry_SYSCALL_64_after_hwframe+0x4b/0x53
  RIP: 0033:0x47fa3f
   </TASK>
  ---[ end trace 0000000000000000 ]---

Reported-by: Alexander Bulekov <bkov@amazon.com>
Debugged-by: Alexander Bulekov <bkov@amazon.com>
Suggested-by: Fred Griffoul <fgriffo@amazon.co.uk>
Fixes: a54aa15c6b ("KVM: x86/mmu: Handle MMIO SPTEs directly in mmu_set_spte()")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-27 22:33:33 +01:00
Lai Jiangshan
b3ae3ceb55 KVM: x86/mmu: KVM: x86/mmu: Skip unsync when large pages are allowed
Use the large-page metadata to avoid pointless attempts to search SP.

If the target GFN falls within a range where a large page is allowed,
then there cannot be a shadow page for that GFN; a shadow page in the
range would itself disallow using a large page. In that case, there
is nothing to unsync and mmu_try_to_unsync_pages() can return
immediately.

This is always true for TDP MMU without nested TDP, and holds for a
significant fraction of cases with shadow paging even all SPs are 4K.

For shadow paging, this optimization theoretically avoids work for about
1/e ~= 37% of GFNs, assuming one guest page table per 2M of memory and
that each GPT falls randomly into the 2M memory buckets. In a simple
test setup, it skipped unsync in a much higher percentage of cases,
mainly because the guest buddy allocator clusters GPTs into fewer
buckets.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://patch.msgid.link/20260123090304.32286-2-jiangshanlai@gmail.com
[sean: check for hugepage after write-tracking, update comment]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-12 10:36:01 -07:00
Sean Christopherson
ecb8062932 KVM: x86/mmu: Don't zero-allocate page table used for splitting a hugepage
When splitting hugepages in the TDP MMU, don't zero the new page table on
allocation since tdp_mmu_split_huge_page() is guaranteed to write every
entry and thus every byte.

Unless someone peeks at the memory between allocating the page table and
writing the child SPTEs, no functional change intended.

Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://patch.msgid.link/20260218210820.2828896-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-03 12:16:27 -08:00
Gal Pressman
1450ab0810 KVM: x86/mmu: Fix UBSAN warning when reading nx_huge_pages parameter
The nx_huge_pages parameter is stored as an int (initialized to -1 to
indicate auto mode), but get_nx_huge_pages() calls param_get_bool()
which expects a bool pointer.
This causes UBSAN to report "load of value 255 is not a valid value for
type '_Bool'" when the parameter is read via sysfs during a narrow time
window.

The issue occurs during module load: the module parameter is registered
and its sysfs file becomes readable before the kvm_mmu_x86_module_init()
function runs:

1. Module load begins, static variable initialized to -1
2. mod_sysfs_setup() creates /sys/module/kvm/parameters/nx_huge_pages
3. (Parameter readable, value = -1)
4. do_init_module() runs kvm_x86_init()
5. kvm_mmu_x86_module_init() resolves -1 to bool

If userspace (e.g., sos report) reads the parameter during step 3,
param_get_bool() dereferences the int as a bool, triggering the UBSAN
warning.

Fix that by properly reading and converting the -1 value into an 'auto'
string.

Fixes: b8e8c8303f ("kvm: mmu: ITLB_MULTIHIT mitigation")
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Gal Pressman <gal@nvidia.com>
Link: https://patch.msgid.link/20260225145050.2350278-3-gal@nvidia.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-02 16:12:42 -08:00
Kees Cook
69050f8d6d treewide: Replace kmalloc with kmalloc_obj for non-scalar types
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:

Single allocations:	kmalloc(sizeof(TYPE), ...)
are replaced with:	kmalloc_obj(TYPE, ...)

Array allocations:	kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with:	kmalloc_objs(TYPE, COUNT, ...)

Flex array allocations:	kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with:	kmalloc_flex(*PTR, FAM, COUNT, ...)

(where TYPE may also be *VAR)

The resulting allocations no longer return "void *", instead returning
"TYPE *".

Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-21 01:02:28 -08:00
Xiaoyao Li
57a7b47ab3 KVM: x86: Don't read guest CR3 when doing async pf while the MMU is direct
Don't read guest CR3 in kvm_arch_setup_async_pf() if the MMU is direct
and use INVALID_GPA instead.

When KVM tries to perform the host-only async page fault for the shared
memory of TDX guests, the following WARNING is triggered:

  WARNING: CPU: 1 PID: 90922 at arch/x86/kvm/vmx/main.c:483 vt_cache_reg+0x16/0x20
  Call Trace:
  __kvm_mmu_faultin_pfn
  kvm_mmu_faultin_pfn
  kvm_tdp_page_fault
  kvm_mmu_do_page_fault
  kvm_mmu_page_fault
  tdx_handle_ept_violation

This WARNING is triggered when calling kvm_mmu_get_guest_pgd() to cache
the guest CR3 in kvm_arch_setup_async_pf() for later use in
kvm_arch_async_page_ready() to determine if it's possible to fix the
page fault in the current vCPU context to save one VM exit. However, when
guest state is protected, KVM cannot read the guest CR3.

Since protected guests aren't compatible with shadow paging, i.e, they
must use direct MMU, avoid calling kvm_mmu_get_guest_pgd() to read guest
CR3 when the MMU is direct and use INVALID_GPA instead.

Note that for protected guests mmu->root_role.direct is always true, so
that kvm_mmu_get_guest_pgd() in kvm_arch_async_page_ready() won't be
reached.

Reported-by: Farrah Chen <farrah.chen@intel.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://patch.msgid.link/20251212135051.2155280-1-xiaoyao.li@intel.com
[sean: explicitly cast to "unsigned long" to make 32-bit builds happy]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08 12:04:58 -08:00
Sean Christopherson
b47b93c15b KVM: x86: Disallow setting CPUID and/or feature MSRs if L2 is active
Extend KVM's restriction on CPUID and feature MSR changes to disallow
updates while L2 is active in addition to rejecting updates after the vCPU
has run at least once.  Like post-run vCPU model updates, attempting to
react to model changes while L2 is active is practically infeasible, e.g.
KVM would need to do _something_ in response to impossible situations where
userspace has a removed a feature that was consumed as parted of nested
VM-Enter.

In practice, disallowing vCPU model changes while L2 is active is largely
uninteresting, as the only way for L2 to be active without the vCPU having
run at least once is if userspace stuffed state via KVM_SET_NESTED_STATE.
And because KVM_SET_NESTED_STATE can't put the vCPU into L2 without
userspace first defining the vCPU model, e.g. to enable SVM/VMX, modifying
the vCPU model while L2 is active would require deliberately setting the
vCPU model, then loading nested state, and then changing the model.  I.e.
no sane VMM should run afoul of the new restriction, and any VMM that does
encounter problems has likely been running a broken setup for a long time.

Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Kevin Cheng <chengkev@google.com>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20251230205641.4092235-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-07 12:44:42 -08:00
Paolo Bonzini
d1e7b4613e KVM VMX changes for 6.19:
- Use the root role from kvm_mmu_page to construct EPTPs instead of the
    current vCPU state, partly as worthwhile cleanup, but mostly to pave the
    way for tracking per-root TLB flushes so that KVM can elide EPT flushes on
    pCPU migration if KVM has flushed the root at least once.
 
  - Add a few missing nested consistency checks.
 
  - Rip out support for doing "early" consistency checks via hardware as the
    functionality hasn't been used in years and is no longer useful in general,
    and replace it with an off-by-default module param to detected missed
    consistency checks (i.e. WARN if hardware finds a check that KVM does not).
 
  - Fix a currently-benign bug where KVM would drop the guest's SPEC_CTRL[63:32]
    on VM-Enter.
 
  - Misc cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmkmWaoACgkQOlYIJqCj
 N/39qw//T14lCRVtO/raX1cuFCyYRTtzEEwRF+T1AdnhLoigduz/ajALLc/giF3Y
 qU5Ubl/N9k/uC62mFd+tC8e/F7BKXHJIvC2WxbxzT00nos/gmHm7ZLZRlFWI51cs
 9oshc87lD0XIW6Ze6Dq9xbJVA2ly1AadvrdFgi8p+/kRTO0/Vyxm9AzmvEzRzO6l
 F6RXRzWzbJtNmmdqBKVMuiMAObP6nPs8Gh3gCHnBGdrSTw/Wt7W7nCLujFK4VlOD
 IXZXDhnYRcSKh8NW/O6Y42VFYN0pzApMDKiFtrSM0kCkANWFusCBXiM/vt1StdZr
 /MNlwJkZbIfP1jMI2km/0TSRM3x9RIlAEJ/07w38WQouc5an7eoVquNtLa3PgJ2L
 AqnKzNro0TtfOSg0Nhy9LznZPOub/4VegCYvxc4+6Q74jKv8pAFMEW9uvJrSq6rk
 cqQCxWG3qid6xCoE6n5uPaMWRBROVG9EzfkK+K3JyvLut9iClKHq37dFOkNQ+AZe
 03uhP0CP529cnbqG+1K1VtPKl9VX7PC+Mg3ZPHI3n1GUL+/AFj3K7bpaZ+LaJpyK
 jfuxBffA69dD57yPGURdfiiTkfUbbAyEP7eiBcV3Yi7vTHkbi+AsF21TA0wsGIpo
 WoRw/jrjw4ScHgNbBP24QmBYynlBUzLXGo7lXE2HlUIst4qqOY0=
 =hP3F
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-vmx-6.19' of https://github.com/kvm-x86/linux into HEAD

KVM VMX changes for 6.19:

 - Use the root role from kvm_mmu_page to construct EPTPs instead of the
   current vCPU state, partly as worthwhile cleanup, but mostly to pave the
   way for tracking per-root TLB flushes so that KVM can elide EPT flushes on
   pCPU migration if KVM has flushed the root at least once.

 - Add a few missing nested consistency checks.

 - Rip out support for doing "early" consistency checks via hardware as the
   functionality hasn't been used in years and is no longer useful in general,
   and replace it with an off-by-default module param to detected missed
   consistency checks (i.e. WARN if hardware finds a check that KVM does not).

 - Fix a currently-benign bug where KVM would drop the guest's SPEC_CTRL[63:32]
   on VM-Enter.

 - Misc cleanups.
2025-11-26 09:44:52 +01:00
Paolo Bonzini
de8e8ebb1a KVM TDX changes for 6.19:
- Overhaul the TDX code to address systemic races where KVM (acting on behalf
    of userspace) could inadvertantly trigger lock contention in the TDX-Module,
    which KVM was either working around in weird, ugly ways, or was simply
    oblivious to (as proven by Yan tripping several KVM_BUG_ON()s with clever
    selftests).
 
  - Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a vCPU if
    creating said vCPU failed partway through.
 
  - Fix a few sparse warnings (bad annotation, 0 != NULL).
 
  - Use struct_size() to simplify copying capabilities to userspace.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmkmVkAACgkQOlYIJqCj
 N/18Ow//cWPmXAdJcM0fRtnSGwzIZszGSD63htgdh5UDeJIFVyUGKH7uGhndQUwK
 Uo8jCJ4ikwMxDdCijv+e4eqCCMZjb7HQhFKaauPVCJZOhmZn0br3EB5xX24Qgp8R
 YN5gTheiTCHHVaxAMl9grgi1xTRi6pJRufRebOmtyGKNQkclctXcuSdtw7IEhqdM
 wKM3eyb7qUhUrmt5tBkSyFAioGcPJIHE3vqLjImqDgduinbXJdQa1sek4Br0sX45
 rfISZ2geXDj/Sh7EPrPU1ne5LQbtgzp1WTG6MRCidYfP86riMQUlEMY6odEYAgIX
 kCd+z248OJShF5EYcEmjc894YLHJ0vVXIXKx/qh0+Jiobz3bujk+whaxTNa26rj0
 3qLPGzFpYugtxkGqBYH4q90oUTovEk4922+jPsQ9GKQ26f0q3XzvriEUSOgrvo0Z
 O26OyK7BezqSM5WMMSf/EGI1ESuli5lbBLYDOaNZS35di2YcDEgtaikRETpWwy82
 TGxrjyeW9Pu6M3iTtQsOVHNxA4hU//Qd5HcDj5rcXOg1rgiPV9n2OaCEMwc6qi+V
 VytbGm4IlMsz6AVHqyv3SUIt1Z4LNAZ/FwK8oeBRVd6LNfm6nfyrW6eQFQVLoIpA
 1nyi9XjMg7xj6ubiSEQSTSl9gto8FzVWwLKwZ8dLH7SPvqlz+zY=
 =qGpA
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-tdx-6.19' of https://github.com/kvm-x86/linux into HEAD

KVM TDX changes for 6.19:

 - Overhaul the TDX code to address systemic races where KVM (acting on behalf
   of userspace) could inadvertantly trigger lock contention in the TDX-Module,
   which KVM was either working around in weird, ugly ways, or was simply
   oblivious to (as proven by Yan tripping several KVM_BUG_ON()s with clever
   selftests).

 - Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a vCPU if
   creating said vCPU failed partway through.

 - Fix a few sparse warnings (bad annotation, 0 != NULL).

 - Use struct_size() to simplify copying capabilities to userspace.
2025-11-26 09:36:37 +01:00
Paolo Bonzini
adc99a6cfc KVM x86 MMU changes for 6.19:
- Skip the costly "zap all SPTEs" on an MMIO generation wrap if MMIO SPTE
    caching is disabled, as there can't be any relevant SPTEs to zap.
 
  - Relocate a misplace export.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmkmUUgACgkQOlYIJqCj
 N/2l/BAAqh0Ac4IZaQ6JMxkI5ZKEUVd/UhkMhSZMMVomXR9H9l6MkiCGAMRurOss
 5fAcJfT/hZ25TeLt6XEaOR/jq7QmHetMtENcO6isLw5tp5cLNowoPZHJLRpBZJeC
 bpFFss/Y5p306vXhGJUH1AHtlp3EgnpCNO3ANoZNISeiicg29G1GxoLwLjEisXeO
 QQ2K8UQGLCk3Mk2u/gWmfdDxeN4RcN3XG4+SQOZiCzWMZUwXyerVnkWbXz/OzBWZ
 vFi24tEwdPQMmnG2ss69LBjEWAritNbKygEwNkKDUo5BZBNDmdmAtAqPcZIOong9
 QZ83dfqVb/4Q9kgHmGfOwuT5MrZbUorXG6SpxFvHDJ22k7TMH16Z+fRDn89vP2e7
 wGHW9jIN3IbPt1RxFD08vsbFAqKKLihVOXkkv65mICz8Pwt36UG3bOmaWLw0Jf7f
 0jGOlKyP8tOcb3IUsVAvE52piEXs5xeFnXuyl0YemvDbN/4gfd1u7ZtO/S3gev2k
 Mo0mqZj4RRVP8al5odQ4yS0rQOF2ztaS7RWDh0ZrWEjKGlDwMk2x/IMpXzcgmvRd
 v/rSED/rPg7nxDZjYEqlffHBUSLGx6ivBzc5/ThSqQE+CKEy1K+wlXp1/t+16D+2
 Ltld2NaVoUeva7owdZa0Qh15Ul+QoMzHBmmzxQM5SE70nz67V14=
 =m2IM
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-mmu-6.19' of https://github.com/kvm-x86/linux into HEAD

KVM x86 MMU changes for 6.19:

 - Skip the costly "zap all SPTEs" on an MMIO generation wrap if MMIO SPTE
   caching is disabled, as there can't be any relevant SPTEs to zap.

 - Relocate a misplace export.
2025-11-26 09:36:01 +01:00
Brendan Jackman
38ee66cb18 KVM: x86: Unify L1TF flushing under per-CPU variable
Currently the tracking of the need to flush L1D for L1TF is tracked by
two bits: one per-CPU and one per-vCPU.

The per-vCPU bit is always set when the vCPU shows up on a core, so
there is no interesting state that's truly per-vCPU. Indeed, this is a
requirement, since L1D is a part of the physical CPU.

So simplify this by combining the two bits.

The vCPU bit was being written from preemption-enabled regions.  To play
nice with those cases, wrap all calls from KVM and use a raw write so that
request a flush with preemption enabled doesn't trigger what would
effectively be DEBUG_PREEMPT false positives.  Preemption doesn't need to
be disabled, as kvm_arch_vcpu_load() will mark the new CPU as needing a
flush if the vCPU task is migrated, or if userspace runs the vCPU on a
different task.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
[sean: put raw write in KVM instead of in a hardirq.h variant]
Link: https://patch.msgid.link/20251113233746.1703361-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-18 16:22:45 -08:00
Sean Christopherson
f6106d41ec x86/bugs: Use an x86 feature to track the MMIO Stale Data mitigation
Convert the MMIO Stale Data mitigation tracking from a static branch into
an x86 feature flag so that it can be used via ALTERNATIVE_2 in KVM.

No functional change intended.

Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Reviewed-by: Brendan Jackman <jackmanb@google.com>
Link: https://patch.msgid.link/20251113233746.1703361-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-18 16:22:42 -08:00
Sean Christopherson
b3e5b670c9 KVM: x86: Use "checked" versions of get_user() and put_user()
Use the normal, checked versions for get_user() and put_user() instead of
the double-underscore versions that omit range checks, as the checked
versions are actually measurably faster on modern CPUs (12%+ on Intel,
25%+ on AMD).

The performance hit on the unchecked versions is almost entirely due to
the added LFENCE on CPUs where LFENCE is serializing (which is effectively
all modern CPUs), which was added by commit 304ec1b050 ("x86/uaccess:
Use __uaccess_begin_nospec() and uaccess_try_nospec").  The small
optimizations done by commit b19b74bc99 ("x86/mm: Rework address range
check in get_user() and put_user()") likely shave a few cycles off, but
the bulk of the extra latency comes from the LFENCE.

Don't bother trying to open-code an equivalent for performance reasons, as
the loss of inlining (e.g. see commit ea6f043fc9 ("x86: Make __get_user()
generate an out-of-line call") is largely a non-factor (ignoring setups
where RET is something entirely different),

As measured across tens of millions of calls of guest PTE reads in
FNAME(walk_addr_generic):

              __get_user()  get_user()  open-coded  open-coded, no LFENCE
Intel (EMR)           75.1        67.6        75.3                   65.5
AMD (Turin)           68.1        51.1        67.5                   49.3

Note, Hyper-V MSR emulation is not a remotely hot path, but convert it
anyways for consistency, and because there is a general desire to remove
__{get,put}_user() entirely.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Closes: https://lore.kernel.org/all/CAHk-=wimh_3jM9Xe8Zx0rpuf8CPDu6DkRCGb44azk0Sz5yqSnw@mail.gmail.com
Cc: Borislav Petkov <bp@alien8.de>
Link: https://patch.msgid.link/20251106210206.221558-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-17 07:50:20 -08:00
Sean Christopherson
b9d5cf6de0 KVM: TDX: WARN if mirror SPTE doesn't have full RWX when creating S-EPT mapping
Pass in the mirror_spte to kvm_x86_ops.set_external_spte() to provide
symmetry with .remove_external_spte(), and assert in TDX that the mirror
SPTE is shadow-present with full RWX permissions (the TDX-Module doesn't
allow the hypervisor to control protections).

Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20251030200951.3402865-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-05 11:05:51 -08:00
Sean Christopherson
7139c86065 KVM: x86/mmu: Drop the return code from kvm_x86_ops.remove_external_spte()
Drop the return code from kvm_x86_ops.remove_external_spte(), a.k.a.
tdx_sept_remove_private_spte(), as KVM simply does a KVM_BUG_ON() failure,
and that KVM_BUG_ON() is redundant since all error paths in TDX also do a
KVM_BUG_ON().

Opportunistically pass the spte instead of the pfn, as the API is clearly
about removing an spte.

Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20251030200951.3402865-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-05 11:05:50 -08:00
Sean Christopherson
6de2fb089b KVM: x86/mmu: Rename kvm_tdp_map_page() to kvm_tdp_page_prefault()
Rename kvm_tdp_map_page() to kvm_tdp_page_prefault() now that it's used
only by kvm_arch_vcpu_pre_fault_memory().

No functional change intended.

Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20251030200951.3402865-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-05 11:03:14 -08:00
Sean Christopherson
fe7413e398 Revert "KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU"
Remove the helper and exports that were added to allow TDX code to reuse
kvm_tdp_map_page() for its gmem post-populate flow now that a dedicated
TDP MMU API is provided to install a mapping given a gfn+pfn pair.

This reverts commit 2608f10576.

Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20251030200951.3402865-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-05 11:03:14 -08:00
Sean Christopherson
c1f173fb33 KVM: x86/mmu: WARN if KVM attempts to map into an invalid TDP MMU root
When mapping into the TDP MMU, WARN (if KVM_PROVE_MMU=y) if the root is
invalid, e.g. if KVM is attempting to insert a mapping without checking if
the information and MMU context is fresh.

Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20251030200951.3402865-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-05 11:03:13 -08:00
Sean Christopherson
3ab3283dbb KVM: x86/mmu: Add dedicated API to map guest_memfd pfn into TDP MMU
Add and use a new API for mapping a private pfn from guest_memfd into the
TDP MMU from TDX's post-populate hook instead of partially open-coding the
functionality into the TDX code.  Sharing code with the pre-fault path
sounded good on paper, but it's fatally flawed as simulating a fault loses
the pfn, and calling back into gmem to re-retrieve the pfn creates locking
problems, e.g. kvm_gmem_populate() already holds the gmem invalidation
lock.

Providing a dedicated API will also removing several MMU exports that
ideally would not be exposed outside of the MMU, let alone to vendor code.
On that topic, opportunistically drop the kvm_mmu_load() export.  Leave
kvm_tdp_mmu_gpa_is_mapped() alone for now; the entire commit that added
kvm_tdp_mmu_gpa_is_mapped() will be removed in the near future.

Gate the API on CONFIG_KVM_GUEST_MEMFD=y as private memory _must_ be backed
by guest_memfd.  Add a lockdep-only assert to that the incoming pfn is
indeed backed by guest_memfd, and that the gmem instance's invalidate lock
is held (which, combined with slots_lock being held, obviates the need to
check for a stale "fault").

Cc: Michael Roth <michael.roth@amd.com>
Cc: Yan Zhao <yan.y.zhao@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/all/20250709232103.zwmufocd3l7sqk7y@amd.com
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20251030200951.3402865-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-05 11:03:12 -08:00
Kai Huang
6422060aa9 KVM: x86/mmu: Move the misplaced export of kvm_zap_gfn_range()
Currently, the export of kvm_zap_gfn_range() is misplaced, i.e., it's
not placed right after the kvm_zap_gfn_range() function body but after
kvm_mmu_zap_collapsible_spte().  Move it to the right place.

No functional change intended.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20251021114345.159372-1-kai.huang@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-04 09:51:06 -08:00
Sean Christopherson
a10f5cc3ac KVM: x86/mmu: Move "dummy root" helpers to spte.h
Move the helpers to get/query a dummy root from mmu_internal.h to spte.h
so that VMX can detect and handle dummy roots when constructing EPTPs.
This will allow using the root's role to build the EPTP instead of pulling
equivalent information out of the vCPU structure.

No functional change intended.

Link: https://lore.kernel.org/r/20250919005955.1366256-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-10-17 15:11:26 -07:00
Dmytro Maluka
b850841a53 KVM: x86/mmu: Skip MMIO SPTE invalidation if enable_mmio_caching=0
If MMIO caching is disabled, there are no MMIO SPTEs to invalidate, so
the costly zapping of all pages is unnecessary even in the unlikely case
when the MMIO generation number has wrapped.

Signed-off-by: Dmytro Maluka <dmaluka@chromium.org>
Link: https://lore.kernel.org/r/20250926135139.1597781-1-dmaluka@chromium.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-10-13 14:50:46 -07:00
Sean Christopherson
6b36119b94 KVM: x86: Export KVM-internal symbols for sub-modules only
Rework almost all of KVM x86's exports to expose symbols only to KVM's
vendor modules, i.e. to kvm-{amd,intel}.ko.  Keep the generic exports that
are guarded by CONFIG_KVM_EXTERNAL_WRITE_TRACKING=y, as they're explicitly
designed/intended for external usage.

Link: https://lore.kernel.org/r/20250919003303.1355064-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-09-30 13:40:02 -04:00
Paolo Bonzini
12abeb81c8 KVM x86 CET virtualization support for 6.18
Add support for virtualizing Control-flow Enforcement Technology (CET) on
 Intel (Shadow Stacks and Indirect Branch Tracking) and AMD (Shadow Stacks).
 
 CET is comprised of two distinct features, Shadow Stacks (SHSTK) and Indirect
 Branch Tracking (IBT), that can be utilized by software to help provide
 Control-flow integrity (CFI).  SHSTK defends against backward-edge attacks
 (a.k.a. Return-oriented programming (ROP)), while IBT defends against
 forward-edge attacks (a.k.a. similarly CALL/JMP-oriented programming (COP/JOP)).
 
 Attackers commonly use ROP and COP/JOP methodologies to redirect the control-
 flow to unauthorized targets in order to execute small snippets of code,
 a.k.a. gadgets, of the attackers choice.  By chaining together several gadgets,
 an attacker can perform arbitrary operations and circumvent the system's
 defenses.
 
 SHSTK defends against backward-edge attacks, which execute gadgets by modifying
 the stack to branch to the attacker's target via RET, by providing a second
 stack that is used exclusively to track control transfer operations.  The
 shadow stack is separate from the data/normal stack, and can be enabled
 independently in user and kernel mode.
 
 When SHSTK is is enabled, CALL instructions push the return address on both the
 data and shadow stack. RET then pops the return address from both stacks and
 compares the addresses.  If the return addresses from the two stacks do not
 match, the CPU generates a Control Protection (#CP) exception.
 
 IBT defends against backward-edge attacks, which branch to gadgets by executing
 indirect CALL and JMP instructions with attacker controlled register or memory
 state, by requiring the target of indirect branches to start with a special
 marker instruction, ENDBRANCH.  If an indirect branch is executed and the next
 instruction is not an ENDBRANCH, the CPU generates a #CP.  Note, ENDBRANCH
 behaves as a NOP if IBT is disabled or unsupported.
 
 From a virtualization perspective, CET presents several problems.  While SHSTK
 and IBT have two layers of enabling, a global control in the form of a CR4 bit,
 and a per-feature control in user and kernel (supervisor) MSRs (U_CET and S_CET
 respectively), the {S,U}_CET MSRs can be context switched via XSAVES/XRSTORS.
 Practically speaking, intercepting and emulating XSAVES/XRSTORS is not a viable
 option due to complexity, and outright disallowing use of XSTATE to context
 switch SHSTK/IBT state would render the features unusable to most guests.
 
 To limit the overall complexity without sacrificing performance or usability,
 simply ignore the potential virtualization hole, but ensure that all paths in
 KVM treat SHSTK/IBT as usable by the guest if the feature is supported in
 hardware, and the guest has access to at least one of SHSTK or IBT.  I.e. allow
 userspace to advertise one of SHSTK or IBT if both are supported in hardware,
 even though doing so would allow a misbehaving guest to use the unadvertised
 feature.
 
 Fully emulating SHSTK and IBT would also require significant complexity, e.g.
 to track and update branch state for IBT, and shadow stack state for SHSTK.
 Given that emulating large swaths of the guest code stream isn't necessary on
 modern CPUs, punt on emulating instructions that meaningful impact or consume
 SHSTK or IBT.  However, instead of doing nothing, explicitly reject emulation
 of such instructions so that KVM's emulator can't be abused to circumvent CET.
 Disable support for SHSTK and IBT if KVM is configured such that emulation of
 arbitrary guest instructions may be required, specifically if Unrestricted
 Guest (Intel only) is disabled, or if KVM will emulate a guest.MAXPHYADDR that
 is smaller than host.MAXPHYADDR.
 
 Lastly disable SHSTK support if shadow paging is enabled, as the protections
 for the shadow stack are novel (shadow stacks require Writable=0,Dirty=1, so
 that they can't be directly modified by software), i.e. would require
 non-trivial support in the Shadow MMU.
 
 Note, AMD CPUs currently only support SHSTK.  Explicitly disable IBT support
 so that KVM doesn't over-advertise if AMD CPUs add IBT, and virtualizing IBT
 in SVM requires KVM modifications.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmjXbisACgkQOlYIJqCj
 N/373w//ckB4c9MjS6eDRp+LtTXQfXyAs8eMcs9YTs7yD3uMvqcbaNuDsf1U2cI6
 i2qcuOdxlnKSJphn6oH2JKDWPjRAfHhCqmYghUPaJwgeYqsTfork9s8rzU2tC82q
 38mQ6BhAuOwa/plodvDp/+POEIoXUyexSoWX+cngGVTmFWdbfA4NNGjWMZOl1XG2
 qLBck6t+IxxUTs1Ij+OsexlAKdY7FcZZ85Ok6I/VE4/lITEhuTJkwkYdh8td3KK/
 IVVk1jb1Z7t8lGQ5fi3+N/D8iHJ/0ladmOux6Yxzw88uyj6XLIFOOFsdK09GyhUS
 QzV06syFkV2vU68VDYiOcMZIdeGmYR5jDpmy9N+o0s86YLU6rKKEaXRP7vW5yHj/
 99AU+DfRHvhqKwWyQ51B+rhr80F3EQrkZXI0QBr8KO7sseFZvZNNVozwKjSyZtNH
 VBhxjIlVQm5Z1rjucKjc573sONK95z9XUSZjYnCUwB1NH7VsvdULQmJBucCmzW/p
 9j49CpmShwggceV6LcYg4Miuvjl/bL1B8Go5Fg+1Fdg7L6Nepi16yywxHmyPqreJ
 Wx/6N0gqZ3LKDdl5CFYxAxvJoldJR6lbw/AGjvFkre8A+TGGRdz3uS9XXqGHvtbu
 W5wKhnvGov69lm4xYbxbI+rvxYmmQLm9SgQXel23icbKJ5kmE48=
 =zsBl
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-cet-6.18' of https://github.com/kvm-x86/linux into HEAD

KVM x86 CET virtualization support for 6.18

Add support for virtualizing Control-flow Enforcement Technology (CET) on
Intel (Shadow Stacks and Indirect Branch Tracking) and AMD (Shadow Stacks).

CET is comprised of two distinct features, Shadow Stacks (SHSTK) and Indirect
Branch Tracking (IBT), that can be utilized by software to help provide
Control-flow integrity (CFI).  SHSTK defends against backward-edge attacks
(a.k.a. Return-oriented programming (ROP)), while IBT defends against
forward-edge attacks (a.k.a. similarly CALL/JMP-oriented programming (COP/JOP)).

Attackers commonly use ROP and COP/JOP methodologies to redirect the control-
flow to unauthorized targets in order to execute small snippets of code,
a.k.a. gadgets, of the attackers choice.  By chaining together several gadgets,
an attacker can perform arbitrary operations and circumvent the system's
defenses.

SHSTK defends against backward-edge attacks, which execute gadgets by modifying
the stack to branch to the attacker's target via RET, by providing a second
stack that is used exclusively to track control transfer operations.  The
shadow stack is separate from the data/normal stack, and can be enabled
independently in user and kernel mode.

When SHSTK is is enabled, CALL instructions push the return address on both the
data and shadow stack. RET then pops the return address from both stacks and
compares the addresses.  If the return addresses from the two stacks do not
match, the CPU generates a Control Protection (#CP) exception.

IBT defends against backward-edge attacks, which branch to gadgets by executing
indirect CALL and JMP instructions with attacker controlled register or memory
state, by requiring the target of indirect branches to start with a special
marker instruction, ENDBRANCH.  If an indirect branch is executed and the next
instruction is not an ENDBRANCH, the CPU generates a #CP.  Note, ENDBRANCH
behaves as a NOP if IBT is disabled or unsupported.

From a virtualization perspective, CET presents several problems.  While SHSTK
and IBT have two layers of enabling, a global control in the form of a CR4 bit,
and a per-feature control in user and kernel (supervisor) MSRs (U_CET and S_CET
respectively), the {S,U}_CET MSRs can be context switched via XSAVES/XRSTORS.
Practically speaking, intercepting and emulating XSAVES/XRSTORS is not a viable
option due to complexity, and outright disallowing use of XSTATE to context
switch SHSTK/IBT state would render the features unusable to most guests.

To limit the overall complexity without sacrificing performance or usability,
simply ignore the potential virtualization hole, but ensure that all paths in
KVM treat SHSTK/IBT as usable by the guest if the feature is supported in
hardware, and the guest has access to at least one of SHSTK or IBT.  I.e. allow
userspace to advertise one of SHSTK or IBT if both are supported in hardware,
even though doing so would allow a misbehaving guest to use the unadvertised
feature.

Fully emulating SHSTK and IBT would also require significant complexity, e.g.
to track and update branch state for IBT, and shadow stack state for SHSTK.
Given that emulating large swaths of the guest code stream isn't necessary on
modern CPUs, punt on emulating instructions that meaningful impact or consume
SHSTK or IBT.  However, instead of doing nothing, explicitly reject emulation
of such instructions so that KVM's emulator can't be abused to circumvent CET.
Disable support for SHSTK and IBT if KVM is configured such that emulation of
arbitrary guest instructions may be required, specifically if Unrestricted
Guest (Intel only) is disabled, or if KVM will emulate a guest.MAXPHYADDR that
is smaller than host.MAXPHYADDR.

Lastly disable SHSTK support if shadow paging is enabled, as the protections
for the shadow stack are novel (shadow stacks require Writable=0,Dirty=1, so
that they can't be directly modified by software), i.e. would require
non-trivial support in the Shadow MMU.

Note, AMD CPUs currently only support SHSTK.  Explicitly disable IBT support
so that KVM doesn't over-advertise if AMD CPUs add IBT, and virtualizing IBT
in SVM requires KVM modifications.
2025-09-30 13:37:14 -04:00
Paolo Bonzini
5b0d0d8542 KVM x86 MMU changes for 6.18
- Recover possible NX huge pages within the TDP MMU under read lock to
    reduce guest jitter when restoring NX huge pages.
 
  - Return -EAGAIN during prefault if userspace concurrently deletes/moves the
    relevant memslot to fix an issue where prefaulting could deadlock with the
    memslot update.
 
  - Don't retry in TDX's anti-zero-step mitigation if the target memslot is
    invalid, i.e. is being deleted or moved, to fix a deadlock scenario similar
    to the aforementioned prefaulting case.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmjXHaEACgkQOlYIJqCj
 N/1uDxAAxGMl1q1Hg0tpVPw7PdcourXlVYJjFzsrK6CdtZpL7n2GJPVhEFBDovud
 oIM9IIiP5f2UDtWeRb6b/mm9INqwTB8lyswbJk/tO+CshBiBdE7PfDbzDzvj9lAv
 Uecc6tQhv+CDpJcSf7t5OqgiRo5gEBTXZZj0l5GOdtiaOU09eq4ttZTME5S1jQgh
 kBddFd3glWeMLv67cTNCxdHsOFnaVWIBoupfw7Fv7LVJ1k6cgKyHAhjfq8A9elEK
 3CyDo8DZ8MG4aguhHzAUQuEM9ELMxOTyJG8xS2BWtFA/glbvUBnOfGeyTmHgo/nN
 qKyjytlpmO0yIlehTd/5tLfpidL8l30VN7+nDpqwTjCDEz9bC39zC9zBmKni84Dt
 wItfmELb6lbvprA+FOseiRwk7/2quLrgc4y21GI29Zqbf6wMoQEnRHF/moFZ3cqg
 C/SP1Ev6N5ENM2BZG9mFSRWr8e2yyan8YWs+AUtsBEM82KaeJrMlZ4yqA1m33a5T
 YK5eL3DablObdfvvz1YXCVxByQ7aIbVCpE3VVigeyHrqoR/EFwZMzYLouOI34jjN
 Nj5+Qck6VMhI+OetUlcXS1D/DIHgpDgZFPcgeLURiwO0l62H/gYLHuoCek4YmkIi
 30ZwVXubBWDg5TcxEi5oIbVfyZfHNi+MyeLMWLEy6hEdnFsTsZU=
 =6qMx
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-mmu-6.18' of https://github.com/kvm-x86/linux into HEAD

KVM x86 MMU changes for 6.18

 - Recover possible NX huge pages within the TDP MMU under read lock to
   reduce guest jitter when restoring NX huge pages.

 - Return -EAGAIN during prefault if userspace concurrently deletes/moves the
   relevant memslot to fix an issue where prefaulting could deadlock with the
   memslot update.

 - Don't retry in TDX's anti-zero-step mitigation if the target memslot is
   invalid, i.e. is being deleted or moved, to fix a deadlock scenario similar
   to the aforementioned prefaulting case.
2025-09-30 13:32:27 -04:00
Sean Christopherson
843af0f2e4 KVM: x86/mmu: Pretty print PK, SS, and SGX flags in MMU tracepoints
Add PK (Protection Keys), SS (Shadow Stacks), and SGX (Software Guard
Extensions) to the set of #PF error flags handled via
kvm_mmu_trace_pferr_flags.  While KVM doesn't expect PK or SS #PFs in
particular, pretty print their names instead of the raw hex value saves
the user from having to go spelunking in the SDM to figure out what's
going on.

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-23-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23 09:17:32 -07:00
Sean Christopherson
3ccbf6f470 KVM: x86/mmu: Return -EAGAIN if userspace deletes/moves memslot during prefault
Return -EAGAIN if userspace attempts to delete or move a memslot while also
prefaulting memory for that same memslot, i.e. force userspace to retry
instead of trying to handle the scenario entirely within KVM.  Unlike
KVM_RUN, which needs to handle the scenario entirely within KVM because
userspace has come to depend on such behavior, KVM_PRE_FAULT_MEMORY can
return -EAGAIN without breaking userspace as this scenario can't have ever
worked (and there's no sane use case for prefaulting to a memslot that's
being deleted/moved).

And also unlike KVM_RUN, the prefault path doesn't naturally guarantee
forward progress.  E.g. to handle such a scenario, KVM would need to drop
and reacquire SRCU to break the deadlock between the memslot update
(synchronizes SRCU) and the prefault (waits for the memslot update to
complete).

However, dropping SRCU creates more problems, as completing the memslot
update will bump the memslot generation, which in turn will invalidate the
MMU root.  To handle that, prefaulting would need to handle pending
KVM_REQ_MMU_FREE_OBSOLETE_ROOTS requests and do kvm_mmu_reload() prior to
mapping each individual.

I.e. to fully handle this scenario, prefaulting would eventually need to
look a lot like vcpu_enter_guest().  Given that there's no reasonable use
case and practically zero risk of breaking userspace, punt the problem to
userspace and avoid adding unnecessary complexity to the prefault path.

Note, TDX's guest_memfd post-populate path is unaffected as slots_lock is
held for the entire duration of populate(), i.e. any memslot modifications
will be fully serialized against TDX's flavor of prefaulting.

Reported-by: Reinette Chatre <reinette.chatre@intel.com>
Closes: https://lore.kernel.org/all/20250519023737.30360-1-yan.y.zhao@intel.com
Debugged-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250822070347.26451-1-yan.y.zhao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-10 12:06:30 -07:00
Ackerley Tng
f029f04ddb KVM: x86/mmu: Handle guest page faults for guest_memfd with shared memory
Update the KVM MMU fault handler to service guest page faults
for memory slots backed by guest_memfd with mmap support. For such
slots, the MMU must always fault in pages directly from guest_memfd,
bypassing the host's userspace_addr.

This ensures that guest_memfd-backed memory is always handled through
the guest_memfd specific faulting path, regardless of whether it's for
private or non-private (shared) use cases.

Additionally, rename kvm_mmu_faultin_pfn_private() to
kvm_mmu_faultin_pfn_gmem(), as this function is now used to fault in
pages from guest_memfd for both private and non-private memory,
accommodating the new use cases.

Co-developed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
[sean: drop the helper]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-ID: <20250729225455.670324-17-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27 04:35:01 -04:00
Sean Christopherson
b7d97f69ed KVM: x86/mmu: Extend guest_memfd's max mapping level to shared mappings
Rework kvm_mmu_max_mapping_level() to consult guest_memfd for all mappings,
not just private mappings, so that hugepage support plays nice with the
upcoming support for backing non-private memory with guest_memfd.

In addition to getting the max order from guest_memfd for gmem-only
memslots, update TDX's hook to effectively ignore shared mappings, as TDX's
restrictions on page size only apply to Secure EPT mappings.  Do nothing
for SNP, as RMP restrictions apply to both private and shared memory.

Suggested-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-ID: <20250729225455.670324-16-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27 04:35:01 -04:00
Sean Christopherson
a3522ac71f KVM: x86/mmu: Enforce guest_memfd's max order when recovering hugepages
Rework kvm_mmu_max_mapping_level() to provide the plumbing to consult
guest_memfd (and relevant vendor code) when recovering hugepages, e.g.
after disabling live migration.  The flaw has existed since guest_memfd was
originally added, but has gone unnoticed due to lack of guest_memfd support
for hugepages or dirty logging.

Don't actually call into guest_memfd at this time, as it's unclear as to
what the API should be.  Ideally, KVM would simply use kvm_gmem_get_pfn(),
but invoking kvm_gmem_get_pfn() would lead to sleeping in atomic context
if guest_memfd needed to allocate memory (mmu_lock is held).  Luckily,
the path isn't actually reachable, so just add a TODO and WARN to ensure
the functionality is added alongisde guest_memfd hugepage support, and
punt the guest_memfd API design question to the future.

Note, calling kvm_mem_is_private() in the non-fault path is safe, so long
as mmu_lock is held, as hugepage recovery operates on shadow-present SPTEs,
i.e. calling kvm_mmu_max_mapping_level() with @fault=NULL is mutually
exclusive with kvm_vm_set_mem_attributes() changing the PRIVATE attribute
of the gfn.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Message-ID: <20250729225455.670324-15-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27 04:35:01 -04:00
Sean Christopherson
1c3fdf1370 KVM: x86/mmu: Hoist guest_memfd max level/order helpers "up" in mmu.c
Move kvm_max_level_for_order() and kvm_max_private_mapping_level() up in
mmu.c so that they can be used by __kvm_mmu_max_mapping_level().

Opportunistically drop the "inline" from kvm_max_level_for_order().

No functional change intended.

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Message-ID: <20250729225455.670324-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27 04:35:00 -04:00
Ackerley Tng
d6c840adfe KVM: x86/mmu: Rename .private_max_mapping_level() to .gmem_max_mapping_level()
Rename kvm_x86_ops.private_max_mapping_level() to .gmem_max_mapping_level()
in anticipation of extending guest_memfd support to non-private memory.

No functional change intended.

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Message-ID: <20250729225455.670324-13-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27 04:35:00 -04:00
Fuad Tabba
923310be23 KVM: Rename kvm_slot_can_be_private() to kvm_slot_has_gmem()
Rename kvm_slot_can_be_private() to kvm_slot_has_gmem() to improve
clarity and accurately reflect its purpose.

The function kvm_slot_can_be_private() was previously used to check if a
given kvm_memory_slot is backed by guest_memfd. However, its name
implied that the memory in such a slot was exclusively "private".

As guest_memfd support expands to include non-private memory (e.g.,
shared host mappings), it's important to remove this association. The
new name, kvm_slot_has_gmem(), states that the slot is backed by
guest_memfd without making assumptions about the memory's privacy
attributes.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shivank Garg <shivankg@amd.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Co-developed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250729225455.670324-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27 04:35:00 -04:00
Vipin Sharma
a577509095 KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock
Use MMU read lock to recover TDP MMU NX huge pages.  To prevent
concurrent modification of the list of potential huge pages, iterate over
the list under tdp_mmu_pages_lock protection and unaccount the page
before dropping the lock.

Zapping under MMU read lock unblocks vCPUs which are waiting for MMU
read lock, which solves a guest jitter issue on Windows VMs which were
observing an increase in network latency.

Do not zap an SPTE if:
- The SPTE is a root page.
- The SPTE does not point at the SP's page table.

If the SPTE does not point at the SP's page table, then something else
has change the SPTE, so KVM cannot safely zap it.

Warn if zapping SPTE fails and current SPTE is still pointing to same
page table, as it should be impossible for the CMPXCHG to fail due to all
other write scenarios being mutually exclusive.

There is always a race between dirty logging, vCPU faults, and NX huge
page recovery for backing a gfn by an NX huge page or an executable
small page.  Unaccounting sooner during the list traversal increases the
window of that race, but functionally, it is okay.  Accounting doesn't
protect against iTLB multi-hit bug, it is there purely to prevent KVM
from bouncing a gfn between two page sizes. The only  downside is that a
vCPU will end up doing more work in tearing down all  the child SPTEs.
This should be a very rare race.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vipin Sharma <vipinsh@google.com>
Co-developed-by: James Houghton <jthoughton@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
Link: https://lore.kernel.org/r/20250707224720.4016504-4-jthoughton@google.com
[sean: clean up kvm_mmu_sp_dirty_logging_enabled() and the changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-19 07:40:41 -07:00
Vipin Sharma
6210556422 KVM: x86/mmu: Rename kvm_tdp_mmu_zap_sp() to better indicate its purpose
kvm_tdp_mmu_zap_sp() is only used for NX huge page recovery, so rename
it to kvm_tdp_mmu_zap_possible_nx_huge_page(). In a future commit, this
function will be changed to include logic specific to NX huge page
recovery.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
https://lore.kernel.org/r/20250707224720.4016504-3-jthoughton@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-19 07:40:25 -07:00
Vipin Sharma
6777885605 KVM: x86/mmu: Track possible NX huge pages separately for TDP vs. Shadow MMU
Track possible NX huge pages for the TDP MMU separately from Shadow MMUs
in anticipation of doing recovery for the TDP MMU while holding mmu_lock
for read instead of write.

Use a small structure to hold the list of pages along with the number of
pages/entries in the list, as relying on kvm->stat.nx_lpage_splits to
calculating the number of pages to recover would result in over-zapping
when both TDP and Shadow MMUs are active.

Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: David Matlack <dmatlack@google.com>
Signed-off-by: Vipin Sharma <vipinsh@google.com>
Co-developed-by: James Houghton <jthoughton@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
Link: https://lore.kernel.org/r/20250707224720.4016504-2-jthoughton@google.com
[sean: rewrite changelog, use #ifdef instead of dummy KVM_TDP_MMU #define]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-19 07:39:10 -07:00
Paolo Bonzini
d7f4aac280 Merge tag 'kvm-x86-mmu-6.17' of https://github.com/kvm-x86/linux into HEAD
KVM x86 MMU changes for 6.17

 - Exempt nested EPT from the the !USER + CR0.WP logic, as EPT doesn't interact
   with CR0.WP.

 - Move the TDX hardware setup code to tdx.c to better co-locate TDX code
   and eliminate a few global symbols.

 - Dynamically allocation the shadow MMU's hashed page list, and defer
   allocating the hashed list until it's actually needed (the TDP MMU doesn't
   use the list).
2025-07-29 08:36:43 -04:00
Sean Christopherson
83ebe71574 KVM: VMX: Apply MMIO Stale Data mitigation if KVM maps MMIO into the guest
Enforce the MMIO State Data mitigation if KVM has ever mapped host MMIO
into the VM, not if the VM has an assigned device.  VFIO is but one of
many ways to map host MMIO into a KVM guest, and even within VFIO,
formally attaching a device to a VM via KVM_DEV_VFIO_FILE_ADD is entirely
optional.

Track whether or not the guest can access host MMIO on a per-MMU basis,
i.e. based on whether or not the vCPU has a mapping to host MMIO.  For
simplicity, track MMIO mappings in "special" rools (those without a
kvm_mmu_page) at the VM level, as only Intel CPUs are vulnerable, and so
only legacy 32-bit shadow paging is affected, i.e. lack of precise
tracking is a complete non-issue.

Make the per-MMU and per-VM flags sticky.  Detecting when *all* MMIO
mappings have been removed would be absurdly complex.  And in practice,
removing MMIO from a guest will be done by deleting the associated memslot,
which by default will force KVM to re-allocate all roots.  Special roots
will forever be mitigated, but as above, the affected scenarios are not
expected to be performance sensitive.

Use a VMX_RUN flag to communicate the need for a buffers flush to
vmx_vcpu_enter_exit() so that kvm_vcpu_can_access_host_mmio() and all its
dependencies don't need to be marked __always_inline, e.g. so that KASAN
doesn't trigger a noinstr violation.

Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Fixes: 8cb861e9e3 ("x86/speculation/mmio: Add mitigation for Processor MMIO Stale Data")
Tested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Link: https://lore.kernel.org/r/20250523011756.3243624-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-25 08:42:51 -07:00
Sean Christopherson
ffe9d7966d KVM: x86/mmu: Locally cache whether a PFN is host MMIO when making a SPTE
When making a SPTE, cache whether or not the target PFN is host MMIO in
order to avoid multiple rounds of the slow path of kvm_is_mmio_pfn(), e.g.
hitting pat_pfn_immune_to_uc_mtrr() in particular can be problematic.  KVM
currently avoids multiple calls by virtue of the two users being mutually
exclusive (.get_mt_mask() is Intel-only, shadow_me_value is AMD-only), but
that won't hold true if/when KVM needs to detect host MMIO mappings for
other reasons, e.g. for mitigating the MMIO Stale Data vulnerability.

No functional change intended.

Tested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Link: https://lore.kernel.org/r/20250523011756.3243624-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-24 13:29:31 -07:00
Sean Christopherson
c126b46e6f KVM: x86: Avoid calling kvm_is_mmio_pfn() when kvm_x86_ops.get_mt_mask is NULL
Guard the call to kvm_x86_call(get_mt_mask) with an explicit check on
kvm_x86_ops.get_mt_mask so as to avoid unnecessarily calling
kvm_is_mmio_pfn(), which is moderately expensive for some backing types.
E.g. lookup_memtype() conditionally takes a system-wide spinlock if KVM
ends up being call pat_pfn_immune_to_uc_mtrr(), e.g. for DAX memory.

While the call to kvm_x86_ops.get_mt_mask() itself is elided, the compiler
still needs to compute all parameters, as it can't know at build time that
the call will be squashed.

   <+243>:   call   0xffffffff812ad880 <kvm_is_mmio_pfn>
   <+248>:   mov    %r13,%rsi
   <+251>:   mov    %rbx,%rdi
   <+254>:   movzbl %al,%edx
   <+257>:   call   0xffffffff81c26af0 <__SCT__kvm_x86_get_mt_mask>

Fixes: 3fee4837ef ("KVM: x86: remove shadow_memtype_mask")
Tested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Link: https://lore.kernel.org/r/20250523011756.3243624-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-24 13:29:31 -07:00
Sean Christopherson
9c4fe6d150 KVM: x86/mmu: Defer allocation of shadow MMU's hashed page list
When the TDP MMU is enabled, i.e. when the shadow MMU isn't used until a
nested TDP VM is run, defer allocation of the array of hashed lists used
to track shadow MMU pages until the first shadow root is allocated.

Setting the list outside of mmu_lock is safe, as concurrent readers must
hold mmu_lock in some capacity, shadow pages can only be added (or removed)
from the list when mmu_lock is held for write, and tasks that are creating
a shadow root are serialized by slots_arch_lock.  I.e. it's impossible for
the list to become non-empty until all readers go away, and so readers are
guaranteed to see an empty list even if they make multiple calls to
kvm_get_mmu_page_hash() in a single mmu_lock critical section.

Use smp_store_release() and smp_load_acquire() to access the hash table
pointer to ensure the stores to zero the lists are retired before readers
start to walk the list.  E.g. if the compiler hoisted the store before the
zeroing of memory, for_each_gfn_valid_sp_with_gptes() could consume stale
kernel data.

Cc: James Houghton <jthoughton@google.com>
Link: https://lore.kernel.org/r/20250523001138.3182794-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-24 12:51:07 -07:00
Sean Christopherson
039ef33e2f KVM: x86/mmu: Dynamically allocate shadow MMU's hashed page list
Dynamically allocate the (massive) array of hashed lists used to track
shadow pages, as the array itself is 32KiB, i.e. is an order-3 allocation
all on its own, and is *exactly* an order-3 allocation.  Dynamically
allocating the array will allow allocating "struct kvm" using kvmalloc(),
and will also allow deferring allocation of the array until it's actually
needed, i.e. until the first shadow root is allocated.

Opportunistically use kvmalloc() for the hashed lists, as an order-3
allocation is (stating the obvious) less likely to fail than an order-4
allocation, and the overhead of vmalloc() is undesirable given that the
size of the allocation is fixed.

Cc: Vipin Sharma <vipinsh@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250523001138.3182794-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-24 12:50:34 -07:00
Sean Christopherson
ffced89220 KVM: x86/mmu: Exempt nested EPT page tables from !USER, CR0.WP=0 logic
Exempt nested EPT shadow pages tables from the CR0.WP=0 handling of
supervisor writes, as EPT doesn't have a U/S bit and isn't affected by
CR0.WP (or CR4.SMEP in the exception to the exception).

Opportunistically refresh the comment to explain what KVM is doing, as
the only record of why KVM shoves in WRITE and drops USER is buried in
years-old changelogs.

Cc: Jon Kohler <jon@nutanix.com>
Cc: Sergey Dyasli <sergey.dyasli@nutanix.com>
Reviewed-by: Jon Kohler <jon@nutanix.com>
Reviewed-by: Sergey Dyasli <sergey.dyasli@nutanix.com>
Link: https://lore.kernel.org/r/20250602234851.54573-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:08:22 -07:00