Commit Graph

2062 Commits

Author SHA1 Message Date
Peter Zijlstra
0701c9e17b x86/kvm/vmx: Move IRQ/NMI dispatch from KVM into x86 core
Move the VMX interrupt dispatch magic into the x86 core code. This
isolates KVM from the FRED/IDT decisions and reduces the amount of
EXPORT_SYMBOL_FOR_KVM().

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: "Verma, Vishal L" <vishal.l.verma@intel.com>
Tested-by: Zhao Liu <zhao1.liu@intel.com>
Tested-by: Zhao Liu <zhao1.liu@intel.com>
Tested-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Binbin Wu <binbin.wu@linxu.intel.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://patch.msgid.link/20260508091829.GO3126523@noisy.programming.kicks-ass.net
2026-05-19 20:25:51 +02:00
Paolo Bonzini
2d5d3fc593 KVM: VMX: introduce module parameter to disable CET
There have been reports of host hangs caused by CET virtualization.
Until these are analyzed further, introduce a module parameter that
makes it possible to easily disable it.

Link: https://lore.kernel.org/all/85548beb-1486-40f9-beb4-632c78e3360b@proxmox.com/
Cc: David Riley <d.riley@proxmox.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-05-13 15:38:22 +02:00
Sean Christopherson
0aec99f9bf KVM: x86: Fix misleading variable names and add more comments for PIR=>IRR flow
Rename kvm_apic_update_irr()'s "irr_updated" and vmx_sync_pir_to_irr()'s
"got_posted_interrupt" to a more accurate "max_irr_is_from_pir", as neither
"irr_updated" nor "got_posted_interrupt" is accurate.
__kvm_apic_update_irr() and thus kvm_apic_update_irr() specifically return
true if and only if the highest priority IRQ, i.e. max_irr, is a "new"
pending IRQ from the PIR.  I.e. it's possible for the IRR to be updated,
i.e. for a posted IRQ to be "got", *without* the APIs returning true.

Expand vmx_sync_pir_to_irr()'s comment to explain why it's necessary to
set KVM_REQ_EVENT only if a "new" IRQ was found, and to explain why it's
safe to do so only if a new IRQ is also the highest priority pending IRQ.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://patch.msgid.link/20260503201703.108231-3-pbonzini@redhat.com/
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-05-03 22:32:41 +02:00
Linus Torvalds
01f492e181 Arm:
- Add support for tracing in the standalone EL2 hypervisor code, which
   should help both debugging and performance analysis.  This uses the
   new infrastructure for 'remote' trace buffers that can be exposed
   by non-kernel entities such as firmware, and which came through the
   tracing tree.
 
 - Add support for GICv5 Per Processor Interrupts (PPIs), as the starting
   point for supporting the new GIC architecture in KVM.
 
 - Finally add support for pKVM protected guests, where pages are unmapped
   from the host as they are faulted into the guest and can be shared back
   from the guest using pKVM hypercalls.  Protected guests are created
   using a new machine type identifier.  As the elusive guestmem has not
   yet delivered on its promises, anonymous memory is also supported.
 
   This is only a first step towards full isolation from the host; for
   example, the CPU register state and DMA accesses are not yet isolated.
   Because this does not really yet bring fully what it promises, it is
   hidden behind CONFIG_ARM_PKVM_GUEST + 'kvm-arm.mode=protected', and
   also triggers TAINT_USER when a VM is created.  Caveat emptor.
 
 - Rework the dreaded user_mem_abort() function to make it more
   maintainable, reducing the amount of state being exposed to the
   various helpers and rendering a substantial amount of state immutable.
 
 - Expand the Stage-2 page table dumper to support NV shadow page tables
   on a per-VM basis.
 
 - Tidy up the pKVM PSCI proxy code to be slightly less hard to follow.
 
 - Fix both SPE and TRBE in non-VHE configurations so that they do not
   generate spurious, out of context table walks that ultimately lead
   to very bad HW lockups.
 
 - A small set of patches fixing the Stage-2 MMU freeing in error cases.
 
 - Tighten-up accepted SMC immediate value to be only #0 for host
   SMCCC calls.
 
 - The usual cleanups and other selftest churn.
 
 LoongArch:
 
 - Use CSR_CRMD_PLV for kvm_arch_vcpu_in_kernel().
 
 - Add DMSINTC irqchip in kernel support.
 
 RISC-V:
 
 - Fix steal time shared memory alignment checks
 
 - Fix vector context allocation leak
 
 - Fix array out-of-bounds in pmu_ctr_read() and pmu_fw_ctr_read_hi()
 
 - Fix double-free of sdata in kvm_pmu_clear_snapshot_area()
 
 - Fix integer overflow in kvm_pmu_validate_counter_mask()
 
 - Fix shift-out-of-bounds in make_xfence_request()
 
 - Fix lost write protection on huge pages during dirty logging
 
 - Split huge pages during fault handling for dirty logging
 
 - Skip CSR restore if VCPU is reloaded on the same core
 
 - Implement kvm_arch_has_default_irqchip() for KVM selftests
 
 - Factored-out ISA checks into separate sources
 
 - Added hideleg to struct kvm_vcpu_config
 
 - Factored-out VCPU config into separate sources
 
 - Support configuration of per-VM HGATP mode from KVM user space
 
 s390:
 
 - Support for ESA (31-bit) guests inside nested hypervisors.
 
 - Remove restriction on memslot alignment, which is not needed anymore with
   the new gmap code.
 
 - Fix LPSW/E to update the bear (which of course is the breaking event
   address register).
 
 x86:
 
 - Shut up various UBSAN warnings on reading module parameter before they
   were initialized.
 
 - Don't zero-allocate page tables that are used for splitting hugepages in
   the TDP MMU, as KVM is guaranteed to set all SPTEs in the page table and
   thus write all bytes.
 
 - As an optimization, bail early when trying to unsync 4KiB mappings if the
   target gfn can just be mapped with a 2MiB hugepage.
 
 x86 generic:
 
 - Copy single-chunk MMIO write values into struct kvm_vcpu (more precisely
   struct kvm_mmio_fragment) to fix use-after-free stack bugs where KVM
   would dereference stack pointer after an exit to userspace.
 
 - Clean up and comment the emulated MMIO code to try to make it easier to
   maintain (not necessarily "easy", but "easier").
 
 - Move VMXON+VMXOFF and EFER.SVME toggling out of KVM (not *all* of VMX
   and SVM enabling) as it is needed for trusted I/O.
 
 - Advertise support for AVX512 Bit Matrix Multiply (BMM) instructions
 
 - Immediately fail the build if a required #define is missing in one of
   KVM's headers that is included multiple times.
 
 - Reject SET_GUEST_DEBUG with -EBUSY if there's an already injected
   exception, mostly to prevent syzkaller from abusing the uAPI to
   trigger WARNs, but also because it can help prevent userspace from
   unintentionally crashing the VM.
 
 - Exempt SMM from CPUID faulting on Intel, as per the spec.
 
 - Misc hardening and cleanup changes.
 
 x86 (AMD):
 
 - Fix and optimize IRQ window inhibit handling for AVIC; make it per-vCPU
   so that KVM doesn't prematurely re-enable AVIC if multiple
   vCPUs have to-be-injected IRQs.
 
 - Clean up and optimize the OSVW handling, avoiding a bug in which KVM would
   overwrite state when enabling virtualization on multiple CPUs in parallel.
   This should not be a problem because OSVW should usually be the same for
   all CPUs.
 
 - Drop a WARN in KVM_MEMORY_ENCRYPT_REG_REGION where KVM complains about a
   "too large" size based purely on user input.
 
 - Clean up and harden the pinning code for KVM_MEMORY_ENCRYPT_REG_REGION.
 
 - Disallow synchronizing a VMSA of an already-launched/encrypted vCPU, as
   doing so for an SNP guest will crash the host due to an RMP violation
   page fault.
 
 - Overhaul KVM's APIs for detecting SEV+ guests so that VM-scoped queries
   are required to hold kvm->lock, and enforce it by lockdep.  Fix various
   bugs where sev_guest() was not ensured to be stable for the whole
   duration of a function or ioctl.
 
 - Convert a pile of kvm->lock SEV code to guard().
 
 - Play nicer with userspace that does not enable KVM_CAP_EXCEPTION_PAYLOAD,
   for which KVM needs to set CR2 and DR6 as a response to ioctls such as
   KVM_GET_VCPU_EVENTS (even if the payload would end up in EXITINFO2
   rather than CR2, for example).  Only set CR2 and DR6 when consumption of
   the payload is imminent, but on the other hand force delivery of the
   payload in all paths where userspace retrieves CR2 or DR6.
 
 - Use vcpu->arch.cr2 when updating vmcb12's CR2 on nested #VMEXIT instead
   of vmcb02->save.cr2.  The value is out of sync after a save/restore
   or after a #PF is injected into L2.
 
 - Fix a class of nSVM bugs where some fields written by the CPU are not
   synchronized from vmcb02 to cached vmcb12 after VMRUN, and so are not
   up-to-date when saved by KVM_GET_NESTED_STATE.
 
 - Fix a class of bugs where the ordering between KVM_SET_NESTED_STATE and
   KVM_SET_{S}REGS could cause vmcb02 to be incorrectly initialized after
   save+restore.
 
 - Add a variety of missing nSVM consistency checks.
 
 - Fix several bugs where KVM failed to correctly update VMCB fields on
   nested #VMEXIT.
 
 - Fix several bugs where KVM failed to correctly synthesize #UD or #GP for
   SVM-related instructions.
 
 - Add support for save+restore of virtualized LBRs (on SVM).
 
 - Refactor various helpers and macros to improve clarity and (hopefully)
   make the code easier to maintain.
 
 - Aggressively sanitize fields when copying from vmcb12, to guard against
   unintentionally allowing L1 to utilize yet-to-be-defined features.
 
 - Fix several bugs where KVM botched rAX legality checks when emulating SVM
   instructions.  There are remaining issues in that KVM doesn't handle size
   prefix overrides for 64-bit guests.
 
 - Fail emulation of VMRUN/VMLOAD/VMSAVE if mapping vmcb12 fails instead of
   somewhat arbitrarily synthesizing #GP (i.e. don't double down on AMD's
   architectural but sketchy behavior of generating #GP for "unsupported"
   addresses).
 
 - Cache all used vmcb12 fields to further harden against TOCTOU bugs.
 
 x86 (Intel):
 
 - Drop obsolete branch hint prefixes from the VMX instruction macros.
 
 - Use ASM_INPUT_RM() in __vmcs_writel() to coerce clang into using a
   register input when appropriate.
 
 - Code cleanups.
 
 guest_memfd:
 
 - Don't mark guest_memfd folios as accessed, as guest_memfd doesn't support
   reclaim, the memory is unevictable, and there is no storage to write
   back to.
 
 LoongArch selftests:
 
 - Add KVM PMU test cases
 
 s390 selftests:
 
 - Enable more memory selftests.
 
 x86 selftests:
 
 - Add support for Hygon CPUs in KVM selftests.
 
 - Fix a bug in the MSR test where it would get false failures on AMD/Hygon
   CPUs with exactly one of RDPID or RDTSCP.
 
 - Add an MADV_COLLAPSE testcase for guest_memfd as a regression test for a
   bug where the kernel would attempt to collapse guest_memfd folios against
   KVM's will.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmnftRQUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroPAzwf+NKO4Ktv+7A22ImN0SBl0nlUuulsz
 vTcw3+hxdRoIw83GdNS+hG5js0wrpMDnbv3t4+VliDNBSSxrBzcSWX2wpilW0Xtw
 qGo1MWhs2lKPy1NlaRVOwPS6j7uF3AR0TQ1iQLGMedQuCU9WpiKJxyhNXJdbLrt3
 8EgFzsvtEsv+jKNRUNDf9+d0j4gZsFyIe+Brhianbw+u3/UCiUClLCdsKPc4+5ZX
 08otYXytacGNIf/5Ev1vT4pHkHL0yqKXAtX7LEtaS3+0KrPuLjV4slemivzE9vf5
 Evafm5AhA4wpaNMb1ZerhY3T94lsMaJpWxotjR//0Q7C9B59pCQnXCm8mg==
 =CcE0
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "Arm:

   - Add support for tracing in the standalone EL2 hypervisor code,
     which should help both debugging and performance analysis. This
     uses the new infrastructure for 'remote' trace buffers that can be
     exposed by non-kernel entities such as firmware, and which came
     through the tracing tree

   - Add support for GICv5 Per Processor Interrupts (PPIs), as the
     starting point for supporting the new GIC architecture in KVM

   - Finally add support for pKVM protected guests, where pages are
     unmapped from the host as they are faulted into the guest and can
     be shared back from the guest using pKVM hypercalls. Protected
     guests are created using a new machine type identifier. As the
     elusive guestmem has not yet delivered on its promises, anonymous
     memory is also supported

     This is only a first step towards full isolation from the host; for
     example, the CPU register state and DMA accesses are not yet
     isolated. Because this does not really yet bring fully what it
     promises, it is hidden behind CONFIG_ARM_PKVM_GUEST +
     'kvm-arm.mode=protected', and also triggers TAINT_USER when a VM is
     created. Caveat emptor

   - Rework the dreaded user_mem_abort() function to make it more
     maintainable, reducing the amount of state being exposed to the
     various helpers and rendering a substantial amount of state
     immutable

   - Expand the Stage-2 page table dumper to support NV shadow page
     tables on a per-VM basis

   - Tidy up the pKVM PSCI proxy code to be slightly less hard to
     follow

   - Fix both SPE and TRBE in non-VHE configurations so that they do not
     generate spurious, out of context table walks that ultimately lead
     to very bad HW lockups

   - A small set of patches fixing the Stage-2 MMU freeing in error
     cases

   - Tighten-up accepted SMC immediate value to be only #0 for host
     SMCCC calls

   - The usual cleanups and other selftest churn

  LoongArch:

   - Use CSR_CRMD_PLV for kvm_arch_vcpu_in_kernel()

   - Add DMSINTC irqchip in kernel support

  RISC-V:

   - Fix steal time shared memory alignment checks

   - Fix vector context allocation leak

   - Fix array out-of-bounds in pmu_ctr_read() and pmu_fw_ctr_read_hi()

   - Fix double-free of sdata in kvm_pmu_clear_snapshot_area()

   - Fix integer overflow in kvm_pmu_validate_counter_mask()

   - Fix shift-out-of-bounds in make_xfence_request()

   - Fix lost write protection on huge pages during dirty logging

   - Split huge pages during fault handling for dirty logging

   - Skip CSR restore if VCPU is reloaded on the same core

   - Implement kvm_arch_has_default_irqchip() for KVM selftests

   - Factored-out ISA checks into separate sources

   - Added hideleg to struct kvm_vcpu_config

   - Factored-out VCPU config into separate sources

   - Support configuration of per-VM HGATP mode from KVM user space

  s390:

   - Support for ESA (31-bit) guests inside nested hypervisors

   - Remove restriction on memslot alignment, which is not needed
     anymore with the new gmap code

   - Fix LPSW/E to update the bear (which of course is the breaking
     event address register)

  x86:

   - Shut up various UBSAN warnings on reading module parameter before
     they were initialized

   - Don't zero-allocate page tables that are used for splitting
     hugepages in the TDP MMU, as KVM is guaranteed to set all SPTEs in
     the page table and thus write all bytes

   - As an optimization, bail early when trying to unsync 4KiB mappings
     if the target gfn can just be mapped with a 2MiB hugepage

  x86 generic:

   - Copy single-chunk MMIO write values into struct kvm_vcpu (more
     precisely struct kvm_mmio_fragment) to fix use-after-free stack
     bugs where KVM would dereference stack pointer after an exit to
     userspace

   - Clean up and comment the emulated MMIO code to try to make it
     easier to maintain (not necessarily "easy", but "easier")

   - Move VMXON+VMXOFF and EFER.SVME toggling out of KVM (not *all* of
     VMX and SVM enabling) as it is needed for trusted I/O

   - Advertise support for AVX512 Bit Matrix Multiply (BMM) instructions

   - Immediately fail the build if a required #define is missing in one
     of KVM's headers that is included multiple times

   - Reject SET_GUEST_DEBUG with -EBUSY if there's an already injected
     exception, mostly to prevent syzkaller from abusing the uAPI to
     trigger WARNs, but also because it can help prevent userspace from
     unintentionally crashing the VM

   - Exempt SMM from CPUID faulting on Intel, as per the spec

   - Misc hardening and cleanup changes

  x86 (AMD):

   - Fix and optimize IRQ window inhibit handling for AVIC; make it
     per-vCPU so that KVM doesn't prematurely re-enable AVIC if multiple
     vCPUs have to-be-injected IRQs

   - Clean up and optimize the OSVW handling, avoiding a bug in which
     KVM would overwrite state when enabling virtualization on multiple
     CPUs in parallel. This should not be a problem because OSVW should
     usually be the same for all CPUs

   - Drop a WARN in KVM_MEMORY_ENCRYPT_REG_REGION where KVM complains
     about a "too large" size based purely on user input

   - Clean up and harden the pinning code for KVM_MEMORY_ENCRYPT_REG_REGION

   - Disallow synchronizing a VMSA of an already-launched/encrypted
     vCPU, as doing so for an SNP guest will crash the host due to an
     RMP violation page fault

   - Overhaul KVM's APIs for detecting SEV+ guests so that VM-scoped
     queries are required to hold kvm->lock, and enforce it by lockdep.
     Fix various bugs where sev_guest() was not ensured to be stable for
     the whole duration of a function or ioctl

   - Convert a pile of kvm->lock SEV code to guard()

   - Play nicer with userspace that does not enable
     KVM_CAP_EXCEPTION_PAYLOAD, for which KVM needs to set CR2 and DR6
     as a response to ioctls such as KVM_GET_VCPU_EVENTS (even if the
     payload would end up in EXITINFO2 rather than CR2, for example).
     Only set CR2 and DR6 when consumption of the payload is imminent,
     but on the other hand force delivery of the payload in all paths
     where userspace retrieves CR2 or DR6

   - Use vcpu->arch.cr2 when updating vmcb12's CR2 on nested #VMEXIT
     instead of vmcb02->save.cr2. The value is out of sync after a
     save/restore or after a #PF is injected into L2

   - Fix a class of nSVM bugs where some fields written by the CPU are
     not synchronized from vmcb02 to cached vmcb12 after VMRUN, and so
     are not up-to-date when saved by KVM_GET_NESTED_STATE

   - Fix a class of bugs where the ordering between KVM_SET_NESTED_STATE
     and KVM_SET_{S}REGS could cause vmcb02 to be incorrectly
     initialized after save+restore

   - Add a variety of missing nSVM consistency checks

   - Fix several bugs where KVM failed to correctly update VMCB fields
     on nested #VMEXIT

   - Fix several bugs where KVM failed to correctly synthesize #UD or
     #GP for SVM-related instructions

   - Add support for save+restore of virtualized LBRs (on SVM)

   - Refactor various helpers and macros to improve clarity and
     (hopefully) make the code easier to maintain

   - Aggressively sanitize fields when copying from vmcb12, to guard
     against unintentionally allowing L1 to utilize yet-to-be-defined
     features

   - Fix several bugs where KVM botched rAX legality checks when
     emulating SVM instructions. There are remaining issues in that KVM
     doesn't handle size prefix overrides for 64-bit guests

   - Fail emulation of VMRUN/VMLOAD/VMSAVE if mapping vmcb12 fails
     instead of somewhat arbitrarily synthesizing #GP (i.e. don't double
     down on AMD's architectural but sketchy behavior of generating #GP
     for "unsupported" addresses)

   - Cache all used vmcb12 fields to further harden against TOCTOU bugs

  x86 (Intel):

   - Drop obsolete branch hint prefixes from the VMX instruction macros

   - Use ASM_INPUT_RM() in __vmcs_writel() to coerce clang into using a
     register input when appropriate

   - Code cleanups

  guest_memfd:

   - Don't mark guest_memfd folios as accessed, as guest_memfd doesn't
     support reclaim, the memory is unevictable, and there is no storage
     to write back to

  LoongArch selftests:

   - Add KVM PMU test cases

  s390 selftests:

   - Enable more memory selftests

  x86 selftests:

   - Add support for Hygon CPUs in KVM selftests

   - Fix a bug in the MSR test where it would get false failures on
     AMD/Hygon CPUs with exactly one of RDPID or RDTSCP

   - Add an MADV_COLLAPSE testcase for guest_memfd as a regression test
     for a bug where the kernel would attempt to collapse guest_memfd
     folios against KVM's will"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (373 commits)
  KVM: x86: use inlines instead of macros for is_sev_*guest
  x86/virt: Treat SVM as unsupported when running as an SEV+ guest
  KVM: SEV: Goto an existing error label if charging misc_cg for an ASID fails
  KVM: SVM: Move lock-protected allocation of SEV ASID into a separate helper
  KVM: SEV: use mutex guard in snp_handle_guest_req()
  KVM: SEV: use mutex guard in sev_mem_enc_unregister_region()
  KVM: SEV: use mutex guard in sev_mem_enc_ioctl()
  KVM: SEV: use mutex guard in snp_launch_update()
  KVM: SEV: Assert that kvm->lock is held when querying SEV+ support
  KVM: SEV: Document that checking for SEV+ guests when reclaiming memory is "safe"
  KVM: SEV: Hide "struct kvm_sev_info" behind CONFIG_KVM_AMD_SEV=y
  KVM: SEV: WARN on unhandled VM type when initializing VM
  KVM: LoongArch: selftests: Add PMU overflow interrupt test
  KVM: LoongArch: selftests: Add basic PMU event counting test
  KVM: LoongArch: selftests: Add cpucfg read/write helpers
  LoongArch: KVM: Add DMSINTC inject msi to vCPU
  LoongArch: KVM: Add DMSINTC device support
  LoongArch: KVM: Make vcpu_is_preempted() as a macro rather than function
  LoongArch: KVM: Move host CSR_GSTAT save and restore in context switch
  LoongArch: KVM: Move host CSR_EENTRY save and restore in context switch
  ...
2026-04-17 07:18:03 -07:00
Linus Torvalds
883af1f8e8 - Print TDX module version during boot
- Make TDX attribute naming consistent
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmndRYkACgkQaDWVMHDJ
 krDp3Q/+LyUUBGw250JkI4DyVNJ6jS0DBa80QKdRVsiIi0u4L14EguNN73yV4TyG
 PwyoE8e1ecvdQTfCdm9Wt9FiIowlptBCu2DOGsO4AYAnGOuDIbvNAYZ+HDOzM4+2
 g5QaGlbSw7bx/zFOhykOH58FWXz9tQ1TyyOw0aICPI83XMqE8xrMHKrGFIyOT9g1
 hqPpegVmpMIjbTdYzb3pH1Be2Z4ymyOeAlzH0p+TIp9reX3qFLsOoD9PQxsDbekO
 AILNXZiu1dAg8FS5xqNIEZyoR+liYmcTQ/PhaGKvXV7ml8flylGvnz4FXijWzUvE
 kKtaMjA3JPJ/G8rcdR75T5CrGcTXXEOgl4GCbRUiEF55ocF4ds/DgABVV+YuoosO
 g+bFihYfeOJGkETQnBCjYjJ+NGcuBO6bpD5TqfN1vCsuemoRobNXKlSTS7PihynB
 4GP98ELaLXmZzj8KzME1ysLgRj4mJDEGBtTfRm4iCLRB4Drx8Wv4ZwPMCDaMtWkQ
 T/+wRLd9q/WOdUCmXK+iawQ3LDxMQitIZguHHBNjKk2WfFYoxs0HoRV9OkgdjEVK
 zDzzCkx3945UnoyKX5h/lxPC1wyp9r54+4zsIY2jKTMNuVl773DYDMn5P1aKGxCQ
 YPrHlJrmmlSFuWzi9/Vx2dcZxNWoZ4YPfMrgrzP5P3EKwxUzX9w=
 =x+Yc
 -----END PGP SIGNATURE-----

Merge tag 'x86_tdx_for_7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 TDX updates from Dave Hansen:
 "The only real thing of note here is printing the TDX module version.

  This is a little silly on its own, but the upcoming TDX module update
  code needs the same TDX module call. This shrinks that set a wee bit.

  There's also few minor macro cleanups and a tweak to the GetQuote ABI
  to make it easier for userspace to detect zero-length (failed) quotes"

* tag 'x86_tdx_for_7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  virt: tdx-guest: Return error for GetQuote failures
  KVM/TDX: Rename KVM_SUPPORTED_TD_ATTRS to KVM_SUPPORTED_TDX_TD_ATTRS
  x86/tdx: Rename TDX_ATTR_* to TDX_TD_ATTR_*
  KVM/TDX: Remove redundant definitions of TDX_TD_ATTR_*
  x86/tdx: Fix the typo in TDX_ATTR_MIGRTABLE
  x86/virt/tdx: Print TDX module version during init
  x86/virt/tdx: Retrieve TDX module version
2026-04-14 14:42:55 -07:00
Paolo Bonzini
4a530993da KVM x86 VMXON and EFER.SVME extraction for 7.1
Move _only_ VMXON+VMXOFF and EFER.SVME toggling out of KVM (versus all of VMX
 and SVM enabling) out of KVM and into the core kernel so that non-KVM TDX
 enabling, e.g. for trusted I/O, can make SEAMCALLs without needing to ensure
 KVM is fully loaded.
 
 TDX isn't a hypervisor, and isn't trying to be a hypervisor. Specifically, TDX
 should _never_ have it's own VMCSes (that are visible to the host; the
 TDX-Module has it's own VMCSes to do SEAMCALL/SEAMRET), and so there is simply
 no reason to move that functionality out of KVM.
 
 With that out of the way, dealing with VMXON/VMXOFF and EFER.SVME is a fairly
 simple refcounting game.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmnZJkYACgkQOlYIJqCj
 N/21chAAjg9tb/E8+vqBZDT5vO9Bu6c333irV2vqBBJZWUx6xKhtk77kL6kISWyf
 aI57hJ5IwbUkfDcomSY+MyRXxw/X4OioSs5qqvcC2XHatGA8XwifJE47cN5ZT0+D
 hzZjru8Z9VGHf5wUXS41yTHtm+INiEYMgJiseUQR6sbWx3H+zDcLIooNQx/ZLYrV
 vR+VPtaMYpJ0TTDDqb8PrCnjgXoXFenAnzAj9bAikWP60kaDXrxN9KPc5woDo29+
 TrkTyr2mmQvKpNhLCDwAMNa9bXxgzkHEGx8J2WZTbUi9ZBv4MwVsnGLLsaUKQlaa
 4V1JDiICzYptjMzU+ka4iTF+m0KEz4EykP7mVVK+5MAHc0NOUVfDW6JP2PM/66dh
 NyyjGhbrfH0PwqzDn4N2h0MmWT4YNCIxESClecEMtEzsCyWfYOMitxbDbzHnu9Vw
 a/C0pwWKJ34Trr0O79SevAWJBlu596mya0YvMeCAWxCvSUGknbo5IXdrmtp6htGp
 Gz5+0ZyvVRbYpwxS+OOpWMkZuPvvEcWTbMAG/scbSHh80P/uCVyuLsRZR2HSB8EV
 tYnnLDDDQ1KmLV7xmw5XnkN9hFffAM8eXA7KX9TPjCXjd25lCJGgquQEH0oAHe5q
 1qXf+lWttP7MIbD5/Ga5CO+FqXAE6xmFRWjEBgLx32kSAWXqxPs=
 =SuxR
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-vmxon-7.1' of https://github.com/kvm-x86/linux into HEAD

KVM x86 VMXON and EFER.SVME extraction for 7.1

Move _only_ VMXON+VMXOFF and EFER.SVME toggling out of KVM (versus all of VMX
and SVM enabling) out of KVM and into the core kernel so that non-KVM TDX
enabling, e.g. for trusted I/O, can make SEAMCALLs without needing to ensure
KVM is fully loaded.

TIO isn't a hypervisor, and isn't trying to be a hypervisor. Specifically, TIO
should _never_ have it's own VMCSes (that are visible to the host; the
TDX-Module has it's own VMCSes to do SEAMCALL/SEAMRET), and so there is simply
no reason to move that functionality out of KVM.

With that out of the way, dealing with VMXON/VMXOFF and EFER.SVME is a fairly
simple refcounting game.
2026-04-13 13:04:48 +02:00
Paolo Bonzini
ea8bc95fbb KVM nested SVM changes for 7.1 (with one common x86 fix)
- To minimize the probability of corrupting guest state, defer KVM's
    non-architectural delivery of exception payloads (e.g. CR2 and DR6) until
    consumption of the payload is imminent, and force delivery of the payload
    in all paths where userspace saves relevant state.
 
  - Use vcpu->arch.cr2 when updating vmcb12's CR2 on nested #VMEXIT to fix a
    bug where L2's CR2 can get corrupted after a save/restore, e.g. if the VM
    is migrated while L2 is faulting in memory.
 
  - Fix a class of nSVM bugs where some fields written by the CPU are not
    synchronized from vmcb02 to cached vmcb12 after VMRUN, and so are not
    up-to-date when saved by KVM_GET_NESTED_STATE.
 
  - Fix a class of bugs where the ordering between KVM_SET_NESTED_STATE and
    KVM_SET_{S}REGS could cause vmcb02 to be incorrectly initialized after
    save+restore.
 
  - Add a variety of missing nSVM consistency checks.
 
  - Fix several bugs where KVM failed to correctly update VMCB fields on nested
    #VMEXIT.
 
  - Fix several bugs where KVM failed to correctly synthesize #UD or #GP for
    SVM-related instructions.
 
  - Add support for save+restore of virtualized LBRs (on SVM).
 
  - Refactor various helpers and macros to improve clarity and (hopefully) make
    the code easier to maintain.
 
  - Aggressively sanitize fields when copying from vmcb12 to guard against
    unintentionally allowing L1 to utilize yet-to-be-defined features.
 
  - Fix several bugs where KVM botched rAX legality checks when emulating SVM
    instructions.  Note, KVM is still flawed in that KVM doesn't address size
    prefix overrides for 64-bit guests; this should probably be documented as a
    KVM erratum.
 
  - Fail emulation of VMRUN/VMLOAD/VMSAVE if mapping vmcb12 fails instead of
    somewhat arbitrarily synthesizing #GP (i.e. don't bastardize AMD's already-
    sketchy behavior of generating #GP if for "unsupported" addresses).
 
  - Cache all used vmcb12 fields to further harden against TOCTOU bugs.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmnZfbwACgkQOlYIJqCj
 N/0pVRAAkys8LLtIekQtEVkaX3EPaXk0lGGmnzXbihgHFsS5lMAS4tcsr7oyk4TI
 rvJUGmkaTKTboQdTaCq0G7lwCu5hMuXsZ10WvmKfivMFxy3kSppqfffux5zVXng2
 U/8oyJSorkX1WPC7d5QAZYMqqcSwQaR+a0FxowghGWBXMRHylerSuH00CiGr6Ron
 QQbZaKBNtkYwYFNos2tLuT4tueyFogk8FPAmdejEQ9CMxUjeAivlKm8JVXaDvGik
 lyPYbJJLukjuxSYGYmeRyGLLwK7VBGkFHQp/KBYSBgzGdweabhsQa1Z0CGm24+w1
 q626W0sxsq97dZ0cd7oE6Cw+AdlMBK+mjpxB9gX4uLGyYlnFkdJV7OSlHVTR9d96
 cqKduT0JvlBnVb7Yd5jyaGVl1YD62p0nwcrTuWidR5IJ16b4mYwwPzvkkQKHLt64
 VAhH8lBVtATtblI9gfsbwGezV74xXnuLb0L1G7xeh1VIWu7pubFdqyRwIA+qiXQa
 OkyxzoDlFl+QF2Uf3cBCFMojBOrSZRiGiLzIkUnjBsN4N2uOPYTsQEfr9BXVVcv7
 obT9xl/wUwry2fAJhUL+IBCDE42+8C62UaWT5KJHQLttBL7Mm06e75hFN5ObbE/x
 nExL+NmAcsSUUbbdojjnD0KWxYKkosNiONBVrjqqXdmBjmzzOvI=
 =ys7N
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-nested-7.1' of https://github.com/kvm-x86/linux into HEAD

KVM nested SVM changes for 7.1 (with one common x86 fix)

 - To minimize the probability of corrupting guest state, defer KVM's
   non-architectural delivery of exception payloads (e.g. CR2 and DR6) until
   consumption of the payload is imminent, and force delivery of the payload
   in all paths where userspace saves relevant state.

 - Use vcpu->arch.cr2 when updating vmcb12's CR2 on nested #VMEXIT to fix a
   bug where L2's CR2 can get corrupted after a save/restore, e.g. if the VM
   is migrated while L2 is faulting in memory.

 - Fix a class of nSVM bugs where some fields written by the CPU are not
   synchronized from vmcb02 to cached vmcb12 after VMRUN, and so are not
   up-to-date when saved by KVM_GET_NESTED_STATE.

 - Fix a class of bugs where the ordering between KVM_SET_NESTED_STATE and
   KVM_SET_{S}REGS could cause vmcb02 to be incorrectly initialized after
   save+restore.

 - Add a variety of missing nSVM consistency checks.

 - Fix several bugs where KVM failed to correctly update VMCB fields on nested
   #VMEXIT.

 - Fix several bugs where KVM failed to correctly synthesize #UD or #GP for
   SVM-related instructions.

 - Add support for save+restore of virtualized LBRs (on SVM).

 - Refactor various helpers and macros to improve clarity and (hopefully) make
   the code easier to maintain.

 - Aggressively sanitize fields when copying from vmcb12 to guard against
   unintentionally allowing L1 to utilize yet-to-be-defined features.

 - Fix several bugs where KVM botched rAX legality checks when emulating SVM
   instructions.  Note, KVM is still flawed in that KVM doesn't address size
   prefix overrides for 64-bit guests; this should probably be documented as a
   KVM erratum.

 - Fail emulation of VMRUN/VMLOAD/VMSAVE if mapping vmcb12 fails instead of
   somewhat arbitrarily synthesizing #GP (i.e. don't bastardize AMD's already-
   sketchy behavior of generating #GP if for "unsupported" addresses).

 - Cache all used vmcb12 fields to further harden against TOCTOU bugs.
2026-04-13 13:01:50 +02:00
Paolo Bonzini
7e7a6e2ad2 KVM VMX changes for 7.1
- Drop obsolete (largely ignored by hardwre) branch hint prefixes from the
    VMX instruction macros, as saving a byte of code per instruction provides
    more benefits than the (mostly) superfluous prefixes.
 
  - Use ASM_INPUT_RM() in __vmcs_writel() to coerce clang into using a register
    input when appropriate.
 
  - Drop unnecessary parentheses in cpu_has_load_cet_ctrl() so as not to suggest
    that "return (x & y);" is KVM's preferred style.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmnZJ1gACgkQOlYIJqCj
 N/1F1w/9F43oaEVn7EDmssy/YWDfP6gcApCMDVm/+n6Z0+zBdsMNpZ8G3OQlbITT
 nfpfA6GqvmW9xsdFsPuM2e6WXSk1dZoDevifzK2g0fe5ZtGfB8xIsovEOvZJTQzo
 MI8UfJZG9xfnA0WWYcCR9EOjhB9fLyHgnG/7kXrYPao1NAtVo1ADskmZ+FIYZAST
 dEMF3Wf6u1x6rjaUyhn0vMIDiIoRw5OhMBWYDYVHPIFfbHjxm4iXJjWmO981xis9
 fIseUcnGM0OUrFv+dBun7zYqIqj3aUcaI+Bq3/eiSm/pKi9MWBaljgOCjLjj5dZc
 07DhFtF0IAUMJIZmq2N/xbxfaaOWdajbw7Wm/ppkvPJg1efz8gB9aKW4wjGAkyRx
 aNeYFq5VUGlKNp1aXS5TWeYIeAAbc1kqIReRdwFXK/gqXfynQCuH4Z7kBJvWsFTY
 GA3GgK3l7A7Qk9A/A8PQvoj8uCH3DTv1N+qeLX9Hdk+NgmKDUthj/0rhXXVv/eAF
 2auMBtf1JcUjRGwvtu+945dAIFE2ZZ/lTNMYbd19mDiFHQ3yBj9HM3N+Q/+DexWd
 OXH3xNralJ4IHlJZz7t/N1CuIujDKDdRAwBXuWq4bOBSFGYExjILV5JF0DolVNr/
 VSe4+tKkz3jM+HL04edPSe0k8rGCDzL/z1wS4/2vEqHLjSrucbM=
 =8+qL
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-vmx-7.1' of https://github.com/kvm-x86/linux into HEAD

KVM VMX changes for 7.1

 - Drop obsolete (largely ignored by hardwre) branch hint prefixes from the
   VMX instruction macros, as saving a byte of code per instruction provides
   more benefits than the (mostly) superfluous prefixes.

 - Use ASM_INPUT_RM() in __vmcs_writel() to coerce clang into using a register
   input when appropriate.

 - Drop unnecessary parentheses in cpu_has_load_cet_ctrl() so as not to suggest
   that "return (x & y);" is KVM's preferred style.
2026-04-13 12:49:36 +02:00
Paolo Bonzini
aa856775be KVM x86 emulated MMIO changes for 7.1
Copy single-chunk MMIO write values into a persistent (per-fragment) field to
 fix use-after-free stack bugs due to KVM dereferencing a stack pointer after an
 exit to userspace.
 
 Clean up and comment the emulated MMIO code to try to make it easier to
 maintain (not necessarily "easy", but "easier").
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmnZJAkACgkQOlYIJqCj
 N/2TUxAAnmOHpQ0hKjDvkMXTLOUJEitegRn8CDd2NWa6TEnoCFmHu5sXAMaO0V7v
 EK2NH0uwT43Zr55BlXTxjEaQOby16tKjqK4FlG1Zb8UI6e9Lzhk6zIXMI7NTpV41
 HMF73cQyHIzeS9ymf0fdVo3nlnvTXBHVCifyJwmb2RARl+LqmTimHb9+9piuIxJB
 /v473RoFCNxI7Rwl6Pp5sjl7lWTIDUQJSi3+1gMaowTtnsUyCPTLejwj7/b9NWF3
 i+nTNwwpHFvgfTE3decMyKupY9aXUM9FU/AFf+eUbvzjR/Dx/7o31Cpz4NCHQ38c
 4TCFKkOQ2r+e8s7ATBeRKdOXdP7d7DW3qasLfPVjzEDxuifmW+awDRYBZwNM/Ybv
 jDCBO6gbtw/f+oJPKq9oivqBSu+Z6vR7NnPmk1vh7VocsZdAlbCwdBU2+N5DBkfh
 LJ+nOxzNL1q1A8X61CZFffEooh971Mg7ztHV5IDviK5/Fop0NfQBjxdxoz+wttzp
 ufwY1WUMHjImRfZTk3e9icJwarqEI39QRmjKuaUJxEXjbJCbtvfKJ0lrNn7RRNPf
 aqi9M2z6UvVrQi/Vw5rRTx7fYr091QIOHDdGBVl7atcGyU4gvdpXdkBS6xqafgD6
 S/QzWypU868iMgLWqNpNeKtRLPQsuwTywC5vx57Sm0yPlPbM8H0=
 =Qyhh
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-mmio-7.1' of https://github.com/kvm-x86/linux into HEAD

KVM x86 emulated MMIO changes for 7.1

Copy single-chunk MMIO write values into a persistent (per-fragment) field to
fix use-after-free stack bugs due to KVM dereferencing a stack pointer after an
exit to userspace.

Clean up and comment the emulated MMIO code to try to make it easier to
maintain (not necessarily "easy", but "easier").
2026-04-13 12:49:14 +02:00
Paolo Bonzini
276f81a491 KVM x86 misc changes for 7.1
- Advertise support for AVX512 Bit Matrix Multiply (BMM) when it's present in
    hardware (no additional emulation/virtualization required).
 
  - Immediately fail the build if a required #define is missing in one of KVM's
    headers that is included multiple times.
 
  - Reject SET_GUEST_DEBUG with -EBUSY if there's an already injected exception,
    mostly to prevent syzkaller from abusing the uAPI to trigger WARNs, but also
    because it can help prevent userspace from unintentionally crashing the VM.
 
  - Exempt SMM from CPUID faulting on Intel, as per the spec.
 
  - Misc hardening and cleanup changes.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmnZIt8ACgkQOlYIJqCj
 N/2HqA/8CwoMlaK4nPDp39JI1+avlKaBkrwfF5/mku6uZcrq9WeyflH+t4wc7JE0
 lRXQO5PPNideYrjEqLsdn9OWIar+ZsYGrsEO5/MFc4Z67kPkai67m7nUT46APU4Q
 fE/3KpT3afaHcM6+zpIIF/lMmQJVco+7EQrlexSM9LZTap6uxNRvMC3B/czF7/li
 UsEJH37vluXxuCPUXAE61IPHtF++eDf4x6w0nIJ+7UJSUZk8JJYWMvJ5lPIxRTGG
 Pvql2v7hDQ9h2ISIDr+e85wpIpIkbc7hKZMtlib36PB1Dm7gOeKgosFHIwNLnJoJ
 pxuzsqYShXBHsmsYgzmfYlVUcWFF1f02yC4XfoQ735LNnBbX6bm5nuSmPQBmvg4O
 +URUKjo4DLjzzs44RrRsBsBVuZTMbe0Ht2qLmGrWrB9+vr1PxQVNFpLA0MCDCFx7
 skJTo6raJQkLJmmoKUslehiJFTvzOrOJy8JhWhiznkJNSS5jWFbaFf7nEoMCYIl0
 ttzeISQDgzHAvT6V29CO4+zttexF4QVVRwFwG3aI8zGJ3WJhjrNyazVLrvrzWfhA
 ygNwV0BCEbBclMpBRF4jRLGMibnsTeEsBTiMARgJ0ZL7RPUYeQidVzP/JwPKbod0
 DHqqtOXXngl7OsHdfdd74ThKaQb6EzlDFyI5aoYInPCXH/LhE98=
 =ZvDQ
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-misc-7.1' of https://github.com/kvm-x86/linux into HEAD

KVM x86 misc changes for 7.1

 - Advertise support for AVX512 Bit Matrix Multiply (BMM) when it's present in
   hardware (no additional emulation/virtualization required).

 - Immediately fail the build if a required #define is missing in one of KVM's
   headers that is included multiple times.

 - Reject SET_GUEST_DEBUG with -EBUSY if there's an already injected exception,
   mostly to prevent syzkaller from abusing the uAPI to trigger WARNs, but also
   because it can help prevent userspace from unintentionally crashing the VM.

 - Exempt SMM from CPUID faulting on Intel, as per the spec.

 - Misc hardening and cleanup changes.
2026-04-13 11:51:34 +02:00
Sean Christopherson
7212094bae KVM: x86: Suppress WARNs on nested_run_pending after userspace exit
To end an ongoing game of whack-a-mole between KVM and syzkaller, WARN on
illegally cancelling a pending nested VM-Enter if and only if userspace
has NOT gained control of the vCPU since the nested run was initiated.  As
proven time and time again by syzkaller, userspace can clobber vCPU state
so as to force a VM-Exit that violates KVM's architectural modelling of
VMRUN/VMLAUNCH/VMRESUME.

To detect that userspace has gained control, while minimizing the risk of
operating on stale data, convert nested_run_pending from a pure boolean to
a tri-state of sorts, where '0' is still "not pending", '1' is "pending",
and '2' is "pending but untrusted".  Then on KVM_RUN, if the flag is in
the "trusted pending" state, move it to "untrusted pending".

Note, moving the state to "untrusted" even if KVM_RUN is ultimately
rejected is a-ok, because for the "untrusted" state to matter, KVM must
get past kvm_x86_vcpu_pre_run() at some point for the vCPU.

Reviewed-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260312234823.3120658-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-04-03 09:34:01 -07:00
Yosry Ahmed
3d4470d71f KVM: x86: Move nested_run_pending to kvm_vcpu_arch
Move nested_run_pending field present in both svm_nested_state and
nested_vmx to the common kvm_vcpu_arch. This allows for common code to
use without plumbing it through per-vendor helpers.

nested_run_pending remains zero-initialized, as the entire kvm_vcpu
struct is, and all further accesses are done through vcpu->arch instead
of svm->nested or vmx->nested.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
[sean: expand the commend in the field declaration]
Link: https://patch.msgid.link/20260312234823.3120658-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-04-03 09:33:30 -07:00
Sean Christopherson
55be358e17 KVM: x86: Immediately fail the build when possible if required #define is missing
Guard usage of the must-be-defined macros in KVM's multi-include headers
with the existing #ifdefs that attempt to alert the developer to a missing
macro, and spit out an explicit #error message if a macro is missing, as
referencing the missing macro completely defeats the purpose of the #ifdef
(the compiler spews a ton of error messages and buries the targeted error
message).

Suggested-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Yuan Yao <yaoyuan@linux.alibaba.com>
Link: https://patch.msgid.link/20260302212619.710873-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-12 10:56:36 -07:00
Xin Li
577da677aa KVM: VMX: Remove unnecessary parentheses
Drop redundant parentheses; the & operator has higher precedence than the
return statement's implicit evaluation, making the grouping redundant.

Signed-off-by: Xin Li <xin@zytor.com>
Link: https://patch.msgid.link/20260306231253.2177246-1-xin@zytor.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-12 09:05:56 -07:00
Yosry Ahmed
3b27c82ba2 KVM: x86: Move some EFER bits enablement to common code
Move EFER bits enablement that only depend on CPU support to common
code, as there is no reason to do it in vendor code. Leave EFER.SVME and
EFER.LMSLE enablement in SVM code as they depend on vendor module
parameters.

Having the enablement in common code ensures that if a vendor starts
supporting an existing feature, KVM doesn't end up advertising to
userspace but not allowing the EFER bit to be set.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260307011619.2324234-2-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-12 09:05:41 -07:00
Paolo Bonzini
6b1ca262a9 KVM: x86: clarify leave_smm() return value
The return value of vmx_leave_smm() is unrelated from that of
nested_vmx_enter_non_root_mode().  Check explicitly for success
(which happens to be 0) and return 1 just like everywhere
else in vmx_leave_smm().

Likewise, in svm_leave_smm() return 0/1 instead of the 0/1/-errno
returned by tenter_svm_guest_mode().

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-03-11 18:41:12 +01:00
Paolo Bonzini
5a30e8aea0 KVM: VMX: check validity of VMCS controls when returning from SMM
The VMCS12 is not available while in SMM.  However, it can be overwritten
if userspace manages to trigger copy_enlightened_to_vmcs12() - for example
via KVM_GET_NESTED_STATE.

Because of this, the VMCS12 has to be checked for validity before it is
used to generate the VMCS02.  Move the check code out of vmx_set_nested_state()
(the other "not a VMLAUNCH/VMRESUME" path that emulates a nested vmentry)
and reuse it in vmx_leave_smm().

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-03-11 18:41:11 +01:00
Jim Mattson
e2ffe85b6d KVM: x86: Introduce KVM_X86_QUIRK_VMCS12_ALLOW_FREEZE_IN_SMM
Add KVM_X86_QUIRK_VMCS12_ALLOW_FREEZE_IN_SMM to allow L1 to set
FREEZE_IN_SMM in vmcs12's GUEST_IA32_DEBUGCTL field, as permitted
prior to commit 6b1dd26544 ("KVM: VMX: Preserve host's
DEBUGCTLMSR_FREEZE_IN_SMM while running the guest").  Enable the quirk
by default for backwards compatibility (like all quirks); userspace
can disable it via KVM_CAP_DISABLE_QUIRKS2 for consistency with the
constraints on WRMSR(IA32_DEBUGCTL).

Note that the quirk only bypasses the consistency check.  The vmcs02 bit is
still owned by the host, and PMCs are not frozen during virtualized SMM.
In particular, if a host administrator decides that PMCs should not be
frozen during physical SMM, then L1 has no say in the matter.

Fixes: 095686e6fc ("KVM: nVMX: Check vmcs12->guest_ia32_debugctl on nested VM-Enter")
Cc: stable@vger.kernel.org
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://patch.msgid.link/20260205231537.1278753-1-jmattson@google.com
[sean: tag for stable@, clean-up and fix goofs in the comment and docs]
Signed-off-by: Sean Christopherson <seanjc@google.com>
[Rename quirk. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-03-11 18:41:11 +01:00
Namhyung Kim
f78e627a01 KVM: VMX: Fix a wrong MSR update in add_atomic_switch_msr()
The previous change had a bug to update a guest MSR with a host value.

Fixes: c3d6a7210a ("KVM: VMX: Dedup code for adding MSR to VMCS's auto list")
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://patch.msgid.link/20260220220216.389475-1-namhyung@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-03-11 18:41:11 +01:00
Sean Christopherson
f630de1f8d KVM: TDX: Fold tdx_bringup() into tdx_hardware_setup()
Now that TDX doesn't need to manually enable virtualization through _KVM_
APIs during setup, fold tdx_bringup() into tdx_hardware_setup() where the
code belongs, e.g. so that KVM doesn't leave the S-EPT kvm_x86_ops wired
up when TDX is disabled.

The weird ordering (and naming) was necessary to allow KVM TDX to use
kvm_enable_virtualization(), which in turn had a hard dependency on
kvm_x86_ops.enable_virtualization_cpu and thus kvm_x86_vendor_init().

Tested-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Sagi Shahar <sagis@google.com>
Link: https://patch.msgid.link/20260214012702.2368778-17-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-04 08:53:13 -08:00
Chao Gao
eac90a5ba0 x86/virt/tdx: KVM: Consolidate TDX CPU hotplug handling
The core kernel registers a CPU hotplug callback to do VMX and TDX init
and deinit while KVM registers a separate CPU offline callback to block
offlining the last online CPU in a socket.

Splitting TDX-related CPU hotplug handling across two components is odd
and adds unnecessary complexity.

Consolidate TDX-related CPU hotplug handling by integrating KVM's
tdx_offline_cpu() to the one in the core kernel.

Also move nr_configured_hkid to the core kernel because tdx_offline_cpu()
references it. Since HKID allocation and free are handled in the core
kernel, it's more natural to track used HKIDs there.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Tested-by: Chao Gao <chao.gao@intel.com>
Tested-by: Sagi Shahar <sagis@google.com>
Link: https://patch.msgid.link/20260214012702.2368778-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-04 08:53:05 -08:00
Sean Christopherson
165e773538 KVM: x86/tdx: Do VMXON and TDX-Module initialization during subsys init
Now that VMXON can be done without bouncing through KVM, do TDX-Module
initialization during subsys init (specifically before module_init() so
that it runs before KVM when both are built-in).  Aside from the obvious
benefits of separating core TDX code from KVM, this will allow tagging a
pile of TDX functions and globals as being __init and __ro_after_init.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Chao Gao <chao.gao@intel.com>
Tested-by: Sagi Shahar <sagis@google.com>
Link: https://patch.msgid.link/20260214012702.2368778-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-04 08:52:59 -08:00
Sean Christopherson
0efe5dc161 x86/virt/tdx: Drop the outdated requirement that TDX be enabled in IRQ context
Remove TDX's outdated requirement that per-CPU enabling be done via IPI
function call, which was a stale artifact leftover from early versions of
the TDX enablement series.  The requirement that IRQs be disabled should
have been dropped as part of the revamped series that relied on a the KVM
rework to enable VMX at module load.

In other words, the kernel's "requirement" was never a requirement at all,
but instead a reflection of how KVM enabled VMX (via IPI callback) when
the TDX subsystem code was merged.

Note, accessing per-CPU information is safe even without disabling IRQs,
as tdx_online_cpu() is invoked via a cpuhp callback, i.e. from a per-CPU
thread.

Link: https://lore.kernel.org/all/ZyJOiPQnBz31qLZ7@google.com
Tested-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Sagi Shahar <sagis@google.com>
Link: https://patch.msgid.link/20260214012702.2368778-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-04 08:52:55 -08:00
Sean Christopherson
8528a7f9c9 x86/virt: Add refcounting of VMX/SVM usage to support multiple in-kernel users
Implement a per-CPU refcounting scheme so that "users" of hardware
virtualization, e.g. KVM and the future TDX code, can co-exist without
pulling the rug out from under each other.  E.g. if KVM were to disable
VMX on module unload or when the last KVM VM was destroyed, SEAMCALLs from
the TDX subsystem would #UD and panic the kernel.

Disable preemption in the get/put APIs to ensure virtualization is fully
enabled/disabled before returning to the caller.  E.g. if the task were
preempted after a 0=>1 transition, the new task would see a 1=>2 and thus
return without enabling virtualization.  Explicitly disable preemption
instead of requiring the caller to do so, because the need to disable
preemption is an artifact of the implementation.  E.g. from KVM's
perspective there is no _need_ to disable preemption as KVM guarantees the
pCPU on which it is running is stable (but preemption is enabled).

Opportunistically abstract away SVM vs. VMX in the public APIs by using
X86_FEATURE_{SVM,VMX} to communicate what technology the caller wants to
enable and use.

Cc: Xu Yilun <yilun.xu@linux.intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Sagi Shahar <sagis@google.com>
Link: https://patch.msgid.link/20260214012702.2368778-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-04 08:52:52 -08:00
Sean Christopherson
428afac5a8 KVM: x86: Move bulk of emergency virtualizaton logic to virt subsystem
Move the majority of the code related to disabling hardware virtualization
in emergency from KVM into the virt subsystem so that virt can take full
ownership of the state of SVM/VMX.  This will allow refcounting usage of
SVM/VMX so that KVM and the TDX subsystem can enable VMX without stomping
on each other.

To route the emergency callback to the "right" vendor code, add to avoid
mixing vendor and generic code, implement a x86_virt_ops structure to
track the emergency callback, along with the SVM vs. VMX (vs. "none")
feature that is active.

To avoid having to choose between SVM and VMX, simply refuse to enable
either if both are somehow supported.  No known CPU supports both SVM and
VMX, and it's comically unlikely such a CPU will ever exist.

Leave KVM's clearing of loaded VMCSes and MSR_VM_HSAVE_PA in KVM, via a
callback explicitly scoped to KVM.  Loading VMCSes and saving/restoring
host state are firmly tied to running VMs, and thus are (a) KVM's
responsibility and (b) operations that are still exclusively reserved for
KVM (as far as in-tree code is concerned).  I.e. the contract being
established is that non-KVM subsystems can utilize virtualization, but for
all intents and purposes cannot act as full-blown hypervisors.

Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Sagi Shahar <sagis@google.com>
Link: https://patch.msgid.link/20260214012702.2368778-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-04 08:52:49 -08:00
Sean Christopherson
920da4f755 KVM: VMX: Move core VMXON enablement to kernel
Move the innermost VMXON+VMXOFF logic out of KVM and into to core x86 so
that TDX can (eventually) force VMXON without having to rely on KVM being
loaded, e.g. to do SEAMCALLs during initialization.

Opportunistically update the comment regarding emergency disabling via NMI
to clarify that virt_rebooting will be set by _another_ emergency callback,
i.e. that virt_rebooting doesn't need to be set before VMCLEAR, only
before _this_ invocation does VMXOFF.

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Sagi Shahar <sagis@google.com>
Link: https://patch.msgid.link/20260214012702.2368778-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-04 08:52:42 -08:00
Sean Christopherson
95e4adb24f x86/virt: Force-clear X86_FEATURE_VMX if configuring root VMCS fails
If allocating and configuring a root VMCS fails, clear X86_FEATURE_VMX in
all CPUs so that KVM doesn't need to manually check root_vmcs.  As added
bonuses, clearing VMX will reflect that VMX is unusable in /proc/cpuinfo,
and will avoid a futile auto-probe of kvm-intel.ko.

WARN if allocating a root VMCS page fails, e.g. to help users figure out
why VMX is broken in the unlikely scenario something goes sideways during
boot (and because the allocation should succeed unless there's a kernel
bug).  Tweak KVM's error message to suggest checking kernel logs if VMX is
unsupported (in addition to checking BIOS).

Tested-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Sagi Shahar <sagis@google.com>
Link: https://patch.msgid.link/20260214012702.2368778-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-04 08:52:39 -08:00
Sean Christopherson
405b7c2793 KVM: VMX: Unconditionally allocate root VMCSes during boot CPU bringup
Allocate the root VMCS (misleading called "vmxarea" and "kvm_area" in KVM)
for each possible CPU during early boot CPU bringup, before early TDX
initialization, so that TDX can eventually do VMXON on-demand (to make
SEAMCALLs) without needing to load kvm-intel.ko.  Allocate the pages early
on, e.g. instead of trying to do so on-demand, to avoid having to juggle
allocation failures at runtime.

Opportunistically rename the per-CPU pointers to better reflect the role
of the VMCS.  Use Intel's "root VMCS" terminology, e.g. from various VMCS
patents[1][2] and older SDMs, not the more opaque "VMXON region" used in
recent versions of the SDM.  While it's possible the VMCS passed to VMXON
no longer serves as _the_ root VMCS on modern CPUs, it is still in effect
a "root mode VMCS", as described in the patents.

Link: https://patentimages.storage.googleapis.com/c7/e4/32/d7a7def5580667/WO2013101191A1.pdf [1]
Link: https://patentimages.storage.googleapis.com/13/f6/8d/1361fab8c33373/US20080163205A1.pdf [2]
Tested-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Sagi Shahar <sagis@google.com>
Link: https://patch.msgid.link/20260214012702.2368778-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-04 08:52:34 -08:00
Sean Christopherson
a1450a8156 KVM: x86: Move "kvm_rebooting" to kernel as "virt_rebooting"
Move "kvm_rebooting" to the kernel, exported for KVM, as one of many steps
towards extracting the innermost VMXON and EFER.SVME management logic out
of KVM and into to core x86.

For lack of a better name, call the new file "hw.c", to yield "virt
hardware" when combined with its parent directory.

No functional change intended.

Tested-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Sagi Shahar <sagis@google.com>
Link: https://patch.msgid.link/20260214012702.2368778-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-04 08:52:31 -08:00
Sean Christopherson
3c75e6a5da KVM: VMX: Move architectural "vmcs" and "vmcs_hdr" structures to public vmx.h
Move "struct vmcs" and "struct vmcs_hdr" to asm/vmx.h in anticipation of
moving VMXON/VMXOFF to the core kernel (VMXON requires a "root" VMCS with
the appropriate revision ID in its header).

No functional change intended.

Tested-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Sagi Shahar <sagis@google.com>
Link: https://patch.msgid.link/20260214012702.2368778-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-04 08:52:26 -08:00
Xiaoyao Li
3256e41f02 KVM/TDX: Rename KVM_SUPPORTED_TD_ATTRS to KVM_SUPPORTED_TDX_TD_ATTRS
Rename KVM_SUPPORTED_TD_ATTRS to KVM_SUPPORTED_TDX_TD_ATTRS to include
"TDX" in the name, making it clear that it pertains to TDX.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Kiryl Shutsemau <kas@kernel.org>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://patch.msgid.link/20260303030335.766779-5-xiaoyao.li@intel.com
2026-03-03 16:06:49 -08:00
Xiaoyao Li
28bcd8d83f x86/tdx: Rename TDX_ATTR_* to TDX_TD_ATTR_*
The macros TDX_ATTR_* and DEF_TDX_ATTR_* are related to TD attributes,
which are TD-scope attributes. Naming them as TDX_ATTR_* can be somewhat
confusing and might mislead people into thinking they are TDX global
things.

Rename TDX_ATTR_* to TDX_TD_ATTR_* to explicitly clarify they are
TD-scope things.

Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kiryl Shutsemau <kas@kernel.org>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://patch.msgid.link/20260303030335.766779-4-xiaoyao.li@intel.com
2026-03-03 16:06:49 -08:00
Xiaoyao Li
8768698719 KVM/TDX: Remove redundant definitions of TDX_TD_ATTR_*
There are definitions of TD attributes bits inside asm/shared/tdx.h as
TDX_ATTR_*.

Remove KVM's definitions and use the ones in asm/shared/tdx.h

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://patch.msgid.link/20260303030335.766779-3-xiaoyao.li@intel.com
2026-03-03 16:06:49 -08:00
Sean Christopherson
e2138c4a5b KVM: x86: Add helpers to prepare kvm_run for userspace MMIO exit
Add helpers to fill kvm_run for userspace MMIO exits to deduplicate a
variety of code, and to allow for a cleaner return path in
emulator_read_write().

Opportunistically add a KVM_BUG_ON() to ensure the caller is limiting the
length of a single MMIO access to 8 bytes (the largest size userspace is
prepared to handled, as the ABI was baked before things like MOVDQ came
along).

No functional change intended.

Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Binbin Wu <binbin.wu@linux.intel.com>
Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Michael Roth <michael.roth@amd.com>
Tested-by: Tom Lendacky <thomas.lendacky@gmail.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://patch.msgid.link/20260225012049.920665-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-02 16:06:49 -08:00
Uros Bizjak
192f777b3a KVM: VMX: Use ASM_INPUT_RM in __vmcs_writel
Use the ASM_INPUT_RM macro for VMCS write operation in vmx_ops.h to
work around clang problems with "rm" asm constraint. clang seems to
always chose the memory input, while it is almost always the worst
choice.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Acked-by: Nathan Chancellor <nathan@kernel.org>
Link: https://patch.msgid.link/20260211102928.100944-2-ubizjak@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-02 14:32:57 -08:00
Uros Bizjak
6dad59124e KVM: VMX: Drop obsolete branch hint prefixes from inline asm
Remove explicit branch hint prefixes (.byte 0x2e / 0x3e) from VMX
inline assembly sequences.

These prefixes (CS/DS segment overrides used as branch hints on
very old x86 CPUs) have been ignored by modern processors for a
long time. Keeping them provides no measurable benefit and only
enlarges the generated code.

No functional change intended.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://patch.msgid.link/20260211102928.100944-1-ubizjak@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-02 14:32:57 -08:00
Kees Cook
189f164e57 Convert remaining multi-line kmalloc_obj/flex GFP_KERNEL uses
Conversion performed via this Coccinelle script:

  // SPDX-License-Identifier: GPL-2.0-only
  // Options: --include-headers-for-types --all-includes --include-headers --keep-comments
  virtual patch

  @gfp depends on patch && !(file in "tools") && !(file in "samples")@
  identifier ALLOC = {kmalloc_obj,kmalloc_objs,kmalloc_flex,
 		    kzalloc_obj,kzalloc_objs,kzalloc_flex,
		    kvmalloc_obj,kvmalloc_objs,kvmalloc_flex,
		    kvzalloc_obj,kvzalloc_objs,kvzalloc_flex};
  @@

  	ALLOC(...
  -		, GFP_KERNEL
  	)

  $ make coccicheck MODE=patch COCCI=gfp.cocci

Build and boot tested x86_64 with Fedora 42's GCC and Clang:

Linux version 6.19.0+ (user@host) (gcc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7), GNU ld version 2.44-12.fc42) #1 SMP PREEMPT_DYNAMIC 1970-01-01
Linux version 6.19.0+ (user@host) (clang version 20.1.8 (Fedora 20.1.8-4.fc42), LLD 20.1.8) #1 SMP PREEMPT_DYNAMIC 1970-01-01

Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-22 08:26:33 -08:00
Linus Torvalds
32a92f8c89 Convert more 'alloc_obj' cases to default GFP_KERNEL arguments
This converts some of the visually simpler cases that have been split
over multiple lines.  I only did the ones that are easy to verify the
resulting diff by having just that final GFP_KERNEL argument on the next
line.

Somebody should probably do a proper coccinelle script for this, but for
me the trivial script actually resulted in an assertion failure in the
middle of the script.  I probably had made it a bit _too_ trivial.

So after fighting that far a while I decided to just do some of the
syntactically simpler cases with variations of the previous 'sed'
scripts.

The more syntactically complex multi-line cases would mostly really want
whitespace cleanup anyway.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 20:03:00 -08:00
Linus Torvalds
bf4afc53b7 Convert 'alloc_obj' family to use the new default GFP_KERNEL argument
This was done entirely with mindless brute force, using

    git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
        xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'

to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.

Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.

For the same reason the 'flex' versions will be done as a separate
conversion.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 17:09:51 -08:00
Kees Cook
69050f8d6d treewide: Replace kmalloc with kmalloc_obj for non-scalar types
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:

Single allocations:	kmalloc(sizeof(TYPE), ...)
are replaced with:	kmalloc_obj(TYPE, ...)

Array allocations:	kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with:	kmalloc_objs(TYPE, COUNT, ...)

Flex array allocations:	kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with:	kmalloc_flex(*PTR, FAM, COUNT, ...)

(where TYPE may also be *VAR)

The resulting allocations no longer return "void *", instead returning
"TYPE *".

Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-21 01:02:28 -08:00
Paolo Bonzini
bf2c3138ae Merge tag 'kvm-x86-pmu-6.20' of https://github.com/kvm-x86/linux into HEAD
KVM mediated PMU support for 6.20

Add support for mediated PMUs, where KVM gives the guest full ownership of PMU
hardware (contexted switched around the fastpath run loop) and allows direct
access to data MSRs and PMCs (restricted by the vPMU model), but intercepts
access to control registers, e.g. to enforce event filtering and to prevent the
guest from profiling sensitive host state.

To keep overall complexity reasonable, mediated PMU usage is all or nothing
for a given instance of KVM (controlled via module param).  The Mediated PMU
is disabled default, partly to maintain backwards compatilibity for existing
setup, partly because there are tradeoffs when running with a mediated PMU that
may be non-starters for some use cases, e.g. the host loses the ability to
profile guests with mediated PMUs, the fastpath run loop is also a blind spot,
entry/exit transitions are more expensive, etc.

Versus the emulated PMU, where KVM is "just another perf user", the mediated
PMU delivers more accurate profiling and monitoring (no risk of contention and
thus dropped events), with significantly less overhead (fewer exits and faster
emulation/programming of event selectors) E.g. when running Specint-2017 on
a single-socket Sapphire Rapids with 56 cores and no-SMT, and using perf from
within the guest:

  Perf command:
  a. basic-sampling: perf record -F 1000 -e 6-instructions  -a --overwrite
  b. multiplex-sampling: perf record -F 1000 -e 10-instructions -a --overwrite

  Guest performance overhead:
  ---------------------------------------------------------------------------
  | Test case          | emulated vPMU | all passthrough | passthrough with |
  |                    |               |                 | event filters    |
  ---------------------------------------------------------------------------
  | basic-sampling     |   33.62%      |    4.24%        |   6.21%          |
  ---------------------------------------------------------------------------
  | multiplex-sampling |   79.32%      |    7.34%        |   10.45%         |
  ---------------------------------------------------------------------------
2026-02-11 12:45:40 -05:00
Paolo Bonzini
1b13885edf Merge tag 'kvm-x86-apic-6.20' of https://github.com/kvm-x86/linux into HEAD
KVM x86 APIC-ish changes for 6.20

 - Fix a benign bug where KVM could use the wrong memslots (ignored SMM) when
   creating a vCPU-specific mapping of guest memory.

 - Clean up KVM's handling of marking mapped vCPU pages dirty.

 - Drop a pile of *ancient* sanity checks hidden behind in KVM's unused
   ASSERT() macro, most of which could be trivially triggered by the guest
   and/or user, and all of which were useless.

 - Fold "struct dest_map" into its sole user, "struct rtc_status", to make it
   more obvious what the weird parameter is used for, and to allow burying the
   RTC shenanigans behind CONFIG_KVM_IOAPIC=y.

 - Bury all of ioapic.h and KVM_IRQCHIP_KERNEL behind CONFIG_KVM_IOAPIC=y.

 - Add a regression test for recent APICv update fixes.

 - Rework KVM's handling of VMCS updates while L2 is active to temporarily
   switch to vmcs01 instead of deferring the update until the next nested
   VM-Exit.  The deferred updates approach directly contributed to several
   bugs, was proving to be a maintenance burden due to the difficulty in
   auditing the correctness of deferred updates, and was polluting
   "struct nested_vmx" with a growing pile of booleans.

 - Handle "hardware APIC ISR", a.k.a. SVI, updates in kvm_apic_update_apicv()
   to consolidate the updates, and to co-locate SVI updates with the updates
   for KVM's own cache of ISR information.

 - Drop a dead function declaration.
2026-02-11 12:45:32 -05:00
Paolo Bonzini
9123c5f956 Merge tag 'kvm-x86-gmem-6.20' of https://github.com/kvm-x86/linux into HEAD
KVM guest_memfd changes for 6.20

 - Remove kvm_gmem_populate()'s preparation tracking and half-baked hugepage
   handling, and instead rely on SNP (the only user of the tracking) to do its
   own tracking via the RMP.

 - Retroactively document and enforce (for SNP) that KVM_SEV_SNP_LAUNCH_UPDATE
   and KVM_TDX_INIT_MEM_REGION require the source page to be 4KiB aligned, to
   avoid non-trivial complexity for a non-existent usecase (and because
   in-place conversion simply can't support unaligned sources).

 - When populating guest_memfd memory, GUP the source page in common code and
   pass the refcounted page to the vendor callback, instead of letting vendor
   code do the heavy lifting.  Doing so avoids a looming deadlock bug with
   in-place due an AB-BA conflict betwee mmap_lock and guest_memfd's filemap
   invalidate lock.
2026-02-11 12:45:12 -05:00
Paolo Bonzini
9e03b7caf4 KVM x86 misc changes for 6.20
- Disallow changing the virtual CPU model if L2 is active, for all the same
    reasons KVM disallows change the model after the first KVM_RUN.
 
  - Fix a bug where KVM would incorrectly reject host accesses to PV MSRs that
    were advertised as supported to userspace when running with
    KVM_CAP_ENFORCE_PV_FEATURE_CPUID enabled.
 
  - Fix a bug where KVM would attempt to read protect guest state (CR3) when
    configuring an async #PF entry.
 
  - Fail the build if EXPORT_SYMBOL_GPL or EXPORT_SYMBOL is used in KVM (for x86
    only) to enforce usage of EXPORT_SYMBOL_FOR_KVM_INTERNAL.  Explicitly allow
    the few exports that are intended for external usage.
 
  - Ignore -EBUSY when checking nested events after a vCPU exits blocking as
    the WARN is user-triggerable, and because exiting to userspace on -EBUSY
    does more harm than good in pretty much every situation.
 
  - Throw in the towel and drop the WARN on INIT/SIPI being blocked when vCPU is
    in Wait-For-SIPI, as playing whack-a-mole with syzkaller turned out to be an
    unwinnable game.
 
  - Add support for new Intel instructions that don't require anything beyond
    enumerating feature flags to userspace.
 
  - Grab SRCU when reading PDPTRs in KVM_GET_SREGS2.
 
  - Add WARNs to guard against modifying KVM's CPU caps outside of the intended
    setup flow, as nested VMX in particular is sensitive to unexpected changes
    in KVM's golden configuration.
 
  - Add a quirk to allow userspace to opt-in to actually suppress EOI broadcasts
    when the suppression feature is enabled by the guest (currently limited to
    split IRQCHIP, i.e. userspace I/O APIC).  Sadly, simply fixing KVM to honor
    Suppress EOI Broadcasts isn't an option as some userspaces have come to rely
    on KVM's buggy behavior (KVM advertises Supress EOI Broadcast irrespective
    of whether or not userspace I/O APIC supports Directed EOIs).
 
  - Minor cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmmGqtYACgkQOlYIJqCj
 N/2mURAAq6xms7qH8IpXy7RJjGP7UWVfV7sJPP9N8FWERVfljYn2FGGPAlBi0+5b
 Gbpf3dhEk+JEHPda7Skz3RqnfKqNXszhPRfUxXIW4nlKWs3VCBNtI2XuOc3xGSs+
 itq6jwirPJAibi3GhP3GOnzH3VSdlgq5JhkYW3MGO2JeB0+XMzB+OYE/xZbnRjXg
 i4qwoe9+pGVHpV+rf0MMhCd/46HaGAegPOKArQUbMXQIK3L+6Kgz3y4zy74cCJkI
 nOmevvXztuM8rWrJUl8NvhqNWAak3au6gLg/1CkNcaXp6ekQovZb8BWihQ8JrkOS
 AcmUNqK8RcXXGtjohuXgTgigLg/t+z7tpXiwHC/BxAglf3YB/P2hcxN1/q8zG56T
 s5Ua8RFiosYorlN/LVeyMpPK4MEZQi8QyL/biKIlyoPg3vIL+g7Llf3XdBYsfb4d
 gWGecZTNmEvhwhVbwCqo+2zsO2ATYXKdR+lE8czqqdJ98l+6p652DxA315a6dx7Y
 2fkirbs/JJJotjvukWjWDNk5oGFdX6cDxt2tA1SqDaZ9WTLoqXIIT+9EMtnqXPZm
 KsQLEa5mrM0mbRuOid+Ce+Y1bK4x4DLFaM1oH9BF0UIewo+dMIC/gRgrJEcBS+Vv
 E+XdrCSq2904NX9Gy3OubdorwTloMk+2Sc0HfvsXMytw1LBsUYY=
 =ii2B
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-misc-6.20' of https://github.com/kvm-x86/linux into HEAD

KVM x86 misc changes for 6.20

 - Disallow changing the virtual CPU model if L2 is active, for all the same
   reasons KVM disallows change the model after the first KVM_RUN.

 - Fix a bug where KVM would incorrectly reject host accesses to PV MSRs that
   were advertised as supported to userspace when running with
   KVM_CAP_ENFORCE_PV_FEATURE_CPUID enabled.

 - Fix a bug where KVM would attempt to read protect guest state (CR3) when
   configuring an async #PF entry.

 - Fail the build if EXPORT_SYMBOL_GPL or EXPORT_SYMBOL is used in KVM (for x86
   only) to enforce usage of EXPORT_SYMBOL_FOR_KVM_INTERNAL.  Explicitly allow
   the few exports that are intended for external usage.

 - Ignore -EBUSY when checking nested events after a vCPU exits blocking as
   the WARN is user-triggerable, and because exiting to userspace on -EBUSY
   does more harm than good in pretty much every situation.

 - Throw in the towel and drop the WARN on INIT/SIPI being blocked when vCPU is
   in Wait-For-SIPI, as playing whack-a-mole with syzkaller turned out to be an
   unwinnable game.

 - Add support for new Intel instructions that don't require anything beyond
   enumerating feature flags to userspace.

 - Grab SRCU when reading PDPTRs in KVM_GET_SREGS2.

 - Add WARNs to guard against modifying KVM's CPU caps outside of the intended
   setup flow, as nested VMX in particular is sensitive to unexpected changes
   in KVM's golden configuration.

 - Add a quirk to allow userspace to opt-in to actually suppress EOI broadcasts
   when the suppression feature is enabled by the guest (currently limited to
   split IRQCHIP, i.e. userspace I/O APIC).  Sadly, simply fixing KVM to honor
   Suppress EOI Broadcasts isn't an option as some userspaces have come to rely
   on KVM's buggy behavior (KVM advertises Supress EOI Broadcast irrespective
   of whether or not userspace I/O APIC supports Directed EOIs).

 - Minor cleanups.
2026-02-09 18:53:47 +01:00
Paolo Bonzini
687603fb2b KVM VMX changes for 6.20
- Fix an SGX bug where KVM would incorrectly try to handle EPCM #PFs by always
    relecting EPCM #PFs back into the guest.  KVM doesn't shadow EPCM entries,
    and so EPCM violations cannot be due to KVM interference, and can't be
    resolved by KVM.
 
  - Fix a bug where KVM would register its posted interrupt wakeup handler even
    if loading kvm-intel.ko ultimately failed.
 
  - Disallow access to vmcb12 fields that aren't fully supported, mostly to
    avoid weirdness and complexity for FRED and other features, where KVM wants
    enable VMCS shadowing for fields that conditionally exist.
 
  - Print out the "bad" offsets and values if kvm-intel.ko refuses to load (or
    refuses to online a CPU) due to a VMCS config mismatch.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmmGstUACgkQOlYIJqCj
 N/3z3w/+NSA+/0/JfeCmw+CiMtmHY4eCOtScPwmrP0RONcee4HzX2LlzhZww9YeL
 GSBouvaU5eyNoYjA14mgOjTHfLUEkhH6/3kULN2LjE8md+oLxD0ZBVQNwkXDKSgX
 BpP8EJ/9vBuAjSzUNWeikBsNlXt8I4+QxZYSHPe+BiKE0kMVtFuua2LQeDtj5qc9
 SYKguN0EmYdCox09a1YOX9tExk4VULrOtwcOnNK0I7m87os5Xl2DLHy1vYLZ0WPT
 R9iSnh/AfTsYuvCfotlGccDW8x9x+5PILZ7zxyipXBOGvRBgTaOmsgho/Rf81vpj
 laj6PDk06ep5PLfX0IPM7I4+8usQCxWB0dTXnB6Fu32BnmwuFRpwYCW3XJsqBMrb
 Q4fa14a0Aj5rviCn/CWDJOmMZtTRbQ/U+AaYT+A1VlaMRo8hkIMvW3coYSqvCuZY
 tceW2/3oobwzad5pi37OPsNws6STQc/UOgQDsmAIX6c5/B+cc8PF/a/DAInHPyX2
 356rpdIBOnF7uheLfHGBefFeD1TdkVZvW9Gy6rHPaVjWAwyc59+C6OZoA8bTJtyP
 x4akIaS0GrJ7Gi9RcHRJpvKQucMWbhOrpZxov9QDMRgkdH00eznVwixVZfYAFLPN
 iyQpYJU+moyhXQBGmVUJlWTuMud3qwwCxhY4DEi/pGT8JtK1v5M=
 =XHNe
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-vmx-6.20' of https://github.com/kvm-x86/linux into HEAD

KVM VMX changes for 6.20

 - Fix an SGX bug where KVM would incorrectly try to handle EPCM #PFs by always
   relecting EPCM #PFs back into the guest.  KVM doesn't shadow EPCM entries,
   and so EPCM violations cannot be due to KVM interference, and can't be
   resolved by KVM.

 - Fix a bug where KVM would register its posted interrupt wakeup handler even
   if loading kvm-intel.ko ultimately failed.

 - Disallow access to vmcb12 fields that aren't fully supported, mostly to
   avoid weirdness and complexity for FRED and other features, where KVM wants
   enable VMCS shadowing for fields that conditionally exist.

 - Print out the "bad" offsets and values if kvm-intel.ko refuses to load (or
   refuses to online a CPU) due to a VMCS config mismatch.
2026-02-09 18:50:04 +01:00
Sean Christopherson
3f2757dbf3 KVM: x86: Harden against unexpected adjustments to kvm_cpu_caps
Add a flag to track when KVM is actively configuring its CPU caps, and
WARN if a cap is set or cleared if KVM isn't in its configuration stage.
Modifying CPU caps after {svm,vmx}_set_cpu_caps() can be fatal to KVM, as
vendor setup code expects the CPU caps to be frozen at that point, e.g.
will do additional configuration based on the caps.

Rename kvm_set_cpu_caps() to kvm_initialize_cpu_caps() to pair with the
new "finalize", and to make it more obvious that KVM's CPU caps aren't
fully configured within the function.

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://patch.msgid.link/20260128014310.3255561-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-30 13:28:29 -08:00
Sean Christopherson
c0d6b8bbbc KVM: VMX: Print out "bad" offsets+value on VMCS config mismatch
When kvm-intel.ko refuses to load due to a mismatched VMCS config, print
all mismatching offsets+values to make it easier to debug goofs during
development, and to make it at least feasible to triage failures that
occur during production.  E.g. if a physical core is flaky or is running
with the "wrong" microcode patch loaded, then a CPU can get a legitimate
mismatch even without KVM bugs.

Print the mismatches as 32-bit values as a compromise between hand coding
every field (to provide precise information) and printing individual bytes
(requires more effort to deduce the mismatch bit(s)).  All fields in the
VMCS config are either 32-bit or 64-bit values, i.e. in many cases,
printing 32-bit values will be 100% precise, and in the others it's close
enough, especially when considering that MSR values are split into EDX:EAX
anyways.

E.g. on mismatch CET entry/exit controls, KVM will print:

  kvm_intel: VMCS config on CPU 0 doesn't match reference config:
    Offset 76 REF = 0x107fffff, CPU0 = 0x007fffff, mismatch = 0x10000000
    Offset 84 REF = 0x0010f3ff, CPU0 = 0x0000f3ff, mismatch = 0x00100000

Opportunistically tweak the wording on the initial error message to say
"mismatch" instead of "inconsistent", as the VMCS config itself isn't
inconsistent, and the wording conflates the cross-CPU compatibility check
with the error_on_inconsistent_vmcs_config knob that treats inconsistent
VMCS configurations as errors (e.g. if a CPU supports CET entry controls
but no CET exit controls).

Cc: Jim Mattson <jmattson@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://patch.msgid.link/20260128014310.3255561-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-30 13:27:46 -08:00
Sean Christopherson
f8ade833b7 KVM: x86: Explicitly configure supported XSS from {svm,vmx}_set_cpu_caps()
Explicitly configure KVM's supported XSS as part of each vendor's setup
flow to fix a bug where clearing SHSTK and IBT in kvm_cpu_caps, e.g. due
to lack of CET XFEATURE support, makes kvm-intel.ko unloadable when nested
VMX is enabled, i.e. when nested=1.  The late clearing results in
nested_vmx_setup_{entry,exit}_ctls() clearing VM_{ENTRY,EXIT}_LOAD_CET_STATE
when nested_vmx_setup_ctls_msrs() runs during the CPU compatibility checks,
ultimately leading to a mismatched VMCS config due to the reference config
having the CET bits set, but every CPU's "local" config having the bits
cleared.

Note, kvm_caps.supported_{xcr0,xss} are unconditionally initialized by
kvm_x86_vendor_init(), before calling into vendor code, and not referenced
between ops->hardware_setup() and their current/old location.

Fixes: 69cc3e8865 ("KVM: x86: Add XSS support for CET_KERNEL and CET_USER")
Cc: stable@vger.kernel.org
Cc: Mathias Krause <minipli@grsecurity.net>
Cc: John Allen <john.allen@amd.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Chao Gao <chao.gao@intel.com>
Cc: Binbin Wu <binbin.wu@linux.intel.com>
Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://patch.msgid.link/20260128014310.3255561-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-30 13:27:33 -08:00
Sean Christopherson
1dc6432059 KVM: nVMX: Remove explicit filtering of GUEST_INTR_STATUS from shadow VMCS fields
Drop KVM's filtering of GUEST_INTR_STATUS when generating the shadow VMCS
bitmap now that KVM drops GUEST_INTR_STATUS from the set of supported
vmcs12 fields if the field isn't supported by hardware, and initialization
of the shadow VMCS fields omits unsupported vmcs12 fields.

Note, there is technically a small functional change here, as the vmcs12
filtering only requires support for Virtual Interrupt Delivery, whereas
the shadow VMCS code being removed required "full" APICv support, i.e.
required Virtual Interrupt Delivery *and* APIC Register Virtualizaton *and*
Posted Interrupt support.

Opportunistically tweak the comment to more precisely explain why the
PML and VMX preemption timer fields need to be explicitly checked.

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://patch.msgid.link/20260115173427.716021-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-26 06:23:56 -08:00
Sean Christopherson
5fdf86e735 KVM: nVMX: Disallow access to vmcs12 fields that aren't supported by "hardware"
Disallow access (VMREAD/VMWRITE), both emulated and via a shadow VMCS, to
VMCS fields that the loaded incarnation of KVM doesn't support, e.g. due
to lack of hardware support, as a middle ground between allowing access to
any vmcs12 field defined by KVM (current behavior) and gating access based
on the userspace-defined vCPU model (the most functionally correct, but
very costly, implementation).

Disallowing access to unsupported fields helps a tiny bit in terms of
closing the virtualization hole (see below), but the main motivation is to
avoid having to weed out unsupported fields when synchronizing between
vmcs12 and a shadow VMCS.  Because shadow VMCS accesses are done via
VMREAD and VMWRITE, KVM _must_ filter out unsupported fields (or eat
VMREAD/VMWRITE failures), and filtering out just shadow VMCS fields is
about the same amount of effort, and arguably much more confusing.

As a bonus, this also fixes a KVM-Unit-Test failure bug when running on
_hardware_ without support for TSC Scaling, which fails with the same
signature as the bug fixed by commit ba1f82456b ("KVM: nVMX: Dynamically
compute max VMCS index for vmcs12"):

  FAIL: VMX_VMCS_ENUM.MAX_INDEX expected: 19, actual: 17

Dynamically computing the max VMCS index only resolved the issue where KVM
was hardcoding max index, but for CPUs with TSC Scaling, that was "good
enough".

Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Xin Li <xin@zytor.com>
Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://lore.kernel.org/all/20251026201911.505204-22-xin@zytor.com
Link: https://lore.kernel.org/all/YR2Tf9WPNEzrE7Xg@google.com
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://patch.msgid.link/20260115173427.716021-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-26 06:23:51 -08:00