Commit Graph

1549 Commits

Author SHA1 Message Date
Tina Zhang
9a12fa5213 KVM: SVM: Disable AVIC IPI virtualization on Hygon Family 18h (erratum #1235)
Hygon Family 18h CPUs are derived from AMD Family 17h (Zen1) silicon and
share the same erratum #1235: hardware may read a stale IsRunning=1 bit
during ICR write emulation and silently fail to generate an
AVIC_IPI_FAILURE_TARGET_NOT_RUNNING VM-Exit on the sending vCPU.

The absence of the VM-Exit causes KVM to miss the required wakeup of
blocking target vCPUs, leading to hung vCPUs and unbounded delays in
guest execution.

Extend the existing AMD Family 17h erratum #1235 workaround to also cover
Hygon Family 18h.  With IPI virtualization disabled, KVM never sets
IsRunning=1 in the Physical ID table, so every non-self IPI generates a
VM-Exit and is correctly emulated.

Fixes: 8de4a1c816 ("KVM: SVM: Disable (x2)AVIC IPI virtualization if CPU has erratum #1235")
Cc: <stable@vger.kernel.org>
Signed-off-by: Tina Zhang <zhang_wei@open-hieco.net>
Message-ID: <20260522040014.3380201-1-zhang_wei@open-hieco.net>
2026-05-23 10:09:04 +02:00
Sean Christopherson
5bd1ddb791 KVM: nSVM: Never use L0's PAUSE loop exiting while L2 is running
Never use L0's (KVM's) PAUSE loop exiting controls while L2 is running,
and instead always configure vmcb02 according to L1's exact capabilities
and desires.

The purpose of intercepting PAUSE after N attempts is to detect when the
vCPU may be stuck waiting on a lock, so that KVM can schedule in a
different vCPU that may be holding said lock.  Barring a very interesting
setup, L1 and L2 do not share locks, and it's extremely unlikely that an
L1 vCPU would hold a spinlock while running L2.  I.e. having a vCPU
executing in L1 yield to a vCPU running in L2 will not allow the L1 vCPU
to make forward progress, and vice versa.

While teaching KVM's "on spin" logic to only yield to other vCPUs in L2 is
doable, in all likelihood it would do more harm than good for most setups.
KVM has limited visibility into which L2 "vCPUs" belong to the same VM,
and thus share a locking domain.  And even if L2 vCPUs are in the same
VM, KVM has no visilibity into L2 vCPU's that are scheduled out by the
L1 hypervisor.

Furthermore, KVM doesn't actually steal PAUSE exits from L1. If L1 is
intercepting PAUSE, KVM will route PAUSE exits to L1, not L0, as
nested_svm_intercept() gives priority to the vmcb12 intercept.  As such,
overriding the count/threshold fields in vmcb02 with vmcb01's values is
nonsensical, as doing so clobbers all the training/learning that has been
done in L1.

Even worse, if L1 is not intercepting PAUSE, i.e. KVM is handling PAUSE
exits, then KVM will adjust the PLE knobs based on L2 behavior, which could
very well be detrimental to L1, e.g. due to essentially poisoning L1 PLE
training with bad data.

And copying the count from vmcb02 to vmcb01 on a nested VM-Exit makes even
less sense, because again, the purpose of PLE is to detect spinning vCPUs.
Whether or not a vCPU is spinning in L2 at the time of a nested VM-Exit
has no relevance as to the behavior of the vCPU when it executes in L1.

The only scenarios where any of this actually works is if at least one
of KVM or L1 is NOT intercepting PAUSE for the guest.  Per the original
changelog, those were the only scenarios considered to be supported.
Disabling KVM's use of PLE makes it so the VM is always in a "supported"
mode.

Last, but certainly not least, using KVM's count/threshold instead of the
values provided by L1 is a blatant violation of the SVM architecture.

Fixes: 74fd41ed16 ("KVM: x86: nSVM: support PAUSE filtering when L0 doesn't intercept PAUSE")
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://patch.msgid.link/20260508213321.373309-1-seanjc@google.com/
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-05-12 22:17:28 +02:00
Paolo Bonzini
01f217fa8a KVM: x86: use inlines instead of macros for is_sev_*guest
This helps avoiding more embarrassment to this maintainer, but also
will catch mistakes more easily for others.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-04-13 19:00:47 +02:00
Paolo Bonzini
92cdeac6a4 KVM SVM changes for 7.1
- Fix and optimize IRQ window inhibit handling for AVIC (the tracking needs to
    be per-vCPU, e.g. so that KVM doesn't prematurely re-enable AVIC if multiple
    vCPUs have to-be-injected IRQs).
 
  - Fix an undefined behavior warning where a crafty userspace can read the
    "avic" module param before it's fully initialized.
 
  - Fix a (likely benign) bug in the "OS-visible workarounds" handling, where
    KVM could clobber state when enabling virtualization on multiple CPUs in
    parallel, and clean up and optimize the code.
 
  - Drop a WARN in KVM_MEMORY_ENCRYPT_REG_REGION where KVM complains about a
    "too large" size based purely on user input, and clean up and harden the
    related pinning code.
 
  - Disallow synchronizing a VMSA of an already-launched/encrypted vCPU, as
    doing so for an SNP guest will trigger an RMP violation #PF and crash the
    host.
 
  - Protect all of sev_mem_enc_register_region() with kvm->lock to ensure
    sev_guest() is stable for the entire of the function.
 
  - Lock all vCPUs when synchronizing VMSAs for SNP guests to ensure the VMSA
    page isn't actively being used.
 
  - Overhaul KVM's APIs for detecting SEV+ guests so that VM-scoped queries are
    required to hold kvm->lock (KVM has had multiple bugs due "is SEV?" checks
    becoming stale), enforced by lockdep.  Add and use vCPU-scoped APIs when
    possible/appropriate, as all checks that originate from a vCPU are
    guaranteed to be stable.
 
  - Convert a pile of kvm->lock SEV code to guard().
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmnZK4wACgkQOlYIJqCj
 N/2uOQ/+LzGQD7myCn47rUhiMo/aY3qjrS+u6PSuFeEMFyaATiWpf/s50hIMHh+/
 VCRAptKgL0PBV/RbOqhZdx4Zn/Yb/NNBwraqc7xQgMOlQwFedOetuFtRveJ4z6Af
 8ycwMxYYtz6SbaT+R3AdK51Nb8S2ZRpd082CiaLgChVcdodkeFtS5KVBqrlBGB21
 EKFbW+QXMHrpmGbgZ8YWMrL5UCSmJFG8ZztcncNfsLS6WxbUjdo/MEiLEDIsrXZd
 oGViwmnY7hcJ5ClcF8UMPtXHHP1+EOk6BKAsmYguG3qUxbX+EEbymb8o16k+h6iw
 ybUZWF7cq44Pl1FModTFAB5LQPg6z6XNhjZ8L+0kjAI05lvszf3QDtezQ+BF24tW
 S18x6yCIpdEJ3VxM4r5Yqf10CRbxMtHKU6EUjL7C4KNNYOz2sX+Tqgi/uHtbgzUJ
 zPG9faY5M3hMjfj5tOCpy/fAEF3fD1mg4GE8pfXZa8d/ppqI4hU0ASpFzw/d4LnH
 PJSaeJhmmEIdRj+RtIGIRSZ9flHM61/+clKngaoR+c/mPQPnDbapivl2kgKWbVJ4
 47c44pYQLTWI01nuwcEILCEj8D1mABJygPjNoO79b2mitmYazMnO42mV3lI5oP0c
 QyzX7sSed6ImIRn8xadfE+tIz3ji9r/ak+ekZvdNiqiNEoi2YG8=
 =AjgE
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-svm-7.1' of https://github.com/kvm-x86/linux into HEAD

KVM SVM changes for 7.1

 - Fix and optimize IRQ window inhibit handling for AVIC (the tracking needs to
   be per-vCPU, e.g. so that KVM doesn't prematurely re-enable AVIC if multiple
   vCPUs have to-be-injected IRQs).

 - Fix an undefined behavior warning where a crafty userspace can read the
   "avic" module param before it's fully initialized.

 - Fix a (likely benign) bug in the "OS-visible workarounds" handling, where
   KVM could clobber state when enabling virtualization on multiple CPUs in
   parallel, and clean up and optimize the code.

 - Drop a WARN in KVM_MEMORY_ENCRYPT_REG_REGION where KVM complains about a
   "too large" size based purely on user input, and clean up and harden the
   related pinning code.

 - Disallow synchronizing a VMSA of an already-launched/encrypted vCPU, as
   doing so for an SNP guest will trigger an RMP violation #PF and crash the
   host.

 - Protect all of sev_mem_enc_register_region() with kvm->lock to ensure
   sev_guest() is stable for the entire of the function.

 - Lock all vCPUs when synchronizing VMSAs for SNP guests to ensure the VMSA
   page isn't actively being used.

 - Overhaul KVM's APIs for detecting SEV+ guests so that VM-scoped queries are
   required to hold kvm->lock (KVM has had multiple bugs due "is SEV?" checks
   becoming stale), enforced by lockdep.  Add and use vCPU-scoped APIs when
   possible/appropriate, as all checks that originate from a vCPU are
   guaranteed to be stable.

 - Convert a pile of kvm->lock SEV code to guard().
2026-04-13 19:00:43 +02:00
Paolo Bonzini
4a530993da KVM x86 VMXON and EFER.SVME extraction for 7.1
Move _only_ VMXON+VMXOFF and EFER.SVME toggling out of KVM (versus all of VMX
 and SVM enabling) out of KVM and into the core kernel so that non-KVM TDX
 enabling, e.g. for trusted I/O, can make SEAMCALLs without needing to ensure
 KVM is fully loaded.
 
 TDX isn't a hypervisor, and isn't trying to be a hypervisor. Specifically, TDX
 should _never_ have it's own VMCSes (that are visible to the host; the
 TDX-Module has it's own VMCSes to do SEAMCALL/SEAMRET), and so there is simply
 no reason to move that functionality out of KVM.
 
 With that out of the way, dealing with VMXON/VMXOFF and EFER.SVME is a fairly
 simple refcounting game.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmnZJkYACgkQOlYIJqCj
 N/21chAAjg9tb/E8+vqBZDT5vO9Bu6c333irV2vqBBJZWUx6xKhtk77kL6kISWyf
 aI57hJ5IwbUkfDcomSY+MyRXxw/X4OioSs5qqvcC2XHatGA8XwifJE47cN5ZT0+D
 hzZjru8Z9VGHf5wUXS41yTHtm+INiEYMgJiseUQR6sbWx3H+zDcLIooNQx/ZLYrV
 vR+VPtaMYpJ0TTDDqb8PrCnjgXoXFenAnzAj9bAikWP60kaDXrxN9KPc5woDo29+
 TrkTyr2mmQvKpNhLCDwAMNa9bXxgzkHEGx8J2WZTbUi9ZBv4MwVsnGLLsaUKQlaa
 4V1JDiICzYptjMzU+ka4iTF+m0KEz4EykP7mVVK+5MAHc0NOUVfDW6JP2PM/66dh
 NyyjGhbrfH0PwqzDn4N2h0MmWT4YNCIxESClecEMtEzsCyWfYOMitxbDbzHnu9Vw
 a/C0pwWKJ34Trr0O79SevAWJBlu596mya0YvMeCAWxCvSUGknbo5IXdrmtp6htGp
 Gz5+0ZyvVRbYpwxS+OOpWMkZuPvvEcWTbMAG/scbSHh80P/uCVyuLsRZR2HSB8EV
 tYnnLDDDQ1KmLV7xmw5XnkN9hFffAM8eXA7KX9TPjCXjd25lCJGgquQEH0oAHe5q
 1qXf+lWttP7MIbD5/Ga5CO+FqXAE6xmFRWjEBgLx32kSAWXqxPs=
 =SuxR
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-vmxon-7.1' of https://github.com/kvm-x86/linux into HEAD

KVM x86 VMXON and EFER.SVME extraction for 7.1

Move _only_ VMXON+VMXOFF and EFER.SVME toggling out of KVM (versus all of VMX
and SVM enabling) out of KVM and into the core kernel so that non-KVM TDX
enabling, e.g. for trusted I/O, can make SEAMCALLs without needing to ensure
KVM is fully loaded.

TIO isn't a hypervisor, and isn't trying to be a hypervisor. Specifically, TIO
should _never_ have it's own VMCSes (that are visible to the host; the
TDX-Module has it's own VMCSes to do SEAMCALL/SEAMRET), and so there is simply
no reason to move that functionality out of KVM.

With that out of the way, dealing with VMXON/VMXOFF and EFER.SVME is a fairly
simple refcounting game.
2026-04-13 13:04:48 +02:00
Paolo Bonzini
ea8bc95fbb KVM nested SVM changes for 7.1 (with one common x86 fix)
- To minimize the probability of corrupting guest state, defer KVM's
    non-architectural delivery of exception payloads (e.g. CR2 and DR6) until
    consumption of the payload is imminent, and force delivery of the payload
    in all paths where userspace saves relevant state.
 
  - Use vcpu->arch.cr2 when updating vmcb12's CR2 on nested #VMEXIT to fix a
    bug where L2's CR2 can get corrupted after a save/restore, e.g. if the VM
    is migrated while L2 is faulting in memory.
 
  - Fix a class of nSVM bugs where some fields written by the CPU are not
    synchronized from vmcb02 to cached vmcb12 after VMRUN, and so are not
    up-to-date when saved by KVM_GET_NESTED_STATE.
 
  - Fix a class of bugs where the ordering between KVM_SET_NESTED_STATE and
    KVM_SET_{S}REGS could cause vmcb02 to be incorrectly initialized after
    save+restore.
 
  - Add a variety of missing nSVM consistency checks.
 
  - Fix several bugs where KVM failed to correctly update VMCB fields on nested
    #VMEXIT.
 
  - Fix several bugs where KVM failed to correctly synthesize #UD or #GP for
    SVM-related instructions.
 
  - Add support for save+restore of virtualized LBRs (on SVM).
 
  - Refactor various helpers and macros to improve clarity and (hopefully) make
    the code easier to maintain.
 
  - Aggressively sanitize fields when copying from vmcb12 to guard against
    unintentionally allowing L1 to utilize yet-to-be-defined features.
 
  - Fix several bugs where KVM botched rAX legality checks when emulating SVM
    instructions.  Note, KVM is still flawed in that KVM doesn't address size
    prefix overrides for 64-bit guests; this should probably be documented as a
    KVM erratum.
 
  - Fail emulation of VMRUN/VMLOAD/VMSAVE if mapping vmcb12 fails instead of
    somewhat arbitrarily synthesizing #GP (i.e. don't bastardize AMD's already-
    sketchy behavior of generating #GP if for "unsupported" addresses).
 
  - Cache all used vmcb12 fields to further harden against TOCTOU bugs.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmnZfbwACgkQOlYIJqCj
 N/0pVRAAkys8LLtIekQtEVkaX3EPaXk0lGGmnzXbihgHFsS5lMAS4tcsr7oyk4TI
 rvJUGmkaTKTboQdTaCq0G7lwCu5hMuXsZ10WvmKfivMFxy3kSppqfffux5zVXng2
 U/8oyJSorkX1WPC7d5QAZYMqqcSwQaR+a0FxowghGWBXMRHylerSuH00CiGr6Ron
 QQbZaKBNtkYwYFNos2tLuT4tueyFogk8FPAmdejEQ9CMxUjeAivlKm8JVXaDvGik
 lyPYbJJLukjuxSYGYmeRyGLLwK7VBGkFHQp/KBYSBgzGdweabhsQa1Z0CGm24+w1
 q626W0sxsq97dZ0cd7oE6Cw+AdlMBK+mjpxB9gX4uLGyYlnFkdJV7OSlHVTR9d96
 cqKduT0JvlBnVb7Yd5jyaGVl1YD62p0nwcrTuWidR5IJ16b4mYwwPzvkkQKHLt64
 VAhH8lBVtATtblI9gfsbwGezV74xXnuLb0L1G7xeh1VIWu7pubFdqyRwIA+qiXQa
 OkyxzoDlFl+QF2Uf3cBCFMojBOrSZRiGiLzIkUnjBsN4N2uOPYTsQEfr9BXVVcv7
 obT9xl/wUwry2fAJhUL+IBCDE42+8C62UaWT5KJHQLttBL7Mm06e75hFN5ObbE/x
 nExL+NmAcsSUUbbdojjnD0KWxYKkosNiONBVrjqqXdmBjmzzOvI=
 =ys7N
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-nested-7.1' of https://github.com/kvm-x86/linux into HEAD

KVM nested SVM changes for 7.1 (with one common x86 fix)

 - To minimize the probability of corrupting guest state, defer KVM's
   non-architectural delivery of exception payloads (e.g. CR2 and DR6) until
   consumption of the payload is imminent, and force delivery of the payload
   in all paths where userspace saves relevant state.

 - Use vcpu->arch.cr2 when updating vmcb12's CR2 on nested #VMEXIT to fix a
   bug where L2's CR2 can get corrupted after a save/restore, e.g. if the VM
   is migrated while L2 is faulting in memory.

 - Fix a class of nSVM bugs where some fields written by the CPU are not
   synchronized from vmcb02 to cached vmcb12 after VMRUN, and so are not
   up-to-date when saved by KVM_GET_NESTED_STATE.

 - Fix a class of bugs where the ordering between KVM_SET_NESTED_STATE and
   KVM_SET_{S}REGS could cause vmcb02 to be incorrectly initialized after
   save+restore.

 - Add a variety of missing nSVM consistency checks.

 - Fix several bugs where KVM failed to correctly update VMCB fields on nested
   #VMEXIT.

 - Fix several bugs where KVM failed to correctly synthesize #UD or #GP for
   SVM-related instructions.

 - Add support for save+restore of virtualized LBRs (on SVM).

 - Refactor various helpers and macros to improve clarity and (hopefully) make
   the code easier to maintain.

 - Aggressively sanitize fields when copying from vmcb12 to guard against
   unintentionally allowing L1 to utilize yet-to-be-defined features.

 - Fix several bugs where KVM botched rAX legality checks when emulating SVM
   instructions.  Note, KVM is still flawed in that KVM doesn't address size
   prefix overrides for 64-bit guests; this should probably be documented as a
   KVM erratum.

 - Fail emulation of VMRUN/VMLOAD/VMSAVE if mapping vmcb12 fails instead of
   somewhat arbitrarily synthesizing #GP (i.e. don't bastardize AMD's already-
   sketchy behavior of generating #GP if for "unsupported" addresses).

 - Cache all used vmcb12 fields to further harden against TOCTOU bugs.
2026-04-13 13:01:50 +02:00
Paolo Bonzini
aa856775be KVM x86 emulated MMIO changes for 7.1
Copy single-chunk MMIO write values into a persistent (per-fragment) field to
 fix use-after-free stack bugs due to KVM dereferencing a stack pointer after an
 exit to userspace.
 
 Clean up and comment the emulated MMIO code to try to make it easier to
 maintain (not necessarily "easy", but "easier").
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmnZJAkACgkQOlYIJqCj
 N/2TUxAAnmOHpQ0hKjDvkMXTLOUJEitegRn8CDd2NWa6TEnoCFmHu5sXAMaO0V7v
 EK2NH0uwT43Zr55BlXTxjEaQOby16tKjqK4FlG1Zb8UI6e9Lzhk6zIXMI7NTpV41
 HMF73cQyHIzeS9ymf0fdVo3nlnvTXBHVCifyJwmb2RARl+LqmTimHb9+9piuIxJB
 /v473RoFCNxI7Rwl6Pp5sjl7lWTIDUQJSi3+1gMaowTtnsUyCPTLejwj7/b9NWF3
 i+nTNwwpHFvgfTE3decMyKupY9aXUM9FU/AFf+eUbvzjR/Dx/7o31Cpz4NCHQ38c
 4TCFKkOQ2r+e8s7ATBeRKdOXdP7d7DW3qasLfPVjzEDxuifmW+awDRYBZwNM/Ybv
 jDCBO6gbtw/f+oJPKq9oivqBSu+Z6vR7NnPmk1vh7VocsZdAlbCwdBU2+N5DBkfh
 LJ+nOxzNL1q1A8X61CZFffEooh971Mg7ztHV5IDviK5/Fop0NfQBjxdxoz+wttzp
 ufwY1WUMHjImRfZTk3e9icJwarqEI39QRmjKuaUJxEXjbJCbtvfKJ0lrNn7RRNPf
 aqi9M2z6UvVrQi/Vw5rRTx7fYr091QIOHDdGBVl7atcGyU4gvdpXdkBS6xqafgD6
 S/QzWypU868iMgLWqNpNeKtRLPQsuwTywC5vx57Sm0yPlPbM8H0=
 =Qyhh
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-mmio-7.1' of https://github.com/kvm-x86/linux into HEAD

KVM x86 emulated MMIO changes for 7.1

Copy single-chunk MMIO write values into a persistent (per-fragment) field to
fix use-after-free stack bugs due to KVM dereferencing a stack pointer after an
exit to userspace.

Clean up and comment the emulated MMIO code to try to make it easier to
maintain (not necessarily "easy", but "easier").
2026-04-13 12:49:14 +02:00
Paolo Bonzini
276f81a491 KVM x86 misc changes for 7.1
- Advertise support for AVX512 Bit Matrix Multiply (BMM) when it's present in
    hardware (no additional emulation/virtualization required).
 
  - Immediately fail the build if a required #define is missing in one of KVM's
    headers that is included multiple times.
 
  - Reject SET_GUEST_DEBUG with -EBUSY if there's an already injected exception,
    mostly to prevent syzkaller from abusing the uAPI to trigger WARNs, but also
    because it can help prevent userspace from unintentionally crashing the VM.
 
  - Exempt SMM from CPUID faulting on Intel, as per the spec.
 
  - Misc hardening and cleanup changes.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmnZIt8ACgkQOlYIJqCj
 N/2HqA/8CwoMlaK4nPDp39JI1+avlKaBkrwfF5/mku6uZcrq9WeyflH+t4wc7JE0
 lRXQO5PPNideYrjEqLsdn9OWIar+ZsYGrsEO5/MFc4Z67kPkai67m7nUT46APU4Q
 fE/3KpT3afaHcM6+zpIIF/lMmQJVco+7EQrlexSM9LZTap6uxNRvMC3B/czF7/li
 UsEJH37vluXxuCPUXAE61IPHtF++eDf4x6w0nIJ+7UJSUZk8JJYWMvJ5lPIxRTGG
 Pvql2v7hDQ9h2ISIDr+e85wpIpIkbc7hKZMtlib36PB1Dm7gOeKgosFHIwNLnJoJ
 pxuzsqYShXBHsmsYgzmfYlVUcWFF1f02yC4XfoQ735LNnBbX6bm5nuSmPQBmvg4O
 +URUKjo4DLjzzs44RrRsBsBVuZTMbe0Ht2qLmGrWrB9+vr1PxQVNFpLA0MCDCFx7
 skJTo6raJQkLJmmoKUslehiJFTvzOrOJy8JhWhiznkJNSS5jWFbaFf7nEoMCYIl0
 ttzeISQDgzHAvT6V29CO4+zttexF4QVVRwFwG3aI8zGJ3WJhjrNyazVLrvrzWfhA
 ygNwV0BCEbBclMpBRF4jRLGMibnsTeEsBTiMARgJ0ZL7RPUYeQidVzP/JwPKbod0
 DHqqtOXXngl7OsHdfdd74ThKaQb6EzlDFyI5aoYInPCXH/LhE98=
 =ZvDQ
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-misc-7.1' of https://github.com/kvm-x86/linux into HEAD

KVM x86 misc changes for 7.1

 - Advertise support for AVX512 Bit Matrix Multiply (BMM) when it's present in
   hardware (no additional emulation/virtualization required).

 - Immediately fail the build if a required #define is missing in one of KVM's
   headers that is included multiple times.

 - Reject SET_GUEST_DEBUG with -EBUSY if there's an already injected exception,
   mostly to prevent syzkaller from abusing the uAPI to trigger WARNs, but also
   because it can help prevent userspace from unintentionally crashing the VM.

 - Exempt SMM from CPUID faulting on Intel, as per the spec.

 - Misc hardening and cleanup changes.
2026-04-13 11:51:34 +02:00
Sean Christopherson
bc0932cf9b KVM: SEV: Goto an existing error label if charging misc_cg for an ASID fails
Dedup a small amount of cleanup code in SEV ASID allocation by reusing
an existing error label.

No functional change intended.

Link: https://patch.msgid.link/20260310234829.2608037-22-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-04-09 12:00:24 -07:00
Carlos López
1d353dae3d KVM: SVM: Move lock-protected allocation of SEV ASID into a separate helper
Extract the lock-protected parts of SEV ASID allocation into a new helper
and opportunistically convert it to use guard() when acquiring the mutex.

Preserve the goto even though it's a little odd, as it's there's a fair
amount of subtlety that makes it surprisingly difficult to replicate the
functionality with a loop construct, and arguably using goto yields the
most readable code.

No functional change intended.

Signed-off-by: Carlos López <clopez@suse.de>
[sean: move code to separate helper, rework shortlog+changelog]
Link: https://patch.msgid.link/20260310234829.2608037-21-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-04-09 12:00:23 -07:00
Carlos López
f09b7f4af9 KVM: SEV: use mutex guard in snp_handle_guest_req()
Simplify the error paths in snp_handle_guest_req() by using a mutex
guard, allowing early return instead of using gotos.

Signed-off-by: Carlos López <clopez@suse.de>
Link: https://patch.msgid.link/20260120201013.3931334-8-clopez@suse.de
Link: https://patch.msgid.link/20260310234829.2608037-20-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-04-09 12:00:22 -07:00
Carlos López
84841f3941 KVM: SEV: use mutex guard in sev_mem_enc_unregister_region()
Simplify the error paths in sev_mem_enc_unregister_region() by using a
mutex guard, allowing early return instead of using gotos.

Signed-off-by: Carlos López <clopez@suse.de>
Link: https://patch.msgid.link/20260120201013.3931334-7-clopez@suse.de
Link: https://patch.msgid.link/20260310234829.2608037-19-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-04-09 12:00:22 -07:00
Carlos López
63e56d8425 KVM: SEV: use mutex guard in sev_mem_enc_ioctl()
Simplify the error paths in sev_mem_enc_ioctl() by using a mutex guard,
allowing early return instead of using gotos.

Signed-off-by: Carlos López <clopez@suse.de>
Link: https://patch.msgid.link/20260120201013.3931334-5-clopez@suse.de
Link: https://patch.msgid.link/20260310234829.2608037-18-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-04-09 12:00:21 -07:00
Carlos López
04d77ded64 KVM: SEV: use mutex guard in snp_launch_update()
Simplify the error paths in snp_launch_update() by using a mutex guard,
allowing early return instead of using gotos.

Signed-off-by: Carlos López <clopez@suse.de>
Link: https://patch.msgid.link/20260120201013.3931334-4-clopez@suse.de
Link: https://patch.msgid.link/20260310234829.2608037-17-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-04-09 12:00:20 -07:00
Sean Christopherson
ba903f7382 KVM: SEV: Assert that kvm->lock is held when querying SEV+ support
Assert that kvm->lock is held when checking if a VM is an SEV+ VM, as KVM
sets *and* resets the relevant flags when initialization SEV state, i.e.
it's extremely easy to end up with TOCTOU bugs if kvm->lock isn't held.

Add waivers for a VM being torn down (refcount is '0') and for there being
a loaded vCPU, with comments for both explaining why they're safe.

Note, the "vCPU loaded" waiver is necessary to avoid splats on the SNP
checks in sev_gmem_prepare() and sev_gmem_max_mapping_level(), which are
currently called when handling nested page faults.  Alternatively, those
checks could key off KVM_X86_SNP_VM, as kvm_arch.vm_type is stable early
in VM creation.  Prioritize consistency, at least for now, and to leave a
"reminder" that the max mapping level code in particular likely needs
special attention if/when KVM supports dirty logging for SNP guests.

Link: https://patch.msgid.link/20260310234829.2608037-16-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-04-09 12:00:20 -07:00
Sean Christopherson
2f34d421e8 KVM: SEV: Document that checking for SEV+ guests when reclaiming memory is "safe"
Document that the check for an SEV+ guest when reclaiming guest memory is
safe even though kvm->lock isn't held.  This will allow asserting that
kvm->lock is held in the SEV accessors, without triggering false positives
on the "safe" cases.

No functional change intended.

Link: https://patch.msgid.link/20260310234829.2608037-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-04-09 12:00:19 -07:00
Sean Christopherson
85d2243a21 KVM: SEV: Hide "struct kvm_sev_info" behind CONFIG_KVM_AMD_SEV=y
Bury "struct kvm_sev_info" behind CONFIG_KVM_AMD_SEV=y to make it harder
for SEV specific code to sneak into common SVM code.

No functional change intended.

Link: https://patch.msgid.link/20260310234829.2608037-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-04-09 12:00:18 -07:00
Sean Christopherson
4f67cf7e7e KVM: SEV: WARN on unhandled VM type when initializing VM
WARN if KVM encounters an unhandled VM type when setting up flags for SEV+
VMs, e.g. to guard against adding a new flavor of SEV without adding proper
recognition in sev_vm_init().

Practically speaking, no functional change intended (the new "default" case
should be unreachable).

Link: https://patch.msgid.link/20260310234829.2608037-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-04-09 12:00:17 -07:00
Sean Christopherson
e353f1beed KVM: SEV: Move SEV-specific VM initialization to sev.c
Move SEV+ VM initialization to sev.c (as sev_vm_init()) so that
kvm_sev_info (and all usage) can be gated on CONFIG_KVM_AMD_SEV=y without
needing more #ifdefs.  As a bonus, isolating the logic will make it easier
to harden the flow, e.g. to WARN if the vm_type is unknown.

No functional change intended (SEV, SEV_ES, and SNP VM types are only
supported if CONFIG_KVM_AMD_SEV=y).

Link: https://patch.msgid.link/20260310234829.2608037-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-04-08 16:04:27 -07:00
Sean Christopherson
7341500f8b KVM: SEV: Move standard VM-scoped helpers to detect SEV+ guests to sev.c
Now that all external usage of the VM-scoped APIs to detect SEV+ guests is
gone, drop the stubs provided for CONFIG_KVM_AMD_SEV=n builds and bury the
"standard" APIs in sev.c.

No functional change intended.

Link: https://patch.msgid.link/20260310234829.2608037-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-04-08 16:04:26 -07:00
Sean Christopherson
56906910ea KVM: SEV: Document the SEV-ES check when querying SMM support as "safe"
Use the "unsafe" API to check for an SEV-ES+ guest when determining whether
or not SMBASE is a supported MSR, i.e. whether or not emulated SMM is
supported.  This will eventually allow adding lockdep assertings to the
APIs for detecting SEV+ VMs without triggering "real" false positives.

While svm_has_emulated_msr() doesn't hold kvm->lock, i.e. can get both
false positives *and* false negatives, both are completely fine, as the
only time the result isn't stable is when userspace is the sole consumer
of the result.  I.e. userspace can confuse itself, but that's it.

No functional change intended.

Link: https://patch.msgid.link/20260310234829.2608037-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-04-08 16:04:25 -07:00
Sean Christopherson
138e5f6a3e KVM: SEV: Add quad-underscore version of VM-scoped APIs to detect SEV+ guests
Add "unsafe" quad-underscore versions of the SEV+ guest detectors in
anticipation of hardening the APIs via lockdep assertions.  This will allow
adding exceptions for usage that is known to be safe in advance of the
lockdep assertions.

Use a pile of underscores to try and communicate that use of the "unsafe"
shouldn't be done lightly.

No functional change intended.

Link: https://patch.msgid.link/20260310234829.2608037-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-04-08 16:04:25 -07:00
Sean Christopherson
5bf92e4753 KVM: SEV: Provide vCPU-scoped accessors for detecting SEV+ guests
Provide vCPU-scoped accessors for detecting if the vCPU belongs to an SEV,
SEV-ES, or SEV-SNP VM, partly to dedup a small amount of code, but mostly
to better document which usages are "safe".  Generally speaking, using the
VM-scoped sev_guest() and friends outside of kvm->lock is unsafe, as they
can get both false positives and false negatives.

But for vCPUs, the accessors are guaranteed to provide a stable result as
KVM disallows initialization SEV+ state after vCPUs are created.  I.e.
operating on a vCPU guarantees the VM can't "become" an SEV+ VM, and that
it can't revert back to a "normal" VM.

This will also allow dropping the stubs for the VM-scoped accessors, as
it's relatively easy to eliminate usage of the accessors from common SVM
once the vCPU-scoped checks are out of the way.

No functional change intended.

Link: https://patch.msgid.link/20260310234829.2608037-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-04-08 16:04:24 -07:00
Sean Christopherson
8075360f3b KVM: SEV: Lock all vCPUs for the duration of SEV-ES VMSA synchronization
Lock and unlock all vCPUs in a single batch when synchronizing SEV-ES VMSAs
during launch finish, partly to dedup the code by a tiny amount, but mostly
so that sev_launch_update_vmsa() uses the same logic/flow as all other SEV
ioctls that lock all vCPUs.

Link: https://patch.msgid.link/20260310234829.2608037-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-04-08 16:04:23 -07:00
Sean Christopherson
cb923ee6a8 KVM: SEV: Lock all vCPUs when synchronzing VMSAs for SNP launch finish
Lock all vCPUs when synchronizing and encrypting VMSAs for SNP guests, as
allowing userspace to manipulate and/or run a vCPU while its state is being
synchronized would at best corrupt vCPU state, and at worst crash the host
kernel.

Opportunistically assert that vcpu->mutex is held when synchronizing its
VMSA (the SEV-ES path already locks vCPUs).

Fixes: ad27ce1555 ("KVM: SEV: Add KVM_SEV_SNP_LAUNCH_FINISH command")
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260310234829.2608037-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-04-08 16:04:19 -07:00
Yosry Ahmed
2daf71bfd7 KVM: nSVM: Fail emulation of VMRUN/VMLOAD/VMSAVE if mapping vmcb12 fails
KVM currently injects a #GP if mapping vmcb12 fails when emulating
VMRUN/VMLOAD/VMSAVE. This is not architectural behavior, as #GP should
only be injected if the physical address is not supported or not
aligned. Instead, handle it as an emulation failure, similar to how nVMX
handles failures to read/write guest memory in several emulation paths.

When virtual VMLOAD/VMSAVE is enabled, if vmcb12's GPA is not mapped in
the NPTs a VMEXIT(#NPF) will be generated, and KVM will install an MMIO
SPTE and emulate the instruction if there is no corresponding memslot.
x86_emulate_insn() will return EMULATION_FAILED as VMLOAD/VMSAVE are not
handled as part of the twobyte_insn cases.

Even though this will also result in an emulation failure, it will only
result in a straight return to userspace if
KVM_CAP_EXIT_ON_EMULATION_FAILURE is set. Otherwise, it would inject #UD
and only exit to userspace if not in guest mode. So the behavior is
slightly different if virtual VMLOAD/VMSAVE is enabled.

Fixes: 3d6368ef58 ("KVM: SVM: Add VMRUN handler")
Reported-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260316202732.3164936-8-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-04-03 16:08:04 -07:00
Yosry Ahmed
878b8efa2a KVM: SVM: Treat mapping failures equally in VMLOAD/VMSAVE emulation
Currently, a #GP is only injected if kvm_vcpu_map() fails with -EINVAL.
But it could also fail with -EFAULT if creating a host mapping failed.
Inject a #GP in all cases, no reason to treat failure modes differently.

Similar to commit 01ddcdc55e ("KVM: nSVM: Always inject a #GP if
mapping VMCB12 fails on nested VMRUN"), treat all failures equally.

Fixes: 8c5fbf1a72 ("KVM/nSVM: Use the new mapping API for mapping guest memory")
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260316202732.3164936-7-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-04-03 16:08:03 -07:00
Yosry Ahmed
783cf7d01f KVM: SVM: Check EFER.SVME and CPL on #GP intercept of SVM instructions
When KVM intercepts #GP on an SVM instruction from L2, it checks the
legality of RAX, and injects a #GP if RAX is illegal, or otherwise
synthesizes a #VMEXIT to L1. However, checking EFER.SVME and CPL takes
precedence over both the RAX check and the intercept. Call
nested_svm_check_permissions() first to cover both.

Note that if #GP is intercepted on SVM instruction in L1, the intercept
handlers of VMRUN/VMLOAD/VMSAVE already perform these checks.

Note #2, if KVM does not intercept #GP, the check for EFER.SVME is not
done in the correct order, because KVM handles it by intercepting the
instructions when EFER.SVME=0 and injecting #UD.  However, a #GP
injected by hardware would happen before the instruction intercept,
leading to #GP taking precedence over #UD from the guest's perspective.
Opportunistically add a FIXME for this.

Fixes: 82a11e9c6f ("KVM: SVM: Add emulation support for #GP triggered by SVM instructions")
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260316202732.3164936-6-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-04-03 16:08:03 -07:00
Yosry Ahmed
d2fbeb61e1 KVM: SVM: Move RAX legality check to SVM insn interception handlers
When #GP is intercepted by KVM, the #GP interception handler checks
whether the GPA in RAX is legal and reinjects the #GP accordingly.
Otherwise, it calls into the appropriate interception handler for
VMRUN/VMLOAD/VMSAVE. The intercept handlers do not check RAX.

However, the intercept handlers need to do the RAX check, because if the
guest has a smaller MAXPHYADDR, RAX could be legal from the hardware
perspective (i.e. CPU does not inject #GP), but not from the vCPU's
perspective. Note that with allow_smaller_maxphyaddr, both NPT and VLS
cannot be used, so VMLOAD/VMSAVE have to be intercepted, and RAX can
always be checked against the vCPU's MAXPHYADDR.

Move the check into the interception handlers for VMRUN/VMLOAD/VMSAVE as
the CPU does not check RAX before the interception. Read RAX using
kvm_register_read() to avoid a false negative on page_address_valid() on
32-bit due to garbage in the higher bits.

Keep the check in the #GP intercept handler in the nested case where
a #VMEXIT is synthesized into L1, as the RAX check is still needed there
and takes precedence over the intercept.

Opportunistically add a FIXME about the #VMEXIT being synthesized into
L1, as it needs to be conditional.

Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260316202732.3164936-5-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-04-03 16:08:02 -07:00
Yosry Ahmed
435741a4e7 KVM: SVM: Properly check RAX on #GP intercept of SVM instructions
When KVM intercepts #GP on an SVM instruction, it re-injects the #GP if
the instruction was executed with a mis-algined RAX. However, a #GP
should also be reinjected if RAX contains an illegal GPA, according to
the APM, one of #GP conditions is:

  rAX referenced a physical address above the maximum
  supported physical address.

Replace the PAGE_MASK check with page_address_valid(), which checks both
page-alignment as well as the legality of the GPA based on the vCPU's
MAXPHYADDR. Use kvm_register_read() to read RAX to so that bits 63:32 are
dropped when the vCPU is in 32-bit mode, i.e. to avoid a false positive
when checking the validity of the address.

Note that this is currently only a problem if KVM is running an L2 guest
and ends up synthesizing a #VMEXIT to L1, as the RAX check takes
precedence over the intercept. Otherwise, if KVM emulates the
instruction, kvm_vcpu_map() should fail on illegal GPAs and inject a #GP
anyway. However, following patches will change the failure behavior of
kvm_vcpu_map(), so make sure the #GP interception handler does this
appropriately.

Opportunistically drop a teaser FIXME about the SVM instructions
handling on #GP belonging in the emulator.

Fixes: 82a11e9c6f ("KVM: SVM: Add emulation support for #GP triggered by SVM instructions")
Fixes: d1cba6c922 ("KVM: x86: nSVM: test eax for 4K alignment for GP errata workaround")
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260316202732.3164936-4-yosry@kernel.org
[sean: massage wording with respect to kvm_register_read()]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-04-03 16:08:01 -07:00
Yosry Ahmed
27f70eaa86 KVM: SVM: Refactor SVM instruction handling on #GP intercept
Instead of returning an opcode from svm_instr_opcode() and then passing
it to emulate_svm_instr(), which uses it to find the corresponding exit
code and intercept handler, return the exit code directly from
svm_instr_opcode(), and rename it to svm_get_decoded_instr_exit_code().

emulate_svm_instr() boils down to synthesizing a #VMEXIT or calling the
intercept handler, so open-code it in gp_interception(), and use
svm_invoke_exit_handler() to call the intercept handler based on
the exit code. This allows for dropping the SVM_INSTR_* enum, and the
const array mapping its values to exit codes and intercept handlers.

In gp_intercept(), handle SVM instructions and first with an early return,
and invert is_guest_mode() checks, un-indenting the rest of the code.

No functional change intended.

Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260316202732.3164936-3-yosry@kernel.org
[sean: add BUILD_BUG_ON(), tweak formatting/naming]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-04-03 16:08:01 -07:00
Sean Christopherson
624bf3440d KVM: SEV: Disallow LAUNCH_FINISH if vCPUs are actively being created
Reject LAUNCH_FINISH for SEV-ES and SNP VMs if KVM is actively creating
one or more vCPUs, as KVM needs to process and encrypt each vCPU's VMSA.
Letting userspace create vCPUs while LAUNCH_FINISH is in-progress is
"fine", at least in the current code base, as kvm_for_each_vcpu() operates
on online_vcpus, LAUNCH_FINISH (all SEV+ sub-ioctls) holds kvm->mutex, and
fully onlining a vCPU in kvm_vm_ioctl_create_vcpu() is done under
kvm->mutex.  I.e. there's no difference between an in-progress vCPU and a
vCPU that is created entirely after LAUNCH_FINISH.

However, given that concurrent LAUNCH_FINISH and vCPU creation can't
possibly work (for any reasonable definition of "work"), since userspace
can't guarantee whether a particular vCPU will be encrypted or not,
disallow the combination as a hardening measure, to reduce the probability
of introducing bugs in the future, and to avoid having to reason about the
safety of future changes related to LAUNCH_FINISH.

Cc: Jethro Beekman <jethro@fortanix.com>
Closes: https://lore.kernel.org/all/b31f7c6e-2807-4662-bcdd-eea2c1e132fa@fortanix.com
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260310234829.2608037-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-04-03 09:37:36 -07:00
Sean Christopherson
b6408b6cec KVM: SEV: Protect *all* of sev_mem_enc_register_region() with kvm->lock
Take and hold kvm->lock for before checking sev_guest() in
sev_mem_enc_register_region(), as sev_guest() isn't stable unless kvm->lock
is held (or KVM can guarantee KVM_SEV_INIT{2} has completed and can't
rollack state).  If KVM_SEV_INIT{2} fails, KVM can end up trying to add to
a not-yet-initialized sev->regions_list, e.g. triggering a #GP

  Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN NOPTI
  KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
  CPU: 110 UID: 0 PID: 72717 Comm: syz.15.11462 Tainted: G     U  W  O        6.16.0-smp-DEV #1 NONE
  Tainted: [U]=USER, [W]=WARN, [O]=OOT_MODULE
  Hardware name: Google, Inc. Arcadia_IT_80/Arcadia_IT_80, BIOS 12.52.0-0 10/28/2024
  RIP: 0010:sev_mem_enc_register_region+0x3f0/0x4f0 ../include/linux/list.h:83
  Code: <41> 80 3c 04 00 74 08 4c 89 ff e8 f1 c7 a2 00 49 39 ed 0f 84 c6 00
  RSP: 0018:ffff88838647fbb8 EFLAGS: 00010256
  RAX: dffffc0000000000 RBX: 1ffff92015cf1e0b RCX: dffffc0000000000
  RDX: 0000000000000000 RSI: 0000000000001000 RDI: ffff888367870000
  RBP: ffffc900ae78f050 R08: ffffea000d9e0007 R09: 1ffffd4001b3c000
  R10: dffffc0000000000 R11: fffff94001b3c001 R12: 0000000000000000
  R13: ffff8982ab0bde00 R14: ffffc900ae78f058 R15: 0000000000000000
  FS:  00007f34e9dc66c0(0000) GS:ffff89ee64d33000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007fe180adef98 CR3: 000000047210e000 CR4: 0000000000350ef0
  Call Trace:
   <TASK>
   kvm_arch_vm_ioctl+0xa72/0x1240 ../arch/x86/kvm/x86.c:7371
   kvm_vm_ioctl+0x649/0x990 ../virt/kvm/kvm_main.c:5363
   __se_sys_ioctl+0x101/0x170 ../fs/ioctl.c:51
   do_syscall_x64 ../arch/x86/entry/syscall_64.c:63 [inline]
   do_syscall_64+0x6f/0x1f0 ../arch/x86/entry/syscall_64.c:94
   entry_SYSCALL_64_after_hwframe+0x76/0x7e
  RIP: 0033:0x7f34e9f7e9a9
  Code: <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
  RSP: 002b:00007f34e9dc6038 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  RAX: ffffffffffffffda RBX: 00007f34ea1a6080 RCX: 00007f34e9f7e9a9
  RDX: 0000200000000280 RSI: 000000008010aebb RDI: 0000000000000007
  RBP: 00007f34ea000d69 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
  R13: 0000000000000000 R14: 00007f34ea1a6080 R15: 00007ffce77197a8
   </TASK>

with a syzlang reproducer that looks like:

  syz_kvm_add_vcpu$x86(0x0, &(0x7f0000000040)={0x0, &(0x7f0000000180)=ANY=[], 0x70}) (async)
  syz_kvm_add_vcpu$x86(0x0, &(0x7f0000000080)={0x0, &(0x7f0000000180)=ANY=[@ANYBLOB="..."], 0x4f}) (async)
  r0 = openat$kvm(0xffffffffffffff9c, &(0x7f0000000200), 0x0, 0x0)
  r1 = ioctl$KVM_CREATE_VM(r0, 0xae01, 0x0)
  r2 = openat$kvm(0xffffffffffffff9c, &(0x7f0000000240), 0x0, 0x0)
  r3 = ioctl$KVM_CREATE_VM(r2, 0xae01, 0x0)
  ioctl$KVM_SET_CLOCK(r3, 0xc008aeba, &(0x7f0000000040)={0x1, 0x8, 0x0, 0x5625e9b0}) (async)
  ioctl$KVM_SET_PIT2(r3, 0x8010aebb, &(0x7f0000000280)={[...], 0x5}) (async)
  ioctl$KVM_SET_PIT2(r1, 0x4070aea0, 0x0) (async)
  r4 = ioctl$KVM_CREATE_VM(0xffffffffffffffff, 0xae01, 0x0)
  openat$kvm(0xffffffffffffff9c, 0x0, 0x0, 0x0) (async)
  ioctl$KVM_SET_USER_MEMORY_REGION(r4, 0x4020ae46, &(0x7f0000000400)={0x0, 0x0, 0x0, 0x2000, &(0x7f0000001000/0x2000)=nil}) (async)
  r5 = ioctl$KVM_CREATE_VCPU(r4, 0xae41, 0x2)
  close(r0) (async)
  openat$kvm(0xffffffffffffff9c, &(0x7f0000000000), 0x8000, 0x0) (async)
  ioctl$KVM_SET_GUEST_DEBUG(r5, 0x4048ae9b, &(0x7f0000000300)={0x4376ea830d46549b, 0x0, [0x46, 0x0, 0x0, 0x0, 0x0, 0x1000]}) (async)
  ioctl$KVM_RUN(r5, 0xae80, 0x0)

Opportunistically use guard() to avoid having to define a new error label
and goto usage.

Fixes: 1e80fdc09d ("KVM: SVM: Pin guest memory when SEV is active")
Cc: stable@vger.kernel.org
Reported-by: Alexander Potapenko <glider@google.com>
Tested-by: Alexander Potapenko <glider@google.com>
Link: https://patch.msgid.link/20260310234829.2608037-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-04-03 09:37:35 -07:00
Sean Christopherson
9b9f7962e3 KVM: SEV: Reject attempts to sync VMSA of an already-launched/encrypted vCPU
Reject synchronizing vCPU state to its associated VMSA if the vCPU has
already been launched, i.e. if the VMSA has already been encrypted.  On a
host with SNP enabled, accessing guest-private memory generates an RMP #PF
and panics the host.

  BUG: unable to handle page fault for address: ff1276cbfdf36000
  #PF: supervisor write access in kernel mode
  #PF: error_code(0x80000003) - RMP violation
  PGD 5a31801067 P4D 5a31802067 PUD 40ccfb5063 PMD 40e5954063 PTE 80000040fdf36163
  SEV-SNP: PFN 0x40fdf36, RMP entry: [0x6010fffffffff001 - 0x000000000000001f]
  Oops: Oops: 0003 [#1] SMP NOPTI
  CPU: 33 UID: 0 PID: 996180 Comm: qemu-system-x86 Tainted: G           OE
  Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
  Hardware name: Dell Inc. PowerEdge R7625/0H1TJT, BIOS 1.5.8 07/21/2023
  RIP: 0010:sev_es_sync_vmsa+0x54/0x4c0 [kvm_amd]
  Call Trace:
   <TASK>
   snp_launch_update_vmsa+0x19d/0x290 [kvm_amd]
   snp_launch_finish+0xb6/0x380 [kvm_amd]
   sev_mem_enc_ioctl+0x14e/0x720 [kvm_amd]
   kvm_arch_vm_ioctl+0x837/0xcf0 [kvm]
   kvm_vm_ioctl+0x3fd/0xcc0 [kvm]
   __x64_sys_ioctl+0xa3/0x100
   x64_sys_call+0xfe0/0x2350
   do_syscall_64+0x81/0x10f0
   entry_SYSCALL_64_after_hwframe+0x76/0x7e
  RIP: 0033:0x7ffff673287d
   </TASK>

Note, the KVM flaw has been present since commit ad73109ae7 ("KVM: SVM:
Provide support to launch and run an SEV-ES guest"), but has only been
actively dangerous for the host since SNP support was added.  With SEV-ES,
KVM would "just" clobber guest state, which is totally fine from a host
kernel perspective since userspace can clobber guest state any time before
sev_launch_update_vmsa().

Fixes: ad27ce1555 ("KVM: SEV: Add KVM_SEV_SNP_LAUNCH_FINISH command")
Reported-by: Jethro Beekman <jethro@fortanix.com>
Closes: https://lore.kernel.org/all/d98692e2-d96b-4c36-8089-4bc1e5cc3d57@fortanix.com
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260310234829.2608037-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-04-03 09:37:35 -07:00
Sean Christopherson
a7f53694d5 KVM: SEV: Use kvzalloc_objs() when pinning userpages
Use kvzalloc_objs() instead of sev_pin_memory()'s open coded (rough)
equivalent to harden the code and

Note!  This sanity check in __kvmalloc_node_noprof()

  /* Don't even allow crazy sizes */
  if (unlikely(size > INT_MAX)) {
          WARN_ON_ONCE(!(flags & __GFP_NOWARN));
          return NULL;
  }

will artificially limit the maximum size of any single pinned region to
just under 1TiB.  While there do appear to be providers that support SEV
VMs with more than 1TiB of _total_ memory, it's unlikely any KVM-based
providers pin 1TiB in a single request.

Allocate with NOWARN so that fuzzers can't trip the WARN_ON_ONCE() when
they inevitably run on systems with copious amounts of RAM, i.e. when they
can get by KVM's "total_npages > totalram_pages()" restriction.

Note #2, KVM's usage of vmalloc()+kmalloc() instead of kvmalloc() predates
commit 7661809d49 ("mm: don't allow oversized kvmalloc() calls") by 4+
years (see commit 89c5058090 ("KVM: SVM: Add support for
KVM_SEV_LAUNCH_UPDATE_DATA command").  I.e. the open coded behavior wasn't
intended to avoid the aforementioned sanity check.  The implementation
appears to be pure oversight at the time the code was written, as it showed
up in v3[1] of the early RFCs, whereas as v2[2] simply used kmalloc().

Cc: Liam Merwick <liam.merwick@oracle.com>
Link: https://lore.kernel.org/all/20170724200303.12197-17-brijesh.singh@amd.com [1]
Link: https://lore.kernel.org/all/148846786714.2349.17724971671841396908.stgit__25299.4950431914$1488470940$gmane$org@brijesh-build-machine [2]
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Tested-by: Liam Merwick <liam.merwick@oracle.com>
Link: https://patch.msgid.link/20260313003302.3136111-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-04-03 09:37:25 -07:00
Sean Christopherson
7ad02ff1e4 KVM: SEV: Use PFN_DOWN() to simplify "number of pages" math when pinning memory
Use PFN_DOWN() instead of open coded equivalents in sev_pin_memory() to
simplify the code and make it easier to read.

No functional change intended (verified before and after versions of the
generated code are identical).

Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Tested-by: Liam Merwick <liam.merwick@oracle.com>
Link: https://patch.msgid.link/20260313003302.3136111-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-04-03 09:37:25 -07:00
Sean Christopherson
6d71f9349d KVM: SEV: Disallow pinning more pages than exist in the system
Explicitly disallow pinning more pages for an SEV VM than exist in the
system to defend against absurd userspace requests without relying on
somewhat arbitrary kernel functionality to prevent truly stupid KVM
behavior.  E.g. even with the INT_MAX check, userspace can request that
KVM pin nearly 8TiB of memory, regardless of how much RAM exists in the
system.

Opportunistically rename "locked" to a more descriptive "total_npages".

Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Tested-by: Liam Merwick <liam.merwick@oracle.com>
Link: https://patch.msgid.link/20260313003302.3136111-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-04-03 09:37:25 -07:00
Sean Christopherson
12a8ff869d KVM: SEV: Drop useless sanity checks in sev_mem_enc_register_region()
Drop sev_mem_enc_register_region()'s sanity checks on the incoming address
and size, as SEV is 64-bit only, making ULONG_MAX a 64-bit, all-ones value,
and thus making it impossible for kvm_enc_region.{addr,size} to be greater
than ULONG_MAX.

Note, sev_pin_memory() verifies the incoming address is non-NULL (which
isn't strictly required, but whatever), and that addr+size don't wrap to
zero (which _is_ needed and what really needs to be guarded against).

Note #2, pin_user_pages_fast() guards against the end address walking into
kernel address space, so lack of an access_ok() check is also safe (maybe
not ideal, but safe).

No functional change intended (the generated code is literally the same,
i.e. the compiler was smart enough to know the checks were useless).

Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Tested-by: Liam Merwick <liam.merwick@oracle.com>
Link: https://patch.msgid.link/20260313003302.3136111-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-04-03 09:37:25 -07:00
Sean Christopherson
8acffeef5e KVM: SEV: Drop WARN on large size for KVM_MEMORY_ENCRYPT_REG_REGION
Drop the WARN in sev_pin_memory() on npages overflowing an int, as the
WARN is comically trivially to trigger from userspace, e.g. by doing:

  struct kvm_enc_region range = {
          .addr = 0,
          .size = -1ul,
  };

  __vm_ioctl(vm, KVM_MEMORY_ENCRYPT_REG_REGION, &range);

Note, the checks in sev_mem_enc_register_region() that presumably exist to
verify the incoming address+size are completely worthless, as both "addr"
and "size" are u64s and SEV is 64-bit only, i.e. they _can't_ be greater
than ULONG_MAX.  That wart will be cleaned up in the near future.

	if (range->addr > ULONG_MAX || range->size > ULONG_MAX)
		return -EINVAL;

Opportunistically add a comment to explain why the code calculates the
number of pages the "hard" way, e.g. instead of just shifting @ulen.

Fixes: 78824fabc7 ("KVM: SVM: fix svn_pin_memory()'s use of get_user_pages_fast()")
Cc: stable@vger.kernel.org
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Tested-by: Liam Merwick <liam.merwick@oracle.com>
Link: https://patch.msgid.link/20260313003302.3136111-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-04-03 09:37:25 -07:00
Sean Christopherson
7212094bae KVM: x86: Suppress WARNs on nested_run_pending after userspace exit
To end an ongoing game of whack-a-mole between KVM and syzkaller, WARN on
illegally cancelling a pending nested VM-Enter if and only if userspace
has NOT gained control of the vCPU since the nested run was initiated.  As
proven time and time again by syzkaller, userspace can clobber vCPU state
so as to force a VM-Exit that violates KVM's architectural modelling of
VMRUN/VMLAUNCH/VMRESUME.

To detect that userspace has gained control, while minimizing the risk of
operating on stale data, convert nested_run_pending from a pure boolean to
a tri-state of sorts, where '0' is still "not pending", '1' is "pending",
and '2' is "pending but untrusted".  Then on KVM_RUN, if the flag is in
the "trusted pending" state, move it to "untrusted pending".

Note, moving the state to "untrusted" even if KVM_RUN is ultimately
rejected is a-ok, because for the "untrusted" state to matter, KVM must
get past kvm_x86_vcpu_pre_run() at some point for the vCPU.

Reviewed-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260312234823.3120658-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-04-03 09:34:01 -07:00
Yosry Ahmed
3d4470d71f KVM: x86: Move nested_run_pending to kvm_vcpu_arch
Move nested_run_pending field present in both svm_nested_state and
nested_vmx to the common kvm_vcpu_arch. This allows for common code to
use without plumbing it through per-vendor helpers.

nested_run_pending remains zero-initialized, as the entire kvm_vcpu
struct is, and all further accesses are done through vcpu->arch instead
of svm->nested or vmx->nested.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
[sean: expand the commend in the field declaration]
Link: https://patch.msgid.link/20260312234823.3120658-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-04-03 09:33:30 -07:00
Yosry Ahmed
520a1347fa KVM: nSVM: Simplify error handling of nested_svm_copy_vmcb12_to_cache()
nested_svm_vmrun() currently stores the return value of
nested_svm_copy_vmcb12_to_cache() in a local variable 'err', separate
from the generally used 'ret' variable. This is done to have a single
call to kvm_skip_emulated_instruction(), such that we can store the
return value of kvm_skip_emulated_instruction() in 'ret', and then
re-check the return value of nested_svm_copy_vmcb12_to_cache() in 'err'.

The code is unnecessarily confusing. Instead, call
kvm_skip_emulated_instruction() in the failure path of
nested_svm_copy_vmcb12_to_cache() if the return value is not -EFAULT,
and drop 'err'.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260306210900.1933788-3-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-13 15:13:04 -07:00
Sean Christopherson
0b4a043a54 KVM: SVM: Add a helper to get LBR field pointer to dedup MSR accesses
Add a helper to get a pointer to the corresponding VMCB field given an LBR
MSR index, and use it to dedup the handling in svm_{g,s}et_msr().

No functional change intended.

Suggested-by: Yosry Ahmed <yosry@kernel.org>
Reviewed-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260310220414.2569208-1-seanjc@google.com
[sean: use KVM_BUG_ON() instead of BUILD_BUG(), clang ain't smart enough]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-13 13:05:54 -07:00
Yosry Ahmed
3b27c82ba2 KVM: x86: Move some EFER bits enablement to common code
Move EFER bits enablement that only depend on CPU support to common
code, as there is no reason to do it in vendor code. Leave EFER.SVME and
EFER.LMSLE enablement in SVM code as they depend on vendor module
parameters.

Having the enablement in common code ensures that if a vendor starts
supporting an existing feature, KVM doesn't end up advertising to
userspace but not allowing the EFER bit to be set.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260307011619.2324234-2-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-12 09:05:41 -07:00
Paolo Bonzini
6b1ca262a9 KVM: x86: clarify leave_smm() return value
The return value of vmx_leave_smm() is unrelated from that of
nested_vmx_enter_non_root_mode().  Check explicitly for success
(which happens to be 0) and return 1 just like everywhere
else in vmx_leave_smm().

Likewise, in svm_leave_smm() return 0/1 instead of the 0/1/-errno
returned by tenter_svm_guest_mode().

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-03-11 18:41:12 +01:00
Paolo Bonzini
be5fa8737d KVM: SVM: check validity of VMCB controls when returning from SMM
The VMCB12 is stored in guest memory and can be mangled while in SMM; it
is then reloaded by svm_leave_smm(), but it is not checked again for
validity.

Move the cached vmcb12 control and save consistency checks out of
svm_set_nested_state() and into a helper, and reuse it in
svm_leave_smm().

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-03-11 18:41:11 +01:00
Sean Christopherson
87d0f901a9 KVM: SVM: Set/clear CR8 write interception when AVIC is (de)activated
Explicitly set/clear CR8 write interception when AVIC is (de)activated to
fix a bug where KVM leaves the interception enabled after AVIC is
activated.  E.g. if KVM emulates INIT=>WFS while AVIC is deactivated, CR8
will remain intercepted in perpetuity.

On its own, the dangling CR8 intercept is "just" a performance issue, but
combined with the TPR sync bug fixed by commit d02e48830e ("KVM: SVM:
Sync TPR from LAPIC into VMCB::V_TPR even if AVIC is active"), the danging
intercept is fatal to Windows guests as the TPR seen by hardware gets
wildly out of sync with reality.

Note, VMX isn't affected by the bug as TPR_THRESHOLD is explicitly ignored
when Virtual Interrupt Delivery is enabled, i.e. when APICv is active in
KVM's world.  I.e. there's no need to trigger update_cr8_intercept(), this
is firmly an SVM implementation flaw/detail.

WARN if KVM gets a CR8 write #VMEXIT while AVIC is active, as KVM should
never enter the guest with AVIC enabled and CR8 writes intercepted.

Fixes: 3bbf3565f4 ("svm: Do not intercept CR8 when enable AVIC")
Cc: stable@vger.kernel.org
Cc: Jim Mattson <jmattson@google.com>
Cc: Naveen N Rao (AMD) <naveen@kernel.org>
Cc: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://patch.msgid.link/20260203190711.458413-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
[Squash fix to avic_deactivate_vmcb. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-03-11 18:41:11 +01:00
Sean Christopherson
3989a6d036 KVM: SVM: Initialize AVIC VMCB fields if AVIC is enabled with in-kernel APIC
Initialize all per-vCPU AVIC control fields in the VMCB if AVIC is enabled
in KVM and the VM has an in-kernel local APIC, i.e. if it's _possible_ the
vCPU could activate AVIC at any point in its lifecycle.  Configuring the
VMCB if and only if AVIC is active "works" purely because of optimizations
in kvm_create_lapic() to speculatively set apicv_active if AVIC is enabled
*and* to defer updates until the first KVM_RUN.  In quotes because KVM
likely won't do the right thing if kvm_apicv_activated() is false, i.e. if
a vCPU is created while APICv is inhibited at the VM level for whatever
reason.  E.g. if the inhibit is *removed* before KVM_REQ_APICV_UPDATE is
handled in KVM_RUN, then __kvm_vcpu_update_apicv() will elide calls to
vendor code due to seeing "apicv_active == activate".

Cleaning up the initialization code will also allow fixing a bug where KVM
incorrectly leaves CR8 interception enabled when AVIC is activated without
creating a mess with respect to whether AVIC is activated or not.

Cc: stable@vger.kernel.org
Fixes: 67034bb9dd ("KVM: SVM: Add irqchip_split() checks before enabling AVIC")
Fixes: 6c3e4422dd ("svm: Add support for dynamic APICv")
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://patch.msgid.link/20260203190711.458413-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-03-11 18:41:11 +01:00
Yosry Ahmed
cdc69269b1 KVM: SVM: Triple fault L1 on unintercepted EFER.SVME clear by L2
KVM tracks when EFER.SVME is set and cleared to initialize and tear down
nested state. However, it doesn't differentiate if EFER.SVME is getting
toggled in L1 or L2+. If L2 clears EFER.SVME, and L1 does not intercept
the EFER write, KVM exits guest mode and tears down nested state while
L2 is running, executing L1 without injecting a proper #VMEXIT.

According to the APM:

    The effect of turning off EFER.SVME while a guest is running is
    undefined; therefore, the VMM should always prevent guests from
    writing EFER.

Since the behavior is architecturally undefined, KVM gets to choose what
to do. Inject a triple fault into L1 as a more graceful option that
running L1 with corrupted state.

Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
base-commit: 95deaec3557dced322e2540bfa426e60e5373d46
Link: https://patch.msgid.link/20260209195142.2554532-2-yosry.ahmed@linux.dev
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-04 16:09:10 -08:00
Jim Mattson
66b207f175 KVM: x86: SVM: Remove vmcb_is_dirty()
After commit dd26d1b5d6ed ("KVM: nSVM: Cache all used fields from VMCB12"),
vmcb_is_dirty() has no callers. Remove the function.

Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://patch.msgid.link/20260224005500.1471972-2-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-04 16:09:09 -08:00