Swap the MOVNTDQA operands, as MOVNTDQA does NOT in fact have "the same
characteristics as 0F E7 (MOVNTDQ)"; MOVNTDQA loads from memory and stores
to registers, while MOVNTDQ loads from registers and stores to memory.
Per the SDM:
MOVNTDQ - Move packed integer values in xmm1 to m128 using non-temporal
hint.
MOVNTDQA - Move double quadword from m128 to xmm1 using non-temporal hint
if WC memory type.
Reported-by: Josh Eads <josheads@google.com>
Fixes: c57d9bafbd ("KVM: x86: Add support for emulating MOVNTDQA")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260506213514.2781948-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- To minimize the probability of corrupting guest state, defer KVM's
non-architectural delivery of exception payloads (e.g. CR2 and DR6) until
consumption of the payload is imminent, and force delivery of the payload
in all paths where userspace saves relevant state.
- Use vcpu->arch.cr2 when updating vmcb12's CR2 on nested #VMEXIT to fix a
bug where L2's CR2 can get corrupted after a save/restore, e.g. if the VM
is migrated while L2 is faulting in memory.
- Fix a class of nSVM bugs where some fields written by the CPU are not
synchronized from vmcb02 to cached vmcb12 after VMRUN, and so are not
up-to-date when saved by KVM_GET_NESTED_STATE.
- Fix a class of bugs where the ordering between KVM_SET_NESTED_STATE and
KVM_SET_{S}REGS could cause vmcb02 to be incorrectly initialized after
save+restore.
- Add a variety of missing nSVM consistency checks.
- Fix several bugs where KVM failed to correctly update VMCB fields on nested
#VMEXIT.
- Fix several bugs where KVM failed to correctly synthesize #UD or #GP for
SVM-related instructions.
- Add support for save+restore of virtualized LBRs (on SVM).
- Refactor various helpers and macros to improve clarity and (hopefully) make
the code easier to maintain.
- Aggressively sanitize fields when copying from vmcb12 to guard against
unintentionally allowing L1 to utilize yet-to-be-defined features.
- Fix several bugs where KVM botched rAX legality checks when emulating SVM
instructions. Note, KVM is still flawed in that KVM doesn't address size
prefix overrides for 64-bit guests; this should probably be documented as a
KVM erratum.
- Fail emulation of VMRUN/VMLOAD/VMSAVE if mapping vmcb12 fails instead of
somewhat arbitrarily synthesizing #GP (i.e. don't bastardize AMD's already-
sketchy behavior of generating #GP if for "unsupported" addresses).
- Cache all used vmcb12 fields to further harden against TOCTOU bugs.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmnZfbwACgkQOlYIJqCj
N/0pVRAAkys8LLtIekQtEVkaX3EPaXk0lGGmnzXbihgHFsS5lMAS4tcsr7oyk4TI
rvJUGmkaTKTboQdTaCq0G7lwCu5hMuXsZ10WvmKfivMFxy3kSppqfffux5zVXng2
U/8oyJSorkX1WPC7d5QAZYMqqcSwQaR+a0FxowghGWBXMRHylerSuH00CiGr6Ron
QQbZaKBNtkYwYFNos2tLuT4tueyFogk8FPAmdejEQ9CMxUjeAivlKm8JVXaDvGik
lyPYbJJLukjuxSYGYmeRyGLLwK7VBGkFHQp/KBYSBgzGdweabhsQa1Z0CGm24+w1
q626W0sxsq97dZ0cd7oE6Cw+AdlMBK+mjpxB9gX4uLGyYlnFkdJV7OSlHVTR9d96
cqKduT0JvlBnVb7Yd5jyaGVl1YD62p0nwcrTuWidR5IJ16b4mYwwPzvkkQKHLt64
VAhH8lBVtATtblI9gfsbwGezV74xXnuLb0L1G7xeh1VIWu7pubFdqyRwIA+qiXQa
OkyxzoDlFl+QF2Uf3cBCFMojBOrSZRiGiLzIkUnjBsN4N2uOPYTsQEfr9BXVVcv7
obT9xl/wUwry2fAJhUL+IBCDE42+8C62UaWT5KJHQLttBL7Mm06e75hFN5ObbE/x
nExL+NmAcsSUUbbdojjnD0KWxYKkosNiONBVrjqqXdmBjmzzOvI=
=ys7N
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-nested-7.1' of https://github.com/kvm-x86/linux into HEAD
KVM nested SVM changes for 7.1 (with one common x86 fix)
- To minimize the probability of corrupting guest state, defer KVM's
non-architectural delivery of exception payloads (e.g. CR2 and DR6) until
consumption of the payload is imminent, and force delivery of the payload
in all paths where userspace saves relevant state.
- Use vcpu->arch.cr2 when updating vmcb12's CR2 on nested #VMEXIT to fix a
bug where L2's CR2 can get corrupted after a save/restore, e.g. if the VM
is migrated while L2 is faulting in memory.
- Fix a class of nSVM bugs where some fields written by the CPU are not
synchronized from vmcb02 to cached vmcb12 after VMRUN, and so are not
up-to-date when saved by KVM_GET_NESTED_STATE.
- Fix a class of bugs where the ordering between KVM_SET_NESTED_STATE and
KVM_SET_{S}REGS could cause vmcb02 to be incorrectly initialized after
save+restore.
- Add a variety of missing nSVM consistency checks.
- Fix several bugs where KVM failed to correctly update VMCB fields on nested
#VMEXIT.
- Fix several bugs where KVM failed to correctly synthesize #UD or #GP for
SVM-related instructions.
- Add support for save+restore of virtualized LBRs (on SVM).
- Refactor various helpers and macros to improve clarity and (hopefully) make
the code easier to maintain.
- Aggressively sanitize fields when copying from vmcb12 to guard against
unintentionally allowing L1 to utilize yet-to-be-defined features.
- Fix several bugs where KVM botched rAX legality checks when emulating SVM
instructions. Note, KVM is still flawed in that KVM doesn't address size
prefix overrides for 64-bit guests; this should probably be documented as a
KVM erratum.
- Fail emulation of VMRUN/VMLOAD/VMSAVE if mapping vmcb12 fails instead of
somewhat arbitrarily synthesizing #GP (i.e. don't bastardize AMD's already-
sketchy behavior of generating #GP if for "unsupported" addresses).
- Cache all used vmcb12 fields to further harden against TOCTOU bugs.
Copy single-chunk MMIO write values into a persistent (per-fragment) field to
fix use-after-free stack bugs due to KVM dereferencing a stack pointer after an
exit to userspace.
Clean up and comment the emulated MMIO code to try to make it easier to
maintain (not necessarily "easy", but "easier").
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmnZJAkACgkQOlYIJqCj
N/2TUxAAnmOHpQ0hKjDvkMXTLOUJEitegRn8CDd2NWa6TEnoCFmHu5sXAMaO0V7v
EK2NH0uwT43Zr55BlXTxjEaQOby16tKjqK4FlG1Zb8UI6e9Lzhk6zIXMI7NTpV41
HMF73cQyHIzeS9ymf0fdVo3nlnvTXBHVCifyJwmb2RARl+LqmTimHb9+9piuIxJB
/v473RoFCNxI7Rwl6Pp5sjl7lWTIDUQJSi3+1gMaowTtnsUyCPTLejwj7/b9NWF3
i+nTNwwpHFvgfTE3decMyKupY9aXUM9FU/AFf+eUbvzjR/Dx/7o31Cpz4NCHQ38c
4TCFKkOQ2r+e8s7ATBeRKdOXdP7d7DW3qasLfPVjzEDxuifmW+awDRYBZwNM/Ybv
jDCBO6gbtw/f+oJPKq9oivqBSu+Z6vR7NnPmk1vh7VocsZdAlbCwdBU2+N5DBkfh
LJ+nOxzNL1q1A8X61CZFffEooh971Mg7ztHV5IDviK5/Fop0NfQBjxdxoz+wttzp
ufwY1WUMHjImRfZTk3e9icJwarqEI39QRmjKuaUJxEXjbJCbtvfKJ0lrNn7RRNPf
aqi9M2z6UvVrQi/Vw5rRTx7fYr091QIOHDdGBVl7atcGyU4gvdpXdkBS6xqafgD6
S/QzWypU868iMgLWqNpNeKtRLPQsuwTywC5vx57Sm0yPlPbM8H0=
=Qyhh
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-mmio-7.1' of https://github.com/kvm-x86/linux into HEAD
KVM x86 emulated MMIO changes for 7.1
Copy single-chunk MMIO write values into a persistent (per-fragment) field to
fix use-after-free stack bugs due to KVM dereferencing a stack pointer after an
exit to userspace.
Clean up and comment the emulated MMIO code to try to make it easier to
maintain (not necessarily "easy", but "easier").
Architecturally, VMRUN/VMLOAD/VMSAVE should generate a #GP if the
physical address in RAX is not supported. check_svme_pa() hardcodes this
to checking that bits 63-48 are not set. This is incorrect on HW
supporting 52 bits of physical address space. Additionally, the emulator
does not check if the address is not aligned, which should also result
in #GP.
Use page_address_valid() which properly checks alignment and the address
legality based on the guest's MAXPHYADDR. Plumb it through
x86_emulate_ops, similar to is_canonical_addr(), to avoid directly
accessing the vCPU object in emulator code.
Fixes: 01de8b09e6 ("KVM: SVM: Add intercept checks for SVM instructions")
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260316202732.3164936-2-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Open code the handling of completed MMIO reads instead of using an ops
hook, as burying the logic behind a (likely RETPOLINE'd) indirect call,
and with an unintuitive name, makes relatively straightforward code hard
to comprehend.
Opportunistically add comments to explain the dependencies between the
emulator's mem_read cache and the MMIO read completion logic, as it's very
easy to overlook the cache's role in getting the read data into the
correct destination.
No functional change intended.
Tested-by: Tom Lendacky <thomas.lendacky@gmail.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://patch.msgid.link/20260225012049.920665-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
The Intel Virtualization Technology FlexMigration Application Note says,
"When CPUID faulting is enabled, all executions of the CPUID instruction
outside system-management mode (SMM) cause a general-protection exception
(#GP(0)) if the current privilege level (CPL) is greater than 0."
Always allow the execution of CPUID in SMM.
Fixes: db2336a804 ("KVM: x86: virtualize cpuid faulting")
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://patch.msgid.link/20260210234613.1383279-1-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Explicitly zero-initialize stack-allocated struct fxregs_state
variables in em_fxsave() and fxregs_fixup() to ensure all padding and
unused fields are cleared before use.
Both functions declare temporary fxregs_state buffers that may be
partially written by fxsave. Although the emulator copies only the
architecturally defined portion of the state to userspace, any padding
or otherwise untouched bytes in the structure can remain uninitialized.
This can lead to the use of uninitialized stack data and may trigger
KMSAN reports. In the worst case, it could result in leaking stack
contents if such bytes are ever exposed.
No functional change intended.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://patch.msgid.link/20260212212457.24483-1-ubizjak@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Some users of KVM have emulated devices (typically added to private
forks of QEMU) that execute AVX instructions on PCI BARs. Whenever
the guest OS tries to do that, an illegal instruction exception or
emulation failure is triggered.
Add the Avx flag to move instructions:
- (66) 0f 10 - MOVUPS/MOVUPD from memory
- (66) 0f 11 - MOVUPS/MOVUPD to memory
- 66 0f 6f - MOVDQA from memory
- 66 0f 7f - MOVDQA to memory
- f3 0f 6f - MOVDQU from memory
- f3 0f 7f - MOVDQU to memory
- (66) 0f 28 - MOVAPS/MOVAPD from memory
- (66) 0f 29 - MOVAPS/MOVAPD to memory
- (66) 0f 2b - MOVNTPS/MOVNTPD to memory
- 66 0f e7 - MOVNTDQ to memory
- 66 0f 38 2a - MOVNTDQA to memory
Co-developed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/kvm/BD108C42-0382-4B17-B601-434A4BD038E7@fb.com/T/
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://patch.msgid.link/20251114003633.60689-11-pbonzini@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
After all the changes done in the previous patches, the only thing
left to support AVX MOV instructions is to expand the VEX prefix into
the appropriate REX, 66/F3/F2 and map prefixes. Three-operand
instructions are not supported.
The Avx bit in this case is not cleared, in fact it is used as the
sign that the instruction does support VEX encoding. Until it is
added to any instruction, however, the only functional change is
to change some not-implemented instructions to #UD if they correspond
to a VEX prefix with an invalid map.
Co-developed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://patch.msgid.link/20251114003633.60689-10-pbonzini@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Restructure how to represent and interpret REX fields, preparing
for handling of both REX2 and VEX.
REX uses the upper four bits of a single byte as a fixed identifier,
and the lower four bits containing the data. VEX and REX2 extends this so
that the first byte identifies the prefix and the rest encode additional
bits; and while VEX only has the same four data bits as REX, eight zero
bits are a valid value for the data bits of REX2. So, stop storing the
REX byte as-is. Instead, store only the low bits of the REX prefix and
track separately whether a REX-like prefix was used.
No functional changes intended.
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Message-ID: <20251110180131.28264-11-chang.seok.bae@intel.com>
[Extracted from APX series; removed bitfields and REX2-specific default. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://patch.msgid.link/20251114003633.60689-9-pbonzini@redhat.com
[sean: name REX_{BXRW} enum "rex_bits"]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Prepare struct operand for hosting AVX registers. Remove the
existing, incomplete code that placed the Avx flag in the operand
alignment field, and repurpose the name for a separate bit that
indicates:
- after decode, whether an instruction supports the VEX prefix;
- before writeback, that the instruction did have the VEX prefix and
therefore 1) it can have op_bytes == 32; 2) t should clear high
bytes of XMM registers.
Right now the bit will never be set and the patch has no intended
functional change. However, this is actually more vexing than the
decoder changes itself, and therefore worth separating.
Co-developed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://patch.msgid.link/20251114003633.60689-8-pbonzini@redhat.com
[sean: guard ymm[8-15] accesses with #ifdef CONFIG_X86_64]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Remove all duplicate handling of register operands, including picking
the right register class and fetching it, by extracting a new function
that can be used for both REG and MODRM operands.
Centralize setting op->orig_val = op->val in fetch_register_operand()
as well.
No functional change intended.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Chang S. Bae <chang.seok.bae@intel.com>
Link: https://patch.msgid.link/20251114003633.60689-6-pbonzini@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
VEX decode will need to set it based on the "pp" bits, so make it
a field in the struct rather than a local variable.
No functional change intended.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Chang S. Bae <chang.seok.bae@intel.com>
Link: https://patch.msgid.link/20251114003633.60689-5-pbonzini@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Align a little better the comments on the right side and list
explicitly the bits used by multi-bit fields.
No functional change intended.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Chang S. Bae <chang.seok.bae@intel.com>
Link: https://patch.msgid.link/20251114003633.60689-4-pbonzini@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
An irresistible microoptimization (changing accesses to Src2 to just an
AND :)) that also frees a bit for AVX in the low flags word. This makes
it closer to SSE since both of them can access XMM registers, pointlessly
shaving another clock cycle or two (maybe).
No functional change intended.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Chang S. Bae <chang.seok.bae@intel.com
Link: https://patch.msgid.link/20251114003633.60689-3-pbonzini@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
MOVNTDQA is a simple MOV instruction, in fact it has the same
characteristics as 0F E7 (MOVNTDQ) other than the aligned-address
requirement.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://patch.msgid.link/20251114003633.60689-2-pbonzini@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
emulator in favor of int3_emulate_jcc() which is written in C
- Replace KVM fastops with C-based stubs which avoids problems with the
fastop infra related to latter not adhering to the C ABI due to their
special calling convention and, more importantly, bypassing compiler
control-flow integrity checking because they're written in asm
- Remove wrongly used static branches and other ugliness accumulated
over time in hyperv's hypercall implementation with a proper static
function call to the correct hypervisor call variant
- Add some fixes and modifications to allow running FRED-enabled kernels
in KVM even on non-FRED hardware
- Add kCFI improvements like validating indirect calls and prepare for
enabling kCFI with GCC. Add cmdline params documentation and other
code cleanups
- Use the single-byte 0xd6 insn as the official #UD single-byte
undefined opcode instruction as agreed upon by both x86 vendors
- Other smaller cleanups and touchups all over the place
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmjqXxkACgkQEsHwGGHe
VUq9QBAAsjaay99a1+Dc53xyP1/HzCUFZDOzEYhj9zF85I8/xA9vTXZr7Qg2m6os
+4EEmnlwU43AR5KgwGJcuszLF9qSqTMz5qkAdFpvnoQ1Hbc8b49A+3yo9/hM7NA2
gPGH0gVZVBcffoETiQ8tJN6C9H6Ec0nTZwKTbasWwxz5oUAw+ppjP+aF4rFQ2/5w
b1ofrcga5yucjvSlXjBOEwHvd21l7O9iMre1oGEn6b0E2LU8ldToRkJkVZIhkWeL
2Iq3gYtVNN4Ao06WbV/EfXAqg5HWXjcm5bLcUXDtSF+Blae+gWoCjrT7XQdQGyEq
J12l4FbIZk5Ha8eWAC425ye9i3Wwo+oie3Cc4SVCMdv5A+AmOF0ijAlo1hcxq0rX
eGNWm8BKJOJ9zz1kxLISO7CfjULKgpsXLabF5a19uwoCsQgj5YrhlJezaIKHXbnK
OWwHWg9IuRkN2KLmJa7pXtHkuAHp4MtEV9TP9kU2WCvCInrNrzp3gYtds3pri82c
8ove+WA3yb/AQ6RCq5vAMLYXBxMRbN7FrmY5ZuwgWJTMi6cp1Sp02mhobwJOgNhO
H7nKWCZnQMyCLPzVeg97HTSgqSXw13dSrujWX9gWYVWBMfZO1B9HcUrhtiOhH7Q9
cvELkcqaxKrCKdRHLLYgHeMIQU2tdpsQ5TXHm7C7liEcZPZpk+g=
=3Otb
-----END PGP SIGNATURE-----
Merge tag 'x86_core_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull more x86 updates from Borislav Petkov:
- Remove a bunch of asm implementing condition flags testing in KVM's
emulator in favor of int3_emulate_jcc() which is written in C
- Replace KVM fastops with C-based stubs which avoids problems with the
fastop infra related to latter not adhering to the C ABI due to their
special calling convention and, more importantly, bypassing compiler
control-flow integrity checking because they're written in asm
- Remove wrongly used static branches and other ugliness accumulated
over time in hyperv's hypercall implementation with a proper static
function call to the correct hypervisor call variant
- Add some fixes and modifications to allow running FRED-enabled
kernels in KVM even on non-FRED hardware
- Add kCFI improvements like validating indirect calls and prepare for
enabling kCFI with GCC. Add cmdline params documentation and other
code cleanups
- Use the single-byte 0xd6 insn as the official #UD single-byte
undefined opcode instruction as agreed upon by both x86 vendors
- Other smaller cleanups and touchups all over the place
* tag 'x86_core_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
x86,retpoline: Optimize patch_retpoline()
x86,ibt: Use UDB instead of 0xEA
x86/cfi: Remove __noinitretpoline and __noretpoline
x86/cfi: Add "debug" option to "cfi=" bootparam
x86/cfi: Standardize on common "CFI:" prefix for CFI reports
x86/cfi: Document the "cfi=" bootparam options
x86/traps: Clarify KCFI instruction layout
compiler_types.h: Move __nocfi out of compiler-specific header
objtool: Validate kCFI calls
x86/fred: KVM: VMX: Always use FRED for IRQs when CONFIG_X86_FRED=y
x86/fred: Play nice with invoking asm_fred_entry_from_kvm() on non-FRED hardware
x86/fred: Install system vector handlers even if FRED isn't fully enabled
x86/hyperv: Use direct call to hypercall-page
x86/hyperv: Clean up hv_do_hypercall()
KVM: x86: Remove fastops
KVM: x86: Convert em_salc() to C
KVM: x86: Introduce EM_ASM_3WCL
KVM: x86: Introduce EM_ASM_1SRC2
KVM: x86: Introduce EM_ASM_2CL
KVM: x86: Introduce EM_ASM_2W
...
Emulate the Shadow Stack restriction that the current SSP must be a 32-bit
value on a FAR JMP from 64-bit mode to compatibility mode. From the SDM's
pseudocode for FAR JMP:
IF ShadowStackEnabled(CPL)
IF (IA32_EFER.LMA and DEST(segment selector).L) = 0
(* If target is legacy or compatibility mode then the SSP must be in low 4GB *)
IF (SSP & 0xFFFFFFFF00000000 != 0); THEN
#GP(0);
FI;
FI;
FI;
Note, only the current CPL needs to be considered, as FAR JMP can't be
used for inter-privilege level transfers, and KVM rejects emulation of all
other far branch instructions when Shadow Stacks are enabled.
To give the emulator access to GUEST_SSP, special case handling
MSR_KVM_INTERNAL_GUEST_SSP in emulator_get_msr() to treat the access as a
host access (KVM doesn't allow guest accesses to internal "MSRs"). The
->get_msr() API is only used for implicit accesses from the emulator, i.e.
is only used with hardcoded MSR indices, and so any access to
MSR_KVM_INTERNAL_GUEST_SSP is guaranteed to be from KVM, i.e. not from the
guest via RDMSR.
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-21-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Don't emulate branch instructions, e.g. CALL/RET/JMP etc., that are
affected by Shadow Stacks and/or Indirect Branch Tracking when said
features are enabled in the guest, as fully emulating CET would require
significant complexity for no practical benefit (KVM shouldn't need to
emulate branch instructions on modern hosts). Simply doing nothing isn't
an option as that would allow a malicious entity to subvert CET
protections via the emulator.
To detect instructions that are subject to IBT or affect IBT state, use
the existing IsBranch flag along with the source operand type to detect
indirect branches, and the existing NearBranch flag to detect far JMPs
and CALLs, all of which are effectively indirect. Explicitly check for
emulation of IRET, FAR RET (IMM), and SYSEXIT (the ret-like far branches)
instead of adding another flag, e.g. IsRet, as it's unlikely the emulator
will ever need to check for return-like instructions outside of this one
specific flow. Use an allow-list instead of a deny-list because (a) it's
a shorter list and (b) so that a missed entry gets a false positive, not a
false negative (i.e. reject emulation instead of clobbering CET state).
For Shadow Stacks, explicitly track instructions that directly affect the
current SSP, as KVM's emulator doesn't have existing flags that can be
used to precisely detect such instructions. Alternatively, the em_xxx()
helpers could directly check for ShadowStack interactions, but using a
dedicated flag is arguably easier to audit, and allows for handling both
IBT and SHSTK in one fell swoop.
Note! On far transfers, do NOT consult the current privilege level and
instead treat SHSTK/IBT as being enabled if they're enabled for User *or*
Supervisor mode. On inter-privilege level far transfers, SHSTK and IBT
can be in play for the target privilege level, i.e. checking the current
privilege could get a false negative, and KVM doesn't know the target
privilege level until emulation gets under way.
Note #2, FAR JMP from 64-bit mode to compatibility mode interacts with
the current SSP, but only to ensure SSP[63:32] == 0. Don't tag FAR JMP
as SHSTK, which would be rather confusing and would result in FAR JMP
being rejected unnecessarily the vast majority of the time (ignoring that
it's unlikely to ever be emulated). A future commit will add the #GP(0)
check for the specific FAR JMP scenario.
Note #3, task switches also modify SSP and so need to be rejected. That
too will be addressed in a future commit.
Suggested-by: Chao Gao <chao.gao@intel.com>
Originally-by: Yang Weijiang <weijiang.yang@intel.com>
Cc: Mathias Krause <minipli@grsecurity.net>
Cc: John Allen <john.allen@amd.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-19-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Remove the IsBranch flag from ENTER and LEAVE in KVM's emulator, as ENTER
and LEAVE are stack operations, not branches. Add forced emulation of
said instructions to the PMU counters test to prove that KVM diverges from
hardware, and to guard against regressions.
Opportunistically add a missing "1 MOV" to the selftest comment regarding
the number of instructions per loop, which commit 7803339fa9 ("KVM:
selftests: Use data load to trigger LLC references/misses in Intel PMU")
forgot to add.
Fixes: 018d70ffcf ("KVM: x86: Update vPMCs when retiring branch instructions")
Cc: Jim Mattson <jmattson@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20250919004639.1360453-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
When completing emulation of instruction that generated a userspace exit
for I/O, don't recheck L1 intercepts as KVM has already finished that
phase of instruction execution, i.e. has already committed to allowing L2
to perform I/O. If L1 (or host userspace) modifies the I/O permission
bitmaps during the exit to userspace, KVM will treat the access as being
intercepted despite already having emulated the I/O access.
Pivot on EMULTYPE_NO_DECODE to detect that KVM is completing emulation.
Of the three users of EMULTYPE_NO_DECODE, only complete_emulated_io() (the
intended "recipient") can reach the code in question. gp_interception()'s
use is mutually exclusive with is_guest_mode(), and
complete_emulated_insn_gp() unconditionally pairs EMULTYPE_NO_DECODE with
EMULTYPE_SKIP.
The bad behavior was detected by a syzkaller program that toggles port I/O
interception during the userspace I/O exit, ultimately resulting in a WARN
on vcpu->arch.pio.count being non-zero due to KVM no completing emulation
of the I/O instruction.
WARNING: CPU: 23 PID: 1083 at arch/x86/kvm/x86.c:8039 emulator_pio_in_out+0x154/0x170 [kvm]
Modules linked in: kvm_intel kvm irqbypass
CPU: 23 UID: 1000 PID: 1083 Comm: repro Not tainted 6.16.0-rc5-c1610d2d66b1-next-vm #74 NONE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:emulator_pio_in_out+0x154/0x170 [kvm]
PKRU: 55555554
Call Trace:
<TASK>
kvm_fast_pio+0xd6/0x1d0 [kvm]
vmx_handle_exit+0x149/0x610 [kvm_intel]
kvm_arch_vcpu_ioctl_run+0xda8/0x1ac0 [kvm]
kvm_vcpu_ioctl+0x244/0x8c0 [kvm]
__x64_sys_ioctl+0x8a/0xd0
do_syscall_64+0x5d/0xc60
entry_SYSCALL_64_after_hwframe+0x4b/0x53
</TASK>
Reported-by: syzbot+cc2032ba16cc2018ca25@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/68790db4.a00a0220.3af5df.0020.GAE@google.com
Fixes: 8a76d7f25f ("KVM: x86: Add x86 callback for intercept check")
Cc: stable@vger.kernel.org
Cc: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250715190638.1899116-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Replace fastops with C based stubs. There are a bunch of problems with
the current fastop infrastructure, most all related to their special
calling convention, which bypasses the normal C-ABI.
There are two immediate problems with this at present:
- it relies on RET preserving EFLAGS; whereas C-ABI does not.
- it circumvents compiler based control-flow-integrity checking
because its all asm magic.
The first is a problem for some mitigations where the
x86_indirect_return_thunk needs to include non-trivial work that
clobbers EFLAGS (eg. the Skylake call depth tracking thing).
The second is a problem because it presents a 'naked' indirect call on
kCFI builds, making it a prime target for control flow hijacking.
Additionally, given that a large chunk of virtual machine performance
relies on absolutely avoiding vmexit these days, this emulation stuff
just isn't that critical for performance anymore.
As such, replace the fastop calls with normal C functions using the
'execute' member.
As noted by Paolo: this code was performance critical for pre-Westmere
(2010) and only when running big real mode code.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lkml.kernel.org/r/20250714103439.773781574@infradead.org
Current test_cc() uses the fastop infrastructure to test flags using
SETcc instructions. However, int3_emulate_jcc() already fully
implements the flags->CC mapping, use that.
Removes a pile of gnarly asm.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lkml.kernel.org/r/20250714103439.637049932@infradead.org
When checking for intercept when emulating an instruction on behalf of L2,
pass the emulator's view of the RIP of the instruction being emulated to
vendor code. Unlike SVM, which communicates the next RIP on VM-Exit,
VMX communicates the length of the instruction that generated the VM-Exit,
i.e. requires the current and next RIPs.
Note, unless userspace modifies RIP during a userspace exit that requires
completion, kvm_rip_read() will contain the same information. Pass the
emulator's view largely out of a paranoia, and because there is no
meaningful cost in doing so.
Link: https://lore.kernel.org/r/20250201015518.689704-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
When checking for intercept when emulating an instruction on behalf of L2,
forward the source and destination operand types to vendor code so that
VMX can synthesize the correct EXIT_QUALIFICATION for port I/O VM-Exits.
Link: https://lore.kernel.org/r/20250201015518.689704-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Set "next_rip" in the emulation interception info passed to vendor code
using the emulator context's "_eip", not "eip". "eip" holds RIP from the
start of emulation, i.e. the RIP of the instruction that's being emulated,
whereas _eip tracks the context's current position in decoding the code
stream, which at the time of the intercept checks is effectively the RIP
of the next instruction.
Passing the current RIP as next_rip causes SVM to stuff the wrong value
value into vmcb12->control.next_rip if a nested VM-Exit is generated, i.e.
if L1 wants to intercept the instruction, and could result in L1 putting
L2 into an infinite loop due to restarting L2 with the same RIP over and
over.
Fixes: 8a76d7f25f ("KVM: x86: Add x86 callback for intercept check")
Link: https://lore.kernel.org/r/20250201015518.689704-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add emulation flags for MSR accesses and Descriptor Tables loads, and pass
the new flags as appropriate to emul_is_noncanonical_address(). The flags
will be used to perform the correct canonical check, as the type of access
affects whether or not CR4.LA57 is consulted when determining the canonical
bit.
No functional change is intended.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20240906221824.491834-3-mlevitsk@redhat.com
[sean: split to separate patch, massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add emulate_ops.is_canonical_addr() to perform (non-)canonical checks in
the emulator, which will allow extending is_noncanonical_address() to
support different flavors of canonical checks, e.g. for descriptor table
bases vs. MSRs, without needing duplicate logic in the emulator.
No functional change is intended.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20240906221824.491834-3-mlevitsk@redhat.com
[sean: separate from additional of flags, massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
* Initial infrastructure for shadow stage-2 MMUs, as part of nested
virtualization enablement
* Support for userspace changes to the guest CTR_EL0 value, enabling
(in part) migration of VMs between heterogenous hardware
* Fixes + improvements to pKVM's FF-A proxy, adding support for v1.1 of
the protocol
* FPSIMD/SVE support for nested, including merged trap configuration
and exception routing
* New command-line parameter to control the WFx trap behavior under KVM
* Introduce kCFI hardening in the EL2 hypervisor
* Fixes + cleanups for handling presence/absence of FEAT_TCRX
* Miscellaneous fixes + documentation updates
LoongArch:
* Add paravirt steal time support.
* Add support for KVM_DIRTY_LOG_INITIALLY_SET.
* Add perf kvm-stat support for loongarch.
RISC-V:
* Redirect AMO load/store access fault traps to guest
* perf kvm stat support
* Use guest files for IMSIC virtualization, when available
ONE_REG support for the Zimop, Zcmop, Zca, Zcf, Zcd, Zcb and Zawrs ISA
extensions is coming through the RISC-V tree.
s390:
* Assortment of tiny fixes which are not time critical
x86:
* Fixes for Xen emulation.
* Add a global struct to consolidate tracking of host values, e.g. EFER
* Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the effective APIC
bus frequency, because TDX.
* Print the name of the APICv/AVIC inhibits in the relevant tracepoint.
* Clean up KVM's handling of vendor specific emulation to consistently act on
"compatible with Intel/AMD", versus checking for a specific vendor.
* Drop MTRR virtualization, and instead always honor guest PAT on CPUs
that support self-snoop.
* Update to the newfangled Intel CPU FMS infrastructure.
* Don't advertise IA32_PERF_GLOBAL_OVF_CTRL as an MSR-to-be-saved, as it reads
'0' and writes from userspace are ignored.
* Misc cleanups
x86 - MMU:
* Small cleanups, renames and refactoring extracted from the upcoming
Intel TDX support.
* Don't allocate kvm_mmu_page.shadowed_translation for shadow pages that can't
hold leafs SPTEs.
* Unconditionally drop mmu_lock when allocating TDP MMU page tables for eager
page splitting, to avoid stalling vCPUs when splitting huge pages.
* Bug the VM instead of simply warning if KVM tries to split a SPTE that is
non-present or not-huge. KVM is guaranteed to end up in a broken state
because the callers fully expect a valid SPTE, it's all but dangerous
to let more MMU changes happen afterwards.
x86 - AMD:
* Make per-CPU save_area allocations NUMA-aware.
* Force sev_es_host_save_area() to be inlined to avoid calling into an
instrumentable function from noinstr code.
* Base support for running SEV-SNP guests. API-wise, this includes
a new KVM_X86_SNP_VM type, encrypting/measure the initial image into
guest memory, and finalizing it before launching it. Internally,
there are some gmem/mmu hooks needed to prepare gmem-allocated pages
before mapping them into guest private memory ranges.
This includes basic support for attestation guest requests, enough to
say that KVM supports the GHCB 2.0 specification.
There is no support yet for loading into the firmware those signing
keys to be used for attestation requests, and therefore no need yet
for the host to provide certificate data for those keys. To support
fetching certificate data from userspace, a new KVM exit type will be
needed to handle fetching the certificate from userspace. An attempt to
define a new KVM_EXIT_COCO/KVM_EXIT_COCO_REQ_CERTS exit type to handle
this was introduced in v1 of this patchset, but is still being discussed
by community, so for now this patchset only implements a stub version
of SNP Extended Guest Requests that does not provide certificate data.
x86 - Intel:
* Remove an unnecessary EPT TLB flush when enabling hardware.
* Fix a series of bugs that cause KVM to fail to detect nested pending posted
interrupts as valid wake eents for a vCPU executing HLT in L2 (with
HLT-exiting disable by L1).
* KVM: x86: Suppress MMIO that is triggered during task switch emulation
Explicitly suppress userspace emulated MMIO exits that are triggered when
emulating a task switch as KVM doesn't support userspace MMIO during
complex (multi-step) emulation. Silently ignoring the exit request can
result in the WARN_ON_ONCE(vcpu->mmio_needed) firing if KVM exits to
userspace for some other reason prior to purging mmio_needed.
See commit 0dc902267c ("KVM: x86: Suppress pending MMIO write exits if
emulator detects exception") for more details on KVM's limitations with
respect to emulated MMIO during complex emulator flows.
Generic:
* Rename the AS_UNMOVABLE flag that was introduced for KVM to AS_INACCESSIBLE,
because the special casing needed by these pages is not due to just
unmovability (and in fact they are only unmovable because the CPU cannot
access them).
* New ioctl to populate the KVM page tables in advance, which is useful to
mitigate KVM page faults during guest boot or after live migration.
The code will also be used by TDX, but (probably) not through the ioctl.
* Enable halt poll shrinking by default, as Intel found it to be a clear win.
* Setup empty IRQ routing when creating a VM to avoid having to synchronize
SRCU when creating a split IRQCHIP on x86.
* Rework the sched_in/out() paths to replace kvm_arch_sched_in() with a flag
that arch code can use for hooking both sched_in() and sched_out().
* Take the vCPU @id as an "unsigned long" instead of "u32" to avoid
truncating a bogus value from userspace, e.g. to help userspace detect bugs.
* Mark a vCPU as preempted if and only if it's scheduled out while in the
KVM_RUN loop, e.g. to avoid marking it preempted and thus writing guest
memory when retrieving guest state during live migration blackout.
Selftests:
* Remove dead code in the memslot modification stress test.
* Treat "branch instructions retired" as supported on all AMD Family 17h+ CPUs.
* Print the guest pseudo-RNG seed only when it changes, to avoid spamming the
log for tests that create lots of VMs.
* Make the PMU counters test less flaky when counting LLC cache misses by
doing CLFLUSH{OPT} in every loop iteration.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmaZQB0UHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroNkZwf/bv2jiENaLFNGPe/VqTKMQ6PHQLMG
+sNHx6fJPP35gTM8Jqf0/7/ummZXcSuC1mWrzYbecZm7Oeg3vwNXHZ4LquwwX6Dv
8dKcUzLbWDAC4WA3SKhi8C8RV2v6E7ohy69NtAJmFWTc7H95dtIQm6cduV2osTC3
OEuHe1i8d9umk6couL9Qhm8hk3i9v2KgCsrfyNrQgLtS3hu7q6yOTR8nT0iH6sJR
KE5A8prBQgLmF34CuvYDw4Hu6E4j+0QmIqodovg2884W1gZQ9LmcVqYPaRZGsG8S
iDdbkualLKwiR1TpRr3HJGKWSFdc7RblbsnHRvHIZgFsMQiimh4HrBSCyQ==
=zepX
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"ARM:
- Initial infrastructure for shadow stage-2 MMUs, as part of nested
virtualization enablement
- Support for userspace changes to the guest CTR_EL0 value, enabling
(in part) migration of VMs between heterogenous hardware
- Fixes + improvements to pKVM's FF-A proxy, adding support for v1.1
of the protocol
- FPSIMD/SVE support for nested, including merged trap configuration
and exception routing
- New command-line parameter to control the WFx trap behavior under
KVM
- Introduce kCFI hardening in the EL2 hypervisor
- Fixes + cleanups for handling presence/absence of FEAT_TCRX
- Miscellaneous fixes + documentation updates
LoongArch:
- Add paravirt steal time support
- Add support for KVM_DIRTY_LOG_INITIALLY_SET
- Add perf kvm-stat support for loongarch
RISC-V:
- Redirect AMO load/store access fault traps to guest
- perf kvm stat support
- Use guest files for IMSIC virtualization, when available
s390:
- Assortment of tiny fixes which are not time critical
x86:
- Fixes for Xen emulation
- Add a global struct to consolidate tracking of host values, e.g.
EFER
- Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the
effective APIC bus frequency, because TDX
- Print the name of the APICv/AVIC inhibits in the relevant
tracepoint
- Clean up KVM's handling of vendor specific emulation to
consistently act on "compatible with Intel/AMD", versus checking
for a specific vendor
- Drop MTRR virtualization, and instead always honor guest PAT on
CPUs that support self-snoop
- Update to the newfangled Intel CPU FMS infrastructure
- Don't advertise IA32_PERF_GLOBAL_OVF_CTRL as an MSR-to-be-saved, as
it reads '0' and writes from userspace are ignored
- Misc cleanups
x86 - MMU:
- Small cleanups, renames and refactoring extracted from the upcoming
Intel TDX support
- Don't allocate kvm_mmu_page.shadowed_translation for shadow pages
that can't hold leafs SPTEs
- Unconditionally drop mmu_lock when allocating TDP MMU page tables
for eager page splitting, to avoid stalling vCPUs when splitting
huge pages
- Bug the VM instead of simply warning if KVM tries to split a SPTE
that is non-present or not-huge. KVM is guaranteed to end up in a
broken state because the callers fully expect a valid SPTE, it's
all but dangerous to let more MMU changes happen afterwards
x86 - AMD:
- Make per-CPU save_area allocations NUMA-aware
- Force sev_es_host_save_area() to be inlined to avoid calling into
an instrumentable function from noinstr code
- Base support for running SEV-SNP guests. API-wise, this includes a
new KVM_X86_SNP_VM type, encrypting/measure the initial image into
guest memory, and finalizing it before launching it. Internally,
there are some gmem/mmu hooks needed to prepare gmem-allocated
pages before mapping them into guest private memory ranges
This includes basic support for attestation guest requests, enough
to say that KVM supports the GHCB 2.0 specification
There is no support yet for loading into the firmware those signing
keys to be used for attestation requests, and therefore no need yet
for the host to provide certificate data for those keys.
To support fetching certificate data from userspace, a new KVM exit
type will be needed to handle fetching the certificate from
userspace.
An attempt to define a new KVM_EXIT_COCO / KVM_EXIT_COCO_REQ_CERTS
exit type to handle this was introduced in v1 of this patchset, but
is still being discussed by community, so for now this patchset
only implements a stub version of SNP Extended Guest Requests that
does not provide certificate data
x86 - Intel:
- Remove an unnecessary EPT TLB flush when enabling hardware
- Fix a series of bugs that cause KVM to fail to detect nested
pending posted interrupts as valid wake eents for a vCPU executing
HLT in L2 (with HLT-exiting disable by L1)
- KVM: x86: Suppress MMIO that is triggered during task switch
emulation
Explicitly suppress userspace emulated MMIO exits that are
triggered when emulating a task switch as KVM doesn't support
userspace MMIO during complex (multi-step) emulation
Silently ignoring the exit request can result in the
WARN_ON_ONCE(vcpu->mmio_needed) firing if KVM exits to userspace
for some other reason prior to purging mmio_needed
See commit 0dc902267c ("KVM: x86: Suppress pending MMIO write
exits if emulator detects exception") for more details on KVM's
limitations with respect to emulated MMIO during complex emulator
flows
Generic:
- Rename the AS_UNMOVABLE flag that was introduced for KVM to
AS_INACCESSIBLE, because the special casing needed by these pages
is not due to just unmovability (and in fact they are only
unmovable because the CPU cannot access them)
- New ioctl to populate the KVM page tables in advance, which is
useful to mitigate KVM page faults during guest boot or after live
migration. The code will also be used by TDX, but (probably) not
through the ioctl
- Enable halt poll shrinking by default, as Intel found it to be a
clear win
- Setup empty IRQ routing when creating a VM to avoid having to
synchronize SRCU when creating a split IRQCHIP on x86
- Rework the sched_in/out() paths to replace kvm_arch_sched_in() with
a flag that arch code can use for hooking both sched_in() and
sched_out()
- Take the vCPU @id as an "unsigned long" instead of "u32" to avoid
truncating a bogus value from userspace, e.g. to help userspace
detect bugs
- Mark a vCPU as preempted if and only if it's scheduled out while in
the KVM_RUN loop, e.g. to avoid marking it preempted and thus
writing guest memory when retrieving guest state during live
migration blackout
Selftests:
- Remove dead code in the memslot modification stress test
- Treat "branch instructions retired" as supported on all AMD Family
17h+ CPUs
- Print the guest pseudo-RNG seed only when it changes, to avoid
spamming the log for tests that create lots of VMs
- Make the PMU counters test less flaky when counting LLC cache
misses by doing CLFLUSH{OPT} in every loop iteration"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (227 commits)
crypto: ccp: Add the SNP_VLEK_LOAD command
KVM: x86/pmu: Add kvm_pmu_call() to simplify static calls of kvm_pmu_ops
KVM: x86: Introduce kvm_x86_call() to simplify static calls of kvm_x86_ops
KVM: x86: Replace static_call_cond() with static_call()
KVM: SEV: Provide support for SNP_EXTENDED_GUEST_REQUEST NAE event
x86/sev: Move sev_guest.h into common SEV header
KVM: SEV: Provide support for SNP_GUEST_REQUEST NAE event
KVM: x86: Suppress MMIO that is triggered during task switch emulation
KVM: x86/mmu: Clean up make_huge_page_split_spte() definition and intro
KVM: x86/mmu: Bug the VM if KVM tries to split a !hugepage SPTE
KVM: selftests: x86: Add test for KVM_PRE_FAULT_MEMORY
KVM: x86: Implement kvm_arch_vcpu_pre_fault_memory()
KVM: x86/mmu: Make kvm_mmu_do_page_fault() return mapped level
KVM: x86/mmu: Account pf_{fixed,emulate,spurious} in callers of "do page fault"
KVM: x86/mmu: Bump pf_taken stat only in the "real" page fault handler
KVM: Add KVM_PRE_FAULT_MEMORY vcpu ioctl to pre-populate guest memory
KVM: Document KVM_PRE_FAULT_MEMORY ioctl
mm, virt: merge AS_UNMOVABLE and AS_INACCESSIBLE
perf kvm: Add kvm-stat for loongarch64
LoongArch: KVM: Add PV steal time support in guest side
...
objtool complains:
arch/x86/kvm/kvm.o: warning: objtool: .altinstr_replacement+0xc5: call without frame pointer save/setup
vmlinux.o: warning: objtool: .altinstr_replacement+0x2eb: call without frame pointer save/setup
Make sure %rSP is an output operand to the respective asm() statements.
The test_cc() hunk and ALT_OUTPUT_SP() courtesy of peterz. Also from him
add some helpful debugging info to the documentation.
Now on to the explanations:
tl;dr: The alternatives macros are pretty fragile.
If I do ALT_OUTPUT_SP(output) in order to be able to package in a %rsp
reference for objtool so that a stack frame gets properly generated, the
inline asm input operand with positional argument 0 in clear_page():
"0" (page)
gets "renumbered" due to the added
: "+r" (current_stack_pointer), "=D" (page)
and then gcc says:
./arch/x86/include/asm/page_64.h:53:9: error: inconsistent operand constraints in an ‘asm’
The fix is to use an explicit "D" constraint which points to a singleton
register class (gcc terminology) which ends up doing what is expected
here: the page pointer - input and output - should be in the same %rdi
register.
Other register classes have more than one register in them - example:
"r" and "=r" or "A":
‘A’
The ‘a’ and ‘d’ registers. This class is used for
instructions that return double word results in the ‘ax:dx’
register pair. Single word values will be allocated either in
‘ax’ or ‘dx’.
so using "D" and "=D" just works in this particular case.
And yes, one would say, sure, why don't you do "+D" but then:
: "+r" (current_stack_pointer), "+D" (page)
: [old] "i" (clear_page_orig), [new1] "i" (clear_page_rep), [new2] "i" (clear_page_erms),
: "cc", "memory", "rax", "rcx")
now find the Waldo^Wcomma which throws a wrench into all this.
Because that silly macro has an "input..." consume-all last macro arg
and in it, one is supposed to supply input *and* clobbers, leading to
silly syntax snafus.
Yap, they need to be cleaned up, one fine day...
Closes: https://lore.kernel.org/oe-kbuild-all/202406141648.jO9qNGLa-lkp@intel.com/
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Sean Christopherson <seanjc@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240625112056.GDZnqoGDXgYuWBDUwu@fat_crate.local
Open code the is_guest_vendor_intel() check in string_registers_quirk() to
discourage makiking exact vendor==Intel checks in the emulator, and to
remove the rather awful #ifdeffery.
The string quirk is literally the only Intel specific, *non-architectural*
behavior that KVM emulates. All Intel specific behavior that is
architecturally defined applies to all vendors that are compatible with
Intel's architecture, i.e. should use guest_cpuid_is_intel_compatible().
Link: https://lore.kernel.org/r/20240405235603.1173076-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Emulate SYSENTER in Compatibility Mode for all vCPUs models that are
compatible with Intel's architecture, as the behavior if SYSENTER is
architecturally defined in Intel's SDM, i.e. should be followed by any
CPU that implements Intel's architecture.
Link: https://lore.kernel.org/r/20240405235603.1173076-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Use guest_cpuid_is_intel_compatible() to determine whether SYSCALL in
32-bit Protected Mode (including Compatibility Mode) should #UD or succeed.
The existing code already does the exact equivalent of
guest_cpuid_is_intel_compatible(), just in a rather roundabout way.
No functional change intended.
Link: https://lore.kernel.org/r/20240405235603.1173076-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
- Fix several bugs where KVM speciously prevents the guest from utilizing
fixed counters and architectural event encodings based on whether or not
guest CPUID reports support for the _architectural_ encoding.
- Fix a variety of bugs in KVM's emulation of RDPMC, e.g. for "fast" reads,
priority of VMX interception vs #GP, PMC types in architectural PMUs, etc.
- Add a selftest to verify KVM correctly emulates RDMPC, counter availability,
and a variety of other PMC-related behaviors that depend on guest CPUID,
i.e. are difficult to validate via KVM-Unit-Tests.
- Zero out PMU metadata on AMD if the virtual PMU is disabled to avoid wasting
cycles, e.g. when checking if a PMC event needs to be synthesized when
skipping an instruction.
- Optimize triggering of emulated events, e.g. for "count instructions" events
when skipping an instruction, which yields a ~10% performance improvement in
VM-Exit microbenchmarks when a vPMU is exposed to the guest.
- Tighten the check for "PMI in guest" to reduce false positives if an NMI
arrives in the host while KVM is handling an IRQ VM-Exit.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmXrUFQACgkQOlYIJqCj
N/11dhAAnr9e6mPmXvaH4YKcvOGgTmwIQdi5W4IBzGm27ErEb0Vyskx3UATRhRm+
gZyp3wNgEA9LeifICDNu4ypn7HZcl2VtRql6FYcB8Bcu8OiHfU8PhWL0/qrpY20e
zffUj2tDweq2ft9Iks1SQJD0sxFkcXIcSKOffP7pRZJHFTKLltGORXwxzd9HJHPY
nc4nERKegK2yH4A4gY6nZ0oV5L3OMUNHx815db5Y+HxXOIjBCjTQiNNd6mUdyX1N
C5sIiElXLdvRTSDvirHfA32LqNwnajDGox4QKZkB3wszCxJ3kRd4OCkTEKMYKHxd
KoKCJQnAdJFFW9xqbT8nNKXZ+hg2+ZQuoSaBuwKryf7jWi0e6a7jcV0OH+cQSZw7
UNudKhs3r4ambfvnFp2IVZlZREMDB+LAjo2So48Jn/JGCAzqte3XqwVKskn9pS9S
qeauXCdOLioZALYtTBl8RM1rEY5mbwQrpPv9CzbeU09qQ/hpXV14W9GmbyeOZcI1
T1cYgEqlLuifRluwT/hxrY321+4noF116gSK1yb07x/sJU8/lhRooEk9V562066E
qo6nIvc7Bv9gTGLwo6VReKSPcTT/6t3HwgPsRjqe+evso3EFN9f9hG+uPxtO6TUj
pdPm3mkj2KfxDdJLf+Ys16gyGdiwI0ZImIkA0uLdM0zftNsrb4Y=
=vayI
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-pmu-6.9' of https://github.com/kvm-x86/linux into HEAD
KVM x86 PMU changes for 6.9:
- Fix several bugs where KVM speciously prevents the guest from utilizing
fixed counters and architectural event encodings based on whether or not
guest CPUID reports support for the _architectural_ encoding.
- Fix a variety of bugs in KVM's emulation of RDPMC, e.g. for "fast" reads,
priority of VMX interception vs #GP, PMC types in architectural PMUs, etc.
- Add a selftest to verify KVM correctly emulates RDMPC, counter availability,
and a variety of other PMC-related behaviors that depend on guest CPUID,
i.e. are difficult to validate via KVM-Unit-Tests.
- Zero out PMU metadata on AMD if the virtual PMU is disabled to avoid wasting
cycles, e.g. when checking if a PMC event needs to be synthesized when
skipping an instruction.
- Optimize triggering of emulated events, e.g. for "count instructions" events
when skipping an instruction, which yields a ~10% performance improvement in
VM-Exit microbenchmarks when a vPMU is exposed to the guest.
- Tighten the check for "PMI in guest" to reduce false positives if an NMI
arrives in the host while KVM is handling an IRQ VM-Exit.
Convert kvm_get_dr()'s output parameter to a return value, and clean up
most of the mess that was created by forcing callers to provide a pointer.
No functional change intended.
Acked-by: Mathias Krause <minipli@grsecurity.net>
Reviewed-by: Mathias Krause <minipli@grsecurity.net>
Link: https://lore.kernel.org/r/20240209220752.388160-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
push and emulate_pop are counterparts. Rename push to emulate_push and
harmonize its function signature with emulate_pop. This should remove
a bit of cognitive load when reading this code.
Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de>
Link: https://lore.kernel.org/r/20231009092054.556935-2-julian.stecklina@cyberus-technology.de
Signed-off-by: Sean Christopherson <seanjc@google.com>
Explicitly zero out variables passed to emulate_pop() as output params
to harden against consuming uninitialized data, and to make sanitizers
happy. Many flows that use emulate_pop() pass an "unsigned long" so as
to be able to hold the largest possible operand, but the actual number
of bytes written is usually the word with of the vCPU. E.g. if the vCPU
is in 16-bit or 32-bit mode (on a 64-bit host), the upper portion of the
output param will be uninitialized.
Passing around the uninitialized data is benign, as actual KVM usage of
the output is also tied to the word width, but passing around
uninitialized data makes some sanitizers rightly complain.
Note, initializing the data in emulate_pop() is not a safe alternative,
e.g. it would result in em_leave() clobbering RBP[31:16] if LEAVE were
emulated with a 16-bit stack.
Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de>
Link: https://lore.kernel.org/r/20231009092054.556935-1-julian.stecklina@cyberus-technology.de
[sean: massage changelog, drop em_popa() variable size change]]
Signed-off-by: Sean Christopherson <seanjc@google.com>
The MOVBE instruction can come with an operand-size prefix (66h). In
this, case the x86 emulation code returns EMULATION_FAILED.
It turns out that em_movbe can already handle this case and all that
is missing is an entry in respective opcode tables to populate
gprefix->pfx_66.
Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de>
Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231212095938.26731-1-julian.stecklina@cyberus-technology.de
Signed-off-by: Sean Christopherson <seanjc@google.com>
Apply the pre-intercepts RDPMC validity check only to AMD, and rename all
relevant functions to make it as clear as possible that the check is not a
standard PMC index check. On Intel, the basic rule is that only invalid
opcodes and privilege/permission/mode checks have priority over VM-Exit,
i.e. RDPMC with an invalid index should VM-Exit, not #GP. While the SDM
doesn't explicitly call out RDPMC, it _does_ explicitly use RDMSR of a
non-existent MSR as an example where VM-Exit has priority over #GP, and
RDPMC is effectively just a variation of RDMSR.
Manually testing on various Intel CPUs confirms this behavior, and the
inverted priority was introduced for SVM compatibility, i.e. was not an
intentional change for Intel PMUs. On AMD, *all* exceptions on RDPMC have
priority over VM-Exit.
Check for a NULL kvm_pmu_ops.check_rdpmc_early instead of using a RET0
static call so as to provide a convenient location to document the
difference between Intel and AMD, and to again try to make it as obvious
as possible that the early check is a one-off thing, not a generic "is
this PMC valid?" helper.
Fixes: 8061252ee0 ("KVM: SVM: Add intercept checks for remaining twobyte instructions")
Cc: Jim Mattson <jmattson@google.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Introduce a new interface get_untagged_addr() to kvm_x86_ops to untag
the metadata from linear address. Call the interface in linearization
of instruction emulator for 64-bit mode.
When enabled feature like Intel Linear Address Masking (LAM) or AMD Upper
Address Ignore (UAI), linear addresses may be tagged with metadata that
needs to be dropped prior to canonicality checks, i.e. the metadata is
ignored.
Introduce get_untagged_addr() to kvm_x86_ops to hide the vendor specific
code, as sadly LAM and UAI have different semantics. Pass the emulator
flags to allow vendor specific implementation to precisely identify the
access type (LAM doesn't untag certain accesses).
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
Link: https://lore.kernel.org/r/20230913124227.12574-9-binbin.wu@linux.intel.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add an emulation flag X86EMUL_F_INVLPG, which is used to identify an
instruction that does TLB invalidation without true memory access.
Only invlpg & invlpga implemented in emulator belong to this kind.
invlpga doesn't need additional information for emulation. Just pass
the flag to em_invlpg().
Linear Address Masking (LAM) and Linear Address Space Separation (LASS)
don't apply to addresses that are inputs to TLB invalidation. The flag
will be consumed to support LAM/LASS virtualization.
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
Link: https://lore.kernel.org/r/20230913124227.12574-5-binbin.wu@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Consolidate @write and @fetch of __linearize() into a set of flags so that
additional flags can be added without needing more/new boolean parameters,
to precisely identify the access type.
No functional change intended.
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
Link: https://lore.kernel.org/r/20230913124227.12574-2-binbin.wu@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>