Commit Graph

612 Commits

Author SHA1 Message Date
Paolo Bonzini
ea8bc95fbb KVM nested SVM changes for 7.1 (with one common x86 fix)
- To minimize the probability of corrupting guest state, defer KVM's
    non-architectural delivery of exception payloads (e.g. CR2 and DR6) until
    consumption of the payload is imminent, and force delivery of the payload
    in all paths where userspace saves relevant state.
 
  - Use vcpu->arch.cr2 when updating vmcb12's CR2 on nested #VMEXIT to fix a
    bug where L2's CR2 can get corrupted after a save/restore, e.g. if the VM
    is migrated while L2 is faulting in memory.
 
  - Fix a class of nSVM bugs where some fields written by the CPU are not
    synchronized from vmcb02 to cached vmcb12 after VMRUN, and so are not
    up-to-date when saved by KVM_GET_NESTED_STATE.
 
  - Fix a class of bugs where the ordering between KVM_SET_NESTED_STATE and
    KVM_SET_{S}REGS could cause vmcb02 to be incorrectly initialized after
    save+restore.
 
  - Add a variety of missing nSVM consistency checks.
 
  - Fix several bugs where KVM failed to correctly update VMCB fields on nested
    #VMEXIT.
 
  - Fix several bugs where KVM failed to correctly synthesize #UD or #GP for
    SVM-related instructions.
 
  - Add support for save+restore of virtualized LBRs (on SVM).
 
  - Refactor various helpers and macros to improve clarity and (hopefully) make
    the code easier to maintain.
 
  - Aggressively sanitize fields when copying from vmcb12 to guard against
    unintentionally allowing L1 to utilize yet-to-be-defined features.
 
  - Fix several bugs where KVM botched rAX legality checks when emulating SVM
    instructions.  Note, KVM is still flawed in that KVM doesn't address size
    prefix overrides for 64-bit guests; this should probably be documented as a
    KVM erratum.
 
  - Fail emulation of VMRUN/VMLOAD/VMSAVE if mapping vmcb12 fails instead of
    somewhat arbitrarily synthesizing #GP (i.e. don't bastardize AMD's already-
    sketchy behavior of generating #GP if for "unsupported" addresses).
 
  - Cache all used vmcb12 fields to further harden against TOCTOU bugs.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmnZfbwACgkQOlYIJqCj
 N/0pVRAAkys8LLtIekQtEVkaX3EPaXk0lGGmnzXbihgHFsS5lMAS4tcsr7oyk4TI
 rvJUGmkaTKTboQdTaCq0G7lwCu5hMuXsZ10WvmKfivMFxy3kSppqfffux5zVXng2
 U/8oyJSorkX1WPC7d5QAZYMqqcSwQaR+a0FxowghGWBXMRHylerSuH00CiGr6Ron
 QQbZaKBNtkYwYFNos2tLuT4tueyFogk8FPAmdejEQ9CMxUjeAivlKm8JVXaDvGik
 lyPYbJJLukjuxSYGYmeRyGLLwK7VBGkFHQp/KBYSBgzGdweabhsQa1Z0CGm24+w1
 q626W0sxsq97dZ0cd7oE6Cw+AdlMBK+mjpxB9gX4uLGyYlnFkdJV7OSlHVTR9d96
 cqKduT0JvlBnVb7Yd5jyaGVl1YD62p0nwcrTuWidR5IJ16b4mYwwPzvkkQKHLt64
 VAhH8lBVtATtblI9gfsbwGezV74xXnuLb0L1G7xeh1VIWu7pubFdqyRwIA+qiXQa
 OkyxzoDlFl+QF2Uf3cBCFMojBOrSZRiGiLzIkUnjBsN4N2uOPYTsQEfr9BXVVcv7
 obT9xl/wUwry2fAJhUL+IBCDE42+8C62UaWT5KJHQLttBL7Mm06e75hFN5ObbE/x
 nExL+NmAcsSUUbbdojjnD0KWxYKkosNiONBVrjqqXdmBjmzzOvI=
 =ys7N
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-nested-7.1' of https://github.com/kvm-x86/linux into HEAD

KVM nested SVM changes for 7.1 (with one common x86 fix)

 - To minimize the probability of corrupting guest state, defer KVM's
   non-architectural delivery of exception payloads (e.g. CR2 and DR6) until
   consumption of the payload is imminent, and force delivery of the payload
   in all paths where userspace saves relevant state.

 - Use vcpu->arch.cr2 when updating vmcb12's CR2 on nested #VMEXIT to fix a
   bug where L2's CR2 can get corrupted after a save/restore, e.g. if the VM
   is migrated while L2 is faulting in memory.

 - Fix a class of nSVM bugs where some fields written by the CPU are not
   synchronized from vmcb02 to cached vmcb12 after VMRUN, and so are not
   up-to-date when saved by KVM_GET_NESTED_STATE.

 - Fix a class of bugs where the ordering between KVM_SET_NESTED_STATE and
   KVM_SET_{S}REGS could cause vmcb02 to be incorrectly initialized after
   save+restore.

 - Add a variety of missing nSVM consistency checks.

 - Fix several bugs where KVM failed to correctly update VMCB fields on nested
   #VMEXIT.

 - Fix several bugs where KVM failed to correctly synthesize #UD or #GP for
   SVM-related instructions.

 - Add support for save+restore of virtualized LBRs (on SVM).

 - Refactor various helpers and macros to improve clarity and (hopefully) make
   the code easier to maintain.

 - Aggressively sanitize fields when copying from vmcb12 to guard against
   unintentionally allowing L1 to utilize yet-to-be-defined features.

 - Fix several bugs where KVM botched rAX legality checks when emulating SVM
   instructions.  Note, KVM is still flawed in that KVM doesn't address size
   prefix overrides for 64-bit guests; this should probably be documented as a
   KVM erratum.

 - Fail emulation of VMRUN/VMLOAD/VMSAVE if mapping vmcb12 fails instead of
   somewhat arbitrarily synthesizing #GP (i.e. don't bastardize AMD's already-
   sketchy behavior of generating #GP if for "unsupported" addresses).

 - Cache all used vmcb12 fields to further harden against TOCTOU bugs.
2026-04-13 13:01:50 +02:00
Sean Christopherson
7212094bae KVM: x86: Suppress WARNs on nested_run_pending after userspace exit
To end an ongoing game of whack-a-mole between KVM and syzkaller, WARN on
illegally cancelling a pending nested VM-Enter if and only if userspace
has NOT gained control of the vCPU since the nested run was initiated.  As
proven time and time again by syzkaller, userspace can clobber vCPU state
so as to force a VM-Exit that violates KVM's architectural modelling of
VMRUN/VMLAUNCH/VMRESUME.

To detect that userspace has gained control, while minimizing the risk of
operating on stale data, convert nested_run_pending from a pure boolean to
a tri-state of sorts, where '0' is still "not pending", '1' is "pending",
and '2' is "pending but untrusted".  Then on KVM_RUN, if the flag is in
the "trusted pending" state, move it to "untrusted pending".

Note, moving the state to "untrusted" even if KVM_RUN is ultimately
rejected is a-ok, because for the "untrusted" state to matter, KVM must
get past kvm_x86_vcpu_pre_run() at some point for the vCPU.

Reviewed-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260312234823.3120658-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-04-03 09:34:01 -07:00
Yosry Ahmed
3d4470d71f KVM: x86: Move nested_run_pending to kvm_vcpu_arch
Move nested_run_pending field present in both svm_nested_state and
nested_vmx to the common kvm_vcpu_arch. This allows for common code to
use without plumbing it through per-vendor helpers.

nested_run_pending remains zero-initialized, as the entire kvm_vcpu
struct is, and all further accesses are done through vcpu->arch instead
of svm->nested or vmx->nested.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
[sean: expand the commend in the field declaration]
Link: https://patch.msgid.link/20260312234823.3120658-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-04-03 09:33:30 -07:00
Paolo Bonzini
5a30e8aea0 KVM: VMX: check validity of VMCS controls when returning from SMM
The VMCS12 is not available while in SMM.  However, it can be overwritten
if userspace manages to trigger copy_enlightened_to_vmcs12() - for example
via KVM_GET_NESTED_STATE.

Because of this, the VMCS12 has to be checked for validity before it is
used to generate the VMCS02.  Move the check code out of vmx_set_nested_state()
(the other "not a VMLAUNCH/VMRESUME" path that emulates a nested vmentry)
and reuse it in vmx_leave_smm().

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-03-11 18:41:11 +01:00
Jim Mattson
e2ffe85b6d KVM: x86: Introduce KVM_X86_QUIRK_VMCS12_ALLOW_FREEZE_IN_SMM
Add KVM_X86_QUIRK_VMCS12_ALLOW_FREEZE_IN_SMM to allow L1 to set
FREEZE_IN_SMM in vmcs12's GUEST_IA32_DEBUGCTL field, as permitted
prior to commit 6b1dd26544 ("KVM: VMX: Preserve host's
DEBUGCTLMSR_FREEZE_IN_SMM while running the guest").  Enable the quirk
by default for backwards compatibility (like all quirks); userspace
can disable it via KVM_CAP_DISABLE_QUIRKS2 for consistency with the
constraints on WRMSR(IA32_DEBUGCTL).

Note that the quirk only bypasses the consistency check.  The vmcs02 bit is
still owned by the host, and PMCs are not frozen during virtualized SMM.
In particular, if a host administrator decides that PMCs should not be
frozen during physical SMM, then L1 has no say in the matter.

Fixes: 095686e6fc ("KVM: nVMX: Check vmcs12->guest_ia32_debugctl on nested VM-Enter")
Cc: stable@vger.kernel.org
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://patch.msgid.link/20260205231537.1278753-1-jmattson@google.com
[sean: tag for stable@, clean-up and fix goofs in the comment and docs]
Signed-off-by: Sean Christopherson <seanjc@google.com>
[Rename quirk. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-03-11 18:41:11 +01:00
Paolo Bonzini
bf2c3138ae Merge tag 'kvm-x86-pmu-6.20' of https://github.com/kvm-x86/linux into HEAD
KVM mediated PMU support for 6.20

Add support for mediated PMUs, where KVM gives the guest full ownership of PMU
hardware (contexted switched around the fastpath run loop) and allows direct
access to data MSRs and PMCs (restricted by the vPMU model), but intercepts
access to control registers, e.g. to enforce event filtering and to prevent the
guest from profiling sensitive host state.

To keep overall complexity reasonable, mediated PMU usage is all or nothing
for a given instance of KVM (controlled via module param).  The Mediated PMU
is disabled default, partly to maintain backwards compatilibity for existing
setup, partly because there are tradeoffs when running with a mediated PMU that
may be non-starters for some use cases, e.g. the host loses the ability to
profile guests with mediated PMUs, the fastpath run loop is also a blind spot,
entry/exit transitions are more expensive, etc.

Versus the emulated PMU, where KVM is "just another perf user", the mediated
PMU delivers more accurate profiling and monitoring (no risk of contention and
thus dropped events), with significantly less overhead (fewer exits and faster
emulation/programming of event selectors) E.g. when running Specint-2017 on
a single-socket Sapphire Rapids with 56 cores and no-SMT, and using perf from
within the guest:

  Perf command:
  a. basic-sampling: perf record -F 1000 -e 6-instructions  -a --overwrite
  b. multiplex-sampling: perf record -F 1000 -e 10-instructions -a --overwrite

  Guest performance overhead:
  ---------------------------------------------------------------------------
  | Test case          | emulated vPMU | all passthrough | passthrough with |
  |                    |               |                 | event filters    |
  ---------------------------------------------------------------------------
  | basic-sampling     |   33.62%      |    4.24%        |   6.21%          |
  ---------------------------------------------------------------------------
  | multiplex-sampling |   79.32%      |    7.34%        |   10.45%         |
  ---------------------------------------------------------------------------
2026-02-11 12:45:40 -05:00
Paolo Bonzini
1b13885edf Merge tag 'kvm-x86-apic-6.20' of https://github.com/kvm-x86/linux into HEAD
KVM x86 APIC-ish changes for 6.20

 - Fix a benign bug where KVM could use the wrong memslots (ignored SMM) when
   creating a vCPU-specific mapping of guest memory.

 - Clean up KVM's handling of marking mapped vCPU pages dirty.

 - Drop a pile of *ancient* sanity checks hidden behind in KVM's unused
   ASSERT() macro, most of which could be trivially triggered by the guest
   and/or user, and all of which were useless.

 - Fold "struct dest_map" into its sole user, "struct rtc_status", to make it
   more obvious what the weird parameter is used for, and to allow burying the
   RTC shenanigans behind CONFIG_KVM_IOAPIC=y.

 - Bury all of ioapic.h and KVM_IRQCHIP_KERNEL behind CONFIG_KVM_IOAPIC=y.

 - Add a regression test for recent APICv update fixes.

 - Rework KVM's handling of VMCS updates while L2 is active to temporarily
   switch to vmcs01 instead of deferring the update until the next nested
   VM-Exit.  The deferred updates approach directly contributed to several
   bugs, was proving to be a maintenance burden due to the difficulty in
   auditing the correctness of deferred updates, and was polluting
   "struct nested_vmx" with a growing pile of booleans.

 - Handle "hardware APIC ISR", a.k.a. SVI, updates in kvm_apic_update_apicv()
   to consolidate the updates, and to co-locate SVI updates with the updates
   for KVM's own cache of ISR information.

 - Drop a dead function declaration.
2026-02-11 12:45:32 -05:00
Sean Christopherson
1dc6432059 KVM: nVMX: Remove explicit filtering of GUEST_INTR_STATUS from shadow VMCS fields
Drop KVM's filtering of GUEST_INTR_STATUS when generating the shadow VMCS
bitmap now that KVM drops GUEST_INTR_STATUS from the set of supported
vmcs12 fields if the field isn't supported by hardware, and initialization
of the shadow VMCS fields omits unsupported vmcs12 fields.

Note, there is technically a small functional change here, as the vmcs12
filtering only requires support for Virtual Interrupt Delivery, whereas
the shadow VMCS code being removed required "full" APICv support, i.e.
required Virtual Interrupt Delivery *and* APIC Register Virtualizaton *and*
Posted Interrupt support.

Opportunistically tweak the comment to more precisely explain why the
PML and VMX preemption timer fields need to be explicitly checked.

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://patch.msgid.link/20260115173427.716021-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-26 06:23:56 -08:00
Sean Christopherson
5fdf86e735 KVM: nVMX: Disallow access to vmcs12 fields that aren't supported by "hardware"
Disallow access (VMREAD/VMWRITE), both emulated and via a shadow VMCS, to
VMCS fields that the loaded incarnation of KVM doesn't support, e.g. due
to lack of hardware support, as a middle ground between allowing access to
any vmcs12 field defined by KVM (current behavior) and gating access based
on the userspace-defined vCPU model (the most functionally correct, but
very costly, implementation).

Disallowing access to unsupported fields helps a tiny bit in terms of
closing the virtualization hole (see below), but the main motivation is to
avoid having to weed out unsupported fields when synchronizing between
vmcs12 and a shadow VMCS.  Because shadow VMCS accesses are done via
VMREAD and VMWRITE, KVM _must_ filter out unsupported fields (or eat
VMREAD/VMWRITE failures), and filtering out just shadow VMCS fields is
about the same amount of effort, and arguably much more confusing.

As a bonus, this also fixes a KVM-Unit-Test failure bug when running on
_hardware_ without support for TSC Scaling, which fails with the same
signature as the bug fixed by commit ba1f82456b ("KVM: nVMX: Dynamically
compute max VMCS index for vmcs12"):

  FAIL: VMX_VMCS_ENUM.MAX_INDEX expected: 19, actual: 17

Dynamically computing the max VMCS index only resolved the issue where KVM
was hardcoding max index, but for CPUs with TSC Scaling, that was "good
enough".

Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Xin Li <xin@zytor.com>
Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://lore.kernel.org/all/20251026201911.505204-22-xin@zytor.com
Link: https://lore.kernel.org/all/YR2Tf9WPNEzrE7Xg@google.com
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://patch.msgid.link/20260115173427.716021-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-26 06:23:51 -08:00
Sean Christopherson
26304e0e69 KVM: nVMX: Setup VMX MSRs on loading CPU during nested_vmx_hardware_setup()
Move the call to nested_vmx_setup_ctls_msrs() from vmx_hardware_setup() to
nested_vmx_hardware_setup() so that the nested code can deal with ordering
dependencies without having to straddle vmx_hardware_setup() and
nested_vmx_hardware_setup().  Specifically, an upcoming change will
sanitize the vmcs12 fields based on hardware support, and that code needs
to run _before_ the MSRs are configured, because the lovely vmcs_enum MSR
depends on the max support vmcs12 field.

No functional change intended.

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://patch.msgid.link/20260115173427.716021-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-16 07:47:59 -08:00
Sean Christopherson
249cc1ab4b KVM: nVMX: Switch to vmcs01 to set virtual APICv mode on-demand if L2 is active
If L1's virtual APIC mode changes while L2 is active, e.g. because L1
doesn't intercept writes to the APIC_BASE MSR and L2 changes the mode,
temporarily load vmcs01 and do all of the necessary actions instead of
deferring the update until the next nested VM-Exit.

This will help in fixing yet more issues related to updates while L2 is
active, e.g. KVM neglects to update vmcs02 MSR intercepts if vmcs01's MSR
intercepts are modified while L2 is active.  Not updating x2APIC MSRs is
benign because vmcs01's settings are not factored into vmcs02's bitmap, but
deferring the x2APIC MSR updates would create a weird, inconsistent state.

Reviewed-by: Chao Gao <chao.gao@intel.com>
Link: https://patch.msgid.link/20260109034532.1012993-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-13 17:35:32 -08:00
Sean Christopherson
51c821d6d0 KVM: nVMX: Switch to vmcs01 to update APIC page on-demand if L2 is active
If the KVM-owned APIC-access page is migrated while L2 is running,
temporarily load vmcs01 and immediately update APIC_ACCESS_ADDR instead
of deferring the update until the next nested VM-Exit.  Once changing
the virtual APIC mode is converted to always do on-demand updates, all
of the "defer until vmcs01 is active" logic will be gone.

Reviewed-by: Chao Gao <chao.gao@intel.com>
Link: https://patch.msgid.link/20260109034532.1012993-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-13 17:35:32 -08:00
Sean Christopherson
2bf889a68f KVM: nVMX: Switch to vmcs01 to refresh APICv controls on-demand if L2 is active
If APICv is (un)inhibited while L2 is running, temporarily load vmcs01 and
immediately refresh the APICv controls in vmcs01 instead of deferring the
update until the next nested VM-Exit.  This all but eliminates potential
ordering issues due to vmcs01 not being synchronized with
kvm_lapic.apicv_active, e.g. where KVM _thinks_ it refreshed APICv, but
vmcs01 still contains stale state.

Reviewed-by: Chao Gao <chao.gao@intel.com>
Link: https://patch.msgid.link/20260109034532.1012993-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-13 17:35:31 -08:00
Sean Christopherson
f0044429b2 KVM: nVMX: Switch to vmcs01 to update SVI on-demand if L2 is active
If APICv is activated while L2 is running and triggers an SVI update,
temporarily load vmcs01 and immediately update SVI instead of deferring
the update until the next nested VM-Exit.  This will eventually allow
killing off kvm_apic_update_hwapic_isr(), and all of nVMX's deferred
APICv updates.

Reviewed-by: Chao Gao <chao.gao@intel.com>
Link: https://patch.msgid.link/20260109034532.1012993-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-13 17:35:31 -08:00
Sean Christopherson
51ca274607 KVM: nVMX: Switch to vmcs01 to update TPR threshold on-demand if L2 is active
If KVM updates L1's TPR Threshold while L2 is active, temporarily load
vmcs01 and immediately update TPR_THRESHOLD instead of deferring the
update until the next nested VM-Exit.  Deferring the TPR Threshold update
is relatively straightforward, but for several APICv related updates,
deferring updates creates ordering and state consistency problems, e.g.
KVM at-large thinks APICv is enabled, but vmcs01 is still running with
stale (and effectively unknown) state.

Reviewed-by: Chao Gao <chao.gao@intel.com>
Link: https://patch.msgid.link/20260109034532.1012993-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-13 17:35:31 -08:00
Sean Christopherson
3e013d0a70 KVM: nVMX: Switch to vmcs01 to update PML controls on-demand if L2 is active
If KVM toggles "CPU dirty logging", a.k.a. Page-Modification Logging (PML),
while L2 is active, temporarily load vmcs01 and immediately update the
relevant controls instead of deferring the update until the next nested
VM-Exit.  For PML, deferring the update is relatively straightforward, but
for several APICv related updates, deferring updates creates ordering and
state consistency problems, e.g. KVM at-large thinks APICv is enabled, but
vmcs01 is still running with stale (and effectively unknown) state.

Convert PML first precisely because it's the simplest case to handle: if
something is broken with the vmcs01 <=> vmcs02 dance, then hopefully bugs
will bisect here.

Reviewed-by: Chao Gao <chao.gao@intel.com>
Link: https://patch.msgid.link/20260109034532.1012993-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-13 17:35:31 -08:00
Sean Christopherson
57dfa61f62 KVM: VMX: Move nested_mark_vmcs12_pages_dirty() to vmx.c, and rename
Move nested_mark_vmcs12_pages_dirty() to vmx.c now that it's only used in
the VM-Exit path, and add "all" to its name to document that its purpose
is to mark all (mapped-out-of-band) vmcs12 pages as dirty.

No functional change intended.

Link: https://patch.msgid.link/20251121223444.355422-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08 11:58:23 -08:00
Sean Christopherson
f74bb1d2ed KVM: nVMX: Precisely mark vAPIC and PID maps dirty when delivering nested PI
Explicitly mark the vmcs12 vAPIC and PI descriptor pages as dirty when
delivering a nested posted interrupt instead of marking all vmcs12 pages
as dirty.  This will allow marking the APIC access page (and any future
 vmcs12 pages) as dirty in nested_mark_vmcs12_pages_dirty() without over-
dirtying in the nested PI case.  Manually marking the vAPIC and PID pages
as dirty also makes the flow a bit more self-documenting, e.g. it's not
obvious at first glance that vmx->nested.pi_desc is actually a host kernel
mapping of a vmcs12 page.

No functional change intended.

Link: https://patch.msgid.link/20251121223444.355422-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08 11:58:22 -08:00
Sean Christopherson
70b02809de KVM: x86: Mark vmcs12 pages as dirty if and only if they're mapped
Mark vmcs12 pages as dirty (in KVM's dirty log bitmap) if and only if the
page is mapped, i.e. if the page is actually "active" in vmcs02.  For some
pages, KVM simply disables the associated VMCS control if the vmcs12 page
is unreachable, i.e. it's possible for nested VM-Enter to succeed with a
"bad" vmcs12 page.

Link: https://patch.msgid.link/20251121223444.355422-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08 11:58:22 -08:00
Sean Christopherson
d374b89edb KVM: VMX: Add mediated PMU support for CPUs without "save perf global ctrl"
Extend mediated PMU support for Intel CPUs without support for saving
PERF_GLOBAL_CONTROL into the guest VMCS field on VM-Exit, e.g. for Skylake
and its derivatives, as well as Icelake.  While supporting CPUs without
VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL isn't completely trivial, it's not that
complex either.  And not supporting such CPUs would mean not supporting 7+
years of Intel CPUs released in the past 10 years.

On VM-Exit, immediately propagate the saved PERF_GLOBAL_CTRL to the VMCS
as well as KVM's software cache so that KVM doesn't need to add full EXREG
tracking of PERF_GLOBAL_CTRL.  In practice, the vast majority of VM-Exits
won't trigger software writes to guest PERF_GLOBAL_CTRL, so deferring the
VMWRITE to the next VM-Enter would only delay the inevitable without
batching/avoiding VMWRITEs.

Note!  Take care to refresh VM_EXIT_MSR_STORE_COUNT on nested VM-Exit, as
it's unfortunately possible that KVM could recalculate MSR intercepts
while L2 is active, e.g. if userspace loads nested state and _then_ sets
PERF_CAPABILITIES.  Eating the VMWRITE on every nested VM-Exit is
unfortunate, but that's a pre-existing problem and can/should be solved
separately, e.g. modifying the number of auto-load entries while L2 is
active is also uncommon on modern CPUs.

Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Manali Shukla <manali.shukla@amd.com>
Link: https://patch.msgid.link/20251206001720.468579-45-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08 11:52:23 -08:00
Sean Christopherson
58f21a0141 KVM: nVMX: Don't update msr_autostore count when saving TSC for vmcs12
Rework nVMX's use of the MSR auto-store list to snapshot TSC to sneak
MSR_IA32_TSC into the list _without_ updating KVM's software tracking,
and drop the generic functionality so that future usage of the store list
for nested specific logic needs to consider the implications of modifying
the list.  Updating the list only for vmcs02 and only on nested VM-Enter
is a disaster waiting to happen, as it means vmcs01 is stale relative to
the software tracking, and KVM could unintentionally leave an MSR in the
store list in perpetuity while running L1, e.g. if KVM addressed the first
issue and updated vmcs01 on nested VM-Exit without removing TSC from the
list.

Furthermore, mixing KVM's desire to save an MSR with L1's desire to save
an MSR result KVM clobbering/ignoring the needs of vmcs01 or vmcs02.
E.g. if KVM added MSR_IA32_TSC to the store list for its own purposes, and
then _removed_ MSR_IA32_TSC from the list after emulating nested VM-Enter,
then KVM would remove MSR_IA32_TSC from the list even though saving TSC on
VM-Exit from L2 is still desirable (to provide L1 with an accurate TSC).

Similarly, removing an MSR from the list based on vmcs12's settings could
drop an MSR that KVM wants to save for its own purposes.

In practice, the issues are currently benign, because KVM doesn't use the
store list for vmcs01.  But that will change with upcoming mediated PMU
support.

Alternatively, a "full" solution would be to track MSR list entries for
vmcs12 separately from KVM's standard lists, but MSR_IA32_TSC is likely
the only MSR that KVM would ever want to save on _every_ VM-Exit purely
based on vmcs12.  I.e. the added complexity isn't remotely justified at
this time.

Opportunistically escalate from a pr_warn_ratelimited() to a full WARN as
KVM reserves eight entries in each MSR list, and as above KVM uses at most
one entry.

Opportunistically make vmx_find_loadstore_msr_slot() local to vmx.c as
using it directly from nested code is unsafe due to the potential for
mixing vmcs01 and vmcs02 state (see above).

Cc: Jim Mattson <jmattson@google.com>
Tested-by: Manali Shukla <manali.shukla@amd.com>
Link: https://patch.msgid.link/20251206001720.468579-37-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08 11:52:17 -08:00
Sean Christopherson
462f092dc5 KVM: VMX: Drop intermediate "guest" field from msr_autostore
Drop the intermediate "guest" field from vcpu_vmx.msr_autostore as the
value saved on VM-Exit isn't guaranteed to be the guest's value, it's
purely whatever is in hardware at the time of VM-Exit.  E.g. KVM's only
use of the store list at the momemnt is to snapshot TSC at VM-Exit, and
the value saved is always the raw TSC even if TSC-offseting and/or
TSC-scaling is enabled for the guest.

And unlike msr_autoload, there is no need differentiate between "on-entry"
and "on-exit".

No functional change intended.

Cc: Jim Mattson <jmattson@google.com>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Manali Shukla <manali.shukla@amd.com>
Link: https://patch.msgid.link/20251206001720.468579-36-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08 11:52:17 -08:00
Mingwei Zhang
88ebc2a319 KVM: nVMX: Disable PMU MSR interception as appropriate while running L2
Merge KVM's PMU MSR interception bitmaps with those of L1, i.e. merge the
bitmaps of vmcs01 and vmcs12, e.g. so that KVM doesn't interpose on MSR
accesses unnecessarily if L1 exposes a mediated PMU (or equivalent) to L2.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
[sean: rewrite changelog and comment, omit MSRs that are always intercepted]
Tested-by: Xudong Hao <xudong.hao@intel.com>
Tested-by: Manali Shukla <manali.shukla@amd.com>
Link: https://patch.msgid.link/20251206001720.468579-32-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08 11:52:14 -08:00
Dapeng Mi
cb58327c4c KVM: nVMX: Add macros to simplify nested MSR interception setting
Add macros nested_vmx_merge_msr_bitmaps_xxx() to simplify nested MSR
interception setting. No function change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Tested-by: Manali Shukla <manali.shukla@amd.com>
Link: https://patch.msgid.link/20251206001720.468579-31-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08 11:52:13 -08:00
Paolo Bonzini
0499add8ef KVM fixes for 6.19-rc1
- Add a missing "break" to fix param parsing in the rseq selftest.
 
  - Apply runtime updates to the _current_ CPUID when userspace is setting
    CPUID, e.g. as part of vCPU hotplug, to fix a false positive and to avoid
    dropping the pending update.
 
  - Disallow toggling KVM_MEM_GUEST_MEMFD on an existing memslot, as it's not
    supported by KVM and leads to a use-after-free due to KVM failing to unbind
    the memslot from the previously-associated guest_memfd instance.
 
  - Harden against similar KVM_MEM_GUEST_MEMFD goofs, and prepare for supporting
    flags-only changes on KVM_MEM_GUEST_MEMFD memlslots, e.g. for dirty logging.
 
  - Set exit_code[63:32] to -1 (all 0xffs) when synthesizing a nested
    SVM_EXIT_ERR (a.k.a. VMEXIT_INVALID) #VMEXIT, as VMEXIT_INVALID is defined
    as -1ull (a 64-bit value).
 
  - Update SVI when activating APICv to fix a bug where a post-activation EOI
    for an in-service IRQ would effective be lost due to SVI being stale.
 
  - Immediately refresh APICv controls (if necessary) on a nested VM-Exit
    instead of deferring the update via KVM_REQ_APICV_UPDATE, as the request is
    effectively ignored because KVM thinks the vCPU already has the correct
    APICv settings.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmk5p18ACgkQOlYIJqCj
 N/0YlBAAvnhGVmqVc3nhd313mo4YGk+Z1RxpO1sJAsGJu42Ir/QqMC9aPHy9ejcS
 hfoIXzPFdVJEztuBUWRje9mvocQnXSAWjXFTaoqJE/LXVnh96Txhh4nvCJSQzyrn
 5/0uk5ZD5dfPVyPtYk6G9w3q9kgYv3Q6O4UEU48ru0q6wcu5FmRshULfHVZnlyNa
 ZALY4k8QzsdSzB0XWusmD0OQpjGRyUR79mqEzUybg4E/b0LAK9Nv6Fr5YgGq7g0N
 AU1T1t+/hMN7x4/24RrxNBO+skPKmCi7nM3iVKqilQSO2fZzrNVUOkr/tGb+5EL6
 iw4JHJQjp9LlzRVxP3QNZp8Bg+knMVRdbSzAkdomDRguMgpu/TGAe7TtkzfEyVel
 VAQUVpDaThp0FK5wAdyMKvpOQqTGjl3KKtM2zs187v+eJJcjJQnIsTk4zW5mmvk4
 Y6YOqbulSNAqVOmJj7oqrDxWgjD75PtXlPFEoOsJM0AuL/sHBo8bKT18cGBwVGmH
 lJoNfkS45kofG4i0zIBwzQuvKjDIQU7ZdVXa2MLL1aqDlyu/66Hll0vCJLAMvHz5
 eb65WQ6Br97e0BNuzVJJNyTGQ3Pr9DdSkTpPkOalwQ3VEyZwcKm3OsB/N0FsgP2V
 ta7vZQ5b6Sn568A9LAgXGhcnQ7mA31VBoNCkLLowxIT4kxVKGUY=
 =4iuO
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-fixes-6.19-rc1' of https://github.com/kvm-x86/linux into HEAD

KVM fixes for 6.19-rc1

 - Add a missing "break" to fix param parsing in the rseq selftest.

 - Apply runtime updates to the _current_ CPUID when userspace is setting
   CPUID, e.g. as part of vCPU hotplug, to fix a false positive and to avoid
   dropping the pending update.

 - Disallow toggling KVM_MEM_GUEST_MEMFD on an existing memslot, as it's not
   supported by KVM and leads to a use-after-free due to KVM failing to unbind
   the memslot from the previously-associated guest_memfd instance.

 - Harden against similar KVM_MEM_GUEST_MEMFD goofs, and prepare for supporting
   flags-only changes on KVM_MEM_GUEST_MEMFD memlslots, e.g. for dirty logging.

 - Set exit_code[63:32] to -1 (all 0xffs) when synthesizing a nested
   SVM_EXIT_ERR (a.k.a. VMEXIT_INVALID) #VMEXIT, as VMEXIT_INVALID is defined
   as -1ull (a 64-bit value).

 - Update SVI when activating APICv to fix a bug where a post-activation EOI
   for an in-service IRQ would effective be lost due to SVI being stale.

 - Immediately refresh APICv controls (if necessary) on a nested VM-Exit
   instead of deferring the update via KVM_REQ_APICV_UPDATE, as the request is
   effectively ignored because KVM thinks the vCPU already has the correct
   APICv settings.
2025-12-18 18:38:45 +01:00
Dongli Zhang
2976313883 KVM: nVMX: Immediately refresh APICv controls as needed on nested VM-Exit
If an APICv status updated was pended while L2 was active, immediately
refresh vmcs01's controls instead of pending KVM_REQ_APICV_UPDATE as
kvm_vcpu_update_apicv() only calls into vendor code if a change is
necessary.

E.g. if APICv is inhibited, and then activated while L2 is running:

  kvm_vcpu_update_apicv()
  |
  -> __kvm_vcpu_update_apicv()
     |
     -> apic->apicv_active = true
      |
      -> vmx_refresh_apicv_exec_ctrl()
         |
         -> vmx->nested.update_vmcs01_apicv_status = true
          |
          -> return

Then L2 exits to L1:

  __nested_vmx_vmexit()
  |
  -> kvm_make_request(KVM_REQ_APICV_UPDATE)

  vcpu_enter_guest(): KVM_REQ_APICV_UPDATE
  -> kvm_vcpu_update_apicv()
     |
     -> __kvm_vcpu_update_apicv()
        |
        -> return // because if (apic->apicv_active == activate)

Reported-by: Chao Gao <chao.gao@intel.com>
Closes: https://lore.kernel.org/all/aQ2jmnN8wUYVEawF@intel.com
Fixes: 7c69661e22 ("KVM: nVMX: Defer APICv updates while L2 is active until L1 is active")
Cc: stable@vger.kernel.org
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
[sean: write changelog]
Link: https://patch.msgid.link/20251205231913.441872-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-12-08 06:56:29 -08:00
Paolo Bonzini
d1e7b4613e KVM VMX changes for 6.19:
- Use the root role from kvm_mmu_page to construct EPTPs instead of the
    current vCPU state, partly as worthwhile cleanup, but mostly to pave the
    way for tracking per-root TLB flushes so that KVM can elide EPT flushes on
    pCPU migration if KVM has flushed the root at least once.
 
  - Add a few missing nested consistency checks.
 
  - Rip out support for doing "early" consistency checks via hardware as the
    functionality hasn't been used in years and is no longer useful in general,
    and replace it with an off-by-default module param to detected missed
    consistency checks (i.e. WARN if hardware finds a check that KVM does not).
 
  - Fix a currently-benign bug where KVM would drop the guest's SPEC_CTRL[63:32]
    on VM-Enter.
 
  - Misc cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmkmWaoACgkQOlYIJqCj
 N/39qw//T14lCRVtO/raX1cuFCyYRTtzEEwRF+T1AdnhLoigduz/ajALLc/giF3Y
 qU5Ubl/N9k/uC62mFd+tC8e/F7BKXHJIvC2WxbxzT00nos/gmHm7ZLZRlFWI51cs
 9oshc87lD0XIW6Ze6Dq9xbJVA2ly1AadvrdFgi8p+/kRTO0/Vyxm9AzmvEzRzO6l
 F6RXRzWzbJtNmmdqBKVMuiMAObP6nPs8Gh3gCHnBGdrSTw/Wt7W7nCLujFK4VlOD
 IXZXDhnYRcSKh8NW/O6Y42VFYN0pzApMDKiFtrSM0kCkANWFusCBXiM/vt1StdZr
 /MNlwJkZbIfP1jMI2km/0TSRM3x9RIlAEJ/07w38WQouc5an7eoVquNtLa3PgJ2L
 AqnKzNro0TtfOSg0Nhy9LznZPOub/4VegCYvxc4+6Q74jKv8pAFMEW9uvJrSq6rk
 cqQCxWG3qid6xCoE6n5uPaMWRBROVG9EzfkK+K3JyvLut9iClKHq37dFOkNQ+AZe
 03uhP0CP529cnbqG+1K1VtPKl9VX7PC+Mg3ZPHI3n1GUL+/AFj3K7bpaZ+LaJpyK
 jfuxBffA69dD57yPGURdfiiTkfUbbAyEP7eiBcV3Yi7vTHkbi+AsF21TA0wsGIpo
 WoRw/jrjw4ScHgNbBP24QmBYynlBUzLXGo7lXE2HlUIst4qqOY0=
 =hP3F
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-vmx-6.19' of https://github.com/kvm-x86/linux into HEAD

KVM VMX changes for 6.19:

 - Use the root role from kvm_mmu_page to construct EPTPs instead of the
   current vCPU state, partly as worthwhile cleanup, but mostly to pave the
   way for tracking per-root TLB flushes so that KVM can elide EPT flushes on
   pCPU migration if KVM has flushed the root at least once.

 - Add a few missing nested consistency checks.

 - Rip out support for doing "early" consistency checks via hardware as the
   functionality hasn't been used in years and is no longer useful in general,
   and replace it with an off-by-default module param to detected missed
   consistency checks (i.e. WARN if hardware finds a check that KVM does not).

 - Fix a currently-benign bug where KVM would drop the guest's SPEC_CTRL[63:32]
   on VM-Enter.

 - Misc cleanups.
2025-11-26 09:44:52 +01:00
Paolo Bonzini
e64dcfab57 KVM x86 misc changes for 6.19:
- Fix an async #PF bug where KVM would clear the completion queue when the
    guest transitioned in and out of paging mode, e.g. when handling an SMI and
    then returning to paged mode via RSM.
 
  - Fix a bug where TDX would effectively corrupt user-return MSR values if the
    TDX Module rejects VP.ENTER and thus doesn't clobber host MSRs as expected.
 
  - Leave the user-return notifier used to restore MSRs registered when
    disabling virtualization, and instead pin kvm.ko.  Restoring host MSRs via
    IPI callback is either pointless (clean reboot) or dangerous (forced reboot)
    since KVM has no idea what code it's interrupting.
 
  - Use the checked version of {get,put}_user(), as Linus wants to kill them
    off, and they're measurably faster on modern CPUs due to the unchecked
    versions containing an LFENCE.
 
  - Fix a long-lurking bug where KVM's lack of catch-up logic for periodic APIC
    timers can result in a hard lockup in the host.
 
  - Revert the periodic kvmclock sync logic now that KVM doesn't use a
    clocksource that's subject to NPT corrections.
 
  - Clean up KVM's handling of MMIO Stale Data and L1TF, and bury the latter
    behind CONFIG_CPU_MITIGATIONS.
 
  - Context switch XCR0, XSS, and PKRU outside of the entry/exit fastpath as
    the only reason they were handled in the faspath was to paper of a bug in
    the core #MC code that has long since been fixed.
 
  - Add emulator support for AVX MOV instructions to play nice with emulated
    devices whose PCI BARs guest drivers like to access with large multi-byte
    instructions.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmkmTw0ACgkQOlYIJqCj
 N/1UDg/9GaIrOk+qBiPQhS8jxTfumL+2DSQTWyg2Fm8E6alpar/PgWbhO0+y4iYR
 6vncg04iEFPxUQIB5TpBoD5sLYarrS9uT0HGKNEdA84P2LthCqRTsSAWL9lxey7+
 MmoNN9l1IZfe+rn7nh8oK0UYFzhqa34DO81Vdl7otohf/dNyTf47KXsMllAGdSxX
 9ipF0Xr2C6N5d4r5zBJT38vLV89CjBmydvegxbUo9fsp9tOWWz8dwidjR+ZMjFxE
 HNtj4bLHw0JefCJKN0SlKO3Q4IuNPeFNVu7FRkdb1IJI6OOMoGzAufhPFp21+iwI
 4Ost2Kd1RuMB2rjlLjeVq4ygK0BAKkZuELUfQPEavih7yf3v0yhP3ojjGQphFGxm
 8pSOjxUQ9CBELwfAJqrf92Z9Tpya+Goq3qL/aa1E9p7TSpSU9NxQv2nOuMfGsDxg
 xmSjfUOGA7pBSYH17ORdqytHya4kqlqtI7v3FvTg+zCXFhmU41HVyqWBkONWIVLn
 wBGNcj0fgwhF3Q5aHHNAwa4En+S2jF/7f7WuF/B3EG3DARCZdPvfd5NOq2rs95Pj
 V/VG9AHy5r86y7DAzeu0a0JmxW9mzL0TEZW+lJyU12oz1jDvMUOF38wkWjUDAj/N
 dp2GYndbpsn8tlnAuv6DHBBNKlfCXbqi2fXTymWpsY4GKLULqos=
 =cORg
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-misc-6.19' of https://github.com/kvm-x86/linux into HEAD

KVM x86 misc changes for 6.19:

 - Fix an async #PF bug where KVM would clear the completion queue when the
   guest transitioned in and out of paging mode, e.g. when handling an SMI and
   then returning to paged mode via RSM.

 - Fix a bug where TDX would effectively corrupt user-return MSR values if the
   TDX Module rejects VP.ENTER and thus doesn't clobber host MSRs as expected.

 - Leave the user-return notifier used to restore MSRs registered when
   disabling virtualization, and instead pin kvm.ko.  Restoring host MSRs via
   IPI callback is either pointless (clean reboot) or dangerous (forced reboot)
   since KVM has no idea what code it's interrupting.

 - Use the checked version of {get,put}_user(), as Linus wants to kill them
   off, and they're measurably faster on modern CPUs due to the unchecked
   versions containing an LFENCE.

 - Fix a long-lurking bug where KVM's lack of catch-up logic for periodic APIC
   timers can result in a hard lockup in the host.

 - Revert the periodic kvmclock sync logic now that KVM doesn't use a
   clocksource that's subject to NPT corrections.

 - Clean up KVM's handling of MMIO Stale Data and L1TF, and bury the latter
   behind CONFIG_CPU_MITIGATIONS.

 - Context switch XCR0, XSS, and PKRU outside of the entry/exit fastpath as
   the only reason they were handled in the faspath was to paper of a bug in
   the core #MC code that has long since been fixed.

 - Add emulator support for AVX MOV instructions to play nice with emulated
   devices whose PCI BARs guest drivers like to access with large multi-byte
   instructions.
2025-11-26 09:34:21 +01:00
Brendan Jackman
38ee66cb18 KVM: x86: Unify L1TF flushing under per-CPU variable
Currently the tracking of the need to flush L1D for L1TF is tracked by
two bits: one per-CPU and one per-vCPU.

The per-vCPU bit is always set when the vCPU shows up on a core, so
there is no interesting state that's truly per-vCPU. Indeed, this is a
requirement, since L1D is a part of the physical CPU.

So simplify this by combining the two bits.

The vCPU bit was being written from preemption-enabled regions.  To play
nice with those cases, wrap all calls from KVM and use a raw write so that
request a flush with preemption enabled doesn't trigger what would
effectively be DEBUG_PREEMPT false positives.  Preemption doesn't need to
be disabled, as kvm_arch_vcpu_load() will mark the new CPU as needing a
flush if the vCPU task is migrated, or if userspace runs the vCPU on a
different task.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
[sean: put raw write in KVM instead of in a hardirq.h variant]
Link: https://patch.msgid.link/20251113233746.1703361-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-18 16:22:45 -08:00
Sean Christopherson
9d7dfb95da KVM: VMX: Inject #UD if guest tries to execute SEAMCALL or TDCALL
Add VMX exit handlers for SEAMCALL and TDCALL to inject a #UD if a non-TD
guest attempts to execute SEAMCALL or TDCALL.  Neither SEAMCALL nor TDCALL
is gated by any software enablement other than VMXON, and so will generate
a VM-Exit instead of e.g. a native #UD when executed from the guest kernel.

Note!  No unprivileged DoS of the L1 kernel is possible as TDCALL and
SEAMCALL #GP at CPL > 0, and the CPL check is performed prior to the VMX
non-root (VM-Exit) check, i.e. userspace can't crash the VM. And for a
nested guest, KVM forwards unknown exits to L1, i.e. an L2 kernel can
crash itself, but not L1.

Note #2!  The Intel® Trust Domain CPU Architectural Extensions spec's
pseudocode shows the CPL > 0 check for SEAMCALL coming _after_ the VM-Exit,
but that appears to be a documentation bug (likely because the CPL > 0
check was incorrectly bundled with other lower-priority #GP checks).
Testing on SPR and EMR shows that the CPL > 0 check is performed before
the VMX non-root check, i.e. SEAMCALL #GPs when executed in usermode.

Note #3!  The aforementioned Trust Domain spec uses confusing pseudocode
that says that SEAMCALL will #UD if executed "inSEAM", but "inSEAM"
specifically means in SEAM Root Mode, i.e. in the TDX-Module.  The long-
form description explicitly states that SEAMCALL generates an exit when
executed in "SEAM VMX non-root operation".  But that's a moot point as the
TDX-Module injects #UD if the guest attempts to execute SEAMCALL, as
documented in the "Unconditionally Blocked Instructions" section of the
TDX-Module base specification.

Cc: stable@vger.kernel.org
Cc: Kai Huang <kai.huang@intel.com>
Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20251016182148.69085-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-10-20 09:37:04 -07:00
Sean Christopherson
1100e4910a KVM: nVMX: Add an off-by-default module param to WARN on missed consistency checks
Add an off-by-default param, "warn_on_missed_cc", to have KVM WARN on a
missed VMX Consistency Check on nested VM-Enter, specifically so that KVM
developers and maintainers can more easily detect missing checks.  KVM's
goal/intent is that KVM detect *all* VM-Fail conditions in software, as
relying on hardware leads to false passes when KVM's nested support is a
subset of hardware support, e.g. see commit 095686e6fc ("KVM: nVMX:
Check vmcs12->guest_ia32_debugctl on nested VM-Enter").

With one notable exception, KVM now detects all VM-Fail scenarios for
which there is known test coverage, i.e. KVM developers can enable the
param and expect a clean run, and thus can use the param to detect missed
checks, e.g. when enabling new features, when writing new tests, etc.

The one exception is an unfortunate consistency check on vTPR.  Because
the vTPR for L2 comes from the virtual APIC page provided by L1, L2's vTPR
is fully writable at all times, i.e. is inherently subject to TOCTOU
issues with respect to checks in software versus consumption in hardware.
Further complicating matters is KVM's deferred handling of vmcs12 pages
when loading nested state; KVM flat out cannot check vTPR during
KVM_SET_NESTED_STATE without breaking setups that do on-demand paging,
e.g. for live migration and/or live update.

To fudge around the vTPR issue, add a "late" controls check for vTPR and
also treat an invalid virtual APIC as VM-Fail, but gate the check on
warn_on_missed_cc being enabled to avoid unwanted false positives, i.e. to
avoid breaking KVM in production.

Cc: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250919005955.1366256-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-10-17 15:11:27 -07:00
Sean Christopherson
a175da6d43 KVM: nVMX: Remove support for "early" consistency checks via hardware
Remove nested_early_check and all associated code, as it's quite obviously
not being used or tested (it's been broken for 4+ years without a single
bug report).  More importantly, KVM's software-based consistency checks
have matured since the option to do hardware-based checks was added; KVM
appears to be missing only _one_ consistency check, on vTPR.  And even
*more* importantly, that consistency check can't be prevented by an early
hardware check due to L1 being able to modify the virtual APIC at any
time, i.e. there's an inherent TOCTOU flaw that could cause KVM to "miss"
a consistency check VM-Fail, regardless of whether the check is performed
by software or by hardware.

In other words, KVM _must_ be able to unwind from a late VM-Fail (which
was a big motivation for doing early checks).  I.e. now that KVM provides
(almost) all necessary consistency checks, what's really needed is a way
to detect missing checks in KVM, not a way to avoid having to unwind from
a late VM-Fail.  And that can be done much more simply, e.g. by an simple
module param to guard a WARN (which, sadly, must be off-by-default to
avoid splats due to the aforementioned TOCTOU issue).

For all intents and purposes, this reverts commit 52017608da ("KVM:
nVMX: add option to perform early consistency checks via H/W").

Link: https://lore.kernel.org/r/20250919005955.1366256-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-10-17 15:11:27 -07:00
Sean Christopherson
f91699d569 KVM: nVMX: Stuff vmcs02.TSC_MULTIPLIER early on for nested early checks
If KVM is doing "early" nested VM-Enter consistency checks and TSC scaling
is supported, stuff vmcs02's TSC Multiplier early on to avoid getting a
false positive VM-Fail due to trying to do VM-Enter with TSC_MULTIPLIER=0.
To minimize complexity around L1 vs. L2 TSC, KVM sets the actual TSC
Multiplier rather late during VM-Entry, i.e. may have '0' at the time of
early consistency checks.

If vmcs12 has TSC Scaling enabled, use the multiplier from vmcs12 so that
nested early checks actually check vmcs12 state, otherwise throw in an
arbitrary value of '1' (anything non-zero is legal).

Fixes: d041b5ea93 ("KVM: nVMX: Enable nested TSC scaling")
Link: https://lore.kernel.org/r/20250919005955.1366256-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-10-17 15:11:27 -07:00
Sean Christopherson
ae8e6ad841 KVM: nVMX: Add consistency check for TSC_MULTIPLIER=0
Add a missing consistency check on the TSC Multiplier being '0'.  Per the
SDM:

  If the "use TSC scaling" VM-execution control is 1, the TSC-multiplier
  must not be zero.

Fixes: d041b5ea93 ("KVM: nVMX: Enable nested TSC scaling")
Link: https://lore.kernel.org/r/20250919005955.1366256-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-10-17 15:11:26 -07:00
Sean Christopherson
15fe455dd1 KVM: nVMX: Add consistency check for TPR_THRESHOLD[31:4]!=0 without VID
Add a missing consistency check on the TPR Threshold.  Per the SDM

  If the "use TPR shadow" VM-execution control is 1 and the "virtual-
  interrupt delivery" VM-execution control is 0, bits 31:4 of the TPR
  threshold VM-execution control field must be 0.

Note, nested_vmx_check_tpr_shadow_controls() bails early if "use TPR
shadow" is 0.

Link: https://lore.kernel.org/r/20250919005955.1366256-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-10-17 15:11:26 -07:00
Sean Christopherson
a8749281e4 KVM: nVMX: Hardcode dummy EPTP used for early nested consistency checks
Hardcode the dummy EPTP used for "early" consistency checks as there's no
need to use 5-level EPT based on the guest.MAXPHYADDR (the EPTP just needs
to be valid, it's never truly consumed).

This will allow breaking construct_eptp()'s dependency on having access to
the vCPU, which in turn will (much further in the future) allow for eliding
per-root TLB flushes when a vCPU is migrated between pCPUs (a flush is
need if and only if that particular pCPU hasn't already flushed the vCPU's
roots).

Link: https://lore.kernel.org/r/20250919005955.1366256-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-10-17 15:11:26 -07:00
Xin Li
f505c7b16f KVM: nVMX: Use vcpu instead of vmx->vcpu when vcpu is available
Prefer using vcpu directly when available, instead of accessing it
through vmx->vcpu.

Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Link: https://lore.kernel.org/r/20250924145421.2046822-1-xin@zytor.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-10-13 14:52:18 -07:00
Chao Gao
42ae644853 KVM: nVMX: Advertise new VM-Entry/Exit control bits for CET state
Advertise the LOAD_CET_STATE VM-Entry/Exit control bits in the nested VMX
MSRS, as all nested support for CET virtualization, including consistency
checks, is in place.

Advertise support if and only if KVM supports at least one of IBT or SHSTK.
While it's userspace's responsibility to provide a consistent CPU model to
the guest, that doesn't mean KVM should set userspace up to fail.

Note, the existing {CLEAR,LOAD}_BNDCFGS behavior predates
KVM_X86_QUIRK_STUFF_FEATURE_MSRS, i.e. KVM "solved" the inconsistent CPU
model problem by overwriting the VMX MSRs provided by userspace.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-35-seanjc@google.com
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23 09:26:30 -07:00
Chao Gao
62f7533a6b KVM: nVMX: Add consistency checks for CET states
Introduce consistency checks for CET states during nested VM-entry.

A VMCS contains both guest and host CET states, each comprising the
IA32_S_CET MSR, SSP, and IA32_INTERRUPT_SSP_TABLE_ADDR MSR. Various
checks are applied to CET states during VM-entry as documented in SDM
Vol3 Chapter "VM ENTRIES". Implement all these checks during nested
VM-entry to emulate the architectural behavior.

In summary, there are three kinds of checks on guest/host CET states
during VM-entry:

A. Checks applied to both guest states and host states:

 * The IA32_S_CET field must not set any reserved bits; bits 10 (SUPPRESS)
   and 11 (TRACKER) cannot both be set.
 * SSP should not have bits 1:0 set.
 * The IA32_INTERRUPT_SSP_TABLE_ADDR field must be canonical.

B. Checks applied to host states only

 * IA32_S_CET MSR and SSP must be canonical if the CPU enters 64-bit mode
   after VM-exit. Otherwise, IA32_S_CET and SSP must have their higher 32
   bits cleared.

C. Checks applied to guest states only:

 * IA32_S_CET MSR and SSP are not required to be canonical (i.e., 63:N-1
   are identical, where N is the CPU's maximum linear-address width). But,
   bits 63:N of SSP must be identical.

Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-34-seanjc@google.com
[sean: have common helper return 0/-EINVAL, not true/false]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23 09:25:02 -07:00
Chao Gao
8060b2bd2d KVM: nVMX: Add consistency checks for CR0.WP and CR4.CET
Add consistency checks for CR4.CET and CR0.WP in guest-state or host-state
area in the VMCS12. This ensures that configurations with CR4.CET set and
CR0.WP not set result in VM-entry failure, aligning with architectural
behavior.

Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-33-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23 09:24:35 -07:00
Yang Weijiang
625884996b KVM: nVMX: Prepare for enabling CET support for nested guest
Set up CET MSRs, related VM_ENTRY/EXIT control bits and fixed CR4 setting
to enable CET for nested VM.

vmcs12 and vmcs02 needs to be synced when L2 exits to L1 or when L1 wants
to resume L2, that way correct CET states can be observed by one another.

Please note that consistency checks regarding CET state during VM-Entry
will be added later to prevent this patch from becoming too large.
Advertising the new CET VM_ENTRY/EXIT control bits are also be deferred
until after the consistency checks are added.

Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Xin Li (Intel) <xin@zytor.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-32-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23 09:24:30 -07:00
Yang Weijiang
033cc166f0 KVM: nVMX: Virtualize NO_HW_ERROR_CODE_CC for L1 event injection to L2
Per SDM description(Vol.3D, Appendix A.1):
"If bit 56 is read as 1, software can use VM entry to deliver a hardware
exception with or without an error code, regardless of vector"

Modify has_error_code check before inject events to nested guest. Only
enforce the check when guest is in real mode, the exception is not hard
exception and the platform doesn't enumerate bit56 in VMX_BASIC, in all
other case ignore the check to make the logic consistent with SDM.

Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-31-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23 09:24:11 -07:00
Sean Christopherson
19e6e083f3 KVM: nVMX: Always forward XSAVES/XRSTORS exits from L2 to L1
Unconditionally forward XSAVES/XRSTORS VM-Exits from L2 to L1, as KVM
doesn't utilize the XSS-bitmap (KVM relies on controlling the XSS value
in hardware to prevent unauthorized access to XSAVES state).  KVM always
loads vmcs02 with vmcs12's bitmap, and so any exit _must_ be due to
vmcs12's XSS-bitmap.

Drop the comment about XSS never being non-zero in anticipation of
enabling CET_KERNEL and CET_USER support.

Opportunistically WARN if XSAVES is not enabled for L2, as the CPU is
supposed to generate #UD before checking the XSS-bitmap.

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-25-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23 09:18:28 -07:00
Yang Weijiang
d2dcf25a4c KVM: x86: Rename kvm_{g,s}et_msr()* to show that they emulate guest accesses
Rename
	kvm_{g,s}et_msr_with_filter()
	kvm_{g,s}et_msr()
to
	kvm_emulate_msr_{read,write}
	__kvm_emulate_msr_{read,write}

to make it more obvious that KVM uses these helpers to emulate guest
behaviors, i.e., host_initiated == false in these helpers.

Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://lore.kernel.org/r/20250812025606.74625-2-chao.gao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-19 11:59:48 -07:00
Xin Li
885df2d210 KVM: x86: Add support for RDMSR/WRMSRNS w/ immediate on Intel
Add support for the immediate forms of RDMSR and WRMSRNS (currently
Intel-only).  The immediate variants are only valid in 64-bit mode, and
use a single general purpose register for the data (the register is also
encoded in the instruction, i.e. not implicit like regular RDMSR/WRMSR).

The immediate variants are primarily motivated by performance, not code
size: by having the MSR index in an immediate, it is available *much*
earlier in the CPU pipeline, which allows hardware much more leeway about
how a particular MSR is handled.

Intel VMX support for the immediate forms of MSR accesses communicates
exit information to the host as follows:

  1) The immediate form of RDMSR uses VM-Exit Reason 84.

  2) The immediate form of WRMSRNS uses VM-Exit Reason 85.

  3) For both VM-Exit reasons 84 and 85, the Exit Qualification field is
     set to the MSR index that triggered the VM-Exit.

  4) Bits 3 ~ 6 of the VM-Exit Instruction Information field are set to
     the register encoding used by the immediate form of the instruction,
     i.e. the destination register for RDMSR, and the source for WRMSRNS.

  5) The VM-Exit Instruction Length field records the size of the
     immediate form of the MSR instruction.

To deal with userspace RDMSR exits, stash the destination register in a
new kvm_vcpu_arch field, similar to cui_linear_rip, pio, etc.
Alternatively, the register could be saved in kvm_run.msr or re-retrieved
from the VMCS, but the former would require sanitizing the value to ensure
userspace doesn't clobber the value to an out-of-bounds index, and the
latter would require a new one-off kvm_x86_ops hook.

Don't bother adding support for the instructions in KVM's emulator, as the
only way for RDMSR/WRMSR to be encountered is if KVM is emulating large
swaths of code due to invalid guest state, and a vCPU cannot have invalid
guest state while in 64-bit mode.

Signed-off-by: Xin Li (Intel) <xin@zytor.com>
[sean: minor tweaks, massage and expand changelog]
Link: https://lore.kernel.org/r/20250805202224.1475590-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-19 11:59:46 -07:00
Sean Christopherson
43f5bea263 KVM: x86/pmu: Add wrappers for counting emulated instructions/branches
Add wrappers for triggering instruction retired and branch retired PMU
events in anticipation of reworking the internal mechanisms to track
which PMCs need to be evaluated, e.g. to avoid having to walk and check
every PMC.

Opportunistically bury "struct kvm_pmu_emulated_event_selectors" in pmu.c.

No functional change intended.

Link: https://lore.kernel.org/r/20250805190526.1453366-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-19 11:59:37 -07:00
Jim Mattson
a7cec20845 KVM: x86: Provide a capability to disable APERF/MPERF read intercepts
Allow a guest to read the physical IA32_APERF and IA32_MPERF MSRs
without interception.

The IA32_APERF and IA32_MPERF MSRs are not virtualized. Writes are not
handled at all. The MSR values are not zeroed on vCPU creation, saved
on suspend, or restored on resume. No accommodation is made for
processor migration or for sharing a logical processor with other
tasks. No adjustments are made for non-unit TSC multipliers. The MSRs
do not account for time the same way as the comparable PMU events,
whether the PMU is virtualized by the traditional emulation method or
the new mediated pass-through approach.

Nonetheless, in a properly constrained environment, this capability
can be combined with a guest CPUID table that advertises support for
CPUID.6:ECX.APERFMPERF[bit 0] to induce a Linux guest to report the
effective physical CPU frequency in /proc/cpuinfo. Moreover, there is
no performance cost for this capability.

Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250530185239.2335185-3-jmattson@google.com
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250626001225.744268-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-09 09:33:37 -07:00
Maxim Levitsky
6b1dd26544 KVM: VMX: Preserve host's DEBUGCTLMSR_FREEZE_IN_SMM while running the guest
Set/clear DEBUGCTLMSR_FREEZE_IN_SMM in GUEST_IA32_DEBUGCTL based on the
host's pre-VM-Enter value, i.e. preserve the host's FREEZE_IN_SMM setting
while running the guest.  When running with the "default treatment of SMIs"
in effect (the only mode KVM supports), SMIs do not generate a VM-Exit that
is visible to host (non-SMM) software, and instead transitions directly
from VMX non-root to SMM.  And critically, DEBUGCTL isn't context switched
by hardware on SMI or RSM, i.e. SMM will run with whatever value was
resident in hardware at the time of the SMI.

Failure to preserve FREEZE_IN_SMM results in the PMU unexpectedly counting
events while the CPU is executing in SMM, which can pollute profiling and
potentially leak information into the guest.

Check for changes in FREEZE_IN_SMM prior to every entry into KVM's inner
run loop, as the bit can be toggled in IRQ context via IPI callback (SMP
function call), by way of /sys/devices/cpu/freeze_on_smi.

Add a field in kvm_x86_ops to communicate which DEBUGCTL bits need to be
preserved, as FREEZE_IN_SMM is only supported and defined for Intel CPUs,
i.e. explicitly checking FREEZE_IN_SMM in common x86 is at best weird, and
at worst could lead to undesirable behavior in the future if AMD CPUs ever
happened to pick up a collision with the bit.

Exempt TDX vCPUs, i.e. protected guests, from the check, as the TDX Module
owns and controls GUEST_IA32_DEBUGCTL.

WARN in SVM if KVM_RUN_LOAD_DEBUGCTL is set, mostly to document that the
lack of handling isn't a KVM bug (TDX already WARNs on any run_flag).

Lastly, explicitly reload GUEST_IA32_DEBUGCTL on a VM-Fail that is missed
by KVM but detected by hardware, i.e. in nested_vmx_restore_host_state().
Doing so avoids the need to track host_debugctl on a per-VMCS basis, as
GUEST_IA32_DEBUGCTL is unconditionally written by prepare_vmcs02() and
load_vmcs12_host_state().  For the VM-Fail case, even though KVM won't
have actually entered the guest, vcpu_enter_guest() will have run with
vmcs02 active and thus could result in vmcs01 being run with a stale value.

Cc: stable@vger.kernel.org
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250610232010.162191-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:05:24 -07:00
Maxim Levitsky
7d0cce6cbe KVM: VMX: Wrap all accesses to IA32_DEBUGCTL with getter/setter APIs
Introduce vmx_guest_debugctl_{read,write}() to handle all accesses to
vmcs.GUEST_IA32_DEBUGCTL. This will allow stuffing FREEZE_IN_SMM into
GUEST_IA32_DEBUGCTL based on the host setting without bleeding the state
into the guest, and without needing to copy+paste the FREEZE_IN_SMM
logic into every patch that accesses GUEST_IA32_DEBUGCTL.

No functional change intended.

Cc: stable@vger.kernel.org
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
[sean: massage changelog, make inline, use in all prepare_vmcs02() cases]
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250610232010.162191-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:05:24 -07:00
Maxim Levitsky
095686e6fc KVM: nVMX: Check vmcs12->guest_ia32_debugctl on nested VM-Enter
Add a consistency check for L2's guest_ia32_debugctl, as KVM only supports
a subset of hardware functionality, i.e. KVM can't rely on hardware to
detect illegal/unsupported values.  Failure to check the vmcs12 value
would allow the guest to load any harware-supported value while running L2.

Take care to exempt BTF and LBR from the validity check in order to match
KVM's behavior for writes via WRMSR, but without clobbering vmcs12.  Even
if VM_EXIT_SAVE_DEBUG_CONTROLS is set in vmcs12, L1 can reasonably expect
that vmcs12->guest_ia32_debugctl will not be modified if writes to the MSR
are being intercepted.

Arguably, KVM _should_ update vmcs12 if VM_EXIT_SAVE_DEBUG_CONTROLS is set
*and* writes to MSR_IA32_DEBUGCTLMSR are not being intercepted by L1, but
that would incur non-trivial complexity and wouldn't change the fact that
KVM's handling of DEBUGCTL is blatantly broken.  I.e. the extra complexity
is not worth carrying.

Cc: stable@vger.kernel.org
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250610232010.162191-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:05:23 -07:00