KVM: x86: Defer non-architectural deliver of exception payload to userspace read

When attempting to play nice with userspace that hasn't enabled
KVM_CAP_EXCEPTION_PAYLOAD, defer KVM's non-architectural delivery of the
payload until userspace actually reads relevant vCPU state, and more
importantly, force delivery of the payload in *all* paths where userspace
saves relevant vCPU state, not just KVM_GET_VCPU_EVENTS.

Ignoring userspace save/restore for the moment, delivering the payload
before the exception is injected is wrong regardless of whether L1 or L2
is running.  To make matters even more confusing, the flaw *currently*
being papered over by the !is_guest_mode() check isn't even the same bug
that commit da998b46d2 ("kvm: x86: Defer setting of CR2 until #PF
delivery") was trying to avoid.

At the time of commit da998b46d2, KVM didn't correctly handle exception
intercepts, as KVM would wait until VM-Entry into L2 was imminent to check
if the queued exception should morph to a nested VM-Exit.  I.e. KVM would
deliver the payload to L2 and then synthesize a VM-Exit into L1.  But the
payload was only the most blatant issue, e.g. waiting to check exception
intercepts would also lead to KVM incorrectly escalating a
should-be-intercepted #PF into a #DF.

That underlying bug was eventually fixed by commit 7709aba8f7 ("KVM: x86:
Morph pending exceptions to pending VM-Exits at queue time"), but in the
interim, commit a06230b62b ("KVM: x86: Deliver exception payload on
KVM_GET_VCPU_EVENTS") came along and subtly added another dependency on
the !is_guest_mode() check.

While not recorded in the changelog, the motivation for deferring the
!exception_payload_enabled delivery was to fix a flaw where a synthesized
MTF (Monitor Trap Flag) VM-Exit would drop a pending #DB and clobber DR6.
On a VM-Exit, VMX CPUs save pending #DB information into the VMCS, which
is emulated by KVM in nested_vmx_update_pending_dbg() by grabbing the
payload from the queue/pending exception.  I.e. prematurely delivering the
payload would cause the pending #DB to not be recorded in the VMCS, and of
course, clobber L2's DR6 as seen by L1.

Jumping back to save+restore, the quirked behavior of forcing delivery of
the payload only works if userspace does KVM_GET_VCPU_EVENTS *before*
CR2 or DR6 is saved, i.e. before KVM_GET_SREGS{,2} and KVM_GET_DEBUGREGS.
E.g. if userspace does KVM_GET_SREGS before KVM_GET_VCPU_EVENTS, then the
CR2 saved by userspace won't contain the payload for the exception save by
KVM_GET_VCPU_EVENTS.

Deliberately deliver the payload in the store_regs() path, as it's the
least awful option even though userspace may not be doing save+restore.
Because if userspace _is_ doing save restore, it could elide KVM_GET_SREGS
knowing that SREGS were already saved when the vCPU exited.

Link: https://lore.kernel.org/all/20200207103608.110305-1-oupton@google.com
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: stable@vger.kernel.org
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Tested-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20260218005438.2619063-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
This commit is contained in:
Sean Christopherson 2026-02-17 16:54:38 -08:00
parent 5c247d08bc
commit d0ad1b05bb

View File

@ -864,9 +864,6 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr,
vcpu->arch.exception.error_code = error_code;
vcpu->arch.exception.has_payload = has_payload;
vcpu->arch.exception.payload = payload;
if (!is_guest_mode(vcpu))
kvm_deliver_exception_payload(vcpu,
&vcpu->arch.exception);
return;
}
@ -5531,18 +5528,8 @@ static int kvm_vcpu_ioctl_x86_set_mce(struct kvm_vcpu *vcpu,
return 0;
}
static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
struct kvm_vcpu_events *events)
static struct kvm_queued_exception *kvm_get_exception_to_save(struct kvm_vcpu *vcpu)
{
struct kvm_queued_exception *ex;
process_nmi(vcpu);
#ifdef CONFIG_KVM_SMM
if (kvm_check_request(KVM_REQ_SMI, vcpu))
process_smi(vcpu);
#endif
/*
* KVM's ABI only allows for one exception to be migrated. Luckily,
* the only time there can be two queued exceptions is if there's a
@ -5553,21 +5540,46 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
if (vcpu->arch.exception_vmexit.pending &&
!vcpu->arch.exception.pending &&
!vcpu->arch.exception.injected)
ex = &vcpu->arch.exception_vmexit;
else
ex = &vcpu->arch.exception;
return &vcpu->arch.exception_vmexit;
return &vcpu->arch.exception;
}
static void kvm_handle_exception_payload_quirk(struct kvm_vcpu *vcpu)
{
struct kvm_queued_exception *ex = kvm_get_exception_to_save(vcpu);
/*
* In guest mode, payload delivery should be deferred if the exception
* will be intercepted by L1, e.g. KVM should not modifying CR2 if L1
* intercepts #PF, ditto for DR6 and #DBs. If the per-VM capability,
* KVM_CAP_EXCEPTION_PAYLOAD, is not set, userspace may or may not
* propagate the payload and so it cannot be safely deferred. Deliver
* the payload if the capability hasn't been requested.
* If KVM_CAP_EXCEPTION_PAYLOAD is disabled, then (prematurely) deliver
* the pending exception payload when userspace saves *any* vCPU state
* that interacts with exception payloads to avoid breaking userspace.
*
* Architecturally, KVM must not deliver an exception payload until the
* exception is actually injected, e.g. to avoid losing pending #DB
* information (which VMX tracks in the VMCS), and to avoid clobbering
* state if the exception is never injected for whatever reason. But
* if KVM_CAP_EXCEPTION_PAYLOAD isn't enabled, then userspace may or
* may not propagate the payload across save+restore, and so KVM can't
* safely defer delivery of the payload.
*/
if (!vcpu->kvm->arch.exception_payload_enabled &&
ex->pending && ex->has_payload)
kvm_deliver_exception_payload(vcpu, ex);
}
static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
struct kvm_vcpu_events *events)
{
struct kvm_queued_exception *ex = kvm_get_exception_to_save(vcpu);
process_nmi(vcpu);
#ifdef CONFIG_KVM_SMM
if (kvm_check_request(KVM_REQ_SMI, vcpu))
process_smi(vcpu);
#endif
kvm_handle_exception_payload_quirk(vcpu);
memset(events, 0, sizeof(*events));
@ -5746,6 +5758,8 @@ static int kvm_vcpu_ioctl_x86_get_debugregs(struct kvm_vcpu *vcpu,
vcpu->arch.guest_state_protected)
return -EINVAL;
kvm_handle_exception_payload_quirk(vcpu);
memset(dbgregs, 0, sizeof(*dbgregs));
BUILD_BUG_ON(ARRAY_SIZE(vcpu->arch.db) != ARRAY_SIZE(dbgregs->db));
@ -12136,6 +12150,8 @@ static void __get_sregs_common(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
if (vcpu->arch.guest_state_protected)
goto skip_protected_regs;
kvm_handle_exception_payload_quirk(vcpu);
kvm_get_segment(vcpu, &sregs->cs, VCPU_SREG_CS);
kvm_get_segment(vcpu, &sregs->ds, VCPU_SREG_DS);
kvm_get_segment(vcpu, &sregs->es, VCPU_SREG_ES);