KVM x86 fixes for 6.16-rcN

- Reject SEV{-ES} intra-host migration if one or more vCPUs are actively being created so as not to create a non-SEV{-ES} vCPU in an SEV{-ES} VM. - Use a pre-allocated, per-vCPU buffer for handling de-sparsified vCPU masks when emulating Hyper-V hypercalls to fix a "stack frame too large" issue. - Allow out-of-range/invalid Xen event channel ports when configuring IRQ routing to avoid dictating a specific ioctl() ordering to userspace. - Conditionally reschedule when setting memory attributes to avoid soft lockups when userspace converts huge swaths of memory to/from private. - Add back MWAIT as a required feature for the MONITOR/MWAIT selftest. - Add a missing field in struct sev_data_snp_launch_start that resulted in the guest-visible workarounds field being filled at the wrong offset. - Skip non-canonical address when processing Hyper-V PV TLB flushes to avoid VM-Fail on INVVPID. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmhdyHAACgkQOlYIJqCj N/17aw/9FwD8hhN9SL/g7Pj6MhcMlKXWZVkat+kJmVPR01vYsp5VRqiAs8zZEhtq dX6E1Vxssc9jz8Ga+hCI1Dc8NWQnfaiCeQkdv+AtjHgmVdFvP4HcLHf7D2GUNwC3 YGIjOLxmYEhdw66zLKnirnPo3GUuoOnG7XicfB8RgM0W6c7QQ/MbU5E0WLSSKGmG z/B8wTM9OpZMyxwyKMK3vKY90sBdKwbdCUhuPoH9MJpsqkPgmltfw3/TssGy47ox ZKKTPjI43UJmMVsB77u4NoLr0owke8twPG+GM6X0Jgrcw4O1dfpEa8yeTbZSgr0t +Se0hFD0diW70gfGovpKouLZE+RSR8Mv88anCmDdRlV76RqHUhaTakNez53FfJyT qAyAjzrbl+fV6nRFYzXLPZMbHUlpcsclVgF14KDAMdpXfagRqUhfQnIJr1eD7bRS /rU6xNSF/lYpGx4KiSr6OE7hLeZfC8sQtZcIb62CJT+9yAKNX2yZ6x4xQCqGO18R /+k+CgzMR0ru4KCR4JmkoCn6LGIDB7ZE+dqvsjoxfDKwlombXhz29yClPNWVi7WM IjXwgPTKjZVhTmvXHlX7sfASbEy3farKfk99aXlb/MPbAUXtVF9Ruw8cfsrayduy n7dRp6BTB9YedeLNkMZwN4B3RemLf8vUq3W/2cKS8Ei7PXWDxzw= =BSws -----END PGP SIGNATURE----- Merge tag 'kvm-x86-fixes-6.16-rcN' of https://github.com/kvm-x86/linux into HEAD KVM x86 fixes for 6.16-rcN - Reject SEV{-ES} intra-host migration if one or more vCPUs are actively being created so as not to create a non-SEV{-ES} vCPU in an SEV{-ES} VM. - Use a pre-allocated, per-vCPU buffer for handling de-sparsified vCPU masks when emulating Hyper-V hypercalls to fix a "stack frame too large" issue. - Allow out-of-range/invalid Xen event channel ports when configuring IRQ routing to avoid dictating a specific ioctl() ordering to userspace. - Conditionally reschedule when setting memory attributes to avoid soft lockups when userspace converts huge swaths of memory to/from private. - Add back MWAIT as a required feature for the MONITOR/MWAIT selftest. - Add a missing field in struct sev_data_snp_launch_start that resulted in the guest-visible workarounds field being filled at the wrong offset. - Skip non-canonical address when processing Hyper-V PV TLB flushes to avoid VM-Fail on INVVPID. - Advertise supported TDX TDVMCALLs to userspace.
2026-05-24 23:22:31 +02:00 · 2025-07-08 10:49:19 -04:00 · 2025-07-08 10:49:19 -04:00 · 5383fc057a
commit 5383fc057a
parent 7e7a7bf212 fa787ac07b
13 changed files with 116 additions and 22 deletions
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@ -7196,6 +7196,10 @@ The valid value for 'flags' is:
 					u64 leaf;
 					u64 r11, r12, r13, r14;
 				} get_tdvmcall_info;
+				struct {
+					u64 ret;
+					u64 vector;
+				} setup_event_notify;
 			};
 		} tdx;

@ -7210,21 +7214,24 @@ number from register R11.  The remaining field of the union provide the
 inputs and outputs of the TDVMCALL.  Currently the following values of
 ``nr`` are defined:

-* ``TDVMCALL_GET_QUOTE``: the guest has requested to generate a TD-Quote
-signed by a service hosting TD-Quoting Enclave operating on the host.
-Parameters and return value are in the ``get_quote`` field of the union.
-The ``gpa`` field and ``size`` specify the guest physical address
-(without the shared bit set) and the size of a shared-memory buffer, in
-which the TDX guest passes a TD Report.  The ``ret`` field represents
-the return value of the GetQuote request.  When the request has been
-queued successfully, the TDX guest can poll the status field in the
-shared-memory area to check whether the Quote generation is completed or
-not. When completed, the generated Quote is returned via the same buffer.
+ * ``TDVMCALL_GET_QUOTE``: the guest has requested to generate a TD-Quote
+   signed by a service hosting TD-Quoting Enclave operating on the host.
+   Parameters and return value are in the ``get_quote`` field of the union.
+   The ``gpa`` field and ``size`` specify the guest physical address
+   (without the shared bit set) and the size of a shared-memory buffer, in
+   which the TDX guest passes a TD Report.  The ``ret`` field represents
+   the return value of the GetQuote request.  When the request has been
+   queued successfully, the TDX guest can poll the status field in the
+   shared-memory area to check whether the Quote generation is completed or
+   not. When completed, the generated Quote is returned via the same buffer.

-* ``TDVMCALL_GET_TD_VM_CALL_INFO``: the guest has requested the support
-status of TDVMCALLs.  The output values for the given leaf should be
-placed in fields from ``r11`` to ``r14`` of the ``get_tdvmcall_info``
-field of the union.
+ * ``TDVMCALL_GET_TD_VM_CALL_INFO``: the guest has requested the support
+   status of TDVMCALLs.  The output values for the given leaf should be
+   placed in fields from ``r11`` to ``r14`` of the ``get_tdvmcall_info``
+   field of the union.
+
+* ``TDVMCALL_SETUP_EVENT_NOTIFY_INTERRUPT``: the guest has requested to
+set up a notification interrupt for vector ``vector``.

 KVM may add support for more values in the future that may cause a userspace
 exit, even without calls to ``KVM_ENABLE_CAP`` or similar.  In this case,
--- a/Documentation/virt/kvm/x86/intel-tdx.rst
+++ b/Documentation/virt/kvm/x86/intel-tdx.rst
@ -79,7 +79,20 @@ to be configured to the TDX guest.
  struct kvm_tdx_capabilities {
        __u64 supported_attrs;
        __u64 supported_xfam;
-        __u64 reserved[254];
+
+        /* TDG.VP.VMCALL hypercalls executed in kernel and forwarded to
+         * userspace, respectively
+         */
+        __u64 kernel_tdvmcallinfo_1_r11;
+        __u64 user_tdvmcallinfo_1_r11;
+
+        /* TDG.VP.VMCALL instruction executions subfunctions executed in kernel
+         * and forwarded to userspace, respectively
+         */
+        __u64 kernel_tdvmcallinfo_1_r12;
+        __u64 user_tdvmcallinfo_1_r12;
+
+        __u64 reserved[250];

        /* Configurable CPUID bits for userspace */
        struct kvm_cpuid2 cpuid;
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@ -700,8 +700,13 @@ struct kvm_vcpu_hv {

 	struct kvm_vcpu_hv_tlb_flush_fifo tlb_flush_fifo[HV_NR_TLB_FLUSH_FIFOS];

-	/* Preallocated buffer for handling hypercalls passing sparse vCPU set */
+	/*
+	 * Preallocated buffers for handling hypercalls that pass sparse vCPU
+	 * sets (for high vCPU counts, they're too large to comfortably fit on
+	 * the stack).
+	 */
 	u64 sparse_banks[HV_MAX_SPARSE_VCPU_BANKS];
+	DECLARE_BITMAP(vcpu_mask, KVM_MAX_VCPUS);

 	struct hv_vp_assist_page vp_assist_page;

--- a/arch/x86/include/asm/shared/tdx.h
+++ b/arch/x86/include/asm/shared/tdx.h
@ -72,6 +72,7 @@
 #define TDVMCALL_MAP_GPA		0x10001
 #define TDVMCALL_GET_QUOTE		0x10002
 #define TDVMCALL_REPORT_FATAL_ERROR	0x10003
+#define TDVMCALL_SETUP_EVENT_NOTIFY_INTERRUPT 0x10004ULL

 /*
 * TDG.VP.VMCALL Status Codes (returned in R10)
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@ -965,7 +965,13 @@ struct kvm_tdx_cmd {
 struct kvm_tdx_capabilities {
 	__u64 supported_attrs;
 	__u64 supported_xfam;
-	__u64 reserved[254];
+
+	__u64 kernel_tdvmcallinfo_1_r11;
+	__u64 user_tdvmcallinfo_1_r11;
+	__u64 kernel_tdvmcallinfo_1_r12;
+	__u64 user_tdvmcallinfo_1_r12;
+
+	__u64 reserved[250];

 	/* Configurable CPUID bits for userspace */
 	struct kvm_cpuid2 cpuid;
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@ -1979,6 +1979,9 @@ int kvm_hv_vcpu_flush_tlb(struct kvm_vcpu *vcpu)
 		if (entries[i] == KVM_HV_TLB_FLUSHALL_ENTRY)
 			goto out_flush_all;

+		if (is_noncanonical_invlpg_address(entries[i], vcpu))
+			continue;
+
 		/*
 		 * Lower 12 bits of 'address' encode the number of additional
 		 * pages to flush.
@ -2001,11 +2004,11 @@ int kvm_hv_vcpu_flush_tlb(struct kvm_vcpu *vcpu)
 static u64 kvm_hv_flush_tlb(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
 {
 	struct kvm_vcpu_hv *hv_vcpu = to_hv_vcpu(vcpu);
+	unsigned long *vcpu_mask = hv_vcpu->vcpu_mask;
 	u64 *sparse_banks = hv_vcpu->sparse_banks;
 	struct kvm *kvm = vcpu->kvm;
 	struct hv_tlb_flush_ex flush_ex;
 	struct hv_tlb_flush flush;
-	DECLARE_BITMAP(vcpu_mask, KVM_MAX_VCPUS);
 	struct kvm_vcpu_hv_tlb_flush_fifo *tlb_flush_fifo;
 	/*
 	 * Normally, there can be no more than 'KVM_HV_TLB_FLUSH_FIFO_SIZE'
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@ -1971,6 +1971,10 @@ static int sev_check_source_vcpus(struct kvm *dst, struct kvm *src)
 	struct kvm_vcpu *src_vcpu;
 	unsigned long i;

+	if (src->created_vcpus != atomic_read(&src->online_vcpus) ||
+	    dst->created_vcpus != atomic_read(&dst->online_vcpus))
+		return -EBUSY;
+
 	if (!sev_es_guest(src))
 		return 0;

@ -4445,8 +4449,12 @@ static void sev_es_init_vmcb(struct vcpu_svm *svm)
 	 * the VMSA will be NULL if this vCPU is the destination for intrahost
 	 * migration, and will be copied later.
 	 */
-	if (svm->sev_es.vmsa && !svm->sev_es.snp_has_guest_vmsa)
-		svm->vmcb->control.vmsa_pa = __pa(svm->sev_es.vmsa);
+	if (!svm->sev_es.snp_has_guest_vmsa) {
+		if (svm->sev_es.vmsa)
+			svm->vmcb->control.vmsa_pa = __pa(svm->sev_es.vmsa);
+		else
+			svm->vmcb->control.vmsa_pa = INVALID_PAGE;
+	}

 	if (cpu_feature_enabled(X86_FEATURE_ALLOWED_SEV_FEATURES))
 		svm->vmcb->control.allowed_sev_features = sev->vmsa_features |
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@ -173,6 +173,9 @@ static void td_init_cpuid_entry2(struct kvm_cpuid_entry2 *entry, unsigned char i
 	tdx_clear_unsupported_cpuid(entry);
 }

+#define TDVMCALLINFO_GET_QUOTE				BIT(0)
+#define TDVMCALLINFO_SETUP_EVENT_NOTIFY_INTERRUPT	BIT(1)
+
 static int init_kvm_tdx_caps(const struct tdx_sys_info_td_conf *td_conf,
 			     struct kvm_tdx_capabilities *caps)
 {
@ -188,6 +191,10 @@ static int init_kvm_tdx_caps(const struct tdx_sys_info_td_conf *td_conf,

 	caps->cpuid.nent = td_conf->num_cpuid_config;

+	caps->user_tdvmcallinfo_1_r11 =
+		TDVMCALLINFO_GET_QUOTE |
+		TDVMCALLINFO_SETUP_EVENT_NOTIFY_INTERRUPT;
+
 	for (i = 0; i < td_conf->num_cpuid_config; i++)
 		td_init_cpuid_entry2(&caps->cpuid.entries[i], i);

@ -1530,6 +1537,27 @@ static int tdx_get_quote(struct kvm_vcpu *vcpu)
 	return 0;
 }

+static int tdx_setup_event_notify_interrupt(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+	u64 vector = tdx->vp_enter_args.r12;
+
+	if (vector < 32 || vector > 255) {
+		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND);
+		return 1;
+	}
+
+	vcpu->run->exit_reason = KVM_EXIT_TDX;
+	vcpu->run->tdx.flags = 0;
+	vcpu->run->tdx.nr = TDVMCALL_SETUP_EVENT_NOTIFY_INTERRUPT;
+	vcpu->run->tdx.setup_event_notify.ret = TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED;
+	vcpu->run->tdx.setup_event_notify.vector = vector;
+
+	vcpu->arch.complete_userspace_io = tdx_complete_simple;
+
+	return 0;
+}
+
 static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 {
 	switch (tdvmcall_leaf(vcpu)) {
@ -1541,6 +1569,8 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 		return tdx_get_td_vm_call_info(vcpu);
 	case TDVMCALL_GET_QUOTE:
 		return tdx_get_quote(vcpu);
+	case TDVMCALL_SETUP_EVENT_NOTIFY_INTERRUPT:
+		return tdx_setup_event_notify_interrupt(vcpu);
 	default:
 		break;
 	}
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@ -1971,8 +1971,19 @@ int kvm_xen_setup_evtchn(struct kvm *kvm,
 {
 	struct kvm_vcpu *vcpu;

-	if (ue->u.xen_evtchn.port >= max_evtchn_port(kvm))
-		return -EINVAL;
+	/*
+	 * Don't check for the port being within range of max_evtchn_port().
+	 * Userspace can configure what ever targets it likes; events just won't
+	 * be delivered if/while the target is invalid, just like userspace can
+	 * configure MSIs which target non-existent APICs.
+	 *
+	 * This allow on Live Migration and Live Update, the IRQ routing table
+	 * can be restored *independently* of other things like creating vCPUs,
+	 * without imposing an ordering dependency on userspace.  In this
+	 * particular case, the problematic ordering would be with setting the
+	 * Xen 'long mode' flag, which changes max_evtchn_port() to allow 4096
+	 * instead of 1024 event channels.
+	 */

 	/* We only support 2 level event channels for now */
 	if (ue->u.xen_evtchn.priority != KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL)
--- a/include/linux/psp-sev.h
+++ b/include/linux/psp-sev.h
@ -594,6 +594,7 @@ struct sev_data_snp_addr {
 * @imi_en: launch flow is launching an IMI (Incoming Migration Image) for the
 *          purpose of guest-assisted migration.
 * @rsvd: reserved
+ * @desired_tsc_khz: hypervisor desired mean TSC freq in kHz of the guest
 * @gosvw: guest OS-visible workarounds, as defined by hypervisor
 */
 struct sev_data_snp_launch_start {
@ -603,6 +604,7 @@ struct sev_data_snp_launch_start {
 	u32 ma_en:1;				/* In */
 	u32 imi_en:1;				/* In */
 	u32 rsvd:30;
+	u32 desired_tsc_khz;			/* In */
 	u8 gosvw[16];				/* In */
 } __packed;

--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@ -467,6 +467,10 @@ struct kvm_run {
 					__u64 leaf;
 					__u64 r11, r12, r13, r14;
 				} get_tdvmcall_info;
+				struct {
+					__u64 ret;
+					__u64 vector;
+				} setup_event_notify;
 			};
 		} tdx;
 		/* Fix the size of the union. */
--- a/tools/testing/selftests/kvm/x86/monitor_mwait_test.c
+++ b/tools/testing/selftests/kvm/x86/monitor_mwait_test.c
@ -74,6 +74,7 @@ int main(int argc, char *argv[])
 	int testcase;
 	char test[80];

+	TEST_REQUIRE(this_cpu_has(X86_FEATURE_MWAIT));
 	TEST_REQUIRE(kvm_has_cap(KVM_CAP_DISABLE_QUIRKS2));

 	ksft_print_header();
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@ -2572,6 +2572,8 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 		r = xa_reserve(&kvm->mem_attr_array, i, GFP_KERNEL_ACCOUNT);
 		if (r)
 			goto out_unlock;
+
+		cond_resched();
 	}

 	kvm_handle_gfn_range(kvm, &pre_set_range);
@ -2580,6 +2582,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 		r = xa_err(xa_store(&kvm->mem_attr_array, i, entry,
 				    GFP_KERNEL_ACCOUNT));
 		KVM_BUG_ON(r, kvm);
+		cond_resched();
 	}

 	kvm_handle_gfn_range(kvm, &post_set_range);