linux

mirror of https://github.com/torvalds/linux.git synced 2026-05-30 01:53:29 +02:00

Author	SHA1	Message	Date
Mikhail Gavrilov	fff82ea9d9	x86/virt: Silence RCU lockdep splat in emergency virt callback path x86_virt_invoke_kvm_emergency_callback() reaches rcu_dereference() through machine_crash_shutdown() with IRQs disabled but with RCU not necessarily watching the crashing CPU, which triggers a suspicious RCU usage splat on debug kernels (CONFIG_PROVE_RCU=y) during panic/kdump: WARNING: suspicious RCU usage arch/x86/virt/hw.c:52 suspicious rcu_dereference_check() usage! rcu_scheduler_active = 2, debug_locks = 1 1 lock held by tee/11119: #0: ffff8881fa32c440 (sb_writers#3){.+.+}-{0:0}, at: ksys_write Call Trace: <TASK> dump_stack_lvl+0x84/0xd0 lockdep_rcu_suspicious.cold+0x37/0x8f x86_virt_invoke_kvm_emergency_callback+0x5f/0x70 x86_svm_emergency_disable_virtualization_cpu+0x2a/0x30 x86_virt_emergency_disable_virtualization_cpu+0x6b/0x90 native_machine_crash_shutdown+0x72/0x170 __crash_kexec+0x137/0x280 panic+0xce/0xd0 sysrq_handle_crash+0x1f/0x20 __handle_sysrq.cold+0x192/0x335 write_sysrq_trigger+0x8c/0xc0 proc_reg_write+0x1c3/0x3c0 vfs_write+0x1d0/0xf80 ksys_write+0x116/0x250 do_syscall_64+0x11c/0x1480 entry_SYSCALL_64_after_hwframe+0x76/0x7e </TASK> A truly correct fix is non-trivial: the RCU usage genuinely is wrong in panic context (RCU may ignore the crashing CPU during synchronization), and a concurrent KVM module unload could in principle race with the callback read; see commit `2baa33a8dd` ("KVM: x86: Leave user-return notifier registered on reboot/shutdown") which notes that nothing prevents module unload during panic/reboot. However, the alternatives are worse: - smp_store_release()/smp_load_acquire() handles ordering but not liveness; the kernel still needs to keep the module text alive while the callback is in flight. - Taking a lock in the panic path is risky — any lock could be held by a CPU that has already been NMI'd to a halt. Use rcu_dereference_raw() to silence the splat and accept the vanishingly small remaining race. Panic context inherently cannot guarantee complete correctness; the goal here is to keep debug builds quiet on the kdump path so the splat doesn't obscure the actual kernel state being captured. Reproducible on a debug kernel (CONFIG_PROVE_LOCKING=y, CONFIG_PROVE_RCU=y) with kvm_amd or kvm_intel loaded by triggering kdump: echo c > /proc/sysrq-trigger Suggested-by: Sean Christopherson <seanjc@google.com> Fixes: `428afac5a8` ("KVM: x86: Move bulk of emergency virtualizaton logic to virt subsystem") Signed-off-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://patch.msgid.link/20260504235435.90957-1-mikhail.v.gavrilov@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-05-13 09:53:43 -07:00
Sean Christopherson	e30aa03d03	x86/virt: Treat SVM as unsupported when running as an SEV+ guest When running as an SEV+ guest, treat SVM as unsupported even if CPUID (and other reporting, e.g. MSRs) enumerate support for SVM, as KVM doesn't support nested virtualization within an SEV VM (KVM would need to explicitly share all VMCBs and other assets with the untrusted host), let alone running nested VMs within SEV-ES+ guests (e.g. emulating VMLOAD, VMSAVE, and VMRUN all require access to guest register state). And outside of KVM, there is no in-tree user of SVM enabling. Arguably, the hypervisor/VMM (e.g. QEMU) should clear SVM from guest CPUID for SEV VMs, especially for SEV-ES+, but super duper technically, it's feasible to run nested VMs in SEV+ guests (with many caveats). More importantly, Linux-as-a-guest has played nice with SVM being advertised to SEV+ guests for a long time. Treating SVM as unsupported fixes a regression where a clean shutdown of an SEV-ES+ guest degrades into an abrupt termination. Due to a gnarly virtualization hole in SEV-ES (the architecture), where EFER must NOT be intercepted by the hypervisor (because the untrusted hypervisor can't set e.g. EFER.LME on behalf o the guest), the _host's_ EFER.SVME is visible to the guest. Because EFER.SVME must be always '1' while in guest mode, Linux-the-guest sees EFER.SVME=1 even when _its_ EFER.SVME is '0', thinks it has enabled virtualization, and ultimately can cause x86_svm_emergency_disable_virtualization_cpu() to execute STGI to ensure GIF is enabled. Executing STGI _should_ be fine, except Linux is a also wee bit paranoid when running as an SEV-ES guest. Because L0 sees EFER.SVME=0 for the guest, a well-behaved L0 hypervisor will intercept STGI (to inject #UD), and thus generate a #VC on the STGI. Which, again, should be fine. Unfortunately, vc_check_opcode_bytes() fails to account for STGI and other SVM instructions, throws a fatal error, and triggers a termination request. In a perfect world, the #VC handler would be more forgiving of unknown intercepts, especially when the #VC happened on an instruction with exception fixup. For now, just fix the immediate regression. Fixes: `428afac5a8` ("KVM: x86: Move bulk of emergency virtualizaton logic to virt subsystem") Reported-by: Srikanth Aithal <sraithal@amd.com> Closes: https://lore.kernel.org/all/c820e242-9f3a-4210-b414-19d11b022404@amd.com Link: https://patch.msgid.link/20260409191341.1932853-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-04-09 12:21:53 -07:00
Sean Christopherson	8528a7f9c9	x86/virt: Add refcounting of VMX/SVM usage to support multiple in-kernel users Implement a per-CPU refcounting scheme so that "users" of hardware virtualization, e.g. KVM and the future TDX code, can co-exist without pulling the rug out from under each other. E.g. if KVM were to disable VMX on module unload or when the last KVM VM was destroyed, SEAMCALLs from the TDX subsystem would #UD and panic the kernel. Disable preemption in the get/put APIs to ensure virtualization is fully enabled/disabled before returning to the caller. E.g. if the task were preempted after a 0=>1 transition, the new task would see a 1=>2 and thus return without enabling virtualization. Explicitly disable preemption instead of requiring the caller to do so, because the need to disable preemption is an artifact of the implementation. E.g. from KVM's perspective there is no _need_ to disable preemption as KVM guarantees the pCPU on which it is running is stable (but preemption is enabled). Opportunistically abstract away SVM vs. VMX in the public APIs by using X86_FEATURE_{SVM,VMX} to communicate what technology the caller wants to enable and use. Cc: Xu Yilun <yilun.xu@linux.intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Tested-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-03-04 08:52:52 -08:00
Sean Christopherson	428afac5a8	KVM: x86: Move bulk of emergency virtualizaton logic to virt subsystem Move the majority of the code related to disabling hardware virtualization in emergency from KVM into the virt subsystem so that virt can take full ownership of the state of SVM/VMX. This will allow refcounting usage of SVM/VMX so that KVM and the TDX subsystem can enable VMX without stomping on each other. To route the emergency callback to the "right" vendor code, add to avoid mixing vendor and generic code, implement a x86_virt_ops structure to track the emergency callback, along with the SVM vs. VMX (vs. "none") feature that is active. To avoid having to choose between SVM and VMX, simply refuse to enable either if both are somehow supported. No known CPU supports both SVM and VMX, and it's comically unlikely such a CPU will ever exist. Leave KVM's clearing of loaded VMCSes and MSR_VM_HSAVE_PA in KVM, via a callback explicitly scoped to KVM. Loading VMCSes and saving/restoring host state are firmly tied to running VMs, and thus are (a) KVM's responsibility and (b) operations that are still exclusively reserved for KVM (as far as in-tree code is concerned). I.e. the contract being established is that non-KVM subsystems can utilize virtualization, but for all intents and purposes cannot act as full-blown hypervisors. Reviewed-by: Chao Gao <chao.gao@intel.com> Tested-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-03-04 08:52:49 -08:00
Sean Christopherson	32d76cdfa1	KVM: SVM: Move core EFER.SVME enablement to kernel Move the innermost EFER.SVME logic out of KVM and into to core x86 to land the SVM support alongside VMX support. This will allow providing a more unified API from the kernel to KVM, and will allow moving the bulk of the emergency disabling insanity out of KVM without having a weird split between kernel and KVM for SVM vs. VMX. No functional change intended. Tested-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-03-04 08:52:45 -08:00
Sean Christopherson	920da4f755	KVM: VMX: Move core VMXON enablement to kernel Move the innermost VMXON+VMXOFF logic out of KVM and into to core x86 so that TDX can (eventually) force VMXON without having to rely on KVM being loaded, e.g. to do SEAMCALLs during initialization. Opportunistically update the comment regarding emergency disabling via NMI to clarify that virt_rebooting will be set by _another_ emergency callback, i.e. that virt_rebooting doesn't need to be set before VMCLEAR, only before _this_ invocation does VMXOFF. Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-03-04 08:52:42 -08:00
Sean Christopherson	95e4adb24f	x86/virt: Force-clear X86_FEATURE_VMX if configuring root VMCS fails If allocating and configuring a root VMCS fails, clear X86_FEATURE_VMX in all CPUs so that KVM doesn't need to manually check root_vmcs. As added bonuses, clearing VMX will reflect that VMX is unusable in /proc/cpuinfo, and will avoid a futile auto-probe of kvm-intel.ko. WARN if allocating a root VMCS page fails, e.g. to help users figure out why VMX is broken in the unlikely scenario something goes sideways during boot (and because the allocation should succeed unless there's a kernel bug). Tweak KVM's error message to suggest checking kernel logs if VMX is unsupported (in addition to checking BIOS). Tested-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-03-04 08:52:39 -08:00
Sean Christopherson	405b7c2793	KVM: VMX: Unconditionally allocate root VMCSes during boot CPU bringup Allocate the root VMCS (misleading called "vmxarea" and "kvm_area" in KVM) for each possible CPU during early boot CPU bringup, before early TDX initialization, so that TDX can eventually do VMXON on-demand (to make SEAMCALLs) without needing to load kvm-intel.ko. Allocate the pages early on, e.g. instead of trying to do so on-demand, to avoid having to juggle allocation failures at runtime. Opportunistically rename the per-CPU pointers to better reflect the role of the VMCS. Use Intel's "root VMCS" terminology, e.g. from various VMCS patents[1][2] and older SDMs, not the more opaque "VMXON region" used in recent versions of the SDM. While it's possible the VMCS passed to VMXON no longer serves as _the_ root VMCS on modern CPUs, it is still in effect a "root mode VMCS", as described in the patents. Link: https://patentimages.storage.googleapis.com/c7/e4/32/d7a7def5580667/WO2013101191A1.pdf [1] Link: https://patentimages.storage.googleapis.com/13/f6/8d/1361fab8c33373/US20080163205A1.pdf [2] Tested-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-03-04 08:52:34 -08:00
Sean Christopherson	a1450a8156	KVM: x86: Move "kvm_rebooting" to kernel as "virt_rebooting" Move "kvm_rebooting" to the kernel, exported for KVM, as one of many steps towards extracting the innermost VMXON and EFER.SVME management logic out of KVM and into to core x86. For lack of a better name, call the new file "hw.c", to yield "virt hardware" when combined with its parent directory. No functional change intended. Tested-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-03-04 08:52:31 -08:00

9 Commits