linux

mirror of https://github.com/torvalds/linux.git synced 2026-05-29 17:43:52 +02:00

Author	SHA1	Message	Date
Linus Torvalds	6a97c4d526	Arm: - Fix ITS EventID sanitisation when restoring an interrupt translation table. - Fix PPI memory leak when failing to initialise a vcpu. - Correctly return an error when the validation of a hypervisor trace descriptor fails, and limit this validation to protected mode only. RISC-V: - Fix invalid HVA warning in steal-time recording - Return SBI_ERR_FAILURE to guest upon OOM in pmu_event_info() and pmu_snapshot_set_shmem() - Fix NULL pointer dereference in SBI v0.1 SEND_IPI handler - Fix sign extension of value for MMIO loads s390: - Fix bugs in vSIE (nested virtualization) and UCONTROL, caused by the page table rewrite. x86: - Apply erratum #1235 workaround (disable AVIC IPI virtualization) on Hygon Family 18h, just like on AMD Family 17h. - When KVM_CAP_X86_APIC_BUS_CYCLES_NS is queried on a specific VM, return the VM's configured APIC bus frequency instead of the default. This is less confusing (read: not wrong) and makes it easier to fill in CPUID information that communicates the APIC bus frequency to the guest. Selftests: - Do not include glibc-internal <bits/endian.h>; it worked by chance and broke building KVM selftests with musl. -----BEGIN PGP SIGNATURE----- iQFIBAABCgAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmoSp6cUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroPgswgAiO5Gi7d6dIspG7e41g5fF2Wq5Rnq 1nB7ZV+CqT0k1fvFe4hBrc2c+DLzFn+h3/fj+4scVF4oAN9YRauIq/2xlGWR23bR gsFncJ2w6TAKLN3MvCh1SpO+GI7kcnTs7HtJ6weDkddbGEtUIgkUZkwEYnEN4t6T pgO7USGFbBBXY575UO/xMeLkfyABzJlLjQbKrvG6RKtEsKAxzTxcPtjQegtHYH4Q 6DLGif4YUB0ZWMQETccl/bKqU6L+OQgDUOSUoHWt+2ox0DLDwiy7VVf3infecXsJ r3PGKn709nlrd+hBn2S9gCbT/BCxp828k2DxSasZ7PQ8634O+qrpLLkODw== =VWgs -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm fixes from Paolo Bonzini: "arm64: - Fix ITS EventID sanitisation when restoring an interrupt translation table. - Fix PPI memory leak when failing to initialise a vcpu. - Correctly return an error when the validation of a hypervisor trace descriptor fails, and limit this validation to protected mode only. RISC-V: - Fix invalid HVA warning in steal-time recording - Return SBI_ERR_FAILURE to guest upon OOM in pmu_event_info() and pmu_snapshot_set_shmem() - Fix NULL pointer dereference in SBI v0.1 SEND_IPI handler - Fix sign extension of value for MMIO loads s390: - Fix bugs in vSIE (nested virtualization) and UCONTROL, caused by the page table rewrite. x86: - Apply erratum #1235 workaround (disable AVIC IPI virtualization) on Hygon Family 18h, just like on AMD Family 17h. - When KVM_CAP_X86_APIC_BUS_CYCLES_NS is queried on a specific VM, return the VM's configured APIC bus frequency instead of the default. This is less confusing (read: not wrong) and makes it easier to fill in CPUID information that communicates the APIC bus frequency to the guest. Selftests: - Do not include glibc-internal <bits/endian.h>; it worked by chance and broke building KVM selftests with musl" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: SVM: Disable AVIC IPI virtualization on Hygon Family 18h (erratum #1235) KVM: selftests: Verify that KVM returns the configured APIC cycle length KVM: x86: Return the VM's configured APIC bus frequency when queried KVM: selftests: elf: Include <endian.h> instead of <bits/endian.h> KVM: s390: Properly reset zero bit in PGSTE KVM: s390: vsie: Fix redundant rmap entries KVM: s390: vsie: Fix unshadowing logic KVM: s390: Fix leaking kvm_s390_mmu_cache in case of errors KVM: s390: vsie: Fix memory leak when unshadowing KVM: arm64: Fix nVHE/pKVM hyp tracing error on invalid desc KVM: arm64: vgic: Free private_irqs when init fails after allocation KVM: arm64: vgic-its: Reject restored DTE with out-of-range num_eventid_bits RISC-V: KVM: Fix sign extension for MMIO loads RISC-V: KVM: Fix NULL pointer dereference in SBI v0.1 SEND_IPI handler riscv: kvm: return SBI_ERR_FAILURE for pmu_event_info() when OOM riscv: kvm: return SBI_ERR_FAILURE for pmu_snapshot_set_shmem() when OOM RISC-V: KVM: Fix invalid HVA warning in steal-time recording	2026-05-24 12:50:36 -07:00
Tina Zhang	9a12fa5213	KVM: SVM: Disable AVIC IPI virtualization on Hygon Family 18h (erratum #1235 ) Hygon Family 18h CPUs are derived from AMD Family 17h (Zen1) silicon and share the same erratum #1235: hardware may read a stale IsRunning=1 bit during ICR write emulation and silently fail to generate an AVIC_IPI_FAILURE_TARGET_NOT_RUNNING VM-Exit on the sending vCPU. The absence of the VM-Exit causes KVM to miss the required wakeup of blocking target vCPUs, leading to hung vCPUs and unbounded delays in guest execution. Extend the existing AMD Family 17h erratum #1235 workaround to also cover Hygon Family 18h. With IPI virtualization disabled, KVM never sets IsRunning=1 in the Physical ID table, so every non-self IPI generates a VM-Exit and is correctly emulated. Fixes: `8de4a1c816` ("KVM: SVM: Disable (x2)AVIC IPI virtualization if CPU has erratum #1235") Cc: <stable@vger.kernel.org> Signed-off-by: Tina Zhang <zhang_wei@open-hieco.net> Message-ID: <20260522040014.3380201-1-zhang_wei@open-hieco.net>	2026-05-23 10:09:04 +02:00
Sean Christopherson	86e2de10eb	KVM: x86: Return the VM's configured APIC bus frequency when queried When KVM_CAP_X86_APIC_BUS_CYCLES_NS is queried on a specific VM, return the VM's configured APIC bus frequency, not KVM's default. Aside from the fact that returning the default frequency is blatantly wrong if userspace has changed the frequency, returning the configured frequency means userspace can blindly trust the result, e.g. when filling PV CPUID information that communicates the APIC bus frequency to the guest. Fixes: `6fef518594` ("KVM: x86: Add a capability to configure bus frequency for APIC timer") Reported-by: David Woodhouse <dwmw2@infradead.org> Closes: https://lore.kernel.org/all/ab84153e33fbe7c25667f595c56b310d4d5a93ef.camel@infradead.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20260522173526.3539407-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2026-05-23 10:07:29 +02:00
Peter Zijlstra	0701c9e17b	x86/kvm/vmx: Move IRQ/NMI dispatch from KVM into x86 core Move the VMX interrupt dispatch magic into the x86 core code. This isolates KVM from the FRED/IDT decisions and reduces the amount of EXPORT_SYMBOL_FOR_KVM(). Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: "Verma, Vishal L" <vishal.l.verma@intel.com> Tested-by: Zhao Liu <zhao1.liu@intel.com> Tested-by: Zhao Liu <zhao1.liu@intel.com> Tested-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Binbin Wu <binbin.wu@linxu.intel.com> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://patch.msgid.link/20260508091829.GO3126523@noisy.programming.kicks-ass.net	2026-05-19 20:25:51 +02:00
Paolo Bonzini	2d5d3fc593	KVM: VMX: introduce module parameter to disable CET There have been reports of host hangs caused by CET virtualization. Until these are analyzed further, introduce a module parameter that makes it possible to easily disable it. Link: https://lore.kernel.org/all/85548beb-1486-40f9-beb4-632c78e3360b@proxmox.com/ Cc: David Riley <d.riley@proxmox.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2026-05-13 15:38:22 +02:00
Sean Christopherson	3098c076c8	KVM: x86: Swap the dst and src operand for MOVNTDQA Swap the MOVNTDQA operands, as MOVNTDQA does NOT in fact have "the same characteristics as 0F E7 (MOVNTDQ)"; MOVNTDQA loads from memory and stores to registers, while MOVNTDQ loads from registers and stores to memory. Per the SDM: MOVNTDQ - Move packed integer values in xmm1 to m128 using non-temporal hint. MOVNTDQA - Move double quadword from m128 to xmm1 using non-temporal hint if WC memory type. Reported-by: Josh Eads <josheads@google.com> Fixes: `c57d9bafbd` ("KVM: x86: Add support for emulating MOVNTDQA") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20260506213514.2781948-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2026-05-12 23:12:32 +02:00
Paolo Bonzini	6b72d0578c	KVM: x86: use again the flush argument of __link_shadow_page() Except in the case of parentless nested-TDP pages, mmu_page_zap_pte() clears the SPTE but leaves the invalid_list empty. In this case, using kvm_flush_remote_tlbs() as kvm_mmu_remote_flush_or_zap() does is overkill. Avoid flushing the entirety of the remote TLBs unless the invalid_list was populated: instead, use a more efficient gfn-targeting flush (if available) and skip it altogether if the caller guarantees that a TLB flush is not necessary. Based-on: <20260503201029.106481-1-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20260503210917.121840-1-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2026-05-12 23:12:31 +02:00
Sean Christopherson	5bd1ddb791	KVM: nSVM: Never use L0's PAUSE loop exiting while L2 is running Never use L0's (KVM's) PAUSE loop exiting controls while L2 is running, and instead always configure vmcb02 according to L1's exact capabilities and desires. The purpose of intercepting PAUSE after N attempts is to detect when the vCPU may be stuck waiting on a lock, so that KVM can schedule in a different vCPU that may be holding said lock. Barring a very interesting setup, L1 and L2 do not share locks, and it's extremely unlikely that an L1 vCPU would hold a spinlock while running L2. I.e. having a vCPU executing in L1 yield to a vCPU running in L2 will not allow the L1 vCPU to make forward progress, and vice versa. While teaching KVM's "on spin" logic to only yield to other vCPUs in L2 is doable, in all likelihood it would do more harm than good for most setups. KVM has limited visibility into which L2 "vCPUs" belong to the same VM, and thus share a locking domain. And even if L2 vCPUs are in the same VM, KVM has no visilibity into L2 vCPU's that are scheduled out by the L1 hypervisor. Furthermore, KVM doesn't actually steal PAUSE exits from L1. If L1 is intercepting PAUSE, KVM will route PAUSE exits to L1, not L0, as nested_svm_intercept() gives priority to the vmcb12 intercept. As such, overriding the count/threshold fields in vmcb02 with vmcb01's values is nonsensical, as doing so clobbers all the training/learning that has been done in L1. Even worse, if L1 is not intercepting PAUSE, i.e. KVM is handling PAUSE exits, then KVM will adjust the PLE knobs based on L2 behavior, which could very well be detrimental to L1, e.g. due to essentially poisoning L1 PLE training with bad data. And copying the count from vmcb02 to vmcb01 on a nested VM-Exit makes even less sense, because again, the purpose of PLE is to detect spinning vCPUs. Whether or not a vCPU is spinning in L2 at the time of a nested VM-Exit has no relevance as to the behavior of the vCPU when it executes in L1. The only scenarios where any of this actually works is if at least one of KVM or L1 is NOT intercepting PAUSE for the guest. Per the original changelog, those were the only scenarios considered to be supported. Disabling KVM's use of PLE makes it so the VM is always in a "supported" mode. Last, but certainly not least, using KVM's count/threshold instead of the values provided by L1 is a blatant violation of the SVM architecture. Fixes: `74fd41ed16` ("KVM: x86: nSVM: support PAUSE filtering when L0 doesn't intercept PAUSE") Cc: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://patch.msgid.link/20260508213321.373309-1-seanjc@google.com/ Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2026-05-12 22:17:28 +02:00
Qiang Ma	2b72f1674e	KVM: x86: Fix Xen hypercall tracepoint argument assignment TRACE_EVENT(kvm_xen_hypercall) stores a5 in __entry->a4 instead of __entry->a5. That overwrites the recorded a4 argument and leaves a5 unset in the trace entry. Fix the typo so both arguments are captured correctly. Signed-off-by: Qiang Ma <maqianga@uniontech.com> Link: https://patch.msgid.link/20260512015313.1685784-1-maqianga@uniontech.com/ Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2026-05-12 22:16:26 +02:00
Sean Christopherson	0cb2af2ea6	KVM: x86: Fix shadow paging use-after-free due to unexpected GFN The shadow MMU computes GFNs for direct shadow pages using sp->gfn plus the SPTE index. This assumption breaks for shadow paging if the guest page tables are modified between VM entries (similar to commit `aad885e774`, "KVM: x86/mmu: Drop/zap existing present SPTE even when creating an MMIO SPTE", 2026-03-27). The flow is as follows: - a PDE is installed for a 2MB mapping, and a page in that area is accessed. KVM creates a kvm_mmu_page consisting of 512 4KB pages; the kvm_mmu_page is marked by FNAME(fetch) as direct-mapped because the guest's mapping is a huge page (and thus contiguous). - the PDE mapping is changed from outside the guest. - the guest accesses another page in the same 2MB area. KVM installs a new leaf SPTE and rmap entry; the SPTE uses the "correct" GFN (i.e. based on the new mapping, as changed in the previous step) but that GFN is outside of the [sp->gfn, sp->gfn + 511] range; therefore the rmap entry cannot be found and removed when the kvm_mmu_page is zapped. - the memslot that covers the first 2MB mapping is deleted, and the kvm_mmu_page for the now-invalid GPA is zapped. However, rmap_remove() only looks at the [sp->gfn, sp->gfn + 511] range established in step 1, and fails to find the rmap entry that was recorded by step 3. - any operation that causes an rmap walk for the same page accessed by step 3 then walks a stale rmap and dereferences a freed kvm_mmu_page. This includes dirty logging or MMU notifier invalidations (e.g., from MADV_DONTNEED). The underlying issue is that KVM's walking of shadow PTEs assumes that if a SPTE is present when KVM wants to install a non-leaf SPTE, then the existing kvm_mmu_page must be for the correct gfn. Because the only way for the gfn to be wrong is if KVM messed up and failed to zap a SPTE... which shouldn't happen, but actually only happens in response to a guest write. That bug dates back literally forever, as even the first version of KVM assumes that the GFN matches and walks into the "wrong" shadow page. However, that was only an imprecision until `2032a93d66` ("KVM: MMU: Don't allocate gfns page for direct mmu pages") came along. Fix it by checking for a target gfn mismatch and zapping the existing SPTE. That way the old SP and rmap entries are gone, KVM installs the rmap in the right location, and everyone is happy. Fixes: `2032a93d66` ("KVM: MMU: Don't allocate gfns page for direct mmu pages") Fixes: `6aa8b732ca` ("kvm: userspace interface") Reported-by: Alexander Bulekov <bkov@amazon.com> Reported-by: Fred Griffoul <fgriffo@amazon.co.uk> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://patch.msgid.link/20260503201029.106481-1-pbonzini@redhat.com/ Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2026-05-03 22:32:53 +02:00
Sean Christopherson	0aec99f9bf	KVM: x86: Fix misleading variable names and add more comments for PIR=>IRR flow Rename kvm_apic_update_irr()'s "irr_updated" and vmx_sync_pir_to_irr()'s "got_posted_interrupt" to a more accurate "max_irr_is_from_pir", as neither "irr_updated" nor "got_posted_interrupt" is accurate. __kvm_apic_update_irr() and thus kvm_apic_update_irr() specifically return true if and only if the highest priority IRQ, i.e. max_irr, is a "new" pending IRQ from the PIR. I.e. it's possible for the IRR to be updated, i.e. for a posted IRQ to be "got", without the APIs returning true. Expand vmx_sync_pir_to_irr()'s comment to explain why it's necessary to set KVM_REQ_EVENT only if a "new" IRQ was found, and to explain why it's safe to do so only if a new IRQ is also the highest priority pending IRQ. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://patch.msgid.link/20260503201703.108231-3-pbonzini@redhat.com/ Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2026-05-03 22:32:41 +02:00
Paolo Bonzini	33fd0ccd25	KVM: x86: Do IRR scan in __kvm_apic_update_irr even if PIR is empty Fall back to apic_find_highest_vector() when PID.ON is set but PIR turns out to be empty, to correctly report the highest pending interrupt from the existing IRR. In a nested VM stress test, the following WARNING fires in vmx_check_nested_events() when kvm_cpu_has_interrupt() reports a pending interrupt but the subsequent kvm_apic_has_interrupt() (which invokes vmx_sync_pir_to_irr() again) returns -1: WARNING: CPU: 99 PID: 57767 at arch/x86/kvm/vmx/nested.c:4449 vmx_check_nested_events+0x6bf/0x6e0 [kvm_intel] Call Trace: kvm_check_and_inject_events vcpu_enter_guest.constprop.0 vcpu_run kvm_arch_vcpu_ioctl_run kvm_vcpu_ioctl __x64_sys_ioctl do_syscall_64 entry_SYSCALL_64_after_hwframe The root cause is a race between vmx_sync_pir_to_irr() on the target vCPU and __vmx_deliver_posted_interrupt() on a sender vCPU. The sender performs two individually-atomic operations that are not a single transaction: 1. pi_test_and_set_pir(vector) -- sets the PIR bit 2. pi_test_and_set_on() -- sets PID.ON The following interleaving triggers the bug: Sender vCPU (IPI): Target vCPU (1st sync_pir_to_irr): B1: set PIR[vector] A1: pi_clear_on() A2: pi_harvest_pir() -> sees B1 bit A3: xchg() -> consumes bit, PIR=0 (1st sync returns correct max_irr) B2: set PID.ON = 1 Target vCPU (2nd sync_pir_to_irr): C1: pi_test_on() -> TRUE (from B2) C2: pi_clear_on() -> ON=0 C3: pi_harvest_pir() -> PIR empty C4: *max_irr = -1, early return IRR NOT SCANNED The interrupt is not lost (it resides in the IRR from the first sync and is recovered on the next vcpu_enter_guest() iteration), but the incorrect max_irr causes a spurious WARNING and a wasted L2 VM-Enter/VM-Exit cycle. Fixes: `b41f8638b9` ("KVM: VMX: Isolate pure loads from atomic XCHG when processing PIR") Reported-by: Farrah Chen <farrah.chen@intel.com> Analyzed-by: Chenyi Qiang <chenyi.qiang@intel.com> Cc: stable@vger.kernel.org Reviewed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/kvm/20260428070349.1633238-1-chenyi.qiang@intel.com/T/ Link: https://patch.msgid.link/20260503201703.108231-2-pbonzini@redhat.com/ Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2026-05-03 22:18:15 +02:00
Paolo Bonzini	464af6fc2b	KVM: x86: check for nEPT/nNPT in slow flush hypercalls Checking is_guest_mode(vcpu) is incorrect, because translate_nested_gpa() is only valid if an L2 guest is running with nested EPT/NPT enabled. Instead use the same condition as translate_nested_gpa() itself. Cc: stable@vger.kernel.org Reviewed-by: Sean Christopherson <seanjc@google.com> Fixes: `aee738236d` ("KVM: x86: Prepare kvm_hv_flush_tlb() to handle L2's GPAs", 2022-11-18) Link: https://patch.msgid.link/20260503200905.106077-1-pbonzini@redhat.com/ Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2026-05-03 22:17:30 +02:00
Linus Torvalds	01f492e181	Arm: - Add support for tracing in the standalone EL2 hypervisor code, which should help both debugging and performance analysis. This uses the new infrastructure for 'remote' trace buffers that can be exposed by non-kernel entities such as firmware, and which came through the tracing tree. - Add support for GICv5 Per Processor Interrupts (PPIs), as the starting point for supporting the new GIC architecture in KVM. - Finally add support for pKVM protected guests, where pages are unmapped from the host as they are faulted into the guest and can be shared back from the guest using pKVM hypercalls. Protected guests are created using a new machine type identifier. As the elusive guestmem has not yet delivered on its promises, anonymous memory is also supported. This is only a first step towards full isolation from the host; for example, the CPU register state and DMA accesses are not yet isolated. Because this does not really yet bring fully what it promises, it is hidden behind CONFIG_ARM_PKVM_GUEST + 'kvm-arm.mode=protected', and also triggers TAINT_USER when a VM is created. Caveat emptor. - Rework the dreaded user_mem_abort() function to make it more maintainable, reducing the amount of state being exposed to the various helpers and rendering a substantial amount of state immutable. - Expand the Stage-2 page table dumper to support NV shadow page tables on a per-VM basis. - Tidy up the pKVM PSCI proxy code to be slightly less hard to follow. - Fix both SPE and TRBE in non-VHE configurations so that they do not generate spurious, out of context table walks that ultimately lead to very bad HW lockups. - A small set of patches fixing the Stage-2 MMU freeing in error cases. - Tighten-up accepted SMC immediate value to be only #0 for host SMCCC calls. - The usual cleanups and other selftest churn. LoongArch: - Use CSR_CRMD_PLV for kvm_arch_vcpu_in_kernel(). - Add DMSINTC irqchip in kernel support. RISC-V: - Fix steal time shared memory alignment checks - Fix vector context allocation leak - Fix array out-of-bounds in pmu_ctr_read() and pmu_fw_ctr_read_hi() - Fix double-free of sdata in kvm_pmu_clear_snapshot_area() - Fix integer overflow in kvm_pmu_validate_counter_mask() - Fix shift-out-of-bounds in make_xfence_request() - Fix lost write protection on huge pages during dirty logging - Split huge pages during fault handling for dirty logging - Skip CSR restore if VCPU is reloaded on the same core - Implement kvm_arch_has_default_irqchip() for KVM selftests - Factored-out ISA checks into separate sources - Added hideleg to struct kvm_vcpu_config - Factored-out VCPU config into separate sources - Support configuration of per-VM HGATP mode from KVM user space s390: - Support for ESA (31-bit) guests inside nested hypervisors. - Remove restriction on memslot alignment, which is not needed anymore with the new gmap code. - Fix LPSW/E to update the bear (which of course is the breaking event address register). x86: - Shut up various UBSAN warnings on reading module parameter before they were initialized. - Don't zero-allocate page tables that are used for splitting hugepages in the TDP MMU, as KVM is guaranteed to set all SPTEs in the page table and thus write all bytes. - As an optimization, bail early when trying to unsync 4KiB mappings if the target gfn can just be mapped with a 2MiB hugepage. x86 generic: - Copy single-chunk MMIO write values into struct kvm_vcpu (more precisely struct kvm_mmio_fragment) to fix use-after-free stack bugs where KVM would dereference stack pointer after an exit to userspace. - Clean up and comment the emulated MMIO code to try to make it easier to maintain (not necessarily "easy", but "easier"). - Move VMXON+VMXOFF and EFER.SVME toggling out of KVM (not all of VMX and SVM enabling) as it is needed for trusted I/O. - Advertise support for AVX512 Bit Matrix Multiply (BMM) instructions - Immediately fail the build if a required #define is missing in one of KVM's headers that is included multiple times. - Reject SET_GUEST_DEBUG with -EBUSY if there's an already injected exception, mostly to prevent syzkaller from abusing the uAPI to trigger WARNs, but also because it can help prevent userspace from unintentionally crashing the VM. - Exempt SMM from CPUID faulting on Intel, as per the spec. - Misc hardening and cleanup changes. x86 (AMD): - Fix and optimize IRQ window inhibit handling for AVIC; make it per-vCPU so that KVM doesn't prematurely re-enable AVIC if multiple vCPUs have to-be-injected IRQs. - Clean up and optimize the OSVW handling, avoiding a bug in which KVM would overwrite state when enabling virtualization on multiple CPUs in parallel. This should not be a problem because OSVW should usually be the same for all CPUs. - Drop a WARN in KVM_MEMORY_ENCRYPT_REG_REGION where KVM complains about a "too large" size based purely on user input. - Clean up and harden the pinning code for KVM_MEMORY_ENCRYPT_REG_REGION. - Disallow synchronizing a VMSA of an already-launched/encrypted vCPU, as doing so for an SNP guest will crash the host due to an RMP violation page fault. - Overhaul KVM's APIs for detecting SEV+ guests so that VM-scoped queries are required to hold kvm->lock, and enforce it by lockdep. Fix various bugs where sev_guest() was not ensured to be stable for the whole duration of a function or ioctl. - Convert a pile of kvm->lock SEV code to guard(). - Play nicer with userspace that does not enable KVM_CAP_EXCEPTION_PAYLOAD, for which KVM needs to set CR2 and DR6 as a response to ioctls such as KVM_GET_VCPU_EVENTS (even if the payload would end up in EXITINFO2 rather than CR2, for example). Only set CR2 and DR6 when consumption of the payload is imminent, but on the other hand force delivery of the payload in all paths where userspace retrieves CR2 or DR6. - Use vcpu->arch.cr2 when updating vmcb12's CR2 on nested #VMEXIT instead of vmcb02->save.cr2. The value is out of sync after a save/restore or after a #PF is injected into L2. - Fix a class of nSVM bugs where some fields written by the CPU are not synchronized from vmcb02 to cached vmcb12 after VMRUN, and so are not up-to-date when saved by KVM_GET_NESTED_STATE. - Fix a class of bugs where the ordering between KVM_SET_NESTED_STATE and KVM_SET_{S}REGS could cause vmcb02 to be incorrectly initialized after save+restore. - Add a variety of missing nSVM consistency checks. - Fix several bugs where KVM failed to correctly update VMCB fields on nested #VMEXIT. - Fix several bugs where KVM failed to correctly synthesize #UD or #GP for SVM-related instructions. - Add support for save+restore of virtualized LBRs (on SVM). - Refactor various helpers and macros to improve clarity and (hopefully) make the code easier to maintain. - Aggressively sanitize fields when copying from vmcb12, to guard against unintentionally allowing L1 to utilize yet-to-be-defined features. - Fix several bugs where KVM botched rAX legality checks when emulating SVM instructions. There are remaining issues in that KVM doesn't handle size prefix overrides for 64-bit guests. - Fail emulation of VMRUN/VMLOAD/VMSAVE if mapping vmcb12 fails instead of somewhat arbitrarily synthesizing #GP (i.e. don't double down on AMD's architectural but sketchy behavior of generating #GP for "unsupported" addresses). - Cache all used vmcb12 fields to further harden against TOCTOU bugs. x86 (Intel): - Drop obsolete branch hint prefixes from the VMX instruction macros. - Use ASM_INPUT_RM() in __vmcs_writel() to coerce clang into using a register input when appropriate. - Code cleanups. guest_memfd: - Don't mark guest_memfd folios as accessed, as guest_memfd doesn't support reclaim, the memory is unevictable, and there is no storage to write back to. LoongArch selftests: - Add KVM PMU test cases s390 selftests: - Enable more memory selftests. x86 selftests: - Add support for Hygon CPUs in KVM selftests. - Fix a bug in the MSR test where it would get false failures on AMD/Hygon CPUs with exactly one of RDPID or RDTSCP. - Add an MADV_COLLAPSE testcase for guest_memfd as a regression test for a bug where the kernel would attempt to collapse guest_memfd folios against KVM's will. -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmnftRQUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroPAzwf+NKO4Ktv+7A22ImN0SBl0nlUuulsz vTcw3+hxdRoIw83GdNS+hG5js0wrpMDnbv3t4+VliDNBSSxrBzcSWX2wpilW0Xtw qGo1MWhs2lKPy1NlaRVOwPS6j7uF3AR0TQ1iQLGMedQuCU9WpiKJxyhNXJdbLrt3 8EgFzsvtEsv+jKNRUNDf9+d0j4gZsFyIe+Brhianbw+u3/UCiUClLCdsKPc4+5ZX 08otYXytacGNIf/5Ev1vT4pHkHL0yqKXAtX7LEtaS3+0KrPuLjV4slemivzE9vf5 Evafm5AhA4wpaNMb1ZerhY3T94lsMaJpWxotjR//0Q7C9B59pCQnXCm8mg== =CcE0 -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm updates from Paolo Bonzini: "Arm: - Add support for tracing in the standalone EL2 hypervisor code, which should help both debugging and performance analysis. This uses the new infrastructure for 'remote' trace buffers that can be exposed by non-kernel entities such as firmware, and which came through the tracing tree - Add support for GICv5 Per Processor Interrupts (PPIs), as the starting point for supporting the new GIC architecture in KVM - Finally add support for pKVM protected guests, where pages are unmapped from the host as they are faulted into the guest and can be shared back from the guest using pKVM hypercalls. Protected guests are created using a new machine type identifier. As the elusive guestmem has not yet delivered on its promises, anonymous memory is also supported This is only a first step towards full isolation from the host; for example, the CPU register state and DMA accesses are not yet isolated. Because this does not really yet bring fully what it promises, it is hidden behind CONFIG_ARM_PKVM_GUEST + 'kvm-arm.mode=protected', and also triggers TAINT_USER when a VM is created. Caveat emptor - Rework the dreaded user_mem_abort() function to make it more maintainable, reducing the amount of state being exposed to the various helpers and rendering a substantial amount of state immutable - Expand the Stage-2 page table dumper to support NV shadow page tables on a per-VM basis - Tidy up the pKVM PSCI proxy code to be slightly less hard to follow - Fix both SPE and TRBE in non-VHE configurations so that they do not generate spurious, out of context table walks that ultimately lead to very bad HW lockups - A small set of patches fixing the Stage-2 MMU freeing in error cases - Tighten-up accepted SMC immediate value to be only #0 for host SMCCC calls - The usual cleanups and other selftest churn LoongArch: - Use CSR_CRMD_PLV for kvm_arch_vcpu_in_kernel() - Add DMSINTC irqchip in kernel support RISC-V: - Fix steal time shared memory alignment checks - Fix vector context allocation leak - Fix array out-of-bounds in pmu_ctr_read() and pmu_fw_ctr_read_hi() - Fix double-free of sdata in kvm_pmu_clear_snapshot_area() - Fix integer overflow in kvm_pmu_validate_counter_mask() - Fix shift-out-of-bounds in make_xfence_request() - Fix lost write protection on huge pages during dirty logging - Split huge pages during fault handling for dirty logging - Skip CSR restore if VCPU is reloaded on the same core - Implement kvm_arch_has_default_irqchip() for KVM selftests - Factored-out ISA checks into separate sources - Added hideleg to struct kvm_vcpu_config - Factored-out VCPU config into separate sources - Support configuration of per-VM HGATP mode from KVM user space s390: - Support for ESA (31-bit) guests inside nested hypervisors - Remove restriction on memslot alignment, which is not needed anymore with the new gmap code - Fix LPSW/E to update the bear (which of course is the breaking event address register) x86: - Shut up various UBSAN warnings on reading module parameter before they were initialized - Don't zero-allocate page tables that are used for splitting hugepages in the TDP MMU, as KVM is guaranteed to set all SPTEs in the page table and thus write all bytes - As an optimization, bail early when trying to unsync 4KiB mappings if the target gfn can just be mapped with a 2MiB hugepage x86 generic: - Copy single-chunk MMIO write values into struct kvm_vcpu (more precisely struct kvm_mmio_fragment) to fix use-after-free stack bugs where KVM would dereference stack pointer after an exit to userspace - Clean up and comment the emulated MMIO code to try to make it easier to maintain (not necessarily "easy", but "easier") - Move VMXON+VMXOFF and EFER.SVME toggling out of KVM (not all of VMX and SVM enabling) as it is needed for trusted I/O - Advertise support for AVX512 Bit Matrix Multiply (BMM) instructions - Immediately fail the build if a required #define is missing in one of KVM's headers that is included multiple times - Reject SET_GUEST_DEBUG with -EBUSY if there's an already injected exception, mostly to prevent syzkaller from abusing the uAPI to trigger WARNs, but also because it can help prevent userspace from unintentionally crashing the VM - Exempt SMM from CPUID faulting on Intel, as per the spec - Misc hardening and cleanup changes x86 (AMD): - Fix and optimize IRQ window inhibit handling for AVIC; make it per-vCPU so that KVM doesn't prematurely re-enable AVIC if multiple vCPUs have to-be-injected IRQs - Clean up and optimize the OSVW handling, avoiding a bug in which KVM would overwrite state when enabling virtualization on multiple CPUs in parallel. This should not be a problem because OSVW should usually be the same for all CPUs - Drop a WARN in KVM_MEMORY_ENCRYPT_REG_REGION where KVM complains about a "too large" size based purely on user input - Clean up and harden the pinning code for KVM_MEMORY_ENCRYPT_REG_REGION - Disallow synchronizing a VMSA of an already-launched/encrypted vCPU, as doing so for an SNP guest will crash the host due to an RMP violation page fault - Overhaul KVM's APIs for detecting SEV+ guests so that VM-scoped queries are required to hold kvm->lock, and enforce it by lockdep. Fix various bugs where sev_guest() was not ensured to be stable for the whole duration of a function or ioctl - Convert a pile of kvm->lock SEV code to guard() - Play nicer with userspace that does not enable KVM_CAP_EXCEPTION_PAYLOAD, for which KVM needs to set CR2 and DR6 as a response to ioctls such as KVM_GET_VCPU_EVENTS (even if the payload would end up in EXITINFO2 rather than CR2, for example). Only set CR2 and DR6 when consumption of the payload is imminent, but on the other hand force delivery of the payload in all paths where userspace retrieves CR2 or DR6 - Use vcpu->arch.cr2 when updating vmcb12's CR2 on nested #VMEXIT instead of vmcb02->save.cr2. The value is out of sync after a save/restore or after a #PF is injected into L2 - Fix a class of nSVM bugs where some fields written by the CPU are not synchronized from vmcb02 to cached vmcb12 after VMRUN, and so are not up-to-date when saved by KVM_GET_NESTED_STATE - Fix a class of bugs where the ordering between KVM_SET_NESTED_STATE and KVM_SET_{S}REGS could cause vmcb02 to be incorrectly initialized after save+restore - Add a variety of missing nSVM consistency checks - Fix several bugs where KVM failed to correctly update VMCB fields on nested #VMEXIT - Fix several bugs where KVM failed to correctly synthesize #UD or #GP for SVM-related instructions - Add support for save+restore of virtualized LBRs (on SVM) - Refactor various helpers and macros to improve clarity and (hopefully) make the code easier to maintain - Aggressively sanitize fields when copying from vmcb12, to guard against unintentionally allowing L1 to utilize yet-to-be-defined features - Fix several bugs where KVM botched rAX legality checks when emulating SVM instructions. There are remaining issues in that KVM doesn't handle size prefix overrides for 64-bit guests - Fail emulation of VMRUN/VMLOAD/VMSAVE if mapping vmcb12 fails instead of somewhat arbitrarily synthesizing #GP (i.e. don't double down on AMD's architectural but sketchy behavior of generating #GP for "unsupported" addresses) - Cache all used vmcb12 fields to further harden against TOCTOU bugs x86 (Intel): - Drop obsolete branch hint prefixes from the VMX instruction macros - Use ASM_INPUT_RM() in __vmcs_writel() to coerce clang into using a register input when appropriate - Code cleanups guest_memfd: - Don't mark guest_memfd folios as accessed, as guest_memfd doesn't support reclaim, the memory is unevictable, and there is no storage to write back to LoongArch selftests: - Add KVM PMU test cases s390 selftests: - Enable more memory selftests x86 selftests: - Add support for Hygon CPUs in KVM selftests - Fix a bug in the MSR test where it would get false failures on AMD/Hygon CPUs with exactly one of RDPID or RDTSCP - Add an MADV_COLLAPSE testcase for guest_memfd as a regression test for a bug where the kernel would attempt to collapse guest_memfd folios against KVM's will" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (373 commits) KVM: x86: use inlines instead of macros for is_sev_*guest x86/virt: Treat SVM as unsupported when running as an SEV+ guest KVM: SEV: Goto an existing error label if charging misc_cg for an ASID fails KVM: SVM: Move lock-protected allocation of SEV ASID into a separate helper KVM: SEV: use mutex guard in snp_handle_guest_req() KVM: SEV: use mutex guard in sev_mem_enc_unregister_region() KVM: SEV: use mutex guard in sev_mem_enc_ioctl() KVM: SEV: use mutex guard in snp_launch_update() KVM: SEV: Assert that kvm->lock is held when querying SEV+ support KVM: SEV: Document that checking for SEV+ guests when reclaiming memory is "safe" KVM: SEV: Hide "struct kvm_sev_info" behind CONFIG_KVM_AMD_SEV=y KVM: SEV: WARN on unhandled VM type when initializing VM KVM: LoongArch: selftests: Add PMU overflow interrupt test KVM: LoongArch: selftests: Add basic PMU event counting test KVM: LoongArch: selftests: Add cpucfg read/write helpers LoongArch: KVM: Add DMSINTC inject msi to vCPU LoongArch: KVM: Add DMSINTC device support LoongArch: KVM: Make vcpu_is_preempted() as a macro rather than function LoongArch: KVM: Move host CSR_GSTAT save and restore in context switch LoongArch: KVM: Move host CSR_EENTRY save and restore in context switch ...	2026-04-17 07:18:03 -07:00
Linus Torvalds	334fbe734e	mm.git review status for linus..mm-stable Everything: Total patches: 368 Reviews/patch: 1.56 Reviewed rate: 74% Excluding DAMON: Total patches: 316 Reviews/patch: 1.77 Reviewed rate: 81% Excluding DAMON and zram: Total patches: 306 Reviews/patch: 1.81 Reviewed rate: 82% Excluding DAMON, zram and maple_tree: Total patches: 276 Reviews/patch: 2.01 Reviewed rate: 91% Significant patch series in this merge: - The 30 patch series "maple_tree: Replace big node with maple copy" from Liam Howlett is mainly prepararatory work for ongoing development but it does reduce stack usage and is an improvement. - The 12 patch series "mm, swap: swap table phase III: remove swap_map" from Kairui Song offers memory savings by removing the static swap_map. It also yields some CPU savings and implements several cleanups. - The 2 patch series "mm: memfd_luo: preserve file seals" from Pratyush Yadav adds file seal preservation to LUO's memfd code. - The 2 patch series "mm: zswap: add per-memcg stat for incompressible pages" from Jiayuan Chen adds additional userspace stats reportng to zswap. - The 4 patch series "arch, mm: consolidate empty_zero_page" from Mike Rapoport implements some cleanups for our handling of ZERO_PAGE() and zero_pfn. - The 2 patch series "mm/kmemleak: Improve scan_should_stop() implementation" from Zhongqiu Han provides an robustness improvement and some cleanups in the kmemleak code. - The 4 patch series "Improve khugepaged scan logic" from Vernon Yang "improves the khugepaged scan logic and reduces CPU consumption by prioritizing scanning tasks that access memory frequently". - The 2 patch series "Make KHO Stateless" from Jason Miu simplifies Kexec Handover by "transitioning KHO from an xarray-based metadata tracking system with serialization to a radix tree data structure that can be passed directly to the next kernel" - The 3 patch series "mm: vmscan: add PID and cgroup ID to vmscan tracepoints" from Thomas Ballasi and Steven Rostedt enhances vmscan's tracepointing. - The 5 patch series "mm: arch/shstk: Common shadow stack mapping helper and VM_NOHUGEPAGE" from Catalin Marinas is a cleanup for the shadow stack code: remove per-arch code in favour of a generic implementation. - The 2 patch series "Fix KASAN support for KHO restored vmalloc regions" from Pasha Tatashin fixes a WARN() which can be emitted the KHO restores a vmalloc area. - The 4 patch series "mm: Remove stray references to pagevec" from Tal Zussman provides several cleanups, mainly udpating references to "struct pagevec", which became folio_batch three years ago. - The 17 patch series "mm: Eliminate fake head pages from vmemmap optimization" from Kiryl Shutsemau simplifies the HugeTLB vmemmap optimization (HVO) by changing how tail pages encode their relationship to the head page. - The 2 patch series "mm/damon/core: improve DAMOS quota efficiency for core layer filters" from SeongJae Park improves two problematic behaviors of DAMOS that makes it less efficient when core layer filters are used. - The 3 patch series "mm/damon: strictly respect min_nr_regions" from SeongJae Park improves DAMON usability by extending the treatment of the min_nr_regions user-settable parameter. - The 3 patch series "mm/page_alloc: pcp locking cleanup" from Vlastimil Babka is a proper fix for a previously hotfixed SMP=n issue. Code simplifications and cleanups ennsed. - The 16 patch series "mm: cleanups around unmapping / zapping" from David Hildenbrand implements "a bunch of cleanups around unmapping and zapping. Mostly simplifications, code movements, documentation and renaming of zapping functions". - The 6 patch series "support batched checking of the young flag for MGLRU" from Baolin Wang supports batched checking of the young flag for MGLRU. It's part cleanups; one benchmark shows large performance benefits for arm64. - The 5 patch series "memcg: obj stock and slab stat caching cleanups" from Johannes Weiner provides memcg cleanup and robustness improvements. - The 5 patch series "Allow order zero pages in page reporting" from Yuvraj Sakshith enhances page_reporting's free page reporting - it is presently and undesirably order-0 pages when reporting free memory. - The 6 patch series "mm: vma flag tweaks" from Lorenzo Stoakes is cleanup work following from the recent conversion of the VMA flags to a bitmap. - The 10 patch series "mm/damon: add optional debugging-purpose sanity checks" from SeongJae Park adds some more developer-facing debug checks into DAMON core. - The 2 patch series "mm/damon: test and document power-of-2 min_region_sz requirement" from SeongJae Park adds an additional DAMON kunit test and makes some adjustments to the addr_unit parameter handling. - The 3 patch series "mm/damon/core: make passed_sample_intervals comparisons overflow-safe" from SeongJae Park fixes a hard-to-hit time overflow issue in DAMON core. - The 7 patch series "mm/damon: improve/fixup/update ratio calculation, test and documentation" from SeongJae Park is a "batch of misc/minor improvements and fixups" for DAMON. - The 4 patch series "mm: move vma_(kernel\|mmu)_pagesize() out of hugetlb.c" from David Hildenbrand fixes a possible issue with dax-device when CONFIG_HUGETLB=n. Some code movement was required. - The 6 patch series "zram: recompression cleanups and tweaks" from Sergey Senozhatsky provides "a somewhat random mix of fixups, recompression cleanups and improvements" in the zram code. - The 11 patch series "mm/damon: support multiple goal-based quota tuning algorithms" from SeongJae Park extend DAMOS quotas goal auto-tuning to support multiple tuning algorithms that users can select. - The 4 patch series "mm: thp: reduce unnecessary start_stop_khugepaged()" from Breno Leitao fixes the khugpaged sysfs handling so we no longer spam the logs with reams of junk when starting/stopping khugepaged. - The 3 patch series "mm: improve map count checks" from Lorenzo Stoakes provides some cleanups and slight fixes in the mremap, mmap and vma code. - The 5 patch series "mm/damon: support addr_unit on default monitoring targets for modules" from SeongJae Park extends the use of DAMON core's addr_unit tunable. - The 5 patch series "mm: khugepaged cleanups and mTHP prerequisites" from Nico Pache provides cleanups in the khugepaged and is a base for Nico's planned khugepaged mTHP support. - The 15 patch series "mm: memory hot(un)plug and SPARSEMEM cleanups" from David Hildenbrand implements code movement and cleanups in the memhotplug and sparsemem code. - The 2 patch series "mm: remove CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE and cleanup CONFIG_MIGRATION" from David Hildenbrand rationalizes some memhotplug Kconfig support. - The 6 patch series "change young flag check functions to return bool" from Baolin Wang is "a cleanup patchset to change all young flag check functions to return bool". - The 3 patch series "mm/damon/sysfs: fix memory leak and NULL dereference issues" from Josh Law and SeongJae Park fixes a few potential DAMON bugs. - The 25 patch series "mm/vma: convert vm_flags_t to vma_flags_t in vma code" from "converts a lot of the existing use of the legacy vm_flags_t data type to the new vma_flags_t type which replaces it". Mainly in the vma code. - The 21 patch series "mm: expand mmap_prepare functionality and usage" from Lorenzo Stoakes "expands the mmap_prepare functionality, which is intended to replace the deprecated f_op->mmap hook which has been the source of bugs and security issues for some time". Cleanups, documentation, extension of mmap_prepare into filesystem drivers. - The 13 patch series "mm/huge_memory: refactor zap_huge_pmd()" from Lorenzo Stoakes simplifies and cleans up zap_huge_pmd(). Additional cleanups around vm_normal_folio_pmd() and the softleaf functionality are performed. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCad3HDQAKCRDdBJ7gKXxA jrUQAPwNhPk5nPSxnyxjAeQtOBHqgCdnICeEismLajPKd9aYRgEA0s2XAu3tSUYi GrBnWImHG3s4ePQxVcPCegWTsOUrXgQ= =1Q7o -----END PGP SIGNATURE----- Merge tag 'mm-stable-2026-04-13-21-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "maple_tree: Replace big node with maple copy" (Liam Howlett) Mainly prepararatory work for ongoing development but it does reduce stack usage and is an improvement. - "mm, swap: swap table phase III: remove swap_map" (Kairui Song) Offers memory savings by removing the static swap_map. It also yields some CPU savings and implements several cleanups. - "mm: memfd_luo: preserve file seals" (Pratyush Yadav) File seal preservation to LUO's memfd code - "mm: zswap: add per-memcg stat for incompressible pages" (Jiayuan Chen) Additional userspace stats reportng to zswap - "arch, mm: consolidate empty_zero_page" (Mike Rapoport) Some cleanups for our handling of ZERO_PAGE() and zero_pfn - "mm/kmemleak: Improve scan_should_stop() implementation" (Zhongqiu Han) A robustness improvement and some cleanups in the kmemleak code - "Improve khugepaged scan logic" (Vernon Yang) Improve khugepaged scan logic and reduce CPU consumption by prioritizing scanning tasks that access memory frequently - "Make KHO Stateless" (Jason Miu) Simplify Kexec Handover by transitioning KHO from an xarray-based metadata tracking system with serialization to a radix tree data structure that can be passed directly to the next kernel - "mm: vmscan: add PID and cgroup ID to vmscan tracepoints" (Thomas Ballasi and Steven Rostedt) Enhance vmscan's tracepointing - "mm: arch/shstk: Common shadow stack mapping helper and VM_NOHUGEPAGE" (Catalin Marinas) Cleanup for the shadow stack code: remove per-arch code in favour of a generic implementation - "Fix KASAN support for KHO restored vmalloc regions" (Pasha Tatashin) Fix a WARN() which can be emitted the KHO restores a vmalloc area - "mm: Remove stray references to pagevec" (Tal Zussman) Several cleanups, mainly udpating references to "struct pagevec", which became folio_batch three years ago - "mm: Eliminate fake head pages from vmemmap optimization" (Kiryl Shutsemau) Simplify the HugeTLB vmemmap optimization (HVO) by changing how tail pages encode their relationship to the head page - "mm/damon/core: improve DAMOS quota efficiency for core layer filters" (SeongJae Park) Improve two problematic behaviors of DAMOS that makes it less efficient when core layer filters are used - "mm/damon: strictly respect min_nr_regions" (SeongJae Park) Improve DAMON usability by extending the treatment of the min_nr_regions user-settable parameter - "mm/page_alloc: pcp locking cleanup" (Vlastimil Babka) The proper fix for a previously hotfixed SMP=n issue. Code simplifications and cleanups ensued - "mm: cleanups around unmapping / zapping" (David Hildenbrand) A bunch of cleanups around unmapping and zapping. Mostly simplifications, code movements, documentation and renaming of zapping functions - "support batched checking of the young flag for MGLRU" (Baolin Wang) Batched checking of the young flag for MGLRU. It's part cleanups; one benchmark shows large performance benefits for arm64 - "memcg: obj stock and slab stat caching cleanups" (Johannes Weiner) memcg cleanup and robustness improvements - "Allow order zero pages in page reporting" (Yuvraj Sakshith) Enhance free page reporting - it is presently and undesirably order-0 pages when reporting free memory. - "mm: vma flag tweaks" (Lorenzo Stoakes) Cleanup work following from the recent conversion of the VMA flags to a bitmap - "mm/damon: add optional debugging-purpose sanity checks" (SeongJae Park) Add some more developer-facing debug checks into DAMON core - "mm/damon: test and document power-of-2 min_region_sz requirement" (SeongJae Park) An additional DAMON kunit test and makes some adjustments to the addr_unit parameter handling - "mm/damon/core: make passed_sample_intervals comparisons overflow-safe" (SeongJae Park) Fix a hard-to-hit time overflow issue in DAMON core - "mm/damon: improve/fixup/update ratio calculation, test and documentation" (SeongJae Park) A batch of misc/minor improvements and fixups for DAMON - "mm: move vma_(kernel\|mmu)_pagesize() out of hugetlb.c" (David Hildenbrand) Fix a possible issue with dax-device when CONFIG_HUGETLB=n. Some code movement was required. - "zram: recompression cleanups and tweaks" (Sergey Senozhatsky) A somewhat random mix of fixups, recompression cleanups and improvements in the zram code - "mm/damon: support multiple goal-based quota tuning algorithms" (SeongJae Park) Extend DAMOS quotas goal auto-tuning to support multiple tuning algorithms that users can select - "mm: thp: reduce unnecessary start_stop_khugepaged()" (Breno Leitao) Fix the khugpaged sysfs handling so we no longer spam the logs with reams of junk when starting/stopping khugepaged - "mm: improve map count checks" (Lorenzo Stoakes) Provide some cleanups and slight fixes in the mremap, mmap and vma code - "mm/damon: support addr_unit on default monitoring targets for modules" (SeongJae Park) Extend the use of DAMON core's addr_unit tunable - "mm: khugepaged cleanups and mTHP prerequisites" (Nico Pache) Cleanups to khugepaged and is a base for Nico's planned khugepaged mTHP support - "mm: memory hot(un)plug and SPARSEMEM cleanups" (David Hildenbrand) Code movement and cleanups in the memhotplug and sparsemem code - "mm: remove CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE and cleanup CONFIG_MIGRATION" (David Hildenbrand) Rationalize some memhotplug Kconfig support - "change young flag check functions to return bool" (Baolin Wang) Cleanups to change all young flag check functions to return bool - "mm/damon/sysfs: fix memory leak and NULL dereference issues" (Josh Law and SeongJae Park) Fix a few potential DAMON bugs - "mm/vma: convert vm_flags_t to vma_flags_t in vma code" (Lorenzo Stoakes) Convert a lot of the existing use of the legacy vm_flags_t data type to the new vma_flags_t type which replaces it. Mainly in the vma code. - "mm: expand mmap_prepare functionality and usage" (Lorenzo Stoakes) Expand the mmap_prepare functionality, which is intended to replace the deprecated f_op->mmap hook which has been the source of bugs and security issues for some time. Cleanups, documentation, extension of mmap_prepare into filesystem drivers - "mm/huge_memory: refactor zap_huge_pmd()" (Lorenzo Stoakes) Simplify and clean up zap_huge_pmd(). Additional cleanups around vm_normal_folio_pmd() and the softleaf functionality are performed. * tag 'mm-stable-2026-04-13-21-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits) mm: fix deferred split queue races during migration mm/khugepaged: fix issue with tracking lock mm/huge_memory: add and use has_deposited_pgtable() mm/huge_memory: add and use normal_or_softleaf_folio_pmd() mm: add softleaf_is_valid_pmd_entry(), pmd_to_softleaf_folio() mm/huge_memory: separate out the folio part of zap_huge_pmd() mm/huge_memory: use mm instead of tlb->mm mm/huge_memory: remove unnecessary sanity checks mm/huge_memory: deduplicate zap deposited table call mm/huge_memory: remove unnecessary VM_BUG_ON_PAGE() mm/huge_memory: add a common exit path to zap_huge_pmd() mm/huge_memory: handle buggy PMD entry in zap_huge_pmd() mm/huge_memory: have zap_huge_pmd return a boolean, add kdoc mm/huge: avoid big else branch in zap_huge_pmd() mm/huge_memory: simplify vma_is_specal_huge() mm: on remap assert that input range within the proposed VMA mm: add mmap_action_map_kernel_pages[_full]() uio: replace deprecated mmap hook with mmap_prepare in uio_info drivers: hv: vmbus: replace deprecated mmap hook with mmap_prepare mm: allow handling of stacked mmap_prepare hooks in more drivers ...	2026-04-15 12:59:16 -07:00
Linus Torvalds	883af1f8e8	- Print TDX module version during boot - Make TDX attribute naming consistent -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmndRYkACgkQaDWVMHDJ krDp3Q/+LyUUBGw250JkI4DyVNJ6jS0DBa80QKdRVsiIi0u4L14EguNN73yV4TyG PwyoE8e1ecvdQTfCdm9Wt9FiIowlptBCu2DOGsO4AYAnGOuDIbvNAYZ+HDOzM4+2 g5QaGlbSw7bx/zFOhykOH58FWXz9tQ1TyyOw0aICPI83XMqE8xrMHKrGFIyOT9g1 hqPpegVmpMIjbTdYzb3pH1Be2Z4ymyOeAlzH0p+TIp9reX3qFLsOoD9PQxsDbekO AILNXZiu1dAg8FS5xqNIEZyoR+liYmcTQ/PhaGKvXV7ml8flylGvnz4FXijWzUvE kKtaMjA3JPJ/G8rcdR75T5CrGcTXXEOgl4GCbRUiEF55ocF4ds/DgABVV+YuoosO g+bFihYfeOJGkETQnBCjYjJ+NGcuBO6bpD5TqfN1vCsuemoRobNXKlSTS7PihynB 4GP98ELaLXmZzj8KzME1ysLgRj4mJDEGBtTfRm4iCLRB4Drx8Wv4ZwPMCDaMtWkQ T/+wRLd9q/WOdUCmXK+iawQ3LDxMQitIZguHHBNjKk2WfFYoxs0HoRV9OkgdjEVK zDzzCkx3945UnoyKX5h/lxPC1wyp9r54+4zsIY2jKTMNuVl773DYDMn5P1aKGxCQ YPrHlJrmmlSFuWzi9/Vx2dcZxNWoZ4YPfMrgrzP5P3EKwxUzX9w= =x+Yc -----END PGP SIGNATURE----- Merge tag 'x86_tdx_for_7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 TDX updates from Dave Hansen: "The only real thing of note here is printing the TDX module version. This is a little silly on its own, but the upcoming TDX module update code needs the same TDX module call. This shrinks that set a wee bit. There's also few minor macro cleanups and a tweak to the GetQuote ABI to make it easier for userspace to detect zero-length (failed) quotes" * tag 'x86_tdx_for_7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: virt: tdx-guest: Return error for GetQuote failures KVM/TDX: Rename KVM_SUPPORTED_TD_ATTRS to KVM_SUPPORTED_TDX_TD_ATTRS x86/tdx: Rename TDX_ATTR_* to TDX_TD_ATTR_* KVM/TDX: Remove redundant definitions of TDX_TD_ATTR_* x86/tdx: Fix the typo in TDX_ATTR_MIGRTABLE x86/virt/tdx: Print TDX module version during init x86/virt/tdx: Retrieve TDX module version	2026-04-14 14:42:55 -07:00
Paolo Bonzini	01f217fa8a	KVM: x86: use inlines instead of macros for is_sev_*guest This helps avoiding more embarrassment to this maintainer, but also will catch mistakes more easily for others. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2026-04-13 19:00:47 +02:00
Paolo Bonzini	92cdeac6a4	KVM SVM changes for 7.1 - Fix and optimize IRQ window inhibit handling for AVIC (the tracking needs to be per-vCPU, e.g. so that KVM doesn't prematurely re-enable AVIC if multiple vCPUs have to-be-injected IRQs). - Fix an undefined behavior warning where a crafty userspace can read the "avic" module param before it's fully initialized. - Fix a (likely benign) bug in the "OS-visible workarounds" handling, where KVM could clobber state when enabling virtualization on multiple CPUs in parallel, and clean up and optimize the code. - Drop a WARN in KVM_MEMORY_ENCRYPT_REG_REGION where KVM complains about a "too large" size based purely on user input, and clean up and harden the related pinning code. - Disallow synchronizing a VMSA of an already-launched/encrypted vCPU, as doing so for an SNP guest will trigger an RMP violation #PF and crash the host. - Protect all of sev_mem_enc_register_region() with kvm->lock to ensure sev_guest() is stable for the entire of the function. - Lock all vCPUs when synchronizing VMSAs for SNP guests to ensure the VMSA page isn't actively being used. - Overhaul KVM's APIs for detecting SEV+ guests so that VM-scoped queries are required to hold kvm->lock (KVM has had multiple bugs due "is SEV?" checks becoming stale), enforced by lockdep. Add and use vCPU-scoped APIs when possible/appropriate, as all checks that originate from a vCPU are guaranteed to be stable. - Convert a pile of kvm->lock SEV code to guard(). -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmnZK4wACgkQOlYIJqCj N/2uOQ/+LzGQD7myCn47rUhiMo/aY3qjrS+u6PSuFeEMFyaATiWpf/s50hIMHh+/ VCRAptKgL0PBV/RbOqhZdx4Zn/Yb/NNBwraqc7xQgMOlQwFedOetuFtRveJ4z6Af 8ycwMxYYtz6SbaT+R3AdK51Nb8S2ZRpd082CiaLgChVcdodkeFtS5KVBqrlBGB21 EKFbW+QXMHrpmGbgZ8YWMrL5UCSmJFG8ZztcncNfsLS6WxbUjdo/MEiLEDIsrXZd oGViwmnY7hcJ5ClcF8UMPtXHHP1+EOk6BKAsmYguG3qUxbX+EEbymb8o16k+h6iw ybUZWF7cq44Pl1FModTFAB5LQPg6z6XNhjZ8L+0kjAI05lvszf3QDtezQ+BF24tW S18x6yCIpdEJ3VxM4r5Yqf10CRbxMtHKU6EUjL7C4KNNYOz2sX+Tqgi/uHtbgzUJ zPG9faY5M3hMjfj5tOCpy/fAEF3fD1mg4GE8pfXZa8d/ppqI4hU0ASpFzw/d4LnH PJSaeJhmmEIdRj+RtIGIRSZ9flHM61/+clKngaoR+c/mPQPnDbapivl2kgKWbVJ4 47c44pYQLTWI01nuwcEILCEj8D1mABJygPjNoO79b2mitmYazMnO42mV3lI5oP0c QyzX7sSed6ImIRn8xadfE+tIz3ji9r/ak+ekZvdNiqiNEoi2YG8= =AjgE -----END PGP SIGNATURE----- Merge tag 'kvm-x86-svm-7.1' of https://github.com/kvm-x86/linux into HEAD KVM SVM changes for 7.1 - Fix and optimize IRQ window inhibit handling for AVIC (the tracking needs to be per-vCPU, e.g. so that KVM doesn't prematurely re-enable AVIC if multiple vCPUs have to-be-injected IRQs). - Fix an undefined behavior warning where a crafty userspace can read the "avic" module param before it's fully initialized. - Fix a (likely benign) bug in the "OS-visible workarounds" handling, where KVM could clobber state when enabling virtualization on multiple CPUs in parallel, and clean up and optimize the code. - Drop a WARN in KVM_MEMORY_ENCRYPT_REG_REGION where KVM complains about a "too large" size based purely on user input, and clean up and harden the related pinning code. - Disallow synchronizing a VMSA of an already-launched/encrypted vCPU, as doing so for an SNP guest will trigger an RMP violation #PF and crash the host. - Protect all of sev_mem_enc_register_region() with kvm->lock to ensure sev_guest() is stable for the entire of the function. - Lock all vCPUs when synchronizing VMSAs for SNP guests to ensure the VMSA page isn't actively being used. - Overhaul KVM's APIs for detecting SEV+ guests so that VM-scoped queries are required to hold kvm->lock (KVM has had multiple bugs due "is SEV?" checks becoming stale), enforced by lockdep. Add and use vCPU-scoped APIs when possible/appropriate, as all checks that originate from a vCPU are guaranteed to be stable. - Convert a pile of kvm->lock SEV code to guard().	2026-04-13 19:00:43 +02:00
Paolo Bonzini	4a530993da	KVM x86 VMXON and EFER.SVME extraction for 7.1 Move _only_ VMXON+VMXOFF and EFER.SVME toggling out of KVM (versus all of VMX and SVM enabling) out of KVM and into the core kernel so that non-KVM TDX enabling, e.g. for trusted I/O, can make SEAMCALLs without needing to ensure KVM is fully loaded. TDX isn't a hypervisor, and isn't trying to be a hypervisor. Specifically, TDX should _never_ have it's own VMCSes (that are visible to the host; the TDX-Module has it's own VMCSes to do SEAMCALL/SEAMRET), and so there is simply no reason to move that functionality out of KVM. With that out of the way, dealing with VMXON/VMXOFF and EFER.SVME is a fairly simple refcounting game. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmnZJkYACgkQOlYIJqCj N/21chAAjg9tb/E8+vqBZDT5vO9Bu6c333irV2vqBBJZWUx6xKhtk77kL6kISWyf aI57hJ5IwbUkfDcomSY+MyRXxw/X4OioSs5qqvcC2XHatGA8XwifJE47cN5ZT0+D hzZjru8Z9VGHf5wUXS41yTHtm+INiEYMgJiseUQR6sbWx3H+zDcLIooNQx/ZLYrV vR+VPtaMYpJ0TTDDqb8PrCnjgXoXFenAnzAj9bAikWP60kaDXrxN9KPc5woDo29+ TrkTyr2mmQvKpNhLCDwAMNa9bXxgzkHEGx8J2WZTbUi9ZBv4MwVsnGLLsaUKQlaa 4V1JDiICzYptjMzU+ka4iTF+m0KEz4EykP7mVVK+5MAHc0NOUVfDW6JP2PM/66dh NyyjGhbrfH0PwqzDn4N2h0MmWT4YNCIxESClecEMtEzsCyWfYOMitxbDbzHnu9Vw a/C0pwWKJ34Trr0O79SevAWJBlu596mya0YvMeCAWxCvSUGknbo5IXdrmtp6htGp Gz5+0ZyvVRbYpwxS+OOpWMkZuPvvEcWTbMAG/scbSHh80P/uCVyuLsRZR2HSB8EV tYnnLDDDQ1KmLV7xmw5XnkN9hFffAM8eXA7KX9TPjCXjd25lCJGgquQEH0oAHe5q 1qXf+lWttP7MIbD5/Ga5CO+FqXAE6xmFRWjEBgLx32kSAWXqxPs= =SuxR -----END PGP SIGNATURE----- Merge tag 'kvm-x86-vmxon-7.1' of https://github.com/kvm-x86/linux into HEAD KVM x86 VMXON and EFER.SVME extraction for 7.1 Move _only_ VMXON+VMXOFF and EFER.SVME toggling out of KVM (versus all of VMX and SVM enabling) out of KVM and into the core kernel so that non-KVM TDX enabling, e.g. for trusted I/O, can make SEAMCALLs without needing to ensure KVM is fully loaded. TIO isn't a hypervisor, and isn't trying to be a hypervisor. Specifically, TIO should _never_ have it's own VMCSes (that are visible to the host; the TDX-Module has it's own VMCSes to do SEAMCALL/SEAMRET), and so there is simply no reason to move that functionality out of KVM. With that out of the way, dealing with VMXON/VMXOFF and EFER.SVME is a fairly simple refcounting game.	2026-04-13 13:04:48 +02:00
Paolo Bonzini	ea8bc95fbb	KVM nested SVM changes for 7.1 (with one common x86 fix) - To minimize the probability of corrupting guest state, defer KVM's non-architectural delivery of exception payloads (e.g. CR2 and DR6) until consumption of the payload is imminent, and force delivery of the payload in all paths where userspace saves relevant state. - Use vcpu->arch.cr2 when updating vmcb12's CR2 on nested #VMEXIT to fix a bug where L2's CR2 can get corrupted after a save/restore, e.g. if the VM is migrated while L2 is faulting in memory. - Fix a class of nSVM bugs where some fields written by the CPU are not synchronized from vmcb02 to cached vmcb12 after VMRUN, and so are not up-to-date when saved by KVM_GET_NESTED_STATE. - Fix a class of bugs where the ordering between KVM_SET_NESTED_STATE and KVM_SET_{S}REGS could cause vmcb02 to be incorrectly initialized after save+restore. - Add a variety of missing nSVM consistency checks. - Fix several bugs where KVM failed to correctly update VMCB fields on nested #VMEXIT. - Fix several bugs where KVM failed to correctly synthesize #UD or #GP for SVM-related instructions. - Add support for save+restore of virtualized LBRs (on SVM). - Refactor various helpers and macros to improve clarity and (hopefully) make the code easier to maintain. - Aggressively sanitize fields when copying from vmcb12 to guard against unintentionally allowing L1 to utilize yet-to-be-defined features. - Fix several bugs where KVM botched rAX legality checks when emulating SVM instructions. Note, KVM is still flawed in that KVM doesn't address size prefix overrides for 64-bit guests; this should probably be documented as a KVM erratum. - Fail emulation of VMRUN/VMLOAD/VMSAVE if mapping vmcb12 fails instead of somewhat arbitrarily synthesizing #GP (i.e. don't bastardize AMD's already- sketchy behavior of generating #GP if for "unsupported" addresses). - Cache all used vmcb12 fields to further harden against TOCTOU bugs. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmnZfbwACgkQOlYIJqCj N/0pVRAAkys8LLtIekQtEVkaX3EPaXk0lGGmnzXbihgHFsS5lMAS4tcsr7oyk4TI rvJUGmkaTKTboQdTaCq0G7lwCu5hMuXsZ10WvmKfivMFxy3kSppqfffux5zVXng2 U/8oyJSorkX1WPC7d5QAZYMqqcSwQaR+a0FxowghGWBXMRHylerSuH00CiGr6Ron QQbZaKBNtkYwYFNos2tLuT4tueyFogk8FPAmdejEQ9CMxUjeAivlKm8JVXaDvGik lyPYbJJLukjuxSYGYmeRyGLLwK7VBGkFHQp/KBYSBgzGdweabhsQa1Z0CGm24+w1 q626W0sxsq97dZ0cd7oE6Cw+AdlMBK+mjpxB9gX4uLGyYlnFkdJV7OSlHVTR9d96 cqKduT0JvlBnVb7Yd5jyaGVl1YD62p0nwcrTuWidR5IJ16b4mYwwPzvkkQKHLt64 VAhH8lBVtATtblI9gfsbwGezV74xXnuLb0L1G7xeh1VIWu7pubFdqyRwIA+qiXQa OkyxzoDlFl+QF2Uf3cBCFMojBOrSZRiGiLzIkUnjBsN4N2uOPYTsQEfr9BXVVcv7 obT9xl/wUwry2fAJhUL+IBCDE42+8C62UaWT5KJHQLttBL7Mm06e75hFN5ObbE/x nExL+NmAcsSUUbbdojjnD0KWxYKkosNiONBVrjqqXdmBjmzzOvI= =ys7N -----END PGP SIGNATURE----- Merge tag 'kvm-x86-nested-7.1' of https://github.com/kvm-x86/linux into HEAD KVM nested SVM changes for 7.1 (with one common x86 fix) - To minimize the probability of corrupting guest state, defer KVM's non-architectural delivery of exception payloads (e.g. CR2 and DR6) until consumption of the payload is imminent, and force delivery of the payload in all paths where userspace saves relevant state. - Use vcpu->arch.cr2 when updating vmcb12's CR2 on nested #VMEXIT to fix a bug where L2's CR2 can get corrupted after a save/restore, e.g. if the VM is migrated while L2 is faulting in memory. - Fix a class of nSVM bugs where some fields written by the CPU are not synchronized from vmcb02 to cached vmcb12 after VMRUN, and so are not up-to-date when saved by KVM_GET_NESTED_STATE. - Fix a class of bugs where the ordering between KVM_SET_NESTED_STATE and KVM_SET_{S}REGS could cause vmcb02 to be incorrectly initialized after save+restore. - Add a variety of missing nSVM consistency checks. - Fix several bugs where KVM failed to correctly update VMCB fields on nested #VMEXIT. - Fix several bugs where KVM failed to correctly synthesize #UD or #GP for SVM-related instructions. - Add support for save+restore of virtualized LBRs (on SVM). - Refactor various helpers and macros to improve clarity and (hopefully) make the code easier to maintain. - Aggressively sanitize fields when copying from vmcb12 to guard against unintentionally allowing L1 to utilize yet-to-be-defined features. - Fix several bugs where KVM botched rAX legality checks when emulating SVM instructions. Note, KVM is still flawed in that KVM doesn't address size prefix overrides for 64-bit guests; this should probably be documented as a KVM erratum. - Fail emulation of VMRUN/VMLOAD/VMSAVE if mapping vmcb12 fails instead of somewhat arbitrarily synthesizing #GP (i.e. don't bastardize AMD's already- sketchy behavior of generating #GP if for "unsupported" addresses). - Cache all used vmcb12 fields to further harden against TOCTOU bugs.	2026-04-13 13:01:50 +02:00
Paolo Bonzini	1b3090da8d	KVM x86 MMU changes for 7.1 - Fix an undefined behavior warning where a crafty userspace can read kvm.ko's nx_huge_pages before it's fully initialized. - Don't zero-allocate page tables that are used for splitting hugepages in the TDP MMU, as KVM is guaranteed to set all SPTEs in the page table and thus write all bytes. - Bail early when trying to unsync 4KiB mappings if the target gfn can be mapped with a 2MiB hugepage, to avoid the gfn hash lookup. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmnZJZIACgkQOlYIJqCj N/1Tlg/+PMF0O39BRxjyIulWVyxWWU/CD3WxS4xSlOd1Pj3rrwIxDx1VYYkmWX90 8DrBkGCBUdruzNcQQ9YOVU1H+z5yvs5QY2a75TWJX/5A40wXZPMl+8xSpbflP8vQ QECaKAyxc+IutnjdivbLxZGbXiFS5yp1PnvK7dOo75gazFkDBF0kpO8lFyvsoaUh Q97kkIT2eR/wTsVx+HXwhY2EgxiBdG6N69JIzzVwOvaddqtB6jSzDXPjAhPiOJVU xlKaJnlCnHhH59eyAegpef3BtpFuyArFVJUZoI0f7zEqMHiztS616iltHPPNeSgT 8oxbJcvAIbLKd6RgN+DJ5zbSOVwU8O7G2WP6dXvuVsjN63nAX9NapMOD5gS7e6/9 gjJq85XClFX2bOHiEnBSAVJ+9obgO69GgP+X3/MIIs0efnCL9ZGtu8tz9ClhkCwP 6cU7vtojENJOYBM/YEUCaGnrB837+0TX1U4LO+tGJeZbWUJXb0UP0I3EpPsHnk0o m/M4q9qzbQHvT/GV3qajp4MLd1NE6W/386xTqvrGv0hFIo0HwDPIPVxzTD7IyypE sEm3HIOiLmE95BGWFKKeKhdT1Yb+wJRuggZXatKGTCeik5FOGYbxtclTGV4h8OHY Vv7Ll/Xom3ekL2dOQbFHbtlfv3ra+4HzLm5rd3uB+RoMcxyPAf0= =y0wF -----END PGP SIGNATURE----- Merge tag 'kvm-x86-mmu-7.1' of https://github.com/kvm-x86/linux into HEAD KVM x86 MMU changes for 7.1 - Fix an undefined behavior warning where a crafty userspace can read kvm.ko's nx_huge_pages before it's fully initialized. - Don't zero-allocate page tables that are used for splitting hugepages in the TDP MMU, as KVM is guaranteed to set all SPTEs in the page table and thus write all bytes. - Bail early when trying to unsync 4KiB mappings if the target gfn can be mapped with a 2MiB hugepage, to avoid the gfn hash lookup.	2026-04-13 12:50:08 +02:00
Paolo Bonzini	7e7a6e2ad2	KVM VMX changes for 7.1 - Drop obsolete (largely ignored by hardwre) branch hint prefixes from the VMX instruction macros, as saving a byte of code per instruction provides more benefits than the (mostly) superfluous prefixes. - Use ASM_INPUT_RM() in __vmcs_writel() to coerce clang into using a register input when appropriate. - Drop unnecessary parentheses in cpu_has_load_cet_ctrl() so as not to suggest that "return (x & y);" is KVM's preferred style. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmnZJ1gACgkQOlYIJqCj N/1F1w/9F43oaEVn7EDmssy/YWDfP6gcApCMDVm/+n6Z0+zBdsMNpZ8G3OQlbITT nfpfA6GqvmW9xsdFsPuM2e6WXSk1dZoDevifzK2g0fe5ZtGfB8xIsovEOvZJTQzo MI8UfJZG9xfnA0WWYcCR9EOjhB9fLyHgnG/7kXrYPao1NAtVo1ADskmZ+FIYZAST dEMF3Wf6u1x6rjaUyhn0vMIDiIoRw5OhMBWYDYVHPIFfbHjxm4iXJjWmO981xis9 fIseUcnGM0OUrFv+dBun7zYqIqj3aUcaI+Bq3/eiSm/pKi9MWBaljgOCjLjj5dZc 07DhFtF0IAUMJIZmq2N/xbxfaaOWdajbw7Wm/ppkvPJg1efz8gB9aKW4wjGAkyRx aNeYFq5VUGlKNp1aXS5TWeYIeAAbc1kqIReRdwFXK/gqXfynQCuH4Z7kBJvWsFTY GA3GgK3l7A7Qk9A/A8PQvoj8uCH3DTv1N+qeLX9Hdk+NgmKDUthj/0rhXXVv/eAF 2auMBtf1JcUjRGwvtu+945dAIFE2ZZ/lTNMYbd19mDiFHQ3yBj9HM3N+Q/+DexWd OXH3xNralJ4IHlJZz7t/N1CuIujDKDdRAwBXuWq4bOBSFGYExjILV5JF0DolVNr/ VSe4+tKkz3jM+HL04edPSe0k8rGCDzL/z1wS4/2vEqHLjSrucbM= =8+qL -----END PGP SIGNATURE----- Merge tag 'kvm-x86-vmx-7.1' of https://github.com/kvm-x86/linux into HEAD KVM VMX changes for 7.1 - Drop obsolete (largely ignored by hardwre) branch hint prefixes from the VMX instruction macros, as saving a byte of code per instruction provides more benefits than the (mostly) superfluous prefixes. - Use ASM_INPUT_RM() in __vmcs_writel() to coerce clang into using a register input when appropriate. - Drop unnecessary parentheses in cpu_has_load_cet_ctrl() so as not to suggest that "return (x & y);" is KVM's preferred style.	2026-04-13 12:49:36 +02:00
Paolo Bonzini	aa856775be	KVM x86 emulated MMIO changes for 7.1 Copy single-chunk MMIO write values into a persistent (per-fragment) field to fix use-after-free stack bugs due to KVM dereferencing a stack pointer after an exit to userspace. Clean up and comment the emulated MMIO code to try to make it easier to maintain (not necessarily "easy", but "easier"). -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmnZJAkACgkQOlYIJqCj N/2TUxAAnmOHpQ0hKjDvkMXTLOUJEitegRn8CDd2NWa6TEnoCFmHu5sXAMaO0V7v EK2NH0uwT43Zr55BlXTxjEaQOby16tKjqK4FlG1Zb8UI6e9Lzhk6zIXMI7NTpV41 HMF73cQyHIzeS9ymf0fdVo3nlnvTXBHVCifyJwmb2RARl+LqmTimHb9+9piuIxJB /v473RoFCNxI7Rwl6Pp5sjl7lWTIDUQJSi3+1gMaowTtnsUyCPTLejwj7/b9NWF3 i+nTNwwpHFvgfTE3decMyKupY9aXUM9FU/AFf+eUbvzjR/Dx/7o31Cpz4NCHQ38c 4TCFKkOQ2r+e8s7ATBeRKdOXdP7d7DW3qasLfPVjzEDxuifmW+awDRYBZwNM/Ybv jDCBO6gbtw/f+oJPKq9oivqBSu+Z6vR7NnPmk1vh7VocsZdAlbCwdBU2+N5DBkfh LJ+nOxzNL1q1A8X61CZFffEooh971Mg7ztHV5IDviK5/Fop0NfQBjxdxoz+wttzp ufwY1WUMHjImRfZTk3e9icJwarqEI39QRmjKuaUJxEXjbJCbtvfKJ0lrNn7RRNPf aqi9M2z6UvVrQi/Vw5rRTx7fYr091QIOHDdGBVl7atcGyU4gvdpXdkBS6xqafgD6 S/QzWypU868iMgLWqNpNeKtRLPQsuwTywC5vx57Sm0yPlPbM8H0= =Qyhh -----END PGP SIGNATURE----- Merge tag 'kvm-x86-mmio-7.1' of https://github.com/kvm-x86/linux into HEAD KVM x86 emulated MMIO changes for 7.1 Copy single-chunk MMIO write values into a persistent (per-fragment) field to fix use-after-free stack bugs due to KVM dereferencing a stack pointer after an exit to userspace. Clean up and comment the emulated MMIO code to try to make it easier to maintain (not necessarily "easy", but "easier").	2026-04-13 12:49:14 +02:00
Paolo Bonzini	276f81a491	KVM x86 misc changes for 7.1 - Advertise support for AVX512 Bit Matrix Multiply (BMM) when it's present in hardware (no additional emulation/virtualization required). - Immediately fail the build if a required #define is missing in one of KVM's headers that is included multiple times. - Reject SET_GUEST_DEBUG with -EBUSY if there's an already injected exception, mostly to prevent syzkaller from abusing the uAPI to trigger WARNs, but also because it can help prevent userspace from unintentionally crashing the VM. - Exempt SMM from CPUID faulting on Intel, as per the spec. - Misc hardening and cleanup changes. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmnZIt8ACgkQOlYIJqCj N/2HqA/8CwoMlaK4nPDp39JI1+avlKaBkrwfF5/mku6uZcrq9WeyflH+t4wc7JE0 lRXQO5PPNideYrjEqLsdn9OWIar+ZsYGrsEO5/MFc4Z67kPkai67m7nUT46APU4Q fE/3KpT3afaHcM6+zpIIF/lMmQJVco+7EQrlexSM9LZTap6uxNRvMC3B/czF7/li UsEJH37vluXxuCPUXAE61IPHtF++eDf4x6w0nIJ+7UJSUZk8JJYWMvJ5lPIxRTGG Pvql2v7hDQ9h2ISIDr+e85wpIpIkbc7hKZMtlib36PB1Dm7gOeKgosFHIwNLnJoJ pxuzsqYShXBHsmsYgzmfYlVUcWFF1f02yC4XfoQ735LNnBbX6bm5nuSmPQBmvg4O +URUKjo4DLjzzs44RrRsBsBVuZTMbe0Ht2qLmGrWrB9+vr1PxQVNFpLA0MCDCFx7 skJTo6raJQkLJmmoKUslehiJFTvzOrOJy8JhWhiznkJNSS5jWFbaFf7nEoMCYIl0 ttzeISQDgzHAvT6V29CO4+zttexF4QVVRwFwG3aI8zGJ3WJhjrNyazVLrvrzWfhA ygNwV0BCEbBclMpBRF4jRLGMibnsTeEsBTiMARgJ0ZL7RPUYeQidVzP/JwPKbod0 DHqqtOXXngl7OsHdfdd74ThKaQb6EzlDFyI5aoYInPCXH/LhE98= =ZvDQ -----END PGP SIGNATURE----- Merge tag 'kvm-x86-misc-7.1' of https://github.com/kvm-x86/linux into HEAD KVM x86 misc changes for 7.1 - Advertise support for AVX512 Bit Matrix Multiply (BMM) when it's present in hardware (no additional emulation/virtualization required). - Immediately fail the build if a required #define is missing in one of KVM's headers that is included multiple times. - Reject SET_GUEST_DEBUG with -EBUSY if there's an already injected exception, mostly to prevent syzkaller from abusing the uAPI to trigger WARNs, but also because it can help prevent userspace from unintentionally crashing the VM. - Exempt SMM from CPUID faulting on Intel, as per the spec. - Misc hardening and cleanup changes.	2026-04-13 11:51:34 +02:00
Sean Christopherson	bc0932cf9b	KVM: SEV: Goto an existing error label if charging misc_cg for an ASID fails Dedup a small amount of cleanup code in SEV ASID allocation by reusing an existing error label. No functional change intended. Link: https://patch.msgid.link/20260310234829.2608037-22-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-04-09 12:00:24 -07:00
Carlos López	1d353dae3d	KVM: SVM: Move lock-protected allocation of SEV ASID into a separate helper Extract the lock-protected parts of SEV ASID allocation into a new helper and opportunistically convert it to use guard() when acquiring the mutex. Preserve the goto even though it's a little odd, as it's there's a fair amount of subtlety that makes it surprisingly difficult to replicate the functionality with a loop construct, and arguably using goto yields the most readable code. No functional change intended. Signed-off-by: Carlos López <clopez@suse.de> [sean: move code to separate helper, rework shortlog+changelog] Link: https://patch.msgid.link/20260310234829.2608037-21-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-04-09 12:00:23 -07:00
Carlos López	f09b7f4af9	KVM: SEV: use mutex guard in snp_handle_guest_req() Simplify the error paths in snp_handle_guest_req() by using a mutex guard, allowing early return instead of using gotos. Signed-off-by: Carlos López <clopez@suse.de> Link: https://patch.msgid.link/20260120201013.3931334-8-clopez@suse.de Link: https://patch.msgid.link/20260310234829.2608037-20-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-04-09 12:00:22 -07:00
Carlos López	84841f3941	KVM: SEV: use mutex guard in sev_mem_enc_unregister_region() Simplify the error paths in sev_mem_enc_unregister_region() by using a mutex guard, allowing early return instead of using gotos. Signed-off-by: Carlos López <clopez@suse.de> Link: https://patch.msgid.link/20260120201013.3931334-7-clopez@suse.de Link: https://patch.msgid.link/20260310234829.2608037-19-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-04-09 12:00:22 -07:00
Carlos López	63e56d8425	KVM: SEV: use mutex guard in sev_mem_enc_ioctl() Simplify the error paths in sev_mem_enc_ioctl() by using a mutex guard, allowing early return instead of using gotos. Signed-off-by: Carlos López <clopez@suse.de> Link: https://patch.msgid.link/20260120201013.3931334-5-clopez@suse.de Link: https://patch.msgid.link/20260310234829.2608037-18-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-04-09 12:00:21 -07:00
Carlos López	04d77ded64	KVM: SEV: use mutex guard in snp_launch_update() Simplify the error paths in snp_launch_update() by using a mutex guard, allowing early return instead of using gotos. Signed-off-by: Carlos López <clopez@suse.de> Link: https://patch.msgid.link/20260120201013.3931334-4-clopez@suse.de Link: https://patch.msgid.link/20260310234829.2608037-17-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-04-09 12:00:20 -07:00
Sean Christopherson	ba903f7382	KVM: SEV: Assert that kvm->lock is held when querying SEV+ support Assert that kvm->lock is held when checking if a VM is an SEV+ VM, as KVM sets and resets the relevant flags when initialization SEV state, i.e. it's extremely easy to end up with TOCTOU bugs if kvm->lock isn't held. Add waivers for a VM being torn down (refcount is '0') and for there being a loaded vCPU, with comments for both explaining why they're safe. Note, the "vCPU loaded" waiver is necessary to avoid splats on the SNP checks in sev_gmem_prepare() and sev_gmem_max_mapping_level(), which are currently called when handling nested page faults. Alternatively, those checks could key off KVM_X86_SNP_VM, as kvm_arch.vm_type is stable early in VM creation. Prioritize consistency, at least for now, and to leave a "reminder" that the max mapping level code in particular likely needs special attention if/when KVM supports dirty logging for SNP guests. Link: https://patch.msgid.link/20260310234829.2608037-16-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-04-09 12:00:20 -07:00
Sean Christopherson	2f34d421e8	KVM: SEV: Document that checking for SEV+ guests when reclaiming memory is "safe" Document that the check for an SEV+ guest when reclaiming guest memory is safe even though kvm->lock isn't held. This will allow asserting that kvm->lock is held in the SEV accessors, without triggering false positives on the "safe" cases. No functional change intended. Link: https://patch.msgid.link/20260310234829.2608037-15-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-04-09 12:00:19 -07:00
Sean Christopherson	85d2243a21	KVM: SEV: Hide "struct kvm_sev_info" behind CONFIG_KVM_AMD_SEV=y Bury "struct kvm_sev_info" behind CONFIG_KVM_AMD_SEV=y to make it harder for SEV specific code to sneak into common SVM code. No functional change intended. Link: https://patch.msgid.link/20260310234829.2608037-14-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-04-09 12:00:18 -07:00
Sean Christopherson	4f67cf7e7e	KVM: SEV: WARN on unhandled VM type when initializing VM WARN if KVM encounters an unhandled VM type when setting up flags for SEV+ VMs, e.g. to guard against adding a new flavor of SEV without adding proper recognition in sev_vm_init(). Practically speaking, no functional change intended (the new "default" case should be unreachable). Link: https://patch.msgid.link/20260310234829.2608037-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-04-09 12:00:17 -07:00
Sean Christopherson	e353f1beed	KVM: SEV: Move SEV-specific VM initialization to sev.c Move SEV+ VM initialization to sev.c (as sev_vm_init()) so that kvm_sev_info (and all usage) can be gated on CONFIG_KVM_AMD_SEV=y without needing more #ifdefs. As a bonus, isolating the logic will make it easier to harden the flow, e.g. to WARN if the vm_type is unknown. No functional change intended (SEV, SEV_ES, and SNP VM types are only supported if CONFIG_KVM_AMD_SEV=y). Link: https://patch.msgid.link/20260310234829.2608037-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-04-08 16:04:27 -07:00
Sean Christopherson	7341500f8b	KVM: SEV: Move standard VM-scoped helpers to detect SEV+ guests to sev.c Now that all external usage of the VM-scoped APIs to detect SEV+ guests is gone, drop the stubs provided for CONFIG_KVM_AMD_SEV=n builds and bury the "standard" APIs in sev.c. No functional change intended. Link: https://patch.msgid.link/20260310234829.2608037-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-04-08 16:04:26 -07:00
Sean Christopherson	56906910ea	KVM: SEV: Document the SEV-ES check when querying SMM support as "safe" Use the "unsafe" API to check for an SEV-ES+ guest when determining whether or not SMBASE is a supported MSR, i.e. whether or not emulated SMM is supported. This will eventually allow adding lockdep assertings to the APIs for detecting SEV+ VMs without triggering "real" false positives. While svm_has_emulated_msr() doesn't hold kvm->lock, i.e. can get both false positives and false negatives, both are completely fine, as the only time the result isn't stable is when userspace is the sole consumer of the result. I.e. userspace can confuse itself, but that's it. No functional change intended. Link: https://patch.msgid.link/20260310234829.2608037-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-04-08 16:04:25 -07:00
Sean Christopherson	138e5f6a3e	KVM: SEV: Add quad-underscore version of VM-scoped APIs to detect SEV+ guests Add "unsafe" quad-underscore versions of the SEV+ guest detectors in anticipation of hardening the APIs via lockdep assertions. This will allow adding exceptions for usage that is known to be safe in advance of the lockdep assertions. Use a pile of underscores to try and communicate that use of the "unsafe" shouldn't be done lightly. No functional change intended. Link: https://patch.msgid.link/20260310234829.2608037-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-04-08 16:04:25 -07:00
Sean Christopherson	5bf92e4753	KVM: SEV: Provide vCPU-scoped accessors for detecting SEV+ guests Provide vCPU-scoped accessors for detecting if the vCPU belongs to an SEV, SEV-ES, or SEV-SNP VM, partly to dedup a small amount of code, but mostly to better document which usages are "safe". Generally speaking, using the VM-scoped sev_guest() and friends outside of kvm->lock is unsafe, as they can get both false positives and false negatives. But for vCPUs, the accessors are guaranteed to provide a stable result as KVM disallows initialization SEV+ state after vCPUs are created. I.e. operating on a vCPU guarantees the VM can't "become" an SEV+ VM, and that it can't revert back to a "normal" VM. This will also allow dropping the stubs for the VM-scoped accessors, as it's relatively easy to eliminate usage of the accessors from common SVM once the vCPU-scoped checks are out of the way. No functional change intended. Link: https://patch.msgid.link/20260310234829.2608037-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-04-08 16:04:24 -07:00
Sean Christopherson	8075360f3b	KVM: SEV: Lock all vCPUs for the duration of SEV-ES VMSA synchronization Lock and unlock all vCPUs in a single batch when synchronizing SEV-ES VMSAs during launch finish, partly to dedup the code by a tiny amount, but mostly so that sev_launch_update_vmsa() uses the same logic/flow as all other SEV ioctls that lock all vCPUs. Link: https://patch.msgid.link/20260310234829.2608037-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-04-08 16:04:23 -07:00
Sean Christopherson	cb923ee6a8	KVM: SEV: Lock all vCPUs when synchronzing VMSAs for SNP launch finish Lock all vCPUs when synchronizing and encrypting VMSAs for SNP guests, as allowing userspace to manipulate and/or run a vCPU while its state is being synchronized would at best corrupt vCPU state, and at worst crash the host kernel. Opportunistically assert that vcpu->mutex is held when synchronizing its VMSA (the SEV-ES path already locks vCPUs). Fixes: `ad27ce1555` ("KVM: SEV: Add KVM_SEV_SNP_LAUNCH_FINISH command") Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260310234829.2608037-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-04-08 16:04:19 -07:00
Mike Rapoport (Microsoft)	9a1d0c738b	mm: rename my_zero_pfn() to zero_pfn() my_zero_pfn() is a silly name. Rename zero_pfn variable to zero_page_pfn and my_zero_pfn() function to zero_pfn(). While on it, move extern declarations of zero_page_pfn outside the functions that use it and add a comment about what ZERO_PAGE is. Link: https://lkml.kernel.org/r/20260211103141.3215197-3-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy (CS GROUP) <chleroy@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:01 -07:00
Yosry Ahmed	2daf71bfd7	KVM: nSVM: Fail emulation of VMRUN/VMLOAD/VMSAVE if mapping vmcb12 fails KVM currently injects a #GP if mapping vmcb12 fails when emulating VMRUN/VMLOAD/VMSAVE. This is not architectural behavior, as #GP should only be injected if the physical address is not supported or not aligned. Instead, handle it as an emulation failure, similar to how nVMX handles failures to read/write guest memory in several emulation paths. When virtual VMLOAD/VMSAVE is enabled, if vmcb12's GPA is not mapped in the NPTs a VMEXIT(#NPF) will be generated, and KVM will install an MMIO SPTE and emulate the instruction if there is no corresponding memslot. x86_emulate_insn() will return EMULATION_FAILED as VMLOAD/VMSAVE are not handled as part of the twobyte_insn cases. Even though this will also result in an emulation failure, it will only result in a straight return to userspace if KVM_CAP_EXIT_ON_EMULATION_FAILURE is set. Otherwise, it would inject #UD and only exit to userspace if not in guest mode. So the behavior is slightly different if virtual VMLOAD/VMSAVE is enabled. Fixes: `3d6368ef58` ("KVM: SVM: Add VMRUN handler") Reported-by: Jim Mattson <jmattson@google.com> Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260316202732.3164936-8-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-04-03 16:08:04 -07:00
Yosry Ahmed	878b8efa2a	KVM: SVM: Treat mapping failures equally in VMLOAD/VMSAVE emulation Currently, a #GP is only injected if kvm_vcpu_map() fails with -EINVAL. But it could also fail with -EFAULT if creating a host mapping failed. Inject a #GP in all cases, no reason to treat failure modes differently. Similar to commit `01ddcdc55e` ("KVM: nSVM: Always inject a #GP if mapping VMCB12 fails on nested VMRUN"), treat all failures equally. Fixes: `8c5fbf1a72` ("KVM/nSVM: Use the new mapping API for mapping guest memory") Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260316202732.3164936-7-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-04-03 16:08:03 -07:00
Yosry Ahmed	783cf7d01f	KVM: SVM: Check EFER.SVME and CPL on #GP intercept of SVM instructions When KVM intercepts #GP on an SVM instruction from L2, it checks the legality of RAX, and injects a #GP if RAX is illegal, or otherwise synthesizes a #VMEXIT to L1. However, checking EFER.SVME and CPL takes precedence over both the RAX check and the intercept. Call nested_svm_check_permissions() first to cover both. Note that if #GP is intercepted on SVM instruction in L1, the intercept handlers of VMRUN/VMLOAD/VMSAVE already perform these checks. Note #2, if KVM does not intercept #GP, the check for EFER.SVME is not done in the correct order, because KVM handles it by intercepting the instructions when EFER.SVME=0 and injecting #UD. However, a #GP injected by hardware would happen before the instruction intercept, leading to #GP taking precedence over #UD from the guest's perspective. Opportunistically add a FIXME for this. Fixes: `82a11e9c6f` ("KVM: SVM: Add emulation support for #GP triggered by SVM instructions") Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260316202732.3164936-6-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-04-03 16:08:03 -07:00
Yosry Ahmed	d2fbeb61e1	KVM: SVM: Move RAX legality check to SVM insn interception handlers When #GP is intercepted by KVM, the #GP interception handler checks whether the GPA in RAX is legal and reinjects the #GP accordingly. Otherwise, it calls into the appropriate interception handler for VMRUN/VMLOAD/VMSAVE. The intercept handlers do not check RAX. However, the intercept handlers need to do the RAX check, because if the guest has a smaller MAXPHYADDR, RAX could be legal from the hardware perspective (i.e. CPU does not inject #GP), but not from the vCPU's perspective. Note that with allow_smaller_maxphyaddr, both NPT and VLS cannot be used, so VMLOAD/VMSAVE have to be intercepted, and RAX can always be checked against the vCPU's MAXPHYADDR. Move the check into the interception handlers for VMRUN/VMLOAD/VMSAVE as the CPU does not check RAX before the interception. Read RAX using kvm_register_read() to avoid a false negative on page_address_valid() on 32-bit due to garbage in the higher bits. Keep the check in the #GP intercept handler in the nested case where a #VMEXIT is synthesized into L1, as the RAX check is still needed there and takes precedence over the intercept. Opportunistically add a FIXME about the #VMEXIT being synthesized into L1, as it needs to be conditional. Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260316202732.3164936-5-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-04-03 16:08:02 -07:00
Yosry Ahmed	435741a4e7	KVM: SVM: Properly check RAX on #GP intercept of SVM instructions When KVM intercepts #GP on an SVM instruction, it re-injects the #GP if the instruction was executed with a mis-algined RAX. However, a #GP should also be reinjected if RAX contains an illegal GPA, according to the APM, one of #GP conditions is: rAX referenced a physical address above the maximum supported physical address. Replace the PAGE_MASK check with page_address_valid(), which checks both page-alignment as well as the legality of the GPA based on the vCPU's MAXPHYADDR. Use kvm_register_read() to read RAX to so that bits 63:32 are dropped when the vCPU is in 32-bit mode, i.e. to avoid a false positive when checking the validity of the address. Note that this is currently only a problem if KVM is running an L2 guest and ends up synthesizing a #VMEXIT to L1, as the RAX check takes precedence over the intercept. Otherwise, if KVM emulates the instruction, kvm_vcpu_map() should fail on illegal GPAs and inject a #GP anyway. However, following patches will change the failure behavior of kvm_vcpu_map(), so make sure the #GP interception handler does this appropriately. Opportunistically drop a teaser FIXME about the SVM instructions handling on #GP belonging in the emulator. Fixes: `82a11e9c6f` ("KVM: SVM: Add emulation support for #GP triggered by SVM instructions") Fixes: `d1cba6c922` ("KVM: x86: nSVM: test eax for 4K alignment for GP errata workaround") Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260316202732.3164936-4-yosry@kernel.org [sean: massage wording with respect to kvm_register_read()] Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-04-03 16:08:01 -07:00
Yosry Ahmed	27f70eaa86	KVM: SVM: Refactor SVM instruction handling on #GP intercept Instead of returning an opcode from svm_instr_opcode() and then passing it to emulate_svm_instr(), which uses it to find the corresponding exit code and intercept handler, return the exit code directly from svm_instr_opcode(), and rename it to svm_get_decoded_instr_exit_code(). emulate_svm_instr() boils down to synthesizing a #VMEXIT or calling the intercept handler, so open-code it in gp_interception(), and use svm_invoke_exit_handler() to call the intercept handler based on the exit code. This allows for dropping the SVM_INSTR_* enum, and the const array mapping its values to exit codes and intercept handlers. In gp_intercept(), handle SVM instructions and first with an early return, and invert is_guest_mode() checks, un-indenting the rest of the code. No functional change intended. Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260316202732.3164936-3-yosry@kernel.org [sean: add BUILD_BUG_ON(), tweak formatting/naming] Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-04-03 16:08:01 -07:00
Yosry Ahmed	c85aaff26d	KVM: SVM: Properly check RAX in the emulator for SVM instructions Architecturally, VMRUN/VMLOAD/VMSAVE should generate a #GP if the physical address in RAX is not supported. check_svme_pa() hardcodes this to checking that bits 63-48 are not set. This is incorrect on HW supporting 52 bits of physical address space. Additionally, the emulator does not check if the address is not aligned, which should also result in #GP. Use page_address_valid() which properly checks alignment and the address legality based on the guest's MAXPHYADDR. Plumb it through x86_emulate_ops, similar to is_canonical_addr(), to avoid directly accessing the vCPU object in emulator code. Fixes: `01de8b09e6` ("KVM: SVM: Add intercept checks for SVM instructions") Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260316202732.3164936-2-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-04-03 16:08:00 -07:00
Sean Christopherson	624bf3440d	KVM: SEV: Disallow LAUNCH_FINISH if vCPUs are actively being created Reject LAUNCH_FINISH for SEV-ES and SNP VMs if KVM is actively creating one or more vCPUs, as KVM needs to process and encrypt each vCPU's VMSA. Letting userspace create vCPUs while LAUNCH_FINISH is in-progress is "fine", at least in the current code base, as kvm_for_each_vcpu() operates on online_vcpus, LAUNCH_FINISH (all SEV+ sub-ioctls) holds kvm->mutex, and fully onlining a vCPU in kvm_vm_ioctl_create_vcpu() is done under kvm->mutex. I.e. there's no difference between an in-progress vCPU and a vCPU that is created entirely after LAUNCH_FINISH. However, given that concurrent LAUNCH_FINISH and vCPU creation can't possibly work (for any reasonable definition of "work"), since userspace can't guarantee whether a particular vCPU will be encrypted or not, disallow the combination as a hardening measure, to reduce the probability of introducing bugs in the future, and to avoid having to reason about the safety of future changes related to LAUNCH_FINISH. Cc: Jethro Beekman <jethro@fortanix.com> Closes: https://lore.kernel.org/all/b31f7c6e-2807-4662-bcdd-eea2c1e132fa@fortanix.com Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260310234829.2608037-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-04-03 09:37:36 -07:00

1 2 3 4 5 ...

11614 Commits