mirror of
https://github.com/torvalds/linux.git
synced 2026-06-05 13:06:59 +02:00
master
379 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
01f492e181 |
Arm:
- Add support for tracing in the standalone EL2 hypervisor code, which should help both debugging and performance analysis. This uses the new infrastructure for 'remote' trace buffers that can be exposed by non-kernel entities such as firmware, and which came through the tracing tree. - Add support for GICv5 Per Processor Interrupts (PPIs), as the starting point for supporting the new GIC architecture in KVM. - Finally add support for pKVM protected guests, where pages are unmapped from the host as they are faulted into the guest and can be shared back from the guest using pKVM hypercalls. Protected guests are created using a new machine type identifier. As the elusive guestmem has not yet delivered on its promises, anonymous memory is also supported. This is only a first step towards full isolation from the host; for example, the CPU register state and DMA accesses are not yet isolated. Because this does not really yet bring fully what it promises, it is hidden behind CONFIG_ARM_PKVM_GUEST + 'kvm-arm.mode=protected', and also triggers TAINT_USER when a VM is created. Caveat emptor. - Rework the dreaded user_mem_abort() function to make it more maintainable, reducing the amount of state being exposed to the various helpers and rendering a substantial amount of state immutable. - Expand the Stage-2 page table dumper to support NV shadow page tables on a per-VM basis. - Tidy up the pKVM PSCI proxy code to be slightly less hard to follow. - Fix both SPE and TRBE in non-VHE configurations so that they do not generate spurious, out of context table walks that ultimately lead to very bad HW lockups. - A small set of patches fixing the Stage-2 MMU freeing in error cases. - Tighten-up accepted SMC immediate value to be only #0 for host SMCCC calls. - The usual cleanups and other selftest churn. LoongArch: - Use CSR_CRMD_PLV for kvm_arch_vcpu_in_kernel(). - Add DMSINTC irqchip in kernel support. RISC-V: - Fix steal time shared memory alignment checks - Fix vector context allocation leak - Fix array out-of-bounds in pmu_ctr_read() and pmu_fw_ctr_read_hi() - Fix double-free of sdata in kvm_pmu_clear_snapshot_area() - Fix integer overflow in kvm_pmu_validate_counter_mask() - Fix shift-out-of-bounds in make_xfence_request() - Fix lost write protection on huge pages during dirty logging - Split huge pages during fault handling for dirty logging - Skip CSR restore if VCPU is reloaded on the same core - Implement kvm_arch_has_default_irqchip() for KVM selftests - Factored-out ISA checks into separate sources - Added hideleg to struct kvm_vcpu_config - Factored-out VCPU config into separate sources - Support configuration of per-VM HGATP mode from KVM user space s390: - Support for ESA (31-bit) guests inside nested hypervisors. - Remove restriction on memslot alignment, which is not needed anymore with the new gmap code. - Fix LPSW/E to update the bear (which of course is the breaking event address register). x86: - Shut up various UBSAN warnings on reading module parameter before they were initialized. - Don't zero-allocate page tables that are used for splitting hugepages in the TDP MMU, as KVM is guaranteed to set all SPTEs in the page table and thus write all bytes. - As an optimization, bail early when trying to unsync 4KiB mappings if the target gfn can just be mapped with a 2MiB hugepage. x86 generic: - Copy single-chunk MMIO write values into struct kvm_vcpu (more precisely struct kvm_mmio_fragment) to fix use-after-free stack bugs where KVM would dereference stack pointer after an exit to userspace. - Clean up and comment the emulated MMIO code to try to make it easier to maintain (not necessarily "easy", but "easier"). - Move VMXON+VMXOFF and EFER.SVME toggling out of KVM (not *all* of VMX and SVM enabling) as it is needed for trusted I/O. - Advertise support for AVX512 Bit Matrix Multiply (BMM) instructions - Immediately fail the build if a required #define is missing in one of KVM's headers that is included multiple times. - Reject SET_GUEST_DEBUG with -EBUSY if there's an already injected exception, mostly to prevent syzkaller from abusing the uAPI to trigger WARNs, but also because it can help prevent userspace from unintentionally crashing the VM. - Exempt SMM from CPUID faulting on Intel, as per the spec. - Misc hardening and cleanup changes. x86 (AMD): - Fix and optimize IRQ window inhibit handling for AVIC; make it per-vCPU so that KVM doesn't prematurely re-enable AVIC if multiple vCPUs have to-be-injected IRQs. - Clean up and optimize the OSVW handling, avoiding a bug in which KVM would overwrite state when enabling virtualization on multiple CPUs in parallel. This should not be a problem because OSVW should usually be the same for all CPUs. - Drop a WARN in KVM_MEMORY_ENCRYPT_REG_REGION where KVM complains about a "too large" size based purely on user input. - Clean up and harden the pinning code for KVM_MEMORY_ENCRYPT_REG_REGION. - Disallow synchronizing a VMSA of an already-launched/encrypted vCPU, as doing so for an SNP guest will crash the host due to an RMP violation page fault. - Overhaul KVM's APIs for detecting SEV+ guests so that VM-scoped queries are required to hold kvm->lock, and enforce it by lockdep. Fix various bugs where sev_guest() was not ensured to be stable for the whole duration of a function or ioctl. - Convert a pile of kvm->lock SEV code to guard(). - Play nicer with userspace that does not enable KVM_CAP_EXCEPTION_PAYLOAD, for which KVM needs to set CR2 and DR6 as a response to ioctls such as KVM_GET_VCPU_EVENTS (even if the payload would end up in EXITINFO2 rather than CR2, for example). Only set CR2 and DR6 when consumption of the payload is imminent, but on the other hand force delivery of the payload in all paths where userspace retrieves CR2 or DR6. - Use vcpu->arch.cr2 when updating vmcb12's CR2 on nested #VMEXIT instead of vmcb02->save.cr2. The value is out of sync after a save/restore or after a #PF is injected into L2. - Fix a class of nSVM bugs where some fields written by the CPU are not synchronized from vmcb02 to cached vmcb12 after VMRUN, and so are not up-to-date when saved by KVM_GET_NESTED_STATE. - Fix a class of bugs where the ordering between KVM_SET_NESTED_STATE and KVM_SET_{S}REGS could cause vmcb02 to be incorrectly initialized after save+restore. - Add a variety of missing nSVM consistency checks. - Fix several bugs where KVM failed to correctly update VMCB fields on nested #VMEXIT. - Fix several bugs where KVM failed to correctly synthesize #UD or #GP for SVM-related instructions. - Add support for save+restore of virtualized LBRs (on SVM). - Refactor various helpers and macros to improve clarity and (hopefully) make the code easier to maintain. - Aggressively sanitize fields when copying from vmcb12, to guard against unintentionally allowing L1 to utilize yet-to-be-defined features. - Fix several bugs where KVM botched rAX legality checks when emulating SVM instructions. There are remaining issues in that KVM doesn't handle size prefix overrides for 64-bit guests. - Fail emulation of VMRUN/VMLOAD/VMSAVE if mapping vmcb12 fails instead of somewhat arbitrarily synthesizing #GP (i.e. don't double down on AMD's architectural but sketchy behavior of generating #GP for "unsupported" addresses). - Cache all used vmcb12 fields to further harden against TOCTOU bugs. x86 (Intel): - Drop obsolete branch hint prefixes from the VMX instruction macros. - Use ASM_INPUT_RM() in __vmcs_writel() to coerce clang into using a register input when appropriate. - Code cleanups. guest_memfd: - Don't mark guest_memfd folios as accessed, as guest_memfd doesn't support reclaim, the memory is unevictable, and there is no storage to write back to. LoongArch selftests: - Add KVM PMU test cases s390 selftests: - Enable more memory selftests. x86 selftests: - Add support for Hygon CPUs in KVM selftests. - Fix a bug in the MSR test where it would get false failures on AMD/Hygon CPUs with exactly one of RDPID or RDTSCP. - Add an MADV_COLLAPSE testcase for guest_memfd as a regression test for a bug where the kernel would attempt to collapse guest_memfd folios against KVM's will. -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmnftRQUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroPAzwf+NKO4Ktv+7A22ImN0SBl0nlUuulsz vTcw3+hxdRoIw83GdNS+hG5js0wrpMDnbv3t4+VliDNBSSxrBzcSWX2wpilW0Xtw qGo1MWhs2lKPy1NlaRVOwPS6j7uF3AR0TQ1iQLGMedQuCU9WpiKJxyhNXJdbLrt3 8EgFzsvtEsv+jKNRUNDf9+d0j4gZsFyIe+Brhianbw+u3/UCiUClLCdsKPc4+5ZX 08otYXytacGNIf/5Ev1vT4pHkHL0yqKXAtX7LEtaS3+0KrPuLjV4slemivzE9vf5 Evafm5AhA4wpaNMb1ZerhY3T94lsMaJpWxotjR//0Q7C9B59pCQnXCm8mg== =CcE0 -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm updates from Paolo Bonzini: "Arm: - Add support for tracing in the standalone EL2 hypervisor code, which should help both debugging and performance analysis. This uses the new infrastructure for 'remote' trace buffers that can be exposed by non-kernel entities such as firmware, and which came through the tracing tree - Add support for GICv5 Per Processor Interrupts (PPIs), as the starting point for supporting the new GIC architecture in KVM - Finally add support for pKVM protected guests, where pages are unmapped from the host as they are faulted into the guest and can be shared back from the guest using pKVM hypercalls. Protected guests are created using a new machine type identifier. As the elusive guestmem has not yet delivered on its promises, anonymous memory is also supported This is only a first step towards full isolation from the host; for example, the CPU register state and DMA accesses are not yet isolated. Because this does not really yet bring fully what it promises, it is hidden behind CONFIG_ARM_PKVM_GUEST + 'kvm-arm.mode=protected', and also triggers TAINT_USER when a VM is created. Caveat emptor - Rework the dreaded user_mem_abort() function to make it more maintainable, reducing the amount of state being exposed to the various helpers and rendering a substantial amount of state immutable - Expand the Stage-2 page table dumper to support NV shadow page tables on a per-VM basis - Tidy up the pKVM PSCI proxy code to be slightly less hard to follow - Fix both SPE and TRBE in non-VHE configurations so that they do not generate spurious, out of context table walks that ultimately lead to very bad HW lockups - A small set of patches fixing the Stage-2 MMU freeing in error cases - Tighten-up accepted SMC immediate value to be only #0 for host SMCCC calls - The usual cleanups and other selftest churn LoongArch: - Use CSR_CRMD_PLV for kvm_arch_vcpu_in_kernel() - Add DMSINTC irqchip in kernel support RISC-V: - Fix steal time shared memory alignment checks - Fix vector context allocation leak - Fix array out-of-bounds in pmu_ctr_read() and pmu_fw_ctr_read_hi() - Fix double-free of sdata in kvm_pmu_clear_snapshot_area() - Fix integer overflow in kvm_pmu_validate_counter_mask() - Fix shift-out-of-bounds in make_xfence_request() - Fix lost write protection on huge pages during dirty logging - Split huge pages during fault handling for dirty logging - Skip CSR restore if VCPU is reloaded on the same core - Implement kvm_arch_has_default_irqchip() for KVM selftests - Factored-out ISA checks into separate sources - Added hideleg to struct kvm_vcpu_config - Factored-out VCPU config into separate sources - Support configuration of per-VM HGATP mode from KVM user space s390: - Support for ESA (31-bit) guests inside nested hypervisors - Remove restriction on memslot alignment, which is not needed anymore with the new gmap code - Fix LPSW/E to update the bear (which of course is the breaking event address register) x86: - Shut up various UBSAN warnings on reading module parameter before they were initialized - Don't zero-allocate page tables that are used for splitting hugepages in the TDP MMU, as KVM is guaranteed to set all SPTEs in the page table and thus write all bytes - As an optimization, bail early when trying to unsync 4KiB mappings if the target gfn can just be mapped with a 2MiB hugepage x86 generic: - Copy single-chunk MMIO write values into struct kvm_vcpu (more precisely struct kvm_mmio_fragment) to fix use-after-free stack bugs where KVM would dereference stack pointer after an exit to userspace - Clean up and comment the emulated MMIO code to try to make it easier to maintain (not necessarily "easy", but "easier") - Move VMXON+VMXOFF and EFER.SVME toggling out of KVM (not *all* of VMX and SVM enabling) as it is needed for trusted I/O - Advertise support for AVX512 Bit Matrix Multiply (BMM) instructions - Immediately fail the build if a required #define is missing in one of KVM's headers that is included multiple times - Reject SET_GUEST_DEBUG with -EBUSY if there's an already injected exception, mostly to prevent syzkaller from abusing the uAPI to trigger WARNs, but also because it can help prevent userspace from unintentionally crashing the VM - Exempt SMM from CPUID faulting on Intel, as per the spec - Misc hardening and cleanup changes x86 (AMD): - Fix and optimize IRQ window inhibit handling for AVIC; make it per-vCPU so that KVM doesn't prematurely re-enable AVIC if multiple vCPUs have to-be-injected IRQs - Clean up and optimize the OSVW handling, avoiding a bug in which KVM would overwrite state when enabling virtualization on multiple CPUs in parallel. This should not be a problem because OSVW should usually be the same for all CPUs - Drop a WARN in KVM_MEMORY_ENCRYPT_REG_REGION where KVM complains about a "too large" size based purely on user input - Clean up and harden the pinning code for KVM_MEMORY_ENCRYPT_REG_REGION - Disallow synchronizing a VMSA of an already-launched/encrypted vCPU, as doing so for an SNP guest will crash the host due to an RMP violation page fault - Overhaul KVM's APIs for detecting SEV+ guests so that VM-scoped queries are required to hold kvm->lock, and enforce it by lockdep. Fix various bugs where sev_guest() was not ensured to be stable for the whole duration of a function or ioctl - Convert a pile of kvm->lock SEV code to guard() - Play nicer with userspace that does not enable KVM_CAP_EXCEPTION_PAYLOAD, for which KVM needs to set CR2 and DR6 as a response to ioctls such as KVM_GET_VCPU_EVENTS (even if the payload would end up in EXITINFO2 rather than CR2, for example). Only set CR2 and DR6 when consumption of the payload is imminent, but on the other hand force delivery of the payload in all paths where userspace retrieves CR2 or DR6 - Use vcpu->arch.cr2 when updating vmcb12's CR2 on nested #VMEXIT instead of vmcb02->save.cr2. The value is out of sync after a save/restore or after a #PF is injected into L2 - Fix a class of nSVM bugs where some fields written by the CPU are not synchronized from vmcb02 to cached vmcb12 after VMRUN, and so are not up-to-date when saved by KVM_GET_NESTED_STATE - Fix a class of bugs where the ordering between KVM_SET_NESTED_STATE and KVM_SET_{S}REGS could cause vmcb02 to be incorrectly initialized after save+restore - Add a variety of missing nSVM consistency checks - Fix several bugs where KVM failed to correctly update VMCB fields on nested #VMEXIT - Fix several bugs where KVM failed to correctly synthesize #UD or #GP for SVM-related instructions - Add support for save+restore of virtualized LBRs (on SVM) - Refactor various helpers and macros to improve clarity and (hopefully) make the code easier to maintain - Aggressively sanitize fields when copying from vmcb12, to guard against unintentionally allowing L1 to utilize yet-to-be-defined features - Fix several bugs where KVM botched rAX legality checks when emulating SVM instructions. There are remaining issues in that KVM doesn't handle size prefix overrides for 64-bit guests - Fail emulation of VMRUN/VMLOAD/VMSAVE if mapping vmcb12 fails instead of somewhat arbitrarily synthesizing #GP (i.e. don't double down on AMD's architectural but sketchy behavior of generating #GP for "unsupported" addresses) - Cache all used vmcb12 fields to further harden against TOCTOU bugs x86 (Intel): - Drop obsolete branch hint prefixes from the VMX instruction macros - Use ASM_INPUT_RM() in __vmcs_writel() to coerce clang into using a register input when appropriate - Code cleanups guest_memfd: - Don't mark guest_memfd folios as accessed, as guest_memfd doesn't support reclaim, the memory is unevictable, and there is no storage to write back to LoongArch selftests: - Add KVM PMU test cases s390 selftests: - Enable more memory selftests x86 selftests: - Add support for Hygon CPUs in KVM selftests - Fix a bug in the MSR test where it would get false failures on AMD/Hygon CPUs with exactly one of RDPID or RDTSCP - Add an MADV_COLLAPSE testcase for guest_memfd as a regression test for a bug where the kernel would attempt to collapse guest_memfd folios against KVM's will" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (373 commits) KVM: x86: use inlines instead of macros for is_sev_*guest x86/virt: Treat SVM as unsupported when running as an SEV+ guest KVM: SEV: Goto an existing error label if charging misc_cg for an ASID fails KVM: SVM: Move lock-protected allocation of SEV ASID into a separate helper KVM: SEV: use mutex guard in snp_handle_guest_req() KVM: SEV: use mutex guard in sev_mem_enc_unregister_region() KVM: SEV: use mutex guard in sev_mem_enc_ioctl() KVM: SEV: use mutex guard in snp_launch_update() KVM: SEV: Assert that kvm->lock is held when querying SEV+ support KVM: SEV: Document that checking for SEV+ guests when reclaiming memory is "safe" KVM: SEV: Hide "struct kvm_sev_info" behind CONFIG_KVM_AMD_SEV=y KVM: SEV: WARN on unhandled VM type when initializing VM KVM: LoongArch: selftests: Add PMU overflow interrupt test KVM: LoongArch: selftests: Add basic PMU event counting test KVM: LoongArch: selftests: Add cpucfg read/write helpers LoongArch: KVM: Add DMSINTC inject msi to vCPU LoongArch: KVM: Add DMSINTC device support LoongArch: KVM: Make vcpu_is_preempted() as a macro rather than function LoongArch: KVM: Move host CSR_GSTAT save and restore in context switch LoongArch: KVM: Move host CSR_EENTRY save and restore in context switch ... |
||
|
|
172100088f |
x86/cpufeatures: Add AMD CPPC Performance Priority feature.
Some future AMD processors have feature named "CPPC Performance Priority" which lets userspace specify different floor performance levels for different CPUs. The platform firmware takes these different floor performance levels into consideration while throttling the CPUs under power/thermal constraints. The presence of this feature is indicated by bit 16 of the EDX register for CPUID leaf 0x80000007. More details can be found in AMD Publication titled "AMD64 Collaborative Processor Performance Control (CPPC) Performance Priority" Revision 1.10. Define a new feature bit named X86_FEATURE_CPPC_PERF_PRIO to map to CPUID 0x80000007.EDX[16]. Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org> |
||
|
|
de0bfdc713 |
KVM: x86: Advertise AVX512 Bit Matrix Multiply (BMM) to userspace
Advertise AVX512 Bit Matrix Multiply (BMM) and Bit Reversal instructions to userspace via CPUID leaf 0x80000021_EAX[23]. This feature enables bit matrix multiply operations and bit reversal. Like most AVX instructions, there are no intercept controls for individual instructions, and no extra work is needed in KVM to enable correct execution of the instructions in the guest. The instructions and CPUID feature are first described in: AMD64 Bit Matrix Multiply and Bit Reversal Instructions Publication #69192 Revision: 1.00 Issue Date: January 2026 While at it, reorder PREFETCHI in KVM's initialization sequence to match the CPUID bit position order for better organization. Signed-off-by: Nikunj A Dadhania <nikunj@amd.com> Link: https://patch.msgid.link/20260210053511.1612505-1-nikunj@amd.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
|
|
cb5573868e |
Loongarch:
- Add more CPUCFG mask bits. - Improve feature detection. - Add lazy load support for FPU and binary translation (LBT) register state. - Fix return value for memory reads from and writes to in-kernel devices. - Add support for detecting preemption from within a guest. - Add KVM steal time test case to tools/selftests. ARM: - Add support for FEAT_IDST, allowing ID registers that are not implemented to be reported as a normal trap rather than as an UNDEF exception. - Add sanitisation of the VTCR_EL2 register, fixing a number of UXN/PXN/XN bugs in the process. - Full handling of RESx bits, instead of only RES0, and resulting in SCTLR_EL2 being added to the list of sanitised registers. - More pKVM fixes for features that are not supposed to be exposed to guests. - Make sure that MTE being disabled on the pKVM host doesn't give it the ability to attack the hypervisor. - Allow pKVM's host stage-2 mappings to use the Force Write Back version of the memory attributes by using the "pass-through' encoding. - Fix trapping of ICC_DIR_EL1 on GICv5 hosts emulating GICv3 for the guest. - Preliminary work for guest GICv5 support. - A bunch of debugfs fixes, removing pointless custom iterators stored in guest data structures. - A small set of FPSIMD cleanups. - Selftest fixes addressing the incorrect alignment of page allocation. - Other assorted low-impact fixes and spelling fixes. RISC-V: - Fixes for issues discoverd by KVM API fuzzing in kvm_riscv_aia_imsic_has_attr(), kvm_riscv_aia_imsic_rw_attr(), and kvm_riscv_vcpu_aia_imsic_update() - Allow Zalasr, Zilsd and Zclsd extensions for Guest/VM - Transparent huge page support for hypervisor page tables - Adjust the number of available guest irq files based on MMIO register sizes found in the device tree or the ACPI tables - Add RISC-V specific paging modes to KVM selftests - Detect paging mode at runtime for selftests s390: - Performance improvement for vSIE (aka nested virtualization) - Completely new memory management. s390 was a special snowflake that enlisted help from the architecture's page table management to build hypervisor page tables, in particular enabling sharing the last level of page tables. This however was a lot of code (~3K lines) in order to support KVM, and also blocked several features. The biggest advantages is that the page size of userspace is completely independent of the page size used by the guest: userspace can mix normal pages, THPs and hugetlbfs as it sees fit, and in fact transparent hugepages were not possible before. It's also now possible to have nested guests and guests with huge pages running on the same host. - Maintainership change for s390 vfio-pci - Small quality of life improvement for protected guests x86: - Add support for giving the guest full ownership of PMU hardware (contexted switched around the fastpath run loop) and allowing direct access to data MSRs and PMCs (restricted by the vPMU model). KVM still intercepts access to control registers, e.g. to enforce event filtering and to prevent the guest from profiling sensitive host state. This is more accurate, since it has no risk of contention and thus dropped events, and also has significantly less overhead. For more information, see the commit message for merge commit |
||
|
|
9e03b7caf4 |
KVM x86 misc changes for 6.20
- Disallow changing the virtual CPU model if L2 is active, for all the same
reasons KVM disallows change the model after the first KVM_RUN.
- Fix a bug where KVM would incorrectly reject host accesses to PV MSRs that
were advertised as supported to userspace when running with
KVM_CAP_ENFORCE_PV_FEATURE_CPUID enabled.
- Fix a bug where KVM would attempt to read protect guest state (CR3) when
configuring an async #PF entry.
- Fail the build if EXPORT_SYMBOL_GPL or EXPORT_SYMBOL is used in KVM (for x86
only) to enforce usage of EXPORT_SYMBOL_FOR_KVM_INTERNAL. Explicitly allow
the few exports that are intended for external usage.
- Ignore -EBUSY when checking nested events after a vCPU exits blocking as
the WARN is user-triggerable, and because exiting to userspace on -EBUSY
does more harm than good in pretty much every situation.
- Throw in the towel and drop the WARN on INIT/SIPI being blocked when vCPU is
in Wait-For-SIPI, as playing whack-a-mole with syzkaller turned out to be an
unwinnable game.
- Add support for new Intel instructions that don't require anything beyond
enumerating feature flags to userspace.
- Grab SRCU when reading PDPTRs in KVM_GET_SREGS2.
- Add WARNs to guard against modifying KVM's CPU caps outside of the intended
setup flow, as nested VMX in particular is sensitive to unexpected changes
in KVM's golden configuration.
- Add a quirk to allow userspace to opt-in to actually suppress EOI broadcasts
when the suppression feature is enabled by the guest (currently limited to
split IRQCHIP, i.e. userspace I/O APIC). Sadly, simply fixing KVM to honor
Suppress EOI Broadcasts isn't an option as some userspaces have come to rely
on KVM's buggy behavior (KVM advertises Supress EOI Broadcast irrespective
of whether or not userspace I/O APIC supports Directed EOIs).
- Minor cleanups.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmmGqtYACgkQOlYIJqCj
N/2mURAAq6xms7qH8IpXy7RJjGP7UWVfV7sJPP9N8FWERVfljYn2FGGPAlBi0+5b
Gbpf3dhEk+JEHPda7Skz3RqnfKqNXszhPRfUxXIW4nlKWs3VCBNtI2XuOc3xGSs+
itq6jwirPJAibi3GhP3GOnzH3VSdlgq5JhkYW3MGO2JeB0+XMzB+OYE/xZbnRjXg
i4qwoe9+pGVHpV+rf0MMhCd/46HaGAegPOKArQUbMXQIK3L+6Kgz3y4zy74cCJkI
nOmevvXztuM8rWrJUl8NvhqNWAak3au6gLg/1CkNcaXp6ekQovZb8BWihQ8JrkOS
AcmUNqK8RcXXGtjohuXgTgigLg/t+z7tpXiwHC/BxAglf3YB/P2hcxN1/q8zG56T
s5Ua8RFiosYorlN/LVeyMpPK4MEZQi8QyL/biKIlyoPg3vIL+g7Llf3XdBYsfb4d
gWGecZTNmEvhwhVbwCqo+2zsO2ATYXKdR+lE8czqqdJ98l+6p652DxA315a6dx7Y
2fkirbs/JJJotjvukWjWDNk5oGFdX6cDxt2tA1SqDaZ9WTLoqXIIT+9EMtnqXPZm
KsQLEa5mrM0mbRuOid+Ce+Y1bK4x4DLFaM1oH9BF0UIewo+dMIC/gRgrJEcBS+Vv
E+XdrCSq2904NX9Gy3OubdorwTloMk+2Sc0HfvsXMytw1LBsUYY=
=ii2B
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-misc-6.20' of https://github.com/kvm-x86/linux into HEAD
KVM x86 misc changes for 6.20
- Disallow changing the virtual CPU model if L2 is active, for all the same
reasons KVM disallows change the model after the first KVM_RUN.
- Fix a bug where KVM would incorrectly reject host accesses to PV MSRs that
were advertised as supported to userspace when running with
KVM_CAP_ENFORCE_PV_FEATURE_CPUID enabled.
- Fix a bug where KVM would attempt to read protect guest state (CR3) when
configuring an async #PF entry.
- Fail the build if EXPORT_SYMBOL_GPL or EXPORT_SYMBOL is used in KVM (for x86
only) to enforce usage of EXPORT_SYMBOL_FOR_KVM_INTERNAL. Explicitly allow
the few exports that are intended for external usage.
- Ignore -EBUSY when checking nested events after a vCPU exits blocking as
the WARN is user-triggerable, and because exiting to userspace on -EBUSY
does more harm than good in pretty much every situation.
- Throw in the towel and drop the WARN on INIT/SIPI being blocked when vCPU is
in Wait-For-SIPI, as playing whack-a-mole with syzkaller turned out to be an
unwinnable game.
- Add support for new Intel instructions that don't require anything beyond
enumerating feature flags to userspace.
- Grab SRCU when reading PDPTRs in KVM_GET_SREGS2.
- Add WARNs to guard against modifying KVM's CPU caps outside of the intended
setup flow, as nested VMX in particular is sensitive to unexpected changes
in KVM's golden configuration.
- Add a quirk to allow userspace to opt-in to actually suppress EOI broadcasts
when the suppression feature is enabled by the guest (currently limited to
split IRQCHIP, i.e. userspace I/O APIC). Sadly, simply fixing KVM to honor
Suppress EOI Broadcasts isn't an option as some userspaces have come to rely
on KVM's buggy behavior (KVM advertises Supress EOI Broadcast irrespective
of whether or not userspace I/O APIC supports Directed EOIs).
- Minor cleanups.
|
||
|
|
f24ef0093d |
KVM: x86: Advertise MOVRS CPUID to userspace
Define the feature flag for MOVRS and advertise support to userspace when
the feature is supported by the host.
MOVRS is a new set of instructions introduced in the Intel platform
Diamond Rapids, to provide load instructions that carry a read-shared
hint.
Functionally, MOVRS family is equivalent to existing load instructions,
but its read-shared hint indicates that the source memory location is
likely to become read-shared by multiple processors, i.e., read in the
future by at least one other processor before it is written (assuming it
is ever written in the future). This hint could optimize the behavior of
the caches, especially shared caches, for this data for future reads by
multiple processors. Additionally, MOVRS family also includes a software
prefetch instruction, PREFETCHRST2, that carries the same read-shared
hint. [*]
MOVRS family is enumerated by CPUID single-bit (0x7.0x1.EAX[bit 31]).
Since it's on a densely-populated CPUID leaf and some other bits on
this leaf have kernel usages, define this new feature in cpufeatures.h,
but hide it in /proc/cpuinfo due to lack of current kernel usage.
Advertise MOVRS bit to userspace directly. It's safe, since there's no
new VMX controls or additional host enabling required for guests to use
this feature.
[*]: Intel Architecture Instruction Set Extensions and Future Features
(rev.059).
Tested-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Zhao Liu <zhao1.liu@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://patch.msgid.link/20251120050720.931449-2-zhao1.liu@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
||
|
|
f49ecf5e11 |
x86/cpufeature: Replace X86_FEATURE_SYSENTER32 with X86_FEATURE_SYSFAST32
In most cases, the use of "fast 32-bit system call" depends either on X86_FEATURE_SEP or X86_FEATURE_SYSENTER32 || X86_FEATURE_SYSCALL32. However, nearly all the logic for both is identical. Define X86_FEATURE_SYSFAST32 which indicates that *either* SYSENTER32 or SYSCALL32 should be used, for either 32- or 64-bit kernels. This defaults to SYSENTER; use SYSCALL if the SYSCALL32 bit is also set. As this removes ALL existing uses of X86_FEATURE_SYSENTER32, which is a kernel-only synthetic feature bit, simply remove it and replace it with X86_FEATURE_SYSFAST32. This leaves an unused alternative for a true 32-bit kernel, but that should really not matter in any way. The clearing of X86_FEATURE_SYSCALL32 can be removed once the patches for automatically clearing disabled features has been merged. Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://patch.msgid.link/20251216212606.1325678-10-hpa@zytor.com |
||
|
|
db5e824964 |
KVM: SVM: Virtualize and advertise support for ERAPS
AMD CPUs with the Enhanced Return Address Predictor Security (ERAPS) feature (available on Zen5+) obviate the need for FILL_RETURN_BUFFER sequences right after VMEXITs. ERAPS adds guest/host tags to entries in the RSB (a.k.a. RAP). This helps with speculation protection across the VM boundary, and it also preserves host and guest entries in the RSB that can improve software performance (which would otherwise be flushed due to the FILL_RETURN_BUFFER sequences). Importantly, ERAPS also improves cross-domain security by clearing the RAP in certain situations. Specifically, the RAP is cleared in response to actions that are typically tied to software context switching between tasks. Per the APM: The ERAPS feature eliminates the need to execute CALL instructions to clear the return address predictor in most cases. On processors that support ERAPS, return addresses from CALL instructions executed in host mode are not used in guest mode, and vice versa. Additionally, the return address predictor is cleared in all cases when the TLB is implicitly invalidated and in the following cases: • MOV CR3 instruction • INVPCID other than single address invalidation (operation type 0) ERAPS also allows CPUs to extends the size of the RSB/RAP from the older standard (of 32 entries) to a new size, enumerated in CPUID leaf 0x80000021:EBX bits 23:16 (64 entries in Zen5 CPUs). In hardware, ERAPS is always-on, when running in host context, the CPU uses the full RSB/RAP size without any software changes necessary. However, when running in guest context, the CPU utilizes the full size of the RSB/RAP if and only if the new ALLOW_LARGER_RAP flag is set in the VMCB; if the flag is not set, the CPU limits itself to the historical size of 32 entires. Requiring software to opt-in for guest usage of RAPs larger than 32 entries allows hypervisors, i.e. KVM, to emulate the aforementioned conditions in which the RAP is cleared as well as the guest/host split. E.g. if the CPU unconditionally used the full RAP for guests, failure to clear the RAP on transitions between L1 or L2, or on emulated guest TLB flushes, would expose the guest to RAP-based attacks as a guest without support for ERAPS wouldn't know that its FILL_RETURN_BUFFER sequence is insufficient. Address the ~two broad categories of ERAPS emulation, and advertise ERAPS support to userspace, along with the RAP size enumerated in CPUID. 1. Architectural RAP clearing: as above, CPUs with ERAPS clear RAP entries on several conditions, including CR3 updates. To handle scenarios where a relevant operation is handled in common code (emulation of INVPCID and to a lesser extent MOV CR3), piggyback VCPU_EXREG_CR3 and create an alias, VCPU_EXREG_ERAPS. SVM doesn't utilize CR3 dirty tracking, and so for all intents and purposes VCPU_EXREG_CR3 is unused. Aliasing VCPU_EXREG_ERAPS ensures that any flow that writes CR3 will also clear the guest's RAP, and allows common x86 to mark ERAPS vCPUs as needing a RAP clear without having to add a new request (or other mechanism). 2. Nested guests: the ERAPS feature adds host/guest tagging to entries in the RSB, but does not distinguish between the guest ASIDs. To prevent the case of an L2 guest poisoning the RSB to attack the L1 guest, the CPU exposes a new VMCB bit (CLEAR_RAP). The next VMRUN with a VMCB that has this bit set causes the CPU to flush the RSB before entering the guest context. Set the bit in VMCB01 after a nested #VMEXIT to ensure the next time the L1 guest runs, its RSB contents aren't polluted by the L2's contents. Similarly, before entry into a nested guest, set the bit for VMCB02, so that the L1 guest's RSB contents are not leaked/used in the L2 context. Enable ALLOW_LARGER_RAP (and emulate RAP clears) if and only if ERAPS is exposed to the guest. Enabling ALLOW_LARGER_RAP unconditionally wouldn't cause any functional issues, but ignoring userspace's (and L1's) desires would put KVM into a grey area, which is especially undesirable due to the potential security implications. E.g. if a use case wants to have L1 do manual RAP clearing even when ERAPS is present in hardware, enabling ALLOW_LARGER_RAP could result in L1 leaving stale entries in the RAP. ERAPS is documented in AMD APM Vol 2 (Pub 24593), in revisions 3.43 and later. Signed-off-by: Amit Shah <amit.shah@amd.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Amit Shah <amit.shah@amd.com> Link: https://patch.msgid.link/aR913X8EqO6meCqa@google.com |
||
|
|
51d90a15fe |
ARM:
- Support for userspace handling of synchronous external aborts (SEAs),
allowing the VMM to potentially handle the abort in a non-fatal
manner.
- Large rework of the VGIC's list register handling with the goal of
supporting more active/pending IRQs than available list registers in
hardware. In addition, the VGIC now supports EOImode==1 style
deactivations for IRQs which may occur on a separate vCPU than the
one that acked the IRQ.
- Support for FEAT_XNX (user / privileged execute permissions) and
FEAT_HAF (hardware update to the Access Flag) in the software page
table walkers and shadow MMU.
- Allow page table destruction to reschedule, fixing long need_resched
latencies observed when destroying a large VM.
- Minor fixes to KVM and selftests
Loongarch:
- Get VM PMU capability from HW GCFG register.
- Add AVEC basic support.
- Use 64-bit register definition for EIOINTC.
- Add KVM timer test cases for tools/selftests.
RISC/V:
- SBI message passing (MPXY) support for KVM guest
- Give a new, more specific error subcode for the case when in-kernel
AIA virtualization fails to allocate IMSIC VS-file
- Support KVM_DIRTY_LOG_INITIALLY_SET, enabling dirty log gradually
in small chunks
- Fix guest page fault within HLV* instructions
- Flush VS-stage TLB after VCPU migration for Andes cores
s390:
- Always allocate ESCA (Extended System Control Area), instead of
starting with the basic SCA and converting to ESCA with the
addition of the 65th vCPU. The price is increased number of
exits (and worse performance) on z10 and earlier processor;
ESCA was introduced by z114/z196 in 2010.
- VIRT_XFER_TO_GUEST_WORK support
- Operation exception forwarding support
- Cleanups
x86:
- Skip the costly "zap all SPTEs" on an MMIO generation wrap if MMIO SPTE
caching is disabled, as there can't be any relevant SPTEs to zap.
- Relocate a misplaced export.
- Fix an async #PF bug where KVM would clear the completion queue when the
guest transitioned in and out of paging mode, e.g. when handling an SMI and
then returning to paged mode via RSM.
- Leave KVM's user-return notifier registered even when disabling
virtualization, as long as kvm.ko is loaded. On reboot/shutdown, keeping
the notifier registered is ok; the kernel does not use the MSRs and the
callback will run cleanly and restore host MSRs if the CPU manages to
return to userspace before the system goes down.
- Use the checked version of {get,put}_user().
- Fix a long-lurking bug where KVM's lack of catch-up logic for periodic APIC
timers can result in a hard lockup in the host.
- Revert the periodic kvmclock sync logic now that KVM doesn't use a
clocksource that's subject to NTP corrections.
- Clean up KVM's handling of MMIO Stale Data and L1TF, and bury the latter
behind CONFIG_CPU_MITIGATIONS.
- Context switch XCR0, XSS, and PKRU outside of the entry/exit fast path;
the only reason they were handled in the fast path was to paper of a bug
in the core #MC code, and that has long since been fixed.
- Add emulator support for AVX MOV instructions, to play nice with emulated
devices whose guest drivers like to access PCI BARs with large multi-byte
instructions.
x86 (AMD):
- Fix a few missing "VMCB dirty" bugs.
- Fix the worst of KVM's lack of EFER.LMSLE emulation.
- Add AVIC support for addressing 4k vCPUs in x2AVIC mode.
- Fix incorrect handling of selective CR0 writes when checking intercepts
during emulation of L2 instructions.
- Fix a currently-benign bug where KVM would clobber SPEC_CTRL[63:32] on
VMRUN and #VMEXIT.
- Fix a bug where KVM corrupt the guest code stream when re-injecting a soft
interrupt if the guest patched the underlying code after the VM-Exit, e.g.
when Linux patches code with a temporary INT3.
- Add KVM_X86_SNP_POLICY_BITS to advertise supported SNP policy bits to
userspace, and extend KVM "support" to all policy bits that don't require
any actual support from KVM.
x86 (Intel):
- Use the root role from kvm_mmu_page to construct EPTPs instead of the
current vCPU state, partly as worthwhile cleanup, but mostly to pave the
way for tracking per-root TLB flushes, and elide EPT flushes on pCPU
migration if the root is clean from a previous flush.
- Add a few missing nested consistency checks.
- Rip out support for doing "early" consistency checks via hardware as the
functionality hasn't been used in years and is no longer useful in general;
replace it with an off-by-default module param to WARN if hardware fails
a check that KVM does not perform.
- Fix a currently-benign bug where KVM would drop the guest's SPEC_CTRL[63:32]
on VM-Enter.
- Misc cleanups.
- Overhaul the TDX code to address systemic races where KVM (acting on behalf
of userspace) could inadvertantly trigger lock contention in the TDX-Module;
KVM was either working around these in weird, ugly ways, or was simply
oblivious to them (though even Yan's devilish selftests could only break
individual VMs, not the host kernel)
- Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a TDX vCPU,
if creating said vCPU failed partway through.
- Fix a few sparse warnings (bad annotation, 0 != NULL).
- Use struct_size() to simplify copying TDX capabilities to userspace.
- Fix a bug where TDX would effectively corrupt user-return MSR values if the
TDX Module rejects VP.ENTER and thus doesn't clobber host MSRs as expected.
Selftests:
- Fix a math goof in mmu_stress_test when running on a single-CPU system/VM.
- Forcefully override ARCH from x86_64 to x86 to play nice with specifying
ARCH=x86_64 on the command line.
- Extend a bunch of nested VMX to validate nested SVM as well.
- Add support for LA57 in the core VM_MODE_xxx macro, and add a test to
verify KVM can save/restore nested VMX state when L1 is using 5-level
paging, but L2 is not.
- Clean up the guest paging code in anticipation of sharing the core logic for
nested EPT and nested NPT.
guest_memfd:
- Add NUMA mempolicy support for guest_memfd, and clean up a variety of
rough edges in guest_memfd along the way.
- Define a CLASS to automatically handle get+put when grabbing a guest_memfd
from a memslot to make it harder to leak references.
- Enhance KVM selftests to make it easer to develop and debug selftests like
those added for guest_memfd NUMA support, e.g. where test and/or KVM bugs
often result in hard-to-debug SIGBUS errors.
- Misc cleanups.
Generic:
- Use the recently-added WQ_PERCPU when creating the per-CPU workqueue for
irqfd cleanup.
- Fix a goof in the dirty ring documentation.
- Fix choice of target for directed yield across different calls to
kvm_vcpu_on_spin(); the function was always starting from the first
vCPU instead of continuing the round-robin search.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCgAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmkvMa8UHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroMlFwf+Ow7zOYUuELSQ+Jn+hOYXiCNrdBDx
ZamvMU8kLPr7XX0Zog6HgcMm//qyA6k5nSfqCjfsQZrIhRA/gWJ61jz1OX/Jxq18
pJ9Vz6epnEPYiOtBwz+v8OS8MqDqVNzj2i6W1/cLPQE50c1Hhw64HWS5CSxDQiHW
A7PVfl5YU12lW1vG3uE0sNESDt4Eh/spNM17iddXdF4ZUOGublserjDGjbc17E7H
8BX3DkC2plqkJKwtjg0ae62hREkITZZc7RqsnftUkEhn0N0H9+rb6NKUyzIVh9NZ
bCtCjtrKN9zfZ0Mujnms3ugBOVqNIputu/DtPnnFKXtXWSrHrgGSNv5ewA==
=PEcw
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"ARM:
- Support for userspace handling of synchronous external aborts
(SEAs), allowing the VMM to potentially handle the abort in a
non-fatal manner
- Large rework of the VGIC's list register handling with the goal of
supporting more active/pending IRQs than available list registers
in hardware. In addition, the VGIC now supports EOImode==1 style
deactivations for IRQs which may occur on a separate vCPU than the
one that acked the IRQ
- Support for FEAT_XNX (user / privileged execute permissions) and
FEAT_HAF (hardware update to the Access Flag) in the software page
table walkers and shadow MMU
- Allow page table destruction to reschedule, fixing long
need_resched latencies observed when destroying a large VM
- Minor fixes to KVM and selftests
Loongarch:
- Get VM PMU capability from HW GCFG register
- Add AVEC basic support
- Use 64-bit register definition for EIOINTC
- Add KVM timer test cases for tools/selftests
RISC/V:
- SBI message passing (MPXY) support for KVM guest
- Give a new, more specific error subcode for the case when in-kernel
AIA virtualization fails to allocate IMSIC VS-file
- Support KVM_DIRTY_LOG_INITIALLY_SET, enabling dirty log gradually
in small chunks
- Fix guest page fault within HLV* instructions
- Flush VS-stage TLB after VCPU migration for Andes cores
s390:
- Always allocate ESCA (Extended System Control Area), instead of
starting with the basic SCA and converting to ESCA with the
addition of the 65th vCPU. The price is increased number of exits
(and worse performance) on z10 and earlier processor; ESCA was
introduced by z114/z196 in 2010
- VIRT_XFER_TO_GUEST_WORK support
- Operation exception forwarding support
- Cleanups
x86:
- Skip the costly "zap all SPTEs" on an MMIO generation wrap if MMIO
SPTE caching is disabled, as there can't be any relevant SPTEs to
zap
- Relocate a misplaced export
- Fix an async #PF bug where KVM would clear the completion queue
when the guest transitioned in and out of paging mode, e.g. when
handling an SMI and then returning to paged mode via RSM
- Leave KVM's user-return notifier registered even when disabling
virtualization, as long as kvm.ko is loaded. On reboot/shutdown,
keeping the notifier registered is ok; the kernel does not use the
MSRs and the callback will run cleanly and restore host MSRs if the
CPU manages to return to userspace before the system goes down
- Use the checked version of {get,put}_user()
- Fix a long-lurking bug where KVM's lack of catch-up logic for
periodic APIC timers can result in a hard lockup in the host
- Revert the periodic kvmclock sync logic now that KVM doesn't use a
clocksource that's subject to NTP corrections
- Clean up KVM's handling of MMIO Stale Data and L1TF, and bury the
latter behind CONFIG_CPU_MITIGATIONS
- Context switch XCR0, XSS, and PKRU outside of the entry/exit fast
path; the only reason they were handled in the fast path was to
paper of a bug in the core #MC code, and that has long since been
fixed
- Add emulator support for AVX MOV instructions, to play nice with
emulated devices whose guest drivers like to access PCI BARs with
large multi-byte instructions
x86 (AMD):
- Fix a few missing "VMCB dirty" bugs
- Fix the worst of KVM's lack of EFER.LMSLE emulation
- Add AVIC support for addressing 4k vCPUs in x2AVIC mode
- Fix incorrect handling of selective CR0 writes when checking
intercepts during emulation of L2 instructions
- Fix a currently-benign bug where KVM would clobber SPEC_CTRL[63:32]
on VMRUN and #VMEXIT
- Fix a bug where KVM corrupt the guest code stream when re-injecting
a soft interrupt if the guest patched the underlying code after the
VM-Exit, e.g. when Linux patches code with a temporary INT3
- Add KVM_X86_SNP_POLICY_BITS to advertise supported SNP policy bits
to userspace, and extend KVM "support" to all policy bits that
don't require any actual support from KVM
x86 (Intel):
- Use the root role from kvm_mmu_page to construct EPTPs instead of
the current vCPU state, partly as worthwhile cleanup, but mostly to
pave the way for tracking per-root TLB flushes, and elide EPT
flushes on pCPU migration if the root is clean from a previous
flush
- Add a few missing nested consistency checks
- Rip out support for doing "early" consistency checks via hardware
as the functionality hasn't been used in years and is no longer
useful in general; replace it with an off-by-default module param
to WARN if hardware fails a check that KVM does not perform
- Fix a currently-benign bug where KVM would drop the guest's
SPEC_CTRL[63:32] on VM-Enter
- Misc cleanups
- Overhaul the TDX code to address systemic races where KVM (acting
on behalf of userspace) could inadvertantly trigger lock contention
in the TDX-Module; KVM was either working around these in weird,
ugly ways, or was simply oblivious to them (though even Yan's
devilish selftests could only break individual VMs, not the host
kernel)
- Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a
TDX vCPU, if creating said vCPU failed partway through
- Fix a few sparse warnings (bad annotation, 0 != NULL)
- Use struct_size() to simplify copying TDX capabilities to userspace
- Fix a bug where TDX would effectively corrupt user-return MSR
values if the TDX Module rejects VP.ENTER and thus doesn't clobber
host MSRs as expected
Selftests:
- Fix a math goof in mmu_stress_test when running on a single-CPU
system/VM
- Forcefully override ARCH from x86_64 to x86 to play nice with
specifying ARCH=x86_64 on the command line
- Extend a bunch of nested VMX to validate nested SVM as well
- Add support for LA57 in the core VM_MODE_xxx macro, and add a test
to verify KVM can save/restore nested VMX state when L1 is using
5-level paging, but L2 is not
- Clean up the guest paging code in anticipation of sharing the core
logic for nested EPT and nested NPT
guest_memfd:
- Add NUMA mempolicy support for guest_memfd, and clean up a variety
of rough edges in guest_memfd along the way
- Define a CLASS to automatically handle get+put when grabbing a
guest_memfd from a memslot to make it harder to leak references
- Enhance KVM selftests to make it easer to develop and debug
selftests like those added for guest_memfd NUMA support, e.g. where
test and/or KVM bugs often result in hard-to-debug SIGBUS errors
- Misc cleanups
Generic:
- Use the recently-added WQ_PERCPU when creating the per-CPU
workqueue for irqfd cleanup
- Fix a goof in the dirty ring documentation
- Fix choice of target for directed yield across different calls to
kvm_vcpu_on_spin(); the function was always starting from the first
vCPU instead of continuing the round-robin search"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (260 commits)
KVM: arm64: at: Update AF on software walk only if VM has FEAT_HAFDBS
KVM: arm64: at: Use correct HA bit in TCR_EL2 when regime is EL2
KVM: arm64: Document KVM_PGTABLE_PROT_{UX,PX}
KVM: arm64: Fix spelling mistake "Unexpeced" -> "Unexpected"
KVM: arm64: Add break to default case in kvm_pgtable_stage2_pte_prot()
KVM: arm64: Add endian casting to kvm_swap_s[12]_desc()
KVM: arm64: Fix compilation when CONFIG_ARM64_USE_LSE_ATOMICS=n
KVM: arm64: selftests: Add test for AT emulation
KVM: arm64: nv: Expose hardware access flag management to NV guests
KVM: arm64: nv: Implement HW access flag management in stage-2 SW PTW
KVM: arm64: Implement HW access flag management in stage-1 SW PTW
KVM: arm64: Propagate PTW errors up to AT emulation
KVM: arm64: Add helper for swapping guest descriptor
KVM: arm64: nv: Use pgtable definitions in stage-2 walk
KVM: arm64: Handle endianness in read helper for emulated PTW
KVM: arm64: nv: Stop passing vCPU through void ptr in S2 PTW
KVM: arm64: Call helper for reading descriptors directly
KVM: arm64: nv: Advertise support for FEAT_XNX
KVM: arm64: Teach ptdump about FEAT_XNX permissions
KVM: s390: Use generic VIRT_XFER_TO_GUEST_WORK functions
...
|
||
|
|
d61f1cc5db |
* Enable Linear Address Space Separation (LASS)
* Change X86_FEATURE leaf 17 from an AMD leaf to Linux-defined -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmkuIXAACgkQaDWVMHDJ krAwoRAAqqavNrthj26XJHjR3x7FVGu11/rvYXAd1U2moN/dhM2w82HMHNFvPuQY 3iq9GDRQdc2rKL7LTkREvN4ZM/rFvkFLt6a5Yv0eCRK8KAiSJEw6Yzu/qgG7kF+0 9clujDUskjjHU0zR5v+o1RxirrLVQ+R50sMVI5uoFx6+WJRiW1BvMG4Csw4BgbvA AqgrZpyq1dQ/GQOW4f0yxBPH0z84wgUbdllYzQzE0GeUlGWQSI4lqa8GFMOmE/Gr 7gBcKmyE0M/BycwTZW7tiMnjWgNL+Y5/RroQJ7hh6R+f5WOd+SpGvlyOihbF7GER L3yZfeQ+EWz1aY1QMWwOSvSawIPJo8EkSn3d9/JFq5Vl9zsFh+ZoPZfZ8bEi36U0 inO93swDcyMkkfOTh4sIgxedLgHja5GFNCGPs0yblvLulWbw7yYVzzEmEjXnclzS fmmifsJjGrUpegEnWdEjAQzXkWPd/hKiAvpzDE/3thBal5NkOzFrudITFvCVuk8w uS2MW0U8VCskNoON0jjwnvv84p0XdHJOsgPB9WnsuMMASKC1RqKAJWXh8AXvZA+I TfCNdSyHDTm+o1e+SMQZRbqoE/r7MmAxUQOkKnlvpJDCz58tsLzW64hRXTe7QpCt rry9/wODswu+oaHoDgfjAmzYde2RhCjwWLzGmqmapNIYfCCVhYs= =5bcW -----END PGP SIGNATURE----- Merge tag 'x86_cpu_for_6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 CPU feature updates from Dave Hansen: "The biggest thing of note here is Linear Address Space Separation (LASS). It represents the first time I can think of that the upper=>kernel/lower=>user address space convention is actually recognized by the hardware on x86. It ensures that userspace can not even get the hardware to _start_ page walks for the kernel address space. This, of course, is a really nice generic side channel defense. This is really only a down payment on LASS support. There are still some details to work out in its interaction with EFI calls and vsyscall emulation. For now, LASS is disabled if either of those features is compiled in (which is almost always the case). There's also one straggler commit in here which converts an under-utilized AMD CPU feature leaf into a generic Linux-defined leaf so more feature can be packed in there. Summary: - Enable Linear Address Space Separation (LASS) - Change X86_FEATURE leaf 17 from an AMD leaf to Linux-defined" * tag 'x86_cpu_for_6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cpu: Enable LASS during CPU initialization selftests/x86: Update the negative vsyscall tests to expect a #GP x86/traps: Communicate a LASS violation in #GP message x86/kexec: Disable LASS during relocate kernel x86/alternatives: Disable LASS when patching kernel code x86/asm: Introduce inline memcpy and memset x86/cpu: Add an LASS dependency on SMAP x86/cpufeatures: Enumerate the LASS feature bits x86/cpufeatures: Make X86_FEATURE leaf 17 Linux-specific |
||
|
|
54de197c9a |
* Allow security version (SVN) updates so enclaves can attest
to new microcode.
* Fix kernel docs typos
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmkuGsMACgkQaDWVMHDJ
krB/AhAAlRe+7+jNXp4Xtor9Q9VpTG5Q7WL7yLUHOd+JbsVKCcCLy2i86rrXEVvo
AN3nQqkAotd1xx9P1SDz2SL4S0PS8zM3avj8JLppBb16VEiApFqUsXYMUJrFVjne
x/+fEsKVOXp+kfM72qhI3/sqgq/zPUanm/6GR0wlpe2+TPOmnw588GgCV9ADW8ha
ktLWaBGPHgyzNGAO5CqSSnN+2u25vRcyKX6mfUZeeooudPPVbANxKLN1WzBP2zZo
eO5jbyeiPNH0bC1p6H5TQOl/bfktfhl99S+ta3e7+/PduDvDR01Z3kMl5JSb74hk
75fXUI1R+/zUyXbyJ+exxTqAkS0F66m1Tv+DU9U4WorKjgAwYFPcdvjkJh/Ju6XB
cqvwvLUE5rH2f3hzUmffSP8FwQO65svLsB4BIWd+3sgu6ZbAV4zSVOSb51W39Q7U
WlAMYtlM6+bZF6RaJCweDUMq407KcHFqjN6rzX7x+VguNe/FMERRsGBz6kBT2FZH
oxMzUXTwRfwSxblvVJMjjecfplywH5viHATbApMIGdamHWis8Dmvl0DMmBi4O6CI
/O792TOHxIqLmxkh9SE7ONOCRe1hUfK2JY4n6sgNrBH8OwylPqkl2kgE5X3Cr9c1
1qwMGfto5t0izr/SA5wruEJtPjQPsAyUKjfP+msyBemTg7dzTFU=
=2lh5
-----END PGP SIGNATURE-----
Merge tag 'x86_sgx_for_6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 SGX updates from Dave HansenL
"The main content here is adding support for the new EUPDATESVN SGX
ISA. Before this, folks who updated microcode had to reboot before
enclaves could attest to the new microcode. The new functionality lets
them do this without a reboot.
The rest are some nice, but relatively mundane comment and kernel-doc
fixups.
Summary:
- Allow security version (SVN) updates so enclaves can attest to new
microcode
- Fix kernel docs typos"
* tag 'x86_sgx_for_6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sgx: Fix a typo in the kernel-doc comment for enum sgx_attribute
x86/sgx: Remove superfluous asterisk from copyright comment in asm/sgx.h
x86/sgx: Document structs and enums with '@', not '%'
x86/sgx: Add kernel-doc descriptions for params passed to vDSO user handler
x86/sgx: Add a missing colon in kernel-doc markup for "struct sgx_enclave_run"
x86/sgx: Enable automatic SVN updates for SGX enclaves
x86/sgx: Implement ENCLS[EUPDATESVN]
x86/sgx: Define error codes for use by ENCLS[EUPDATESVN]
x86/cpufeatures: Add X86_FEATURE_SGX_EUPDATESVN feature flag
x86/sgx: Introduce functions to count the sgx_(vepc_)open()
|
||
|
|
d748981834 |
- The mandatory pile of cleanups the cat drags in every merge window
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmktqqMACgkQEsHwGGHe
VUpBDg/8CT4uBJLqgQYoh9YYN9U/C1walUMLivHq9VJbld7a55nGCirUTBuiTRzp
V2np/6V/XpG2YmV9Xj6E4Q6QLiOsfrWKFRzfr7OlPZ7yW5GR8aKBcEZm2Voi7j94
qLVSraAjtfmTtU5Ym7Lp1fqswDW2Y+iU6zqIyotqy+/7qZ6yp4NwlrUwSocDsSLo
n6Gh1Vv6fXdeMckzT1WJ/CMtx07IfQB/wqVVTO4WwBu1Pv71WO+Cv1pG2mewUVjK
879X7+oc1icxCtD2OZxHfOEJGtC97N7cZDq6VEd4s/bjgYF75glyFtkv/DVgbQ6E
BSkwciT5Sqb7B5N09RJWWJiNLlNQPFQdDFAuU/Sk18Vqc01jySFoWLLsiS/DeCsz
opEnPK6uXO4m+2DI44dWpWgVj33a/ao6zej3uEPENkp9+FYaZPwip8Bdjn1FptBp
ZGmqa7oy38MXTzV6hOctMAx/nXE05lff41Xe8fLlisNc7a6ZUwql9oA4A4yULtKI
BMlddTZ5/N5LzOGVH5nixX+ig3wDqtEEOD1tdLb/f8Nhy71Z3Kmbf7zDuJCTaabn
VAGaPUDZ4qfeMks1qRIVVR7rUo6PwRRNvi8Wyiauc7/fnxWyVR0z1vrqs2o3MtZJ
WQ0/HAZZ/K54HWk049ea2b1kFjjOe6R5hKKX1Hc2nPIS3APD/Vc=
=3uJO
-----END PGP SIGNATURE-----
Merge tag 'x86_cleanups_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Borislav Petkov:
- The mandatory pile of cleanups the cat drags in every merge window
* tag 'x86_cleanups_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot: Clean up whitespace in a20.c
x86/mm: Delete disabled debug code
x86/{boot,mtrr}: Remove unused function declarations
x86/percpu: Use BIT_WORD() and BIT_MASK() macros
x86/cpufeatures: Correct LKGS feature flag description
x86/idtentry: Add missing '*' to kernel-doc lines
|
||
|
|
679fcce002 |
KVM SVM changes for 6.19:
- Fix a few missing "VMCB dirty" bugs.
- Fix the worst of KVM's lack of EFER.LMSLE emulation.
- Add AVIC support for addressing 4k vCPUs in x2AVIC mode.
- Fix incorrect handling of selective CR0 writes when checking intercepts
during emulation of L2 instructions.
- Fix a currently-benign bug where KVM would clobber SPEC_CTRL[63:32] on
VMRUN and #VMEXIT.
- Fix a bug where KVM corrupt the guest code stream when re-injecting a soft
interrupt if the guest patched the underlying code after the VM-Exit, e.g.
when Linux patches code with a temporary INT3.
- Add KVM_X86_SNP_POLICY_BITS to advertise supported SNP policy bits to
userspace, and extend KVM "support" to all policy bits that don't require
any actual support from KVM.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmkmVFgACgkQOlYIJqCj
N/3GVA/+ITWLRuY28kGmpDBflp6EyMIDlBe4v+JRFV5Ll/6kq+sYWZTIE8uiDHNP
E6dbrAnU3xxF2fK5pb3Fq0kKslu/+UTnldt3VN0uGzHlTqvw9itKaCdFnOBMND4Z
Fs7SbXUd1ZWz5Z/Uq2niJDd3p1YA69i1At+udXterFVKzl3GSmNMvsXhNPjjF3gL
+purCkHfWLnXhFJYMYgnaWDJUQldR+YfulWwEd2y4qXyfoqOBJhg7DpKuHlewBVT
2ice4k4nBu47rNl+ZRFM9sFX0959OX8MxykO902UB4+qS39jFzTlyM+LjnxKmCfC
dzGTh4lhG/1QQme6TYBQ+OgXMj6H+8KqQ+YNbjxjAEgY8hWDdVK0bZMIq/iS16aQ
VPSf1/GufdvV+dUyyb2DZzf7NhWmKyVGjlN5PnGQQl0x6+LwI5m3EODDLfTHlmlb
0UEZkXdN74ghT2ExepVyVKeDtbQPJNFN/voBnYr8n0P+9Jf28QuoWD5bTloJIxIJ
OwjwJq3HbDduq/RCFbiSERMBPYFxYCkxVlt+TI+ONhNCUNfvxefNfftHIx+6Yk73
IV5g3gWNWkIo4h1yp8zsglwiTStY4qpiR52YlDLN3+btgYcPOAXt/U4nigaomfdR
NoYbuqD1N+u1P4Vlnr8uUZExQVY+JoIPrB3zPnITni+aucSpp+c=
=p1kg
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-svm-6.19' of https://github.com/kvm-x86/linux into HEAD
KVM SVM changes for 6.19:
- Fix a few missing "VMCB dirty" bugs.
- Fix the worst of KVM's lack of EFER.LMSLE emulation.
- Add AVIC support for addressing 4k vCPUs in x2AVIC mode.
- Fix incorrect handling of selective CR0 writes when checking intercepts
during emulation of L2 instructions.
- Fix a currently-benign bug where KVM would clobber SPEC_CTRL[63:32] on
VMRUN and #VMEXIT.
- Fix a bug where KVM corrupt the guest code stream when re-injecting a soft
interrupt if the guest patched the underlying code after the VM-Exit, e.g.
when Linux patches code with a temporary INT3.
- Add KVM_X86_SNP_POLICY_BITS to advertise supported SNP policy bits to
userspace, and extend KVM "support" to all policy bits that don't require
any actual support from KVM.
|
||
|
|
3767def18f |
x86/cpufeatures: Add support for L3 Smart Data Cache Injection Allocation Enforcement
Smart Data Cache Injection (SDCI) is a mechanism that enables direct insertion
of data from I/O devices into the L3 cache. By directly caching data from I/O
devices rather than first storing the I/O data in DRAM, SDCI reduces demands on
DRAM bandwidth and reduces latency to the processor consuming the I/O data.
The SDCIAE (SDCI Allocation Enforcement) PQE feature allows system software to
control the portion of the L3 cache used for SDCI.
When enabled, SDCIAE forces all SDCI lines to be placed into the L3 cache
partitions identified by the highest-supported L3_MASK_n register, where n is
the maximum supported CLOSID.
Add CPUID feature bit that can be used to configure SDCIAE.
The SDCIAE feature details are documented in:
AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.4.7 L3 Smart Data Cache
Injection Allocation Enforcement (SDCIAE).
available at https://bugzilla.kernel.org/show_bug.cgi?id=206537
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/83ca10d981c48e86df2c3ad9658bb3ba3544c763.1762995456.git.babu.moger@amd.com
|
||
|
|
f6106d41ec |
x86/bugs: Use an x86 feature to track the MMIO Stale Data mitigation
Convert the MMIO Stale Data mitigation tracking from a static branch into an x86 feature flag so that it can be used via ALTERNATIVE_2 in KVM. No functional change intended. Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Reviewed-by: Brendan Jackman <jackmanb@google.com> Link: https://patch.msgid.link/20251113233746.1703361-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
|
|
7baadd463e |
x86/cpufeatures: Enumerate the LASS feature bits
Linear Address Space Separation (LASS) is a security feature that mitigates a class of side-channel attacks relying on speculative access across the user/kernel boundary. Privilege mode based access protection already exists today with paging and features such as SMEP and SMAP. However, to enforce these protections, the processor must traverse the paging structures in memory. An attacker can use timing information resulting from this traversal to determine details about the paging structures, and to determine the layout of the kernel memory. LASS provides the same mode-based protections as paging but without traversing the paging structures. Because the protections are enforced prior to page-walks, an attacker will not be able to derive paging-based timing information from the various caching structures such as the TLBs, mid-level caches, page walker, data caches, etc. LASS enforcement relies on the kernel implementation to divide the 64-bit virtual address space into two halves: Addr[63]=0 -> User address space Addr[63]=1 -> Kernel address space Any data access or code execution across address spaces typically results in a #GP fault, with an #SS generated in some rare cases. The LASS enforcement for kernel data accesses is dependent on CR4.SMAP being set. The enforcement can be disabled by toggling the RFLAGS.AC bit similar to SMAP. Define the CPU feature bits to enumerate LASS. Also, disable the feature at compile time on 32-bit kernels. Use a direct dependency on X86_32 (instead of !X86_64) to make it easier to combine with similar 32-bit specific dependencies in the future. LASS mitigates a class of side-channel speculative attacks, such as Spectre LAM, described in the paper, "Leaky Address Masking: Exploiting Unmasked Spectre Gadgets with Noncanonical Address Translation". Add the "lass" flag to /proc/cpuinfo to indicate that the feature is supported by hardware and enabled by the kernel. This allows userspace to determine if the system is secure against such attacks. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Xin Li (Intel) <xin@zytor.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://patch.msgid.link/20251118182911.2983253-2-sohil.mehta%40intel.com |
||
|
|
47955b58cf |
x86/cpufeatures: Correct LKGS feature flag description
Quotation marks in cpufeatures.h comments are special and when the comment begins with a quoted string, that string lands in /proc/cpuinfo, turning it into a user-visible one. The LKGS comment doesn't begin with a quoted string but just in case drop the quoted "kernel" in there to avoid confusion. And while at it, simply change the description into what the LKGS instruction does for more clarity. No functional changes. Reviewed-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20251015103548.10194-1-bp@kernel.org |
||
|
|
5d0316e25d |
x86/cpufeatures: Add X86_FEATURE_X2AVIC_EXT
Add CPUID feature bit for x2AVIC extension that enables AMD SVM to support up to 4096 vCPUs in x2AVIC mode. The primary change is in the size of the AVIC Physical ID table, which can now go up to 8 contiguous 4k pages. The number of pages allocated is controlled by the maximum APIC ID for a guest, and that controls the number of pages to allocate for the AVIC Physical ID table. AVIC hardware is enhanced to look up Physical ID table entries for vCPUs > 512 for locating the target APIC backing page and the host APIC ID of the physical core on which the guest vCPU is running. Signed-off-by: Naveen N Rao (AMD) <naveen@kernel.org> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/e5c9c471ab99a130bf9b728b77050ab308cf8624.1757009416.git.naveen@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
|
|
6ffdb49101 |
x86/cpufeatures: Add X86_FEATURE_SGX_EUPDATESVN feature flag
Add a flag indicating whenever ENCLS[EUPDATESVN] SGX instruction is supported. This will be used by SGX driver to perform CPU SVN updates. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Nataliia Bondarevska <bondarn@google.com> |
||
|
|
4793f990ea |
KVM: x86: Advertise EferLmsleUnsupported to userspace
CPUID.80000008H:EBX.EferLmsleUnsupported[bit 20] is a defeature
bit. When this bit is clear, EFER.LMSLE is supported. When this bit is
set, EFER.LMLSE is unsupported. KVM has never _emulated_ EFER.LMSLE, so
KVM cannot truly support a 0-setting of this bit.
However, KVM has allowed the guest to enable EFER.LMSLE in hardware
since commit
|
||
|
|
ddde4abaa0 |
x86/cpufeatures: Make X86_FEATURE leaf 17 Linux-specific
That cpuinfo_x86.x86_capability[] element was supposed to mirror CPUID flags from CPUID_0x80000007_EBX but that leaf has still to this day only three bits defined in it. So move those bits to scattered.c and free the capability element for synthetic flags. No functional changes. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> |
||
|
|
256e341706 |
Generic:
* Rework almost all of KVM's exports to expose symbols only to KVM's x86
vendor modules (kvm-{amd,intel}.ko and PPC's kvm-{pr,hv}.ko.
x86:
* Rework almost all of KVM x86's exports to expose symbols only to KVM's
vendor modules, i.e. to kvm-{amd,intel}.ko.
* Add support for virtualizing Control-flow Enforcement Technology (CET) on
Intel (Shadow Stacks and Indirect Branch Tracking) and AMD (Shadow Stacks).
It's worth noting that while SHSTK and IBT can be enabled separately in CPUID,
it is not really possible to virtualize them separately. Therefore, Intel
processors will really allow both SHSTK and IBT under the hood if either is
made visible in the guest's CPUID. The alternative would be to intercept
XSAVES/XRSTORS, which is not feasible for performance reasons.
* Fix a variety of fuzzing WARNs all caused by checking L1 intercepts when
completing userspace I/O. KVM has already committed to allowing L2 to
to perform I/O at that point.
* Emulate PERF_CNTR_GLOBAL_STATUS_SET for PerfMonV2 guests, as the MSR is
supposed to exist for v2 PMUs.
* Allow Centaur CPU leaves (base 0xC000_0000) for Zhaoxin CPUs.
* Add support for the immediate forms of RDMSR and WRMSRNS, sans full
emulator support (KVM should never need to emulate the MSRs outside of
forced emulation and other contrived testing scenarios).
* Clean up the MSR APIs in preparation for CET and FRED virtualization, as
well as mediated vPMU support.
* Clean up a pile of PMU code in anticipation of adding support for mediated
vPMUs.
* Reject in-kernel IOAPIC/PIT for TDX VMs, as KVM can't obtain EOI vmexits
needed to faithfully emulate an I/O APIC for such guests.
* Many cleanups and minor fixes.
* Recover possible NX huge pages within the TDP MMU under read lock to
reduce guest jitter when restoring NX huge pages.
* Return -EAGAIN during prefault if userspace concurrently deletes/moves the
relevant memslot, to fix an issue where prefaulting could deadlock with the
memslot update.
x86 (AMD):
* Enable AVIC by default for Zen4+ if x2AVIC (and other prereqs) is supported.
* Require a minimum GHCB version of 2 when starting SEV-SNP guests via
KVM_SEV_INIT2 so that invalid GHCB versions result in immediate errors
instead of latent guest failures.
* Add support for SEV-SNP's CipherText Hiding, an opt-in feature that prevents
unauthorized CPU accesses from reading the ciphertext of SNP guest private
memory, e.g. to attempt an offline attack. This feature splits the shared
SEV-ES/SEV-SNP ASID space into separate ranges for SEV-ES and SEV-SNP guests,
therefore a new module parameter is needed to control the number of ASIDs
that can be used for VMs with CipherText Hiding vs. how many can be used to
run SEV-ES guests.
* Add support for Secure TSC for SEV-SNP guests, which prevents the untrusted
host from tampering with the guest's TSC frequency, while still allowing the
the VMM to configure the guest's TSC frequency prior to launch.
* Validate the XCR0 provided by the guest (via the GHCB) to avoid bugs
resulting from bogus XCR0 values.
* Save an SEV guest's policy if and only if LAUNCH_START fully succeeds to
avoid leaving behind stale state (thankfully not consumed in KVM).
* Explicitly reject non-positive effective lengths during SNP's LAUNCH_UPDATE
instead of subtly relying on guest_memfd to deal with them.
* Reload the pre-VMRUN TSC_AUX on #VMEXIT for SEV-ES guests, not the host's
desired TSC_AUX, to fix a bug where KVM was keeping a different vCPU's
TSC_AUX in the host MSR until return to userspace.
KVM (Intel):
* Preparation for FRED support.
* Don't retry in TDX's anti-zero-step mitigation if the target memslot is
invalid, i.e. is being deleted or moved, to fix a deadlock scenario similar
to the aforementioned prefaulting case.
* Misc bugfixes and minor cleanups.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmjjx/0UHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroMLFwf9HXZdqBn6VvkbSL/HIGdNG1BEzeJ0
MQVEMMdmWJ72JtI6soJ6oN5NWTIJJeMTPuCgRrNxFbIivSdm9vYPTSCNwNBhKb+H
FEsr62a9T4XgnTqy20h+yZJiKNvwtaggdTWFnUAUqsBSFkEtksAP72odvZx+GNv/
cndqtxy/84TcJ4ZXFdxElylCcQ9xRoRkqkU8KaVfg88wqMIMbSR3OBSH/g8bqR+3
cjvDGNC7TPHPEN2Wmq2AYluRlBxB2ZhsOauArsdidPXHAevO+AFnbS27fz6bixZK
LTS/qwKOsvhFzyHngemuG6s6HgkgBEshfcKk5i7d2ReRjaGP4EvkhmlImA==
=k49c
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull x86 kvm updates from Paolo Bonzini:
"Generic:
- Rework almost all of KVM's exports to expose symbols only to KVM's
x86 vendor modules (kvm-{amd,intel}.ko and PPC's kvm-{pr,hv}.ko
x86:
- Rework almost all of KVM x86's exports to expose symbols only to
KVM's vendor modules, i.e. to kvm-{amd,intel}.ko
- Add support for virtualizing Control-flow Enforcement Technology
(CET) on Intel (Shadow Stacks and Indirect Branch Tracking) and AMD
(Shadow Stacks).
It is worth noting that while SHSTK and IBT can be enabled
separately in CPUID, it is not really possible to virtualize them
separately. Therefore, Intel processors will really allow both
SHSTK and IBT under the hood if either is made visible in the
guest's CPUID. The alternative would be to intercept
XSAVES/XRSTORS, which is not feasible for performance reasons
- Fix a variety of fuzzing WARNs all caused by checking L1 intercepts
when completing userspace I/O. KVM has already committed to
allowing L2 to to perform I/O at that point
- Emulate PERF_CNTR_GLOBAL_STATUS_SET for PerfMonV2 guests, as the
MSR is supposed to exist for v2 PMUs
- Allow Centaur CPU leaves (base 0xC000_0000) for Zhaoxin CPUs
- Add support for the immediate forms of RDMSR and WRMSRNS, sans full
emulator support (KVM should never need to emulate the MSRs outside
of forced emulation and other contrived testing scenarios)
- Clean up the MSR APIs in preparation for CET and FRED
virtualization, as well as mediated vPMU support
- Clean up a pile of PMU code in anticipation of adding support for
mediated vPMUs
- Reject in-kernel IOAPIC/PIT for TDX VMs, as KVM can't obtain EOI
vmexits needed to faithfully emulate an I/O APIC for such guests
- Many cleanups and minor fixes
- Recover possible NX huge pages within the TDP MMU under read lock
to reduce guest jitter when restoring NX huge pages
- Return -EAGAIN during prefault if userspace concurrently
deletes/moves the relevant memslot, to fix an issue where
prefaulting could deadlock with the memslot update
x86 (AMD):
- Enable AVIC by default for Zen4+ if x2AVIC (and other prereqs) is
supported
- Require a minimum GHCB version of 2 when starting SEV-SNP guests
via KVM_SEV_INIT2 so that invalid GHCB versions result in immediate
errors instead of latent guest failures
- Add support for SEV-SNP's CipherText Hiding, an opt-in feature that
prevents unauthorized CPU accesses from reading the ciphertext of
SNP guest private memory, e.g. to attempt an offline attack. This
feature splits the shared SEV-ES/SEV-SNP ASID space into separate
ranges for SEV-ES and SEV-SNP guests, therefore a new module
parameter is needed to control the number of ASIDs that can be used
for VMs with CipherText Hiding vs. how many can be used to run
SEV-ES guests
- Add support for Secure TSC for SEV-SNP guests, which prevents the
untrusted host from tampering with the guest's TSC frequency, while
still allowing the the VMM to configure the guest's TSC frequency
prior to launch
- Validate the XCR0 provided by the guest (via the GHCB) to avoid
bugs resulting from bogus XCR0 values
- Save an SEV guest's policy if and only if LAUNCH_START fully
succeeds to avoid leaving behind stale state (thankfully not
consumed in KVM)
- Explicitly reject non-positive effective lengths during SNP's
LAUNCH_UPDATE instead of subtly relying on guest_memfd to deal with
them
- Reload the pre-VMRUN TSC_AUX on #VMEXIT for SEV-ES guests, not the
host's desired TSC_AUX, to fix a bug where KVM was keeping a
different vCPU's TSC_AUX in the host MSR until return to userspace
KVM (Intel):
- Preparation for FRED support
- Don't retry in TDX's anti-zero-step mitigation if the target
memslot is invalid, i.e. is being deleted or moved, to fix a
deadlock scenario similar to the aforementioned prefaulting case
- Misc bugfixes and minor cleanups"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (142 commits)
KVM: x86: Export KVM-internal symbols for sub-modules only
KVM: x86: Drop pointless exports of kvm_arch_xxx() hooks
KVM: x86: Move kvm_intr_is_single_vcpu() to lapic.c
KVM: Export KVM-internal symbols for sub-modules only
KVM: s390/vfio-ap: Use kvm_is_gpa_in_memslot() instead of open coded equivalent
KVM: VMX: Make CR4.CET a guest owned bit
KVM: selftests: Verify MSRs are (not) in save/restore list when (un)supported
KVM: selftests: Add coverage for KVM-defined registers in MSRs test
KVM: selftests: Add KVM_{G,S}ET_ONE_REG coverage to MSRs test
KVM: selftests: Extend MSRs test to validate vCPUs without supported features
KVM: selftests: Add support for MSR_IA32_{S,U}_CET to MSRs test
KVM: selftests: Add an MSR test to exercise guest/host and read/write
KVM: x86: Define AMD's #HV, #VC, and #SX exception vectors
KVM: x86: Define Control Protection Exception (#CP) vector
KVM: x86: Add human friendly formatting for #XM, and #VE
KVM: SVM: Enable shadow stack virtualization for SVM
KVM: SEV: Synchronize MSR_IA32_XSS from the GHCB when it's valid
KVM: SVM: Pass through shadow stack MSRs as appropriate
KVM: SVM: Update dump_vmcb with shadow stack save area additions
KVM: nSVM: Save/load CET Shadow Stack state to/from vmcb12/vmcb02
...
|
||
|
|
d05ca6b793 |
KVM x86 changes for 6.18
- Don't (re)check L1 intercepts when completing userspace I/O to fix a flaw
where a misbehaving usersepace (a.k.a. syzkaller) could swizzle L1's
intercepts and trigger a variety of WARNs in KVM.
- Emulate PERF_CNTR_GLOBAL_STATUS_SET for PerfMonV2 guests, as the MSR is
supposed to exist for v2 PMUs.
- Allow Centaur CPU leaves (base 0xC000_0000) for Zhaoxin CPUs.
- Clean up KVM's vector hashing code for delivering lowest priority IRQs.
- Clean up the fastpath handler code to only handle IPIs and WRMSRs that are
actually "fast", as opposed to handling those that KVM _hopes_ are fast, and
in the process of doing so add fastpath support for TSC_DEADLINE writes on
AMD CPUs.
- Clean up a pile of PMU code in anticipation of adding support for mediated
vPMUs.
- Add support for the immediate forms of RDMSR and WRMSRNS, sans full
emulator support (KVM should never need to emulate the MSRs outside of
forced emulation and other contrived testing scenarios).
- Clean up the MSR APIs in preparation for CET and FRED virtualization, as
well as mediated vPMU support.
- Rejecting a fully in-kernel IRQCHIP if EOIs are protected, i.e. for TDX VMs,
as KVM can't faithfully emulate an I/O APIC for such guests.
- KVM_REQ_MSR_FILTER_CHANGED into a generic RECALC_INTERCEPTS in preparation
for mediated vPMU support, as KVM will need to recalculate MSR intercepts in
response to PMU refreshes for guests with mediated vPMUs.
- Misc cleanups and minor fixes.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmjXIr0ACgkQOlYIJqCj
N/1bbhAAxHzxN7IcizgAYf1BZWMjRU4zJgwlkoGuBeH/IgUOODPjs93L9kyrzvVL
tcFgIe9o5fZRGmUfyZbCKnJaQi/4u/2QPRSGhsYt7vyDjCoXzO5CJPMYIqDz5Z2r
qg+GNMlLtWI8EbcDd4qT22SWC8GufoXFEQnX6PUNhasOHeKit5ye8wmttcG+zvYV
KeIkPluddQkQ2JKyG53IFNmm1lkY05oAibv61hkxqUSwCIJKsQFuDjl4GVouAd/H
eu0+pzNmzPUTQ/qJzr2cNL5Nqz08DGp2OCFFRO6bgXaWkvHnFG3EAEHlhTAUh92t
LPJxmhb6R8SUc+z8rYTgyF/zVpgeJcJO7F44FrXa7r2iV58ds3TfuO53hVaEfyNp
1GUMH0m8N2vfjtFyUVP1KwZHuFxiGKLd1wZ1h0yKpj1Eg1FjR2cEontqwH44tHn2
ENq8MIbWIBhvCsz5fIbM4y591JSevJUrDlYu60Lz7VyXHAw8Cq92t/dN9O7oH5mJ
pIyoracU1g0Q6bbATZYsOGhkCTYLtdelZaBb5AYIgQ+U4C1TA4GpgEBUSVH8HXDy
kXzVqSFlL0v5rrFkBPjiNFb5WD3iLjJIM3DLGoNegOM8+79r/USGHUY+XU3z/kCH
rV8JBlTnLBCrNOHEiHJUI2kwBQ9C9/l88X/VwvRUNv7SthuExSo=
=9IB0
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-misc-6.18' of https://github.com/kvm-x86/linux into HEAD
KVM x86 changes for 6.18
- Don't (re)check L1 intercepts when completing userspace I/O to fix a flaw
where a misbehaving usersepace (a.k.a. syzkaller) could swizzle L1's
intercepts and trigger a variety of WARNs in KVM.
- Emulate PERF_CNTR_GLOBAL_STATUS_SET for PerfMonV2 guests, as the MSR is
supposed to exist for v2 PMUs.
- Allow Centaur CPU leaves (base 0xC000_0000) for Zhaoxin CPUs.
- Clean up KVM's vector hashing code for delivering lowest priority IRQs.
- Clean up the fastpath handler code to only handle IPIs and WRMSRs that are
actually "fast", as opposed to handling those that KVM _hopes_ are fast, and
in the process of doing so add fastpath support for TSC_DEADLINE writes on
AMD CPUs.
- Clean up a pile of PMU code in anticipation of adding support for mediated
vPMUs.
- Add support for the immediate forms of RDMSR and WRMSRNS, sans full
emulator support (KVM should never need to emulate the MSRs outside of
forced emulation and other contrived testing scenarios).
- Clean up the MSR APIs in preparation for CET and FRED virtualization, as
well as mediated vPMU support.
- Rejecting a fully in-kernel IRQCHIP if EOIs are protected, i.e. for TDX VMs,
as KVM can't faithfully emulate an I/O APIC for such guests.
- KVM_REQ_MSR_FILTER_CHANGED into a generic RECALC_INTERCEPTS in preparation
for mediated vPMU support, as KVM will need to recalculate MSR intercepts in
response to PMU refreshes for guests with mediated vPMUs.
- Misc cleanups and minor fixes.
|
||
|
|
a104e0a305 |
KVM SVM changes for 6.18
- Require a minimum GHCB version of 2 when starting SEV-SNP guests via
KVM_SEV_INIT2 so that invalid GHCB versions result in immediate errors
instead of latent guest failures.
- Add support for Secure TSC for SEV-SNP guests, which prevents the untrusted
host from tampering with the guest's TSC frequency, while still allowing the
the VMM to configure the guest's TSC frequency prior to launch.
- Mitigate the potential for TOCTOU bugs when accessing GHCB fields by
wrapping all accesses via READ_ONCE().
- Validate the XCR0 provided by the guest (via the GHCB) to avoid tracking a
bogous XCR0 value in KVM's software model.
- Save an SEV guest's policy if and only if LAUNCH_START fully succeeds to
avoid leaving behind stale state (thankfully not consumed in KVM).
- Explicitly reject non-positive effective lengths during SNP's LAUNCH_UPDATE
instead of subtly relying on guest_memfd to do the "heavy" lifting.
- Reload the pre-VMRUN TSC_AUX on #VMEXIT for SEV-ES guests, not the host's
desired TSC_AUX, to fix a bug where KVM could clobber a different vCPU's
TSC_AUX due to hardware not matching the value cached in the user-return MSR
infrastructure.
- Enable AVIC by default for Zen4+ if x2AVIC (and other prereqs) is supported,
and clean up the AVIC initialization code along the way.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmjXH54ACgkQOlYIJqCj
N/0OCw//e+0o6jov6/PO8ljq6sXJySsXKxEFYnvQlWYzjqtlVs05Y2SY0GBTnMu3
g0ie2c4V3VD7cY5bGAWETWvrOMLqGXM3E7v9dVOuE4xU3xx0HkCAlXc/woOLUXoT
jo/komNXnpeiZ1QRO9FlGooHTJ6Y+jg6/mM7asStS2Pk3Mm//wYgQej9mSJDrypo
NB4+BCS9cyt8rndNtCUkyedFYMboVQ8AEvXh/jeydhw4rdbBh0/Ci2IKGcVI5DP1
be8GD/FsNTIUDtieHRYCR+LCKCMFj/hYzlg2nQ6UjxHZbvlDyQuh2Ld2LtZiGSef
ejNr9e+ro6vxWBgX6wplWtKRLxBYEnQ1h/rQ9A3g50TuhrtFJbxBxY7DPQ16hlBJ
EB/E1JFvVgkGVrYN0oPQCvvfhFtpkx43qnEBw4q0pbdAS79XOnG2GJFvI0hpZAP6
qwy19lbsJ5g3qLTlDPChxQJC08gThn3CbarCmZNNzBpPDQoLDUfYBfyN4prRPuiN
UByfaaEC0Fi6JSgmHsO0LsUB9K++k2ucWiIIW4YQhVgPUtCjTNLe9omgGJ1UYe0X
YITqgklewe3QtBJ46JE0APkPaHio7r6zd7QvO+RhRFkjwZfY6dlsrSImykKrpK3O
rPaZnW+UpAnA1XIqroMl1RVoczFCfGcP1Cat9JwScBVVxjJ1DlI=
=zd53
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-svm-6.18' of https://github.com/kvm-x86/linux into HEAD
KVM SVM changes for 6.18
- Require a minimum GHCB version of 2 when starting SEV-SNP guests via
KVM_SEV_INIT2 so that invalid GHCB versions result in immediate errors
instead of latent guest failures.
- Add support for Secure TSC for SEV-SNP guests, which prevents the untrusted
host from tampering with the guest's TSC frequency, while still allowing the
the VMM to configure the guest's TSC frequency prior to launch.
- Mitigate the potential for TOCTOU bugs when accessing GHCB fields by
wrapping all accesses via READ_ONCE().
- Validate the XCR0 provided by the guest (via the GHCB) to avoid tracking a
bogous XCR0 value in KVM's software model.
- Save an SEV guest's policy if and only if LAUNCH_START fully succeeds to
avoid leaving behind stale state (thankfully not consumed in KVM).
- Explicitly reject non-positive effective lengths during SNP's LAUNCH_UPDATE
instead of subtly relying on guest_memfd to do the "heavy" lifting.
- Reload the pre-VMRUN TSC_AUX on #VMEXIT for SEV-ES guests, not the host's
desired TSC_AUX, to fix a bug where KVM could clobber a different vCPU's
TSC_AUX due to hardware not matching the value cached in the user-return MSR
infrastructure.
- Enable AVIC by default for Zen4+ if x2AVIC (and other prereqs) is supported,
and clean up the AVIC initialization code along the way.
|
||
|
|
e19c062199 |
x86/cpufeatures: Add support for Assignable Bandwidth Monitoring Counters (ABMC)
Users can create as many monitor groups as RMIDs supported by the hardware. However, the bandwidth monitoring feature on AMD only guarantees that RMIDs currently assigned to a processor will be tracked by hardware. The counters of any other RMIDs which are no longer being tracked will be reset to zero. The MBM event counters return "Unavailable" for the RMIDs that are not tracked by hardware. So, there can be only limited number of groups that can give guaranteed monitoring numbers. With ever changing configurations there is no way to definitely know which of these groups are being tracked during a particular time. Users do not have the option to monitor a group or set of groups for a certain period of time without worrying about RMID being reset in between. The ABMC feature allows users to assign a hardware counter to an RMID, event pair and monitor bandwidth usage as long as it is assigned. The hardware continues to track the assigned counter until it is explicitly unassigned by the user. There is no need to worry about counters being reset during this period. Additionally, the user can specify the type of memory transactions (e.g., reads, writes) for the counter to track. Without ABMC enabled, monitoring will work in current mode without assignment option. The Linux resctrl subsystem provides an interface that allows monitoring of up to two memory bandwidth events per group, selected from a combination of available total and local events. When ABMC is enabled, two events will be assigned to each group by default, in line with the current interface design. Users will also have the option to configure which types of memory transactions are counted by these events. Due to the limited number of available counters (32), users may quickly exhaust the available counters. If the system runs out of assignable ABMC counters, the kernel will report an error. In such cases, users will need to unassign one or more active counters to free up counters for new assignments. resctrl will provide options to assign or unassign events through the group-specific interface file. The feature is detected via CPUID_Fn80000020_EBX_x00 bit 5: ABMC (Assignable Bandwidth Monitoring Counters). The ABMC feature details are documented in APM [1] available from [2]. [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth Monitoring (ABMC). [ bp: Massage commit message, fixup enumeration due to VMSCAPE ] Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 # [2] |
||
|
|
223ba8ee0a |
Mitigate VMSCAPE issue with indirect branch predictor flushes
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmi58uwACgkQaDWVMHDJ krCIBxAAj/8/RBSSK6ULtDLKbmpRKMVpwEE1Yt8vK95Z/50gVSidtQtofIet+CPY NeN5Y4Aip3w/JFoIQafop8ZASOFjNjhqVEjE75RdtdDacQCyluqWg/2PrJpKkBVv OWTVVVPD9aSZAY0Tk/79ABV8Fbp/EBID5mhJ40GrBhkLZku2ALDj1eQINEjoBedB 2+sCO1MMqynlmglt8FltwFtl0rHgtlhGviuc/QmsxH9FrLIGBlgciW4Rma+LOtAE 4iD1Ij/ICuwA78kPAgrxvs+B1w3QGZhTPvOHjj0c9kKM3jBqphWoMWFUKbFfUK8i 6rM0jZMB8iaUcKJ+Ra+stNmvddLkbya7J9wwHgQWi/kxEMZMxbbbOXwfl1Ya8sha n/kKxm8Lsrjex3RTnd1hoXvGY2blr0dZ97jfjgOqVuYBZih5yWzixQbuf3TAbCZO Kb+fbfC7EsI1N0zuFh42Q1hT0zxYYshNIxtGPjDwspJRkHvhmNjNswXr7sccXhFo P5araDcYN0ul85SlAhQRMB17mle47ETSgh04LRM4Rq3rbweXzghoRj//WcY4YqYS qSJEFzSC7hVwNabG+NBexUaZL8bZRMoE7qx5lmo0q+tTMIQkEG2rqrFz9b1d4JON g6aKyrD8YyRCoBjZAF0tjCwhQgxSKXGsVwzBYl0+RcY+1Lo1L2U= =8wrr -----END PGP SIGNATURE----- Merge tag 'vmscape-for-linus-20250904' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull vmescape mitigation fixes from Dave Hansen: "Mitigate vmscape issue with indirect branch predictor flushes. vmscape is a vulnerability that essentially takes Spectre-v2 and attacks host userspace from a guest. It particularly affects hypervisors like QEMU. Even if a hypervisor may not have any sensitive data like disk encryption keys, guest-userspace may be able to attack the guest-kernel using the hypervisor as a confused deputy. There are many ways to mitigate vmscape using the existing Spectre-v2 defenses like IBRS variants or the IBPB flushes. This series focuses solely on IBPB because it works universally across vendors and all vulnerable processors. Further work doing vendor and model-specific optimizations can build on top of this if needed / wanted. Do the normal issue mitigation dance: - Add the CPU bug boilerplate - Add a list of vulnerable CPUs - Use IBPB to flush the branch predictors after running guests" * tag 'vmscape-for-linus-20250904' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/vmscape: Add old Intel CPUs to affected list x86/vmscape: Warn when STIBP is disabled with SMT x86/bugs: Move cpu_bugs_smt_update() down x86/vmscape: Enable the mitigation x86/vmscape: Add conditional IBPB mitigation x86/vmscape: Enumerate VMSCAPE bug Documentation/hw-vuln: Add VMSCAPE documentation |
||
|
|
7b59c73fd6 |
x86/cpufeatures: Add SNP Secure TSC
The Secure TSC feature for SEV-SNP allows guests to securely use the RDTSC and RDTSCP instructions, ensuring that the parameters used cannot be altered by the hypervisor once the guest is launched. For more details, refer to the AMD64 APM Vol 2, Section "Secure TSC". Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Tested-by: Vaishali Thakkar <vaishali.thakkar@suse.com> Signed-off-by: Nikunj A Dadhania <nikunj@amd.com> Link: https://lore.kernel.org/r/20250819234833.3080255-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
|
|
3c7cb84145 |
x86/cpufeatures: Add a CPU feature bit for MSR immediate form instructions
The immediate form of MSR access instructions are primarily motivated by performance, not code size: by having the MSR number in an immediate, it is available *much* earlier in the pipeline, which allows the hardware much more leeway about how a particular MSR is handled. Use a scattered CPU feature bit for MSR immediate form instructions. Suggested-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Xin Li (Intel) <xin@zytor.com> Link: https://lore.kernel.org/r/20250805202224.1475590-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
|
|
2f8f173413 |
x86/vmscape: Add conditional IBPB mitigation
VMSCAPE is a vulnerability that exploits insufficient branch predictor isolation between a guest and a userspace hypervisor (like QEMU). Existing mitigations already protect kernel/KVM from a malicious guest. Userspace can additionally be protected by flushing the branch predictors after a VMexit. Since it is the userspace that consumes the poisoned branch predictors, conditionally issue an IBPB after a VMexit and before returning to userspace. Workloads that frequently switch between hypervisor and userspace will incur the most overhead from the new IBPB. This new IBPB is not integrated with the existing IBPB sites. For instance, a task can use the existing speculation control prctl() to get an IBPB at context switch time. With this implementation, the IBPB is doubled up: one at context switch and another before running userspace. The intent is to integrate and optimize these cases post-embargo. [ dhansen: elaborate on suboptimal IBPB solution ] Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Sean Christopherson <seanjc@google.com> |
||
|
|
a508cec6e5 |
x86/vmscape: Enumerate VMSCAPE bug
The VMSCAPE vulnerability may allow a guest to cause Branch Target Injection (BTI) in userspace hypervisors. Kernels (both host and guest) have existing defenses against direct BTI attacks from guests. There are also inter-process BTI mitigations which prevent processes from attacking each other. However, the threat in this case is to a userspace hypervisor within the same process as the attacker. Userspace hypervisors have access to their own sensitive data like disk encryption keys and also typically have access to all guest data. This means guest userspace may use the hypervisor as a confused deputy to attack sensitive guest kernel data. There are no existing mitigations for these attacks. Introduce X86_BUG_VMSCAPE for this vulnerability and set it on affected Intel and AMD CPUs. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> |
||
|
|
7b306dfa32 |
x86/sev: Evict cache lines during SNP memory validation
An SNP cache coherency vulnerability requires a cache line eviction mitigation when validating memory after a page state change to private. The specific mitigation is to touch the first and last byte of each 4K page that is being validated. There is no need to perform the mitigation when performing a page state change to shared and rescinding validation. CPUID bit Fn8000001F_EBX[31] defines the COHERENCY_SFW_NO CPUID bit that, when set, indicates that the software mitigation for this vulnerability is not needed. Implement the mitigation and invoke it when validating memory (making it private) and the COHERENCY_SFW_NO bit is not set, indicating the SNP guest is vulnerable. Co-developed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> |
||
|
|
5bf2f5119b |
Linux 6.16
-----BEGIN PGP SIGNATURE----- iQFSBAABCgA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmiGmY4eHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGJtgH/3Ao101Cd+qBIQk6 Ad1TrV6JfSnPMnbafjEYU6VjXyXKT4iIIxYDqndyxSt8fij5EIWMFveOVELkZhe9 jDqlaODFrV7VYowVMAcVZCLEkQneommnbNaUaJK+odF+tPhkEbcHfubf/6Fd009q H0QB/EsExdjlLpwtkzM6yWnKss+1HA+wonw5TNbdj7+OE5vFemaonP2+lmO6EnCC PeiwkawLWA67hGonJFD4S6gp7KseqRUyQKLAbICDf6+86GueXI8QsIHhELyznfBG nN1DyhfxiPz1ICfvX7sIdPU+CuyQG0KCLX0iepNc0IGcJ40x8tLeXiHKNy/k+3Sf FBD2gmg= =HI0k -----END PGP SIGNATURE----- Merge tag 'v6.16' into x86/cpu, to resolve conflict Resolve overlapping context conflict between this upstream fix: |
||
|
|
65f55a3017 |
x86/CPU/AMD: Add CPUID faulting support
Add CPUID faulting support on AMD using the same user interface. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/20250528213105.1149-1-bp@kernel.org |
||
|
|
d8010d4ba4 |
x86/bugs: Add a Transient Scheduler Attacks mitigation
Add the required features detection glue to bugs.c et all in order to support the TSA mitigation. Co-developed-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> |
||
|
|
7f9039c524 |
Generic:
* Clean up locking of all vCPUs for a VM by using the *_nest_lock()
family of functions, and move duplicated code to virt/kvm/.
kernel/ patches acked by Peter Zijlstra.
* Add MGLRU support to the access tracking perf test.
ARM fixes:
* Make the irqbypass hooks resilient to changes in the GSI<->MSI
routing, avoiding behind stale vLPI mappings being left behind. The
fix is to resolve the VGIC IRQ using the host IRQ (which is stable)
and nuking the vLPI mapping upon a routing change.
* Close another VGIC race where vCPU creation races with VGIC
creation, leading to in-flight vCPUs entering the kernel w/o private
IRQs allocated.
* Fix a build issue triggered by the recently added workaround for
Ampere's AC04_CPU_23 erratum.
* Correctly sign-extend the VA when emulating a TLBI instruction
potentially targeting a VNCR mapping.
* Avoid dereferencing a NULL pointer in the VGIC debug code, which can
happen if the device doesn't have any mapping yet.
s390:
* Fix interaction between some filesystems and Secure Execution
* Some cleanups and refactorings, preparing for an upcoming big series
x86:
* Wait for target vCPU to acknowledge KVM_REQ_UPDATE_PROTECTED_GUEST_STATE to
fix a race between AP destroy and VMRUN.
* Decrypt and dump the VMSA in dump_vmcb() if debugging enabled for the VM.
* Refine and harden handling of spurious faults.
* Add support for ALLOWED_SEV_FEATURES.
* Add #VMGEXIT to the set of handlers special cased for CONFIG_RETPOLINE=y.
* Treat DEBUGCTL[5:2] as reserved to pave the way for virtualizing features
that utilize those bits.
* Don't account temporary allocations in sev_send_update_data().
* Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM, via Bus Lock Threshold.
* Unify virtualization of IBRS on nested VM-Exit, and cross-vCPU IBPB, between
SVM and VMX.
* Advertise support to userspace for WRMSRNS and PREFETCHI.
* Rescan I/O APIC routes after handling EOI that needed to be intercepted due
to the old/previous routing, but not the new/current routing.
* Add a module param to control and enumerate support for device posted
interrupts.
* Fix a potential overflow with nested virt on Intel systems running 32-bit kernels.
* Flush shadow VMCSes on emergency reboot.
* Add support for SNP to the various SEV selftests.
* Add a selftest to verify fastops instructions via forced emulation.
* Refine and optimize KVM's software processing of the posted interrupt bitmap, and share
the harvesting code between KVM and the kernel's Posted MSI handler
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmg9TjwUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroOUxQf7B7nnWqIKd7jSkGzSD6YsSX9TXktr
2tJIOfWM3zNYg5GRCidg+m4Y5+DqQWd3Hi5hH2P9wUw7RNuOjOFsDe+y0VBr8ysE
ve39t/yp+mYalNmHVFl8s3dBDgrIeGKiz+Wgw3zCQIBZ18rJE1dREhv37RlYZ3a2
wSvuObe8sVpCTyKIowDs1xUi7qJUBoopMSuqfleSHawRrcgCpV99U8/KNFF5plLH
7fXOBAHHniVCVc+mqQN2wxtVJDhST+U3TaU4GwlKy9Yevr+iibdOXffveeIgNEU4
D6q1F2zKp6UdV3+p8hxyaTTbiCVDqsp9WOgY/0I/f+CddYn0WVZgOlR+ow==
=mYFL
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull more kvm updates from Paolo Bonzini:
Generic:
- Clean up locking of all vCPUs for a VM by using the *_nest_lock()
family of functions, and move duplicated code to virt/kvm/. kernel/
patches acked by Peter Zijlstra
- Add MGLRU support to the access tracking perf test
ARM fixes:
- Make the irqbypass hooks resilient to changes in the GSI<->MSI
routing, avoiding behind stale vLPI mappings being left behind. The
fix is to resolve the VGIC IRQ using the host IRQ (which is stable)
and nuking the vLPI mapping upon a routing change
- Close another VGIC race where vCPU creation races with VGIC
creation, leading to in-flight vCPUs entering the kernel w/o
private IRQs allocated
- Fix a build issue triggered by the recently added workaround for
Ampere's AC04_CPU_23 erratum
- Correctly sign-extend the VA when emulating a TLBI instruction
potentially targeting a VNCR mapping
- Avoid dereferencing a NULL pointer in the VGIC debug code, which
can happen if the device doesn't have any mapping yet
s390:
- Fix interaction between some filesystems and Secure Execution
- Some cleanups and refactorings, preparing for an upcoming big
series
x86:
- Wait for target vCPU to ack KVM_REQ_UPDATE_PROTECTED_GUEST_STATE
to fix a race between AP destroy and VMRUN
- Decrypt and dump the VMSA in dump_vmcb() if debugging enabled for
the VM
- Refine and harden handling of spurious faults
- Add support for ALLOWED_SEV_FEATURES
- Add #VMGEXIT to the set of handlers special cased for
CONFIG_RETPOLINE=y
- Treat DEBUGCTL[5:2] as reserved to pave the way for virtualizing
features that utilize those bits
- Don't account temporary allocations in sev_send_update_data()
- Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM, via Bus Lock
Threshold
- Unify virtualization of IBRS on nested VM-Exit, and cross-vCPU
IBPB, between SVM and VMX
- Advertise support to userspace for WRMSRNS and PREFETCHI
- Rescan I/O APIC routes after handling EOI that needed to be
intercepted due to the old/previous routing, but not the
new/current routing
- Add a module param to control and enumerate support for device
posted interrupts
- Fix a potential overflow with nested virt on Intel systems running
32-bit kernels
- Flush shadow VMCSes on emergency reboot
- Add support for SNP to the various SEV selftests
- Add a selftest to verify fastops instructions via forced emulation
- Refine and optimize KVM's software processing of the posted
interrupt bitmap, and share the harvesting code between KVM and the
kernel's Posted MSI handler"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (93 commits)
rtmutex_api: provide correct extern functions
KVM: arm64: vgic-debug: Avoid dereferencing NULL ITE pointer
KVM: arm64: vgic-init: Plug vCPU vs. VGIC creation race
KVM: arm64: Unmap vLPIs affected by changes to GSI routing information
KVM: arm64: Resolve vLPI by host IRQ in vgic_v4_unset_forwarding()
KVM: arm64: Protect vLPI translation with vgic_irq::irq_lock
KVM: arm64: Use lock guard in vgic_v4_set_forwarding()
KVM: arm64: Mask out non-VA bits from TLBI VA* on VNCR invalidation
arm64: sysreg: Drag linux/kconfig.h to work around vdso build issue
KVM: s390: Simplify and move pv code
KVM: s390: Refactor and split some gmap helpers
KVM: s390: Remove unneeded srcu lock
s390: Remove unneeded includes
s390/uv: Improve splitting of large folios that cannot be split while dirty
s390/uv: Always return 0 from s390_wiggle_split_folio() if successful
s390/uv: Don't return 0 from make_hva_secure() if the operation was not successful
rust: add helper for mutex_trylock
RISC-V: KVM: use kvm_trylock_all_vcpus when locking all vCPUs
KVM: arm64: use kvm_trylock_all_vcpus when locking all vCPUs
x86: KVM: SVM: use kvm_lock_all_vcpus instead of a custom implementation
...
|
||
|
|
4e02d4f973 |
KVM SVM changes for 6.16:
- Wait for target vCPU to acknowledge KVM_REQ_UPDATE_PROTECTED_GUEST_STATE to
fix a race between AP destroy and VMRUN.
- Decrypt and dump the VMSA in dump_vmcb() if debugging enabled for the VM.
- Add support for ALLOWED_SEV_FEATURES.
- Add #VMGEXIT to the set of handlers special cased for CONFIG_RETPOLINE=y.
- Treat DEBUGCTL[5:2] as reserved to pave the way for virtualizing features
that utilize those bits.
- Don't account temporary allocations in sev_send_update_data().
- Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM, via Bus Lock Threshold.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmgwmwAACgkQOlYIJqCj
N/1pHw//edW/x838POMeeCN8j39NBKErW9yZoQLhMbzogttRvfoba+xYY9zXyRFx
8AXB8+2iLtb7pXUohc0eYN0mNqgD0SnoMLqGfn7nrkJafJSUAJHAoZn1Mdom1M1y
jHvBPbHCMMsgdLV8wpDRqCNWTH+d5W0kcN5WjKwOswVLj1rybVfK7bSLMhvkk1e5
RrOR4Ewf95/Ag2b36L4SvS1yG9fTClmKeGArMXhEXjy2INVSpBYyZMjVtjHiNzU9
TjtB2RSM45O+Zl0T2fZdVW8LFhA6kVeL1v+Oo433CjOQE0LQff3Vl14GCANIlPJU
PiWN/RIKdWkuxStIP3vw02eHzONCcg2GnNHzEyKQ1xW8lmrwzVRdXZzVsc2Dmowb
7qGykBQ+wzoE0sMeZPA0k/QOSqg2vGxUQHjR7720loLV9m9Tu/mJnS9e179GJKgI
e1ArSLwKmHpjwKZqU44IQVTZaxSC4Sg2kI670i21ChPgx8+oVkA6I0LFQXymx7uS
2lbH+ovTlJSlP9fbaJhMwAU2wpSHAyXif/HPjdw2LTH3NdgXzfEnZfTlAWiP65LQ
hnz5HvmUalW3x9kmzRmeDIAkDnAXhyt3ZQMvbNzqlO5AfS+Tqh4Ed5EFP3IrQAzK
HQ+Gi0ip+B84t9Tbi6rfQwzTZEbSSOfYksC7TXqRGhNo/DvHumE=
=k6rK
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-svm-6.16' of https://github.com/kvm-x86/linux into HEAD
KVM SVM changes for 6.16:
- Wait for target vCPU to acknowledge KVM_REQ_UPDATE_PROTECTED_GUEST_STATE to
fix a race between AP destroy and VMRUN.
- Decrypt and dump the VMSA in dump_vmcb() if debugging enabled for the VM.
- Add support for ALLOWED_SEV_FEATURES.
- Add #VMGEXIT to the set of handlers special cased for CONFIG_RETPOLINE=y.
- Treat DEBUGCTL[5:2] as reserved to pave the way for virtualizing features
that utilize those bits.
- Don't account temporary allocations in sev_send_update_data().
- Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM, via Bus Lock Threshold.
|
||
|
|
ebd38b26ec |
KVM x86 misc changes for 6.16:
- Unify virtualization of IBRS on nested VM-Exit, and cross-vCPU IBPB, between
SVM and VMX.
- Advertise support to userspace for WRMSRNS and PREFETCHI.
- Rescan I/O APIC routes after handling EOI that needed to be intercepted due
to the old/previous routing, but not the new/current routing.
- Add a module param to control and enumerate support for device posted
interrupts.
- Misc cleanups.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmgwmHsACgkQOlYIJqCj
N/2fYhAAiwKkqQpOWLcGjjezDTnpMqDDbHCffroq0Ttmqfg/cuul1oyZax+9fxBO
203HUi5VKcG7uAGSpLcMFkPUs9hKnaln2lsDaQD+AnGucdj+JKF5p3INCsSYCo9N
LVRjRZWtZocxJwHSHX9gU8om0pJ5fBCBG2+7+7XhWRaIqCpJe5k944JotiiOkgZ4
5sXeITkN2kouFVMI8eD4wQGNXxRxs837SYUlwCnoD3VuuBesOZuEhz/CEL9l8vNY
keXBLPg7bSW53clKfquNKwXDQRephnZaYoexDebUd+OlZphGhTIPh+C75xPQLWSi
aYg6W9XDu3TChf4LPxHnJLwLg/rjeKNQARcxrnb3XLpPAtx3i2cKU8pDPhnd4qn0
+YV5H0dato8bbe+oClGv+oIolM01qfI9SJVoaEhTPu3Rdw9cCQSVFn5t32vG3Vab
FVxX+seV3+XTmVveD4cjiiMbqtNADwZ/PmHNAi9QCl46DgHR++MLfRtjYuGo1koL
QOmCg2fWOFYtQT6XPJqZxp1SHYxuawrB4qcO9FNyxTuMChoslYoLAr3mBUj0DvwL
fdXNof74Ccj8OK9o3uCPOXS92pZz92rz/edy/XmYiCmO+VEwBJFR6IdginZMPX11
UAc4mAC+KkDTvOcKPPIWEXArOsfQMFKefi+bPOeWUx9/nqpcVws=
=Vs9P
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-misc-6.16' of https://github.com/kvm-x86/linux into HEAD
KVM x86 misc changes for 6.16:
- Unify virtualization of IBRS on nested VM-Exit, and cross-vCPU IBPB, between
SVM and VMX.
- Advertise support to userspace for WRMSRNS and PREFETCHI.
- Rescan I/O APIC routes after handling EOI that needed to be intercepted due
to the old/previous routing, but not the new/current routing.
- Add a module param to control and enumerate support for device posted
interrupts.
- Misc cleanups.
|
||
|
|
412751aa69 |
Linux 6.15-rc7
-----BEGIN PGP SIGNATURE----- iQFSBAABCgA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmgqSbkeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGr6sH/1ICAvlin1GuxffE ISVNz3xhXQpXG2k8yl9r0umpdCfPQbGrxm30vZyuIDNutY/FuMvkIqfu+Z1NnLg0 GidZW015LtXrp7/puKtTnUD5CPSjdETMXig+Q7c1PrxkkmHwz8sBbbm173AIDbDB t7wwqSEUQh2AIDouGwN+DXB+6bR2FoOXb/k/njmtappIwR3rBc2f1HQJnP095rKO 5AKw1c9DMv5Wq2cEdBOCP48e4CFZEIN1ycW0nvtjpnOmcPOJjLoEothRbntQolqF udtj5UeTGdAJqmjigv7KHmlrmFNe+GqBq4+beHl5MRxhBaT2uGGaM9jCJiSxT3Jx sHyYYr8= =Ddma -----END PGP SIGNATURE----- Merge tag 'v6.15-rc7' into x86/core, to pick up fixes Pick up build fixes from upstream to make this tree more testable. Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
56b2b1fc90 |
Misc x86 fixes:
- Fix SEV-SNP kdump bugs - Update the email address of Alexey Makhalov in MAINTAINERS - Add the CPU feature flag for the Zen6 microarchitecture - Fix typo in system message Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmgoj3MRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1hppg//S/eodSXrgxzTOvZLu0gFeYN4xyxUnfWl 0Dvc+FRasGCpBpQcD9sl3w9xKnTkaGY250NPP4/OKW2JgiizP6E3UcFYvaDnZ96I TU3/y3acAI5zpvASOuOuDlwDt0w9xIk5L/K0gcVec9dYnGdAOmTE4jjZV6wDm0Q4 rto8k5E0RmSs5HQ4GcpU2sgzJSlaQlkkxZMo6HaUE6oJUiuodmPnxHkjoLgAQiU9 I0ALcrPVtyI1jap52DVxAIDcMsrOddazYley4IyDRqWezwrtrxkNaEzvNkMWO4ZV iAnTYe/21HrppsQ40KuYa5VY5k0Dkv+QVzb23rGT2sZlPaXAiPIVUtt25z4VGtve 1z/kn1TszfcqC9sPodVcHQkzNrTktlaEKXd3u9GuFlfMkuj7iSnmYnGoPMo6x7T9 vcbBF6PUQ+uNi7QZXDvww8S0OMBVVlMDOjhuGjFBFzkmfVzkFtdyC1oGXppiXNzg KG0LjiTDlOeI4B8bxG1Wwldwl7vLfwHJag2xWaw0uQR8mjstkCTLXibjdvz3QNwi bM14hlG3TxmxJSsYl8QNFnF45DwzApWGKz9K81OPz/yZ2Z6KB1uQqrN2l8+blFt9 OUMEukY9sAcmUR1hkt3Rdynb1ri+jGMcJUGOn48w2ne+qiLoVicp8LEgWO6KoI3Z vgLnVmqIa9o= =cD7r -----END PGP SIGNATURE----- Merge tag 'x86-urgent-2025-05-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 fixes from Ingo Molnar: - Fix SEV-SNP kdump bugs - Update the email address of Alexey Makhalov in MAINTAINERS - Add the CPU feature flag for the Zen6 microarchitecture - Fix typo in system message * tag 'x86-urgent-2025-05-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Remove duplicated word in warning message x86/CPU/AMD: Add X86_FEATURE_ZEN6 x86/sev: Make sure pages are not skipped during kdump x86/sev: Do not touch VMSA pages during SNP guest memory kdump MAINTAINERS: Update Alexey Makhalov's email address x86/sev: Fix operator precedence in GHCB_MSR_VMPL_REQ_LEVEL macro |
||
|
|
faad6645e1 |
x86/cpufeatures: Add CPUID feature bit for the Bus Lock Threshold
Misbehaving guests can cause bus locks to degrade the performance of the system. The Bus Lock Threshold feature can be used to address this issue by providing capability to the hypervisor to limit guest's ability to generate bus lock, thereby preventing system slowdown due to performance penalities. When the Bus Lock Threshold feature is enabled, the processor checks the bus lock threshold count before executing the buslock and decides whether to trigger bus lock exit or not. The value of the bus lock threshold count '0' generates bus lock exits, and if the value is greater than '0', the bus lock is executed successfully and the bus lock threshold count is decremented. Presence of the Bus Lock threshold feature is indicated via CPUID function 0x8000000A_EDX[29]. Signed-off-by: Manali Shukla <manali.shukla@amd.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250502050346.14274-3-manali.shukla@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
|
|
24ee8d9432 |
x86/CPU/AMD: Add X86_FEATURE_ZEN6
Add a synthetic feature flag for Zen6.
[ bp: Move the feature flag to a free slot and avoid future merge
conflicts from incoming stuff. ]
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250513204857.3376577-1-yazen.ghannam@amd.com
|
||
|
|
c4070e1996 |
Merge commit 'its-for-linus-20250509-merge' into x86/core, to resolve conflicts
Conflicts: Documentation/admin-guide/hw-vuln/index.rst arch/x86/include/asm/cpufeatures.h arch/x86/kernel/alternative.c arch/x86/kernel/cpu/bugs.c arch/x86/kernel/cpu/common.c drivers/base/cpu.c include/linux/cpu.h Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
69cb33e2f8 |
Merge branch 'x86/microcode' into x86/core, to merge dependent commits
Prepare to resolve conflicts with an upstream series of fixes that conflict
with pending x86 changes:
|
||
|
|
2665281a07 |
x86/its: Add "vmexit" option to skip mitigation on some CPUs
Ice Lake generation CPUs are not affected by guest/host isolation part of ITS. If a user is only concerned about KVM guests, they can now choose a new cmdline option "vmexit" that will not deploy the ITS mitigation when CPU is not affected by guest/host isolation. This saves the performance overhead of ITS mitigation on Ice Lake gen CPUs. When "vmexit" option selected, if the CPU is affected by ITS guest/host isolation, the default ITS mitigation is deployed. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> |
||
|
|
8754e67ad4 |
x86/its: Add support for ITS-safe indirect thunk
Due to ITS, indirect branches in the lower half of a cacheline may be vulnerable to branch target injection attack. Introduce ITS-safe thunks to patch indirect branches in the lower half of cacheline with the thunk. Also thunk any eBPF generated indirect branches in emit_indirect_jump(). Below category of indirect branches are not mitigated: - Indirect branches in the .init section are not mitigated because they are discarded after boot. - Indirect branches that are explicitly marked retpoline-safe. Note that retpoline also mitigates the indirect branches against ITS. This is because the retpoline sequence fills an RSB entry before RET, and it does not suffer from RSB-underflow part of the ITS. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> |
||
|
|
159013a7ca |
x86/its: Enumerate Indirect Target Selection (ITS) bug
ITS bug in some pre-Alderlake Intel CPUs may allow indirect branches in the first half of a cache line get predicted to a target of a branch located in the second half of the cache line. Set X86_BUG_ITS on affected CPUs. Mitigation to follow in later commits. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> |
||
|
|
f9f27c4a37 |
x86/cpufeatures: Add "Allowed SEV Features" Feature
Add CPU feature detection for "Allowed SEV Features" to allow the Hypervisor to enforce that SEV-ES and SEV-SNP guest VMs cannot enable features (via SEV_FEATURES) that the Hypervisor does not support or wish to be enabled. Signed-off-by: Kishon Vijay Abraham I <kvijayab@amd.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Kim Phillips <kim.phillips@amd.com> Link: https://lore.kernel.org/r/20250310201603.1217954-2-kim.phillips@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
|
|
d88bb2ded2 |
KVM: x86: Advertise support for AMD's PREFETCHI
The latest AMD platform has introduced a new instruction called PREFETCHI. This instruction loads a cache line from a specified memory address into the indicated data or instruction cache level, based on locality reference hints. Feature bit definition: CPUID_Fn80000021_EAX [bit 20] - Indicates support for IC prefetch. This feature is analogous to Intel's PREFETCHITI (CPUID.(EAX=7,ECX=1):EDX), though the CPUID bit definitions differ between AMD and Intel. Advertise support to userspace, as no additional enabling is necessary (PREFETCHI can't be intercepted as there's no instruction specific behavior that needs to be virtualize). The feature is documented in Processor Programming Reference (PPR) for AMD Family 1Ah Model 02h, Revision C1 (Link below). Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 Signed-off-by: Babu Moger <babu.moger@amd.com> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/ee1c08fc400bb574a2b8f2c6a0bd9def10a29d35.1744130533.git.babu.moger@amd.com [sean: rewrite shortlog to highlight the KVM functionality] Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
|
|
9a7cb00a8f |
x86/cpufeatures: Define X86_FEATURE_AMD_IBRS_SAME_MODE
Per the APM [1]:
Some processors, identified by CPUID Fn8000_0008_EBX[IbrsSameMode]
(bit 19) = 1, provide additional speculation limits. For these
processors, when IBRS is set, indirect branch predictions are not
influenced by any prior indirect branches, regardless of mode (CPL
and guest/host) and regardless of whether the prior indirect branches
occurred before or after the setting of IBRS. This is referred to as
Same Mode IBRS.
Define this feature bit, which will be used by KVM to determine if an
IBPB is required on nested VM-exits in SVM.
[1] AMD64 Architecture Programmer's Manual Pub. 40332, Rev 4.08 - April
2024, Volume 2, 3.2.9 Speculation Control MSRs
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250221163352.3818347-2-yosry.ahmed@linux.dev
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
||
|
|
4e2c719782 |
x86/cpu: Help users notice when running old Intel microcode
Old microcode is bad for users and for kernel developers.
For users, it exposes them to known fixed security and/or functional
issues. These obviously rarely result in instant dumpster fires in
every environment. But it is as important to keep your microcode up
to date as it is to keep your kernel up to date.
Old microcode also makes kernels harder to debug. A developer looking
at an oops need to consider kernel bugs, known CPU issues and unknown
CPU issues as possible causes. If they know the microcode is up to
date, they can mostly eliminate known CPU issues as the cause.
Make it easier to tell if CPU microcode is out of date. Add a list
of released microcode. If the loaded microcode is older than the
release, tell users in a place that folks can find it:
/sys/devices/system/cpu/vulnerabilities/old_microcode
Tell kernel kernel developers about it with the existing taint
flag:
TAINT_CPU_OUT_OF_SPEC
== Discussion ==
When a user reports a potential kernel issue, it is very common
to ask them to reproduce the issue on mainline. Running mainline,
they will (independently from the distro) acquire a more up-to-date
microcode version list. If their microcode is old, they will
get a warning about the taint and kernel developers can take that
into consideration when debugging.
Just like any other entry in "vulnerabilities/", users are free to
make their own assessment of their exposure.
== Microcode Revision Discussion ==
The microcode versions in the table were generated from the Intel
microcode git repo:
8ac9378a8487 ("microcode-20241112 Release")
which as of this writing lags behind the latest microcode-20250211.
It can be argued that the versions that the kernel picks to call "old"
should be a revision or two old. Which specific version is picked is
less important to me than picking *a* version and enforcing it.
This repository contains only microcode versions that Intel has deemed
to be OS-loadable. It is quite possible that the BIOS has loaded a
newer microcode than the latest in this repo. If this happens, the
system is considered to have new microcode, not old.
Specifically, the sysfs file and taint flag answer the question:
Is the CPU running on the latest OS-loadable microcode,
or something even later that the BIOS loaded?
In other words, Intel never publishes an authoritative list of CPUs
and latest microcode revisions. Until it does, this is the best that
Linux can do.
Also note that the "intel-ucode-defs.h" file is simple, ugly and
has lots of magic numbers. That's on purpose and should allow a
single file to be shared across lots of stable kernel regardless of if
they have the new "VFM" infrastructure or not. It was generated with
a dumb script.
== FAQ ==
Q: Does this tell me if my system is secure or insecure?
A: No. It only tells you if your microcode was old when the
system booted.
Q: Should the kernel warn if the microcode list itself is too old?
A: No. New kernels will get new microcode lists, both mainline
and stable. The only way to have an old list is to be running
an old kernel in which case you have bigger problems.
Q: Is this for security or functional issues?
A: Both.
Q: If a given microcode update only has functional problems but
no security issues, will it be considered old?
A: Yes. All microcode image versions within a microcode release
are treated identically. Intel appears to make security
updates without disclosing them in the release notes. Thus,
all updates are considered to be security-relevant.
Q: Who runs old microcode?
A: Anybody with an old distro. This happens all the time inside
of Intel where there are lots of weird systems in labs that
might not be getting regular distro updates and might also
be running rather exotic microcode images.
Q: If I update my microcode after booting will it stop saying
"Vulnerable"?
A: No. Just like all the other vulnerabilies, you need to
reboot before the kernel will reassess your vulnerability.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: "Ahmed S. Darwish" <darwi@linutronix.de>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/all/20250421195659.CF426C07%40davehans-spike.ostc.intel.com
(cherry picked from commit 9127865b15eb0a1bd05ad7efe29489c44394bdc1)
|