linux

mirror of https://github.com/torvalds/linux.git synced 2026-06-03 12:03:54 +02:00

Author	SHA1	Message	Date
Sean Christopherson	0aec99f9bf	KVM: x86: Fix misleading variable names and add more comments for PIR=>IRR flow Rename kvm_apic_update_irr()'s "irr_updated" and vmx_sync_pir_to_irr()'s "got_posted_interrupt" to a more accurate "max_irr_is_from_pir", as neither "irr_updated" nor "got_posted_interrupt" is accurate. __kvm_apic_update_irr() and thus kvm_apic_update_irr() specifically return true if and only if the highest priority IRQ, i.e. max_irr, is a "new" pending IRQ from the PIR. I.e. it's possible for the IRR to be updated, i.e. for a posted IRQ to be "got", without the APIs returning true. Expand vmx_sync_pir_to_irr()'s comment to explain why it's necessary to set KVM_REQ_EVENT only if a "new" IRQ was found, and to explain why it's safe to do so only if a new IRQ is also the highest priority pending IRQ. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://patch.msgid.link/20260503201703.108231-3-pbonzini@redhat.com/ Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2026-05-03 22:32:41 +02:00
Paolo Bonzini	33fd0ccd25	KVM: x86: Do IRR scan in __kvm_apic_update_irr even if PIR is empty Fall back to apic_find_highest_vector() when PID.ON is set but PIR turns out to be empty, to correctly report the highest pending interrupt from the existing IRR. In a nested VM stress test, the following WARNING fires in vmx_check_nested_events() when kvm_cpu_has_interrupt() reports a pending interrupt but the subsequent kvm_apic_has_interrupt() (which invokes vmx_sync_pir_to_irr() again) returns -1: WARNING: CPU: 99 PID: 57767 at arch/x86/kvm/vmx/nested.c:4449 vmx_check_nested_events+0x6bf/0x6e0 [kvm_intel] Call Trace: kvm_check_and_inject_events vcpu_enter_guest.constprop.0 vcpu_run kvm_arch_vcpu_ioctl_run kvm_vcpu_ioctl __x64_sys_ioctl do_syscall_64 entry_SYSCALL_64_after_hwframe The root cause is a race between vmx_sync_pir_to_irr() on the target vCPU and __vmx_deliver_posted_interrupt() on a sender vCPU. The sender performs two individually-atomic operations that are not a single transaction: 1. pi_test_and_set_pir(vector) -- sets the PIR bit 2. pi_test_and_set_on() -- sets PID.ON The following interleaving triggers the bug: Sender vCPU (IPI): Target vCPU (1st sync_pir_to_irr): B1: set PIR[vector] A1: pi_clear_on() A2: pi_harvest_pir() -> sees B1 bit A3: xchg() -> consumes bit, PIR=0 (1st sync returns correct max_irr) B2: set PID.ON = 1 Target vCPU (2nd sync_pir_to_irr): C1: pi_test_on() -> TRUE (from B2) C2: pi_clear_on() -> ON=0 C3: pi_harvest_pir() -> PIR empty C4: *max_irr = -1, early return IRR NOT SCANNED The interrupt is not lost (it resides in the IRR from the first sync and is recovered on the next vcpu_enter_guest() iteration), but the incorrect max_irr causes a spurious WARNING and a wasted L2 VM-Enter/VM-Exit cycle. Fixes: `b41f8638b9` ("KVM: VMX: Isolate pure loads from atomic XCHG when processing PIR") Reported-by: Farrah Chen <farrah.chen@intel.com> Analyzed-by: Chenyi Qiang <chenyi.qiang@intel.com> Cc: stable@vger.kernel.org Reviewed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/kvm/20260428070349.1633238-1-chenyi.qiang@intel.com/T/ Link: https://patch.msgid.link/20260503201703.108231-2-pbonzini@redhat.com/ Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2026-05-03 22:18:15 +02:00
Anel Orazgaliyeva	00d572d4cd	KVM: X86: Fix array_index_nospec protection in __pv_send_ipi The __pv_send_ipi() function iterates over up to BITS_PER_LONG vCPUs starting from the APIC ID specified in its 'min' argument, which is provided by the guest. Commit `c87bd4dd43` used array_index_nospec() to clamp the value of 'min' but then the for_each_set_bit() loop dereferences higher indices without further protection. Theoretically, a guest can trigger speculative access to up to BITS_PER_LONG elements off the end of the phys_map[] array. (In practice it would probably need aggressive loop unrolling by the compiler to go more than one element off the end, and even that seems unlikely, but the theoretical possibility exists.) Move the array_index_nospec() inside the loop to protect the [map + i] index which is actually being used each time. Fixes: `c87bd4dd43` ("KVM: x86: use array_index_nospec with indices that come from guest") Fixes: `bdf7ffc899` ("KVM: LAPIC: Fix pv ipis out-of-bounds access") Fixes: `4180bf1b65` ("KVM: X86: Implement "send IPI" hypercall") Signed-off-by: Anel Orazgaliyeva <anelkz@amazon.de> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Jim Mattson <jmattson@google.com> Link: https://patch.msgid.link/9d50fc3ca9e8e58f551d015f95d51a3c29ce6ccc.camel@infradead.org Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-03-12 09:11:40 -07:00
xuanqingshi	26c9bfc0fa	KVM: x86: Add LAPIC guard in kvm_apic_write_nodecode() kvm_apic_write_nodecode() dereferences vcpu->arch.apic without first checking whether the in-kernel LAPIC has been initialized. If it has not (e.g. the vCPU was created without an in-kernel LAPIC), the dereference results in a NULL pointer access. While APIC-write VM-Exits are not expected to occur on a vCPU without an in-kernel LAPIC, kvm_apic_write_nodecode() should be robust against such a scenario as a defense-in-depth measure, e.g. to guard against KVM bugs or CPU errata that could generate a spurious APIC-write VM-Exit. Use KVM_BUG_ON() with lapic_in_kernel() instead of a simple WARN_ON_ONCE(), as suggested by Sean Christopherson, so that KVM kills the VM outright rather than letting it continue in a broken state. Found by a VMCS-targeted fuzzer based on syzkaller. Signed-off-by: xuanqingshi <1356292400@qq.com> Link: https://patch.msgid.link/tencent_7A9F1B4D75468C0CF5DE1B6902038C948B07@qq.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-03-12 09:11:40 -07:00
Kees Cook	69050f8d6d	treewide: Replace kmalloc with kmalloc_obj for non-scalar types This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(PTR, FAM, COUNT, ...) (where TYPE may also be VAR) The resulting allocations no longer return "void ", instead returning "TYPE ". Signed-off-by: Kees Cook <kees@kernel.org>	2026-02-21 01:02:28 -08:00
Paolo Bonzini	1b13885edf	Merge tag 'kvm-x86-apic-6.20' of https://github.com/kvm-x86/linux into HEAD KVM x86 APIC-ish changes for 6.20 - Fix a benign bug where KVM could use the wrong memslots (ignored SMM) when creating a vCPU-specific mapping of guest memory. - Clean up KVM's handling of marking mapped vCPU pages dirty. - Drop a pile of ancient sanity checks hidden behind in KVM's unused ASSERT() macro, most of which could be trivially triggered by the guest and/or user, and all of which were useless. - Fold "struct dest_map" into its sole user, "struct rtc_status", to make it more obvious what the weird parameter is used for, and to allow burying the RTC shenanigans behind CONFIG_KVM_IOAPIC=y. - Bury all of ioapic.h and KVM_IRQCHIP_KERNEL behind CONFIG_KVM_IOAPIC=y. - Add a regression test for recent APICv update fixes. - Rework KVM's handling of VMCS updates while L2 is active to temporarily switch to vmcs01 instead of deferring the update until the next nested VM-Exit. The deferred updates approach directly contributed to several bugs, was proving to be a maintenance burden due to the difficulty in auditing the correctness of deferred updates, and was polluting "struct nested_vmx" with a growing pile of booleans. - Handle "hardware APIC ISR", a.k.a. SVI, updates in kvm_apic_update_apicv() to consolidate the updates, and to co-locate SVI updates with the updates for KVM's own cache of ISR information. - Drop a dead function declaration.	2026-02-11 12:45:32 -05:00
Khushit Shah	6517dfbcc9	KVM: x86: Add x2APIC "features" to control EOI broadcast suppression Add two flags for KVM_CAP_X2APIC_API to allow userspace to control support for Suppress EOI Broadcasts when using a split IRQCHIP (I/O APIC emulated by userspace), which KVM completely mishandles. When x2APIC support was first added, KVM incorrectly advertised and "enabled" Suppress EOI Broadcast, without fully supporting the I/O APIC side of the equation, i.e. without adding directed EOI to KVM's in-kernel I/O APIC. That flaw was carried over to split IRQCHIP support, i.e. KVM advertised support for Suppress EOI Broadcasts irrespective of whether or not the userspace I/O APIC implementation supported directed EOIs. Even worse, KVM didn't actually suppress EOI broadcasts, i.e. userspace VMMs without support for directed EOI came to rely on the "spurious" broadcasts. KVM "fixed" the in-kernel I/O APIC implementation by completely disabling support for Suppress EOI Broadcasts in commit `0bcc3fb95b` ("KVM: lapic: stop advertising DIRECTED_EOI when in-kernel IOAPIC is in use"), but didn't do anything to remedy userspace I/O APIC implementations. KVM's bogus handling of Suppress EOI Broadcast is problematic when the guest relies on interrupts being masked in the I/O APIC until well after the initial local APIC EOI. E.g. Windows with Credential Guard enabled handles interrupts in the following order: 1. Interrupt for L2 arrives. 2. L1 APIC EOIs the interrupt. 3. L1 resumes L2 and injects the interrupt. 4. L2 EOIs after servicing. 5. L1 performs the I/O APIC EOI. Because KVM EOIs the I/O APIC at step #2, the guest can get an interrupt storm, e.g. if the IRQ line is still asserted and userspace reacts to the EOI by re-injecting the IRQ, because the guest doesn't de-assert the line until step #4, and doesn't expect the interrupt to be re-enabled until step #5. Unfortunately, simply "fixing" the bug isn't an option, as KVM has no way of knowing if the userspace I/O APIC supports directed EOIs, i.e. suppressing EOI broadcasts would result in interrupts being stuck masked in the userspace I/O APIC due to step #5 being ignored by userspace. And fully disabling support for Suppress EOI Broadcast is also undesirable, as picking up the fix would require a guest reboot, and more importantly would change the virtual CPU model exposed to the guest without any buy-in from userspace. Add KVM_X2APIC_ENABLE_SUPPRESS_EOI_BROADCAST and KVM_X2APIC_DISABLE_SUPPRESS_EOI_BROADCAST flags to allow userspace to explicitly enable or disable support for Suppress EOI Broadcasts. This gives userspace control over the virtual CPU model exposed to the guest, as KVM should never have enabled support for Suppress EOI Broadcast without userspace opt-in. Not setting either flag will result in legacy quirky behavior for backward compatibility. Disallow fully enabling SUPPRESS_EOI_BROADCAST when using an in-kernel I/O APIC, as KVM's history/support is just as tragic. E.g. it's not clear that commit `c806a6ad35` ("KVM: x86: call irq notifiers with directed EOI") was entirely correct, i.e. it may have simply papered over the lack of Directed EOI emulation in the I/O APIC. Note, Suppress EOI Broadcasts is defined only in Intel's SDM, not in AMD's APM. But the bit is writable on some AMD CPUs, e.g. Turin, and KVM's ABI is to support Directed EOI (KVM's name) irrespective of guest CPU vendor. Fixes: `7543a635aa` ("KVM: x86: Add KVM exit for IOAPIC EOIs") Closes: https://lore.kernel.org/kvm/7D497EF1-607D-4D37-98E7-DAF95F099342@nutanix.com Cc: stable@vger.kernel.org Suggested-by: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Khushit Shah <khushit.shah@nutanix.com> Link: https://patch.msgid.link/20260123125657.3384063-1-khushit.shah@nutanix.com [sean: clean up minor formatting goofs and fix a comment typo] Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-30 13:28:35 -08:00
Sean Christopherson	c4a365cd4a	KVM: x86: Drop WARN on INIT/SIPI being blocked when vCPU is in Wait-For-SIPI Drop the sanity check in kvm_apic_accept_events() that attempts to detect KVM bugs by asserting that a vCPU isn't in Wait-For-SIPI if INIT/SIPI are blocked, because if INIT is blocked, then it should be impossible for a vCPU to get into WFS in the first place. Unfortunately, syzbot is smarter than KVM (and its maintainers), and circumvented the guards put in place by commit `0fe3e8d804` ("KVM: x86: Move INIT_RECEIVED vs. INIT/SIPI blocked check to KVM_RUN") by swapping the order and stuffing VMXON after INIT, and then triggering kvm_apic_accept_events() by way of KVM_GET_MP_STATE. Simply drop the WARN as it hasn't detected any meaningful KVM bugs in years (if ever?), and preventing userspace from clobbering guest state is generally a non-goal. More importantly, fully closing the hole would likely require enforcing some amount of ordering in KVM's ioctls, which is a much bigger risk than simply deleting the WARN. Reported-by: syzbot+59f2c3a3fc4f6c09b8cd@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/6925da1b.a70a0220.d98e3.00b0.GAE@google.com Link: https://patch.msgid.link/20260123022816.2283567-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-23 09:11:16 -08:00
Sean Christopherson	000d75b0b1	KVM: x86: Update APICv ISR (a.k.a. SVI) as part of kvm_apic_update_apicv() Fold the calls to .hwapic_isr_update() in kvm_apic_set_state(), kvm_lapic_reset(), and __kvm_vcpu_update_apicv() into kvm_apic_update_apicv(), as updating SVI is directly related to updating KVM's own cache of ISR information, e.g. SVI is more or less the APICv equivalent of highest_isr_cache. Note, calling .hwapic_isr_update() during kvm_apic_update_apicv() has benign side effects, as doing so changes the orders of the calls in kvm_lapic_reset() and kvm_apic_set_state(), specifically with respect to to the order between .hwapic_isr_update() and .apicv_post_state_restore(). However, the changes in ordering are glorified nops as the former hook is VMX-only and the latter is SVM-only. Reviewed-by: Chao Gao <chao.gao@intel.com> Link: https://patch.msgid.link/20260109034532.1012993-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-13 17:35:32 -08:00
Sean Christopherson	59c3e0603d	KVM: x86: Bury ioapic.h definitions behind CONFIG_KVM_IOAPIC Now that almost everything in ioapic.h is used only by code guarded by CONFIG_KVM_IOAPIC=y, bury (almost) the entire thing behind the Kconfig. Link: https://patch.msgid.link/20251206004311.479939-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-12 09:31:41 -08:00
Sean Christopherson	5cd6b1a6ee	KVM: x86: Fold "struct dest_map" into "struct rtc_status" Drop "struct dest_map" and fold its members into its one and only user, "struct rtc_status". Tracking "pending" EOIs and associated vCPUs is very much a hack for legacy RTC behavior, and should never be needed for other IRQ delivery. In addition to making it more obvious why KVM tracks target vCPUs, this will allow burying the "struct rtc_status" definition behind CONFIG_KVM_IOAPIC=y, which in turn will make it even harder for KVM to misuse the structure. No functional change intended. Link: https://patch.msgid.link/20251206004311.479939-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-12 09:31:41 -08:00
Sean Christopherson	1a5d7f9540	KVM: x86: Add a wrapper to handle common case of IRQ delivery without dest_map Turn kvm_irq_delivery_to_apic() into a wrapper that passes NULL for the @dest_map param, as only the ugly I/O APIC RTC hackery needs to know which vCPUs received the IRQ. No functional change intended. Link: https://patch.msgid.link/20251206004311.479939-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-12 09:31:40 -08:00
Sean Christopherson	37187992dd	KVM: x86: Drop guest/user-triggerable asserts on IRR/ISR vectors Remove the ASSERT()s in apic_find_highest_i{r,s}r() that exist to detect illegal vectors (0-15 are reserved and never recognized by the local APIC), as the asserts, if they were ever to be enabled by #defining DEBUG, can be trivially triggered from both the guest and from userspace, and ultimately because the ASSERT()s are useless. In large part due to lack of emulation for the Error Status Register and its "delayed" read semantics, KVM doesn't filter out bad IRQs (IPIs or otherwise) when IRQs are sent or received. Instead, probably by dumb luck on KVM's part, KVM effectively ignores pending illegal vectors in the IRR due vector 0-15 having priority '0', and thus never being higher priority than PPR. As for ISR, a misbehaving userspace could stuff illegal vector bits, but again the end result is mostly benign (aside from userspace likely breaking the VM), as processing illegal vectors "works" and doesn't cause functional problems. Regardless of the safety and correctness of KVM's illegal vector handling, one thing is for certain: the ASSERT()s have done absolutely nothing to help detect such issues since they were added 18+ years ago by commit `97222cc831` ("KVM: Emulate local APIC in kernel"). For all intents and purposes, no functional change intended. Link: https://patch.msgid.link/20251206004311.479939-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-12 09:31:37 -08:00
Sean Christopherson	a4978324e4	KVM: x86: Drop ASSERT()s on APIC/vCPU being non-NULL Remove ASSERT()s on vCPU and APIC structures being non-NULL in the local APIC code as the DEBUG=1 path of ASSERT() ends with BUG(), i.e. isn't meaningfully better for debugging than a NULL pointer dereference. For all intents and purposes, no functional change intended. Link: https://patch.msgid.link/20251206004311.479939-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-12 09:31:37 -08:00
Sean Christopherson	a091fe60c2	KVM: x86: Grab lapic_timer in a local variable to cleanup periodic code Stash apic->lapic_timer in a local "ktimer" variable in advance_periodic_target_expiration() to eliminate a few unaligned wraps, and to make the code easier to read overall. No functional change intended. Link: https://patch.msgid.link/20251113205114.1647493-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-11-17 07:50:23 -08:00
fuqiang wang	18ab3fc8e8	KVM: x86: Fix VM hard lockup after prolonged inactivity with periodic HV timer When advancing the target expiration for the guest's APIC timer in periodic mode, set the expiration to "now" if the target expiration is in the past (similar to what is done in update_target_expiration()). Blindly adding the period to the previous target expiration can result in KVM generating a practically unbounded number of hrtimer IRQs due to programming an expired timer over and over. In extreme scenarios, e.g. if userspace pauses/suspends a VM for an extended duration, this can even cause hard lockups in the host. Currently, the bug only affects Intel CPUs when using the hypervisor timer (HV timer), a.k.a. the VMX preemption timer. Unlike the software timer, a.k.a. hrtimer, which KVM keeps running even on exits to userspace, the HV timer only runs while the guest is active. As a result, if the vCPU does not run for an extended duration, there will be a huge gap between the target expiration and the current time the vCPU resumes running. Because the target expiration is incremented by only one period on each timer expiration, this leads to a series of timer expirations occurring rapidly after the vCPU/VM resumes. More critically, when the vCPU first triggers a periodic HV timer expiration after resuming, advancing the expiration by only one period will result in a target expiration in the past. As a result, the delta may be calculated as a negative value. When the delta is converted into an absolute value (tscdeadline is an unsigned u64), the resulting value can overflow what the HV timer is capable of programming. I.e. the large value will exceed the VMX Preemption Timer's maximum bit width of cpu_preemption_timer_multi + 32, and thus cause KVM to switch from the HV timer to the software timer (hrtimers). After switching to the software timer, periodic timer expiration callbacks may be executed consecutively within a single clock interrupt handler, because hrtimers honors KVM's request for an expiration in the past and immediately re-invokes KVM's callback after reprogramming. And because the interrupt handler runs with IRQs disabled, restarting KVM's hrtimer over and over until the target expiration is advanced to "now" can result in a hard lockup. E.g. the following hard lockup was triggered in the host when running a Windows VM (only relevant because it used the APIC timer in periodic mode) after resuming the VM from a long suspend (in the host). NMI watchdog: Watchdog detected hard LOCKUP on cpu 45 ... RIP: 0010:advance_periodic_target_expiration+0x4d/0x80 [kvm] ... RSP: 0018:ff4f88f5d98d8ef0 EFLAGS: 00000046 RAX: fff0103f91be678e RBX: fff0103f91be678e RCX: 00843a7d9e127bcc RDX: 0000000000000002 RSI: 0052ca4003697505 RDI: ff440d5bfbdbd500 RBP: ff440d5956f99200 R08: ff2ff2a42deb6a84 R09: 000000000002a6c0 R10: 0122d794016332b3 R11: 0000000000000000 R12: ff440db1af39cfc0 R13: ff440db1af39cfc0 R14: ffffffffc0d4a560 R15: ff440db1af39d0f8 FS: 00007f04a6ffd700(0000) GS:ff440db1af380000(0000) knlGS:000000e38a3b8000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000d5651feff8 CR3: 000000684e038002 CR4: 0000000000773ee0 PKRU: 55555554 Call Trace: <IRQ> apic_timer_fn+0x31/0x50 [kvm] __hrtimer_run_queues+0x100/0x280 hrtimer_interrupt+0x100/0x210 ? ttwu_do_wakeup+0x19/0x160 smp_apic_timer_interrupt+0x6a/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> Moreover, if the suspend duration of the virtual machine is not long enough to trigger a hard lockup in this scenario, since commit `98c25ead5e` ("KVM: VMX: Move preemption timer <=> hrtimer dance to common x86"), KVM will continue using the software timer until the guest reprograms the APIC timer in some way. Since the periodic timer does not require frequent APIC timer register programming, the guest may continue to use the software timer in perpetuity. Fixes: `d8f2f498d9` ("x86/kvm: fix LAPIC timer drift when guest uses periodic mode") Cc: stable@vger.kernel.org Signed-off-by: fuqiang wang <fuqiang.wng@gmail.com> [sean: massage comments and changelog] Link: https://patch.msgid.link/20251113205114.1647493-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-11-17 07:50:22 -08:00
fuqiang wang	9633f180ce	KVM: x86: Explicitly set new periodic hrtimer expiration in apic_timer_fn() When restarting an hrtimer to emulate a the guest's APIC timer in periodic mode, explicitly set the expiration using the target expiration computed by advance_periodic_target_expiration() instead of adding the period to the existing timer. This will allow making adjustments to the expiration, e.g. to deal with expirations far in the past, without having to implement the same logic in both advance_periodic_target_expiration() and apic_timer_fn(). Cc: stable@vger.kernel.org Signed-off-by: fuqiang wang <fuqiang.wng@gmail.com> [sean: split to separate patch, write changelog] Link: https://patch.msgid.link/20251113205114.1647493-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-11-17 07:50:21 -08:00
Sean Christopherson	0ea9494be9	KVM: x86: WARN if hrtimer callback for periodic APIC timer fires with period=0 WARN and don't restart the hrtimer if KVM's callback runs with the guest's APIC timer in periodic mode but with a period of '0', as not advancing the hrtimer's deadline would put the CPU into an infinite loop of hrtimer events. Observing a period of '0' should be impossible, even when the hrtimer is running on a different CPU than the vCPU, as KVM is supposed to cancel the hrtimer before changing (or zeroing) the period, e.g. when switching from periodic to one-shot. Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20251113205114.1647493-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-11-17 07:50:21 -08:00
Sean Christopherson	6b36119b94	KVM: x86: Export KVM-internal symbols for sub-modules only Rework almost all of KVM x86's exports to expose symbols only to KVM's vendor modules, i.e. to kvm-{amd,intel}.ko. Keep the generic exports that are guarded by CONFIG_KVM_EXTERNAL_WRITE_TRACKING=y, as they're explicitly designed/intended for external usage. Link: https://lore.kernel.org/r/20250919003303.1355064-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-09-30 13:40:02 -04:00
Sean Christopherson	d273b52b6f	KVM: x86: Move kvm_intr_is_single_vcpu() to lapic.c Move kvm_intr_is_single_vcpu() to lapic.c, drop its export, and make its "fast" helper local to lapic.c. kvm_intr_is_single_vcpu() is only usable if the local APIC is in-kernel, i.e. it most definitely belongs in the local APIC code. No functional change intended. Fixes: `cf04ec393e` ("KVM: x86: Dedup AVIC vs. PI code for identifying target vCPU") Link: https://lore.kernel.org/r/20250919003303.1355064-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-09-30 13:40:02 -04:00
Paolo Bonzini	d05ca6b793	KVM x86 changes for 6.18 - Don't (re)check L1 intercepts when completing userspace I/O to fix a flaw where a misbehaving usersepace (a.k.a. syzkaller) could swizzle L1's intercepts and trigger a variety of WARNs in KVM. - Emulate PERF_CNTR_GLOBAL_STATUS_SET for PerfMonV2 guests, as the MSR is supposed to exist for v2 PMUs. - Allow Centaur CPU leaves (base 0xC000_0000) for Zhaoxin CPUs. - Clean up KVM's vector hashing code for delivering lowest priority IRQs. - Clean up the fastpath handler code to only handle IPIs and WRMSRs that are actually "fast", as opposed to handling those that KVM _hopes_ are fast, and in the process of doing so add fastpath support for TSC_DEADLINE writes on AMD CPUs. - Clean up a pile of PMU code in anticipation of adding support for mediated vPMUs. - Add support for the immediate forms of RDMSR and WRMSRNS, sans full emulator support (KVM should never need to emulate the MSRs outside of forced emulation and other contrived testing scenarios). - Clean up the MSR APIs in preparation for CET and FRED virtualization, as well as mediated vPMU support. - Rejecting a fully in-kernel IRQCHIP if EOIs are protected, i.e. for TDX VMs, as KVM can't faithfully emulate an I/O APIC for such guests. - KVM_REQ_MSR_FILTER_CHANGED into a generic RECALC_INTERCEPTS in preparation for mediated vPMU support, as KVM will need to recalculate MSR intercepts in response to PMU refreshes for guests with mediated vPMUs. - Misc cleanups and minor fixes. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmjXIr0ACgkQOlYIJqCj N/1bbhAAxHzxN7IcizgAYf1BZWMjRU4zJgwlkoGuBeH/IgUOODPjs93L9kyrzvVL tcFgIe9o5fZRGmUfyZbCKnJaQi/4u/2QPRSGhsYt7vyDjCoXzO5CJPMYIqDz5Z2r qg+GNMlLtWI8EbcDd4qT22SWC8GufoXFEQnX6PUNhasOHeKit5ye8wmttcG+zvYV KeIkPluddQkQ2JKyG53IFNmm1lkY05oAibv61hkxqUSwCIJKsQFuDjl4GVouAd/H eu0+pzNmzPUTQ/qJzr2cNL5Nqz08DGp2OCFFRO6bgXaWkvHnFG3EAEHlhTAUh92t LPJxmhb6R8SUc+z8rYTgyF/zVpgeJcJO7F44FrXa7r2iV58ds3TfuO53hVaEfyNp 1GUMH0m8N2vfjtFyUVP1KwZHuFxiGKLd1wZ1h0yKpj1Eg1FjR2cEontqwH44tHn2 ENq8MIbWIBhvCsz5fIbM4y591JSevJUrDlYu60Lz7VyXHAw8Cq92t/dN9O7oH5mJ pIyoracU1g0Q6bbATZYsOGhkCTYLtdelZaBb5AYIgQ+U4C1TA4GpgEBUSVH8HXDy kXzVqSFlL0v5rrFkBPjiNFb5WD3iLjJIM3DLGoNegOM8+79r/USGHUY+XU3z/kCH rV8JBlTnLBCrNOHEiHJUI2kwBQ9C9/l88X/VwvRUNv7SthuExSo= =9IB0 -----END PGP SIGNATURE----- Merge tag 'kvm-x86-misc-6.18' of https://github.com/kvm-x86/linux into HEAD KVM x86 changes for 6.18 - Don't (re)check L1 intercepts when completing userspace I/O to fix a flaw where a misbehaving usersepace (a.k.a. syzkaller) could swizzle L1's intercepts and trigger a variety of WARNs in KVM. - Emulate PERF_CNTR_GLOBAL_STATUS_SET for PerfMonV2 guests, as the MSR is supposed to exist for v2 PMUs. - Allow Centaur CPU leaves (base 0xC000_0000) for Zhaoxin CPUs. - Clean up KVM's vector hashing code for delivering lowest priority IRQs. - Clean up the fastpath handler code to only handle IPIs and WRMSRs that are actually "fast", as opposed to handling those that KVM _hopes_ are fast, and in the process of doing so add fastpath support for TSC_DEADLINE writes on AMD CPUs. - Clean up a pile of PMU code in anticipation of adding support for mediated vPMUs. - Add support for the immediate forms of RDMSR and WRMSRNS, sans full emulator support (KVM should never need to emulate the MSRs outside of forced emulation and other contrived testing scenarios). - Clean up the MSR APIs in preparation for CET and FRED virtualization, as well as mediated vPMU support. - Rejecting a fully in-kernel IRQCHIP if EOIs are protected, i.e. for TDX VMs, as KVM can't faithfully emulate an I/O APIC for such guests. - KVM_REQ_MSR_FILTER_CHANGED into a generic RECALC_INTERCEPTS in preparation for mediated vPMU support, as KVM will need to recalculate MSR intercepts in response to PMU refreshes for guests with mediated vPMUs. - Misc cleanups and minor fixes.	2025-09-30 13:36:41 -04:00
Liao Yuanhong	4319fa120f	KVM: x86: Use guard() instead of mutex_lock() to simplify code Use guard(mutex) instead of mutex_lock/mutex_unlock pair to simplify the error handling when allocating the APIC access page. No functional change intended. Signed-off-by: Liao Yuanhong <liaoyuanhong@vivo.com> Link: https://lore.kernel.org/r/20250901131822.647802-1-liaoyuanhong@vivo.com [sean: add blank link to isolate guard(), tweak changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-09-16 12:55:10 -07:00
Sean Christopherson	aac057dd62	KVM: x86: Move vector_hashing into lapic.c Move the vector_hashing module param into lapic.c now that all usage is contained within the local APIC emulation code. Opportunistically drop the accessor and append "_enabled" to the variable to help capture that it's a boolean module param. No functional change intended. Link: https://lore.kernel.org/r/20250821214209.3463350-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-09-10 12:05:13 -07:00
Sean Christopherson	73473f31a4	KVM: x86: Make "lowest priority" helpers local to lapic.c Make various helpers for resolving lowest priority IRQs local to lapic.c now that kvm_irq_delivery_to_apic() lives in lapic.c as well. No functional change intended. Link: https://lore.kernel.org/r/20250821214209.3463350-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-09-10 12:05:12 -07:00
Sean Christopherson	cbf5d94574	KVM: x86: Move kvm_irq_delivery_to_apic() from irq.c to lapic.c Move kvm_irq_delivery_to_apic() to lapic.c as it is specific to local APIC emulation. This will allow burying more local APIC code in lapic.c, e.g. the various "lowest priority" helpers. No functional change intended. Link: https://lore.kernel.org/r/20250821214209.3463350-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-09-10 12:05:09 -07:00
Sean Christopherson	0a94b20424	KVM: x86: Unconditionally handle MSR_IA32_TSC_DEADLINE in fastpath exits Drop the fastpath VM-Exit requirement that KVM can use the hypervisor timer to emulate the APIC timer in TSC deadline mode. I.e. unconditionally handle MSR_IA32_TSC_DEADLINE WRMSRs in the fastpath. Restricting the fastpath to maybe using the VMX preemption timer is ineffective and unnecessary. If the requested deadline can't be programmed into the VMX preemption timer, KVM will fall back to hrtimers, i.e. the restriction is ineffective as far as preventing any kind of worst case scenario. But guarding against a worst case scenario is completely unnecessary as the "slow" path, start_sw_tscdeadline() => hrtimer_start(), explicitly disables IRQs. In fact, the worst case scenario is when KVM thinks it can use the VMX preemption timer, as KVM will eat the overhead of calling into vmx_set_hv_timer() and falling back to hrtimers. Opportunistically limit kvm_can_use_hv_timer() to lapic.c as the fastpath code was the only external user. Stating the obvious, this allows handling MSR_IA32_TSC_DEADLINE writes in the fastpath on AMD CPUs. Link: https://lore.kernel.org/r/20250805190526.1453366-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-08-19 11:59:34 -07:00
Sean Christopherson	7774143400	KVM: x86: Only allow "fast" IPIs in fastpath WRMSR(X2APIC_ICR) handler Explicitly restrict fastpath ICR writes to IPIs that are "fast", i.e. can be delivered without having to walk all vCPUs, and that target at most 16 vCPUs. Artificially restricting ICR writes to physical mode guarantees at most one vCPU will receive in IPI (because x2APIC IDs are read-only), but that delivery might not be "fast". E.g. even if the vCPU exists, KVM might have to iterate over 4096 vCPUs to find the right one. Limiting delivery to fast IPIs aligns the WRMSR fastpath with kvm_arch_set_irq_inatomic() (which also runs with IRQs disabled), and will allow dropping the semi-arbitrary restrictions on delivery mode and type. Link: https://lore.kernel.org/r/20250805190526.1453366-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-08-19 11:59:32 -07:00
Sean Christopherson	15daa58e78	KVM: x86: Add kvm_icr_to_lapic_irq() helper to allow for fastpath IPIs Extract the code for converting an ICR message into a kvm_lapic_irq structure into a local helper so that a fast-only IPI path can share the conversion logic. No functional change intended. Link: https://lore.kernel.org/r/20250805190526.1453366-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-08-19 11:59:32 -07:00
Yury Norov	cc63f918a2	kvm: x86: simplify kvm_vector_to_index() Use find_nth_bit() and make the function almost a one-liner. Signed-off-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-08-19 11:59:31 -07:00
Thijs Raymakers	c87bd4dd43	KVM: x86: use array_index_nospec with indices that come from guest min and dest_id are guest-controlled indices. Using array_index_nospec() after the bounds checks clamps these values to mitigate speculative execution side-channels. Signed-off-by: Thijs Raymakers <thijs@raymakers.nl> Cc: stable@vger.kernel.org Cc: Sean Christopherson <seanjc@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Fixes: `715062970f` ("KVM: X86: Implement PV sched yield hypercall") Fixes: `bdf7ffc899` ("KVM: LAPIC: Fix pv ipis out-of-bounds access") Fixes: `4180bf1b65` ("KVM: X86: Implement "send IPI" hypercall") Link: https://lore.kernel.org/r/20250804064405.4802-1-thijs@raymakers.nl Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-08-15 11:33:21 -07:00
Paolo Bonzini	89400f0687	Merge tag 'kvm-x86-apic-6.17' of https://github.com/kvm-x86/linux into HEAD KVM local APIC changes for 6.17 Extract many of KVM's helpers for accessing architectural local APIC state to common x86 so that they can be shared by guest-side code for Secure AVIC.	2025-07-29 08:36:44 -04:00
Neeraj Upadhyay	17776e6c20	x86/apic: KVM: Move apic_test)vector() to common code Move apic_test_vector() to apic.h in order to reuse it in the Secure AVIC guest APIC driver in later patches to test vector state in the APIC backing page. No functional change intended. Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250709033242.267892-14-Neeraj.Upadhyay@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-07-10 09:44:43 -07:00
Neeraj Upadhyay	3d3a9083da	x86/apic: KVM: Move lapic get/set helpers to common code Move the apic_get_reg(), apic_set_reg(), apic_get_reg64() and apic_set_reg64() helper functions to apic.h in order to reuse them in the Secure AVIC guest APIC driver in later patches to read/write registers from/to the APIC backing page. No functional change intended. Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250709033242.267892-12-Neeraj.Upadhyay@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-07-10 09:44:42 -07:00
Neeraj Upadhyay	39e81633f6	x86/apic: KVM: Move apic_find_highest_vector() to a common header In preparation for using apic_find_highest_vector() in Secure AVIC guest APIC driver, move it and associated macros to apic.h. No functional change intended. Acked-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Link: https://lore.kernel.org/r/20250709033242.267892-11-Neeraj.Upadhyay@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-07-10 09:44:41 -07:00
Neeraj Upadhyay	b5f8980f29	KVM: x86: Rename lapic set/clear vector helpers In preparation for moving kvm-internal kvm_lapic_set_vector(), kvm_lapic_clear_vector() to apic.h for use in Secure AVIC APIC driver, rename them as part of the APIC API. No functional change intended. Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250709033242.267892-10-Neeraj.Upadhyay@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-07-10 09:44:41 -07:00
Neeraj Upadhyay	9c23bc4fec	KVM: x86: Rename lapic get/set_reg64() helpers In preparation for moving kvm-internal __kvm_lapic_set_reg64(), __kvm_lapic_get_reg64() to apic.h for use in Secure AVIC APIC driver, rename them as part of the APIC API. No functional change intended. Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250709033242.267892-9-Neeraj.Upadhyay@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-07-10 09:44:40 -07:00
Neeraj Upadhyay	b9bd231913	KVM: x86: Rename lapic get/set_reg() helpers In preparation for moving kvm-internal __kvm_lapic_set_reg(), __kvm_lapic_get_reg() to apic.h for use in Secure AVIC APIC driver, rename them as part of the APIC API. No functional change intended. Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250709033242.267892-8-Neeraj.Upadhyay@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-07-10 09:44:39 -07:00
Neeraj Upadhyay	bdaccfe4e5	KVM: x86: Rename find_highest_vector() In preparation for moving kvm-internal find_highest_vector() to apic.h for use in Secure AVIC APIC driver, rename find_highest_vector() to apic_find_highest_vector() as part of the APIC API. No functional change intended. Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250709033242.267892-7-Neeraj.Upadhyay@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-07-10 09:44:39 -07:00
Neeraj Upadhyay	e2fa7905b2	KVM: x86: Change lapic regs base address to void pointer Change APIC base address from "char " to "void " in KVM lapic's set/get helper functions. Pointer arithmetic for "void " and "char " operate identically. With "void *" there is less of a chance of doing the wrong thing, e.g. neglecting to cast and reading a byte instead of the desired APIC register size. No functional change intended. Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250709033242.267892-6-Neeraj.Upadhyay@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-07-10 09:44:38 -07:00
Neeraj Upadhyay	9cbb5fd156	KVM: x86: Rename VEC_POS/REG_POS macro usages In preparation for moving most of the KVM's lapic helpers which use VEC_POS/REG_POS macros to common APIC header for use in Secure AVIC APIC driver, rename all VEC_POS/REG_POS macro usages to APIC_VECTOR_TO_BIT_NUMBER/APIC_VECTOR_TO_REG_OFFSET and remove VEC_POS/REG_POS. While at it, clean up line wrap in find_highest_vector(). No functional change intended. Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250709033242.267892-5-Neeraj.Upadhyay@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-07-10 09:44:37 -07:00
Neeraj Upadhyay	3fb7b83e2a	KVM: x86: Remove redundant parentheses around 'bitmap' When doing pointer arithmetic in apic_test_vector() and kvm_lapic_{set\|clear}_vector(), remove the unnecessary parentheses surrounding the 'bitmap' parameter. No functional change intended. Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Link: https://lore.kernel.org/r/20250709033242.267892-3-Neeraj.Upadhyay@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-07-10 09:44:36 -07:00
Neeraj Upadhyay	ac48017020	KVM: x86: Open code setting/clearing of bits in the ISR Remove __apic_test_and_set_vector() and __apic_test_and_clear_vector(), because the _only_ register that's safe to modify with a non-atomic operation is ISR, because KVM isn't running the vCPU, i.e. hardware can't service an IRQ or process an EOI for the relevant (virtual) APIC. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250709033242.267892-2-Neeraj.Upadhyay@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-07-10 09:44:36 -07:00
Sean Christopherson	628a27731e	KVM: x86: Add CONFIG_KVM_IOAPIC to allow disabling in-kernel I/O APIC Add a Kconfig to allow building KVM without support for emulating a I/O APIC, PIC, and PIT, which is desirable for deployments that effectively don't support a fully in-kernel IRQ chip, i.e. never expect any VMM to create an in-kernel I/O APIC. E.g. compiling out support eliminates a few thousand lines of guest-facing code and gives security folks warm fuzzies. As a bonus, wrapping relevant paths with CONFIG_KVM_IOAPIC #ifdefs makes it much easier for readers to understand which bits and pieces exist specifically for fully in-kernel IRQ chips. Opportunistically convert all two in-kernel uses of __KVM_HAVE_IOAPIC to CONFIG_KVM_IOAPIC, e.g. rather than add a second #ifdef to generate a stub for kvm_arch_post_irq_routing_update(). Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250611213557.294358-15-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:52:50 -07:00
Paolo Bonzini	db44dcbdf8	KVM x86 posted interrupt changes for 6.16: Refine and optimize KVM's software processing of the PIR, and ultimately share PIR harvesting code between KVM and the kernel's Posted MSI handler -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmgwmWcACgkQOlYIJqCj N/3mUw/9HN4OLRqFytu+GjEocl8I7JelJdwCsNMsUwZRnNVnYGDqsjvw8rzqeFmx RoQ8uNqMd1PqZOgAdN6suLES949ItErbnG2+UlBvZeNgR63K8fyNJaPUzSXh0Kyd vNNzGschI0txZXNEtMHcIsCuQknU/arlE6v+HOAokb1jxaIZH2h06vrBAj6pLAHO hbcZPkaQEaFoQhqCbYm015ecJQRPv3IZoW7H1cK5nC4q6QdNo3LPfGqUJwgHV3Wq hbfS+2J78nTqLhSn7HHE/y5z3R5+ZyPwFQwbqfvjjap5/DW5w8Tltg2Oif597lf2 klBukBkJyfzSdhjaPKb3V23kCNabNyyX7KUDZnW5HCiEu62Lnl0MexXCvFvSvtmy YDSsXMg3KdtlESwUOaxGjd2J81tx36L3ZvWRaopDLzA2A6KVyVQCSANGOGkKrRzq Qq3R/frzp1uUVpVDtdyDIO1AujoXkRecdOj1uAIr2XQBg8jx0kveAUyrkXFbQVjK oNbfRlOiu6/vnXkWqwZ2w/Q0kRRrK7M+vensOZlculqDqxPH+BLWB+dfPqjGikb/ cL01KPu6n/GQJpwAxIbGU4eUIQPAVOcHm3iRaIlRqEoDCs7C8fTRIyDx+cD1vW8O O9j/r05EV/Ck5XF2ks6bHIK+C3wemNrCvoeFbnO1uicqtdO+Tqw= =dU1G -----END PGP SIGNATURE----- Merge tag 'kvm-x86-pir-6.16' of https://github.com/kvm-x86/linux into HEAD KVM x86 posted interrupt changes for 6.16: Refine and optimize KVM's software processing of the PIR, and ultimately share PIR harvesting code between KVM and the kernel's Posted MSI handler	2025-05-27 12:15:01 -04:00
Sean Christopherson	edaf3eded3	x86/irq: KVM: Add helper for harvesting PIR to deduplicate KVM and posted MSIs Now that posted MSI and KVM harvesting of PIR is identical, extract the code (and posted MSI's wonderful comment) to a common helper. No functional change intended. Link: https://lore.kernel.org/r/20250401163447.846608-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-04-24 11:19:41 -07:00
Sean Christopherson	baf68a0e3b	KVM: VMX: Use arch_xchg() when processing PIR to avoid instrumentation Use arch_xchg() when moving IRQs from the PIR to the vIRR, purely to avoid instrumentation so that KVM is compatible with the needs of posted MSI. This will allow extracting the core PIR logic to common code and sharing it between KVM and posted MSI handling. Link: https://lore.kernel.org/r/20250401163447.846608-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-04-24 11:19:40 -07:00
Sean Christopherson	b41f8638b9	KVM: VMX: Isolate pure loads from atomic XCHG when processing PIR Rework KVM's processing of the PIR to use the same algorithm as posted MSIs, i.e. to do READ(x4) => XCHG(x4) instead of (READ+XCHG)(x4). Given KVM's long-standing, sub-optimal use of 32-bit accesses to the PIR, it's safe to say far more thought and investigation was put into handling the PIR for posted MSIs, i.e. there's no reason to assume KVM's existing logic is meaningful, let alone superior. Matching the processing done by posted MSIs will also allow deduplicating the code between KVM and posted MSIs. See the comment for handle_pending_pir() added by commit `1b03d82ba1` ("x86/irq: Install posted MSI notification handler") for details on why isolating loads from XCHG is desirable. Suggested-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20250401163447.846608-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-04-24 11:19:40 -07:00
Sean Christopherson	06b4d0ea22	KVM: VMX: Process PIR using 64-bit accesses on 64-bit kernels Process the PIR at the natural kernel width, i.e. in 64-bit chunks on 64-bit kernels, so that the worst case of having a posted IRQ in each chunk of the vIRR only requires 4 loads and xchgs from/to the PIR, not 8. Deliberately use a "continue" to skip empty entries so that the code is a carbon copy of handle_pending_pir(), in anticipation of deduplicating KVM and posted MSI logic. Suggested-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20250401163447.846608-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-04-24 11:19:39 -07:00
Sean Christopherson	f1459315f4	x86/irq: KVM: Track PIR bitmap as an "unsigned long" array Track the PIR bitmap in posted interrupt descriptor structures as an array of unsigned longs instead of using unionized arrays for KVM (u32s) versus IRQ management (u64s). In practice, because the non-KVM usage is (sanely) restricted to 64-bit kernels, all existing usage of the u64 variant is already working with unsigned longs. Using "unsigned long" for the array will allow reworking KVM's processing of the bitmap to read/write in 64-bit chunks on 64-bit kernels, i.e. will allow optimizing KVM by reducing the number of atomic accesses to PIR. Opportunstically replace the open coded literals in the posted MSIs code with the appropriate macro. Deliberately don't use ARRAY_SIZE() in the for-loops, even though it would be cleaner from a certain perspective, in anticipation of decoupling the processing from the array declaration. No functional change intended. Link: https://lore.kernel.org/r/20250401163447.846608-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-04-24 11:19:38 -07:00
Sean Christopherson	6433fc01f9	KVM: VMX: Ensure vIRR isn't reloaded at odd times when sync'ing PIR Read each vIRR exactly once when shuffling IRQs from the PIR to the vAPIC to ensure getting the highest priority IRQ from the chunk doesn't reload from the vIRR. In practice, a reload is functionally benign as vcpu->mutex is held and so IRQs can be consumed, i.e. new IRQs can appear, but existing IRQs can't disappear. Link: https://lore.kernel.org/r/20250401163447.846608-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-04-24 11:19:38 -07:00

1 2 3 4 5 ...

617 Commits