Rename kvm_apic_update_irr()'s "irr_updated" and vmx_sync_pir_to_irr()'s
"got_posted_interrupt" to a more accurate "max_irr_is_from_pir", as neither
"irr_updated" nor "got_posted_interrupt" is accurate.
__kvm_apic_update_irr() and thus kvm_apic_update_irr() specifically return
true if and only if the highest priority IRQ, i.e. max_irr, is a "new"
pending IRQ from the PIR. I.e. it's possible for the IRR to be updated,
i.e. for a posted IRQ to be "got", *without* the APIs returning true.
Expand vmx_sync_pir_to_irr()'s comment to explain why it's necessary to
set KVM_REQ_EVENT only if a "new" IRQ was found, and to explain why it's
safe to do so only if a new IRQ is also the highest priority pending IRQ.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://patch.msgid.link/20260503201703.108231-3-pbonzini@redhat.com/
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fall back to apic_find_highest_vector() when PID.ON is set but PIR
turns out to be empty, to correctly report the highest pending interrupt
from the existing IRR.
In a nested VM stress test, the following WARNING fires in
vmx_check_nested_events() when kvm_cpu_has_interrupt() reports a pending
interrupt but the subsequent kvm_apic_has_interrupt() (which invokes
vmx_sync_pir_to_irr() again) returns -1:
WARNING: CPU: 99 PID: 57767 at arch/x86/kvm/vmx/nested.c:4449 vmx_check_nested_events+0x6bf/0x6e0 [kvm_intel]
Call Trace:
kvm_check_and_inject_events
vcpu_enter_guest.constprop.0
vcpu_run
kvm_arch_vcpu_ioctl_run
kvm_vcpu_ioctl
__x64_sys_ioctl
do_syscall_64
entry_SYSCALL_64_after_hwframe
The root cause is a race between vmx_sync_pir_to_irr() on the target vCPU
and __vmx_deliver_posted_interrupt() on a sender vCPU. The sender
performs two individually-atomic operations that are not a single
transaction:
1. pi_test_and_set_pir(vector) -- sets the PIR bit
2. pi_test_and_set_on() -- sets PID.ON
The following interleaving triggers the bug:
Sender vCPU (IPI): Target vCPU (1st sync_pir_to_irr):
B1: set PIR[vector]
A1: pi_clear_on()
A2: pi_harvest_pir() -> sees B1 bit
A3: xchg() -> consumes bit, PIR=0
(1st sync returns correct max_irr)
B2: set PID.ON = 1
Target vCPU (2nd sync_pir_to_irr):
C1: pi_test_on() -> TRUE (from B2)
C2: pi_clear_on() -> ON=0
C3: pi_harvest_pir() -> PIR empty
C4: *max_irr = -1, early return
IRR NOT SCANNED
The interrupt is not lost (it resides in the IRR from the first sync and
is recovered on the next vcpu_enter_guest() iteration), but the incorrect
max_irr causes a spurious WARNING and a wasted L2 VM-Enter/VM-Exit cycle.
Fixes: b41f8638b9 ("KVM: VMX: Isolate pure loads from atomic XCHG when processing PIR")
Reported-by: Farrah Chen <farrah.chen@intel.com>
Analyzed-by: Chenyi Qiang <chenyi.qiang@intel.com>
Cc: stable@vger.kernel.org
Reviewed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/kvm/20260428070349.1633238-1-chenyi.qiang@intel.com/T/
Link: https://patch.msgid.link/20260503201703.108231-2-pbonzini@redhat.com/
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The __pv_send_ipi() function iterates over up to BITS_PER_LONG vCPUs
starting from the APIC ID specified in its 'min' argument, which is
provided by the guest.
Commit c87bd4dd43 used array_index_nospec() to clamp the value of 'min'
but then the for_each_set_bit() loop dereferences higher indices without
further protection. Theoretically, a guest can trigger speculative access
to up to BITS_PER_LONG elements off the end of the phys_map[] array.
(In practice it would probably need aggressive loop unrolling by the
compiler to go more than one element off the end, and even that seems
unlikely, but the theoretical possibility exists.)
Move the array_index_nospec() inside the loop to protect the [map + i]
index which is actually being used each time.
Fixes: c87bd4dd43 ("KVM: x86: use array_index_nospec with indices that come from guest")
Fixes: bdf7ffc899 ("KVM: LAPIC: Fix pv ipis out-of-bounds access")
Fixes: 4180bf1b65 ("KVM: X86: Implement "send IPI" hypercall")
Signed-off-by: Anel Orazgaliyeva <anelkz@amazon.de>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://patch.msgid.link/9d50fc3ca9e8e58f551d015f95d51a3c29ce6ccc.camel@infradead.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
kvm_apic_write_nodecode() dereferences vcpu->arch.apic without first
checking whether the in-kernel LAPIC has been initialized. If it has
not (e.g. the vCPU was created without an in-kernel LAPIC), the
dereference results in a NULL pointer access.
While APIC-write VM-Exits are not expected to occur on a vCPU without
an in-kernel LAPIC, kvm_apic_write_nodecode() should be robust against
such a scenario as a defense-in-depth measure, e.g. to guard against
KVM bugs or CPU errata that could generate a spurious APIC-write
VM-Exit.
Use KVM_BUG_ON() with lapic_in_kernel() instead of a simple
WARN_ON_ONCE(), as suggested by Sean Christopherson, so that KVM
kills the VM outright rather than letting it continue in a broken
state.
Found by a VMCS-targeted fuzzer based on syzkaller.
Signed-off-by: xuanqingshi <1356292400@qq.com>
Link: https://patch.msgid.link/tencent_7A9F1B4D75468C0CF5DE1B6902038C948B07@qq.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with: kmalloc_objs(TYPE, COUNT, ...)
Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...)
(where TYPE may also be *VAR)
The resulting allocations no longer return "void *", instead returning
"TYPE *".
Signed-off-by: Kees Cook <kees@kernel.org>
KVM x86 APIC-ish changes for 6.20
- Fix a benign bug where KVM could use the wrong memslots (ignored SMM) when
creating a vCPU-specific mapping of guest memory.
- Clean up KVM's handling of marking mapped vCPU pages dirty.
- Drop a pile of *ancient* sanity checks hidden behind in KVM's unused
ASSERT() macro, most of which could be trivially triggered by the guest
and/or user, and all of which were useless.
- Fold "struct dest_map" into its sole user, "struct rtc_status", to make it
more obvious what the weird parameter is used for, and to allow burying the
RTC shenanigans behind CONFIG_KVM_IOAPIC=y.
- Bury all of ioapic.h and KVM_IRQCHIP_KERNEL behind CONFIG_KVM_IOAPIC=y.
- Add a regression test for recent APICv update fixes.
- Rework KVM's handling of VMCS updates while L2 is active to temporarily
switch to vmcs01 instead of deferring the update until the next nested
VM-Exit. The deferred updates approach directly contributed to several
bugs, was proving to be a maintenance burden due to the difficulty in
auditing the correctness of deferred updates, and was polluting
"struct nested_vmx" with a growing pile of booleans.
- Handle "hardware APIC ISR", a.k.a. SVI, updates in kvm_apic_update_apicv()
to consolidate the updates, and to co-locate SVI updates with the updates
for KVM's own cache of ISR information.
- Drop a dead function declaration.
Add two flags for KVM_CAP_X2APIC_API to allow userspace to control support
for Suppress EOI Broadcasts when using a split IRQCHIP (I/O APIC emulated
by userspace), which KVM completely mishandles. When x2APIC support was
first added, KVM incorrectly advertised and "enabled" Suppress EOI
Broadcast, without fully supporting the I/O APIC side of the equation,
i.e. without adding directed EOI to KVM's in-kernel I/O APIC.
That flaw was carried over to split IRQCHIP support, i.e. KVM advertised
support for Suppress EOI Broadcasts irrespective of whether or not the
userspace I/O APIC implementation supported directed EOIs. Even worse,
KVM didn't actually suppress EOI broadcasts, i.e. userspace VMMs without
support for directed EOI came to rely on the "spurious" broadcasts.
KVM "fixed" the in-kernel I/O APIC implementation by completely disabling
support for Suppress EOI Broadcasts in commit 0bcc3fb95b ("KVM: lapic:
stop advertising DIRECTED_EOI when in-kernel IOAPIC is in use"), but
didn't do anything to remedy userspace I/O APIC implementations.
KVM's bogus handling of Suppress EOI Broadcast is problematic when the
guest relies on interrupts being masked in the I/O APIC until well after
the initial local APIC EOI. E.g. Windows with Credential Guard enabled
handles interrupts in the following order:
1. Interrupt for L2 arrives.
2. L1 APIC EOIs the interrupt.
3. L1 resumes L2 and injects the interrupt.
4. L2 EOIs after servicing.
5. L1 performs the I/O APIC EOI.
Because KVM EOIs the I/O APIC at step #2, the guest can get an interrupt
storm, e.g. if the IRQ line is still asserted and userspace reacts to the
EOI by re-injecting the IRQ, because the guest doesn't de-assert the line
until step #4, and doesn't expect the interrupt to be re-enabled until
step #5.
Unfortunately, simply "fixing" the bug isn't an option, as KVM has no way
of knowing if the userspace I/O APIC supports directed EOIs, i.e.
suppressing EOI broadcasts would result in interrupts being stuck masked
in the userspace I/O APIC due to step #5 being ignored by userspace. And
fully disabling support for Suppress EOI Broadcast is also undesirable, as
picking up the fix would require a guest reboot, *and* more importantly
would change the virtual CPU model exposed to the guest without any buy-in
from userspace.
Add KVM_X2APIC_ENABLE_SUPPRESS_EOI_BROADCAST and
KVM_X2APIC_DISABLE_SUPPRESS_EOI_BROADCAST flags to allow userspace to
explicitly enable or disable support for Suppress EOI Broadcasts. This
gives userspace control over the virtual CPU model exposed to the guest,
as KVM should never have enabled support for Suppress EOI Broadcast without
userspace opt-in. Not setting either flag will result in legacy quirky
behavior for backward compatibility.
Disallow fully enabling SUPPRESS_EOI_BROADCAST when using an in-kernel
I/O APIC, as KVM's history/support is just as tragic. E.g. it's not clear
that commit c806a6ad35 ("KVM: x86: call irq notifiers with directed EOI")
was entirely correct, i.e. it may have simply papered over the lack of
Directed EOI emulation in the I/O APIC.
Note, Suppress EOI Broadcasts is defined only in Intel's SDM, not in AMD's
APM. But the bit is writable on some AMD CPUs, e.g. Turin, and KVM's ABI
is to support Directed EOI (KVM's name) irrespective of guest CPU vendor.
Fixes: 7543a635aa ("KVM: x86: Add KVM exit for IOAPIC EOIs")
Closes: https://lore.kernel.org/kvm/7D497EF1-607D-4D37-98E7-DAF95F099342@nutanix.com
Cc: stable@vger.kernel.org
Suggested-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Khushit Shah <khushit.shah@nutanix.com>
Link: https://patch.msgid.link/20260123125657.3384063-1-khushit.shah@nutanix.com
[sean: clean up minor formatting goofs and fix a comment typo]
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Drop the sanity check in kvm_apic_accept_events() that attempts to detect
KVM bugs by asserting that a vCPU isn't in Wait-For-SIPI if INIT/SIPI are
blocked, because if INIT is blocked, then it should be impossible for a
vCPU to get into WFS in the first place. Unfortunately, syzbot is smarter
than KVM (and its maintainers), and circumvented the guards put in place
by commit 0fe3e8d804 ("KVM: x86: Move INIT_RECEIVED vs. INIT/SIPI blocked
check to KVM_RUN") by swapping the order and stuffing VMXON after INIT, and
then triggering kvm_apic_accept_events() by way of KVM_GET_MP_STATE.
Simply drop the WARN as it hasn't detected any meaningful KVM bugs in
years (if ever?), and preventing userspace from clobbering guest state is
generally a non-goal. More importantly, fully closing the hole would
likely require enforcing some amount of ordering in KVM's ioctls, which is
a much bigger risk than simply deleting the WARN.
Reported-by: syzbot+59f2c3a3fc4f6c09b8cd@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/6925da1b.a70a0220.d98e3.00b0.GAE@google.com
Link: https://patch.msgid.link/20260123022816.2283567-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Fold the calls to .hwapic_isr_update() in kvm_apic_set_state(),
kvm_lapic_reset(), and __kvm_vcpu_update_apicv() into
kvm_apic_update_apicv(), as updating SVI is directly related to updating
KVM's own cache of ISR information, e.g. SVI is more or less the APICv
equivalent of highest_isr_cache.
Note, calling .hwapic_isr_update() during kvm_apic_update_apicv() has
benign side effects, as doing so changes the orders of the calls in
kvm_lapic_reset() and kvm_apic_set_state(), specifically with respect to
to the order between .hwapic_isr_update() and .apicv_post_state_restore().
However, the changes in ordering are glorified nops as the former hook is
VMX-only and the latter is SVM-only.
Reviewed-by: Chao Gao <chao.gao@intel.com>
Link: https://patch.msgid.link/20260109034532.1012993-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Now that almost everything in ioapic.h is used only by code guarded by
CONFIG_KVM_IOAPIC=y, bury (almost) the entire thing behind the Kconfig.
Link: https://patch.msgid.link/20251206004311.479939-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Drop "struct dest_map" and fold its members into its one and only user,
"struct rtc_status". Tracking "pending" EOIs and associated vCPUs is very
much a hack for legacy RTC behavior, and should never be needed for other
IRQ delivery. In addition to making it more obvious why KVM tracks target
vCPUs, this will allow burying the "struct rtc_status" definition behind
CONFIG_KVM_IOAPIC=y, which in turn will make it even harder for KVM to
misuse the structure.
No functional change intended.
Link: https://patch.msgid.link/20251206004311.479939-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Turn kvm_irq_delivery_to_apic() into a wrapper that passes NULL for the
@dest_map param, as only the ugly I/O APIC RTC hackery needs to know which
vCPUs received the IRQ.
No functional change intended.
Link: https://patch.msgid.link/20251206004311.479939-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Remove the ASSERT()s in apic_find_highest_i{r,s}r() that exist to detect
illegal vectors (0-15 are reserved and never recognized by the local APIC),
as the asserts, if they were ever to be enabled by #defining DEBUG, can be
trivially triggered from both the guest and from userspace, and ultimately
because the ASSERT()s are useless.
In large part due to lack of emulation for the Error Status Register and
its "delayed" read semantics, KVM doesn't filter out bad IRQs (IPIs or
otherwise) when IRQs are sent or received. Instead, probably by dumb
luck on KVM's part, KVM effectively ignores pending illegal vectors in
the IRR due vector 0-15 having priority '0', and thus never being higher
priority than PPR.
As for ISR, a misbehaving userspace could stuff illegal vector bits, but
again the end result is mostly benign (aside from userspace likely
breaking the VM), as processing illegal vectors "works" and doesn't cause
functional problems.
Regardless of the safety and correctness of KVM's illegal vector handling,
one thing is for certain: the ASSERT()s have done absolutely nothing to
help detect such issues since they were added 18+ years ago by commit
97222cc831 ("KVM: Emulate local APIC in kernel").
For all intents and purposes, no functional change intended.
Link: https://patch.msgid.link/20251206004311.479939-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Remove ASSERT()s on vCPU and APIC structures being non-NULL in the local
APIC code as the DEBUG=1 path of ASSERT() ends with BUG(), i.e. isn't
meaningfully better for debugging than a NULL pointer dereference.
For all intents and purposes, no functional change intended.
Link: https://patch.msgid.link/20251206004311.479939-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Stash apic->lapic_timer in a local "ktimer" variable in
advance_periodic_target_expiration() to eliminate a few unaligned wraps,
and to make the code easier to read overall.
No functional change intended.
Link: https://patch.msgid.link/20251113205114.1647493-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
When advancing the target expiration for the guest's APIC timer in periodic
mode, set the expiration to "now" if the target expiration is in the past
(similar to what is done in update_target_expiration()). Blindly adding
the period to the previous target expiration can result in KVM generating
a practically unbounded number of hrtimer IRQs due to programming an
expired timer over and over. In extreme scenarios, e.g. if userspace
pauses/suspends a VM for an extended duration, this can even cause hard
lockups in the host.
Currently, the bug only affects Intel CPUs when using the hypervisor timer
(HV timer), a.k.a. the VMX preemption timer. Unlike the software timer,
a.k.a. hrtimer, which KVM keeps running even on exits to userspace, the
HV timer only runs while the guest is active. As a result, if the vCPU
does not run for an extended duration, there will be a huge gap between
the target expiration and the current time the vCPU resumes running.
Because the target expiration is incremented by only one period on each
timer expiration, this leads to a series of timer expirations occurring
rapidly after the vCPU/VM resumes.
More critically, when the vCPU first triggers a periodic HV timer
expiration after resuming, advancing the expiration by only one period
will result in a target expiration in the past. As a result, the delta
may be calculated as a negative value. When the delta is converted into
an absolute value (tscdeadline is an unsigned u64), the resulting value
can overflow what the HV timer is capable of programming. I.e. the large
value will exceed the VMX Preemption Timer's maximum bit width of
cpu_preemption_timer_multi + 32, and thus cause KVM to switch from the
HV timer to the software timer (hrtimers).
After switching to the software timer, periodic timer expiration callbacks
may be executed consecutively within a single clock interrupt handler,
because hrtimers honors KVM's request for an expiration in the past and
immediately re-invokes KVM's callback after reprogramming. And because
the interrupt handler runs with IRQs disabled, restarting KVM's hrtimer
over and over until the target expiration is advanced to "now" can result
in a hard lockup.
E.g. the following hard lockup was triggered in the host when running a
Windows VM (only relevant because it used the APIC timer in periodic mode)
after resuming the VM from a long suspend (in the host).
NMI watchdog: Watchdog detected hard LOCKUP on cpu 45
...
RIP: 0010:advance_periodic_target_expiration+0x4d/0x80 [kvm]
...
RSP: 0018:ff4f88f5d98d8ef0 EFLAGS: 00000046
RAX: fff0103f91be678e RBX: fff0103f91be678e RCX: 00843a7d9e127bcc
RDX: 0000000000000002 RSI: 0052ca4003697505 RDI: ff440d5bfbdbd500
RBP: ff440d5956f99200 R08: ff2ff2a42deb6a84 R09: 000000000002a6c0
R10: 0122d794016332b3 R11: 0000000000000000 R12: ff440db1af39cfc0
R13: ff440db1af39cfc0 R14: ffffffffc0d4a560 R15: ff440db1af39d0f8
FS: 00007f04a6ffd700(0000) GS:ff440db1af380000(0000) knlGS:000000e38a3b8000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000d5651feff8 CR3: 000000684e038002 CR4: 0000000000773ee0
PKRU: 55555554
Call Trace:
<IRQ>
apic_timer_fn+0x31/0x50 [kvm]
__hrtimer_run_queues+0x100/0x280
hrtimer_interrupt+0x100/0x210
? ttwu_do_wakeup+0x19/0x160
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
Moreover, if the suspend duration of the virtual machine is not long enough
to trigger a hard lockup in this scenario, since commit 98c25ead5e
("KVM: VMX: Move preemption timer <=> hrtimer dance to common x86"), KVM
will continue using the software timer until the guest reprograms the APIC
timer in some way. Since the periodic timer does not require frequent APIC
timer register programming, the guest may continue to use the software
timer in perpetuity.
Fixes: d8f2f498d9 ("x86/kvm: fix LAPIC timer drift when guest uses periodic mode")
Cc: stable@vger.kernel.org
Signed-off-by: fuqiang wang <fuqiang.wng@gmail.com>
[sean: massage comments and changelog]
Link: https://patch.msgid.link/20251113205114.1647493-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
When restarting an hrtimer to emulate a the guest's APIC timer in periodic
mode, explicitly set the expiration using the target expiration computed
by advance_periodic_target_expiration() instead of adding the period to
the existing timer. This will allow making adjustments to the expiration,
e.g. to deal with expirations far in the past, without having to implement
the same logic in both advance_periodic_target_expiration() and
apic_timer_fn().
Cc: stable@vger.kernel.org
Signed-off-by: fuqiang wang <fuqiang.wng@gmail.com>
[sean: split to separate patch, write changelog]
Link: https://patch.msgid.link/20251113205114.1647493-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
WARN and don't restart the hrtimer if KVM's callback runs with the guest's
APIC timer in periodic mode but with a period of '0', as not advancing the
hrtimer's deadline would put the CPU into an infinite loop of hrtimer
events. Observing a period of '0' should be impossible, even when the
hrtimer is running on a different CPU than the vCPU, as KVM is supposed to
cancel the hrtimer before changing (or zeroing) the period, e.g. when
switching from periodic to one-shot.
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20251113205114.1647493-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Rework almost all of KVM x86's exports to expose symbols only to KVM's
vendor modules, i.e. to kvm-{amd,intel}.ko. Keep the generic exports that
are guarded by CONFIG_KVM_EXTERNAL_WRITE_TRACKING=y, as they're explicitly
designed/intended for external usage.
Link: https://lore.kernel.org/r/20250919003303.1355064-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move kvm_intr_is_single_vcpu() to lapic.c, drop its export, and make its
"fast" helper local to lapic.c. kvm_intr_is_single_vcpu() is only usable
if the local APIC is in-kernel, i.e. it most definitely belongs in the
local APIC code.
No functional change intended.
Fixes: cf04ec393e ("KVM: x86: Dedup AVIC vs. PI code for identifying target vCPU")
Link: https://lore.kernel.org/r/20250919003303.1355064-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Don't (re)check L1 intercepts when completing userspace I/O to fix a flaw
where a misbehaving usersepace (a.k.a. syzkaller) could swizzle L1's
intercepts and trigger a variety of WARNs in KVM.
- Emulate PERF_CNTR_GLOBAL_STATUS_SET for PerfMonV2 guests, as the MSR is
supposed to exist for v2 PMUs.
- Allow Centaur CPU leaves (base 0xC000_0000) for Zhaoxin CPUs.
- Clean up KVM's vector hashing code for delivering lowest priority IRQs.
- Clean up the fastpath handler code to only handle IPIs and WRMSRs that are
actually "fast", as opposed to handling those that KVM _hopes_ are fast, and
in the process of doing so add fastpath support for TSC_DEADLINE writes on
AMD CPUs.
- Clean up a pile of PMU code in anticipation of adding support for mediated
vPMUs.
- Add support for the immediate forms of RDMSR and WRMSRNS, sans full
emulator support (KVM should never need to emulate the MSRs outside of
forced emulation and other contrived testing scenarios).
- Clean up the MSR APIs in preparation for CET and FRED virtualization, as
well as mediated vPMU support.
- Rejecting a fully in-kernel IRQCHIP if EOIs are protected, i.e. for TDX VMs,
as KVM can't faithfully emulate an I/O APIC for such guests.
- KVM_REQ_MSR_FILTER_CHANGED into a generic RECALC_INTERCEPTS in preparation
for mediated vPMU support, as KVM will need to recalculate MSR intercepts in
response to PMU refreshes for guests with mediated vPMUs.
- Misc cleanups and minor fixes.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmjXIr0ACgkQOlYIJqCj
N/1bbhAAxHzxN7IcizgAYf1BZWMjRU4zJgwlkoGuBeH/IgUOODPjs93L9kyrzvVL
tcFgIe9o5fZRGmUfyZbCKnJaQi/4u/2QPRSGhsYt7vyDjCoXzO5CJPMYIqDz5Z2r
qg+GNMlLtWI8EbcDd4qT22SWC8GufoXFEQnX6PUNhasOHeKit5ye8wmttcG+zvYV
KeIkPluddQkQ2JKyG53IFNmm1lkY05oAibv61hkxqUSwCIJKsQFuDjl4GVouAd/H
eu0+pzNmzPUTQ/qJzr2cNL5Nqz08DGp2OCFFRO6bgXaWkvHnFG3EAEHlhTAUh92t
LPJxmhb6R8SUc+z8rYTgyF/zVpgeJcJO7F44FrXa7r2iV58ds3TfuO53hVaEfyNp
1GUMH0m8N2vfjtFyUVP1KwZHuFxiGKLd1wZ1h0yKpj1Eg1FjR2cEontqwH44tHn2
ENq8MIbWIBhvCsz5fIbM4y591JSevJUrDlYu60Lz7VyXHAw8Cq92t/dN9O7oH5mJ
pIyoracU1g0Q6bbATZYsOGhkCTYLtdelZaBb5AYIgQ+U4C1TA4GpgEBUSVH8HXDy
kXzVqSFlL0v5rrFkBPjiNFb5WD3iLjJIM3DLGoNegOM8+79r/USGHUY+XU3z/kCH
rV8JBlTnLBCrNOHEiHJUI2kwBQ9C9/l88X/VwvRUNv7SthuExSo=
=9IB0
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-misc-6.18' of https://github.com/kvm-x86/linux into HEAD
KVM x86 changes for 6.18
- Don't (re)check L1 intercepts when completing userspace I/O to fix a flaw
where a misbehaving usersepace (a.k.a. syzkaller) could swizzle L1's
intercepts and trigger a variety of WARNs in KVM.
- Emulate PERF_CNTR_GLOBAL_STATUS_SET for PerfMonV2 guests, as the MSR is
supposed to exist for v2 PMUs.
- Allow Centaur CPU leaves (base 0xC000_0000) for Zhaoxin CPUs.
- Clean up KVM's vector hashing code for delivering lowest priority IRQs.
- Clean up the fastpath handler code to only handle IPIs and WRMSRs that are
actually "fast", as opposed to handling those that KVM _hopes_ are fast, and
in the process of doing so add fastpath support for TSC_DEADLINE writes on
AMD CPUs.
- Clean up a pile of PMU code in anticipation of adding support for mediated
vPMUs.
- Add support for the immediate forms of RDMSR and WRMSRNS, sans full
emulator support (KVM should never need to emulate the MSRs outside of
forced emulation and other contrived testing scenarios).
- Clean up the MSR APIs in preparation for CET and FRED virtualization, as
well as mediated vPMU support.
- Rejecting a fully in-kernel IRQCHIP if EOIs are protected, i.e. for TDX VMs,
as KVM can't faithfully emulate an I/O APIC for such guests.
- KVM_REQ_MSR_FILTER_CHANGED into a generic RECALC_INTERCEPTS in preparation
for mediated vPMU support, as KVM will need to recalculate MSR intercepts in
response to PMU refreshes for guests with mediated vPMUs.
- Misc cleanups and minor fixes.
Use guard(mutex) instead of mutex_lock/mutex_unlock pair to simplify the
error handling when allocating the APIC access page.
No functional change intended.
Signed-off-by: Liao Yuanhong <liaoyuanhong@vivo.com>
Link: https://lore.kernel.org/r/20250901131822.647802-1-liaoyuanhong@vivo.com
[sean: add blank link to isolate guard(), tweak changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Move the vector_hashing module param into lapic.c now that all usage is
contained within the local APIC emulation code.
Opportunistically drop the accessor and append "_enabled" to the variable
to help capture that it's a boolean module param.
No functional change intended.
Link: https://lore.kernel.org/r/20250821214209.3463350-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Make various helpers for resolving lowest priority IRQs local to lapic.c
now that kvm_irq_delivery_to_apic() lives in lapic.c as well.
No functional change intended.
Link: https://lore.kernel.org/r/20250821214209.3463350-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Move kvm_irq_delivery_to_apic() to lapic.c as it is specific to local APIC
emulation. This will allow burying more local APIC code in lapic.c, e.g.
the various "lowest priority" helpers.
No functional change intended.
Link: https://lore.kernel.org/r/20250821214209.3463350-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Drop the fastpath VM-Exit requirement that KVM can use the hypervisor
timer to emulate the APIC timer in TSC deadline mode. I.e. unconditionally
handle MSR_IA32_TSC_DEADLINE WRMSRs in the fastpath. Restricting the
fastpath to *maybe* using the VMX preemption timer is ineffective and
unnecessary.
If the requested deadline can't be programmed into the VMX preemption
timer, KVM will fall back to hrtimers, i.e. the restriction is ineffective
as far as preventing any kind of worst case scenario.
But guarding against a worst case scenario is completely unnecessary as
the "slow" path, start_sw_tscdeadline() => hrtimer_start(), explicitly
disables IRQs. In fact, the worst case scenario is when KVM thinks it
can use the VMX preemption timer, as KVM will eat the overhead of calling
into vmx_set_hv_timer() and falling back to hrtimers.
Opportunistically limit kvm_can_use_hv_timer() to lapic.c as the fastpath
code was the only external user.
Stating the obvious, this allows handling MSR_IA32_TSC_DEADLINE writes in
the fastpath on AMD CPUs.
Link: https://lore.kernel.org/r/20250805190526.1453366-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Explicitly restrict fastpath ICR writes to IPIs that are "fast", i.e. can
be delivered without having to walk all vCPUs, and that target at most 16
vCPUs. Artificially restricting ICR writes to physical mode guarantees
at most one vCPU will receive in IPI (because x2APIC IDs are read-only),
but that delivery might not be "fast". E.g. even if the vCPU exists, KVM
might have to iterate over 4096 vCPUs to find the right one.
Limiting delivery to fast IPIs aligns the WRMSR fastpath with
kvm_arch_set_irq_inatomic() (which also runs with IRQs disabled), and will
allow dropping the semi-arbitrary restrictions on delivery mode and type.
Link: https://lore.kernel.org/r/20250805190526.1453366-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Extract the code for converting an ICR message into a kvm_lapic_irq
structure into a local helper so that a fast-only IPI path can share the
conversion logic.
No functional change intended.
Link: https://lore.kernel.org/r/20250805190526.1453366-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Use find_nth_bit() and make the function almost a one-liner.
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
KVM local APIC changes for 6.17
Extract many of KVM's helpers for accessing architectural local APIC state
to common x86 so that they can be shared by guest-side code for Secure AVIC.
Move apic_test_vector() to apic.h in order to reuse it in the Secure AVIC
guest APIC driver in later patches to test vector state in the APIC
backing page.
No functional change intended.
Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-14-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Move the apic_get_reg(), apic_set_reg(), apic_get_reg64() and
apic_set_reg64() helper functions to apic.h in order to reuse them in the
Secure AVIC guest APIC driver in later patches to read/write registers
from/to the APIC backing page.
No functional change intended.
Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-12-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
In preparation for using apic_find_highest_vector() in Secure AVIC
guest APIC driver, move it and associated macros to apic.h.
No functional change intended.
Acked-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Link: https://lore.kernel.org/r/20250709033242.267892-11-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
In preparation for moving kvm-internal kvm_lapic_set_vector(),
kvm_lapic_clear_vector() to apic.h for use in Secure AVIC APIC driver,
rename them as part of the APIC API.
No functional change intended.
Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-10-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
In preparation for moving kvm-internal __kvm_lapic_set_reg64(),
__kvm_lapic_get_reg64() to apic.h for use in Secure AVIC APIC driver,
rename them as part of the APIC API.
No functional change intended.
Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-9-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
In preparation for moving kvm-internal __kvm_lapic_set_reg(),
__kvm_lapic_get_reg() to apic.h for use in Secure AVIC APIC driver,
rename them as part of the APIC API.
No functional change intended.
Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-8-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
In preparation for moving kvm-internal find_highest_vector() to
apic.h for use in Secure AVIC APIC driver, rename find_highest_vector()
to apic_find_highest_vector() as part of the APIC API.
No functional change intended.
Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-7-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Change APIC base address from "char *" to "void *" in KVM
lapic's set/get helper functions. Pointer arithmetic for "void *"
and "char *" operate identically. With "void *" there is less
of a chance of doing the wrong thing, e.g. neglecting to cast and
reading a byte instead of the desired APIC register size.
No functional change intended.
Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-6-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
In preparation for moving most of the KVM's lapic helpers which
use VEC_POS/REG_POS macros to common APIC header for use in Secure
AVIC APIC driver, rename all VEC_POS/REG_POS macro usages to
APIC_VECTOR_TO_BIT_NUMBER/APIC_VECTOR_TO_REG_OFFSET and remove
VEC_POS/REG_POS.
While at it, clean up line wrap in find_highest_vector().
No functional change intended.
Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-5-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
When doing pointer arithmetic in apic_test_vector() and
kvm_lapic_{set|clear}_vector(), remove the unnecessary
parentheses surrounding the 'bitmap' parameter.
No functional change intended.
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Link: https://lore.kernel.org/r/20250709033242.267892-3-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Remove __apic_test_and_set_vector() and __apic_test_and_clear_vector(),
because the _only_ register that's safe to modify with a non-atomic
operation is ISR, because KVM isn't running the vCPU, i.e. hardware can't
service an IRQ or process an EOI for the relevant (virtual) APIC.
No functional change intended.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-2-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add a Kconfig to allow building KVM without support for emulating a I/O
APIC, PIC, and PIT, which is desirable for deployments that effectively
don't support a fully in-kernel IRQ chip, i.e. never expect any VMM to
create an in-kernel I/O APIC. E.g. compiling out support eliminates a few
thousand lines of guest-facing code and gives security folks warm fuzzies.
As a bonus, wrapping relevant paths with CONFIG_KVM_IOAPIC #ifdefs makes
it much easier for readers to understand which bits and pieces exist
specifically for fully in-kernel IRQ chips.
Opportunistically convert all two in-kernel uses of __KVM_HAVE_IOAPIC to
CONFIG_KVM_IOAPIC, e.g. rather than add a second #ifdef to generate a stub
for kvm_arch_post_irq_routing_update().
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Refine and optimize KVM's software processing of the PIR, and ultimately share
PIR harvesting code between KVM and the kernel's Posted MSI handler
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmgwmWcACgkQOlYIJqCj
N/3mUw/9HN4OLRqFytu+GjEocl8I7JelJdwCsNMsUwZRnNVnYGDqsjvw8rzqeFmx
RoQ8uNqMd1PqZOgAdN6suLES949ItErbnG2+UlBvZeNgR63K8fyNJaPUzSXh0Kyd
vNNzGschI0txZXNEtMHcIsCuQknU/arlE6v+HOAokb1jxaIZH2h06vrBAj6pLAHO
hbcZPkaQEaFoQhqCbYm015ecJQRPv3IZoW7H1cK5nC4q6QdNo3LPfGqUJwgHV3Wq
hbfS+2J78nTqLhSn7HHE/y5z3R5+ZyPwFQwbqfvjjap5/DW5w8Tltg2Oif597lf2
klBukBkJyfzSdhjaPKb3V23kCNabNyyX7KUDZnW5HCiEu62Lnl0MexXCvFvSvtmy
YDSsXMg3KdtlESwUOaxGjd2J81tx36L3ZvWRaopDLzA2A6KVyVQCSANGOGkKrRzq
Qq3R/frzp1uUVpVDtdyDIO1AujoXkRecdOj1uAIr2XQBg8jx0kveAUyrkXFbQVjK
oNbfRlOiu6/vnXkWqwZ2w/Q0kRRrK7M+vensOZlculqDqxPH+BLWB+dfPqjGikb/
cL01KPu6n/GQJpwAxIbGU4eUIQPAVOcHm3iRaIlRqEoDCs7C8fTRIyDx+cD1vW8O
O9j/r05EV/Ck5XF2ks6bHIK+C3wemNrCvoeFbnO1uicqtdO+Tqw=
=dU1G
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-pir-6.16' of https://github.com/kvm-x86/linux into HEAD
KVM x86 posted interrupt changes for 6.16:
Refine and optimize KVM's software processing of the PIR, and ultimately share
PIR harvesting code between KVM and the kernel's Posted MSI handler
Now that posted MSI and KVM harvesting of PIR is identical, extract the
code (and posted MSI's wonderful comment) to a common helper.
No functional change intended.
Link: https://lore.kernel.org/r/20250401163447.846608-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Use arch_xchg() when moving IRQs from the PIR to the vIRR, purely to avoid
instrumentation so that KVM is compatible with the needs of posted MSI.
This will allow extracting the core PIR logic to common code and sharing
it between KVM and posted MSI handling.
Link: https://lore.kernel.org/r/20250401163447.846608-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Rework KVM's processing of the PIR to use the same algorithm as posted
MSIs, i.e. to do READ(x4) => XCHG(x4) instead of (READ+XCHG)(x4). Given
KVM's long-standing, sub-optimal use of 32-bit accesses to the PIR, it's
safe to say far more thought and investigation was put into handling the
PIR for posted MSIs, i.e. there's no reason to assume KVM's existing
logic is meaningful, let alone superior.
Matching the processing done by posted MSIs will also allow deduplicating
the code between KVM and posted MSIs.
See the comment for handle_pending_pir() added by commit 1b03d82ba1
("x86/irq: Install posted MSI notification handler") for details on
why isolating loads from XCHG is desirable.
Suggested-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250401163447.846608-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Process the PIR at the natural kernel width, i.e. in 64-bit chunks on
64-bit kernels, so that the worst case of having a posted IRQ in each
chunk of the vIRR only requires 4 loads and xchgs from/to the PIR, not 8.
Deliberately use a "continue" to skip empty entries so that the code is a
carbon copy of handle_pending_pir(), in anticipation of deduplicating KVM
and posted MSI logic.
Suggested-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250401163447.846608-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Track the PIR bitmap in posted interrupt descriptor structures as an array
of unsigned longs instead of using unionized arrays for KVM (u32s) versus
IRQ management (u64s). In practice, because the non-KVM usage is (sanely)
restricted to 64-bit kernels, all existing usage of the u64 variant is
already working with unsigned longs.
Using "unsigned long" for the array will allow reworking KVM's processing
of the bitmap to read/write in 64-bit chunks on 64-bit kernels, i.e. will
allow optimizing KVM by reducing the number of atomic accesses to PIR.
Opportunstically replace the open coded literals in the posted MSIs code
with the appropriate macro. Deliberately don't use ARRAY_SIZE() in the
for-loops, even though it would be cleaner from a certain perspective, in
anticipation of decoupling the processing from the array declaration.
No functional change intended.
Link: https://lore.kernel.org/r/20250401163447.846608-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Read each vIRR exactly once when shuffling IRQs from the PIR to the vAPIC
to ensure getting the highest priority IRQ from the chunk doesn't reload
from the vIRR. In practice, a reload is functionally benign as vcpu->mutex
is held and so IRQs can be consumed, i.e. new IRQs can appear, but existing
IRQs can't disappear.
Link: https://lore.kernel.org/r/20250401163447.846608-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>