Commit Graph

98 Commits

Author SHA1 Message Date
Vincent Donnefort
0a90fbc8a1 KVM: arm64: Add event support to the nVHE/pKVM hyp and trace remote
Allow the creation of hypervisor and trace remote events with a single
macro HYP_EVENT(). That macro expands in the kernel side to add all
the required declarations (based on REMOTE_EVENT()) as well as in the
hypervisor side to create the trace_<event>() function.

Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Link: https://patch.msgid.link/20260309162516.2623589-28-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-03-11 08:51:17 +00:00
Linus Torvalds
cb5573868e Loongarch:
- Add more CPUCFG mask bits.
 
 - Improve feature detection.
 
 - Add lazy load support for FPU and binary translation (LBT) register state.
 
 - Fix return value for memory reads from and writes to in-kernel devices.
 
 - Add support for detecting preemption from within a guest.
 
 - Add KVM steal time test case to tools/selftests.
 
 ARM:
 
 - Add support for FEAT_IDST, allowing ID registers that are not
   implemented to be reported as a normal trap rather than as an UNDEF
   exception.
 
 - Add sanitisation of the VTCR_EL2 register, fixing a number of
   UXN/PXN/XN bugs in the process.
 
 - Full handling of RESx bits, instead of only RES0, and resulting in
   SCTLR_EL2 being added to the list of sanitised registers.
 
 - More pKVM fixes for features that are not supposed to be exposed to
   guests.
 
 - Make sure that MTE being disabled on the pKVM host doesn't give it
   the ability to attack the hypervisor.
 
 - Allow pKVM's host stage-2 mappings to use the Force Write Back
   version of the memory attributes by using the "pass-through'
   encoding.
 
 - Fix trapping of ICC_DIR_EL1 on GICv5 hosts emulating GICv3 for the
   guest.
 
 - Preliminary work for guest GICv5 support.
 
 - A bunch of debugfs fixes, removing pointless custom iterators stored
   in guest data structures.
 
 - A small set of FPSIMD cleanups.
 
 - Selftest fixes addressing the incorrect alignment of page
   allocation.
 
 - Other assorted low-impact fixes and spelling fixes.
 
 RISC-V:
 
 - Fixes for issues discoverd by KVM API fuzzing in
   kvm_riscv_aia_imsic_has_attr(), kvm_riscv_aia_imsic_rw_attr(),
   and kvm_riscv_vcpu_aia_imsic_update()
 
 - Allow Zalasr, Zilsd and Zclsd extensions for Guest/VM
 
 - Transparent huge page support for hypervisor page tables
 
 - Adjust the number of available guest irq files based on MMIO
   register sizes found in the device tree or the ACPI tables
 
 - Add RISC-V specific paging modes to KVM selftests
 
 - Detect paging mode at runtime for selftests
 
 s390:
 
 - Performance improvement for vSIE (aka nested virtualization)
 
 - Completely new memory management.  s390 was a special snowflake that enlisted
   help from the architecture's page table management to build hypervisor
   page tables, in particular enabling sharing the last level of page
   tables.  This however was a lot of code (~3K lines) in order to support
   KVM, and also blocked several features.  The biggest advantages is
   that the page size of userspace is completely independent of the
   page size used by the guest: userspace can mix normal pages, THPs and
   hugetlbfs as it sees fit, and in fact transparent hugepages were not
   possible before.  It's also now possible to have nested guests and
   guests with huge pages running on the same host.
 
 - Maintainership change for s390 vfio-pci
 
 - Small quality of life improvement for protected guests
 
 x86:
 
 - Add support for giving the guest full ownership of PMU hardware (contexted
   switched around the fastpath run loop) and allowing direct access to data
   MSRs and PMCs (restricted by the vPMU model).  KVM still intercepts
   access to control registers, e.g. to enforce event filtering and to
   prevent the guest from profiling sensitive host state.  This is more
   accurate, since it has no risk of contention and thus dropped events, and
   also has significantly less overhead.
 
   For more information, see the commit message for merge commit bf2c3138ae
   ("Merge tag 'kvm-x86-pmu-6.20' of https://github.com/kvm-x86/linux into HEAD").
 
 - Disallow changing the virtual CPU model if L2 is active, for all the same
   reasons KVM disallows change the model after the first KVM_RUN.
 
 - Fix a bug where KVM would incorrectly reject host accesses to PV MSRs
   when running with KVM_CAP_ENFORCE_PV_FEATURE_CPUID enabled, even if those
   were advertised as supported to userspace,
 
 - Fix a bug with protected guest state (SEV-ES/SNP and TDX) VMs, where KVM
   would attempt to read CR3 configuring an async #PF entry.
 
 - Fail the build if EXPORT_SYMBOL_GPL or EXPORT_SYMBOL is used in KVM (for x86
   only) to enforce usage of EXPORT_SYMBOL_FOR_KVM_INTERNAL.  Only a few exports
   that are intended for external usage, and those are allowed explicitly.
 
 - When checking nested events after a vCPU is unblocked, ignore -EBUSY instead
   of WARNing.  Userspace can sometimes put the vCPU into what should be an
   impossible state, and spurious exit to userspace on -EBUSY does not really
   do anything to solve the issue.
 
 - Also throw in the towel and drop the WARN on INIT/SIPI being blocked when vCPU
   is in Wait-For-SIPI, which also resulted in playing whack-a-mole with syzkaller
   stuffing architecturally impossible states into KVM.
 
 - Add support for new Intel instructions that don't require anything beyond
   enumerating feature flags to userspace.
 
 - Grab SRCU when reading PDPTRs in KVM_GET_SREGS2.
 
 - Add WARNs to guard against modifying KVM's CPU caps outside of the intended
   setup flow, as nested VMX in particular is sensitive to unexpected changes
   in KVM's golden configuration.
 
 - Add a quirk to allow userspace to opt-in to actually suppress EOI broadcasts
   when the suppression feature is enabled by the guest (currently limited to
   split IRQCHIP, i.e. userspace I/O APIC).  Sadly, simply fixing KVM to honor
   Suppress EOI Broadcasts isn't an option as some userspaces have come to rely
   on KVM's buggy behavior (KVM advertises Supress EOI Broadcast irrespective
   of whether or not userspace I/O APIC supports Directed EOIs).
 
 - Clean up KVM's handling of marking mapped vCPU pages dirty.
 
 - Drop a pile of *ancient* sanity checks hidden behind in KVM's unused
   ASSERT() macro, most of which could be trivially triggered by the guest
   and/or user, and all of which were useless.
 
 - Fold "struct dest_map" into its sole user, "struct rtc_status", to make it
   more obvious what the weird parameter is used for, and to allow fropping
   these RTC shenanigans if CONFIG_KVM_IOAPIC=n.
 
 - Bury all of ioapic.h, i8254.h and related ioctls (including
   KVM_CREATE_IRQCHIP) behind CONFIG_KVM_IOAPIC=y.
 
 - Add a regression test for recent APICv update fixes.
 
 - Handle "hardware APIC ISR", a.k.a. SVI, updates in kvm_apic_update_apicv()
   to consolidate the updates, and to co-locate SVI updates with the updates
   for KVM's own cache of ISR information.
 
 - Drop a dead function declaration.
 
 - Minor cleanups.
 
 x86 (Intel):
 
 - Rework KVM's handling of VMCS updates while L2 is active to temporarily
   switch to vmcs01 instead of deferring the update until the next nested
   VM-Exit.  The deferred updates approach directly contributed to several
   bugs, was proving to be a maintenance burden due to the difficulty in
   auditing the correctness of deferred updates, and was polluting
   "struct nested_vmx" with a growing pile of booleans.
 
 - Fix an SGX bug where KVM would incorrectly try to handle EPCM page faults,
   and instead always reflect them into the guest.  Since KVM doesn't shadow
   EPCM entries, EPCM violations cannot be due to KVM interference and
   can't be resolved by KVM.
 
 - Fix a bug where KVM would register its posted interrupt wakeup handler even
   if loading kvm-intel.ko ultimately failed.
 
 - Disallow access to vmcb12 fields that aren't fully supported, mostly to
   avoid weirdness and complexity for FRED and other features, where KVM wants
   enable VMCS shadowing for fields that conditionally exist.
 
 - Print out the "bad" offsets and values if kvm-intel.ko refuses to load (or
   refuses to online a CPU) due to a VMCS config mismatch.
 
 x86 (AMD):
 
 - Drop a user-triggerable WARN on nested_svm_load_cr3() failure.
 
 - Add support for virtualizing ERAPS.  Note, correct virtualization of ERAPS
   relies on an upcoming, publicly announced change in the APM to reduce the
   set of conditions where hardware (i.e. KVM) *must* flush the RAP.
 
 - Ignore nSVM intercepts for instructions that are not supported according to
   L1's virtual CPU model.
 
 - Add support for expedited writes to the fast MMIO bus, a la VMX's fastpath
   for EPT Misconfig.
 
 - Don't set GIF when clearing EFER.SVME, as GIF exists independently of SVM,
   and allow userspace to restore nested state with GIF=0.
 
 - Treat exit_code as an unsigned 64-bit value through all of KVM.
 
 - Add support for fetching SNP certificates from userspace.
 
 - Fix a bug where KVM would use vmcb02 instead of vmcb01 when emulating VMLOAD
   or VMSAVE on behalf of L2.
 
 - Misc fixes and cleanups.
 
 x86 selftests:
 
 - Add a regression test for TPR<=>CR8 synchronization and IRQ masking.
 
 - Overhaul selftest's MMU infrastructure to genericize stage-2 MMU support,
   and extend x86's infrastructure to support EPT and NPT (for L2 guests).
 
 - Extend several nested VMX tests to also cover nested SVM.
 
 - Add a selftest for nested VMLOAD/VMSAVE.
 
 - Rework the nested dirty log test, originally added as a regression test for
   PML where KVM logged L2 GPAs instead of L1 GPAs, to improve test coverage
   and to hopefully make the test easier to understand and maintain.
 
 guest_memfd:
 
 - Remove kvm_gmem_populate()'s preparation tracking and half-baked hugepage
   handling.  SEV/SNP was the only user of the tracking and it can do it via
   the RMP.
 
 - Retroactively document and enforce (for SNP) that KVM_SEV_SNP_LAUNCH_UPDATE
   and KVM_TDX_INIT_MEM_REGION require the source page to be 4KiB aligned, to
   avoid non-trivial complexity for something that no known VMM seems to be
   doing and to avoid an API special case for in-place conversion, which
   simply can't support unaligned sources.
 
 - When populating guest_memfd memory, GUP the source page in common code and
   pass the refcounted page to the vendor callback, instead of letting vendor
   code do the heavy lifting.  Doing so avoids a looming deadlock bug with
   in-place due an AB-BA conflict betwee mmap_lock and guest_memfd's filemap
   invalidate lock.
 
 Generic:
 
 - Fix a bug where KVM would ignore the vCPU's selected address space when
   creating a vCPU-specific mapping of guest memory.  Actually this bug
   could not be hit even on x86, the only architecture with multiple
   address spaces, but it's a bug nevertheless.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCgAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmmNqwwUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroPaZAf/cJx5B67lnST272esz0j29MIuT/Ti
 jnf6PI9b7XubKYOtNvlu5ZW4Jsa5dqRG0qeO/JmcXDlwBf5/UkWOyvqIXyiuTl0l
 KcSUlKPtTgKZSoZpJpTppuuDE8FSYqEdcCmjNvoYzcJoPjmaeJbK6aqO0AkBbb6e
 L5InrLV7nV9iua6rFvA0s/G8/Eq2DG8M9hTRHe6NcI/z4hvslOudvpUXtC8Jygoo
 cV8vFavUwc+atrmvhAOLvSitnrjfNa4zcG6XMOlwXPfIdvi3zqTlQTgUpwGKiAGQ
 RIDUVZ/9bcWgJqbPRsdEWwaYRkNQWc5nmrAHRpEEaYV/NeBBNf4v6qfKSw==
 =SkJ1
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "Loongarch:

   - Add more CPUCFG mask bits

   - Improve feature detection

   - Add lazy load support for FPU and binary translation (LBT) register
     state

   - Fix return value for memory reads from and writes to in-kernel
     devices

   - Add support for detecting preemption from within a guest

   - Add KVM steal time test case to tools/selftests

  ARM:

   - Add support for FEAT_IDST, allowing ID registers that are not
     implemented to be reported as a normal trap rather than as an UNDEF
     exception

   - Add sanitisation of the VTCR_EL2 register, fixing a number of
     UXN/PXN/XN bugs in the process

   - Full handling of RESx bits, instead of only RES0, and resulting in
     SCTLR_EL2 being added to the list of sanitised registers

   - More pKVM fixes for features that are not supposed to be exposed to
     guests

   - Make sure that MTE being disabled on the pKVM host doesn't give it
     the ability to attack the hypervisor

   - Allow pKVM's host stage-2 mappings to use the Force Write Back
     version of the memory attributes by using the "pass-through'
     encoding

   - Fix trapping of ICC_DIR_EL1 on GICv5 hosts emulating GICv3 for the
     guest

   - Preliminary work for guest GICv5 support

   - A bunch of debugfs fixes, removing pointless custom iterators
     stored in guest data structures

   - A small set of FPSIMD cleanups

   - Selftest fixes addressing the incorrect alignment of page
     allocation

   - Other assorted low-impact fixes and spelling fixes

  RISC-V:

   - Fixes for issues discoverd by KVM API fuzzing in
     kvm_riscv_aia_imsic_has_attr(), kvm_riscv_aia_imsic_rw_attr(), and
     kvm_riscv_vcpu_aia_imsic_update()

   - Allow Zalasr, Zilsd and Zclsd extensions for Guest/VM

   - Transparent huge page support for hypervisor page tables

   - Adjust the number of available guest irq files based on MMIO
     register sizes found in the device tree or the ACPI tables

   - Add RISC-V specific paging modes to KVM selftests

   - Detect paging mode at runtime for selftests

  s390:

   - Performance improvement for vSIE (aka nested virtualization)

   - Completely new memory management. s390 was a special snowflake that
     enlisted help from the architecture's page table management to
     build hypervisor page tables, in particular enabling sharing the
     last level of page tables. This however was a lot of code (~3K
     lines) in order to support KVM, and also blocked several features.
     The biggest advantages is that the page size of userspace is
     completely independent of the page size used by the guest:
     userspace can mix normal pages, THPs and hugetlbfs as it sees fit,
     and in fact transparent hugepages were not possible before. It's
     also now possible to have nested guests and guests with huge pages
     running on the same host

   - Maintainership change for s390 vfio-pci

   - Small quality of life improvement for protected guests

  x86:

   - Add support for giving the guest full ownership of PMU hardware
     (contexted switched around the fastpath run loop) and allowing
     direct access to data MSRs and PMCs (restricted by the vPMU model).

     KVM still intercepts access to control registers, e.g. to enforce
     event filtering and to prevent the guest from profiling sensitive
     host state. This is more accurate, since it has no risk of
     contention and thus dropped events, and also has significantly less
     overhead.

     For more information, see the commit message for merge commit
     bf2c3138ae ("Merge tag 'kvm-x86-pmu-6.20' ...")

   - Disallow changing the virtual CPU model if L2 is active, for all
     the same reasons KVM disallows change the model after the first
     KVM_RUN

   - Fix a bug where KVM would incorrectly reject host accesses to PV
     MSRs when running with KVM_CAP_ENFORCE_PV_FEATURE_CPUID enabled,
     even if those were advertised as supported to userspace,

   - Fix a bug with protected guest state (SEV-ES/SNP and TDX) VMs,
     where KVM would attempt to read CR3 configuring an async #PF entry

   - Fail the build if EXPORT_SYMBOL_GPL or EXPORT_SYMBOL is used in KVM
     (for x86 only) to enforce usage of EXPORT_SYMBOL_FOR_KVM_INTERNAL.
     Only a few exports that are intended for external usage, and those
     are allowed explicitly

   - When checking nested events after a vCPU is unblocked, ignore
     -EBUSY instead of WARNing. Userspace can sometimes put the vCPU
     into what should be an impossible state, and spurious exit to
     userspace on -EBUSY does not really do anything to solve the issue

   - Also throw in the towel and drop the WARN on INIT/SIPI being
     blocked when vCPU is in Wait-For-SIPI, which also resulted in
     playing whack-a-mole with syzkaller stuffing architecturally
     impossible states into KVM

   - Add support for new Intel instructions that don't require anything
     beyond enumerating feature flags to userspace

   - Grab SRCU when reading PDPTRs in KVM_GET_SREGS2

   - Add WARNs to guard against modifying KVM's CPU caps outside of the
     intended setup flow, as nested VMX in particular is sensitive to
     unexpected changes in KVM's golden configuration

   - Add a quirk to allow userspace to opt-in to actually suppress EOI
     broadcasts when the suppression feature is enabled by the guest
     (currently limited to split IRQCHIP, i.e. userspace I/O APIC).
     Sadly, simply fixing KVM to honor Suppress EOI Broadcasts isn't an
     option as some userspaces have come to rely on KVM's buggy behavior
     (KVM advertises Supress EOI Broadcast irrespective of whether or
     not userspace I/O APIC supports Directed EOIs)

   - Clean up KVM's handling of marking mapped vCPU pages dirty

   - Drop a pile of *ancient* sanity checks hidden behind in KVM's
     unused ASSERT() macro, most of which could be trivially triggered
     by the guest and/or user, and all of which were useless

   - Fold "struct dest_map" into its sole user, "struct rtc_status", to
     make it more obvious what the weird parameter is used for, and to
     allow fropping these RTC shenanigans if CONFIG_KVM_IOAPIC=n

   - Bury all of ioapic.h, i8254.h and related ioctls (including
     KVM_CREATE_IRQCHIP) behind CONFIG_KVM_IOAPIC=y

   - Add a regression test for recent APICv update fixes

   - Handle "hardware APIC ISR", a.k.a. SVI, updates in
     kvm_apic_update_apicv() to consolidate the updates, and to
     co-locate SVI updates with the updates for KVM's own cache of ISR
     information

   - Drop a dead function declaration

   - Minor cleanups

  x86 (Intel):

   - Rework KVM's handling of VMCS updates while L2 is active to
     temporarily switch to vmcs01 instead of deferring the update until
     the next nested VM-Exit.

     The deferred updates approach directly contributed to several bugs,
     was proving to be a maintenance burden due to the difficulty in
     auditing the correctness of deferred updates, and was polluting
     "struct nested_vmx" with a growing pile of booleans

   - Fix an SGX bug where KVM would incorrectly try to handle EPCM page
     faults, and instead always reflect them into the guest. Since KVM
     doesn't shadow EPCM entries, EPCM violations cannot be due to KVM
     interference and can't be resolved by KVM

   - Fix a bug where KVM would register its posted interrupt wakeup
     handler even if loading kvm-intel.ko ultimately failed

   - Disallow access to vmcb12 fields that aren't fully supported,
     mostly to avoid weirdness and complexity for FRED and other
     features, where KVM wants enable VMCS shadowing for fields that
     conditionally exist

   - Print out the "bad" offsets and values if kvm-intel.ko refuses to
     load (or refuses to online a CPU) due to a VMCS config mismatch

  x86 (AMD):

   - Drop a user-triggerable WARN on nested_svm_load_cr3() failure

   - Add support for virtualizing ERAPS. Note, correct virtualization of
     ERAPS relies on an upcoming, publicly announced change in the APM
     to reduce the set of conditions where hardware (i.e. KVM) *must*
     flush the RAP

   - Ignore nSVM intercepts for instructions that are not supported
     according to L1's virtual CPU model

   - Add support for expedited writes to the fast MMIO bus, a la VMX's
     fastpath for EPT Misconfig

   - Don't set GIF when clearing EFER.SVME, as GIF exists independently
     of SVM, and allow userspace to restore nested state with GIF=0

   - Treat exit_code as an unsigned 64-bit value through all of KVM

   - Add support for fetching SNP certificates from userspace

   - Fix a bug where KVM would use vmcb02 instead of vmcb01 when
     emulating VMLOAD or VMSAVE on behalf of L2

   - Misc fixes and cleanups

  x86 selftests:

   - Add a regression test for TPR<=>CR8 synchronization and IRQ masking

   - Overhaul selftest's MMU infrastructure to genericize stage-2 MMU
     support, and extend x86's infrastructure to support EPT and NPT
     (for L2 guests)

   - Extend several nested VMX tests to also cover nested SVM

   - Add a selftest for nested VMLOAD/VMSAVE

   - Rework the nested dirty log test, originally added as a regression
     test for PML where KVM logged L2 GPAs instead of L1 GPAs, to
     improve test coverage and to hopefully make the test easier to
     understand and maintain

  guest_memfd:

   - Remove kvm_gmem_populate()'s preparation tracking and half-baked
     hugepage handling. SEV/SNP was the only user of the tracking and it
     can do it via the RMP

   - Retroactively document and enforce (for SNP) that
     KVM_SEV_SNP_LAUNCH_UPDATE and KVM_TDX_INIT_MEM_REGION require the
     source page to be 4KiB aligned, to avoid non-trivial complexity for
     something that no known VMM seems to be doing and to avoid an API
     special case for in-place conversion, which simply can't support
     unaligned sources

   - When populating guest_memfd memory, GUP the source page in common
     code and pass the refcounted page to the vendor callback, instead
     of letting vendor code do the heavy lifting. Doing so avoids a
     looming deadlock bug with in-place due an AB-BA conflict betwee
     mmap_lock and guest_memfd's filemap invalidate lock

  Generic:

   - Fix a bug where KVM would ignore the vCPU's selected address space
     when creating a vCPU-specific mapping of guest memory. Actually
     this bug could not be hit even on x86, the only architecture with
     multiple address spaces, but it's a bug nevertheless"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (267 commits)
  KVM: s390: Increase permitted SE header size to 1 MiB
  MAINTAINERS: Replace backup for s390 vfio-pci
  KVM: s390: vsie: Fix race in acquire_gmap_shadow()
  KVM: s390: vsie: Fix race in walk_guest_tables()
  KVM: s390: Use guest address to mark guest page dirty
  irqchip/riscv-imsic: Adjust the number of available guest irq files
  RISC-V: KVM: Transparent huge page support
  RISC-V: KVM: selftests: Add Zalasr extensions to get-reg-list test
  RISC-V: KVM: Allow Zalasr extensions for Guest/VM
  KVM: riscv: selftests: Add riscv vm satp modes
  KVM: riscv: selftests: add Zilsd and Zclsd extension to get-reg-list test
  riscv: KVM: allow Zilsd and Zclsd extensions for Guest/VM
  RISC-V: KVM: Skip IMSIC update if vCPU IMSIC state is not initialized
  RISC-V: KVM: Fix null pointer dereference in kvm_riscv_aia_imsic_rw_attr()
  RISC-V: KVM: Fix null pointer dereference in kvm_riscv_aia_imsic_has_attr()
  RISC-V: KVM: Remove unnecessary 'ret' assignment
  KVM: s390: Add explicit padding to struct kvm_s390_keyop
  KVM: LoongArch: selftests: Add steal time test case
  LoongArch: KVM: Add paravirt vcpu_is_preempted() support in guest side
  LoongArch: KVM: Add paravirt preempt feature in hypervisor side
  ...
2026-02-13 11:31:15 -08:00
Linus Torvalds
0c61526621 EFI updates for v7.0
- Quirk the broken EFI framebuffer geometry on the Valve Steam Deck
 
 - Capture the EDID information of the primary display also on non-x86
   EFI systems when booting via the EFI stub.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQQm/3uucuRGn1Dmh0wbglWLn0tXAUCaYoTAgAKCRAwbglWLn0t
 XFe8AQDJe2GSNfzWgqoTqgT6tcH7lFG2SjdpIb+jHSmvgHckbAD/cUaY8YnhdYkm
 nz6URLJN/2NHuaDq1mUL8CwJwIot4wk=
 =41tn
 -----END PGP SIGNATURE-----

Merge tag 'efi-next-for-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi

Pull EFI updates from Ard Biesheuvel:

 - Quirk the broken EFI framebuffer geometry on the Valve Steam Deck

 - Capture the EDID information of the primary display also on non-x86
   EFI systems when booting via the EFI stub.

* tag 'efi-next-for-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
  efi: Support EDID information
  sysfb: Move edid_info into sysfb_primary_display
  sysfb: Pass sysfb_primary_display to devices
  sysfb: Replace screen_info with sysfb_primary_display
  sysfb: Add struct sysfb_display_info
  efi: sysfb_efi: Reduce number of references to global screen_info
  efi: earlycon: Reduce number of references to global screen_info
  efi: sysfb_efi: Fix efidrmfb and simpledrmfb on Valve Steam Deck
  efi: sysfb_efi: Convert swap width and height quirk to a callback
  efi: sysfb_efi: Fix lfb_linelength calculation when applying quirks
  efi: sysfb_efi: Replace open coded swap with the macro
2026-02-09 20:49:19 -08:00
Marc Zyngier
86364832ba KVM: arm64: Don't blindly set set PSTATE.PAN on guest exit
We set PSTATE.PAN to 1 on exiting from a guest if PAN support has
been compiled in and that it exists on the HW. However, this is not
necessarily correct.

In a nVHE configuration, there is no notion of PAN at EL2, so setting
PSTATE.PAN to anything is pointless.

Furthermore, not setting PAN to 0 when CONFIG_ARM64_PAN isn't set
means we run with the *guest's* PSTATE.PAN (which might be set to 1),
and we will explode on the next userspace access. Yes, the architecture
is delightful in that particular corner.

Fix the whole thing by always setting PAN to something when running
VHE (which implies PAN support), and only ignore it when running nVHE.

Reported-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://msgid.link/20260107124600.2736328-1-maz@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
2026-01-09 15:43:14 -08:00
Thomas Zimmermann
a41e0ab394 sysfb: Replace screen_info with sysfb_primary_display
Replace the global screen_info with sysfb_primary_display of type
struct sysfb_display_info. Adapt all users of screen_info.

Instances of screen_info are defined for x86, loongarch and EFI,
with only one instance compiled into a specific build. Replace all
of them with sysfb_primary_display.

All existing users of screen_info are updated by pointing them to
sysfb_primary_display.screen instead. This introduces some churn to
the code, but has no impact on functionality.

Boot parameters and EFI config tables are unchanged. They transfer
screen_info as before. The logic in EFI's alloc_screen_info() changes
slightly, as it now returns the screen field of sysfb_primary_display.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Bjorn Helgaas <bhelgaas@google.com> # drivers/pci/
Reviewed-by: Richard Lyu <richard.lyu@suse.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2025-12-16 14:12:44 +01:00
Marc Zyngier
8d3dfab1d3 KVM: arm64: Turn vgic-v3 errata traps into a patched-in constant
The trap bits are currently only set to manage CPU errata. However,
we are about to make use of them for purposes beyond beating broken
CPUs into submission.

For this purpose, turn these errata-driven bits into a patched-in
constant that is merged with the KVM-driven value at the point of
programming the ICH_HCR_EL2 register, rather than being directly
stored with with the shadow value..

This allows the KVM code to distinguish between a trap being handled
for the purpose of an erratum workaround, or for KVM's own need.

Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://msgid.link/20251120172540.2267180-5-maz@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
2025-11-24 14:29:11 -08:00
Marc Zyngier
9664d5810e KVM: arm64: Don't access ICC_SRE_EL2 if GICv3 doesn't support v2 compatibility
We currently access ICC_SRE_EL2 at each load/put on VHE, and on each
entry/exit on nVHE. Both are quite onerous on NV, as this register
always traps.

We do this to make sure the EL1 guest doesn't flip between v2 and v3
behind our back. But all modern implementations have dropped v2,
and this is just overhead.

At the same time, the GICv5 spec has been fixed to allow access to
ICC_SRE_EL2 in legacy mode. Use this opportunity to replace the
GICv5 checks for v2 compat checks, with an ad-hoc static key.

Co-developed-by: Sascha Bischoff <sascha.bischoff@arm.com>
Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-09-17 17:40:42 +01:00
Linus Torvalds
e9e668cd27 arm64 fixes for -rc1
- Disable problematic linker assertions for broken versions of LLD.
 
 - Work around sporadic link failure with LLD and various randconfig
   builds.
 
 - Fix missing invalidation in the TLB batching code when reclaim races
   with mprotect() and friends.
 
 - Add a command-line override for MPAM to allow booting on systems with
   broken firmware.
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmhBcycQHHdpbGxAa2Vy
 bmVsLm9yZwAKCRC3rHDchMFjNDwWCACtc4Jw3wwkmaiiP9Ner1/7wKq8xRLC2WRU
 tJjWLSkeoTthxf0DZILc61rNpOalfaRK774/Xo0OiYOBpKeAi5cSaUYMyabVJGcK
 k1R0KXDUu8oS6xKXmXyeuBV2pK4v4aET3E6lzUQZfvamhzuZfCvvKKrF5K8vv5Ph
 eowBMWKugMrwXMOBkRgVopppobdneFuVvnoMlNNYWOy70wDekoPV3qizoVJG/ulQ
 BTFunXX8Otufrm48Ye2bYalfwoiGdUQaJz/gRuHko0o3SOhqR3qZp2DWxQgBwJ+g
 VI6/dRLnVQpdg6toTvS9jzPczVfLt4/5VhLevbBcJuaUOER4SOZl
 =cfnk
 -----END PGP SIGNATURE-----

Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 fixes from Will Deacon:
 "We've got a couple of build fixes when using LLD, a missing TLB
  invalidation and a workaround for broken firmware on SoCs with CPUs
  that implement MPAM:

   - Disable problematic linker assertions for broken versions of LLD

   - Work around sporadic link failure with LLD and various randconfig
     builds

   - Fix missing invalidation in the TLB batching code when reclaim
     races with mprotect() and friends

   - Add a command-line override for MPAM to allow booting on systems
     with broken firmware"

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm64: Add override for MPAM
  arm64/mm: Close theoretical race where stale TLB entry remains valid
  arm64: Work around convergence issue with LLD linker
  arm64: Disable LLD linker ASSERT()s for the time being
2025-06-05 11:39:17 -07:00
Ard Biesheuvel
dc0a083948 arm64: Work around convergence issue with LLD linker
LLD will occasionally error out with a '__init_end does not converge'
error if INIT_IDMAP_DIR_SIZE is defined in terms of _end, as this
results in a circular dependency.

Counter this by dimensioning the initial IDMAP page tables based on a
new boundary marker 'kimage_limit', and define it such that its value
should not change as a result of the initdata segment being pushed over
a 64k segment boundary due to changes in INIT_IDMAP_DIR_SIZE, provided
that its value doesn't change by more than 2M between linker passes.

Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250531123005.3866382-2-ardb+git@google.com
Signed-off-by: Will Deacon <will@kernel.org>
2025-06-02 12:53:18 +01:00
Ard Biesheuvel
e21560b7d3 arm64: Disable LLD linker ASSERT()s for the time being
It turns out [1] that the way LLD handles ASSERT()s in the linker script
can result in spurious failures, so disable them for the newly
introduced BSS symbol export checks. Since we're not aware of any issues
with the existing assertions in vmlinux.lds.S, leave those alone for now
so that they can continue to provide useful coverage.

A linker fix [2] is due to land in version 21 of LLD.

Link: https://lore.kernel.org/r/202505261019.OUlitN6m-lkp@intel.com [1]
Link: https://github.com/llvm/llvm-project/commit/5859863bab7f [2]
Link: https://github.com/ClangBuiltLinux/linux/issues/2094
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202505261019.OUlitN6m-lkp@intel.com/
Link: https://lore.kernel.org/r/20250529073507.2984959-2-ardb+git@google.com
Signed-off-by: Will Deacon <will@kernel.org>
2025-06-02 12:49:22 +01:00
Linus Torvalds
43db111107 ARM:
* Add large stage-2 mapping (THP) support for non-protected guests when
   pKVM is enabled, clawing back some performance.
 
 * Enable nested virtualisation support on systems that support it,
   though it is disabled by default.
 
 * Add UBSAN support to the standalone EL2 object used in nVHE/hVHE and
   protected modes.
 
 * Large rework of the way KVM tracks architecture features and links
   them with the effects of control bits. While this has no functional
   impact, it ensures correctness of emulation (the data is automatically
   extracted from the published JSON files), and helps dealing with the
   evolution of the architecture.
 
 * Significant changes to the way pKVM tracks ownership of pages,
   avoiding page table walks by storing the state in the hypervisor's
   vmemmap. This in turn enables the THP support described above.
 
 * New selftest checking the pKVM ownership transition rules
 
 * Fixes for FEAT_MTE_ASYNC being accidentally advertised to guests
   even if the host didn't have it.
 
 * Fixes for the address translation emulation, which happened to be
   rather buggy in some specific contexts.
 
 * Fixes for the PMU emulation in NV contexts, decoupling PMCR_EL0.N
   from the number of counters exposed to a guest and addressing a
   number of issues in the process.
 
 * Add a new selftest for the SVE host state being corrupted by a
   guest.
 
 * Keep HCR_EL2.xMO set at all times for systems running with the
   kernel at EL2, ensuring that the window for interrupts is slightly
   bigger, and avoiding a pretty bad erratum on the AmpereOne HW.
 
 * Add workaround for AmpereOne's erratum AC04_CPU_23, which suffers
   from a pretty bad case of TLB corruption unless accesses to HCR_EL2
   are heavily synchronised.
 
 * Add a per-VM, per-ITS debugfs entry to dump the state of the ITS
   tables in a human-friendly fashion.
 
 * and the usual random cleanups.
 
 LoongArch:
 
 * Don't flush tlb if the host supports hardware page table walks.
 
 * Add KVM selftests support.
 
 RISC-V:
 
 * Add vector registers to get-reg-list selftest
 
 * VCPU reset related improvements
 
 * Remove scounteren initialization from VCPU reset
 
 * Support VCPU reset from userspace using set_mpstate() ioctl
 
 x86:
 
 * Initial support for TDX in KVM.  This finally makes it possible to use the
   TDX module to run confidential guests on Intel processors.  This is quite a
   large series, including support for private page tables (managed by the
   TDX module and mirrored in KVM for efficiency), forwarding some TDVMCALLs
   to userspace, and handling several special VM exits from the TDX module.
 
   This has been in the works for literally years and it's not really possible
   to describe everything here, so I'll defer to the various merge commits
   up to and including commit 7bcf7246c4 ("Merge branch 'kvm-tdx-finish-initial'
   into HEAD").
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmg02hwUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroNnkwf/db4xeWKSMseCIvBVR+ObDn3LXhwT
 hAgmTkDkP1zq9RfbfJSbUA1DXRwfP+f1sWySLMWECkFEQW9fGIJF9fOQRDSXKmhX
 158U3+FEt+3jxLRCGFd4zyXAqyY3C8JSkPUyJZxCpUbXtB5tdDNac4rZAXKDULwe
 sUi0OW/kFDM2yt369pBGQAGdN+75/oOrYISGOSvMXHxjccNqvveX8MUhpBjYIuuj
 73iBWmsfv3vCtam56Racz3C3v44ie498PmWFtnB0R+CVfWfrnUAaRiGWx+egLiBW
 dBPDiZywMn++prmphEUFgaStDTQy23JBLJ8+RvHkp+o5GaTISKJB3nedZQ==
 =adZU
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "As far as x86 goes this pull request "only" includes TDX host support.

  Quotes are appropriate because (at 6k lines and 100+ commits) it is
  much bigger than the rest, which will come later this week and
  consists mostly of bugfixes and selftests. s390 changes will also come
  in the second batch.

  ARM:

   - Add large stage-2 mapping (THP) support for non-protected guests
     when pKVM is enabled, clawing back some performance.

   - Enable nested virtualisation support on systems that support it,
     though it is disabled by default.

   - Add UBSAN support to the standalone EL2 object used in nVHE/hVHE
     and protected modes.

   - Large rework of the way KVM tracks architecture features and links
     them with the effects of control bits. While this has no functional
     impact, it ensures correctness of emulation (the data is
     automatically extracted from the published JSON files), and helps
     dealing with the evolution of the architecture.

   - Significant changes to the way pKVM tracks ownership of pages,
     avoiding page table walks by storing the state in the hypervisor's
     vmemmap. This in turn enables the THP support described above.

   - New selftest checking the pKVM ownership transition rules

   - Fixes for FEAT_MTE_ASYNC being accidentally advertised to guests
     even if the host didn't have it.

   - Fixes for the address translation emulation, which happened to be
     rather buggy in some specific contexts.

   - Fixes for the PMU emulation in NV contexts, decoupling PMCR_EL0.N
     from the number of counters exposed to a guest and addressing a
     number of issues in the process.

   - Add a new selftest for the SVE host state being corrupted by a
     guest.

   - Keep HCR_EL2.xMO set at all times for systems running with the
     kernel at EL2, ensuring that the window for interrupts is slightly
     bigger, and avoiding a pretty bad erratum on the AmpereOne HW.

   - Add workaround for AmpereOne's erratum AC04_CPU_23, which suffers
     from a pretty bad case of TLB corruption unless accesses to HCR_EL2
     are heavily synchronised.

   - Add a per-VM, per-ITS debugfs entry to dump the state of the ITS
     tables in a human-friendly fashion.

   - and the usual random cleanups.

  LoongArch:

   - Don't flush tlb if the host supports hardware page table walks.

   - Add KVM selftests support.

  RISC-V:

   - Add vector registers to get-reg-list selftest

   - VCPU reset related improvements

   - Remove scounteren initialization from VCPU reset

   - Support VCPU reset from userspace using set_mpstate() ioctl

  x86:

   - Initial support for TDX in KVM.

     This finally makes it possible to use the TDX module to run
     confidential guests on Intel processors. This is quite a large
     series, including support for private page tables (managed by the
     TDX module and mirrored in KVM for efficiency), forwarding some
     TDVMCALLs to userspace, and handling several special VM exits from
     the TDX module.

     This has been in the works for literally years and it's not really
     possible to describe everything here, so I'll defer to the various
     merge commits up to and including commit 7bcf7246c4 ('Merge
     branch 'kvm-tdx-finish-initial' into HEAD')"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (248 commits)
  x86/tdx: mark tdh_vp_enter() as __flatten
  Documentation: virt/kvm: remove unreferenced footnote
  RISC-V: KVM: lock the correct mp_state during reset
  KVM: arm64: Fix documentation for vgic_its_iter_next()
  KVM: arm64: np-guest CMOs with PMD_SIZE fixmap
  KVM: arm64: Stage-2 huge mappings for np-guests
  KVM: arm64: Add a range to pkvm_mappings
  KVM: arm64: Convert pkvm_mappings to interval tree
  KVM: arm64: Add a range to __pkvm_host_test_clear_young_guest()
  KVM: arm64: Add a range to __pkvm_host_wrprotect_guest()
  KVM: arm64: Add a range to __pkvm_host_unshare_guest()
  KVM: arm64: Add a range to __pkvm_host_share_guest()
  KVM: arm64: Introduce for_each_hyp_page
  KVM: arm64: Handle huge mappings for np-guest CMOs
  KVM: arm64: nv: Release faulted-in VNCR page from mmu_lock critical section
  KVM: arm64: nv: Handle TLBI S1E2 for VNCR invalidation with mmu_lock held
  KVM: arm64: nv: Hold mmu_lock when invalidating VNCR SW-TLB before translating
  RISC-V: KVM: add KVM_CAP_RISCV_MP_STATE_RESET
  RISC-V: KVM: Remove scounteren initialization
  KVM: RISC-V: remove unnecessary SBI reset state
  ...
2025-05-29 08:10:01 -07:00
Linus Torvalds
47cf96fbe3 arm64 updates for 6.16
ACPI, EFI and PSCI:
  - Decouple Arm's "Software Delegated Exception Interface" (SDEI)
    support from the ACPI GHES code so that it can be used by platforms
    booted with device-tree.
 
  - Remove unnecessary per-CPU tracking of the FPSIMD state across EFI
    runtime calls.
 
  - Fix a node refcount imbalance in the PSCI device-tree code.
 
 CPU Features:
  - Ensure register sanitisation is applied to fields in ID_AA64MMFR4.
 
  - Expose AIDR_EL1 to userspace via sysfs, primarily so that KVM guests
    can reliably query the underlying CPU types from the VMM.
 
  - Re-enabling of SME support (CONFIG_ARM64_SME) as a result of fixes
    to our context-switching, signal handling and ptrace code.
 
 Entry code:
  - Hook up TIF_NEED_RESCHED_LAZY so that CONFIG_PREEMPT_LAZY can be
    selected.
 
 Memory management:
  - Prevent BSS exports from being used by the early PI code.
 
  - Propagate level and stride information to the low-level TLB
    invalidation routines when operating on hugetlb entries.
 
  - Use the page-table contiguous hint for vmap() mappings with
    VM_ALLOW_HUGE_VMAP where possible.
 
  - Optimise vmalloc()/vmap() page-table updates to use "lazy MMU mode"
    and hook this up on arm64 so that the trailing DSB (used to publish
    the updates to the hardware walker) can be deferred until the end of
    the mapping operation.
 
  - Extend mmap() randomisation for 52-bit virtual addresses (on par with
    48-bit addressing) and remove limited support for randomisation of
    the linear map.
 
 Perf and PMUs:
  - Add support for probing the CMN-S3 driver using ACPI.
 
  - Minor driver fixes to the CMN, Arm-NI and amlogic PMU drivers.
 
 Selftests:
  - Fix FPSIMD and SME tests to align with the freshly re-enabled SME
    support.
 
  - Fix default setting of the OUTPUT variable so that tests are
    installed in the right location.
 
 vDSO:
  - Replace raw counter access from inline assembly code with a call to
    the the __arch_counter_get_cntvct() helper function.
 
 Miscellaneous:
  - Add some missing header inclusions to the CCA headers.
 
  - Rework rendering of /proc/cpuinfo to follow the x86-approach and
    avoid repeated buffer expansion (the user-visible format remains
    identical).
 
  - Remove redundant selection of CONFIG_CRC32
 
  - Extend early error message when failing to map the device-tree blob.
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmg1uTgQHHdpbGxAa2Vy
 bmVsLm9yZwAKCRC3rHDchMFjNFv2CAC9S5OW0btOAo7V/LFBpLhJM3hdIV6Sn6N1
 d/K5znuqPBG6VPfBrshaZltEl/C3U8KG4H8xrlX5cSo7CRuf3DgVBw3kiZ6ERZj6
 1gnKR54juA1oWhcroPl0s76ETWj3N4gO036u2qOhWNAYflDunh1+bCIGJkG4H/yP
 wqtWn974YUbad/zQJSbG3IMO1yvxZ/PsNpVF8HjyQ0/ZPWsYTscrhNQ0hWro17sR
 CTcUaGxH4GrXW24EGNgkLB9aq67X2rtGGtaIlp5oFl8FuLklc7TYbPwJp8cPCTNm
 0Sp0mpuR9M675pYIKoCI9m5twc46znRIKmbXi5LvPd77418y3jTf
 =03N4
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Will Deacon:
 "The headline feature is the re-enablement of support for Arm's
  Scalable Matrix Extension (SME) thanks to a bumper crop of fixes
  from Mark Rutland.

  If matrices aren't your thing, then Ryan's page-table optimisation
  work is much more interesting.

  Summary:

  ACPI, EFI and PSCI:

   - Decouple Arm's "Software Delegated Exception Interface" (SDEI)
     support from the ACPI GHES code so that it can be used by platforms
     booted with device-tree

   - Remove unnecessary per-CPU tracking of the FPSIMD state across EFI
     runtime calls

   - Fix a node refcount imbalance in the PSCI device-tree code

  CPU Features:

   - Ensure register sanitisation is applied to fields in ID_AA64MMFR4

   - Expose AIDR_EL1 to userspace via sysfs, primarily so that KVM
     guests can reliably query the underlying CPU types from the VMM

   - Re-enabling of SME support (CONFIG_ARM64_SME) as a result of fixes
     to our context-switching, signal handling and ptrace code

  Entry code:

   - Hook up TIF_NEED_RESCHED_LAZY so that CONFIG_PREEMPT_LAZY can be
     selected

  Memory management:

   - Prevent BSS exports from being used by the early PI code

   - Propagate level and stride information to the low-level TLB
     invalidation routines when operating on hugetlb entries

   - Use the page-table contiguous hint for vmap() mappings with
     VM_ALLOW_HUGE_VMAP where possible

   - Optimise vmalloc()/vmap() page-table updates to use "lazy MMU mode"
     and hook this up on arm64 so that the trailing DSB (used to publish
     the updates to the hardware walker) can be deferred until the end
     of the mapping operation

   - Extend mmap() randomisation for 52-bit virtual addresses (on par
     with 48-bit addressing) and remove limited support for
     randomisation of the linear map

  Perf and PMUs:

   - Add support for probing the CMN-S3 driver using ACPI

   - Minor driver fixes to the CMN, Arm-NI and amlogic PMU drivers

  Selftests:

   - Fix FPSIMD and SME tests to align with the freshly re-enabled SME
     support

   - Fix default setting of the OUTPUT variable so that tests are
     installed in the right location

  vDSO:

   - Replace raw counter access from inline assembly code with a call to
     the the __arch_counter_get_cntvct() helper function

  Miscellaneous:

   - Add some missing header inclusions to the CCA headers

   - Rework rendering of /proc/cpuinfo to follow the x86-approach and
     avoid repeated buffer expansion (the user-visible format remains
     identical)

   - Remove redundant selection of CONFIG_CRC32

   - Extend early error message when failing to map the device-tree
     blob"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (83 commits)
  arm64: cputype: Add cputype definition for HIP12
  arm64: el2_setup.h: Make __init_el2_fgt labels consistent, again
  perf/arm-cmn: Add CMN S3 ACPI binding
  arm64/boot: Disallow BSS exports to startup code
  arm64/boot: Move global CPU override variables out of BSS
  arm64/boot: Move init_pgdir[] and init_idmap_pgdir[] into __pi_ namespace
  perf/arm-cmn: Initialise cmn->cpu earlier
  kselftest/arm64: Set default OUTPUT path when undefined
  arm64: Update comment regarding values in __boot_cpu_mode
  arm64: mm: Drop redundant check in pmd_trans_huge()
  arm64/mm: Re-organise setting up FEAT_S1PIE registers PIRE0_EL1 and PIR_EL1
  arm64/mm: Permit lazy_mmu_mode to be nested
  arm64/mm: Disable barrier batching in interrupt contexts
  arm64/cpuinfo: only show one cpu's info in c_show()
  arm64/mm: Batch barriers when updating kernel mappings
  mm/vmalloc: Enter lazy mmu mode while manipulating vmalloc ptes
  arm64/mm: Support huge pte-mapped pages in vmap
  mm/vmalloc: Gracefully unmap huge ptes
  mm/vmalloc: Warn on improper use of vunmap_range()
  arm64/mm: Hoist barriers out of set_ptes_anysz() loop
  ...
2025-05-28 14:55:35 -07:00
Marc Zyngier
d5702dd224 Merge branch kvm-arm64/pkvm-selftest-6.16 into kvm-arm64/pkvm-np-thp-6.16
* kvm-arm64/pkvm-selftest-6.16:
  : .
  : pKVM selftests covering the memory ownership transitions by
  : Quentin Perret. From the initial cover letter:
  :
  : "We have recently found a bug [1] in the pKVM memory ownership
  : transitions by code inspection, but it could have been caught with a
  : test.
  :
  : Introduce a boot-time selftest exercising all the known pKVM memory
  : transitions and importantly checks the rejection of illegal transitions.
  :
  : The new test is hidden behind a new Kconfig option separate from
  : CONFIG_EL2_NVHE_DEBUG on purpose as that has side effects on the
  : transition checks ([1] doesn't reproduce with EL2 debug enabled).
  :
  : [1] https://lore.kernel.org/kvmarm/20241128154406.602875-1-qperret@google.com/"
  : .
  KVM: arm64: Extend pKVM selftest for np-guests
  KVM: arm64: Selftest for pKVM transitions
  KVM: arm64: Don't WARN from __pkvm_host_share_guest()
  KVM: arm64: Add .hyp.data section

Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-21 14:33:43 +01:00
Ard Biesheuvel
9053052107 arm64/boot: Disallow BSS exports to startup code
BSS might be uninitialized when entering the startup code, so forbid the
use by the startup code of any variables that live after __bss_start in
the linker map.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Yeoreum Yun <yeoreum.yun@arm.com>
Reviewed-by: Yeoreum Yun <yeoreum.yun@arm.com>
Link: https://lore.kernel.org/r/20250508114328.2460610-8-ardb+git@google.com
[will: Drop export of 'memstart_offset_seed', as this has been removed]
Signed-off-by: Will Deacon <will@kernel.org>
2025-05-16 16:08:13 +01:00
Ard Biesheuvel
93d0d6f8a6 arm64/boot: Move init_pgdir[] and init_idmap_pgdir[] into __pi_ namespace
init_pgdir[] is only referenced from the startup code, but lives after
BSS in the linker map. Before tightening the rules about accessing BSS
from startup code, move init_pgdir[] into the __pi_ namespace, so it
does not need to be exported explicitly.

For symmetry, do the same with init_idmap_pgdir[], although it lives
before BSS.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Yeoreum Yun <yeoreum.yun@arm.com>
Reviewed-by: Yeoreum Yun <yeoreum.yun@arm.com>
Link: https://lore.kernel.org/r/20250508114328.2460610-6-ardb+git@google.com
Signed-off-by: Will Deacon <will@kernel.org>
2025-05-16 16:05:21 +01:00
David Brazdil
74b13d5816 KVM: arm64: Add .hyp.data section
The hypervisor has not needed its own .data section because all globals
were either .rodata or .bss. To avoid having to initialize future
data-structures at run-time, let's introduce add a .data section to the
hypervisor.

Signed-off-by: David Brazdil <dbrazdil@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20250416160900.3078417-2-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-06 09:56:18 +01:00
Ard Biesheuvel
1db780bafa arm64/mm: Remove randomization of the linear map
Since commit

  97d6786e06 ("arm64: mm: account for hotplug memory when randomizing the linear region")

the decision whether or not to randomize the placement of the system's
DRAM inside the linear map is based on the capabilities of the CPU
rather than how much memory is present at boot time. This change was
necessary because memory hotplug may result in DRAM appearing in places
that are not covered by the linear region at all (and therefore
unusable) if the decision is solely based on the memory map at boot.

In the Android GKI kernel, which requires support for memory hotplug,
and is built with a reduced virtual address space of only 39 bits wide,
randomization of the linear map never happens in practice as a result.
And even on arm64 kernels built with support for 48 bit virtual
addressing, the wider PArange of recent CPUs means that linear map
randomization is slowly becoming a feature that only works on systems
that will soon be obsolete.

So let's just remove this feature. We can always bring it back in an
improved form if there is a real need for it.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Kees Cook <kees@kernel.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250318134949.3194334-2-ardb+git@google.com
Signed-off-by: Will Deacon <will@kernel.org>
2025-04-29 13:21:49 +01:00
Marc Zyngier
117c3b21d3 arm64: Rework checks for broken Cavium HW in the PI code
Calling into the MIDR checking framework from the PI code has recently
become much harder, due to the new fancy "multi-MIDR" support that
relies on tables being populated at boot time, but not that early that
they are available to the PI code. There are additional issues with
this framework, as the code really isn't position independend *at all*.

This leads to some ugly breakages, as reported by Ada.

It so appears that the only reason for the PI code to call into the
MIDR checking code is to cope with The Most Broken ARM64 System Ever,
aka Cavium ThunderX, which cannot deal with nG attributes that result
of the combination of KASLR and KPTI as a consequence of Erratum 27456.

Duplicate the check for the erratum in the PI code, removing the
dependency on the bulk of the MIDR checking framework. This allows
dropping that same check from kaslr_requires_kpti(), as the KPTI code
already relies on the ARM64_WORKAROUND_CAVIUM_27456 cap.

Fixes: c8c2647e69 ("arm64: Make  _midr_in_range_list() an exported function")
Reported-by: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/3d97e45a-23cf-419b-9b6f-140b4d88de7b@arm.com
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Cc: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20250418093129.1755739-1-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-04-18 13:51:07 -07:00
Oliver Upton
1b1d1b17b8 Merge branch 'kvm-arm64/pmuv3-asahi' into kvmarm/next
* kvm-arm64/pmuv3-asahi:
  : Support PMUv3 for KVM guests on Apple silicon
  :
  : Take advantage of some IMPLEMENTATION DEFINED traps available on Apple
  : parts to trap-and-emulate the PMUv3 registers on behalf of a KVM guest.
  : Constrain the vPMU to a cycle counter and single event counter, as the
  : Apple PMU has events that cannot be counted on every counter.
  :
  : There is a small new interface between the ARM PMU driver and KVM, where
  : the PMU driver owns the PMUv3 -> hardware event mappings.
  arm64: Enable IMP DEF PMUv3 traps on Apple M*
  KVM: arm64: Provide 1 event counter on IMPDEF hardware
  drivers/perf: apple_m1: Provide helper for mapping PMUv3 events
  KVM: arm64: Remap PMUv3 events onto hardware
  KVM: arm64: Advertise PMUv3 if IMPDEF traps are present
  KVM: arm64: Compute synthetic sysreg ESR for Apple PMUv3 traps
  KVM: arm64: Move PMUVer filtering into KVM code
  KVM: arm64: Use guard() to cleanup usage of arm_pmus_lock
  KVM: arm64: Drop kvm_arm_pmu_available static key
  KVM: arm64: Use a cpucap to determine if system supports FEAT_PMUv3
  KVM: arm64: Always support SW_INCR PMU event
  KVM: arm64: Compute PMCEID from arm_pmu's event bitmaps
  drivers/perf: apple_m1: Support host/guest event filtering
  drivers/perf: apple_m1: Refactor event select/filter configuration

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-19 14:54:23 -07:00
Oliver Upton
a38b67d151 KVM: arm64: Drop kvm_arm_pmu_available static key
With the PMUv3 cpucap, kvm_arm_pmu_available is no longer used in the
hot path of guest entry/exit. On top of that, guest support for PMUv3
may not correlate with host support for the feature, e.g. on IMPDEF
hardware.

Throw out the static key and just inspect the list of PMUs to determine
if PMUv3 is supported for KVM guests.

Tested-by: Janne Grunau <j@jannau.net>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250305202641.428114-7-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-11 12:54:29 -07:00
Shameer Kolothum
c8c2647e69 arm64: Make  _midr_in_range_list() an exported function
Subsequent patch will add target implementation CPU support and that
will require _midr_in_range_list() to access new data. To avoid
exporting the data make _midr_in_range_list() a normal function and
export it.

No functional changes intended.

Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20250221140229.12588-5-shameerali.kolothum.thodi@huawei.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-02-26 13:30:36 -08:00
Marc Zyngier
0bc9a9e85f KVM: arm64: Work around x1e's CNTVOFF_EL2 bogosity
It appears that on Qualcomm's x1e CPU, CNTVOFF_EL2 doesn't really
work, specially with HCR_EL2.E2H=1.

A non-zero offset results in a screaming virtual timer interrupt,
to the tune of a few 100k interrupts per second on a 4 vcpu VM.
This is also evidenced by this CPU's inability to correctly run
any of the timer selftests.

The only case this doesn't break is when this register is set to 0,
which breaks VM migration.

When HCR_EL2.E2H=0, the timer seems to behave normally, and does
not result in an interrupt storm.

As a workaround, use the fact that this CPU implements FEAT_ECV,
and trap all accesses to the virtual timer and counter, keeping
CNTVOFF_EL2 set to zero, and emulate accesses to CVAL/TVAL/CTL
and the counter itself, fixing up the timer to account for the
missing offset.

And if you think this is disgusting, you'd probably be right.

Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241217142321.763801-12-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-02 19:19:10 +00:00
Mark Rutland
18fdb6348c arm64: irqchip/gic-v3: Select priorities at boot time
The distributor and PMR/RPR can present different views of the interrupt
priority space dependent upon the values of GICD_CTLR.DS and
SCR_EL3.FIQ. Currently we treat the distributor's view of the priority
space as canonical, and when the two differ we change the way we handle
values in the PMR/RPR, using the `gic_nonsecure_priorities` static key
to decide what to do.

This approach works, but it's sub-optimal. When using pseudo-NMI we
manipulate the distributor rarely, and we manipulate the PMR/RPR
registers very frequently in code spread out throughout the kernel (e.g.
local_irq_{save,restore}()). It would be nicer if we could use fixed
values for the PMR/RPR, and dynamically choose the values programmed
into the distributor.

This patch changes the GICv3 driver and arm64 code accordingly. PMR
values are chosen at compile time, and the GICv3 driver determines the
appropriate values to program into the distributor at boot time. This
removes the need for the `gic_nonsecure_priorities` static key and
results in smaller and better generated code for saving/restoring the
irqflags.

Before this patch, local_irq_disable() compiles to:

| 0000000000000000 <outlined_local_irq_disable>:
|    0:   d503201f        nop
|    4:   d50343df        msr     daifset, #0x3
|    8:   d65f03c0        ret
|    c:   d503201f        nop
|   10:   d2800c00        mov     x0, #0x60                       // #96
|   14:   d5184600        msr     icc_pmr_el1, x0
|   18:   d65f03c0        ret
|   1c:   d2801400        mov     x0, #0xa0                       // #160
|   20:   17fffffd        b       14 <outlined_local_irq_disable+0x14>

After this patch, local_irq_disable() compiles to:

| 0000000000000000 <outlined_local_irq_disable>:
|    0:   d503201f        nop
|    4:   d50343df        msr     daifset, #0x3
|    8:   d65f03c0        ret
|    c:   d2801800        mov     x0, #0xc0                       // #192
|   10:   d5184600        msr     icc_pmr_el1, x0
|   14:   d65f03c0        ret

... with 3 fewer instructions per call.

For defconfig + CONFIG_PSEUDO_NMI=y, this results in a minor saving of
~4K of text, and will make it easier to make further improvements to the
way we manipulate irqflags and DAIF bits.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexandru Elisei <alexandru.elisei@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Tested-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240617111841.2529370-6-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
2024-06-24 18:16:45 +01:00
Ard Biesheuvel
9684ec186f arm64: Enable LPA2 at boot if supported by the system
Update the early kernel mapping code to take 52-bit virtual addressing
into account based on the LPA2 feature. This is a bit more involved than
LVA (which is supported with 64k pages only), given that some page table
descriptor bits change meaning in this case.

To keep the handling in asm to a minimum, the initial ID map is still
created with 48-bit virtual addressing, which implies that the kernel
image must be loaded into 48-bit addressable physical memory. This is
currently required by the boot protocol, even though we happen to
support placement outside of that for LVA/64k based configurations.

Enabling LPA2 involves more than setting TCR.T1SZ to a lower value,
there is also a DS bit in TCR that needs to be set, and which changes
the meaning of bits [9:8] in all page table descriptors. Since we cannot
enable DS and every live page table descriptor at the same time, let's
pivot through another temporary mapping. This avoids the need to
reintroduce manipulations of the page tables with the MMU and caches
disabled.

To permit the LPA2 feature to be overridden on the kernel command line,
which may be necessary to work around silicon errata, or to deal with
mismatched features on heterogeneous SoC designs, test for CPU feature
overrides first, and only then enable LPA2.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20240214122845.2033971-78-ardb+git@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-02-16 12:42:40 +00:00
Ard Biesheuvel
68aec33f8f arm64: mm: Add feature override support for LVA
Add support for overriding the VARange field of the MMFR2 CPU ID
register. This permits the associated LVA feature to be overridden early
enough for the boot code that creates the kernel mapping to take it into
account.

Given that LPA2 implies LVA, disabling the latter should disable the
former as well. So override the ID_AA64MMFR0.TGran field of the current
page size as well if it advertises support for 52-bit addressing.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20240214122845.2033971-71-ardb+git@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-02-16 12:42:36 +00:00
Ard Biesheuvel
9cce9c6c2c arm64: mm: Handle LVA support as a CPU feature
Currently, we detect CPU support for 52-bit virtual addressing (LVA)
extremely early, before creating the kernel page tables or enabling the
MMU. We cannot override the feature this early, and so large virtual
addressing is always enabled on CPUs that implement support for it if
the software support for it was enabled at build time. It also means we
rely on non-trivial code in asm to deal with this feature.

Given that both the ID map and the TTBR1 mapping of the kernel image are
guaranteed to be 48-bit addressable, it is not actually necessary to
enable support this early, and instead, we can model it as a CPU
feature. That way, we can rely on code patching to get the correct
TCR.T1SZ values programmed on secondary boot and resume from suspend.

On the primary boot path, we simply enable the MMU with 48-bit virtual
addressing initially, and update TCR.T1SZ if LVA is supported from C
code, right before creating the kernel mapping. Given that TTBR1 still
points to reserved_pg_dir at this point, updating TCR.T1SZ should be
safe without the need for explicit TLB maintenance.

Since this gets rid of all accesses to the vabits_actual variable from
asm code that occurred before TCR.T1SZ had been programmed, we no longer
have a need for this variable, and we can replace it with a C expression
that produces the correct value directly, based on the value of TCR.T1SZ.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20240214122845.2033971-70-ardb+git@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-02-16 12:42:36 +00:00
Ard Biesheuvel
ba5b0333a8 arm64: mm: omit redundant remap of kernel image
Now that the early kernel mapping is created with all the right
attributes and segment boundaries, there is no longer a need to recreate
it and switch to it. This also means we no longer have to copy the kasan
shadow or some parts of the fixmap from one set of page tables to the
other.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20240214122845.2033971-68-ardb+git@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-02-16 12:42:35 +00:00
Ard Biesheuvel
84b04d3e6b arm64: kernel: Create initial ID map from C code
The asm code that creates the initial ID map is rather intricate and
hard to follow. This is problematic because it makes adding support for
things like LPA2 or WXN more difficult than necessary. Also, it is
parameterized like the rest of the MM code to run with a configurable
number of levels, which is rather pointless, given that all AArch64 CPUs
implement support for 48-bit virtual addressing, and that many systems
exist with DRAM located outside of the 39-bit addressable range, which
is the only smaller VA size that is widely used, and we need additional
tricks to make things work in that combination.

So let's bite the bullet, and rip out all the asm macros, and fiddly
code, and replace it with a C implementation based on the newly added
routines for creating the early kernel VA mappings. And while at it,
create the initial ID map based on 48-bit virtual addressing as well,
regardless of the number of configured levels for the kernel proper.

Note that this code may execute with the MMU and caches disabled, and is
therefore not permitted to make unaligned accesses. This shouldn't
generally happen in any case for the algorithm as implemented, but to be
sure, let's pass -mstrict-align to the compiler just in case.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20240214122845.2033971-66-ardb+git@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-02-16 12:42:34 +00:00
Ard Biesheuvel
97a6f43bb0 arm64: head: Move early kernel mapping routines into C code
The asm version of the kernel mapping code works fine for creating a
coarse grained identity map, but for mapping the kernel down to its
exact boundaries with the right attributes, it is not suitable. This is
why we create a preliminary RWX kernel mapping first, and then rebuild
it from scratch later on.

So let's reimplement this in C, in a way that will make it unnecessary
to create the kernel page tables yet another time in paging_init().

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20240214122845.2033971-63-ardb+git@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-02-16 12:42:33 +00:00
Ard Biesheuvel
aa6a52b247 arm64: head: move memstart_offset_seed handling to C code
Now that we can set BSS variables from the early code running from the
ID map, we can set memstart_offset_seed directly from the C code that
derives the value instead of passing it back and forth between C and asm
code.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20240214122845.2033971-60-ardb+git@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-02-16 12:42:32 +00:00
Ard Biesheuvel
e223a44912 arm64: idreg-override: Move to early mini C runtime
We will want to parse the ID register overrides even earlier, so that we
can take them into account before creating the kernel mapping. So
migrate the code and make it work in the context of the early C runtime.
We will move the invocation to an earlier stage in a subsequent patch.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20240214122845.2033971-49-ardb+git@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-02-16 12:42:28 +00:00
Ard Biesheuvel
3567fa63cb arm64: kaslr: Adjust randomization range dynamically
Currently, we base the KASLR randomization range on a rough estimate of
the available space in the upper VA region: the lower 1/4th has the
module region and the upper 1/4th has the fixmap, vmemmap and PCI I/O
ranges, and so we pick a random location in the remaining space in the
middle.

Once we enable support for 5-level paging with 4k pages, this no longer
works: the vmemmap region, being dimensioned to cover a 52-bit linear
region, takes up so much space in the upper VA region (the size of which
is based on a 48-bit VA space for compatibility with non-LVA hardware)
that the region above the vmalloc region takes up more than a quarter of
the available space.

So instead of a heuristic, let's derive the randomization range from the
actual boundaries of the vmalloc region.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20231213084024.2367360-16-ardb@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
2024-02-09 10:56:12 +00:00
Arnd Bergmann
b8466fe82b efi: move screen_info into efi init code
After the vga console no longer relies on global screen_info, there are
only two remaining use cases:

 - on the x86 architecture, it is used for multiple boot methods
   (bzImage, EFI, Xen, kexec) to commucate the initial VGA or framebuffer
   settings to a number of device drivers.

 - on other architectures, it is only used as part of the EFI stub,
   and only for the three sysfb framebuffers (simpledrm, simplefb, efifb).

Remove the duplicate data structure definitions by moving it into the
efi-init.c file that sets it up initially for the EFI case, leaving x86
as an exception that retains its own definition for non-EFI boots.

The added #ifdefs here are optional, I added them to further limit the
reach of screen_info to configurations that have at least one of the
users enabled.

Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Javier Martinez Canillas <javierm@redhat.com>
Acked-by: Helge Deller <deller@gmx.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20231017093947.3627976-1-arnd@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-17 16:33:39 +02:00
Ard Biesheuvel
45dd403da8 efi/zboot: arm64: Inject kernel code size symbol into the zboot payload
The EFI zboot code is not built as part of the kernel proper, like the
ordinary EFI stub, but still needs access to symbols that are defined
only internally in the kernel, and are left unexposed deliberately to
avoid creating ABI inadvertently that we're stuck with later.

So capture the kernel code size of the kernel image, and inject it as an
ELF symbol into the object that contains the compressed payload, where
it will be accessible to zboot code that needs it.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
2023-04-26 18:01:41 +02:00
Catalin Marinas
156010ed9c Merge branches 'for-next/sysreg', 'for-next/sme', 'for-next/kselftest', 'for-next/misc', 'for-next/sme2', 'for-next/tpidr2', 'for-next/scs', 'for-next/compat-hwcap', 'for-next/ftrace', 'for-next/efi-boot-mmu-on', 'for-next/ptrauth' and 'for-next/pseudo-nmi', remote-tracking branch 'arm64/for-next/perf' into for-next/core
* arm64/for-next/perf:
  perf: arm_spe: Print the version of SPE detected
  perf: arm_spe: Add support for SPEv1.2 inverted event filtering
  perf: Add perf_event_attr::config3
  drivers/perf: fsl_imx8_ddr_perf: Remove set-but-not-used variable
  perf: arm_spe: Support new SPEv1.2/v8.7 'not taken' event
  perf: arm_spe: Use new PMSIDR_EL1 register enums
  perf: arm_spe: Drop BIT() and use FIELD_GET/PREP accessors
  arm64/sysreg: Convert SPE registers to automatic generation
  arm64: Drop SYS_ from SPE register defines
  perf: arm_spe: Use feature numbering for PMSEVFR_EL1 defines
  perf/marvell: Add ACPI support to TAD uncore driver
  perf/marvell: Add ACPI support to DDR uncore driver
  perf/arm-cmn: Reset DTM_PMU_CONFIG at probe
  drivers/perf: hisi: Extract initialization of "cpa_pmu->pmu"
  drivers/perf: hisi: Simplify the parameters of hisi_pmu_init()
  drivers/perf: hisi: Advertise the PERF_PMU_CAP_NO_EXCLUDE capability

* for-next/sysreg:
  : arm64 sysreg and cpufeature fixes/updates
  KVM: arm64: Use symbolic definition for ISR_EL1.A
  arm64/sysreg: Add definition of ISR_EL1
  arm64/sysreg: Add definition for ICC_NMIAR1_EL1
  arm64/cpufeature: Remove 4 bit assumption in ARM64_FEATURE_MASK()
  arm64/sysreg: Fix errors in 32 bit enumeration values
  arm64/cpufeature: Fix field sign for DIT hwcap detection

* for-next/sme:
  : SME-related updates
  arm64/sme: Optimise SME exit on syscall entry
  arm64/sme: Don't use streaming mode to probe the maximum SME VL
  arm64/ptrace: Use system_supports_tpidr2() to check for TPIDR2 support

* for-next/kselftest: (23 commits)
  : arm64 kselftest fixes and improvements
  kselftest/arm64: Don't require FA64 for streaming SVE+ZA tests
  kselftest/arm64: Copy whole EXTRA context
  kselftest/arm64: Fix enumeration of systems without 128 bit SME for SSVE+ZA
  kselftest/arm64: Fix enumeration of systems without 128 bit SME
  kselftest/arm64: Don't require FA64 for streaming SVE tests
  kselftest/arm64: Limit the maximum VL we try to set via ptrace
  kselftest/arm64: Correct buffer size for SME ZA storage
  kselftest/arm64: Remove the local NUM_VL definition
  kselftest/arm64: Verify simultaneous SSVE and ZA context generation
  kselftest/arm64: Verify that SSVE signal context has SVE_SIG_FLAG_SM set
  kselftest/arm64: Remove spurious comment from MTE test Makefile
  kselftest/arm64: Support build of MTE tests with clang
  kselftest/arm64: Initialise current at build time in signal tests
  kselftest/arm64: Don't pass headers to the compiler as source
  kselftest/arm64: Remove redundant _start labels from FP tests
  kselftest/arm64: Fix .pushsection for strings in FP tests
  kselftest/arm64: Run BTI selftests on systems without BTI
  kselftest/arm64: Fix test numbering when skipping tests
  kselftest/arm64: Skip non-power of 2 SVE vector lengths in fp-stress
  kselftest/arm64: Only enumerate power of two VLs in syscall-abi
  ...

* for-next/misc:
  : Miscellaneous arm64 updates
  arm64/mm: Intercept pfn changes in set_pte_at()
  Documentation: arm64: correct spelling
  arm64: traps: attempt to dump all instructions
  arm64: Apply dynamic shadow call stack patching in two passes
  arm64: el2_setup.h: fix spelling typo in comments
  arm64: Kconfig: fix spelling
  arm64: cpufeature: Use kstrtobool() instead of strtobool()
  arm64: Avoid repeated AA64MMFR1_EL1 register read on pagefault path
  arm64: make ARCH_FORCE_MAX_ORDER selectable

* for-next/sme2: (23 commits)
  : Support for arm64 SME 2 and 2.1
  arm64/sme: Fix __finalise_el2 SMEver check
  kselftest/arm64: Remove redundant _start labels from zt-test
  kselftest/arm64: Add coverage of SME 2 and 2.1 hwcaps
  kselftest/arm64: Add coverage of the ZT ptrace regset
  kselftest/arm64: Add SME2 coverage to syscall-abi
  kselftest/arm64: Add test coverage for ZT register signal frames
  kselftest/arm64: Teach the generic signal context validation about ZT
  kselftest/arm64: Enumerate SME2 in the signal test utility code
  kselftest/arm64: Cover ZT in the FP stress test
  kselftest/arm64: Add a stress test program for ZT0
  arm64/sme: Add hwcaps for SME 2 and 2.1 features
  arm64/sme: Implement ZT0 ptrace support
  arm64/sme: Implement signal handling for ZT
  arm64/sme: Implement context switching for ZT0
  arm64/sme: Provide storage for ZT0
  arm64/sme: Add basic enumeration for SME2
  arm64/sme: Enable host kernel to access ZT0
  arm64/sme: Manually encode ZT0 load and store instructions
  arm64/esr: Document ISS for ZT0 being disabled
  arm64/sme: Document SME 2 and SME 2.1 ABI
  ...

* for-next/tpidr2:
  : Include TPIDR2 in the signal context
  kselftest/arm64: Add test case for TPIDR2 signal frame records
  kselftest/arm64: Add TPIDR2 to the set of known signal context records
  arm64/signal: Include TPIDR2 in the signal context
  arm64/sme: Document ABI for TPIDR2 signal information

* for-next/scs:
  : arm64: harden shadow call stack pointer handling
  arm64: Stash shadow stack pointer in the task struct on interrupt
  arm64: Always load shadow stack pointer directly from the task struct

* for-next/compat-hwcap:
  : arm64: Expose compat ARMv8 AArch32 features (HWCAPs)
  arm64: Add compat hwcap SSBS
  arm64: Add compat hwcap SB
  arm64: Add compat hwcap I8MM
  arm64: Add compat hwcap ASIMDBF16
  arm64: Add compat hwcap ASIMDFHM
  arm64: Add compat hwcap ASIMDDP
  arm64: Add compat hwcap FPHP and ASIMDHP

* for-next/ftrace:
  : Add arm64 support for DYNAMICE_FTRACE_WITH_CALL_OPS
  arm64: avoid executing padding bytes during kexec / hibernation
  arm64: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS
  arm64: ftrace: Update stale comment
  arm64: patching: Add aarch64_insn_write_literal_u64()
  arm64: insn: Add helpers for BTI
  arm64: Extend support for CONFIG_FUNCTION_ALIGNMENT
  ACPI: Don't build ACPICA with '-Os'
  Compiler attributes: GCC cold function alignment workarounds
  ftrace: Add DYNAMIC_FTRACE_WITH_CALL_OPS

* for-next/efi-boot-mmu-on:
  : Permit arm64 EFI boot with MMU and caches on
  arm64: kprobes: Drop ID map text from kprobes blacklist
  arm64: head: Switch endianness before populating the ID map
  efi: arm64: enter with MMU and caches enabled
  arm64: head: Clean the ID map and the HYP text to the PoC if needed
  arm64: head: avoid cache invalidation when entering with the MMU on
  arm64: head: record the MMU state at primary entry
  arm64: kernel: move identity map out of .text mapping
  arm64: head: Move all finalise_el2 calls to after __enable_mmu

* for-next/ptrauth:
  : arm64 pointer authentication cleanup
  arm64: pauth: don't sign leaf functions
  arm64: unify asm-arch manipulation

* for-next/pseudo-nmi:
  : Pseudo-NMI code generation optimisations
  arm64: irqflags: use alternative branches for pseudo-NMI logic
  arm64: add ARM64_HAS_GIC_PRIO_RELAXED_SYNC cpucap
  arm64: make ARM64_HAS_GIC_PRIO_MASKING depend on ARM64_HAS_GIC_CPUIF_SYSREGS
  arm64: rename ARM64_HAS_IRQ_PRIO_MASKING to ARM64_HAS_GIC_PRIO_MASKING
  arm64: rename ARM64_HAS_SYSREG_GIC_CPUIF to ARM64_HAS_GIC_CPUIF_SYSREGS
2023-02-10 18:51:49 +00:00
Mark Rutland
8bf0a8048b arm64: add ARM64_HAS_GIC_PRIO_RELAXED_SYNC cpucap
When Priority Mask Hint Enable (PMHE) == 0b1, the GIC may use the PMR
value to determine whether to signal an IRQ to a PE, and consequently
after a change to the PMR value, a DSB SY may be required to ensure that
interrupts are signalled to a CPU in finite time. When PMHE == 0b0,
interrupts are always signalled to the relevant PE, and all masking
occurs locally, without requiring a DSB SY.

Since commit:

  f226650494 ("arm64: Relax ICC_PMR_EL1 accesses when ICC_CTLR_EL1.PMHE is clear")

... we handle this dynamically: in most cases a static key is used to
determine whether to issue a DSB SY, but the entry code must read from
ICC_CTLR_EL1 as static keys aren't accessible from plain assembly.

It would be much nicer to use an alternative instruction sequence for
the DSB, as this would avoid the need to read from ICC_CTLR_EL1 in the
entry code, and for most other code this will result in simpler code
generation with fewer instructions and fewer branches.

This patch adds a new ARM64_HAS_GIC_PRIO_RELAXED_SYNC cpucap which is
only set when ICC_CTLR_EL1.PMHE == 0b0 (and GIC priority masking is in
use). This allows us to replace the existing users of the
`gic_pmr_sync` static key with alternative sequences which default to a
DSB SY and are relaxed to a NOP when PMHE is not in use.

The entry assembly management of the PMR is slightly restructured to use
a branch (rather than multiple NOPs) when priority masking is not in
use. This is more in keeping with other alternatives in the entry
assembly, and permits the use of a separate alternatives for the
PMHE-dependent DSB SY (and removal of the conditional branch this
currently requires). For consistency I've adjusted both the save and
restore paths.

According to bloat-o-meter, when building defconfig +
CONFIG_ARM64_PSEUDO_NMI=y this shrinks the kernel text by ~4KiB:

| add/remove: 4/2 grow/shrink: 42/310 up/down: 332/-5032 (-4700)

The resulting vmlinux is ~66KiB smaller, though the resulting Image size
is unchanged due to padding and alignment:

| [mark@lakrids:~/src/linux]% ls -al vmlinux-*
| -rwxr-xr-x 1 mark mark 137508344 Jan 17 14:11 vmlinux-after
| -rwxr-xr-x 1 mark mark 137575440 Jan 17 13:49 vmlinux-before
| [mark@lakrids:~/src/linux]% ls -al Image-*
| -rw-r--r-- 1 mark mark 38777344 Jan 17 14:11 Image-after
| -rw-r--r-- 1 mark mark 38777344 Jan 17 13:49 Image-before

Prior to this patch we did not verify the state of ICC_CTLR_EL1.PMHE on
secondary CPUs. As of this patch this is verified by the cpufeature code
when using GIC priority masking (i.e. when using pseudo-NMIs).

Note that since commit:

  7e3a57fa6c ("arm64: Document ICC_CTLR_EL3.PMHE setting requirements")

... Documentation/arm64/booting.rst specifies:

|      - ICC_CTLR_EL3.PMHE (bit 6) must be set to the same value across
|        all CPUs the kernel is executing on, and must stay constant
|        for the lifetime of the kernel.

... so that should not adversely affect any compliant systems, and as
we'll only check for the absense of PMHE when using pseudo-NMIs, this
will only fire when such mismatch will adversely affect the system.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20230130145429.903791-5-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2023-01-31 16:06:17 +00:00
Ard Biesheuvel
6178617038 efi: arm64: enter with MMU and caches enabled
Instead of cleaning the entire loaded kernel image to the PoC and
disabling the MMU and caches before branching to the kernel's bare metal
entry point, we can leave the MMU and caches enabled, and rely on EFI's
cacheable 1:1 mapping of all of system RAM (which is mandated by the
spec) to populate the initial page tables.

This removes the need for managing coherency in software, which is
tedious and error prone.

Note that we still need to clean the executable region of the image to
the PoU if this is required for I/D coherency, but only if we actually
decided to move the image in memory, as otherwise, this will have been
taken care of by the loader.

This change affects both the builtin EFI stub as well as the zboot
decompressor, which now carries the entire EFI stub along with the
decompression code and the compressed image.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20230111102236.1430401-7-ardb@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2023-01-24 11:51:08 +00:00
Linus Torvalds
8fa590bf34 ARM64:
* Enable the per-vcpu dirty-ring tracking mechanism, together with an
   option to keep the good old dirty log around for pages that are
   dirtied by something other than a vcpu.
 
 * Switch to the relaxed parallel fault handling, using RCU to delay
   page table reclaim and giving better performance under load.
 
 * Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping option,
   which multi-process VMMs such as crosvm rely on (see merge commit 382b5b87a97d:
   "Fix a number of issues with MTE, such as races on the tags being
   initialised vs the PG_mte_tagged flag as well as the lack of support
   for VM_SHARED when KVM is involved.  Patches from Catalin Marinas and
   Peter Collingbourne").
 
 * Merge the pKVM shadow vcpu state tracking that allows the hypervisor
   to have its own view of a vcpu, keeping that state private.
 
 * Add support for the PMUv3p5 architecture revision, bringing support
   for 64bit counters on systems that support it, and fix the
   no-quite-compliant CHAIN-ed counter support for the machines that
   actually exist out there.
 
 * Fix a handful of minor issues around 52bit VA/PA support (64kB pages
   only) as a prefix of the oncoming support for 4kB and 16kB pages.
 
 * Pick a small set of documentation and spelling fixes, because no
   good merge window would be complete without those.
 
 s390:
 
 * Second batch of the lazy destroy patches
 
 * First batch of KVM changes for kernel virtual != physical address support
 
 * Removal of a unused function
 
 x86:
 
 * Allow compiling out SMM support
 
 * Cleanup and documentation of SMM state save area format
 
 * Preserve interrupt shadow in SMM state save area
 
 * Respond to generic signals during slow page faults
 
 * Fixes and optimizations for the non-executable huge page errata fix.
 
 * Reprogram all performance counters on PMU filter change
 
 * Cleanups to Hyper-V emulation and tests
 
 * Process Hyper-V TLB flushes from a nested guest (i.e. from a L2 guest
   running on top of a L1 Hyper-V hypervisor)
 
 * Advertise several new Intel features
 
 * x86 Xen-for-KVM:
 
 ** Allow the Xen runstate information to cross a page boundary
 
 ** Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured
 
 ** Add support for 32-bit guests in SCHEDOP_poll
 
 * Notable x86 fixes and cleanups:
 
 ** One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0).
 
 ** Reinstate IBPB on emulated VM-Exit that was incorrectly dropped a few
    years back when eliminating unnecessary barriers when switching between
    vmcs01 and vmcs02.
 
 ** Clean up vmread_error_trampoline() to make it more obvious that params
    must be passed on the stack, even for x86-64.
 
 ** Let userspace set all supported bits in MSR_IA32_FEAT_CTL irrespective
    of the current guest CPUID.
 
 ** Fudge around a race with TSC refinement that results in KVM incorrectly
    thinking a guest needs TSC scaling when running on a CPU with a
    constant TSC, but no hardware-enumerated TSC frequency.
 
 ** Advertise (on AMD) that the SMM_CTL MSR is not supported
 
 ** Remove unnecessary exports
 
 Generic:
 
 * Support for responding to signals during page faults; introduces
   new FOLL_INTERRUPTIBLE flag that was reviewed by mm folks
 
 Selftests:
 
 * Fix an inverted check in the access tracking perf test, and restore
   support for asserting that there aren't too many idle pages when
   running on bare metal.
 
 * Fix build errors that occur in certain setups (unsure exactly what is
   unique about the problematic setup) due to glibc overriding
   static_assert() to a variant that requires a custom message.
 
 * Introduce actual atomics for clear/set_bit() in selftests
 
 * Add support for pinning vCPUs in dirty_log_perf_test.
 
 * Rename the so called "perf_util" framework to "memstress".
 
 * Add a lightweight psuedo RNG for guest use, and use it to randomize
   the access pattern and write vs. read percentage in the memstress tests.
 
 * Add a common ucall implementation; code dedup and pre-work for running
   SEV (and beyond) guests in selftests.
 
 * Provide a common constructor and arch hook, which will eventually be
   used by x86 to automatically select the right hypercall (AMD vs. Intel).
 
 * A bunch of added/enabled/fixed selftests for ARM64, covering memslots,
   breakpoints, stage-2 faults and access tracking.
 
 * x86-specific selftest changes:
 
 ** Clean up x86's page table management.
 
 ** Clean up and enhance the "smaller maxphyaddr" test, and add a related
    test to cover generic emulation failure.
 
 ** Clean up the nEPT support checks.
 
 ** Add X86_PROPERTY_* framework to retrieve multi-bit CPUID values.
 
 ** Fix an ordering issue in the AMX test introduced by recent conversions
    to use kvm_cpu_has(), and harden the code to guard against similar bugs
    in the future.  Anything that tiggers caching of KVM's supported CPUID,
    kvm_cpu_has() in this case, effectively hides opt-in XSAVE features if
    the caching occurs before the test opts in via prctl().
 
 Documentation:
 
 * Remove deleted ioctls from documentation
 
 * Clean up the docs for the x86 MSR filter.
 
 * Various fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmOaFrcUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroPemQgAq49excg2Cc+EsHnZw3vu/QWdA0Rt
 KhL3OgKxuHNjCbD2O9n2t5di7eJOTQ7F7T0eDm3xPTr4FS8LQ2327/mQePU/H2CF
 mWOpq9RBWLzFsSTeVA2Mz9TUTkYSnDHYuRsBvHyw/n9cL76BWVzjImldFtjYjjex
 yAwl8c5itKH6bc7KO+5ydswbvBzODkeYKUSBNdbn6m0JGQST7XppNwIAJvpiHsii
 Qgpk0e4Xx9q4PXG/r5DedI6BlufBsLhv0aE9SHPzyKH3JbbUFhJYI8ZD5OhBQuYW
 MwxK2KlM5Jm5ud2NZDDlsMmmvd1lnYCFDyqNozaKEWC1Y5rq1AbMa51fXA==
 =QAYX
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "ARM64:

   - Enable the per-vcpu dirty-ring tracking mechanism, together with an
     option to keep the good old dirty log around for pages that are
     dirtied by something other than a vcpu.

   - Switch to the relaxed parallel fault handling, using RCU to delay
     page table reclaim and giving better performance under load.

   - Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping
     option, which multi-process VMMs such as crosvm rely on (see merge
     commit 382b5b87a97d: "Fix a number of issues with MTE, such as
     races on the tags being initialised vs the PG_mte_tagged flag as
     well as the lack of support for VM_SHARED when KVM is involved.
     Patches from Catalin Marinas and Peter Collingbourne").

   - Merge the pKVM shadow vcpu state tracking that allows the
     hypervisor to have its own view of a vcpu, keeping that state
     private.

   - Add support for the PMUv3p5 architecture revision, bringing support
     for 64bit counters on systems that support it, and fix the
     no-quite-compliant CHAIN-ed counter support for the machines that
     actually exist out there.

   - Fix a handful of minor issues around 52bit VA/PA support (64kB
     pages only) as a prefix of the oncoming support for 4kB and 16kB
     pages.

   - Pick a small set of documentation and spelling fixes, because no
     good merge window would be complete without those.

  s390:

   - Second batch of the lazy destroy patches

   - First batch of KVM changes for kernel virtual != physical address
     support

   - Removal of a unused function

  x86:

   - Allow compiling out SMM support

   - Cleanup and documentation of SMM state save area format

   - Preserve interrupt shadow in SMM state save area

   - Respond to generic signals during slow page faults

   - Fixes and optimizations for the non-executable huge page errata
     fix.

   - Reprogram all performance counters on PMU filter change

   - Cleanups to Hyper-V emulation and tests

   - Process Hyper-V TLB flushes from a nested guest (i.e. from a L2
     guest running on top of a L1 Hyper-V hypervisor)

   - Advertise several new Intel features

   - x86 Xen-for-KVM:

      - Allow the Xen runstate information to cross a page boundary

      - Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured

      - Add support for 32-bit guests in SCHEDOP_poll

   - Notable x86 fixes and cleanups:

      - One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0).

      - Reinstate IBPB on emulated VM-Exit that was incorrectly dropped
        a few years back when eliminating unnecessary barriers when
        switching between vmcs01 and vmcs02.

      - Clean up vmread_error_trampoline() to make it more obvious that
        params must be passed on the stack, even for x86-64.

      - Let userspace set all supported bits in MSR_IA32_FEAT_CTL
        irrespective of the current guest CPUID.

      - Fudge around a race with TSC refinement that results in KVM
        incorrectly thinking a guest needs TSC scaling when running on a
        CPU with a constant TSC, but no hardware-enumerated TSC
        frequency.

      - Advertise (on AMD) that the SMM_CTL MSR is not supported

      - Remove unnecessary exports

  Generic:

   - Support for responding to signals during page faults; introduces
     new FOLL_INTERRUPTIBLE flag that was reviewed by mm folks

  Selftests:

   - Fix an inverted check in the access tracking perf test, and restore
     support for asserting that there aren't too many idle pages when
     running on bare metal.

   - Fix build errors that occur in certain setups (unsure exactly what
     is unique about the problematic setup) due to glibc overriding
     static_assert() to a variant that requires a custom message.

   - Introduce actual atomics for clear/set_bit() in selftests

   - Add support for pinning vCPUs in dirty_log_perf_test.

   - Rename the so called "perf_util" framework to "memstress".

   - Add a lightweight psuedo RNG for guest use, and use it to randomize
     the access pattern and write vs. read percentage in the memstress
     tests.

   - Add a common ucall implementation; code dedup and pre-work for
     running SEV (and beyond) guests in selftests.

   - Provide a common constructor and arch hook, which will eventually
     be used by x86 to automatically select the right hypercall (AMD vs.
     Intel).

   - A bunch of added/enabled/fixed selftests for ARM64, covering
     memslots, breakpoints, stage-2 faults and access tracking.

   - x86-specific selftest changes:

      - Clean up x86's page table management.

      - Clean up and enhance the "smaller maxphyaddr" test, and add a
        related test to cover generic emulation failure.

      - Clean up the nEPT support checks.

      - Add X86_PROPERTY_* framework to retrieve multi-bit CPUID values.

      - Fix an ordering issue in the AMX test introduced by recent
        conversions to use kvm_cpu_has(), and harden the code to guard
        against similar bugs in the future. Anything that tiggers
        caching of KVM's supported CPUID, kvm_cpu_has() in this case,
        effectively hides opt-in XSAVE features if the caching occurs
        before the test opts in via prctl().

  Documentation:

   - Remove deleted ioctls from documentation

   - Clean up the docs for the x86 MSR filter.

   - Various fixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (361 commits)
  KVM: x86: Add proper ReST tables for userspace MSR exits/flags
  KVM: selftests: Allocate ucall pool from MEM_REGION_DATA
  KVM: arm64: selftests: Align VA space allocator with TTBR0
  KVM: arm64: Fix benign bug with incorrect use of VA_BITS
  KVM: arm64: PMU: Fix period computation for 64bit counters with 32bit overflow
  KVM: x86: Advertise that the SMM_CTL MSR is not supported
  KVM: x86: remove unnecessary exports
  KVM: selftests: Fix spelling mistake "probabalistic" -> "probabilistic"
  tools: KVM: selftests: Convert clear/set_bit() to actual atomics
  tools: Drop "atomic_" prefix from atomic test_and_set_bit()
  tools: Drop conflicting non-atomic test_and_{clear,set}_bit() helpers
  KVM: selftests: Use non-atomic clear/set bit helpers in KVM tests
  perf tools: Use dedicated non-atomic clear/set bit helpers
  tools: Take @bit as an "unsigned long" in {clear,set}_bit() helpers
  KVM: arm64: selftests: Enable single-step without a "full" ucall()
  KVM: x86: fix APICv/x2AVIC disabled when vm reboot by itself
  KVM: Remove stale comment about KVM_REQ_UNHALT
  KVM: Add missing arch for KVM_CREATE_DEVICE and KVM_{SET,GET}_DEVICE_ATTR
  KVM: Reference to kvm_userspace_memory_region in doc and comments
  KVM: Delete all references to removed KVM_SET_MEMORY_ALIAS ioctl
  ...
2022-12-15 11:12:21 -08:00
Quentin Perret
169cd0f823 KVM: arm64: Don't unnecessarily map host kernel sections at EL2
We no longer need to map the host's '.rodata' and '.bss' sections in the
stage-1 page-table of the pKVM hypervisor at EL2, so remove those
mappings and avoid creating any future dependencies at EL2 on
host-controlled data structures.

Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221110190259.26861-25-will@kernel.org
2022-11-11 17:19:35 +00:00
Will Deacon
73f38ef2ae KVM: arm64: Maintain a copy of 'kvm_arm_vmid_bits' at EL2
Sharing 'kvm_arm_vmid_bits' between EL1 and EL2 allows the host to
modify the variable arbitrarily, potentially leading to all sorts of
shenanians as this is used to configure the VTTBR register for the
guest stage-2.

In preparation for unmapping host sections entirely from EL2, maintain
a copy of 'kvm_arm_vmid_bits' in the pKVM hypervisor and initialise it
from the host value while it is still trusted.

Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221110190259.26861-23-will@kernel.org
2022-11-11 17:19:35 +00:00
Quentin Perret
fe41a7f8c0 KVM: arm64: Unmap 'kvm_arm_hyp_percpu_base' from the host
When pKVM is enabled, the hypervisor at EL2 does not trust the host at
EL1 and must therefore prevent it from having unrestricted access to
internal hypervisor state.

The 'kvm_arm_hyp_percpu_base' array holds the offsets for hypervisor
per-cpu allocations, so move this this into the nVHE code where it
cannot be modified by the untrusted host at EL1.

Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221110190259.26861-22-will@kernel.org
2022-11-11 17:19:35 +00:00
Will Deacon
13e248aab7 KVM: arm64: Provide I-cache invalidation by virtual address at EL2
In preparation for handling cache maintenance of guest pages from within
the pKVM hypervisor at EL2, introduce an EL2 copy of icache_inval_pou()
which will later be plumbed into the stage-2 page-table cache
maintenance callbacks, ensuring that the initial contents of pages
mapped as executable into the guest stage-2 page-table is visible to the
instruction fetcher.

Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221110190259.26861-17-will@kernel.org
2022-11-11 17:16:25 +00:00
Ard Biesheuvel
da8dd0c75b efi: libstub: Provide local implementations of strrchr() and memchr()
Clone the implementations of strrchr() and memchr() in lib/string.c so
we can use them in the standalone zboot decompressor app. These routines
are used by the FDT handling code.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2022-11-09 12:42:02 +01:00
Ard Biesheuvel
2e6fa86f2d efi: libstub: Enable efi_printk() in zboot decompressor
Split the efi_printk() routine into its own source file, and provide
local implementations of strlen() and strnlen() so that the standalone
zboot app can efi_err and efi_info etc.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2022-11-09 12:42:02 +01:00
Ard Biesheuvel
52dce39cd2 efi: libstub: Clone memcmp() into the stub
We will no longer be able to call into the kernel image once we merge
the decompressor with the EFI stub, so we need our own implementation of
memcmp(). Let's add the one from lib/string.c and simplify it.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2022-11-09 12:42:02 +01:00
Ard Biesheuvel
fa882a1389 efi: libstub: Use local strncmp() implementation unconditionally
In preparation for moving the EFI stub functionality into the zboot
decompressor, switch to the stub's implementation of strncmp()
unconditionally.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2022-11-09 12:42:02 +01:00
Ard Biesheuvel
aaeb3fc614 arm64: efi: Move dcache cleaning of loaded image out of efi_enter_kernel()
The efi_enter_kernel() routine will be shared between the existing EFI
stub and the zboot decompressor, and the version of
dcache_clean_to_poc() that the core kernel exports to the stub will not
be available in the latter case.

So move the handling into the .c file which will remain part of the stub
build that integrates directly with the kernel proper.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
2022-11-09 12:42:01 +01:00
Linus Torvalds
0e470763d8 EFI updates for v6.1
- implement EFI boot support for LoongArch
 - implement generic EFI compressed boot support for arm64, RISC-V and
   LoongArch, none of which implement a decompressor today
 - measure the kernel command line into the TPM if measured boot is in
   effect
 - refactor the EFI stub code in order to isolate DT dependencies for
   architectures other than x86
 - avoid calling SetVirtualAddressMap() on arm64 if the configured size
   of the VA space guarantees that doing so is unnecessary
 - move some ARM specific code out of the generic EFI source files
 - unmap kernel code from the x86 mixed mode 1:1 page tables
 -----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE+9lifEBpyUIVN1cpw08iOZLZjyQFAmM5mfEACgkQw08iOZLZ
 jySnJwv9G2nBheSlK9bbWKvCpnDvVIExtlL+mg1wB64oxPrGiWRgjxeyA9+92bT0
 Y6jYfKbGOGKnxkEJQl19ik6C3JfEwtGm4SnOVp4+osFeDRB7lFemfcIYN5dqz111
 wkZA/Y15rnz3tZeGaXnq2jMoFuccQDXPJtOlqbdVqFQ5Py6YT92uMyuI079pN0T+
 GSu7VVOX+SBsv4nGaUKIpSVwAP0gXkS/7s7CTf47QiR2+j8WMTlQEYZVjOKZjMJZ
 /7hXY2/mduxnuVuT7cfx0mpZKEryUREJoBL5nDzjTnlhLb5X8cHKiaE1lx0aJ//G
 JYTR8lDklJZl/7RUw/IW/YodcKcofr3F36NMzWB5vzM+KHOOpv4qEZhoGnaXv94u
 auqhzYA83heaRjz7OISlk6kgFxdlIRE1VdrkEBXSlQeCQUv1woS+ZNVGYcKqgR0B
 48b31Ogm2A0pAuba89+U9lz/n33lhIDtYvJqLO6AAPLGiVacD9ZdapN5kMftVg/1
 SfhFqNzy
 =d8Ps
 -----END PGP SIGNATURE-----

Merge tag 'efi-next-for-v6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi

Pull EFI updates from Ard Biesheuvel:
 "A bit more going on than usual in the EFI subsystem. The main driver
  for this has been the introduction of the LoonArch architecture last
  cycle, which inspired some cleanup and refactoring of the EFI code.
  Another driver for EFI changes this cycle and in the future is
  confidential compute.

  The LoongArch architecture does not use either struct bootparams or DT
  natively [yet], and so passing information between the EFI stub and
  the core kernel using either of those is undesirable. And in general,
  overloading DT has been a source of issues on arm64, so using DT for
  this on new architectures is a to avoid for the time being (even if we
  might converge on something DT based for non-x86 architectures in the
  future). For this reason, in addition to the patch that enables EFI
  boot for LoongArch, there are a number of refactoring patches applied
  on top of which separate the DT bits from the generic EFI stub bits.
  These changes are on a separate topich branch that has been shared
  with the LoongArch maintainers, who will include it in their pull
  request as well. This is not ideal, but the best way to manage the
  conflicts without stalling LoongArch for another cycle.

  Another development inspired by LoongArch is the newly added support
  for EFI based decompressors. Instead of adding yet another
  arch-specific incarnation of this pattern for LoongArch, we are
  introducing an EFI app based on the existing EFI libstub
  infrastructure that encapulates the decompression code we use on other
  architectures, but in a way that is fully generic. This has been
  developed and tested in collaboration with distro and systemd folks,
  who are eager to start using this for systemd-boot and also for arm64
  secure boot on Fedora. Note that the EFI zimage files this introduces
  can also be decompressed by non-EFI bootloaders if needed, as the
  image header describes the location of the payload inside the image,
  and the type of compression that was used. (Note that Fedora's arm64
  GRUB is buggy [0] so you'll need a recent version or switch to
  systemd-boot in order to use this.)

  Finally, we are adding TPM measurement of the kernel command line
  provided by EFI. There is an oversight in the TCG spec which results
  in a blind spot for command line arguments passed to loaded images,
  which means that either the loader or the stub needs to take the
  measurement. Given the combinatorial explosion I am anticipating when
  it comes to firmware/bootloader stacks and firmware based attestation
  protocols (SEV-SNP, TDX, DICE, DRTM), it is good to set a baseline now
  when it comes to EFI measured boot, which is that the kernel measures
  the initrd and command line. Intermediate loaders can measure
  additional assets if needed, but with the baseline in place, we can
  deploy measured boot in a meaningful way even if you boot into Linux
  straight from the EFI firmware.

  Summary:

   - implement EFI boot support for LoongArch

   - implement generic EFI compressed boot support for arm64, RISC-V and
     LoongArch, none of which implement a decompressor today

   - measure the kernel command line into the TPM if measured boot is in
     effect

   - refactor the EFI stub code in order to isolate DT dependencies for
     architectures other than x86

   - avoid calling SetVirtualAddressMap() on arm64 if the configured
     size of the VA space guarantees that doing so is unnecessary

   - move some ARM specific code out of the generic EFI source files

   - unmap kernel code from the x86 mixed mode 1:1 page tables"

* tag 'efi-next-for-v6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: (24 commits)
  efi/arm64: libstub: avoid SetVirtualAddressMap() when possible
  efi: zboot: create MemoryMapped() device path for the parent if needed
  efi: libstub: fix up the last remaining open coded boot service call
  efi/arm: libstub: move ARM specific code out of generic routines
  efi/libstub: measure EFI LoadOptions
  efi/libstub: refactor the initrd measuring functions
  efi/loongarch: libstub: remove dependency on flattened DT
  efi: libstub: install boot-time memory map as config table
  efi: libstub: remove DT dependency from generic stub
  efi: libstub: unify initrd loading between architectures
  efi: libstub: remove pointless goto kludge
  efi: libstub: simplify efi_get_memory_map() and struct efi_boot_memmap
  efi: libstub: avoid efi_get_memory_map() for allocating the virt map
  efi: libstub: drop pointless get_memory_map() call
  efi: libstub: fix type confusion for load_options_size
  arm64: efi: enable generic EFI compressed boot
  loongarch: efi: enable generic EFI compressed boot
  riscv: efi: enable generic EFI compressed boot
  efi/libstub: implement generic EFI zboot
  efi/libstub: move efi_system_table global var into separate object
  ...
2022-10-09 08:56:54 -07:00
Ard Biesheuvel
c82ceb440b efi/libstub: use EFI provided memcpy/memset routines
The stub is used in different execution environments, but on arm64,
RISC-V and LoongArch, we still use the core kernel's implementation of
memcpy and memset, as they are just a branch instruction away, and can
generally be reused even from code such as the EFI stub that runs in a
completely different address space.

KAsan complicates this slightly, resulting in the need for some hacks to
expose the uninstrumented, __ prefixed versions as the normal ones, as
the latter are instrumented to include the KAsan checks, which only work
in the core kernel.

Unfortunately, #define'ing memcpy to __memcpy when building C code does
not guarantee that no explicit memcpy() calls will be emitted. And with
the upcoming zboot support, which consists of a separate binary which
therefore needs its own implementation of memcpy/memset anyway, it's
better to provide one explicitly instead of linking to the existing one.

Given that EFI exposes implementations of memmove() and memset() via the
boot services table, let's wire those up in the appropriate way, and
drop the references to the core kernel ones.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2022-09-17 15:13:21 +02:00
Mark Rutland
d926079f17 arm64: alternatives: add shared NOP callback
For each instance of an alternative, the compiler outputs a distinct
copy of the alternative instructions into a subsection. As the compiler
doesn't have special knowledge of alternatives, it cannot coalesce these
to save space.

In a defconfig kernel built with GCC 12.1.0, there are approximately
10,000 instances of alternative_has_feature_likely(), where the
replacement instruction is always a NOP. As NOPs are
position-independent, we don't need a unique copy per alternative
sequence.

This patch adds a callback to patch an alternative sequence with NOPs,
and make use of this in alternative_has_feature_likely(). So that this
can be used for other sites in future, this is written to patch multiple
instructions up to the original sequence length.

For NVHE, an alias is added to image-vars.h.

For modules, the callback is exported. Note that as modules are loaded
within 2GiB of the kernel, an alt_instr entry in a module can always
refer directly to the callback, and no special handling is necessary.

When building with GCC 12.1.0, the vmlinux is ~158KiB smaller, though
the resulting Image size is unchanged due to alignment constraints and
padding:

| % ls -al vmlinux-*
| -rwxr-xr-x 1 mark mark 134644592 Sep  1 14:52 vmlinux-after
| -rwxr-xr-x 1 mark mark 134486232 Sep  1 14:50 vmlinux-before
| % ls -al Image-*
| -rw-r--r-- 1 mark mark 37108224 Sep  1 14:52 Image-after
| -rw-r--r-- 1 mark mark 37108224 Sep  1 14:50 Image-before

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Joey Gouly <joey.gouly@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20220912162210.3626215-9-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2022-09-16 17:15:03 +01:00