Commit Graph

386 Commits

Author SHA1 Message Date
Linus Torvalds
13f24586a2 arm64 updates for 7.1 (second round):
Core features:
 
  - Add workaround for C1-Pro erratum 4193714 - early CME (SME unit)
    DVMSync acknowledgement. The fix consists of sending IPIs on TLB
    maintenance to those CPUs running in user space with SME enabled
 
  - Include kernel-hwcap.h in list of generated files (missed in a recent
    commit generating the KERNEL_HWCAP_* macros)
 
 CCA:
 
  - Fix RSI_INCOMPLETE error check in arm-cca-guest
 
 MPAM:
 
  - Fix an unmount->remount problem with the CDP emulation, uninitialised
    variable and checker warnings
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAmnmQZcACgkQa9axLQDI
 XvGouA//SXlo7hQyM41rkgRru9oqftrGg0y6nxz4Z089kv50cm3Jlf/nUuti6vah
 BMBLCGXA1iOQrGIVmuvtCxDRrfYZWpfKGuT9A0gmEoMqrGIpWl9gfBQG+uR+YrQX
 4kp5DLqB85WrJIPiy7HUV6GQoCbFuMrRJwxl89IdWZSobaei3SczTmnttwyJtxG5
 /BMitl024TYdiOPNo8bhiML1wIJCaTHvH4IrtCHPyUHEAtsHSMy00y0OrSKBtA/9
 ZHZRpY7Po/jnL7YUs1AfYwsaSXjkvqXN0K1Tdavzm75k6lpJmbM3VsZabG/CEuvK
 PCOGV++is4Y/A+7aQsCwXKeVnY3b6AC4sextytNq0g3GZ7I+Ht9O6nbsp5ZmyXzB
 HRiFxmFS1pSQOMX9f1neKi3vxDMTy1tKPeccTTzL8dNnxTvUBXnoWfPoJh3cpbjm
 Dbhe1kksiEn01WWFacGtkIPDa9c+Bkd2T+8wrsk85Z+u3Z0JPM5PfOn6v3X9YlKl
 K7W8fhvlDL1wP+iyWcMT5zdo+xzHY4ZxuyWbi9a4RhKc6lFHVVG2mpUuPwSsh2ma
 NnxkDouriuoADHBir89U71N483HSnNfSjhlVSFYD2LFCre5KOZM4KYZ2vwWb8Sy4
 79q+BlVRUTQ5O6XjePoSPjUW4APPNviHJsF4E4IiqHkd9O5lMZU=
 =LNY2
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull more arm64 updates from Catalin Marinas:
 "The main 'feature' is a workaround for C1-Pro erratum 4193714
  requiring IPIs during TLB maintenance if a process is running in user
  space with SME enabled.

  The hardware acknowledges the DVMSync messages before completing
  in-flight SME accesses, with security implications. The workaround
  makes use of the mm_cpumask() to track the cores that need
  interrupting (arm64 hasn't used this mask before).

  The rest are fixes for MPAM, CCA and generated header that turned up
  during the merging window or shortly before.

  Summary:

  Core features:

   - Add workaround for C1-Pro erratum 4193714 - early CME (SME unit)
     DVMSync acknowledgement. The fix consists of sending IPIs on TLB
     maintenance to those CPUs running in user space with SME enabled

   - Include kernel-hwcap.h in list of generated files (missed in a
     recent commit generating the KERNEL_HWCAP_* macros)

  CCA:

   - Fix RSI_INCOMPLETE error check in arm-cca-guest

  MPAM:

   - Fix an unmount->remount problem with the CDP emulation,
     uninitialised variable and checker warnings"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm_mpam: resctrl: Make resctrl_mon_ctx_waiters static
  arm_mpam: resctrl: Fix the check for no monitor components found
  arm_mpam: resctrl: Fix MBA CDP alloc_capable handling on unmount
  virt: arm-cca-guest: fix error check for RSI_INCOMPLETE
  arm64/hwcap: Include kernel-hwcap.h in list of generated files
  arm64: errata: Work around early CME DVMSync acknowledgement
  arm64: cputype: Add C1-Pro definitions
  arm64: tlb: Pass the corresponding mm to __tlbi_sync_s1ish()
  arm64: tlb: Introduce __tlbi_sync_s1ish_{kernel,batch}() for TLB maintenance
2026-04-20 16:46:22 -07:00
Catalin Marinas
858fbd7248 Merge branch 'for-next/c1-pro-erratum-4193714' into for-next/core
* for-next/c1-pro-erratum-4193714:
  : Work around C1-Pro erratum 4193714 (CVE-2026-0995)
  arm64: errata: Work around early CME DVMSync acknowledgement
  arm64: cputype: Add C1-Pro definitions
  arm64: tlb: Pass the corresponding mm to __tlbi_sync_s1ish()
  arm64: tlb: Introduce __tlbi_sync_s1ish_{kernel,batch}() for TLB maintenance
2026-04-20 13:12:35 +01:00
Linus Torvalds
01f492e181 Arm:
- Add support for tracing in the standalone EL2 hypervisor code, which
   should help both debugging and performance analysis.  This uses the
   new infrastructure for 'remote' trace buffers that can be exposed
   by non-kernel entities such as firmware, and which came through the
   tracing tree.
 
 - Add support for GICv5 Per Processor Interrupts (PPIs), as the starting
   point for supporting the new GIC architecture in KVM.
 
 - Finally add support for pKVM protected guests, where pages are unmapped
   from the host as they are faulted into the guest and can be shared back
   from the guest using pKVM hypercalls.  Protected guests are created
   using a new machine type identifier.  As the elusive guestmem has not
   yet delivered on its promises, anonymous memory is also supported.
 
   This is only a first step towards full isolation from the host; for
   example, the CPU register state and DMA accesses are not yet isolated.
   Because this does not really yet bring fully what it promises, it is
   hidden behind CONFIG_ARM_PKVM_GUEST + 'kvm-arm.mode=protected', and
   also triggers TAINT_USER when a VM is created.  Caveat emptor.
 
 - Rework the dreaded user_mem_abort() function to make it more
   maintainable, reducing the amount of state being exposed to the
   various helpers and rendering a substantial amount of state immutable.
 
 - Expand the Stage-2 page table dumper to support NV shadow page tables
   on a per-VM basis.
 
 - Tidy up the pKVM PSCI proxy code to be slightly less hard to follow.
 
 - Fix both SPE and TRBE in non-VHE configurations so that they do not
   generate spurious, out of context table walks that ultimately lead
   to very bad HW lockups.
 
 - A small set of patches fixing the Stage-2 MMU freeing in error cases.
 
 - Tighten-up accepted SMC immediate value to be only #0 for host
   SMCCC calls.
 
 - The usual cleanups and other selftest churn.
 
 LoongArch:
 
 - Use CSR_CRMD_PLV for kvm_arch_vcpu_in_kernel().
 
 - Add DMSINTC irqchip in kernel support.
 
 RISC-V:
 
 - Fix steal time shared memory alignment checks
 
 - Fix vector context allocation leak
 
 - Fix array out-of-bounds in pmu_ctr_read() and pmu_fw_ctr_read_hi()
 
 - Fix double-free of sdata in kvm_pmu_clear_snapshot_area()
 
 - Fix integer overflow in kvm_pmu_validate_counter_mask()
 
 - Fix shift-out-of-bounds in make_xfence_request()
 
 - Fix lost write protection on huge pages during dirty logging
 
 - Split huge pages during fault handling for dirty logging
 
 - Skip CSR restore if VCPU is reloaded on the same core
 
 - Implement kvm_arch_has_default_irqchip() for KVM selftests
 
 - Factored-out ISA checks into separate sources
 
 - Added hideleg to struct kvm_vcpu_config
 
 - Factored-out VCPU config into separate sources
 
 - Support configuration of per-VM HGATP mode from KVM user space
 
 s390:
 
 - Support for ESA (31-bit) guests inside nested hypervisors.
 
 - Remove restriction on memslot alignment, which is not needed anymore with
   the new gmap code.
 
 - Fix LPSW/E to update the bear (which of course is the breaking event
   address register).
 
 x86:
 
 - Shut up various UBSAN warnings on reading module parameter before they
   were initialized.
 
 - Don't zero-allocate page tables that are used for splitting hugepages in
   the TDP MMU, as KVM is guaranteed to set all SPTEs in the page table and
   thus write all bytes.
 
 - As an optimization, bail early when trying to unsync 4KiB mappings if the
   target gfn can just be mapped with a 2MiB hugepage.
 
 x86 generic:
 
 - Copy single-chunk MMIO write values into struct kvm_vcpu (more precisely
   struct kvm_mmio_fragment) to fix use-after-free stack bugs where KVM
   would dereference stack pointer after an exit to userspace.
 
 - Clean up and comment the emulated MMIO code to try to make it easier to
   maintain (not necessarily "easy", but "easier").
 
 - Move VMXON+VMXOFF and EFER.SVME toggling out of KVM (not *all* of VMX
   and SVM enabling) as it is needed for trusted I/O.
 
 - Advertise support for AVX512 Bit Matrix Multiply (BMM) instructions
 
 - Immediately fail the build if a required #define is missing in one of
   KVM's headers that is included multiple times.
 
 - Reject SET_GUEST_DEBUG with -EBUSY if there's an already injected
   exception, mostly to prevent syzkaller from abusing the uAPI to
   trigger WARNs, but also because it can help prevent userspace from
   unintentionally crashing the VM.
 
 - Exempt SMM from CPUID faulting on Intel, as per the spec.
 
 - Misc hardening and cleanup changes.
 
 x86 (AMD):
 
 - Fix and optimize IRQ window inhibit handling for AVIC; make it per-vCPU
   so that KVM doesn't prematurely re-enable AVIC if multiple
   vCPUs have to-be-injected IRQs.
 
 - Clean up and optimize the OSVW handling, avoiding a bug in which KVM would
   overwrite state when enabling virtualization on multiple CPUs in parallel.
   This should not be a problem because OSVW should usually be the same for
   all CPUs.
 
 - Drop a WARN in KVM_MEMORY_ENCRYPT_REG_REGION where KVM complains about a
   "too large" size based purely on user input.
 
 - Clean up and harden the pinning code for KVM_MEMORY_ENCRYPT_REG_REGION.
 
 - Disallow synchronizing a VMSA of an already-launched/encrypted vCPU, as
   doing so for an SNP guest will crash the host due to an RMP violation
   page fault.
 
 - Overhaul KVM's APIs for detecting SEV+ guests so that VM-scoped queries
   are required to hold kvm->lock, and enforce it by lockdep.  Fix various
   bugs where sev_guest() was not ensured to be stable for the whole
   duration of a function or ioctl.
 
 - Convert a pile of kvm->lock SEV code to guard().
 
 - Play nicer with userspace that does not enable KVM_CAP_EXCEPTION_PAYLOAD,
   for which KVM needs to set CR2 and DR6 as a response to ioctls such as
   KVM_GET_VCPU_EVENTS (even if the payload would end up in EXITINFO2
   rather than CR2, for example).  Only set CR2 and DR6 when consumption of
   the payload is imminent, but on the other hand force delivery of the
   payload in all paths where userspace retrieves CR2 or DR6.
 
 - Use vcpu->arch.cr2 when updating vmcb12's CR2 on nested #VMEXIT instead
   of vmcb02->save.cr2.  The value is out of sync after a save/restore
   or after a #PF is injected into L2.
 
 - Fix a class of nSVM bugs where some fields written by the CPU are not
   synchronized from vmcb02 to cached vmcb12 after VMRUN, and so are not
   up-to-date when saved by KVM_GET_NESTED_STATE.
 
 - Fix a class of bugs where the ordering between KVM_SET_NESTED_STATE and
   KVM_SET_{S}REGS could cause vmcb02 to be incorrectly initialized after
   save+restore.
 
 - Add a variety of missing nSVM consistency checks.
 
 - Fix several bugs where KVM failed to correctly update VMCB fields on
   nested #VMEXIT.
 
 - Fix several bugs where KVM failed to correctly synthesize #UD or #GP for
   SVM-related instructions.
 
 - Add support for save+restore of virtualized LBRs (on SVM).
 
 - Refactor various helpers and macros to improve clarity and (hopefully)
   make the code easier to maintain.
 
 - Aggressively sanitize fields when copying from vmcb12, to guard against
   unintentionally allowing L1 to utilize yet-to-be-defined features.
 
 - Fix several bugs where KVM botched rAX legality checks when emulating SVM
   instructions.  There are remaining issues in that KVM doesn't handle size
   prefix overrides for 64-bit guests.
 
 - Fail emulation of VMRUN/VMLOAD/VMSAVE if mapping vmcb12 fails instead of
   somewhat arbitrarily synthesizing #GP (i.e. don't double down on AMD's
   architectural but sketchy behavior of generating #GP for "unsupported"
   addresses).
 
 - Cache all used vmcb12 fields to further harden against TOCTOU bugs.
 
 x86 (Intel):
 
 - Drop obsolete branch hint prefixes from the VMX instruction macros.
 
 - Use ASM_INPUT_RM() in __vmcs_writel() to coerce clang into using a
   register input when appropriate.
 
 - Code cleanups.
 
 guest_memfd:
 
 - Don't mark guest_memfd folios as accessed, as guest_memfd doesn't support
   reclaim, the memory is unevictable, and there is no storage to write
   back to.
 
 LoongArch selftests:
 
 - Add KVM PMU test cases
 
 s390 selftests:
 
 - Enable more memory selftests.
 
 x86 selftests:
 
 - Add support for Hygon CPUs in KVM selftests.
 
 - Fix a bug in the MSR test where it would get false failures on AMD/Hygon
   CPUs with exactly one of RDPID or RDTSCP.
 
 - Add an MADV_COLLAPSE testcase for guest_memfd as a regression test for a
   bug where the kernel would attempt to collapse guest_memfd folios against
   KVM's will.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmnftRQUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroPAzwf+NKO4Ktv+7A22ImN0SBl0nlUuulsz
 vTcw3+hxdRoIw83GdNS+hG5js0wrpMDnbv3t4+VliDNBSSxrBzcSWX2wpilW0Xtw
 qGo1MWhs2lKPy1NlaRVOwPS6j7uF3AR0TQ1iQLGMedQuCU9WpiKJxyhNXJdbLrt3
 8EgFzsvtEsv+jKNRUNDf9+d0j4gZsFyIe+Brhianbw+u3/UCiUClLCdsKPc4+5ZX
 08otYXytacGNIf/5Ev1vT4pHkHL0yqKXAtX7LEtaS3+0KrPuLjV4slemivzE9vf5
 Evafm5AhA4wpaNMb1ZerhY3T94lsMaJpWxotjR//0Q7C9B59pCQnXCm8mg==
 =CcE0
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "Arm:

   - Add support for tracing in the standalone EL2 hypervisor code,
     which should help both debugging and performance analysis. This
     uses the new infrastructure for 'remote' trace buffers that can be
     exposed by non-kernel entities such as firmware, and which came
     through the tracing tree

   - Add support for GICv5 Per Processor Interrupts (PPIs), as the
     starting point for supporting the new GIC architecture in KVM

   - Finally add support for pKVM protected guests, where pages are
     unmapped from the host as they are faulted into the guest and can
     be shared back from the guest using pKVM hypercalls. Protected
     guests are created using a new machine type identifier. As the
     elusive guestmem has not yet delivered on its promises, anonymous
     memory is also supported

     This is only a first step towards full isolation from the host; for
     example, the CPU register state and DMA accesses are not yet
     isolated. Because this does not really yet bring fully what it
     promises, it is hidden behind CONFIG_ARM_PKVM_GUEST +
     'kvm-arm.mode=protected', and also triggers TAINT_USER when a VM is
     created. Caveat emptor

   - Rework the dreaded user_mem_abort() function to make it more
     maintainable, reducing the amount of state being exposed to the
     various helpers and rendering a substantial amount of state
     immutable

   - Expand the Stage-2 page table dumper to support NV shadow page
     tables on a per-VM basis

   - Tidy up the pKVM PSCI proxy code to be slightly less hard to
     follow

   - Fix both SPE and TRBE in non-VHE configurations so that they do not
     generate spurious, out of context table walks that ultimately lead
     to very bad HW lockups

   - A small set of patches fixing the Stage-2 MMU freeing in error
     cases

   - Tighten-up accepted SMC immediate value to be only #0 for host
     SMCCC calls

   - The usual cleanups and other selftest churn

  LoongArch:

   - Use CSR_CRMD_PLV for kvm_arch_vcpu_in_kernel()

   - Add DMSINTC irqchip in kernel support

  RISC-V:

   - Fix steal time shared memory alignment checks

   - Fix vector context allocation leak

   - Fix array out-of-bounds in pmu_ctr_read() and pmu_fw_ctr_read_hi()

   - Fix double-free of sdata in kvm_pmu_clear_snapshot_area()

   - Fix integer overflow in kvm_pmu_validate_counter_mask()

   - Fix shift-out-of-bounds in make_xfence_request()

   - Fix lost write protection on huge pages during dirty logging

   - Split huge pages during fault handling for dirty logging

   - Skip CSR restore if VCPU is reloaded on the same core

   - Implement kvm_arch_has_default_irqchip() for KVM selftests

   - Factored-out ISA checks into separate sources

   - Added hideleg to struct kvm_vcpu_config

   - Factored-out VCPU config into separate sources

   - Support configuration of per-VM HGATP mode from KVM user space

  s390:

   - Support for ESA (31-bit) guests inside nested hypervisors

   - Remove restriction on memslot alignment, which is not needed
     anymore with the new gmap code

   - Fix LPSW/E to update the bear (which of course is the breaking
     event address register)

  x86:

   - Shut up various UBSAN warnings on reading module parameter before
     they were initialized

   - Don't zero-allocate page tables that are used for splitting
     hugepages in the TDP MMU, as KVM is guaranteed to set all SPTEs in
     the page table and thus write all bytes

   - As an optimization, bail early when trying to unsync 4KiB mappings
     if the target gfn can just be mapped with a 2MiB hugepage

  x86 generic:

   - Copy single-chunk MMIO write values into struct kvm_vcpu (more
     precisely struct kvm_mmio_fragment) to fix use-after-free stack
     bugs where KVM would dereference stack pointer after an exit to
     userspace

   - Clean up and comment the emulated MMIO code to try to make it
     easier to maintain (not necessarily "easy", but "easier")

   - Move VMXON+VMXOFF and EFER.SVME toggling out of KVM (not *all* of
     VMX and SVM enabling) as it is needed for trusted I/O

   - Advertise support for AVX512 Bit Matrix Multiply (BMM) instructions

   - Immediately fail the build if a required #define is missing in one
     of KVM's headers that is included multiple times

   - Reject SET_GUEST_DEBUG with -EBUSY if there's an already injected
     exception, mostly to prevent syzkaller from abusing the uAPI to
     trigger WARNs, but also because it can help prevent userspace from
     unintentionally crashing the VM

   - Exempt SMM from CPUID faulting on Intel, as per the spec

   - Misc hardening and cleanup changes

  x86 (AMD):

   - Fix and optimize IRQ window inhibit handling for AVIC; make it
     per-vCPU so that KVM doesn't prematurely re-enable AVIC if multiple
     vCPUs have to-be-injected IRQs

   - Clean up and optimize the OSVW handling, avoiding a bug in which
     KVM would overwrite state when enabling virtualization on multiple
     CPUs in parallel. This should not be a problem because OSVW should
     usually be the same for all CPUs

   - Drop a WARN in KVM_MEMORY_ENCRYPT_REG_REGION where KVM complains
     about a "too large" size based purely on user input

   - Clean up and harden the pinning code for KVM_MEMORY_ENCRYPT_REG_REGION

   - Disallow synchronizing a VMSA of an already-launched/encrypted
     vCPU, as doing so for an SNP guest will crash the host due to an
     RMP violation page fault

   - Overhaul KVM's APIs for detecting SEV+ guests so that VM-scoped
     queries are required to hold kvm->lock, and enforce it by lockdep.
     Fix various bugs where sev_guest() was not ensured to be stable for
     the whole duration of a function or ioctl

   - Convert a pile of kvm->lock SEV code to guard()

   - Play nicer with userspace that does not enable
     KVM_CAP_EXCEPTION_PAYLOAD, for which KVM needs to set CR2 and DR6
     as a response to ioctls such as KVM_GET_VCPU_EVENTS (even if the
     payload would end up in EXITINFO2 rather than CR2, for example).
     Only set CR2 and DR6 when consumption of the payload is imminent,
     but on the other hand force delivery of the payload in all paths
     where userspace retrieves CR2 or DR6

   - Use vcpu->arch.cr2 when updating vmcb12's CR2 on nested #VMEXIT
     instead of vmcb02->save.cr2. The value is out of sync after a
     save/restore or after a #PF is injected into L2

   - Fix a class of nSVM bugs where some fields written by the CPU are
     not synchronized from vmcb02 to cached vmcb12 after VMRUN, and so
     are not up-to-date when saved by KVM_GET_NESTED_STATE

   - Fix a class of bugs where the ordering between KVM_SET_NESTED_STATE
     and KVM_SET_{S}REGS could cause vmcb02 to be incorrectly
     initialized after save+restore

   - Add a variety of missing nSVM consistency checks

   - Fix several bugs where KVM failed to correctly update VMCB fields
     on nested #VMEXIT

   - Fix several bugs where KVM failed to correctly synthesize #UD or
     #GP for SVM-related instructions

   - Add support for save+restore of virtualized LBRs (on SVM)

   - Refactor various helpers and macros to improve clarity and
     (hopefully) make the code easier to maintain

   - Aggressively sanitize fields when copying from vmcb12, to guard
     against unintentionally allowing L1 to utilize yet-to-be-defined
     features

   - Fix several bugs where KVM botched rAX legality checks when
     emulating SVM instructions. There are remaining issues in that KVM
     doesn't handle size prefix overrides for 64-bit guests

   - Fail emulation of VMRUN/VMLOAD/VMSAVE if mapping vmcb12 fails
     instead of somewhat arbitrarily synthesizing #GP (i.e. don't double
     down on AMD's architectural but sketchy behavior of generating #GP
     for "unsupported" addresses)

   - Cache all used vmcb12 fields to further harden against TOCTOU bugs

  x86 (Intel):

   - Drop obsolete branch hint prefixes from the VMX instruction macros

   - Use ASM_INPUT_RM() in __vmcs_writel() to coerce clang into using a
     register input when appropriate

   - Code cleanups

  guest_memfd:

   - Don't mark guest_memfd folios as accessed, as guest_memfd doesn't
     support reclaim, the memory is unevictable, and there is no storage
     to write back to

  LoongArch selftests:

   - Add KVM PMU test cases

  s390 selftests:

   - Enable more memory selftests

  x86 selftests:

   - Add support for Hygon CPUs in KVM selftests

   - Fix a bug in the MSR test where it would get false failures on
     AMD/Hygon CPUs with exactly one of RDPID or RDTSCP

   - Add an MADV_COLLAPSE testcase for guest_memfd as a regression test
     for a bug where the kernel would attempt to collapse guest_memfd
     folios against KVM's will"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (373 commits)
  KVM: x86: use inlines instead of macros for is_sev_*guest
  x86/virt: Treat SVM as unsupported when running as an SEV+ guest
  KVM: SEV: Goto an existing error label if charging misc_cg for an ASID fails
  KVM: SVM: Move lock-protected allocation of SEV ASID into a separate helper
  KVM: SEV: use mutex guard in snp_handle_guest_req()
  KVM: SEV: use mutex guard in sev_mem_enc_unregister_region()
  KVM: SEV: use mutex guard in sev_mem_enc_ioctl()
  KVM: SEV: use mutex guard in snp_launch_update()
  KVM: SEV: Assert that kvm->lock is held when querying SEV+ support
  KVM: SEV: Document that checking for SEV+ guests when reclaiming memory is "safe"
  KVM: SEV: Hide "struct kvm_sev_info" behind CONFIG_KVM_AMD_SEV=y
  KVM: SEV: WARN on unhandled VM type when initializing VM
  KVM: LoongArch: selftests: Add PMU overflow interrupt test
  KVM: LoongArch: selftests: Add basic PMU event counting test
  KVM: LoongArch: selftests: Add cpucfg read/write helpers
  LoongArch: KVM: Add DMSINTC inject msi to vCPU
  LoongArch: KVM: Add DMSINTC device support
  LoongArch: KVM: Make vcpu_is_preempted() as a macro rather than function
  LoongArch: KVM: Move host CSR_GSTAT save and restore in context switch
  LoongArch: KVM: Move host CSR_EENTRY save and restore in context switch
  ...
2026-04-17 07:18:03 -07:00
Catalin Marinas
0baba94a97 arm64: errata: Work around early CME DVMSync acknowledgement
C1-Pro acknowledges DVMSync messages before completing the SME/CME
memory accesses. Work around this by issuing an IPI to the affected CPUs
if they are running in EL0 with SME enabled.

Note that we avoid the local DSB in the IPI handler as the kernel runs
with SCTLR_EL1.IESB=1. This is sufficient to complete SME memory
accesses at EL0 on taking an exception to EL1. On the return to user
path, no barrier is necessary either. See the comment in
sme_set_active() and the more detailed explanation in the link below.

To avoid a potential IPI flood from malicious applications (e.g.
madvise(MADV_PAGEOUT) in a tight loop), track where a process is active
via mm_cpumask() and only interrupt those CPUs.

Link: https://lore.kernel.org/r/ablEXwhfKyJW1i7l@J2N7QTR9R3
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Mark Brown <broonie@kernel.org>
Reviewed-by: Will Deacon <will@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2026-04-10 19:46:14 +01:00
Catalin Marinas
480a9e57cc Merge branches 'for-next/misc', 'for-next/tlbflush', 'for-next/ttbr-macros-cleanup', 'for-next/kselftest', 'for-next/feat_lsui', 'for-next/mpam', 'for-next/hotplug-batched-tlbi', 'for-next/bbml2-fixes', 'for-next/sysreg', 'for-next/generic-entry' and 'for-next/acpi', remote-tracking branches 'arm64/for-next/perf' and 'arm64/for-next/read-once' into for-next/core
* arm64/for-next/perf:
  : Perf updates
  perf/arm-cmn: Fix resource_size_t printk specifier in arm_cmn_init_dtc()
  perf/arm-cmn: Fix incorrect error check for devm_ioremap()
  perf: add NVIDIA Tegra410 C2C PMU
  perf: add NVIDIA Tegra410 CPU Memory Latency PMU
  perf/arm_cspmu: nvidia: Add Tegra410 PCIE-TGT PMU
  perf/arm_cspmu: nvidia: Add Tegra410 PCIE PMU
  perf/arm_cspmu: Add arm_cspmu_acpi_dev_get
  perf/arm_cspmu: nvidia: Add Tegra410 UCF PMU
  perf/arm_cspmu: nvidia: Rename doc to Tegra241
  perf/arm-cmn: Stop claiming entire iomem region
  arm64: cpufeature: Use pmuv3_implemented() function
  arm64: cpufeature: Make PMUVer and PerfMon unsigned
  KVM: arm64: Read PMUVer as unsigned

* arm64/for-next/read-once:
  : Fixes for __READ_ONCE() with CONFIG_LTO=y
  arm64, compiler-context-analysis: Permit alias analysis through __READ_ONCE() with CONFIG_LTO=y
  arm64: Optimize __READ_ONCE() with CONFIG_LTO=y

* for-next/misc:
  : Miscellaneous cleanups/fixes
  arm64: rsi: use linear-map alias for realm config buffer
  arm64: Kconfig: fix duplicate word in CMDLINE help text
  arm64: mte: Skip TFSR_EL1 checks and barriers in synchronous tag check mode
  arm64/hwcap: Generate the KERNEL_HWCAP_ definitions for the hwcaps
  arm64: kexec: Remove duplicate allocation for trans_pgd
  arm64: mm: Use generic enum pgtable_level
  arm64: scs: Remove redundant save/restore of SCS SP on entry to/from EL0
  arm64: remove ARCH_INLINE_*

* for-next/tlbflush:
  : Refactor the arm64 TLB invalidation API and implementation
  arm64: mm: __ptep_set_access_flags must hint correct TTL
  arm64: mm: Provide level hint for flush_tlb_page()
  arm64: mm: Wrap flush_tlb_page() around __do_flush_tlb_range()
  arm64: mm: More flags for __flush_tlb_range()
  arm64: mm: Refactor __flush_tlb_range() to take flags
  arm64: mm: Refactor flush_tlb_page() to use __tlbi_level_asid()
  arm64: mm: Simplify __flush_tlb_range_limit_excess()
  arm64: mm: Simplify __TLBI_RANGE_NUM() macro
  arm64: mm: Re-implement the __flush_tlb_range_op macro in C
  arm64: mm: Inline __TLBI_VADDR_RANGE() into __tlbi_range()
  arm64: mm: Push __TLBI_VADDR() into __tlbi_level()
  arm64: mm: Implicitly invalidate user ASID based on TLBI operation
  arm64: mm: Introduce a C wrapper for by-range TLB invalidation
  arm64: mm: Re-implement the __tlbi_level macro as a C function

* for-next/ttbr-macros-cleanup:
  : Cleanups of the TTBR1_* macros
  arm64/mm: Directly use TTBRx_EL1_CnP
  arm64/mm: Directly use TTBRx_EL1_ASID_MASK
  arm64/mm: Describe TTBR1_BADDR_4852_OFFSET

* for-next/kselftest:
  : arm64 kselftest updates
  selftests/arm64: Implement cmpbr_sigill() to hwcap test

* for-next/feat_lsui:
  : Futex support using FEAT_LSUI instructions to avoid toggling PAN
  arm64: armv8_deprecated: Disable swp emulation when FEAT_LSUI present
  arm64: Kconfig: Add support for LSUI
  KVM: arm64: Use CAST instruction for swapping guest descriptor
  arm64: futex: Support futex with FEAT_LSUI
  arm64: futex: Refactor futex atomic operation
  KVM: arm64: kselftest: set_id_regs: Add test for FEAT_LSUI
  KVM: arm64: Expose FEAT_LSUI to guests
  arm64: cpufeature: Add FEAT_LSUI

* for-next/mpam: (40 commits)
  : Expose MPAM to user-space via resctrl:
  :  - Add architecture context-switch and hiding of the feature from KVM.
  :  - Add interface to allow MPAM to be exposed to user-space using resctrl.
  :  - Add errata workaoround for some existing platforms.
  :  - Add documentation for using MPAM and what shape of platforms can use resctrl
  arm64: mpam: Add initial MPAM documentation
  arm_mpam: Quirk CMN-650's CSU NRDY behaviour
  arm_mpam: Add workaround for T241-MPAM-6
  arm_mpam: Add workaround for T241-MPAM-4
  arm_mpam: Add workaround for T241-MPAM-1
  arm_mpam: Add quirk framework
  arm_mpam: resctrl: Call resctrl_init() on platforms that can support resctrl
  arm64: mpam: Select ARCH_HAS_CPU_RESCTRL
  arm_mpam: resctrl: Add empty definitions for assorted resctrl functions
  arm_mpam: resctrl: Update the rmid reallocation limit
  arm_mpam: resctrl: Add resctrl_arch_rmid_read()
  arm_mpam: resctrl: Allow resctrl to allocate monitors
  arm_mpam: resctrl: Add support for csu counters
  arm_mpam: resctrl: Add monitor initialisation and domain boilerplate
  arm_mpam: resctrl: Add kunit test for control format conversions
  arm_mpam: resctrl: Add support for 'MB' resource
  arm_mpam: resctrl: Wait for cacheinfo to be ready
  arm_mpam: resctrl: Add rmid index helpers
  arm_mpam: resctrl: Convert to/from MPAMs fixed-point formats
  arm_mpam: resctrl: Hide CDP emulation behind CONFIG_EXPERT
  ...

* for-next/hotplug-batched-tlbi:
  : arm64/mm: Enable batched TLB flush in unmap_hotplug_range()
  arm64/mm: Reject memory removal that splits a kernel leaf mapping
  arm64/mm: Enable batched TLB flush in unmap_hotplug_range()

* for-next/bbml2-fixes:
  : Fixes for realm guest and BBML2_NOABORT
  arm64: mm: Remove pmd_sect() and pud_sect()
  arm64: mm: Handle invalid large leaf mappings correctly
  arm64: mm: Fix rodata=full block mapping support for realm guests

* for-next/sysreg:
  : arm64 sysreg updates
  arm64/sysreg: Update ID_AA64SMFR0_EL1 description to DDI0601 2025-12
  arm64/sysreg: Update ID_AA64ZFR0_EL1 description to DDI0601 2025-12
  arm64/sysreg: Update ID_AA64FPFR0_EL1 description to DDI0601 2025-12
  arm64/sysreg: Update ID_AA64ISAR2_EL1 description to DDI0601 2025-12
  arm64/sysreg: Update ID_AA64ISAR0_EL1 description to DDI0601 2025-12
  arm64/sysreg: Update SMIDR_EL1 to DDI0601 2025-06

* for-next/generic-entry:
  : More arm64 refactoring towards using the generic entry code
  arm64: Check DAIF (and PMR) at task-switch time
  arm64: entry: Use split preemption logic
  arm64: entry: Use irqentry_{enter_from,exit_to}_kernel_mode()
  arm64: entry: Consistently prefix arm64-specific wrappers
  arm64: entry: Don't preempt with SError or Debug masked
  entry: Split preemption from irqentry_exit_to_kernel_mode()
  entry: Split kernel mode logic from irqentry_{enter,exit}()
  entry: Move irqentry_enter() prototype later
  entry: Remove local_irq_{enable,disable}_exit_to_user()
  entry: Fix stale comment for irqentry_enter()

* for-next/acpi:
  : arm64 ACPI updates
  ACPI: AGDI: fix missing newline in error message
2026-04-10 14:22:24 +01:00
Mark Brown
306736fd51 arm64/sysreg: Update ID_AA64SMFR0_EL1 description to DDI0601 2025-12
The 2025 extensions add FEAT_SME2P3, including LUT6.

Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2026-04-09 16:50:55 +01:00
Mark Brown
bf56250f34 arm64/sysreg: Update ID_AA64ZFR0_EL1 description to DDI0601 2025-12
The 2025 extensions add FEAT_SVE2P3 and FEAT_SVE_B16MM.

Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2026-04-09 16:50:55 +01:00
Mark Brown
d74576b51b arm64/sysreg: Update ID_AA64FPFR0_EL1 description to DDI0601 2025-12
The 2025 extensions add FEAT_F16MM and adjust some of the RES0 bits to be
RAZ instead as a placeholder for future extensions.

Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2026-04-09 16:50:55 +01:00
Mark Brown
bb5e1e5405 arm64/sysreg: Update ID_AA64ISAR2_EL1 description to DDI0601 2025-12
The 2025 extensions update the LUT field for new instructions added by
SVE and SME 2.3, there is no separate FEAT_ feature for these.

Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2026-04-09 16:50:55 +01:00
Mark Brown
b964aa8d68 arm64/sysreg: Update ID_AA64ISAR0_EL1 description to DDI0601 2025-12
The 2025 extensions add FEAT_F16F32DOT and FEAT_F16F32MM.

Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2026-04-09 16:50:55 +01:00
Mark Brown
abed23c3c4 arm64/hwcap: Generate the KERNEL_HWCAP_ definitions for the hwcaps
Currently for each hwcap we define both the HWCAPn_NAME definition which is
exposed to userspace and a kernel internal KERNEL_HWCAP_NAME definition
which we use internally. This is tedious and repetitive, instead use a
script to generate the KERNEL_HWCAP_ definitions from the UAPI definitions.

No functional changes intended.

Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2026-04-09 16:50:23 +01:00
Mark Brown
85b6f920a8 arm64/sysreg: Update SMIDR_EL1 to DDI0601 2025-06
Update the definition of SMIDR_EL1 in the sysreg definition to reflect the
information in DD0601 2025-06. This includes somewhat more generic ways of
describing the sharing of SMCUs, more information on supported priorities
and provides additional resolution for describing affinity groups.

Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2026-04-02 21:48:21 +01:00
Marc Zyngier
77acae60be arm64: Fix field references for ICH_PPI_DVIR[01]_EL2
The ICH_PPI_DVIR[01]_EL2 registers should refer to the ICH_PPI_DVIRx_EL2
fields, instead of ICH_PPI_DVIx_EL2.

Reviewed-by: Sascha Bischoff <sascha.bischoff@arm.com>
Fixes: 2808a83370 ("arm64/sysreg: Add remaining GICv5 ICC_ & ICH_ sysregs for KVM support")
Link: https://sashiko.dev/#/patchset/20260319154937.3619520-1-sascha.bischoff%40arm.com
Link: https://patch.msgid.link/20260401103611.357092-4-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-04-01 15:42:25 +01:00
Ben Horgan
29fa1be82b arm64/sysreg: Add MPAMSM_EL1 register
The MPAMSM_EL1 register determines the MPAM configuration for an SMCU. Add
the register definition.

Tested-by: Gavin Shan <gshan@redhat.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Zeng Heng <zengheng4@huawei.com>
Tested-by: Punit Agrawal <punit.agrawal@oss.qualcomm.com>
Tested-by: Jesse Chick <jessechick@os.amperecomputing.com>
Reviewed-by: Zeng Heng <zengheng4@huawei.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: James Morse <james.morse@arm.com>
2026-03-27 15:27:04 +00:00
Yeoreum Yun
7181f718cb arm64: cpufeature: Add FEAT_LSUI
Since Armv9.6, FEAT_LSUI introduces atomic instructions that allow
privileged code to access user memory without clearing the PSTATE.PAN
bit. Add CPU feature detection for FEAT_LSUI.

Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
[catalin.marinas@arm.com: Remove commit log references to SW_PAN]
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2026-03-26 18:19:07 +00:00
Sascha Bischoff
2808a83370 arm64/sysreg: Add remaining GICv5 ICC_ & ICH_ sysregs for KVM support
Add the GICv5 system registers required to support native GICv5 guests
with KVM. Many of the GICv5 sysregs have already been added as part of
the host GICv5 driver, keeping this set relatively small. The
registers added in this change complete the set by adding those
required by KVM either directly (ICH_) or indirectly (FGTs for the
ICC_ sysregs).

The following system registers and their fields are added:

	ICC_APR_EL1
	ICC_HPPIR_EL1
	ICC_IAFFIDR_EL1
	ICH_APR_EL2
	ICH_CONTEXTR_EL2
	ICH_PPI_ACTIVER<n>_EL2
	ICH_PPI_DVI<n>_EL2
	ICH_PPI_ENABLER<n>_EL2
	ICH_PPI_PENDR<n>_EL2
	ICH_PPI_PRIORITYR<n>_EL2

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20260319154937.3619520-7-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-03-19 18:21:27 +00:00
Linus Torvalds
cb5573868e Loongarch:
- Add more CPUCFG mask bits.
 
 - Improve feature detection.
 
 - Add lazy load support for FPU and binary translation (LBT) register state.
 
 - Fix return value for memory reads from and writes to in-kernel devices.
 
 - Add support for detecting preemption from within a guest.
 
 - Add KVM steal time test case to tools/selftests.
 
 ARM:
 
 - Add support for FEAT_IDST, allowing ID registers that are not
   implemented to be reported as a normal trap rather than as an UNDEF
   exception.
 
 - Add sanitisation of the VTCR_EL2 register, fixing a number of
   UXN/PXN/XN bugs in the process.
 
 - Full handling of RESx bits, instead of only RES0, and resulting in
   SCTLR_EL2 being added to the list of sanitised registers.
 
 - More pKVM fixes for features that are not supposed to be exposed to
   guests.
 
 - Make sure that MTE being disabled on the pKVM host doesn't give it
   the ability to attack the hypervisor.
 
 - Allow pKVM's host stage-2 mappings to use the Force Write Back
   version of the memory attributes by using the "pass-through'
   encoding.
 
 - Fix trapping of ICC_DIR_EL1 on GICv5 hosts emulating GICv3 for the
   guest.
 
 - Preliminary work for guest GICv5 support.
 
 - A bunch of debugfs fixes, removing pointless custom iterators stored
   in guest data structures.
 
 - A small set of FPSIMD cleanups.
 
 - Selftest fixes addressing the incorrect alignment of page
   allocation.
 
 - Other assorted low-impact fixes and spelling fixes.
 
 RISC-V:
 
 - Fixes for issues discoverd by KVM API fuzzing in
   kvm_riscv_aia_imsic_has_attr(), kvm_riscv_aia_imsic_rw_attr(),
   and kvm_riscv_vcpu_aia_imsic_update()
 
 - Allow Zalasr, Zilsd and Zclsd extensions for Guest/VM
 
 - Transparent huge page support for hypervisor page tables
 
 - Adjust the number of available guest irq files based on MMIO
   register sizes found in the device tree or the ACPI tables
 
 - Add RISC-V specific paging modes to KVM selftests
 
 - Detect paging mode at runtime for selftests
 
 s390:
 
 - Performance improvement for vSIE (aka nested virtualization)
 
 - Completely new memory management.  s390 was a special snowflake that enlisted
   help from the architecture's page table management to build hypervisor
   page tables, in particular enabling sharing the last level of page
   tables.  This however was a lot of code (~3K lines) in order to support
   KVM, and also blocked several features.  The biggest advantages is
   that the page size of userspace is completely independent of the
   page size used by the guest: userspace can mix normal pages, THPs and
   hugetlbfs as it sees fit, and in fact transparent hugepages were not
   possible before.  It's also now possible to have nested guests and
   guests with huge pages running on the same host.
 
 - Maintainership change for s390 vfio-pci
 
 - Small quality of life improvement for protected guests
 
 x86:
 
 - Add support for giving the guest full ownership of PMU hardware (contexted
   switched around the fastpath run loop) and allowing direct access to data
   MSRs and PMCs (restricted by the vPMU model).  KVM still intercepts
   access to control registers, e.g. to enforce event filtering and to
   prevent the guest from profiling sensitive host state.  This is more
   accurate, since it has no risk of contention and thus dropped events, and
   also has significantly less overhead.
 
   For more information, see the commit message for merge commit bf2c3138ae
   ("Merge tag 'kvm-x86-pmu-6.20' of https://github.com/kvm-x86/linux into HEAD").
 
 - Disallow changing the virtual CPU model if L2 is active, for all the same
   reasons KVM disallows change the model after the first KVM_RUN.
 
 - Fix a bug where KVM would incorrectly reject host accesses to PV MSRs
   when running with KVM_CAP_ENFORCE_PV_FEATURE_CPUID enabled, even if those
   were advertised as supported to userspace,
 
 - Fix a bug with protected guest state (SEV-ES/SNP and TDX) VMs, where KVM
   would attempt to read CR3 configuring an async #PF entry.
 
 - Fail the build if EXPORT_SYMBOL_GPL or EXPORT_SYMBOL is used in KVM (for x86
   only) to enforce usage of EXPORT_SYMBOL_FOR_KVM_INTERNAL.  Only a few exports
   that are intended for external usage, and those are allowed explicitly.
 
 - When checking nested events after a vCPU is unblocked, ignore -EBUSY instead
   of WARNing.  Userspace can sometimes put the vCPU into what should be an
   impossible state, and spurious exit to userspace on -EBUSY does not really
   do anything to solve the issue.
 
 - Also throw in the towel and drop the WARN on INIT/SIPI being blocked when vCPU
   is in Wait-For-SIPI, which also resulted in playing whack-a-mole with syzkaller
   stuffing architecturally impossible states into KVM.
 
 - Add support for new Intel instructions that don't require anything beyond
   enumerating feature flags to userspace.
 
 - Grab SRCU when reading PDPTRs in KVM_GET_SREGS2.
 
 - Add WARNs to guard against modifying KVM's CPU caps outside of the intended
   setup flow, as nested VMX in particular is sensitive to unexpected changes
   in KVM's golden configuration.
 
 - Add a quirk to allow userspace to opt-in to actually suppress EOI broadcasts
   when the suppression feature is enabled by the guest (currently limited to
   split IRQCHIP, i.e. userspace I/O APIC).  Sadly, simply fixing KVM to honor
   Suppress EOI Broadcasts isn't an option as some userspaces have come to rely
   on KVM's buggy behavior (KVM advertises Supress EOI Broadcast irrespective
   of whether or not userspace I/O APIC supports Directed EOIs).
 
 - Clean up KVM's handling of marking mapped vCPU pages dirty.
 
 - Drop a pile of *ancient* sanity checks hidden behind in KVM's unused
   ASSERT() macro, most of which could be trivially triggered by the guest
   and/or user, and all of which were useless.
 
 - Fold "struct dest_map" into its sole user, "struct rtc_status", to make it
   more obvious what the weird parameter is used for, and to allow fropping
   these RTC shenanigans if CONFIG_KVM_IOAPIC=n.
 
 - Bury all of ioapic.h, i8254.h and related ioctls (including
   KVM_CREATE_IRQCHIP) behind CONFIG_KVM_IOAPIC=y.
 
 - Add a regression test for recent APICv update fixes.
 
 - Handle "hardware APIC ISR", a.k.a. SVI, updates in kvm_apic_update_apicv()
   to consolidate the updates, and to co-locate SVI updates with the updates
   for KVM's own cache of ISR information.
 
 - Drop a dead function declaration.
 
 - Minor cleanups.
 
 x86 (Intel):
 
 - Rework KVM's handling of VMCS updates while L2 is active to temporarily
   switch to vmcs01 instead of deferring the update until the next nested
   VM-Exit.  The deferred updates approach directly contributed to several
   bugs, was proving to be a maintenance burden due to the difficulty in
   auditing the correctness of deferred updates, and was polluting
   "struct nested_vmx" with a growing pile of booleans.
 
 - Fix an SGX bug where KVM would incorrectly try to handle EPCM page faults,
   and instead always reflect them into the guest.  Since KVM doesn't shadow
   EPCM entries, EPCM violations cannot be due to KVM interference and
   can't be resolved by KVM.
 
 - Fix a bug where KVM would register its posted interrupt wakeup handler even
   if loading kvm-intel.ko ultimately failed.
 
 - Disallow access to vmcb12 fields that aren't fully supported, mostly to
   avoid weirdness and complexity for FRED and other features, where KVM wants
   enable VMCS shadowing for fields that conditionally exist.
 
 - Print out the "bad" offsets and values if kvm-intel.ko refuses to load (or
   refuses to online a CPU) due to a VMCS config mismatch.
 
 x86 (AMD):
 
 - Drop a user-triggerable WARN on nested_svm_load_cr3() failure.
 
 - Add support for virtualizing ERAPS.  Note, correct virtualization of ERAPS
   relies on an upcoming, publicly announced change in the APM to reduce the
   set of conditions where hardware (i.e. KVM) *must* flush the RAP.
 
 - Ignore nSVM intercepts for instructions that are not supported according to
   L1's virtual CPU model.
 
 - Add support for expedited writes to the fast MMIO bus, a la VMX's fastpath
   for EPT Misconfig.
 
 - Don't set GIF when clearing EFER.SVME, as GIF exists independently of SVM,
   and allow userspace to restore nested state with GIF=0.
 
 - Treat exit_code as an unsigned 64-bit value through all of KVM.
 
 - Add support for fetching SNP certificates from userspace.
 
 - Fix a bug where KVM would use vmcb02 instead of vmcb01 when emulating VMLOAD
   or VMSAVE on behalf of L2.
 
 - Misc fixes and cleanups.
 
 x86 selftests:
 
 - Add a regression test for TPR<=>CR8 synchronization and IRQ masking.
 
 - Overhaul selftest's MMU infrastructure to genericize stage-2 MMU support,
   and extend x86's infrastructure to support EPT and NPT (for L2 guests).
 
 - Extend several nested VMX tests to also cover nested SVM.
 
 - Add a selftest for nested VMLOAD/VMSAVE.
 
 - Rework the nested dirty log test, originally added as a regression test for
   PML where KVM logged L2 GPAs instead of L1 GPAs, to improve test coverage
   and to hopefully make the test easier to understand and maintain.
 
 guest_memfd:
 
 - Remove kvm_gmem_populate()'s preparation tracking and half-baked hugepage
   handling.  SEV/SNP was the only user of the tracking and it can do it via
   the RMP.
 
 - Retroactively document and enforce (for SNP) that KVM_SEV_SNP_LAUNCH_UPDATE
   and KVM_TDX_INIT_MEM_REGION require the source page to be 4KiB aligned, to
   avoid non-trivial complexity for something that no known VMM seems to be
   doing and to avoid an API special case for in-place conversion, which
   simply can't support unaligned sources.
 
 - When populating guest_memfd memory, GUP the source page in common code and
   pass the refcounted page to the vendor callback, instead of letting vendor
   code do the heavy lifting.  Doing so avoids a looming deadlock bug with
   in-place due an AB-BA conflict betwee mmap_lock and guest_memfd's filemap
   invalidate lock.
 
 Generic:
 
 - Fix a bug where KVM would ignore the vCPU's selected address space when
   creating a vCPU-specific mapping of guest memory.  Actually this bug
   could not be hit even on x86, the only architecture with multiple
   address spaces, but it's a bug nevertheless.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCgAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmmNqwwUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroPaZAf/cJx5B67lnST272esz0j29MIuT/Ti
 jnf6PI9b7XubKYOtNvlu5ZW4Jsa5dqRG0qeO/JmcXDlwBf5/UkWOyvqIXyiuTl0l
 KcSUlKPtTgKZSoZpJpTppuuDE8FSYqEdcCmjNvoYzcJoPjmaeJbK6aqO0AkBbb6e
 L5InrLV7nV9iua6rFvA0s/G8/Eq2DG8M9hTRHe6NcI/z4hvslOudvpUXtC8Jygoo
 cV8vFavUwc+atrmvhAOLvSitnrjfNa4zcG6XMOlwXPfIdvi3zqTlQTgUpwGKiAGQ
 RIDUVZ/9bcWgJqbPRsdEWwaYRkNQWc5nmrAHRpEEaYV/NeBBNf4v6qfKSw==
 =SkJ1
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "Loongarch:

   - Add more CPUCFG mask bits

   - Improve feature detection

   - Add lazy load support for FPU and binary translation (LBT) register
     state

   - Fix return value for memory reads from and writes to in-kernel
     devices

   - Add support for detecting preemption from within a guest

   - Add KVM steal time test case to tools/selftests

  ARM:

   - Add support for FEAT_IDST, allowing ID registers that are not
     implemented to be reported as a normal trap rather than as an UNDEF
     exception

   - Add sanitisation of the VTCR_EL2 register, fixing a number of
     UXN/PXN/XN bugs in the process

   - Full handling of RESx bits, instead of only RES0, and resulting in
     SCTLR_EL2 being added to the list of sanitised registers

   - More pKVM fixes for features that are not supposed to be exposed to
     guests

   - Make sure that MTE being disabled on the pKVM host doesn't give it
     the ability to attack the hypervisor

   - Allow pKVM's host stage-2 mappings to use the Force Write Back
     version of the memory attributes by using the "pass-through'
     encoding

   - Fix trapping of ICC_DIR_EL1 on GICv5 hosts emulating GICv3 for the
     guest

   - Preliminary work for guest GICv5 support

   - A bunch of debugfs fixes, removing pointless custom iterators
     stored in guest data structures

   - A small set of FPSIMD cleanups

   - Selftest fixes addressing the incorrect alignment of page
     allocation

   - Other assorted low-impact fixes and spelling fixes

  RISC-V:

   - Fixes for issues discoverd by KVM API fuzzing in
     kvm_riscv_aia_imsic_has_attr(), kvm_riscv_aia_imsic_rw_attr(), and
     kvm_riscv_vcpu_aia_imsic_update()

   - Allow Zalasr, Zilsd and Zclsd extensions for Guest/VM

   - Transparent huge page support for hypervisor page tables

   - Adjust the number of available guest irq files based on MMIO
     register sizes found in the device tree or the ACPI tables

   - Add RISC-V specific paging modes to KVM selftests

   - Detect paging mode at runtime for selftests

  s390:

   - Performance improvement for vSIE (aka nested virtualization)

   - Completely new memory management. s390 was a special snowflake that
     enlisted help from the architecture's page table management to
     build hypervisor page tables, in particular enabling sharing the
     last level of page tables. This however was a lot of code (~3K
     lines) in order to support KVM, and also blocked several features.
     The biggest advantages is that the page size of userspace is
     completely independent of the page size used by the guest:
     userspace can mix normal pages, THPs and hugetlbfs as it sees fit,
     and in fact transparent hugepages were not possible before. It's
     also now possible to have nested guests and guests with huge pages
     running on the same host

   - Maintainership change for s390 vfio-pci

   - Small quality of life improvement for protected guests

  x86:

   - Add support for giving the guest full ownership of PMU hardware
     (contexted switched around the fastpath run loop) and allowing
     direct access to data MSRs and PMCs (restricted by the vPMU model).

     KVM still intercepts access to control registers, e.g. to enforce
     event filtering and to prevent the guest from profiling sensitive
     host state. This is more accurate, since it has no risk of
     contention and thus dropped events, and also has significantly less
     overhead.

     For more information, see the commit message for merge commit
     bf2c3138ae ("Merge tag 'kvm-x86-pmu-6.20' ...")

   - Disallow changing the virtual CPU model if L2 is active, for all
     the same reasons KVM disallows change the model after the first
     KVM_RUN

   - Fix a bug where KVM would incorrectly reject host accesses to PV
     MSRs when running with KVM_CAP_ENFORCE_PV_FEATURE_CPUID enabled,
     even if those were advertised as supported to userspace,

   - Fix a bug with protected guest state (SEV-ES/SNP and TDX) VMs,
     where KVM would attempt to read CR3 configuring an async #PF entry

   - Fail the build if EXPORT_SYMBOL_GPL or EXPORT_SYMBOL is used in KVM
     (for x86 only) to enforce usage of EXPORT_SYMBOL_FOR_KVM_INTERNAL.
     Only a few exports that are intended for external usage, and those
     are allowed explicitly

   - When checking nested events after a vCPU is unblocked, ignore
     -EBUSY instead of WARNing. Userspace can sometimes put the vCPU
     into what should be an impossible state, and spurious exit to
     userspace on -EBUSY does not really do anything to solve the issue

   - Also throw in the towel and drop the WARN on INIT/SIPI being
     blocked when vCPU is in Wait-For-SIPI, which also resulted in
     playing whack-a-mole with syzkaller stuffing architecturally
     impossible states into KVM

   - Add support for new Intel instructions that don't require anything
     beyond enumerating feature flags to userspace

   - Grab SRCU when reading PDPTRs in KVM_GET_SREGS2

   - Add WARNs to guard against modifying KVM's CPU caps outside of the
     intended setup flow, as nested VMX in particular is sensitive to
     unexpected changes in KVM's golden configuration

   - Add a quirk to allow userspace to opt-in to actually suppress EOI
     broadcasts when the suppression feature is enabled by the guest
     (currently limited to split IRQCHIP, i.e. userspace I/O APIC).
     Sadly, simply fixing KVM to honor Suppress EOI Broadcasts isn't an
     option as some userspaces have come to rely on KVM's buggy behavior
     (KVM advertises Supress EOI Broadcast irrespective of whether or
     not userspace I/O APIC supports Directed EOIs)

   - Clean up KVM's handling of marking mapped vCPU pages dirty

   - Drop a pile of *ancient* sanity checks hidden behind in KVM's
     unused ASSERT() macro, most of which could be trivially triggered
     by the guest and/or user, and all of which were useless

   - Fold "struct dest_map" into its sole user, "struct rtc_status", to
     make it more obvious what the weird parameter is used for, and to
     allow fropping these RTC shenanigans if CONFIG_KVM_IOAPIC=n

   - Bury all of ioapic.h, i8254.h and related ioctls (including
     KVM_CREATE_IRQCHIP) behind CONFIG_KVM_IOAPIC=y

   - Add a regression test for recent APICv update fixes

   - Handle "hardware APIC ISR", a.k.a. SVI, updates in
     kvm_apic_update_apicv() to consolidate the updates, and to
     co-locate SVI updates with the updates for KVM's own cache of ISR
     information

   - Drop a dead function declaration

   - Minor cleanups

  x86 (Intel):

   - Rework KVM's handling of VMCS updates while L2 is active to
     temporarily switch to vmcs01 instead of deferring the update until
     the next nested VM-Exit.

     The deferred updates approach directly contributed to several bugs,
     was proving to be a maintenance burden due to the difficulty in
     auditing the correctness of deferred updates, and was polluting
     "struct nested_vmx" with a growing pile of booleans

   - Fix an SGX bug where KVM would incorrectly try to handle EPCM page
     faults, and instead always reflect them into the guest. Since KVM
     doesn't shadow EPCM entries, EPCM violations cannot be due to KVM
     interference and can't be resolved by KVM

   - Fix a bug where KVM would register its posted interrupt wakeup
     handler even if loading kvm-intel.ko ultimately failed

   - Disallow access to vmcb12 fields that aren't fully supported,
     mostly to avoid weirdness and complexity for FRED and other
     features, where KVM wants enable VMCS shadowing for fields that
     conditionally exist

   - Print out the "bad" offsets and values if kvm-intel.ko refuses to
     load (or refuses to online a CPU) due to a VMCS config mismatch

  x86 (AMD):

   - Drop a user-triggerable WARN on nested_svm_load_cr3() failure

   - Add support for virtualizing ERAPS. Note, correct virtualization of
     ERAPS relies on an upcoming, publicly announced change in the APM
     to reduce the set of conditions where hardware (i.e. KVM) *must*
     flush the RAP

   - Ignore nSVM intercepts for instructions that are not supported
     according to L1's virtual CPU model

   - Add support for expedited writes to the fast MMIO bus, a la VMX's
     fastpath for EPT Misconfig

   - Don't set GIF when clearing EFER.SVME, as GIF exists independently
     of SVM, and allow userspace to restore nested state with GIF=0

   - Treat exit_code as an unsigned 64-bit value through all of KVM

   - Add support for fetching SNP certificates from userspace

   - Fix a bug where KVM would use vmcb02 instead of vmcb01 when
     emulating VMLOAD or VMSAVE on behalf of L2

   - Misc fixes and cleanups

  x86 selftests:

   - Add a regression test for TPR<=>CR8 synchronization and IRQ masking

   - Overhaul selftest's MMU infrastructure to genericize stage-2 MMU
     support, and extend x86's infrastructure to support EPT and NPT
     (for L2 guests)

   - Extend several nested VMX tests to also cover nested SVM

   - Add a selftest for nested VMLOAD/VMSAVE

   - Rework the nested dirty log test, originally added as a regression
     test for PML where KVM logged L2 GPAs instead of L1 GPAs, to
     improve test coverage and to hopefully make the test easier to
     understand and maintain

  guest_memfd:

   - Remove kvm_gmem_populate()'s preparation tracking and half-baked
     hugepage handling. SEV/SNP was the only user of the tracking and it
     can do it via the RMP

   - Retroactively document and enforce (for SNP) that
     KVM_SEV_SNP_LAUNCH_UPDATE and KVM_TDX_INIT_MEM_REGION require the
     source page to be 4KiB aligned, to avoid non-trivial complexity for
     something that no known VMM seems to be doing and to avoid an API
     special case for in-place conversion, which simply can't support
     unaligned sources

   - When populating guest_memfd memory, GUP the source page in common
     code and pass the refcounted page to the vendor callback, instead
     of letting vendor code do the heavy lifting. Doing so avoids a
     looming deadlock bug with in-place due an AB-BA conflict betwee
     mmap_lock and guest_memfd's filemap invalidate lock

  Generic:

   - Fix a bug where KVM would ignore the vCPU's selected address space
     when creating a vCPU-specific mapping of guest memory. Actually
     this bug could not be hit even on x86, the only architecture with
     multiple address spaces, but it's a bug nevertheless"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (267 commits)
  KVM: s390: Increase permitted SE header size to 1 MiB
  MAINTAINERS: Replace backup for s390 vfio-pci
  KVM: s390: vsie: Fix race in acquire_gmap_shadow()
  KVM: s390: vsie: Fix race in walk_guest_tables()
  KVM: s390: Use guest address to mark guest page dirty
  irqchip/riscv-imsic: Adjust the number of available guest irq files
  RISC-V: KVM: Transparent huge page support
  RISC-V: KVM: selftests: Add Zalasr extensions to get-reg-list test
  RISC-V: KVM: Allow Zalasr extensions for Guest/VM
  KVM: riscv: selftests: Add riscv vm satp modes
  KVM: riscv: selftests: add Zilsd and Zclsd extension to get-reg-list test
  riscv: KVM: allow Zilsd and Zclsd extensions for Guest/VM
  RISC-V: KVM: Skip IMSIC update if vCPU IMSIC state is not initialized
  RISC-V: KVM: Fix null pointer dereference in kvm_riscv_aia_imsic_rw_attr()
  RISC-V: KVM: Fix null pointer dereference in kvm_riscv_aia_imsic_has_attr()
  RISC-V: KVM: Remove unnecessary 'ret' assignment
  KVM: s390: Add explicit padding to struct kvm_s390_keyop
  KVM: LoongArch: selftests: Add steal time test case
  LoongArch: KVM: Add paravirt vcpu_is_preempted() support in guest side
  LoongArch: KVM: Add paravirt preempt feature in hypervisor side
  ...
2026-02-13 11:31:15 -08:00
Linus Torvalds
36ae1c45b2 Scheduler changes for v7.0:
Scheduler Kconfig space updates:
 
  - Further consolidate configurable preemption modes: reduce
    the number of architectures that are allowed to offer
    PREEMPT_NONE and PREEMPT_VOLUNTARY, reducing the number
    of preemption models from four to just two: 'full' and 'lazy'
    on up-to-date architectures (arm64, loongarch, powerpc,
    riscv, s390, x86).
 
    None and voluntary are only available as legacy features
    on platforms that don't implement lazy preemption yet,
    or which don't even support preemption.
 
    The goal is to eventually remove cond_resched() and
    voluntary preemption altogether.
 
    (Peter Zijlstra)
 
 RSEQ based 'scheduler time slice extension' support:
 
 This allows a thread to request a time slice extension when it
 enters a critical section to avoid contention on a resource when
 the thread is scheduled out inside of the critical section.
 
  - Add fields and constants for time slice extension
  - Provide static branch for time slice extensions
  - Add statistics for time slice extensions
  - Add prctl() to enable time slice extensions
  - Implement sys_rseq_slice_yield()
  - Implement syscall entry work for time slice extensions
  - Implement time slice extension enforcement timer
  - Reset slice extension when scheduled
  - Implement rseq_grant_slice_extension()
  - entry: Hook up rseq time slice extension
  - selftests: Implement time slice extension test
 
    (Thomas Gleixner)
 
  - Allow registering RSEQ with slice extension
  - Move slice_ext_nsec to debugfs
  - Lower default slice extension
  - selftests/rseq: Add rseq slice histogram script
 
    (Peter Zijlstra)
 
 Scheduler performance/scalability improvements:
 
  - Update rq->avg_idle when a task is moved to an idle CPU,
    which improves the scalability of various workloads.
    (Shubhang Kaushik)
 
  - Reorder fields in 'struct rq' for better caching
    (Blake Jones)
 
  - Fair scheduler SMP NOHZ balancing code speedups:
 
    - Move checking for nohz cpus after time check
    - Change likelyhood of nohz.nr_cpus
    - Remove nohz.nr_cpus and use weight of cpumask instead
 
      (Shrikanth Hegde)
 
  - Avoid false sharing for sched_clock_irqtime (Wangyang Guo)
 
  - Drop useless cpumask_empty() in find_energy_efficient_cpu()
  - Simplify task_numa_find_cpu()
  - Use cpumask_weight_and() in sched_balance_find_dst_group()
 
    (Yury Norov)
 
 DL scheduler updates:
 
  - Add a deadline server for sched_ext tasks (by Andrea Righi and
    Joel Fernandes, with fixes by Peter Zijlstra)
 
 RT scheduler updates:
 
  - Skip currently executing CPU in rto_next_cpu() (Chen Jinghuang)
 
 Entry code updates and performance improvements, which is part of the
 scheduler tree in this cycle due to interdependencies with the RSEQ
 based time slice extension work:
 
   - Remove unused syscall argument from syscall_trace_enter()
   - Rework syscall_exit_to_user_mode_work() for architecture reuse
   - Add arch_ptrace_report_syscall_entry/exit()
   - Inline syscall_exit_work() and syscall_trace_enter()
 
     (Jinjie Ruan)
 
 Scheduler core updates:
 
  - Rework sched_class::wakeup_preempt() and rq_modified_*()
  - Avoid rq->lock bouncing in sched_balance_newidle()
  - Rename rcu_dereference_check_sched_domain() =>
           rcu_dereference_sched_domain()
  - <linux/compiler_types.h>: Add the __signed_scalar_typeof() helper
 
    (Peter Zijlstra)
 
 Fair scheduler updates/refactoring:
 
  - Fold the sched_avg update
  - Change rcu_dereference_check_sched_domain() to rcu-sched
  - Switch to rcu_dereference_all()
  - Remove superfluous rcu_read_lock()
  - Limit hrtick work
 
    (Peter Zijlstra)
 
  - Join two #ifdef CONFIG_FAIR_GROUP_SCHED blocks
  - Clean up comments in 'struct cfs_rq'
  - Separate se->vlag from se->vprot
  - Rename cfs_rq::avg_load to cfs_rq::sum_weight
  - Rename cfs_rq::avg_vruntime to ::sum_w_vruntime & helper functions
  - Introduce and use the vruntime_cmp() and vruntime_op() wrappers
    for wrapped-signed aritmetics
  - Sort out 'blocked_load*' namespace noise
 
    (Ingo Molnar)
 
 Scheduler debugging code updates:
 
  - Export hidden tracepoints to modules (Gabriele Monaco)
 
  - Convert copy_from_user() + kstrtouint() to kstrtouint_from_user()
    (Fushuai Wang)
 
  - Add assertions to QUEUE_CLASS (Peter Zijlstra)
 
  - hrtimer: Fix tracing oddity (Thomas Gleixner)
 
 Misc fixes and cleanups:
 
  - Re-evaluate scheduling when migrating queued tasks out of
    throttled cgroups (Zicheng Qu)
 
  - Remove task_struct->faults_disabled_mapping (Christoph Hellwig)
 
  - Fix math notation errors in avg_vruntime comment (Zhan Xusheng)
 
  - sched/cpufreq: Use %pe format for PTR_ERR() printing (zenghongling)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmmJj+IRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1grtQ//WyXYGVE/WicdqslfaCY2Mr0uJnL0tLSM
 CJp+0LROdkmy+ChJmftO8RgjCUSsjhC4/xcBhUQXApf/ffQi3b2jH6nkTp/Z64Ms
 p2IXLkBiZjwdcO6fGbB0JE2G1J4hGRC5BlqfgkZzWidMf3kIbmrHg99mVWGzODLY
 N/cPW4d0WGf9TScl1FgEiOqgF3czMLlqvTDJqaFMpsTzSUcRBnrG4xushb4W/bBx
 573eqxgZJ6urNSGu8niY9PAl9F7gskXW3YxI3k8SH7VmJKSevWlwI9vMEhcRDzud
 E0XxD7J8iPOKtr7ypXm7anMBv4jWVUdAnPbYi4TDsyDDU/HguqMqT1McTGn8wQ+F
 jmdhmMC9/TEIzq93SNLbCYieibqDsmJoNVFFi0FWfPLMtYbcZd5a884SIz532vx4
 DegdlDXdazUwhxzDiQR3sq1CsHXpxNS2YdrpadAtF/r2gU86DQjsEew8yBvXi7bb
 Wrkzpax70sU1AFI23wJQkEb/OnnXyehAHAhhQN6GVvuiGr9P7C02WLEGLlmSmJrx
 zl2F750P76yhTfGcvTfJ/5LTfSB+yRozGvcdXnIkyzWotY6a2D1MKNusAfVax+IR
 kyfAWqVdxBhlKnqYbu92lTogvnPh3Lymd6G4TZZRkSH2jixyGd2oS7nZaDBAeBEM
 NHQtr9R+KyU=
 =Xj2f
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "Scheduler Kconfig space updates:

   - Further consolidate configurable preemption modes (Peter Zijlstra)

     Reduce the number of architectures that are allowed to offer
     PREEMPT_NONE and PREEMPT_VOLUNTARY, reducing the number of
     preemption models from four to just two: 'full' and 'lazy' on
     up-to-date architectures (arm64, loongarch, powerpc, riscv, s390,
     x86).

     None and voluntary are only available as legacy features on
     platforms that don't implement lazy preemption yet, or which don't
     even support preemption.

     The goal is to eventually remove cond_resched() and voluntary
     preemption altogether.

  RSEQ based 'scheduler time slice extension' support (Thomas Gleixner
  and Peter Zijlstra):

  This allows a thread to request a time slice extension when it enters
  a critical section to avoid contention on a resource when the thread
  is scheduled out inside of the critical section.

   - Add fields and constants for time slice extension
   - Provide static branch for time slice extensions
   - Add statistics for time slice extensions
   - Add prctl() to enable time slice extensions
   - Implement sys_rseq_slice_yield()
   - Implement syscall entry work for time slice extensions
   - Implement time slice extension enforcement timer
   - Reset slice extension when scheduled
   - Implement rseq_grant_slice_extension()
   - entry: Hook up rseq time slice extension
   - selftests: Implement time slice extension test
   - Allow registering RSEQ with slice extension
   - Move slice_ext_nsec to debugfs
   - Lower default slice extension
   - selftests/rseq: Add rseq slice histogram script

  Scheduler performance/scalability improvements:

   - Update rq->avg_idle when a task is moved to an idle CPU, which
     improves the scalability of various workloads (Shubhang Kaushik)

   - Reorder fields in 'struct rq' for better caching (Blake Jones)

   - Fair scheduler SMP NOHZ balancing code speedups (Shrikanth Hegde):
      - Move checking for nohz cpus after time check
      - Change likelyhood of nohz.nr_cpus
      - Remove nohz.nr_cpus and use weight of cpumask instead

   - Avoid false sharing for sched_clock_irqtime (Wangyang Guo)

   - Cleanups (Yury Norov):
      - Drop useless cpumask_empty() in find_energy_efficient_cpu()
      - Simplify task_numa_find_cpu()
      - Use cpumask_weight_and() in sched_balance_find_dst_group()

  DL scheduler updates:

   - Add a deadline server for sched_ext tasks (by Andrea Righi and Joel
     Fernandes, with fixes by Peter Zijlstra)

  RT scheduler updates:

   - Skip currently executing CPU in rto_next_cpu() (Chen Jinghuang)

  Entry code updates and performance improvements (Jinjie Ruan)

  This is part of the scheduler tree in this cycle due to inter-
  dependencies with the RSEQ based time slice extension work:

    - Remove unused syscall argument from syscall_trace_enter()
    - Rework syscall_exit_to_user_mode_work() for architecture reuse
    - Add arch_ptrace_report_syscall_entry/exit()
    - Inline syscall_exit_work() and syscall_trace_enter()

  Scheduler core updates (Peter Zijlstra):

   - Rework sched_class::wakeup_preempt() and rq_modified_*()
   - Avoid rq->lock bouncing in sched_balance_newidle()
   - Rename rcu_dereference_check_sched_domain() =>
            rcu_dereference_sched_domain()
   - <linux/compiler_types.h>: Add the __signed_scalar_typeof() helper

  Fair scheduler updates/refactoring (Peter Zijlstra and Ingo Molnar):

   - Fold the sched_avg update
   - Change rcu_dereference_check_sched_domain() to rcu-sched
   - Switch to rcu_dereference_all()
   - Remove superfluous rcu_read_lock()
   - Limit hrtick work
   - Join two #ifdef CONFIG_FAIR_GROUP_SCHED blocks
   - Clean up comments in 'struct cfs_rq'
   - Separate se->vlag from se->vprot
   - Rename cfs_rq::avg_load to cfs_rq::sum_weight
   - Rename cfs_rq::avg_vruntime to ::sum_w_vruntime & helper functions
   - Introduce and use the vruntime_cmp() and vruntime_op() wrappers for
     wrapped-signed aritmetics
   - Sort out 'blocked_load*' namespace noise

  Scheduler debugging code updates:

   - Export hidden tracepoints to modules (Gabriele Monaco)

   - Convert copy_from_user() + kstrtouint() to kstrtouint_from_user()
     (Fushuai Wang)

   - Add assertions to QUEUE_CLASS (Peter Zijlstra)

   - hrtimer: Fix tracing oddity (Thomas Gleixner)

  Misc fixes and cleanups:

   - Re-evaluate scheduling when migrating queued tasks out of throttled
     cgroups (Zicheng Qu)

   - Remove task_struct->faults_disabled_mapping (Christoph Hellwig)

   - Fix math notation errors in avg_vruntime comment (Zhan Xusheng)

   - sched/cpufreq: Use %pe format for PTR_ERR() printing
     (zenghongling)"

* tag 'sched-core-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits)
  sched: Re-evaluate scheduling when migrating queued tasks out of throttled cgroups
  sched/cpufreq: Use %pe format for PTR_ERR() printing
  sched/rt: Skip currently executing CPU in rto_next_cpu()
  sched/clock: Avoid false sharing for sched_clock_irqtime
  selftests/sched_ext: Add test for DL server total_bw consistency
  selftests/sched_ext: Add test for sched_ext dl_server
  sched/debug: Fix dl_server (re)start conditions
  sched/debug: Add support to change sched_ext server params
  sched_ext: Add a DL server for sched_ext tasks
  sched/debug: Stop and start server based on if it was active
  sched/debug: Fix updating of ppos on server write ops
  sched/deadline: Clear the defer params
  entry: Inline syscall_exit_work() and syscall_trace_enter()
  entry: Add arch_ptrace_report_syscall_entry/exit()
  entry: Rework syscall_exit_to_user_mode_work() for architecture reuse
  entry: Remove unused syscall argument from syscall_trace_enter()
  sched: remove task_struct->faults_disabled_mapping
  sched: Update rq->avg_idle when a task is moved to an idle CPU
  selftests/rseq: Add rseq slice histogram script
  hrtimer: Fix trace oddity
  ...
2026-02-10 12:50:10 -08:00
Marc Zyngier
1df3f01ebf Merge branch kvm-arm64/resx into kvmarm-master/next
* kvm-arm64/resx:
  : .
  : Add infrastructure to deal with the full gamut of RESx bits
  : for NV. As a result, it is now possible to have the expected
  : semantics for some bits such as SCTLR_EL2.SPAN.
  : .
  KVM: arm64: Add debugfs file dumping computed RESx values
  KVM: arm64: Add sanitisation to SCTLR_EL2
  KVM: arm64: Remove all traces of HCR_EL2.MIOCNCE
  KVM: arm64: Remove all traces of FEAT_TME
  KVM: arm64: Simplify handling of full register invalid constraint
  KVM: arm64: Get rid of FIXED_VALUE altogether
  KVM: arm64: Simplify handling of HCR_EL2.E2H RESx
  KVM: arm64: Move RESx into individual register descriptors
  KVM: arm64: Add RES1_WHEN_E2Hx constraints as configuration flags
  KVM: arm64: Add REQUIRES_E2H1 constraint as configuration flags
  KVM: arm64: Simplify FIXED_VALUE handling
  KVM: arm64: Convert HCR_EL2.RW to AS_RES1
  KVM: arm64: Correctly handle SCTLR_EL1 RES1 bits for unsupported features
  KVM: arm64: Allow RES1 bits to be inferred from configuration
  KVM: arm64: Inherit RESx bits from FGT register descriptors
  KVM: arm64: Extend unified RESx handling to runtime sanitisation
  KVM: arm64: Introduce data structure tracking both RES0 and RES1 bits
  KVM: arm64: Introduce standalone FGU computing primitive
  KVM: arm64: Remove duplicate configuration for SCTLR_EL1.{EE,E0E}
  arm64: Convert SCTLR_EL2 to sysreg infrastructure

Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-02-05 09:17:48 +00:00
Marc Zyngier
47e89febd3 Merge branch kvm-arm64/gicv5-prologue into kvmarm-master/next
* kvm-arm64/gicv5-prologue:
  : .
  : Prologue to GICv5 support, courtesy of Sascha Bischoff.
  :
  : This is preliminary work that sets the scene for the full-blow
  : support.
  : .
  irqchip/gic-v5: Check if impl is virt capable
  KVM: arm64: gic: Set vgic_model before initing private IRQs
  arm64/sysreg: Drop ICH_HFGRTR_EL2.ICC_HAPR_EL1 and make RES1
  KVM: arm64: gic-v3: Switch vGIC-v3 to use generated ICH_VMCR_EL2

Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-02-05 09:17:30 +00:00
Marc Zyngier
fb40cb15e8 KVM: arm64: Remove all traces of HCR_EL2.MIOCNCE
MIOCNCE had the potential to eat your data, and also was never
implemented by anyone. It's been retrospectively removed from
the architecture, and we're happy to follow that lead.

Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260202184329.2724080-19-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-02-05 09:02:13 +00:00
Marc Zyngier
d65bf6e317 KVM: arm64: Remove all traces of FEAT_TME
FEAT_TME has been dropped from the architecture. Retrospectively.
I'm sure someone is crying somewhere, but most of us won't.

Clean-up time.

Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260202184329.2724080-18-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-02-05 09:02:13 +00:00
Marc Zyngier
0a35bd285f arm64: Convert SCTLR_EL2 to sysreg infrastructure
Convert SCTLR_EL2 to the sysreg infrastructure, as per the 2025-12_rel
revision of the Registers.json file.

Note that we slightly deviate from the above, as we stick to the ARM
ARM M.a definition of SCTLR_EL2[9], which is RES0, in order to avoid
dragging the POE2 definitions...

Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260202184329.2724080-2-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-02-05 08:59:28 +00:00
Sascha Bischoff
b583177aaf arm64/sysreg: Drop ICH_HFGRTR_EL2.ICC_HAPR_EL1 and make RES1
The GICv5 architecture is dropping the ICC_HAPR_EL1 and ICV_HAPR_EL1
system registers. These registers were never added to the sysregs, but
the traps for them were.

Drop the trap bit from the ICH_HFGRTR_EL2 and make it Res1 as per the
upcoming GICv5 spec change. Additionally, update the EL2 setup code to
not attempt to set that bit.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20260128175919.3828384-4-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-01-30 11:10:46 +00:00
Will Deacon
2f8aed5e97 Merge branch 'for-next/errata' into for-next/core
* for-next/errata:
  arm64: errata: Workaround for SI L1 downstream coherency issue
2026-01-29 12:05:33 +00:00
Lucas Wei
3fed7e0059 arm64: errata: Workaround for SI L1 downstream coherency issue
When software issues a Cache Maintenance Operation (CMO) targeting a
dirty cache line, the CPU and DSU cluster may optimize the operation by
combining the CopyBack Write and CMO into a single combined CopyBack
Write plus CMO transaction presented to the interconnect (MCN).
For these combined transactions, the MCN splits the operation into two
separate transactions, one Write and one CMO, and then propagates the
write and optionally the CMO to the downstream memory system or external
Point of Serialization (PoS).
However, the MCN may return an early CompCMO response to the DSU cluster
before the corresponding Write and CMO transactions have completed at
the external PoS or downstream memory. As a result, stale data may be
observed by external observers that are directly connected to the
external PoS or downstream memory.

This erratum affects any system topology in which the following
conditions apply:
 - The Point of Serialization (PoS) is located downstream of the
   interconnect.
 - A downstream observer accesses memory directly, bypassing the
   interconnect.

Conditions:
This erratum occurs only when all of the following conditions are met:
 1. Software executes a data cache maintenance operation, specifically,
    a clean or clean&invalidate by virtual address (DC CVAC or DC
    CIVAC), that hits on unique dirty data in the CPU or DSU cache.
    This results in a combined CopyBack and CMO being issued to the
    interconnect.
 2. The interconnect splits the combined transaction into separate Write
    and CMO transactions and returns an early completion response to the
    CPU or DSU before the write has completed at the downstream memory
    or PoS.
 3. A downstream observer accesses the affected memory address after the
    early completion response is issued but before the actual memory
    write has completed. This allows the observer to read stale data
    that has not yet been updated at the PoS or downstream memory.

The implementation of workaround put a second loop of CMOs at the same
virtual address whose operation meet erratum conditions to wait until
cache data be cleaned to PoC. This way of implementation mitigates
performance penalty compared to purely duplicate original CMO.

Signed-off-by: Lucas Wei <lucaswei@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
2026-01-23 13:30:38 +00:00
Marc Zyngier
cb6cd8a86d Merge branch kvm-arm64/feat_idst into kvmarm-master/next
* kvm-arm64/feat_idst:
  : .
  : Add support for FEAT_IDST, allowing ID registers that are not implemented
  : to be reported as a normal trap rather than as an UNDEF exception.
  : .
  KVM: arm64: selftests: Add a test for FEAT_IDST
  KVM: arm64: pkvm: Report optional ID register traps with a 0x18 syndrome
  KVM: arm64: pkvm: Add a generic synchronous exception injection primitive
  KVM: arm64: Force trap of GMID_EL1 when the guest doesn't have MTE
  KVM: arm64: Handle CSSIDR2_EL1 and SMIDR_EL1 in a generic way
  KVM: arm64: Handle FEAT_IDST for sysregs without specific handlers
  KVM: arm64: Add a generic synchronous exception injection primitive
  KVM: arm64: Add trap routing for GMID_EL1
  arm64: Repaint ID_AA64MMFR2_EL1.IDS description

Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-01-23 10:04:35 +00:00
Marc Zyngier
1c07f35ff7 Merge branch kvm-arm64/vtcr into kvmarm-master/next
* kvm-arm64/vtcr:
  : .
  : VTCR_EL2 conversion to the configuration-driven RESx framework,
  : fixing a couple of UXN/PXN/XN bugs in the process.
  : .
  KVM: arm64: nv: Return correct RES0 bits for FGT registers
  KVM: arm64: Always populate FGT masks at boot time
  KVM: arm64: Honor UX/PX attributes for EL2 S1 mappings
  KVM: arm64: Convert VTCR_EL2 to config-driven sanitisation
  KVM: arm64: Account for RES1 bits in DECLARE_FEAT_MAP() and co
  arm64: Convert VTCR_EL2 to sysreg infratructure
  arm64: Convert ID_AA64MMFR0_EL1.TGRAN{4,16,64}_2 to UnsignedEnum
  KVM: arm64: Invert KVM_PGTABLE_WALK_HANDLE_FAULT to fix pKVM walkers
  KVM: arm64: Don't blindly set set PSTATE.PAN on guest exit
  KVM: arm64: nv: Respect stage-2 write permssion when setting stage-1 AF
  KVM: arm64: Remove unused vcpu_{clear,set}_wfx_traps()
  KVM: arm64: Remove unused parameter in synchronize_vcpu_pstate()
  KVM: arm64: Remove extra argument for __pvkm_host_{share,unshare}_hyp()
  KVM: arm64: Inject UNDEF for a register trap without accessor
  KVM: arm64: Copy FGT traps to unprotected pKVM VCPU on VCPU load
  KVM: arm64: Fix EL2 S1 XN handling for hVHE setups
  KVM: arm64: gic: Check for vGICv3 when clearing TWI

Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-01-23 10:03:03 +00:00
Yicong Yang
58ce78667a arm64: Add support for FEAT_{LS64, LS64_V}
Armv8.7 introduces single-copy atomic 64-byte loads and stores
instructions and its variants named under FEAT_{LS64, LS64_V}.
These features are identified by ID_AA64ISAR1_EL1.LS64 and the
use of such instructions in userspace (EL0) can be trapped.

As st64bv (FEAT_LS64_V) and st64bv0 (FEAT_LS64_ACCDATA) can not be tell
apart, FEAT_LS64 and FEAT_LS64_ACCDATA which will be supported in later
patch will be exported to userspace, FEAT_LS64_V will be enabled only
in kernel.

In order to support the use of corresponding instructions in userspace:
- Make ID_AA64ISAR1_EL1.LS64 visbile to userspace
- Add identifying and enabling in the cpufeature list
- Expose these support of these features to userspace through HWCAP3
  and cpuinfo

ld64b/st64b (FEAT_LS64) and st64bv (FEAT_LS64_V) is intended for
special memory (device memory) so requires support by the CPU, system
and target memory location (device that support these instructions).
The HWCAP3_LS64, implies the support of CPU and system (since no
identification method from system, so SoC vendors should advertise
support in the CPU if system also support them).

Otherwise for ld64b/st64b the atomicity may not be guaranteed or a
DABT will be generated, so users (probably userspace driver developer)
should make sure the target memory (device) also have the support.
For st64bv 0xffffffffffffffff will be returned as status result for
unsupported memory so user should check it.

Document the restrictions along with HWCAP3_LS64.

Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Oliver Upton <oupton@kernel.org>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Zhou Wang <wangzhou1@hisilicon.com>
Signed-off-by: Will Deacon <will@kernel.org>
2026-01-22 13:24:49 +00:00
Thomas Gleixner
99d2592023 rseq: Implement sys_rseq_slice_yield()
Provide a new syscall which has the only purpose to yield the CPU after the
kernel granted a time slice extension.

sched_yield() is not suitable for that because it unconditionally
schedules, but the end of the time slice extension is not required to
schedule when the task was already preempted. This also allows to have a
strict check for termination to catch user space invoking random syscalls
including sched_yield() from a time slice extension region.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Link: https://patch.msgid.link/20251215155708.929634896@linutronix.de
2026-01-22 11:11:17 +01:00
Marc Zyngier
4a7fe842b8 arm64: Repaint ID_AA64MMFR2_EL1.IDS description
ID_AA64MMFR2_EL1.IDS, as described in the sysreg file, is pretty horrible
as it diesctly give the ESR value. Repaint it using the usual NI/IMP
identifiers to describe the absence/presence of FEAT_IDST.

Also add the new EL3 routing feature, even if we really don't care about it.

Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://patch.msgid.link/20260108173233.2911955-2-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-01-15 11:58:56 +00:00
Marc Zyngier
a035001dea arm64: Convert VTCR_EL2 to sysreg infratructure
Our definition of VTCR_EL2 is both partial (tons of fields are
missing) and totally inconsistent (some constants are shifted,
some are not). They are also expressed in terms of TCR, which is
rather inconvenient.

Replace the ad-hoc definitions with the the generated version.
This results in a bunch of additional changes to make the code
with the unshifted nature of generated enumerations.

The register data was extracted from the BSD licenced AARCHMRS
(AARCHMRS_OPENSOURCE_A_profile_FAT-2025-09_ASL0).

Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20251210173024.561160-4-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-01-15 10:58:20 +00:00
Marc Zyngier
f1640174c8 arm64: Convert ID_AA64MMFR0_EL1.TGRAN{4,16,64}_2 to UnsignedEnum
ID_AA64MMFR0_EL1.TGRAN{4,16,64}_2 are currently represented as unordered
enumerations. However, the architecture treats them as Unsigned,
as hinted to by the MRS data:

(FEAT_S2TGran4K <=> (((UInt(ID_AA64MMFR0_EL1.TGran4_2) == 0) &&
		       FEAT_TGran4K) ||
		     (UInt(ID_AA64MMFR0_EL1.TGran4_2) >= 2))))

and similar descriptions exist for 16 and 64k.

This is also confirmed by D24.1.3.3 ("Alternative ID scheme used for
ID_AA64MMFR0_EL1 stage 2 granule sizes") in the L.b revision of
the ARM ARM.

Turn these fields into UnsignedEnum so that we can use the above
description more or less literally.

Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20251210173024.561160-3-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-01-15 10:58:20 +00:00
Linus Torvalds
51d90a15fe ARM:
- Support for userspace handling of synchronous external aborts (SEAs),
   allowing the VMM to potentially handle the abort in a non-fatal
   manner.
 
 - Large rework of the VGIC's list register handling with the goal of
   supporting more active/pending IRQs than available list registers in
   hardware. In addition, the VGIC now supports EOImode==1 style
   deactivations for IRQs which may occur on a separate vCPU than the
   one that acked the IRQ.
 
 - Support for FEAT_XNX (user / privileged execute permissions) and
   FEAT_HAF (hardware update to the Access Flag) in the software page
   table walkers and shadow MMU.
 
 - Allow page table destruction to reschedule, fixing long need_resched
   latencies observed when destroying a large VM.
 
 - Minor fixes to KVM and selftests
 
 Loongarch:
 
 - Get VM PMU capability from HW GCFG register.
 
 - Add AVEC basic support.
 
 - Use 64-bit register definition for EIOINTC.
 
 - Add KVM timer test cases for tools/selftests.
 
 RISC/V:
 
 - SBI message passing (MPXY) support for KVM guest
 
 - Give a new, more specific error subcode for the case when in-kernel
   AIA virtualization fails to allocate IMSIC VS-file
 
 - Support KVM_DIRTY_LOG_INITIALLY_SET, enabling dirty log gradually
   in small chunks
 
 - Fix guest page fault within HLV* instructions
 
 - Flush VS-stage TLB after VCPU migration for Andes cores
 
 s390:
 
 - Always allocate ESCA (Extended System Control Area), instead of
   starting with the basic SCA and converting to ESCA with the
   addition of the 65th vCPU.  The price is increased number of
   exits (and worse performance) on z10 and earlier processor;
   ESCA was introduced by z114/z196 in 2010.
 
 - VIRT_XFER_TO_GUEST_WORK support
 
 - Operation exception forwarding support
 
 - Cleanups
 
 x86:
 
 - Skip the costly "zap all SPTEs" on an MMIO generation wrap if MMIO SPTE
   caching is disabled, as there can't be any relevant SPTEs to zap.
 
 - Relocate a misplaced export.
 
 - Fix an async #PF bug where KVM would clear the completion queue when the
   guest transitioned in and out of paging mode, e.g. when handling an SMI and
   then returning to paged mode via RSM.
 
 - Leave KVM's user-return notifier registered even when disabling
   virtualization, as long as kvm.ko is loaded.  On reboot/shutdown, keeping
   the notifier registered is ok; the kernel does not use the MSRs and the
   callback will run cleanly and restore host MSRs if the CPU manages to
   return to userspace before the system goes down.
 
 - Use the checked version of {get,put}_user().
 
 - Fix a long-lurking bug where KVM's lack of catch-up logic for periodic APIC
   timers can result in a hard lockup in the host.
 
 - Revert the periodic kvmclock sync logic now that KVM doesn't use a
   clocksource that's subject to NTP corrections.
 
 - Clean up KVM's handling of MMIO Stale Data and L1TF, and bury the latter
   behind CONFIG_CPU_MITIGATIONS.
 
 - Context switch XCR0, XSS, and PKRU outside of the entry/exit fast path;
   the only reason they were handled in the fast path was to paper of a bug
   in the core #MC code, and that has long since been fixed.
 
 - Add emulator support for AVX MOV instructions, to play nice with emulated
   devices whose guest drivers like to access PCI BARs with large multi-byte
   instructions.
 
 x86 (AMD):
 
 - Fix a few missing "VMCB dirty" bugs.
 
 - Fix the worst of KVM's lack of EFER.LMSLE emulation.
 
 - Add AVIC support for addressing 4k vCPUs in x2AVIC mode.
 
 - Fix incorrect handling of selective CR0 writes when checking intercepts
   during emulation of L2 instructions.
 
 - Fix a currently-benign bug where KVM would clobber SPEC_CTRL[63:32] on
   VMRUN and #VMEXIT.
 
 - Fix a bug where KVM corrupt the guest code stream when re-injecting a soft
   interrupt if the guest patched the underlying code after the VM-Exit, e.g.
   when Linux patches code with a temporary INT3.
 
 - Add KVM_X86_SNP_POLICY_BITS to advertise supported SNP policy bits to
   userspace, and extend KVM "support" to all policy bits that don't require
   any actual support from KVM.
 
 x86 (Intel):
 
 - Use the root role from kvm_mmu_page to construct EPTPs instead of the
   current vCPU state, partly as worthwhile cleanup, but mostly to pave the
   way for tracking per-root TLB flushes, and elide EPT flushes on pCPU
   migration if the root is clean from a previous flush.
 
 - Add a few missing nested consistency checks.
 
 - Rip out support for doing "early" consistency checks via hardware as the
   functionality hasn't been used in years and is no longer useful in general;
   replace it with an off-by-default module param to WARN if hardware fails
   a check that KVM does not perform.
 
 - Fix a currently-benign bug where KVM would drop the guest's SPEC_CTRL[63:32]
   on VM-Enter.
 
 - Misc cleanups.
 
 - Overhaul the TDX code to address systemic races where KVM (acting on behalf
   of userspace) could inadvertantly trigger lock contention in the TDX-Module;
   KVM was either working around these in weird, ugly ways, or was simply
   oblivious to them (though even Yan's devilish selftests could only break
   individual VMs, not the host kernel)
 
 - Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a TDX vCPU,
   if creating said vCPU failed partway through.
 
 - Fix a few sparse warnings (bad annotation, 0 != NULL).
 
 - Use struct_size() to simplify copying TDX capabilities to userspace.
 
 - Fix a bug where TDX would effectively corrupt user-return MSR values if the
   TDX Module rejects VP.ENTER and thus doesn't clobber host MSRs as expected.
 
 Selftests:
 
 - Fix a math goof in mmu_stress_test when running on a single-CPU system/VM.
 
 - Forcefully override ARCH from x86_64 to x86 to play nice with specifying
   ARCH=x86_64 on the command line.
 
 - Extend a bunch of nested VMX to validate nested SVM as well.
 
 - Add support for LA57 in the core VM_MODE_xxx macro, and add a test to
   verify KVM can save/restore nested VMX state when L1 is using 5-level
   paging, but L2 is not.
 
 - Clean up the guest paging code in anticipation of sharing the core logic for
   nested EPT and nested NPT.
 
 guest_memfd:
 
 - Add NUMA mempolicy support for guest_memfd, and clean up a variety of
   rough edges in guest_memfd along the way.
 
 - Define a CLASS to automatically handle get+put when grabbing a guest_memfd
   from a memslot to make it harder to leak references.
 
 - Enhance KVM selftests to make it easer to develop and debug selftests like
   those added for guest_memfd NUMA support, e.g. where test and/or KVM bugs
   often result in hard-to-debug SIGBUS errors.
 
 - Misc cleanups.
 
 Generic:
 
 - Use the recently-added WQ_PERCPU when creating the per-CPU workqueue for
   irqfd cleanup.
 
 - Fix a goof in the dirty ring documentation.
 
 - Fix choice of target for directed yield across different calls to
   kvm_vcpu_on_spin(); the function was always starting from the first
   vCPU instead of continuing the round-robin search.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCgAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmkvMa8UHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroMlFwf+Ow7zOYUuELSQ+Jn+hOYXiCNrdBDx
 ZamvMU8kLPr7XX0Zog6HgcMm//qyA6k5nSfqCjfsQZrIhRA/gWJ61jz1OX/Jxq18
 pJ9Vz6epnEPYiOtBwz+v8OS8MqDqVNzj2i6W1/cLPQE50c1Hhw64HWS5CSxDQiHW
 A7PVfl5YU12lW1vG3uE0sNESDt4Eh/spNM17iddXdF4ZUOGublserjDGjbc17E7H
 8BX3DkC2plqkJKwtjg0ae62hREkITZZc7RqsnftUkEhn0N0H9+rb6NKUyzIVh9NZ
 bCtCjtrKN9zfZ0Mujnms3ugBOVqNIputu/DtPnnFKXtXWSrHrgGSNv5ewA==
 =PEcw
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "ARM:

   - Support for userspace handling of synchronous external aborts
     (SEAs), allowing the VMM to potentially handle the abort in a
     non-fatal manner

   - Large rework of the VGIC's list register handling with the goal of
     supporting more active/pending IRQs than available list registers
     in hardware. In addition, the VGIC now supports EOImode==1 style
     deactivations for IRQs which may occur on a separate vCPU than the
     one that acked the IRQ

   - Support for FEAT_XNX (user / privileged execute permissions) and
     FEAT_HAF (hardware update to the Access Flag) in the software page
     table walkers and shadow MMU

   - Allow page table destruction to reschedule, fixing long
     need_resched latencies observed when destroying a large VM

   - Minor fixes to KVM and selftests

  Loongarch:

   - Get VM PMU capability from HW GCFG register

   - Add AVEC basic support

   - Use 64-bit register definition for EIOINTC

   - Add KVM timer test cases for tools/selftests

  RISC/V:

   - SBI message passing (MPXY) support for KVM guest

   - Give a new, more specific error subcode for the case when in-kernel
     AIA virtualization fails to allocate IMSIC VS-file

   - Support KVM_DIRTY_LOG_INITIALLY_SET, enabling dirty log gradually
     in small chunks

   - Fix guest page fault within HLV* instructions

   - Flush VS-stage TLB after VCPU migration for Andes cores

  s390:

   - Always allocate ESCA (Extended System Control Area), instead of
     starting with the basic SCA and converting to ESCA with the
     addition of the 65th vCPU. The price is increased number of exits
     (and worse performance) on z10 and earlier processor; ESCA was
     introduced by z114/z196 in 2010

   - VIRT_XFER_TO_GUEST_WORK support

   - Operation exception forwarding support

   - Cleanups

  x86:

   - Skip the costly "zap all SPTEs" on an MMIO generation wrap if MMIO
     SPTE caching is disabled, as there can't be any relevant SPTEs to
     zap

   - Relocate a misplaced export

   - Fix an async #PF bug where KVM would clear the completion queue
     when the guest transitioned in and out of paging mode, e.g. when
     handling an SMI and then returning to paged mode via RSM

   - Leave KVM's user-return notifier registered even when disabling
     virtualization, as long as kvm.ko is loaded. On reboot/shutdown,
     keeping the notifier registered is ok; the kernel does not use the
     MSRs and the callback will run cleanly and restore host MSRs if the
     CPU manages to return to userspace before the system goes down

   - Use the checked version of {get,put}_user()

   - Fix a long-lurking bug where KVM's lack of catch-up logic for
     periodic APIC timers can result in a hard lockup in the host

   - Revert the periodic kvmclock sync logic now that KVM doesn't use a
     clocksource that's subject to NTP corrections

   - Clean up KVM's handling of MMIO Stale Data and L1TF, and bury the
     latter behind CONFIG_CPU_MITIGATIONS

   - Context switch XCR0, XSS, and PKRU outside of the entry/exit fast
     path; the only reason they were handled in the fast path was to
     paper of a bug in the core #MC code, and that has long since been
     fixed

   - Add emulator support for AVX MOV instructions, to play nice with
     emulated devices whose guest drivers like to access PCI BARs with
     large multi-byte instructions

  x86 (AMD):

   - Fix a few missing "VMCB dirty" bugs

   - Fix the worst of KVM's lack of EFER.LMSLE emulation

   - Add AVIC support for addressing 4k vCPUs in x2AVIC mode

   - Fix incorrect handling of selective CR0 writes when checking
     intercepts during emulation of L2 instructions

   - Fix a currently-benign bug where KVM would clobber SPEC_CTRL[63:32]
     on VMRUN and #VMEXIT

   - Fix a bug where KVM corrupt the guest code stream when re-injecting
     a soft interrupt if the guest patched the underlying code after the
     VM-Exit, e.g. when Linux patches code with a temporary INT3

   - Add KVM_X86_SNP_POLICY_BITS to advertise supported SNP policy bits
     to userspace, and extend KVM "support" to all policy bits that
     don't require any actual support from KVM

  x86 (Intel):

   - Use the root role from kvm_mmu_page to construct EPTPs instead of
     the current vCPU state, partly as worthwhile cleanup, but mostly to
     pave the way for tracking per-root TLB flushes, and elide EPT
     flushes on pCPU migration if the root is clean from a previous
     flush

   - Add a few missing nested consistency checks

   - Rip out support for doing "early" consistency checks via hardware
     as the functionality hasn't been used in years and is no longer
     useful in general; replace it with an off-by-default module param
     to WARN if hardware fails a check that KVM does not perform

   - Fix a currently-benign bug where KVM would drop the guest's
     SPEC_CTRL[63:32] on VM-Enter

   - Misc cleanups

   - Overhaul the TDX code to address systemic races where KVM (acting
     on behalf of userspace) could inadvertantly trigger lock contention
     in the TDX-Module; KVM was either working around these in weird,
     ugly ways, or was simply oblivious to them (though even Yan's
     devilish selftests could only break individual VMs, not the host
     kernel)

   - Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a
     TDX vCPU, if creating said vCPU failed partway through

   - Fix a few sparse warnings (bad annotation, 0 != NULL)

   - Use struct_size() to simplify copying TDX capabilities to userspace

   - Fix a bug where TDX would effectively corrupt user-return MSR
     values if the TDX Module rejects VP.ENTER and thus doesn't clobber
     host MSRs as expected

  Selftests:

   - Fix a math goof in mmu_stress_test when running on a single-CPU
     system/VM

   - Forcefully override ARCH from x86_64 to x86 to play nice with
     specifying ARCH=x86_64 on the command line

   - Extend a bunch of nested VMX to validate nested SVM as well

   - Add support for LA57 in the core VM_MODE_xxx macro, and add a test
     to verify KVM can save/restore nested VMX state when L1 is using
     5-level paging, but L2 is not

   - Clean up the guest paging code in anticipation of sharing the core
     logic for nested EPT and nested NPT

  guest_memfd:

   - Add NUMA mempolicy support for guest_memfd, and clean up a variety
     of rough edges in guest_memfd along the way

   - Define a CLASS to automatically handle get+put when grabbing a
     guest_memfd from a memslot to make it harder to leak references

   - Enhance KVM selftests to make it easer to develop and debug
     selftests like those added for guest_memfd NUMA support, e.g. where
     test and/or KVM bugs often result in hard-to-debug SIGBUS errors

   - Misc cleanups

  Generic:

   - Use the recently-added WQ_PERCPU when creating the per-CPU
     workqueue for irqfd cleanup

   - Fix a goof in the dirty ring documentation

   - Fix choice of target for directed yield across different calls to
     kvm_vcpu_on_spin(); the function was always starting from the first
     vCPU instead of continuing the round-robin search"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (260 commits)
  KVM: arm64: at: Update AF on software walk only if VM has FEAT_HAFDBS
  KVM: arm64: at: Use correct HA bit in TCR_EL2 when regime is EL2
  KVM: arm64: Document KVM_PGTABLE_PROT_{UX,PX}
  KVM: arm64: Fix spelling mistake "Unexpeced" -> "Unexpected"
  KVM: arm64: Add break to default case in kvm_pgtable_stage2_pte_prot()
  KVM: arm64: Add endian casting to kvm_swap_s[12]_desc()
  KVM: arm64: Fix compilation when CONFIG_ARM64_USE_LSE_ATOMICS=n
  KVM: arm64: selftests: Add test for AT emulation
  KVM: arm64: nv: Expose hardware access flag management to NV guests
  KVM: arm64: nv: Implement HW access flag management in stage-2 SW PTW
  KVM: arm64: Implement HW access flag management in stage-1 SW PTW
  KVM: arm64: Propagate PTW errors up to AT emulation
  KVM: arm64: Add helper for swapping guest descriptor
  KVM: arm64: nv: Use pgtable definitions in stage-2 walk
  KVM: arm64: Handle endianness in read helper for emulated PTW
  KVM: arm64: nv: Stop passing vCPU through void ptr in S2 PTW
  KVM: arm64: Call helper for reading descriptors directly
  KVM: arm64: nv: Advertise support for FEAT_XNX
  KVM: arm64: Teach ptdump about FEAT_XNX permissions
  KVM: s390: Use generic VIRT_XFER_TO_GUEST_WORK functions
  ...
2025-12-05 17:01:20 -08:00
Linus Torvalds
44fc84337b arm64 updates for 6.19:
Core features:
 
  - Basic Arm MPAM (Memory system resource Partitioning And Monitoring)
    driver under drivers/resctrl/ which makes use of the fs/rectrl/ API
 
 Perf and PMU:
 
  - Avoid cycle counter on multi-threaded CPUs
 
  - Extend CSPMU device probing and add additional filtering support for
    NVIDIA implementations
 
  - Add support for the PMUs on the NoC S3 interconnect
 
  - Add additional compatible strings for new Cortex and C1 CPUs
 
  - Add support for data source filtering to the SPE driver
 
  - Add support for i.MX8QM and "DB" PMU in the imx PMU driver
 
 Memory managemennt:
 
  - Avoid broadcast TLBI if page reused in write fault
 
  - Elide TLB invalidation if the old PTE was not valid
 
  - Drop redundant cpu_set_*_tcr_t0sz() macros
 
  - Propagate pgtable_alloc() errors outside of __create_pgd_mapping()
 
  - Propagate return value from __change_memory_common()
 
 ACPI and EFI:
 
  - Call EFI runtime services without disabling preemption
 
  - Remove unused ACPI function
 
 Miscellaneous:
 
  - ptrace support to disable streaming on SME-only systems
 
  - Improve sysreg generation to include a 'Prefix' descriptor
 
  - Replace __ASSEMBLY__ with __ASSEMBLER__
 
  - Align register dumps in the kselftest zt-test
 
  - Remove some no longer used macros/functions
 
  - Various spelling corrections
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAmkvMjkACgkQa9axLQDI
 XvGaGg//dtT/ZAqrWa6Yniv1LOlh837C07YdxAYTTuJ+I87DnrxIqjwbW+ye+bF+
 61RTkioeCUm3PH+ncO9gPVNi4ASZ1db3/Rc8Fb6rr1TYOI1sMIeBsbbVdRJgsbX6
 zu9197jOBHscTAeDceB6jZBDyW8iSLINPZ7LN6lGxXsZM/Vn5zfE0heKEEio6Fsx
 +AzO2vos0XcwBR9vFGXtiCDx57T+/cXUtrWfA0Cjz4nvHSgD8+ghS+Jwv+kHMt1L
 zrarqbeQfj+Iixm9PVHiazv+8THo9QdNl1yGLxDmJ4LEVPewjW5jBs8+5e8e3/Gj
 p5JEvmSyWvKTTbFoM5vhxC72A7yuT1QwAk2iCyFIxMbQ25PndHboKVp/569DzOkT
 +6CjI88sVSP6D7bVlN6pFlzc/Fa07YagnDMnMCSfk4LBjUfE3jYb+usaFydyv/rl
 jwZbJrnSF/H+uQlyoJFgOEXSoQdDsll3dv6yEsUCwbd8RqXbAe3svbguOUHSdvIj
 sCViezGZQ7Rkn6D21AfF9j6e7ceaSDaf5DWMxPI3dAxFKG8TJbCBsToR59NnoSj+
 bNEozbZ1mCxmwH8i43wZ6P0RkClvJnoXcvRA+TJj02fSZACO39d3XDNswfXWL41r
 KiWGUJZyn2lPKtiAWVX6pSBtDJ+5rFhuoFgADLX6trkxDe9/EMQ=
 =4Sb6
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Catalin Marinas:
 "These are the arm64 updates for 6.19.

  The biggest part is the Arm MPAM driver under drivers/resctrl/.
  There's a patch touching mm/ to handle spurious faults for huge pmd
  (similar to the pte version). The corresponding arm64 part allows us
  to avoid the TLB maintenance if a (huge) page is reused after a write
  fault. There's EFI refactoring to allow runtime services with
  preemption enabled and the rest is the usual perf/PMU updates and
  several cleanups/typos.

  Summary:

  Core features:

   - Basic Arm MPAM (Memory system resource Partitioning And Monitoring)
     driver under drivers/resctrl/ which makes use of the fs/rectrl/ API

  Perf and PMU:

   - Avoid cycle counter on multi-threaded CPUs

   - Extend CSPMU device probing and add additional filtering support
     for NVIDIA implementations

   - Add support for the PMUs on the NoC S3 interconnect

   - Add additional compatible strings for new Cortex and C1 CPUs

   - Add support for data source filtering to the SPE driver

   - Add support for i.MX8QM and "DB" PMU in the imx PMU driver

  Memory managemennt:

   - Avoid broadcast TLBI if page reused in write fault

   - Elide TLB invalidation if the old PTE was not valid

   - Drop redundant cpu_set_*_tcr_t0sz() macros

   - Propagate pgtable_alloc() errors outside of __create_pgd_mapping()

   - Propagate return value from __change_memory_common()

  ACPI and EFI:

   - Call EFI runtime services without disabling preemption

   - Remove unused ACPI function

  Miscellaneous:

   - ptrace support to disable streaming on SME-only systems

   - Improve sysreg generation to include a 'Prefix' descriptor

   - Replace __ASSEMBLY__ with __ASSEMBLER__

   - Align register dumps in the kselftest zt-test

   - Remove some no longer used macros/functions

   - Various spelling corrections"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (94 commits)
  arm64/mm: Document why linear map split failure upon vm_reset_perms is not problematic
  arm64/pageattr: Propagate return value from __change_memory_common
  arm64/sysreg: Remove unused define ARM64_FEATURE_FIELD_BITS
  KVM: arm64: selftests: Consider all 7 possible levels of cache
  KVM: arm64: selftests: Remove ARM64_FEATURE_FIELD_BITS and its last user
  arm64: atomics: lse: Remove unused parameters from ATOMIC_FETCH_OP_AND macros
  Documentation/arm64: Fix the typo of register names
  ACPI: GTDT: Get rid of acpi_arch_timer_mem_init()
  perf: arm_spe: Add support for filtering on data source
  perf: Add perf_event_attr::config4
  perf/imx_ddr: Add support for PMU in DB (system interconnects)
  perf/imx_ddr: Get and enable optional clks
  perf/imx_ddr: Move ida_alloc() from ddr_perf_init() to ddr_perf_probe()
  dt-bindings: perf: fsl-imx-ddr: Add compatible string for i.MX8QM, i.MX8QXP and i.MX8DXL
  arm64: remove duplicate ARCH_HAS_MEM_ENCRYPT
  arm64: mm: use untagged address to calculate page index
  MAINTAINERS: new entry for MPAM Driver
  arm_mpam: Add kunit tests for props_mismatch()
  arm_mpam: Add kunit test for bitmap reset
  arm_mpam: Add helper to reset saved mbwu state
  ...
2025-12-02 17:03:55 -08:00
Oliver Upton
3eef0c83c3 Merge branch 'kvm-arm64/nv-xnx-haf' into kvmarm/next
* kvm-arm64/nv-xnx-haf: (22 commits)
  : Support for FEAT_XNX and FEAT_HAF in nested
  :
  : Add support for a couple of MMU-related features that weren't
  : implemented by KVM's software page table walk:
  :
  :  - FEAT_XNX: Allows the hypervisor to describe execute permissions
  :    separately for EL0 and EL1
  :
  :  - FEAT_HAF: Hardware update of the Access Flag, which in the context of
  :    nested means software walkers must also set the Access Flag.
  :
  : The series also adds some basic support for testing KVM's emulation of
  : the AT instruction, including the implementation detail that AT sets the
  : Access Flag in KVM.
  KVM: arm64: at: Update AF on software walk only if VM has FEAT_HAFDBS
  KVM: arm64: at: Use correct HA bit in TCR_EL2 when regime is EL2
  KVM: arm64: Document KVM_PGTABLE_PROT_{UX,PX}
  KVM: arm64: Fix spelling mistake "Unexpeced" -> "Unexpected"
  KVM: arm64: Add break to default case in kvm_pgtable_stage2_pte_prot()
  KVM: arm64: Add endian casting to kvm_swap_s[12]_desc()
  KVM: arm64: Fix compilation when CONFIG_ARM64_USE_LSE_ATOMICS=n
  KVM: arm64: selftests: Add test for AT emulation
  KVM: arm64: nv: Expose hardware access flag management to NV guests
  KVM: arm64: nv: Implement HW access flag management in stage-2 SW PTW
  KVM: arm64: Implement HW access flag management in stage-1 SW PTW
  KVM: arm64: Propagate PTW errors up to AT emulation
  KVM: arm64: Add helper for swapping guest descriptor
  KVM: arm64: nv: Use pgtable definitions in stage-2 walk
  KVM: arm64: Handle endianness in read helper for emulated PTW
  KVM: arm64: nv: Stop passing vCPU through void ptr in S2 PTW
  KVM: arm64: Call helper for reading descriptors directly
  KVM: arm64: nv: Advertise support for FEAT_XNX
  KVM: arm64: Teach ptdump about FEAT_XNX permissions
  KVM: arm64: nv: Forward FEAT_XNX permissions to the shadow stage-2
  ...

Signed-off-by: Oliver Upton <oupton@kernel.org>
2025-12-01 00:47:41 -08:00
Marc Zyngier
2a28810cbb KVM: arm64: GICv3: Detect and work around the lack of ICV_DIR_EL1 trapping
A long time ago, an unsuspecting architect forgot to add a trap
bit for ICV_DIR_EL1 in ICH_HCR_EL2. Which was unfortunate, but
what's a bit of spec between friends? Thankfully, this was fixed
in a later revision, and ARM "deprecates" the lack of trapping
ability.

Unfortuantely, a few (billion) CPUs went out with that defect,
anything ARMv8.0 from ARM, give or take. And on these CPUs,
you can't trap DIR on its own, full stop.

As the next best thing, we can trap everything in the common group,
which is a tad expensive, but hey ho, that's what you get. You can
otherwise recycle the HW in the neaby bin.

Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://msgid.link/20251120172540.2267180-7-maz@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
2025-11-24 14:29:11 -08:00
Oliver Upton
dc31124379 arm64: Detect FEAT_XNX
Detect the feature in anticipation of using it in KVM.

Reviewed-by: Marc Zyngier <maz@kernel.org>
Tested-by: Marc Zyngier <maz@kernel.org>
Link: https://msgid.link/20251124190158.177318-2-oupton@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
2025-11-24 14:24:44 -08:00
Sascha Bischoff
a04fbfb8a1 arm64/sysreg: Add ICH_VMCR_EL2
Add the ICH_VMCR_EL2 register, which is required for the upcoming
GICv5 KVM support. This register has two different field encodings,
based on if it is used for GICv3 or GICv5-based VMs. The
GICv5-specific field encodings are generated with a FEAT_GCIE prefix.

This register is already described in the GICv3 KVM code
directly. This will be ported across to use the generated encodings as
part of an upcoming change.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-13 18:09:46 +00:00
Sascha Bischoff
a0b130eedd arm64/sysreg: Move generation of RES0/RES1/UNKN to function
The RESx and UNKN define generation happens in two places
(EndSysreg and EndSysregFields), and was using nearly identical
code. Split this out into a function, and call that instead, rather
then keeping the dupliated code.

There are no changes to the generated sysregs as part of this change.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-13 18:09:46 +00:00
Sascha Bischoff
fe2ef46995 arm64/sysreg: Support feature-specific fields with 'Prefix' descriptor
Some system register field encodings change based on, for example the
in-use architecture features, or the context in which they are
accessed. In order to support these different field encodings,
introduce the Prefix descriptor (Prefix, EndPrefix) for describing
such sysregs.

The Prefix descriptor can be used in the following way:

        Sysreg  EXAMPLE 0    1    2    3    4
        Prefix    FEAT_A
	Field   63:0    Foo
	EndPrefix
	Prefix    FEAT_B
	Field   63:1    Bar
 	Res0    0
        EndPrefix
        Field   63:0    Baz
        EndSysreg

This will generate a single set of system register encodings (REG_,
SYS_, ...), and then generate three sets of field definitions for the
system register called EXAMPLE. The first set is prefixed by FEAT_A,
e.g. FEAT_A_EXAMPLE_Foo. The second set is prefixed by FEAT_B, e.g.,
FEAT_B_EXAMPLE_Bar. The third set is not given a prefix at all,
e.g. EXAMPLE_BAZ. For each set, a corresponding set of defines for
Res0, Res1, and Unkn is generated.

The intent for the final prefix-less fields is to describe default or
legacy field encodings. This ensure that prefixed encodings can be
added to already-present sysregs without affecting existing legacy
code. Prefixed fields must be defined before those without a prefix,
and this is checked by the generator. This ensures consisnt ordering
within the sysregs definitions.

The Prefix descriptor can be used within Sysreg or SysregFields
blocks. Field, Res0, Res1, Unkn, Rax, SignedEnum, Enum can all be used
within a Prefix block. Fields and Mapping can not. Fields that vary
with features must be described as part of a SysregFields block,
instead. Mappings, which are just a code comment, make little sense in
this context, and have hence not been included.

There are no changes to the generated system register definitions as
part of this change.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-13 18:09:46 +00:00
Sascha Bischoff
0aab5772a5 arm64/sysreg: Fix checks for incomplete sysreg definitions
The checks for incomplete sysreg definitions were checking if the
next_bit was greater than 0, which is incorrect and missed occasions
where bit 0 hasn't been defined for a sysreg. The reason is that
next_bit is -1 when all bits have been processed (LSB - 1).

Change the checks to use >= 0, instead. Also, set next_bit in Mapping
to -1 instead of 0 to match these new checks.

There are no changes to the generated sysreg definitons as part of
this change, and conveniently no definitions lack definitions for bit
0.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-13 18:09:46 +00:00
Christian Brauner
b36d4b6aa8
arch: hookup listns() system call
Add the listns() system call to all architectures.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-20-2e6f823ebdc0@kernel.org
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:18 +01:00
Linus Torvalds
f3826aa996 guest_memfd:
* Add support for host userspace mapping of guest_memfd-backed memory for VM
   types that do NOT use support KVM_MEMORY_ATTRIBUTE_PRIVATE (which isn't
   precisely the same thing as CoCo VMs, since x86's SEV-MEM and SEV-ES have
   no way to detect private vs. shared).
 
   This lays the groundwork for removal of guest memory from the kernel direct
   map, as well as for limited mmap() for guest_memfd-backed memory.
 
   For more information see:
   * a6ad54137a ("Merge branch 'guest-memfd-mmap' into HEAD", 2025-08-27)
   * https://github.com/firecracker-microvm/firecracker/tree/feature/secret-hiding
     (guest_memfd in Firecracker)
   * https://lore.kernel.org/all/20250221160728.1584559-1-roypat@amazon.co.uk/
     (direct map removal)
   * https://lore.kernel.org/all/20250328153133.3504118-1-tabba@google.com/
     (mmap support)
 
 ARM:
 
 * Add support for FF-A 1.2 as the secure memory conduit for pKVM,
   allowing more registers to be used as part of the message payload.
 
 * Change the way pKVM allocates its VM handles, making sure that the
   privileged hypervisor is never tricked into using uninitialised
   data.
 
 * Speed up MMIO range registration by avoiding unnecessary RCU
   synchronisation, which results in VMs starting much quicker.
 
 * Add the dump of the instruction stream when panic-ing in the EL2
   payload, just like the rest of the kernel has always done. This will
   hopefully help debugging non-VHE setups.
 
 * Add 52bit PA support to the stage-1 page-table walker, and make use
   of it to populate the fault level reported to the guest on failing
   to translate a stage-1 walk.
 
 * Add NV support to the GICv3-on-GICv5 emulation code, ensuring
   feature parity for guests, irrespective of the host platform.
 
 * Fix some really ugly architecture problems when dealing with debug
   in a nested VM. This has some bad performance impacts, but is at
   least correct.
 
 * Add enough infrastructure to be able to disable EL2 features and
   give effective values to the EL2 control registers. This then allows
   a bunch of features to be turned off, which helps cross-host
   migration.
 
 * Large rework of the selftest infrastructure to allow most tests to
   transparently run at EL2. This is the first step towards enabling
   NV testing.
 
 * Various fixes and improvements all over the map, including one BE
   fix, just in time for the removal of the feature.
 
 LoongArch:
 
 * Detect page table walk feature on new hardware
 
 * Add sign extension with kernel MMIO/IOCSR emulation
 
 * Improve in-kernel IPI emulation
 
 * Improve in-kernel PCH-PIC emulation
 
 * Move kvm_iocsr tracepoint out of generic code
 
 RISC-V:
 
 * Added SBI FWFT extension for Guest/VM with misaligned delegation and
   pointer masking PMLEN features
 
 * Added ONE_REG interface for SBI FWFT extension
 
 * Added Zicbop and bfloat16 extensions for Guest/VM
 
 * Enabled more common KVM selftests for RISC-V
 
 * Added SBI v3.0 PMU enhancements in KVM and perf driver
 
 s390:
 
 * Improve interrupt cpu for wakeup, in particular the heuristic to decide
   which vCPU to deliver a floating interrupt to.
 
 * Clear the PTE when discarding a swapped page because of CMMA; this
   bug was introduced in 6.16 when refactoring gmap code.
 
 x86 selftests:
 
 * Add #DE coverage in the fastops test (the only exception that's guest-
   triggerable in fastop-emulated instructions).
 
 * Fix PMU selftests errors encountered on Granite Rapids (GNR), Sierra
   Forest (SRF) and Clearwater Forest (CWF).
 
 * Minor cleanups and improvements
 
 x86 (guest side):
 
 * For the legacy PCI hole (memory between TOLUD and 4GiB) to UC when
   overriding guest MTRR for TDX/SNP to fix an issue where ACPI auto-mapping
   could map devices as WB and prevent the device drivers from mapping their
   devices with UC/UC-.
 
 * Make kvm_async_pf_task_wake() a local static helper and remove its
   export.
 
 * Use native qspinlocks when running in a VM with dedicated vCPU=>pCPU
   bindings even when PV_UNHALT is unsupported.
 
 Generic:
 
 * Remove a redundant __GFP_NOWARN from kvm_setup_async_pf() as __GFP_NOWARN is
   now included in GFP_NOWAIT.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmjcGSkUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroPSPAgAnJDswU4fZ5YdJr6jGzsbSQ6utlIV
 FeEltLKQIM7Aq/uvL6PLN5Kx1Pb/d9r9ag39mDT6lq9fOfJdOLjJr2SBXPTCsrPS
 6hyNL1mlgo5qzs54T8dkMbQThlSgA4zaehsc0zl8vnwil6ygoAdrtTHqZm6V0hu/
 F/sVlikCsLix1hC0KtzwscyWYcjWtXfVoi9eU5WY6ALpQaVXfRUtwyOhGDkldr+m
 i3iDiGiLAZ5Iu3igUCIOEzSSQY0FgLJpzbwJAeUxIvomDkHGJLaR14ijvM+NkRZi
 FBo2CLbjrwXb56Rbh2ABcq0CGJ3EiU3L+CC34UaRLzbtl/2BtpetkC3irA==
 =fyov
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "This excludes the bulk of the x86 changes, which I will send
  separately. They have two not complex but relatively unusual conflicts
  so I will wait for other dust to settle.

  guest_memfd:

   - Add support for host userspace mapping of guest_memfd-backed memory
     for VM types that do NOT use support KVM_MEMORY_ATTRIBUTE_PRIVATE
     (which isn't precisely the same thing as CoCo VMs, since x86's
     SEV-MEM and SEV-ES have no way to detect private vs. shared).

     This lays the groundwork for removal of guest memory from the
     kernel direct map, as well as for limited mmap() for
     guest_memfd-backed memory.

     For more information see:
       - commit a6ad54137a ("Merge branch 'guest-memfd-mmap' into HEAD")
       - guest_memfd in Firecracker:
           https://github.com/firecracker-microvm/firecracker/tree/feature/secret-hiding
       - direct map removal:
           https://lore.kernel.org/all/20250221160728.1584559-1-roypat@amazon.co.uk/
       - mmap support:
           https://lore.kernel.org/all/20250328153133.3504118-1-tabba@google.com/

  ARM:

   - Add support for FF-A 1.2 as the secure memory conduit for pKVM,
     allowing more registers to be used as part of the message payload.

   - Change the way pKVM allocates its VM handles, making sure that the
     privileged hypervisor is never tricked into using uninitialised
     data.

   - Speed up MMIO range registration by avoiding unnecessary RCU
     synchronisation, which results in VMs starting much quicker.

   - Add the dump of the instruction stream when panic-ing in the EL2
     payload, just like the rest of the kernel has always done. This
     will hopefully help debugging non-VHE setups.

   - Add 52bit PA support to the stage-1 page-table walker, and make use
     of it to populate the fault level reported to the guest on failing
     to translate a stage-1 walk.

   - Add NV support to the GICv3-on-GICv5 emulation code, ensuring
     feature parity for guests, irrespective of the host platform.

   - Fix some really ugly architecture problems when dealing with debug
     in a nested VM. This has some bad performance impacts, but is at
     least correct.

   - Add enough infrastructure to be able to disable EL2 features and
     give effective values to the EL2 control registers. This then
     allows a bunch of features to be turned off, which helps cross-host
     migration.

   - Large rework of the selftest infrastructure to allow most tests to
     transparently run at EL2. This is the first step towards enabling
     NV testing.

   - Various fixes and improvements all over the map, including one BE
     fix, just in time for the removal of the feature.

  LoongArch:

   - Detect page table walk feature on new hardware

   - Add sign extension with kernel MMIO/IOCSR emulation

   - Improve in-kernel IPI emulation

   - Improve in-kernel PCH-PIC emulation

   - Move kvm_iocsr tracepoint out of generic code

  RISC-V:

   - Added SBI FWFT extension for Guest/VM with misaligned delegation
     and pointer masking PMLEN features

   - Added ONE_REG interface for SBI FWFT extension

   - Added Zicbop and bfloat16 extensions for Guest/VM

   - Enabled more common KVM selftests for RISC-V

   - Added SBI v3.0 PMU enhancements in KVM and perf driver

  s390:

   - Improve interrupt cpu for wakeup, in particular the heuristic to
     decide which vCPU to deliver a floating interrupt to.

   - Clear the PTE when discarding a swapped page because of CMMA; this
     bug was introduced in 6.16 when refactoring gmap code.

  x86 selftests:

   - Add #DE coverage in the fastops test (the only exception that's
     guest- triggerable in fastop-emulated instructions).

   - Fix PMU selftests errors encountered on Granite Rapids (GNR),
     Sierra Forest (SRF) and Clearwater Forest (CWF).

   - Minor cleanups and improvements

  x86 (guest side):

   - For the legacy PCI hole (memory between TOLUD and 4GiB) to UC when
     overriding guest MTRR for TDX/SNP to fix an issue where ACPI
     auto-mapping could map devices as WB and prevent the device drivers
     from mapping their devices with UC/UC-.

   - Make kvm_async_pf_task_wake() a local static helper and remove its
     export.

   - Use native qspinlocks when running in a VM with dedicated
     vCPU=>pCPU bindings even when PV_UNHALT is unsupported.

  Generic:

   - Remove a redundant __GFP_NOWARN from kvm_setup_async_pf() as
     __GFP_NOWARN is now included in GFP_NOWAIT.

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (178 commits)
  KVM: s390: Fix to clear PTE when discarding a swapped page
  KVM: arm64: selftests: Cover ID_AA64ISAR3_EL1 in set_id_regs
  KVM: arm64: selftests: Remove a duplicate register listing in set_id_regs
  KVM: arm64: selftests: Cope with arch silliness in EL2 selftest
  KVM: arm64: selftests: Add basic test for running in VHE EL2
  KVM: arm64: selftests: Enable EL2 by default
  KVM: arm64: selftests: Initialize HCR_EL2
  KVM: arm64: selftests: Use the vCPU attr for setting nr of PMU counters
  KVM: arm64: selftests: Use hyp timer IRQs when test runs at EL2
  KVM: arm64: selftests: Select SMCCC conduit based on current EL
  KVM: arm64: selftests: Provide helper for getting default vCPU target
  KVM: arm64: selftests: Alias EL1 registers to EL2 counterparts
  KVM: arm64: selftests: Create a VGICv3 for 'default' VMs
  KVM: arm64: selftests: Add unsanitised helpers for VGICv3 creation
  KVM: arm64: selftests: Add helper to check for VGICv3 support
  KVM: arm64: selftests: Initialize VGICv3 only once
  KVM: arm64: selftests: Provide kvm_arch_vm_post_create() in library code
  KVM: selftests: Add ex_str() to print human friendly name of exception vectors
  selftests/kvm: remove stale TODO in xapic_state_test
  KVM: selftests: Handle Intel Atom errata that leads to PMU event overcount
  ...
2025-10-04 08:52:16 -07:00
Linus Torvalds
feafee2845 arm64 updates for 6.18
Confidential computing:
  - Add support for accepting secrets from firmware (e.g. ACPI CCEL)
    and mapping them with appropriate attributes.
 
 CPU features:
  - Advertise atomic floating-point instructions to userspace.
 
  - Extend Spectre workarounds to cover additional Arm CPU variants.
 
  - Extend list of CPUs that support break-before-make level 2 and
    guarantee not to generate TLB conflict aborts for changes of mapping
    granularity (BBML2_NOABORT).
 
  - Add GCS support to our uprobes implementation.
 
 Documentation:
  - Remove bogus SME documentation concerning register state when
    entering/exiting streaming mode.
 
 Entry code:
  - Switch over to the generic IRQ entry code (GENERIC_IRQ_ENTRY).
 
  - Micro-optimise syscall entry path with a compiler branch hint.
 
 Memory management:
  - Enable huge mappings in vmalloc space even when kernel page-table
    dumping is enabled.
 
  - Tidy up the types used in our early MMU setup code.
 
  - Rework rodata= for closer parity with the behaviour on x86.
 
  - For CPUs implementing BBML2_NOABORT, utilise block mappings in the
    linear map even when rodata= applies to virtual aliases.
 
  - Don't re-allocate the virtual region between '_text' and '_stext',
    as doing so confused tools parsing /proc/vmcore.
 
 Miscellaneous:
  - Clean-up Kconfig menuconfig text for architecture features.
 
  - Avoid redundant bitmap_empty() during determination of supported
    SME vector lengths.
 
  - Re-enable warnings when building the 32-bit vDSO object.
 
  - Avoid breaking our eggs at the wrong end.
 
 Perf and PMUs:
  - Support for v3 of the Hisilicon L3C PMU.
 
  - Support for Hisilicon's MN and NoC PMUs.
 
  - Support for Fujitsu's Uncore PMU.
 
  - Support for SPE's extended event filtering feature.
 
  - Preparatory work to enable data source filtering in SPE.
 
  - Support for multiple lanes in the DWC PCIe PMU.
 
  - Support for i.MX94 in the IMX DDR PMU driver.
 
  - MAINTAINERS update (Thank you, Yicong).
 
  - Minor driver fixes (PERF_IDX2OFF() overflow, CMN register offsets).
 
 Selftests:
  - Add basic LSFE check to the existing hwcaps test.
 
  - Support nolibc in GCS tests.
 
  - Extend SVE ptrace test to pass unsupported regsets and invalid vector
    lengths.
 
  - Minor cleanups (typos, cosmetic changes).
 
 System registers:
  - Fix ID_PFR1_EL1 definition.
 
  - Fix incorrect signedness of some fields in ID_AA64MMFR4_EL1.
 
  - Sync TCR_EL1 definition with the latest Arm ARM (L.b).
 
  - Be stricter about the input fed into our AWK sysreg generator script.
 
  - Typo fixes and removal of redundant definitions.
 
 ACPI, EFI and PSCI:
  - Decouple Arm's "Software Delegated Exception Interface" (SDEI)
    support from the ACPI GHES code so that it can be used by platforms
    booted with device-tree.
 
  - Remove unnecessary per-CPU tracking of the FPSIMD state across EFI
    runtime calls.
 
  - Fix a node refcount imbalance in the PSCI device-tree code.
 
 CPU Features:
  - Ensure register sanitisation is applied to fields in ID_AA64MMFR4.
 
  - Expose AIDR_EL1 to userspace via sysfs, primarily so that KVM guests
    can reliably query the underlying CPU types from the VMM.
 
  - Re-enabling of SME support (CONFIG_ARM64_SME) as a result of fixes
    to our context-switching, signal handling and ptrace code.
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmjWlMIQHHdpbGxAa2Vy
 bmVsLm9yZwAKCRC3rHDchMFjNIVYB/sHaFYmToEaK4ldMfTn4ZQ8yr9ZuTMLr/ji
 zOm7wULP08wKIcp6t+1N6e7A8Wb3YSErxTywc4b9MqctIel1WvBMewjJd38xb2qO
 hFCuRuWemfyt6Rw/EGMCP54yueLzKbnN4q+/Aks8jedWtS+AXY6McGCJOzMjuECL
 zKb68Hqqb09YQ2b2BPSAiU9g42rotiIYppLaLbEdxUnw/eqfEN3wSrl92/LuBlqH
 r2wZDnJwA5q8Iy9HWlnhg4Jax0Cs86jSJPIQLB6v3WWhc3HR49pm5cqYzYkUht9L
 nA6eaWJiBl/+k3S+ftPKNzU8n04r15lqyNZE2FYfRk+0BPSacJOf
 =kO15
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Will Deacon:
 "There's good stuff across the board, including some nice mm
  improvements for CPUs with the 'noabort' BBML2 feature and a clever
  patch to allow ptdump to play nicely with block mappings in the
  vmalloc area.

  Confidential computing:

   - Add support for accepting secrets from firmware (e.g. ACPI CCEL)
     and mapping them with appropriate attributes.

  CPU features:

   - Advertise atomic floating-point instructions to userspace

   - Extend Spectre workarounds to cover additional Arm CPU variants

   - Extend list of CPUs that support break-before-make level 2 and
     guarantee not to generate TLB conflict aborts for changes of
     mapping granularity (BBML2_NOABORT)

   - Add GCS support to our uprobes implementation.

  Documentation:

   - Remove bogus SME documentation concerning register state when
     entering/exiting streaming mode.

  Entry code:

   - Switch over to the generic IRQ entry code (GENERIC_IRQ_ENTRY)

   - Micro-optimise syscall entry path with a compiler branch hint.

  Memory management:

   - Enable huge mappings in vmalloc space even when kernel page-table
     dumping is enabled

   - Tidy up the types used in our early MMU setup code

   - Rework rodata= for closer parity with the behaviour on x86

   - For CPUs implementing BBML2_NOABORT, utilise block mappings in the
     linear map even when rodata= applies to virtual aliases

   - Don't re-allocate the virtual region between '_text' and '_stext',
     as doing so confused tools parsing /proc/vmcore.

  Miscellaneous:

   - Clean-up Kconfig menuconfig text for architecture features

   - Avoid redundant bitmap_empty() during determination of supported
     SME vector lengths

   - Re-enable warnings when building the 32-bit vDSO object

   - Avoid breaking our eggs at the wrong end.

  Perf and PMUs:

   - Support for v3 of the Hisilicon L3C PMU

   - Support for Hisilicon's MN and NoC PMUs

   - Support for Fujitsu's Uncore PMU

   - Support for SPE's extended event filtering feature

   - Preparatory work to enable data source filtering in SPE

   - Support for multiple lanes in the DWC PCIe PMU

   - Support for i.MX94 in the IMX DDR PMU driver

   - MAINTAINERS update (Thank you, Yicong)

   - Minor driver fixes (PERF_IDX2OFF() overflow, CMN register offsets).

  Selftests:

   - Add basic LSFE check to the existing hwcaps test

   - Support nolibc in GCS tests

   - Extend SVE ptrace test to pass unsupported regsets and invalid
     vector lengths

   - Minor cleanups (typos, cosmetic changes).

  System registers:

   - Fix ID_PFR1_EL1 definition

   - Fix incorrect signedness of some fields in ID_AA64MMFR4_EL1

   - Sync TCR_EL1 definition with the latest Arm ARM (L.b)

   - Be stricter about the input fed into our AWK sysreg generator
     script

   - Typo fixes and removal of redundant definitions.

  ACPI, EFI and PSCI:

   - Decouple Arm's "Software Delegated Exception Interface" (SDEI)
     support from the ACPI GHES code so that it can be used by platforms
     booted with device-tree

   - Remove unnecessary per-CPU tracking of the FPSIMD state across EFI
     runtime calls

   - Fix a node refcount imbalance in the PSCI device-tree code.

  CPU Features:

   - Ensure register sanitisation is applied to fields in ID_AA64MMFR4

   - Expose AIDR_EL1 to userspace via sysfs, primarily so that KVM
     guests can reliably query the underlying CPU types from the VMM

   - Re-enabling of SME support (CONFIG_ARM64_SME) as a result of fixes
     to our context-switching, signal handling and ptrace code"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (93 commits)
  arm64: cpufeature: Remove duplicate asm/mmu.h header
  arm64: Kconfig: Make CPU_BIG_ENDIAN depend on BROKEN
  perf/dwc_pcie: Fix use of uninitialized variable
  arm/syscalls: mark syscall invocation as likely in invoke_syscall
  Documentation: hisi-pmu: Add introduction to HiSilicon V3 PMU
  Documentation: hisi-pmu: Fix of minor format error
  drivers/perf: hisi: Add support for L3C PMU v3
  drivers/perf: hisi: Refactor the event configuration of L3C PMU
  drivers/perf: hisi: Extend the field of tt_core
  drivers/perf: hisi: Extract the event filter check of L3C PMU
  drivers/perf: hisi: Simplify the probe process of each L3C PMU version
  drivers/perf: hisi: Export hisi_uncore_pmu_isr()
  drivers/perf: hisi: Relax the event ID check in the framework
  perf: Fujitsu: Add the Uncore PMU driver
  arm64: map [_text, _stext) virtual address range non-executable+read-only
  arm64/sysreg: Update TCR_EL1 register
  arm64: Enable vmalloc-huge with ptdump
  arm64: cpufeature: add Neoverse-V3AE to BBML2 allow list
  arm64: errata: Apply workarounds for Neoverse-V3AE
  arm64: cputype: Add Neoverse-V3AE definitions
  ...
2025-09-29 18:48:39 -07:00
Will Deacon
c7c7eb4f0e Merge branch 'for-next/sysregs' into for-next/core
* for-next/sysregs:
  arm64/sysreg: Update TCR_EL1 register
  arm64: sysreg: Add validation checks to sysreg header generation script
  arm64: sysreg: Correct sign definitions for EIESB and DoubleLock
  arm64: sysreg: Fix and tidy up sysreg field definitions
2025-09-24 16:35:01 +01:00
Anshuman Khandual
14f158552e arm64/sysreg: Update TCR_EL1 register
Update TCR_EL1 register fields as per latest ARM ARM DDI 0487 L.B and while
here drop an explicit sysreg definition SYS_TCR_EL1 from sysreg.h, which is
now redundant.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
2025-09-22 11:57:15 +01:00
James Clark
a7005ff2d0 arm64: sysreg: Add new PMSFCR_EL1 fields and PMSDSFR_EL1 register
Add new fields and register that are introduced for the features
FEAT_SPE_EFT (extended filtering) and FEAT_SPE_FDS (data source
filtering).

Tested-by: Leo Yan <leo.yan@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: James Clark <james.clark@linaro.org>
Signed-off-by: Will Deacon <will@kernel.org>
2025-09-18 14:17:02 +01:00
Sascha Bischoff
7847f51189 arm64: cpucaps: Add GICv5 Legacy vCPU interface (GCIE_LEGACY) capability
Implement the GCIE_LEGACY capability as a system feature to be able to
check for support from KVM. The type is explicitly
ARM64_CPUCAP_EARLY_LOCAL_CPU_FEATURE, which means that the capability
is enabled early if all boot CPUs support it. Additionally, if this
capability is enabled during boot, it prevents late onlining of CPUs
that lack it, thereby avoiding potential mismatched configurations
which would break KVM.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-09-17 17:41:02 +01:00
Fuad Tabba
382cbbe7fb arm64: sysreg: Add validation checks to sysreg header generation script
The gen_sysreg.awk script processes the system register specification in
the sysreg text file to generate C macro definitions. The current script
will silently accept certain errors in the specification file, leading
to incorrect header generation.

For example, a Sysreg or SysregFields can be accidentally duplicated,
causing its macros to be emitted twice. An Enum can contain duplicate
values for different items, which is architecturally incorrect.

Add checks to catch these errors at build time. The script now tracks
all seen Sysreg and SysregFields definitions and checks for duplicates.
It also tracks values within each Enum block to ensure entries are
unique.

Acked-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Fuad Tabba <tabba@google.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
2025-09-11 14:46:07 +01:00