Commit Graph

25896 Commits

Author SHA1 Message Date
Linus Torvalds
9702969978 slab updates for 7.0 part2
-----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCAA5FiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmmTRqgbFIAAAAAABAAO
 bWFudTIsMi41KzEuMTEsMiwyAAoJELvgsHXSRYiaUboIAIQRGZNZLzAD04PpEwDe
 LP3g1iI6DytfzHkcqkf+cV1OHpsKZjKUDY8xw42L3ztktzD83W6ypSzQBz1opnUx
 5w7N8EoE/GtY+pbOgBwGi7rvwg2i0+IkCdt9R8VpKa5fmwcgWcIpNtp0XRdWjWTb
 pn04sRTHiNHlMZxdVHVAmlxgcC/8SNBHi4w5KJtDUrq+bkZUS3XAN2ssU3oKBpMy
 OxhZw7BwfIO7PbBLFTrGQNPjfDU6IL7q8p7T6JcLyugPmqbvzAk07fDOs6GBFPBt
 jc1wZvC8h32y7WnWqA4rU+g06jXb088B71IywpxzUSIyPs0rfGy/eEtdEOBWrqIT
 5o8=
 =dulw
 -----END PGP SIGNATURE-----

Merge tag 'slab-for-7.0-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab

Pull more slab updates from Vlastimil Babka:

 - Two stable fixes for kmalloc_nolock() usage from NMI context (Harry
   Yoo)

 - Allow kmalloc_nolock() allocations to be freed with kfree() and thus
   also kfree_rcu() and simplify slabobj_ext handling - we no longer
   need to track how it was allocated to use the matching freeing
   function (Harry Yoo)

* tag 'slab-for-7.0-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
  mm/slab: drop the OBJEXTS_NOSPIN_ALLOC flag from enum objext_flags
  mm/slab: allow freeing kmalloc_nolock()'d objects using kfree[_rcu]()
  mm/slab: use prandom if !allow_spin
  mm/slab: do not access current->mems_allowed_seq if !allow_spin
2026-02-16 13:41:38 -08:00
Linus Torvalds
787fe1d43a memblock: updates for 7.0-rc1
* update tools/include/linux/mm.h to fix memblock tests compilation
 * drop redundant struct page* parameter from memblock_free_pages() and get
   struct page from the pfn
 * add underflow detection for size calculation in memtest and warn about
   underflow when VM_DEBUG is enabled
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEeOVYVaWZL5900a/pOQOGJssO/ZEFAmmQIhoQHHJwcHRAa2Vy
 bmVsLm9yZwAKCRA5A4Ymyw79kWhYB/0aobkrfD4aW5Utfmzp08LdBwtfsOqEfKX6
 AdBGPdG+WB90auW4qwDupspqj2lYDpJ4QvETNP0B84ek62VEN+8YEbvcC4W70l4H
 nsrrnkTgwFGNXXxjr6tIQXu9hnC1o7eSuWhhYry4XG+JEKR3iah54JmbxcDrAEFj
 lb4BzdocDtF6J3EkOv5alaDfdwUxgA3C6Idp2mpVb4m7DMraGZMq3lm7EPYm22zb
 zo9v0nvXW9xtZfADQ6mRzp4uTjd/UAUH+YsU/u1S1f+JBN1bELXmFRf/X3CKBC6/
 AIO9FcHsfA0i1MhbeBizT9eUEFaNIRxbMAtWbfdHrQhaLWNvyPOU
 =Gz3z
 -----END PGP SIGNATURE-----

Merge tag 'memblock-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock

Pull memblock updates from Mike Rapoport:

 - update tools/include/linux/mm.h to fix memblock tests compilation

 - drop redundant struct page* parameter from memblock_free_pages() and
   get struct page from the pfn

 - add underflow detection for size calculation in memtest and warn
   about underflow when VM_DEBUG is enabled

* tag 'memblock-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
  mm/memtest: add underflow detection for size calculation
  memblock: drop redundant 'struct page *' argument from memblock_free_pages()
  memblock test: include <linux/sizes.h> from tools mm.h stub
2026-02-14 12:39:34 -08:00
Linus Torvalds
44331bd6a6 Three MM hotfixes, all three are cc:stable.
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaY9AJwAKCRDdBJ7gKXxA
 jpzPAP4gTO3MHdBP/msNbdZCCQd2iXkdLlrdFsCpRyX/cC4BBwD/Vb50gfE9HkeX
 5NYksxdTovqWDxivAcPLcWazHOEiGg0=
 =bLY1
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2026-02-13-07-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM fixes from Andrew Morton:
 "Three MM hotfixes, all three are cc:stable"

* tag 'mm-hotfixes-stable-2026-02-13-07-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  procfs: fix possible double mmput() in do_procmap_query()
  mm/page_alloc: skip debug_check_no_{obj,locks}_freed with FPI_TRYLOCK
  mm/hugetlb: restore failed global reservations to subpool
2026-02-13 12:13:27 -08:00
Linus Torvalds
cb5573868e Loongarch:
- Add more CPUCFG mask bits.
 
 - Improve feature detection.
 
 - Add lazy load support for FPU and binary translation (LBT) register state.
 
 - Fix return value for memory reads from and writes to in-kernel devices.
 
 - Add support for detecting preemption from within a guest.
 
 - Add KVM steal time test case to tools/selftests.
 
 ARM:
 
 - Add support for FEAT_IDST, allowing ID registers that are not
   implemented to be reported as a normal trap rather than as an UNDEF
   exception.
 
 - Add sanitisation of the VTCR_EL2 register, fixing a number of
   UXN/PXN/XN bugs in the process.
 
 - Full handling of RESx bits, instead of only RES0, and resulting in
   SCTLR_EL2 being added to the list of sanitised registers.
 
 - More pKVM fixes for features that are not supposed to be exposed to
   guests.
 
 - Make sure that MTE being disabled on the pKVM host doesn't give it
   the ability to attack the hypervisor.
 
 - Allow pKVM's host stage-2 mappings to use the Force Write Back
   version of the memory attributes by using the "pass-through'
   encoding.
 
 - Fix trapping of ICC_DIR_EL1 on GICv5 hosts emulating GICv3 for the
   guest.
 
 - Preliminary work for guest GICv5 support.
 
 - A bunch of debugfs fixes, removing pointless custom iterators stored
   in guest data structures.
 
 - A small set of FPSIMD cleanups.
 
 - Selftest fixes addressing the incorrect alignment of page
   allocation.
 
 - Other assorted low-impact fixes and spelling fixes.
 
 RISC-V:
 
 - Fixes for issues discoverd by KVM API fuzzing in
   kvm_riscv_aia_imsic_has_attr(), kvm_riscv_aia_imsic_rw_attr(),
   and kvm_riscv_vcpu_aia_imsic_update()
 
 - Allow Zalasr, Zilsd and Zclsd extensions for Guest/VM
 
 - Transparent huge page support for hypervisor page tables
 
 - Adjust the number of available guest irq files based on MMIO
   register sizes found in the device tree or the ACPI tables
 
 - Add RISC-V specific paging modes to KVM selftests
 
 - Detect paging mode at runtime for selftests
 
 s390:
 
 - Performance improvement for vSIE (aka nested virtualization)
 
 - Completely new memory management.  s390 was a special snowflake that enlisted
   help from the architecture's page table management to build hypervisor
   page tables, in particular enabling sharing the last level of page
   tables.  This however was a lot of code (~3K lines) in order to support
   KVM, and also blocked several features.  The biggest advantages is
   that the page size of userspace is completely independent of the
   page size used by the guest: userspace can mix normal pages, THPs and
   hugetlbfs as it sees fit, and in fact transparent hugepages were not
   possible before.  It's also now possible to have nested guests and
   guests with huge pages running on the same host.
 
 - Maintainership change for s390 vfio-pci
 
 - Small quality of life improvement for protected guests
 
 x86:
 
 - Add support for giving the guest full ownership of PMU hardware (contexted
   switched around the fastpath run loop) and allowing direct access to data
   MSRs and PMCs (restricted by the vPMU model).  KVM still intercepts
   access to control registers, e.g. to enforce event filtering and to
   prevent the guest from profiling sensitive host state.  This is more
   accurate, since it has no risk of contention and thus dropped events, and
   also has significantly less overhead.
 
   For more information, see the commit message for merge commit bf2c3138ae
   ("Merge tag 'kvm-x86-pmu-6.20' of https://github.com/kvm-x86/linux into HEAD").
 
 - Disallow changing the virtual CPU model if L2 is active, for all the same
   reasons KVM disallows change the model after the first KVM_RUN.
 
 - Fix a bug where KVM would incorrectly reject host accesses to PV MSRs
   when running with KVM_CAP_ENFORCE_PV_FEATURE_CPUID enabled, even if those
   were advertised as supported to userspace,
 
 - Fix a bug with protected guest state (SEV-ES/SNP and TDX) VMs, where KVM
   would attempt to read CR3 configuring an async #PF entry.
 
 - Fail the build if EXPORT_SYMBOL_GPL or EXPORT_SYMBOL is used in KVM (for x86
   only) to enforce usage of EXPORT_SYMBOL_FOR_KVM_INTERNAL.  Only a few exports
   that are intended for external usage, and those are allowed explicitly.
 
 - When checking nested events after a vCPU is unblocked, ignore -EBUSY instead
   of WARNing.  Userspace can sometimes put the vCPU into what should be an
   impossible state, and spurious exit to userspace on -EBUSY does not really
   do anything to solve the issue.
 
 - Also throw in the towel and drop the WARN on INIT/SIPI being blocked when vCPU
   is in Wait-For-SIPI, which also resulted in playing whack-a-mole with syzkaller
   stuffing architecturally impossible states into KVM.
 
 - Add support for new Intel instructions that don't require anything beyond
   enumerating feature flags to userspace.
 
 - Grab SRCU when reading PDPTRs in KVM_GET_SREGS2.
 
 - Add WARNs to guard against modifying KVM's CPU caps outside of the intended
   setup flow, as nested VMX in particular is sensitive to unexpected changes
   in KVM's golden configuration.
 
 - Add a quirk to allow userspace to opt-in to actually suppress EOI broadcasts
   when the suppression feature is enabled by the guest (currently limited to
   split IRQCHIP, i.e. userspace I/O APIC).  Sadly, simply fixing KVM to honor
   Suppress EOI Broadcasts isn't an option as some userspaces have come to rely
   on KVM's buggy behavior (KVM advertises Supress EOI Broadcast irrespective
   of whether or not userspace I/O APIC supports Directed EOIs).
 
 - Clean up KVM's handling of marking mapped vCPU pages dirty.
 
 - Drop a pile of *ancient* sanity checks hidden behind in KVM's unused
   ASSERT() macro, most of which could be trivially triggered by the guest
   and/or user, and all of which were useless.
 
 - Fold "struct dest_map" into its sole user, "struct rtc_status", to make it
   more obvious what the weird parameter is used for, and to allow fropping
   these RTC shenanigans if CONFIG_KVM_IOAPIC=n.
 
 - Bury all of ioapic.h, i8254.h and related ioctls (including
   KVM_CREATE_IRQCHIP) behind CONFIG_KVM_IOAPIC=y.
 
 - Add a regression test for recent APICv update fixes.
 
 - Handle "hardware APIC ISR", a.k.a. SVI, updates in kvm_apic_update_apicv()
   to consolidate the updates, and to co-locate SVI updates with the updates
   for KVM's own cache of ISR information.
 
 - Drop a dead function declaration.
 
 - Minor cleanups.
 
 x86 (Intel):
 
 - Rework KVM's handling of VMCS updates while L2 is active to temporarily
   switch to vmcs01 instead of deferring the update until the next nested
   VM-Exit.  The deferred updates approach directly contributed to several
   bugs, was proving to be a maintenance burden due to the difficulty in
   auditing the correctness of deferred updates, and was polluting
   "struct nested_vmx" with a growing pile of booleans.
 
 - Fix an SGX bug where KVM would incorrectly try to handle EPCM page faults,
   and instead always reflect them into the guest.  Since KVM doesn't shadow
   EPCM entries, EPCM violations cannot be due to KVM interference and
   can't be resolved by KVM.
 
 - Fix a bug where KVM would register its posted interrupt wakeup handler even
   if loading kvm-intel.ko ultimately failed.
 
 - Disallow access to vmcb12 fields that aren't fully supported, mostly to
   avoid weirdness and complexity for FRED and other features, where KVM wants
   enable VMCS shadowing for fields that conditionally exist.
 
 - Print out the "bad" offsets and values if kvm-intel.ko refuses to load (or
   refuses to online a CPU) due to a VMCS config mismatch.
 
 x86 (AMD):
 
 - Drop a user-triggerable WARN on nested_svm_load_cr3() failure.
 
 - Add support for virtualizing ERAPS.  Note, correct virtualization of ERAPS
   relies on an upcoming, publicly announced change in the APM to reduce the
   set of conditions where hardware (i.e. KVM) *must* flush the RAP.
 
 - Ignore nSVM intercepts for instructions that are not supported according to
   L1's virtual CPU model.
 
 - Add support for expedited writes to the fast MMIO bus, a la VMX's fastpath
   for EPT Misconfig.
 
 - Don't set GIF when clearing EFER.SVME, as GIF exists independently of SVM,
   and allow userspace to restore nested state with GIF=0.
 
 - Treat exit_code as an unsigned 64-bit value through all of KVM.
 
 - Add support for fetching SNP certificates from userspace.
 
 - Fix a bug where KVM would use vmcb02 instead of vmcb01 when emulating VMLOAD
   or VMSAVE on behalf of L2.
 
 - Misc fixes and cleanups.
 
 x86 selftests:
 
 - Add a regression test for TPR<=>CR8 synchronization and IRQ masking.
 
 - Overhaul selftest's MMU infrastructure to genericize stage-2 MMU support,
   and extend x86's infrastructure to support EPT and NPT (for L2 guests).
 
 - Extend several nested VMX tests to also cover nested SVM.
 
 - Add a selftest for nested VMLOAD/VMSAVE.
 
 - Rework the nested dirty log test, originally added as a regression test for
   PML where KVM logged L2 GPAs instead of L1 GPAs, to improve test coverage
   and to hopefully make the test easier to understand and maintain.
 
 guest_memfd:
 
 - Remove kvm_gmem_populate()'s preparation tracking and half-baked hugepage
   handling.  SEV/SNP was the only user of the tracking and it can do it via
   the RMP.
 
 - Retroactively document and enforce (for SNP) that KVM_SEV_SNP_LAUNCH_UPDATE
   and KVM_TDX_INIT_MEM_REGION require the source page to be 4KiB aligned, to
   avoid non-trivial complexity for something that no known VMM seems to be
   doing and to avoid an API special case for in-place conversion, which
   simply can't support unaligned sources.
 
 - When populating guest_memfd memory, GUP the source page in common code and
   pass the refcounted page to the vendor callback, instead of letting vendor
   code do the heavy lifting.  Doing so avoids a looming deadlock bug with
   in-place due an AB-BA conflict betwee mmap_lock and guest_memfd's filemap
   invalidate lock.
 
 Generic:
 
 - Fix a bug where KVM would ignore the vCPU's selected address space when
   creating a vCPU-specific mapping of guest memory.  Actually this bug
   could not be hit even on x86, the only architecture with multiple
   address spaces, but it's a bug nevertheless.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCgAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmmNqwwUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroPaZAf/cJx5B67lnST272esz0j29MIuT/Ti
 jnf6PI9b7XubKYOtNvlu5ZW4Jsa5dqRG0qeO/JmcXDlwBf5/UkWOyvqIXyiuTl0l
 KcSUlKPtTgKZSoZpJpTppuuDE8FSYqEdcCmjNvoYzcJoPjmaeJbK6aqO0AkBbb6e
 L5InrLV7nV9iua6rFvA0s/G8/Eq2DG8M9hTRHe6NcI/z4hvslOudvpUXtC8Jygoo
 cV8vFavUwc+atrmvhAOLvSitnrjfNa4zcG6XMOlwXPfIdvi3zqTlQTgUpwGKiAGQ
 RIDUVZ/9bcWgJqbPRsdEWwaYRkNQWc5nmrAHRpEEaYV/NeBBNf4v6qfKSw==
 =SkJ1
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "Loongarch:

   - Add more CPUCFG mask bits

   - Improve feature detection

   - Add lazy load support for FPU and binary translation (LBT) register
     state

   - Fix return value for memory reads from and writes to in-kernel
     devices

   - Add support for detecting preemption from within a guest

   - Add KVM steal time test case to tools/selftests

  ARM:

   - Add support for FEAT_IDST, allowing ID registers that are not
     implemented to be reported as a normal trap rather than as an UNDEF
     exception

   - Add sanitisation of the VTCR_EL2 register, fixing a number of
     UXN/PXN/XN bugs in the process

   - Full handling of RESx bits, instead of only RES0, and resulting in
     SCTLR_EL2 being added to the list of sanitised registers

   - More pKVM fixes for features that are not supposed to be exposed to
     guests

   - Make sure that MTE being disabled on the pKVM host doesn't give it
     the ability to attack the hypervisor

   - Allow pKVM's host stage-2 mappings to use the Force Write Back
     version of the memory attributes by using the "pass-through'
     encoding

   - Fix trapping of ICC_DIR_EL1 on GICv5 hosts emulating GICv3 for the
     guest

   - Preliminary work for guest GICv5 support

   - A bunch of debugfs fixes, removing pointless custom iterators
     stored in guest data structures

   - A small set of FPSIMD cleanups

   - Selftest fixes addressing the incorrect alignment of page
     allocation

   - Other assorted low-impact fixes and spelling fixes

  RISC-V:

   - Fixes for issues discoverd by KVM API fuzzing in
     kvm_riscv_aia_imsic_has_attr(), kvm_riscv_aia_imsic_rw_attr(), and
     kvm_riscv_vcpu_aia_imsic_update()

   - Allow Zalasr, Zilsd and Zclsd extensions for Guest/VM

   - Transparent huge page support for hypervisor page tables

   - Adjust the number of available guest irq files based on MMIO
     register sizes found in the device tree or the ACPI tables

   - Add RISC-V specific paging modes to KVM selftests

   - Detect paging mode at runtime for selftests

  s390:

   - Performance improvement for vSIE (aka nested virtualization)

   - Completely new memory management. s390 was a special snowflake that
     enlisted help from the architecture's page table management to
     build hypervisor page tables, in particular enabling sharing the
     last level of page tables. This however was a lot of code (~3K
     lines) in order to support KVM, and also blocked several features.
     The biggest advantages is that the page size of userspace is
     completely independent of the page size used by the guest:
     userspace can mix normal pages, THPs and hugetlbfs as it sees fit,
     and in fact transparent hugepages were not possible before. It's
     also now possible to have nested guests and guests with huge pages
     running on the same host

   - Maintainership change for s390 vfio-pci

   - Small quality of life improvement for protected guests

  x86:

   - Add support for giving the guest full ownership of PMU hardware
     (contexted switched around the fastpath run loop) and allowing
     direct access to data MSRs and PMCs (restricted by the vPMU model).

     KVM still intercepts access to control registers, e.g. to enforce
     event filtering and to prevent the guest from profiling sensitive
     host state. This is more accurate, since it has no risk of
     contention and thus dropped events, and also has significantly less
     overhead.

     For more information, see the commit message for merge commit
     bf2c3138ae ("Merge tag 'kvm-x86-pmu-6.20' ...")

   - Disallow changing the virtual CPU model if L2 is active, for all
     the same reasons KVM disallows change the model after the first
     KVM_RUN

   - Fix a bug where KVM would incorrectly reject host accesses to PV
     MSRs when running with KVM_CAP_ENFORCE_PV_FEATURE_CPUID enabled,
     even if those were advertised as supported to userspace,

   - Fix a bug with protected guest state (SEV-ES/SNP and TDX) VMs,
     where KVM would attempt to read CR3 configuring an async #PF entry

   - Fail the build if EXPORT_SYMBOL_GPL or EXPORT_SYMBOL is used in KVM
     (for x86 only) to enforce usage of EXPORT_SYMBOL_FOR_KVM_INTERNAL.
     Only a few exports that are intended for external usage, and those
     are allowed explicitly

   - When checking nested events after a vCPU is unblocked, ignore
     -EBUSY instead of WARNing. Userspace can sometimes put the vCPU
     into what should be an impossible state, and spurious exit to
     userspace on -EBUSY does not really do anything to solve the issue

   - Also throw in the towel and drop the WARN on INIT/SIPI being
     blocked when vCPU is in Wait-For-SIPI, which also resulted in
     playing whack-a-mole with syzkaller stuffing architecturally
     impossible states into KVM

   - Add support for new Intel instructions that don't require anything
     beyond enumerating feature flags to userspace

   - Grab SRCU when reading PDPTRs in KVM_GET_SREGS2

   - Add WARNs to guard against modifying KVM's CPU caps outside of the
     intended setup flow, as nested VMX in particular is sensitive to
     unexpected changes in KVM's golden configuration

   - Add a quirk to allow userspace to opt-in to actually suppress EOI
     broadcasts when the suppression feature is enabled by the guest
     (currently limited to split IRQCHIP, i.e. userspace I/O APIC).
     Sadly, simply fixing KVM to honor Suppress EOI Broadcasts isn't an
     option as some userspaces have come to rely on KVM's buggy behavior
     (KVM advertises Supress EOI Broadcast irrespective of whether or
     not userspace I/O APIC supports Directed EOIs)

   - Clean up KVM's handling of marking mapped vCPU pages dirty

   - Drop a pile of *ancient* sanity checks hidden behind in KVM's
     unused ASSERT() macro, most of which could be trivially triggered
     by the guest and/or user, and all of which were useless

   - Fold "struct dest_map" into its sole user, "struct rtc_status", to
     make it more obvious what the weird parameter is used for, and to
     allow fropping these RTC shenanigans if CONFIG_KVM_IOAPIC=n

   - Bury all of ioapic.h, i8254.h and related ioctls (including
     KVM_CREATE_IRQCHIP) behind CONFIG_KVM_IOAPIC=y

   - Add a regression test for recent APICv update fixes

   - Handle "hardware APIC ISR", a.k.a. SVI, updates in
     kvm_apic_update_apicv() to consolidate the updates, and to
     co-locate SVI updates with the updates for KVM's own cache of ISR
     information

   - Drop a dead function declaration

   - Minor cleanups

  x86 (Intel):

   - Rework KVM's handling of VMCS updates while L2 is active to
     temporarily switch to vmcs01 instead of deferring the update until
     the next nested VM-Exit.

     The deferred updates approach directly contributed to several bugs,
     was proving to be a maintenance burden due to the difficulty in
     auditing the correctness of deferred updates, and was polluting
     "struct nested_vmx" with a growing pile of booleans

   - Fix an SGX bug where KVM would incorrectly try to handle EPCM page
     faults, and instead always reflect them into the guest. Since KVM
     doesn't shadow EPCM entries, EPCM violations cannot be due to KVM
     interference and can't be resolved by KVM

   - Fix a bug where KVM would register its posted interrupt wakeup
     handler even if loading kvm-intel.ko ultimately failed

   - Disallow access to vmcb12 fields that aren't fully supported,
     mostly to avoid weirdness and complexity for FRED and other
     features, where KVM wants enable VMCS shadowing for fields that
     conditionally exist

   - Print out the "bad" offsets and values if kvm-intel.ko refuses to
     load (or refuses to online a CPU) due to a VMCS config mismatch

  x86 (AMD):

   - Drop a user-triggerable WARN on nested_svm_load_cr3() failure

   - Add support for virtualizing ERAPS. Note, correct virtualization of
     ERAPS relies on an upcoming, publicly announced change in the APM
     to reduce the set of conditions where hardware (i.e. KVM) *must*
     flush the RAP

   - Ignore nSVM intercepts for instructions that are not supported
     according to L1's virtual CPU model

   - Add support for expedited writes to the fast MMIO bus, a la VMX's
     fastpath for EPT Misconfig

   - Don't set GIF when clearing EFER.SVME, as GIF exists independently
     of SVM, and allow userspace to restore nested state with GIF=0

   - Treat exit_code as an unsigned 64-bit value through all of KVM

   - Add support for fetching SNP certificates from userspace

   - Fix a bug where KVM would use vmcb02 instead of vmcb01 when
     emulating VMLOAD or VMSAVE on behalf of L2

   - Misc fixes and cleanups

  x86 selftests:

   - Add a regression test for TPR<=>CR8 synchronization and IRQ masking

   - Overhaul selftest's MMU infrastructure to genericize stage-2 MMU
     support, and extend x86's infrastructure to support EPT and NPT
     (for L2 guests)

   - Extend several nested VMX tests to also cover nested SVM

   - Add a selftest for nested VMLOAD/VMSAVE

   - Rework the nested dirty log test, originally added as a regression
     test for PML where KVM logged L2 GPAs instead of L1 GPAs, to
     improve test coverage and to hopefully make the test easier to
     understand and maintain

  guest_memfd:

   - Remove kvm_gmem_populate()'s preparation tracking and half-baked
     hugepage handling. SEV/SNP was the only user of the tracking and it
     can do it via the RMP

   - Retroactively document and enforce (for SNP) that
     KVM_SEV_SNP_LAUNCH_UPDATE and KVM_TDX_INIT_MEM_REGION require the
     source page to be 4KiB aligned, to avoid non-trivial complexity for
     something that no known VMM seems to be doing and to avoid an API
     special case for in-place conversion, which simply can't support
     unaligned sources

   - When populating guest_memfd memory, GUP the source page in common
     code and pass the refcounted page to the vendor callback, instead
     of letting vendor code do the heavy lifting. Doing so avoids a
     looming deadlock bug with in-place due an AB-BA conflict betwee
     mmap_lock and guest_memfd's filemap invalidate lock

  Generic:

   - Fix a bug where KVM would ignore the vCPU's selected address space
     when creating a vCPU-specific mapping of guest memory. Actually
     this bug could not be hit even on x86, the only architecture with
     multiple address spaces, but it's a bug nevertheless"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (267 commits)
  KVM: s390: Increase permitted SE header size to 1 MiB
  MAINTAINERS: Replace backup for s390 vfio-pci
  KVM: s390: vsie: Fix race in acquire_gmap_shadow()
  KVM: s390: vsie: Fix race in walk_guest_tables()
  KVM: s390: Use guest address to mark guest page dirty
  irqchip/riscv-imsic: Adjust the number of available guest irq files
  RISC-V: KVM: Transparent huge page support
  RISC-V: KVM: selftests: Add Zalasr extensions to get-reg-list test
  RISC-V: KVM: Allow Zalasr extensions for Guest/VM
  KVM: riscv: selftests: Add riscv vm satp modes
  KVM: riscv: selftests: add Zilsd and Zclsd extension to get-reg-list test
  riscv: KVM: allow Zilsd and Zclsd extensions for Guest/VM
  RISC-V: KVM: Skip IMSIC update if vCPU IMSIC state is not initialized
  RISC-V: KVM: Fix null pointer dereference in kvm_riscv_aia_imsic_rw_attr()
  RISC-V: KVM: Fix null pointer dereference in kvm_riscv_aia_imsic_has_attr()
  RISC-V: KVM: Remove unnecessary 'ret' assignment
  KVM: s390: Add explicit padding to struct kvm_s390_keyop
  KVM: LoongArch: selftests: Add steal time test case
  LoongArch: KVM: Add paravirt vcpu_is_preempted() support in guest side
  LoongArch: KVM: Add paravirt preempt feature in hypervisor side
  ...
2026-02-13 11:31:15 -08:00
Harry Yoo
338ad1e84d mm/page_alloc: skip debug_check_no_{obj,locks}_freed with FPI_TRYLOCK
When CONFIG_DEBUG_OBJECTS_FREE is enabled,
debug_check_no_{obj,locks}_freed() functions are called.

Since both of them spin on a lock, they are not safe to be called if the
FPI_TRYLOCK flag is specified.  This leads to a lockdep splat:

  ================================
  WARNING: inconsistent lock state
  6.19.0-rc5-slab-for-next+ #326 Tainted: G                 N
  --------------------------------
  inconsistent {INITIAL USE} -> {IN-NMI} usage.
  kunit_try_catch/9046 [HC2[2]:SC0[0]:HE0:SE1] takes:
  ffffffff84ed6bf8 (&obj_hash[i].lock){-.-.}-{2:2}, at: __debug_check_no_obj_freed+0xe0/0x300
  {INITIAL USE} state was registered at:
    lock_acquire+0xd9/0x2f0
    _raw_spin_lock_irqsave+0x4c/0x80
    __debug_object_init+0x9d/0x1f0
    debug_object_init+0x34/0x50
    __init_work+0x28/0x40
    init_cgroup_housekeeping+0x151/0x210
    init_cgroup_root+0x3d/0x140
    cgroup_init_early+0x30/0x240
    start_kernel+0x3e/0xcd0
    x86_64_start_reservations+0x18/0x30
    x86_64_start_kernel+0xf3/0x140
    common_startup_64+0x13e/0x148
  irq event stamp: 2998
  hardirqs last  enabled at (2997): [<ffffffff8298b77a>] exc_nmi+0x11a/0x240
  hardirqs last disabled at (2998): [<ffffffff8298b991>] sysvec_irq_work+0x11/0x110
  softirqs last  enabled at (1416): [<ffffffff813c1f72>] __irq_exit_rcu+0x132/0x1c0
  softirqs last disabled at (1303): [<ffffffff813c1f72>] __irq_exit_rcu+0x132/0x1c0

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(&obj_hash[i].lock);
    <Interrupt>
      lock(&obj_hash[i].lock);

   *** DEADLOCK ***

Rename free_pages_prepare() to __free_pages_prepare(), add an fpi_t
parameter, and skip those checks if FPI_TRYLOCK is set.  To keep the fpi_t
definition in mm/page_alloc.c, add a wrapper function free_pages_prepare()
that always passes FPI_NONE and use it in mm/compaction.c.

Link: https://lkml.kernel.org/r/20260209062639.16577-1-harry.yoo@oracle.com
Fixes: 8c57b687e8 ("mm, bpf: Introduce free_pages_nolock()")
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Zi Yan <ziy@nvidia.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-12 15:40:15 -08:00
Joshua Hahn
1d3f9bb4c8 mm/hugetlb: restore failed global reservations to subpool
Commit a833a693a4 ("mm: hugetlb: fix incorrect fallback for subpool")
fixed an underflow error for hstate->resv_huge_pages caused by incorrectly
attributing globally requested pages to the subpool's reservation.

Unfortunately, this fix also introduced the opposite problem, which would
leave spool->used_hpages elevated if the globally requested pages could
not be acquired.  This is because while a subpool's reserve pages only
accounts for what is requested and allocated from the subpool, its "used"
counter keeps track of what is consumed in total, both from the subpool
and globally.  Thus, we need to adjust spool->used_hpages in the other
direction, and make sure that globally requested pages are uncharged from
the subpool's used counter.

Each failed allocation attempt increments the used_hpages counter by how
many pages were requested from the global pool.  Ultimately, this renders
the subpool unusable, as used_hpages approaches the max limit.

The issue can be reproduced as follows:
1. Allocate 4 hugetlb pages
2. Create a hugetlb mount with max=4, min=2
3. Consume 2 pages globally
4. Request 3 pages from the subpool (2 from subpool + 1 from global)
	4.1 hugepage_subpool_get_pages(spool, 3) succeeds.
		used_hpages += 3
	4.2 hugetlb_acct_memory(h, 1) fails: no global pages left
		used_hpages -= 2
5. Subpool now has used_hpages = 1, despite not being able to
   successfully allocate any hugepages. It believes it can now only
   allocate 3 more hugepages, not 4.

With each failed allocation attempt incrementing the used counter, the
subpool eventually reaches a point where its used counter equals its
max counter.  At that point, any future allocations that try to
allocate hugeTLB pages from the subpool will fail, despite the subpool
not having any of its hugeTLB pages consumed by any user.

Once this happens, there is no way to make the subpool usable again,
since there is no way to decrement the used counter as no process is
really consuming the hugeTLB pages.

The underflow issue that the original commit fixes still remains fixed
as well.

Without this fix, used_hpages would keep on leaking if
hugetlb_acct_memory() fails.

Link: https://lkml.kernel.org/r/20260116204037.2270096-1-joshua.hahnjy@gmail.com
Fixes: a833a693a4 ("mm: hugetlb: fix incorrect fallback for subpool")
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Acked-by: Usama Arif <usama.arif@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Ma Wupeng <mawupeng1@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-12 15:40:15 -08:00
Linus Torvalds
136114e0ab mm.git review status for linus..mm-nonmm-stable
Total patches:       107
 Reviews/patch:       1.07
 Reviewed rate:       67%
 
 - The 2 patch series "ocfs2: give ocfs2 the ability to reclaim
   suballocator free bg" from Heming Zhao saves disk space by teaching
   ocfs2 to reclaim suballocator block group space.
 
 - The 4 patch series "Add ARRAY_END(), and use it to fix off-by-one
   bugs" from Alejandro Colomar adds the ARRAY_END() macro and uses it in
   various places.
 
 - The 2 patch series "vmcoreinfo: support VMCOREINFO_BYTES larger than
   PAGE_SIZE" from Pnina Feder makes the vmcore code future-safe, if
   VMCOREINFO_BYTES ever exceeds the page size.
 
 - The 7 patch series "kallsyms: Prevent invalid access when showing
   module buildid" from Petr Mladek cleans up kallsyms code related to
   module buildid and fixes an invalid access crash when printing
   backtraces.
 
 - The 3 patch series "Address page fault in
   ima_restore_measurement_list()" from Harshit Mogalapalli fixes a
   kexec-related crash that can occur when booting the second-stage kernel
   on x86.
 
 - The 6 patch series "kho: ABI headers and Documentation updates" from
   Mike Rapoport updates the kexec handover ABI documentation.
 
 - The 4 patch series "Align atomic storage" from Finn Thain adds the
   __aligned attribute to atomic_t and atomic64_t definitions to get
   natural alignment of both types on csky, m68k, microblaze, nios2,
   openrisc and sh.
 
 - The 2 patch series "kho: clean up page initialization logic" from
   Pratyush Yadav simplifies the page initialization logic in
   kho_restore_page().
 
 - The 6 patch series "Unload linux/kernel.h" from Yury Norov moves
   several things out of kernel.h and into more appropriate places.
 
 - The 7 patch series "don't abuse task_struct.group_leader" from Oleg
   Nesterov removes the usage of ->group_leader when it is "obviously
   unnecessary".
 
 - The 5 patch series "list private v2 & luo flb" from Pasha Tatashin
   adds some infrastructure improvements to the live update orchestrator.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaY4giAAKCRDdBJ7gKXxA
 jgusAQDnKkP8UWTqXPC1jI+OrDJGU5ciAx8lzLeBVqMKzoYk9AD/TlhT2Nlx+Ef6
 0HCUHUD0FMvAw/7/Dfc6ZKxwBEIxyww=
 =mmsH
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:

 - "ocfs2: give ocfs2 the ability to reclaim suballocator free bg" saves
   disk space by teaching ocfs2 to reclaim suballocator block group
   space (Heming Zhao)

 - "Add ARRAY_END(), and use it to fix off-by-one bugs" adds the
   ARRAY_END() macro and uses it in various places (Alejandro Colomar)

 - "vmcoreinfo: support VMCOREINFO_BYTES larger than PAGE_SIZE" makes
   the vmcore code future-safe, if VMCOREINFO_BYTES ever exceeds the
   page size (Pnina Feder)

 - "kallsyms: Prevent invalid access when showing module buildid" cleans
   up kallsyms code related to module buildid and fixes an invalid
   access crash when printing backtraces (Petr Mladek)

 - "Address page fault in ima_restore_measurement_list()" fixes a
   kexec-related crash that can occur when booting the second-stage
   kernel on x86 (Harshit Mogalapalli)

 - "kho: ABI headers and Documentation updates" updates the kexec
   handover ABI documentation (Mike Rapoport)

 - "Align atomic storage" adds the __aligned attribute to atomic_t and
   atomic64_t definitions to get natural alignment of both types on
   csky, m68k, microblaze, nios2, openrisc and sh (Finn Thain)

 - "kho: clean up page initialization logic" simplifies the page
   initialization logic in kho_restore_page() (Pratyush Yadav)

 - "Unload linux/kernel.h" moves several things out of kernel.h and into
   more appropriate places (Yury Norov)

 - "don't abuse task_struct.group_leader" removes the usage of
   ->group_leader when it is "obviously unnecessary" (Oleg Nesterov)

 - "list private v2 & luo flb" adds some infrastructure improvements to
   the live update orchestrator (Pasha Tatashin)

* tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (107 commits)
  watchdog/hardlockup: simplify perf event probe and remove per-cpu dependency
  procfs: fix missing RCU protection when reading real_parent in do_task_stat()
  watchdog/softlockup: fix sample ring index wrap in need_counting_irqs()
  kcsan, compiler_types: avoid duplicate type issues in BPF Type Format
  kho: fix doc for kho_restore_pages()
  tests/liveupdate: add in-kernel liveupdate test
  liveupdate: luo_flb: introduce File-Lifecycle-Bound global state
  liveupdate: luo_file: Use private list
  list: add kunit test for private list primitives
  list: add primitives for private list manipulations
  delayacct: fix uapi timespec64 definition
  panic: add panic_force_cpu= parameter to redirect panic to a specific CPU
  netclassid: use thread_group_leader(p) in update_classid_task()
  RDMA/umem: don't abuse current->group_leader
  drm/pan*: don't abuse current->group_leader
  drm/amd: kill the outdated "Only the pthreads threading model is supported" checks
  drm/amdgpu: don't abuse current->group_leader
  android/binder: use same_thread_group(proc->tsk, current) in binder_mmap()
  android/binder: don't abuse current->group_leader
  kho: skip memoryless NUMA nodes when reserving scratch areas
  ...
2026-02-12 12:13:01 -08:00
Linus Torvalds
4cff5c05e0 mm.git review status for linus..mm-stable
Everything:
 
 Total patches:       325
 Reviews/patch:       1.39
 Reviewed rate:       72%
 
 Excluding DAMON:
 
 Total patches:       262
 Reviews/patch:       1.63
 Reviewed rate:       82%
 
 Excluding DAMON and zram:
 
 Total patches:       248
 Reviews/patch:       1.72
 Reviewed rate:       86%
 
 - The 14 patch series "powerpc/64s: do not re-activate batched TLB
   flush" from Alexander Gordeev makes arch_{enter|leave}_lazy_mmu_mode()
   nest properly.
 
   It adds a generic enter/leave layer and switches architectures to use
   it.  Various hacks were removed in the process.
 
 - The 7 patch series "zram: introduce compressed data writeback" from
   Richard Chang and Sergey Senozhatsky implements data compression for
   zram writeback.
 
 - The 8 patch series "mm: folio_zero_user: clear page ranges" from David
   Hildenbrand adds clearing of contiguous page ranges for hugepages.
   Large improvements during demand faulting are demonstrated.
 
 - The 2 patch series "memcg cleanups" from Chen Ridong tideis up some
   memcg code.
 
 - The 12 patch series "mm/damon: introduce {,max_}nr_snapshots and
   tracepoint for damos stats" from SeongJae Park improves DAMOS stat's
   provided information, deterministic control, and readability.
 
 - The 3 patch series "selftests/mm: hugetlb cgroup charging: robustness
   fixes" from Li Wang fixes a few issues in the hugetlb cgroup charging
   selftests.
 
 - The 5 patch series "Fix va_high_addr_switch.sh test failure - again"
   from Chunyu Hu addresses several issues in the va_high_addr_switch test.
 
 - The 5 patch series "mm/damon/tests/core-kunit: extend existing test
   scenarios" from Shu Anzai improves the KUnit test coverage for DAMON.
 
 - The 2 patch series "mm/khugepaged: fix dirty page handling for
   MADV_COLLAPSE" from Shivank Garg fixes a glitch in khugepaged which was
   causing madvise(MADV_COLLAPSE) to transiently return -EAGAIN.
 
 - The 29 patch series "arch, mm: consolidate hugetlb early reservation"
   from Mike Rapoport reworks and consolidates a pile of straggly code
   related to reservation of hugetlb memory from bootmem and creation of
   CMA areas for hugetlb.
 
 - The 9 patch series "mm: clean up anon_vma implementation" from Lorenzo
   Stoakes cleans up the anon_vma implementation in various ways.
 
 - The 3 patch series "tweaks for __alloc_pages_slowpath()" from
   Vlastimil Babka does a little streamlining of the page allocator's
   slowpath code.
 
 - The 8 patch series "memcg: separate private and public ID namespaces"
   from Shakeel Butt cleans up the memcg ID code and prevents the
   internal-only private IDs from being exposed to userspace.
 
 - The 6 patch series "mm: hugetlb: allocate frozen gigantic folio" from
   Kefeng Wang cleans up the allocation of frozen folios and avoids some
   atomic refcount operations.
 
 - The 11 patch series "mm/damon: advance DAMOS-based LRU sorting" from
   SeongJae Park improves DAMOS's movement of memory betewwn the active and
   inactive LRUs and adds auto-tuning of the ratio-based quotas and of
   monitoring intervals.
 
 - The 18 patch series "Support page table check on PowerPC" from Andrew
   Donnellan makes CONFIG_PAGE_TABLE_CHECK_ENFORCED work on powerpc.
 
 - The 3 patch series "nodemask: align nodes_and{,not} with underlying
   bitmap ops" from Yury Norov makes nodes_and() and nodes_andnot()
   propagate the return values from the underlying bit operations, enabling
   some cleanup in calling code.
 
 - The 5 patch series "mm/damon: hide kdamond and kdamond_lock from API
   callers" from SeongJae Park cleans up some DAMON internal interfaces.
 
 - The 4 patch series "mm/khugepaged: cleanups and scan limit fix" from
   Shivank Garg does some cleanup work in khupaged and fixes a scan limit
   accounting issue.
 
 - The 24 patch series "mm: balloon infrastructure cleanups" from David
   Hildenbrand goes to town on the balloon infrastructure and its page
   migration function.  Mainly cleanups, also some locking simplification.
 
 - The 2 patch series "mm/vmscan: add tracepoint and reason for
   kswapd_failures reset" from Jiayuan Chen adds additional tracepoints to
   the page reclaim code.
 
 - The 3 patch series "Replace wq users and add WQ_PERCPU to
   alloc_workqueue() users" from Marco Crivellari is part of Marco's
   kernel-wide migration from the legacy workqueue APIs over to the
   preferred unbound workqueues.
 
 - The 9 patch series "Various mm kselftests improvements/fixes" from
   Kevin Brodsky provides various unrelated improvements/fixes for the mm
   kselftests.
 
 - The 5 patch series "mm: accelerate gigantic folio allocation" from
   Kefeng Wang greatly speeds up gigantic folio allocation, mainly by
   avoiding unnecessary work in pfn_range_valid_contig().
 
 - The 5 patch series "selftests/damon: improve leak detection and wss
   estimation reliability" from SeongJae Park improves the reliability of
   two of the DAMON selftests.
 
 - The 8 patch series "mm/damon: cleanup kdamond, damon_call(), damos
   filter and DAMON_MIN_REGION" from SeongJae Park does some cleanup work
   in the core DAMON code.
 
 - The 8 patch series "Docs/mm/damon: update intro, modules, maintainer
   profile, and misc" from SeongJae Park performs maintenance work on the
   DAMON documentation.
 
 - The 10 patch series "mm: add and use vma_assert_stabilised() helper"
   from Lorenzo Stoakes refactors and cleans up the core VMA code.  The
   main aim here is to be able to use the mmap write lock's lockdep state
   to perform various assertions regarding the locking which the VMA code
   requires.
 
 - The 19 patch series "mm, swap: swap table phase II: unify swapin use"
   from Kairui Song removes some old swap code (swap cache bypassing and
   swap synchronization) which wasn't working very well.  Various other
   cleanups and simplifications were made.  The end result is a 20% speedup
   in one benchmark.
 
 - The 8 patch series "enable PT_RECLAIM on more 64-bit architectures"
   from Qi Zheng makes PT_RECLAIM available on 64-bit alpha, loongarch,
   mips, parisc, um,  Various cleanups were performed along the way.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaY1HfAAKCRDdBJ7gKXxA
 jqhZAP9H8ZlKKqCEgnr6U5XXmJ63Ep2FDQpl8p35yr9yVuU9+gEAgfyWiJ43l1fP
 rT0yjsUW3KQFBi/SEA3R6aYarmoIBgI=
 =+HLt
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2026-02-11-19-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - "powerpc/64s: do not re-activate batched TLB flush" makes
   arch_{enter|leave}_lazy_mmu_mode() nest properly (Alexander Gordeev)

   It adds a generic enter/leave layer and switches architectures to use
   it. Various hacks were removed in the process.

 - "zram: introduce compressed data writeback" implements data
   compression for zram writeback (Richard Chang and Sergey Senozhatsky)

 - "mm: folio_zero_user: clear page ranges" adds clearing of contiguous
   page ranges for hugepages. Large improvements during demand faulting
   are demonstrated (David Hildenbrand)

 - "memcg cleanups" tidies up some memcg code (Chen Ridong)

 - "mm/damon: introduce {,max_}nr_snapshots and tracepoint for damos
   stats" improves DAMOS stat's provided information, deterministic
   control, and readability (SeongJae Park)

 - "selftests/mm: hugetlb cgroup charging: robustness fixes" fixes a few
   issues in the hugetlb cgroup charging selftests (Li Wang)

 - "Fix va_high_addr_switch.sh test failure - again" addresses several
   issues in the va_high_addr_switch test (Chunyu Hu)

 - "mm/damon/tests/core-kunit: extend existing test scenarios" improves
   the KUnit test coverage for DAMON (Shu Anzai)

 - "mm/khugepaged: fix dirty page handling for MADV_COLLAPSE" fixes a
   glitch in khugepaged which was causing madvise(MADV_COLLAPSE) to
   transiently return -EAGAIN (Shivank Garg)

 - "arch, mm: consolidate hugetlb early reservation" reworks and
   consolidates a pile of straggly code related to reservation of
   hugetlb memory from bootmem and creation of CMA areas for hugetlb
   (Mike Rapoport)

 - "mm: clean up anon_vma implementation" cleans up the anon_vma
   implementation in various ways (Lorenzo Stoakes)

 - "tweaks for __alloc_pages_slowpath()" does a little streamlining of
   the page allocator's slowpath code (Vlastimil Babka)

 - "memcg: separate private and public ID namespaces" cleans up the
   memcg ID code and prevents the internal-only private IDs from being
   exposed to userspace (Shakeel Butt)

 - "mm: hugetlb: allocate frozen gigantic folio" cleans up the
   allocation of frozen folios and avoids some atomic refcount
   operations (Kefeng Wang)

 - "mm/damon: advance DAMOS-based LRU sorting" improves DAMOS's movement
   of memory betewwn the active and inactive LRUs and adds auto-tuning
   of the ratio-based quotas and of monitoring intervals (SeongJae Park)

 - "Support page table check on PowerPC" makes
   CONFIG_PAGE_TABLE_CHECK_ENFORCED work on powerpc (Andrew Donnellan)

 - "nodemask: align nodes_and{,not} with underlying bitmap ops" makes
   nodes_and() and nodes_andnot() propagate the return values from the
   underlying bit operations, enabling some cleanup in calling code
   (Yury Norov)

 - "mm/damon: hide kdamond and kdamond_lock from API callers" cleans up
   some DAMON internal interfaces (SeongJae Park)

 - "mm/khugepaged: cleanups and scan limit fix" does some cleanup work
   in khupaged and fixes a scan limit accounting issue (Shivank Garg)

 - "mm: balloon infrastructure cleanups" goes to town on the balloon
   infrastructure and its page migration function. Mainly cleanups, also
   some locking simplification (David Hildenbrand)

 - "mm/vmscan: add tracepoint and reason for kswapd_failures reset" adds
   additional tracepoints to the page reclaim code (Jiayuan Chen)

 - "Replace wq users and add WQ_PERCPU to alloc_workqueue() users" is
   part of Marco's kernel-wide migration from the legacy workqueue APIs
   over to the preferred unbound workqueues (Marco Crivellari)

 - "Various mm kselftests improvements/fixes" provides various unrelated
   improvements/fixes for the mm kselftests (Kevin Brodsky)

 - "mm: accelerate gigantic folio allocation" greatly speeds up gigantic
   folio allocation, mainly by avoiding unnecessary work in
   pfn_range_valid_contig() (Kefeng Wang)

 - "selftests/damon: improve leak detection and wss estimation
   reliability" improves the reliability of two of the DAMON selftests
   (SeongJae Park)

 - "mm/damon: cleanup kdamond, damon_call(), damos filter and
   DAMON_MIN_REGION" does some cleanup work in the core DAMON code
   (SeongJae Park)

 - "Docs/mm/damon: update intro, modules, maintainer profile, and misc"
   performs maintenance work on the DAMON documentation (SeongJae Park)

 - "mm: add and use vma_assert_stabilised() helper" refactors and cleans
   up the core VMA code. The main aim here is to be able to use the mmap
   write lock's lockdep state to perform various assertions regarding
   the locking which the VMA code requires (Lorenzo Stoakes)

 - "mm, swap: swap table phase II: unify swapin use" removes some old
   swap code (swap cache bypassing and swap synchronization) which
   wasn't working very well. Various other cleanups and simplifications
   were made. The end result is a 20% speedup in one benchmark (Kairui
   Song)

 - "enable PT_RECLAIM on more 64-bit architectures" makes PT_RECLAIM
   available on 64-bit alpha, loongarch, mips, parisc, and um. Various
   cleanups were performed along the way (Qi Zheng)

* tag 'mm-stable-2026-02-11-19-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (325 commits)
  mm/memory: handle non-split locks correctly in zap_empty_pte_table()
  mm: move pte table reclaim code to memory.c
  mm: make PT_RECLAIM depends on MMU_GATHER_RCU_TABLE_FREE
  mm: convert __HAVE_ARCH_TLB_REMOVE_TABLE to CONFIG_HAVE_ARCH_TLB_REMOVE_TABLE config
  um: mm: enable MMU_GATHER_RCU_TABLE_FREE
  parisc: mm: enable MMU_GATHER_RCU_TABLE_FREE
  mips: mm: enable MMU_GATHER_RCU_TABLE_FREE
  LoongArch: mm: enable MMU_GATHER_RCU_TABLE_FREE
  alpha: mm: enable MMU_GATHER_RCU_TABLE_FREE
  mm: change mm/pt_reclaim.c to use asm/tlb.h instead of asm-generic/tlb.h
  mm/damon/stat: remove __read_mostly from memory_idle_ms_percentiles
  zsmalloc: make common caches global
  mm: add SPDX id lines to some mm source files
  mm/zswap: use %pe to print error pointers
  mm/vmscan: use %pe to print error pointers
  mm/readahead: fix typo in comment
  mm: khugepaged: fix NR_FILE_PAGES and NR_SHMEM in collapse_file()
  mm: refactor vma_map_pages to use vm_insert_pages
  mm/damon: unify address range representation with damon_addr_range
  mm/cma: replace snprintf with strscpy in cma_new_area
  ...
2026-02-12 11:32:37 -08:00
Linus Torvalds
997f9640c9 fsverity updates for 7.0
fsverity cleanups, speedup, and memory usage optimization from
 Christoph Hellwig:
 
 - Move some logic into common code
 
 - Fix btrfs to reject truncates of fsverity files
 
 - Improve the readahead implementation
 
 - Store each inode's fsverity_info in a hash table instead of using a
   pointer in the filesystem-specific part of the inode.
 
   This optimizes for memory usage in the usual case where most files
   don't have fsverity enabled.
 
 - Look up the fsverity_info fewer times during verification, to
   amortize the hash table overhead
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCaY0nZhQcZWJpZ2dlcnNA
 a2VybmVsLm9yZwAKCRDzXCl4vpKOK/AVAP9wSLEYsG3dqnNIHjIvLeK+9NC3Ni4d
 m+fvT1JfuideOwEA9r2EfztusLU5iyqWJlHyxekibXItUDgYGltaYb7eXAU=
 =a+To
 -----END PGP SIGNATURE-----

Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux

Pull fsverity updates from Eric Biggers:
 "fsverity cleanups, speedup, and memory usage optimization from
  Christoph Hellwig:

   - Move some logic into common code

   - Fix btrfs to reject truncates of fsverity files

   - Improve the readahead implementation

   - Store each inode's fsverity_info in a hash table instead of using a
     pointer in the filesystem-specific part of the inode.

     This optimizes for memory usage in the usual case where most files
     don't have fsverity enabled.

   - Look up the fsverity_info fewer times during verification, to
     amortize the hash table overhead"

* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux:
  fsverity: remove inode from fsverity_verification_ctx
  fsverity: use a hashtable to find the fsverity_info
  btrfs: consolidate fsverity_info lookup
  f2fs: consolidate fsverity_info lookup
  ext4: consolidate fsverity_info lookup
  fs: consolidate fsverity_info lookup in buffer.c
  fsverity: push out fsverity_info lookup
  fsverity: deconstify the inode pointer in struct fsverity_info
  fsverity: kick off hash readahead at data I/O submission time
  ext4: move ->read_folio and ->readahead to readpage.c
  readahead: push invalidate_lock out of page_cache_ra_unbounded
  fsverity: don't issue readahead for non-ENOENT errors from __filemap_get_folio
  fsverity: start consolidating pagecache code
  fsverity: pass struct file to ->write_merkle_tree_block
  f2fs: don't build the fsverity work handler for !CONFIG_FS_VERITY
  ext4: don't build the fsverity work handler for !CONFIG_FS_VERITY
  fs,fsverity: clear out fsverity_info from common code
  fs,fsverity: reject size changes on fsverity files in setattr_prepare
2026-02-12 10:41:34 -08:00
Linus Torvalds
1e0ea4dff0 IOMMU Updates for Linux v7.0
Including:
 
 	- Core changes:
 	  - Rust bindings for IO-pgtable code
 	  - IOMMU page allocation debugging support
 	  - Disable ATS during PCI resets
 
 	- Intel VT-d changes:
 	  - Skip dev-iotlb flush for inaccessible PCIe device
 	  - Flush cache for PASID table before using it
 	  - Use right invalidation method for SVA and NESTED domains
 	  - Ensure atomicity in context and PASID entry updates
 
 	- AMD-Vi changes:
 	  - Support for nested translations
 	  - Other minor improvements
 
 	- ARM-SMMU-v2 changes:
 	  - Configure SoC-specific prefetcher settings for Qualcomm's "MDSS".
 
 	- ARM-SMMU-v3 changes:
 	  - Improve CMDQ locking fairness for pathetically small queue sizes.
 	  - Remove tracking of the IAS as this is only relevant for AArch32 and
 	    was causing C_BAD_STE errors.
 	  - Add device-tree support for NVIDIA's CMDQV extension.
 	  - Allow some hitless transitions for the 'MEV' and 'EATS' STE fields.
 	  - Don't disable ATS for nested S1-bypass nested domains.
 	  - Additions to the kunit selftests.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmmLDZwACgkQK/BELZcB
 GuNHgg//Yf9K/+T6+IOemA5Z8k3x2p39Q/Dv5x+SEGkh+CUh2C5dX97WD9LHntus
 1mgIHlSgbM3bgMB+XTS1Q5ghy1QH71XOMnGCPhthwg843iCP2CcrB84ZZKKnNmw9
 2YJdxYlNcbAMpvSd0F1XKaXoiNl9qzWx+QFtnVaTXMptNEhYOxMOlaZPtlEuwfJa
 T7h4cwtsiMDLWA4pw85y4hfvc5jKRv4dMoohin0lNEBpWkCfYE6b2Cjpff+9TtU2
 Jyvvcvyns0US3amEwPHlIyfTUPKdaq6Vv3NX8TkAJUhGyEzdfwEtzqAvWMvOEYFh
 HfnE/LjZZLB1CUkF5MTib9dBgJACf/jtvOtuh4wZkx+7O2WIR6Ebo41dtWBM6dxh
 cHGeeQGqxdDZ5UJbIonF8Am0lxsaZx2zs09tlHEMGl2pNDi6vUppk1iTOkv3Wog0
 zy4GhDBl0n/IcyCaIinnWck8C+BsAMcRGpDP2AB0I9/C2qpsaFY/NdNkbIGidhaJ
 3khdAcjWsNPiJPNbUx66n6t8RSXdYKUuhJq2a/GgYmtAjhRR9cJlupB8/QYCBS5j
 fxXpHp4xMtw+Cgj58xC+gYXDivQOEThPs/BhL/qrxOzWE03HWI15MFydqRFWicnI
 gJCZSevMncBfNUTIJUSUmuT7ukP40cnh58QBeRkTmKGcW6HjuyY=
 =W/nW
 -----END PGP SIGNATURE-----

Merge tag 'iommu-updates-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux

Pull iommu updates from Joerg Roedel:
 "Core changes:
   - Rust bindings for IO-pgtable code
   - IOMMU page allocation debugging support
   - Disable ATS during PCI resets

  Intel VT-d changes:
   - Skip dev-iotlb flush for inaccessible PCIe device
   - Flush cache for PASID table before using it
   - Use right invalidation method for SVA and NESTED domains
   - Ensure atomicity in context and PASID entry updates

  AMD-Vi changes:
   - Support for nested translations
   - Other minor improvements

  ARM-SMMU-v2 changes:
   - Configure SoC-specific prefetcher settings for Qualcomm's "MDSS"

  ARM-SMMU-v3 changes:
   - Improve CMDQ locking fairness for pathetically small queue sizes
   - Remove tracking of the IAS as this is only relevant for AArch32 and
     was causing C_BAD_STE errors
   - Add device-tree support for NVIDIA's CMDQV extension
   - Allow some hitless transitions for the 'MEV' and 'EATS' STE fields
   - Don't disable ATS for nested S1-bypass nested domains
   - Additions to the kunit selftests"

* tag 'iommu-updates-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: (54 commits)
  iommupt: Always add IOVA range to iotlb_gather in gather_range_pages()
  iommu/amd: serialize sequence allocation under concurrent TLB invalidations
  iommu/amd: Fix type of type parameter to amd_iommufd_hw_info()
  iommu/arm-smmu-v3: Do not set disable_ats unless vSTE is Translate
  iommu/arm-smmu-v3-test: Add nested s1bypass/s1dssbypass coverage
  iommu/arm-smmu-v3: Mark EATS_TRANS safe when computing the update sequence
  iommu/arm-smmu-v3: Mark STE MEV safe when computing the update sequence
  iommu/arm-smmu-v3: Add update_safe bits to fix STE update sequence
  iommu/arm-smmu-v3: Add device-tree support for CMDQV driver
  iommu/tegra241-cmdqv: Decouple driver from ACPI
  iommu/arm-smmu-qcom: Restore ACTLR settings for MDSS on sa8775p
  iommu/vt-d: Fix race condition during PASID entry replacement
  iommu/vt-d: Clear Present bit before tearing down context entry
  iommu/vt-d: Clear Present bit before tearing down PASID entry
  iommu/vt-d: Flush piotlb for SVM and Nested domain
  iommu/vt-d: Flush cache for PASID table before using it
  iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable mode
  iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device without scalable mode
  rust: iommu: fix `srctree` link warning
  rust: iommu: fix Rust formatting
  ...
2026-02-11 16:36:08 -08:00
Linus Torvalds
148f95f75c slab updates for 7.0
-----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCAA5FiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmmK68UbFIAAAAAABAAO
 bWFudTIsMi41KzEuMTEsMiwyAAoJELvgsHXSRYiamM8H/0eOKSvZG/C/HdTm36cy
 pVjOjgX9KmlHoeH1dOMjqgL2KfOIBis8j1GY0Q/qF1a86uzQa6uuz4XdmJeTUkEE
 YfzwOdaLIR0U6R/gIH9YPfyU9h3VBLUNtotDculntSO3ZgwY5QUHQHz+ROnVG5SU
 MSQ2oshSRkh06LRIlvbd0kLax8vZy8UjfYPonF33+XRya17nIY6V2DvqC0MDuEcM
 jWvbQfm5HTamTAlSV4bmJw+U/FehEdpC4U0ulsAtQILGpJvHCwqDGCNQRFkzcsaM
 yhi1JLFCZcoHqbQycZMNAypPERfIp8O5thSU6xU2AP/cNl2scR/7/MSuWOvjKBv4
 pKU=
 =u52A
 -----END PGP SIGNATURE-----

Merge tag 'slab-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab

Pull slab updates from Vlastimil Babka:

 - The percpu sheaves caching layer was introduced as opt-in in 6.18 and
   now we enable it for all caches and remove the previous cpu (partial)
   slab caching mechanism.

   Besides the lower locking overhead and much more likely fastpath when
   freeing, this removes the rather complicated code related to the cpu
   slab lockless fastpaths (using this_cpu_try_cmpxchg128/64) and all
   its complications for PREEMPT_RT or kmalloc_nolock().

   The lockless slab freelist+counters update operation using
   try_cmpxchg128/64 remains and is crucial for freeing remote NUMA
   objects, and to allow flushing objects from sheaves to slabs mostly
   without the node list_lock (Vlastimil Babka)

 - Eliminate slabobj_ext metadata overhead when possible. Instead of
   using kmalloc() to allocate the array for memcg and/or allocation
   profiling tag pointers, use leftover space in a slab or per-object
   padding due to alignment (Harry Yoo)

 - Various followup improvements to the above (Hao Li)

* tag 'slab-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: (39 commits)
  slub: let need_slab_obj_exts() return false if SLAB_NO_OBJ_EXT is set
  mm/slab: only allow SLAB_OBJ_EXT_IN_OBJ for unmergeable caches
  mm/slab: place slabobj_ext metadata in unused space within s->size
  mm/slab: move [__]ksize and slab_ksize() to mm/slub.c
  mm/slab: save memory by allocating slabobj_ext array from leftover
  mm/memcontrol,alloc_tag: handle slabobj_ext access under KASAN poison
  mm/slab: use stride to access slabobj_ext
  mm/slab: abstract slabobj_ext access via new slab_obj_ext() helper
  ext4: specify the free pointer offset for ext4_inode_cache
  mm/slab: allow specifying free pointer offset when using constructor
  mm/slab: use unsigned long for orig_size to ensure proper metadata align
  slub: clarify object field layout comments
  mm/slab: avoid allocating slabobj_ext array from its own slab
  slub: avoid list_lock contention from __refill_objects_any()
  mm/slub: cleanup and repurpose some stat items
  mm/slub: remove DEACTIVATE_TO_* stat items
  slab: remove frozen slab checks from __slab_free()
  slab: update overview comments
  slab: refill sheaves from all nodes
  slab: remove unused PREEMPT_RT specific macros
  ...
2026-02-11 14:12:50 -08:00
Linus Torvalds
ff661eeee2 cgroup: Changes for v6.20
- cpuset changes:
 
   - Continue separating v1 and v2 implementations by moving more
     v1-specific logic into cpuset-v1.c.
 
   - Improve partition handling. Sibling partitions are no longer
     invalidated on cpuset.cpus conflict, cpuset.cpus changes no longer
     fail in v2, and effective_xcpus computation is made consistent.
 
   - Fix partition effective CPUs overlap that caused a warning on cpuset
     removal when sibling partitions shared CPUs.
 
 - Increase the maximum cgroup subsystem count from 16 to 32 to
   accommodate future subsystem additions.
 
 - Misc cleanups and selftest improvements including switching to
   css_is_online() helper, removing dead code and stale documentation
   references, using lockdep_assert_cpuset_lock_held() consistently,
   and adding polling helpers for asynchronously updated cgroup
   statistics.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaYozIw4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGZQKAQD51KJQz4M79wf2yBhIBLOnM4aakMalhSwZNL4O
 JiGutwD+Ir33VzNX8aXBuDin9p4wI15O54PhqSenJbelKRQ3Dws=
 =gR7L
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup updates from Tejun Heo:

 - cpuset changes:

    - Continue separating v1 and v2 implementations by moving more
      v1-specific logic into cpuset-v1.c

    - Improve partition handling. Sibling partitions are no longer
      invalidated on cpuset.cpus conflict, cpuset.cpus changes no longer
      fail in v2, and effective_xcpus computation is made consistent

    - Fix partition effective CPUs overlap that caused a warning on
      cpuset removal when sibling partitions shared CPUs

 - Increase the maximum cgroup subsystem count from 16 to 32 to
   accommodate future subsystem additions

 - Misc cleanups and selftest improvements including switching to
   css_is_online() helper, removing dead code and stale documentation
   references, using lockdep_assert_cpuset_lock_held() consistently, and
   adding polling helpers for asynchronously updated cgroup statistics

* tag 'cgroup-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
  cpuset: fix overlap of partition effective CPUs
  cgroup: increase maximum subsystem count from 16 to 32
  cgroup: Remove stale cpu.rt.max reference from documentation
  cpuset: replace direct lockdep_assert_held() with lockdep_assert_cpuset_lock_held()
  cgroup/cpuset: Move the v1 empty cpus/mems check to cpuset1_validate_change()
  cgroup/cpuset: Don't invalidate sibling partitions on cpuset.cpus conflict
  cgroup/cpuset: Don't fail cpuset.cpus change in v2
  cgroup/cpuset: Consistently compute effective_xcpus in update_cpumasks_hier()
  cgroup/cpuset: Streamline rm_siblings_excl_cpus()
  cpuset: remove dead code in cpuset-v1.c
  cpuset: remove v1-specific code from generate_sched_domains
  cpuset: separate generate_sched_domains for v1 and v2
  cpuset: move update_domain_attr_tree to cpuset_v1.c
  cpuset: add cpuset1_init helper for v1 initialization
  cpuset: add cpuset1_online_css helper for v1-specific operations
  cpuset: add lockdep_assert_cpuset_lock_held helper
  cpuset: Remove unnecessary checks in rebuild_sched_domains_locked
  cgroup: switch to css_is_online() helper
  selftests: cgroup: Replace sleep with cg_read_key_long_poll() for waiting on nr_dying_descendants
  selftests: cgroup: make test_memcg_sock robust against delayed sock stats
  ...
2026-02-11 13:20:50 -08:00
Paolo Bonzini
b1195183ed - gmap rewrite: completely new memory management for kvm/s390
- vSIE improvement
 - maintainership change for s390 vfio-pci
 - small quality of life improvement for protected guests
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEoWuZBM6M3lCBSfTnuARItAMU6BMFAmmLOKAACgkQuARItAMU
 6BOkhQ/9G0Sr1bfkcSQvbszvjSoDsOCSm+oAm39679Dr4v+7SsItknTutYK0M7dM
 n6oY2kU1dveFsF0FwBAALh4LC0lYNEmz34eHxaFPFfgqQ9lX2fBAQuBSPq8uOz1L
 Pk6IIlqTls8TAvKF/KqTXCEULnPGVXD2KP4WaSir+T2vZr6N/mqB7cZPR23/rMAi
 +PF/UtmgbfG+eRjqA8QRdm8nnzTrM5cWe9roZXYsAXDLZh+EYYeG4d96GFTV8udY
 /6mV1YKP0Aa+youC5p4oIh1Iv7p/Yjv6RxPoEbW1O31M9yBDJmFpz4W5C/rdkzwI
 nOjQj9i7ZINXf83kAZMvFb8MdXlJzaw8rBUlWzxNfrsL4ga8Rp3xMNEdufWd3T5x
 zNFXr0ANuBifi0B0EasBWlYDRbK4WGAC4vnkgmxqP5t2JiAN+d0FXB8LaRyZvgs/
 tiwEDenCk1eDWEBcWbLnX7fGGDKDUNXVMAFTrGM1BMNZe6/IL/h/sypLuYSJ/d3Y
 VXDgZZyAWUVqjidDxrwurdjyzvbPd69GDbKjhTuUu4OdqUMucjjQf74w6m857Wn/
 9oLoR0p+8deb1SQ2RuB8sujcJiO9YHczwL8PLDa+bGw3jH6TRiMVVrt1HOw5QmfG
 QpwhKvrF2yPTgv5VZbFvYEvtITnBBfaepQe97pDDEfsHqPeStmI=
 =2eiy
 -----END PGP SIGNATURE-----

Merge tag 'kvm-s390-next-7.0-1' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD

- gmap rewrite: completely new memory management for kvm/s390
- vSIE improvement
- maintainership change for s390 vfio-pci
- small quality of life improvement for protected guests
2026-02-11 18:52:27 +01:00
Linus Torvalds
0923fd0419 Locking updates for v6.20:
Lock debugging:
 
  - Implement compiler-driven static analysis locking context
    checking, using the upcoming Clang 22 compiler's context
    analysis features. (Marco Elver)
 
    We removed Sparse context analysis support, because prior to
    removal even a defconfig kernel produced 1,700+ context
    tracking Sparse warnings, the overwhelming majority of which
    are false positives. On an allmodconfig kernel the number of
    false positive context tracking Sparse warnings grows to
    over 5,200... On the plus side of the balance actual locking
    bugs found by Sparse context analysis is also rather ... sparse:
    I found only 3 such commits in the last 3 years. So the
    rate of false positives and the maintenance overhead is
    rather high and there appears to be no active policy in
    place to achieve a zero-warnings baseline to move the
    annotations & fixers to developers who introduce new code.
 
    Clang context analysis is more complete and more aggressive
    in trying to find bugs, at least in principle. Plus it has
    a different model to enabling it: it's enabled subsystem by
    subsystem, which results in zero warnings on all relevant
    kernel builds (as far as our testing managed to cover it).
    Which allowed us to enable it by default, similar to other
    compiler warnings, with the expectation that there are no
    warnings going forward. This enforces a zero-warnings baseline
    on clang-22+ builds. (Which are still limited in distribution,
    admittedly.)
 
    Hopefully the Clang approach can lead to a more maintainable
    zero-warnings status quo and policy, with more and more
    subsystems and drivers enabling the feature. Context tracking
    can be enabled for all kernel code via WARN_CONTEXT_ANALYSIS_ALL=y
    (default disabled), but this will generate a lot of false positives.
 
    ( Having said that, Sparse support could still be added back,
      if anyone is interested - the removal patch is still
      relatively straightforward to revert at this stage. )
 
 Rust integration updates: (Alice Ryhl, Fujita Tomonori, Boqun Feng)
 
   - Add support for Atomic<i8/i16/bool> and replace most Rust native
     AtomicBool usages with Atomic<bool>
 
   - Clean up LockClassKey and improve its documentation
 
   - Add missing Send and Sync trait implementation for SetOnce
 
   - Make ARef Unpin as it is supposed to be
 
   - Add __rust_helper to a few Rust helpers as a preparation for
     helper LTO
 
   - Inline various lock related functions to avoid additional
     function calls.
 
 WW mutexes:
 
   - Extend ww_mutex tests and other test-ww_mutex updates (John Stultz)
 
 Misc fixes and cleanups:
 
   - rcu: Mark lockdep_assert_rcu_helper() __always_inline
     (Arnd Bergmann)
 
   - locking/local_lock: Include more missing headers (Peter Zijlstra)
 
   - seqlock: fix scoped_seqlock_read kernel-doc (Randy Dunlap)
 
   - rust: sync: Replace `kernel::c_str!` with C-Strings
     (Tamir Duberstein)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmmIXiURHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gH+A/9GX5UmU6+HuDfDrCtXm9GDve6wkwahvcW
 jLDxOYjs764I2BhyjZnjKjyF5zw60hbykem7Wcf5EV2YH30nM4XRgEWVJfkr1UAI
 Pra415X4DdOzZ6qYQIpO8Udt1LtR7BMSaXITVLJaLicxEoOVtq3SKxjqyhCFs7UW
 MfJdqleB+RMLqq3LlzgB4l43eKk1xyeHh+oQwI0RSxuIpVZme3p4TObnCKjIWnK7
 Ihd+dkgC852WBjANgNL7F/sd5UsF5QX3wjtOrLhMKvkIgTPdXln0g398pivjN/G/
 Kpnw18SFeb159JfJu8eMotsYvVnQ0D5aOcTBfL4qvOHCImhpcu2s6ik9BcXqt2yT
 8IiuWk9xEM3Ok+I/I4ClT5cf5GYpyigV2QsXxn+IjDX5Na8v4zlHh0r8SElP8fOt
 7dpQx7iw8UghAib3AzA3suN78Oh39m8l5BNobj7LAjnqOQcVvoPo4o7/48ntuH7A
 38EucFrXfxQBMfGbMwvxEmgYuX7MyVfQLaPE06MHy1BkZkffT8Um38TB0iNtZmtf
 WUx01yLKWYspehlwFi319uVI4/Zp7FnTfqa5uKv1oSXVdL9vZojSXUzrgDV7FVqT
 Z4xAAw/kwNHpUG7y0zNOqd6PukovG1t+CjbLvK+eHPwc5c0vEGG2oTRAfEvvP1z/
 kesYDmCyJnk=
 =N1gA
 -----END PGP SIGNATURE-----

Merge tag 'locking-core-2026-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking updates from Ingo Molnar:
 "Lock debugging:

   - Implement compiler-driven static analysis locking context checking,
     using the upcoming Clang 22 compiler's context analysis features
     (Marco Elver)

     We removed Sparse context analysis support, because prior to
     removal even a defconfig kernel produced 1,700+ context tracking
     Sparse warnings, the overwhelming majority of which are false
     positives. On an allmodconfig kernel the number of false positive
     context tracking Sparse warnings grows to over 5,200... On the plus
     side of the balance actual locking bugs found by Sparse context
     analysis is also rather ... sparse: I found only 3 such commits in
     the last 3 years. So the rate of false positives and the
     maintenance overhead is rather high and there appears to be no
     active policy in place to achieve a zero-warnings baseline to move
     the annotations & fixers to developers who introduce new code.

     Clang context analysis is more complete and more aggressive in
     trying to find bugs, at least in principle. Plus it has a different
     model to enabling it: it's enabled subsystem by subsystem, which
     results in zero warnings on all relevant kernel builds (as far as
     our testing managed to cover it). Which allowed us to enable it by
     default, similar to other compiler warnings, with the expectation
     that there are no warnings going forward. This enforces a
     zero-warnings baseline on clang-22+ builds (Which are still limited
     in distribution, admittedly)

     Hopefully the Clang approach can lead to a more maintainable
     zero-warnings status quo and policy, with more and more subsystems
     and drivers enabling the feature. Context tracking can be enabled
     for all kernel code via WARN_CONTEXT_ANALYSIS_ALL=y (default
     disabled), but this will generate a lot of false positives.

     ( Having said that, Sparse support could still be added back,
       if anyone is interested - the removal patch is still
       relatively straightforward to revert at this stage. )

  Rust integration updates: (Alice Ryhl, Fujita Tomonori, Boqun Feng)

    - Add support for Atomic<i8/i16/bool> and replace most Rust native
      AtomicBool usages with Atomic<bool>

    - Clean up LockClassKey and improve its documentation

    - Add missing Send and Sync trait implementation for SetOnce

    - Make ARef Unpin as it is supposed to be

    - Add __rust_helper to a few Rust helpers as a preparation for
      helper LTO

    - Inline various lock related functions to avoid additional function
      calls

  WW mutexes:

    - Extend ww_mutex tests and other test-ww_mutex updates (John
      Stultz)

  Misc fixes and cleanups:

    - rcu: Mark lockdep_assert_rcu_helper() __always_inline (Arnd
      Bergmann)

    - locking/local_lock: Include more missing headers (Peter Zijlstra)

    - seqlock: fix scoped_seqlock_read kernel-doc (Randy Dunlap)

    - rust: sync: Replace `kernel::c_str!` with C-Strings (Tamir
      Duberstein)"

* tag 'locking-core-2026-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (90 commits)
  locking/rwlock: Fix write_trylock_irqsave() with CONFIG_INLINE_WRITE_TRYLOCK
  rcu: Mark lockdep_assert_rcu_helper() __always_inline
  compiler-context-analysis: Remove __assume_ctx_lock from initializers
  tomoyo: Use scoped init guard
  crypto: Use scoped init guard
  kcov: Use scoped init guard
  compiler-context-analysis: Introduce scoped init guards
  cleanup: Make __DEFINE_LOCK_GUARD handle commas in initializers
  seqlock: fix scoped_seqlock_read kernel-doc
  tools: Update context analysis macros in compiler_types.h
  rust: sync: Replace `kernel::c_str!` with C-Strings
  rust: sync: Inline various lock related methods
  rust: helpers: Move #define __rust_helper out of atomic.c
  rust: wait: Add __rust_helper to helpers
  rust: time: Add __rust_helper to helpers
  rust: task: Add __rust_helper to helpers
  rust: sync: Add __rust_helper to helpers
  rust: refcount: Add __rust_helper to helpers
  rust: rcu: Add __rust_helper to helpers
  rust: processor: Add __rust_helper to helpers
  ...
2026-02-10 12:28:44 -08:00
Linus Torvalds
f17b474e36 bpf-next-7.0
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmmGmrgACgkQ6rmadz2v
 bTq6NxAAkCHosxzGn9GYYBV8xhrBJoJJDCyEbQ4nR0XNY+zaWnuykmiPP9w1aOAM
 zm/po3mQB2pZjetvlrPrgG5RLgBCAUHzqVGy0r+phUvD3vbohKlmSlMm2kiXOb9N
 T01BgLWsyqN2ZcNFvORdSsftqIJUHcXxU6RdupGD60sO5XM9ty5cwyewLX8GBOas
 UN2bOhbK2DpqYWUvtv+3Q3ykxoStMSkXZvDRurwLKl4RHeLjXZXPo8NjnfBlk/F2
 vdFo/F4NO4TmhOave6UPXvKb4yo9IlBRmiPAl0RmNKBxenY8j9XuV/xZxU6YgzDn
 +SQfDK+CKQ4IYIygE+fqd4e5CaQrnjmPPcIw12AB2CF0LimY9Xxyyk6FSAhMN7wm
 GTVh5K2C3Dk3OiRQk4G58EvQ5QcxzX98IeeCpcckMUkPsFWHRvF402WMUcv9SWpD
 DsxxPkfENY/6N67EvH0qcSe/ikdUorQKFl4QjXKwsMCd5WhToeP4Z7Ck1gVSNkAh
 9CX++mLzg333Lpsc4SSIuk9bEPpFa5cUIKUY7GCsCiuOXciPeMDP3cGSd5LioqxN
 qWljs4Z88QDM2LJpAh8g4m3sA7bMhES3nPmdlI5CfgBcVyLW8D8CqQq4GEZ1McwL
 Ky084+lEosugoVjRejrdMMEOsqAfcbkTr2b8jpuAZdwJKm6p/bw=
 =cBdK
 -----END PGP SIGNATURE-----

Merge tag 'bpf-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Pull bpf updates from Alexei Starovoitov:

 - Support associating BPF program with struct_ops (Amery Hung)

 - Switch BPF local storage to rqspinlock and remove recursion detection
   counters which were causing false positives (Amery Hung)

 - Fix live registers marking for indirect jumps (Anton Protopopov)

 - Introduce execution context detection BPF helpers (Changwoo Min)

 - Improve verifier precision for 32bit sign extension pattern
   (Cupertino Miranda)

 - Optimize BTF type lookup by sorting vmlinux BTF and doing binary
   search (Donglin Peng)

 - Allow states pruning for misc/invalid slots in iterator loops (Eduard
   Zingerman)

 - In preparation for ASAN support in BPF arenas teach libbpf to move
   global BPF variables to the end of the region and enable arena kfuncs
   while holding locks (Emil Tsalapatis)

 - Introduce support for implicit arguments in kfuncs and migrate a
   number of them to new API. This is a prerequisite for cgroup
   sub-schedulers in sched-ext (Ihor Solodrai)

 - Fix incorrect copied_seq calculation in sockmap (Jiayuan Chen)

 - Fix ORC stack unwind from kprobe_multi (Jiri Olsa)

 - Speed up fentry attach by using single ftrace direct ops in BPF
   trampolines (Jiri Olsa)

 - Require frozen map for calculating map hash (KP Singh)

 - Fix lock entry creation in TAS fallback in rqspinlock (Kumar
   Kartikeya Dwivedi)

 - Allow user space to select cpu in lookup/update operations on per-cpu
   array and hash maps (Leon Hwang)

 - Make kfuncs return trusted pointers by default (Matt Bobrowski)

 - Introduce "fsession" support where single BPF program is executed
   upon entry and exit from traced kernel function (Menglong Dong)

 - Allow bpf_timer and bpf_wq use in all programs types (Mykyta
   Yatsenko, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Alexei
   Starovoitov)

 - Make KF_TRUSTED_ARGS the default for all kfuncs and clean up their
   definition across the tree (Puranjay Mohan)

 - Allow BPF arena calls from non-sleepable context (Puranjay Mohan)

 - Improve register id comparison logic in the verifier and extend
   linked registers with negative offsets (Puranjay Mohan)

 - In preparation for BPF-OOM introduce kfuncs to access memcg events
   (Roman Gushchin)

 - Use CFI compatible destructor kfunc type (Sami Tolvanen)

 - Add bitwise tracking for BPF_END in the verifier (Tianci Cao)

 - Add range tracking for BPF_DIV and BPF_MOD in the verifier (Yazhou
   Tang)

 - Make BPF selftests work with 64k page size (Yonghong Song)

* tag 'bpf-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (268 commits)
  selftests/bpf: Fix outdated test on storage->smap
  selftests/bpf: Choose another percpu variable in bpf for btf_dump test
  selftests/bpf: Remove test_task_storage_map_stress_lookup
  selftests/bpf: Update task_local_storage/task_storage_nodeadlock test
  selftests/bpf: Update task_local_storage/recursion test
  selftests/bpf: Update sk_storage_omem_uncharge test
  bpf: Switch to bpf_selem_unlink_nofail in bpf_local_storage_{map_free, destroy}
  bpf: Support lockless unlink when freeing map or local storage
  bpf: Prepare for bpf_selem_unlink_nofail()
  bpf: Remove unused percpu counter from bpf_local_storage_map_free
  bpf: Remove cgroup local storage percpu counter
  bpf: Remove task local storage percpu counter
  bpf: Change local_storage->lock and b->lock to rqspinlock
  bpf: Convert bpf_selem_unlink to failable
  bpf: Convert bpf_selem_link_map to failable
  bpf: Convert bpf_selem_unlink_map to failable
  bpf: Select bpf_local_storage_map_bucket based on bpf_local_storage
  selftests/xsk: fix number of Tx frags in invalid packet
  selftests/xsk: properly handle batch ending in the middle of a packet
  bpf: Prevent reentrance into call_rcu_tasks_trace()
  ...
2026-02-10 11:26:21 -08:00
Harry Yoo
27125df9a5 mm/slab: drop the OBJEXTS_NOSPIN_ALLOC flag from enum objext_flags
OBJEXTS_NOSPIN_ALLOC was used to remember whether a slabobj_ext vector
was allocated via kmalloc_nolock(), so that free_slab_obj_exts() could
call kfree_nolock() instead of kfree().

Now that kfree() supports freeing kmalloc_nolock() objects, this flag is
no longer needed. Instead, pass the allow_spin parameter down to
free_slab_obj_exts() to determine whether kfree_nolock() or kfree()
should be called in the free path, and free one bit in
enum objext_flags.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Hao Li <hao.li@linux.dev>
Link: https://patch.msgid.link/20260210044642.139482-3-harry.yoo@oracle.com
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-02-10 11:39:30 +01:00
Harry Yoo
c4d6d78298 mm/slab: allow freeing kmalloc_nolock()'d objects using kfree[_rcu]()
Slab objects that are allocated with kmalloc_nolock() must be freed
using kfree_nolock() because only a subset of alloc hooks are called,
since kmalloc_nolock() can't spin on a lock during allocation.

This imposes a limitation: such objects cannot be freed with kfree_rcu(),
forcing users to work around this limitation by calling call_rcu()
with a callback that frees the object using kfree_nolock().

Remove this limitation by teaching kmemleak to gracefully ignore cases
when kmemleak_free() or kmemleak_ignore() is called without a prior
kmemleak_alloc().

Unlike kmemleak, kfence already handles this case, because,
due to its design, only a subset of allocations are served from kfence.

With this change, kfree() and kfree_rcu() can be used to free objects
that are allocated using kmalloc_nolock().

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20260210044642.139482-2-harry.yoo@oracle.com
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-02-10 11:39:30 +01:00
Harry Yoo
a1e244a9f1 mm/slab: use prandom if !allow_spin
When CONFIG_SLAB_FREELIST_RANDOM is enabled and get_random_u32()
is called in an NMI context, lockdep complains because it acquires
a local_lock:

  ================================
  WARNING: inconsistent lock state
  6.19.0-rc5-slab-for-next+ #325 Tainted: G                 N
  --------------------------------
  inconsistent {INITIAL USE} -> {IN-NMI} usage.
  kunit_try_catch/8312 [HC2[2]:SC0[0]:HE0:SE1] takes:
  ffff88a02ec49cc0 (batched_entropy_u32.lock){-.-.}-{3:3}, at: get_random_u32+0x7f/0x2e0
  {INITIAL USE} state was registered at:
    lock_acquire+0xd9/0x2f0
    get_random_u32+0x93/0x2e0
    __get_random_u32_below+0x17/0x70
    cache_random_seq_create+0x121/0x1c0
    init_cache_random_seq+0x5d/0x110
    do_kmem_cache_create+0x1e0/0xa30
    __kmem_cache_create_args+0x4ec/0x830
    create_kmalloc_caches+0xe6/0x130
    kmem_cache_init+0x1b1/0x660
    mm_core_init+0x1d8/0x4b0
    start_kernel+0x620/0xcd0
    x86_64_start_reservations+0x18/0x30
    x86_64_start_kernel+0xf3/0x140
    common_startup_64+0x13e/0x148
  irq event stamp: 76
  hardirqs last  enabled at (75): [<ffffffff8298b77a>] exc_nmi+0x11a/0x240
  hardirqs last disabled at (76): [<ffffffff8298b991>] sysvec_irq_work+0x11/0x110
  softirqs last  enabled at (0): [<ffffffff813b2dda>] copy_process+0xc7a/0x2350
  softirqs last disabled at (0): [<0000000000000000>] 0x0

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(batched_entropy_u32.lock);
    <Interrupt>
      lock(batched_entropy_u32.lock);

   *** DEADLOCK ***

Fix this by using pseudo-random number generator if !allow_spin.
This means kmalloc_nolock() users won't get truly random numbers,
but there is not much we can do about it.

Note that an NMI handler might interrupt prandom_u32_state() and
change the random state, but that's safe.

Link: https://lore.kernel.org/all/0c33bdee-6de8-4d9f-92ca-4f72c1b6fb9f@suse.cz
Fixes: af92793e52 ("slab: Introduce kmalloc_nolock() and kfree_nolock().")
Cc: stable@vger.kernel.org
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20260210081900.329447-3-harry.yoo@oracle.com
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-02-10 10:55:32 +01:00
Harry Yoo
144080a582 mm/slab: do not access current->mems_allowed_seq if !allow_spin
Lockdep complains when get_from_any_partial() is called in an NMI
context, because current->mems_allowed_seq is seqcount_spinlock_t and
not NMI-safe:

  ================================
  WARNING: inconsistent lock state
  6.19.0-rc5-kfree-rcu+ #315 Tainted: G                 N
  --------------------------------
  inconsistent {INITIAL USE} -> {IN-NMI} usage.
  kunit_try_catch/9989 [HC1[1]:SC0[0]:HE0:SE1] takes:
  ffff889085799820 (&____s->seqcount#3){.-.-}-{0:0}, at: ___slab_alloc+0x58f/0xc00
  {INITIAL USE} state was registered at:
    lock_acquire+0x185/0x320
    kernel_init_freeable+0x391/0x1150
    kernel_init+0x1f/0x220
    ret_from_fork+0x736/0x8f0
    ret_from_fork_asm+0x1a/0x30
  irq event stamp: 56
  hardirqs last  enabled at (55): [<ffffffff850a68d7>] _raw_spin_unlock_irq+0x27/0x70
  hardirqs last disabled at (56): [<ffffffff850858ca>] __schedule+0x2a8a/0x6630
  softirqs last  enabled at (0): [<ffffffff81536711>] copy_process+0x1dc1/0x6a10
  softirqs last disabled at (0): [<0000000000000000>] 0x0

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(&____s->seqcount#3);
    <Interrupt>
      lock(&____s->seqcount#3);

   *** DEADLOCK ***

According to Documentation/locking/seqlock.rst, seqcount_t is not
NMI-safe and seqcount_latch_t should be used when read path can interrupt
the write-side critical section. In this case, do not access
current->mems_allowed_seq and avoid retry.

Fixes: af92793e52 ("slab: Introduce kmalloc_nolock() and kfree_nolock().")
Cc: stable@vger.kernel.org
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20260210081900.329447-2-harry.yoo@oracle.com
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-02-10 10:55:31 +01:00
Vlastimil Babka
815c8e3551 Merge branch 'slab/for-7.0/sheaves' into slab/for-next
Merge series "slab: replace cpu (partial) slabs with sheaves".

The percpu sheaves caching layer was introduced as opt-in but the goal
was to eventually move all caches to them. This is the next step,
enabling sheaves for all caches (except the two bootstrap ones) and then
removing the per cpu (partial) slabs and lots of associated code.

Besides the lower locking overhead and much more likely fastpath when
freeing, this removes the rather complicated code related to the cpu
slab lockless fastpaths (using this_cpu_try_cmpxchg128/64) and all its
complications for PREEMPT_RT or kmalloc_nolock().

The lockless slab freelist+counters update operation using
try_cmpxchg128/64 remains and is crucial for freeing remote NUMA objects
and to allow flushing objects from sheaves to slabs mostly without the
node list_lock.

Link: https://lore.kernel.org/all/20260123-sheaves-for-all-v4-0-041323d506f7@suse.cz/
2026-02-10 09:10:00 +01:00
Linus Torvalds
d16738a4e7 The kthread code provides an infrastructure which manages the preferred
affinity of unbound kthreads (node or custom cpumask) against
 housekeeping (CPU isolation) constraints and CPU hotplug events.
 
 One crucial missing piece is the handling of cpuset: when an isolated
 partition is created, deleted, or its CPUs updated, all the unbound
 kthreads in the top cpuset become indifferently affine to _all_ the
 non-isolated CPUs, possibly breaking their preferred affinity along
 the way.
 
 Solve this with performing the kthreads affinity update from cpuset to
 the kthreads consolidated relevant code instead so that preferred
 affinities are honoured and applied against the updated cpuset isolated
 partitions.
 
 The dispatch of the new isolated cpumasks to timers, workqueues and
 kthreads is performed by housekeeping, as per the nice Tejun's
 suggestion.
 
 As a welcome side effect, HK_TYPE_DOMAIN then integrates both the set
 from boot defined domain isolation (through isolcpus=) and cpuset
 isolated partitions. Housekeeping cpumasks are now modifiable with a
 specific RCU based synchronization. A big step toward making nohz_full=
 also mutable through cpuset in the future.
 -----BEGIN PGP SIGNATURE-----
 
 iQJPBAABCAA5FiEEd76+gtGM8MbftQlOhSRUR1COjHcFAmmE0mYbFIAAAAAABAAO
 bWFudTIsMi41KzEuMTEsMiwyAAoJEIUkVEdQjox36eMP/0Ls/ArfYVi/MNAXWlpy
 rAt6m9Y/X9GBcDM/VI9BXq1ZX4qEr2XjJ8UUb8cM08uHEAt0ErlmpRxREwJFrKbI
 H4jzg5EwO0D0c6MnvgQJEAwkHxQVIjsxG9DovRIjxyW4ycx3aSsRg/f2VKyWoLvY
 7ZT7CbLFE+I/MQh2ZgUu/9pnCDQVR2anss2WYIej5mmgFL5pyEv3YvYgKYVyK08z
 sXyNxpP976g2d9ECJ9OtFJV9we6mlqxlG0MVCiv/Uxh7DBjxWWPsLvlmLAXggQ03
 +0GW+nnutDaKz83pgS7Z4zum/+Oa+I1dTLIN27pARUNcMCYip7njM2KNpJwPdov3
 +fAIODH2JVX1xewT+U1cCq6gdI55ejbwdQYGFV075dKBUxKQeIyrghvfC3Ga6aKQ
 Gw3y68jdrXOw6iyfHR5k/0Mnu2/FDKUW2fZxLKm55PvNZP5jQFmSlz9wyiwwyb3m
 UUSgThj6Ozodxks8hDX41rGVezCcm1ni+qNSiNIs8HPaaZQrwbnvKHQFBBJHQzJP
 rJ39VWBx3Hq/ly71BOR6pCzoZsfS1f85YKhJ4vsfjLO6BfhI16nBat89eROSRKcz
 XptyWqW0PgAD0teDuMCTPNuUym/viBHALXHKuSO12CIizacvftiGcmaQNPlLiiFZ
 /Dr2+aOhwYw3UD6djn3u94M9
 =nWGh
 -----END PGP SIGNATURE-----

Merge tag 'kthread-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks

Pull kthread updates from Frederic Weisbecker:
 "The kthread code provides an infrastructure which manages the
  preferred affinity of unbound kthreads (node or custom cpumask)
  against housekeeping (CPU isolation) constraints and CPU hotplug
  events.

  One crucial missing piece is the handling of cpuset: when an isolated
  partition is created, deleted, or its CPUs updated, all the unbound
  kthreads in the top cpuset become indifferently affine to _all_ the
  non-isolated CPUs, possibly breaking their preferred affinity along
  the way.

  Solve this with performing the kthreads affinity update from cpuset to
  the kthreads consolidated relevant code instead so that preferred
  affinities are honoured and applied against the updated cpuset
  isolated partitions.

  The dispatch of the new isolated cpumasks to timers, workqueues and
  kthreads is performed by housekeeping, as per the nice Tejun's
  suggestion.

  As a welcome side effect, HK_TYPE_DOMAIN then integrates both the set
  from boot defined domain isolation (through isolcpus=) and cpuset
  isolated partitions. Housekeeping cpumasks are now modifiable with a
  specific RCU based synchronization. A big step toward making
  nohz_full= also mutable through cpuset in the future"

* tag 'kthread-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks: (33 commits)
  doc: Add housekeeping documentation
  kthread: Document kthread_affine_preferred()
  kthread: Comment on the purpose and placement of kthread_affine_node() call
  kthread: Honour kthreads preferred affinity after cpuset changes
  sched/arm64: Move fallback task cpumask to HK_TYPE_DOMAIN
  sched: Switch the fallback task allowed cpumask to HK_TYPE_DOMAIN
  kthread: Rely on HK_TYPE_DOMAIN for preferred affinity management
  kthread: Include kthreadd to the managed affinity list
  kthread: Include unbound kthreads in the managed affinity list
  kthread: Refine naming of affinity related fields
  PCI: Remove superfluous HK_TYPE_WQ check
  sched/isolation: Remove HK_TYPE_TICK test from cpu_is_isolated()
  cpuset: Remove cpuset_cpu_is_isolated()
  timers/migration: Remove superfluous cpuset isolation test
  cpuset: Propagate cpuset isolation update to timers through housekeeping
  cpuset: Propagate cpuset isolation update to workqueue through housekeeping
  PCI: Flush PCI probe workqueue on cpuset isolated partition change
  sched/isolation: Flush vmstat workqueues on cpuset isolated partition change
  sched/isolation: Flush memcg workqueues on cpuset isolated partition change
  cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset
  ...
2026-02-09 19:57:30 -08:00
Linus Torvalds
26c9342bb7 struct filename series
[mostly] sanitize struct filename hanling
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaYlcJgAKCRBZ7Krx/gZQ
 6xlKAP9c9J13sJ/mcobsj1Ov7nSHISNbnYqvRRCu09Wq3UQvJgEApNQYOEdLtpff
 zUnWOAQ0nOKY7w9VMLkRRustXpuGjAc=
 =Fld4
 -----END PGP SIGNATURE-----

Merge tag 'pull-filename' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs 'struct filename' updates from Al Viro:
 "[Mostly] sanitize struct filename handling"

* tag 'pull-filename' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (68 commits)
  sysfs(2): fs_index() argument is _not_ a pathname
  alpha: switch osf_mount() to strndup_user()
  ksmbd: use CLASS(filename_kernel)
  mqueue: switch to CLASS(filename)
  user_statfs(): switch to CLASS(filename)
  statx: switch to CLASS(filename_maybe_null)
  quotactl_block(): switch to CLASS(filename)
  chroot(2): switch to CLASS(filename)
  move_mount(2): switch to CLASS(filename_maybe_null)
  namei.c: switch user pathname imports to CLASS(filename{,_flags})
  namei.c: convert getname_kernel() callers to CLASS(filename_kernel)
  do_f{chmod,chown,access}at(): use CLASS(filename_uflags)
  do_readlinkat(): switch to CLASS(filename_flags)
  do_sys_truncate(): switch to CLASS(filename)
  do_utimes_path(): switch to CLASS(filename_uflags)
  chdir(2): unspaghettify a bit...
  do_fchownat(): unspaghettify a bit...
  fspick(2): use CLASS(filename_flags)
  name_to_handle_at(): use CLASS(filename_uflags)
  vfs_open_tree(): use CLASS(filename_uflags)
  ...
2026-02-09 16:58:28 -08:00
Linus Torvalds
aa2a0fcd4c vfs-7.0-rc1.leases
Please consider pulling these changes from the signed vfs-7.0-rc1.leases tag.
 
 Thanks!
 Christian
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaYX49gAKCRCRxhvAZXjc
 olR/AP40iNOTRn7LosXbRWqGGZqzy9v64QYoLzk3QdsWuGmbRAD/egNQzof8mkAf
 IscefWTOjY7xyDzmEBEBnfHftgMiEwM=
 =zre0
 -----END PGP SIGNATURE-----

Merge tag 'vfs-7.0-rc1.leases' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs lease updates from Christian Brauner:
 "This contains updates for lease support to require filesystems to
  explicitly opt-in to lease support

  Currently kernel_setlease() falls through to generic_setlease() when a
  a filesystem does not define ->setlease(), silently granting lease
  support to every filesystem regardless of whether it is prepared for
  it.

  This is a poor default: most filesystems never intended to support
  leases, and the silent fallthrough makes it impossible to distinguish
  "supports leases" from "never thought about it".

  This inverts the default. It adds explicit

	.setlease = generic_setlease;

  assignments to every in-tree filesystem that should retain lease
  support, then changes kernel_setlease() to return -EINVAL when
  ->setlease is NULL.

  With the new default in place, simple_nosetlease() is redundant and
  is removed along with all references to it"

* tag 'vfs-7.0-rc1.leases' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (25 commits)
  fuse: add setlease file operation
  fs: remove simple_nosetlease()
  filelock: default to returning -EINVAL when ->setlease operation is NULL
  xfs: add setlease file operation
  ufs: add setlease file operation
  udf: add setlease file operation
  tmpfs: add setlease file operation
  squashfs: add setlease file operation
  overlayfs: add setlease file operation
  orangefs: add setlease file operation
  ocfs2: add setlease file operation
  ntfs3: add setlease file operation
  nilfs2: add setlease file operation
  jfs: add setlease file operation
  jffs2: add setlease file operation
  gfs2: add a setlease file operation
  fat: add setlease file operation
  f2fs: add setlease file operation
  exfat: add setlease file operation
  ext4: add setlease file operation
  ...
2026-02-09 11:59:07 -08:00
David Hildenbrand (Red Hat)
fb4ddf2085 mm/memory: handle non-split locks correctly in zap_empty_pte_table()
While we handle pte_lockptr() == pmd_lockptr() correctly in
zap_pte_table_if_empty(), we don't handle it in zap_empty_pte_table(),
making the spin_trylock() always fail and forcing us onto the slow path.

So let's handle the scenario where pte_lockptr() == pmd_lockptr() better,
which can only happen if CONFIG_SPLIT_PTE_PTLOCKS is not set.

This is only relevant once we unlock CONFIG_PT_RECLAIM on architectures
that are not x86-64.

Link: https://lkml.kernel.org/r/20260119220708.3438514-3-david@kernel.org
Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-06 15:47:19 -08:00
David Hildenbrand (Red Hat)
4c640eb418 mm: move pte table reclaim code to memory.c
Some cleanups for PT table reclaim code, triggered by a false-positive
warning we might start to see soon after we unlocked pt-reclaim on
architectures besides x86-64.


This patch (of 2):

The pte-table reclaim code is only called from memory.c, while zapping
pages, and it better also stays that way in the long run.  If we ever have
to call it from other files, we should expose proper high-level helpers
for zapping if the existing helpers are not good enough.

So, let's move the code over (it's not a lot) and slightly clean it up a
bit by:
- Renaming the functions.
- Dropping the "Check if it is empty PTE page" comment, which is now
  self-explaining given the function name.
- Making zap_pte_table_if_empty() return whether zapping worked so the
  caller can free it.
- Adding a comment in pte_table_reclaim_possible().
- Inlining free_pte() in the last remaining user.
- In zap_empty_pte_table(), switch from pmdp_get_lcokless() to
  pmd_clear(), we are holding the PMD PT lock.

By moving the code over, compilers can also easily figure out when
zap_empty_pte_table() does not initialize the pmdval variable, avoiding
false-positive warnings about the variable possibly not being initialized.

Link: https://lkml.kernel.org/r/20260119220708.3438514-1-david@kernel.org
Link: https://lkml.kernel.org/r/20260119220708.3438514-2-david@kernel.org
Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-06 15:47:19 -08:00
Qi Zheng
9c8c02df3f mm: make PT_RECLAIM depends on MMU_GATHER_RCU_TABLE_FREE
The PT_RECLAIM can work on all architectures that support
MMU_GATHER_RCU_TABLE_FREE, except for those that have selected
HAVE_ARCH_TLB_REMOVE_TABLE,so make PT_RECLAIM depends on
MMU_GATHER_RCU_TABLE_FREE && !HAVE_ARCH_TLB_REMOVE_TABLE.

BTW, change PT_RECLAIM to be enabled by default, since nobody should want
to turn it off.

Link: https://lkml.kernel.org/r/83b034810935a9ff18e425b085e065bb0acb28f3.1769515122.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Magnus Lindholm <linmag7@gmail.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-06 15:47:19 -08:00
Qi Zheng
086498aed3 mm: convert __HAVE_ARCH_TLB_REMOVE_TABLE to CONFIG_HAVE_ARCH_TLB_REMOVE_TABLE config
For architectures that define __HAVE_ARCH_TLB_REMOVE_TABLE, the page
tables at the pmd/pud level are generally not of struct ptdesc type, and
do not have pt_rcu_head member, thus these architectures cannot support
PT_RECLAIM.

In preparation for enabling PT_RECLAIM on more architectures, convert
__HAVE_ARCH_TLB_REMOVE_TABLE to CONFIG_HAVE_ARCH_TLB_REMOVE_TABLE config,
so that we can make conditional judgments in Kconfig.

Link: https://lkml.kernel.org/r/5ebfa3d4b56e63c6906bda5eccaa9f7194d3a86b.1769515122.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Tested-by: Andreas Larsson <andreas@gaisler.com>	[sparc, UP&SMP]
Acked-by: Andreas Larsson <andreas@gaisler.com>		[sparc]
Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Magnus Lindholm <linmag7@gmail.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-06 15:47:19 -08:00
Qi Zheng
d8b65654b1 mm: change mm/pt_reclaim.c to use asm/tlb.h instead of asm-generic/tlb.h
Patch series "enable PT_RECLAIM on more 64-bit architectures", v4.

This series aims to enable PT_RECLAIM on more 64-bit architectures.

On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of
empty PTE page table pages (such as 100GB+).  To resolve this problem, we
need to enable PT_RECLAIM, which depends on MMU_GATHER_RCU_TABLE_FREE.

For these architectures that define its own __tlb_remove_table(), since
their page tables are not of type struct ptdesc, they cannot be supported
PT_RECLAIM.

Therefore, this series first enables MMU_GATHER_RCU_TABLE_FREE on all
64-bit architectures, then converts __HAVE_ARCH_TLB_REMOVE_TABLE to
CONFIG_HAVE_ARCH_TLB_REMOVE_TABLE config, and finally makes PT_RECLAIM
depend on MMU_GATHER_RCU_TABLE_FREE && !HAVE_ARCH_TLB_REMOVE_TABLE.  This
way, PT_RECLAIM can be enabled by default on most 64-bit architectures.

Of course, this will also be enabled on some 32-bit architectures that
already support MMU_GATHER_RCU_TABLE_FREE.  That's fine, PT_RECLAIM works
well on all 32-bit architectures as well.  Although the benefit isn't
significant, there's still memory that can be reclaimed.  Perhaps
PT_RECLAIM can be enabled on all 32-bit architectures in the future.


This patch (of 8):

Generally, the asm/tlb.h will include asm-generic/tlb.h, so change
mm/pt_reclaim.c to use asm/tlb.h instead of asm-generic/tlb.h.  This is a
preparation for enabling CONFIG_PT_RECLAIM on other architectures, such as
alpha.

Link: https://lkml.kernel.org/r/cover.1769515122.git.zhengqi.arch@bytedance.com
Link: https://lkml.kernel.org/r/befca537d10c6bf8d531b1ee0a8af1e3b31352b0.1769515122.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Magnus Lindholm <linmag7@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: WANG Xuerui <kernel@xen0n.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-06 15:47:17 -08:00
Li RongQing
06f5ff36e4 mm/damon/stat: remove __read_mostly from memory_idle_ms_percentiles
The 'memory_idle_ms_percentiles' array in DAMON_STAT is updated frequently
by the kernel to reflect the latest idle time statistics.  Marking it as
'__read_mostly' is inappropriate for data that is regularly written to, as
it can lead to cache pollution in the read-mostly section.

Remove the '__read_mostly' annotation to accurately reflect the
variable's usage pattern.

Link: https://lkml.kernel.org/r/20260130085603.1814-1-lirongqing@baidu.com
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-06 15:47:17 -08:00
Sergey Senozhatsky
3881b00a2c zsmalloc: make common caches global
Currently, zsmalloc creates kmem_cache of handles and zspages for each
pool, which may be suboptimal from the memory usage point of view (extra
internal fragmentation per pool).  Systems that create multiple zsmalloc
pools may benefit from shared common zsmalloc caches.

Make handles and zspages kmem caches global.  The memory savings depend on
particular setup and data patterns and can be found via slabinfo.

Link: https://lkml.kernel.org/r/20260117025406.799428-1-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-06 15:47:17 -08:00
Tim Bird
ef24e0aa07 mm: add SPDX id lines to some mm source files
Some of the memory management source files are missing
SPDX-License-Identifier lines.  Add appropriate IDs
to these files (mostly GPL-2.0, but one LGPL-2.1).

Link: https://lkml.kernel.org/r/20260204213101.1754183-1-tim.bird@sony.com
Signed-off-by: Tim Bird <tim.bird@sony.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-06 15:47:16 -08:00
Sahil Chandna
c69ca4e992 mm/zswap: use %pe to print error pointers
Use the %pe printk format specifier to report error pointers directly
instead of printing PTR_ERR() as a long value.  This improves clarity,
produces more readable error messages.

This instance was flagged by the Coccinelle script
(misc/ptr_err_to_pe.cocci) as an opportunity to adopt %pe.

Found by: make coccicheck MODE=report M=mm/
No functional change intended.

Link: https://lkml.kernel.org/r/581a26f22fb4c6ce04aeb7ee0d703fe64454ac7f.1770230135.git.chandna.sahil@gmail.com
Signed-off-by: Sahil Chandna <chandna.sahil@gmail.com>
Acked-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-06 15:47:16 -08:00
Sahil Chandna
4a8eabc6e4 mm/vmscan: use %pe to print error pointers
Use the %pe printk format specifier to report error pointers directly
instead of printing PTR_ERR() as a long value.  This improves clarity,
produces more readable error messages.

This instance was flagged by the Coccinelle script
(misc/ptr_err_to_pe.cocci) as an opportunity to adopt %pe.

Found by: make coccicheck MODE=report M=mm/
No functional change intended

Link: https://lkml.kernel.org/r/80a6643657a60e75ddf48b4869b3e7fdc101f855.1770230135.git.chandna.sahil@gmail.com
Signed-off-by: Sahil Chandna <chandna.sahil@gmail.com>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-06 15:47:16 -08:00
Wilson Zeng
ad1e0c44a4 mm/readahead: fix typo in comment
Fix a typo in a comment: max_readhead -> max_readahead.

Link: https://lkml.kernel.org/r/20260127152535.321951-1-cheng20011202@gmail.com
Signed-off-by: Wilson Zeng <cheng20011202@gmail.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-06 15:47:16 -08:00
Shakeel Butt
4188b2592f mm: khugepaged: fix NR_FILE_PAGES and NR_SHMEM in collapse_file()
In META's fleet, we observed high-level cgroups showing zero file memcg
stats while their descendants had non-zero values.  Investigation using
drgn revealed that these parent cgroups actually had negative file stats,
aggregated from their children.

This issue became more frequent after deploying thp-always more widely,
pointing to a correlation with THP file collapsing.  The root cause is
that collapse_file() assumes old folios and the new THP belong to the same
node and memcg.  When this assumption breaks, stats become skewed.  The
bug affects not just memcg stats but also per-numa stats, and not just
NR_FILE_PAGES but also NR_SHMEM.

The assumption breaks in scenarios such as:

1. Small folios allocated on one node while the THP gets allocated on a
   different node.

2. A package downloader running in one cgroup populates the page cache,
   while a job in a different cgroup executes the downloaded binary.

3. A file shared between processes in different cgroups, where one
   process faults in the pages and khugepaged (or madvise(COLLAPSE))
   collapses them on behalf of the other.

Fix the accounting by explicitly incrementing stats for the new THP and
decrementing stats for the old folios being replaced.

Link: https://lkml.kernel.org/r/20260130042925.2797946-1-shakeel.butt@linux.dev
Fixes: f3f0e1d215 ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Kiryl Shutsemau <kas@kernel.org>
Acked-by: David Hildenbrand (arm) <david@kernel.org>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nico Pache <npache@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Usama Arif <usamaarif642@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-06 15:47:15 -08:00
Justin Green
cc5cbf37ce mm: refactor vma_map_pages to use vm_insert_pages
vma_map_pages currently calls vm_insert_page on each individual page in
the mapping, which creates significant overhead because we are repeatedly
spinlocking.  Instead, we should batch insert pages using vm_insert_pages,
which amortizes the cost of the spinlock.

Tested through watching hardware accelerated video on a MTK ChromeOS
device.  This particular path maps both a V4L2 buffer and a GEM allocated
buffer into userspace and converts the contents from one pixel format to
another.  Both vb2_mmap() and mtk_gem_object_mmap() exercise this pathway.

Link: https://lkml.kernel.org/r/20260128225648.2938636-1-greenjustin@chromium.org
Signed-off-by: Justin Green <greenjustin@chromium.org>
Acked-by: Brian Geffon <bgeffon@google.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Arjun Roy <arjunroy@google.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-06 15:47:15 -08:00
Enze Li
9a2791e748 mm/damon: unify address range representation with damon_addr_range
Currently, DAMON defines two identical structures for representing address
ranges: damon_system_ram_region and damon_addr_range.  Both structures
share the same semantic interpretation of a half-open interval [start,
end), where the start address is inclusive and the end address is
exclusive.

This duplication adds unnecessary redundancy and increases maintenance
overhead.  This patch replaces all uses of damon_system_ram_region with
the more generic damon_addr_range structure, ensuring a unified type
representation for address ranges within the DAMON subsystem.  The change
simplifies the codebase, improves readability, and avoids potential
inconsistencies in future modifications.

Link: https://lkml.kernel.org/r/20260129100845.281734-1-lienze@kylinos.cn
Signed-off-by: Enze Li <lienze@kylinos.cn>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-06 15:47:15 -08:00
Thorsten Blum
ad789a85b1 mm/cma: replace snprintf with strscpy in cma_new_area
Replace snprintf("%s", ...) with the faster and more direct strscpy().

Link: https://lkml.kernel.org/r/20260126174516.236968-1-thorsten.blum@linux.dev
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-06 15:47:15 -08:00
Yosry Ahmed
e2c3b6b21c mm: zswap: use SG list decompression APIs from zsmalloc
Use the new zs_obj_read_sg_*() APIs in zswap_decompress(), instead of
zs_obj_read_*() APIs returning a linear address.  The SG list is passed
directly to the crypto API, simplifying the logic and dropping the
workaround that copies highmem addresses to a buffer.  The crypto API
should internally linearize the SG list if needed.

This avoids the memcpy() in zsmalloc for objects spanning multiple pages,
although an equivalent operation will be done internally by acomp/scomp. 
However, in the future compression algorithms could support handling
discontiguous SG lists, completely eliminating the copying for spanning
objects.

Zsmalloc fills an SG list up to 2 entries in size, so change the input SG
list to fit 2 entries.

Update the incompressible entries path to use memcpy_from_sglist() to copy
the data to the folio.  Opportunistically set dlen to PAGE_SIZE in the
same code path (rather that at the top of the function) to make it
clearer.

Drop the goto in zswap_compress() as the code now is not simple enough for
an if-else statement instead.  Rename 'decomp_ret' to 'ret' and reuse it
to keep the intermediate return value of crypto_acomp_decompress() to keep
line lengths manageable.

No functional change intended.

Link: https://lkml.kernel.org/r/20260121013615.2906368-1-yosry.ahmed@linux.dev
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-06 15:47:14 -08:00
Linus Torvalds
3dc58c9ce1 A couple of late-breaking MM fixes. One against a new-in-this-cycle patch
and the other addresses a locking issue which has been there for 1+ years.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaYZRhQAKCRDdBJ7gKXxA
 jhxUAQCPsCziTG1PTXTY12tt4UIY0lppNXP7ltuddHBwry1RRgEAwMbAZq5tvx1S
 bVnCux9M4uIcBJbWak738Q4wMxcqigg=
 =8OWj
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2026-02-06-12-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull hotfixes from Andrew Morton:
 "A couple of late-breaking MM fixes. One against a new-in-this-cycle
  patch and the other addresses a locking issue which has been there for
  over a year"

* tag 'mm-hotfixes-stable-2026-02-06-12-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm/memory-failure: reject unsupported non-folio compound page
  procfs: avoid fetching build ID while holding VMA lock
2026-02-06 13:07:47 -08:00
Linus Torvalds
5ca98c22b5 slab fix for 6.19-rc8
-----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCAA5FiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmmFq/4bFIAAAAAABAAO
 bWFudTIsMi41KzEuMTEsMiwyAAoJELvgsHXSRYiajNsH/Rwx9KuJepaUyCIPHNNK
 +i7+uAhro222mXNRxz0JtC/uI+Fal/hD9EpnQ9pKfrrTJ7dBoZCP7oMekZXhZkto
 TzYLb7iSzUIdnvu4e3gcO5H88LCLr0gMLJwY+U1Uk51O5yYsl9LY+B5oC/foi0JV
 pVAjZofupRLUbu7aPzrz///ia0UYf8TmKZaGrnMa1d/kiK3tWl1G0Tvhph7e4Lgj
 AF4X3IFf3LsxhKEfkyqyNAXuAFBCB91Hm1wiqf5SDeBGXl5OIylVfj82aV/ZO5TI
 wjXrUBo/nP5y3xMzPvJfnZT/++aqJFDSC9eIxSelYxpqm5xlNS54Zeo5fFpasHPS
 Q+c=
 =MAii
 -----END PGP SIGNATURE-----

Merge tag 'slab-for-6.19-rc8-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab

Pull slab fix from Vlastimil Babka:
 "A stable fix for memory allocation profiling tag not being cleared
  when aborting an allocation due to memcg charge failure (Hao Ge)"

* tag 'slab-for-6.19-rc8-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
  mm/slab: Add alloc_tagging_slab_free_hook for memcg_alloc_abort_single
2026-02-06 09:56:03 -08:00
Joerg Roedel
ad09563660 Merge branches 'fixes', 'arm/smmu/updates', 'intel/vt-d', 'amd/amd-vi' and 'core' into next 2026-02-06 11:10:40 +01:00
Hao Li
98e99fc4ad slub: let need_slab_obj_exts() return false if SLAB_NO_OBJ_EXT is set
SLAB_NO_OBJ_EXT is set for boot caches, but need_slab_obj_exts() doesn't
check this flag. We should return false unconditionally when
SLAB_NO_OBJ_EXT is set.

Signed-off-by: Hao Li <hao.li@linux.dev>
Acked-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20260205120709.425719-1-hao.li@linux.dev
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-02-06 10:39:36 +01:00
Hao Ge
e6c53ead2d mm/slab: Add alloc_tagging_slab_free_hook for memcg_alloc_abort_single
When CONFIG_MEM_ALLOC_PROFILING_DEBUG is enabled, the following warning
may be noticed:

[ 3959.023862] ------------[ cut here ]------------
[ 3959.023891] alloc_tag was not cleared (got tag for lib/xarray.c:378)
[ 3959.023947] WARNING: ./include/linux/alloc_tag.h:155 at alloc_tag_add+0x128/0x178, CPU#6: mkfs.ntfs/113998
[ 3959.023978] Modules linked in: dns_resolver tun brd overlay exfat btrfs blake2b libblake2b xor xor_neon raid6_pq loop sctp ip6_udp_tunnel udp_tunnel ext4 crc16 mbcache jbd2 rfkill sunrpc vfat fat sg fuse nfnetlink sr_mod virtio_gpu cdrom drm_client_lib virtio_dma_buf drm_shmem_helper drm_kms_helper ghash_ce drm sm4 backlight virtio_net net_failover virtio_scsi failover virtio_console virtio_blk virtio_mmio dm_mirror dm_region_hash dm_log dm_multipath dm_mod i2c_dev aes_neon_bs aes_ce_blk [last unloaded: hwpoison_inject]
[ 3959.024170] CPU: 6 UID: 0 PID: 113998 Comm: mkfs.ntfs Kdump: loaded Tainted: G        W           6.19.0-rc7+ #7 PREEMPT(voluntary)
[ 3959.024182] Tainted: [W]=WARN
[ 3959.024186] Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
[ 3959.024192] pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 3959.024199] pc : alloc_tag_add+0x128/0x178
[ 3959.024207] lr : alloc_tag_add+0x128/0x178
[ 3959.024214] sp : ffff80008b696d60
[ 3959.024219] x29: ffff80008b696d60 x28: 0000000000000000 x27: 0000000000000240
[ 3959.024232] x26: 0000000000000000 x25: 0000000000000240 x24: ffff800085d17860
[ 3959.024245] x23: 0000000000402800 x22: ffff0000c0012dc0 x21: 00000000000002d0
[ 3959.024257] x20: ffff0000e6ef3318 x19: ffff800085ae0410 x18: 0000000000000000
[ 3959.024269] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[ 3959.024281] x14: 0000000000000000 x13: 0000000000000001 x12: ffff600064101293
[ 3959.024292] x11: 1fffe00064101292 x10: ffff600064101292 x9 : dfff800000000000
[ 3959.024305] x8 : 00009fff9befed6e x7 : ffff000320809493 x6 : 0000000000000001
[ 3959.024316] x5 : ffff000320809490 x4 : ffff600064101293 x3 : ffff800080691838
[ 3959.024328] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0000d5bcd640
[ 3959.024340] Call trace:
[ 3959.024346]  alloc_tag_add+0x128/0x178 (P)
[ 3959.024355]  __alloc_tagging_slab_alloc_hook+0x11c/0x1a8
[ 3959.024362]  kmem_cache_alloc_lru_noprof+0x1b8/0x5e8
[ 3959.024369]  xas_alloc+0x304/0x4f0
[ 3959.024381]  xas_create+0x1e0/0x4a0
[ 3959.024388]  xas_store+0x68/0xda8
[ 3959.024395]  __filemap_add_folio+0x5b0/0xbd8
[ 3959.024409]  filemap_add_folio+0x16c/0x7e0
[ 3959.024416]  __filemap_get_folio_mpol+0x2dc/0x9e8
[ 3959.024424]  iomap_get_folio+0xfc/0x180
[ 3959.024435]  __iomap_get_folio+0x2f8/0x4b8
[ 3959.024441]  iomap_write_begin+0x198/0xc18
[ 3959.024448]  iomap_write_iter+0x2ec/0x8f8
[ 3959.024454]  iomap_file_buffered_write+0x19c/0x290
[ 3959.024461]  blkdev_write_iter+0x38c/0x978
[ 3959.024470]  vfs_write+0x4d4/0x928
[ 3959.024482]  ksys_write+0xfc/0x1f8
[ 3959.024489]  __arm64_sys_write+0x74/0xb0
[ 3959.024496]  invoke_syscall+0xd4/0x258
[ 3959.024507]  el0_svc_common.constprop.0+0xb4/0x240
[ 3959.024514]  do_el0_svc+0x48/0x68
[ 3959.024520]  el0_svc+0x40/0xf8
[ 3959.024526]  el0t_64_sync_handler+0xa0/0xe8
[ 3959.024533]  el0t_64_sync+0x1ac/0x1b0
[ 3959.024540] ---[ end trace 0000000000000000 ]---

When __memcg_slab_post_alloc_hook() fails, there are two different
free paths depending on whether size == 1 or size != 1. In the
kmem_cache_free_bulk() path, we do call alloc_tagging_slab_free_hook().
However, in memcg_alloc_abort_single() we don't, the above warning will be
triggered on the next allocation.

Therefore, add alloc_tagging_slab_free_hook() to the
memcg_alloc_abort_single() path.

Fixes: 9f9796b413 ("mm, slab: move memcg charging to post-alloc hook")
Cc: stable@vger.kernel.org
Suggested-by: Hao Li <hao.li@linux.dev>
Signed-off-by: Hao Ge <hao.ge@linux.dev>
Reviewed-by: Hao Li <hao.li@linux.dev>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20260204101401.202762-1-hao.ge@linux.dev
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-02-06 09:51:08 +01:00
Miaohe Lin
ae9fd76c11 mm/memory-failure: reject unsupported non-folio compound page
When !CONFIG_TRANSPARENT_HUGEPAGE, a non-folio compound page can appear in
a userspace mapping via either vm_insert_*() functions or
vm_operatios_struct->fault().  They are not folios, thus should not be
considered for folio operations like split.  To reject these pages, make
sure get_hwpoison_page() is always called as HWPoisonHandlable() will do
the right work.

[Some commit log borrowed from Zi Yan. Thanks.]

Link: https://lkml.kernel.org/r/20260205075328.523211-1-linmiaohe@huawei.com
Fixes: 689b898677 ("mm/memory-failure: improve large block size folio handling")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reported-by: 是参差 <shicenci@gmail.com>
Closes: https://lore.kernel.org/all/PS1PPF7E1D7501F1E4F4441E7ECD056DEADAB98A@PS1PPF7E1D7501F.apcprd02.prod.outlook.com/
Reviewed-by: Zi Yan <ziy@nvidia.com>
Tested-by: Zi Yan <ziy@nvidia.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-05 14:10:00 -08:00
Linus Torvalds
b20624608f 5 hotfixes. 2 are cc:stable, 2 are for MM.
All are singletons - please see the changelogs for details.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaYPcbwAKCRDdBJ7gKXxA
 jpM9AQDRiBlZRBdYY8/nS2zMc8hE7s5O3koXu/UMf2O01aJjsgD6AssmcJzkbLir
 O1mlBSD0wlR3TZLEqSOUYIxgw7evLww=
 =ZHeq
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2026-02-04-15-55' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "Five hotfixes.  Two are cc:stable, two are for MM.

  All are singletons - please see the changelogs for details"

* tag 'mm-hotfixes-stable-2026-02-04-15-55' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  Documentation: document liveupdate cmdline parameter
  mm, shmem: prevent infinite loop on truncate race
  mailmap: update Alexander Mikhalitsyn's emails
  liveupdate: luo_file: do not clear serialized_data on unfreeze
  x86/kfence: fix booting on 32bit non-PAE systems
2026-02-04 16:04:00 -08:00
Claudio Imbrenda
728b0e21b4 KVM: S390: Remove PGSTE code from linux/s390 mm
Remove the PGSTE config option.
Remove all code from linux/s390 mm that involves PGSTEs.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
2026-02-04 17:00:10 +01:00
Harry Yoo
2f35fee943 mm/slab: only allow SLAB_OBJ_EXT_IN_OBJ for unmergeable caches
While SLAB_OBJ_EXT_IN_OBJ allows to reduce memory overhead to account
slab objects, it prevents slab merging because merging can change
the metadata layout.

As pointed out Vlastimil Babka, disabling merging solely for this memory
optimization may not be a net win, because disabling slab merging tends
to increase overall memory usage.

Restrict SLAB_OBJ_EXT_IN_OBJ to caches that are already unmergeable for
other reasons (e.g., those with constructors or SLAB_TYPESAFE_BY_RCU).

Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20260127103151.21883-3-harry.yoo@oracle.com
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-02-04 10:05:36 +01:00
Harry Yoo
a77d6d3386 mm/slab: place slabobj_ext metadata in unused space within s->size
When a cache has high s->align value and s->object_size is not aligned
to it, each object ends up with some unused space because of alignment.
If this wasted space is big enough, we can use it to store the
slabobj_ext metadata instead of wasting it.

On my system, this happens with caches like kmem_cache, mm_struct, pid,
task_struct, sighand_cache, xfs_inode, and others.

To place the slabobj_ext metadata within each object, the existing
slab_obj_ext() logic can still be used by setting:

  - slab->obj_exts = slab_address(slab) + (slabobj_ext offset)
  - stride = s->size

slab_obj_ext() doesn't need know where the metadata is stored,
so this method works without adding extra overhead to slab_obj_ext().

A good example benefiting from this optimization is xfs_inode
(object_size: 992, align: 64). To measure memory savings, 2 millions of
files were created on XFS.

[ MEMCG=y, MEM_ALLOC_PROFILING=n ]

Before patch (creating ~2.64M directories on xfs):
  Slab:            5175976 kB
  SReclaimable:    3837524 kB
  SUnreclaim:      1338452 kB

After patch (creating ~2.64M directories on xfs):
  Slab:            5152912 kB
  SReclaimable:    3838568 kB
  SUnreclaim:      1314344 kB (-23.54 MiB)

Enjoy the memory savings!

Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20260113061845.159790-10-harry.yoo@oracle.com
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-02-04 10:05:36 +01:00
Harry Yoo
fab0694646 mm/slab: move [__]ksize and slab_ksize() to mm/slub.c
To access SLUB's internal implementation details beyond cache flags in
ksize(), move __ksize(), ksize(), and slab_ksize() to mm/slub.c.

[vbabka@suse.cz: also make __ksize() static and move its kerneldoc to
 ksize() ]

Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20260113061845.159790-9-harry.yoo@oracle.com
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-02-04 10:05:35 +01:00