linux

mirror of https://github.com/torvalds/linux.git synced 2026-07-27 01:32:21 +02:00

Author	SHA1	Message	Date
Reinette Chatre	b9f089723a	fs/resctrl: Fix double-add of pseudo-locked region's RMID to free list A pseudo-locked group's RMID is freed when it is created. On unmount rmdir_all_sub() unconditionally frees all RMID of all groups, resulting in a double-free of the pseudo-locked group's RMID. The consequence of this is that the original free results in the pseudo-locked group's RMID being added to the rmid_free_lru linked list and the second free then attempts to add the same RMID entry to the rmid_free_lru again. Do not double-free a pseudo-locked group's RMID. Fixes: `e0bdfe8e36` ("x86/intel_rdt: Support creation/removal of pseudo-locked region") Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@kernel.org> Link: https://patch.msgid.link/551432dd7e624a862b8e58314c38aaba0afff3e9.1783377598.git.reinette.chatre@intel.com	2026-07-06 19:05:49 -07:00
Tony Luck	52fce64860	fs/resctrl: Fix use-after-free during unmount During unmount or failure teardown all mon_data structures that contain monitoring event file private data are freed after which kernfs nodes are removed. However, the RDT_DELETED flag is never set for the statically allocated default resource group. A concurrent reader of an event file associated with the default resource group may, after dropping kernfs active protection, block on rdtgroup_mutex while unmount proceeds to free the file private data and destroy the kernfs node without waiting for the reader. When the mutex is released, the reader wakes up, observes that RDT_DELETED is not set for the default group, and dereferences the already-freed file private data. The scenario can be depicted as follows: CPU0 CPU1 /* * Default resource group's * monitoring data accessible via * kernfs file with kernfs_node::priv * pointing to a struct mon_data. * User opens the file for reading. / rdtgroup_mondata_show() / arch encounters fatal error / rdtgroup_kn_lock_live() resctrl_exit() atomic_inc(&rdtgroup_default.waitcount) cpus_read_lock() kernfs_break_active_protection(kn) mutex_lock(&rdtgroup_mutex) cpus_read_lock() resctrl_fs_teardown() mutex_lock(&rdtgroup_mutex) rmdir_all_sub() mon_put_kn_priv() / Delete all mon_data structures / rdtgroup_destroy_root() kernfs_destroy_root() rdtgroup_default.kn = NULL mutex_unlock(&rdtgroup_mutex) / * rdtgroup_default.flags is empty so * rdtgroup_kn_lock_live() returns * &rdtgroup_default / md = of->kn->priv; / md points to freed mon_data */ Set RDT_DELETED for the default group unconditionally since the flag does not lead to the freeing of this statically allocated group. Do not allow a new resctrl mount if there are any waiters on default group of previous mount. A new mount will re-initialize the default group that would appear to waiters from previous mount as though the default group is accessible causing them to access the mon_data structures from the previous mount that have been removed. Fixes: `2a65660385` ("x86/resctrl: Expand the width of domid by replacing mon_data_bits") Closes: https://sashiko.dev/#/patchset/20260508182143.14592-1-tony.luck%40intel.com?part=2 [1] Reported-by: Sashiko <sashiko-bot@kernel.org> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Chen Yu <yu.c.chen@intel.com> Cc: <stable@kernel.org> Link: https://patch.msgid.link/49a2ca3ca688f27e1a646cf90e1dc69287021127.1783377598.git.reinette.chatre@intel.com	2026-07-06 18:50:56 -07:00
Tony Luck	ca0676ae2e	fs/resctrl: Free mon_data structures on rdt_get_tree() failure If mkdir_mondata_all() or a subsequent call in rdt_get_tree() fails, the mon_data structures allocated by mon_get_kn_priv() are leaked. Add mon_put_kn_priv() to the out_mongrp error path to free the mon_data structures. Fixes: `2a65660385` ("x86/resctrl: Expand the width of domid by replacing mon_data_bits") Closes: https://lore.kernel.org/lkml/5d38c1fb-8f91-472b-8897-24b2f50c772b@intel.com/ Reported-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Chen Yu <yu.c.chen@intel.com> Reviewed-by: Ben Horgan <ben.horgan@arm.com> Cc: <stable@kernel.org> Link: https://patch.msgid.link/433623b7e3316ffd52323255d1aa4f156ad97cb1.1783377598.git.reinette.chatre@intel.com	2026-07-06 17:42:38 -07:00
Reinette Chatre	fc16126cc1	x86,fs/resctrl: Prevent out-of-bounds access while offlining CPU when SNC enabled The architecture updates the cpu_mask in a domain's header to track which online CPUs are associated with the domain. When this mask becomes empty the architecture initiates offline of the domain that includes calling on resctrl fs to offline the domain. If it is a monitoring domain in which LLC occupancy is tracked resctrl fs forces the limbo handler to clear all busy RMID state associated with the domain. The limbo handler always reads the current event value associated with a busy RMID irrespective of it being checked as part of regular "is it still busy" check or whether it will be forced released anyway. When reading an RMID on a system with SNC enabled the "logical RMID" is converted to the "physical RMID" and this conversion requires the NUMA node ID of the resctrl monitoring domain that is in turn determined by querying the NUMA node ID of any CPU belonging to the monitoring domain. When the monitoring domain is going offline its cpu_mask is empty causing the NUMA node ID query via cpu_to_node() to be done with "nr_cpu_ids" as argument resulting in an out-of-bounds access. Refactor the limbo handler to skip reading the RMID when the RMID will just be forced to no longer be dirty in the domain anyway. Add a safety check to the architecture's RMID reader to protect against this scenario. Fixes: `e13db55b5a` ("x86/resctrl: Introduce snc_nodes_per_l3_cache") Closes: https://sashiko.dev/#/patchset/cover.1780456704.git.reinette.chatre%40intel.com?part=9 Reported-by: Sashiko <sashiko-bot@kernel.org> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@kernel.org> Link: https://patch.msgid.link/16137433df42f85013b2f7a53626795cbd6637b9.1781029125.git.reinette.chatre@intel.com	2026-07-01 13:15:02 -07:00
Ben Horgan	3aec86e4ea	fs/resctrl: Continue counter allocation after failure In mbm_event mode, with mbm_assign_on_mkdir set to 1, when a user creates a new CTRL_MON or MON group resctrl attempts to allocate counters for each of the supported MBM events on each resctrl domain. As counters are limited, such allocation may fail and when it does counter allocations for the remaining domains are skipped even if the domains have available counters. Because of that, the user needs to view the resource group'smbm_L3_assignments file to get an accurate view of counter assignment in a new resource group and then manually create counters in the skipped domains with available counters. Writes to mbm_L3_assignments using the wildcard format, <event>:*=e, also skip counter allocation in other domains after a counter allocation failure. When handling a request to create counters in all domains it is unnecessary for a counter allocation in one domain to prevent counter allocation in other domains. Always attempt to allocate all the counters requested. [ bp: Massage commit message. ] Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Babu Moger <babu.moger@amd.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/20260506082855.3694761-1-ben.horgan@arm.com	2026-05-08 12:07:43 +02:00
Ben Horgan	ee3d4c81d8	fs/resctrl: Add monitor property 'mbm_cntr_assign_fixed' Commit `3b497c3f4f` ("fs/resctrl: Introduce the interface to display monitoring modes") introduced CONFIG_RESCTRL_ASSIGN_FIXED but left adding the Kconfig entry until it was necessary. The counter assignment mode is fixed in MPAM, even when there are assignable counters, and so addressing this is needed to support MPAM. To avoid the burden of another Kconfig entry, replace CONFIG_RESCTRL_ASSIGN_FIXED with a new property in 'struct resctrl_mon', 'mbm_cntr_assign_fixed' to be set by the architecture. Do not request the architecture to change the counter assignment mode if it does not support doing so. Provide insight to user space about why such a request fails. Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Babu Moger <babu.moger@amd.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/20260506082855.3694761-1-ben.horgan@arm.com	2026-05-07 16:29:14 +02:00
Ben Horgan	f52abe6502	fs/resctrl: Disallow the software controller when MBM counters are assignable The software controller requires that there is one MBM counter per monitor group that is assigned to the event backing the software controller, as per mba_MBps_event. When mbm_event mode is in use, it is not guaranteed that any particular event will have an assigned counter. Currently, only AMD systems support counter assignment, but the MBA delay is non-linear and so the software controller is never supported anyway. On MPAM systems, the MBA delay is linear and so the software controller could be supported. The MPAM driver, unless a need arises, will not support the 'default' mbm_assign_mode and will always use the 'mbm_event' mode for memory bandwidth monitoring. Rather than develop a way to guarantee the counter assignment requirements needed by the software controller, take the pragmatic approach. Don't allow the software controller to be used at the same time as 'mbm_event' mode. As MPAM is the only relevant architecture and it will use 'mbm_event' mode whenever there are assignable MBM counters, for simplicity's sake, don't allow the software controller when the MBM counters are assignable. Implement this by failing the mount if the user requests the software controller, the mba_MBps option, and the MBM counters are assignable. Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/20260506082855.3694761-1-ben.horgan@arm.com	2026-05-06 19:06:57 +02:00
Ben Horgan	94a1206522	x86,fs/resctrl: Create 'event_filter' files read only if they're not configurable When the counter assignment mode is mbm_event resctrl assumes the MBM events are configurable and exposes the 'event_filter' files. These files live at info/L3_MON/event_configs/<event>/event_filter and are used to display and set the event configuration. The MPAM architecture has support for configuring the memory bandwidth utilization (MBWU) counters to only count reads or only count writes. However, in MPAM, this event filtering support is optional in the hardware (and not yet implemented in the MPAM driver) but MBM counter assignment is always possible for MPAM MBWU counters. In order to support mbm_event mode with MPAM, create the 'event_filter' files read only if the event configuration can't be changed. A user can still chmod the file and so also return early with an error from event_filter_write(). Introduce a new monitor property, mbm_cntr_configurable, to indicate whether or not assignable MBM counters are configurable. On x86, set this to true whenever mbm_cntr_assignable is true to keep existing behaviour. Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/20260506082855.3694761-1-ben.horgan@arm.com	2026-05-06 19:06:57 +02:00
Ben Horgan	7625632fed	fs/resctrl: Tidy up the error path in resctrl_mkdir_event_configs() The error path in resctrl_mkdir_event_configs() is unnecessarily complicated. Simplify it to just return directly on error. Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20260506082855.3694761-1-ben.horgan@arm.com	2026-05-06 19:06:52 +02:00
Linus Torvalds	334fbe734e	mm.git review status for linus..mm-stable Everything: Total patches: 368 Reviews/patch: 1.56 Reviewed rate: 74% Excluding DAMON: Total patches: 316 Reviews/patch: 1.77 Reviewed rate: 81% Excluding DAMON and zram: Total patches: 306 Reviews/patch: 1.81 Reviewed rate: 82% Excluding DAMON, zram and maple_tree: Total patches: 276 Reviews/patch: 2.01 Reviewed rate: 91% Significant patch series in this merge: - The 30 patch series "maple_tree: Replace big node with maple copy" from Liam Howlett is mainly prepararatory work for ongoing development but it does reduce stack usage and is an improvement. - The 12 patch series "mm, swap: swap table phase III: remove swap_map" from Kairui Song offers memory savings by removing the static swap_map. It also yields some CPU savings and implements several cleanups. - The 2 patch series "mm: memfd_luo: preserve file seals" from Pratyush Yadav adds file seal preservation to LUO's memfd code. - The 2 patch series "mm: zswap: add per-memcg stat for incompressible pages" from Jiayuan Chen adds additional userspace stats reportng to zswap. - The 4 patch series "arch, mm: consolidate empty_zero_page" from Mike Rapoport implements some cleanups for our handling of ZERO_PAGE() and zero_pfn. - The 2 patch series "mm/kmemleak: Improve scan_should_stop() implementation" from Zhongqiu Han provides an robustness improvement and some cleanups in the kmemleak code. - The 4 patch series "Improve khugepaged scan logic" from Vernon Yang "improves the khugepaged scan logic and reduces CPU consumption by prioritizing scanning tasks that access memory frequently". - The 2 patch series "Make KHO Stateless" from Jason Miu simplifies Kexec Handover by "transitioning KHO from an xarray-based metadata tracking system with serialization to a radix tree data structure that can be passed directly to the next kernel" - The 3 patch series "mm: vmscan: add PID and cgroup ID to vmscan tracepoints" from Thomas Ballasi and Steven Rostedt enhances vmscan's tracepointing. - The 5 patch series "mm: arch/shstk: Common shadow stack mapping helper and VM_NOHUGEPAGE" from Catalin Marinas is a cleanup for the shadow stack code: remove per-arch code in favour of a generic implementation. - The 2 patch series "Fix KASAN support for KHO restored vmalloc regions" from Pasha Tatashin fixes a WARN() which can be emitted the KHO restores a vmalloc area. - The 4 patch series "mm: Remove stray references to pagevec" from Tal Zussman provides several cleanups, mainly udpating references to "struct pagevec", which became folio_batch three years ago. - The 17 patch series "mm: Eliminate fake head pages from vmemmap optimization" from Kiryl Shutsemau simplifies the HugeTLB vmemmap optimization (HVO) by changing how tail pages encode their relationship to the head page. - The 2 patch series "mm/damon/core: improve DAMOS quota efficiency for core layer filters" from SeongJae Park improves two problematic behaviors of DAMOS that makes it less efficient when core layer filters are used. - The 3 patch series "mm/damon: strictly respect min_nr_regions" from SeongJae Park improves DAMON usability by extending the treatment of the min_nr_regions user-settable parameter. - The 3 patch series "mm/page_alloc: pcp locking cleanup" from Vlastimil Babka is a proper fix for a previously hotfixed SMP=n issue. Code simplifications and cleanups ennsed. - The 16 patch series "mm: cleanups around unmapping / zapping" from David Hildenbrand implements "a bunch of cleanups around unmapping and zapping. Mostly simplifications, code movements, documentation and renaming of zapping functions". - The 6 patch series "support batched checking of the young flag for MGLRU" from Baolin Wang supports batched checking of the young flag for MGLRU. It's part cleanups; one benchmark shows large performance benefits for arm64. - The 5 patch series "memcg: obj stock and slab stat caching cleanups" from Johannes Weiner provides memcg cleanup and robustness improvements. - The 5 patch series "Allow order zero pages in page reporting" from Yuvraj Sakshith enhances page_reporting's free page reporting - it is presently and undesirably order-0 pages when reporting free memory. - The 6 patch series "mm: vma flag tweaks" from Lorenzo Stoakes is cleanup work following from the recent conversion of the VMA flags to a bitmap. - The 10 patch series "mm/damon: add optional debugging-purpose sanity checks" from SeongJae Park adds some more developer-facing debug checks into DAMON core. - The 2 patch series "mm/damon: test and document power-of-2 min_region_sz requirement" from SeongJae Park adds an additional DAMON kunit test and makes some adjustments to the addr_unit parameter handling. - The 3 patch series "mm/damon/core: make passed_sample_intervals comparisons overflow-safe" from SeongJae Park fixes a hard-to-hit time overflow issue in DAMON core. - The 7 patch series "mm/damon: improve/fixup/update ratio calculation, test and documentation" from SeongJae Park is a "batch of misc/minor improvements and fixups" for DAMON. - The 4 patch series "mm: move vma_(kernel\|mmu)_pagesize() out of hugetlb.c" from David Hildenbrand fixes a possible issue with dax-device when CONFIG_HUGETLB=n. Some code movement was required. - The 6 patch series "zram: recompression cleanups and tweaks" from Sergey Senozhatsky provides "a somewhat random mix of fixups, recompression cleanups and improvements" in the zram code. - The 11 patch series "mm/damon: support multiple goal-based quota tuning algorithms" from SeongJae Park extend DAMOS quotas goal auto-tuning to support multiple tuning algorithms that users can select. - The 4 patch series "mm: thp: reduce unnecessary start_stop_khugepaged()" from Breno Leitao fixes the khugpaged sysfs handling so we no longer spam the logs with reams of junk when starting/stopping khugepaged. - The 3 patch series "mm: improve map count checks" from Lorenzo Stoakes provides some cleanups and slight fixes in the mremap, mmap and vma code. - The 5 patch series "mm/damon: support addr_unit on default monitoring targets for modules" from SeongJae Park extends the use of DAMON core's addr_unit tunable. - The 5 patch series "mm: khugepaged cleanups and mTHP prerequisites" from Nico Pache provides cleanups in the khugepaged and is a base for Nico's planned khugepaged mTHP support. - The 15 patch series "mm: memory hot(un)plug and SPARSEMEM cleanups" from David Hildenbrand implements code movement and cleanups in the memhotplug and sparsemem code. - The 2 patch series "mm: remove CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE and cleanup CONFIG_MIGRATION" from David Hildenbrand rationalizes some memhotplug Kconfig support. - The 6 patch series "change young flag check functions to return bool" from Baolin Wang is "a cleanup patchset to change all young flag check functions to return bool". - The 3 patch series "mm/damon/sysfs: fix memory leak and NULL dereference issues" from Josh Law and SeongJae Park fixes a few potential DAMON bugs. - The 25 patch series "mm/vma: convert vm_flags_t to vma_flags_t in vma code" from "converts a lot of the existing use of the legacy vm_flags_t data type to the new vma_flags_t type which replaces it". Mainly in the vma code. - The 21 patch series "mm: expand mmap_prepare functionality and usage" from Lorenzo Stoakes "expands the mmap_prepare functionality, which is intended to replace the deprecated f_op->mmap hook which has been the source of bugs and security issues for some time". Cleanups, documentation, extension of mmap_prepare into filesystem drivers. - The 13 patch series "mm/huge_memory: refactor zap_huge_pmd()" from Lorenzo Stoakes simplifies and cleans up zap_huge_pmd(). Additional cleanups around vm_normal_folio_pmd() and the softleaf functionality are performed. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCad3HDQAKCRDdBJ7gKXxA jrUQAPwNhPk5nPSxnyxjAeQtOBHqgCdnICeEismLajPKd9aYRgEA0s2XAu3tSUYi GrBnWImHG3s4ePQxVcPCegWTsOUrXgQ= =1Q7o -----END PGP SIGNATURE----- Merge tag 'mm-stable-2026-04-13-21-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "maple_tree: Replace big node with maple copy" (Liam Howlett) Mainly prepararatory work for ongoing development but it does reduce stack usage and is an improvement. - "mm, swap: swap table phase III: remove swap_map" (Kairui Song) Offers memory savings by removing the static swap_map. It also yields some CPU savings and implements several cleanups. - "mm: memfd_luo: preserve file seals" (Pratyush Yadav) File seal preservation to LUO's memfd code - "mm: zswap: add per-memcg stat for incompressible pages" (Jiayuan Chen) Additional userspace stats reportng to zswap - "arch, mm: consolidate empty_zero_page" (Mike Rapoport) Some cleanups for our handling of ZERO_PAGE() and zero_pfn - "mm/kmemleak: Improve scan_should_stop() implementation" (Zhongqiu Han) A robustness improvement and some cleanups in the kmemleak code - "Improve khugepaged scan logic" (Vernon Yang) Improve khugepaged scan logic and reduce CPU consumption by prioritizing scanning tasks that access memory frequently - "Make KHO Stateless" (Jason Miu) Simplify Kexec Handover by transitioning KHO from an xarray-based metadata tracking system with serialization to a radix tree data structure that can be passed directly to the next kernel - "mm: vmscan: add PID and cgroup ID to vmscan tracepoints" (Thomas Ballasi and Steven Rostedt) Enhance vmscan's tracepointing - "mm: arch/shstk: Common shadow stack mapping helper and VM_NOHUGEPAGE" (Catalin Marinas) Cleanup for the shadow stack code: remove per-arch code in favour of a generic implementation - "Fix KASAN support for KHO restored vmalloc regions" (Pasha Tatashin) Fix a WARN() which can be emitted the KHO restores a vmalloc area - "mm: Remove stray references to pagevec" (Tal Zussman) Several cleanups, mainly udpating references to "struct pagevec", which became folio_batch three years ago - "mm: Eliminate fake head pages from vmemmap optimization" (Kiryl Shutsemau) Simplify the HugeTLB vmemmap optimization (HVO) by changing how tail pages encode their relationship to the head page - "mm/damon/core: improve DAMOS quota efficiency for core layer filters" (SeongJae Park) Improve two problematic behaviors of DAMOS that makes it less efficient when core layer filters are used - "mm/damon: strictly respect min_nr_regions" (SeongJae Park) Improve DAMON usability by extending the treatment of the min_nr_regions user-settable parameter - "mm/page_alloc: pcp locking cleanup" (Vlastimil Babka) The proper fix for a previously hotfixed SMP=n issue. Code simplifications and cleanups ensued - "mm: cleanups around unmapping / zapping" (David Hildenbrand) A bunch of cleanups around unmapping and zapping. Mostly simplifications, code movements, documentation and renaming of zapping functions - "support batched checking of the young flag for MGLRU" (Baolin Wang) Batched checking of the young flag for MGLRU. It's part cleanups; one benchmark shows large performance benefits for arm64 - "memcg: obj stock and slab stat caching cleanups" (Johannes Weiner) memcg cleanup and robustness improvements - "Allow order zero pages in page reporting" (Yuvraj Sakshith) Enhance free page reporting - it is presently and undesirably order-0 pages when reporting free memory. - "mm: vma flag tweaks" (Lorenzo Stoakes) Cleanup work following from the recent conversion of the VMA flags to a bitmap - "mm/damon: add optional debugging-purpose sanity checks" (SeongJae Park) Add some more developer-facing debug checks into DAMON core - "mm/damon: test and document power-of-2 min_region_sz requirement" (SeongJae Park) An additional DAMON kunit test and makes some adjustments to the addr_unit parameter handling - "mm/damon/core: make passed_sample_intervals comparisons overflow-safe" (SeongJae Park) Fix a hard-to-hit time overflow issue in DAMON core - "mm/damon: improve/fixup/update ratio calculation, test and documentation" (SeongJae Park) A batch of misc/minor improvements and fixups for DAMON - "mm: move vma_(kernel\|mmu)_pagesize() out of hugetlb.c" (David Hildenbrand) Fix a possible issue with dax-device when CONFIG_HUGETLB=n. Some code movement was required. - "zram: recompression cleanups and tweaks" (Sergey Senozhatsky) A somewhat random mix of fixups, recompression cleanups and improvements in the zram code - "mm/damon: support multiple goal-based quota tuning algorithms" (SeongJae Park) Extend DAMOS quotas goal auto-tuning to support multiple tuning algorithms that users can select - "mm: thp: reduce unnecessary start_stop_khugepaged()" (Breno Leitao) Fix the khugpaged sysfs handling so we no longer spam the logs with reams of junk when starting/stopping khugepaged - "mm: improve map count checks" (Lorenzo Stoakes) Provide some cleanups and slight fixes in the mremap, mmap and vma code - "mm/damon: support addr_unit on default monitoring targets for modules" (SeongJae Park) Extend the use of DAMON core's addr_unit tunable - "mm: khugepaged cleanups and mTHP prerequisites" (Nico Pache) Cleanups to khugepaged and is a base for Nico's planned khugepaged mTHP support - "mm: memory hot(un)plug and SPARSEMEM cleanups" (David Hildenbrand) Code movement and cleanups in the memhotplug and sparsemem code - "mm: remove CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE and cleanup CONFIG_MIGRATION" (David Hildenbrand) Rationalize some memhotplug Kconfig support - "change young flag check functions to return bool" (Baolin Wang) Cleanups to change all young flag check functions to return bool - "mm/damon/sysfs: fix memory leak and NULL dereference issues" (Josh Law and SeongJae Park) Fix a few potential DAMON bugs - "mm/vma: convert vm_flags_t to vma_flags_t in vma code" (Lorenzo Stoakes) Convert a lot of the existing use of the legacy vm_flags_t data type to the new vma_flags_t type which replaces it. Mainly in the vma code. - "mm: expand mmap_prepare functionality and usage" (Lorenzo Stoakes) Expand the mmap_prepare functionality, which is intended to replace the deprecated f_op->mmap hook which has been the source of bugs and security issues for some time. Cleanups, documentation, extension of mmap_prepare into filesystem drivers - "mm/huge_memory: refactor zap_huge_pmd()" (Lorenzo Stoakes) Simplify and clean up zap_huge_pmd(). Additional cleanups around vm_normal_folio_pmd() and the softleaf functionality are performed. * tag 'mm-stable-2026-04-13-21-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits) mm: fix deferred split queue races during migration mm/khugepaged: fix issue with tracking lock mm/huge_memory: add and use has_deposited_pgtable() mm/huge_memory: add and use normal_or_softleaf_folio_pmd() mm: add softleaf_is_valid_pmd_entry(), pmd_to_softleaf_folio() mm/huge_memory: separate out the folio part of zap_huge_pmd() mm/huge_memory: use mm instead of tlb->mm mm/huge_memory: remove unnecessary sanity checks mm/huge_memory: deduplicate zap deposited table call mm/huge_memory: remove unnecessary VM_BUG_ON_PAGE() mm/huge_memory: add a common exit path to zap_huge_pmd() mm/huge_memory: handle buggy PMD entry in zap_huge_pmd() mm/huge_memory: have zap_huge_pmd return a boolean, add kdoc mm/huge: avoid big else branch in zap_huge_pmd() mm/huge_memory: simplify vma_is_specal_huge() mm: on remap assert that input range within the proposed VMA mm: add mmap_action_map_kernel_pages[_full]() uio: replace deprecated mmap hook with mmap_prepare in uio_info drivers: hv: vmbus: replace deprecated mmap hook with mmap_prepare mm: allow handling of stacked mmap_prepare hooks in more drivers ...	2026-04-15 12:59:16 -07:00
Reinette Chatre	79727019ce	fs/resctrl: Add missing return value descriptions Using the stricter "./tools/docs/kernel-doc -Wall -v" to verify proper formatting of documentation comments includes warnings related to return markup on functions that are omitted during the default verification checks. This stricter verification reports a couple of missing return descriptions in resctrl: Warning: .../fs/resctrl/rdtgroup.c:1536 No description found for return value of 'rdtgroup_cbm_to_size' Warning: .../fs/resctrl/rdtgroup.c:3131 No description found for return value of 'mon_get_kn_priv' Warning: .../fs/resctrl/rdtgroup.c:3523 No description found for return value of 'cbm_ensure_valid' Warning: .../fs/resctrl/monitor.c:238 No description found for return value of 'resctrl_find_cleanest_closid' Add the missing return descriptions. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://patch.msgid.link/1c50b9f7c73251c007133590986f127e1af57780.1775576382.git.reinette.chatre@intel.com	2026-04-07 21:01:22 +02:00
Lorenzo Stoakes (Oracle)	0c2aa66357	mm: reintroduce vma_desc_test() as a singular flag test Similar to vma_flags_test(), we have previously renamed vma_desc_test() to vma_desc_test_any(). Now that is in place, we can reintroduce vma_desc_test() to explicitly check for a single VMA flag. As with vma_flags_test(), this is useful as often flag tests are against a single flag, and vma_desc_test_any(flags, VMA_READ_BIT) reads oddly and potentially causes confusion. As with vma_flags_test() a combination of sparse and vma_flags_t being a struct means that users cannot misuse this function without it getting flagged. Also update the VMA tests to reflect this change. Link: https://lkml.kernel.org/r/3a65ca23defb05060333f0586428fe279a484564.1772704455.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Babu Moger <babu.moger@amd.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chao Yu <chao@kernel.org> Cc: Chatre, Reinette <reinette.chatre@intel.com> Cc: Chunhai Guo <guochunhai@vivo.com> Cc: Damien Le Maol <dlemoal@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Dave Martin <dave.martin@arm.com> Cc: Gao Xiang <xiang@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hongbo Li <lihongbo22@huawei.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Morse <james.morse@arm.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jeffle Xu <jefflexu@linux.alibaba.com> Cc: Johannes Thumshirn <jth@kernel.org> Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naohiro Aota <naohiro.aota@wdc.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sandeep Dhavale <dhavale@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Yue Hu <zbestahu@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:19 -07:00
Lorenzo Stoakes (Oracle)	e650bb30ca	mm: rename VMA flag helpers to be more readable Patch series "mm: vma flag tweaks". The ongoing work around introducing non-system word VMA flags has introduced a number of helper functions and macros to make life easier when working with these flags and to make conversions from the legacy use of VM_xxx flags more straightforward. This series improves these to reduce confusion as to what they do and to improve consistency and readability. Firstly the series renames vma_flags_test() to vma_flags_test_any() to make it abundantly clear that this function tests whether any of the flags are set (as opposed to vma_flags_test_all()). It then renames vma_desc_test_flags() to vma_desc_test_any() for the same reason. Note that we drop the 'flags' suffix here, as vma_desc_test_any_flags() would be cumbersome and 'test' implies a flag test. Similarly, we rename vma_test_all_flags() to vma_test_all() for consistency. Next, we have a couple of instances (erofs, zonefs) where we are now testing for vma_desc_test_any(desc, VMA_SHARED_BIT) && vma_desc_test_any(desc, VMA_MAYWRITE_BIT). This is silly, so this series introduces vma_desc_test_all() so these callers can instead invoke vma_desc_test_all(desc, VMA_SHARED_BIT, VMA_MAYWRITE_BIT). We then observe that quite a few instances of vma_flags_test_any() and vma_desc_test_any() are in fact only testing against a single flag. Using the _any() variant here is just confusing - 'any' of single item reads strangely and is liable to cause confusion. So in these instances the series reintroduces vma_flags_test() and vma_desc_test() as helpers which test against a single flag. The fact that vma_flags_t is a struct and that vma_flag_t utilises sparse to avoid confusion with vm_flags_t makes it impossible for a user to misuse these helpers without it getting flagged somewhere. The series also updates __mk_vma_flags() and functions invoked by it to explicitly mark them always inline to match expectation and to be consistent with other VMA flag helpers. It also renames vma_flag_set() to vma_flags_set_flag() (a function only used by __mk_vma_flags()) to be consistent with other VMA flag helpers. Finally it updates the VMA tests for each of these changes, and introduces explicit tests for vma_flags_test() and vma_desc_test() to assert that they behave as expected. This patch (of 6): On reflection, it's confusing to have vma_flags_test() and vma_desc_test_flags() test whether any comma-separated VMA flag bit is set, while also having vma_flags_test_all() and vma_test_all_flags() separately test whether all flags are set. Firstly, rename vma_flags_test() to vma_flags_test_any() to eliminate this confusion. Secondly, since the VMA descriptor flag functions are becoming rather cumbersome, prefer vma_desc_test() to vma_desc_test_flags(), and also rename vma_desc_test_flags() to vma_desc_test_any(). Finally, rename vma_test_all_flags() to vma_test_all() to keep the VMA-specific helper consistent with the VMA descriptor naming convention and to help avoid confusion vs. vma_flags_test_all(). While we're here, also update whitespace to be consistent in helper functions. Link: https://lkml.kernel.org/r/cover.1772704455.git.ljs@kernel.org Link: https://lkml.kernel.org/r/0f9cb3c511c478344fac0b3b3b0300bb95be95e9.1772704455.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Suggested-by: Pedro Falcato <pfalcato@suse.de> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Babu Moger <babu.moger@amd.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chao Yu <chao@kernel.org> Cc: Chatre, Reinette <reinette.chatre@intel.com> Cc: Chunhai Guo <guochunhai@vivo.com> Cc: Damien Le Maol <dlemoal@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Dave Martin <dave.martin@arm.com> Cc: Gao Xiang <xiang@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hongbo Li <lihongbo22@huawei.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Morse <james.morse@arm.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jeffle Xu <jefflexu@linux.alibaba.com> Cc: Johannes Thumshirn <jth@kernel.org> Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naohiro Aota <naohiro.aota@wdc.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sandeep Dhavale <dhavale@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Yue Hu <zbestahu@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:18 -07:00
Aaron Tomlin	d2bf45d067	fs/resctrl: Add "" shorthand to set io_alloc CBM for all domains Configuring the io_alloc_cbm interface requires an explicit domain ID for each cache domain. On systems with high core counts and numerous cache clusters, this requirement becomes cumbersome for automation and management tasks that aim to apply a uniform policy. Introduce a wildcard domain ID selector "" for the io_alloc_cbm interface. This enables users to set the same Capacity Bitmask (CBM) across all cache domains in a single operation. Signed-off-by: Aaron Tomlin <atomlin@atomlin.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Babu Moger <babu.moger@amd.com> Link: https://patch.msgid.link/20260325001159.447075-3-atomlin@atomlin.com	2026-04-02 00:22:59 +02:00
Aaron Tomlin	d06b8e7c97	fs/resctrl: Report invalid domain ID when parsing io_alloc_cbm The last_cmd_status file is intended to report details about the most recent resctrl filesystem operation, specifically to aid in diagnosing failures. However, when parsing io_alloc_cbm, if a user provides a domain ID that does not exist in the resource, the operation fails with -EINVAL without updating last_cmd_status. This results in inconsistent behaviour where the system call returns an error, but last_cmd_status misleadingly reports "ok", leaving the user unaware that the failure was caused by an invalid domain ID. Write an error message to last_cmd_status when the target domain ID cannot be found. Fixes: `28fa2cce7a` ("fs/resctrl: Introduce interface to modify io_alloc capacity bitmasks") Suggested-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Aaron Tomlin <atomlin@atomlin.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Babu Moger <babu.moger@amd.com> Link: https://patch.msgid.link/20260325001159.447075-2-atomlin@atomlin.com	2026-04-02 00:11:35 +02:00
Linus Torvalds	bf4afc53b7	Convert 'alloc_obj' family to use the new default GFP_KERNEL argument This was done entirely with mindless brute force, using git grep -l '\<k[vmz]alloc_objs(., GFP_KERNEL)' \| xargs sed -i 's/\(alloc_objs(.*\), GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-21 17:09:51 -08:00
Kees Cook	69050f8d6d	treewide: Replace kmalloc with kmalloc_obj for non-scalar types This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(PTR, FAM, COUNT, ...) (where TYPE may also be VAR) The resulting allocations no longer return "void ", instead returning "TYPE ". Signed-off-by: Kees Cook <kees@kernel.org>	2026-02-21 01:02:28 -08:00
Linus Torvalds	eeccf287a2	mm.git review status for linus..mm-stable Total patches: 36 Reviews/patch: 1.77 Reviewed rate: 83% - The 2 patch series "mm/vmscan: fix demotion targets checks in reclaim/demotion" from Bing Jiao fixes a couple of issues in the demotion code - pages were failed demotion and were finding themselves demoted into disallowed nodes. - The 11 patch series "Remove XA_ZERO from error recovery of dup_mmap()" from Liam Howlett fixes a rare mapledtree race and performs a number of cleanups. - The 13 patch series "mm: add bitmap VMA flag helpers and convert all mmap_prepare to use them" from Lorenzo Stoakes implements a lot of cleanups following on from the conversion of the VMA flags into a bitmap. - The 5 patch series "support batch checking of references and unmapping for large folios" from Baolin Wang implements batching to greatly improve the performance of reclaiming clean file-backed large folios. - The 3 patch series "selftests/mm: add memory failure selftests" from Miaohe Lin does as claimed. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaZaIEQAKCRDdBJ7gKXxA jj73AQCQDwLoipDiQRGyjB5BDYydymWuDoiB1tlDPHfYAP3b/QD/UQtVlOEXqwM3 naOKs3NQ1pwnfhDaQMirGw2eAnJ1SQY= =6Iif -----END PGP SIGNATURE----- Merge tag 'mm-stable-2026-02-18-19-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more MM updates from Andrew Morton: - "mm/vmscan: fix demotion targets checks in reclaim/demotion" fixes a couple of issues in the demotion code - pages were failed demotion and were finding themselves demoted into disallowed nodes (Bing Jiao) - "Remove XA_ZERO from error recovery of dup_mmap()" fixes a rare mapledtree race and performs a number of cleanups (Liam Howlett) - "mm: add bitmap VMA flag helpers and convert all mmap_prepare to use them" implements a lot of cleanups following on from the conversion of the VMA flags into a bitmap (Lorenzo Stoakes) - "support batch checking of references and unmapping for large folios" implements batching to greatly improve the performance of reclaiming clean file-backed large folios (Baolin Wang) - "selftests/mm: add memory failure selftests" does as claimed (Miaohe Lin) * tag 'mm-stable-2026-02-18-19-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (36 commits) mm/page_alloc: clear page->private in free_pages_prepare() selftests/mm: add memory failure dirty pagecache test selftests/mm: add memory failure clean pagecache test selftests/mm: add memory failure anonymous page test mm: rmap: support batched unmapping for file large folios arm64: mm: implement the architecture-specific clear_flush_young_ptes() arm64: mm: support batch clearing of the young flag for large folios arm64: mm: factor out the address and ptep alignment into a new helper mm: rmap: support batched checks of the references for large folios tools/testing/vma: add VMA userland tests for VMA flag functions tools/testing/vma: separate out vma_internal.h into logical headers tools/testing/vma: separate VMA userland tests into separate files mm: make vm_area_desc utilise vma_flags_t only mm: update all remaining mmap_prepare users to use vma_flags_t mm: update shmem_[kernel]_file_*() functions to use vma_flags_t mm: update secretmem to use VMA flags on mmap_prepare mm: update hugetlbfs to use VMA flags on mmap_prepare mm: add basic VMA flag operation helper functions tools: bitmap: add missing bitmap_[subset(), andnot()] mm: add mk_vma_flags() bitmap flag macro helper ...	2026-02-18 20:50:32 -08:00
Lorenzo Stoakes	5bd2c0650a	mm: update all remaining mmap_prepare users to use vma_flags_t We will be shortly removing the vm_flags_t field from vm_area_desc so we need to update all mmap_prepare users to only use the dessc->vma_flags field. This patch achieves that and makes all ancillary changes required to make this possible. This lays the groundwork for future work to eliminate the use of vm_flags_t in vm_area_desc altogether and more broadly throughout the kernel. While we're here, we take the opportunity to replace VM_REMAP_FLAGS with VMA_REMAP_FLAGS, the vma_flags_t equivalent. No functional changes intended. Link: https://lkml.kernel.org/r/fb1f55323799f09fe6a36865b31550c9ec67c225.1769097829.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Damien Le Moal <dlemoal@kernel.org> [zonefs] Acked-by: "Darrick J. Wong" <djwong@kernel.org> Acked-by: Pedro Falcato <pfalcato@suse.de> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Yury Norov <ynorov@nvidia.com> Cc: Chris Mason <clm@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-02-12 15:42:58 -08:00
Tony Luck	d0891647fb	fs/resctrl: Move RMID initialization to first mount L3 monitor features are enumerated during resctrl initialization and rmid_ptrs[] that tracks all RMIDs and depends on the number of supported RMIDs is allocated during this time. Telemetry monitor features are enumerated during first resctrl mount and may support a different number of RMIDs compared to L3 monitor features. Delay allocation and initialization of rmid_ptrs[] until first mount. Since the number of RMIDs cannot change on later mounts, keep the same set of rmid_ptrs[] until resctrl_exit(). This is required because the limbo handler keeps running after resctrl is unmounted and needs to access rmid_ptrs[] as it keeps tracking busy RMIDs after unmount. Rename routines to match what they now do: dom_data_init() -> setup_rmid_lru_list() dom_data_exit() -> free_rmid_lru_list() Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com	2026-01-10 11:46:48 +01:00
Tony Luck	0ecc988b02	x86,fs/resctrl: Compute number of RMIDs as minimum across resources resctrl assumes that only the L3 resource supports monitor events, so it simply takes the rdt_resource::num_rmid from RDT_RESOURCE_L3 as the system's number of RMIDs. The addition of telemetry events in a different resource breaks that assumption. Compute the number of available RMIDs as the minimum value across all mon_capable resources (analogous to how the number of CLOSIDs is computed across alloc_capable resources). Note that mount time enumeration of the telemetry resource means that this number can be reduced. If this happens, then some memory will be wasted as the allocations for rdt_l3_mon_domain::mbm_states[] and rdt_l3_mon_domain::rmid_busy_llc created during resctrl initialization will be larger than needed. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com	2026-01-10 11:43:58 +01:00
Tony Luck	ee7f6af79f	fs/resctrl: Move allocation/free of closid_num_dirty_rmid[] closid_num_dirty_rmid[] and rmid_ptrs[] are allocated together during resctrl initialization and freed together during resctrl exit. Telemetry events are enumerated on resctrl mount so only at resctrl mount will the number of RMID supported by all monitoring resources and needed as size for rmid_ptrs[] be known. Separate closid_num_dirty_rmid[] and rmid_ptrs[] allocation and free in preparation for rmid_ptrs[] to be allocated on resctrl mount. Keep the rdtgroup_mutex protection around the allocation and free of closid_num_dirty_rmid[] as ARM needs this to guarantee memory ordering. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com	2026-01-10 11:33:14 +01:00
Tony Luck	67640e333b	x86/resctrl: Handle number of RMIDs supported by RDT_RESOURCE_PERF_PKG There are now three meanings for "number of RMIDs": 1) The number for legacy features enumerated by CPUID leaf 0xF. This is the maximum number of distinct values that can be loaded into MSR_IA32_PQR_ASSOC. Note that systems with Sub-NUMA Cluster mode enabled will force scaling down the CPUID enumerated value by the number of SNC nodes per L3-cache. 2) The number of registers in MMIO space for each event. This is enumerated in the XML files and is the value initialized into event_group::num_rmid. 3) The number of "hardware counters" (this isn't a strictly accurate description of how things work, but serves as a useful analogy that does describe the limitations) feeding to those MMIO registers. This is enumerated in telemetry_region::num_rmids returned by intel_pmt_get_regions_by_feature(). Event groups with insufficient "hardware counters" to track all RMIDs are difficult for users to use, since the system may reassign "hardware counters" at any time. This means that users cannot reliably collect two consecutive event counts to compute the rate at which events are occurring. Disable such event groups by default. The user may override this with a command line "rdt=" option. In this case limit an under-resourced event group's number of possible monitor resource groups to the lowest number of "hardware counters". Scan all enabled event groups and assign the RDT_RESOURCE_PERF_PKG resource "num_rmid" value to the smallest of these values as this value will be used later to compare against the number of RMIDs supported by other resources to determine how many monitoring resource groups are supported. N.B. Change type of resctrl_mon::num_rmid to u32 to match its usage and the type of event_group::num_rmid so that min(r->num_rmid, e->num_rmid) won't complain about mixing signed and unsigned types. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com	2026-01-10 11:20:14 +01:00
Tony Luck	f4e0cd80d3	x86,fs/resctrl: Handle domain creation/deletion for RDT_RESOURCE_PERF_PKG The L3 resource has several requirements for domains. There are per-domain structures that hold the 64-bit values of counters, and elements to keep track of the overflow and limbo threads. None of these are needed for the PERF_PKG resource. The hardware counters are wide enough that they do not wrap around for decades. Define a new rdt_perf_pkg_mon_domain structure which just consists of the standard rdt_domain_hdr to keep track of domain id and CPU mask. Update resctrl_online_mon_domain() for RDT_RESOURCE_PERF_PKG. The only action needed for this resource is to create and populate domain directories if a domain is added while resctrl is mounted. Similarly resctrl_offline_mon_domain() only needs to remove domain directories. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com	2026-01-09 23:36:41 +01:00
Tony Luck	93d9fd8999	fs/resctrl: Refactor rmdir_mondata_subdir_allrdtgrp() Clearing a monitor group's mon_data directory is complicated because of the support for Sub-NUMA Cluster (SNC) mode. Refactor the SNC case into a helper function to make it easier to add support for a new telemetry resource. Suggested-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com	2026-01-09 23:02:58 +01:00
Tony Luck	0ec1db4cac	fs/resctrl: Refactor mkdir_mondata_subdir() Population of a monitor group's mon_data directory is unreasonably complicated because of the support for Sub-NUMA Cluster (SNC) mode. Split out the SNC code into a helper function to make it easier to add support for a new telemetry resource. Move all the duplicated code to make and set owner of domain directories into the mon_add_all_files() helper and rename to _mkdir_mondata_subdir(). Suggested-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com	2026-01-09 23:02:57 +01:00
Tony Luck	51541f6ca7	x86/resctrl: Read telemetry events Introduce intel_aet_read_event() to read telemetry events for resource RDT_RESOURCE_PERF_PKG. There may be multiple aggregators tracking each package, so scan all of them and add up all counters. Aggregators may return an invalid data indication if they have received no records for a given RMID. The user will see "Unavailable" if none of the aggregators on a package provide valid counts. Resctrl now uses readq() so depends on X86_64. Update Kconfig. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com	2026-01-09 23:02:57 +01:00
Tony Luck	7e6df96145	x86/resctrl: Find and enable usable telemetry events Every event group has a private copy of the data of all telemetry event aggregators (aka "telemetry regions") tracking its feature type. Included may be regions that have the same feature type but tracking different GUID from the event group's. Traverse the event group's telemetry region data and mark all regions that are not usable by the event group as unusable by clearing those regions' MMIO addresses. A region is considered unusable if: 1) GUID does not match the GUID of the event group. 2) Package ID is invalid. 3) The enumerated size of the MMIO region does not match the expected value from the XML description file. Hereafter any telemetry region with an MMIO address is considered valid for the event group it is associated with. Enable all the event group's events as long as there is at least one usable region from where data for its events can be read. Enabling of an event can fail if the same event has already been enabled as part of another event group. It should never happen that the same event is described by different GUID supported by the same system so just WARN (via resctrl_enable_mon_event()) and skip the event. Note that it is architecturally possible that some telemetry events are only supported by a subset of the packages in the system. It is not expected that systems will ever do this. If they do the user will see event files in resctrl that always return "Unavailable". Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com	2026-01-09 23:02:45 +01:00
Tony Luck	8ccb1f8fa6	x86,fs/resctrl: Add architectural event pointer The resctrl file system layer passes the domain, RMID, and event id to the architecture to fetch an event counter. Fetching a telemetry event counter requires additional information that is private to the architecture, for example, the offset into MMIO space from where the counter should be read. Add mon_evt::arch_priv that architecture can use for any private data related to the event. The resctrl filesystem initializes mon_evt::arch_priv when the architecture enables the event and passes it back to architecture when needing to fetch an event counter. Suggested-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com	2026-01-09 16:37:08 +01:00
Tony Luck	8f6b6ad69b	x86,fs/resctrl: Fill in details of events for performance and energy GUIDs The telemetry event aggregators of the Intel Clearwater Forest CPU support two RMID-based feature types: "energy" with GUID 0x26696143¹, and "perf" with GUID 0x26557651². The event counter offsets in an aggregator's MMIO space are arranged in groups for each RMID. E.g., the "energy" counters for GUID 0x26696143 are arranged like this: MMIO offset:0x0000 Counter for RMID 0 PMT_EVENT_ENERGY MMIO offset:0x0008 Counter for RMID 0 PMT_EVENT_ACTIVITY MMIO offset:0x0010 Counter for RMID 1 PMT_EVENT_ENERGY MMIO offset:0x0018 Counter for RMID 1 PMT_EVENT_ACTIVITY ... MMIO offset:0x23F0 Counter for RMID 575 PMT_EVENT_ENERGY MMIO offset:0x23F8 Counter for RMID 575 PMT_EVENT_ACTIVITY After all counters there are three status registers that provide indications of how many times an aggregator was unable to process event counts, the time stamp for the most recent loss of data, and the time stamp of the most recent successful update. MMIO offset:0x2400 AGG_DATA_LOSS_COUNT MMIO offset:0x2408 AGG_DATA_LOSS_TIMESTAMP MMIO offset:0x2410 LAST_UPDATE_TIMESTAMP Define event_group structures for both of these aggregator types and define the events tracked by the aggregators in the file system code. PMT_EVENT_ENERGY and PMT_EVENT_ACTIVITY are produced in fixed point format. File system code must output as floating point values. ¹https://github.com/intel/Intel-PMT/blob/main/xml/CWF/OOBMSM/RMID-ENERGY/cwf_aggregator.xml ²https://github.com/intel/Intel-PMT/blob/main/xml/CWF/OOBMSM/RMID-PERF/cwf_aggregator.xml [ bp: Massage commit message. ] Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com	2026-01-09 16:37:07 +01:00
Tony Luck	db64994d11	fs/resctrl: Emphasize that L3 monitoring resource is required for summing domains The feature to sum event data across multiple domains supports systems with Sub-NUMA Cluster (SNC) mode enabled. The top-level monitoring files in each "mon_L3_XX" directory provide the sum of data across all SNC nodes sharing an L3 cache instance while the "mon_sub_L3_YY" sub-directories provide the event data of the individual nodes. SNC is only associated with the L3 resource and domains and as a result the flow handling the sum of event data implicitly assumes it is working with the L3 resource and domains. Reading of telemetry events does not require to sum event data so this feature can remain dedicated to SNC and keep the implicit assumption of working with the L3 resource and domains. Add a WARN to where the implicit assumption of working with the L3 resource is made and add comments on how the structure controlling the event sum feature is used. Suggested-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com	2026-01-09 16:37:07 +01:00
Tony Luck	2e53ad6668	x86,fs/resctrl: Add and initialize a resource for package scope monitoring Add a new PERF_PKG resource and introduce package level scope for monitoring telemetry events so that CPU hotplug notifiers can build domains at the package granularity. Use the physical package ID available via topology_physical_package_id() to identify the monitoring domains with package level scope. This enables user space to use: /sys/devices/system/cpu/cpuX/topology/physical_package_id to identify the monitoring domain a CPU is associated with. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com	2026-01-09 16:37:07 +01:00
Tony Luck	39208e73a4	x86,fs/resctrl: Add an architectural hook called for first mount Enumeration of Intel telemetry events is an asynchronous process involving several mutually dependent drivers added as auxiliary devices during the device_initcall() phase of Linux boot. The process finishes after the probe functions of these drivers completes. But this happens after resctrl_arch_late_init() is executed. Tracing the enumeration process shows that it does complete a full seven seconds before the earliest possible mount of the resctrl file system (when included in /etc/fstab for automatic mount by systemd). Add a hook for use by telemetry event enumeration and initialization and run it once at the beginning of resctrl mount without any locks held. The architecture is responsible for any required locking. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20260105191711.GBaVwON5nZn-uO6Sqg@fat_crate.local	2026-01-09 16:36:34 +01:00
Tony Luck	e37c9a3dc9	x86,fs/resctrl: Support binary fixed point event counters resctrl assumes that all monitor events can be displayed as unsigned decimal integers. Hardware architecture counters may provide some telemetry events with greater precision where the event is not a simple count, but is a measurement of some sort (e.g. Joules for energy consumed). Add a new argument to resctrl_enable_mon_event() for architecture code to inform the file system that the value for a counter is a fixed-point value with a specific number of binary places. Only allow architecture to use floating point format on events that the file system has marked with mon_evt::is_floating_point which reflects the contract with user space on how the event values are displayed. Display fixed point values with values rounded to ceil(binary_bits * log10(2)) decimal places. Special case for zero binary bits to print "{value}.0". Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com	2026-01-05 16:10:41 +01:00
Tony Luck	ab0308aee3	x86,fs/resctrl: Handle events that can be read from any CPU resctrl assumes that monitor events can only be read from a CPU in the cpumask_t set of each domain. This is true for x86 events accessed with an MSR interface, but may not be true for other access methods such as MMIO. Introduce and use flag mon_evt::any_cpu, settable by architecture, that indicates there are no restrictions on which CPU can read that event. This flag is not supported by the L3 event reading that requires to be run on a CPU that belongs to the L3 domain of the event being read. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com	2026-01-05 15:38:07 +01:00
Tony Luck	dd110880e8	fs/resctrl: Make event details accessible to functions when reading events Reading monitoring event data from MMIO requires more context than the event id to be able to read the correct memory location. struct mon_evt is the appropriate place for this event specific context. Prepare for addition of extra fields to struct mon_evt by changing the calling conventions to pass a pointer to the mon_evt structure instead of just the event id. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com	2026-01-05 15:25:22 +01:00
Tony Luck	9c214d10c5	x86,fs/resctrl: Rename some L3 specific functions With the arrival of monitor events tied to new domains associated with a different resource it would be clearer if the L3 resource specific functions are more accurately named. Rename three groups of functions: Functions that allocate/free architecture per-RMID MBM state information: arch_domain_mbm_alloc() -> l3_mon_domain_mbm_alloc() mon_domain_free() -> l3_mon_domain_free() Functions that allocate/free filesystem per-RMID MBM state information: domain_setup_mon_state() -> domain_setup_l3_mon_state() domain_destroy_mon_state() -> domain_destroy_l3_mon_state() Initialization/exit: rdt_get_mon_l3_config() -> rdt_get_l3_mon_config() resctrl_mon_resource_init() -> resctrl_l3_mon_resource_init() resctrl_mon_resource_exit() -> resctrl_l3_mon_resource_exit() Ensure kernel-doc descriptions of these functions' return values are present and correctly formatted. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com	2026-01-05 11:21:55 +01:00
Tony Luck	4bc3ef46ff	x86,fs/resctrl: Rename struct rdt_mon_domain and rdt_hw_mon_domain The upcoming telemetry event monitoring is not tied to the L3 resource and will have a new domain structure. Rename the L3 resource specific domain data structures to include "l3_" in their names to avoid confusion between the different resource specific domain structures: rdt_mon_domain -> rdt_l3_mon_domain rdt_hw_mon_domain -> rdt_hw_l3_mon_domain No functional change. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com	2026-01-05 11:17:25 +01:00
Tony Luck	6b10cf7b6e	x86,fs/resctrl: Use struct rdt_domain_hdr when reading counters Convert the whole call sequence from mon_event_read() to resctrl_arch_rmid_read() to pass resource independent struct rdt_domain_hdr instead of an L3 specific domain structure to prepare for monitoring events in other resources. This additional layer of indirection obscures which aspects of event counting depend on a valid domain. Event initialization, support for assignable counters, and normal event counting implicitly depend on a valid domain while summing of domains does not. Split summing domains from the core event counting handling to make their respective dependencies obvious. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com	2026-01-05 11:08:58 +01:00
Tony Luck	ad5c2ff75e	fs/resctrl: Split L3 dependent parts out of __mon_event_count() Carve out the L3 resource specific event reading code into a separate helper to support reading event data from a new monitoring resource. Suggested-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com	2026-01-05 10:18:33 +01:00
Tony Luck	97fec06d35	x86,fs/resctrl: Refactor domain create/remove using struct rdt_domain_hdr Up until now, all monitoring events were associated with the L3 resource and it made sense to use the L3 specific "struct rdt_mon_domain *" argument to functions operating on domains. Telemetry events will be tied to a new resource with its instances represented by a new domain structure that, just like struct rdt_mon_domain, starts with the generic struct rdt_domain_hdr. Prepare to support domains belonging to different resources by changing the calling convention of functions operating on domains. Pass the generic header and use that to find the domain specific structure where needed. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com	2026-01-04 13:48:11 +01:00
Tony Luck	03eb578b37	x86,fs/resctrl: Improve domain type checking Every resctrl resource has a list of domain structures. struct rdt_ctrl_domain and struct rdt_mon_domain both begin with struct rdt_domain_hdr with rdt_domain_hdr::type used in validity checks before accessing the domain of a particular type. Add the resource id to struct rdt_domain_hdr in preparation for a new monitoring domain structure that will be associated with a new monitoring resource. Improve existing domain validity checks with a new helper domain_header_is_valid() that checks both domain type and resource id. domain_header_is_valid() should be used before every call to container_of() that accesses a domain structure. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com	2026-01-04 07:30:10 +01:00
Linus Torvalds	7203ca412f	Significant patch series in this merge are as follows: - The 10 patch series "__vmalloc()/kvmalloc() and no-block support" from Uladzislau Rezki reworks the vmalloc() code to support non-blocking allocations (GFP_ATOIC, GFP_NOWAIT). - The 2 patch series "ksm: fix exec/fork inheritance" from xu xin fixes a rare case where the KSM MMF_VM_MERGE_ANY prctl state is not inherited across fork/exec. - The 4 patch series "mm/zswap: misc cleanup of code and documentations" from SeongJae Park does some light maintenance work on the zswap code. - The 5 patch series "mm/page_owner: add debugfs files 'show_handles' and 'show_stacks_handles'" from Mauricio Faria de Oliveira enhances the /sys/kernel/debug/page_owner debug feature. It adds unique identifiers to differentiate the various stack traces so that userspace monitoring tools can better match stack traces over time. - The 2 patch series "mm/page_alloc: pcp->batch cleanups" from Joshua Hahn makes some minor alterations to the page allocator's per-cpu-pages feature. - The 2 patch series "Improve UFFDIO_MOVE scalability by removing anon_vma lock" from Lokesh Gidra addresses a scalability issue in userfaultfd's UFFDIO_MOVE operation. - The 2 patch series "kasan: cleanups for kasan_enabled() checks" from Sabyrzhan Tasbolatov performs some cleanup in the KASAN code. - The 2 patch series "drivers/base/node: fold node register and unregister functions" from Donet Tom cleans up the NUMA node handling code a little. - The 4 patch series "mm: some optimizations for prot numa" from Kefeng Wang provides some cleanups and small optimizations to the NUMA allocation hinting code. - The 5 patch series "mm/page_alloc: Batch callers of free_pcppages_bulk" from Joshua Hahn addresses long lock hold times at boot on large machines. These were causing (harmless) softlockup warnings. - The 2 patch series "optimize the logic for handling dirty file folios during reclaim" from Baolin Wang removes some now-unnecessary work from page reclaim. - The 10 patch series "mm/damon: allow DAMOS auto-tuned for per-memcg per-node memory usage" from SeongJae Park enhances the DAMOS auto-tuning feature. - The 2 patch series "mm/damon: fixes for address alignment issues in DAMON_LRU_SORT and DAMON_RECLAIM" from Quanmin Yan fixes DAMON_LRU_SORT and DAMON_RECLAIM with certain userspace configuration. - The 15 patch series "expand mmap_prepare functionality, port more users" from Lorenzo Stoakes enhances the new(ish) file_operations.mmap_prepare() method and ports additional callsites from the old ->mmap() over to ->mmap_prepare(). - The 8 patch series "Fix stale IOTLB entries for kernel address space" from Lu Baolu fixes a bug (and possible security issue on non-x86) in the IOMMU code. In some situations the IOMMU could be left hanging onto a stale kernel pagetable entry. - The 4 patch series "mm/huge_memory: cleanup __split_unmapped_folio()" from Wei Yang cleans up and optimizes the folio splitting code. - The 5 patch series "mm, swap: misc cleanup and bugfix" from Kairui Song implements some cleanups and a minor fix in the swap discard code. - The 8 patch series "mm/damon: misc documentation fixups" from SeongJae Park does as advertised. - The 9 patch series "mm/damon: support pin-point targets removal" from SeongJae Park permits userspace to remove a specific monitoring target in the middle of the current targets list. - The 2 patch series "mm: MISC follow-up patches for linux/pgalloc.h" from Harry Yoo implements a couple of cleanups related to mm header file inclusion. - The 2 patch series "mm/swapfile.c: select swap devices of default priority round robin" from Baoquan He improves the selection of swap devices for NUMA machines. - The 3 patch series "mm: Convert memory block states (MEM_) macros to enums" from Israel Batista changes the memory block labels from macros to enums so they will appear in kernel debug info. - The 3 patch series "ksm: perform a range-walk to jump over holes in break_ksm" from Pedro Demarchi Gomes addresses an inefficiency when KSM unmerges an address range. - The 22 patch series "mm/damon/tests: fix memory bugs in kunit tests" from SeongJae Park fixes leaks and unhandled malloc() failures in DAMON userspace unit tests. - The 2 patch series "some cleanups for pageout()" from Baolin Wang cleans up a couple of minor things in the page scanner's writeback-for-eviction code. - The 2 patch series "mm/hugetlb: refactor sysfs/sysctl interfaces" from Hui Zhu moves hugetlb's sysfs/sysctl handling code into a new file. - The 9 patch series "introduce VM_MAYBE_GUARD and make it sticky" from Lorenzo Stoakes makes the VMA guard regions available in /proc/pid/smaps and improves the mergeability of guarded VMAs. - The 2 patch series "mm: perform guard region install/remove under VMA lock" from Lorenzo Stoakes reduces mmap lock contention for callers performing VMA guard region operations. - The 2 patch series "vma_start_write_killable" from Matthew Wilcox starts work in permitting applications to be killed when they are waiting on a read_lock on the VMA lock. - The 11 patch series "mm/damon/tests: add more tests for online parameters commit" from SeongJae Park adds additional userspace testing of DAMON's "commit" feature. - The 9 patch series "mm/damon: misc cleanups" from SeongJae Park does that. - The 2 patch series "make VM_SOFTDIRTY a sticky VMA flag" from Lorenzo Stoakes addresses the possible loss of a VMA's VM_SOFTDIRTY flag when that VMA is merged with another. - The 16 patch series "mm: support device-private THP" from Balbir Singh introduces support for Transparent Huge Page (THP) migration in zone device-private memory. - The 3 patch series "Optimize folio split in memory failure" from Zi Yan optimizes folio split operations in the memory failure code. - The 2 patch series "mm/huge_memory: Define split_type and consolidate split support checks" from Wei Yang provides some more cleanups in the folio splitting code. - The 16 patch series "mm: remove is_swap_[pte, pmd]() + non-swap entries, introduce leaf entries" from Lorenzo Stoakes cleans up our handling of pagetable leaf entries by introducing the concept of 'software leaf entries', of type softleaf_t. - The 4 patch series "reparent the THP split queue" from Muchun Song reparents the THP split queue to its parent memcg. This is in preparation for addressing the long-standing "dying memcg" problem, wherein dead memcg's linger for too long, consuming memory resources. - The 3 patch series "unify PMD scan results and remove redundant cleanup" from Wei Yang does a little cleanup in the hugepage collapse code. - The 6 patch series "zram: introduce writeback bio batching" from Sergey Senozhatsky improves zram writeback efficiency by introducing batched bio writeback support. - The 4 patch series "memcg: cleanup the memcg stats interfaces" from Shakeel Butt cleans up our handling of the interrupt safety of some memcg stats. - The 4 patch series "make vmalloc gfp flags usage more apparent" from Vishal Moola cleans up vmalloc's handling of incoming GFP flags. - The 6 patch series "mm: Add soft-dirty and uffd-wp support for RISC-V" from Chunyan Zhang teches soft dirty and userfaultfd write protect tracking to use RISC-V's Svrsw60t59b extension. - The 5 patch series "mm: swap: small fixes and comment cleanups" from Youngjun Park fixes a small bug and cleans up some of the swap code. - The 4 patch series "initial work on making VMA flags a bitmap" from Lorenzo Stoakes starts work on converting the vma struct's flags to a bitmap, so we stop running out of them, especially on 32-bit. - The 2 patch series "mm/swapfile: fix and cleanup swap list iterations" from Youngjun Park addresses a possible bug in the swap discard code and cleans things up a little. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaTEb0wAKCRDdBJ7gKXxA jjfIAP94W4EkCCwNOupnChoG+YWw/JW21anXt5NN+i5svn1yugEAwzvv6A+cAFng o+ug/fyrfPZG7PLp2R8WFyGIP0YoBA4= =IUzS -----END PGP SIGNATURE----- Merge tag 'mm-stable-2025-12-03-21-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "__vmalloc()/kvmalloc() and no-block support" (Uladzislau Rezki) Rework the vmalloc() code to support non-blocking allocations (GFP_ATOIC, GFP_NOWAIT) "ksm: fix exec/fork inheritance" (xu xin) Fix a rare case where the KSM MMF_VM_MERGE_ANY prctl state is not inherited across fork/exec "mm/zswap: misc cleanup of code and documentations" (SeongJae Park) Some light maintenance work on the zswap code "mm/page_owner: add debugfs files 'show_handles' and 'show_stacks_handles'" (Mauricio Faria de Oliveira) Enhance the /sys/kernel/debug/page_owner debug feature by adding unique identifiers to differentiate the various stack traces so that userspace monitoring tools can better match stack traces over time "mm/page_alloc: pcp->batch cleanups" (Joshua Hahn) Minor alterations to the page allocator's per-cpu-pages feature "Improve UFFDIO_MOVE scalability by removing anon_vma lock" (Lokesh Gidra) Address a scalability issue in userfaultfd's UFFDIO_MOVE operation "kasan: cleanups for kasan_enabled() checks" (Sabyrzhan Tasbolatov) "drivers/base/node: fold node register and unregister functions" (Donet Tom) Clean up the NUMA node handling code a little "mm: some optimizations for prot numa" (Kefeng Wang) Cleanups and small optimizations to the NUMA allocation hinting code "mm/page_alloc: Batch callers of free_pcppages_bulk" (Joshua Hahn) Address long lock hold times at boot on large machines. These were causing (harmless) softlockup warnings "optimize the logic for handling dirty file folios during reclaim" (Baolin Wang) Remove some now-unnecessary work from page reclaim "mm/damon: allow DAMOS auto-tuned for per-memcg per-node memory usage" (SeongJae Park) Enhance the DAMOS auto-tuning feature "mm/damon: fixes for address alignment issues in DAMON_LRU_SORT and DAMON_RECLAIM" (Quanmin Yan) Fix DAMON_LRU_SORT and DAMON_RECLAIM with certain userspace configuration "expand mmap_prepare functionality, port more users" (Lorenzo Stoakes) Enhance the new(ish) file_operations.mmap_prepare() method and port additional callsites from the old ->mmap() over to ->mmap_prepare() "Fix stale IOTLB entries for kernel address space" (Lu Baolu) Fix a bug (and possible security issue on non-x86) in the IOMMU code. In some situations the IOMMU could be left hanging onto a stale kernel pagetable entry "mm/huge_memory: cleanup __split_unmapped_folio()" (Wei Yang) Clean up and optimize the folio splitting code "mm, swap: misc cleanup and bugfix" (Kairui Song) Some cleanups and a minor fix in the swap discard code "mm/damon: misc documentation fixups" (SeongJae Park) "mm/damon: support pin-point targets removal" (SeongJae Park) Permit userspace to remove a specific monitoring target in the middle of the current targets list "mm: MISC follow-up patches for linux/pgalloc.h" (Harry Yoo) A couple of cleanups related to mm header file inclusion "mm/swapfile.c: select swap devices of default priority round robin" (Baoquan He) improve the selection of swap devices for NUMA machines "mm: Convert memory block states (MEM_) macros to enums" (Israel Batista) Change the memory block labels from macros to enums so they will appear in kernel debug info "ksm: perform a range-walk to jump over holes in break_ksm" (Pedro Demarchi Gomes) Address an inefficiency when KSM unmerges an address range "mm/damon/tests: fix memory bugs in kunit tests" (SeongJae Park) Fix leaks and unhandled malloc() failures in DAMON userspace unit tests "some cleanups for pageout()" (Baolin Wang) Clean up a couple of minor things in the page scanner's writeback-for-eviction code "mm/hugetlb: refactor sysfs/sysctl interfaces" (Hui Zhu) Move hugetlb's sysfs/sysctl handling code into a new file "introduce VM_MAYBE_GUARD and make it sticky" (Lorenzo Stoakes) Make the VMA guard regions available in /proc/pid/smaps and improves the mergeability of guarded VMAs "mm: perform guard region install/remove under VMA lock" (Lorenzo Stoakes) Reduce mmap lock contention for callers performing VMA guard region operations "vma_start_write_killable" (Matthew Wilcox) Start work on permitting applications to be killed when they are waiting on a read_lock on the VMA lock "mm/damon/tests: add more tests for online parameters commit" (SeongJae Park) Add additional userspace testing of DAMON's "commit" feature "mm/damon: misc cleanups" (SeongJae Park) "make VM_SOFTDIRTY a sticky VMA flag" (Lorenzo Stoakes) Address the possible loss of a VMA's VM_SOFTDIRTY flag when that VMA is merged with another "mm: support device-private THP" (Balbir Singh) Introduce support for Transparent Huge Page (THP) migration in zone device-private memory "Optimize folio split in memory failure" (Zi Yan) "mm/huge_memory: Define split_type and consolidate split support checks" (Wei Yang) Some more cleanups in the folio splitting code "mm: remove is_swap_[pte, pmd]() + non-swap entries, introduce leaf entries" (Lorenzo Stoakes) Clean up our handling of pagetable leaf entries by introducing the concept of 'software leaf entries', of type softleaf_t "reparent the THP split queue" (Muchun Song) Reparent the THP split queue to its parent memcg. This is in preparation for addressing the long-standing "dying memcg" problem, wherein dead memcg's linger for too long, consuming memory resources "unify PMD scan results and remove redundant cleanup" (Wei Yang) A little cleanup in the hugepage collapse code "zram: introduce writeback bio batching" (Sergey Senozhatsky) Improve zram writeback efficiency by introducing batched bio writeback support "memcg: cleanup the memcg stats interfaces" (Shakeel Butt) Clean up our handling of the interrupt safety of some memcg stats "make vmalloc gfp flags usage more apparent" (Vishal Moola) Clean up vmalloc's handling of incoming GFP flags "mm: Add soft-dirty and uffd-wp support for RISC-V" (Chunyan Zhang) Teach soft dirty and userfaultfd write protect tracking to use RISC-V's Svrsw60t59b extension "mm: swap: small fixes and comment cleanups" (Youngjun Park) Fix a small bug and clean up some of the swap code "initial work on making VMA flags a bitmap" (Lorenzo Stoakes) Start work on converting the vma struct's flags to a bitmap, so we stop running out of them, especially on 32-bit "mm/swapfile: fix and cleanup swap list iterations" (Youngjun Park) Address a possible bug in the swap discard code and clean things up a little [ This merge also reverts commit `ebb9aeb980` ("vfio/nvgrace-gpu: register device memory for poison handling") because it looks broken to me, I've asked for clarification - Linus ] * tag 'mm-stable-2025-12-03-21-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits) mm: fix vma_start_write_killable() signal handling mm/swapfile: use plist_for_each_entry in __folio_throttle_swaprate mm/swapfile: fix list iteration when next node is removed during discard fs/proc/task_mmu.c: fix make_uffd_wp_huge_pte() huge pte handling mm/kfence: add reboot notifier to disable KFENCE on shutdown memcg: remove inc/dec_lruvec_kmem_state helpers selftests/mm/uffd: initialize char variable to Null mm: fix DEBUG_RODATA_TEST indentation in Kconfig mm: introduce VMA flags bitmap type tools/testing/vma: eliminate dependency on vma->__vm_flags mm: simplify and rename mm flags function for clarity mm: declare VMA flags by bit zram: fix a spelling mistake mm/page_alloc: optimize lowmem_reserve max lookup using its semantic monotonicity mm/vmscan: skip increasing kswapd_failures when reclaim was boosted pagemap: update BUDDY flag documentation mm: swap: remove scan_swap_map_slots() references from comments mm: swap: change swap_alloc_slow() to void mm, swap: remove redundant comment for read_swap_cache_async mm, swap: use SWP_SOLIDSTATE to determine if swap is rotational ...	2025-12-05 13:52:43 -08:00
Linus Torvalds	2ae20d6510	- Add support for AMD's Smart Data Cache Injection feature which allows for direct insertion of data from I/O devices into the L3 cache, thus bypassing DRAM and saving its bandwidth; the resctrl side of the feature allows the size of the L3 used for data injection to be controlled - Add Intel Clearwater Forest to the list of CPUs which support Sub-NUMA clustering - Other fixes and cleanups -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmktpFQACgkQEsHwGGHe VUop4g/9GTb/5rcFMQzeGlG3USnJOqJ+SmiAalA9lm1c933en9tqUgL/K0C0xC6h yraB3ICuob1YayiZkBwKIOQiei9gmfhH/CGf5vLcZMM+D6fqvlk1D+C40SuFoDFV DOH3H2nYoJ3vbZRtRZsD3bv/djST/OVk28g7eY8OwpZIwN5VSFULJwjK1ePPy+nL l65s/yrgLY0oLDBCGxtJ9gVxjCBqAoqfbbwVbcJm5hXv+2sYk8BH6de/CU+0v/vo K6Qu4GbmWqDKYH9thjC4ZC/DPXjtoCxGkg/l1Af5T1PiZF0ZtgEZI6i9JTR33jYJ 7j6BpkCwPzY07MKj/Ub1RemlMfY4XMN/qssEfFmnwG+aMBtbojNAjdb00Pu9Ffn+ TKFKiZ6WBTcYhqPQsFVruwHh8wDbJp2/x/yBfjD4qovo1HuyCln4iGDmoFcU2wTD UlOXW89bxOT56A3FL77ElnOg9nRltvdKduOluGtkpSkmBbzmDfoXrhG2z9zuuAui FB6GT2c5MRVXEC4BY30xwQBG5MArVRMyz9uYDyXf9+KHhWVdmq9K0ZAkIaUmPCvy BvBXpRhfxm/dKJPhtSuUPhh5A+a87gqoiu1McaFoVGyjVJIJ5gflge8+/mLj1lQz kG56SnLOzdtcwKcmQ5ncv5EkrTBD1Ph12u1kcd+4IZwkpgGZteE= =o7Dg -----END PGP SIGNATURE----- Merge tag 'x86_cache_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 resource control updates from Borislav Petkov: - Add support for AMD's Smart Data Cache Injection feature which allows for direct insertion of data from I/O devices into the L3 cache, thus bypassing DRAM and saving its bandwidth; the resctrl side of the feature allows the size of the L3 used for data injection to be controlled - Add Intel Clearwater Forest to the list of CPUs which support Sub-NUMA clustering - Other fixes and cleanups * tag 'x86_cache_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: fs/resctrl: Update bit_usage to reflect io_alloc fs/resctrl: Introduce interface to modify io_alloc capacity bitmasks fs/resctrl: Modify struct rdt_parse_data to pass mode and CLOSID fs/resctrl: Introduce interface to display io_alloc CBMs fs/resctrl: Add user interface to enable/disable io_alloc feature fs/resctrl: Introduce interface to display "io_alloc" support x86,fs/resctrl: Implement "io_alloc" enable/disable handlers x86,fs/resctrl: Detect io_alloc feature x86/resctrl: Add SDCIAE feature in the command line options x86/cpufeatures: Add support for L3 Smart Data Cache Injection Allocation Enforcement fs/resctrl: Consider sparse masks when initializing new group's allocation x86/resctrl: Support Sub-NUMA Cluster (SNC) mode on Clearwater Forest	2025-12-02 11:55:58 -08:00
Babu Moger	ac7de456a3	fs/resctrl: Update bit_usage to reflect io_alloc The "shareable_bits" and "bit_usage" resctrl files associated with cache resources give insight into how instances of a cache is used. Update the annotated capacity bitmasks displayed by "bit_usage" to include the cache portions allocated for I/O via the "io_alloc" feature. "shareable_bits" is a global bitmask of shareable cache with I/O and can thus not present the per-domain I/O allocations possible with the "io_alloc" feature. Revise the "shareable_bits" documentation to direct users to "bit_usage" for accurate cache usage information. Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://patch.msgid.link/e02a0d424129fd7f3e45822a559b1c614ae4652a.1762995456.git.babu.moger@amd.com	2025-11-22 14:30:34 +01:00
Babu Moger	28fa2cce7a	fs/resctrl: Introduce interface to modify io_alloc capacity bitmasks The io_alloc feature in resctrl enables system software to configure the portion of the cache allocated for I/O traffic. When supported, the io_alloc_cbm file in resctrl provides access to capacity bitmasks (CBMs) allocated for I/O devices. Enable users to modify io_alloc CBMs by writing to the io_alloc_cbm resctrl file when the io_alloc feature is enabled. Mirror the CBMs between CDP_CODE and CDP_DATA when CDP is enabled to present consistent I/O allocation information to user space. Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://patch.msgid.link/67609641b03ccfba18a8ee0bf9dbd1f3dcbecda3.1762995456.git.babu.moger@amd.com	2025-11-22 14:28:31 +01:00
Babu Moger	af1242eeca	fs/resctrl: Modify struct rdt_parse_data to pass mode and CLOSID parse_cbm() requires resource group mode and CLOSID to validate the capacity bitmask (CBM). It is passed via struct rdtgroup in struct rdt_parse_data. The io_alloc feature also uses CBMs to indicate which portions of cache are allocated for I/O traffic. The CBMs are provided by user space and need to be validated the same as CBMs provided for general (CPU) cache allocation. parse_cbm() cannot be used as-is since io_alloc does not have rdtgroup context. Pass the resource group mode and CLOSID directly to parse_cbm() via struct rdt_parse_data, instead of through the rdtgroup struct, to facilitate calling parse_cbm() to verify the CBM of the io_alloc feature. Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://patch.msgid.link/f8ec6ab5cf594d906a3fe75f56793d5fbd63f38f.1762995456.git.babu.moger@amd.com	2025-11-22 13:10:12 +01:00
Babu Moger	77b6623262	fs/resctrl: Introduce interface to display io_alloc CBMs Introduce the "io_alloc_cbm" resctrl file to display the capacity bitmasks (CBMs) that represent the portions of each cache instance allocated for I/O traffic on a cache resource that supports the "io_alloc" feature. io_alloc_cbm resides in the info directory of a cache resource, for example, /sys/fs/resctrl/info/L3/. Since the resource name is part of the path, it is not necessary to display the resource name as done in the schemata file. When CDP is enabled, io_alloc routes traffic using the highest CLOSID associated with the CDP_CODE resource and that CLOSID becomes unusable for the CDP_DATA resource. The highest CLOSID of CDP_CODE and CDP_DATA resources will be kept in sync to ensure consistent user interface. In preparation for this, access the CBMs for I/O traffic through highest CLOSID of either CDP_CODE or CDP_DATA resource. Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://patch.msgid.link/55a3ff66a70e7ce8239f022e62b334e9d64af604.1762995456.git.babu.moger@amd.com	2025-11-22 11:37:21 +01:00
Babu Moger	9445c7059c	fs/resctrl: Add user interface to enable/disable io_alloc feature AMD's SDCIAE forces all SDCI lines to be placed into the L3 cache portions identified by the highest-supported L3_MASK_n register, where n is the maximum supported CLOSID. To support this, when io_alloc resctrl feature is enabled, reserve the highest CLOSID exclusively for I/O allocation traffic making it no longer available for general CPU cache allocation. Introduce user interface to enable/disable io_alloc feature and encourage users to enable io_alloc only when running workloads that can benefit from this functionality. On enable, initialize the io_alloc CLOSID with all usable CBMs across all the domains. Since CLOSIDs are managed by resctrl fs, it is least invasive to make "io_alloc is supported by maximum supported CLOSID" part of the initial resctrl fs support for io_alloc. Take care to minimally (only in error messages) expose this use of CLOSID for io_alloc to user space so that this is not required from other architectures that may support io_alloc differently in the future. When resctrl is mounted with "-o cdp" to enable code/data prioritization, there are two L3 resources that can support I/O allocation: L3CODE and L3DATA. From resctrl fs perspective the two resources share a CLOSID and the architecture's available CLOSID are halved to support this. The architecture's underlying CLOSID used by SDCIAE when CDP is enabled is the CLOSID associated with the CDP_CODE resource, but from resctrl's perspective there is only one CLOSID for both CDP_CODE and CDP_DATA. CDP_DATA is thus not usable for general (CPU) cache allocation nor I/O allocation. Keep the CDP_CODE and CDP_DATA I/O alloc status in sync to avoid any confusion to user space. That is, enabling io_alloc on CDP_CODE does so on CDP_DATA and vice-versa, and keep the I/O allocation CBMs of CDP_CODE and CDP_DATA in sync. Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://patch.msgid.link/c7d3037795e653e22b02d8fc73ca80d9b075031c.1762995456.git.babu.moger@amd.com	2025-11-21 23:01:54 +01:00
Babu Moger	48068e5650	fs/resctrl: Introduce interface to display "io_alloc" support Introduce the "io_alloc" resctrl file to the "info" area of a cache resource, for example /sys/fs/resctrl/info/L3/io_alloc. "io_alloc" indicates support for the "io_alloc" feature that allows direct insertion of data from I/O devices into the cache. Restrict exposing support for "io_alloc" to the L3 resource that is the only resource where this feature can be backed by AMD's L3 Smart Data Cache Injection Allocation Enforcement (SDCIAE). With that, the "io_alloc" file is only visible to user space if the L3 resource supports "io_alloc". Doing so makes the file visible for all cache resources though, for example also L2 cache (if it supports cache allocation). As a consequence, add capability for file to report expected "enabled" and "disabled", as well as "not supported". Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://patch.msgid.link/e8b116a8f424128b227734bb1d433c14af478d90.1762995456.git.babu.moger@amd.com	2025-11-21 22:49:42 +01:00

1 2

83 Commits