Commit Graph

384 Commits

Author SHA1 Message Date
Linus Torvalds
40735a683b mm.git review status for linus..mm-stable
Everything:
 
 Total patches:       121
 Reviews/patch:       2.11
 Reviewed rate:       90%
 
 Excluding DAMON:
 
 Total patches:       113
 Reviews/patch:       2.25
 Reviewed rate:       96%
 
 - The 33 patch series "Eliminate Dying Memory Cgroup" from Qi Zheng and
   Muchun Song addresses the longstanding "dying memcg problem".  A
   situation wherein a no-longer-used memory control group will hang around
   for an extended period pointlessly consuming memory.  The [0/N]
   changelog has a good overview of this work.
 
 - The 3 patch series "fix unexpected type conversions and potential
   overflows" from Qi Zheng fixes a couple of potential 32-bit/64-bit
   issues which were identified during review of the "Eliminate Dying
   Memory Cgroup" series.
 
 - The 6 patch series "kho: history: track previous kernel version and
   kexec boot count" from Breno Leitao uses Kexec Handover (KHO) to pass
   the previous kernel's version string and the number of kexec reboots
   since the last cold boot to the next kernel, and prints it at boot time.
 
 - The 4 patch series "liveupdate: prevent double preservation" from
   Pasha Tatashin teaches LUO to avoid managing the same file across
   different active sessions.
 
 - The 10 patch series "liveupdate: Fix module unloading and unregister
   API" from Pasha Tatashin addresses an issue with how LUO handles module
   reference counting and unregistration during module unloading.
 
 - The 2 patch series "zswap pool per-CPU acomp_ctx simplifications" from
   Kanchana Sridhar simplifies and cleans up the zswap crypto compression
   handling and improves the lifecycle management of zswap pool's per-CPU
   acomp_ctx resources.
 
 - The 2 patch series "mm/damon/core: fix damon_call()/damos_walk() vs
   kdmond exit race" from SeongJae Park addresses unlikely but possible
   leaks and deadlocks in damon_call() and damon_walk().
 
 - The 2 patch series "mm/damon/core: validate damos_quota_goal->nid"
   from SeongJae Park fixes a couple of root-only wild pointer
   dereferences.
 
 - The 2 patch series "Docs/admin-guide/mm/damon: warn commit_inputs vs
   other params race" from SeongJae Park updates the DAMON documentation to
   warn operators about potential races which can occur if the
   commit_inputs parameter is altered at the wrong time.
 
 - The 3 patch series "Minor hmm_test fixes and cleanups" from Alistair
   Popple implements two bugfixes a cleanup for the HMM kernel selftests.
 
 - The 6 patch series "Modify memfd_luo code" from Chenghao Duan provides
   cleanups, simplifications and speedups in the memfd_lou code.
 
 - The 4 patch series "mm, kvm: allow uffd support in guest_memfd" from
   Mike Rapoport enables support for userfaultfd in guest_memfd.
 
 - The 6 patch series "selftests/mm: skip several tests when thp is not
   available" from Chunyu Hu fixes several issues in the selftests code
   which were causing breakage when the tests were run on CONFIG_THP=n
   kernels.
 
 - The 2 patch series "mm/mprotect: micro-optimization work" from Pedro
   Falcato implements a couple of nice speedups for mprotect().
 
 - The 3 patch series "MAINTAINERS: update KHO and LIVE UPDATE entries"
   from Pratyush Yadav reflects upcoming changes in the maintenance of KHO,
   LUO, memfd_luo, kexec, crash, kdump and probably other kexec-based
   things - they are being moved out of mm.git and into a new git tree.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaeNL/wAKCRDdBJ7gKXxA
 jt7EAQCEEQvYYTjld+8HJKsCbavY4pEfci7z4SBiQyIPjRracQD/ZfjXnzL7ucc1
 b6q6G4TcslvIDBgzVkk9G2BVn2oCoAg=
 =3ozv
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2026-04-18-02-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull more MM updates from Andrew Morton:

 - "Eliminate Dying Memory Cgroup" (Qi Zheng and Muchun Song)

   Address the longstanding "dying memcg problem". A situation wherein a
   no-longer-used memory control group will hang around for an extended
   period pointlessly consuming memory

 - "fix unexpected type conversions and potential overflows" (Qi Zheng)

   Fix a couple of potential 32-bit/64-bit issues which were identified
   during review of the "Eliminate Dying Memory Cgroup" series

 - "kho: history: track previous kernel version and kexec boot count"
   (Breno Leitao)

   Use Kexec Handover (KHO) to pass the previous kernel's version string
   and the number of kexec reboots since the last cold boot to the next
   kernel, and print it at boot time

 - "liveupdate: prevent double preservation" (Pasha Tatashin)

   Teach LUO to avoid managing the same file across different active
   sessions

 - "liveupdate: Fix module unloading and unregister API" (Pasha
   Tatashin)

   Address an issue with how LUO handles module reference counting and
   unregistration during module unloading

 - "zswap pool per-CPU acomp_ctx simplifications" (Kanchana Sridhar)

   Simplify and clean up the zswap crypto compression handling and
   improve the lifecycle management of zswap pool's per-CPU acomp_ctx
   resources

 - "mm/damon/core: fix damon_call()/damos_walk() vs kdmond exit race"
   (SeongJae Park)

   Address unlikely but possible leaks and deadlocks in damon_call() and
   damon_walk()

 - "mm/damon/core: validate damos_quota_goal->nid" (SeongJae Park)

   Fix a couple of root-only wild pointer dereferences

 - "Docs/admin-guide/mm/damon: warn commit_inputs vs other params race"
   (SeongJae Park)

   Update the DAMON documentation to warn operators about potential
   races which can occur if the commit_inputs parameter is altered at
   the wrong time

 - "Minor hmm_test fixes and cleanups" (Alistair Popple)

   Bugfixes and a cleanup for the HMM kernel selftests

 - "Modify memfd_luo code" (Chenghao Duan)

   Cleanups, simplifications and speedups to the memfd_lou code

 - "mm, kvm: allow uffd support in guest_memfd" (Mike Rapoport)

   Support for userfaultfd in guest_memfd

 - "selftests/mm: skip several tests when thp is not available" (Chunyu
   Hu)

   Fix several issues in the selftests code which were causing breakage
   when the tests were run on CONFIG_THP=n kernels

 - "mm/mprotect: micro-optimization work" (Pedro Falcato)

   A couple of nice speedups for mprotect()

 - "MAINTAINERS: update KHO and LIVE UPDATE entries" (Pratyush Yadav)

   Document upcoming changes in the maintenance of KHO, LUO, memfd_luo,
   kexec, crash, kdump and probably other kexec-based things - they are
   being moved out of mm.git and into a new git tree

* tag 'mm-stable-2026-04-18-02-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (121 commits)
  MAINTAINERS: add page cache reviewer
  mm/vmscan: avoid false-positive -Wuninitialized warning
  MAINTAINERS: update Dave's kdump reviewer email address
  MAINTAINERS: drop include/linux/liveupdate from LIVE UPDATE
  MAINTAINERS: drop include/linux/kho/abi/ from KHO
  MAINTAINERS: update KHO and LIVE UPDATE maintainers
  MAINTAINERS: update kexec/kdump maintainers entries
  mm/migrate_device: remove dead migration entry check in migrate_vma_collect_huge_pmd()
  selftests: mm: skip charge_reserved_hugetlb without killall
  userfaultfd: allow registration of ranges below mmap_min_addr
  mm/vmstat: fix vmstat_shepherd double-scheduling vmstat_update
  mm/hugetlb: fix early boot crash on parameters without '=' separator
  zram: reject unrecognized type= values in recompress_store()
  docs: proc: document ProtectionKey in smaps
  mm/mprotect: special-case small folios when applying permissions
  mm/mprotect: move softleaf code out of the main function
  mm: remove '!root_reclaim' checking in should_abort_scan()
  mm/sparse: fix comment for section map alignment
  mm/page_io: use sio->len for PSWPIN accounting in sio_read_complete()
  selftests/mm: transhuge_stress: skip the test when thp not available
  ...
2026-04-19 08:01:17 -07:00
Linus Torvalds
9055c64567 memblock: updates for 7.0-rc1
* improve debugability of reserve_mem kernel parameter handling with print
   outs in case of a failure and debugfs info showing what was actually
   reserved
 * Make memblock_free_late() and free_reserved_area() use the same core
   logic for freeing the memory to buddy and ensure it takes care of
   updating memblock arrays when ARCH_KEEP_MEMBLOCK is enabled.
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEeOVYVaWZL5900a/pOQOGJssO/ZEFAmnjRmsQHHJwcHRAa2Vy
 bmVsLm9yZwAKCRA5A4Ymyw79kYh0CAC4NpZGFqpEBep1eQcfqsPH05dvp1LUXDNk
 i5GwS2ht/F5D9GcD+EyoYRQjRM8k+XZyOe3sqEF01Uav/rHAv3XrITg/pfiA92AR
 K7CvQv4NvyQqUNcv/mEb+P8niriJ4oHRXCag9inop1jo/x3Mym07oEy73rknAx9r
 ZQKwoFNOM/QQGVb9hZUANKCkE8cAsUXG89yEOH0n17FOahC0PZbK/vxjeO+br3IL
 HxEoC5l1j4cUauf8XEhsVXXdch0iqit/fB3ROePYFNCx7koVYHk6Yl1w++AM0RUA
 ypOmfPsSiqLY2ciuTIAnpTeMfQkkhEmMI3mp6T5BUBwSKJxLRaSM
 =c1xd
 -----END PGP SIGNATURE-----

Merge tag 'memblock-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock

Pull memblock updates from Mike Rapoport:

 - improve debuggability of reserve_mem kernel parameter handling with
   print outs in case of a failure and debugfs info showing what was
   actually reserved

 - Make memblock_free_late() and free_reserved_area() use the same core
   logic for freeing the memory to buddy and ensure it takes care of
   updating memblock arrays when ARCH_KEEP_MEMBLOCK is enabled.

* tag 'memblock-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
  x86/alternative: delay freeing of smp_locks section
  memblock: warn when freeing reserved memory before memory map is initialized
  memblock, treewide: make memblock_free() handle late freeing
  memblock: make free_reserved_area() update memblock if ARCH_KEEP_MEMBLOCK=y
  memblock: extract page freeing from free_reserved_area() into a helper
  memblock: make free_reserved_area() more robust
  mm: move free_reserved_area() to mm/memblock.c
  powerpc: opal-core: pair alloc_pages_exact() with free_pages_exact()
  powerpc: fadump: pair alloc_pages_exact() with free_pages_exact()
  memblock: reserve_mem: fix end caclulation in reserve_mem_release_by_name()
  memblock: move reserve_bootmem_range() to memblock.c and make it static
  memblock: Add reserve_mem debugfs info
  memblock: Print out errors on reserve_mem parser
2026-04-18 11:29:14 -07:00
Breno Leitao
85e4139282 kho: persist blob size in KHO FDT
kho_add_subtree() accepts a size parameter but only forwards it to
debugfs.  The size is not persisted in the KHO FDT, so it is lost across
kexec.  This makes it impossible for the incoming kernel to determine the
blob size without understanding the blob format.

Store the blob size as a "blob-size" property in the KHO FDT alongside the
"preserved-data" physical address.  This allows the receiving kernel to
recover the size for any blob regardless of format.

Also extend kho_retrieve_subtree() with an optional size output parameter
so callers can learn the blob size without needing to understand the blob
format.  Update all callers to pass NULL for the new parameter.

Link: https://lore.kernel.org/20260316-kho-v9-3-ed6dcd951988@debian.org
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18 00:10:48 -07:00
Breno Leitao
d9e4142e76 kho: add size parameter to kho_add_subtree()
Patch series "kho: history: track previous kernel version and kexec boot
count", v9.

Use Kexec Handover (KHO) to pass the previous kernel's version string and
the number of kexec reboots since the last cold boot to the next kernel,
and print it at boot time.

Example
=======
	[    0.000000] Linux version 6.19.0-rc3-upstream-00047-ge5d992347849
	...
	[    0.000000] KHO: exec from: 6.19.0-rc4-next-20260107upstream-00004-g3071b0dc4498 (count 1)

Motivation
==========

Bugs that only reproduce when kexecing from specific kernel versions are
difficult to diagnose.  These issues occur when a buggy kernel kexecs into
a new kernel, with the bug manifesting only in the second kernel.

Recent examples include:

 * eb22663125 ("x86/boot: Fix page table access in 5-level to 4-level paging transition")
 * 77d48d39e9 ("efistub/tpm: Use ACPI reclaim memory for event log to avoid corruption")
 * 64b45dd46e ("x86/efi: skip memattr table on kexec boot")

As kexec-based reboots become more common, these version-dependent bugs
are appearing more frequently.  At scale, correlating crashes to the
previous kernel version is challenging, especially when issues only occur
in specific transition scenarios.

Some bugs manifest only after multiple consecutive kexec reboots. 
Tracking the kexec count helps identify these cases (this metric is
already used by live update sub-system).

KHO provides a reliable mechanism to pass information between kernels.  By
carrying the previous kernel's release string and kexec count forward, we
can print this context at boot time to aid debugging.

The goal of this feature is to have this information being printed in
early boot, so, users can trace back kernel releases in kexec.  Systemd is
not helpful because we cannot assume that the previous kernel has systemd
or even write access to the disk (common when using Linux as bootloaders)


This patch (of 6):

kho_add_subtree() assumes the fdt argument is always an FDT and calls
fdt_totalsize() on it in the debugfs code path.  This assumption will
break if a caller passes arbitrary data instead of an FDT.

When CONFIG_KEXEC_HANDOVER_DEBUGFS is enabled, kho_debugfs_fdt_add() calls
__kho_debugfs_fdt_add(), which executes:

    f->wrapper.size = fdt_totalsize(fdt);

Fix this by adding an explicit size parameter to kho_add_subtree() so
callers specify the blob size.  This allows subtrees to contain arbitrary
data formats, not just FDTs.  Update all callers:

  - memblock.c: use fdt_totalsize(fdt)
  - luo_core.c: use fdt_totalsize(fdt_out)
  - test_kho.c: use fdt_totalsize()
  - kexec_handover.c (root fdt): use fdt_totalsize(kho_out.fdt)

Also update __kho_debugfs_fdt_add() to receive the size explicitly instead
of computing it internally via fdt_totalsize().  In kho_in_debugfs_init(),
pass fdt_totalsize() for the root FDT and sub-blobs since all current
users are FDTs.  A subsequent patch will persist the size in the KHO FDT
so the incoming side can handle non-FDT blobs correctly.

Link: https://lore.kernel.org/20260323110747.193569-1-duanchenghao@kylinos.cn
Link: https://lore.kernel.org/20260316-kho-v9-1-ed6dcd951988@debian.org
Signed-off-by: Breno Leitao <leitao@debian.org>
Suggested-by: Pratyush Yadav <pratyush@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18 00:10:48 -07:00
Ard Biesheuvel
a142d0ae9f memblock: Permit existing reserved regions to be marked RSRV_KERN
Permit existing memblock reservations to be marked as RSRV_KERN. This
will be used by the EFI code on x86 to distinguish between reservations
of boot services data regions that have actual significance to the
kernel and regions that are reserved temporarily to work around buggy
firmware.

Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2026-04-09 17:14:36 +02:00
Mike Rapoport (Microsoft)
59bd1d914b memblock: warn when freeing reserved memory before memory map is initialized
When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, freeing of reserved
memory before the memory map is fully initialized in deferred_init_memmap()
would cause access to uninitialized struct pages and may crash when
accessing spurious list pointers, like was recently discovered during
discussion about memory leaks in x86 EFI code [1].

The trace below is from an attempt to call free_reserved_page() before
page_alloc_init_late():

[    0.076840] BUG: unable to handle page fault for address: ffffce1a005a0788
[    0.078226] #PF: supervisor read access in kernel mode
[    0.078226] #PF: error_code(0x0000) - not-present page
[    0.078226] PGD 0 P4D 0
[    0.078226] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
[    0.078226] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.12.68-92.123.amzn2023.x86_64 #1
[    0.078226] Hardware name: Amazon EC2 t3a.nano/, BIOS 1.0 10/16/2017
[    0.078226] RIP: 0010:__list_del_entry_valid_or_report+0x32/0xb0
...
[    0.078226]  __free_one_page+0x170/0x520
[    0.078226]  free_pcppages_bulk+0x151/0x1e0
[    0.078226]  free_unref_page_commit+0x263/0x320
[    0.078226]  free_unref_page+0x2c8/0x5b0
[    0.078226]  ? srso_return_thunk+0x5/0x5f
[    0.078226]  free_reserved_page+0x1c/0x30
[    0.078226]  memblock_free_late+0x6c/0xc0

Currently there are not many callers of free_reserved_area() and they all
appear to be at the right timings.

Still, in order to protect against problematic code moves or additions of
new callers add a warning that will inform that reserved pages cannot be
freed until the memory map is fully initialized.

[1] https://lore.kernel.org/all/e5d5a1105d90ee1e7fe7eafaed2ed03bbad0c46b.camel@kernel.crashing.org/

Link: https://patch.msgid.link/20260323074836.3653702-10-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2026-04-01 11:20:15 +03:00
Mike Rapoport (Microsoft)
87ce9e83ab memblock, treewide: make memblock_free() handle late freeing
It shouldn't be responsibility of memblock users to detect if they free
memory allocated from memblock late and should use memblock_free_late().

Make memblock_free() and memblock_phys_free() take care of late memory
freeing and drop memblock_free_late().

Link: https://patch.msgid.link/20260323074836.3653702-9-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2026-04-01 11:20:15 +03:00
Mike Rapoport (Microsoft)
b2129a3951 memblock: make free_reserved_area() update memblock if ARCH_KEEP_MEMBLOCK=y
On architectures that keep memblock after boot, freeing of reserved memory
with free_reserved_area() is paired with an update of memblock arrays,
usually by a call to memblock_free().

Make free_reserved_area() directly update memblock.reserved when
ARCH_KEEP_MEMBLOCK is enabled.

Remove the now-redundant explicit memblock_free() call from
arm64::free_initmem() and the #ifdef CONFIG_ARCH_KEEP_MEMBLOCK block
from the generic free_initrd_mem().

Link: https://patch.msgid.link/20260323074836.3653702-8-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2026-04-01 11:20:15 +03:00
Mike Rapoport (Microsoft)
7fbc5e2612 memblock: extract page freeing from free_reserved_area() into a helper
There are two functions that release pages to the buddy allocator late in
the boot: free_reserved_area() and memblock_free_late().

Currently they are using different underlying functionality,
free_reserved_area() runs each page being freed via free_reserved_page()
and memblock_free_late() uses memblock_free_pages() -> __free_pages_core(),
but in the end they both boil down to a loop that frees a range page by
page.

Extract the loop frees pages from free_reserved_area() into a helper and
use that helper in memblock_free_late().

Link: https://patch.msgid.link/20260323074836.3653702-7-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2026-04-01 11:20:15 +03:00
Mike Rapoport (Microsoft)
b8de9573e6 memblock: make free_reserved_area() more robust
There are two potential problems in free_reserved_area():
* it may free a page with not-existent buddy page
* it may be passed a virtual address from an alias mapping that won't
  be properly translated by virt_to_page(), for example a symbol on arm64

While first issue is quite theoretical and the second one does not manifest
itself because all the callers do the right thing, it is easy to make
free_reserved_area() robust enough to avoid these potential issues.

Replace the loop by virtual address with a loop by pfn that uses
for_each_valid_pfn() and use __pa() or __pa_symbol() depending on the
virtual mapping alias to correctly determine the loop boundaries.

Link: https://patch.msgid.link/20260323074836.3653702-6-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2026-04-01 11:20:15 +03:00
Mike Rapoport (Microsoft)
0510bdab53 mm: move free_reserved_area() to mm/memblock.c
free_reserved_area() is related to memblock as it frees reserved memory
back to the buddy allocator, similar to what memblock_free_late() does.

Move free_reserved_area() to mm/memblock.c to prepare for further
consolidation of the functions that free reserved memory.

No functional changes.

Link: https://patch.msgid.link/20260323074836.3653702-5-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
2026-04-01 11:19:46 +03:00
Mike Rapoport (Microsoft)
c12c3e1507 memblock: reserve_mem: fix end caclulation in reserve_mem_release_by_name()
free_reserved_area() expects end parameter to point to the first address
after the area, but reserve_mem_release_by_name() passes it the last
address inside the area.

Remove subtraction of one in calculation of the area end.

Fixes: 74e2498ccf ("mm/memblock: Add reserved memory release function")
Link: https://patch.msgid.link/20260323074836.3653702-2-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2026-04-01 11:19:45 +03:00
Mike Rapoport (Microsoft)
8b7b85384f memblock: move reserve_bootmem_range() to memblock.c and make it static
reserve_bootmem_region() is only called from
memmap_init_reserved_pages() and it was in mm/mm_init.c because of its
dependecies on static init_deferred_page().

Since init_deferred_page() is not static anymore, move
reserve_bootmem_region(), rename it to memmap_init_reserved_range() and
make it static.

Update the comment describing it to better reflect what the function
does and drop bogus comment about reserved pages in free_bootmem_page().

Update memblock test stubs to reflect the core changes.

Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Reviewed-by: David Hildenbrand (Arm) <david@kernel.org>
Link: https://patch.msgid.link/20260323072042.3651061-1-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2026-04-01 11:19:45 +03:00
Guilherme G. Piccoli
0709682cdb memblock: Add reserve_mem debugfs info
When using the "reserve_mem" parameter, users aim at having an
area that (hopefully) persists across boots, so pstore infrastructure
(like ramoops module) can make use of that to save oops/ftrace logs,
for example.

There is no easy way to determine if this kernel parameter is properly
set though; the kernel doesn't show information about this memory in
memblock debugfs, neither in /proc/iomem nor dmesg. This is a relevant
information for tools like kdumpst[0], to determine if it's reliable
to use the reserved area as ramoops persistent storage; checking only
/proc/cmdline is not sufficient as it doesn't tell if the reservation
effectively succeeded or not.

Add here a new file under memblock debugfs showing properly set memory
reservations, with name and size as passed to "reserve_mem". Notice that
if no "reserve_mem=" is passed on command-line or if the reservation
attempts fail, the file is not created.

[0] https://aur.archlinux.org/packages/kdumpst

Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Link: https://patch.msgid.link/20260324012839.1991765-2-gpiccoli@igalia.com
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2026-04-01 11:19:43 +03:00
Guilherme G. Piccoli
f7f4a21c2a memblock: Print out errors on reserve_mem parser
The parsing of kernel parameter "reserve_mem=" is subject to
multiple failures, like duplicate naming, malformed expression
or even lack of available memory. Right now, all of these fail
silently. Let's add some messages so the kernel log can provide
useful information in case of failures.

Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Link: https://patch.msgid.link/20260324012839.1991765-1-gpiccoli@igalia.com
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2026-04-01 11:19:40 +03:00
Linus Torvalds
787fe1d43a memblock: updates for 7.0-rc1
* update tools/include/linux/mm.h to fix memblock tests compilation
 * drop redundant struct page* parameter from memblock_free_pages() and get
   struct page from the pfn
 * add underflow detection for size calculation in memtest and warn about
   underflow when VM_DEBUG is enabled
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEeOVYVaWZL5900a/pOQOGJssO/ZEFAmmQIhoQHHJwcHRAa2Vy
 bmVsLm9yZwAKCRA5A4Ymyw79kWhYB/0aobkrfD4aW5Utfmzp08LdBwtfsOqEfKX6
 AdBGPdG+WB90auW4qwDupspqj2lYDpJ4QvETNP0B84ek62VEN+8YEbvcC4W70l4H
 nsrrnkTgwFGNXXxjr6tIQXu9hnC1o7eSuWhhYry4XG+JEKR3iah54JmbxcDrAEFj
 lb4BzdocDtF6J3EkOv5alaDfdwUxgA3C6Idp2mpVb4m7DMraGZMq3lm7EPYm22zb
 zo9v0nvXW9xtZfADQ6mRzp4uTjd/UAUH+YsU/u1S1f+JBN1bELXmFRf/X3CKBC6/
 AIO9FcHsfA0i1MhbeBizT9eUEFaNIRxbMAtWbfdHrQhaLWNvyPOU
 =Gz3z
 -----END PGP SIGNATURE-----

Merge tag 'memblock-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock

Pull memblock updates from Mike Rapoport:

 - update tools/include/linux/mm.h to fix memblock tests compilation

 - drop redundant struct page* parameter from memblock_free_pages() and
   get struct page from the pfn

 - add underflow detection for size calculation in memtest and warn
   about underflow when VM_DEBUG is enabled

* tag 'memblock-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
  mm/memtest: add underflow detection for size calculation
  memblock: drop redundant 'struct page *' argument from memblock_free_pages()
  memblock test: include <linux/sizes.h> from tools mm.h stub
2026-02-14 12:39:34 -08:00
Linus Torvalds
136114e0ab mm.git review status for linus..mm-nonmm-stable
Total patches:       107
 Reviews/patch:       1.07
 Reviewed rate:       67%
 
 - The 2 patch series "ocfs2: give ocfs2 the ability to reclaim
   suballocator free bg" from Heming Zhao saves disk space by teaching
   ocfs2 to reclaim suballocator block group space.
 
 - The 4 patch series "Add ARRAY_END(), and use it to fix off-by-one
   bugs" from Alejandro Colomar adds the ARRAY_END() macro and uses it in
   various places.
 
 - The 2 patch series "vmcoreinfo: support VMCOREINFO_BYTES larger than
   PAGE_SIZE" from Pnina Feder makes the vmcore code future-safe, if
   VMCOREINFO_BYTES ever exceeds the page size.
 
 - The 7 patch series "kallsyms: Prevent invalid access when showing
   module buildid" from Petr Mladek cleans up kallsyms code related to
   module buildid and fixes an invalid access crash when printing
   backtraces.
 
 - The 3 patch series "Address page fault in
   ima_restore_measurement_list()" from Harshit Mogalapalli fixes a
   kexec-related crash that can occur when booting the second-stage kernel
   on x86.
 
 - The 6 patch series "kho: ABI headers and Documentation updates" from
   Mike Rapoport updates the kexec handover ABI documentation.
 
 - The 4 patch series "Align atomic storage" from Finn Thain adds the
   __aligned attribute to atomic_t and atomic64_t definitions to get
   natural alignment of both types on csky, m68k, microblaze, nios2,
   openrisc and sh.
 
 - The 2 patch series "kho: clean up page initialization logic" from
   Pratyush Yadav simplifies the page initialization logic in
   kho_restore_page().
 
 - The 6 patch series "Unload linux/kernel.h" from Yury Norov moves
   several things out of kernel.h and into more appropriate places.
 
 - The 7 patch series "don't abuse task_struct.group_leader" from Oleg
   Nesterov removes the usage of ->group_leader when it is "obviously
   unnecessary".
 
 - The 5 patch series "list private v2 & luo flb" from Pasha Tatashin
   adds some infrastructure improvements to the live update orchestrator.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaY4giAAKCRDdBJ7gKXxA
 jgusAQDnKkP8UWTqXPC1jI+OrDJGU5ciAx8lzLeBVqMKzoYk9AD/TlhT2Nlx+Ef6
 0HCUHUD0FMvAw/7/Dfc6ZKxwBEIxyww=
 =mmsH
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:

 - "ocfs2: give ocfs2 the ability to reclaim suballocator free bg" saves
   disk space by teaching ocfs2 to reclaim suballocator block group
   space (Heming Zhao)

 - "Add ARRAY_END(), and use it to fix off-by-one bugs" adds the
   ARRAY_END() macro and uses it in various places (Alejandro Colomar)

 - "vmcoreinfo: support VMCOREINFO_BYTES larger than PAGE_SIZE" makes
   the vmcore code future-safe, if VMCOREINFO_BYTES ever exceeds the
   page size (Pnina Feder)

 - "kallsyms: Prevent invalid access when showing module buildid" cleans
   up kallsyms code related to module buildid and fixes an invalid
   access crash when printing backtraces (Petr Mladek)

 - "Address page fault in ima_restore_measurement_list()" fixes a
   kexec-related crash that can occur when booting the second-stage
   kernel on x86 (Harshit Mogalapalli)

 - "kho: ABI headers and Documentation updates" updates the kexec
   handover ABI documentation (Mike Rapoport)

 - "Align atomic storage" adds the __aligned attribute to atomic_t and
   atomic64_t definitions to get natural alignment of both types on
   csky, m68k, microblaze, nios2, openrisc and sh (Finn Thain)

 - "kho: clean up page initialization logic" simplifies the page
   initialization logic in kho_restore_page() (Pratyush Yadav)

 - "Unload linux/kernel.h" moves several things out of kernel.h and into
   more appropriate places (Yury Norov)

 - "don't abuse task_struct.group_leader" removes the usage of
   ->group_leader when it is "obviously unnecessary" (Oleg Nesterov)

 - "list private v2 & luo flb" adds some infrastructure improvements to
   the live update orchestrator (Pasha Tatashin)

* tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (107 commits)
  watchdog/hardlockup: simplify perf event probe and remove per-cpu dependency
  procfs: fix missing RCU protection when reading real_parent in do_task_stat()
  watchdog/softlockup: fix sample ring index wrap in need_counting_irqs()
  kcsan, compiler_types: avoid duplicate type issues in BPF Type Format
  kho: fix doc for kho_restore_pages()
  tests/liveupdate: add in-kernel liveupdate test
  liveupdate: luo_flb: introduce File-Lifecycle-Bound global state
  liveupdate: luo_file: Use private list
  list: add kunit test for private list primitives
  list: add primitives for private list manipulations
  delayacct: fix uapi timespec64 definition
  panic: add panic_force_cpu= parameter to redirect panic to a specific CPU
  netclassid: use thread_group_leader(p) in update_classid_task()
  RDMA/umem: don't abuse current->group_leader
  drm/pan*: don't abuse current->group_leader
  drm/amd: kill the outdated "Only the pthreads threading model is supported" checks
  drm/amdgpu: don't abuse current->group_leader
  android/binder: use same_thread_group(proc->tsk, current) in binder_mmap()
  android/binder: don't abuse current->group_leader
  kho: skip memoryless NUMA nodes when reserving scratch areas
  ...
2026-02-12 12:13:01 -08:00
Mike Rapoport (Microsoft)
dd1e79ef6c kho/abi: add memblock ABI header
Introduce KHO ABI header describing preservation ABI for memblock's
reserve_mem regions and link the relevant documentation to KHO docs.

[lukas.bulwahn@redhat.com: MAINTAINERS: adjust file entry in MEMBLOCK AND MEMORY MANAGEMENT INITIALIZATION]
  Link: https://lkml.kernel.org/r/20260107090438.22901-1-lukas.bulwahn@redhat.com
[rppt@kernel.org: update reserved_mem node description, per Pratyush]
  Link: https://lkml.kernel.org/r/aW_M-HYZzx5SkbnZ@kernel.org
Link: https://lkml.kernel.org/r/20260105165839.285270-7-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Jason Miu <jasonmiu@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-26 19:07:12 -08:00
Kevin Lourenco
62451ae347 mm: fix minor spelling mistakes in comments
Correct several typos in comments across files in mm/

[akpm@linux-foundation.org: also fix comment grammar, per SeongJae]
Link: https://lkml.kernel.org/r/20251218150906.25042-1-klourencodev@gmail.com
Signed-off-by: Kevin Lourenco <klourencodev@gmail.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20 19:24:48 -08:00
Shengming Hu
58e3e52654 memblock: drop redundant 'struct page *' argument from memblock_free_pages()
memblock_free_pages() currently takes both a struct page * and the
corresponding PFN. The page pointer is always derived from the PFN at
call sites (pfn_to_page(pfn)), making the parameter redundant and also
allowing accidental mismatches between the two arguments.

Simplify the interface by removing the struct page * argument and
deriving the page locally from the PFN, after the deferred struct page
initialization check. This keeps the behavior unchanged while making
the helper harder to misuse.

Signed-off-by: Shengming Hu <hu.shengming@zte.com.cn>
Reviewed-by: David Hildenbrand (Red Hat) <david@kernel.org>
Link: https://patch.msgid.link/tencent_F741CE6ECC49EE099736685E60C0DBD4A209@qq.com
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2026-01-09 11:53:51 +02:00
Linus Torvalds
509d3f4584 Significant patch series in this pull request:
- The 6 patch series "panic: sys_info: Refactor and fix a potential
   issue" from Andy Shevchenko fixes a build issue and does some cleanup in
   ib/sys_info.c.
 
 - The 9 patch series "Implement mul_u64_u64_div_u64_roundup()" from
   David Laight enhances the 64-bit math code on behalf of a PWM driver and
   beefs up the test module for these library functions.
 
 - The 2 patch series "scripts/gdb/symbols: make BPF debug info available
   to GDB" from Ilya Leoshkevich makes BPF symbol names, sizes, and line
   numbers available to the GDB debugger.
 
 - The 4 patch series "Enable hung_task and lockup cases to dump system
   info on demand" from Feng Tang adds a sysctl which can be used to cause
   additional info dumping when the hung-task and lockup detectors fire.
 
 - The 6 patch series "lib/base64: add generic encoder/decoder, migrate
   users" from Kuan-Wei Chiu adds a general base64 encoder/decoder to lib/
   and migrates several users away from their private implementations.
 
 - The 2 patch series "rbree: inline rb_first() and rb_last()" from Eric
   Dumazet makes TCP a little faster.
 
 - The 9 patch series "liveupdate: Rework KHO for in-kernel users" from
   Pasha Tatashin reworks the KEXEC Handover interfaces in preparation for
   Live Update Orchestrator (LUO), and possibly for other future clients.
 
 - The 13 patch series "kho: simplify state machine and enable dynamic
   updates" from Pasha Tatashin increases the flexibility of KEXEC
   Handover.  Also preparation for LUO.
 
 - The 18 patch series "Live Update Orchestrator" from Pasha Tatashin is
   a major new feature targeted at cloud environments.  Quoting the [0/N]:
 
     This series introduces the Live Update Orchestrator, a kernel subsystem
     designed to facilitate live kernel updates using a kexec-based reboot.
     This capability is critical for cloud environments, allowing hypervisors
     to be updated with minimal downtime for running virtual machines.  LUO
     achieves this by preserving the state of selected resources, such as
     memory, devices and their dependencies, across the kernel transition.
 
     As a key feature, this series includes support for preserving memfd file
     descriptors, which allows critical in-memory data, such as guest RAM or
     any other large memory region, to be maintained in RAM across the kexec
     reboot.
 
   Mike Rappaport merits a mention here, for his extensive review and
   testing work.
 
 - The 3 patch series "kexec: reorganize kexec and kdump sysfs" from
   Sourabh Jain moves the kexec and kdump sysfs entries from /sys/kernel/
   to /sys/kernel/kexec/ and adds back-compatibility symlinks which can
   hopefully be removed one day.
 
 - The 2 patch series "kho: fixes for vmalloc restoration" from Mike
   Rapoport fixes a BUG which was being hit during KHO restoration of
   vmalloc() regions.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaTSAkQAKCRDdBJ7gKXxA
 jrkiAP9QKfsRv46XZaM5raScjY1ayjP+gqb2rgt6BQ/gZvb2+wD/cPAYOR6BiX52
 n0pVpQmG5P/KyOmpLztn96ejL4heKwQ=
 =JY96
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2025-12-06-11-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:

 - "panic: sys_info: Refactor and fix a potential issue" (Andy Shevchenko)
   fixes a build issue and does some cleanup in ib/sys_info.c

 - "Implement mul_u64_u64_div_u64_roundup()" (David Laight)
   enhances the 64-bit math code on behalf of a PWM driver and beefs up
   the test module for these library functions

 - "scripts/gdb/symbols: make BPF debug info available to GDB" (Ilya Leoshkevich)
   makes BPF symbol names, sizes, and line numbers available to the GDB
   debugger

 - "Enable hung_task and lockup cases to dump system info on demand" (Feng Tang)
   adds a sysctl which can be used to cause additional info dumping when
   the hung-task and lockup detectors fire

 - "lib/base64: add generic encoder/decoder, migrate users" (Kuan-Wei Chiu)
   adds a general base64 encoder/decoder to lib/ and migrates several
   users away from their private implementations

 - "rbree: inline rb_first() and rb_last()" (Eric Dumazet)
   makes TCP a little faster

 - "liveupdate: Rework KHO for in-kernel users" (Pasha Tatashin)
   reworks the KEXEC Handover interfaces in preparation for Live Update
   Orchestrator (LUO), and possibly for other future clients

 - "kho: simplify state machine and enable dynamic updates" (Pasha Tatashin)
   increases the flexibility of KEXEC Handover. Also preparation for LUO

 - "Live Update Orchestrator" (Pasha Tatashin)
   is a major new feature targeted at cloud environments. Quoting the
   cover letter:

      This series introduces the Live Update Orchestrator, a kernel
      subsystem designed to facilitate live kernel updates using a
      kexec-based reboot. This capability is critical for cloud
      environments, allowing hypervisors to be updated with minimal
      downtime for running virtual machines. LUO achieves this by
      preserving the state of selected resources, such as memory,
      devices and their dependencies, across the kernel transition.

      As a key feature, this series includes support for preserving
      memfd file descriptors, which allows critical in-memory data, such
      as guest RAM or any other large memory region, to be maintained in
      RAM across the kexec reboot.

   Mike Rappaport merits a mention here, for his extensive review and
   testing work.

 - "kexec: reorganize kexec and kdump sysfs" (Sourabh Jain)
   moves the kexec and kdump sysfs entries from /sys/kernel/ to
   /sys/kernel/kexec/ and adds back-compatibility symlinks which can
   hopefully be removed one day

 - "kho: fixes for vmalloc restoration" (Mike Rapoport)
   fixes a BUG which was being hit during KHO restoration of vmalloc()
   regions

* tag 'mm-nonmm-stable-2025-12-06-11-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (139 commits)
  calibrate: update header inclusion
  Reinstate "resource: avoid unnecessary lookups in find_next_iomem_res()"
  vmcoreinfo: track and log recoverable hardware errors
  kho: fix restoring of contiguous ranges of order-0 pages
  kho: kho_restore_vmalloc: fix initialization of pages array
  MAINTAINERS: TPM DEVICE DRIVER: update the W-tag
  init: replace simple_strtoul with kstrtoul to improve lpj_setup
  KHO: fix boot failure due to kmemleak access to non-PRESENT pages
  Documentation/ABI: new kexec and kdump sysfs interface
  Documentation/ABI: mark old kexec sysfs deprecated
  kexec: move sysfs entries to /sys/kernel/kexec
  test_kho: always print restore status
  kho: free chunks using free_page() instead of kfree()
  selftests/liveupdate: add kexec test for multiple and empty sessions
  selftests/liveupdate: add simple kexec-based selftest for LUO
  selftests/liveupdate: add userspace API selftests
  docs: add documentation for memfd preservation via LUO
  mm: memfd_luo: allow preserving memfd
  liveupdate: luo_file: add private argument to store runtime state
  mm: shmem: export some functions to internal.h
  ...
2025-12-06 14:01:20 -08:00
Pasha Tatashin
f5bfd4793a memblock: unpreserve memory in case of error
If there is an error half way through KHO memory preservation, we should
rollback and unpreserve everything that is partially preserved.

[akpm@linux-foundation.org: s/err_no_fdt_page/err_report/ in prepare_kho_fdt(), per Mike]
Link: https://lkml.kernel.org/r/20251101142325.1326536-5-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Suggested-by: Pratyush Yadav <pratyush@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Changyuan Lyu <changyuanl@google.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Simon Horman <horms@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27 14:24:32 -08:00
Mike Rapoport (Microsoft)
70f9133096 kho: drop notifiers
The KHO framework uses a notifier chain as the mechanism for clients to
participate in the finalization process.  While this works for a single,
central state machine, it is too restrictive for kernel-internal
components like pstore/reserve_mem or IMA.  These components need a
simpler, direct way to register their state for preservation (e.g., during
their initcall) without being part of a complex, shutdown-time notifier
sequence.  The notifier model forces all participants into a single
finalization flow and makes direct preservation from an arbitrary context
difficult.  This patch refactors the client participation model by
removing the notifier chain and introducing a direct API for managing FDT
subtrees.

The core kho_finalize() and kho_abort() state machine remains, but clients
now register their data with KHO beforehand.

Link: https://lkml.kernel.org/r/20251101142325.1326536-3-pasha.tatashin@soleen.com
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Co-developed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Changyuan Lyu <changyuanl@google.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Pratyush Yadav <pratyush@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Simon Horman <horms@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27 14:24:32 -08:00
Akinobu Mita
c42af83c59 memblock: fix memblock_estimated_nr_free_pages() for soft-reserved memory
memblock_estimated_nr_free_pages() returns the difference between the total
size of the "memory" memblock type and the "reserved" memblock type.

The "soft-reserved" memory regions are added to the "reserved" memblock
type, but not to the "memory" memblock type. Therefore,
memblock_estimated_nr_free_pages() may return a smaller value than
expected, or if it underflows, an extremely large value.

/proc/sys/kernel/threads-max is determined by the value of
memblock_estimated_nr_free_pages().  This issue was discovered on machines
with CXL memory because kernel.threads-max was either smaller than expected
or extremely large for the installed DRAM size.

This fixes the issue by replacing memblock_reserved_size() with
memblock_reserved_kern_size() that tells how much memory was
reserved from the actual RAM.

Suggested-by: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Link: https://patch.msgid.link/20251111010010.7800-1-akinobu.mita@gmail.com
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2025-11-11 18:15:35 +02:00
Mike Rapoport (Microsoft)
8375b76517 kho: replace kho_preserve_phys() with kho_preserve_pages()
to make it clear that KHO operates on pages rather than on a random
physical address.

The kho_preserve_pages() will be also used in upcoming support for vmalloc
preservation.

Link: https://lkml.kernel.org/r/20250921054458.4043761-3-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Changyuan Lyu <changyuanl@google.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-10-07 13:48:55 -07:00
Mike Rapoport (Microsoft)
e68f150bc1 memblock: drop for_each_free_mem_pfn_range_in_zone_from()
for_each_free_mem_pfn_range_in_zone_from() and its "backend" implementation
__next_mem_pfn_range_in_zone() were only used by deferred initialization of
the memory map.

Remove them as they are not used anymore.

Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2025-09-14 08:49:03 +03:00
Mike Rapoport (Microsoft)
b3dcc9d1d8 memblock: fix kernel-doc for MEMBLOCK_RSRV_NOINIT
The kernel-doc description of MEMBLOCK_RSRV_NOINIT and
memblock_reserved_mark_noinit() do not accurately describe their
functionality.

Expand their kernel doc to make it clear that the user of
MEMBLOCK_RSRV_NOINIT is responsible to properly initialize the struct pages
for such regions and add more details about effects of using this flag.

Reviewed-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/f8140a17-c4ec-489b-b314-d45abe48bf36@redhat.com
Link: https://lore.kernel.org/r/20250826071947.1949725-1-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2025-08-26 10:47:03 +03:00
Pratyush Brahma
4647c4dead mm: numa,memblock: Use SZ_1M macro to denote bytes to MB conversion
Replace the manual bitwise conversion of bytes to MB with
SZ_1M macro, a standard macro used within the mm subsystem,
to improve readability.

Signed-off-by: Pratyush Brahma <pratyush.brahma@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20250820-numa-memblks-refac-v2-1-43bf1af02acd@oss.qualcomm.com
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2025-08-20 16:31:23 +03:00
Alexander Graf
f992307802 memblock: add KHO support for reserve_mem
Linux has recently gained support for "reserve_mem": A mechanism to
allocate a region of memory early enough in boot that we can cross our
fingers and hope it stays at the same location during most boots, so we
can store for example ftrace buffers into it.

Thanks to KASLR, we can never be really sure that "reserve_mem"
allocations are static across kexec.  Let's teach it KHO awareness so that
it serializes its reservations on kexec exit and deserializes them again
on boot, preserving the exact same mapping across kexec.

This is an example user for KHO in the KHO patch set to ensure we have at
least one (not very controversial) user in the tree before extending KHO's
use to more subsystems.

Link: https://lkml.kernel.org/r/20250509074635.3187114-16-changyuanl@google.com
Signed-off-by: Alexander Graf <graf@amazon.com>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Co-developed-by: Changyuan Lyu <changyuanl@google.com>
Signed-off-by: Changyuan Lyu <changyuanl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ashish Kalra <ashish.kalra@amd.com>
Cc: Ben Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Gowans <jgowans@amazon.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pratyush Yadav <ptyadav@amazon.de>
Cc: Rob Herring <robh@kernel.org>
Cc: Saravana Kannan <saravanak@google.com>
Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Thomas Lendacky <thomas.lendacky@amd.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12 23:50:42 -07:00
Alexander Graf
c609c144b0 kexec: add KHO parsing support
When we have a KHO kexec, we get an FDT blob and scratch region to
populate the state of the system.  Provide helper functions that allow
architecture code to easily handle memory reservations based on them and
give device drivers visibility into the KHO FDT and memory reservations so
they can recover their own state.

Include a fix from Arnd Bergmann <arnd@arndb.de>
https://lore.kernel.org/lkml/20250424093302.3894961-1-arnd@kernel.org/.

Link: https://lkml.kernel.org/r/20250509074635.3187114-6-changyuanl@google.com
Signed-off-by: Alexander Graf <graf@amazon.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Co-developed-by: Changyuan Lyu <changyuanl@google.com>
Signed-off-by: Changyuan Lyu <changyuanl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
Cc: Ashish Kalra <ashish.kalra@amd.com>
Cc: Ben Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Gowans <jgowans@amazon.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pratyush Yadav <ptyadav@amazon.de>
Cc: Rob Herring <robh@kernel.org>
Cc: Saravana Kannan <saravanak@google.com>
Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Thomas Lendacky <thomas.lendacky@amd.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12 23:50:39 -07:00
Mike Rapoport (Microsoft)
b8a8f96a6d memblock: introduce memmap_init_kho_scratch()
With deferred initialization of struct page it will be necessary to
initialize memory map for KHO scratch regions early.

Add memmap_init_kho_scratch() method that will allow such initialization
in upcoming patches.

Link: https://lkml.kernel.org/r/20250509074635.3187114-4-changyuanl@google.com
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Changyuan Lyu <changyuanl@google.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ashish Kalra <ashish.kalra@amd.com>
Cc: Ben Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Gowans <jgowans@amazon.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pratyush Yadav <ptyadav@amazon.de>
Cc: Rob Herring <robh@kernel.org>
Cc: Saravana Kannan <saravanak@google.com>
Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Thomas Lendacky <thomas.lendacky@amd.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12 23:50:39 -07:00
Alexander Graf
d59f43b574 memblock: add support for scratch memory
With KHO (Kexec HandOver), we need a way to ensure that the new kernel
does not allocate memory on top of any memory regions that the previous
kernel was handing over.  But to know where those are, we need to include
them in the memblock.reserved array which may not be big enough to hold
all ranges that need to be persisted across kexec.  To resize the array,
we need to allocate memory.  That brings us into a catch 22 situation.

The solution to that is limit memblock allocations to the scratch regions:
safe regions to operate in the case when there is memory that should
remain intact across kexec.

KHO provides several "scratch regions" as part of its metadata.  These
scratch regions are contiguous memory blocks that known not to contain any
memory that should be persisted across kexec.  These regions should be
large enough to accommodate all memblock allocations done by the kexeced
kernel.

We introduce a new memblock_set_scratch_only() function that allows KHO to
indicate that any memblock allocation must happen from the scratch
regions.

Later, we may want to perform another KHO kexec.  For that, we reuse the
same scratch regions.  To ensure that no eventually handed over data gets
allocated inside a scratch region, we flip the semantics of the scratch
region with memblock_clear_scratch_only(): After that call, no allocations
may happen from scratch memblock regions.  We will lift that restriction
in the next patch.

Link: https://lkml.kernel.org/r/20250509074635.3187114-3-changyuanl@google.com
Signed-off-by: Alexander Graf <graf@amazon.com>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Changyuan Lyu <changyuanl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ashish Kalra <ashish.kalra@amd.com>
Cc: Ben Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Gowans <jgowans@amazon.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pratyush Yadav <ptyadav@amazon.de>
Cc: Rob Herring <robh@kernel.org>
Cc: Saravana Kannan <saravanak@google.com>
Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Thomas Lendacky <thomas.lendacky@amd.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12 23:50:39 -07:00
Mike Rapoport (Microsoft)
4c78cc596b memblock: add MEMBLOCK_RSRV_KERN flag
Patch series "kexec: introduce Kexec HandOver (KHO)", v8.

Kexec today considers itself purely a boot loader: When we enter the new
kernel, any state the previous kernel left behind is irrelevant and the
new kernel reinitializes the system.

However, there are use cases where this mode of operation is not what we
actually want.  In virtualization hosts for example, we want to use kexec
to update the host kernel while virtual machine memory stays untouched. 
When we add device assignment to the mix, we also need to ensure that
IOMMU and VFIO states are untouched.  If we add PCIe peer to peer DMA, we
need to do the same for the PCI subsystem.  If we want to kexec while an
SEV-SNP enabled virtual machine is running, we need to preserve the VM
context pages and physical memory.  See "pkernfs: Persisting guest memory
and kernel/device state safely across kexec" Linux Plumbers Conference
2023 presentation for details:

  https://lpc.events/event/17/contributions/1485/

To start us on the journey to support all the use cases above, this patch
implements basic infrastructure to allow hand over of kernel state across
kexec (Kexec HandOver, aka KHO).  As a really simple example target, we
use memblock's reserve_mem.

With this patchset applied, memory that was reserved using "reserve_mem"
command line options remains intact after kexec and it is guaranteed to
reside at the same physical address.

== Alternatives ==

There are alternative approaches to (parts of) the problems above:

  * Memory Pools [1] - preallocated persistent memory region + allocator
  * PRMEM [2] - resizable persistent memory regions with fixed metadata
                pointer on the kernel command line + allocator
  * Pkernfs [3] - preallocated file system for in-kernel data with fixed
                  address location on the kernel command line
  * PKRAM [4] - handover of user space pages using a fixed metadata page
                specified via command line

All of the approaches above fundamentally have the same problem: They
require the administrator to explicitly carve out a physical memory
location because they have no mechanism outside of the kernel command line
to pass data (including memory reservations) between kexec'ing kernels.

KHO provides that base foundation.  We will determine later whether we
still need any of the approaches above for fast bulk memory handover of
for example IOMMU page tables.  But IMHO they would all be users of KHO,
with KHO providing the foundational primitive to pass metadata and bulk
memory reservations as well as provide easy versioning for data.

== Overview ==

We introduce a metadata file that the kernels pass between each other. 
How they pass it is architecture specific.  The file's format is a
Flattened Device Tree (fdt) which has a generator and parser already
included in Linux.  KHO is enabled in the kernel command line by `kho=on`.
When the root user enables KHO through
/sys/kernel/debug/kho/out/finalize, the kernel invokes callbacks to every
KHO users to register preserved memory regions, which contain drivers'
states.

When the actual kexec happens, the fdt is part of the image set that we
boot into.  In addition, we keep "scratch regions" available for kexec:
physically contiguous memory regions that are guaranteed to not have any
memory that KHO would preserve.  The new kernel bootstraps itself using
the scratch regions and sets all handed over memory as in use.  When
drivers initialize that support KHO, they introspect the fdt, restore
preserved memory regions, and retrieve their states stored in the
preserved memory.

== Limitations ==

Currently KHO is only implemented for file based kexec.  The kernel
interfaces in the patch set are already in place to support user space
kexec as well, but it is still not implemented it yet inside kexec tools.

== How to Use ==

To use the code, please boot the kernel with the "kho=on" command line
parameter.  KHO will automatically create scratch regions.  If you want to
set the scratch size explicitly you can use "kho_scratch=" command line
parameter.  For instance, "kho_scratch=16M,512M,256M" will reserve a 16
MiB low memory scratch area, a 512 MiB global scratch region, and 256 MiB
per NUMA node scratch regions on boot.

Make sure to have a reserved memory range requested with reserv_mem
command line option, for example, "reserve_mem=64m:4k:n1".

Then before you invoke file based "kexec -l", finalize KHO FDT:

  # echo 1 > /sys/kernel/debug/kho/out/finalize

You can preview the generated FDT using `dtc`,

  # dtc /sys/kernel/debug/kho/out/fdt
  # dtc /sys/kernel/debug/kho/out/sub_fdts/memblock

`dtc` is available on ubuntu by `sudo apt-get install device-tree-compiler`.

Now kexec into the new kernel,

  # kexec -l Image --initrd=initrd -s
  # kexec -e

(The order of KHO finalization and "kexec -l" does not matter.)

The new kernel will boot up and contain the previous kernel's reserve_mem
contents at the same physical address as the first kernel.

You can also review the FDT passed from the old kernel,

  # dtc /sys/kernel/debug/kho/in/fdt
  # dtc /sys/kernel/debug/kho/in/sub_fdts/memblock


This patch (of 17):

To denote areas that were reserved for kernel use either directly with
memblock_reserve_kern() or via memblock allocations.

Link: https://lore.kernel.org/lkml/20250424083258.2228122-1-changyuanl@google.com/
Link: https://lore.kernel.org/lkml/aAeaJ2iqkrv_ffhT@kernel.org/
Link: https://lore.kernel.org/lkml/35c58191-f774-40cf-8d66-d1e2aaf11a62@intel.com/
Link: https://lore.kernel.org/lkml/20250424093302.3894961-1-arnd@kernel.org/
Link: https://lkml.kernel.org/r/20250509074635.3187114-1-changyuanl@google.com
Link: https://lkml.kernel.org/r/20250509074635.3187114-2-changyuanl@google.com
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Co-developed-by: Changyuan Lyu <changyuanl@google.com>
Signed-off-by: Changyuan Lyu <changyuanl@google.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ashish Kalra <ashish.kalra@amd.com>
Cc: Ben Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Gowans <jgowans@amazon.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pratyush Yadav <ptyadav@amazon.de>
Cc: Rob Herring <robh@kernel.org>
Cc: Saravana Kannan <saravanak@google.com>
Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Thomas Lendacky <thomas.lendacky@amd.com>
Cc: Will Deacon <will@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12 23:50:38 -07:00
Tom Lendacky
da8bf5daa5 memblock: Accept allocated memory before use in memblock_double_array()
When increasing the array size in memblock_double_array() and the slab
is not yet available, a call to memblock_find_in_range() is used to
reserve/allocate memory. However, the range returned may not have been
accepted, which can result in a crash when booting an SNP guest:

  RIP: 0010:memcpy_orig+0x68/0x130
  Code: ...
  RSP: 0000:ffffffff9cc03ce8 EFLAGS: 00010006
  RAX: ff11001ff83e5000 RBX: 0000000000000000 RCX: fffffffffffff000
  RDX: 0000000000000bc0 RSI: ffffffff9dba8860 RDI: ff11001ff83e5c00
  RBP: 0000000000002000 R08: 0000000000000000 R09: 0000000000002000
  R10: 000000207fffe000 R11: 0000040000000000 R12: ffffffff9d06ef78
  R13: ff11001ff83e5000 R14: ffffffff9dba7c60 R15: 0000000000000c00
  memblock_double_array+0xff/0x310
  memblock_add_range+0x1fb/0x2f0
  memblock_reserve+0x4f/0xa0
  memblock_alloc_range_nid+0xac/0x130
  memblock_alloc_internal+0x53/0xc0
  memblock_alloc_try_nid+0x3d/0xa0
  swiotlb_init_remap+0x149/0x2f0
  mem_init+0xb/0xb0
  mm_core_init+0x8f/0x350
  start_kernel+0x17e/0x5d0
  x86_64_start_reservations+0x14/0x30
  x86_64_start_kernel+0x92/0xa0
  secondary_startup_64_no_verify+0x194/0x19b

Mitigate this by calling accept_memory() on the memory range returned
before the slab is available.

Prior to v6.12, the accept_memory() interface used a 'start' and 'end'
parameter instead of 'start' and 'size', therefore the accept_memory()
call must be adjusted to specify 'start + size' for 'end' when applying
to kernels prior to v6.12.

Cc: stable@vger.kernel.org # see patch description, needs adjustments for <= 6.11
Fixes: dcdfdd40fa ("mm: Add support for unaccepted memory")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/da1ac73bf4ded761e21b4e4bb5178382a580cd73.1746725050.git.thomas.lendacky@amd.com
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2025-05-09 08:53:12 +03:00
Wei Yang
eac8ea8736 mm/memblock: repeat setting reserved region nid if array is doubled
Commit 61167ad5fe ("mm: pass nid to reserve_bootmem_region()") introduce
a way to set nid to all reserved region.

But there is a corner case it will leave some region with invalid nid.
When memblock_set_node() doubles the array of memblock.reserved, it may
lead to a new reserved region before current position. The new region
will be left with an invalid node id.

Repeat the process when detecting it.

Fixes: 61167ad5fe ("mm: pass nid to reserve_bootmem_region()")
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
CC: Mike Rapoport <rppt@kernel.org>
CC: Yajun Deng <yajun.deng@linux.dev>
CC: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250318071948.23854-3-richard.weiyang@gmail.com
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2025-04-07 09:28:01 +03:00
Wei Yang
06eaa824fd mm/memblock: pass size instead of end to memblock_set_node()
The second parameter of memblock_set_node() is size instead of end.

Since it iterates from lower address to higher address, finally the node
id is correct. But during the process, some of them are wrong.

Pass size instead of end.

Fixes: 61167ad5fe ("mm: pass nid to reserve_bootmem_region()")
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
CC: Mike Rapoport <rppt@kernel.org>
CC: Yajun Deng <yajun.deng@linux.dev>
CC: stable@vger.kernel.org
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/r/20250318071948.23854-2-richard.weiyang@gmail.com
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2025-04-07 09:28:01 +03:00
Mike Rapoport (Microsoft)
7790c9c926 memblock: don't release high memory to page allocator when HIGHMEM is off
Nathan Chancellor reports the following crash on a MIPS system with
CONFIG_HIGHMEM=n:

  Linux version 6.14.0-rc6-00359-g6faea3422e3b (nathan@ax162) (mips-linux-gcc (GCC) 14.2.0, GNU ld (GNU Binutils) 2.42) #1 SMP Fri Mar 21 08:12:02 MST 2025
  earlycon: uart8250 at I/O port 0x3f8 (options '38400n8')
  printk: legacy bootconsole [uart8250] enabled
  Config serial console: console=ttyS0,38400n8r
  CPU0 revision is: 00019300 (MIPS 24Kc)
  FPU revision is: 00739300
  MIPS: machine is mti,malta
  Software DMA cache coherency enabled
  Initial ramdisk at: 0x8fad0000 (5360128 bytes)
  OF: reserved mem: Reserved memory: No reserved-memory node in the DT
  Primary instruction cache 2kB, VIPT, 2-way, linesize 16 bytes.
  Primary data cache 2kB, 2-way, VIPT, no aliases, linesize 16 bytes
  Zone ranges:
    DMA      [mem 0x0000000000000000-0x0000000000ffffff]
    Normal   [mem 0x0000000001000000-0x000000001fffffff]
  Movable zone start for each node
  Early memory node ranges
    node   0: [mem 0x0000000000000000-0x000000000fffffff]
    node   0: [mem 0x0000000090000000-0x000000009fffffff]
  Initmem setup node 0 [mem 0x0000000000000000-0x000000009fffffff]
  On node 0, zone Normal: 16384 pages in unavailable ranges
  random: crng init done
  percpu: Embedded 3 pages/cpu s18832 r8192 d22128 u49152
  Kernel command line: rd_start=0xffffffff8fad0000 rd_size=5360128  console=ttyS0,38400n8r
  printk: log buffer data + meta data: 32768 + 102400 = 135168 bytes
  Dentry cache hash table entries: 65536 (order: 4, 262144 bytes, linear)
  Inode-cache hash table entries: 32768 (order: 3, 131072 bytes, linear)
  Writing ErrCtl register=00000000
  Readback ErrCtl register=00000000
  Built 1 zonelists, mobility grouping on.  Total pages: 16384
  mem auto-init: stack:all(zero), heap alloc:off, heap free:off
  Unhandled kernel unaligned access[#1]:
  CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.14.0-rc6-00359-g6faea3422e3b #1
  Hardware name: mti,malta
  $ 0   : 00000000 00000001 81cb0880 00129027
  $ 4   : 00000001 0000000a 00000002 00129026
  $ 8   : ffffdfff 80101e00 00000002 00000000
  $12   : 81c9c224 81c63e68 00000002 00000000
  $16   : 805b1e00 00025800 81cb0880 00000002
  $20   : 00000000 81c63e64 0000000a 81f10000
  $24   : 81c63e64 81c63e60
  $28   : 81c60000 81c63de0 00000001 81cc9d20
  Hi    : 00000000
  Lo    : 00000000
  epc   : 814a227c __free_pages_ok+0x144/0x3c0
  ra    : 81cc9d20 memblock_free_all+0x1d4/0x27c
  Status: 10000002        KERNEL EXL
  Cause : 00800410 (ExcCode 04)
  BadVA : 00129026
  PrId  : 00019300 (MIPS 24Kc)
  Modules linked in:
  Process swapper (pid: 0, threadinfo=(ptrval), task=(ptrval), tls=00000000)
  Stack : 81f10000 805a9e00 81c80000 00000000 00000002 814aa240 000003ff 00000400
          00000000 81f10000 81c9c224 00003b1f 81c80000 81c63e60 81ca0000 81c63e64
          81f10000 0000000a 0000001f 81cc9d20 81f10000 81cc96d8 00000000 81c80000
          81c9c224 81c63e60 81c63e64 00000000 81f10000 00024000 00028000 00025c00
          90000000 a0000000 00000002 00000017 00000000 00000000 81f10000 81f10000
          ...
  Call Trace:
  [<814a227c>] __free_pages_ok+0x144/0x3c0
  [<81cc9d20>] memblock_free_all+0x1d4/0x27c
  [<81cc6764>] mm_core_init+0x100/0x138
  [<81cb4ba4>] start_kernel+0x4a0/0x6e4

  Code: 1080ffd5  02003825  2467ffff <8ce30000> 7c630500  1060ffd4  00000000  8ce30000  7c630180

The crash happens because commit 6faea3422e ("arch, mm: streamline
HIGHMEM freeing") too eagerly frees high memory to the page allocator even
when HIGHMEM is disabled.

Make sure that when CONFIG_HIGHMEM=n the high memory is not released to the
page allocator.

Link: https://lore.kernel.org/all/20250323190647.GA1009914@ax162
Link: https://lkml.kernel.org/r/20250325114928.1791109-3-rppt@kernel.org
Reported-by: Nathan Chancellor <nathan@kernel.org>
Fixes: 6faea3422e ("arch, mm: streamline HIGHMEM freeing")
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01 15:17:11 -07:00
Linus Torvalds
eb0ece1602 - The 6 patch series "Enable strict percpu address space checks" from
Uros Bizjak uses x86 named address space qualifiers to provide
   compile-time checking of percpu area accesses.
 
   This has caused a small amount of fallout - two or three issues were
   reported.  In all cases the calling code was founf to be incorrect.
 
 - The 4 patch series "Some cleanup for memcg" from Chen Ridong
   implements some relatively monir cleanups for the memcontrol code.
 
 - The 17 patch series "mm: fixes for device-exclusive entries (hmm)"
   from David Hildenbrand fixes a boatload of issues which David found then
   using device-exclusive PTE entries when THP is enabled.  More work is
   needed, but this makes thins better - our own HMM selftests now succeed.
 
 - The 2 patch series "mm: zswap: remove z3fold and zbud" from Yosry
   Ahmed remove the z3fold and zbud implementations.  They have been
   deprecated for half a year and nobody has complained.
 
 - The 5 patch series "mm: further simplify VMA merge operation" from
   Lorenzo Stoakes implements numerous simplifications in this area.  No
   runtime effects are anticipated.
 
 - The 4 patch series "mm/madvise: remove redundant mmap_lock operations
   from process_madvise()" from SeongJae Park rationalizes the locking in
   the madvise() implementation.  Performance gains of 20-25% were observed
   in one MADV_DONTNEED microbenchmark.
 
 - The 12 patch series "Tiny cleanup and improvements about SWAP code"
   from Baoquan He contains a number of touchups to issues which Baoquan
   noticed when working on the swap code.
 
 - The 2 patch series "mm: kmemleak: Usability improvements" from Catalin
   Marinas implements a couple of improvements to the kmemleak user-visible
   output.
 
 - The 2 patch series "mm/damon/paddr: fix large folios access and
   schemes handling" from Usama Arif provides a couple of fixes for DAMON's
   handling of large folios.
 
 - The 3 patch series "mm/damon/core: fix wrong and/or useless
   damos_walk() behaviors" from SeongJae Park fixes a few issues with the
   accuracy of kdamond's walking of DAMON regions.
 
 - The 3 patch series "expose mapping wrprotect, fix fb_defio use" from
   Lorenzo Stoakes changes the interaction between framebuffer deferred-io
   and core MM.  No functional changes are anticipated - this is
   preparatory work for the future removal of page structure fields.
 
 - The 4 patch series "mm/damon: add support for hugepage_size DAMOS
   filter" from Usama Arif adds a DAMOS filter which permits the filtering
   by huge page sizes.
 
 - The 4 patch series "mm: permit guard regions for file-backed/shmem
   mappings" from Lorenzo Stoakes extends the guard region feature from its
   present "anon mappings only" state.  The feature now covers shmem and
   file-backed mappings.
 
 - The 4 patch series "mm: batched unmap lazyfree large folios during
   reclamation" from Barry Song cleans up and speeds up the unmapping for
   pte-mapped large folios.
 
 - The 18 patch series "reimplement per-vma lock as a refcount" from
   Suren Baghdasaryan puts the vm_lock back into the vma.  Our reasons for
   pulling it out were largely bogus and that change made the code more
   messy.  This patchset provides small (0-10%) improvements on one
   microbenchmark.
 
 - The 5 patch series "Docs/mm/damon: misc DAMOS filters documentation
   fixes and improves" from SeongJae Park does some maintenance work on the
   DAMON docs.
 
 - The 27 patch series "hugetlb/CMA improvements for large systems" from
   Frank van der Linden addresses a pile of issues which have been observed
   when using CMA on large machines.
 
 - The 2 patch series "mm/damon: introduce DAMOS filter type for unmapped
   pages" from SeongJae Park enables users of DMAON/DAMOS to filter my the
   page's mapped/unmapped status.
 
 - The 19 patch series "zsmalloc/zram: there be preemption" from Sergey
   Senozhatsky teaches zram to run its compression and decompression
   operations preemptibly.
 
 - The 12 patch series "selftests/mm: Some cleanups from trying to run
   them" from Brendan Jackman fixes a pile of unrelated issues which
   Brendan encountered while runnimg our selftests.
 
 - The 2 patch series "fs/proc/task_mmu: add guard region bit to pagemap"
   from Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to
   determine whether a particular page is a guard page.
 
 - The 7 patch series "mm, swap: remove swap slot cache" from Kairui Song
   removes the swap slot cache from the allocation path - it simply wasn't
   being effective.
 
 - The 5 patch series "mm: cleanups for device-exclusive entries (hmm)"
   from David Hildenbrand implements a number of unrelated cleanups in this
   code.
 
 - The 5 patch series "mm: Rework generic PTDUMP configs" from Anshuman
   Khandual implements a number of preparatoty cleanups to the
   GENERIC_PTDUMP Kconfig logic.
 
 - The 8 patch series "mm/damon: auto-tune aggregation interval" from
   SeongJae Park implements a feedback-driven automatic tuning feature for
   DAMON's aggregation interval tuning.
 
 - The 5 patch series "Fix lazy mmu mode" from Ryan Roberts fixes some
   issues in powerpc, sparc and x86 lazy MMU implementations.  Ryan did
   this in preparation for implementing lazy mmu mode for arm64 to optimize
   vmalloc.
 
 - The 2 patch series "mm/page_alloc: Some clarifications for migratetype
   fallback" from Brendan Jackman reworks some commentary to make the code
   easier to follow.
 
 - The 3 patch series "page_counter cleanup and size reduction" from
   Shakeel Butt cleans up the page_counter code and fixes a size increase
   which we accidentally added late last year.
 
 - The 3 patch series "Add a command line option that enables control of
   how many threads should be used to allocate huge pages" from Thomas
   Prescher does that.  It allows the careful operator to significantly
   reduce boot time by tuning the parallalization of huge page
   initialization.
 
 - The 3 patch series "Fix calculations in trace_balance_dirty_pages()
   for cgwb" from Tang Yizhou fixes the tracing output from the dirty page
   balancing code.
 
 - The 9 patch series "mm/damon: make allow filters after reject filters
   useful and intuitive" from SeongJae Park improves the handling of allow
   and reject filters.  Behaviour is made more consistent and the
   documention is updated accordingly.
 
 - The 5 patch series "Switch zswap to object read/write APIs" from Yosry
   Ahmed updates zswap to the new object read/write APIs and thus permits
   the removal of some legacy code from zpool and zsmalloc.
 
 - The 6 patch series "Some trivial cleanups for shmem" from Baolin Wang
   does as it claims.
 
 - The 20 patch series "fs/dax: Fix ZONE_DEVICE page reference counts"
   from Alistair Popple regularizes the weird ZONE_DEVICE page refcount
   handling in DAX, permittig the removal of a number of special-case
   checks.
 
 - The 4 patch series "refactor mremap and fix bug" from Lorenzo Stoakes
   is a preparatoty refactoring and cleanup of the mremap() code.
 
 - The 20 patch series "mm: MM owner tracking for large folios (!hugetlb)
   + CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in
   which we determine whether a large folio is known to be mapped
   exclusively into a single MM.
 
 - The 8 patch series "mm/damon: add sysfs dirs for managing DAMOS
   filters based on handling layers" from SeongJae Park adds a couple of
   new sysfs directories to ease the management of DAMON/DAMOS filters.
 
 - The 13 patch series "arch, mm: reduce code duplication in mem_init()"
   from Mike Rapoport consolidates many per-arch implementations of
   mem_init() into code generic code, where that is practical.
 
 - The 13 patch series "mm/damon/sysfs: commit parameters online via
   damon_call()" from SeongJae Park continues the cleaning up of sysfs
   access to DAMON internal data.
 
 - The 3 patch series "mm: page_ext: Introduce new iteration API" from
   Luiz Capitulino reworks the page_ext initialization to fix a boot-time
   crash which was observed with an unusual combination of compile and
   cmdline options.
 
 - The 8 patch series "Buddy allocator like (or non-uniform) folio split"
   from Zi Yan reworks the code to split a folio into smaller folios.  The
   main benefit is lessened memory consumption: fewer post-split folios are
   generated.
 
 - The 2 patch series "Minimize xa_node allocation during xarry split"
   from Zi Yan reduces the number of xarray xa_nodes which are generated
   during an xarray split.
 
 - The 2 patch series "drivers/base/memory: Two cleanups" from Gavin Shan
   performs some maintenance work on the drivers/base/memory code.
 
 - The 3 patch series "Add tracepoints for lowmem reserves, watermarks
   and totalreserve_pages" from Martin Liu adds some more tracepoints to
   the page allocator code.
 
 - The 4 patch series "mm/madvise: cleanup requests validations and
   classifications" from SeongJae Park cleans up some warts which SeongJae
   observed during his earlier madvise work.
 
 - The 3 patch series "mm/hwpoison: Fix regressions in memory failure
   handling" from Shuai Xue addresses two quite serious regressions which
   Shuai has observed in the memory-failure implementation.
 
 - The 5 patch series "mm: reliable huge page allocator" from Johannes
   Weiner makes huge page allocations cheaper and more reliable by reducing
   fragmentation.
 
 - The 5 patch series "Minor memcg cleanups & prep for memdescs" from
   Matthew Wilcox is preparatory work for the future implementation of
   memdescs.
 
 - The 4 patch series "track memory used by balloon drivers" from Nico
   Pache introduces a way to track memory used by our various balloon
   drivers.
 
 - The 2 patch series "mm/damon: introduce DAMOS filter type for active
   pages" from Nhat Pham permits users to filter for active/inactive pages,
   separately for file and anon pages.
 
 - The 2 patch series "Adding Proactive Memory Reclaim Statistics" from
   Hao Jia separates the proactive reclaim statistics from the direct
   reclaim statistics.
 
 - The 2 patch series "mm/vmscan: don't try to reclaim hwpoison folio"
   from Jinjiang Tu fixes our handling of hwpoisoned pages within the
   reclaim code.
 -----BEGIN PGP SIGNATURE-----
 
 iHQEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ+nZaAAKCRDdBJ7gKXxA
 jsOWAPiP4r7CJHMZRK4eyJOkvS1a1r+TsIarrFZtjwvf/GIfAQCEG+JDxVfUaUSF
 Ee93qSSLR1BkNdDw+931Pu0mXfbnBw==
 =Pn2K
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - The series "Enable strict percpu address space checks" from Uros
   Bizjak uses x86 named address space qualifiers to provide
   compile-time checking of percpu area accesses.

   This has caused a small amount of fallout - two or three issues were
   reported. In all cases the calling code was found to be incorrect.

 - The series "Some cleanup for memcg" from Chen Ridong implements some
   relatively monir cleanups for the memcontrol code.

 - The series "mm: fixes for device-exclusive entries (hmm)" from David
   Hildenbrand fixes a boatload of issues which David found then using
   device-exclusive PTE entries when THP is enabled. More work is
   needed, but this makes thins better - our own HMM selftests now
   succeed.

 - The series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed
   remove the z3fold and zbud implementations. They have been deprecated
   for half a year and nobody has complained.

 - The series "mm: further simplify VMA merge operation" from Lorenzo
   Stoakes implements numerous simplifications in this area. No runtime
   effects are anticipated.

 - The series "mm/madvise: remove redundant mmap_lock operations from
   process_madvise()" from SeongJae Park rationalizes the locking in the
   madvise() implementation. Performance gains of 20-25% were observed
   in one MADV_DONTNEED microbenchmark.

 - The series "Tiny cleanup and improvements about SWAP code" from
   Baoquan He contains a number of touchups to issues which Baoquan
   noticed when working on the swap code.

 - The series "mm: kmemleak: Usability improvements" from Catalin
   Marinas implements a couple of improvements to the kmemleak
   user-visible output.

 - The series "mm/damon/paddr: fix large folios access and schemes
   handling" from Usama Arif provides a couple of fixes for DAMON's
   handling of large folios.

 - The series "mm/damon/core: fix wrong and/or useless damos_walk()
   behaviors" from SeongJae Park fixes a few issues with the accuracy of
   kdamond's walking of DAMON regions.

 - The series "expose mapping wrprotect, fix fb_defio use" from Lorenzo
   Stoakes changes the interaction between framebuffer deferred-io and
   core MM. No functional changes are anticipated - this is preparatory
   work for the future removal of page structure fields.

 - The series "mm/damon: add support for hugepage_size DAMOS filter"
   from Usama Arif adds a DAMOS filter which permits the filtering by
   huge page sizes.

 - The series "mm: permit guard regions for file-backed/shmem mappings"
   from Lorenzo Stoakes extends the guard region feature from its
   present "anon mappings only" state. The feature now covers shmem and
   file-backed mappings.

 - The series "mm: batched unmap lazyfree large folios during
   reclamation" from Barry Song cleans up and speeds up the unmapping
   for pte-mapped large folios.

 - The series "reimplement per-vma lock as a refcount" from Suren
   Baghdasaryan puts the vm_lock back into the vma. Our reasons for
   pulling it out were largely bogus and that change made the code more
   messy. This patchset provides small (0-10%) improvements on one
   microbenchmark.

 - The series "Docs/mm/damon: misc DAMOS filters documentation fixes and
   improves" from SeongJae Park does some maintenance work on the DAMON
   docs.

 - The series "hugetlb/CMA improvements for large systems" from Frank
   van der Linden addresses a pile of issues which have been observed
   when using CMA on large machines.

 - The series "mm/damon: introduce DAMOS filter type for unmapped pages"
   from SeongJae Park enables users of DMAON/DAMOS to filter my the
   page's mapped/unmapped status.

 - The series "zsmalloc/zram: there be preemption" from Sergey
   Senozhatsky teaches zram to run its compression and decompression
   operations preemptibly.

 - The series "selftests/mm: Some cleanups from trying to run them" from
   Brendan Jackman fixes a pile of unrelated issues which Brendan
   encountered while runnimg our selftests.

 - The series "fs/proc/task_mmu: add guard region bit to pagemap" from
   Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to
   determine whether a particular page is a guard page.

 - The series "mm, swap: remove swap slot cache" from Kairui Song
   removes the swap slot cache from the allocation path - it simply
   wasn't being effective.

 - The series "mm: cleanups for device-exclusive entries (hmm)" from
   David Hildenbrand implements a number of unrelated cleanups in this
   code.

 - The series "mm: Rework generic PTDUMP configs" from Anshuman Khandual
   implements a number of preparatoty cleanups to the GENERIC_PTDUMP
   Kconfig logic.

 - The series "mm/damon: auto-tune aggregation interval" from SeongJae
   Park implements a feedback-driven automatic tuning feature for
   DAMON's aggregation interval tuning.

 - The series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in
   powerpc, sparc and x86 lazy MMU implementations. Ryan did this in
   preparation for implementing lazy mmu mode for arm64 to optimize
   vmalloc.

 - The series "mm/page_alloc: Some clarifications for migratetype
   fallback" from Brendan Jackman reworks some commentary to make the
   code easier to follow.

 - The series "page_counter cleanup and size reduction" from Shakeel
   Butt cleans up the page_counter code and fixes a size increase which
   we accidentally added late last year.

 - The series "Add a command line option that enables control of how
   many threads should be used to allocate huge pages" from Thomas
   Prescher does that. It allows the careful operator to significantly
   reduce boot time by tuning the parallalization of huge page
   initialization.

 - The series "Fix calculations in trace_balance_dirty_pages() for cgwb"
   from Tang Yizhou fixes the tracing output from the dirty page
   balancing code.

 - The series "mm/damon: make allow filters after reject filters useful
   and intuitive" from SeongJae Park improves the handling of allow and
   reject filters. Behaviour is made more consistent and the documention
   is updated accordingly.

 - The series "Switch zswap to object read/write APIs" from Yosry Ahmed
   updates zswap to the new object read/write APIs and thus permits the
   removal of some legacy code from zpool and zsmalloc.

 - The series "Some trivial cleanups for shmem" from Baolin Wang does as
   it claims.

 - The series "fs/dax: Fix ZONE_DEVICE page reference counts" from
   Alistair Popple regularizes the weird ZONE_DEVICE page refcount
   handling in DAX, permittig the removal of a number of special-case
   checks.

 - The series "refactor mremap and fix bug" from Lorenzo Stoakes is a
   preparatoty refactoring and cleanup of the mremap() code.

 - The series "mm: MM owner tracking for large folios (!hugetlb) +
   CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in
   which we determine whether a large folio is known to be mapped
   exclusively into a single MM.

 - The series "mm/damon: add sysfs dirs for managing DAMOS filters based
   on handling layers" from SeongJae Park adds a couple of new sysfs
   directories to ease the management of DAMON/DAMOS filters.

 - The series "arch, mm: reduce code duplication in mem_init()" from
   Mike Rapoport consolidates many per-arch implementations of
   mem_init() into code generic code, where that is practical.

 - The series "mm/damon/sysfs: commit parameters online via
   damon_call()" from SeongJae Park continues the cleaning up of sysfs
   access to DAMON internal data.

 - The series "mm: page_ext: Introduce new iteration API" from Luiz
   Capitulino reworks the page_ext initialization to fix a boot-time
   crash which was observed with an unusual combination of compile and
   cmdline options.

 - The series "Buddy allocator like (or non-uniform) folio split" from
   Zi Yan reworks the code to split a folio into smaller folios. The
   main benefit is lessened memory consumption: fewer post-split folios
   are generated.

 - The series "Minimize xa_node allocation during xarry split" from Zi
   Yan reduces the number of xarray xa_nodes which are generated during
   an xarray split.

 - The series "drivers/base/memory: Two cleanups" from Gavin Shan
   performs some maintenance work on the drivers/base/memory code.

 - The series "Add tracepoints for lowmem reserves, watermarks and
   totalreserve_pages" from Martin Liu adds some more tracepoints to the
   page allocator code.

 - The series "mm/madvise: cleanup requests validations and
   classifications" from SeongJae Park cleans up some warts which
   SeongJae observed during his earlier madvise work.

 - The series "mm/hwpoison: Fix regressions in memory failure handling"
   from Shuai Xue addresses two quite serious regressions which Shuai
   has observed in the memory-failure implementation.

 - The series "mm: reliable huge page allocator" from Johannes Weiner
   makes huge page allocations cheaper and more reliable by reducing
   fragmentation.

 - The series "Minor memcg cleanups & prep for memdescs" from Matthew
   Wilcox is preparatory work for the future implementation of memdescs.

 - The series "track memory used by balloon drivers" from Nico Pache
   introduces a way to track memory used by our various balloon drivers.

 - The series "mm/damon: introduce DAMOS filter type for active pages"
   from Nhat Pham permits users to filter for active/inactive pages,
   separately for file and anon pages.

 - The series "Adding Proactive Memory Reclaim Statistics" from Hao Jia
   separates the proactive reclaim statistics from the direct reclaim
   statistics.

 - The series "mm/vmscan: don't try to reclaim hwpoison folio" from
   Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim
   code.

* tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (431 commits)
  mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex()
  x86/mm: restore early initialization of high_memory for 32-bits
  mm/vmscan: don't try to reclaim hwpoison folio
  mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper
  cgroup: docs: add pswpin and pswpout items in cgroup v2 doc
  mm: vmscan: split proactive reclaim statistics from direct reclaim statistics
  selftests/mm: speed up split_huge_page_test
  selftests/mm: uffd-unit-tests support for hugepages > 2M
  docs/mm/damon/design: document active DAMOS filter type
  mm/damon: implement a new DAMOS filter type for active pages
  fs/dax: don't disassociate zero page entries
  MM documentation: add "Unaccepted" meminfo entry
  selftests/mm: add commentary about 9pfs bugs
  fork: use __vmalloc_node() for stack allocation
  docs/mm: Physical Memory: Populate the "Zones" section
  xen: balloon: update the NR_BALLOON_PAGES state
  hv_balloon: update the NR_BALLOON_PAGES state
  balloon_compaction: update the NR_BALLOON_PAGES state
  meminfo: add a per node counter for balloon drivers
  mm: remove references to folio in __memcg_kmem_uncharge_page()
  ...
2025-04-01 09:29:18 -07:00
Masami Hiramatsu (Google)
74e2498ccf mm/memblock: Add reserved memory release function
Add reserve_mem_release_by_name() to release a reserved memory region
with a given name. This allows us to release reserved memory which is
defined by kernel cmdline, after boot.

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: linux-mm@kvack.org
Link: https://lore.kernel.org/173989133862.230693.14094993331347437600.stgit@devnote2
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-28 08:39:28 -04:00
Mike Rapoport (Microsoft)
6faea3422e arch, mm: streamline HIGHMEM freeing
All architectures that support HIGHMEM have their code that frees high
memory pages to the buddy allocator while __free_memory_core() is limited
to freeing only low memory.

There is no actual reason for that.  The memory map is completely ready by
the time memblock_free_all() is called and high pages can be released to
the buddy allocator along with low memory.

Remove low memory limit from __free_memory_core() and drop per-architecture
code that frees high memory pages.

Link: https://lkml.kernel.org/r/20250313135003.836600-12-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>	[x86]
Tested-by: Mark Brown <broonie@kernel.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Guo Ren (csky) <guoren@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russel King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17 22:06:53 -07:00
Guo Weikang
c6f239796b mm/memblock: add memblock_alloc_or_panic interface
Before SLUB initialization, various subsystems used memblock_alloc to
allocate memory.  In most cases, when memory allocation fails, an
immediate panic is required.  To simplify this behavior and reduce
repetitive checks, introduce `memblock_alloc_or_panic`.  This function
ensures that memory allocation failures result in a panic automatically,
improving code readability and consistency across subsystems that require
this behavior.

[guoweikang.kernel@gmail.com: arch/s390: save_area_alloc default failure behavior changed to panic]
  Link: https://lkml.kernel.org/r/20250109033136.2845676-1-guoweikang.kernel@gmail.com
  Link: https://lore.kernel.org/lkml/Z2fknmnNtiZbCc7x@kernel.org/
Link: https://lkml.kernel.org/r/20250102072528.650926-1-guoweikang.kernel@gmail.com
Signed-off-by: Guo Weikang <guoweikang.kernel@gmail.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>	[m68k]
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>	[s390]
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25 20:22:38 -08:00
Mike Rapoport (Microsoft)
9cdc6423ac memblock: allow zero threshold in validate_numa_converage()
Currently memblock validate_numa_converage() returns false negative when
threshold set to zero.

Make the check if the memory size with invalid node ID is greater than
the threshold exclusive to fix that.

Link: https://lore.kernel.org/all/Z0mIDBD4KLyxyOCm@kernel.org/
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2024-12-01 21:08:56 +02:00
Linus Torvalds
aa486552a1 memblock: updates for 6.12-rc1
* new memblock_estimated_nr_free_pages() helper to replace totalram_pages()
   which is less accurate when CONFIG_DEFERRED_STRUCT_PAGE_INIT is set
 * fixes for memblock tests
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEeOVYVaWZL5900a/pOQOGJssO/ZEFAmbejv0QHHJwcHRAa2Vy
 bmVsLm9yZwAKCRA5A4Ymyw79kVVlB/4yOoCDvJyUocEY0/Zv5bdRGXlAI0Igp3VV
 E0rEpvIjTBWwp/KZziQ8zMFk5zL/Aqb081vRsCko0lh2wjD5tFgNWWJG/sryQ/tX
 vc88p83KEXxNy4QC1qCh8dvHGIZVuLQ8oWQ7QFuH2ResdOaLdcfnobcu6/W/pBE0
 60/0bNdNgFPgnCpFIcWvGFOqZ10akhw4xYrwRsCKAQEeqeKyQE/DBFUvNrqkOuNG
 +4k71X/9mcuEDBKGRCf5XzCf7nwk4k8pzOc4xMeEhAaaV2uZdENfQuu1Av7nqRah
 zhYveo0Wd0cnGWORBT/ddzPDeBjdP2ZM9qR70yoSj2mQ7a3ixLfd
 =wtsK
 -----END PGP SIGNATURE-----

Merge tag 'memblock-v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock

Pull memblock updates from Mike Rapoport:

 - new memblock_estimated_nr_free_pages() helper to replace
   totalram_pages() which is less accurate when
   CONFIG_DEFERRED_STRUCT_PAGE_INIT is set

 - fixes for memblock tests

* tag 'memblock-v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
  s390/mm: get estimated free pages by memblock api
  kernel/fork.c: get estimated free pages by memblock api
  mm/memblock: introduce a new helper memblock_estimated_nr_free_pages()
  memblock test: fix implicit declaration of function 'strscpy'
  memblock test: fix implicit declaration of function 'isspace'
  memblock test: fix implicit declaration of function 'memparse'
  memblock test: add the definition of __setup()
  memblock test: fix implicit declaration of function 'virt_to_phys'
  tools/testing: abstract two init.h into common include directory
  memblock tests: include export.h in linkage.h as kernel dose
  memblock tests: include memory_hotplug.h in mmzone.h as kernel dose
2024-09-25 11:35:19 -07:00
Kirill A. Shutemov
5adfeaecc4 mm: rework accept memory helpers
Make accept_memory() and range_contains_unaccepted_memory() take 'start'
and 'size' arguments instead of 'start' and 'end'.

Remove accept_page(), replacing it with direct calls to accept_memory(). 
The accept_page() name is going to be used for a different function.

Link: https://lkml.kernel.org/r/20240809114854.3745464-6-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01 20:26:07 -07:00
Wei Yang
d0f8a8973f mm/memblock: introduce a new helper memblock_estimated_nr_free_pages()
During bootup, system may need the number of free pages in the whole system
to do some calculation before all pages are freed to buddy system. Usually
this number is get from totalram_pages(). Since we plan to move the free
pages accounting in __free_pages_core(), this value may not represent
total free pages at the early stage, especially when
CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled.

Instead of using raw memblock api, let's introduce a new helper for user
to get the estimated number of free pages from memblock point of view.

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
CC: David Hildenbrand <david@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20240808001415.6298-1-richard.weiyang@gmail.com
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2024-08-11 19:18:52 +03:00
Linus Torvalds
b2fc97c186 memblock: updates for 6.11-rc1
* reserve_mem command line parameter to allow creation of named memory
   reservation at boot time.
   The driving use-case is to improve the ability of pstore to retain
   ramoops data across reboots.
 * cleaunps and small improvements in memblock and mm_init
 * new tests cases in memblock test suite
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEeOVYVaWZL5900a/pOQOGJssO/ZEFAmaXfoIQHHJwcHRAa2Vy
 bmVsLm9yZwAKCRA5A4Ymyw79kU5mCAC23vIrB8FRlORczMYj+V3VFss3OjKT92lS
 fHGwq2oxHW+rdDpHXFObHU0D3k8d2l5jyrENRAAyA02qR0L6Pv8Na6pGxQua1eic
 VIdw0PFQMsizD1AIj84Y6skkyyF/tvZHpmX0B12D5+Ur65DC/Z867Cm/lE33/fHv
 /1+QB0JlG7W+FzxVstYyebY5/DVkH+bC7/A57FE2oB4BRXjEd8v9tTHBS4kRSvrE
 zE2KFxeGajN749LHztIpIprPKehu8Gc3oLrYLNJO+uLFVCV8ey3OqVj0RXMG2wLl
 hmVYqhbZM/Uz59D/P8pULD49f1Thjv/5A/MvUZ3SxM6zpWlsincf
 =xrZd
 -----END PGP SIGNATURE-----

Merge tag 'memblock-v6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock

Pull memblock updates from Mike Rapoport:

 - 'reserve_mem' command line parameter to allow creation of named
   memory reservation at boot time.

   The driving use-case is to improve the ability of pstore to retain
   ramoops data across reboots.

 - cleanups and small improvements in memblock and mm_init

 - new tests cases in memblock test suite

* tag 'memblock-v6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
  memblock tests: fix implicit declaration of function 'numa_valid_node'
  memblock: Move late alloc warning down to phys alloc
  pstore/ramoops: Add ramoops.mem_name= command line option
  mm/memblock: Add "reserve_mem" to reserved named memory at boot up
  mm/mm_init.c: don't initialize page->lru again
  mm/mm_init.c: not always search next deferred_init_pfn from very beginning
  mm/mm_init.c: use deferred_init_mem_pfn_range_in_zone() to decide loop condition
  mm/mm_init.c: get the highest zone directly
  mm/mm_init.c: move nr_initialised reset down a bit
  mm/memblock: fix a typo in description of for_each_mem_region()
  mm/mm_init.c: use memblock_region_memory_base_pfn() to get startpfn
  mm/memblock: use PAGE_ALIGN_DOWN to get pgend in free_memmap
  mm/memblock: return true directly on finding overlap region
  memblock tests: add memblock_overlaps_region_checks
  mm/memblock: fix comment for memblock_isolate_range()
  memblock tests: add memblock_reserve_many_may_conflict_check()
  memblock tests: add memblock_reserve_all_locations_check()
  mm/memblock: remove empty dummy entry
2024-07-18 14:48:11 -07:00
James Gowans
94ff46de4a memblock: Move late alloc warning down to phys alloc
If a driver/subsystem tries to do an allocation after the memblock
allocations have been freed and the memory handed to the buddy
allocator, it will not actually be legal to use that allocation: the
buddy allocator owns the memory. Currently this mis-use is handled by
the memblock function which does allocations and returns virtual
addresses by printing a warning and doing a kmalloc instead. However
the physical allocation function does not to do this check - callers of
the physical alloc function are unprotected against mis-use.

Improve the error catching here by moving the check into the physical
allocation function which is used by the virtual addr allocation
function.

Signed-off-by: James Gowans <jgowans@amazon.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Alex Graf <graf@amazon.de>
Link: https://lore.kernel.org/r/20240619095555.85980-1-jgowans@amazon.com
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
2024-06-19 18:05:25 +03:00
Steven Rostedt (Google)
1e4c64b71c mm/memblock: Add "reserve_mem" to reserved named memory at boot up
In order to allow for requesting a memory region that can be used for
things like pstore on multiple machines where the memory layout is not the
same, add a new option to the kernel command line called "reserve_mem".

The format is:  reserve_mem=nn:align:name

Where it will find nn amount of memory at the given alignment of align.
The name field is to allow another subsystem to retrieve where the memory
was found. For example:

  reserve_mem=12M:4096:oops ramoops.mem_name=oops

Where ramoops.mem_name will tell ramoops that memory was reserved for it
via the reserve_mem option and it can find it by calling:

  if (reserve_mem_find_by_name("oops", &start, &size)) {
	// start holds the start address and size holds the size given

This is typically used for systems that do not wipe the RAM, and this
command line will try to reserve the same physical memory on soft reboots.
Note, it is not guaranteed to be the same location. For example, if KASLR
places the kernel at the location of where the RAM reservation was from a
previous boot, the new reservation will be at a different location.  Any
subsystem using this feature must add a way to verify that the contents of
the physical memory is from a previous boot, as there may be cases where
the memory will not be located at the same location.

Not all systems may work either. There could be bit flips if the reboot
goes through the BIOS. Using kexec to reboot the machine is likely to
have better results in such cases.

Link: https://lore.kernel.org/all/ZjJVnZUX3NZiGW6q@kernel.org/

Suggested-by: Mike Rapoport <rppt@kernel.org>
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20240613155527.437020271@goodmis.org
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
2024-06-19 10:59:49 +03:00
Mike Rapoport (IBM)
8043832e2a memblock: use numa_valid_node() helper to check for invalid node ID
Introduce numa_valid_node(nid) that verifies that nid is a valid node ID
and use that instead of comparing nid parameter with either NUMA_NO_NODE
or MAX_NUMNODES.

This makes the checks for valid node IDs consistent and more robust and
allows to get rid of multiple WARNings.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
2024-06-16 10:17:57 +03:00
Wei Yang
b73f6b98bb mm/memblock: use PAGE_ALIGN_DOWN to get pgend in free_memmap
Leverage the macro PAGE_ALIGN_DOWN to get pgend.

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Link: https://lore.kernel.org/r/20240507075833.6346-7-richard.weiyang@gmail.com
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
2024-06-05 10:22:22 +03:00