Commit Graph

20 Commits

Author SHA1 Message Date
SeongJae Park
f98590bc08 mm/damon/stat: detect and use fresh enabled value
DAMON_STAT updates 'enabled' parameter value, which represents the running
status of its kdamond, when the user explicitly requests start/stop of the
kdamond.  The kdamond can, however, be stopped even if the user explicitly
requested the stop, if ctx->regions_score_histogram allocation failure at
beginning of the execution of the kdamond.  Hence, if the kdamond is
stopped by the allocation failure, the value of the parameter can be
stale.

Users could show the stale value and be confused.  The problem will only
rarely happen in real and common setups because the allocation is arguably
too small to fail.  Also, unlike the similar bugs that are now fixed in
DAMON_RECLAIM and DAMON_LRU_SORT, kdamond can be restarted in this case,
because DAMON_STAT force-updates the enabled parameter value for user
inputs.  The bug is a bug, though.

The issue stems from the fact that there are multiple events that can
change the status, and following all the events is challenging. 
Dynamically detect and use the fresh status for the parameters when those
are requested.

The issue was dicovered [1] by Sashiko.

Link: https://lore.kernel.org/20260419161003.79176-4-sj@kernel.org
Link: https://lore.kernel.org/20260416040602.88665-1-sj@kernel.org [1]
Fixes: 369c415e60 ("mm/damon: introduce DAMON_STAT module")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Liew Rui Yan <aethernet65535@gmail.com>
Cc: <stable@vger.kernel.org> # 6.17.x
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-27 05:54:27 -07:00
Linus Torvalds
40735a683b mm.git review status for linus..mm-stable
Everything:
 
 Total patches:       121
 Reviews/patch:       2.11
 Reviewed rate:       90%
 
 Excluding DAMON:
 
 Total patches:       113
 Reviews/patch:       2.25
 Reviewed rate:       96%
 
 - The 33 patch series "Eliminate Dying Memory Cgroup" from Qi Zheng and
   Muchun Song addresses the longstanding "dying memcg problem".  A
   situation wherein a no-longer-used memory control group will hang around
   for an extended period pointlessly consuming memory.  The [0/N]
   changelog has a good overview of this work.
 
 - The 3 patch series "fix unexpected type conversions and potential
   overflows" from Qi Zheng fixes a couple of potential 32-bit/64-bit
   issues which were identified during review of the "Eliminate Dying
   Memory Cgroup" series.
 
 - The 6 patch series "kho: history: track previous kernel version and
   kexec boot count" from Breno Leitao uses Kexec Handover (KHO) to pass
   the previous kernel's version string and the number of kexec reboots
   since the last cold boot to the next kernel, and prints it at boot time.
 
 - The 4 patch series "liveupdate: prevent double preservation" from
   Pasha Tatashin teaches LUO to avoid managing the same file across
   different active sessions.
 
 - The 10 patch series "liveupdate: Fix module unloading and unregister
   API" from Pasha Tatashin addresses an issue with how LUO handles module
   reference counting and unregistration during module unloading.
 
 - The 2 patch series "zswap pool per-CPU acomp_ctx simplifications" from
   Kanchana Sridhar simplifies and cleans up the zswap crypto compression
   handling and improves the lifecycle management of zswap pool's per-CPU
   acomp_ctx resources.
 
 - The 2 patch series "mm/damon/core: fix damon_call()/damos_walk() vs
   kdmond exit race" from SeongJae Park addresses unlikely but possible
   leaks and deadlocks in damon_call() and damon_walk().
 
 - The 2 patch series "mm/damon/core: validate damos_quota_goal->nid"
   from SeongJae Park fixes a couple of root-only wild pointer
   dereferences.
 
 - The 2 patch series "Docs/admin-guide/mm/damon: warn commit_inputs vs
   other params race" from SeongJae Park updates the DAMON documentation to
   warn operators about potential races which can occur if the
   commit_inputs parameter is altered at the wrong time.
 
 - The 3 patch series "Minor hmm_test fixes and cleanups" from Alistair
   Popple implements two bugfixes a cleanup for the HMM kernel selftests.
 
 - The 6 patch series "Modify memfd_luo code" from Chenghao Duan provides
   cleanups, simplifications and speedups in the memfd_lou code.
 
 - The 4 patch series "mm, kvm: allow uffd support in guest_memfd" from
   Mike Rapoport enables support for userfaultfd in guest_memfd.
 
 - The 6 patch series "selftests/mm: skip several tests when thp is not
   available" from Chunyu Hu fixes several issues in the selftests code
   which were causing breakage when the tests were run on CONFIG_THP=n
   kernels.
 
 - The 2 patch series "mm/mprotect: micro-optimization work" from Pedro
   Falcato implements a couple of nice speedups for mprotect().
 
 - The 3 patch series "MAINTAINERS: update KHO and LIVE UPDATE entries"
   from Pratyush Yadav reflects upcoming changes in the maintenance of KHO,
   LUO, memfd_luo, kexec, crash, kdump and probably other kexec-based
   things - they are being moved out of mm.git and into a new git tree.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaeNL/wAKCRDdBJ7gKXxA
 jt7EAQCEEQvYYTjld+8HJKsCbavY4pEfci7z4SBiQyIPjRracQD/ZfjXnzL7ucc1
 b6q6G4TcslvIDBgzVkk9G2BVn2oCoAg=
 =3ozv
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2026-04-18-02-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull more MM updates from Andrew Morton:

 - "Eliminate Dying Memory Cgroup" (Qi Zheng and Muchun Song)

   Address the longstanding "dying memcg problem". A situation wherein a
   no-longer-used memory control group will hang around for an extended
   period pointlessly consuming memory

 - "fix unexpected type conversions and potential overflows" (Qi Zheng)

   Fix a couple of potential 32-bit/64-bit issues which were identified
   during review of the "Eliminate Dying Memory Cgroup" series

 - "kho: history: track previous kernel version and kexec boot count"
   (Breno Leitao)

   Use Kexec Handover (KHO) to pass the previous kernel's version string
   and the number of kexec reboots since the last cold boot to the next
   kernel, and print it at boot time

 - "liveupdate: prevent double preservation" (Pasha Tatashin)

   Teach LUO to avoid managing the same file across different active
   sessions

 - "liveupdate: Fix module unloading and unregister API" (Pasha
   Tatashin)

   Address an issue with how LUO handles module reference counting and
   unregistration during module unloading

 - "zswap pool per-CPU acomp_ctx simplifications" (Kanchana Sridhar)

   Simplify and clean up the zswap crypto compression handling and
   improve the lifecycle management of zswap pool's per-CPU acomp_ctx
   resources

 - "mm/damon/core: fix damon_call()/damos_walk() vs kdmond exit race"
   (SeongJae Park)

   Address unlikely but possible leaks and deadlocks in damon_call() and
   damon_walk()

 - "mm/damon/core: validate damos_quota_goal->nid" (SeongJae Park)

   Fix a couple of root-only wild pointer dereferences

 - "Docs/admin-guide/mm/damon: warn commit_inputs vs other params race"
   (SeongJae Park)

   Update the DAMON documentation to warn operators about potential
   races which can occur if the commit_inputs parameter is altered at
   the wrong time

 - "Minor hmm_test fixes and cleanups" (Alistair Popple)

   Bugfixes and a cleanup for the HMM kernel selftests

 - "Modify memfd_luo code" (Chenghao Duan)

   Cleanups, simplifications and speedups to the memfd_lou code

 - "mm, kvm: allow uffd support in guest_memfd" (Mike Rapoport)

   Support for userfaultfd in guest_memfd

 - "selftests/mm: skip several tests when thp is not available" (Chunyu
   Hu)

   Fix several issues in the selftests code which were causing breakage
   when the tests were run on CONFIG_THP=n kernels

 - "mm/mprotect: micro-optimization work" (Pedro Falcato)

   A couple of nice speedups for mprotect()

 - "MAINTAINERS: update KHO and LIVE UPDATE entries" (Pratyush Yadav)

   Document upcoming changes in the maintenance of KHO, LUO, memfd_luo,
   kexec, crash, kdump and probably other kexec-based things - they are
   being moved out of mm.git and into a new git tree

* tag 'mm-stable-2026-04-18-02-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (121 commits)
  MAINTAINERS: add page cache reviewer
  mm/vmscan: avoid false-positive -Wuninitialized warning
  MAINTAINERS: update Dave's kdump reviewer email address
  MAINTAINERS: drop include/linux/liveupdate from LIVE UPDATE
  MAINTAINERS: drop include/linux/kho/abi/ from KHO
  MAINTAINERS: update KHO and LIVE UPDATE maintainers
  MAINTAINERS: update kexec/kdump maintainers entries
  mm/migrate_device: remove dead migration entry check in migrate_vma_collect_huge_pmd()
  selftests: mm: skip charge_reserved_hugetlb without killall
  userfaultfd: allow registration of ranges below mmap_min_addr
  mm/vmstat: fix vmstat_shepherd double-scheduling vmstat_update
  mm/hugetlb: fix early boot crash on parameters without '=' separator
  zram: reject unrecognized type= values in recompress_store()
  docs: proc: document ProtectionKey in smaps
  mm/mprotect: special-case small folios when applying permissions
  mm/mprotect: move softleaf code out of the main function
  mm: remove '!root_reclaim' checking in should_abort_scan()
  mm/sparse: fix comment for section map alignment
  mm/page_io: use sio->len for PSWPIN accounting in sio_read_complete()
  selftests/mm: transhuge_stress: skip the test when thp not available
  ...
2026-04-19 08:01:17 -07:00
Jackie Liu
e04ed278d2 mm/damon/stat: fix memory leak on damon_start() failure in damon_stat_start()
Destroy the DAMON context and reset the global pointer when damon_start()
fails.  Otherwise, the context allocated by damon_stat_build_ctx() is
leaked, and the stale damon_stat_context pointer will be overwritten on
the next enable attempt, making the old allocation permanently
unreachable.

Link: https://lore.kernel.org/20260331101553.88422-1-liu.yun@linux.dev
Fixes: 369c415e60 ("mm/damon: introduce DAMON_STAT module")
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> # 6.17.x
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18 00:10:51 -07:00
SeongJae Park
4c04c6b47c mm/damon/stat: deallocate damon_call() failure leaking damon_ctx
damon_stat_start() always allocates the module's damon_ctx object
(damon_stat_context).  Meanwhile, if damon_call() in the function fails,
the damon_ctx object is not deallocated.  Hence, if the damon_call() is
failed, and the user writes Y to “enabled” again, the previously
allocated damon_ctx object is leaked.

This cannot simply be fixed by deallocating the damon_ctx object when
damon_call() fails.  That's because damon_call() failure doesn't guarantee
the kdamond main function, which accesses the damon_ctx object, is
completely finished.  In other words, if damon_stat_start() deallocates
the damon_ctx object after damon_call() failure, the not-yet-terminated
kdamond could access the freed memory (use-after-free).

Fix the leak while avoiding the use-after-free by keeping returning
damon_stat_start() without deallocating the damon_ctx object after
damon_call() failure, but deallocating it when the function is invoked
again and the kdamond is completely terminated.  If the kdamond is not yet
terminated, simply return -EAGAIN, as the kdamond will soon be terminated.

The issue was discovered [1] by sashiko.

Link: https://lkml.kernel.org/r/20260402134418.74121-1-sj@kernel.org
Link: https://lore.kernel.org/20260401012428.86694-1-sj@kernel.org [1]
Fixes: 405f61996d ("mm/damon/stat: use damon_call() repeat mode instead of damon_callback")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> # 6.17.x
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-06 11:13:43 -07:00
SeongJae Park
84481e705a mm/damon/stat: monitor all System RAM resources
DAMON_STAT usage document (Documentation/admin-guide/mm/damon/stat.rst)
says it monitors the system's entire physical memory.  But, it is
monitoring only the biggest System RAM resource of the system.  When there
are multiple System RAM resources, this results in monitoring only an
unexpectedly small fraction of the physical memory.  For example, suppose
the system has a 500 GiB System RAM, 10 MiB non-System RAM, and 500 GiB
System RAM resources in order on the physical address space.  DAMON_STAT
will monitor only the first 500 GiB System RAM.  This situation is
particularly common on NUMA systems.

Select a physical address range that covers all System RAM areas of the
system, to fix this issue and make it work as documented.

[sj@kernel.org: return error if monitoring target region is invalid]
  Link: https://lkml.kernel.org/r/20260317053631.87907-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20260316235118.873-1-sj@kernel.org
Fixes: 369c415e60 ("mm/damon: introduce DAMON_STAT module")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org>	[6.17+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-23 09:35:05 -07:00
Kees Cook
189f164e57 Convert remaining multi-line kmalloc_obj/flex GFP_KERNEL uses
Conversion performed via this Coccinelle script:

  // SPDX-License-Identifier: GPL-2.0-only
  // Options: --include-headers-for-types --all-includes --include-headers --keep-comments
  virtual patch

  @gfp depends on patch && !(file in "tools") && !(file in "samples")@
  identifier ALLOC = {kmalloc_obj,kmalloc_objs,kmalloc_flex,
 		    kzalloc_obj,kzalloc_objs,kzalloc_flex,
		    kvmalloc_obj,kvmalloc_objs,kvmalloc_flex,
		    kvzalloc_obj,kvzalloc_objs,kvzalloc_flex};
  @@

  	ALLOC(...
  -		, GFP_KERNEL
  	)

  $ make coccicheck MODE=patch COCCI=gfp.cocci

Build and boot tested x86_64 with Fedora 42's GCC and Clang:

Linux version 6.19.0+ (user@host) (gcc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7), GNU ld version 2.44-12.fc42) #1 SMP PREEMPT_DYNAMIC 1970-01-01
Linux version 6.19.0+ (user@host) (clang version 20.1.8 (Fedora 20.1.8-4.fc42), LLD 20.1.8) #1 SMP PREEMPT_DYNAMIC 1970-01-01

Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-22 08:26:33 -08:00
Kees Cook
69050f8d6d treewide: Replace kmalloc with kmalloc_obj for non-scalar types
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:

Single allocations:	kmalloc(sizeof(TYPE), ...)
are replaced with:	kmalloc_obj(TYPE, ...)

Array allocations:	kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with:	kmalloc_objs(TYPE, COUNT, ...)

Flex array allocations:	kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with:	kmalloc_flex(*PTR, FAM, COUNT, ...)

(where TYPE may also be *VAR)

The resulting allocations no longer return "void *", instead returning
"TYPE *".

Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-21 01:02:28 -08:00
Li RongQing
06f5ff36e4 mm/damon/stat: remove __read_mostly from memory_idle_ms_percentiles
The 'memory_idle_ms_percentiles' array in DAMON_STAT is updated frequently
by the kernel to reflect the latest idle time statistics.  Marking it as
'__read_mostly' is inappropriate for data that is regularly written to, as
it can lead to cache pollution in the read-mostly section.

Remove the '__read_mostly' annotation to accurately reflect the
variable's usage pattern.

Link: https://lkml.kernel.org/r/20260130085603.1814-1-lirongqing@baidu.com
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-06 15:47:17 -08:00
SeongJae Park
cc1db8dff8 mm/damon: rename min_sz_region of damon_ctx to min_region_sz
'min_sz_region' field of 'struct damon_ctx' represents the minimum size of
each DAMON region for the context.  'struct damos_access_pattern' has a
field of the same name.  It confuses readers and makes 'grep' less optimal
for them.  Rename it to 'min_region_sz'.

Link: https://lkml.kernel.org/r/20260117175256.82826-9-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31 14:22:47 -08:00
Kevin Lourenco
5ec9bb6de4 mm/damon: fix typos in comments
Correct minor spelling mistakes in several files under mm/damon.  No
functional changes.

Link: https://lkml.kernel.org/r/20251217181216.47576-1-klourencodev@gmail.com
Signed-off-by: Kevin Lourenco <klourencodev@gmail.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20 19:24:48 -08:00
JaeJoon Jung
9082f24bd3 mm/damon/stat: deduplicate intervals_goal setup in damon_stat_build_ctx()
The damon_stat_build_ctx() function sets the values of intervals_goal
structure members.  These values are applied to damon_ctx in
damon_set_attrs().  However, It is resetting the values that were already
applied previously to the same values.  I suggest removing this code as it
constitutes duplicate execution.

Link: https://patch.msgid.link/20251206011716.7185-1-rgbi3307@gmail.com
Link: https://lkml.kernel.org/r/20251216073440.40891-1-sj@kernel.org
Signed-off-by: JaeJoon Jung <rgbi3307@gmail.com>
Reviewed-by: Enze Li <lienze@kylinos.cn>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20 19:24:46 -08:00
Quanmin Yan
e859a224fa mm/damon: add a min_sz_region parameter to damon_set_region_biggest_system_ram_default()
Patch series "mm/damon: fixes for address alignment issues in
DAMON_LRU_SORT and DAMON_RECLAIM", v2.

In DAMON_LRU_SORT and DAMON_RECLAIM, damon_set_regions() will apply
DAMON_MIN_REGION as the core address alignment, and the monitoring target
address ranges would be aligned on DAMON_MIN_REGION * addr_unit.  When
users 1) set addr_unit to a value larger than 1, and 2) set the monitoring
target address range as not aligned on DAMON_MIN_REGION * addr_unit, it
will cause DAMON_LRU_SORT and DAMON_RECLAIM to operate on unexpectedly
large physical address ranges.

For example, if the user sets the monitoring target address range to [4,
8) and addr_unit as 1024, the aimed monitoring target address range is [4
KiB, 8 KiB).  Assuming DAMON_MIN_REGION is 4096, so resulting target
address range will be [0, 4096) in the DAMON core layer address system,
and [0, 4 MiB) in the physical address space, which is an unexpected
range.

To fix the issue, add a min_sz_region parameter to
damon_set_region_biggest_system_ram_default() and use it when calling
damon_set_regions(), replacing the direct use of DAMON_MIN_REGION.


This patch (of 2):

In DAMON_LRU_SORT, damon_set_regions() will apply DAMON_MIN_REGION as the
core address alignment, and the monitoring target address ranges would be
aligned on DAMON_MIN_REGION * addr_unit.  When users 1) set addr_unit to a
value larger than 1, and 2) set the monitoring target address range as not
aligned on DAMON_MIN_REGION * addr_unit, it will cause DAMON_LRU_SORT to
operate on unexpectedly large physical address ranges.

For example, if the user sets the monitoring target address range to [4,
8) and addr_unit as 1024, the aimed monitoring target address range is [4
KiB, 8 KiB).  Assuming DAMON_MIN_REGION is 4096, so resulting target
address range will be [0, 4096) in the DAMON core layer address system,
and [0, 4 MiB) in the physical address space, which is an unexpected
range.

To fix the issue, add a min_sz_region parameter to
damon_set_region_biggest_system_ram_default() and use it when calling
damon_set_regions(), replacing the direct use of DAMON_MIN_REGION.

Link: https://lkml.kernel.org/r/20251020130125.2875164-1-yanquanmin1@huawei.com
Link: https://lkml.kernel.org/r/20251020130125.2875164-2-yanquanmin1@huawei.com
Fixes: 2e0fe9245d ("mm/damon/lru_sort: support addr_unit for DAMON_LRU_SORT")
Signed-off-by: Quanmin Yan <yanquanmin1@huawei.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: ze zuo <zuoze1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-16 17:28:09 -08:00
Quanmin Yan
2f6ce7e714 mm/damon/stat: change last_refresh_jiffies to a global variable
Patch series "mm/damon: fixes for the jiffies-related issues", v2.

On 32-bit systems, the kernel initializes jiffies to "-5 minutes" to make
jiffies wrap bugs appear earlier.  However, this may cause the
time_before() series of functions to return unexpected values, resulting
in DAMON not functioning as intended.  Meanwhile, similar issues exist in
some specific user operation scenarios.

This patchset addresses these issues.  The first patch is about the
DAMON_STAT module, and the second patch is about the core layer's sysfs.


This patch (of 2):

In DAMON_STAT's damon_stat_damon_call_fn(), time_before_eq() is used to
avoid unnecessarily frequent stat update.

On 32-bit systems, the kernel initializes jiffies to "-5 minutes" to make
jiffies wrap bugs appear earlier.  However, this causes time_before_eq()
in DAMON_STAT to unexpectedly return true during the first 5 minutes after
boot on 32-bit systems (see [1] for more explanation, which fixes another
jiffies-related issue before).  As a result, DAMON_STAT does not update
any monitoring results during that period, which becomes more confusing
when DAMON_STAT_ENABLED_DEFAULT is enabled.

There is also an issue unrelated to the system's word size[2]: if the user
stops DAMON_STAT just after last_refresh_jiffies is updated and restarts
it after 5 seconds or a longer delay, last_refresh_jiffies will retain an
older value, causing time_before_eq() to return false and the update to
happen earlier than expected.

Fix these issues by making last_refresh_jiffies a global variable and
initializing it each time DAMON_STAT is started.

Link: https://lkml.kernel.org/r/20251030020746.967174-2-yanquanmin1@huawei.com
Link: https://lkml.kernel.org/r/20250822025057.1740854-1-ekffu200098@gmail.com [1]
Link: https://lore.kernel.org/all/20251028143250.50144-1-sj@kernel.org/ [2]
Fixes: fabdd1e911 ("mm/damon/stat: calculate and expose estimated memory bandwidth")
Signed-off-by: Quanmin Yan <yanquanmin1@huawei.com>
Suggested-by: SeongJae Park <sj@kernel.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: ze zuo <zuoze1@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-09 21:19:45 -08:00
SeongJae Park
a983a26d52 mm/damon/stat: expose negative idle time
DAMON_STAT calculates the idle time of a region using the region's age if
the region's nr_accesses is zero.  If the nr_accesses value is non-zero
(positive), the idle time of the region becomes zero.

This means the users cannot know how warm and hot data is distributed,
using DAMON_STAT's memory_idle_ms_percentiles output.  The other stat,
namely estimated_memory_bandwidth, can help understanding how the overall
access temperature of the system is, but it is still very rough
information.  On production systems, actually, a significant portion of
the system memory is observed with zero idle time, and we cannot break it
down based on its internal hotness distribution.

Define the idle time of the region using its age, similar to those having
zero nr_accesses, but multiples '-1' to distinguish it.  And expose that
using the same parameter interface, memory_idle_ms_percentiles.

Link: https://lkml.kernel.org/r/20250916183127.65708-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:39 -07:00
SeongJae Park
cc7ceb1d14 mm/damon/stat: expose the current tuned aggregation interval
Patch series "mm/damon/stat: expose auto-tuned intervals and non-idle
ages".

DAMON_STAT is intentionally providing limited information for easy
consumption of the information.  From production fleet level usages, below
limitations are found, though.

The aggregation interval of DAMON_STAT represents the granularity of the
memory_idle_ms_percentiles.  But the interval is auto-tuned and not
exposed to users, so users cannot know the granularity.

All memory regions of non-zero (positive) nr_accesses are treated as
having zero idle time.  A significant portion of production systems have
such zero idle time.  Hence breakdown of warm and hot data is nearly
impossible.

Make following changes to overcome the limitations.  Expose the auto-tuned
aggregation interval with a new parameter named aggr_interval_us.  Expose
the age of non-zero nr_accesses (how long >0 access frequency the region
retained) regions as a negative idle time.


This patch (of 2):

DAMON_STAT calculates the idle time for a region as the region's age
multiplied by the aggregation interval.  That is, the aggregation interval
is the granularity of the idle time.  Since the aggregation interval is
auto-tuned and not exposed to users, however, users cannot easily know in
what granularity the stat is made.  Expose the tuned aggregation interval
in microseconds via a new parameter, aggr_interval_us.

Link: https://lkml.kernel.org/r/20250916183127.65708-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20250916183127.65708-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:39 -07:00
SeongJae Park
b663f17b73 mm/damon/stat: use damon_initialized()
DAMON_STAT is assuming DAMON is ready to use in module_init time, and uses
its own hack to see if it is the time.  Use damon_initialized(), which is
a way for seeing if DAMON is ready to be used that is more reliable and
better to maintain instead of the hack.

Link: https://lkml.kernel.org/r/20250916033511.116366-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:37 -07:00
SeongJae Park
405f61996d mm/damon/stat: use damon_call() repeat mode instead of damon_callback
DAMON_STAT uses damon_callback for periodically reading DAMON internal
data.  Use its alternative, damon_call() repeat mode.

Link: https://lkml.kernel.org/r/20250712195016.151108-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-19 18:59:54 -07:00
SeongJae Park
e5d2585d9e mm/damon/stat: calculate and expose idle time percentiles
Knowing how much memory is how cold can be useful for understanding
coldness and utilization efficiency of memory.  The raw form of DAMON's
monitoring results has the information.  Convert the raw results into the
per-byte idle time distributions and expose it as percentiles metric to
users, as a read-only DAMON_STAT parameter.

In detail, the metrics are calculated as follows.  First, DAMON's
per-region access frequency and age information is converted into per-byte
idle time.  If access frequency of a region is higher than zero, every
byte of the region has zero idle time.  If the access frequency of a
region is zero, every byte of the region has idle time as the age of the
region.  Then the logic sorts the per-byte idle times and provides the
value at 0/100, 1/100, ..., 99/100 and 100/100 location of the sorted
array.

The metric can be easily aggregated and compared on large scale production
systems.  For example, if an average of 75-th percentile idle time of
machines that collected on similar time is two minutes, it means the
system's 25 percent memory is not accessed at all for two minutes or more
on average.  If a workload considers two minutes as unit work time, we can
conclude its working set size is only 75 percent of the memory.  If the
system utilizes proactive reclamation and it supports coldness-based
thresholds like DAMON_RECLAIM, the idle time percentiles can be used to
find a more safe or aggressive coldness threshold for aimed memory saving.

Link: https://lkml.kernel.org/r/20250604183127.13968-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09 22:41:55 -07:00
SeongJae Park
fabdd1e911 mm/damon/stat: calculate and expose estimated memory bandwidth
The raw form of DAMON's monitoring results captures many details of the
information.  However, not every bit of the information is always required
for understanding practical access patterns.  Especially on real world
production systems of high scale time and size, the raw form is difficult
to be aggregated and compared.

Convert the raw monitoring results into a single number metric, namely
estimated memory bandwidth and expose it to users as a read-only
DAMON_STAT parameter.  The metric represents access intensiveness
(hotness) of the system.  It can easily be aggregated and compared for
high level understanding of the access pattern on large systems.

Link: https://lkml.kernel.org/r/20250604183127.13968-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09 22:41:55 -07:00
SeongJae Park
369c415e60 mm/damon: introduce DAMON_STAT module
Patch series "mm/damon: introduce DAMON_STAT for simple and practical
access monitoring", v2.

DAMON-based access monitoring is not simple due to required DAMON control
and results visualizations.  Introduce a static kernel module for making
it simple.  The module can be enabled without manual setup and provides
access pattern metrics that easy to fetch and understand the practical
access pattern information, namely estimated memory bandwidth and memory
idle time percentiles.

Background and Problems
=======================

DAMON can be used for monitoring data access patterns of the system and
workloads.  Specifically, users can start DAMON to monitor access events
on specific address space with fine controls including address ranges to
monitor and time intervals between samplings and aggregations.  The
resulting access information snapshot contains access frequency
(nr_accesses) and how long the frequency was kept (age) for each byte.

The monitoring usage is not simple and practical enough for production
usage.  Users should first start DAMON with a number of parameters, and
wait until DAMON's monitoring results capture a reasonable amount of the
time data (age).  In production, such manual start and wait is impractical
to capture useful information from a high number of machines in a timely
manner.

The monitoring result is also too detailed to be used on production
environments.  The raw results are hard to be aggregated and/or compared
for production environments having a large scale of time, space and
machines fleet.

Users have to implement and use their own automation of DAMON control and
results processing.  It is repetitive and challenging since there is no
good reference or guideline for such automation.

Solution: DAMON_STAT
====================

Implement such automation in kernel space as a static kernel module,
namely DAMON_STAT.  It can be enabled at build, boot, or run time via its
build configuration or module parameter.  It monitors the entire physical
address space with monitoring intervals that auto-tuned for a reasonable
amount of access observations and minimum overhead.  It converts the raw
monitoring results into simpler metrics that can easily be aggregated and
compared, namely estimated memory bandwidth and idle time percentiles.

Understanding of the metrics and the user interface of DAMON_STAT is
essential.  Refer to the commit messages of the second and the third
patches of this patch series for more details about the metrics.  For the
user interface, the standard module parameters system is used.  Refer to
the fourth patch of this patch series for details of the user interface.

Discussions
===========

The module aims to be useful on production environments constructed with a
large number of machines that run a long time.  The auto-tuned monitoring
intervals ensure a reasonable quality of the outputs.  The auto-tuning
also ensures its overhead be reasonable and low enough to be enabled
always on the production.  The simplified monitoring results metrics can
be useful for showing both coldness (idle time percentiles) and hotness
(memory bandwidth) of the system's access pattern.  We expect the
information can be useful for assessing system memory utilization and
inspiring optimizations or investigations on both kernel and user space
memory management logics for large scale fleets.

We hence expect the module is good enough to be just used in most
environments.  For special cases that require a custom access monitoring
automation, users will still benefit by using DAMON_STAT as a reference or
a guideline for their specialized automation.


This patch (of 4):

To use DAMON for monitoring access patterns of the system, users should
manually start DAMON via DAMON sysfs ABI with a number of parameters for
specifying the monitoring target address space, address ranges, and
monitoring intervals.  After that, users should also wait until desired
amount of time data is captured into DAMON's monitoring results.  It is
bothersome and take a long time to be practical for access monitoring on
large fleet level production environments.

For access-aware system operations use cases like proactive cold memory
reclamation, similar problems existed.  We we solved those by introducing
dedicated static kernel modules such as DAMON_RECLAIM.

Implement such static kernel module for access monitoring, namely
DAMON_STAT.  It monitors the entire physical address space with auto-tuned
monitoring intervals.  The auto-tuning is set to capture 4 % of observable
access events in each snapshot while keeping the sampling intervals 5
milliseconds in minimum and 10 seconds in maximum.  From a few production
environments, we confirmed this setup provides high quality monitoring
results with minimum overheads.  The module therefore receives only one
user input, whether to enable or disable it.  It can be set on build or
boot time via build configuration or kernel boot command line.  It can
also be overridden at runtime.

Note that this commit only implements the DAMON control part of the
module.  Users could get the monitoring results via damon:damon_aggregated
tracepoint, but that's of course not the recommended way.  Following
commits will implement convenient and optimized ways for serving the
monitoring results to users.

[sj@kernel.org: use IS_ENABLED() for enabled initial value]
  Link: https://lkml.kernel.org/r/20250604205619.18929-1-sj@kernel.org
[sj@kernel.org: reset enabled when DAMON start failed]
  Link: https://lkml.kernel.org/r/20250706184750.36588-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20250604183127.13968-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20250604183127.13968-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09 22:41:55 -07:00