linux/include
SeongJae Park 33c3f6c2b4 mm/damon/core: fix damos_walk() vs kdamond_fn() exit race
When kdamond_fn() main loop is finished, the function cancels remaining
damos_walk() request and unset the damon_ctx->kdamond so that API callers
and API functions themselves can show the context is terminated. 
damos_walk() adds the caller's request to the queue first.  After that, it
shows if the kdamond of the damon_ctx is still running (damon_ctx->kdamond
is set).  Only if the kdamond is running, damos_walk() starts waiting for
the kdamond's handling of the newly added request.

The damos_walk() requests registration and damon_ctx->kdamond unset are
protected by different mutexes, though.  Hence, damos_walk() could race
with damon_ctx->kdamond unset, and result in deadlocks.

For example, let's suppose kdamond successfully finished the damow_walk()
request cancelling.  Right after that, damos_walk() is called for the
context.  It registers the new request, and shows the context is still
running, because damon_ctx->kdamond unset is not yet done.  Hence the
damos_walk() caller starts waiting for the handling of the request. 
However, the kdamond is already on the termination steps, so it never
handles the new request.  As a result, the damos_walk() caller thread
infinitely waits.

Fix this by introducing another damon_ctx field, namely
walk_control_obsolete.  It is protected by the
damon_ctx->walk_control_lock, which protects damos_walk() request
registration.  Initialize (unset) it in kdamond_fn() before letting
damon_start() returns and set it just before the cancelling of the
remaining damos_walk() request is executed.  damos_walk() reads the
obsolete field under the lock and avoids adding a new request.

After this change, only requests that are guaranteed to be handled or
cancelled are registered.  Hence the after-registration DAMON context
termination check is no longer needed.  Remove it together.

The issue is found by sashiko [1].


Link: https://lore.kernel.org/20260327233319.3528-3-sj@kernel.org
Link: https://lore.kernel.org/20260325141956.87144-1-sj@kernel.org [1]
Fixes: bf0eaba0ff ("mm/damon/core: implement damos_walk()")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> # 6.14.x
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18 00:10:51 -07:00
..
acpi mailbox: platform and core updates 2026-02-14 11:13:32 -08:00
asm-generic mm/mmu_gather: replace IPI with synchronize_rcu() when batch allocation fails 2026-04-05 13:53:05 -07:00
clocksource
crypto Networking changes for 7.0 2026-02-11 19:31:52 -08:00
cxl
drm drm/dp: Add definition for Panel Replay full-line granularity 2026-03-04 15:26:08 +02:00
dt-bindings phy-for-7.0 2026-02-17 11:40:04 -08:00
hyperv Revert "mshv: expose the scrub partition hypercall" 2026-03-11 16:54:24 +00:00
keys
kunit kunit: irq: Ensure timer doesn't fire too frequently 2026-02-24 14:44:21 -08:00
kvm
linux mm/damon/core: fix damos_walk() vs kdamond_fn() exit race 2026-04-18 00:10:51 -07:00
math-emu
media [GIT PULL for v7.0] media updates 2026-02-11 12:20:25 -08:00
memory
misc
net mm: introduce a new page type for page pool in page type 2026-04-05 13:53:19 -07:00
pcmcia
ras
rdma RDMA/core: Check id_priv->restricted_node_type in cma_listen_on_dev() 2026-02-25 07:50:10 -05:00
rv rv: Fix multiple definition of __pcpu_unique_da_mon_this 2026-02-20 13:12:00 +01:00
scsi SCSI misc on 20260212 2026-02-12 15:43:02 -08:00
soc
sound ASoC: Fixes for v7.0 2026-03-05 17:22:14 +01:00
target
trace mm: memcontrol: change val type to long in __mod_memcg_{lruvec_}state() 2026-04-18 00:10:48 -07:00
uapi ARM: 2026-03-15 12:22:10 -07:00
ufs
vdso
video
xen xen/xenbus: better handle backend crash 2026-03-04 15:31:40 +01:00
Kbuild