linux/include
Tejun Heo 93618edf75 cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated
A chain of commits going back to v7.0 reworked rmdir to satisfy the
controller invariant that a subsystem's ->css_offline() must not run while
tasks are still doing kernel-side work in the cgroup.

[1] d245698d72 ("cgroup: Defer task cgroup unlink until after the task is done switching out")
[2] a72f73c4dd ("cgroup: Don't expose dead tasks in cgroup")
[3] 1b164b876c ("cgroup: Wait for dying tasks to leave on rmdir")
[4] 4c56a8ac68 ("cgroup: Fix cgroup_drain_dying() testing the wrong condition")
[5] 13e786b64b ("cgroup: Increment nr_dying_subsys_* from rmdir context")

[1] moved task cset unlink from do_exit() to finish_task_switch() so a
task's cset link drops only after the task has fully stopped scheduling.
That made tasks past exit_signals() linger on cset->tasks until their final
context switch, which led to a series of problems as what userspace expected
to see after rmdir diverged from what the kernel needs to wait for. [2]-[5]
tried to bridge that divergence: [2] filtered the exiting tasks from
cgroup.procs; [3] had rmdir(2) sleep in TASK_UNINTERRUPTIBLE for them; [4]
fixed the wait's condition; [5] made nr_dying_subsys_* visible
synchronously.

The cgroup_drain_dying() wait in [3] turned out to be a dead end. When the
rmdir caller is also the reaper of a zombie that pins a pidns teardown (e.g.
host PID 1 systemd reaping orphan pids that were re-parented to it during
the same teardown), rmdir blocks in TASK_UNINTERRUPTIBLE waiting for those
pids to free, the pids can't free because PID 1 is the reaper and it's stuck
in rmdir, and the system A-A deadlocks. No internal lock ordering breaks
this; the wait itself is the bug.

The css killing side that drove the original reorder, however, can be made
cleanly asynchronous: ->css_offline() is already async, run from
css_killed_work_fn() driven by percpu_ref_kill_and_confirm(). The fix is to
make that chain start only after all tasks have left the cgroup. rmdir's
user-visible side then returns as soon as cgroup.procs and friends are
empty, while ->css_offline() still runs only after the cgroup is fully
drained.

Verified by the original reproducer (pidns teardown + zombie reaper, runs
under vng) which hangs vanilla and succeeds here, and by per-commit
deterministic repros for [2], [3], [4], [5] with a boot parameter that
widens the post-exit_signals() window so each state is reliably reachable.
Some stress tests on top of that.

cgroup_apply_control_disable() has the same shape of pre-existing race:
when a controller is disabled via subtree_control, kill_css() ran
synchronously while tasks past exit_signals() could still be linked to
the cgroup's csets, and ->css_offline() could fire before they drained.
This patch preserves the existing synchronous behavior at that call site
(kill_css_sync() + kill_css_finish() back-to-back) and a follow-up patch
will defer kill_css_finish() there using a per-css trigger.

This seems like the right approach and I don't see problems with it. The
changes are somewhat invasive but not excessively so, so backporting to
-stable should be okay. If something does turn out to be wrong, the fallback
is to revert the entire chain ([1]-[5]) and rework in the development branch
instead.

v2: Pin cgrp across the deferred destroy work with explicit
    cgroup_get()/cgroup_put() around queue_work() and the work_fn. v1
    wasn't actually broken (ordered cgroup_offline_wq + queue_work order
    in cgroup_task_dead() saved it) but the explicit ref removes the
    dependency on those non-obvious invariants. Also note the
    pre-existing cgroup_apply_control_disable() race in the description;
    a follow-up will defer kill_css_finish() there.

Fixes: 1b164b876c ("cgroup: Wait for dying tasks to leave on rmdir")
Cc: stable@vger.kernel.org # v7.0+
Reported-and-tested-by: Martin Pitt <martin@piware.de>
Link: https://lore.kernel.org/all/afHNg2VX2jy9bW7y@piware.de/
Link: https://lore.kernel.org/all/35e0670adb4abeab13da2c321582af9f@kernel.org/
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2026-05-04 08:52:26 -10:00
..
acpi Power management updates for 7.1-rc1 2026-04-13 19:47:52 -07:00
asm-generic mm.git review status for linus..mm-nonmm-stable 2026-04-16 20:11:56 -07:00
clocksource
crypto This update includes the following changes: 2026-04-15 15:22:26 -07:00
cxl
drm
dt-bindings Support for Mobileye EyeQ6Lplus 2026-04-17 08:53:23 -07:00
hyperv
keys
kunit
kvm
linux cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated 2026-05-04 08:52:26 -10:00
math-emu
media
memory
misc
net mm.git review status for linus..mm-stable 2026-04-15 12:59:16 -07:00
pcmcia
ras
rdma
rv
scsi
soc
sound ASoC: Updates for v7.1 2026-04-13 18:09:48 +02:00
target
trace Runtime Verification updates for 7.1: 2026-04-15 17:15:18 -07:00
uapi Arm: 2026-04-17 07:18:03 -07:00
ufs
vdso
video
xen xen/grant-table: guard gnttab_suspend/resume with CONFIG_HIBERNATE_CALLBACKS 2026-04-10 11:07:21 +02:00
Kbuild