linux

mirror of https://github.com/torvalds/linux.git synced 2026-05-27 08:33:17 +02:00

Author	SHA1	Message	Date
sunshaojie	345f401666	cgroup/cpuset: Return only actually allocated CPUs during partition invalidation In update_parent_effective_cpumask() with partcmd_invalidate, the CPUs to return to the parent are computed as: adding = cpumask_and(tmp->addmask, xcpus, parent->effective_xcpus); where xcpus = user_xcpus(cs) which returns cs->exclusive_cpus (if set) or cs->cpus_allowed. When exclusive_cpus is not set, user_xcpus(cs) can contain CPUs that were never actually granted to the partition due to sibling exclusion in compute_excpus(). Consequently, the invalidation may return CPUs to the parent that remain in use by sibling partitions, causing overlapping effective_cpus and triggering the WARN_ON_ONCE(1) in generate_sched_domains(). Use cs->effective_xcpus instead, which reflects the CPUs actually granted to this partition. Reproducer (on a 4-CPU machine): cd /sys/fs/cgroup mkdir a1 b1 # a1 becomes partition root with CPUs 0-1 echo "0-1" > a1/cpuset.cpus echo "root" > a1/cpuset.cpus.partition # b1 becomes partition root with CPUs 1-2, but sibling exclusion # reduces its effective_xcpus to CPU 2 only echo "1-2" > b1/cpuset.cpus echo "root" > b1/cpuset.cpus.partition # b1 changes cpus_allowed to 0-1 -> partition invalidation echo "0-1" > b1/cpuset.cpus # Expected: CPUs 2-3 (only CPU 2 returned from b1) # Actual: CPUs 1-3 (CPU 0-1 returned, overlapping with a1) cat cpuset.cpus.effective dmesg will also show a WARNING from generate_sched_domains() reporting overlapping partition root effective_cpus. Fixes: `2a3602030d` ("cgroup/cpuset: Don't invalidate sibling partitions on cpuset.cpus conflict") Cc: stable@vger.kernel.org # v7.0+ Signed-off-by: sunshaojie <sunshaojie@kylinos.cn> Tested-by: Chen Ridong <chenridong@huaweicloud.com> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-05-13 08:54:53 -10:00
Guopeng Zhang	5dd74441cb	cgroup/cpuset: Reserve DL bandwidth only for root-domain moves cpuset_can_attach() currently adds the bandwidth of all migrating SCHED_DEADLINE tasks to sum_migrate_dl_bw. If the source and destination cpuset effective CPU masks do not overlap, the whole sum is then reserved in the destination root domain. set_cpus_allowed_dl(), however, subtracts bandwidth from the source root domain only when the affinity change really moves the task between root domains. A DL task can move between cpusets that are still in the same root domain, so including that task in sum_migrate_dl_bw can reserve destination bandwidth without a matching source-side subtraction. Share the root-domain move test with set_cpus_allowed_dl(). Keep nr_migrate_dl_tasks counting all migrating deadline tasks for cpuset DL task accounting, but add to sum_migrate_dl_bw only for tasks that need a root-domain bandwidth move. Keep using the destination cpuset effective CPU mask and leave the broader can_attach()/attach() transaction model unchanged. Fixes: `2ef269ef1a` ("cgroup/cpuset: Free DL BW in case can_attach() fails") Cc: stable@vger.kernel.org # v6.10+ Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn> Reviewed-by: Waiman Long <longman@redhat.com> Acked-by: Juri Lelli <juri.lelli@redhat.com> Tested-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-05-11 10:27:14 -10:00
Guopeng Zhang	4a39eda5fd	cgroup/cpuset: Reset DL migration state on can_attach() failure cpuset_can_attach() accumulates temporary SCHED_DEADLINE migration state in the destination cpuset while walking the taskset. If a later task_can_attach() or security_task_setscheduler() check fails, cgroup_migrate_execute() treats cpuset as the failing subsystem and does not call cpuset_cancel_attach() for it. The partially accumulated state is then left behind and can be consumed by a later attach, corrupting cpuset DL task accounting and pending DL bandwidth accounting. Reset the pending DL migration state from the common error exit when ret is non-zero. Successful can_attach() keeps the state for cpuset_attach() or cpuset_cancel_attach(). Fixes: `2ef269ef1a` ("cgroup/cpuset: Free DL BW in case can_attach() fails") Cc: stable@vger.kernel.org # v6.10+ Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Reviewed-by: Waiman Long <longman@redhat.com>	2026-05-10 22:14:49 -10:00
Chen Wandun	dde2f938d0	cgroup/cpuset: move PF_EXITING check before __GFP_HARDWALL in cpuset_current_node_allowed() Since prepare_alloc_pages() unconditionally adds __GFP_HARDWALL for the fast path when cpusets are enabled, the __GFP_HARDWALL check in cpuset_current_node_allowed() causes the PF_EXITING escape path to be skipped on the first allocation attempt. This makes it unreachable in the common case, so dying tasks can get stuck in direct reclaim or even trigger OOM while trying to exit, despite being allowed to allocate from any node. Move the PF_EXITING check before __GFP_HARDWALL so that dying tasks can allocate memory from any node to exit quickly, even when cpusets are enabled. Also update the function comment to reflect the actual behavior of prepare_alloc_pages() and the corrected check ordering. Signed-off-by: Chen Wandun <chenwandun@lixiang.com> Acked-by: Michal Koutný <mkoutny@suse.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-05-07 11:57:31 -10:00
Guopeng Zhang	41d701ddc3	cgroup/cpuset: record DL BW alloc CPU for attach rollback cpuset_can_attach() allocates DL bandwidth only when migrating deadline tasks to a disjoint CPU mask, but cpuset_cancel_attach() rolls back based only on nr_migrate_dl_tasks. This makes the DL bandwidth alloc/free paths asymmetric: rollback can call dl_bw_free() even when no dl_bw_alloc() was done. Rollback also needs to undo the reservation against the same CPU/root domain that was charged. Record the CPU used by dl_bw_alloc() and use that state in cpuset_cancel_attach(). If no allocation happened, dl_bw_cpu stays at -1 and rollback skips dl_bw_free(). If allocation did happen, bandwidth is returned to the same CPU/root domain. Successful attach paths are unchanged. This only fixes failed attach rollback accounting. Fixes: `2ef269ef1a` ("cgroup/cpuset: Free DL BW in case can_attach() fails") Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-04-17 08:57:37 -10:00
Waiman Long	089f3fcd69	cgroup/cpuset: Skip security check for hotplug induced v1 task migration When a CPU hot removal causes a v1 cpuset to lose all its CPUs, the cpuset hotplug handler will schedule a work function to migrate tasks in that cpuset with no CPU to its ancestor to enable those tasks to continue running. If a strict security policy is in place, however, the task migration may fail when security_task_setscheduler() call in cpuset_can_attach() returns a -EACCES error. That will mean that those tasks will have no CPU to run on. The system administrators will have to explicitly intervene to either add CPUs to that cpuset or move the tasks elsewhere if they are aware of it. This problem was found by a reported test failure in the LTP's cpuset_hotplug_test.sh. Fix this problem by treating this special case as an exception to skip the setsched security check in cpuset_can_attach() when a v1 cpuset with tasks have no CPU left. With that patch applied, the cpuset_hotplug_test.sh test can be run successfully without failure. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-31 09:14:13 -10:00
Waiman Long	bbe5ab8191	cgroup/cpuset: Simplify setsched decision check in task iteration loop of cpuset_can_attach() Centralize the check required to run security_task_setscheduler() in the task iteration loop of cpuset_can_attach() outside of the loop as it has no dependency on the characteristics of the tasks themselves. There is no functional change. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-31 09:14:13 -10:00
Waiman Long	ca174c705d	cgroup/cpuset: Call rebuild_sched_domains() directly in hotplug Besides deferring the call to housekeeping_update(), commit `6df415aa46` ("cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue") also defers the rebuild_sched_domains() call to the workqueue. So a new offline CPU may still be in a sched domain or new online CPU not showing up in the sched domains for a short transition period. That could be a problem in some corner cases and can be the cause of a reported test failure[1]. Fix it by calling rebuild_sched_domains_cpuslocked() directly in hotplug as before. If isolated partition invalidation or recreation is being done, the housekeeping_update() call to update the housekeeping cpumasks will still be deferred to a workqueue. In commit `3bfe479671` ("cgroup/cpuset: Move housekeeping_update()/rebuild_sched_domains() together"), housekeeping_update() is called before rebuild_sched_domains() because it needs to access the HK_TYPE_DOMAIN housekeeping cpumask. That is now changed to use the static HK_TYPE_DOMAIN_BOOT cpumask as HK_TYPE_DOMAIN cpumask is now changeable at run time. As a result, we can move the rebuild_sched_domains() call before housekeeping_update() with the slight advantage that it will be done in the same cpus_read_lock critical section without the possibility of interference by a concurrent cpu hot add/remove operation. As it doesn't make sense to acquire cpuset_mutex/cpuset_top_mutex after calling housekeeping_update() and immediately release them again, move the cpuset_full_unlock() operation inside update_hk_sched_domains() and rename it to cpuset_update_sd_hk_unlock() to signify that it will release the full set of locks. [1] https://lore.kernel.org/lkml/1a89aceb-48db-4edd-a730-b445e41221fe@nvidia.com Fixes: `6df415aa46` ("cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue") Tested-by: Jon Hunter <jonathanh@nvidia.com> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-06 06:58:25 -10:00
Linus Torvalds	0031c06807	cgroup: Fixes for v7.0-rc2 - Fix circular locking dependency in cpuset partition code by deferring housekeeping_update() calls to a workqueue instead of calling them directly under cpus_read_lock. - Fix null-ptr-deref in rebuild_sched_domains_cpuslocked() when generate_sched_domains() returns NULL due to kmalloc failure. - Fix incorrect cpuset behavior for effective_xcpus in partition_xcpus_del() and cpuset_update_tasks_cpumask() in update_cpumasks_hier(). - Fix race between task migration and cgroup iteration. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaadVVQ4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGef0AQDLuJE3vzc2VeCBc4rGcj7ZSRmc3tc28lOqHRzi XEx1iwD+PeFcb9wt1CTqA5hAiIY1LGR/5iO1kTH7paRd16DBRAc= =S8WE -----END PGP SIGNATURE----- Merge tag 'cgroup-for-7.0-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - Fix circular locking dependency in cpuset partition code by deferring housekeeping_update() calls to a workqueue instead of calling them directly under cpus_read_lock - Fix null-ptr-deref in rebuild_sched_domains_cpuslocked() when generate_sched_domains() returns NULL due to kmalloc failure - Fix incorrect cpuset behavior for effective_xcpus in partition_xcpus_del() and cpuset_update_tasks_cpumask() in update_cpumasks_hier() - Fix race between task migration and cgroup iteration * tag 'cgroup-for-7.0-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/cpuset: fix null-ptr-deref in rebuild_sched_domains_cpuslocked cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue cgroup/cpuset: Move housekeeping_update()/rebuild_sched_domains() together kselftest/cgroup: Simplify test_cpuset_prs.sh by removing "S+" command cgroup/cpuset: Set isolated_cpus_updating only if isolated_cpus is changed cgroup/cpuset: Clarify exclusion rules for cpuset internal variables cgroup/cpuset: Fix incorrect use of cpuset_update_tasks_cpumask() in update_cpumasks_hier() cgroup/cpuset: Fix incorrect change to effective_xcpus in partition_xcpus_del() cgroup: fix race between task migration and iteration	2026-03-03 14:25:18 -08:00
Chen Ridong	085f067389	cgroup/cpuset: fix null-ptr-deref in rebuild_sched_domains_cpuslocked A null-pointer-dereference bug was reported by syzbot: Oops: general protection fault, probably for address 0xdffffc0000000000: KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] RIP: 0010:bitmap_subset include/linux/bitmap.h:433 [inline] RIP: 0010:cpumask_subset include/linux/cpumask.h:836 [inline] RIP: 0010:rebuild_sched_domains_locked kernel/cgroup/cpuset.c:967 RSP: 0018:ffffc90003ecfbc0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000020 RDX: ffff888028de0000 RSI: ffffffff8200f003 RDI: ffffffff8df14f28 RBP: 0000000000000000 R08: 0000000000000cc0 R09: 00000000ffffffff R10: ffffffff8e7d95b3 R11: 0000000000000001 R12: 0000000000000000 R13: 00000000000f4240 R14: dffffc0000000000 R15: 0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f463fff CR3: 000000003704c000 CR4: 00000000003526f0 Call Trace: <TASK> rebuild_sched_domains_cpuslocked kernel/cgroup/cpuset.c:983 [inline] rebuild_sched_domains+0x21/0x40 kernel/cgroup/cpuset.c:990 sched_rt_handler+0xb5/0xe0 kernel/sched/rt.c:2911 proc_sys_call_handler+0x47f/0x5a0 fs/proc/proc_sysctl.c:600 new_sync_write fs/read_write.c:595 [inline] vfs_write+0x6ac/0x1070 fs/read_write.c:688 ksys_write+0x12a/0x250 fs/read_write.c:740 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f The issue occurs when generate_sched_domains() returns ndoms = 1 and doms = NULL due to a kmalloc failure. This leads to a null-pointer dereference when accessing doms in rebuild_sched_domains_locked(). Fix this by adding a NULL check for doms before accessing it. Fixes: `6ee43047e8` ("cpuset: Remove unnecessary checks in rebuild_sched_domains_locked") Reported-by: syzbot+460792609a79c085f79f@syzkaller.appspotmail.com Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-25 07:39:04 -10:00
Waiman Long	a84097e625	cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock The current cpuset partition code is able to dynamically update the sched domains of a running system and the corresponding HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentially the "isolcpus=domain,..." boot command line feature at run time. The housekeeping cpumask update requires flushing a number of different workqueues which may not be safe with cpus_read_lock() held as the workqueue flushing code may acquire cpus_read_lock() or acquiring locks which have locking dependency with cpus_read_lock() down the chain. Below is an example of such circular locking problem. ====================================================== WARNING: possible circular locking dependency detected 6.18.0-test+ #2 Tainted: G S ------------------------------------------------------ test_cpuset_prs/10971 is trying to acquire lock: ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x7a/0x180 but task is already holding lock: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #4 (cpuset_mutex){+.+.}-{4:4}: -> #3 (cpu_hotplug_lock){++++}-{0:0}: -> #2 (rtnl_mutex){+.+.}-{4:4}: -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}: -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}: Chain exists of: (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex 5 locks held by test_cpuset_prs/10971: #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0 #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x260/0x5f0 #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x2b6/0x5f0 #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at: cpuset_partition_write+0x77/0x130 #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130 Call Trace: <TASK> : touch_wq_lockdep_map+0x93/0x180 __flush_workqueue+0x111/0x10b0 housekeeping_update+0x12d/0x2d0 update_parent_effective_cpumask+0x595/0x2440 update_prstate+0x89d/0xce0 cpuset_partition_write+0xc5/0x130 cgroup_file_write+0x1a5/0x680 kernfs_fop_write_iter+0x3df/0x5f0 vfs_write+0x525/0xfd0 ksys_write+0xf9/0x1d0 do_syscall_64+0x95/0x520 entry_SYSCALL_64_after_hwframe+0x76/0x7e To avoid such a circular locking dependency problem, we have to call housekeeping_update() without holding the cpus_read_lock() and cpuset_mutex. The current set of wq's flushed by housekeeping_update() may not have work functions that call cpus_read_lock() directly, but we are likely to extend the list of wq's that are flushed in the future. Moreover, the current set of work functions may hold locks that may have cpu_hotplug_lock down the dependency chain. So housekeeping_update() is now called after releasing cpus_read_lock and cpuset_mutex at the end of a cpuset operation. These two locks are then re-acquired later before calling rebuild_sched_domains_locked(). To enable mutual exclusion between the housekeeping_update() call and other cpuset control file write actions, a new top level cpuset_top_mutex is introduced. This new mutex will be acquired first to allow sharing variables used by both code paths. However, cpuset update from CPU hotplug can still happen in parallel with the housekeeping_update() call, though that should be rare in production environment. As cpus_read_lock() is now no longer held when tmigr_isolated_exclude_cpumask() is called, it needs to acquire it directly. The lockdep_is_cpuset_held() is also updated to return true if either cpuset_top_mutex or cpuset_mutex is held. Fixes: `03ff735101` ("cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-23 10:46:49 -10:00
Waiman Long	6df415aa46	cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue The cpuset_handle_hotplug() may need to invoke housekeeping_update(), for instance, when an isolated partition is invalidated because its last active CPU has been put offline. As we are going to enable dynamic update to the nozh_full housekeeping cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug, allowing the CPU hotplug path to call into housekeeping_update() directly from update_isolation_cpumasks() will likely cause deadlock. So we have to defer any call to housekeeping_update() after the CPU hotplug operation has finished. This is now done via the workqueue where the update_hk_sched_domains() function will be invoked via the hk_sd_workfn(). An concurrent cpuset control file write may have executed the required update_hk_sched_domains() function before the work function is called. So the work function call may become a no-op when it is invoked. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-23 10:42:11 -10:00
Waiman Long	3bfe479671	cgroup/cpuset: Move housekeeping_update()/rebuild_sched_domains() together With the latest changes in sched/isolation.c, rebuild_sched_domains*() requires the HK_TYPE_DOMAIN housekeeping cpumask to be properly updated first, if needed, before the sched domains can be rebuilt. So the two naturally fit together. Do that by creating a new update_hk_sched_domains() helper to house both actions. The name of the isolated_cpus_updating flag to control the call to housekeeping_update() is now outdated. So change it to update_housekeeping to better reflect its purpose. Also move the call to update_hk_sched_domains() to the end of cpuset and hotplug operations before releasing the cpuset_mutex. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-23 10:42:08 -10:00
Waiman Long	14713ed9e9	cgroup/cpuset: Set isolated_cpus_updating only if isolated_cpus is changed As cpuset is updating HK_TYPE_DOMAIN housekeeping mask when there is a change in the set of isolated CPUs, making this change is now more costly than before. Right now, the isolated_cpus_updating flag can be set even if there is no real change in isolated_cpus. Put in additional checks to make sure that isolated_cpus_updating is set only if there is a real change in isolated_cpus. Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-23 10:41:09 -10:00
Waiman Long	17b1860034	cgroup/cpuset: Clarify exclusion rules for cpuset internal variables Clarify the locking rules associated with file level internal variables inside the cpuset code. There is no functional change. Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-23 10:41:04 -10:00
Waiman Long	68230aac8b	cgroup/cpuset: Fix incorrect use of cpuset_update_tasks_cpumask() in update_cpumasks_hier() Commit `e2ffe502ba` ("cgroup/cpuset: Add cpuset.cpus.exclusive for v2") incorrectly changed the 2nd parameter of cpuset_update_tasks_cpumask() from tmp->new_cpus to cp->effective_cpus. This second parameter is just a temporary cpumask for internal use. The cpuset_update_tasks_cpumask() function was originally called update_tasks_cpumask() before commit `381b53c3b5` ("cgroup/cpuset: rename functions shared between v1 and v2"). This mistake can incorrectly change the effective_cpus of the cpuset when it is the top_cpuset or in arm64 architecture where task_cpu_possible_mask() may differ from cpu_possible_mask. So far top_cpuset hasn't been passed to update_cpumasks_hier() yet, but arm64 arch can still be impacted. Fix it by reverting the incorrect change. Fixes: `e2ffe502ba` ("cgroup/cpuset: Add cpuset.cpus.exclusive for v2") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-23 10:41:01 -10:00
Waiman Long	f9a1767ce3	cgroup/cpuset: Fix incorrect change to effective_xcpus in partition_xcpus_del() The effective_xcpus of a cpuset can contain offline CPUs. In partition_xcpus_del(), the xcpus parameter is incorrectly used as a temporary cpumask to mask out offline CPUs. As xcpus can be the effective_xcpus of a cpuset, this can result in unexpected changes in that cpumask. Fix this problem by not making any changes to the xcpus parameter. Fixes: `11e5f407b6` ("cgroup/cpuset: Keep track of CPUs in isolated partitions") Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-23 10:40:58 -10:00
Linus Torvalds	bf4afc53b7	Convert 'alloc_obj' family to use the new default GFP_KERNEL argument This was done entirely with mindless brute force, using git grep -l '\<k[vmz]alloc_objs(., GFP_KERNEL)' \| xargs sed -i 's/$alloc_objs(.*$, GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-21 17:09:51 -08:00
Kees Cook	69050f8d6d	treewide: Replace kmalloc with kmalloc_obj for non-scalar types This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(PTR, FAM, COUNT, ...) (where TYPE may also be VAR) The resulting allocations no longer return "void ", instead returning "TYPE ". Signed-off-by: Kees Cook <kees@kernel.org>	2026-02-21 01:02:28 -08:00
Linus Torvalds	eeccf287a2	mm.git review status for linus..mm-stable Total patches: 36 Reviews/patch: 1.77 Reviewed rate: 83% - The 2 patch series "mm/vmscan: fix demotion targets checks in reclaim/demotion" from Bing Jiao fixes a couple of issues in the demotion code - pages were failed demotion and were finding themselves demoted into disallowed nodes. - The 11 patch series "Remove XA_ZERO from error recovery of dup_mmap()" from Liam Howlett fixes a rare mapledtree race and performs a number of cleanups. - The 13 patch series "mm: add bitmap VMA flag helpers and convert all mmap_prepare to use them" from Lorenzo Stoakes implements a lot of cleanups following on from the conversion of the VMA flags into a bitmap. - The 5 patch series "support batch checking of references and unmapping for large folios" from Baolin Wang implements batching to greatly improve the performance of reclaiming clean file-backed large folios. - The 3 patch series "selftests/mm: add memory failure selftests" from Miaohe Lin does as claimed. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaZaIEQAKCRDdBJ7gKXxA jj73AQCQDwLoipDiQRGyjB5BDYydymWuDoiB1tlDPHfYAP3b/QD/UQtVlOEXqwM3 naOKs3NQ1pwnfhDaQMirGw2eAnJ1SQY= =6Iif -----END PGP SIGNATURE----- Merge tag 'mm-stable-2026-02-18-19-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more MM updates from Andrew Morton: - "mm/vmscan: fix demotion targets checks in reclaim/demotion" fixes a couple of issues in the demotion code - pages were failed demotion and were finding themselves demoted into disallowed nodes (Bing Jiao) - "Remove XA_ZERO from error recovery of dup_mmap()" fixes a rare mapledtree race and performs a number of cleanups (Liam Howlett) - "mm: add bitmap VMA flag helpers and convert all mmap_prepare to use them" implements a lot of cleanups following on from the conversion of the VMA flags into a bitmap (Lorenzo Stoakes) - "support batch checking of references and unmapping for large folios" implements batching to greatly improve the performance of reclaiming clean file-backed large folios (Baolin Wang) - "selftests/mm: add memory failure selftests" does as claimed (Miaohe Lin) * tag 'mm-stable-2026-02-18-19-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (36 commits) mm/page_alloc: clear page->private in free_pages_prepare() selftests/mm: add memory failure dirty pagecache test selftests/mm: add memory failure clean pagecache test selftests/mm: add memory failure anonymous page test mm: rmap: support batched unmapping for file large folios arm64: mm: implement the architecture-specific clear_flush_young_ptes() arm64: mm: support batch clearing of the young flag for large folios arm64: mm: factor out the address and ptep alignment into a new helper mm: rmap: support batched checks of the references for large folios tools/testing/vma: add VMA userland tests for VMA flag functions tools/testing/vma: separate out vma_internal.h into logical headers tools/testing/vma: separate VMA userland tests into separate files mm: make vm_area_desc utilise vma_flags_t only mm: update all remaining mmap_prepare users to use vma_flags_t mm: update shmem_[kernel]_file_*() functions to use vma_flags_t mm: update secretmem to use VMA flags on mmap_prepare mm: update hugetlbfs to use VMA flags on mmap_prepare mm: add basic VMA flag operation helper functions tools: bitmap: add missing bitmap_[subset(), andnot()] mm: add mk_vma_flags() bitmap flag macro helper ...	2026-02-18 20:50:32 -08:00
Bing Jiao	1aceed565f	mm/vmscan: fix demotion targets checks in reclaim/demotion Patch series "mm/vmscan: fix demotion targets checks in reclaim/demotion", v9. This patch series addresses two issues in demote_folio_list(), can_demote(), and next_demotion_node() in reclaim/demotion. 1. demote_folio_list() and can_demote() do not correctly check demotion target against cpuset.mems_effective, which will cause (a) pages to be demoted to not-allowed nodes and (b) pages fail demotion even if the system still has allowed demotion nodes. Patch 1 fixes this bug by updating cpuset_node_allowed() and mem_cgroup_node_allowed() to return effective_mems, allowing directly logic-and operation against demotion targets. 2. next_demotion_node() returns a preferred demotion target, but it does not check the node against allowed nodes. Patch 2 ensures that next_demotion_node() filters against the allowed node mask and selects the closest demotion target to the source node. This patch (of 2): Fix two bugs in demote_folio_list() and can_demote() due to incorrect demotion target checks against cpuset.mems_effective in reclaim/demotion. Commit `7d709f49ba` ("vmscan,cgroup: apply mems_effective to reclaim") introduces the cpuset.mems_effective check and applies it to can_demote(). However: 1. It does not apply this check in demote_folio_list(), which leads to situations where pages are demoted to nodes that are explicitly excluded from the task's cpuset.mems. 2. It checks only the nodes in the immediate next demotion hierarchy and does not check all allowed demotion targets in can_demote(). This can cause pages to never be demoted if the nodes in the next demotion hierarchy are not set in mems_effective. These bugs break resource isolation provided by cpuset.mems. This is visible from userspace because pages can either fail to be demoted entirely or are demoted to nodes that are not allowed in multi-tier memory systems. To address these bugs, update cpuset_node_allowed() and mem_cgroup_node_allowed() to return effective_mems, allowing directly logic-and operation against demotion targets. Also update can_demote() and demote_folio_list() accordingly. Bug 1 reproduction: Assume a system with 4 nodes, where nodes 0-1 are top-tier and nodes 2-3 are far-tier memory. All nodes have equal capacity. Test script: echo 1 > /sys/kernel/mm/numa/demotion_enabled mkdir /sys/fs/cgroup/test echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control echo "0-2" > /sys/fs/cgroup/test/cpuset.mems echo $$ > /sys/fs/cgroup/test/cgroup.procs swapoff -a # Expectation: Should respect node 0-2 limit. # Observation: Node 3 shows significant allocation (MemFree drops) stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1 Bug 2 reproduction: Assume a system with 6 nodes, where nodes 0-2 are top-tier, node 3 is a far-tier node, and nodes 4-5 are the farthest-tier nodes. All nodes have equal capacity. Test script: echo 1 > /sys/kernel/mm/numa/demotion_enabled mkdir /sys/fs/cgroup/test echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control echo "0-2,4-5" > /sys/fs/cgroup/test/cpuset.mems echo $$ > /sys/fs/cgroup/test/cgroup.procs swapoff -a # Expectation: Pages are demoted to Nodes 4-5 # Observation: No pages are demoted before oom. stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1,2 Link: https://lkml.kernel.org/r/20260114205305.2869796-1-bingjiao@google.com Link: https://lkml.kernel.org/r/20260114205305.2869796-2-bingjiao@google.com Fixes: `7d709f49ba` ("vmscan,cgroup: apply mems_effective to reclaim") Signed-off-by: Bing Jiao <bingjiao@google.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Gregory Price <gourry@gourry.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-02-12 15:42:52 -08:00
Linus Torvalds	4cff5c05e0	mm.git review status for linus..mm-stable Everything: Total patches: 325 Reviews/patch: 1.39 Reviewed rate: 72% Excluding DAMON: Total patches: 262 Reviews/patch: 1.63 Reviewed rate: 82% Excluding DAMON and zram: Total patches: 248 Reviews/patch: 1.72 Reviewed rate: 86% - The 14 patch series "powerpc/64s: do not re-activate batched TLB flush" from Alexander Gordeev makes arch_{enter\|leave}_lazy_mmu_mode() nest properly. It adds a generic enter/leave layer and switches architectures to use it. Various hacks were removed in the process. - The 7 patch series "zram: introduce compressed data writeback" from Richard Chang and Sergey Senozhatsky implements data compression for zram writeback. - The 8 patch series "mm: folio_zero_user: clear page ranges" from David Hildenbrand adds clearing of contiguous page ranges for hugepages. Large improvements during demand faulting are demonstrated. - The 2 patch series "memcg cleanups" from Chen Ridong tideis up some memcg code. - The 12 patch series "mm/damon: introduce {,max_}nr_snapshots and tracepoint for damos stats" from SeongJae Park improves DAMOS stat's provided information, deterministic control, and readability. - The 3 patch series "selftests/mm: hugetlb cgroup charging: robustness fixes" from Li Wang fixes a few issues in the hugetlb cgroup charging selftests. - The 5 patch series "Fix va_high_addr_switch.sh test failure - again" from Chunyu Hu addresses several issues in the va_high_addr_switch test. - The 5 patch series "mm/damon/tests/core-kunit: extend existing test scenarios" from Shu Anzai improves the KUnit test coverage for DAMON. - The 2 patch series "mm/khugepaged: fix dirty page handling for MADV_COLLAPSE" from Shivank Garg fixes a glitch in khugepaged which was causing madvise(MADV_COLLAPSE) to transiently return -EAGAIN. - The 29 patch series "arch, mm: consolidate hugetlb early reservation" from Mike Rapoport reworks and consolidates a pile of straggly code related to reservation of hugetlb memory from bootmem and creation of CMA areas for hugetlb. - The 9 patch series "mm: clean up anon_vma implementation" from Lorenzo Stoakes cleans up the anon_vma implementation in various ways. - The 3 patch series "tweaks for __alloc_pages_slowpath()" from Vlastimil Babka does a little streamlining of the page allocator's slowpath code. - The 8 patch series "memcg: separate private and public ID namespaces" from Shakeel Butt cleans up the memcg ID code and prevents the internal-only private IDs from being exposed to userspace. - The 6 patch series "mm: hugetlb: allocate frozen gigantic folio" from Kefeng Wang cleans up the allocation of frozen folios and avoids some atomic refcount operations. - The 11 patch series "mm/damon: advance DAMOS-based LRU sorting" from SeongJae Park improves DAMOS's movement of memory betewwn the active and inactive LRUs and adds auto-tuning of the ratio-based quotas and of monitoring intervals. - The 18 patch series "Support page table check on PowerPC" from Andrew Donnellan makes CONFIG_PAGE_TABLE_CHECK_ENFORCED work on powerpc. - The 3 patch series "nodemask: align nodes_and{,not} with underlying bitmap ops" from Yury Norov makes nodes_and() and nodes_andnot() propagate the return values from the underlying bit operations, enabling some cleanup in calling code. - The 5 patch series "mm/damon: hide kdamond and kdamond_lock from API callers" from SeongJae Park cleans up some DAMON internal interfaces. - The 4 patch series "mm/khugepaged: cleanups and scan limit fix" from Shivank Garg does some cleanup work in khupaged and fixes a scan limit accounting issue. - The 24 patch series "mm: balloon infrastructure cleanups" from David Hildenbrand goes to town on the balloon infrastructure and its page migration function. Mainly cleanups, also some locking simplification. - The 2 patch series "mm/vmscan: add tracepoint and reason for kswapd_failures reset" from Jiayuan Chen adds additional tracepoints to the page reclaim code. - The 3 patch series "Replace wq users and add WQ_PERCPU to alloc_workqueue() users" from Marco Crivellari is part of Marco's kernel-wide migration from the legacy workqueue APIs over to the preferred unbound workqueues. - The 9 patch series "Various mm kselftests improvements/fixes" from Kevin Brodsky provides various unrelated improvements/fixes for the mm kselftests. - The 5 patch series "mm: accelerate gigantic folio allocation" from Kefeng Wang greatly speeds up gigantic folio allocation, mainly by avoiding unnecessary work in pfn_range_valid_contig(). - The 5 patch series "selftests/damon: improve leak detection and wss estimation reliability" from SeongJae Park improves the reliability of two of the DAMON selftests. - The 8 patch series "mm/damon: cleanup kdamond, damon_call(), damos filter and DAMON_MIN_REGION" from SeongJae Park does some cleanup work in the core DAMON code. - The 8 patch series "Docs/mm/damon: update intro, modules, maintainer profile, and misc" from SeongJae Park performs maintenance work on the DAMON documentation. - The 10 patch series "mm: add and use vma_assert_stabilised() helper" from Lorenzo Stoakes refactors and cleans up the core VMA code. The main aim here is to be able to use the mmap write lock's lockdep state to perform various assertions regarding the locking which the VMA code requires. - The 19 patch series "mm, swap: swap table phase II: unify swapin use" from Kairui Song removes some old swap code (swap cache bypassing and swap synchronization) which wasn't working very well. Various other cleanups and simplifications were made. The end result is a 20% speedup in one benchmark. - The 8 patch series "enable PT_RECLAIM on more 64-bit architectures" from Qi Zheng makes PT_RECLAIM available on 64-bit alpha, loongarch, mips, parisc, um, Various cleanups were performed along the way. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaY1HfAAKCRDdBJ7gKXxA jqhZAP9H8ZlKKqCEgnr6U5XXmJ63Ep2FDQpl8p35yr9yVuU9+gEAgfyWiJ43l1fP rT0yjsUW3KQFBi/SEA3R6aYarmoIBgI= =+HLt -----END PGP SIGNATURE----- Merge tag 'mm-stable-2026-02-11-19-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "powerpc/64s: do not re-activate batched TLB flush" makes arch_{enter\|leave}_lazy_mmu_mode() nest properly (Alexander Gordeev) It adds a generic enter/leave layer and switches architectures to use it. Various hacks were removed in the process. - "zram: introduce compressed data writeback" implements data compression for zram writeback (Richard Chang and Sergey Senozhatsky) - "mm: folio_zero_user: clear page ranges" adds clearing of contiguous page ranges for hugepages. Large improvements during demand faulting are demonstrated (David Hildenbrand) - "memcg cleanups" tidies up some memcg code (Chen Ridong) - "mm/damon: introduce {,max_}nr_snapshots and tracepoint for damos stats" improves DAMOS stat's provided information, deterministic control, and readability (SeongJae Park) - "selftests/mm: hugetlb cgroup charging: robustness fixes" fixes a few issues in the hugetlb cgroup charging selftests (Li Wang) - "Fix va_high_addr_switch.sh test failure - again" addresses several issues in the va_high_addr_switch test (Chunyu Hu) - "mm/damon/tests/core-kunit: extend existing test scenarios" improves the KUnit test coverage for DAMON (Shu Anzai) - "mm/khugepaged: fix dirty page handling for MADV_COLLAPSE" fixes a glitch in khugepaged which was causing madvise(MADV_COLLAPSE) to transiently return -EAGAIN (Shivank Garg) - "arch, mm: consolidate hugetlb early reservation" reworks and consolidates a pile of straggly code related to reservation of hugetlb memory from bootmem and creation of CMA areas for hugetlb (Mike Rapoport) - "mm: clean up anon_vma implementation" cleans up the anon_vma implementation in various ways (Lorenzo Stoakes) - "tweaks for __alloc_pages_slowpath()" does a little streamlining of the page allocator's slowpath code (Vlastimil Babka) - "memcg: separate private and public ID namespaces" cleans up the memcg ID code and prevents the internal-only private IDs from being exposed to userspace (Shakeel Butt) - "mm: hugetlb: allocate frozen gigantic folio" cleans up the allocation of frozen folios and avoids some atomic refcount operations (Kefeng Wang) - "mm/damon: advance DAMOS-based LRU sorting" improves DAMOS's movement of memory betewwn the active and inactive LRUs and adds auto-tuning of the ratio-based quotas and of monitoring intervals (SeongJae Park) - "Support page table check on PowerPC" makes CONFIG_PAGE_TABLE_CHECK_ENFORCED work on powerpc (Andrew Donnellan) - "nodemask: align nodes_and{,not} with underlying bitmap ops" makes nodes_and() and nodes_andnot() propagate the return values from the underlying bit operations, enabling some cleanup in calling code (Yury Norov) - "mm/damon: hide kdamond and kdamond_lock from API callers" cleans up some DAMON internal interfaces (SeongJae Park) - "mm/khugepaged: cleanups and scan limit fix" does some cleanup work in khupaged and fixes a scan limit accounting issue (Shivank Garg) - "mm: balloon infrastructure cleanups" goes to town on the balloon infrastructure and its page migration function. Mainly cleanups, also some locking simplification (David Hildenbrand) - "mm/vmscan: add tracepoint and reason for kswapd_failures reset" adds additional tracepoints to the page reclaim code (Jiayuan Chen) - "Replace wq users and add WQ_PERCPU to alloc_workqueue() users" is part of Marco's kernel-wide migration from the legacy workqueue APIs over to the preferred unbound workqueues (Marco Crivellari) - "Various mm kselftests improvements/fixes" provides various unrelated improvements/fixes for the mm kselftests (Kevin Brodsky) - "mm: accelerate gigantic folio allocation" greatly speeds up gigantic folio allocation, mainly by avoiding unnecessary work in pfn_range_valid_contig() (Kefeng Wang) - "selftests/damon: improve leak detection and wss estimation reliability" improves the reliability of two of the DAMON selftests (SeongJae Park) - "mm/damon: cleanup kdamond, damon_call(), damos filter and DAMON_MIN_REGION" does some cleanup work in the core DAMON code (SeongJae Park) - "Docs/mm/damon: update intro, modules, maintainer profile, and misc" performs maintenance work on the DAMON documentation (SeongJae Park) - "mm: add and use vma_assert_stabilised() helper" refactors and cleans up the core VMA code. The main aim here is to be able to use the mmap write lock's lockdep state to perform various assertions regarding the locking which the VMA code requires (Lorenzo Stoakes) - "mm, swap: swap table phase II: unify swapin use" removes some old swap code (swap cache bypassing and swap synchronization) which wasn't working very well. Various other cleanups and simplifications were made. The end result is a 20% speedup in one benchmark (Kairui Song) - "enable PT_RECLAIM on more 64-bit architectures" makes PT_RECLAIM available on 64-bit alpha, loongarch, mips, parisc, and um. Various cleanups were performed along the way (Qi Zheng) * tag 'mm-stable-2026-02-11-19-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (325 commits) mm/memory: handle non-split locks correctly in zap_empty_pte_table() mm: move pte table reclaim code to memory.c mm: make PT_RECLAIM depends on MMU_GATHER_RCU_TABLE_FREE mm: convert __HAVE_ARCH_TLB_REMOVE_TABLE to CONFIG_HAVE_ARCH_TLB_REMOVE_TABLE config um: mm: enable MMU_GATHER_RCU_TABLE_FREE parisc: mm: enable MMU_GATHER_RCU_TABLE_FREE mips: mm: enable MMU_GATHER_RCU_TABLE_FREE LoongArch: mm: enable MMU_GATHER_RCU_TABLE_FREE alpha: mm: enable MMU_GATHER_RCU_TABLE_FREE mm: change mm/pt_reclaim.c to use asm/tlb.h instead of asm-generic/tlb.h mm/damon/stat: remove __read_mostly from memory_idle_ms_percentiles zsmalloc: make common caches global mm: add SPDX id lines to some mm source files mm/zswap: use %pe to print error pointers mm/vmscan: use %pe to print error pointers mm/readahead: fix typo in comment mm: khugepaged: fix NR_FILE_PAGES and NR_SHMEM in collapse_file() mm: refactor vma_map_pages to use vm_insert_pages mm/damon: unify address range representation with damon_addr_range mm/cma: replace snprintf with strscpy in cma_new_area ...	2026-02-12 11:32:37 -08:00
Linus Torvalds	ff661eeee2	cgroup: Changes for v6.20 - cpuset changes: - Continue separating v1 and v2 implementations by moving more v1-specific logic into cpuset-v1.c. - Improve partition handling. Sibling partitions are no longer invalidated on cpuset.cpus conflict, cpuset.cpus changes no longer fail in v2, and effective_xcpus computation is made consistent. - Fix partition effective CPUs overlap that caused a warning on cpuset removal when sibling partitions shared CPUs. - Increase the maximum cgroup subsystem count from 16 to 32 to accommodate future subsystem additions. - Misc cleanups and selftest improvements including switching to css_is_online() helper, removing dead code and stale documentation references, using lockdep_assert_cpuset_lock_held() consistently, and adding polling helpers for asynchronously updated cgroup statistics. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaYozIw4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGZQKAQD51KJQz4M79wf2yBhIBLOnM4aakMalhSwZNL4O JiGutwD+Ir33VzNX8aXBuDin9p4wI15O54PhqSenJbelKRQ3Dws= =gR7L -----END PGP SIGNATURE----- Merge tag 'cgroup-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - cpuset changes: - Continue separating v1 and v2 implementations by moving more v1-specific logic into cpuset-v1.c - Improve partition handling. Sibling partitions are no longer invalidated on cpuset.cpus conflict, cpuset.cpus changes no longer fail in v2, and effective_xcpus computation is made consistent - Fix partition effective CPUs overlap that caused a warning on cpuset removal when sibling partitions shared CPUs - Increase the maximum cgroup subsystem count from 16 to 32 to accommodate future subsystem additions - Misc cleanups and selftest improvements including switching to css_is_online() helper, removing dead code and stale documentation references, using lockdep_assert_cpuset_lock_held() consistently, and adding polling helpers for asynchronously updated cgroup statistics * tag 'cgroup-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits) cpuset: fix overlap of partition effective CPUs cgroup: increase maximum subsystem count from 16 to 32 cgroup: Remove stale cpu.rt.max reference from documentation cpuset: replace direct lockdep_assert_held() with lockdep_assert_cpuset_lock_held() cgroup/cpuset: Move the v1 empty cpus/mems check to cpuset1_validate_change() cgroup/cpuset: Don't invalidate sibling partitions on cpuset.cpus conflict cgroup/cpuset: Don't fail cpuset.cpus change in v2 cgroup/cpuset: Consistently compute effective_xcpus in update_cpumasks_hier() cgroup/cpuset: Streamline rm_siblings_excl_cpus() cpuset: remove dead code in cpuset-v1.c cpuset: remove v1-specific code from generate_sched_domains cpuset: separate generate_sched_domains for v1 and v2 cpuset: move update_domain_attr_tree to cpuset_v1.c cpuset: add cpuset1_init helper for v1 initialization cpuset: add cpuset1_online_css helper for v1-specific operations cpuset: add lockdep_assert_cpuset_lock_held helper cpuset: Remove unnecessary checks in rebuild_sched_domains_locked cgroup: switch to css_is_online() helper selftests: cgroup: Replace sleep with cg_read_key_long_poll() for waiting on nr_dying_descendants selftests: cgroup: make test_memcg_sock robust against delayed sock stats ...	2026-02-11 13:20:50 -08:00
Linus Torvalds	d16738a4e7	The kthread code provides an infrastructure which manages the preferred affinity of unbound kthreads (node or custom cpumask) against housekeeping (CPU isolation) constraints and CPU hotplug events. One crucial missing piece is the handling of cpuset: when an isolated partition is created, deleted, or its CPUs updated, all the unbound kthreads in the top cpuset become indifferently affine to _all_ the non-isolated CPUs, possibly breaking their preferred affinity along the way. Solve this with performing the kthreads affinity update from cpuset to the kthreads consolidated relevant code instead so that preferred affinities are honoured and applied against the updated cpuset isolated partitions. The dispatch of the new isolated cpumasks to timers, workqueues and kthreads is performed by housekeeping, as per the nice Tejun's suggestion. As a welcome side effect, HK_TYPE_DOMAIN then integrates both the set from boot defined domain isolation (through isolcpus=) and cpuset isolated partitions. Housekeeping cpumasks are now modifiable with a specific RCU based synchronization. A big step toward making nohz_full= also mutable through cpuset in the future. -----BEGIN PGP SIGNATURE----- iQJPBAABCAA5FiEEd76+gtGM8MbftQlOhSRUR1COjHcFAmmE0mYbFIAAAAAABAAO bWFudTIsMi41KzEuMTEsMiwyAAoJEIUkVEdQjox36eMP/0Ls/ArfYVi/MNAXWlpy rAt6m9Y/X9GBcDM/VI9BXq1ZX4qEr2XjJ8UUb8cM08uHEAt0ErlmpRxREwJFrKbI H4jzg5EwO0D0c6MnvgQJEAwkHxQVIjsxG9DovRIjxyW4ycx3aSsRg/f2VKyWoLvY 7ZT7CbLFE+I/MQh2ZgUu/9pnCDQVR2anss2WYIej5mmgFL5pyEv3YvYgKYVyK08z sXyNxpP976g2d9ECJ9OtFJV9we6mlqxlG0MVCiv/Uxh7DBjxWWPsLvlmLAXggQ03 +0GW+nnutDaKz83pgS7Z4zum/+Oa+I1dTLIN27pARUNcMCYip7njM2KNpJwPdov3 +fAIODH2JVX1xewT+U1cCq6gdI55ejbwdQYGFV075dKBUxKQeIyrghvfC3Ga6aKQ Gw3y68jdrXOw6iyfHR5k/0Mnu2/FDKUW2fZxLKm55PvNZP5jQFmSlz9wyiwwyb3m UUSgThj6Ozodxks8hDX41rGVezCcm1ni+qNSiNIs8HPaaZQrwbnvKHQFBBJHQzJP rJ39VWBx3Hq/ly71BOR6pCzoZsfS1f85YKhJ4vsfjLO6BfhI16nBat89eROSRKcz XptyWqW0PgAD0teDuMCTPNuUym/viBHALXHKuSO12CIizacvftiGcmaQNPlLiiFZ /Dr2+aOhwYw3UD6djn3u94M9 =nWGh -----END PGP SIGNATURE----- Merge tag 'kthread-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks Pull kthread updates from Frederic Weisbecker: "The kthread code provides an infrastructure which manages the preferred affinity of unbound kthreads (node or custom cpumask) against housekeeping (CPU isolation) constraints and CPU hotplug events. One crucial missing piece is the handling of cpuset: when an isolated partition is created, deleted, or its CPUs updated, all the unbound kthreads in the top cpuset become indifferently affine to _all_ the non-isolated CPUs, possibly breaking their preferred affinity along the way. Solve this with performing the kthreads affinity update from cpuset to the kthreads consolidated relevant code instead so that preferred affinities are honoured and applied against the updated cpuset isolated partitions. The dispatch of the new isolated cpumasks to timers, workqueues and kthreads is performed by housekeeping, as per the nice Tejun's suggestion. As a welcome side effect, HK_TYPE_DOMAIN then integrates both the set from boot defined domain isolation (through isolcpus=) and cpuset isolated partitions. Housekeeping cpumasks are now modifiable with a specific RCU based synchronization. A big step toward making nohz_full= also mutable through cpuset in the future" * tag 'kthread-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks: (33 commits) doc: Add housekeeping documentation kthread: Document kthread_affine_preferred() kthread: Comment on the purpose and placement of kthread_affine_node() call kthread: Honour kthreads preferred affinity after cpuset changes sched/arm64: Move fallback task cpumask to HK_TYPE_DOMAIN sched: Switch the fallback task allowed cpumask to HK_TYPE_DOMAIN kthread: Rely on HK_TYPE_DOMAIN for preferred affinity management kthread: Include kthreadd to the managed affinity list kthread: Include unbound kthreads in the managed affinity list kthread: Refine naming of affinity related fields PCI: Remove superfluous HK_TYPE_WQ check sched/isolation: Remove HK_TYPE_TICK test from cpu_is_isolated() cpuset: Remove cpuset_cpu_is_isolated() timers/migration: Remove superfluous cpuset isolation test cpuset: Propagate cpuset isolation update to timers through housekeeping cpuset: Propagate cpuset isolation update to workqueue through housekeeping PCI: Flush PCI probe workqueue on cpuset isolated partition change sched/isolation: Flush vmstat workqueues on cpuset isolated partition change sched/isolation: Flush memcg workqueues on cpuset isolated partition change cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset ...	2026-02-09 19:57:30 -08:00
Frederic Weisbecker	e894f63398	kthread: Honour kthreads preferred affinity after cpuset changes When cpuset isolated partitions get updated, unbound kthreads get indifferently affine to all non isolated CPUs, regardless of their individual affinity preferences. For example kswapd is a per-node kthread that prefers to be affine to the node it refers to. Whenever an isolated partition is created, updated or deleted, kswapd's node affinity is going to be broken if any CPU in the related node is not isolated because kswapd will be affine globally. Fix this with letting the consolidated kthread managed affinity code do the affinity update on behalf of cpuset. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Waiman Long <longman@redhat.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Marco Crivellari <marco.crivellari@suse.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: cgroups@vger.kernel.org	2026-02-03 15:23:35 +01:00
Frederic Weisbecker	6440966067	cpuset: Remove cpuset_cpu_is_isolated() The set of cpuset isolated CPUs is now included in HK_TYPE_DOMAIN housekeeping cpumask. There is no usecase left interested in just checking what is isolated by cpuset and not by the isolcpus= kernel boot parameter. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Waiman Long <longman@redhat.com> Cc: "Michal Koutný" <mkoutny@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Marco Crivellari <marco.crivellari@suse.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: cgroups@vger.kernel.org	2026-02-03 15:23:34 +01:00
Frederic Weisbecker	f5c145ae4f	cpuset: Propagate cpuset isolation update to timers through housekeeping Until now, cpuset would propagate isolated partition changes to timer migration so that unbound timers don't get migrated to isolated CPUs. Since housekeeping now centralizes, synchronize and propagates isolation cpumask changes, perform the work from that subsystem for consolidation and consistency purposes. Signed-off-by: Frederic Weisbecker <frederic@kernel.org>	2026-02-03 15:23:34 +01:00
Frederic Weisbecker	23f09dcc0a	cpuset: Propagate cpuset isolation update to workqueue through housekeeping Until now, cpuset would propagate isolated partition changes to workqueues so that unbound workers get properly reaffined. Since housekeeping now centralizes, synchronize and propagates isolation cpumask changes, perform the work from that subsystem for consolidation and consistency purposes. For simplification purpose, the target function is adapted to take the new housekeeping mask instead of the isolated mask. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Waiman Long <longman@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: "Michal Koutný" <mkoutny@suse.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Marco Crivellari <marco.crivellari@suse.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: cgroups@vger.kernel.org	2026-02-03 15:23:34 +01:00
Frederic Weisbecker	03ff735101	cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset Until now, HK_TYPE_DOMAIN used to only include boot defined isolated CPUs passed through isolcpus= boot option. Users interested in also knowing the runtime defined isolated CPUs through cpuset must use different APIs: cpuset_cpu_is_isolated(), cpu_is_isolated(), etc... There are many drawbacks to that approach: 1) Most interested subsystems want to know about all isolated CPUs, not just those defined on boot time. 2) cpuset_cpu_is_isolated() / cpu_is_isolated() are not synchronized with concurrent cpuset changes. 3) Further cpuset modifications are not propagated to subsystems Solve 1) and 2) and centralize all isolated CPUs within the HK_TYPE_DOMAIN housekeeping cpumask. Subsystems can rely on RCU to synchronize against concurrent changes. The propagation mentioned in 3) will be handled in further patches. [Chen Ridong: Fix cpu_hotplug_lock deadlock and use correct static branch API] Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Waiman Long <longman@redhat.com> Reviewed-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Chen Ridong <chenridong@huawei.com> Cc: "Michal Koutný" <mkoutny@suse.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Marco Crivellari <marco.crivellari@suse.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: cgroups@vger.kernel.org	2026-02-03 15:23:34 +01:00
Frederic Weisbecker	a7e546354d	cpuset: Provide lockdep check for cpuset lock held cpuset modifies partitions, including isolated, while holding the cpuset mutex. This means that holding the cpuset mutex is safe to synchronize against housekeeping cpumask changes. Provide a lockdep check to validate that. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: "Michal Koutný" <mkoutny@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Marco Crivellari <marco.crivellari@suse.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: cgroups@vger.kernel.org Cc: linux-kernel@vger.kernel.org	2026-02-03 15:23:33 +01:00
Frederic Weisbecker	0f4dfdc17b	cpuset: Convert boot_hk_cpus to use HK_TYPE_DOMAIN_BOOT boot_hk_cpus is an ad-hoc copy of HK_TYPE_DOMAIN_BOOT. Remove it and use the official version. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Phil Auld <pauld@redhat.com> Reviewed-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Marco Crivellari <marco.crivellari@suse.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Koutny <mkoutny@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: cgroups@vger.kernel.org	2026-02-03 15:23:33 +01:00
Chen Ridong	8b1f3c54f9	cpuset: fix overlap of partition effective CPUs A warning was detect: WARNING: kernel/cgroup/cpuset.c:825 at rebuild_sched_domains_locked Modules linked in: CPU: 12 UID: 0 PID: 681 Comm: rmdir 6.19.0-rc6-next-20260121+ RIP: 0010:rebuild_sched_domains_locked+0x309/0x4b0 RSP: 0018:ffffc900019bbd28 EFLAGS: 00000202 RAX: ffff888104413508 RBX: 0000000000000008 RCX: ffff888104413510 RDX: ffff888109b5f400 RSI: 000000000000ffcf RDI: 0000000000000001 RBP: 0000000000000002 R08: ffff888104413508 R09: 0000000000000002 R10: ffff888104413508 R11: 0000000000000001 R12: ffff888104413500 R13: 0000000000000002 R14: ffffc900019bbd78 R15: 0000000000000000 FS: 00007fe274b8d740(0000) GS:ffff8881b6b3c000(0000) knlGS: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fe274c98b50 CR3: 00000001047a9000 CR4: 00000000000006f0 Call Trace: <TASK> update_prstate+0x1c7/0x580 cpuset_css_killed+0x2f/0x50 kill_css+0x32/0x180 cgroup_destroy_locked+0xa7/0x200 cgroup_rmdir+0x28/0x100 kernfs_iop_rmdir+0x4c/0x80 vfs_rmdir+0x12c/0x280 filename_rmdir+0x19e/0x200 __x64_sys_rmdir+0x23/0x40 do_syscall_64+0x6b/0x390 It can be reproduced by steps: # cd /sys/fs/cgroup/ # mkdir A1 # mkdir B1 # mkdir C1 # echo 1-3 > A1/cpuset.cpus # echo root > A1/cpuset.cpus.partition # echo 3-5 > B1/cpuset.cpus # echo root > B1/cpuset.cpus.partition # echo 6 > C1/cpuset.cpus # echo root > C1/cpuset.cpus.partition # rmdir A1/ # rmdir C1/ Both A1 and B1 were initially configured with CPU 3, which was exclusively assigned to A1's partition. When A1 was removed, CPU 3 was returned to the root pool. However, B1 incorrectly regained access to CPU 3 when update_cpumasks_hier was triggered during C1's removal, which also updated sibling configurations. The update_sibling_cpumasks function was called to synchronize siblings' effective CPUs due to changes in their parent's effective CPUs. However, parent effective CPU changes should not affect partition-effective CPUs. To fix this issue, update_cpumasks_hier should only be invoked when the sibling is not a valid partition in the update_sibling_cpumasks. Fixes: `2a3602030d` ("cgroup/cpuset: Don't invalidate sibling partitions on cpuset.cpus conflict") Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-01 06:49:52 -10:00
Yury Norov	291487b753	cgroup: use nodes_and() output where appropriate Now that nodes_and() returns true if the result nodemask is not empty, drop useless nodes_intersects() in guarantee_online_mems() and nodes_empty() in update_nodemasks_hier(), which both are O(N). Link: https://lkml.kernel.org/r/20260114172217.861204-4-ynorov@nvidia.com Signed-off-by: Yury Norov <ynorov@nvidia.com> Reviewed-by: Gregory Price <gourry@gourry.net> Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Byungchul Park <byungchul@sk.com> Cc: David Hildenbrand <david@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: Yury Norov (NVIDIA) <yury.norov@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-26 20:02:37 -08:00
Tim Bird	a1b3421a02	kernel: cgroup: Add SPDX-License-Identifier lines Add GPL-2.0 SPDX license id lines to a few old files, replacing the reference to the COPYING file. The COPYING file at the time of creation of these files (2007 and 2005) was GPL-v2.0, with an additional clause indicating that only v2 applied. Signed-off-by: Tim Bird <tim.bird@sony.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-01-15 22:03:09 -10:00
Zhao Mengmeng	090e0ae303	cpuset: replace direct lockdep_assert_held() with lockdep_assert_cpuset_lock_held() We already added lockdep_assert_cpuset_lock_held(), use this new function to keep consistency. Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-01-12 09:27:58 -10:00
Waiman Long	272bd81833	cgroup/cpuset: Move the v1 empty cpus/mems check to cpuset1_validate_change() As stated in commit `1c09b195d3` ("cpuset: fix a regression in validating config change"), it is not allowed to clear masks of a cpuset if there're tasks in it. This is specific to v1 since empty "cpuset.cpus" or "cpuset.mems" will cause the v2 cpuset to inherit the effective CPUs or memory nodes from its parent. So it is OK to have empty cpus or mems even if there are tasks in the cpuset. Move this empty cpus/mems check in validate_change() to cpuset1_validate_change() to allow more flexibility in setting cpus or mems in v2. cpuset_is_populated() needs to be moved into cpuset-internal.h as it is needed by the empty cpus/mems checking code. Also add a test case to test_cpuset_prs.sh to verify that. Reported-by: Chen Ridong <chenridong@huaweicloud.com> Closes: https://lore.kernel.org/lkml/7a3ec392-2e86-4693-aa9f-1e668a668b9c@huaweicloud.com/ Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-01-12 09:03:22 -10:00
Waiman Long	2a3602030d	cgroup/cpuset: Don't invalidate sibling partitions on cpuset.cpus conflict Currently, when setting a cpuset's cpuset.cpus to a value that conflicts with the cpuset.cpus/cpuset.cpus.exclusive of a sibling partition, the sibling's partition state becomes invalid. This is overly harsh and is probably not necessary. The cpuset.cpus.exclusive control file, if set, will override the cpuset.cpus of the same cpuset when creating a cpuset partition. So cpuset.cpus has less priority than cpuset.cpus.exclusive in setting up a partition. However, it cannot override a conflicting cpuset.cpus file in a sibling cpuset and the partition creation process will fail. This is inconsistent. That will also make using cpuset.cpus.exclusive less valuable as a tool to set up cpuset partitions as the users have to check if such a cpuset.cpus conflict exists or not. Fix these problems by making sure that once a cpuset.cpus.exclusive is set without failure, it will always be allowed to form a valid partition as long as at least one CPU can be granted from its parent irrespective of the state of the siblings' cpuset.cpus values. Of course, setting cpuset.cpus.exclusive will fail if it conflicts with the cpuset.cpus.exclusive or the cpuset.cpus.exclusive.effective value of a sibling. Partition can still be created by setting only cpuset.cpus without setting cpuset.cpus.exclusive. However, any conflicting CPUs in sibling's cpuset.cpus.exclusive.effective and cpuset.cpus.exclusive values will be removed from its cpuset.cpus.exclusive.effective as long as there is still one or more CPUs left and can be granted from its parent. This CPU stripping is currently done in rm_siblings_excl_cpus(). The new code will now try its best to enable the creation of new partitions with only cpuset.cpus set without invalidating existing ones. However it is not guaranteed that all the CPUs requested in cpuset.cpus will be used in the new partition even when all these CPUs can be granted from the parent. This is similar to the fact that cpuset.cpus.effective may not be able to include all the CPUs requested in cpuset.cpus. In this case, the parent may not able to grant all the exclusive CPUs requested in cpuset.cpus to cpuset.cpus.exclusive.effective if some of them have already been granted to other partitions earlier. With the creation of multiple sibling partitions by setting only cpuset.cpus, this does have the side effect that their exact cpuset.cpus.exclusive.effective settings will depend on the order of partition creation if there are conflicts. Due to the exclusive nature of the CPUs in a partition, it is not easy to make it fair other than the old behavior of invalidating all the conflicting partitions. For example, # echo "0-2" > A1/cpuset.cpus # echo "root" > A1/cpuset.cpus.partition # cat A1/cpuset.cpus.partition root # cat A1/cpuset.cpus.exclusive.effective 0-2 # echo "2-4" > B1/cpuset.cpus # echo "root" > B1/cpuset.cpus.partition # cat B1/cpuset.cpus.partition root # cat B1/cpuset.cpus.exclusive.effective 3-4 # cat B1/cpuset.cpus.effective 3-4 For users who want to be sure that they can get most of the CPUs they want, cpuset.cpus.exclusive should be used instead if they can set it successfully without failure. Setting cpuset.cpus.exclusive will guarantee that sibling conflicts from then onward is no longer possible. To make this change, we have to separate out the is_cpu_exclusive() check in cpus_excl_conflict() into a cgroup v1 only cpuset1_cpus_excl_conflict() helper. The cpus_allowed_validate_change() helper is now no longer needed and can be removed. Some existing tests in test_cpuset_prs.sh are updated and new ones are added to reflect the new behavior. The cgroup-v2.rst doc file is also updated the clarify what exclusive CPUs will be used when a partition is created. Reported-by: Sun Shaojie <sunshaojie@kylinos.cn> Closes: https://lore.kernel.org/lkml/20251117015708.977585-1-sunshaojie@kylinos.cn/ Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-01-12 09:02:46 -10:00
Waiman Long	6e6f13f6d5	cgroup/cpuset: Don't fail cpuset.cpus change in v2 Commit `fe8cd2736e` ("cgroup/cpuset: Delay setting of CS_CPU_EXCLUSIVE until valid partition") introduced a new check to disallow the setting of a new cpuset.cpus.exclusive value that is a superset of a sibling's cpuset.cpus value so that there will at least be one CPU left in the sibling in case the cpuset becomes a valid partition root. This new check does have the side effect of failing a cpuset.cpus change that make it a subset of a sibling's cpuset.cpus.exclusive value. With v2, users are supposed to be allowed to set whatever value they want in cpuset.cpus without failure. To maintain this rule, the check is now restricted to only when cpuset.cpus.exclusive is being changed not when cpuset.cpus is changed. The cgroup-v2.rst doc file is also updated to reflect this change. Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-01-12 09:02:14 -10:00
Waiman Long	a1a01793ae	cgroup/cpuset: Consistently compute effective_xcpus in update_cpumasks_hier() Since commit `f62a5d3936` ("cgroup/cpuset: Remove remote_partition_check() & make update_cpumasks_hier() handle remote partition"), the compute_effective_exclusive_cpumask() helper was extended to strip exclusive CPUs from siblings when computing effective_xcpus (cpuset.cpus.exclusive.effective). This helper was later renamed to compute_excpus() in commit `86bbbd1f33` ("cpuset: Refactor exclusive CPU mask computation logic"). This helper is supposed to be used consistently to compute effective_xcpus. However, there is an exception within the callback critical section in update_cpumasks_hier() when exclusive_cpus of a valid partition root is empty. This can cause effective_xcpus value to differ depending on where exactly it is last computed. Fix this by using compute_excpus() in this case to give a consistent result. Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-01-12 09:01:44 -10:00
Waiman Long	18bc2425a8	cgroup/cpuset: Streamline rm_siblings_excl_cpus() If exclusive_cpus is set, effective_xcpus must be a subset of exclusive_cpus. Currently, rm_siblings_excl_cpus() checks both exclusive_cpus and effective_xcpus consecutively. It is simpler to check only exclusive_cpus if non-empty or just effective_xcpus otherwise. No functional change is expected. Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-01-12 09:01:40 -10:00
Chen Ridong	7cc1720589	cpuset: remove v1-specific code from generate_sched_domains Following the introduction of cpuset1_generate_sched_domains() for v1 in the previous patch, v1-specific logic can now be removed from the generic generate_sched_domains(). This patch cleans up the v1-only code and ensures uf_node is only visible when CONFIG_CPUSETS_V1=y. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-12-18 08:36:15 -10:00
Chen Ridong	6e1d31ce49	cpuset: separate generate_sched_domains for v1 and v2 The generate_sched_domains() function currently handles both v1 and v2 logic. However, the underlying mechanisms for building scheduler domains differ significantly between the two versions. For cpuset v2, scheduler domains are straightforwardly derived from valid partitions, whereas cpuset v1 employs a more complex union-find algorithm to merge overlapping cpusets. Co-locating these implementations complicates maintenance. This patch, along with subsequent ones, aims to separate the v1 and v2 logic. For ease of review, this patch first copies the generate_sched_domains() function into cpuset-v1.c as cpuset1_generate_sched_domains() and removes v2-specific code. Common helpers and top_cpuset are declared in cpuset-internal.h. When operating in v1 mode, the code now calls cpuset1_generate_sched_domains(). Currently there is some code duplication, which will be largely eliminated once v1-specific code is removed from v2 in the following patch. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-12-18 08:36:15 -10:00
Chen Ridong	cb33f8814c	cpuset: move update_domain_attr_tree to cpuset_v1.c Since relax_domain_level is only applicable to v1, move update_domain_attr_tree() to cpuset-v1.c, which solely updates relax_domain_level, Additionally, relax_domain_level is now initialized in cpuset1_inited. Accordingly, the initialization of relax_domain_level in top_cpuset is removed. The unnecessary remote_partition initialization in top_cpuset is also cleaned up. As a result, relax_domain_level can be defined in cpuset only when CONFIG_CPUSETS_V1=y. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-12-18 08:36:15 -10:00
Chen Ridong	4ef42c645f	cpuset: add cpuset1_init helper for v1 initialization This patch introduces the cpuset1_init helper in cpuset_v1.c to initialize v1-specific fields, including the fmeter and relax_domain_level members. The relax_domain_level related code will be moved to cpuset_v1.c in a subsequent patch. After this move, v1-specific members will only be visible when CONFIG_CPUSETS_V1=y. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-12-18 08:36:15 -10:00
Chen Ridong	56805c1bb1	cpuset: add cpuset1_online_css helper for v1-specific operations This commit introduces the cpuset1_online_css helper to centralize v1-specific handling during cpuset online. It performs operations such as updating the CS_SPREAD_PAGE, CS_SPREAD_SLAB, and CGRP_CPUSET_CLONE_CHILDREN flags, which are unique to the cpuset v1 control group interface. The helper is now placed in cpuset-v1.c to maintain clear separation between v1 and v2 logic. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-12-18 08:36:08 -10:00
Chen Ridong	14c11e1b2a	cpuset: add lockdep_assert_cpuset_lock_held helper Add lockdep_assert_cpuset_lock_held() to allow other subsystems to verify that cpuset_mutex is held. Suggested-by: Waiman Long <longman@redhat.com> Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-12-18 08:35:57 -10:00
Chen Ridong	aa7d3a56a2	cpuset: fix warning when disabling remote partition A warning was triggered as follows: WARNING: kernel/cgroup/cpuset.c:1651 at remote_partition_disable+0xf7/0x110 RIP: 0010:remote_partition_disable+0xf7/0x110 RSP: 0018:ffffc90001947d88 EFLAGS: 00000206 RAX: 0000000000007fff RBX: ffff888103b6e000 RCX: 0000000000006f40 RDX: 0000000000006f00 RSI: ffffc90001947da8 RDI: ffff888103b6e000 RBP: ffff888103b6e000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000001 R11: ffff88810b2e2728 R12: ffffc90001947da8 R13: 0000000000000000 R14: ffffc90001947da8 R15: ffff8881081f1c00 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f55c8bbe0b2 CR3: 000000010b14c000 CR4: 00000000000006f0 Call Trace: <TASK> update_prstate+0x2d3/0x580 cpuset_partition_write+0x94/0xf0 kernfs_fop_write_iter+0x147/0x200 vfs_write+0x35d/0x500 ksys_write+0x66/0xe0 do_syscall_64+0x6b/0x390 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7f55c8cd4887 Reproduction steps (on a 16-CPU machine): # cd /sys/fs/cgroup/ # mkdir A1 # echo +cpuset > A1/cgroup.subtree_control # echo "0-14" > A1/cpuset.cpus.exclusive # mkdir A1/A2 # echo "0-14" > A1/A2/cpuset.cpus.exclusive # echo "root" > A1/A2/cpuset.cpus.partition # echo 0 > /sys/devices/system/cpu/cpu15/online # echo member > A1/A2/cpuset.cpus.partition When CPU 15 is offlined, subpartitions_cpus gets cleared because no CPUs remain available for the top_cpuset, forcing partitions to share CPUs with the top_cpuset. In this scenario, disabling the remote partition triggers a warning stating that effective_xcpus is not a subset of subpartitions_cpus. Partitions should be invalidated in this case to inform users that the partition is now invalid(cpus are shared with top_cpuset). To fix this issue: 1. Only emit the warning only if subpartitions_cpus is not empty and the effective_xcpus is not a subset of subpartitions_cpus. 2. During the CPU hotplug process, invalidate partitions if subpartitions_cpus is empty. Fixes: `f62a5d3936` ("cgroup/cpuset: Remove remote_partition_check() & make update_cpumasks_hier() handle remote partition") Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-12-18 06:38:54 -10:00
Chen Ridong	6ee43047e8	cpuset: Remove unnecessary checks in rebuild_sched_domains_locked Commit `406100f3da` ("cpuset: fix race between hotplug work and later CPU offline") added a check for empty effective_cpus in partitions for cgroup v2. However, this check did not account for remote partitions, which were introduced later. After commit `2125c0034c` ("cgroup/cpuset: Make cpuset hotplug processing synchronous"), cpuset hotplug handling is now synchronous. This eliminates the race condition with subsequent CPU offline operations that the original check aimed to fix. Instead of extending the check to support remote partitions, this patch removes all the redundant effective_cpus check. Additionally, it adds a check and warning to verify that all generated sched domains consist of active CPUs, preventing partition_sched_domains from being invoked with offline CPUs. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-12-08 09:02:42 -10:00
Linus Torvalds	8449d3252c	cgroup: Changes for v6.19 - Defer task cgroup unlink until after the dying task's final context switch so that controllers see the cgroup properly populated until the task is truly gone. - cpuset cleanups and simplifications. Enforce that domain isolated CPUs stay in root or isolated partitions and fail if isolated+nohz_full would leave no housekeeping CPU. Fix sched/deadline root domain handling during CPU hot-unplug and race for tasks in attaching cpusets. - Misc fixes including memory reclaim protection documentation and selftest KTAP conformance. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaS3pEQ4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGYbrAP9H0kVyWH5tK9VhjSZyqidic8NuvtmNOyhIRrg0 8S8K0wD/YG9xlh2JUyRmS4B23ggc59+9y5xM2/sctrho51Pvsgg= =0MB+ -----END PGP SIGNATURE----- Merge tag 'cgroup-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - Defer task cgroup unlink until after the dying task's final context switch so that controllers see the cgroup properly populated until the task is truly gone - cpuset cleanups and simplifications. Enforce that domain isolated CPUs stay in root or isolated partitions and fail if isolated+nohz_full would leave no housekeeping CPU. Fix sched/deadline root domain handling during CPU hot-unplug and race for tasks in attaching cpusets - Misc fixes including memory reclaim protection documentation and selftest KTAP conformance * tag 'cgroup-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits) cpuset: Treat cpusets in attaching as populated sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug cgroup/cpuset: Introduce cpuset_cpus_allowed_locked() docs: cgroup: No special handling of unpopulated memcgs docs: cgroup: Note about sibling relative reclaim protection docs: cgroup: Explain reclaim protection target selftests/cgroup: conform test to KTAP format output cpuset: remove need_rebuild_sched_domains cpuset: remove global remote_children list cpuset: simplify node setting on error cgroup: include missing header for struct irq_work cgroup: Fix sleeping from invalid context warning on PREEMPT_RT cgroup/cpuset: Globally track isolated_cpus update cgroup/cpuset: Ensure domain isolated CPUs stay in root or isolated partition cgroup/cpuset: Move up prstate_housekeeping_conflict() helper cgroup/cpuset: Fail if isolated and nohz_full don't leave any housekeeping cgroup/cpuset: Rename update_unbound_workqueue_cpumask() to update_isolation_cpumasks() cgroup: Defer task cgroup unlink until after the task is done switching out cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free() cgroup: Rename cgroup lifecycle hooks to cgroup_task_*() ...	2025-12-03 13:04:07 -08:00
Linus Torvalds	d42e504a55	Update to the time/timers core: - Prevent a thundering herd problem when the timekeeper CPU is delayed and a large number of CPUs compete to acquire jiffies_lock to do the update. Limit it to one CPU with a separate "uncontended" atomic variable. - A set of improvements for the timer migration mechanism: - Support imbalanced NUMA trees correctly - Support dynamic exclusion of CPUs from the migrator duty to allow the cpuset/isolation mechanism to exclude them from handling timers of remote idle CPUs. - The usual small updates, cleanups and enhancements -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmks7doTHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoaxrD/40nxx+8cEXsVbVLIkP2PQbd2Y8+7sk YbNu/Cb7j7Bg7R8YIs4p5GHk+7Yt/hNsW77SmbAzRPUyYYG6L3bUYlBa3yQlvIuo xRPbzGA+RJies9skIGHbQ8z6ig1zUASRJPcBYiuaVIAuQhCfLNc4Nii9cEWtjZ24 +5gfRwV+vy74ArWwRkwaGejDK1tav+gd62OkFQZC8WtjQ08ozGZ6VBJNg7nYq/gH FYO1rH2tQ/ZyjlO/x5NF8gFcjYD8iv5PDp8oH35MPx+XTdDccf0G3QB7ug0ffVdV b4gA6lZTAmpsu/NHb6ByN4i/kf3wf8la/i+EaAh/Ov7NW078gunvVKVA7jStcbBl ZgG5SRHiKRvQF/WXLGVQAnilRDZwRuS0nmJlqfExa44v23l5o3768RwdRYwQlv8g X5KSRl0jlVgVtZHgNBlZtgX9+rnQSr9sB5sVGBP2a6a1WhVXQV/2kp0wjdnU0mPw jLCnSdsHqBlSf9V7O/na823WCnBFb7blrLBXUoSbHBnICqtVFzhE1kBXWw3S7Kqh CiaWM+S4WfR0HRnUlWMTS8BZ82MgiDnd7nGUXWwXBbdqWmoj/9CoU6SZRjbMBkzi EY1XvmoYf6eSzdxfydI1hFi0/bbb8K9umHQlrpW3HeN9uXnVc0/+TroVPLuaKUdi 53ClqXjzE+CpJg== =lQKn -----END PGP SIGNATURE----- Merge tag 'timers-core-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer core updates from Thomas Gleixner: - Prevent a thundering herd problem when the timekeeper CPU is delayed and a large number of CPUs compete to acquire jiffies_lock to do the update. Limit it to one CPU with a separate "uncontended" atomic variable. - A set of improvements for the timer migration mechanism: - Support imbalanced NUMA trees correctly - Support dynamic exclusion of CPUs from the migrator duty to allow the cpuset/isolation mechanism to exclude them from handling timers of remote idle CPUs - The usual small updates, cleanups and enhancements * tag 'timers-core-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timers/migration: Exclude isolated cpus from hierarchy cpumask: Add initialiser to use cleanup helpers sched/isolation: Force housekeeping if isolcpus and nohz_full don't leave any cgroup/cpuset: Rename update_unbound_workqueue_cpumask() to update_isolation_cpumasks() timers/migration: Use scoped_guard on available flag set/clear timers/migration: Add mask for CPUs available in the hierarchy timers/migration: Rename 'online' bit to 'available' selftests/timers/nanosleep: Add tests for return of remaining time selftests/timers: Clean up kernel version check in posix_timers time: Fix a few typos in time[r] related code comments time: tick-oneshot: Add missing Return and parameter descriptions to kernel-doc hrtimer: Store time as ktime_t in restart block timers/migration: Remove dead code handling idle CPU checking for remote timers timers/migration: Remove unused "cpu" parameter from tmigr_get_group() timers/migration: Assert that hotplug preparing CPU is part of stable active hierarchy timers/migration: Fix imbalanced NUMA trees timers/migration: Remove locking on group connection timers/migration: Convert "while" loops to use "for" tick/sched: Limit non-timekeeper CPUs calling jiffies update	2025-12-02 09:58:33 -08:00

1 2 3 4 5 ...

322 Commits