linux/kernel
Frederic Weisbecker bd3c45dd01 timers/migration: Fix another hotplug activation race
The hotplug control CPU is assumed to be active in the hierarchy but
that doesn't imply that the root is active. If the current CPU is not
the one that activated the current hierarchy, and the CPU performing
this duty is still halfway through the tree, the root may still be
observed inactive. And this can break the activation of a new root as in
the following scenario:

1) Initially, the whole system has 64 CPUs and only CPU 63 is awake.

                   [GRP1:0]
                    active
                  /    |    \
                 /     |     \
         [GRP0:0]    [...]    [GRP0:7]
           idle      idle      active
         /   |   \               |
     CPU 0  CPU 1  ...         CPU 63
     idle   idle               active

2) CPU 63 goes idle _but_ due to a #VMEXIT it hasn't yet reached the
   [GRP1:0]->parent dereference (that would be NULL and stop the walk)
   in __walk_groups_from().

                   [GRP1:0]
                     idle
                  /    |    \
                 /     |     \
         [GRP0:0]    [...]    [GRP0:7]
           idle      idle       idle
         /   |   \                |
     CPU 0  CPU 1  ...         CPU 63
     idle   idle                idle

3) CPU 1 wakes up, activates GRP0:0 but didn't yet manage to propagate
   up to GRP1:0 due to yet another #VMEXIT.

                   [GRP1:0]
                     idle
                  /    |    \
                 /     |     \
         [GRP0:0]    [...]    [GRP0:7]
         active      idle       idle
         /   |   \                |
     CPU 0  CPU 1  ...         CPU 63
     idle  active               idle

3) CPU 0 wakes up and doesn't need to walk above GRP0:0 as it's CPU 1
   role.

                   [GRP1:0]
                     idle
                  /    |    \
                 /     |     \
         [GRP0:0]    [...]    [GRP0:7]
         active      idle       idle
         /   |   \                |
     CPU 0  CPU 1  ...         CPU 63
    active  active              idle

4) CPU 0 boots CPU 64. It creates a new root for it.

                             [GRP2:0]
                               idle
                           /          \
                          /            \
                   [GRP1:0]           [GRP1:1]
                   idle                 idle
                  /    |    \                \
                 /     |     \                \
         [GRP0:0]    [...]    [GRP0:7]      [GRP0:8]
         active      idle       idle          idle
         /   |   \                |            |
     CPU 0  CPU 1  ...         CPU 63        CPU 64
    active  active              idle         offline

5) CPU 0 activates the new root, but note that GRP1:0 is still idle,
   waiting for CPU 1 to resume from #VMEXIT and activate it.

                             [GRP2:0]
                              active
                           /          \
                          /            \
                   [GRP1:0]           [GRP1:1]
                   idle                 idle
                  /    |    \                \
                 /     |     \                \
         [GRP0:0]    [...]    [GRP0:7]      [GRP0:8]
         active      idle       idle          idle
         /   |   \                |            |
     CPU 0  CPU 1  ...         CPU 63        CPU 64
    active  active              idle         offline

6) CPU 63 resumes after #VMEXIT and sees the new GRP1:0 parent.
   Therefore it propagates the stale inactive state of GRP1:0 up to
   GRP2:0.

                             [GRP2:0]
                              idle
                           /          \
                          /            \
                   [GRP1:0]           [GRP1:1]
                   idle                 idle
                  /    |    \                \
                 /     |     \                \
         [GRP0:0]    [...]    [GRP0:7]      [GRP0:8]
         active      idle       idle          idle
         /   |   \                |            |
     CPU 0  CPU 1  ...         CPU 63        CPU 64
    active  active              idle         offline

7) CPU 1 resumes after #VMEXIT and finally activates GRP1:0. But it
   doesn't observe its parent link because no ordering enforced that.
   Therefore GRP2:0 is spuriously left idle.

                             [GRP2:0]
                              idle
                           /          \
                          /            \
                   [GRP1:0]           [GRP1:1]
                   active                 idle
                  /    |    \                \
                 /     |     \                \
         [GRP0:0]    [...]    [GRP0:7]      [GRP0:8]
         active      idle       idle          idle
         /   |   \                |            |
     CPU 0  CPU 1  ...         CPU 63        CPU 64
    active  active              idle         offline

Such races are highly theoretical and the problem would solve itself
once the old root ever becomes idle again. But it still leaves a taste
of discomfort.

Fix it with enforcing a fully ordered atomic read of the old root state
before propagating the activate state up to the new root. It has a two
directions ordering effect:

* Acquire + release of the latest old root state: If the hotplug control
  CPU is not the one that woke up the old root, make sure to acquire its
  active state and propagate it upwards through the ordered chain of
  activation (the acquire pairs with the cmpxchg() in tmigr_active_up()
  and subsequent releases will pair with atomic_read_acquire() and
  smp_mb__after_atomic() in tmigr_inactive_up()).

* Release: If the hotplug control CPU is not the one that must wake up
  the old root, but the CPU covering that is lagging behind its duty,
  publish the links from the old root to the new parents. This way the
  lagging CPU will propagate the active state itself.

Fixes: 7ee9887703 ("timers: Implement the hierarchical pull model")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260423165354.95152-2-frederic@kernel.org
2026-05-06 08:21:12 +02:00
..
bpf bpf-fixes 2026-04-17 15:58:22 -07:00
cgroup mm.git review status for linus..mm-stable 2026-04-19 08:01:17 -07:00
configs Remove WARN_ALL_UNSEEDED_RANDOM kernel config option 2026-02-23 11:18:48 -08:00
debug kgdb: update outdated references to kgdb_wait() 2026-04-21 16:41:54 +01:00
dma memblock: updates for 7.0-rc1 2026-04-18 11:29:14 -07:00
entry arm64 updates for 7.1: 2026-04-14 16:48:56 -07:00
events mm.git review status for linus..mm-stable 2026-04-15 12:59:16 -07:00
futex Locking updates for v7.1: 2026-04-14 12:36:25 -07:00
gcov Convert more 'alloc_obj' cases to default GFP_KERNEL arguments 2026-02-21 20:03:00 -08:00
irq genirq/chip: Invoke add_interrupt_randomness() in handle_percpu_devid_irq() 2026-04-02 23:03:29 +02:00
kcsan kcsan: test: Adjust "expect" allocation type for kmalloc_obj 2026-02-26 09:54:08 -08:00
livepatch Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
liveupdate mm.git review status for linus..mm-stable 2026-04-19 08:01:17 -07:00
locking locking/mutex: Fix ww_mutex wait_list operations 2026-04-23 10:05:49 +02:00
module module: Simplify warning on positive returns from module_init() 2026-04-04 00:04:48 +00:00
power Merge branches 'pm-cpuidle', 'pm-opp' and 'pm-sleep' 2026-04-10 12:37:27 +02:00
printk Merge branch 'rework/prb-fixes' into for-linus 2026-04-20 13:42:01 +02:00
rcu RCU changes for v7.1 2026-04-13 09:36:45 -07:00
sched tracing updates for v7.1: 2026-04-17 09:43:12 -07:00
time timers/migration: Fix another hotplug activation race 2026-05-06 08:21:12 +02:00
trace ring-buffer fix for 7.1: 2026-04-24 15:17:23 -07:00
unwind Convert more 'alloc_obj' cases to default GFP_KERNEL arguments 2026-02-21 20:03:00 -08:00
.gitignore kheaders: rebuild kheaders_data.tar.xz when a file is modified within a minute 2025-06-24 20:30:37 +09:00
acct.c vfs-7.1-rc1.misc 2026-04-13 14:20:11 -07:00
async.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
audit_fsnotify.c audit: widen ino fields to u64 2026-03-06 14:31:26 +01:00
audit_tree.c Convert 'alloc_flex' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
audit_watch.c audit: widen ino fields to u64 2026-03-06 14:31:26 +01:00
audit.c audit: handle unknown status requests in audit_receive_msg() 2026-03-10 15:22:43 -04:00
audit.h audit: widen ino fields to u64 2026-03-06 14:31:26 +01:00
auditfilter.c audit: fix coding style issues 2026-03-05 22:16:08 -05:00
auditsc.c audit: widen ino fields to u64 2026-03-06 14:31:26 +01:00
backtracetest.c
bounds.c x86/asm: Remove ANNOTATE_DATA_SPECIAL usage 2025-12-03 16:53:19 +01:00
capability.c
cfi.c cfi: Move BPF CFI types and helpers to generic code 2025-07-31 18:23:53 -07:00
compat.c
configs.c
context_tracking.c context_tracking: Remove rcu_task_trace_heavyweight_{enter,exit}() 2026-01-01 16:39:46 +08:00
cpu_pm.c syscore: Pass context data to callbacks 2025-11-14 10:01:52 +01:00
cpu.c SPDX updates for 7.0-rc1 2026-02-17 09:46:03 -08:00
crash_core_test.c crash: add KUnit tests for crash_exclude_mem_range 2025-09-13 17:32:55 -07:00
crash_core.c kernel/crash: remove inclusion of crypto/sha1.h 2026-03-27 21:19:46 -07:00
crash_dump_dm_crypt.c crash_dump/dm-crypt: don't print in arch-specific code 2026-04-02 23:36:24 -07:00
crash_reserve.c kernel/crash: remove inclusion of crypto/sha1.h 2026-03-27 21:19:46 -07:00
cred.c cred: remove unused set_security_override_from_ctx() 2026-01-06 20:52:57 -05:00
delayacct.c delayacct: fix uapi timespec64 definition 2026-02-08 00:13:32 -08:00
dma.c
elfcorehdr.c
exec_domain.c
exit.c mm.git review status for linus..mm-nonmm-stable 2026-04-16 20:11:56 -07:00
exit.h
extable.c
fail_function.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
fork.c mm.git review status for linus..mm-nonmm-stable 2026-04-16 20:11:56 -07:00
freezer.c freezer: Clarify that only cgroup1 freezer uses PM freezer 2025-10-30 20:10:27 +01:00
gen_kheaders.sh kheaders: make it possible to override TAR 2025-08-06 10:23:36 +09:00
groups.c treewide: Replace kmalloc with kmalloc_obj for non-scalar types 2026-02-21 01:02:28 -08:00
hung_task.c hung_task: explicitly report I/O wait state in log output 2026-03-27 21:19:40 -07:00
iomem.c
irq_work.c kernel: Use trace_call__##name() at guarded tracepoint call sites 2026-03-26 08:28:49 -04:00
jump_label.c jump_label: use ATOMIC_INIT() for initialization of .enabled 2026-03-16 13:16:48 +01:00
kallsyms_internal.h kallsyms: Get rid of kallsyms relative base 2026-01-22 15:58:22 -07:00
kallsyms_selftest.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
kallsyms_selftest.h
kallsyms.c mm.git review status for linus..mm-nonmm-stable 2026-02-12 12:13:01 -08:00
kcmp.c
Kconfig.freezer
Kconfig.hz
Kconfig.kexec liveupdate: kho: move to kernel/liveupdate 2025-11-27 14:24:33 -08:00
Kconfig.locks
Kconfig.preempt sched: Further restrict the preemption modes 2026-01-08 12:43:57 +01:00
kcov.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
kexec_core.c kernel/kexec: remove inclusion of crypto/hash.h 2026-03-27 21:19:46 -07:00
kexec_elf.c
kexec_file.c kexec: derive purgatory entry from symbol 2026-01-31 16:16:07 -08:00
kexec_internal.h kexec: enable CMA based contiguous allocation 2025-08-02 12:01:38 -07:00
kexec.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
kheaders.c
kprobes.c kprobes: Remove unneeded warnings from __arm_kprobe_ftrace() 2026-03-13 23:15:26 +09:00
kstack_erase.c sysctl: remove __user qualifier from stack_erasing_sysctl buffer argument 2025-11-27 15:44:53 +01:00
ksyms_common.c
ksysfs.c kernel: ksysfs: initialize kernel_kobj earlier 2026-04-03 19:39:52 +02:00
kthread.c kthread: consolidate kthread exit paths to prevent use-after-free 2026-02-26 10:45:49 +01:00
latencytop.c
Makefile kcov: Enable context analysis 2026-01-05 16:43:34 +01:00
module_signature.c module: Give 'enum pkey_id_type' a more specific name 2026-03-24 21:42:37 +00:00
notifier.c
nscommon.c nsfs: tighten permission checks for ns iteration ioctls 2026-02-27 22:00:08 +01:00
nsproxy.c vfs-7.1-rc1.mount.v2 2026-04-14 19:59:25 -07:00
nstree.c nstree: tighten permission checks for listing 2026-02-27 22:00:11 +01:00
padata.c padata: Put CPU offline callback in ONLINE section to allow failure 2026-03-22 11:17:59 +09:00
panic.c kernel/panic: mark init_taint_buf as __initdata and panic instead of warning in alloc_taint_buf() 2026-03-27 21:19:33 -07:00
params.c module: Clean up parse_args() arguments 2026-03-18 21:43:18 +00:00
pid_namespace.c pid_namespace: allow opening pid_for_children before init was created 2026-03-20 14:44:26 +01:00
pid_sysctl.h
pid.c mm.git review status for linus..mm-nonmm-stable 2026-04-16 20:11:56 -07:00
profile.c
ptrace.c clone: add CLONE_AUTOREAP 2026-03-11 23:14:02 +01:00
range.c
reboot.c treewide: Replace kmalloc with kmalloc_obj for non-scalar types 2026-02-21 01:02:28 -08:00
regset.c
relay.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
resource_kunit.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
resource.c PCI: Align head space better 2026-03-27 10:19:08 -05:00
rseq.c rseq: slice ext: Ensure rseq feature size differs from original rseq size 2026-02-23 11:19:19 +01:00
scftorture.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
scs.c scs: fix a wrong parameter in __scs_magic 2025-11-12 10:00:13 -08:00
seccomp.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
signal.c mm.git review status for linus..mm-nonmm-stable 2026-04-16 20:11:56 -07:00
smp.c tracing updates for v7.1: 2026-04-17 09:43:12 -07:00
smpboot.c
smpboot.h
softirq.c softirq: Prepare for deferred hrtimer rearming 2026-02-27 16:40:13 +01:00
stacktrace.c
static_call_inline.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
static_call.c
stop_machine.c sched/core: Fix migrate_swap() vs. hotplug 2025-07-01 15:02:03 +02:00
sys_ni.c rseq: Implement sys_rseq_slice_yield() 2026-01-22 11:11:17 +01:00
sys.c prctl: cfi: change the branch landing pad prctl()s to be more descriptive 2026-04-04 18:40:58 -06:00
sysctl-test.c
sysctl.c sysctl: fix uninitialized variable in proc_do_large_bitmap 2026-03-26 09:32:19 +01:00
task_work.c task_work: Fix NMI race condition 2025-10-29 10:29:54 +01:00
taskstats.c taskstats: set version in TGID exit notifications 2026-04-15 02:15:02 -07:00
torture.c torture: Avoid modulo-zero error in torture_hrtimeout_ns() 2026-03-30 15:48:14 -04:00
tracepoint.c tracepoint: balance regfunc() on func_add() failure in tracepoint_add_func() 2026-04-14 05:17:02 -04:00
tsacct.c tsacct: skip all kernel threads 2026-01-26 19:07:13 -08:00
ucount.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
uid16.c
uid16.h
umh.c treewide: Replace kmalloc with kmalloc_obj for non-scalar types 2026-02-21 01:02:28 -08:00
up.c
user_namespace.c Convert remaining multi-line kmalloc_obj/flex GFP_KERNEL uses 2026-02-22 08:26:33 -08:00
user-return-notifier.c
user.c ns: drop custom reference count initialization for initial namespaces 2025-11-11 10:01:32 +01:00
utsname_sysctl.c
utsname.c namespace-6.18-rc1 2025-09-29 11:20:29 -07:00
vhost_task.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
vmcore_info.c mm.git review status for linus..mm-nonmm-stable 2026-04-16 20:11:56 -07:00
watch_queue.c Convert 'alloc_flex' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
watchdog_buddy.c watchdog/hardlockup: improve buddy system detection timeliness 2026-03-27 21:19:47 -07:00
watchdog_perf.c watchdog/hardlockup: simplify perf event probe and remove per-cpu dependency 2026-02-08 00:13:35 -08:00
watchdog.c watchdog/hardlockup: improve buddy system detection timeliness 2026-03-27 21:19:47 -07:00
workqueue_internal.h workqueue: Show in-flight work item duration in stall diagnostics 2026-03-05 07:27:48 -10:00
workqueue.c workqueue: Changes for v7.1 2026-04-15 10:32:08 -07:00