linux/kernel/time
Frederic Weisbecker bd3c45dd01 timers/migration: Fix another hotplug activation race
The hotplug control CPU is assumed to be active in the hierarchy but
that doesn't imply that the root is active. If the current CPU is not
the one that activated the current hierarchy, and the CPU performing
this duty is still halfway through the tree, the root may still be
observed inactive. And this can break the activation of a new root as in
the following scenario:

1) Initially, the whole system has 64 CPUs and only CPU 63 is awake.

                   [GRP1:0]
                    active
                  /    |    \
                 /     |     \
         [GRP0:0]    [...]    [GRP0:7]
           idle      idle      active
         /   |   \               |
     CPU 0  CPU 1  ...         CPU 63
     idle   idle               active

2) CPU 63 goes idle _but_ due to a #VMEXIT it hasn't yet reached the
   [GRP1:0]->parent dereference (that would be NULL and stop the walk)
   in __walk_groups_from().

                   [GRP1:0]
                     idle
                  /    |    \
                 /     |     \
         [GRP0:0]    [...]    [GRP0:7]
           idle      idle       idle
         /   |   \                |
     CPU 0  CPU 1  ...         CPU 63
     idle   idle                idle

3) CPU 1 wakes up, activates GRP0:0 but didn't yet manage to propagate
   up to GRP1:0 due to yet another #VMEXIT.

                   [GRP1:0]
                     idle
                  /    |    \
                 /     |     \
         [GRP0:0]    [...]    [GRP0:7]
         active      idle       idle
         /   |   \                |
     CPU 0  CPU 1  ...         CPU 63
     idle  active               idle

3) CPU 0 wakes up and doesn't need to walk above GRP0:0 as it's CPU 1
   role.

                   [GRP1:0]
                     idle
                  /    |    \
                 /     |     \
         [GRP0:0]    [...]    [GRP0:7]
         active      idle       idle
         /   |   \                |
     CPU 0  CPU 1  ...         CPU 63
    active  active              idle

4) CPU 0 boots CPU 64. It creates a new root for it.

                             [GRP2:0]
                               idle
                           /          \
                          /            \
                   [GRP1:0]           [GRP1:1]
                   idle                 idle
                  /    |    \                \
                 /     |     \                \
         [GRP0:0]    [...]    [GRP0:7]      [GRP0:8]
         active      idle       idle          idle
         /   |   \                |            |
     CPU 0  CPU 1  ...         CPU 63        CPU 64
    active  active              idle         offline

5) CPU 0 activates the new root, but note that GRP1:0 is still idle,
   waiting for CPU 1 to resume from #VMEXIT and activate it.

                             [GRP2:0]
                              active
                           /          \
                          /            \
                   [GRP1:0]           [GRP1:1]
                   idle                 idle
                  /    |    \                \
                 /     |     \                \
         [GRP0:0]    [...]    [GRP0:7]      [GRP0:8]
         active      idle       idle          idle
         /   |   \                |            |
     CPU 0  CPU 1  ...         CPU 63        CPU 64
    active  active              idle         offline

6) CPU 63 resumes after #VMEXIT and sees the new GRP1:0 parent.
   Therefore it propagates the stale inactive state of GRP1:0 up to
   GRP2:0.

                             [GRP2:0]
                              idle
                           /          \
                          /            \
                   [GRP1:0]           [GRP1:1]
                   idle                 idle
                  /    |    \                \
                 /     |     \                \
         [GRP0:0]    [...]    [GRP0:7]      [GRP0:8]
         active      idle       idle          idle
         /   |   \                |            |
     CPU 0  CPU 1  ...         CPU 63        CPU 64
    active  active              idle         offline

7) CPU 1 resumes after #VMEXIT and finally activates GRP1:0. But it
   doesn't observe its parent link because no ordering enforced that.
   Therefore GRP2:0 is spuriously left idle.

                             [GRP2:0]
                              idle
                           /          \
                          /            \
                   [GRP1:0]           [GRP1:1]
                   active                 idle
                  /    |    \                \
                 /     |     \                \
         [GRP0:0]    [...]    [GRP0:7]      [GRP0:8]
         active      idle       idle          idle
         /   |   \                |            |
     CPU 0  CPU 1  ...         CPU 63        CPU 64
    active  active              idle         offline

Such races are highly theoretical and the problem would solve itself
once the old root ever becomes idle again. But it still leaves a taste
of discomfort.

Fix it with enforcing a fully ordered atomic read of the old root state
before propagating the activate state up to the new root. It has a two
directions ordering effect:

* Acquire + release of the latest old root state: If the hotplug control
  CPU is not the one that woke up the old root, make sure to acquire its
  active state and propagate it upwards through the ordered chain of
  activation (the acquire pairs with the cmpxchg() in tmigr_active_up()
  and subsequent releases will pair with atomic_read_acquire() and
  smp_mb__after_atomic() in tmigr_inactive_up()).

* Release: If the hotplug control CPU is not the one that must wake up
  the old root, but the CPU covering that is lagging behind its duty,
  publish the links from the old root to the new parents. This way the
  lagging CPU will propagate the active state itself.

Fixes: 7ee9887703 ("timers: Implement the hierarchical pull model")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260423165354.95152-2-frederic@kernel.org
2026-05-06 08:21:12 +02:00
..
.kunitconfig time/kunit: Add .kunitconfig 2026-02-24 08:41:51 +01:00
alarmtimer.c Merge branch 'timers/urgent' into timers/core 2026-04-11 07:58:33 +02:00
clockevents.c clockevents: Add missing resets of the next_event_forced flag 2026-04-16 21:22:04 +02:00
clocksource-wdtest.c clocksource: Rewrite watchdog code completely 2026-03-20 13:36:32 +01:00
clocksource.c clocksource: Rewrite watchdog code completely 2026-03-20 13:36:32 +01:00
hrtimer.c Merge branch 'timers/urgent' into timers/core 2026-04-11 07:58:33 +02:00
itimer.c timers/itimer: Avoid direct access to hrtimer clockbase 2025-09-09 12:27:17 +02:00
jiffies.c Linux 7.0-rc4 2026-03-21 08:02:36 +01:00
Kconfig Update to the VDSO subsystem: 2026-04-14 10:53:44 -07:00
Makefile timens: Remove dependency on the vDSO 2026-03-26 15:44:23 +01:00
namespace_internal.h timens: Remove dependency on the vDSO 2026-03-26 15:44:23 +01:00
namespace_vdso.c mm.git review status for linus..mm-stable 2026-04-15 12:59:16 -07:00
namespace.c timens: Use task_lock guard in timens_get*() 2026-04-01 17:13:36 +02:00
ntp_internal.h ntp: Rename __do_adjtimex() to ntp_adjtimex() 2025-06-19 14:28:23 +02:00
ntp.c ntp: Use ktime_get_ntp_seconds() 2025-06-19 14:28:24 +02:00
posix-clock.c Convert 'alloc_obj' family to use the new default GFP_KERNEL argument 2026-02-21 17:09:51 -08:00
posix-cpu-timers.c hrtimer: Store time as ktime_t in restart block 2025-11-14 16:31:19 +01:00
posix-stubs.c
posix-timers.c posix-timers: Fix stale function name in comment 2026-03-28 14:18:13 +01:00
posix-timers.h timekeeping: Add minimal posix-timers support for auxiliary clocks 2025-06-27 20:13:12 +02:00
sched_clock.c time/sched_clock: Use ACCESS_PRIVATE() to evaluate hrtimer::function 2026-01-13 18:08:57 +01:00
sleep_timeout.c treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
test_udelay.c time: Add MODULE_DESCRIPTION() to time test modules 2024-06-03 11:18:50 +02:00
tick-broadcast-hrtimer.c clockevents: Remove redundant CLOCK_EVT_FEAT_KTIME 2026-02-27 16:40:06 +01:00
tick-broadcast.c clockevents: Add missing resets of the next_event_forced flag 2026-04-16 21:22:04 +02:00
tick-common.c clockevents: Prevent timer interrupt starvation 2026-04-10 22:45:38 +02:00
tick-internal.h cpufreq: ondemand: Simplify idle cputime granularity test 2026-01-28 22:24:58 +01:00
tick-legacy.c
tick-oneshot.c treewide: Update email address 2026-01-11 06:09:11 -10:00
tick-sched.c Merge branch 'timers/urgent' into timers/core 2026-04-11 07:58:33 +02:00
tick-sched.h tick/sched: Fix struct tick_sched doc warnings 2024-04-01 10:36:35 +02:00
time_test.c time/kunit: Document handling of negative years of is_leap() 2026-02-02 12:37:54 +01:00
time.c time/jiffies: Mark jiffies_64_to_clock_t() notrace 2026-03-11 10:33:12 +01:00
timeconst.bc
timeconv.c
timecounter.c time/timecounter: Inline timecounter_cyc2time() 2025-12-15 20:16:49 +01:00
timekeeping_debug.c timekeeping: Add percpu counter for tracking floor swap events 2024-10-10 10:20:46 +02:00
timekeeping_internal.h timekeeping: Provide ktime_get_ntp_seconds() 2025-06-19 14:28:23 +02:00
timekeeping.c Linux 7.0-rc4 2026-03-21 08:02:36 +01:00
timekeeping.h timekeeping: Provide infrastructure for coupled clockevents 2026-02-27 16:40:08 +01:00
timer_list.c timer_list: Print offset as signed integer 2026-03-12 12:15:54 +01:00
timer_migration.c timers/migration: Fix another hotplug activation race 2026-05-06 08:21:12 +02:00
timer_migration.h timers/migration: Rename 'online' bit to 'available' 2025-11-20 20:17:31 +01:00
timer.c timers: Get this_cpu once while clearing the idle state 2026-03-24 23:21:30 +01:00
vsyscall.c vdso/vsyscall: Avoid slow division loop in auxiliary clock update 2025-09-03 11:55:11 +02:00