linux/kernel/rcu
Frederic Weisbecker 55d4669ef1 rcu: Fix rcu_barrier() VS post CPUHP_TEARDOWN_CPU invocation
When rcu_barrier() calls rcu_rdp_cpu_online() and observes a CPU off
rnp->qsmaskinitnext, it means that all accesses from the offline CPU
preceding the CPUHP_TEARDOWN_CPU are visible to RCU barrier, including
callbacks expiration and counter updates.

However interrupts can still fire after stop_machine() re-enables
interrupts and before rcutree_report_cpu_dead(). The related accesses
happening between CPUHP_TEARDOWN_CPU and rnp->qsmaskinitnext clearing
are _NOT_ guaranteed to be seen by rcu_barrier() without proper
ordering, especially when callbacks are invoked there to the end, making
rcutree_migrate_callback() bypass barrier_lock.

The following theoretical race example can make rcu_barrier() hang:

CPU 0                                               CPU 1
-----                                               -----
//cpu_down()
smpboot_park_threads()
//ksoftirqd is parked now
<IRQ>
rcu_sched_clock_irq()
   invoke_rcu_core()
do_softirq()
   rcu_core()
      rcu_do_batch()
         // callback storm
         // rcu_do_batch() returns
         // before completing all
         // of them
   // do_softirq also returns early because of
   // timeout. It defers to ksoftirqd but
   // it's parked
</IRQ>
stop_machine()
   take_cpu_down()
                                                    rcu_barrier()
                                                        spin_lock(barrier_lock)
                                                        // observes rcu_segcblist_n_cbs(&rdp->cblist) != 0
<IRQ>
do_softirq()
   rcu_core()
      rcu_do_batch()
         //completes all pending callbacks
         //smp_mb() implied _after_ callback number dec
</IRQ>

rcutree_report_cpu_dead()
   rnp->qsmaskinitnext &= ~rdp->grpmask;

rcutree_migrate_callback()
   // no callback, early return without locking
   // barrier_lock
                                                        //observes !rcu_rdp_cpu_online(rdp)
                                                        rcu_barrier_entrain()
                                                           rcu_segcblist_entrain()
                                                              // Observe rcu_segcblist_n_cbs(rsclp) == 0
                                                              // because no barrier between reading
                                                              // rnp->qsmaskinitnext and rsclp->len
                                                              rcu_segcblist_add_len()
                                                                 smp_mb__before_atomic()
                                                                 // will now observe the 0 count and empty
                                                                 // list, but too late, we enqueue regardless
                                                                 WRITE_ONCE(rsclp->len, rsclp->len + v);
                                                        // ignored barrier callback
                                                        // rcu barrier stall...

This could be solved with a read memory barrier, enforcing the message
passing between rnp->qsmaskinitnext and rsclp->len, matching the full
memory barrier after rsclp->len addition in rcu_segcblist_add_len()
performed at the end of rcu_do_batch().

However the rcu_barrier() is complicated enough and probably doesn't
need too many more subtleties. CPU down is a slowpath and the
barrier_lock seldom contended. Solve the issue with unconditionally
locking the barrier_lock on rcutree_migrate_callbacks(). This makes sure
that either rcu_barrier() sees the empty queue or its entrained
callback will be migrated.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2024-07-04 13:48:57 -07:00
..
Kconfig rcu: Create NEED_TASKS_RCU to factor out enablement logic 2024-04-15 11:29:48 +02:00
Kconfig.debug rcu: Restrict access to RCU CPU stall notifiers 2023-12-12 02:31:22 +05:30
Makefile rcuperf: Change rcuperf to rcuscale 2020-08-24 18:39:24 -07:00
rcu_segcblist.c rcu: Use rcu_segcblist_segempty() instead of open coding it 2023-10-04 17:33:18 +02:00
rcu_segcblist.h rcu: Throttle callback invocation based on number of ready callbacks 2023-01-03 17:28:34 -08:00
rcu.h rcutorture: Make rcutorture support print rcu-tasks gp state 2024-04-16 11:16:35 +02:00
rcuscale.c rcu: Rename jiffies_till_flush to jiffies_lazy_flush 2024-02-14 08:00:57 -08:00
rcutorture.c rcutorture: Use rcu_gp_slow_register/unregister() only for rcutype test 2024-04-16 11:16:36 +02:00
refscale.c refscale: Print out additional module parameters 2023-09-11 23:02:18 +02:00
srcutiny.c srcu: Make Tiny SRCU explicitly disable preemption 2024-04-15 11:29:48 +02:00
srcutree.c srcu: Disable interrupts directly in srcu_gp_end() 2024-06-18 10:00:48 -07:00
sync.c rcu: Eliminate lockless accesses to rcu_sync->gp_count 2024-07-04 13:48:57 -07:00
tasks.h Merge branches 'fixes.2024.04.15a', 'misc.2024.04.12a', 'rcu-sync-normal-improve.2024.04.15a', 'rcu-tasks.2024.04.15a' and 'rcutorture.2024.04.15a' into rcu-merge.2024.04.15a 2024-05-01 13:04:02 +02:00
tiny.c rcu: Make Tiny RCU explicitly disable preemption 2024-04-15 11:29:48 +02:00
tree_exp.h rcu: Reduce synchronize_rcu() latency 2024-04-15 19:47:49 +02:00
tree_nocb.h Merge branches 'rcu-doc.2024.02.14a', 'rcu-nocb.2024.02.14a', 'rcu-exp.2024.02.14a', 'rcu-tasks.2024.02.26a' and 'rcu-misc.2024.02.14a' into rcu.2024.02.26a 2024-02-26 17:37:25 -08:00
tree_plugin.h rcu: Add rcutree.nohz_full_patience_delay to reduce nohz_full OS jitter 2024-07-04 13:47:39 -07:00
tree_stall.h rcu: Fix buffer overflow in print_cpu_stall_info() 2024-04-15 19:43:50 +02:00
tree.c rcu: Fix rcu_barrier() VS post CPUHP_TEARDOWN_CPU invocation 2024-07-04 13:48:57 -07:00
tree.h rcu/tree: Reduce wake up for synchronize_rcu() common case 2024-06-18 09:59:40 -07:00
update.c rcu-tasks: Make Tasks RCU wait idly for grace-period delays 2024-04-09 15:11:49 +02:00