arm64: Use load LSE atomics for the non-return per-CPU atomic operations

The non-return per-CPU this_cpu_*() atomic operations are implemented as STADD/STCLR/STSET when FEAT_LSE is available. On many microarchitecture implementations, these instructions tend to be executed "far" in the interconnect or memory subsystem (unless the data is already in the L1 cache). This is in general more efficient when there is contention as it avoids bouncing cache lines between CPUs. The load atomics (e.g. LDADD without XZR as destination), OTOH, tend to be executed "near" with the data loaded into the L1 cache. STADD executed back to back as in srcu_read_{lock,unlock}*() incur an additional overhead due to the default posting behaviour on several CPU implementations. Since the per-CPU atomics are unlikely to be used concurrently on the same memory location, encourage the hardware to to execute them "near" by issuing load atomics - LDADD/LDCLR/LDSET - with the destination register unused (but not XZR). Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Link: https://lore.kernel.org/r/e7d539ed-ced0-4b96-8ecd-048a5b803b85@paulmck-laptop Reported-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Paul E. McKenney <paulmck@kernel.org> Cc: Will Deacon <will@kernel.org> Reviewed-by: Palmer Dabbelt <palmer@dabbelt.com> [will: Add comment and link to the discussion thread] Signed-off-by: Will Deacon <will@kernel.org>
2026-05-30 18:13:41 +02:00 · 2025-11-06 15:52:13 +00:00 · 2025-11-06 15:52:13 +00:00 · 535fdfc5a2
commit 535fdfc5a2
parent b98c94eed4
1 changed files with 11 additions and 4 deletions
--- a/arch/arm64/include/asm/percpu.h
+++ b/arch/arm64/include/asm/percpu.h
@ -77,7 +77,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val)		\
 	"	stxr" #sfx "\t%w[loop], %" #w "[tmp], %[ptr]\n"		\
 	"	cbnz	%w[loop], 1b",					\
 	/* LSE atomics */						\
-		#op_lse "\t%" #w "[val], %[ptr]\n"			\
+		#op_lse "\t%" #w "[val], %" #w "[tmp], %[ptr]\n"	\
 		__nops(3))						\
 	: [loop] "=&r" (loop), [tmp] "=&r" (tmp),			\
 	  [ptr] "+Q"(*(u##sz *)ptr)					\
@ -124,9 +124,16 @@ PERCPU_RW_OPS(8)
 PERCPU_RW_OPS(16)
 PERCPU_RW_OPS(32)
 PERCPU_RW_OPS(64)
-PERCPU_OP(add, add, stadd)
+
-PERCPU_OP(andnot, bic, stclr)
+/*
-PERCPU_OP(or, orr, stset)
+ * Use value-returning atomics for CPU-local ops as they are more likely
 * to execute "near" to the CPU (e.g. in L1$).
 *
 * https://lore.kernel.org/r/e7d539ed-ced0-4b96-8ecd-048a5b803b85@paulmck-laptop
 */
 PERCPU_OP(add, add, ldadd)
 PERCPU_OP(andnot, bic, ldclr)
 PERCPU_OP(or, orr, ldset)
 PERCPU_RET_OP(add, add, ldadd)
 #undef PERCPU_RW_OPS