arm64: Use load LSE atomics for the non-return per-CPU atomic operations

The non-return per-CPU this_cpu_*() atomic operations are implemented as
STADD/STCLR/STSET when FEAT_LSE is available. On many microarchitecture
implementations, these instructions tend to be executed "far" in the
interconnect or memory subsystem (unless the data is already in the L1
cache). This is in general more efficient when there is contention as it
avoids bouncing cache lines between CPUs. The load atomics (e.g. LDADD
without XZR as destination), OTOH, tend to be executed "near" with the
data loaded into the L1 cache.

STADD executed back to back as in srcu_read_{lock,unlock}*() incur an
additional overhead due to the default posting behaviour on several CPU
implementations. Since the per-CPU atomics are unlikely to be used
concurrently on the same memory location, encourage the hardware to to
execute them "near" by issuing load atomics - LDADD/LDCLR/LDSET - with
the destination register unused (but not XZR).

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/e7d539ed-ced0-4b96-8ecd-048a5b803b85@paulmck-laptop
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Palmer Dabbelt <palmer@dabbelt.com>
[will: Add comment and link to the discussion thread]
Signed-off-by: Will Deacon <will@kernel.org>
This commit is contained in:
Catalin Marinas 2025-11-06 15:52:13 +00:00 committed by Will Deacon
parent b98c94eed4
commit 535fdfc5a2

View File

@ -77,7 +77,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val) \
" stxr" #sfx "\t%w[loop], %" #w "[tmp], %[ptr]\n" \ " stxr" #sfx "\t%w[loop], %" #w "[tmp], %[ptr]\n" \
" cbnz %w[loop], 1b", \ " cbnz %w[loop], 1b", \
/* LSE atomics */ \ /* LSE atomics */ \
#op_lse "\t%" #w "[val], %[ptr]\n" \ #op_lse "\t%" #w "[val], %" #w "[tmp], %[ptr]\n" \
__nops(3)) \ __nops(3)) \
: [loop] "=&r" (loop), [tmp] "=&r" (tmp), \ : [loop] "=&r" (loop), [tmp] "=&r" (tmp), \
[ptr] "+Q"(*(u##sz *)ptr) \ [ptr] "+Q"(*(u##sz *)ptr) \
@ -124,9 +124,16 @@ PERCPU_RW_OPS(8)
PERCPU_RW_OPS(16) PERCPU_RW_OPS(16)
PERCPU_RW_OPS(32) PERCPU_RW_OPS(32)
PERCPU_RW_OPS(64) PERCPU_RW_OPS(64)
PERCPU_OP(add, add, stadd)
PERCPU_OP(andnot, bic, stclr) /*
PERCPU_OP(or, orr, stset) * Use value-returning atomics for CPU-local ops as they are more likely
* to execute "near" to the CPU (e.g. in L1$).
*
* https://lore.kernel.org/r/e7d539ed-ced0-4b96-8ecd-048a5b803b85@paulmck-laptop
*/
PERCPU_OP(add, add, ldadd)
PERCPU_OP(andnot, bic, ldclr)
PERCPU_OP(or, orr, ldset)
PERCPU_RET_OP(add, add, ldadd) PERCPU_RET_OP(add, add, ldadd)
#undef PERCPU_RW_OPS #undef PERCPU_RW_OPS