mirror of
https://github.com/torvalds/linux.git
synced 2026-05-13 00:28:54 +02:00
On systems where many CPUs share one LLC, unbound workqueues using WQ_AFFN_CACHE collapse to a single worker pool, causing heavy spinlock contention on pool->lock. For example, Chuck Lever measured 39% of cycles lost to native_queued_spin_lock_slowpath on a 12-core shared-L3 NFS-over-RDMA system. The existing affinity hierarchy (cpu, smt, cache, numa, system) offers no intermediate option between per-LLC and per-SMT-core granularity. Add WQ_AFFN_CACHE_SHARD, which subdivides each LLC into groups of at most wq_cache_shard_size cores (default 8, tunable via boot parameter). Shards are always split on core (SMT group) boundaries so that Hyper-Threading siblings are never placed in different pods. Cores are distributed across shards as evenly as possible -- for example, 36 cores in a single LLC with max shard size 8 produces 5 shards of 8+7+7+7+7 cores. The implementation follows the same comparator pattern as other affinity scopes: precompute_cache_shard_ids() pre-fills the cpu_shard_id[] array from the already-initialized WQ_AFFN_CACHE and WQ_AFFN_SMT topology, and cpus_share_cache_shard() is passed to init_pod_type(). Benchmark on NVIDIA Grace (72 CPUs, single LLC, 50k items/thread), show cache_shard delivers ~5x the throughput and ~6.5x lower p50 latency compared to cache scope on this 72-core single-LLC system. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
|---|---|---|
| .. | ||
| acpi | ||
| asm-generic | ||
| clocksource | ||
| crypto | ||
| cxl | ||
| drm | ||
| dt-bindings | ||
| hyperv | ||
| keys | ||
| kunit | ||
| kvm | ||
| linux | ||
| math-emu | ||
| media | ||
| memory | ||
| misc | ||
| net | ||
| pcmcia | ||
| ras | ||
| rdma | ||
| rv | ||
| scsi | ||
| soc | ||
| sound | ||
| target | ||
| trace | ||
| uapi | ||
| ufs | ||
| vdso | ||
| video | ||
| xen | ||
| Kbuild | ||