linux

mirror of https://github.com/torvalds/linux.git synced 2026-06-08 14:42:37 +02:00

History

Marco Elver 0912037fec perf/hw_breakpoint: Reduce contention with large number of tasks While optimizing task_bp_pinned()'s runtime complexity to O(1) on average helps reduce time spent in the critical section, we still suffer due to serializing everything via 'nr_bp_mutex'. Indeed, a profile shows that now contention is the biggest issue: 95.93% [kernel] [k] osq_lock 0.70% [kernel] [k] mutex_spin_on_owner 0.22% [kernel] [k] smp_cfm_core_cond 0.18% [kernel] [k] task_bp_pinned 0.18% [kernel] [k] rhashtable_jhash2 0.15% [kernel] [k] queued_spin_lock_slowpath when running the breakpoint benchmark with (system with 256 CPUs): \| $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 \| # Running 'breakpoint/thread' benchmark: \| # Created/joined 30 threads with 4 breakpoints and 64 parallelism \| Total time: 0.207 [sec] \| \| 108.267188 usecs/op \| 6929.100000 usecs/op/cpu The main concern for synchronizing the breakpoint constraints data is that a consistent snapshot of the per-CPU and per-task data is observed. The access pattern is as follows: 1. If the target is a task: the task's pinned breakpoints are counted, checked for space, and then appended to; only bp_cpuinfo::cpu_pinned is used to check for conflicts with CPU-only breakpoints; bp_cpuinfo::tsk_pinned are incremented/decremented, but otherwise unused. 2. If the target is a CPU: bp_cpuinfo::cpu_pinned are counted, along with bp_cpuinfo::tsk_pinned; after a successful check, cpu_pinned is incremented. No per-task breakpoints are checked. Since rhltable safely synchronizes insertions/deletions, we can allow concurrency as follows: 1. If the target is a task: independent tasks may update and check the constraints concurrently, but same-task target calls need to be serialized; since bp_cpuinfo::tsk_pinned is only updated, but not checked, these modifications can happen concurrently by switching tsk_pinned to atomic_t. 2. If the target is a CPU: access to the per-CPU constraints needs to be serialized with other CPU-target and task-target callers (to stabilize the bp_cpuinfo::tsk_pinned snapshot). We can allow the above concurrency by introducing a per-CPU constraints data reader-writer lock (bp_cpuinfo_sem), and per-task mutexes (reuses task_struct::perf_event_mutex): 1. If the target is a task: acquires perf_event_mutex, and acquires bp_cpuinfo_sem as a reader. The choice of percpu-rwsem minimizes contention in the presence of many read-lock but few write-lock acquisitions: we assume many orders of magnitude more task target breakpoints creations/destructions than CPU target breakpoints. 2. If the target is a CPU: acquires bp_cpuinfo_sem as a writer. With these changes, contention with thousands of tasks is reduced to the point where waiting on locking no longer dominates the profile: \| $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 \| # Running 'breakpoint/thread' benchmark: \| # Created/joined 30 threads with 4 breakpoints and 64 parallelism \| Total time: 0.077 [sec] \| \| 40.201563 usecs/op \| 2572.900000 usecs/op/cpu 21.54% [kernel] [k] task_bp_pinned 20.18% [kernel] [k] rhashtable_jhash2 6.81% [kernel] [k] toggle_bp_slot 5.47% [kernel] [k] queued_spin_lock_slowpath 3.75% [kernel] [k] smp_cfm_core_cond 3.48% [kernel] [k] bcmp On this particular setup that's a speedup of 2.7x. We're also getting closer to the theoretical ideal performance through optimizations in hw_breakpoint.c -- constraints accounting disabled: \| perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 \| # Running 'breakpoint/thread' benchmark: \| # Created/joined 30 threads with 4 breakpoints and 64 parallelism \| Total time: 0.067 [sec] \| \| 35.286458 usecs/op \| 2258.333333 usecs/op/cpu Which means the current implementation is ~12% slower than the theoretical ideal. For reference, performance without any breakpoints: \| $> bench -r 30 breakpoint thread -b 0 -p 64 -t 64 \| # Running 'breakpoint/thread' benchmark: \| # Created/joined 30 threads with 0 breakpoints and 64 parallelism \| Total time: 0.060 [sec] \| \| 31.365625 usecs/op \| 2007.400000 usecs/op/cpu On a system with 256 CPUs, the theoretical ideal is only ~12% slower than no breakpoints at all; the current implementation is ~28% slower. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20220829124719.675715-12-elver@google.com		2022-08-30 10:56:24 +02:00
..
bpf	net: Fix suspicious RCU usage in bpf_sk_reuseport_detach()	2022-08-17 16:42:59 -07:00
cgroup	Various fixes: a deadline scheduler fix, a migration fix, a Sparse fix and a comment fix.	2022-08-06 17:34:06 -07:00
configs	xen: branch for v6.0-rc1b	2022-08-14 09:28:54 -07:00
debug	Modules updates for v5.19-rc1	2022-05-26 17:13:43 -07:00
dma	remoteproc updates for v5.20	2022-08-08 15:16:29 -07:00
entry	context_tracking: Take NMI eqs entrypoints over RCU	2022-07-05 13:32:59 -07:00
events	perf/hw_breakpoint: Reduce contention with large number of tasks	2022-08-30 10:56:24 +02:00
futex	drm for 5.19-rc1	2022-05-25 16:18:27 -07:00
gcov	gcov: Remove compiler version check	2021-12-02 17:25:21 +09:00
irq	irqchip/genirq updates for 5.20:	2022-07-28 12:36:35 +02:00
kcsan	kcsan: test: Add a .kunitconfig to run KCSAN tests	2022-07-22 09:22:59 -06:00
livepatch	Livepatching changes for 5.19	2022-06-02 08:55:01 -07:00
locking	locking/percpu-rwsem: Add percpu_is_write_locked() and percpu_is_read_locked()	2022-08-30 10:56:23 +02:00
module	Modules updates for 6.0	2022-08-08 14:12:19 -07:00
power	Char / Misc driver changes for 6.0-rc1	2022-08-04 11:05:48 -07:00
printk	printk: do not wait for consoles when suspended	2022-07-15 10:52:11 +02:00
rcu	- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe	2022-08-05 16:32:45 -07:00
sched	Various fixes: a deadline scheduler fix, a migration fix, a Sparse fix and a comment fix.	2022-08-06 17:34:06 -07:00
time	time: Correct the prototype of ns_to_kernel_old_timeval and ns_to_timespec64	2022-08-09 20:02:13 +02:00
trace	Various fixes for tracing:	2022-08-21 14:49:42 -07:00
.gitignore
acct.c	kernel/acct: move acct sysctls to its own file	2022-04-06 13:43:44 -07:00
async.c	Revert "module, async: async_synchronize_full() on module init iff async is used"	2022-02-03 11:20:34 -08:00
audit_fsnotify.c	fsnotify: make allow_dups a property of the group	2022-04-25 14:37:18 +02:00
audit_tree.c	audit: use fsnotify group lock helpers	2022-04-25 14:37:28 +02:00
audit_watch.c	fsnotify: pass flags argument to fsnotify_alloc_group()	2022-04-25 14:37:12 +02:00
audit.c	audit: make is_audit_feature_set() static	2022-06-13 14:08:57 -04:00
audit.h	audit: log AUDIT_TIME_* records only from rules	2022-02-22 13:51:40 -05:00
auditfilter.c	audit/stable-5.17 PR 20220110	2022-01-11 13:08:21 -08:00
auditsc.c	audit, io_uring, io-wq: Fix memory leak in io_sq_thread() and io_wqe_worker()	2022-08-04 08:33:54 -06:00
backtracetest.c
bounds.c
capability.c	xfs: don't generate selinux audit messages for capability testing	2022-03-09 10:32:06 -08:00
cfi.c	context_tracking: Take IRQ eqs entrypoints over RCU	2022-07-05 13:32:59 -07:00
compat.c	arch: remove compat_alloc_user_space	2021-09-08 15:32:35 -07:00
configs.c
context_tracking.c	MAINTAINERS: Add Paul as context tracking maintainer	2022-07-05 13:33:00 -07:00
cpu_pm.c	context_tracking: Take IRQ eqs entrypoints over RCU	2022-07-05 13:32:59 -07:00
cpu.c	Intel Trust Domain Extensions	2022-05-23 17:51:12 -07:00
crash_core.c	kdump: round up the total memory size to 128M for crashkernel reservation	2022-07-17 17:31:40 -07:00
crash_dump.c
cred.c	x86: Mark __invalid_creds() __noreturn	2022-03-15 10:32:44 +01:00
delayacct.c	delayacct: track delays from write-protect copy	2022-06-01 15:55:25 -07:00
dma.c
exec_domain.c
exit.c	exit: Fix typo in comment: s/sub-theads/sub-threads	2022-08-03 10:44:54 +02:00
extable.c	context_tracking: Take NMI eqs entrypoints over RCU	2022-07-05 13:32:59 -07:00
fail_function.c
fork.c	Tracing updates for 5.20 / 6.0	2022-08-05 09:41:12 -07:00
freezer.c
gen_kheaders.sh	kheaders: Have cpio unconditionally replace files	2022-05-08 03:16:59 +09:00
groups.c	security: Add LSM hook to setgroups() syscall	2022-07-15 18:21:49 +00:00
hung_task.c	kernel/hung_task: fix address space of proc_dohung_task_timeout_secs	2022-07-29 18:12:35 -07:00
iomem.c
irq_work.c	irq_work: use kasan_record_aux_stack_noalloc() record callstack	2022-04-15 14:49:55 -07:00
jump_label.c	jump_label: make initial NOP patching the special case	2022-06-24 09:48:55 +02:00
kallsyms_internal.h	kallsyms: move declarations to internal header	2022-07-17 17:31:39 -07:00
kallsyms.c	Updates to various subsystems which I help look after. lib, ocfs2,	2022-08-07 10:03:24 -07:00
kcmp.c
Kconfig.freezer
Kconfig.hz
Kconfig.locks
Kconfig.preempt	Revert "signal, x86: Delay calling signals in atomic on RT enabled kernels"	2022-03-31 10:36:55 +02:00
kcov.c	kcov: update pos before writing pc in trace function	2022-05-25 13:05:42 -07:00
kexec_core.c	kexec: drop weak attribute from functions	2022-07-15 12:21:16 -04:00
kexec_elf.c
kexec_file.c	Updates to various subsystems which I help look after. lib, ocfs2,	2022-08-07 10:03:24 -07:00
kexec_internal.h
kexec.c	kexec: avoid compat_alloc_user_space	2021-09-08 15:32:34 -07:00
kheaders.c
kmod.c
kprobes.c	kprobes: Forbid probing on trampoline and BPF code areas	2022-08-02 11:47:29 +02:00
ksysfs.c	kernel/ksysfs.c: use helper macro __ATTR_RW	2022-03-23 19:00:33 -07:00
kthread.c	kthread: make it clear that kthread_create_on_node() might be terminated by any fatal signal	2022-06-16 19:11:30 -07:00
latencytop.c	latencytop: move sysctl to its own file	2022-04-21 11:40:59 -07:00
Makefile	kernel: remove platform_has() infrastructure	2022-08-01 07:42:56 +02:00
module_signature.c
notifier.c	notifier: Add blocking/atomic_notifier_chain_register_unique_prio()	2022-05-19 19:30:30 +02:00
nsproxy.c	fs/exec: allow to unshare a time namespace on vfork+exec	2022-06-15 07:58:04 -07:00
padata.c	padata: replace cpumask_weight with cpumask_empty in padata.c	2022-01-31 11:21:46 +11:00
panic.c	linux-kselftest-kunit-5.20-rc1	2022-08-02 19:34:45 -07:00
params.c	kobject: remove kset from struct kset_uevent_ops callbacks	2021-12-28 11:26:18 +01:00
pid_namespace.c	kernel: pid_namespace: use NULL instead of using plain integer as pointer	2022-04-29 14:38:00 -07:00
pid.c	pid: add pidfd_get_task() helper	2021-10-14 13:29:18 +02:00
profile.c	profile: setup_profiling_timer() is moslty not implemented	2022-07-29 18:12:36 -07:00
ptrace.c	ptrace: fix clearing of JOBCTL_TRACED in ptrace_unfreeze_traced()	2022-07-09 11:06:19 -07:00
range.c
reboot.c	Merge branch 'rework/kthreads' into for-linus	2022-06-23 19:11:28 +02:00
regset.c
relay.c	relay: remove redundant assignment to pointer buf	2022-05-12 20:38:37 -07:00
resource_kunit.c
resource.c	resource: Introduce alloc_free_mem_region()	2022-07-21 17:19:25 -07:00
rseq.c	rseq: Kill process when unknown flags are encountered in ABI structures	2022-08-01 15:21:42 +02:00
scftorture.c	scftorture: Fix distribution of short handler delays	2022-04-11 17:07:29 -07:00
scs.c	kasan, vmalloc: only tag normal vmalloc allocations	2022-03-24 19:06:48 -07:00
seccomp.c	seccomp: Add wait_killable semantic to seccomp user notifier	2022-05-03 14:11:58 -07:00
signal.c	signal handling: don't use BUG_ON() for debugging	2022-07-07 09:53:43 -07:00
smp.c	locking/csd_lock: Change csdlock_debug from early_param to __setup	2022-07-19 11:40:00 -07:00
smpboot.c	cpu/hotplug: Allow the CPU in CPU_UP_PREPARE state to be brought up again.	2022-04-12 14:13:01 +02:00
smpboot.h
softirq.c	context_tracking: Take IRQ eqs entrypoints over RCU	2022-07-05 13:32:59 -07:00
stackleak.c	stackleak: add on/off stack variants	2022-05-08 01:33:09 -07:00
stacktrace.c	uaccess: remove CONFIG_SET_FS	2022-02-25 09:36:06 +01:00
static_call_inline.c	static_call: Don't make __static_call_return0 static	2022-04-05 09:59:38 +02:00
static_call.c	static_call: Don't make __static_call_return0 static	2022-04-05 09:59:38 +02:00
stop_machine.c	Scheduler changes in this cycle were:	2022-05-24 11:11:13 -07:00
sys_ni.c	mm/mempolicy: wire up syscall set_mempolicy_home_node	2022-01-15 16:30:30 +02:00
sys.c	arm64/sme: Implement vector length configuration prctl()s	2022-04-22 18:50:54 +01:00
sysctl-test.c
sysctl.c	kernel/sysctl.c: Remove trailing white space	2022-08-08 09:01:36 -07:00
task_work.c	task_work: allow TWA_SIGNAL without a rescheduling IPI	2022-04-30 08:39:32 -06:00
taskstats.c	kernel: make taskstats available from all net namespaces	2022-04-29 14:38:03 -07:00
torture.c	torture: Wake up kthreads after storing task_struct pointer	2022-02-01 17:24:39 -08:00
tracepoint.c
tsacct.c	taskstats: version 12 with thread group and exe info	2022-04-29 14:38:03 -07:00
ucount.c	ucounts: Handle wrapping in is_ucounts_overlimit	2022-02-17 09:11:57 -06:00
uid16.c
uid16.h
umh.c	kthread: Don't allocate kthread_struct for init and umh	2022-05-06 14:49:44 -05:00
up.c
user_namespace.c	ucounts: Fix systemd LimitNPROC with private users regression	2022-02-25 10:40:14 -06:00
user-return-notifier.c
user.c	fs/epoll: use a per-cpu counter for user's watches count	2021-09-08 11:50:27 -07:00
usermode_driver.c	blob_to_mnt(): kern_unmount() is needed to undo kern_mount()	2022-05-19 23:25:47 -04:00
utsname_sysctl.c
utsname.c
watch_queue.c	This was a moderately busy cycle for documentation, but nothing all that	2022-08-02 19:24:24 -07:00
watchdog_hld.c	Revert "printk: add functions to prefer direct printing"	2022-06-23 18:41:40 +02:00
watchdog.c	powerpc updates for 6.0	2022-08-06 16:38:17 -07:00
workqueue_internal.h	workqueue: Assign a color to barrier work items	2021-08-17 07:49:10 -10:00
workqueue.c	drm for 5.20/6.0	2022-08-03 19:52:08 -07:00