mirror of
https://github.com/torvalds/linux.git
synced 2026-06-08 14:42:37 +02:00
While optimizing task_bp_pinned()'s runtime complexity to O(1) on
average helps reduce time spent in the critical section, we still suffer
due to serializing everything via 'nr_bp_mutex'. Indeed, a profile shows
that now contention is the biggest issue:
95.93% [kernel] [k] osq_lock
0.70% [kernel] [k] mutex_spin_on_owner
0.22% [kernel] [k] smp_cfm_core_cond
0.18% [kernel] [k] task_bp_pinned
0.18% [kernel] [k] rhashtable_jhash2
0.15% [kernel] [k] queued_spin_lock_slowpath
when running the breakpoint benchmark with (system with 256 CPUs):
| $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
| # Running 'breakpoint/thread' benchmark:
| # Created/joined 30 threads with 4 breakpoints and 64 parallelism
| Total time: 0.207 [sec]
|
| 108.267188 usecs/op
| 6929.100000 usecs/op/cpu
The main concern for synchronizing the breakpoint constraints data is
that a consistent snapshot of the per-CPU and per-task data is observed.
The access pattern is as follows:
1. If the target is a task: the task's pinned breakpoints are counted,
checked for space, and then appended to; only bp_cpuinfo::cpu_pinned
is used to check for conflicts with CPU-only breakpoints;
bp_cpuinfo::tsk_pinned are incremented/decremented, but otherwise
unused.
2. If the target is a CPU: bp_cpuinfo::cpu_pinned are counted, along
with bp_cpuinfo::tsk_pinned; after a successful check, cpu_pinned is
incremented. No per-task breakpoints are checked.
Since rhltable safely synchronizes insertions/deletions, we can allow
concurrency as follows:
1. If the target is a task: independent tasks may update and check the
constraints concurrently, but same-task target calls need to be
serialized; since bp_cpuinfo::tsk_pinned is only updated, but not
checked, these modifications can happen concurrently by switching
tsk_pinned to atomic_t.
2. If the target is a CPU: access to the per-CPU constraints needs to
be serialized with other CPU-target and task-target callers (to
stabilize the bp_cpuinfo::tsk_pinned snapshot).
We can allow the above concurrency by introducing a per-CPU constraints
data reader-writer lock (bp_cpuinfo_sem), and per-task mutexes (reuses
task_struct::perf_event_mutex):
1. If the target is a task: acquires perf_event_mutex, and acquires
bp_cpuinfo_sem as a reader. The choice of percpu-rwsem minimizes
contention in the presence of many read-lock but few write-lock
acquisitions: we assume many orders of magnitude more task target
breakpoints creations/destructions than CPU target breakpoints.
2. If the target is a CPU: acquires bp_cpuinfo_sem as a writer.
With these changes, contention with thousands of tasks is reduced to the
point where waiting on locking no longer dominates the profile:
| $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
| # Running 'breakpoint/thread' benchmark:
| # Created/joined 30 threads with 4 breakpoints and 64 parallelism
| Total time: 0.077 [sec]
|
| 40.201563 usecs/op
| 2572.900000 usecs/op/cpu
21.54% [kernel] [k] task_bp_pinned
20.18% [kernel] [k] rhashtable_jhash2
6.81% [kernel] [k] toggle_bp_slot
5.47% [kernel] [k] queued_spin_lock_slowpath
3.75% [kernel] [k] smp_cfm_core_cond
3.48% [kernel] [k] bcmp
On this particular setup that's a speedup of 2.7x.
We're also getting closer to the theoretical ideal performance through
optimizations in hw_breakpoint.c -- constraints accounting disabled:
| perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
| # Running 'breakpoint/thread' benchmark:
| # Created/joined 30 threads with 4 breakpoints and 64 parallelism
| Total time: 0.067 [sec]
|
| 35.286458 usecs/op
| 2258.333333 usecs/op/cpu
Which means the current implementation is ~12% slower than the
theoretical ideal.
For reference, performance without any breakpoints:
| $> bench -r 30 breakpoint thread -b 0 -p 64 -t 64
| # Running 'breakpoint/thread' benchmark:
| # Created/joined 30 threads with 0 breakpoints and 64 parallelism
| Total time: 0.060 [sec]
|
| 31.365625 usecs/op
| 2007.400000 usecs/op/cpu
On a system with 256 CPUs, the theoretical ideal is only ~12% slower
than no breakpoints at all; the current implementation is ~28% slower.
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20220829124719.675715-12-elver@google.com
|
||
|---|---|---|
| .. | ||
| bpf | ||
| cgroup | ||
| configs | ||
| debug | ||
| dma | ||
| entry | ||
| events | ||
| futex | ||
| gcov | ||
| irq | ||
| kcsan | ||
| livepatch | ||
| locking | ||
| module | ||
| power | ||
| printk | ||
| rcu | ||
| sched | ||
| time | ||
| trace | ||
| .gitignore | ||
| acct.c | ||
| async.c | ||
| audit_fsnotify.c | ||
| audit_tree.c | ||
| audit_watch.c | ||
| audit.c | ||
| audit.h | ||
| auditfilter.c | ||
| auditsc.c | ||
| backtracetest.c | ||
| bounds.c | ||
| capability.c | ||
| cfi.c | ||
| compat.c | ||
| configs.c | ||
| context_tracking.c | ||
| cpu_pm.c | ||
| cpu.c | ||
| crash_core.c | ||
| crash_dump.c | ||
| cred.c | ||
| delayacct.c | ||
| dma.c | ||
| exec_domain.c | ||
| exit.c | ||
| extable.c | ||
| fail_function.c | ||
| fork.c | ||
| freezer.c | ||
| gen_kheaders.sh | ||
| groups.c | ||
| hung_task.c | ||
| iomem.c | ||
| irq_work.c | ||
| jump_label.c | ||
| kallsyms_internal.h | ||
| kallsyms.c | ||
| kcmp.c | ||
| Kconfig.freezer | ||
| Kconfig.hz | ||
| Kconfig.locks | ||
| Kconfig.preempt | ||
| kcov.c | ||
| kexec_core.c | ||
| kexec_elf.c | ||
| kexec_file.c | ||
| kexec_internal.h | ||
| kexec.c | ||
| kheaders.c | ||
| kmod.c | ||
| kprobes.c | ||
| ksysfs.c | ||
| kthread.c | ||
| latencytop.c | ||
| Makefile | ||
| module_signature.c | ||
| notifier.c | ||
| nsproxy.c | ||
| padata.c | ||
| panic.c | ||
| params.c | ||
| pid_namespace.c | ||
| pid.c | ||
| profile.c | ||
| ptrace.c | ||
| range.c | ||
| reboot.c | ||
| regset.c | ||
| relay.c | ||
| resource_kunit.c | ||
| resource.c | ||
| rseq.c | ||
| scftorture.c | ||
| scs.c | ||
| seccomp.c | ||
| signal.c | ||
| smp.c | ||
| smpboot.c | ||
| smpboot.h | ||
| softirq.c | ||
| stackleak.c | ||
| stacktrace.c | ||
| static_call_inline.c | ||
| static_call.c | ||
| stop_machine.c | ||
| sys_ni.c | ||
| sys.c | ||
| sysctl-test.c | ||
| sysctl.c | ||
| task_work.c | ||
| taskstats.c | ||
| torture.c | ||
| tracepoint.c | ||
| tsacct.c | ||
| ucount.c | ||
| uid16.c | ||
| uid16.h | ||
| umh.c | ||
| up.c | ||
| user_namespace.c | ||
| user-return-notifier.c | ||
| user.c | ||
| usermode_driver.c | ||
| utsname_sysctl.c | ||
| utsname.c | ||
| watch_queue.c | ||
| watchdog_hld.c | ||
| watchdog.c | ||
| workqueue_internal.h | ||
| workqueue.c | ||