linux/kernel
Linus Torvalds 797479da0a signal: avoid double atomic counter increments for user accounting
[ Upstream commit fda31c5029 ]

When queueing a signal, we increment both the users count of pending
signals (for RLIMIT_SIGPENDING tracking) and we increment the refcount
of the user struct itself (because we keep a reference to the user in
the signal structure in order to correctly account for it when freeing).

That turns out to be fairly expensive, because both of them are atomic
updates, and particularly under extreme signal handling pressure on big
machines, you can get a lot of cache contention on the user struct.
That can then cause horrid cacheline ping-pong when you do these
multiple accesses.

So change the reference counting to only pin the user for the _first_
pending signal, and to unpin it when the last pending signal is
dequeued.  That means that when a user sees a lot of concurrent signal
queuing - which is the only situation when this matters - the only
atomic access needed is generally the 'sigpending' count update.

This was noticed because of a particularly odd timing artifact on a
dual-socket 96C/192T Cascade Lake platform: when you get into bad
contention, on that machine for some reason seems to be much worse when
the contention happens in the upper 32-byte half of the cacheline.

As a result, the kernel test robot will-it-scale 'signal1' benchmark had
an odd performance regression simply due to random alignment of the
'struct user_struct' (and pointed to a completely unrelated and
apparently nonsensical commit for the regression).

Avoiding the double increments (and decrements on the dequeueing side,
of course) makes for much less contention and hugely improved
performance on that will-it-scale microbenchmark.

Quoting Feng Tang:

 "It makes a big difference, that the performance score is tripled! bump
  from original 17000 to 54000. Also the gap between 5.0-rc6 and
  5.0-rc6+Jiri's patch is reduced to around 2%"

[ The "2% gap" is the odd cacheline placement difference on that
  platform: under the extreme contention case, the effect of which half
  of the cacheline was hot was 5%, so with the reduced contention the
  odd timing artifact is reduced too ]

It does help in the non-contended case too, but is not nearly as
noticeable.

Reported-and-tested-by: Feng Tang <feng.tang@intel.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Philip Li <philip.li@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-03-20 11:55:53 +01:00
..
bpf bpf, offload: Replace bitwise AND by logical AND in bpf_prog_offload_info_fill 2020-02-28 16:38:59 +01:00
cgroup cgroup: Iterate tasks that did not finish do_exit() 2020-03-18 07:14:19 +01:00
configs
debug kdb: do a sanity check on the cpu in kdb_per_cpu() 2020-01-27 14:50:48 +01:00
dma dma-debug: add a schedule point in debug_dma_dump_mappings() 2020-01-04 19:12:43 +01:00
events perf/core: Fix mlock accounting in perf_mmap() 2020-02-11 04:34:19 -08:00
gcov
irq genirq/proc: Reject invalid affinity masks (again) 2020-02-28 16:38:59 +01:00
livepatch livepatch: Nullify obj->mod in klp_module_coming()'s error path 2019-10-07 18:57:10 +02:00
locking locking/spinlock/debug: Fix various data races 2020-01-12 12:17:05 +01:00
power PM / hibernate: memory_bm_find_bit(): Tighten node optimisation 2020-01-09 10:18:58 +01:00
printk printk: fix exclusive_console replaying 2020-02-11 04:33:51 -08:00
rcu rcu: Avoid data-race in rcu_gp_fqs_check_wake() 2020-02-11 04:33:55 -08:00
sched sched/fair: Fix O(nr_cgroups) in the load balancing path 2020-03-05 16:42:21 +01:00
time clocksource: Prevent double add_timer_on() for watchdog_timer 2020-02-11 04:34:18 -08:00
trace tracing: Disable trace_printk() on post poned tests 2020-03-05 16:42:18 +01:00
.gitignore
acct.c acct_on(): don't mess with freeze protection 2019-05-31 06:46:05 -07:00
async.c
audit_fsnotify.c
audit_tree.c audit: Embed key into chunk 2019-12-13 08:51:11 +01:00
audit_watch.c audit_get_nd(): don't unlock parent too early 2019-12-13 08:51:02 +01:00
audit.c audit: always check the netlink payload length in audit_receive_msg() 2020-03-05 16:42:23 +01:00
audit.h
auditfilter.c audit: fix error handling in audit_data_to_entry() 2020-03-05 16:42:17 +01:00
auditsc.c audit: print empty EXECVE args 2019-12-01 09:17:17 +01:00
backtracetest.c
bounds.c
capability.c LSM: generalize flag passing to security_capable 2020-01-23 08:21:29 +01:00
compat.c
configs.c
context_tracking.c
cpu_pm.c
cpu.c cpu/hotplug, stop_machine: Fix stop_machine vs hotplug order 2020-02-24 08:34:35 +01:00
crash_core.c
crash_dump.c
cred.c memcg: account security cred as well to kmemcg 2020-01-09 10:19:00 +01:00
delayacct.c
dma.c
elfcore.c kernel/elfcore.c: include proper prototypes 2019-10-11 18:21:23 +02:00
exec_domain.c
exit.c exit: panic before exit_mm() on global init exit 2020-01-09 10:19:02 +01:00
extable.c
fail_function.c
fork.c fork,memcg: alloc_thread_stack_node needs to set tsk->stack 2020-01-27 14:50:58 +01:00
freezer.c
futex.c futex: Prevent robust futex exit race 2019-12-01 09:17:38 +01:00
groups.c
hung_task.c kernel: hung_task.c: disable on suspend 2019-04-20 09:16:02 +02:00
iomem.c
irq_work.c irq_work: Do not raise an IPI when queueing work on the local CPU 2019-05-31 06:46:19 -07:00
jump_label.c jump_label: move 'asm goto' support test to Kconfig 2019-06-04 08:02:34 +02:00
kallsyms.c kallsyms: Don't let kallsyms_lookup_size_offset() fail on retrieving the first symbol 2019-09-21 07:17:02 +02:00
kcmp.c
Kconfig.freezer
Kconfig.hz
Kconfig.locks
Kconfig.preempt
kcov.c
kexec_core.c kexec: Allocate decrypted control pages for kdump if SME is enabled 2019-11-24 08:20:29 +01:00
kexec_file.c
kexec_internal.h
kexec.c
kmod.c
kprobes.c kprobes: Fix optimize_kprobe()/unoptimize_kprobe() cancellation logic 2020-03-11 14:14:47 +01:00
ksysfs.c
kthread.c
latencytop.c
Makefile y2038: futex: Move compat implementation into futex.c 2019-12-01 09:17:38 +01:00
memremap.c mm/memory_hotplug: shrink zones when offlining memory 2020-01-29 16:43:27 +01:00
module_signing.c
module-internal.h
module.c module: avoid setting info->name early in case we can fall back to info->mod->name 2020-02-24 08:34:49 +01:00
notifier.c
nsproxy.c
padata.c padata: fix null pointer deref of pd->pinst 2020-02-14 16:33:28 -05:00
panic.c kernel/panic.c: do not append newline to the stack protector panic string 2019-12-01 09:17:10 +01:00
params.c
pid_namespace.c signal/pid_namespace: Fix reboot_pid_ns to use send_sig not force_sig 2019-07-26 09:14:01 +02:00
pid.c
profile.c
ptrace.c ptrace: reintroduce usage of subjective credentials in ptrace_has_cap() 2020-01-23 08:21:29 +01:00
range.c
reboot.c
relay.c
resource.c resource: fix locking in find_next_iomem_res() 2019-09-16 08:22:20 +02:00
rseq.c
seccomp.c LSM: generalize flag passing to security_capable 2020-01-23 08:21:29 +01:00
signal.c signal: avoid double atomic counter increments for user accounting 2020-03-20 11:55:53 +01:00
smp.c
smpboot.c
smpboot.h
softirq.c
stacktrace.c
stop_machine.c
sys_ni.c
sys.c kernel/sys.c: prctl: fix false positive in validate_prctl_map() 2019-06-15 11:54:01 +02:00
sysctl_binary.c
sysctl.c kernel: sysctl: make drop_caches write-only 2020-01-04 19:13:17 +01:00
task_work.c
taskstats.c taskstats: fix data-race 2020-01-09 10:18:59 +01:00
test_kprobes.c
torture.c
tracepoint.c
tsacct.c
ucount.c
uid16.c
uid16.h
umh.c
up.c
user_namespace.c
user-return-notifier.c
user.c
utsname_sysctl.c
utsname.c
watchdog_hld.c
watchdog.c watchdog/softlockup: Enforce that timestamp is valid on boot 2020-02-24 08:34:49 +01:00
workqueue_internal.h
workqueue.c workqueue: don't use wq_select_unbound_cpu() for bound works 2020-03-18 07:14:20 +01:00