Add a tracepoint for pause and resume which measures the
duration of time to perform the entire operation, the
cpus acted upon with this event, and the current state
of the active cpu mask. This should be sufficient
for testing pause performance.
Bug: 175959069
Change-Id: I9fc269c7d09ac78ec31612d3c552044b72b0e6e3
Signed-off-by: Stephen Dickey <dickey@codeaurora.org>
Buffers that are passed to read_actions_logged() and write_actions_logged()
are in kernel memory; the sysctl core takes care of copying from/to
userspace.
Fixes: 32927393dc ("sysctl: pass kernel pointers to ->proc_handler")
Reviewed-by: Tyler Hicks <code@tyhicks.com>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20201120170545.1419332-1-jannh@google.com
(cherry picked from commit fab686eb03)
Signed-off-by: Jeff Vander Stoep <jeffv@google.com>
Change-Id: I378d1eb73ff01928d3255191ed4443cd7b87c720
Bug: 176068146
SECCOMP_CACHE will only operate on syscalls that do not access
any syscall arguments or instruction pointer. To facilitate
this we need a static analyser to know whether a filter will
return allow regardless of syscall arguments for a given
architecture number / syscall number pair. This is implemented
here with a pseudo-emulator, and stored in a per-filter bitmap.
In order to build this bitmap at filter attach time, each filter is
emulated for every syscall (under each possible architecture), and
checked for any accesses of struct seccomp_data that are not the "arch"
nor "nr" (syscall) members. If only "arch" and "nr" are examined, and
the program returns allow, then we can be sure that the filter must
return allow independent from syscall arguments.
Nearly all seccomp filters are built from these cBPF instructions:
BPF_LD | BPF_W | BPF_ABS
BPF_JMP | BPF_JEQ | BPF_K
BPF_JMP | BPF_JGE | BPF_K
BPF_JMP | BPF_JGT | BPF_K
BPF_JMP | BPF_JSET | BPF_K
BPF_JMP | BPF_JA
BPF_RET | BPF_K
BPF_ALU | BPF_AND | BPF_K
Each of these instructions are emulated. Any weirdness or loading
from a syscall argument will cause the emulator to bail.
The emulation is also halted if it reaches a return. In that case,
if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good.
Emulator structure and comments are from Kees [1] and Jann [2].
Emulation is done at attach time. If a filter depends on more
filters, and if the dependee does not guarantee to allow the
syscall, then we skip the emulation of this syscall.
[1] https://lore.kernel.org/lkml/20200923232923.3142503-5-keescook@chromium.org/
[2] https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/
Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
Reviewed-by: Jann Horn <jannh@google.com>
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/71c7be2db5ee08905f41c3be5c1ad6e2601ce88f.1602431034.git.yifeifz2@illinois.edu
(cherry picked from commit 8e01b51a31)
Signed-off-by: Jeff Vander Stoep <jeffv@google.com>
Change-Id: I5047f7f0d6502e5de6c047743f1053fda3025a6e
Bug: 176068146
The overhead of running Seccomp filters has been part of some past
discussions [1][2][3]. Oftentimes, the filters have a large number
of instructions that check syscall numbers one by one and jump based
on that. Some users chain BPF filters which further enlarge the
overhead. A recent work [6] comprehensively measures the Seccomp
overhead and shows that the overhead is non-negligible and has a
non-trivial impact on application performance.
We observed some common filters, such as docker's [4] or
systemd's [5], will make most decisions based only on the syscall
numbers, and as past discussions considered, a bitmap where each bit
represents a syscall makes most sense for these filters.
The fast (common) path for seccomp should be that the filter permits
the syscall to pass through, and failing seccomp is expected to be
an exceptional case; it is not expected for userspace to call a
denylisted syscall over and over.
When it can be concluded that an allow must occur for the given
architecture and syscall pair (this determination is introduced in
the next commit), seccomp will immediately allow the syscall,
bypassing further BPF execution.
Each architecture number has its own bitmap. The architecture
number in seccomp_data is checked against the defined architecture
number constant before proceeding to test the bit against the
bitmap with the syscall number as the index of the bit in the
bitmap, and if the bit is set, seccomp returns allow. The bitmaps
are all clear in this patch and will be initialized in the next
commit.
When only one architecture exists, the check against architecture
number is skipped, suggested by Kees Cook [7].
[1] https://lore.kernel.org/linux-security-module/c22a6c3cefc2412cad00ae14c1371711@huawei.com/T/
[2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/
[3] https://github.com/seccomp/libseccomp/issues/116
[4] ae0ef82b90/profiles/seccomp/default.json
[5] 6743a1caf4/src/shared/seccomp-util.c (L270)
[6] Draco: Architectural and Operating System Support for System Call Security
https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020
[7] https://lore.kernel.org/bpf/202010091614.8BB0EB64@keescook/
Co-developed-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu>
Signed-off-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu>
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
Reviewed-by: Jann Horn <jannh@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/10f91a367ec4fcdea7fc3f086de3f5f13a4a7436.1602431034.git.yifeifz2@illinois.edu
(cherry picked from commit f9d480b6ff)A
Signed-off-by: Jeff Vander Stoep <jeffv@google.com>
Change-Id: I50b6682e17dc6e91b5e92017361200d722282825
Bug: 176068146
Export hrtimer_expire_entry/exit tracepoints, so that vendor modules
can register probes for these tracepoints.
Bug: 175936268
Change-Id: I739f369d3b56e09f8e9061fefdf25830e37e987e
Signed-off-by: Changki Kim <changki.kim@samsung.com>
Export workqueue_execute_start/end tracepoints, so that vendor modules
can register probes for these tracepoints.
Bug: 175936268
Change-Id: Ib4c8f39ff8305a1d52fbca9d06b5e792396a3a2d
Signed-off-by: Changki Kim <changki.kim@samsung.com>
Export irq_handle_exit tracepoint, so that vendor modules
can register probes for this tracepoint.
Bug: 175936268
Change-Id: I8e1eaffb7dd2f257e9c09412aad54ecca62bf019
Signed-off-by: Changki Kim <changki.kim@samsung.com>
Add a restricted vendor hook to check whether a set of tasks can
move to other cgorup.
Bug: 175808144
Signed-off-by: Choonghoon Park <choong.park@samsung.com>
Change-Id: If7bac83e0d2d1069b1436331989c3926645eab19
Export irq_handle_entry tracepoint, so that vendor modules
can register probes for this tracepoint.
Bug: 175806230
Change-Id: Iacc331f923d27f1a17065d6c0315c0c054af313e
Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org>
Move function tracer options to Kconfig to make it easier to add
new methods for generating __mcount_loc, and to make the options
available also when building kernel modules.
Note that FTRACE_MCOUNT_USE_* options are updated on rebuild and
therefore, work even if the .config was generated in a different
environment.
Bug: 145210207
Change-Id: I6fc38abde50b602788148cb236aba1261affa896
Link: https://lore.kernel.org/lkml/20201211184633.3213045-1-samitolvanen@google.com/
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
This symbol is needed for certain vendor modules which cannot rely on
only the sysfs interface for uclamp alone.
Bug: 170697030
Signed-off-by: J. Avila <elavila@google.com>
Change-Id: Ic904ade83a45d259cfc95501e6b81e6c5a0e90a0
place_entity() vendor hook is meant to tweak vruntime by vendor
modules as needed, but with current form of the hook that is not
possible as vruntime is passed by it's value. Fix it by switching
to pass by reference.
Bug: 175448877
Change-Id: Ibb51592f94da31019fa98a6767d080ec61daafe6
Signed-off-by: Satya Durga Srinivasu Prabhala <satyap@codeaurora.org>
Export sysctl_sched_latency symbol to be able to access from vendor modules.
Bug: 175448877
Change-Id: Ieae39579f4adfe2bb97d0ee6b1970dd904aafdda
Signed-off-by: Satya Durga Srinivasu Prabhala <satyap@codeaurora.org>
sched_uclamp_used will be used by vendor modules for enhancing placements.
Bug: 175448875
Change-Id: Ib6aa7839ae4d72de6490e2d1ff92729830830e3b
Signed-off-by: Abhijeet Dharmapurikar <adharmap@codeaurora.org>
(cherry picked from commit 7923ce03f7)
Below symbols would be used by vendor modules for tracking tasks
in various cgroups
1. cgroup_taskset_first
2. cgroup_taskset_next
Bug: 175045928
Change-Id: I6d89148eb4c71174a02a27acab196ff940be9082
Signed-off-by: Abhijeet Dharmapurikar <adharmap@codeaurora.org>
(cherry picked from commit 7333bd73c0)
- Correct a few problems in the x86 and the generic membarrier
implementation. Small corrections for assumptions about visibility
which have turned out not to be true.
- Make the PAT bits for memory encryption correct vs. 4K and 2M/1G page
table entries as they are at a different location.
- Fix a concurrency issue in the the local bandwidth readout of resource
control leading to incorrect values
- Fix the ordering of allocating a vector for an interrupt. The order
missed to respect the provided cpumask when the first attempt of
allocating node local in the mask fails. It then tries the node instead
of trying the full provided mask first. This leads to erroneous error
messages and breaking the (user) supplied affinity request. Reorder it.
- Make the INT3 padding detection in optprobe work correctly.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/WOw0THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYodUhEACW02Al+mlpZjOhzIGwNUALGX7YGEOi
wJoiIDTV2vOtKhcJ/AOVn05CznoWXa6PN1+7pm0nR2M1XaiXZ3B6HMjy1+0rpSqK
5dCCEqr+QowwMRsKd4hnuJowR/7Og66FiYIcknDJg/1egg+RXy7yKds577W6/KW0
2VV/x35xDywERiQ28qIftQ4L7NREyiioomN/GFpeoQf8tQF06Rb4t12T5pdA6D8S
/C1wj0IKVRMJo/7HbB/X6skxPOK0PWAEBpaT0+Q66VKSNnFVN5ap+rGTKdUoTmuM
NdxHOukpHyFCQtbRPOjRUeeSZJLKSX5oXsBO1GUvyyxcT2XNlTOdNamYLWyO+RfV
uP97qhzYDKL+cgDkwypJ0WOzOb9EIXOh4P9BTnJFBGhc4EQwen3cpb3CyWWftjnv
/obXiRDnAOE6P6H2AGiwrK3Pny+SvgrFYKMe+iy+ntToz1yrDh7ZrZ0DQKVoFWEr
n3qUnlPZmVvRzHIRVYoK69nS/UgmNN0LssavzRBzab3BcK93f23QkW86P42kNCLa
9kL+ZgGwlpqUZZR/p3pHq9Mv2ZXGEdUYY99h2vy1fgMwOw/RQZnPDAj/131UXlsU
4DL0mAlTK1/0/81H6/V1GDBkMkO+hN4x6Y3asi7bPHEKEzlLe+P1tUt3YELd3CEC
dc8ebICG1InXsw==
=t31y
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2020-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"A set of x86 and membarrier fixes:
- Correct a few problems in the x86 and the generic membarrier
implementation. Small corrections for assumptions about visibility
which have turned out not to be true.
- Make the PAT bits for memory encryption correct vs 4K and 2M/1G
page table entries as they are at a different location.
- Fix a concurrency issue in the the local bandwidth readout of
resource control leading to incorrect values
- Fix the ordering of allocating a vector for an interrupt. The order
missed to respect the provided cpumask when the first attempt of
allocating node local in the mask fails. It then tries the node
instead of trying the full provided mask first. This leads to
erroneous error messages and breaking the (user) supplied affinity
request. Reorder it.
- Make the INT3 padding detection in optprobe work correctly"
* tag 'x86-urgent-2020-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/kprobes: Fix optprobe to detect INT3 padding correctly
x86/apic/vector: Fix ordering in vector assignment
x86/resctrl: Fix incorrect local bandwidth when mba_sc is enabled
x86/mm/mem_encrypt: Fix definition of PMD_FLAGS_DEC_WP
membarrier: Execute SYNC_CORE on the calling thread
membarrier: Explicitly sync remote cores when SYNC_CORE is requested
membarrier: Add an actual barrier before rseq_preempt()
x86/membarrier: Get rid of a dubious optimization
Remove bpf_ prefix, which causes these helpers to be reported in verifier
dump as bpf_bpf_this_cpu_ptr() and bpf_bpf_per_cpu_ptr(), respectively. Lets
fix it as long as it is still possible before UAPI freezes on these helpers.
Fixes: eaa6bcb71e ("bpf: Introduce bpf_per_cpu_ptr()")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kernel/elfcore.c only contains weak symbols, which triggers a bug with
clang in combination with recordmcount:
Cannot find symbol for section 2: .text.
kernel/elfcore.o: failed
Move the empty stubs into linux/elfcore.h as inline functions. As only
two architectures use these, just use the architecture specific Kconfig
symbols to key off the declaration.
Link: https://lkml.kernel.org/r/20201204165742.3815221-2-arnd@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Barret Rhoden <brho@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Steps on the way to 5.10-rc8/final
Resolves conflicts with:
net/xfrm/xfrm_state.c
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I06495103fc0f2531b241b3577820f4461ba83dd5
Pull networking fixes from David Miller:
1) IPsec compat fixes, from Dmitry Safonov.
2) Fix memory leak in xfrm_user_policy(). Fix from Yu Kuai.
3) Fix polling in xsk sockets by using sk_poll_wait() instead of
datagram_poll() which keys off of sk_wmem_alloc and such which xsk
sockets do not update. From Xuan Zhuo.
4) Missing init of rekey_data in cfgh80211, from Sara Sharon.
5) Fix destroy of timer before init, from Davide Caratti.
6) Missing CRYPTO_CRC32 selects in ethernet driver Kconfigs, from Arnd
Bergmann.
7) Missing error return in rtm_to_fib_config() switch case, from Zhang
Changzhong.
8) Fix some src/dest address handling in vrf and add a testcase. From
Stephen Suryaputra.
9) Fix multicast handling in Seville switches driven by mscc-ocelot
driver. From Vladimir Oltean.
10) Fix proto value passed to skb delivery demux in udp, from Xin Long.
11) HW pkt counters not reported correctly in enetc driver, from Claudiu
Manoil.
12) Fix deadlock in bridge, from Joseph Huang.
13) Missing of_node_pur() in dpaa2 driver, fromn Christophe JAILLET.
14) Fix pid fetching in bpftool when there are a lot of results, from
Andrii Nakryiko.
15) Fix long timeouts in nft_dynset, from Pablo Neira Ayuso.
16) Various stymmac fixes, from Fugang Duan.
17) Fix null deref in tipc, from Cengiz Can.
18) When mss is biog, coose more resonable rcvq_space in tcp, fromn Eric
Dumazet.
19) Revert a geneve change that likely isnt necessary, from Jakub
Kicinski.
20) Avoid premature rx buffer reuse in various Intel driversm from Björn
Töpel.
21) retain EcT bits during TIS reflection in tcp, from Wei Wang.
22) Fix Tso deferral wrt. cwnd limiting in tcp, from Neal Cardwell.
23) MPLS_OPT_LSE_LABEL attribute is 342 ot 8 bits, from Guillaume Nault
24) Fix propagation of 32-bit signed bounds in bpf verifier and add test
cases, from Alexei Starovoitov.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits)
selftests: fix poll error in udpgro.sh
selftests/bpf: Fix "dubious pointer arithmetic" test
selftests/bpf: Fix array access with signed variable test
selftests/bpf: Add test for signed 32-bit bound check bug
bpf: Fix propagation of 32-bit signed bounds from 64-bit bounds.
MAINTAINERS: Add entry for Marvell Prestera Ethernet Switch driver
net: sched: Fix dump of MPLS_OPT_LSE_LABEL attribute in cls_flower
net/mlx4_en: Handle TX error CQE
net/mlx4_en: Avoid scheduling restart task if it is already running
tcp: fix cwnd-limited bug for TSO deferral where we send nothing
net: flow_offload: Fix memory leak for indirect flow block
tcp: Retain ECT bits for tos reflection
ethtool: fix stack overflow in ethnl_parse_bitset()
e1000e: fix S0ix flow to allow S0i3.2 subset entry
ice: avoid premature Rx buffer reuse
ixgbe: avoid premature Rx buffer reuse
i40e: avoid premature Rx buffer reuse
igb: avoid transmit queue timeout in xdp path
igb: use xdp_do_flush
igb: skb add metasize for xdp
...
Alexei Starovoitov says:
====================
pull-request: bpf 2020-12-10
The following pull-request contains BPF updates for your *net* tree.
We've added 21 non-merge commits during the last 12 day(s) which contain
a total of 21 files changed, 163 insertions(+), 88 deletions(-).
The main changes are:
1) Fix propagation of 32-bit signed bounds from 64-bit bounds, from Alexei.
2) Fix ring_buffer__poll() return value, from Andrii.
3) Fix race in lwt_bpf, from Cong.
4) Fix test_offload, from Toke.
5) Various xsk fixes.
Please consider pulling these changes from:
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git
Thanks a lot!
Also thanks to reporters, reviewers and testers of commits in this pull-request:
Cong Wang, Hulk Robot, Jakub Kicinski, Jean-Philippe Brucker, John
Fastabend, Magnus Karlsson, Maxim Mikityanskiy, Yonghong Song
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The 64-bit signed bounds should not affect 32-bit signed bounds unless the
verifier knows that upper 32-bits are either all 1s or all 0s. For example the
register with smin_value==1 doesn't mean that s32_min_value is also equal to 1,
since smax_value could be larger than 32-bit subregister can hold.
The verifier refines the smax/s32_max return value from certain helpers in
do_refine_retval_range(). Teach the verifier to recognize that smin/s32_min
value is also bounded. When both smin and smax bounds fit into 32-bit
subregister the verifier can propagate those bounds.
Fixes: 3f50f132d8 ("bpf: Verifier, do explicit ALU32 bounds tracking")
Reported-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add a restricted vendor hook when a set of tasks change the cgroups in
the cpu controller. This facilitates various scheduler value adds.
Bug: 175045928
Change-Id: I544046d631f4d6c9bc2b999e054b5a296ec31a81
Signed-off-by: Abhijeet Dharmapurikar <adharmap@codeaurora.org>
This scheduler interface has been brought to allow the hotplug feature
paused_cpus to force tasks migration. It depends on migrate_tasks() which
is only available when CONFIG_HOTPLUG_CPU is set. sched_cpu_drain*()
functions should then also depend on this configuration.
Bug: 161210528
Fixes: e19b8ce (ANDROID: cpu/hotplug: add migration to paused_cpus)
Change-Id: I250cd0a8f382811604445600acc33ab4df4ad4da
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
membarrier()'s MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE is documented as
syncing the core on all sibling threads but not necessarily the calling
thread. This behavior is fundamentally buggy and cannot be used safely.
Suppose a user program has two threads. Thread A is on CPU 0 and thread B
is on CPU 1. Thread A modifies some text and calls
membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE).
Then thread B executes the modified code. If, at any point after
membarrier() decides which CPUs to target, thread A could be preempted and
replaced by thread B on CPU 0. This could even happen on exit from the
membarrier() syscall. If this happens, thread B will end up running on CPU
0 without having synced.
In principle, this could be fixed by arranging for the scheduler to issue
sync_core_before_usermode() whenever switching between two threads in the
same mm if there is any possibility of a concurrent membarrier() call, but
this would have considerable overhead. Instead, make membarrier() sync the
calling CPU as well.
As an optimization, this avoids an extra smp_mb() in the default
barrier-only mode and an extra rseq preempt on the caller.
Fixes: 70216e18e5 ("membarrier: Provide core serializing command, *_SYNC_CORE")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/r/250ded637696d490c69bef1877148db86066881c.1607058304.git.luto@kernel.org
membarrier() does not explicitly sync_core() remote CPUs; instead, it
relies on the assumption that an IPI will result in a core sync. On x86,
this may be true in practice, but it's not architecturally reliable. In
particular, the SDM and APM do not appear to guarantee that interrupt
delivery is serializing. While IRET does serialize, IPI return can
schedule, thereby switching to another task in the same mm that was
sleeping in a syscall. The new task could then SYSRET back to usermode
without ever executing IRET.
Make this more robust by explicitly calling sync_core_before_usermode()
on remote cores. (This also helps people who search the kernel tree for
instances of sync_core() and sync_core_before_usermode() -- one might be
surprised that the core membarrier code doesn't currently show up in a
such a search.)
Fixes: 70216e18e5 ("membarrier: Provide core serializing command, *_SYNC_CORE")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/776b448d5f7bd6b12690707f5ed67bcda7f1d427.1607058304.git.luto@kernel.org
It seems that most RSEQ membarrier users will expect any stores done before
the membarrier() syscall to be visible to the target task(s). While this
is extremely likely to be true in practice, nothing actually guarantees it
by a strict reading of the x86 manuals. Rather than providing this
guarantee by accident and potentially causing a problem down the road, just
add an explicit barrier.
Fixes: 70216e18e5 ("membarrier: Provide core serializing command, *_SYNC_CORE")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/d3e7197e034fa4852afcf370ca49c30496e58e40.1607058304.git.luto@kernel.org
This reverts commit 1243dc518c.
percpu_rwsem is an rcu based lock. Under loaded conditions
this lock will require that each cpu perform a switch, causing
exessive delays with long-running tasks on those CPUs.
In the case of hotplug/pause, this can slow down the ability
to activate/deactivate a CPU.
Revert the change from cpuset_mutex to percpu_rwsem, to
eliminate that overhead, and revert to a global lock.
Bug: 161210528
Change-Id: Id00dcaa6d601b561d1321d3e944b6c52e9663f1a
Signed-off-by: Stephen Dickey <dickey@codeaurora.org>
Incorporate a vendor hook in the resume cpus path
so that vendor specific activities may take place.
Bug: 161210528
Change-Id: I74d03247491b004e891dbcfe06a478d00a95ba9f
Signed-off-by: Stephen Dickey <dickey@codeaurora.org>
In the resume_cpus() path, cpus cannot be taken
advantage of until the cpus write lock is acquired,
and cpus are activated and domains rebuilt. This
can incurr significant delay in the unpause operation.
Additionally, if scheduled through the kworker thread,
the wait time for rebuilding sched domains becomes
large due to a busy system that can prevent the kworker
from executing.
Activate the cpus and call the cpuset_hotplug_workfn
directly within resume_cpus prior to getting the cpus
write lock, thereby eliminating delays associated
with scheduling this activity.
Bug: 161210528
Change-Id: Ie2521f28ed9078b22d421d792f08413016d4dd62
Signed-off-by: Stephen Dickey <dickey@codeaurora.org>
Signed-off-by: Todd Kjos <tkjos@google.com>
paused_cpus intending to force CPUs to go idle as quickly as possible,
adding a migration step, to drain the rq from any running task.
Two steps are actually needed. The first one, "lazy", will run before the
cpu_active_mask has been synced. The second one will run after. It is
possible for another CPU, to observe an outdated version of that mask and
to enqueue a task on a rq that has just been marked inactive. The second
migration is there to catch any of those spurious move, while the first
one will drain the rq as quickly as possible to let the CPU reach an idle
state.
Bug: 161210528
Change-Id: Ie26c2e4c42665dd61d41a899a84536e56bf2b887
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
pause_cpus intends to have a way to force a CPU to go idle and to resume
as quickly as possible, with as little disruption as possible on the
system. This is a way of saving energy or meet thermal constraints, for
which a full CPU hotunplug is too slow. A paused CPU is simply deactivated
from the scheduler point of view. This corresponds to the first hotunplug
step.
Each pause operation still needs some heavy synchronization. Allowing to
pause several CPUs in one go mitigate that issue.
Paused CPUs can be resumed with resume_cpus(), which also takes a cpumask
as an input.
Few limitations:
* It isn't possible to pause a CPU which is running SCHED_DEADLINE task.
* A paused CPU will be removed from any cpuset it is part of. Resuming
the CPU won't put back this CPU in the cpuset if using cgroup1.
Cgroup2 doesn't have this limitation.
* per-CPU kthreads are still allowed to run on a paused CPU.
Bug: 161210528
Change-Id: I1f5cb28190f8ec979bb8640a89b022f2f7266bcf
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Todd Kjos <tkjos@google.com>
In the event of a partial _cpu_down, (i.e. _cpu_down(target) where
target > CPUHP_AP_OFFLINE), the cpu_online_mask won't be aligned with
cpu_active_mask. This is an issue when trying to offline the last CPU
from cpu_active_mask, while num_online_cpus() > 1.
Protect against this case by checking num_active_cpus() instead of
num_online_cpus().
Bug: 161210528
Change-Id: Ibe7d9ef69e5f91e99be0d98076614a7654bda094
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
In the event of a partial hotunplug, a stable state with a CPU set in the
online_mask and cleared from active_mask can happen. An online CPU, from a
scheduler point of view, should be part of the cpu_active_mask.
Bug: 161210528
Change-Id: I0d0aa6fca4c6dc145634c4aad6519045e0afc8e2
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Adding support in update_max_interval() for incomplete HP _cpu_down, where
cpu_active_mask != cpu_online_mask. This situation can happen in the event
of a partial _cpu_down. i.e. _cpu_down(target) where
target > CPUHP_AP_OFFLINE.
Bug: 161210528
Change-Id: Ia422057c65f16dc9aa8f6d272098b2308b00f0ac
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
In the event of a partial hotunplug, a stable state with a CPU set in the
online mask but cleared in the active can happen. This is problematic for
the window between the active mask clearing and the sched domains rebuild.
RT could bounce back a task, migrated off a hotunplugged CPU. Introducing
an intersection between lowest_mask and the cpu_active_mask to prevent a
such situation.
Bug: 161210528
Change-Id: I4f8cb782c2ca560c297b7f4bdb2336918c83a5a1
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
This new interface allows to trigger a stopper on a given CPU and wait
for the end of the work in a separated function cpu_stop_work_wait().
This differs from stop_one_cpu_nowait() by allowing the usage of the
cpu_stop completion mechanism.
Bug: 161210528
Change-Id: Ida51371e32897d008ece0639190fc21feabb0f28
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
While writing an application that requires user stack trace option
to work in instances, I found that the instance option has a bug
that makes it a nop. The check for performing the user stack trace
in an instance, checks the top level options (not the instance options)
to determine if a user stack trace should be performed or not.
This is not only incorrect, but also confusing for users. It confused
me for a bit!
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCX85BzBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qv7XAQCzfq6DpHUv1gVaBkGytOszLW4IuEtd
/09jDPbVSG8T2QEA+e6fpBP1aIixgLN6+vFVl3wAK3SaIQeIA/blkbBNFQ0=
=o1Ew
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"Fix userstacktrace option for instances
While writing an application that requires user stack trace option to
work in instances, I found that the instance option has a bug that
makes it a nop. The check for performing the user stack trace in an
instance, checks the top level options (not the instance options) to
determine if a user stack trace should be performed or not.
This is not only incorrect, but also confusing for users. It confused
me for a bit!"
* tag 'trace-v5.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix userstacktrace option for instances
- Make multiqueue devices which use the managed interrupt affinity
infrastructure work on PowerPC/Pseries. PowerPC does not use the
generic infrastructure for setting up PCI/MSI interrupts and the
multiqueue changes failed to update the legacy PCI/MSI infrastructure.
Make this work by passing the affinity setup information down to the
mapping and allocation functions.
- Move Jason Cooper from MAINTAINERS to CREDITS as his mail is bouncing
and he's not reachable. We hope all is well with him and say thanks
for his work over the years.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/M1GwTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoW8xD/4uG/0ayYgSdRf4nXcyXu4JKoHV5oK5
y7IWY9s04fqTFbVO2fRaD1hBYHavWfdV80obP8dJio1g6R1BqzZiEVUmCdWI0tHJ
recAsGYxqPrNj9soHEZ7ZmuGX6VhuzQj57srU+lhzsqk+88uY/n1d/TlrHCH7miU
0cfBSoolP2l2p6UYHvXfH2wk1hRHg8sySOfxGSp6KSrewoOwAOT2CCNX8gIcmy1n
dUsJaHEFzU547p55zDs5DTHfM0yJdsqqUpdxvpiZWpZhsIzoQvd8taiH7/uaRGqd
yJI4sMWudJUGGas2Vq0yjG6L0uAJ7M+kjqodJzn0hAKq6MhAIKaPMEbpPx9TuZYb
zZg9ce5o4LwzTphNPmcEMCjpPKRGNiEbcl1XY4qhQWnBvuOb1mIBFV4+6srd+Lpg
o7kEt+XyjKZARgw01yDf9tHSJYOcBQuHqGUdRZQAWSCThizpQsOZJaUWB8l4mLDy
fScYx4cH12oPmCg3Fdd22oq7JN0ed9O3M7BLuzmI006uSWsB8fbfEcM+k+g+63Go
xpHYKM6VOzLlspFPFvVo3nwzvc787he2I9tIOPtCU0Hl4BBjjfGyB9FcnhNShNoe
dfVVzTaEWmHNGonTI61suwZJeyOTudzHqbgw5rAmqmJeV5R8gFSiGmzINlqoWZX6
TWDVz4Y1ma8QwA==
=/DKJ
-----END PGP SIGNATURE-----
Merge tag 'irq-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
"A set of updates for the interrupt subsystem:
- Make multiqueue devices which use the managed interrupt affinity
infrastructure work on PowerPC/Pseries. PowerPC does not use the
generic infrastructure for setting up PCI/MSI interrupts and the
multiqueue changes failed to update the legacy PCI/MSI
infrastructure. Make this work by passing the affinity setup
information down to the mapping and allocation functions.
- Move Jason Cooper from MAINTAINERS to CREDITS as his mail is
bouncing and he's not reachable. We hope all is well with him and
say thanks for his work over the years"
* tag 'irq-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
powerpc/pseries: Pass MSI affinity to irq_create_mapping()
genirq/irqdomain: Add an irq_create_mapping_affinity() function
MAINTAINERS: Move Jason Cooper to CREDITS
Three commits fixing possible missed TLB invalidations for multi-threaded
processes when CPUs are hotplugged in and out.
A fix for a host crash triggerable by host userspace (qemu) in KVM on Power9.
A fix for a host crash in machine check handling when running HPT guests on a
HPT host.
One commit fixing potential missed TLB invalidations when using the hash MMU on
Power9 or later.
A regression fix for machines with CPUs on node 0 but no memory.
Thanks to:
Aneesh Kumar K.V, Cédric Le Goater, Greg Kurz, Milan Mohanty, Milton Miller,
Nicholas Piggin, Paul Mackerras, Srikar Dronamraju.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl/LcwsTHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgOeiD/wKGX8eE7AJ5ZxoFLwpGEJhp9QgMDhe
nP82CkKobwMM3UCbde9MC8PqYGC7/7PhRPM0GI03uh6EfeHUtle7AZlBAlZoGaeJ
MwdQBQrZSqf1QJOyhUEa6CI0XTfCEOrsw+AkZQKdsv9JLcFBz7IyfP61gf7MHfyo
QKlfYYilXHbJ7M9oiM9gKUdtrpPfMGH0YnIp0FR+JowJAWUfFY626H9j7chNwWK+
7nrphtLHwsBVNtIoKWvPocuLKPsziOqXWnOP/do/RuCoKXMbGjtOJHhUgEYC5PM7
eQug43YDaws4K1fxaHvQto/u92nL2GFY6FfKNeJ5FcQYgCIvi/T8jzEsJyqGbpVz
YihZj1MbhhGr/neVtJW4SbdCTCU7R7X9QBy4He6XoWHR0fNoQDQvjNT/ziiuHiN0
tU+Y9aoHwI/0Pb44ceiQ/T10nxYtk+6Cj5Cm9Ll7MvfjUsE/BpxlYdi+KMqRSGOb
itOwFLQpgy28feMRKGZNKFURwTophASFaKO88yhjeSnlcGqxvicSIUpz8UD1jxwt
o/tsger09ZXqBYVdVKLpqbKsifVbzUfJmmycvuDF37B+VjwHACP+VZltwdOqnX13
BM9ndcDW2p6UnNLfs47FWJM+czmShrgwqI/W7qcCFleYL3r5XOS8hJHfgvJEcE04
n7A9cNvK5q6nvg==
=tIAZ
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.10-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"Some more powerpc fixes for 5.10:
- Three commits fixing possible missed TLB invalidations for
multi-threaded processes when CPUs are hotplugged in and out.
- A fix for a host crash triggerable by host userspace (qemu) in KVM
on Power9.
- A fix for a host crash in machine check handling when running HPT
guests on a HPT host.
- One commit fixing potential missed TLB invalidations when using the
hash MMU on Power9 or later.
- A regression fix for machines with CPUs on node 0 but no memory.
Thanks to Aneesh Kumar K.V, Cédric Le Goater, Greg Kurz, Milan
Mohanty, Milton Miller, Nicholas Piggin, Paul Mackerras, and Srikar
Dronamraju"
* tag 'powerpc-5.10-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/64s/powernv: Fix memory corruption when saving SLB entries on MCE
KVM: PPC: Book3S HV: XIVE: Fix vCPU id sanity check
powerpc/numa: Fix a regression on memoryless node 0
powerpc/64s: Trim offlined CPUs from mm_cpumasks
kernel/cpu: add arch override for clear_tasks_mm_cpumask() mm handling
powerpc/64s/pseries: Fix hash tlbiel_all_isa300 for guest kernels
powerpc/64s: Fix hash ISA v3.0 TLBIEL instruction generation
When the instances were able to use their own options, the userstacktrace
option was left hardcoded for the top level. This made the instance
userstacktrace option bascially into a nop, and will confuse users that set
it, but nothing happens (I was confused when it happened to me!)
Cc: stable@vger.kernel.org
Fixes: 16270145ce ("tracing: Add trace options for core options to instances")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
We have debug infrastructure built on top of preempt/irq disable/enable
events. This requires modifications to the kernel tracing code. Since
this is not feasible with GKI, we started with registering to the
existing preemptirq trace events. However the performance of wide
variety of use cases are regressed as the rate of preemptirq events
is super high and generic trace events are slow.
Since GKI allows optimized trace events via restricted trace hooks,
add the same for preemptirq event.
Bug: 174541725
Change-Id: Ic8d3cdd1c1aa6a9267d0b755694fedffa2ea8e36
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>