-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmEKcCUACgkQONu9yGCS
aT7sMw/7BNJDmX9w+p1lgTIJJzSuz8C/eNgbeZgK7CE4DovO+WL9oEm53vqYcDDo
j5REnrRhxcBYxwG/GXl1Oniv1wHqw0SplV+5G2NH1RMy23eSFGCw+8G+YOEJnU3P
94hJuEs/43Py7eZV/VtyO2UMdDRnGI6MlNvu18YjnRJcdqIIl2gln1G8wbyySYVb
wR1rudvtiEdrmTQr7qGxeIrZNKGwFl0KxEl8j9X/aqxvfe8PRVYKlmtwblf5rybe
TElQxz2XGRgk8g2yWQmnNoU6rfFHdZ4lTnCwfpFA1XE6/HBA64/1p22QTJUZvyOU
pbQc1MRaoUncGV9UFAMY1j38JFsVar7YHHOcpp9YIJOjoyiAw4aatGDcntdWDCiG
X1mCSLs10/xGRPaJJXulp786MH4aTR5qIeoNg8mu3Z3In4ElbBW5xr0wa3N8gs3O
lEnK/gT2MHiQ1boa+Qy3W+XZmOjWtL69JgbOyRcOYS6lkHL4DFlGL2Nn5u8qGfL4
hzohJzH36W5SUHDQiYTt1wLNu4iHpAECjxcnk9fCvlcHA5Yu1bqgyQ62i3C9RA6a
/aO0B0yraHmvCAboemDsESwylxmpiRB3caqKtzlaZjoiOfPydcBwJM46ZfbzLNPh
l+/YKK2tLOXWyRIhEv8183tVeu7mZ02xjsetPtLltZPJqR+SJKE=
=8nLw
-----END PGP SIGNATURE-----
Merge 5.10.56 into android12-5.10-lts
Changes in 5.10.56
selftest: fix build error in tools/testing/selftests/vm/userfaultfd.c
io_uring: fix null-ptr-deref in io_sq_offload_start()
x86/asm: Ensure asm/proto.h can be included stand-alone
pipe: make pipe writes always wake up readers
btrfs: fix rw device counting in __btrfs_free_extra_devids
btrfs: mark compressed range uptodate only if all bio succeed
Revert "ACPI: resources: Add checks for ACPI IRQ override"
ACPI: DPTF: Fix reading of attributes
x86/kvm: fix vcpu-id indexed array sizes
KVM: add missing compat KVM_CLEAR_DIRTY_LOG
ocfs2: fix zero out valid data
ocfs2: issue zeroout to EOF blocks
can: j1939: j1939_xtp_rx_dat_one(): fix rxtimer value between consecutive TP.DT to 750ms
can: raw: raw_setsockopt(): fix raw_rcv panic for sock UAF
can: peak_usb: pcan_usb_handle_bus_evt(): fix reading rxerr/txerr values
can: mcba_usb_start(): add missing urb->transfer_dma initialization
can: usb_8dev: fix memory leak
can: ems_usb: fix memory leak
can: esd_usb2: fix memory leak
alpha: register early reserved memory in memblock
HID: wacom: Re-enable touch by default for Cintiq 24HDT / 27QHDT
NIU: fix incorrect error return, missed in previous revert
drm/amd/display: ensure dentist display clock update finished in DCN20
drm/amdgpu: Avoid printing of stack contents on firmware load error
drm/amdgpu: Fix resource leak on probe error path
blk-iocost: fix operation ordering in iocg_wake_fn()
nfc: nfcsim: fix use after free during module unload
cfg80211: Fix possible memory leak in function cfg80211_bss_update
RDMA/bnxt_re: Fix stats counters
bpf: Fix OOB read when printing XDP link fdinfo
mac80211: fix enabling 4-address mode on a sta vif after assoc
netfilter: conntrack: adjust stop timestamp to real expiry value
netfilter: nft_nat: allow to specify layer 4 protocol NAT only
i40e: Fix logic of disabling queues
i40e: Fix firmware LLDP agent related warning
i40e: Fix queue-to-TC mapping on Tx
i40e: Fix log TC creation failure when max num of queues is exceeded
tipc: fix implicit-connect for SYN+
tipc: fix sleeping in tipc accept routine
net: Set true network header for ECN decapsulation
net: qrtr: fix memory leaks
ionic: remove intr coalesce update from napi
ionic: fix up dim accounting for tx and rx
ionic: count csum_none when offload enabled
tipc: do not write skb_shinfo frags when doing decrytion
octeontx2-pf: Fix interface down flag on error
mlx4: Fix missing error code in mlx4_load_one()
KVM: x86: Check the right feature bit for MSR_KVM_ASYNC_PF_ACK access
net: llc: fix skb_over_panic
drm/msm/dpu: Fix sm8250_mdp register length
drm/msm/dp: Initialize the INTF_CONFIG register
skmsg: Make sk_psock_destroy() static
net/mlx5: Fix flow table chaining
net/mlx5e: Fix nullptr in mlx5e_hairpin_get_mdev()
sctp: fix return value check in __sctp_rcv_asconf_lookup
tulip: windbond-840: Fix missing pci_disable_device() in probe and remove
sis900: Fix missing pci_disable_device() in probe and remove
can: hi311x: fix a signedness bug in hi3110_cmd()
bpf: Introduce BPF nospec instruction for mitigating Spectre v4
bpf: Fix leakage due to insufficient speculative store bypass mitigation
bpf: Remove superfluous aux sanitation on subprog rejection
bpf: verifier: Allocate idmap scratch in verifier env
bpf: Fix pointer arithmetic mask tightening under state pruning
SMB3: fix readpage for large swap cache
powerpc/pseries: Fix regression while building external modules
Revert "perf map: Fix dso->nsinfo refcounting"
i40e: Add additional info to PHY type error
can: j1939: j1939_session_deactivate(): clarify lifetime of session object
Linux 5.10.56
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ib3c9244afb7ee5d6ee8d3235efe8956898f486c4
commit e042aa532c upstream.
In 7fedb63a83 ("bpf: Tighten speculative pointer arithmetic mask") we
narrowed the offset mask for unprivileged pointer arithmetic in order to
mitigate a corner case where in the speculative domain it is possible to
advance, for example, the map value pointer by up to value_size-1 out-of-
bounds in order to leak kernel memory via side-channel to user space.
The verifier's state pruning for scalars leaves one corner case open
where in the first verification path R_x holds an unknown scalar with an
aux->alu_limit of e.g. 7, and in a second verification path that same
register R_x, here denoted as R_x', holds an unknown scalar which has
tighter bounds and would thus satisfy range_within(R_x, R_x') as well as
tnum_in(R_x, R_x') for state pruning, yielding an aux->alu_limit of 3:
Given the second path fits the register constraints for pruning, the final
generated mask from aux->alu_limit will remain at 7. While technically
not wrong for the non-speculative domain, it would however be possible
to craft similar cases where the mask would be too wide as in 7fedb63a83.
One way to fix it is to detect the presence of unknown scalar map pointer
arithmetic and force a deeper search on unknown scalars to ensure that
we do not run into a masking mismatch.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c9e73e3d2b upstream.
func_states_equal makes a very short lived allocation for idmap,
probably because it's too large to fit on the stack. However the
function is called quite often, leading to a lot of alloc / free
churn. Replace the temporary allocation with dedicated scratch
space in struct bpf_verifier_env.
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Edward Cree <ecree.xilinx@gmail.com>
Link: https://lore.kernel.org/bpf/20210429134656.122225-4-lmb@cloudflare.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 59089a189e upstream.
Follow-up to fe9a5ca7e3 ("bpf: Do not mark insn as seen under speculative
path verification"). The sanitize_insn_aux_data() helper does not serve a
particular purpose in today's code. The original intention for the helper
was that if function-by-function verification fails, a given program would
be cleared from temporary insn_aux_data[], and then its verification would
be re-attempted in the context of the main program a second time.
However, a failure in do_check_subprogs() will skip do_check_main() and
propagate the error to the user instead, thus such situation can never occur.
Given its interaction is not compatible to the Spectre v1 mitigation (due to
comparing aux->seen with env->pass_cnt), just remove sanitize_insn_aux_data()
to avoid future bugs in this area.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 2039f26f3a ]
Spectre v4 gadgets make use of memory disambiguation, which is a set of
techniques that execute memory access instructions, that is, loads and
stores, out of program order; Intel's optimization manual, section 2.4.4.5:
A load instruction micro-op may depend on a preceding store. Many
microarchitectures block loads until all preceding store addresses are
known. The memory disambiguator predicts which loads will not depend on
any previous stores. When the disambiguator predicts that a load does
not have such a dependency, the load takes its data from the L1 data
cache. Eventually, the prediction is verified. If an actual conflict is
detected, the load and all succeeding instructions are re-executed.
af86ca4e30 ("bpf: Prevent memory disambiguation attack") tried to mitigate
this attack by sanitizing the memory locations through preemptive "fast"
(low latency) stores of zero prior to the actual "slow" (high latency) store
of a pointer value such that upon dependency misprediction the CPU then
speculatively executes the load of the pointer value and retrieves the zero
value instead of the attacker controlled scalar value previously stored at
that location, meaning, subsequent access in the speculative domain is then
redirected to the "zero page".
The sanitized preemptive store of zero prior to the actual "slow" store is
done through a simple ST instruction based on r10 (frame pointer) with
relative offset to the stack location that the verifier has been tracking
on the original used register for STX, which does not have to be r10. Thus,
there are no memory dependencies for this store, since it's only using r10
and immediate constant of zero; hence af86ca4e30 /assumed/ a low latency
operation.
However, a recent attack demonstrated that this mitigation is not sufficient
since the preemptive store of zero could also be turned into a "slow" store
and is thus bypassed as well:
[...]
// r2 = oob address (e.g. scalar)
// r7 = pointer to map value
31: (7b) *(u64 *)(r10 -16) = r2
// r9 will remain "fast" register, r10 will become "slow" register below
32: (bf) r9 = r10
// JIT maps BPF reg to x86 reg:
// r9 -> r15 (callee saved)
// r10 -> rbp
// train store forward prediction to break dependency link between both r9
// and r10 by evicting them from the predictor's LRU table.
33: (61) r0 = *(u32 *)(r7 +24576)
34: (63) *(u32 *)(r7 +29696) = r0
35: (61) r0 = *(u32 *)(r7 +24580)
36: (63) *(u32 *)(r7 +29700) = r0
37: (61) r0 = *(u32 *)(r7 +24584)
38: (63) *(u32 *)(r7 +29704) = r0
39: (61) r0 = *(u32 *)(r7 +24588)
40: (63) *(u32 *)(r7 +29708) = r0
[...]
543: (61) r0 = *(u32 *)(r7 +25596)
544: (63) *(u32 *)(r7 +30716) = r0
// prepare call to bpf_ringbuf_output() helper. the latter will cause rbp
// to spill to stack memory while r13/r14/r15 (all callee saved regs) remain
// in hardware registers. rbp becomes slow due to push/pop latency. below is
// disasm of bpf_ringbuf_output() helper for better visual context:
//
// ffffffff8117ee20: 41 54 push r12
// ffffffff8117ee22: 55 push rbp
// ffffffff8117ee23: 53 push rbx
// ffffffff8117ee24: 48 f7 c1 fc ff ff ff test rcx,0xfffffffffffffffc
// ffffffff8117ee2b: 0f 85 af 00 00 00 jne ffffffff8117eee0 <-- jump taken
// [...]
// ffffffff8117eee0: 49 c7 c4 ea ff ff ff mov r12,0xffffffffffffffea
// ffffffff8117eee7: 5b pop rbx
// ffffffff8117eee8: 5d pop rbp
// ffffffff8117eee9: 4c 89 e0 mov rax,r12
// ffffffff8117eeec: 41 5c pop r12
// ffffffff8117eeee: c3 ret
545: (18) r1 = map[id:4]
547: (bf) r2 = r7
548: (b7) r3 = 0
549: (b7) r4 = 4
550: (85) call bpf_ringbuf_output#194288
// instruction 551 inserted by verifier \
551: (7a) *(u64 *)(r10 -16) = 0 | /both/ are now slow stores here
// storing map value pointer r7 at fp-16 | since value of r10 is "slow".
552: (7b) *(u64 *)(r10 -16) = r7 /
// following "fast" read to the same memory location, but due to dependency
// misprediction it will speculatively execute before insn 551/552 completes.
553: (79) r2 = *(u64 *)(r9 -16)
// in speculative domain contains attacker controlled r2. in non-speculative
// domain this contains r7, and thus accesses r7 +0 below.
554: (71) r3 = *(u8 *)(r2 +0)
// leak r3
As can be seen, the current speculative store bypass mitigation which the
verifier inserts at line 551 is insufficient since /both/, the write of
the zero sanitation as well as the map value pointer are a high latency
instruction due to prior memory access via push/pop of r10 (rbp) in contrast
to the low latency read in line 553 as r9 (r15) which stays in hardware
registers. Thus, architecturally, fp-16 is r7, however, microarchitecturally,
fp-16 can still be r2.
Initial thoughts to address this issue was to track spilled pointer loads
from stack and enforce their load via LDX through r10 as well so that /both/
the preemptive store of zero /as well as/ the load use the /same/ register
such that a dependency is created between the store and load. However, this
option is not sufficient either since it can be bypassed as well under
speculation. An updated attack with pointer spill/fills now _all_ based on
r10 would look as follows:
[...]
// r2 = oob address (e.g. scalar)
// r7 = pointer to map value
[...]
// longer store forward prediction training sequence than before.
2062: (61) r0 = *(u32 *)(r7 +25588)
2063: (63) *(u32 *)(r7 +30708) = r0
2064: (61) r0 = *(u32 *)(r7 +25592)
2065: (63) *(u32 *)(r7 +30712) = r0
2066: (61) r0 = *(u32 *)(r7 +25596)
2067: (63) *(u32 *)(r7 +30716) = r0
// store the speculative load address (scalar) this time after the store
// forward prediction training.
2068: (7b) *(u64 *)(r10 -16) = r2
// preoccupy the CPU store port by running sequence of dummy stores.
2069: (63) *(u32 *)(r7 +29696) = r0
2070: (63) *(u32 *)(r7 +29700) = r0
2071: (63) *(u32 *)(r7 +29704) = r0
2072: (63) *(u32 *)(r7 +29708) = r0
2073: (63) *(u32 *)(r7 +29712) = r0
2074: (63) *(u32 *)(r7 +29716) = r0
2075: (63) *(u32 *)(r7 +29720) = r0
2076: (63) *(u32 *)(r7 +29724) = r0
2077: (63) *(u32 *)(r7 +29728) = r0
2078: (63) *(u32 *)(r7 +29732) = r0
2079: (63) *(u32 *)(r7 +29736) = r0
2080: (63) *(u32 *)(r7 +29740) = r0
2081: (63) *(u32 *)(r7 +29744) = r0
2082: (63) *(u32 *)(r7 +29748) = r0
2083: (63) *(u32 *)(r7 +29752) = r0
2084: (63) *(u32 *)(r7 +29756) = r0
2085: (63) *(u32 *)(r7 +29760) = r0
2086: (63) *(u32 *)(r7 +29764) = r0
2087: (63) *(u32 *)(r7 +29768) = r0
2088: (63) *(u32 *)(r7 +29772) = r0
2089: (63) *(u32 *)(r7 +29776) = r0
2090: (63) *(u32 *)(r7 +29780) = r0
2091: (63) *(u32 *)(r7 +29784) = r0
2092: (63) *(u32 *)(r7 +29788) = r0
2093: (63) *(u32 *)(r7 +29792) = r0
2094: (63) *(u32 *)(r7 +29796) = r0
2095: (63) *(u32 *)(r7 +29800) = r0
2096: (63) *(u32 *)(r7 +29804) = r0
2097: (63) *(u32 *)(r7 +29808) = r0
2098: (63) *(u32 *)(r7 +29812) = r0
// overwrite scalar with dummy pointer; same as before, also including the
// sanitation store with 0 from the current mitigation by the verifier.
2099: (7a) *(u64 *)(r10 -16) = 0 | /both/ are now slow stores here
2100: (7b) *(u64 *)(r10 -16) = r7 | since store unit is still busy.
// load from stack intended to bypass stores.
2101: (79) r2 = *(u64 *)(r10 -16)
2102: (71) r3 = *(u8 *)(r2 +0)
// leak r3
[...]
Looking at the CPU microarchitecture, the scheduler might issue loads (such
as seen in line 2101) before stores (line 2099,2100) because the load execution
units become available while the store execution unit is still busy with the
sequence of dummy stores (line 2069-2098). And so the load may use the prior
stored scalar from r2 at address r10 -16 for speculation. The updated attack
may work less reliable on CPU microarchitectures where loads and stores share
execution resources.
This concludes that the sanitizing with zero stores from af86ca4e30 ("bpf:
Prevent memory disambiguation attack") is insufficient. Moreover, the detection
of stack reuse from af86ca4e30 where previously data (STACK_MISC) has been
written to a given stack slot where a pointer value is now to be stored does
not have sufficient coverage as precondition for the mitigation either; for
several reasons outlined as follows:
1) Stack content from prior program runs could still be preserved and is
therefore not "random", best example is to split a speculative store
bypass attack between tail calls, program A would prepare and store the
oob address at a given stack slot and then tail call into program B which
does the "slow" store of a pointer to the stack with subsequent "fast"
read. From program B PoV such stack slot type is STACK_INVALID, and
therefore also must be subject to mitigation.
2) The STACK_SPILL must not be coupled to register_is_const(&stack->spilled_ptr)
condition, for example, the previous content of that memory location could
also be a pointer to map or map value. Without the fix, a speculative
store bypass is not mitigated in such precondition and can then lead to
a type confusion in the speculative domain leaking kernel memory near
these pointer types.
While brainstorming on various alternative mitigation possibilities, we also
stumbled upon a retrospective from Chrome developers [0]:
[...] For variant 4, we implemented a mitigation to zero the unused memory
of the heap prior to allocation, which cost about 1% when done concurrently
and 4% for scavenging. Variant 4 defeats everything we could think of. We
explored more mitigations for variant 4 but the threat proved to be more
pervasive and dangerous than we anticipated. For example, stack slots used
by the register allocator in the optimizing compiler could be subject to
type confusion, leading to pointer crafting. Mitigating type confusion for
stack slots alone would have required a complete redesign of the backend of
the optimizing compiler, perhaps man years of work, without a guarantee of
completeness. [...]
From BPF side, the problem space is reduced, however, options are rather
limited. One idea that has been explored was to xor-obfuscate pointer spills
to the BPF stack:
[...]
// preoccupy the CPU store port by running sequence of dummy stores.
[...]
2106: (63) *(u32 *)(r7 +29796) = r0
2107: (63) *(u32 *)(r7 +29800) = r0
2108: (63) *(u32 *)(r7 +29804) = r0
2109: (63) *(u32 *)(r7 +29808) = r0
2110: (63) *(u32 *)(r7 +29812) = r0
// overwrite scalar with dummy pointer; xored with random 'secret' value
// of 943576462 before store ...
2111: (b4) w11 = 943576462
2112: (af) r11 ^= r7
2113: (7b) *(u64 *)(r10 -16) = r11
2114: (79) r11 = *(u64 *)(r10 -16)
2115: (b4) w2 = 943576462
2116: (af) r2 ^= r11
// ... and restored with the same 'secret' value with the help of AX reg.
2117: (71) r3 = *(u8 *)(r2 +0)
[...]
While the above would not prevent speculation, it would make data leakage
infeasible by directing it to random locations. In order to be effective
and prevent type confusion under speculation, such random secret would have
to be regenerated for each store. The additional complexity involved for a
tracking mechanism that prevents jumps such that restoring spilled pointers
would not get corrupted is not worth the gain for unprivileged. Hence, the
fix in here eventually opted for emitting a non-public BPF_ST | BPF_NOSPEC
instruction which the x86 JIT translates into a lfence opcode. Inserting the
latter in between the store and load instruction is one of the mitigations
options [1]. The x86 instruction manual notes:
[...] An LFENCE that follows an instruction that stores to memory might
complete before the data being stored have become globally visible. [...]
The latter meaning that the preceding store instruction finished execution
and the store is at minimum guaranteed to be in the CPU's store queue, but
it's not guaranteed to be in that CPU's L1 cache at that point (globally
visible). The latter would only be guaranteed via sfence. So the load which
is guaranteed to execute after the lfence for that local CPU would have to
rely on store-to-load forwarding. [2], in section 2.3 on store buffers says:
[...] For every store operation that is added to the ROB, an entry is
allocated in the store buffer. This entry requires both the virtual and
physical address of the target. Only if there is no free entry in the store
buffer, the frontend stalls until there is an empty slot available in the
store buffer again. Otherwise, the CPU can immediately continue adding
subsequent instructions to the ROB and execute them out of order. On Intel
CPUs, the store buffer has up to 56 entries. [...]
One small upside on the fix is that it lifts constraints from af86ca4e30
where the sanitize_stack_off relative to r10 must be the same when coming
from different paths. The BPF_ST | BPF_NOSPEC gets emitted after a BPF_STX
or BPF_ST instruction. This happens either when we store a pointer or data
value to the BPF stack for the first time, or upon later pointer spills.
The former needs to be enforced since otherwise stale stack data could be
leaked under speculation as outlined earlier. For non-x86 JITs the BPF_ST |
BPF_NOSPEC mapping is currently optimized away, but others could emit a
speculation barrier as well if necessary. For real-world unprivileged
programs e.g. generated by LLVM, pointer spill/fill is only generated upon
register pressure and LLVM only tries to do that for pointers which are not
used often. The program main impact will be the initial BPF_ST | BPF_NOSPEC
sanitation for the STACK_INVALID case when the first write to a stack slot
occurs e.g. upon map lookup. In future we might refine ways to mitigate
the latter cost.
[0] https://arxiv.org/pdf/1902.05178.pdf
[1] https://msrc-blog.microsoft.com/2018/05/21/analysis-and-mitigation-of-speculative-store-bypass-cve-2018-3639/
[2] https://arxiv.org/pdf/1905.05725.pdf
Fixes: af86ca4e30 ("bpf: Prevent memory disambiguation attack")
Fixes: f7cf25b202 ("bpf: track spill/fill of constants")
Co-developed-by: Piotr Krysiuk <piotras@gmail.com>
Co-developed-by: Benedict Schlueter <benedict.schlueter@rub.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Piotr Krysiuk <piotras@gmail.com>
Signed-off-by: Benedict Schlueter <benedict.schlueter@rub.de>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit f5e81d1117 ]
In case of JITs, each of the JIT backends compiles the BPF nospec instruction
/either/ to a machine instruction which emits a speculation barrier /or/ to
/no/ machine instruction in case the underlying architecture is not affected
by Speculative Store Bypass or has different mitigations in place already.
This covers both x86 and (implicitly) arm64: In case of x86, we use 'lfence'
instruction for mitigation. In case of arm64, we rely on the firmware mitigation
as controlled via the ssbd kernel parameter. Whenever the mitigation is enabled,
it works for all of the kernel code with no need to provide any additional
instructions here (hence only comment in arm64 JIT). Other archs can follow
as needed. The BPF nospec instruction is specifically targeting Spectre v4
since i) we don't use a serialization barrier for the Spectre v1 case, and
ii) mitigation instructions for v1 and v4 might be different on some archs.
The BPF nospec is required for a future commit, where the BPF verifier does
annotate intermediate BPF programs with speculation barriers.
Co-developed-by: Piotr Krysiuk <piotras@gmail.com>
Co-developed-by: Benedict Schlueter <benedict.schlueter@rub.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Piotr Krysiuk <piotras@gmail.com>
Signed-off-by: Benedict Schlueter <benedict.schlueter@rub.de>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmEE6r8ACgkQONu9yGCS
aT54gw/7B1RbIpcTJlhtiQVYvwMJTX/ZgoxkSsffz6G7TXaBbg/x29cuc3cI9IUJ
fr99zp8gQbQMibXqCaIPBloetSdquX9G+gOmzVSIS+ac2Wx/pH29qRz1fslWq8Mj
hOg8t2Ce+LoAgGZzSXswhvIL1Zep8HZDlRKKcolmiGEJFBr/4nPfEVOeFNpiTUYF
cb5RtB1nRR0DJa8wWr18S7v2/TBdR9V7TgGuwRnpVviIQjjfE/S+lmbgtzHgm5Tf
cBtP7Ta/eIGeLQIyQEoEWV2ViJRmSGk4s2dOcQ/kUMje5de+J03Hkp4jeJbB2ndx
UUos2Xm4D1j4jsIBZqJKV+Fzf9RzdNXQp4VJIfzvbDbZn8y9xHRcXm/ChNBz/B5L
8oKcdM9Sf8DxiztlZ4c40IndYjwp0QxlubCYDfwYQfYB/6LLrffrgPZu/TgWw9lj
lXOBUZq7yHuo9OSAIfGCCtKKnzA63wLv3zTmNmSO9fDoXUl0VHBrZC6apgB6AC9T
rzvOOgRC6NfWViZl2E3Op7XfOg2jstpRXUr7a0HSaQp4QfmlLJ9HnDywQfNQeX4+
HyAxhXo8gHPDgh8DK7yRObBhu24yQcy+ukesQwkSS2m0j2ilfQszk0uP1bfFOuq2
G5Z8VMQPsIMsSijEyuonLRVUPfp8herhWMP0YsZZeEnxZvaSEaM=
=li6t
-----END PGP SIGNATURE-----
Merge 5.10.55 into android12-5.10-lts
Changes in 5.10.55
tools: Allow proper CC/CXX/... override with LLVM=1 in Makefile.include
io_uring: fix link timeout refs
KVM: x86: determine if an exception has an error code only when injecting it.
af_unix: fix garbage collect vs MSG_PEEK
workqueue: fix UAF in pwq_unbound_release_workfn()
cgroup1: fix leaked context root causing sporadic NULL deref in LTP
net/802/mrp: fix memleak in mrp_request_join()
net/802/garp: fix memleak in garp_request_join()
net: annotate data race around sk_ll_usec
sctp: move 198 addresses from unusable to private scope
rcu-tasks: Don't delete holdouts within trc_inspect_reader()
rcu-tasks: Don't delete holdouts within trc_wait_for_one_reader()
ipv6: allocate enough headroom in ip6_finish_output2()
drm/ttm: add a check against null pointer dereference
hfs: add missing clean-up in hfs_fill_super
hfs: fix high memory mapping in hfs_bnode_read
hfs: add lock nesting notation to hfs_find_init
firmware: arm_scmi: Fix possible scmi_linux_errmap buffer overflow
firmware: arm_scmi: Fix range check for the maximum number of pending messages
cifs: fix the out of range assignment to bit fields in parse_server_interfaces
iomap: remove the length variable in iomap_seek_data
iomap: remove the length variable in iomap_seek_hole
ARM: dts: versatile: Fix up interrupt controller node names
ipv6: ip6_finish_output2: set sk into newly allocated nskb
Linux 5.10.55
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I2d673bdde784b3689af73289305091dbd4ead042
[ Upstream commit a9ab9cce93 ]
Invoking trc_del_holdout() from within trc_wait_for_one_reader() is
only a performance optimization because the RCU Tasks Trace grace-period
kthread will eventually do this within check_all_holdout_tasks_trace().
But it is not a particularly important performance optimization because
it only applies to the grace-period kthread, of which there is but one.
This commit therefore removes this invocation of trc_del_holdout() in
favor of the one in check_all_holdout_tasks_trace() in the grace-period
kthread.
Reported-by: "Xu, Yanfei" <yanfei.xu@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 1d10bf55d8 ]
As Yanfei pointed out, although invoking trc_del_holdout() is safe
from the viewpoint of the integrity of the holdout list itself,
the put_task_struct() invoked by trc_del_holdout() can result in
use-after-free errors due to later accesses to this task_struct structure
by the RCU Tasks Trace grace-period kthread.
This commit therefore removes this call to trc_del_holdout() from
trc_inspect_reader() in favor of the grace-period thread's existing call
to trc_del_holdout(), thus eliminating that particular class of
use-after-free errors.
Reported-by: "Xu, Yanfei" <yanfei.xu@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 1e7107c5ef upstream.
Richard reported sporadic (roughly one in 10 or so) null dereferences and
other strange behaviour for a set of automated LTP tests. Things like:
BUG: kernel NULL pointer dereference, address: 0000000000000008
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 0 PID: 1516 Comm: umount Not tainted 5.10.0-yocto-standard #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-48-gd9c812dda519-prebuilt.qemu.org 04/01/2014
RIP: 0010:kernfs_sop_show_path+0x1b/0x60
...or these others:
RIP: 0010:do_mkdirat+0x6a/0xf0
RIP: 0010:d_alloc_parallel+0x98/0x510
RIP: 0010:do_readlinkat+0x86/0x120
There were other less common instances of some kind of a general scribble
but the common theme was mount and cgroup and a dubious dentry triggering
the NULL dereference. I was only able to reproduce it under qemu by
replicating Richard's setup as closely as possible - I never did get it
to happen on bare metal, even while keeping everything else the same.
In commit 71d883c37e ("cgroup_do_mount(): massage calling conventions")
we see this as a part of the overall change:
--------------
struct cgroup_subsys *ss;
- struct dentry *dentry;
[...]
- dentry = cgroup_do_mount(&cgroup_fs_type, fc->sb_flags, root,
- CGROUP_SUPER_MAGIC, ns);
[...]
- if (percpu_ref_is_dying(&root->cgrp.self.refcnt)) {
- struct super_block *sb = dentry->d_sb;
- dput(dentry);
+ ret = cgroup_do_mount(fc, CGROUP_SUPER_MAGIC, ns);
+ if (!ret && percpu_ref_is_dying(&root->cgrp.self.refcnt)) {
+ struct super_block *sb = fc->root->d_sb;
+ dput(fc->root);
deactivate_locked_super(sb);
msleep(10);
return restart_syscall();
}
--------------
In changing from the local "*dentry" variable to using fc->root, we now
export/leave that dentry pointer in the file context after doing the dput()
in the unlikely "is_dying" case. With LTP doing a crazy amount of back to
back mount/unmount [testcases/bin/cgroup_regression_5_1.sh] the unlikely
becomes slightly likely and then bad things happen.
A fix would be to not leave the stale reference in fc->root as follows:
--------------
dput(fc->root);
+ fc->root = NULL;
deactivate_locked_super(sb);
--------------
...but then we are just open-coding a duplicate of fc_drop_locked() so we
simply use that instead.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: stable@vger.kernel.org # v5.1+
Reported-by: Richard Purdie <richard.purdie@linuxfoundation.org>
Fixes: 71d883c37e ("cgroup_do_mount(): massage calling conventions")
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Sync up with android12-5.10 for the following commits:
2a8bfea53d UPSTREAM: kernel/irq: export irq_gc_set_wake
36fbb55631 FROMGIT: procfs: prevent unpriveleged processes accessing fdinfo dir
fc2d64ec5d FROMGIT: f2fs: don't sleep while grabing nat_tree_lock
cf1646cba3 FROMLIST: scsi: ufs: Allow async suspend/resume callbacks
98afdd197f ANDROID: ABI: update generic symbol list and ABI XML
e2e063f507 ANDROID: scsi: ufs: add vendor hook to override key reprogramming
198e728044 ANDROID: GKI: Add rockchip symbol list
Change-Id: Ib1c25472ab1be58611ee2fa61047ceef48f00e7e
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmEBT00ACgkQONu9yGCS
aT7svA/9HCRwW+pK3UpK1+0FK7gGH8DA3jSONj775zVEKhboDZNIwZsDqG0Ly+jm
/JejWXPKZlekaDXgzBfZY3H59xgij/VwYHe8p7cdxfi1TlhmAQwFjLNZnWav8as6
IyNkpsDJn8fMXmfHDi2u3cb8wrVi/aQDzTlwu88cUtREyZCaaYlo0Fdv9MJyhww/
p6LWPYQoZ8TmFY+Y/2ORVxFos2UVuU0hhhMdGt9LrX2WNEGRNUZUqmbhcXYfdsX0
ckSHbijIcWdcka3nQ6yOvdxw75rTqd8c/bP0y+yAteeJ0CykjVnI2cdK+M2ZEi4j
/JqpGJrRWhsZf5MiO8b3k+I1K62JDa1GYBQ9Amp8FKKzjYLPTNeFAP9IsMyDc4oi
oW98XM7XzoSEU9t/FSAIGT0hYK9k+lnPxw623LhxD6x3VPynnNAnQsLr+HirOgG6
mZ79L4ZFu3lUvVsCuCgKn/uxwDopUNlhqo5B2/4M2kSWwe2Xu5bExpGc2bT9xCOP
6fF9DmvmpG1UPGCXrOqaxemyEPmHqmyjKJpxDt6vZqlOL9vqHez4WmEEM1C+E2NZ
5VKKbBk/KZDxNX9EiFOtI2HRFb1cghoI2Hcb/QjRoB9Dv3a6cHgjxDl0eKm8SiDN
+1ytV0IFH3fT4aRiXJ7I3GBwkjKcDaX0sjYwtnCx9s5XZmm9PRQ=
=HAyL
-----END PGP SIGNATURE-----
Merge 5.10.54 into android12-5.10-lts
Changes in 5.10.54
igc: Fix use-after-free error during reset
igb: Fix use-after-free error during reset
igc: change default return of igc_read_phy_reg()
ixgbe: Fix an error handling path in 'ixgbe_probe()'
igc: Fix an error handling path in 'igc_probe()'
igb: Fix an error handling path in 'igb_probe()'
fm10k: Fix an error handling path in 'fm10k_probe()'
e1000e: Fix an error handling path in 'e1000_probe()'
iavf: Fix an error handling path in 'iavf_probe()'
igb: Check if num of q_vectors is smaller than max before array access
igb: Fix position of assignment to *ring
gve: Fix an error handling path in 'gve_probe()'
net: add kcov handle to skb extensions
bonding: fix suspicious RCU usage in bond_ipsec_add_sa()
bonding: fix null dereference in bond_ipsec_add_sa()
ixgbevf: use xso.real_dev instead of xso.dev in callback functions of struct xfrmdev_ops
bonding: fix suspicious RCU usage in bond_ipsec_del_sa()
bonding: disallow setting nested bonding + ipsec offload
bonding: Add struct bond_ipesc to manage SA
bonding: fix suspicious RCU usage in bond_ipsec_offload_ok()
bonding: fix incorrect return value of bond_ipsec_offload_ok()
ipv6: fix 'disable_policy' for fwd packets
stmmac: platform: Fix signedness bug in stmmac_probe_config_dt()
selftests: icmp_redirect: remove from checking for IPv6 route get
selftests: icmp_redirect: IPv6 PMTU info should be cleared after redirect
pwm: sprd: Ensure configuring period and duty_cycle isn't wrongly skipped
cxgb4: fix IRQ free race during driver unload
mptcp: fix warning in __skb_flow_dissect() when do syn cookie for subflow join
nvme-pci: do not call nvme_dev_remove_admin from nvme_remove
KVM: x86/pmu: Clear anythread deprecated bit when 0xa leaf is unsupported on the SVM
perf inject: Fix dso->nsinfo refcounting
perf map: Fix dso->nsinfo refcounting
perf probe: Fix dso->nsinfo refcounting
perf env: Fix sibling_dies memory leak
perf test session_topology: Delete session->evlist
perf test event_update: Fix memory leak of evlist
perf dso: Fix memory leak in dso__new_map()
perf test maps__merge_in: Fix memory leak of maps
perf env: Fix memory leak of cpu_pmu_caps
perf report: Free generated help strings for sort option
perf script: Fix memory 'threads' and 'cpus' leaks on exit
perf lzma: Close lzma stream on exit
perf probe-file: Delete namelist in del_events() on the error path
perf data: Close all files in close_dir()
perf sched: Fix record failure when CONFIG_SCHEDSTATS is not set
ASoC: wm_adsp: Correct wm_coeff_tlv_get handling
spi: imx: add a check for speed_hz before calculating the clock
spi: stm32: fixes pm_runtime calls in probe/remove
regulator: hi6421: Use correct variable type for regmap api val argument
regulator: hi6421: Fix getting wrong drvdata
spi: mediatek: fix fifo rx mode
ASoC: rt5631: Fix regcache sync errors on resume
bpf, test: fix NULL pointer dereference on invalid expected_attach_type
bpf: Fix tail_call_reachable rejection for interpreter when jit failed
xdp, net: Fix use-after-free in bpf_xdp_link_release
timers: Fix get_next_timer_interrupt() with no timers pending
liquidio: Fix unintentional sign extension issue on left shift of u16
s390/bpf: Perform r1 range checking before accessing jit->seen_reg[r1]
bpf, sockmap: Fix potential memory leak on unlikely error case
bpf, sockmap, tcp: sk_prot needs inuse_idx set for proc stats
bpf, sockmap, udp: sk_prot needs inuse_idx set for proc stats
bpftool: Check malloc return value in mount_bpffs_for_pin
net: fix uninit-value in caif_seqpkt_sendmsg
usb: hso: fix error handling code of hso_create_net_device
dma-mapping: handle vmalloc addresses in dma_common_{mmap,get_sgtable}
efi/tpm: Differentiate missing and invalid final event log table.
net: decnet: Fix sleeping inside in af_decnet
KVM: PPC: Book3S: Fix CONFIG_TRANSACTIONAL_MEM=n crash
KVM: PPC: Fix kvm_arch_vcpu_ioctl vcpu_load leak
net: sched: fix memory leak in tcindex_partial_destroy_work
sctp: trim optlen when it's a huge value in sctp_setsockopt
netrom: Decrease sock refcount when sock timers expire
scsi: iscsi: Fix iface sysfs attr detection
scsi: target: Fix protect handling in WRITE SAME(32)
spi: cadence: Correct initialisation of runtime PM again
ACPI: Kconfig: Fix table override from built-in initrd
bnxt_en: don't disable an already disabled PCI device
bnxt_en: Refresh RoCE capabilities in bnxt_ulp_probe()
bnxt_en: Add missing check for BNXT_STATE_ABORT_ERR in bnxt_fw_rset_task()
bnxt_en: Validate vlan protocol ID on RX packets
bnxt_en: Check abort error state in bnxt_half_open_nic()
net: hisilicon: rename CACHE_LINE_MASK to avoid redefinition
net/tcp_fastopen: fix data races around tfo_active_disable_stamp
ALSA: hda: intel-dsp-cfg: add missing ElkhartLake PCI ID
net: hns3: fix possible mismatches resp of mailbox
net: hns3: fix rx VLAN offload state inconsistent issue
spi: spi-bcm2835: Fix deadlock
net/sched: act_skbmod: Skip non-Ethernet packets
ipv6: fix another slab-out-of-bounds in fib6_nh_flush_exceptions
ceph: don't WARN if we're still opening a session to an MDS
nvme-pci: don't WARN_ON in nvme_reset_work if ctrl.state is not RESETTING
Revert "USB: quirks: ignore remote wake-up on Fibocom L850-GL LTE modem"
afs: Fix tracepoint string placement with built-in AFS
r8169: Avoid duplicate sysfs entry creation error
nvme: set the PRACT bit when using Write Zeroes with T10 PI
sctp: update active_key for asoc when old key is being replaced
tcp: disable TFO blackhole logic by default
net: dsa: sja1105: make VID 4095 a bridge VLAN too
net: sched: cls_api: Fix the the wrong parameter
drm/panel: raspberrypi-touchscreen: Prevent double-free
cifs: only write 64kb at a time when fallocating a small region of a file
cifs: fix fallocate when trying to allocate a hole.
proc: Avoid mixing integer types in mem_rw()
mmc: core: Don't allocate IDA for OF aliases
s390/ftrace: fix ftrace_update_ftrace_func implementation
s390/boot: fix use of expolines in the DMA code
ALSA: usb-audio: Add missing proc text entry for BESPOKEN type
ALSA: usb-audio: Add registration quirk for JBL Quantum headsets
ALSA: sb: Fix potential ABBA deadlock in CSP driver
ALSA: hda/realtek: Fix pop noise and 2 Front Mic issues on a machine
ALSA: hdmi: Expose all pins on MSI MS-7C94 board
ALSA: pcm: Call substream ack() method upon compat mmap commit
ALSA: pcm: Fix mmap capability check
Revert "usb: renesas-xhci: Fix handling of unknown ROM state"
usb: xhci: avoid renesas_usb_fw.mem when it's unusable
xhci: Fix lost USB 2 remote wake
KVM: PPC: Book3S: Fix H_RTAS rets buffer overflow
KVM: PPC: Book3S HV Nested: Sanitise H_ENTER_NESTED TM state
usb: hub: Disable USB 3 device initiated lpm if exit latency is too high
usb: hub: Fix link power management max exit latency (MEL) calculations
USB: usb-storage: Add LaCie Rugged USB3-FW to IGNORE_UAS
usb: max-3421: Prevent corruption of freed memory
usb: renesas_usbhs: Fix superfluous irqs happen after usb_pkt_pop()
USB: serial: option: add support for u-blox LARA-R6 family
USB: serial: cp210x: fix comments for GE CS1000
USB: serial: cp210x: add ID for CEL EM3588 USB ZigBee stick
usb: gadget: Fix Unbalanced pm_runtime_enable in tegra_xudc_probe
usb: dwc2: gadget: Fix GOUTNAK flow for Slave mode.
usb: dwc2: gadget: Fix sending zero length packet in DDMA mode.
usb: typec: stusb160x: register role switch before interrupt registration
firmware/efi: Tell memblock about EFI iomem reservations
tracepoints: Update static_call before tp_funcs when adding a tracepoint
tracing/histogram: Rename "cpu" to "common_cpu"
tracing: Fix bug in rb_per_cpu_empty() that might cause deadloop.
tracing: Synthetic event field_pos is an index not a boolean
btrfs: check for missing device in btrfs_trim_fs
media: ngene: Fix out-of-bounds bug in ngene_command_config_free_buf()
ixgbe: Fix packet corruption due to missing DMA sync
bus: mhi: core: Validate channel ID when processing command completions
posix-cpu-timers: Fix rearm racing against process tick
selftest: use mmap instead of posix_memalign to allocate memory
io_uring: explicitly count entries for poll reqs
io_uring: remove double poll entry on arm failure
userfaultfd: do not untag user pointers
memblock: make for_each_mem_range() traverse MEMBLOCK_HOTPLUG regions
hugetlbfs: fix mount mode command line processing
rbd: don't hold lock_rwsem while running_list is being drained
rbd: always kick acquire on "acquired" and "released" notifications
misc: eeprom: at24: Always append device id even if label property is set.
nds32: fix up stack guard gap
driver core: Prevent warning when removing a device link from unregistered consumer
drm: Return -ENOTTY for non-drm ioctls
drm/amdgpu: update golden setting for sienna_cichlid
net: dsa: mv88e6xxx: enable SerDes RX stats for Topaz
net: dsa: mv88e6xxx: enable SerDes PCS register dump via ethtool -d on Topaz
PCI: Mark AMD Navi14 GPU ATS as broken
bonding: fix build issue
skbuff: Release nfct refcount on napi stolen or re-used skbs
Documentation: Fix intiramfs script name
perf inject: Close inject.output on exit
usb: ehci: Prevent missed ehci interrupts with edge-triggered MSI
drm/i915/gvt: Clear d3_entered on elsp cmd submission.
sfc: ensure correct number of XDP queues
xhci: add xhci_get_virt_ep() helper
skbuff: Fix build with SKB extensions disabled
Linux 5.10.54
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ifd2823b47ab1544cd1f168b138624ffe060a471e
commit 1a3402d93c upstream.
Since the process wide cputime counter is started locklessly from
posix_cpu_timer_rearm(), it can be concurrently stopped by operations
on other timers from the same thread group, such as in the following
unlucky scenario:
CPU 0 CPU 1
----- -----
timer_settime(TIMER B)
posix_cpu_timer_rearm(TIMER A)
cpu_clock_sample_group()
(pct->timers_active already true)
handle_posix_cpu_timers()
check_process_timers()
stop_process_timers()
pct->timers_active = false
arm_timer(TIMER A)
tick -> run_posix_cpu_timers()
// sees !pct->timers_active, ignore
// our TIMER A
Fix this with simply locking process wide cputime counting start and
timer arm in the same block.
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Fixes: 60f2ceaa81 ("posix-cpu-timers: Remove unnecessary locking around cpu_clock_sample_group")
Cc: stable@vger.kernel.org
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 67f0d6d988 upstream.
The "rb_per_cpu_empty()" misinterpret the condition (as not-empty) when
"head_page" and "commit_page" of "struct ring_buffer_per_cpu" points to
the same buffer page, whose "buffer_data_page" is empty and "read" field
is non-zero.
An error scenario could be constructed as followed (kernel perspective):
1. All pages in the buffer has been accessed by reader(s) so that all of
them will have non-zero "read" field.
2. Read and clear all buffer pages so that "rb_num_of_entries()" will
return 0 rendering there's no more data to read. It is also required
that the "read_page", "commit_page" and "tail_page" points to the same
page, while "head_page" is the next page of them.
3. Invoke "ring_buffer_lock_reserve()" with large enough "length"
so that it shot pass the end of current tail buffer page. Now the
"head_page", "commit_page" and "tail_page" points to the same page.
4. Discard current event with "ring_buffer_discard_commit()", so that
"head_page", "commit_page" and "tail_page" points to a page whose buffer
data page is now empty.
When the error scenario has been constructed, "tracing_read_pipe" will
be trapped inside a deadloop: "trace_empty()" returns 0 since
"rb_per_cpu_empty()" returns 0 when it hits the CPU containing such
constructed ring buffer. Then "trace_find_next_entry_inc()" always
return NULL since "rb_num_of_entries()" reports there's no more entry
to read. Finally "trace_seq_to_user()" returns "-EBUSY" spanking
"tracing_read_pipe" back to the start of the "waitagain" loop.
I've also written a proof-of-concept script to construct the scenario
and trigger the bug automatically, you can use it to trace and validate
my reasoning above:
https://github.com/aegistudio/RingBufferDetonator.git
Tests has been carried out on linux kernel 5.14-rc2
(2734d6c1b1), my fixed version
of kernel (for testing whether my update fixes the bug) and
some older kernels (for range of affected kernels). Test result is
also attached to the proof-of-concept repository.
Link: https://lore.kernel.org/linux-trace-devel/YPaNxsIlb2yjSi5Y@aegistudio/
Link: https://lore.kernel.org/linux-trace-devel/YPgrN85WL9VyrZ55@aegistudio
Cc: stable@vger.kernel.org
Fixes: bf41a158ca ("ring-buffer: make reentrant")
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Haoran Luo <www@aegistudio.net>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1e3bac71c5 upstream.
Currently the histogram logic allows the user to write "cpu" in as an
event field, and it will record the CPU that the event happened on.
The problem with this is that there's a lot of events that have "cpu"
as a real field, and using "cpu" as the CPU it ran on, makes it
impossible to run histograms on the "cpu" field of events.
For example, if I want to have a histogram on the count of the
workqueue_queue_work event on its cpu field, running:
># echo 'hist:keys=cpu' > events/workqueue/workqueue_queue_work/trigger
Gives a misleading and wrong result.
Change the command to "common_cpu" as no event should have "common_*"
fields as that's a reserved name for fields used by all events. And
this makes sense here as common_cpu would be a field used by all events.
Now we can even do:
># echo 'hist:keys=common_cpu,cpu if cpu < 100' > events/workqueue/workqueue_queue_work/trigger
># cat events/workqueue/workqueue_queue_work/hist
# event histogram
#
# trigger info: hist:keys=common_cpu,cpu:vals=hitcount:sort=hitcount:size=2048 if cpu < 100 [active]
#
{ common_cpu: 0, cpu: 2 } hitcount: 1
{ common_cpu: 0, cpu: 4 } hitcount: 1
{ common_cpu: 7, cpu: 7 } hitcount: 1
{ common_cpu: 0, cpu: 7 } hitcount: 1
{ common_cpu: 0, cpu: 1 } hitcount: 1
{ common_cpu: 0, cpu: 6 } hitcount: 2
{ common_cpu: 0, cpu: 5 } hitcount: 2
{ common_cpu: 1, cpu: 1 } hitcount: 4
{ common_cpu: 6, cpu: 6 } hitcount: 4
{ common_cpu: 5, cpu: 5 } hitcount: 14
{ common_cpu: 4, cpu: 4 } hitcount: 26
{ common_cpu: 0, cpu: 0 } hitcount: 39
{ common_cpu: 2, cpu: 2 } hitcount: 184
Now for backward compatibility, I added a trick. If "cpu" is used, and
the field is not found, it will fall back to "common_cpu" and work as
it did before. This way, it will still work for old programs that use
"cpu" to get the actual CPU, but if the event has a "cpu" as a field, it
will get that event's "cpu" field, which is probably what it wants
anyway.
I updated the tracefs/README to include documentation about both the
common_timestamp and the common_cpu. This way, if that text is present in
the README, then an application can know that common_cpu is supported over
just plain "cpu".
Link: https://lkml.kernel.org/r/20210721110053.26b4f641@oasis.local.home
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Fixes: 8b7622bf94 ("tracing: Add cpu field for hist triggers")
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 352384d5c8 upstream.
Because of the significant overhead that retpolines pose on indirect
calls, the tracepoint code was updated to use the new "static_calls" that
can modify the running code to directly call a function instead of using
an indirect caller, and this function can be changed at runtime.
In the tracepoint code that calls all the registered callbacks that are
attached to a tracepoint, the following is done:
it_func_ptr = rcu_dereference_raw((&__tracepoint_##name)->funcs);
if (it_func_ptr) {
__data = (it_func_ptr)->data;
static_call(tp_func_##name)(__data, args);
}
If there's just a single callback, the static_call is updated to just call
that callback directly. Once another handler is added, then the static
caller is updated to call the iterator, that simply loops over all the
funcs in the array and calls each of the callbacks like the old method
using indirect calling.
The issue was discovered with a race between updating the funcs array and
updating the static_call. The funcs array was updated first and then the
static_call was updated. This is not an issue as long as the first element
in the old array is the same as the first element in the new array. But
that assumption is incorrect, because callbacks also have a priority
field, and if there's a callback added that has a higher priority than the
callback on the old array, then it will become the first callback in the
new array. This means that it is possible to call the old callback with
the new callback data element, which can cause a kernel panic.
static_call = callback1()
funcs[] = {callback1,data1};
callback2 has higher priority than callback1
CPU 1 CPU 2
----- -----
new_funcs = {callback2,data2},
{callback1,data1}
rcu_assign_pointer(tp->funcs, new_funcs);
/*
* Now tp->funcs has the new array
* but the static_call still calls callback1
*/
it_func_ptr = tp->funcs [ new_funcs ]
data = it_func_ptr->data [ data2 ]
static_call(callback1, data);
/* Now callback1 is called with
* callback2's data */
[ KERNEL PANIC ]
update_static_call(iterator);
To prevent this from happening, always switch the static_call to the
iterator before assigning the tp->funcs to the new array. The iterator will
always properly match the callback with its data.
To trigger this bug:
In one terminal:
while :; do hackbench 50; done
In another terminal
echo 1 > /sys/kernel/tracing/events/sched/sched_waking/enable
while :; do
echo 1 > /sys/kernel/tracing/set_event_pid;
sleep 0.5
echo 0 > /sys/kernel/tracing/set_event_pid;
sleep 0.5
done
And it doesn't take long to crash. This is because the set_event_pid adds
a callback to the sched_waking tracepoint with a high priority, which will
be called before the sched_waking trace event callback is called.
Note, the removal to a single callback updates the array first, before
changing the static_call to single callback, which is the proper order as
the first element in the array is the same as what the static_call is
being changed to.
Link: https://lore.kernel.org/io-uring/4ebea8f0-58c9-e571-fd30-0ce4f6f09c70@samba.org/
Cc: stable@vger.kernel.org
Fixes: d25e37d89d ("tracepoint: Optimize using static_call()")
Reported-by: Stefan Metzmacher <metze@samba.org>
tested-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 40ac971eab ]
xen-swiotlb can use vmalloc backed addresses for dma coherent allocations
and uses the common helpers. Properly handle them to unbreak Xen on
ARM platforms.
Fixes: 1b65c4e5a9 ("swiotlb-xen: use xen_alloc/free_coherent_pages")
Signed-off-by: Roman Skakun <roman_skakun@epam.com>
Reviewed-by: Andrii Anisov <andrii_anisov@epam.com>
[hch: split the patch, renamed the helpers]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit aebacb7f6c ]
31cd0e119d ("timers: Recalculate next timer interrupt only when
necessary") subtly altered get_next_timer_interrupt()'s behaviour. The
function no longer consistently returns KTIME_MAX with no timers
pending.
In order to decide if there are any timers pending we check whether the
next expiry will happen NEXT_TIMER_MAX_DELTA jiffies from now.
Unfortunately, the next expiry time and the timer base clock are no
longer updated in unison. The former changes upon certain timer
operations (enqueue, expire, detach), whereas the latter keeps track of
jiffies as they move forward. Ultimately breaking the logic above.
A simplified example:
- Upon entering get_next_timer_interrupt() with:
jiffies = 1
base->clk = 0;
base->next_expiry = NEXT_TIMER_MAX_DELTA;
'base->next_expiry == base->clk + NEXT_TIMER_MAX_DELTA', the function
returns KTIME_MAX.
- 'base->clk' is updated to the jiffies value.
- The next time we enter get_next_timer_interrupt(), taking into account
no timer operations happened:
base->clk = 1;
base->next_expiry = NEXT_TIMER_MAX_DELTA;
'base->next_expiry != base->clk + NEXT_TIMER_MAX_DELTA', the function
returns a valid expire time, which is incorrect.
This ultimately might unnecessarily rearm sched's timer on nohz_full
setups, and add latency to the system[1].
So, introduce 'base->timers_pending'[2], update it every time
'base->next_expiry' changes, and use it in get_next_timer_interrupt().
[1] See tick_nohz_stop_tick().
[2] A quick pahole check on x86_64 and arm64 shows it doesn't make
'struct timer_base' any bigger.
Fixes: 31cd0e119d ("timers: Recalculate next timer interrupt only when necessary")
Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
This reverts commit a9f36bf361 which is
commit f263a81451 upstream.
It breaks the Android KABI and is not needed for Android devices at this
point in time.
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Id5d1b6407f9bd5228630b9a813e9799dbc448b96
commit f263a81451 upstream.
Subprograms are calling map_poke_track(), but on program release there is no
hook to call map_poke_untrack(). However, on program release, the aux memory
(and poke descriptor table) is freed even though we still have a reference to
it in the element list of the map aux data. When we run map_poke_run(), we then
end up accessing free'd memory, triggering KASAN in prog_array_map_poke_run():
[...]
[ 402.824689] BUG: KASAN: use-after-free in prog_array_map_poke_run+0xc2/0x34e
[ 402.824698] Read of size 4 at addr ffff8881905a7940 by task hubble-fgs/4337
[ 402.824705] CPU: 1 PID: 4337 Comm: hubble-fgs Tainted: G I 5.12.0+ #399
[ 402.824715] Call Trace:
[ 402.824719] dump_stack+0x93/0xc2
[ 402.824727] print_address_description.constprop.0+0x1a/0x140
[ 402.824736] ? prog_array_map_poke_run+0xc2/0x34e
[ 402.824740] ? prog_array_map_poke_run+0xc2/0x34e
[ 402.824744] kasan_report.cold+0x7c/0xd8
[ 402.824752] ? prog_array_map_poke_run+0xc2/0x34e
[ 402.824757] prog_array_map_poke_run+0xc2/0x34e
[ 402.824765] bpf_fd_array_map_update_elem+0x124/0x1a0
[...]
The elements concerned are walked as follows:
for (i = 0; i < elem->aux->size_poke_tab; i++) {
poke = &elem->aux->poke_tab[i];
[...]
The access to size_poke_tab is a 4 byte read, verified by checking offsets
in the KASAN dump:
[ 402.825004] The buggy address belongs to the object at ffff8881905a7800
which belongs to the cache kmalloc-1k of size 1024
[ 402.825008] The buggy address is located 320 bytes inside of
1024-byte region [ffff8881905a7800, ffff8881905a7c00)
The pahole output of bpf_prog_aux:
struct bpf_prog_aux {
[...]
/* --- cacheline 5 boundary (320 bytes) --- */
u32 size_poke_tab; /* 320 4 */
[...]
In general, subprograms do not necessarily manage their own data structures.
For example, BTF func_info and linfo are just pointers to the main program
structure. This allows reference counting and cleanup to be done on the latter
which simplifies their management a bit. The aux->poke_tab struct, however,
did not follow this logic. The initial proposed fix for this use-after-free
bug further embedded poke data tracking into the subprogram with proper
reference counting. However, Daniel and Alexei questioned why we were treating
these objects special; I agree, its unnecessary. The fix here removes the per
subprogram poke table allocation and map tracking and instead simply points
the aux->poke_tab pointer at the main programs poke table. This way, map
tracking is simplified to the main program and we do not need to manage them
per subprogram.
This also means, bpf_prog_free_deferred(), which unwinds the program reference
counting and kfrees objects, needs to ensure that we don't try to double free
the poke_tab when free'ing the subprog structures. This is easily solved by
NULL'ing the poke_tab pointer. The second detail is to ensure that per
subprogram JIT logic only does fixups on poke_tab[] entries it owns. To do
this, we add a pointer in the poke structure to point at the subprogram value
so JITs can easily check while walking the poke_tab structure if the current
entry belongs to the current program. The aux pointer is stable and therefore
suitable for such comparison. On the jit_subprogs() error path, we omit
cleaning up the poke->aux field because these are only ever referenced from
the JIT side, but on error we will never make it to the JIT, so its fine to
leave them dangling. Removing these pointers would complicate the error path
for no reason. However, we do need to untrack all poke descriptors from the
main program as otherwise they could race with the freeing of JIT memory from
the subprograms. Lastly, a748c6975d ("bpf: propagate poke descriptors to
subprograms") had an off-by-one on the subprogram instruction index range
check as it was testing 'insn_idx >= subprog_start && insn_idx <= subprog_end'.
However, subprog_end is the next subprogram's start instruction.
Fixes: a748c6975d ("bpf: propagate poke descriptors to subprograms")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210707223848.14580-2-john.fastabend@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 72d0ad7cb5 ]
The time remaining until expiry of the refresh_timer can be negative.
Casting the type to an unsigned 64-bit value will cause integer
underflow, making the runtime_refresh_within return false instead of
true. These situations are rare, but they do happen.
This does not cause user-facing issues or errors; other than
possibly unthrottling cfs_rq's using runtime from the previous period(s),
making the CFS bandwidth enforcement less strict in those (special)
situations.
Signed-off-by: Odin Ugedal <odin@uged.al>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Link: https://lore.kernel.org/r/20210629121452.18429-1-odin@uged.al
Signed-off-by: Sasha Levin <sashal@kernel.org>
Sync up with android12-5.10 for the following commits:
90732d426e UPSTREAM: mm/compaction: correct deferral logic for proactive compaction
d0a88ae479 ANDROID: Enable GKI Dr. No Enforcement
e6a59da61e ANDROID: media: v4l2-core: Fix deadlock in vendor hook
fdb1cfe2d3 FROMGIT: dma_buf: remove dmabuf sysfs teardown before release
1dc0dd2573 ANDROID: update mtk symbol list
a669748346 UPSTREAM: mfd: syscon: Free the allocated name field of struct regmap_config
cffe67a351 ANDROID: Give UIC cmd timeout a larger value
3bca0b5344 ANDROID: binder: retry security_secid_to_secctx()
9fdfeda4c9 ANDROID: update new gki symbol for mtk
1dd167be9f ANDROID: mm, kasan: fix for "integrate page_alloc init with HW_TAGS"
f215850ea0 UPSTREAM: kasan: fix conflict with page poisoning
a3da917661 ANDROID: GKI: Export two more mm symbols for GKI
0497b9601b ANDROID: Update symbol list for mtk
0f27e1d317 FROMLIST: kfence: skip all GFP_ZONEMASK allocations
8478d8dc53 FROMLIST: kfence: move the size check to the beginning of __kfence_alloc()
ddabf14bea ANDROID: ABI: update allowed list for galaxy
11cec52238 FROMGIT: f2fs: add sysfs nodes to get GC info for each GC mode
fdc46110cb ANDROID: abi_gki_aarch64_qcom: Add android_debug_for_each_module
0bb433e014 ANDROID: debug_symbols: Add android_debug_for_each_module
a0429aa1d0 ANDROID: ABI: Update ABI for symbol list updates
c68e5ca9f8 ANDROID: GKI: Update symbols to symbol list
aac5a77959 ANDROID: Update symbol list for mtk
42fc1faf44 UPSTREAM: block, bfq: set next_rq to waker_bfqq->next_rq in waker injection
e37cc8a09f UPSTREAM: mm/mremap: hold the rmap lock in write mode when moving page table entries.
9b136eab76 ANDROID: pstore/ram: Add backward compatibility for ramoops reserved region
df97297651 ANDROID: Update symbol list for mtk
4322bb07e8 ANDROID: vendor_hooks: Modify the function name
007213b1d0 BACKPORT: FROMLIST: kasan: add memzero int for unaligned size at DEBUG
7acbce0bf4 BACKPORT: FROMLIST: mm: move helper to check slub_debug_enabled
d6eeae59b5 ANDROID: ABI: initial update allowed list for galaxy
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I6d28f1be3cbd8a1fed86d2d647e6c2bb55221b0a
This effectively locks down OWNERS approval to a small group to guard
the code base against unintentional breakages.
Bug: 194314089
Signed-off-by: Matthias Maennich <maennich@google.com>
Change-Id: Ifd1ea97639a622320ea83f901f6451e2e52b38d4
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmD22M4ACgkQONu9yGCS
aT6MbxAAmAmm5sj9fMOk4ijcRz+CbpTZZdSS5VT3BbJJrTrZ1qk9KO0/uAhY1SXe
uuP8Q7BD2llWQ6hQiJj7vZnXKiJsBrngH//4QFa72TjXxa1ENPcw/UceLxMgT6e1
0r74mcikr4f2A6kFoAexyteHfzm0D+8ZfgzJtTKPbBPqCqRIaZFO84xyTclvNJ6h
0Xx2dSNWDU1dXvUe+43ggaZUFNe4SHGupgsc3GSsxPkyKTzrXMGsRh4m3p94t4py
WQULkq/57JaeJwQcxWMOPqLIHF/IWtZSfr+YHx+q1zvThK9uUsAd3e8B8r6FJswV
xmDIDkCw3VIRxTiYhYKFqiDZOanlxYOtWXCXAyI+YV/JIT4NGHapg91qEq5PR9Ti
tkSmAbwaON9LzshzWuKjDHMDJO6x/i3YJmaXuVgBciD56xIvCMKiiardpey2574o
M5m8fLcXQXQBPY/WrusAn312/PanUxgstn6EJgInq0M4GoMj1aUb+W8luQfqq0G5
gYNjzTCEQErTITAyha9SLQvBVH3snfHpKoDL669HK6woq2GH2K48YIKxCam5trlH
9P9LCXBhSHe19/r2qLotcMW7XmuSsu77FetZtxFJQeWVqRokmyHzZnTbeoQAxhpa
xGkvnbnyLettCCtm/JAmwnASqIYnUxuvgj7zcPb5PGnhwk1rfXU=
=b0QD
-----END PGP SIGNATURE-----
Merge 5.10.52 into android12-5.10-lts
Changes in 5.10.52
certs: add 'x509_revocation_list' to gitignore
cifs: handle reconnect of tcon when there is no cached dfs referral
KVM: mmio: Fix use-after-free Read in kvm_vm_ioctl_unregister_coalesced_mmio
KVM: x86: Use guest MAXPHYADDR from CPUID.0x8000_0008 iff TDP is enabled
KVM: x86/mmu: Do not apply HPA (memory encryption) mask to GPAs
KVM: nSVM: Check the value written to MSR_VM_HSAVE_PA
KVM: X86: Disable hardware breakpoints unconditionally before kvm_x86->run()
scsi: core: Fix bad pointer dereference when ehandler kthread is invalid
scsi: zfcp: Report port fc_security as unknown early during remote cable pull
tracing: Do not reference char * as a string in histograms
drm/i915/gtt: drop the page table optimisation
drm/i915/gt: Fix -EDEADLK handling regression
cgroup: verify that source is a string
fbmem: Do not delete the mode that is still in use
drm/dp_mst: Do not set proposed vcpi directly
drm/dp_mst: Avoid to mess up payload table by ports in stale topology
drm/dp_mst: Add missing drm parameters to recently added call to drm_dbg_kms()
drm/ingenic: Fix non-OSD mode
drm/ingenic: Switch IPU plane to type OVERLAY
Revert "drm/ast: Remove reference to struct drm_device.pdev"
net: bridge: multicast: fix PIM hello router port marking race
net: bridge: multicast: fix MRD advertisement router port marking race
leds: tlc591xx: fix return value check in tlc591xx_probe()
ASoC: Intel: sof_sdw: add mutual exclusion between PCH DMIC and RT715
dmaengine: fsl-qdma: check dma_set_mask return value
scsi: arcmsr: Fix the wrong CDB payload report to IOP
srcu: Fix broken node geometry after early ssp init
rcu: Reject RCU_LOCKDEP_WARN() false positives
tty: serial: fsl_lpuart: fix the potential risk of division or modulo by zero
serial: fsl_lpuart: disable DMA for console and fix sysrq
misc/libmasm/module: Fix two use after free in ibmasm_init_one
misc: alcor_pci: fix null-ptr-deref when there is no PCI bridge
ASoC: intel/boards: add missing MODULE_DEVICE_TABLE
partitions: msdos: fix one-byte get_unaligned()
iio: gyro: fxa21002c: Balance runtime pm + use pm_runtime_resume_and_get().
iio: magn: bmc150: Balance runtime pm + use pm_runtime_resume_and_get()
ALSA: usx2y: Avoid camelCase
ALSA: usx2y: Don't call free_pages_exact() with NULL address
Revert "ALSA: bebob/oxfw: fix Kconfig entry for Mackie d.2 Pro"
usb: common: usb-conn-gpio: fix NULL pointer dereference of charger
w1: ds2438: fixing bug that would always get page0
scsi: arcmsr: Fix doorbell status being updated late on ARC-1886
scsi: hisi_sas: Propagate errors in interrupt_init_v1_hw()
scsi: lpfc: Fix "Unexpected timeout" error in direct attach topology
scsi: lpfc: Fix crash when lpfc_sli4_hba_setup() fails to initialize the SGLs
scsi: core: Cap scsi_host cmd_per_lun at can_queue
ALSA: ac97: fix PM reference leak in ac97_bus_remove()
tty: serial: 8250: serial_cs: Fix a memory leak in error handling path
scsi: mpt3sas: Fix deadlock while cancelling the running firmware event
scsi: core: Fixup calling convention for scsi_mode_sense()
scsi: scsi_dh_alua: Check for negative result value
fs/jfs: Fix missing error code in lmLogInit()
scsi: megaraid_sas: Fix resource leak in case of probe failure
scsi: megaraid_sas: Early detection of VD deletion through RaidMap update
scsi: megaraid_sas: Handle missing interrupts while re-enabling IRQs
scsi: iscsi: Add iscsi_cls_conn refcount helpers
scsi: iscsi: Fix conn use after free during resets
scsi: iscsi: Fix shost->max_id use
scsi: qedi: Fix null ref during abort handling
scsi: qedi: Fix race during abort timeouts
scsi: qedi: Fix TMF session block/unblock use
scsi: qedi: Fix cleanup session block/unblock use
mfd: da9052/stmpe: Add and modify MODULE_DEVICE_TABLE
mfd: cpcap: Fix cpcap dmamask not set warnings
ASoC: img: Fix PM reference leak in img_i2s_in_probe()
fsi: Add missing MODULE_DEVICE_TABLE
serial: tty: uartlite: fix console setup
s390/sclp_vt220: fix console name to match device
s390: disable SSP when needed
selftests: timers: rtcpie: skip test if default RTC device does not exist
ALSA: sb: Fix potential double-free of CSP mixer elements
powerpc/ps3: Add dma_mask to ps3_dma_region
iommu/arm-smmu: Fix arm_smmu_device refcount leak when arm_smmu_rpm_get fails
iommu/arm-smmu: Fix arm_smmu_device refcount leak in address translation
ASoC: soc-pcm: fix the return value in dpcm_apply_symmetry()
gpio: zynq: Check return value of pm_runtime_get_sync
gpio: zynq: Check return value of irq_get_irq_data
scsi: storvsc: Correctly handle multiple flags in srb_status
ALSA: ppc: fix error return code in snd_pmac_probe()
selftests/powerpc: Fix "no_handler" EBB selftest
gpio: pca953x: Add support for the On Semi pca9655
powerpc/mm/book3s64: Fix possible build error
ASoC: soc-core: Fix the error return code in snd_soc_of_parse_audio_routing()
habanalabs/gaudi: set the correct cpu_id on MME2_QM failure
habanalabs: remove node from list before freeing the node
s390/processor: always inline stap() and __load_psw_mask()
s390/ipl_parm: fix program check new psw handling
s390/mem_detect: fix diag260() program check new psw handling
s390/mem_detect: fix tprot() program check new psw handling
Input: hideep - fix the uninitialized use in hideep_nvm_unlock()
ALSA: bebob: add support for ToneWeal FW66
ALSA: usb-audio: scarlett2: Fix 18i8 Gen 2 PCM Input count
ALSA: usb-audio: scarlett2: Fix data_mutex lock
ALSA: usb-audio: scarlett2: Fix scarlett2_*_ctl_put() return values
usb: gadget: f_hid: fix endianness issue with descriptors
usb: gadget: hid: fix error return code in hid_bind()
powerpc/boot: Fixup device-tree on little endian
ASoC: Intel: kbl_da7219_max98357a: shrink platform_id below 20 characters
backlight: lm3630a: Fix return code of .update_status() callback
ALSA: hda: Add IRQ check for platform_get_irq()
ALSA: usb-audio: scarlett2: Fix 6i6 Gen 2 line out descriptions
ALSA: firewire-motu: fix detection for S/PDIF source on optical interface in v2 protocol
leds: turris-omnia: add missing MODULE_DEVICE_TABLE
staging: rtl8723bs: fix macro value for 2.4Ghz only device
intel_th: Wait until port is in reset before programming it
i2c: core: Disable client irq on reboot/shutdown
phy: intel: Fix for warnings due to EMMC clock 175Mhz change in FIP
lib/decompress_unlz4.c: correctly handle zero-padding around initrds.
kcov: add __no_sanitize_coverage to fix noinstr for all architectures
power: supply: sc27xx: Add missing MODULE_DEVICE_TABLE
power: supply: sc2731_charger: Add missing MODULE_DEVICE_TABLE
pwm: spear: Don't modify HW state in .remove callback
PCI: ftpci100: Rename macro name collision
power: supply: ab8500: Avoid NULL pointers
PCI: hv: Fix a race condition when removing the device
power: supply: max17042: Do not enforce (incorrect) interrupt trigger type
power: reset: gpio-poweroff: add missing MODULE_DEVICE_TABLE
ARM: 9087/1: kprobes: test-thumb: fix for LLVM_IAS=1
PCI/P2PDMA: Avoid pci_get_slot(), which may sleep
NFSv4: Fix delegation return in cases where we have to retry
PCI: pciehp: Ignore Link Down/Up caused by DPC
watchdog: Fix possible use-after-free in wdt_startup()
watchdog: sc520_wdt: Fix possible use-after-free in wdt_turnoff()
watchdog: Fix possible use-after-free by calling del_timer_sync()
watchdog: imx_sc_wdt: fix pretimeout
watchdog: iTCO_wdt: Account for rebooting on second timeout
x86/fpu: Return proper error codes from user access functions
remoteproc: core: Fix cdev remove and rproc del
PCI: tegra: Add missing MODULE_DEVICE_TABLE
orangefs: fix orangefs df output.
ceph: remove bogus checks and WARN_ONs from ceph_set_page_dirty
drm/gma500: Add the missed drm_gem_object_put() in psb_user_framebuffer_create()
NFS: nfs_find_open_context() may only select open files
power: supply: charger-manager: add missing MODULE_DEVICE_TABLE
power: supply: ab8500: add missing MODULE_DEVICE_TABLE
drm/amdkfd: fix sysfs kobj leak
pwm: img: Fix PM reference leak in img_pwm_enable()
pwm: tegra: Don't modify HW state in .remove callback
ACPI: AMBA: Fix resource name in /proc/iomem
ACPI: video: Add quirk for the Dell Vostro 3350
PCI: rockchip: Register IRQ handlers after device and data are ready
virtio-blk: Fix memory leak among suspend/resume procedure
virtio_net: Fix error handling in virtnet_restore()
virtio_console: Assure used length from device is limited
f2fs: atgc: fix to set default age threshold
NFSD: Fix TP_printk() format specifier in nfsd_clid_class
x86/signal: Detect and prevent an alternate signal stack overflow
f2fs: add MODULE_SOFTDEP to ensure crc32 is included in the initramfs
f2fs: compress: fix to disallow temp extension
remoteproc: k3-r5: Fix an error message
PCI/sysfs: Fix dsm_label_utf16s_to_utf8s() buffer overrun
power: supply: rt5033_battery: Fix device tree enumeration
NFSv4: Initialise connection to the server in nfs4_alloc_client()
NFSv4: Fix an Oops in pnfs_mark_request_commit() when doing O_DIRECT
misc: alcor_pci: fix inverted branch condition
um: fix error return code in slip_open()
um: fix error return code in winch_tramp()
ubifs: Fix off-by-one error
ubifs: journal: Fix error return code in ubifs_jnl_write_inode()
watchdog: aspeed: fix hardware timeout calculation
watchdog: jz4740: Fix return value check in jz4740_wdt_probe()
SUNRPC: prevent port reuse on transports which don't request it.
nfs: fix acl memory leak of posix_acl_create()
ubifs: Set/Clear I_LINKABLE under i_lock for whiteout inode
PCI: iproc: Fix multi-MSI base vector number allocation
PCI: iproc: Support multi-MSI only on uniprocessor kernel
f2fs: fix to avoid adding tab before doc section
x86/fpu: Fix copy_xstate_to_kernel() gap handling
x86/fpu: Limit xstate copy size in xstateregs_set()
PCI: intel-gw: Fix INTx enable
pwm: imx1: Don't disable clocks at device remove time
PCI: tegra194: Fix tegra_pcie_ep_raise_msi_irq() ill-defined shift
vdpa/mlx5: Fix umem sizes assignments on VQ create
vdpa/mlx5: Fix possible failure in umem size calculation
virtio_net: move tx vq operation under tx queue lock
nvme-tcp: can't set sk_user_data without write_lock
nfsd: Reduce contention for the nfsd_file nf_rwsem
ALSA: isa: Fix error return code in snd_cmi8330_probe()
vdpa/mlx5: Clear vq ready indication upon device reset
NFSv4/pnfs: Fix the layout barrier update
NFSv4/pnfs: Fix layoutget behaviour after invalidation
NFSv4/pNFS: Don't call _nfs4_pnfs_v3_ds_connect multiple times
hexagon: handle {,SOFT}IRQENTRY_TEXT in linker script
hexagon: use common DISCARDS macro
ARM: dts: gemini-rut1xx: remove duplicate ethernet node
reset: RESET_BRCMSTB_RESCAL should depend on ARCH_BRCMSTB
reset: RESET_INTEL_GW should depend on X86
reset: a10sr: add missing of_match_table reference
ARM: exynos: add missing of_node_put for loop iteration
ARM: dts: exynos: fix PWM LED max brightness on Odroid XU/XU3
ARM: dts: exynos: fix PWM LED max brightness on Odroid HC1
ARM: dts: exynos: fix PWM LED max brightness on Odroid XU4
memory: stm32-fmc2-ebi: add missing of_node_put for loop iteration
memory: atmel-ebi: add missing of_node_put for loop iteration
reset: brcmstb: Add missing MODULE_DEVICE_TABLE
memory: pl353: Fix error return code in pl353_smc_probe()
ARM: dts: sun8i: h3: orangepi-plus: Fix ethernet phy-mode
rtc: fix snprintf() checking in is_rtc_hctosys()
arm64: dts: renesas: v3msk: Fix memory size
ARM: dts: r8a7779, marzen: Fix DU clock names
arm64: dts: ti: j7200-main: Enable USB2 PHY RX sensitivity workaround
arm64: dts: renesas: Add missing opp-suspend properties
arm64: dts: renesas: r8a7796[01]: Fix OPP table entry voltages
ARM: dts: stm32: Connect PHY IRQ line on DH STM32MP1 SoM
ARM: dts: stm32: Rework LAN8710Ai PHY reset on DHCOM SoM
arm64: dts: qcom: trogdor: Add no-hpd to DSI bridge node
firmware: tegra: Fix error return code in tegra210_bpmp_init()
firmware: arm_scmi: Reset Rx buffer to max size during async commands
dt-bindings: i2c: at91: fix example for scl-gpios
ARM: dts: BCM5301X: Fixup SPI binding
reset: bail if try_module_get() fails
arm64: dts: renesas: r8a779a0: Drop power-domains property from GIC node
arm64: dts: ti: k3-j721e-main: Fix external refclk input to SERDES
memory: fsl_ifc: fix leak of IO mapping on probe failure
memory: fsl_ifc: fix leak of private memory on probe failure
arm64: dts: allwinner: a64-sopine-baseboard: change RGMII mode to TXID
ARM: dts: dra7: Fix duplicate USB4 target module node
ARM: dts: am335x: align ti,pindir-d0-out-d1-in property with dt-shema
ARM: dts: am437x: align ti,pindir-d0-out-d1-in property with dt-shema
thermal/drivers/sprd: Add missing MODULE_DEVICE_TABLE
ARM: dts: imx6q-dhcom: Fix ethernet reset time properties
ARM: dts: imx6q-dhcom: Fix ethernet plugin detection problems
ARM: dts: imx6q-dhcom: Add gpios pinctrl for i2c bus recovery
thermal/drivers/rcar_gen3_thermal: Fix coefficient calculations
firmware: turris-mox-rwtm: fix reply status decoding function
firmware: turris-mox-rwtm: report failures better
firmware: turris-mox-rwtm: fail probing when firmware does not support hwrng
firmware: turris-mox-rwtm: show message about HWRNG registration
arm64: dts: rockchip: Re-add regulator-boot-on, regulator-always-on for vdd_gpu on rk3399-roc-pc
arm64: dts: rockchip: Re-add regulator-always-on for vcc_sdio for rk3399-roc-pc
scsi: be2iscsi: Fix an error handling path in beiscsi_dev_probe()
sched/uclamp: Ignore max aggregation if rq is idle
jump_label: Fix jump_label_text_reserved() vs __init
static_call: Fix static_call_text_reserved() vs __init
mips: always link byteswap helpers into decompressor
mips: disable branch profiling in boot/decompress.o
MIPS: vdso: Invalid GIC access through VDSO
scsi: scsi_dh_alua: Fix signedness bug in alua_rtpg()
seq_file: disallow extremely large seq buffer allocations
Linux 5.10.52
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ic1b04661728db8b0e060ca6935783e15a22210da
[ Upstream commit 2bee6d16e4 ]
It turns out that static_call_text_reserved() was reporting __init
text as being reserved past the time when the __init text was freed
and re-used.
This is mostly harmless and will at worst result in refusing a kprobe.
Fixes: 6333e8f73b ("static_call: Avoid kprobes on inline static_call()s")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20210628113045.106211657@infradead.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 9e667624c2 ]
It turns out that jump_label_text_reserved() was reporting __init text
as being reserved past the time when the __init text was freed and
re-used.
For a long time, this resulted in, at worst, not being able to kprobe
text that happened to land at the re-used address. However a recent
commit e7bf1ba97a ("jump_label, x86: Emit short JMP") made it a
fatal mistake because it now needs to read the instruction in order to
determine the conflict -- an instruction that's no longer there.
Fixes: 4c3ef6d793 ("jump label: Add jump_label_text_reserved() to reserve jump points")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20210628113045.045141693@infradead.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 3e1493f463 ]
When a task wakes up on an idle rq, uclamp_rq_util_with() would max
aggregate with rq value. But since there is no task enqueued yet, the
values are stale based on the last task that was running. When the new
task actually wakes up and enqueued, then the rq uclamp values should
reflect that of the newly woken up task effective uclamp values.
This is a problem particularly for uclamp_max because it default to
1024. If a task p with uclamp_max = 512 wakes up, then max aggregation
would ignore the capping that should apply when this task is enqueued,
which is wrong.
Fix that by ignoring max aggregation if the rq is idle since in that
case the effective uclamp value of the rq will be the ones of the task
that will wake up.
Fixes: 9d20ad7dfc ("sched/uclamp: Add uclamp_util_with()")
Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
[qias: Changelog]
Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Link: https://lore.kernel.org/r/20210630141204.8197-1-xuewen.yan94@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 3066820034 ]
If another lockdep report runs concurrently with an RCU lockdep report
from RCU_LOCKDEP_WARN(), the following sequence of events can occur:
1. debug_lockdep_rcu_enabled() sees that lockdep is enabled
when called from (say) synchronize_rcu().
2. Lockdep is disabled by a concurrent lockdep report.
3. debug_lockdep_rcu_enabled() evaluates its lockdep-expression
argument, for example, lock_is_held(&rcu_bh_lock_map).
4. Because lockdep is now disabled, lock_is_held() plays it safe and
returns the constant 1.
5. But in this case, the constant 1 is not safe, because invoking
synchronize_rcu() under rcu_read_lock_bh() is disallowed.
6. debug_lockdep_rcu_enabled() wrongly invokes lockdep_rcu_suspicious(),
resulting in a false-positive splat.
This commit therefore changes RCU_LOCKDEP_WARN() to check
debug_lockdep_rcu_enabled() after checking the lockdep expression,
so that any "safe" returns from lock_is_held() are rejected by
debug_lockdep_rcu_enabled(). This requires memory ordering, which is
supplied by READ_ONCE(debug_locks). The resulting volatile accesses
prevent the compiler from reordering and the fact that only one variable
is being accessed prevents the underlying hardware from reordering.
The combination works for IA64, which can reorder reads to the same
location, but this is defeated by the volatile accesses, which compile
to load instructions that provide ordering.
Reported-by: syzbot+dde0cc33951735441301@syzkaller.appspotmail.com
Reported-by: Matthew Wilcox <willy@infradead.org>
Reported-by: syzbot+88e4f02896967fe1ab0d@syzkaller.appspotmail.com
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Suggested-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b5befe842e ]
An srcu_struct structure that is initialized before rcu_init_geometry()
will have its srcu_node hierarchy based on CONFIG_NR_CPUS. Once
rcu_init_geometry() is called, this hierarchy is compressed as needed
for the actual maximum number of CPUs for this system.
Later on, that srcu_struct structure is confused, sometimes referring
to its initial CONFIG_NR_CPUS-based hierarchy, and sometimes instead
to the new num_possible_cpus() hierarchy. For example, each of its
->mynode fields continues to reference the original leaf rcu_node
structures, some of which might no longer exist. On the other hand,
srcu_for_each_node_breadth_first() traverses to the new node hierarchy.
There are at least two bad possible outcomes to this:
1) a) A callback enqueued early on an srcu_data structure (call it
*sdp) is recorded pending on sdp->mynode->srcu_data_have_cbs in
srcu_funnel_gp_start() with sdp->mynode pointing to a deep leaf
(say 3 levels).
b) The grace period ends after rcu_init_geometry() shrinks the
nodes level to a single one. srcu_gp_end() walks through the new
srcu_node hierarchy without ever reaching the old leaves so the
callback is never executed.
This is easily reproduced on an 8 CPUs machine with CONFIG_NR_CPUS >= 32
and "rcupdate.rcu_self_test=1". The srcu_barrier() after early tests
verification never completes and the boot hangs:
[ 5413.141029] INFO: task swapper/0:1 blocked for more than 4915 seconds.
[ 5413.147564] Not tainted 5.12.0-rc4+ #28
[ 5413.151927] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5413.159753] task:swapper/0 state:D stack: 0 pid: 1 ppid: 0 flags:0x00004000
[ 5413.168099] Call Trace:
[ 5413.170555] __schedule+0x36c/0x930
[ 5413.174057] ? wait_for_completion+0x88/0x110
[ 5413.178423] schedule+0x46/0xf0
[ 5413.181575] schedule_timeout+0x284/0x380
[ 5413.185591] ? wait_for_completion+0x88/0x110
[ 5413.189957] ? mark_held_locks+0x61/0x80
[ 5413.193882] ? mark_held_locks+0x61/0x80
[ 5413.197809] ? _raw_spin_unlock_irq+0x24/0x50
[ 5413.202173] ? wait_for_completion+0x88/0x110
[ 5413.206535] wait_for_completion+0xb4/0x110
[ 5413.210724] ? srcu_torture_stats_print+0x110/0x110
[ 5413.215610] srcu_barrier+0x187/0x200
[ 5413.219277] ? rcu_tasks_verify_self_tests+0x50/0x50
[ 5413.224244] ? rdinit_setup+0x2b/0x2b
[ 5413.227907] rcu_verify_early_boot_tests+0x2d/0x40
[ 5413.232700] do_one_initcall+0x63/0x310
[ 5413.236541] ? rdinit_setup+0x2b/0x2b
[ 5413.240207] ? rcu_read_lock_sched_held+0x52/0x80
[ 5413.244912] kernel_init_freeable+0x253/0x28f
[ 5413.249273] ? rest_init+0x250/0x250
[ 5413.252846] kernel_init+0xa/0x110
[ 5413.256257] ret_from_fork+0x22/0x30
2) An srcu_struct structure that is initialized before rcu_init_geometry()
and used afterward will always have stale rdp->mynode references,
resulting in callbacks to be missed in srcu_gp_end(), just like in
the previous scenario.
This commit therefore causes init_srcu_struct_nodes to initialize the
geometry, if needed. This ensures that the srcu_node hierarchy is
properly built and distributed from the get-go.
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 3b0462726e upstream.
The following sequence can be used to trigger a UAF:
int fscontext_fd = fsopen("cgroup");
int fd_null = open("/dev/null, O_RDONLY);
int fsconfig(fscontext_fd, FSCONFIG_SET_FD, "source", fd_null);
close_range(3, ~0U, 0);
The cgroup v1 specific fs parser expects a string for the "source"
parameter. However, it is perfectly legitimate to e.g. specify a file
descriptor for the "source" parameter. The fs parser doesn't know what
a filesystem allows there. So it's a bug to assume that "source" is
always of type fs_value_is_string when it can reasonably also be
fs_value_is_file.
This assumption in the cgroup code causes a UAF because struct
fs_parameter uses a union for the actual value. Access to that union is
guarded by the param->type member. Since the cgroup paramter parser
didn't check param->type but unconditionally moved param->string into
fc->source a close on the fscontext_fd would trigger a UAF during
put_fs_context() which frees fc->source thereby freeing the file stashed
in param->file causing a UAF during a close of the fd_null.
Fix this by verifying that param->type is actually a string and report
an error if not.
In follow up patches I'll add a new generic helper that can be used here
and by other filesystems instead of this error-prone copy-pasta fix.
But fixing it in here first makes backporting a it to stable a lot
easier.
Fixes: 8d2451f499 ("cgroup1: switch to option-by-option parsing")
Reported-by: syzbot+283ce5a46486d6acdbaf@syzkaller.appspotmail.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: <stable@kernel.org>
Cc: syzkaller-bugs <syzkaller-bugs@googlegroups.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmD1LZUACgkQONu9yGCS
aT7gERAArjYemBSkD4/nq5HvxoVu7ueEyqI2orJCyB6b5npPrBZjlKna3SuYuNUF
CmcX5Y2Ynxd3gWJvYFEJdAAQAEtKAzPdPv0QJ+KiLSP2bEZ4Q5grEJXfzcgxcB4L
fUfaCZnUwjoII5xzW1+U2zCJ7YtGx5hZySLKMxSEc0IzawDlfMx4HdwlohXzczsA
Zq3/sTJcW30PWSp6MSuMOH//lPh7sAoCnksAv4Yb8MZYZjC8JNnKFn+IwRUGWEMZ
sFtNbq51sMgGq4TjYBIdO6wBElP4dgWJhYc4cO0667cDvgp6iod/bKlOLJSiwIVX
uPWkyPihH9ZUvNVY0TxjjnxS1rnM0QhH+cEXNGn+SE7KaNmzsI2MaR+DVA2yWcFr
9edqTq5x2CJJ0R/oXHP4nYFtsvlV/QcirlrF0OZHYwz84b16f37Ac7tHOQpr3kNO
N29AW0l5XmpxbfHgo1Iaoi02seouLC47vRkvjTpS9mValGUC0ciXTb97CK4FremE
34rskxIRWBU8HECYioFOeHTAi0+xb/9tOj87BnB5CJ28CD2Md27TTdorsISnqPb0
/ER89QfVtlJLi17wGB0rlAm8fDF3Cy6BnA57QIql1z1NbJGc1cxenEAFiIOnbQ5G
t6SLs3mgqpQizsKUFYimCWa04ZzY4Bg8H9bAI+M+9w7J6yujQzI=
=dGMa
-----END PGP SIGNATURE-----
Merge 5.10.51 into android12-5.10-lts
Changes in 5.10.51
drm/mxsfb: Don't select DRM_KMS_FB_HELPER
drm/zte: Don't select DRM_KMS_FB_HELPER
drm/ast: Fixed CVE for DP501
drm/amd/display: fix HDCP reset sequence on reinitialize
drm/amd/amdgpu/sriov disable all ip hw status by default
drm/vc4: fix argument ordering in vc4_crtc_get_margins()
drm/bridge: nwl-dsi: Force a full modeset when crtc_state->active is changed to be true
net: pch_gbe: Use proper accessors to BE data in pch_ptp_match()
drm/amd/display: fix use_max_lb flag for 420 pixel formats
clk: renesas: rcar-usb2-clock-sel: Fix error handling in .probe()
hugetlb: clear huge pte during flush function on mips platform
atm: iphase: fix possible use-after-free in ia_module_exit()
mISDN: fix possible use-after-free in HFC_cleanup()
atm: nicstar: Fix possible use-after-free in nicstar_cleanup()
net: Treat __napi_schedule_irqoff() as __napi_schedule() on PREEMPT_RT
drm/mediatek: Fix PM reference leak in mtk_crtc_ddp_hw_init()
net: mdio: ipq8064: add regmap config to disable REGCACHE
drm/bridge: lt9611: Add missing MODULE_DEVICE_TABLE
reiserfs: add check for invalid 1st journal block
drm/virtio: Fix double free on probe failure
net: mdio: provide shim implementation of devm_of_mdiobus_register
net/sched: cls_api: increase max_reclassify_loop
pinctrl: equilibrium: Add missing MODULE_DEVICE_TABLE
drm/scheduler: Fix hang when sched_entity released
drm/sched: Avoid data corruptions
udf: Fix NULL pointer dereference in udf_symlink function
drm/vc4: Fix clock source for VEC PixelValve on BCM2711
drm/vc4: hdmi: Fix PM reference leak in vc4_hdmi_encoder_pre_crtc_co()
e100: handle eeprom as little endian
igb: handle vlan types with checker enabled
igb: fix assignment on big endian machines
drm/bridge: cdns: Fix PM reference leak in cdns_dsi_transfer()
clk: renesas: r8a77995: Add ZA2 clock
net/mlx5e: IPsec/rep_tc: Fix rep_tc_update_skb drops IPsec packet
net/mlx5: Fix lag port remapping logic
drm: rockchip: add missing registers for RK3188
drm: rockchip: add missing registers for RK3066
net: stmmac: the XPCS obscures a potential "PHY not found" error
RDMA/rtrs: Change MAX_SESS_QUEUE_DEPTH
clk: tegra: Fix refcounting of gate clocks
clk: tegra: Ensure that PLLU configuration is applied properly
drm: bridge: cdns-mhdp8546: Fix PM reference leak in
virtio-net: Add validation for used length
ipv6: use prandom_u32() for ID generation
MIPS: cpu-probe: Fix FPU detection on Ingenic JZ4760(B)
MIPS: ingenic: Select CPU_SUPPORTS_CPUFREQ && MIPS_EXTERNAL_TIMER
drm/amd/display: Avoid HDCP over-read and corruption
drm/amdgpu: remove unsafe optimization to drop preamble ib
net: tcp better handling of reordering then loss cases
RDMA/cxgb4: Fix missing error code in create_qp()
dm space maps: don't reset space map allocation cursor when committing
dm writecache: don't split bios when overwriting contiguous cache content
dm: Fix dm_accept_partial_bio() relative to zone management commands
net: bridge: mrp: Update ring transitions.
pinctrl: mcp23s08: fix race condition in irq handler
ice: set the value of global config lock timeout longer
ice: fix clang warning regarding deadcode.DeadStores
virtio_net: Remove BUG() to avoid machine dead
net: mscc: ocelot: check return value after calling platform_get_resource()
net: bcmgenet: check return value after calling platform_get_resource()
net: mvpp2: check return value after calling platform_get_resource()
net: micrel: check return value after calling platform_get_resource()
net: moxa: Use devm_platform_get_and_ioremap_resource()
drm/amd/display: Fix DCN 3.01 DSCCLK validation
drm/amd/display: Update scaling settings on modeset
drm/amd/display: Release MST resources on switch from MST to SST
drm/amd/display: Set DISPCLK_MAX_ERRDET_CYCLES to 7
drm/amd/display: Fix off-by-one error in DML
net: phy: realtek: add delay to fix RXC generation issue
selftests: Clean forgotten resources as part of cleanup()
net: sgi: ioc3-eth: check return value after calling platform_get_resource()
drm/amdkfd: use allowed domain for vmbo validation
fjes: check return value after calling platform_get_resource()
selinux: use __GFP_NOWARN with GFP_NOWAIT in the AVC
r8169: avoid link-up interrupt issue on RTL8106e if user enables ASPM
drm/amd/display: Verify Gamma & Degamma LUT sizes in amdgpu_dm_atomic_check
xfrm: Fix error reporting in xfrm_state_construct.
dm writecache: commit just one block, not a full page
wlcore/wl12xx: Fix wl12xx get_mac error if device is in ELP
wl1251: Fix possible buffer overflow in wl1251_cmd_scan
cw1200: add missing MODULE_DEVICE_TABLE
drm/amdkfd: fix circular locking on get_wave_state
drm/amdkfd: Fix circular lock in nocpsch path
bpf: Fix up register-based shifts in interpreter to silence KUBSAN
ice: fix incorrect payload indicator on PTYPE
ice: mark PTYPE 2 as reserved
mt76: mt7615: fix fixed-rate tx status reporting
net: fix mistake path for netdev_features_strings
net: ipa: Add missing of_node_put() in ipa_firmware_load()
net: sched: fix error return code in tcf_del_walker()
io_uring: fix false WARN_ONCE
drm/amdgpu: fix bad address translation for sienna_cichlid
drm/amdkfd: Walk through list with dqm lock hold
mt76: mt7915: fix IEEE80211_HE_PHY_CAP7_MAX_NC for station mode
rtl8xxxu: Fix device info for RTL8192EU devices
MIPS: add PMD table accounting into MIPS'pmd_alloc_one
net: fec: add ndo_select_queue to fix TX bandwidth fluctuations
atm: nicstar: use 'dma_free_coherent' instead of 'kfree'
atm: nicstar: register the interrupt handler in the right place
vsock: notify server to shutdown when client has pending signal
RDMA/rxe: Don't overwrite errno from ib_umem_get()
iwlwifi: mvm: don't change band on bound PHY contexts
iwlwifi: mvm: fix error print when session protection ends
iwlwifi: pcie: free IML DMA memory allocation
iwlwifi: pcie: fix context info freeing
sfc: avoid double pci_remove of VFs
sfc: error code if SRIOV cannot be disabled
wireless: wext-spy: Fix out-of-bounds warning
cfg80211: fix default HE tx bitrate mask in 2G band
mac80211: consider per-CPU statistics if present
mac80211_hwsim: add concurrent channels scanning support over virtio
IB/isert: Align target max I/O size to initiator size
media, bpf: Do not copy more entries than user space requested
net: ip: avoid OOM kills with large UDP sends over loopback
RDMA/cma: Fix rdma_resolve_route() memory leak
Bluetooth: btusb: Fixed too many in-token issue for Mediatek Chip.
Bluetooth: Fix the HCI to MGMT status conversion table
Bluetooth: Fix alt settings for incoming SCO with transparent coding format
Bluetooth: Shutdown controller after workqueues are flushed or cancelled
Bluetooth: btusb: Add a new QCA_ROME device (0cf3:e500)
Bluetooth: L2CAP: Fix invalid access if ECRED Reconfigure fails
Bluetooth: L2CAP: Fix invalid access on ECRED Connection response
Bluetooth: btusb: Add support USB ALT 3 for WBS
Bluetooth: mgmt: Fix the command returns garbage parameter value
Bluetooth: btusb: fix bt fiwmare downloading failure issue for qca btsoc.
sched/fair: Ensure _sum and _avg values stay consistent
bpf: Fix false positive kmemleak report in bpf_ringbuf_area_alloc()
flow_offload: action should not be NULL when it is referenced
sctp: validate from_addr_param return
sctp: add size validation when walking chunks
MIPS: loongsoon64: Reserve memory below starting pfn to prevent Oops
MIPS: set mips32r5 for virt extensions
selftests/resctrl: Fix incorrect parsing of option "-t"
MIPS: MT extensions are not available on MIPS32r1
ath11k: unlock on error path in ath11k_mac_op_add_interface()
arm64: dts: rockchip: add rk3328 dwc3 usb controller node
arm64: dts: rockchip: Enable USB3 for rk3328 Rock64
loop: fix I/O error on fsync() in detached loop devices
mm,hwpoison: return -EBUSY when migration fails
io_uring: simplify io_remove_personalities()
io_uring: Convert personality_idr to XArray
io_uring: convert io_buffer_idr to XArray
scsi: iscsi: Fix race condition between login and sync thread
scsi: iscsi: Fix iSCSI cls conn state
powerpc/mm: Fix lockup on kernel exec fault
powerpc/barrier: Avoid collision with clang's __lwsync macro
powerpc/powernv/vas: Release reference to tgid during window close
drm/amdgpu: Update NV SIMD-per-CU to 2
drm/amdgpu: enable sdma0 tmz for Raven/Renoir(V2)
drm/radeon: Add the missed drm_gem_object_put() in radeon_user_framebuffer_create()
drm/radeon: Call radeon_suspend_kms() in radeon_pci_shutdown() for Loongson64
drm/vc4: txp: Properly set the possible_crtcs mask
drm/vc4: crtc: Skip the TXP
drm/vc4: hdmi: Prevent clock unbalance
drm/dp: Handle zeroed port counts in drm_dp_read_downstream_info()
drm/rockchip: dsi: remove extra component_del() call
drm/amd/display: fix incorrrect valid irq check
pinctrl/amd: Add device HID for new AMD GPIO controller
drm/amd/display: Reject non-zero src_y and src_x for video planes
drm/tegra: Don't set allow_fb_modifiers explicitly
drm/msm/mdp4: Fix modifier support enabling
drm/arm/malidp: Always list modifiers
drm/nouveau: Don't set allow_fb_modifiers explicitly
drm/i915/display: Do not zero past infoframes.vsc
mmc: sdhci-acpi: Disable write protect detection on Toshiba Encore 2 WT8-B
mmc: sdhci: Fix warning message when accessing RPMB in HS400 mode
mmc: core: clear flags before allowing to retune
mmc: core: Allow UHS-I voltage switch for SDSC cards if supported
ata: ahci_sunxi: Disable DIPM
arm64: tlb: fix the TTL value of tlb_get_level
cpu/hotplug: Cure the cpusets trainwreck
clocksource/arm_arch_timer: Improve Allwinner A64 timer workaround
fpga: stratix10-soc: Add missing fpga_mgr_free() call
ASoC: tegra: Set driver_name=tegra for all machine drivers
i40e: fix PTP on 5Gb links
qemu_fw_cfg: Make fw_cfg_rev_attr a proper kobj_attribute
ipmi/watchdog: Stop watchdog timer when the current action is 'none'
thermal/drivers/int340x/processor_thermal: Fix tcc setting
ubifs: Fix races between xattr_{set|get} and listxattr operations
power: supply: ab8500: Fix an old bug
mfd: syscon: Free the allocated name field of struct regmap_config
nvmem: core: add a missing of_node_put
lkdtm/bugs: XFAIL UNALIGNED_LOAD_STORE_WRITE
selftests/lkdtm: Fix expected text for CR4 pinning
extcon: intel-mrfld: Sync hardware and software state on init
seq_buf: Fix overflow in seq_buf_putmem_hex()
rq-qos: fix missed wake-ups in rq_qos_throttle try two
tracing: Simplify & fix saved_tgids logic
tracing: Resize tgid_map to pid_max, not PID_MAX_DEFAULT
ipack/carriers/tpci200: Fix a double free in tpci200_pci_probe
coresight: Propagate symlink failure
coresight: tmc-etf: Fix global-out-of-bounds in tmc_update_etf_buffer()
dm zoned: check zone capacity
dm writecache: flush origin device when writing and cache is full
dm btree remove: assign new_root only when removal succeeds
PCI: Leave Apple Thunderbolt controllers on for s2idle or standby
PCI: aardvark: Fix checking for PIO Non-posted Request
PCI: aardvark: Implement workaround for the readback value of VEND_ID
media: subdev: disallow ioctl for saa6588/davinci
media: dtv5100: fix control-request directions
media: zr364xx: fix memory leak in zr364xx_start_readpipe
media: gspca/sq905: fix control-request direction
media: gspca/sunplus: fix zero-length control requests
media: uvcvideo: Fix pixel format change for Elgato Cam Link 4K
io_uring: fix clear IORING_SETUP_R_DISABLED in wrong function
dm writecache: write at least 4k when committing
pinctrl: mcp23s08: Fix missing unlock on error in mcp23s08_irq()
drm/ast: Remove reference to struct drm_device.pdev
jfs: fix GPF in diFree
smackfs: restrict bytes count in smk_set_cipso()
ext4: fix memory leak in ext4_fill_super
f2fs: fix to avoid racing on fsync_entry_slab by multi filesystem instances
Linux 5.10.51
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Icb10fed733a0050848ecc23db13ae3d134895acd
commit 4030a6e6a6 upstream.
Currently tgid_map is sized at PID_MAX_DEFAULT entries, which means that
on systems where pid_max is configured higher than PID_MAX_DEFAULT the
ftrace record-tgid option doesn't work so well. Any tasks with PIDs
higher than PID_MAX_DEFAULT are simply not recorded in tgid_map, and
don't show up in the saved_tgids file.
In particular since systemd v243 & above configure pid_max to its
highest possible 1<<22 value by default on 64 bit systems this renders
the record-tgids option of little use.
Increase the size of tgid_map to the configured pid_max instead,
allowing it to cover the full range of PIDs up to the maximum value of
PID_MAX_LIMIT if the system is configured that way.
On 64 bit systems with pid_max == PID_MAX_LIMIT this will increase the
size of tgid_map from 256KiB to 16MiB. Whilst this 64x increase in
memory overhead sounds significant 64 bit systems are presumably best
placed to accommodate it, and since tgid_map is only allocated when the
record-tgid option is actually used presumably the user would rather it
spends sufficient memory to actually record the tgids they expect.
The size of tgid_map could also increase for CONFIG_BASE_SMALL=y
configurations, but these seem unlikely to be systems upon which people
are both configuring a large pid_max and running ftrace with record-tgid
anyway.
Of note is that we only allocate tgid_map once, the first time that the
record-tgid option is enabled. Therefore its size is only set once, to
the value of pid_max at the time the record-tgid option is first
enabled. If a user increases pid_max after that point, the saved_tgids
file will not contain entries for any tasks with pids beyond the earlier
value of pid_max.
Link: https://lkml.kernel.org/r/20210701172407.889626-2-paulburton@google.com
Fixes: d914ba37d7 ("tracing: Add support for recording tgid of tasks")
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Paul Burton <paulburton@google.com>
[ Fixed comment coding style ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b81b3e959a upstream.
The tgid_map array records a mapping from pid to tgid, where the index
of an entry within the array is the pid & the value stored at that index
is the tgid.
The saved_tgids_next() function iterates over pointers into the tgid_map
array & dereferences the pointers which results in the tgid, but then it
passes that dereferenced value to trace_find_tgid() which treats it as a
pid & does a further lookup within the tgid_map array. It seems likely
that the intent here was to skip over entries in tgid_map for which the
recorded tgid is zero, but instead we end up skipping over entries for
which the thread group leader hasn't yet had its own tgid recorded in
tgid_map.
A minimal fix would be to remove the call to trace_find_tgid, turning:
if (trace_find_tgid(*ptr))
into:
if (*ptr)
..but it seems like this logic can be much simpler if we simply let
seq_read() iterate over the whole tgid_map array & filter out empty
entries by returning SEQ_SKIP from saved_tgids_show(). Here we take that
approach, removing the incorrect logic here entirely.
Link: https://lkml.kernel.org/r/20210630003406.4013668-1-paulburton@google.com
Fixes: d914ba37d7 ("tracing: Add support for recording tgid of tasks")
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Paul Burton <paulburton@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 11c7aa0dde upstream.
Commit 545fbd0775 ("rq-qos: fix missed wake-ups in rq_qos_throttle")
tried to fix a problem that a process could be sleeping in rq_qos_wait()
without anyone to wake it up. However the fix is not complete and the
following can still happen:
CPU1 (waiter1) CPU2 (waiter2) CPU3 (waker)
rq_qos_wait() rq_qos_wait()
acquire_inflight_cb() -> fails
acquire_inflight_cb() -> fails
completes IOs, inflight
decreased
prepare_to_wait_exclusive()
prepare_to_wait_exclusive()
has_sleeper = !wq_has_single_sleeper() -> true as there are two sleepers
has_sleeper = !wq_has_single_sleeper() -> true
io_schedule() io_schedule()
Deadlock as now there's nobody to wakeup the two waiters. The logic
automatically blocking when there are already sleepers is really subtle
and the only way to make it work reliably is that we check whether there
are some waiters in the queue when adding ourselves there. That way, we
are guaranteed that at least the first process to enter the wait queue
will recheck the waiting condition before going to sleep and thus
guarantee forward progress.
Fixes: 545fbd0775 ("rq-qos: fix missed wake-ups in rq_qos_throttle")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210607112613.25344-1-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b22afcdf04 upstream.
Alexey and Joshua tried to solve a cpusets related hotplug problem which is
user space visible and results in unexpected behaviour for some time after
a CPU has been plugged in and the corresponding uevent was delivered.
cpusets delegate the hotplug work (rebuilding cpumasks etc.) to a
workqueue. This is done because the cpusets code has already a lock
nesting of cgroups_mutex -> cpu_hotplug_lock. A synchronous callback or
waiting for the work to finish with cpu_hotplug_lock held can and will
deadlock because that results in the reverse lock order.
As a consequence the uevent can be delivered before cpusets have consistent
state which means that a user space invocation of sched_setaffinity() to
move a task to the plugged CPU fails up to the point where the scheduled
work has been processed.
The same is true for CPU unplug, but that does not create user observable
failure (yet).
It's still inconsistent to claim that an operation is finished before it
actually is and that's the real issue at hand. uevents just make it
reliably observable.
Obviously the problem should be fixed in cpusets/cgroups, but untangling
that is pretty much impossible because according to the changelog of the
commit which introduced this 8 years ago:
3a5a6d0c2b03("cpuset: don't nest cgroup_mutex inside get_online_cpus()")
the lock order cgroups_mutex -> cpu_hotplug_lock is a design decision and
the whole code is built around that.
So bite the bullet and invoke the relevant cpuset function, which waits for
the work to finish, in _cpu_up/down() after dropping cpu_hotplug_lock and
only when tasks are not frozen by suspend/hibernate because that would
obviously wait forever.
Waiting there with cpu_add_remove_lock, which is protecting the present
and possible CPU maps, held is not a problem at all because neither work
queues nor cpusets/cgroups have any lockchains related to that lock.
Waiting in the hotplug machinery is not problematic either because there
are already state callbacks which wait for hardware queues to drain. It
makes the operations slightly slower, but hotplug is slow anyway.
This ensures that state is consistent before returning from a hotplug
up/down operation. It's still inconsistent during the operation, but that's
a different story.
Add a large comment which explains why this is done and why this is not a
dump ground for the hack of the day to work around half thought out locking
schemes. Document also the implications vs. hotplug operations and
serialization or the lack of it.
Thanks to Alexy and Joshua for analyzing why this temporary
sched_setaffinity() failure happened.
Fixes: 3a5a6d0c2b03("cpuset: don't nest cgroup_mutex inside get_online_cpus()")
Reported-by: Alexey Klimov <aklimov@redhat.com>
Reported-by: Joshua Baker <jobaker@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Alexey Klimov <aklimov@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87tuowcnv3.ffs@nanos.tec.linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit ccff81e1d0 ]
kmemleak scans struct page, but it does not scan the page content. If we
allocate some memory with kmalloc(), then allocate page with alloc_page(),
and if we put kmalloc pointer somewhere inside that page, kmemleak will
report kmalloc pointer as a false positive.
We can instruct kmemleak to scan the memory area by calling kmemleak_alloc()
and kmemleak_free(), but part of struct bpf_ringbuf is mmaped to user space,
and if struct bpf_ringbuf changes we would have to revisit and review size
argument in kmemleak_alloc(), because we do not want kmemleak to scan the
user space memory. Let's simplify things and use kmemleak_not_leak() here.
For posterity, also adding additional prior analysis from Andrii:
I think either kmemleak or syzbot are misreporting this. I've added a
bunch of printks around all allocations performed by BPF ringbuf. [...]
On repro side I get these two warnings:
[vmuser@archvm bpf]$ sudo ./repro
BUG: memory leak
unreferenced object 0xffff88810d538c00 (size 64):
comm "repro", pid 2140, jiffies 4294692933 (age 14.540s)
hex dump (first 32 bytes):
00 af 19 04 00 ea ff ff c0 ae 19 04 00 ea ff ff ................
80 ae 19 04 00 ea ff ff c0 29 2e 04 00 ea ff ff .........)......
backtrace:
[<0000000077bfbfbd>] __bpf_map_area_alloc+0x31/0xc0
[<00000000587fa522>] ringbuf_map_alloc.cold.4+0x48/0x218
[<0000000044d49e96>] __do_sys_bpf+0x359/0x1d90
[<00000000f601d565>] do_syscall_64+0x2d/0x40
[<0000000043d3112a>] entry_SYSCALL_64_after_hwframe+0x44/0xae
BUG: memory leak
unreferenced object 0xffff88810d538c80 (size 64):
comm "repro", pid 2143, jiffies 4294699025 (age 8.448s)
hex dump (first 32 bytes):
80 aa 19 04 00 ea ff ff 00 ab 19 04 00 ea ff ff ................
c0 ab 19 04 00 ea ff ff 80 44 28 04 00 ea ff ff .........D(.....
backtrace:
[<0000000077bfbfbd>] __bpf_map_area_alloc+0x31/0xc0
[<00000000587fa522>] ringbuf_map_alloc.cold.4+0x48/0x218
[<0000000044d49e96>] __do_sys_bpf+0x359/0x1d90
[<00000000f601d565>] do_syscall_64+0x2d/0x40
[<0000000043d3112a>] entry_SYSCALL_64_after_hwframe+0x44/0xae
Note that both reported leaks (ffff88810d538c80 and ffff88810d538c00)
correspond to pages array bpf_ringbuf is allocating and tracking properly
internally. Note also that syzbot repro doesn't close FD of created BPF
ringbufs, and even when ./repro itself exits with error, there are still
two forked processes hanging around in my system. So clearly ringbuf maps
are alive at that point. So reporting any memory leak looks weird at that
point, because that memory is being used by active referenced BPF ringbuf.
It's also a question why repro doesn't clean up its forks. But if I do a
`pkill repro`, I do see that all the allocated memory is /properly/ cleaned
up [and the] "leaks" are deallocated properly.
BTW, if I add close() right after bpf() syscall in syzbot repro, I see that
everything is immediately deallocated, like designed. And no memory leak
is reported. So I don't think the problem is anywhere in bpf_ringbuf code,
rather in the leak detection and/or repro itself.
Reported-by: syzbot+5d895828587f49e7fe9b@syzkaller.appspotmail.com
Signed-off-by: Rustam Kovhaev <rkovhaev@gmail.com>
[ Daniel: also included analysis from Andrii to the commit log ]
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: syzbot+5d895828587f49e7fe9b@syzkaller.appspotmail.com
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/CAEf4BzYk+dqs+jwu6VKXP-RttcTEGFe+ySTGWT9CRNkagDiJVA@mail.gmail.com
Link: https://lore.kernel.org/lkml/YNTAqiE7CWJhOK2M@nuc10
Link: https://lore.kernel.org/lkml/20210615101515.GC26027@arm.com
Link: https://syzkaller.appspot.com/bug?extid=5d895828587f49e7fe9b
Link: https://lore.kernel.org/bpf/20210626181156.1873604-1-rkovhaev@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 1c35b07e6d ]
The _sum and _avg values are in general sync together with the PELT
divider. They are however not always completely in perfect sync,
resulting in situations where _sum gets to zero while _avg stays
positive. Such situations are undesirable.
This comes from the fact that PELT will increase period_contrib, also
increasing the PELT divider, without updating _sum and _avg values to
stay in perfect sync where (_sum == _avg * divider). However, such PELT
change will never lower _sum, making it impossible to end up in a
situation where _sum is zero and _avg is not.
Therefore, we need to ensure that when subtracting load outside PELT,
that when _sum is zero, _avg is also set to zero. This occurs when
(_sum < _avg * divider), and the subtracted (_avg * divider) is bigger
or equal to the current _sum, while the subtracted _avg is smaller than
the current _avg.
Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Odin Ugedal <odin@uged.al>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Link: https://lore.kernel.org/r/20210624111815.57937-1-odin@uged.al
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 28131e9d93 ]
syzbot reported a shift-out-of-bounds that KUBSAN observed in the
interpreter:
[...]
UBSAN: shift-out-of-bounds in kernel/bpf/core.c:1420:2
shift exponent 255 is too large for 64-bit type 'long long unsigned int'
CPU: 1 PID: 11097 Comm: syz-executor.4 Not tainted 5.12.0-rc2-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:79 [inline]
dump_stack+0x141/0x1d7 lib/dump_stack.c:120
ubsan_epilogue+0xb/0x5a lib/ubsan.c:148
__ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:327
___bpf_prog_run.cold+0x19/0x56c kernel/bpf/core.c:1420
__bpf_prog_run32+0x8f/0xd0 kernel/bpf/core.c:1735
bpf_dispatcher_nop_func include/linux/bpf.h:644 [inline]
bpf_prog_run_pin_on_cpu include/linux/filter.h:624 [inline]
bpf_prog_run_clear_cb include/linux/filter.h:755 [inline]
run_filter+0x1a1/0x470 net/packet/af_packet.c:2031
packet_rcv+0x313/0x13e0 net/packet/af_packet.c:2104
dev_queue_xmit_nit+0x7c2/0xa90 net/core/dev.c:2387
xmit_one net/core/dev.c:3588 [inline]
dev_hard_start_xmit+0xad/0x920 net/core/dev.c:3609
__dev_queue_xmit+0x2121/0x2e00 net/core/dev.c:4182
__bpf_tx_skb net/core/filter.c:2116 [inline]
__bpf_redirect_no_mac net/core/filter.c:2141 [inline]
__bpf_redirect+0x548/0xc80 net/core/filter.c:2164
____bpf_clone_redirect net/core/filter.c:2448 [inline]
bpf_clone_redirect+0x2ae/0x420 net/core/filter.c:2420
___bpf_prog_run+0x34e1/0x77d0 kernel/bpf/core.c:1523
__bpf_prog_run512+0x99/0xe0 kernel/bpf/core.c:1737
bpf_dispatcher_nop_func include/linux/bpf.h:644 [inline]
bpf_test_run+0x3ed/0xc50 net/bpf/test_run.c:50
bpf_prog_test_run_skb+0xabc/0x1c50 net/bpf/test_run.c:582
bpf_prog_test_run kernel/bpf/syscall.c:3127 [inline]
__do_sys_bpf+0x1ea9/0x4f00 kernel/bpf/syscall.c:4406
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xae
[...]
Generally speaking, KUBSAN reports from the kernel should be fixed.
However, in case of BPF, this particular report caused concerns since
the large shift is not wrong from BPF point of view, just undefined.
In the verifier, K-based shifts that are >= {64,32} (depending on the
bitwidth of the instruction) are already rejected. The register-based
cases were not given their content might not be known at verification
time. Ideas such as verifier instruction rewrite with an additional
AND instruction for the source register were brought up, but regularly
rejected due to the additional runtime overhead they incur.
As Edward Cree rightly put it:
Shifts by more than insn bitness are legal in the BPF ISA; they are
implementation-defined behaviour [of the underlying architecture],
rather than UB, and have been made legal for performance reasons.
Each of the JIT backends compiles the BPF shift operations to machine
instructions which produce implementation-defined results in such a
case; the resulting contents of the register may be arbitrary but
program behaviour as a whole remains defined.
Guard checks in the fast path (i.e. affecting JITted code) will thus
not be accepted.
The case of division by zero is not truly analogous here, as division
instructions on many of the JIT-targeted architectures will raise a
machine exception / fault on division by zero, whereas (to the best
of my knowledge) none will do so on an out-of-bounds shift.
Given the KUBSAN report only affects the BPF interpreter, but not JITs,
one solution is to add the ANDs with 63 or 31 into ___bpf_prog_run().
That would make the shifts defined, and thus shuts up KUBSAN, and the
compiler would optimize out the AND on any CPU that interprets the shift
amounts modulo the width anyway (e.g., confirmed from disassembly that
on x86-64 and arm64 the generated interpreter code is the same before
and after this fix).
The BPF interpreter is slow path, and most likely compiled out anyway
as distros select BPF_JIT_ALWAYS_ON to avoid speculative execution of
BPF instructions by the interpreter. Given the main argument was to
avoid sacrificing performance, the fact that the AND is optimized away
from compiler for mainstream archs helps as well as a solution moving
forward. Also add a comment on LSH/RSH/ARSH translation for JIT authors
to provide guidance when they see the ___bpf_prog_run() interpreter
code and use it as a model for a new JIT backend.
Reported-by: syzbot+bed360704c521841c85d@syzkaller.appspotmail.com
Reported-by: Kurt Manucredo <fuzzybritches0@gmail.com>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Co-developed-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: syzbot+bed360704c521841c85d@syzkaller.appspotmail.com
Cc: Edward Cree <ecree.xilinx@gmail.com>
Link: https://lore.kernel.org/bpf/0000000000008f912605bd30d5d7@google.com
Link: https://lore.kernel.org/bpf/bac16d8d-c174-bdc4-91bd-bfa62b410190@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
Allow vendors to obtain a list of modules loaded at given time. Vendor
modules are able to register on part of notifier chain
(register_module_notifer), but a vendor module would never see modules
which are loaded before the one which registers on the notifier chain.
The kernel doesn't offer load order control, so a hook is necessary to
iterate through currently loaded kernel modules.
Bug: 193552324
Change-Id: I3b01cc1b90f8c0c7c21a37992cc7d607316efc7b
Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
This change introduces a prctl that allows the user program to control
which PAC keys are enabled in a particular task. The main reason
why this is useful is to enable a userspace ABI that uses PAC to
sign and authenticate function pointers and other pointers exposed
outside of the function, while still allowing binaries conforming
to the ABI to interoperate with legacy binaries that do not sign or
authenticate pointers.
The idea is that a dynamic loader or early startup code would issue
this prctl very early after establishing that a process may load legacy
binaries, but before executing any PAC instructions.
This change adds a small amount of overhead to kernel entry and exit
due to additional required instruction sequences.
On a DragonBoard 845c (Cortex-A75) with the powersave governor, the
overhead of similar instruction sequences was measured as 4.9ns when
simulating the common case where IA is left enabled, or 43.7ns when
simulating the uncommon case where IA is disabled. These numbers can
be seen as the worst case scenario, since in more realistic scenarios
a better performing governor would be used and a newer chip would be
used that would support PAC unlike Cortex-A75 and would be expected
to be faster than Cortex-A75.
On an Apple M1 under a hypervisor, the overhead of the entry/exit
instruction sequences introduced by this patch was measured as 0.3ns
in the case where IA is left enabled, and 33.0ns in the case where
IA is disabled.
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Dave Martin <Dave.Martin@arm.com>
Link: https://linux-review.googlesource.com/id/Ibc41a5e6a76b275efbaa126b31119dc197b927a5
Link: https://lore.kernel.org/r/d6609065f8f40397a4124654eb68c9f490b4d477.1616123271.git.pcc@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Bug: 192536783
(cherry picked from commit 201698626f)
Change-Id: Ic0a21c92a22575f9ec3599fb67bd2931a50b9f04
[quic_eberman@quicinc.com: Resolved merge conflict in
arch/arm64/kernel/process.c]
Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
Signed-off-by: Peter Collingbourne <pcc@google.com>
Psi polling mechanism is trying to minimize the number of wakeups to
run psi_poll_work and is currently relying on timer_pending() to detect
when this work is already scheduled. This provides a window of opportunity
for psi_group_change to schedule an immediate psi_poll_work after
poll_timer_fn got called but before psi_poll_work could reschedule itself.
Below is the depiction of this entire window:
poll_timer_fn
wake_up_interruptible(&group->poll_wait);
psi_poll_worker
wait_event_interruptible(group->poll_wait, ...)
psi_poll_work
psi_schedule_poll_work
if (timer_pending(&group->poll_timer)) return;
...
mod_timer(&group->poll_timer, jiffies + delay);
Prior to 461daba06b we used to rely on poll_scheduled atomic which was
reset and set back inside psi_poll_work and therefore this race window
was much smaller.
The larger window causes increased number of wakeups and our partners
report visible power regression of ~10mA after applying 461daba06b.
Bring back the poll_scheduled atomic and make this race window even
narrower by resetting poll_scheduled only when we reach polling expiration
time. This does not completely eliminate the possibility of extra wakeups
caused by a race with psi_group_change however it will limit it to the
worst case scenario of one extra wakeup per every tracking window (0.5s
in the worst case).
This patch also ensures correct ordering between clearing poll_scheduled
flag and obtaining changed_states using memory barrier. Correct ordering
between updating changed_states and setting poll_scheduled is ensured by
atomic_xchg operation.
By tracing the number of immediate rescheduling attempts performed by
psi_group_change and the number of these attempts being blocked due to
psi monitor being already active, we can assess the effects of this change:
Before the patch:
Run#1 Run#2 Run#3
Immediate reschedules attempted: 684365 1385156 1261240
Immediate reschedules blocked: 682846 1381654 1258682
Immediate reschedules (delta): 1519 3502 2558
Immediate reschedules (% of attempted): 0.22% 0.25% 0.20%
After the patch:
Run#1 Run#2 Run#3
Immediate reschedules attempted: 882244 770298 426218
Immediate reschedules blocked: 881996 769796 426074
Immediate reschedules (delta): 248 502 144
Immediate reschedules (% of attempted): 0.03% 0.07% 0.03%
The number of non-blocked immediate reschedules dropped from 0.22-0.25%
to 0.03-0.07%. The drop is attributed to the decrease in the race window
size and the fact that we allow this race only when psi monitors reach
polling window expiration time.
Fixes: 461daba06b ("psi: eliminate kthread_worker from psi trigger scheduling mechanism")
Reported-by: Kathleen Chang <yt.chang@mediatek.com>
Reported-by: Wenju Xu <wenju.xu@mediatek.com>
Reported-by: Jonathan Chen <jonathan.jmchen@mediatek.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: SH Chen <show-hong.chen@mediatek.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/patchwork/patch/1455172/
Bug: 191127654
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Ie61547ca043e702442a9c6db1468cfb60ff2e729
This reverts commit b2c4d9a33c.
It breaks the abi and is not needed on Android kernels.
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I295a6c8088e4297400c1feebd78abd2f6c5edd7e
This reverts commit 0855952ed4.
It breaks the abi and is not needed on Android kernels.
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I65001e80f3776ae4ab56be5da555dba394278c68