Commit Graph

35699 Commits

Author SHA1 Message Date
Greg Kroah-Hartman
8b444656fa This is the 5.10.56 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmEKcCUACgkQONu9yGCS
 aT7sMw/7BNJDmX9w+p1lgTIJJzSuz8C/eNgbeZgK7CE4DovO+WL9oEm53vqYcDDo
 j5REnrRhxcBYxwG/GXl1Oniv1wHqw0SplV+5G2NH1RMy23eSFGCw+8G+YOEJnU3P
 94hJuEs/43Py7eZV/VtyO2UMdDRnGI6MlNvu18YjnRJcdqIIl2gln1G8wbyySYVb
 wR1rudvtiEdrmTQr7qGxeIrZNKGwFl0KxEl8j9X/aqxvfe8PRVYKlmtwblf5rybe
 TElQxz2XGRgk8g2yWQmnNoU6rfFHdZ4lTnCwfpFA1XE6/HBA64/1p22QTJUZvyOU
 pbQc1MRaoUncGV9UFAMY1j38JFsVar7YHHOcpp9YIJOjoyiAw4aatGDcntdWDCiG
 X1mCSLs10/xGRPaJJXulp786MH4aTR5qIeoNg8mu3Z3In4ElbBW5xr0wa3N8gs3O
 lEnK/gT2MHiQ1boa+Qy3W+XZmOjWtL69JgbOyRcOYS6lkHL4DFlGL2Nn5u8qGfL4
 hzohJzH36W5SUHDQiYTt1wLNu4iHpAECjxcnk9fCvlcHA5Yu1bqgyQ62i3C9RA6a
 /aO0B0yraHmvCAboemDsESwylxmpiRB3caqKtzlaZjoiOfPydcBwJM46ZfbzLNPh
 l+/YKK2tLOXWyRIhEv8183tVeu7mZ02xjsetPtLltZPJqR+SJKE=
 =8nLw
 -----END PGP SIGNATURE-----

Merge 5.10.56 into android12-5.10-lts

Changes in 5.10.56
	selftest: fix build error in tools/testing/selftests/vm/userfaultfd.c
	io_uring: fix null-ptr-deref in io_sq_offload_start()
	x86/asm: Ensure asm/proto.h can be included stand-alone
	pipe: make pipe writes always wake up readers
	btrfs: fix rw device counting in __btrfs_free_extra_devids
	btrfs: mark compressed range uptodate only if all bio succeed
	Revert "ACPI: resources: Add checks for ACPI IRQ override"
	ACPI: DPTF: Fix reading of attributes
	x86/kvm: fix vcpu-id indexed array sizes
	KVM: add missing compat KVM_CLEAR_DIRTY_LOG
	ocfs2: fix zero out valid data
	ocfs2: issue zeroout to EOF blocks
	can: j1939: j1939_xtp_rx_dat_one(): fix rxtimer value between consecutive TP.DT to 750ms
	can: raw: raw_setsockopt(): fix raw_rcv panic for sock UAF
	can: peak_usb: pcan_usb_handle_bus_evt(): fix reading rxerr/txerr values
	can: mcba_usb_start(): add missing urb->transfer_dma initialization
	can: usb_8dev: fix memory leak
	can: ems_usb: fix memory leak
	can: esd_usb2: fix memory leak
	alpha: register early reserved memory in memblock
	HID: wacom: Re-enable touch by default for Cintiq 24HDT / 27QHDT
	NIU: fix incorrect error return, missed in previous revert
	drm/amd/display: ensure dentist display clock update finished in DCN20
	drm/amdgpu: Avoid printing of stack contents on firmware load error
	drm/amdgpu: Fix resource leak on probe error path
	blk-iocost: fix operation ordering in iocg_wake_fn()
	nfc: nfcsim: fix use after free during module unload
	cfg80211: Fix possible memory leak in function cfg80211_bss_update
	RDMA/bnxt_re: Fix stats counters
	bpf: Fix OOB read when printing XDP link fdinfo
	mac80211: fix enabling 4-address mode on a sta vif after assoc
	netfilter: conntrack: adjust stop timestamp to real expiry value
	netfilter: nft_nat: allow to specify layer 4 protocol NAT only
	i40e: Fix logic of disabling queues
	i40e: Fix firmware LLDP agent related warning
	i40e: Fix queue-to-TC mapping on Tx
	i40e: Fix log TC creation failure when max num of queues is exceeded
	tipc: fix implicit-connect for SYN+
	tipc: fix sleeping in tipc accept routine
	net: Set true network header for ECN decapsulation
	net: qrtr: fix memory leaks
	ionic: remove intr coalesce update from napi
	ionic: fix up dim accounting for tx and rx
	ionic: count csum_none when offload enabled
	tipc: do not write skb_shinfo frags when doing decrytion
	octeontx2-pf: Fix interface down flag on error
	mlx4: Fix missing error code in mlx4_load_one()
	KVM: x86: Check the right feature bit for MSR_KVM_ASYNC_PF_ACK access
	net: llc: fix skb_over_panic
	drm/msm/dpu: Fix sm8250_mdp register length
	drm/msm/dp: Initialize the INTF_CONFIG register
	skmsg: Make sk_psock_destroy() static
	net/mlx5: Fix flow table chaining
	net/mlx5e: Fix nullptr in mlx5e_hairpin_get_mdev()
	sctp: fix return value check in __sctp_rcv_asconf_lookup
	tulip: windbond-840: Fix missing pci_disable_device() in probe and remove
	sis900: Fix missing pci_disable_device() in probe and remove
	can: hi311x: fix a signedness bug in hi3110_cmd()
	bpf: Introduce BPF nospec instruction for mitigating Spectre v4
	bpf: Fix leakage due to insufficient speculative store bypass mitigation
	bpf: Remove superfluous aux sanitation on subprog rejection
	bpf: verifier: Allocate idmap scratch in verifier env
	bpf: Fix pointer arithmetic mask tightening under state pruning
	SMB3: fix readpage for large swap cache
	powerpc/pseries: Fix regression while building external modules
	Revert "perf map: Fix dso->nsinfo refcounting"
	i40e: Add additional info to PHY type error
	can: j1939: j1939_session_deactivate(): clarify lifetime of session object
	Linux 5.10.56

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ib3c9244afb7ee5d6ee8d3235efe8956898f486c4
2021-08-04 15:02:23 +02:00
Daniel Borkmann
be561c0154 bpf: Fix pointer arithmetic mask tightening under state pruning
commit e042aa532c upstream.

In 7fedb63a83 ("bpf: Tighten speculative pointer arithmetic mask") we
narrowed the offset mask for unprivileged pointer arithmetic in order to
mitigate a corner case where in the speculative domain it is possible to
advance, for example, the map value pointer by up to value_size-1 out-of-
bounds in order to leak kernel memory via side-channel to user space.

The verifier's state pruning for scalars leaves one corner case open
where in the first verification path R_x holds an unknown scalar with an
aux->alu_limit of e.g. 7, and in a second verification path that same
register R_x, here denoted as R_x', holds an unknown scalar which has
tighter bounds and would thus satisfy range_within(R_x, R_x') as well as
tnum_in(R_x, R_x') for state pruning, yielding an aux->alu_limit of 3:
Given the second path fits the register constraints for pruning, the final
generated mask from aux->alu_limit will remain at 7. While technically
not wrong for the non-speculative domain, it would however be possible
to craft similar cases where the mask would be too wide as in 7fedb63a83.

One way to fix it is to detect the presence of unknown scalar map pointer
arithmetic and force a deeper search on unknown scalars to ensure that
we do not run into a masking mismatch.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-08-04 12:46:45 +02:00
Lorenz Bauer
ffb9d5c48b bpf: verifier: Allocate idmap scratch in verifier env
commit c9e73e3d2b upstream.

func_states_equal makes a very short lived allocation for idmap,
probably because it's too large to fit on the stack. However the
function is called quite often, leading to a lot of alloc / free
churn. Replace the temporary allocation with dedicated scratch
space in struct bpf_verifier_env.

Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Edward Cree <ecree.xilinx@gmail.com>
Link: https://lore.kernel.org/bpf/20210429134656.122225-4-lmb@cloudflare.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-08-04 12:46:45 +02:00
Daniel Borkmann
a11ca29c65 bpf: Remove superfluous aux sanitation on subprog rejection
commit 59089a189e upstream.

Follow-up to fe9a5ca7e3 ("bpf: Do not mark insn as seen under speculative
path verification"). The sanitize_insn_aux_data() helper does not serve a
particular purpose in today's code. The original intention for the helper
was that if function-by-function verification fails, a given program would
be cleared from temporary insn_aux_data[], and then its verification would
be re-attempted in the context of the main program a second time.

However, a failure in do_check_subprogs() will skip do_check_main() and
propagate the error to the user instead, thus such situation can never occur.
Given its interaction is not compatible to the Spectre v1 mitigation (due to
comparing aux->seen with env->pass_cnt), just remove sanitize_insn_aux_data()
to avoid future bugs in this area.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-08-04 12:46:44 +02:00
Daniel Borkmann
0e9280654a bpf: Fix leakage due to insufficient speculative store bypass mitigation
[ Upstream commit 2039f26f3a ]

Spectre v4 gadgets make use of memory disambiguation, which is a set of
techniques that execute memory access instructions, that is, loads and
stores, out of program order; Intel's optimization manual, section 2.4.4.5:

  A load instruction micro-op may depend on a preceding store. Many
  microarchitectures block loads until all preceding store addresses are
  known. The memory disambiguator predicts which loads will not depend on
  any previous stores. When the disambiguator predicts that a load does
  not have such a dependency, the load takes its data from the L1 data
  cache. Eventually, the prediction is verified. If an actual conflict is
  detected, the load and all succeeding instructions are re-executed.

af86ca4e30 ("bpf: Prevent memory disambiguation attack") tried to mitigate
this attack by sanitizing the memory locations through preemptive "fast"
(low latency) stores of zero prior to the actual "slow" (high latency) store
of a pointer value such that upon dependency misprediction the CPU then
speculatively executes the load of the pointer value and retrieves the zero
value instead of the attacker controlled scalar value previously stored at
that location, meaning, subsequent access in the speculative domain is then
redirected to the "zero page".

The sanitized preemptive store of zero prior to the actual "slow" store is
done through a simple ST instruction based on r10 (frame pointer) with
relative offset to the stack location that the verifier has been tracking
on the original used register for STX, which does not have to be r10. Thus,
there are no memory dependencies for this store, since it's only using r10
and immediate constant of zero; hence af86ca4e30 /assumed/ a low latency
operation.

However, a recent attack demonstrated that this mitigation is not sufficient
since the preemptive store of zero could also be turned into a "slow" store
and is thus bypassed as well:

  [...]
  // r2 = oob address (e.g. scalar)
  // r7 = pointer to map value
  31: (7b) *(u64 *)(r10 -16) = r2
  // r9 will remain "fast" register, r10 will become "slow" register below
  32: (bf) r9 = r10
  // JIT maps BPF reg to x86 reg:
  //  r9  -> r15 (callee saved)
  //  r10 -> rbp
  // train store forward prediction to break dependency link between both r9
  // and r10 by evicting them from the predictor's LRU table.
  33: (61) r0 = *(u32 *)(r7 +24576)
  34: (63) *(u32 *)(r7 +29696) = r0
  35: (61) r0 = *(u32 *)(r7 +24580)
  36: (63) *(u32 *)(r7 +29700) = r0
  37: (61) r0 = *(u32 *)(r7 +24584)
  38: (63) *(u32 *)(r7 +29704) = r0
  39: (61) r0 = *(u32 *)(r7 +24588)
  40: (63) *(u32 *)(r7 +29708) = r0
  [...]
  543: (61) r0 = *(u32 *)(r7 +25596)
  544: (63) *(u32 *)(r7 +30716) = r0
  // prepare call to bpf_ringbuf_output() helper. the latter will cause rbp
  // to spill to stack memory while r13/r14/r15 (all callee saved regs) remain
  // in hardware registers. rbp becomes slow due to push/pop latency. below is
  // disasm of bpf_ringbuf_output() helper for better visual context:
  //
  // ffffffff8117ee20: 41 54                 push   r12
  // ffffffff8117ee22: 55                    push   rbp
  // ffffffff8117ee23: 53                    push   rbx
  // ffffffff8117ee24: 48 f7 c1 fc ff ff ff  test   rcx,0xfffffffffffffffc
  // ffffffff8117ee2b: 0f 85 af 00 00 00     jne    ffffffff8117eee0 <-- jump taken
  // [...]
  // ffffffff8117eee0: 49 c7 c4 ea ff ff ff  mov    r12,0xffffffffffffffea
  // ffffffff8117eee7: 5b                    pop    rbx
  // ffffffff8117eee8: 5d                    pop    rbp
  // ffffffff8117eee9: 4c 89 e0              mov    rax,r12
  // ffffffff8117eeec: 41 5c                 pop    r12
  // ffffffff8117eeee: c3                    ret
  545: (18) r1 = map[id:4]
  547: (bf) r2 = r7
  548: (b7) r3 = 0
  549: (b7) r4 = 4
  550: (85) call bpf_ringbuf_output#194288
  // instruction 551 inserted by verifier    \
  551: (7a) *(u64 *)(r10 -16) = 0            | /both/ are now slow stores here
  // storing map value pointer r7 at fp-16   | since value of r10 is "slow".
  552: (7b) *(u64 *)(r10 -16) = r7           /
  // following "fast" read to the same memory location, but due to dependency
  // misprediction it will speculatively execute before insn 551/552 completes.
  553: (79) r2 = *(u64 *)(r9 -16)
  // in speculative domain contains attacker controlled r2. in non-speculative
  // domain this contains r7, and thus accesses r7 +0 below.
  554: (71) r3 = *(u8 *)(r2 +0)
  // leak r3

As can be seen, the current speculative store bypass mitigation which the
verifier inserts at line 551 is insufficient since /both/, the write of
the zero sanitation as well as the map value pointer are a high latency
instruction due to prior memory access via push/pop of r10 (rbp) in contrast
to the low latency read in line 553 as r9 (r15) which stays in hardware
registers. Thus, architecturally, fp-16 is r7, however, microarchitecturally,
fp-16 can still be r2.

Initial thoughts to address this issue was to track spilled pointer loads
from stack and enforce their load via LDX through r10 as well so that /both/
the preemptive store of zero /as well as/ the load use the /same/ register
such that a dependency is created between the store and load. However, this
option is not sufficient either since it can be bypassed as well under
speculation. An updated attack with pointer spill/fills now _all_ based on
r10 would look as follows:

  [...]
  // r2 = oob address (e.g. scalar)
  // r7 = pointer to map value
  [...]
  // longer store forward prediction training sequence than before.
  2062: (61) r0 = *(u32 *)(r7 +25588)
  2063: (63) *(u32 *)(r7 +30708) = r0
  2064: (61) r0 = *(u32 *)(r7 +25592)
  2065: (63) *(u32 *)(r7 +30712) = r0
  2066: (61) r0 = *(u32 *)(r7 +25596)
  2067: (63) *(u32 *)(r7 +30716) = r0
  // store the speculative load address (scalar) this time after the store
  // forward prediction training.
  2068: (7b) *(u64 *)(r10 -16) = r2
  // preoccupy the CPU store port by running sequence of dummy stores.
  2069: (63) *(u32 *)(r7 +29696) = r0
  2070: (63) *(u32 *)(r7 +29700) = r0
  2071: (63) *(u32 *)(r7 +29704) = r0
  2072: (63) *(u32 *)(r7 +29708) = r0
  2073: (63) *(u32 *)(r7 +29712) = r0
  2074: (63) *(u32 *)(r7 +29716) = r0
  2075: (63) *(u32 *)(r7 +29720) = r0
  2076: (63) *(u32 *)(r7 +29724) = r0
  2077: (63) *(u32 *)(r7 +29728) = r0
  2078: (63) *(u32 *)(r7 +29732) = r0
  2079: (63) *(u32 *)(r7 +29736) = r0
  2080: (63) *(u32 *)(r7 +29740) = r0
  2081: (63) *(u32 *)(r7 +29744) = r0
  2082: (63) *(u32 *)(r7 +29748) = r0
  2083: (63) *(u32 *)(r7 +29752) = r0
  2084: (63) *(u32 *)(r7 +29756) = r0
  2085: (63) *(u32 *)(r7 +29760) = r0
  2086: (63) *(u32 *)(r7 +29764) = r0
  2087: (63) *(u32 *)(r7 +29768) = r0
  2088: (63) *(u32 *)(r7 +29772) = r0
  2089: (63) *(u32 *)(r7 +29776) = r0
  2090: (63) *(u32 *)(r7 +29780) = r0
  2091: (63) *(u32 *)(r7 +29784) = r0
  2092: (63) *(u32 *)(r7 +29788) = r0
  2093: (63) *(u32 *)(r7 +29792) = r0
  2094: (63) *(u32 *)(r7 +29796) = r0
  2095: (63) *(u32 *)(r7 +29800) = r0
  2096: (63) *(u32 *)(r7 +29804) = r0
  2097: (63) *(u32 *)(r7 +29808) = r0
  2098: (63) *(u32 *)(r7 +29812) = r0
  // overwrite scalar with dummy pointer; same as before, also including the
  // sanitation store with 0 from the current mitigation by the verifier.
  2099: (7a) *(u64 *)(r10 -16) = 0         | /both/ are now slow stores here
  2100: (7b) *(u64 *)(r10 -16) = r7        | since store unit is still busy.
  // load from stack intended to bypass stores.
  2101: (79) r2 = *(u64 *)(r10 -16)
  2102: (71) r3 = *(u8 *)(r2 +0)
  // leak r3
  [...]

Looking at the CPU microarchitecture, the scheduler might issue loads (such
as seen in line 2101) before stores (line 2099,2100) because the load execution
units become available while the store execution unit is still busy with the
sequence of dummy stores (line 2069-2098). And so the load may use the prior
stored scalar from r2 at address r10 -16 for speculation. The updated attack
may work less reliable on CPU microarchitectures where loads and stores share
execution resources.

This concludes that the sanitizing with zero stores from af86ca4e30 ("bpf:
Prevent memory disambiguation attack") is insufficient. Moreover, the detection
of stack reuse from af86ca4e30 where previously data (STACK_MISC) has been
written to a given stack slot where a pointer value is now to be stored does
not have sufficient coverage as precondition for the mitigation either; for
several reasons outlined as follows:

 1) Stack content from prior program runs could still be preserved and is
    therefore not "random", best example is to split a speculative store
    bypass attack between tail calls, program A would prepare and store the
    oob address at a given stack slot and then tail call into program B which
    does the "slow" store of a pointer to the stack with subsequent "fast"
    read. From program B PoV such stack slot type is STACK_INVALID, and
    therefore also must be subject to mitigation.

 2) The STACK_SPILL must not be coupled to register_is_const(&stack->spilled_ptr)
    condition, for example, the previous content of that memory location could
    also be a pointer to map or map value. Without the fix, a speculative
    store bypass is not mitigated in such precondition and can then lead to
    a type confusion in the speculative domain leaking kernel memory near
    these pointer types.

While brainstorming on various alternative mitigation possibilities, we also
stumbled upon a retrospective from Chrome developers [0]:

  [...] For variant 4, we implemented a mitigation to zero the unused memory
  of the heap prior to allocation, which cost about 1% when done concurrently
  and 4% for scavenging. Variant 4 defeats everything we could think of. We
  explored more mitigations for variant 4 but the threat proved to be more
  pervasive and dangerous than we anticipated. For example, stack slots used
  by the register allocator in the optimizing compiler could be subject to
  type confusion, leading to pointer crafting. Mitigating type confusion for
  stack slots alone would have required a complete redesign of the backend of
  the optimizing compiler, perhaps man years of work, without a guarantee of
  completeness. [...]

From BPF side, the problem space is reduced, however, options are rather
limited. One idea that has been explored was to xor-obfuscate pointer spills
to the BPF stack:

  [...]
  // preoccupy the CPU store port by running sequence of dummy stores.
  [...]
  2106: (63) *(u32 *)(r7 +29796) = r0
  2107: (63) *(u32 *)(r7 +29800) = r0
  2108: (63) *(u32 *)(r7 +29804) = r0
  2109: (63) *(u32 *)(r7 +29808) = r0
  2110: (63) *(u32 *)(r7 +29812) = r0
  // overwrite scalar with dummy pointer; xored with random 'secret' value
  // of 943576462 before store ...
  2111: (b4) w11 = 943576462
  2112: (af) r11 ^= r7
  2113: (7b) *(u64 *)(r10 -16) = r11
  2114: (79) r11 = *(u64 *)(r10 -16)
  2115: (b4) w2 = 943576462
  2116: (af) r2 ^= r11
  // ... and restored with the same 'secret' value with the help of AX reg.
  2117: (71) r3 = *(u8 *)(r2 +0)
  [...]

While the above would not prevent speculation, it would make data leakage
infeasible by directing it to random locations. In order to be effective
and prevent type confusion under speculation, such random secret would have
to be regenerated for each store. The additional complexity involved for a
tracking mechanism that prevents jumps such that restoring spilled pointers
would not get corrupted is not worth the gain for unprivileged. Hence, the
fix in here eventually opted for emitting a non-public BPF_ST | BPF_NOSPEC
instruction which the x86 JIT translates into a lfence opcode. Inserting the
latter in between the store and load instruction is one of the mitigations
options [1]. The x86 instruction manual notes:

  [...] An LFENCE that follows an instruction that stores to memory might
  complete before the data being stored have become globally visible. [...]

The latter meaning that the preceding store instruction finished execution
and the store is at minimum guaranteed to be in the CPU's store queue, but
it's not guaranteed to be in that CPU's L1 cache at that point (globally
visible). The latter would only be guaranteed via sfence. So the load which
is guaranteed to execute after the lfence for that local CPU would have to
rely on store-to-load forwarding. [2], in section 2.3 on store buffers says:

  [...] For every store operation that is added to the ROB, an entry is
  allocated in the store buffer. This entry requires both the virtual and
  physical address of the target. Only if there is no free entry in the store
  buffer, the frontend stalls until there is an empty slot available in the
  store buffer again. Otherwise, the CPU can immediately continue adding
  subsequent instructions to the ROB and execute them out of order. On Intel
  CPUs, the store buffer has up to 56 entries. [...]

One small upside on the fix is that it lifts constraints from af86ca4e30
where the sanitize_stack_off relative to r10 must be the same when coming
from different paths. The BPF_ST | BPF_NOSPEC gets emitted after a BPF_STX
or BPF_ST instruction. This happens either when we store a pointer or data
value to the BPF stack for the first time, or upon later pointer spills.
The former needs to be enforced since otherwise stale stack data could be
leaked under speculation as outlined earlier. For non-x86 JITs the BPF_ST |
BPF_NOSPEC mapping is currently optimized away, but others could emit a
speculation barrier as well if necessary. For real-world unprivileged
programs e.g. generated by LLVM, pointer spill/fill is only generated upon
register pressure and LLVM only tries to do that for pointers which are not
used often. The program main impact will be the initial BPF_ST | BPF_NOSPEC
sanitation for the STACK_INVALID case when the first write to a stack slot
occurs e.g. upon map lookup. In future we might refine ways to mitigate
the latter cost.

  [0] https://arxiv.org/pdf/1902.05178.pdf
  [1] https://msrc-blog.microsoft.com/2018/05/21/analysis-and-mitigation-of-speculative-store-bypass-cve-2018-3639/
  [2] https://arxiv.org/pdf/1905.05725.pdf

Fixes: af86ca4e30 ("bpf: Prevent memory disambiguation attack")
Fixes: f7cf25b202 ("bpf: track spill/fill of constants")
Co-developed-by: Piotr Krysiuk <piotras@gmail.com>
Co-developed-by: Benedict Schlueter <benedict.schlueter@rub.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Piotr Krysiuk <piotras@gmail.com>
Signed-off-by: Benedict Schlueter <benedict.schlueter@rub.de>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-08-04 12:46:44 +02:00
Daniel Borkmann
bea9e2fd18 bpf: Introduce BPF nospec instruction for mitigating Spectre v4
[ Upstream commit f5e81d1117 ]

In case of JITs, each of the JIT backends compiles the BPF nospec instruction
/either/ to a machine instruction which emits a speculation barrier /or/ to
/no/ machine instruction in case the underlying architecture is not affected
by Speculative Store Bypass or has different mitigations in place already.

This covers both x86 and (implicitly) arm64: In case of x86, we use 'lfence'
instruction for mitigation. In case of arm64, we rely on the firmware mitigation
as controlled via the ssbd kernel parameter. Whenever the mitigation is enabled,
it works for all of the kernel code with no need to provide any additional
instructions here (hence only comment in arm64 JIT). Other archs can follow
as needed. The BPF nospec instruction is specifically targeting Spectre v4
since i) we don't use a serialization barrier for the Spectre v1 case, and
ii) mitigation instructions for v1 and v4 might be different on some archs.

The BPF nospec is required for a future commit, where the BPF verifier does
annotate intermediate BPF programs with speculation barriers.

Co-developed-by: Piotr Krysiuk <piotras@gmail.com>
Co-developed-by: Benedict Schlueter <benedict.schlueter@rub.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Piotr Krysiuk <piotras@gmail.com>
Signed-off-by: Benedict Schlueter <benedict.schlueter@rub.de>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-08-04 12:46:44 +02:00
Greg Kroah-Hartman
1afedcdcf8 This is the 5.10.55 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmEE6r8ACgkQONu9yGCS
 aT54gw/7B1RbIpcTJlhtiQVYvwMJTX/ZgoxkSsffz6G7TXaBbg/x29cuc3cI9IUJ
 fr99zp8gQbQMibXqCaIPBloetSdquX9G+gOmzVSIS+ac2Wx/pH29qRz1fslWq8Mj
 hOg8t2Ce+LoAgGZzSXswhvIL1Zep8HZDlRKKcolmiGEJFBr/4nPfEVOeFNpiTUYF
 cb5RtB1nRR0DJa8wWr18S7v2/TBdR9V7TgGuwRnpVviIQjjfE/S+lmbgtzHgm5Tf
 cBtP7Ta/eIGeLQIyQEoEWV2ViJRmSGk4s2dOcQ/kUMje5de+J03Hkp4jeJbB2ndx
 UUos2Xm4D1j4jsIBZqJKV+Fzf9RzdNXQp4VJIfzvbDbZn8y9xHRcXm/ChNBz/B5L
 8oKcdM9Sf8DxiztlZ4c40IndYjwp0QxlubCYDfwYQfYB/6LLrffrgPZu/TgWw9lj
 lXOBUZq7yHuo9OSAIfGCCtKKnzA63wLv3zTmNmSO9fDoXUl0VHBrZC6apgB6AC9T
 rzvOOgRC6NfWViZl2E3Op7XfOg2jstpRXUr7a0HSaQp4QfmlLJ9HnDywQfNQeX4+
 HyAxhXo8gHPDgh8DK7yRObBhu24yQcy+ukesQwkSS2m0j2ilfQszk0uP1bfFOuq2
 G5Z8VMQPsIMsSijEyuonLRVUPfp8herhWMP0YsZZeEnxZvaSEaM=
 =li6t
 -----END PGP SIGNATURE-----

Merge 5.10.55 into android12-5.10-lts

Changes in 5.10.55
	tools: Allow proper CC/CXX/... override with LLVM=1 in Makefile.include
	io_uring: fix link timeout refs
	KVM: x86: determine if an exception has an error code only when injecting it.
	af_unix: fix garbage collect vs MSG_PEEK
	workqueue: fix UAF in pwq_unbound_release_workfn()
	cgroup1: fix leaked context root causing sporadic NULL deref in LTP
	net/802/mrp: fix memleak in mrp_request_join()
	net/802/garp: fix memleak in garp_request_join()
	net: annotate data race around sk_ll_usec
	sctp: move 198 addresses from unusable to private scope
	rcu-tasks: Don't delete holdouts within trc_inspect_reader()
	rcu-tasks: Don't delete holdouts within trc_wait_for_one_reader()
	ipv6: allocate enough headroom in ip6_finish_output2()
	drm/ttm: add a check against null pointer dereference
	hfs: add missing clean-up in hfs_fill_super
	hfs: fix high memory mapping in hfs_bnode_read
	hfs: add lock nesting notation to hfs_find_init
	firmware: arm_scmi: Fix possible scmi_linux_errmap buffer overflow
	firmware: arm_scmi: Fix range check for the maximum number of pending messages
	cifs: fix the out of range assignment to bit fields in parse_server_interfaces
	iomap: remove the length variable in iomap_seek_data
	iomap: remove the length variable in iomap_seek_hole
	ARM: dts: versatile: Fix up interrupt controller node names
	ipv6: ip6_finish_output2: set sk into newly allocated nskb
	Linux 5.10.55

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I2d673bdde784b3689af73289305091dbd4ead042
2021-07-31 08:51:04 +02:00
Paul E. McKenney
86cb49e731 rcu-tasks: Don't delete holdouts within trc_wait_for_one_reader()
[ Upstream commit a9ab9cce93 ]

Invoking trc_del_holdout() from within trc_wait_for_one_reader() is
only a performance optimization because the RCU Tasks Trace grace-period
kthread will eventually do this within check_all_holdout_tasks_trace().
But it is not a particularly important performance optimization because
it only applies to the grace-period kthread, of which there is but one.
This commit therefore removes this invocation of trc_del_holdout() in
favor of the one in check_all_holdout_tasks_trace() in the grace-period
kthread.

Reported-by: "Xu, Yanfei" <yanfei.xu@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-31 08:16:11 +02:00
Paul E. McKenney
55ddab2bfd rcu-tasks: Don't delete holdouts within trc_inspect_reader()
[ Upstream commit 1d10bf55d8 ]

As Yanfei pointed out, although invoking trc_del_holdout() is safe
from the viewpoint of the integrity of the holdout list itself,
the put_task_struct() invoked by trc_del_holdout() can result in
use-after-free errors due to later accesses to this task_struct structure
by the RCU Tasks Trace grace-period kthread.

This commit therefore removes this call to trc_del_holdout() from
trc_inspect_reader() in favor of the grace-period thread's existing call
to trc_del_holdout(), thus eliminating that particular class of
use-after-free errors.

Reported-by: "Xu, Yanfei" <yanfei.xu@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-31 08:16:11 +02:00
Paul Gortmaker
df34f88862 cgroup1: fix leaked context root causing sporadic NULL deref in LTP
commit 1e7107c5ef upstream.

Richard reported sporadic (roughly one in 10 or so) null dereferences and
other strange behaviour for a set of automated LTP tests.  Things like:

   BUG: kernel NULL pointer dereference, address: 0000000000000008
   #PF: supervisor read access in kernel mode
   #PF: error_code(0x0000) - not-present page
   PGD 0 P4D 0
   Oops: 0000 [#1] PREEMPT SMP PTI
   CPU: 0 PID: 1516 Comm: umount Not tainted 5.10.0-yocto-standard #1
   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-48-gd9c812dda519-prebuilt.qemu.org 04/01/2014
   RIP: 0010:kernfs_sop_show_path+0x1b/0x60

...or these others:

   RIP: 0010:do_mkdirat+0x6a/0xf0
   RIP: 0010:d_alloc_parallel+0x98/0x510
   RIP: 0010:do_readlinkat+0x86/0x120

There were other less common instances of some kind of a general scribble
but the common theme was mount and cgroup and a dubious dentry triggering
the NULL dereference.  I was only able to reproduce it under qemu by
replicating Richard's setup as closely as possible - I never did get it
to happen on bare metal, even while keeping everything else the same.

In commit 71d883c37e ("cgroup_do_mount(): massage calling conventions")
we see this as a part of the overall change:

   --------------
           struct cgroup_subsys *ss;
   -       struct dentry *dentry;

   [...]

   -       dentry = cgroup_do_mount(&cgroup_fs_type, fc->sb_flags, root,
   -                                CGROUP_SUPER_MAGIC, ns);

   [...]

   -       if (percpu_ref_is_dying(&root->cgrp.self.refcnt)) {
   -               struct super_block *sb = dentry->d_sb;
   -               dput(dentry);
   +       ret = cgroup_do_mount(fc, CGROUP_SUPER_MAGIC, ns);
   +       if (!ret && percpu_ref_is_dying(&root->cgrp.self.refcnt)) {
   +               struct super_block *sb = fc->root->d_sb;
   +               dput(fc->root);
                   deactivate_locked_super(sb);
                   msleep(10);
                   return restart_syscall();
           }
   --------------

In changing from the local "*dentry" variable to using fc->root, we now
export/leave that dentry pointer in the file context after doing the dput()
in the unlikely "is_dying" case.   With LTP doing a crazy amount of back to
back mount/unmount [testcases/bin/cgroup_regression_5_1.sh] the unlikely
becomes slightly likely and then bad things happen.

A fix would be to not leave the stale reference in fc->root as follows:

   --------------
                  dput(fc->root);
  +               fc->root = NULL;
                  deactivate_locked_super(sb);
   --------------

...but then we are just open-coding a duplicate of fc_drop_locked() so we
simply use that instead.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: stable@vger.kernel.org      # v5.1+
Reported-by: Richard Purdie <richard.purdie@linuxfoundation.org>
Fixes: 71d883c37e ("cgroup_do_mount(): massage calling conventions")
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-31 08:16:11 +02:00
Yang Yingliang
dcd00801f3 workqueue: fix UAF in pwq_unbound_release_workfn()
commit b42b0bddcb upstream.

I got a UAF report when doing fuzz test:

[  152.880091][ T8030] ==================================================================
[  152.881240][ T8030] BUG: KASAN: use-after-free in pwq_unbound_release_workfn+0x50/0x190
[  152.882442][ T8030] Read of size 4 at addr ffff88810d31bd00 by task kworker/3:2/8030
[  152.883578][ T8030]
[  152.883932][ T8030] CPU: 3 PID: 8030 Comm: kworker/3:2 Not tainted 5.13.0+ #249
[  152.885014][ T8030] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
[  152.886442][ T8030] Workqueue: events pwq_unbound_release_workfn
[  152.887358][ T8030] Call Trace:
[  152.887837][ T8030]  dump_stack_lvl+0x75/0x9b
[  152.888525][ T8030]  ? pwq_unbound_release_workfn+0x50/0x190
[  152.889371][ T8030]  print_address_description.constprop.10+0x48/0x70
[  152.890326][ T8030]  ? pwq_unbound_release_workfn+0x50/0x190
[  152.891163][ T8030]  ? pwq_unbound_release_workfn+0x50/0x190
[  152.891999][ T8030]  kasan_report.cold.15+0x82/0xdb
[  152.892740][ T8030]  ? pwq_unbound_release_workfn+0x50/0x190
[  152.893594][ T8030]  __asan_load4+0x69/0x90
[  152.894243][ T8030]  pwq_unbound_release_workfn+0x50/0x190
[  152.895057][ T8030]  process_one_work+0x47b/0x890
[  152.895778][ T8030]  worker_thread+0x5c/0x790
[  152.896439][ T8030]  ? process_one_work+0x890/0x890
[  152.897163][ T8030]  kthread+0x223/0x250
[  152.897747][ T8030]  ? set_kthread_struct+0xb0/0xb0
[  152.898471][ T8030]  ret_from_fork+0x1f/0x30
[  152.899114][ T8030]
[  152.899446][ T8030] Allocated by task 8884:
[  152.900084][ T8030]  kasan_save_stack+0x21/0x50
[  152.900769][ T8030]  __kasan_kmalloc+0x88/0xb0
[  152.901416][ T8030]  __kmalloc+0x29c/0x460
[  152.902014][ T8030]  alloc_workqueue+0x111/0x8e0
[  152.902690][ T8030]  __btrfs_alloc_workqueue+0x11e/0x2a0
[  152.903459][ T8030]  btrfs_alloc_workqueue+0x6d/0x1d0
[  152.904198][ T8030]  scrub_workers_get+0x1e8/0x490
[  152.904929][ T8030]  btrfs_scrub_dev+0x1b9/0x9c0
[  152.905599][ T8030]  btrfs_ioctl+0x122c/0x4e50
[  152.906247][ T8030]  __x64_sys_ioctl+0x137/0x190
[  152.906916][ T8030]  do_syscall_64+0x34/0xb0
[  152.907535][ T8030]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  152.908365][ T8030]
[  152.908688][ T8030] Freed by task 8884:
[  152.909243][ T8030]  kasan_save_stack+0x21/0x50
[  152.909893][ T8030]  kasan_set_track+0x20/0x30
[  152.910541][ T8030]  kasan_set_free_info+0x24/0x40
[  152.911265][ T8030]  __kasan_slab_free+0xf7/0x140
[  152.911964][ T8030]  kfree+0x9e/0x3d0
[  152.912501][ T8030]  alloc_workqueue+0x7d7/0x8e0
[  152.913182][ T8030]  __btrfs_alloc_workqueue+0x11e/0x2a0
[  152.913949][ T8030]  btrfs_alloc_workqueue+0x6d/0x1d0
[  152.914703][ T8030]  scrub_workers_get+0x1e8/0x490
[  152.915402][ T8030]  btrfs_scrub_dev+0x1b9/0x9c0
[  152.916077][ T8030]  btrfs_ioctl+0x122c/0x4e50
[  152.916729][ T8030]  __x64_sys_ioctl+0x137/0x190
[  152.917414][ T8030]  do_syscall_64+0x34/0xb0
[  152.918034][ T8030]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  152.918872][ T8030]
[  152.919203][ T8030] The buggy address belongs to the object at ffff88810d31bc00
[  152.919203][ T8030]  which belongs to the cache kmalloc-512 of size 512
[  152.921155][ T8030] The buggy address is located 256 bytes inside of
[  152.921155][ T8030]  512-byte region [ffff88810d31bc00, ffff88810d31be00)
[  152.922993][ T8030] The buggy address belongs to the page:
[  152.923800][ T8030] page:ffffea000434c600 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10d318
[  152.925249][ T8030] head:ffffea000434c600 order:2 compound_mapcount:0 compound_pincount:0
[  152.926399][ T8030] flags: 0x57ff00000010200(slab|head|node=1|zone=2|lastcpupid=0x7ff)
[  152.927515][ T8030] raw: 057ff00000010200 dead000000000100 dead000000000122 ffff888009c42c80
[  152.928716][ T8030] raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
[  152.929890][ T8030] page dumped because: kasan: bad access detected
[  152.930759][ T8030]
[  152.931076][ T8030] Memory state around the buggy address:
[  152.931851][ T8030]  ffff88810d31bc00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  152.932967][ T8030]  ffff88810d31bc80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  152.934068][ T8030] >ffff88810d31bd00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  152.935189][ T8030]                    ^
[  152.935763][ T8030]  ffff88810d31bd80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  152.936847][ T8030]  ffff88810d31be00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  152.937940][ T8030] ==================================================================

If apply_wqattrs_prepare() fails in alloc_workqueue(), it will call put_pwq()
which invoke a work queue to call pwq_unbound_release_workfn() and use the 'wq'.
The 'wq' allocated in alloc_workqueue() will be freed in error path when
apply_wqattrs_prepare() fails. So it will lead a UAF.

CPU0                                          CPU1
alloc_workqueue()
alloc_and_link_pwqs()
apply_wqattrs_prepare() fails
apply_wqattrs_cleanup()
schedule_work(&pwq->unbound_release_work)
kfree(wq)
                                              worker_thread()
                                              pwq_unbound_release_workfn() <- trigger uaf here

If apply_wqattrs_prepare() fails, the new pwq are not linked, it doesn't
hold any reference to the 'wq', 'wq' is invalid to access in the worker,
so add check pwq if linked to fix this.

Fixes: 2d5f0764b5 ("workqueue: split apply_workqueue_attrs() into 3 stages")
Cc: stable@vger.kernel.org # v4.2+
Reported-by: Hulk Robot <hulkci@huawei.com>
Suggested-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Tested-by: Pavel Skripkin <paskripkin@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-31 08:16:11 +02:00
Greg Kroah-Hartman
57e177ea01 Merge branch 'android12-5.10' into android12-5.10-lts
Sync up with android12-5.10 for the following commits:

2a8bfea53d UPSTREAM: kernel/irq: export irq_gc_set_wake
36fbb55631 FROMGIT: procfs: prevent unpriveleged processes accessing fdinfo dir
fc2d64ec5d FROMGIT: f2fs: don't sleep while grabing nat_tree_lock
cf1646cba3 FROMLIST: scsi: ufs: Allow async suspend/resume callbacks
98afdd197f ANDROID: ABI: update generic symbol list and ABI XML
e2e063f507 ANDROID: scsi: ufs: add vendor hook to override key reprogramming
198e728044 ANDROID: GKI: Add rockchip symbol list

Change-Id: Ib1c25472ab1be58611ee2fa61047ceef48f00e7e
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2021-07-30 13:34:01 +02:00
Jianqun Xu
2a8bfea53d UPSTREAM: kernel/irq: export irq_gc_set_wake
Module driver may use irq_gc_set_wake.

Bug: 194515348
Change-Id: I52f43e1dff15d987532395e5151e65419b5904b2
Signed-off-by: Jianqun Xu <jay.xu@rock-chips.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210305080658.2422114-1-jay.xu@rock-chips.com
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
(cherry picked from commit 024c79520f)
Signed-off-by: Kever Yang <kever.yang@rock-chips.com>
2021-07-30 06:41:28 +00:00
Greg Kroah-Hartman
e4cac2c332 This is the 5.10.54 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmEBT00ACgkQONu9yGCS
 aT7svA/9HCRwW+pK3UpK1+0FK7gGH8DA3jSONj775zVEKhboDZNIwZsDqG0Ly+jm
 /JejWXPKZlekaDXgzBfZY3H59xgij/VwYHe8p7cdxfi1TlhmAQwFjLNZnWav8as6
 IyNkpsDJn8fMXmfHDi2u3cb8wrVi/aQDzTlwu88cUtREyZCaaYlo0Fdv9MJyhww/
 p6LWPYQoZ8TmFY+Y/2ORVxFos2UVuU0hhhMdGt9LrX2WNEGRNUZUqmbhcXYfdsX0
 ckSHbijIcWdcka3nQ6yOvdxw75rTqd8c/bP0y+yAteeJ0CykjVnI2cdK+M2ZEi4j
 /JqpGJrRWhsZf5MiO8b3k+I1K62JDa1GYBQ9Amp8FKKzjYLPTNeFAP9IsMyDc4oi
 oW98XM7XzoSEU9t/FSAIGT0hYK9k+lnPxw623LhxD6x3VPynnNAnQsLr+HirOgG6
 mZ79L4ZFu3lUvVsCuCgKn/uxwDopUNlhqo5B2/4M2kSWwe2Xu5bExpGc2bT9xCOP
 6fF9DmvmpG1UPGCXrOqaxemyEPmHqmyjKJpxDt6vZqlOL9vqHez4WmEEM1C+E2NZ
 5VKKbBk/KZDxNX9EiFOtI2HRFb1cghoI2Hcb/QjRoB9Dv3a6cHgjxDl0eKm8SiDN
 +1ytV0IFH3fT4aRiXJ7I3GBwkjKcDaX0sjYwtnCx9s5XZmm9PRQ=
 =HAyL
 -----END PGP SIGNATURE-----

Merge 5.10.54 into android12-5.10-lts

Changes in 5.10.54
	igc: Fix use-after-free error during reset
	igb: Fix use-after-free error during reset
	igc: change default return of igc_read_phy_reg()
	ixgbe: Fix an error handling path in 'ixgbe_probe()'
	igc: Fix an error handling path in 'igc_probe()'
	igb: Fix an error handling path in 'igb_probe()'
	fm10k: Fix an error handling path in 'fm10k_probe()'
	e1000e: Fix an error handling path in 'e1000_probe()'
	iavf: Fix an error handling path in 'iavf_probe()'
	igb: Check if num of q_vectors is smaller than max before array access
	igb: Fix position of assignment to *ring
	gve: Fix an error handling path in 'gve_probe()'
	net: add kcov handle to skb extensions
	bonding: fix suspicious RCU usage in bond_ipsec_add_sa()
	bonding: fix null dereference in bond_ipsec_add_sa()
	ixgbevf: use xso.real_dev instead of xso.dev in callback functions of struct xfrmdev_ops
	bonding: fix suspicious RCU usage in bond_ipsec_del_sa()
	bonding: disallow setting nested bonding + ipsec offload
	bonding: Add struct bond_ipesc to manage SA
	bonding: fix suspicious RCU usage in bond_ipsec_offload_ok()
	bonding: fix incorrect return value of bond_ipsec_offload_ok()
	ipv6: fix 'disable_policy' for fwd packets
	stmmac: platform: Fix signedness bug in stmmac_probe_config_dt()
	selftests: icmp_redirect: remove from checking for IPv6 route get
	selftests: icmp_redirect: IPv6 PMTU info should be cleared after redirect
	pwm: sprd: Ensure configuring period and duty_cycle isn't wrongly skipped
	cxgb4: fix IRQ free race during driver unload
	mptcp: fix warning in __skb_flow_dissect() when do syn cookie for subflow join
	nvme-pci: do not call nvme_dev_remove_admin from nvme_remove
	KVM: x86/pmu: Clear anythread deprecated bit when 0xa leaf is unsupported on the SVM
	perf inject: Fix dso->nsinfo refcounting
	perf map: Fix dso->nsinfo refcounting
	perf probe: Fix dso->nsinfo refcounting
	perf env: Fix sibling_dies memory leak
	perf test session_topology: Delete session->evlist
	perf test event_update: Fix memory leak of evlist
	perf dso: Fix memory leak in dso__new_map()
	perf test maps__merge_in: Fix memory leak of maps
	perf env: Fix memory leak of cpu_pmu_caps
	perf report: Free generated help strings for sort option
	perf script: Fix memory 'threads' and 'cpus' leaks on exit
	perf lzma: Close lzma stream on exit
	perf probe-file: Delete namelist in del_events() on the error path
	perf data: Close all files in close_dir()
	perf sched: Fix record failure when CONFIG_SCHEDSTATS is not set
	ASoC: wm_adsp: Correct wm_coeff_tlv_get handling
	spi: imx: add a check for speed_hz before calculating the clock
	spi: stm32: fixes pm_runtime calls in probe/remove
	regulator: hi6421: Use correct variable type for regmap api val argument
	regulator: hi6421: Fix getting wrong drvdata
	spi: mediatek: fix fifo rx mode
	ASoC: rt5631: Fix regcache sync errors on resume
	bpf, test: fix NULL pointer dereference on invalid expected_attach_type
	bpf: Fix tail_call_reachable rejection for interpreter when jit failed
	xdp, net: Fix use-after-free in bpf_xdp_link_release
	timers: Fix get_next_timer_interrupt() with no timers pending
	liquidio: Fix unintentional sign extension issue on left shift of u16
	s390/bpf: Perform r1 range checking before accessing jit->seen_reg[r1]
	bpf, sockmap: Fix potential memory leak on unlikely error case
	bpf, sockmap, tcp: sk_prot needs inuse_idx set for proc stats
	bpf, sockmap, udp: sk_prot needs inuse_idx set for proc stats
	bpftool: Check malloc return value in mount_bpffs_for_pin
	net: fix uninit-value in caif_seqpkt_sendmsg
	usb: hso: fix error handling code of hso_create_net_device
	dma-mapping: handle vmalloc addresses in dma_common_{mmap,get_sgtable}
	efi/tpm: Differentiate missing and invalid final event log table.
	net: decnet: Fix sleeping inside in af_decnet
	KVM: PPC: Book3S: Fix CONFIG_TRANSACTIONAL_MEM=n crash
	KVM: PPC: Fix kvm_arch_vcpu_ioctl vcpu_load leak
	net: sched: fix memory leak in tcindex_partial_destroy_work
	sctp: trim optlen when it's a huge value in sctp_setsockopt
	netrom: Decrease sock refcount when sock timers expire
	scsi: iscsi: Fix iface sysfs attr detection
	scsi: target: Fix protect handling in WRITE SAME(32)
	spi: cadence: Correct initialisation of runtime PM again
	ACPI: Kconfig: Fix table override from built-in initrd
	bnxt_en: don't disable an already disabled PCI device
	bnxt_en: Refresh RoCE capabilities in bnxt_ulp_probe()
	bnxt_en: Add missing check for BNXT_STATE_ABORT_ERR in bnxt_fw_rset_task()
	bnxt_en: Validate vlan protocol ID on RX packets
	bnxt_en: Check abort error state in bnxt_half_open_nic()
	net: hisilicon: rename CACHE_LINE_MASK to avoid redefinition
	net/tcp_fastopen: fix data races around tfo_active_disable_stamp
	ALSA: hda: intel-dsp-cfg: add missing ElkhartLake PCI ID
	net: hns3: fix possible mismatches resp of mailbox
	net: hns3: fix rx VLAN offload state inconsistent issue
	spi: spi-bcm2835: Fix deadlock
	net/sched: act_skbmod: Skip non-Ethernet packets
	ipv6: fix another slab-out-of-bounds in fib6_nh_flush_exceptions
	ceph: don't WARN if we're still opening a session to an MDS
	nvme-pci: don't WARN_ON in nvme_reset_work if ctrl.state is not RESETTING
	Revert "USB: quirks: ignore remote wake-up on Fibocom L850-GL LTE modem"
	afs: Fix tracepoint string placement with built-in AFS
	r8169: Avoid duplicate sysfs entry creation error
	nvme: set the PRACT bit when using Write Zeroes with T10 PI
	sctp: update active_key for asoc when old key is being replaced
	tcp: disable TFO blackhole logic by default
	net: dsa: sja1105: make VID 4095 a bridge VLAN too
	net: sched: cls_api: Fix the the wrong parameter
	drm/panel: raspberrypi-touchscreen: Prevent double-free
	cifs: only write 64kb at a time when fallocating a small region of a file
	cifs: fix fallocate when trying to allocate a hole.
	proc: Avoid mixing integer types in mem_rw()
	mmc: core: Don't allocate IDA for OF aliases
	s390/ftrace: fix ftrace_update_ftrace_func implementation
	s390/boot: fix use of expolines in the DMA code
	ALSA: usb-audio: Add missing proc text entry for BESPOKEN type
	ALSA: usb-audio: Add registration quirk for JBL Quantum headsets
	ALSA: sb: Fix potential ABBA deadlock in CSP driver
	ALSA: hda/realtek: Fix pop noise and 2 Front Mic issues on a machine
	ALSA: hdmi: Expose all pins on MSI MS-7C94 board
	ALSA: pcm: Call substream ack() method upon compat mmap commit
	ALSA: pcm: Fix mmap capability check
	Revert "usb: renesas-xhci: Fix handling of unknown ROM state"
	usb: xhci: avoid renesas_usb_fw.mem when it's unusable
	xhci: Fix lost USB 2 remote wake
	KVM: PPC: Book3S: Fix H_RTAS rets buffer overflow
	KVM: PPC: Book3S HV Nested: Sanitise H_ENTER_NESTED TM state
	usb: hub: Disable USB 3 device initiated lpm if exit latency is too high
	usb: hub: Fix link power management max exit latency (MEL) calculations
	USB: usb-storage: Add LaCie Rugged USB3-FW to IGNORE_UAS
	usb: max-3421: Prevent corruption of freed memory
	usb: renesas_usbhs: Fix superfluous irqs happen after usb_pkt_pop()
	USB: serial: option: add support for u-blox LARA-R6 family
	USB: serial: cp210x: fix comments for GE CS1000
	USB: serial: cp210x: add ID for CEL EM3588 USB ZigBee stick
	usb: gadget: Fix Unbalanced pm_runtime_enable in tegra_xudc_probe
	usb: dwc2: gadget: Fix GOUTNAK flow for Slave mode.
	usb: dwc2: gadget: Fix sending zero length packet in DDMA mode.
	usb: typec: stusb160x: register role switch before interrupt registration
	firmware/efi: Tell memblock about EFI iomem reservations
	tracepoints: Update static_call before tp_funcs when adding a tracepoint
	tracing/histogram: Rename "cpu" to "common_cpu"
	tracing: Fix bug in rb_per_cpu_empty() that might cause deadloop.
	tracing: Synthetic event field_pos is an index not a boolean
	btrfs: check for missing device in btrfs_trim_fs
	media: ngene: Fix out-of-bounds bug in ngene_command_config_free_buf()
	ixgbe: Fix packet corruption due to missing DMA sync
	bus: mhi: core: Validate channel ID when processing command completions
	posix-cpu-timers: Fix rearm racing against process tick
	selftest: use mmap instead of posix_memalign to allocate memory
	io_uring: explicitly count entries for poll reqs
	io_uring: remove double poll entry on arm failure
	userfaultfd: do not untag user pointers
	memblock: make for_each_mem_range() traverse MEMBLOCK_HOTPLUG regions
	hugetlbfs: fix mount mode command line processing
	rbd: don't hold lock_rwsem while running_list is being drained
	rbd: always kick acquire on "acquired" and "released" notifications
	misc: eeprom: at24: Always append device id even if label property is set.
	nds32: fix up stack guard gap
	driver core: Prevent warning when removing a device link from unregistered consumer
	drm: Return -ENOTTY for non-drm ioctls
	drm/amdgpu: update golden setting for sienna_cichlid
	net: dsa: mv88e6xxx: enable SerDes RX stats for Topaz
	net: dsa: mv88e6xxx: enable SerDes PCS register dump via ethtool -d on Topaz
	PCI: Mark AMD Navi14 GPU ATS as broken
	bonding: fix build issue
	skbuff: Release nfct refcount on napi stolen or re-used skbs
	Documentation: Fix intiramfs script name
	perf inject: Close inject.output on exit
	usb: ehci: Prevent missed ehci interrupts with edge-triggered MSI
	drm/i915/gvt: Clear d3_entered on elsp cmd submission.
	sfc: ensure correct number of XDP queues
	xhci: add xhci_get_virt_ep() helper
	skbuff: Fix build with SKB extensions disabled
	Linux 5.10.54

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ifd2823b47ab1544cd1f168b138624ffe060a471e
2021-07-28 15:23:47 +02:00
Frederic Weisbecker
6e81e2c38a posix-cpu-timers: Fix rearm racing against process tick
commit 1a3402d93c upstream.

Since the process wide cputime counter is started locklessly from
posix_cpu_timer_rearm(), it can be concurrently stopped by operations
on other timers from the same thread group, such as in the following
unlucky scenario:

         CPU 0                                CPU 1
         -----                                -----
                                           timer_settime(TIMER B)
   posix_cpu_timer_rearm(TIMER A)
       cpu_clock_sample_group()
           (pct->timers_active already true)

                                           handle_posix_cpu_timers()
                                               check_process_timers()
                                                   stop_process_timers()
                                                       pct->timers_active = false
       arm_timer(TIMER A)

   tick -> run_posix_cpu_timers()
       // sees !pct->timers_active, ignore
       // our TIMER A

Fix this with simply locking process wide cputime counting start and
timer arm in the same block.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Fixes: 60f2ceaa81 ("posix-cpu-timers: Remove unnecessary locking around cpu_clock_sample_group")
Cc: stable@vger.kernel.org
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-28 14:35:45 +02:00
Steven Rostedt (VMware)
552b053f1a tracing: Synthetic event field_pos is an index not a boolean
commit 3b13911a2f upstream.

Performing the following:

 ># echo 'wakeup_lat s32 pid; u64 delta; char wake_comm[]' > synthetic_events
 ># echo 'hist:keys=pid:__arg__1=common_timestamp.usecs' > events/sched/sched_waking/trigger
 ># echo 'hist:keys=next_pid:pid=next_pid,delta=common_timestamp.usecs-$__arg__1:onmatch(sched.sched_waking).trace(wakeup_lat,$pid,$delta,prev_comm)'\
      > events/sched/sched_switch/trigger
 ># echo 1 > events/synthetic/enable

Crashed the kernel:

 BUG: kernel NULL pointer dereference, address: 000000000000001b
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: 0000 [#1] PREEMPT SMP
 CPU: 7 PID: 0 Comm: swapper/7 Not tainted 5.13.0-rc5-test+ #104
 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
 RIP: 0010:strlen+0x0/0x20
 Code: f6 82 80 2b 0b bc 20 74 11 0f b6 50 01 48 83 c0 01 f6 82 80 2b 0b bc
  20 75 ef c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 <80> 3f 00 74 10
  48 89 f8 48 83 c0 01 80 38 9 f8 c3 31
 RSP: 0018:ffffaa75000d79d0 EFLAGS: 00010046
 RAX: 0000000000000002 RBX: ffff9cdb55575270 RCX: 0000000000000000
 RDX: ffff9cdb58c7a320 RSI: ffffaa75000d7b40 RDI: 000000000000001b
 RBP: ffffaa75000d7b40 R08: ffff9cdb40a4f010 R09: ffffaa75000d7ab8
 R10: ffff9cdb4398c700 R11: 0000000000000008 R12: ffff9cdb58c7a320
 R13: ffff9cdb55575270 R14: ffff9cdb58c7a000 R15: 0000000000000018
 FS:  0000000000000000(0000) GS:ffff9cdb5aa00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 000000000000001b CR3: 00000000c0612006 CR4: 00000000001706e0
 Call Trace:
  trace_event_raw_event_synth+0x90/0x1d0
  action_trace+0x5b/0x70
  event_hist_trigger+0x4bd/0x4e0
  ? cpumask_next_and+0x20/0x30
  ? update_sd_lb_stats.constprop.0+0xf6/0x840
  ? __lock_acquire.constprop.0+0x125/0x550
  ? find_held_lock+0x32/0x90
  ? sched_clock_cpu+0xe/0xd0
  ? lock_release+0x155/0x440
  ? update_load_avg+0x8c/0x6f0
  ? enqueue_entity+0x18a/0x920
  ? __rb_reserve_next+0xe5/0x460
  ? ring_buffer_lock_reserve+0x12a/0x3f0
  event_triggers_call+0x52/0xe0
  trace_event_buffer_commit+0x1ae/0x240
  trace_event_raw_event_sched_switch+0x114/0x170
  __traceiter_sched_switch+0x39/0x50
  __schedule+0x431/0xb00
  schedule_idle+0x28/0x40
  do_idle+0x198/0x2e0
  cpu_startup_entry+0x19/0x20
  secondary_startup_64_no_verify+0xc2/0xcb

The reason is that the dynamic events array keeps track of the field
position of the fields array, via the field_pos variable in the
synth_field structure. Unfortunately, that field is a boolean for some
reason, which means any field_pos greater than 1 will be a bug (in this
case it was 2).

Link: https://lkml.kernel.org/r/20210721191008.638bce34@oasis.local.home

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Fixes: bd82631d7c ("tracing: Add support for dynamic strings to synthetic events")
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-28 14:35:45 +02:00
Haoran Luo
757bdba802 tracing: Fix bug in rb_per_cpu_empty() that might cause deadloop.
commit 67f0d6d988 upstream.

The "rb_per_cpu_empty()" misinterpret the condition (as not-empty) when
"head_page" and "commit_page" of "struct ring_buffer_per_cpu" points to
the same buffer page, whose "buffer_data_page" is empty and "read" field
is non-zero.

An error scenario could be constructed as followed (kernel perspective):

1. All pages in the buffer has been accessed by reader(s) so that all of
them will have non-zero "read" field.

2. Read and clear all buffer pages so that "rb_num_of_entries()" will
return 0 rendering there's no more data to read. It is also required
that the "read_page", "commit_page" and "tail_page" points to the same
page, while "head_page" is the next page of them.

3. Invoke "ring_buffer_lock_reserve()" with large enough "length"
so that it shot pass the end of current tail buffer page. Now the
"head_page", "commit_page" and "tail_page" points to the same page.

4. Discard current event with "ring_buffer_discard_commit()", so that
"head_page", "commit_page" and "tail_page" points to a page whose buffer
data page is now empty.

When the error scenario has been constructed, "tracing_read_pipe" will
be trapped inside a deadloop: "trace_empty()" returns 0 since
"rb_per_cpu_empty()" returns 0 when it hits the CPU containing such
constructed ring buffer. Then "trace_find_next_entry_inc()" always
return NULL since "rb_num_of_entries()" reports there's no more entry
to read. Finally "trace_seq_to_user()" returns "-EBUSY" spanking
"tracing_read_pipe" back to the start of the "waitagain" loop.

I've also written a proof-of-concept script to construct the scenario
and trigger the bug automatically, you can use it to trace and validate
my reasoning above:

  https://github.com/aegistudio/RingBufferDetonator.git

Tests has been carried out on linux kernel 5.14-rc2
(2734d6c1b1), my fixed version
of kernel (for testing whether my update fixes the bug) and
some older kernels (for range of affected kernels). Test result is
also attached to the proof-of-concept repository.

Link: https://lore.kernel.org/linux-trace-devel/YPaNxsIlb2yjSi5Y@aegistudio/
Link: https://lore.kernel.org/linux-trace-devel/YPgrN85WL9VyrZ55@aegistudio

Cc: stable@vger.kernel.org
Fixes: bf41a158ca ("ring-buffer: make reentrant")
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Haoran Luo <www@aegistudio.net>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-28 14:35:45 +02:00
Steven Rostedt (VMware)
a5e1aff589 tracing/histogram: Rename "cpu" to "common_cpu"
commit 1e3bac71c5 upstream.

Currently the histogram logic allows the user to write "cpu" in as an
event field, and it will record the CPU that the event happened on.

The problem with this is that there's a lot of events that have "cpu"
as a real field, and using "cpu" as the CPU it ran on, makes it
impossible to run histograms on the "cpu" field of events.

For example, if I want to have a histogram on the count of the
workqueue_queue_work event on its cpu field, running:

 ># echo 'hist:keys=cpu' > events/workqueue/workqueue_queue_work/trigger

Gives a misleading and wrong result.

Change the command to "common_cpu" as no event should have "common_*"
fields as that's a reserved name for fields used by all events. And
this makes sense here as common_cpu would be a field used by all events.

Now we can even do:

 ># echo 'hist:keys=common_cpu,cpu if cpu < 100' > events/workqueue/workqueue_queue_work/trigger
 ># cat events/workqueue/workqueue_queue_work/hist
 # event histogram
 #
 # trigger info: hist:keys=common_cpu,cpu:vals=hitcount:sort=hitcount:size=2048 if cpu < 100 [active]
 #

 { common_cpu:          0, cpu:          2 } hitcount:          1
 { common_cpu:          0, cpu:          4 } hitcount:          1
 { common_cpu:          7, cpu:          7 } hitcount:          1
 { common_cpu:          0, cpu:          7 } hitcount:          1
 { common_cpu:          0, cpu:          1 } hitcount:          1
 { common_cpu:          0, cpu:          6 } hitcount:          2
 { common_cpu:          0, cpu:          5 } hitcount:          2
 { common_cpu:          1, cpu:          1 } hitcount:          4
 { common_cpu:          6, cpu:          6 } hitcount:          4
 { common_cpu:          5, cpu:          5 } hitcount:         14
 { common_cpu:          4, cpu:          4 } hitcount:         26
 { common_cpu:          0, cpu:          0 } hitcount:         39
 { common_cpu:          2, cpu:          2 } hitcount:        184

Now for backward compatibility, I added a trick. If "cpu" is used, and
the field is not found, it will fall back to "common_cpu" and work as
it did before. This way, it will still work for old programs that use
"cpu" to get the actual CPU, but if the event has a "cpu" as a field, it
will get that event's "cpu" field, which is probably what it wants
anyway.

I updated the tracefs/README to include documentation about both the
common_timestamp and the common_cpu. This way, if that text is present in
the README, then an application can know that common_cpu is supported over
just plain "cpu".

Link: https://lkml.kernel.org/r/20210721110053.26b4f641@oasis.local.home

Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Fixes: 8b7622bf94 ("tracing: Add cpu field for hist triggers")
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-28 14:35:45 +02:00
Steven Rostedt (VMware)
0edad8b9f6 tracepoints: Update static_call before tp_funcs when adding a tracepoint
commit 352384d5c8 upstream.

Because of the significant overhead that retpolines pose on indirect
calls, the tracepoint code was updated to use the new "static_calls" that
can modify the running code to directly call a function instead of using
an indirect caller, and this function can be changed at runtime.

In the tracepoint code that calls all the registered callbacks that are
attached to a tracepoint, the following is done:

	it_func_ptr = rcu_dereference_raw((&__tracepoint_##name)->funcs);
	if (it_func_ptr) {
		__data = (it_func_ptr)->data;
		static_call(tp_func_##name)(__data, args);
	}

If there's just a single callback, the static_call is updated to just call
that callback directly. Once another handler is added, then the static
caller is updated to call the iterator, that simply loops over all the
funcs in the array and calls each of the callbacks like the old method
using indirect calling.

The issue was discovered with a race between updating the funcs array and
updating the static_call. The funcs array was updated first and then the
static_call was updated. This is not an issue as long as the first element
in the old array is the same as the first element in the new array. But
that assumption is incorrect, because callbacks also have a priority
field, and if there's a callback added that has a higher priority than the
callback on the old array, then it will become the first callback in the
new array. This means that it is possible to call the old callback with
the new callback data element, which can cause a kernel panic.

	static_call = callback1()
	funcs[] = {callback1,data1};
	callback2 has higher priority than callback1

	CPU 1				CPU 2
	-----				-----

   new_funcs = {callback2,data2},
               {callback1,data1}

   rcu_assign_pointer(tp->funcs, new_funcs);

  /*
   * Now tp->funcs has the new array
   * but the static_call still calls callback1
   */

				it_func_ptr = tp->funcs [ new_funcs ]
				data = it_func_ptr->data [ data2 ]
				static_call(callback1, data);

				/* Now callback1 is called with
				 * callback2's data */

				[ KERNEL PANIC ]

   update_static_call(iterator);

To prevent this from happening, always switch the static_call to the
iterator before assigning the tp->funcs to the new array. The iterator will
always properly match the callback with its data.

To trigger this bug:

  In one terminal:

    while :; do hackbench 50; done

  In another terminal

    echo 1 > /sys/kernel/tracing/events/sched/sched_waking/enable
    while :; do
        echo 1 > /sys/kernel/tracing/set_event_pid;
        sleep 0.5
        echo 0 > /sys/kernel/tracing/set_event_pid;
        sleep 0.5
   done

And it doesn't take long to crash. This is because the set_event_pid adds
a callback to the sched_waking tracepoint with a high priority, which will
be called before the sched_waking trace event callback is called.

Note, the removal to a single callback updates the array first, before
changing the static_call to single callback, which is the proper order as
the first element in the array is the same as what the static_call is
being changed to.

Link: https://lore.kernel.org/io-uring/4ebea8f0-58c9-e571-fd30-0ce4f6f09c70@samba.org/

Cc: stable@vger.kernel.org
Fixes: d25e37d89d ("tracepoint: Optimize using static_call()")
Reported-by: Stefan Metzmacher <metze@samba.org>
tested-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-28 14:35:45 +02:00
Roman Skakun
8983766903 dma-mapping: handle vmalloc addresses in dma_common_{mmap,get_sgtable}
[ Upstream commit 40ac971eab ]

xen-swiotlb can use vmalloc backed addresses for dma coherent allocations
and uses the common helpers.  Properly handle them to unbreak Xen on
ARM platforms.

Fixes: 1b65c4e5a9 ("swiotlb-xen: use xen_alloc/free_coherent_pages")
Signed-off-by: Roman Skakun <roman_skakun@epam.com>
Reviewed-by: Andrii Anisov <andrii_anisov@epam.com>
[hch: split the patch, renamed the helpers]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-28 14:35:38 +02:00
Nicolas Saenz Julienne
0ff2ea9d8f timers: Fix get_next_timer_interrupt() with no timers pending
[ Upstream commit aebacb7f6c ]

31cd0e119d ("timers: Recalculate next timer interrupt only when
necessary") subtly altered get_next_timer_interrupt()'s behaviour. The
function no longer consistently returns KTIME_MAX with no timers
pending.

In order to decide if there are any timers pending we check whether the
next expiry will happen NEXT_TIMER_MAX_DELTA jiffies from now.
Unfortunately, the next expiry time and the timer base clock are no
longer updated in unison. The former changes upon certain timer
operations (enqueue, expire, detach), whereas the latter keeps track of
jiffies as they move forward. Ultimately breaking the logic above.

A simplified example:

- Upon entering get_next_timer_interrupt() with:

	jiffies = 1
	base->clk = 0;
	base->next_expiry = NEXT_TIMER_MAX_DELTA;

  'base->next_expiry == base->clk + NEXT_TIMER_MAX_DELTA', the function
  returns KTIME_MAX.

- 'base->clk' is updated to the jiffies value.

- The next time we enter get_next_timer_interrupt(), taking into account
  no timer operations happened:

	base->clk = 1;
	base->next_expiry = NEXT_TIMER_MAX_DELTA;

  'base->next_expiry != base->clk + NEXT_TIMER_MAX_DELTA', the function
  returns a valid expire time, which is incorrect.

This ultimately might unnecessarily rearm sched's timer on nohz_full
setups, and add latency to the system[1].

So, introduce 'base->timers_pending'[2], update it every time
'base->next_expiry' changes, and use it in get_next_timer_interrupt().

[1] See tick_nohz_stop_tick().
[2] A quick pahole check on x86_64 and arm64 shows it doesn't make
    'struct timer_base' any bigger.

Fixes: 31cd0e119d ("timers: Recalculate next timer interrupt only when necessary")
Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-28 14:35:37 +02:00
Daniel Borkmann
39f1735c81 bpf: Fix tail_call_reachable rejection for interpreter when jit failed
[ Upstream commit 5dd0a6b858 ]

During testing of f263a81451 ("bpf: Track subprog poke descriptors correctly
and fix use-after-free") under various failure conditions, for example, when
jit_subprogs() fails and tries to clean up the program to be run under the
interpreter, we ran into the following freeze:

  [...]
  #127/8 tailcall_bpf2bpf_3:FAIL
  [...]
  [   92.041251] BUG: KASAN: slab-out-of-bounds in ___bpf_prog_run+0x1b9d/0x2e20
  [   92.042408] Read of size 8 at addr ffff88800da67f68 by task test_progs/682
  [   92.043707]
  [   92.044030] CPU: 1 PID: 682 Comm: test_progs Tainted: G   O   5.13.0-53301-ge6c08cb33a30-dirty #87
  [   92.045542] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014
  [   92.046785] Call Trace:
  [   92.047171]  ? __bpf_prog_run_args64+0xc0/0xc0
  [   92.047773]  ? __bpf_prog_run_args32+0x8b/0xb0
  [   92.048389]  ? __bpf_prog_run_args64+0xc0/0xc0
  [   92.049019]  ? ktime_get+0x117/0x130
  [...] // few hundred [similar] lines more
  [   92.659025]  ? ktime_get+0x117/0x130
  [   92.659845]  ? __bpf_prog_run_args64+0xc0/0xc0
  [   92.660738]  ? __bpf_prog_run_args32+0x8b/0xb0
  [   92.661528]  ? __bpf_prog_run_args64+0xc0/0xc0
  [   92.662378]  ? print_usage_bug+0x50/0x50
  [   92.663221]  ? print_usage_bug+0x50/0x50
  [   92.664077]  ? bpf_ksym_find+0x9c/0xe0
  [   92.664887]  ? ktime_get+0x117/0x130
  [   92.665624]  ? kernel_text_address+0xf5/0x100
  [   92.666529]  ? __kernel_text_address+0xe/0x30
  [   92.667725]  ? unwind_get_return_address+0x2f/0x50
  [   92.668854]  ? ___bpf_prog_run+0x15d4/0x2e20
  [   92.670185]  ? ktime_get+0x117/0x130
  [   92.671130]  ? __bpf_prog_run_args64+0xc0/0xc0
  [   92.672020]  ? __bpf_prog_run_args32+0x8b/0xb0
  [   92.672860]  ? __bpf_prog_run_args64+0xc0/0xc0
  [   92.675159]  ? ktime_get+0x117/0x130
  [   92.677074]  ? lock_is_held_type+0xd5/0x130
  [   92.678662]  ? ___bpf_prog_run+0x15d4/0x2e20
  [   92.680046]  ? ktime_get+0x117/0x130
  [   92.681285]  ? __bpf_prog_run32+0x6b/0x90
  [   92.682601]  ? __bpf_prog_run64+0x90/0x90
  [   92.683636]  ? lock_downgrade+0x370/0x370
  [   92.684647]  ? mark_held_locks+0x44/0x90
  [   92.685652]  ? ktime_get+0x117/0x130
  [   92.686752]  ? lockdep_hardirqs_on+0x79/0x100
  [   92.688004]  ? ktime_get+0x117/0x130
  [   92.688573]  ? __cant_migrate+0x2b/0x80
  [   92.689192]  ? bpf_test_run+0x2f4/0x510
  [   92.689869]  ? bpf_test_timer_continue+0x1c0/0x1c0
  [   92.690856]  ? rcu_read_lock_bh_held+0x90/0x90
  [   92.691506]  ? __kasan_slab_alloc+0x61/0x80
  [   92.692128]  ? eth_type_trans+0x128/0x240
  [   92.692737]  ? __build_skb+0x46/0x50
  [   92.693252]  ? bpf_prog_test_run_skb+0x65e/0xc50
  [   92.693954]  ? bpf_prog_test_run_raw_tp+0x2d0/0x2d0
  [   92.694639]  ? __fget_light+0xa1/0x100
  [   92.695162]  ? bpf_prog_inc+0x23/0x30
  [   92.695685]  ? __sys_bpf+0xb40/0x2c80
  [   92.696324]  ? bpf_link_get_from_fd+0x90/0x90
  [   92.697150]  ? mark_held_locks+0x24/0x90
  [   92.698007]  ? lockdep_hardirqs_on_prepare+0x124/0x220
  [   92.699045]  ? finish_task_switch+0xe6/0x370
  [   92.700072]  ? lockdep_hardirqs_on+0x79/0x100
  [   92.701233]  ? finish_task_switch+0x11d/0x370
  [   92.702264]  ? __switch_to+0x2c0/0x740
  [   92.703148]  ? mark_held_locks+0x24/0x90
  [   92.704155]  ? __x64_sys_bpf+0x45/0x50
  [   92.705146]  ? do_syscall_64+0x35/0x80
  [   92.706953]  ? entry_SYSCALL_64_after_hwframe+0x44/0xae
  [...]

Turns out that the program rejection from e411901c0b ("bpf: allow for tailcalls
in BPF subprograms for x64 JIT") is buggy since env->prog->aux->tail_call_reachable
is never true. Commit ebf7d1f508 ("bpf, x64: rework pro/epilogue and tailcall
handling in JIT") added a tracker into check_max_stack_depth() which propagates
the tail_call_reachable condition throughout the subprograms. This info is then
assigned to the subprogram's func[i]->aux->tail_call_reachable. However, in the
case of the rejection check upon JIT failure, env->prog->aux->tail_call_reachable
is used. func[0]->aux->tail_call_reachable which represents the main program's
information did not propagate this to the outer env->prog->aux, though. Add this
propagation into check_max_stack_depth() where it needs to belong so that the
check can be done reliably.

Fixes: ebf7d1f508 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT")
Fixes: e411901c0b ("bpf: allow for tailcalls in BPF subprograms for x64 JIT")
Co-developed-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/bpf/618c34e3163ad1a36b1e82377576a6081e182f25.1626123173.git.daniel@iogearbox.net
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-28 14:35:37 +02:00
Greg Kroah-Hartman
67e686fc73 Revert "bpf: Track subprog poke descriptors correctly and fix use-after-free"
This reverts commit a9f36bf361 which is
commit f263a81451 upstream.

It breaks the Android KABI and is not needed for Android devices at this
point in time.

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Id5d1b6407f9bd5228630b9a813e9799dbc448b96
2021-07-26 11:53:25 +02:00
Greg Kroah-Hartman
afe9ed0e13 This is the 5.10.53 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmD9WtYACgkQONu9yGCS
 aT6IkRAAnW25XGt27ELqFNo2vCXI38YFEyNpe0aN8y2A4X1MXu8EKNunQNFKYVt9
 5sGrSjt30qy0A9pbeajqXfnGmCNe5R825lRCQia08/+b/qyfaF358DdAI59q5zd1
 0IhInfBo+TJyLLxltt0f2yGQJfo2DfmRuOpr0EXrE7Mo7lTqDqrpmDbLl+fZ8Ugd
 SaXRZMHLsH18Q8XuFbbBnU57tgGsIys9ie397jf3jYZBpfnzWjfsKbM4BnBz3fAv
 bUnZ5P1G8Bwps30ES8Qpc6ZoWMYl6MkEPNc+4f57+amvJME45MePUvgDRrQV3Z0P
 bveovBal8c13aO46DIa+bW0Wevy1pR/OcNQ78r8I1Sq4gp6rIp0oj6Q3rEPNfJqf
 8s1/9FNbZVRkz/TA2vtILGhIlJfMm7QcESM8WA1kA5EN4er35DGlvN69kpspxhoh
 FBrgmAdMQQgmiCtLR1QllP5PVDaTNQlsXR3yeASc54UrEwuyFe/ELE5YEb7vvXIQ
 u7JF+zYbxVFqyUCtyoBEy6eoAe/jGAI/kiCvifitR33289lOItz5n64CQzCex54R
 WR1wX+rl9Jf/SWHuBh/JgCqS6x8O2+1WfpABmEWwT6m1leC6v0LG0CJoywGdHf1I
 +IlIsThsoPIMd0xFzd7GlV4+ot2ZHsxyhY7dDnkIfCZtFLKubuU=
 =PNDL
 -----END PGP SIGNATURE-----

Merge 5.10.53 into android12-5.10-lts

Changes in 5.10.53
	ARM: dts: gemini: rename mdio to the right name
	ARM: dts: gemini: add device_type on pci
	ARM: dts: rockchip: Fix thermal sensor cells o rk322x
	ARM: dts: rockchip: fix pinctrl sleep nodename for rk3036-kylin and rk3288
	arm64: dts: rockchip: fix pinctrl sleep nodename for rk3399.dtsi
	ARM: dts: rockchip: Fix the timer clocks order
	ARM: dts: rockchip: Fix IOMMU nodes properties on rk322x
	ARM: dts: rockchip: Fix power-controller node names for rk3066a
	ARM: dts: rockchip: Fix power-controller node names for rk3188
	ARM: dts: rockchip: Fix power-controller node names for rk3288
	arm64: dts: rockchip: Fix power-controller node names for px30
	arm64: dts: rockchip: Fix power-controller node names for rk3328
	arm64: dts: rockchip: Fix power-controller node names for rk3399
	reset: ti-syscon: fix to_ti_syscon_reset_data macro
	ARM: brcmstb: dts: fix NAND nodes names
	ARM: Cygnus: dts: fix NAND nodes names
	ARM: NSP: dts: fix NAND nodes names
	ARM: dts: BCM63xx: Fix NAND nodes names
	ARM: dts: Hurricane 2: Fix NAND nodes names
	ARM: dts: imx6: phyFLEX: Fix UART hardware flow control
	ARM: imx: pm-imx5: Fix references to imx5_cpu_suspend_info
	arm64: dts: rockchip: fix regulator-gpio states array
	ARM: dts: ux500: Fix interrupt cells
	ARM: dts: ux500: Rename gpio-controller node
	ARM: dts: ux500: Fix orientation of accelerometer
	ARM: dts: imx6dl-riotboard: configure PHY clock and set proper EEE value
	rtc: mxc_v2: add missing MODULE_DEVICE_TABLE
	kbuild: sink stdout from cmd for silent build
	ARM: dts: am57xx-cl-som-am57x: fix ti,no-reset-on-init flag for gpios
	ARM: dts: am437x-gp-evm: fix ti,no-reset-on-init flag for gpios
	ARM: dts: am335x: fix ti,no-reset-on-init flag for gpios
	ARM: dts: OMAP2+: Replace underscores in sub-mailbox node names
	arm64: dts: ti: k3-am654x/j721e/j7200-common-proc-board: Fix MCU_RGMII1_TXC direction
	ARM: tegra: wm8903: Fix polarity of headphones-detection GPIO in device-trees
	ARM: tegra: nexus7: Correct 3v3 regulator GPIO of PM269 variant
	arm64: dts: qcom: sc7180: Move rmtfs memory region
	ARM: dts: stm32: Remove extra size-cells on dhcom-pdk2
	ARM: dts: stm32: Fix touchscreen node on dhcom-pdk2
	ARM: dts: stm32: fix stm32mp157c-odyssey card detect pin
	ARM: dts: stm32: fix gpio-keys node on STM32 MCU boards
	ARM: dts: stm32: fix RCC node name on stm32f429 MCU
	ARM: dts: stm32: fix timer nodes on STM32 MCU to prevent warnings
	memory: tegra: Fix compilation warnings on 64bit platforms
	firmware: arm_scmi: Add SMCCC discovery dependency in Kconfig
	firmware: arm_scmi: Fix the build when CONFIG_MAILBOX is not selected
	ARM: dts: bcm283x: Fix up MMC node names
	ARM: dts: bcm283x: Fix up GPIO LED node names
	arm64: dts: juno: Update SCPI nodes as per the YAML schema
	ARM: dts: rockchip: fix supply properties in io-domains nodes
	ARM: dts: stm32: fix i2c node name on stm32f746 to prevent warnings
	ARM: dts: stm32: move stmmac axi config in ethernet node on stm32mp15
	ARM: dts: stm32: fix the Odyssey SoM eMMC VQMMC supply
	ARM: dts: stm32: Drop unused linux,wakeup from touchscreen node on DHCOM SoM
	ARM: dts: stm32: Rename spi-flash/mx66l51235l@N to flash@N on DHCOM SoM
	ARM: dts: stm32: fix stpmic node for stm32mp1 boards
	ARM: OMAP2+: Block suspend for am3 and am4 if PM is not configured
	soc/tegra: fuse: Fix Tegra234-only builds
	firmware: tegra: bpmp: Fix Tegra234-only builds
	arm64: dts: ls208xa: remove bus-num from dspi node
	arm64: dts: imx8mq: assign PCIe clocks
	thermal/core: Correct function name thermal_zone_device_unregister()
	thermal/drivers/rcar_gen3_thermal: Do not shadow rcar_gen3_ths_tj_1
	thermal/drivers/imx_sc: Add missing of_node_put for loop iteration
	thermal/drivers/sprd: Add missing of_node_put for loop iteration
	kbuild: mkcompile_h: consider timestamp if KBUILD_BUILD_TIMESTAMP is set
	arch/arm64/boot/dts/marvell: fix NAND partitioning scheme
	rtc: max77686: Do not enforce (incorrect) interrupt trigger type
	scsi: aic7xxx: Fix unintentional sign extension issue on left shift of u8
	scsi: libsas: Add LUN number check in .slave_alloc callback
	scsi: libfc: Fix array index out of bound exception
	scsi: qedf: Add check to synchronize abort and flush
	sched/fair: Fix CFS bandwidth hrtimer expiry type
	perf/x86/intel/uncore: Clean up error handling path of iio mapping
	thermal/core/thermal_of: Stop zone device before unregistering it
	s390/traps: do not test MONITOR CALL without CONFIG_BUG
	s390: introduce proper type handling call_on_stack() macro
	cifs: prevent NULL deref in cifs_compose_mount_options()
	firmware: turris-mox-rwtm: add marvell,armada-3700-rwtm-firmware compatible string
	arm64: dts: marvell: armada-37xx: move firmware node to generic dtsi file
	Revert "swap: fix do_swap_page() race with swapoff"
	f2fs: Show casefolding support only when supported
	mm/thp: simplify copying of huge zero page pmd when fork
	mm/userfaultfd: fix uffd-wp special cases for fork()
	mm/page_alloc: fix memory map initialization for descending nodes
	usb: cdns3: Enable TDL_CHK only for OUT ep
	net: bcmgenet: ensure EXT_ENERGY_DET_MASK is clear
	net: dsa: mv88e6xxx: enable .port_set_policy() on Topaz
	net: dsa: mv88e6xxx: use correct .stats_set_histogram() on Topaz
	net: dsa: mv88e6xxx: enable .rmu_disable() on Topaz
	net: dsa: mv88e6xxx: enable devlink ATU hash param for Topaz
	net: ipv6: fix return value of ip6_skb_dst_mtu
	netfilter: ctnetlink: suspicious RCU usage in ctnetlink_dump_helpinfo
	net/sched: act_ct: fix err check for nf_conntrack_confirm
	vmxnet3: fix cksum offload issues for tunnels with non-default udp ports
	net/sched: act_ct: remove and free nf_table callbacks
	net: bridge: sync fdb to new unicast-filtering ports
	net: netdevsim: use xso.real_dev instead of xso.dev in callback functions of struct xfrmdev_ops
	net: bcmgenet: Ensure all TX/RX queues DMAs are disabled
	net: ip_tunnel: fix mtu calculation for ETHER tunnel devices
	net: moxa: fix UAF in moxart_mac_probe
	net: qcom/emac: fix UAF in emac_remove
	net: ti: fix UAF in tlan_remove_one
	net: send SYNACK packet with accepted fwmark
	net: validate lwtstate->data before returning from skb_tunnel_info()
	Revert "mm/shmem: fix shmem_swapin() race with swapoff"
	net: dsa: properly check for the bridge_leave methods in dsa_switch_bridge_leave()
	net: fddi: fix UAF in fza_probe
	dma-buf/sync_file: Don't leak fences on merge failure
	kbuild: do not suppress Kconfig prompts for silent build
	ARM: dts: aspeed: Fix AST2600 machines line names
	ARM: dts: tacoma: Add phase corrections for eMMC
	tcp: consistently disable header prediction for mptcp
	tcp: annotate data races around tp->mtu_info
	tcp: fix tcp_init_transfer() to not reset icsk_ca_initialized
	ipv6: tcp: drop silly ICMPv6 packet too big messages
	tcp: call sk_wmem_schedule before sk_mem_charge in zerocopy path
	tools: bpf: Fix error in 'make -C tools/ bpf_install'
	bpftool: Properly close va_list 'ap' by va_end() on error
	bpf: Track subprog poke descriptors correctly and fix use-after-free
	perf test bpf: Free obj_buf
	drm/panel: nt35510: Do not fail if DSI read fails
	udp: annotate data races around unix_sk(sk)->gso_size
	Linux 5.10.53

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Iac8fe9cd2abb2d8dd993967205a97c89f01f3647
2021-07-25 15:37:14 +02:00
John Fastabend
a9f36bf361 bpf: Track subprog poke descriptors correctly and fix use-after-free
commit f263a81451 upstream.

Subprograms are calling map_poke_track(), but on program release there is no
hook to call map_poke_untrack(). However, on program release, the aux memory
(and poke descriptor table) is freed even though we still have a reference to
it in the element list of the map aux data. When we run map_poke_run(), we then
end up accessing free'd memory, triggering KASAN in prog_array_map_poke_run():

  [...]
  [  402.824689] BUG: KASAN: use-after-free in prog_array_map_poke_run+0xc2/0x34e
  [  402.824698] Read of size 4 at addr ffff8881905a7940 by task hubble-fgs/4337
  [  402.824705] CPU: 1 PID: 4337 Comm: hubble-fgs Tainted: G          I       5.12.0+ #399
  [  402.824715] Call Trace:
  [  402.824719]  dump_stack+0x93/0xc2
  [  402.824727]  print_address_description.constprop.0+0x1a/0x140
  [  402.824736]  ? prog_array_map_poke_run+0xc2/0x34e
  [  402.824740]  ? prog_array_map_poke_run+0xc2/0x34e
  [  402.824744]  kasan_report.cold+0x7c/0xd8
  [  402.824752]  ? prog_array_map_poke_run+0xc2/0x34e
  [  402.824757]  prog_array_map_poke_run+0xc2/0x34e
  [  402.824765]  bpf_fd_array_map_update_elem+0x124/0x1a0
  [...]

The elements concerned are walked as follows:

    for (i = 0; i < elem->aux->size_poke_tab; i++) {
           poke = &elem->aux->poke_tab[i];
    [...]

The access to size_poke_tab is a 4 byte read, verified by checking offsets
in the KASAN dump:

  [  402.825004] The buggy address belongs to the object at ffff8881905a7800
                 which belongs to the cache kmalloc-1k of size 1024
  [  402.825008] The buggy address is located 320 bytes inside of
                 1024-byte region [ffff8881905a7800, ffff8881905a7c00)

The pahole output of bpf_prog_aux:

  struct bpf_prog_aux {
    [...]
    /* --- cacheline 5 boundary (320 bytes) --- */
    u32                        size_poke_tab;        /*   320     4 */
    [...]

In general, subprograms do not necessarily manage their own data structures.
For example, BTF func_info and linfo are just pointers to the main program
structure. This allows reference counting and cleanup to be done on the latter
which simplifies their management a bit. The aux->poke_tab struct, however,
did not follow this logic. The initial proposed fix for this use-after-free
bug further embedded poke data tracking into the subprogram with proper
reference counting. However, Daniel and Alexei questioned why we were treating
these objects special; I agree, its unnecessary. The fix here removes the per
subprogram poke table allocation and map tracking and instead simply points
the aux->poke_tab pointer at the main programs poke table. This way, map
tracking is simplified to the main program and we do not need to manage them
per subprogram.

This also means, bpf_prog_free_deferred(), which unwinds the program reference
counting and kfrees objects, needs to ensure that we don't try to double free
the poke_tab when free'ing the subprog structures. This is easily solved by
NULL'ing the poke_tab pointer. The second detail is to ensure that per
subprogram JIT logic only does fixups on poke_tab[] entries it owns. To do
this, we add a pointer in the poke structure to point at the subprogram value
so JITs can easily check while walking the poke_tab structure if the current
entry belongs to the current program. The aux pointer is stable and therefore
suitable for such comparison. On the jit_subprogs() error path, we omit
cleaning up the poke->aux field because these are only ever referenced from
the JIT side, but on error we will never make it to the JIT, so its fine to
leave them dangling. Removing these pointers would complicate the error path
for no reason. However, we do need to untrack all poke descriptors from the
main program as otherwise they could race with the freeing of JIT memory from
the subprograms. Lastly, a748c6975d ("bpf: propagate poke descriptors to
subprograms") had an off-by-one on the subprogram instruction index range
check as it was testing 'insn_idx >= subprog_start && insn_idx <= subprog_end'.
However, subprog_end is the next subprogram's start instruction.

Fixes: a748c6975d ("bpf: propagate poke descriptors to subprograms")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210707223848.14580-2-john.fastabend@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-25 14:36:21 +02:00
Odin Ugedal
892387e761 sched/fair: Fix CFS bandwidth hrtimer expiry type
[ Upstream commit 72d0ad7cb5 ]

The time remaining until expiry of the refresh_timer can be negative.
Casting the type to an unsigned 64-bit value will cause integer
underflow, making the runtime_refresh_within return false instead of
true. These situations are rare, but they do happen.

This does not cause user-facing issues or errors; other than
possibly unthrottling cfs_rq's using runtime from the previous period(s),
making the CFS bandwidth enforcement less strict in those (special)
situations.

Signed-off-by: Odin Ugedal <odin@uged.al>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Link: https://lore.kernel.org/r/20210629121452.18429-1-odin@uged.al
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-25 14:36:17 +02:00
Greg Kroah-Hartman
c0dd8de281 Merge branch 'android12-5.10' into android12-5.10-lts
Sync up with android12-5.10 for the following commits:

90732d426e UPSTREAM: mm/compaction: correct deferral logic for proactive compaction
d0a88ae479 ANDROID: Enable GKI Dr. No Enforcement
e6a59da61e ANDROID: media: v4l2-core: Fix deadlock in vendor hook
fdb1cfe2d3 FROMGIT: dma_buf: remove dmabuf sysfs teardown before release
1dc0dd2573 ANDROID: update mtk symbol list
a669748346 UPSTREAM: mfd: syscon: Free the allocated name field of struct regmap_config
cffe67a351 ANDROID: Give UIC cmd timeout a larger value
3bca0b5344 ANDROID: binder: retry security_secid_to_secctx()
9fdfeda4c9 ANDROID: update new gki symbol for mtk
1dd167be9f ANDROID: mm, kasan: fix for "integrate page_alloc init with HW_TAGS"
f215850ea0 UPSTREAM: kasan: fix conflict with page poisoning
a3da917661 ANDROID: GKI: Export two more mm symbols for GKI
0497b9601b ANDROID: Update symbol list for mtk
0f27e1d317 FROMLIST: kfence: skip all GFP_ZONEMASK allocations
8478d8dc53 FROMLIST: kfence: move the size check to the beginning of __kfence_alloc()
ddabf14bea ANDROID: ABI: update allowed list for galaxy
11cec52238 FROMGIT: f2fs: add sysfs nodes to get GC info for each GC mode
fdc46110cb ANDROID: abi_gki_aarch64_qcom: Add android_debug_for_each_module
0bb433e014 ANDROID: debug_symbols: Add android_debug_for_each_module
a0429aa1d0 ANDROID: ABI: Update ABI for symbol list updates
c68e5ca9f8 ANDROID: GKI: Update symbols to symbol list
aac5a77959 ANDROID: Update symbol list for mtk
42fc1faf44 UPSTREAM: block, bfq: set next_rq to waker_bfqq->next_rq in waker injection
e37cc8a09f UPSTREAM: mm/mremap: hold the rmap lock in write mode when moving page table entries.
9b136eab76 ANDROID: pstore/ram: Add backward compatibility for ramoops reserved region
df97297651 ANDROID: Update symbol list for mtk
4322bb07e8 ANDROID: vendor_hooks: Modify the function name
007213b1d0 BACKPORT: FROMLIST: kasan: add memzero int for unaligned size at DEBUG
7acbce0bf4 BACKPORT: FROMLIST: mm: move helper to check slub_debug_enabled
d6eeae59b5 ANDROID: ABI: initial update allowed list for galaxy

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I6d28f1be3cbd8a1fed86d2d647e6c2bb55221b0a
2021-07-22 13:32:27 +02:00
Matthias Maennich
d0a88ae479 ANDROID: Enable GKI Dr. No Enforcement
This effectively locks down OWNERS approval to a small group to guard
the code base against unintentional breakages.

Bug: 194314089
Signed-off-by: Matthias Maennich <maennich@google.com>
Change-Id: Ifd1ea97639a622320ea83f901f6451e2e52b38d4
2021-07-21 20:51:47 +01:00
Greg Kroah-Hartman
51ab149d5f This is the 5.10.52 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmD22M4ACgkQONu9yGCS
 aT6MbxAAmAmm5sj9fMOk4ijcRz+CbpTZZdSS5VT3BbJJrTrZ1qk9KO0/uAhY1SXe
 uuP8Q7BD2llWQ6hQiJj7vZnXKiJsBrngH//4QFa72TjXxa1ENPcw/UceLxMgT6e1
 0r74mcikr4f2A6kFoAexyteHfzm0D+8ZfgzJtTKPbBPqCqRIaZFO84xyTclvNJ6h
 0Xx2dSNWDU1dXvUe+43ggaZUFNe4SHGupgsc3GSsxPkyKTzrXMGsRh4m3p94t4py
 WQULkq/57JaeJwQcxWMOPqLIHF/IWtZSfr+YHx+q1zvThK9uUsAd3e8B8r6FJswV
 xmDIDkCw3VIRxTiYhYKFqiDZOanlxYOtWXCXAyI+YV/JIT4NGHapg91qEq5PR9Ti
 tkSmAbwaON9LzshzWuKjDHMDJO6x/i3YJmaXuVgBciD56xIvCMKiiardpey2574o
 M5m8fLcXQXQBPY/WrusAn312/PanUxgstn6EJgInq0M4GoMj1aUb+W8luQfqq0G5
 gYNjzTCEQErTITAyha9SLQvBVH3snfHpKoDL669HK6woq2GH2K48YIKxCam5trlH
 9P9LCXBhSHe19/r2qLotcMW7XmuSsu77FetZtxFJQeWVqRokmyHzZnTbeoQAxhpa
 xGkvnbnyLettCCtm/JAmwnASqIYnUxuvgj7zcPb5PGnhwk1rfXU=
 =b0QD
 -----END PGP SIGNATURE-----

Merge 5.10.52 into android12-5.10-lts

Changes in 5.10.52
	certs: add 'x509_revocation_list' to gitignore
	cifs: handle reconnect of tcon when there is no cached dfs referral
	KVM: mmio: Fix use-after-free Read in kvm_vm_ioctl_unregister_coalesced_mmio
	KVM: x86: Use guest MAXPHYADDR from CPUID.0x8000_0008 iff TDP is enabled
	KVM: x86/mmu: Do not apply HPA (memory encryption) mask to GPAs
	KVM: nSVM: Check the value written to MSR_VM_HSAVE_PA
	KVM: X86: Disable hardware breakpoints unconditionally before kvm_x86->run()
	scsi: core: Fix bad pointer dereference when ehandler kthread is invalid
	scsi: zfcp: Report port fc_security as unknown early during remote cable pull
	tracing: Do not reference char * as a string in histograms
	drm/i915/gtt: drop the page table optimisation
	drm/i915/gt: Fix -EDEADLK handling regression
	cgroup: verify that source is a string
	fbmem: Do not delete the mode that is still in use
	drm/dp_mst: Do not set proposed vcpi directly
	drm/dp_mst: Avoid to mess up payload table by ports in stale topology
	drm/dp_mst: Add missing drm parameters to recently added call to drm_dbg_kms()
	drm/ingenic: Fix non-OSD mode
	drm/ingenic: Switch IPU plane to type OVERLAY
	Revert "drm/ast: Remove reference to struct drm_device.pdev"
	net: bridge: multicast: fix PIM hello router port marking race
	net: bridge: multicast: fix MRD advertisement router port marking race
	leds: tlc591xx: fix return value check in tlc591xx_probe()
	ASoC: Intel: sof_sdw: add mutual exclusion between PCH DMIC and RT715
	dmaengine: fsl-qdma: check dma_set_mask return value
	scsi: arcmsr: Fix the wrong CDB payload report to IOP
	srcu: Fix broken node geometry after early ssp init
	rcu: Reject RCU_LOCKDEP_WARN() false positives
	tty: serial: fsl_lpuart: fix the potential risk of division or modulo by zero
	serial: fsl_lpuart: disable DMA for console and fix sysrq
	misc/libmasm/module: Fix two use after free in ibmasm_init_one
	misc: alcor_pci: fix null-ptr-deref when there is no PCI bridge
	ASoC: intel/boards: add missing MODULE_DEVICE_TABLE
	partitions: msdos: fix one-byte get_unaligned()
	iio: gyro: fxa21002c: Balance runtime pm + use pm_runtime_resume_and_get().
	iio: magn: bmc150: Balance runtime pm + use pm_runtime_resume_and_get()
	ALSA: usx2y: Avoid camelCase
	ALSA: usx2y: Don't call free_pages_exact() with NULL address
	Revert "ALSA: bebob/oxfw: fix Kconfig entry for Mackie d.2 Pro"
	usb: common: usb-conn-gpio: fix NULL pointer dereference of charger
	w1: ds2438: fixing bug that would always get page0
	scsi: arcmsr: Fix doorbell status being updated late on ARC-1886
	scsi: hisi_sas: Propagate errors in interrupt_init_v1_hw()
	scsi: lpfc: Fix "Unexpected timeout" error in direct attach topology
	scsi: lpfc: Fix crash when lpfc_sli4_hba_setup() fails to initialize the SGLs
	scsi: core: Cap scsi_host cmd_per_lun at can_queue
	ALSA: ac97: fix PM reference leak in ac97_bus_remove()
	tty: serial: 8250: serial_cs: Fix a memory leak in error handling path
	scsi: mpt3sas: Fix deadlock while cancelling the running firmware event
	scsi: core: Fixup calling convention for scsi_mode_sense()
	scsi: scsi_dh_alua: Check for negative result value
	fs/jfs: Fix missing error code in lmLogInit()
	scsi: megaraid_sas: Fix resource leak in case of probe failure
	scsi: megaraid_sas: Early detection of VD deletion through RaidMap update
	scsi: megaraid_sas: Handle missing interrupts while re-enabling IRQs
	scsi: iscsi: Add iscsi_cls_conn refcount helpers
	scsi: iscsi: Fix conn use after free during resets
	scsi: iscsi: Fix shost->max_id use
	scsi: qedi: Fix null ref during abort handling
	scsi: qedi: Fix race during abort timeouts
	scsi: qedi: Fix TMF session block/unblock use
	scsi: qedi: Fix cleanup session block/unblock use
	mfd: da9052/stmpe: Add and modify MODULE_DEVICE_TABLE
	mfd: cpcap: Fix cpcap dmamask not set warnings
	ASoC: img: Fix PM reference leak in img_i2s_in_probe()
	fsi: Add missing MODULE_DEVICE_TABLE
	serial: tty: uartlite: fix console setup
	s390/sclp_vt220: fix console name to match device
	s390: disable SSP when needed
	selftests: timers: rtcpie: skip test if default RTC device does not exist
	ALSA: sb: Fix potential double-free of CSP mixer elements
	powerpc/ps3: Add dma_mask to ps3_dma_region
	iommu/arm-smmu: Fix arm_smmu_device refcount leak when arm_smmu_rpm_get fails
	iommu/arm-smmu: Fix arm_smmu_device refcount leak in address translation
	ASoC: soc-pcm: fix the return value in dpcm_apply_symmetry()
	gpio: zynq: Check return value of pm_runtime_get_sync
	gpio: zynq: Check return value of irq_get_irq_data
	scsi: storvsc: Correctly handle multiple flags in srb_status
	ALSA: ppc: fix error return code in snd_pmac_probe()
	selftests/powerpc: Fix "no_handler" EBB selftest
	gpio: pca953x: Add support for the On Semi pca9655
	powerpc/mm/book3s64: Fix possible build error
	ASoC: soc-core: Fix the error return code in snd_soc_of_parse_audio_routing()
	habanalabs/gaudi: set the correct cpu_id on MME2_QM failure
	habanalabs: remove node from list before freeing the node
	s390/processor: always inline stap() and __load_psw_mask()
	s390/ipl_parm: fix program check new psw handling
	s390/mem_detect: fix diag260() program check new psw handling
	s390/mem_detect: fix tprot() program check new psw handling
	Input: hideep - fix the uninitialized use in hideep_nvm_unlock()
	ALSA: bebob: add support for ToneWeal FW66
	ALSA: usb-audio: scarlett2: Fix 18i8 Gen 2 PCM Input count
	ALSA: usb-audio: scarlett2: Fix data_mutex lock
	ALSA: usb-audio: scarlett2: Fix scarlett2_*_ctl_put() return values
	usb: gadget: f_hid: fix endianness issue with descriptors
	usb: gadget: hid: fix error return code in hid_bind()
	powerpc/boot: Fixup device-tree on little endian
	ASoC: Intel: kbl_da7219_max98357a: shrink platform_id below 20 characters
	backlight: lm3630a: Fix return code of .update_status() callback
	ALSA: hda: Add IRQ check for platform_get_irq()
	ALSA: usb-audio: scarlett2: Fix 6i6 Gen 2 line out descriptions
	ALSA: firewire-motu: fix detection for S/PDIF source on optical interface in v2 protocol
	leds: turris-omnia: add missing MODULE_DEVICE_TABLE
	staging: rtl8723bs: fix macro value for 2.4Ghz only device
	intel_th: Wait until port is in reset before programming it
	i2c: core: Disable client irq on reboot/shutdown
	phy: intel: Fix for warnings due to EMMC clock 175Mhz change in FIP
	lib/decompress_unlz4.c: correctly handle zero-padding around initrds.
	kcov: add __no_sanitize_coverage to fix noinstr for all architectures
	power: supply: sc27xx: Add missing MODULE_DEVICE_TABLE
	power: supply: sc2731_charger: Add missing MODULE_DEVICE_TABLE
	pwm: spear: Don't modify HW state in .remove callback
	PCI: ftpci100: Rename macro name collision
	power: supply: ab8500: Avoid NULL pointers
	PCI: hv: Fix a race condition when removing the device
	power: supply: max17042: Do not enforce (incorrect) interrupt trigger type
	power: reset: gpio-poweroff: add missing MODULE_DEVICE_TABLE
	ARM: 9087/1: kprobes: test-thumb: fix for LLVM_IAS=1
	PCI/P2PDMA: Avoid pci_get_slot(), which may sleep
	NFSv4: Fix delegation return in cases where we have to retry
	PCI: pciehp: Ignore Link Down/Up caused by DPC
	watchdog: Fix possible use-after-free in wdt_startup()
	watchdog: sc520_wdt: Fix possible use-after-free in wdt_turnoff()
	watchdog: Fix possible use-after-free by calling del_timer_sync()
	watchdog: imx_sc_wdt: fix pretimeout
	watchdog: iTCO_wdt: Account for rebooting on second timeout
	x86/fpu: Return proper error codes from user access functions
	remoteproc: core: Fix cdev remove and rproc del
	PCI: tegra: Add missing MODULE_DEVICE_TABLE
	orangefs: fix orangefs df output.
	ceph: remove bogus checks and WARN_ONs from ceph_set_page_dirty
	drm/gma500: Add the missed drm_gem_object_put() in psb_user_framebuffer_create()
	NFS: nfs_find_open_context() may only select open files
	power: supply: charger-manager: add missing MODULE_DEVICE_TABLE
	power: supply: ab8500: add missing MODULE_DEVICE_TABLE
	drm/amdkfd: fix sysfs kobj leak
	pwm: img: Fix PM reference leak in img_pwm_enable()
	pwm: tegra: Don't modify HW state in .remove callback
	ACPI: AMBA: Fix resource name in /proc/iomem
	ACPI: video: Add quirk for the Dell Vostro 3350
	PCI: rockchip: Register IRQ handlers after device and data are ready
	virtio-blk: Fix memory leak among suspend/resume procedure
	virtio_net: Fix error handling in virtnet_restore()
	virtio_console: Assure used length from device is limited
	f2fs: atgc: fix to set default age threshold
	NFSD: Fix TP_printk() format specifier in nfsd_clid_class
	x86/signal: Detect and prevent an alternate signal stack overflow
	f2fs: add MODULE_SOFTDEP to ensure crc32 is included in the initramfs
	f2fs: compress: fix to disallow temp extension
	remoteproc: k3-r5: Fix an error message
	PCI/sysfs: Fix dsm_label_utf16s_to_utf8s() buffer overrun
	power: supply: rt5033_battery: Fix device tree enumeration
	NFSv4: Initialise connection to the server in nfs4_alloc_client()
	NFSv4: Fix an Oops in pnfs_mark_request_commit() when doing O_DIRECT
	misc: alcor_pci: fix inverted branch condition
	um: fix error return code in slip_open()
	um: fix error return code in winch_tramp()
	ubifs: Fix off-by-one error
	ubifs: journal: Fix error return code in ubifs_jnl_write_inode()
	watchdog: aspeed: fix hardware timeout calculation
	watchdog: jz4740: Fix return value check in jz4740_wdt_probe()
	SUNRPC: prevent port reuse on transports which don't request it.
	nfs: fix acl memory leak of posix_acl_create()
	ubifs: Set/Clear I_LINKABLE under i_lock for whiteout inode
	PCI: iproc: Fix multi-MSI base vector number allocation
	PCI: iproc: Support multi-MSI only on uniprocessor kernel
	f2fs: fix to avoid adding tab before doc section
	x86/fpu: Fix copy_xstate_to_kernel() gap handling
	x86/fpu: Limit xstate copy size in xstateregs_set()
	PCI: intel-gw: Fix INTx enable
	pwm: imx1: Don't disable clocks at device remove time
	PCI: tegra194: Fix tegra_pcie_ep_raise_msi_irq() ill-defined shift
	vdpa/mlx5: Fix umem sizes assignments on VQ create
	vdpa/mlx5: Fix possible failure in umem size calculation
	virtio_net: move tx vq operation under tx queue lock
	nvme-tcp: can't set sk_user_data without write_lock
	nfsd: Reduce contention for the nfsd_file nf_rwsem
	ALSA: isa: Fix error return code in snd_cmi8330_probe()
	vdpa/mlx5: Clear vq ready indication upon device reset
	NFSv4/pnfs: Fix the layout barrier update
	NFSv4/pnfs: Fix layoutget behaviour after invalidation
	NFSv4/pNFS: Don't call _nfs4_pnfs_v3_ds_connect multiple times
	hexagon: handle {,SOFT}IRQENTRY_TEXT in linker script
	hexagon: use common DISCARDS macro
	ARM: dts: gemini-rut1xx: remove duplicate ethernet node
	reset: RESET_BRCMSTB_RESCAL should depend on ARCH_BRCMSTB
	reset: RESET_INTEL_GW should depend on X86
	reset: a10sr: add missing of_match_table reference
	ARM: exynos: add missing of_node_put for loop iteration
	ARM: dts: exynos: fix PWM LED max brightness on Odroid XU/XU3
	ARM: dts: exynos: fix PWM LED max brightness on Odroid HC1
	ARM: dts: exynos: fix PWM LED max brightness on Odroid XU4
	memory: stm32-fmc2-ebi: add missing of_node_put for loop iteration
	memory: atmel-ebi: add missing of_node_put for loop iteration
	reset: brcmstb: Add missing MODULE_DEVICE_TABLE
	memory: pl353: Fix error return code in pl353_smc_probe()
	ARM: dts: sun8i: h3: orangepi-plus: Fix ethernet phy-mode
	rtc: fix snprintf() checking in is_rtc_hctosys()
	arm64: dts: renesas: v3msk: Fix memory size
	ARM: dts: r8a7779, marzen: Fix DU clock names
	arm64: dts: ti: j7200-main: Enable USB2 PHY RX sensitivity workaround
	arm64: dts: renesas: Add missing opp-suspend properties
	arm64: dts: renesas: r8a7796[01]: Fix OPP table entry voltages
	ARM: dts: stm32: Connect PHY IRQ line on DH STM32MP1 SoM
	ARM: dts: stm32: Rework LAN8710Ai PHY reset on DHCOM SoM
	arm64: dts: qcom: trogdor: Add no-hpd to DSI bridge node
	firmware: tegra: Fix error return code in tegra210_bpmp_init()
	firmware: arm_scmi: Reset Rx buffer to max size during async commands
	dt-bindings: i2c: at91: fix example for scl-gpios
	ARM: dts: BCM5301X: Fixup SPI binding
	reset: bail if try_module_get() fails
	arm64: dts: renesas: r8a779a0: Drop power-domains property from GIC node
	arm64: dts: ti: k3-j721e-main: Fix external refclk input to SERDES
	memory: fsl_ifc: fix leak of IO mapping on probe failure
	memory: fsl_ifc: fix leak of private memory on probe failure
	arm64: dts: allwinner: a64-sopine-baseboard: change RGMII mode to TXID
	ARM: dts: dra7: Fix duplicate USB4 target module node
	ARM: dts: am335x: align ti,pindir-d0-out-d1-in property with dt-shema
	ARM: dts: am437x: align ti,pindir-d0-out-d1-in property with dt-shema
	thermal/drivers/sprd: Add missing MODULE_DEVICE_TABLE
	ARM: dts: imx6q-dhcom: Fix ethernet reset time properties
	ARM: dts: imx6q-dhcom: Fix ethernet plugin detection problems
	ARM: dts: imx6q-dhcom: Add gpios pinctrl for i2c bus recovery
	thermal/drivers/rcar_gen3_thermal: Fix coefficient calculations
	firmware: turris-mox-rwtm: fix reply status decoding function
	firmware: turris-mox-rwtm: report failures better
	firmware: turris-mox-rwtm: fail probing when firmware does not support hwrng
	firmware: turris-mox-rwtm: show message about HWRNG registration
	arm64: dts: rockchip: Re-add regulator-boot-on, regulator-always-on for vdd_gpu on rk3399-roc-pc
	arm64: dts: rockchip: Re-add regulator-always-on for vcc_sdio for rk3399-roc-pc
	scsi: be2iscsi: Fix an error handling path in beiscsi_dev_probe()
	sched/uclamp: Ignore max aggregation if rq is idle
	jump_label: Fix jump_label_text_reserved() vs __init
	static_call: Fix static_call_text_reserved() vs __init
	mips: always link byteswap helpers into decompressor
	mips: disable branch profiling in boot/decompress.o
	MIPS: vdso: Invalid GIC access through VDSO
	scsi: scsi_dh_alua: Fix signedness bug in alua_rtpg()
	seq_file: disallow extremely large seq buffer allocations
	Linux 5.10.52

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ic1b04661728db8b0e060ca6935783e15a22210da
2021-07-20 16:36:53 +02:00
Peter Zijlstra
53c5c2496f static_call: Fix static_call_text_reserved() vs __init
[ Upstream commit 2bee6d16e4 ]

It turns out that static_call_text_reserved() was reporting __init
text as being reserved past the time when the __init text was freed
and re-used.

This is mostly harmless and will at worst result in refusing a kprobe.

Fixes: 6333e8f73b ("static_call: Avoid kprobes on inline static_call()s")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20210628113045.106211657@infradead.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-20 16:05:58 +02:00
Peter Zijlstra
59ae35884c jump_label: Fix jump_label_text_reserved() vs __init
[ Upstream commit 9e667624c2 ]

It turns out that jump_label_text_reserved() was reporting __init text
as being reserved past the time when the __init text was freed and
re-used.

For a long time, this resulted in, at worst, not being able to kprobe
text that happened to land at the re-used address. However a recent
commit e7bf1ba97a ("jump_label, x86: Emit short JMP") made it a
fatal mistake because it now needs to read the instruction in order to
determine the conflict -- an instruction that's no longer there.

Fixes: 4c3ef6d793 ("jump label: Add jump_label_text_reserved() to reserve jump points")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20210628113045.045141693@infradead.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-20 16:05:58 +02:00
Xuewen Yan
143a6b8ec5 sched/uclamp: Ignore max aggregation if rq is idle
[ Upstream commit 3e1493f463 ]

When a task wakes up on an idle rq, uclamp_rq_util_with() would max
aggregate with rq value. But since there is no task enqueued yet, the
values are stale based on the last task that was running. When the new
task actually wakes up and enqueued, then the rq uclamp values should
reflect that of the newly woken up task effective uclamp values.

This is a problem particularly for uclamp_max because it default to
1024. If a task p with uclamp_max = 512 wakes up, then max aggregation
would ignore the capping that should apply when this task is enqueued,
which is wrong.

Fix that by ignoring max aggregation if the rq is idle since in that
case the effective uclamp value of the rq will be the ones of the task
that will wake up.

Fixes: 9d20ad7dfc ("sched/uclamp: Add uclamp_util_with()")
Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
[qias: Changelog]
Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Link: https://lore.kernel.org/r/20210630141204.8197-1-xuewen.yan94@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-20 16:05:58 +02:00
Paul E. McKenney
35a35909ec rcu: Reject RCU_LOCKDEP_WARN() false positives
[ Upstream commit 3066820034 ]

If another lockdep report runs concurrently with an RCU lockdep report
from RCU_LOCKDEP_WARN(), the following sequence of events can occur:

1.	debug_lockdep_rcu_enabled() sees that lockdep is enabled
	when called from (say) synchronize_rcu().

2.	Lockdep is disabled by a concurrent lockdep report.

3.	debug_lockdep_rcu_enabled() evaluates its lockdep-expression
	argument, for example, lock_is_held(&rcu_bh_lock_map).

4.	Because lockdep is now disabled, lock_is_held() plays it safe and
	returns the constant 1.

5.	But in this case, the constant 1 is not safe, because invoking
	synchronize_rcu() under rcu_read_lock_bh() is disallowed.

6.	debug_lockdep_rcu_enabled() wrongly invokes lockdep_rcu_suspicious(),
	resulting in a false-positive splat.

This commit therefore changes RCU_LOCKDEP_WARN() to check
debug_lockdep_rcu_enabled() after checking the lockdep expression,
so that any "safe" returns from lock_is_held() are rejected by
debug_lockdep_rcu_enabled().  This requires memory ordering, which is
supplied by READ_ONCE(debug_locks).  The resulting volatile accesses
prevent the compiler from reordering and the fact that only one variable
is being accessed prevents the underlying hardware from reordering.
The combination works for IA64, which can reorder reads to the same
location, but this is defeated by the volatile accesses, which compile
to load instructions that provide ordering.

Reported-by: syzbot+dde0cc33951735441301@syzkaller.appspotmail.com
Reported-by: Matthew Wilcox <willy@infradead.org>
Reported-by: syzbot+88e4f02896967fe1ab0d@syzkaller.appspotmail.com
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Suggested-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-20 16:05:38 +02:00
Frederic Weisbecker
23597afbe0 srcu: Fix broken node geometry after early ssp init
[ Upstream commit b5befe842e ]

An srcu_struct structure that is initialized before rcu_init_geometry()
will have its srcu_node hierarchy based on CONFIG_NR_CPUS.  Once
rcu_init_geometry() is called, this hierarchy is compressed as needed
for the actual maximum number of CPUs for this system.

Later on, that srcu_struct structure is confused, sometimes referring
to its initial CONFIG_NR_CPUS-based hierarchy, and sometimes instead
to the new num_possible_cpus() hierarchy.  For example, each of its
->mynode fields continues to reference the original leaf rcu_node
structures, some of which might no longer exist.  On the other hand,
srcu_for_each_node_breadth_first() traverses to the new node hierarchy.

There are at least two bad possible outcomes to this:

1) a) A callback enqueued early on an srcu_data structure (call it
      *sdp) is recorded pending on sdp->mynode->srcu_data_have_cbs in
      srcu_funnel_gp_start() with sdp->mynode pointing to a deep leaf
      (say 3 levels).

   b) The grace period ends after rcu_init_geometry() shrinks the
      nodes level to a single one.  srcu_gp_end() walks through the new
      srcu_node hierarchy without ever reaching the old leaves so the
      callback is never executed.

   This is easily reproduced on an 8 CPUs machine with CONFIG_NR_CPUS >= 32
   and "rcupdate.rcu_self_test=1". The srcu_barrier() after early tests
   verification never completes and the boot hangs:

	[ 5413.141029] INFO: task swapper/0:1 blocked for more than 4915 seconds.
	[ 5413.147564]       Not tainted 5.12.0-rc4+ #28
	[ 5413.151927] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
	[ 5413.159753] task:swapper/0       state:D stack:    0 pid:    1 ppid:     0 flags:0x00004000
	[ 5413.168099] Call Trace:
	[ 5413.170555]  __schedule+0x36c/0x930
	[ 5413.174057]  ? wait_for_completion+0x88/0x110
	[ 5413.178423]  schedule+0x46/0xf0
	[ 5413.181575]  schedule_timeout+0x284/0x380
	[ 5413.185591]  ? wait_for_completion+0x88/0x110
	[ 5413.189957]  ? mark_held_locks+0x61/0x80
	[ 5413.193882]  ? mark_held_locks+0x61/0x80
	[ 5413.197809]  ? _raw_spin_unlock_irq+0x24/0x50
	[ 5413.202173]  ? wait_for_completion+0x88/0x110
	[ 5413.206535]  wait_for_completion+0xb4/0x110
	[ 5413.210724]  ? srcu_torture_stats_print+0x110/0x110
	[ 5413.215610]  srcu_barrier+0x187/0x200
	[ 5413.219277]  ? rcu_tasks_verify_self_tests+0x50/0x50
	[ 5413.224244]  ? rdinit_setup+0x2b/0x2b
	[ 5413.227907]  rcu_verify_early_boot_tests+0x2d/0x40
	[ 5413.232700]  do_one_initcall+0x63/0x310
	[ 5413.236541]  ? rdinit_setup+0x2b/0x2b
	[ 5413.240207]  ? rcu_read_lock_sched_held+0x52/0x80
	[ 5413.244912]  kernel_init_freeable+0x253/0x28f
	[ 5413.249273]  ? rest_init+0x250/0x250
	[ 5413.252846]  kernel_init+0xa/0x110
	[ 5413.256257]  ret_from_fork+0x22/0x30

2) An srcu_struct structure that is initialized before rcu_init_geometry()
   and used afterward will always have stale rdp->mynode references,
   resulting in callbacks to be missed in srcu_gp_end(), just like in
   the previous scenario.

This commit therefore causes init_srcu_struct_nodes to initialize the
geometry, if needed.  This ensures that the srcu_node hierarchy is
properly built and distributed from the get-go.

Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-20 16:05:38 +02:00
Christian Brauner
811763e3be cgroup: verify that source is a string
commit 3b0462726e upstream.

The following sequence can be used to trigger a UAF:

    int fscontext_fd = fsopen("cgroup");
    int fd_null = open("/dev/null, O_RDONLY);
    int fsconfig(fscontext_fd, FSCONFIG_SET_FD, "source", fd_null);
    close_range(3, ~0U, 0);

The cgroup v1 specific fs parser expects a string for the "source"
parameter.  However, it is perfectly legitimate to e.g.  specify a file
descriptor for the "source" parameter.  The fs parser doesn't know what
a filesystem allows there.  So it's a bug to assume that "source" is
always of type fs_value_is_string when it can reasonably also be
fs_value_is_file.

This assumption in the cgroup code causes a UAF because struct
fs_parameter uses a union for the actual value.  Access to that union is
guarded by the param->type member.  Since the cgroup paramter parser
didn't check param->type but unconditionally moved param->string into
fc->source a close on the fscontext_fd would trigger a UAF during
put_fs_context() which frees fc->source thereby freeing the file stashed
in param->file causing a UAF during a close of the fd_null.

Fix this by verifying that param->type is actually a string and report
an error if not.

In follow up patches I'll add a new generic helper that can be used here
and by other filesystems instead of this error-prone copy-pasta fix.
But fixing it in here first makes backporting a it to stable a lot
easier.

Fixes: 8d2451f499 ("cgroup1: switch to option-by-option parsing")
Reported-by: syzbot+283ce5a46486d6acdbaf@syzkaller.appspotmail.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: <stable@kernel.org>
Cc: syzkaller-bugs <syzkaller-bugs@googlegroups.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-20 16:05:36 +02:00
Steven Rostedt (VMware)
905169794d tracing: Do not reference char * as a string in histograms
commit 704adfb5a9 upstream.

The histogram logic was allowing events with char * pointers to be used as
normal strings. But it was easy to crash the kernel with:

 # echo 'hist:keys=filename' > events/syscalls/sys_enter_openat/trigger

And open some files, and boom!

 BUG: unable to handle page fault for address: 00007f2ced0c3280
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 1173fa067 P4D 1173fa067 PUD 1171b6067 PMD 1171dd067 PTE 0
 Oops: 0000 [#1] PREEMPT SMP
 CPU: 6 PID: 1810 Comm: cat Not tainted 5.13.0-rc5-test+ #61
 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01
v03.03 07/14/2016
 RIP: 0010:strlen+0x0/0x20
 Code: f6 82 80 2a 0b a9 20 74 11 0f b6 50 01 48 83 c0 01 f6 82 80 2a 0b
a9 20 75 ef c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 <80> 3f 00 74
10 48 89 f8 48 83 c0 01 80 38 00 75 f7 48 29 f8 c3

 RSP: 0018:ffffbdbf81567b50 EFLAGS: 00010246
 RAX: 0000000000000003 RBX: ffff93815cdb3800 RCX: ffff9382401a22d0
 RDX: 0000000000000100 RSI: 0000000000000000 RDI: 00007f2ced0c3280
 RBP: 0000000000000100 R08: ffff9382409ff074 R09: ffffbdbf81567c98
 R10: ffff9382409ff074 R11: 0000000000000000 R12: ffff9382409ff074
 R13: 0000000000000001 R14: ffff93815a744f00 R15: 00007f2ced0c3280
 FS:  00007f2ced0f8580(0000) GS:ffff93825a800000(0000)
knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007f2ced0c3280 CR3: 0000000107069005 CR4: 00000000001706e0
 Call Trace:
  event_hist_trigger+0x463/0x5f0
  ? find_held_lock+0x32/0x90
  ? sched_clock_cpu+0xe/0xd0
  ? lock_release+0x155/0x440
  ? kernel_init_free_pages+0x6d/0x90
  ? preempt_count_sub+0x9b/0xd0
  ? kernel_init_free_pages+0x6d/0x90
  ? get_page_from_freelist+0x12c4/0x1680
  ? __rb_reserve_next+0xe5/0x460
  ? ring_buffer_lock_reserve+0x12a/0x3f0
  event_triggers_call+0x52/0xe0
  ftrace_syscall_enter+0x264/0x2c0
  syscall_trace_enter.constprop.0+0x1ee/0x210
  do_syscall_64+0x1c/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xae

Where it triggered a fault on strlen(key) where key was the filename.

The reason is that filename is a char * to user space, and the histogram
code just blindly dereferenced it, with obvious bad results.

I originally tried to use strncpy_from_user/kernel_nofault() but found
that there's other places that its dereferenced and not worth the effort.

Just do not allow "char *" to act like strings.

Link: https://lkml.kernel.org/r/20210715000206.025df9d2@rorschach.local.home

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: stable@vger.kernel.org
Acked-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Tom Zanussi <zanussi@kernel.org>
Fixes: 79e577cbce ("tracing: Support string type key properly")
Fixes: 5967bd5c42 ("tracing: Let filter_assign_type() detect FILTER_PTR_STRING")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-20 16:05:36 +02:00
Greg Kroah-Hartman
8db62be3c3 This is the 5.10.51 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmD1LZUACgkQONu9yGCS
 aT7gERAArjYemBSkD4/nq5HvxoVu7ueEyqI2orJCyB6b5npPrBZjlKna3SuYuNUF
 CmcX5Y2Ynxd3gWJvYFEJdAAQAEtKAzPdPv0QJ+KiLSP2bEZ4Q5grEJXfzcgxcB4L
 fUfaCZnUwjoII5xzW1+U2zCJ7YtGx5hZySLKMxSEc0IzawDlfMx4HdwlohXzczsA
 Zq3/sTJcW30PWSp6MSuMOH//lPh7sAoCnksAv4Yb8MZYZjC8JNnKFn+IwRUGWEMZ
 sFtNbq51sMgGq4TjYBIdO6wBElP4dgWJhYc4cO0667cDvgp6iod/bKlOLJSiwIVX
 uPWkyPihH9ZUvNVY0TxjjnxS1rnM0QhH+cEXNGn+SE7KaNmzsI2MaR+DVA2yWcFr
 9edqTq5x2CJJ0R/oXHP4nYFtsvlV/QcirlrF0OZHYwz84b16f37Ac7tHOQpr3kNO
 N29AW0l5XmpxbfHgo1Iaoi02seouLC47vRkvjTpS9mValGUC0ciXTb97CK4FremE
 34rskxIRWBU8HECYioFOeHTAi0+xb/9tOj87BnB5CJ28CD2Md27TTdorsISnqPb0
 /ER89QfVtlJLi17wGB0rlAm8fDF3Cy6BnA57QIql1z1NbJGc1cxenEAFiIOnbQ5G
 t6SLs3mgqpQizsKUFYimCWa04ZzY4Bg8H9bAI+M+9w7J6yujQzI=
 =dGMa
 -----END PGP SIGNATURE-----

Merge 5.10.51 into android12-5.10-lts

Changes in 5.10.51
	drm/mxsfb: Don't select DRM_KMS_FB_HELPER
	drm/zte: Don't select DRM_KMS_FB_HELPER
	drm/ast: Fixed CVE for DP501
	drm/amd/display: fix HDCP reset sequence on reinitialize
	drm/amd/amdgpu/sriov disable all ip hw status by default
	drm/vc4: fix argument ordering in vc4_crtc_get_margins()
	drm/bridge: nwl-dsi: Force a full modeset when crtc_state->active is changed to be true
	net: pch_gbe: Use proper accessors to BE data in pch_ptp_match()
	drm/amd/display: fix use_max_lb flag for 420 pixel formats
	clk: renesas: rcar-usb2-clock-sel: Fix error handling in .probe()
	hugetlb: clear huge pte during flush function on mips platform
	atm: iphase: fix possible use-after-free in ia_module_exit()
	mISDN: fix possible use-after-free in HFC_cleanup()
	atm: nicstar: Fix possible use-after-free in nicstar_cleanup()
	net: Treat __napi_schedule_irqoff() as __napi_schedule() on PREEMPT_RT
	drm/mediatek: Fix PM reference leak in mtk_crtc_ddp_hw_init()
	net: mdio: ipq8064: add regmap config to disable REGCACHE
	drm/bridge: lt9611: Add missing MODULE_DEVICE_TABLE
	reiserfs: add check for invalid 1st journal block
	drm/virtio: Fix double free on probe failure
	net: mdio: provide shim implementation of devm_of_mdiobus_register
	net/sched: cls_api: increase max_reclassify_loop
	pinctrl: equilibrium: Add missing MODULE_DEVICE_TABLE
	drm/scheduler: Fix hang when sched_entity released
	drm/sched: Avoid data corruptions
	udf: Fix NULL pointer dereference in udf_symlink function
	drm/vc4: Fix clock source for VEC PixelValve on BCM2711
	drm/vc4: hdmi: Fix PM reference leak in vc4_hdmi_encoder_pre_crtc_co()
	e100: handle eeprom as little endian
	igb: handle vlan types with checker enabled
	igb: fix assignment on big endian machines
	drm/bridge: cdns: Fix PM reference leak in cdns_dsi_transfer()
	clk: renesas: r8a77995: Add ZA2 clock
	net/mlx5e: IPsec/rep_tc: Fix rep_tc_update_skb drops IPsec packet
	net/mlx5: Fix lag port remapping logic
	drm: rockchip: add missing registers for RK3188
	drm: rockchip: add missing registers for RK3066
	net: stmmac: the XPCS obscures a potential "PHY not found" error
	RDMA/rtrs: Change MAX_SESS_QUEUE_DEPTH
	clk: tegra: Fix refcounting of gate clocks
	clk: tegra: Ensure that PLLU configuration is applied properly
	drm: bridge: cdns-mhdp8546: Fix PM reference leak in
	virtio-net: Add validation for used length
	ipv6: use prandom_u32() for ID generation
	MIPS: cpu-probe: Fix FPU detection on Ingenic JZ4760(B)
	MIPS: ingenic: Select CPU_SUPPORTS_CPUFREQ && MIPS_EXTERNAL_TIMER
	drm/amd/display: Avoid HDCP over-read and corruption
	drm/amdgpu: remove unsafe optimization to drop preamble ib
	net: tcp better handling of reordering then loss cases
	RDMA/cxgb4: Fix missing error code in create_qp()
	dm space maps: don't reset space map allocation cursor when committing
	dm writecache: don't split bios when overwriting contiguous cache content
	dm: Fix dm_accept_partial_bio() relative to zone management commands
	net: bridge: mrp: Update ring transitions.
	pinctrl: mcp23s08: fix race condition in irq handler
	ice: set the value of global config lock timeout longer
	ice: fix clang warning regarding deadcode.DeadStores
	virtio_net: Remove BUG() to avoid machine dead
	net: mscc: ocelot: check return value after calling platform_get_resource()
	net: bcmgenet: check return value after calling platform_get_resource()
	net: mvpp2: check return value after calling platform_get_resource()
	net: micrel: check return value after calling platform_get_resource()
	net: moxa: Use devm_platform_get_and_ioremap_resource()
	drm/amd/display: Fix DCN 3.01 DSCCLK validation
	drm/amd/display: Update scaling settings on modeset
	drm/amd/display: Release MST resources on switch from MST to SST
	drm/amd/display: Set DISPCLK_MAX_ERRDET_CYCLES to 7
	drm/amd/display: Fix off-by-one error in DML
	net: phy: realtek: add delay to fix RXC generation issue
	selftests: Clean forgotten resources as part of cleanup()
	net: sgi: ioc3-eth: check return value after calling platform_get_resource()
	drm/amdkfd: use allowed domain for vmbo validation
	fjes: check return value after calling platform_get_resource()
	selinux: use __GFP_NOWARN with GFP_NOWAIT in the AVC
	r8169: avoid link-up interrupt issue on RTL8106e if user enables ASPM
	drm/amd/display: Verify Gamma & Degamma LUT sizes in amdgpu_dm_atomic_check
	xfrm: Fix error reporting in xfrm_state_construct.
	dm writecache: commit just one block, not a full page
	wlcore/wl12xx: Fix wl12xx get_mac error if device is in ELP
	wl1251: Fix possible buffer overflow in wl1251_cmd_scan
	cw1200: add missing MODULE_DEVICE_TABLE
	drm/amdkfd: fix circular locking on get_wave_state
	drm/amdkfd: Fix circular lock in nocpsch path
	bpf: Fix up register-based shifts in interpreter to silence KUBSAN
	ice: fix incorrect payload indicator on PTYPE
	ice: mark PTYPE 2 as reserved
	mt76: mt7615: fix fixed-rate tx status reporting
	net: fix mistake path for netdev_features_strings
	net: ipa: Add missing of_node_put() in ipa_firmware_load()
	net: sched: fix error return code in tcf_del_walker()
	io_uring: fix false WARN_ONCE
	drm/amdgpu: fix bad address translation for sienna_cichlid
	drm/amdkfd: Walk through list with dqm lock hold
	mt76: mt7915: fix IEEE80211_HE_PHY_CAP7_MAX_NC for station mode
	rtl8xxxu: Fix device info for RTL8192EU devices
	MIPS: add PMD table accounting into MIPS'pmd_alloc_one
	net: fec: add ndo_select_queue to fix TX bandwidth fluctuations
	atm: nicstar: use 'dma_free_coherent' instead of 'kfree'
	atm: nicstar: register the interrupt handler in the right place
	vsock: notify server to shutdown when client has pending signal
	RDMA/rxe: Don't overwrite errno from ib_umem_get()
	iwlwifi: mvm: don't change band on bound PHY contexts
	iwlwifi: mvm: fix error print when session protection ends
	iwlwifi: pcie: free IML DMA memory allocation
	iwlwifi: pcie: fix context info freeing
	sfc: avoid double pci_remove of VFs
	sfc: error code if SRIOV cannot be disabled
	wireless: wext-spy: Fix out-of-bounds warning
	cfg80211: fix default HE tx bitrate mask in 2G band
	mac80211: consider per-CPU statistics if present
	mac80211_hwsim: add concurrent channels scanning support over virtio
	IB/isert: Align target max I/O size to initiator size
	media, bpf: Do not copy more entries than user space requested
	net: ip: avoid OOM kills with large UDP sends over loopback
	RDMA/cma: Fix rdma_resolve_route() memory leak
	Bluetooth: btusb: Fixed too many in-token issue for Mediatek Chip.
	Bluetooth: Fix the HCI to MGMT status conversion table
	Bluetooth: Fix alt settings for incoming SCO with transparent coding format
	Bluetooth: Shutdown controller after workqueues are flushed or cancelled
	Bluetooth: btusb: Add a new QCA_ROME device (0cf3:e500)
	Bluetooth: L2CAP: Fix invalid access if ECRED Reconfigure fails
	Bluetooth: L2CAP: Fix invalid access on ECRED Connection response
	Bluetooth: btusb: Add support USB ALT 3 for WBS
	Bluetooth: mgmt: Fix the command returns garbage parameter value
	Bluetooth: btusb: fix bt fiwmare downloading failure issue for qca btsoc.
	sched/fair: Ensure _sum and _avg values stay consistent
	bpf: Fix false positive kmemleak report in bpf_ringbuf_area_alloc()
	flow_offload: action should not be NULL when it is referenced
	sctp: validate from_addr_param return
	sctp: add size validation when walking chunks
	MIPS: loongsoon64: Reserve memory below starting pfn to prevent Oops
	MIPS: set mips32r5 for virt extensions
	selftests/resctrl: Fix incorrect parsing of option "-t"
	MIPS: MT extensions are not available on MIPS32r1
	ath11k: unlock on error path in ath11k_mac_op_add_interface()
	arm64: dts: rockchip: add rk3328 dwc3 usb controller node
	arm64: dts: rockchip: Enable USB3 for rk3328 Rock64
	loop: fix I/O error on fsync() in detached loop devices
	mm,hwpoison: return -EBUSY when migration fails
	io_uring: simplify io_remove_personalities()
	io_uring: Convert personality_idr to XArray
	io_uring: convert io_buffer_idr to XArray
	scsi: iscsi: Fix race condition between login and sync thread
	scsi: iscsi: Fix iSCSI cls conn state
	powerpc/mm: Fix lockup on kernel exec fault
	powerpc/barrier: Avoid collision with clang's __lwsync macro
	powerpc/powernv/vas: Release reference to tgid during window close
	drm/amdgpu: Update NV SIMD-per-CU to 2
	drm/amdgpu: enable sdma0 tmz for Raven/Renoir(V2)
	drm/radeon: Add the missed drm_gem_object_put() in radeon_user_framebuffer_create()
	drm/radeon: Call radeon_suspend_kms() in radeon_pci_shutdown() for Loongson64
	drm/vc4: txp: Properly set the possible_crtcs mask
	drm/vc4: crtc: Skip the TXP
	drm/vc4: hdmi: Prevent clock unbalance
	drm/dp: Handle zeroed port counts in drm_dp_read_downstream_info()
	drm/rockchip: dsi: remove extra component_del() call
	drm/amd/display: fix incorrrect valid irq check
	pinctrl/amd: Add device HID for new AMD GPIO controller
	drm/amd/display: Reject non-zero src_y and src_x for video planes
	drm/tegra: Don't set allow_fb_modifiers explicitly
	drm/msm/mdp4: Fix modifier support enabling
	drm/arm/malidp: Always list modifiers
	drm/nouveau: Don't set allow_fb_modifiers explicitly
	drm/i915/display: Do not zero past infoframes.vsc
	mmc: sdhci-acpi: Disable write protect detection on Toshiba Encore 2 WT8-B
	mmc: sdhci: Fix warning message when accessing RPMB in HS400 mode
	mmc: core: clear flags before allowing to retune
	mmc: core: Allow UHS-I voltage switch for SDSC cards if supported
	ata: ahci_sunxi: Disable DIPM
	arm64: tlb: fix the TTL value of tlb_get_level
	cpu/hotplug: Cure the cpusets trainwreck
	clocksource/arm_arch_timer: Improve Allwinner A64 timer workaround
	fpga: stratix10-soc: Add missing fpga_mgr_free() call
	ASoC: tegra: Set driver_name=tegra for all machine drivers
	i40e: fix PTP on 5Gb links
	qemu_fw_cfg: Make fw_cfg_rev_attr a proper kobj_attribute
	ipmi/watchdog: Stop watchdog timer when the current action is 'none'
	thermal/drivers/int340x/processor_thermal: Fix tcc setting
	ubifs: Fix races between xattr_{set|get} and listxattr operations
	power: supply: ab8500: Fix an old bug
	mfd: syscon: Free the allocated name field of struct regmap_config
	nvmem: core: add a missing of_node_put
	lkdtm/bugs: XFAIL UNALIGNED_LOAD_STORE_WRITE
	selftests/lkdtm: Fix expected text for CR4 pinning
	extcon: intel-mrfld: Sync hardware and software state on init
	seq_buf: Fix overflow in seq_buf_putmem_hex()
	rq-qos: fix missed wake-ups in rq_qos_throttle try two
	tracing: Simplify & fix saved_tgids logic
	tracing: Resize tgid_map to pid_max, not PID_MAX_DEFAULT
	ipack/carriers/tpci200: Fix a double free in tpci200_pci_probe
	coresight: Propagate symlink failure
	coresight: tmc-etf: Fix global-out-of-bounds in tmc_update_etf_buffer()
	dm zoned: check zone capacity
	dm writecache: flush origin device when writing and cache is full
	dm btree remove: assign new_root only when removal succeeds
	PCI: Leave Apple Thunderbolt controllers on for s2idle or standby
	PCI: aardvark: Fix checking for PIO Non-posted Request
	PCI: aardvark: Implement workaround for the readback value of VEND_ID
	media: subdev: disallow ioctl for saa6588/davinci
	media: dtv5100: fix control-request directions
	media: zr364xx: fix memory leak in zr364xx_start_readpipe
	media: gspca/sq905: fix control-request direction
	media: gspca/sunplus: fix zero-length control requests
	media: uvcvideo: Fix pixel format change for Elgato Cam Link 4K
	io_uring: fix clear IORING_SETUP_R_DISABLED in wrong function
	dm writecache: write at least 4k when committing
	pinctrl: mcp23s08: Fix missing unlock on error in mcp23s08_irq()
	drm/ast: Remove reference to struct drm_device.pdev
	jfs: fix GPF in diFree
	smackfs: restrict bytes count in smk_set_cipso()
	ext4: fix memory leak in ext4_fill_super
	f2fs: fix to avoid racing on fsync_entry_slab by multi filesystem instances
	Linux 5.10.51

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Icb10fed733a0050848ecc23db13ae3d134895acd
2021-07-19 17:29:53 +02:00
Paul Burton
eb81b5a37d tracing: Resize tgid_map to pid_max, not PID_MAX_DEFAULT
commit 4030a6e6a6 upstream.

Currently tgid_map is sized at PID_MAX_DEFAULT entries, which means that
on systems where pid_max is configured higher than PID_MAX_DEFAULT the
ftrace record-tgid option doesn't work so well. Any tasks with PIDs
higher than PID_MAX_DEFAULT are simply not recorded in tgid_map, and
don't show up in the saved_tgids file.

In particular since systemd v243 & above configure pid_max to its
highest possible 1<<22 value by default on 64 bit systems this renders
the record-tgids option of little use.

Increase the size of tgid_map to the configured pid_max instead,
allowing it to cover the full range of PIDs up to the maximum value of
PID_MAX_LIMIT if the system is configured that way.

On 64 bit systems with pid_max == PID_MAX_LIMIT this will increase the
size of tgid_map from 256KiB to 16MiB. Whilst this 64x increase in
memory overhead sounds significant 64 bit systems are presumably best
placed to accommodate it, and since tgid_map is only allocated when the
record-tgid option is actually used presumably the user would rather it
spends sufficient memory to actually record the tgids they expect.

The size of tgid_map could also increase for CONFIG_BASE_SMALL=y
configurations, but these seem unlikely to be systems upon which people
are both configuring a large pid_max and running ftrace with record-tgid
anyway.

Of note is that we only allocate tgid_map once, the first time that the
record-tgid option is enabled. Therefore its size is only set once, to
the value of pid_max at the time the record-tgid option is first
enabled. If a user increases pid_max after that point, the saved_tgids
file will not contain entries for any tasks with pids beyond the earlier
value of pid_max.

Link: https://lkml.kernel.org/r/20210701172407.889626-2-paulburton@google.com

Fixes: d914ba37d7 ("tracing: Add support for recording tgid of tasks")
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Paul Burton <paulburton@google.com>
[ Fixed comment coding style ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-19 09:45:01 +02:00
Paul Burton
3cda5b7f4e tracing: Simplify & fix saved_tgids logic
commit b81b3e959a upstream.

The tgid_map array records a mapping from pid to tgid, where the index
of an entry within the array is the pid & the value stored at that index
is the tgid.

The saved_tgids_next() function iterates over pointers into the tgid_map
array & dereferences the pointers which results in the tgid, but then it
passes that dereferenced value to trace_find_tgid() which treats it as a
pid & does a further lookup within the tgid_map array. It seems likely
that the intent here was to skip over entries in tgid_map for which the
recorded tgid is zero, but instead we end up skipping over entries for
which the thread group leader hasn't yet had its own tgid recorded in
tgid_map.

A minimal fix would be to remove the call to trace_find_tgid, turning:

  if (trace_find_tgid(*ptr))

into:

  if (*ptr)

..but it seems like this logic can be much simpler if we simply let
seq_read() iterate over the whole tgid_map array & filter out empty
entries by returning SEQ_SKIP from saved_tgids_show(). Here we take that
approach, removing the incorrect logic here entirely.

Link: https://lkml.kernel.org/r/20210630003406.4013668-1-paulburton@google.com

Fixes: d914ba37d7 ("tracing: Add support for recording tgid of tasks")
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Paul Burton <paulburton@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-19 09:45:00 +02:00
Jan Kara
8cc58a6e2c rq-qos: fix missed wake-ups in rq_qos_throttle try two
commit 11c7aa0dde upstream.

Commit 545fbd0775 ("rq-qos: fix missed wake-ups in rq_qos_throttle")
tried to fix a problem that a process could be sleeping in rq_qos_wait()
without anyone to wake it up. However the fix is not complete and the
following can still happen:

CPU1 (waiter1)		CPU2 (waiter2)		CPU3 (waker)
rq_qos_wait()		rq_qos_wait()
  acquire_inflight_cb() -> fails
			  acquire_inflight_cb() -> fails

						completes IOs, inflight
						  decreased
  prepare_to_wait_exclusive()
			  prepare_to_wait_exclusive()
  has_sleeper = !wq_has_single_sleeper() -> true as there are two sleepers
			  has_sleeper = !wq_has_single_sleeper() -> true
  io_schedule()		  io_schedule()

Deadlock as now there's nobody to wakeup the two waiters. The logic
automatically blocking when there are already sleepers is really subtle
and the only way to make it work reliably is that we check whether there
are some waiters in the queue when adding ourselves there. That way, we
are guaranteed that at least the first process to enter the wait queue
will recheck the waiting condition before going to sleep and thus
guarantee forward progress.

Fixes: 545fbd0775 ("rq-qos: fix missed wake-ups in rq_qos_throttle")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210607112613.25344-1-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-19 09:45:00 +02:00
Thomas Gleixner
b5e26be407 cpu/hotplug: Cure the cpusets trainwreck
commit b22afcdf04 upstream.

Alexey and Joshua tried to solve a cpusets related hotplug problem which is
user space visible and results in unexpected behaviour for some time after
a CPU has been plugged in and the corresponding uevent was delivered.

cpusets delegate the hotplug work (rebuilding cpumasks etc.) to a
workqueue. This is done because the cpusets code has already a lock
nesting of cgroups_mutex -> cpu_hotplug_lock. A synchronous callback or
waiting for the work to finish with cpu_hotplug_lock held can and will
deadlock because that results in the reverse lock order.

As a consequence the uevent can be delivered before cpusets have consistent
state which means that a user space invocation of sched_setaffinity() to
move a task to the plugged CPU fails up to the point where the scheduled
work has been processed.

The same is true for CPU unplug, but that does not create user observable
failure (yet).

It's still inconsistent to claim that an operation is finished before it
actually is and that's the real issue at hand. uevents just make it
reliably observable.

Obviously the problem should be fixed in cpusets/cgroups, but untangling
that is pretty much impossible because according to the changelog of the
commit which introduced this 8 years ago:

 3a5a6d0c2b03("cpuset: don't nest cgroup_mutex inside get_online_cpus()")

the lock order cgroups_mutex -> cpu_hotplug_lock is a design decision and
the whole code is built around that.

So bite the bullet and invoke the relevant cpuset function, which waits for
the work to finish, in _cpu_up/down() after dropping cpu_hotplug_lock and
only when tasks are not frozen by suspend/hibernate because that would
obviously wait forever.

Waiting there with cpu_add_remove_lock, which is protecting the present
and possible CPU maps, held is not a problem at all because neither work
queues nor cpusets/cgroups have any lockchains related to that lock.

Waiting in the hotplug machinery is not problematic either because there
are already state callbacks which wait for hardware queues to drain. It
makes the operations slightly slower, but hotplug is slow anyway.

This ensures that state is consistent before returning from a hotplug
up/down operation. It's still inconsistent during the operation, but that's
a different story.

Add a large comment which explains why this is done and why this is not a
dump ground for the hack of the day to work around half thought out locking
schemes. Document also the implications vs. hotplug operations and
serialization or the lack of it.

Thanks to Alexy and Joshua for analyzing why this temporary
sched_setaffinity() failure happened.

Fixes: 3a5a6d0c2b03("cpuset: don't nest cgroup_mutex inside get_online_cpus()")
Reported-by: Alexey Klimov <aklimov@redhat.com>
Reported-by: Joshua Baker <jobaker@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Alexey Klimov <aklimov@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87tuowcnv3.ffs@nanos.tec.linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-19 09:44:59 +02:00
Rustam Kovhaev
a61af01141 bpf: Fix false positive kmemleak report in bpf_ringbuf_area_alloc()
[ Upstream commit ccff81e1d0 ]

kmemleak scans struct page, but it does not scan the page content. If we
allocate some memory with kmalloc(), then allocate page with alloc_page(),
and if we put kmalloc pointer somewhere inside that page, kmemleak will
report kmalloc pointer as a false positive.

We can instruct kmemleak to scan the memory area by calling kmemleak_alloc()
and kmemleak_free(), but part of struct bpf_ringbuf is mmaped to user space,
and if struct bpf_ringbuf changes we would have to revisit and review size
argument in kmemleak_alloc(), because we do not want kmemleak to scan the
user space memory. Let's simplify things and use kmemleak_not_leak() here.

For posterity, also adding additional prior analysis from Andrii:

  I think either kmemleak or syzbot are misreporting this. I've added a
  bunch of printks around all allocations performed by BPF ringbuf. [...]
  On repro side I get these two warnings:

  [vmuser@archvm bpf]$ sudo ./repro
  BUG: memory leak
  unreferenced object 0xffff88810d538c00 (size 64):
    comm "repro", pid 2140, jiffies 4294692933 (age 14.540s)
    hex dump (first 32 bytes):
      00 af 19 04 00 ea ff ff c0 ae 19 04 00 ea ff ff  ................
      80 ae 19 04 00 ea ff ff c0 29 2e 04 00 ea ff ff  .........)......
    backtrace:
      [<0000000077bfbfbd>] __bpf_map_area_alloc+0x31/0xc0
      [<00000000587fa522>] ringbuf_map_alloc.cold.4+0x48/0x218
      [<0000000044d49e96>] __do_sys_bpf+0x359/0x1d90
      [<00000000f601d565>] do_syscall_64+0x2d/0x40
      [<0000000043d3112a>] entry_SYSCALL_64_after_hwframe+0x44/0xae

  BUG: memory leak
  unreferenced object 0xffff88810d538c80 (size 64):
    comm "repro", pid 2143, jiffies 4294699025 (age 8.448s)
    hex dump (first 32 bytes):
      80 aa 19 04 00 ea ff ff 00 ab 19 04 00 ea ff ff  ................
      c0 ab 19 04 00 ea ff ff 80 44 28 04 00 ea ff ff  .........D(.....
    backtrace:
      [<0000000077bfbfbd>] __bpf_map_area_alloc+0x31/0xc0
      [<00000000587fa522>] ringbuf_map_alloc.cold.4+0x48/0x218
      [<0000000044d49e96>] __do_sys_bpf+0x359/0x1d90
      [<00000000f601d565>] do_syscall_64+0x2d/0x40
      [<0000000043d3112a>] entry_SYSCALL_64_after_hwframe+0x44/0xae

  Note that both reported leaks (ffff88810d538c80 and ffff88810d538c00)
  correspond to pages array bpf_ringbuf is allocating and tracking properly
  internally. Note also that syzbot repro doesn't close FD of created BPF
  ringbufs, and even when ./repro itself exits with error, there are still
  two forked processes hanging around in my system. So clearly ringbuf maps
  are alive at that point. So reporting any memory leak looks weird at that
  point, because that memory is being used by active referenced BPF ringbuf.

  It's also a question why repro doesn't clean up its forks. But if I do a
  `pkill repro`, I do see that all the allocated memory is /properly/ cleaned
  up [and the] "leaks" are deallocated properly.

  BTW, if I add close() right after bpf() syscall in syzbot repro, I see that
  everything is immediately deallocated, like designed. And no memory leak
  is reported. So I don't think the problem is anywhere in bpf_ringbuf code,
  rather in the leak detection and/or repro itself.

Reported-by: syzbot+5d895828587f49e7fe9b@syzkaller.appspotmail.com
Signed-off-by: Rustam Kovhaev <rkovhaev@gmail.com>
[ Daniel: also included analysis from Andrii to the commit log ]
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: syzbot+5d895828587f49e7fe9b@syzkaller.appspotmail.com
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/CAEf4BzYk+dqs+jwu6VKXP-RttcTEGFe+ySTGWT9CRNkagDiJVA@mail.gmail.com
Link: https://lore.kernel.org/lkml/YNTAqiE7CWJhOK2M@nuc10
Link: https://lore.kernel.org/lkml/20210615101515.GC26027@arm.com
Link: https://syzkaller.appspot.com/bug?extid=5d895828587f49e7fe9b
Link: https://lore.kernel.org/bpf/20210626181156.1873604-1-rkovhaev@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-19 09:44:54 +02:00
Odin Ugedal
20285dc271 sched/fair: Ensure _sum and _avg values stay consistent
[ Upstream commit 1c35b07e6d ]

The _sum and _avg values are in general sync together with the PELT
divider. They are however not always completely in perfect sync,
resulting in situations where _sum gets to zero while _avg stays
positive. Such situations are undesirable.

This comes from the fact that PELT will increase period_contrib, also
increasing the PELT divider, without updating _sum and _avg values to
stay in perfect sync where (_sum == _avg * divider). However, such PELT
change will never lower _sum, making it impossible to end up in a
situation where _sum is zero and _avg is not.

Therefore, we need to ensure that when subtracting load outside PELT,
that when _sum is zero, _avg is also set to zero. This occurs when
(_sum < _avg * divider), and the subtracted (_avg * divider) is bigger
or equal to the current _sum, while the subtracted _avg is smaller than
the current _avg.

Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Odin Ugedal <odin@uged.al>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Link: https://lore.kernel.org/r/20210624111815.57937-1-odin@uged.al
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-19 09:44:54 +02:00
Daniel Borkmann
2e66c36f13 bpf: Fix up register-based shifts in interpreter to silence KUBSAN
[ Upstream commit 28131e9d93 ]

syzbot reported a shift-out-of-bounds that KUBSAN observed in the
interpreter:

  [...]
  UBSAN: shift-out-of-bounds in kernel/bpf/core.c:1420:2
  shift exponent 255 is too large for 64-bit type 'long long unsigned int'
  CPU: 1 PID: 11097 Comm: syz-executor.4 Not tainted 5.12.0-rc2-syzkaller #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  Call Trace:
   __dump_stack lib/dump_stack.c:79 [inline]
   dump_stack+0x141/0x1d7 lib/dump_stack.c:120
   ubsan_epilogue+0xb/0x5a lib/ubsan.c:148
   __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:327
   ___bpf_prog_run.cold+0x19/0x56c kernel/bpf/core.c:1420
   __bpf_prog_run32+0x8f/0xd0 kernel/bpf/core.c:1735
   bpf_dispatcher_nop_func include/linux/bpf.h:644 [inline]
   bpf_prog_run_pin_on_cpu include/linux/filter.h:624 [inline]
   bpf_prog_run_clear_cb include/linux/filter.h:755 [inline]
   run_filter+0x1a1/0x470 net/packet/af_packet.c:2031
   packet_rcv+0x313/0x13e0 net/packet/af_packet.c:2104
   dev_queue_xmit_nit+0x7c2/0xa90 net/core/dev.c:2387
   xmit_one net/core/dev.c:3588 [inline]
   dev_hard_start_xmit+0xad/0x920 net/core/dev.c:3609
   __dev_queue_xmit+0x2121/0x2e00 net/core/dev.c:4182
   __bpf_tx_skb net/core/filter.c:2116 [inline]
   __bpf_redirect_no_mac net/core/filter.c:2141 [inline]
   __bpf_redirect+0x548/0xc80 net/core/filter.c:2164
   ____bpf_clone_redirect net/core/filter.c:2448 [inline]
   bpf_clone_redirect+0x2ae/0x420 net/core/filter.c:2420
   ___bpf_prog_run+0x34e1/0x77d0 kernel/bpf/core.c:1523
   __bpf_prog_run512+0x99/0xe0 kernel/bpf/core.c:1737
   bpf_dispatcher_nop_func include/linux/bpf.h:644 [inline]
   bpf_test_run+0x3ed/0xc50 net/bpf/test_run.c:50
   bpf_prog_test_run_skb+0xabc/0x1c50 net/bpf/test_run.c:582
   bpf_prog_test_run kernel/bpf/syscall.c:3127 [inline]
   __do_sys_bpf+0x1ea9/0x4f00 kernel/bpf/syscall.c:4406
   do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
   entry_SYSCALL_64_after_hwframe+0x44/0xae
  [...]

Generally speaking, KUBSAN reports from the kernel should be fixed.
However, in case of BPF, this particular report caused concerns since
the large shift is not wrong from BPF point of view, just undefined.
In the verifier, K-based shifts that are >= {64,32} (depending on the
bitwidth of the instruction) are already rejected. The register-based
cases were not given their content might not be known at verification
time. Ideas such as verifier instruction rewrite with an additional
AND instruction for the source register were brought up, but regularly
rejected due to the additional runtime overhead they incur.

As Edward Cree rightly put it:

  Shifts by more than insn bitness are legal in the BPF ISA; they are
  implementation-defined behaviour [of the underlying architecture],
  rather than UB, and have been made legal for performance reasons.
  Each of the JIT backends compiles the BPF shift operations to machine
  instructions which produce implementation-defined results in such a
  case; the resulting contents of the register may be arbitrary but
  program behaviour as a whole remains defined.

  Guard checks in the fast path (i.e. affecting JITted code) will thus
  not be accepted.

  The case of division by zero is not truly analogous here, as division
  instructions on many of the JIT-targeted architectures will raise a
  machine exception / fault on division by zero, whereas (to the best
  of my knowledge) none will do so on an out-of-bounds shift.

Given the KUBSAN report only affects the BPF interpreter, but not JITs,
one solution is to add the ANDs with 63 or 31 into ___bpf_prog_run().
That would make the shifts defined, and thus shuts up KUBSAN, and the
compiler would optimize out the AND on any CPU that interprets the shift
amounts modulo the width anyway (e.g., confirmed from disassembly that
on x86-64 and arm64 the generated interpreter code is the same before
and after this fix).

The BPF interpreter is slow path, and most likely compiled out anyway
as distros select BPF_JIT_ALWAYS_ON to avoid speculative execution of
BPF instructions by the interpreter. Given the main argument was to
avoid sacrificing performance, the fact that the AND is optimized away
from compiler for mainstream archs helps as well as a solution moving
forward. Also add a comment on LSH/RSH/ARSH translation for JIT authors
to provide guidance when they see the ___bpf_prog_run() interpreter
code and use it as a model for a new JIT backend.

Reported-by: syzbot+bed360704c521841c85d@syzkaller.appspotmail.com
Reported-by: Kurt Manucredo <fuzzybritches0@gmail.com>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Co-developed-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: syzbot+bed360704c521841c85d@syzkaller.appspotmail.com
Cc: Edward Cree <ecree.xilinx@gmail.com>
Link: https://lore.kernel.org/bpf/0000000000008f912605bd30d5d7@google.com
Link: https://lore.kernel.org/bpf/bac16d8d-c174-bdc4-91bd-bfa62b410190@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-19 09:44:50 +02:00
Elliot Berman
0bb433e014 ANDROID: debug_symbols: Add android_debug_for_each_module
Allow vendors to obtain a list of modules loaded at given time. Vendor
modules are able to register on part of notifier chain
(register_module_notifer), but a vendor module would never see modules
which are loaded before the one which registers on the notifier chain.
The kernel doesn't offer load order control, so a hook is necessary to
iterate through currently loaded kernel modules.

Bug: 193552324
Change-Id: I3b01cc1b90f8c0c7c21a37992cc7d607316efc7b
Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
2021-07-15 13:59:25 -07:00
Greg Kroah-Hartman
1328352dcd Merge branch 'android12-5.10' into android12-5.10-lts
Sync up with android12-5.10 for the following commits:

870488eb07 ANDROID: GKI: 7/14/2021 KMI update
e9742a9ea5 ANDROID: Update the ABI symbol list
ec2190fd3f FROMLIST: arm64: avoid double ISB on kernel entry
98b2c1dd1c FROMLIST: arm64: mte: optimize GCR_EL1 modification on kernel entry/exit
a20103c331 BACKPORT: FROMLIST: arm64: mte: avoid TFSR related operations unless in async mode
3972be647a FROMLIST: Documentation: document the preferred tag checking mode feature
5adf29adb5 FROMLIST: arm64: mte: introduce a per-CPU tag checking mode preference
ce5ba15abc FROMLIST: arm64: move preemption disablement to prctl handlers
6c08feaa27 FROMLIST: arm64: mte: change ASYNC and SYNC TCF settings into bitfields
f438cf16cd FROMLIST: arm64: mte: rename gcr_user_excl to mte_ctrl
a4c9e551b6 BACKPORT: arm64: pac: Optimize kernel entry/exit key installation code paths
50829b8901 BACKPORT: arm64: Introduce prctl(PR_PAC_{SET,GET}_ENABLED_KEYS)
6119d18df7 ANDROID: cleancache: add oem data to cleancache_ops
a0c429e8e1 ANDROID: blkdev: add oem data to block_device_operations
26cd2564e1 FROMLIST: psi: stop relying on timer_pending for poll_work rescheduling
e85b291d7d ANDROID: GKI: Enable CONFIG_MEMCG
0ed7424fa0 ANDROID: GKI: net: add vendor hooks for 'struct sock' lifecycle
4d30956478 ANDROID: GKI: net: add vendor hooks for 'struct nf_conn' lifecycle
7786463e48 ANDROID: GKI: add vendor padding variable in struct sock
280c9b98aa ANDROID: GKI: add vendor padding variable in struct nf_conn
9d1b55d20a ANDROID: vendor_hooks: add a field in mem_cgroup
65115fdbf8 ANDROID: vendor_hooks: add a field in pglist_data
26920e0f3a FROMLIST: usb: dwc3: avoid NULL access of usb_gadget_driver
5bb2dd8d39 FROMGIT: usb: dwc3: dwc3-qcom: Enable tx-fifo-resize property by default
79274dbb00 FROMGIT: usb: dwc3: Resize TX FIFOs to meet EP bursting requirements
1e11f36199 FROMGIT: usb: gadget: configfs: Check USB configuration before adding
6da5e7afbf FROMGIT: usb: gadget: udc: core: Introduce check_config to verify USB configuration
2ed5fbf261 ANDROID: GKI: fscrypt: add OEM data to struct fscrypt_operations
194fd9239a ANDROID: GKI: fscrypt: add ABI padding to struct fscrypt_operations
8011eb2215 ANDROID: mm: provision to add shmem pages to inactive file lru head
9bb1247653 ANDROID: GKI: Enable CONFIG_CGROUP_NET_PRIO
a1ce719ca7 ANDROID: Delete the DMA-BUF attachment sysfs statistics
a2b3afb2f7 ANDROID: android: Add symbols to debug_symbols driver
914a7b14a0 UPSTREAM: USB: UDC core: Add udc_async_callbacks gadget op
9af9ef8dfa ANDROID: vendor_hooks: Add oem data to file struct
37485a3025 ANDROID: add kabi padding for structures for the android12 release
429c78f9b0 ANDROID: GKI: device.h: add Android ABI padding to some structures
aea5e1c230 ANDROID: GKI: elevator: add Android ABI padding to some structures
1b79ef2754 ANDROID: GKI: scsi: add Android ABI padding to some structures
33175403b9 ANDROID: GKI: workqueue.h: add Android ABI padding to some structures
d5c344a498 ANDROID: GKI: sched: add Android ABI padding to some structures
9c4854fa5a ANDROID: GKI: phy: add Android ABI padding to some structures
f4872b2353 ANDROID: GKI: fs.h: add Android ABI padding to some structures
48cddc7c42 ANDROID: GKI: dentry: add Android ABI padding to some structures
b9081a2925 ANDROID: GKI: bio: add Android ABI padding to some structures
99bf8cf8fa ANDROID: GKI: ufs: add Android ABI padding to some structures
9df147298f ANDROID: Update the generic symbol list
12f48605e8 ANDROID: mm: cma do not sleep for __GFP_NORETRY
0e688e972d ANDROID: mm: cma: skip problematic pageblock
9938b82be1 ANDROID: mm: bail out tlb free batching on page zapping when cma is going on
c8578a3e90 ANDROID: mm: lru_cache_disable skips lru cache drainnig
c01ce3b5ef ANDROID: mm: do not try test_page_isoalte if migration fails
675e504598 ANDROID: mm: add cma allocation statistics
b1e4543c27 UPSTREAM: mm, page_alloc: move draining pcplists to page isolation users
13bc06efd9 ANDROID: ALSA: compress: add vendor hook to support pause in draining
2faed77792 ANDROID: vendor_hooks: add vendor hook in blk_mq_rq_ctx_init()
292baba45a ANDROID: abi_gki_aarch64_qcom: Add I3C core symbols to qcom tree
eecc725a8e ANDROID: vendor_hooks: add vendor hook in blk_mq_alloc_rqs()
9c2958f454 ANDROID: GKI: Export put_task_stack symbol
288805c86a ANDROID: abi_gki_aarch64_qcom: Add idr_alloc_u32
e8516fd3af ANDROID: sound: usb: add vendor hook for cpu suspend support
d820d22b5d ANDROID: mm: page_pinner: use EXPORT_SYMBOL_GPL
efc09793ea ANDROID: GKI: update allowed GKI symbol for Exynosauto SoC
67e3e39eb1 ANDROID: GKI: sync allowed list for exynosauto SoC
d25e256373 ANDROID: ABI: add new symbols required by fips140.ko
50661975be ANDROID: fips140: add/update module help text
b7397e89db ANDROID: fips140: add power-up cryptographic self-tests
bd7d13c36e ANDROID: arm64: disable LSE when building the FIPS140 module
1061ef0493 ANDROID: jump_label: disable jump labels in fips140.ko
dcf509fea7 ANDROID: ipv6: add vendor hook for gen ipv6 link-local addr
018332e871 ANDROID: Revert "scsi: block: Do not accept any requests while suspended"
2ad2c3a25b ANDROID: abi_gki_aarch64_qcom: whitelist vm_event_states
7bcfde2601 ANDROID: ashmem: Export is_ashmem_file

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I7d60121fa9f25f007dab97dd666adcdb1964afc8
2021-07-15 17:17:09 +02:00
Peter Collingbourne
50829b8901 BACKPORT: arm64: Introduce prctl(PR_PAC_{SET,GET}_ENABLED_KEYS)
This change introduces a prctl that allows the user program to control
which PAC keys are enabled in a particular task. The main reason
why this is useful is to enable a userspace ABI that uses PAC to
sign and authenticate function pointers and other pointers exposed
outside of the function, while still allowing binaries conforming
to the ABI to interoperate with legacy binaries that do not sign or
authenticate pointers.

The idea is that a dynamic loader or early startup code would issue
this prctl very early after establishing that a process may load legacy
binaries, but before executing any PAC instructions.

This change adds a small amount of overhead to kernel entry and exit
due to additional required instruction sequences.

On a DragonBoard 845c (Cortex-A75) with the powersave governor, the
overhead of similar instruction sequences was measured as 4.9ns when
simulating the common case where IA is left enabled, or 43.7ns when
simulating the uncommon case where IA is disabled. These numbers can
be seen as the worst case scenario, since in more realistic scenarios
a better performing governor would be used and a newer chip would be
used that would support PAC unlike Cortex-A75 and would be expected
to be faster than Cortex-A75.

On an Apple M1 under a hypervisor, the overhead of the entry/exit
instruction sequences introduced by this patch was measured as 0.3ns
in the case where IA is left enabled, and 33.0ns in the case where
IA is disabled.

Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Dave Martin <Dave.Martin@arm.com>
Link: https://linux-review.googlesource.com/id/Ibc41a5e6a76b275efbaa126b31119dc197b927a5
Link: https://lore.kernel.org/r/d6609065f8f40397a4124654eb68c9f490b4d477.1616123271.git.pcc@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

Bug: 192536783
(cherry picked from commit 201698626f)
Change-Id: Ic0a21c92a22575f9ec3599fb67bd2931a50b9f04
[quic_eberman@quicinc.com: Resolved merge conflict in
 arch/arm64/kernel/process.c]
Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
Signed-off-by: Peter Collingbourne <pcc@google.com>
2021-07-14 20:52:05 -07:00
Suren Baghdasaryan
26cd2564e1 FROMLIST: psi: stop relying on timer_pending for poll_work rescheduling
Psi polling mechanism is trying to minimize the number of wakeups to
run psi_poll_work and is currently relying on timer_pending() to detect
when this work is already scheduled. This provides a window of opportunity
for psi_group_change to schedule an immediate psi_poll_work after
poll_timer_fn got called but before psi_poll_work could reschedule itself.
Below is the depiction of this entire window:

poll_timer_fn
  wake_up_interruptible(&group->poll_wait);

psi_poll_worker
  wait_event_interruptible(group->poll_wait, ...)
  psi_poll_work
    psi_schedule_poll_work
      if (timer_pending(&group->poll_timer)) return;
      ...
      mod_timer(&group->poll_timer, jiffies + delay);

Prior to 461daba06b we used to rely on poll_scheduled atomic which was
reset and set back inside psi_poll_work and therefore this race window
was much smaller.
The larger window causes increased number of wakeups and our partners
report visible power regression of ~10mA after applying 461daba06b.
Bring back the poll_scheduled atomic and make this race window even
narrower by resetting poll_scheduled only when we reach polling expiration
time. This does not completely eliminate the possibility of extra wakeups
caused by a race with psi_group_change however it will limit it to the
worst case scenario of one extra wakeup per every tracking window (0.5s
in the worst case).
This patch also ensures correct ordering between clearing poll_scheduled
flag and obtaining changed_states using memory barrier. Correct ordering
between updating changed_states and setting poll_scheduled is ensured by
atomic_xchg operation.
By tracing the number of immediate rescheduling attempts performed by
psi_group_change and the number of these attempts being blocked due to
psi monitor being already active, we can assess the effects of this change:

Before the patch:
                                           Run#1    Run#2      Run#3
Immediate reschedules attempted:           684365   1385156    1261240
Immediate reschedules blocked:             682846   1381654    1258682
Immediate reschedules (delta):             1519     3502       2558
Immediate reschedules (% of attempted):    0.22%    0.25%      0.20%

After the patch:
                                           Run#1    Run#2      Run#3
Immediate reschedules attempted:           882244   770298    426218
Immediate reschedules blocked:             881996   769796    426074
Immediate reschedules (delta):             248      502       144
Immediate reschedules (% of attempted):    0.03%    0.07%     0.03%

The number of non-blocked immediate reschedules dropped from 0.22-0.25%
to 0.03-0.07%. The drop is attributed to the decrease in the race window
size and the fact that we allow this race only when psi monitors reach
polling window expiration time.

Fixes: 461daba06b ("psi: eliminate kthread_worker from psi trigger scheduling mechanism")
Reported-by: Kathleen Chang <yt.chang@mediatek.com>
Reported-by: Wenju Xu <wenju.xu@mediatek.com>
Reported-by: Jonathan Chen <jonathan.jmchen@mediatek.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: SH Chen <show-hong.chen@mediatek.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Link: https://lore.kernel.org/patchwork/patch/1455172/
Bug: 191127654
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Ie61547ca043e702442a9c6db1468cfb60ff2e729
2021-07-14 20:52:04 -07:00
Greg Kroah-Hartman
11b396dfd9 Revert "Add a reference to ucounts for each cred"
This reverts commit b2c4d9a33c.

It breaks the abi and is not needed on Android kernels.

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I295a6c8088e4297400c1feebd78abd2f6c5edd7e
2021-07-14 19:46:17 +02:00
Greg Kroah-Hartman
049c7d395d Revert "cred: add missing return error code when set_cred_ucounts() failed"
This reverts commit 0855952ed4.

It breaks the abi and is not needed on Android kernels.

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I65001e80f3776ae4ab56be5da555dba394278c68
2021-07-14 19:46:05 +02:00