Commit Graph

4511 Commits

Author SHA1 Message Date
Leon Hwang
01793ed86b bpf, verifier: Correct tail_call_reachable for bpf prog
It's confusing to inspect 'prog->aux->tail_call_reachable' with drgn[0],
when bpf prog has tail call but 'tail_call_reachable' is false.

This patch corrects 'tail_call_reachable' when bpf prog has tail call.

Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
Link: https://lore.kernel.org/r/20240610124224.34673-2-hffilwlqm@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-06-20 19:48:29 -07:00
Jakub Kicinski
a6ec08beec Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

drivers/net/ethernet/broadcom/bnxt/bnxt.c
  1e7962114c ("bnxt_en: Restore PTP tx_avail count in case of skb_pad() error")
  165f87691a ("bnxt_en: add timestamping statistics support")

No adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-20 13:49:59 -07:00
Alexei Starovoitov
b90d77e5fd bpf: Fix remap of arena.
The bpf arena logic didn't account for mremap operation. Add a refcnt for
multiple mmap events to prevent use-after-free in arena_vm_close.

Fixes: 317460317a ("bpf: Introduce bpf_arena.")
Reported-by: Pengfei Xu <pengfei.xu@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Barret Rhoden <brho@google.com>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Closes: https://lore.kernel.org/bpf/Zmuw29IhgyPNKnIM@xpf.sh.intel.com
Link: https://lore.kernel.org/bpf/20240617171812.76634-1-alexei.starovoitov@gmail.com
2024-06-18 17:19:46 +02:00
Yonghong Song
44b7f7151d bpf: Add missed var_off setting in coerce_subreg_to_size_sx()
In coerce_subreg_to_size_sx(), for the case where upper
sign extension bits are the same for smax32 and smin32
values, we missed to setup properly. This is especially
problematic if both smax32 and smin32's sign extension
bits are 1.

The following is a simple example illustrating the inconsistent
verifier states due to missed var_off:

  0: (85) call bpf_get_prandom_u32#7    ; R0_w=scalar()
  1: (bf) r3 = r0                       ; R0_w=scalar(id=1) R3_w=scalar(id=1)
  2: (57) r3 &= 15                      ; R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=15,var_off=(0x0; 0xf))
  3: (47) r3 |= 128                     ; R3_w=scalar(smin=umin=smin32=umin32=128,smax=umax=smax32=umax32=143,var_off=(0x80; 0xf))
  4: (bc) w7 = (s8)w3
  REG INVARIANTS VIOLATION (alu): range bounds violation u64=[0xffffff80, 0x8f] s64=[0xffffff80, 0x8f]
    u32=[0xffffff80, 0x8f] s32=[0x80, 0xffffff8f] var_off=(0x80, 0xf)

The var_off=(0x80, 0xf) is not correct, and the correct one should
be var_off=(0xffffff80; 0xf) since from insn 3, we know that at
insn 4, the sign extension bits will be 1. This patch fixed this
issue by setting var_off properly.

Fixes: 8100928c88 ("bpf: Support new sign-extension mov insns")
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240615174632.3995278-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-06-17 10:45:46 -07:00
Yonghong Song
380d5f89a4 bpf: Add missed var_off setting in set_sext32_default_val()
Zac reported a verification failure and Alexei reproduced the issue
with a simple reproducer ([1]). The verification failure is due to missed
setting for var_off.

The following is the reproducer in [1]:
  0: R1=ctx() R10=fp0
  0: (71) r3 = *(u8 *)(r10 -387)        ;
     R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff)) R10=fp0
  1: (bc) w7 = (s8)w3                   ;
     R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff))
     R7_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=127,var_off=(0x0; 0x7f))
  2: (36) if w7 >= 0x2533823b goto pc-3
     mark_precise: frame0: last_idx 2 first_idx 0 subseq_idx -1
     mark_precise: frame0: regs=r7 stack= before 1: (bc) w7 = (s8)w3
     mark_precise: frame0: regs=r3 stack= before 0: (71) r3 = *(u8 *)(r10 -387)
  2: R7_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=127,var_off=(0x0; 0x7f))
  3: (b4) w0 = 0                        ; R0_w=0
  4: (95) exit

Note that after insn 1, the var_off for R7 is (0x0; 0x7f). This is not correct
since upper 24 bits of w7 could be 0 or 1. So correct var_off should be
(0x0; 0xffffffff). Missing var_off setting in set_sext32_default_val() caused later
incorrect analysis in zext_32_to_64(dst_reg) and reg_bounds_sync(dst_reg).

To fix the issue, set var_off correctly in set_sext32_default_val(). The correct
reg state after insn 1 becomes:
  1: (bc) w7 = (s8)w3                   ;
     R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff))
     R7_w=scalar(smin=0,smax=umax=0xffffffff,smin32=-128,smax32=127,var_off=(0x0; 0xffffffff))
and at insn 2, the verifier correctly determines either branch is possible.

  [1] https://lore.kernel.org/bpf/CAADnVQLPU0Shz7dWV4bn2BgtGdxN3uFHPeobGBA72tpg5Xoykw@mail.gmail.com/

Fixes: 8100928c88 ("bpf: Support new sign-extension mov insns")
Reported-by: Zac Ecob <zacecob@protonmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240615174626.3994813-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-06-17 10:45:46 -07:00
Alexei Starovoitov
98d7ca374b bpf: Track delta between "linked" registers.
Compilers can generate the code
  r1 = r2
  r1 += 0x1
  if r2 < 1000 goto ...
  use knowledge of r2 range in subsequent r1 operations

So remember constant delta between r2 and r1 and update r1 after 'if' condition.

Unfortunately LLVM still uses this pattern for loops with 'can_loop' construct:
for (i = 0; i < 1000 && can_loop; i++)

The "undo" pass was introduced in LLVM
https://reviews.llvm.org/D121937
to prevent this optimization, but it cannot cover all cases.
Instead of fighting middle end optimizer in BPF backend teach the verifier
about this pattern.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20240613013815.953-3-alexei.starovoitov@gmail.com
2024-06-14 21:52:39 +02:00
Vadim Fedorenko
65d6d61d25 bpf: crypto: make state and IV dynptr nullable
Some ciphers do not require state and IV buffer, but with current
implementation 0-sized dynptr is always needed. With adjustment to
verifier we can provide NULL instead of 0-sized dynptr. Make crypto
kfuncs ready for this.

Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Link: https://lore.kernel.org/r/20240613211817.1551967-3-vadfed@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-06-13 16:33:04 -07:00
Vadim Fedorenko
a90797993a bpf: verifier: make kfuncs args nullalble
Some arguments to kfuncs might be NULL in some cases. But currently it's
not possible to pass NULL to any BTF structures because the check for
the suffix is located after all type checks. Move it to earlier place
to allow nullable args.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Link: https://lore.kernel.org/r/20240613211817.1551967-2-vadfed@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-06-13 16:33:04 -07:00
Maciej Żenczykowski
b99a95bc56 bpf: fix UML x86_64 compile failure
pcpu_hot (defined in arch/x86) is not available on user mode linux (ARCH=um)

Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Fixes: 1ae6921009 ("bpf: inline bpf_get_smp_processor_id() helper")
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Link: https://lore.kernel.org/r/20240613173146.2524647-1-maze@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-06-13 11:24:45 -07:00
Daniel Xu
78746f93e9 bpf: Fix bpf_dynptr documentation comments
The function argument names were changed but the doc comment was not.
Fix htmldocs build warning by updating doc comments.

Fixes: cce4c40b96 ("bpf: treewide: Align kfunc signatures to prog point-of-view")
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/d0b0eb05f91e12e5795966153b11998d3fc1d433.1718295425.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-06-13 11:18:43 -07:00
Daniel Borkmann
e73cd1cfc2 bpf: Reduce stack consumption in check_stack_write_fixed_off
The fake_reg moved into env->fake_reg given it consumes a lot of stack
space (120 bytes). Migrate the fake_reg in check_stack_write_fixed_off()
as well now that we have it.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20240613115310.25383-2-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-06-13 11:16:01 -07:00
Daniel Borkmann
9242480126 bpf: Fix reg_set_min_max corruption of fake_reg
Juan reported that after doing some changes to buzzer [0] and implementing
a new fuzzing strategy guided by coverage, they noticed the following in
one of the probes:

  [...]
  13: (79) r6 = *(u64 *)(r0 +0)         ; R0=map_value(ks=4,vs=8) R6_w=scalar()
  14: (b7) r0 = 0                       ; R0_w=0
  15: (b4) w0 = -1                      ; R0_w=0xffffffff
  16: (74) w0 >>= 1                     ; R0_w=0x7fffffff
  17: (5c) w6 &= w0                     ; R0_w=0x7fffffff R6_w=scalar(smin=smin32=0,smax=umax=umax32=0x7fffffff,var_off=(0x0; 0x7fffffff))
  18: (44) w6 |= 2                      ; R6_w=scalar(smin=umin=smin32=umin32=2,smax=umax=umax32=0x7fffffff,var_off=(0x2; 0x7ffffffd))
  19: (56) if w6 != 0x7ffffffd goto pc+1
  REG INVARIANTS VIOLATION (true_reg2): range bounds violation u64=[0x7fffffff, 0x7ffffffd] s64=[0x7fffffff, 0x7ffffffd] u32=[0x7fffffff, 0x7ffffffd] s32=[0x7fffffff, 0x7ffffffd] var_off=(0x7fffffff, 0x0)
  REG INVARIANTS VIOLATION (false_reg1): range bounds violation u64=[0x7fffffff, 0x7ffffffd] s64=[0x7fffffff, 0x7ffffffd] u32=[0x7fffffff, 0x7ffffffd] s32=[0x7fffffff, 0x7ffffffd] var_off=(0x7fffffff, 0x0)
  REG INVARIANTS VIOLATION (false_reg2): const tnum out of sync with range bounds u64=[0x0, 0xffffffffffffffff] s64=[0x8000000000000000, 0x7fffffffffffffff] u32=[0x0, 0xffffffff] s32=[0x80000000, 0x7fffffff] var_off=(0x7fffffff, 0x0)
  19: R6_w=0x7fffffff
  20: (95) exit

  from 19 to 21: R0=0x7fffffff R6=scalar(smin=umin=smin32=umin32=2,smax=umax=smax32=umax32=0x7ffffffe,var_off=(0x2; 0x7ffffffd)) R7=map_ptr(ks=4,vs=8) R9=ctx() R10=fp0 fp-24=map_ptr(ks=4,vs=8) fp-40=mmmmmmmm
  21: R0=0x7fffffff R6=scalar(smin=umin=smin32=umin32=2,smax=umax=smax32=umax32=0x7ffffffe,var_off=(0x2; 0x7ffffffd)) R7=map_ptr(ks=4,vs=8) R9=ctx() R10=fp0 fp-24=map_ptr(ks=4,vs=8) fp-40=mmmmmmmm
  21: (14) w6 -= 2147483632             ; R6_w=scalar(smin=umin=umin32=2,smax=umax=0xffffffff,smin32=0x80000012,smax32=14,var_off=(0x2; 0xfffffffd))
  22: (76) if w6 s>= 0xe goto pc+1      ; R6_w=scalar(smin=umin=umin32=2,smax=umax=0xffffffff,smin32=0x80000012,smax32=13,var_off=(0x2; 0xfffffffd))
  23: (95) exit

  from 22 to 24: R0=0x7fffffff R6_w=14 R7=map_ptr(ks=4,vs=8) R9=ctx() R10=fp0 fp-24=map_ptr(ks=4,vs=8) fp-40=mmmmmmmm
  24: R0=0x7fffffff R6_w=14 R7=map_ptr(ks=4,vs=8) R9=ctx() R10=fp0 fp-24=map_ptr(ks=4,vs=8) fp-40=mmmmmmmm
  24: (14) w6 -= 14                     ; R6_w=0
  [...]

What can be seen here is a register invariant violation on line 19. After
the binary-or in line 18, the verifier knows that bit 2 is set but knows
nothing about the rest of the content which was loaded from a map value,
meaning, range is [2,0x7fffffff] with var_off=(0x2; 0x7ffffffd). When in
line 19 the verifier analyzes the branch, it splits the register states
in reg_set_min_max() into the registers of the true branch (true_reg1,
true_reg2) and the registers of the false branch (false_reg1, false_reg2).

Since the test is w6 != 0x7ffffffd, the src_reg is a known constant.
Internally, the verifier creates a "fake" register initialized as scalar
to the value of 0x7ffffffd, and then passes it onto reg_set_min_max(). Now,
for line 19, it is mathematically impossible to take the false branch of
this program, yet the verifier analyzes it. It is impossible because the
second bit of r6 will be set due to the prior or operation and the
constant in the condition has that bit unset (hex(fd) == binary(1111 1101).

When the verifier first analyzes the false / fall-through branch, it will
compute an intersection between the var_off of r6 and of the constant. This
is because the verifier creates a "fake" register initialized to the value
of the constant. The intersection result later refines both registers in
regs_refine_cond_op():

  [...]
  t = tnum_intersect(tnum_subreg(reg1->var_off), tnum_subreg(reg2->var_off));
  reg1->var_off = tnum_with_subreg(reg1->var_off, t);
  reg2->var_off = tnum_with_subreg(reg2->var_off, t);
  [...]

Since the verifier is analyzing the false branch of the conditional jump,
reg1 is equal to false_reg1 and reg2 is equal to false_reg2, i.e. the reg2
is the "fake" register that was meant to hold a constant value. The resulting
var_off of the intersection says that both registers now hold a known value
of var_off=(0x7fffffff, 0x0) or in other words: this operation manages to
make the verifier think that the "constant" value that was passed in the
jump operation now holds a different value.

Normally this would not be an issue since it should not influence the true
branch, however, false_reg2 and true_reg2 are pointers to the same "fake"
register. Meaning, the false branch can influence the results of the true
branch. In line 24, the verifier assumes R6_w=0, but the actual runtime
value in this case is 1. The fix is simply not passing in the same "fake"
register location as inputs to reg_set_min_max(), but instead making a
copy. Moving the fake_reg into the env also reduces stack consumption by
120 bytes. With this, the verifier successfully rejects invalid accesses
from the test program.

  [0] https://github.com/google/buzzer

Fixes: 67420501e8 ("bpf: generalize reg_set_min_max() to handle non-const register comparisons")
Reported-by: Juan José López Jaimez <jjlopezjaimez@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20240613115310.25383-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-06-13 11:16:01 -07:00
Daniel Xu
cce4c40b96 bpf: treewide: Align kfunc signatures to prog point-of-view
Previously, kfunc declarations in bpf_kfuncs.h (and others) used "user
facing" types for kfuncs prototypes while the actual kfunc definitions
used "kernel facing" types. More specifically: bpf_dynptr vs
bpf_dynptr_kern, __sk_buff vs sk_buff, and xdp_md vs xdp_buff.

It wasn't an issue before, as the verifier allows aliased types.
However, since we are now generating kfunc prototypes in vmlinux.h (in
addition to keeping bpf_kfuncs.h around), this conflict creates
compilation errors.

Fix this conflict by using "user facing" types in kfunc definitions.
This results in more casts, but otherwise has no additional runtime
cost.

Note, similar to 5b268d1ebc ("bpf: Have bpf_rdonly_cast() take a const
pointer"), we also make kfuncs take const arguments where appropriate in
order to make the kfunc more permissive.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/b58346a63a0e66bc9b7504da751b526b0b189a67.1718207789.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-06-12 11:01:31 -07:00
Daniel Xu
ec209ad863 bpf: verifier: Relax caller requirements for kfunc projection type args
Currently, if a kfunc accepts a projection type as an argument (eg
struct __sk_buff *), the caller must exactly provide exactly the same
type with provable provenance.

However in practice, kfuncs that accept projection types _must_ cast to
the underlying type before use b/c projection type layouts are
completely made up. Thus, it is ok to relax the verifier rules around
implicit conversions.

We will use this functionality in the next commit when we align kfuncs
to user-facing types.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/e2c025cb09ccfd4af1ec9e18284dc3cecff7514d.1718207789.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-06-12 11:01:31 -07:00
Jakub Kicinski
b1156532bc bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZmIsRAAKCRDbK58LschI
 g4SSAP0bkl6rPMn7zp1h+/l7hlvpp2aVOmasBTe8hIhAGUbluwD/TGq4sNsGgXFI
 i4tUtFRhw8pOjy2guy6526qyJvBs8wY=
 =WMhY
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2024-06-06

We've added 54 non-merge commits during the last 10 day(s) which contain
a total of 50 files changed, 1887 insertions(+), 527 deletions(-).

The main changes are:

1) Add a user space notification mechanism via epoll when a struct_ops
   object is getting detached/unregistered, from Kui-Feng Lee.

2) Big batch of BPF selftest refactoring for sockmap and BPF congctl
   tests, from Geliang Tang.

3) Add BTF field (type and string fields, right now) iterator support
   to libbpf instead of using existing callback-based approaches,
   from Andrii Nakryiko.

4) Extend BPF selftests for the latter with a new btf_field_iter
   selftest, from Alan Maguire.

5) Add new kfuncs for a generic, open-coded bits iterator,
   from Yafang Shao.

6) Fix BPF selftests' kallsyms_find() helper under kernels configured
   with CONFIG_LTO_CLANG_THIN, from Yonghong Song.

7) Remove a bunch of unused structs in BPF selftests,
   from David Alan Gilbert.

8) Convert test_sockmap section names into names understood by libbpf
   so it can deduce program type and attach type, from Jakub Sitnicki.

9) Extend libbpf with the ability to configure log verbosity
   via LIBBPF_LOG_LEVEL environment variable, from Mykyta Yatsenko.

10) Fix BPF selftests with regards to bpf_cookie and find_vma flakiness
    in nested VMs, from Song Liu.

11) Extend riscv32/64 JITs to introduce shift/add helpers to generate Zba
    optimization, from Xiao Wang.

12) Enable BPF programs to declare arrays and struct fields with kptr,
    bpf_rb_root, and bpf_list_head, from Kui-Feng Lee.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (54 commits)
  selftests/bpf: Drop useless arguments of do_test in bpf_tcp_ca
  selftests/bpf: Use start_test in test_dctcp in bpf_tcp_ca
  selftests/bpf: Use start_test in test_dctcp_fallback in bpf_tcp_ca
  selftests/bpf: Add start_test helper in bpf_tcp_ca
  selftests/bpf: Use connect_to_fd_opts in do_test in bpf_tcp_ca
  libbpf: Auto-attach struct_ops BPF maps in BPF skeleton
  selftests/bpf: Add btf_field_iter selftests
  selftests/bpf: Fix send_signal test with nested CONFIG_PARAVIRT
  libbpf: Remove callback-based type/string BTF field visitor helpers
  bpftool: Use BTF field iterator in btfgen
  libbpf: Make use of BTF field iterator in BTF handling code
  libbpf: Make use of BTF field iterator in BPF linker code
  libbpf: Add BTF field iterator
  selftests/bpf: Ignore .llvm.<hash> suffix in kallsyms_find()
  selftests/bpf: Fix bpf_cookie and find_vma in nested VM
  selftests/bpf: Test global bpf_list_head arrays.
  selftests/bpf: Test global bpf_rb_root arrays and fields in nested struct types.
  selftests/bpf: Test kptr arrays and kptrs in nested struct fields.
  bpf: limit the number of levels of a nested struct type.
  bpf: look into the types of the fields of a struct type recursively.
  ...
====================

Link: https://lore.kernel.org/r/20240606223146.23020-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-10 18:02:14 -07:00
Jakub Kicinski
62b5bf58b9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts.

Adjacent changes:

drivers/net/ethernet/pensando/ionic/ionic_txrx.c
  d9c0420999 ("ionic: Mark error paths in the data path as unlikely")
  491aee894a ("ionic: fix kernel panic in XDP_TX action")

net/ipv6/ip6_fib.c
  b4cb4a1391 ("net: use unrcu_pointer() helper")
  b01e1c0307 ("ipv6: fix possible race in __fib6_drop_pcpu_from()")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-06 12:06:56 -07:00
Kui-Feng Lee
f19caf57d8 bpf: limit the number of levels of a nested struct type.
Limit the number of levels looking into struct types to avoid running out
of stack space.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240523174202.461236-7-thinker.li@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-06-03 20:52:42 -07:00
Kui-Feng Lee
64e8ee8148 bpf: look into the types of the fields of a struct type recursively.
The verifier has field information for specific special types, such as
kptr, rbtree root, and list head. These types are handled
differently. However, we did not previously examine the types of fields of
a struct type variable. Field information records were not generated for
the kptrs, rbtree roots, and linked_list heads that are not located at the
outermost struct type of a variable.

For example,

  struct A {
    struct task_struct __kptr * task;
  };

  struct B {
    struct A mem_a;
  }

  struct B var_b;

It did not examine "struct A" so as not to generate field information for
the kptr in "struct A" for "var_b".

This patch enables BPF programs to define fields of these special types in
a struct type other than the direct type of a variable or in a struct type
that is the type of a field in the value type of a map.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240523174202.461236-6-thinker.li@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-06-03 20:52:42 -07:00
Kui-Feng Lee
994796c025 bpf: create repeated fields for arrays.
The verifier uses field information for certain special types, such as
kptr, rbtree root, and list head. These types are treated
differently. However, we did not previously support these types in
arrays. This update examines arrays and duplicates field information the
same number of times as the length of the array if the element type is one
of the special types.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240523174202.461236-5-thinker.li@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-06-03 20:52:42 -07:00
Kui-Feng Lee
a7db0d4f87 bpf: refactor btf_find_struct_field() and btf_find_datasec_var().
Move common code of the two functions to btf_find_field_one().

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240523174202.461236-4-thinker.li@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-06-03 20:52:42 -07:00
Kui-Feng Lee
482f713379 bpf: Remove unnecessary call to btf_field_type_size().
field->size has been initialized by bpf_parse_fields() with the value
returned by btf_field_type_size(). Use it instead of calling
btf_field_type_size() again.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240523174202.461236-3-thinker.li@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-06-03 20:52:42 -07:00
Kui-Feng Lee
c95a3be45a bpf: Remove unnecessary checks on the offset of btf_field.
reg_find_field_offset() always return a btf_field with a matching offset
value. Checking the offset of the returned btf_field is unnecessary.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240523174202.461236-2-thinker.li@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-06-03 20:52:42 -07:00
Cong Wang
2884dc7d08 bpf: Fix a potential use-after-free in bpf_link_free()
After commit 1a80dbcb2d, bpf_link can be freed by
link->ops->dealloc_deferred, but the code still tests and uses
link->ops->dealloc afterward, which leads to a use-after-free as
reported by syzbot. Actually, one of them should be sufficient, so
just call one of them instead of both. Also add a WARN_ON() in case
of any problematic implementation.

Fixes: 1a80dbcb2d ("bpf: support deferring bpf_link dealloc to after RCU grace period")
Reported-by: syzbot+1989ee16d94720836244@syzkaller.appspotmail.com
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20240602182703.207276-1-xiyou.wangcong@gmail.com
2024-06-03 18:16:19 +02:00
Thorsten Blum
2317dc2c22 bpf, devmap: Remove unnecessary if check in for loop
The iterator variable dst cannot be NULL and the if check can be removed.
Remove it and fix the following Coccinelle/coccicheck warning reported
by itnull.cocci:

	ERROR: iterator variable bound on line 762 cannot be NULL

Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20240529101900.103913-2-thorsten.blum@toblux.com
2024-06-03 17:09:23 +02:00
Jiri Olsa
aeb8fe0283 bpf: Fix bpf_session_cookie BTF_ID in special_kfunc_set list
The bpf_session_cookie is unavailable for !CONFIG_FPROBE as reported
by Sebastian [1].

To fix that we remove CONFIG_FPROBE ifdef for session kfuncs, which
is fine, because there's filter for session programs.

Then based on bpf_trace.o dependency:
  obj-$(CONFIG_BPF_EVENTS) += bpf_trace.o

we add bpf_session_cookie BTF_ID in special_kfunc_set list dependency
on CONFIG_BPF_EVENTS.

[1] https://lore.kernel.org/bpf/20240531071557.MvfIqkn7@linutronix.de/T/#m71c6d5ec71db2967288cb79acedc15cc5dbfeec5
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Fixes: 5c919acef8 ("bpf: Add support for kprobe session cookie")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20240531194500.2967187-1-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-05-31 14:54:48 -07:00
Jakub Kicinski
e19de2064f Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

drivers/net/ethernet/ti/icssg/icssg_classifier.c
  abd5576b9c ("net: ti: icssg-prueth: Add support for ICSSG switch firmware")
  56a5cf538c ("net: ti: icssg-prueth: Fix start counter for ft1 filter")
https://lore.kernel.org/all/20240531123822.3bb7eadf@canb.auug.org.au/

No other adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-31 14:10:28 -07:00
Kui-Feng Lee
67c3e8353f bpf: export bpf_link_inc_not_zero.
bpf_link_inc_not_zero() will be used by kernel modules.  We will use it in
bpf_testmod.c later.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240530065946.979330-5-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-05-30 15:34:13 -07:00
Kui-Feng Lee
1adddc97aa bpf: support epoll from bpf struct_ops links.
Add epoll support to bpf struct_ops links to trigger EPOLLHUP event upon
detachment.

This patch implements the "poll" of the "struct file_operations" for BPF
links and introduces a new "poll" operator in the "struct bpf_link_ops". By
implementing "poll" of "struct bpf_link_ops" for the links of struct_ops,
the file descriptor of a struct_ops link can be added to an epoll file
descriptor to receive EPOLLHUP events.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240530065946.979330-4-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-05-30 15:34:13 -07:00
Kui-Feng Lee
6fb2544ea1 bpf: enable detaching links of struct_ops objects.
Implement the detach callback in bpf_link_ops for struct_ops so that user
programs can detach a struct_ops link. The subsystems that struct_ops
objects are registered to can also use this callback to detach the links
being passed to them.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240530065946.979330-3-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-05-30 15:34:13 -07:00
Kui-Feng Lee
73287fe228 bpf: pass bpf_struct_ops_link to callbacks in bpf_struct_ops.
Pass an additional pointer of bpf_struct_ops_link to callback function reg,
unreg, and update provided by subsystems defined in bpf_struct_ops. A
bpf_struct_ops_map can be registered for multiple links. Passing a pointer
of bpf_struct_ops_link helps subsystems to distinguish them.

This pointer will be used in the later patches to let the subsystem
initiate a detachment on a link that was registered to it previously.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240530065946.979330-2-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-05-30 15:34:13 -07:00
Yafang Shao
4665415975 bpf: Add bits iterator
Add three new kfuncs for the bits iterator:
- bpf_iter_bits_new
  Initialize a new bits iterator for a given memory area. Due to the
  limitation of bpf memalloc, the max number of words (8-byte units) that
  can be iterated over is limited to (4096 / 8).
- bpf_iter_bits_next
  Get the next bit in a bpf_iter_bits
- bpf_iter_bits_destroy
  Destroy a bpf_iter_bits

The bits iterator facilitates the iteration of the bits of a memory area,
such as cpumask. It can be used in any context and on any address.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240517023034.48138-2-laoar.shao@gmail.com
2024-05-29 16:01:47 -07:00
Jakub Kicinski
4b3529edbb bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZlWtmQAKCRDbK58LschI
 g0TUAQDT76jx7Rq1DShCtZ3eqiBMNkYczK8b+GqNsSG8YGduaAEA1jn/GN+H65Rh
 atQZ/pYAfLZflMV04+XE0GyBr5q1uQg=
 =NczG
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2024-05-28

We've added 23 non-merge commits during the last 11 day(s) which contain
a total of 45 files changed, 696 insertions(+), 277 deletions(-).

The main changes are:

1) Rename skb's mono_delivery_time to tstamp_type for extensibility
   and add SKB_CLOCK_TAI type support to bpf_skb_set_tstamp(),
   from Abhishek Chauhan.

2) Add netfilter CT zone ID and direction to bpf_ct_opts so that arbitrary
   CT zones can be used from XDP/tc BPF netfilter CT helper functions,
   from Brad Cowie.

3) Several tweaks to the instruction-set.rst IETF doc to address
   the Last Call review comments, from Dave Thaler.

4) Small batch of riscv64 BPF JIT optimizations in order to emit more
   compressed instructions to the JITed image for better icache efficiency,
   from Xiao Wang.

5) Sort bpftool C dump output from BTF, aiming to simplify vmlinux.h
   diffing and forcing more natural type definitions ordering,
   from Mykyta Yatsenko.

6) Use DEV_STATS_INC() macro in BPF redirect helpers to silence
   a syzbot/KCSAN race report for the tx_errors counter,
   from Jiang Yunshui.

7) Un-constify bpf_func_info in bpftool to fix compilation with LLVM 17+
   which started treating const structs as constants and thus breaking
   full BTF program name resolution, from Ivan Babrou.

8) Fix up BPF program numbers in test_sockmap selftest in order to reduce
   some of the test-internal array sizes, from Geliang Tang.

9) Small cleanup in Makefile.btf script to use test-ge check for v1.25-only
   pahole, from Alan Maguire.

10) Fix bpftool's make dependencies for vmlinux.h in order to avoid needless
    rebuilds in some corner cases, from Artem Savkov.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (23 commits)
  bpf, net: Use DEV_STAT_INC()
  bpf, docs: Fix instruction.rst indentation
  bpf, docs: Clarify call local offset
  bpf, docs: Add table captions
  bpf, docs: clarify sign extension of 64-bit use of 32-bit imm
  bpf, docs: Use RFC 2119 language for ISA requirements
  bpf, docs: Move sentence about returning R0 to abi.rst
  bpf: constify member bpf_sysctl_kern:: Table
  riscv, bpf: Try RVC for reg move within BPF_CMPXCHG JIT
  riscv, bpf: Use STACK_ALIGN macro for size rounding up
  riscv, bpf: Optimize zextw insn with Zba extension
  selftests/bpf: Handle forwarding of UDP CLOCK_TAI packets
  net: Add additional bit to support clockid_t timestamp type
  net: Rename mono_delivery_time to tstamp_type for scalabilty
  selftests/bpf: Update tests for new ct zone opts for nf_conntrack kfuncs
  net: netfilter: Make ct zone opts configurable for bpf ct helpers
  selftests/bpf: Fix prog numbers in test_sockmap
  bpf: Remove unused variable "prev_state"
  bpftool: Un-const bpf_func_info to fix it for llvm 17 and newer
  bpf: Fix order of args in call to bpf_map_kvcalloc
  ...
====================

Link: https://lore.kernel.org/r/20240528105924.30905-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-28 07:27:29 -07:00
Jakub Kicinski
2786ae339e bpf-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZlTGFAAKCRDbK58LschI
 g5NXAP0QRn8nBSxJHIswFSOwRiCyhOhR7YL2P0c+RGcRMA+ZSAD9E1cwsYXsPu3L
 ummQ52AMaMfouHg6aW+rFIoupkGSnwc=
 =QctA
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2024-05-27

We've added 15 non-merge commits during the last 7 day(s) which contain
a total of 18 files changed, 583 insertions(+), 55 deletions(-).

The main changes are:

1) Fix broken BPF multi-uprobe PID filtering logic which filtered by thread
   while the promise was to filter by process, from Andrii Nakryiko.

2) Fix the recent influx of syzkaller reports to sockmap which triggered
   a locking rule violation by performing a map_delete, from Jakub Sitnicki.

3) Fixes to netkit driver in particular on skb->pkt_type override upon pass
   verdict, from Daniel Borkmann.

4) Fix an integer overflow in resolve_btfids which can wrongly trigger build
   failures, from Friedrich Vock.

5) Follow-up fixes for ARC JIT reported by static analyzers,
   from Shahab Vahedi.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests/bpf: Cover verifier checks for mutating sockmap/sockhash
  Revert "bpf, sockmap: Prevent lock inversion deadlock in map delete elem"
  bpf: Allow delete from sockmap/sockhash only if update is allowed
  selftests/bpf: Add netkit test for pkt_type
  selftests/bpf: Add netkit tests for mac address
  netkit: Fix pkt_type override upon netkit pass verdict
  netkit: Fix setting mac address in l2 mode
  ARC, bpf: Fix issues reported by the static analyzers
  selftests/bpf: extend multi-uprobe tests with USDTs
  selftests/bpf: extend multi-uprobe tests with child thread case
  libbpf: detect broken PID filtering logic for multi-uprobe
  bpf: remove unnecessary rcu_read_{lock,unlock}() in multi-uprobe attach logic
  bpf: fix multi-uprobe PID filtering logic
  bpf: Fix potential integer overflow in resolve_btfids
  MAINTAINERS: Add myself as reviewer of ARM64 BPF JIT
====================

Link: https://lore.kernel.org/r/20240527203551.29712-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-27 16:26:30 -07:00
Jakub Sitnicki
98e948fb60 bpf: Allow delete from sockmap/sockhash only if update is allowed
We have seen an influx of syzkaller reports where a BPF program attached to
a tracepoint triggers a locking rule violation by performing a map_delete
on a sockmap/sockhash.

We don't intend to support this artificial use scenario. Extend the
existing verifier allowed-program-type check for updating sockmap/sockhash
to also cover deleting from a map.

From now on only BPF programs which were previously allowed to update
sockmap/sockhash can delete from these map types.

Fixes: ff91059932 ("bpf, sockmap: Prevent lock inversion deadlock in map delete elem")
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Reported-by: syzbot+ec941d6e24f633a59172@syzkaller.appspotmail.com
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: syzbot+ec941d6e24f633a59172@syzkaller.appspotmail.com
Acked-by: John Fastabend <john.fastabend@gmail.com>
Closes: https://syzkaller.appspot.com/bug?extid=ec941d6e24f633a59172
Link: https://lore.kernel.org/bpf/20240527-sockmap-verify-deletes-v1-1-944b372f2101@cloudflare.com
2024-05-27 19:33:40 +02:00
Linus Torvalds
61307b7be4 The usual shower of singleton fixes and minor series all over MM,
documented (hopefully adequately) in the respective changelogs.  Notable
 series include:
 
 - Lucas Stach has provided some page-mapping
   cleanup/consolidation/maintainability work in the series "mm/treewide:
   Remove pXd_huge() API".
 
 - In the series "Allow migrate on protnone reference with
   MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's
   MPOL_PREFERRED_MANY mode, yielding almost doubled performance in one
   test.
 
 - In their series "Memory allocation profiling" Kent Overstreet and
   Suren Baghdasaryan have contributed a means of determining (via
   /proc/allocinfo) whereabouts in the kernel memory is being allocated:
   number of calls and amount of memory.
 
 - Matthew Wilcox has provided the series "Various significant MM
   patches" which does a number of rather unrelated things, but in largely
   similar code sites.
 
 - In his series "mm: page_alloc: freelist migratetype hygiene" Johannes
   Weiner has fixed the page allocator's handling of migratetype requests,
   with resulting improvements in compaction efficiency.
 
 - In the series "make the hugetlb migration strategy consistent" Baolin
   Wang has fixed a hugetlb migration issue, which should improve hugetlb
   allocation reliability.
 
 - Liu Shixin has hit an I/O meltdown caused by readahead in a
   memory-tight memcg.  Addressed in the series "Fix I/O high when memory
   almost met memcg limit".
 
 - In the series "mm/filemap: optimize folio adding and splitting" Kairui
   Song has optimized pagecache insertion, yielding ~10% performance
   improvement in one test.
 
 - Baoquan He has cleaned up and consolidated the early zone
   initialization code in the series "mm/mm_init.c: refactor
   free_area_init_core()".
 
 - Baoquan has also redone some MM initializatio code in the series
   "mm/init: minor clean up and improvement".
 
 - MM helper cleanups from Christoph Hellwig in his series "remove
   follow_pfn".
 
 - More cleanups from Matthew Wilcox in the series "Various page->flags
   cleanups".
 
 - Vlastimil Babka has contributed maintainability improvements in the
   series "memcg_kmem hooks refactoring".
 
 - More folio conversions and cleanups in Matthew Wilcox's series
 
 	"Convert huge_zero_page to huge_zero_folio"
 	"khugepaged folio conversions"
 	"Remove page_idle and page_young wrappers"
 	"Use folio APIs in procfs"
 	"Clean up __folio_put()"
 	"Some cleanups for memory-failure"
 	"Remove page_mapping()"
 	"More folio compat code removal"
 
 - David Hildenbrand chipped in with "fs/proc/task_mmu: convert hugetlb
   functions to work on folis".
 
 - Code consolidation and cleanup work related to GUP's handling of
   hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2".
 
 - Rick Edgecombe has developed some fixes to stack guard gaps in the
   series "Cover a guard gap corner case".
 
 - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the series
   "mm/ksm: fix ksm exec support for prctl".
 
 - Baolin Wang has implemented NUMA balancing for multi-size THPs.  This
   is a simple first-cut implementation for now.  The series is "support
   multi-size THP numa balancing".
 
 - Cleanups to vma handling helper functions from Matthew Wilcox in the
   series "Unify vma_address and vma_pgoff_address".
 
 - Some selftests maintenance work from Dev Jain in the series
   "selftests/mm: mremap_test: Optimizations and style fixes".
 
 - Improvements to the swapping of multi-size THPs from Ryan Roberts in
   the series "Swap-out mTHP without splitting".
 
 - Kefeng Wang has significantly optimized the handling of arm64's
   permission page faults in the series
 
 	"arch/mm/fault: accelerate pagefault when badaccess"
 	"mm: remove arch's private VM_FAULT_BADMAP/BADACCESS"
 
 - GUP cleanups from David Hildenbrand in "mm/gup: consistently call it
   GUP-fast".
 
 - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault path to
   use struct vm_fault".
 
 - selftests build fixes from John Hubbard in the series "Fix
   selftests/mm build without requiring "make headers"".
 
 - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the
   series "Improved Memory Tier Creation for CPUless NUMA Nodes".  Fixes
   the initialization code so that migration between different memory types
   works as intended.
 
 - David Hildenbrand has improved follow_pte() and fixed an errant driver
   in the series "mm: follow_pte() improvements and acrn follow_pte()
   fixes".
 
 - David also did some cleanup work on large folio mapcounts in his
   series "mm: mapcount for large folios + page_mapcount() cleanups".
 
 - Folio conversions in KSM in Alex Shi's series "transfer page to folio
   in KSM".
 
 - Barry Song has added some sysfs stats for monitoring multi-size THP's
   in the series "mm: add per-order mTHP alloc and swpout counters".
 
 - Some zswap cleanups from Yosry Ahmed in the series "zswap same-filled
   and limit checking cleanups".
 
 - Matthew Wilcox has been looking at buffer_head code and found the
   documentation to be lacking.  The series is "Improve buffer head
   documentation".
 
 - Multi-size THPs get more work, this time from Lance Yang.  His series
   "mm/madvise: enhance lazyfreeing with mTHP in madvise_free" optimizes
   the freeing of these things.
 
 - Kemeng Shi has added more userspace-visible writeback instrumentation
   in the series "Improve visibility of writeback".
 
 - Kemeng Shi then sent some maintenance work on top in the series "Fix
   and cleanups to page-writeback".
 
 - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in the
   series "Improve anon_vma scalability for anon VMAs".  Intel's test bot
   reported an improbable 3x improvement in one test.
 
 - SeongJae Park adds some DAMON feature work in the series
 
 	"mm/damon: add a DAMOS filter type for page granularity access recheck"
 	"selftests/damon: add DAMOS quota goal test"
 
 - Also some maintenance work in the series
 
 	"mm/damon/paddr: simplify page level access re-check for pageout"
 	"mm/damon: misc fixes and improvements"
 
 - David Hildenbrand has disabled some known-to-fail selftests ni the
   series "selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL".
 
 - memcg metadata storage optimizations from Shakeel Butt in "memcg:
   reduce memory consumption by memcg stats".
 
 - DAX fixes and maintenance work from Vishal Verma in the series
   "dax/bus.c: Fixups for dax-bus locking".
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZkgQYwAKCRDdBJ7gKXxA
 jrdKAP9WVJdpEcXxpoub/vVE0UWGtffr8foifi9bCwrQrGh5mgEAx7Yf0+d/oBZB
 nvA4E0DcPrUAFy144FNM0NTCb7u9vAw=
 =V3R/
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull mm updates from Andrew Morton:
 "The usual shower of singleton fixes and minor series all over MM,
  documented (hopefully adequately) in the respective changelogs.
  Notable series include:

   - Lucas Stach has provided some page-mapping cleanup/consolidation/
     maintainability work in the series "mm/treewide: Remove pXd_huge()
     API".

   - In the series "Allow migrate on protnone reference with
     MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's
     MPOL_PREFERRED_MANY mode, yielding almost doubled performance in
     one test.

   - In their series "Memory allocation profiling" Kent Overstreet and
     Suren Baghdasaryan have contributed a means of determining (via
     /proc/allocinfo) whereabouts in the kernel memory is being
     allocated: number of calls and amount of memory.

   - Matthew Wilcox has provided the series "Various significant MM
     patches" which does a number of rather unrelated things, but in
     largely similar code sites.

   - In his series "mm: page_alloc: freelist migratetype hygiene"
     Johannes Weiner has fixed the page allocator's handling of
     migratetype requests, with resulting improvements in compaction
     efficiency.

   - In the series "make the hugetlb migration strategy consistent"
     Baolin Wang has fixed a hugetlb migration issue, which should
     improve hugetlb allocation reliability.

   - Liu Shixin has hit an I/O meltdown caused by readahead in a
     memory-tight memcg. Addressed in the series "Fix I/O high when
     memory almost met memcg limit".

   - In the series "mm/filemap: optimize folio adding and splitting"
     Kairui Song has optimized pagecache insertion, yielding ~10%
     performance improvement in one test.

   - Baoquan He has cleaned up and consolidated the early zone
     initialization code in the series "mm/mm_init.c: refactor
     free_area_init_core()".

   - Baoquan has also redone some MM initializatio code in the series
     "mm/init: minor clean up and improvement".

   - MM helper cleanups from Christoph Hellwig in his series "remove
     follow_pfn".

   - More cleanups from Matthew Wilcox in the series "Various
     page->flags cleanups".

   - Vlastimil Babka has contributed maintainability improvements in the
     series "memcg_kmem hooks refactoring".

   - More folio conversions and cleanups in Matthew Wilcox's series:
	"Convert huge_zero_page to huge_zero_folio"
	"khugepaged folio conversions"
	"Remove page_idle and page_young wrappers"
	"Use folio APIs in procfs"
	"Clean up __folio_put()"
	"Some cleanups for memory-failure"
	"Remove page_mapping()"
	"More folio compat code removal"

   - David Hildenbrand chipped in with "fs/proc/task_mmu: convert
     hugetlb functions to work on folis".

   - Code consolidation and cleanup work related to GUP's handling of
     hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2".

   - Rick Edgecombe has developed some fixes to stack guard gaps in the
     series "Cover a guard gap corner case".

   - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the
     series "mm/ksm: fix ksm exec support for prctl".

   - Baolin Wang has implemented NUMA balancing for multi-size THPs.
     This is a simple first-cut implementation for now. The series is
     "support multi-size THP numa balancing".

   - Cleanups to vma handling helper functions from Matthew Wilcox in
     the series "Unify vma_address and vma_pgoff_address".

   - Some selftests maintenance work from Dev Jain in the series
     "selftests/mm: mremap_test: Optimizations and style fixes".

   - Improvements to the swapping of multi-size THPs from Ryan Roberts
     in the series "Swap-out mTHP without splitting".

   - Kefeng Wang has significantly optimized the handling of arm64's
     permission page faults in the series
	"arch/mm/fault: accelerate pagefault when badaccess"
	"mm: remove arch's private VM_FAULT_BADMAP/BADACCESS"

   - GUP cleanups from David Hildenbrand in "mm/gup: consistently call
     it GUP-fast".

   - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault
     path to use struct vm_fault".

   - selftests build fixes from John Hubbard in the series "Fix
     selftests/mm build without requiring "make headers"".

   - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the
     series "Improved Memory Tier Creation for CPUless NUMA Nodes".
     Fixes the initialization code so that migration between different
     memory types works as intended.

   - David Hildenbrand has improved follow_pte() and fixed an errant
     driver in the series "mm: follow_pte() improvements and acrn
     follow_pte() fixes".

   - David also did some cleanup work on large folio mapcounts in his
     series "mm: mapcount for large folios + page_mapcount() cleanups".

   - Folio conversions in KSM in Alex Shi's series "transfer page to
     folio in KSM".

   - Barry Song has added some sysfs stats for monitoring multi-size
     THP's in the series "mm: add per-order mTHP alloc and swpout
     counters".

   - Some zswap cleanups from Yosry Ahmed in the series "zswap
     same-filled and limit checking cleanups".

   - Matthew Wilcox has been looking at buffer_head code and found the
     documentation to be lacking. The series is "Improve buffer head
     documentation".

   - Multi-size THPs get more work, this time from Lance Yang. His
     series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free"
     optimizes the freeing of these things.

   - Kemeng Shi has added more userspace-visible writeback
     instrumentation in the series "Improve visibility of writeback".

   - Kemeng Shi then sent some maintenance work on top in the series
     "Fix and cleanups to page-writeback".

   - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in
     the series "Improve anon_vma scalability for anon VMAs". Intel's
     test bot reported an improbable 3x improvement in one test.

   - SeongJae Park adds some DAMON feature work in the series
	"mm/damon: add a DAMOS filter type for page granularity access recheck"
	"selftests/damon: add DAMOS quota goal test"

   - Also some maintenance work in the series
	"mm/damon/paddr: simplify page level access re-check for pageout"
	"mm/damon: misc fixes and improvements"

   - David Hildenbrand has disabled some known-to-fail selftests ni the
     series "selftests: mm: cow: flag vmsplice() hugetlb tests as
     XFAIL".

   - memcg metadata storage optimizations from Shakeel Butt in "memcg:
     reduce memory consumption by memcg stats".

   - DAX fixes and maintenance work from Vishal Verma in the series
     "dax/bus.c: Fixups for dax-bus locking""

* tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits)
  memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order
  selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime
  mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp
  mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault
  selftests: cgroup: add tests to verify the zswap writeback path
  mm: memcg: make alloc_mem_cgroup_per_node_info() return bool
  mm/damon/core: fix return value from damos_wmark_metric_value
  mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED
  selftests: cgroup: remove redundant enabling of memory controller
  Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree
  Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT
  Docs/mm/damon/design: use a list for supported filters
  Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command
  Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file
  selftests/damon: classify tests for functionalities and regressions
  selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None'
  selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts
  selftests/damon/_damon_sysfs: check errors from nr_schemes file reads
  mm/damon/core: initialize ->esz_bp from damos_quota_init_priv()
  selftests/damon: add a test for DAMOS quota goal
  ...
2024-05-19 09:21:03 -07:00
Mohammad Shehar Yaar Tausif
6f130e4d4a bpf: Fix order of args in call to bpf_map_kvcalloc
The original function call passed size of smap->bucket before the number of
buckets which raises the error 'calloc-transposed-args' on compilation.

Signed-off-by: Mohammad Shehar Yaar Tausif <sheharyaar48@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240516072411.42016-1-sheharyaar48@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-05-18 10:46:16 -07:00
Linus Torvalds
f08a1e912d Including fix from Andrii for the issue mentioned in our net-next PR,
the rest is unremarkable.
 
 Current release - regressions:
 
  - virtio_net: fix missed error path rtnl_unlock after control queue
    locking rework
 
 Current release - new code bugs:
 
  - bpf: fix KASAN slab-out-of-bounds in percpu_array_map_gen_lookup,
    caused by missing nested map handling
 
  - drv: dsa: correct initialization order for KSZ88x3 ports
 
 Previous releases - regressions:
 
  - af_packet: do not call packet_read_pending() from tpacket_destruct_skb()
    fix performance regression
 
  - ipv6: fix route deleting failure when metric equals 0, don't assume
    0 means not set / default in this case
 
 Previous releases - always broken:
 
  - bridge: couple of syzbot-driven fixes
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmZHtJQACgkQMUZtbf5S
 Irsfyw//ZhCFzvXKLENNalHHMXwq7DsuXe6KWlljEOzLHH0/9QqqNC9ggYKRI5OE
 unB//YC3sqtAUczQnED+UOh553Pu4Kvq9334LTX5m4HJQTYLLq1aGM/UZplsBTHx
 3MsXUApYFth8pqCZvIcKOZcOddeViBfzEQ9jEAsgIyaqFy3XaiH4Zf6pJAAMyUbE
 19CRiK/1TYNrX01XPOeV/9vYGj9rzepo6S5zpHKsWsFZArCcRPBsea/KWYYfLjW7
 ExA2Cb+eUnPkNL4bTeH6dwgQGVL8Jo/OsKmsa/tdQffnj1pshdePXtP3TBEynMJF
 jSSwwUMq56yE+uok4karE3wIhciUEYvTwfgt5FErYVqfqDiX1+7AZGtdZVDX/mgH
 F0etKHDhX9F1zxHVMFwOMA4rLN6cvfpe7Pg+dt4B9E0o18SyNekOM1Ngdu/1ALtd
 QV41JFHweHInDRrMLdj4aWW4EYPR5SUuvg66Pec4T7x5hAAapzIJySS+RIydC+ND
 guPztYxO5cn5Q7kug1FyUBSXUXZxuCNRACb6zD4/4wbVRZhz7l3OTcd13QADCiwv
 Tr61r2bS1Bp/HZ3iIHBY85JnKMvpdwNXN2SPsYQQwVrv9FLj9iskH9kjwqVDG4ja
 W3ivZZM+CcZbnB81JynK7Ge54PT+SiPy3Nw4RIVxFl1QlzXC21E=
 =7eys
 -----END PGP SIGNATURE-----

Merge tag 'net-6.10-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Current release - regressions:

   - virtio_net: fix missed error path rtnl_unlock after control queue
     locking rework

  Current release - new code bugs:

   - bpf: fix KASAN slab-out-of-bounds in percpu_array_map_gen_lookup,
     caused by missing nested map handling

   - drv: dsa: correct initialization order for KSZ88x3 ports

  Previous releases - regressions:

   - af_packet: do not call packet_read_pending() from
     tpacket_destruct_skb() fix performance regression

   - ipv6: fix route deleting failure when metric equals 0, don't assume
     0 means not set / default in this case

  Previous releases - always broken:

   - bridge: couple of syzbot-driven fixes"

* tag 'net-6.10-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (30 commits)
  selftests: net: local_termination: annotate the expected failures
  net: dsa: microchip: Correct initialization order for KSZ88x3 ports
  MAINTAINERS: net: Update reviewers for TI's Ethernet drivers
  dt-bindings: net: ti: Update maintainers list
  l2tp: fix ICMP error handling for UDP-encap sockets
  net: txgbe: fix to control VLAN strip
  net: wangxun: match VLAN CTAG and STAG features
  net: wangxun: fix to change Rx features
  af_packet: do not call packet_read_pending() from tpacket_destruct_skb()
  virtio_net: Fix missed rtnl_unlock
  netrom: fix possible dead-lock in nr_rt_ioctl()
  idpf: don't skip over ethtool tcp-data-split setting
  dt-bindings: net: qcom: ethernet: Allow dma-coherent
  bonding: fix oops during rmmod
  net/ipv6: Fix route deleting failure when metric equals 0
  selftests/net: reduce xfrm_policy test time
  selftests/bpf: Adjust btf_dump test to reflect recent change in file_operations
  selftests/bpf: Adjust test_access_variable_array after a kernel function name change
  selftests/net/lib: no need to record ns name if it already exist
  net: qrtr: ns: Fix module refcnt
  ...
2024-05-17 18:57:14 -07:00
Linus Torvalds
91b6163be4 sysctl changes for v6.10-rc1
Summary
 * Removed sentinel elements from ctl_table structs in kernel/*
 
   Removing sentinels in ctl_table arrays reduces the build time size and
   runtime memory consumed by ~64 bytes per array. Removals for net/, io_uring/,
   mm/, ipc/ and security/ are set to go into mainline through their respective
   subsystems making the next release the most likely place where the final
   series that removes the check for proc_name == NULL will land. This PR adds
   to removals already in arch/, drivers/ and fs/.
 
 * Adjusted ctl_table definitions and references to allow constification
 
   Adjustments:
     - Removing unused ctl_table function arguments
     - Moving non-const elements from ctl_table to ctl_table_header
     - Making ctl_table pointers const in ctl_table_root structure
 
   Making the static ctl_table structs const will increase safety by keeping the
   pointers to proc_handler functions in .rodata. Though no ctl_tables where
   made const in this PR, the ground work for making that possible has started
   with these changes sent by Thomas Weißschuh.
 
 Testing
 * These changes went into linux-next after v6.9-rc4; giving it a good month of
   testing.
 -----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEErkcJVyXmMSXOyyeQupfNUreWQU8FAmZFvBMACgkQupfNUreW
 QU/eGAv9EWeiXKxr3EVSMAsb9MWbJq7C99I/pd5hMf+qH4PgJpKDH7w/sb2e8h8+
 unGiW83ikgrtph7OS4/xM3Y9r3Nvzd6C/OztqgMnNKeRFdMgP7wu9HaSNs05ordb
 CqJdhvL93quc5HxrGTS9sdLK/wLJWOHwuWMXhX4qS44JNxTdPV2q10Rb7DZyHZ6O
 C9qp61L2Q2CrnOBKIx8MoeCh20ynJQAo3b0pTN63ZYF4D0vqCcnYNNTPkge4ID8/
 ULJoP5hAQY0vJ4g4fC4Gmooa5GECpm8MfZUf3SdgPyauqM/sm3dVdsLXAWD4Phcp
 TsG2a/5KMYwnLHrUGwDW7bFfEemRU88h0Iam56+SKMl1kMlEpWaLL9ApQXoHFayG
 e10izS+i/nlQiqYIHtuczCoTimT4/LGnonCLcdA//C3XzBT5MnOd7xsjuaQSpFWl
 /CV9SZa4ABwzX7u2jty8ik90iihLCFQyKj1d9m1mDVbgb6r3iUOxVuHBgMtY7MF7
 eyaEmV7l
 =/rQW
 -----END PGP SIGNATURE-----

Merge tag 'sysctl-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl

Pull sysctl updates from Joel Granados:

 - Remove sentinel elements from ctl_table structs in kernel/*

   Removing sentinels in ctl_table arrays reduces the build time size
   and runtime memory consumed by ~64 bytes per array. Removals for
   net/, io_uring/, mm/, ipc/ and security/ are set to go into mainline
   through their respective subsystems making the next release the most
   likely place where the final series that removes the check for
   proc_name == NULL will land.

   This adds to removals already in arch/, drivers/ and fs/.

 - Adjust ctl_table definitions and references to allow constification
     - Remove unused ctl_table function arguments
     - Move non-const elements from ctl_table to ctl_table_header
     - Make ctl_table pointers const in ctl_table_root structure

   Making the static ctl_table structs const will increase safety by
   keeping the pointers to proc_handler functions in .rodata. Though no
   ctl_tables where made const in this PR, the ground work for making
   that possible has started with these changes sent by Thomas
   Weißschuh.

* tag 'sysctl-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl:
  sysctl: drop now unnecessary out-of-bounds check
  sysctl: move sysctl type to ctl_table_header
  sysctl: drop sysctl_is_perm_empty_ctl_table
  sysctl: treewide: constify argument ctl_table_root::permissions(table)
  sysctl: treewide: drop unused argument ctl_table_root::set_ownership(table)
  bpf: Remove the now superfluous sentinel elements from ctl_table array
  delayacct: Remove the now superfluous sentinel elements from ctl_table array
  kprobes: Remove the now superfluous sentinel elements from ctl_table array
  printk: Remove the now superfluous sentinel elements from ctl_table array
  scheduler: Remove the now superfluous sentinel elements from ctl_table array
  seccomp: Remove the now superfluous sentinel elements from ctl_table array
  timekeeping: Remove the now superfluous sentinel elements from ctl_table array
  ftrace: Remove the now superfluous sentinel elements from ctl_table array
  umh: Remove the now superfluous sentinel elements from ctl_table array
  kernel misc: Remove the now superfluous sentinel elements from ctl_table array
2024-05-17 17:31:24 -07:00
Linus Torvalds
a49468240e Modules changes for v6.10-rc1
Finally something fun. Mike Rapoport does some cleanup to allow us to
 take out module_alloc() out of modules into a new paint shedded execmem_alloc()
 and execmem_free() so to make emphasis these helpers are actually used outside
 of modules. It starts with a no-functional changes API rename / placeholders
 to then allow architectures to define their requirements into a new shiny
 struct execmem_info with ranges, and requirements for those ranges. Archs
 now can intitialize this execmem_info as the last part of mm_core_init() if
 they have to diverge from the norm. Each range is a known type clearly
 articulated and spelled out in enum execmem_type.
 
 Although a lot of this is major cleanup and prep work for future enhancements an
 immediate clear gain is we get to enable KPROBES without MODULES now. That is
 ultimately what motiviated to pick this work up again, now with smaller goal as
 concrete stepping stone.
 
 This has been sitting on linux-next for a little less than a month, a few issues
 were found already and fixed, in particular an odd mips boot issue. Arch folks
 reviewed the code too. This is ready for wider exposure and testing.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmZDHfMSHG1jZ3JvZkBr
 ZXJuZWwub3JnAAoJEM4jHQowkoinfIwP/iFsr89v9BjWdRTqzufuHwjOxvFymWxU
 BbEpOppRny3CckDU9ag9hLIlUaSL1Bg56Zb+znzp5stKOoiQYMDBvjSYdfybPxW2
 mRS6SClMF1ubWbzdysdp5Ld9u8T0MQPCLX+P2pKhZRGi0wjkBf5WEkTje+muJKI3
 4vYkXS7bNhuTwRQ+EGfze4+AeleGdQJKDWFY00TW9mZTTBADjfHyYU5o0m9ijf5l
 3V/weUznODvjVJStbIF7wEQ845Ae02LN1zXfsloIOuBMhcMju+x8IjPgPbD0KhX2
 yA48q7mVWkirYp0L5GSQchtqV1GBiP0NK1xXWEpyx6EqQZ4RJCsQhlhjijoExYBR
 ylP4bqiGVuE3IN075X0OzGCnmOStuzwssfDmug0sMAZH/MvmOQ21WzZdet2nLMas
 wwJArHqZsBI9BnBlvH9ZM4Y9f1zC7iR1wULaNGwXLPx34X9PIch8Yk+RElP1kMFQ
 +YrjOuWPjl63pmSkrkk+Pe2eesMPcPB41M6Q2iCjDlp0iBp63LIx2XISUbTf0ljM
 EsI4ZQseYpx+BmC7AuQfmXvEOjuXII9z072/artVWcB2u/87ixIprnqZVhcs/spy
 73DnXB4ufor2PCCC5Xrb/6kT6G+PzF3VwTbHQ1D+fYZ5n2qdyG+LKxgXbtxsRVTp
 oUg+Z/AJaCMt
 =Nsg4
 -----END PGP SIGNATURE-----

Merge tag 'modules-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux

Pull modules updates from Luis Chamberlain:
 "Finally something fun. Mike Rapoport does some cleanup to allow us to
  take out module_alloc() out of modules into a new paint shedded
  execmem_alloc() and execmem_free() so to make emphasis these helpers
  are actually used outside of modules.

  It starts with a non-functional changes API rename / placeholders to
  then allow architectures to define their requirements into a new shiny
  struct execmem_info with ranges, and requirements for those ranges.

  Archs now can intitialize this execmem_info as the last part of
  mm_core_init() if they have to diverge from the norm. Each range is a
  known type clearly articulated and spelled out in enum execmem_type.

  Although a lot of this is major cleanup and prep work for future
  enhancements an immediate clear gain is we get to enable KPROBES
  without MODULES now. That is ultimately what motiviated to pick this
  work up again, now with smaller goal as concrete stepping stone"

* tag 'modules-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
  bpf: remove CONFIG_BPF_JIT dependency on CONFIG_MODULES of
  kprobes: remove dependency on CONFIG_MODULES
  powerpc: use CONFIG_EXECMEM instead of CONFIG_MODULES where appropriate
  x86/ftrace: enable dynamic ftrace without CONFIG_MODULES
  arch: make execmem setup available regardless of CONFIG_MODULES
  powerpc: extend execmem_params for kprobes allocations
  arm64: extend execmem_info for generated code allocations
  riscv: extend execmem_params for generated code allocations
  mm/execmem, arch: convert remaining overrides of module_alloc to execmem
  mm/execmem, arch: convert simple overrides of module_alloc to execmem
  mm: introduce execmem_alloc() and execmem_free()
  module: make module_memory_{alloc,free} more self-contained
  sparc: simplify module_alloc()
  nios2: define virtual address space for modules
  mips: module: rename MODULE_START to MODULES_VADDR
  arm64: module: remove unneeded call to kasan_alloc_module_shadow()
  kallsyms: replace deprecated strncpy with strscpy
  module: allow UNUSED_KSYMS_WHITELIST to be relative against objtree.
2024-05-15 14:05:08 -07:00
Andrii Nakryiko
9ee9822908 bpf: save extended inner map info for percpu array maps as well
ARRAY_OF_MAPS and HASH_OF_MAPS map types have special logic to save
a few extra fields required for correct operations of ARRAY maps, when
they are used as inner maps. PERCPU_ARRAY maps have similar
requirements as they now support generating inline element lookup
logic. So make sure that both classes of maps are handled correctly.

Reported-by: Jakub Kicinski <kuba@kernel.org>
Fixes: db69718b8e ("bpf: inline bpf_map_lookup_elem() for PERCPU_ARRAY maps")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20240515062440.846086-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-05-15 09:34:54 -07:00
Linus Torvalds
1b294a1f35 Networking changes for 6.10.
Core & protocols
 ----------------
 
  - Complete rework of garbage collection of AF_UNIX sockets.
    AF_UNIX is prone to forming reference count cycles due to fd passing
    functionality. New method based on Tarjan's Strongly Connected Components
    algorithm should be both faster and remove a lot of workarounds
    we accumulated over the years.
 
  - Add TCP fraglist GRO support, allowing chaining multiple TCP packets
    and forwarding them together. Useful for small switches / routers which
    lack basic checksum offload in some scenarios (e.g. PPPoE).
 
  - Support using SMP threads for handling packet backlog i.e. packet
    processing from software interfaces and old drivers which don't
    use NAPI. This helps move the processing out of the softirq jumble.
 
  - Continue work of converting from rtnl lock to RCU protection.
    Don't require rtnl lock when reading: IPv6 routing FIB, IPv6 address
    labels, netdev threaded NAPI sysfs files, bonding driver's sysfs files,
    MPLS devconf, IPv4 FIB rules, netns IDs, tcp metrics, TC Qdiscs,
    neighbor entries, ARP entries via ioctl(SIOCGARP), a lot of the link
    information available via rtnetlink.
 
  - Small optimizations from Eric to UDP wake up handling, memory accounting,
    RPS/RFS implementation, TCP packet sizing etc.
 
  - Allow direct page recycling in the bulk API used by XDP, for +2% PPS.
 
  - Support peek with an offset on TCP sockets.
 
  - Add MPTCP APIs for querying last time packets were received/sent/acked,
    and whether MPTCP "upgrade" succeeded on a TCP socket.
 
  - Add intra-node communication shortcut to improve SMC performance.
 
  - Add IPv6 (and IPv{4,6}-over-IPv{4,6}) support to the GTP protocol driver.
 
  - Add HSR-SAN (RedBOX) mode of operation to the HSR protocol driver.
 
  - Add reset reasons for tracing what caused a TCP reset to be sent.
 
  - Introduce direction attribute for xfrm (IPSec) states.
    State can be used either for input or output packet processing.
 
 Things we sprinkled into general kernel code
 --------------------------------------------
 
  - Add bitmap_{read,write}(), bitmap_size(), expose BYTES_TO_BITS().
    This required touch-ups and renaming of a few existing users.
 
  - Add Endian-dependent __counted_by_{le,be} annotations.
 
  - Make building selftests "quieter" by printing summaries like
    "CC object.o" rather than full commands with all the arguments.
 
 Netfilter
 ---------
 
  - Use GFP_KERNEL to clone elements, to deal better with OOM situations
    and avoid failures in the .commit step.
 
 BPF
 ---
 
  - Add eBPF JIT for ARCv2 CPUs.
 
  - Support attaching kprobe BPF programs through kprobe_multi link in
    a session mode, meaning, a BPF program is attached to both function entry
    and return, the entry program can decide if the return program gets
    executed and the entry program can share u64 cookie value with return
    program. "Session mode" is a common use-case for tetragon and bpftrace.
 
  - Add the ability to specify and retrieve BPF cookie for raw tracepoint
    programs in order to ease migration from classic to raw tracepoints.
 
  - Add an internal-only BPF per-CPU instruction for resolving per-CPU
    memory addresses and implement support in x86, ARM64 and RISC-V JITs.
    This allows inlining functions which need to access per-CPU state.
 
  - Optimize x86 BPF JIT's emit_mov_imm64, and add support for various
    atomics in bpf_arena which can be JITed as a single x86 instruction.
    Support BPF arena on ARM64.
 
  - Add a new bpf_wq API for deferring events and refactor process-context
    bpf_timer code to keep common code where possible.
 
  - Harden the BPF verifier's and/or/xor value tracking.
 
  - Introduce crypto kfuncs to let BPF programs call kernel crypto APIs.
 
  - Support bpf_tail_call_static() helper for BPF programs with GCC 13.
 
  - Add bpf_preempt_{disable,enable}() kfuncs in order to allow a BPF
    program to have code sections where preemption is disabled.
 
 Driver API
 ----------
 
  - Skip software TC processing completely if all installed rules are
    marked as HW-only, instead of checking the HW-only flag rule by rule.
 
  - Add support for configuring PoE (Power over Ethernet), similar to
    the already existing support for PoDL (Power over Data Line) config.
 
  - Initial bits of a queue control API, for now allowing a single queue
    to be reset without disturbing packet flow to other queues.
 
  - Common (ethtool) statistics for hardware timestamping.
 
 Tests and tooling
 -----------------
 
  - Remove the need to create a config file to run the net forwarding tests
    so that a naive "make run_tests" can exercise them.
 
  - Define a method of writing tests which require an external endpoint
    to communicate with (to send/receive data towards the test machine).
    Add a few such tests.
 
  - Create a shared code library for writing Python tests. Expose the YAML
    Netlink library from tools/ to the tests for easy Netlink access.
 
  - Move netfilter tests under net/, extend them, separate performance tests
    from correctness tests, and iron out issues found by running them
    "on every commit".
 
  - Refactor BPF selftests to use common network helpers.
 
  - Further work filling in YAML definitions of Netlink messages for:
    nftables, team driver, bonding interfaces, vlan interfaces, VF info,
    TC u32 mark, TC police action.
 
  - Teach Python YAML Netlink to decode attribute policies.
 
  - Extend the definition of the "indexed array" construct in the specs
    to cover arrays of scalars rather than just nests.
 
  - Add hyperlinks between definitions in generated Netlink docs.
 
 Drivers
 -------
 
  - Make sure unsupported flower control flags are rejected by drivers,
    and make more drivers report errors directly to the application rather
    than dmesg (large number of driver changes from Asbjørn Sloth Tønnesen).
 
  - Ethernet high-speed NICs:
    - Broadcom (bnxt):
      - support multiple RSS contexts and steering traffic to them
      - support XDP metadata
      - make page pool allocations more NUMA aware
    - Intel (100G, ice, idpf):
      - extract datapath code common among Intel drivers into a library
      - use fewer resources in switchdev by sharing queues with the PF
      - add PFCP filter support
      - add Ethernet filter support
      - use a spinlock instead of HW lock in PTP clock ops
      - support 5 layer Tx scheduler topology
    - nVidia/Mellanox:
      - 800G link modes and 100G SerDes speeds
      - per-queue IRQ coalescing configuration
    - Marvell Octeon:
      - support offloading TC packet mark action
 
  - Ethernet NICs consumer, embedded and virtual:
    - stop lying about skb->truesize in USB Ethernet drivers, it messes up
      TCP memory calculations
    - Google cloud vNIC:
      - support changing ring size via ethtool
      - support ring reset using the queue control API
    - VirtIO net:
      - expose flow hash from RSS to XDP
      - per-queue statistics
      - add selftests
    - Synopsys (stmmac):
      - support controllers which require an RX clock signal from the MII
        bus to perform their hardware initialization
    - TI:
      - icssg_prueth: support ICSSG-based Ethernet on AM65x SR1.0 devices
      - icssg_prueth: add SW TX / RX Coalescing based on hrtimers
      - cpsw: minimal XDP support
    - Renesas (ravb):
      - support describing the MDIO bus
    - Realtek (r8169):
      - add support for RTL8168M
    - Microchip Sparx5:
      - matchall and flower actions mirred and redirect
 
  - Ethernet switches:
    - nVidia/Mellanox:
      - improve events processing performance
    - Marvell:
      - add support for MV88E6250 family internal PHYs
    - Microchip:
      - add DCB and DSCP mapping support for KSZ switches
      - vsc73xx: convert to PHYLINK
    - Realtek:
      - rtl8226b/rtl8221b: add C45 instances and SerDes switching
 
  - Many driver changes related to PHYLIB and PHYLINK deprecated API cleanup.
 
  - Ethernet PHYs:
    - Add a new driver for Airoha EN8811H 2.5 Gigabit PHY.
    - micrel: lan8814: add support for PPS out and external timestamp trigger
 
  - WiFi:
    - Disable Wireless Extensions (WEXT) in all Wi-Fi 7 devices drivers.
      Modern devices can only be configured using nl80211.
    - mac80211/cfg80211
      - handle color change per link for WiFi 7 Multi-Link Operation
    - Intel (iwlwifi):
      - don't support puncturing in 5 GHz
      - support monitor mode on passive channels
      - BZ-W device support
      - P2P with HE/EHT support
      - re-add support for firmware API 90
      - provide channel survey information for Automatic Channel Selection
    - MediaTek (mt76):
      - mt7921 LED control
      - mt7925 EHT radiotap support
      - mt7920e PCI support
    - Qualcomm (ath11k):
      - P2P support for QCA6390, WCN6855 and QCA2066
      - support hibernation
      - ieee80211-freq-limit Device Tree property support
    - Qualcomm (ath12k):
      - refactoring in preparation of multi-link support
      - suspend and hibernation support
      - ACPI support
      - debugfs support, including dfs_simulate_radar support
    - RealTek:
      - rtw88: RTL8723CS SDIO device support
      - rtw89: RTL8922AE Wi-Fi 7 PCI device support
      - rtw89: complete features of new WiFi 7 chip 8922AE including
        BT-coexistence and Wake-on-WLAN
      - rtw89: use BIOS ACPI settings to set TX power and channels
      - rtl8xxxu: enable Management Frame Protection (MFP) support
 
  - Bluetooth:
    - support for Intel BlazarI and Filmore Peak2 (BE201)
    - support for MediaTek MT7921S SDIO
    - initial support for Intel PCIe BT driver
    - remove HCI_AMP support
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmZD6sQACgkQMUZtbf5S
 IrtLYw/+I73ePGIye37o2jpbodcLAUZVfF3r6uYUzK8hokEcKD0QVJa9w7PizLZ3
 UO45ClOXFLJCkfP4reFenLfxGCel2AJI+F7VFl2xaO2XgrcH/lnVrHqKZEAEXjls
 KoYMnShIolv7h2MKP6hHtyTi2j1wvQUKsZC71o9/fuW+4fUT8gECx1YtYcL73wrw
 gEMdlUgBYC3jiiCUHJIFX6iPJ2t/TC+q1eIIF2K/Osrk2kIqQhzoozcL4vpuAZQT
 99ljx/qRelXa8oppDb7nM5eulg7WY8ZqxEfFZphTMC5nLEGzClxuOTTl2kDYI/D/
 UZmTWZDY+F5F0xvNk2gH84qVJXBOVDoobpT7hVA/tDuybobc/kvGDzRayEVqVzKj
 Q0tPlJs+xBZpkK5TVnxaFLJVOM+p1Xosxy3kNVXmuYNBvT/R89UbJiCrUKqKZF+L
 z/1mOYUv8UklHqYAeuJSptHvqJjTGa/fsEYP7dAUBbc1N2eVB8mzZ4mgU5rYXbtC
 E6UXXiWnoSRm8bmco9QmcWWoXt5UGEizHSJLz6t1R5Df/YmXhWlytll5aCwY1ksf
 FNoL7S4u7AZThL1Nwi7yUs4CAjhk/N4aOsk+41S0sALCx30BJuI6UdesAxJ0lu+Z
 fwCQYbs27y4p7mBLbkYwcQNxAxGm7PSK4yeyRIy2njiyV4qnLf8=
 =EsC2
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core & protocols:

   - Complete rework of garbage collection of AF_UNIX sockets.

     AF_UNIX is prone to forming reference count cycles due to fd
     passing functionality. New method based on Tarjan's Strongly
     Connected Components algorithm should be both faster and remove a
     lot of workarounds we accumulated over the years.

   - Add TCP fraglist GRO support, allowing chaining multiple TCP
     packets and forwarding them together. Useful for small switches /
     routers which lack basic checksum offload in some scenarios (e.g.
     PPPoE).

   - Support using SMP threads for handling packet backlog i.e. packet
     processing from software interfaces and old drivers which don't use
     NAPI. This helps move the processing out of the softirq jumble.

   - Continue work of converting from rtnl lock to RCU protection.

     Don't require rtnl lock when reading: IPv6 routing FIB, IPv6
     address labels, netdev threaded NAPI sysfs files, bonding driver's
     sysfs files, MPLS devconf, IPv4 FIB rules, netns IDs, tcp metrics,
     TC Qdiscs, neighbor entries, ARP entries via ioctl(SIOCGARP), a lot
     of the link information available via rtnetlink.

   - Small optimizations from Eric to UDP wake up handling, memory
     accounting, RPS/RFS implementation, TCP packet sizing etc.

   - Allow direct page recycling in the bulk API used by XDP, for +2%
     PPS.

   - Support peek with an offset on TCP sockets.

   - Add MPTCP APIs for querying last time packets were received/sent/acked
     and whether MPTCP "upgrade" succeeded on a TCP socket.

   - Add intra-node communication shortcut to improve SMC performance.

   - Add IPv6 (and IPv{4,6}-over-IPv{4,6}) support to the GTP protocol
     driver.

   - Add HSR-SAN (RedBOX) mode of operation to the HSR protocol driver.

   - Add reset reasons for tracing what caused a TCP reset to be sent.

   - Introduce direction attribute for xfrm (IPSec) states. State can be
     used either for input or output packet processing.

  Things we sprinkled into general kernel code:

   - Add bitmap_{read,write}(), bitmap_size(), expose BYTES_TO_BITS().

     This required touch-ups and renaming of a few existing users.

   - Add Endian-dependent __counted_by_{le,be} annotations.

   - Make building selftests "quieter" by printing summaries like
     "CC object.o" rather than full commands with all the arguments.

  Netfilter:

   - Use GFP_KERNEL to clone elements, to deal better with OOM
     situations and avoid failures in the .commit step.

  BPF:

   - Add eBPF JIT for ARCv2 CPUs.

   - Support attaching kprobe BPF programs through kprobe_multi link in
     a session mode, meaning, a BPF program is attached to both function
     entry and return, the entry program can decide if the return
     program gets executed and the entry program can share u64 cookie
     value with return program. "Session mode" is a common use-case for
     tetragon and bpftrace.

   - Add the ability to specify and retrieve BPF cookie for raw
     tracepoint programs in order to ease migration from classic to raw
     tracepoints.

   - Add an internal-only BPF per-CPU instruction for resolving per-CPU
     memory addresses and implement support in x86, ARM64 and RISC-V
     JITs. This allows inlining functions which need to access per-CPU
     state.

   - Optimize x86 BPF JIT's emit_mov_imm64, and add support for various
     atomics in bpf_arena which can be JITed as a single x86
     instruction. Support BPF arena on ARM64.

   - Add a new bpf_wq API for deferring events and refactor
     process-context bpf_timer code to keep common code where possible.

   - Harden the BPF verifier's and/or/xor value tracking.

   - Introduce crypto kfuncs to let BPF programs call kernel crypto
     APIs.

   - Support bpf_tail_call_static() helper for BPF programs with GCC 13.

   - Add bpf_preempt_{disable,enable}() kfuncs in order to allow a BPF
     program to have code sections where preemption is disabled.

  Driver API:

   - Skip software TC processing completely if all installed rules are
     marked as HW-only, instead of checking the HW-only flag rule by
     rule.

   - Add support for configuring PoE (Power over Ethernet), similar to
     the already existing support for PoDL (Power over Data Line)
     config.

   - Initial bits of a queue control API, for now allowing a single
     queue to be reset without disturbing packet flow to other queues.

   - Common (ethtool) statistics for hardware timestamping.

  Tests and tooling:

   - Remove the need to create a config file to run the net forwarding
     tests so that a naive "make run_tests" can exercise them.

   - Define a method of writing tests which require an external endpoint
     to communicate with (to send/receive data towards the test
     machine). Add a few such tests.

   - Create a shared code library for writing Python tests. Expose the
     YAML Netlink library from tools/ to the tests for easy Netlink
     access.

   - Move netfilter tests under net/, extend them, separate performance
     tests from correctness tests, and iron out issues found by running
     them "on every commit".

   - Refactor BPF selftests to use common network helpers.

   - Further work filling in YAML definitions of Netlink messages for:
     nftables, team driver, bonding interfaces, vlan interfaces, VF
     info, TC u32 mark, TC police action.

   - Teach Python YAML Netlink to decode attribute policies.

   - Extend the definition of the "indexed array" construct in the specs
     to cover arrays of scalars rather than just nests.

   - Add hyperlinks between definitions in generated Netlink docs.

  Drivers:

   - Make sure unsupported flower control flags are rejected by drivers,
     and make more drivers report errors directly to the application
     rather than dmesg (large number of driver changes from Asbjørn
     Sloth Tønnesen).

   - Ethernet high-speed NICs:
      - Broadcom (bnxt):
         - support multiple RSS contexts and steering traffic to them
         - support XDP metadata
         - make page pool allocations more NUMA aware
      - Intel (100G, ice, idpf):
         - extract datapath code common among Intel drivers into a library
         - use fewer resources in switchdev by sharing queues with the PF
         - add PFCP filter support
         - add Ethernet filter support
         - use a spinlock instead of HW lock in PTP clock ops
         - support 5 layer Tx scheduler topology
      - nVidia/Mellanox:
         - 800G link modes and 100G SerDes speeds
         - per-queue IRQ coalescing configuration
      - Marvell Octeon:
         - support offloading TC packet mark action

   - Ethernet NICs consumer, embedded and virtual:
      - stop lying about skb->truesize in USB Ethernet drivers, it
        messes up TCP memory calculations
      - Google cloud vNIC:
         - support changing ring size via ethtool
         - support ring reset using the queue control API
      - VirtIO net:
         - expose flow hash from RSS to XDP
         - per-queue statistics
         - add selftests
      - Synopsys (stmmac):
         - support controllers which require an RX clock signal from the
           MII bus to perform their hardware initialization
      - TI:
         - icssg_prueth: support ICSSG-based Ethernet on AM65x SR1.0 devices
         - icssg_prueth: add SW TX / RX Coalescing based on hrtimers
         - cpsw: minimal XDP support
      - Renesas (ravb):
         - support describing the MDIO bus
      - Realtek (r8169):
         - add support for RTL8168M
      - Microchip Sparx5:
         - matchall and flower actions mirred and redirect

   - Ethernet switches:
      - nVidia/Mellanox:
         - improve events processing performance
      - Marvell:
         - add support for MV88E6250 family internal PHYs
      - Microchip:
         - add DCB and DSCP mapping support for KSZ switches
         - vsc73xx: convert to PHYLINK
      - Realtek:
         - rtl8226b/rtl8221b: add C45 instances and SerDes switching

   - Many driver changes related to PHYLIB and PHYLINK deprecated API
     cleanup

   - Ethernet PHYs:
      - Add a new driver for Airoha EN8811H 2.5 Gigabit PHY.
      - micrel: lan8814: add support for PPS out and external timestamp trigger

   - WiFi:
      - Disable Wireless Extensions (WEXT) in all Wi-Fi 7 devices
        drivers. Modern devices can only be configured using nl80211.
      - mac80211/cfg80211
         - handle color change per link for WiFi 7 Multi-Link Operation
      - Intel (iwlwifi):
         - don't support puncturing in 5 GHz
         - support monitor mode on passive channels
         - BZ-W device support
         - P2P with HE/EHT support
         - re-add support for firmware API 90
         - provide channel survey information for Automatic Channel Selection
      - MediaTek (mt76):
         - mt7921 LED control
         - mt7925 EHT radiotap support
         - mt7920e PCI support
      - Qualcomm (ath11k):
         - P2P support for QCA6390, WCN6855 and QCA2066
         - support hibernation
         - ieee80211-freq-limit Device Tree property support
      - Qualcomm (ath12k):
         - refactoring in preparation of multi-link support
         - suspend and hibernation support
         - ACPI support
         - debugfs support, including dfs_simulate_radar support
      - RealTek:
         - rtw88: RTL8723CS SDIO device support
         - rtw89: RTL8922AE Wi-Fi 7 PCI device support
         - rtw89: complete features of new WiFi 7 chip 8922AE including
           BT-coexistence and Wake-on-WLAN
         - rtw89: use BIOS ACPI settings to set TX power and channels
         - rtl8xxxu: enable Management Frame Protection (MFP) support

   - Bluetooth:
      - support for Intel BlazarI and Filmore Peak2 (BE201)
      - support for MediaTek MT7921S SDIO
      - initial support for Intel PCIe BT driver
      - remove HCI_AMP support"

* tag 'net-next-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1827 commits)
  selftests: netfilter: fix packetdrill conntrack testcase
  net: gro: fix napi_gro_cb zeroed alignment
  Bluetooth: btintel_pcie: Refactor and code cleanup
  Bluetooth: btintel_pcie: Fix warning reported by sparse
  Bluetooth: hci_core: Fix not handling hdev->le_num_of_adv_sets=1
  Bluetooth: btintel: Fix compiler warning for multi_v7_defconfig config
  Bluetooth: btintel_pcie: Fix compiler warnings
  Bluetooth: btintel_pcie: Add *setup* function to download firmware
  Bluetooth: btintel_pcie: Add support for PCIe transport
  Bluetooth: btintel: Export few static functions
  Bluetooth: HCI: Remove HCI_AMP support
  Bluetooth: L2CAP: Fix div-by-zero in l2cap_le_flowctl_init()
  Bluetooth: qca: Fix error code in qca_read_fw_build_info()
  Bluetooth: hci_conn: Use __counted_by() and avoid -Wfamnae warning
  Bluetooth: btintel: Add support for Filmore Peak2 (BE201)
  Bluetooth: btintel: Add support for BlazarI
  LE Create Connection command timeout increased to 20 secs
  dt-bindings: net: bluetooth: Add MediaTek MT7921S SDIO Bluetooth
  Bluetooth: compute LE flow credits based on recvbuf space
  Bluetooth: hci_sync: Use cmd->num_cis instead of magic number
  ...
2024-05-14 19:42:24 -07:00
Jakub Kicinski
654de42f3f Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Merge in late fixes to prepare for the 6.10 net-next PR.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-14 10:53:19 -07:00
Mike Rapoport (IBM)
2c9e5d4a00 bpf: remove CONFIG_BPF_JIT dependency on CONFIG_MODULES of
BPF just-in-time compiler depended on CONFIG_MODULES because it used
module_alloc() to allocate memory for the generated code.

Since code allocations are now implemented with execmem, drop dependency of
CONFIG_BPF_JIT on CONFIG_MODULES and make it select CONFIG_EXECMEM.

Suggested-by: Björn Töpel <bjorn@kernel.org>
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-05-14 00:36:29 -07:00
Mike Rapoport (IBM)
12af2b83d0 mm: introduce execmem_alloc() and execmem_free()
module_alloc() is used everywhere as a mean to allocate memory for code.

Beside being semantically wrong, this unnecessarily ties all subsystems
that need to allocate code, such as ftrace, kprobes and BPF to modules and
puts the burden of code allocation to the modules code.

Several architectures override module_alloc() because of various
constraints where the executable memory can be located and this causes
additional obstacles for improvements of code allocation.

Start splitting code allocation from modules by introducing execmem_alloc()
and execmem_free() APIs.

Initially, execmem_alloc() is a wrapper for module_alloc() and
execmem_free() is a replacement of module_memfree() to allow updating all
call sites to use the new APIs.

Since architectures define different restrictions on placement,
permissions, alignment and other parameters for memory that can be used by
different subsystems that allocate executable memory, execmem_alloc() takes
a type argument, that will be used to identify the calling subsystem and to
allow architectures define parameters for ranges suitable for that
subsystem.

No functional changes.

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-05-14 00:31:43 -07:00
Jakub Kicinski
6e62702feb bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZkGcZAAKCRDbK58LschI
 g6o6APwLsqhrM2w71VUN5ciCxu4H5VDtZp6wkdqtVbxxU4qNxQEApKgYgKt8ZLF3
 Kily5c7m+S4ZXhMX21rb8JhSAz0dfQk=
 =5Dk7
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2024-05-13

We've added 119 non-merge commits during the last 14 day(s) which contain
a total of 134 files changed, 9462 insertions(+), 4742 deletions(-).

The main changes are:

1) Add BPF JIT support for 32-bit ARCv2 processors, from Shahab Vahedi.

2) Add BPF range computation improvements to the verifier in particular
   around XOR and OR operators, refactoring of checks for range computation
   and relaxing MUL range computation so that src_reg can also be an unknown
   scalar, from Cupertino Miranda.

3) Add support to attach kprobe BPF programs through kprobe_multi link in
   a session mode, meaning, a BPF program is attached to both function entry
   and return, the entry program can decide if the return program gets
   executed and the entry program can share u64 cookie value with return
   program. Session mode is a common use-case for tetragon and bpftrace,
   from Jiri Olsa.

4) Fix a potential overflow in libbpf's ring__consume_n() and improve libbpf
   as well as BPF selftest's struct_ops handling, from Andrii Nakryiko.

5) Improvements to BPF selftests in context of BPF gcc backend,
   from Jose E. Marchesi & David Faust.

6) Migrate remaining BPF selftest tests from test_sock_addr.c to prog_test-
   -style in order to retire the old test, run it in BPF CI and additionally
   expand test coverage, from Jordan Rife.

7) Big batch for BPF selftest refactoring in order to remove duplicate code
   around common network helpers, from Geliang Tang.

8) Another batch of improvements to BPF selftests to retire obsolete
   bpf_tcp_helpers.h as everything is available vmlinux.h,
   from Martin KaFai Lau.

9) Fix BPF map tear-down to not walk the map twice on free when both timer
   and wq is used, from Benjamin Tissoires.

10) Fix BPF verifier assumptions about socket->sk that it can be non-NULL,
    from Alexei Starovoitov.

11) Change BTF build scripts to using --btf_features for pahole v1.26+,
    from Alan Maguire.

12) Small improvements to BPF reusing struct_size() and krealloc_array(),
    from Andy Shevchenko.

13) Fix s390 JIT to emit a barrier for BPF_FETCH instructions,
    from Ilya Leoshkevich.

14) Extend TCP ->cong_control() callback in order to feed in ack and
    flag parameters and allow write-access to tp->snd_cwnd_stamp
    from BPF program, from Miao Xu.

15) Add support for internal-only per-CPU instructions to inline
    bpf_get_smp_processor_id() helper call for arm64 and riscv64 BPF JITs,
    from Puranjay Mohan.

16) Follow-up to remove the redundant ethtool.h from tooling infrastructure,
    from Tushar Vyavahare.

17) Extend libbpf to support "module:<function>" syntax for tracing
    programs, from Viktor Malik.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (119 commits)
  bpf: make list_for_each_entry portable
  bpf: ignore expected GCC warning in test_global_func10.c
  bpf: disable strict aliasing in test_global_func9.c
  selftests/bpf: Free strdup memory in xdp_hw_metadata
  selftests/bpf: Fix a few tests for GCC related warnings.
  bpf: avoid gcc overflow warning in test_xdp_vlan.c
  tools: remove redundant ethtool.h from tooling infra
  selftests/bpf: Expand ATTACH_REJECT tests
  selftests/bpf: Expand getsockname and getpeername tests
  sefltests/bpf: Expand sockaddr hook deny tests
  selftests/bpf: Expand sockaddr program return value tests
  selftests/bpf: Retire test_sock_addr.(c|sh)
  selftests/bpf: Remove redundant sendmsg test cases
  selftests/bpf: Migrate ATTACH_REJECT test cases
  selftests/bpf: Migrate expected_attach_type tests
  selftests/bpf: Migrate wildcard destination rewrite test
  selftests/bpf: Migrate sendmsg6 v4 mapped address tests
  selftests/bpf: Migrate sendmsg deny test cases
  selftests/bpf: Migrate WILDCARD_IP test
  selftests/bpf: Handle SYSCALL_EPERM and SYSCALL_ENOTSUPP test cases
  ...
====================

Link: https://lore.kernel.org/r/20240513134114.17575-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-13 16:41:10 -07:00
Linus Torvalds
c0b9620bc3 RCU pull request for v6.10
This pull request contains the following branches:
 
 fixes.2024.04.15a: Fix a lockdep complain for lazy-preemptible kernel,
 remove redundant BH disable for TINY_RCU, remove redundant READ_ONCE()
 in tree.c, fix false positives KCSAN splat and fix buffer overflow in
 the print_cpu_stall_info().
 
 misc.2024.04.12a: Misc updates related to bpf, tracing and update the
 MAINTAINERS file.
 
 rcu-sync-normal-improve.2024.04.15a: An improvement of a normal
 synchronize_rcu() call in terms of latency. It maintains a separate
 track for sync. users only. This approach bypasses per-cpu nocb-lists
 thus sync-users do not depend on nocb-list length and how fast regular
 callbacks are processed.
 
 rcu-tasks.2024.04.15a: RCU tasks, switch tasks RCU grace periods to
 sleep at TASK_IDLE priority, fix some comments, add some diagnostic
 warning to the exit_tasks_rcu_start() and fix a buffer overflow in
 the show_rcu_tasks_trace_gp_kthread().
 
 rcutorture.2024.04.15a: Increase memory to guest OS, fix a Tasks
 Rude RCU testing, some updates for TREE09, dump mode information
 to debug GP kthread state, remove redundant READ_ONCE(), fix some
 comments about RCU_TORTURE_PIPE_LEN and pipe_count, remove some
 redundant pointer initialization, fix a hung splat task by when
 the rcutorture tests start to exit, fix invalid context warning,
 add '--do-kvfree' parameter to torture test and use slow register
 unregister callbacks only for rcutype test.
 -----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEEu6QRe/mAUYNn5U0PBYqkjnKWLM8FAmYzsmUACgkQBYqkjnKW
 LM/FAwv+LcIJ9lO/wzUpnH3d3djBOPmyu7Us8ERNY5lcVZ+neS2m3vxq0kOk/cnV
 RGgZc7qjWqMQ9hAx/MmIodmiw036ceRDe5CP/Ec/TYx68m+NPG3VnP08s/xLXLlx
 n8aSJJu37y0ElMQMwvuQaoNJ2xqlZ8AHCR6iaqJtzmPBR6zHLyeCPVpdPJQfcSO7
 +9ABzqo8isGxeuaAE7y0WUp0ZsSpdYvdext5SStjtvZ+hKERdVluhBF+OxZIZByp
 RSBoZJrbTKKpzTUBSE0ci+mlfqBPmSVjjqvygscuwOoKhm+601E51DYb1QXkGujq
 vuc1f/c7VjTAXyvs9k4An2x3XcN5SFhA6Bhc+L6aU/UJBzAWrJJkVOwS79gHNSn1
 qshyhpDLE8MiBEi0QxaEmBZLkz3BX1aYbQA0+5wvgoz0u8QglrpRrPRIWUWC0wvq
 SOLIibZkJuPUOZuD5AP4tg80swTuSCvyWuiKUVRnJK9FsYKdcyNUCnOLIwUzQlrg
 1/hatlvS
 =cq8V
 -----END PGP SIGNATURE-----

Merge tag 'rcu.next.v6.10' of https://github.com/urezki/linux

Pull RCU updates from Uladzislau Rezki:

 - Fix a lockdep complain for lazy-preemptible kernel, remove redundant
   BH disable for TINY_RCU, remove redundant READ_ONCE() in tree.c, fix
   false positives KCSAN splat and fix buffer overflow in the
   print_cpu_stall_info().

 - Misc updates related to bpf, tracing and update the MAINTAINERS file.

 - An improvement of a normal synchronize_rcu() call in terms of
   latency. It maintains a separate track for sync. users only. This
   approach bypasses per-cpu nocb-lists thus sync-users do not depend on
   nocb-list length and how fast regular callbacks are processed.

 - RCU tasks: switch tasks RCU grace periods to sleep at TASK_IDLE
   priority, fix some comments, add some diagnostic warning to the
   exit_tasks_rcu_start() and fix a buffer overflow in the
   show_rcu_tasks_trace_gp_kthread().

 - RCU torture: Increase memory to guest OS, fix a Tasks Rude RCU
   testing, some updates for TREE09, dump mode information to debug GP
   kthread state, remove redundant READ_ONCE(), fix some comments about
   RCU_TORTURE_PIPE_LEN and pipe_count, remove some redundant pointer
   initialization, fix a hung splat task by when the rcutorture tests
   start to exit, fix invalid context warning, add '--do-kvfree'
   parameter to torture test and use slow register unregister callbacks
   only for rcutype test.

* tag 'rcu.next.v6.10' of https://github.com/urezki/linux: (48 commits)
  rcutorture: Use rcu_gp_slow_register/unregister() only for rcutype test
  torture: Scale --do-kvfree test time
  rcutorture: Fix invalid context warning when enable srcu barrier testing
  rcutorture: Make stall-tasks directly exit when rcutorture tests end
  rcutorture: Removing redundant function pointer initialization
  rcutorture: Make rcutorture support print rcu-tasks gp state
  rcutorture: Use the gp_kthread_dbg operation specified by cur_ops
  rcutorture: Re-use value stored to ->rtort_pipe_count instead of re-reading
  rcutorture: Fix rcu_torture_one_read() pipe_count overflow comment
  rcutorture: Remove extraneous rcu_torture_pipe_update_one() READ_ONCE()
  rcu: Allocate WQ with WQ_MEM_RECLAIM bit set
  rcu: Support direct wake-up of synchronize_rcu() users
  rcu: Add a trace event for synchronize_rcu_normal()
  rcu: Reduce synchronize_rcu() latency
  rcu: Fix buffer overflow in print_cpu_stall_info()
  rcu: Mollify sparse with RCU guard
  rcu-tasks: Fix show_rcu_tasks_trace_gp_kthread buffer overflow
  rcu-tasks: Fix the comments for tasks_rcu_exit_srcu_stall_timer
  rcu-tasks: Replace exit_tasks_rcu_start() initialization with WARN_ON_ONCE()
  rcu: Remove redundant CONFIG_PROVE_RCU #if condition
  ...
2024-05-13 09:49:06 -07:00
Puranjay Mohan
2ddec2c80b riscv, bpf: inline bpf_get_smp_processor_id()
Inline the calls to bpf_get_smp_processor_id() in the riscv bpf jit.

RISCV saves the pointer to the CPU's task_struct in the TP (thread
pointer) register. This makes it trivial to get the CPU's processor id.
As thread_info is the first member of task_struct, we can read the
processor id from TP + offsetof(struct thread_info, cpu).

          RISCV64 JIT output for `call bpf_get_smp_processor_id`
	  ======================================================

                Before                           After
               --------                         -------

         auipc   t1,0x848c                  ld    a5,32(tp)
         jalr    604(t1)
         mv      a5,a0

Benchmark using [1] on Qemu.

./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc

+---------------+------------------+------------------+--------------+
|      Name     |     Before       |       After      |   % change   |
|---------------+------------------+------------------+--------------|
| glob-arr-inc  | 1.077 ± 0.006M/s | 1.336 ± 0.010M/s |   + 24.04%   |
| arr-inc       | 1.078 ± 0.002M/s | 1.332 ± 0.015M/s |   + 23.56%   |
| hash-inc      | 0.494 ± 0.004M/s | 0.653 ± 0.001M/s |   + 32.18%   |
+---------------+------------------+------------------+--------------+

NOTE: This benchmark includes changes from this patch and the previous
      patch that implemented the per-cpu insn.

[1] https://github.com/anakryiko/linux/commit/8dec900975ef

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Björn Töpel <bjorn@kernel.org>
Link: https://lore.kernel.org/r/20240502151854.9810-3-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-05-12 16:54:34 -07:00
Haiyue Wang
75b0fbf15d bpf: Remove redundant page mask of vmf->address
As the comment described in "struct vm_fault":
	".address"      : 'Faulting virtual address - masked'
	".real_address" : 'Faulting virtual address - unmasked'

The link [1] said: "Whatever the routes, all architectures end up to the
invocation of handle_mm_fault() which, in turn, (likely) ends up calling
__handle_mm_fault() to carry out the actual work of allocating the page
tables."

  __handle_mm_fault() does address assignment:
	.address = address & PAGE_MASK,
	.real_address = address,

This is debug dump by running `./test_progs -a "*arena*"`:

[   69.767494] arena fault: vmf->address = 10000001d000, vmf->real_address = 10000001d008
[   69.767496] arena fault: vmf->address = 10000001c000, vmf->real_address = 10000001c008
[   69.767499] arena fault: vmf->address = 10000001b000, vmf->real_address = 10000001b008
[   69.767501] arena fault: vmf->address = 10000001a000, vmf->real_address = 10000001a008
[   69.767504] arena fault: vmf->address = 100000019000, vmf->real_address = 100000019008
[   69.769388] arena fault: vmf->address = 10000001e000, vmf->real_address = 10000001e1e8

So we can use the value of 'vmf->address' to do BPF arena kernel address
space cast directly.

[1] https://docs.kernel.org/mm/page_tables.html

Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
Link: https://lore.kernel.org/r/20240507063358.8048-1-haiyue.wang@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-05-07 14:13:17 -07:00
Cupertino Miranda
41d047a871 bpf/verifier: relax MUL range computation check
MUL instruction required that src_reg would be a known value (i.e.
src_reg would be a const value). The condition in this case can be
relaxed, since the range computation algorithm used in current code
already supports a proper range computation for any valid range value on
its operands.

Signed-off-by: Cupertino Miranda <cupertino.miranda@oracle.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: David Faust <david.faust@oracle.com>
Cc: Jose Marchesi <jose.marchesi@oracle.com>
Cc: Elena Zannoni <elena.zannoni@oracle.com>
Link: https://lore.kernel.org/r/20240506141849.185293-6-cupertino.miranda@oracle.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-05-06 17:09:12 -07:00
Cupertino Miranda
138cc42c05 bpf/verifier: improve XOR and OR range computation
Range for XOR and OR operators would not be attempted unless src_reg
would resolve to a single value, i.e. a known constant value.
This condition is unnecessary, and the following XOR/OR operator
handling could compute a possible better range.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>

Signed-off-by: Cupertino Miranda <cupertino.miranda@oracle.com
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: David Faust <david.faust@oracle.com>
Cc: Jose Marchesi <jose.marchesi@oracle.com>
Cc: Elena Zannoni <elena.zannoni@oracle.com>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Link: https://lore.kernel.org/r/20240506141849.185293-4-cupertino.miranda@oracle.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-05-06 17:09:11 -07:00
Cupertino Miranda
0922c78f59 bpf/verifier: refactor checks for range computation
Split range computation checks in its own function, isolating pessimitic
range set for dst_reg and failing return to a single point.

Signed-off-by: Cupertino Miranda <cupertino.miranda@oracle.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: David Faust <david.faust@oracle.com>
Cc: Jose Marchesi <jose.marchesi@oracle.com>
Cc: Elena Zannoni <elena.zannoni@oracle.com>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>

bpf/verifier: improve code after range computation recent changes.
Link: https://lore.kernel.org/r/20240506141849.185293-3-cupertino.miranda@oracle.com

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-05-06 17:09:11 -07:00
Cupertino Miranda
d786957ebd bpf/verifier: replace calls to mark_reg_unknown.
In order to further simplify the code in adjust_scalar_min_max_vals all
the calls to mark_reg_unknown are replaced by __mark_reg_unknown.

static void mark_reg_unknown(struct bpf_verifier_env *env,
  			     struct bpf_reg_state *regs, u32 regno)
{
	if (WARN_ON(regno >= MAX_BPF_REG)) {
		... mark all regs not init ...
		return;
    }
	__mark_reg_unknown(env, regs + regno);
}

The 'regno >= MAX_BPF_REG' does not apply to
adjust_scalar_min_max_vals(), because it is only called from the
following stack:
  - check_alu_op
    - adjust_reg_min_max_vals
      - adjust_scalar_min_max_vals

The check_alu_op() does check_reg_arg() which verifies that both src and
dst register numbers are within bounds.

Signed-off-by: Cupertino Miranda <cupertino.miranda@oracle.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: David Faust <david.faust@oracle.com>
Cc: Jose Marchesi <jose.marchesi@oracle.com>
Cc: Elena Zannoni <elena.zannoni@oracle.com>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Link: https://lore.kernel.org/r/20240506141849.185293-2-cupertino.miranda@oracle.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-05-06 17:09:11 -07:00
Jakub Kicinski
e958da0ddb Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

include/linux/filter.h
kernel/bpf/core.c
  66e13b615a ("bpf: verifier: prevent userspace memory access")
  d503a04f8b ("bpf: Add support for certain atomics in bpf_arena to x86 JIT")
https://lore.kernel.org/all/20240429114939.210328b0@canb.auug.org.au/

No adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02 12:06:25 -07:00
Vadim Fedorenko
ac2f438c2a bpf: crypto: fix build when CONFIG_CRYPTO=m
Crypto subsytem can be build as a module. In this case we still have to
build BPF crypto framework otherwise the build will fail.

Fixes: 3e1c6f3540 ("bpf: make common crypto API for TC/XDP programs")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202405011634.4JK40epY-lkp@intel.com/
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Link: https://lore.kernel.org/r/20240501170130.1682309-1-vadfed@meta.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-05-01 13:32:26 -07:00
Stanislav Fomichev
543576ec15 bpf: Add BPF_PROG_TYPE_CGROUP_SKB attach type enforcement in BPF_LINK_CREATE
bpf_prog_attach uses attach_type_to_prog_type to enforce proper
attach type for BPF_PROG_TYPE_CGROUP_SKB. link_create uses
bpf_prog_get and relies on bpf_prog_attach_check_attach_type
to properly verify prog_type <> attach_type association.

Add missing attach_type enforcement for the link_create case.
Otherwise, it's currently possible to attach cgroup_skb prog
types to other cgroup hooks.

Fixes: af6eea5743 ("bpf: Implement bpf_link-based cgroup BPF program attachment")
Link: https://lore.kernel.org/bpf/0000000000004792a90615a1dde0@google.com/
Reported-by: syzbot+838346b979830606c854@syzkaller.appspotmail.com
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240426231621.2716876-2-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-04-30 10:43:37 -07:00
Jiri Olsa
5c919acef8 bpf: Add support for kprobe session cookie
Adding support for cookie within the session of kprobe multi
entry and return program.

The session cookie is u64 value and can be retrieved be new
kfunc bpf_session_cookie, which returns pointer to the cookie
value. The bpf program can use the pointer to store (on entry)
and load (on return) the value.

The cookie value is implemented via fprobe feature that allows
to share values between entry and return ftrace fprobe callbacks.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240430112830.1184228-4-jolsa@kernel.org
2024-04-30 09:45:53 -07:00
Jiri Olsa
adf46d88ae bpf: Add support for kprobe session context
Adding struct bpf_session_run_ctx object to hold session related
data, which is atm is_return bool and data pointer coming in
following changes.

Placing bpf_session_run_ctx layer in between bpf_run_ctx and
bpf_kprobe_multi_run_ctx so the session data can be retrieved
regardless of if it's kprobe_multi or uprobe_multi link, which
support is coming in future. This way both kprobe_multi and
uprobe_multi can use same kfuncs to access the session data.

Adding bpf_session_is_return kfunc that returns true if the
bpf program is executed from the exit probe of the kprobe multi
link attached in wrapper mode. It returns false otherwise.

Adding new kprobe hook for kprobe program type.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240430112830.1184228-3-jolsa@kernel.org
2024-04-30 09:45:53 -07:00
Jiri Olsa
535a3692ba bpf: Add support for kprobe session attach
Adding support to attach bpf program for entry and return probe
of the same function. This is common use case which at the moment
requires to create two kprobe multi links.

Adding new BPF_TRACE_KPROBE_SESSION attach type that instructs
kernel to attach single link program to both entry and exit probe.

It's possible to control execution of the bpf program on return
probe simply by returning zero or non zero from the entry bpf
program execution to execute or not the bpf program on return
probe respectively.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240430112830.1184228-2-jolsa@kernel.org
2024-04-30 09:45:53 -07:00
Benjamin Tissoires
a891711d01 bpf: Do not walk twice the hash map on free
If someone stores both a timer and a workqueue in a hash map, on free, we
would walk it twice.

Add a check in htab_free_malloced_timers_or_wq and free the timers and
workqueues if they are present.

Fixes: 246331e3f1 ("bpf: allow struct bpf_wq to be embedded in arraymaps and hashmaps")
Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/bpf/20240430-bpf-next-v3-2-27afe7f3b17c@kernel.org
2024-04-30 16:28:46 +02:00
Benjamin Tissoires
b98a5c68cc bpf: Do not walk twice the map on free
If someone stores both a timer and a workqueue in a map, on free
we would walk it twice.

Add a check in array_map_free_timers_wq and free the timers and
workqueues if they are present.

Fixes: 246331e3f1 ("bpf: allow struct bpf_wq to be embedded in arraymaps and hashmaps")
Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/bpf/20240430-bpf-next-v3-1-27afe7f3b17c@kernel.org
2024-04-30 16:28:33 +02:00
Andy Shevchenko
a3034872cd bpf: Switch to krealloc_array()
Let the krealloc_array() copy the original data and
check for a multiplication overflow.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20240429120005.3539116-1-andriy.shevchenko@linux.intel.com
2024-04-29 16:13:14 -07:00
Andy Shevchenko
cb01621b6d bpf: Use struct_size()
Use struct_size() instead of hand writing it.
This is less verbose and more robust.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20240429121323.3818497-1-andriy.shevchenko@linux.intel.com
2024-04-29 16:12:03 -07:00
Alexei Starovoitov
0db63c0b86 bpf: Fix verifier assumptions about socket->sk
The verifier assumes that 'sk' field in 'struct socket' is valid
and non-NULL when 'socket' pointer itself is trusted and non-NULL.
That may not be the case when socket was just created and
passed to LSM socket_accept hook.
Fix this verifier assumption and adjust tests.

Reported-by: Liam Wisehart <liamwisehart@meta.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Fixes: 6fcd486b3a ("bpf: Refactor RCU enforcement in the verifier.")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20240427002544.68803-1-alexei.starovoitov@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-04-29 14:16:41 -07:00
Jakub Kicinski
89de2db193 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZi9+AAAKCRDbK58LschI
 g0nEAP487m7L0nLVriC2oIOWsi29tklW3etm6DO7gmGRGIHgrgEAnMyV1xBj3bGj
 v6jJwDcybCym1hLx+1x1JCZ4eoAFswE=
 =xbna
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2024-04-29

We've added 147 non-merge commits during the last 32 day(s) which contain
a total of 158 files changed, 9400 insertions(+), 2213 deletions(-).

The main changes are:

1) Add an internal-only BPF per-CPU instruction for resolving per-CPU
   memory addresses and implement support in x86 BPF JIT. This allows
   inlining per-CPU array and hashmap lookups
   and the bpf_get_smp_processor_id() helper, from Andrii Nakryiko.

2) Add BPF link support for sk_msg and sk_skb programs, from Yonghong Song.

3) Optimize x86 BPF JIT's emit_mov_imm64, and add support for various
   atomics in bpf_arena which can be JITed as a single x86 instruction,
   from Alexei Starovoitov.

4) Add support for passing mark with bpf_fib_lookup helper,
   from Anton Protopopov.

5) Add a new bpf_wq API for deferring events and refactor sleepable
   bpf_timer code to keep common code where possible,
   from Benjamin Tissoires.

6) Fix BPF_PROG_TEST_RUN infra with regards to bpf_dummy_struct_ops programs
   to check when NULL is passed for non-NULLable parameters,
   from Eduard Zingerman.

7) Harden the BPF verifier's and/or/xor value tracking,
   from Harishankar Vishwanathan.

8) Introduce crypto kfuncs to make BPF programs able to utilize the kernel
   crypto subsystem, from Vadim Fedorenko.

9) Various improvements to the BPF instruction set standardization doc,
   from Dave Thaler.

10) Extend libbpf APIs to partially consume items from the BPF ringbuffer,
    from Andrea Righi.

11) Bigger batch of BPF selftests refactoring to use common network helpers
    and to drop duplicate code, from Geliang Tang.

12) Support bpf_tail_call_static() helper for BPF programs with GCC 13,
    from Jose E. Marchesi.

13) Add bpf_preempt_{disable,enable}() kfuncs in order to allow a BPF
    program to have code sections where preemption is disabled,
    from Kumar Kartikeya Dwivedi.

14) Allow invoking BPF kfuncs from BPF_PROG_TYPE_SYSCALL programs,
    from David Vernet.

15) Extend the BPF verifier to allow different input maps for a given
    bpf_for_each_map_elem() helper call in a BPF program, from Philo Lu.

16) Add support for PROBE_MEM32 and bpf_addr_space_cast instructions
    for riscv64 and arm64 JITs to enable BPF Arena, from Puranjay Mohan.

17) Shut up a false-positive KMSAN splat in interpreter mode by unpoison
    the stack memory, from Martin KaFai Lau.

18) Improve xsk selftest coverage with new tests on maximum and minimum
    hardware ring size configurations, from Tushar Vyavahare.

19) Various ReST man pages fixes as well as documentation and bash completion
    improvements for bpftool, from Rameez Rehman & Quentin Monnet.

20) Fix libbpf with regards to dumping subsequent char arrays,
    from Quentin Deslandes.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (147 commits)
  bpf, docs: Clarify PC use in instruction-set.rst
  bpf_helpers.h: Define bpf_tail_call_static when building with GCC
  bpf, docs: Add introduction for use in the ISA Internet Draft
  selftests/bpf: extend BPF_SOCK_OPS_RTT_CB test for srtt and mrtt_us
  bpf: add mrtt and srtt as BPF_SOCK_OPS_RTT_CB args
  selftests/bpf: dummy_st_ops should reject 0 for non-nullable params
  bpf: check bpf_dummy_struct_ops program params for test runs
  selftests/bpf: do not pass NULL for non-nullable params in dummy_st_ops
  selftests/bpf: adjust dummy_st_ops_success to detect additional error
  bpf: mark bpf_dummy_struct_ops.test_1 parameter as nullable
  selftests/bpf: Add ring_buffer__consume_n test.
  bpf: Add bpf_guard_preempt() convenience macro
  selftests: bpf: crypto: add benchmark for crypto functions
  selftests: bpf: crypto skcipher algo selftests
  bpf: crypto: add skcipher to bpf crypto
  bpf: make common crypto API for TC/XDP programs
  bpf: update the comment for BTF_FIELDS_MAX
  selftests/bpf: Fix wq test.
  selftests/bpf: Use make_sockaddr in test_sock_addr
  selftests/bpf: Use connect_to_addr in test_sock_addr
  ...
====================

Link: https://lore.kernel.org/r/20240429131657.19423-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-29 13:12:19 -07:00
Puranjay Mohan
66e13b615a bpf: verifier: prevent userspace memory access
With BPF_PROBE_MEM, BPF allows de-referencing an untrusted pointer. To
thwart invalid memory accesses, the JITs add an exception table entry
for all such accesses. But in case the src_reg + offset is a userspace
address, the BPF program might read that memory if the user has
mapped it.

Make the verifier add guard instructions around such memory accesses and
skip the load if the address falls into the userspace region.

The JITs need to implement bpf_arch_uaddress_limit() to define where
the userspace addresses end for that architecture or TASK_SIZE is taken
as default.

The implementation is as follows:

REG_AX =  SRC_REG
if(offset)
	REG_AX += offset;
REG_AX >>= 32;
if (REG_AX <= (uaddress_limit >> 32))
	DST_REG = 0;
else
	DST_REG = *(size *)(SRC_REG + offset);

Comparing just the upper 32 bits of the load address with the upper
32 bits of uaddress_limit implies that the values are being aligned down
to a 4GB boundary before comparison.

The above means that all loads with address <= uaddress_limit + 4GB are
skipped. This is acceptable because there is a large hole (much larger
than 4GB) between userspace and kernel space memory, therefore a
correctly functioning BPF program should not access this 4GB memory
above the userspace.

Let's analyze what this patch does to the following fentry program
dereferencing an untrusted pointer:

  SEC("fentry/tcp_v4_connect")
  int BPF_PROG(fentry_tcp_v4_connect, struct sock *sk)
  {
                *(volatile long *)sk;
                return 0;
  }

    BPF Program before              |           BPF Program after
    ------------------              |           -----------------

  0: (79) r1 = *(u64 *)(r1 +0)          0: (79) r1 = *(u64 *)(r1 +0)
  -----------------------------------------------------------------------
  1: (79) r1 = *(u64 *)(r1 +0) --\      1: (bf) r11 = r1
  ----------------------------\   \     2: (77) r11 >>= 32
  2: (b7) r0 = 0               \   \    3: (b5) if r11 <= 0x8000 goto pc+2
  3: (95) exit                  \   \-> 4: (79) r1 = *(u64 *)(r1 +0)
                                 \      5: (05) goto pc+1
                                  \     6: (b7) r1 = 0
                                   \--------------------------------------
                                        7: (b7) r0 = 0
                                        8: (95) exit

As you can see from above, in the best case (off=0), 5 extra instructions
are emitted.

Now, we analyze the same program after it has gone through the JITs of
ARM64 and RISC-V architectures. We follow the single load instruction
that has the untrusted pointer and see what instrumentation has been
added around it.

                                x86-64 JIT
                                ==========
     JIT's Instrumentation
          (upstream)
     ---------------------

   0:   nopl   0x0(%rax,%rax,1)
   5:   xchg   %ax,%ax
   7:   push   %rbp
   8:   mov    %rsp,%rbp
   b:   mov    0x0(%rdi),%rdi
  ---------------------------------
   f:   movabs $0x800000000000,%r11
  19:   cmp    %r11,%rdi
  1c:   jb     0x000000000000002a
  1e:   mov    %rdi,%r11
  21:   add    $0x0,%r11
  28:   jae    0x000000000000002e
  2a:   xor    %edi,%edi
  2c:   jmp    0x0000000000000032
  2e:   mov    0x0(%rdi),%rdi
  ---------------------------------
  32:   xor    %eax,%eax
  34:   leave
  35:   ret

The x86-64 JIT already emits some instructions to protect against user
memory access. This patch doesn't make any changes for the x86-64 JIT.

                                  ARM64 JIT
                                  =========

        No Intrumentation                       Verifier's Instrumentation
           (upstream)                                  (This patch)
        -----------------                       --------------------------

   0:   add     x9, x30, #0x0                0:   add     x9, x30, #0x0
   4:   nop                                  4:   nop
   8:   paciasp                              8:   paciasp
   c:   stp     x29, x30, [sp, #-16]!        c:   stp     x29, x30, [sp, #-16]!
  10:   mov     x29, sp                     10:   mov     x29, sp
  14:   stp     x19, x20, [sp, #-16]!       14:   stp     x19, x20, [sp, #-16]!
  18:   stp     x21, x22, [sp, #-16]!       18:   stp     x21, x22, [sp, #-16]!
  1c:   stp     x25, x26, [sp, #-16]!       1c:   stp     x25, x26, [sp, #-16]!
  20:   stp     x27, x28, [sp, #-16]!       20:   stp     x27, x28, [sp, #-16]!
  24:   mov     x25, sp                     24:   mov     x25, sp
  28:   mov     x26, #0x0                   28:   mov     x26, #0x0
  2c:   sub     x27, x25, #0x0              2c:   sub     x27, x25, #0x0
  30:   sub     sp, sp, #0x0                30:   sub     sp, sp, #0x0
  34:   ldr     x0, [x0]                    34:   ldr     x0, [x0]
--------------------------------------------------------------------------------
  38:   ldr     x0, [x0] ----------\        38:   add     x9, x0, #0x0
-----------------------------------\\       3c:   lsr     x9, x9, #32
  3c:   mov     x7, #0x0            \\      40:   cmp     x9, #0x10, lsl #12
  40:   mov     sp, sp               \\     44:   b.ls    0x0000000000000050
  44:   ldp     x27, x28, [sp], #16   \\--> 48:   ldr     x0, [x0]
  48:   ldp     x25, x26, [sp], #16    \    4c:   b       0x0000000000000054
  4c:   ldp     x21, x22, [sp], #16     \   50:   mov     x0, #0x0
  50:   ldp     x19, x20, [sp], #16      \---------------------------------------
  54:   ldp     x29, x30, [sp], #16         54:   mov     x7, #0x0
  58:   add     x0, x7, #0x0                58:   mov     sp, sp
  5c:   autiasp                             5c:   ldp     x27, x28, [sp], #16
  60:   ret                                 60:   ldp     x25, x26, [sp], #16
  64:   nop                                 64:   ldp     x21, x22, [sp], #16
  68:   ldr     x10, 0x0000000000000070     68:   ldp     x19, x20, [sp], #16
  6c:   br      x10                         6c:   ldp     x29, x30, [sp], #16
                                            70:   add     x0, x7, #0x0
                                            74:   autiasp
                                            78:   ret
                                            7c:   nop
                                            80:   ldr     x10, 0x0000000000000088
                                            84:   br      x10

There are 6 extra instructions added in ARM64 in the best case. This will
become 7 in the worst case (off != 0).

                           RISC-V JIT (RISCV_ISA_C Disabled)
                           ==========

        No Intrumentation           Verifier's Instrumentation
           (upstream)                      (This patch)
        -----------------           --------------------------

   0:   nop                            0:   nop
   4:   nop                            4:   nop
   8:   li      a6, 33                 8:   li      a6, 33
   c:   addi    sp, sp, -16            c:   addi    sp, sp, -16
  10:   sd      s0, 8(sp)             10:   sd      s0, 8(sp)
  14:   addi    s0, sp, 16            14:   addi    s0, sp, 16
  18:   ld      a0, 0(a0)             18:   ld      a0, 0(a0)
---------------------------------------------------------------
  1c:   ld      a0, 0(a0) --\         1c:   mv      t0, a0
--------------------------\  \        20:   srli    t0, t0, 32
  20:   li      a5, 0      \  \       24:   lui     t1, 4096
  24:   ld      s0, 8(sp)   \  \      28:   sext.w  t1, t1
  28:   addi    sp, sp, 16   \  \     2c:   bgeu    t1, t0, 12
  2c:   sext.w  a0, a5        \  \--> 30:   ld      a0, 0(a0)
  30:   ret                    \      34:   j       8
                                \     38:   li      a0, 0
                                 \------------------------------
                                      3c:   li      a5, 0
                                      40:   ld      s0, 8(sp)
                                      44:   addi    sp, sp, 16
                                      48:   sext.w  a0, a5
                                      4c:   ret

There are 7 extra instructions added in RISC-V.

Fixes: 8008342853 ("bpf, arm64: Add BPF exception tables")
Reported-by: Breno Leitao <leitao@debian.org>
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
Link: https://lore.kernel.org/r/20240424100210.11982-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-26 09:45:18 -07:00
Rick Edgecombe
529ce23a76 mm: switch mm->get_unmapped_area() to a flag
The mm_struct contains a function pointer *get_unmapped_area(), which is
set to either arch_get_unmapped_area() or arch_get_unmapped_area_topdown()
during the initialization of the mm.

Since the function pointer only ever points to two functions that are
named the same across all arch's, a function pointer is not really
required.  In addition future changes will want to add versions of the
functions that take additional arguments.  So to save a pointers worth of
bytes in mm_struct, and prevent adding additional function pointers to
mm_struct in future changes, remove it and keep the information about
which get_unmapped_area() to use in a flag.

Add the new flag to MMF_INIT_MASK so it doesn't get clobbered on fork by
mmf_init_flags().  Most MM flags get clobbered on fork.  In the
pre-existing behavior mm->get_unmapped_area() would get copied to the new
mm in dup_mm(), so not clobbering the flag preserves the existing behavior
around inheriting the topdown-ness.

Introduce a helper, mm_get_unmapped_area(), to easily convert code that
refers to the old function pointer to instead select and call either
arch_get_unmapped_area() or arch_get_unmapped_area_topdown() based on the
flag.  Then drop the mm->get_unmapped_area() function pointer.  Leave the
get_unmapped_area() pointer in struct file_operations alone.  The main
purpose of this change is to reorganize in preparation for future changes,
but it also converts the calls of mm->get_unmapped_area() from indirect
branches into a direct ones.

The stress-ng bigheap benchmark calls realloc a lot, which calls through
get_unmapped_area() in the kernel.  On x86, the change yielded a ~1%
improvement there on a retpoline config.

In testing a few x86 configs, removing the pointer unfortunately didn't
result in any actual size reductions in the compiled layout of mm_struct. 
But depending on compiler or arch alignment requirements, the change could
shrink the size of mm_struct.

Link: https://lkml.kernel.org/r/20240326021656.202649-3-rick.p.edgecombe@intel.com
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Deepak Gupta <debug@rivosinc.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: Helge Deller <deller@gmx.de>
Cc: H. Peter Anvin (Intel) <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:25 -07:00
Yosry Ahmed
91b71e78b8 mm: memcg: add NULL check to obj_cgroup_put()
9 out of 16 callers perform a NULL check before calling obj_cgroup_put(). 
Move the NULL check in the function, similar to mem_cgroup_put().  The
unlikely() NULL check in current_objcg_update() was left alone to avoid
dropping the unlikey() annotation as this a fast path.

Link: https://lkml.kernel.org/r/20240316015803.2777252-1-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:55:43 -07:00
Vadim Fedorenko
3e1c6f3540 bpf: make common crypto API for TC/XDP programs
Add crypto API support to BPF to be able to decrypt or encrypt packets
in TC/XDP BPF programs. Special care should be taken for initialization
part of crypto algo because crypto alloc) doesn't work with preemtion
disabled, it can be run only in sleepable BPF program. Also async crypto
is not supported because of the very same issue - TC/XDP BPF programs
are not sleepable.

Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Link: https://lore.kernel.org/r/20240422225024.2847039-2-vadfed@meta.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-04-24 16:01:10 -07:00
Kumar Kartikeya Dwivedi
fc7566ad0a bpf: Introduce bpf_preempt_[disable,enable] kfuncs
Introduce two new BPF kfuncs, bpf_preempt_disable and
bpf_preempt_enable. These kfuncs allow disabling preemption in BPF
programs. Nesting is allowed, since the intended use cases includes
building native BPF spin locks without kernel helper involvement. Apart
from that, this can be used to per-CPU data structures for cases where
programs (or userspace) may preempt one or the other. Currently, while
per-CPU access is stable, whether it will be consistent is not
guaranteed, as only migration is disabled for BPF programs.

Global functions are disallowed from being called, but support for them
will be added as a follow up not just preempt kfuncs, but rcu_read_lock
kfuncs as well. Static subprog calls are permitted. Sleepable helpers
and kfuncs are disallowed in non-preemptible regions.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20240424031315.2757363-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-24 09:47:49 -07:00
Alexei Starovoitov
dc92febf7b bpf: Don't check for recursion in bpf_wq_work.
__bpf_prog_enter_sleepable_recur does recursion check which is not applicable
to wq callback. The callback function is part of bpf program and bpf prog might
be running on the same cpu. So recursion check would incorrectly prevent
callback from running. The code can call __bpf_prog_enter_sleepable(), but
run_ctx would be fake, hence use explicit rcu_read_lock_trace();
migrate_disable(); to address this problem. Another reason to open code is
__bpf_prog_enter* are not available in !JIT configs.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202404241719.IIGdpAku-lkp@intel.com/
Closes: https://lore.kernel.org/oe-kbuild-all/202404241811.FFV4Bku3-lkp@intel.com/
Fixes: eb48f6cd41 ("bpf: wq: add bpf_wq_init")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-24 09:06:50 -07:00
Joel Granados
1adb825af9 bpf: Remove the now superfluous sentinel elements from ctl_table array
This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which
will reduce the overall build time size of the kernel and run time
memory bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)

Remove sentinel element from bpf_syscall_table.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Joel Granados <j.granados@samsung.com>
2024-04-24 09:43:54 +02:00
Benjamin Tissoires
8e83da9732 bpf: add bpf_wq_start
again, copy/paste from bpf_timer_start().

Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
Link: https://lore.kernel.org/r/20240420-bpf_wq-v2-15-6c986a5a741f@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-23 19:46:57 -07:00
Benjamin Tissoires
81f1d7a583 bpf: wq: add bpf_wq_set_callback_impl
To support sleepable async callbacks, we need to tell push_async_cb()
whether the cb is sleepable or not.

The verifier now detects that we are in bpf_wq_set_callback_impl and
can allow a sleepable callback to happen.

Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
Link: https://lore.kernel.org/r/20240420-bpf_wq-v2-13-6c986a5a741f@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-23 19:46:57 -07:00
Benjamin Tissoires
eb48f6cd41 bpf: wq: add bpf_wq_init
We need to teach the verifier about the second argument which is declared
as void * but which is of type KF_ARG_PTR_TO_MAP. We could have dropped
this extra case if we declared the second argument as struct bpf_map *,
but that means users will have to do extra casting to have their program
compile.

We also need to duplicate the timer code for the checking if the map
argument is matching the provided workqueue.

Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
Link: https://lore.kernel.org/r/20240420-bpf_wq-v2-11-6c986a5a741f@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-23 19:46:57 -07:00
Benjamin Tissoires
246331e3f1 bpf: allow struct bpf_wq to be embedded in arraymaps and hashmaps
Currently bpf_wq_cancel_and_free() is just a placeholder as there is
no memory allocation for bpf_wq just yet.

Again, duplication of the bpf_timer approach

Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
Link: https://lore.kernel.org/r/20240420-bpf_wq-v2-9-6c986a5a741f@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-23 18:31:25 -07:00
Benjamin Tissoires
d940c9b94d bpf: add support for KF_ARG_PTR_TO_WORKQUEUE
Introduce support for KF_ARG_PTR_TO_WORKQUEUE. The kfuncs will use bpf_wq
as argument and that will be recognized as workqueue argument by verifier.
bpf_wq_kern casting can happen inside kfunc, but using bpf_wq in
argument makes life easier for users who work with non-kern type in BPF
progs.

Duplicate process_timer_func into process_wq_func.
meta argument is only needed to ensure bpf_wq_init's workqueue and map
arguments are coming from the same map (map_uid logic is necessary for
correct inner-map handling), so also amend check_kfunc_args() to
match what helpers functions check is doing.

Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
Link: https://lore.kernel.org/r/20240420-bpf_wq-v2-8-6c986a5a741f@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-23 18:31:25 -07:00
Benjamin Tissoires
ad2c03e691 bpf: verifier: bail out if the argument is not a map
When a kfunc is declared with a KF_ARG_PTR_TO_MAP, we should have
reg->map_ptr set to a non NULL value, otherwise, that means that the
underlying type is not a map.

Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
Link: https://lore.kernel.org/r/20240420-bpf_wq-v2-7-6c986a5a741f@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-23 18:31:24 -07:00
Benjamin Tissoires
d56b63cf0c bpf: add support for bpf_wq user type
Mostly a copy/paste from the bpf_timer API, without the initialization
and free, as they will be done in a separate patch.

Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
Link: https://lore.kernel.org/r/20240420-bpf_wq-v2-5-6c986a5a741f@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-23 18:31:24 -07:00
Benjamin Tissoires
fc22d9495f bpf: replace bpf_timer_cancel_and_free with a generic helper
Same reason than most bpf_timer* functions, we need almost the same for
workqueues.
So extract the generic part out of it so bpf_wq_cancel_and_free can reuse
it.

Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
Link: https://lore.kernel.org/r/20240420-bpf_wq-v2-4-6c986a5a741f@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-23 18:31:24 -07:00
Benjamin Tissoires
073f11b026 bpf: replace bpf_timer_set_callback with a generic helper
In the same way we have a generic __bpf_async_init(), we also need
to share code between timer and workqueue for the set_callback call.

We just add an unused flags parameter, as it will be used for workqueues.

Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
Link: https://lore.kernel.org/r/20240420-bpf_wq-v2-3-6c986a5a741f@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-23 18:31:24 -07:00
Benjamin Tissoires
56b4a177ae bpf: replace bpf_timer_init with a generic helper
No code change except for the new flags argument being stored in the
local data struct.

Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
Link: https://lore.kernel.org/r/20240420-bpf_wq-v2-2-6c986a5a741f@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-23 18:31:24 -07:00
Benjamin Tissoires
be2749beff bpf: make timer data struct more generic
To be able to add workqueues and reuse most of the timer code, we need
to make bpf_hrtimer more generic.

There is no code change except that the new struct gets a new u64 flags
attribute. We are still below 2 cache lines, so this shouldn't impact
the current running codes.

The ordering is also changed. Everything related to async callback
is now on top of bpf_hrtimer.

Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
Link: https://lore.kernel.org/r/20240420-bpf_wq-v2-1-6c986a5a741f@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-23 18:31:24 -07:00
Rafael Passos
a7de265cb2 bpf: Fix typos in comments
Found the following typos in comments, and fixed them:

s/unpriviledged/unprivileged/
s/reponsible/responsible/
s/possiblities/possibilities/
s/Divison/Division/
s/precsion/precision/
s/havea/have a/
s/reponsible/responsible/
s/responsibile/responsible/
s/tigher/tighter/
s/respecitve/respective/

Signed-off-by: Rafael Passos <rafael@rcpassos.me>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/6af7deb4-bb24-49e8-b3f1-8dd410597337@smtp-relay.sendinblue.com
2024-04-22 17:48:08 +02:00
Rafael Passos
e1a7545981 bpf: Fix typo in function save_aux_ptr_type
I found this typo in the save_aux_ptr_type function.
s/allow_trust_missmatch/allow_trust_mismatch/
I did not find this anywhere else in the codebase.

Signed-off-by: Rafael Passos <rafael@rcpassos.me>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/fbe1d636-8172-4698-9a5a-5a3444b55322@smtp-relay.sendinblue.com
2024-04-22 17:12:05 +02:00
Harishankar Vishwanathan
1f586614f3 bpf: Harden and/or/xor value tracking in verifier
This patch addresses a latent unsoundness issue in the
scalar(32)_min_max_and/or/xor functions. While it is not a bugfix,
it ensures that the functions produce sound outputs for all inputs.

The issue occurs in these functions when setting signed bounds. The
following example illustrates the issue for scalar_min_max_and(),
but it applies to the other functions.

In scalar_min_max_and() the following clause is executed when ANDing
positive numbers:

  /* ANDing two positives gives a positive, so safe to
   * cast result into s64.
   */
  dst_reg->smin_value = dst_reg->umin_value;
  dst_reg->smax_value = dst_reg->umax_value;

However, if umin_value and umax_value of dst_reg cross the sign boundary
(i.e., if (s64)dst_reg->umin_value > (s64)dst_reg->umax_value), then we
will end up with smin_value > smax_value, which is unsound.

Previous works [1, 2] have discovered and reported this issue. Our tool
Agni [2, 3] consideres it a false positive. This is because, during the
verification of the abstract operator scalar_min_max_and(), Agni restricts
its inputs to those passing through reg_bounds_sync(). This mimics
real-world verifier behavior, as reg_bounds_sync() is invariably executed
at the tail of every abstract operator. Therefore, such behavior is
unlikely in an actual verifier execution.

However, it is still unsound for an abstract operator to set signed bounds
such that smin_value > smax_value. This patch fixes it, making the abstract
operator sound for all (well-formed) inputs.

It is worth noting that while the previous code updated the signed bounds
(using the output unsigned bounds) only when the *input signed* bounds
were positive, the new code updates them whenever the *output unsigned*
bounds do not cross the sign boundary.

An alternative approach to fix this latent unsoundness would be to
unconditionally set the signed bounds to unbounded [S64_MIN, S64_MAX], and
let reg_bounds_sync() refine the signed bounds using the unsigned bounds
and the tnum. We found that our approach produces more precise (tighter)
bounds.

For example, consider these inputs to BPF_AND:

  /* dst_reg */
  var_off.value: 8608032320201083347
  var_off.mask: 615339716653692460
  smin_value: 8070450532247928832
  smax_value: 8070450532247928832
  umin_value: 13206380674380886586
  umax_value: 13206380674380886586
  s32_min_value: -2110561598
  s32_max_value: -133438816
  u32_min_value: 4135055354
  u32_max_value: 4135055354

  /* src_reg */
  var_off.value: 8584102546103074815
  var_off.mask: 9862641527606476800
  smin_value: 2920655011908158522
  smax_value: 7495731535348625717
  umin_value: 7001104867969363969
  umax_value: 8584102543730304042
  s32_min_value: -2097116671
  s32_max_value: 71704632
  u32_min_value: 1047457619
  u32_max_value: 4268683090

After going through tnum_and() -> scalar32_min_max_and() ->
scalar_min_max_and() -> reg_bounds_sync(), our patch produces the following
bounds for s32:

  s32_min_value: -1263875629
  s32_max_value: -159911942

Whereas, setting the signed bounds to unbounded in scalar_min_max_and()
produces:

  s32_min_value: -1263875629
  s32_max_value: -1

As observed, our patch produces a tighter s32 bound. We also confirmed
using Agni and SMT verification that our patch always produces signed
bounds that are equal to or more precise than setting the signed bounds to
unbounded in scalar_min_max_and().

  [1] https://sanjit-bhat.github.io/assets/pdf/ebpf-verifier-range-analysis22.pdf
  [2] https://link.springer.com/chapter/10.1007/978-3-031-37709-9_12
  [3] https://github.com/bpfverif/agni

Co-developed-by: Matan Shachnai <m.shachnai@rutgers.edu>
Signed-off-by: Matan Shachnai <m.shachnai@rutgers.edu>
Co-developed-by: Srinivas Narayana <srinivas.narayana@rutgers.edu>
Signed-off-by: Srinivas Narayana <srinivas.narayana@rutgers.edu>
Co-developed-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu>
Signed-off-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu>
Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240402212039.51815-1-harishankar.vishwanathan@gmail.com
Link: https://lore.kernel.org/bpf/20240416115303.331688-1-harishankar.vishwanathan@gmail.com
2024-04-16 17:55:27 +02:00
Ard Biesheuvel
fc5eb4a84e btf: Avoid weak external references
If the BTF code is enabled in the build configuration, the start/stop
BTF markers are guaranteed to exist. Only when CONFIG_DEBUG_INFO_BTF=n,
the references in btf_parse_vmlinux() will remain unsatisfied, relying
on the weak linkage of the external references to avoid breaking the
build.

Avoid GOT based relocations to these markers in the final executable by
dropping the weak attribute and instead, make btf_parse_vmlinux() return
ERR_PTR(-ENOENT) directly if CONFIG_DEBUG_INFO_BTF is not enabled to
begin with.  The compiler will drop any subsequent references to
__start_BTF and __stop_BTF in that case, allowing the link to succeed.

Note that Clang will notice that taking the address of __start_BTF can
no longer yield NULL, so testing for that condition becomes unnecessary.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20240415162041.2491523-8-ardb+git@google.com
2024-04-16 16:35:13 +02:00
Anton Protopopov
37eacb9f6e bpf: Fix a verifier verbose message
Long ago a map file descriptor in a pseudo ldimm64 instruction could
only be present as an immediate value insn[0].imm, and thus this value
was used in a verbose verifier message printed when the file descriptor
wasn't valid. Since addition of BPF_PSEUDO_MAP_IDX_VALUE/BPF_PSEUDO_MAP_IDX
the insn[0].imm field can also contain an index pointing to the file
descriptor in the attr.fd_array array. However, if the file descriptor
is invalid, the verifier still prints the verbose message containing
value of insn[0].imm. Patch the verifier message to always print the
actual file descriptor value.

Fixes: 387544bfa2 ("bpf: Introduce fd_idx")
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240412141100.3562942-1-aspsk@isovalent.com
2024-04-12 18:37:20 +02:00
Paul E. McKenney
1e52af7f02 bpf: Choose RCU Tasks based on TASKS_RCU rather than PREEMPTION
The advent of CONFIG_PREEMPT_AUTO, AKA lazy preemption, will mean that
even kernels built with CONFIG_PREEMPT_NONE or CONFIG_PREEMPT_VOLUNTARY
might see the occasional preemption, and that this preemption just might
happen within a trampoline.

Therefore, update bpf_tramp_image_put() to choose call_rcu_tasks()
based on CONFIG_TASKS_RCU instead of CONFIG_PREEMPTION.

This change might enable further simplifications, but the goal of this
effort is to make the code safe, not necessarily optimal.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Ankur Arora <ankur.a.arora@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: <bpf@vger.kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-04-12 11:23:25 +02:00
Yonghong Song
699c23f02c bpf: Add bpf_link support for sk_msg and sk_skb progs
Add bpf_link support for sk_msg and sk_skb programs. We have an
internal request to support bpf_link for sk_msg programs so user
space can have a uniform handling with bpf_link based libbpf
APIs. Using bpf_link based libbpf API also has a benefit which
makes system robust by decoupling prog life cycle and
attachment life cycle.

Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240410043527.3737160-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-10 19:52:25 -07:00
Alexei Starovoitov
d503a04f8b bpf: Add support for certain atomics in bpf_arena to x86 JIT
Support atomics in bpf_arena that can be JITed as a single x86 instruction.
Instructions that are JITed as loops are not supported at the moment,
since they require more complex extable and loop logic.

JITs can choose to do smarter things with bpf_jit_supports_insn().
Like arm64 may decide to support all bpf atomics instructions
when emit_lse_atomic is available and none in ll_sc mode.

bpf_jit_supports_percpu_insn(), bpf_jit_supports_ptr_xchg() and
other such callbacks can be replaced with bpf_jit_supports_insn()
in the future.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240405231134.17274-1-alexei.starovoitov@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-04-09 10:24:26 -07:00
Paul E. McKenney
b993115b44 bpf: Select new NEED_TASKS_RCU Kconfig option
Currently, if a Kconfig option depends on TASKS_RCU, it conditionally does
"select TASKS_RCU if PREEMPTION".  This works, but requires any change in
this enablement logic to be replicated across all such "select" clauses.
A new NEED_TASKS_RCU Kconfig option has been created to allow this
enablement logic to be in one place in kernel/rcu/Kconfig.

Therefore, make BPF select the new NEED_TASKS_RCU Kconfig option.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: <bpf@vger.kernel.org>
Cc: Ankur Arora <ankur.a.arora@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-04-09 15:13:05 +02:00
David Vernet
a8e03b6bbb bpf: Allow invoking kfuncs from BPF_PROG_TYPE_SYSCALL progs
Currently, a set of core BPF kfuncs (e.g. bpf_task_*, bpf_cgroup_*,
bpf_cpumask_*, etc) cannot be invoked from BPF_PROG_TYPE_SYSCALL
programs. The whitelist approach taken for enabling kfuncs makes sense:
it not safe to call these kfuncs from every program type. For example,
it may not be safe to call bpf_task_acquire() in an fentry to
free_task().

BPF_PROG_TYPE_SYSCALL, on the other hand, is a perfectly safe program
type from which to invoke these kfuncs, as it's a very controlled
environment, and we should never be able to run into any of the typical
problems such as recursive invoations, acquiring references on freeing
kptrs, etc. Being able to invoke these kfuncs would be useful, as
BPF_PROG_TYPE_SYSCALL can be invoked with BPF_PROG_RUN, and would
therefore enable user space programs to synchronously call into BPF to
manipulate these kptrs.

This patch therefore enables invoking the aforementioned core kfuncs
from BPF_PROG_TYPE_SYSCALL progs.

Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20240405143041.632519-2-void@manifault.com
2024-04-05 10:56:09 -07:00
Philo Lu
9d482da9e1 bpf: allow invoking bpf_for_each_map_elem with different maps
Taking different maps within a single bpf_for_each_map_elem call is not
allowed before, because from the second map,
bpf_insn_aux_data->map_ptr_state will be marked as *poison*. In fact
both map_ptr and state are needed to support this use case: map_ptr is
used by set_map_elem_callback_state() while poison state is needed to
determine whether to use direct call.

Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240405025536.18113-3-lulie@linux.alibaba.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-05 10:31:17 -07:00
Philo Lu
0a525621b7 bpf: store both map ptr and state in bpf_insn_aux_data
Currently, bpf_insn_aux_data->map_ptr_state is used to store either
map_ptr or its poison state (i.e., BPF_MAP_PTR_POISON). Thus
BPF_MAP_PTR_POISON must be checked before reading map_ptr. In certain
cases, we may need valid map_ptr even in case of poison state.
This will be explained in next patch with bpf_for_each_map_elem()
helper.

This patch changes map_ptr_state into a new struct including both map
pointer and its state (poison/unpriv). It's in the same union with
struct bpf_loop_inline_state, so there is no extra memory overhead.
Besides, macros BPF_MAP_PTR_UNPRIV/BPF_MAP_PTR_POISON/BPF_MAP_PTR are no
longer needed.

This patch does not change any existing functionality.

Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240405025536.18113-2-lulie@linux.alibaba.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-05 10:31:17 -07:00
Arnd Bergmann
58babe2718 bpf: fix perf_snapshot_branch_stack link failure
The newly added code to handle bpf_get_branch_snapshot fails to link when
CONFIG_PERF_EVENTS is disabled:

aarch64-linux-ld: kernel/bpf/verifier.o: in function `do_misc_fixups':
verifier.c:(.text+0x1090c): undefined reference to `__SCK__perf_snapshot_branch_stack'

Add a build-time check for that Kconfig symbol around the code to
remove the link time dependency.

Fixes: 314a53623c ("bpf: inline bpf_get_branch_snapshot() helper")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20240405142637.577046-1-arnd@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-05 08:39:15 -07:00
Andrii Nakryiko
1f2a74b41e bpf: prevent r10 register from being marked as precise
r10 is a special register that is not under BPF program's control and is
always effectively precise. The rest of precision logic assumes that
only r0-r9 SCALAR registers are marked as precise, so prevent r10 from
being marked precise.

This can happen due to signed cast instruction allowing to do something
like `r0 = (s8)r10;`, which later, if r0 needs to be precise, would lead
to an attempt to mark r10 as precise.

Prevent this with an extra check during instruction backtracking.

Fixes: 8100928c88 ("bpf: Support new sign-extension mov insns")
Reported-by: syzbot+148110ee7cf72f39f33e@syzkaller.appspotmail.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240404214536.3551295-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-04 18:31:08 -07:00
Jakub Kicinski
cf1ca1f66d Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

net/ipv4/ip_gre.c
  17af420545 ("erspan: make sure erspan_base_hdr is present in skb->head")
  5832c4a77d ("ip_tunnel: convert __be16 tunnel flags to bitmaps")
https://lore.kernel.org/all/20240402103253.3b54a1cf@canb.auug.org.au/

Adjacent changes:

net/ipv6/ip6_fib.c
  d21d40605b ("ipv6: Fix infinite recursion in fib6_dump_done().")
  5fc68320c1 ("ipv6: remove RTNL protection from inet6_dump_fib()")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-04 18:01:07 -07:00
Linus Torvalds
c88b9b4cde Including fixes from netfilter, bluetooth and bpf.
Fairly usual collection of driver and core fixes. The large selftest
 accompanying one of the fixes is also becoming a common occurrence.
 
 Current release - regressions:
 
  - ipv6: fix infinite recursion in fib6_dump_done()
 
  - net/rds: fix possible null-deref in newly added error path
 
 Current release - new code bugs:
 
  - net: do not consume a full cacheline for system_page_pool
 
  - bpf: fix bpf_arena-related file descriptor leaks in the verifier
 
  - drv: ice: fix freeing uninitialized pointers, fixing misuse of
    the newfangled __free() auto-cleanup
 
 Previous releases - regressions:
 
  - x86/bpf: fixes the BPF JIT with retbleed=stuff
 
  - xen-netfront: add missing skb_mark_for_recycle, fix page pool
    accounting leaks, revealed by recently added explicit warning
 
  - tcp: fix bind() regression for v6-only wildcard and v4-mapped-v6
    non-wildcard addresses
 
  - Bluetooth:
    - replace "hci_qca: Set BDA quirk bit if fwnode exists in DT"
      with better workarounds to un-break some buggy Qualcomm devices
    - set conn encrypted before conn establishes, fix re-connecting
      to some headsets which use slightly unusual sequence of msgs
 
  - mptcp:
    - prevent BPF accessing lowat from a subflow socket
    - don't account accept() of non-MPC client as fallback to TCP
 
  - drv: mana: fix Rx DMA datasize and skb_over_panic
 
  - drv: i40e: fix VF MAC filter removal
 
 Previous releases - always broken:
 
  - gro: various fixes related to UDP tunnels - netns crossing problems,
    incorrect checksum conversions, and incorrect packet transformations
    which may lead to panics
 
  - bpf: support deferring bpf_link dealloc to after RCU grace period
 
  - nf_tables:
    - release batch on table validation from abort path
    - release mutex after nft_gc_seq_end from abort path
    - flush pending destroy work before exit_net release
 
  - drv: r8169: skip DASH fw status checks when DASH is disabled
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmYO91wACgkQMUZtbf5S
 IrvHBQ/+PH/hobI+o3aLqwtdVlyxhmA31bVQ0I3aTIZV7c3ideMBcfgYa8TiZM2g
 pLiBiWoJXCN0h33wgUmlUee+sBvpoPCdPjGD/g99OJyKWjVt2D7ObnSwxMfjHUoq
 dtcN2JupqHP0SHz6wPPCmnWtTLxSGUsDdKjmkHQcCRhQIGTYFkYyHcOmPgNbBjaB
 6jvmH1kE9WQTFD8QcOMaZmXQ5omoafpxxQLsgundtOWxPWHL7XNvk0B5k/ESDRG1
 ujbxwtNnOESzpxZMQ6OyZlsnN/1tWfnEvLJFYVwf9BMrOlahJT/f5b/EJ9/Xy4dC
 zkAp7Tul3uAvNRKhBNhVBTWQbnIykmiNMp1VBFmiScQAy8hcnX+6d4LKTIHxbXZK
 V3AqcUS6YU2nyMdLRkhvq9f3uxD6hcY19gQdyqgCUPOtyUAs/JPv7lXQjCuuEqkq
 urEZkigUApnEqPIrIqANJ7nXUy3U0K8qU6evOZoGZ5OdiKeNKC3+tIr+g2f1ZUZq
 a7Dkat7JH9WQ7IG8Geody6Z30K9EpSqYMTKzB5wTfmuqw6cV8bl9OAW9UOSRK0GL
 pyG8GwpkpFPkNiZdu9Zt44Pno5xdLIa1+C3QZR0r5CJWYAzCbI80MppP5veF9Mw+
 v+2v8iBWuh9iv0AUj9KJOwG5QQ+EXLUuSlhtx/DFnmn2CJ9plXI=
 =6bQI
 -----END PGP SIGNATURE-----

Merge tag 'net-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from netfilter, bluetooth and bpf.

  Fairly usual collection of driver and core fixes. The large selftest
  accompanying one of the fixes is also becoming a common occurrence.

  Current release - regressions:

   - ipv6: fix infinite recursion in fib6_dump_done()

   - net/rds: fix possible null-deref in newly added error path

  Current release - new code bugs:

   - net: do not consume a full cacheline for system_page_pool

   - bpf: fix bpf_arena-related file descriptor leaks in the verifier

   - drv: ice: fix freeing uninitialized pointers, fixing misuse of the
     newfangled __free() auto-cleanup

  Previous releases - regressions:

   - x86/bpf: fixes the BPF JIT with retbleed=stuff

   - xen-netfront: add missing skb_mark_for_recycle, fix page pool
     accounting leaks, revealed by recently added explicit warning

   - tcp: fix bind() regression for v6-only wildcard and v4-mapped-v6
     non-wildcard addresses

   - Bluetooth:
      - replace "hci_qca: Set BDA quirk bit if fwnode exists in DT" with
        better workarounds to un-break some buggy Qualcomm devices
      - set conn encrypted before conn establishes, fix re-connecting to
        some headsets which use slightly unusual sequence of msgs

   - mptcp:
      - prevent BPF accessing lowat from a subflow socket
      - don't account accept() of non-MPC client as fallback to TCP

   - drv: mana: fix Rx DMA datasize and skb_over_panic

   - drv: i40e: fix VF MAC filter removal

  Previous releases - always broken:

   - gro: various fixes related to UDP tunnels - netns crossing
     problems, incorrect checksum conversions, and incorrect packet
     transformations which may lead to panics

   - bpf: support deferring bpf_link dealloc to after RCU grace period

   - nf_tables:
      - release batch on table validation from abort path
      - release mutex after nft_gc_seq_end from abort path
      - flush pending destroy work before exit_net release

   - drv: r8169: skip DASH fw status checks when DASH is disabled"

* tag 'net-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits)
  netfilter: validate user input for expected length
  net/sched: act_skbmod: prevent kernel-infoleak
  net: usb: ax88179_178a: avoid the interface always configured as random address
  net: dsa: sja1105: Fix parameters order in sja1110_pcs_mdio_write_c45()
  net: ravb: Always update error counters
  net: ravb: Always process TX descriptor ring
  netfilter: nf_tables: discard table flag update with pending basechain deletion
  netfilter: nf_tables: Fix potential data-race in __nft_flowtable_type_get()
  netfilter: nf_tables: reject new basechain after table flag update
  netfilter: nf_tables: flush pending destroy work before exit_net release
  netfilter: nf_tables: release mutex after nft_gc_seq_end from abort path
  netfilter: nf_tables: release batch on table validation from abort path
  Revert "tg3: Remove residual error handling in tg3_suspend"
  tg3: Remove residual error handling in tg3_suspend
  net: mana: Fix Rx DMA datasize and skb_over_panic
  net/sched: fix lockdep splat in qdisc_tree_reduce_backlog()
  net: phy: micrel: lan8814: Fix when enabling/disabling 1-step timestamping
  net: stmmac: fix rx queue priority assignment
  net: txgbe: fix i2c dev name cannot match clkdev
  net: fec: Set mac_managed_pm during probe
  ...
2024-04-04 14:49:10 -07:00
Andrii Nakryiko
314a53623c bpf: inline bpf_get_branch_snapshot() helper
Inline bpf_get_branch_snapshot() helper using architecture-agnostic
inline BPF code which calls directly into underlying callback of
perf_snapshot_branch_stack static call. This callback is set early
during kernel initialization and is never updated or reset, so it's ok
to fetch actual implementation using static_call_query() and call
directly into it.

This change eliminates a full function call and saves one LBR entry
in PERF_SAMPLE_BRANCH_ANY LBR mode.

Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240404002640.1774210-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-04 13:08:01 -07:00
Alexei Starovoitov
af682b767a bpf: Optimize emit_mov_imm64().
Turned out that bpf prog callback addresses, bpf prog addresses
used in bpf_trampoline, and in other cases the 64-bit address
can be represented as sign extended 32-bit value.

According to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82339
"Skylake has 0.64c throughput for mov r64, imm64, vs. 0.25 for mov r32, imm32."
So use shorter encoding and faster instruction when possible.

Special care is needed in jit_subprogs(), since bpf_pseudo_func()
instruction cannot change its size during the last step of JIT.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/CAADnVQKFfpY-QZBrOU2CG8v2du8Lgyb7MNVmOZVK_yTyOdNbBA@mail.gmail.com
Link: https://lore.kernel.org/bpf/20240401233800.42737-1-alexei.starovoitov@gmail.com
2024-04-04 16:13:26 +02:00
Andrii Nakryiko
0b56e637f7 bpf: inline bpf_map_lookup_elem() helper for PERCPU_HASH map
Using new per-CPU BPF instruction, partially inline
bpf_map_lookup_elem() helper for per-CPU hashmap BPF map. Just like for
normal HASH map, we still generate a call into __htab_map_lookup_elem(),
but after that we resolve per-CPU element address using a new
instruction, saving on extra functions calls.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20240402021307.1012571-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-03 10:29:56 -07:00
Andrii Nakryiko
db69718b8e bpf: inline bpf_map_lookup_elem() for PERCPU_ARRAY maps
Using new per-CPU BPF instruction implement inlining for per-CPU ARRAY
map lookup helper, if BPF JIT support is present.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20240402021307.1012571-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-03 10:29:56 -07:00
Andrii Nakryiko
1ae6921009 bpf: inline bpf_get_smp_processor_id() helper
If BPF JIT supports per-CPU MOV instruction, inline bpf_get_smp_processor_id()
to eliminate unnecessary function calls.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20240402021307.1012571-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-03 10:29:56 -07:00
Andrii Nakryiko
7bdbf74463 bpf: add special internal-only MOV instruction to resolve per-CPU addrs
Add a new BPF instruction for resolving absolute addresses of per-CPU
data from their per-CPU offsets. This instruction is internal-only and
users are not allowed to use them directly. They will only be used for
internal inlining optimizations for now between BPF verifier and BPF JITs.

We use a special BPF_MOV | BPF_ALU64 | BPF_X form with insn->off field
set to BPF_ADDR_PERCPU = -1. I used negative offset value to distinguish
them from positive ones used by user-exposed instructions.

Such instruction performs a resolution of a per-CPU offset stored in
a register to a valid kernel address which can be dereferenced. It is
useful in any use case where absolute address of a per-CPU data has to
be resolved (e.g., in inlining bpf_map_lookup_elem()).

BPF disassembler is also taught to recognize them to support dumping
final BPF assembly code (non-JIT'ed version).

Add arch-specific way for BPF JITs to mark support for this instructions.

This patch also adds support for these instructions in x86-64 BPF JIT.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20240402021307.1012571-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-03 10:29:55 -07:00
Justin Stitt
2e114248e0 bpf: Replace deprecated strncpy with strscpy
strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.

bpf sym names get looked up and compared/cleaned with various string
apis. This suggests they need to be NUL-terminated (strncpy() suggests
this but does not guarantee it).

|	static int compare_symbol_name(const char *name, char *namebuf)
|	{
|		cleanup_symbol_name(namebuf);
|		return strcmp(name, namebuf);
|	}

|	static void cleanup_symbol_name(char *s)
|	{
|		...
|		res = strstr(s, ".llvm.");
|		...
|	}

Use strscpy() as this method guarantees NUL-termination on the
destination buffer.

This patch also replaces two uses of strncpy() used in log.c. These are
simple replacements as postfix has been zero-initialized on the stack
and has source arguments with a size less than the destination's size.

Note that this patch uses the new 2-argument version of strscpy
introduced in commit e6584c3964 ("string: Allow 2-argument strscpy()").

Signed-off-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1]
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2]
Link: https://github.com/KSPP/linux/issues/90
Link: https://lore.kernel.org/bpf/20240402-strncpy-kernel-bpf-core-c-v1-1-7cb07a426e78@google.com
2024-04-03 16:57:41 +02:00
Jose Fernandez
ce09cbdd98 bpf: Improve program stats run-time calculation
This patch improves the run-time calculation for program stats by
capturing the duration as soon as possible after the program returns.

Previously, the duration included u64_stats_t operations. While the
instrumentation overhead is part of the total time spent when stats are
enabled, distinguishing between the program's native execution time and
the time spent due to instrumentation is crucial for accurate
performance analysis.

By making this change, the patch facilitates more precise optimization
of BPF programs, enabling users to understand their performance in
environments without stats enabled.

I used a virtualized environment to measure the run-time over one minute
for a basic raw_tracepoint/sys_enter program, which just increments a
local counter. Although the virtualization introduced some performance
degradation that could affect the results, I observed approximately a
16% decrease in average run-time reported by stats with this change
(310 -> 260 nsec).

Signed-off-by: Jose Fernandez <josef@netflix.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240402034010.25060-1-josef@netflix.com
2024-04-02 16:51:15 +02:00
Anton Protopopov
9dc182c58b bpf: Add a verbose message if map limit is reached
When more than 64 maps are used by a program and its subprograms the
verifier returns -E2BIG. Add a verbose message which highlights the
source of the error and also print the actual limit.

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240402073347.195920-1-aspsk@isovalent.com
2024-04-02 16:12:00 +02:00
Linus Torvalds
7e40c2100c Kbuild fixes for v6.9
- Deduplicate Kconfig entries for CONFIG_CXL_PMU
 
  - Fix unselectable choice entry in MIPS Kconfig, and forbid this
    structure
 
  - Remove unused include/asm-generic/export.h
 
  - Fix a NULL pointer dereference bug in modpost
 
  - Enable -Woverride-init warning consistently with W=1
 
  - Drop KCSAN flags from *.mod.c files
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmYJVK0VHG1hc2FoaXJv
 eUBrZXJuZWwub3JnAAoJED2LAQed4NsGf/sP/3GOk//cQGwPyWCgtCEUo6T4yyD7
 1m2TTR0JQk/lcohSFtYk0I20rhKRqU6yAMAERmyehI66D2QY7lhiYVc16ram5y04
 x0nWxd9IqerIlGJtaWePOvNqKdCw2EP9fS9NKz58rEDMGlsSf0Rd3NEdSsWoH8td
 dECtt8yCawENAMStb/rAfsnL6kn2JIhVMyqwo0RdQfiaVT5Zk6Qgpko0Oq0ncRP2
 qdNgHbvnJdKMy81FHSBAi0QEZOYvhFNX+E+6lFfWEsX6xT+wvXddCNQzJf/YV3Cw
 Klw1tGveV7UGzlZ4fsnFrv4V6g1KO2AD3342efdDo++ypBEBpImVODc+Rp0jE9Nk
 OgdOQRe2k9a5keH0LWY0ehvDbQlSbfNxk0wNtAfo5Kk5e41nHmHJBWCwGG+cXrjJ
 mPJjSrTpuNVSaGV0kt3EskHbDBeBmIIg+5QPbldmW2qcC88kWoavkyLD3WPFsg/a
 CAuR/HqH7MDfxzvsqTCjonlVcyDKX6aW66LrQ1NCtmphI4F8mdKp746CzGlziuIm
 gjYJL/UWVlx0VebMo8dwDpaHvez4/4s6xAJcyqtA+TS5HbrQWKQuwFkiv4iWQxNd
 MvyVdzgKhcMdoXhfFpUZ0LlFvHGefJ+Z6N1FQLoQJkTirt5aqRbEAjP0VXwQB4eH
 zYygkhvvtiH9/STu
 =tx+2
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-fixes-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild fixes from Masahiro Yamada:

 - Deduplicate Kconfig entries for CONFIG_CXL_PMU

 - Fix unselectable choice entry in MIPS Kconfig, and forbid this
   structure

 - Remove unused include/asm-generic/export.h

 - Fix a NULL pointer dereference bug in modpost

 - Enable -Woverride-init warning consistently with W=1

 - Drop KCSAN flags from *.mod.c files

* tag 'kbuild-fixes-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kconfig: Fix typo HEIGTH to HEIGHT
  Documentation/llvm: Note s390 LLVM=1 support with LLVM 18.1.0 and newer
  kbuild: Disable KCSAN for autogenerated *.mod.c intermediaries
  kbuild: make -Woverride-init warnings more consistent
  modpost: do not make find_tosym() return NULL
  export.h: remove include/asm-generic/export.h
  kconfig: do not reparent the menu inside a choice block
  MIPS: move unselectable FIT_IMAGE_FDT_EPM5 out of the "System type" choice
  cxl: remove CONFIG_CXL_PMU entry in drivers/cxl/Kconfig
2024-03-31 11:23:51 -07:00
Arnd Bergmann
c40845e319 kbuild: make -Woverride-init warnings more consistent
The -Woverride-init warn about code that may be intentional or not,
but the inintentional ones tend to be real bugs, so there is a bit of
disagreement on whether this warning option should be enabled by default
and we have multiple settings in scripts/Makefile.extrawarn as well as
individual subsystems.

Older versions of clang only supported -Wno-initializer-overrides with
the same meaning as gcc's -Woverride-init, though all supported versions
now work with both. Because of this difference, an earlier cleanup of
mine accidentally turned the clang warning off for W=1 builds and only
left it on for W=2, while it's still enabled for gcc with W=1.

There is also one driver that only turns the warning off for newer
versions of gcc but not other compilers, and some but not all the
Makefiles still use a cc-disable-warning conditional that is no
longer needed with supported compilers here.

Address all of the above by removing the special cases for clang
and always turning the warning off unconditionally where it got
in the way, using the syntax that is supported by both compilers.

Fixes: 2cd3271b7a ("kbuild: avoid duplicate warning options")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Hamza Mahfooz <hamza.mahfooz@amd.com>
Acked-by: Jani Nikula <jani.nikula@intel.com>
Acked-by: Andrew Jeffery <andrew@codeconstruct.com.au>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2024-03-31 11:32:26 +09:00
Alexei Starovoitov
59f2f84117 bpf: Avoid kfree_rcu() under lock in bpf_lpm_trie.
syzbot reported the following lock sequence:
cpu 2:
  grabs timer_base lock
    spins on bpf_lpm lock

cpu 1:
  grab rcu krcp lock
    spins on timer_base lock

cpu 0:
  grab bpf_lpm lock
    spins on rcu krcp lock

bpf_lpm lock can be the same.
timer_base lock can also be the same due to timer migration.
but rcu krcp lock is always per-cpu, so it cannot be the same lock.
Hence it's a false positive.
To avoid lockdep complaining move kfree_rcu() after spin_unlock.

Reported-by: syzbot+1fa663a2100308ab6eab@syzkaller.appspotmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240329171439.37813-1-alexei.starovoitov@gmail.com
2024-03-29 11:10:41 -07:00
Anton Protopopov
6dae957c8e bpf: fix possible file descriptor leaks in verifier
The resolve_pseudo_ldimm64() function might have leaked file
descriptors when BPF_MAP_TYPE_ARENA was used in a program (some
error paths missed a corresponding fdput). Add missing fdputs.

v2:
  remove unrelated changes from the fix

Fixes: 6082b6c328 ("bpf: Recognize addr_space_cast instruction in the verifier.")
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20240329071106.67968-1-aspsk@isovalent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-29 09:19:55 -07:00
Martin KaFai Lau
e8742081db bpf: Mark bpf prog stack with kmsan_unposion_memory in interpreter mode
syzbot reported uninit memory usages during map_{lookup,delete}_elem.

==========
BUG: KMSAN: uninit-value in __dev_map_lookup_elem kernel/bpf/devmap.c:441 [inline]
BUG: KMSAN: uninit-value in dev_map_lookup_elem+0xf3/0x170 kernel/bpf/devmap.c:796
__dev_map_lookup_elem kernel/bpf/devmap.c:441 [inline]
dev_map_lookup_elem+0xf3/0x170 kernel/bpf/devmap.c:796
____bpf_map_lookup_elem kernel/bpf/helpers.c:42 [inline]
bpf_map_lookup_elem+0x5c/0x80 kernel/bpf/helpers.c:38
___bpf_prog_run+0x13fe/0xe0f0 kernel/bpf/core.c:1997
__bpf_prog_run256+0xb5/0xe0 kernel/bpf/core.c:2237
==========

The reproducer should be in the interpreter mode.

The C reproducer is trying to run the following bpf prog:

    0: (18) r0 = 0x0
    2: (18) r1 = map[id:49]
    4: (b7) r8 = 16777216
    5: (7b) *(u64 *)(r10 -8) = r8
    6: (bf) r2 = r10
    7: (07) r2 += -229
            ^^^^^^^^^^

    8: (b7) r3 = 8
    9: (b7) r4 = 0
   10: (85) call dev_map_lookup_elem#1543472
   11: (95) exit

It is due to the "void *key" (r2) passed to the helper. bpf allows uninit
stack memory access for bpf prog with the right privileges. This patch
uses kmsan_unpoison_memory() to mark the stack as initialized.

This should address different syzbot reports on the uninit "void *key"
argument during map_{lookup,delete}_elem.

Reported-by: syzbot+603bcd9b0bf1d94dbb9b@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/000000000000f9ce6d061494e694@google.com/
Reported-by: syzbot+eb02dc7f03dce0ef39f3@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/000000000000a5c69c06147c2238@google.com/
Reported-by: syzbot+b4e65ca24fd4d0c734c3@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/000000000000ac56fb06143b6cfa@google.com/
Reported-by: syzbot+d2b113dc9fea5e1d2848@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/0000000000000d69b206142d1ff7@google.com/
Reported-by: syzbot+1a3cf6f08d68868f9db3@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/0000000000006f876b061478e878@google.com/
Tested-by: syzbot+1a3cf6f08d68868f9db3@syzkaller.appspotmail.com
Suggested-by: Yonghong Song <yonghong.song@linux.dev>
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240328185801.1843078-1-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-28 19:00:59 -07:00
Andrii Nakryiko
1a80dbcb2d bpf: support deferring bpf_link dealloc to after RCU grace period
BPF link for some program types is passed as a "context" which can be
used by those BPF programs to look up additional information. E.g., for
multi-kprobes and multi-uprobes, link is used to fetch BPF cookie values.

Because of this runtime dependency, when bpf_link refcnt drops to zero
there could still be active BPF programs running accessing link data.

This patch adds generic support to defer bpf_link dealloc callback to
after RCU GP, if requested. This is done by exposing two different
deallocation callbacks, one synchronous and one deferred. If deferred
one is provided, bpf_link_free() will schedule dealloc_deferred()
callback to happen after RCU GP.

BPF is using two flavors of RCU: "classic" non-sleepable one and RCU
tasks trace one. The latter is used when sleepable BPF programs are
used. bpf_link_free() accommodates that by checking underlying BPF
program's sleepable flag, and goes either through normal RCU GP only for
non-sleepable, or through RCU tasks trace GP *and* then normal RCU GP
(taking into account rcu_trace_implies_rcu_gp() optimization), if BPF
program is sleepable.

We use this for multi-kprobe and multi-uprobe links, which dereference
link during program run. We also preventively switch raw_tp link to use
deferred dealloc callback, as upcoming changes in bpf-next tree expose
raw_tp link data (specifically, cookie value) to BPF program at runtime
as well.

Fixes: 0dcac27254 ("bpf: Add multi kprobe link")
Fixes: 89ae89f53d ("bpf: Add multi uprobe link")
Reported-by: syzbot+981935d9485a560bfbcb@syzkaller.appspotmail.com
Reported-by: syzbot+2cb5a6c573e98db598cc@syzkaller.appspotmail.com
Reported-by: syzbot+62d8b26793e8a2bd0516@syzkaller.appspotmail.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20240328052426.3042617-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-28 18:47:45 -07:00
Yafang Shao
ee3bad033d bpf: Mitigate latency spikes associated with freeing non-preallocated htab
Following the recent upgrade of one of our BPF programs, we encountered
significant latency spikes affecting other applications running on the same
host. After thorough investigation, we identified that these spikes were
primarily caused by the prolonged duration required to free a
non-preallocated htab with approximately 2 million keys.

Notably, our kernel configuration lacks the presence of CONFIG_PREEMPT. In
scenarios where kernel execution extends excessively, other threads might
be starved of CPU time, resulting in latency issues across the system. To
mitigate this, we've adopted a proactive approach by incorporating
cond_resched() calls within the kernel code. This ensures that during
lengthy kernel operations, the scheduler is invoked periodically to provide
opportunities for other threads to execute.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240327032022.78391-1-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-28 18:31:40 -07:00
Andrii Nakryiko
3124591f68 bpf: add bpf_modify_return_test_tp() kfunc triggering tracepoint
Add a simple bpf_modify_return_test_tp() kfunc, available to all program
types, that is useful for various testing and benchmarking scenarios, as
it allows to trigger most tracing BPF program types from BPF side,
allowing to do complex testing and benchmarking scenarios.

It is also attachable to for fmod_ret programs, making it a good and
simple way to trigger fmod_ret program under test/benchmark.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240326162151.3981687-6-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-28 18:31:40 -07:00
Haiyue Wang
55fc888ded bpf,arena: Use helper sizeof_field in struct accessors
Use the well defined helper sizeof_field() to calculate the size of a
struct member, instead of doing custom calculations.

Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
Link: https://lore.kernel.org/r/20240327065334.8140-1-haiyue.wang@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-28 18:31:36 -07:00
Mykyta Yatsenko
786bf0e7e2 bpf: improve error message for unsupported helper
BPF verifier emits "unknown func" message when given BPF program type
does not support BPF helper. This message may be confusing for users, as
important context that helper is unknown only to current program type is
not provided.

This patch changes message to "program of this type cannot use helper "
and aligns dependent code in libbpf and tests. Any suggestions on
improving/changing this message are welcome.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Quentin Monnet <qmo@kernel.org>
Link: https://lore.kernel.org/r/20240325152210.377548-1-yatsenko@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-28 18:30:53 -07:00
Jakub Kicinski
5e47fbe5ce Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts, or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-28 17:25:57 -07:00
Andrei Matei
ecc6a21018 bpf: Protect against int overflow for stack access size
This patch re-introduces protection against the size of access to stack
memory being negative; the access size can appear negative as a result
of overflowing its signed int representation. This should not actually
happen, as there are other protections along the way, but we should
protect against it anyway. One code path was missing such protections
(fixed in the previous patch in the series), causing out-of-bounds array
accesses in check_stack_range_initialized(). This patch causes the
verification of a program with such a non-sensical access size to fail.

This check used to exist in a more indirect way, but was inadvertendly
removed in a833a17aea.

Fixes: a833a17aea ("bpf: Fix verification of indirect var-off stack access")
Reported-by: syzbot+33f4297b5f927648741a@syzkaller.appspotmail.com
Reported-by: syzbot+aafd0513053a1cbf52ef@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/CAADnVQLORV5PT0iTAhRER+iLBTkByCYNBYyvBSgjN1T31K+gOw@mail.gmail.com/
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Andrei Matei <andreimatei1@gmail.com>
Link: https://lore.kernel.org/r/20240327024245.318299-3-andreimatei1@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-27 09:56:36 -07:00
Andrei Matei
a8d89feba7 bpf: Check bloom filter map value size
This patch adds a missing check to bloom filter creating, rejecting
values above KMALLOC_MAX_SIZE. This brings the bloom map in line with
many other map types.

The lack of this protection can cause kernel crashes for value sizes
that overflow int's. Such a crash was caught by syzkaller. The next
patch adds more guard-rails at a lower level.

Signed-off-by: Andrei Matei <andreimatei1@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240327024245.318299-2-andreimatei1@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-27 09:56:17 -07:00
Hari Bathini
96b98a6552 bpf: fix warning for crash_kexec
With [1], crash dump specific code is moved out of CONFIG_KEXEC_CORE
and placed under CONFIG_CRASH_DUMP, where it is more appropriate.
And since CONFIG_KEXEC & !CONFIG_CRASH_DUMP build option is supported
with that, it led to the below warning:

  "WARN: resolve_btfids: unresolved symbol crash_kexec"

Fix it by using the appropriate #ifdef.

[1] https://lore.kernel.org/all/20240124051254.67105-1-bhe@redhat.com/

Acked-by: Baoquan He <bhe@redhat.com>
Fixes: 02aff84805 ("crash: split crash dumping code out from kexec_core.c")
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Link: https://lore.kernel.org/r/20240319080152.36987-1-hbathini@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-27 08:52:24 -07:00
Jakub Kicinski
2a702c2e57 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZgHylwAKCRDbK58LschI
 gzmaAPwKhDFFSU/DU08k22muJxLIXVR7Xx04baJ9mPiFrqZyyAEA8RFNamC7wZIB
 AnfwwoDjfDTP60rlXFaEf8UT5PpA7Ao=
 =/KF6
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2024-03-25

We've added 38 non-merge commits during the last 13 day(s) which contain
a total of 50 files changed, 867 insertions(+), 274 deletions(-).

The main changes are:

1) Add the ability to specify and retrieve BPF cookie also for raw
   tracepoint programs in order to ease migration from classic to raw
   tracepoints, from Andrii Nakryiko.

2) Allow the use of bpf_get_{ns_,}current_pid_tgid() helper for all
   program types and add additional BPF selftests, from Yonghong Song.

3) Several improvements to bpftool and its build, for example, enabling
   libbpf logs when loading pid_iter in debug mode, from Quentin Monnet.

4) Check the return code of all BPF-related set_memory_*() functions during
   load and bail out in case they fail, from Christophe Leroy.

5) Avoid a goto in regs_refine_cond_op() such that the verifier can
   be better integrated into Agni tool which doesn't support backedges
   yet, from Harishankar Vishwanathan.

6) Add a small BPF trie perf improvement by always inlining
   longest_prefix_match, from Jesper Dangaard Brouer.

7) Small BPF selftest refactor in bpf_tcp_ca.c to utilize start_server()
   helper instead of open-coding it, from Geliang Tang.

8) Improve test_tc_tunnel.sh BPF selftest to prevent client connect
   before the server bind, from Alessandro Carminati.

9) Fix BPF selftest benchmark for older glibc and use syscall(SYS_gettid)
   instead of gettid(), from Alan Maguire.

10) Implement a backward-compatible method for struct_ops types with
    additional fields which are not present in older kernels,
    from Kui-Feng Lee.

11) Add a small helper to check if an instruction is addr_space_cast
    from as(0) to as(1) and utilize it in x86-64 JIT, from Puranjay Mohan.

12) Small cleanup to remove unnecessary error check in
    bpf_struct_ops_map_update_elem, from Martin KaFai Lau.

13) Improvements to libbpf fd validity checks for BPF map/programs,
    from Mykyta Yatsenko.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (38 commits)
  selftests/bpf: Fix flaky test btf_map_in_map/lookup_update
  bpf: implement insn_is_cast_user() helper for JITs
  bpf: Avoid get_kernel_nofault() to fetch kprobe entry IP
  selftests/bpf: Use start_server in bpf_tcp_ca
  bpf: Sync uapi bpf.h to tools directory
  libbpf: Add new sec_def "sk_skb/verdict"
  selftests/bpf: Mark uprobe trigger functions with nocf_check attribute
  selftests/bpf: Use syscall(SYS_gettid) instead of gettid() wrapper in bench
  bpf-next: Avoid goto in regs_refine_cond_op()
  bpftool: Clean up HOST_CFLAGS, HOST_LDFLAGS for bootstrap bpftool
  selftests/bpf: scale benchmark counting by using per-CPU counters
  bpftool: Remove unnecessary source files from bootstrap version
  bpftool: Enable libbpf logs when loading pid_iter in debug mode
  selftests/bpf: add raw_tp/tp_btf BPF cookie subtests
  libbpf: add support for BPF cookie for raw_tp/tp_btf programs
  bpf: support BPF cookie in raw tracepoint (raw_tp, tp_btf) programs
  bpf: pass whole link instead of prog when triggering raw tracepoint
  bpf: flatten bpf_probe_register call chain
  selftests/bpf: Prevent client connect before server bind in test_tc_tunnel.sh
  selftests/bpf: Add a sk_msg prog bpf_get_ns_current_pid_tgid() test
  ...
====================

Link: https://lore.kernel.org/r/20240325233940.7154-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-27 07:52:34 -07:00
Paolo Abeni
37ccdf7f11 bpf-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZgHmTAAKCRDbK58LschI
 g1gWAP9HjAWE/Sy0B2t9opIiTqRzdMJLYs2B4OFeHRI6+qQg0gD6A4jsKEh/xmtG
 Hhjw+AElJRFZ3SUIT4mZlljzUHIYYAA=
 =T0lM
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2024-03-25

The following pull-request contains BPF updates for your *net* tree.

We've added 17 non-merge commits during the last 12 day(s) which contain
a total of 19 files changed, 184 insertions(+), 61 deletions(-).

The main changes are:

1) Fix an arm64 BPF JIT bug in BPF_LDX_MEMSX implementation's offset handling
   found via test_bpf module, from Puranjay Mohan.

2) Various fixups to the BPF arena code in particular in the BPF verifier and
   around BPF selftests to match latest corresponding LLVM implementation,
   from Puranjay Mohan and Alexei Starovoitov.

3) Fix xsk to not assume that metadata is always requested in TX completion,
   from Stanislav Fomichev.

4) Fix riscv BPF JIT's kfunc parameter incompatibility between BPF and the riscv
   ABI which requires sign-extension on int/uint, from Pu Lehui.

5) Fix s390x BPF JIT's bpf_plt pointer arithmetic which triggered a crash when
   testing struct_ops, from Ilya Leoshkevich.

6) Fix libbpf's arena mmap handling which had incorrect u64-to-pointer cast on
   32-bit architectures, from Andrii Nakryiko.

7) Fix libbpf to define MFD_CLOEXEC when not available, from Arnaldo Carvalho de Melo.

8) Fix arm64 BPF JIT implementation for 32bit unconditional bswap which
   resulted in an incorrect swap as indicated by test_bpf, from Artem Savkov.

9) Fix BPF man page build script to use silent mode, from Hangbin Liu.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  riscv, bpf: Fix kfunc parameters incompatibility between bpf and riscv abi
  bpf: verifier: reject addr_space_cast insn without arena
  selftests/bpf: verifier_arena: fix mmap address for arm64
  bpf: verifier: fix addr_space_cast from as(1) to as(0)
  libbpf: Define MFD_CLOEXEC if not available
  arm64: bpf: fix 32bit unconditional bswap
  bpf, arm64: fix bug in BPF_LDX_MEMSX
  libbpf: fix u64-to-pointer cast on 32-bit arches
  s390/bpf: Fix bpf_plt pointer arithmetic
  xsk: Don't assume metadata is always requested in TX completion
  selftests/bpf: Add arena test case for 4Gbyte corner case
  selftests/bpf: Remove hard coded PAGE_SIZE macro.
  libbpf, selftests/bpf: Adjust libbpf, bpftool, selftests to match LLVM
  bpf: Clarify bpf_arena comments.
  MAINTAINERS: Update email address for Quentin Monnet
  scripts/bpf_doc: Use silent mode when exec make cmd
  bpf: Temporarily disable atomic operations in BPF arena
====================

Link: https://lore.kernel.org/r/20240325213520.26688-1-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-03-26 12:55:18 +01:00
Puranjay Mohan
122fdbd2a0 bpf: verifier: reject addr_space_cast insn without arena
The verifier allows using the addr_space_cast instruction in a program
that doesn't have an associated arena. This was caught in the form an
invalid memory access in do_misc_fixups() when while converting
addr_space_cast to a normal 32-bit mov, env->prog->aux->arena was
dereferenced to check for BPF_F_NO_USER_CONV flag.

Reject programs that include the addr_space_cast instruction but don't
have an associated arena.

root@rv-tester:~# ./reproducer
 Unable to handle kernel access to user memory without uaccess routines at virtual address 0000000000000030
 Oops [#1]
 [<ffffffff8017eeaa>] do_misc_fixups+0x43c/0x1168
 [<ffffffff801936d6>] bpf_check+0xda8/0x22b6
 [<ffffffff80174b32>] bpf_prog_load+0x486/0x8dc
 [<ffffffff80176566>] __sys_bpf+0xbd8/0x214e
 [<ffffffff80177d14>] __riscv_sys_bpf+0x22/0x2a
 [<ffffffff80d2493a>] do_trap_ecall_u+0x102/0x17c
 [<ffffffff80d3048c>] ret_from_exception+0x0/0x64

Fixes: 6082b6c328 ("bpf: Recognize addr_space_cast instruction in the verifier.")
Reported-by: xingwei lee <xrivendell7@gmail.com>
Reported-by: yue sun <samsun1006219@gmail.com>
Closes: https://lore.kernel.org/bpf/CABOYnLz09O1+2gGVJuCxd_24a-7UueXzV-Ff+Fr+h5EKFDiYCQ@mail.gmail.com/
Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
Link: https://lore.kernel.org/r/20240322153518.11555-1-puranjay12@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-22 20:44:09 -07:00
Puranjay Mohan
f7f5d1808b bpf: verifier: fix addr_space_cast from as(1) to as(0)
The verifier currently converts addr_space_cast from as(1) to as(0) that
is: BPF_ALU64 | BPF_MOV | BPF_X with off=1 and imm=1
to
BPF_ALU | BPF_MOV | BPF_X with imm=1 (32-bit mov)

Because of this imm=1, the JITs that have bpf_jit_needs_zext() == true,
interpret the converted instruction as BPF_ZEXT_REG(DST) which is a
special form of mov32, used for doing explicit zero extension on dst.
These JITs will just zero extend the dst reg and will not move the src to
dst before the zext.

Fix do_misc_fixups() to set imm=0 when converting addr_space_cast to a
normal mov32.

The JITs that have bpf_jit_needs_zext() == true rely on the verifier to
emit zext instructions. Mark dst_reg as subreg when doing cast from
as(1) to as(0) so the verifier emits a zext instruction after the mov.

Fixes: 6082b6c328 ("bpf: Recognize addr_space_cast instruction in the verifier.")
Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
Link: https://lore.kernel.org/r/20240321153939.113996-1-puranjay12@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-22 20:36:36 -07:00
Harishankar Vishwanathan
4c2a26fc80 bpf-next: Avoid goto in regs_refine_cond_op()
In case of GE/GT/SGE/JST instructions, regs_refine_cond_op()
reuses the logic that does analysis of LE/LT/SLE/SLT instructions.
This commit avoids the use of a goto to perform the reuse.

Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240321002955.808604-1-harishankar.vishwanathan@gmail.com
2024-03-21 11:56:26 -07:00
Yan Zhai
00bf631224 bpf: report RCU QS in cpumap kthread
When there are heavy load, cpumap kernel threads can be busy polling
packets from redirect queues and block out RCU tasks from reaching
quiescent states. It is insufficient to just call cond_resched() in such
context. Periodically raise a consolidated RCU QS before cond_resched
fixes the problem.

Fixes: 6710e11269 ("bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP")
Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Yan Zhai <yan@cloudflare.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/c17b9f1517e19d813da3ede5ed33ee18496bb5d8.1710877680.git.yan@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-20 21:05:43 -07:00
Andrii Nakryiko
68ca5d4eeb bpf: support BPF cookie in raw tracepoint (raw_tp, tp_btf) programs
Wire up BPF cookie for raw tracepoint programs (both BTF and non-BTF
aware variants). This brings them up to part w.r.t. BPF cookie usage
with classic tracepoint and fentry/fexit programs.

Acked-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Message-ID: <20240319233852.1977493-4-andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-19 23:05:34 -07:00
Andrii Nakryiko
d4dfc5700e bpf: pass whole link instead of prog when triggering raw tracepoint
Instead of passing prog as an argument to bpf_trace_runX() helpers, that
are called from tracepoint triggering calls, store BPF link itself
(struct bpf_raw_tp_link for raw tracepoints). This will allow to pass
extra information like BPF cookie into raw tracepoint registration.

Instead of replacing `struct bpf_prog *prog = __data;` with
corresponding `struct bpf_raw_tp_link *link = __data;` assignment in
`__bpf_trace_##call` I just passed `__data` through into underlying
bpf_trace_runX() call. This works well because we implicitly cast `void *`,
and it also avoids naming clashes with arguments coming from
tracepoint's "proto" list. We could have run into the same problem with
"prog", we just happened to not have a tracepoint that has "prog" input
argument. We are less lucky with "link", as there are tracepoints using
"link" argument name already. So instead of trying to avoid naming
conflicts, let's just remove intermediate local variable. It doesn't
hurt readibility, it's either way a bit of a maze of calls and macros,
that requires careful reading.

Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Message-ID: <20240319233852.1977493-3-andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-19 23:05:33 -07:00
Yonghong Song
eb166e522c bpf: Allow helper bpf_get_[ns_]current_pid_tgid() for all prog types
Currently bpf_get_current_pid_tgid() is allowed in tracing, cgroup
and sk_msg progs while bpf_get_ns_current_pid_tgid() is only allowed
in tracing progs.

We have an internal use case where for an application running
in a container (with pid namespace), user wants to get
the pid associated with the pid namespace in a cgroup bpf
program. Currently, cgroup bpf progs already allow
bpf_get_current_pid_tgid(). Let us allow bpf_get_ns_current_pid_tgid()
as well.

With auditing the code, bpf_get_current_pid_tgid() is also used
by sk_msg prog. But there are no side effect to expose these two
helpers to all prog types since they do not reveal any kernel specific
data. The detailed discussion is in [1].

So with this patch, both bpf_get_current_pid_tgid() and bpf_get_ns_current_pid_tgid()
are put in bpf_base_func_proto(), making them available to all
program types.

  [1] https://lore.kernel.org/bpf/20240307232659.1115872-1-yonghong.song@linux.dev/

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20240315184854.2975190-1-yonghong.song@linux.dev
2024-03-19 14:24:07 -07:00
Jesper Dangaard Brouer
1a4a0cb798 bpf/lpm_trie: Inline longest_prefix_match for fastpath
The BPF map type LPM (Longest Prefix Match) is used heavily
in production by multiple products that have BPF components.
Perf data shows trie_lookup_elem() and longest_prefix_match()
being part of kernels perf top.

For every level in the LPM tree trie_lookup_elem() calls out
to longest_prefix_match().  The compiler is free to inline this
call, but chooses not to inline, because other slowpath callers
(that can be invoked via syscall) exists like trie_update_elem(),
trie_delete_elem() or trie_get_next_key().

 bcc/tools/funccount -Ti 1 'trie_lookup_elem|longest_prefix_match.isra.0'
 FUNC                                    COUNT
 trie_lookup_elem                       664945
 longest_prefix_match.isra.0           8101507

Observation on a single random machine shows a factor 12 between
the two functions. Given an average of 12 levels in the trie being
searched.

This patch force inlining longest_prefix_match(), but only for
the lookup fastpath to balance object instruction size.

In production with AMD CPUs, measuring the function latency of
'trie_lookup_elem' (bcc/tools/funclatency) we are seeing an improvement
function latency reduction 7-8% with this patch applied (to production
kernels 6.6 and 6.1). Analyzing perf data, we can explain this rather
large improvement due to reducing the overhead for AMD side-channel
mitigation SRSO (Speculative Return Stack Overflow).

Fixes: fb3bd914b3 ("x86/srso: Add a Speculative RAS Overflow mitigation")
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/171076828575.2141737.18370644069389889027.stgit@firesoul
2024-03-19 13:46:03 +01:00
Christophe Leroy
c733239f8f bpf: Check return from set_memory_rox()
arch_protect_bpf_trampoline() and alloc_new_pack() call
set_memory_rox() which can fail, leading to unprotected memory.

Take into account return from set_memory_rox() function and add
__must_check flag to arch_protect_bpf_trampoline().

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/fe1c163c83767fde5cab31d209a4a6be3ddb3a73.1710574353.git.christophe.leroy@csgroup.eu
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-03-18 14:18:47 -07:00
Christophe Leroy
e3362acd79 bpf: Remove arch_unprotect_bpf_trampoline()
Last user of arch_unprotect_bpf_trampoline() was removed by
commit 187e2af05a ("bpf: struct_ops supports more than one page for
trampolines.")

Remove arch_unprotect_bpf_trampoline()

Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Fixes: 187e2af05a ("bpf: struct_ops supports more than one page for trampolines.")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Link: https://lore.kernel.org/r/42c635bb54d3af91db0f9b85d724c7c290069f67.1710574353.git.christophe.leroy@csgroup.eu
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-03-18 14:18:47 -07:00
Martin KaFai Lau
7f3edd0c72 bpf: Remove unnecessary err < 0 check in bpf_struct_ops_map_update_elem
There is a "if (err)" check earlier, so the "if (err < 0)"
check that this patch removing is unnecessary. It was my overlook
when making adjustments to the bpf_struct_ops_prepare_trampoline()
such that the caller does not have to worry about the new page when
the function returns error.

Fixes: 187e2af05a ("bpf: struct_ops supports more than one page for trampolines.")
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20240315192112.2825039-1-martin.lau@linux.dev
2024-03-18 12:48:10 -07:00
Alexei Starovoitov
ee498a38f3 bpf: Clarify bpf_arena comments.
Clarify two bpf_arena comments, use existing SZ_4G #define,
improve page_cnt check.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20240315021834.62988-2-alexei.starovoitov@gmail.com
2024-03-15 14:23:45 -07:00
Christophe Leroy
7d2cc63eca bpf: Take return from set_memory_ro() into account with bpf_prog_lock_ro()
set_memory_ro() can fail, leaving memory unprotected.

Check its return and take it into account as an error.

Link: https://github.com/KSPP/linux/issues/7
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: linux-hardening@vger.kernel.org <linux-hardening@vger.kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Message-ID: <286def78955e04382b227cb3e4b6ba272a7442e3.1709850515.git.christophe.leroy@csgroup.eu>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-14 19:28:52 -07:00
Andrii Nakryiko
4d8926a040 bpf: preserve sleepable bit in subprog info
Copy over main program's sleepable bit into subprog's info. This might
be important for, e.g., freplace cases.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Stanislav Fomichev <sdf@google.com>
Message-ID: <20240314000127.3881569-1-andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-14 19:28:16 -07:00
Puranjay Mohan
44d79142ed bpf: Temporarily disable atomic operations in BPF arena
Currently, the x86 JIT handling PROBE_MEM32 tagged accesses is not
equipped to handle atomic accesses into PTR_TO_ARENA, as no PROBE_MEM32
tagging is performed and no handling is enabled for them.

This will lead to unsafety as the offset into arena will dereferenced
directly without turning it into a base + offset access into the arena
region.

Since the changes to the x86 JIT will be fairly involved, for now,
temporarily disallow use of PTR_TO_ARENA as the destination operand for
atomics until support is added to the JIT backend.

Fixes: 2fe99eb0cc ("bpf: Add x86-64 JIT support for PROBE_MEM32 pseudo instructions.")
Reported-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Message-ID: <20240314174931.98702-1-puranjay12@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-14 12:04:45 -07:00
Jakub Kicinski
5f20e6ab1f for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmXvm7IACgkQ6rmadz2v
 bTqdMA//VMHNHVLb4oROoXyQD9fw2mCmIUEKzP88RXfqcxsfEX7HF+k8B5ZTk0ro
 CHXTAnc79+Qqg0j24bkQKxup/fKBQVw9D+Ia4b3ytlm1I2MtyU/16xNEzVhAPU2D
 iKk6mVBsEdCbt/GjpWORy/VVnZlZpC7BOpZLxsbbxgXOndnCegyjXzSnLGJGxdvi
 zkrQTn2SrFzLi6aNpVLqrv6Nks6HJusfCKsIrtlbkQ85dulasHOtwK9s6GF60nte
 aaho+MPx3L+lWEgapsm8rR779pHaYIB/GbZUgEPxE/xUJ/V8BzDgFNLMzEiIBRMN
 a0zZam11BkBzCfcO9gkvDRByaei/dZz2jdqfU4GlHklFj1WFfz8Q7fRLEPINksvj
 WXLgJADGY5mtGbjG21FScThxzj+Ruqwx0a13ddlyI/W+P3y5yzSWsLwJG5F9p0oU
 6nlkJ4U8yg+9E1ie5ae0TibqvRJzXPjfOERZGwYDSVvfQGzv1z+DGSOPMmgNcWYM
 dIaO+A/+NS3zdbk8+1PP2SBbhHPk6kWyCUByWc7wMzCPTiwriFGY/DD2sN+Fsufo
 zorzfikUQOlTfzzD5jbmT49U8hUQUf6QIWsu7BijSiHaaC7am4S8QB2O6ibJMqdv
 yNiwvuX+ThgVIY3QKrLLqL0KPGeKMR5mtfq6rrwSpfp/b4g27FE=
 =eFgA
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Alexei Starovoitov says:

====================
pull-request: bpf-next 2024-03-11

We've added 59 non-merge commits during the last 9 day(s) which contain
a total of 88 files changed, 4181 insertions(+), 590 deletions(-).

The main changes are:

1) Enforce VM_IOREMAP flag and range in ioremap_page_range and introduce
   VM_SPARSE kind and vm_area_[un]map_pages to be used in bpf_arena,
   from Alexei.

2) Introduce bpf_arena which is sparse shared memory region between bpf
   program and user space where structures inside the arena can have
   pointers to other areas of the arena, and pointers work seamlessly for
   both user-space programs and bpf programs, from Alexei and Andrii.

3) Introduce may_goto instruction that is a contract between the verifier
   and the program. The verifier allows the program to loop assuming it's
   behaving well, but reserves the right to terminate it, from Alexei.

4) Use IETF format for field definitions in the BPF standard
   document, from Dave.

5) Extend struct_ops libbpf APIs to allow specify version suffixes for
   stuct_ops map types, share the same BPF program between several map
   definitions, and other improvements, from Eduard.

6) Enable struct_ops support for more than one page in trampolines,
   from Kui-Feng.

7) Support kCFI + BPF on riscv64, from Puranjay.

8) Use bpf_prog_pack for arm64 bpf trampoline, from Puranjay.

9) Fix roundup_pow_of_two undefined behavior on 32-bit archs, from Toke.
====================

Link: https://lore.kernel.org/r/20240312003646.8692-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11 18:06:04 -07:00
Andrii Nakryiko
66c8473135 bpf: move sleepable flag from bpf_prog_aux to bpf_prog
prog->aux->sleepable is checked very frequently as part of (some) BPF
program run hot paths. So this extra aux indirection seems wasteful and
on busy systems might cause unnecessary memory cache misses.

Let's move sleepable flag into prog itself to eliminate unnecessary
pointer dereference.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Message-ID: <20240309004739.2961431-1-andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-11 16:41:25 -07:00
Puranjay Mohan
d6170e4aaf bpf: hardcode BPF_PROG_PACK_SIZE to 2MB * num_possible_nodes()
On some architectures like ARM64, PMD_SIZE can be really large in some
configurations. Like with CONFIG_ARM64_64K_PAGES=y the PMD_SIZE is
512MB.

Use 2MB * num_possible_nodes() as the size for allocations done through
the prog pack allocator. On most architectures, PMD_SIZE will be equal
to 2MB in case of 4KB pages and will be greater than 2MB for bigger page
sizes.

Fixes: ea2babac63 ("bpf: Simplify bpf_prog_pack_[size|mask]")
Reported-by: "kernelci.org bot" <bot@kernelci.org>
Closes: https://lore.kernel.org/all/7e216c88-77ee-47b8-becc-a0f780868d3c@sirena.org.uk/
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202403092219.dhgcuz2G-lkp@intel.com/
Suggested-by: Song Liu <song@kernel.org>
Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
Message-ID: <20240311122722.86232-1-puranjay12@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-11 16:33:30 -07:00
Alexei Starovoitov
2edc3de6fb bpf: Recognize btf_decl_tag("arg: Arena") as PTR_TO_ARENA.
In global bpf functions recognize btf_decl_tag("arg:arena") as PTR_TO_ARENA.

Note, when the verifier sees:

__weak void foo(struct bar *p)

it recognizes 'p' as PTR_TO_MEM and 'struct bar' has to be a struct with scalars.
Hence the only way to use arena pointers in global functions is to tag them with "arg:arena".

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/bpf/20240308010812.89848-7-alexei.starovoitov@gmail.com
2024-03-11 15:37:24 -07:00
Alexei Starovoitov
6082b6c328 bpf: Recognize addr_space_cast instruction in the verifier.
rY = addr_space_cast(rX, 0, 1) tells the verifier that rY->type = PTR_TO_ARENA.
Any further operations on PTR_TO_ARENA register have to be in 32-bit domain.

The verifier will mark load/store through PTR_TO_ARENA with PROBE_MEM32.
JIT will generate them as kern_vm_start + 32bit_addr memory accesses.

rY = addr_space_cast(rX, 1, 0) tells the verifier that rY->type = unknown scalar.
If arena->map_flags has BPF_F_NO_USER_CONV set then convert cast_user to mov32 as well.
Otherwise JIT will convert it to:
  rY = (u32)rX;
  if (rY)
     rY |= arena->user_vm_start & ~(u64)~0U;

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240308010812.89848-6-alexei.starovoitov@gmail.com
2024-03-11 15:37:24 -07:00
Alexei Starovoitov
142fd4d2dc bpf: Add x86-64 JIT support for bpf_addr_space_cast instruction.
LLVM generates bpf_addr_space_cast instruction while translating
pointers between native (zero) address space and
__attribute__((address_space(N))).
The addr_space=1 is reserved as bpf_arena address space.

rY = addr_space_cast(rX, 0, 1) is processed by the verifier and
converted to normal 32-bit move: wX = wY

rY = addr_space_cast(rX, 1, 0) has to be converted by JIT:

aux_reg = upper_32_bits of arena->user_vm_start
aux_reg <<= 32
wX = wY // clear upper 32 bits of dst register
if (wX) // if not zero add upper bits of user_vm_start
  wX |= aux_reg

JIT can do it more efficiently:

mov dst_reg32, src_reg32  // 32-bit move
shl dst_reg, 32
or dst_reg, user_vm_start
rol dst_reg, 32
xor r11, r11
test dst_reg32, dst_reg32 // check if lower 32-bit are zero
cmove r11, dst_reg	  // if so, set dst_reg to zero
			  // Intel swapped src/dst register encoding in CMOVcc

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20240308010812.89848-5-alexei.starovoitov@gmail.com
2024-03-11 15:37:24 -07:00
Alexei Starovoitov
667a86ad9b bpf: Disasm support for addr_space_cast instruction.
LLVM generates rX = addr_space_cast(rY, dst_addr_space, src_addr_space)
instruction when pointers in non-zero address space are used by the bpf
program. Recognize this insn in uapi and in bpf disassembler.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/bpf/20240308010812.89848-3-alexei.starovoitov@gmail.com
2024-03-11 15:37:24 -07:00
Alexei Starovoitov
317460317a bpf: Introduce bpf_arena.
Introduce bpf_arena, which is a sparse shared memory region between the bpf
program and user space.

Use cases:
1. User space mmap-s bpf_arena and uses it as a traditional mmap-ed
   anonymous region, like memcached or any key/value storage. The bpf
   program implements an in-kernel accelerator. XDP prog can search for
   a key in bpf_arena and return a value without going to user space.
2. The bpf program builds arbitrary data structures in bpf_arena (hash
   tables, rb-trees, sparse arrays), while user space consumes it.
3. bpf_arena is a "heap" of memory from the bpf program's point of view.
   The user space may mmap it, but bpf program will not convert pointers
   to user base at run-time to improve bpf program speed.

Initially, the kernel vm_area and user vma are not populated. User space
can fault in pages within the range. While servicing a page fault,
bpf_arena logic will insert a new page into the kernel and user vmas. The
bpf program can allocate pages from that region via
bpf_arena_alloc_pages(). This kernel function will insert pages into the
kernel vm_area. The subsequent fault-in from user space will populate that
page into the user vma. The BPF_F_SEGV_ON_FAULT flag at arena creation time
can be used to prevent fault-in from user space. In such a case, if a page
is not allocated by the bpf program and not present in the kernel vm_area,
the user process will segfault. This is useful for use cases 2 and 3 above.

bpf_arena_alloc_pages() is similar to user space mmap(). It allocates pages
either at a specific address within the arena or allocates a range with the
maple tree. bpf_arena_free_pages() is analogous to munmap(), which frees
pages and removes the range from the kernel vm_area and from user process
vmas.

bpf_arena can be used as a bpf program "heap" of up to 4GB. The speed of
bpf program is more important than ease of sharing with user space. This is
use case 3. In such a case, the BPF_F_NO_USER_CONV flag is recommended.
It will tell the verifier to treat the rX = bpf_arena_cast_user(rY)
instruction as a 32-bit move wX = wY, which will improve bpf prog
performance. Otherwise, bpf_arena_cast_user is translated by JIT to
conditionally add the upper 32 bits of user vm_start (if the pointer is not
NULL) to arena pointers before they are stored into memory. This way, user
space sees them as valid 64-bit pointers.

Diff https://github.com/llvm/llvm-project/pull/84410 enables LLVM BPF
backend generate the bpf_addr_space_cast() instruction to cast pointers
between address_space(1) which is reserved for bpf_arena pointers and
default address space zero. All arena pointers in a bpf program written in
C language are tagged as __attribute__((address_space(1))). Hence, clang
provides helpful diagnostics when pointers cross address space. Libbpf and
the kernel support only address_space == 1. All other address space
identifiers are reserved.

rX = bpf_addr_space_cast(rY, /* dst_as */ 1, /* src_as */ 0) tells the
verifier that rX->type = PTR_TO_ARENA. Any further operations on
PTR_TO_ARENA register have to be in the 32-bit domain. The verifier will
mark load/store through PTR_TO_ARENA with PROBE_MEM32. JIT will generate
them as kern_vm_start + 32bit_addr memory accesses. The behavior is similar
to copy_from_kernel_nofault() except that no address checks are necessary.
The address is guaranteed to be in the 4GB range. If the page is not
present, the destination register is zeroed on read, and the operation is
ignored on write.

rX = bpf_addr_space_cast(rY, 0, 1) tells the verifier that rX->type =
unknown scalar. If arena->map_flags has BPF_F_NO_USER_CONV set, then the
verifier converts such cast instructions to mov32. Otherwise, JIT will emit
native code equivalent to:
rX = (u32)rY;
if (rY)
  rX |= clear_lo32_bits(arena->user_vm_start); /* replace hi32 bits in rX */

After such conversion, the pointer becomes a valid user pointer within
bpf_arena range. The user process can access data structures created in
bpf_arena without any additional computations. For example, a linked list
built by a bpf program can be walked natively by user space.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Barret Rhoden <brho@google.com>
Link: https://lore.kernel.org/bpf/20240308010812.89848-2-alexei.starovoitov@gmail.com
2024-03-11 15:37:23 -07:00
Eric Dumazet
aa70d2d16f net: move skbuff_cache(s) to net_hotdata
skbuff_cache, skbuff_fclone_cache and skb_small_head_cache
are used in rx/tx fast paths.

Move them to net_hotdata for better cache locality.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306160031.874438-11-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-07 21:12:42 -08:00
Toke Høiland-Jørgensen
7a4b21250b bpf: Fix stackmap overflow check on 32-bit arches
The stackmap code relies on roundup_pow_of_two() to compute the number
of hash buckets, and contains an overflow check by checking if the
resulting value is 0. However, on 32-bit arches, the roundup code itself
can overflow by doing a 32-bit left-shift of an unsigned long value,
which is undefined behaviour, so it is not guaranteed to truncate
neatly. This was triggered by syzbot on the DEVMAP_HASH type, which
contains the same check, copied from the hashtab code.

The commit in the fixes tag actually attempted to fix this, but the fix
did not account for the UB, so the fix only works on CPUs where an
overflow does result in a neat truncation to zero, which is not
guaranteed. Checking the value before rounding does not have this
problem.

Fixes: 6183f4d3a0 ("bpf: Check for integer overflow when using roundup_pow_of_two()")
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Bui Quang Minh <minhquangbui99@gmail.com>
Message-ID: <20240307120340.99577-4-toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-07 20:06:25 -08:00
Toke Høiland-Jørgensen
6787d916c2 bpf: Fix hashtab overflow check on 32-bit arches
The hashtab code relies on roundup_pow_of_two() to compute the number of
hash buckets, and contains an overflow check by checking if the
resulting value is 0. However, on 32-bit arches, the roundup code itself
can overflow by doing a 32-bit left-shift of an unsigned long value,
which is undefined behaviour, so it is not guaranteed to truncate
neatly. This was triggered by syzbot on the DEVMAP_HASH type, which
contains the same check, copied from the hashtab code. So apply the same
fix to hashtab, by moving the overflow check to before the roundup.

Fixes: daaf427c6a ("bpf: fix arraymap NULL deref and missing overflow and zero size checks")
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Message-ID: <20240307120340.99577-3-toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-07 20:05:56 -08:00
Toke Høiland-Jørgensen
281d464a34 bpf: Fix DEVMAP_HASH overflow check on 32-bit arches
The devmap code allocates a number hash buckets equal to the next power
of two of the max_entries value provided when creating the map. When
rounding up to the next power of two, the 32-bit variable storing the
number of buckets can overflow, and the code checks for overflow by
checking if the truncated 32-bit value is equal to 0. However, on 32-bit
arches the rounding up itself can overflow mid-way through, because it
ends up doing a left-shift of 32 bits on an unsigned long value. If the
size of an unsigned long is four bytes, this is undefined behaviour, so
there is no guarantee that we'll end up with a nice and tidy 0-value at
the end.

Syzbot managed to turn this into a crash on arm32 by creating a
DEVMAP_HASH with max_entries > 0x80000000 and then trying to update it.
Fix this by moving the overflow check to before the rounding up
operation.

Fixes: 6f9d451ab1 ("xdp: Add devmap_hash map type for looking up devices by hashed index")
Link: https://lore.kernel.org/r/000000000000ed666a0611af6818@google.com
Reported-and-tested-by: syzbot+8cd36f6b65f3cafd400a@syzkaller.appspotmail.com
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Message-ID: <20240307120340.99577-2-toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-07 20:02:38 -08:00
Alexei Starovoitov
fe5064158c bpf: Tell bpf programs kernel's PAGE_SIZE
vmlinux BTF includes all kernel enums.
Add __PAGE_SIZE = PAGE_SIZE enum, so that bpf programs
that include vmlinux.h can easily access it.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20240307031228.42896-7-alexei.starovoitov@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-03-07 14:58:48 -08:00
Alexei Starovoitov
cf2c2e4a3d bpf: Plumb get_unmapped_area() callback into bpf_map_ops
Subsequent patches introduce bpf_arena that imposes special alignment
requirements on address selection.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20240307031228.42896-4-alexei.starovoitov@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-03-07 14:58:48 -08:00
Alexei Starovoitov
8d94f1357c bpf: Recognize '__map' suffix in kfunc arguments
Recognize 'void *p__map' kfunc argument as 'struct bpf_map *p__map'.
It allows kfunc to have 'void *' argument for maps, since bpf progs
will call them as:
struct {
        __uint(type, BPF_MAP_TYPE_ARENA);
	...
} arena SEC(".maps");

bpf_kfunc_with_map(... &arena ...);

Underneath libbpf will load CONST_PTR_TO_MAP into the register via ld_imm64
insn. If kfunc was defined with 'struct bpf_map *' it would pass the
verifier as well, but bpf prog would need to type cast the argument
(void *)&arena, which is not clean.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20240307031228.42896-3-alexei.starovoitov@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-03-07 14:58:48 -08:00
Alexei Starovoitov
88d1d4a7ee bpf: Allow kfuncs return 'void *'
Recognize return of 'void *' from kfunc as returning unknown scalar.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20240307031228.42896-2-alexei.starovoitov@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-03-07 14:58:48 -08:00
Jakub Kicinski
e3afe5dd3a Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts.

Adjacent changes:

net/core/page_pool_user.c
  0b11b1c5c3 ("netdev: let netlink core handle -EMSGSIZE errors")
  429679dcf7 ("page_pool: fix netlink dump stop/resume")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-07 10:29:36 -08:00
Eduard Zingerman
bd70a8fb7c bpf: Allow all printable characters in BTF DATASEC names
The intent is to allow libbpf to use SEC("?.struct_ops") to identify
struct_ops maps that are optional, e.g. like in the following BPF code:

    SEC("?.struct_ops")
    struct test_ops optional_map = { ... };

Which yields the following BTF:

    ...
    [13] DATASEC '?.struct_ops' size=0 vlen=...
    ...

To load such BTF libbpf rewrites DATASEC name before load.
After this patch the rewrite won't be necessary.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240306104529.6453-15-eddyz87@gmail.com
2024-03-06 15:18:16 -08:00
Alexei Starovoitov
4f81c16f50 bpf: Recognize that two registers are safe when their ranges match
When open code iterators, bpf_loop or may_goto are used the following two
states are equivalent and safe to prune the search:

cur state: fp-8_w=scalar(id=3,smin=umin=smin32=umin32=2,smax=umax=smax32=umax32=11,var_off=(0x0; 0xf))
old state: fp-8_rw=scalar(id=2,smin=umin=smin32=umin32=1,smax=umax=smax32=umax32=11,var_off=(0x0; 0xf))

In other words "exact" state match should ignore liveness and precision
marks, since open coded iterator logic didn't complete their propagation,
reg_old->type == NOT_INIT && reg_cur->type != NOT_INIT is also not safe to
prune while looping, but range_within logic that applies to scalars,
ptr_to_mem, map_value, pkt_ptr is safe to rely on.

Avoid doing such comparison when regular infinite loop detection logic is
used, otherwise bounded loop logic will declare such "infinite loop" as
false positive. Such example is in progs/verifier_loops1.c
not_an_inifinite_loop().

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Tested-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240306031929.42666-3-alexei.starovoitov@gmail.com
2024-03-06 15:18:00 -08:00
Alexei Starovoitov
011832b97b bpf: Introduce may_goto instruction
Introduce may_goto instruction that from the verifier pov is similar to
open coded iterators bpf_for()/bpf_repeat() and bpf_loop() helper, but it
doesn't iterate any objects.
In assembly 'may_goto' is a nop most of the time until bpf runtime has to
terminate the program for whatever reason. In the current implementation
may_goto has a hidden counter, but other mechanisms can be used.
For programs written in C the later patch introduces 'cond_break' macro
that combines 'may_goto' with 'break' statement and has similar semantics:
cond_break is a nop until bpf runtime has to break out of this loop.
It can be used in any normal "for" or "while" loop, like

  for (i = zero; i < cnt; cond_break, i++) {

The verifier recognizes that may_goto is used in the program, reserves
additional 8 bytes of stack, initializes them in subprog prologue, and
replaces may_goto instruction with:
aux_reg = *(u64 *)(fp - 40)
if aux_reg == 0 goto pc+off
aux_reg -= 1
*(u64 *)(fp - 40) = aux_reg

may_goto instruction can be used by LLVM to implement __builtin_memcpy,
__builtin_strcmp.

may_goto is not a full substitute for bpf_for() macro.
bpf_for() doesn't have induction variable that verifiers sees,
so 'i' in bpf_for(i, 0, 100) is seen as imprecise and bounded.

But when the code is written as:
for (i = 0; i < 100; cond_break, i++)
the verifier see 'i' as precise constant zero,
hence cond_break (aka may_goto) doesn't help to converge the loop.
A static or global variable can be used as a workaround:
static int zero = 0;
for (i = zero; i < 100; cond_break, i++) // works!

may_goto works well with arena pointers that don't need to be bounds
checked on access. Load/store from arena returns imprecise unbounded
scalar and loops with may_goto pass the verifier.

Reserve new opcode BPF_JMP | BPF_JCOND for may_goto insn.
JCOND stands for conditional pseudo jump.
Since goto_or_nop insn was proposed, it may use the same opcode.
may_goto vs goto_or_nop can be distinguished by src_reg:
code = BPF_JMP | BPF_JCOND
src_reg = 0 - may_goto
src_reg = 1 - goto_or_nop

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Tested-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240306031929.42666-2-alexei.starovoitov@gmail.com
2024-03-06 15:17:31 -08:00
Toke Høiland-Jørgensen
2487007aa3 cpumap: Zero-initialise xdp_rxq_info struct before running XDP program
When running an XDP program that is attached to a cpumap entry, we don't
initialise the xdp_rxq_info data structure being used in the xdp_buff
that backs the XDP program invocation. Tobias noticed that this leads to
random values being returned as the xdp_md->rx_queue_index value for XDP
programs running in a cpumap.

This means we're basically returning the contents of the uninitialised
memory, which is bad. Fix this by zero-initialising the rxq data
structure before running the XDP program.

Fixes: 9216477449 ("bpf: cpumap: Add the possibility to attach an eBPF program to cpumap")
Reported-by: Tobias Böhm <tobias@aibor.de>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20240305213132.11955-1-toke@redhat.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-03-05 16:48:53 -08:00
Eduard Zingerman
e9a8e5a587 bpf: check bpf_func_state->callback_depth when pruning states
When comparing current and cached states verifier should consider
bpf_func_state->callback_depth. Current state cannot be pruned against
cached state, when current states has more iterations left compared to
cached state. Current state has more iterations left when it's
callback_depth is smaller.

Below is an example illustrating this bug, minimized from mailing list
discussion [0] (assume that BPF_F_TEST_STATE_FREQ is set).
The example is not a safe program: if loop_cb point (1) is followed by
loop_cb point (2), then division by zero is possible at point (4).

    struct ctx {
    	__u64 a;
    	__u64 b;
    	__u64 c;
    };

    static void loop_cb(int i, struct ctx *ctx)
    {
    	/* assume that generated code is "fallthrough-first":
    	 * if ... == 1 goto
    	 * if ... == 2 goto
    	 * <default>
    	 */
    	switch (bpf_get_prandom_u32()) {
    	case 1:  /* 1 */ ctx->a = 42; return 0; break;
    	case 2:  /* 2 */ ctx->b = 42; return 0; break;
    	default: /* 3 */ ctx->c = 42; return 0; break;
    	}
    }

    SEC("tc")
    __failure
    __flag(BPF_F_TEST_STATE_FREQ)
    int test(struct __sk_buff *skb)
    {
    	struct ctx ctx = { 7, 7, 7 };

    	bpf_loop(2, loop_cb, &ctx, 0);              /* 0 */
    	/* assume generated checks are in-order: .a first */
    	if (ctx.a == 42 && ctx.b == 42 && ctx.c == 7)
    		asm volatile("r0 /= 0;":::"r0");    /* 4 */
    	return 0;
    }

Prior to this commit verifier built the following checkpoint tree for
this example:

 .------------------------------------- Checkpoint / State name
 |    .-------------------------------- Code point number
 |    |   .---------------------------- Stack state {ctx.a,ctx.b,ctx.c}
 |    |   |        .------------------- Callback depth in frame #0
 v    v   v        v
   - (0) {7P,7P,7},depth=0
     - (3) {7P,7P,7},depth=1
       - (0) {7P,7P,42},depth=1
         - (3) {7P,7,42},depth=2
           - (0) {7P,7,42},depth=2      loop terminates because of depth limit
             - (4) {7P,7,42},depth=0    predicted false, ctx.a marked precise
             - (6) exit
(a)      - (2) {7P,7,42},depth=2
           - (0) {7P,42,42},depth=2     loop terminates because of depth limit
             - (4) {7P,42,42},depth=0   predicted false, ctx.a marked precise
             - (6) exit
(b)      - (1) {7P,7P,42},depth=2
           - (0) {42P,7P,42},depth=2    loop terminates because of depth limit
             - (4) {42P,7P,42},depth=0  predicted false, ctx.{a,b} marked precise
             - (6) exit
     - (2) {7P,7,7},depth=1             considered safe, pruned using checkpoint (a)
(c)  - (1) {7P,7P,7},depth=1            considered safe, pruned using checkpoint (b)

Here checkpoint (b) has callback_depth of 2, meaning that it would
never reach state {42,42,7}.
While checkpoint (c) has callback_depth of 1, and thus
could yet explore the state {42,42,7} if not pruned prematurely.
This commit makes forbids such premature pruning,
allowing verifier to explore states sub-tree starting at (c):

(c)  - (1) {7,7,7P},depth=1
       - (0) {42P,7,7P},depth=1
         ...
         - (2) {42,7,7},depth=2
           - (0) {42,42,7},depth=2      loop terminates because of depth limit
             - (4) {42,42,7},depth=0    predicted true, ctx.{a,b,c} marked precise
               - (5) division by zero

[0] https://lore.kernel.org/bpf/9b251840-7cb8-4d17-bd23-1fc8071d8eef@linux.dev/

Fixes: bb124da69c ("bpf: keep track of max number of bpf_loop callback iterations")
Suggested-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240222154121.6991-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-05 16:15:56 -08:00
Kui-Feng Lee
187e2af05a bpf: struct_ops supports more than one page for trampolines.
The BPF struct_ops previously only allowed one page of trampolines.
Each function pointer of a struct_ops is implemented by a struct_ops
bpf program. Each struct_ops bpf program requires a trampoline.
The following selftest patch shows each page can hold a little more
than 20 trampolines.

While one page is more than enough for the tcp-cc usecase,
the sched_ext use case shows that one page is not always enough and hits
the one page limit. This patch overcomes the one page limit by allocating
another page when needed and it is limited to a total of
MAX_IMAGE_PAGES (8) pages which is more than enough for
reasonable usages.

The variable st_map->image has been changed to st_map->image_pages, and
its type has been changed to an array of pointers to pages.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240224223418.526631-3-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-03-04 14:09:20 -08:00
Kui-Feng Lee
73e4f9e615 bpf, net: validate struct_ops when updating value.
Perform all validations when updating values of struct_ops maps. Doing
validation in st_ops->reg() and st_ops->update() is not necessary anymore.
However, tcp_register_congestion_control() has been called in various
places. It still needs to do validations.

Cc: netdev@vger.kernel.org
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240224223418.526631-2-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-03-04 10:03:57 -08:00
Eric Dumazet
80bfab79b8 net: adopt skb_network_offset() and similar helpers
This is a cleanup patch, making code a bit more concise.

1) Use skb_network_offset(skb) in place of
       (skb_network_header(skb) - skb->data)

2) Use -skb_network_offset(skb) in place of
       (skb->data - skb_network_header(skb))

3) Use skb_transport_offset(skb) in place of
       (skb_transport_header(skb) - skb->data)

4) Use skb_inner_transport_offset(skb) in place of
       (skb_inner_transport_header(skb) - skb->data)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Edward Cree <ecree.xilinx@gmail.com> # for sfc
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-04 08:47:06 +00:00
Jakub Kicinski
4b2765ae41 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZeEKVAAKCRDbK58LschI
 g7oYAQD5Jlv4fIVTvxvfZrTTZ2tU+OsPa75mc8SDKwpash3YygEA8kvESy8+t6pg
 D6QmSf1DIZdFoSp/bV+pfkNWMeR8gwg=
 =mTAj
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2024-02-29

We've added 119 non-merge commits during the last 32 day(s) which contain
a total of 150 files changed, 3589 insertions(+), 995 deletions(-).

The main changes are:

1) Extend the BPF verifier to enable static subprog calls in spin lock
   critical sections, from Kumar Kartikeya Dwivedi.

2) Fix confusing and incorrect inference of PTR_TO_CTX argument type
   in BPF global subprogs, from Andrii Nakryiko.

3) Larger batch of riscv BPF JIT improvements and enabling inlining
   of the bpf_kptr_xchg() for RV64, from Pu Lehui.

4) Allow skeleton users to change the values of the fields in struct_ops
   maps at runtime, from Kui-Feng Lee.

5) Extend the verifier's capabilities of tracking scalars when they
   are spilled to stack, especially when the spill or fill is narrowing,
   from Maxim Mikityanskiy & Eduard Zingerman.

6) Various BPF selftest improvements to fix errors under gcc BPF backend,
   from Jose E. Marchesi.

7) Avoid module loading failure when the module trying to register
   a struct_ops has its BTF section stripped, from Geliang Tang.

8) Annotate all kfuncs in .BTF_ids section which eventually allows
   for automatic kfunc prototype generation from bpftool, from Daniel Xu.

9) Several updates to the instruction-set.rst IETF standardization
   document, from Dave Thaler.

10) Shrink the size of struct bpf_map resp. bpf_array,
    from Alexei Starovoitov.

11) Initial small subset of BPF verifier prepwork for sleepable bpf_timer,
    from Benjamin Tissoires.

12) Fix bpftool to be more portable to musl libc by using POSIX's
    basename(), from Arnaldo Carvalho de Melo.

13) Add libbpf support to gcc in CORE macro definitions,
    from Cupertino Miranda.

14) Remove a duplicate type check in perf_event_bpf_event,
    from Florian Lehner.

15) Fix bpf_spin_{un,}lock BPF helpers to actually annotate them
    with notrace correctly, from Yonghong Song.

16) Replace the deprecated bpf_lpm_trie_key 0-length array with flexible
    array to fix build warnings, from Kees Cook.

17) Fix resolve_btfids cross-compilation to non host-native endianness,
    from Viktor Malik.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (119 commits)
  selftests/bpf: Test if shadow types work correctly.
  bpftool: Add an example for struct_ops map and shadow type.
  bpftool: Generated shadow variables for struct_ops maps.
  libbpf: Convert st_ops->data to shadow type.
  libbpf: Set btf_value_type_id of struct bpf_map for struct_ops.
  bpf: Replace bpf_lpm_trie_key 0-length array with flexible array
  bpf, arm64: use bpf_prog_pack for memory management
  arm64: patching: implement text_poke API
  bpf, arm64: support exceptions
  arm64: stacktrace: Implement arch_bpf_stack_walk() for the BPF JIT
  bpf: add is_async_callback_calling_insn() helper
  bpf: introduce in_sleepable() helper
  bpf: allow more maps in sleepable bpf programs
  selftests/bpf: Test case for lacking CFI stub functions.
  bpf: Check cfi_stubs before registering a struct_ops type.
  bpf: Clarify batch lookup/lookup_and_delete semantics
  bpf, docs: specify which BPF_ABS and BPF_IND fields were zero
  bpf, docs: Fix typos in instruction-set.rst
  selftests/bpf: update tcp_custom_syncookie to use scalar packet offset
  bpf: Shrink size of struct bpf_map/bpf_array.
  ...
====================

Link: https://lore.kernel.org/r/20240301001625.8800-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-02 20:50:59 -08:00
Kees Cook
896880ff30 bpf: Replace bpf_lpm_trie_key 0-length array with flexible array
Replace deprecated 0-length array in struct bpf_lpm_trie_key with
flexible array. Found with GCC 13:

../kernel/bpf/lpm_trie.c:207:51: warning: array subscript i is outside array bounds of 'const __u8[0]' {aka 'const unsigned char[]'} [-Warray-bounds=]
  207 |                                        *(__be16 *)&key->data[i]);
      |                                                   ^~~~~~~~~~~~~
../include/uapi/linux/swab.h:102:54: note: in definition of macro '__swab16'
  102 | #define __swab16(x) (__u16)__builtin_bswap16((__u16)(x))
      |                                                      ^
../include/linux/byteorder/generic.h:97:21: note: in expansion of macro '__be16_to_cpu'
   97 | #define be16_to_cpu __be16_to_cpu
      |                     ^~~~~~~~~~~~~
../kernel/bpf/lpm_trie.c:206:28: note: in expansion of macro 'be16_to_cpu'
  206 |                 u16 diff = be16_to_cpu(*(__be16 *)&node->data[i]
^
      |                            ^~~~~~~~~~~
In file included from ../include/linux/bpf.h:7:
../include/uapi/linux/bpf.h:82:17: note: while referencing 'data'
   82 |         __u8    data[0];        /* Arbitrary size */
      |                 ^~~~

And found at run-time under CONFIG_FORTIFY_SOURCE:

  UBSAN: array-index-out-of-bounds in kernel/bpf/lpm_trie.c:218:49
  index 0 is out of range for type '__u8 [*]'

Changing struct bpf_lpm_trie_key is difficult since has been used by
userspace. For example, in Cilium:

	struct egress_gw_policy_key {
	        struct bpf_lpm_trie_key lpm_key;
	        __u32 saddr;
	        __u32 daddr;
	};

While direct references to the "data" member haven't been found, there
are static initializers what include the final member. For example,
the "{}" here:

        struct egress_gw_policy_key in_key = {
                .lpm_key = { 32 + 24, {} },
                .saddr   = CLIENT_IP,
                .daddr   = EXTERNAL_SVC_IP & 0Xffffff,
        };

To avoid the build time and run time warnings seen with a 0-sized
trailing array for struct bpf_lpm_trie_key, introduce a new struct
that correctly uses a flexible array for the trailing bytes,
struct bpf_lpm_trie_key_u8. As part of this, include the "header"
portion (which is just the "prefixlen" member), so it can be used
by anything building a bpf_lpr_trie_key that has trailing members that
aren't a u8 flexible array (like the self-test[1]), which is named
struct bpf_lpm_trie_key_hdr.

Unfortunately, C++ refuses to parse the __struct_group() helper, so
it is not possible to define struct bpf_lpm_trie_key_hdr directly in
struct bpf_lpm_trie_key_u8, so we must open-code the union directly.

Adjust the kernel code to use struct bpf_lpm_trie_key_u8 through-out,
and for the selftest to use struct bpf_lpm_trie_key_hdr. Add a comment
to the UAPI header directing folks to the two new options.

Reported-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Closes: https://paste.debian.net/hidden/ca500597/
Link: https://lore.kernel.org/all/202206281009.4332AA33@keescook/ [1]
Link: https://lore.kernel.org/bpf/20240222155612.it.533-kees@kernel.org
2024-02-29 22:52:43 +01:00
Benjamin Tissoires
2ab256e932 bpf: add is_async_callback_calling_insn() helper
Currently we have a special case for BPF_FUNC_timer_set_callback,
let's introduce a helper we can extend for the kfunc that will come in
a later patch

Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
Link: https://lore.kernel.org/r/20240221-hid-bpf-sleepable-v3-3-1fb378ca6301@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-02-22 17:48:53 -08:00
Benjamin Tissoires
dfe6625df4 bpf: introduce in_sleepable() helper
No code change, but it'll allow to have only one place to change
everything when we add in_sleepable in cur_state.

Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
Link: https://lore.kernel.org/r/20240221-hid-bpf-sleepable-v3-2-1fb378ca6301@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-02-22 17:47:15 -08:00
Benjamin Tissoires
55bad79e33 bpf: allow more maps in sleepable bpf programs
These 2 maps types are required for HID-BPF when a user wants to do
IO with a device from a sleepable tracing point.

Allowing BPF_MAP_TYPE_QUEUE (and therefore BPF_MAP_TYPE_STACK) allows
for a BPF program to prepare from an IRQ the list of HID commands to send
back to the device and then these commands can be retrieved from the
sleepable trace point.

Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
Link: https://lore.kernel.org/r/20240221-hid-bpf-sleepable-v3-1-1fb378ca6301@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-02-22 17:42:23 -08:00
Jakub Kicinski
fecc51559a Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

net/ipv4/udp.c
  f796feabb9 ("udp: add local "peek offset enabled" flag")
  56667da739 ("net: implement lockless setsockopt(SO_PEEK_OFF)")

Adjacent changes:

net/unix/garbage.c
  aa82ac51d6 ("af_unix: Drop oob_skb ref before purging queue in GC.")
  11498715f2 ("af_unix: Remove io_uring code for GC.")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-22 15:29:26 -08:00
Kui-Feng Lee
3e0008336a bpf: Check cfi_stubs before registering a struct_ops type.
Recently, st_ops->cfi_stubs was introduced. However, the upcoming new
struct_ops support (e.g. sched_ext) is not aware of this and does not
provide its own cfi_stubs. The kernel ends up NULL dereferencing the
st_ops->cfi_stubs.

Considering struct_ops supports kernel module now, this NULL check
is necessary. This patch is to reject struct_ops registration
that does not provide a cfi_stubs.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240222021105.1180475-2-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-02-22 12:26:40 -08:00
Yafang Shao
5f2ae606cb bpf: Fix an issue due to uninitialized bpf_iter_task
Failure to initialize it->pos, coupled with the presence of an invalid
value in the flags variable, can lead to it->pos referencing an invalid
task, potentially resulting in a kernel panic. To mitigate this risk, it's
crucial to ensure proper initialization of it->pos to NULL.

Fixes: ac8148d957 ("bpf: bpf_iter_task_next: use next_task(kit->task) rather than next_task(kit->pos)")
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/bpf/20240217114152.1623-2-laoar.shao@gmail.com
2024-02-19 12:28:15 +01:00
Martin KaFai Lau
0281b919e1 bpf: Fix racing between bpf_timer_cancel_and_free and bpf_timer_cancel
The following race is possible between bpf_timer_cancel_and_free
and bpf_timer_cancel. It will lead a UAF on the timer->timer.

bpf_timer_cancel();
	spin_lock();
	t = timer->time;
	spin_unlock();

					bpf_timer_cancel_and_free();
						spin_lock();
						t = timer->timer;
						timer->timer = NULL;
						spin_unlock();
						hrtimer_cancel(&t->timer);
						kfree(t);

	/* UAF on t */
	hrtimer_cancel(&t->timer);

In bpf_timer_cancel_and_free, this patch frees the timer->timer
after a rcu grace period. This requires a rcu_head addition
to the "struct bpf_hrtimer". Another kfree(t) happens in bpf_timer_init,
this does not need a kfree_rcu because it is still under the
spin_lock and timer->timer has not been visible by others yet.

In bpf_timer_cancel, rcu_read_lock() is added because this helper
can be used in a non rcu critical section context (e.g. from
a sleepable bpf prog). Other timer->timer usages in helpers.c
have been audited, bpf_timer_cancel() is the only place where
timer->timer is used outside of the spin_lock.

Another solution considered is to mark a t->flag in bpf_timer_cancel
and clear it after hrtimer_cancel() is done.  In bpf_timer_cancel_and_free,
it busy waits for the flag to be cleared before kfree(t). This patch
goes with a straight forward solution and frees timer->timer after
a rcu grace period.

Fixes: b00628b1c7 ("bpf: Introduce bpf timers.")
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20240215211218.990808-1-martin.lau@linux.dev
2024-02-19 12:26:46 +01:00
Yonghong Song
682158ab53 bpf: Fix test verif_scale_strobemeta_subprogs failure due to llvm19
With latest llvm19, I hit the following selftest failures with

  $ ./test_progs -j
  libbpf: prog 'on_event': BPF program load failed: Permission denied
  libbpf: prog 'on_event': -- BEGIN PROG LOAD LOG --
  combined stack size of 4 calls is 544. Too large
  verification time 1344153 usec
  stack depth 24+440+0+32
  processed 51008 insns (limit 1000000) max_states_per_insn 19 total_states 1467 peak_states 303 mark_read 146
  -- END PROG LOAD LOG --
  libbpf: prog 'on_event': failed to load: -13
  libbpf: failed to load object 'strobemeta_subprogs.bpf.o'
  scale_test:FAIL:expect_success unexpected error: -13 (errno 13)
  #498     verif_scale_strobemeta_subprogs:FAIL

The verifier complains too big of the combined stack size (544 bytes) which
exceeds the maximum stack limit 512. This is a regression from llvm19 ([1]).

In the above error log, the original stack depth is 24+440+0+32.
To satisfy interpreter's need, in verifier the stack depth is adjusted to
32+448+32+32=544 which exceeds 512, hence the error. The same adjusted
stack size is also used for jit case.

But the jitted codes could use smaller stack size.

  $ egrep -r stack_depth | grep round_up
  arm64/net/bpf_jit_comp.c:       ctx->stack_size = round_up(prog->aux->stack_depth, 16);
  loongarch/net/bpf_jit.c:        bpf_stack_adjust = round_up(ctx->prog->aux->stack_depth, 16);
  powerpc/net/bpf_jit_comp.c:     cgctx.stack_size = round_up(fp->aux->stack_depth, 16);
  riscv/net/bpf_jit_comp32.c:             round_up(ctx->prog->aux->stack_depth, STACK_ALIGN);
  riscv/net/bpf_jit_comp64.c:     bpf_stack_adjust = round_up(ctx->prog->aux->stack_depth, 16);
  s390/net/bpf_jit_comp.c:        u32 stack_depth = round_up(fp->aux->stack_depth, 8);
  sparc/net/bpf_jit_comp_64.c:            stack_needed += round_up(stack_depth, 16);
  x86/net/bpf_jit_comp.c:         EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8));
  x86/net/bpf_jit_comp.c: int tcc_off = -4 - round_up(stack_depth, 8);
  x86/net/bpf_jit_comp.c:                     round_up(stack_depth, 8));
  x86/net/bpf_jit_comp.c: int tcc_off = -4 - round_up(stack_depth, 8);
  x86/net/bpf_jit_comp.c:         EMIT3_off32(0x48, 0x81, 0xC4, round_up(stack_depth, 8));

In the above, STACK_ALIGN in riscv/net/bpf_jit_comp32.c is defined as 16.
So stack is aligned in either 8 or 16, x86/s390 having 8-byte stack alignment and
the rest having 16-byte alignment.

This patch calculates total stack depth based on 16-byte alignment if jit is requested.
For the above failing case, the new stack size will be 32+448+0+32=512 and no verification
failure. llvm19 regression will be discussed separately in llvm upstream.

The verifier change caused three test failures as these tests compared messages
with stack size. More specifically,
  - test_global_funcs/global_func1: fail with interpreter mode and success with jit mode.
    Adjusted stack sizes so both jit and interpreter modes will fail.
  - async_stack_depth/{pseudo_call_check, async_call_root_check}: since jit and interpreter
    will calculate different stack sizes, the failure msg is adjusted to omit those
    specific stack size numbers.

  [1] https://lore.kernel.org/bpf/32bde0f0-1881-46c9-931a-673be566c61d@linux.dev/

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240214232951.4113094-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-02-15 13:45:27 -08:00
Andrii Nakryiko
57354f5fde bpf: improve duplicate source code line detection
Verifier log avoids printing the same source code line multiple times
when a consecutive block of BPF assembly instructions are covered by the
same original (C) source code line. This greatly improves verifier log
legibility.

Unfortunately, this check is imperfect and in production applications it
quite often happens that verifier log will have multiple duplicated
source lines emitted, for no apparently good reason. E.g., this is
excerpt from a real-world BPF application (with register states omitted
for clarity):

BEFORE
======
; for (int i = 0; i < STROBE_MAX_MAP_ENTRIES; ++i) { @ strobemeta_probe.bpf.c:394
5369: (07) r8 += 2                    ;
5370: (07) r7 += 16                   ;
; for (int i = 0; i < STROBE_MAX_MAP_ENTRIES; ++i) { @ strobemeta_probe.bpf.c:394
5371: (07) r9 += 1                    ;
5372: (79) r4 = *(u64 *)(r10 -32)     ;
; for (int i = 0; i < STROBE_MAX_MAP_ENTRIES; ++i) { @ strobemeta_probe.bpf.c:394
5373: (55) if r9 != 0xf goto pc+2
; if (i >= map->cnt) @ strobemeta_probe.bpf.c:396
5376: (79) r1 = *(u64 *)(r10 -40)     ;
5377: (79) r1 = *(u64 *)(r1 +8)       ;
; if (i >= map->cnt) @ strobemeta_probe.bpf.c:396
5378: (dd) if r1 s<= r9 goto pc-5     ;
; descr->key_lens[i] = 0; @ strobemeta_probe.bpf.c:398
5379: (b4) w1 = 0                     ;
5380: (6b) *(u16 *)(r8 -30) = r1      ;
; task, data, off, STROBE_MAX_STR_LEN, map->entries[i].key); @ strobemeta_probe.bpf.c:400
5381: (79) r3 = *(u64 *)(r7 -8)       ;
5382: (7b) *(u64 *)(r10 -24) = r6     ;
; task, data, off, STROBE_MAX_STR_LEN, map->entries[i].key); @ strobemeta_probe.bpf.c:400
5383: (bc) w6 = w6                    ;
; barrier_var(payload_off); @ strobemeta_probe.bpf.c:280
5384: (bf) r2 = r6                    ;
5385: (bf) r1 = r4                    ;

As can be seen, line 394 is emitted thrice, 396 is emitted twice, and
line 400 is duplicated as well. Note that there are no intermingling
other lines of source code in between these duplicates, so the issue is
not compiler reordering assembly instruction such that multiple original
source code lines are in effect.

It becomes more obvious what's going on if we look at *full* original line info
information (using btfdump for this, [0]):

  #2764: line: insn #5363 --> 394:3 @ ./././strobemeta_probe.bpf.c
            for (int i = 0; i < STROBE_MAX_MAP_ENTRIES; ++i) {
  #2765: line: insn #5373 --> 394:21 @ ./././strobemeta_probe.bpf.c
            for (int i = 0; i < STROBE_MAX_MAP_ENTRIES; ++i) {
  #2766: line: insn #5375 --> 394:47 @ ./././strobemeta_probe.bpf.c
            for (int i = 0; i < STROBE_MAX_MAP_ENTRIES; ++i) {
  #2767: line: insn #5377 --> 394:3 @ ./././strobemeta_probe.bpf.c
            for (int i = 0; i < STROBE_MAX_MAP_ENTRIES; ++i) {
  #2768: line: insn #5378 --> 414:10 @ ./././strobemeta_probe.bpf.c
            return off;

We can see that there are four line info records covering
instructions #5363 through #5377 (instruction indices are shifted due to
subprog instruction being appended to main program), all of them are
pointing to the same C source code line #394. But each of them points to
a different part of that line, which is denoted by differing column
numbers (3, 21, 47, 3).

But verifier log doesn't distinguish between parts of the same source code line
and doesn't emit this column number information, so for end user it's just a
repetitive visual noise. So let's improve the detection of repeated source code
line and avoid this.

With the changes in this patch, we get this output for the same piece of BPF
program log:

AFTER
=====
; for (int i = 0; i < STROBE_MAX_MAP_ENTRIES; ++i) { @ strobemeta_probe.bpf.c:394
5369: (07) r8 += 2                    ;
5370: (07) r7 += 16                   ;
5371: (07) r9 += 1                    ;
5372: (79) r4 = *(u64 *)(r10 -32)     ;
5373: (55) if r9 != 0xf goto pc+2
; if (i >= map->cnt) @ strobemeta_probe.bpf.c:396
5376: (79) r1 = *(u64 *)(r10 -40)     ;
5377: (79) r1 = *(u64 *)(r1 +8)       ;
5378: (dd) if r1 s<= r9 goto pc-5     ;
; descr->key_lens[i] = 0; @ strobemeta_probe.bpf.c:398
5379: (b4) w1 = 0                     ;
5380: (6b) *(u16 *)(r8 -30) = r1      ;
; task, data, off, STROBE_MAX_STR_LEN, map->entries[i].key); @ strobemeta_probe.bpf.c:400
5381: (79) r3 = *(u64 *)(r7 -8)       ;
5382: (7b) *(u64 *)(r10 -24) = r6     ;
5383: (bc) w6 = w6                    ;
; barrier_var(payload_off); @ strobemeta_probe.bpf.c:280
5384: (bf) r2 = r6                    ;
5385: (bf) r1 = r4                    ;

All the duplication is gone and the log is cleaner and less distracting.

  [0] https://github.com/anakryiko/btfdump

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240214174100.2847419-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-02-15 13:00:48 -08:00
Andrii Nakryiko
a4561f5afe bpf: Use O(log(N)) binary search to find line info record
Real-world BPF applications keep growing in size. Medium-sized production
application can easily have 50K+ verified instructions, and its line
info section in .BTF.ext has more than 3K entries.

When verifier emits log with log_level>=1, it annotates assembly code
with matched original C source code. Currently it uses linear search
over line info records to find a match. As complexity of BPF
applications grows, this O(K * N) approach scales poorly.

So, let's instead of linear O(N) search for line info record use faster
equivalent O(log(N)) binary search algorithm. It's not a plain binary
search, as we don't look for exact match. It's an upper bound search
variant, looking for rightmost line info record that starts at or before
given insn_off.

Some unscientific measurements were done before and after this change.
They were done in VM and fluctuate a bit, but overall the speed up is
undeniable.

BASELINE
========
File                              Program           Duration (us)   Insns
--------------------------------  ----------------  -------------  ------
katran.bpf.o                      balancer_ingress        2497130  343552
pyperf600.bpf.linked3.o           on_event               12389611  627288
strobelight_pyperf_libbpf.o       on_py_event              387399   52445
--------------------------------  ----------------  -------------  ------

BINARY SEARCH
=============

File                              Program           Duration (us)   Insns
--------------------------------  ----------------  -------------  ------
katran.bpf.o                      balancer_ingress        2339312  343552
pyperf600.bpf.linked3.o           on_event                5602203  627288
strobelight_pyperf_libbpf.o       on_py_event              294761   52445
--------------------------------  ----------------  -------------  ------

While Katran's speed up is pretty modest (about 105ms, or 6%), for
production pyperf BPF program (on_py_event) it's much greater already,
going from 387ms down to 295ms (23% improvement).

Looking at BPF selftests's biggest pyperf example, we can see even more
dramatic improvement, shaving more than 50% of time, going from 12.3s
down to 5.6s.

Different amount of improvement is the function of overall amount of BPF
assembly instructions in .bpf.o files (which contributes to how much
line info records there will be and thus, on average, how much time linear
search will take), among other things:

$ llvm-objdump -d katran.bpf.o | wc -l
3863
$ llvm-objdump -d strobelight_pyperf_libbpf.o | wc -l
6997
$ llvm-objdump -d pyperf600.bpf.linked3.o | wc -l
87854

Granted, this only applies to debugging cases (e.g., using veristat, or
failing verification in production), but seems worth doing to improve
overall developer experience anyways.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20240214002311.2197116-1-andrii@kernel.org
2024-02-14 23:53:42 +01:00
Andrii Nakryiko
7cc13adbd0 bpf: emit source code file name and line number in verifier log
As BPF applications grow in size and complexity and are separated into
multiple .bpf.c files that are statically linked together, it becomes
harder and harder to match verifier's BPF assembly level output to
original C code. While often annotated C source code is unique enough to
be able to identify the file it belongs to, quite often this is actually
problematic as parts of source code can be quite generic.

Long story short, it is very useful to see source code file name and
line number information along with the original C code. Verifier already
knows this information, we just need to output it.

This patch extends verifier log with file name and line number
information, emitted next to original (presumably C) source code,
annotating BPF assembly output, like so:

  ; <original C code> @ <filename>.bpf.c:<line>

If file name has directory names in it, they are stripped away. This
should be fine in practice as file names tend to be pretty unique with
C code anyways, and keeping log size smaller is always good.

In practice this might look something like below, where some code is
coming from application files, while others are from libbpf's usdt.bpf.h
header file:

  ; if (STROBEMETA_READ( @ strobemeta_probe.bpf.c:534
  5592: (79) r1 = *(u64 *)(r10 -56)     ; R1_w=mem_or_null(id=1589,sz=7680) R10=fp0
  5593: (7b) *(u64 *)(r10 -56) = r1     ; R1_w=mem_or_null(id=1589,sz=7680) R10=fp0
  5594: (79) r3 = *(u64 *)(r10 -8)      ; R3_w=scalar() R10=fp0 fp-8=mmmmmmmm

  ...

  170: (71) r1 = *(u8 *)(r8 +15)        ; frame1: R1_w=scalar(...) R8_w=map_value(map=__bpf_usdt_spec,ks=4,vs=208)
  171: (67) r1 <<= 56                   ; frame1: R1_w=scalar(...)
  172: (c7) r1 s>>= 56                  ; frame1: R1_w=scalar(smin=smin32=-128,smax=smax32=127)
  ; val <<= arg_spec->arg_bitshift; @ usdt.bpf.h:183
  173: (67) r1 <<= 32                   ; frame1: R1_w=scalar(...)
  174: (77) r1 >>= 32                   ; frame1: R1_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
  175: (79) r2 = *(u64 *)(r10 -8)       ; frame1: R2_w=scalar() R10=fp0 fp-8=mmmmmmmm
  176: (6f) r2 <<= r1                   ; frame1: R1_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R2_w=scalar()
  177: (7b) *(u64 *)(r10 -8) = r2       ; frame1: R2_w=scalar(id=61) R10=fp0 fp-8_w=scalar(id=61)
  ; if (arg_spec->arg_signed) @ usdt.bpf.h:184
  178: (bf) r3 = r2                     ; frame1: R2_w=scalar(id=61) R3_w=scalar(id=61)
  179: (7f) r3 >>= r1                   ; frame1: R1_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R3_w=scalar()
  ; if (arg_spec->arg_signed) @ usdt.bpf.h:184
  180: (71) r4 = *(u8 *)(r8 +14)
  181: safe

log_fixup tests needed a minor adjustment as verifier log output
increased a bit and that test is quite sensitive to such changes.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240212235944.2816107-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-02-13 18:51:32 -08:00
Andrii Nakryiko
879bbe7aa4 bpf: don't infer PTR_TO_CTX for programs with unnamed context type
For program types that don't have named context type name (e.g., BPF
iterator programs or tracepoint programs), ctx_tname will be a non-NULL
empty string. For such programs it shouldn't be possible to have
PTR_TO_CTX argument for global subprogs based on type name alone.
arg:ctx tag is the only way to have PTR_TO_CTX passed into global
subprog for such program types.

Fix this loophole, which currently would assume PTR_TO_CTX whenever
user uses a pointer to anonymous struct as an argument to their global
subprogs. This happens in practice with the following (quite common, in
practice) approach:

typedef struct { /* anonymous */
    int x;
} my_type_t;

int my_subprog(my_type_t *arg) { ... }

User's intent is to have PTR_TO_MEM argument for `arg`, but verifier
will complain about expecting PTR_TO_CTX.

This fix also closes unintended s390x-specific KPROBE handling of
PTR_TO_CTX case. Selftest change is necessary to accommodate this.

Fixes: 91cc1a9974 ("bpf: Annotate context types")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240212233221.2575350-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-02-13 18:46:47 -08:00
Andrii Nakryiko
824c58fb10 bpf: handle bpf_user_pt_regs_t typedef explicitly for PTR_TO_CTX global arg
Expected canonical argument type for global function arguments
representing PTR_TO_CTX is `bpf_user_pt_regs_t *ctx`. This currently
works on s390x by accident because kernel resolves such typedef to
underlying struct (which is anonymous on s390x), and erroneously
accepting it as expected context type. We are fixing this problem next,
which would break s390x arch, so we need to handle `bpf_user_pt_regs_t`
case explicitly for KPROBE programs.

Fixes: 91cc1a9974 ("bpf: Annotate context types")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240212233221.2575350-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-02-13 18:46:47 -08:00
Andrii Nakryiko
fb5b86cfd4 bpf: simplify btf_get_prog_ctx_type() into btf_is_prog_ctx_type()
Return result of btf_get_prog_ctx_type() is never used and callers only
check NULL vs non-NULL case to determine if given type matches expected
PTR_TO_CTX type. So rename function to `btf_is_prog_ctx_type()` and
return a simple true/false. We'll use this simpler interface to handle
kprobe program type's special typedef case in the next patch.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240212233221.2575350-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-02-13 18:46:46 -08:00
Oliver Crumrine
32e18e7688 bpf: remove check in __cgroup_bpf_run_filter_skb
Originally, this patch removed a redundant check in
BPF_CGROUP_RUN_PROG_INET_EGRESS, as the check was already being done in
the function it called, __cgroup_bpf_run_filter_skb. For v2, it was
reccomended that I remove the check from __cgroup_bpf_run_filter_skb,
and add the checks to the other macro that calls that function,
BPF_CGROUP_RUN_PROG_INET_INGRESS.

To sum it up, checking that the socket exists and that it is a full
socket is now part of both macros BPF_CGROUP_RUN_PROG_INET_EGRESS and
BPF_CGROUP_RUN_PROG_INET_INGRESS, and it is no longer part of the
function they call, __cgroup_bpf_run_filter_skb.

v3->v4: Fixed weird merge conflict.
v2->v3: Sent to bpf-next instead of generic patch
v1->v2: Addressed feedback about where check should be removed.

Signed-off-by: Oliver Crumrine <ozlinuxc@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/7lv62yiyvmj5a7eozv2iznglpkydkdfancgmbhiptrgvgan5sy@3fl3onchgdz3
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-02-13 15:41:17 -08:00
Kui-Feng Lee
1611603537 bpf: Create argument information for nullable arguments.
Collect argument information from the type information of stub functions to
mark arguments of BPF struct_ops programs with PTR_MAYBE_NULL if they are
nullable.  A nullable argument is annotated by suffixing "__nullable" at
the argument name of stub function.

For nullable arguments, this patch sets a struct bpf_ctx_arg_aux to label
their reg_type with PTR_TO_BTF_ID | PTR_TRUSTED | PTR_MAYBE_NULL. This
makes the verifier to check programs and ensure that they properly check
the pointer. The programs should check if the pointer is null before
accessing the pointed memory.

The implementer of a struct_ops type should annotate the arguments that can
be null. The implementer should define a stub function (empty) as a
placeholder for each defined operator. The name of a stub function should
be in the pattern "<st_op_type>__<operator name>". For example, for
test_maybe_null of struct bpf_testmod_ops, it's stub function name should
be "bpf_testmod_ops__test_maybe_null". You mark an argument nullable by
suffixing the argument name with "__nullable" at the stub function.

Since we already has stub functions for kCFI, we just reuse these stub
functions with the naming convention mentioned earlier. These stub
functions with the naming convention is only required if there are nullable
arguments to annotate. For functions having not nullable arguments, stub
functions are not necessary for the purpose of this patch.

This patch will prepare a list of struct bpf_ctx_arg_aux, aka arg_info, for
each member field of a struct_ops type.  "arg_info" will be assigned to
"prog->aux->ctx_arg_info" of BPF struct_ops programs in
check_struct_ops_btf_id() so that it can be used by btf_ctx_access() later
to set reg_type properly for the verifier.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240209023750.1153905-4-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-02-13 15:16:44 -08:00
Kui-Feng Lee
6115a0aeef bpf: Move __kfunc_param_match_suffix() to btf.c.
Move __kfunc_param_match_suffix() to btf.c and rename it as
btf_param_match_suffix(). It can be reused by bpf_struct_ops later.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240209023750.1153905-3-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-02-13 15:16:44 -08:00
Kui-Feng Lee
77c0208e19 bpf: add btf pointer to struct bpf_ctx_arg_aux.
Enable the providers to use types defined in a module instead of in the
kernel (btf_vmlinux).

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240209023750.1153905-2-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-02-13 15:16:44 -08:00
Hari Bathini
11f522256e bpf: Fix warning for bpf_cpumask in verifier
Compiling with CONFIG_BPF_SYSCALL & !CONFIG_BPF_JIT throws the below
warning:

  "WARN: resolve_btfids: unresolved symbol bpf_cpumask"

Fix it by adding the appropriate #ifdef.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Stanislav Fomichev <sdf@google.com>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/bpf/20240208100115.602172-1-hbathini@linux.ibm.com
2024-02-13 11:13:39 -08:00
Yonghong Song
178c54666f bpf: Mark bpf_spin_{lock,unlock}() helpers with notrace correctly
Currently tracing is supposed not to allow for bpf_spin_{lock,unlock}()
helper calls. This is to prevent deadlock for the following cases:
  - there is a prog (prog-A) calling bpf_spin_{lock,unlock}().
  - there is a tracing program (prog-B), e.g., fentry, attached
    to bpf_spin_lock() and/or bpf_spin_unlock().
  - prog-B calls bpf_spin_{lock,unlock}().
For such a case, when prog-A calls bpf_spin_{lock,unlock}(),
a deadlock will happen.

The related source codes are below in kernel/bpf/helpers.c:
  notrace BPF_CALL_1(bpf_spin_lock, struct bpf_spin_lock *, lock)
  notrace BPF_CALL_1(bpf_spin_unlock, struct bpf_spin_lock *, lock)
notrace is supposed to prevent fentry prog from attaching to
bpf_spin_{lock,unlock}().

But actually this is not the case and fentry prog can successfully
attached to bpf_spin_lock(). Siddharth Chintamaneni reported
the issue in [1]. The following is the macro definition for
above BPF_CALL_1:
  #define BPF_CALL_x(x, name, ...)                                               \
        static __always_inline                                                 \
        u64 ____##name(__BPF_MAP(x, __BPF_DECL_ARGS, __BPF_V, __VA_ARGS__));   \
        typedef u64 (*btf_##name)(__BPF_MAP(x, __BPF_DECL_ARGS, __BPF_V, __VA_ARGS__)); \
        u64 name(__BPF_REG(x, __BPF_DECL_REGS, __BPF_N, __VA_ARGS__));         \
        u64 name(__BPF_REG(x, __BPF_DECL_REGS, __BPF_N, __VA_ARGS__))          \
        {                                                                      \
                return ((btf_##name)____##name)(__BPF_MAP(x,__BPF_CAST,__BPF_N,__VA_ARGS__));\
        }                                                                      \
        static __always_inline                                                 \
        u64 ____##name(__BPF_MAP(x, __BPF_DECL_ARGS, __BPF_V, __VA_ARGS__))

  #define BPF_CALL_1(name, ...)   BPF_CALL_x(1, name, __VA_ARGS__)

The notrace attribute is actually applied to the static always_inline function
____bpf_spin_{lock,unlock}(). The actual callback function
bpf_spin_{lock,unlock}() is not marked with notrace, hence
allowing fentry prog to attach to two helpers, and this
may cause the above mentioned deadlock. Siddharth Chintamaneni
actually has a reproducer in [2].

To fix the issue, a new macro NOTRACE_BPF_CALL_1 is introduced which
will add notrace attribute to the original function instead of
the hidden always_inline function and this fixed the problem.

  [1] https://lore.kernel.org/bpf/CAE5sdEigPnoGrzN8WU7Tx-h-iFuMZgW06qp0KHWtpvoXxf1OAQ@mail.gmail.com/
  [2] https://lore.kernel.org/bpf/CAE5sdEg6yUc_Jz50AnUXEEUh6O73yQ1Z6NV2srJnef0ZrQkZew@mail.gmail.com/

Fixes: d83525ca62 ("bpf: introduce bpf_spin_lock")
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20240207070102.335167-1-yonghong.song@linux.dev
2024-02-13 11:11:25 -08:00
Daniel Xu
5b268d1ebc bpf: Have bpf_rdonly_cast() take a const pointer
Since 20d59ee551 ("libbpf: add bpf_core_cast() macro"), libbpf is now
exporting a const arg version of bpf_rdonly_cast(). This causes the
following conflicting type error when generating kfunc prototypes from
BTF:

In file included from skeleton/pid_iter.bpf.c:5:
/home/dxu/dev/linux/tools/bpf/bpftool/bootstrap/libbpf/include/bpf/bpf_core_read.h:297:14: error: conflicting types for 'bpf_rdonly_cast'
extern void *bpf_rdonly_cast(const void *obj__ign, __u32 btf_id__k) __ksym __weak;
             ^
./vmlinux.h:135625:14: note: previous declaration is here
extern void *bpf_rdonly_cast(void *obj__ign, u32 btf_id__k) __weak __ksym;

This is b/c the kernel defines bpf_rdonly_cast() with non-const arg.
Since const arg is more permissive and thus backwards compatible, we
change the kernel definition as well to avoid conflicting type errors.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/dfd3823f11ffd2d4c838e961d61ec9ae8a646773.1707080349.git.dxu@dxuuu.xyz
2024-02-13 11:05:26 -08:00
Marco Elver
68bc61c26c bpf: Allow compiler to inline most of bpf_local_storage_lookup()
In various performance profiles of kernels with BPF programs attached,
bpf_local_storage_lookup() appears as a significant portion of CPU
cycles spent. To enable the compiler generate more optimal code, turn
bpf_local_storage_lookup() into a static inline function, where only the
cache insertion code path is outlined

Notably, outlining cache insertion helps avoid bloating callers by
duplicating setting up calls to raw_spin_{lock,unlock}_irqsave() (on
architectures which do not inline spin_lock/unlock, such as x86), which
would cause the compiler produce worse code by deciding to outline
otherwise inlinable functions. The call overhead is neutral, because we
make 2 calls either way: either calling raw_spin_lock_irqsave() and
raw_spin_unlock_irqsave(); or call __bpf_local_storage_insert_cache(),
which calls raw_spin_lock_irqsave(), followed by a tail-call to
raw_spin_unlock_irqsave() where the compiler can perform TCO and (in
optimized uninstrumented builds) turns it into a plain jump. The call to
__bpf_local_storage_insert_cache() can be elided entirely if
cacheit_lockit is a false constant expression.

Based on results from './benchs/run_bench_local_storage.sh' (21 trials,
reboot between each trial; x86 defconfig + BPF, clang 16) this produces
improvements in throughput and latency in the majority of cases, with an
average (geomean) improvement of 8%:

+---- Hashmap Control --------------------
|
| + num keys: 10
| :                                         <before>             | <after>
| +-+ hashmap (control) sequential get    +----------------------+----------------------
|   +- hits throughput                    | 14.789 M ops/s       | 14.745 M ops/s (  ~  )
|   +- hits latency                       | 67.679 ns/op         | 67.879 ns/op   (  ~  )
|   +- important_hits throughput          | 14.789 M ops/s       | 14.745 M ops/s (  ~  )
|
| + num keys: 1000
| :                                         <before>             | <after>
| +-+ hashmap (control) sequential get    +----------------------+----------------------
|   +- hits throughput                    | 12.233 M ops/s       | 12.170 M ops/s (  ~  )
|   +- hits latency                       | 81.754 ns/op         | 82.185 ns/op   (  ~  )
|   +- important_hits throughput          | 12.233 M ops/s       | 12.170 M ops/s (  ~  )
|
| + num keys: 10000
| :                                         <before>             | <after>
| +-+ hashmap (control) sequential get    +----------------------+----------------------
|   +- hits throughput                    | 7.220 M ops/s        | 7.204 M ops/s  (  ~  )
|   +- hits latency                       | 138.522 ns/op        | 138.842 ns/op  (  ~  )
|   +- important_hits throughput          | 7.220 M ops/s        | 7.204 M ops/s  (  ~  )
|
| + num keys: 100000
| :                                         <before>             | <after>
| +-+ hashmap (control) sequential get    +----------------------+----------------------
|   +- hits throughput                    | 5.061 M ops/s        | 5.165 M ops/s  (+2.1%)
|   +- hits latency                       | 198.483 ns/op        | 194.270 ns/op  (-2.1%)
|   +- important_hits throughput          | 5.061 M ops/s        | 5.165 M ops/s  (+2.1%)
|
| + num keys: 4194304
| :                                         <before>             | <after>
| +-+ hashmap (control) sequential get    +----------------------+----------------------
|   +- hits throughput                    | 2.864 M ops/s        | 2.882 M ops/s  (  ~  )
|   +- hits latency                       | 365.220 ns/op        | 361.418 ns/op  (-1.0%)
|   +- important_hits throughput          | 2.864 M ops/s        | 2.882 M ops/s  (  ~  )
|
+---- Local Storage ----------------------
|
| + num_maps: 1
| :                                         <before>             | <after>
| +-+ local_storage cache sequential get  +----------------------+----------------------
|   +- hits throughput                    | 33.005 M ops/s       | 39.068 M ops/s (+18.4%)
|   +- hits latency                       | 30.300 ns/op         | 25.598 ns/op   (-15.5%)
|   +- important_hits throughput          | 33.005 M ops/s       | 39.068 M ops/s (+18.4%)
| :
| :                                         <before>             | <after>
| +-+ local_storage cache interleaved get +----------------------+----------------------
|   +- hits throughput                    | 37.151 M ops/s       | 44.926 M ops/s (+20.9%)
|   +- hits latency                       | 26.919 ns/op         | 22.259 ns/op   (-17.3%)
|   +- important_hits throughput          | 37.151 M ops/s       | 44.926 M ops/s (+20.9%)
|
| + num_maps: 10
| :                                         <before>             | <after>
| +-+ local_storage cache sequential get  +----------------------+----------------------
|   +- hits throughput                    | 32.288 M ops/s       | 38.099 M ops/s (+18.0%)
|   +- hits latency                       | 30.972 ns/op         | 26.248 ns/op   (-15.3%)
|   +- important_hits throughput          | 3.229 M ops/s        | 3.810 M ops/s  (+18.0%)
| :
| :                                         <before>             | <after>
| +-+ local_storage cache interleaved get +----------------------+----------------------
|   +- hits throughput                    | 34.473 M ops/s       | 41.145 M ops/s (+19.4%)
|   +- hits latency                       | 29.010 ns/op         | 24.307 ns/op   (-16.2%)
|   +- important_hits throughput          | 12.312 M ops/s       | 14.695 M ops/s (+19.4%)
|
| + num_maps: 16
| :                                         <before>             | <after>
| +-+ local_storage cache sequential get  +----------------------+----------------------
|   +- hits throughput                    | 32.524 M ops/s       | 38.341 M ops/s (+17.9%)
|   +- hits latency                       | 30.748 ns/op         | 26.083 ns/op   (-15.2%)
|   +- important_hits throughput          | 2.033 M ops/s        | 2.396 M ops/s  (+17.9%)
| :
| :                                         <before>             | <after>
| +-+ local_storage cache interleaved get +----------------------+----------------------
|   +- hits throughput                    | 34.575 M ops/s       | 41.338 M ops/s (+19.6%)
|   +- hits latency                       | 28.925 ns/op         | 24.193 ns/op   (-16.4%)
|   +- important_hits throughput          | 11.001 M ops/s       | 13.153 M ops/s (+19.6%)
|
| + num_maps: 17
| :                                         <before>             | <after>
| +-+ local_storage cache sequential get  +----------------------+----------------------
|   +- hits throughput                    | 28.861 M ops/s       | 32.756 M ops/s (+13.5%)
|   +- hits latency                       | 34.649 ns/op         | 30.530 ns/op   (-11.9%)
|   +- important_hits throughput          | 1.700 M ops/s        | 1.929 M ops/s  (+13.5%)
| :
| :                                         <before>             | <after>
| +-+ local_storage cache interleaved get +----------------------+----------------------
|   +- hits throughput                    | 31.529 M ops/s       | 36.110 M ops/s (+14.5%)
|   +- hits latency                       | 31.719 ns/op         | 27.697 ns/op   (-12.7%)
|   +- important_hits throughput          | 9.598 M ops/s        | 10.993 M ops/s (+14.5%)
|
| + num_maps: 24
| :                                         <before>             | <after>
| +-+ local_storage cache sequential get  +----------------------+----------------------
|   +- hits throughput                    | 18.602 M ops/s       | 19.937 M ops/s (+7.2%)
|   +- hits latency                       | 53.767 ns/op         | 50.166 ns/op   (-6.7%)
|   +- important_hits throughput          | 0.776 M ops/s        | 0.831 M ops/s  (+7.2%)
| :
| :                                         <before>             | <after>
| +-+ local_storage cache interleaved get +----------------------+----------------------
|   +- hits throughput                    | 21.718 M ops/s       | 23.332 M ops/s (+7.4%)
|   +- hits latency                       | 46.047 ns/op         | 42.865 ns/op   (-6.9%)
|   +- important_hits throughput          | 6.110 M ops/s        | 6.564 M ops/s  (+7.4%)
|
| + num_maps: 32
| :                                         <before>             | <after>
| +-+ local_storage cache sequential get  +----------------------+----------------------
|   +- hits throughput                    | 14.118 M ops/s       | 14.626 M ops/s (+3.6%)
|   +- hits latency                       | 70.856 ns/op         | 68.381 ns/op   (-3.5%)
|   +- important_hits throughput          | 0.442 M ops/s        | 0.458 M ops/s  (+3.6%)
| :
| :                                         <before>             | <after>
| +-+ local_storage cache interleaved get +----------------------+----------------------
|   +- hits throughput                    | 17.111 M ops/s       | 17.906 M ops/s (+4.6%)
|   +- hits latency                       | 58.451 ns/op         | 55.865 ns/op   (-4.4%)
|   +- important_hits throughput          | 4.776 M ops/s        | 4.998 M ops/s  (+4.6%)
|
| + num_maps: 100
| :                                         <before>             | <after>
| +-+ local_storage cache sequential get  +----------------------+----------------------
|   +- hits throughput                    | 5.281 M ops/s        | 5.528 M ops/s  (+4.7%)
|   +- hits latency                       | 192.398 ns/op        | 183.059 ns/op  (-4.9%)
|   +- important_hits throughput          | 0.053 M ops/s        | 0.055 M ops/s  (+4.9%)
| :
| :                                         <before>             | <after>
| +-+ local_storage cache interleaved get +----------------------+----------------------
|   +- hits throughput                    | 6.265 M ops/s        | 6.498 M ops/s  (+3.7%)
|   +- hits latency                       | 161.436 ns/op        | 152.877 ns/op  (-5.3%)
|   +- important_hits throughput          | 1.636 M ops/s        | 1.697 M ops/s  (+3.7%)
|
| + num_maps: 1000
| :                                         <before>             | <after>
| +-+ local_storage cache sequential get  +----------------------+----------------------
|   +- hits throughput                    | 0.355 M ops/s        | 0.354 M ops/s  (  ~  )
|   +- hits latency                       | 2826.538 ns/op       | 2827.139 ns/op (  ~  )
|   +- important_hits throughput          | 0.000 M ops/s        | 0.000 M ops/s  (  ~  )
| :
| :                                         <before>             | <after>
| +-+ local_storage cache interleaved get +----------------------+----------------------
|   +- hits throughput                    | 0.404 M ops/s        | 0.403 M ops/s  (  ~  )
|   +- hits latency                       | 2481.190 ns/op       | 2487.555 ns/op (  ~  )
|   +- important_hits throughput          | 0.102 M ops/s        | 0.101 M ops/s  (  ~  )

The on_lookup test in {cgrp,task}_ls_recursion.c is removed
because the bpf_local_storage_lookup is no longer traceable
and adding tracepoint will make the compiler generate worse
code: https://lore.kernel.org/bpf/ZcJmok64Xqv6l4ZS@elver.google.com/

Signed-off-by: Marco Elver <elver@google.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240207122626.3508658-1-elver@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-02-11 14:06:24 -08:00
Geliang Tang
947e56f82f bpf, btf: Check btf for register_bpf_struct_ops
Similar to the handling in the functions __register_btf_kfunc_id_set()
and register_btf_id_dtor_kfuncs(), this patch uses the newly added
helper check_btf_kconfigs() to handle module with its btf section
stripped.

While at it, the patch also adds the missed IS_ERR() check to fix the
commit f6be98d199 ("bpf, net: switch to dynamic registration")

Fixes: f6be98d199 ("bpf, net: switch to dynamic registration")
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Link: https://lore.kernel.org/r/69082b9835463fe36f9e354bddf2d0a97df39c2b.1707373307.git.tanggeliang@kylinos.cn
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-02-08 11:37:49 -08:00
Geliang Tang
9e60b0e025 bpf, btf: Add check_btf_kconfigs helper
This patch extracts duplicate code on error path when btf_get_module_btf()
returns NULL from the functions __register_btf_kfunc_id_set() and
register_btf_id_dtor_kfuncs() into a new helper named check_btf_kconfigs()
to check CONFIG_DEBUG_INFO_BTF and CONFIG_DEBUG_INFO_BTF_MODULES in it.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/fa5537fc55f1e4d0bfd686598c81b7ab9dbd82b7.1707373307.git.tanggeliang@kylinos.cn
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-02-08 11:22:56 -08:00
Geliang Tang
b9a395f0f7 bpf, btf: Fix return value of register_btf_id_dtor_kfuncs
The same as __register_btf_kfunc_id_set(), to let the modules with
stripped btf section loaded, this patch changes the return value of
register_btf_id_dtor_kfuncs() too from -ENOENT to 0 when btf is NULL.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Link: https://lore.kernel.org/r/eab65586d7fb0e72f2707d3747c7d4a5d60c823f.1707373307.git.tanggeliang@kylinos.cn
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-02-08 11:05:07 -08:00
Masahiro Yamada
e55dad12ab bpf: Merge two CONFIG_BPF entries
'config BPF' exists in both init/Kconfig and kernel/bpf/Kconfig.

Commit b24abcff91 ("bpf, kconfig: Add consolidated menu entry for bpf
with core options") added the second one to kernel/bpf/Kconfig instead
of moving the existing one.

Merge them together.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20240204075634.32969-1-masahiroy@kernel.org
2024-02-07 16:38:20 -08:00
Kumar Kartikeya Dwivedi
6fceea0fa5 bpf: Transfer RCU lock state between subprog calls
Allow transferring an imbalanced RCU lock state between subprog calls
during verification. This allows patterns where a subprog call returns
with an RCU lock held, or a subprog call releases an RCU lock held by
the caller. Currently, the verifier would end up complaining if the RCU
lock is not released when processing an exit from a subprog, which is
non-ideal if its execution is supposed to be enclosed in an RCU read
section of the caller.

Instead, simply only check whether we are processing exit for frame#0
and do not complain on an active RCU lock otherwise. We only need to
update the check when processing BPF_EXIT insn, as copy_verifier_state
is already set up to do the right thing.

Suggested-by: David Vernet <void@manifault.com>
Tested-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20240205055646.1112186-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-02-05 20:00:14 -08:00
Kumar Kartikeya Dwivedi
a44b1334aa bpf: Allow calling static subprogs while holding a bpf_spin_lock
Currently, calling any helpers, kfuncs, or subprogs except the graph
data structure (lists, rbtrees) API kfuncs while holding a bpf_spin_lock
is not allowed. One of the original motivations of this decision was to
force the BPF programmer's hand into keeping the bpf_spin_lock critical
section small, and to ensure the execution time of the program does not
increase due to lock waiting times. In addition to this, some of the
helpers and kfuncs may be unsafe to call while holding a bpf_spin_lock.

However, when it comes to subprog calls, atleast for static subprogs,
the verifier is able to explore their instructions during verification.
Therefore, it is similar in effect to having the same code inlined into
the critical section. Hence, not allowing static subprog calls in the
bpf_spin_lock critical section is mostly an annoyance that needs to be
worked around, without providing any tangible benefit.

Unlike static subprog calls, global subprog calls are not safe to permit
within the critical section, as the verifier does not explore them
during verification, therefore whether the same lock will be taken
again, or unlocked, cannot be ascertained.

Therefore, allow calling static subprogs within a bpf_spin_lock critical
section, and only reject it in case the subprog linkage is global.

Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: David Vernet <void@manifault.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20240204222349.938118-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-02-05 19:58:47 -08:00
Kui-Feng Lee
df9705eaa0 bpf: Remove an unnecessary check.
The "i" here is always equal to "btf_type_vlen(t)" since
the "for_each_member()" loop never breaks.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240203055119.2235598-1-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-02-05 10:25:08 -08:00
Andrii Nakryiko
1eb986746a bpf: don't emit warnings intended for global subprogs for static subprogs
When btf_prepare_func_args() was generalized to handle both static and
global subprogs, a few warnings/errors that are meant only for global
subprog cases started to be emitted for static subprogs, where they are
sort of expected and irrelavant.

Stop polutting verifier logs with irrelevant scary-looking messages.

Fixes: e26080d0da ("bpf: prepare btf_prepare_func_args() for handling static subprogs")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240202190529.2374377-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-02-02 18:08:59 -08:00
Andrii Nakryiko
8f13c34087 bpf: handle trusted PTR_TO_BTF_ID_OR_NULL in argument check logic
Add PTR_TRUSTED | PTR_MAYBE_NULL modifiers for PTR_TO_BTF_ID to
check_reg_type() to support passing trusted nullable PTR_TO_BTF_ID
registers into global functions accepting `__arg_trusted __arg_nullable`
arguments. This hasn't been caught earlier because tests were either
passing known non-NULL PTR_TO_BTF_ID registers or known NULL (SCALAR)
registers.

When utilizing this functionality in complicated real-world BPF
application that passes around PTR_TO_BTF_ID_OR_NULL, it became apparent
that verifier rejects valid case because check_reg_type() doesn't handle
this case explicitly. Existing check_reg_type() logic is already
anticipating this combination, so we just need to explicitly list this
combo in the switch statement.

Fixes: e2b3c4ff5d ("bpf: add __arg_trusted global func arg tag")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240202190529.2374377-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-02-02 18:08:58 -08:00
Eduard Zingerman
6efbde200b bpf: Handle scalar spill vs all MISC in stacksafe()
When check_stack_read_fixed_off() reads value from an spi
all stack slots of which are set to STACK_{MISC,INVALID},
the destination register is set to unbound SCALAR_VALUE.

Exploit this fact by allowing stacksafe() to use a fake
unbound scalar register to compare 'mmmm mmmm' stack value
in old state vs spilled 64-bit scalar in current state
and vice versa.

Veristat results after this patch show some gains:

./veristat -C -e file,prog,states -f 'states_pct>10'  not-opt after
File                     Program                States   (DIFF)
-----------------------  ---------------------  ---------------
bpf_overlay.o            tail_rev_nodeport_lb4    -45 (-15.85%)
bpf_xdp.o                tail_lb_ipv4            -541 (-19.57%)
pyperf100.bpf.o          on_event                -680 (-10.42%)
pyperf180.bpf.o          on_event               -2164 (-19.62%)
pyperf600.bpf.o          on_event               -9799 (-24.84%)
strobemeta.bpf.o         on_event               -9157 (-65.28%)
xdp_synproxy_kern.bpf.o  syncookie_tc             -54 (-19.29%)
xdp_synproxy_kern.bpf.o  syncookie_xdp            -74 (-24.50%)

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240127175237.526726-6-maxtram95@gmail.com
2024-02-02 13:22:14 -08:00
Maxim Mikityanskiy
c1e6148cb4 bpf: Preserve boundaries and track scalars on narrowing fill
When the width of a fill is smaller than the width of the preceding
spill, the information about scalar boundaries can still be preserved,
as long as it's coerced to the right width (done by coerce_reg_to_size).
Even further, if the actual value fits into the fill width, the ID can
be preserved as well for further tracking of equal scalars.

Implement the above improvements, which makes narrowing fills behave the
same as narrowing spills and MOVs between registers.

Two tests are adjusted to accommodate for endianness differences and to
take into account that it's now allowed to do a narrowing fill from the
least significant bits.

reg_bounds_sync is added to coerce_reg_to_size to correctly adjust
umin/umax boundaries after the var_off truncation, for example, a 64-bit
value 0xXXXXXXXX00000000, when read as a 32-bit, gets umin = 0, umax =
0xFFFFFFFF, var_off = (0x0; 0xffffffff00000000), which needs to be
synced down to umax = 0, otherwise reg_bounds_sanity_check doesn't pass.

Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240127175237.526726-4-maxtram95@gmail.com
2024-02-02 13:22:14 -08:00
Maxim Mikityanskiy
e67ddd9b1c bpf: Track spilled unbounded scalars
Support the pattern where an unbounded scalar is spilled to the stack,
then boundary checks are performed on the src register, after which the
stack frame slot is refilled into a register.

Before this commit, the verifier didn't treat the src register and the
stack slot as related if the src register was an unbounded scalar. The
register state wasn't copied, the id wasn't preserved, and the stack
slot was marked as STACK_MISC. Subsequent boundary checks on the src
register wouldn't result in updating the boundaries of the spilled
variable on the stack.

After this commit, the verifier will preserve the bond between src and
dst even if src is unbounded, which permits to do boundary checks on src
and refill dst later, still remembering its boundaries. Such a pattern
is sometimes generated by clang when compiling complex long functions.

One test is adjusted to reflect that now unbounded scalars are tracked.

Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20240127175237.526726-2-maxtram95@gmail.com
2024-02-02 13:22:14 -08:00
Matt Bobrowski
1581e5118e bpf: Minor clean-up to sleepable_lsm_hooks BTF set
There's already one main CONFIG_SECURITY_NETWORK ifdef block within
the sleepable_lsm_hooks BTF set. Consolidate this duplicated ifdef
block as there's no need for it and all things guarded by it should
remain in one place in this specific context.

Signed-off-by: Matt Bobrowski <mattbobrowski@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/Zbt1smz43GDMbVU3@google.com
2024-02-01 18:37:45 +01:00
Daniel Xu
6f3189f38a bpf: treewide: Annotate BPF kfuncs in BTF
This commit marks kfuncs as such inside the .BTF_ids section. The upshot
of these annotations is that we'll be able to automatically generate
kfunc prototypes for downstream users. The process is as follows:

1. In source, use BTF_KFUNCS_START/END macro pair to mark kfuncs
2. During build, pahole injects into BTF a "bpf_kfunc" BTF_DECL_TAG for
   each function inside BTF_KFUNCS sets
3. At runtime, vmlinux or module BTF is made available in sysfs
4. At runtime, bpftool (or similar) can look at provided BTF and
   generate appropriate prototypes for functions with "bpf_kfunc" tag

To ensure future kfunc are similarly tagged, we now also return error
inside kfunc registration for untagged kfuncs. For vmlinux kfuncs,
we also WARN(), as initcall machinery does not handle errors.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Acked-by: Benjamin Tissoires <bentiss@kernel.org>
Link: https://lore.kernel.org/r/e55150ceecbf0a5d961e608941165c0bee7bc943.1706491398.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-31 20:40:56 -08:00
Andrii Nakryiko
8f2b44cd9d bpf: add arg:nullable tag to be combined with trusted pointers
Add ability to mark arg:trusted arguments with optional arg:nullable
tag to mark it as PTR_TO_BTF_ID_OR_NULL variant, which will allow
callers to pass NULL, and subsequently will force global subprog's code
to do NULL check. This allows to have "optional" PTR_TO_BTF_ID values
passed into global subprogs.

For now arg:nullable cannot be combined with anything else.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240130000648.2144827-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-30 09:41:50 -08:00
Andrii Nakryiko
e2b3c4ff5d bpf: add __arg_trusted global func arg tag
Add support for passing PTR_TO_BTF_ID registers to global subprogs.
Currently only PTR_TRUSTED flavor of PTR_TO_BTF_ID is supported.
Non-NULL semantics is assumed, so caller will be forced to prove
PTR_TO_BTF_ID can't be NULL.

Note, we disallow global subprogs to destroy passed in PTR_TO_BTF_ID
arguments, even the trusted one. We achieve that by not setting
ref_obj_id when validating subprog code. This basically enforces (in
Rust terms) borrowing semantics vs move semantics. Borrowing semantics
seems to be a better fit for isolated global subprog validation
approach.

Implementation-wise, we utilize existing logic for matching
user-provided BTF type to kernel-side BTF type, used by BPF CO-RE logic
and following same matching rules. We enforce a unique match for types.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240130000648.2144827-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-30 09:41:50 -08:00
Haiyue Wang
6668e818f9 bpf,token: Use BIT_ULL() to convert the bit mask
Replace the '(1ULL << *)' with the macro BIT_ULL(nr).

Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240127134901.3698613-1-haiyue.wang@intel.com
2024-01-29 20:04:55 -08:00
Andrii Nakryiko
add9c58cd4 bpf: move arg:ctx type enforcement check inside the main logic loop
Now that bpf and bpf-next trees converged and we don't run the risk of
merge conflicts, move btf_validate_prog_ctx_type() into its most logical
place inside the main logic loop.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240125205510.3642094-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-29 12:34:13 -08:00
Kui-Feng Lee
e6be8cd5d3 bpf: Fix error checks against bpf_get_btf_vmlinux().
In bpf_struct_ops_map_alloc, it needs to check for NULL in the returned
pointer of bpf_get_btf_vmlinux() when CONFIG_DEBUG_INFO_BTF is not set.
ENOTSUPP is used to preserve the same behavior before the
struct_ops kmod support.

In the function check_struct_ops_btf_id(), instead of redoing the
bpf_get_btf_vmlinux() that has already been done in syscall.c, the fix
here is to check for prog->aux->attach_btf_id.
BPF_PROG_TYPE_STRUCT_OPS must require attach_btf_id and syscall.c
guarantees a valid attach_btf as long as attach_btf_id is set.
When attach_btf_id is not set, this patch returns -ENOTSUPP
because it is what the selftest in test_libbpf_probe_prog_types()
and libbpf_probes.c are expecting for feature probing purpose.

Changes from v1:

 - Remove an unnecessary NULL check in check_struct_ops_btf_id()

Reported-by: syzbot+88f0aafe5f950d7489d7@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/00000000000040d68a060fc8db8c@google.com/
Reported-by: syzbot+1336f3d4b10bcda75b89@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/00000000000026353b060fc21c07@google.com/
Fixes: fcc2c1fb06 ("bpf: pass attached BTF to the bpf_struct_ops subsystem")
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240126023113.1379504-1-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-01-25 20:49:59 -08:00
Andrii Nakryiko
6c1752e0b6 bpf: Support symbolic BPF FS delegation mount options
Besides already supported special "any" value and hex bit mask, support
string-based parsing of delegation masks based on exact enumerator
names. Utilize BTF information of `enum bpf_cmd`, `enum bpf_map_type`,
`enum bpf_prog_type`, and `enum bpf_attach_type` types to find supported
symbolic names (ignoring __MAX_xxx guard values and stripping repetitive
prefixes like BPF_ for cmd and attach types, BPF_MAP_TYPE_ for maps, and
BPF_PROG_TYPE_ for prog types). The case doesn't matter, but it is
normalized to lower case in mount option output. So "PROG_LOAD",
"prog_load", and "MAP_create" are all valid values to specify for
delegate_cmds options, "array" is among supported for map types, etc.

Besides supporting string values, we also support multiple values
specified at the same time, using colon (':') separator.

There are corresponding changes on bpf_show_options side to use known
values to print them in human-readable format, falling back to hex mask
printing, if there are any unrecognized bits. This shouldn't be
necessary when enum BTF information is present, but in general we should
always be able to fall back to this even if kernel was built without BTF.
As mentioned, emitted symbolic names are normalized to be all lower case.

Example below shows various ways to specify delegate_cmds options
through mount command and how mount options are printed back:

12/14 14:39:07.604
vmuser@archvm:~/local/linux/tools/testing/selftests/bpf
$ mount | rg token

  $ sudo mkdir -p /sys/fs/bpf/token
  $ sudo mount -t bpf bpffs /sys/fs/bpf/token \
               -o delegate_cmds=prog_load:MAP_CREATE \
               -o delegate_progs=kprobe \
               -o delegate_attachs=xdp
  $ mount | grep token
  bpffs on /sys/fs/bpf/token type bpf (rw,relatime,delegate_cmds=map_create:prog_load,delegate_progs=kprobe,delegate_attachs=xdp)

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-20-andrii@kernel.org
2024-01-24 16:21:02 -08:00
Andrii Nakryiko
aeaa97b006 bpf: Fail BPF_TOKEN_CREATE if no delegation option was set on BPF FS
It's quite confusing in practice when it's possible to successfully
create a BPF token from BPF FS that didn't have any of delegate_xxx
mount options set up. While it's not wrong, it's actually more
meaningful to reject BPF_TOKEN_CREATE with specific error code (-ENOENT)
to let user-space know that no token delegation is setup up.

So, instead of creating empty BPF token that will be always ignored
because it doesn't have any of the allow_xxx bits set, reject it with
-ENOENT. If we ever need empty BPF token to be possible, we can support
that with extra flag passed into BPF_TOKEN_CREATE.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-19-andrii@kernel.org
2024-01-24 16:21:02 -08:00
Andrii Nakryiko
f568a3d49a bpf,lsm: Add BPF token LSM hooks
Wire up bpf_token_create and bpf_token_free LSM hooks, which allow to
allocate LSM security blob (we add `void *security` field to struct
bpf_token for that), but also control who can instantiate BPF token.
This follows existing pattern for BPF map and BPF prog.

Also add security_bpf_token_allow_cmd() and security_bpf_token_capable()
LSM hooks that allow LSM implementation to control and negate (if
necessary) BPF token's delegation of a specific bpf_cmd and capability,
respectively.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-12-andrii@kernel.org
2024-01-24 16:21:01 -08:00
Andrii Nakryiko
a2431c7eab bpf,lsm: Refactor bpf_map_alloc/bpf_map_free LSM hooks
Similarly to bpf_prog_alloc LSM hook, rename and extend bpf_map_alloc
hook into bpf_map_create, taking not just struct bpf_map, but also
bpf_attr and bpf_token, to give a fuller context to LSMs.

Unlike bpf_prog_alloc, there is no need to move the hook around, as it
currently is firing right before allocating BPF map ID and FD, which
seems to be a sweet spot.

But like bpf_prog_alloc/bpf_prog_free combo, make sure that bpf_map_free
LSM hook is called even if bpf_map_create hook returned error, as if few
LSMs are combined together it could be that one LSM successfully
allocated security blob for its needs, while subsequent LSM rejected BPF
map creation. The former LSM would still need to free up LSM blob, so we
need to ensure security_bpf_map_free() is called regardless of the
outcome.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-11-andrii@kernel.org
2024-01-24 16:21:01 -08:00
Andrii Nakryiko
1b67772e4e bpf,lsm: Refactor bpf_prog_alloc/bpf_prog_free LSM hooks
Based on upstream discussion ([0]), rework existing
bpf_prog_alloc_security LSM hook. Rename it to bpf_prog_load and instead
of passing bpf_prog_aux, pass proper bpf_prog pointer for a full BPF
program struct. Also, we pass bpf_attr union with all the user-provided
arguments for BPF_PROG_LOAD command.  This will give LSMs as much
information as we can basically provide.

The hook is also BPF token-aware now, and optional bpf_token struct is
passed as a third argument. bpf_prog_load LSM hook is called after
a bunch of sanity checks were performed, bpf_prog and bpf_prog_aux were
allocated and filled out, but right before performing full-fledged BPF
verification step.

bpf_prog_free LSM hook is now accepting struct bpf_prog argument, for
consistency. SELinux code is adjusted to all new names, types, and
signatures.

Note, given that bpf_prog_load (previously bpf_prog_alloc) hook can be
used by some LSMs to allocate extra security blob, but also by other
LSMs to reject BPF program loading, we need to make sure that
bpf_prog_free LSM hook is called after bpf_prog_load/bpf_prog_alloc one
*even* if the hook itself returned error. If we don't do that, we run
the risk of leaking memory. This seems to be possible today when
combining SELinux and BPF LSM, as one example, depending on their
relative ordering.

Also, for BPF LSM setup, add bpf_prog_load and bpf_prog_free to
sleepable LSM hooks list, as they are both executed in sleepable
context. Also drop bpf_prog_load hook from untrusted, as there is no
issue with refcount or anything else anymore, that originally forced us
to add it to untrusted list in c0c852dd18 ("bpf: Do not mark certain LSM
hook arguments as trusted"). We now trigger this hook much later and it
should not be an issue anymore.

  [0] https://lore.kernel.org/bpf/9fe88aef7deabbe87d3fc38c4aea3c69.paul@paul-moore.com/

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-10-andrii@kernel.org
2024-01-24 16:21:01 -08:00
Andrii Nakryiko
d79a354975 bpf: Consistently use BPF token throughout BPF verifier logic
Remove remaining direct queries to perfmon_capable() and bpf_capable()
in BPF verifier logic and instead use BPF token (if available) to make
decisions about privileges.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-9-andrii@kernel.org
2024-01-24 16:21:01 -08:00
Andrii Nakryiko
bbc1d24724 bpf: Take into account BPF token when fetching helper protos
Instead of performing unconditional system-wide bpf_capable() and
perfmon_capable() calls inside bpf_base_func_proto() function (and other
similar ones) to determine eligibility of a given BPF helper for a given
program, use previously recorded BPF token during BPF_PROG_LOAD command
handling to inform the decision.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-8-andrii@kernel.org
2024-01-24 16:21:01 -08:00
Andrii Nakryiko
caf8f28e03 bpf: Add BPF token support to BPF_PROG_LOAD command
Add basic support of BPF token to BPF_PROG_LOAD. BPF_F_TOKEN_FD flag
should be set in prog_flags field when providing prog_token_fd.

Wire through a set of allowed BPF program types and attach types,
derived from BPF FS at BPF token creation time. Then make sure we
perform bpf_token_capable() checks everywhere where it's relevant.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-7-andrii@kernel.org
2024-01-24 16:21:01 -08:00
Andrii Nakryiko
9ea7c4bf17 bpf: Add BPF token support to BPF_BTF_LOAD command
Accept BPF token FD in BPF_BTF_LOAD command to allow BTF data loading
through delegated BPF token. BPF_F_TOKEN_FD flag has to be specified
when passing BPF token FD. Given BPF_BTF_LOAD command didn't have flags
field before, we also add btf_flags field.

BTF loading is a pretty straightforward operation, so as long as BPF
token is created with allow_cmds granting BPF_BTF_LOAD command, kernel
proceeds to parsing BTF data and creating BTF object.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-6-andrii@kernel.org
2024-01-24 16:21:01 -08:00
Andrii Nakryiko
a177fc2bf6 bpf: Add BPF token support to BPF_MAP_CREATE command
Allow providing token_fd for BPF_MAP_CREATE command to allow controlled
BPF map creation from unprivileged process through delegated BPF token.
New BPF_F_TOKEN_FD flag is added to specify together with BPF token FD
for BPF_MAP_CREATE command.

Wire through a set of allowed BPF map types to BPF token, derived from
BPF FS at BPF token creation time. This, in combination with allowed_cmds
allows to create a narrowly-focused BPF token (controlled by privileged
agent) with a restrictive set of BPF maps that application can attempt
to create.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-5-andrii@kernel.org
2024-01-24 16:21:01 -08:00
Andrii Nakryiko
35f96de041 bpf: Introduce BPF token object
Add new kind of BPF kernel object, BPF token. BPF token is meant to
allow delegating privileged BPF functionality, like loading a BPF
program or creating a BPF map, from privileged process to a *trusted*
unprivileged process, all while having a good amount of control over which
privileged operations could be performed using provided BPF token.

This is achieved through mounting BPF FS instance with extra delegation
mount options, which determine what operations are delegatable, and also
constraining it to the owning user namespace (as mentioned in the
previous patch).

BPF token itself is just a derivative from BPF FS and can be created
through a new bpf() syscall command, BPF_TOKEN_CREATE, which accepts BPF
FS FD, which can be attained through open() API by opening BPF FS mount
point. Currently, BPF token "inherits" delegated command, map types,
prog type, and attach type bit sets from BPF FS as is. In the future,
having an BPF token as a separate object with its own FD, we can allow
to further restrict BPF token's allowable set of things either at the
creation time or after the fact, allowing the process to guard itself
further from unintentionally trying to load undesired kind of BPF
programs. But for now we keep things simple and just copy bit sets as is.

When BPF token is created from BPF FS mount, we take reference to the
BPF super block's owning user namespace, and then use that namespace for
checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
capabilities that are normally only checked against init userns (using
capable()), but now we check them using ns_capable() instead (if BPF
token is provided). See bpf_token_capable() for details.

Such setup means that BPF token in itself is not sufficient to grant BPF
functionality. User namespaced process has to *also* have necessary
combination of capabilities inside that user namespace. So while
previously CAP_BPF was useless when granted within user namespace, now
it gains a meaning and allows container managers and sys admins to have
a flexible control over which processes can and need to use BPF
functionality within the user namespace (i.e., container in practice).
And BPF FS delegation mount options and derived BPF tokens serve as
a per-container "flag" to grant overall ability to use bpf() (plus further
restrict on which parts of bpf() syscalls are treated as namespaced).

Note also, BPF_TOKEN_CREATE command itself requires ns_capable(CAP_BPF)
within the BPF FS owning user namespace, rounding up the ns_capable()
story of BPF token. Also creating BPF token in init user namespace is
currently not supported, given BPF token doesn't have any effect in init
user namespace anyways.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-4-andrii@kernel.org
2024-01-24 16:21:01 -08:00
Andrii Nakryiko
6fe01d3cbb bpf: Add BPF token delegation mount options to BPF FS
Add few new mount options to BPF FS that allow to specify that a given
BPF FS instance allows creation of BPF token (added in the next patch),
and what sort of operations are allowed under BPF token. As such, we get
4 new mount options, each is a bit mask
  - `delegate_cmds` allow to specify which bpf() syscall commands are
    allowed with BPF token derived from this BPF FS instance;
  - if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
    a set of allowable BPF map types that could be created with BPF token;
  - if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
    a set of allowable BPF program types that could be loaded with BPF token;
  - if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
    a set of allowable BPF program attach types that could be loaded with
    BPF token; delegate_progs and delegate_attachs are meant to be used
    together, as full BPF program type is, in general, determined
    through both program type and program attach type.

Currently, these mount options accept the following forms of values:
  - a special value "any", that enables all possible values of a given
  bit set;
  - numeric value (decimal or hexadecimal, determined by kernel
  automatically) that specifies a bit mask value directly;
  - all the values for a given mount option are combined, if specified
  multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
  delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
  mask.

Ideally, more convenient (for humans) symbolic form derived from
corresponding UAPI enums would be accepted (e.g., `-o
delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
it requires a bunch of UAPI header churn, so I postponed it until this
feature lands upstream or at least there is a definite consensus that
this feature is acceptable and is going to make it, just to minimize
amount of wasted effort and not increase amount of non-essential code to
be reviewed.

Attentive reader will notice that BPF FS is now marked as
FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
user namespace as long as the process has sufficient *namespaced*
capabilities within that user namespace. But in reality we still
restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
to allow creating BPF FS context object (i.e., fsopen("bpf")) from
inside unprivileged process inside non-init userns, to capture that
userns as the owning userns. It will still be required to pass this
context object back to privileged process to instantiate and mount it.

This manipulation is important, because capturing non-init userns as the
owning userns of BPF FS instance (super block) allows to use that userns
to constraint BPF token to that userns later on (see next patch). So
creating BPF FS with delegation inside unprivileged userns will restrict
derived BPF token objects to only "work" inside that intended userns,
making it scoped to a intended "container". Also, setting these
delegation options requires capable(CAP_SYS_ADMIN), so unprivileged
process cannot set this up without involvement of a privileged process.

There is a set of selftests at the end of the patch set that simulates
this sequence of steps and validates that everything works as intended.
But careful review is requested to make sure there are no missed gaps in
the implementation and testing.

This somewhat subtle set of aspects is the result of previous
discussions ([0]) about various user namespace implications and
interactions with BPF token functionality and is necessary to contain
BPF token inside intended user namespace.

  [0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-3-andrii@kernel.org
2024-01-24 16:21:00 -08:00
Andrii Nakryiko
ed1ad5a741 bpf: Align CAP_NET_ADMIN checks with bpf_capable() approach
Within BPF syscall handling code CAP_NET_ADMIN checks stand out a bit
compared to CAP_BPF and CAP_PERFMON checks. For the latter, CAP_BPF or
CAP_PERFMON are checked first, but if they are not set, CAP_SYS_ADMIN
takes over and grants whatever part of BPF syscall is required.

Similar kind of checks that involve CAP_NET_ADMIN are not so consistent.
One out of four uses does follow CAP_BPF/CAP_PERFMON model: during
BPF_PROG_LOAD, if the type of BPF program is "network-related" either
CAP_NET_ADMIN or CAP_SYS_ADMIN is required to proceed.

But in three other cases CAP_NET_ADMIN is required even if CAP_SYS_ADMIN
is set:
  - when creating DEVMAP/XDKMAP/CPU_MAP maps;
  - when attaching CGROUP_SKB programs;
  - when handling BPF_PROG_QUERY command.

This patch is changing the latter three cases to follow BPF_PROG_LOAD
model, that is allowing to proceed under either CAP_NET_ADMIN or
CAP_SYS_ADMIN.

This also makes it cleaner in subsequent BPF token patches to switch
wholesomely to a generic bpf_token_capable(int cap) check, that always
falls back to CAP_SYS_ADMIN if requested capability is missing.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-2-andrii@kernel.org
2024-01-24 16:21:00 -08:00
Kui-Feng Lee
7c81c2490c bpf: export btf_ctx_access to modules.
The module requires the use of btf_ctx_access() to invoke
bpf_tracing_btf_ctx_access() from a module. This function is valuable for
implementing validation functions that ensure proper access to ctx.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-14-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-01-23 17:12:52 -08:00
Kui-Feng Lee
f6be98d199 bpf, net: switch to dynamic registration
Replace the static list of struct_ops types with per-btf struct_ops_tab to
enable dynamic registration.

Both bpf_dummy_ops and bpf_tcp_ca now utilize the registration function
instead of being listed in bpf_struct_ops_types.h.

Cc: netdev@vger.kernel.org
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-12-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-01-23 17:12:46 -08:00
Kui-Feng Lee
612d087d4b bpf: validate value_type
A value_type should consist of three components: refcnt, state, and data.
refcnt and state has been move to struct bpf_struct_ops_common_value to
make it easier to check the value type.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-11-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-01-23 16:37:45 -08:00
Kui-Feng Lee
e3f87fdfed bpf: hold module refcnt in bpf_struct_ops map creation and prog verification.
To ensure that a module remains accessible whenever a struct_ops object of
a struct_ops type provided by the module is still in use.

struct bpf_struct_ops_map doesn't hold a refcnt to btf anymore since a
module will hold a refcnt to it's btf already. But, struct_ops programs are
different. They hold their associated btf, not the module since they need
only btf to assure their types (signatures).

However, verifier holds the refcnt of the associated module of a struct_ops
type temporarily when verify a struct_ops prog. Verifier needs the help
from the verifier operators (struct bpf_verifier_ops) provided by the owner
module to verify data access of a prog, provide information, and generate
code.

This patch also add a count of links (links_cnt) to bpf_struct_ops_map. It
avoids bpf_struct_ops_map_put_progs() from accessing btf after calling
module_put() in bpf_struct_ops_map_free().

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-10-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-01-23 16:37:44 -08:00
Kui-Feng Lee
fcc2c1fb06 bpf: pass attached BTF to the bpf_struct_ops subsystem
Pass the fd of a btf from the userspace to the bpf() syscall, and then
convert the fd into a btf. The btf is generated from the module that
defines the target BPF struct_ops type.

In order to inform the kernel about the module that defines the target
struct_ops type, the userspace program needs to provide a btf fd for the
respective module's btf. This btf contains essential information on the
types defined within the module, including the target struct_ops type.

A btf fd must be provided to the kernel for struct_ops maps and for the bpf
programs attached to those maps.

In the case of the bpf programs, the attach_btf_obj_fd parameter is passed
as part of the bpf_attr and is converted into a btf. This btf is then
stored in the prog->aux->attach_btf field. Here, it just let the verifier
access attach_btf directly.

In the case of struct_ops maps, a btf fd is passed as value_type_btf_obj_fd
of bpf_attr. The bpf_struct_ops_map_alloc() function converts the fd to a
btf and stores it as st_map->btf. A flag BPF_F_VTYPE_BTF_OBJ_FD is added
for map_flags to indicate that the value of value_type_btf_obj_fd is set.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-9-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-01-23 16:37:44 -08:00
Kui-Feng Lee
689423db3b bpf: lookup struct_ops types from a given module BTF.
This is a preparation for searching for struct_ops types from a specified
module. BTF is always btf_vmlinux now. This patch passes a pointer of BTF
to bpf_struct_ops_find_value() and bpf_struct_ops_find(). Once the new
registration API of struct_ops types is used, other BTFs besides
btf_vmlinux can also be passed to them.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-8-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-01-23 16:37:44 -08:00
Kui-Feng Lee
1338b93346 bpf: pass btf object id in bpf_map_info.
Include btf object id (btf_obj_id) in bpf_map_info so that tools (ex:
bpftools struct_ops dump) know the correct btf from the kernel to look up
type information of struct_ops types.

Since struct_ops types can be defined and registered in a module. The
type information of a struct_ops type are defined in the btf of the
module defining it.  The userspace tools need to know which btf is for
the module defining a struct_ops type.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-7-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-01-23 16:37:44 -08:00
Kui-Feng Lee
47f4f657ac bpf: make struct_ops_map support btfs other than btf_vmlinux.
Once new struct_ops can be registered from modules, btf_vmlinux is no
longer the only btf that struct_ops_map would face.  st_map should remember
what btf it should use to get type information.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-6-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-01-23 16:37:44 -08:00
Kui-Feng Lee
e61995111a bpf: add struct_ops_tab to btf.
Maintain a registry of registered struct_ops types in the per-btf (module)
struct_ops_tab. This registry allows for easy lookup of struct_ops types
that are registered by a specific module.

It is a preparation work for supporting kernel module struct_ops in a
latter patch. Each struct_ops will be registered under its own kernel
module btf and will be stored in the newly added btf->struct_ops_tab. The
bpf verifier and bpf syscall (e.g. prog and map cmd) can find the
struct_ops and its btf type/size/id... information from
btf->struct_ops_tab.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-5-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-01-23 16:37:44 -08:00
Kui-Feng Lee
4c5763ed99 bpf, net: introduce bpf_struct_ops_desc.
Move some of members of bpf_struct_ops to bpf_struct_ops_desc.  type_id is
unavailabe in bpf_struct_ops anymore. Modules should get it from the btf
received by kmod's init function.

Cc: netdev@vger.kernel.org
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-4-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-01-23 16:37:44 -08:00
Kui-Feng Lee
9567839538 bpf: get type information with BTF_ID_LIST
Get ready to remove bpf_struct_ops_init() in the future. By using
BTF_ID_LIST, it is possible to gather type information while building
instead of runtime.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-3-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-01-23 16:37:43 -08:00
Kui-Feng Lee
3b1f89e747 bpf: refactory struct_ops type initialization to a function.
Move the majority of the code to bpf_struct_ops_init_one(), which can then
be utilized for the initialization of newly registered dynamically
allocated struct_ops types in the following patches.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-2-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-01-23 16:37:43 -08:00
Jiri Olsa
d5c16492c6 bpf: Add cookie to perf_event bpf_link_info records
At the moment we don't store cookie for perf_event probes,
while we do that for the rest of the probes.

Adding cookie fields to struct bpf_link_info perf event
probe records:

  perf_event.uprobe
  perf_event.kprobe
  perf_event.tracepoint
  perf_event.perf_event

And the code to store that in bpf_link_info struct.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20240119110505.400573-2-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 16:05:27 -08:00
Hao Sun
2ce793ebe2 bpf: Refactor ptr alu checking rules to allow alu explicitly
Current checking rules are structured to disallow alu on particular ptr
types explicitly, so default cases are allowed implicitly. This may lead
to newly added ptr types being allowed unexpectedly. So restruture it to
allow alu explicitly. The tradeoff is mainly a bit more cases added in
the switch. The following table from Eduard summarizes the rules:

        | Pointer type        | Arithmetics allowed |
        |---------------------+---------------------|
        | PTR_TO_CTX          | yes                 |
        | CONST_PTR_TO_MAP    | conditionally       |
        | PTR_TO_MAP_VALUE    | yes                 |
        | PTR_TO_MAP_KEY      | yes                 |
        | PTR_TO_STACK        | yes                 |
        | PTR_TO_PACKET_META  | yes                 |
        | PTR_TO_PACKET       | yes                 |
        | PTR_TO_PACKET_END   | no                  |
        | PTR_TO_FLOW_KEYS    | conditionally       |
        | PTR_TO_SOCKET       | no                  |
        | PTR_TO_SOCK_COMMON  | no                  |
        | PTR_TO_TCP_SOCK     | no                  |
        | PTR_TO_TP_BUFFER    | yes                 |
        | PTR_TO_XDP_SOCK     | no                  |
        | PTR_TO_BTF_ID       | yes                 |
        | PTR_TO_MEM          | yes                 |
        | PTR_TO_BUF          | yes                 |
        | PTR_TO_FUNC         | yes                 |
        | CONST_PTR_TO_DYNPTR | yes                 |

The refactored rules are equivalent to the original one. Note that
PTR_TO_FUNC and CONST_PTR_TO_DYNPTR are not reject here because: (1)
check_mem_access() rejects load/store on those ptrs, and those ptrs
with offset passing to calls are rejected check_func_arg_reg_off();
(2) someone may rely on the verifier not rejecting programs earily.

Signed-off-by: Hao Sun <sunhao.th@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240117094012.36798-1-sunhao.th@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 15:01:39 -08:00
Yonghong Song
9a4c57f52b bpf: Track aligned st store as imprecise spilled registers
With patch set [1], precision backtracing supports register spill/fill
to/from the stack. The patch [2] allows initial imprecise register spill
with content 0. This is a common case for cpuv3 and lower for
initializing the stack variables with pattern
  r1 = 0
  *(u64 *)(r10 - 8) = r1
and the [2] has demonstrated good verification improvement.

For cpuv4, the initialization could be
  *(u64 *)(r10 - 8) = 0
The current verifier marks the r10-8 contents with STACK_ZERO.
Similar to [2], let us permit the above insn to behave like
imprecise register spill which can reduce number of verified states.
The change is in function check_stack_write_fixed_off().

Before this patch, spilled zero will be marked as STACK_ZERO
which can provide precise values. In check_stack_write_var_off(),
STACK_ZERO will be maintained if writing a const zero
so later it can provide precise values if needed.

The above handling of '*(u64 *)(r10 - 8) = 0' as a spill
will have issues in check_stack_write_var_off() as the spill
will be converted to STACK_MISC and the precise value 0
is lost. To fix this issue, if the spill slots with const
zero and the BPF_ST write also with const zero, the spill slots
are preserved, which can later provide precise values
if needed. Without the change in check_stack_write_var_off(),
the test_verifier subtest 'BPF_ST_MEM stack imm zero, variable offset'
will fail.

I checked cpuv3 and cpuv4 with and without this patch with veristat.
There is no state change for cpuv3 since '*(u64 *)(r10 - 8) = 0'
is only generated with cpuv4.

For cpuv4:
$ ../veristat -C old.cpuv4.csv new.cpuv4.csv -e file,prog,insns,states -f 'insns_diff!=0'
File                                        Program              Insns (A)  Insns (B)  Insns    (DIFF)  States (A)  States (B)  States (DIFF)
------------------------------------------  -------------------  ---------  ---------  ---------------  ----------  ----------  -------------
local_storage_bench.bpf.linked3.o           get_local                  228        168    -60 (-26.32%)          17          14   -3 (-17.65%)
pyperf600_bpf_loop.bpf.linked3.o            on_event                  6066       4889  -1177 (-19.40%)         403         321  -82 (-20.35%)
test_cls_redirect.bpf.linked3.o             cls_redirect             35483      35387     -96 (-0.27%)        2179        2177    -2 (-0.09%)
test_l4lb_noinline.bpf.linked3.o            balancer_ingress          4494       4522     +28 (+0.62%)         217         219    +2 (+0.92%)
test_l4lb_noinline_dynptr.bpf.linked3.o     balancer_ingress          1432       1455     +23 (+1.61%)          92          94    +2 (+2.17%)
test_xdp_noinline.bpf.linked3.o             balancer_ingress_v6       3462       3458      -4 (-0.12%)         216         216    +0 (+0.00%)
verifier_iterating_callbacks.bpf.linked3.o  widening                    52         41    -11 (-21.15%)           4           3   -1 (-25.00%)
xdp_synproxy_kern.bpf.linked3.o             syncookie_tc             12412      11719    -693 (-5.58%)         345         330   -15 (-4.35%)
xdp_synproxy_kern.bpf.linked3.o             syncookie_xdp            12478      11794    -684 (-5.48%)         346         331   -15 (-4.34%)

test_l4lb_noinline and test_l4lb_noinline_dynptr has minor regression, but
pyperf600_bpf_loop and local_storage_bench gets pretty good improvement.

  [1] https://lore.kernel.org/all/20231205184248.1502704-1-andrii@kernel.org/
  [2] https://lore.kernel.org/all/20231205184248.1502704-9-andrii@kernel.org/

Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Tested-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240110051348.2737007-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:23 -08:00
Maxim Mikityanskiy
8ecfc371d8 bpf: Assign ID to scalars on spill
Currently, when a scalar bounded register is spilled to the stack, its
ID is preserved, but only if was already assigned, i.e. if this register
was MOVed before.

Assign an ID on spill if none is set, so that equal scalars could be
tracked if a register is spilled to the stack and filled into another
register.

One test is adjusted to reflect the change in register IDs.

Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240108205209.838365-9-maxtram95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:23 -08:00
Maxim Mikityanskiy
87e51ac6cb bpf: Add the get_reg_width function
Put calculation of the register value width into a dedicated function.
This function will also be used in a following commit.

Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Link: https://lore.kernel.org/r/20240108205209.838365-8-maxtram95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:23 -08:00
Maxim Mikityanskiy
8e0e074aaf bpf: Add the assign_scalar_id_before_mov function
Extract the common code that generates a register ID for src_reg before
MOV if needed into a new function. This function will also be used in
a following commit.

Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240108205209.838365-7-maxtram95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:22 -08:00
Eduard Zingerman
d5b892fd60 bpf: make infinite loop detection in is_state_visited() exact
Current infinite loops detection mechanism is speculative:
- first, states_maybe_looping() check is done which simply does memcmp
  for R1-R10 in current frame;
- second, states_equal(..., exact=false) is called. With exact=false
  states_equal() would compare scalars for equality only if in old
  state scalar has precision mark.

Such logic might be problematic if compiler makes some unlucky stack
spill/fill decisions. An artificial example of a false positive looks
as follows:

        r0 = ... unknown scalar ...
        r0 &= 0xff;
        *(u64 *)(r10 - 8) = r0;
        r0 = 0;
    loop:
        r0 = *(u64 *)(r10 - 8);
        if r0 > 10 goto exit_;
        r0 += 1;
        *(u64 *)(r10 - 8) = r0;
        r0 = 0;
        goto loop;

This commit updates call to states_equal to use exact=true, forcing
all scalar comparisons to be exact.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240108205209.838365-3-maxtram95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:22 -08:00
Andrii Nakryiko
522bb2c1f8 bpf: support multiple tags per argument
Add ability to iterate multiple decl_tag types pointed to the same
function argument. Use this to support multiple __arg_xxx tags per
global subprog argument.

We leave btf_find_decl_tag_value() intact, but change its implementation
to use a new btf_find_next_decl_tag() which can be straightforwardly
used to find next BTF type ID of a matching btf_decl_tag type.
btf_prepare_func_args() is switched from btf_find_decl_tag_value() to
btf_find_next_decl_tag() to gain multiple tags per argument support.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240105000909.2818934-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:21 -08:00
Andrii Nakryiko
54c11ec493 bpf: prepare btf_prepare_func_args() for multiple tags per argument
Add btf_arg_tag flags enum to be able to record multiple tags per
argument. Also streamline pointer argument processing some more.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240105000909.2818934-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:21 -08:00
Andrii Nakryiko
18810ad392 bpf: make sure scalar args don't accept __arg_nonnull tag
Move scalar arg processing in btf_prepare_func_args() after all pointer
arg processing is done. This makes it easier to do validation. One
example of unintended behavior right now is ability to specify
__arg_nonnull for integer/enum arguments. This patch fixes this.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240105000909.2818934-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:21 -08:00
Hou Tao
7c05e7f3e7 bpf: Support inlining bpf_kptr_xchg() helper
The motivation of inlining bpf_kptr_xchg() comes from the performance
profiling of bpf memory allocator benchmark. The benchmark uses
bpf_kptr_xchg() to stash the allocated objects and to pop the stashed
objects for free. After inling bpf_kptr_xchg(), the performance for
object free on 8-CPUs VM increases about 2%~10%. The inline also has
downside: both the kasan and kcsan checks on the pointer will be
unavailable.

bpf_kptr_xchg() can be inlined by converting the calling of
bpf_kptr_xchg() into an atomic_xchg() instruction. But the conversion
depends on two conditions:
1) JIT backend supports atomic_xchg() on pointer-sized word
2) For the specific arch, the implementation of xchg is the same as
   atomic_xchg() on pointer-sized words.

It seems most 64-bit JIT backends satisfies these two conditions. But
as a precaution, defining a weak function bpf_jit_supports_ptr_xchg()
to state whether such conversion is safe and only supporting inline for
64-bit host.

For x86-64, it supports BPF_XCHG atomic operation and both xchg() and
atomic_xchg() use arch_xchg() to implement the exchange, so enabling the
inline of bpf_kptr_xchg() on x86-64 first.

Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20240105104819.3916743-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23 14:40:21 -08:00
Andrii Nakryiko
0ba971511d bpf: enforce types for __arg_ctx-tagged arguments in global subprogs
Add enforcement of expected types for context arguments tagged with
arg:ctx (__arg_ctx) tag.

First, any program type will accept generic `void *` context type when
combined with __arg_ctx tag.

Besides accepting "canonical" struct names and `void *`, for a bunch of
program types for which program context is actually a named struct, we
allows a bunch of pragmatic exceptions to match real-world and expected
usage:

  - for both kprobes and perf_event we allow `bpf_user_pt_regs_t *` as
    canonical context argument type, where `bpf_user_pt_regs_t` is a
    *typedef*, not a struct;
  - for kprobes, we also always accept `struct pt_regs *`, as that's what
    actually is passed as a context to any kprobe program;
  - for perf_event, we resolve typedefs (unless it's `bpf_user_pt_regs_t`)
    down to actual struct type and accept `struct pt_regs *`, or
    `struct user_pt_regs *`, or `struct user_regs_struct *`, depending
    on the actual struct type kernel architecture points `bpf_user_pt_regs_t`
    typedef to; otherwise, canonical `struct bpf_perf_event_data *` is
    expected;
  - for raw_tp/raw_tp.w programs, `u64/long *` are accepted, as that's
    what's expected with BPF_PROG() usage; otherwise, canonical
    `struct bpf_raw_tracepoint_args *` is expected;
  - tp_btf supports both `struct bpf_raw_tracepoint_args *` and `u64 *`
    formats, both are coded as expections as tp_btf is actually a TRACING
    program type, which has no canonical context type;
  - iterator programs accept `struct bpf_iter__xxx *` structs, currently
    with no further iterator-type specific enforcement;
  - fentry/fexit/fmod_ret/lsm/struct_ops all accept `u64 *`;
  - classic tracepoint programs, as well as syscall and freplace
    programs allow any user-provided type.

In all other cases kernel will enforce exact match of struct name to
expected canonical type. And if user-provided type doesn't match that
expectation, verifier will emit helpful message with expected type name.

Note a bit unnatural way the check is done after processing all the
arguments. This is done to avoid conflict between bpf and bpf-next
trees. Once trees converge, a small follow up patch will place a simple
btf_validate_prog_ctx_type() check into a proper ARG_PTR_TO_CTX branch
(which bpf-next tree patch refactored already), removing duplicated
arg:ctx detection logic.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240118033143.3384355-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-17 20:20:06 -08:00
Andrii Nakryiko
66967a32d3 bpf: extract bpf_ctx_convert_map logic and make it more reusable
Refactor btf_get_prog_ctx_type() a bit to allow reuse of
bpf_ctx_convert_map logic in more than one places. Simplify interface by
returning btf_type instead of btf_member (field reference in BTF).

To do the above we need to touch and start untangling
btf_translate_to_vmlinux() implementation. We do the bare minimum to
not regress anything for btf_translate_to_vmlinux(), but its
implementation is very questionable for what it claims to be doing.
Mapping kfunc argument types to kernel corresponding types conceptually
is quite different from recognizing program context types. Fixing this
is out of scope for this change though.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240118033143.3384355-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-17 20:20:05 -08:00
Hao Sun
22c7fa171a bpf: Reject variable offset alu on PTR_TO_FLOW_KEYS
For PTR_TO_FLOW_KEYS, check_flow_keys_access() only uses fixed off
for validation. However, variable offset ptr alu is not prohibited
for this ptr kind. So the variable offset is not checked.

The following prog is accepted:

  func#0 @0
  0: R1=ctx() R10=fp0
  0: (bf) r6 = r1                       ; R1=ctx() R6_w=ctx()
  1: (79) r7 = *(u64 *)(r6 +144)        ; R6_w=ctx() R7_w=flow_keys()
  2: (b7) r8 = 1024                     ; R8_w=1024
  3: (37) r8 /= 1                       ; R8_w=scalar()
  4: (57) r8 &= 1024                    ; R8_w=scalar(smin=smin32=0,
  smax=umax=smax32=umax32=1024,var_off=(0x0; 0x400))
  5: (0f) r7 += r8
  mark_precise: frame0: last_idx 5 first_idx 0 subseq_idx -1
  mark_precise: frame0: regs=r8 stack= before 4: (57) r8 &= 1024
  mark_precise: frame0: regs=r8 stack= before 3: (37) r8 /= 1
  mark_precise: frame0: regs=r8 stack= before 2: (b7) r8 = 1024
  6: R7_w=flow_keys(smin=smin32=0,smax=umax=smax32=umax32=1024,var_off
  =(0x0; 0x400)) R8_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=1024,
  var_off=(0x0; 0x400))
  6: (79) r0 = *(u64 *)(r7 +0)          ; R0_w=scalar()
  7: (95) exit

This prog loads flow_keys to r7, and adds the variable offset r8
to r7, and finally causes out-of-bounds access:

  BUG: unable to handle page fault for address: ffffc90014c80038
  [...]
  Call Trace:
   <TASK>
   bpf_dispatcher_nop_func include/linux/bpf.h:1231 [inline]
   __bpf_prog_run include/linux/filter.h:651 [inline]
   bpf_prog_run include/linux/filter.h:658 [inline]
   bpf_prog_run_pin_on_cpu include/linux/filter.h:675 [inline]
   bpf_flow_dissect+0x15f/0x350 net/core/flow_dissector.c:991
   bpf_prog_test_run_flow_dissector+0x39d/0x620 net/bpf/test_run.c:1359
   bpf_prog_test_run kernel/bpf/syscall.c:4107 [inline]
   __sys_bpf+0xf8f/0x4560 kernel/bpf/syscall.c:5475
   __do_sys_bpf kernel/bpf/syscall.c:5561 [inline]
   __se_sys_bpf kernel/bpf/syscall.c:5559 [inline]
   __x64_sys_bpf+0x73/0xb0 kernel/bpf/syscall.c:5559
   do_syscall_x64 arch/x86/entry/common.c:52 [inline]
   do_syscall_64+0x3f/0x110 arch/x86/entry/common.c:83
   entry_SYSCALL_64_after_hwframe+0x63/0x6b

Fix this by rejecting ptr alu with variable offset on flow_keys.
Applying the patch rejects the program with "R7 pointer arithmetic
on flow_keys prohibited".

Fixes: d58e468b11 ("flow_dissector: implements flow dissector BPF hook")
Signed-off-by: Hao Sun <sunhao.th@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20240115082028.9992-1-sunhao.th@gmail.com
2024-01-16 17:12:29 +01:00
Linus Torvalds
3e7aeb78ab Networking changes for 6.8.
Core & protocols
 ----------------
 
  - Analyze and reorganize core networking structs (socks, netdev,
    netns, mibs) to optimize cacheline consumption and set up
    build time warnings to safeguard against future header changes.
    This improves TCP performances with many concurrent connections
    up to 40%.
 
  - Add page-pool netlink-based introspection, exposing the
    memory usage and recycling stats. This helps indentify
    bad PP users and possible leaks.
 
  - Refine TCP/DCCP source port selection to no longer favor even
    source port at connect() time when IP_LOCAL_PORT_RANGE is set.
    This lowers the time taken by connect() for hosts having
    many active connections to the same destination.
 
  - Refactor the TCP bind conflict code, shrinking related socket
    structs.
 
  - Refactor TCP SYN-Cookie handling, as a preparation step to
    allow arbitrary SYN-Cookie processing via eBPF.
 
  - Tune optmem_max for 0-copy usage, increasing the default value
    to 128KB and namespecifying it.
 
  - Allow coalescing for cloned skbs coming from page pools, improving
    RX performances with some common configurations.
 
  - Reduce extension header parsing overhead at GRO time.
 
  - Add bridge MDB bulk deletion support, allowing user-space to
    request the deletion of matching entries.
 
  - Reorder nftables struct members, to keep data accessed by the
    datapath first.
 
  - Introduce TC block ports tracking and use. This allows supporting
    multicast-like behavior at the TC layer.
 
  - Remove UAPI support for retired TC qdiscs (dsmark, CBQ and ATM) and
    classifiers (RSVP and tcindex).
 
  - More data-race annotations.
 
  - Extend the diag interface to dump TCP bound-only sockets.
 
  - Conditional notification of events for TC qdisc class and actions.
 
  - Support for WPAN dynamic associations with nearby devices, to form
    a sub-network using a specific PAN ID.
 
  - Implement SMCv2.1 virtual ISM device support.
 
  - Add support for Batman-avd mulicast packet type.
 
 BPF
 ---
 
  - Tons of verifier improvements:
    - BPF register bounds logic and range support along with a large
      test suite
    - log improvements
    - complete precision tracking support for register spills
    - track aligned STACK_ZERO cases as imprecise spilled registers. It
      improves the verifier "instructions processed" metric from single
      digit to 50-60% for some programs
    - support for user's global BPF subprogram arguments with few
      commonly requested annotations for a better developer experience
    - support tracking of BPF_JNE which helps cases when the compiler
      transforms (unsigned) "a > 0" into "if a == 0 goto xxx" and the
      like
    - several fixes
 
  - Add initial TX metadata implementation for AF_XDP with support in
    mlx5 and stmmac drivers. Two types of offloads are supported right
    now, that is, TX timestamp and TX checksum offload.
 
  - Fix kCFI bugs in BPF all forms of indirect calls from BPF into
    kernel and from kernel into BPF work with CFI enabled. This allows
    BPF to work with CONFIG_FINEIBT=y.
 
  - Change BPF verifier logic to validate global subprograms lazily
    instead of unconditionally before the main program, so they can be
    guarded using BPF CO-RE techniques.
 
  - Support uid/gid options when mounting bpffs.
 
  - Add a new kfunc which acquires the associated cgroup of a task
    within a specific cgroup v1 hierarchy where the latter is identified
    by its id.
 
  - Extend verifier to allow bpf_refcount_acquire() of a map value field
    obtained via direct load which is a use-case needed in sched_ext.
 
  - Add BPF link_info support for uprobe multi link along with bpftool
    integration for the latter.
 
  - Support for VLAN tag in XDP hints.
 
  - Remove deprecated bpfilter kernel leftovers given the project
    is developed in user-space (https://github.com/facebook/bpfilter).
 
 Misc
 ----
 
  - Support for parellel TC self-tests execution.
 
  - Increase MPTCP self-tests coverage.
 
  - Updated the bridge documentation, including several so-far
    undocumented features.
 
  - Convert all the net self-tests to run in unique netns, to
    avoid random failures due to conflict and allow concurrent
    runs.
 
  - Add TCP-AO self-tests.
 
  - Add kunit tests for both cfg80211 and mac80211.
 
  - Autogenerate Netlink families documentation from YAML spec.
 
  - Add yml-gen support for fixed headers and recursive nests, the
    tool can now generate user-space code for all genetlink families
    for which we have specs.
 
  - A bunch of additional module descriptions fixes.
 
  - Catch incorrect freeing of pages belonging to a page pool.
 
 Driver API
 ----------
 
  - Rust abstractions for network PHY drivers; do not cover yet the
    full C API, but already allow implementing functional PHY drivers
    in rust.
 
  - Introduce queue and NAPI support in the netdev Netlink interface,
    allowing complete access to the device <> NAPIs <> queues
    relationship.
 
  - Introduce notifications filtering for devlink to allow control
    application scale to thousands of instances.
 
  - Improve PHY validation, requesting rate matching information for
    each ethtool link mode supported by both the PHY and host.
 
  - Add support for ethtool symmetric-xor RSS hash.
 
  - ACPI based Wifi band RFI (WBRF) mitigation feature for the AMD
    platform.
 
  - Expose pin fractional frequency offset value over new DPLL generic
    netlink attribute.
 
  - Convert older drivers to platform remove callback returning void.
 
  - Add support for PHY package MMD read/write.
 
 New hardware / drivers
 ----------------------
 
  - Ethernet:
    - Octeon CN10K devices
    - Broadcom 5760X P7
    - Qualcomm SM8550 SoC
    - Texas Instrument DP83TG720S PHY
 
  - Bluetooth:
    - IMC Networks Bluetooth radio
 
 Removed
 -------
 
  - WiFi:
    - libertas 16-bit PCMCIA support
    - Atmel at76c50x drivers
    - HostAP ISA/PCMCIA style 802.11b driver
    - zd1201 802.11b USB dongles
    - Orinoco ISA/PCMCIA 802.11b driver
    - Aviator/Raytheon driver
    - Planet WL3501 driver
    - RNDIS USB 802.11b driver
 
 Drivers
 -------
 
  - Ethernet high-speed NICs:
    - Intel (100G, ice, idpf):
      - allow one by one port representors creation and removal
      - add temperature and clock information reporting
      - add get/set for ethtool's header split ringparam
      - add again FW logging
      - adds support switchdev hardware packet mirroring
      - iavf: implement symmetric-xor RSS hash
      - igc: add support for concurrent physical and free-running timers
      - i40e: increase the allowable descriptors
    - nVidia/Mellanox:
      - Preparation for Socket-Direct multi-dev netdev. That will allow
        in future releases combining multiple PFs devices attached to
        different NUMA nodes under the same netdev
    - Broadcom (bnxt):
      - TX completion handling improvements
      - add basic ntuple filter support
      - reduce MSIX vectors usage for MQPRIO offload
      - add VXLAN support, USO offload and TX coalesce completion for P7
    - Marvell Octeon EP:
      - xmit-more support
      - add PF-VF mailbox support and use it for FW notifications for VFs
    - Wangxun (ngbe/txgbe):
      - implement ethtool functions to operate pause param, ring param,
        coalesce channel number and msglevel
    - Netronome/Corigine (nfp):
      - add flow-steering support
      - support UDP segmentation offload
 
  - Ethernet NICs embedded, slower, virtual:
    - Xilinx AXI: remove duplicate DMA code adopting the dma engine driver
    - stmmac: add support for HW-accelerated VLAN stripping
    - TI AM654x sw: add mqprio, frame preemption & coalescing
    - gve: add support for non-4k page sizes.
    - virtio-net: support dynamic coalescing moderation
 
  - nVidia/Mellanox Ethernet datacenter switches:
    - allow firmware upgrade without a reboot
    - more flexible support for bridge flooding via the compressed
      FID flooding mode
 
  - Ethernet embedded switches:
    - Microchip:
      - fine-tune flow control and speed configurations in KSZ8xxx
      - KSZ88X3: enable setting rmii reference
    - Renesas:
      - add jumbo frames support
    - Marvell:
      - 88E6xxx: add "eth-mac" and "rmon" stats support
 
  - Ethernet PHYs:
    - aquantia: add firmware load support
    - at803x: refactor the driver to simplify adding support for more
      chip variants
    - NXP C45 TJA11xx: Add MACsec offload support
 
  - Wifi:
    - MediaTek (mt76):
      - NVMEM EEPROM improvements
      - mt7996 Extremely High Throughput (EHT) improvements
      - mt7996 Wireless Ethernet Dispatcher (WED) support
      - mt7996 36-bit DMA support
    - Qualcomm (ath12k):
      - support for a single MSI vector
      - WCN7850: support AP mode
    - Intel (iwlwifi):
      - new debugfs file fw_dbg_clear
      - allow concurrent P2P operation on DFS channels
 
  - Bluetooth:
    - QCA2066: support HFP offload
    - ISO: more broadcast-related improvements
    - NXP: better recovery in case receiver/transmitter get out of sync
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmWdamsSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkGC4P/2xjLzdw22ckSssuE9ORbGko9SNjnqHk
 PQh1E+26BHiCg5KB8VvzMsL78E79MRNXEattSW+1g7dhCvln3oi+Vd0WkdRkgt35
 98Iv18zLbbwFAJeyKvmLAPAkQkMLtVj19QILBBRrugF+egEZgVSE3JBcTAiKv2ZQ
 HzkabA171Ri6LpCcEEtY5XuaKvimGnGzF8YMFf8rX0wtqd2p5kbY9aMe47WAGxvU
 Vf9548XvH+A5yVH2/4/gujtUOpA/RHuhuCMb+oo0cZ+VCC1x9MGzoXzj6r87OTkf
 k2W1whNzcGoin92f+9Lk1JYMuiGKBH4QVaDdNXJnYFSJWPTE7RvRsPzYTSD4/GzK
 yEZbzSJXpy/2vDQm16NoAxl7evRs8Sorzkw4LQRviZHI/5SAkK2ZQiCK5CO8QSYy
 C1LELcV5kn6Foe24xWnrWLjAGug9oJnYoGPMU5gvPmFJMvUMXqm5rmbBgUWL5Rxw
 q1M6gVzabCyWUy6z2G2vaqW2ZntNVvCkdsLtIX0XZkcTzNoP0MA+TuhyGz4wbiuo
 PeyQp/mbGnDgCYggqKIA0YWrTVxkhFrKN520cbO8qXBQytV9oFbM/0/+C0/r/5WX
 pL1JVzLrh6l5ME7EIQfha8UOF9j8q4ueSwb40P3AR2NaZiDABM0zfUZ6+sx+91WF
 ucqPEcZB5cRE
 =1bW6
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Paolo Abeni:
 "The most interesting thing is probably the networking structs
  reorganization and a significant amount of changes is around
  self-tests.

  Core & protocols:

   - Analyze and reorganize core networking structs (socks, netdev,
     netns, mibs) to optimize cacheline consumption and set up build
     time warnings to safeguard against future header changes

     This improves TCP performances with many concurrent connections up
     to 40%

   - Add page-pool netlink-based introspection, exposing the memory
     usage and recycling stats. This helps indentify bad PP users and
     possible leaks

   - Refine TCP/DCCP source port selection to no longer favor even
     source port at connect() time when IP_LOCAL_PORT_RANGE is set. This
     lowers the time taken by connect() for hosts having many active
     connections to the same destination

   - Refactor the TCP bind conflict code, shrinking related socket
     structs

   - Refactor TCP SYN-Cookie handling, as a preparation step to allow
     arbitrary SYN-Cookie processing via eBPF

   - Tune optmem_max for 0-copy usage, increasing the default value to
     128KB and namespecifying it

   - Allow coalescing for cloned skbs coming from page pools, improving
     RX performances with some common configurations

   - Reduce extension header parsing overhead at GRO time

   - Add bridge MDB bulk deletion support, allowing user-space to
     request the deletion of matching entries

   - Reorder nftables struct members, to keep data accessed by the
     datapath first

   - Introduce TC block ports tracking and use. This allows supporting
     multicast-like behavior at the TC layer

   - Remove UAPI support for retired TC qdiscs (dsmark, CBQ and ATM) and
     classifiers (RSVP and tcindex)

   - More data-race annotations

   - Extend the diag interface to dump TCP bound-only sockets

   - Conditional notification of events for TC qdisc class and actions

   - Support for WPAN dynamic associations with nearby devices, to form
     a sub-network using a specific PAN ID

   - Implement SMCv2.1 virtual ISM device support

   - Add support for Batman-avd mulicast packet type

  BPF:

   - Tons of verifier improvements:
       - BPF register bounds logic and range support along with a large
         test suite
       - log improvements
       - complete precision tracking support for register spills
       - track aligned STACK_ZERO cases as imprecise spilled registers.
         This improves the verifier "instructions processed" metric from
         single digit to 50-60% for some programs
       - support for user's global BPF subprogram arguments with few
         commonly requested annotations for a better developer
         experience
       - support tracking of BPF_JNE which helps cases when the compiler
         transforms (unsigned) "a > 0" into "if a == 0 goto xxx" and the
         like
       - several fixes

   - Add initial TX metadata implementation for AF_XDP with support in
     mlx5 and stmmac drivers. Two types of offloads are supported right
     now, that is, TX timestamp and TX checksum offload

   - Fix kCFI bugs in BPF all forms of indirect calls from BPF into
     kernel and from kernel into BPF work with CFI enabled. This allows
     BPF to work with CONFIG_FINEIBT=y

   - Change BPF verifier logic to validate global subprograms lazily
     instead of unconditionally before the main program, so they can be
     guarded using BPF CO-RE techniques

   - Support uid/gid options when mounting bpffs

   - Add a new kfunc which acquires the associated cgroup of a task
     within a specific cgroup v1 hierarchy where the latter is
     identified by its id

   - Extend verifier to allow bpf_refcount_acquire() of a map value
     field obtained via direct load which is a use-case needed in
     sched_ext

   - Add BPF link_info support for uprobe multi link along with bpftool
     integration for the latter

   - Support for VLAN tag in XDP hints

   - Remove deprecated bpfilter kernel leftovers given the project is
     developed in user-space (https://github.com/facebook/bpfilter)

  Misc:

   - Support for parellel TC self-tests execution

   - Increase MPTCP self-tests coverage

   - Updated the bridge documentation, including several so-far
     undocumented features

   - Convert all the net self-tests to run in unique netns, to avoid
     random failures due to conflict and allow concurrent runs

   - Add TCP-AO self-tests

   - Add kunit tests for both cfg80211 and mac80211

   - Autogenerate Netlink families documentation from YAML spec

   - Add yml-gen support for fixed headers and recursive nests, the tool
     can now generate user-space code for all genetlink families for
     which we have specs

   - A bunch of additional module descriptions fixes

   - Catch incorrect freeing of pages belonging to a page pool

  Driver API:

   - Rust abstractions for network PHY drivers; do not cover yet the
     full C API, but already allow implementing functional PHY drivers
     in rust

   - Introduce queue and NAPI support in the netdev Netlink interface,
     allowing complete access to the device <> NAPIs <> queues
     relationship

   - Introduce notifications filtering for devlink to allow control
     application scale to thousands of instances

   - Improve PHY validation, requesting rate matching information for
     each ethtool link mode supported by both the PHY and host

   - Add support for ethtool symmetric-xor RSS hash

   - ACPI based Wifi band RFI (WBRF) mitigation feature for the AMD
     platform

   - Expose pin fractional frequency offset value over new DPLL generic
     netlink attribute

   - Convert older drivers to platform remove callback returning void

   - Add support for PHY package MMD read/write

  New hardware / drivers:

   - Ethernet:
       - Octeon CN10K devices
       - Broadcom 5760X P7
       - Qualcomm SM8550 SoC
       - Texas Instrument DP83TG720S PHY

   - Bluetooth:
       - IMC Networks Bluetooth radio

  Removed:

   - WiFi:
       - libertas 16-bit PCMCIA support
       - Atmel at76c50x drivers
       - HostAP ISA/PCMCIA style 802.11b driver
       - zd1201 802.11b USB dongles
       - Orinoco ISA/PCMCIA 802.11b driver
       - Aviator/Raytheon driver
       - Planet WL3501 driver
       - RNDIS USB 802.11b driver

  Driver updates:

   - Ethernet high-speed NICs:
       - Intel (100G, ice, idpf):
          - allow one by one port representors creation and removal
          - add temperature and clock information reporting
          - add get/set for ethtool's header split ringparam
          - add again FW logging
          - adds support switchdev hardware packet mirroring
          - iavf: implement symmetric-xor RSS hash
          - igc: add support for concurrent physical and free-running
            timers
          - i40e: increase the allowable descriptors
       - nVidia/Mellanox:
          - Preparation for Socket-Direct multi-dev netdev. That will
            allow in future releases combining multiple PFs devices
            attached to different NUMA nodes under the same netdev
       - Broadcom (bnxt):
          - TX completion handling improvements
          - add basic ntuple filter support
          - reduce MSIX vectors usage for MQPRIO offload
          - add VXLAN support, USO offload and TX coalesce completion
            for P7
       - Marvell Octeon EP:
          - xmit-more support
          - add PF-VF mailbox support and use it for FW notifications
            for VFs
       - Wangxun (ngbe/txgbe):
          - implement ethtool functions to operate pause param, ring
            param, coalesce channel number and msglevel
       - Netronome/Corigine (nfp):
          - add flow-steering support
          - support UDP segmentation offload

   - Ethernet NICs embedded, slower, virtual:
       - Xilinx AXI: remove duplicate DMA code adopting the dma engine
         driver
       - stmmac: add support for HW-accelerated VLAN stripping
       - TI AM654x sw: add mqprio, frame preemption & coalescing
       - gve: add support for non-4k page sizes.
       - virtio-net: support dynamic coalescing moderation

   - nVidia/Mellanox Ethernet datacenter switches:
       - allow firmware upgrade without a reboot
       - more flexible support for bridge flooding via the compressed
         FID flooding mode

   - Ethernet embedded switches:
       - Microchip:
          - fine-tune flow control and speed configurations in KSZ8xxx
          - KSZ88X3: enable setting rmii reference
       - Renesas:
          - add jumbo frames support
       - Marvell:
          - 88E6xxx: add "eth-mac" and "rmon" stats support

   - Ethernet PHYs:
       - aquantia: add firmware load support
       - at803x: refactor the driver to simplify adding support for more
         chip variants
       - NXP C45 TJA11xx: Add MACsec offload support

   - Wifi:
       - MediaTek (mt76):
          - NVMEM EEPROM improvements
          - mt7996 Extremely High Throughput (EHT) improvements
          - mt7996 Wireless Ethernet Dispatcher (WED) support
          - mt7996 36-bit DMA support
       - Qualcomm (ath12k):
          - support for a single MSI vector
          - WCN7850: support AP mode
       - Intel (iwlwifi):
          - new debugfs file fw_dbg_clear
          - allow concurrent P2P operation on DFS channels

   - Bluetooth:
       - QCA2066: support HFP offload
       - ISO: more broadcast-related improvements
       - NXP: better recovery in case receiver/transmitter get out of sync"

* tag 'net-next-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1714 commits)
  lan78xx: remove redundant statement in lan78xx_get_eee
  lan743x: remove redundant statement in lan743x_ethtool_get_eee
  bnxt_en: Fix RCU locking for ntuple filters in bnxt_rx_flow_steer()
  bnxt_en: Fix RCU locking for ntuple filters in bnxt_srxclsrldel()
  bnxt_en: Remove unneeded variable in bnxt_hwrm_clear_vnic_filter()
  tcp: Revert no longer abort SYN_SENT when receiving some ICMP
  Revert "mlx5 updates 2023-12-20"
  Revert "net: stmmac: Enable Per DMA Channel interrupt"
  ipvlan: Remove usage of the deprecated ida_simple_xx() API
  ipvlan: Fix a typo in a comment
  net/sched: Remove ipt action tests
  net: stmmac: Use interrupt mode INTM=1 for per channel irq
  net: stmmac: Add support for TX/RX channel interrupt
  net: stmmac: Make MSI interrupt routine generic
  dt-bindings: net: snps,dwmac: per channel irq
  net: phy: at803x: make read_status more generic
  net: phy: at803x: add support for cdt cross short test for qca808x
  net: phy: at803x: refactor qca808x cable test get status function
  net: phy: at803x: generalize cdt fault length function
  net: ethernet: cortina: Drop TSO support
  ...
2024-01-11 10:07:29 -08:00
Jiri Olsa
715d82ba63 bpf: Fix re-attachment branch in bpf_tracing_prog_attach
The following case can cause a crash due to missing attach_btf:

1) load rawtp program
2) load fentry program with rawtp as target_fd
3) create tracing link for fentry program with target_fd = 0
4) repeat 3

In the end we have:

- prog->aux->dst_trampoline == NULL
- tgt_prog == NULL (because we did not provide target_fd to link_create)
- prog->aux->attach_btf == NULL (the program was loaded with attach_prog_fd=X)
- the program was loaded for tgt_prog but we have no way to find out which one

    BUG: kernel NULL pointer dereference, address: 0000000000000058
    Call Trace:
     <TASK>
     ? __die+0x20/0x70
     ? page_fault_oops+0x15b/0x430
     ? fixup_exception+0x22/0x330
     ? exc_page_fault+0x6f/0x170
     ? asm_exc_page_fault+0x22/0x30
     ? bpf_tracing_prog_attach+0x279/0x560
     ? btf_obj_id+0x5/0x10
     bpf_tracing_prog_attach+0x439/0x560
     __sys_bpf+0x1cf4/0x2de0
     __x64_sys_bpf+0x1c/0x30
     do_syscall_64+0x41/0xf0
     entry_SYSCALL_64_after_hwframe+0x6e/0x76

Return -EINVAL in this situation.

Fixes: f3a9507554 ("bpf: Allow trampoline re-attach for tracing and lsm programs")
Cc: stable@vger.kernel.org
Signed-off-by: Jiri Olsa <olsajiri@gmail.com>
Acked-by: Jiri Olsa <olsajiri@gmail.com>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Dmitrii Dolgov <9erthalion6@gmail.com>
Link: https://lore.kernel.org/r/20240103190559.14750-4-9erthalion6@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-04 20:40:19 -08:00
Dmitrii Dolgov
19bfcdf949 bpf: Relax tracing prog recursive attach rules
Currently, it's not allowed to attach an fentry/fexit prog to another
one fentry/fexit. At the same time it's not uncommon to see a tracing
program with lots of logic in use, and the attachment limitation
prevents usage of fentry/fexit for performance analysis (e.g. with
"bpftool prog profile" command) in this case. An example could be
falcosecurity libs project that uses tp_btf tracing programs.

Following the corresponding discussion [1], the reason for that is to
avoid tracing progs call cycles without introducing more complex
solutions. But currently it seems impossible to load and attach tracing
programs in a way that will form such a cycle. The limitation is coming
from the fact that attach_prog_fd is specified at the prog load (thus
making it impossible to attach to a program loaded after it in this
way), as well as tracing progs not implementing link_detach.

Replace "no same type" requirement with verification that no more than
one level of attachment nesting is allowed. In this way only one
fentry/fexit program could be attached to another fentry/fexit to cover
profiling use case, and still no cycle could be formed. To implement,
add a new field into bpf_prog_aux to track nested attachment for tracing
programs.

[1]: https://lore.kernel.org/bpf/20191108064039.2041889-16-ast@kernel.org/

Acked-by: Jiri Olsa <olsajiri@gmail.com>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Dmitrii Dolgov <9erthalion6@gmail.com>
Link: https://lore.kernel.org/r/20240103190559.14750-2-9erthalion6@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-04 20:31:34 -08:00
Yonghong Song
9ddf872b47 bpf: Remove unnecessary cpu == 0 check in memalloc
After merging the patch set [1] to reduce memory usage
for bpf_global_percpu_ma, Alexei found a redundant check (cpu == 0)
in function bpf_mem_alloc_percpu_unit_init() ([2]).
Indeed, the check is unnecessary since c->unit_size will
be all NULL or all non-NULL for all cpus before
for_each_possible_cpu() loop.
Removing the check makes code less confusing.

  [1] https://lore.kernel.org/all/20231222031729.1287957-1-yonghong.song@linux.dev/
  [2] https://lore.kernel.org/all/20231222031745.1289082-1-yonghong.song@linux.dev/

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240104165744.702239-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-04 10:18:14 -08:00
Yonghong Song
5c1a376532 bpf: Limit up to 512 bytes for bpf_global_percpu_ma allocation
For percpu data structure allocation with bpf_global_percpu_ma,
the maximum data size is 4K. But for a system with large
number of cpus, bigger data size (e.g., 2K, 4K) might consume
a lot of memory. For example, the percpu memory consumption
with unit size 2K and 1024 cpus will be 2K * 1K * 1k = 2GB
memory.

We should discourage such usage. Let us limit the maximum data
size to be 512 for bpf_global_percpu_ma allocation.

Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231222031801.1290841-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-03 21:08:26 -08:00
Yonghong Song
0e2ba9f96f bpf: Use smaller low/high marks for percpu allocation
Currently, refill low/high marks are set with the assumption
of normal non-percpu memory allocation. For example, for
an allocation size 256, for non-percpu memory allocation,
low mark is 32 and high mark is 96, resulting in the
batch allocation of 48 elements and the allocated memory
will be 48 * 256 = 12KB for this particular cpu.
Assuming an 128-cpu system, the total memory consumption
across all cpus will be 12K * 128 = 1.5MB memory.

This might be okay for non-percpu allocation, but may not be
good for percpu allocation, which will consume 1.5MB * 128 = 192MB
memory in the worst case if every cpu has a chance of memory
allocation.

In practice, percpu allocation is very rare compared to
non-percpu allocation. So let us have smaller low/high marks
which can avoid unnecessary memory consumption.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231222031755.1289671-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-03 21:08:25 -08:00
Yonghong Song
5b95e638f1 bpf: Refill only one percpu element in memalloc
Typically for percpu map element or data structure, once allocated,
most operations are lookup or in-place update. Deletion are really
rare. Currently, for percpu data strcture, 4 elements will be
refilled if the size is <= 256. Let us just do with one element
for percpu data. For example, for size 256 and 128 cpus, the
potential saving will be 3 * 256 * 128 * 128 = 12MB.

Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231222031750.1289290-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-03 21:08:25 -08:00
Yonghong Song
c39aa3b289 bpf: Allow per unit prefill for non-fix-size percpu memory allocator
Commit 41a5db8d81 ("Add support for non-fix-size percpu mem allocation")
added support for non-fix-size percpu memory allocation.
Such allocation will allocate percpu memory for all buckets on all
cpus and the memory consumption is in the order to quadratic.
For example, let us say, 4 cpus, unit size 16 bytes, so each
cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256 bytes.
Then let us say, 8 cpus with the same unit size, each cpu
has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024 bytes.
So if the number of cpus doubles, the number of memory consumption
will be 4 times. So for a system with large number of cpus, the
memory consumption goes up quickly with quadratic order.
For example, for 4KB percpu allocation, 128 cpus. The total memory
consumption will 4KB * 128 * 128 = 64MB. Things will become
worse if the number of cpus is bigger (e.g., 512, 1024, etc.)

In Commit 41a5db8d81, the non-fix-size percpu memory allocation is
done in boot time, so for system with large number of cpus, the initial
percpu memory consumption is very visible. For example, for 128 cpu
system, the total percpu memory allocation will be at least
(16 + 32 + 64 + 96 + 128 + 196 + 256 + 512 + 1024 + 2048 + 4096)
  * 128 * 128 = ~138MB.
which is pretty big. It will be even bigger for larger number of cpus.

Note that the current prefill also allocates 4 entries if the unit size
is less than 256. So on top of 138MB memory consumption, this will
add more consumption with
3 * (16 + 32 + 64 + 96 + 128 + 196 + 256) * 128 * 128 = ~38MB.
Next patch will try to reduce this memory consumption.

Later on, Commit 1fda5bb66a ("bpf: Do not allocate percpu memory
at init stage") moved the non-fix-size percpu memory allocation
to bpf verificaiton stage. Once a particular bpf_percpu_obj_new()
is called by bpf program, the memory allocator will try to fill in
the cache with all sizes, causing the same amount of percpu memory
consumption as in the boot stage.

To reduce the initial percpu memory consumption for non-fix-size
percpu memory allocation, instead of filling the cache with all
supported allocation sizes, this patch intends to fill the cache
only for the requested size. As typically users will not use large
percpu data structure, this can save memory significantly.
For example, the allocation size is 64 bytes with 128 cpus.
Then total percpu memory amount will be 64 * 128 * 128 = 1MB,
much less than previous 138MB.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231222031745.1289082-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-03 21:08:25 -08:00
Yonghong Song
9fc8e80204 bpf: Add objcg to bpf_mem_alloc
The objcg is a bpf_mem_alloc level property since all bpf_mem_cache's
are with the same objcg. This patch made such a property explicit.
The next patch will use this property to save and restore objcg
for percpu unit allocator.

Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231222031739.1288590-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-03 21:08:25 -08:00
Yonghong Song
9beda16c25 bpf: Avoid unnecessary extra percpu memory allocation
Currently, for percpu memory allocation, say if the user
requests allocation size to be 32 bytes, the actually
calculated size will be 40 bytes and it further rounds
to 64 bytes, and eventually 64 bytes are allocated,
wasting 32-byte memory.

Change bpf_mem_alloc() to calculate the cache index
based on the user-provided allocation size so unnecessary
extra memory can be avoided.

Suggested-by: Hou Tao <houtao1@huawei.com>
Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231222031734.1288400-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-03 21:08:25 -08:00
Andrei Matei
8a021e7fa1 bpf: Simplify checking size of helper accesses
This patch simplifies the verification of size arguments associated to
pointer arguments to helpers and kfuncs. Many helpers take a pointer
argument followed by the size of the memory access performed to be
performed through that pointer. Before this patch, the handling of the
size argument in check_mem_size_reg() was confusing and wasteful: if the
size register's lower bound was 0, then the verification was done twice:
once considering the size of the access to be the lower-bound of the
respective argument, and once considering the upper bound (even if the
two are the same). The upper bound checking is a super-set of the
lower-bound checking(*), except: the only point of the lower-bound check
is to handle the case where zero-sized-accesses are explicitly not
allowed and the lower-bound is zero. This static condition is now
checked explicitly, replacing a much more complex, expensive and
confusing verification call to check_helper_mem_access().

Error messages change in this patch. Before, messages about illegal
zero-size accesses depended on the type of the pointer and on other
conditions, and sometimes the message was plain wrong: in some tests
that changed you'll see that the old message was something like "R1 min
value is outside of the allowed memory range", where R1 is the pointer
register; the error was wrongly claiming that the pointer was bad
instead of the size being bad. Other times the information that the size
came for a register with a possible range of values was wrong, and the
error presented the size as a fixed zero. Now the errors refer to the
right register. However, the old error messages did contain useful
information about the pointer register which is now lost; recovering
this information was deemed not important enough.

(*) Besides standing to reason that the checks for a bigger size access
are a super-set of the checks for a smaller size access, I have also
mechanically verified this by reading the code for all types of
pointers. I could convince myself that it's true for all but
PTR_TO_BTF_ID (check_ptr_to_btf_access). There, simply looking
line-by-line does not immediately prove what we want. If anyone has any
qualms, let me know.

Signed-off-by: Andrei Matei <andreimatei1@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231221232225.568730-2-andreimatei1@gmail.com
2024-01-03 10:37:56 -08:00
Kent Overstreet
1e2f2d3199 Kill sched.h dependency on rcupdate.h
by moving cond_resched_rcu() to rcupdate_wait.h, we can kill another big
sched.h dependency.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-27 11:50:20 -05:00
Simon Horman
5abde62465 bpf: Avoid unnecessary use of comma operator in verifier
Although it does not seem to have any untoward side-effects, the use
of ';' to separate to assignments seems more appropriate than ','.

Flagged by clang-17 -Wcomma

No functional change intended. Compile tested only.

Signed-off-by: Simon Horman <horms@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/bpf/20231221-bpf-verifier-comma-v1-1-cde2530912e9@kernel.org
2023-12-21 22:40:25 +01:00
Daniel Borkmann
b08c8fc041 bpf: Re-support uid and gid when mounting bpffs
For a clean, conflict-free revert of the token-related patches in commit
d17aff807f ("Revert BPF token-related functionality"), the bpf fs commit
750e785796 ("bpf: Support uid and gid when mounting bpffs") was undone
temporarily as well.

This patch manually re-adds the functionality from the original one back
in 750e785796, no other functional changes intended.

Testing:

  # mount -t bpf -o uid=65534,gid=65534 bpffs ./foo
  # ls -la . | grep foo
  drwxrwxrwt   2 nobody nogroup          0 Dec 20 13:16 foo
  # mount -t bpf
  bpffs on /root/foo type bpf (rw,relatime,uid=65534,gid=65534)

Also, passing invalid arguments for uid/gid are properly rejected as expected.

Fixes: d17aff807f ("Revert BPF token-related functionality")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Cc: Jie Jiang <jiejiang@chromium.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/bpf/20231220133805.20953-1-daniel@iogearbox.net
2023-12-21 14:24:30 +01:00
Hou Tao
7ac5c53e00 bpf: Use c->unit_size to select target cache during free
At present, bpf memory allocator uses check_obj_size() to ensure that
ksize() of allocated pointer is equal with the unit_size of used
bpf_mem_cache. Its purpose is to prevent bpf_mem_free() from selecting
a bpf_mem_cache which has different unit_size compared with the
bpf_mem_cache used for allocation. But as reported by lkp, the return
value of ksize() or kmalloc_size_roundup() may change due to slab merge
and it will lead to the warning report in check_obj_size().

The reported warning happened as follows:
(1) in bpf_mem_cache_adjust_size(), kmalloc_size_roundup(96) returns the
object_size of kmalloc-96 instead of kmalloc-cg-96. The object_size of
kmalloc-96 is 96, so size_index for 96 is not adjusted accordingly.
(2) the object_size of kmalloc-cg-96 is adjust from 96 to 128 due to
slab merge in __kmem_cache_alias(). For SLAB, SLAB_HWCACHE_ALIGN is
enabled by default for kmalloc slab, so align is 64 and size is 128 for
kmalloc-cg-96. SLUB has a similar merge logic, but its object_size will
not be changed, because its align is 8 under x86-64.
(3) when unit_alloc() does kmalloc_node(96, __GFP_ACCOUNT, node),
ksize() returns 128 instead of 96 for the returned pointer.
(4) the warning in check_obj_size() is triggered.

Considering the slab merge can happen in anytime (e.g, a slab created in
a new module), the following case is also possible: during the
initialization of bpf_global_ma, there is no slab merge and ksize() for
a 96-bytes object returns 96. But after that a new slab created by a
kernel module is merged to kmalloc-cg-96 and the object_size of
kmalloc-cg-96 is adjust from 96 to 128 (which is possible for x86-64 +
CONFIG_SLAB, because its alignment requirement is 64 for 96-bytes slab).
So soon or later, when bpf_global_ma frees a 96-byte-sized pointer
which is allocated from bpf_mem_cache with unit_size=96, bpf_mem_free()
will free the pointer through a bpf_mem_cache in which unit_size is 128,
because the return value of ksize() changes. The warning for the
mismatch will be triggered again.

A feasible fix is introducing similar APIs compared with ksize() and
kmalloc_size_roundup() to return the actually-allocated size instead of
size which may change due to slab merge, but it will introduce
unnecessary dependency on the implementation details of mm subsystem.

As for now the pointer of bpf_mem_cache is saved in the 8-bytes area
(or 4-bytes under 32-bit host) above the returned pointer, using
unit_size in the saved bpf_mem_cache to select the target cache instead
of inferring the size from the pointer itself. Beside no extra
dependency on mm subsystem, the performance for bpf_mem_free_rcu() is
also improved as shown below.

Before applying the patch, the performances of bpf_mem_alloc() and
bpf_mem_free_rcu() on 8-CPUs VM with one producer are as follows:

kmalloc : alloc 11.69 ± 0.28M/s free 29.58 ± 0.93M/s
percpu  : alloc 14.11 ± 0.52M/s free 14.29 ± 0.99M/s

After apply the patch, the performance for bpf_mem_free_rcu() increases
9% and 146% for kmalloc memory and per-cpu memory respectively:

kmalloc: alloc 11.01 ± 0.03M/s free   32.42 ± 0.48M/s
percpu:  alloc 12.84 ± 0.12M/s free   35.24 ± 0.23M/s

After the fixes, there is no need to adjust size_index to fix the
mismatch between allocation and free, so remove it as well. Also return
NULL instead of ZERO_SIZE_PTR for zero-sized alloc in bpf_mem_alloc(),
because there is no bpf_mem_cache pointer saved above ZERO_SIZE_PTR.

Fixes: 9077fc228f ("bpf: Use kmalloc_size_roundup() to adjust size_index")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/bpf/202310302113.9f8fe705-oliver.sang@intel.com
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231216131052.27621-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-20 13:25:46 -08:00
Andrii Nakryiko
a64bfe6186 bpf: add support for passing dynptr pointer to global subprog
Add ability to pass a pointer to dynptr into global functions.
This allows to have global subprogs that accept and work with generic
dynptrs that are created by caller. Dynptr argument is detected based on
the name of a struct type, if it's "bpf_dynptr", it's assumed to be
a proper dynptr pointer. Both actual struct and forward struct
declaration types are supported.

This is conceptually exactly the same semantics as
bpf_user_ringbuf_drain()'s use of dynptr to pass a variable-sized
pointer to ringbuf record. So we heavily rely on CONST_PTR_TO_DYNPTR
bits of already existing logic in the verifier.

During global subprog validation, we mark such CONST_PTR_TO_DYNPTR as
having LOCAL type, as that's the most unassuming type of dynptr and it
doesn't have any special helpers that can try to free or acquire extra
references (unlike skb, xdp, or ringbuf dynptr). So that seems like a safe
"choice" to make from correctness standpoint. It's still possible to
pass any type of dynptr to such subprog, though, because generic dynptr
helpers, like getting data/slice pointers, read/write memory copying
routines, dynptr adjustment and getter routines all work correctly with
any type of dynptr.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231215011334.2307144-8-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-19 18:06:46 -08:00
Andrii Nakryiko
94e1c70a34 bpf: support 'arg:xxx' btf_decl_tag-based hints for global subprog args
Add support for annotating global BPF subprog arguments to provide more
information about expected semantics of the argument. Currently,
verifier relies purely on argument's BTF type information, and supports
three general use cases: scalar, pointer-to-context, and
pointer-to-fixed-size-memory.

Scalar and pointer-to-fixed-mem work well in practice and are quite
natural to use. But pointer-to-context is a bit problematic, as typical
BPF users don't realize that they need to use a special type name to
signal to verifier that argument is not just some pointer, but actually
a PTR_TO_CTX. Further, even if users do know which type to use, it is
limiting in situations where the same BPF program logic is used across
few different program types. Common case is kprobes, tracepoints, and
perf_event programs having a helper to send some data over BPF perf
buffer. bpf_perf_event_output() requires `ctx` argument, and so it's
quite cumbersome to share such global subprog across few BPF programs of
different types, necessitating extra static subprog that is context
type-agnostic.

Long story short, there is a need to go beyond types and allow users to
add hints to global subprog arguments to define expectations.

This patch adds such support for two initial special tags:
  - pointer to context;
  - non-null qualifier for generic pointer arguments.

All of the above came up in practice already and seem generally useful
additions. Non-null qualifier is an often requested feature, which
currently has to be worked around by having unnecessary NULL checks
inside subprogs even if we know that arguments are never NULL. Pointer
to context was discussed earlier.

As for implementation, we utilize btf_decl_tag attribute and set up an
"arg:xxx" convention to specify argument hint. As such:
  - btf_decl_tag("arg:ctx") is a PTR_TO_CTX hint;
  - btf_decl_tag("arg:nonnull") marks pointer argument as not allowed to
    be NULL, making NULL check inside global subprog unnecessary.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231215011334.2307144-7-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-19 18:06:46 -08:00
Andrii Nakryiko
f18c3d88de bpf: reuse subprog argument parsing logic for subprog call checks
Remove duplicated BTF parsing logic when it comes to subprog call check.
Instead, use (potentially cached) results of btf_prepare_func_args() to
abstract away expectations of each subprog argument in generic terms
(e.g., "this is pointer to context", or "this is a pointer to memory of
size X"), and then use those simple high-level argument type
expectations to validate actual register states to check if they match
expectations.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231215011334.2307144-6-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-19 18:06:46 -08:00
Andrii Nakryiko
c5a7244759 bpf: move subprog call logic back to verifier.c
Subprog call logic in btf_check_subprog_call() currently has both a lot
of BTF parsing logic (which is, presumably, what justified putting it
into btf.c), but also a bunch of register state checks, some of each
utilize deep verifier logic helpers, necessarily exported from
verifier.c: check_ptr_off_reg(), check_func_arg_reg_off(),
and check_mem_reg().

Going forward, btf_check_subprog_call() will have a minimum of
BTF-related logic, but will get more internal verifier logic related to
register state manipulation. So move it into verifier.c to minimize
amount of verifier-specific logic exposed to btf.c.

We do this move before refactoring btf_check_func_arg_match() to
preserve as much history post-refactoring as possible.

No functional changes.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231215011334.2307144-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-19 18:06:46 -08:00
Andrii Nakryiko
e26080d0da bpf: prepare btf_prepare_func_args() for handling static subprogs
Generalize btf_prepare_func_args() to support both global and static
subprogs. We are going to utilize this property in the next patch,
reusing btf_prepare_func_args() for subprog call logic instead of
reparsing BTF information in a completely separate implementation.

btf_prepare_func_args() now detects whether subprog is global or static
makes slight logic adjustments for static func cases, like not failing
fatally (-EFAULT) for conditions that are allowable for static subprogs.

Somewhat subtle (but major!) difference is the handling of pointer arguments.
Both global and static functions need to handle special context
arguments (which are pointers to predefined type names), but static
subprogs give up on any other pointers, falling back to marking subprog
as "unreliable", disabling the use of BTF type information altogether.

For global functions, though, we are assuming that such pointers to
unrecognized types are just pointers to fixed-sized memory region (or
error out if size cannot be established, like for `void *` pointers).

This patch accommodates these small differences and sets up a stage for
refactoring in the next patch, eliminating a separate BTF-based parsing
logic in btf_check_func_arg_match().

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231215011334.2307144-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-19 18:06:46 -08:00
Andrii Nakryiko
5eccd2db42 bpf: reuse btf_prepare_func_args() check for main program BTF validation
Instead of btf_check_subprog_arg_match(), use btf_prepare_func_args()
logic to validate "trustworthiness" of main BPF program's BTF information,
if it is present.

We ignored results of original BTF check anyway, often times producing
confusing and ominously-sounding "reg type unsupported for arg#0
function" message, which has no apparent effect on program correctness
and verification process.

All the -EFAULT returning sanity checks are already performed in
check_btf_info_early(), so there is zero reason to have this duplication
of logic between btf_check_subprog_call() and btf_check_subprog_arg_match().
Dropping btf_check_subprog_arg_match() simplifies
btf_check_func_arg_match() further removing `bool processing_call` flag.

One subtle bit that was done by btf_check_subprog_arg_match() was
potentially marking main program's BTF as unreliable. We do this
explicitly now with a dedicated simple check, preserving the original
behavior, but now based on well factored btf_prepare_func_args() logic.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231215011334.2307144-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-19 18:06:46 -08:00
Andrii Nakryiko
4ba1d0f234 bpf: abstract away global subprog arg preparation logic from reg state setup
btf_prepare_func_args() is used to understand expectations and
restrictions on global subprog arguments. But current implementation is
hard to extend, as it intermixes BTF-based func prototype parsing and
interpretation logic with setting up register state at subprog entry.

Worse still, those registers are not completely set up inside
btf_prepare_func_args(), requiring some more logic later in
do_check_common(). Like calling mark_reg_unknown() and similar
initialization operations.

This intermixing of BTF interpretation and register state setup is
problematic. First, it causes duplication of BTF parsing logic for global
subprog verification (to set up initial state of global subprog) and
global subprog call sites analysis (when we need to check that whatever
is being passed into global subprog matches expectations), performed in
btf_check_subprog_call().

Given we want to extend global func argument with tags later, this
duplication is problematic. So refactor btf_prepare_func_args() to do
only BTF-based func proto and args parsing, returning high-level
argument "expectations" only, with no regard to specifics of register
state. I.e., if it's a context argument, instead of setting register
state to PTR_TO_CTX, we return ARG_PTR_TO_CTX enum for that argument as
"an argument specification" for further processing inside
do_check_common(). Similarly for SCALAR arguments, PTR_TO_MEM, etc.

This allows to reuse btf_prepare_func_args() in following patches at
global subprog call site analysis time. It also keeps register setup
code consistently in one place, do_check_common().

Besides all this, we cache this argument specs information inside
env->subprog_info, eliminating the need to redo these potentially
expensive BTF traversals, especially if BPF program's BTF is big and/or
there are lots of global subprog calls.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231215011334.2307144-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-19 18:06:46 -08:00
Menglong Dong
d028f87517 bpf: make the verifier tracks the "not equal" for regs
We can derive some new information for BPF_JNE in regs_refine_cond_op().
Take following code for example:

  /* The type of "a" is u32 */
  if (a > 0 && a < 100) {
    /* the range of the register for a is [0, 99], not [1, 99],
     * and will cause the following error:
     *
     *   invalid zero-sized read
     *
     * as a can be 0.
     */
    bpf_skb_store_bytes(skb, xx, xx, a, 0);
  }

In the code above, "a > 0" will be compiled to "jmp xxx if a == 0". In the
TRUE branch, the dst_reg will be marked as known to 0. However, in the
fallthrough(FALSE) branch, the dst_reg will not be handled, which makes
the [min, max] for a is [0, 99], not [1, 99].

For BPF_JNE, we can reduce the range of the dst reg if the src reg is a
const and is exactly the edge of the dst reg.

Signed-off-by: Menglong Dong <menglong8.dong@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20231219134800.1550388-2-menglong8.dong@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-19 17:18:55 -08:00
Andrii Nakryiko
d17aff807f Revert BPF token-related functionality
This patch includes the following revert (one  conflicting BPF FS
patch and three token patch sets, represented by merge commits):
  - revert 0f5d5454c7 "Merge branch 'bpf-fs-mount-options-parsing-follow-ups'";
  - revert 750e785796 "bpf: Support uid and gid when mounting bpffs";
  - revert 733763285a "Merge branch 'bpf-token-support-in-libbpf-s-bpf-object'";
  - revert c35919dcce "Merge branch 'bpf-token-and-bpf-fs-based-delegation'".

Link: https://lore.kernel.org/bpf/CAHk-=wg7JuFYwGy=GOMbRCtOL+jwSQsdUaBsRWkDVYbxipbM5A@mail.gmail.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2023-12-19 08:23:03 -08:00
Jakub Kicinski
c49b292d03 netdev
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmWAz2EACgkQ6rmadz2v
 bToqrw/9EwroZCc8GEHOKAlb/fzrMvn92rLo0ZW/cGN84QJPnx4zM6Zo0+fgLaaN
 oqqztwMUwdzGC3uX3FfVXaaLKbJ/MeHeL9BXFZNW8zkRHciw4R7kIBhOdPnHyET7
 uT+rQ4xPe1Mt7e9PjepKlSL5mEsxWfBkdUgsdn19Z2Vjdfr9mZMhYWYMJGcfTCD1
 TwxHKBPhq5fN3IsshmMBB8IrRp1HStUKb65MgZ4dI22LJXxTsFkx5XMFXcmuqvkH
 NhKj8jDcPEEh31bYcb6aG2Z4onw5F2lquygjk1Qyy5cyw45m/ipJKAXKdAyvJG+R
 VZCWOET/9wbRwFSK5wxwihCuKghFiofK52i2PcGtXZh0PCouyZZneSJOKM0yVWKO
 BvuJBxK4ETRnQyN6ZxhuJiEXG3/YMBBhyR2TX1LntVK9ct/k7qFVzATG49J39/sR
 SYMbptBRj4a5oMJ1qn0nFVEDFkg0jTnTDNnsEpcz60Ayt6EsJ1XosO5yz2huf861
 xgRMTKMseyG1/uV45tQ8ZPzbSPpBxjUi9Dl3coYsIm1a+y6clWUXcarONY5KVrpS
 CR98DuFgl+E7dXuisd/Kz2p2KxxSPq8nytsmLlgOvrUqhwiXqB+TKN8EHgIapVOt
 l1A5LrzXFTcGlT9MlaWBqEIy83Bu1nqQqbxrAFOE0k8A5jomXaw=
 =stU2
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Alexei Starovoitov says:

====================
pull-request: bpf-next 2023-12-18

This PR is larger than usual and contains changes in various parts
of the kernel.

The main changes are:

1) Fix kCFI bugs in BPF, from Peter Zijlstra.

End result: all forms of indirect calls from BPF into kernel
and from kernel into BPF work with CFI enabled. This allows BPF
to work with CONFIG_FINEIBT=y.

2) Introduce BPF token object, from Andrii Nakryiko.

It adds an ability to delegate a subset of BPF features from privileged
daemon (e.g., systemd) through special mount options for userns-bound
BPF FS to a trusted unprivileged application. The design accommodates
suggestions from Christian Brauner and Paul Moore.

Example:
$ sudo mkdir -p /sys/fs/bpf/token
$ sudo mount -t bpf bpffs /sys/fs/bpf/token \
             -o delegate_cmds=prog_load:MAP_CREATE \
             -o delegate_progs=kprobe \
             -o delegate_attachs=xdp

3) Various verifier improvements and fixes, from Andrii Nakryiko, Andrei Matei.

 - Complete precision tracking support for register spills
 - Fix verification of possibly-zero-sized stack accesses
 - Fix access to uninit stack slots
 - Track aligned STACK_ZERO cases as imprecise spilled registers.
   It improves the verifier "instructions processed" metric from single
   digit to 50-60% for some programs.
 - Fix verifier retval logic

4) Support for VLAN tag in XDP hints, from Larysa Zaremba.

5) Allocate BPF trampoline via bpf_prog_pack mechanism, from Song Liu.

End result: better memory utilization and lower I$ miss for calls to BPF
via BPF trampoline.

6) Fix race between BPF prog accessing inner map and parallel delete,
from Hou Tao.

7) Add bpf_xdp_get_xfrm_state() kfunc, from Daniel Xu.

It allows BPF interact with IPSEC infra. The intent is to support
software RSS (via XDP) for the upcoming ipsec pcpu work.
Experiments on AWS demonstrate single tunnel pcpu ipsec reaching
line rate on 100G ENA nics.

8) Expand bpf_cgrp_storage to support cgroup1 non-attach, from Yafang Shao.

9) BPF file verification via fsverity, from Song Liu.

It allows BPF progs get fsverity digest.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (164 commits)
  bpf: Ensure precise is reset to false in __mark_reg_const_zero()
  selftests/bpf: Add more uprobe multi fail tests
  bpf: Fail uprobe multi link with negative offset
  selftests/bpf: Test the release of map btf
  s390/bpf: Fix indirect trampoline generation
  selftests/bpf: Temporarily disable dummy_struct_ops test on s390
  x86/cfi,bpf: Fix bpf_exception_cb() signature
  bpf: Fix dtor CFI
  cfi: Add CFI_NOSEAL()
  x86/cfi,bpf: Fix bpf_struct_ops CFI
  x86/cfi,bpf: Fix bpf_callback_t CFI
  x86/cfi,bpf: Fix BPF JIT call
  cfi: Flip headers
  selftests/bpf: Add test for abnormal cnt during multi-kprobe attachment
  selftests/bpf: Don't use libbpf_get_error() in kprobe_multi_test
  selftests/bpf: Add test for abnormal cnt during multi-uprobe attachment
  bpf: Limit the number of kprobes when attaching program to multiple kprobes
  bpf: Limit the number of uprobes when attaching program to multiple uprobes
  bpf: xdp: Register generic_kfunc_set with XDP programs
  selftests/bpf: utilize string values for delegate_xxx mount options
  ...
====================

Link: https://lore.kernel.org/r/20231219000520.34178-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-18 16:46:08 -08:00
Andrii Nakryiko
8e432e6197 bpf: Ensure precise is reset to false in __mark_reg_const_zero()
It is safe to always start with imprecise SCALAR_VALUE register.
Previously __mark_reg_const_zero() relied on caller to reset precise
mark, but it's very error prone and we already missed it in a few
places. So instead make __mark_reg_const_zero() reset precision always,
as it's a safe default for SCALAR_VALUE. Explanation is basically the
same as for why we are resetting (or rather not setting) precision in
current state. If necessary, precision propagation will set it to
precise correctly.

As such, also remove a big comment about forward precision propagation
in mark_reg_stack_read() and avoid unnecessarily setting precision to
true after reading from STACK_ZERO stack. Again, precision propagation
will correctly handle this, if that SCALAR_VALUE register will ever be
needed to be precise.

Reported-by: Maxim Mikityanskiy <maxtram95@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Maxim Mikityanskiy <maxtram95@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20231218173601.53047-1-andrii@kernel.org
2023-12-18 23:54:21 +01:00
Alexei Starovoitov
852486b35f x86/cfi,bpf: Fix bpf_exception_cb() signature
As per the earlier patches, BPF sub-programs have bpf_callback_t
signature and CFI expects callers to have matching signature. This is
violated by bpf_prog_aux::bpf_exception_cb().

[peterz: Changelog]
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/CAADnVQ+Z7UcXXBBhMubhcMM=R-dExk-uHtfOLtoLxQ1XxEpqEA@mail.gmail.com
Link: https://lore.kernel.org/r/20231215092707.910319166@infradead.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-15 16:25:55 -08:00
Peter Zijlstra
e4c0033989 bpf: Fix dtor CFI
Ensure the various dtor functions match their prototype and retain
their CFI signatures, since they don't have their address taken, they
are prone to not getting CFI, making them impossible to call
indirectly.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20231215092707.799451071@infradead.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-15 16:25:55 -08:00
Peter Zijlstra
2cd3e3772e x86/cfi,bpf: Fix bpf_struct_ops CFI
BPF struct_ops uses __arch_prepare_bpf_trampoline() to write
trampolines for indirect function calls. These tramplines much have
matching CFI.

In order to obtain the correct CFI hash for the various methods, add a
matching structure that contains stub functions, the compiler will
generate correct CFI which we can pilfer for the trampolines.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20231215092707.566977112@infradead.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-15 16:25:55 -08:00
Peter Zijlstra
4f9087f166 x86/cfi,bpf: Fix BPF JIT call
The current BPF call convention is __nocfi, except when it calls !JIT things,
then it calls regular C functions.

It so happens that with FineIBT the __nocfi and C calling conventions are
incompatible. Specifically __nocfi will call at func+0, while FineIBT will have
endbr-poison there, which is not a valid indirect target. Causing #CP.

Notably this only triggers on IBT enabled hardware, which is probably why this
hasn't been reported (also, most people will have JIT on anyway).

Implement proper CFI prologues for the BPF JIT codegen and drop __nocfi for
x86.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20231215092707.345270396@infradead.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-15 16:25:55 -08:00
Daniel Xu
7489723c2e bpf: xdp: Register generic_kfunc_set with XDP programs
Registering generic_kfunc_set with XDP programs enables some of the
newer BPF features inside XDP -- namely tree based data structures and
BPF exceptions.

The current motivation for this commit is to enable assertions inside
XDP bpf progs. Assertions are a standard and useful tool to encode
intent.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/d07d4614b81ca6aada44fcb89bb6b618fb66e4ca.1702594357.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-14 19:12:16 -08:00
Andrii Nakryiko
c5707b2146 bpf: support symbolic BPF FS delegation mount options
Besides already supported special "any" value and hex bit mask, support
string-based parsing of delegation masks based on exact enumerator
names. Utilize BTF information of `enum bpf_cmd`, `enum bpf_map_type`,
`enum bpf_prog_type`, and `enum bpf_attach_type` types to find supported
symbolic names (ignoring __MAX_xxx guard values and stripping repetitive
prefixes like BPF_ for cmd and attach types, BPF_MAP_TYPE_ for maps, and
BPF_PROG_TYPE_ for prog types). The case doesn't matter, but it is
normalized to lower case in mount option output. So "PROG_LOAD",
"prog_load", and "MAP_create" are all valid values to specify for
delegate_cmds options, "array" is among supported for map types, etc.

Besides supporting string values, we also support multiple values
specified at the same time, using colon (':') separator.

There are corresponding changes on bpf_show_options side to use known
values to print them in human-readable format, falling back to hex mask
printing, if there are any unrecognized bits. This shouldn't be
necessary when enum BTF information is present, but in general we should
always be able to fall back to this even if kernel was built without BTF.
As mentioned, emitted symbolic names are normalized to be all lower case.

Example below shows various ways to specify delegate_cmds options
through mount command and how mount options are printed back:

12/14 14:39:07.604
vmuser@archvm:~/local/linux/tools/testing/selftests/bpf
$ mount | rg token

  $ sudo mkdir -p /sys/fs/bpf/token
  $ sudo mount -t bpf bpffs /sys/fs/bpf/token \
               -o delegate_cmds=prog_load:MAP_CREATE \
               -o delegate_progs=kprobe \
               -o delegate_attachs=xdp
  $ mount | grep token
  bpffs on /sys/fs/bpf/token type bpf (rw,relatime,delegate_cmds=map_create:prog_load,delegate_progs=kprobe,delegate_attachs=xdp)

Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231214225016.1209867-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-14 17:30:27 -08:00
Yonghong Song
59e5791f59 bpf: Fix a race condition between btf_put() and map_free()
When running `./test_progs -j` in my local vm with latest kernel,
I once hit a kasan error like below:

  [ 1887.184724] BUG: KASAN: slab-use-after-free in bpf_rb_root_free+0x1f8/0x2b0
  [ 1887.185599] Read of size 4 at addr ffff888106806910 by task kworker/u12:2/2830
  [ 1887.186498]
  [ 1887.186712] CPU: 3 PID: 2830 Comm: kworker/u12:2 Tainted: G           OEL     6.7.0-rc3-00699-g90679706d486-dirty #494
  [ 1887.188034] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
  [ 1887.189618] Workqueue: events_unbound bpf_map_free_deferred
  [ 1887.190341] Call Trace:
  [ 1887.190666]  <TASK>
  [ 1887.190949]  dump_stack_lvl+0xac/0xe0
  [ 1887.191423]  ? nf_tcp_handle_invalid+0x1b0/0x1b0
  [ 1887.192019]  ? panic+0x3c0/0x3c0
  [ 1887.192449]  print_report+0x14f/0x720
  [ 1887.192930]  ? preempt_count_sub+0x1c/0xd0
  [ 1887.193459]  ? __virt_addr_valid+0xac/0x120
  [ 1887.194004]  ? bpf_rb_root_free+0x1f8/0x2b0
  [ 1887.194572]  kasan_report+0xc3/0x100
  [ 1887.195085]  ? bpf_rb_root_free+0x1f8/0x2b0
  [ 1887.195668]  bpf_rb_root_free+0x1f8/0x2b0
  [ 1887.196183]  ? __bpf_obj_drop_impl+0xb0/0xb0
  [ 1887.196736]  ? preempt_count_sub+0x1c/0xd0
  [ 1887.197270]  ? preempt_count_sub+0x1c/0xd0
  [ 1887.197802]  ? _raw_spin_unlock+0x1f/0x40
  [ 1887.198319]  bpf_obj_free_fields+0x1d4/0x260
  [ 1887.198883]  array_map_free+0x1a3/0x260
  [ 1887.199380]  bpf_map_free_deferred+0x7b/0xe0
  [ 1887.199943]  process_scheduled_works+0x3a2/0x6c0
  [ 1887.200549]  worker_thread+0x633/0x890
  [ 1887.201047]  ? __kthread_parkme+0xd7/0xf0
  [ 1887.201574]  ? kthread+0x102/0x1d0
  [ 1887.202020]  kthread+0x1ab/0x1d0
  [ 1887.202447]  ? pr_cont_work+0x270/0x270
  [ 1887.202954]  ? kthread_blkcg+0x50/0x50
  [ 1887.203444]  ret_from_fork+0x34/0x50
  [ 1887.203914]  ? kthread_blkcg+0x50/0x50
  [ 1887.204397]  ret_from_fork_asm+0x11/0x20
  [ 1887.204913]  </TASK>
  [ 1887.204913]  </TASK>
  [ 1887.205209]
  [ 1887.205416] Allocated by task 2197:
  [ 1887.205881]  kasan_set_track+0x3f/0x60
  [ 1887.206366]  __kasan_kmalloc+0x6e/0x80
  [ 1887.206856]  __kmalloc+0xac/0x1a0
  [ 1887.207293]  btf_parse_fields+0xa15/0x1480
  [ 1887.207836]  btf_parse_struct_metas+0x566/0x670
  [ 1887.208387]  btf_new_fd+0x294/0x4d0
  [ 1887.208851]  __sys_bpf+0x4ba/0x600
  [ 1887.209292]  __x64_sys_bpf+0x41/0x50
  [ 1887.209762]  do_syscall_64+0x4c/0xf0
  [ 1887.210222]  entry_SYSCALL_64_after_hwframe+0x63/0x6b
  [ 1887.210868]
  [ 1887.211074] Freed by task 36:
  [ 1887.211460]  kasan_set_track+0x3f/0x60
  [ 1887.211951]  kasan_save_free_info+0x28/0x40
  [ 1887.212485]  ____kasan_slab_free+0x101/0x180
  [ 1887.213027]  __kmem_cache_free+0xe4/0x210
  [ 1887.213514]  btf_free+0x5b/0x130
  [ 1887.213918]  rcu_core+0x638/0xcc0
  [ 1887.214347]  __do_softirq+0x114/0x37e

The error happens at bpf_rb_root_free+0x1f8/0x2b0:

  00000000000034c0 <bpf_rb_root_free>:
  ; {
    34c0: f3 0f 1e fa                   endbr64
    34c4: e8 00 00 00 00                callq   0x34c9 <bpf_rb_root_free+0x9>
    34c9: 55                            pushq   %rbp
    34ca: 48 89 e5                      movq    %rsp, %rbp
  ...
  ;       if (rec && rec->refcount_off >= 0 &&
    36aa: 4d 85 ed                      testq   %r13, %r13
    36ad: 74 a9                         je      0x3658 <bpf_rb_root_free+0x198>
    36af: 49 8d 7d 10                   leaq    0x10(%r13), %rdi
    36b3: e8 00 00 00 00                callq   0x36b8 <bpf_rb_root_free+0x1f8>
                                        <==== kasan function
    36b8: 45 8b 7d 10                   movl    0x10(%r13), %r15d
                                        <==== use-after-free load
    36bc: 45 85 ff                      testl   %r15d, %r15d
    36bf: 78 8c                         js      0x364d <bpf_rb_root_free+0x18d>

So the problem is at rec->refcount_off in the above.

I did some source code analysis and find the reason.
                                  CPU A                        CPU B
  bpf_map_put:
    ...
    btf_put with rcu callback
    ...
    bpf_map_free_deferred
      with system_unbound_wq
    ...                          ...                           ...
    ...                          btf_free_rcu:                 ...
    ...                          ...                           bpf_map_free_deferred:
    ...                          ...
    ...         --------->       btf_struct_metas_free()
    ...         | race condition ...
    ...         --------->                                     map->ops->map_free()
    ...
    ...                          btf->struct_meta_tab = NULL

In the above, map_free() corresponds to array_map_free() and eventually
calling bpf_rb_root_free() which calls:
  ...
  __bpf_obj_drop_impl(obj, field->graph_root.value_rec, false);
  ...

Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with following code:

  meta = btf_find_struct_meta(btf, btf_id);
  if (!meta)
    return -EFAULT;
  rec->fields[i].graph_root.value_rec = meta->record;

So basically, 'value_rec' is a pointer to the record in struct_metas_tab.
And it is possible that that particular record has been freed by
btf_struct_metas_free() and hence we have a kasan error here.

Actually it is very hard to reproduce the failure with current bpf/bpf-next
code, I only got the above error once. To increase reproducibility, I added
a delay in bpf_map_free_deferred() to delay map->ops->map_free(), which
significantly increased reproducibility.

  diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
  index 5e43ddd1b83f..aae5b5213e93 100644
  --- a/kernel/bpf/syscall.c
  +++ b/kernel/bpf/syscall.c
  @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct work_struct *work)
        struct bpf_map *map = container_of(work, struct bpf_map, work);
        struct btf_record *rec = map->record;

  +     mdelay(100);
        security_bpf_map_free(map);
        bpf_map_release_memcg(map);
        /* implementation dependent freeing */

Hao also provided test cases ([1]) for easily reproducing the above issue.

There are two ways to fix the issue, the v1 of the patch ([2]) moving
btf_put() after map_free callback, and the v5 of the patch ([3]) using
a kptr style fix which tries to get a btf reference during
map_check_btf(). Each approach has its pro and cons. The first approach
delays freeing btf while the second approach needs to acquire reference
depending on context which makes logic not very elegant and may
complicate things with future new data structures. Alexei
suggested in [4] going back to v1 which is what this patch
tries to do.

Rerun './test_progs -j' with the above mdelay() hack for a couple
of times and didn't observe the error for the above rb_root test cases.
Running Hou's test ([1]) is also successful.

  [1] https://lore.kernel.org/bpf/20231207141500.917136-1-houtao@huaweicloud.com/
  [2] v1: https://lore.kernel.org/bpf/20231204173946.3066377-1-yonghong.song@linux.dev/
  [3] v5: https://lore.kernel.org/bpf/20231208041621.2968241-1-yonghong.song@linux.dev/
  [4] v4: https://lore.kernel.org/bpf/CAADnVQJ3FiXUhZJwX_81sjZvSYYKCFB3BT6P8D59RS2Gu+0Z7g@mail.gmail.com/

Cc: Hou Tao <houtao@huaweicloud.com>
Fixes: 958cf2e273 ("bpf: Introduce bpf_obj_new")
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231214203815.1469107-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-14 17:10:32 -08:00
Hou Tao
dc68540913 bpf: Use GFP_KERNEL in bpf_event_entry_gen()
rcu_read_lock() is no longer held when invoking bpf_event_entry_gen()
which is called by perf_event_fd_array_get_ptr(), so using GFP_KERNEL
instead of GFP_ATOMIC to reduce the possibility of failures due to
out-of-memory.

Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231214043010.3458072-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-13 20:49:11 -08:00
Hou Tao
8f82583f95 bpf: Reduce the scope of rcu_read_lock when updating fd map
There is no rcu-read-lock requirement for ops->map_fd_get_ptr() or
ops->map_fd_put_ptr(), so doesn't use rcu-read-lock for these two
callbacks.

For bpf_fd_array_map_update_elem(), accessing array->ptrs doesn't need
rcu-read-lock because array->ptrs must still be allocated. For
bpf_fd_htab_map_update_elem(), htab_map_update_elem() only requires
rcu-read-lock to be held to avoid the WARN_ON_ONCE(), so only use
rcu_read_lock() during the invocation of htab_map_update_elem().

Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231214043010.3458072-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-13 20:49:11 -08:00
Hou Tao
2a0c6b41ee bpf: Update the comments in maybe_wait_bpf_programs()
Since commit 638e4b825d ("bpf: Allows per-cpu maps and map-in-map in
sleepable programs"), sleepable BPF program can also use map-in-map, but
maybe_wait_bpf_programs() doesn't handle it accordingly. The main reason
is that using synchronize_rcu_tasks_trace() to wait for the completions
of these sleepable BPF programs may incur a very long delay and
userspace may think it is hung, so the wait for sleepable BPF programs
is skipped. Update the comments in maybe_wait_bpf_programs() to reflect
the reason.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20231211083447.1921178-1-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-13 17:01:42 -08:00
Matt Bobrowski
b13cddf633 bpf: add small subset of SECURITY_PATH hooks to BPF sleepable_lsm_hooks list
security_path_* based LSM hooks appear to be generally missing from
the sleepable_lsm_hooks list. Initially add a small subset of them to
the preexisting sleepable_lsm_hooks list so that sleepable BPF helpers
like bpf_d_path() can be used from sleepable BPF LSM based programs.

The security_path_* hooks added in this patch are similar to the
security_inode_* counterparts that already exist in the
sleepable_lsm_hooks list, and are called in roughly similar points and
contexts. Presumably, making them OK to be also annotated as
sleepable.

Building a kernel with DEBUG_ATOMIC_SLEEP options enabled and running
reasonable workloads stimulating activity that would be intercepted by
such security hooks didn't show any splats.

Notably, I haven't added all the security_path_* LSM hooks that are
available as I don't need them at this point in time.

Signed-off-by: Matt Bobrowski <mattbobrowski@google.com>
Acked-by: KP Singh <kpsingh@kernel.org>
Link: https://lore.kernel.org/r/ZXM3IHHXpNY9y82a@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-13 16:56:19 -08:00
Andrii Nakryiko
f5fdb51fb9 bpf: fail BPF_TOKEN_CREATE if no delegation option was set on BPF FS
It's quite confusing in practice when it's possible to successfully
create a BPF token from BPF FS that didn't have any of delegate_xxx
mount options set up. While it's not wrong, it's actually more
meaningful to reject BPF_TOKEN_CREATE with specific error code (-ENOENT)
to let user-space know that no token delegation is setup up.

So, instead of creating empty BPF token that will be always ignored
because it doesn't have any of the allow_xxx bits set, reject it with
-ENOENT. If we ever need empty BPF token to be possible, we can support
that with extra flag passed into BPF_TOKEN_CREATE.

Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231213190842.3844987-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-13 15:47:04 -08:00
Jie Jiang
750e785796 bpf: Support uid and gid when mounting bpffs
Parse uid and gid in bpf_parse_param() so that they can be passed in as
the `data` parameter when mount() bpffs. This will be useful when we
want to control which user/group has the control to the mounted bpffs,
otherwise a separate chown() call will be needed.

Signed-off-by: Jie Jiang <jiejiang@chromium.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Mike Frysinger <vapier@chromium.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231212093923.497838-1-jiejiang@chromium.org
2023-12-13 15:37:42 -08:00
Andrei Matei
745e031130 bpf: Comment on check_mem_size_reg
This patch adds a comment to check_mem_size_reg -- a function whose
meaning is not very transparent. The function implicitly deals with two
registers connected by convention, which is not obvious.

Signed-off-by: Andrei Matei <andreimatei1@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231210225149.67639-1-andreimatei1@gmail.com
2023-12-12 15:35:32 -08:00
Yang Li
56c26d5ad8 bpf: Remove unused backtrack_state helper functions
The function are defined in the verifier.c file, but not called
elsewhere, so delete the unused function.

kernel/bpf/verifier.c:3448:20: warning: unused function 'bt_set_slot'
kernel/bpf/verifier.c:3453:20: warning: unused function 'bt_clear_slot'
kernel/bpf/verifier.c:3488:20: warning: unused function 'bt_is_slot_set'

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20231212005436.103829-1-yang.lee@linux.alibaba.com

Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=7714
2023-12-12 09:52:07 -08:00
Andrii Nakryiko
1a1ad782dc bpf: tidy up exception callback management a bit
Use the fact that we are passing subprog index around and have
a corresponding struct bpf_subprog_info in bpf_verifier_env for each
subprogram. We don't need to separately pass around a flag whether
subprog is exception callback or not, each relevant verifier function
can determine this using provided subprog index if we maintain
bpf_subprog_info properly.

Also move out exception callback-specific logic from
btf_prepare_func_args(), keeping it generic. We can enforce all these
restriction right before exception callback verification pass. We add
out parameter, arg_cnt, for now, but this will be unnecessary with
subsequent refactoring and will be removed.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231204233931.49758-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-11 19:23:32 -08:00
Andrii Nakryiko
22b769bb4f bpf: emit more dynptr information in verifier log
Emit dynptr type for CONST_PTR_TO_DYNPTR register. Also emit id,
ref_obj_id, and dynptr_id fields for STACK_DYNPTR stack slots.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231204233931.49758-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-11 19:21:22 -08:00
Andrii Nakryiko
1e68485d82 bpf: log PTR_TO_MEM memory size in verifier log
Emit valid memory size addressable through PTR_TO_MEM register.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231204233931.49758-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-11 19:21:22 -08:00
David Vernet
a6de18f310 bpf: Add bpf_cpumask_weight() kfunc
It can be useful to query how many bits are set in a cpumask. For
example, if you want to perform special logic for the last remaining
core that's set in a mask. Let's therefore add a new
bpf_cpumask_weight() kfunc which checks how many bits are set in a mask.

Signed-off-by: David Vernet <void@manifault.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231207210843.168466-2-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-09 21:37:33 -08:00
Andrii Nakryiko
482d548d40 bpf: handle fake register spill to stack with BPF_ST_MEM instruction
When verifier validates BPF_ST_MEM instruction that stores known
constant to stack (e.g., *(u64 *)(r10 - 8) = 123), it effectively spills
a fake register with a constant (but initially imprecise) value to
a stack slot. Because read-side logic treats it as a proper register
fill from stack slot, we need to mark such stack slot initialization as
INSN_F_STACK_ACCESS instruction to stop precision backtracking from
missing it.

Fixes: 41f6f64e69 ("bpf: support non-r10 register spill/fill to/from stack in precision tracking")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231209010958.66758-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-09 18:43:20 -08:00
Hou Tao
06e5c999f1 bpf: Set uattr->batch.count as zero before batched update or deletion
generic_map_{delete,update}_batch() doesn't set uattr->batch.count as
zero before it tries to allocate memory for key. If the memory
allocation fails, the value of uattr->batch.count will be incorrect.

Fix it by setting uattr->batch.count as zero beore batched update or
deletion.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231208102355.2628918-6-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-09 18:12:34 -08:00
Hou Tao
67ad2c73ff bpf: Only call maybe_wait_bpf_programs() when map operation succeeds
There is no need to call maybe_wait_bpf_programs() if update or deletion
operation fails. So only call maybe_wait_bpf_programs() if update or
deletion operation succeeds.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231208102355.2628918-5-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-09 18:12:34 -08:00
Hou Tao
012772581d bpf: Add missed maybe_wait_bpf_programs() for htab of maps
When doing batched lookup and deletion operations on htab of maps,
maybe_wait_bpf_programs() is needed to ensure all programs don't use the
inner map after the bpf syscall returns.

Instead of adding the wait in __htab_map_lookup_and_delete_batch(),
adding the wait in bpf_map_do_batch() and also removing the calling of
maybe_wait_bpf_programs() from generic_map_{delete,update}_batch().

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231208102355.2628918-4-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-09 18:12:34 -08:00
Hou Tao
37ba5b59d6 bpf: Call maybe_wait_bpf_programs() only once for generic_map_update_batch()
Just like commit 9087c6ff8d ("bpf: Call maybe_wait_bpf_programs() only
once from generic_map_delete_batch()"), there is also no need to call
maybe_wait_bpf_programs() for each update in batched update, so only
call it once in generic_map_update_batch().

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231208102355.2628918-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-09 18:12:33 -08:00
Hou Tao
c26f2a8901 bpf: Remove unnecessary wait from bpf_map_copy_value()
Both map_lookup_elem() and generic_map_lookup_batch() use
bpf_map_copy_value() to lookup and copy the value, and there is no
update operation in bpf_map_copy_value(), so just remove the invocation
of maybe_wait_bpf_programs() from it.

Fixes: 15c14a3dca ("bpf: Add bpf_map_{value_size, update_value, map_copy_value} functions")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231208102355.2628918-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-09 18:12:33 -08:00
Yafang Shao
73d9eb340d bpf: Enable bpf_cgrp_storage for cgroup1 non-attach case
In the current cgroup1 environment, associating operations between cgroups
and applications in a BPF program requires storing a mapping of cgroup_id
to application either in a hash map or maintaining it in userspace.
However, by enabling bpf_cgrp_storage for cgroup1, it becomes possible to
conveniently store application-specific information in cgroup-local storage
and utilize it within BPF programs. Furthermore, enabling this feature for
cgroup1 involves minor modifications for the non-attach case, streamlining
the process.

However, when it comes to enabling this functionality for the cgroup1
attach case, it presents challenges. Therefore, the decision is to focus on
enabling it solely for the cgroup1 non-attach case at present. If
attempting to attach to a cgroup1 fd, the operation will simply fail with
the error code -EBADF.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231206115326.4295-2-laoar.shao@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-12-08 17:08:18 -08:00
Andrei Matei
2929bfac00 bpf: Minor cleanup around stack bounds
Push the rounding up of stack offsets into the function responsible for
growing the stack, rather than relying on all the callers to do it.
Uncertainty about whether the callers did it or not tripped up people in
a previous review.

Signed-off-by: Andrei Matei <andreimatei1@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20231208032519.260451-4-andreimatei1@gmail.com
2023-12-08 14:19:00 -08:00
Andrei Matei
6b4a64bafd bpf: Fix accesses to uninit stack slots
Privileged programs are supposed to be able to read uninitialized stack
memory (ever since 6715df8d5) but, before this patch, these accesses
were permitted inconsistently. In particular, accesses were permitted
above state->allocated_stack, but not below it. In other words, if the
stack was already "large enough", the access was permitted, but
otherwise the access was rejected instead of being allowed to "grow the
stack". This undesired rejection was happening in two places:
- in check_stack_slot_within_bounds()
- in check_stack_range_initialized()
This patch arranges for these accesses to be permitted. A bunch of tests
that were relying on the old rejection had to change; all of them were
changed to add also run unprivileged, in which case the old behavior
persists. One tests couldn't be updated - global_func16 - because it
can't run unprivileged for other reasons.

This patch also fixes the tracking of the stack size for variable-offset
reads. This second fix is bundled in the same commit as the first one
because they're inter-related. Before this patch, writes to the stack
using registers containing a variable offset (as opposed to registers
with fixed, known values) were not properly contributing to the
function's needed stack size. As a result, it was possible for a program
to verify, but then to attempt to read out-of-bounds data at runtime
because a too small stack had been allocated for it.

Each function tracks the size of the stack it needs in
bpf_subprog_info.stack_depth, which is maintained by
update_stack_depth(). For regular memory accesses, check_mem_access()
was calling update_state_depth() but it was passing in only the fixed
part of the offset register, ignoring the variable offset. This was
incorrect; the minimum possible value of that register should be used
instead.

This tracking is now fixed by centralizing the tracking of stack size in
grow_stack_state(), and by lifting the calls to grow_stack_state() to
check_stack_access_within_bounds() as suggested by Andrii. The code is
now simpler and more convincingly tracks the correct maximum stack size.
check_stack_range_initialized() can now rely on enough stack having been
allocated for the access; this helps with the fix for the first issue.

A few tests were changed to also check the stack depth computation. The
one that fails without this patch is verifier_var_off:stack_write_priv_vs_unpriv.

Fixes: 01f810ace9 ("bpf: Allow variable-offset stack access")
Reported-by: Hao Sun <sunhao.th@gmail.com>
Signed-off-by: Andrei Matei <andreimatei1@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231208032519.260451-3-andreimatei1@gmail.com

Closes: https://lore.kernel.org/bpf/CABWLsev9g8UP_c3a=1qbuZUi20tGoUXoU07FPf-5FLvhOKOY+Q@mail.gmail.com/
2023-12-08 14:19:00 -08:00
Jakub Kicinski
2483e7f04c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

drivers/net/ethernet/stmicro/stmmac/dwmac5.c
drivers/net/ethernet/stmicro/stmmac/dwmac5.h
drivers/net/ethernet/stmicro/stmmac/dwxgmac2_core.c
drivers/net/ethernet/stmicro/stmmac/hwif.h
  37e4b8df27 ("net: stmmac: fix FPE events losing")
  c3f3b97238 ("net: stmmac: Refactor EST implementation")
https://lore.kernel.org/all/20231206110306.01e91114@canb.auug.org.au/

Adjacent changes:

net/ipv4/tcp_ao.c
  9396c4ee93 ("net/tcp: Don't store TCP-AO maclen on reqsk")
  7b0f570f87 ("tcp: Move TCP-AO bits from cookie_v[46]_check() to tcp_ao_syncookie().")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-07 17:53:17 -08:00
Andrei Matei
1d38a9ee81 bpf: Guard stack limits against 32bit overflow
This patch promotes the arithmetic around checking stack bounds to be
done in the 64-bit domain, instead of the current 32bit. The arithmetic
implies adding together a 64-bit register with a int offset. The
register was checked to be below 1<<29 when it was variable, but not
when it was fixed. The offset either comes from an instruction (in which
case it is 16 bit), from another register (in which case the caller
checked it to be below 1<<29 [1]), or from the size of an argument to a
kfunc (in which case it can be a u32 [2]). Between the register being
inconsistently checked to be below 1<<29, and the offset being up to an
u32, it appears that we were open to overflowing the `int`s which were
currently used for arithmetic.

[1] 815fb87b75/kernel/bpf/verifier.c (L7494-L7498)
[2] 815fb87b75/kernel/bpf/verifier.c (L11904)

Reported-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Signed-off-by: Andrei Matei <andreimatei1@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231207041150.229139-4-andreimatei1@gmail.com
2023-12-07 13:58:10 -08:00
Andrei Matei
a833a17aea bpf: Fix verification of indirect var-off stack access
This patch fixes a bug around the verification of possibly-zero-sized
stack accesses. When the access was done through a var-offset stack
pointer, check_stack_access_within_bounds was incorrectly computing the
maximum-offset of a zero-sized read to be the same as the register's min
offset. Instead, we have to take in account the register's maximum
possible value. The patch also simplifies how the max offset is checked;
the check is now simpler than for min offset.

The bug was allowing accesses to erroneously pass the
check_stack_access_within_bounds() checks, only to later crash in
check_stack_range_initialized() when all the possibly-affected stack
slots are iterated (this time with a correct max offset).
check_stack_range_initialized() is relying on
check_stack_access_within_bounds() for its accesses to the
stack-tracking vector to be within bounds; in the case of zero-sized
accesses, we were essentially only verifying that the lowest possible
slot was within bounds. We would crash when the max-offset of the stack
pointer was >= 0 (which shouldn't pass verification, and hopefully is
not something anyone's code attempts to do in practice).

Thanks Hao for reporting!

Fixes: 01f810ace9 ("bpf: Allow variable-offset stack access")
Reported-by: Hao Sun <sunhao.th@gmail.com>
Signed-off-by: Andrei Matei <andreimatei1@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231207041150.229139-2-andreimatei1@gmail.com

Closes: https://lore.kernel.org/bpf/CACkBjsZGEUaRCHsmaX=h-efVogsRfK1FPxmkgb0Os_frnHiNdw@mail.gmail.com/
2023-12-07 13:57:53 -08:00
Song Liu
26ef208c20 bpf: Use arch_bpf_trampoline_size
Instead of blindly allocating PAGE_SIZE for each trampoline, check the size
of the trampoline with arch_bpf_trampoline_size(). This size is saved in
bpf_tramp_image->size, and used for modmem charge/uncharge. The fallback
arch_alloc_bpf_trampoline() still allocates a whole page because we need to
use set_memory_* to protect the memory.

struct_ops trampoline still uses a whole page for multiple trampolines.

With this size check at caller (regular trampoline and struct_ops
trampoline), remove arch_bpf_trampoline_size() from
arch_prepare_bpf_trampoline() in archs.

Also, update bpf_image_ksym_add() to handle symbol of different sizes.

Signed-off-by: Song Liu <song@kernel.org>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Tested-by: Ilya Leoshkevich <iii@linux.ibm.com>  # on s390x
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Björn Töpel <bjorn@rivosinc.com>
Tested-by: Björn Töpel <bjorn@rivosinc.com> # on riscv
Link: https://lore.kernel.org/r/20231206224054.492250-7-song@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-06 17:17:20 -08:00
Song Liu
96d1b7c081 bpf: Add arch_bpf_trampoline_size()
This helper will be used to calculate the size of the trampoline before
allocating the memory.

arch_prepare_bpf_trampoline() for arm64 and riscv64 can use
arch_bpf_trampoline_size() to check the trampoline fits in the image.

OTOH, arch_prepare_bpf_trampoline() for s390 has to call the JIT process
twice, so it cannot use arch_bpf_trampoline_size().

Signed-off-by: Song Liu <song@kernel.org>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Tested-by: Ilya Leoshkevich <iii@linux.ibm.com>  # on s390x
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Björn Töpel <bjorn@rivosinc.com>
Tested-by: Björn Töpel <bjorn@rivosinc.com> # on riscv
Link: https://lore.kernel.org/r/20231206224054.492250-6-song@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-06 17:17:20 -08:00
Song Liu
82583daa2e bpf: Add helpers for trampoline image management
As BPF trampoline of different archs moves from bpf_jit_[alloc|free]_exec()
to bpf_prog_pack_[alloc|free](), we need to use different _alloc, _free for
different archs during the transition. Add the following helpers for this
transition:

void *arch_alloc_bpf_trampoline(unsigned int size);
void arch_free_bpf_trampoline(void *image, unsigned int size);
void arch_protect_bpf_trampoline(void *image, unsigned int size);
void arch_unprotect_bpf_trampoline(void *image, unsigned int size);

The fallback version of these helpers require size <= PAGE_SIZE, but they
are only called with size == PAGE_SIZE. They will be called with size <
PAGE_SIZE when arch_bpf_trampoline_size() helper is introduced later.

Signed-off-by: Song Liu <song@kernel.org>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Tested-by: Ilya Leoshkevich <iii@linux.ibm.com>  # on s390x
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20231206224054.492250-4-song@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-06 17:17:20 -08:00
Song Liu
7a3d9a159b bpf: Adjust argument names of arch_prepare_bpf_trampoline()
We are using "im" for "struct bpf_tramp_image" and "tr" for "struct
bpf_trampoline" in most of the code base. The only exception is the
prototype and fallback version of arch_prepare_bpf_trampoline(). Update
them to match the rest of the code base.

We mix "orig_call" and "func_addr" for the argument in different versions
of arch_prepare_bpf_trampoline(). s/orig_call/func_addr/g so they match.

Signed-off-by: Song Liu <song@kernel.org>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Tested-by: Ilya Leoshkevich <iii@linux.ibm.com>  # on s390x
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20231206224054.492250-3-song@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-06 17:17:20 -08:00
Song Liu
f08a1c6582 bpf: Let bpf_prog_pack_free handle any pointer
Currently, bpf_prog_pack_free only can only free pointer to struct
bpf_binary_header, which is not flexible. Add a size argument to
bpf_prog_pack_free so that it can handle any pointer.

Signed-off-by: Song Liu <song@kernel.org>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Tested-by: Ilya Leoshkevich <iii@linux.ibm.com>  # on s390x
Reviewed-by: Björn Töpel <bjorn@rivosinc.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20231206224054.492250-2-song@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-06 17:17:20 -08:00
Jiri Olsa
4b7de80160 bpf: Fix prog_array_map_poke_run map poke update
Lee pointed out issue found by syscaller [0] hitting BUG in prog array
map poke update in prog_array_map_poke_run function due to error value
returned from bpf_arch_text_poke function.

There's race window where bpf_arch_text_poke can fail due to missing
bpf program kallsym symbols, which is accounted for with check for
-EINVAL in that BUG_ON call.

The problem is that in such case we won't update the tail call jump
and cause imbalance for the next tail call update check which will
fail with -EBUSY in bpf_arch_text_poke.

I'm hitting following race during the program load:

  CPU 0                             CPU 1

  bpf_prog_load
    bpf_check
      do_misc_fixups
        prog_array_map_poke_track

                                    map_update_elem
                                      bpf_fd_array_map_update_elem
                                        prog_array_map_poke_run

                                          bpf_arch_text_poke returns -EINVAL

    bpf_prog_kallsyms_add

After bpf_arch_text_poke (CPU 1) fails to update the tail call jump, the next
poke update fails on expected jump instruction check in bpf_arch_text_poke
with -EBUSY and triggers the BUG_ON in prog_array_map_poke_run.

Similar race exists on the program unload.

Fixing this by moving the update to bpf_arch_poke_desc_update function which
makes sure we call __bpf_arch_text_poke that skips the bpf address check.

Each architecture has slightly different approach wrt looking up bpf address
in bpf_arch_text_poke, so instead of splitting the function or adding new
'checkip' argument in previous version, it seems best to move the whole
map_poke_run update as arch specific code.

  [0] https://syzkaller.appspot.com/bug?extid=97a4fe20470e9bc30810

Fixes: ebf7d1f508 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT")
Reported-by: syzbot+97a4fe20470e9bc30810@syzkaller.appspotmail.com
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Cc: Lee Jones <lee@kernel.org>
Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/bpf/20231206083041.1306660-2-jolsa@kernel.org
2023-12-06 22:40:16 +01:00
Andrii Nakryiko
d734ca7b33 bpf,lsm: add BPF token LSM hooks
Wire up bpf_token_create and bpf_token_free LSM hooks, which allow to
allocate LSM security blob (we add `void *security` field to struct
bpf_token for that), but also control who can instantiate BPF token.
This follows existing pattern for BPF map and BPF prog.

Also add security_bpf_token_allow_cmd() and security_bpf_token_capable()
LSM hooks that allow LSM implementation to control and negate (if
necessary) BPF token's delegation of a specific bpf_cmd and capability,
respectively.

Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-12-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-06 10:03:00 -08:00
Andrii Nakryiko
66d636d70a bpf,lsm: refactor bpf_map_alloc/bpf_map_free LSM hooks
Similarly to bpf_prog_alloc LSM hook, rename and extend bpf_map_alloc
hook into bpf_map_create, taking not just struct bpf_map, but also
bpf_attr and bpf_token, to give a fuller context to LSMs.

Unlike bpf_prog_alloc, there is no need to move the hook around, as it
currently is firing right before allocating BPF map ID and FD, which
seems to be a sweet spot.

But like bpf_prog_alloc/bpf_prog_free combo, make sure that bpf_map_free
LSM hook is called even if bpf_map_create hook returned error, as if few
LSMs are combined together it could be that one LSM successfully
allocated security blob for its needs, while subsequent LSM rejected BPF
map creation. The former LSM would still need to free up LSM blob, so we
need to ensure security_bpf_map_free() is called regardless of the
outcome.

Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-11-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-06 10:02:59 -08:00
Andrii Nakryiko
c3dd6e94df bpf,lsm: refactor bpf_prog_alloc/bpf_prog_free LSM hooks
Based on upstream discussion ([0]), rework existing
bpf_prog_alloc_security LSM hook. Rename it to bpf_prog_load and instead
of passing bpf_prog_aux, pass proper bpf_prog pointer for a full BPF
program struct. Also, we pass bpf_attr union with all the user-provided
arguments for BPF_PROG_LOAD command.  This will give LSMs as much
information as we can basically provide.

The hook is also BPF token-aware now, and optional bpf_token struct is
passed as a third argument. bpf_prog_load LSM hook is called after
a bunch of sanity checks were performed, bpf_prog and bpf_prog_aux were
allocated and filled out, but right before performing full-fledged BPF
verification step.

bpf_prog_free LSM hook is now accepting struct bpf_prog argument, for
consistency. SELinux code is adjusted to all new names, types, and
signatures.

Note, given that bpf_prog_load (previously bpf_prog_alloc) hook can be
used by some LSMs to allocate extra security blob, but also by other
LSMs to reject BPF program loading, we need to make sure that
bpf_prog_free LSM hook is called after bpf_prog_load/bpf_prog_alloc one
*even* if the hook itself returned error. If we don't do that, we run
the risk of leaking memory. This seems to be possible today when
combining SELinux and BPF LSM, as one example, depending on their
relative ordering.

Also, for BPF LSM setup, add bpf_prog_load and bpf_prog_free to
sleepable LSM hooks list, as they are both executed in sleepable
context. Also drop bpf_prog_load hook from untrusted, as there is no
issue with refcount or anything else anymore, that originally forced us
to add it to untrusted list in c0c852dd18 ("bpf: Do not mark certain LSM
hook arguments as trusted"). We now trigger this hook much later and it
should not be an issue anymore.

  [0] https://lore.kernel.org/bpf/9fe88aef7deabbe87d3fc38c4aea3c69.paul@paul-moore.com/

Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-10-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-06 10:02:59 -08:00
Andrii Nakryiko
8062fb12de bpf: consistently use BPF token throughout BPF verifier logic
Remove remaining direct queries to perfmon_capable() and bpf_capable()
in BPF verifier logic and instead use BPF token (if available) to make
decisions about privileges.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-9-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-06 10:02:59 -08:00
Andrii Nakryiko
4cbb270e11 bpf: take into account BPF token when fetching helper protos
Instead of performing unconditional system-wide bpf_capable() and
perfmon_capable() calls inside bpf_base_func_proto() function (and other
similar ones) to determine eligibility of a given BPF helper for a given
program, use previously recorded BPF token during BPF_PROG_LOAD command
handling to inform the decision.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-8-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-06 10:02:59 -08:00
Andrii Nakryiko
e1cef620f5 bpf: add BPF token support to BPF_PROG_LOAD command
Add basic support of BPF token to BPF_PROG_LOAD. Wire through a set of
allowed BPF program types and attach types, derived from BPF FS at BPF
token creation time. Then make sure we perform bpf_token_capable()
checks everywhere where it's relevant.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-7-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-06 10:02:59 -08:00
Andrii Nakryiko
ee54b1a910 bpf: add BPF token support to BPF_BTF_LOAD command
Accept BPF token FD in BPF_BTF_LOAD command to allow BTF data loading
through delegated BPF token. BTF loading is a pretty straightforward
operation, so as long as BPF token is created with allow_cmds granting
BPF_BTF_LOAD command, kernel proceeds to parsing BTF data and creating
BTF object.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-6-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-06 10:02:59 -08:00
Andrii Nakryiko
688b7270b3 bpf: add BPF token support to BPF_MAP_CREATE command
Allow providing token_fd for BPF_MAP_CREATE command to allow controlled
BPF map creation from unprivileged process through delegated BPF token.

Wire through a set of allowed BPF map types to BPF token, derived from
BPF FS at BPF token creation time. This, in combination with allowed_cmds
allows to create a narrowly-focused BPF token (controlled by privileged
agent) with a restrictive set of BPF maps that application can attempt
to create.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-06 10:02:59 -08:00
Andrii Nakryiko
4527358b76 bpf: introduce BPF token object
Add new kind of BPF kernel object, BPF token. BPF token is meant to
allow delegating privileged BPF functionality, like loading a BPF
program or creating a BPF map, from privileged process to a *trusted*
unprivileged process, all while having a good amount of control over which
privileged operations could be performed using provided BPF token.

This is achieved through mounting BPF FS instance with extra delegation
mount options, which determine what operations are delegatable, and also
constraining it to the owning user namespace (as mentioned in the
previous patch).

BPF token itself is just a derivative from BPF FS and can be created
through a new bpf() syscall command, BPF_TOKEN_CREATE, which accepts BPF
FS FD, which can be attained through open() API by opening BPF FS mount
point. Currently, BPF token "inherits" delegated command, map types,
prog type, and attach type bit sets from BPF FS as is. In the future,
having an BPF token as a separate object with its own FD, we can allow
to further restrict BPF token's allowable set of things either at the
creation time or after the fact, allowing the process to guard itself
further from unintentionally trying to load undesired kind of BPF
programs. But for now we keep things simple and just copy bit sets as is.

When BPF token is created from BPF FS mount, we take reference to the
BPF super block's owning user namespace, and then use that namespace for
checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
capabilities that are normally only checked against init userns (using
capable()), but now we check them using ns_capable() instead (if BPF
token is provided). See bpf_token_capable() for details.

Such setup means that BPF token in itself is not sufficient to grant BPF
functionality. User namespaced process has to *also* have necessary
combination of capabilities inside that user namespace. So while
previously CAP_BPF was useless when granted within user namespace, now
it gains a meaning and allows container managers and sys admins to have
a flexible control over which processes can and need to use BPF
functionality within the user namespace (i.e., container in practice).
And BPF FS delegation mount options and derived BPF tokens serve as
a per-container "flag" to grant overall ability to use bpf() (plus further
restrict on which parts of bpf() syscalls are treated as namespaced).

Note also, BPF_TOKEN_CREATE command itself requires ns_capable(CAP_BPF)
within the BPF FS owning user namespace, rounding up the ns_capable()
story of BPF token.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-06 10:02:59 -08:00
Andrii Nakryiko
40bba140c6 bpf: add BPF token delegation mount options to BPF FS
Add few new mount options to BPF FS that allow to specify that a given
BPF FS instance allows creation of BPF token (added in the next patch),
and what sort of operations are allowed under BPF token. As such, we get
4 new mount options, each is a bit mask
  - `delegate_cmds` allow to specify which bpf() syscall commands are
    allowed with BPF token derived from this BPF FS instance;
  - if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
    a set of allowable BPF map types that could be created with BPF token;
  - if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
    a set of allowable BPF program types that could be loaded with BPF token;
  - if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
    a set of allowable BPF program attach types that could be loaded with
    BPF token; delegate_progs and delegate_attachs are meant to be used
    together, as full BPF program type is, in general, determined
    through both program type and program attach type.

Currently, these mount options accept the following forms of values:
  - a special value "any", that enables all possible values of a given
  bit set;
  - numeric value (decimal or hexadecimal, determined by kernel
  automatically) that specifies a bit mask value directly;
  - all the values for a given mount option are combined, if specified
  multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
  delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
  mask.

Ideally, more convenient (for humans) symbolic form derived from
corresponding UAPI enums would be accepted (e.g., `-o
delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
it requires a bunch of UAPI header churn, so I postponed it until this
feature lands upstream or at least there is a definite consensus that
this feature is acceptable and is going to make it, just to minimize
amount of wasted effort and not increase amount of non-essential code to
be reviewed.

Attentive reader will notice that BPF FS is now marked as
FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
user namespace as long as the process has sufficient *namespaced*
capabilities within that user namespace. But in reality we still
restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
to allow creating BPF FS context object (i.e., fsopen("bpf")) from
inside unprivileged process inside non-init userns, to capture that
userns as the owning userns. It will still be required to pass this
context object back to privileged process to instantiate and mount it.

This manipulation is important, because capturing non-init userns as the
owning userns of BPF FS instance (super block) allows to use that userns
to constraint BPF token to that userns later on (see next patch). So
creating BPF FS with delegation inside unprivileged userns will restrict
derived BPF token objects to only "work" inside that intended userns,
making it scoped to a intended "container". Also, setting these
delegation options requires capable(CAP_SYS_ADMIN), so unprivileged
process cannot set this up without involvement of a privileged process.

There is a set of selftests at the end of the patch set that simulates
this sequence of steps and validates that everything works as intended.
But careful review is requested to make sure there are no missed gaps in
the implementation and testing.

This somewhat subtle set of aspects is the result of previous
discussions ([0]) about various user namespace implications and
interactions with BPF token functionality and is necessary to contain
BPF token inside intended user namespace.

  [0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/

Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-06 10:02:58 -08:00
Andrii Nakryiko
909fa05dd3 bpf: align CAP_NET_ADMIN checks with bpf_capable() approach
Within BPF syscall handling code CAP_NET_ADMIN checks stand out a bit
compared to CAP_BPF and CAP_PERFMON checks. For the latter, CAP_BPF or
CAP_PERFMON are checked first, but if they are not set, CAP_SYS_ADMIN
takes over and grants whatever part of BPF syscall is required.

Similar kind of checks that involve CAP_NET_ADMIN are not so consistent.
One out of four uses does follow CAP_BPF/CAP_PERFMON model: during
BPF_PROG_LOAD, if the type of BPF program is "network-related" either
CAP_NET_ADMIN or CAP_SYS_ADMIN is required to proceed.

But in three other cases CAP_NET_ADMIN is required even if CAP_SYS_ADMIN
is set:
  - when creating DEVMAP/XDKMAP/CPU_MAP maps;
  - when attaching CGROUP_SKB programs;
  - when handling BPF_PROG_QUERY command.

This patch is changing the latter three cases to follow BPF_PROG_LOAD
model, that is allowing to proceed under either CAP_NET_ADMIN or
CAP_SYS_ADMIN.

This also makes it cleaner in subsequent BPF token patches to switch
wholesomely to a generic bpf_token_capable(int cap) check, that always
falls back to CAP_SYS_ADMIN if requested capability is missing.

Cc: Jakub Kicinski <kuba@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-06 10:02:58 -08:00
Andrii Nakryiko
18a433b620 bpf: track aligned STACK_ZERO cases as imprecise spilled registers
Now that precision backtracing is supporting register spill/fill to/from
stack, there is another oportunity to be exploited here: minimizing
precise STACK_ZERO cases. With a simple code change we can rely on
initially imprecise register spill tracking for cases when register
spilled to stack was a known zero.

This is a very common case for initializing on the stack variables,
including rather large structures. Often times zero has no special
meaning for the subsequent BPF program logic and is often overwritten
with non-zero values soon afterwards. But due to STACK_ZERO vs
STACK_MISC tracking, such initial zero initialization actually causes
duplication of verifier states as STACK_ZERO is clearly different than
STACK_MISC or spilled SCALAR_VALUE register.

The effect of this (now) trivial change is huge, as can be seen below.
These are differences between BPF selftests, Cilium, and Meta-internal
BPF object files relative to previous patch in this series. You can see
improvements ranging from single-digit percentage improvement for
instructions and states, all the way to 50-60% reduction for some of
Meta-internal host agent programs, and even some Cilium programs.

For Meta-internal ones I left only the differences for largest BPF
object files by states/instructions, as there were too many differences
in the overall output. All the differences were improvements, reducting
number of states and thus instructions validated.

Note, Meta-internal BPF object file names are not printed below.
Many copies of balancer_ingress are actually many different
configurations of Katran, so they are different BPF programs, which
explains state reduction going from -16% all the way to 31%, depending
on BPF program logic complexity.

I also tooked a closer look at a few small-ish BPF programs to validate
the behavior. Let's take bpf_iter_netrlink.bpf.o (first row below).
While it's just 8 vs 5 states, verifier log is still pretty long to
include it here. But the reduction in states is due to the following
piece of C code:

        unsigned long ino;

	...

        sk = s->sk_socket;
        if (!sk) {
                ino = 0;
        } else {
                inode = SOCK_INODE(sk);
                bpf_probe_read_kernel(&ino, sizeof(ino), &inode->i_ino);
        }
        BPF_SEQ_PRINTF(seq, "%-8u %-8lu\n", s->sk_drops.counter, ino);
	return 0;

You can see that in some situations `ino` is zero-initialized, while in
others it's unknown value filled out by bpf_probe_read_kernel(). Before
this change code after if/else branches have to be validated twice. Once
with (precise) ino == 0, due to eager STACK_ZERO logic, and then again
for when ino is just STACK_MISC. But BPF_SEQ_PRINTF() doesn't care about
precise value of ino, so with the change in this patch verifier is able
to prune states from after one of the branches, reducing number of total
states (and instructions) required for successful validation.

Similar principle applies to bigger real-world applications, just at
a much larger scale.

SELFTESTS
=========
File                                     Program                  Insns (A)  Insns (B)  Insns    (DIFF)  States (A)  States (B)  States (DIFF)
---------------------------------------  -----------------------  ---------  ---------  ---------------  ----------  ----------  -------------
bpf_iter_netlink.bpf.linked3.o           dump_netlink                   148        104    -44 (-29.73%)           8           5   -3 (-37.50%)
bpf_iter_unix.bpf.linked3.o              dump_unix                     8474       8404     -70 (-0.83%)         151         147    -4 (-2.65%)
bpf_loop.bpf.linked3.o                   stack_check                    560        324   -236 (-42.14%)          42          24  -18 (-42.86%)
local_storage_bench.bpf.linked3.o        get_local                      120         77    -43 (-35.83%)           9           6   -3 (-33.33%)
loop6.bpf.linked3.o                      trace_virtqueue_add_sgs      10167       9868    -299 (-2.94%)         226         206   -20 (-8.85%)
pyperf600_bpf_loop.bpf.linked3.o         on_event                      4872       3423  -1449 (-29.74%)         322         229  -93 (-28.88%)
strobemeta.bpf.linked3.o                 on_event                    180697     176036   -4661 (-2.58%)        4780        4734   -46 (-0.96%)
test_cls_redirect.bpf.linked3.o          cls_redirect                 65594      65401    -193 (-0.29%)        4230        4212   -18 (-0.43%)
test_global_func_args.bpf.linked3.o      test_cls                       145        136      -9 (-6.21%)          10           9   -1 (-10.00%)
test_l4lb.bpf.linked3.o                  balancer_ingress              4760       2612  -2148 (-45.13%)         113         102   -11 (-9.73%)
test_l4lb_noinline.bpf.linked3.o         balancer_ingress              4845       4877     +32 (+0.66%)         219         221    +2 (+0.91%)
test_l4lb_noinline_dynptr.bpf.linked3.o  balancer_ingress              2072       2087     +15 (+0.72%)          97          98    +1 (+1.03%)
test_seg6_loop.bpf.linked3.o             __add_egr_x                  12440       9975  -2465 (-19.82%)         364         353   -11 (-3.02%)
test_tcp_hdr_options.bpf.linked3.o       estab                         2558       2572     +14 (+0.55%)         179         180    +1 (+0.56%)
test_xdp_dynptr.bpf.linked3.o            _xdp_tx_iptunnel               645        596     -49 (-7.60%)          26          24    -2 (-7.69%)
test_xdp_noinline.bpf.linked3.o          balancer_ingress_v6           3520       3516      -4 (-0.11%)         216         216    +0 (+0.00%)
xdp_synproxy_kern.bpf.linked3.o          syncookie_tc                 82661      81241   -1420 (-1.72%)        5073        5155   +82 (+1.62%)
xdp_synproxy_kern.bpf.linked3.o          syncookie_xdp                84964      82297   -2667 (-3.14%)        5130        5157   +27 (+0.53%)

META-INTERNAL
=============
Program                                 Insns (A)  Insns (B)  Insns      (DIFF)  States (A)  States (B)  States   (DIFF)
--------------------------------------  ---------  ---------  -----------------  ----------  ----------  ---------------
balancer_ingress                            27925      23608    -4317 (-15.46%)        1488        1482      -6 (-0.40%)
balancer_ingress                            31824      27546    -4278 (-13.44%)        1658        1652      -6 (-0.36%)
balancer_ingress                            32213      27935    -4278 (-13.28%)        1689        1683      -6 (-0.36%)
balancer_ingress                            32213      27935    -4278 (-13.28%)        1689        1683      -6 (-0.36%)
balancer_ingress                            31824      27546    -4278 (-13.44%)        1658        1652      -6 (-0.36%)
balancer_ingress                            38647      29562    -9085 (-23.51%)        2069        1835   -234 (-11.31%)
balancer_ingress                            38647      29562    -9085 (-23.51%)        2069        1835   -234 (-11.31%)
balancer_ingress                            40339      30792    -9547 (-23.67%)        2193        1934   -259 (-11.81%)
balancer_ingress                            37321      29055    -8266 (-22.15%)        1972        1795    -177 (-8.98%)
balancer_ingress                            38176      29753    -8423 (-22.06%)        2008        1831    -177 (-8.81%)
balancer_ingress                            29193      20910    -8283 (-28.37%)        1599        1422   -177 (-11.07%)
balancer_ingress                            30013      21452    -8561 (-28.52%)        1645        1447   -198 (-12.04%)
balancer_ingress                            28691      24290    -4401 (-15.34%)        1545        1531     -14 (-0.91%)
balancer_ingress                            34223      28965    -5258 (-15.36%)        1984        1875    -109 (-5.49%)
balancer_ingress                            35481      26158    -9323 (-26.28%)        2095        1806   -289 (-13.79%)
balancer_ingress                            35481      26158    -9323 (-26.28%)        2095        1806   -289 (-13.79%)
balancer_ingress                            35868      26455    -9413 (-26.24%)        2140        1827   -313 (-14.63%)
balancer_ingress                            35868      26455    -9413 (-26.24%)        2140        1827   -313 (-14.63%)
balancer_ingress                            35481      26158    -9323 (-26.28%)        2095        1806   -289 (-13.79%)
balancer_ingress                            35481      26158    -9323 (-26.28%)        2095        1806   -289 (-13.79%)
balancer_ingress                            34844      29485    -5359 (-15.38%)        2036        1918    -118 (-5.80%)
fbflow_egress                                3256       2652     -604 (-18.55%)         218         192    -26 (-11.93%)
fbflow_ingress                               1026        944       -82 (-7.99%)          70          63     -7 (-10.00%)
sslwall_tc_egress                            8424       7360    -1064 (-12.63%)         498         458     -40 (-8.03%)
syar_accept_protect                         15040       9539    -5501 (-36.58%)         364         220   -144 (-39.56%)
syar_connect_tcp_v6                         15036       9535    -5501 (-36.59%)         360         216   -144 (-40.00%)
syar_connect_udp_v4                         15039       9538    -5501 (-36.58%)         361         217   -144 (-39.89%)
syar_connect_connect4_protect4              24805      15833    -8972 (-36.17%)         756         480   -276 (-36.51%)
syar_lsm_file_open                         167772     151813    -15959 (-9.51%)        1836        1667    -169 (-9.20%)
syar_namespace_create_new                   14805       9304    -5501 (-37.16%)         353         209   -144 (-40.79%)
syar_python3_detect                         17531      12030    -5501 (-31.38%)         391         247   -144 (-36.83%)
syar_ssh_post_fork                          16412      10911    -5501 (-33.52%)         405         261   -144 (-35.56%)
syar_enter_execve                           14728       9227    -5501 (-37.35%)         345         201   -144 (-41.74%)
syar_enter_execveat                         14728       9227    -5501 (-37.35%)         345         201   -144 (-41.74%)
syar_exit_execve                            16622      11121    -5501 (-33.09%)         376         232   -144 (-38.30%)
syar_exit_execveat                          16622      11121    -5501 (-33.09%)         376         232   -144 (-38.30%)
syar_syscalls_kill                          15288       9787    -5501 (-35.98%)         398         254   -144 (-36.18%)
syar_task_enter_pivot_root                  14898       9397    -5501 (-36.92%)         357         213   -144 (-40.34%)
syar_syscalls_setreuid                      16678      11177    -5501 (-32.98%)         429         285   -144 (-33.57%)
syar_syscalls_setuid                        16678      11177    -5501 (-32.98%)         429         285   -144 (-33.57%)
syar_syscalls_process_vm_readv              14959       9458    -5501 (-36.77%)         364         220   -144 (-39.56%)
syar_syscalls_process_vm_writev             15757      10256    -5501 (-34.91%)         390         246   -144 (-36.92%)
do_uprobe                                   15519      10018    -5501 (-35.45%)         373         229   -144 (-38.61%)
edgewall                                   179715      55783  -123932 (-68.96%)       12607        3999  -8608 (-68.28%)
bictcp_state                                 7570       4131    -3439 (-45.43%)         496         269   -227 (-45.77%)
cubictcp_state                               7570       4131    -3439 (-45.43%)         496         269   -227 (-45.77%)
tcp_rate_skb_delivered                        447        272     -175 (-39.15%)          29          18    -11 (-37.93%)
kprobe__bbr_set_state                        4566       2615    -1951 (-42.73%)         209         124    -85 (-40.67%)
kprobe__bictcp_state                         4566       2615    -1951 (-42.73%)         209         124    -85 (-40.67%)
inet_sock_set_state                          1501       1337     -164 (-10.93%)          93          85      -8 (-8.60%)
tcp_retransmit_skb                           1145        981     -164 (-14.32%)          67          59     -8 (-11.94%)
tcp_retransmit_synack                        1183        951     -232 (-19.61%)          67          55    -12 (-17.91%)
bpf_tcptuner                                 1459       1187     -272 (-18.64%)          99          80    -19 (-19.19%)
tw_egress                                     801        776       -25 (-3.12%)          69          66      -3 (-4.35%)
tw_ingress                                    795        770       -25 (-3.14%)          69          66      -3 (-4.35%)
ttls_tc_ingress                             19025      19383      +358 (+1.88%)         470         465      -5 (-1.06%)
ttls_nat_egress                               490        299     -191 (-38.98%)          33          20    -13 (-39.39%)
ttls_nat_ingress                              448        285     -163 (-36.38%)          32          21    -11 (-34.38%)
tw_twfw_egress                             511127     212071  -299056 (-58.51%)       16733        8504  -8229 (-49.18%)
tw_twfw_ingress                            500095     212069  -288026 (-57.59%)       16223        8504  -7719 (-47.58%)
tw_twfw_tc_eg                              511113     212064  -299049 (-58.51%)       16732        8504  -8228 (-49.18%)
tw_twfw_tc_in                              500095     212069  -288026 (-57.59%)       16223        8504  -7719 (-47.58%)
tw_twfw_egress                              12632      12435      -197 (-1.56%)         276         260     -16 (-5.80%)
tw_twfw_ingress                             12631      12454      -177 (-1.40%)         278         261     -17 (-6.12%)
tw_twfw_tc_eg                               12595      12435      -160 (-1.27%)         274         259     -15 (-5.47%)
tw_twfw_tc_in                               12631      12454      -177 (-1.40%)         278         261     -17 (-6.12%)
tw_xdp_dump                                   266        209      -57 (-21.43%)           9           8     -1 (-11.11%)

CILIUM
=========
File           Program                           Insns (A)  Insns (B)  Insns     (DIFF)  States (A)  States (B)  States  (DIFF)
-------------  --------------------------------  ---------  ---------  ----------------  ----------  ----------  --------------
bpf_host.o     cil_to_netdev                          6047       4578   -1469 (-24.29%)         362         249  -113 (-31.22%)
bpf_host.o     handle_lxc_traffic                     2227       1585    -642 (-28.83%)         156         103   -53 (-33.97%)
bpf_host.o     tail_handle_ipv4_from_netdev           2244       1458    -786 (-35.03%)         163         106   -57 (-34.97%)
bpf_host.o     tail_handle_nat_fwd_ipv4              21022      10479  -10543 (-50.15%)        1289         670  -619 (-48.02%)
bpf_host.o     tail_handle_nat_fwd_ipv6              15433      11375   -4058 (-26.29%)         905         643  -262 (-28.95%)
bpf_host.o     tail_ipv4_host_policy_ingress          2219       1367    -852 (-38.40%)         161          96   -65 (-40.37%)
bpf_host.o     tail_nodeport_nat_egress_ipv4         22460      19862   -2598 (-11.57%)        1469        1293  -176 (-11.98%)
bpf_host.o     tail_nodeport_nat_ingress_ipv4         5526       3534   -1992 (-36.05%)         366         243  -123 (-33.61%)
bpf_host.o     tail_nodeport_nat_ingress_ipv6         5132       4256    -876 (-17.07%)         241         219    -22 (-9.13%)
bpf_host.o     tail_nodeport_nat_ipv6_egress          3702       3542     -160 (-4.32%)         215         205    -10 (-4.65%)
bpf_lxc.o      tail_handle_nat_fwd_ipv4              21022      10479  -10543 (-50.15%)        1289         670  -619 (-48.02%)
bpf_lxc.o      tail_handle_nat_fwd_ipv6              15433      11375   -4058 (-26.29%)         905         643  -262 (-28.95%)
bpf_lxc.o      tail_ipv4_ct_egress                    5073       3374   -1699 (-33.49%)         262         172   -90 (-34.35%)
bpf_lxc.o      tail_ipv4_ct_ingress                   5093       3385   -1708 (-33.54%)         262         172   -90 (-34.35%)
bpf_lxc.o      tail_ipv4_ct_ingress_policy_only       5093       3385   -1708 (-33.54%)         262         172   -90 (-34.35%)
bpf_lxc.o      tail_ipv6_ct_egress                    4593       3878    -715 (-15.57%)         194         151   -43 (-22.16%)
bpf_lxc.o      tail_ipv6_ct_ingress                   4606       3891    -715 (-15.52%)         194         151   -43 (-22.16%)
bpf_lxc.o      tail_ipv6_ct_ingress_policy_only       4606       3891    -715 (-15.52%)         194         151   -43 (-22.16%)
bpf_lxc.o      tail_nodeport_nat_ingress_ipv4         5526       3534   -1992 (-36.05%)         366         243  -123 (-33.61%)
bpf_lxc.o      tail_nodeport_nat_ingress_ipv6         5132       4256    -876 (-17.07%)         241         219    -22 (-9.13%)
bpf_overlay.o  tail_handle_nat_fwd_ipv4              20524      10114  -10410 (-50.72%)        1271         638  -633 (-49.80%)
bpf_overlay.o  tail_nodeport_nat_egress_ipv4         22718      19490   -3228 (-14.21%)        1475        1275  -200 (-13.56%)
bpf_overlay.o  tail_nodeport_nat_ingress_ipv4         5526       3534   -1992 (-36.05%)         366         243  -123 (-33.61%)
bpf_overlay.o  tail_nodeport_nat_ingress_ipv6         5132       4256    -876 (-17.07%)         241         219    -22 (-9.13%)
bpf_overlay.o  tail_nodeport_nat_ipv6_egress          3638       3548      -90 (-2.47%)         209         203     -6 (-2.87%)
bpf_overlay.o  tail_rev_nodeport_lb4                  4368       3820    -548 (-12.55%)         248         215   -33 (-13.31%)
bpf_overlay.o  tail_rev_nodeport_lb6                  2867       2428    -439 (-15.31%)         167         140   -27 (-16.17%)
bpf_sock.o     cil_sock6_connect                      1718       1703      -15 (-0.87%)         100          99     -1 (-1.00%)
bpf_xdp.o      tail_handle_nat_fwd_ipv4              12917      12443     -474 (-3.67%)         875         849    -26 (-2.97%)
bpf_xdp.o      tail_handle_nat_fwd_ipv6              13515      13264     -251 (-1.86%)         715         702    -13 (-1.82%)
bpf_xdp.o      tail_lb_ipv4                          39492      36367    -3125 (-7.91%)        2430        2251   -179 (-7.37%)
bpf_xdp.o      tail_lb_ipv6                          80441      78058    -2383 (-2.96%)        3647        3523   -124 (-3.40%)
bpf_xdp.o      tail_nodeport_ipv6_dsr                 1038        901    -137 (-13.20%)          61          55     -6 (-9.84%)
bpf_xdp.o      tail_nodeport_nat_egress_ipv4         13027      12096     -931 (-7.15%)         868         809    -59 (-6.80%)
bpf_xdp.o      tail_nodeport_nat_ingress_ipv4         7617       5900   -1717 (-22.54%)         522         413  -109 (-20.88%)
bpf_xdp.o      tail_nodeport_nat_ingress_ipv6         7575       7395     -180 (-2.38%)         383         374     -9 (-2.35%)
bpf_xdp.o      tail_rev_nodeport_lb4                  6808       6739      -69 (-1.01%)         403         396     -7 (-1.74%)
bpf_xdp.o      tail_rev_nodeport_lb6                 16173      15847     -326 (-2.02%)        1010         990    -20 (-1.98%)

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231205184248.1502704-9-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-05 13:40:21 -08:00
Andrii Nakryiko
e322f0bcb8 bpf: preserve constant zero when doing partial register restore
Similar to special handling of STACK_ZERO, when reading 1/2/4 bytes from
stack from slot that has register spilled into it and that register has
a constant value zero, preserve that zero and mark spilled register as
precise for that. This makes spilled const zero register and STACK_ZERO
cases equivalent in their behavior.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231205184248.1502704-7-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-05 13:40:21 -08:00
Andrii Nakryiko
eaf18febd6 bpf: preserve STACK_ZERO slots on partial reg spills
Instead of always forcing STACK_ZERO slots to STACK_MISC, preserve it in
situations where this is possible. E.g., when spilling register as
1/2/4-byte subslots on the stack, all the remaining bytes in the stack
slot do not automatically become unknown. If we knew they contained
zeroes, we can preserve those STACK_ZERO markers.

Add a helper mark_stack_slot_misc(), similar to scrub_spilled_slot(),
but that doesn't overwrite either STACK_INVALID nor STACK_ZERO. Note
that we need to take into account possibility of being in unprivileged
mode, in which case STACK_INVALID is forced to STACK_MISC for correctness,
as treating STACK_INVALID as equivalent STACK_MISC is only enabled in
privileged mode.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231205184248.1502704-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-05 13:40:20 -08:00
Andrii Nakryiko
ab125ed3ec bpf: fix check for attempt to corrupt spilled pointer
When register is spilled onto a stack as a 1/2/4-byte register, we set
slot_type[BPF_REG_SIZE - 1] (plus potentially few more below it,
depending on actual spill size). So to check if some stack slot has
spilled register we need to consult slot_type[7], not slot_type[0].

To avoid the need to remember and double-check this in the future, just
use is_spilled_reg() helper.

Fixes: 27113c59b6 ("bpf: Check the other end of slot_type for STACK_SPILL")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231205184248.1502704-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-05 13:40:20 -08:00
Andrii Nakryiko
41f6f64e69 bpf: support non-r10 register spill/fill to/from stack in precision tracking
Use instruction (jump) history to record instructions that performed
register spill/fill to/from stack, regardless if this was done through
read-only r10 register, or any other register after copying r10 into it
*and* potentially adjusting offset.

To make this work reliably, we push extra per-instruction flags into
instruction history, encoding stack slot index (spi) and stack frame
number in extra 10 bit flags we take away from prev_idx in instruction
history. We don't touch idx field for maximum performance, as it's
checked most frequently during backtracking.

This change removes basically the last remaining practical limitation of
precision backtracking logic in BPF verifier. It fixes known
deficiencies, but also opens up new opportunities to reduce number of
verified states, explored in the subsequent patches.

There are only three differences in selftests' BPF object files
according to veristat, all in the positive direction (less states).

File                                    Program        Insns (A)  Insns (B)  Insns  (DIFF)  States (A)  States (B)  States (DIFF)
--------------------------------------  -------------  ---------  ---------  -------------  ----------  ----------  -------------
test_cls_redirect_dynptr.bpf.linked3.o  cls_redirect        2987       2864  -123 (-4.12%)         240         231    -9 (-3.75%)
xdp_synproxy_kern.bpf.linked3.o         syncookie_tc       82848      82661  -187 (-0.23%)        5107        5073   -34 (-0.67%)
xdp_synproxy_kern.bpf.linked3.o         syncookie_xdp      85116      84964  -152 (-0.18%)        5162        5130   -32 (-0.62%)

Note, I avoided renaming jmp_history to more generic insn_hist to
minimize number of lines changed and potential merge conflicts between
bpf and bpf-next trees.

Notice also cur_hist_entry pointer reset to NULL at the beginning of
instruction verification loop. This pointer avoids the problem of
relying on last jump history entry's insn_idx to determine whether we
already have entry for current instruction or not. It can happen that we
added jump history entry because current instruction is_jmp_point(), but
also we need to add instruction flags for stack access. In this case, we
don't want to entries, so we need to reuse last added entry, if it is
present.

Relying on insn_idx comparison has the same ambiguity problem as the one
that was fixed recently in [0], so we avoid that.

  [0] https://patchwork.kernel.org/project/netdevbpf/patch/20231110002638.4168352-3-andrii@kernel.org/

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Reported-by: Tao Lyu <tao.lyu@epfl.ch>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231205184248.1502704-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-05 13:40:20 -08:00
Hou Tao
af66bfd3c8 bpf: Optimize the free of inner map
When removing the inner map from the outer map, the inner map will be
freed after one RCU grace period and one RCU tasks trace grace
period, so it is certain that the bpf program, which may access the
inner map, has exited before the inner map is freed.

However there is no need to wait for one RCU tasks trace grace period if
the outer map is only accessed by non-sleepable program. So adding
sleepable_refcnt in bpf_map and increasing sleepable_refcnt when adding
the outer map into env->used_maps for sleepable program. Although the
max number of bpf program is INT_MAX - 1, the number of bpf programs
which are being loaded may be greater than INT_MAX, so using atomic64_t
instead of atomic_t for sleepable_refcnt. When removing the inner map
from the outer map, using sleepable_refcnt to decide whether or not a
RCU tasks trace grace period is needed before freeing the inner map.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231204140425.1480317-6-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-04 17:50:26 -08:00
Hou Tao
8766733641 bpf: Defer the free of inner map when necessary
When updating or deleting an inner map in map array or map htab, the map
may still be accessed by non-sleepable program or sleepable program.
However bpf_map_fd_put_ptr() decreases the ref-counter of the inner map
directly through bpf_map_put(), if the ref-counter is the last one
(which is true for most cases), the inner map will be freed by
ops->map_free() in a kworker. But for now, most .map_free() callbacks
don't use synchronize_rcu() or its variants to wait for the elapse of a
RCU grace period, so after the invocation of ops->map_free completes,
the bpf program which is accessing the inner map may incur
use-after-free problem.

Fix the free of inner map by invoking bpf_map_free_deferred() after both
one RCU grace period and one tasks trace RCU grace period if the inner
map has been removed from the outer map before. The deferment is
accomplished by using call_rcu() or call_rcu_tasks_trace() when
releasing the last ref-counter of bpf map. The newly-added rcu_head
field in bpf_map shares the same storage space with work field to
reduce the size of bpf_map.

Fixes: bba1dc0b55 ("bpf: Remove redundant synchronize_rcu.")
Fixes: 638e4b825d ("bpf: Allows per-cpu maps and map-in-map in sleepable programs")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231204140425.1480317-5-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-04 17:50:26 -08:00
Hou Tao
79d93b3c6f bpf: Set need_defer as false when clearing fd array during map free
Both map deletion operation, map release and map free operation use
fd_array_map_delete_elem() to remove the element from fd array and
need_defer is always true in fd_array_map_delete_elem(). For the map
deletion operation and map release operation, need_defer=true is
necessary, because the bpf program, which accesses the element in fd
array, may still alive. However for map free operation, it is certain
that the bpf program which owns the fd array has already been exited, so
setting need_defer as false is appropriate for map free operation.

So fix it by adding need_defer parameter to bpf_fd_array_map_clear() and
adding a new helper __fd_array_map_delete_elem() to handle the map
deletion, map release and map free operations correspondingly.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231204140425.1480317-4-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-04 17:50:26 -08:00
Hou Tao
20c20bd11a bpf: Add map and need_defer parameters to .map_fd_put_ptr()
map is the pointer of outer map, and need_defer needs some explanation.
need_defer tells the implementation to defer the reference release of
the passed element and ensure that the element is still alive before
the bpf program, which may manipulate it, exits.

The following three cases will invoke map_fd_put_ptr() and different
need_defer values will be passed to these callers:

1) release the reference of the old element in the map during map update
   or map deletion. The release must be deferred, otherwise the bpf
   program may incur use-after-free problem, so need_defer needs to be
   true.
2) release the reference of the to-be-added element in the error path of
   map update. The to-be-added element is not visible to any bpf
   program, so it is OK to pass false for need_defer parameter.
3) release the references of all elements in the map during map release.
   Any bpf program which has access to the map must have been exited and
   released, so need_defer=false will be OK.

These two parameters will be used by the following patches to fix the
potential use-after-free problem for map-in-map.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231204140425.1480317-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-04 17:50:26 -08:00
Hou Tao
169410eba2 bpf: Check rcu_read_lock_trace_held() before calling bpf map helpers
These three bpf_map_{lookup,update,delete}_elem() helpers are also
available for sleepable bpf program, so add the corresponding lock
assertion for sleepable bpf program, otherwise the following warning
will be reported when a sleepable bpf program manipulates bpf map under
interpreter mode (aka bpf_jit_enable=0):

  WARNING: CPU: 3 PID: 4985 at kernel/bpf/helpers.c:40 ......
  CPU: 3 PID: 4985 Comm: test_progs Not tainted 6.6.0+ #2
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ......
  RIP: 0010:bpf_map_lookup_elem+0x54/0x60
  ......
  Call Trace:
   <TASK>
   ? __warn+0xa5/0x240
   ? bpf_map_lookup_elem+0x54/0x60
   ? report_bug+0x1ba/0x1f0
   ? handle_bug+0x40/0x80
   ? exc_invalid_op+0x18/0x50
   ? asm_exc_invalid_op+0x1b/0x20
   ? __pfx_bpf_map_lookup_elem+0x10/0x10
   ? rcu_lockdep_current_cpu_online+0x65/0xb0
   ? rcu_is_watching+0x23/0x50
   ? bpf_map_lookup_elem+0x54/0x60
   ? __pfx_bpf_map_lookup_elem+0x10/0x10
   ___bpf_prog_run+0x513/0x3b70
   __bpf_prog_run32+0x9d/0xd0
   ? __bpf_prog_enter_sleepable_recur+0xad/0x120
   ? __bpf_prog_enter_sleepable_recur+0x3e/0x120
   bpf_trampoline_6442580665+0x4d/0x1000
   __x64_sys_getpgid+0x5/0x30
   ? do_syscall_64+0x36/0xb0
   entry_SYSCALL_64_after_hwframe+0x6e/0x76
   </TASK>

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231204140425.1480317-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-04 17:50:26 -08:00
Andrei Matei
5bd90cdc65 bpf: Minor logging improvement
One place where we were logging a register was only logging the variable
part, not also the fixed part.

Signed-off-by: Andrei Matei <andreimatei1@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20231204011248.2040084-1-andreimatei1@gmail.com
2023-12-04 15:57:27 +01:00
Andrii Nakryiko
81eff2e364 bpf: simplify tnum output if a fully known constant
Emit tnum representation as just a constant if all bits are known.
Use decimal-vs-hex logic to determine exact format of emitted
constant value, just like it's done for register range values.
For that move tnum_strn() to kernel/bpf/log.c to reuse decimal-vs-hex
determination logic and constants.

Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231202175705.885270-12-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-02 11:36:51 -08:00
Andrii Nakryiko
eabe518de5 bpf: enforce precision of R0 on program/async callback return
Given we enforce a valid range for program and async callback return
value, we must mark R0 as precise to avoid incorrect state pruning.

Fixes: b5dc0163d8 ("bpf: precise scalar_value tracking")
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231202175705.885270-9-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-02 11:36:51 -08:00
Andrii Nakryiko
0ef24c8dfa bpf: unify async callback and program retval checks
Use common logic to verify program return values and async callback
return values. This allows to avoid duplication of any extra steps
necessary, like precision marking, which will be added in the next
patch.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231202175705.885270-8-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-02 11:36:50 -08:00
Andrii Nakryiko
c871d0e00f bpf: enforce precise retval range on program exit
Similarly to subprog/callback logic, enforce return value of BPF program
using more precise smin/smax range.

We need to adjust a bunch of tests due to a changed format of an error
message.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231202175705.885270-7-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-02 11:36:50 -08:00
Andrii Nakryiko
8fa4ecd49b bpf: enforce exact retval range on subprog/callback exit
Instead of relying on potentially imprecise tnum representation of
expected return value range for callbacks and subprogs, validate that
smin/smax range satisfy exact expected range of return values.

E.g., if callback would need to return [0, 2] range, tnum can't
represent this precisely and instead will allow [0, 3] range. By
checking smin/smax range, we can make sure that subprog/callback indeed
returns only valid [0, 2] range.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231202175705.885270-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-02 11:36:50 -08:00
Andrii Nakryiko
0acd03a5bd bpf: enforce precision of R0 on callback return
Given verifier checks actual value, r0 has to be precise, so we need to
propagate precision properly. r0 also has to be marked as read,
otherwise subsequent state comparisons will ignore such register as
unimportant and precision won't really help here.

Fixes: 69c087ba62 ("bpf: Add bpf_for_each_map_elem() helper")
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231202175705.885270-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-02 11:36:50 -08:00
Andrii Nakryiko
5fad52bee3 bpf: provide correct register name for exception callback retval check
bpf_throw() is checking R1, so let's report R1 in the log.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231202175705.885270-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-02 11:36:50 -08:00
Yonghong Song
dfce9cb314 bpf: Fix a verifier bug due to incorrect branch offset comparison with cpu=v4
Bpf cpu=v4 support is introduced in [1] and Commit 4cd58e9af8
("bpf: Support new 32bit offset jmp instruction") added support for new
32bit offset jmp instruction. Unfortunately, in function
bpf_adj_delta_to_off(), for new branch insn with 32bit offset, the offset
(plus/minor a small delta) compares to 16-bit offset bound
[S16_MIN, S16_MAX], which caused the following verification failure:
  $ ./test_progs-cpuv4 -t verif_scale_pyperf180
  ...
  insn 10 cannot be patched due to 16-bit range
  ...
  libbpf: failed to load object 'pyperf180.bpf.o'
  scale_test:FAIL:expect_success unexpected error: -12 (errno 12)
  #405     verif_scale_pyperf180:FAIL

Note that due to recent llvm18 development, the patch [2] (already applied
in bpf-next) needs to be applied to bpf tree for testing purpose.

The fix is rather simple. For 32bit offset branch insn, the adjusted
offset compares to [S32_MIN, S32_MAX] and then verification succeeded.

  [1] https://lore.kernel.org/all/20230728011143.3710005-1-yonghong.song@linux.dev
  [2] https://lore.kernel.org/bpf/20231110193644.3130906-1-yonghong.song@linux.dev

Fixes: 4cd58e9af8 ("bpf: Support new 32bit offset jmp instruction")
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231201024640.3417057-1-yonghong.song@linux.dev
2023-12-01 15:41:50 -08:00
Jakub Kicinski
753c8608f3 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZWiCPAAKCRDbK58LschI
 g4djAQC1FdqCRIFkhbiIRNHTgHjnfQShELQbd9ofJqzylLqmmgD+JI1E7D9SXagm
 pIXQ26EGmq8/VcCT3VLncA8EsC76Gg4=
 =Xowm
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2023-11-30

We've added 30 non-merge commits during the last 7 day(s) which contain
a total of 58 files changed, 1598 insertions(+), 154 deletions(-).

The main changes are:

1) Add initial TX metadata implementation for AF_XDP with support in mlx5
   and stmmac drivers. Two types of offloads are supported right now, that
   is, TX timestamp and TX checksum offload, from Stanislav Fomichev with
   stmmac implementation from Song Yoong Siang.

2) Change BPF verifier logic to validate global subprograms lazily instead
   of unconditionally before the main program, so they can be guarded using
   BPF CO-RE techniques, from Andrii Nakryiko.

3) Add BPF link_info support for uprobe multi link along with bpftool
   integration for the latter, from Jiri Olsa.

4) Use pkg-config in BPF selftests to determine ld flags which is
   in particular needed for linking statically, from Akihiko Odaki.

5) Fix a few BPF selftest failures to adapt to the upcoming LLVM18,
   from Yonghong Song.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (30 commits)
  bpf/tests: Remove duplicate JSGT tests
  selftests/bpf: Add TX side to xdp_hw_metadata
  selftests/bpf: Convert xdp_hw_metadata to XDP_USE_NEED_WAKEUP
  selftests/bpf: Add TX side to xdp_metadata
  selftests/bpf: Add csum helpers
  selftests/xsk: Support tx_metadata_len
  xsk: Add option to calculate TX checksum in SW
  xsk: Validate xsk_tx_metadata flags
  xsk: Document tx_metadata_len layout
  net: stmmac: Add Tx HWTS support to XDP ZC
  net/mlx5e: Implement AF_XDP TX timestamp and checksum offload
  tools: ynl: Print xsk-features from the sample
  xsk: Add TX timestamp and TX checksum offload support
  xsk: Support tx_metadata_len
  selftests/bpf: Use pkg-config for libelf
  selftests/bpf: Override PKG_CONFIG for static builds
  selftests/bpf: Choose pkg-config for the target
  bpftool: Add support to display uprobe_multi links
  selftests/bpf: Add link_info test for uprobe_multi link
  selftests/bpf: Use bpf_link__destroy in fill_link_info tests
  ...
====================

Conflicts:

Documentation/netlink/specs/netdev.yaml:
  839ff60df3 ("net: page_pool: add nlspec for basic access to page pools")
  48eb03dd26 ("xsk: Add TX timestamp and TX checksum offload support")
https://lore.kernel.org/all/20231201094705.1ee3cab8@canb.auug.org.au/

While at it also regen, tree is dirty after:
  48eb03dd26 ("xsk: Add TX timestamp and TX checksum offload support")
looks like code wasn't re-rendered after "render-max" was removed.

Link: https://lore.kernel.org/r/20231130145708.32573-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-30 16:58:42 -08:00
Jakub Kicinski
975f2d73a9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-30 16:11:19 -08:00
Hou Tao
75a442581d bpf: Add missed allocation hint for bpf_mem_cache_alloc_flags()
bpf_mem_cache_alloc_flags() may call __alloc() directly when there is no
free object in free list, but it doesn't initialize the allocation hint
for the returned pointer. It may lead to bad memory dereference when
freeing the pointer, so fix it by initializing the allocation hint.

Fixes: 822fb26bdb ("bpf: Add a hint to allocated objects.")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231111043821.2258513-1-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-26 18:00:26 -08:00
Andrii Nakryiko
2afae08c9d bpf: Validate global subprogs lazily
Slightly change BPF verifier logic around eagerness and order of global
subprog validation. Instead of going over every global subprog eagerly
and validating it before main (entry) BPF program is verified, turn it
around. Validate main program first, mark subprogs that were called from
main program for later verification, but otherwise assume it is valid.
Afterwards, go over marked global subprogs and validate those,
potentially marking some more global functions as being called. Continue
this process until all (transitively) callable global subprogs are
validated. It's a BFS traversal at its heart and will always converge.

This is an important change because it allows to feature-gate some
subprograms that might not be verifiable on some older kernel, depending
on supported set of features.

E.g., at some point, global functions were allowed to accept a pointer
to memory, which size is identified by user-provided type.
Unfortunately, older kernels don't support this feature. With BPF CO-RE
approach, the natural way would be to still compile BPF object file once
and guard calls to this global subprog with some CO-RE check or using
.rodata variables. That's what people do to guard usage of new helpers
or kfuncs, and any other new BPF-side feature that might be missing on
old kernels.

That's currently impossible to do with global subprogs, unfortunately,
because they are eagerly and unconditionally validated. This patch set
aims to change this, so that in the future when global funcs gain new
features, those can be guarded using BPF CO-RE techniques in the same
fashion as any other new kernel feature.

Two selftests had to be adjusted in sync with these changes.

test_global_func12 relied on eager global subprog validation failing
before main program failure is detected (unknown return value). Fix by
making sure that main program is always valid.

verifier_subprog_precision's parent_stack_slot_precise subtest relied on
verifier checkpointing heuristic to do a checkpoint at instruction #5,
but that's no longer true because we don't have enough jumps validated
before reaching insn #5 due to global subprogs being validated later.

Other than that, no changes, as one would expect.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20231124035937.403208-3-andrii@kernel.org
2023-11-24 10:40:06 +01:00
Andrii Nakryiko
491dd8edec bpf: Emit global subprog name in verifier logs
We have the name, instead of emitting just func#N to identify global
subprog, augment verifier log messages with actual function name to make
it more user-friendly.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20231124035937.403208-2-andrii@kernel.org
2023-11-24 10:40:06 +01:00
Jakub Kicinski
45c226dde7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

drivers/net/ethernet/intel/ice/ice_main.c
  c9663f79cd ("ice: adjust switchdev rebuild path")
  7758017911 ("ice: restore timestamp configuration after device reset")
https://lore.kernel.org/all/20231121211259.3348630-1-anthony.l.nguyen@intel.com/

Adjacent changes:

kernel/bpf/verifier.c
  bb124da69c ("bpf: keep track of max number of bpf_loop callback iterations")
  5f99f312bd ("bpf: add register bounds sanity checks and sanitization")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-23 12:20:58 -08:00
Jakub Kicinski
53475287da bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZV0kjgAKCRDbK58LschI
 gy0EAP9XwncW2OhO72DpITluFzvWPgB0N97OANKBXjzKJrRAlQD/aUe9nlvBQuad
 WsbMKLeC4wvI2X/4PEIR4ukbuZ3ypAA=
 =LMVg
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2023-11-21

We've added 85 non-merge commits during the last 12 day(s) which contain
a total of 63 files changed, 4464 insertions(+), 1484 deletions(-).

The main changes are:

1) Huge batch of verifier changes to improve BPF register bounds logic
   and range support along with a large test suite, and verifier log
   improvements, all from Andrii Nakryiko.

2) Add a new kfunc which acquires the associated cgroup of a task within
   a specific cgroup v1 hierarchy where the latter is identified by its id,
   from Yafang Shao.

3) Extend verifier to allow bpf_refcount_acquire() of a map value field
   obtained via direct load which is a use-case needed in sched_ext,
   from Dave Marchevsky.

4) Fix bpf_get_task_stack() helper to add the correct crosstask check
   for the get_perf_callchain(), from Jordan Rome.

5) Fix BPF task_iter internals where lockless usage of next_thread()
   was wrong. The rework also simplifies the code, from Oleg Nesterov.

6) Fix uninitialized tail padding via LIBBPF_OPTS_RESET, and another
   fix for certain BPF UAPI structs to fix verifier failures seen
   in bpf_dynptr usage, from Yonghong Song.

7) Add BPF selftest fixes for map_percpu_stats flakes due to per-CPU BPF
   memory allocator not being able to allocate per-CPU pointer successfully,
   from Hou Tao.

8) Add prep work around dynptr and string handling for kfuncs which
   is later going to be used by file verification via BPF LSM and fsverity,
   from Song Liu.

9) Improve BPF selftests to update multiple prog_tests to use ASSERT_*
   macros, from Yuran Pereira.

10) Optimize LPM trie lookup to check prefixlen before walking the trie,
    from Florian Lehner.

11) Consolidate virtio/9p configs from BPF selftests in config.vm file
    given they are needed consistently across archs, from Manu Bretelle.

12) Small BPF verifier refactor to remove register_is_const(),
    from Shung-Hsi Yu.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (85 commits)
  selftests/bpf: Replaces the usage of CHECK calls for ASSERTs in vmlinux
  selftests/bpf: Replaces the usage of CHECK calls for ASSERTs in bpf_obj_id
  selftests/bpf: Replaces the usage of CHECK calls for ASSERTs in bind_perm
  selftests/bpf: Replaces the usage of CHECK calls for ASSERTs in bpf_tcp_ca
  selftests/bpf: reduce verboseness of reg_bounds selftest logs
  bpf: bpf_iter_task_next: use next_task(kit->task) rather than next_task(kit->pos)
  bpf: bpf_iter_task_next: use __next_thread() rather than next_thread()
  bpf: task_group_seq_get_next: use __next_thread() rather than next_thread()
  bpf: emit frameno for PTR_TO_STACK regs if it differs from current one
  bpf: smarter verifier log number printing logic
  bpf: omit default off=0 and imm=0 in register state log
  bpf: emit map name in register state if applicable and available
  bpf: print spilled register state in stack slot
  bpf: extract register state printing
  bpf: move verifier state printing code to kernel/bpf/log.c
  bpf: move verbose_linfo() into kernel/bpf/log.c
  bpf: rename BPF_F_TEST_SANITY_STRICT to BPF_F_TEST_REG_INVARIANTS
  bpf: Remove test for MOVSX32 with offset=32
  selftests/bpf: add iter test requiring range x range logic
  veristat: add ability to set BPF_F_TEST_SANITY_STRICT flag with -r flag
  ...
====================

Link: https://lore.kernel.org/r/20231122000500.28126-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-21 17:53:20 -08:00
Eduard Zingerman
bb124da69c bpf: keep track of max number of bpf_loop callback iterations
In some cases verifier can't infer convergence of the bpf_loop()
iteration. E.g. for the following program:

    static int cb(__u32 idx, struct num_context* ctx)
    {
        ctx->i++;
        return 0;
    }

    SEC("?raw_tp")
    int prog(void *_)
    {
        struct num_context ctx = { .i = 0 };
        __u8 choice_arr[2] = { 0, 1 };

        bpf_loop(2, cb, &ctx, 0);
        return choice_arr[ctx.i];
    }

Each 'cb' simulation would eventually return to 'prog' and reach
'return choice_arr[ctx.i]' statement. At which point ctx.i would be
marked precise, thus forcing verifier to track multitude of separate
states with {.i=0}, {.i=1}, ... at bpf_loop() callback entry.

This commit allows "brute force" handling for such cases by limiting
number of callback body simulations using 'umax' value of the first
bpf_loop() parameter.

For this, extend bpf_func_state with 'callback_depth' field.
Increment this field when callback visiting state is pushed to states
traversal stack. For frame #N it's 'callback_depth' field counts how
many times callback with frame depth N+1 had been executed.
Use bpf_func_state specifically to allow independent tracking of
callback depths when multiple nested bpf_loop() calls are present.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231121020701.26440-11-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-20 18:36:40 -08:00
Eduard Zingerman
cafe2c2150 bpf: widening for callback iterators
Callbacks are similar to open coded iterators, so add imprecise
widening logic for callback body processing. This makes callback based
loops behave identically to open coded iterators, e.g. allowing to
verify programs like below:

  struct ctx { u32 i; };
  int cb(u32 idx, struct ctx* ctx)
  {
          ++ctx->i;
          return 0;
  }
  ...
  struct ctx ctx = { .i = 0 };
  bpf_loop(100, cb, &ctx, 0);
  ...

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231121020701.26440-9-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-20 18:36:40 -08:00
Eduard Zingerman
ab5cfac139 bpf: verify callbacks as if they are called unknown number of times
Prior to this patch callbacks were handled as regular function calls,
execution of callback body was modeled exactly once.
This patch updates callbacks handling logic as follows:
- introduces a function push_callback_call() that schedules callback
  body verification in env->head stack;
- updates prepare_func_exit() to reschedule callback body verification
  upon BPF_EXIT;
- as calls to bpf_*_iter_next(), calls to callback invoking functions
  are marked as checkpoints;
- is_state_visited() is updated to stop callback based iteration when
  some identical parent state is found.

Paths with callback function invoked zero times are now verified first,
which leads to necessity to modify some selftests:
- the following negative tests required adding release/unlock/drop
  calls to avoid previously masked unrelated error reports:
  - cb_refs.c:underflow_prog
  - exceptions_fail.c:reject_rbtree_add_throw
  - exceptions_fail.c:reject_with_cp_reference
- the following precision tracking selftests needed change in expected
  log trace:
  - verifier_subprog_precision.c:callback_result_precise
    (note: r0 precision is no longer propagated inside callback and
           I think this is a correct behavior)
  - verifier_subprog_precision.c:parent_callee_saved_reg_precise_with_callback
  - verifier_subprog_precision.c:parent_stack_slot_precise_with_callback

Reported-by: Andrew Werner <awerner32@gmail.com>
Closes: https://lore.kernel.org/bpf/CA+vRuzPChFNXmouzGG+wsy=6eMcfr1mFG0F3g7rbg-sedGKW3w@mail.gmail.com/
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231121020701.26440-7-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-20 18:35:44 -08:00
Eduard Zingerman
58124a98cb bpf: extract setup_func_entry() utility function
Move code for simulated stack frame creation to a separate utility
function. This function would be used in the follow-up change for
callbacks handling.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231121020701.26440-6-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-20 18:33:35 -08:00
Eduard Zingerman
683b96f960 bpf: extract __check_reg_arg() utility function
Split check_reg_arg() into two utility functions:
- check_reg_arg() operating on registers from current verifier state;
- __check_reg_arg() operating on a specific set of registers passed as
  a parameter;

The __check_reg_arg() function would be used by a follow-up change for
callbacks handling.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231121020701.26440-5-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-20 18:33:35 -08:00
Oleg Nesterov
ac8148d957 bpf: bpf_iter_task_next: use next_task(kit->task) rather than next_task(kit->pos)
This looks more clear and simplifies the code. While at it, remove the
unnecessary initialization of pos/task at the start of bpf_iter_task_new().

Note that we can even kill kit->task, we can just use pos->group_leader,
but I don't understand the BUILD_BUG_ON() checks in bpf_iter_task_new().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231114163239.GA903@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-19 11:43:44 -08:00
Oleg Nesterov
5a34f9dabd bpf: bpf_iter_task_next: use __next_thread() rather than next_thread()
Lockless use of next_thread() should be avoided, kernel/bpf/task_iter.c
is the last user and the usage is wrong.

bpf_iter_task_next() can loop forever, "kit->pos == kit->task" can never
happen if kit->pos execs. Change this code to use __next_thread().

With or without this change the usage of kit->pos/task and next_task()
doesn't look nice, see the next patch.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231114163237.GA897@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-19 11:43:44 -08:00
Oleg Nesterov
2d1618054f bpf: task_group_seq_get_next: use __next_thread() rather than next_thread()
Lockless use of next_thread() should be avoided, kernel/bpf/task_iter.c
is the last user and the usage is wrong.

task_group_seq_get_next() can return the group leader twice if it races
with mt-thread exec which changes the group->leader's pid.

Change the main loop to use __next_thread(), kill "next_tid == common->pid"
check.

__next_thread() can't loop forever, we can also change this code to retry
if next_tid == 0.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231114163234.GA890@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-19 11:43:44 -08:00
Andrii Nakryiko
46862ee854 bpf: emit frameno for PTR_TO_STACK regs if it differs from current one
It's possible to pass a pointer to parent's stack to child subprogs. In
such case verifier state output is ambiguous not showing whether
register container a pointer to "current" stack, belonging to current
subprog (frame), or it's actually a pointer to one of parent frames.

So emit this information if frame number differs between the state which
register is part of. E.g., if current state is in frame 2 and it has
a register pointing to stack in grand parent state (frame #0), we'll see
something like 'R1=fp[0]-16', while "local stack pointer" will be just
'R2=fp-16'.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-9-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18 11:39:59 -08:00
Andrii Nakryiko
0f8dbdbc64 bpf: smarter verifier log number printing logic
Instead of always printing numbers as either decimals (and in some
cases, like for "imm=%llx", in hexadecimals), decide the form based on
actual values. For numbers in a reasonably small range (currently,
[0, U16_MAX] for unsigned values, and [S16_MIN, S16_MAX] for signed ones),
emit them as decimals. In all other cases, even for signed values,
emit them in hexadecimals.

For large values hex form is often times way more useful: it's easier to
see an exact difference between 0xffffffff80000000 and 0xffffffff7fffffff,
than between 18446744071562067966 and 18446744071562067967, as one
particular example.

Small values representing small pointer offsets or application
constants, on the other hand, are way more useful to be represented in
decimal notation.

Adjust reg_bounds register state parsing logic to take into account this
change.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-8-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18 11:39:59 -08:00
Andrii Nakryiko
1db747d75b bpf: omit default off=0 and imm=0 in register state log
Simplify BPF verifier log further by omitting default (and frequently
irrelevant) off=0 and imm=0 parts for non-SCALAR_VALUE registers. As can
be seen from fixed tests, this is often a visual noise for PTR_TO_CTX
register and even for PTR_TO_PACKET registers.

Omitting default values follows the rest of register state logic: we
omit default values to keep verifier log succinct and to highlight
interesting state that deviates from default one. E.g., we do the same
for var_off, when it's unknown, which gives no additional information.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-7-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18 11:39:59 -08:00
Andrii Nakryiko
0c95c9fdb6 bpf: emit map name in register state if applicable and available
In complicated real-world applications, whenever debugging some
verification error through verifier log, it often would be very useful
to see map name for PTR_TO_MAP_VALUE register. Usually this needs to be
inferred from key/value sizes and maybe trying to guess C code location,
but it's not always clear.

Given verifier has the name, and it's never too long, let's just emit it
for ptr_to_map_key, ptr_to_map_value, and const_ptr_to_map registers. We
reshuffle the order a bit, so that map name, key size, and value size
appear before offset and immediate values, which seems like a more
logical order.

Current output:

  R1_w=map_ptr(map=array_map,ks=4,vs=8,off=0,imm=0)

But we'll get rid of useless off=0 and imm=0 parts in the next patch.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-6-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18 11:39:59 -08:00
Andrii Nakryiko
67d43dfbb4 bpf: print spilled register state in stack slot
Print the same register state representation when printing stack state,
as we do for normal registers. Note that if stack slot contains
subregister spill (1, 2, or 4 byte long), we'll still emit "m0?" mask
for those bytes that are not part of spilled register.

While means we can get something like fp-8=0000scalar() for a 4-byte
spill with other 4 bytes still being STACK_ZERO.

Some example before and after, taken from the log of
pyperf_subprogs.bpf.o:

49: (7b) *(u64 *)(r10 -256) = r1      ; frame1: R1_w=ctx(off=0,imm=0) R10=fp0 fp-256_w=ctx
49: (7b) *(u64 *)(r10 -256) = r1      ; frame1: R1_w=ctx(off=0,imm=0) R10=fp0 fp-256_w=ctx(off=0,imm=0)

150: (7b) *(u64 *)(r10 -264) = r0     ; frame1: R0_w=map_value_or_null(id=6,off=0,ks=192,vs=4,imm=0) R10=fp0 fp-264_w=map_value_or_null
150: (7b) *(u64 *)(r10 -264) = r0     ; frame1: R0_w=map_value_or_null(id=6,off=0,ks=192,vs=4,imm=0) R10=fp0 fp-264_w=map_value_or_null(id=6,off=0,ks=192,vs=4,imm=0)

5192: (61) r1 = *(u32 *)(r10 -272)    ; frame1: R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=15,var_off=(0x0; 0xf)) R10=fp0 fp-272=
5192: (61) r1 = *(u32 *)(r10 -272)    ; frame1: R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=15,var_off=(0x0; 0xf)) R10=fp0 fp-272=????scalar(smin=smin32=0,smax=umax=smax32=umax32=15,var_off=(0x0; 0xf))

While at it, do a few other simple clean ups:
  - skip slot if it's not scratched before detecting whether it's valid;
  - move taking spilled_reg pointer outside of switch (only DYNPTR has
    to adjust that to get to the "main" slot);
  - don't recalculate types_buf second time for MISC/ZERO/default case.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18 11:39:59 -08:00
Andrii Nakryiko
009f5465be bpf: extract register state printing
Extract printing register state representation logic into a separate
helper, as we are going to reuse it for spilled register state printing
in the next patch. This also nicely reduces code nestedness.

No functional changes.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18 11:39:59 -08:00
Andrii Nakryiko
42feb6620a bpf: move verifier state printing code to kernel/bpf/log.c
Move a good chunk of code from verifier.c to log.c: verifier state
verbose printing logic. This is an important and very much
logging/debugging oriented code. It fits the overlall log.c's focus on
verifier logging, and moving it allows to keep growing it without
unnecessarily adding to verifier.c code that otherwise contains a core
verification logic.

There are not many shared dependencies between this code and the rest of
verifier.c code, except a few single-line helpers for various register
type checks and a bit of state "scratching" helpers. We move all such
trivial helpers into include/bpf/bpf_verifier.h as static inlines.

No functional changes in this patch.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18 11:39:59 -08:00
Andrii Nakryiko
db840d389b bpf: move verbose_linfo() into kernel/bpf/log.c
verifier.c is huge. Let's try to move out parts that are logging-related
into log.c, as we previously did with bpf_log() and other related stuff.
This patch moves line info verbose output routines: it's pretty
self-contained and isolated code, so there is no problem with this.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18 11:39:58 -08:00
Andrii Nakryiko
ff8867af01 bpf: rename BPF_F_TEST_SANITY_STRICT to BPF_F_TEST_REG_INVARIANTS
Rename verifier internal flag BPF_F_TEST_SANITY_STRICT to more neutral
BPF_F_TEST_REG_INVARIANTS. This is a follow up to [0].

A few selftests and veristat need to be adjusted in the same patch as
well.

  [0] https://patchwork.kernel.org/project/netdevbpf/patch/20231112010609.848406-5-andrii@kernel.org/

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231117171404.225508-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-17 10:30:02 -08:00
Andrii Nakryiko
cf5fe3c71c bpf: make __reg{32,64}_deduce_bounds logic more robust
This change doesn't seem to have any effect on selftests and production
BPF object files, but we preemptively try to make it more robust.

First, "learn sign from signed bounds" comment is misleading, as we are
learning not just sign, but also values.

Second, we simplify the check for determining whether entire range is
positive or negative similarly to other checks added earlier, using
appropriate u32/u64 cast and single comparisons. As explain in comments
in __reg64_deduce_bounds(), the checks are equivalent.

Last but not least, smin/smax and s32_min/s32_max reassignment based on
min/max of both umin/umax and smin/smax (and 32-bit equivalents) is hard
to explain and justify. We are updating unsigned bounds from signed
bounds, why would we update signed bounds at the same time? This might
be correct, but it's far from obvious why and the code or comments don't
try to justify this. Given we've added a separate deduction of signed
bounds from unsigned bounds earlier, this seems at least redundant, if
not just wrong.

In short, we remove doubtful pieces, and streamline the rest to follow
the logic and approach of the rest of reg_bounds_sync() checks.

Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231112010609.848406-7-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-15 12:03:42 -08:00
Andrii Nakryiko
3cf98cf594 bpf: remove redundant s{32,64} -> u{32,64} deduction logic
Equivalent checks were recently added in more succinct and, arguably,
safer form in:
  - f188765f23a5 ("bpf: derive smin32/smax32 from umin32/umax32 bounds");
  - 2e74aef782d3 ("bpf: derive smin/smax from umin/max bounds").

The checks we are removing in this patch set do similar checks to detect
if entire u32/u64 range has signed bit set or not set, but does it with
two separate checks.

Further, we forcefully overwrite either smin or smax (and 32-bit equvalents)
without applying normal min/max intersection logic. It's not clear why
that would be correct in all cases and seems to work by accident. This
logic is also "gated" by previous signed -> unsigned derivation, which
returns early.

All this is quite confusing and seems error-prone, while we already have
at least equivalent checks happening earlier. So remove this duplicate
and error-prone logic to simplify things a bit.

Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231112010609.848406-6-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-15 12:03:42 -08:00
Andrii Nakryiko
5f99f312bd bpf: add register bounds sanity checks and sanitization
Add simple sanity checks that validate well-formed ranges (min <= max)
across u64, s64, u32, and s32 ranges. Also for cases when the value is
constant (either 64-bit or 32-bit), we validate that ranges and tnums
are in agreement.

These bounds checks are performed at the end of BPF_ALU/BPF_ALU64
operations, on conditional jumps, and for LDX instructions (where subreg
zero/sign extension is probably the most important to check). This
covers most of the interesting cases.

Also, we validate the sanity of the return register when manually
adjusting it for some special helpers.

By default, sanity violation will trigger a warning in verifier log and
resetting register bounds to "unbounded" ones. But to aid development
and debugging, BPF_F_TEST_SANITY_STRICT flag is added, which will
trigger hard failure of verification with -EFAULT on register bounds
violations. This allows selftests to catch such issues. veristat will
also gain a CLI option to enable this behavior.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20231112010609.848406-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-15 12:03:42 -08:00
Andrii Nakryiko
be41a203bb bpf: enhance BPF_JEQ/BPF_JNE is_branch_taken logic
Use 32-bit subranges to prune some 64-bit BPF_JEQ/BPF_JNE conditions
that otherwise would be "inconclusive" (i.e., is_branch_taken() would
return -1). This can happen, for example, when registers are initialized
as 64-bit u64/s64, then compared for inequality as 32-bit subregisters,
and then followed by 64-bit equality/inequality check. That 32-bit
inequality can establish some pattern for lower 32 bits of a register
(e.g., s< 0 condition determines whether the bit #31 is zero or not),
while overall 64-bit value could be anything (according to a value range
representation).

This is not a fancy quirky special case, but actually a handling that's
necessary to prevent correctness issue with BPF verifier's range
tracking: set_range_min_max() assumes that register ranges are
non-overlapping, and if that condition is not guaranteed by
is_branch_taken() we can end up with invalid ranges, where min > max.

  [0] https://lore.kernel.org/bpf/CACkBjsY2q1_fUohD7hRmKGqv1MV=eP2f6XK8kjkYNw7BaiF8iQ@mail.gmail.com/

Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231112010609.848406-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-15 12:03:42 -08:00
Andrii Nakryiko
96381879a3 bpf: generalize is_scalar_branch_taken() logic
Generalize is_branch_taken logic for SCALAR_VALUE register to handle
cases when both registers are not constants. Previously supported
<range> vs <scalar> cases are a natural subset of more generic <range>
vs <range> set of cases.

Generalized logic relies on straightforward segment intersection checks.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20231112010609.848406-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-15 12:03:41 -08:00
Andrii Nakryiko
67420501e8 bpf: generalize reg_set_min_max() to handle non-const register comparisons
Generalize bounds adjustment logic of reg_set_min_max() to handle not
just register vs constant case, but in general any register vs any
register cases. For most of the operations it's trivial extension based
on range vs range comparison logic, we just need to properly pick
min/max of a range to compare against min/max of the other range.

For BPF_JSET we keep the original capabilities, just make sure JSET is
integrated in the common framework. This is manifested in the
internal-only BPF_JSET + BPF_X "opcode" to allow for simpler and more
uniform rev_opcode() handling. See the code for details. This allows to
reuse the same code exactly both for TRUE and FALSE branches without
explicitly handling both conditions with custom code.

Note also that now we don't need a special handling of BPF_JEQ/BPF_JNE
case none of the registers are constants. This is now just a normal
generic case handled by reg_set_min_max().

To make tnum handling cleaner, tnum_with_subreg() helper is added, as
that's a common operator when dealing with 32-bit subregister bounds.
This keeps the overall logic much less noisy when it comes to tnums.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20231112010609.848406-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-15 12:03:41 -08:00
Yonghong Song
1fda5bb66a bpf: Do not allocate percpu memory at init stage
Kirill Shutemov reported significant percpu memory consumption increase after
booting in 288-cpu VM ([1]) due to commit 41a5db8d81 ("bpf: Add support for
non-fix-size percpu mem allocation"). The percpu memory consumption is
increased from 111MB to 969MB. The number is from /proc/meminfo.

I tried to reproduce the issue with my local VM which at most supports upto
255 cpus. With 252 cpus, without the above commit, the percpu memory
consumption immediately after boot is 57MB while with the above commit the
percpu memory consumption is 231MB.

This is not good since so far percpu memory from bpf memory allocator is not
widely used yet. Let us change pre-allocation in init stage to on-demand
allocation when verifier detects there is a need of percpu memory for bpf
program. With this change, percpu memory consumption after boot can be reduced
signicantly.

  [1] https://lore.kernel.org/lkml/20231109154934.4saimljtqx625l3v@box.shutemov.name/

Fixes: 41a5db8d81 ("bpf: Add support for non-fix-size percpu mem allocation")
Reported-and-tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231111013928.948838-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-15 07:51:06 -08:00
Yafang Shao
fe977716b4 bpf: Add a new kfunc for cgroup1 hierarchy
A new kfunc is added to acquire cgroup1 of a task:

- bpf_task_get_cgroup1
  Acquires the associated cgroup of a task whithin a specific cgroup1
  hierarchy. The cgroup1 hierarchy is identified by its hierarchy ID.

This new kfunc enables the tracing of tasks within a designated
container or cgroup directory in BPF programs.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20231111090034.4248-2-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-14 08:56:56 -08:00
Jordan Rome
b8e3a87a62 bpf: Add crosstask check to __bpf_get_stack
Currently get_perf_callchain only supports user stack walking for
the current task. Passing the correct *crosstask* param will return
0 frames if the task passed to __bpf_get_stack isn't the current
one instead of a single incorrect frame/address. This change
passes the correct *crosstask* param but also does a preemptive
check in __bpf_get_stack if the task is current and returns
-EOPNOTSUPP if it is not.

This issue was found using bpf_get_task_stack inside a BPF
iterator ("iter/task"), which iterates over all tasks.
bpf_get_task_stack works fine for fetching kernel stacks
but because get_perf_callchain relies on the caller to know
if the requested *task* is the current one (via *crosstask*)
it was failing in a confusing way.

It might be possible to get user stacks for all tasks utilizing
something like access_process_vm but that requires the bpf
program calling bpf_get_task_stack to be sleepable and would
therefore be a breaking change.

Fixes: fa28dcb82a ("bpf: Introduce helper bpf_get_task_stack()")
Signed-off-by: Jordan Rome <jordalgo@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231108112334.3433136-1-jordalgo@meta.com
2023-11-10 11:06:10 -08:00
Andrii Nakryiko
10e14e9652 bpf: fix control-flow graph checking in privileged mode
When BPF program is verified in privileged mode, BPF verifier allows
bounded loops. This means that from CFG point of view there are
definitely some back-edges. Original commit adjusted check_cfg() logic
to not detect back-edges in control flow graph if they are resulting
from conditional jumps, which the idea that subsequent full BPF
verification process will determine whether such loops are bounded or
not, and either accept or reject the BPF program. At least that's my
reading of the intent.

Unfortunately, the implementation of this idea doesn't work correctly in
all possible situations. Conditional jump might not result in immediate
back-edge, but just a few unconditional instructions later we can arrive
at back-edge. In such situations check_cfg() would reject BPF program
even in privileged mode, despite it might be bounded loop. Next patch
adds one simple program demonstrating such scenario.

To keep things simple, instead of trying to detect back edges in
privileged mode, just assume every back edge is valid and let subsequent
BPF verification prove or reject bounded loops.

Note a few test changes. For unknown reason, we have a few tests that
are specified to detect a back-edge in a privileged mode, but looking at
their code it seems like the right outcome is passing check_cfg() and
letting subsequent verification to make a decision about bounded or not
bounded looping.

Bounded recursion case is also interesting. The example should pass, as
recursion is limited to just a few levels and so we never reach maximum
number of nested frames and never exhaust maximum stack depth. But the
way that max stack depth logic works today it falsely detects this as
exceeding max nested frame count. This patch series doesn't attempt to
fix this orthogonal problem, so we just adjust expected verifier failure.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Fixes: 2589726d12 ("bpf: introduce bounded loops")
Reported-by: Hao Sun <sunhao.th@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231110061412.2995786-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09 22:57:24 -08:00
Andrii Nakryiko
4bb7ea946a bpf: fix precision backtracking instruction iteration
Fix an edge case in __mark_chain_precision() which prematurely stops
backtracking instructions in a state if it happens that state's first
and last instruction indexes are the same. This situations doesn't
necessarily mean that there were no instructions simulated in a state,
but rather that we starting from the instruction, jumped around a bit,
and then ended up at the same instruction before checkpointing or
marking precision.

To distinguish between these two possible situations, we need to consult
jump history. If it's empty or contain a single record "bridging" parent
state and first instruction of processed state, then we indeed
backtracked all instructions in this state. But if history is not empty,
we are definitely not done yet.

Move this logic inside get_prev_insn_idx() to contain it more nicely.
Use -ENOENT return code to denote "we are out of instructions"
situation.

This bug was exposed by verifier_loop1.c's bounded_recursion subtest, once
the next fix in this patch set is applied.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Fixes: b5dc0163d8 ("bpf: precise scalar_value tracking")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231110002638.4168352-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09 20:11:20 -08:00
Andrii Nakryiko
3feb263bb5 bpf: handle ldimm64 properly in check_cfg()
ldimm64 instructions are 16-byte long, and so have to be handled
appropriately in check_cfg(), just like the rest of BPF verifier does.

This has implications in three places:
  - when determining next instruction for non-jump instructions;
  - when determining next instruction for callback address ldimm64
    instructions (in visit_func_call_insn());
  - when checking for unreachable instructions, where second half of
    ldimm64 is expected to be unreachable;

We take this also as an opportunity to report jump into the middle of
ldimm64. And adjust few test_verifier tests accordingly.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Reported-by: Hao Sun <sunhao.th@gmail.com>
Fixes: 475fb78fbf ("bpf: verifier (add branch/goto checks)")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231110002638.4168352-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09 20:11:20 -08:00
Dave Marchevsky
1b12171533 bpf: Mark direct ld of stashed bpf_{rb,list}_node as non-owning ref
This patch enables the following pattern:

  /* mapval contains a __kptr pointing to refcounted local kptr */
  mapval = bpf_map_lookup_elem(&map, &idx);
  if (!mapval || !mapval->some_kptr) { /* omitted */ }

  p = bpf_refcount_acquire(&mapval->some_kptr);

Currently this doesn't work because bpf_refcount_acquire expects an
owning or non-owning ref. The verifier defines non-owning ref as a type:

  PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF

while mapval->some_kptr is PTR_TO_BTF_ID | PTR_UNTRUSTED. It's possible
to do the refcount_acquire by first bpf_kptr_xchg'ing mapval->some_kptr
into a temp kptr, refcount_acquiring that, and xchg'ing back into
mapval, but this is unwieldy and shouldn't be necessary.

This patch modifies btf_ld_kptr_type such that user-allocated types are
marked MEM_ALLOC and if those types have a bpf_{rb,list}_node they're
marked NON_OWN_REF as well. Additionally, due to changes to
bpf_obj_drop_impl earlier in this series, rcu_protected_object now
returns true for all user-allocated types, resulting in
mapval->some_kptr being marked MEM_RCU.

After this patch's changes, mapval->some_kptr is now:

  PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF | MEM_RCU

which results in it passing the non-owning ref test, and the motivating
example passing verification.

Future work will likely get rid of special non-owning ref lifetime logic
in the verifier, at which point we'll be able to delete the NON_OWN_REF
flag entirely.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20231107085639.3016113-6-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09 19:07:51 -08:00
Dave Marchevsky
790ce3cfef bpf: Move GRAPH_{ROOT,NODE}_MASK macros into btf_field_type enum
This refactoring patch removes the unused BPF_GRAPH_NODE_OR_ROOT
btf_field_type and moves BPF_GRAPH_{NODE,ROOT} macros into the
btf_field_type enum. Further patches in the series will use
BPF_GRAPH_NODE, so let's move this useful definition out of btf.c.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20231107085639.3016113-5-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09 19:07:51 -08:00
Dave Marchevsky
649924b76a bpf: Use bpf_mem_free_rcu when bpf_obj_dropping non-refcounted nodes
The use of bpf_mem_free_rcu to free refcounted local kptrs was added
in commit 7e26cd12ad ("bpf: Use bpf_mem_free_rcu when
bpf_obj_dropping refcounted nodes"). In the cover letter for the
series containing that patch [0] I commented:

    Perhaps it makes sense to move to mem_free_rcu for _all_
    non-owning refs in the future, not just refcounted. This might
    allow custom non-owning ref lifetime + invalidation logic to be
    entirely subsumed by MEM_RCU handling. IMO this needs a bit more
    thought and should be tackled outside of a fix series, so it's not
    attempted here.

It's time to start moving in the "non-owning refs have MEM_RCU
lifetime" direction. As mentioned in that comment, using
bpf_mem_free_rcu for all local kptrs - not just refcounted - is
necessarily the first step towards that goal. This patch does so.

After this patch the memory pointed to by all local kptrs will not be
reused until RCU grace period elapses. The verifier's understanding of
non-owning ref validity and the clobbering logic it uses to enforce
that understanding are not changed here, that'll happen gradually in
future work, including further patches in the series.

  [0]: https://lore.kernel.org/all/20230821193311.3290257-1-davemarchevsky@fb.com/

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20231107085639.3016113-4-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09 19:07:51 -08:00
Dave Marchevsky
1500a5d9f4 bpf: Add KF_RCU flag to bpf_refcount_acquire_impl
Refcounted local kptrs are kptrs to user-defined types with a
bpf_refcount field. Recent commits ([0], [1]) modified the lifetime of
refcounted local kptrs such that the underlying memory is not reused
until RCU grace period has elapsed.

Separately, verification of bpf_refcount_acquire calls currently
succeeds for MAYBE_NULL non-owning reference input, which is a problem
as bpf_refcount_acquire_impl has no handling for this case.

This patch takes advantage of aforementioned lifetime changes to tag
bpf_refcount_acquire_impl kfunc KF_RCU, thereby preventing MAYBE_NULL
input to the kfunc. The KF_RCU flag applies to all kfunc params; it's
fine for it to apply to the void *meta__ign param as that's populated by
the verifier and is tagged __ign regardless.

  [0]: commit 7e26cd12ad ("bpf: Use bpf_mem_free_rcu when
       bpf_obj_dropping refcounted nodes") is the actual change to
       allocation behaivor
  [1]: commit 0816b8c6bf ("bpf: Consider non-owning refs to refcounted
       nodes RCU protected") modified verifier understanding of
       refcounted local kptrs to match [0]'s changes

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Fixes: 7c50b1cb76 ("bpf: Add bpf_refcount_acquire kfunc")
Link: https://lore.kernel.org/r/20231107085639.3016113-2-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09 19:07:51 -08:00
Shung-Hsi Yu
82ce364c60 bpf: replace register_is_const() with is_reg_const()
The addition of is_reg_const() in commit 171de12646d2 ("bpf: generalize
is_branch_taken to handle all conditional jumps in one place") has made the
register_is_const() redundant. Give the former has more feature, plus the
fact the latter is only used in one place, replace register_is_const() with
is_reg_const(), and remove the definition of register_is_const.

This requires moving the definition of is_reg_const() further up. And since
the comment of reg_const_value() reference is_reg_const(), move it up as
well.

Signed-off-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231108140043.12282-1-shung-hsi.yu@suse.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09 19:07:51 -08:00
Song Liu
045edee19d bpf: Introduce KF_ARG_PTR_TO_CONST_STR
Similar to ARG_PTR_TO_CONST_STR for BPF helpers, KF_ARG_PTR_TO_CONST_STR
specifies kfunc args that point to const strings. Annotation "__str" is
used to specify kfunc arg of type KF_ARG_PTR_TO_CONST_STR. Also, add
documentation for the "__str" annotation.

bpf_get_file_xattr() will be the first kfunc that uses this type.

Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://lore.kernel.org/bpf/20231107045725.2278852-4-song@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09 19:07:38 -08:00
Song Liu
0b51940729 bpf: Factor out helper check_reg_const_str()
ARG_PTR_TO_CONST_STR is used to specify constant string args for BPF
helpers. The logic that verifies a reg is ARG_PTR_TO_CONST_STR is
implemented in check_func_arg().

As we introduce kfuncs with constant string args, it is necessary to
do the same check for kfuncs (in check_kfunc_args). Factor out the logic
for ARG_PTR_TO_CONST_STR to a new check_reg_const_str() so that it can be
reused.

check_func_arg() ensures check_reg_const_str() is only called with reg of
type PTR_TO_MAP_VALUE. Add a redundent type check in check_reg_const_str()
to avoid misuse in the future. Other than this redundent check, there is
no change in behavior.

Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://lore.kernel.org/bpf/20231107045725.2278852-3-song@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09 19:07:38 -08:00
Song Liu
74523c06ae bpf: Add __bpf_dynptr_data* for in kernel use
Different types of bpf dynptr have different internal data storage.
Specifically, SKB and XDP type of dynptr may have non-continuous data.
Therefore, it is not always safe to directly access dynptr->data.

Add __bpf_dynptr_data and __bpf_dynptr_data_rw to replace direct access to
dynptr->data.

Update bpf_verify_pkcs7_signature to use __bpf_dynptr_data instead of
dynptr->data.

Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://lore.kernel.org/bpf/20231107045725.2278852-2-song@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09 19:07:38 -08:00
Florian Lehner
9b75dbeb36 bpf, lpm: Fix check prefixlen before walking trie
When looking up an element in LPM trie, the condition 'matchlen ==
trie->max_prefixlen' will never return true, if key->prefixlen is larger
than trie->max_prefixlen. Consequently all elements in the LPM trie will
be visited and no element is returned in the end.

To resolve this, check key->prefixlen first before walking the LPM trie.

Fixes: b95a5c4db0 ("bpf: add a longest prefix match trie map implementation")
Signed-off-by: Florian Lehner <dev@der-flo.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231105085801.3742-1-dev@der-flo.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09 19:07:38 -08:00
Andrii Nakryiko
4621202adc bpf: generalize reg_set_min_max() to handle two sets of two registers
Change reg_set_min_max() to take FALSE/TRUE sets of two registers each,
instead of assuming that we are always comparing to a constant. For now
we still assume that right-hand side registers are constants (and make
sure that's the case by swapping src/dst regs, if necessary), but
subsequent patches will remove this limitation.

reg_set_min_max() is now called unconditionally for any register
comparison, so that might include pointer vs pointer. This makes it
consistent with is_branch_taken() generality. But we currently only
support adjustments based on SCALAR vs SCALAR comparisons, so
reg_set_min_max() has to guard itself againts pointers.

Taking two by two registers allows to further unify and simplify
check_cond_jmp_op() logic. We utilize fake register for BPF_K
conditional jump case, just like with is_branch_taken() part.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231102033759.2541186-18-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09 18:58:40 -08:00
Andrii Nakryiko
811476e9cc bpf: prepare reg_set_min_max for second set of registers
Similarly to is_branch_taken()-related refactorings, start preparing
reg_set_min_max() to handle more generic case of two non-const
registers. Start with renaming arguments to accommodate later addition
of second register as an input argument.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231102033759.2541186-17-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09 18:58:40 -08:00
Andrii Nakryiko
4d345887d2 bpf: unify 32-bit and 64-bit is_branch_taken logic
Combine 32-bit and 64-bit is_branch_taken logic for SCALAR_VALUE
registers. It makes it easier to see parallels between two domains
(32-bit and 64-bit), and makes subsequent refactoring more
straightforward.

No functional changes.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231102033759.2541186-16-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09 18:58:40 -08:00
Andrii Nakryiko
b74c2a842b bpf: generalize is_branch_taken to handle all conditional jumps in one place
Make is_branch_taken() a single entry point for branch pruning decision
making, handling both pointer vs pointer, pointer vs scalar, and scalar
vs scalar cases in one place. This also nicely cleans up check_cond_jmp_op().

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231102033759.2541186-15-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09 18:58:40 -08:00
Andrii Nakryiko
c697289efe bpf: move is_branch_taken() down
Move is_branch_taken() slightly down. In subsequent patched we'll need
both flip_opcode() and is_pkt_ptr_branch_taken() for is_branch_taken(),
but instead of sprinkling forward declarations around, it makes more
sense to move is_branch_taken() lower below is_pkt_ptr_branch_taken(),
and also keep it closer to very tightly related reg_set_min_max(), as
they are two critical parts of the same SCALAR range tracking logic.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231102033759.2541186-14-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09 18:58:39 -08:00
Andrii Nakryiko
c31534267c bpf: generalize is_branch_taken() to work with two registers
While still assuming that second register is a constant, generalize
is_branch_taken-related code to accept two registers instead of register
plus explicit constant value. This also, as a side effect, allows to
simplify check_cond_jmp_op() by unifying BPF_K case with BPF_X case, for
which we use a fake register to represent BPF_K's imm constant as
a register.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20231102033759.2541186-13-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09 18:58:39 -08:00
Andrii Nakryiko
c2a3ab0946 bpf: rename is_branch_taken reg arguments to prepare for the second one
Just taking mundane refactoring bits out into a separate patch. No
functional changes.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20231102033759.2541186-12-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09 18:58:39 -08:00
Andrii Nakryiko
9e314f5d86 bpf: drop knowledge-losing __reg_combine_{32,64}_into_{64,32} logic
When performing 32-bit conditional operation operating on lower 32 bits
of a full 64-bit register, register full value isn't changed. We just
potentially gain new knowledge about that register's lower 32 bits.

Unfortunately, __reg_combine_{32,64}_into_{64,32} logic that
reg_set_min_max() performs as a last step, can lose information in some
cases due to __mark_reg64_unbounded() and __reg_assign_32_into_64().
That's bad and completely unnecessary. Especially __reg_assign_32_into_64()
looks completely out of place here, because we are not performing
zero-extending subregister assignment during conditional jump.

So this patch replaced __reg_combine_* with just a normal
reg_bounds_sync() which will do a proper job of deriving u64/s64 bounds
from u32/s32, and vice versa (among all other combinations).

__reg_combine_64_into_32() is also used in one more place,
coerce_reg_to_size(), while handling 1- and 2-byte register loads.
Looking into this, it seems like besides marking subregister as
unbounded before performing reg_bounds_sync(), we were also performing
deduction of smin32/smax32 and umin32/umax32 bounds from respective
smin/smax and umin/umax bounds. It's now redundant as reg_bounds_sync()
performs all the same logic more generically (e.g., without unnecessary
assumption that upper 32 bits of full register should be zero).

Long story short, we remove __reg_combine_64_into_32() completely, and
coerce_reg_to_size() now only does resetting subreg to unbounded and then
performing reg_bounds_sync() to recover as much information as possible
from 64-bit umin/umax and smin/smax bounds, set explicitly in
coerce_reg_to_size() earlier.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20231102033759.2541186-10-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09 18:58:39 -08:00
Andrii Nakryiko
d7f0087381 bpf: try harder to deduce register bounds from different numeric domains
There are cases (caught by subsequent reg_bounds tests in selftests/bpf)
where performing one round of __reg_deduce_bounds() doesn't propagate
all the information from, say, s32 to u32 bounds and than from newly
learned u32 bounds back to u64 and s64. So perform __reg_deduce_bounds()
twice to make sure such derivations are propagated fully after
reg_bounds_sync().

One such example is test `(s64)[0xffffffff00000001; 0] (u64)<
0xffffffff00000000` from selftest patch from this patch set. It demonstrates an
intricate dance of u64 -> s64 -> u64 -> u32 bounds adjustments, which requires
two rounds of __reg_deduce_bounds(). Here are corresponding refinement log from
selftest, showing evolution of knowledge.

REFINING (FALSE R1) (u64)SRC=[0xffffffff00000000; U64_MAX] (u64)DST_OLD=[0; U64_MAX] (u64)DST_NEW=[0xffffffff00000000; U64_MAX]
REFINING (FALSE R1) (u64)SRC=[0xffffffff00000000; U64_MAX] (s64)DST_OLD=[0xffffffff00000001; 0] (s64)DST_NEW=[0xffffffff00000001; -1]
REFINING (FALSE R1) (s64)SRC=[0xffffffff00000001; -1] (u64)DST_OLD=[0xffffffff00000000; U64_MAX] (u64)DST_NEW=[0xffffffff00000001; U64_MAX]
REFINING (FALSE R1) (u64)SRC=[0xffffffff00000001; U64_MAX] (u32)DST_OLD=[0; U32_MAX] (u32)DST_NEW=[1; U32_MAX]

R1 initially has smin/smax set to [0xffffffff00000001; -1], while umin/umax is
unknown. After (u64)< comparison, in FALSE branch we gain knowledge that
umin/umax is [0xffffffff00000000; U64_MAX]. That causes smin/smax to learn that
zero can't happen and upper bound is -1. Then smin/smax is adjusted from
umin/umax improving lower bound from 0xffffffff00000000 to 0xffffffff00000001.
And then eventually umin32/umax32 bounds are drived from umin/umax and become
[1; U32_MAX].

Selftest in the last patch is actually implementing a multi-round fixed-point
convergence logic, but so far all the tests are handled by two rounds of
reg_bounds_sync() on the verifier state, so we keep it simple for now.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231102033759.2541186-9-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09 18:58:39 -08:00
Andrii Nakryiko
c51d5ad654 bpf: improve deduction of 64-bit bounds from 32-bit bounds
Add a few interesting cases in which we can tighten 64-bit bounds based
on newly learnt information about 32-bit bounds. E.g., when full u64/s64
registers are used in BPF program, and then eventually compared as
u32/s32. The latter comparison doesn't change the value of full
register, but it does impose new restrictions on possible lower 32 bits
of such full registers. And we can use that to derive additional full
register bounds information.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20231102033759.2541186-8-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09 18:58:39 -08:00
Andrii Nakryiko
6593f2e674 bpf: add special smin32/smax32 derivation from 64-bit bounds
Add a special case where we can derive valid s32 bounds from umin/umax
or smin/smax by stitching together negative s32 subrange and
non-negative s32 subrange. That requires upper 32 bits to form a [N, N+1]
range in u32 domain (taking into account wrap around, so 0xffffffff
to 0x00000000 is a valid [N, N+1] range in this sense). See code comment
for concrete examples.

Eduard Zingerman also provided an alternative explanation ([0]) for more
mathematically inclined readers:

Suppose:
. there are numbers a, b, c
. 2**31 <= b < 2**32
. 0 <= c < 2**31
. umin = 2**32 * a + b
. umax = 2**32 * (a + 1) + c

The number of values in the range represented by [umin; umax] is:
. N = umax - umin + 1 = 2**32 + c - b + 1
. min(N) = 2**32 + 0 - (2**32-1) + 1 = 2, with b = 2**32-1, c = 0
. max(N) = 2**32 + (2**31 - 1) - 2**31 + 1 = 2**32, with b = 2**31, c = 2**31-1

Hence [(s32)b; (s32)c] forms a valid range.

  [0] https://lore.kernel.org/bpf/d7af631802f0cfae20df77fe70068702d24bbd31.camel@gmail.com/

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231102033759.2541186-7-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09 18:58:39 -08:00
Andrii Nakryiko
c1efab6468 bpf: derive subreg bounds from full bounds when upper 32 bits are constant
Comments in code try to explain the idea behind why this is correct.
Please check the code and comments.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231102033759.2541186-6-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09 18:58:39 -08:00
Andrii Nakryiko
d540517990 bpf: derive smin32/smax32 from umin32/umax32 bounds
All the logic that applies to u64 vs s64, equally applies for u32 vs s32
relationships (just taken in a smaller 32-bit numeric space). So do the
same deduction of smin32/smax32 from umin32/umax32, if we can.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231102033759.2541186-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09 18:58:39 -08:00
Andrii Nakryiko
93f7378734 bpf: derive smin/smax from umin/max bounds
Add smin/smax derivation from appropriate umin/umax values. Previously the
logic was surprisingly asymmetric, trying to derive umin/umax from smin/smax
(if possible), but not trying to do the same in the other direction. A simple
addition to __reg64_deduce_bounds() fixes this.

Added also generic comment about u64/s64 ranges and their relationship.
Hopefully that helps readers to understand all the bounds deductions
a bit better.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231102033759.2541186-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09 18:58:39 -08:00
Chuyi Zhou
0de4f50de2 bpf: Let verifier consider {task,cgroup} is trusted in bpf_iter_reg
BTF_TYPE_SAFE_TRUSTED(struct bpf_iter__task) in verifier.c wanted to
teach BPF verifier that bpf_iter__task -> task is a trusted ptr. But it
doesn't work well.

The reason is, bpf_iter__task -> task would go through btf_ctx_access()
which enforces the reg_type of 'task' is ctx_arg_info->reg_type, and in
task_iter.c, we actually explicitly declare that the
ctx_arg_info->reg_type is PTR_TO_BTF_ID_OR_NULL.

Actually we have a previous case like this[1] where PTR_TRUSTED is added to
the arg flag for map_iter.

This patch sets ctx_arg_info->reg_type is PTR_TO_BTF_ID_OR_NULL |
PTR_TRUSTED in task_reg_info.

Similarly, bpf_cgroup_reg_info -> cgroup is also PTR_TRUSTED since we are
under the protection of cgroup_mutex and we would check cgroup_is_dead()
in __cgroup_iter_seq_show().

This patch is to improve the user experience of the newly introduced
bpf_iter_css_task kfunc before hitting the mainline. The Fixes tag is
pointing to the commit introduced the bpf_iter_css_task kfunc.

Link[1]:https://lore.kernel.org/all/20230706133932.45883-3-aspsk@isovalent.com/

Fixes: 9c66dc94b6 ("bpf: Introduce css_task open-coded iterator kfuncs")
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231107132204.912120-2-zhouchuyi@bytedance.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-11-07 15:24:25 -08:00
Shung-Hsi Yu
291d044fd5 bpf: Fix precision tracking for BPF_ALU | BPF_TO_BE | BPF_END
BPF_END and BPF_NEG has a different specification for the source bit in
the opcode compared to other ALU/ALU64 instructions, and is either
reserved or use to specify the byte swap endianness. In both cases the
source bit does not encode source operand location, and src_reg is a
reserved field.

backtrack_insn() currently does not differentiate BPF_END and BPF_NEG
from other ALU/ALU64 instructions, which leads to r0 being incorrectly
marked as precise when processing BPF_ALU | BPF_TO_BE | BPF_END
instructions. This commit teaches backtrack_insn() to correctly mark
precision for such case.

While precise tracking of BPF_NEG and other BPF_END instructions are
correct and does not need fixing, this commit opt to process all BPF_NEG
and BPF_END instructions within the same if-clause to better align with
current convention used in the verifier (e.g. check_alu_op).

Fixes: b5dc0163d8 ("bpf: precise scalar_value tracking")
Cc: stable@vger.kernel.org
Reported-by: Mohamed Mahmoud <mmahmoud@redhat.com>
Closes: https://lore.kernel.org/r/87jzrrwptf.fsf@toke.dk
Tested-by: Toke Høiland-Jørgensen <toke@redhat.com>
Tested-by: Tao Lyu <tao.lyu@epfl.ch>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20231102053913.12004-2-shung-hsi.yu@suse.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-01 22:54:27 -07:00
Chuyi Zhou
3091b66749 bpf: Relax allowlist for css_task iter
The newly added open-coded css_task iter would try to hold the global
css_set_lock in bpf_iter_css_task_new, so the bpf side has to be careful in
where it allows to use this iter. The mainly concern is dead locking on
css_set_lock. check_css_task_iter_allowlist() in verifier enforced css_task
can only be used in bpf_lsm hooks and sleepable bpf_iter.

This patch relax the allowlist for css_task iter. Any lsm and any iter
(even non-sleepable) and any sleepable are safe since they would not hold
the css_set_lock before entering BPF progs context.

This patch also fixes the misused BPF_TRACE_ITER in
check_css_task_iter_allowlist which compared bpf_prog_type with
bpf_attach_type.

Fixes: 9c66dc94b6 ("bpf: Introduce css_task open-coded iterator kfuncs")
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231031050438.93297-2-zhouchuyi@bytedance.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-01 22:49:20 -07:00
Hou Tao
fd381ce60a bpf: Check map->usercnt after timer->timer is assigned
When there are concurrent uref release and bpf timer init operations,
the following sequence diagram is possible. It will break the guarantee
provided by bpf_timer: bpf_timer will still be alive after userspace
application releases or unpins the map. It also will lead to kmemleak
for old kernel version which doesn't release bpf_timer when map is
released.

bpf program X:

bpf_timer_init()
  lock timer->lock
    read timer->timer as NULL
    read map->usercnt != 0

                process Y:

                close(map_fd)
                  // put last uref
                  bpf_map_put_uref()
                    atomic_dec_and_test(map->usercnt)
                      array_map_free_timers()
                        bpf_timer_cancel_and_free()
                          // just return
                          read timer->timer is NULL

    t = bpf_map_kmalloc_node()
    timer->timer = t
  unlock timer->lock

Fix the problem by checking map->usercnt after timer->timer is assigned,
so when there are concurrent uref release and bpf timer init, either
bpf_timer_cancel_and_free() from uref release reads a no-NULL timer
or the newly-added atomic64_read() returns a zero usercnt.

Because atomic_dec_and_test(map->usercnt) and READ_ONCE(timer->timer)
in bpf_timer_cancel_and_free() are not protected by a lock, so add
a memory barrier to guarantee the order between map->usercnt and
timer->timer. Also use WRITE_ONCE(timer->timer, x) to match the lockless
read of timer->timer in bpf_timer_cancel_and_free().

Reported-by: Hsin-Wei Hung <hsinweih@uci.edu>
Closes: https://lore.kernel.org/bpf/CABcoxUaT2k9hWsS1tNgXyoU3E-=PuOgMn737qK984fbFmfYixQ@mail.gmail.com
Fixes: b00628b1c7 ("bpf: Introduce bpf timers.")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231030063616.1653024-1-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-01 22:37:31 -07:00
Dave Marchevsky
391145ba2a bpf: Add __bpf_kfunc_{start,end}_defs macros
BPF kfuncs are meant to be called from BPF programs. Accordingly, most
kfuncs are not called from anywhere in the kernel, which the
-Wmissing-prototypes warning is unhappy about. We've peppered
__diag_ignore_all("-Wmissing-prototypes", ... everywhere kfuncs are
defined in the codebase to suppress this warning.

This patch adds two macros meant to bound one or many kfunc definitions.
All existing kfunc definitions which use these __diag calls to suppress
-Wmissing-prototypes are migrated to use the newly-introduced macros.
A new __diag_ignore_all - for "-Wmissing-declarations" - is added to the
__bpf_kfunc_start_defs macro based on feedback from Andrii on an earlier
version of this patch [0] and another recent mailing list thread [1].

In the future we might need to ignore different warnings or do other
kfunc-specific things. This change will make it easier to make such
modifications for all kfunc defs.

  [0]: https://lore.kernel.org/bpf/CAEf4BzaE5dRWtK6RPLnjTW-MW9sx9K3Fn6uwqCTChK2Dcb1Xig@mail.gmail.com/
  [1]: https://lore.kernel.org/bpf/ZT+2qCc%2FaXep0%2FLf@krava/

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Cc: Jiri Olsa <olsajiri@gmail.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20231031215625.2343848-1-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-01 22:33:53 -07:00
Hao Sun
811c363645 bpf: Fix check_stack_write_fixed_off() to correctly spill imm
In check_stack_write_fixed_off(), imm value is cast to u32 before being
spilled to the stack. Therefore, the sign information is lost, and the
range information is incorrect when load from the stack again.

For the following prog:
0: r2 = r10
1: *(u64*)(r2 -40) = -44
2: r0 = *(u64*)(r2 - 40)
3: if r0 s<= 0xa goto +2
4: r0 = 1
5: exit
6: r0  = 0
7: exit

The verifier gives:
func#0 @0
0: R1=ctx(off=0,imm=0) R10=fp0
0: (bf) r2 = r10                      ; R2_w=fp0 R10=fp0
1: (7a) *(u64 *)(r2 -40) = -44        ; R2_w=fp0 fp-40_w=4294967252
2: (79) r0 = *(u64 *)(r2 -40)         ; R0_w=4294967252 R2_w=fp0
fp-40_w=4294967252
3: (c5) if r0 s< 0xa goto pc+2
mark_precise: frame0: last_idx 3 first_idx 0 subseq_idx -1
mark_precise: frame0: regs=r0 stack= before 2: (79) r0 = *(u64 *)(r2 -40)
3: R0_w=4294967252
4: (b7) r0 = 1                        ; R0_w=1
5: (95) exit
verification time 7971 usec
stack depth 40
processed 6 insns (limit 1000000) max_states_per_insn 0 total_states 0
peak_states 0 mark_read 0

So remove the incorrect cast, since imm field is declared as s32, and
__mark_reg_known() takes u64, so imm would be correctly sign extended
by compiler.

Fixes: ecdf985d76 ("bpf: track immediate values written to stack by BPF_ST instruction")
Cc: stable@vger.kernel.org
Signed-off-by: Hao Sun <sunhao.th@gmail.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231101-fix-check-stack-write-v3-1-f05c2b1473d5@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-01 22:30:27 -07:00
Matthieu Baerts
05670f81d1 bpf: fix compilation error without CGROUPS
Our MPTCP CI complained [1] -- and KBuild too -- that it was no longer
possible to build the kernel without CONFIG_CGROUPS:

  kernel/bpf/task_iter.c: In function 'bpf_iter_css_task_new':
  kernel/bpf/task_iter.c:919:14: error: 'CSS_TASK_ITER_PROCS' undeclared (first use in this function)
    919 |         case CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED:
        |              ^~~~~~~~~~~~~~~~~~~
  kernel/bpf/task_iter.c:919:14: note: each undeclared identifier is reported only once for each function it appears in
  kernel/bpf/task_iter.c:919:36: error: 'CSS_TASK_ITER_THREADED' undeclared (first use in this function)
    919 |         case CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED:
        |                                    ^~~~~~~~~~~~~~~~~~~~~~
  kernel/bpf/task_iter.c:927:60: error: invalid application of 'sizeof' to incomplete type 'struct css_task_iter'
    927 |         kit->css_it = bpf_mem_alloc(&bpf_global_ma, sizeof(struct css_task_iter));
        |                                                            ^~~~~~
  kernel/bpf/task_iter.c:930:9: error: implicit declaration of function 'css_task_iter_start'; did you mean 'task_seq_start'? [-Werror=implicit-function-declaration]
    930 |         css_task_iter_start(css, flags, kit->css_it);
        |         ^~~~~~~~~~~~~~~~~~~
        |         task_seq_start
  kernel/bpf/task_iter.c: In function 'bpf_iter_css_task_next':
  kernel/bpf/task_iter.c:940:16: error: implicit declaration of function 'css_task_iter_next'; did you mean 'class_dev_iter_next'? [-Werror=implicit-function-declaration]
    940 |         return css_task_iter_next(kit->css_it);
        |                ^~~~~~~~~~~~~~~~~~
        |                class_dev_iter_next
  kernel/bpf/task_iter.c:940:16: error: returning 'int' from a function with return type 'struct task_struct *' makes pointer from integer without a cast [-Werror=int-conversion]
    940 |         return css_task_iter_next(kit->css_it);
        |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  kernel/bpf/task_iter.c: In function 'bpf_iter_css_task_destroy':
  kernel/bpf/task_iter.c:949:9: error: implicit declaration of function 'css_task_iter_end' [-Werror=implicit-function-declaration]
    949 |         css_task_iter_end(kit->css_it);
        |         ^~~~~~~~~~~~~~~~~

This patch simply surrounds with a #ifdef the new code requiring CGroups
support. It seems enough for the compiler and this is similar to
bpf_iter_css_{new,next,destroy}() functions where no other #ifdef have
been added in kernel/bpf/helpers.c and in the selftests.

Fixes: 9c66dc94b6 ("bpf: Introduce css_task open-coded iterator kfuncs")
Link: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/6665206927
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202310260528.aHWgVFqq-lkp@intel.com/
Signed-off-by: Matthieu Baerts <matttbe@kernel.org>
[ added missing ifdefs for BTF_ID cgroup definitions ]
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20231101181601.1493271-1-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-01 22:28:25 -07:00
Linus Torvalds
f5277ad1e9 for-6.7/io_uring-sockopt-2023-10-30
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmU/vdwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpr2rD/0astIsj/AACVSPzHARg9lnhkIvUeweMSSl
 CjifLTzK3a9E3R2IrC4sflObUKIEL3fste0Lva141eNULZvBJ6cQJDvY7Bp72Bkc
 CTPEwEQiwDJKLhTzQh3gY0H0+nFMWwEm1uc4dyeNAft/R9bPP/qOq62ttCoCp9+S
 1UoFmTlJE3bhejyS7fytoGZvKqhkpdR7rtbR4ya7CXWPoAG+v9amo8fputbxm0dj
 WECpKdd65JHWwYV4rbPA69T7jZ9V0oUsLen9RJ9BmjMLOFggHYqQdvEwG0Htirhw
 t5uaXqSvc8pXsJhKXMS3tXCrLNtBha5nlWHBpSE+6ovcmKiRzFjUaRXkRbcIrOAx
 ljIm0HHto1+xv0pDrNl3/lIjv5dpNOEauqqgMeYytQJIHa0JpSWbYzvjwQ8EZXQv
 WWDiRfH5Z0/3BsFdOCVqd8mTt4Pbksp2VFcxGkojRtSqSr4CML3mPZSmqGcs3nE6
 Fc16XXw7oLEWoF1tQYMP6KG0cVLem4on28c8CcVMJ/pRvcun3jBCif2gmMHJkWyA
 a6Uq116amqQ61f1p+EQ3ChqyTA5uALrXPmovu6Ne3Y/btW5yG4+Vu7AsPLjPHdFN
 oGHjOPV77XQzEqzUWRXmXPecZ+QifkcCV/8kbqtEHQqk5n+HUKQZmpC8+014ms3V
 Af6LYI/vYg==
 =sk8+
 -----END PGP SIGNATURE-----

Merge tag 'for-6.7/io_uring-sockopt-2023-10-30' of git://git.kernel.dk/linux

Pull io_uring {get,set}sockopt support from Jens Axboe:
 "This adds support for using getsockopt and setsockopt via io_uring.

  The main use cases for this is to enable use of direct descriptors,
  rather than first instantiating a normal file descriptor, doing the
  option tweaking needed, then turning it into a direct descriptor. With
  this support, we can avoid needing a regular file descriptor
  completely.

  The net and bpf bits have been signed off on their side"

* tag 'for-6.7/io_uring-sockopt-2023-10-30' of git://git.kernel.dk/linux:
  selftests/bpf/sockopt: Add io_uring support
  io_uring/cmd: Introduce SOCKET_URING_OP_SETSOCKOPT
  io_uring/cmd: Introduce SOCKET_URING_OP_GETSOCKOPT
  io_uring/cmd: return -EOPNOTSUPP if net is disabled
  selftests/net: Extract uring helpers to be reusable
  tools headers: Grab copy of io_uring.h
  io_uring/cmd: Pass compat mode in issue_flags
  net/socket: Break down __sys_getsockopt
  net/socket: Break down __sys_setsockopt
  bpf: Add sockptr support for setsockopt
  bpf: Add sockptr support for getsockopt
2023-11-01 11:16:34 -10:00
Linus Torvalds
89ed67ef12 Networking changes for 6.7.
Core & protocols
 ----------------
 
  - Support usec resolution of TCP timestamps, enabled selectively by
    a route attribute.
 
  - Defer regular TCP ACK while processing socket backlog, try to send
    a cumulative ACK at the end. Increase single TCP flow performance
    on a 200Gbit NIC by 20% (100Gbit -> 120Gbit).
 
  - The Fair Queuing (FQ) packet scheduler:
    - add built-in 3 band prio / WRR scheduling
    - support bypass if the qdisc is mostly idle (5% speed up for TCP RR)
    - improve inactive flow reporting
    - optimize the layout of structures for better cache locality
 
  - Support TCP Authentication Option (RFC 5925, TCP-AO), a more modern
    replacement for the old MD5 option.
 
  - Add more retransmission timeout (RTO) related statistics to TCP_INFO.
 
  - Support sending fragmented skbs over vsock sockets.
 
  - Make sure we send SIGPIPE for vsock sockets if socket was shutdown().
 
  - Add sysctl for ignoring lower limit on lifetime in Router
    Advertisement PIO, based on an in-progress IETF draft.
 
  - Add sysctl to control activation of TCP ping-pong mode.
 
  - Add sysctl to make connection timeout in MPTCP configurable.
 
  - Support rcvlowat and notsent_lowat on MPTCP sockets, to help apps
    limit the number of wakeups.
 
  - Support netlink GET for MDB (multicast forwarding), allowing user
    space to request a single MDB entry instead of dumping the entire
    table.
 
  - Support selective FDB flushing in the VXLAN tunnel driver.
 
  - Allow limiting learned FDB entries in bridges, prevent OOM attacks.
 
  - Allow controlling via configfs netconsole targets which were created
    via the kernel cmdline at boot, rather than via configfs at runtime.
 
  - Support multiple PTP timestamp event queue readers with different
    filters.
 
  - MCTP over I3C.
 
 BPF
 ---
 
  - Add new veth-like netdevice where BPF program defines the logic
    of the xmit routine. It can operate in L3 and L2 mode.
 
  - Support exceptions - allow asserting conditions which should
    never be true but are hard for the verifier to infer.
    With some extra flexibility around handling of the exit / failure.
    https://lwn.net/Articles/938435/
 
  - Add support for local per-cpu kptr, allow allocating and storing
    per-cpu objects in maps. Access to those objects operates on
    the value for the current CPU. This allows to deprecate local
    one-off implementations of per-CPU storage like
    BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE maps.
 
  - Extend cgroup BPF sockaddr hooks for UNIX sockets. The use case is
    for systemd to re-implement the LogNamespace feature which allows
    running multiple instances of systemd-journald to process the logs
    of different services.
 
  - Enable open-coded task_vma iteration, after maple tree conversion
    made it hard to directly walk VMAs in tracing programs.
 
  - Add open-coded task, css_task and css iterator support.
    One of the use cases is customizable OOM victim selection via BPF.
 
  - Allow source address selection with bpf_*_fib_lookup().
 
  - Add ability to pin BPF timer to the current CPU.
 
  - Prevent creation of infinite loops by combining tail calls and
    fentry/fexit programs.
 
  - Add missed stats for kprobes to retrieve the number of missed kprobe
    executions and subsequent executions of BPF programs.
 
  - Inherit system settings for CPU security mitigations.
 
  - Add BPF v4 CPU instruction support for arm32 and s390x.
 
 Changes to common code
 ----------------------
 
  - overflow: add DEFINE_FLEX() for on-stack definition of structs
    with flexible array members.
 
  - Process doc update with more guidance for reviewers.
 
 Driver API
 ----------
 
  - Simplify locking in WiFi (cfg80211 and mac80211 layers), use wiphy
    mutex in most places and remove a lot of smaller locks.
 
  - Create a common DPLL configuration API. Allow configuring
    and querying state of PLL circuits used for clock syntonization,
    in network time distribution.
 
  - Unify fragmented and full page allocation APIs in page pool code.
    Let drivers be ignorant of PAGE_SIZE.
 
  - Rework PHY state machine to avoid races with calls to phy_stop().
 
  - Notify DSA drivers of MAC address changes on user ports, improve
    correctness of offloads which depend on matching port MAC addresses.
 
  - Allow antenna control on injected WiFi frames.
 
  - Reduce the number of variants of napi_schedule().
 
  - Simplify error handling when composing devlink health messages.
 
 Misc
 ----
 
  - A lot of KCSAN data race "fixes", from Eric.
 
  - A lot of __counted_by() annotations, from Kees.
 
  - A lot of strncpy -> strscpy and printf format fixes.
 
  - Replace master/slave terminology with conduit/user in DSA drivers.
 
  - Handful of KUnit tests for netdev and WiFi core.
 
 Removed
 -------
 
  - AppleTalk COPS.
 
  - AppleTalk ipddp.
 
  - TI AR7 CPMAC Ethernet driver.
 
 Drivers
 -------
 
  - Ethernet high-speed NICs:
    - Intel (100G, ice, idpf):
      - add a driver for the Intel E2000 IPUs
      - make CRC/FCS stripping configurable
      - cross-timestamping for E823 devices
      - basic support for E830 devices
      - use aux-bus for managing client drivers
      - i40e: report firmware versions via devlink
    - nVidia/Mellanox:
      - support 4-port NICs
      - increase max number of channels to 256
      - optimize / parallelize SF creation flow
    - Broadcom (bnxt):
      - enhance NIC temperature reporting
      - support PAM4 speeds and lane configuration
    - Marvell OcteonTX2:
      - PTP pulse-per-second output support
      - enable hardware timestamping for VFs
    - Solarflare/AMD:
      - conntrack NAT offload and offload for tunnels
    - Wangxun (ngbe/txgbe):
      - expose HW statistics
    - Pensando/AMD:
      - support PCI level reset
      - narrow down the condition under which skbs are linearized
    - Netronome/Corigine (nfp):
      - support CHACHA20-POLY1305 crypto in IPsec offload
 
  - Ethernet NICs embedded, slower, virtual:
    - Synopsys (stmmac):
      - add Loongson-1 SoC support
      - enable use of HW queues with no offload capabilities
      - enable PPS input support on all 5 channels
      - increase TX coalesce timer to 5ms
    - RealTek USB (r8152): improve efficiency of Rx by using GRO frags
    - xen: support SW packet timestamping
    - add drivers for implementations based on TI's PRUSS (AM64x EVM)
 
  - nVidia/Mellanox Ethernet datacenter switches:
    - avoid poor HW resource use on Spectrum-4 by better block selection
      for IPv6 multicast forwarding and ordering of blocks in ACL region
 
  - Ethernet embedded switches:
    - Microchip:
      - support configuring the drive strength for EMI compliance
      - ksz9477: partial ACL support
      - ksz9477: HSR offload
      - ksz9477: Wake on LAN
    - Realtek:
      - rtl8366rb: respect device tree config of the CPU port
 
  - Ethernet PHYs:
    - support Broadcom BCM5221 PHYs
    - TI dp83867: support hardware LED blinking
 
  - CAN:
    - add support for Linux-PHY based CAN transceivers
    - at91_can: clean up and use rx-offload helpers
 
  - WiFi:
    - MediaTek (mt76):
      - new sub-driver for mt7925 USB/PCIe devices
      - HW wireless <> Ethernet bridging in MT7988 chips
      - mt7603/mt7628 stability improvements
    - Qualcomm (ath12k):
      - WCN7850:
        - enable 320 MHz channels in 6 GHz band
        - hardware rfkill support
        - enable IEEE80211_HW_SINGLE_SCAN_ON_ALL_BANDS
          to make scan faster
        - read board data variant name from SMBIOS
      - QCN9274: mesh support
    - RealTek (rtw89):
      - TDMA-based multi-channel concurrency (MCC)
    - Silicon Labs (wfx):
      - Remain-On-Channel (ROC) support
 
  - Bluetooth:
    - ISO: many improvements for broadcast support
    - mark BCM4378/BCM4387 as BROKEN_LE_CODED
    - add support for QCA2066
    - btmtksdio: enable Bluetooth wakeup from suspend
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmU8XsYACgkQMUZtbf5S
 Irv19RAAnud/24OOF5XMEJkIcYlnfqximh4XO6PujRSYkSkOUJdZTF6iJPgf3pSP
 YpwoHYbYKHYfeOf8+3bTNESiQNSnoVmvmvwiS6/7lZ3behHUrGLQzW9Htc3EZyWH
 2h6QkDZ5OOjfg0bwYSfp3vXkmMH2k8WE9Y0NvCkhcohqZi13Rmp14RnyPmNb2d1V
 yZRYDMSM133KqE6gnBr1Ct65IEvnKeGlCUN2mTGqOJgdn6DZMsyxvtt0y4rmN7Ab
 41+CgPU5SfxfbYpW+Dl2HJpgfte3WrC57KC6AM0PAPJzPmQWgeB/m9mjz/apj6Bg
 bhsEIo7FdvbCnQm3yWPhK2OgCAcSwLr8jfGMU+Q+W4VnL5SRRR3Rm0zjsze+kHNP
 OfqJgxzl3DpvoJqVBy1h5FGcZt0XHwhksm4cTxWqIahsF+veY0ECBXbuBBQx9XTF
 Y7INfI8ulg7wISJs+CJfIClYkgOibTw2u8taBS5ikbtgxNqp5D4QqODn7UefQap1
 PR/IDYODF+zRgmMJLeBqSa6fij6BkfOEDiOWak5kggBoZdtbtmeKI6tzze06CNdW
 lWv1WEhRufxnwK+IuWsEkjhiMbs2WGLvkJ5JbgQV9BfqHfIfiqBCrcWtT/WbQnGt
 lmU46CXh1t/FZEqbmK9h+8vsIIfrcDl6jb5npEiKPRG00vDKRTM=
 =46nS
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core & protocols:

   - Support usec resolution of TCP timestamps, enabled selectively by a
     route attribute.

   - Defer regular TCP ACK while processing socket backlog, try to send
     a cumulative ACK at the end. Increase single TCP flow performance
     on a 200Gbit NIC by 20% (100Gbit -> 120Gbit).

   - The Fair Queuing (FQ) packet scheduler:
       - add built-in 3 band prio / WRR scheduling
       - support bypass if the qdisc is mostly idle (5% speed up for TCP RR)
       - improve inactive flow reporting
       - optimize the layout of structures for better cache locality

   - Support TCP Authentication Option (RFC 5925, TCP-AO), a more modern
     replacement for the old MD5 option.

   - Add more retransmission timeout (RTO) related statistics to
     TCP_INFO.

   - Support sending fragmented skbs over vsock sockets.

   - Make sure we send SIGPIPE for vsock sockets if socket was
     shutdown().

   - Add sysctl for ignoring lower limit on lifetime in Router
     Advertisement PIO, based on an in-progress IETF draft.

   - Add sysctl to control activation of TCP ping-pong mode.

   - Add sysctl to make connection timeout in MPTCP configurable.

   - Support rcvlowat and notsent_lowat on MPTCP sockets, to help apps
     limit the number of wakeups.

   - Support netlink GET for MDB (multicast forwarding), allowing user
     space to request a single MDB entry instead of dumping the entire
     table.

   - Support selective FDB flushing in the VXLAN tunnel driver.

   - Allow limiting learned FDB entries in bridges, prevent OOM attacks.

   - Allow controlling via configfs netconsole targets which were
     created via the kernel cmdline at boot, rather than via configfs at
     runtime.

   - Support multiple PTP timestamp event queue readers with different
     filters.

   - MCTP over I3C.

  BPF:

   - Add new veth-like netdevice where BPF program defines the logic of
     the xmit routine. It can operate in L3 and L2 mode.

   - Support exceptions - allow asserting conditions which should never
     be true but are hard for the verifier to infer. With some extra
     flexibility around handling of the exit / failure:

          https://lwn.net/Articles/938435/

   - Add support for local per-cpu kptr, allow allocating and storing
     per-cpu objects in maps. Access to those objects operates on the
     value for the current CPU.

     This allows to deprecate local one-off implementations of per-CPU
     storage like BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE maps.

   - Extend cgroup BPF sockaddr hooks for UNIX sockets. The use case is
     for systemd to re-implement the LogNamespace feature which allows
     running multiple instances of systemd-journald to process the logs
     of different services.

   - Enable open-coded task_vma iteration, after maple tree conversion
     made it hard to directly walk VMAs in tracing programs.

   - Add open-coded task, css_task and css iterator support. One of the
     use cases is customizable OOM victim selection via BPF.

   - Allow source address selection with bpf_*_fib_lookup().

   - Add ability to pin BPF timer to the current CPU.

   - Prevent creation of infinite loops by combining tail calls and
     fentry/fexit programs.

   - Add missed stats for kprobes to retrieve the number of missed
     kprobe executions and subsequent executions of BPF programs.

   - Inherit system settings for CPU security mitigations.

   - Add BPF v4 CPU instruction support for arm32 and s390x.

  Changes to common code:

   - overflow: add DEFINE_FLEX() for on-stack definition of structs with
     flexible array members.

   - Process doc update with more guidance for reviewers.

  Driver API:

   - Simplify locking in WiFi (cfg80211 and mac80211 layers), use wiphy
     mutex in most places and remove a lot of smaller locks.

   - Create a common DPLL configuration API. Allow configuring and
     querying state of PLL circuits used for clock syntonization, in
     network time distribution.

   - Unify fragmented and full page allocation APIs in page pool code.
     Let drivers be ignorant of PAGE_SIZE.

   - Rework PHY state machine to avoid races with calls to phy_stop().

   - Notify DSA drivers of MAC address changes on user ports, improve
     correctness of offloads which depend on matching port MAC
     addresses.

   - Allow antenna control on injected WiFi frames.

   - Reduce the number of variants of napi_schedule().

   - Simplify error handling when composing devlink health messages.

  Misc:

   - A lot of KCSAN data race "fixes", from Eric.

   - A lot of __counted_by() annotations, from Kees.

   - A lot of strncpy -> strscpy and printf format fixes.

   - Replace master/slave terminology with conduit/user in DSA drivers.

   - Handful of KUnit tests for netdev and WiFi core.

  Removed:

   - AppleTalk COPS.

   - AppleTalk ipddp.

   - TI AR7 CPMAC Ethernet driver.

  Drivers:

   - Ethernet high-speed NICs:
      - Intel (100G, ice, idpf):
         - add a driver for the Intel E2000 IPUs
         - make CRC/FCS stripping configurable
         - cross-timestamping for E823 devices
         - basic support for E830 devices
         - use aux-bus for managing client drivers
         - i40e: report firmware versions via devlink
      - nVidia/Mellanox:
         - support 4-port NICs
         - increase max number of channels to 256
         - optimize / parallelize SF creation flow
      - Broadcom (bnxt):
         - enhance NIC temperature reporting
         - support PAM4 speeds and lane configuration
      - Marvell OcteonTX2:
         - PTP pulse-per-second output support
         - enable hardware timestamping for VFs
      - Solarflare/AMD:
         - conntrack NAT offload and offload for tunnels
      - Wangxun (ngbe/txgbe):
         - expose HW statistics
      - Pensando/AMD:
         - support PCI level reset
         - narrow down the condition under which skbs are linearized
      - Netronome/Corigine (nfp):
         - support CHACHA20-POLY1305 crypto in IPsec offload

   - Ethernet NICs embedded, slower, virtual:
      - Synopsys (stmmac):
         - add Loongson-1 SoC support
         - enable use of HW queues with no offload capabilities
         - enable PPS input support on all 5 channels
         - increase TX coalesce timer to 5ms
      - RealTek USB (r8152): improve efficiency of Rx by using GRO frags
      - xen: support SW packet timestamping
      - add drivers for implementations based on TI's PRUSS (AM64x EVM)

   - nVidia/Mellanox Ethernet datacenter switches:
      - avoid poor HW resource use on Spectrum-4 by better block
        selection for IPv6 multicast forwarding and ordering of blocks
        in ACL region

   - Ethernet embedded switches:
      - Microchip:
         - support configuring the drive strength for EMI compliance
         - ksz9477: partial ACL support
         - ksz9477: HSR offload
         - ksz9477: Wake on LAN
      - Realtek:
         - rtl8366rb: respect device tree config of the CPU port

   - Ethernet PHYs:
      - support Broadcom BCM5221 PHYs
      - TI dp83867: support hardware LED blinking

   - CAN:
      - add support for Linux-PHY based CAN transceivers
      - at91_can: clean up and use rx-offload helpers

   - WiFi:
      - MediaTek (mt76):
         - new sub-driver for mt7925 USB/PCIe devices
         - HW wireless <> Ethernet bridging in MT7988 chips
         - mt7603/mt7628 stability improvements
      - Qualcomm (ath12k):
         - WCN7850:
            - enable 320 MHz channels in 6 GHz band
            - hardware rfkill support
            - enable IEEE80211_HW_SINGLE_SCAN_ON_ALL_BANDS to
              make scan faster
            - read board data variant name from SMBIOS
        - QCN9274: mesh support
      - RealTek (rtw89):
         - TDMA-based multi-channel concurrency (MCC)
      - Silicon Labs (wfx):
         - Remain-On-Channel (ROC) support

   - Bluetooth:
      - ISO: many improvements for broadcast support
      - mark BCM4378/BCM4387 as BROKEN_LE_CODED
      - add support for QCA2066
      - btmtksdio: enable Bluetooth wakeup from suspend"

* tag 'net-next-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1816 commits)
  net: pcs: xpcs: Add 2500BASE-X case in get state for XPCS drivers
  net: bpf: Use sockopt_lock_sock() in ip_sock_set_tos()
  net: mana: Use xdp_set_features_flag instead of direct assignment
  vxlan: Cleanup IFLA_VXLAN_PORT_RANGE entry in vxlan_get_size()
  iavf: delete the iavf client interface
  iavf: add a common function for undoing the interrupt scheme
  iavf: use unregister_netdev
  iavf: rely on netdev's own registered state
  iavf: fix the waiting time for initial reset
  iavf: in iavf_down, don't queue watchdog_task if comms failed
  iavf: simplify mutex_trylock+sleep loops
  iavf: fix comments about old bit locks
  doc/netlink: Update schema to support cmd-cnt-name and cmd-max-name
  tools: ynl: introduce option to process unknown attributes or types
  ipvlan: properly track tx_errors
  netdevsim: Block until all devices are released
  nfp: using napi_build_skb() to replace build_skb()
  net: dsa: microchip: ksz9477: Fix spelling mistake "Enery" -> "Energy"
  net: dsa: microchip: Ensure Stable PME Pin State for Wake-on-LAN
  net: dsa: microchip: Refactor switch shutdown routine for WoL preparation
  ...
2023-10-31 05:10:11 -10:00
Linus Torvalds
14ab6d425e vfs-6.7.ctime
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTppYgAKCRCRxhvAZXjc
 okIHAP9anLz1QDyMLH12ASuHjgBc0Of3jcB6NB97IWGpL4O21gEA46ohaD+vcJuC
 YkBLU3lXqQ87nfu28ExFAzh10hG2jwM=
 =m4pB
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.7.ctime' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs

Pull vfs inode time accessor updates from Christian Brauner:
 "This finishes the conversion of all inode time fields to accessor
  functions as discussed on list. Changing timestamps manually as we
  used to do before is error prone. Using accessors function makes this
  robust.

  It does not contain the switch of the time fields to discrete 64 bit
  integers to replace struct timespec and free up space in struct inode.
  But after this, the switch can be trivially made and the patch should
  only affect the vfs if we decide to do it"

* tag 'vfs-6.7.ctime' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (86 commits)
  fs: rename inode i_atime and i_mtime fields
  security: convert to new timestamp accessors
  selinux: convert to new timestamp accessors
  apparmor: convert to new timestamp accessors
  sunrpc: convert to new timestamp accessors
  mm: convert to new timestamp accessors
  bpf: convert to new timestamp accessors
  ipc: convert to new timestamp accessors
  linux: convert to new timestamp accessors
  zonefs: convert to new timestamp accessors
  xfs: convert to new timestamp accessors
  vboxsf: convert to new timestamp accessors
  ufs: convert to new timestamp accessors
  udf: convert to new timestamp accessors
  ubifs: convert to new timestamp accessors
  tracefs: convert to new timestamp accessors
  sysv: convert to new timestamp accessors
  squashfs: convert to new timestamp accessors
  server: convert to new timestamp accessors
  client: convert to new timestamp accessors
  ...
2023-10-30 09:47:13 -10:00
Linus Torvalds
3b3f874cc1 vfs-6.7.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTpoQAAKCRCRxhvAZXjc
 ovFNAQDgIRjXfZ1Ku+USxsRRdqp8geJVaNc3PuMmYhOYhUenqgEAmC1m+p0y31dS
 P6+HlL16Mqgu0tpLCcJK9BibpDZ0Ew4=
 =7yD1
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "This contains the usual miscellaneous features, cleanups, and fixes
  for vfs and individual fses.

  Features:

   - Rename and export helpers that get write access to a mount. They
     are used in overlayfs to get write access to the upper mount.

   - Print the pretty name of the root device on boot failure. This
     helps in scenarios where we would usually only print
     "unknown-block(1,2)".

   - Add an internal SB_I_NOUMASK flag. This is another part in the
     endless POSIX ACL saga in a way.

     When POSIX ACLs are enabled via SB_POSIXACL the vfs cannot strip
     the umask because if the relevant inode has POSIX ACLs set it might
     take the umask from there. But if the inode doesn't have any POSIX
     ACLs set then we apply the umask in the filesytem itself. So we end
     up with:

      (1) no SB_POSIXACL -> strip umask in vfs
      (2) SB_POSIXACL    -> strip umask in filesystem

     The umask semantics associated with SB_POSIXACL allowed filesystems
     that don't even support POSIX ACLs at all to raise SB_POSIXACL
     purely to avoid umask stripping. That specifically means NFS v4 and
     Overlayfs. NFS v4 does it because it delegates this to the server
     and Overlayfs because it needs to delegate umask stripping to the
     upper filesystem, i.e., the filesystem used as the writable layer.

     This went so far that SB_POSIXACL is raised eve on kernels that
     don't even have POSIX ACL support at all.

     Stop this blatant abuse and add SB_I_NOUMASK which is an internal
     superblock flag that filesystems can raise to opt out of umask
     handling. That should really only be the two mentioned above. It's
     not that we want any filesystems to do this. Ideally we have all
     umask handling always in the vfs.

   - Make overlayfs use SB_I_NOUMASK too.

   - Now that we have SB_I_NOUMASK, stop checking for SB_POSIXACL in
     IS_POSIXACL() if the kernel doesn't have support for it. This is a
     very old patch but it's only possible to do this now with the wider
     cleanup that was done.

   - Follow-up work on fake path handling from last cycle. Citing mostly
     from Amir:

     When overlayfs was first merged, overlayfs files of regular files
     and directories, the ones that are installed in file table, had a
     "fake" path, namely, f_path is the overlayfs path and f_inode is
     the "real" inode on the underlying filesystem.

     In v6.5, we took another small step by introducing of the
     backing_file container and the file_real_path() helper. This change
     allowed vfs and filesystem code to get the "real" path of an
     overlayfs backing file. With this change, we were able to make
     fsnotify work correctly and report events on the "real" filesystem
     objects that were accessed via overlayfs.

     This method works fine, but it still leaves the vfs vulnerable to
     new code that is not aware of files with fake path. A recent
     example is commit db1d1e8b98 ("IMA: use vfs_getattr_nosec to get
     the i_version"). This commit uses direct referencing to f_path in
     IMA code that otherwise uses file_inode() and file_dentry() to
     reference the filesystem objects that it is measuring.

     This contains work to switch things around: instead of having
     filesystem code opt-in to get the "real" path, have generic code
     opt-in for the "fake" path in the few places that it is needed.

     Is it far more likely that new filesystems code that does not use
     the file_dentry() and file_real_path() helpers will end up causing
     crashes or averting LSM/audit rules if we keep the "fake" path
     exposed by default.

     This change already makes file_dentry() moot, but for now we did
     not change this helper just added a WARN_ON() in ovl_d_real() to
     catch if we have made any wrong assumptions.

     After the dust settles on this change, we can make file_dentry() a
     plain accessor and we can drop the inode argument to ->d_real().

   - Switch struct file to SLAB_TYPESAFE_BY_RCU. This looks like a small
     change but it really isn't and I would like to see everyone on
     their tippie toes for any possible bugs from this work.

     Essentially we've been doing most of what SLAB_TYPESAFE_BY_RCU for
     files since a very long time because of the nasty interactions
     between the SCM_RIGHTS file descriptor garbage collection. So
     extending it makes a lot of sense but it is a subtle change. There
     are almost no places that fiddle with file rcu semantics directly
     and the ones that did mess around with struct file internal under
     rcu have been made to stop doing that because it really was always
     dodgy.

     I forgot to put in the link tag for this change and the discussion
     in the commit so adding it into the merge message:

       https://lore.kernel.org/r/20230926162228.68666-1-mjguzik@gmail.com

  Cleanups:

   - Various smaller pipe cleanups including the removal of a spin lock
     that was only used to protect against writes without pipe_lock()
     from O_NOTIFICATION_PIPE aka watch queues. As that was never
     implemented remove the additional locking from pipe_write().

   - Annotate struct watch_filter with the new __counted_by attribute.

   - Clarify do_unlinkat() cleanup so that it doesn't look like an extra
     iput() is done that would cause issues.

   - Simplify file cleanup when the file has never been opened.

   - Use module helper instead of open-coding it.

   - Predict error unlikely for stale retry.

   - Use WRITE_ONCE() for mount expiry field instead of just commenting
     that one hopes the compiler doesn't get smart.

  Fixes:

   - Fix readahead on block devices.

   - Fix writeback when layztime is enabled and inodes whose timestamp
     is the only thing that changed reside on wb->b_dirty_time. This
     caused excessively large zombie memory cgroup when lazytime was
     enabled as such inodes weren't handled fast enough.

   - Convert BUG_ON() to WARN_ON_ONCE() in open_last_lookups()"

* tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (26 commits)
  file, i915: fix file reference for mmap_singleton()
  vfs: Convert BUG_ON to WARN_ON_ONCE in open_last_lookups
  writeback, cgroup: switch inodes with dirty timestamps to release dying cgwbs
  chardev: Simplify usage of try_module_get()
  ovl: rely on SB_I_NOUMASK
  fs: fix umask on NFS with CONFIG_FS_POSIX_ACL=n
  fs: store real path instead of fake path in backing file f_path
  fs: create helper file_user_path() for user displayed mapped file path
  fs: get mnt_writers count for an open backing file's real path
  vfs: stop counting on gcc not messing with mnt_expiry_mark if not asked
  vfs: predict the error in retry_estale as unlikely
  backing file: free directly
  vfs: fix readahead(2) on block devices
  io_uring: use files_lookup_fd_locked()
  file: convert to SLAB_TYPESAFE_BY_RCU
  vfs: shave work on failed file open
  fs: simplify misleading code to remove ambiguity regarding ihold()/iput()
  watch_queue: Annotate struct watch_filter with __counted_by
  fs/pipe: use spinlock in pipe_read() only if there is a watch_queue
  fs/pipe: remove unnecessary spinlock from pipe_write()
  ...
2023-10-30 09:14:19 -10:00
Hou Tao
c421c12586 bpf: Add more WARN_ON_ONCE checks for mismatched alloc and free
There are two possible mismatched alloc and free cases in BPF memory
allocator:

1) allocate from cache X but free by cache Y with a different unit_size
2) allocate from per-cpu cache but free by kmalloc cache or vice versa

So add more WARN_ON_ONCE checks in free_bulk() and __free_by_rcu() to
spot these mismatched alloc and free early.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20231021014959.3563841-1-houtao@huaweicloud.com
2023-10-26 14:23:33 +02:00
Daniel Borkmann
35dfaad718 netkit, bpf: Add bpf programmable net device
This work adds a new, minimal BPF-programmable device called "netkit"
(former PoC code-name "meta") we recently presented at LSF/MM/BPF. The
core idea is that BPF programs are executed within the drivers xmit routine
and therefore e.g. in case of containers/Pods moving BPF processing closer
to the source.

One of the goals was that in case of Pod egress traffic, this allows to
move BPF programs from hostns tcx ingress into the device itself, providing
earlier drop or forward mechanisms, for example, if the BPF program
determines that the skb must be sent out of the node, then a redirect to
the physical device can take place directly without going through per-CPU
backlog queue. This helps to shift processing for such traffic from softirq
to process context, leading to better scheduling decisions/performance (see
measurements in the slides).

In this initial version, the netkit device ships as a pair, but we plan to
extend this further so it can also operate in single device mode. The pair
comes with a primary and a peer device. Only the primary device, typically
residing in hostns, can manage BPF programs for itself and its peer. The
peer device is designated for containers/Pods and cannot attach/detach
BPF programs. Upon the device creation, the user can set the default policy
to 'pass' or 'drop' for the case when no BPF program is attached.

Additionally, the device can be operated in L3 (default) or L2 mode. The
management of BPF programs is done via bpf_mprog, so that multi-attach is
supported right from the beginning with similar API and dependency controls
as tcx. For details on the latter see commit 053c8e1f23 ("bpf: Add generic
attach/detach/query API for multi-progs"). tc BPF compatibility is provided,
so that existing programs can be easily migrated.

Going forward, we plan to use netkit devices in Cilium as the main device
type for connecting Pods. They will be operated in L3 mode in order to
simplify a Pod's neighbor management and the peer will operate in default
drop mode, so that no traffic is leaving between the time when a Pod is
brought up by the CNI plugin and programs attached by the agent.
Additionally, the programs we attach via tcx on the physical devices are
using bpf_redirect_peer() for inbound traffic into netkit device, hence the
latter is also supporting the ndo_get_peer_dev callback. Similarly, we use
bpf_redirect_neigh() for the way out, pushing from netkit peer to phys device
directly. Also, BIG TCP is supported on netkit device. For the follow-up
work in single device mode, we plan to convert Cilium's cilium_host/_net
devices into a single one.

An extensive test suite for checking device operations and the BPF program
and link management API comes as BPF selftests in this series.

Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://github.com/borkmann/iproute2/tree/pr/netkit
Link: http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf (24ff.)
Link: https://lore.kernel.org/r/20231024214904.29825-2-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-10-24 16:06:03 -07:00
Andrii Nakryiko
42d31dd601 bpf: Improve JEQ/JNE branch taken logic
When determining if an if/else branch will always or never be taken, use
signed range knowledge in addition to currently used unsigned range knowledge.
If either signed or unsigned range suggests that condition is always/never
taken, return corresponding branch_taken verdict.

Current use of unsigned range for this seems arbitrary and unnecessarily
incomplete. It is possible for *signed* operations to be performed on
register, which could "invalidate" unsigned range for that register. In such
case branch_taken will be artificially useless, even if we can still tell
that some constant is outside of register value range based on its signed
bounds.

veristat-based validation shows zero differences across selftests, Cilium,
and Meta-internal BPF object files.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/bpf/20231022205743.72352-2-andrii@kernel.org
2023-10-24 14:45:51 +02:00
Paul E. McKenney
06646da014 bpf: Fold smp_mb__before_atomic() into atomic_set_release()
The bpf_user_ringbuf_drain() BPF_CALL function uses an atomic_set()
immediately preceded by smp_mb__before_atomic() so as to order storing
of ring-buffer consumer and producer positions prior to the atomic_set()
call's clearing of the ->busy flag, as follows:

        smp_mb__before_atomic();
        atomic_set(&rb->busy, 0);

Although this works given current architectures and implementations, and
given that this only needs to order prior writes against a later write.
However, it does so by accident because the smp_mb__before_atomic()
is only guaranteed to work with read-modify-write atomic operations, and
not at all with things like atomic_set() and atomic_read().

Note especially that smp_mb__before_atomic() will not, repeat *not*,
order the prior write to "a" before the subsequent non-read-modify-write
atomic read from "b", even on strongly ordered systems such as x86:

        WRITE_ONCE(a, 1);
        smp_mb__before_atomic();
        r1 = atomic_read(&b);

Therefore, replace the smp_mb__before_atomic() and atomic_set() with
atomic_set_release() as follows:

        atomic_set_release(&rb->busy, 0);

This is no slower (and sometimes is faster) than the original, and also
provides a formal guarantee of ordering that the original lacks.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/bpf/ec86d38e-cfb4-44aa-8fdb-6c925922d93c@paulmck-laptop
2023-10-24 14:26:07 +02:00
Song Liu
d35381aa73 bpf: Fix unnecessary -EBUSY from htab_lock_bucket
htab_lock_bucket uses the following logic to avoid recursion:

1. preempt_disable();
2. check percpu counter htab->map_locked[hash] for recursion;
   2.1. if map_lock[hash] is already taken, return -BUSY;
3. raw_spin_lock_irqsave();

However, if an IRQ hits between 2 and 3, BPF programs attached to the IRQ
logic will not able to access the same hash of the hashtab and get -EBUSY.

This -EBUSY is not really necessary. Fix it by disabling IRQ before
checking map_locked:

1. preempt_disable();
2. local_irq_save();
3. check percpu counter htab->map_locked[hash] for recursion;
   3.1. if map_lock[hash] is already taken, return -BUSY;
4. raw_spin_lock().

Similarly, use raw_spin_unlock() and local_irq_restore() in
htab_unlock_bucket().

Fixes: 20b6cc34ea ("bpf: Avoid hashtab deadlock with map_locked")
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/7a9576222aa40b1c84ad3a9ba3e64011d1a04d41.camel@linux.ibm.com
Link: https://lore.kernel.org/bpf/20231012055741.3375999-1-song@kernel.org
2023-10-24 14:25:55 +02:00
Eduard Zingerman
b4d8239534 bpf: print full verifier states on infinite loop detection
Additional logging in is_state_visited(): if infinite loop is detected
print full verifier state for both current and equivalent states.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231024000917.12153-8-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-23 21:49:32 -07:00
Eduard Zingerman
2a0992829e bpf: correct loop detection for iterators convergence
It turns out that .branches > 0 in is_state_visited() is not a
sufficient condition to identify if two verifier states form a loop
when iterators convergence is computed. This commit adds logic to
distinguish situations like below:

 (I)            initial       (II)            initial
                  |                             |
                  V                             V
     .---------> hdr                           ..
     |            |                             |
     |            V                             V
     |    .------...                    .------..
     |    |       |                     |       |
     |    V       V                     V       V
     |   ...     ...               .-> hdr     ..
     |    |       |                |    |       |
     |    V       V                |    V       V
     |   succ <- cur               |   succ <- cur
     |    |                        |    |
     |    V                        |    V
     |   ...                       |   ...
     |    |                        |    |
     '----'                        '----'

For both (I) and (II) successor 'succ' of the current state 'cur' was
previously explored and has branches count at 0. However, loop entry
'hdr' corresponding to 'succ' might be a part of current DFS path.
If that is the case 'succ' and 'cur' are members of the same loop
and have to be compared exactly.

Co-developed-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Co-developed-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Reviewed-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231024000917.12153-6-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-23 21:49:32 -07:00
Eduard Zingerman
2793a8b015 bpf: exact states comparison for iterator convergence checks
Convergence for open coded iterators is computed in is_state_visited()
by examining states with branches count > 1 and using states_equal().
states_equal() computes sub-state relation using read and precision marks.
Read and precision marks are propagated from children states,
thus are not guaranteed to be complete inside a loop when branches
count > 1. This could be demonstrated using the following unsafe program:

     1. r7 = -16
     2. r6 = bpf_get_prandom_u32()
     3. while (bpf_iter_num_next(&fp[-8])) {
     4.   if (r6 != 42) {
     5.     r7 = -32
     6.     r6 = bpf_get_prandom_u32()
     7.     continue
     8.   }
     9.   r0 = r10
    10.   r0 += r7
    11.   r8 = *(u64 *)(r0 + 0)
    12.   r6 = bpf_get_prandom_u32()
    13. }

Here verifier would first visit path 1-3, create a checkpoint at 3
with r7=-16, continue to 4-7,3 with r7=-32.

Because instructions at 9-12 had not been visitied yet existing
checkpoint at 3 does not have read or precision mark for r7.
Thus states_equal() would return true and verifier would discard
current state, thus unsafe memory access at 11 would not be caught.

This commit fixes this loophole by introducing exact state comparisons
for iterator convergence logic:
- registers are compared using regs_exact() regardless of read or
  precision marks;
- stack slots have to have identical type.

Unfortunately, this is too strict even for simple programs like below:

    i = 0;
    while(iter_next(&it))
      i++;

At each iteration step i++ would produce a new distinct state and
eventually instruction processing limit would be reached.

To avoid such behavior speculatively forget (widen) range for
imprecise scalar registers, if those registers were not precise at the
end of the previous iteration and do not match exactly.

This a conservative heuristic that allows to verify wide range of
programs, however it precludes verification of programs that conjure
an imprecise value on the first loop iteration and use it as precise
on the second.

Test case iter_task_vma_for_each() presents one of such cases:

        unsigned int seen = 0;
        ...
        bpf_for_each(task_vma, vma, task, 0) {
                if (seen >= 1000)
                        break;
                ...
                seen++;
        }

Here clang generates the following code:

<LBB0_4>:
      24:       r8 = r6                          ; stash current value of
                ... body ...                       'seen'
      29:       r1 = r10
      30:       r1 += -0x8
      31:       call bpf_iter_task_vma_next
      32:       r6 += 0x1                        ; seen++;
      33:       if r0 == 0x0 goto +0x2 <LBB0_6>  ; exit on next() == NULL
      34:       r7 += 0x10
      35:       if r8 < 0x3e7 goto -0xc <LBB0_4> ; loop on seen < 1000

<LBB0_6>:
      ... exit ...

Note that counter in r6 is copied to r8 and then incremented,
conditional jump is done using r8. Because of this precision mark for
r6 lags one state behind of precision mark on r8 and widening logic
kicks in.

Adding barrier_var(seen) after conditional is sufficient to force
clang use the same register for both counting and conditional jump.

This issue was discussed in the thread [1] which was started by
Andrew Werner <awerner32@gmail.com> demonstrating a similar bug
in callback functions handling. The callbacks would be addressed
in a followup patch.

[1] https://lore.kernel.org/bpf/97a90da09404c65c8e810cf83c94ac703705dc0e.camel@gmail.com/

Co-developed-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Co-developed-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231024000917.12153-4-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-23 21:49:31 -07:00
Eduard Zingerman
4c97259abc bpf: extract same_callsites() as utility function
Extract same_callsites() from clean_live_states() as a utility function.
This function would be used by the next patch in the set.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231024000917.12153-3-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-23 21:49:31 -07:00
Eduard Zingerman
3c4e420cb6 bpf: move explored_state() closer to the beginning of verifier.c
Subsequent patches would make use of explored_state() function.
Move it up to avoid adding unnecessary prototype.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231024000917.12153-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-23 21:49:31 -07:00
Daniel Borkmann
b63dadd6f9 bpf, tcx: Get rid of tcx_link_const
Small clean up to get rid of the extra tcx_link_const() and only retain
the tcx_link().

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20231023185015.21152-1-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-10-23 15:01:53 -07:00
Hou Tao
e383a45902 bpf: Use bpf_global_percpu_ma for per-cpu kptr in __bpf_obj_drop_impl()
The following warning was reported when running "./test_progs -t
test_bpf_ma/percpu_free_through_map_free":

  ------------[ cut here ]------------
  WARNING: CPU: 1 PID: 68 at kernel/bpf/memalloc.c:342
  CPU: 1 PID: 68 Comm: kworker/u16:2 Not tainted 6.6.0-rc2+ #222
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
  Workqueue: events_unbound bpf_map_free_deferred
  RIP: 0010:bpf_mem_refill+0x21c/0x2a0
  ......
  Call Trace:
   <IRQ>
   ? bpf_mem_refill+0x21c/0x2a0
   irq_work_single+0x27/0x70
   irq_work_run_list+0x2a/0x40
   irq_work_run+0x18/0x40
   __sysvec_irq_work+0x1c/0xc0
   sysvec_irq_work+0x73/0x90
   </IRQ>
   <TASK>
   asm_sysvec_irq_work+0x1b/0x20
  RIP: 0010:unit_free+0x50/0x80
   ......
   bpf_mem_free+0x46/0x60
   __bpf_obj_drop_impl+0x40/0x90
   bpf_obj_free_fields+0x17d/0x1a0
   array_map_free+0x6b/0x170
   bpf_map_free_deferred+0x54/0xa0
   process_scheduled_works+0xba/0x370
   worker_thread+0x16d/0x2e0
   kthread+0x105/0x140
   ret_from_fork+0x39/0x60
   ret_from_fork_asm+0x1b/0x30
   </TASK>
  ---[ end trace 0000000000000000 ]---

The reason is simple: __bpf_obj_drop_impl() does not know the freeing
field is a per-cpu pointer and it uses bpf_global_ma to free the
pointer. Because bpf_global_ma is not a per-cpu allocator, so ksize() is
used to select the corresponding cache. The bpf_mem_cache with 16-bytes
unit_size will always be selected to do the unmatched free and it will
trigger the warning in free_bulk() eventually.

Because per-cpu kptr doesn't support list or rb-tree now, so fix the
problem by only checking whether or not the type of kptr is per-cpu in
bpf_obj_free_fields(), and using bpf_global_percpu_ma to these kptrs.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231020133202.4043247-7-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-20 14:15:13 -07:00
Hou Tao
e581a3461d bpf: Move the declaration of __bpf_obj_drop_impl() to bpf.h
both syscall.c and helpers.c have the declaration of
__bpf_obj_drop_impl(), so just move it to a common header file.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231020133202.4043247-6-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-20 14:15:13 -07:00
Hou Tao
3f2189e4f7 bpf: Use pcpu_alloc_size() in bpf_mem_free{_rcu}()
For bpf_global_percpu_ma, the pointer passed to bpf_mem_free_rcu() is
allocated by kmalloc() and its size is fixed (16-bytes on x86-64). So
no matter which cache allocates the dynamic per-cpu area, on x86-64
cache[2] will always be used to free the per-cpu area.

Fix the unbalance by checking whether the bpf memory allocator is
per-cpu or not and use pcpu_alloc_size() instead of ksize() to
find the correct cache for per-cpu free.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231020133202.4043247-5-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-20 14:15:13 -07:00
Hou Tao
baa8fdecd8 bpf: Re-enable unit_size checking for global per-cpu allocator
With pcpu_alloc_size() in place, check whether or not the size of
the dynamic per-cpu area is matched with unit_size.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231020133202.4043247-4-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-20 14:15:13 -07:00
Chuyi Zhou
cb3ecf7915 bpf: Let bpf_iter_task_new accept null task ptr
When using task_iter to iterate all threads of a specific task, we enforce
that the user must pass a valid task pointer to ensure safety. However,
when iterating all threads/process in the system, BPF verifier still
require a valid ptr instead of "nullable" pointer, even though it's
pointless, which is a kind of surprising from usability standpoint. It
would be nice if we could let that kfunc accept a explicit null pointer
when we are using BPF_TASK_ITER_ALL_{PROCS, THREADS} and a valid pointer
when using BPF_TASK_ITER_THREAD.

Given a trival kfunc:
	__bpf_kfunc void FN(struct TYPE_A *obj);

BPF Prog would reject a nullptr for obj. The error info is:
"arg#x pointer type xx xx must point to scalar, or struct with scalar"
reported by get_kfunc_ptr_arg_type(). The reg->type is SCALAR_VALUE and
the btf type of ref_t is not scalar or scalar_struct which leads to the
rejection of get_kfunc_ptr_arg_type.

This patch add "__nullable" annotation:
	__bpf_kfunc void FN(struct TYPE_A *obj__nullable);
Here __nullable indicates obj can be optional, user can pass a explicit
nullptr or a normal TYPE_A pointer. In get_kfunc_ptr_arg_type(), we will
detect whether the current arg is optional and register is null, If so,
return a new kfunc_ptr_arg_type KF_ARG_PTR_TO_NULL and skip to the next
arg in check_kfunc_args().

Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231018061746.111364-7-zhouchuyi@bytedance.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-19 17:02:46 -07:00
Chuyi Zhou
dfab99df14 bpf: teach the verifier to enforce css_iter and task_iter in RCU CS
css_iter and task_iter should be used in rcu section. Specifically, in
sleepable progs explicit bpf_rcu_read_lock() is needed before use these
iters. In normal bpf progs that have implicit rcu_read_lock(), it's OK to
use them directly.

This patch adds a new a KF flag KF_RCU_PROTECTED for bpf_iter_task_new and
bpf_iter_css_new. It means the kfunc should be used in RCU CS. We check
whether we are in rcu cs before we want to invoke this kfunc. If the rcu
protection is guaranteed, we would let st->type = PTR_TO_STACK | MEM_RCU.
Once user do rcu_unlock during the iteration, state MEM_RCU of regs would
be cleared. is_iter_reg_valid_init() will reject if reg->type is UNTRUSTED.

It is worth noting that currently, bpf_rcu_read_unlock does not
clear the state of the STACK_ITER reg, since bpf_for_each_spilled_reg
only considers STACK_SPILL. This patch also let bpf_for_each_spilled_reg
search STACK_ITER.

Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231018061746.111364-6-zhouchuyi@bytedance.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-19 17:02:46 -07:00
Chuyi Zhou
7251d0905e bpf: Introduce css open-coded iterator kfuncs
This Patch adds kfuncs bpf_iter_css_{new,next,destroy} which allow
creation and manipulation of struct bpf_iter_css in open-coded iterator
style. These kfuncs actually wrapps css_next_descendant_{pre, post}.
css_iter can be used to:

1) iterating a sepcific cgroup tree with pre/post/up order

2) iterating cgroup_subsystem in BPF Prog, like
for_each_mem_cgroup_tree/cpuset_for_each_descendant_pre in kernel.

The API design is consistent with cgroup_iter. bpf_iter_css_new accepts
parameters defining iteration order and starting css. Here we also reuse
BPF_CGROUP_ITER_DESCENDANTS_PRE, BPF_CGROUP_ITER_DESCENDANTS_POST,
BPF_CGROUP_ITER_ANCESTORS_UP enums.

Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20231018061746.111364-5-zhouchuyi@bytedance.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-19 17:02:46 -07:00
Chuyi Zhou
c68a78ffe2 bpf: Introduce task open coded iterator kfuncs
This patch adds kfuncs bpf_iter_task_{new,next,destroy} which allow
creation and manipulation of struct bpf_iter_task in open-coded iterator
style. BPF programs can use these kfuncs or through bpf_for_each macro to
iterate all processes in the system.

The API design keep consistent with SEC("iter/task"). bpf_iter_task_new()
accepts a specific task and iterating type which allows:

1. iterating all process in the system (BPF_TASK_ITER_ALL_PROCS)

2. iterating all threads in the system (BPF_TASK_ITER_ALL_THREADS)

3. iterating all threads of a specific task (BPF_TASK_ITER_PROC_THREADS)

Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Link: https://lore.kernel.org/r/20231018061746.111364-4-zhouchuyi@bytedance.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-19 17:02:46 -07:00
Chuyi Zhou
9c66dc94b6 bpf: Introduce css_task open-coded iterator kfuncs
This patch adds kfuncs bpf_iter_css_task_{new,next,destroy} which allow
creation and manipulation of struct bpf_iter_css_task in open-coded
iterator style. These kfuncs actually wrapps css_task_iter_{start,next,
end}. BPF programs can use these kfuncs through bpf_for_each macro for
iteration of all tasks under a css.

css_task_iter_*() would try to get the global spin-lock *css_set_lock*, so
the bpf side has to be careful in where it allows to use this iter.
Currently we only allow it in bpf_lsm and bpf iter-s.

Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20231018061746.111364-3-zhouchuyi@bytedance.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-19 17:02:46 -07:00
Breno Leitao
3f31e0d14d bpf: Add sockptr support for setsockopt
The whole network stack uses sockptr, and while it doesn't move to
something more modern, let's use sockptr in setsockptr BPF hooks, so, it
could be used by other callers.

The main motivation for this change is to use it in the io_uring
{g,s}etsockopt(), which will use a userspace pointer for *optval, but, a
kernel value for optlen.

Link: https://lore.kernel.org/all/ZSArfLaaGcfd8LH8@gmail.com/

Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20231016134750.1381153-3-leitao@debian.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-19 14:05:37 -06:00
Breno Leitao
a615f67e1a bpf: Add sockptr support for getsockopt
The whole network stack uses sockptr, and while it doesn't move to
something more modern, let's use sockptr in getsockptr BPF hooks, so, it
could be used by other callers.

The main motivation for this change is to use it in the io_uring
{g,s}etsockopt(), which will use a userspace pointer for *optval, but, a
kernel value for optlen.

Link: https://lore.kernel.org/all/ZSArfLaaGcfd8LH8@gmail.com/

Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20231016134750.1381153-2-leitao@debian.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-19 14:05:28 -06:00
Christian Brauner
0ede61d858
file: convert to SLAB_TYPESAFE_BY_RCU
In recent discussions around some performance improvements in the file
handling area we discussed switching the file cache to rely on
SLAB_TYPESAFE_BY_RCU which allows us to get rid of call_rcu() based
freeing for files completely. This is a pretty sensitive change overall
but it might actually be worth doing.

The main downside is the subtlety. The other one is that we should
really wait for Jann's patch to land that enables KASAN to handle
SLAB_TYPESAFE_BY_RCU UAFs. Currently it doesn't but a patch for this
exists.

With SLAB_TYPESAFE_BY_RCU objects may be freed and reused multiple times
which requires a few changes. So it isn't sufficient anymore to just
acquire a reference to the file in question under rcu using
atomic_long_inc_not_zero() since the file might have already been
recycled and someone else might have bumped the reference.

In other words, callers might see reference count bumps from newer
users. For this reason it is necessary to verify that the pointer is the
same before and after the reference count increment. This pattern can be
seen in get_file_rcu() and __files_get_rcu().

In addition, it isn't possible to access or check fields in struct file
without first aqcuiring a reference on it. Not doing that was always
very dodgy and it was only usable for non-pointer data in struct file.
With SLAB_TYPESAFE_BY_RCU it is necessary that callers first acquire a
reference under rcu or they must hold the files_lock of the fdtable.
Failing to do either one of this is a bug.

Thanks to Jann for pointing out that we need to ensure memory ordering
between reallocations and pointer check by ensuring that all subsequent
loads have a dependency on the second load in get_file_rcu() and
providing a fixup that was folded into this patch.

Cc: Jann Horn <jannh@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-19 11:02:48 +02:00
Jeff Layton
1bc628a756
bpf: convert to new timestamp accessors
Convert to using the new inode timestamp accessor functions.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20231004185347.80880-79-jlayton@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-18 14:08:30 +02:00
Yafang Shao
29a7e00ffa bpf: Fix missed rcu read lock in bpf_task_under_cgroup()
When employed within a sleepable program not under RCU protection, the
use of 'bpf_task_under_cgroup()' may trigger a warning in the kernel log,
particularly when CONFIG_PROVE_RCU is enabled:

  [ 1259.662357] WARNING: suspicious RCU usage
  [ 1259.662358] 6.5.0+ #33 Not tainted
  [ 1259.662360] -----------------------------
  [ 1259.662361] include/linux/cgroup.h:423 suspicious rcu_dereference_check() usage!

Other info that might help to debug this:

  [ 1259.662366] rcu_scheduler_active = 2, debug_locks = 1
  [ 1259.662368] 1 lock held by trace/72954:
  [ 1259.662369]  #0: ffffffffb5e3eda0 (rcu_read_lock_trace){....}-{0:0}, at: __bpf_prog_enter_sleepable+0x0/0xb0

Stack backtrace:

  [ 1259.662385] CPU: 50 PID: 72954 Comm: trace Kdump: loaded Not tainted 6.5.0+ #33
  [ 1259.662391] Call Trace:
  [ 1259.662393]  <TASK>
  [ 1259.662395]  dump_stack_lvl+0x6e/0x90
  [ 1259.662401]  dump_stack+0x10/0x20
  [ 1259.662404]  lockdep_rcu_suspicious+0x163/0x1b0
  [ 1259.662412]  task_css_set.part.0+0x23/0x30
  [ 1259.662417]  bpf_task_under_cgroup+0xe7/0xf0
  [ 1259.662422]  bpf_prog_7fffba481a3bcf88_lsm_run+0x5c/0x93
  [ 1259.662431]  bpf_trampoline_6442505574+0x60/0x1000
  [ 1259.662439]  bpf_lsm_bpf+0x5/0x20
  [ 1259.662443]  ? security_bpf+0x32/0x50
  [ 1259.662452]  __sys_bpf+0xe6/0xdd0
  [ 1259.662463]  __x64_sys_bpf+0x1a/0x30
  [ 1259.662467]  do_syscall_64+0x38/0x90
  [ 1259.662472]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
  [ 1259.662479] RIP: 0033:0x7f487baf8e29
  [...]
  [ 1259.662504]  </TASK>

This issue can be reproduced by executing a straightforward program, as
demonstrated below:

SEC("lsm.s/bpf")
int BPF_PROG(lsm_run, int cmd, union bpf_attr *attr, unsigned int size)
{
        struct cgroup *cgrp = NULL;
        struct task_struct *task;
        int ret = 0;

        if (cmd != BPF_LINK_CREATE)
                return 0;

        // The cgroup2 should be mounted first
        cgrp = bpf_cgroup_from_id(1);
        if (!cgrp)
                goto out;
        task = bpf_get_current_task_btf();
        if (bpf_task_under_cgroup(task, cgrp))
                ret = -1;
        bpf_cgroup_release(cgrp);

out:
        return ret;
}

After running the program, if you subsequently execute another BPF program,
you will encounter the warning.

It's worth noting that task_under_cgroup_hierarchy() is also utilized by
bpf_current_task_under_cgroup(). However, bpf_current_task_under_cgroup()
doesn't exhibit this issue because it cannot be used in sleepable BPF
programs.

Fixes: b5ad4cdc46 ("bpf: Add bpf_task_under_cgroup() kfunc")
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Cc: Feng Zhou <zhoufeng.zf@bytedance.com>
Cc: KP Singh <kpsingh@kernel.org>
Link: https://lore.kernel.org/bpf/20231007135945.4306-1-laoar.shao@gmail.com
2023-10-17 18:27:41 +02:00
Sebastian Andrzej Siewior
9a675ba55a net, bpf: Add a warning if NAPI cb missed xdp_do_flush().
A few drivers were missing a xdp_do_flush() invocation after
XDP_REDIRECT.

Add three helper functions each for one of the per-CPU lists. Return
true if the per-CPU list is non-empty and flush the list.

Add xdp_do_check_flushed() which invokes each helper functions and
creates a warning if one of the functions had a non-empty list.

Hide everything behind CONFIG_DEBUG_NET.

Suggested-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20231016125738.Yt79p1uF@linutronix.de
2023-10-17 15:02:03 +02:00
Jakub Kicinski
a3c2dd9648 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZS1d4wAKCRDbK58LschI
 g4DSAP441CdKh8fd+wNKUSKHFbpCQ6EvocR6Nf+Sj2DFUx/w/QEA7mfju7Abqjc3
 xwDEx0BuhrjMrjV5MmEpxc7lYl9XcQU=
 =vuWk
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2023-10-16

We've added 90 non-merge commits during the last 25 day(s) which contain
a total of 120 files changed, 3519 insertions(+), 895 deletions(-).

The main changes are:

1) Add missed stats for kprobes to retrieve the number of missed kprobe
   executions and subsequent executions of BPF programs, from Jiri Olsa.

2) Add cgroup BPF sockaddr hooks for unix sockets. The use case is
   for systemd to reimplement the LogNamespace feature which allows
   running multiple instances of systemd-journald to process the logs
   of different services, from Daan De Meyer.

3) Implement BPF CPUv4 support for s390x BPF JIT, from Ilya Leoshkevich.

4) Improve BPF verifier log output for scalar registers to better
   disambiguate their internal state wrt defaults vs min/max values
   matching, from Andrii Nakryiko.

5) Extend the BPF fib lookup helpers for IPv4/IPv6 to support retrieving
   the source IP address with a new BPF_FIB_LOOKUP_SRC flag,
   from Martynas Pumputis.

6) Add support for open-coded task_vma iterator to help with symbolization
   for BPF-collected user stacks, from Dave Marchevsky.

7) Add libbpf getters for accessing individual BPF ring buffers which
   is useful for polling them individually, for example, from Martin Kelly.

8) Extend AF_XDP selftests to validate the SHARED_UMEM feature,
   from Tushar Vyavahare.

9) Improve BPF selftests cross-building support for riscv arch,
   from Björn Töpel.

10) Add the ability to pin a BPF timer to the same calling CPU,
   from David Vernet.

11) Fix libbpf's bpf_tracing.h macros for riscv to use the generic
   implementation of PT_REGS_SYSCALL_REGS() to access syscall arguments,
   from Alexandre Ghiti.

12) Extend libbpf to support symbol versioning for uprobes, from Hengqi Chen.

13) Fix bpftool's skeleton code generation to guarantee that ELF data
    is 8 byte aligned, from Ian Rogers.

14) Inherit system-wide cpu_mitigations_off() setting for Spectre v1/v4
    security mitigations in BPF verifier, from Yafang Shao.

15) Annotate struct bpf_stack_map with __counted_by attribute to prepare
    BPF side for upcoming __counted_by compiler support, from Kees Cook.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (90 commits)
  bpf: Ensure proper register state printing for cond jumps
  bpf: Disambiguate SCALAR register state output in verifier logs
  selftests/bpf: Make align selftests more robust
  selftests/bpf: Improve missed_kprobe_recursion test robustness
  selftests/bpf: Improve percpu_alloc test robustness
  selftests/bpf: Add tests for open-coded task_vma iter
  bpf: Introduce task_vma open-coded iterator kfuncs
  selftests/bpf: Rename bpf_iter_task_vma.c to bpf_iter_task_vmas.c
  bpf: Don't explicitly emit BTF for struct btf_iter_num
  bpf: Change syscall_nr type to int in struct syscall_tp_t
  net/bpf: Avoid unused "sin_addr_len" warning when CONFIG_CGROUP_BPF is not set
  bpf: Avoid unnecessary audit log for CPU security mitigations
  selftests/bpf: Add tests for cgroup unix socket address hooks
  selftests/bpf: Make sure mount directory exists
  documentation/bpf: Document cgroup unix socket address hooks
  bpftool: Add support for cgroup unix socket address hooks
  libbpf: Add support for cgroup unix socket address hooks
  bpf: Implement cgroup sockaddr hooks for unix sockets
  bpf: Add bpf_sock_addr_set_sun_path() to allow writing unix sockaddr from bpf
  bpf: Propagate modified uaddrlen from cgroup sockaddr programs
  ...
====================

Link: https://lore.kernel.org/r/20231016204803.30153-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-16 21:05:33 -07:00
Andrii Nakryiko
1a8a315f00 bpf: Ensure proper register state printing for cond jumps
Verifier emits relevant register state involved in any given instruction
next to it after `;` to the right, if possible. Or, worst case, on the
separate line repeating instruction index.

E.g., a nice and simple case would be:

  2: (d5) if r0 s<= 0x0 goto pc+1       ; R0_w=0

But if there is some intervening extra output (e.g., precision
backtracking log) involved, we are supposed to see the state after the
precision backtrack log:

  4: (75) if r0 s>= 0x0 goto pc+1
  mark_precise: frame0: last_idx 4 first_idx 0 subseq_idx -1
  mark_precise: frame0: regs=r0 stack= before 2: (d5) if r0 s<= 0x0 goto pc+1
  mark_precise: frame0: regs=r0 stack= before 1: (b7) r0 = 0
  6: R0_w=0

First off, note that in `6: R0_w=0` instruction index corresponds to the
next instruction, not to the conditional jump instruction itself, which
is wrong and we'll get to that.

But besides that, the above is a happy case that does work today. Yet,
if it so happens that precision backtracking had to traverse some of the
parent states, this `6: R0_w=0` state output would be missing.

This is due to a quirk of print_verifier_state() routine, which performs
mark_verifier_state_clean(env) at the end. This marks all registers as
"non-scratched", which means that subsequent logic to print *relevant*
registers (that is, "scratched ones") fails and doesn't see anything
relevant to print and skips the output altogether.

print_verifier_state() is used both to print instruction context, but
also to print an **entire** verifier state indiscriminately, e.g.,
during precision backtracking (and in a few other situations, like
during entering or exiting subprogram).  Which means if we have to print
entire parent state before getting to printing instruction context
state, instruction context is marked as clean and is omitted.

Long story short, this is definitely not intentional. So we fix this
behavior in this patch by teaching print_verifier_state() to clear
scratch state only if it was used to print instruction state, not the
parent/callback state. This is determined by print_all option, so if
it's not set, we don't clear scratch state. This fixes missing
instruction state for these cases.

As for the mismatched instruction index, we fix that by making sure we
call print_insn_state() early inside check_cond_jmp_op() before we
adjusted insn_idx based on jump branch taken logic. And with that we get
desired correct information:

  9: (16) if w4 == 0x1 goto pc+9
  mark_precise: frame0: last_idx 9 first_idx 9 subseq_idx -1
  mark_precise: frame0: parent state regs=r4 stack=: R2_w=1944 R4_rw=P1 R10=fp0
  mark_precise: frame0: last_idx 8 first_idx 0 subseq_idx 9
  mark_precise: frame0: regs=r4 stack= before 8: (66) if w4 s> 0x3 goto pc+5
  mark_precise: frame0: regs=r4 stack= before 7: (b7) r4 = 1
  9: R4=1

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20231011223728.3188086-6-andrii@kernel.org
2023-10-16 13:49:18 +02:00
Andrii Nakryiko
72f8a1de4a bpf: Disambiguate SCALAR register state output in verifier logs
Currently the way that verifier prints SCALAR_VALUE register state (and
PTR_TO_PACKET, which can have var_off and ranges info as well) is very
ambiguous.

In the name of brevity we are trying to eliminate "unnecessary" output
of umin/umax, smin/smax, u32_min/u32_max, and s32_min/s32_max values, if
possible. Current rules are that if any of those have their default
value (which for mins is the minimal value of its respective types: 0,
S32_MIN, or S64_MIN, while for maxs it's U32_MAX, S32_MAX, S64_MAX, or
U64_MAX) *OR* if there is another min/max value that as matching value.
E.g., if smin=100 and umin=100, we'll emit only umin=10, omitting smin
altogether. This approach has a few problems, being both ambiguous and
sort-of incorrect in some cases.

Ambiguity is due to missing value could be either default value or value
of umin/umax or smin/smax. This is especially confusing when we mix
signed and unsigned ranges. Quite often, umin=0 and smin=0, and so we'll
have only `umin=0` leaving anyone reading verifier log to guess whether
smin is actually 0 or it's actually -9223372036854775808 (S64_MIN). And
often times it's important to know, especially when debugging tricky
issues.

"Sort-of incorrectness" comes from mixing negative and positive values.
E.g., if umin is some large positive number, it can be equal to smin
which is, interpreted as signed value, is actually some negative value.
Currently, that smin will be omitted and only umin will be emitted with
a large positive value, giving an impression that smin is also positive.

Anyway, ambiguity is the biggest issue making it impossible to have an
exact understanding of register state, preventing any sort of automated
testing of verifier state based on verifier log. This patch is
attempting to rectify the situation by removing ambiguity, while
minimizing the verboseness of register state output.

The rules are straightforward:
  - if some of the values are missing, then it definitely has a default
  value. I.e., `umin=0` means that umin is zero, but smin is actually
  S64_MIN;
  - all the various boundaries that happen to have the same value are
  emitted in one equality separated sequence. E.g., if umin and smin are
  both 100, we'll emit `smin=umin=100`, making this explicit;
  - we do not mix negative and positive values together, and even if
  they happen to have the same bit-level value, they will be emitted
  separately with proper sign. I.e., if both umax and smax happen to be
  0xffffffffffffffff, we'll emit them both separately as
  `smax=-1,umax=18446744073709551615`;
  - in the name of a bit more uniformity and consistency,
  {u32,s32}_{min,max} are renamed to {s,u}{min,max}32, which seems to
  improve readability.

The above means that in case of all 4 ranges being, say, [50, 100] range,
we'd previously see hugely ambiguous:

    R1=scalar(umin=50,umax=100)

Now, we'll be more explicit:

    R1=scalar(smin=umin=smin32=umin32=50,smax=umax=smax32=umax32=100)

This is slightly more verbose, but distinct from the case when we don't
know anything about signed boundaries and 32-bit boundaries, which under
new rules will match the old case:

    R1=scalar(umin=50,umax=100)

Also, in the name of simplicity of implementation and consistency, order
for {s,u}32_{min,max} are emitted *before* var_off. Previously they were
emitted afterwards, for unclear reasons.

This patch also includes a few fixes to selftests that expect exact
register state to accommodate slight changes to verifier format. You can
see that the changes are pretty minimal in common cases.

Note, the special case when SCALAR_VALUE register is a known constant
isn't changed, we'll emit constant value once, interpreted as signed
value.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20231011223728.3188086-5-andrii@kernel.org
2023-10-16 13:49:18 +02:00
Dave Marchevsky
4ac4546821 bpf: Introduce task_vma open-coded iterator kfuncs
This patch adds kfuncs bpf_iter_task_vma_{new,next,destroy} which allow
creation and manipulation of struct bpf_iter_task_vma in open-coded
iterator style. BPF programs can use these kfuncs directly or through
bpf_for_each macro for natural-looking iteration of all task vmas.

The implementation borrows heavily from bpf_find_vma helper's locking -
differing only in that it holds the mmap_read lock for all iterations
while the helper only executes its provided callback on a maximum of 1
vma. Aside from locking, struct vma_iterator and vma_next do all the
heavy lifting.

A pointer to an inner data struct, struct bpf_iter_task_vma_data, is the
only field in struct bpf_iter_task_vma. This is because the inner data
struct contains a struct vma_iterator (not ptr), whose size is likely to
change under us. If bpf_iter_task_vma_kern contained vma_iterator directly
such a change would require change in opaque bpf_iter_task_vma struct's
size. So better to allocate vma_iterator using BPF allocator, and since
that alloc must already succeed, might as well allocate all iter fields,
thereby freezing struct bpf_iter_task_vma size.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231013204426.1074286-4-davemarchevsky@fb.com
2023-10-13 15:48:58 -07:00
Dave Marchevsky
f10ca5da5b bpf: Don't explicitly emit BTF for struct btf_iter_num
Commit 6018e1f407 ("bpf: implement numbers iterator") added the
BTF_TYPE_EMIT line that this patch is modifying. The struct btf_iter_num
doesn't exist, so only a forward declaration is emitted in BTF:

  FWD 'btf_iter_num' fwd_kind=struct

That commit was probably hoping to ensure that struct bpf_iter_num is
emitted in vmlinux BTF. A previous version of this patch changed the
line to emit the correct type, but Yonghong confirmed that it would
definitely be emitted regardless in [0], so this patch simply removes
the line.

This isn't marked "Fixes" because the extraneous btf_iter_num FWD wasn't
causing any issues that I noticed, aside from mild confusion when I
looked through the code.

  [0]: https://lore.kernel.org/bpf/25d08207-43e6-36a8-5e0f-47a913d4cda5@linux.dev/

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231013204426.1074286-2-davemarchevsky@fb.com
2023-10-13 15:48:58 -07:00
Jakub Kicinski
0e6bb5b7f4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts.

Adjacent changes:

kernel/bpf/verifier.c
  829955981c ("bpf: Fix verifier log for async callback return values")
  a923819fb2 ("bpf: Treat first argument as return value for bpf_throw")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-12 17:07:34 -07:00
Daan De Meyer
859051dd16 bpf: Implement cgroup sockaddr hooks for unix sockets
These hooks allows intercepting connect(), getsockname(),
getpeername(), sendmsg() and recvmsg() for unix sockets. The unix
socket hooks get write access to the address length because the
address length is not fixed when dealing with unix sockets and
needs to be modified when a unix socket address is modified by
the hook. Because abstract socket unix addresses start with a
NUL byte, we cannot recalculate the socket address in kernelspace
after running the hook by calculating the length of the unix socket
path using strlen().

These hooks can be used when users want to multiplex syscall to a
single unix socket to multiple different processes behind the scenes
by redirecting the connect() and other syscalls to process specific
sockets.

We do not implement support for intercepting bind() because when
using bind() with unix sockets with a pathname address, this creates
an inode in the filesystem which must be cleaned up. If we rewrite
the address, the user might try to clean up the wrong file, leaking
the socket in the filesystem where it is never cleaned up. Until we
figure out a solution for this (and a use case for intercepting bind()),
we opt to not allow rewriting the sockaddr in bind() calls.

We also implement recvmsg() support for connected streams so that
after a connect() that is modified by a sockaddr hook, any corresponding
recmvsg() on the connected socket can also be modified to make the
connected program think it is connected to the "intended" remote.

Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Link: https://lore.kernel.org/r/20231011185113.140426-5-daan.j.demeyer@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-10-11 17:27:47 -07:00
Daan De Meyer
53e380d214 bpf: Add bpf_sock_addr_set_sun_path() to allow writing unix sockaddr from bpf
As prep for adding unix socket support to the cgroup sockaddr hooks,
let's add a kfunc bpf_sock_addr_set_sun_path() that allows modifying a unix
sockaddr from bpf. While this is already possible for AF_INET and AF_INET6,
we'll need this kfunc when we add unix socket support since modifying the
address for those requires modifying both the address and the sockaddr
length.

Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Link: https://lore.kernel.org/r/20231011185113.140426-4-daan.j.demeyer@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-10-11 16:29:25 -07:00
Daan De Meyer
fefba7d1ae bpf: Propagate modified uaddrlen from cgroup sockaddr programs
As prep for adding unix socket support to the cgroup sockaddr hooks,
let's propagate the sockaddr length back to the caller after running
a bpf cgroup sockaddr hook program. While not important for AF_INET or
AF_INET6, the sockaddr length is important when working with AF_UNIX
sockaddrs as the size of the sockaddr cannot be determined just from the
address family or the sockaddr's contents.

__cgroup_bpf_run_filter_sock_addr() is modified to take the uaddrlen as
an input/output argument. After running the program, the modified sockaddr
length is stored in the uaddrlen pointer.

Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Link: https://lore.kernel.org/r/20231011185113.140426-3-daan.j.demeyer@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-10-11 15:03:40 -07:00
David Vernet
829955981c bpf: Fix verifier log for async callback return values
The verifier, as part of check_return_code(), verifies that async
callbacks such as from e.g. timers, will return 0. It does this by
correctly checking that R0->var_off is in tnum_const(0), which
effectively checks that it's in a range of 0. If this condition fails,
however, it prints an error message which says that the value should
have been in (0x0; 0x1). This results in possibly confusing output such
as the following in which an async callback returns 1:

  At async callback the register R0 has value (0x1; 0x0) should have been in (0x0; 0x1)

The fix is easy -- we should just pass the tnum_const(0) as the correct
range to verbose_invalid_scalar(), which will then print the following:

  At async callback the register R0 has value (0x1; 0x0) should have been in (0x0; 0x0)

Fixes: bfc6bb74e4 ("bpf: Implement verifier support for validation of async callbacks.")
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20231009161414.235829-1-void@manifault.com
2023-10-09 23:10:58 +02:00
David Vernet
d6247ecb6c bpf: Add ability to pin bpf timer to calling CPU
BPF supports creating high resolution timers using bpf_timer_* helper
functions. Currently, only the BPF_F_TIMER_ABS flag is supported, which
specifies that the timeout should be interpreted as absolute time. It
would also be useful to be able to pin that timer to a core. For
example, if you wanted to make a subset of cores run without timer
interrupts, and only have the timer be invoked on a single core.

This patch adds support for this with a new BPF_F_TIMER_CPU_PIN flag.
When specified, the HRTIMER_MODE_PINNED flag is passed to
hrtimer_start(). A subsequent patch will update selftests to validate.

Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20231004162339.200702-2-void@manifault.com
2023-10-09 16:28:49 +02:00
Lorenz Bauer
ba62d61128 bpf: Refuse unused attributes in bpf_prog_{attach,detach}
The recently added tcx attachment extended the BPF UAPI for attaching and
detaching by a couple of fields. Those fields are currently only supported
for tcx, other types like cgroups and flow dissector silently ignore the
new fields except for the new flags.

This is problematic once we extend bpf_mprog to older attachment types, since
it's hard to figure out whether the syscall really was successful if the
kernel silently ignores non-zero values.

Explicitly reject non-zero fields relevant to bpf_mprog for attachment types
which don't use the latter yet.

Fixes: e420bed025 ("bpf: Add fd-based tcx multi-prog infra with link support")
Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20231006220655.1653-3-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-10-06 17:11:21 -07:00
Daniel Borkmann
edfa9af0a7 bpf: Handle bpf_mprog_query with NULL entry
Improve consistency for bpf_mprog_query() API and let the latter also handle
a NULL entry as can be the case for tcx. Instead of returning -ENOENT, we
copy a count of 0 and revision of 1 to user space, so that this can be fed
into a subsequent bpf_mprog_attach() call as expected_revision. A BPF self-
test as part of this series has been added to assert this case.

Suggested-by: Lorenz Bauer <lmb@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20231006220655.1653-2-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-10-06 17:11:20 -07:00
Daniel Borkmann
a4fe78386a bpf: Fix BPF_PROG_QUERY last field check
While working on the ebpf-go [0] library integration for bpf_mprog and tcx,
Lorenz noticed that two subsequent BPF_PROG_QUERY requests currently fail. A
typical workflow is to first gather the bpf_mprog count without passing program/
link arrays, followed by the second request which contains the actual array
pointers.

The initial call populates count and revision fields. The second call gets
rejected due to a BPF_PROG_QUERY_LAST_FIELD bug which should point to
query.revision instead of query.link_attach_flags since the former is really
the last member.

It was not noticed in libbpf as bpf_prog_query_opts() always calls bpf(2) with
an on-stack bpf_attr that is memset() each time (and therefore query.revision
was reset to zero).

  [0] https://ebpf-go.dev

Fixes: e420bed025 ("bpf: Add fd-based tcx multi-prog infra with link support")
Reported-by: Lorenz Bauer <lmb@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20231006220655.1653-1-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-10-06 17:11:20 -07:00
Kees Cook
84cb9cbd91 bpf: Annotate struct bpf_stack_map with __counted_by
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS (for
array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).

As found with Coccinelle [1], add __counted_by for struct bpf_stack_map.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci [1]
Link: https://lore.kernel.org/bpf/20231006201657.work.531-kees@kernel.org
2023-10-06 23:44:35 +02:00
Jakub Kicinski
2606cf059c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts (or adjacent changes of note).

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-05 13:16:47 -07:00
Hou Tao
9077fc228f bpf: Use kmalloc_size_roundup() to adjust size_index
Commit d52b59315b ("bpf: Adjust size_index according to the value of
KMALLOC_MIN_SIZE") uses KMALLOC_MIN_SIZE to adjust size_index, but as
reported by Nathan, the adjustment is not enough, because
__kmalloc_minalign() also decides the minimal alignment of slab object
as shown in new_kmalloc_cache() and its value may be greater than
KMALLOC_MIN_SIZE (e.g., 64 bytes vs 8 bytes under a riscv QEMU VM).

Instead of invoking __kmalloc_minalign() in bpf subsystem to find the
maximal alignment, just using kmalloc_size_roundup() directly to get the
corresponding slab object size for each allocation size. If these two
sizes are unmatched, adjust size_index to select a bpf_mem_cache with
unit_size equal to the object_size of the underlying slab cache for the
allocation size.

Fixes: 822fb26bdb ("bpf: Add a hint to allocated objects.")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Closes: https://lore.kernel.org/bpf/20230914181407.GA1000274@dev-arch.thelio-3990X/
Signed-off-by: Hou Tao <houtao1@huawei.com>
Tested-by: Emil Renner Berthing <emil.renner.berthing@canonical.com>
Link: https://lore.kernel.org/r/20230928101558.2594068-1-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-30 09:39:28 -07:00
Daniel Borkmann
f9b0e1088b bpf, mprog: Fix maximum program check on mprog attachment
After Paul's recent improvement to syzkaller to improve coverage for
bpf_mprog and tcx, it hit a splat that the program limit was surpassed.
What happened is that the maximum number of progs got added, followed
by another prog add request which adds with BPF_F_BEFORE flag relative
to the last program in the array. The idx >= bpf_mprog_max() check in
bpf_mprog_attach() still passes because the index is below the maximum
but the maximum will be surpassed. We need to add a check upfront for
insertions to catch this situation.

Fixes: 053c8e1f23 ("bpf: Add generic attach/detach/query API for multi-progs")
Reported-by: syzbot+baa44e3dbbe48e05c1ad@syzkaller.appspotmail.com
Reported-by: syzbot+b97d20ed568ce0951a06@syzkaller.appspotmail.com
Reported-by: syzbot+2558ca3567a77b7af4e3@syzkaller.appspotmail.com
Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: syzbot+baa44e3dbbe48e05c1ad@syzkaller.appspotmail.com
Tested-by: syzbot+b97d20ed568ce0951a06@syzkaller.appspotmail.com
Link: https://github.com/google/syzkaller/pull/4207
Link: https://lore.kernel.org/bpf/20230929204121.20305-1-daniel@iogearbox.net
2023-09-29 15:49:57 -07:00
Jiri Olsa
3acf8ace68 bpf: Add missed value to kprobe perf link info
Add missed value to kprobe attached through perf link info to
hold the stats of missed kprobe handler execution.

The kprobe's missed counter gets incremented when kprobe handler
is not executed due to another kprobe running on the same cpu.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230920213145.1941596-4-jolsa@kernel.org
2023-09-25 16:37:44 -07:00
Ilya Leoshkevich
577c06af81 bpf: Disable zero-extension for BPF_MEMSX
On the architectures that use bpf_jit_needs_zext(), e.g., s390x, the
verifier incorrectly inserts a zero-extension after BPF_MEMSX, leading
to miscompilations like the one below:

      24:       89 1a ff fe 00 00 00 00 "r1 = *(s16 *)(r10 - 2);"       # zext_dst set
   0x3ff7fdb910e:       lgh     %r2,-2(%r13,%r0)                        # load halfword
   0x3ff7fdb9114:       llgfr   %r2,%r2                                 # wrong!
      25:       65 10 00 03 00 00 7f ff if r1 s> 32767 goto +3 <l0_1>   # check_cond_jmp_op()

Disable such zero-extensions. The JITs need to insert sign-extension
themselves, if necessary.

Suggested-by: Puranjay Mohan <puranjay12@gmail.com>
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Puranjay Mohan <puranjay12@gmail.com>
Link: https://lore.kernel.org/r/20230919101336.2223655-2-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-21 14:21:59 -07:00
Paolo Abeni
e9cbc89067 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-09-21 21:49:45 +02:00
Andrii Nakryiko
81335f90e8 bpf: unconditionally reset backtrack_state masks on global func exit
In mark_chain_precision() logic, when we reach the entry to a global
func, it is expected that R1-R5 might be still requested to be marked
precise. This would correspond to some integer input arguments being
tracked as precise. This is all expected and handled as a special case.

What's not expected is that we'll leave backtrack_state structure with
some register bits set. This is because for subsequent precision
propagations backtrack_state is reused without clearing masks, as all
code paths are carefully written in a way to leave empty backtrack_state
with zeroed out masks, for speed.

The fix is trivial, we always clear register bit in the register mask, and
then, optionally, set reg->precise if register is SCALAR_VALUE type.

Reported-by: Chris Mason <clm@meta.com>
Fixes: be2ef81615 ("bpf: allow precision tracking for programs with subprogs")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230918210110.2241458-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-20 03:26:25 -07:00
Alexei Starovoitov
aec42f3623 bpf: Remove unused variables.
Remove unused prev_offset, min_size, krec_size variables.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202309190634.fL17FWoT-lkp@intel.com/
Fixes: aaa619ebcc ("bpf: Refactor check_btf_func and split into two phases")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-19 02:26:47 -07:00
Kumar Kartikeya Dwivedi
7d3460632d bpf: Fix bpf_throw warning on 32-bit arch
On 32-bit architectures, the pointer width is 32-bit, while we try to
cast from a u64 down to it, the compiler complains on mismatch in
integer size. Fix this by first casting to long which should match
the pointer width on targets supported by Linux.

Fixes: ec5290a178 ("bpf: Prevent KASAN false positive with bpf_throw")
Reported-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Link: https://lore.kernel.org/r/20230918155233.297024-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-19 02:07:36 -07:00
David S. Miller
685c6d5b2c Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
The following pull-request contains BPF updates for your *net-next* tree.

We've added 73 non-merge commits during the last 9 day(s) which contain
a total of 79 files changed, 5275 insertions(+), 600 deletions(-).

The main changes are:

1) Basic BTF validation in libbpf, from Andrii Nakryiko.

2) bpf_assert(), bpf_throw(), exceptions in bpf progs, from Kumar Kartikeya Dwivedi.

3) next_thread cleanups, from Oleg Nesterov.

4) Add mcpu=v4 support to arm32, from Puranjay Mohan.

5) Add support for __percpu pointers in bpf progs, from Yonghong Song.

6) Fix bpf tailcall interaction with bpf trampoline, from Leon Hwang.

7) Raise irq_work in bpf_mem_alloc while irqs are disabled to improve refill probabablity, from Hou Tao.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git

Thanks a lot!

Also thanks to reporters, reviewers and testers of commits in this pull-request:

Alan Maguire, Andrey Konovalov, Dave Marchevsky, "Eric W. Biederman",
Jiri Olsa, Maciej Fijalkowski, Quentin Monnet, Russell King (Oracle),
Song Liu, Stanislav Fomichev, Yonghong Song
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2023-09-17 15:12:06 +01:00
Kumar Kartikeya Dwivedi
06d686f771 bpf: Fix kfunc callback register type handling
The kfunc code to handle KF_ARG_PTR_TO_CALLBACK does not check the reg
type before using reg->subprogno. This can accidently permit invalid
pointers from being passed into callback helpers (e.g. silently from
different paths). Likewise, reg->subprogno from the per-register type
union may not be meaningful either. We need to reject any other type
except PTR_TO_FUNC.

Acked-by: Dave Marchevsky <davemarchevsky@fb.com>
Fixes: 5d92ddc3de ("bpf: Add callback validation to kfunc verifier logic")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230912233214.1518551-14-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-16 09:36:43 -07:00
Kumar Kartikeya Dwivedi
fd548e1a46 bpf: Disallow fentry/fexit/freplace for exception callbacks
During testing, it was discovered that extensions to exception callbacks
had no checks, upon running a testcase, the kernel ended up running off
the end of a program having final call as bpf_throw, and hitting int3
instructions.

The reason is that while the default exception callback would have reset
the stack frame to return back to the main program's caller, the
replacing extension program will simply return back to bpf_throw, which
will instead return back to the program and the program will continue
execution, now in an undefined state where anything could happen.

The way to support extensions to an exception callback would be to mark
the BPF_PROG_TYPE_EXT main subprog as an exception_cb, and prevent it
from calling bpf_throw. This would make the JIT produce a prologue that
restores saved registers and reset the stack frame. But let's not do
that until there is a concrete use case for this, and simply disallow
this for now.

Similar issues will exist for fentry and fexit cases, where trampoline
saves data on the stack when invoking exception callback, which however
will then end up resetting the stack frame, and on return, the fexit
program will never will invoked as the return address points to the main
program's caller in the kernel. Instead of additional complexity and
back and forth between the two stacks to enable such a use case, simply
forbid it.

One key point here to note is that currently X86_TAIL_CALL_OFFSET didn't
require any modifications, even though we emit instructions before the
corresponding endbr64 instruction. This is because we ensure that a main
subprog never serves as an exception callback, and therefore the
exception callback (which will be a global subprog) can never serve as
the tail call target, eliminating any discrepancies. However, once we
support a BPF_PROG_TYPE_EXT to also act as an exception callback, it
will end up requiring change to the tail call offset to account for the
extra instructions. For simplicitly, tail calls could be disabled for
such targets.

Noting the above, it appears better to wait for a concrete use case
before choosing to permit extension programs to replace exception
callbacks.

As a precaution, we disable fentry and fexit for exception callbacks as
well.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230912233214.1518551-13-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-16 09:36:32 -07:00
Kumar Kartikeya Dwivedi
66d9111f35 bpf: Detect IP == ksym.end as part of BPF program
Now that bpf_throw kfunc is the first such call instruction that has
noreturn semantics within the verifier, this also kicks in dead code
elimination in unprecedented ways. For one, any instruction following
a bpf_throw call will never be marked as seen. Moreover, if a callchain
ends up throwing, any instructions after the call instruction to the
eventually throwing subprog in callers will also never be marked as
seen.

The tempting way to fix this would be to emit extra 'int3' instructions
which bump the jited_len of a program, and ensure that during runtime
when a program throws, we can discover its boundaries even if the call
instruction to bpf_throw (or to subprogs that always throw) is emitted
as the final instruction in the program.

An example of such a program would be this:

do_something():
	...
	r0 = 0
	exit

foo():
	r1 = 0
	call bpf_throw
	r0 = 0
	exit

bar(cond):
	if r1 != 0 goto pc+2
	call do_something
	exit
	call foo
	r0 = 0  // Never seen by verifier
	exit	//

main(ctx):
	r1 = ...
	call bar
	r0 = 0
	exit

Here, if we do end up throwing, the stacktrace would be the following:

bpf_throw
foo
bar
main

In bar, the final instruction emitted will be the call to foo, as such,
the return address will be the subsequent instruction (which the JIT
emits as int3 on x86). This will end up lying outside the jited_len of
the program, thus, when unwinding, we will fail to discover the return
address as belonging to any program and end up in a panic due to the
unreliable stack unwinding of BPF programs that we never expect.

To remedy this case, make bpf_prog_ksym_find treat IP == ksym.end as
part of the BPF program, so that is_bpf_text_address returns true when
such a case occurs, and we are able to unwind reliably when the final
instruction ends up being a call instruction.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230912233214.1518551-12-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-16 09:34:22 -07:00
Kumar Kartikeya Dwivedi
ec5290a178 bpf: Prevent KASAN false positive with bpf_throw
The KASAN stack instrumentation when CONFIG_KASAN_STACK is true poisons
the stack of a function when it is entered and unpoisons it when
leaving. However, in the case of bpf_throw, we will never return as we
switch our stack frame to the BPF exception callback. Later, this
discrepancy will lead to confusing KASAN splats when kernel resumes
execution on return from the BPF program.

Fix this by unpoisoning everything below the stack pointer of the BPF
program, which should cover the range that would not be unpoisoned. An
example splat is below:

BUG: KASAN: stack-out-of-bounds in stack_trace_consume_entry+0x14e/0x170
Write of size 8 at addr ffffc900013af958 by task test_progs/227

CPU: 0 PID: 227 Comm: test_progs Not tainted 6.5.0-rc2-g43f1c6c9052a-dirty #26
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-2.fc39 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x4a/0x80
 print_report+0xcf/0x670
 ? arch_stack_walk+0x79/0x100
 kasan_report+0xda/0x110
 ? stack_trace_consume_entry+0x14e/0x170
 ? stack_trace_consume_entry+0x14e/0x170
 ? __pfx_stack_trace_consume_entry+0x10/0x10
 stack_trace_consume_entry+0x14e/0x170
 ? __sys_bpf+0xf2e/0x41b0
 arch_stack_walk+0x8b/0x100
 ? __sys_bpf+0xf2e/0x41b0
 ? bpf_prog_test_run_skb+0x341/0x1c70
 ? bpf_prog_test_run_skb+0x341/0x1c70
 stack_trace_save+0x9b/0xd0
 ? __pfx_stack_trace_save+0x10/0x10
 ? __kasan_slab_free+0x109/0x180
 ? bpf_prog_test_run_skb+0x341/0x1c70
 ? __sys_bpf+0xf2e/0x41b0
 ? __x64_sys_bpf+0x78/0xc0
 ? do_syscall_64+0x3c/0x90
 ? entry_SYSCALL_64_after_hwframe+0x6e/0xd8
 kasan_save_stack+0x33/0x60
 ? kasan_save_stack+0x33/0x60
 ? kasan_set_track+0x25/0x30
 ? kasan_save_free_info+0x2b/0x50
 ? __kasan_slab_free+0x109/0x180
 ? kmem_cache_free+0x191/0x460
 ? bpf_prog_test_run_skb+0x341/0x1c70
 kasan_set_track+0x25/0x30
 kasan_save_free_info+0x2b/0x50
 __kasan_slab_free+0x109/0x180
 kmem_cache_free+0x191/0x460
 bpf_prog_test_run_skb+0x341/0x1c70
 ? __pfx_bpf_prog_test_run_skb+0x10/0x10
 ? __fget_light+0x51/0x220
 __sys_bpf+0xf2e/0x41b0
 ? __might_fault+0xa2/0x170
 ? __pfx___sys_bpf+0x10/0x10
 ? lock_release+0x1de/0x620
 ? __might_fault+0xcd/0x170
 ? __pfx_lock_release+0x10/0x10
 ? __pfx_blkcg_maybe_throttle_current+0x10/0x10
 __x64_sys_bpf+0x78/0xc0
 ? syscall_enter_from_user_mode+0x20/0x50
 do_syscall_64+0x3c/0x90
 entry_SYSCALL_64_after_hwframe+0x6e/0xd8
RIP: 0033:0x7f0fbb38880d
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d
89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f3 45 12 00 f7 d8 64
89 01 48
RSP: 002b:00007ffe13907de8 EFLAGS: 00000206 ORIG_RAX: 0000000000000141
RAX: ffffffffffffffda RBX: 00007ffe13908708 RCX: 00007f0fbb38880d
RDX: 0000000000000050 RSI: 00007ffe13907e20 RDI: 000000000000000a
RBP: 00007ffe13907e00 R08: 0000000000000000 R09: 00007ffe13907e20
R10: 0000000000000064 R11: 0000000000000206 R12: 0000000000000003
R13: 0000000000000000 R14: 00007f0fbb532000 R15: 0000000000cfbd90
 </TASK>

The buggy address belongs to stack of task test_progs/227
KASAN internal error: frame info validation failed; invalid marker: 0

The buggy address belongs to the virtual mapping at
 [ffffc900013a8000, ffffc900013b1000) created by:
 kernel_clone+0xcd/0x600

The buggy address belongs to the physical page:
page:00000000b70f4332 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x11418f
flags: 0x2fffe0000000000(node=0|zone=2|lastcpupid=0x7fff)
page_type: 0xffffffff()
raw: 02fffe0000000000 0000000000000000 dead000000000122 0000000000000000
raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffffc900013af800: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ffffc900013af880: 00 00 00 f1 f1 f1 f1 00 00 00 f3 f3 f3 f3 f3 00
>ffffc900013af900: 00 00 00 00 00 00 00 00 00 00 00 f1 00 00 00 00
                                                    ^
 ffffc900013af980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ffffc900013afa00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
==================================================================
Disabling lock debugging due to kernel taint

Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Andrey Konovalov <andreyknvl@gmail.com>
Link: https://lore.kernel.org/r/20230912233214.1518551-11-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-16 09:34:22 -07:00
Kumar Kartikeya Dwivedi
a923819fb2 bpf: Treat first argument as return value for bpf_throw
In case of the default exception callback, change the behavior of
bpf_throw, where the passed cookie value is no longer ignored, but
is instead the return value of the default exception callback. As
such, we need to place restrictions on the value being passed into
bpf_throw in such a case, only allowing those permitted by the
check_return_code function.

Thus, bpf_throw can now control the return value of the program from
each call site without having the user install a custom exception
callback just to override the return value when an exception is thrown.

We also modify the hidden subprog instructions to now move BPF_REG_1 to
BPF_REG_0, so as to set the return value before exit in the default
callback.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230912233214.1518551-9-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-16 09:34:21 -07:00
Kumar Kartikeya Dwivedi
b62bf8a5e9 bpf: Perform CFG walk for exception callback
Since exception callbacks are not referenced using bpf_pseudo_func and
bpf_pseudo_call instructions, check_cfg traversal will never explore
instructions of the exception callback. Even after adding the subprog,
the program will then fail with a 'unreachable insn' error.

We thus need to begin walking from the start of the exception callback
again in check_cfg after a complete CFG traversal finishes, so as to
explore the CFG rooted at the exception callback.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230912233214.1518551-8-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-16 09:34:21 -07:00
Kumar Kartikeya Dwivedi
b9ae0c9dd0 bpf: Add support for custom exception callbacks
By default, the subprog generated by the verifier to handle a thrown
exception hardcodes a return value of 0. To allow user-defined logic
and modification of the return value when an exception is thrown,
introduce the 'exception_callback:' declaration tag, which marks a
callback as the default exception handler for the program.

The format of the declaration tag is 'exception_callback:<value>', where
<value> is the name of the exception callback. Each main program can be
tagged using this BTF declaratiion tag to associate it with an exception
callback. In case the tag is absent, the default callback is used.

As such, the exception callback cannot be modified at runtime, only set
during verification.

Allowing modification of the callback for the current program execution
at runtime leads to issues when the programs begin to nest, as any
per-CPU state maintaing this information will have to be saved and
restored. We don't want it to stay in bpf_prog_aux as this takes a
global effect for all programs. An alternative solution is spilling
the callback pointer at a known location on the program stack on entry,
and then passing this location to bpf_throw as a parameter.

However, since exceptions are geared more towards a use case where they
are ideally never invoked, optimizing for this use case and adding to
the complexity has diminishing returns.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230912233214.1518551-7-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-16 09:34:21 -07:00
Kumar Kartikeya Dwivedi
aaa619ebcc bpf: Refactor check_btf_func and split into two phases
This patch splits the check_btf_info's check_btf_func check into two
separate phases.  The first phase sets up the BTF and prepares
func_info, but does not perform any validation of required invariants
for subprogs just yet. This is left to the second phase, which happens
where check_btf_info executes currently, and performs the line_info and
CO-RE relocation.

The reason to perform this split is to obtain the userspace supplied
func_info information before we perform the add_subprog call, where we
would now require finding and adding subprogs that may not have a
bpf_pseudo_call or bpf_pseudo_func instruction in the program.

We require this as we want to enable userspace to supply exception
callbacks that can override the default hidden subprogram generated by
the verifier (which performs a hardcoded action). In such a case, the
exception callback may never be referenced in an instruction, but will
still be suitably annotated (by way of BTF declaration tags). For
finding this exception callback, we would require the program's BTF
information, and the supplied func_info information which maps BTF type
IDs to subprograms.

Since the exception callback won't actually be referenced through
instructions, later checks in check_cfg and do_check_subprogs will not
verify the subprog. This means that add_subprog needs to add them in the
add_subprog_and_kfunc phase before we move forward, which is why the BTF
and func_info are required at that point.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230912233214.1518551-6-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-16 09:34:21 -07:00
Kumar Kartikeya Dwivedi
f18b03faba bpf: Implement BPF exceptions
This patch implements BPF exceptions, and introduces a bpf_throw kfunc
to allow programs to throw exceptions during their execution at runtime.
A bpf_throw invocation is treated as an immediate termination of the
program, returning back to its caller within the kernel, unwinding all
stack frames.

This allows the program to simplify its implementation, by testing for
runtime conditions which the verifier has no visibility into, and assert
that they are true. In case they are not, the program can simply throw
an exception from the other branch.

BPF exceptions are explicitly *NOT* an unlikely slowpath error handling
primitive, and this objective has guided design choices of the
implementation of the them within the kernel (with the bulk of the cost
for unwinding the stack offloaded to the bpf_throw kfunc).

The implementation of this mechanism requires use of add_hidden_subprog
mechanism introduced in the previous patch, which generates a couple of
instructions to move R1 to R0 and exit. The JIT then rewrites the
prologue of this subprog to take the stack pointer and frame pointer as
inputs and reset the stack frame, popping all callee-saved registers
saved by the main subprog. The bpf_throw function then walks the stack
at runtime, and invokes this exception subprog with the stack and frame
pointers as parameters.

Reviewers must take note that currently the main program is made to save
all callee-saved registers on x86_64 during entry into the program. This
is because we must do an equivalent of a lightweight context switch when
unwinding the stack, therefore we need the callee-saved registers of the
caller of the BPF program to be able to return with a sane state.

Note that we have to additionally handle r12, even though it is not used
by the program, because when throwing the exception the program makes an
entry into the kernel which could clobber r12 after saving it on the
stack. To be able to preserve the value we received on program entry, we
push r12 and restore it from the generated subprogram when unwinding the
stack.

For now, bpf_throw invocation fails when lingering resources or locks
exist in that path of the program. In a future followup, bpf_throw will
be extended to perform frame-by-frame unwinding to release lingering
resources for each stack frame, removing this limitation.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230912233214.1518551-5-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-16 09:34:21 -07:00
Kumar Kartikeya Dwivedi
335d1c5b54 bpf: Implement support for adding hidden subprogs
Introduce support in the verifier for generating a subprogram and
include it as part of a BPF program dynamically after the do_check phase
is complete. The first user will be the next patch which generates
default exception callbacks if none are set for the program. The phase
of invocation will be do_misc_fixups. Note that this is an internal
verifier function, and should be used with instruction blocks which
uphold the invariants stated in check_subprogs.

Since these subprogs are always appended to the end of the instruction
sequence of the program, it becomes relatively inexpensive to do the
related adjustments to the subprog_info of the program. Only the fake
exit subprogram is shifted forward, making room for our new subprog.

This is useful to insert a new subprogram, get it JITed, and obtain its
function pointer. The next patch will use this functionality to insert a
default exception callback which will be invoked after unwinding the
stack.

Note that these added subprograms are invisible to userspace, and never
reported in BPF_OBJ_GET_INFO_BY_ID etc. For now, only a single
subprogram is supported, but more can be easily supported in the future.

To this end, two function counts are introduced now, the existing
func_cnt, and real_func_cnt, the latter including hidden programs. This
allows us to conver the JIT code to use the real_func_cnt for management
of resources while syscall path continues working with existing
func_cnt.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230912233214.1518551-4-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-16 09:34:21 -07:00
Kumar Kartikeya Dwivedi
fd5d27b701 arch/x86: Implement arch_bpf_stack_walk
The plumbing for offline unwinding when we throw an exception in
programs would require walking the stack, hence introduce a new
arch_bpf_stack_walk function. This is provided when the JIT supports
exceptions, i.e. bpf_jit_supports_exceptions is true. The arch-specific
code is really minimal, hence it should be straightforward to extend
this support to other architectures as well, as it reuses the logic of
arch_stack_walk, but allowing access to unwind_state data.

Once the stack pointer and frame pointer are known for the main subprog
during the unwinding, we know the stack layout and location of any
callee-saved registers which must be restored before we return back to
the kernel. This handling will be added in the subsequent patches.

Note that while we primarily unwind through BPF frames, which are
effectively CONFIG_UNWINDER_FRAME_POINTER, we still need one of this or
CONFIG_UNWINDER_ORC to be able to unwind through the bpf_throw frame
from which we begin walking the stack. We also require both sp and bp
(stack and frame pointers) from the unwind_state structure, which are
only available when one of these two options are enabled.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230912233214.1518551-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-16 09:34:21 -07:00
David S. Miller
1612cc4b14 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:

====================

The following pull-request contains BPF updates for your *net* tree.

We've added 21 non-merge commits during the last 8 day(s) which contain
a total of 21 files changed, 450 insertions(+), 36 deletions(-).

The main changes are:

1) Adjust bpf_mem_alloc buckets to match ksize(), from Hou Tao.

2) Check whether override is allowed in kprobe mult, from Jiri Olsa.

3) Fix btf_id symbol generation with ld.lld, from Jiri and Nick.

4) Fix potential deadlock when using queue and stack maps from NMI, from Toke Høiland-Jørgensen.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Thanks a lot!

Also thanks to reporters, reviewers and testers of commits in this pull-request:

Alan Maguire, Biju Das, Björn Töpel, Dan Carpenter, Daniel Borkmann,
Eduard Zingerman, Hsin-Wei Hung, Marcus Seyfarth, Nathan Chancellor,
Satya Durga Srinivasu Prabhala, Song Liu, Stephen Rothwell
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2023-09-16 11:16:00 +01:00
Larysa Zaremba
9b2b86332a bpf: Allow to use kfunc XDP hints and frags together
There is no fundamental reason, why multi-buffer XDP and XDP kfunc RX hints
cannot coexist in a single program.

Allow those features to be used together by modifying the flags condition
for dev-bound-only programs, segments are still prohibited for fully
offloaded programs, hence additional check.

Suggested-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/CAKH8qBuzgtJj=OKMdsxEkyML36VsAuZpcrsXcyqjdKXSJCBq=Q@mail.gmail.com/
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230915083914.65538-1-larysa.zaremba@intel.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-09-15 11:39:46 -07:00
Stanislav Fomichev
a9c2a60854 bpf: expose information about supported xdp metadata kfunc
Add new xdp-rx-metadata-features member to netdev netlink
which exports a bitmask of supported kfuncs. Most of the patch
is autogenerated (headers), the only relevant part is netdev.yaml
and the changes in netdev-genl.c to marshal into netlink.

Example output on veth:

$ ip link add veth0 type veth peer name veth1 # ifndex == 12
$ ./tools/net/ynl/samples/netdev 12

Select ifc ($ifindex; or 0 = dump; or -2 ntf check): 12
   veth1[12]    xdp-features (23): basic redirect rx-sg xdp-rx-metadata-features (3): timestamp hash xdp-zc-max-segs=0

Cc: netdev@vger.kernel.org
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230913171350.369987-3-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-09-15 11:26:58 -07:00
Stanislav Fomichev
fc45c5b642 bpf: make it easier to add new metadata kfunc
No functional changes.

Instead of having hand-crafted code in bpf_dev_bound_resolve_kfunc,
move kfunc <> xmo handler relationship into XDP_METADATA_KFUNC_xxx.
This way, any time new kfunc is added, we don't have to touch
bpf_dev_bound_resolve_kfunc.

Also document XDP_METADATA_KFUNC_xxx arguments since we now have
more than two and it might be confusing what is what.

Cc: netdev@vger.kernel.org
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230913171350.369987-2-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-09-15 11:26:58 -07:00
Hou Tao
dca7acd84e bpf: Skip unit_size checking for global per-cpu allocator
For global per-cpu allocator, the size of free object in free list
doesn't match with unit_size and now there is no way to get the size of
per-cpu pointer saved in free object, so just skip the checking.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Closes: https://lore.kernel.org/bpf/20230913133436.0eeec4cb@canb.auug.org.au/
Signed-off-by: Hou Tao <houtao1@huawei.com>
Tested-by: Biju Das <biju.das.jz@bp.renesas.com>
Link: https://lore.kernel.org/r/20230913135943.3137292-1-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-15 10:22:24 -07:00
Song Liu
5c04433daf bpf: Charge modmem for struct_ops trampoline
Current code charges modmem for regular trampoline, but not for struct_ops
trampoline. Add bpf_jit_[charge|uncharge]_modmem() to struct_ops so the
trampoline is charged in both cases.

Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230914222542.2986059-1-song@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-09-14 15:30:45 -07:00
Randy Dunlap
214bfd267f bpf, cgroup: fix multiple kernel-doc warnings
Fix missing or extra function parameter kernel-doc warnings
in cgroup.c:

kernel/bpf/cgroup.c:1359: warning: Excess function parameter 'type' description in '__cgroup_bpf_run_filter_skb'
kernel/bpf/cgroup.c:1359: warning: Function parameter or member 'atype' not described in '__cgroup_bpf_run_filter_skb'
kernel/bpf/cgroup.c:1439: warning: Excess function parameter 'type' description in '__cgroup_bpf_run_filter_sk'
kernel/bpf/cgroup.c:1439: warning: Function parameter or member 'atype' not described in '__cgroup_bpf_run_filter_sk'
kernel/bpf/cgroup.c:1467: warning: Excess function parameter 'type' description in '__cgroup_bpf_run_filter_sock_addr'
kernel/bpf/cgroup.c:1467: warning: Function parameter or member 'atype' not described in '__cgroup_bpf_run_filter_sock_addr'
kernel/bpf/cgroup.c:1512: warning: Excess function parameter 'type' description in '__cgroup_bpf_run_filter_sock_ops'
kernel/bpf/cgroup.c:1512: warning: Function parameter or member 'atype' not described in '__cgroup_bpf_run_filter_sock_ops'
kernel/bpf/cgroup.c:1685: warning: Excess function parameter 'type' description in '__cgroup_bpf_run_filter_sysctl'
kernel/bpf/cgroup.c:1685: warning: Function parameter or member 'atype' not described in '__cgroup_bpf_run_filter_sysctl'
kernel/bpf/cgroup.c:795: warning: Excess function parameter 'type' description in '__cgroup_bpf_replace'
kernel/bpf/cgroup.c:795: warning: Function parameter or member 'new_prog' not described in '__cgroup_bpf_replace'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: bpf@vger.kernel.org
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230912060812.1715-1-rdunlap@infradead.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-12 13:19:07 -07:00
Christophe JAILLET
a8f1257286 bpf: Fix a erroneous check after snprintf()
snprintf() does not return negative error code on error, it returns the
number of characters which *would* be generated for the given input.

Fix the error handling check.

Fixes: 57539b1c0a ("bpf: Enable annotating trusted nested pointers")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/393bdebc87b22563c08ace094defa7160eb7a6c0.1694190795.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-12 13:15:46 -07:00
Leon Hwang
2b5dcb31a1 bpf, x64: Fix tailcall infinite loop
From commit ebf7d1f508 ("bpf, x64: rework pro/epilogue and tailcall
handling in JIT"), the tailcall on x64 works better than before.

From commit e411901c0b ("bpf: allow for tailcalls in BPF subprograms
for x64 JIT"), tailcall is able to run in BPF subprograms on x64.

From commit 5b92a28aae ("bpf: Support attaching tracing BPF program
to other BPF programs"), BPF program is able to trace other BPF programs.

How about combining them all together?

1. FENTRY/FEXIT on a BPF subprogram.
2. A tailcall runs in the BPF subprogram.
3. The tailcall calls the subprogram's caller.

As a result, a tailcall infinite loop comes up. And the loop would halt
the machine.

As we know, in tail call context, the tail_call_cnt propagates by stack
and rax register between BPF subprograms. So do in trampolines.

Fixes: ebf7d1f508 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT")
Fixes: e411901c0b ("bpf: allow for tailcalls in BPF subprograms for x64 JIT")
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
Link: https://lore.kernel.org/r/20230912150442.2009-3-hffilwlqm@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-12 13:06:12 -07:00
Eduard Zingerman
1a49f4195d bpf: Avoid dummy bpf_offload_netdev in __bpf_prog_dev_bound_init
Fix for a bug observable under the following sequence of events:
1. Create a network device that does not support XDP offload.
2. Load a device bound XDP program with BPF_F_XDP_DEV_BOUND_ONLY flag
   (such programs are not offloaded).
3. Load a device bound XDP program with zero flags
   (such programs are offloaded).

At step (2) __bpf_prog_dev_bound_init() associates with device (1)
a dummy bpf_offload_netdev struct with .offdev field set to NULL.
At step (3) __bpf_prog_dev_bound_init() would reuse dummy struct
allocated at step (2).
However, downstream usage of the bpf_offload_netdev assumes that
.offdev field can't be NULL, e.g. in bpf_prog_offload_verifier_prep().

Adjust __bpf_prog_dev_bound_init() to require bpf_offload_netdev
with non-NULL .offdev for offloaded BPF programs.

Fixes: 2b3486bc2d ("bpf: Introduce device-bound XDP programs")
Reported-by: syzbot+291100dcb32190ec02a8@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/000000000000d97f3c060479c4f8@google.com/
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20230912005539.2248244-2-eddyz87@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-09-11 22:06:06 -07:00
Toke Høiland-Jørgensen
a34a9f1a19 bpf: Avoid deadlock when using queue and stack maps from NMI
Sysbot discovered that the queue and stack maps can deadlock if they are
being used from a BPF program that can be called from NMI context (such as
one that is attached to a perf HW counter event). To fix this, add an
in_nmi() check and use raw_spin_trylock() in NMI context, erroring out if
grabbing the lock fails.

Fixes: f1a2e44a3a ("bpf: add queue and stack maps")
Reported-by: Hsin-Wei Hung <hsinweih@uci.edu>
Tested-by: Hsin-Wei Hung <hsinweih@uci.edu>
Co-developed-by: Hsin-Wei Hung <hsinweih@uci.edu>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20230911132815.717240-1-toke@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-11 19:04:49 -07:00
Hou Tao
c930472552 bpf: Ensure unit_size is matched with slab cache object size
Add extra check in bpf_mem_alloc_init() to ensure the unit_size of
bpf_mem_cache is matched with the object_size of underlying slab cache.
If these two sizes are unmatched, print a warning once and return
-EINVAL in bpf_mem_alloc_init(), so the mismatch can be found early and
the potential issue can be prevented.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230908133923.2675053-4-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-11 12:41:37 -07:00
Hou Tao
b1d53958b6 bpf: Don't prefill for unused bpf_mem_cache
When the unit_size of a bpf_mem_cache is unmatched with the object_size
of the underlying slab cache, the bpf_mem_cache will not be used, and
the allocation will be redirected to a bpf_mem_cache with a bigger
unit_size instead, so there is no need to prefill for these
unused bpf_mem_caches.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230908133923.2675053-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-11 12:41:37 -07:00
Hou Tao
d52b59315b bpf: Adjust size_index according to the value of KMALLOC_MIN_SIZE
The following warning was reported when running "./test_progs -a
link_api -a linked_list" on a RISC-V QEMU VM:

  ------------[ cut here ]------------
  WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342 bpf_mem_refill
  Modules linked in: bpf_testmod(OE)
  CPU: 3 PID: 261 Comm: test_progs- ... 6.5.0-rc5-01743-gdcb152bb8328 #2
  Hardware name: riscv-virtio,qemu (DT)
  epc : bpf_mem_refill+0x1fc/0x206
   ra : irq_work_single+0x68/0x70
  epc : ffffffff801b1bc4 ra : ffffffff8015fe84 sp : ff2000000001be20
   gp : ffffffff82d26138 tp : ff6000008477a800 t0 : 0000000000046600
   t1 : ffffffff812b6ddc t2 : 0000000000000000 s0 : ff2000000001be70
   s1 : ff5ffffffffe8998 a0 : ff5ffffffffe8998 a1 : ff600003fef4b000
   a2 : 000000000000003f a3 : ffffffff80008250 a4 : 0000000000000060
   a5 : 0000000000000080 a6 : 0000000000000000 a7 : 0000000000735049
   s2 : ff5ffffffffe8998 s3 : 0000000000000022 s4 : 0000000000001000
   s5 : 0000000000000007 s6 : ff5ffffffffe8570 s7 : ffffffff82d6bd30
   s8 : 000000000000003f s9 : ffffffff82d2c5e8 s10: 000000000000ffff
   s11: ffffffff82d2c5d8 t3 : ffffffff81ea8f28 t4 : 0000000000000000
   t5 : ff6000008fd28278 t6 : 0000000000040000
  [<ffffffff801b1bc4>] bpf_mem_refill+0x1fc/0x206
  [<ffffffff8015fe84>] irq_work_single+0x68/0x70
  [<ffffffff8015feb4>] irq_work_run_list+0x28/0x36
  [<ffffffff8015fefa>] irq_work_run+0x38/0x66
  [<ffffffff8000828a>] handle_IPI+0x3a/0xb4
  [<ffffffff800a5c3a>] handle_percpu_devid_irq+0xa4/0x1f8
  [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
  [<ffffffff800ae570>] ipi_mux_process+0xac/0xfa
  [<ffffffff8000a8ea>] sbi_ipi_handle+0x2e/0x88
  [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
  [<ffffffff807ee70e>] riscv_intc_irq+0x36/0x4e
  [<ffffffff812b5d3a>] handle_riscv_irq+0x54/0x86
  [<ffffffff812b6904>] do_irq+0x66/0x98
  ---[ end trace 0000000000000000 ]---

The warning is due to WARN_ON_ONCE(tgt->unit_size != c->unit_size) in
free_bulk(). The direct reason is that a object is allocated and
freed by bpf_mem_caches with different unit_size.

The root cause is that KMALLOC_MIN_SIZE is 64 and there is no 96-bytes
slab cache in the specific VM. When linked_list test allocates a
72-bytes object through bpf_obj_new(), bpf_global_ma will allocate it
from a bpf_mem_cache with 96-bytes unit_size, but this bpf_mem_cache is
backed by 128-bytes slab cache. When the object is freed, bpf_mem_free()
uses ksize() to choose the corresponding bpf_mem_cache. Because the
object is allocated from 128-bytes slab cache, ksize() returns 128,
bpf_mem_free() chooses a 128-bytes bpf_mem_cache to free the object and
triggers the warning.

A similar warning will also be reported when using CONFIG_SLAB instead
of CONFIG_SLUB in a x86-64 kernel. Because CONFIG_SLUB defines
KMALLOC_MIN_SIZE as 8 but CONFIG_SLAB defines KMALLOC_MIN_SIZE as 32.

An alternative fix is to use kmalloc_size_round() in bpf_mem_alloc() to
choose a bpf_mem_cache which has the same unit_size with the backing
slab cache, but it may introduce performance degradation, so fix the
warning by adjusting the indexes in size_index according to the value of
KMALLOC_MIN_SIZE just like setup_kmalloc_cache_index_table() does.

Fixes: 822fb26bdb ("bpf: Add a hint to allocated objects.")
Reported-by: Björn Töpel <bjorn@kernel.org>
Closes: https://lore.kernel.org/bpf/87jztjmmy4.fsf@all.your.base.are.belong.to.us
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230908133923.2675053-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-11 12:41:36 -07:00
Linus Torvalds
1b37a0a2d4 RISC-V Patches for the 6.6 Merge Window, Part 2 (try 2)
* The kernel now dynamically probes for misaligned access speed, as
   opposed to relying on a table of known implementations.
 * Support for non-coherent devices on systems using the Andes AX45MP
   core, including the RZ/Five SoCs.
 * Support for the V extension in ptrace(), again.
 * Support for KASLR.
 * Support for the BPF prog pack allocator in RISC-V.
 * A handful of bug fixes and cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEEKzw3R0RoQ7JKlDp6LhMZ81+7GIkFAmT8eV0THHBhbG1lckBk
 YWJiZWx0LmNvbQAKCRAuExnzX7sYiQYTD/9V6asKMDdWUV+gti/gRvJsiYUjIrrK
 h4MB8hL3fHfCLBpTD4rU6K1Gx6hzPjGsxIuQyAq/hf752KB/9XUiIVziRBv2ZEBb
 GuTFCXfg0QXBUlxBZzFw5SKUuKXgRaMAQ14qjy3tfLk31YMQmBtAlEPdDM8mZOCQ
 zNI3bbdn6zASeaSMh7hwBoOJWP2ACoOEW7RcD44EDT8jb3YW5rEF86x0XtYLgJb6
 xhaR4ieIdaOLxz2RbjXj0GcPIBfhTxZbwN3fLlD8PxuGqCKn5kN03bPPwP9tMTAc
 z02EgVcSDvJWpYikuuTkPMxpSi18OZPJ6eriwOv5ccP5NXQScO09iGo7IZEM7OzO
 j1IrIXyncU4BhxlpWombU454Va+ezUlfh9uh+MrJ+Bnve3T3S9ax7AV4S8vkJZlT
 bnmJVS/g7L/7nxTQdJ3zoAo2WzFQXL0C8SR5tGo/3aRk0uYoliHy/W419f55F9GZ
 rFcc+LMqai8N4bLN3whaK0NnuodNWHoNlpcd/5ncJwecswuDkah3LWcd4rwBrWhu
 8RIkIfpdr/vTQjUVXVLeMHdKB+lST3iF1feMqJj0PfTyvTZi5yfSppjAfkAdVq+9
 lHqAjsaGdiCrOtLxb0oBR2PTDQPAm2gN2meuSMommDQR6Vul8K5WcQml9Zx9QEWA
 eDXWYDZISKYJbA==
 =s89m
 -----END PGP SIGNATURE-----

Merge tag 'riscv-for-linus-6.6-mw2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux

Pull more RISC-V updates from Palmer Dabbelt:

 - The kernel now dynamically probes for misaligned access speed, as
   opposed to relying on a table of known implementations.

 - Support for non-coherent devices on systems using the Andes AX45MP
   core, including the RZ/Five SoCs.

 - Support for the V extension in ptrace(), again.

 - Support for KASLR.

 - Support for the BPF prog pack allocator in RISC-V.

 - A handful of bug fixes and cleanups.

* tag 'riscv-for-linus-6.6-mw2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (25 commits)
  soc: renesas: Kconfig: For ARCH_R9A07G043 select the required configs if dependencies are met
  riscv: Kconfig.errata: Add dependency for RISCV_SBI in ERRATA_ANDES config
  riscv: Kconfig.errata: Drop dependency for MMU in ERRATA_ANDES_CMO config
  riscv: Kconfig: Select DMA_DIRECT_REMAP only if MMU is enabled
  bpf, riscv: use prog pack allocator in the BPF JIT
  riscv: implement a memset like function for text
  riscv: extend patch_text_nosync() for multiple pages
  bpf: make bpf_prog_pack allocator portable
  riscv: libstub: Implement KASLR by using generic functions
  libstub: Fix compilation warning for rv32
  arm64: libstub: Move KASLR handling functions to kaslr.c
  riscv: Dump out kernel offset information on panic
  riscv: Introduce virtual kernel mapping KASLR
  RISC-V: Add ptrace support for vectors
  soc: renesas: Kconfig: Select the required configs for RZ/Five SoC
  cache: Add L2 cache management for Andes AX45MP RISC-V core
  dt-bindings: cache: andestech,ax45mp-cache: Add DT binding documentation for L2 cache controller
  riscv: mm: dma-noncoherent: nonstandard cache operations support
  riscv: errata: Add Andes alternative ports
  riscv: asm: vendorid_list: Add Andes Technology to the vendors list
  ...
2023-09-09 14:25:11 -07:00
Palmer Dabbelt
77eea559ba
Merge patch series "bpf, riscv: use BPF prog pack allocator in BPF JIT"
Puranjay Mohan <puranjay12@gmail.com> says:

Here is some data to prove the V2 fixes the problem:

Without this series:
root@rv-selftester:~/src/kselftest/bpf# time ./test_tag
test_tag: OK (40945 tests)

real    7m47.562s
user    0m24.145s
sys     6m37.064s

With this series applied:
root@rv-selftester:~/src/selftest/bpf# time ./test_tag
test_tag: OK (40945 tests)

real    7m29.472s
user    0m25.865s
sys     6m18.401s

BPF programs currently consume a page each on RISCV. For systems with many BPF
programs, this adds significant pressure to instruction TLB. High iTLB pressure
usually causes slow down for the whole system.

Song Liu introduced the BPF prog pack allocator[1] to mitigate the above issue.
It packs multiple BPF programs into a single huge page. It is currently only
enabled for the x86_64 BPF JIT.

I enabled this allocator on the ARM64 BPF JIT[2]. It is being reviewed now.

This patch series enables the BPF prog pack allocator for the RISCV BPF JIT.

======================================================
Performance Analysis of prog pack allocator on RISCV64
======================================================

Test setup:
===========

Host machine: Debian GNU/Linux 11 (bullseye)
Qemu Version: QEMU emulator version 8.0.3 (Debian 1:8.0.3+dfsg-1)
u-boot-qemu Version: 2023.07+dfsg-1
opensbi Version: 1.3-1

To test the performance of the BPF prog pack allocator on RV, a stresser
tool[4] linked below was built. This tool loads 8 BPF programs on the system and
triggers 5 of them in an infinite loop by doing system calls.

The runner script starts 20 instances of the above which loads 8*20=160 BPF
programs on the system, 5*20=100 of which are being constantly triggered.
The script is passed a command which would be run in the above environment.

The script was run with following perf command:
./run.sh "perf stat -a \
        -e iTLB-load-misses \
        -e dTLB-load-misses  \
        -e dTLB-store-misses \
        -e instructions \
        --timeout 60000"

The output of the above command is discussed below before and after enabling the
BPF prog pack allocator.

The tests were run on qemu-system-riscv64 with 8 cpus, 16G memory. The rootfs
was created using Bjorn's riscv-cross-builder[5] docker container linked below.

Results
=======

Before enabling prog pack allocator:
------------------------------------

Performance counter stats for 'system wide':

           4939048      iTLB-load-misses
           5468689      dTLB-load-misses
            465234      dTLB-store-misses
     1441082097998      instructions

      60.045791200 seconds time elapsed

After enabling prog pack allocator:
-----------------------------------

Performance counter stats for 'system wide':

           3430035      iTLB-load-misses
           5008745      dTLB-load-misses
            409944      dTLB-store-misses
     1441535637988      instructions

      60.046296600 seconds time elapsed

Improvements in metrics
=======================

It was expected that the iTLB-load-misses would decrease as now a single huge
page is used to keep all the BPF programs compared to a single page for each
program earlier.

--------------------------------------------
The improvement in iTLB-load-misses: -30.5 %
--------------------------------------------

I repeated this expriment more than 100 times in different setups and the
improvement was always greater than 30%.

This patch series is boot tested on the Starfive VisionFive 2 board[6].
The performance analysis was not done on the board because it doesn't
expose iTLB-load-misses, etc. The stresser program was run on the board to test
the loading and unloading of BPF programs

[1] https://lore.kernel.org/bpf/20220204185742.271030-1-song@kernel.org/
[2] https://lore.kernel.org/all/20230626085811.3192402-1-puranjay12@gmail.com/
[3] https://lore.kernel.org/all/20230626085811.3192402-2-puranjay12@gmail.com/
[4] https://github.com/puranjaymohan/BPF-Allocator-Bench
[5] https://github.com/bjoto/riscv-cross-builder
[6] https://www.starfivetech.com/en/site/boards

* b4-shazam-merge:
  bpf, riscv: use prog pack allocator in the BPF JIT
  riscv: implement a memset like function for text
  riscv: extend patch_text_nosync() for multiple pages
  bpf: make bpf_prog_pack allocator portable

Link: https://lore.kernel.org/r/20230831131229.497941-1-puranjay12@gmail.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-09-08 11:25:25 -07:00
Oleg Nesterov
780aa8dfcb bpf: task_group_seq_get_next: simplify the "next tid" logic
Kill saved_tid. It looks ugly to update *tid and then restore the
previous value if __task_pid_nr_ns() returns 0. Change this code
to update *tid and common->pid_visiting once before return.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230905154656.GA24950@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08 08:42:19 -07:00
Oleg Nesterov
0ee9808b0a bpf: task_group_seq_get_next: kill next_task
It only adds the unnecessary confusion and compicates the "retry" code.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230905154654.GA24945@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08 08:42:19 -07:00
Oleg Nesterov
87abbf7a54 bpf: task_group_seq_get_next: fix the skip_if_dup_files check
Unless I am notally confused it is wrong. We are going to return or
skip next_task so we need to check next_task-files, not task->files.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230905154651.GA24940@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08 08:42:19 -07:00
Oleg Nesterov
4981921350 bpf: task_group_seq_get_next: cleanup the usage of get/put_task_struct
get_pid_task() makes no sense, the code does put_task_struct() soon after.
Use find_task_by_pid_ns() instead of find_pid_ns + get_pid_task and kill
put_task_struct(), this allows to do get_task_struct() only once before
return.

While at it, kill the unnecessary "if (!pid)" check in the "if (!*tid)"
block, this matches the next usage of find_pid_ns() + get_pid_task() in
this function.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230905154649.GA24935@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08 08:42:19 -07:00
Oleg Nesterov
1a00ef57d9 bpf: task_group_seq_get_next: cleanup the usage of next_thread()
1. find_pid_ns() + get_pid_task() under rcu_read_lock() guarantees that we
   can safely iterate the task->thread_group list. Even if this task exits
   right after get_pid_task() (or goto retry) and pid_alive() returns 0.

   Kill the unnecessary pid_alive() check.

2. next_thread() simply can't return NULL, kill the bogus "if (!next_task)"
   check.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230905154646.GA24928@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08 08:42:19 -07:00
Hou Tao
62cf51cb0e bpf: Enable IRQ after irq_work_raise() completes in unit_free{_rcu}()
Both unit_free() and unit_free_rcu() invoke irq_work_raise() to free
freed objects back to slab and the invocation may also be preempted by
unit_alloc() and unit_alloc() may return NULL unexpectedly as shown in
the following case:

task A         task B

unit_free()
  // high_watermark = 48
  // free_cnt = 49 after free
  irq_work_raise()
    // mark irq work as IRQ_WORK_PENDING
    irq_work_claim()

               // task B preempts task A
               unit_alloc()
                 // free_cnt = 48 after alloc

               // does unit_alloc() 32-times
	       ......
	       // free_cnt = 16

	       unit_alloc()
	         // free_cnt = 15 after alloc
                 // irq work is already PENDING,
                 // so just return
                 irq_work_raise()

	       // does unit_alloc() 15-times
               ......
	       // free_cnt = 0

               unit_alloc()
                 // free_cnt = 0 before alloc
                 return NULL

Fix it by enabling IRQ after irq_work_raise() completes.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230901111954.1804721-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08 08:42:19 -07:00
Hou Tao
566f6de3ce bpf: Enable IRQ after irq_work_raise() completes in unit_alloc()
When doing stress test for qp-trie, bpf_mem_alloc() returned NULL
unexpectedly because all qp-trie operations were initiated from
bpf syscalls and there was still available free memory. bpf_obj_new()
has the same problem as shown by the following selftest.

The failure is due to the preemption. irq_work_raise() will invoke
irq_work_claim() first to mark the irq work as pending and then inovke
__irq_work_queue_local() to raise an IPI. So when the current task
which is invoking irq_work_raise() is preempted by other task,
unit_alloc() may return NULL for preemption task as shown below:

task A         task B

unit_alloc()
  // low_watermark = 32
  // free_cnt = 31 after alloc
  irq_work_raise()
    // mark irq work as IRQ_WORK_PENDING
    irq_work_claim()

	       // task B preempts task A
	       unit_alloc()
	         // free_cnt = 30 after alloc
	         // irq work is already PENDING,
	         // so just return
	         irq_work_raise()
	       // does unit_alloc() 30-times
	       ......
	       unit_alloc()
	         // free_cnt = 0 before alloc
	         return NULL

Fix it by enabling IRQ after irq_work_raise() completes. An alternative
fix is using preempt_{disable|enable}_notrace() pair, but it may have
extra overhead. Another feasible fix is to only disable preemption or
IRQ before invoking irq_work_queue() and enable preemption or IRQ after
the invocation completes, but it can't handle the case when
c->low_watermark is 1.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230901111954.1804721-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08 08:42:18 -07:00
Yonghong Song
5b221ecb3a bpf: Mark OBJ_RELEASE argument as MEM_RCU when possible
In previous selftests/bpf patch, we have
  p = bpf_percpu_obj_new(struct val_t);
  if (!p)
          goto out;

  p1 = bpf_kptr_xchg(&e->pc, p);
  if (p1) {
          /* race condition */
          bpf_percpu_obj_drop(p1);
  }

  p = e->pc;
  if (!p)
          goto out;

After bpf_kptr_xchg(), we need to re-read e->pc into 'p'.
This is due to that the second argument of bpf_kptr_xchg() is marked
OBJ_RELEASE and it will be marked as invalid after the call.
So after bpf_kptr_xchg(), 'p' is an unknown scalar,
and the bpf program needs to reread from the map value.

This patch checks if the 'p' has type MEM_ALLOC and MEM_PERCPU,
and if 'p' is RCU protected. If this is the case, 'p' can be marked
as MEM_RCU. MEM_ALLOC needs to be removed since 'p' is not
an owning reference any more. Such a change makes re-read
from the map value unnecessary.

Note that re-reading 'e->pc' after bpf_kptr_xchg() might get
a different value from 'p' if immediately before 'p = e->pc',
another cpu may do another bpf_kptr_xchg() and swap in another value
into 'e->pc'. If this is the case, then 'p = e->pc' may
get either 'p' or another value, and race condition already exists.
So removing direct re-reading seems fine too.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230827152816.2000760-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08 08:42:18 -07:00
Yonghong Song
01cc55af93 bpf: Add bpf_this_cpu_ptr/bpf_per_cpu_ptr support for allocated percpu obj
The bpf helpers bpf_this_cpu_ptr() and bpf_per_cpu_ptr() are re-purposed
for allocated percpu objects. For an allocated percpu obj,
the reg type is 'PTR_TO_BTF_ID | MEM_PERCPU | MEM_RCU'.

The return type for these two re-purposed helpera is
'PTR_TO_MEM | MEM_RCU | MEM_ALLOC'.
The MEM_ALLOC allows that the per-cpu data can be read and written.

Since the memory allocator bpf_mem_alloc() returns
a ptr to a percpu ptr for percpu data, the first argument
of bpf_this_cpu_ptr() and bpf_per_cpu_ptr() is patched
with a dereference before passing to the helper func.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230827152749.1997202-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08 08:42:17 -07:00
Yonghong Song
36d8bdf75a bpf: Add alloc/xchg/direct_access support for local percpu kptr
Add two new kfunc's, bpf_percpu_obj_new_impl() and
bpf_percpu_obj_drop_impl(), to allocate a percpu obj.
Two functions are very similar to bpf_obj_new_impl()
and bpf_obj_drop_impl(). The major difference is related
to percpu handling.

    bpf_rcu_read_lock()
    struct val_t __percpu_kptr *v = map_val->percpu_data;
    ...
    bpf_rcu_read_unlock()

For a percpu data map_val like above 'v', the reg->type
is set as
	PTR_TO_BTF_ID | MEM_PERCPU | MEM_RCU
if inside rcu critical section.

MEM_RCU marking here is similar to NON_OWN_REF as 'v'
is not a owning reference. But NON_OWN_REF is
trusted and typically inside the spinlock while
MEM_RCU is under rcu read lock. RCU is preferred here
since percpu data structures mean potential concurrent
access into its contents.

Also, bpf_percpu_obj_new_impl() is restricted such that
no pointers or special fields are allowed. Therefore,
the bpf_list_head and bpf_rb_root will not be supported
in this patch set to avoid potential memory leak issue
due to racing between bpf_obj_free_fields() and another
bpf_kptr_xchg() moving an allocated object to
bpf_list_head and bpf_rb_root.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230827152744.1996739-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08 08:42:17 -07:00
Yonghong Song
55db92f42f bpf: Add BPF_KPTR_PERCPU as a field type
BPF_KPTR_PERCPU represents a percpu field type like below

  struct val_t {
    ... fields ...
  };
  struct t {
    ...
    struct val_t __percpu_kptr *percpu_data_ptr;
    ...
  };

where
  #define __percpu_kptr __attribute__((btf_type_tag("percpu_kptr")))

While BPF_KPTR_REF points to a trusted kernel object or a trusted
local object, BPF_KPTR_PERCPU points to a trusted local
percpu object.

This patch added basic support for BPF_KPTR_PERCPU
related to percpu_kptr field parsing, recording and free operations.
BPF_KPTR_PERCPU also supports the same map types
as BPF_KPTR_REF does.

Note that unlike a local kptr, it is possible that
a BPF_KTPR_PERCPU struct may not contain any
special fields like other kptr, bpf_spin_lock, bpf_list_head, etc.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230827152739.1996391-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08 08:42:17 -07:00
Yonghong Song
41a5db8d81 bpf: Add support for non-fix-size percpu mem allocation
This is needed for later percpu mem allocation when the
allocation is done by bpf program. For such cases, a global
bpf_global_percpu_ma is added where a flexible allocation
size is needed.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230827152734.1995725-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08 08:42:17 -07:00
Linus Torvalds
73be7fb14e Including fixes from netfilter and bpf.
Current release - regressions:
 
  - eth: stmmac: fix failure to probe without MAC interface specified
 
 Current release - new code bugs:
 
  - docs: netlink: fix missing classic_netlink doc reference
 
 Previous releases - regressions:
 
  - deal with integer overflows in kmalloc_reserve()
 
  - use sk_forward_alloc_get() in sk_get_meminfo()
 
  - bpf_sk_storage: fix the missing uncharge in sk_omem_alloc
 
  - fib: avoid warn splat in flow dissector after packet mangling
 
  - skb_segment: call zero copy functions before using skbuff frags
 
  - eth: sfc: check for zero length in EF10 RX prefix
 
 Previous releases - always broken:
 
  - af_unix: fix msg_controllen test in scm_pidfd_recv() for
    MSG_CMSG_COMPAT
 
  - xsk: fix xsk_build_skb() dereferencing possible ERR_PTR()
 
  - netfilter:
    - nft_exthdr: fix non-linear header modification
    - xt_u32, xt_sctp: validate user space input
    - nftables: exthdr: fix 4-byte stack OOB write
    - nfnetlink_osf: avoid OOB read
    - one more fix for the garbage collection work from last release
 
  - igmp: limit igmpv3_newpack() packet size to IP_MAX_MTU
 
  - bpf, sockmap: fix preempt_rt splat when using raw_spin_lock_t
 
  - handshake: fix null-deref in handshake_nl_done_doit()
 
  - ip: ignore dst hint for multipath routes to ensure packets
    are hashed across the nexthops
 
  - phy: micrel:
    - correct bit assignments for cable test errata
    - disable EEE according to the KSZ9477 errata
 
 Misc:
 
  - docs/bpf: document compile-once-run-everywhere (CO-RE) relocations
 
  - Revert "net: macsec: preserve ingress frame ordering", it appears
    to have been developed against an older kernel, problem doesn't
    exist upstream
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmT6R6wACgkQMUZtbf5S
 IrsmTg//TgmRjxSZ0lrPQtJwZR/eN3ZR2oQG3rwnssCx+YgHEGGxQsfT4KHEMacR
 ZgGDZVTpthUJkkACBPi8ZMoy++RdjEmlCcanfeDkGHoYGtiX1lhkofhLMn1KUHbI
 rIbP9EdNKxQT0SsBlw/U28pD5jKyqOgL23QobEwmcjLTdMpamb+qIsD6/xNv9tEj
 Tu4BdCIkhjxnBD622hsE3pFTG7oSn2WM6rf5NT1E43mJ3W8RrMcydSB27J7Oryo9
 l3nYMAhz0vQINS2WQ9eCT1/7GI6gg1nDtxFtrnV7ASvxayRBPIUr4kg1vT+Tixsz
 CZMnwVamEBIYl9agmj7vSji7d5nOUgXPhtWhwWUM2tRoGdeGw3vSi1pgDvRiUCHE
 PJ4UHv7goa2AgnOlOQCFtRybAu+9nmSGm7V+GkeGLnH7xbFsEa5smQ/+FSPJs8Dn
 Yf4q5QAhdN8tdnofRlrN/nCssoDF3cfmBsTJ7wo5h71gW+BWhsP58eDCJlXd/r8k
 +Qnvoe2kw27ktFR1tjsUDZ0AcSmeVARNwmXCOBYZsG4tEek8pLyj008mDvJvdfyn
 PGPn7Eo5DyaERlHVmPuebHXSyniDEPe2GLTmlHcGiRpGspoUHbB+HRiDAuRLMB9g
 pkL8RHpNfppnuUXeUoNy3rgEkYwlpTjZX0QHC6N8NQ76ccB6CNM=
 =YpmE
 -----END PGP SIGNATURE-----

Merge tag 'net-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking updates from Jakub Kicinski:
 "Including fixes from netfilter and bpf.

  Current release - regressions:

   - eth: stmmac: fix failure to probe without MAC interface specified

  Current release - new code bugs:

   - docs: netlink: fix missing classic_netlink doc reference

  Previous releases - regressions:

   - deal with integer overflows in kmalloc_reserve()

   - use sk_forward_alloc_get() in sk_get_meminfo()

   - bpf_sk_storage: fix the missing uncharge in sk_omem_alloc

   - fib: avoid warn splat in flow dissector after packet mangling

   - skb_segment: call zero copy functions before using skbuff frags

   - eth: sfc: check for zero length in EF10 RX prefix

  Previous releases - always broken:

   - af_unix: fix msg_controllen test in scm_pidfd_recv() for
     MSG_CMSG_COMPAT

   - xsk: fix xsk_build_skb() dereferencing possible ERR_PTR()

   - netfilter:
      - nft_exthdr: fix non-linear header modification
      - xt_u32, xt_sctp: validate user space input
      - nftables: exthdr: fix 4-byte stack OOB write
      - nfnetlink_osf: avoid OOB read
      - one more fix for the garbage collection work from last release

   - igmp: limit igmpv3_newpack() packet size to IP_MAX_MTU

   - bpf, sockmap: fix preempt_rt splat when using raw_spin_lock_t

   - handshake: fix null-deref in handshake_nl_done_doit()

   - ip: ignore dst hint for multipath routes to ensure packets are
     hashed across the nexthops

   - phy: micrel:
      - correct bit assignments for cable test errata
      - disable EEE according to the KSZ9477 errata

  Misc:

   - docs/bpf: document compile-once-run-everywhere (CO-RE) relocations

   - Revert "net: macsec: preserve ingress frame ordering", it appears
     to have been developed against an older kernel, problem doesn't
     exist upstream"

* tag 'net-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (95 commits)
  net: enetc: distinguish error from valid pointers in enetc_fixup_clear_rss_rfs()
  Revert "net: team: do not use dynamic lockdep key"
  net: hns3: remove GSO partial feature bit
  net: hns3: fix the port information display when sfp is absent
  net: hns3: fix invalid mutex between tc qdisc and dcb ets command issue
  net: hns3: fix debugfs concurrency issue between kfree buffer and read
  net: hns3: fix byte order conversion issue in hclge_dbg_fd_tcam_read()
  net: hns3: Support query tx timeout threshold by debugfs
  net: hns3: fix tx timeout issue
  net: phy: Provide Module 4 KSZ9477 errata (DS80000754C)
  netfilter: nf_tables: Unbreak audit log reset
  netfilter: ipset: add the missing IP_SET_HASH_WITH_NET0 macro for ip_set_hash_netportnet.c
  netfilter: nft_set_rbtree: skip sync GC for new elements in this transaction
  netfilter: nf_tables: uapi: Describe NFTA_RULE_CHAIN_ID
  netfilter: nfnetlink_osf: avoid OOB read
  netfilter: nftables: exthdr: fix 4-byte stack OOB write
  selftests/bpf: Check bpf_sk_storage has uncharged sk_omem_alloc
  bpf: bpf_sk_storage: Fix the missing uncharge in sk_omem_alloc
  bpf: bpf_sk_storage: Fix invalid wait context lockdep report
  s390/bpf: Pass through tail call counter in trampolines
  ...
2023-09-07 18:33:07 -07:00
Puranjay Mohan
20e490adea
bpf: make bpf_prog_pack allocator portable
The bpf_prog_pack allocator currently uses module_alloc() and
module_memfree() to allocate and free memory. This is not portable
because different architectures use different methods for allocating
memory for BPF programs. Like ARM64 and riscv use vmalloc()/vfree().

Use bpf_jit_alloc_exec() and bpf_jit_free_exec() for memory management
in bpf_prog_pack allocator. Other architectures can override these with
their implementation and will be able to use bpf_prog_pack directly.

On architectures that don't override bpf_jit_alloc/free_exec() this is
basically a NOP.

Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Björn Töpel <bjorn@kernel.org>
Tested-by: Björn Töpel <bjorn@rivosinc.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20230831131229.497941-2-puranjay12@gmail.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-09-06 06:26:05 -07:00
Martin KaFai Lau
55d49f750b bpf: bpf_sk_storage: Fix the missing uncharge in sk_omem_alloc
The commit c83597fa5d ("bpf: Refactor some inode/task/sk storage functions
for reuse"), refactored the bpf_{sk,task,inode}_storage_free() into
bpf_local_storage_unlink_nolock() which then later renamed to
bpf_local_storage_destroy(). The commit accidentally passed the
"bool uncharge_mem = false" argument to bpf_selem_unlink_storage_nolock()
which then stopped the uncharge from happening to the sk->sk_omem_alloc.

This missing uncharge only happens when the sk is going away (during
__sk_destruct).

This patch fixes it by always passing "uncharge_mem = true". It is a
noop to the task/inode/cgroup storage because they do not have the
map_local_storage_(un)charge enabled in the map_ops. A followup patch
will be done in bpf-next to remove the uncharge_mem argument.

A selftest is added in the next patch.

Fixes: c83597fa5d ("bpf: Refactor some inode/task/sk storage functions for reuse")
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230901231129.578493-3-martin.lau@linux.dev
2023-09-06 11:08:14 +02:00
Martin KaFai Lau
a96a44aba5 bpf: bpf_sk_storage: Fix invalid wait context lockdep report
'./test_progs -t test_local_storage' reported a splat:

[   27.137569] =============================
[   27.138122] [ BUG: Invalid wait context ]
[   27.138650] 6.5.0-03980-gd11ae1b16b0a #247 Tainted: G           O
[   27.139542] -----------------------------
[   27.140106] test_progs/1729 is trying to lock:
[   27.140713] ffff8883ef047b88 (stock_lock){-.-.}-{3:3}, at: local_lock_acquire+0x9/0x130
[   27.141834] other info that might help us debug this:
[   27.142437] context-{5:5}
[   27.142856] 2 locks held by test_progs/1729:
[   27.143352]  #0: ffffffff84bcd9c0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x4/0x40
[   27.144492]  #1: ffff888107deb2c0 (&storage->lock){..-.}-{2:2}, at: bpf_local_storage_update+0x39e/0x8e0
[   27.145855] stack backtrace:
[   27.146274] CPU: 0 PID: 1729 Comm: test_progs Tainted: G           O       6.5.0-03980-gd11ae1b16b0a #247
[   27.147550] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[   27.149127] Call Trace:
[   27.149490]  <TASK>
[   27.149867]  dump_stack_lvl+0x130/0x1d0
[   27.152609]  dump_stack+0x14/0x20
[   27.153131]  __lock_acquire+0x1657/0x2220
[   27.153677]  lock_acquire+0x1b8/0x510
[   27.157908]  local_lock_acquire+0x29/0x130
[   27.159048]  obj_cgroup_charge+0xf4/0x3c0
[   27.160794]  slab_pre_alloc_hook+0x28e/0x2b0
[   27.161931]  __kmem_cache_alloc_node+0x51/0x210
[   27.163557]  __kmalloc+0xaa/0x210
[   27.164593]  bpf_map_kzalloc+0xbc/0x170
[   27.165147]  bpf_selem_alloc+0x130/0x510
[   27.166295]  bpf_local_storage_update+0x5aa/0x8e0
[   27.167042]  bpf_fd_sk_storage_update_elem+0xdb/0x1a0
[   27.169199]  bpf_map_update_value+0x415/0x4f0
[   27.169871]  map_update_elem+0x413/0x550
[   27.170330]  __sys_bpf+0x5e9/0x640
[   27.174065]  __x64_sys_bpf+0x80/0x90
[   27.174568]  do_syscall_64+0x48/0xa0
[   27.175201]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[   27.175932] RIP: 0033:0x7effb40e41ad
[   27.176357] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d8
[   27.179028] RSP: 002b:00007ffe64c21fc8 EFLAGS: 00000202 ORIG_RAX: 0000000000000141
[   27.180088] RAX: ffffffffffffffda RBX: 00007ffe64c22768 RCX: 00007effb40e41ad
[   27.181082] RDX: 0000000000000020 RSI: 00007ffe64c22008 RDI: 0000000000000002
[   27.182030] RBP: 00007ffe64c21ff0 R08: 0000000000000000 R09: 00007ffe64c22788
[   27.183038] R10: 0000000000000064 R11: 0000000000000202 R12: 0000000000000000
[   27.184006] R13: 00007ffe64c22788 R14: 00007effb42a1000 R15: 0000000000000000
[   27.184958]  </TASK>

It complains about acquiring a local_lock while holding a raw_spin_lock.
It means it should not allocate memory while holding a raw_spin_lock
since it is not safe for RT.

raw_spin_lock is needed because bpf_local_storage supports tracing
context. In particular for task local storage, it is easy to
get a "current" task PTR_TO_BTF_ID in tracing bpf prog.
However, task (and cgroup) local storage has already been moved to
bpf mem allocator which can be used after raw_spin_lock.

The splat is for the sk storage. For sk (and inode) storage,
it has not been moved to bpf mem allocator. Using raw_spin_lock or not,
kzalloc(GFP_ATOMIC) could theoretically be unsafe in tracing context.
However, the local storage helper requires a verifier accepted
sk pointer (PTR_TO_BTF_ID), it is hypothetical if that (mean running
a bpf prog in a kzalloc unsafe context and also able to hold a verifier
accepted sk pointer) could happen.

This patch avoids kzalloc after raw_spin_lock to silent the splat.
There is an existing kzalloc before the raw_spin_lock. At that point,
a kzalloc is very likely required because a lookup has just been done
before. Thus, this patch always does the kzalloc before acquiring
the raw_spin_lock and remove the later kzalloc usage after the
raw_spin_lock. After this change, it will have a charge and then
uncharge during the syscall bpf_map_update_elem() code path.
This patch opts for simplicity and not continue the old
optimization to save one charge and uncharge.

This issue is dated back to the very first commit of bpf_sk_storage
which had been refactored multiple times to create task, inode, and
cgroup storage. This patch uses a Fixes tag with a more recent
commit that should be easier to do backport.

Fixes: b00fa38a9c ("bpf: Enable non-atomic allocations in local storage")
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230901231129.578493-2-martin.lau@linux.dev
2023-09-06 11:07:54 +02:00
Sebastian Andrzej Siewior
6764e767f4 bpf: Assign bpf_tramp_run_ctx::saved_run_ctx before recursion check.
__bpf_prog_enter_recur() assigns bpf_tramp_run_ctx::saved_run_ctx before
performing the recursion check which means in case of a recursion
__bpf_prog_exit_recur() uses the previously set bpf_tramp_run_ctx::saved_run_ctx
value.

__bpf_prog_enter_sleepable_recur() assigns bpf_tramp_run_ctx::saved_run_ctx
after the recursion check which means in case of a recursion
__bpf_prog_exit_sleepable_recur() uses an uninitialized value. This does not
look right. If I read the entry trampoline code right, then bpf_tramp_run_ctx
isn't initialized upfront.

Align __bpf_prog_enter_sleepable_recur() with __bpf_prog_enter_recur() and
set bpf_tramp_run_ctx::saved_run_ctx before the recursion check is made.
Remove the assignment of saved_run_ctx in kern_sys_bpf() since it happens
a few cycles later.

Fixes: e384c7b7b4 ("bpf, x86: Create bpf_tramp_run_ctx on the caller thread's stack")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20230830080405.251926-3-bigeasy@linutronix.de
2023-09-06 10:44:28 +02:00
Sebastian Andrzej Siewior
7645629f7d bpf: Invoke __bpf_prog_exit_sleepable_recur() on recursion in kern_sys_bpf().
If __bpf_prog_enter_sleepable_recur() detects recursion then it returns
0 without undoing rcu_read_lock_trace(), migrate_disable() or
decrementing the recursion counter. This is fine in the JIT case because
the JIT code will jump in the 0 case to the end and invoke the matching
exit trampoline (__bpf_prog_exit_sleepable_recur()).

This is not the case in kern_sys_bpf() which returns directly to the
caller with an error code.

Add __bpf_prog_exit_sleepable_recur() as clean up in the recursion case.

Fixes: b1d18a7574 ("bpf: Extend sys_bpf commands for bpf_syscall programs.")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20230830080405.251926-2-bigeasy@linutronix.de
2023-09-06 10:39:31 +02:00
Linus Torvalds
b70100f2e6 Probes updates for v6.6:
- kprobes: use struct_size() for variable size kretprobe_instance
   data structure.
 
 - eprobe: Simplify trace_eprobe list iteration.
 
 - probe events: Data structure field access support on BTF argument.
   . Update BTF argument support on the functions in the kernel loadable
     modules (only loaded modules are supported).
   . Move generic BTF access function (search function prototype and get
     function parameters) to a separated file.
   . Add a function to search a member of data structure in BTF.
   . Support accessing BTF data structure member from probe args by
     C-like arrow('->') and dot('.') operators. e.g.
     't sched_switch next=next->pid vruntime=next->se.vruntime'
   . Support accessing BTF data structure member from $retval. e.g.
       'f getname_flags%return +0($retval->name):string'
   . Add string type checking if BTF type info is available.
     This will reject if user specify ":string" type for non "char
     pointer" type.
   . Automatically assume the fprobe event as a function return event
     if $retval is used.
 
 - selftests/ftrace: Add BTF data field access test cases.
 
 - Documentation: Update fprobe event example with BTF data field.
 -----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmTycQkbHG1hc2FtaS5o
 aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bqS8H/jeR1JhOzIXOvTw7XCFm
 MrSY/SKi8tQfV6lau2UmoYdbYvYjpqL34XLOQPNf2/lrcL2M9aNYXk9fbhlW8enx
 vkMyKQ0E5anixkF4vsTbEl9DaprxbpsPVACmZ/7VjQk2JuXIdyaNk8hno9LgIcEq
 udztb0o2HmDFqAXfRi0LvlSTAIwvXZ+usmEvYpaq1g2WwrCe7NHEYl42vMpj+h4H
 9l4t5rA9JyPPX4yQUjtKGW5eRVTwDTm/Gn6DRzYfYzkkiBZv27qfovzBOt672LgG
 hyot+u7XeKvZx3jjnF7+mRWoH/m0dqyhyi/nPhpIE09VhgwclrbGAcDuR1x6sp01
 PHY=
 =hBDN
 -----END PGP SIGNATURE-----

Merge tag 'probes-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probes updates from Masami Hiramatsu:

 - kprobes: use struct_size() for variable size kretprobe_instance data
   structure.

 - eprobe: Simplify trace_eprobe list iteration.

 - probe events: Data structure field access support on BTF argument.

     - Update BTF argument support on the functions in the kernel
       loadable modules (only loaded modules are supported).

     - Move generic BTF access function (search function prototype and
       get function parameters) to a separated file.

     - Add a function to search a member of data structure in BTF.

     - Support accessing BTF data structure member from probe args by
       C-like arrow('->') and dot('.') operators. e.g.
          't sched_switch next=next->pid vruntime=next->se.vruntime'

     - Support accessing BTF data structure member from $retval. e.g.
          'f getname_flags%return +0($retval->name):string'

     - Add string type checking if BTF type info is available. This will
       reject if user specify ":string" type for non "char pointer"
       type.

     - Automatically assume the fprobe event as a function return event
       if $retval is used.

 - selftests/ftrace: Add BTF data field access test cases.

 - Documentation: Update fprobe event example with BTF data field.

* tag 'probes-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  Documentation: tracing: Update fprobe event example with BTF field
  selftests/ftrace: Add BTF fields access testcases
  tracing/fprobe-event: Assume fprobe is a return event by $retval
  tracing/probes: Add string type check with BTF
  tracing/probes: Support BTF field access from $retval
  tracing/probes: Support BTF based data structure field access
  tracing/probes: Add a function to search a member of a struct/union
  tracing/probes: Move finding func-proto API and getting func-param API to trace_btf
  tracing/probes: Support BTF argument on module functions
  tracing/eprobe: Iterate trace_eprobe directly
  kernel: kprobes: Use struct_size()
2023-09-02 11:10:50 -07:00
Linus Torvalds
bd6c11bc43 Networking changes for 6.6.
Core
 ----
 
  - Increase size limits for to-be-sent skb frag allocations. This
    allows tun, tap devices and packet sockets to better cope with large
    writes operations.
 
  - Store netdevs in an xarray, to simplify iterating over netdevs.
 
  - Refactor nexthop selection for multipath routes.
 
  - Improve sched class lifetime handling.
 
  - Add backup nexthop ID support for bridge.
 
  - Implement drop reasons support in openvswitch.
 
  - Several data races annotations and fixes.
 
  - Constify the sk parameter of routing functions.
 
  - Prepend kernel version to netconsole message.
 
 Protocols
 ---------
 
  - Implement support for TCP probing the peer being under memory
    pressure.
 
  - Remove hard coded limitation on IPv6 specific info placement
    inside the socket struct.
 
  - Get rid of sysctl_tcp_adv_win_scale and use an auto-estimated
    per socket scaling factor.
 
  - Scaling-up the IPv6 expired route GC via a separated list of
    expiring routes.
 
  - In-kernel support for the TLS alert protocol.
 
  - Better support for UDP reuseport with connected sockets.
 
  - Add NEXT-C-SID support for SRv6 End.X behavior, reducing the SR
    header size.
 
  - Get rid of additional ancillary per MPTCP connection struct socket.
 
  - Implement support for BPF-based MPTCP packet schedulers.
 
  - Format MPTCP subtests selftests results in TAP.
 
  - Several new SMC 2.1 features including unique experimental options,
    max connections per lgr negotiation, max links per lgr negotiation.
 
 BPF
 ---
 
  - Multi-buffer support in AF_XDP.
 
  - Add multi uprobe BPF links for attaching multiple uprobes
    and usdt probes, which is significantly faster and saves extra fds.
 
  - Implement an fd-based tc BPF attach API (TCX) and BPF link support on
    top of it.
 
  - Add SO_REUSEPORT support for TC bpf_sk_assign.
 
  - Support new instructions from cpu v4 to simplify the generated code and
    feature completeness, for x86, arm64, riscv64.
 
  - Support defragmenting IPv(4|6) packets in BPF.
 
  - Teach verifier actual bounds of bpf_get_smp_processor_id()
    and fix perf+libbpf issue related to custom section handling.
 
  - Introduce bpf map element count and enable it for all program types.
 
  - Add a BPF hook in sys_socket() to change the protocol ID
    from IPPROTO_TCP to IPPROTO_MPTCP to cover migration for legacy.
 
  - Introduce bpf_me_mcache_free_rcu() and fix OOM under stress.
 
  - Add uprobe support for the bpf_get_func_ip helper.
 
  - Check skb ownership against full socket.
 
  - Support for up to 12 arguments in BPF trampoline.
 
  - Extend link_info for kprobe_multi and perf_event links.
 
 Netfilter
 ---------
 
  - Speed-up process exit by aborting ruleset validation if a
    fatal signal is pending.
 
  - Allow NLA_POLICY_MASK to be used with BE16/BE32 types.
 
 Driver API
 ----------
 
  - Page pool optimizations, to improve data locality and cache usage.
 
  - Introduce ndo_hwtstamp_get() and ndo_hwtstamp_set() to avoid the need
    for raw ioctl() handling in drivers.
 
  - Simplify genetlink dump operations (doit/dumpit) providing them
    the common information already populated in struct genl_info.
 
  - Extend and use the yaml devlink specs to [re]generate the split ops.
 
  - Introduce devlink selective dumps, to allow SF filtering SF based on
    handle and other attributes.
 
  - Add yaml netlink spec for netlink-raw families, allow route, link and
    address related queries via the ynl tool.
 
  - Remove phylink legacy mode support.
 
  - Support offload LED blinking to phy.
 
  - Add devlink port function attributes for IPsec.
 
 New hardware / drivers
 ----------------------
 
  - Ethernet:
    - Broadcom ASP 2.0 (72165) ethernet controller
    - MediaTek MT7988 SoC
    - Texas Instruments AM654 SoC
    - Texas Instruments IEP driver
    - Atheros qca8081 phy
    - Marvell 88Q2110 phy
    - NXP TJA1120 phy
 
  - WiFi:
    - MediaTek mt7981 support
 
  - Can:
    - Kvaser SmartFusion2 PCI Express devices
    - Allwinner T113 controllers
    - Texas Instruments tcan4552/4553 chips
 
  - Bluetooth:
    - Intel Gale Peak
    - Qualcomm WCN3988 and WCN7850
    - NXP AW693 and IW624
    - Mediatek MT2925
 
 Drivers
 -------
 
  - Ethernet NICs:
    - nVidia/Mellanox:
      - mlx5:
        - support UDP encapsulation in packet offload mode
        - IPsec packet offload support in eswitch mode
        - improve aRFS observability by adding new set of counters
        - extends MACsec offload support to cover RoCE traffic
        - dynamic completion EQs
      - mlx4:
        - convert to use auxiliary bus instead of custom interface logic
    - Intel
      - ice:
        - implement switchdev bridge offload, even for LAG interfaces
        - implement SRIOV support for LAG interfaces
      - igc:
        - add support for multiple in-flight TX timestamps
    - Broadcom:
      - bnxt:
        - use the unified RX page pool buffers for XDP and non-XDP
        - use the NAPI skb allocation cache
    - OcteonTX2:
      - support Round Robin scheduling HTB offload
      - TC flower offload support for SPI field
    - Freescale:
      -  add XDP_TX feature support
    - AMD:
      - ionic: add support for PCI FLR event
      - sfc:
        - basic conntrack offload
        - introduce eth, ipv4 and ipv6 pedit offloads
    - ST Microelectronics:
      - stmmac: maximze PTP timestamping resolution
 
  - Virtual NICs:
    - Microsoft vNIC:
      - batch ringing RX queue doorbell on receiving packets
      - add page pool for RX buffers
    - Virtio vNIC:
      - add per queue interrupt coalescing support
    - Google vNIC:
      - add queue-page-list mode support
 
  - Ethernet high-speed switches:
    - nVidia/Mellanox (mlxsw):
      - add port range matching tc-flower offload
      - permit enslavement to netdevices with uppers
 
  - Ethernet embedded switches:
    - Marvell (mv88e6xxx):
      - convert to phylink_pcs
    - Renesas:
      - r8A779fx: add speed change support
      - rzn1: enables vlan support
 
  - Ethernet PHYs:
    - convert mv88e6xxx to phylink_pcs
 
  - WiFi:
    - Qualcomm Wi-Fi 7 (ath12k):
      - extremely High Throughput (EHT) PHY support
    - RealTek (rtl8xxxu):
      - enable AP mode for: RTL8192FU, RTL8710BU (RTL8188GU),
        RTL8192EU and RTL8723BU
    - RealTek (rtw89):
      - Introduce Time Averaged SAR (TAS) support
 
  - Connector:
    - support for event filtering
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmTt1ZoSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkgFUP/REFaYWdWUvAzmWeezyx9dqgZMfSOjWq
 9QvySiA94OAOcjIYkb7wfzQ5BBAZqaBQ/f8XqWwS1EDDDEBs8sP1cxmABKwW7Hsr
 qFRu2sOqLzKBk223d0jIgEocfQaFpGbF71gXoTlDivBjBi5UxWm9bF0XnbYWcKgO
 /QEvzNosi9uNdi85Fzmv62J6YzAdidEpwGsM7X2CfejwNRmStxAEg/NwvRR0Hyiq
 OJCo97omEgTRaUle8nc64PDx33u4h5kQ1BkaeHEv0rbE3hftFC2YPKn/InmqSFGz
 6ew2xnrGPR37LCuAiCcIIv6yR7K0eu0iYJ7jXwZxBDqxGavEPuwWGBoCP6qFiitH
 ZLWhIrAUrdmSbySkTOCONhJ475qFAuQoYHYpZnX/bJZUHlSsb/9lwDJYJQGpVfd1
 /daqJVSb7lhaifmNO1iNd/ibCIXq9zapwtkRwA897M8GkZBTsnVvazFld1Em+Se3
 Bx6DSDUVBqVQ9fpZG2IAGD6odDwOzC1lF2IoceFvK9Ff6oE0psI+A0qNLMkHxZbW
 Qlo7LsNe53hpoCC+yHTfXX7e/X8eNt0EnCGOQJDusZ0Nr3K7H4LKFA0i8UBUK05n
 4lKnnaSQW7GQgdofLWt103OMDR9GoDxpFsm7b1X9+AEk6Fz6tq50wWYeMZETUKYP
 DCW8VGFOZjZM
 =9CsR
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Paolo Abeni:
 "Core:

   - Increase size limits for to-be-sent skb frag allocations. This
     allows tun, tap devices and packet sockets to better cope with
     large writes operations

   - Store netdevs in an xarray, to simplify iterating over netdevs

   - Refactor nexthop selection for multipath routes

   - Improve sched class lifetime handling

   - Add backup nexthop ID support for bridge

   - Implement drop reasons support in openvswitch

   - Several data races annotations and fixes

   - Constify the sk parameter of routing functions

   - Prepend kernel version to netconsole message

  Protocols:

   - Implement support for TCP probing the peer being under memory
     pressure

   - Remove hard coded limitation on IPv6 specific info placement inside
     the socket struct

   - Get rid of sysctl_tcp_adv_win_scale and use an auto-estimated per
     socket scaling factor

   - Scaling-up the IPv6 expired route GC via a separated list of
     expiring routes

   - In-kernel support for the TLS alert protocol

   - Better support for UDP reuseport with connected sockets

   - Add NEXT-C-SID support for SRv6 End.X behavior, reducing the SR
     header size

   - Get rid of additional ancillary per MPTCP connection struct socket

   - Implement support for BPF-based MPTCP packet schedulers

   - Format MPTCP subtests selftests results in TAP

   - Several new SMC 2.1 features including unique experimental options,
     max connections per lgr negotiation, max links per lgr negotiation

  BPF:

   - Multi-buffer support in AF_XDP

   - Add multi uprobe BPF links for attaching multiple uprobes and usdt
     probes, which is significantly faster and saves extra fds

   - Implement an fd-based tc BPF attach API (TCX) and BPF link support
     on top of it

   - Add SO_REUSEPORT support for TC bpf_sk_assign

   - Support new instructions from cpu v4 to simplify the generated code
     and feature completeness, for x86, arm64, riscv64

   - Support defragmenting IPv(4|6) packets in BPF

   - Teach verifier actual bounds of bpf_get_smp_processor_id() and fix
     perf+libbpf issue related to custom section handling

   - Introduce bpf map element count and enable it for all program types

   - Add a BPF hook in sys_socket() to change the protocol ID from
     IPPROTO_TCP to IPPROTO_MPTCP to cover migration for legacy

   - Introduce bpf_me_mcache_free_rcu() and fix OOM under stress

   - Add uprobe support for the bpf_get_func_ip helper

   - Check skb ownership against full socket

   - Support for up to 12 arguments in BPF trampoline

   - Extend link_info for kprobe_multi and perf_event links

  Netfilter:

   - Speed-up process exit by aborting ruleset validation if a fatal
     signal is pending

   - Allow NLA_POLICY_MASK to be used with BE16/BE32 types

  Driver API:

   - Page pool optimizations, to improve data locality and cache usage

   - Introduce ndo_hwtstamp_get() and ndo_hwtstamp_set() to avoid the
     need for raw ioctl() handling in drivers

   - Simplify genetlink dump operations (doit/dumpit) providing them the
     common information already populated in struct genl_info

   - Extend and use the yaml devlink specs to [re]generate the split ops

   - Introduce devlink selective dumps, to allow SF filtering SF based
     on handle and other attributes

   - Add yaml netlink spec for netlink-raw families, allow route, link
     and address related queries via the ynl tool

   - Remove phylink legacy mode support

   - Support offload LED blinking to phy

   - Add devlink port function attributes for IPsec

  New hardware / drivers:

   - Ethernet:
      - Broadcom ASP 2.0 (72165) ethernet controller
      - MediaTek MT7988 SoC
      - Texas Instruments AM654 SoC
      - Texas Instruments IEP driver
      - Atheros qca8081 phy
      - Marvell 88Q2110 phy
      - NXP TJA1120 phy

   - WiFi:
      - MediaTek mt7981 support

   - Can:
      - Kvaser SmartFusion2 PCI Express devices
      - Allwinner T113 controllers
      - Texas Instruments tcan4552/4553 chips

   - Bluetooth:
      - Intel Gale Peak
      - Qualcomm WCN3988 and WCN7850
      - NXP AW693 and IW624
      - Mediatek MT2925

  Drivers:

   - Ethernet NICs:
      - nVidia/Mellanox:
         - mlx5:
            - support UDP encapsulation in packet offload mode
            - IPsec packet offload support in eswitch mode
            - improve aRFS observability by adding new set of counters
            - extends MACsec offload support to cover RoCE traffic
            - dynamic completion EQs
         - mlx4:
            - convert to use auxiliary bus instead of custom interface
              logic
      - Intel
         - ice:
            - implement switchdev bridge offload, even for LAG
              interfaces
            - implement SRIOV support for LAG interfaces
         - igc:
            - add support for multiple in-flight TX timestamps
      - Broadcom:
         - bnxt:
            - use the unified RX page pool buffers for XDP and non-XDP
            - use the NAPI skb allocation cache
      - OcteonTX2:
         - support Round Robin scheduling HTB offload
         - TC flower offload support for SPI field
      - Freescale:
         - add XDP_TX feature support
      - AMD:
         - ionic: add support for PCI FLR event
         - sfc:
            - basic conntrack offload
            - introduce eth, ipv4 and ipv6 pedit offloads
      - ST Microelectronics:
         - stmmac: maximze PTP timestamping resolution

   - Virtual NICs:
      - Microsoft vNIC:
         - batch ringing RX queue doorbell on receiving packets
         - add page pool for RX buffers
      - Virtio vNIC:
         - add per queue interrupt coalescing support
      - Google vNIC:
         - add queue-page-list mode support

   - Ethernet high-speed switches:
      - nVidia/Mellanox (mlxsw):
         - add port range matching tc-flower offload
         - permit enslavement to netdevices with uppers

   - Ethernet embedded switches:
      - Marvell (mv88e6xxx):
         - convert to phylink_pcs
      - Renesas:
         - r8A779fx: add speed change support
         - rzn1: enables vlan support

   - Ethernet PHYs:
      - convert mv88e6xxx to phylink_pcs

   - WiFi:
      - Qualcomm Wi-Fi 7 (ath12k):
         - extremely High Throughput (EHT) PHY support
      - RealTek (rtl8xxxu):
         - enable AP mode for: RTL8192FU, RTL8710BU (RTL8188GU),
           RTL8192EU and RTL8723BU
      - RealTek (rtw89):
         - Introduce Time Averaged SAR (TAS) support

   - Connector:
      - support for event filtering"

* tag 'net-next-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1806 commits)
  net: ethernet: mtk_wed: minor change in wed_{tx,rx}info_show
  net: ethernet: mtk_wed: add some more info in wed_txinfo_show handler
  net: stmmac: clarify difference between "interface" and "phy_interface"
  r8152: add vendor/device ID pair for D-Link DUB-E250
  devlink: move devlink_notify_register/unregister() to dev.c
  devlink: move small_ops definition into netlink.c
  devlink: move tracepoint definitions into core.c
  devlink: push linecard related code into separate file
  devlink: push rate related code into separate file
  devlink: push trap related code into separate file
  devlink: use tracepoint_enabled() helper
  devlink: push region related code into separate file
  devlink: push param related code into separate file
  devlink: push resource related code into separate file
  devlink: push dpipe related code into separate file
  devlink: move and rename devlink_dpipe_send_and_alloc_skb() helper
  devlink: push shared buffer related code into separate file
  devlink: push port related code into separate file
  devlink: push object register/unregister notifications into separate helpers
  inet: fix IP_TRANSPARENT error handling
  ...
2023-08-29 11:33:01 -07:00
Linus Torvalds
615e95831e v6.6-vfs.ctime
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXTKAAKCRCRxhvAZXjc
 oifJAQCzi/p+AdQu8LA/0XvR7fTwaq64ZDCibU4BISuLGT2kEgEAuGbuoFZa0rs2
 XYD/s4+gi64p9Z01MmXm2XO1pu3GPg0=
 =eJz5
 -----END PGP SIGNATURE-----

Merge tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs timestamp updates from Christian Brauner:
 "This adds VFS support for multi-grain timestamps and converts tmpfs,
  xfs, ext4, and btrfs to use them. This carries acks from all relevant
  filesystems.

  The VFS always uses coarse-grained timestamps when updating the ctime
  and mtime after a change. This has the benefit of allowing filesystems
  to optimize away a lot of metadata updates, down to around 1 per
  jiffy, even when a file is under heavy writes.

  Unfortunately, this has always been an issue when we're exporting via
  NFSv3, which relies on timestamps to validate caches. A lot of changes
  can happen in a jiffy, so timestamps aren't sufficient to help the
  client decide to invalidate the cache.

  Even with NFSv4, a lot of exported filesystems don't properly support
  a change attribute and are subject to the same problems with timestamp
  granularity. Other applications have similar issues with timestamps
  (e.g., backup applications).

  If we were to always use fine-grained timestamps, that would improve
  the situation, but that becomes rather expensive, as the underlying
  filesystem would have to log a lot more metadata updates.

  This introduces fine-grained timestamps that are used when they are
  actively queried.

  This uses the 31st bit of the ctime tv_nsec field to indicate that
  something has queried the inode for the mtime or ctime. When this flag
  is set, on the next mtime or ctime update, the kernel will fetch a
  fine-grained timestamp instead of the usual coarse-grained one.

  As POSIX generally mandates that when the mtime changes, the ctime
  must also change the kernel always stores normalized ctime values, so
  only the first 30 bits of the tv_nsec field are ever used.

  Filesytems can opt into this behavior by setting the FS_MGTIME flag in
  the fstype. Filesystems that don't set this flag will continue to use
  coarse-grained timestamps.

  Various preparatory changes, fixes and cleanups are included:

   - Fixup all relevant places where POSIX requires updating ctime
     together with mtime. This is a wide-range of places and all
     maintainers provided necessary Acks.

   - Add new accessors for inode->i_ctime directly and change all
     callers to rely on them. Plain accesses to inode->i_ctime are now
     gone and it is accordingly rename to inode->__i_ctime and commented
     as requiring accessors.

   - Extend generic_fillattr() to pass in a request mask mirroring in a
     sense the statx() uapi. This allows callers to pass in a request
     mask to only get a subset of attributes filled in.

   - Rework timestamp updates so it's possible to drop the @now
     parameter the update_time() inode operation and associated helpers.

   - Add inode_update_timestamps() and convert all filesystems to it
     removing a bunch of open-coding"

* tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (107 commits)
  btrfs: convert to multigrain timestamps
  ext4: switch to multigrain timestamps
  xfs: switch to multigrain timestamps
  tmpfs: add support for multigrain timestamps
  fs: add infrastructure for multigrain timestamps
  fs: drop the timespec64 argument from update_time
  xfs: have xfs_vn_update_time gets its own timestamp
  fat: make fat_update_time get its own timestamp
  fat: remove i_version handling from fat_update_time
  ubifs: have ubifs_update_time use inode_update_timestamps
  btrfs: have it use inode_update_timestamps
  fs: drop the timespec64 arg from generic_update_time
  fs: pass the request_mask to generic_fillattr
  fs: remove silly warning from current_time
  gfs2: fix timestamp handling on quota inodes
  fs: rename i_ctime field to __i_ctime
  selinux: convert to ctime accessor functions
  security: convert to ctime accessor functions
  apparmor: convert to ctime accessor functions
  sunrpc: convert to ctime accessor functions
  ...
2023-08-28 09:31:32 -07:00
Dave Marchevsky
5861d1e8db bpf: Allow bpf_spin_{lock,unlock} in sleepable progs
Commit 9e7a4d9831 ("bpf: Allow LSM programs to use bpf spin locks")
disabled bpf_spin_lock usage in sleepable progs, stating:

 Sleepable LSM programs can be preempted which means that allowng spin
 locks will need more work (disabling preemption and the verifier
 ensuring that no sleepable helpers are called when a spin lock is
 held).

This patch disables preemption before grabbing bpf_spin_lock. The second
requirement above "no sleepable helpers are called when a spin lock is
held" is implicitly enforced by current verifier logic due to helper
calls in spin_lock CS being disabled except for a few exceptions, none
of which sleep.

Due to above preemption changes, bpf_spin_lock CS can also be considered
a RCU CS, so verifier's in_rcu_cs check is modified to account for this.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230821193311.3290257-7-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-25 09:23:17 -07:00
Dave Marchevsky
0816b8c6bf bpf: Consider non-owning refs to refcounted nodes RCU protected
An earlier patch in the series ensures that the underlying memory of
nodes with bpf_refcount - which can have multiple owners - is not reused
until RCU grace period has elapsed. This prevents
use-after-free with non-owning references that may point to
recently-freed memory. While RCU read lock is held, it's safe to
dereference such a non-owning ref, as by definition RCU GP couldn't have
elapsed and therefore underlying memory couldn't have been reused.

From the perspective of verifier "trustedness" non-owning refs to
refcounted nodes are now trusted only in RCU CS and therefore should no
longer pass is_trusted_reg, but rather is_rcu_reg. Let's mark them
MEM_RCU in order to reflect this new state.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230821193311.3290257-6-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-25 09:23:16 -07:00
Dave Marchevsky
ba2464c86f bpf: Reenable bpf_refcount_acquire
Now that all reported issues are fixed, bpf_refcount_acquire can be
turned back on. Also reenable all bpf_refcount-related tests which were
disabled.

This a revert of:
 * commit f3514a5d67 ("selftests/bpf: Disable newly-added 'owner' field test until refcount re-enabled")
 * commit 7deca5eae8 ("bpf: Disable bpf_refcount_acquire kfunc calls until race conditions are fixed")

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230821193311.3290257-5-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-25 09:23:16 -07:00
Dave Marchevsky
7e26cd12ad bpf: Use bpf_mem_free_rcu when bpf_obj_dropping refcounted nodes
This is the final fix for the use-after-free scenario described in
commit 7793fc3bab ("bpf: Make bpf_refcount_acquire fallible for
non-owning refs"). That commit, by virtue of changing
bpf_refcount_acquire's refcount_inc to a refcount_inc_not_zero, fixed
the "refcount incr on 0" splat. The not_zero check in
refcount_inc_not_zero, though, still occurs on memory that could have
been free'd and reused, so the commit didn't properly fix the root
cause.

This patch actually fixes the issue by free'ing using the recently-added
bpf_mem_free_rcu, which ensures that the memory is not reused until
RCU grace period has elapsed. If that has happened then
there are no non-owning references alive that point to the
recently-free'd memory, so it can be safely reused.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230821193311.3290257-4-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-25 09:23:16 -07:00
Dave Marchevsky
f0d991a070 bpf: Ensure kptr_struct_meta is non-NULL for collection insert and refcount_acquire
It's straightforward to prove that kptr_struct_meta must be non-NULL for
any valid call to these kfuncs:

  * btf_parse_struct_metas in btf.c creates a btf_struct_meta for any
    struct in user BTF with a special field (e.g. bpf_refcount,
    {rb,list}_node). These are stored in that BTF's struct_meta_tab.

  * __process_kf_arg_ptr_to_graph_node in verifier.c ensures that nodes
    have {rb,list}_node field and that it's at the correct offset.
    Similarly, check_kfunc_args ensures bpf_refcount field existence for
    node param to bpf_refcount_acquire.

  * So a btf_struct_meta must have been created for the struct type of
    node param to these kfuncs

  * That BTF and its struct_meta_tab are guaranteed to still be around.
    Any arbitrary {rb,list} node the BPF program interacts with either:
    came from bpf_obj_new or a collection removal kfunc in the same
    program, in which case the BTF is associated with the program and
    still around; or came from bpf_kptr_xchg, in which case the BTF was
    associated with the map and is still around

Instead of silently continuing with NULL struct_meta, which caused
confusing bugs such as those addressed by commit 2140a6e342 ("bpf: Set
kptr_struct_meta for node param to list and rbtree insert funcs"), let's
error out. Then, at runtime, we can confidently say that the
implementations of these kfuncs were given a non-NULL kptr_struct_meta,
meaning that special-field-specific functionality like
bpf_obj_free_fields and the bpf_obj_drop change introduced later in this
series are guaranteed to execute.

This patch doesn't change functionality, just makes it easier to reason
about existing functionality.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230821193311.3290257-2-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-25 09:23:16 -07:00
Yonghong Song
393dc4bd92 bpf: Remove a WARN_ON_ONCE warning related to local kptr
Currently, in function bpf_obj_free_fields(), for local kptr,
a warning will be issued if the struct does not contain any
special fields. But actually the kernel seems totally okay
with a local kptr without any special fields. Permitting
no special fields also aligns with future percpu kptr which
also allows no special fields.

Acked-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230824063417.201925-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-24 08:15:16 -07:00
Yafang Shao
d75e30dddf bpf: Fix issue in verifying allow_ptr_leaks
After we converted the capabilities of our networking-bpf program from
cap_sys_admin to cap_net_admin+cap_bpf, our networking-bpf program
failed to start. Because it failed the bpf verifier, and the error log
is "R3 pointer comparison prohibited".

A simple reproducer as follows,

SEC("cls-ingress")
int ingress(struct __sk_buff *skb)
{
	struct iphdr *iph = (void *)(long)skb->data + sizeof(struct ethhdr);

	if ((long)(iph + 1) > (long)skb->data_end)
		return TC_ACT_STOLEN;
	return TC_ACT_OK;
}

Per discussion with Yonghong and Alexei [1], comparison of two packet
pointers is not a pointer leak. This patch fixes it.

Our local kernel is 6.1.y and we expect this fix to be backported to
6.1.y, so stable is CCed.

[1]. https://lore.kernel.org/bpf/CAADnVQ+Nmspr7Si+pxWn8zkE7hX-7s93ugwC+94aXSy4uQ9vBg@mail.gmail.com/

Suggested-by: Yonghong Song <yonghong.song@linux.dev>
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230823020703.3790-2-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-23 09:37:29 -07:00
Masami Hiramatsu (Google)
b1d1e90490 tracing/probes: Support BTF argument on module functions
Since the btf returned from bpf_get_btf_vmlinux() only covers functions in
the vmlinux, BTF argument is not available on the functions in the modules.
Use bpf_find_btf_id() instead of bpf_get_btf_vmlinux()+btf_find_name_kind()
so that BTF argument can find the correct struct btf and btf_type in it.
With this fix, fprobe events can use `$arg*` on module functions as below

 # grep nf_log_ip_packet /proc/kallsyms
ffffffffa0005c00 t nf_log_ip_packet	[nf_log_syslog]
ffffffffa0005bf0 t __pfx_nf_log_ip_packet	[nf_log_syslog]
 # echo 'f nf_log_ip_packet $arg*' > dynamic_events
 # cat dynamic_events
f:fprobes/nf_log_ip_packet__entry nf_log_ip_packet net=net pf=pf hooknum=hooknum skb=skb in=in out=out loginfo=loginfo prefix=prefix

To support the module's btf which is removable, the struct btf needs to be
ref-counted. So this also records the btf in the traceprobe_parse_context
and returns the refcount when the parse has done.

Link: https://lore.kernel.org/all/169272154223.160970.3507930084247934031.stgit@devnote2/

Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-08-23 09:39:15 +09:00
Kumar Kartikeya Dwivedi
6785b2edf4 bpf: Fix check_func_arg_reg_off bug for graph root/node
The commit being fixed introduced a hunk into check_func_arg_reg_off
that bypasses reg->off == 0 enforcement when offset points to a graph
node or root. This might possibly be done for treating bpf_rbtree_remove
and others as KF_RELEASE and then later check correct reg->off in helper
argument checks.

But this is not the case, those helpers are already not KF_RELEASE and
permit non-zero reg->off and verify it later to match the subobject in
BTF type.

However, this logic leads to bpf_obj_drop permitting free of register
arguments with non-zero offset when they point to a graph root or node
within them, which is not ok.

For instance:

struct foo {
	int i;
	int j;
	struct bpf_rb_node node;
};

struct foo *f = bpf_obj_new(typeof(*f));
if (!f) ...
bpf_obj_drop(f); // OK
bpf_obj_drop(&f->i); // still ok from verifier PoV
bpf_obj_drop(&f->node); // Not OK, but permitted right now

Fix this by dropping the whole part of code altogether.

Fixes: 6a3cd3318f ("bpf: Migrate release_on_unlock logic to non-owning ref semantics")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230822175140.1317749-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-22 12:52:48 -07:00
Yonghong Song
ab6c637ad0 bpf: Fix a bpf_kptr_xchg() issue with local kptr
When reviewing local percpu kptr support, Alexei discovered a bug
wherea bpf_kptr_xchg() may succeed even if the map value kptr type and
locally allocated obj type do not match ([1]). Missed struct btf_id
comparison is the reason for the bug. This patch added such struct btf_id
comparison and will flag verification failure if types do not match.

  [1] https://lore.kernel.org/bpf/20230819002907.io3iphmnuk43xblu@macbook-pro-8.dhcp.thefacebook.com/#t

Reported-by: Alexei Starovoitov <ast@kernel.org>
Fixes: 738c96d5e2 ("bpf: Allow local kptrs to be exchanged via bpf_kptr_xchg")
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230822050053.2886960-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-22 09:43:55 -07:00
Jiri Olsa
b733eeade4 bpf: Add pid filter support for uprobe_multi link
Adding support to specify pid for uprobe_multi link and the uprobes
are created only for task with given pid value.

Using the consumer.filter filter callback for that, so the task gets
filtered during the uprobe installation.

We still need to check the task during runtime in the uprobe handler,
because the handler could get executed if there's another system
wide consumer on the same uprobe (thanks Oleg for the insight).

Cc: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230809083440.3209381-6-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-21 15:51:25 -07:00
Jiri Olsa
0b779b61f6 bpf: Add cookies support for uprobe_multi link
Adding support to specify cookies array for uprobe_multi link.

The cookies array share indexes and length with other uprobe_multi
arrays (offsets/ref_ctr_offsets).

The cookies[i] value defines cookie for i-the uprobe and will be
returned by bpf_get_attach_cookie helper when called from ebpf
program hooked to that specific uprobe.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230809083440.3209381-5-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-21 15:51:25 -07:00
Jiri Olsa
89ae89f53d bpf: Add multi uprobe link
Adding new multi uprobe link that allows to attach bpf program
to multiple uprobes.

Uprobes to attach are specified via new link_create uprobe_multi
union:

  struct {
    __aligned_u64   path;
    __aligned_u64   offsets;
    __aligned_u64   ref_ctr_offsets;
    __u32           cnt;
    __u32           flags;
  } uprobe_multi;

Uprobes are defined for single binary specified in path and multiple
calling sites specified in offsets array with optional reference
counters specified in ref_ctr_offsets array. All specified arrays
have length of 'cnt'.

The 'flags' supports single bit for now that marks the uprobe as
return probe.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230809083440.3209381-4-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-21 15:51:25 -07:00
Jiri Olsa
3505cb9fa2 bpf: Add attach_type checks under bpf_prog_attach_check_attach_type
Add extra attach_type checks from link_create under
bpf_prog_attach_check_attach_type.

Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230809083440.3209381-3-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-21 15:51:25 -07:00
Hou Tao
c2e42ddf26 bpf, cpumask: Clean up bpf_cpu_map_entry directly in cpu_map_free
After synchronous_rcu(), both the dettached XDP program and
xdp_do_flush() are completed, and the only user of bpf_cpu_map_entry
will be cpu_map_kthread_run(), so instead of calling
__cpu_map_entry_replace() to stop kthread and cleanup entry after a RCU
grace period, do these things directly.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20230816045959.358059-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-21 15:21:16 -07:00
Hou Tao
8f8500a247 bpf, cpumap: Use queue_rcu_work() to remove unnecessary rcu_barrier()
As for now __cpu_map_entry_replace() uses call_rcu() to wait for the
inflight xdp program to exit the RCU read critical section, and then
launch kworker cpu_map_kthread_stop() to call kthread_stop() to flush
all pending xdp frames or skbs.

But it is unnecessary to use rcu_barrier() in cpu_map_kthread_stop() to
wait for the completion of __cpu_map_entry_free(), because rcu_barrier()
will wait for all pending RCU callbacks and cpu_map_kthread_stop() only
needs to wait for the completion of a specific __cpu_map_entry_free().

So use queue_rcu_work() to replace call_rcu(), schedule_work() and
rcu_barrier(). queue_rcu_work() will queue a __cpu_map_entry_free()
kworker after a RCU grace period. Because __cpu_map_entry_free() is
running in a kworker context, so it is OK to do all of these freeing
procedures include kthread_stop() in it.

After the update, there is no need to do reference-counting for
bpf_cpu_map_entry, because bpf_cpu_map_entry is freed directly in
__cpu_map_entry_free(), so just remove it.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20230816045959.358059-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-21 15:21:16 -07:00
Yafang Shao
0aa35162d2 bpf: Fix uninitialized symbol in bpf_perf_link_fill_kprobe()
The commit 1b715e1b0e ("bpf: Support ->fill_link_info for perf_event") leads
to the following Smatch static checker warning:

    kernel/bpf/syscall.c:3416 bpf_perf_link_fill_kprobe()
    error: uninitialized symbol 'type'.

That can happens when uname is NULL. So fix it by verifying the uname when we
really need to fill it.

Fixes: 1b715e1b0e ("bpf: Support ->fill_link_info for perf_event")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Closes: https://lore.kernel.org/bpf/85697a7e-f897-4f74-8b43-82721bebc462@kili.mountain
Link: https://lore.kernel.org/bpf/20230813141900.1268-2-laoar.shao@gmail.com
2023-08-16 16:44:23 +02:00
David Vernet
8ba651ed7f bpf: Support default .validate() and .update() behavior for struct_ops links
Currently, if a struct_ops map is loaded with BPF_F_LINK, it must also
define the .validate() and .update() callbacks in its corresponding
struct bpf_struct_ops in the kernel. Enabling struct_ops link is useful
in its own right to ensure that the map is unloaded if an application
crashes. For example, with sched_ext, we want to automatically unload
the host-wide scheduler if the application crashes. We would likely
never support updating elements of a sched_ext struct_ops map, so we'd
have to implement these callbacks showing that they _can't_ support
element updates just to benefit from the basic lifetime management of
struct_ops links.

Let's enable struct_ops maps to work with BPF_F_LINK even if they
haven't defined these callbacks, by assuming that a struct_ops map
element cannot be updated by default.

Acked-by: Kui-Feng Lee <thinker.li@gmail.com>
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230814185908.700553-2-void@manifault.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-14 22:23:39 -07:00
Yue Haibing
526bc5ba19 bpf: lru: Remove unused declaration bpf_lru_promote()
Commit 3a08c2fd76 ("bpf: LRU List") declared but never implemented this.

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Link: https://lore.kernel.org/r/20230808145531.19692-1-yuehaibing@huawei.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-08 17:21:42 -07:00
Yonghong Song
db2baf82b0 bpf: Fix an incorrect verification success with movsx insn
syzbot reports a verifier bug which triggers a runtime panic.
The test bpf program is:
   0: (62) *(u32 *)(r10 -8) = 553656332
   1: (bf) r1 = (s16)r10
   2: (07) r1 += -8
   3: (b7) r2 = 3
   4: (bd) if r2 <= r1 goto pc+0
   5: (85) call bpf_trace_printk#-138320
   6: (b7) r0 = 0
   7: (95) exit

At insn 1, the current implementation keeps 'r1' as a frame pointer,
which caused later bpf_trace_printk helper call crash since frame
pointer address is not valid any more. Note that at insn 4,
the 'pointer vs. scalar' comparison is allowed for privileged
prog run.

To fix the problem with above insn 1, the fix in the patch adopts
similar pattern to existing 'R1 = (u32) R2' handling. For unprivileged
prog run, verification will fail with 'R<num> sign-extension part of pointer'.
For privileged prog run, the dst_reg 'r1' will be marked as
an unknown scalar, so later 'bpf_trace_pointk' helper will complain
since it expected certain pointers.

Reported-by: syzbot+d61b595e9205573133b3@syzkaller.appspotmail.com
Fixes: 8100928c88 ("bpf: Support new sign-extension mov insns")
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20230807175721.671696-1-yonghong.song@linux.dev
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-07 16:23:35 -07:00
Yang Yingliang
1e8e2efb34 bpf: change bpf_alu_sign_string and bpf_movsx_string to static
The bpf_alu_sign_string and bpf_movsx_string introduced in commit
f835bb6222 ("bpf: Add kernel/bpftool asm support for new instructions")
are only used in disasm.c now, change them to static.

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202308050615.wxAn1v2J-lkp@intel.com/
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230803023128.3753323-1-yangyingliang@huawei.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-04 16:15:50 -07:00
Kui-Feng Lee
5426700e68 bpf: fix bpf_dynptr_slice() to stop return an ERR_PTR.
Verify if the pointer obtained from bpf_xdp_pointer() is either an error or
NULL before returning it.

The function bpf_dynptr_slice() mistakenly returned an ERR_PTR. Instead of
solely checking for NULL, it should also verify if the pointer returned by
bpf_xdp_pointer() is an error or NULL.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/bpf/d1360219-85c3-4a03-9449-253ea905f9d1@moroto.mountain/
Fixes: 66e3a13e7c ("bpf: Add bpf_dynptr_slice and bpf_dynptr_slice_rdwr")
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230803231206.1060485-1-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-04 14:53:15 -07:00
Daniel Borkmann
d210f9735e bpf: Fix mprog detachment for empty mprog entry
syzbot reported an UBSAN array-index-out-of-bounds access in bpf_mprog_read()
upon bpf_mprog_detach(). While it did not have a reproducer, I was able to
manually reproduce through an empty mprog entry which just has miniq present.

The latter is important given otherwise we get an ENOENT error as tcx detaches
the whole mprog entry. The index 4294967295 was triggered via NULL dtuple.prog
which then attempts to detach from the back. bpf_mprog_fetch() in this case
did hit the idx == total and therefore tried to grab the entry at idx -1.

Fix it by adding an explicit bpf_mprog_total() check in bpf_mprog_detach() and
bail out early with ENOENT.

Fixes: 053c8e1f23 ("bpf: Add generic attach/detach/query API for multi-progs")
Reported-by: syzbot+0c06ba0f831fe07a8f27@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20230804131112.11012-1-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-04 09:35:39 -07:00
Li kunyu
5964d1e459 bpf: bpf_struct_ops: Remove unnecessary initial values of variables
err and tlinks is assigned first, so it does not need to initialize the
assignment.

Signed-off-by: Li kunyu <kunyu@nfschina.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20230804175929.2867-1-kunyu@nfschina.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-03 17:54:35 -07:00
Jakub Kicinski
d07b7b32da pull-request: bpf-next 2023-08-03
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRdM/uy1Ege0+EN1fNar9k/UBDW4wUCZMvevwAKCRBar9k/UBDW
 42Z0AP90hLZ9OmoghYAlALHLl8zqXuHCV8OeFXR5auqG+kkcCwEAx6h99vnh4zgP
 Tngj6Yid60o39/IZXXblhV37HfSiyQ8=
 =/kVE
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Martin KaFai Lau says:

====================
pull-request: bpf-next 2023-08-03

We've added 54 non-merge commits during the last 10 day(s) which contain
a total of 84 files changed, 4026 insertions(+), 562 deletions(-).

The main changes are:

1) Add SO_REUSEPORT support for TC bpf_sk_assign from Lorenz Bauer,
   Daniel Borkmann

2) Support new insns from cpu v4 from Yonghong Song

3) Non-atomically allocate freelist during prefill from YiFei Zhu

4) Support defragmenting IPv(4|6) packets in BPF from Daniel Xu

5) Add tracepoint to xdp attaching failure from Leon Hwang

6) struct netdev_rx_queue and xdp.h reshuffling to reduce
   rebuild time from Jakub Kicinski

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (54 commits)
  net: invert the netdevice.h vs xdp.h dependency
  net: move struct netdev_rx_queue out of netdevice.h
  eth: add missing xdp.h includes in drivers
  selftests/bpf: Add testcase for xdp attaching failure tracepoint
  bpf, xdp: Add tracepoint to xdp attaching failure
  selftests/bpf: fix static assert compilation issue for test_cls_*.c
  bpf: fix bpf_probe_read_kernel prototype mismatch
  riscv, bpf: Adapt bpf trampoline to optimized riscv ftrace framework
  libbpf: fix typos in Makefile
  tracing: bpf: use struct trace_entry in struct syscall_tp_t
  bpf, devmap: Remove unused dtab field from bpf_dtab_netdev
  bpf, cpumap: Remove unused cmap field from bpf_cpu_map_entry
  netfilter: bpf: Only define get_proto_defrag_hook() if necessary
  bpf: Fix an array-index-out-of-bounds issue in disasm.c
  net: remove duplicate INDIRECT_CALLABLE_DECLARE of udp[6]_ehashfn
  docs/bpf: Fix malformed documentation
  bpf: selftests: Add defrag selftests
  bpf: selftests: Support custom type and proto for client sockets
  bpf: selftests: Support not connecting client socket
  netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link
  ...
====================

Link: https://lore.kernel.org/r/20230803174845.825419-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-03 15:34:36 -07:00
Jakub Kicinski
35b1b1fd96 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

net/dsa/port.c
  9945c1fb03 ("net: dsa: fix older DSA drivers using phylink")
  a88dd75384 ("net: dsa: remove legacy_pre_march2020 detection")
https://lore.kernel.org/all/20230731102254.2c9868ca@canb.auug.org.au/

net/xdp/xsk.c
  3c5b4d69c3 ("net: annotate data-races around sk->sk_mark")
  b7f72a30e9 ("xsk: introduce wrappers and helpers for supporting multi-buffer in Tx path")
https://lore.kernel.org/all/20230731102631.39988412@canb.auug.org.au/

drivers/net/ethernet/broadcom/bnxt/bnxt.c
  37b61cda9c ("bnxt: don't handle XDP in netpoll")
  2b56b3d992 ("eth: bnxt: handle invalid Tx completions more gracefully")
https://lore.kernel.org/all/20230801101708.1dc7faac@canb.auug.org.au/

Adjacent changes:

drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_fs.c
  62da08331f ("net/mlx5e: Set proper IPsec source port in L4 selector")
  fbd517549c ("net/mlx5e: Add function to get IPsec offload namespace")

drivers/net/ethernet/sfc/selftest.c
  55c1528f9b ("sfc: fix field-spanning memcpy in selftest")
  ae9d445cd4 ("sfc: Miscellaneous comment removals")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-03 14:34:37 -07:00
Jakub Kicinski
680ee0456a net: invert the netdevice.h vs xdp.h dependency
xdp.h is far more specific and is included in only 67 other
files vs netdevice.h's 1538 include sites.
Make xdp.h include netdevice.h, instead of the other way around.
This decreases the incremental allmodconfig builds size when
xdp.h is touched from 5947 to 662 objects.

Move bpf_prog_run_xdp() to xdp.h, seems appropriate and filter.h
is a mega-header in its own right so it's nice to avoid xdp.h
getting included there as well.

The only unfortunate part is that the typedef for xdp_features_t
has to move to netdevice.h, since its embedded in struct netdevice.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/20230803010230.1755386-4-kuba@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-03 08:38:07 -07:00
Arnd Bergmann
6a5a148aaf bpf: fix bpf_probe_read_kernel prototype mismatch
bpf_probe_read_kernel() has a __weak definition in core.c and another
definition with an incompatible prototype in kernel/trace/bpf_trace.c,
when CONFIG_BPF_EVENTS is enabled.

Since the two are incompatible, there cannot be a shared declaration in
a header file, but the lack of a prototype causes a W=1 warning:

kernel/bpf/core.c:1638:12: error: no previous prototype for 'bpf_probe_read_kernel' [-Werror=missing-prototypes]

On 32-bit architectures, the local prototype

u64 __weak bpf_probe_read_kernel(void *dst, u32 size, const void *unsafe_ptr)

passes arguments in other registers as the one in bpf_trace.c

BPF_CALL_3(bpf_probe_read_kernel, void *, dst, u32, size,
            const void *, unsafe_ptr)

which uses 64-bit arguments in pairs of registers.

As both versions of the function are fairly simple and only really
differ in one line, just move them into a header file as an inline
function that does not add any overhead for the bpf_trace.c callers
and actually avoids a function call for the other one.

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/ac25cb0f-b804-1649-3afb-1dc6138c2716@iogearbox.net/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230801111449.185301-1-arnd@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-02 14:14:17 -07:00
Hou Tao
1ea66e89f6 bpf, devmap: Remove unused dtab field from bpf_dtab_netdev
Commit 96360004b8 ("xdp: Make devmap flush_list common for all map
instances") removes the use of bpf_dtab_netdev::dtab in bq_enqueue(),
so just remove dtab from bpf_dtab_netdev.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/20230728014942.892272-3-houtao@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-07-31 18:26:08 -07:00
Hou Tao
2d20bfc315 bpf, cpumap: Remove unused cmap field from bpf_cpu_map_entry
Since commit cdfafe98ca ("xdp: Make cpumap flush_list common for all
map instances"), cmap is no longer used, so just remove it.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230728014942.892272-2-houtao@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-07-31 18:26:08 -07:00
Yonghong Song
e99688eba2 bpf: Fix an array-index-out-of-bounds issue in disasm.c
syzbot reported an array-index-out-of-bounds when printing out bpf
insns. Further investigation shows the insn is illegal but
is printed out due to log level 1 or 2 before actual insn verification
in do_check().

This particular illegal insn is a MOVSX insn with offset value 2.
The legal offset value for MOVSX should be 8, 16 and 32.
The disasm sign-extension-size array index is calculated as
 (insn->off / 8) - 1
and offset value 2 gives an out-of-bound index -1.

Tighten the checking for MOVSX insn in disasm.c to avoid
array-index-out-of-bounds issue.

Reported-by: syzbot+3758842a6c01012aa73b@syzkaller.appspotmail.com
Fixes: f835bb6222 ("bpf: Add kernel/bpftool asm support for new instructions")
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20230731204534.1975311-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-31 17:35:02 -07:00
Hou Tao
7c62b75cd1 bpf, cpumap: Handle skb as well when clean up ptr_ring
The following warning was reported when running xdp_redirect_cpu with
both skb-mode and stress-mode enabled:

  ------------[ cut here ]------------
  Incorrect XDP memory type (-2128176192) usage
  WARNING: CPU: 7 PID: 1442 at net/core/xdp.c:405
  Modules linked in:
  CPU: 7 PID: 1442 Comm: kworker/7:0 Tainted: G  6.5.0-rc2+ #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
  Workqueue: events __cpu_map_entry_free
  RIP: 0010:__xdp_return+0x1e4/0x4a0
  ......
  Call Trace:
   <TASK>
   ? show_regs+0x65/0x70
   ? __warn+0xa5/0x240
   ? __xdp_return+0x1e4/0x4a0
   ......
   xdp_return_frame+0x4d/0x150
   __cpu_map_entry_free+0xf9/0x230
   process_one_work+0x6b0/0xb80
   worker_thread+0x96/0x720
   kthread+0x1a5/0x1f0
   ret_from_fork+0x3a/0x70
   ret_from_fork_asm+0x1b/0x30
   </TASK>

The reason for the warning is twofold. One is due to the kthread
cpu_map_kthread_run() is stopped prematurely. Another one is
__cpu_map_ring_cleanup() doesn't handle skb mode and treats skbs in
ptr_ring as XDP frames.

Prematurely-stopped kthread will be fixed by the preceding patch and
ptr_ring will be empty when __cpu_map_ring_cleanup() is called. But
as the comments in __cpu_map_ring_cleanup() said, handling and freeing
skbs in ptr_ring as well to "catch any broken behaviour gracefully".

Fixes: 11941f8a85 ("bpf: cpumap: Implement generic cpumap")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/20230729095107.1722450-3-houtao@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-07-31 15:37:12 -07:00
Hou Tao
640a604585 bpf, cpumap: Make sure kthread is running before map update returns
The following warning was reported when running stress-mode enabled
xdp_redirect_cpu with some RT threads:

  ------------[ cut here ]------------
  WARNING: CPU: 4 PID: 65 at kernel/bpf/cpumap.c:135
  CPU: 4 PID: 65 Comm: kworker/4:1 Not tainted 6.5.0-rc2+ #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
  Workqueue: events cpu_map_kthread_stop
  RIP: 0010:put_cpu_map_entry+0xda/0x220
  ......
  Call Trace:
   <TASK>
   ? show_regs+0x65/0x70
   ? __warn+0xa5/0x240
   ......
   ? put_cpu_map_entry+0xda/0x220
   cpu_map_kthread_stop+0x41/0x60
   process_one_work+0x6b0/0xb80
   worker_thread+0x96/0x720
   kthread+0x1a5/0x1f0
   ret_from_fork+0x3a/0x70
   ret_from_fork_asm+0x1b/0x30
   </TASK>

The root cause is the same as commit 4369016497 ("bpf: cpumap: Fix memory
leak in cpu_map_update_elem"). The kthread is stopped prematurely by
kthread_stop() in cpu_map_kthread_stop(), and kthread() doesn't call
cpu_map_kthread_run() at all but XDP program has already queued some
frames or skbs into ptr_ring. So when __cpu_map_ring_cleanup() checks
the ptr_ring, it will find it was not emptied and report a warning.

An alternative fix is to use __cpu_map_ring_cleanup() to drop these
pending frames or skbs when kthread_stop() returns -EINTR, but it may
confuse the user, because these frames or skbs have been handled
correctly by XDP program. So instead of dropping these frames or skbs,
just make sure the per-cpu kthread is running before
__cpu_map_entry_alloc() returns.

After apply the fix, the error handle for kthread_stop() will be
unnecessary because it will always return 0, so just remove it.

Fixes: 6710e11269 ("bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Pu Lehui <pulehui@huawei.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/20230729095107.1722450-2-houtao@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-07-31 15:37:12 -07:00
Martin KaFai Lau
079082c60a tcx: Fix splat during dev unregister
During unregister_netdevice_many_notify(), the ordering of our concerned
function calls is like this:

  unregister_netdevice_many_notify
    dev_shutdown
	qdisc_put
            clsact_destroy
    tcx_uninstall

The syzbot reproducer triggered a case that the qdisc refcnt is not
zero during dev_shutdown().

tcx_uninstall() will then WARN_ON_ONCE(tcx_entry(entry)->miniq_active)
because the miniq is still active and the entry should not be freed.
The latter assumed that qdisc destruction happens before tcx teardown.

This fix is to avoid tcx_uninstall() doing tcx_entry_free() when the
miniq is still alive and let the clsact_destroy() do the free later, so
that we do not assume any specific ordering for either of them.

If still active, tcx_uninstall() does clear the entry when flushing out
the prog/link. clsact_destroy() will then notice the "!tcx_entry_is_active()"
and then does the tcx_entry_free() eventually.

Fixes: e420bed025 ("bpf: Add fd-based tcx multi-prog infra with link support")
Reported-by: syzbot+376a289e86a0fd02b9ba@syzkaller.appspotmail.com
Reported-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: syzbot+376a289e86a0fd02b9ba@syzkaller.appspotmail.com
Tested-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/222255fe07cb58f15ee662e7ee78328af5b438e4.1690549248.git.daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-31 14:44:02 -07:00
YiFei Zhu
d1a02358d4 bpf: Non-atomically allocate freelist during prefill
In internal testing of test_maps, we sometimes observed failures like:
  test_maps: test_maps.c:173: void test_hashmap_percpu(unsigned int, void *):
    Assertion `bpf_map_update_elem(fd, &key, value, BPF_ANY) == 0' failed.
where the errno is ENOMEM. After some troubleshooting and enabling
the warnings, we saw:
  [   91.304708] percpu: allocation failed, size=8 align=8 atomic=1, atomic alloc failed, no space left
  [   91.304716] CPU: 51 PID: 24145 Comm: test_maps Kdump: loaded Tainted: G                 N 6.1.38-smp-DEV #7
  [   91.304719] Hardware name: Google Astoria/astoria, BIOS 0.20230627.0-0 06/27/2023
  [   91.304721] Call Trace:
  [   91.304724]  <TASK>
  [   91.304730]  [<ffffffffa7ef83b9>] dump_stack_lvl+0x59/0x88
  [   91.304737]  [<ffffffffa7ef83f8>] dump_stack+0x10/0x18
  [   91.304738]  [<ffffffffa75caa0c>] pcpu_alloc+0x6fc/0x870
  [   91.304741]  [<ffffffffa75ca302>] __alloc_percpu_gfp+0x12/0x20
  [   91.304743]  [<ffffffffa756785e>] alloc_bulk+0xde/0x1e0
  [   91.304746]  [<ffffffffa7566c02>] bpf_mem_alloc_init+0xd2/0x2f0
  [   91.304747]  [<ffffffffa7547c69>] htab_map_alloc+0x479/0x650
  [   91.304750]  [<ffffffffa751d6e0>] map_create+0x140/0x2e0
  [   91.304752]  [<ffffffffa751d413>] __sys_bpf+0x5a3/0x6c0
  [   91.304753]  [<ffffffffa751c3ec>] __x64_sys_bpf+0x1c/0x30
  [   91.304754]  [<ffffffffa7ef847a>] do_syscall_64+0x5a/0x80
  [   91.304756]  [<ffffffffa800009b>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

This makes sense, because in atomic context, percpu allocation would
not create new chunks; it would only create in non-atomic contexts.
And if during prefill all precpu chunks are full, -ENOMEM would
happen immediately upon next unit_alloc.

Prefill phase does not actually run in atomic context, so we can
use this fact to allocate non-atomically with GFP_KERNEL instead
of GFP_NOWAIT. This avoids the immediate -ENOMEM.

GFP_NOWAIT has to be used in unit_alloc when bpf program runs
in atomic context. Even if bpf program runs in non-atomic context,
in most cases, rcu read lock is enabled for the program so
GFP_NOWAIT is still needed. This is often also the case for
BPF_MAP_UPDATE_ELEM syscalls.

Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230728043359.3324347-1-zhuyifei@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-28 09:41:10 -07:00
Yonghong Song
09fedc7318 bpf: Fix compilation warning with -Wparentheses
The kernel test robot reported compilation warnings when -Wparentheses is
added to KBUILD_CFLAGS with gcc compiler. The following is the error message:

  .../bpf-next/kernel/bpf/verifier.c: In function ‘coerce_reg_to_size_sx’:
  .../bpf-next/kernel/bpf/verifier.c:5901:14:
    error: suggest parentheses around comparison in operand of ‘==’ [-Werror=parentheses]
    if (s64_max >= 0 == s64_min >= 0) {
        ~~~~~~~~^~~~
  .../bpf-next/kernel/bpf/verifier.c: In function ‘coerce_subreg_to_size_sx’:
  .../bpf-next/kernel/bpf/verifier.c:5965:14:
    error: suggest parentheses around comparison in operand of ‘==’ [-Werror=parentheses]
    if (s32_min >= 0 == s32_max >= 0) {
        ~~~~~~~~^~~~

To fix the issue, add proper parentheses for the above '>=' condition
to silence the warning/error.

I tried a few clang compilers like clang16 and clang18 and they do not emit
such warnings with -Wparentheses.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202307281133.wi0c4SqG-lkp@intel.com/
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230728055740.2284534-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-28 08:54:04 -07:00
Yonghong Song
f835bb6222 bpf: Add kernel/bpftool asm support for new instructions
Add asm support for new instructions so kernel verifier and bpftool
xlated insn dumps can have proper asm syntax for new instructions.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:54:02 -07:00
Yonghong Song
4cd58e9af8 bpf: Support new 32bit offset jmp instruction
Add interpreter/jit/verifier support for 32bit offset jmp instruction.
If a conditional jmp instruction needs more than 16bit offset,
it can be simulated with a conditional jmp + a 32bit jmp insn.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011231.3716103-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:52:33 -07:00
Yonghong Song
7058e3a31e bpf: Fix jit blinding with new sdiv/smov insns
Handle new insns properly in bpf_jit_blind_insn() function.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011225.3715812-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:52:33 -07:00
Yonghong Song
ec0e2da95f bpf: Support new signed div/mod instructions.
Add interpreter/jit support for new signed div/mod insns.
The new signed div/mod instructions are encoded with
unsigned div/mod instructions plus insn->off == 1.
Also add basic verifier support to ensure new insns get
accepted.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011219.3714605-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:52:33 -07:00
Yonghong Song
0845c3db7b bpf: Support new unconditional bswap instruction
The existing 'be' and 'le' insns will do conditional bswap
depends on host endianness. This patch implements
unconditional bswap insns.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011213.3712808-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:52:33 -07:00
Yonghong Song
1f1e864b65 bpf: Handle sign-extenstin ctx member accesses
Currently, if user accesses a ctx member with signed types,
the compiler will generate an unsigned load followed by
necessary left and right shifts.

With the introduction of sign-extension load, compiler may
just emit a ldsx insn instead. Let us do a final movsx sign
extension to the final unsigned ctx load result to
satisfy original sign extension requirement.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011207.3712528-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:52:33 -07:00
Yonghong Song
8100928c88 bpf: Support new sign-extension mov insns
Add interpreter/jit support for new sign-extension mov insns.
The original 'MOV' insn is extended to support reg-to-reg
signed version for both ALU and ALU64 operations. For ALU mode,
the insn->off value of 8 or 16 indicates sign-extension
from 8- or 16-bit value to 32-bit value. For ALU64 mode,
the insn->off value of 8/16/32 indicates sign-extension
from 8-, 16- or 32-bit value to 64-bit value.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011202.3712300-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:52:33 -07:00
Yonghong Song
1f9a1ea821 bpf: Support new sign-extension load insns
Add interpreter/jit support for new sign-extension load insns
which adds a new mode (BPF_MEMSX).
Also add verifier support to recognize these insns and to
do proper verification with new insns. In verifier, besides
to deduce proper bounds for the dst_reg, probed memory access
is also properly handled.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011156.3711870-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:52:33 -07:00
Arnd Bergmann
63e2da3b7f bpf: work around -Wuninitialized warning
Splitting these out into separate helper functions means that we
actually pass an uninitialized variable into another function call
if dec_active() happens to not be inlined, and CONFIG_PREEMPT_RT
is disabled:

kernel/bpf/memalloc.c: In function 'add_obj_to_free_list':
kernel/bpf/memalloc.c:200:9: error: 'flags' is used uninitialized [-Werror=uninitialized]
  200 |         dec_active(c, flags);

Avoid this by passing the flags by reference, so they either get
initialized and dereferenced through a pointer, or the pointer never
gets accessed at all.

Fixes: 18e027b1c7 ("bpf: Factor out inc/dec of active flag into helpers.")
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20230725202653.2905259-1-arnd@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-25 17:14:18 -07:00
Jeff Layton
417d2b6b11 bpf: convert to ctime accessor functions
In later patches, we're going to change how the inode's ctime field is
used. Switch to using accessor functions instead of raw accesses of
inode->i_ctime.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230705190309.579783-84-jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-07-24 10:30:07 +02:00
Jakub Kicinski
59be3baa8d Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-20 15:52:55 -07:00
Alexei Starovoitov
6f5a630d7c bpf, net: Introduce skb_pointer_if_linear().
Network drivers always call skb_header_pointer() with non-null buffer.
Remove !buffer check to prevent accidental misuse of skb_header_pointer().
Introduce skb_pointer_if_linear() instead.

Reported-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230718234021.43640-1-alexei.starovoitov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19 10:27:33 -07:00
Daniel Borkmann
e420bed025 bpf: Add fd-based tcx multi-prog infra with link support
This work refactors and adds a lightweight extension ("tcx") to the tc BPF
ingress and egress data path side for allowing BPF program management based
on fds via bpf() syscall through the newly added generic multi-prog API.
The main goal behind this work which we also presented at LPC [0] last year
and a recent update at LSF/MM/BPF this year [3] is to support long-awaited
BPF link functionality for tc BPF programs, which allows for a model of safe
ownership and program detachment.

Given the rise in tc BPF users in cloud native environments, this becomes
necessary to avoid hard to debug incidents either through stale leftover
programs or 3rd party applications accidentally stepping on each others toes.
As a recap, a BPF link represents the attachment of a BPF program to a BPF
hook point. The BPF link holds a single reference to keep BPF program alive.
Moreover, hook points do not reference a BPF link, only the application's
fd or pinning does. A BPF link holds meta-data specific to attachment and
implements operations for link creation, (atomic) BPF program update,
detachment and introspection. The motivation for BPF links for tc BPF programs
is multi-fold, for example:

  - From Meta: "It's especially important for applications that are deployed
    fleet-wide and that don't "control" hosts they are deployed to. If such
    application crashes and no one notices and does anything about that, BPF
    program will keep running draining resources or even just, say, dropping
    packets. We at FB had outages due to such permanent BPF attachment
    semantics. With fd-based BPF link we are getting a framework, which allows
    safe, auto-detachable behavior by default, unless application explicitly
    opts in by pinning the BPF link." [1]

  - From Cilium-side the tc BPF programs we attach to host-facing veth devices
    and phys devices build the core datapath for Kubernetes Pods, and they
    implement forwarding, load-balancing, policy, EDT-management, etc, within
    BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently
    experienced hard-to-debug issues in a user's staging environment where
    another Kubernetes application using tc BPF attached to the same prio/handle
    of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath
    it. The goal is to establish a clear/safe ownership model via links which
    cannot accidentally be overridden. [0,2]

BPF links for tc can co-exist with non-link attachments, and the semantics are
in line also with XDP links: BPF links cannot replace other BPF links, BPF
links cannot replace non-BPF links, non-BPF links cannot replace BPF links and
lastly only non-BPF links can replace non-BPF links. In case of Cilium, this
would solve mentioned issue of safe ownership model as 3rd party applications
would not be able to accidentally wipe Cilium programs, even if they are not
BPF link aware.

Earlier attempts [4] have tried to integrate BPF links into core tc machinery
to solve cls_bpf, which has been intrusive to the generic tc kernel API with
extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could
be wiped from the qdisc also. Locking a tc BPF program in place this way, is
getting into layering hacks given the two object models are vastly different.

We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF
attach API, so that the BPF link implementation blends in naturally similar to
other link types which are fd-based and without the need for changing core tc
internal APIs. BPF programs for tc can then be successively migrated from classic
cls_bpf to the new tc BPF link without needing to change the program's source
code, just the BPF loader mechanics for attaching is sufficient.

For the current tc framework, there is no change in behavior with this change
and neither does this change touch on tc core kernel APIs. The gist of this
patch is that the ingress and egress hook have a lightweight, qdisc-less
extension for BPF to attach its tc BPF programs, in other words, a minimal
entry point for tc BPF. The name tcx has been suggested from discussion of
earlier revisions of this work as a good fit, and to more easily differ between
the classic cls_bpf attachment and the fd-based one.

For the ingress and egress tcx points, the device holds a cache-friendly array
with program pointers which is separated from control plane (slow-path) data.
Earlier versions of this work used priority to determine ordering and expression
of dependencies similar as with classic tc, but it was challenged that for
something more future-proof a better user experience is required. Hence this
resulted in the design and development of the generic attach/detach/query API
for multi-progs. See prior patch with its discussion on the API design. tcx is
the first user and later we plan to integrate also others, for example, one
candidate is multi-prog support for XDP which would benefit and have the same
'look and feel' from API perspective.

The goal with tcx is to have maximum compatibility to existing tc BPF programs,
so they don't need to be rewritten specifically. Compatibility to call into
classic tcf_classify() is also provided in order to allow successive migration
or both to cleanly co-exist where needed given its all one logical tc layer and
the tcx plus classic tc cls/act build one logical overall processing pipeline.

tcx supports the simplified return codes TCX_NEXT which is non-terminating (go
to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT.
The fd-based API is behind a static key, so that when unused the code is also
not entered. The struct tcx_entry's program array is currently static, but
could be made dynamic if necessary at a point in future. The a/b pair swap
design has been chosen so that for detachment there are no allocations which
otherwise could fail.

The work has been tested with tc-testing selftest suite which all passes, as
well as the tc BPF tests from the BPF CI, and also with Cilium's L4LB.

Thanks also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews
of this work.

  [0] https://lpc.events/event/16/contributions/1353/
  [1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com
  [2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog
  [3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
  [4] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230719140858.13224-3-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19 10:07:27 -07:00
Daniel Borkmann
053c8e1f23 bpf: Add generic attach/detach/query API for multi-progs
This adds a generic layer called bpf_mprog which can be reused by different
attachment layers to enable multi-program attachment and dependency resolution.
In-kernel users of the bpf_mprog don't need to care about the dependency
resolution internals, they can just consume it with few API calls.

The initial idea of having a generic API sparked out of discussion [0] from an
earlier revision of this work where tc's priority was reused and exposed via
BPF uapi as a way to coordinate dependencies among tc BPF programs, similar
as-is for classic tc BPF. The feedback was that priority provides a bad user
experience and is hard to use [1], e.g.:

  I cannot help but feel that priority logic copy-paste from old tc, netfilter
  and friends is done because "that's how things were done in the past". [...]
  Priority gets exposed everywhere in uapi all the way to bpftool when it's
  right there for users to understand. And that's the main problem with it.

  The user don't want to and don't need to be aware of it, but uapi forces them
  to pick the priority. [...] Your cover letter [0] example proves that in
  real life different service pick the same priority. They simply don't know
  any better. Priority is an unnecessary magic that apps _have_ to pick, so
  they just copy-paste and everyone ends up using the same.

The course of the discussion showed more and more the need for a generic,
reusable API where the "same look and feel" can be applied for various other
program types beyond just tc BPF, for example XDP today does not have multi-
program support in kernel, but also there was interest around this API for
improving management of cgroup program types. Such common multi-program
management concept is useful for BPF management daemons or user space BPF
applications coordinating internally about their attachments.

Both from Cilium and Meta side [2], we've collected the following requirements
for a generic attach/detach/query API for multi-progs which has been implemented
as part of this work:

  - Support prog-based attach/detach and link API
  - Dependency directives (can also be combined):
    - BPF_F_{BEFORE,AFTER} with relative_{fd,id} which can be {prog,link,none}
      - BPF_F_ID flag as {fd,id} toggle; the rationale for id is so that user
        space application does not need CAP_SYS_ADMIN to retrieve foreign fds
        via bpf_*_get_fd_by_id()
      - BPF_F_LINK flag as {prog,link} toggle
      - If relative_{fd,id} is none, then BPF_F_BEFORE will just prepend, and
        BPF_F_AFTER will just append for attaching
      - Enforced only at attach time
    - BPF_F_REPLACE with replace_bpf_fd which can be prog, links have their
      own infra for replacing their internal prog
    - If no flags are set, then it's default append behavior for attaching
  - Internal revision counter and optionally being able to pass expected_revision
  - User space application can query current state with revision, and pass it
    along for attachment to assert current state before doing updates
  - Query also gets extension for link_ids array and link_attach_flags:
    - prog_ids are always filled with program IDs
    - link_ids are filled with link IDs when link was used, otherwise 0
    - {prog,link}_attach_flags for holding {prog,link}-specific flags
  - Must be easy to integrate/reuse for in-kernel users

The uapi-side changes needed for supporting bpf_mprog are rather minimal,
consisting of the additions of the attachment flags, revision counter, and
expanding existing union with relative_{fd,id} member.

The bpf_mprog framework consists of an bpf_mprog_entry object which holds
an array of bpf_mprog_fp (fast-path structure). The bpf_mprog_cp (control-path
structure) is part of bpf_mprog_bundle. Both have been separated, so that
fast-path gets efficient packing of bpf_prog pointers for maximum cache
efficiency. Also, array has been chosen instead of linked list or other
structures to remove unnecessary indirections for a fast point-to-entry in
tc for BPF.

The bpf_mprog_entry comes as a pair via bpf_mprog_bundle so that in case of
updates the peer bpf_mprog_entry is populated and then just swapped which
avoids additional allocations that could otherwise fail, for example, in
detach case. bpf_mprog_{fp,cp} arrays are currently static, but they could
be converted to dynamic allocation if necessary at a point in future.
Locking is deferred to the in-kernel user of bpf_mprog, for example, in case
of tcx which uses this API in the next patch, it piggybacks on rtnl.

An extensive test suite for checking all aspects of this API for prog-based
attach/detach and link API comes as BPF selftests in this series.

Thanks also to Andrii Nakryiko for early API discussions wrt Meta's BPF prog
management.

  [0] https://lore.kernel.org/bpf/20221004231143.19190-1-daniel@iogearbox.net
  [1] https://lore.kernel.org/bpf/CAADnVQ+gEY3FjCR=+DmjDR4gp5bOYZUFJQXj4agKFHT9CQPZBw@mail.gmail.com
  [2] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20230719140858.13224-2-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19 10:07:27 -07:00
Anton Protopopov
72829b1c1f bpf: allow any program to use the bpf_map_sum_elem_count kfunc
Register the bpf_map_sum_elem_count func for all programs, and update the
map_ptr subtest of the test_progs test to test the new functionality.

The usage is allowed as long as the pointer to the map is trusted (when
using tracing programs) or is a const pointer to map, as in the following
example:

    struct {
            __uint(type, BPF_MAP_TYPE_HASH);
            ...
    } hash SEC(".maps");

    ...

    static inline int some_bpf_prog(void)
    {
            struct bpf_map *map = (struct bpf_map *)&hash;
            __s64 count;

            count = bpf_map_sum_elem_count(map);

            ...
    }

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Link: https://lore.kernel.org/r/20230719092952.41202-5-aspsk@isovalent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19 09:48:53 -07:00
Anton Protopopov
9c29804961 bpf: make an argument const in the bpf_map_sum_elem_count kfunc
We use the map pointer only to read the counter values, no locking
involved, so mark the argument as const.

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Link: https://lore.kernel.org/r/20230719092952.41202-4-aspsk@isovalent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19 09:48:52 -07:00
Anton Protopopov
5ba190c29c bpf: consider CONST_PTR_TO_MAP as trusted pointer to struct bpf_map
Add the BTF id of struct bpf_map to the reg2btf_ids array. This makes the
values of the CONST_PTR_TO_MAP type to be considered as trusted by kfuncs.
This, in turn, allows users to execute trusted kfuncs which accept `struct
bpf_map *` arguments from non-tracing programs.

While exporting the btf_bpf_map_id variable, save some bytes by defining
it as BTF_ID_LIST_GLOBAL_SINGLE (which is u32[1]) and not as BTF_ID_LIST
(which is u32[64]).

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Link: https://lore.kernel.org/r/20230719092952.41202-3-aspsk@isovalent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19 09:48:52 -07:00
Anton Protopopov
831deb2976 bpf: consider types listed in reg2btf_ids as trusted
The reg2btf_ids array contains a list of types for which we can (and need)
to find a corresponding static BTF id. All the types in the list can be
considered as trusted for purposes of kfuncs.

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Link: https://lore.kernel.org/r/20230719092952.41202-2-aspsk@isovalent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19 09:48:52 -07:00
Dave Marchevsky
c3c510ce43 bpf: Add 'owner' field to bpf_{list,rb}_node
As described by Kumar in [0], in shared ownership scenarios it is
necessary to do runtime tracking of {rb,list} node ownership - and
synchronize updates using this ownership information - in order to
prevent races. This patch adds an 'owner' field to struct bpf_list_node
and bpf_rb_node to implement such runtime tracking.

The owner field is a void * that describes the ownership state of a
node. It can have the following values:

  NULL           - the node is not owned by any data structure
  BPF_PTR_POISON - the node is in the process of being added to a data
                   structure
  ptr_to_root    - the pointee is a data structure 'root'
                   (bpf_rb_root / bpf_list_head) which owns this node

The field is initially NULL (set by bpf_obj_init_field default behavior)
and transitions states in the following sequence:

  Insertion: NULL -> BPF_PTR_POISON -> ptr_to_root
  Removal:   ptr_to_root -> NULL

Before a node has been successfully inserted, it is not protected by any
root's lock, and therefore two programs can attempt to add the same node
to different roots simultaneously. For this reason the intermediate
BPF_PTR_POISON state is necessary. For removal, the node is protected
by some root's lock so this intermediate hop isn't necessary.

Note that bpf_list_pop_{front,back} helpers don't need to check owner
before removing as the node-to-be-removed is not passed in as input and
is instead taken directly from the list. Do the check anyways and
WARN_ON_ONCE in this unexpected scenario.

Selftest changes in this patch are entirely mechanical: some BTF
tests have hardcoded struct sizes for structs that contain
bpf_{list,rb}_node fields, those were adjusted to account for the new
sizes. Selftest additions to validate the owner field are added in a
further patch in the series.

  [0]: https://lore.kernel.org/bpf/d7hyspcow5wtjcmw4fugdgyp3fwhljwuscp3xyut5qnwivyeru@ysdq543otzv2

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Suggested-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230718083813.3416104-4-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-18 17:23:10 -07:00
Dave Marchevsky
0a1f7bfe35 bpf: Introduce internal definitions for UAPI-opaque bpf_{rb,list}_node
Structs bpf_rb_node and bpf_list_node are opaquely defined in
uapi/linux/bpf.h, as BPF program writers are not expected to touch their
fields - nor does the verifier allow them to do so.

Currently these structs are simple wrappers around structs rb_node and
list_head and linked_list / rbtree implementation just casts and passes
to library functions for those data structures. Later patches in this
series, though, will add an "owner" field to bpf_{rb,list}_node, such
that they're not just wrapping an underlying node type. Moreover, the
bpf linked_list and rbtree implementations will deal with these owner
pointers directly in a few different places.

To avoid having to do

  void *owner = (void*)bpf_list_node + sizeof(struct list_head)

with opaque UAPI node types, add bpf_{list,rb}_node_kern struct
definitions to internal headers and modify linked_list and rbtree to use
the internal types where appropriate.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230718083813.3416104-3-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-18 17:23:10 -07:00
Kumar Kartikeya Dwivedi
b5e9ad522c bpf: Repeat check_max_stack_depth for async callbacks
While the check_max_stack_depth function explores call chains emanating
from the main prog, which is typically enough to cover all possible call
chains, it doesn't explore those rooted at async callbacks unless the
async callback will have been directly called, since unlike non-async
callbacks it skips their instruction exploration as they don't
contribute to stack depth.

It could be the case that the async callback leads to a callchain which
exceeds the stack depth, but this is never reachable while only
exploring the entry point from main subprog. Hence, repeat the check for
the main subprog *and* all async callbacks marked by the symbolic
execution pass of the verifier, as execution of the program may begin at
any of them.

Consider functions with following stack depths:
main: 256
async: 256
foo: 256

main:
    rX = async
    bpf_timer_set_callback(...)

async:
    foo()

Here, async is not descended as it does not contribute to stack depth of
main (since it is referenced using bpf_pseudo_func and not
bpf_pseudo_call). However, when async is invoked asynchronously, it will
end up breaching the MAX_BPF_STACK limit by calling foo.

Hence, in addition to main, we also need to explore call chains
beginning at all async callback subprogs in a program.

Fixes: 7ddc80a476 ("bpf: Teach stack depth check about async callbacks.")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230717161530.1238-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-18 15:21:09 -07:00
Kumar Kartikeya Dwivedi
ba7b3e7d5f bpf: Fix subprog idx logic in check_max_stack_depth
The assignment to idx in check_max_stack_depth happens once we see a
bpf_pseudo_call or bpf_pseudo_func. This is not an issue as the rest of
the code performs a few checks and then pushes the frame to the frame
stack, except the case of async callbacks. If the async callback case
causes the loop iteration to be skipped, the idx assignment will be
incorrect on the next iteration of the loop. The value stored in the
frame stack (as the subprogno of the current subprog) will be incorrect.

This leads to incorrect checks and incorrect tail_call_reachable
marking. Save the target subprog in a new variable and only assign to
idx once we are done with the is_async_cb check which may skip pushing
of frame to the frame stack and subsequent stack depth checks and tail
call markings.

Fixes: 7ddc80a476 ("bpf: Teach stack depth check about async callbacks.")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230717161530.1238-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-18 15:21:09 -07:00
Jakub Kicinski
d2afa89f66 for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmSwqwoACgkQ6rmadz2v
 bTqOHRAAn+fzTLqUqsveFQcxOkie5MPHxKoOTjG4+yFR7rzPkU6Mn5RX3w5yFzSn
 RqutwykF9OgipAzC3QXv4pRJuq6Gia5nvwUSDP4CX273ljyeF54DK7HfopE1+YrK
 HXyBWZvVvMZP6q7qQyQ3qtbHZSjs5XP/M6YBlJ5zo/BTLFCyvbSDP14YKEqcBkWG
 ld72ElXFxlnr/zEfRjzBCfMlbmgeHLO0SiHS/9827zEmNP1AAH5/ETA7/rJ7yCJs
 QNQUIoJWob8xm5FMJ6CU/+sOqXR1CY053meGJFFBX5pvVD/CLRhrwHn0IMCyQqmh
 wKR5waeXhpl/CKNeFuxXVMNFiXbqBb/0LYJaJtrMysjMLTsQ9X7NkrDBa/+kYGyZ
 +ghGlaMQvPqUGg0rLH2nl9JNB8Ne/8prLMsAKUWnPuOo+Q03j054gnqhGeNtDd5b
 gpSk+7x93PlhGcegBV1Wk8dkiGC5V9nTVNxg40XQUCs4k9L/8Vjc35Tjqx7nBTNH
 DiFD24DDKUZacw9L6nEqvLF/N2fiRjtUZnVPC0yn/annyBcfX1s+ZH2Tu1F6Qk38
 QMfBCnt12exmsiDoxdzzGJtjHnS/k5fsaKjlR21mOyMrIH7ipltr5UHHrdr1hBP6
 24uSeTImvQQKDi+9IuXN127jZDOupKqVS6csrA0ZXrlKWh2HR+U=
 =GVUB
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Alexei Starovoitov says:

====================
pull-request: bpf-next 2023-07-13

We've added 67 non-merge commits during the last 15 day(s) which contain
a total of 106 files changed, 4444 insertions(+), 619 deletions(-).

The main changes are:

1) Fix bpftool build in presence of stale vmlinux.h,
   from Alexander Lobakin.

2) Introduce bpf_me_mcache_free_rcu() and fix OOM under stress,
   from Alexei Starovoitov.

3) Teach verifier actual bounds of bpf_get_smp_processor_id()
   and fix perf+libbpf issue related to custom section handling,
   from Andrii Nakryiko.

4) Introduce bpf map element count, from Anton Protopopov.

5) Check skb ownership against full socket, from Kui-Feng Lee.

6) Support for up to 12 arguments in BPF trampoline, from Menglong Dong.

7) Export rcu_request_urgent_qs_task, from Paul E. McKenney.

8) Fix BTF walking of unions, from Yafang Shao.

9) Extend link_info for kprobe_multi and perf_event links,
   from Yafang Shao.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (67 commits)
  selftests/bpf: Add selftest for PTR_UNTRUSTED
  bpf: Fix an error in verifying a field in a union
  selftests/bpf: Add selftests for nested_trust
  bpf: Fix an error around PTR_UNTRUSTED
  selftests/bpf: add testcase for TRACING with 6+ arguments
  bpf, x86: allow function arguments up to 12 for TRACING
  bpf, x86: save/restore regs with BPF_DW size
  bpftool: Use "fallthrough;" keyword instead of comments
  bpf: Add object leak check.
  bpf: Convert bpf_cpumask to bpf_mem_cache_free_rcu.
  bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu().
  selftests/bpf: Improve test coverage of bpf_mem_alloc.
  rcu: Export rcu_request_urgent_qs_task()
  bpf: Allow reuse from waiting_for_gp_ttrace list.
  bpf: Add a hint to allocated objects.
  bpf: Change bpf_mem_cache draining process.
  bpf: Further refactor alloc_bulk().
  bpf: Factor out inc/dec of active flag into helpers.
  bpf: Refactor alloc_bulk().
  bpf: Let free_all() return the number of freed elements.
  ...
====================

Link: https://lore.kernel.org/r/20230714020910.80794-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-13 19:13:24 -07:00
Yafang Shao
33937607ef bpf: Fix an error in verifying a field in a union
We are utilizing BPF LSM to monitor BPF operations within our container
environment. When we add support for raw_tracepoint, it hits below
error.

; (const void *)attr->raw_tracepoint.name);
27: (79) r3 = *(u64 *)(r2 +0)
access beyond the end of member map_type (mend:4) in struct (anon) with off 0 size 8

It can be reproduced with below BPF prog.

SEC("lsm/bpf")
int BPF_PROG(bpf_audit, int cmd, union bpf_attr *attr, unsigned int size)
{
	switch (cmd) {
	case BPF_RAW_TRACEPOINT_OPEN:
		bpf_printk("raw_tracepoint is %s", attr->raw_tracepoint.name);
		break;
	default:
		break;
	}
	return 0;
}

The reason is that when accessing a field in a union, such as bpf_attr,
if the field is located within a nested struct that is not the first
member of the union, it can result in incorrect field verification.

  union bpf_attr {
      struct {
          __u32 map_type; <<<< Actually it will find that field.
          __u32 key_size;
          __u32 value_size;
         ...
      };
      ...
      struct {
          __u64 name;    <<<< We want to verify this field.
          __u32 prog_fd;
      } raw_tracepoint;
  };

Considering the potential deep nesting levels, finding a perfect
solution to address this issue has proven challenging. Therefore, I
propose a solution where we simply skip the verification process if the
field in question is located within a union.

Fixes: 7e3617a72d ("bpf: Add array support to btf_struct_access")
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20230713025642.27477-4-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-13 16:24:29 -07:00
Yafang Shao
7ce4dc3e4a bpf: Fix an error around PTR_UNTRUSTED
Per discussion with Alexei, the PTR_UNTRUSTED flag should not been
cleared when we start to walk a new struct, because the struct in
question may be a struct nested in a union. We should also check and set
this flag before we walk its each member, in case itself is a union.
We will clear this flag if the field is BTF_TYPE_SAFE_RCU_OR_NULL.

Fixes: 6fcd486b3a ("bpf: Refactor RCU enforcement in the verifier.")
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20230713025642.27477-2-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-13 16:24:29 -07:00
Hou Tao
4ed8b5bcfa bpf: Add object leak check.
The object leak check is cheap. Do it unconditionally to spot difficult races
in bpf_mem_alloc.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230706033447.54696-15-alexei.starovoitov@gmail.com
2023-07-12 23:45:23 +02:00
Alexei Starovoitov
8e07bb9ebc bpf: Convert bpf_cpumask to bpf_mem_cache_free_rcu.
Convert bpf_cpumask to bpf_mem_cache_free_rcu.
Note that migrate_disable() in bpf_cpumask_release() is still necessary, since
bpf_cpumask_release() is a dtor. bpf_obj_free_fields() can be converted to do
migrate_disable() there in a follow up.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/bpf/20230706033447.54696-14-alexei.starovoitov@gmail.com
2023-07-12 23:45:23 +02:00
Alexei Starovoitov
5af6807bdb bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu().
Introduce bpf_mem_[cache_]free_rcu() similar to kfree_rcu().
Unlike bpf_mem_[cache_]free() that links objects for immediate reuse into
per-cpu free list the _rcu() flavor waits for RCU grace period and then moves
objects into free_by_rcu_ttrace list where they are waiting for RCU
task trace grace period to be freed into slab.

The life cycle of objects:
alloc: dequeue free_llist
free: enqeueu free_llist
free_rcu: enqueue free_by_rcu -> waiting_for_gp
free_llist above high watermark -> free_by_rcu_ttrace
after RCU GP waiting_for_gp -> free_by_rcu_ttrace
free_by_rcu_ttrace -> waiting_for_gp_ttrace -> slab

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20230706033447.54696-13-alexei.starovoitov@gmail.com
2023-07-12 23:45:23 +02:00
Alexei Starovoitov
04fabf00b4 bpf: Allow reuse from waiting_for_gp_ttrace list.
alloc_bulk() can reuse elements from free_by_rcu_ttrace.
Let it reuse from waiting_for_gp_ttrace as well to avoid unnecessary kmalloc().

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230706033447.54696-10-alexei.starovoitov@gmail.com
2023-07-12 23:45:23 +02:00
Alexei Starovoitov
822fb26bdb bpf: Add a hint to allocated objects.
To address OOM issue when one cpu is allocating and another cpu is freeing add
a target bpf_mem_cache hint to allocated objects and when local cpu free_llist
overflows free to that bpf_mem_cache. The hint addresses the OOM while
maintaining the same performance for common case when alloc/free are done on the
same cpu.

Note that do_call_rcu_ttrace() now has to check 'draining' flag in one more case,
since do_call_rcu_ttrace() is called not only for current cpu.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20230706033447.54696-9-alexei.starovoitov@gmail.com
2023-07-12 23:45:23 +02:00
Alexei Starovoitov
d114dde245 bpf: Change bpf_mem_cache draining process.
The next patch will introduce cross-cpu llist access and existing
irq_work_sync() + drain_mem_cache() + rcu_barrier_tasks_trace() mechanism will
not be enough, since irq_work_sync() + drain_mem_cache() on cpu A won't
guarantee that llist on cpu A are empty. The free_bulk() on cpu B might add
objects back to llist of cpu A. Add 'bool draining' flag.
The modified sequence looks like:
for_each_cpu:
  WRITE_ONCE(c->draining, true); // do_call_rcu_ttrace() won't be doing call_rcu() any more
  irq_work_sync(); // wait for irq_work callback (free_bulk) to finish
  drain_mem_cache(); // free all objects
rcu_barrier_tasks_trace(); // wait for RCU callbacks to execute

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20230706033447.54696-8-alexei.starovoitov@gmail.com
2023-07-12 23:45:22 +02:00
Alexei Starovoitov
7468048237 bpf: Further refactor alloc_bulk().
In certain scenarios alloc_bulk() might be taking free objects mainly from
free_by_rcu_ttrace list. In such case get_memcg() and set_active_memcg() are
redundant, but they show up in perf profile. Split the loop and only set memcg
when allocating from slab. No performance difference in this patch alone, but
it helps in combination with further patches.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20230706033447.54696-7-alexei.starovoitov@gmail.com
2023-07-12 23:45:22 +02:00
Alexei Starovoitov
18e027b1c7 bpf: Factor out inc/dec of active flag into helpers.
Factor out local_inc/dec_return(&c->active) into helpers.
No functional changes.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20230706033447.54696-6-alexei.starovoitov@gmail.com
2023-07-12 23:45:22 +02:00
Alexei Starovoitov
05ae68656a bpf: Refactor alloc_bulk().
Factor out inner body of alloc_bulk into separate helper.
No functional changes.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20230706033447.54696-5-alexei.starovoitov@gmail.com
2023-07-12 23:45:22 +02:00
Alexei Starovoitov
9de3e81521 bpf: Let free_all() return the number of freed elements.
Let free_all() helper return the number of freed elements.
It's not used in this patch, but helps in debug/development of bpf_mem_alloc.

For example this diff for __free_rcu():
-       free_all(llist_del_all(&c->waiting_for_gp_ttrace), !!c->percpu_size);
+       printk("cpu %d freed %d objs after tasks trace\n", raw_smp_processor_id(),
+       	free_all(llist_del_all(&c->waiting_for_gp_ttrace), !!c->percpu_size));

would show how busy RCU tasks trace is.
In artificial benchmark where one cpu is allocating and different cpu is freeing
the RCU tasks trace won't be able to keep up and the list of objects
would keep growing from thousands to millions and eventually OOMing.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20230706033447.54696-4-alexei.starovoitov@gmail.com
2023-07-12 23:45:22 +02:00
Alexei Starovoitov
a80672d7e1 bpf: Simplify code of destroy_mem_alloc() with kmemdup().
Use kmemdup() to simplify the code.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20230706033447.54696-3-alexei.starovoitov@gmail.com
2023-07-12 23:45:22 +02:00
Alexei Starovoitov
12c8d0f4c8 bpf: Rename few bpf_mem_alloc fields.
Rename:
-       struct rcu_head rcu;
-       struct llist_head free_by_rcu;
-       struct llist_head waiting_for_gp;
-       atomic_t call_rcu_in_progress;
+       struct llist_head free_by_rcu_ttrace;
+       struct llist_head waiting_for_gp_ttrace;
+       struct rcu_head rcu_ttrace;
+       atomic_t call_rcu_ttrace_in_progress;
...
-	static void do_call_rcu(struct bpf_mem_cache *c)
+	static void do_call_rcu_ttrace(struct bpf_mem_cache *c)

to better indicate intended use.

The 'tasks trace' is shortened to 'ttrace' to reduce verbosity.
No functional changes.

Later patches will add free_by_rcu/waiting_for_gp fields to be used with normal RCU.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20230706033447.54696-2-alexei.starovoitov@gmail.com
2023-07-12 23:45:22 +02:00
Andrii Nakryiko
f42bcd168d bpf: teach verifier actual bounds of bpf_get_smp_processor_id() result
bpf_get_smp_processor_id() helper returns current CPU on which BPF
program runs. It can't return value that is bigger than maximum allowed
number of CPUs (minus one, due to zero indexing). Teach BPF verifier to
recognize that. This makes it possible to use bpf_get_smp_processor_id()
result to index into arrays without extra checks, as demonstrated in
subsequent selftests/bpf patch.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230711232400.1658562-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-12 07:57:18 -07:00
Yafang Shao
1b715e1b0e bpf: Support ->fill_link_info for perf_event
By introducing support for ->fill_link_info to the perf_event link, users
gain the ability to inspect it using `bpftool link show`. While the current
approach involves accessing this information via `bpftool perf show`,
consolidating link information for all link types in one place offers
greater convenience. Additionally, this patch extends support to the
generic perf event, which is not currently accommodated by
`bpftool perf show`. While only the perf type and config are exposed to
userspace, other attributes such as sample_period and sample_freq are
ignored. It's important to note that if kptr_restrict is not permitted, the
probed address will not be exposed, maintaining security measures.

A new enum bpf_perf_event_type is introduced to help the user understand
which struct is relevant.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230709025630.3735-9-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-11 20:07:51 -07:00
Yafang Shao
57d4853765 bpf: Add a common helper bpf_copy_to_user()
Add a common helper bpf_copy_to_user(), which will be used at multiple
places.
No functional change.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230709025630.3735-8-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-11 20:07:51 -07:00
Pu Lehui
4369016497 bpf: cpumap: Fix memory leak in cpu_map_update_elem
Syzkaller reported a memory leak as follows:

BUG: memory leak
unreferenced object 0xff110001198ef748 (size 192):
  comm "syz-executor.3", pid 17672, jiffies 4298118891 (age 9.906s)
  hex dump (first 32 bytes):
    00 00 00 00 4a 19 00 00 80 ad e3 e4 fe ff c0 00  ....J...........
    00 b2 d3 0c 01 00 11 ff 28 f5 8e 19 01 00 11 ff  ........(.......
  backtrace:
    [<ffffffffadd28087>] __cpu_map_entry_alloc+0xf7/0xb00
    [<ffffffffadd28d8e>] cpu_map_update_elem+0x2fe/0x3d0
    [<ffffffffadc6d0fd>] bpf_map_update_value.isra.0+0x2bd/0x520
    [<ffffffffadc7349b>] map_update_elem+0x4cb/0x720
    [<ffffffffadc7d983>] __se_sys_bpf+0x8c3/0xb90
    [<ffffffffb029cc80>] do_syscall_64+0x30/0x40
    [<ffffffffb0400099>] entry_SYSCALL_64_after_hwframe+0x61/0xc6

BUG: memory leak
unreferenced object 0xff110001198ef528 (size 192):
  comm "syz-executor.3", pid 17672, jiffies 4298118891 (age 9.906s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffffadd281f0>] __cpu_map_entry_alloc+0x260/0xb00
    [<ffffffffadd28d8e>] cpu_map_update_elem+0x2fe/0x3d0
    [<ffffffffadc6d0fd>] bpf_map_update_value.isra.0+0x2bd/0x520
    [<ffffffffadc7349b>] map_update_elem+0x4cb/0x720
    [<ffffffffadc7d983>] __se_sys_bpf+0x8c3/0xb90
    [<ffffffffb029cc80>] do_syscall_64+0x30/0x40
    [<ffffffffb0400099>] entry_SYSCALL_64_after_hwframe+0x61/0xc6

BUG: memory leak
unreferenced object 0xff1100010fd93d68 (size 8):
  comm "syz-executor.3", pid 17672, jiffies 4298118891 (age 9.906s)
  hex dump (first 8 bytes):
    00 00 00 00 00 00 00 00                          ........
  backtrace:
    [<ffffffffade5db3e>] kvmalloc_node+0x11e/0x170
    [<ffffffffadd28280>] __cpu_map_entry_alloc+0x2f0/0xb00
    [<ffffffffadd28d8e>] cpu_map_update_elem+0x2fe/0x3d0
    [<ffffffffadc6d0fd>] bpf_map_update_value.isra.0+0x2bd/0x520
    [<ffffffffadc7349b>] map_update_elem+0x4cb/0x720
    [<ffffffffadc7d983>] __se_sys_bpf+0x8c3/0xb90
    [<ffffffffb029cc80>] do_syscall_64+0x30/0x40
    [<ffffffffb0400099>] entry_SYSCALL_64_after_hwframe+0x61/0xc6

In the cpu_map_update_elem flow, when kthread_stop is called before
calling the threadfn of rcpu->kthread, since the KTHREAD_SHOULD_STOP bit
of kthread has been set by kthread_stop, the threadfn of rcpu->kthread
will never be executed, and rcpu->refcnt will never be 0, which will
lead to the allocated rcpu, rcpu->queue and rcpu->queue->queue cannot be
released.

Calling kthread_stop before executing kthread's threadfn will return
-EINTR. We can complete the release of memory resources in this state.

Fixes: 6710e11269 ("bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP")
Signed-off-by: Pu Lehui <pulehui@huawei.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230711115848.2701559-1-pulehui@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-11 19:57:03 -07:00
Anton Protopopov
515ee52b22 bpf: make preloaded map iterators to display map elements count
Add another column to the /sys/fs/bpf/maps.debug iterator to display
cur_entries, the current number of entries in the map as is returned
by the bpf_map_sum_elem_count kfunc. Also fix formatting.

Example:

    # cat /sys/fs/bpf/maps.debug
      id name             max_entries  cur_entries
       2 iterator.rodata            1            0
     125 cilium_auth_map       524288          666
     126 cilium_runtime_          256            0
     127 cilium_signals            32            0
     128 cilium_node_map        16384         1344
     129 cilium_events             32            0
     ...

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Link: https://lore.kernel.org/r/20230706133932.45883-5-aspsk@isovalent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-06 12:42:25 -07:00
Anton Protopopov
9bc421b6be bpf: populate the per-cpu insertions/deletions counters for hashmaps
Initialize and utilize the per-cpu insertions/deletions counters for hash-based
maps. Non-trivial changes only apply to the preallocated maps for which the
{inc,dec}_elem_count functions are not called, as there's no need in counting
elements to sustain proper map operations.

To increase/decrease percpu counters for preallocated maps we add raw calls to
the bpf_map_{inc,dec}_elem_count functions so that the impact is minimal. For
dynamically allocated maps we add corresponding calls to the existing
{inc,dec}_elem_count functions.

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Link: https://lore.kernel.org/r/20230706133932.45883-4-aspsk@isovalent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-06 12:42:25 -07:00
Anton Protopopov
803370d3d3 bpf: add a new kfunc to return current bpf_map elements count
A bpf_map_sum_elem_count kfunc was added to simplify getting the sum of the map
per-cpu element counters. If a map doesn't implement the counter, then the
function will always return 0.

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Link: https://lore.kernel.org/r/20230706133932.45883-3-aspsk@isovalent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-06 12:42:25 -07:00
Kumar Kartikeya Dwivedi
5415ccd50a bpf: Fix max stack depth check for async callbacks
The check_max_stack_depth pass happens after the verifier's symbolic
execution, and attempts to walk the call graph of the BPF program,
ensuring that the stack usage stays within bounds for all possible call
chains. There are two cases to consider: bpf_pseudo_func and
bpf_pseudo_call. In the former case, the callback pointer is loaded into
a register, and is assumed that it is passed to some helper later which
calls it (however there is no way to be sure), but the check remains
conservative and accounts the stack usage anyway. For this particular
case, asynchronous callbacks are skipped as they execute asynchronously
when their corresponding event fires.

The case of bpf_pseudo_call is simpler and we know that the call is
definitely made, hence the stack depth of the subprog is accounted for.

However, the current check still skips an asynchronous callback even if
a bpf_pseudo_call was made for it. This is erroneous, as it will miss
accounting for the stack usage of the asynchronous callback, which can
be used to breach the maximum stack depth limit.

Fix this by only skipping asynchronous callbacks when the instruction is
not a pseudo call to the subprog.

Fixes: 7ddc80a476 ("bpf: Teach stack depth check about async callbacks.")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230705144730.235802-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-05 19:14:54 -07:00
Hou Tao
cf6eeb8f9d bpf: Remove unnecessary ring buffer size check
The theoretical maximum size of ring buffer is about 64GB, but now the
size of ring buffer is specified by max_entries in bpf_attr and its
maximum value is (4GB - 1), and it won't be possible for overflow.

So just remove the unnecessary size check in ringbuf_map_alloc() but
keep the comments for possible extension in future.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Closes: https://lore.kernel.org/bpf/9c636a63-1f3d-442d-9223-96c2dccb9469@moroto.mountain
Link: https://lore.kernel.org/bpf/20230704074014.216616-1-houtao@huaweicloud.com
2023-07-05 14:09:45 +02:00
SeongJae Park
3de4d22cc9 bpf, btf: Warn but return no error for NULL btf from __register_btf_kfunc_id_set()
__register_btf_kfunc_id_set() assumes .BTF to be part of the module's .ko
file if CONFIG_DEBUG_INFO_BTF is enabled. If that's not the case, the
function prints an error message and return an error. As a result, such
modules cannot be loaded.

However, the section could be stripped out during a build process. It would
be better to let the modules loaded, because their basic functionalities
have no problem [0], though the BTF functionalities will not be supported.
Make the function to lower the level of the message from error to warn, and
return no error.

  [0] https://lore.kernel.org/bpf/20220219082037.ow2kbq5brktf4f2u@apollo.legion

Fixes: c446fdacb1 ("bpf: fix register_btf_kfunc_id_set for !CONFIG_DEBUG_INFO_BTF")
Reported-by: Alexander Egorenkov <Alexander.Egorenkov@ibm.com>
Suggested-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/87y228q66f.fsf@oc8242746057.ibm.com
Link: https://lore.kernel.org/bpf/20220219082037.ow2kbq5brktf4f2u@apollo.legion
Link: https://lore.kernel.org/bpf/20230701171447.56464-1-sj@kernel.org
2023-07-03 18:48:09 +02:00
Stanislav Fomichev
819d43428a bpf: Resolve modifiers when walking structs
It is impossible to use skb_frag_t in the tracing program. Resolve typedefs
when walking structs.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20230626212522.2414485-1-sdf@google.com
2023-06-30 10:35:59 +02:00
Fangrui Song
bbaf1ff06a bpf: Replace deprecated -target with --target= for Clang
The -target option has been deprecated since clang 3.4 in 2013. Therefore, use
the preferred --target=bpf form instead. This also matches how we use --target=
in scripts/Makefile.clang.

Signed-off-by: Fangrui Song <maskray@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Quentin Monnet <quentin@isovalent.com>
Link: 274b6f0c87
Link: https://lore.kernel.org/bpf/20230624001856.1903733-1-maskray@google.com
2023-06-29 15:46:17 +02:00
Jakub Kicinski
a685d0df75 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZJX+ygAKCRDbK58LschI
 g0/2AQDHg12smf9mPfK9wOFDNRIIX8r2iufB8LUFQMzCwltN6gEAkAdkAyfbof7P
 TMaNUiHABijAFtChxoSI35j3OOSRrwE=
 =GJgN
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2023-06-23

We've added 49 non-merge commits during the last 24 day(s) which contain
a total of 70 files changed, 1935 insertions(+), 442 deletions(-).

The main changes are:

1) Extend bpf_fib_lookup helper to allow passing the route table ID,
   from Louis DeLosSantos.

2) Fix regsafe() in verifier to call check_ids() for scalar registers,
   from Eduard Zingerman.

3) Extend the set of cpumask kfuncs with bpf_cpumask_first_and()
   and a rework of bpf_cpumask_any*() kfuncs. Additionally,
   add selftests, from David Vernet.

4) Fix socket lookup BPF helpers for tc/XDP to respect VRF bindings,
   from Gilad Sever.

5) Change bpf_link_put() to use workqueue unconditionally to fix it
   under PREEMPT_RT, from Sebastian Andrzej Siewior.

6) Follow-ups to address issues in the bpf_refcount shared ownership
   implementation, from Dave Marchevsky.

7) A few general refactorings to BPF map and program creation permissions
   checks which were part of the BPF token series, from Andrii Nakryiko.

8) Various fixes for benchmark framework and add a new benchmark
   for BPF memory allocator to BPF selftests, from Hou Tao.

9) Documentation improvements around iterators and trusted pointers,
   from Anton Protopopov.

10) Small cleanup in verifier to improve allocated object check,
    from Daniel T. Lee.

11) Improve performance of bpf_xdp_pointer() by avoiding access
    to shared_info when XDP packet does not have frags,
    from Jesper Dangaard Brouer.

12) Silence a harmless syzbot-reported warning in btf_type_id_size(),
    from Yonghong Song.

13) Remove duplicate bpfilter_umh_cleanup in favor of umd_cleanup_helper,
    from Jarkko Sakkinen.

14) Fix BPF selftests build for resolve_btfids under custom HOSTCFLAGS,
    from Viktor Malik.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (49 commits)
  bpf, docs: Document existing macros instead of deprecated
  bpf, docs: BPF Iterator Document
  selftests/bpf: Fix compilation failure for prog vrf_socket_lookup
  selftests/bpf: Add vrf_socket_lookup tests
  bpf: Fix bpf socket lookup from tc/xdp to respect socket VRF bindings
  bpf: Call __bpf_sk_lookup()/__bpf_skc_lookup() directly via TC hookpoint
  bpf: Factor out socket lookup functions for the TC hookpoint.
  selftests/bpf: Set the default value of consumer_cnt as 0
  selftests/bpf: Ensure that next_cpu() returns a valid CPU number
  selftests/bpf: Output the correct error code for pthread APIs
  selftests/bpf: Use producer_cnt to allocate local counter array
  xsk: Remove unused inline function xsk_buff_discard()
  bpf: Keep BPF_PROG_LOAD permission checks clear of validations
  bpf: Centralize permissions checks for all BPF map types
  bpf: Inline map creation logic in map_create() function
  bpf: Move unprivileged checks into map_create() and bpf_prog_load()
  bpf: Remove in_atomic() from bpf_link_put().
  selftests/bpf: Verify that check_ids() is used for scalars in regsafe()
  bpf: Verify scalar ids mapping in regsafe() using check_ids()
  selftests/bpf: Check if mark_chain_precision() follows scalar ids
  ...
====================

Link: https://lore.kernel.org/r/20230623211256.8409-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-24 14:52:28 -07:00
Jakub Kicinski
a7384f3918 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

tools/testing/selftests/net/fcnal-test.sh
  d7a2fc1437 ("selftests: net: fcnal-test: check if FIPS mode is enabled")
  dd017c72dd ("selftests: fcnal: Test SO_DONTROUTE on TCP sockets.")
https://lore.kernel.org/all/5007b52c-dd16-dbf6-8d64-b9701bfa498b@tessares.net/
https://lore.kernel.org/all/20230619105427.4a0df9b3@canb.auug.org.au/

No adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-22 18:40:38 -07:00
Jiri Olsa
db8eae6bc5 bpf: Force kprobe multi expected_attach_type for kprobe_multi link
We currently allow to create perf link for program with
expected_attach_type == BPF_TRACE_KPROBE_MULTI.

This will cause crash when we call helpers like get_attach_cookie or
get_func_ip in such program, because it will call the kprobe_multi's
version (current->bpf_ctx context setup) of those helpers while it
expects perf_link's current->bpf_ctx context setup.

Making sure that we use BPF_TRACE_KPROBE_MULTI expected_attach_type
only for programs attaching through kprobe_multi link.

Fixes: ca74823c6e ("bpf: Add cookie support to programs attached with kprobe multi link")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230618131414.75649-1-jolsa@kernel.org
2023-06-21 10:40:26 +02:00
Florent Revest
9724160b39 bpf/btf: Accept function names that contain dots
When building a kernel with LLVM=1, LLVM_IAS=0 and CONFIG_KASAN=y, LLVM
leaves DWARF tags for the "asan.module_ctor" & co symbols. In turn,
pahole creates BTF_KIND_FUNC entries for these and this makes the BTF
metadata validation fail because they contain a dot.

In a dramatic turn of event, this BTF verification failure can cause
the netfilter_bpf initialization to fail, causing netfilter_core to
free the netfilter_helper hashmap and netfilter_ftp to trigger a
use-after-free. The risk of u-a-f in netfilter will be addressed
separately but the existence of "asan.module_ctor" debug info under some
build conditions sounds like a good enough reason to accept functions
that contain dots in BTF.

Although using only LLVM=1 is the recommended way to compile clang-based
kernels, users can certainly do LLVM=1, LLVM_IAS=0 as well and we still
try to support that combination according to Nick. To clarify:

  - > v5.10 kernel, LLVM=1 (LLVM_IAS=0 is not the default) is recommended,
    but user can still have LLVM=1, LLVM_IAS=0 to trigger the issue

  - <= 5.10 kernel, LLVM=1 (LLVM_IAS=0 is the default) is recommended in
    which case GNU as will be used

Fixes: 1dc9285184 ("bpf: kernel side support for BTF Var and DataSec")
Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Cc: Yonghong Song <yhs@meta.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Link: https://lore.kernel.org/bpf/20230615145607.3469985-1-revest@chromium.org
2023-06-21 10:32:22 +02:00
Andrii Nakryiko
7f6719f7a8 bpf: Keep BPF_PROG_LOAD permission checks clear of validations
Move out flags validation and license checks out of the permission
checks. They were intermingled, which makes subsequent changes harder.
Clean this up: perform straightforward flag validation upfront, and
fetch and check license later, right where we use it. Also consolidate
capabilities check in one block, right after basic attribute sanity
checks.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20230613223533.3689589-5-andrii@kernel.org
2023-06-19 14:04:04 +02:00
Andrii Nakryiko
6c3eba1c5e bpf: Centralize permissions checks for all BPF map types
This allows to do more centralized decisions later on, and generally
makes it very explicit which maps are privileged and which are not
(e.g., LRU_HASH and LRU_PERCPU_HASH, which are privileged HASH variants,
as opposed to unprivileged HASH and HASH_PERCPU; now this is explicit
and easy to verify).

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20230613223533.3689589-4-andrii@kernel.org
2023-06-19 14:04:04 +02:00
Andrii Nakryiko
22db41226b bpf: Inline map creation logic in map_create() function
Currently find_and_alloc_map() performs two separate functions: some
argument sanity checking and partial map creation workflow hanling.
Neither of those functions are self-sufficient and are augmented by
further checks and initialization logic in the caller (map_create()
function). So unify all the sanity checks, permission checks, and
creation and initialization logic in one linear piece of code in
map_create() instead. This also make it easier to further enhance
permission checks and keep them located in one place.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20230613223533.3689589-3-andrii@kernel.org
2023-06-19 14:04:04 +02:00
Andrii Nakryiko
1d28635abc bpf: Move unprivileged checks into map_create() and bpf_prog_load()
Make each bpf() syscall command a bit more self-contained, making it
easier to further enhance it. We move sysctl_unprivileged_bpf_disabled
handling down to map_create() and bpf_prog_load(), two special commands
in this regard.

Also swap the order of checks, calling bpf_capable() only if
sysctl_unprivileged_bpf_disabled is true, avoiding unnecessary audit
messages.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20230613223533.3689589-2-andrii@kernel.org
2023-06-19 14:04:04 +02:00
Sebastian Andrzej Siewior
ab5d47bd41 bpf: Remove in_atomic() from bpf_link_put().
bpf_free_inode() is invoked as a RCU callback. Usually RCU callbacks are
invoked within softirq context. By setting rcutree.use_softirq=0 boot
option the RCU callbacks will be invoked in a per-CPU kthread with
bottom halves disabled which implies a RCU read section.

On PREEMPT_RT the context remains fully preemptible. The RCU read
section however does not allow schedule() invocation. The latter happens
in mutex_lock() performed by bpf_trampoline_unlink_prog() originated
from bpf_link_put().

It was pointed out that the bpf_link_put() invocation should not be
delayed if originated from close(). It was also pointed out that other
invocations from within a syscall should also avoid the workqueue.
Everyone else should use workqueue by default to remain safe in the
future (while auditing the code, every caller was preemptible except for
the RCU case).

Let bpf_link_put() use the worker unconditionally. Add
bpf_link_put_direct() which will directly free the resources and is used
by close() and from within __sys_bpf().

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230614083430.oENawF8f@linutronix.de
2023-06-16 09:28:02 -07:00
Eduard Zingerman
1ffc85d929 bpf: Verify scalar ids mapping in regsafe() using check_ids()
Make sure that the following unsafe example is rejected by verifier:

1: r9 = ... some pointer with range X ...
2: r6 = ... unbound scalar ID=a ...
3: r7 = ... unbound scalar ID=b ...
4: if (r6 > r7) goto +1
5: r6 = r7
6: if (r6 > X) goto ...
--- checkpoint ---
7: r9 += r7
8: *(u64 *)r9 = Y

This example is unsafe because not all execution paths verify r7 range.
Because of the jump at (4) the verifier would arrive at (6) in two states:
I.  r6{.id=b}, r7{.id=b} via path 1-6;
II. r6{.id=a}, r7{.id=b} via path 1-4, 6.

Currently regsafe() does not call check_ids() for scalar registers,
thus from POV of regsafe() states (I) and (II) are identical. If the
path 1-6 is taken by verifier first, and checkpoint is created at (6)
the path [1-4, 6] would be considered safe.

Changes in this commit:
- check_ids() is modified to disallow mapping multiple old_id to the
  same cur_id.
- check_scalar_ids() is added, unlike check_ids() it treats ID zero as
  a unique scalar ID.
- check_scalar_ids() needs to generate temporary unique IDs, field
  'tmp_id_gen' is added to bpf_verifier_env::idmap_scratch to
  facilitate this.
- regsafe() is updated to:
  - use check_scalar_ids() for precise scalar registers.
  - compare scalar registers using memcmp only for explore_alu_limits
    branch. This simplifies control flow for scalar case, and has no
    measurable performance impact.
- check_alu_op() is updated to avoid generating bpf_reg_state::id for
  constant scalar values when processing BPF_MOV. ID is needed to
  propagate range information for identical values, but there is
  nothing to propagate for constants.

Fixes: 75748837b7 ("bpf: Propagate scalar ranges through register assignments.")
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230613153824.3324830-4-eddyz87@gmail.com
2023-06-13 15:15:08 -07:00
Eduard Zingerman
904e6ddf41 bpf: Use scalar ids in mark_chain_precision()
Change mark_chain_precision() to track precision in situations
like below:

    r2 = unknown value
    ...
  --- state #0 ---
    ...
    r1 = r2                 // r1 and r2 now share the same ID
    ...
  --- state #1 {r1.id = A, r2.id = A} ---
    ...
    if (r2 > 10) goto exit; // find_equal_scalars() assigns range to r1
    ...
  --- state #2 {r1.id = A, r2.id = A} ---
    r3 = r10
    r3 += r1                // need to mark both r1 and r2

At the beginning of the processing of each state, ensure that if a
register with a scalar ID is marked as precise, all registers sharing
this ID are also marked as precise.

This property would be used by a follow-up change in regsafe().

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230613153824.3324830-2-eddyz87@gmail.com
2023-06-13 15:14:27 -07:00
Krister Johansen
0108a4e9f3 bpf: ensure main program has an extable
When subprograms are in use, the main program is not jit'd after the
subprograms because jit_subprogs sets a value for prog->bpf_func upon
success.  Subsequent calls to the JIT are bypassed when this value is
non-NULL.  This leads to a situation where the main program and its
func[0] counterpart are both in the bpf kallsyms tree, but only func[0]
has an extable.  Extables are only created during JIT.  Now there are
two nearly identical program ksym entries in the tree, but only one has
an extable.  Depending upon how the entries are placed, there's a chance
that a fault will call search_extable on the aux with the NULL entry.

Since jit_subprogs already copies state from func[0] to the main
program, include the extable pointer in this state duplication.
Additionally, ensure that the copy of the main program in func[0] is not
added to the bpf_prog_kallsyms table. Instead, let the main program get
added later in bpf_prog_load().  This ensures there is only a single
copy of the main program in the kallsyms table, and that its tag matches
the tag observed by tooling like bpftool.

Cc: stable@vger.kernel.org
Fixes: 1c2a088a66 ("bpf: x64: add JIT support for multi-function programs")
Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Tested-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/6de9b2f4b4724ef56efbb0339daaa66c8b68b1e7.1686616663.git.kjlx@templeofstupid.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-06-13 15:13:52 -07:00
David Vernet
f983be9173 bpf: Replace bpf_cpumask_any* with bpf_cpumask_any_distribute*
We currently export the bpf_cpumask_any() and bpf_cpumask_any_and()
kfuncs. Intuitively, one would expect these to choose any CPU in the
cpumask, but what they actually do is alias to cpumask_first() and
cpmkas_first_and().

This is useless given that we already export bpf_cpumask_first() and
bpf_cpumask_first_and(), so this patch replaces them with kfuncs that
call cpumask_any_distribute() and cpumask_any_and_distribute(), which
actually choose any CPU from the cpumask (or the AND of two cpumasks for
the latter).

Signed-off-by: David Vernet <void@manifault.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20230610035053.117605-3-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-06-12 15:09:11 -07:00
David Vernet
5ba3a7a851 bpf: Add bpf_cpumask_first_and() kfunc
We currently provide bpf_cpumask_first(), bpf_cpumask_any(), and
bpf_cpumask_any_and() kfuncs. bpf_cpumask_any() and
bpf_cpumask_any_and() are confusing misnomers in that they actually just
call cpumask_first() and cpumask_first_and() respectively.

We'll replace them with bpf_cpumask_any_distribute() and
bpf_cpumask_any_distribute_and() kfuncs in a subsequent patch, so let's
ensure feature parity by adding a bpf_cpumask_first_and() kfunc to
account for bpf_cpumask_any_and() being removed.

Signed-off-by: David Vernet <void@manifault.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20230610035053.117605-1-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-06-12 15:09:11 -07:00
Arnd Bergmann
ba49f97688 bpf: Hide unused bpf_patch_call_args
This function is only used when CONFIG_BPF_JIT_ALWAYS_ON is disabled, but
CONFIG_BPF_SYSCALL is enabled. When both are turned off, the prototype is
missing but the unused function is still compiled, as seen from this W=1
warning:

  [...]
  kernel/bpf/core.c:2075:6: error: no previous prototype for 'bpf_patch_call_args' [-Werror=missing-prototypes]
  [...]

Add a matching #ifdef for the definition to leave it out.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20230602135128.1498362-1-arnd@kernel.org
2023-06-12 19:00:08 +02:00
Jakub Kicinski
449f6bc17a Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

net/sched/sch_taprio.c
  d636fc5dd6 ("net: sched: add rcu annotations around qdisc->qdisc_sleeping")
  dced11ef84 ("net/sched: taprio: don't overwrite "sch" variable in taprio_dump_class_stats()")

net/ipv4/sysctl_net_ipv4.c
  e209fee411 ("net/ipv4: ping_group_range: allow GID from 2147483648 to 4294967294")
  ccce324dab ("tcp: make the first N SYN RTO backoffs linear")
https://lore.kernel.org/all/20230605100816.08d41a7b@canb.auug.org.au/

No adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-08 11:35:14 -07:00
Maxim Mikityanskiy
713274f1f2 bpf: Fix verifier id tracking of scalars on spill
The following scenario describes a bug in the verifier where it
incorrectly concludes about equivalent scalar IDs which could lead to
verifier bypass in privileged mode:

1. Prepare a 32-bit rogue number.
2. Put the rogue number into the upper half of a 64-bit register, and
   roll a random (unknown to the verifier) bit in the lower half. The
   rest of the bits should be zero (although variations are possible).
3. Assign an ID to the register by MOVing it to another arbitrary
   register.
4. Perform a 32-bit spill of the register, then perform a 32-bit fill to
   another register. Due to a bug in the verifier, the ID will be
   preserved, although the new register will contain only the lower 32
   bits, i.e. all zeros except one random bit.

At this point there are two registers with different values but the same
ID, which means the integrity of the verifier state has been corrupted.

5. Compare the new 32-bit register with 0. In the branch where it's
   equal to 0, the verifier will believe that the original 64-bit
   register is also 0, because it has the same ID, but its actual value
   still contains the rogue number in the upper half.
   Some optimizations of the verifier prevent the actual bypass, so
   extra care is needed: the comparison must be between two registers,
   and both branches must be reachable (this is why one random bit is
   needed). Both branches are still suitable for the bypass.
6. Right shift the original register by 32 bits to pop the rogue number.
7. Use the rogue number as an offset with any pointer. The verifier will
   believe that the offset is 0, while in reality it's the given number.

The fix is similar to the 32-bit BPF_MOV handling in check_alu_op for
SCALAR_VALUE. If the spill is narrowing the actual register value, don't
keep the ID, make sure it's reset to 0.

Fixes: 354e8f1970 ("bpf: Support <8-byte scalar spill and refill")
Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Andrii Nakryiko <andrii@kernel.org> # Checked veristat delta
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20230607123951.558971-2-maxtram95@gmail.com
2023-06-08 10:27:43 +02:00
Hou Tao
aa7881fcfe bpf: Factor out a common helper free_all()
Factor out a common helper free_all() to free all normal elements or
per-cpu elements on a lock-less list.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230606035310.4026145-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-06-06 13:40:07 -07:00
Florian Westphal
132328e8e8 bpf: netfilter: Add BPF_NETFILTER bpf_attach_type
Andrii Nakryiko writes:

 And we currently don't have an attach type for NETLINK BPF link.
 Thankfully it's not too late to add it. I see that link_create() in
 kernel/bpf/syscall.c just bypasses attach_type check. We shouldn't
 have done that. Instead we need to add BPF_NETLINK attach type to enum
 bpf_attach_type. And wire all that properly throughout the kernel and
 libbpf itself.

This adds BPF_NETFILTER and uses it.  This breaks uabi but this
wasn't in any non-rc release yet, so it should be fine.

v2: check link_attack prog type in link_create too

Fixes: 84601d6ee6 ("bpf: add bpf_link support for BPF_NETFILTER programs")
Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/CAEf4BzZ69YgrQW7DHCJUT_X+GqMq_ZQQPBwopaJJVGFD5=d5Vg@mail.gmail.com/
Link: https://lore.kernel.org/bpf/20230605131445.32016-1-fw@strlen.de
2023-06-05 15:01:43 -07:00
David Vernet
51302c951c bpf: Teach verifier that trusted PTR_TO_BTF_ID pointers are non-NULL
In reg_type_not_null(), we currently assume that a pointer may be NULL
if it has the PTR_MAYBE_NULL modifier, or if it doesn't belong to one of
several base type of pointers that are never NULL-able. For example,
PTR_TO_CTX, PTR_TO_MAP_VALUE, etc.

It turns out that in some cases, PTR_TO_BTF_ID can never be NULL as
well, though we currently don't specify it. For example, if you had the
following program:

SEC("tc")
long example_refcnt_fail(void *ctx)
{
	struct bpf_cpumask *mask1, *mask2;

	mask1 = bpf_cpumask_create();
	mask2 = bpf_cpumask_create();

        if (!mask1 || !mask2)
		goto error_release;

	bpf_cpumask_test_cpu(0, (const struct cpumask *)mask1);
	bpf_cpumask_test_cpu(0, (const struct cpumask *)mask2);

error_release:
	if (mask1)
		bpf_cpumask_release(mask1);
	if (mask2)
		bpf_cpumask_release(mask2);
	return ret;
}

The verifier will incorrectly fail to load the program, thinking
(unintuitively) that we have a possibly-unreleased reference if the mask
is NULL, because we (correctly) don't issue a bpf_cpumask_release() on
the NULL path.

The reason the verifier gets confused is due to the fact that we don't
explicitly tell the verifier that trusted PTR_TO_BTF_ID pointers can
never be NULL. Basically, if we successfully get past the if check
(meaning both pointers go from ptr_or_null_bpf_cpumask to
ptr_bpf_cpumask), the verifier will correctly assume that the references
need to be dropped on any possible branch that leads to program exit.
However, it will _incorrectly_ think that the ptr == NULL branch is
possible, and will erroneously detect it as a branch on which we failed
to drop the reference.

The solution is of course to teach the verifier that trusted
PTR_TO_BTF_ID pointers can never be NULL, so that it doesn't incorrectly
think it's possible for the reference to be present on the ptr == NULL
branch.

A follow-on patch will add a selftest that verifies this behavior.

Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230602150112.1494194-1-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-06-05 14:36:57 -07:00
Daniel T. Lee
503e4def54 bpf: Replace open code with for allocated object check
>From commit 282de143ea ("bpf: Introduce allocated objects support"),
With this allocated object with BPF program, (PTR_TO_BTF_ID | MEM_ALLOC)
has been a way of indicating to check the type is the allocated object.

commit d8939cb0a0 ("bpf: Loosen alloc obj test in verifier's
reg_btf_record")
>From the commit, there has been helper function for checking this, named
type_is_ptr_alloc_obj(). But still, some of the code use open code to
retrieve this info. This commit replaces the open code with the
type_is_alloc(), and the type_is_ptr_alloc_obj() function.

Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230527122706.59315-1-danieltimlee@gmail.com
2023-06-05 14:33:17 -07:00
Dave Marchevsky
7793fc3bab bpf: Make bpf_refcount_acquire fallible for non-owning refs
This patch fixes an incorrect assumption made in the original
bpf_refcount series [0], specifically that the BPF program calling
bpf_refcount_acquire on some node can always guarantee that the node is
alive. In that series, the patch adding failure behavior to rbtree_add
and list_push_{front, back} breaks this assumption for non-owning
references.

Consider the following program:

  n = bpf_kptr_xchg(&mapval, NULL);
  /* skip error checking */

  bpf_spin_lock(&l);
  if(bpf_rbtree_add(&t, &n->rb, less)) {
    bpf_refcount_acquire(n);
    /* Failed to add, do something else with the node */
  }
  bpf_spin_unlock(&l);

It's incorrect to assume that bpf_refcount_acquire will always succeed in this
scenario. bpf_refcount_acquire is being called in a critical section
here, but the lock being held is associated with rbtree t, which isn't
necessarily the lock associated with the tree that the node is already
in. So after bpf_rbtree_add fails to add the node and calls bpf_obj_drop
in it, the program has no ownership of the node's lifetime. Therefore
the node's refcount can be decr'd to 0 at any time after the failing
rbtree_add. If this happens before the refcount_acquire above, the node
might be free'd, and regardless refcount_acquire will be incrementing a
0 refcount.

Later patches in the series exercise this scenario, resulting in the
expected complaint from the kernel (without this patch's changes):

  refcount_t: addition on 0; use-after-free.
  WARNING: CPU: 1 PID: 207 at lib/refcount.c:25 refcount_warn_saturate+0xbc/0x110
  Modules linked in: bpf_testmod(O)
  CPU: 1 PID: 207 Comm: test_progs Tainted: G           O       6.3.0-rc7-02231-g723de1a718a2-dirty #371
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
  RIP: 0010:refcount_warn_saturate+0xbc/0x110
  Code: 6f 64 f6 02 01 e8 84 a3 5c ff 0f 0b eb 9d 80 3d 5e 64 f6 02 00 75 94 48 c7 c7 e0 13 d2 82 c6 05 4e 64 f6 02 01 e8 64 a3 5c ff <0f> 0b e9 7a ff ff ff 80 3d 38 64 f6 02 00 0f 85 6d ff ff ff 48 c7
  RSP: 0018:ffff88810b9179b0 EFLAGS: 00010082
  RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000000
  RDX: 0000000000000202 RSI: 0000000000000008 RDI: ffffffff857c3680
  RBP: ffff88810027d3c0 R08: ffffffff8125f2a4 R09: ffff88810b9176e7
  R10: ffffed1021722edc R11: 746e756f63666572 R12: ffff88810027d388
  R13: ffff88810027d3c0 R14: ffffc900005fe030 R15: ffffc900005fe048
  FS:  00007fee0584a700(0000) GS:ffff88811b280000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00005634a96f6c58 CR3: 0000000108ce9002 CR4: 0000000000770ee0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  PKRU: 55555554
  Call Trace:
   <TASK>
   bpf_refcount_acquire_impl+0xb5/0xc0

  (rest of output snipped)

The patch addresses this by changing bpf_refcount_acquire_impl to use
refcount_inc_not_zero instead of refcount_inc and marking
bpf_refcount_acquire KF_RET_NULL.

For owning references, though, we know the above scenario is not possible
and thus that bpf_refcount_acquire will always succeed. Some verifier
bookkeeping is added to track "is input owning ref?" for bpf_refcount_acquire
calls and return false from is_kfunc_ret_null for bpf_refcount_acquire on
owning refs despite it being marked KF_RET_NULL.

Existing selftests using bpf_refcount_acquire are modified where
necessary to NULL-check its return value.

  [0]: https://lore.kernel.org/bpf/20230415201811.343116-1-davemarchevsky@fb.com/

Fixes: d2dcc67df9 ("bpf: Migrate bpf_rbtree_add and bpf_list_push_{front,back} to possibly fail")
Reported-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230602022647.1571784-5-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-06-05 13:17:20 -07:00
Dave Marchevsky
cc0d76cafe bpf: Fix __bpf_{list,rbtree}_add's beginning-of-node calculation
Given the pointer to struct bpf_{rb,list}_node within a local kptr and
the byte offset of that field within the kptr struct, the calculation changed
by this patch is meant to find the beginning of the kptr so that it can
be passed to bpf_obj_drop.

Unfortunately instead of doing

  ptr_to_kptr = ptr_to_node_field - offset_bytes

the calculation is erroneously doing

  ptr_to_ktpr = ptr_to_node_field - (offset_bytes * sizeof(struct bpf_rb_node))

or the bpf_list_node equivalent.

This patch fixes the calculation.

Fixes: d2dcc67df9 ("bpf: Migrate bpf_rbtree_add and bpf_list_push_{front,back} to possibly fail")
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230602022647.1571784-4-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-06-05 13:17:19 -07:00
Dave Marchevsky
2140a6e342 bpf: Set kptr_struct_meta for node param to list and rbtree insert funcs
In verifier.c, fixup_kfunc_call uses struct bpf_insn_aux_data's
kptr_struct_meta field to pass information about local kptr types to
various helpers and kfuncs at runtime. The recent bpf_refcount series
added a few functions to the set that need this information:

  * bpf_refcount_acquire
    * Needs to know where the refcount field is in order to increment
  * Graph collection insert kfuncs: bpf_rbtree_add, bpf_list_push_{front,back}
    * Were migrated to possibly fail by the bpf_refcount series. If
      insert fails, the input node is bpf_obj_drop'd. bpf_obj_drop needs
      the kptr_struct_meta in order to decr refcount and properly free
      special fields.

Unfortunately the verifier handling of collection insert kfuncs was not
modified to actually populate kptr_struct_meta. Accordingly, when the
node input to those kfuncs is passed to bpf_obj_drop, it is done so
without the information necessary to decr refcount.

This patch fixes the issue by populating kptr_struct_meta for those
kfuncs.

Fixes: d2dcc67df9 ("bpf: Migrate bpf_rbtree_add and bpf_list_push_{front,back} to possibly fail")
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230602022647.1571784-3-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-06-05 13:17:19 -07:00
Rhys Rustad-Elliott
cba41bb78d bpf: Fix elem_size not being set for inner maps
Commit d937bc3449 ("bpf: make uniform use of array->elem_size
everywhere in arraymap.c") changed array_map_gen_lookup to use
array->elem_size instead of round_up(map->value_size, 8) as the element
size when generating code to access a value in an array map.

array->elem_size, however, is not set by bpf_map_meta_alloc when
initializing an BPF_MAP_TYPE_ARRAY_OF_MAPS or BPF_MAP_TYPE_HASH_OF_MAPS.
This results in array_map_gen_lookup incorrectly outputting code that
always accesses index 0 in the array (as the index will be calculated
via a multiplication with the element size, which is incorrectly set to
0).

Set elem_size on the bpf_array object when allocating an array or hash
of maps to fix this.

Fixes: d937bc3449 ("bpf: make uniform use of array->elem_size everywhere in arraymap.c")
Signed-off-by: Rhys Rustad-Elliott <me@rhysre.net>
Link: https://lore.kernel.org/r/20230602190110.47068-2-me@rhysre.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-06-02 16:22:12 -07:00
Azeem Shaikh
ffadc37252 bpf: Replace all non-returning strlcpy with strscpy
strlcpy() reads the entire source buffer first. This read may exceed the
destination size limit. This is both inefficient and can lead to linear
read overflows if a source string is not NUL-terminated [1]. This is not
the case here, however, in an effort to remove strlcpy() completely [2],
lets replace strlcpy() here with strscpy(). No return values were used,
so a direct replacement is safe.

  [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy
  [2] https://github.com/KSPP/linux/issues/89

Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/bpf/20230530155659.309657-1-azeemshaikh38@gmail.com
2023-05-31 13:04:20 +02:00
Yonghong Song
e6c2f594ed bpf: Silence a warning in btf_type_id_size()
syzbot reported a warning in [1] with the following stacktrace:
  WARNING: CPU: 0 PID: 5005 at kernel/bpf/btf.c:1988 btf_type_id_size+0x2d9/0x9d0 kernel/bpf/btf.c:1988
  ...
  RIP: 0010:btf_type_id_size+0x2d9/0x9d0 kernel/bpf/btf.c:1988
  ...
  Call Trace:
   <TASK>
   map_check_btf kernel/bpf/syscall.c:1024 [inline]
   map_create+0x1157/0x1860 kernel/bpf/syscall.c:1198
   __sys_bpf+0x127f/0x5420 kernel/bpf/syscall.c:5040
   __do_sys_bpf kernel/bpf/syscall.c:5162 [inline]
   __se_sys_bpf kernel/bpf/syscall.c:5160 [inline]
   __x64_sys_bpf+0x79/0xc0 kernel/bpf/syscall.c:5160
   do_syscall_x64 arch/x86/entry/common.c:50 [inline]
   do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
   entry_SYSCALL_64_after_hwframe+0x63/0xcd

With the following btf
  [1] DECL_TAG 'a' type_id=4 component_idx=-1
  [2] PTR '(anon)' type_id=0
  [3] TYPE_TAG 'a' type_id=2
  [4] VAR 'a' type_id=3, linkage=static
and when the bpf_attr.btf_key_type_id = 1 (DECL_TAG),
the following WARN_ON_ONCE in btf_type_id_size() is triggered:
  if (WARN_ON_ONCE(!btf_type_is_modifier(size_type) &&
                   !btf_type_is_var(size_type)))
          return NULL;

Note that 'return NULL' is the correct behavior as we don't want
a DECL_TAG type to be used as a btf_{key,value}_type_id even
for the case like 'DECL_TAG -> STRUCT'. So there
is no correctness issue here, we just want to silence warning.

To silence the warning, I added DECL_TAG as one of kinds in
btf_type_nosize() which will cause btf_type_id_size() returning
NULL earlier without the warning.

  [1] https://lore.kernel.org/bpf/000000000000e0df8d05fc75ba86@google.com/

Reported-by: syzbot+958967f249155967d42a@syzkaller.appspotmail.com
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20230530205029.264910-1-yhs@fb.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-05-30 14:34:46 -07:00
Jakub Kicinski
75455b906d bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZHEm+wAKCRDbK58LschI
 gyIKAQCqO7B4sIu8hYVxBTwfHV2tIuXSMSCV4P9e78NUOPcO2QEAvLP/WVSjB0Bm
 vpyTKKM22SpZvPe/jSp52j6t20N+qAc=
 =HFxD
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2023-05-26

We've added 54 non-merge commits during the last 10 day(s) which contain
a total of 76 files changed, 2729 insertions(+), 1003 deletions(-).

The main changes are:

1) Add the capability to destroy sockets in BPF through a new kfunc,
   from Aditi Ghag.

2) Support O_PATH fds in BPF_OBJ_PIN and BPF_OBJ_GET commands,
   from Andrii Nakryiko.

3) Add capability for libbpf to resize datasec maps when backed via mmap,
   from JP Kobryn.

4) Move all the test kfuncs for CI out of the kernel and into bpf_testmod,
   from Jiri Olsa.

5) Big batch of xsk selftest improvements to prep for multi-buffer testing,
   from Magnus Karlsson.

6) Show the target_{obj,btf}_id in tracing link's fdinfo and dump it
   via bpftool, from Yafang Shao.

7) Various misc BPF selftest improvements to work with upcoming LLVM 17,
   from Yonghong Song.

8) Extend bpftool to specify netdevice for resolving XDP hints,
   from Larysa Zaremba.

9) Document masking in shift operations for the insn set document,
   from Dave Thaler.

10) Extend BPF selftests to check xdp_feature support for bond driver,
    from Lorenzo Bianconi.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (54 commits)
  bpf: Fix bad unlock balance on freeze_mutex
  libbpf: Ensure FD >= 3 during bpf_map__reuse_fd()
  libbpf: Ensure libbpf always opens files with O_CLOEXEC
  selftests/bpf: Check whether to run selftest
  libbpf: Change var type in datasec resize func
  bpf: drop unnecessary bpf_capable() check in BPF_MAP_FREEZE command
  libbpf: Selftests for resizing datasec maps
  libbpf: Add capability for resizing datasec maps
  selftests/bpf: Add path_fd-based BPF_OBJ_PIN and BPF_OBJ_GET tests
  libbpf: Add opts-based bpf_obj_pin() API and add support for path_fd
  bpf: Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands
  libbpf: Start v1.3 development cycle
  bpf: Validate BPF object in BPF_OBJ_PIN before calling LSM
  bpftool: Specify XDP Hints ifname when loading program
  selftests/bpf: Add xdp_feature selftest for bond device
  selftests/bpf: Test bpf_sock_destroy
  selftests/bpf: Add helper to get port using getsockname
  bpf: Add bpf_sock_destroy kfunc
  bpf: Add kfunc filter function to 'struct btf_kfunc_id_set'
  bpf: udp: Implement batching for sockets iterator
  ...
====================

Link: https://lore.kernel.org/r/20230526222747.17775-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-26 17:26:01 -07:00
Daniel Borkmann
4266f41fea bpf: Fix bad unlock balance on freeze_mutex
Commit c4c84f6fb2 ("bpf: drop unnecessary bpf_capable() check in
BPF_MAP_FREEZE command") moved the permissions check outside of the
freeze_mutex in the map_freeze() handler. The error paths still jumps
to the err_put which tries to unlock the freeze_mutex even though it
was not locked in the first place. Fix it.

Fixes: c4c84f6fb2 ("bpf: drop unnecessary bpf_capable() check in BPF_MAP_FREEZE command")
Reported-by: syzbot+8982e75c2878b9ffeac5@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2023-05-26 12:16:12 +02:00
Jakub Kicinski
d4031ec844 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

net/ipv4/raw.c
  3632679d9e ("ipv{4,6}/raw: fix output xfrm lookup wrt protocol")
  c85be08fc4 ("raw: Stop using RTO_ONLINK.")
https://lore.kernel.org/all/20230525110037.2b532b83@canb.auug.org.au/

Adjacent changes:

drivers/net/ethernet/freescale/fec_main.c
  9025944fdd ("net: fec: add dma_wmb to ensure correct descriptor values")
  144470c88c ("net: fec: using the standard return codes when xdp xmit errors")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-25 19:57:39 -07:00
Andrii Nakryiko
c4c84f6fb2 bpf: drop unnecessary bpf_capable() check in BPF_MAP_FREEZE command
Seems like that extra bpf_capable() check in BPF_MAP_FREEZE handler was
unintentionally left when we switched to a model that all BPF map
operations should be allowed regardless of CAP_BPF (or any other
capabilities), as long as process got BPF map FD somehow.

This patch replaces bpf_capable() check in BPF_MAP_FREEZE handler with
writeable access check, given conceptually freezing the map is modifying
it: map becomes unmodifiable for subsequent updates.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230524225421.1587859-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-05-25 10:08:20 -07:00
Andrii Nakryiko
cb8edce280 bpf: Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands
Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall
forces users to specify pinning location as a string-based absolute or
relative (to current working directory) path. This has various
implications related to security (e.g., symlink-based attacks), forces
BPF FS to be exposed in the file system, which can cause races with
other applications.

One of the feedbacks we got from folks working with containers heavily
was that inability to use purely FD-based location specification was an
unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET
commands. This patch closes this oversight, adding path_fd field to
BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by
*at() syscalls for dirfd + pathname combinations.

This now allows interesting possibilities like working with detached BPF
FS mount (e.g., to perform multiple pinnings without running a risk of
someone interfering with them), and generally making pinning/getting
more secure and not prone to any races and/or security attacks.

This is demonstrated by a selftest added in subsequent patch that takes
advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate
creating detached BPF FS mount, pinning, and then getting BPF map out of
it, all while never exposing this private instance of BPF FS to outside
worlds.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20230523170013.728457-4-andrii@kernel.org
2023-05-23 23:31:42 +02:00
Andrii Nakryiko
e7d85427ef bpf: Validate BPF object in BPF_OBJ_PIN before calling LSM
Do a sanity check whether provided file-to-be-pinned is actually a BPF
object (prog, map, btf) before calling security_path_mknod LSM hook. If
it's not, LSM hook doesn't have to be triggered, as the operation has no
chance of succeeding anyways.

Suggested-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20230522232917.2454595-2-andrii@kernel.org
2023-05-23 16:56:37 +02:00
Anton Protopopov
b34ffb0c6d bpf: fix a memory leak in the LRU and LRU_PERCPU hash maps
The LRU and LRU_PERCPU maps allocate a new element on update before locking the
target hash table bucket. Right after that the maps try to lock the bucket.
If this fails, then maps return -EBUSY to the caller without releasing the
allocated element. This makes the element untracked: it doesn't belong to
either of free lists, and it doesn't belong to the hash table, so can't be
re-used; this eventually leads to the permanent -ENOMEM on LRU map updates,
which is unexpected. Fix this by returning the element to the local free list
if bucket locking fails.

Fixes: 20b6cc34ea ("bpf: Avoid hashtab deadlock with map_locked")
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Link: https://lore.kernel.org/r/20230522154558.2166815-1-aspsk@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-05-22 10:26:39 -07:00
Aditi Ghag
e924e80ee6 bpf: Add kfunc filter function to 'struct btf_kfunc_id_set'
This commit adds the ability to filter kfuncs to certain BPF program
types. This is required to limit bpf_sock_destroy kfunc implemented in
follow-up commits to programs with attach type 'BPF_TRACE_ITER'.

The commit adds a callback filter to 'struct btf_kfunc_id_set'.  The
filter has access to the `bpf_prog` construct including its properties
such as `expected_attached_type`.

Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
Link: https://lore.kernel.org/r/20230519225157.760788-7-aditi.ghag@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-05-19 22:44:14 -07:00
Yafang Shao
e859e42951 bpf: Show target_{obj,btf}_id in tracing link fdinfo
The target_btf_id can help us understand which kernel function is
linked by a tracing prog. The target_btf_id and target_obj_id have
already been exposed to userspace, so we just need to show them.

The result as follows,

$ cat /proc/10673/fdinfo/10
pos:    0
flags:  02000000
mnt_id: 15
ino:    2094
link_type:      tracing
link_id:        2
prog_tag:       a04f5eef06a7f555
prog_id:        13
attach_type:    24
target_obj_id:  1
target_btf_id:  13964

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230517103126.68372-2-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-05-19 10:06:44 -07:00
Will Deacon
0613d8ca9a bpf: Fix mask generation for 32-bit narrow loads of 64-bit fields
A narrow load from a 64-bit context field results in a 64-bit load
followed potentially by a 64-bit right-shift and then a bitwise AND
operation to extract the relevant data.

In the case of a 32-bit access, an immediate mask of 0xffffffff is used
to construct a 64-bit BPP_AND operation which then sign-extends the mask
value and effectively acts as a glorified no-op. For example:

0:	61 10 00 00 00 00 00 00	r0 = *(u32 *)(r1 + 0)

results in the following code generation for a 64-bit field:

	ldr	x7, [x7]	// 64-bit load
	mov	x10, #0xffffffffffffffff
	and	x7, x7, x10

Fix the mask generation so that narrow loads always perform a 32-bit AND
operation:

	ldr	x7, [x7]	// 64-bit load
	mov	w10, #0xffffffff
	and	w7, w7, w10

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Krzesimir Nowak <krzesimir@kinvolk.io>
Cc: Andrey Ignatov <rdna@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Fixes: 31fd85816d ("bpf: permits narrower load from bpf program context fields")
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20230518102528.1341-1-will@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-05-19 09:58:37 -07:00
Andrii Nakryiko
cff36398bd bpf: drop unnecessary user-triggerable WARN_ONCE in verifierl log
It's trivial for user to trigger "verifier log line truncated" warning,
as verifier has a fixed-sized buffer of 1024 bytes (as of now), and there are at
least two pieces of user-provided information that can be output through
this buffer, and both can be arbitrarily sized by user:
  - BTF names;
  - BTF.ext source code lines strings.

Verifier log buffer should be properly sized for typical verifier state
output. But it's sort-of expected that this buffer won't be long enough
in some circumstances. So let's drop the check. In any case code will
work correctly, at worst truncating a part of a single line output.

Reported-by: syzbot+8b2a08dfbd25fd933d75@syzkaller.appspotmail.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230516180409.3549088-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-05-16 22:34:50 -07:00
Jakub Kicinski
a0e35a648f bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZGKqEAAKCRDbK58LschI
 g6LYAQDp1jAszCOkmJ8VUA0ZyC5NAFDv+7y9Nd1toYWYX1btzAEAkf8+5qBJ1qmI
 P5M0hjMTbH4MID9Aql10ZbMHheyOBAo=
 =NUQM
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2023-05-16

We've added 57 non-merge commits during the last 19 day(s) which contain
a total of 63 files changed, 3293 insertions(+), 690 deletions(-).

The main changes are:

1) Add precision propagation to verifier for subprogs and callbacks,
   from Andrii Nakryiko.

2) Improve BPF's {g,s}setsockopt() handling with wrong option lengths,
   from Stanislav Fomichev.

3) Utilize pahole v1.25 for the kernel's BTF generation to filter out
   inconsistent function prototypes, from Alan Maguire.

4) Various dyn-pointer verifier improvements to relax restrictions,
   from Daniel Rosenberg.

5) Add a new bpf_task_under_cgroup() kfunc for designated task,
   from Feng Zhou.

6) Unblock tests for arm64 BPF CI after ftrace supporting direct call,
   from Florent Revest.

7) Add XDP hint kfunc metadata for RX hash/timestamp for igc,
   from Jesper Dangaard Brouer.

8) Add several new dyn-pointer kfuncs to ease their usability,
   from Joanne Koong.

9) Add in-depth LRU internals description and dot function graph,
   from Joe Stringer.

10) Fix KCSAN report on bpf_lru_list when accessing node->ref,
    from Martin KaFai Lau.

11) Only dump unprivileged_bpf_disabled log warning upon write,
    from Kui-Feng Lee.

12) Extend test_progs to directly passing allow/denylist file,
    from Stephen Veiss.

13) Fix BPF trampoline memleak upon failure attaching to fentry,
    from Yafang Shao.

14) Fix emitting struct bpf_tcp_sock type in vmlinux BTF,
    from Yonghong Song.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (57 commits)
  bpf: Fix memleak due to fentry attach failure
  bpf: Remove bpf trampoline selector
  bpf, arm64: Support struct arguments in the BPF trampoline
  bpftool: JIT limited misreported as negative value on aarch64
  bpf: fix calculation of subseq_idx during precision backtracking
  bpf: Remove anonymous union in bpf_kfunc_call_arg_meta
  bpf: Document EFAULT changes for sockopt
  selftests/bpf: Correctly handle optlen > 4096
  selftests/bpf: Update EFAULT {g,s}etsockopt selftests
  bpf: Don't EFAULT for {g,s}setsockopt with wrong optlen
  libbpf: fix offsetof() and container_of() to work with CO-RE
  bpf: Address KCSAN report on bpf_lru_list
  bpf: Add --skip_encoding_btf_inconsistent_proto, --btf_gen_optimized to pahole flags for v1.25
  selftests/bpf: Accept mem from dynptr in helper funcs
  bpf: verifier: Accept dynptr mem as mem in helpers
  selftests/bpf: Check overflow in optional buffer
  selftests/bpf: Test allowing NULL buffer in dynptr slice
  bpf: Allow NULL buffers in bpf_dynptr_slice(_rw)
  selftests/bpf: Add testcase for bpf_task_under_cgroup
  bpf: Add bpf_task_under_cgroup() kfunc
  ...
====================

Link: https://lore.kernel.org/r/20230515225603.27027-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-16 19:50:05 -07:00
Yafang Shao
108598c39e bpf: Fix memleak due to fentry attach failure
If it fails to attach fentry, the allocated bpf trampoline image will be
left in the system. That can be verified by checking /proc/kallsyms.

This meamleak can be verified by a simple bpf program as follows:

  SEC("fentry/trap_init")
  int fentry_run()
  {
      return 0;
  }

It will fail to attach trap_init because this function is freed after
kernel init, and then we can find the trampoline image is left in the
system by checking /proc/kallsyms.

  $ tail /proc/kallsyms
  ffffffffc0613000 t bpf_trampoline_6442453466_1  [bpf]
  ffffffffc06c3000 t bpf_trampoline_6442453466_1  [bpf]

  $ bpftool btf dump file /sys/kernel/btf/vmlinux | grep "FUNC 'trap_init'"
  [2522] FUNC 'trap_init' type_id=119 linkage=static

  $ echo $((6442453466 & 0x7fffffff))
  2522

Note that there are two left bpf trampoline images, that is because the
libbpf will fallback to raw tracepoint if -EINVAL is returned.

Fixes: e21aa34178 ("bpf: Fix fexit trampoline.")
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <song@kernel.org>
Cc: Jiri Olsa <olsajiri@gmail.com>
Link: https://lore.kernel.org/bpf/20230515130849.57502-2-laoar.shao@gmail.com
2023-05-15 23:41:59 +02:00
Yafang Shao
47e79cbeea bpf: Remove bpf trampoline selector
After commit e21aa34178 ("bpf: Fix fexit trampoline."), the selector is only
used to indicate how many times the bpf trampoline image are updated and been
displayed in the trampoline ksym name. After the trampoline is freed, the
selector will start from 0 again. So the selector is a useless value to the
user. We can remove it.

If the user want to check whether the bpf trampoline image has been updated
or not, the user can compare the address. Each time the trampoline image is
updated, the address will change consequently. Jiri also pointed out another
issue that perf is still using the old name "bpf_trampoline_%lu", so this
change can fix the issue in perf.

Fixes: e21aa34178 ("bpf: Fix fexit trampoline.")
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <song@kernel.org>
Cc: Jiri Olsa <olsajiri@gmail.com>
Link: https://lore.kernel.org/bpf/ZFvOOlrmHiY9AgXE@krava
Link: https://lore.kernel.org/bpf/20230515130849.57502-3-laoar.shao@gmail.com
2023-05-15 22:18:19 +02:00
Andrii Nakryiko
d84b1a6708 bpf: fix calculation of subseq_idx during precision backtracking
Subsequent instruction index (subseq_idx) is an index of an instruction
that was verified/executed by verifier after the currently processed
instruction. It is maintained during precision backtracking processing
and is used to detect various subprog calling conditions.

This patch fixes the bug with incorrectly resetting subseq_idx to -1
when going from child state to parent state during backtracking. If we
don't maintain correct subseq_idx we can misidentify subprog calls
leading to precision tracking bugs.

One such case was triggered by test_global_funcs/global_func9 test where
global subprog call happened to be the very last instruction in parent
state, leading to subseq_idx==-1, triggering WARN_ONCE:

  [   36.045754] verifier backtracking bug
  [   36.045764] WARNING: CPU: 13 PID: 2073 at kernel/bpf/verifier.c:3503 __mark_chain_precision+0xcc6/0xde0
  [   36.046819] Modules linked in: aesni_intel(E) crypto_simd(E) cryptd(E) kvm_intel(E) kvm(E) irqbypass(E) i2c_piix4(E) serio_raw(E) i2c_core(E) crc32c_intel)
  [   36.048040] CPU: 13 PID: 2073 Comm: test_progs Tainted: G        W  OE      6.3.0-07976-g4d585f48ee6b-dirty #972
  [   36.048783] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
  [   36.049648] RIP: 0010:__mark_chain_precision+0xcc6/0xde0
  [   36.050038] Code: 3d 82 c6 05 bb 35 32 02 01 e8 66 21 ec ff 0f 0b b8 f2 ff ff ff e9 30 f5 ff ff 48 c7 c7 f3 61 3d 82 4c 89 0c 24 e8 4a 21 ec ff <0f> 0b 4c0

With the fix precision tracking across multiple states works correctly now:

mark_precise: frame0: last_idx 45 first_idx 38 subseq_idx -1
mark_precise: frame0: regs=r8 stack= before 44: (61) r7 = *(u32 *)(r10 -4)
mark_precise: frame0: regs=r8 stack= before 43: (85) call pc+41
mark_precise: frame0: regs=r8 stack= before 42: (07) r1 += -48
mark_precise: frame0: regs=r8 stack= before 41: (bf) r1 = r10
mark_precise: frame0: regs=r8 stack= before 40: (63) *(u32 *)(r10 -48) = r1
mark_precise: frame0: regs=r8 stack= before 39: (b4) w1 = 0
mark_precise: frame0: regs=r8 stack= before 38: (85) call pc+38
mark_precise: frame0: parent state regs=r8 stack=:  R0_w=scalar() R1_w=map_value(off=4,ks=4,vs=8,imm=0) R6=1 R7_w=scalar() R8_r=P0 R10=fpm
mark_precise: frame0: last_idx 36 first_idx 28 subseq_idx 38
mark_precise: frame0: regs=r8 stack= before 36: (18) r1 = 0xffff888104f2ed14
mark_precise: frame0: regs=r8 stack= before 35: (85) call pc+33
mark_precise: frame0: regs=r8 stack= before 33: (18) r1 = 0xffff888104f2ed10
mark_precise: frame0: regs=r8 stack= before 32: (85) call pc+36
mark_precise: frame0: regs=r8 stack= before 31: (07) r1 += -4
mark_precise: frame0: regs=r8 stack= before 30: (bf) r1 = r10
mark_precise: frame0: regs=r8 stack= before 29: (63) *(u32 *)(r10 -4) = r7
mark_precise: frame0: regs=r8 stack= before 28: (4c) w7 |= w0
mark_precise: frame0: parent state regs=r8 stack=:  R0_rw=scalar() R6=1 R7_rw=scalar() R8_rw=P0 R10=fp0 fp-48_r=mmmmmmmm
mark_precise: frame0: last_idx 27 first_idx 16 subseq_idx 28
mark_precise: frame0: regs=r8 stack= before 27: (85) call pc+31
mark_precise: frame0: regs=r8 stack= before 26: (b7) r1 = 0
mark_precise: frame0: regs=r8 stack= before 25: (b7) r8 = 0

Note how subseq_idx starts out as -1, then is preserved as 38 and then 28 as we
go up the parent state chain.

Reported-by: Alexei Starovoitov <ast@kernel.org>
Fixes: fde2a3882b ("bpf: support precision propagation in the presence of subprogs")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230515180710.1535018-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-05-15 12:06:31 -07:00
Dave Marchevsky
4d585f48ee bpf: Remove anonymous union in bpf_kfunc_call_arg_meta
For kfuncs like bpf_obj_drop and bpf_refcount_acquire - which take
user-defined types as input - the verifier needs to track the specific
type passed in when checking a particular kfunc call. This requires
tracking (btf, btf_id) tuple. In commit 7c50b1cb76
("bpf: Add bpf_refcount_acquire kfunc") I added an anonymous union with
inner structs named after the specific kfuncs tracking this information,
with the goal of making it more obvious which kfunc this data was being
tracked / expected to be tracked on behalf of.

In a recent series adding a new user of this tuple, Alexei mentioned
that he didn't like this union usage as it doesn't really help with
readability or bug-proofing ([0]). In an offline convo we agreed to
have the tuple be fields (arg_btf, arg_btf_id), with comments in
bpf_kfunc_call_arg_meta definition enumerating the uses of the fields by
kfunc-specific handling logic. Such a pattern is used by struct
bpf_reg_state without trouble.

Accordingly, this patch removes the anonymous union in favor of arg_btf
and arg_btf_id fields and comment enumerating their current uses. The
patch also removes struct btf_and_id, which was only being used by the
removed union's inner structs.

This is a mechanical change, existing linked_list and rbtree tests will
validate that correct (btf, btf_id) are being passed.

  [0]: https://lore.kernel.org/bpf/20230505021707.vlyiwy57vwxglbka@dhcp-172-26-102-232.dhcp.thefacebook.com

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230510213047.1633612-1-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-05-15 07:17:12 -07:00
Jakub Kicinski
e1505c1cc8 bpf: netdev: init the offload table earlier
Some netdevices may get unregistered before late_initcall(),
we have to move the hashtable init earlier.

Fixes: f1fc43d039 ("bpf: Move offload initialization into late_initcall")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217399
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230505215836.491485-1-kuba@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-05-15 07:07:41 -07:00
Stanislav Fomichev
29ebbba7d4 bpf: Don't EFAULT for {g,s}setsockopt with wrong optlen
With the way the hooks implemented right now, we have a special
condition: optval larger than PAGE_SIZE will expose only first 4k into
BPF; any modifications to the optval are ignored. If the BPF program
doesn't handle this condition by resetting optlen to 0,
the userspace will get EFAULT.

The intention of the EFAULT was to make it apparent to the
developers that the program is doing something wrong.
However, this inadvertently might affect production workloads
with the BPF programs that are not too careful (i.e., returning EFAULT
for perfectly valid setsockopt/getsockopt calls).

Let's try to minimize the chance of BPF program screwing up userspace
by ignoring the output of those BPF programs (instead of returning
EFAULT to the userspace). pr_info_once those cases to
the dmesg to help with figuring out what's going wrong.

Fixes: 0d01da6afc ("bpf: implement getsockopt and setsockopt hooks")
Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230511170456.1759459-2-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-05-13 16:20:15 -07:00
Martin KaFai Lau
ee9fd0ac30 bpf: Address KCSAN report on bpf_lru_list
KCSAN reported a data-race when accessing node->ref.
Although node->ref does not have to be accurate,
take this chance to use a more common READ_ONCE() and WRITE_ONCE()
pattern instead of data_race().

There is an existing bpf_lru_node_is_ref() and bpf_lru_node_set_ref().
This patch also adds bpf_lru_node_clear_ref() to do the
WRITE_ONCE(node->ref, 0) also.

==================================================================
BUG: KCSAN: data-race in __bpf_lru_list_rotate / __htab_lru_percpu_map_update_elem

write to 0xffff888137038deb of 1 bytes by task 11240 on cpu 1:
__bpf_lru_node_move kernel/bpf/bpf_lru_list.c:113 [inline]
__bpf_lru_list_rotate_active kernel/bpf/bpf_lru_list.c:149 [inline]
__bpf_lru_list_rotate+0x1bf/0x750 kernel/bpf/bpf_lru_list.c:240
bpf_lru_list_pop_free_to_local kernel/bpf/bpf_lru_list.c:329 [inline]
bpf_common_lru_pop_free kernel/bpf/bpf_lru_list.c:447 [inline]
bpf_lru_pop_free+0x638/0xe20 kernel/bpf/bpf_lru_list.c:499
prealloc_lru_pop kernel/bpf/hashtab.c:290 [inline]
__htab_lru_percpu_map_update_elem+0xe7/0x820 kernel/bpf/hashtab.c:1316
bpf_percpu_hash_update+0x5e/0x90 kernel/bpf/hashtab.c:2313
bpf_map_update_value+0x2a9/0x370 kernel/bpf/syscall.c:200
generic_map_update_batch+0x3ae/0x4f0 kernel/bpf/syscall.c:1687
bpf_map_do_batch+0x2d9/0x3d0 kernel/bpf/syscall.c:4534
__sys_bpf+0x338/0x810
__do_sys_bpf kernel/bpf/syscall.c:5096 [inline]
__se_sys_bpf kernel/bpf/syscall.c:5094 [inline]
__x64_sys_bpf+0x43/0x50 kernel/bpf/syscall.c:5094
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

read to 0xffff888137038deb of 1 bytes by task 11241 on cpu 0:
bpf_lru_node_set_ref kernel/bpf/bpf_lru_list.h:70 [inline]
__htab_lru_percpu_map_update_elem+0x2f1/0x820 kernel/bpf/hashtab.c:1332
bpf_percpu_hash_update+0x5e/0x90 kernel/bpf/hashtab.c:2313
bpf_map_update_value+0x2a9/0x370 kernel/bpf/syscall.c:200
generic_map_update_batch+0x3ae/0x4f0 kernel/bpf/syscall.c:1687
bpf_map_do_batch+0x2d9/0x3d0 kernel/bpf/syscall.c:4534
__sys_bpf+0x338/0x810
__do_sys_bpf kernel/bpf/syscall.c:5096 [inline]
__se_sys_bpf kernel/bpf/syscall.c:5094 [inline]
__x64_sys_bpf+0x43/0x50 kernel/bpf/syscall.c:5094
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

value changed: 0x01 -> 0x00

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 11241 Comm: syz-executor.3 Not tainted 6.3.0-rc7-syzkaller-00136-g6a66fdd29ea1 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/30/2023
==================================================================

Reported-by: syzbot+ebe648a84e8784763f82@syzkaller.appspotmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20230511043748.1384166-1-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-05-12 12:01:03 -07:00
Daniel Rosenberg
2012c867c8 bpf: verifier: Accept dynptr mem as mem in helpers
This allows using memory retrieved from dynptrs with helper functions
that accept ARG_PTR_TO_MEM. For instance, results from bpf_dynptr_data
can be passed along to bpf_strncmp.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Link: https://lore.kernel.org/r/20230506013134.2492210-5-drosen@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-05-06 16:42:57 -07:00
Daniel Rosenberg
3bda08b636 bpf: Allow NULL buffers in bpf_dynptr_slice(_rw)
bpf_dynptr_slice(_rw) uses a user provided buffer if it can not provide
a pointer to a block of contiguous memory. This buffer is unused in the
case of local dynptrs, and may be unused in other cases as well. There
is no need to require the buffer, as the kfunc can just return NULL if
it was needed and not provided.

This adds another kfunc annotation, __opt, which combines with __sz and
__szk to allow the buffer associated with the size to be NULL. If the
buffer is NULL, the verifier does not check that the buffer is of
sufficient size.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Link: https://lore.kernel.org/r/20230506013134.2492210-2-drosen@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-05-06 16:42:57 -07:00
Feng Zhou
b5ad4cdc46 bpf: Add bpf_task_under_cgroup() kfunc
Add a kfunc that's similar to the bpf_current_task_under_cgroup.
The difference is that it is a designated task.

When hook sched related functions, sometimes it is necessary to
specify a task instead of the current task.

Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20230506031545.35991-2-zhoufeng.zf@bytedance.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-05-06 13:56:38 -07:00
Andrii Nakryiko
fde2a3882b bpf: support precision propagation in the presence of subprogs
Add support precision backtracking in the presence of subprogram frames in
jump history.

This means supporting a few different kinds of subprogram invocation
situations, all requiring a slightly different handling in precision
backtracking handling logic:
  - static subprogram calls;
  - global subprogram calls;
  - callback-calling helpers/kfuncs.

For each of those we need to handle a few precision propagation cases:
  - what to do with precision of subprog returns (r0);
  - what to do with precision of input arguments;
  - for all of them callee-saved registers in caller function should be
    propagated ignoring subprog/callback part of jump history.

N.B. Async callback-calling helpers (currently only
bpf_timer_set_callback()) are transparent to all this because they set
a separate async callback environment and thus callback's history is not
shared with main program's history. So as far as all the changes in this
commit goes, such helper is just a regular helper.

Let's look at all these situation in more details. Let's start with
static subprogram being called, using an exxerpt of a simple main
program and its static subprog, indenting subprog's frame slightly to
make everything clear.

frame 0				frame 1			precision set
=======				=======			=============

 9: r6 = 456;
10: r1 = 123;						fr0: r6
11: call pc+10;						fr0: r1, r6
				22: r0 = r1;		fr0: r6;     fr1: r1
				23: exit		fr0: r6;     fr1: r0
12: r1 = <map_pointer>					fr0: r0, r6
13: r1 += r0;						fr0: r0, r6
14: r1 += r6;						fr0: r6
15: exit

As can be seen above main function is passing 123 as single argument to
an identity (`return x;`) subprog. Returned value is used to adjust map
pointer offset, which forces r0 to be marked as precise. Then
instruction #14 does the same for callee-saved r6, which will have to be
backtracked all the way to instruction #9. For brevity, precision sets
for instruction #13 and #14 are combined in the diagram above.

First, for subprog calls, r0 returned from subprog (in frame 0) has to
go into subprog's frame 1, and should be cleared from frame 0. So we go
back into subprog's frame knowing we need to mark r0 precise. We then
see that insn #22 sets r0 from r1, so now we care about marking r1
precise.  When we pop up from subprog's frame back into caller at
insn #11 we keep r1, as it's an argument-passing register, so we eventually
find `10: r1 = 123;` and satify precision propagation chain for insn #13.

This example demonstrates two sets of rules:
  - r0 returned after subprog call has to be moved into subprog's r0 set;
  - *static* subprog arguments (r1-r5) are moved back to caller precision set.

Let's look at what happens with callee-saved precision propagation. Insn #14
mark r6 as precise. When we get into subprog's frame, we keep r6 in
frame 0's precision set *only*. Subprog itself has its own set of
independent r6-r10 registers and is not affected. When we eventually
made our way out of subprog frame we keep r6 in precision set until we
reach `9: r6 = 456;`, satisfying propagation. r6-r10 propagation is
perhaps the simplest aspect, it always stays in its original frame.

That's pretty much all we have to do to support precision propagation
across *static subprog* invocation.

Let's look at what happens when we have global subprog invocation.

frame 0				frame 1			precision set
=======				=======			=============

 9: r6 = 456;
10: r1 = 123;						fr0: r6
11: call pc+10; # global subprog			fr0: r6
12: r1 = <map_pointer>					fr0: r0, r6
13: r1 += r0;						fr0: r0, r6
14: r1 += r6;						fr0: r6;
15: exit

Starting from insn #13, r0 has to be precise. We backtrack all the way
to insn #11 (call pc+10) and see that subprog is global, so was already
validated in isolation. As opposed to static subprog, global subprog
always returns unknown scalar r0, so that satisfies precision
propagation and we drop r0 from precision set. We are done for insns #13.

Now for insn #14. r6 is in precision set, we backtrack to `call pc+10;`.
Here we need to recognize that this is effectively both exit and entry
to global subprog, which means we stay in caller's frame. So we carry on
with r6 still in precision set, until we satisfy it at insn #9. The only
hard part with global subprogs is just knowing when it's a global func.

Lastly, callback-calling helpers and kfuncs do simulate subprog calls,
so jump history will have subprog instructions in between caller
program's instructions, but the rules of propagating r0 and r1-r5
differ, because we don't actually directly call callback. We actually
call helper/kfunc, which at runtime will call subprog, so the only
difference between normal helper/kfunc handling is that we need to make
sure to skip callback simulatinog part of jump history.
Let's look at an example to make this clearer.

frame 0				frame 1			precision set
=======				=======			=============

 8: r6 = 456;
 9: r1 = 123;						fr0: r6
10: r2 = &callback;					fr0: r6
11: call bpf_loop;					fr0: r6
				22: r0 = r1;		fr0: r6      fr1:
				23: exit		fr0: r6      fr1:
12: r1 = <map_pointer>					fr0: r0, r6
13: r1 += r0;						fr0: r0, r6
14: r1 += r6;						fr0: r6;
15: exit

Again, insn #13 forces r0 to be precise. As soon as we get to `23: exit`
we see that this isn't actually a static subprog call (it's `call
bpf_loop;` helper call instead). So we clear r0 from precision set.

For callee-saved register, there is no difference: it stays in frame 0's
precision set, we go through insn #22 and #23, ignoring them until we
get back to caller frame 0, eventually satisfying precision backtrack
logic at insn #8 (`r6 = 456;`).

Assuming callback needed to set r0 as precise at insn #23, we'd
backtrack to insn #22, switching from r0 to r1, and then at the point
when we pop back to frame 0 at insn #11, we'll clear r1-r5 from
precision set, as we don't really do a subprog call directly, so there
is no input argument precision propagation.

That's pretty much it. With these changes, it seems like the only still
unsupported situation for precision backpropagation is the case when
program is accessing stack through registers other than r10. This is
still left as unsupported (though rare) case for now.

As for results. For selftests, few positive changes for bigger programs,
cls_redirect in dynptr variant benefitting the most:

[vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results.csv ~/subprog-precise-after-results.csv -f @veristat.cfg -e file,prog,insns -f 'insns_diff!=0'
File                                      Program        Insns (A)  Insns (B)  Insns     (DIFF)
----------------------------------------  -------------  ---------  ---------  ----------------
pyperf600_bpf_loop.bpf.linked1.o          on_event            2060       2002      -58 (-2.82%)
test_cls_redirect_dynptr.bpf.linked1.o    cls_redirect       15660       2914  -12746 (-81.39%)
test_cls_redirect_subprogs.bpf.linked1.o  cls_redirect       61620      59088    -2532 (-4.11%)
xdp_synproxy_kern.bpf.linked1.o           syncookie_tc      109980      86278  -23702 (-21.55%)
xdp_synproxy_kern.bpf.linked1.o           syncookie_xdp      97716      85147  -12569 (-12.86%)

Cilium progress don't really regress. They don't use subprogs and are
mostly unaffected, but some other fixes and improvements could have
changed something. This doesn't appear to be the case:

[vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-cilium.csv ~/subprog-precise-after-results-cilium.csv -e file,prog,insns -f 'insns_diff!=0'
File           Program                         Insns (A)  Insns (B)  Insns (DIFF)
-------------  ------------------------------  ---------  ---------  ------------
bpf_host.o     tail_nodeport_nat_ingress_ipv6       4983       5003  +20 (+0.40%)
bpf_lxc.o      tail_nodeport_nat_ingress_ipv6       4983       5003  +20 (+0.40%)
bpf_overlay.o  tail_nodeport_nat_ingress_ipv6       4983       5003  +20 (+0.40%)
bpf_xdp.o      tail_handle_nat_fwd_ipv6            12475      12504  +29 (+0.23%)
bpf_xdp.o      tail_nodeport_nat_ingress_ipv6       6363       6371   +8 (+0.13%)

Looking at (somewhat anonymized) Meta production programs, we see mostly
insignificant variation in number of instructions, with one program
(syar_bind6_protect6) benefitting the most at -17%.

[vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-fbcode.csv ~/subprog-precise-after-results-fbcode.csv -e prog,insns -f 'insns_diff!=0'
Program                   Insns (A)  Insns (B)  Insns     (DIFF)
------------------------  ---------  ---------  ----------------
on_request_context_event        597        585      -12 (-2.01%)
read_async_py_stack           43789      43657     -132 (-0.30%)
read_sync_py_stack            35041      37599    +2558 (+7.30%)
rrm_usdt                        946        940       -6 (-0.63%)
sysarmor_inet6_bind           28863      28249     -614 (-2.13%)
sysarmor_inet_bind            28845      28240     -605 (-2.10%)
syar_bind4_protect4          154145     147640    -6505 (-4.22%)
syar_bind6_protect6          165242     137088  -28154 (-17.04%)
syar_task_exit_setgid         21289      19720    -1569 (-7.37%)
syar_task_exit_setuid         21290      19721    -1569 (-7.37%)
do_uprobe                     19967      19413     -554 (-2.77%)
tw_twfw_ingress              215877     204833   -11044 (-5.12%)
tw_twfw_tc_in                215877     204833   -11044 (-5.12%)

But checking duration (wall clock) differences, that is the actual time taken
by verifier to validate programs, we see a sometimes dramatic improvements, all
the way to about 16x improvements:

[vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-meta.csv ~/subprog-precise-after-results-meta.csv -e prog,duration -s duration_diff^ | head -n20
Program                                   Duration (us) (A)  Duration (us) (B)  Duration (us) (DIFF)
----------------------------------------  -----------------  -----------------  --------------------
tw_twfw_ingress                                     4488374             272836    -4215538 (-93.92%)
tw_twfw_tc_in                                       4339111             268175    -4070936 (-93.82%)
tw_twfw_egress                                      3521816             270751    -3251065 (-92.31%)
tw_twfw_tc_eg                                       3472878             284294    -3188584 (-91.81%)
balancer_ingress                                     343119             291391      -51728 (-15.08%)
syar_bind6_protect6                                   78992              64782      -14210 (-17.99%)
ttls_tc_ingress                                       11739               8176       -3563 (-30.35%)
kprobe__security_inode_link                           13864              11341       -2523 (-18.20%)
read_sync_py_stack                                    21927              19442       -2485 (-11.33%)
read_async_py_stack                                   30444              28136        -2308 (-7.58%)
syar_task_exit_setuid                                 10256               8440       -1816 (-17.71%)

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230505043317.3629845-9-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-05-04 22:35:35 -07:00
Andrii Nakryiko
c50c0b57a5 bpf: fix mark_all_scalars_precise use in mark_chain_precision
When precision backtracking bails out due to some unsupported sequence
of instructions (e.g., stack access through register other than r10), we
need to mark all SCALAR registers as precise to be safe. Currently,
though, we mark SCALARs precise only starting from the state we detected
unsupported condition, which could be one of the parent states of the
actual current state. This will leave some registers potentially not
marked as precise, even though they should. So make sure we start
marking scalars as precise from current state (env->cur_state).

Further, we don't currently detect a situation when we end up with some
stack slots marked as needing precision, but we ran out of available
states to find the instructions that populate those stack slots. This is
akin the `i >= func->allocated_stack / BPF_REG_SIZE` check and should be
handled similarly by falling back to marking all SCALARs precise. Add
this check when we run out of states.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230505043317.3629845-8-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-05-04 22:35:35 -07:00
Andrii Nakryiko
f655badf2a bpf: fix propagate_precision() logic for inner frames
Fix propagate_precision() logic to perform propagation of all necessary
registers and stack slots across all active frames *in one batch step*.

Doing this for each register/slot in each individual frame is wasteful,
but the main problem is that backtracking of instruction in any frame
except the deepest one just doesn't work. This is due to backtracking
logic relying on jump history, and available jump history always starts
(or ends, depending how you view it) in current frame. So, if
prog A (frame #0) called subprog B (frame #1) and we need to propagate
precision of, say, register R6 (callee-saved) within frame #0, we
actually don't even know where jump history that corresponds to prog
A even starts. We'd need to skip subprog part of jump history first to
be able to do this.

Luckily, with struct backtrack_state and __mark_chain_precision()
handling bitmasks tracking/propagation across all active frames at the
same time (added in previous patch), propagate_precision() can be both
fixed and sped up by setting all the necessary bits across all frames
and then performing one __mark_chain_precision() pass. This makes it
unnecessary to skip subprog parts of jump history.

We also improve logging along the way, to clearly specify which
registers' and slots' precision markings are propagated within which
frame. Each frame will have dedicated line and all registers and stack
slots from that frame will be reported in format similar to precision
backtrack regs/stack logging. E.g.:

frame 1: propagating r1,r2,r3,fp-8,fp-16
frame 0: propagating r3,r9,fp-120

Fixes: 529409ea92 ("bpf: propagate precision across all frames, not just the last one")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230505043317.3629845-7-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-05-04 22:35:35 -07:00
Andrii Nakryiko
1ef22b6865 bpf: maintain bitmasks across all active frames in __mark_chain_precision
Teach __mark_chain_precision logic to maintain register/stack masks
across all active frames when going from child state to parent state.
Currently this should be mostly no-op, as precision backtracking usually
bails out when encountering subprog entry/exit.

It's not very apparent from the diff due to increased indentation, but
the logic remains the same, except everything is done on specific `fr`
frame index. Calls to bt_clear_reg() and bt_clear_slot() are replaced
with frame-specific bt_clear_frame_reg() and bt_clear_frame_slot(),
where frame index is passed explicitly, instead of using current frame
number.

We also adjust logging to emit affected frame number. And we also add
better logging of human-readable register and stack slot masks, similar
to previous patch.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230505043317.3629845-6-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-05-04 22:35:35 -07:00
Andrii Nakryiko
d9439c21a9 bpf: improve precision backtrack logging
Add helper to format register and stack masks in more human-readable
format. Adjust logging a bit during backtrack propagation and especially
during forcing precision fallback logic to make it clearer what's going
on (with log_level=2, of course), and also start reporting affected
frame depth. This is in preparation for having more than one active
frame later when precision propagation between subprog calls is added.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230505043317.3629845-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-05-04 22:35:35 -07:00
Andrii Nakryiko
407958a0e9 bpf: encapsulate precision backtracking bookkeeping
Add struct backtrack_state and straightforward API around it to keep
track of register and stack masks used and maintained during precision
backtracking process. Having this logic separately allow to keep
high-level backtracking algorithm cleaner, but also it sets us up to
cleanly keep track of register and stack masks per frame, allowing (with
some further logic adjustments) to perform precision backpropagation
across multiple frames (i.e., subprog calls).

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230505043317.3629845-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-05-04 22:35:35 -07:00
Andrii Nakryiko
e0bf462276 bpf: mark relevant stack slots scratched for register read instructions
When handling instructions that read register slots, mark relevant stack
slots as scratched so that verifier log would contain those slots' states, in
addition to currently emitted registers with stack slot offsets.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230505043317.3629845-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-05-04 22:35:34 -07:00
Kui-Feng Lee
fedf99200a bpf: Print a warning only if writing to unprivileged_bpf_disabled.
Only print the warning message if you are writing to
"/proc/sys/kernel/unprivileged_bpf_disabled".

The kernel may print an annoying warning when you read
"/proc/sys/kernel/unprivileged_bpf_disabled" saying

  WARNING: Unprivileged eBPF is enabled with eIBRS on, data leaks possible
  via Spectre v2 BHB attacks!

However, this message is only meaningful when the feature is
disabled or enabled.

Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20230502181418.308479-1-kuifeng@meta.com
2023-05-02 16:20:31 -07:00
Linus Torvalds
86e98ed15b cgroup changes for v6.4-rc1
* cpuset changes including the fix for an incorrect interaction with CPU
   hotplug and an optimization.
 
 * Other doc and cosmetic changes.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZErfng4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGVVtAQCDycK4VSgc4nsFPG1vh1Oy1A723ciEUwAbKmV/
 F1n7xwEA68FiDvE29LpMJJuYP9HnX0A5zRMyNnb52kN9jmgcEQI=
 =ALol
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup updates from Tejun Heo:

 - cpuset changes including the fix for an incorrect interaction with
   CPU hotplug and an optimization

 - Other doc and cosmetic changes

* tag 'cgroup-for-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  docs: cgroup-v1/cpusets: update libcgroup project link
  cgroup/cpuset: Minor updates to test_cpuset_prs.sh
  cgroup/cpuset: Include offline CPUs when tasks' cpumasks in top_cpuset are updated
  cgroup/cpuset: Skip task update if hotplug doesn't affect current cpuset
  cpuset: Clean up cpuset_node_allowed
  cgroup: bpf: use cgroup_lock()/cgroup_unlock() wrappers
2023-04-29 10:05:22 -07:00
Joanne Koong
361f129f3c bpf: Add bpf_dynptr_clone
The cloned dynptr will point to the same data as its parent dynptr,
with the same type, offset, size and read-only properties.

Any writes to a dynptr will be reflected across all instances
(by 'instance', this means any dynptrs that point to the same
underlying data).

Please note that data slice and dynptr invalidations will affect all
instances as well. For example, if bpf_dynptr_write() is called on an
skb-type dynptr, all data slices of dynptr instances to that skb
will be invalidated as well (eg data slices of any clones, parents,
grandparents, ...). Another example is if a ringbuf dynptr is submitted,
any instance of that dynptr will be invalidated.

Changing the view of the dynptr (eg advancing the offset or
trimming the size) will only affect that dynptr and not affect any
other instances.

One example use case where cloning may be helpful is for hashing or
iterating through dynptr data. Cloning will allow the user to maintain
the original view of the dynptr for future use, while also allowing
views to smaller subsets of the data after the offset is advanced or the
size is trimmed.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230420071414.570108-5-joannelkoong@gmail.com
2023-04-27 10:40:47 +02:00
Joanne Koong
26662d7347 bpf: Add bpf_dynptr_size
bpf_dynptr_size returns the number of usable bytes in a dynptr.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20230420071414.570108-4-joannelkoong@gmail.com
2023-04-27 10:40:41 +02:00
Joanne Koong
540ccf96dd bpf: Add bpf_dynptr_is_null and bpf_dynptr_is_rdonly
bpf_dynptr_is_null returns true if the dynptr is null / invalid
(determined by whether ptr->data is NULL), else false if
the dynptr is a valid dynptr.

bpf_dynptr_is_rdonly returns true if the dynptr is read-only,
else false if the dynptr is read-writable. If the dynptr is
null / invalid, false is returned by default.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20230420071414.570108-3-joannelkoong@gmail.com
2023-04-27 10:40:36 +02:00
Joanne Koong
987d0242d1 bpf: Add bpf_dynptr_adjust
Add a new kfunc

int bpf_dynptr_adjust(struct bpf_dynptr_kern *ptr, u32 start, u32 end);

which adjusts the dynptr to reflect the new [start, end) interval.
In particular, it advances the offset of the dynptr by "start" bytes,
and if end is less than the size of the dynptr, then this will trim the
dynptr accordingly.

Adjusting the dynptr interval may be useful in certain situations.
For example, when hashing which takes in generic dynptrs, if the dynptr
points to a struct but only a certain memory region inside the struct
should be hashed, adjust can be used to narrow in on the
specific region to hash.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230420071414.570108-2-joannelkoong@gmail.com
2023-04-27 10:40:31 +02:00
Linus Torvalds
6e98b09da9 Networking changes for 6.4.
Core
 ----
 
  - Introduce a config option to tweak MAX_SKB_FRAGS. Increasing the
    default value allows for better BIG TCP performances.
 
  - Reduce compound page head access for zero-copy data transfers.
 
  - RPS/RFS improvements, avoiding unneeded NET_RX_SOFTIRQ when possible.
 
  - Threaded NAPI improvements, adding defer skb free support and unneeded
    softirq avoidance.
 
  - Address dst_entry reference count scalability issues, via false
    sharing avoidance and optimize refcount tracking.
 
  - Add lockless accesses annotation to sk_err[_soft].
 
  - Optimize again the skb struct layout.
 
  - Extends the skb drop reasons to make it usable by multiple
    subsystems.
 
  - Better const qualifier awareness for socket casts.
 
 BPF
 ---
 
  - Add skb and XDP typed dynptrs which allow BPF programs for more
    ergonomic and less brittle iteration through data and variable-sized
    accesses.
 
  - Add a new BPF netfilter program type and minimal support to hook
    BPF programs to netfilter hooks such as prerouting or forward.
 
  - Add more precise memory usage reporting for all BPF map types.
 
  - Adds support for using {FOU,GUE} encap with an ipip device operating
    in collect_md mode and add a set of BPF kfuncs for controlling encap
    params.
 
  - Allow BPF programs to detect at load time whether a particular kfunc
    exists or not, and also add support for this in light skeleton.
 
  - Bigger batch of BPF verifier improvements to prepare for upcoming BPF
    open-coded iterators allowing for less restrictive looping capabilities.
 
  - Rework RCU enforcement in the verifier, add kptr_rcu and enforce BPF
    programs to NULL-check before passing such pointers into kfunc.
 
  - Add support for kptrs in percpu hashmaps, percpu LRU hashmaps and in
    local storage maps.
 
  - Enable RCU semantics for task BPF kptrs and allow referenced kptr
    tasks to be stored in BPF maps.
 
  - Add support for refcounted local kptrs to the verifier for allowing
    shared ownership, useful for adding a node to both the BPF list and
    rbtree.
 
  - Add BPF verifier support for ST instructions in convert_ctx_access()
    which will help new -mcpu=v4 clang flag to start emitting them.
 
  - Add ARM32 USDT support to libbpf.
 
  - Improve bpftool's visual program dump which produces the control
    flow graph in a DOT format by adding C source inline annotations.
 
 Protocols
 ---------
 
  - IPv4: Allow adding to IPv4 address a 'protocol' tag. Such value
    indicates the provenance of the IP address.
 
  - IPv6: optimize route lookup, dropping unneeded R/W lock acquisition.
 
  - Add the handshake upcall mechanism, allowing the user-space
    to implement generic TLS handshake on kernel's behalf.
 
  - Bridge: support per-{Port, VLAN} neighbor suppression, increasing
    resilience to nodes failures.
 
  - SCTP: add support for Fair Capacity and Weighted Fair Queueing
    schedulers.
 
  - MPTCP: delay first subflow allocation up to its first usage. This
    will allow for later better LSM interaction.
 
  - xfrm: Remove inner/outer modes from input/output path. These are
    not needed anymore.
 
  - WiFi:
    - reduced neighbor report (RNR) handling for AP mode
    - HW timestamping support
    - support for randomized auth/deauth TA for PASN privacy
    - per-link debugfs for multi-link
    - TC offload support for mac80211 drivers
    - mac80211 mesh fast-xmit and fast-rx support
    - enable Wi-Fi 7 (EHT) mesh support
 
 Netfilter
 ---------
 
  - Add nf_tables 'brouting' support, to force a packet to be routed
    instead of being bridged.
 
  - Update bridge netfilter and ovs conntrack helpers to handle
    IPv6 Jumbo packets properly, i.e. fetch the packet length
    from hop-by-hop extension header. This is needed for BIT TCP
    support.
 
  - The iptables 32bit compat interface isn't compiled in by default
    anymore.
 
  - Move ip(6)tables builtin icmp matches to the udptcp one.
    This has the advantage that icmp/icmpv6 match doesn't load the
    iptables/ip6tables modules anymore when iptables-nft is used.
 
  - Extended netlink error report for netdevice in flowtables and
    netdev/chains. Allow for incrementally add/delete devices to netdev
    basechain. Allow to create netdev chain without device.
 
 Driver API
 ----------
 
  - Remove redundant Device Control Error Reporting Enable, as PCI core
    has already error reporting enabled at enumeration time.
 
  - Move Multicast DB netlink handlers to core, allowing devices other
    then bridge to use them.
 
  - Allow the page_pool to directly recycle the pages from safely
    localized NAPI.
 
  - Implement lockless TX queue stop/wake combo macros, allowing for
    further code de-duplication and sanitization.
 
  - Add YNL support for user headers and struct attrs.
 
  - Add partial YNL specification for devlink.
 
  - Add partial YNL specification for ethtool.
 
  - Add tc-mqprio and tc-taprio support for preemptible traffic classes.
 
  - Add tx push buf len param to ethtool, specifies the maximum number
    of bytes of a transmitted packet a driver can push directly to the
    underlying device.
 
  - Add basic LED support for switch/phy.
 
  - Add NAPI documentation, stop relaying on external links.
 
  - Convert dsa_master_ioctl() to netdev notifier. This is a preparatory
    work to make the hardware timestamping layer selectable by user
    space.
 
  - Add transceiver support and improve the error messages for CAN-FD
    controllers.
 
 New hardware / drivers
 ----------------------
 
  - Ethernet:
    - AMD/Pensando core device support
    - MediaTek MT7981 SoC
    - MediaTek MT7988 SoC
    - Broadcom BCM53134 embedded switch
    - Texas Instruments CPSW9G ethernet switch
    - Qualcomm EMAC3 DWMAC ethernet
    - StarFive JH7110 SoC
    - NXP CBTX ethernet PHY
 
  - WiFi:
    - Apple M1 Pro/Max devices
    - RealTek rtl8710bu/rtl8188gu
    - RealTek rtl8822bs, rtl8822cs and rtl8821cs SDIO chipset
 
  - Bluetooth:
    - Realtek RTL8821CS, RTL8851B, RTL8852BS
    - Mediatek MT7663, MT7922
    - NXP w8997
    - Actions Semi ATS2851
    - QTI WCN6855
    - Marvell 88W8997
 
  - Can:
    - STMicroelectronics bxcan stm32f429
 
 Drivers
 -------
  - Ethernet NICs:
    - Intel (1G, icg):
      - add tracking and reporting of QBV config errors.
      - add support for configuring max SDU for each Tx queue.
    - Intel (100G, ice):
      - refactor mailbox overflow detection to support Scalable IOV
      - GNSS interface optimization
    - Intel (i40e):
      - support XDP multi-buffer
    - nVidia/Mellanox:
      - add the support for linux bridge multicast offload
      - enable TC offload for egress and engress MACVLAN over bond
      - add support for VxLAN GBP encap/decap flows offload
      - extend packet offload to fully support libreswan
      - support tunnel mode in mlx5 IPsec packet offload
      - extend XDP multi-buffer support
      - support MACsec VLAN offload
      - add support for dynamic msix vectors allocation
      - drop RX page_cache and fully use page_pool
      - implement thermal zone to report NIC temperature
    - Netronome/Corigine:
      - add support for multi-zone conntrack offload
    - Solarflare/Xilinx:
      - support offloading TC VLAN push/pop actions to the MAE
      - support TC decap rules
      - support unicast PTP
 
  - Other NICs:
    - Broadcom (bnxt): enforce software based freq adjustments only
 		on shared PHC NIC
    - RealTek (r8169): refactor to addess ASPM issues during NAPI poll.
    - Micrel (lan8841): add support for PTP_PF_PEROUT
    - Cadence (macb): enable PTP unicast
    - Engleder (tsnep): add XDP socket zero-copy support
    - virtio-net: implement exact header length guest feature
    - veth: add page_pool support for page recycling
    - vxlan: add MDB data path support
    - gve: add XDP support for GQI-QPL format
    - geneve: accept every ethertype
    - macvlan: allow some packets to bypass broadcast queue
    - mana: add support for jumbo frame
 
  - Ethernet high-speed switches:
    - Microchip (sparx5): Add support for TC flower templates.
 
  - Ethernet embedded switches:
    - Broadcom (b54):
      - configure 6318 and 63268 RGMII ports
    - Marvell (mv88e6xxx):
      - faster C45 bus scan
    - Microchip:
      - lan966x:
        - add support for IS1 VCAP
        - better TX/RX from/to CPU performances
      - ksz9477: add ETS Qdisc support
      - ksz8: enhance static MAC table operations and error handling
      - sama7g5: add PTP capability
    - NXP (ocelot):
      - add support for external ports
      - add support for preemptible traffic classes
    - Texas Instruments:
      - add CPSWxG SGMII support for J7200 and J721E
 
  - Intel WiFi (iwlwifi):
    - preparation for Wi-Fi 7 EHT and multi-link support
    - EHT (Wi-Fi 7) sniffer support
    - hardware timestamping support for some devices/firwmares
    - TX beacon protection on newer hardware
 
  - Qualcomm 802.11ax WiFi (ath11k):
    - MU-MIMO parameters support
    - ack signal support for management packets
 
  - RealTek WiFi (rtw88):
    - SDIO bus support
    - better support for some SDIO devices
      (e.g. MAC address from efuse)
 
  - RealTek WiFi (rtw89):
    - HW scan support for 8852b
    - better support for 6 GHz scanning
    - support for various newer firmware APIs
    - framework firmware backwards compatibility
 
  - MediaTek WiFi (mt76):
    - P2P support
    - mesh A-MSDU support
    - EHT (Wi-Fi 7) support
    - coredump support
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmRI/mUSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkgO0QAJGxpuN67YgYV0BIM+/atWKEEexJYG7B
 9MMpU4jMO3EW/pUS5t7VRsBLUybLYVPmqCZoHodObDfnu59jiPOegb6SikJv/ZwJ
 Zw62PVk5MvDnQjlu4e6kDcGwkplteN08TlgI+a49BUTedpdFitrxHAYGW8f2fRO6
 cK2XSld+ZucMoym5vRwf8yWS1BwdxnslPMxDJ+/8ZbWBZv44qAnG2vMB/kIx7ObC
 Vel/4m6MzTwVsLYBsRvcwMVbNNlZ9GuhztlTzEbfGA4ZhTadIAMgb5VTWXB84Ws7
 Aic5wTdli+q+x6/2cxhbyeoVuB9HHObYmLBAciGg4GNljP5rnQBY3X3+KVZ/x9TI
 HQB7CmhxmAZVrO9pLARFV+ECrMTH2/dy3NyrZ7uYQ3WPOXJi8hJZjOTO/eeEGL7C
 eTjdz0dZBWIBK2gON/6s4nExXVQUTEF2ZsPi52jTTClKjfe5pz/ddeFQIWaY1DTm
 pInEiWPAvd28JyiFmhFNHsuIBCjX/Zqe2JuMfMBeBibDAC09o/OGdKJYUI15AiRf
 F46Pdb7use/puqfrYW44kSAfaPYoBiE+hj1RdeQfen35xD9HVE4vdnLNeuhRlFF9
 aQfyIRHYQofkumRDr5f8JEY66cl9NiKQ4IVW1xxQfYDNdC6wQqREPG1md7rJVMrJ
 vP7ugFnttneg
 =ITVa
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Paolo Abeni:
 "Core:

   - Introduce a config option to tweak MAX_SKB_FRAGS. Increasing the
     default value allows for better BIG TCP performances

   - Reduce compound page head access for zero-copy data transfers

   - RPS/RFS improvements, avoiding unneeded NET_RX_SOFTIRQ when
     possible

   - Threaded NAPI improvements, adding defer skb free support and
     unneeded softirq avoidance

   - Address dst_entry reference count scalability issues, via false
     sharing avoidance and optimize refcount tracking

   - Add lockless accesses annotation to sk_err[_soft]

   - Optimize again the skb struct layout

   - Extends the skb drop reasons to make it usable by multiple
     subsystems

   - Better const qualifier awareness for socket casts

  BPF:

   - Add skb and XDP typed dynptrs which allow BPF programs for more
     ergonomic and less brittle iteration through data and
     variable-sized accesses

   - Add a new BPF netfilter program type and minimal support to hook
     BPF programs to netfilter hooks such as prerouting or forward

   - Add more precise memory usage reporting for all BPF map types

   - Adds support for using {FOU,GUE} encap with an ipip device
     operating in collect_md mode and add a set of BPF kfuncs for
     controlling encap params

   - Allow BPF programs to detect at load time whether a particular
     kfunc exists or not, and also add support for this in light
     skeleton

   - Bigger batch of BPF verifier improvements to prepare for upcoming
     BPF open-coded iterators allowing for less restrictive looping
     capabilities

   - Rework RCU enforcement in the verifier, add kptr_rcu and enforce
     BPF programs to NULL-check before passing such pointers into kfunc

   - Add support for kptrs in percpu hashmaps, percpu LRU hashmaps and
     in local storage maps

   - Enable RCU semantics for task BPF kptrs and allow referenced kptr
     tasks to be stored in BPF maps

   - Add support for refcounted local kptrs to the verifier for allowing
     shared ownership, useful for adding a node to both the BPF list and
     rbtree

   - Add BPF verifier support for ST instructions in
     convert_ctx_access() which will help new -mcpu=v4 clang flag to
     start emitting them

   - Add ARM32 USDT support to libbpf

   - Improve bpftool's visual program dump which produces the control
     flow graph in a DOT format by adding C source inline annotations

  Protocols:

   - IPv4: Allow adding to IPv4 address a 'protocol' tag. Such value
     indicates the provenance of the IP address

   - IPv6: optimize route lookup, dropping unneeded R/W lock acquisition

   - Add the handshake upcall mechanism, allowing the user-space to
     implement generic TLS handshake on kernel's behalf

   - Bridge: support per-{Port, VLAN} neighbor suppression, increasing
     resilience to nodes failures

   - SCTP: add support for Fair Capacity and Weighted Fair Queueing
     schedulers

   - MPTCP: delay first subflow allocation up to its first usage. This
     will allow for later better LSM interaction

   - xfrm: Remove inner/outer modes from input/output path. These are
     not needed anymore

   - WiFi:
      - reduced neighbor report (RNR) handling for AP mode
      - HW timestamping support
      - support for randomized auth/deauth TA for PASN privacy
      - per-link debugfs for multi-link
      - TC offload support for mac80211 drivers
      - mac80211 mesh fast-xmit and fast-rx support
      - enable Wi-Fi 7 (EHT) mesh support

  Netfilter:

   - Add nf_tables 'brouting' support, to force a packet to be routed
     instead of being bridged

   - Update bridge netfilter and ovs conntrack helpers to handle IPv6
     Jumbo packets properly, i.e. fetch the packet length from
     hop-by-hop extension header. This is needed for BIT TCP support

   - The iptables 32bit compat interface isn't compiled in by default
     anymore

   - Move ip(6)tables builtin icmp matches to the udptcp one. This has
     the advantage that icmp/icmpv6 match doesn't load the
     iptables/ip6tables modules anymore when iptables-nft is used

   - Extended netlink error report for netdevice in flowtables and
     netdev/chains. Allow for incrementally add/delete devices to netdev
     basechain. Allow to create netdev chain without device

  Driver API:

   - Remove redundant Device Control Error Reporting Enable, as PCI core
     has already error reporting enabled at enumeration time

   - Move Multicast DB netlink handlers to core, allowing devices other
     then bridge to use them

   - Allow the page_pool to directly recycle the pages from safely
     localized NAPI

   - Implement lockless TX queue stop/wake combo macros, allowing for
     further code de-duplication and sanitization

   - Add YNL support for user headers and struct attrs

   - Add partial YNL specification for devlink

   - Add partial YNL specification for ethtool

   - Add tc-mqprio and tc-taprio support for preemptible traffic classes

   - Add tx push buf len param to ethtool, specifies the maximum number
     of bytes of a transmitted packet a driver can push directly to the
     underlying device

   - Add basic LED support for switch/phy

   - Add NAPI documentation, stop relaying on external links

   - Convert dsa_master_ioctl() to netdev notifier. This is a
     preparatory work to make the hardware timestamping layer selectable
     by user space

   - Add transceiver support and improve the error messages for CAN-FD
     controllers

  New hardware / drivers:

   - Ethernet:
      - AMD/Pensando core device support
      - MediaTek MT7981 SoC
      - MediaTek MT7988 SoC
      - Broadcom BCM53134 embedded switch
      - Texas Instruments CPSW9G ethernet switch
      - Qualcomm EMAC3 DWMAC ethernet
      - StarFive JH7110 SoC
      - NXP CBTX ethernet PHY

   - WiFi:
      - Apple M1 Pro/Max devices
      - RealTek rtl8710bu/rtl8188gu
      - RealTek rtl8822bs, rtl8822cs and rtl8821cs SDIO chipset

   - Bluetooth:
      - Realtek RTL8821CS, RTL8851B, RTL8852BS
      - Mediatek MT7663, MT7922
      - NXP w8997
      - Actions Semi ATS2851
      - QTI WCN6855
      - Marvell 88W8997

   - Can:
      - STMicroelectronics bxcan stm32f429

  Drivers:

   - Ethernet NICs:
      - Intel (1G, icg):
         - add tracking and reporting of QBV config errors
         - add support for configuring max SDU for each Tx queue
      - Intel (100G, ice):
         - refactor mailbox overflow detection to support Scalable IOV
         - GNSS interface optimization
      - Intel (i40e):
         - support XDP multi-buffer
      - nVidia/Mellanox:
         - add the support for linux bridge multicast offload
         - enable TC offload for egress and engress MACVLAN over bond
         - add support for VxLAN GBP encap/decap flows offload
         - extend packet offload to fully support libreswan
         - support tunnel mode in mlx5 IPsec packet offload
         - extend XDP multi-buffer support
         - support MACsec VLAN offload
         - add support for dynamic msix vectors allocation
         - drop RX page_cache and fully use page_pool
         - implement thermal zone to report NIC temperature
      - Netronome/Corigine:
         - add support for multi-zone conntrack offload
      - Solarflare/Xilinx:
         - support offloading TC VLAN push/pop actions to the MAE
         - support TC decap rules
         - support unicast PTP

   - Other NICs:
      - Broadcom (bnxt): enforce software based freq adjustments only on
        shared PHC NIC
      - RealTek (r8169): refactor to addess ASPM issues during NAPI poll
      - Micrel (lan8841): add support for PTP_PF_PEROUT
      - Cadence (macb): enable PTP unicast
      - Engleder (tsnep): add XDP socket zero-copy support
      - virtio-net: implement exact header length guest feature
      - veth: add page_pool support for page recycling
      - vxlan: add MDB data path support
      - gve: add XDP support for GQI-QPL format
      - geneve: accept every ethertype
      - macvlan: allow some packets to bypass broadcast queue
      - mana: add support for jumbo frame

   - Ethernet high-speed switches:
      - Microchip (sparx5): Add support for TC flower templates

   - Ethernet embedded switches:
      - Broadcom (b54):
         - configure 6318 and 63268 RGMII ports
      - Marvell (mv88e6xxx):
         - faster C45 bus scan
      - Microchip:
         - lan966x:
            - add support for IS1 VCAP
            - better TX/RX from/to CPU performances
         - ksz9477: add ETS Qdisc support
         - ksz8: enhance static MAC table operations and error handling
         - sama7g5: add PTP capability
      - NXP (ocelot):
         - add support for external ports
         - add support for preemptible traffic classes
      - Texas Instruments:
         - add CPSWxG SGMII support for J7200 and J721E

   - Intel WiFi (iwlwifi):
      - preparation for Wi-Fi 7 EHT and multi-link support
      - EHT (Wi-Fi 7) sniffer support
      - hardware timestamping support for some devices/firwmares
      - TX beacon protection on newer hardware

   - Qualcomm 802.11ax WiFi (ath11k):
      - MU-MIMO parameters support
      - ack signal support for management packets

   - RealTek WiFi (rtw88):
      - SDIO bus support
      - better support for some SDIO devices (e.g. MAC address from
        efuse)

   - RealTek WiFi (rtw89):
      - HW scan support for 8852b
      - better support for 6 GHz scanning
      - support for various newer firmware APIs
      - framework firmware backwards compatibility

   - MediaTek WiFi (mt76):
      - P2P support
      - mesh A-MSDU support
      - EHT (Wi-Fi 7) support
      - coredump support"

* tag 'net-next-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2078 commits)
  net: phy: hide the PHYLIB_LEDS knob
  net: phy: marvell-88x2222: remove unnecessary (void*) conversions
  tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.
  net: amd: Fix link leak when verifying config failed
  net: phy: marvell: Fix inconsistent indenting in led_blink_set
  lan966x: Don't use xdp_frame when action is XDP_TX
  tsnep: Add XDP socket zero-copy TX support
  tsnep: Add XDP socket zero-copy RX support
  tsnep: Move skb receive action to separate function
  tsnep: Add functions for queue enable/disable
  tsnep: Rework TX/RX queue initialization
  tsnep: Replace modulo operation with mask
  net: phy: dp83867: Add led_brightness_set support
  net: phy: Fix reading LED reg property
  drivers: nfc: nfcsim: remove return value check of `dev_dir`
  net: phy: dp83867: Remove unnecessary (void*) conversions
  net: ethtool: coalesce: try to make user settings stick twice
  net: mana: Check if netdev/napi_alloc_frag returns single page
  net: mana: Rename mana_refill_rxoob and remove some empty lines
  net: veth: add page_pool stats
  ...
2023-04-26 16:07:23 -07:00
Linus Torvalds
df45da57cb arm64 updates for 6.4
ACPI:
 	* Improve error reporting when failing to manage SDEI on AGDI device
 	  removal
 
 Assembly routines:
 	* Improve register constraints so that the compiler can make use of
 	  the zero register instead of moving an immediate #0 into a GPR
 
 	* Allow the compiler to allocate the registers used for CAS
 	  instructions
 
 CPU features and system registers:
 	* Cleanups to the way in which CPU features are identified from the
 	  ID register fields
 
 	* Extend system register definition generation to handle Enum types
 	  when defining shared register fields
 
 	* Generate definitions for new _EL2 registers and add new fields
 	  for ID_AA64PFR1_EL1
 
 	* Allow SVE to be disabled separately from SME on the kernel
 	  command-line
 
 Tracing:
 	* Support for "direct calls" in ftrace, which enables BPF tracing
 	  for arm64
 
 Kdump:
 	* Don't bother unmapping the crashkernel from the linear mapping,
 	  which then allows us to use huge (block) mappings and reduce
 	  TLB pressure when a crashkernel is loaded.
 
 Memory management:
 	* Try again to remove data cache invalidation from the coherent DMA
 	  allocation path
 
 	* Simplify the fixmap code by mapping at page granularity
 
 	* Allow the kfence pool to be allocated early, preventing the rest
 	  of the linear mapping from being forced to page granularity
 
 Perf and PMU:
 	* Move CPU PMU code out to drivers/perf/ where it can be reused
 	  by the 32-bit ARM architecture when running on ARMv8 CPUs
 
 	* Fix race between CPU PMU probing and pKVM host de-privilege
 
 	* Add support for Apple M2 CPU PMU
 
 	* Adjust the generic PERF_COUNT_HW_BRANCH_INSTRUCTIONS event
 	  dynamically, depending on what the CPU actually supports
 
 	* Minor fixes and cleanups to system PMU drivers
 
 Stack tracing:
 	* Use the XPACLRI instruction to strip PAC from pointers, rather
 	  than rolling our own function in C
 
 	* Remove redundant PAC removal for toolchains that handle this in
 	  their builtins
 
 	* Make backtracing more resilient in the face of instrumentation
 
 Miscellaneous:
 	* Fix single-step with KGDB
 
 	* Remove harmless warning when 'nokaslr' is passed on the kernel
 	  command-line
 
 	* Minor fixes and cleanups across the board
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmRChcwQHHdpbGxAa2Vy
 bmVsLm9yZwAKCRC3rHDchMFjNCgBCADFvkYY9ESztSnd3EpiMbbAzgRCQBiA5H7U
 F2Wc+hIWgeAeUEttSH22+F16r6Jb0gbaDvsuhtN2W/rwQhKNbCU0MaUME05MPmg2
 AOp+RZb2vdT5i5S5dC6ZM6G3T6u9O78LBWv2JWBdd6RIybamEn+RL00ep2WAduH7
 n1FgTbsKgnbScD2qd4K1ejZ1W/BQMwYulkNpyTsmCIijXM12lkzFlxWnMtky3uhR
 POpawcIZzXvWI02QAX+SIdynGChQV3VP+dh9GuFbt7ASigDEhgunvfUYhZNSaqf4
 +/q0O8toCtmQJBUhF0DEDSB5T8SOz5v9CKxKuwfaX6Trq0ixFQpZ
 =78L9
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Will Deacon:
 "ACPI:

   - Improve error reporting when failing to manage SDEI on AGDI device
     removal

  Assembly routines:

   - Improve register constraints so that the compiler can make use of
     the zero register instead of moving an immediate #0 into a GPR

   - Allow the compiler to allocate the registers used for CAS
     instructions

  CPU features and system registers:

   - Cleanups to the way in which CPU features are identified from the
     ID register fields

   - Extend system register definition generation to handle Enum types
     when defining shared register fields

   - Generate definitions for new _EL2 registers and add new fields for
     ID_AA64PFR1_EL1

   - Allow SVE to be disabled separately from SME on the kernel
     command-line

  Tracing:

   - Support for "direct calls" in ftrace, which enables BPF tracing for
     arm64

  Kdump:

   - Don't bother unmapping the crashkernel from the linear mapping,
     which then allows us to use huge (block) mappings and reduce TLB
     pressure when a crashkernel is loaded.

  Memory management:

   - Try again to remove data cache invalidation from the coherent DMA
     allocation path

   - Simplify the fixmap code by mapping at page granularity

   - Allow the kfence pool to be allocated early, preventing the rest of
     the linear mapping from being forced to page granularity

  Perf and PMU:

   - Move CPU PMU code out to drivers/perf/ where it can be reused by
     the 32-bit ARM architecture when running on ARMv8 CPUs

   - Fix race between CPU PMU probing and pKVM host de-privilege

   - Add support for Apple M2 CPU PMU

   - Adjust the generic PERF_COUNT_HW_BRANCH_INSTRUCTIONS event
     dynamically, depending on what the CPU actually supports

   - Minor fixes and cleanups to system PMU drivers

  Stack tracing:

   - Use the XPACLRI instruction to strip PAC from pointers, rather than
     rolling our own function in C

   - Remove redundant PAC removal for toolchains that handle this in
     their builtins

   - Make backtracing more resilient in the face of instrumentation

  Miscellaneous:

   - Fix single-step with KGDB

   - Remove harmless warning when 'nokaslr' is passed on the kernel
     command-line

   - Minor fixes and cleanups across the board"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (72 commits)
  KVM: arm64: Ensure CPU PMU probes before pKVM host de-privilege
  arm64: kexec: include reboot.h
  arm64: delete dead code in this_cpu_set_vectors()
  arm64/cpufeature: Use helper macro to specify ID register for capabilites
  drivers/perf: hisi: add NULL check for name
  drivers/perf: hisi: Remove redundant initialized of pmu->name
  arm64/cpufeature: Consistently use symbolic constants for min_field_value
  arm64/cpufeature: Pull out helper for CPUID register definitions
  arm64/sysreg: Convert HFGITR_EL2 to automatic generation
  ACPI: AGDI: Improve error reporting for problems during .remove()
  arm64: kernel: Fix kernel warning when nokaslr is passed to commandline
  perf/arm-cmn: Fix port detection for CMN-700
  arm64: kgdb: Set PSTATE.SS to 1 to re-enable single-step
  arm64: move PAC masks to <asm/pointer_auth.h>
  arm64: use XPACLRI to strip PAC
  arm64: avoid redundant PAC stripping in __builtin_return_address()
  arm64/sme: Fix some comments of ARM SME
  arm64/signal: Alloc tpidr2 sigframe after checking system_supports_tpidr2()
  arm64/signal: Use system_supports_tpidr2() to check TPIDR2
  arm64/idreg: Don't disable SME when disabling SVE
  ...
2023-04-25 12:39:01 -07:00
Linus Torvalds
ef36b9afc2 fget() to fdget() conversions
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZEYCQAAKCRBZ7Krx/gZQ
 64FdAQDZ2hTDyZEWPt486dWYPYpiKyaGFXSXDGo7wgP0fiwxXQEA/mROKb6JqYw6
 27mZ9A7qluT8r3AfTTQ0D+Yse/dr4AM=
 =GA9W
 -----END PGP SIGNATURE-----

Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs fget updates from Al Viro:
 "fget() to fdget() conversions"

* tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fuse_dev_ioctl(): switch to fdget()
  cgroup_get_from_fd(): switch to fdget_raw()
  bpf: switch to fdget_raw()
  build_mount_idmapped(): switch to fdget()
  kill the last remaining user of proc_ns_fget()
  SVM-SEV: convert the rest of fget() uses to fdget() in there
  convert sgx_set_attribute() to fdget()/fdput()
  convert setns(2) to fdget()/fdput()
2023-04-24 19:14:20 -07:00
Yafang Shao
a0c109dcaf bpf: Add __rcu_read_{lock,unlock} into btf id deny list
The tracing recursion prevention mechanism must be protected by rcu, that
leaves __rcu_read_{lock,unlock} unprotected by this mechanism. If we trace
them, the recursion will happen. Let's add them into the btf id deny list.

When CONFIG_PREEMPT_RCU is enabled, it can be reproduced with a simple bpf
program as such:
  SEC("fentry/__rcu_read_lock")
  int fentry_run()
  {
      return 0;
  }

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20230424161104.3737-2-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-24 14:16:01 -07:00
Dave Marchevsky
7deca5eae8 bpf: Disable bpf_refcount_acquire kfunc calls until race conditions are fixed
As reported by Kumar in [0], the shared ownership implementation for BPF
programs has some race conditions which need to be addressed before it
can safely be used. This patch does so in a minimal way instead of
ripping out shared ownership entirely, as proper fixes for the issues
raised will follow ASAP, at which point this patch's commit can be
reverted to re-enable shared ownership.

The patch removes the ability to call bpf_refcount_acquire_impl from BPF
programs. Programs can only bump refcount and obtain a new owning
reference using this kfunc, so removing the ability to call it
effectively disables shared ownership.

Instead of changing success / failure expectations for
bpf_refcount-related selftests, this patch just disables them from
running for now.

  [0]: https://lore.kernel.org/bpf/d7hyspcow5wtjcmw4fugdgyp3fwhljwuscp3xyut5qnwivyeru@ysdq543otzv2/

Reported-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230424204321.2680232-1-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-24 14:02:11 -07:00
Jakub Kicinski
9a82cdc28f bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZELn8wAKCRDbK58LschI
 g1khAQC1nmXPuKjM4EAfFK8Ysb3KoF8ADmpE97n+/HEDydCagwD/bX0+NABR75Nh
 ueGcoU1TcfcbshDzrH0s+C95owZDZw4=
 =BeZM
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2023-04-21

We've added 71 non-merge commits during the last 8 day(s) which contain
a total of 116 files changed, 13397 insertions(+), 8896 deletions(-).

The main changes are:

1) Add a new BPF netfilter program type and minimal support to hook
   BPF programs to netfilter hooks such as prerouting or forward,
   from Florian Westphal.

2) Fix race between btf_put and btf_idr walk which caused a deadlock,
   from Alexei Starovoitov.

3) Second big batch to migrate test_verifier unit tests into test_progs
   for ease of readability and debugging, from Eduard Zingerman.

4) Add support for refcounted local kptrs to the verifier for allowing
   shared ownership, useful for adding a node to both the BPF list and
   rbtree, from Dave Marchevsky.

5) Migrate bpf_for(), bpf_for_each() and bpf_repeat() macros from BPF
  selftests into libbpf-provided bpf_helpers.h header and improve
  kfunc handling, from Andrii Nakryiko.

6) Support 64-bit pointers to kfuncs needed for archs like s390x,
   from Ilya Leoshkevich.

7) Support BPF progs under getsockopt with a NULL optval,
   from Stanislav Fomichev.

8) Improve verifier u32 scalar equality checking in order to enable
   LLVM transformations which earlier had to be disabled specifically
   for BPF backend, from Yonghong Song.

9) Extend bpftool's struct_ops object loading to support links,
   from Kui-Feng Lee.

10) Add xsk selftest follow-up fixes for hugepage allocated umem,
    from Magnus Karlsson.

11) Support BPF redirects from tc BPF to ifb devices,
    from Daniel Borkmann.

12) Add BPF support for integer type when accessing variable length
    arrays, from Feng Zhou.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (71 commits)
  selftests/bpf: verifier/value_ptr_arith converted to inline assembly
  selftests/bpf: verifier/value_illegal_alu converted to inline assembly
  selftests/bpf: verifier/unpriv converted to inline assembly
  selftests/bpf: verifier/subreg converted to inline assembly
  selftests/bpf: verifier/spin_lock converted to inline assembly
  selftests/bpf: verifier/sock converted to inline assembly
  selftests/bpf: verifier/search_pruning converted to inline assembly
  selftests/bpf: verifier/runtime_jit converted to inline assembly
  selftests/bpf: verifier/regalloc converted to inline assembly
  selftests/bpf: verifier/ref_tracking converted to inline assembly
  selftests/bpf: verifier/map_ptr_mixing converted to inline assembly
  selftests/bpf: verifier/map_in_map converted to inline assembly
  selftests/bpf: verifier/lwt converted to inline assembly
  selftests/bpf: verifier/loops1 converted to inline assembly
  selftests/bpf: verifier/jeq_infer_not_null converted to inline assembly
  selftests/bpf: verifier/direct_packet_access converted to inline assembly
  selftests/bpf: verifier/d_path converted to inline assembly
  selftests/bpf: verifier/ctx converted to inline assembly
  selftests/bpf: verifier/btf_ctx_access converted to inline assembly
  selftests/bpf: verifier/bpf_get_stack converted to inline assembly
  ...
====================

Link: https://lore.kernel.org/r/20230421211035.9111-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-21 20:32:37 -07:00
Florian Westphal
fd9c663b9a bpf: minimal support for programs hooked into netfilter framework
This adds minimal support for BPF_PROG_TYPE_NETFILTER bpf programs
that will be invoked via the NF_HOOK() points in the ip stack.

Invocation incurs an indirect call.  This is not a necessity: Its
possible to add 'DEFINE_BPF_DISPATCHER(nf_progs)' and handle the
program invocation with the same method already done for xdp progs.

This isn't done here to keep the size of this chunk down.

Verifier restricts verdicts to either DROP or ACCEPT.

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20230421170300.24115-3-fw@strlen.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-21 11:34:14 -07:00
Florian Westphal
84601d6ee6 bpf: add bpf_link support for BPF_NETFILTER programs
Add bpf_link support skeleton.  To keep this reviewable, no bpf program
can be invoked yet, if a program is attached only a c-stub is called and
not the actual bpf program.

Defaults to 'y' if both netfilter and bpf syscall are enabled in kconfig.

Uapi example usage:
	union bpf_attr attr = { };

	attr.link_create.prog_fd = progfd;
	attr.link_create.attach_type = 0; /* unused */
	attr.link_create.netfilter.pf = PF_INET;
	attr.link_create.netfilter.hooknum = NF_INET_LOCAL_IN;
	attr.link_create.netfilter.priority = -128;

	err = bpf(BPF_LINK_CREATE, &attr, sizeof(attr));

... this would attach progfd to ipv4:input hook.

Such hook gets removed automatically if the calling program exits.

BPF_NETFILTER program invocation is added in followup change.

NF_HOOK_OP_BPF enum will eventually be read from nfnetlink_hook, it
allows to tell userspace which program is attached at the given hook
when user runs 'nft hook list' command rather than just the priority
and not-very-helpful 'this hook runs a bpf prog but I can't tell which
one'.

Will also be used to disallow registration of two bpf programs with
same priority in a followup patch.

v4: arm32 cmpxchg only supports 32bit operand
    s/prio/priority/
v3: restrict prog attachment to ip/ip6 for now, lets lift restrictions if
    more use cases pop up (arptables, ebtables, netdev ingress/egress etc).

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20230421170300.24115-2-fw@strlen.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-21 11:34:14 -07:00
Stanislav Fomichev
00e74ae086 bpf: Don't EFAULT for getsockopt with optval=NULL
Some socket options do getsockopt with optval=NULL to estimate the size
of the final buffer (which is returned via optlen). This breaks BPF
getsockopt assumptions about permitted optval buffer size. Let's enforce
these assumptions only when non-NULL optval is provided.

Fixes: 0d01da6afc ("bpf: implement getsockopt and setsockopt hooks")
Reported-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/ZD7Js4fj5YyI2oLd@google.com/T/#mb68daf700f87a9244a15d01d00c3f0e5b08f49f7
Link: https://lore.kernel.org/bpf/20230418225343.553806-2-sdf@google.com
2023-04-21 17:09:53 +02:00
Dave Marchevsky
4ab07209d5 bpf: Fix bpf_refcount_acquire's refcount_t address calculation
When calculating the address of the refcount_t struct within a local
kptr, bpf_refcount_acquire_impl should add refcount_off bytes to the
address of the local kptr. Due to some missing parens, the function is
incorrectly adding sizeof(refcount_t) * refcount_off bytes. This patch
fixes the calculation.

Due to the incorrect calculation, bpf_refcount_acquire_impl was trying
to refcount_inc some memory well past the end of local kptrs, resulting
in kasan and refcount complaints, as reported in [0]. In that thread,
Florian and Eduard discovered that bpf selftests written in the new
style - with __success and an expected __retval, specifically - were
not actually being run. As a result, selftests added in bpf_refcount
series weren't really exercising this behavior, and thus didn't unearth
the bug.

With this fixed behavior it's safe to revert commit 7c4b96c000
("selftests/bpf: disable program test run for progs/refcounted_kptr.c"),
this patch does so.

  [0] https://lore.kernel.org/bpf/ZEEp+j22imoN6rn9@strlen.de/

Fixes: 7c50b1cb76 ("bpf: Add bpf_refcount_acquire kfunc")
Reported-by: Florian Westphal <fw@strlen.de>
Reported-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20230421074431.3548349-1-davemarchevsky@fb.com
2023-04-21 16:31:37 +02:00
Alexei Starovoitov
acf1c3d68e bpf: Fix race between btf_put and btf_idr walk.
Florian and Eduard reported hard dead lock:
[   58.433327]  _raw_spin_lock_irqsave+0x40/0x50
[   58.433334]  btf_put+0x43/0x90
[   58.433338]  bpf_find_btf_id+0x157/0x240
[   58.433353]  btf_parse_fields+0x921/0x11c0

This happens since btf->refcount can be 1 at the time of btf_put() and
btf_put() will call btf_free_id() which will try to grab btf_idr_lock
and will dead lock.
Avoid the issue by doing btf_put() without locking.

Fixes: 3d78417b60 ("bpf: Add bpf_btf_find_by_name_kind() helper.")
Fixes: 1e89106da2 ("bpf: Add bpf_core_add_cands() and wire it into bpf_core_apply_relo_insn().")
Reported-by: Florian Westphal <fw@strlen.de>
Reported-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20230421014901.70908-1-alexei.starovoitov@gmail.com
2023-04-21 16:27:56 +02:00
Al Viro
1d0027dc9a bpf: switch to fdget_raw()
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-04-20 22:55:35 -04:00
Jakub Kicinski
681c5b51dc Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Adjacent changes:

net/mptcp/protocol.h
  63740448a3 ("mptcp: fix accept vs worker race")
  2a6a870e44 ("mptcp: stops worker on unaccepted sockets at listener close")
  ddb1a072f8 ("mptcp: move first subflow allocation at mpc access time")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-20 16:29:51 -07:00
Feng Zhou
2569c7b872 bpf: support access variable length array of integer type
After this commit:
bpf: Support variable length array in tracing programs (9c5f8a1008)
Trace programs can access variable length array, but for structure
type. This patch adds support for integer type.

Example:
Hook load_balance
struct sched_domain {
	...
	unsigned long span[];
}

The access: sd->span[0].

Co-developed-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com>
Link: https://lore.kernel.org/r/20230420032735.27760-2-zhoufeng.zf@bytedance.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-19 21:29:39 -07:00
Daniel Borkmann
71b547f561 bpf: Fix incorrect verifier pruning due to missing register precision taints
Juan Jose et al reported an issue found via fuzzing where the verifier's
pruning logic prematurely marks a program path as safe.

Consider the following program:

   0: (b7) r6 = 1024
   1: (b7) r7 = 0
   2: (b7) r8 = 0
   3: (b7) r9 = -2147483648
   4: (97) r6 %= 1025
   5: (05) goto pc+0
   6: (bd) if r6 <= r9 goto pc+2
   7: (97) r6 %= 1
   8: (b7) r9 = 0
   9: (bd) if r6 <= r9 goto pc+1
  10: (b7) r6 = 0
  11: (b7) r0 = 0
  12: (63) *(u32 *)(r10 -4) = r0
  13: (18) r4 = 0xffff888103693400 // map_ptr(ks=4,vs=48)
  15: (bf) r1 = r4
  16: (bf) r2 = r10
  17: (07) r2 += -4
  18: (85) call bpf_map_lookup_elem#1
  19: (55) if r0 != 0x0 goto pc+1
  20: (95) exit
  21: (77) r6 >>= 10
  22: (27) r6 *= 8192
  23: (bf) r1 = r0
  24: (0f) r0 += r6
  25: (79) r3 = *(u64 *)(r0 +0)
  26: (7b) *(u64 *)(r1 +0) = r3
  27: (95) exit

The verifier treats this as safe, leading to oob read/write access due
to an incorrect verifier conclusion:

  func#0 @0
  0: R1=ctx(off=0,imm=0) R10=fp0
  0: (b7) r6 = 1024                     ; R6_w=1024
  1: (b7) r7 = 0                        ; R7_w=0
  2: (b7) r8 = 0                        ; R8_w=0
  3: (b7) r9 = -2147483648              ; R9_w=-2147483648
  4: (97) r6 %= 1025                    ; R6_w=scalar()
  5: (05) goto pc+0
  6: (bd) if r6 <= r9 goto pc+2         ; R6_w=scalar(umin=18446744071562067969,var_off=(0xffffffff00000000; 0xffffffff)) R9_w=-2147483648
  7: (97) r6 %= 1                       ; R6_w=scalar()
  8: (b7) r9 = 0                        ; R9=0
  9: (bd) if r6 <= r9 goto pc+1         ; R6=scalar(umin=1) R9=0
  10: (b7) r6 = 0                       ; R6_w=0
  11: (b7) r0 = 0                       ; R0_w=0
  12: (63) *(u32 *)(r10 -4) = r0
  last_idx 12 first_idx 9
  regs=1 stack=0 before 11: (b7) r0 = 0
  13: R0_w=0 R10=fp0 fp-8=0000????
  13: (18) r4 = 0xffff8ad3886c2a00      ; R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
  15: (bf) r1 = r4                      ; R1_w=map_ptr(off=0,ks=4,vs=48,imm=0) R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
  16: (bf) r2 = r10                     ; R2_w=fp0 R10=fp0
  17: (07) r2 += -4                     ; R2_w=fp-4
  18: (85) call bpf_map_lookup_elem#1   ; R0=map_value_or_null(id=1,off=0,ks=4,vs=48,imm=0)
  19: (55) if r0 != 0x0 goto pc+1       ; R0=0
  20: (95) exit

  from 19 to 21: R0=map_value(off=0,ks=4,vs=48,imm=0) R6=0 R7=0 R8=0 R9=0 R10=fp0 fp-8=mmmm????
  21: (77) r6 >>= 10                    ; R6_w=0
  22: (27) r6 *= 8192                   ; R6_w=0
  23: (bf) r1 = r0                      ; R0=map_value(off=0,ks=4,vs=48,imm=0) R1_w=map_value(off=0,ks=4,vs=48,imm=0)
  24: (0f) r0 += r6
  last_idx 24 first_idx 19
  regs=40 stack=0 before 23: (bf) r1 = r0
  regs=40 stack=0 before 22: (27) r6 *= 8192
  regs=40 stack=0 before 21: (77) r6 >>= 10
  regs=40 stack=0 before 19: (55) if r0 != 0x0 goto pc+1
  parent didn't have regs=40 stack=0 marks: R0_rw=map_value_or_null(id=1,off=0,ks=4,vs=48,imm=0) R6_rw=P0 R7=0 R8=0 R9=0 R10=fp0 fp-8=mmmm????
  last_idx 18 first_idx 9
  regs=40 stack=0 before 18: (85) call bpf_map_lookup_elem#1
  regs=40 stack=0 before 17: (07) r2 += -4
  regs=40 stack=0 before 16: (bf) r2 = r10
  regs=40 stack=0 before 15: (bf) r1 = r4
  regs=40 stack=0 before 13: (18) r4 = 0xffff8ad3886c2a00
  regs=40 stack=0 before 12: (63) *(u32 *)(r10 -4) = r0
  regs=40 stack=0 before 11: (b7) r0 = 0
  regs=40 stack=0 before 10: (b7) r6 = 0
  25: (79) r3 = *(u64 *)(r0 +0)         ; R0_w=map_value(off=0,ks=4,vs=48,imm=0) R3_w=scalar()
  26: (7b) *(u64 *)(r1 +0) = r3         ; R1_w=map_value(off=0,ks=4,vs=48,imm=0) R3_w=scalar()
  27: (95) exit

  from 9 to 11: R1=ctx(off=0,imm=0) R6=0 R7=0 R8=0 R9=0 R10=fp0
  11: (b7) r0 = 0                       ; R0_w=0
  12: (63) *(u32 *)(r10 -4) = r0
  last_idx 12 first_idx 11
  regs=1 stack=0 before 11: (b7) r0 = 0
  13: R0_w=0 R10=fp0 fp-8=0000????
  13: (18) r4 = 0xffff8ad3886c2a00      ; R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
  15: (bf) r1 = r4                      ; R1_w=map_ptr(off=0,ks=4,vs=48,imm=0) R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
  16: (bf) r2 = r10                     ; R2_w=fp0 R10=fp0
  17: (07) r2 += -4                     ; R2_w=fp-4
  18: (85) call bpf_map_lookup_elem#1
  frame 0: propagating r6
  last_idx 19 first_idx 11
  regs=40 stack=0 before 18: (85) call bpf_map_lookup_elem#1
  regs=40 stack=0 before 17: (07) r2 += -4
  regs=40 stack=0 before 16: (bf) r2 = r10
  regs=40 stack=0 before 15: (bf) r1 = r4
  regs=40 stack=0 before 13: (18) r4 = 0xffff8ad3886c2a00
  regs=40 stack=0 before 12: (63) *(u32 *)(r10 -4) = r0
  regs=40 stack=0 before 11: (b7) r0 = 0
  parent didn't have regs=40 stack=0 marks: R1=ctx(off=0,imm=0) R6_r=P0 R7=0 R8=0 R9=0 R10=fp0
  last_idx 9 first_idx 9
  regs=40 stack=0 before 9: (bd) if r6 <= r9 goto pc+1
  parent didn't have regs=40 stack=0 marks: R1=ctx(off=0,imm=0) R6_rw=Pscalar() R7_w=0 R8_w=0 R9_rw=0 R10=fp0
  last_idx 8 first_idx 0
  regs=40 stack=0 before 8: (b7) r9 = 0
  regs=40 stack=0 before 7: (97) r6 %= 1
  regs=40 stack=0 before 6: (bd) if r6 <= r9 goto pc+2
  regs=40 stack=0 before 5: (05) goto pc+0
  regs=40 stack=0 before 4: (97) r6 %= 1025
  regs=40 stack=0 before 3: (b7) r9 = -2147483648
  regs=40 stack=0 before 2: (b7) r8 = 0
  regs=40 stack=0 before 1: (b7) r7 = 0
  regs=40 stack=0 before 0: (b7) r6 = 1024
  19: safe
  frame 0: propagating r6
  last_idx 9 first_idx 0
  regs=40 stack=0 before 6: (bd) if r6 <= r9 goto pc+2
  regs=40 stack=0 before 5: (05) goto pc+0
  regs=40 stack=0 before 4: (97) r6 %= 1025
  regs=40 stack=0 before 3: (b7) r9 = -2147483648
  regs=40 stack=0 before 2: (b7) r8 = 0
  regs=40 stack=0 before 1: (b7) r7 = 0
  regs=40 stack=0 before 0: (b7) r6 = 1024

  from 6 to 9: safe
  verification time 110 usec
  stack depth 4
  processed 36 insns (limit 1000000) max_states_per_insn 0 total_states 3 peak_states 3 mark_read 2

The verifier considers this program as safe by mistakenly pruning unsafe
code paths. In the above func#0, code lines 0-10 are of interest. In line
0-3 registers r6 to r9 are initialized with known scalar values. In line 4
the register r6 is reset to an unknown scalar given the verifier does not
track modulo operations. Due to this, the verifier can also not determine
precisely which branches in line 6 and 9 are taken, therefore it needs to
explore them both.

As can be seen, the verifier starts with exploring the false/fall-through
paths first. The 'from 19 to 21' path has both r6=0 and r9=0 and the pointer
arithmetic on r0 += r6 is therefore considered safe. Given the arithmetic,
r6 is correctly marked for precision tracking where backtracking kicks in
where it walks back the current path all the way where r6 was set to 0 in
the fall-through branch.

Next, the pruning logics pops the path 'from 9 to 11' from the stack. Also
here, the state of the registers is the same, that is, r6=0 and r9=0, so
that at line 19 the path can be pruned as it is considered safe. It is
interesting to note that the conditional in line 9 turned r6 into a more
precise state, that is, in the fall-through path at the beginning of line
10, it is R6=scalar(umin=1), and in the branch-taken path (which is analyzed
here) at the beginning of line 11, r6 turned into a known const r6=0 as
r9=0 prior to that and therefore (unsigned) r6 <= 0 concludes that r6 must
be 0 (**):

  [...]                                 ; R6_w=scalar()
  9: (bd) if r6 <= r9 goto pc+1         ; R6=scalar(umin=1) R9=0
  [...]

  from 9 to 11: R1=ctx(off=0,imm=0) R6=0 R7=0 R8=0 R9=0 R10=fp0
  [...]

The next path is 'from 6 to 9'. The verifier considers the old and current
state equivalent, and therefore prunes the search incorrectly. Looking into
the two states which are being compared by the pruning logic at line 9, the
old state consists of R6_rwD=Pscalar() R9_rwD=0 R10=fp0 and the new state
consists of R1=ctx(off=0,imm=0) R6_w=scalar(umax=18446744071562067968)
R7_w=0 R8_w=0 R9_w=-2147483648 R10=fp0. While r6 had the reg->precise flag
correctly set in the old state, r9 did not. Both r6'es are considered as
equivalent given the old one is a superset of the current, more precise one,
however, r9's actual values (0 vs 0x80000000) mismatch. Given the old r9
did not have reg->precise flag set, the verifier does not consider the
register as contributing to the precision state of r6, and therefore it
considered both r9 states as equivalent. However, for this specific pruned
path (which is also the actual path taken at runtime), register r6 will be
0x400 and r9 0x80000000 when reaching line 21, thus oob-accessing the map.

The purpose of precision tracking is to initially mark registers (including
spilled ones) as imprecise to help verifier's pruning logic finding equivalent
states it can then prune if they don't contribute to the program's safety
aspects. For example, if registers are used for pointer arithmetic or to pass
constant length to a helper, then the verifier sets reg->precise flag and
backtracks the BPF program instruction sequence and chain of verifier states
to ensure that the given register or stack slot including their dependencies
are marked as precisely tracked scalar. This also includes any other registers
and slots that contribute to a tracked state of given registers/stack slot.
This backtracking relies on recorded jmp_history and is able to traverse
entire chain of parent states. This process ends only when all the necessary
registers/slots and their transitive dependencies are marked as precise.

The backtrack_insn() is called from the current instruction up to the first
instruction, and its purpose is to compute a bitmask of registers and stack
slots that need precision tracking in the parent's verifier state. For example,
if a current instruction is r6 = r7, then r6 needs precision after this
instruction and r7 needs precision before this instruction, that is, in the
parent state. Hence for the latter r7 is marked and r6 unmarked.

For the class of jmp/jmp32 instructions, backtrack_insn() today only looks
at call and exit instructions and for all other conditionals the masks
remain as-is. However, in the given situation register r6 has a dependency
on r9 (as described above in **), so also that one needs to be marked for
precision tracking. In other words, if an imprecise register influences a
precise one, then the imprecise register should also be marked precise.
Meaning, in the parent state both dest and src register need to be tracked
for precision and therefore the marking must be more conservative by setting
reg->precise flag for both. The precision propagation needs to cover both
for the conditional: if the src reg was marked but not the dst reg and vice
versa.

After the fix the program is correctly rejected:

  func#0 @0
  0: R1=ctx(off=0,imm=0) R10=fp0
  0: (b7) r6 = 1024                     ; R6_w=1024
  1: (b7) r7 = 0                        ; R7_w=0
  2: (b7) r8 = 0                        ; R8_w=0
  3: (b7) r9 = -2147483648              ; R9_w=-2147483648
  4: (97) r6 %= 1025                    ; R6_w=scalar()
  5: (05) goto pc+0
  6: (bd) if r6 <= r9 goto pc+2         ; R6_w=scalar(umin=18446744071562067969,var_off=(0xffffffff80000000; 0x7fffffff),u32_min=-2147483648) R9_w=-2147483648
  7: (97) r6 %= 1                       ; R6_w=scalar()
  8: (b7) r9 = 0                        ; R9=0
  9: (bd) if r6 <= r9 goto pc+1         ; R6=scalar(umin=1) R9=0
  10: (b7) r6 = 0                       ; R6_w=0
  11: (b7) r0 = 0                       ; R0_w=0
  12: (63) *(u32 *)(r10 -4) = r0
  last_idx 12 first_idx 9
  regs=1 stack=0 before 11: (b7) r0 = 0
  13: R0_w=0 R10=fp0 fp-8=0000????
  13: (18) r4 = 0xffff9290dc5bfe00      ; R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
  15: (bf) r1 = r4                      ; R1_w=map_ptr(off=0,ks=4,vs=48,imm=0) R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
  16: (bf) r2 = r10                     ; R2_w=fp0 R10=fp0
  17: (07) r2 += -4                     ; R2_w=fp-4
  18: (85) call bpf_map_lookup_elem#1   ; R0=map_value_or_null(id=1,off=0,ks=4,vs=48,imm=0)
  19: (55) if r0 != 0x0 goto pc+1       ; R0=0
  20: (95) exit

  from 19 to 21: R0=map_value(off=0,ks=4,vs=48,imm=0) R6=0 R7=0 R8=0 R9=0 R10=fp0 fp-8=mmmm????
  21: (77) r6 >>= 10                    ; R6_w=0
  22: (27) r6 *= 8192                   ; R6_w=0
  23: (bf) r1 = r0                      ; R0=map_value(off=0,ks=4,vs=48,imm=0) R1_w=map_value(off=0,ks=4,vs=48,imm=0)
  24: (0f) r0 += r6
  last_idx 24 first_idx 19
  regs=40 stack=0 before 23: (bf) r1 = r0
  regs=40 stack=0 before 22: (27) r6 *= 8192
  regs=40 stack=0 before 21: (77) r6 >>= 10
  regs=40 stack=0 before 19: (55) if r0 != 0x0 goto pc+1
  parent didn't have regs=40 stack=0 marks: R0_rw=map_value_or_null(id=1,off=0,ks=4,vs=48,imm=0) R6_rw=P0 R7=0 R8=0 R9=0 R10=fp0 fp-8=mmmm????
  last_idx 18 first_idx 9
  regs=40 stack=0 before 18: (85) call bpf_map_lookup_elem#1
  regs=40 stack=0 before 17: (07) r2 += -4
  regs=40 stack=0 before 16: (bf) r2 = r10
  regs=40 stack=0 before 15: (bf) r1 = r4
  regs=40 stack=0 before 13: (18) r4 = 0xffff9290dc5bfe00
  regs=40 stack=0 before 12: (63) *(u32 *)(r10 -4) = r0
  regs=40 stack=0 before 11: (b7) r0 = 0
  regs=40 stack=0 before 10: (b7) r6 = 0
  25: (79) r3 = *(u64 *)(r0 +0)         ; R0_w=map_value(off=0,ks=4,vs=48,imm=0) R3_w=scalar()
  26: (7b) *(u64 *)(r1 +0) = r3         ; R1_w=map_value(off=0,ks=4,vs=48,imm=0) R3_w=scalar()
  27: (95) exit

  from 9 to 11: R1=ctx(off=0,imm=0) R6=0 R7=0 R8=0 R9=0 R10=fp0
  11: (b7) r0 = 0                       ; R0_w=0
  12: (63) *(u32 *)(r10 -4) = r0
  last_idx 12 first_idx 11
  regs=1 stack=0 before 11: (b7) r0 = 0
  13: R0_w=0 R10=fp0 fp-8=0000????
  13: (18) r4 = 0xffff9290dc5bfe00      ; R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
  15: (bf) r1 = r4                      ; R1_w=map_ptr(off=0,ks=4,vs=48,imm=0) R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
  16: (bf) r2 = r10                     ; R2_w=fp0 R10=fp0
  17: (07) r2 += -4                     ; R2_w=fp-4
  18: (85) call bpf_map_lookup_elem#1
  frame 0: propagating r6
  last_idx 19 first_idx 11
  regs=40 stack=0 before 18: (85) call bpf_map_lookup_elem#1
  regs=40 stack=0 before 17: (07) r2 += -4
  regs=40 stack=0 before 16: (bf) r2 = r10
  regs=40 stack=0 before 15: (bf) r1 = r4
  regs=40 stack=0 before 13: (18) r4 = 0xffff9290dc5bfe00
  regs=40 stack=0 before 12: (63) *(u32 *)(r10 -4) = r0
  regs=40 stack=0 before 11: (b7) r0 = 0
  parent didn't have regs=40 stack=0 marks: R1=ctx(off=0,imm=0) R6_r=P0 R7=0 R8=0 R9=0 R10=fp0
  last_idx 9 first_idx 9
  regs=40 stack=0 before 9: (bd) if r6 <= r9 goto pc+1
  parent didn't have regs=240 stack=0 marks: R1=ctx(off=0,imm=0) R6_rw=Pscalar() R7_w=0 R8_w=0 R9_rw=P0 R10=fp0
  last_idx 8 first_idx 0
  regs=240 stack=0 before 8: (b7) r9 = 0
  regs=40 stack=0 before 7: (97) r6 %= 1
  regs=40 stack=0 before 6: (bd) if r6 <= r9 goto pc+2
  regs=240 stack=0 before 5: (05) goto pc+0
  regs=240 stack=0 before 4: (97) r6 %= 1025
  regs=240 stack=0 before 3: (b7) r9 = -2147483648
  regs=40 stack=0 before 2: (b7) r8 = 0
  regs=40 stack=0 before 1: (b7) r7 = 0
  regs=40 stack=0 before 0: (b7) r6 = 1024
  19: safe

  from 6 to 9: R1=ctx(off=0,imm=0) R6_w=scalar(umax=18446744071562067968) R7_w=0 R8_w=0 R9_w=-2147483648 R10=fp0
  9: (bd) if r6 <= r9 goto pc+1
  last_idx 9 first_idx 0
  regs=40 stack=0 before 6: (bd) if r6 <= r9 goto pc+2
  regs=240 stack=0 before 5: (05) goto pc+0
  regs=240 stack=0 before 4: (97) r6 %= 1025
  regs=240 stack=0 before 3: (b7) r9 = -2147483648
  regs=40 stack=0 before 2: (b7) r8 = 0
  regs=40 stack=0 before 1: (b7) r7 = 0
  regs=40 stack=0 before 0: (b7) r6 = 1024
  last_idx 9 first_idx 0
  regs=200 stack=0 before 6: (bd) if r6 <= r9 goto pc+2
  regs=240 stack=0 before 5: (05) goto pc+0
  regs=240 stack=0 before 4: (97) r6 %= 1025
  regs=240 stack=0 before 3: (b7) r9 = -2147483648
  regs=40 stack=0 before 2: (b7) r8 = 0
  regs=40 stack=0 before 1: (b7) r7 = 0
  regs=40 stack=0 before 0: (b7) r6 = 1024
  11: R6=scalar(umax=18446744071562067968) R9=-2147483648
  11: (b7) r0 = 0                       ; R0_w=0
  12: (63) *(u32 *)(r10 -4) = r0
  last_idx 12 first_idx 11
  regs=1 stack=0 before 11: (b7) r0 = 0
  13: R0_w=0 R10=fp0 fp-8=0000????
  13: (18) r4 = 0xffff9290dc5bfe00      ; R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
  15: (bf) r1 = r4                      ; R1_w=map_ptr(off=0,ks=4,vs=48,imm=0) R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
  16: (bf) r2 = r10                     ; R2_w=fp0 R10=fp0
  17: (07) r2 += -4                     ; R2_w=fp-4
  18: (85) call bpf_map_lookup_elem#1   ; R0_w=map_value_or_null(id=3,off=0,ks=4,vs=48,imm=0)
  19: (55) if r0 != 0x0 goto pc+1       ; R0_w=0
  20: (95) exit

  from 19 to 21: R0=map_value(off=0,ks=4,vs=48,imm=0) R6=scalar(umax=18446744071562067968) R7=0 R8=0 R9=-2147483648 R10=fp0 fp-8=mmmm????
  21: (77) r6 >>= 10                    ; R6_w=scalar(umax=18014398507384832,var_off=(0x0; 0x3fffffffffffff))
  22: (27) r6 *= 8192                   ; R6_w=scalar(smax=9223372036854767616,umax=18446744073709543424,var_off=(0x0; 0xffffffffffffe000),s32_max=2147475456,u32_max=-8192)
  23: (bf) r1 = r0                      ; R0=map_value(off=0,ks=4,vs=48,imm=0) R1_w=map_value(off=0,ks=4,vs=48,imm=0)
  24: (0f) r0 += r6
  last_idx 24 first_idx 21
  regs=40 stack=0 before 23: (bf) r1 = r0
  regs=40 stack=0 before 22: (27) r6 *= 8192
  regs=40 stack=0 before 21: (77) r6 >>= 10
  parent didn't have regs=40 stack=0 marks: R0_rw=map_value(off=0,ks=4,vs=48,imm=0) R6_r=Pscalar(umax=18446744071562067968) R7=0 R8=0 R9=-2147483648 R10=fp0 fp-8=mmmm????
  last_idx 19 first_idx 11
  regs=40 stack=0 before 19: (55) if r0 != 0x0 goto pc+1
  regs=40 stack=0 before 18: (85) call bpf_map_lookup_elem#1
  regs=40 stack=0 before 17: (07) r2 += -4
  regs=40 stack=0 before 16: (bf) r2 = r10
  regs=40 stack=0 before 15: (bf) r1 = r4
  regs=40 stack=0 before 13: (18) r4 = 0xffff9290dc5bfe00
  regs=40 stack=0 before 12: (63) *(u32 *)(r10 -4) = r0
  regs=40 stack=0 before 11: (b7) r0 = 0
  parent didn't have regs=40 stack=0 marks: R1=ctx(off=0,imm=0) R6_rw=Pscalar(umax=18446744071562067968) R7_w=0 R8_w=0 R9_w=-2147483648 R10=fp0
  last_idx 9 first_idx 0
  regs=40 stack=0 before 9: (bd) if r6 <= r9 goto pc+1
  regs=240 stack=0 before 6: (bd) if r6 <= r9 goto pc+2
  regs=240 stack=0 before 5: (05) goto pc+0
  regs=240 stack=0 before 4: (97) r6 %= 1025
  regs=240 stack=0 before 3: (b7) r9 = -2147483648
  regs=40 stack=0 before 2: (b7) r8 = 0
  regs=40 stack=0 before 1: (b7) r7 = 0
  regs=40 stack=0 before 0: (b7) r6 = 1024
  math between map_value pointer and register with unbounded min value is not allowed
  verification time 886 usec
  stack depth 4
  processed 49 insns (limit 1000000) max_states_per_insn 1 total_states 5 peak_states 5 mark_read 2

Fixes: b5dc0163d8 ("bpf: precise scalar_value tracking")
Reported-by: Juan Jose Lopez Jaimez <jjlopezjaimez@google.com>
Reported-by: Meador Inge <meadori@google.com>
Reported-by: Simon Scannell <simonscannell@google.com>
Reported-by: Nenad Stojanovski <thenenadx@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Reviewed-by: Juan Jose Lopez Jaimez <jjlopezjaimez@google.com>
Reviewed-by: Meador Inge <meadori@google.com>
Reviewed-by: Simon Scannell <simonscannell@google.com>
2023-04-19 10:18:18 -07:00
Yonghong Song
3be49f7955 bpf: Improve verifier u32 scalar equality checking
In [1], I tried to remove bpf-specific codes to prevent certain
llvm optimizations, and add llvm TTI (target transform info) hooks
to prevent those optimizations. During this process, I found
if I enable llvm SimplifyCFG:shouldFoldTwoEntryPHINode
transformation, I will hit the following verification failure with selftests:

  ...
  8: (18) r1 = 0xffffc900001b2230       ; R1_w=map_value(off=560,ks=4,vs=564,imm=0)
  10: (61) r1 = *(u32 *)(r1 +0)         ; R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
  ; if (skb->tstamp == EGRESS_ENDHOST_MAGIC)
  11: (79) r2 = *(u64 *)(r6 +152)       ; R2_w=scalar() R6=ctx(off=0,imm=0)
  ; if (skb->tstamp == EGRESS_ENDHOST_MAGIC)
  12: (55) if r2 != 0xb9fbeef goto pc+10        ; R2_w=195018479
  13: (bc) w2 = w1                      ; R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
  ; if (test < __NR_TESTS)
  14: (a6) if w1 < 0x9 goto pc+1 16: R0=2 R1_w=scalar(umax=8,var_off=(0x0; 0xf)) R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R6=ctx(off=0,imm=0) R10=fp0
  ;
  16: (27) r2 *= 28                     ; R2_w=scalar(umax=120259084260,var_off=(0x0; 0x1ffffffffc),s32_max=2147483644,u32_max=-4)
  17: (18) r3 = 0xffffc900001b2118      ; R3_w=map_value(off=280,ks=4,vs=564,imm=0)
  19: (0f) r3 += r2                     ; R2_w=scalar(umax=120259084260,var_off=(0x0; 0x1ffffffffc),s32_max=2147483644,u32_max=-4) R3_w=map_value(off=280,ks=4,vs=564,umax=120259084260,var_off=(0x0; 0x1ffffffffc),s32_max=2147483644,u32_max=-4)
  20: (61) r2 = *(u32 *)(r3 +0)
  R3 unbounded memory access, make sure to bounds check any such access
  processed 97 insns (limit 1000000) max_states_per_insn 1 total_states 10 peak_states 10 mark_read 6
  -- END PROG LOAD LOG --
  libbpf: prog 'ingress_fwdns_prio100': failed to load: -13
  libbpf: failed to load object 'test_tc_dtime'
  libbpf: failed to load BPF skeleton 'test_tc_dtime': -13
  ...

At insn 14, with condition 'w1 < 9', register r1 is changed from an arbitrary
u32 value to `scalar(umax=8,var_off=(0x0; 0xf))`. Register r2, however, remains
as an arbitrary u32 value. Current verifier won't claim r1/r2 equality if
the previous mov is alu32 ('w2 = w1').

If r1 upper 32bit value is not 0, we indeed cannot clamin r1/r2 equality
after 'w2 = w1'. But in this particular case, we know r1 upper 32bit value
is 0, so it is safe to claim r1/r2 equality. This patch exactly did this.
For a 32bit subreg mov, if the src register upper 32bit is 0,
it is okay to claim equality between src and dst registers.

With this patch, the above verification sequence becomes

  ...
  8: (18) r1 = 0xffffc9000048e230       ; R1_w=map_value(off=560,ks=4,vs=564,imm=0)
  10: (61) r1 = *(u32 *)(r1 +0)         ; R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
  ; if (skb->tstamp == EGRESS_ENDHOST_MAGIC)
  11: (79) r2 = *(u64 *)(r6 +152)       ; R2_w=scalar() R6=ctx(off=0,imm=0)
  ; if (skb->tstamp == EGRESS_ENDHOST_MAGIC)
  12: (55) if r2 != 0xb9fbeef goto pc+10        ; R2_w=195018479
  13: (bc) w2 = w1                      ; R1_w=scalar(id=6,umax=4294967295,var_off=(0x0; 0xffffffff)) R2_w=scalar(id=6,umax=4294967295,var_off=(0x0; 0xffffffff))
  ; if (test < __NR_TESTS)
  14: (a6) if w1 < 0x9 goto pc+1        ; R1_w=scalar(id=6,umin=9,umax=4294967295,var_off=(0x0; 0xffffffff))
  ...
  from 14 to 16: R0=2 R1_w=scalar(id=6,umax=8,var_off=(0x0; 0xf)) R2_w=scalar(id=6,umax=8,var_off=(0x0; 0xf)) R6=ctx(off=0,imm=0) R10=fp0
  16: (27) r2 *= 28                     ; R2_w=scalar(umax=224,var_off=(0x0; 0xfc))
  17: (18) r3 = 0xffffc9000048e118      ; R3_w=map_value(off=280,ks=4,vs=564,imm=0)
  19: (0f) r3 += r2
  20: (61) r2 = *(u32 *)(r3 +0)         ; R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R3_w=map_value(off=280,ks=4,vs=564,umax=224,var_off=(0x0; 0xfc),s32_max=252,u32_max=252)
  ...

and eventually the bpf program can be verified successfully.

  [1] https://reviews.llvm.org/D147968

Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20230417222134.359714-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-17 15:50:02 -07:00
Sean Young
69a8c792cd bpf: lirc program type should not require SYS_CAP_ADMIN
Make it possible to load lirc program type with just CAP_BPF. There is
nothing exceptional about lirc programs that means they require
SYS_CAP_ADMIN.

In order to attach or detach a lirc program type you need permission to
open /dev/lirc0; if you have permission to do that, you can alter all
sorts of lirc receiving options. Changing the IR protocol decoder is no
different.

Right now on a typical distribution /dev/lirc devices are only
read/write by root. Ideally we would make them group read/write like
other devices so that local users can use them without becoming root.

Signed-off-by: Sean Young <sean@mess.org>
Link: https://lore.kernel.org/r/ZD0ArKpwnDBJZsrE@gofer.mess.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-17 13:21:14 -07:00
David Vernet
7b4ddf3920 bpf: Remove KF_KPTR_GET kfunc flag
We've managed to improve the UX for kptrs significantly over the last 9
months. All of the existing use cases which previously had KF_KPTR_GET
kfuncs (struct bpf_cpumask *, struct task_struct *, and struct cgroup *)
have all been updated to be synchronized using RCU. In other words,
their KF_KPTR_GET kfuncs have been removed in favor of KF_RCU |
KF_ACQUIRE kfuncs, with the pointers themselves also being readable from
maps in an RCU read region thanks to the types being RCU safe.

While KF_KPTR_GET was a logical starting point for kptrs, it's become
clear that they're not the correct abstraction. KF_KPTR_GET is a flag
that essentially does nothing other than enforcing that the argument to
a function is a pointer to a referenced kptr map value. At first glance,
that's a useful thing to guarantee to a kfunc. It gives kfuncs the
ability to try and acquire a reference on that kptr without requiring
the BPF prog to do something like this:

struct kptr_type *in_map, *new = NULL;

in_map = bpf_kptr_xchg(&map->value, NULL);
if (in_map) {
        new = bpf_kptr_type_acquire(in_map);
        in_map = bpf_kptr_xchg(&map->value, in_map);
        if (in_map)
                bpf_kptr_type_release(in_map);
}

That's clearly a pretty ugly (and racy) UX, and if using KF_KPTR_GET is
the only alternative, it's better than nothing. However, the problem
with any KF_KPTR_GET kfunc lies in the fact that it always requires some
kind of synchronization in order to safely do an opportunistic acquire
of the kptr in the map. This is because a BPF program running on another
CPU could do a bpf_kptr_xchg() on that map value, and free the kptr
after it's been read by the KF_KPTR_GET kfunc. For example, the
now-removed bpf_task_kptr_get() kfunc did the following:

struct task_struct *bpf_task_kptr_get(struct task_struct **pp)
{
            struct task_struct *p;

        rcu_read_lock();
        p = READ_ONCE(*pp);
        /* If p is non-NULL, it could still be freed by another CPU,
         * so we have to do an opportunistic refcount_inc_not_zero()
         * and return NULL if the task will be freed after the
         * current RCU read region.
         */
        |f (p && !refcount_inc_not_zero(&p->rcu_users))
                p = NULL;
        rcu_read_unlock();

        return p;
}

In other words, the kfunc uses RCU to ensure that the task remains valid
after it's been peeked from the map. However, this is completely
redundant with just defining a KF_RCU kfunc that itself does a
refcount_inc_not_zero(), which is exactly what bpf_task_acquire() now
does.

So, the question of whether KF_KPTR_GET is useful is actually, "Are
there any synchronization mechanisms / safety flags that are required by
certain kptrs, but which are not provided by the verifier to kfuncs?"
The answer to that question today is "No", because every kptr we
currently care about is RCU protected.

Even if the answer ever became "yes", the proper way to support that
referenced kptr type would be to add support for whatever
synchronization mechanism it requires in the verifier, rather than
giving kfuncs a flag that says, "Here's a pointer to a referenced kptr
in a map, do whatever you need to do."

With all that said -- so as to allow us to consolidate the kfunc API,
and simplify the verifier a bit, this patch removes KF_KPTR_GET, and all
relevant logic from the verifier.

Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230416084928.326135-3-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-16 08:51:24 -07:00
Dave Marchevsky
3e81740a90 bpf: Centralize btf_field-specific initialization logic
All btf_fields in an object are 0-initialized by memset in
bpf_obj_init. This might not be a valid initial state for some field
types, in which case kfuncs that use the type will properly initialize
their input if it's been 0-initialized. Some BPF graph collection types
and kfuncs do this: bpf_list_{head,node} and bpf_rb_node.

An earlier patch in this series added the bpf_refcount field, for which
the 0 state indicates that the refcounted object should be free'd.
bpf_obj_init treats this field specially, setting refcount to 1 instead
of relying on scattered "refcount is 0? Must have just been initialized,
let's set to 1" logic in kfuncs.

This patch extends this treatment to list and rbtree field types,
allowing most scattered initialization logic in kfuncs to be removed.

Note that bpf_{list_head,rb_root} may be inside a BPF map, in which case
they'll be 0-initialized without passing through the newly-added logic,
so scattered initialization logic must remain for these collection root
types.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230415201811.343116-9-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-15 17:36:50 -07:00
Dave Marchevsky
404ad75a36 bpf: Migrate bpf_rbtree_remove to possibly fail
This patch modifies bpf_rbtree_remove to account for possible failure
due to the input rb_node already not being in any collection.
The function can now return NULL, and does when the aforementioned
scenario occurs. As before, on successful removal an owning reference to
the removed node is returned.

Adding KF_RET_NULL to bpf_rbtree_remove's kfunc flags - now KF_RET_NULL |
KF_ACQUIRE - provides the desired verifier semantics:

  * retval must be checked for NULL before use
  * if NULL, retval's ref_obj_id is released
  * retval is a "maybe acquired" owning ref, not a non-owning ref,
    so it will live past end of critical section (bpf_spin_unlock), and
    thus can be checked for NULL after the end of the CS

BPF programs must add checks
============================

This does change bpf_rbtree_remove's verifier behavior. BPF program
writers will need to add NULL checks to their programs, but the
resulting UX looks natural:

  bpf_spin_lock(&glock);

  n = bpf_rbtree_first(&ghead);
  if (!n) { /* ... */}
  res = bpf_rbtree_remove(&ghead, &n->node);

  bpf_spin_unlock(&glock);

  if (!res)  /* Newly-added check after this patch */
    return 1;

  n = container_of(res, /* ... */);
  /* Do something else with n */
  bpf_obj_drop(n);
  return 0;

The "if (!res)" check above is the only addition necessary for the above
program to pass verification after this patch.

bpf_rbtree_remove no longer clobbers non-owning refs
====================================================

An issue arises when bpf_rbtree_remove fails, though. Consider this
example:

  struct node_data {
    long key;
    struct bpf_list_node l;
    struct bpf_rb_node r;
    struct bpf_refcount ref;
  };

  long failed_sum;

  void bpf_prog()
  {
    struct node_data *n = bpf_obj_new(/* ... */);
    struct bpf_rb_node *res;
    n->key = 10;

    bpf_spin_lock(&glock);

    bpf_list_push_back(&some_list, &n->l); /* n is now a non-owning ref */
    res = bpf_rbtree_remove(&some_tree, &n->r, /* ... */);
    if (!res)
      failed_sum += n->key;  /* not possible */

    bpf_spin_unlock(&glock);
    /* if (res) { do something useful and drop } ... */
  }

The bpf_rbtree_remove in this example will always fail. Similarly to
bpf_spin_unlock, bpf_rbtree_remove is a non-owning reference
invalidation point. The verifier clobbers all non-owning refs after a
bpf_rbtree_remove call, so the "failed_sum += n->key" line will fail
verification, and in fact there's no good way to get information about
the node which failed to add after the invalidation. This patch removes
non-owning reference invalidation from bpf_rbtree_remove to allow the
above usecase to pass verification. The logic for why this is now
possible is as follows:

Before this series, bpf_rbtree_add couldn't fail and thus assumed that
its input, a non-owning reference, was in the tree. But it's easy to
construct an example where two non-owning references pointing to the same
underlying memory are acquired and passed to rbtree_remove one after
another (see rbtree_api_release_aliasing in
selftests/bpf/progs/rbtree_fail.c).

So it was necessary to clobber non-owning refs to prevent this
case and, more generally, to enforce "non-owning ref is definitely
in some collection" invariant. This series removes that invariant and
the failure / runtime checking added in this patch provide a clean way
to deal with the aliasing issue - just fail to remove.

Because the aliasing issue prevented by clobbering non-owning refs is no
longer an issue, this patch removes the invalidate_non_owning_refs
call from verifier handling of bpf_rbtree_remove. Note that
bpf_spin_unlock - the other caller of invalidate_non_owning_refs -
clobbers non-owning refs for a different reason, so its clobbering
behavior remains unchanged.

No BPF program changes are necessary for programs to remain valid as a
result of this clobbering change. A valid program before this patch
passed verification with its non-owning refs having shorter (or equal)
lifetimes due to more aggressive clobbering.

Also, update existing tests to check bpf_rbtree_remove retval for NULL
where necessary, and move rbtree_api_release_aliasing from
progs/rbtree_fail.c to progs/rbtree.c since it's now expected to pass
verification.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230415201811.343116-8-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-15 17:36:50 -07:00
Dave Marchevsky
d2dcc67df9 bpf: Migrate bpf_rbtree_add and bpf_list_push_{front,back} to possibly fail
Consider this code snippet:

  struct node {
    long key;
    bpf_list_node l;
    bpf_rb_node r;
    bpf_refcount ref;
  }

  int some_bpf_prog(void *ctx)
  {
    struct node *n = bpf_obj_new(/*...*/), *m;

    bpf_spin_lock(&glock);

    bpf_rbtree_add(&some_tree, &n->r, /* ... */);
    m = bpf_refcount_acquire(n);
    bpf_rbtree_add(&other_tree, &m->r, /* ... */);

    bpf_spin_unlock(&glock);

    /* ... */
  }

After bpf_refcount_acquire, n and m point to the same underlying memory,
and that node's bpf_rb_node field is being used by the some_tree insert,
so overwriting it as a result of the second insert is an error. In order
to properly support refcounted nodes, the rbtree and list insert
functions must be allowed to fail. This patch adds such support.

The kfuncs bpf_rbtree_add, bpf_list_push_{front,back} are modified to
return an int indicating success/failure, with 0 -> success, nonzero ->
failure.

bpf_obj_drop on failure
=======================

Currently the only reason an insert can fail is the example above: the
bpf_{list,rb}_node is already in use. When such a failure occurs, the
insert kfuncs will bpf_obj_drop the input node. This allows the insert
operations to logically fail without changing their verifier owning ref
behavior, namely the unconditional release_reference of the input
owning ref.

With insert that always succeeds, ownership of the node is always passed
to the collection, since the node always ends up in the collection.

With a possibly-failed insert w/ bpf_obj_drop, ownership of the node
is always passed either to the collection (success), or to bpf_obj_drop
(failure). Regardless, it's correct to continue unconditionally
releasing the input owning ref, as something is always taking ownership
from the calling program on insert.

Keeping owning ref behavior unchanged results in a nice default UX for
insert functions that can fail. If the program's reaction to a failed
insert is "fine, just get rid of this owning ref for me and let me go
on with my business", then there's no reason to check for failure since
that's default behavior. e.g.:

  long important_failures = 0;

  int some_bpf_prog(void *ctx)
  {
    struct node *n, *m, *o; /* all bpf_obj_new'd */

    bpf_spin_lock(&glock);
    bpf_rbtree_add(&some_tree, &n->node, /* ... */);
    bpf_rbtree_add(&some_tree, &m->node, /* ... */);
    if (bpf_rbtree_add(&some_tree, &o->node, /* ... */)) {
      important_failures++;
    }
    bpf_spin_unlock(&glock);
  }

If we instead chose to pass ownership back to the program on failed
insert - by returning NULL on success or an owning ref on failure -
programs would always have to do something with the returned ref on
failure. The most likely action is probably "I'll just get rid of this
owning ref and go about my business", which ideally would look like:

  if (n = bpf_rbtree_add(&some_tree, &n->node, /* ... */))
    bpf_obj_drop(n);

But bpf_obj_drop isn't allowed in a critical section and inserts must
occur within one, so in reality error handling would become a
hard-to-parse mess.

For refcounted nodes, we can replicate the "pass ownership back to
program on failure" logic with this patch's semantics, albeit in an ugly
way:

  struct node *n = bpf_obj_new(/* ... */), *m;

  bpf_spin_lock(&glock);

  m = bpf_refcount_acquire(n);
  if (bpf_rbtree_add(&some_tree, &n->node, /* ... */)) {
    /* Do something with m */
  }

  bpf_spin_unlock(&glock);
  bpf_obj_drop(m);

bpf_refcount_acquire is used to simulate "return owning ref on failure".
This should be an uncommon occurrence, though.

Addition of two verifier-fixup'd args to collection inserts
===========================================================

The actual bpf_obj_drop kfunc is
bpf_obj_drop_impl(void *, struct btf_struct_meta *), with bpf_obj_drop
macro populating the second arg with 0 and the verifier later filling in
the arg during insn fixup.

Because bpf_rbtree_add and bpf_list_push_{front,back} now might do
bpf_obj_drop, these kfuncs need a btf_struct_meta parameter that can be
passed to bpf_obj_drop_impl.

Similarly, because the 'node' param to those insert functions is the
bpf_{list,rb}_node within the node type, and bpf_obj_drop expects a
pointer to the beginning of the node, the insert functions need to be
able to find the beginning of the node struct. A second
verifier-populated param is necessary: the offset of {list,rb}_node within the
node type.

These two new params allow the insert kfuncs to correctly call
__bpf_obj_drop_impl:

  beginning_of_node = bpf_rb_node_ptr - offset
  if (already_inserted)
    __bpf_obj_drop_impl(beginning_of_node, btf_struct_meta->record);

Similarly to other kfuncs with "hidden" verifier-populated params, the
insert functions are renamed with _impl prefix and a macro is provided
for common usage. For example, bpf_rbtree_add kfunc is now
bpf_rbtree_add_impl and bpf_rbtree_add is now a macro which sets
"hidden" args to 0.

Due to the two new args BPF progs will need to be recompiled to work
with the new _impl kfuncs.

This patch also rewrites the "hidden argument" explanation to more
directly say why the BPF program writer doesn't need to populate the
arguments with anything meaningful.

How does this new logic affect non-owning references?
=====================================================

Currently, non-owning refs are valid until the end of the critical
section in which they're created. We can make this guarantee because, if
a non-owning ref exists, the referent was added to some collection. The
collection will drop() its nodes when it goes away, but it can't go away
while our program is accessing it, so that's not a problem. If the
referent is removed from the collection in the same CS that it was added
in, it can't be bpf_obj_drop'd until after CS end. Those are the only
two ways to free the referent's memory and neither can happen until
after the non-owning ref's lifetime ends.

On first glance, having these collection insert functions potentially
bpf_obj_drop their input seems like it breaks the "can't be
bpf_obj_drop'd until after CS end" line of reasoning. But we care about
the memory not being _freed_ until end of CS end, and a previous patch
in the series modified bpf_obj_drop such that it doesn't free refcounted
nodes until refcount == 0. So the statement can be more accurately
rewritten as "can't be free'd until after CS end".

We can prove that this rewritten statement holds for any non-owning
reference produced by collection insert functions:

* If the input to the insert function is _not_ refcounted
  * We have an owning reference to the input, and can conclude it isn't
    in any collection
    * Inserting a node in a collection turns owning refs into
      non-owning, and since our input type isn't refcounted, there's no
      way to obtain additional owning refs to the same underlying
      memory
  * Because our node isn't in any collection, the insert operation
    cannot fail, so bpf_obj_drop will not execute
  * If bpf_obj_drop is guaranteed not to execute, there's no risk of
    memory being free'd

* Otherwise, the input to the insert function is refcounted
  * If the insert operation fails due to the node's list_head or rb_root
    already being in some collection, there was some previous successful
    insert which passed refcount to the collection
  * We have an owning reference to the input, it must have been
    acquired via bpf_refcount_acquire, which bumped the refcount
  * refcount must be >= 2 since there's a valid owning reference and the
    node is already in a collection
  * Insert triggering bpf_obj_drop will decr refcount to >= 1, never
    resulting in a free

So although we may do bpf_obj_drop during the critical section, this
will never result in memory being free'd, and no changes to non-owning
ref logic are needed in this patch.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230415201811.343116-6-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-15 17:36:50 -07:00
Dave Marchevsky
7c50b1cb76 bpf: Add bpf_refcount_acquire kfunc
Currently, BPF programs can interact with the lifetime of refcounted
local kptrs in the following ways:

  bpf_obj_new  - Initialize refcount to 1 as part of new object creation
  bpf_obj_drop - Decrement refcount and free object if it's 0
  collection add - Pass ownership to the collection. No change to
                   refcount but collection is responsible for
		   bpf_obj_dropping it

In order to be able to add a refcounted local kptr to multiple
collections we need to be able to increment the refcount and acquire a
new owning reference. This patch adds a kfunc, bpf_refcount_acquire,
implementing such an operation.

bpf_refcount_acquire takes a refcounted local kptr and returns a new
owning reference to the same underlying memory as the input. The input
can be either owning or non-owning. To reinforce why this is safe,
consider the following code snippets:

  struct node *n = bpf_obj_new(typeof(*n)); // A
  struct node *m = bpf_refcount_acquire(n); // B

In the above snippet, n will be alive with refcount=1 after (A), and
since nothing changes that state before (B), it's obviously safe. If
n is instead added to some rbtree, we can still safely refcount_acquire
it:

  struct node *n = bpf_obj_new(typeof(*n));
  struct node *m;

  bpf_spin_lock(&glock);
  bpf_rbtree_add(&groot, &n->node, less);   // A
  m = bpf_refcount_acquire(n);              // B
  bpf_spin_unlock(&glock);

In the above snippet, after (A) n is a non-owning reference, and after
(B) m is an owning reference pointing to the same memory as n. Although
n has no ownership of that memory's lifetime, it's guaranteed to be
alive until the end of the critical section, and n would be clobbered if
we were past the end of the critical section, so it's safe to bump
refcount.

Implementation details:

* From verifier's perspective, bpf_refcount_acquire handling is similar
  to bpf_obj_new and bpf_obj_drop. Like the former, it returns a new
  owning reference matching input type, although like the latter, type
  can be inferred from concrete kptr input. Verifier changes in
  {check,fixup}_kfunc_call and check_kfunc_args are largely copied from
  aforementioned functions' verifier changes.

* An exception to the above is the new KF_ARG_PTR_TO_REFCOUNTED_KPTR
  arg, indicated by new "__refcounted_kptr" kfunc arg suffix. This is
  necessary in order to handle both owning and non-owning input without
  adding special-casing to "__alloc" arg handling. Also a convenient
  place to confirm that input type has bpf_refcount field.

* The implemented kfunc is actually bpf_refcount_acquire_impl, with
  'hidden' second arg that the verifier sets to the type's struct_meta
  in fixup_kfunc_call.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230415201811.343116-5-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-15 17:36:50 -07:00
Dave Marchevsky
1512217c47 bpf: Support refcounted local kptrs in existing semantics
A local kptr is considered 'refcounted' when it is of a type that has a
bpf_refcount field. When such a kptr is created, its refcount should be
initialized to 1; when destroyed, the object should be free'd only if a
refcount decr results in 0 refcount.

Existing logic always frees the underlying memory when destroying a
local kptr, and 0-initializes all btf_record fields. This patch adds
checks for "is local kptr refcounted?" and new logic for that case in
the appropriate places.

This patch focuses on changing existing semantics and thus conspicuously
does _not_ provide a way for BPF programs in increment refcount. That
follows later in the series.

__bpf_obj_drop_impl is modified to do the right thing when it sees a
refcounted type. Container types for graph nodes (list, tree, stashed in
map) are migrated to use __bpf_obj_drop_impl as a destructor for their
nodes instead of each having custom destruction code in their _free
paths. Now that "drop" isn't a synonym for "free" when the type is
refcounted it makes sense to centralize this logic.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230415201811.343116-4-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-15 17:36:49 -07:00
Dave Marchevsky
d54730b50b bpf: Introduce opaque bpf_refcount struct and add btf_record plumbing
A 'struct bpf_refcount' is added to the set of opaque uapi/bpf.h types
meant for use in BPF programs. Similarly to other opaque types like
bpf_spin_lock and bpf_rbtree_node, the verifier needs to know where in
user-defined struct types a bpf_refcount can be located, so necessary
btf_record plumbing is added to enable this. bpf_refcount is sized to
hold a refcount_t.

Similarly to bpf_spin_lock, the offset of a bpf_refcount is cached in
btf_record as refcount_off in addition to being in the field array.
Caching refcount_off makes sense for this field because further patches
in the series will modify functions that take local kptrs (e.g.
bpf_obj_drop) to change their behavior if the type they're operating on
is refcounted. So enabling fast "is this type refcounted?" checks is
desirable.

No such verifier behavior changes are introduced in this patch, just
logic to recognize 'struct bpf_refcount' in btf_record.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230415201811.343116-3-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-15 17:36:49 -07:00
Dave Marchevsky
cd2a807901 bpf: Remove btf_field_offs, use btf_record's fields instead
The btf_field_offs struct contains (offset, size) for btf_record fields,
sorted by offset. btf_field_offs is always used in conjunction with
btf_record, which has btf_field 'fields' array with (offset, type), the
latter of which btf_field_offs' size is derived from via
btf_field_type_size.

This patch adds a size field to struct btf_field and sorts btf_record's
fields by offset, making it possible to get rid of btf_field_offs. Less
data duplication and less code complexity results.

Since btf_field_offs' lifetime closely followed the btf_record used to
populate it, most complexity wins are from removal of initialization
code like:

  if (btf_record_successfully_initialized) {
    foffs = btf_parse_field_offs(rec);
    if (IS_ERR_OR_NULL(foffs))
      // free the btf_record and return err
  }

Other changes in this patch are pretty mechanical:

  * foffs->field_off[i] -> rec->fields[i].offset
  * foffs->field_sz[i] -> rec->fields[i].size
  * Sort rec->fields in btf_parse_fields before returning
    * It's possible that this is necessary independently of other
      changes in this patch. btf_record_find in syscall.c expects
      btf_record's fields to be sorted by offset, yet there's no
      explicit sorting of them before this patch, record's fields are
      populated in the order they're read from BTF struct definition.
      BTF docs don't say anything about the sortedness of struct fields.
  * All functions taking struct btf_field_offs * input now instead take
    struct btf_record *. All callsites of these functions already have
    access to the correct btf_record.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230415201811.343116-2-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-15 17:36:49 -07:00
Ilya Leoshkevich
1cf3bfc60f bpf: Support 64-bit pointers to kfuncs
test_ksyms_module fails to emit a kfunc call targeting a module on
s390x, because the verifier stores the difference between kfunc
address and __bpf_call_base in bpf_insn.imm, which is s32, and modules
are roughly (1 << 42) bytes away from the kernel on s390x.

Fix by keeping BTF id in bpf_insn.imm for BPF_PSEUDO_KFUNC_CALLs,
and storing the absolute address in bpf_kfunc_desc.

Introduce bpf_jit_supports_far_kfunc_call() in order to limit this new
behavior to the s390x JIT. Otherwise other JITs need to be modified,
which is not desired.

Introduce bpf_get_kfunc_addr() instead of exposing both
find_kfunc_desc() and struct bpf_kfunc_desc.

In addition to sorting kfuncs by imm, also sort them by offset, in
order to handle conflicting imms from different modules. Do this on
all architectures in order to simplify code.

Factor out resolving specialized kfuncs (XPD and dynptr) from
fixup_kfunc_call(). This was required in the first place, because
fixup_kfunc_call() uses find_kfunc_desc(), which returns a const
pointer, so it's not possible to modify kfunc addr without stripping
const, which is not nice. It also removes repetition of code like:

	if (bpf_jit_supports_far_kfunc_call())
		desc->addr = func;
	else
		insn->imm = BPF_CALL_IMM(func);

and separates kfunc_desc_tab fixups from kfunc_call fixups.

Suggested-by: Jiri Olsa <olsajiri@gmail.com>
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230412230632.885985-1-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-13 21:36:41 -07:00
Yafang
c11bd04648 bpf: Add preempt_count_{sub,add} into btf id deny list
The recursion check in __bpf_prog_enter* and __bpf_prog_exit*
leave preempt_count_{sub,add} unprotected. When attaching trampoline to
them we get panic as follows,

[  867.843050] BUG: TASK stack guard page was hit at 0000000009d325cf (stack is 0000000046a46a15..00000000537e7b28)
[  867.843064] stack guard page: 0000 [#1] PREEMPT SMP NOPTI
[  867.843067] CPU: 8 PID: 11009 Comm: trace Kdump: loaded Not tainted 6.2.0+ #4
[  867.843100] Call Trace:
[  867.843101]  <TASK>
[  867.843104]  asm_exc_int3+0x3a/0x40
[  867.843108] RIP: 0010:preempt_count_sub+0x1/0xa0
[  867.843135]  __bpf_prog_enter_recur+0x17/0x90
[  867.843148]  bpf_trampoline_6442468108_0+0x2e/0x1000
[  867.843154]  ? preempt_count_sub+0x1/0xa0
[  867.843157]  preempt_count_sub+0x5/0xa0
[  867.843159]  ? migrate_enable+0xac/0xf0
[  867.843164]  __bpf_prog_exit_recur+0x2d/0x40
[  867.843168]  bpf_trampoline_6442468108_0+0x55/0x1000
...
[  867.843788]  preempt_count_sub+0x5/0xa0
[  867.843793]  ? migrate_enable+0xac/0xf0
[  867.843829]  __bpf_prog_exit_recur+0x2d/0x40
[  867.843837] BUG: IRQ stack guard page was hit at 0000000099bd8228 (stack is 00000000b23e2bc4..000000006d95af35)
[  867.843841] BUG: IRQ stack guard page was hit at 000000005ae07924 (stack is 00000000ffd69623..0000000014eb594c)
[  867.843843] BUG: IRQ stack guard page was hit at 00000000028320f0 (stack is 00000000034b6438..0000000078d1bcec)
[  867.843842]  bpf_trampoline_6442468108_0+0x55/0x1000
...

That is because in __bpf_prog_exit_recur, the preempt_count_{sub,add} are
called after prog->active is decreased.

Fixing this by adding these two functions into btf ids deny list.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Yafang <laoar.shao@gmail.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jiri Olsa <olsajiri@gmail.com>
Acked-by: Hao Luo <haoluo@google.com>
Link: https://lore.kernel.org/r/20230413025248.79764-1-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-13 21:20:21 -07:00
Jakub Kicinski
c2865b1122 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZDhSiwAKCRDbK58LschI
 g8cbAQCH4xrquOeDmYyGXFQGchHZAIj++tKg8ABU4+hYeJtrlwEA6D4W6wjoSZRk
 mLSptZ9qro8yZA86BvyPvlBT1h9ELQA=
 =StAc
 -----END PGP SIGNATURE-----

Daniel Borkmann says:

====================
pull-request: bpf-next 2023-04-13

We've added 260 non-merge commits during the last 36 day(s) which contain
a total of 356 files changed, 21786 insertions(+), 11275 deletions(-).

The main changes are:

1) Rework BPF verifier log behavior and implement it as a rotating log
   by default with the option to retain old-style fixed log behavior,
   from Andrii Nakryiko.

2) Adds support for using {FOU,GUE} encap with an ipip device operating
   in collect_md mode and add a set of BPF kfuncs for controlling encap
   params, from Christian Ehrig.

3) Allow BPF programs to detect at load time whether a particular kfunc
   exists or not, and also add support for this in light skeleton,
   from Alexei Starovoitov.

4) Optimize hashmap lookups when key size is multiple of 4,
   from Anton Protopopov.

5) Enable RCU semantics for task BPF kptrs and allow referenced kptr
   tasks to be stored in BPF maps, from David Vernet.

6) Add support for stashing local BPF kptr into a map value via
   bpf_kptr_xchg(). This is useful e.g. for rbtree node creation
   for new cgroups, from Dave Marchevsky.

7) Fix BTF handling of is_int_ptr to skip modifiers to work around
   tracing issues where a program cannot be attached, from Feng Zhou.

8) Migrate a big portion of test_verifier unit tests over to
   test_progs -a verifier_* via inline asm to ease {read,debug}ability,
   from Eduard Zingerman.

9) Several updates to the instruction-set.rst documentation
   which is subject to future IETF standardization
   (https://lwn.net/Articles/926882/), from Dave Thaler.

10) Fix BPF verifier in the __reg_bound_offset's 64->32 tnum sub-register
    known bits information propagation, from Daniel Borkmann.

11) Add skb bitfield compaction work related to BPF with the overall goal
    to make more of the sk_buff bits optional, from Jakub Kicinski.

12) BPF selftest cleanups for build id extraction which stand on its own
    from the upcoming integration work of build id into struct file object,
    from Jiri Olsa.

13) Add fixes and optimizations for xsk descriptor validation and several
    selftest improvements for xsk sockets, from Kal Conley.

14) Add BPF links for struct_ops and enable switching implementations
    of BPF TCP cong-ctls under a given name by replacing backing
    struct_ops map, from Kui-Feng Lee.

15) Remove a misleading BPF verifier env->bypass_spec_v1 check on variable
    offset stack read as earlier Spectre checks cover this,
    from Luis Gerhorst.

16) Fix issues in copy_from_user_nofault() for BPF and other tracers
    to resemble copy_from_user_nmi() from safety PoV, from Florian Lehner
    and Alexei Starovoitov.

17) Add --json-summary option to test_progs in order for CI tooling to
    ease parsing of test results, from Manu Bretelle.

18) Batch of improvements and refactoring to prep for upcoming
    bpf_local_storage conversion to bpf_mem_cache_{alloc,free} allocator,
    from Martin KaFai Lau.

19) Improve bpftool's visual program dump which produces the control
    flow graph in a DOT format by adding C source inline annotations,
    from Quentin Monnet.

20) Fix attaching fentry/fexit/fmod_ret/lsm to modules by extracting
    the module name from BTF of the target and searching kallsyms of
    the correct module, from Viktor Malik.

21) Improve BPF verifier handling of '<const> <cond> <non_const>'
    to better detect whether in particular jmp32 branches are taken,
    from Yonghong Song.

22) Allow BPF TCP cong-ctls to write app_limited of struct tcp_sock.
    A built-in cc or one from a kernel module is already able to write
    to app_limited, from Yixin Shen.

Conflicts:

Documentation/bpf/bpf_devel_QA.rst
  b7abcd9c65 ("bpf, doc: Link to submitting-patches.rst for general patch submission info")
  0f10f647f4 ("bpf, docs: Use internal linking for link to netdev subsystem doc")
https://lore.kernel.org/all/20230307095812.236eb1be@canb.auug.org.au/

include/net/ip_tunnels.h
  bc9d003dc4 ("ip_tunnel: Preserve pointer const in ip_tunnel_info_opts")
  ac931d4cde ("ipip,ip_tunnel,sit: Add FOU support for externally controlled ipip devices")
https://lore.kernel.org/all/20230413161235.4093777-1-broonie@kernel.org/

net/bpf/test_run.c
  e5995bc7e2 ("bpf, test_run: fix crashes due to XDP frame overwriting/corruption")
  294635a816 ("bpf, test_run: fix &xdp_frame misplacement for LIVE_FRAMES")
https://lore.kernel.org/all/20230320102619.05b80a98@canb.auug.org.au/
====================

Link: https://lore.kernel.org/r/20230413191525.7295-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-13 16:43:38 -07:00
David Vernet
6499fe6edc bpf: Remove bpf_cgroup_kptr_get() kfunc
Now that bpf_cgroup_acquire() is KF_RCU | KF_RET_NULL,
bpf_cgroup_kptr_get() is redundant. Let's remove it, and update
selftests to instead use bpf_cgroup_acquire() where appropriate. The
next patch will update the BPF documentation to not mention
bpf_cgroup_kptr_get().

Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230411041633.179404-2-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-12 12:57:54 -07:00
David Vernet
1d71283987 bpf: Make bpf_cgroup_acquire() KF_RCU | KF_RET_NULL
struct cgroup is already an RCU-safe type in the verifier. We can
therefore update bpf_cgroup_acquire() to be KF_RCU | KF_RET_NULL, and
subsequently remove bpf_cgroup_kptr_get(). This patch does the first of
these by updating bpf_cgroup_acquire() to be KF_RCU | KF_RET_NULL, and
also updates selftests accordingly.

Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230411041633.179404-1-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-12 12:57:54 -07:00
Alexei Starovoitov
10fd5f70c3 bpf: Handle NULL in bpf_local_storage_free.
During OOM bpf_local_storage_alloc() may fail to allocate 'storage' and
call to bpf_local_storage_free() with NULL pointer will cause a crash like:
[ 271718.917646] BUG: kernel NULL pointer dereference, address: 00000000000000a0
[ 271719.019620] RIP: 0010:call_rcu+0x2d/0x240
[ 271719.216274]  bpf_local_storage_alloc+0x19e/0x1e0
[ 271719.250121]  bpf_local_storage_update+0x33b/0x740

Fixes: 7e30a8477b ("bpf: Add bpf_local_storage_free()")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230412171252.15635-1-alexei.starovoitov@gmail.com
2023-04-12 10:27:50 -07:00
Feng Zhou
91f2dc6838 bpf/btf: Fix is_int_ptr()
When tracing a kernel function with arg type is u32*, btf_ctx_access()
would report error: arg2 type INT is not a struct.

The commit bb6728d756 ("bpf: Allow access to int pointer arguments
in tracing programs") added support for int pointer, but did not skip
modifiers before checking it's type. This patch fixes it.

Fixes: bb6728d756 ("bpf: Allow access to int pointer arguments in tracing programs")
Co-developed-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20230410085908.98493-2-zhoufeng.zf@bytedance.com
2023-04-11 20:29:30 +02:00
Andrii Nakryiko
fac08d45e2 bpf: Relax log_buf NULL conditions when log_level>0 is requested
Drop the log_size>0 and log_buf!=NULL condition when log_level>0. This
allows users to request log_true_size of a full log without providing
actual (even if small) log buffer. Verifier log handling code was mostly
ready to handle NULL log->ubuf, so only few small changes were necessary
to prevent NULL log->ubuf from causing problems.

Note, that if user provided NULL log_buf with log_level>0 we don't
consider this a log truncation, and thus won't return -ENOSPC.

We also enforce that either (log_buf==NULL && log_size==0) or
(log_buf!=NULL && log_size>0).

Suggested-by: Lorenz Bauer <lmb@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Lorenz Bauer <lmb@isovalent.com>
Link: https://lore.kernel.org/bpf/20230406234205.323208-15-andrii@kernel.org
2023-04-11 18:05:44 +02:00
Andrii Nakryiko
bdcab4144f bpf: Simplify internal verifier log interface
Simplify internal verifier log API down to bpf_vlog_init() and
bpf_vlog_finalize(). The former handles input arguments validation in
one place and makes it easier to change it. The latter subsumes -ENOSPC
(truncation) and -EFAULT handling and simplifies both caller's code
(bpf_check() and btf_parse()).

For btf_parse(), this patch also makes sure that verifier log
finalization happens even if there is some error condition during BTF
verification process prior to normal finalization step.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Lorenz Bauer <lmb@isovalent.com>
Link: https://lore.kernel.org/bpf/20230406234205.323208-14-andrii@kernel.org
2023-04-11 18:05:44 +02:00
Andrii Nakryiko
47a71c1f9a bpf: Add log_true_size output field to return necessary log buffer size
Add output-only log_true_size and btf_log_true_size field to
BPF_PROG_LOAD and BPF_BTF_LOAD commands, respectively. It will return
the size of log buffer necessary to fit in all the log contents at
specified log_level. This is very useful for BPF loader libraries like
libbpf to be able to size log buffer correctly, but could be used by
users directly, if necessary, as well.

This patch plumbs all this through the code, taking into account actual
bpf_attr size provided by user to determine if these new fields are
expected by users. And if they are, set them from kernel on return.

We refactory btf_parse() function to accommodate this, moving attr and
uattr handling inside it. The rest is very straightforward code, which
is split from the logging accounting changes in the previous patch to
make it simpler to review logic vs UAPI changes.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Lorenz Bauer <lmb@isovalent.com>
Link: https://lore.kernel.org/bpf/20230406234205.323208-13-andrii@kernel.org
2023-04-11 18:05:43 +02:00
Andrii Nakryiko
fa1c7d5cc4 bpf: Keep track of total log content size in both fixed and rolling modes
Change how we do accounting in BPF_LOG_FIXED mode and adopt log->end_pos
as *logical* log position. This means that we can go beyond physical log
buffer size now and be able to tell what log buffer size should be to
fit entire log contents without -ENOSPC.

To do this for BPF_LOG_FIXED mode, we need to remove a short-circuiting
logic of not vsnprintf()'ing further log content once we filled up
user-provided buffer, which is done by bpf_verifier_log_needed() checks.
We modify these checks to always keep going if log->level is non-zero
(i.e., log is requested), even if log->ubuf was NULL'ed out due to
copying data to user-space, or if entire log buffer is physically full.
We adopt bpf_verifier_vlog() routine to work correctly with
log->ubuf == NULL condition, performing log formatting into temporary
kernel buffer, doing all the necessary accounting, but just avoiding
copying data out if buffer is full or NULL'ed out.

With these changes, it's now possible to do this sort of determination of
log contents size in both BPF_LOG_FIXED and default rolling log mode.
We need to keep in mind bpf_vlog_reset(), though, which shrinks log
contents after successful verification of a particular code path. This
log reset means that log->end_pos isn't always increasing, so to return
back to users what should be the log buffer size to fit all log content
without causing -ENOSPC even in the presence of log resetting, we need
to keep maximum over "lifetime" of logging. We do this accounting in
bpf_vlog_update_len_max() helper.

A related and subtle aspect is that with this logical log->end_pos even in
BPF_LOG_FIXED mode we could temporary "overflow" buffer, but then reset
it back with bpf_vlog_reset() to a position inside user-supplied
log_buf. In such situation we still want to properly maintain
terminating zero. We will eventually return -ENOSPC even if final log
buffer is small (we detect this through log->len_max check). This
behavior is simpler to reason about and is consistent with current
behavior of verifier log. Handling of this required a small addition to
bpf_vlog_reset() logic to avoid doing put_user() beyond physical log
buffer dimensions.

Another issue to keep in mind is that we limit log buffer size to 32-bit
value and keep such log length as u32, but theoretically verifier could
produce huge log stretching beyond 4GB. Instead of keeping (and later
returning) 64-bit log length, we cap it at UINT_MAX. Current UAPI makes
it impossible to specify log buffer size bigger than 4GB anyways, so we
don't really loose anything here and keep everything consistently 32-bit
in UAPI. This property will be utilized in next patch.

Doing the same determination of maximum log buffer for rolling mode is
trivial, as log->end_pos and log->start_pos are already logical
positions, so there is nothing new there.

These changes do incidentally fix one small issue with previous logging
logic. Previously, if use provided log buffer of size N, and actual log
output was exactly N-1 bytes + terminating \0, kernel logic coun't
distinguish this condition from log truncation scenario which would end
up with truncated log contents of N-1 bytes + terminating \0 as well.

But now with log->end_pos being logical position that could go beyond
actual log buffer size, we can distinguish these two conditions, which
we do in this patch. This plays nicely with returning log_size_actual
(implemented in UAPI in the next patch), as we can now guarantee that if
user takes such log_size_actual and provides log buffer of that exact
size, they will not get -ENOSPC in return.

All in all, all these changes do conceptually unify fixed and rolling
log modes much better, and allow a nice feature requested by users:
knowing what should be the size of the buffer to avoid -ENOSPC.

We'll plumb this through the UAPI and the code in the next patch.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Lorenz Bauer <lmb@isovalent.com>
Link: https://lore.kernel.org/bpf/20230406234205.323208-12-andrii@kernel.org
2023-04-11 18:05:43 +02:00
Andrii Nakryiko
8a6ca6bc55 bpf: Simplify logging-related error conditions handling
Move log->level == 0 check into bpf_vlog_truncated() instead of doing it
explicitly. Also remove unnecessary goto in kernel/bpf/verifier.c.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Lorenz Bauer <lmb@isovalent.com>
Link: https://lore.kernel.org/bpf/20230406234205.323208-11-andrii@kernel.org
2023-04-11 18:05:43 +02:00
Andrii Nakryiko
cbedb42a0d bpf: Avoid incorrect -EFAULT error in BPF_LOG_KERNEL mode
If verifier log is in BPF_LOG_KERNEL mode, no log->ubuf is expected and
it stays NULL throughout entire verification process. Don't erroneously
return -EFAULT in such case.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Lorenz Bauer <lmb@isovalent.com>
Link: https://lore.kernel.org/bpf/20230406234205.323208-10-andrii@kernel.org
2023-04-11 18:05:43 +02:00
Andrii Nakryiko
971fb5057d bpf: Fix missing -EFAULT return on user log buf error in btf_parse()
btf_parse() is missing -EFAULT error return if log->ubuf was NULL-ed out
due to error while copying data into user-provided buffer. Add it, but
handle a special case of BPF_LOG_KERNEL in which log->ubuf is always NULL.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Lorenz Bauer <lmb@isovalent.com>
Link: https://lore.kernel.org/bpf/20230406234205.323208-9-andrii@kernel.org
2023-04-11 18:05:43 +02:00
Andrii Nakryiko
24bc80887a bpf: Ignore verifier log reset in BPF_LOG_KERNEL mode
Verifier log position reset is meaningless in BPF_LOG_KERNEL mode, so
just exit early in bpf_vlog_reset() if log->level is BPF_LOG_KERNEL.

This avoid meaningless put_user() into NULL log->ubuf.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Lorenz Bauer <lmb@isovalent.com>
Link: https://lore.kernel.org/bpf/20230406234205.323208-8-andrii@kernel.org
2023-04-11 18:05:43 +02:00
Andrii Nakryiko
1216640938 bpf: Switch BPF verifier log to be a rotating log by default
Currently, if user-supplied log buffer to collect BPF verifier log turns
out to be too small to contain full log, bpf() syscall returns -ENOSPC,
fails BPF program verification/load, and preserves first N-1 bytes of
the verifier log (where N is the size of user-supplied buffer).

This is problematic in a bunch of common scenarios, especially when
working with real-world BPF programs that tend to be pretty complex as
far as verification goes and require big log buffers. Typically, it's
when debugging tricky cases at log level 2 (verbose). Also, when BPF program
is successfully validated, log level 2 is the only way to actually see
verifier state progression and all the important details.

Even with log level 1, it's possible to get -ENOSPC even if the final
verifier log fits in log buffer, if there is a code path that's deep
enough to fill up entire log, even if normally it would be reset later
on (there is a logic to chop off successfully validated portions of BPF
verifier log).

In short, it's not always possible to pre-size log buffer. Also, what's
worse, in practice, the end of the log most often is way more important
than the beginning, but verifier stops emitting log as soon as initial
log buffer is filled up.

This patch switches BPF verifier log behavior to effectively behave as
rotating log. That is, if user-supplied log buffer turns out to be too
short, verifier will keep overwriting previously written log,
effectively treating user's log buffer as a ring buffer. -ENOSPC is
still going to be returned at the end, to notify user that log contents
was truncated, but the important last N bytes of the log would be
returned, which might be all that user really needs. This consistent
-ENOSPC behavior, regardless of rotating or fixed log behavior, allows
to prevent backwards compatibility breakage. The only user-visible
change is which portion of verifier log user ends up seeing *if buffer
is too small*. Given contents of verifier log itself is not an ABI,
there is no breakage due to this behavior change. Specialized tools that
rely on specific contents of verifier log in -ENOSPC scenario are
expected to be easily adapted to accommodate old and new behaviors.

Importantly, though, to preserve good user experience and not require
every user-space application to adopt to this new behavior, before
exiting to user-space verifier will rotate log (in place) to make it
start at the very beginning of user buffer as a continuous
zero-terminated string. The contents will be a chopped off N-1 last
bytes of full verifier log, of course.

Given beginning of log is sometimes important as well, we add
BPF_LOG_FIXED (which equals 8) flag to force old behavior, which allows
tools like veristat to request first part of verifier log, if necessary.
BPF_LOG_FIXED flag is also a simple and straightforward way to check if
BPF verifier supports rotating behavior.

On the implementation side, conceptually, it's all simple. We maintain
64-bit logical start and end positions. If we need to truncate the log,
start position will be adjusted accordingly to lag end position by
N bytes. We then use those logical positions to calculate their matching
actual positions in user buffer and handle wrap around the end of the
buffer properly. Finally, right before returning from bpf_check(), we
rotate user log buffer contents in-place as necessary, to make log
contents contiguous. See comments in relevant functions for details.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Lorenz Bauer <lmb@isovalent.com>
Link: https://lore.kernel.org/bpf/20230406234205.323208-4-andrii@kernel.org
2023-04-11 18:05:43 +02:00
Andrii Nakryiko
03cc3aa6a5 bpf: Remove minimum size restrictions on verifier log buffer
It's not clear why we have 128 as minimum size, but it makes testing
harder and seems unnecessary, as we carefully handle truncation
scenarios and use proper snprintf variants. So remove this limitation
and just enforce positive length for log buffer.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Lorenz Bauer <lmb@isovalent.com>
Link: https://lore.kernel.org/bpf/20230406234205.323208-3-andrii@kernel.org
2023-04-11 18:05:42 +02:00
Andrii Nakryiko
4294a0a7ab bpf: Split off basic BPF verifier log into separate file
kernel/bpf/verifier.c file is large and growing larger all the time. So
it's good to start splitting off more or less self-contained parts into
separate files to keep source code size (somewhat) somewhat under
control.

This patch is a one step in this direction, moving some of BPF verifier log
routines into a separate kernel/bpf/log.c. Right now it's most low-level
and isolated routines to append data to log, reset log to previous
position, etc. Eventually we could probably move verifier state
printing logic here as well, but this patch doesn't attempt to do that
yet.

Subsequent patches will add more logic to verifier log management, so
having basics in a separate file will make sure verifier.c doesn't grow
more with new changes.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Lorenz Bauer <lmb@isovalent.com>
Link: https://lore.kernel.org/bpf/20230406234205.323208-2-andrii@kernel.org
2023-04-11 18:05:42 +02:00
Barret Rhoden
f3f2134977 bpf: ensure all memory is initialized in bpf_get_current_comm
BPF helpers that take an ARG_PTR_TO_UNINIT_MEM must ensure that all of
the memory is set, including beyond the end of the string.

Signed-off-by: Barret Rhoden <brho@google.com>
Link: https://lore.kernel.org/r/20230407001808.1622968-1-brho@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-06 18:48:24 -07:00
Yonghong Song
953d9f5bea bpf: Improve handling of pattern '<const> <cond_op> <non_const>' in verifier
Currently, the verifier does not handle '<const> <cond_op> <non_const>' well.
For example,
  ...
  10: (79) r1 = *(u64 *)(r10 -16)       ; R1_w=scalar() R10=fp0
  11: (b7) r2 = 0                       ; R2_w=0
  12: (2d) if r2 > r1 goto pc+2
  13: (b7) r0 = 0
  14: (95) exit
  15: (65) if r1 s> 0x1 goto pc+3
  16: (0f) r0 += r1
  ...
At insn 12, verifier decides both true and false branch are possible, but
actually only false branch is possible.

Currently, the verifier already supports patterns '<non_const> <cond_op> <const>.
Add support for patterns '<const> <cond_op> <non_const>' in a similar way.

Also fix selftest 'verifier_bounds_mix_sign_unsign/bounds checks mixing signed and unsigned, variant 10'
due to this change.

Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Dave Marchevsky <davemarchevsky@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230406164505.1046801-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-06 15:26:08 -07:00
Yonghong Song
13fbcee557 bpf: Improve verifier JEQ/JNE insn branch taken checking
Currently, for BPF_JEQ/BPF_JNE insn, verifier determines
whether the branch is taken or not only if both operands
are constants. Therefore, for the following code snippet,
  0: (85) call bpf_ktime_get_ns#5       ; R0_w=scalar()
  1: (a5) if r0 < 0x3 goto pc+2         ; R0_w=scalar(umin=3)
  2: (b7) r2 = 2                        ; R2_w=2
  3: (1d) if r0 == r2 goto pc+2 6

At insn 3, since r0 is not a constant, verifier assumes both branch
can be taken which may lead inproper verification failure.

Add comparing umin/umax value and the constant. If the umin value
is greater than the constant, or umax value is smaller than the constant,
for JEQ the branch must be not-taken, and for JNE the branch must be taken.
The jmp32 mode JEQ/JNE branch taken checking is also handled similarly.

The following lists the veristat result w.r.t. changed number
of processes insns during verification:

File                                                   Program                                               Insns (A)  Insns (B)  Insns    (DIFF)
-----------------------------------------------------  ----------------------------------------------------  ---------  ---------  ---------------
test_cls_redirect.bpf.linked3.o                        cls_redirect                                              64980      73472  +8492 (+13.07%)
test_seg6_loop.bpf.linked3.o                           __add_egr_x                                               12425      12423      -2 (-0.02%)
test_tcp_hdr_options.bpf.linked3.o                     estab                                                      2634       2558     -76 (-2.89%)
test_parse_tcp_hdr_opt.bpf.linked3.o                   xdp_ingress_v6                                             1421       1420      -1 (-0.07%)
test_parse_tcp_hdr_opt_dynptr.bpf.linked3.o            xdp_ingress_v6                                             1238       1237      -1 (-0.08%)
test_tc_dtime.bpf.linked3.o                            egress_fwdns_prio100                                        414        411      -3 (-0.72%)

Mostly a small improvement but test_cls_redirect.bpf.linked3.o has a 13% regression.
I checked with verifier log and found it this is due to pruning.
For some JEQ/JNE branches impacted by this patch,
one branch is explored and the other has state equivalence and
pruned.

Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Dave Marchevsky <davemarchevsky@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230406164455.1045294-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-06 15:26:08 -07:00
Alexei Starovoitov
afeebf9f57 bpf: Undo strict enforcement for walking untagged fields.
The commit 6fcd486b3a ("bpf: Refactor RCU enforcement in the verifier.")
broke several tracing bpf programs. Even in clang compiled kernels there are
many fields that are not marked with __rcu that are safe to read and pass into
helpers, but the verifier doesn't know that they're safe. Aggressively marking
them as PTR_UNTRUSTED was premature.

Fixes: 6fcd486b3a ("bpf: Refactor RCU enforcement in the verifier.")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/bpf/20230404045029.82870-8-alexei.starovoitov@gmail.com
2023-04-04 16:57:24 -07:00
Alexei Starovoitov
30ee9821f9 bpf: Allowlist few fields similar to __rcu tag.
Allow bpf program access cgrp->kn, mm->exe_file, skb->sk, req->sk.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/bpf/20230404045029.82870-7-alexei.starovoitov@gmail.com
2023-04-04 16:57:21 -07:00
Alexei Starovoitov
add68b843f bpf: Refactor NULL-ness check in check_reg_type().
check_reg_type() unconditionally disallows PTR_TO_BTF_ID | PTR_MAYBE_NULL.
It's problematic for helpers that allow ARG_PTR_TO_BTF_ID_OR_NULL like
bpf_sk_storage_get(). Allow passing PTR_TO_BTF_ID | PTR_MAYBE_NULL into such
helpers. That technically includes bpf_kptr_xchg() helper, but in practice:
  bpf_kptr_xchg(..., bpf_cpumask_create());
is still disallowed because bpf_cpumask_create() returns ref counted pointer
with ref_obj_id > 0.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/bpf/20230404045029.82870-6-alexei.starovoitov@gmail.com
2023-04-04 16:57:18 -07:00
Alexei Starovoitov
91571a515d bpf: Teach verifier that certain helpers accept NULL pointer.
bpf_[sk|inode|task|cgrp]_storage_[get|delete]() and bpf_get_socket_cookie() helpers
perform run-time check that sk|inode|task|cgrp pointer != NULL.
Teach verifier about this fact and allow bpf programs to pass
PTR_TO_BTF_ID | PTR_MAYBE_NULL into such helpers.
It will be used in the subsequent patch that will do
bpf_sk_storage_get(.., skb->sk, ...);
Even when 'skb' pointer is trusted the 'sk' pointer may be NULL.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/bpf/20230404045029.82870-5-alexei.starovoitov@gmail.com
2023-04-04 16:57:16 -07:00
Alexei Starovoitov
63260df139 bpf: Refactor btf_nested_type_is_trusted().
btf_nested_type_is_trusted() tries to find a struct member at corresponding offset.
It works for flat structures and falls apart in more complex structs with nested structs.
The offset->member search is already performed by btf_struct_walk() including nested structs.
Reuse this work and pass {field name, field btf id} into btf_nested_type_is_trusted()
instead of offset to make BTF_TYPE_SAFE*() logic more robust.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/bpf/20230404045029.82870-4-alexei.starovoitov@gmail.com
2023-04-04 16:57:14 -07:00
Alexei Starovoitov
b7e852a9ec bpf: Remove unused arguments from btf_struct_access().
Remove unused arguments from btf_struct_access() callback.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/bpf/20230404045029.82870-3-alexei.starovoitov@gmail.com
2023-04-04 16:57:10 -07:00
Alexei Starovoitov
7d64c51328 bpf: Invoke btf_struct_access() callback only for writes.
Remove duplicated if (atype == BPF_READ) btf_struct_access() from
btf_struct_access() callback and invoke it only for writes. This is
possible to do because currently btf_struct_access() custom callback
always delegates to generic btf_struct_access() helper for BPF_READ
accesses.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/bpf/20230404045029.82870-2-alexei.starovoitov@gmail.com
2023-04-04 16:57:03 -07:00
Dave Marchevsky
f6a6a5a976 bpf: Fix struct_meta lookup for bpf_obj_free_fields kfunc call
bpf_obj_drop_impl has a void return type. In check_kfunc_call, the "else
if" which sets insn_aux->kptr_struct_meta for bpf_obj_drop_impl is
surrounded by a larger if statement which checks btf_type_is_ptr. As a
result:

  * The bpf_obj_drop_impl-specific code will never execute
  * The btf_struct_meta input to bpf_obj_drop is always NULL
  * __bpf_obj_drop_impl will always see a NULL btf_record when called
    from BPF program, and won't call bpf_obj_free_fields
  * program-allocated kptrs which have fields that should be cleaned up
    by bpf_obj_free_fields may instead leak resources

This patch adds a btf_type_is_void branch to the larger if and moves
special handling for bpf_obj_drop_impl there, fixing the issue.

Fixes: ac9f06050a ("bpf: Introduce bpf_obj_drop")
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230403200027.2271029-1-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-03 14:54:21 -07:00
Anton Protopopov
92b2e810f0 bpf: compute hashes in bloom filter similar to hashmap
If the value size in a bloom filter is a multiple of 4, then the jhash2()
function is used to compute hashes. The length parameter of this function
equals to the number of 32-bit words in input. Compute it in the hot path
instead of pre-computing it, as this is translated to one extra shift to
divide the length by four vs. one extra memory load of a pre-computed length.

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Link: https://lore.kernel.org/r/20230402114340.3441-1-aspsk@isovalent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-02 08:44:49 -07:00
Anton Protopopov
5b85575ad4 bpf: optimize hashmap lookups when key_size is divisible by 4
The BPF hashmap uses the jhash() hash function. There is an optimized version
of this hash function which may be used if hash size is a multiple of 4. Apply
this optimization to the hashmap in a similar way as it is done in the bloom
filter map.

On practice the optimization is only noticeable for smaller key sizes, which,
however, is sufficient for many applications. An example is listed in the
following table of measurements (a hashmap of 65536 elements was used):

    --------------------------------------------------------------------
    | key_size | fullness | lookups /sec | lookups (opt) /sec |   gain |
    --------------------------------------------------------------------
    |        4 |      25% |      42.990M |            46.000M |   7.0% |
    |        4 |      50% |      37.910M |            39.094M |   3.1% |
    |        4 |      75% |      34.486M |            36.124M |   4.7% |
    |        4 |     100% |      31.760M |            32.719M |   3.0% |
    --------------------------------------------------------------------
    |        8 |      25% |      43.855M |            49.626M |  13.2% |
    |        8 |      50% |      38.328M |            42.152M |  10.0% |
    |        8 |      75% |      34.483M |            38.088M |  10.5% |
    |        8 |     100% |      31.306M |            34.686M |  10.8% |
    --------------------------------------------------------------------
    |       12 |      25% |      38.398M |            43.770M |  14.0% |
    |       12 |      50% |      33.336M |            37.712M |  13.1% |
    |       12 |      75% |      29.917M |            34.440M |  15.1% |
    |       12 |     100% |      27.322M |            30.480M |  11.6% |
    --------------------------------------------------------------------
    |       16 |      25% |      41.491M |            41.921M |   1.0% |
    |       16 |      50% |      36.206M |            36.474M |   0.7% |
    |       16 |      75% |      32.529M |            33.027M |   1.5% |
    |       16 |     100% |      29.581M |            30.325M |   2.5% |
    --------------------------------------------------------------------
    |       20 |      25% |      34.240M |            36.787M |   7.4% |
    |       20 |      50% |      30.328M |            32.663M |   7.7% |
    |       20 |      75% |      27.536M |            29.354M |   6.6% |
    |       20 |     100% |      24.847M |            26.505M |   6.7% |
    --------------------------------------------------------------------
    |       24 |      25% |      36.329M |            40.608M |  11.8% |
    |       24 |      50% |      31.444M |            35.059M |  11.5% |
    |       24 |      75% |      28.426M |            31.452M |  10.6% |
    |       24 |     100% |      26.278M |            28.741M |   9.4% |
    --------------------------------------------------------------------
    |       28 |      25% |      31.540M |            31.944M |   1.3% |
    |       28 |      50% |      27.739M |            28.063M |   1.2% |
    |       28 |      75% |      24.993M |            25.814M |   3.3% |
    |       28 |     100% |      23.513M |            23.500M |  -0.1% |
    --------------------------------------------------------------------
    |       32 |      25% |      32.116M |            33.953M |   5.7% |
    |       32 |      50% |      28.879M |            29.859M |   3.4% |
    |       32 |      75% |      26.227M |            26.948M |   2.7% |
    |       32 |     100% |      23.829M |            24.613M |   3.3% |
    --------------------------------------------------------------------
    |       64 |      25% |      22.535M |            22.554M |   0.1% |
    |       64 |      50% |      20.471M |            20.675M |   1.0% |
    |       64 |      75% |      19.077M |            19.146M |   0.4% |
    |       64 |     100% |      17.710M |            18.131M |   2.4% |
    --------------------------------------------------------------------

The following script was used to gather the results (SMT & frequency off):

    cd tools/testing/selftests/bpf
    for key_size in 4 8 12 16 20 24 28 32 64; do
            for nr_entries in `seq 16384 16384 65536`; do
                    fullness=$(printf '%3s' $((nr_entries*100/65536)))
                    echo -n "key_size=$key_size: $fullness% full: "
                    sudo ./bench -d2 -a bpf-hashmap-lookup --key_size=$key_size --nr_entries=$nr_entries --max_entries=65536 --nr_loops=2000000 --map_flags=0x40 | grep cpu
            done
            echo
    done

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Link: https://lore.kernel.org/r/20230401200602.3275-1-aspsk@isovalent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-01 15:08:19 -07:00
David Vernet
f85671c6ef bpf: Remove now-defunct task kfuncs
In commit 22df776a9a ("tasks: Extract rcu_users out of union"), the
'refcount_t rcu_users' field was extracted out of a union with the
'struct rcu_head rcu' field. This allows us to safely perform a
refcount_inc_not_zero() on task->rcu_users when acquiring a reference on
a task struct. A prior patch leveraged this by making struct task_struct
an RCU-protected object in the verifier, and by bpf_task_acquire() to
use the task->rcu_users field for synchronization.

Now that we can use RCU to protect tasks, we no longer need
bpf_task_kptr_get(), or bpf_task_acquire_not_zero(). bpf_task_kptr_get()
is truly completely unnecessary, as we can just use RCU to get the
object. bpf_task_acquire_not_zero() is now equivalent to
bpf_task_acquire().

In addition to these changes, this patch also updates the associated
selftests to no longer use these kfuncs.

Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230331195733.699708-3-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-01 09:07:20 -07:00
David Vernet
d02c48fa11 bpf: Make struct task_struct an RCU-safe type
struct task_struct objects are a bit interesting in terms of how their
lifetime is protected by refcounts. task structs have two refcount
fields:

1. refcount_t usage: Protects the memory backing the task struct. When
   this refcount drops to 0, the task is immediately freed, without
   waiting for an RCU grace period to elapse. This is the field that
   most callers in the kernel currently use to ensure that a task
   remains valid while it's being referenced, and is what's currently
   tracked with bpf_task_acquire() and bpf_task_release().

2. refcount_t rcu_users: A refcount field which, when it drops to 0,
   schedules an RCU callback that drops a reference held on the 'usage'
   field above (which is acquired when the task is first created). This
   field therefore provides a form of RCU protection on the task by
   ensuring that at least one 'usage' refcount will be held until an RCU
   grace period has elapsed. The qualifier "a form of" is important
   here, as a task can remain valid after task->rcu_users has dropped to
   0 and the subsequent RCU gp has elapsed.

In terms of BPF, we want to use task->rcu_users to protect tasks that
function as referenced kptrs, and to allow tasks stored as referenced
kptrs in maps to be accessed with RCU protection.

Let's first determine whether we can safely use task->rcu_users to
protect tasks stored in maps. All of the bpf_task* kfuncs can only be
called from tracepoint, struct_ops, or BPF_PROG_TYPE_SCHED_CLS, program
types. For tracepoint and struct_ops programs, the struct task_struct
passed to a program handler will always be trusted, so it will always be
safe to call bpf_task_acquire() with any task passed to a program.
Note, however, that we must update bpf_task_acquire() to be KF_RET_NULL,
as it is possible that the task has exited by the time the program is
invoked, even if the pointer is still currently valid because the main
kernel holds a task->usage refcount. For BPF_PROG_TYPE_SCHED_CLS, tasks
should never be passed as an argument to the any program handlers, so it
should not be relevant.

The second question is whether it's safe to use RCU to access a task
that was acquired with bpf_task_acquire(), and stored in a map. Because
bpf_task_acquire() now uses task->rcu_users, it follows that if the task
is present in the map, that it must have had at least one
task->rcu_users refcount by the time the current RCU cs was started.
Therefore, it's safe to access that task until the end of the current
RCU cs.

With all that said, this patch makes struct task_struct is an
RCU-protected object. In doing so, we also change bpf_task_acquire() to
be KF_ACQUIRE | KF_RCU | KF_RET_NULL, and adjust any selftests as
necessary. A subsequent patch will remove bpf_task_kptr_get(), and
bpf_task_acquire_not_zero() respectively.

Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230331195733.699708-2-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-01 09:07:20 -07:00
David Vernet
e4c2acab95 bpf: Handle PTR_MAYBE_NULL case in PTR_TO_BTF_ID helper call arg
When validating a helper function argument, we use check_reg_type() to
ensure that the register containing the argument is of the correct type.
When the register's base type is PTR_TO_BTF_ID, there is some
supplemental logic where we do extra checks for various combinations of
PTR_TO_BTF_ID type modifiers. For example, for PTR_TO_BTF_ID,
PTR_TO_BTF_ID | PTR_TRUSTED, and PTR_TO_BTF_ID | MEM_RCU, we call
map_kptr_match_type() for bpf_kptr_xchg() calls, and
btf_struct_ids_match() for other helper calls.

When an unhandled PTR_TO_BTF_ID type modifier combination is passed to
check_reg_type(), the verifier fails with an internal verifier error
message. This can currently be triggered by passing a PTR_MAYBE_NULL
pointer to helper functions (currently just bpf_kptr_xchg()) with an
ARG_PTR_TO_BTF_ID_OR_NULL arg type. For example, by callin
bpf_kptr_xchg(&v->kptr, bpf_cpumask_create()).

Whether or not passing a PTR_MAYBE_NULL arg to an
ARG_PTR_TO_BTF_ID_OR_NULL argument is valid is an interesting question.
In a vacuum, it seems fine. A helper function with an
ARG_PTR_TO_BTF_ID_OR_NULL arg would seem to be implying that it can
handle either a NULL or non-NULL arg, and has logic in place to detect
and gracefully handle each. This is the case for bpf_kptr_xchg(), which
of course simply does an xchg(). On the other hand, bpf_kptr_xchg() also
specifies OBJ_RELEASE, and refcounting semantics for a PTR_MAYBE_NULL
pointer is different than handling it for a NULL _OR_ non-NULL pointer.
For example, with a non-NULL arg, we should always fail if there was not
a nonzero refcount for the value in the register being passed to the
helper. For PTR_MAYBE_NULL on the other hand, it's unclear. If the
pointer is NULL it would be fine, but if it's not NULL, it would be
incorrect to load the program.

The current solution to this is to just fail if PTR_MAYBE_NULL is
passed, and to instead require programs to have a NULL check to
explicitly handle the NULL and non-NULL cases. This seems reasonable.
Not only would it possibly be quite complicated to correctly handle
PTR_MAYBE_NULL refcounting in the verifier, but it's also an arguably
odd programming pattern in general to not explicitly handle the NULL
case anyways. For example, it seems odd to not care about whether a
pointer you're passing to bpf_kptr_xchg() was successfully allocated in
a program such as the following:

private(MASK) static struct bpf_cpumask __kptr * global_mask;

SEC("tp_btf/task_newtask")
int BPF_PROG(example, struct task_struct *task, u64 clone_flags)
{
        struct bpf_cpumask *prev;

	/* bpf_cpumask_create() returns PTR_MAYBE_NULL */
	prev = bpf_kptr_xchg(&global_mask, bpf_cpumask_create());
	if (prev)
		bpf_cpumask_release(prev);

	return 0;
}

This patch therefore updates the verifier to explicitly check for
PTR_MAYBE_NULL in check_reg_type(), and fail gracefully if it's
observed. This isn't really "fixing" anything unsafe or incorrect. We're
just updating the verifier to fail gracefully, and explicitly handle
this pattern rather than unintentionally falling back to an internal
verifier error path. A subsequent patch will update selftests.

Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230330145203.80506-1-void@manifault.com
2023-03-30 14:12:22 -07:00
Martin KaFai Lau
6ae9d5e99e bpf: Use bpf_mem_cache_alloc/free for bpf_local_storage
This patch uses bpf_mem_cache_alloc/free for allocating and freeing
bpf_local_storage for task and cgroup storage.

The changes are similar to the previous patch. A few things that
worth to mention for bpf_local_storage:

The local_storage is freed when the last selem is deleted.
Before deleting a selem from local_storage, it needs to retrieve the
local_storage->smap because the bpf_selem_unlink_storage_nolock()
may have set it to NULL. Note that local_storage->smap may have
already been NULL when the selem created this local_storage has
been removed. In this case, call_rcu will be used to free the
local_storage.
Also, the bpf_ma (true or false) value is needed before calling
bpf_local_storage_free(). The bpf_ma can either be obtained from
the local_storage->smap (if available) or any of its selem's smap.
A new helper check_storage_bpf_ma() is added to obtain
bpf_ma for a deleting bpf_local_storage.

When bpf_local_storage_alloc getting a reused memory, all
fields are either in the correct values or will be initialized.
'cache[]' must already be all NULLs. 'list' must be empty.
Others will be initialized.

Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20230322215246.1675516-4-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-25 19:52:52 -07:00
Martin KaFai Lau
08a7ce384e bpf: Use bpf_mem_cache_alloc/free in bpf_local_storage_elem
This patch uses bpf_mem_alloc for the task and cgroup local storage that
the bpf prog can easily get a hold of the storage owner's PTR_TO_BTF_ID.
eg. bpf_get_current_task_btf() can be used in some of the kmalloc code
path which will cause deadlock/recursion. bpf_mem_cache_alloc is
deadlock free and will solve a legit use case in [1].

For sk storage, its batch creation benchmark shows a few percent
regression when the sk create/destroy batch size is larger than 32.
The sk creation/destruction happens much more often and
depends on external traffic. Considering it is hypothetical
to be able to cause deadlock with sk storage, it can cross
the bridge to use bpf_mem_alloc till a legit (ie. useful)
use case comes up.

For inode storage, bpf_local_storage_destroy() is called before
waiting for a rcu gp and its memory cannot be reused immediately.
inode stays with kmalloc/kfree after the rcu [or tasks_trace] gp.

A 'bool bpf_ma' argument is added to bpf_local_storage_map_alloc().
Only task and cgroup storage have 'bpf_ma == true' which
means to use bpf_mem_cache_alloc/free(). This patch only changes
selem to use bpf_mem_alloc for task and cgroup. The next patch
will change the local_storage to use bpf_mem_alloc also for
task and cgroup.

Here is some more details on the changes:

* memory allocation:
After bpf_mem_cache_alloc(), the SDATA(selem)->data is zero-ed because
bpf_mem_cache_alloc() could return a reused selem. It is to keep
the existing bpf_map_kzalloc() behavior. Only SDATA(selem)->data
is zero-ed. SDATA(selem)->data is the visible part to the bpf prog.
No need to use zero_map_value() to do the zeroing because
bpf_selem_free(..., reuse_now = true) ensures no bpf prog is using
the selem before returning the selem through bpf_mem_cache_free().
For the internal fields of selem, they will be initialized when
linking to the new smap and the new local_storage.

When 'bpf_ma == false', nothing changes in this patch. It will
stay with the bpf_map_kzalloc().

* memory free:
The bpf_selem_free() and bpf_selem_free_rcu() are modified to handle
the bpf_ma == true case.

For the common selem free path where its owner is also being destroyed,
the mem is freed in bpf_local_storage_destroy(), the owner (task
and cgroup) has gone through a rcu gp. The memory can be reused
immediately, so bpf_local_storage_destroy() will call
bpf_selem_free(..., reuse_now = true) which will do
bpf_mem_cache_free() for immediate reuse consideration.

An exception is the delete elem code path. The delete elem code path
is called from the helper bpf_*_storage_delete() and the syscall
bpf_map_delete_elem(). This path is an unusual case for local
storage because the common use case is to have the local storage
staying with its owner life time so that the bpf prog and the user
space does not have to monitor the owner's destruction. For the delete
elem path, the selem cannot be reused immediately because there could
be bpf prog using it. It will call bpf_selem_free(..., reuse_now = false)
and it will wait for a rcu tasks trace gp before freeing the elem. The
rcu callback is changed to do bpf_mem_cache_raw_free() instead of kfree().

When 'bpf_ma == false', it should be the same as before.
__bpf_selem_free() is added to do the kfree_rcu and call_tasks_trace_rcu().
A few words on the 'reuse_now == true'. When 'reuse_now == true',
it is still racing with bpf_local_storage_map_free which is under rcu
protection, so it still needs to wait for a rcu gp instead of kfree().
Otherwise, the selem may be reused by slab for a totally different struct
while the bpf_local_storage_map_free() is still using it (as a
rcu reader). For the inode case, there may be other rcu readers also.
In short, when bpf_ma == false and reuse_now == true => vanilla rcu.

[1]: https://lore.kernel.org/bpf/20221118190109.1512674-1-namhyung@kernel.org/

Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20230322215246.1675516-3-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-25 19:52:52 -07:00
Martin KaFai Lau
e65a5c6edb bpf: Add a few bpf mem allocator functions
This patch adds a few bpf mem allocator functions which will
be used in the bpf_local_storage in a later patch.

bpf_mem_cache_alloc_flags(..., gfp_t flags) is added. When the
flags == GFP_KERNEL, it will fallback to __alloc(..., GFP_KERNEL).
bpf_local_storage knows its running context is sleepable (GFP_KERNEL)
and provides a better guarantee on memory allocation.

bpf_local_storage has some uncommon cases that its selem
cannot be reused immediately. It handles its own
rcu_head and goes through a rcu_trace gp and then free it.
bpf_mem_cache_raw_free() is added for direct free purpose
without leaking the LLIST_NODE_SZ internal knowledge.
During free time, the 'struct bpf_mem_alloc *ma' is no longer
available. However, the caller should know if it is
percpu memory or not and it can call different raw_free functions.
bpf_local_storage does not support percpu value, so only
the non-percpu 'bpf_mem_cache_raw_free()' is added in
this patch.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20230322215246.1675516-2-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-25 19:52:51 -07:00
David Vernet
6c831c4684 bpf: Treat KF_RELEASE kfuncs as KF_TRUSTED_ARGS
KF_RELEASE kfuncs are not currently treated as having KF_TRUSTED_ARGS,
even though they have a superset of the requirements of KF_TRUSTED_ARGS.
Like KF_TRUSTED_ARGS, KF_RELEASE kfuncs require a 0-offset argument, and
don't allow NULL-able arguments. Unlike KF_TRUSTED_ARGS which require
_either_ an argument with ref_obj_id > 0, _or_ (ref->type &
BPF_REG_TRUSTED_MODIFIERS) (and no unsafe modifiers allowed), KF_RELEASE
only allows for ref_obj_id > 0.  Because KF_RELEASE today doesn't
automatically imply KF_TRUSTED_ARGS, some of these requirements are
enforced in different ways that can make the behavior of the verifier
feel unpredictable. For example, a KF_RELEASE kfunc with a NULL-able
argument will currently fail in the verifier with a message like, "arg#0
is ptr_or_null_ expected ptr_ or socket" rather than "Possibly NULL
pointer passed to trusted arg0". Our intention is the same, but the
semantics are different due to implemenetation details that kfunc authors
and BPF program writers should not need to care about.

Let's make the behavior of the verifier more consistent and intuitive by
having KF_RELEASE kfuncs imply the presence of KF_TRUSTED_ARGS. Our
eventual goal is to have all kfuncs assume KF_TRUSTED_ARGS by default
anyways, so this takes us a step in that direction.

Note that it does not make sense to assume KF_TRUSTED_ARGS for all
KF_ACQUIRE kfuncs. KF_ACQUIRE kfuncs can have looser semantics than
KF_RELEASE, with e.g. KF_RCU | KF_RET_NULL. We may want to have
KF_ACQUIRE imply KF_TRUSTED_ARGS _unless_ KF_RCU is specified, but that
can be left to another patch set, and there are no such subtleties to
address for KF_RELEASE.

Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230325213144.486885-4-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-25 16:56:22 -07:00
David Vernet
fb2211a57c bpf: Remove now-unnecessary NULL checks for KF_RELEASE kfuncs
Now that we're not invoking kfunc destructors when the kptr in a map was
NULL, we no longer require NULL checks in many of our KF_RELEASE kfuncs.
This patch removes those NULL checks.

Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230325213144.486885-3-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-25 16:56:22 -07:00
David Vernet
1431d0b584 bpf: Only invoke kptr dtor following non-NULL xchg
When a map value is being freed, we loop over all of the fields of the
corresponding BPF object and issue the appropriate cleanup calls
corresponding to the field's type. If the field is a referenced kptr, we
atomically xchg the value out of the map, and invoke the kptr's
destructor on whatever was there before (or bpf_obj_drop() it if it was
a local kptr).

Currently, we always invoke the destructor (either bpf_obj_drop() or the
kptr's registered destructor) on any KPTR_REF-type field in a map, even
if there wasn't a value in the map. This means that any function serving
as the kptr's KF_RELEASE destructor must always treat the argument as
possibly NULL, as the following can and regularly does happen:

void *xchgd_field;

/* No value was in the map, so xchgd_field is NULL */
xchgd_field = (void *)xchg(unsigned long *field_ptr, 0);
field->kptr.dtor(xchgd_field);

These are odd semantics to impose on KF_RELEASE kfuncs -- BPF programs
are prohibited by the verifier from passing NULL pointers to KF_RELEASE
kfuncs, so it doesn't make sense to require this of BPF programs, but
not the main kernel destructor path. It's also unnecessary to invoke any
cleanup logic for local kptrs. If there is no object there, there's
nothing to drop.

So as to allow KF_RELEASE kfuncs to fully assume that an argument is
non-NULL, this patch updates a KPTR_REF's destructor to only be invoked
when a non-NULL value is xchg'd out of the kptr map field.

Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230325213144.486885-2-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-25 16:56:22 -07:00
Martin KaFai Lau
55fbae0547 bpf: Check IS_ERR for the bpf_map_get() return value
This patch fixes a mistake in checking NULL instead of
checking IS_ERR for the bpf_map_get() return value.

It also fixes the return value in link_update_map() from -EINVAL
to PTR_ERR(*_map).

Reported-by: syzbot+71ccc0fe37abb458406b@syzkaller.appspotmail.com
Fixes: 68b04864ca ("bpf: Create links for BPF struct_ops maps.")
Fixes: aef56f2e91 ("bpf: Update the struct_ops of a bpf_link.")
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Kui-Feng Lee <kuifeng@meta.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230324184241.1387437-1-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-24 12:40:47 -07:00
Jakub Kicinski
dc0a7b5200 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts:

drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
  6e9d51b1a5 ("net/mlx5e: Initialize link speed to zero")
  1bffcea429 ("net/mlx5e: Add devlink hairpin queues parameters")
https://lore.kernel.org/all/20230324120623.4ebbc66f@canb.auug.org.au/
https://lore.kernel.org/all/20230321211135.47711-1-saeed@kernel.org/

Adjacent changes:

drivers/net/phy/phy.c
  323fe43cf9 ("net: phy: Improved PHY error reporting in state machine")
  4203d84032 ("net: phy: Ensure state transitions are processed from phy_stop()")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-24 10:10:20 -07:00
Kui-Feng Lee
aef56f2e91 bpf: Update the struct_ops of a bpf_link.
By improving the BPF_LINK_UPDATE command of bpf(), it should allow you
to conveniently switch between different struct_ops on a single
bpf_link. This would enable smoother transitions from one struct_ops
to another.

The struct_ops maps passing along with BPF_LINK_UPDATE should have the
BPF_F_LINK flag.

Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230323032405.3735486-6-kuifeng@meta.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-03-22 22:53:02 -07:00
Kui-Feng Lee
68b04864ca bpf: Create links for BPF struct_ops maps.
Make bpf_link support struct_ops.  Previously, struct_ops were always
used alone without any associated links. Upon updating its value, a
struct_ops would be activated automatically. Yet other BPF program
types required to make a bpf_link with their instances before they
could become active. Now, however, you can create an inactive
struct_ops, and create a link to activate it later.

With bpf_links, struct_ops has a behavior similar to other BPF program
types. You can pin/unpin them from their links and the struct_ops will
be deactivated when its link is removed while previously need someone
to delete the value for it to be deactivated.

bpf_links are responsible for registering their associated
struct_ops. You can only use a struct_ops that has the BPF_F_LINK flag
set to create a bpf_link, while a structs without this flag behaves in
the same manner as before and is registered upon updating its value.

The BPF_LINK_TYPE_STRUCT_OPS serves a dual purpose. Not only is it
used to craft the links for BPF struct_ops programs, but also to
create links for BPF struct_ops them-self.  Since the links of BPF
struct_ops programs are only used to create trampolines internally,
they are never seen in other contexts. Thus, they can be reused for
struct_ops themself.

To maintain a reference to the map supporting this link, we add
bpf_struct_ops_link as an additional type. The pointer of the map is
RCU and won't be necessary until later in the patchset.

Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
Link: https://lore.kernel.org/r/20230323032405.3735486-4-kuifeng@meta.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-03-22 22:53:02 -07:00
Kui-Feng Lee
b671c2067a bpf: Retire the struct_ops map kvalue->refcnt.
We have replaced kvalue-refcnt with synchronize_rcu() to wait for an
RCU grace period.

Maintenance of kvalue->refcnt was a complicated task, as we had to
simultaneously keep track of two reference counts: one for the
reference count of bpf_map. When the kvalue->refcnt reaches zero, we
also have to reduce the reference count on bpf_map - yet these steps
are not performed in an atomic manner and require us to be vigilant
when managing them. By eliminating kvalue->refcnt, we can make our
maintenance more straightforward as the refcount of bpf_map is now
solely managed!

To prevent the trampoline image of a struct_ops from being released
while it is still in use, we wait for an RCU grace period. The
setsockopt(TCP_CONGESTION, "...") command allows you to change your
socket's congestion control algorithm and can result in releasing the
old struct_ops implementation. It is fine. However, this function is
exposed through bpf_setsockopt(), it may be accessed by BPF programs
as well. To ensure that the trampoline image belonging to struct_op
can be safely called while its method is in use, the trampoline
safeguarde the BPF program with rcu_read_lock(). Doing so prevents any
destruction of the associated images before returning from a
trampoline and requires us to wait for an RCU grace period.

Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
Link: https://lore.kernel.org/r/20230323032405.3735486-2-kuifeng@meta.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-03-22 22:51:47 -07:00
Andrii Nakryiko
b63cbc490e bpf: remember meta->iter info only for initialized iters
For iter_new() functions iterator state's slot might not be yet
initialized, in which case iter_get_spi() will return -ERANGE. This is
expected and is handled properly. But for iter_next() and iter_destroy()
cases iter slot is supposed to be initialized and correct, so -ERANGE is
not possible.

Move meta->iter.{spi,frameno} initialization into iter_next/iter_destroy
handling branch to make it more explicit that valid information will be
remembered in meta->iter block for subsequent use in process_iter_next_call(),
avoiding confusingly looking -ERANGE assignment for meta->iter.spi.

Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230322232502.836171-1-andrii@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-03-22 17:04:47 -07:00
Daniel Borkmann
7be14c1c90 bpf: Fix __reg_bound_offset 64->32 var_off subreg propagation
Xu reports that after commit 3f50f132d8 ("bpf: Verifier, do explicit ALU32
bounds tracking"), the following BPF program is rejected by the verifier:

   0: (61) r2 = *(u32 *)(r1 +0)          ; R2_w=pkt(off=0,r=0,imm=0)
   1: (61) r3 = *(u32 *)(r1 +4)          ; R3_w=pkt_end(off=0,imm=0)
   2: (bf) r1 = r2
   3: (07) r1 += 1
   4: (2d) if r1 > r3 goto pc+8
   5: (71) r1 = *(u8 *)(r2 +0)           ; R1_w=scalar(umax=255,var_off=(0x0; 0xff))
   6: (18) r0 = 0x7fffffffffffff10
   8: (0f) r1 += r0                      ; R1_w=scalar(umin=0x7fffffffffffff10,umax=0x800000000000000f)
   9: (18) r0 = 0x8000000000000000
  11: (07) r0 += 1
  12: (ad) if r0 < r1 goto pc-2
  13: (b7) r0 = 0
  14: (95) exit

And the verifier log says:

  func#0 @0
  0: R1=ctx(off=0,imm=0) R10=fp0
  0: (61) r2 = *(u32 *)(r1 +0)          ; R1=ctx(off=0,imm=0) R2_w=pkt(off=0,r=0,imm=0)
  1: (61) r3 = *(u32 *)(r1 +4)          ; R1=ctx(off=0,imm=0) R3_w=pkt_end(off=0,imm=0)
  2: (bf) r1 = r2                       ; R1_w=pkt(off=0,r=0,imm=0) R2_w=pkt(off=0,r=0,imm=0)
  3: (07) r1 += 1                       ; R1_w=pkt(off=1,r=0,imm=0)
  4: (2d) if r1 > r3 goto pc+8          ; R1_w=pkt(off=1,r=1,imm=0) R3_w=pkt_end(off=0,imm=0)
  5: (71) r1 = *(u8 *)(r2 +0)           ; R1_w=scalar(umax=255,var_off=(0x0; 0xff)) R2_w=pkt(off=0,r=1,imm=0)
  6: (18) r0 = 0x7fffffffffffff10       ; R0_w=9223372036854775568
  8: (0f) r1 += r0                      ; R0_w=9223372036854775568 R1_w=scalar(umin=9223372036854775568,umax=9223372036854775823,s32_min=-240,s32_max=15)
  9: (18) r0 = 0x8000000000000000       ; R0_w=-9223372036854775808
  11: (07) r0 += 1                      ; R0_w=-9223372036854775807
  12: (ad) if r0 < r1 goto pc-2         ; R0_w=-9223372036854775807 R1_w=scalar(umin=9223372036854775568,umax=9223372036854775809)
  13: (b7) r0 = 0                       ; R0_w=0
  14: (95) exit

  from 12 to 11: R0_w=-9223372036854775807 R1_w=scalar(umin=9223372036854775810,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff)) R2_w=pkt(off=0,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) R10=fp0
  11: (07) r0 += 1                      ; R0_w=-9223372036854775806
  12: (ad) if r0 < r1 goto pc-2         ; R0_w=-9223372036854775806 R1_w=scalar(umin=9223372036854775810,umax=9223372036854775810,var_off=(0x8000000000000000; 0xffffffff))
  13: safe

  [...]

  from 12 to 11: R0_w=-9223372036854775795 R1=scalar(umin=9223372036854775822,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff)) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0
  11: (07) r0 += 1                      ; R0_w=-9223372036854775794
  12: (ad) if r0 < r1 goto pc-2         ; R0_w=-9223372036854775794 R1=scalar(umin=9223372036854775822,umax=9223372036854775822,var_off=(0x8000000000000000; 0xffffffff))
  13: safe

  from 12 to 11: R0_w=-9223372036854775794 R1=scalar(umin=9223372036854775823,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff)) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0
  11: (07) r0 += 1                      ; R0_w=-9223372036854775793
  12: (ad) if r0 < r1 goto pc-2         ; R0_w=-9223372036854775793 R1=scalar(umin=9223372036854775823,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff))
  13: safe

  from 12 to 11: R0_w=-9223372036854775793 R1=scalar(umin=9223372036854775824,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff)) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0
  11: (07) r0 += 1                      ; R0_w=-9223372036854775792
  12: (ad) if r0 < r1 goto pc-2         ; R0_w=-9223372036854775792 R1=scalar(umin=9223372036854775824,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff))
  13: safe

  [...]

The 64bit umin=9223372036854775810 bound continuously bumps by +1 while
umax=9223372036854775823 stays as-is until the verifier complexity limit
is reached and the program gets finally rejected. During this simulation,
the umin also eventually surpasses umax. Looking at the first 'from 12
to 11' output line from the loop, R1 has the following state:

  R1_w=scalar(umin=0x8000000000000002 (9223372036854775810),
              umax=0x800000000000000f (9223372036854775823),
          var_off=(0x8000000000000000;
                           0xffffffff))

The var_off has technically not an inconsistent state but it's very
imprecise and far off surpassing 64bit umax bounds whereas the expected
output with refined known bits in var_off should have been like:

  R1_w=scalar(umin=0x8000000000000002 (9223372036854775810),
              umax=0x800000000000000f (9223372036854775823),
          var_off=(0x8000000000000000;
                                  0xf))

In the above log, var_off stays as var_off=(0x8000000000000000; 0xffffffff)
and does not converge into a narrower mask where more bits become known,
eventually transforming R1 into a constant upon umin=9223372036854775823,
umax=9223372036854775823 case where the verifier would have terminated and
let the program pass.

The __reg_combine_64_into_32() marks the subregister unknown and propagates
64bit {s,u}min/{s,u}max bounds to their 32bit equivalents iff they are within
the 32bit universe. The question came up whether __reg_combine_64_into_32()
should special case the situation that when 64bit {s,u}min bounds have
the same value as 64bit {s,u}max bounds to then assign the latter as
well to the 32bit reg->{s,u}32_{min,max}_value. As can be seen from the
above example however, that is just /one/ special case and not a /generic/
solution given above example would still not be addressed this way and
remain at an imprecise var_off=(0x8000000000000000; 0xffffffff).

The improvement is needed in __reg_bound_offset() to refine var32_off with
the updated var64_off instead of the prior reg->var_off. The reg_bounds_sync()
code first refines information about the register's min/max bounds via
__update_reg_bounds() from the current var_off, then in __reg_deduce_bounds()
from sign bit and with the potentially learned bits from bounds it'll
update the var_off tnum in __reg_bound_offset(). For example, intersecting
with the old var_off might have improved bounds slightly, e.g. if umax
was 0x7f...f and var_off was (0; 0xf...fc), then new var_off will then
result in (0; 0x7f...fc). The intersected var64_off holds then the
universe which is a superset of var32_off. The point for the latter is
not to broaden, but to further refine known bits based on the intersection
of var_off with 32 bit bounds, so that we later construct the final var_off
from upper and lower 32 bits. The final __update_reg_bounds() can then
potentially still slightly refine bounds if more bits became known from the
new var_off.

After the improvement, we can see R1 converging successively:

  func#0 @0
  0: R1=ctx(off=0,imm=0) R10=fp0
  0: (61) r2 = *(u32 *)(r1 +0)          ; R1=ctx(off=0,imm=0) R2_w=pkt(off=0,r=0,imm=0)
  1: (61) r3 = *(u32 *)(r1 +4)          ; R1=ctx(off=0,imm=0) R3_w=pkt_end(off=0,imm=0)
  2: (bf) r1 = r2                       ; R1_w=pkt(off=0,r=0,imm=0) R2_w=pkt(off=0,r=0,imm=0)
  3: (07) r1 += 1                       ; R1_w=pkt(off=1,r=0,imm=0)
  4: (2d) if r1 > r3 goto pc+8          ; R1_w=pkt(off=1,r=1,imm=0) R3_w=pkt_end(off=0,imm=0)
  5: (71) r1 = *(u8 *)(r2 +0)           ; R1_w=scalar(umax=255,var_off=(0x0; 0xff)) R2_w=pkt(off=0,r=1,imm=0)
  6: (18) r0 = 0x7fffffffffffff10       ; R0_w=9223372036854775568
  8: (0f) r1 += r0                      ; R0_w=9223372036854775568 R1_w=scalar(umin=9223372036854775568,umax=9223372036854775823,s32_min=-240,s32_max=15)
  9: (18) r0 = 0x8000000000000000       ; R0_w=-9223372036854775808
  11: (07) r0 += 1                      ; R0_w=-9223372036854775807
  12: (ad) if r0 < r1 goto pc-2         ; R0_w=-9223372036854775807 R1_w=scalar(umin=9223372036854775568,umax=9223372036854775809)
  13: (b7) r0 = 0                       ; R0_w=0
  14: (95) exit

  from 12 to 11: R0_w=-9223372036854775807 R1_w=scalar(umin=9223372036854775810,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2_w=pkt(off=0,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) R10=fp0
  11: (07) r0 += 1                      ; R0_w=-9223372036854775806
  12: (ad) if r0 < r1 goto pc-2         ; R0_w=-9223372036854775806 R1_w=-9223372036854775806
  13: safe

  from 12 to 11: R0_w=-9223372036854775806 R1_w=scalar(umin=9223372036854775811,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2_w=pkt(off=0,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) R10=fp0
  11: (07) r0 += 1                      ; R0_w=-9223372036854775805
  12: (ad) if r0 < r1 goto pc-2         ; R0_w=-9223372036854775805 R1_w=-9223372036854775805
  13: safe

  [...]

  from 12 to 11: R0_w=-9223372036854775798 R1=scalar(umin=9223372036854775819,umax=9223372036854775823,var_off=(0x8000000000000008; 0x7),s32_min=8,s32_max=15,u32_min=8,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0
  11: (07) r0 += 1                      ; R0_w=-9223372036854775797
  12: (ad) if r0 < r1 goto pc-2         ; R0_w=-9223372036854775797 R1=-9223372036854775797
  13: safe

  from 12 to 11: R0_w=-9223372036854775797 R1=scalar(umin=9223372036854775820,umax=9223372036854775823,var_off=(0x800000000000000c; 0x3),s32_min=12,s32_max=15,u32_min=12,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0
  11: (07) r0 += 1                      ; R0_w=-9223372036854775796
  12: (ad) if r0 < r1 goto pc-2         ; R0_w=-9223372036854775796 R1=-9223372036854775796
  13: safe

  from 12 to 11: R0_w=-9223372036854775796 R1=scalar(umin=9223372036854775821,umax=9223372036854775823,var_off=(0x800000000000000c; 0x3),s32_min=12,s32_max=15,u32_min=12,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0
  11: (07) r0 += 1                      ; R0_w=-9223372036854775795
  12: (ad) if r0 < r1 goto pc-2         ; R0_w=-9223372036854775795 R1=-9223372036854775795
  13: safe

  from 12 to 11: R0_w=-9223372036854775795 R1=scalar(umin=9223372036854775822,umax=9223372036854775823,var_off=(0x800000000000000e; 0x1),s32_min=14,s32_max=15,u32_min=14,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0
  11: (07) r0 += 1                      ; R0_w=-9223372036854775794
  12: (ad) if r0 < r1 goto pc-2         ; R0_w=-9223372036854775794 R1=-9223372036854775794
  13: safe

  from 12 to 11: R0_w=-9223372036854775794 R1=-9223372036854775793 R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0
  11: (07) r0 += 1                      ; R0_w=-9223372036854775793
  12: (ad) if r0 < r1 goto pc-2
  last_idx 12 first_idx 12
  parent didn't have regs=1 stack=0 marks: R0_rw=P-9223372036854775801 R1_r=scalar(umin=9223372036854775815,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0
  last_idx 11 first_idx 11
  regs=1 stack=0 before 11: (07) r0 += 1
  parent didn't have regs=1 stack=0 marks: R0_rw=P-9223372036854775805 R1_rw=scalar(umin=9223372036854775812,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2_w=pkt(off=0,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) R10=fp0
  last_idx 12 first_idx 0
  regs=1 stack=0 before 12: (ad) if r0 < r1 goto pc-2
  regs=1 stack=0 before 11: (07) r0 += 1
  regs=1 stack=0 before 12: (ad) if r0 < r1 goto pc-2
  regs=1 stack=0 before 11: (07) r0 += 1
  regs=1 stack=0 before 12: (ad) if r0 < r1 goto pc-2
  regs=1 stack=0 before 11: (07) r0 += 1
  regs=1 stack=0 before 9: (18) r0 = 0x8000000000000000
  last_idx 12 first_idx 12
  parent didn't have regs=2 stack=0 marks: R0_rw=P-9223372036854775801 R1_r=Pscalar(umin=9223372036854775815,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0
  last_idx 11 first_idx 11
  regs=2 stack=0 before 11: (07) r0 += 1
  parent didn't have regs=2 stack=0 marks: R0_rw=P-9223372036854775805 R1_rw=Pscalar(umin=9223372036854775812,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2_w=pkt(off=0,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) R10=fp0
  last_idx 12 first_idx 0
  regs=2 stack=0 before 12: (ad) if r0 < r1 goto pc-2
  regs=2 stack=0 before 11: (07) r0 += 1
  regs=2 stack=0 before 12: (ad) if r0 < r1 goto pc-2
  regs=2 stack=0 before 11: (07) r0 += 1
  regs=2 stack=0 before 12: (ad) if r0 < r1 goto pc-2
  regs=2 stack=0 before 11: (07) r0 += 1
  regs=2 stack=0 before 9: (18) r0 = 0x8000000000000000
  regs=2 stack=0 before 8: (0f) r1 += r0
  regs=3 stack=0 before 6: (18) r0 = 0x7fffffffffffff10
  regs=2 stack=0 before 5: (71) r1 = *(u8 *)(r2 +0)
  13: safe

  from 4 to 13: safe
  verification time 322 usec
  stack depth 0
  processed 56 insns (limit 1000000) max_states_per_insn 1 total_states 3 peak_states 3 mark_read 1

This also fixes up a test case along with this improvement where we match
on the verifier log. The updated log now has a refined var_off, too.

Fixes: 3f50f132d8 ("bpf: Verifier, do explicit ALU32 bounds tracking")
Reported-by: Xu Kuohai <xukuohai@huaweicloud.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20230314203424.4015351-2-xukuohai@huaweicloud.com
Link: https://lore.kernel.org/bpf/20230322213056.2470-1-daniel@iogearbox.net
2023-03-22 16:49:25 -07:00
JP Kobryn
d7ba4cc900 bpf: return long from bpf_map_ops funcs
This patch changes the return types of bpf_map_ops functions to long, where
previously int was returned. Using long allows for bpf programs to maintain
the sign bit in the absence of sign extension during situations where
inlined bpf helper funcs make calls to the bpf_map_ops funcs and a negative
error is returned.

The definitions of the helper funcs are generated from comments in the bpf
uapi header at `include/uapi/linux/bpf.h`. The return type of these
helpers was previously changed from int to long in commit bdb7b79b4c. For
any case where one of the map helpers call the bpf_map_ops funcs that are
still returning 32-bit int, a compiler might not include sign extension
instructions to properly convert the 32-bit negative value a 64-bit
negative value.

For example:
bpf assembly excerpt of an inlined helper calling a kernel function and
checking for a specific error:

; err = bpf_map_update_elem(&mymap, &key, &val, BPF_NOEXIST);
  ...
  46:	call   0xffffffffe103291c	; htab_map_update_elem
; if (err && err != -EEXIST) {
  4b:	cmp    $0xffffffffffffffef,%rax ; cmp -EEXIST,%rax

kernel function assembly excerpt of return value from
`htab_map_update_elem` returning 32-bit int:

movl $0xffffffef, %r9d
...
movl %r9d, %eax

...results in the comparison:
cmp $0xffffffffffffffef, $0x00000000ffffffef

Fixes: bdb7b79b4c ("bpf: Switch most helper return values from 32-bit int to 64-bit long")
Tested-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Link: https://lore.kernel.org/r/20230322194754.185781-3-inwardvessel@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-22 15:11:30 -07:00
Alexei Starovoitov
1057d29945 bpf: Teach the verifier to recognize rdonly_mem as not null.
Teach the verifier to recognize PTR_TO_MEM | MEM_RDONLY as not NULL
otherwise if (!bpf_ksym_exists(known_kfunc)) doesn't go through
dead code elimination.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/bpf/20230321203854.3035-3-alexei.starovoitov@gmail.com
2023-03-22 09:31:05 -07:00
Daniel Borkmann
10ec8ca8ec bpf: Adjust insufficient default bpf_jit_limit
We've seen recent AWS EKS (Kubernetes) user reports like the following:

  After upgrading EKS nodes from v20230203 to v20230217 on our 1.24 EKS
  clusters after a few days a number of the nodes have containers stuck
  in ContainerCreating state or liveness/readiness probes reporting the
  following error:

    Readiness probe errored: rpc error: code = Unknown desc = failed to
    exec in container: failed to start exec "4a11039f730203ffc003b7[...]":
    OCI runtime exec failed: exec failed: unable to start container process:
    unable to init seccomp: error loading seccomp filter into kernel:
    error loading seccomp filter: errno 524: unknown

  However, we had not been seeing this issue on previous AMIs and it only
  started to occur on v20230217 (following the upgrade from kernel 5.4 to
  5.10) with no other changes to the underlying cluster or workloads.

  We tried the suggestions from that issue (sysctl net.core.bpf_jit_limit=452534528)
  which helped to immediately allow containers to be created and probes to
  execute but after approximately a day the issue returned and the value
  returned by cat /proc/vmallocinfo | grep bpf_jit | awk '{s+=$2} END {print s}'
  was steadily increasing.

I tested bpf tree to observe bpf_jit_charge_modmem, bpf_jit_uncharge_modmem
their sizes passed in as well as bpf_jit_current under tcpdump BPF filter,
seccomp BPF and native (e)BPF programs, and the behavior all looks sane
and expected, that is nothing "leaking" from an upstream perspective.

The bpf_jit_limit knob was originally added in order to avoid a situation
where unprivileged applications loading BPF programs (e.g. seccomp BPF
policies) consuming all the module memory space via BPF JIT such that loading
of kernel modules would be prevented. The default limit was defined back in
2018 and while good enough back then, we are generally seeing far more BPF
consumers today.

Adjust the limit for the BPF JIT pool from originally 1/4 to now 1/2 of the
module memory space to better reflect today's needs and avoid more users
running into potentially hard to debug issues.

Fixes: fdadd04931 ("bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K")
Reported-by: Stephen Haynes <sh@synk.net>
Reported-by: Lefteris Alexakis <lefteris.alexakis@kpn.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://github.com/awslabs/amazon-eks-ami/issues/1179
Link: https://github.com/awslabs/amazon-eks-ami/issues/1219
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230320143725.8394-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-21 12:43:05 -07:00
Florent Revest
da8bdfbd42 ftrace: Rename _ftrace_direct_multi APIs to _ftrace_direct APIs
Now that the original _ftrace_direct APIs are gone, the "_multi"
suffixes only add confusion.

Link: https://lkml.kernel.org/r/20230321140424.345218-5-revest@chromium.org

Signed-off-by: Florent Revest <revest@chromium.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-21 13:43:32 -04:00
Florent Revest
59495740f7 ftrace: Let unregister_ftrace_direct_multi() call ftrace_free_filter()
A common pattern when using the ftrace_direct_multi API is to unregister
the ops and also immediately free its filter. We've noticed it's very
easy for users to miss calling ftrace_free_filter().

This adds a "free_filters" argument to unregister_ftrace_direct_multi()
to both remind the user they should free filters and also to make their
life easier.

Link: https://lkml.kernel.org/r/20230321140424.345218-2-revest@chromium.org

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Florent Revest <revest@chromium.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-21 13:42:11 -04:00
Alexei Starovoitov
58aa2afbb1 bpf: Allow ld_imm64 instruction to point to kfunc.
Allow ld_imm64 insn with BPF_PSEUDO_BTF_ID to hold the address of kfunc. The
ld_imm64 pointing to a valid kfunc will be seen as non-null PTR_TO_MEM by
is_branch_taken() logic of the verifier, while libbpf will resolve address to
unknown kfunc as ld_imm64 reg, 0 which will also be recognized by
is_branch_taken() and the verifier will proceed dead code elimination. BPF
programs can use this logic to detect at load time whether kfunc is present in
the kernel with bpf_ksym_exists() macro that is introduced in the next patches.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20230317201920.62030-2-alexei.starovoitov@gmail.com
2023-03-17 15:44:26 -07:00
Kamalesh Babulal
4cdb91b0de cgroup: bpf: use cgroup_lock()/cgroup_unlock() wrappers
Replace mutex_[un]lock() with cgroup_[un]lock() wrappers to stay
consistent across cgroup core and other subsystem code, while
operating on the cgroup_mutex.

Signed-off-by: Kamalesh Babulal <kamalesh.babulal@oracle.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-03-17 12:07:13 -10:00
Viktor Malik
bd5314f8dd kallsyms, bpf: Move find_kallsyms_symbol_value out of internal header
Moving find_kallsyms_symbol_value from kernel/module/internal.h to
include/linux/module.h. The reason is that internal.h is not prepared to
be included when CONFIG_MODULES=n. find_kallsyms_symbol_value is used by
kernel/bpf/verifier.c and including internal.h from it (without modules)
leads into a compilation error:

  In file included from ../include/linux/container_of.h:5,
                   from ../include/linux/list.h:5,
                   from ../include/linux/timer.h:5,
                   from ../include/linux/workqueue.h:9,
                   from ../include/linux/bpf.h:10,
                   from ../include/linux/bpf-cgroup.h:5,
                   from ../kernel/bpf/verifier.c:7:
  ../kernel/bpf/../module/internal.h: In function 'mod_find':
  ../include/linux/container_of.h:20:54: error: invalid use of undefined type 'struct module'
     20 |         static_assert(__same_type(*(ptr), ((type *)0)->member) ||       \
        |                                                      ^~
  [...]

This patch fixes the above error.

Fixes: 31bf1dbccf ("bpf: Fix attaching fentry/fexit/fmod_ret/lsm to modules")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Viktor Malik <vmalik@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/oe-kbuild-all/202303161404.OrmfCy09-lkp@intel.com/
Link: https://lore.kernel.org/bpf/20230317095601.386738-1-vmalik@redhat.com
2023-03-17 13:45:51 +01:00
Luis Gerhorst
082cdc69a4 bpf: Remove misleading spec_v1 check on var-offset stack read
For every BPF_ADD/SUB involving a pointer, adjust_ptr_min_max_vals()
ensures that the resulting pointer has a constant offset if
bypass_spec_v1 is false. This is ensured by calling sanitize_check_bounds()
which in turn calls check_stack_access_for_ptr_arithmetic(). There,
-EACCESS is returned if the register's offset is not constant, thereby
rejecting the program.

In summary, an unprivileged user must never be able to create stack
pointers with a variable offset. That is also the case, because a
respective check in check_stack_write() is missing. If they were able
to create a variable-offset pointer, users could still use it in a
stack-write operation to trigger unsafe speculative behavior [1].

Because unprivileged users must already be prevented from creating
variable-offset stack pointers, viable options are to either remove
this check (replacing it with a clarifying comment), or to turn it
into a "verifier BUG"-message, also adding a similar check in
check_stack_write() (for consistency, as a second-level defense).
This patch implements the first option to reduce verifier bloat.

This check was introduced by commit 01f810ace9 ("bpf: Allow
variable-offset stack access") which correctly notes that
"variable-offset reads and writes are disallowed (they were already
disallowed for the indirect access case) because the speculative
execution checking code doesn't support them". However, it does not
further discuss why the check in check_stack_read() is necessary.
The code which made this check obsolete was also introduced in this
commit.

I have compiled ~650 programs from the Linux selftests, Linux samples,
Cilium, and libbpf/examples projects and confirmed that none of these
trigger the check in check_stack_read() [2]. Instead, all of these
programs are, as expected, already rejected when constructing the
variable-offset pointers. Note that the check in
check_stack_access_for_ptr_arithmetic() also prints "off=%d" while the
code removed by this patch does not (the error removed does not appear
in the "verification_error" values). For reproducibility, the
repository linked includes the raw data and scripts used to create
the plot.

  [1] https://arxiv.org/pdf/1807.03757.pdf
  [2] 53dc19fcf4/data/plots/23-02-26_23-56_bpftool/bpftool/0004-errors.pdf

Fixes: 01f810ace9 ("bpf: Allow variable-offset stack access")
Signed-off-by: Luis Gerhorst <gerhorst@cs.fau.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230315165358.23701-1-gerhorst@cs.fau.de
2023-03-16 22:05:50 +01:00
David Vernet
1b403ce77d bpf: Remove bpf_cpumask_kptr_get() kfunc
Now that struct bpf_cpumask is RCU safe, there's no need for this kfunc.
Rather than doing the following:

private(MASK) static struct bpf_cpumask __kptr *global;

int BPF_PROG(prog, s32 cpu, ...)
{
	struct bpf_cpumask *cpumask;

	bpf_rcu_read_lock();
	cpumask = bpf_cpumask_kptr_get(&global);
	if (!cpumask) {
		bpf_rcu_read_unlock();
		return -1;
	}
	bpf_cpumask_setall(cpumask);
	...
	bpf_cpumask_release(cpumask);
	bpf_rcu_read_unlock();
}

Programs can instead simply do (assume same global cpumask):

int BPF_PROG(prog, ...)
{
	struct bpf_cpumask *cpumask;

	bpf_rcu_read_lock();
	cpumask = global;
	if (!cpumask) {
		bpf_rcu_read_unlock();
		return -1;
	}
	bpf_cpumask_setall(cpumask);
	...
	bpf_rcu_read_unlock();
}

In other words, no extra atomic acquire / release, and less boilerplate
code.

This patch removes both the kfunc, as well as its selftests and
documentation.

Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230316054028.88924-5-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-16 12:28:30 -07:00
David Vernet
63d2d83d21 bpf: Mark struct bpf_cpumask as rcu protected
struct bpf_cpumask is a BPF-wrapper around the struct cpumask type which
can be instantiated by a BPF program, and then queried as a cpumask in
similar fashion to normal kernel code. The previous patch in this series
makes the type fully RCU safe, so the type can be included in the
rcu_protected_type BTF ID list.

A subsequent patch will remove bpf_cpumask_kptr_get(), as it's no longer
useful now that we can just treat the type as RCU safe by default and do
our own if check.

Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230316054028.88924-3-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-16 12:28:30 -07:00
David Vernet
77473d1a96 bpf: Free struct bpf_cpumask in call_rcu handler
The struct bpf_cpumask type uses the bpf_mem_cache_{alloc,free}() APIs
to allocate and free its cpumasks. The bpf_mem allocator may currently
immediately reuse some memory when its freed, without waiting for an RCU
read cycle to elapse. We want to be able to treat struct bpf_cpumask
objects as completely RCU safe.

This is necessary for two reasons:

1. bpf_cpumask_kptr_get() currently does an RCU-protected
   refcnt_inc_not_zero(). This of course assumes that the underlying
   memory is not reused, and is therefore unsafe in its current form.

2. We want to be able to get rid of bpf_cpumask_kptr_get() entirely, and
   intead use the superior kptr RCU semantics now afforded by the
   verifier.

This patch fixes (1), and enables (2), by making struct bpf_cpumask RCU
safe. A subsequent patch will update the verifier to allow struct
bpf_cpumask * pointers to be passed to KF_RCU kfuncs, and then a latter
patch will remove bpf_cpumask_kptr_get().

Fixes: 516f4d3397 ("bpf: Enable cpumasks to be queried and used as kptrs")
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230316054028.88924-2-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-16 12:28:30 -07:00
Viktor Malik
31bf1dbccf bpf: Fix attaching fentry/fexit/fmod_ret/lsm to modules
This resolves two problems with attachment of fentry/fexit/fmod_ret/lsm
to functions located in modules:

1. The verifier tries to find the address to attach to in kallsyms. This
   is always done by searching the entire kallsyms, not respecting the
   module in which the function is located. Such approach causes an
   incorrect attachment address to be computed if the function to attach
   to is shadowed by a function of the same name located earlier in
   kallsyms.

2. If the address to attach to is located in a module, the module
   reference is only acquired in register_fentry. If the module is
   unloaded between the place where the address is found
   (bpf_check_attach_target in the verifier) and register_fentry, it is
   possible that another module is loaded to the same address which may
   lead to potential errors.

Since the attachment must contain the BTF of the program to attach to,
we extract the module from it and search for the function address in the
correct module (resolving problem no. 1). Then, the module reference is
taken directly in bpf_check_attach_target and stored in the bpf program
(in bpf_prog_aux). The reference is only released when the program is
unloaded (resolving problem no. 2).

Signed-off-by: Viktor Malik <vmalik@redhat.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/3f6a9d8ae850532b5ef864ef16327b0f7a669063.1678432753.git.vmalik@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-15 18:38:21 -07:00
Alexei Starovoitov
3e30be4288 bpf: Allow helpers access trusted PTR_TO_BTF_ID.
The verifier rejects the code:
  bpf_strncmp(task->comm, 16, "my_task");
with the message:
  16: (85) call bpf_strncmp#182
  R1 type=trusted_ptr_ expected=fp, pkt, pkt_meta, map_key, map_value, mem, ringbuf_mem, buf

Teach the verifier that such access pattern is safe.
Do not allow untrusted and legacy ptr_to_btf_id to be passed into helpers.

Reported-by: David Vernet <void@manifault.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230313235845.61029-3-alexei.starovoitov@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-03-13 23:08:21 -07:00
Alexei Starovoitov
c9267aa8b7 bpf: Fix bpf_strncmp proto.
bpf_strncmp() doesn't write into its first argument.
Make sure that the verifier knows about it.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230313235845.61029-2-alexei.starovoitov@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-03-13 23:08:21 -07:00
Dave Marchevsky
9e36a204bd bpf: Disable migration when freeing stashed local kptr using obj drop
When a local kptr is stashed in a map and freed when the map goes away,
currently an error like the below appears:

[   39.195695] BUG: using smp_processor_id() in preemptible [00000000] code: kworker/u32:15/2875
[   39.196549] caller is bpf_mem_free+0x56/0xc0
[   39.196958] CPU: 15 PID: 2875 Comm: kworker/u32:15 Tainted: G           O       6.2.0-13016-g22df776a9a86 #4477
[   39.197897] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
[   39.198949] Workqueue: events_unbound bpf_map_free_deferred
[   39.199470] Call Trace:
[   39.199703]  <TASK>
[   39.199911]  dump_stack_lvl+0x60/0x70
[   39.200267]  check_preemption_disabled+0xbf/0xe0
[   39.200704]  bpf_mem_free+0x56/0xc0
[   39.201032]  ? bpf_obj_new_impl+0xa0/0xa0
[   39.201430]  bpf_obj_free_fields+0x1cd/0x200
[   39.201838]  array_map_free+0xad/0x220
[   39.202193]  ? finish_task_switch+0xe5/0x3c0
[   39.202614]  bpf_map_free_deferred+0xea/0x210
[   39.203006]  ? lockdep_hardirqs_on_prepare+0xe/0x220
[   39.203460]  process_one_work+0x64f/0xbe0
[   39.203822]  ? pwq_dec_nr_in_flight+0x110/0x110
[   39.204264]  ? do_raw_spin_lock+0x107/0x1c0
[   39.204662]  ? lockdep_hardirqs_on_prepare+0xe/0x220
[   39.205107]  worker_thread+0x74/0x7a0
[   39.205451]  ? process_one_work+0xbe0/0xbe0
[   39.205818]  kthread+0x171/0x1a0
[   39.206111]  ? kthread_complete_and_exit+0x20/0x20
[   39.206552]  ret_from_fork+0x1f/0x30
[   39.206886]  </TASK>

This happens because the call to __bpf_obj_drop_impl I added in the patch
adding support for stashing local kptrs doesn't disable migration. Prior
to that patch, __bpf_obj_drop_impl logic only ran when called by a BPF
progarm, whereas now it can be called from map free path, so it's
necessary to explicitly disable migration.

Also, refactor a bit to just call __bpf_obj_drop_impl directly instead
of bothering w/ dtor union and setting pointer-to-obj_drop.

Fixes: c8e1875409 ("bpf: Support __kptr to local kptrs")
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230313214641.3731908-1-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-13 16:55:04 -07:00
Alexei Starovoitov
a33a6eaa19 Merge branch 'bpf: Allow reads from uninit stack'
Merge commit bf9bec4cb3 ("Merge branch 'bpf: Allow reads from uninit stack'")
from bpf-next to bpf tree to address verification issues in some programs
due to stack usage.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-13 13:21:22 -07:00
Andrii Nakryiko
34f0677e7a bpf: fix precision propagation verbose logging
Fix wrong order of frame index vs register/slot index in precision
propagation verbose (level 2) output. It's wrong and very confusing as is.

Fixes: 529409ea92 ("bpf: propagate precision across all frames, not just the last one")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230313184017.4083374-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-13 11:46:44 -07:00
Dave Marchevsky
738c96d5e2 bpf: Allow local kptrs to be exchanged via bpf_kptr_xchg
The previous patch added necessary plumbing for verifier and runtime to
know what to do with non-kernel PTR_TO_BTF_IDs in map values, but didn't
provide any way to get such local kptrs into a map value. This patch
modifies verifier handling of bpf_kptr_xchg to allow MEM_ALLOC kptr
types.

check_reg_type is modified accept MEM_ALLOC-flagged input to
bpf_kptr_xchg despite such types not being in btf_ptr_types. This could
have been done with a MAYBE_MEM_ALLOC equivalent to MAYBE_NULL, but
bpf_kptr_xchg is the only helper that I can forsee using
MAYBE_MEM_ALLOC, so keep it special-cased for now.

The verifier tags bpf_kptr_xchg retval MEM_ALLOC if and only if the BTF
associated with the retval is not kernel BTF.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230310230743.2320707-3-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10 16:38:05 -08:00
Dave Marchevsky
c8e1875409 bpf: Support __kptr to local kptrs
If a PTR_TO_BTF_ID type comes from program BTF - not vmlinux or module
BTF - it must have been allocated by bpf_obj_new and therefore must be
free'd with bpf_obj_drop. Such a PTR_TO_BTF_ID is considered a "local
kptr" and is tagged with MEM_ALLOC type tag by bpf_obj_new.

This patch adds support for treating __kptr-tagged pointers to "local
kptrs" as having an implicit bpf_obj_drop destructor for referenced kptr
acquire / release semantics. Consider the following example:

  struct node_data {
          long key;
          long data;
          struct bpf_rb_node node;
  };

  struct map_value {
          struct node_data __kptr *node;
  };

  struct {
          __uint(type, BPF_MAP_TYPE_ARRAY);
          __type(key, int);
          __type(value, struct map_value);
          __uint(max_entries, 1);
  } some_nodes SEC(".maps");

If struct node_data had a matching definition in kernel BTF, the verifier would
expect a destructor for the type to be registered. Since struct node_data does
not match any type in kernel BTF, the verifier knows that there is no kfunc
that provides a PTR_TO_BTF_ID to this type, and that such a PTR_TO_BTF_ID can
only come from bpf_obj_new. So instead of searching for a registered dtor,
a bpf_obj_drop dtor can be assumed.

This allows the runtime to properly destruct such kptrs in
bpf_obj_free_fields, which enables maps to clean up map_vals w/ such
kptrs when going away.

Implementation notes:
  * "kernel_btf" variable is renamed to "kptr_btf" in btf_parse_kptr.
    Before this patch, the variable would only ever point to vmlinux or
    module BTFs, but now it can point to some program BTF for local kptr
    type. It's later used to populate the (btf, btf_id) pair in kptr btf
    field.
  * It's necessary to btf_get the program BTF when populating btf_field
    for local kptr. btf_record_free later does a btf_put.
  * Behavior for non-local referenced kptrs is not modified, as
    bpf_find_btf_id helper only searches vmlinux and module BTFs for
    matching BTF type. If such a type is found, btf_field_kptr's btf will
    pass btf_is_kernel check, and the associated release function is
    some one-argument dtor. If btf_is_kernel check fails, associated
    release function is two-arg bpf_obj_drop_impl. Before this patch
    only btf_field_kptr's w/ kernel or module BTFs were created.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230310230743.2320707-2-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10 16:38:05 -08:00
Dave Marchevsky
74843b57ec bpf: Change btf_record_find enum parameter to field_mask
btf_record_find's 3rd parameter can be multiple enum btf_field_type's
masked together. The function is called with BPF_KPTR in two places in
verifier.c, so it works with masked values already.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230309180111.1618459-4-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10 12:16:37 -08:00
Dave Marchevsky
a4aa38897b bpf: btf: Remove unused btf_field_info_type enum
This enum was added and used in commit aa3496accc ("bpf: Refactor kptr_off_tab
into btf_record"). Later refactoring in commit db55911782 ("bpf: Consolidate
spin_lock, timer management into btf_record") resulted in the enum
values no longer being used anywhere.

Let's remove them.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230309180111.1618459-3-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10 12:16:37 -08:00
Dave Marchevsky
b32a5dae44 bpf: verifier: Rename kernel_type_name helper to btf_type_name
kernel_type_name was introduced in commit 9e15db6613 ("bpf: Implement accurate raw_tp context access via BTF")
with type signature:

  const char *kernel_type_name(u32 id)

At that time the function used global btf_vmlinux BTF for all id lookups. Later,
in commit 22dc4a0f5e ("bpf: Remove hard-coded btf_vmlinux assumption from BPF verifier"),
the type signature was changed to:

  static const char *kernel_type_name(const struct btf* btf, u32 id)

With the btf parameter used for lookups instead of global btf_vmlinux.

The helper will function as expected for type name lookup using non-kernel BTFs,
and will be used for such in further patches in the series. Let's rename it to
avoid incorrect assumptions that might arise when seeing the current name.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230309180111.1618459-2-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10 12:16:37 -08:00
Martin KaFai Lau
7e30a8477b bpf: Add bpf_local_storage_free()
This patch refactors local_storage freeing logic into
bpf_local_storage_free(). It is a preparation work for a later
patch that uses bpf_mem_cache_alloc/free. The other kfree(local_storage)
cases are also changed to bpf_local_storage_free(..., reuse_now = true).

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20230308065936.1550103-12-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10 11:05:29 -08:00
Martin KaFai Lau
1288aaa278 bpf: Add bpf_local_storage_rcu callback
The existing bpf_local_storage_free_rcu is renamed to
bpf_local_storage_free_trace_rcu. A new bpf_local_storage_rcu
callback is added to do the kfree instead of using kfree_rcu.
It is a preparation work for a later patch using
bpf_mem_cache_alloc/free.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20230308065936.1550103-11-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10 11:05:29 -08:00
Martin KaFai Lau
c0d63f3091 bpf: Add bpf_selem_free()
This patch refactors the selem freeing logic into bpf_selem_free().
It is a preparation work for a later patch using
bpf_mem_cache_alloc/free. The other kfree(selem) cases
are also changed to bpf_selem_free(..., reuse_now = true).

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20230308065936.1550103-10-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10 11:05:28 -08:00
Martin KaFai Lau
f8ccf30c17 bpf: Add bpf_selem_free_rcu callback
Add bpf_selem_free_rcu() callback to do the kfree() instead
of using kfree_rcu. It is a preparation work for using
bpf_mem_cache_alloc/free in a later patch.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20230308065936.1550103-9-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10 11:05:28 -08:00
Martin KaFai Lau
c609981342 bpf: Remove bpf_selem_free_fields*_rcu
This patch removes the bpf_selem_free_fields*_rcu. The
bpf_obj_free_fields() can be done before the call_rcu_trasks_trace()
and kfree_rcu(). It is needed when a later patch uses
bpf_mem_cache_alloc/free. In bpf hashtab, bpf_obj_free_fields()
is also called before calling bpf_mem_cache_free. The discussion
can be found in
https://lore.kernel.org/bpf/f67021ee-21d9-bfae-6134-4ca542fab843@linux.dev/

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20230308065936.1550103-8-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10 11:05:28 -08:00
Martin KaFai Lau
a47eabf216 bpf: Repurpose use_trace_rcu to reuse_now in bpf_local_storage
This patch re-purpose the use_trace_rcu to mean
if the freed memory can be reused immediately or not.
The use_trace_rcu is renamed to reuse_now. Other than
the boolean test is reversed, it should be a no-op.

The following explains the reason for the rename and how it will
be used in a later patch.

In a later patch, bpf_mem_cache_alloc/free will be used
in the bpf_local_storage. The bpf mem allocator will reuse
the freed memory immediately. Some of the free paths in
bpf_local_storage does not support memory to be reused immediately.
These paths are the "delete" elem cases from the bpf_*_storage_delete()
helper and the map_delete_elem() syscall. Note that "delete" elem
before the owner's (sk/task/cgrp/inode) lifetime ended is not
the common usage for the local storage.

The common free path, bpf_local_storage_destroy(), can reuse the
memory immediately. This common path means the storage stays with
its owner until the owner is destroyed.

The above mentioned "delete" elem paths that cannot
reuse immediately always has the 'use_trace_rcu ==  true'.
The cases that is safe for immediate reuse always have
'use_trace_rcu == false'. Instead of adding another arg
in a later patch, this patch re-purpose this arg
to reuse_now and have the test logic reversed.

In a later patch, 'reuse_now == true' will free to the
bpf_mem_cache_free() where the memory can be reused
immediately. 'reuse_now == false' will go through the
call_rcu_tasks_trace().

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20230308065936.1550103-7-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10 11:05:28 -08:00
Martin KaFai Lau
fc6652aab6 bpf: Remember smap in bpf_local_storage
This patch remembers which smap triggers the allocation
of a 'struct bpf_local_storage' object. The local_storage is
allocated during the very first selem added to the owner.
The smap pointer is needed when using the bpf_mem_cache_free
in a later patch because it needs to free to the correct
smap's bpf_mem_alloc object.

When a selem is being removed, it needs to check if it is
the selem that triggers the creation of the local_storage.
If it is, the local_storage->smap pointer will be reset to NULL.
This NULL reset is done under the local_storage->lock in
bpf_selem_unlink_storage_nolock() when a selem is being removed.
Also note that the local_storage may not go away even
local_storage->smap is NULL because there may be other
selem still stored in the local_storage.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20230308065936.1550103-6-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10 11:05:28 -08:00
Martin KaFai Lau
121f31f3e0 bpf: Remove the preceding __ from __bpf_selem_unlink_storage
__bpf_selem_unlink_storage is taking the spin lock and there is
no name collision also. Having the preceding '__' is confusing
when reviewing the later patch.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20230308065936.1550103-5-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10 11:05:28 -08:00
Martin KaFai Lau
62827d612a bpf: Remove __bpf_local_storage_map_alloc
bpf_local_storage_map_alloc() is the only caller of
__bpf_local_storage_map_alloc().  The remaining logic in
bpf_local_storage_map_alloc() is only a one liner setting
the smap->cache_idx.

Remove __bpf_local_storage_map_alloc() to simplify code.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20230308065936.1550103-4-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10 11:05:28 -08:00
Martin KaFai Lau
2ffcb6fc50 bpf: Refactor codes into bpf_local_storage_destroy
This patch first renames bpf_local_storage_unlink_nolock to
bpf_local_storage_destroy(). It better reflects that it is only
used when the storage's owner (sk/task/cgrp/inode) is being kfree().

All bpf_local_storage_destroy's caller is taking the spin lock and
then free the storage. This patch also moves these two steps into
the bpf_local_storage_destroy.

This is a preparation work for a later patch that uses
bpf_mem_cache_alloc/free in the bpf_local_storage.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20230308065936.1550103-3-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10 11:05:28 -08:00
Martin KaFai Lau
4cbd23cc92 bpf: Move a few bpf_local_storage functions to static scope
This patch moves the bpf_local_storage_free_rcu() and
bpf_selem_unlink_map() to static because they are
not used outside of bpf_local_storage.c.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20230308065936.1550103-2-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10 11:05:28 -08:00
Andrii Nakryiko
52c2b005a3 bpf: take into account liveness when propagating precision
When doing state comparison, if old state has register that is not
marked as REG_LIVE_READ, then we just skip comparison, regardless what's
the state of corresponing register in current state. This is because not
REG_LIVE_READ register is irrelevant for further program execution and
correctness. All good here.

But when we get to precision propagation, after two states were declared
equivalent, we don't take into account old register's liveness, and thus
attempt to propagate precision for register in current state even if
that register in old state was not REG_LIVE_READ anymore. This is bad,
because register in current state could be anything at all and this
could cause -EFAULT due to internal logic bugs.

Fix by taking into account REG_LIVE_READ liveness mark to keep the logic
in state comparison in sync with precision propagation.

Fixes: a3ce685dd0 ("bpf: fix precision tracking")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230309224131.57449-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10 10:11:42 -08:00
Andrii Nakryiko
4b5ce570db bpf: ensure state checkpointing at iter_next() call sites
State equivalence check and checkpointing performed in is_state_visited()
employs certain heuristics to try to save memory by avoiding state checkpoints
if not enough jumps and instructions happened since last checkpoint. This leads
to unpredictability of whether a particular instruction will be checkpointed
and how regularly. While normally this is not causing much problems (except
inconveniences for predictable verifier tests, which we overcome with
BPF_F_TEST_STATE_FREQ flag), turns out it's not the case for open-coded
iterators.

Checking and saving state checkpoints at iter_next() call is crucial for fast
convergence of open-coded iterator loop logic, so we need to force it. If we
don't do that, is_state_visited() might skip saving a checkpoint, causing
unnecessarily long sequence of not checkpointed instructions and jumps, leading
to exhaustion of jump history buffer, and potentially other undesired outcomes.
It is expected that with correct open-coded iterators convergence will happen
quickly, so we don't run a risk of exhausting memory.

This patch adds, in addition to prune and jump instruction marks, also a
"forced checkpoint" mark, and makes sure that any iter_next() call instruction
is marked as such.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230310060149.625887-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10 08:31:42 -08:00
Jakub Kicinski
d0ddf5065f Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Documentation/bpf/bpf_devel_QA.rst
  b7abcd9c65 ("bpf, doc: Link to submitting-patches.rst for general patch submission info")
  d56b0c461d ("bpf, docs: Fix link to netdev-FAQ target")
https://lore.kernel.org/all/20230307095812.236eb1be@canb.auug.org.au/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-09 22:22:11 -08:00
Andrii Nakryiko
6018e1f407 bpf: implement numbers iterator
Implement the first open-coded iterator type over a range of integers.

It's public API consists of:
  - bpf_iter_num_new() constructor, which accepts [start, end) range
    (that is, start is inclusive, end is exclusive).
  - bpf_iter_num_next() which will keep returning read-only pointer to int
    until the range is exhausted, at which point NULL will be returned.
    If bpf_iter_num_next() is kept calling after this, NULL will be
    persistently returned.
  - bpf_iter_num_destroy() destructor, which needs to be called at some
    point to clean up iterator state. BPF verifier enforces that iterator
    destructor is called at some point before BPF program exits.

Note that `start = end = X` is a valid combination to setup an empty
iterator. bpf_iter_num_new() will return 0 (success) for any such
combination.

If bpf_iter_num_new() detects invalid combination of input arguments, it
returns error, resets iterator state to, effectively, empty iterator, so
any subsequent call to bpf_iter_num_next() will keep returning NULL.

BPF verifier has no knowledge that returned integers are in the
[start, end) value range, as both `start` and `end` are not statically
known and enforced: they are runtime values.

While the implementation is pretty trivial, some care needs to be taken
to avoid overflows and underflows. Subsequent selftests will validate
correctness of [start, end) semantics, especially around extremes
(INT_MIN and INT_MAX).

Similarly to bpf_loop(), we enforce that no more than BPF_MAX_LOOPS can
be specified.

bpf_iter_num_{new,next,destroy}() is a logical evolution from bounded
BPF loops and bpf_loop() helper and is the basis for implementing
ergonomic BPF loops with no statically known or verified bounds.
Subsequent patches implement bpf_for() macro, demonstrating how this can
be wrapped into something that works and feels like a normal for() loop
in C language.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230308184121.1165081-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-08 16:19:51 -08:00
Andrii Nakryiko
06accc8779 bpf: add support for open-coded iterator loops
Teach verifier about the concept of the open-coded (or inline) iterators.

This patch adds generic iterator loop verification logic, new STACK_ITER
stack slot type to contain iterator state, and necessary kfunc plumbing
for iterator's constructor, destructor and next methods. Next patch
implements first specific iterator (numbers iterator for implementing
for() loop logic). Such split allows to have more focused commits for
verifier logic and separate commit that we could point later to
demonstrating  what does it take to add a new kind of iterator.

Each kind of iterator has its own associated struct bpf_iter_<type>,
where <type> denotes a specific type of iterator. struct bpf_iter_<type>
state is supposed to live on BPF program stack, so there will be no way
to change its size later on without breaking backwards compatibility, so
choose wisely! But given this struct is specific to a given <type> of
iterator, this allows a lot of flexibility: simple iterators could be
fine with just one stack slot (8 bytes), like numbers iterator in the
next patch, while some other more complicated iterators might need way
more to keep their iterator state. Either way, such design allows to
avoid runtime memory allocations, which otherwise would be necessary if
we fixed on-the-stack size and it turned out to be too small for a given
iterator implementation.

The way BPF verifier logic is implemented, there are no artificial
restrictions on a number of active iterators, it should work correctly
using multiple active iterators at the same time. This also means you
can have multiple nested iteration loops. struct bpf_iter_<type>
reference can be safely passed to subprograms as well.

General flow is easiest to demonstrate with a simple example using
number iterator implemented in next patch. Here's the simplest possible
loop:

  struct bpf_iter_num it;
  int *v;

  bpf_iter_num_new(&it, 2, 5);
  while ((v = bpf_iter_num_next(&it))) {
      bpf_printk("X = %d", *v);
  }
  bpf_iter_num_destroy(&it);

Above snippet should output "X = 2", "X = 3", "X = 4". Note that 5 is
exclusive and is not returned. This matches similar APIs (e.g., slices
in Go or Rust) that implement a range of elements, where end index is
non-inclusive.

In the above example, we see a trio of function:
  - constructor, bpf_iter_num_new(), which initializes iterator state
  (struct bpf_iter_num it) on the stack. If any of the input arguments
  are invalid, constructor should make sure to still initialize it such
  that subsequent bpf_iter_num_next() calls will return NULL. I.e., on
  error, return error and construct empty iterator.
  - next method, bpf_iter_num_next(), which accepts pointer to iterator
  state and produces an element. Next method should always return
  a pointer. The contract between BPF verifier is that next method will
  always eventually return NULL when elements are exhausted. Once NULL is
  returned, subsequent next calls should keep returning NULL. In the
  case of numbers iterator, bpf_iter_num_next() returns a pointer to an int
  (storage for this integer is inside the iterator state itself),
  which can be dereferenced after corresponding NULL check.
  - once done with the iterator, it's mandated that user cleans up its
  state with the call to destructor, bpf_iter_num_destroy() in this
  case. Destructor frees up any resources and marks stack space used by
  struct bpf_iter_num as usable for something else.

Any other iterator implementation will have to implement at least these
three methods. It is enforced that for any given type of iterator only
applicable constructor/destructor/next are callable. I.e., verifier
ensures you can't pass number iterator state into, say, cgroup
iterator's next method.

It is important to keep the naming pattern consistent to be able to
create generic macros to help with BPF iter usability. E.g., one
of the follow up patches adds generic bpf_for_each() macro to bpf_misc.h
in selftests, which allows to utilize iterator "trio" nicely without
having to code the above somewhat tedious loop explicitly every time.
This is enforced at kfunc registration point by one of the previous
patches in this series.

At the implementation level, iterator state tracking for verification
purposes is very similar to dynptr. We add STACK_ITER stack slot type,
reserve necessary number of slots, depending on
sizeof(struct bpf_iter_<type>), and keep track of necessary extra state
in the "main" slot, which is marked with non-zero ref_obj_id. Other
slots are also marked as STACK_ITER, but have zero ref_obj_id. This is
simpler than having a separate "is_first_slot" flag.

Another big distinction is that STACK_ITER is *always refcounted*, which
simplifies implementation without sacrificing usability. So no need for
extra "iter_id", no need to anticipate reuse of STACK_ITER slots for new
constructors, etc. Keeping it simple here.

As far as the verification logic goes, there are two extensive comments:
in process_iter_next_call() and iter_active_depths_differ() explaining
some important and sometimes subtle aspects. Please refer to them for
details.

But from 10,000-foot point of view, next methods are the points of
forking a verification state, which are conceptually similar to what
verifier is doing when validating conditional jump. We branch out at
a `call bpf_iter_<type>_next` instruction and simulate two outcomes:
NULL (iteration is done) and non-NULL (new element is returned). NULL is
simulated first and is supposed to reach exit without looping. After
that non-NULL case is validated and it either reaches exit (for trivial
examples with no real loop), or reaches another `call bpf_iter_<type>_next`
instruction with the state equivalent to already (partially) validated
one. State equivalency at that point means we technically are going to
be looping forever without "breaking out" out of established "state
envelope" (i.e., subsequent iterations don't add any new knowledge or
constraints to the verifier state, so running 1, 2, 10, or a million of
them doesn't matter). But taking into account the contract stating that
iterator next method *has to* return NULL eventually, we can conclude
that loop body is safe and will eventually terminate. Given we validated
logic outside of the loop (NULL case), and concluded that loop body is
safe (though potentially looping many times), verifier can claim safety
of the overall program logic.

The rest of the patch is necessary plumbing for state tracking, marking,
validation, and necessary further kfunc plumbing to allow implementing
iterator constructor, destructor, and next methods.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230308184121.1165081-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-08 16:19:50 -08:00
Andrii Nakryiko
215bf4962f bpf: add iterator kfuncs registration and validation logic
Add ability to register kfuncs that implement BPF open-coded iterator
contract and enforce naming and function proto convention. Enforcement
happens at the time of kfunc registration and significantly simplifies
the rest of iterators logic in the verifier.

More details follow in subsequent patches, but we enforce the following
conditions.

All kfuncs (constructor, next, destructor) have to be named consistenly
as bpf_iter_<type>_{new,next,destroy}(), respectively. <type> represents
iterator type, and iterator state should be represented as a matching
`struct bpf_iter_<type>` state type. Also, all iter kfuncs should have
a pointer to this `struct bpf_iter_<type>` as the very first argument.

Additionally:
  - Constructor, i.e., bpf_iter_<type>_new(), can have arbitrary extra
  number of arguments. Return type is not enforced either.
  - Next method, i.e., bpf_iter_<type>_next(), has to return a pointer
  type and should have exactly one argument: `struct bpf_iter_<type> *`
  (const/volatile/restrict and typedefs are ignored).
  - Destructor, i.e., bpf_iter_<type>_destroy(), should return void and
  should have exactly one argument, similar to the next method.
  - struct bpf_iter_<type> size is enforced to be positive and
  a multiple of 8 bytes (to fit stack slots correctly).

Such strictness and consistency allows to build generic helpers
abstracting important, but boilerplate, details to be able to use
open-coded iterators effectively and ergonomically (see bpf_for_each()
in subsequent patches). It also simplifies the verifier logic in some
places. At the same time, this doesn't hurt generality of possible
iterator implementations. Win-win.

Constructor kfunc is marked with a new KF_ITER_NEW flags, next method is
marked with KF_ITER_NEXT (and should also have KF_RET_NULL, of course),
while destructor kfunc is marked as KF_ITER_DESTROY.

Additionally, we add a trivial kfunc name validation: it should be
a valid non-NULL and non-empty string.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230308184121.1165081-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-08 16:19:50 -08:00
Andrii Nakryiko
07236eab7a bpf: factor out fetching basic kfunc metadata
Factor out logic to fetch basic kfunc metadata based on struct bpf_insn.
This is not exactly short or trivial code to just copy/paste and this
information is sometimes necessary in other parts of the verifier logic.
Subsequent patches will rely on this to determine if an instruction is
a kfunc call to iterator next method.

No functional changes intended, including that verbose() warning
behavior when kfunc is not allowed for a particular program type.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230308184121.1165081-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-08 16:19:50 -08:00
Yafang Shao
6b4a6ea2c6 bpf: enforce all maps having memory usage callback
We have implemented memory usage callback for all maps, and we enforce
any newly added map having a callback as well. We check this callback at
map creation time. If it doesn't have the callback, we will return
EINVAL.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20230305124615.12358-19-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-07 09:33:43 -08:00
Yafang Shao
9629363cd0 bpf: offload map memory usage
A new helper is introduced to calculate offload map memory usage. But
currently the memory dynamically allocated in netdev dev_ops, like
nsim_map_update_elem, is not counted. Let's just put it aside now.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20230305124615.12358-18-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-07 09:33:43 -08:00
Yafang Shao
7490b7f1c0 bpf, net: bpf_local_storage memory usage
A new helper is introduced into bpf_local_storage map to calculate the
memory usage. This helper is also used by other maps like
bpf_cgrp_storage, bpf_inode_storage, bpf_task_storage and etc.

Note that currently the dynamically allocated storage elements are not
counted in the usage, since it will take extra runtime overhead in the
elements update or delete path. So let's put it aside now, and implement
it in the future when someone really need it.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20230305124615.12358-15-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-07 09:33:43 -08:00
Yafang Shao
2f536977d6 bpf: local_storage memory usage
A new helper is introduced to calculate local_storage map memory usage.
Currently the dynamically allocated elements are not counted, since it
will take runtime overhead in the element update or delete path. So
let's put it aside currently, and implement it in the future if the user
really needs it.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20230305124615.12358-14-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-07 09:33:43 -08:00
Yafang Shao
f062226d8d bpf: bpf_struct_ops memory usage
A new helper is introduced to calculate bpf_struct_ops memory usage.

The result as follows,

- before
1: struct_ops  name count_map  flags 0x0
        key 4B  value 256B  max_entries 1  memlock 4096B
        btf_id 73

- after
1: struct_ops  name count_map  flags 0x0
        key 4B  value 256B  max_entries 1  memlock 5016B
        btf_id 73

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20230305124615.12358-13-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-07 09:33:43 -08:00
Yafang Shao
c6e66b42a3 bpf: queue_stack_maps memory usage
A new helper is introduced to calculate queue_stack_maps memory usage.

The result as follows,

- before
20: queue  name count_map  flags 0x0
        key 0B  value 4B  max_entries 65536  memlock 266240B
21: stack  name count_map  flags 0x0
        key 0B  value 4B  max_entries 65536  memlock 266240B

- after
20: queue  name count_map  flags 0x0
        key 0B  value 4B  max_entries 65536  memlock 524288B
21: stack  name count_map  flags 0x0
        key 0B  value 4B  max_entries 65536  memlock 524288B

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20230305124615.12358-12-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-07 09:33:42 -08:00
Yafang Shao
fa5e83df17 bpf: devmap memory usage
A new helper is introduced to calculate the memory usage of devmap and
devmap_hash. The number of dynamically allocated elements are recored
for devmap_hash already, but not for devmap. To track the memory size of
dynamically allocated elements, this patch also count the numbers for
devmap.

The result as follows,
- before
40: devmap  name count_map  flags 0x80
        key 4B  value 4B  max_entries 65536  memlock 524288B
41: devmap_hash  name count_map  flags 0x80
        key 4B  value 4B  max_entries 65536  memlock 524288B

- after
40: devmap  name count_map  flags 0x80  <<<< no elements
        key 4B  value 4B  max_entries 65536  memlock 524608B
41: devmap_hash  name count_map  flags 0x80 <<<< no elements
        key 4B  value 4B  max_entries 65536  memlock 524608B

Note that the number of buckets is same with max_entries for devmap_hash
in this case.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20230305124615.12358-11-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-07 09:33:42 -08:00
Yafang Shao
835f1fca95 bpf: cpumap memory usage
A new helper is introduced to calculate cpumap memory usage. The size of
cpu_entries can be dynamically changed when we update or delete a cpumap
element, but this patch doesn't include the memory size of cpu_entry
yet. We can dynamically calculate the memory usage when we alloc or free
a cpu_entry, but it will take extra runtime overhead, so let just put it
aside currently. Note that the size of different cpu_entry may be
different as well.

The result as follows,
- before
48: cpumap  name count_map  flags 0x4
        key 4B  value 4B  max_entries 64  memlock 4096B

- after
48: cpumap  name count_map  flags 0x4
        key 4B  value 4B  max_entries 64  memlock 832B

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20230305124615.12358-10-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-07 09:33:42 -08:00
Yafang Shao
71a49abe73 bpf: bloom_filter memory usage
Introduce a new helper to calculate the bloom_filter memory usage.

The result as follows,
- before
16: bloom_filter  flags 0x0
        key 0B  value 8B  max_entries 65536  memlock 524288B

- after
16: bloom_filter  flags 0x0
        key 0B  value 8B  max_entries 65536  memlock 65856B

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20230305124615.12358-9-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-07 09:33:42 -08:00
Yafang Shao
2f7e4ab2ca bpf: ringbuf memory usage
A new helper ringbuf_map_mem_usage() is introduced to calculate ringbuf
memory usage.

The result as follows,
- before
15: ringbuf  name count_map  flags 0x0
        key 0B  value 0B  max_entries 65536  memlock 0B

- after
15: ringbuf  name count_map  flags 0x0
        key 0B  value 0B  max_entries 65536  memlock 78424B

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230305124615.12358-8-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-07 09:33:42 -08:00
Yafang Shao
2e89caf055 bpf: reuseport_array memory usage
A new helper is introduced to calculate reuseport_array memory usage.

The result as follows,
- before
14: reuseport_sockarray  name count_map  flags 0x0
        key 4B  value 8B  max_entries 65536  memlock 1048576B

- after
14: reuseport_sockarray  name count_map  flags 0x0
        key 4B  value 8B  max_entries 65536  memlock 524544B

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20230305124615.12358-7-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-07 09:33:42 -08:00
Yafang Shao
cbb9b6068c bpf: stackmap memory usage
A new helper is introduced to get stackmap memory usage. Some small
memory allocations are ignored as their memory size is quite small
compared to the totol usage.

The result as follows,
- before
16: stack_trace  name count_map  flags 0x0
        key 4B  value 8B  max_entries 65536  memlock 1048576B

- after
16: stack_trace  name count_map  flags 0x0
        key 4B  value 8B  max_entries 65536  memlock 2097472B

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20230305124615.12358-6-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-07 09:33:42 -08:00
Yafang Shao
1746d0555a bpf: arraymap memory usage
Introduce array_map_mem_usage() to calculate arraymap memory usage. In
this helper, some small memory allocations are ignored, like the
allocation of struct bpf_array_aux in prog_array. The inner_map_meta in
array_of_map is also ignored.

The result as follows,

- before
11: array  name count_map  flags 0x0
        key 4B  value 4B  max_entries 65536  memlock 524288B
12: percpu_array  name count_map  flags 0x0
        key 4B  value 4B  max_entries 65536  memlock 8912896B
13: perf_event_array  name count_map  flags 0x0
        key 4B  value 4B  max_entries 65536  memlock 524288B
14: prog_array  name count_map  flags 0x0
        key 4B  value 4B  max_entries 65536  memlock 524288B
15: cgroup_array  name count_map  flags 0x0
        key 4B  value 4B  max_entries 65536  memlock 524288B

- after
11: array  name count_map  flags 0x0
        key 4B  value 4B  max_entries 65536  memlock 524608B
12: percpu_array  name count_map  flags 0x0
        key 4B  value 4B  max_entries 65536  memlock 17301824B
13: perf_event_array  name count_map  flags 0x0
        key 4B  value 4B  max_entries 65536  memlock 524608B
14: prog_array  name count_map  flags 0x0
        key 4B  value 4B  max_entries 65536  memlock 524608B
15: cgroup_array  name count_map  flags 0x0
        key 4B  value 4B  max_entries 65536  memlock 524608B

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20230305124615.12358-5-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-07 09:33:42 -08:00
Yafang Shao
304849a27b bpf: hashtab memory usage
htab_map_mem_usage() is introduced to calculate hashmap memory usage. In
this helper, some small memory allocations are ignore, as their size is
quite small compared with the total size. The inner_map_meta in
hash_of_map is also ignored.

The result for hashtab as follows,

- before this change
1: hash  name count_map  flags 0x1  <<<< no prealloc, fully set
        key 16B  value 24B  max_entries 1048576  memlock 41943040B
2: hash  name count_map  flags 0x1  <<<< no prealloc, none set
        key 16B  value 24B  max_entries 1048576  memlock 41943040B
3: hash  name count_map  flags 0x0  <<<< prealloc
        key 16B  value 24B  max_entries 1048576  memlock 41943040B

The memlock is always a fixed size whatever it is preallocated or
not, and whatever the count of allocated elements is.

- after this change
1: hash  name count_map  flags 0x1    <<<< non prealloc, fully set
        key 16B  value 24B  max_entries 1048576  memlock 117441536B
2: hash  name count_map  flags 0x1    <<<< non prealloc, non set
        key 16B  value 24B  max_entries 1048576  memlock 16778240B
3: hash  name count_map  flags 0x0    <<<< prealloc
        key 16B  value 24B  max_entries 1048576  memlock 109056000B

The memlock now is hashtab actually allocated.

The result for percpu hash map as follows,
- before this change
4: percpu_hash  name count_map  flags 0x0       <<<< prealloc
        key 16B  value 24B  max_entries 1048576  memlock 822083584B
5: percpu_hash  name count_map  flags 0x1       <<<< no prealloc
        key 16B  value 24B  max_entries 1048576  memlock 822083584B

- after this change
4: percpu_hash  name count_map  flags 0x0
        key 16B  value 24B  max_entries 1048576  memlock 897582080B
5: percpu_hash  name count_map  flags 0x1
        key 16B  value 24B  max_entries 1048576  memlock 922748736B

At worst, the difference can be 10x, for example,
- before this change
6: hash  name count_map  flags 0x0
        key 4B  value 4B  max_entries 1048576  memlock 8388608B

- after this change
6: hash  name count_map  flags 0x0
        key 4B  value 4B  max_entries 1048576  memlock 83889408B

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230305124615.12358-4-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-07 09:33:42 -08:00
Yafang Shao
41d5941e7f bpf: lpm_trie memory usage
trie_mem_usage() is introduced to calculate the lpm_trie memory usage.
Some small memory allocations are ignored. The inner node is also
ignored.

The result as follows,

- before
10: lpm_trie  flags 0x1
        key 8B  value 8B  max_entries 65536  memlock 1048576B

- after
10: lpm_trie  flags 0x1
        key 8B  value 8B  max_entries 65536  memlock 2291536B

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20230305124615.12358-3-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-07 09:33:42 -08:00
Yafang Shao
90a5527d76 bpf: add new map ops ->map_mem_usage
Add a new map ops ->map_mem_usage to print the memory usage of a
bpf map.

This is a preparation for the followup change.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20230305124615.12358-2-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-07 09:33:41 -08:00
Jakub Kicinski
36e5e391a2 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZAZsBwAKCRDbK58LschI
 g3W1AQCQnO6pqqX5Q2aYDAZPlZRtV2TRLjuqrQE0dHW/XLAbBgD/bgsAmiKhPSCG
 2mTt6izpTQVlZB0e8KcDIvbYd9CE3Qc=
 =EjJQ
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2023-03-06

We've added 85 non-merge commits during the last 13 day(s) which contain
a total of 131 files changed, 7102 insertions(+), 1792 deletions(-).

The main changes are:

1) Add skb and XDP typed dynptrs which allow BPF programs for more
   ergonomic and less brittle iteration through data and variable-sized
   accesses, from Joanne Koong.

2) Bigger batch of BPF verifier improvements to prepare for upcoming BPF
   open-coded iterators allowing for less restrictive looping capabilities,
   from Andrii Nakryiko.

3) Rework RCU enforcement in the verifier, add kptr_rcu and enforce BPF
   programs to NULL-check before passing such pointers into kfunc,
   from Alexei Starovoitov.

4) Add support for kptrs in percpu hashmaps, percpu LRU hashmaps and in
   local storage maps, from Kumar Kartikeya Dwivedi.

5) Add BPF verifier support for ST instructions in convert_ctx_access()
   which will help new -mcpu=v4 clang flag to start emitting them,
   from Eduard Zingerman.

6) Make uprobe attachment Android APK aware by supporting attachment
   to functions inside ELF objects contained in APKs via function names,
   from Daniel Müller.

7) Add a new flag BPF_F_TIMER_ABS flag for bpf_timer_start() helper
   to start the timer with absolute expiration value instead of relative
   one, from Tero Kristo.

8) Add a new kfunc bpf_cgroup_from_id() to look up cgroups via id,
   from Tejun Heo.

9) Extend libbpf to support users manually attaching kprobes/uprobes
   in the legacy/perf/link mode, from Menglong Dong.

10) Implement workarounds in the mips BPF JIT for DADDI/R4000,
   from Jiaxun Yang.

11) Enable mixing bpf2bpf and tailcalls for the loongarch BPF JIT,
    from Hengqi Chen.

12) Extend BPF instruction set doc with describing the encoding of BPF
    instructions in terms of how bytes are stored under big/little endian,
    from Jose E. Marchesi.

13) Follow-up to enable kfunc support for riscv BPF JIT, from Pu Lehui.

14) Fix bpf_xdp_query() backwards compatibility on old kernels,
    from Yonghong Song.

15) Fix BPF selftest cross compilation with CLANG_CROSS_FLAGS,
    from Florent Revest.

16) Improve bpf_cpumask_ma to only allocate one bpf_mem_cache,
    from Hou Tao.

17) Fix BPF verifier's check_subprogs to not unnecessarily mark
    a subprogram with has_tail_call, from Ilya Leoshkevich.

18) Fix arm syscall regs spec in libbpf's bpf_tracing.h, from Puranjay Mohan.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (85 commits)
  selftests/bpf: Add test for legacy/perf kprobe/uprobe attach mode
  selftests/bpf: Split test_attach_probe into multi subtests
  libbpf: Add support to set kprobe/uprobe attach mode
  tools/resolve_btfids: Add /libsubcmd to .gitignore
  bpf: add support for fixed-size memory pointer returns for kfuncs
  bpf: generalize dynptr_get_spi to be usable for iters
  bpf: mark PTR_TO_MEM as non-null register type
  bpf: move kfunc_call_arg_meta higher in the file
  bpf: ensure that r0 is marked scratched after any function call
  bpf: fix visit_insn()'s detection of BPF_FUNC_timer_set_callback helper
  bpf: clean up visit_insn()'s instruction processing
  selftests/bpf: adjust log_fixup's buffer size for proper truncation
  bpf: honor env->test_state_freq flag in is_state_visited()
  selftests/bpf: enhance align selftest's expected log matching
  bpf: improve regsafe() checks for PTR_TO_{MEM,BUF,TP_BUFFER}
  bpf: improve stack slot state printing
  selftests/bpf: Disassembler tests for verifier.c:convert_ctx_access()
  selftests/bpf: test if pointer type is tracked for BPF_ST_MEM
  bpf: allow ctx writes using BPF_ST_MEM instruction
  bpf: Use separate RCU callbacks for freeing selem
  ...
====================

Link: https://lore.kernel.org/r/20230307004346.27578-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-06 20:36:39 -08:00
Jakub Kicinski
757b56a6c7 bpf-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZAZZ1wAKCRDbK58LschI
 g4fcAQDYVsICeBDmhdBdZs7Kb91/s6SrU6B0jy4zs0gOIBBOhgD7B3jt3dMTD2tp
 rPLHlv6uUoYS7mbZsrZi/XjVw8UmewM=
 =VUnr
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2023-03-06

We've added 8 non-merge commits during the last 7 day(s) which contain
a total of 9 files changed, 64 insertions(+), 18 deletions(-).

The main changes are:

1) Fix BTF resolver for DATASEC sections when a VAR points at a modifier,
   that is, keep resolving such instances instead of bailing out,
   from Lorenz Bauer.

2) Fix BPF test framework with regards to xdp_frame info misplacement
   in the "live packet" code, from Alexander Lobakin.

3) Fix an infinite loop in BPF sockmap code for TCP/UDP/AF_UNIX,
   from Liu Jian.

4) Fix a build error for riscv BPF JIT under PERF_EVENTS=n,
   from Randy Dunlap.

5) Several BPF doc fixes with either broken links or external instead
   of internal doc links, from Bagas Sanjaya.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests/bpf: check that modifier resolves after pointer
  btf: fix resolving BTF_KIND_VAR after ARRAY, STRUCT, UNION, PTR
  bpf, test_run: fix &xdp_frame misplacement for LIVE_FRAMES
  bpf, doc: Link to submitting-patches.rst for general patch submission info
  bpf, doc: Do not link to docs.kernel.org for kselftest link
  bpf, sockmap: Fix an infinite loop error when len is 0 in tcp_bpf_recvmsg_parser()
  riscv, bpf: Fix patch_text implicit declaration
  bpf, docs: Fix link to BTF doc
====================

Link: https://lore.kernel.org/r/20230306215944.11981-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-06 20:28:00 -08:00
Lorenz Bauer
9b459804ff btf: fix resolving BTF_KIND_VAR after ARRAY, STRUCT, UNION, PTR
btf_datasec_resolve contains a bug that causes the following BTF
to fail loading:

    [1] DATASEC a size=2 vlen=2
        type_id=4 offset=0 size=1
        type_id=7 offset=1 size=1
    [2] INT (anon) size=1 bits_offset=0 nr_bits=8 encoding=(none)
    [3] PTR (anon) type_id=2
    [4] VAR a type_id=3 linkage=0
    [5] INT (anon) size=1 bits_offset=0 nr_bits=8 encoding=(none)
    [6] TYPEDEF td type_id=5
    [7] VAR b type_id=6 linkage=0

This error message is printed during btf_check_all_types:

    [1] DATASEC a size=2 vlen=2
        type_id=7 offset=1 size=1 Invalid type

By tracing btf_*_resolve we can pinpoint the problem:

    btf_datasec_resolve(depth: 1, type_id: 1, mode: RESOLVE_TBD) = 0
        btf_var_resolve(depth: 2, type_id: 4, mode: RESOLVE_TBD) = 0
            btf_ptr_resolve(depth: 3, type_id: 3, mode: RESOLVE_PTR) = 0
        btf_var_resolve(depth: 2, type_id: 4, mode: RESOLVE_PTR) = 0
    btf_datasec_resolve(depth: 1, type_id: 1, mode: RESOLVE_PTR) = -22

The last invocation of btf_datasec_resolve should invoke btf_var_resolve
by means of env_stack_push, instead it returns EINVAL. The reason is that
env_stack_push is never executed for the second VAR.

    if (!env_type_is_resolve_sink(env, var_type) &&
        !env_type_is_resolved(env, var_type_id)) {
        env_stack_set_next_member(env, i + 1);
        return env_stack_push(env, var_type, var_type_id);
    }

env_type_is_resolve_sink() changes its behaviour based on resolve_mode.
For RESOLVE_PTR, we can simplify the if condition to the following:

    (btf_type_is_modifier() || btf_type_is_ptr) && !env_type_is_resolved()

Since we're dealing with a VAR the clause evaluates to false. This is
not sufficient to trigger the bug however. The log output and EINVAL
are only generated if btf_type_id_size() fails.

    if (!btf_type_id_size(btf, &type_id, &type_size)) {
        btf_verifier_log_vsi(env, v->t, vsi, "Invalid type");
        return -EINVAL;
    }

Most types are sized, so for example a VAR referring to an INT is not a
problem. The bug is only triggered if a VAR points at a modifier. Since
we skipped btf_var_resolve that modifier was also never resolved, which
means that btf_resolved_type_id returns 0 aka VOID for the modifier.
This in turn causes btf_type_id_size to return NULL, triggering EINVAL.

To summarise, the following conditions are necessary:

- VAR pointing at PTR, STRUCT, UNION or ARRAY
- Followed by a VAR pointing at TYPEDEF, VOLATILE, CONST, RESTRICT or
  TYPE_TAG

The fix is to reset resolve_mode to RESOLVE_TBD before attempting to
resolve a VAR from a DATASEC.

Fixes: 1dc9285184 ("bpf: kernel side support for BTF Var and DataSec")
Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
Link: https://lore.kernel.org/r/20230306112138.155352-2-lmb@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-03-06 11:44:13 -08:00
Andrii Nakryiko
f4b4eee616 bpf: add support for fixed-size memory pointer returns for kfuncs
Support direct fixed-size (and for now, read-only) memory access when
kfunc's return type is a pointer to non-struct type. Calculate type size
and let BPF program access that many bytes directly. This is crucial for
numbers iterator.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230302235015.2044271-13-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-04 11:14:32 -08:00
Andrii Nakryiko
a461f5adf1 bpf: generalize dynptr_get_spi to be usable for iters
Generalize the logic of fetching special stack slot object state using
spi (stack slot index). This will be used by STACK_ITER logic next.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230302235015.2044271-12-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-04 11:14:32 -08:00
Andrii Nakryiko
d5271c5b19 bpf: mark PTR_TO_MEM as non-null register type
PTR_TO_MEM register without PTR_MAYBE_NULL is indeed non-null. This is
important for BPF verifier to be able to prune guaranteed not to be
taken branches. This is always the case with open-coded iterators.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230302235015.2044271-11-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-04 11:14:32 -08:00
Andrii Nakryiko
d0e1ac2279 bpf: move kfunc_call_arg_meta higher in the file
Move struct bpf_kfunc_call_arg_meta higher in the file and put it next
to struct bpf_call_arg_meta, so it can be used from more functions.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230302235015.2044271-10-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-04 11:14:32 -08:00
Andrii Nakryiko
553a64a85c bpf: ensure that r0 is marked scratched after any function call
r0 is important (unless called function is void-returning, but that's
taken care of by print_verifier_state() anyways) in verifier logs.
Currently for helpers we seem to print it in verifier log, but for
kfuncs we don't.

Instead of figuring out where in the maze of code we accidentally set r0
as scratched for helpers and why we don't do that for kfuncs, just
enforce that after any function call r0 is marked as scratched.

Also, perhaps, we should reconsider "scratched" terminology, as it's
mightily confusing. "Touched" would seem more appropriate. But I left
that for follow ups for now.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230302235015.2044271-9-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-04 11:14:32 -08:00
Andrii Nakryiko
c1ee85a980 bpf: fix visit_insn()'s detection of BPF_FUNC_timer_set_callback helper
It's not correct to assume that any BPF_CALL instruction is a helper
call. Fix visit_insn()'s detection of bpf_timer_set_callback() helper by
also checking insn->code == 0. For kfuncs insn->code would be set to
BPF_PSEUDO_KFUNC_CALL, and for subprog calls it will be BPF_PSEUDO_CALL.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230302235015.2044271-8-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-04 11:14:32 -08:00
Andrii Nakryiko
653ae3a874 bpf: clean up visit_insn()'s instruction processing
Instead of referencing processed instruction repeatedly as insns[t]
throughout entire visit_insn() function, take a local insn pointer and
work with it in a cleaner way.

It makes enhancing this function further a bit easier as well.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230302235015.2044271-7-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-04 11:14:32 -08:00
Andrii Nakryiko
98ddcf389d bpf: honor env->test_state_freq flag in is_state_visited()
env->test_state_freq flag can be set by user by passing
BPF_F_TEST_STATE_FREQ program flag. This is used in a bunch of selftests
to have predictable state checkpoints at every jump and so on.

Currently, bounded loop handling heuristic ignores this flag if number
of processed jumps and/or number of processed instructions is below some
thresholds, which throws off that reliable state checkpointing.

Honor this flag in all circumstances by disabling heuristic if
env->test_state_freq is set.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230302235015.2044271-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-04 11:14:31 -08:00
Andrii Nakryiko
567da5d253 bpf: improve regsafe() checks for PTR_TO_{MEM,BUF,TP_BUFFER}
Teach regsafe() logic to handle PTR_TO_MEM, PTR_TO_BUF, and
PTR_TO_TP_BUFFER similarly to PTR_TO_MAP_{KEY,VALUE}. That is, instead of
exact match for var_off and range, use tnum_in() and range_within()
checks, allowing more general verified state to subsume more specific
current state. This allows to match wider range of valid and safe
states, speeding up verification and detecting wider range of equivalent
states for upcoming open-coded iteration looping logic.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230302235015.2044271-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-04 11:14:31 -08:00
Andrii Nakryiko
d54e0f6c1a bpf: improve stack slot state printing
Improve stack slot state printing to provide more useful and relevant
information, especially for dynptrs. While previously we'd see something
like:

  8: (85) call bpf_ringbuf_reserve_dynptr#198   ; R0_w=scalar() fp-8_w=dddddddd fp-16_w=dddddddd refs=2

Now we'll see way more useful:

  8: (85) call bpf_ringbuf_reserve_dynptr#198   ; R0_w=scalar() fp-16_w=dynptr_ringbuf(ref_id=2) refs=2

I experimented with printing the range of slots taken by dynptr,
something like:

  fp-16..8_w=dynptr_ringbuf(ref_id=2)

But it felt very awkward and pretty useless. So we print the lowest
address (most negative offset) only.

The general structure of this code is now also set up for easier
extension and will accommodate ITER slots naturally.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230302235015.2044271-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-04 11:14:31 -08:00
Eduard Zingerman
0d80a619c1 bpf: allow ctx writes using BPF_ST_MEM instruction
Lift verifier restriction to use BPF_ST_MEM instructions to write to
context data structures. This requires the following changes:
 - verifier.c:do_check() for BPF_ST updated to:
   - no longer forbid writes to registers of type PTR_TO_CTX;
   - track dst_reg type in the env->insn_aux_data[...].ptr_type field
     (same way it is done for BPF_STX and BPF_LDX instructions).
 - verifier.c:convert_ctx_access() and various callbacks invoked by
   it are updated to handled BPF_ST instruction alongside BPF_STX.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20230304011247.566040-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-03 21:41:46 -08:00
Kumar Kartikeya Dwivedi
e768e3c5aa bpf: Use separate RCU callbacks for freeing selem
Martin suggested that instead of using a byte in the hole (which he has
a use for in his future patch) in bpf_local_storage_elem, we can
dispatch a different call_rcu callback based on whether we need to free
special fields in bpf_local_storage_elem data. The free path, described
in commit 9db44fdd81 ("bpf: Support kptrs in local storage maps"),
only waits for call_rcu callbacks when there are special (kptrs, etc.)
fields in the map value, hence it is necessary that we only access
smap in this case.

Therefore, dispatch different RCU callbacks based on the BPF map has a
valid btf_record, which dereference and use smap's btf_record only when
it is valid.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230303141542.300068-1-memxor@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-03-03 09:45:27 -08:00
Alexei Starovoitov
6fcd486b3a bpf: Refactor RCU enforcement in the verifier.
bpf_rcu_read_lock/unlock() are only available in clang compiled kernels. Lack
of such key mechanism makes it impossible for sleepable bpf programs to use RCU
pointers.

Allow bpf_rcu_read_lock/unlock() in GCC compiled kernels (though GCC doesn't
support btf_type_tag yet) and allowlist certain field dereferences in important
data structures like tast_struct, cgroup, socket that are used by sleepable
programs either as RCU pointer or full trusted pointer (which is valid outside
of RCU CS). Use BTF_TYPE_SAFE_RCU and BTF_TYPE_SAFE_TRUSTED macros for such
tagging. They will be removed once GCC supports btf_type_tag.

With that refactor check_ptr_to_btf_access(). Make it strict in enforcing
PTR_TRUSTED and PTR_UNTRUSTED while deprecating old PTR_TO_BTF_ID without
modifier flags. There is a chance that this strict enforcement might break
existing programs (especially on GCC compiled kernels), but this cleanup has to
start sooner than later. Note PTR_TO_CTX access still yields old deprecated
PTR_TO_BTF_ID. Once it's converted to strict PTR_TRUSTED or PTR_UNTRUSTED the
kfuncs and helpers will be able to default to KF_TRUSTED_ARGS. KF_RCU will
remain as a weaker version of KF_TRUSTED_ARGS where obj refcnt could be 0.

Adjust rcu_read_lock selftest to run on gcc and clang compiled kernels.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/bpf/20230303041446.3630-7-alexei.starovoitov@gmail.com
2023-03-03 17:42:20 +01:00
Alexei Starovoitov
20c09d92fa bpf: Introduce kptr_rcu.
The life time of certain kernel structures like 'struct cgroup' is protected by RCU.
Hence it's safe to dereference them directly from __kptr tagged pointers in bpf maps.
The resulting pointer is MEM_RCU and can be passed to kfuncs that expect KF_RCU.
Derefrence of other kptr-s returns PTR_UNTRUSTED.

For example:
struct map_value {
   struct cgroup __kptr *cgrp;
};

SEC("tp_btf/cgroup_mkdir")
int BPF_PROG(test_cgrp_get_ancestors, struct cgroup *cgrp_arg, const char *path)
{
  struct cgroup *cg, *cg2;

  cg = bpf_cgroup_acquire(cgrp_arg); // cg is PTR_TRUSTED and ref_obj_id > 0
  bpf_kptr_xchg(&v->cgrp, cg);

  cg2 = v->cgrp; // This is new feature introduced by this patch.
  // cg2 is PTR_MAYBE_NULL | MEM_RCU.
  // When cg2 != NULL, it's a valid cgroup, but its percpu_ref could be zero

  if (cg2)
    bpf_cgroup_ancestor(cg2, level); // safe to do.
}

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/bpf/20230303041446.3630-4-alexei.starovoitov@gmail.com
2023-03-03 17:42:20 +01:00
Alexei Starovoitov
8d093b4e95 bpf: Mark cgroups and dfl_cgrp fields as trusted.
bpf programs sometimes do:
bpf_cgrp_storage_get(&map, task->cgroups->dfl_cgrp, ...);
It is safe to do, because cgroups->dfl_cgrp pointer is set diring init and
never changes. The task->cgroups is also never NULL. It is also set during init
and will change when task switches cgroups. For any trusted task pointer
dereference of cgroups and dfl_cgrp should yield trusted pointers. The verifier
wasn't aware of this. Hence in gcc compiled kernels task->cgroups dereference
was producing PTR_TO_BTF_ID without modifiers while in clang compiled kernels
the verifier recognizes __rcu tag in cgroups field and produces
PTR_TO_BTF_ID | MEM_RCU | MAYBE_NULL.
Tag cgroups and dfl_cgrp as trusted to equalize clang and gcc behavior.
When GCC supports btf_type_tag such tagging will done directly in the type.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: David Vernet <void@manifault.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/bpf/20230303041446.3630-3-alexei.starovoitov@gmail.com
2023-03-03 17:42:20 +01:00
Alexei Starovoitov
03b77e17ae bpf: Rename __kptr_ref -> __kptr and __kptr -> __kptr_untrusted.
__kptr meant to store PTR_UNTRUSTED kernel pointers inside bpf maps.
The concept felt useful, but didn't get much traction,
since bpf_rdonly_cast() was added soon after and bpf programs received
a simpler way to access PTR_UNTRUSTED kernel pointers
without going through restrictive __kptr usage.

Rename __kptr_ref -> __kptr and __kptr -> __kptr_untrusted to indicate
its intended usage.
The main goal of __kptr_untrusted was to read/write such pointers
directly while bpf_kptr_xchg was a mechanism to access refcnted
kernel pointers. The next patch will allow RCU protected __kptr access
with direct read. At that point __kptr_untrusted will be deprecated.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/bpf/20230303041446.3630-2-alexei.starovoitov@gmail.com
2023-03-03 17:42:20 +01:00
Tero Kristo
f71f853049 bpf: Add support for absolute value BPF timers
Add a new flag BPF_F_TIMER_ABS that can be passed to bpf_timer_start()
to start an absolute value timer instead of the default relative value.
This makes the timer expire at an exact point in time, instead of a time
with latencies induced by both the BPF and timer subsystems.

Suggested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Tero Kristo <tero.kristo@linux.intel.com>
Link: https://lore.kernel.org/r/20230302114614.2985072-2-tero.kristo@linux.intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-02 22:41:32 -08:00
Tejun Heo
c501bf55c8 bpf: Make bpf_get_current_[ancestor_]cgroup_id() available for all program types
These helpers are safe to call from any context and there's no reason to
restrict access to them. Remove them from bpf_trace and filter lists and add
to bpf_base_func_proto() under perfmon_capable().

v2: After consulting with Andrii, relocated in bpf_base_func_proto() so that
    they require bpf_capable() but not perfomon_capable() as it doesn't read
    from or affect others on the system.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/ZAD8QyoszMZiTzBY@slm.duckdns.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-02 22:37:07 -08:00
Joanne Koong
c45eac537b bpf: Fix bpf_dynptr_slice{_rdwr} to return NULL instead of 0
Change bpf_dynptr_slice and bpf_dynptr_slice_rdwr to return NULL instead
of 0, in accordance with the codebase guidelines.

Fixes: 66e3a13e7c ("bpf: Add bpf_dynptr_slice and bpf_dynptr_slice_rdwr")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230302053014.1726219-1-joannelkoong@gmail.com
2023-03-01 21:36:29 -08:00
David Vernet
7ce60b110e bpf: Fix doxygen comments for dynptr slice kfuncs
In commit 66e3a13e7c ("bpf: Add bpf_dynptr_slice and
bpf_dynptr_slice_rdwr"), the bpf_dynptr_slice() and
bpf_dynptr_slice_rdwr() kfuncs were added to BPF. These kfuncs included
doxygen headers, but unfortunately those headers are not properly
formatted according to [0], and causes the following warnings during the
docs build:

./kernel/bpf/helpers.c:2225: warning: \
    Excess function parameter 'returns' description in 'bpf_dynptr_slice'
./kernel/bpf/helpers.c:2303: warning: \
    Excess function parameter 'returns' description in 'bpf_dynptr_slice_rdwr'
...

This patch fixes those doxygen comments.

[0]: https://docs.kernel.org/doc-guide/kernel-doc.html#function-documentation

Fixes: 66e3a13e7c ("bpf: Add bpf_dynptr_slice and bpf_dynptr_slice_rdwr")
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230301194910.602738-1-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-01 16:00:10 -08:00
Kumar Kartikeya Dwivedi
9db44fdd81 bpf: Support kptrs in local storage maps
Enable support for kptrs in local storage maps by wiring up the freeing
of these kptrs from map value. Freeing of bpf_local_storage_map is only
delayed in case there are special fields, therefore bpf_selem_free_*
path can also only dereference smap safely in that case. This is
recorded using a bool utilizing a hole in bpF_local_storage_elem. It
could have been tagged in the pointer value smap using the lowest bit
(since alignment > 1), but since there was already a hole I went with
the simpler option. Only the map structure freeing is delayed using RCU
barriers, as the buckets aren't used when selem is being freed, so they
can be freed once all readers of the bucket lists can no longer access
it.

Cc: Martin KaFai Lau <martin.lau@kernel.org>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230225154010.391965-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-01 10:24:33 -08:00
Kumar Kartikeya Dwivedi
65334e64a4 bpf: Support kptrs in percpu hashmap and percpu LRU hashmap
Enable support for kptrs in percpu BPF hashmap and percpu BPF LRU
hashmap by wiring up the freeing of these kptrs from percpu map
elements.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230225154010.391965-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-01 10:24:33 -08:00
Joanne Koong
66e3a13e7c bpf: Add bpf_dynptr_slice and bpf_dynptr_slice_rdwr
Two new kfuncs are added, bpf_dynptr_slice and bpf_dynptr_slice_rdwr.
The user must pass in a buffer to store the contents of the data slice
if a direct pointer to the data cannot be obtained.

For skb and xdp type dynptrs, these two APIs are the only way to obtain
a data slice. However, for other types of dynptrs, there is no
difference between bpf_dynptr_slice(_rdwr) and bpf_dynptr_data.

For skb type dynptrs, the data is copied into the user provided buffer
if any of the data is not in the linear portion of the skb. For xdp type
dynptrs, the data is copied into the user provided buffer if the data is
between xdp frags.

If the skb is cloned and a call to bpf_dynptr_data_rdwr is made, then
the skb will be uncloned (see bpf_unclone_prologue()).

Please note that any bpf_dynptr_write() automatically invalidates any prior
data slices of the skb dynptr. This is because the skb may be cloned or
may need to pull its paged buffer into the head. As such, any
bpf_dynptr_write() will automatically have its prior data slices
invalidated, even if the write is to data in the skb head of an uncloned
skb. Please note as well that any other helper calls that change the
underlying packet buffer (eg bpf_skb_pull_data()) invalidates any data
slices of the skb dynptr as well, for the same reasons.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://lore.kernel.org/r/20230301154953.641654-10-joannelkoong@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-01 09:55:24 -08:00
Joanne Koong
05421aecd4 bpf: Add xdp dynptrs
Add xdp dynptrs, which are dynptrs whose underlying pointer points
to a xdp_buff. The dynptr acts on xdp data. xdp dynptrs have two main
benefits. One is that they allow operations on sizes that are not
statically known at compile-time (eg variable-sized accesses).
Another is that parsing the packet data through dynptrs (instead of
through direct access of xdp->data and xdp->data_end) can be more
ergonomic and less brittle (eg does not need manual if checking for
being within bounds of data_end).

For reads and writes on the dynptr, this includes reading/writing
from/to and across fragments. Data slices through the bpf_dynptr_data
API are not supported; instead bpf_dynptr_slice() and
bpf_dynptr_slice_rdwr() should be used.

For examples of how xdp dynptrs can be used, please see the attached
selftests.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://lore.kernel.org/r/20230301154953.641654-9-joannelkoong@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-01 09:55:24 -08:00
Joanne Koong
b5964b968a bpf: Add skb dynptrs
Add skb dynptrs, which are dynptrs whose underlying pointer points
to a skb. The dynptr acts on skb data. skb dynptrs have two main
benefits. One is that they allow operations on sizes that are not
statically known at compile-time (eg variable-sized accesses).
Another is that parsing the packet data through dynptrs (instead of
through direct access of skb->data and skb->data_end) can be more
ergonomic and less brittle (eg does not need manual if checking for
being within bounds of data_end).

For bpf prog types that don't support writes on skb data, the dynptr is
read-only (bpf_dynptr_write() will return an error)

For reads and writes through the bpf_dynptr_read() and bpf_dynptr_write()
interfaces, reading and writing from/to data in the head as well as from/to
non-linear paged buffers is supported. Data slices through the
bpf_dynptr_data API are not supported; instead bpf_dynptr_slice() and
bpf_dynptr_slice_rdwr() (added in subsequent commit) should be used.

For examples of how skb dynptrs can be used, please see the attached
selftests.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://lore.kernel.org/r/20230301154953.641654-8-joannelkoong@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-01 09:55:24 -08:00
Joanne Koong
d96d937d7c bpf: Add __uninit kfunc annotation
This patch adds __uninit as a kfunc annotation.

This will be useful for scenarios such as for example in dynptrs,
indicating whether the dynptr should be checked by the verifier as an
initialized or an uninitialized dynptr.

Without this annotation, the alternative would be needing to hard-code
in the verifier the specific kfunc to indicate that arg should be
treated as an uninitialized arg.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://lore.kernel.org/r/20230301154953.641654-7-joannelkoong@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-01 09:55:24 -08:00
Joanne Koong
485ec51ef9 bpf: Refactor verifier dynptr into get_dynptr_arg_reg
This commit refactors the logic for determining which register in a
function is the dynptr into "get_dynptr_arg_reg". This will be used
in the future when the dynptr reg for BPF_FUNC_dynptr_write will need
to be obtained in order to support writes for skb dynptrs.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://lore.kernel.org/r/20230301154953.641654-6-joannelkoong@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-01 09:55:23 -08:00
Joanne Koong
1d18feb2c9 bpf: Allow initializing dynptrs in kfuncs
This change allows kfuncs to take in an uninitialized dynptr as a
parameter. Before this change, only helper functions could successfully
use uninitialized dynptrs. This change moves the memory access check
(including stack state growing and slot marking) into
process_dynptr_func(), which both helpers and kfuncs call into.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://lore.kernel.org/r/20230301154953.641654-4-joannelkoong@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-01 09:55:23 -08:00
Joanne Koong
7e0dac2807 bpf: Refactor process_dynptr_func
This change cleans up process_dynptr_func's flow to be more intuitive
and updates some comments with more context.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://lore.kernel.org/r/20230301154953.641654-3-joannelkoong@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-01 09:55:23 -08:00
Joanne Koong
2f46439346 bpf: Support "sk_buff" and "xdp_buff" as valid kfunc arg types
The bpf mirror of the in-kernel sk_buff and xdp_buff data structures are
__sk_buff and xdp_md. Currently, when we pass in the program ctx to a
kfunc where the program ctx is a skb or xdp buffer, we reject the
program if the in-kernel definition is sk_buff/xdp_buff instead of
__sk_buff/xdp_md.

This change allows "sk_buff <--> __sk_buff" and "xdp_buff <--> xdp_md"
to be recognized as valid matches. The user program may pass in their
program ctx as a __sk_buff or xdp_md, and the in-kernel definition
of the kfunc may define this arg as a sk_buff or xdp_buff.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://lore.kernel.org/r/20230301154953.641654-2-joannelkoong@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-01 09:55:23 -08:00
David Vernet
30a2d8328d bpf: Fix bpf_cgroup_from_id() doxygen header
In commit 332ea1f697 ("bpf: Add bpf_cgroup_from_id() kfunc"), a new
bpf_cgroup_from_id() kfunc was added which allows a BPF program to
lookup and acquire a reference to a cgroup from a cgroup id. The
commit's doxygen comment seems to have copy-pasted fields, which causes
BPF kfunc helper documentation to fail to render:

<snip>/helpers.c:2114: warning: Excess function parameter 'cgrp'...
<snip>/helpers.c:2114: warning: Excess function parameter 'level'...

<snip>

<snip>/helpers.c:2114: warning: Excess function parameter 'level'...

This patch fixes the doxygen header.

Fixes: 332ea1f697 ("bpf: Add bpf_cgroup_from_id() kfunc")
Signed-off-by: David Vernet <void@manifault.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20230228152845.294695-1-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-28 08:42:11 -08:00
Linus Torvalds
3822a7c409 - Daniel Verkamp has contributed a memfd series ("mm/memfd: add
F_SEAL_EXEC") which permits the setting of the memfd execute bit at
   memfd creation time, with the option of sealing the state of the X bit.
 
 - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
   thread-safe for pmd unshare") which addresses a rare race condition
   related to PMD unsharing.
 
 - Several folioification patch serieses from Matthew Wilcox, Vishal
   Moola, Sidhartha Kumar and Lorenzo Stoakes
 
 - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which
   does perform some memcg maintenance and cleanup work.
 
 - SeongJae Park has added DAMOS filtering to DAMON, with the series
   "mm/damon/core: implement damos filter".  These filters provide users
   with finer-grained control over DAMOS's actions.  SeongJae has also done
   some DAMON cleanup work.
 
 - Kairui Song adds a series ("Clean up and fixes for swap").
 
 - Vernon Yang contributed the series "Clean up and refinement for maple
   tree".
 
 - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series.  It
   adds to MGLRU an LRU of memcgs, to improve the scalability of global
   reclaim.
 
 - David Hildenbrand has added some userfaultfd cleanup work in the
   series "mm: uffd-wp + change_protection() cleanups".
 
 - Christoph Hellwig has removed the generic_writepages() library
   function in the series "remove generic_writepages".
 
 - Baolin Wang has performed some maintenance on the compaction code in
   his series "Some small improvements for compaction".
 
 - Sidhartha Kumar is doing some maintenance work on struct page in his
   series "Get rid of tail page fields".
 
 - David Hildenbrand contributed some cleanup, bugfixing and
   generalization of pte management and of pte debugging in his series "mm:
   support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap
   PTEs".
 
 - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
   flag in the series "Discard __GFP_ATOMIC".
 
 - Sergey Senozhatsky has improved zsmalloc's memory utilization with his
   series "zsmalloc: make zspage chain size configurable".
 
 - Joey Gouly has added prctl() support for prohibiting the creation of
   writeable+executable mappings.  The previous BPF-based approach had
   shortcomings.  See "mm: In-kernel support for memory-deny-write-execute
   (MDWE)".
 
 - Waiman Long did some kmemleak cleanup and bugfixing in the series
   "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".
 
 - T.J.  Alumbaugh has contributed some MGLRU cleanup work in his series
   "mm: multi-gen LRU: improve".
 
 - Jiaqi Yan has provided some enhancements to our memory error
   statistics reporting, mainly by presenting the statistics on a per-node
   basis.  See the series "Introduce per NUMA node memory error
   statistics".
 
 - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
   regression in compaction via his series "Fix excessive CPU usage during
   compaction".
 
 - Christoph Hellwig does some vmalloc maintenance work in the series
   "cleanup vfree and vunmap".
 
 - Christoph Hellwig has removed block_device_operations.rw_page() in ths
   series "remove ->rw_page".
 
 - We get some maple_tree improvements and cleanups in Liam Howlett's
   series "VMA tree type safety and remove __vma_adjust()".
 
 - Suren Baghdasaryan has done some work on the maintainability of our
   vm_flags handling in the series "introduce vm_flags modifier functions".
 
 - Some pagemap cleanup and generalization work in Mike Rapoport's series
   "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and
   "fixups for generic implementation of pfn_valid()"
 
 - Baoquan He has done some work to make /proc/vmallocinfo and
   /proc/kcore better represent the real state of things in his series
   "mm/vmalloc.c: allow vread() to read out vm_map_ram areas".
 
 - Jason Gunthorpe rationalized the GUP system's interface to the rest of
   the kernel in the series "Simplify the external interface for GUP".
 
 - SeongJae Park wishes to migrate people from DAMON's debugfs interface
   over to its sysfs interface.  To support this, we'll temporarily be
   printing warnings when people use the debugfs interface.  See the series
   "mm/damon: deprecate DAMON debugfs interface".
 
 - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
   and clean-ups" series.
 
 - Huang Ying has provided a dramatic reduction in migration's TLB flush
   IPI rates with the series "migrate_pages(): batch TLB flushing".
 
 - Arnd Bergmann has some objtool fixups in "objtool warning fixes".
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY/PoPQAKCRDdBJ7gKXxA
 jlvpAPsFECUBBl20qSue2zCYWnHC7Yk4q9ytTkPB/MMDrFEN9wD/SNKEm2UoK6/K
 DmxHkn0LAitGgJRS/W9w81yrgig9tAQ=
 =MlGs
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Daniel Verkamp has contributed a memfd series ("mm/memfd: add
   F_SEAL_EXEC") which permits the setting of the memfd execute bit at
   memfd creation time, with the option of sealing the state of the X
   bit.

 - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
   thread-safe for pmd unshare") which addresses a rare race condition
   related to PMD unsharing.

 - Several folioification patch serieses from Matthew Wilcox, Vishal
   Moola, Sidhartha Kumar and Lorenzo Stoakes

 - Johannes Weiner has a series ("mm: push down lock_page_memcg()")
   which does perform some memcg maintenance and cleanup work.

 - SeongJae Park has added DAMOS filtering to DAMON, with the series
   "mm/damon/core: implement damos filter".

   These filters provide users with finer-grained control over DAMOS's
   actions. SeongJae has also done some DAMON cleanup work.

 - Kairui Song adds a series ("Clean up and fixes for swap").

 - Vernon Yang contributed the series "Clean up and refinement for maple
   tree".

 - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It
   adds to MGLRU an LRU of memcgs, to improve the scalability of global
   reclaim.

 - David Hildenbrand has added some userfaultfd cleanup work in the
   series "mm: uffd-wp + change_protection() cleanups".

 - Christoph Hellwig has removed the generic_writepages() library
   function in the series "remove generic_writepages".

 - Baolin Wang has performed some maintenance on the compaction code in
   his series "Some small improvements for compaction".

 - Sidhartha Kumar is doing some maintenance work on struct page in his
   series "Get rid of tail page fields".

 - David Hildenbrand contributed some cleanup, bugfixing and
   generalization of pte management and of pte debugging in his series
   "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with
   swap PTEs".

 - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
   flag in the series "Discard __GFP_ATOMIC".

 - Sergey Senozhatsky has improved zsmalloc's memory utilization with
   his series "zsmalloc: make zspage chain size configurable".

 - Joey Gouly has added prctl() support for prohibiting the creation of
   writeable+executable mappings.

   The previous BPF-based approach had shortcomings. See "mm: In-kernel
   support for memory-deny-write-execute (MDWE)".

 - Waiman Long did some kmemleak cleanup and bugfixing in the series
   "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".

 - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series
   "mm: multi-gen LRU: improve".

 - Jiaqi Yan has provided some enhancements to our memory error
   statistics reporting, mainly by presenting the statistics on a
   per-node basis. See the series "Introduce per NUMA node memory error
   statistics".

 - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
   regression in compaction via his series "Fix excessive CPU usage
   during compaction".

 - Christoph Hellwig does some vmalloc maintenance work in the series
   "cleanup vfree and vunmap".

 - Christoph Hellwig has removed block_device_operations.rw_page() in
   ths series "remove ->rw_page".

 - We get some maple_tree improvements and cleanups in Liam Howlett's
   series "VMA tree type safety and remove __vma_adjust()".

 - Suren Baghdasaryan has done some work on the maintainability of our
   vm_flags handling in the series "introduce vm_flags modifier
   functions".

 - Some pagemap cleanup and generalization work in Mike Rapoport's
   series "mm, arch: add generic implementation of pfn_valid() for
   FLATMEM" and "fixups for generic implementation of pfn_valid()"

 - Baoquan He has done some work to make /proc/vmallocinfo and
   /proc/kcore better represent the real state of things in his series
   "mm/vmalloc.c: allow vread() to read out vm_map_ram areas".

 - Jason Gunthorpe rationalized the GUP system's interface to the rest
   of the kernel in the series "Simplify the external interface for
   GUP".

 - SeongJae Park wishes to migrate people from DAMON's debugfs interface
   over to its sysfs interface. To support this, we'll temporarily be
   printing warnings when people use the debugfs interface. See the
   series "mm/damon: deprecate DAMON debugfs interface".

 - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
   and clean-ups" series.

 - Huang Ying has provided a dramatic reduction in migration's TLB flush
   IPI rates with the series "migrate_pages(): batch TLB flushing".

 - Arnd Bergmann has some objtool fixups in "objtool warning fixes".

* tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits)
  include/linux/migrate.h: remove unneeded externs
  mm/memory_hotplug: cleanup return value handing in do_migrate_range()
  mm/uffd: fix comment in handling pte markers
  mm: change to return bool for isolate_movable_page()
  mm: hugetlb: change to return bool for isolate_hugetlb()
  mm: change to return bool for isolate_lru_page()
  mm: change to return bool for folio_isolate_lru()
  objtool: add UACCESS exceptions for __tsan_volatile_read/write
  kmsan: disable ftrace in kmsan core code
  kasan: mark addr_has_metadata __always_inline
  mm: memcontrol: rename memcg_kmem_enabled()
  sh: initialize max_mapnr
  m68k/nommu: add missing definition of ARCH_PFN_OFFSET
  mm: percpu: fix incorrect size in pcpu_obj_full_size()
  maple_tree: reduce stack usage with gcc-9 and earlier
  mm: page_alloc: call panic() when memoryless node allocation fails
  mm: multi-gen LRU: avoid futile retries
  migrate_pages: move THP/hugetlb migration support check to simplify code
  migrate_pages: batch flushing TLB
  migrate_pages: share more code between _unmap and _move
  ...
2023-02-23 17:09:35 -08:00
Tejun Heo
332ea1f697 bpf: Add bpf_cgroup_from_id() kfunc
cgroup ID is an userspace-visible 64bit value uniquely identifying a given
cgroup. As the IDs are used widely, it's useful to be able to look up the
matching cgroups. Add bpf_cgroup_from_id().

v2: Separate out selftest into its own patch as suggested by Alexei.

Signed-off-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/Y/bBaG96t0/gQl9/@slm.duckdns.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-23 08:33:05 -08:00
Ilya Leoshkevich
df2ccc180a bpf: Check for helper calls in check_subprogs()
The condition src_reg != BPF_PSEUDO_CALL && imm == BPF_FUNC_tail_call
may be satisfied by a kfunc call. This would lead to unnecessarily
setting has_tail_call. Use src_reg == 0 instead.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230220163756.753713-1-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-22 13:08:52 -08:00
Hou Tao
5d5de3a431 bpf: Only allocate one bpf_mem_cache for bpf_cpumask_ma
The size of bpf_cpumask is fixed, so there is no need to allocate many
bpf_mem_caches for bpf_cpumask_ma, just one bpf_mem_cache is enough.
Also add comments for bpf_mem_alloc_init() in bpf_mem_alloc.h to prevent
future miuse.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230216024821.2202916-1-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-22 12:59:32 -08:00
Kumar Kartikeya Dwivedi
dbd8d22863 bpf: Wrap register invalidation with a helper
Typically, verifier should use env->allow_ptr_leaks when invaliding
registers for users that don't have CAP_PERFMON or CAP_SYS_ADMIN to
avoid leaking the pointer value. This is similar in spirit to
c67cae551f ("bpf: Tighten ptr_to_btf_id checks."). In a lot of the
existing checks, we know the capabilities are present, hence we don't do
the check.

Instead of being inconsistent in the application of the check, wrap the
action of invalidating a register into a helper named 'mark_invalid_reg'
and use it in a uniform fashion to replace open coded invalidation
operations, so that the check is always made regardless of the call site
and we don't have to remember whether it needs to be done or not for
each case.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230221200646.2500777-7-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-22 12:50:34 -08:00
Kumar Kartikeya Dwivedi
da03e43a8c bpf: Fix check_reg_type for PTR_TO_BTF_ID
The current code does type matching for the case where reg->type is
PTR_TO_BTF_ID or has the PTR_TRUSTED flag. However, this only needs to
occur for non-MEM_ALLOC and non-MEM_PERCPU cases, but will include both
as per the current code.

The MEM_ALLOC case with or without PTR_TRUSTED needs to be handled
specially by the code for type_is_alloc case, while MEM_PERCPU case must
be ignored. Hence, to restore correct behavior and for clarity,
explicitly list out the handled PTR_TO_BTF_ID types which should be
handled for each case using a switch statement.

Helpers currently only take:
	PTR_TO_BTF_ID
	PTR_TO_BTF_ID | PTR_TRUSTED
	PTR_TO_BTF_ID | MEM_RCU
	PTR_TO_BTF_ID | MEM_ALLOC
	PTR_TO_BTF_ID | MEM_PERCPU
	PTR_TO_BTF_ID | MEM_PERCPU | PTR_TRUSTED

This fix was also described (for the MEM_ALLOC case) in [0].

  [0]: https://lore.kernel.org/bpf/20221121160657.h6z7xuvedybp5y7s@apollo

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230221200646.2500777-6-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-22 12:50:15 -08:00
Kumar Kartikeya Dwivedi
521d3c0a17 bpf: Remove unused MEM_ALLOC | PTR_TRUSTED checks
The plan is to supposedly tag everything with PTR_TRUSTED eventually,
however those changes should bring in their respective code, instead
of leaving it around right now. It is arguable whether PTR_TRUSTED is
required for all types, when it's only use case is making PTR_TO_BTF_ID
a bit stronger, while all other types are trusted by default.

Hence, just drop the two instances which do not occur in the verifier
for now to avoid reader confusion.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230221200646.2500777-5-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-22 12:49:52 -08:00
Kumar Kartikeya Dwivedi
0a09a2f933 bpf: Annotate data races in bpf_local_storage
There are a few cases where hlist_node is checked to be unhashed without
holding the lock protecting its modification. In this case, one must use
hlist_unhashed_lockless to avoid load tearing and KCSAN reports. Fix
this by using lockless variant in places not protected by the lock.

Since this is not prompted by any actual KCSAN reports but only from
code review, I have not included a fixes tag.

Cc: Martin KaFai Lau <martin.lau@kernel.org>
Cc: KP Singh <kpsingh@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230221200646.2500777-4-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-22 12:49:34 -08:00
Eduard Zingerman
6715df8d5d bpf: Allow reads from uninit stack
This commits updates the following functions to allow reads from
uninitialized stack locations when env->allow_uninit_stack option is
enabled:
- check_stack_read_fixed_off()
- check_stack_range_initialized(), called from:
  - check_stack_read_var_off()
  - check_helper_mem_access()

Such change allows to relax logic in stacksafe() to treat STACK_MISC
and STACK_INVALID in a same way and make the following stack slot
configurations equivalent:

  |  Cached state    |  Current state   |
  |   stack slot     |   stack slot     |
  |------------------+------------------|
  | STACK_INVALID or | STACK_INVALID or |
  | STACK_MISC       | STACK_SPILL   or |
  |                  | STACK_MISC    or |
  |                  | STACK_ZERO    or |
  |                  | STACK_DYNPTR     |

This leads to significant verification speed gains (see below).

The idea was suggested by Andrii Nakryiko [1] and initial patch was
created by Alexei Starovoitov [2].

Currently the env->allow_uninit_stack is allowed for programs loaded
by users with CAP_PERFMON or CAP_SYS_ADMIN capabilities.

A number of test cases from verifier/*.c were expecting uninitialized
stack access to be an error. These test cases were updated to execute
in unprivileged mode (thus preserving the tests).

The test progs/test_global_func10.c expected "invalid indirect read
from stack" error message because of the access to uninitialized
memory region. This error is no longer possible in privileged mode.
The test is updated to provoke an error "invalid indirect access to
stack" because of access to invalid stack address (such error is not
verified by progs/test_global_func*.c series of tests).

The following tests had to be removed because these can't be made
unprivileged:
- verifier/sock.c:
  - "sk_storage_get(map, skb->sk, &stack_value, 1): partially init
  stack_value"
  BPF_PROG_TYPE_SCHED_CLS programs are not executed in unprivileged mode.
- verifier/var_off.c:
  - "indirect variable-offset stack access, max_off+size > max_initialized"
  - "indirect variable-offset stack access, uninitialized"
  These tests verify that access to uninitialized stack values is
  detected when stack offset is not a constant. However, variable
  stack access is prohibited in unprivileged mode, thus these tests
  are no longer valid.

 * * *

Here is veristat log comparing this patch with current master on a
set of selftest binaries listed in tools/testing/selftests/bpf/veristat.cfg
and cilium BPF binaries (see [3]):

$ ./veristat -e file,prog,states -C -f 'states_pct<-30' master.log current.log
File                        Program                     States (A)  States (B)  States    (DIFF)
--------------------------  --------------------------  ----------  ----------  ----------------
bpf_host.o                  tail_handle_ipv6_from_host         349         244    -105 (-30.09%)
bpf_host.o                  tail_handle_nat_fwd_ipv4          1320         895    -425 (-32.20%)
bpf_lxc.o                   tail_handle_nat_fwd_ipv4          1320         895    -425 (-32.20%)
bpf_sock.o                  cil_sock4_connect                   70          48     -22 (-31.43%)
bpf_sock.o                  cil_sock4_sendmsg                   68          46     -22 (-32.35%)
bpf_xdp.o                   tail_handle_nat_fwd_ipv4          1554         803    -751 (-48.33%)
bpf_xdp.o                   tail_lb_ipv4                      6457        2473   -3984 (-61.70%)
bpf_xdp.o                   tail_lb_ipv6                      7249        3908   -3341 (-46.09%)
pyperf600_bpf_loop.bpf.o    on_event                           287         145    -142 (-49.48%)
strobemeta.bpf.o            on_event                         15915        4772  -11143 (-70.02%)
strobemeta_nounroll2.bpf.o  on_event                         17087        3820  -13267 (-77.64%)
xdp_synproxy_kern.bpf.o     syncookie_tc                     21271        6635  -14636 (-68.81%)
xdp_synproxy_kern.bpf.o     syncookie_xdp                    23122        6024  -17098 (-73.95%)
--------------------------  --------------------------  ----------  ----------  ----------------

Note: I limited selection by states_pct<-30%.

Inspection of differences in pyperf600_bpf_loop behavior shows that
the following patch for the test removes almost all differences:

    - a/tools/testing/selftests/bpf/progs/pyperf.h
    + b/tools/testing/selftests/bpf/progs/pyperf.h
    @ -266,8 +266,8 @ int __on_event(struct bpf_raw_tracepoint_args *ctx)
            }

            if (event->pthread_match || !pidData->use_tls) {
    -               void* frame_ptr;
    -               FrameData frame;
    +               void* frame_ptr = 0;
    +               FrameData frame = {};
                    Symbol sym = {};
                    int cur_cpu = bpf_get_smp_processor_id();

W/o this patch the difference comes from the following pattern
(for different variables):

    static bool get_frame_data(... FrameData *frame ...)
    {
        ...
        bpf_probe_read_user(&frame->f_code, ...);
        if (!frame->f_code)
            return false;
        ...
        bpf_probe_read_user(&frame->co_name, ...);
        if (frame->co_name)
            ...;
    }

    int __on_event(struct bpf_raw_tracepoint_args *ctx)
    {
        FrameData frame;
        ...
        get_frame_data(... &frame ...) // indirectly via a bpf_loop & callback
        ...
    }

    SEC("raw_tracepoint/kfree_skb")
    int on_event(struct bpf_raw_tracepoint_args* ctx)
    {
        ...
        ret |= __on_event(ctx);
        ret |= __on_event(ctx);
        ...
    }

With regards to value `frame->co_name` the following is important:
- Because of the conditional `if (!frame->f_code)` each call to
  __on_event() produces two states, one with `frame->co_name` marked
  as STACK_MISC, another with it as is (and marked STACK_INVALID on a
  first call).
- The call to bpf_probe_read_user() does not mark stack slots
  corresponding to `&frame->co_name` as REG_LIVE_WRITTEN but it marks
  these slots as BPF_MISC, this happens because of the following loop
  in the check_helper_call():

	for (i = 0; i < meta.access_size; i++) {
		err = check_mem_access(env, insn_idx, meta.regno, i, BPF_B,
				       BPF_WRITE, -1, false);
		if (err)
			return err;
	}

  Note the size of the write, it is a one byte write for each byte
  touched by a helper. The BPF_B write does not lead to write marks
  for the target stack slot.
- Which means that w/o this patch when second __on_event() call is
  verified `if (frame->co_name)` will propagate read marks first to a
  stack slot with STACK_MISC marks and second to a stack slot with
  STACK_INVALID marks and these states would be considered different.

[1] https://lore.kernel.org/bpf/CAEf4BzY3e+ZuC6HUa8dCiUovQRg2SzEk7M-dSkqNZyn=xEmnPA@mail.gmail.com/
[2] https://lore.kernel.org/bpf/CAADnVQKs2i1iuZ5SUGuJtxWVfGYR9kDgYKhq3rNV+kBLQCu7rA@mail.gmail.com/
[3] git@github.com:anakryiko/cilium.git

Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Co-developed-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230219200427.606541-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-22 12:34:50 -08:00
Linus Torvalds
f3dd0c5337 bpf: add missing header file include
Commit 74e19ef0ff ("uaccess: Add speculation barrier to
copy_from_user()") built fine on x86-64 and arm64, and that's the extent
of my local build testing.

It turns out those got the <linux/nospec.h> include incidentally through
other header files (<linux/kvm_host.h> in particular), but that was not
true of other architectures, resulting in build errors

  kernel/bpf/core.c: In function ‘___bpf_prog_run’:
  kernel/bpf/core.c:1913:3: error: implicit declaration of function ‘barrier_nospec’

so just make sure to explicitly include the proper <linux/nospec.h>
header file to make everybody see it.

Fixes: 74e19ef0ff ("uaccess: Add speculation barrier to copy_from_user()")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Viresh Kumar <viresh.kumar@linaro.org>
Reported-by: Huacai Chen <chenhuacai@loongson.cn>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-02-22 09:52:32 -08:00
Linus Torvalds
5b7c4cabbb Networking changes for 6.3.
Core
 ----
 
  - Add dedicated kmem_cache for typical/small skb->head, avoid having
    to access struct page at kfree time, and improve memory use.
 
  - Introduce sysctl to set default RPS configuration for new netdevs.
 
  - Define Netlink protocol specification format which can be used
    to describe messages used by each family and auto-generate parsers.
    Add tools for generating kernel data structures and uAPI headers.
 
  - Expose all net/core sysctls inside netns.
 
  - Remove 4s sleep in netpoll if carrier is instantly detected on boot.
 
  - Add configurable limit of MDB entries per port, and port-vlan.
 
  - Continue populating drop reasons throughout the stack.
 
  - Retire a handful of legacy Qdiscs and classifiers.
 
 Protocols
 ---------
 
  - Support IPv4 big TCP (TSO frames larger than 64kB).
 
  - Add IP_LOCAL_PORT_RANGE socket option, to control local port range
    on socket by socket basis.
 
  - Track and report in procfs number of MPTCP sockets used.
 
  - Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP
    path manager.
 
  - IPv6: don't check net.ipv6.route.max_size and rely on garbage
    collection to free memory (similarly to IPv4).
 
  - Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986).
 
  - ICMP: add per-rate limit counters.
 
  - Add support for user scanning requests in ieee802154.
 
  - Remove static WEP support.
 
  - Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate
    reporting.
 
  - WiFi 7 EHT channel puncturing support (client & AP).
 
 BPF
 ---
 
  - Add a rbtree data structure following the "next-gen data structure"
    precedent set by recently added linked list, that is, by using
    kfunc + kptr instead of adding a new BPF map type.
 
  - Expose XDP hints via kfuncs with initial support for RX hash and
    timestamp metadata.
 
  - Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key
    to better support decap on GRE tunnel devices not operating
    in collect metadata.
 
  - Improve x86 JIT's codegen for PROBE_MEM runtime error checks.
 
  - Remove the need for trace_printk_lock for bpf_trace_printk
    and bpf_trace_vprintk helpers.
 
  - Extend libbpf's bpf_tracing.h support for tracing arguments of
    kprobes/uprobes and syscall as a special case.
 
  - Significantly reduce the search time for module symbols
    by livepatch and BPF.
 
  - Enable cpumasks to be used as kptrs, which is useful for tracing
    programs tracking which tasks end up running on which CPUs in
    different time intervals.
 
  - Add support for BPF trampoline on s390x and riscv64.
 
  - Add capability to export the XDP features supported by the NIC.
 
  - Add __bpf_kfunc tag for marking kernel functions as kfuncs.
 
  - Add cgroup.memory=nobpf kernel parameter option to disable BPF
    memory accounting for container environments.
 
 Netfilter
 ---------
 
  - Remove the CLUSTERIP target. It has been marked as obsolete
    for years, and we still have WARN splats wrt. races of
    the out-of-band /proc interface installed by this target.
 
  - Add 'destroy' commands to nf_tables. They are identical to
    the existing 'delete' commands, but do not return an error if
    the referenced object (set, chain, rule...) did not exist.
 
 Driver API
 ----------
 
  - Improve cpumask_local_spread() locality to help NICs set the right
    IRQ affinity on AMD platforms.
 
  - Separate C22 and C45 MDIO bus transactions more clearly.
 
  - Introduce new DCB table to control DSCP rewrite on egress.
 
  - Support configuration of Physical Layer Collision Avoidance (PLCA)
    Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of
    shared medium Ethernet.
 
  - Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing
    preemption of low priority frames by high priority frames.
 
  - Add support for controlling MACSec offload using netlink SET.
 
  - Rework devlink instance refcounts to allow registration and
    de-registration under the instance lock. Split the code into multiple
    files, drop some of the unnecessarily granular locks and factor out
    common parts of netlink operation handling.
 
  - Add TX frame aggregation parameters (for USB drivers).
 
  - Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning
    messages with notifications for debug.
 
  - Allow offloading of UDP NEW connections via act_ct.
 
  - Add support for per action HW stats in TC.
 
  - Support hardware miss to TC action (continue processing in SW from
    a specific point in the action chain).
 
  - Warn if old Wireless Extension user space interface is used with
    modern cfg80211/mac80211 drivers. Do not support Wireless Extensions
    for Wi-Fi 7 devices at all. Everyone should switch to using nl80211
    interface instead.
 
  - Improve the CAN bit timing configuration. Use extack to return error
    messages directly to user space, update the SJW handling, including
    the definition of a new default value that will benefit CAN-FD
    controllers, by increasing their oscillator tolerance.
 
 New hardware / drivers
 ----------------------
 
  - Ethernet:
    - nVidia BlueField-3 support (control traffic driver)
    - Ethernet support for imx93 SoCs
    - Motorcomm yt8531 gigabit Ethernet PHY
    - onsemi NCN26000 10BASE-T1S PHY (with support for PLCA)
    - Microchip LAN8841 PHY (incl. cable diagnostics and PTP)
    - Amlogic gxl MDIO mux
 
  - WiFi:
    - RealTek RTL8188EU (rtl8xxxu)
    - Qualcomm Wi-Fi 7 devices (ath12k)
 
  - CAN:
    - Renesas R-Car V4H
 
 Drivers
 -------
 
  - Bluetooth:
    - Set Per Platform Antenna Gain (PPAG) for Intel controllers.
 
  - Ethernet NICs:
    - Intel (1G, igc):
      - support TSN / Qbv / packet scheduling features of i226 model
    - Intel (100G, ice):
      - use GNSS subsystem instead of TTY
      - multi-buffer XDP support
      - extend support for GPIO pins to E823 devices
    - nVidia/Mellanox:
      - update the shared buffer configuration on PFC commands
      - implement PTP adjphase function for HW offset control
      - TC support for Geneve and GRE with VF tunnel offload
      - more efficient crypto key management method
      - multi-port eswitch support
    - Netronome/Corigine:
      - add DCB IEEE support
      - support IPsec offloading for NFP3800
    - Freescale/NXP (enetc):
      - enetc: support XDP_REDIRECT for XDP non-linear buffers
      - enetc: improve reconfig, avoid link flap and waiting for idle
      - enetc: support MAC Merge layer
    - Other NICs:
      - sfc/ef100: add basic devlink support for ef100
      - ionic: rx_push mode operation (writing descriptors via MMIO)
      - bnxt: use the auxiliary bus abstraction for RDMA
      - r8169: disable ASPM and reset bus in case of tx timeout
      - cpsw: support QSGMII mode for J721e CPSW9G
      - cpts: support pulse-per-second output
      - ngbe: add an mdio bus driver
      - usbnet: optimize usbnet_bh() by avoiding unnecessary queuing
      - r8152: handle devices with FW with NCM support
      - amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation
      - virtio-net: support multi buffer XDP
      - virtio/vsock: replace virtio_vsock_pkt with sk_buff
      - tsnep: XDP support
 
  - Ethernet high-speed switches:
    - nVidia/Mellanox (mlxsw):
      - add support for latency TLV (in FW control messages)
    - Microchip (sparx5):
      - separate explicit and implicit traffic forwarding rules, make
        the implicit rules always active
      - add support for egress DSCP rewrite
      - IS0 VCAP support (Ingress Classification)
      - IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS etc.)
      - ES2 VCAP support (Egress Access Control)
      - support for Per-Stream Filtering and Policing (802.1Q, 8.6.5.1)
 
  - Ethernet embedded switches:
    - Marvell (mv88e6xxx):
      - add MAB (port auth) offload support
      - enable PTP receive for mv88e6390
    - NXP (ocelot):
      - support MAC Merge layer
      - support for the the vsc7512 internal copper phys
    - Microchip:
      - lan9303: convert to PHYLINK
      - lan966x: support TC flower filter statistics
      - lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x
      - lan937x: support Credit Based Shaper configuration
      - ksz9477: support Energy Efficient Ethernet
    - other:
      - qca8k: convert to regmap read/write API, use bulk operations
      - rswitch: Improve TX timestamp accuracy
 
  - Intel WiFi (iwlwifi):
    - EHT (Wi-Fi 7) rate reporting
    - STEP equalizer support: transfer some STEP (connection to radio
      on platforms with integrated wifi) related parameters from the
      BIOS to the firmware.
 
  - Qualcomm 802.11ax WiFi (ath11k):
    - IPQ5018 support
    - Fine Timing Measurement (FTM) responder role support
    - channel 177 support
 
  - MediaTek WiFi (mt76):
    - per-PHY LED support
    - mt7996: EHT (Wi-Fi 7) support
    - Wireless Ethernet Dispatch (WED) reset support
    - switch to using page pool allocator
 
  - RealTek WiFi (rtw89):
    - support new version of Bluetooth co-existance
 
  - Mobile:
    - rmnet: support TX aggregation.
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmP1VIYACgkQMUZtbf5S
 IrvsChAApz0rNL/sPKxXTEfxZ1tN7D3sYxYKQPomxvl5BV+MvicrLddJy3KmzEFK
 nnJNO3nuRNuH422JQ/ylZ4mGX1opa6+5QJb0UINImXUI7Fm8HHBIuPGkv7d5CheZ
 7JexFqjPJXUy9nPyh1Rra+IA9AcRd2U7jeGEZR38wb99bHJQj5Bzdk20WArEB0el
 n44aqg49LXH71bSeXRz77x5SjkwVtYiccQxLcnmTbjLU2xVraLvI2J+wAhHnVXWW
 9lrU1+V4Ex2Xcd1xR0L0cHeK+meP1TrPRAeF+JDpVI3a/zJiE7cZjfHdG/jH5xWl
 leZJqghVozrZQNtewWWO7XhUFhMDgFu3W/1vNLjSHPZEqaz1JpM67J1+ql6s63l4
 LMWoXbcYZz+SL9ZRCoPkbGue/5fKSHv8/Jl9Sh58+eTS+c/zgN8uFGRNFXLX1+EP
 n8uvt985PxMd6x1+dHumhOUzxnY4Sfi1vjitSunTsNFQ3Cmp4SO0IfBVJWfLUCuC
 xz5hbJGJJbSpvUsO+HWyCg83E5OWghRE/Onpt2jsQSZCrO9HDg4FRTEf3WAMgaqc
 edb5KfbRZPTJQM08gWdluXzSk1nw3FNP2tXW4XlgUrEbjb+fOk0V9dQg2gyYTxQ1
 Nhvn8ZQPi6/GMMELHAIPGmmW1allyOGiAzGlQsv8EmL+OFM6WDI=
 =xXhC
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core:

   - Add dedicated kmem_cache for typical/small skb->head, avoid having
     to access struct page at kfree time, and improve memory use.

   - Introduce sysctl to set default RPS configuration for new netdevs.

   - Define Netlink protocol specification format which can be used to
     describe messages used by each family and auto-generate parsers.
     Add tools for generating kernel data structures and uAPI headers.

   - Expose all net/core sysctls inside netns.

   - Remove 4s sleep in netpoll if carrier is instantly detected on
     boot.

   - Add configurable limit of MDB entries per port, and port-vlan.

   - Continue populating drop reasons throughout the stack.

   - Retire a handful of legacy Qdiscs and classifiers.

  Protocols:

   - Support IPv4 big TCP (TSO frames larger than 64kB).

   - Add IP_LOCAL_PORT_RANGE socket option, to control local port range
     on socket by socket basis.

   - Track and report in procfs number of MPTCP sockets used.

   - Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path
     manager.

   - IPv6: don't check net.ipv6.route.max_size and rely on garbage
     collection to free memory (similarly to IPv4).

   - Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986).

   - ICMP: add per-rate limit counters.

   - Add support for user scanning requests in ieee802154.

   - Remove static WEP support.

   - Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate
     reporting.

   - WiFi 7 EHT channel puncturing support (client & AP).

  BPF:

   - Add a rbtree data structure following the "next-gen data structure"
     precedent set by recently added linked list, that is, by using
     kfunc + kptr instead of adding a new BPF map type.

   - Expose XDP hints via kfuncs with initial support for RX hash and
     timestamp metadata.

   - Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to
     better support decap on GRE tunnel devices not operating in collect
     metadata.

   - Improve x86 JIT's codegen for PROBE_MEM runtime error checks.

   - Remove the need for trace_printk_lock for bpf_trace_printk and
     bpf_trace_vprintk helpers.

   - Extend libbpf's bpf_tracing.h support for tracing arguments of
     kprobes/uprobes and syscall as a special case.

   - Significantly reduce the search time for module symbols by
     livepatch and BPF.

   - Enable cpumasks to be used as kptrs, which is useful for tracing
     programs tracking which tasks end up running on which CPUs in
     different time intervals.

   - Add support for BPF trampoline on s390x and riscv64.

   - Add capability to export the XDP features supported by the NIC.

   - Add __bpf_kfunc tag for marking kernel functions as kfuncs.

   - Add cgroup.memory=nobpf kernel parameter option to disable BPF
     memory accounting for container environments.

  Netfilter:

   - Remove the CLUSTERIP target. It has been marked as obsolete for
     years, and we still have WARN splats wrt races of the out-of-band
     /proc interface installed by this target.

   - Add 'destroy' commands to nf_tables. They are identical to the
     existing 'delete' commands, but do not return an error if the
     referenced object (set, chain, rule...) did not exist.

  Driver API:

   - Improve cpumask_local_spread() locality to help NICs set the right
     IRQ affinity on AMD platforms.

   - Separate C22 and C45 MDIO bus transactions more clearly.

   - Introduce new DCB table to control DSCP rewrite on egress.

   - Support configuration of Physical Layer Collision Avoidance (PLCA)
     Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of
     shared medium Ethernet.

   - Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing
     preemption of low priority frames by high priority frames.

   - Add support for controlling MACSec offload using netlink SET.

   - Rework devlink instance refcounts to allow registration and
     de-registration under the instance lock. Split the code into
     multiple files, drop some of the unnecessarily granular locks and
     factor out common parts of netlink operation handling.

   - Add TX frame aggregation parameters (for USB drivers).

   - Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning
     messages with notifications for debug.

   - Allow offloading of UDP NEW connections via act_ct.

   - Add support for per action HW stats in TC.

   - Support hardware miss to TC action (continue processing in SW from
     a specific point in the action chain).

   - Warn if old Wireless Extension user space interface is used with
     modern cfg80211/mac80211 drivers. Do not support Wireless
     Extensions for Wi-Fi 7 devices at all. Everyone should switch to
     using nl80211 interface instead.

   - Improve the CAN bit timing configuration. Use extack to return
     error messages directly to user space, update the SJW handling,
     including the definition of a new default value that will benefit
     CAN-FD controllers, by increasing their oscillator tolerance.

  New hardware / drivers:

   - Ethernet:
      - nVidia BlueField-3 support (control traffic driver)
      - Ethernet support for imx93 SoCs
      - Motorcomm yt8531 gigabit Ethernet PHY
      - onsemi NCN26000 10BASE-T1S PHY (with support for PLCA)
      - Microchip LAN8841 PHY (incl. cable diagnostics and PTP)
      - Amlogic gxl MDIO mux

   - WiFi:
      - RealTek RTL8188EU (rtl8xxxu)
      - Qualcomm Wi-Fi 7 devices (ath12k)

   - CAN:
      - Renesas R-Car V4H

  Drivers:

   - Bluetooth:
      - Set Per Platform Antenna Gain (PPAG) for Intel controllers.

   - Ethernet NICs:
      - Intel (1G, igc):
         - support TSN / Qbv / packet scheduling features of i226 model
      - Intel (100G, ice):
         - use GNSS subsystem instead of TTY
         - multi-buffer XDP support
         - extend support for GPIO pins to E823 devices
      - nVidia/Mellanox:
         - update the shared buffer configuration on PFC commands
         - implement PTP adjphase function for HW offset control
         - TC support for Geneve and GRE with VF tunnel offload
         - more efficient crypto key management method
         - multi-port eswitch support
      - Netronome/Corigine:
         - add DCB IEEE support
         - support IPsec offloading for NFP3800
      - Freescale/NXP (enetc):
         - support XDP_REDIRECT for XDP non-linear buffers
         - improve reconfig, avoid link flap and waiting for idle
         - support MAC Merge layer
      - Other NICs:
         - sfc/ef100: add basic devlink support for ef100
         - ionic: rx_push mode operation (writing descriptors via MMIO)
         - bnxt: use the auxiliary bus abstraction for RDMA
         - r8169: disable ASPM and reset bus in case of tx timeout
         - cpsw: support QSGMII mode for J721e CPSW9G
         - cpts: support pulse-per-second output
         - ngbe: add an mdio bus driver
         - usbnet: optimize usbnet_bh() by avoiding unnecessary queuing
         - r8152: handle devices with FW with NCM support
         - amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation
         - virtio-net: support multi buffer XDP
         - virtio/vsock: replace virtio_vsock_pkt with sk_buff
         - tsnep: XDP support

   - Ethernet high-speed switches:
      - nVidia/Mellanox (mlxsw):
         - add support for latency TLV (in FW control messages)
      - Microchip (sparx5):
         - separate explicit and implicit traffic forwarding rules, make
           the implicit rules always active
         - add support for egress DSCP rewrite
         - IS0 VCAP support (Ingress Classification)
         - IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS
           etc.)
         - ES2 VCAP support (Egress Access Control)
         - support for Per-Stream Filtering and Policing (802.1Q,
           8.6.5.1)

   - Ethernet embedded switches:
      - Marvell (mv88e6xxx):
         - add MAB (port auth) offload support
         - enable PTP receive for mv88e6390
      - NXP (ocelot):
         - support MAC Merge layer
         - support for the the vsc7512 internal copper phys
      - Microchip:
         - lan9303: convert to PHYLINK
         - lan966x: support TC flower filter statistics
         - lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x
         - lan937x: support Credit Based Shaper configuration
         - ksz9477: support Energy Efficient Ethernet
      - other:
         - qca8k: convert to regmap read/write API, use bulk operations
         - rswitch: Improve TX timestamp accuracy

   - Intel WiFi (iwlwifi):
      - EHT (Wi-Fi 7) rate reporting
      - STEP equalizer support: transfer some STEP (connection to radio
        on platforms with integrated wifi) related parameters from the
        BIOS to the firmware.

   - Qualcomm 802.11ax WiFi (ath11k):
      - IPQ5018 support
      - Fine Timing Measurement (FTM) responder role support
      - channel 177 support

   - MediaTek WiFi (mt76):
      - per-PHY LED support
      - mt7996: EHT (Wi-Fi 7) support
      - Wireless Ethernet Dispatch (WED) reset support
      - switch to using page pool allocator

   - RealTek WiFi (rtw89):
      - support new version of Bluetooth co-existance

   - Mobile:
      - rmnet: support TX aggregation"

* tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits)
  page_pool: add a comment explaining the fragment counter usage
  net: ethtool: fix __ethtool_dev_mm_supported() implementation
  ethtool: pse-pd: Fix double word in comments
  xsk: add linux/vmalloc.h to xsk.c
  sefltests: netdevsim: wait for devlink instance after netns removal
  selftest: fib_tests: Always cleanup before exit
  net/mlx5e: Align IPsec ASO result memory to be as required by hardware
  net/mlx5e: TC, Set CT miss to the specific ct action instance
  net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG
  net/mlx5: Refactor tc miss handling to a single function
  net/mlx5: Kconfig: Make tc offload depend on tc skb extension
  net/sched: flower: Support hardware miss to tc action
  net/sched: flower: Move filter handle initialization earlier
  net/sched: cls_api: Support hardware miss to tc action
  net/sched: Rename user cookie and act cookie
  sfc: fix builds without CONFIG_RTC_LIB
  sfc: clean up some inconsistent indentings
  net/mlx4_en: Introduce flexible array to silence overflow warning
  net: lan966x: Fix possible deadlock inside PTP
  net/ulp: Remove redundant ->clone() test in inet_clone_ulp().
  ...
2023-02-21 18:24:12 -08:00
Dave Hansen
74e19ef0ff uaccess: Add speculation barrier to copy_from_user()
The results of "access_ok()" can be mis-speculated.  The result is that
you can end speculatively:

	if (access_ok(from, size))
		// Right here

even for bad from/size combinations.  On first glance, it would be ideal
to just add a speculation barrier to "access_ok()" so that its results
can never be mis-speculated.

But there are lots of system calls just doing access_ok() via
"copy_to_user()" and friends (example: fstat() and friends).  Those are
generally not problematic because they do not _consume_ data from
userspace other than the pointer.  They are also very quick and common
system calls that should not be needlessly slowed down.

"copy_from_user()" on the other hand uses a user-controller pointer and
is frequently followed up with code that might affect caches.  Take
something like this:

	if (!copy_from_user(&kernelvar, uptr, size))
		do_something_with(kernelvar);

If userspace passes in an evil 'uptr' that *actually* points to a kernel
addresses, and then do_something_with() has cache (or other)
side-effects, it could allow userspace to infer kernel data values.

Add a barrier to the common copy_from_user() code to prevent
mis-speculated values which happen after the copy.

Also add a stub for architectures that do not define barrier_nospec().
This makes the macro usable in generic code.

Since the barrier is now usable in generic code, the x86 #ifdef in the
BPF code can also go away.

Reported-by: Jordy Zomer <jordyzomer@google.com>
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>   # BPF bits
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-02-21 14:45:22 -08:00
Linus Torvalds
05e6295f7b fs.idmapped.v6.3
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCY+5NlQAKCRCRxhvAZXjc
 orOaAP9i2h3OJy95nO2Fpde0Bt2UT+oulKCCcGlvXJ8/+TQpyQD/ZQq47gFQ0EAz
 Br5NxeyGeecAb0lHpFz+CpLGsxMrMwQ=
 =+BG5
 -----END PGP SIGNATURE-----

Merge tag 'fs.idmapped.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping

Pull vfs idmapping updates from Christian Brauner:

 - Last cycle we introduced the dedicated struct mnt_idmap type for
   mount idmapping and the required infrastucture in 256c8aed2b ("fs:
   introduce dedicated idmap type for mounts"). As promised in last
   cycle's pull request message this converts everything to rely on
   struct mnt_idmap.

   Currently we still pass around the plain namespace that was attached
   to a mount. This is in general pretty convenient but it makes it easy
   to conflate namespaces that are relevant on the filesystem with
   namespaces that are relevant on the mount level. Especially for
   non-vfs developers without detailed knowledge in this area this was a
   potential source for bugs.

   This finishes the conversion. Instead of passing the plain namespace
   around this updates all places that currently take a pointer to a
   mnt_userns with a pointer to struct mnt_idmap.

   Now that the conversion is done all helpers down to the really
   low-level helpers only accept a struct mnt_idmap argument instead of
   two namespace arguments.

   Conflating mount and other idmappings will now cause the compiler to
   complain loudly thus eliminating the possibility of any bugs. This
   makes it impossible for filesystem developers to mix up mount and
   filesystem idmappings as they are two distinct types and require
   distinct helpers that cannot be used interchangeably.

   Everything associated with struct mnt_idmap is moved into a single
   separate file. With that change no code can poke around in struct
   mnt_idmap. It can only be interacted with through dedicated helpers.
   That means all filesystems are and all of the vfs is completely
   oblivious to the actual implementation of idmappings.

   We are now also able to extend struct mnt_idmap as we see fit. For
   example, we can decouple it completely from namespaces for users that
   don't require or don't want to use them at all. We can also extend
   the concept of idmappings so we can cover filesystem specific
   requirements.

   In combination with the vfs{g,u}id_t work we finished in v6.2 this
   makes this feature substantially more robust and thus difficult to
   implement wrong by a given filesystem and also protects the vfs.

 - Enable idmapped mounts for tmpfs and fulfill a longstanding request.

   A long-standing request from users had been to make it possible to
   create idmapped mounts for tmpfs. For example, to share the host's
   tmpfs mount between multiple sandboxes. This is a prerequisite for
   some advanced Kubernetes cases. Systemd also has a range of use-cases
   to increase service isolation. And there are more users of this.

   However, with all of the other work going on this was way down on the
   priority list but luckily someone other than ourselves picked this
   up.

   As usual the patch is tiny as all the infrastructure work had been
   done multiple kernel releases ago. In addition to all the tests that
   we already have I requested that Rodrigo add a dedicated tmpfs
   testsuite for idmapped mounts to xfstests. It is to be included into
   xfstests during the v6.3 development cycle. This should add a slew of
   additional tests.

* tag 'fs.idmapped.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: (26 commits)
  shmem: support idmapped mounts for tmpfs
  fs: move mnt_idmap
  fs: port vfs{g,u}id helpers to mnt_idmap
  fs: port fs{g,u}id helpers to mnt_idmap
  fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap
  fs: port i_{g,u}id_{needs_}update() to mnt_idmap
  quota: port to mnt_idmap
  fs: port privilege checking helpers to mnt_idmap
  fs: port inode_owner_or_capable() to mnt_idmap
  fs: port inode_init_owner() to mnt_idmap
  fs: port acl to mnt_idmap
  fs: port xattr to mnt_idmap
  fs: port ->permission() to pass mnt_idmap
  fs: port ->fileattr_set() to pass mnt_idmap
  fs: port ->set_acl() to pass mnt_idmap
  fs: port ->get_acl() to pass mnt_idmap
  fs: port ->tmpfile() to pass mnt_idmap
  fs: port ->rename() to pass mnt_idmap
  fs: port ->mknod() to pass mnt_idmap
  fs: port ->mkdir() to pass mnt_idmap
  ...
2023-02-20 11:53:11 -08:00
Andrii Nakryiko
d384dce281 bpf: Fix global subprog context argument resolution logic
KPROBE program's user-facing context type is defined as typedef
bpf_user_pt_regs_t. This leads to a problem when trying to passing
kprobe/uprobe/usdt context argument into global subprog, as kernel
always strip away mods and typedefs of user-supplied type, but takes
expected type from bpf_ctx_convert as is, which causes mismatch.

Current way to work around this is to define a fake struct with the same
name as expected typedef:

  struct bpf_user_pt_regs_t {};

  __noinline my_global_subprog(struct bpf_user_pt_regs_t *ctx) { ... }

This patch fixes the issue by resolving expected type, if it's not
a struct. It still leaves the above work-around working for backwards
compatibility.

Fixes: 91cc1a9974 ("bpf: Annotate context types")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20230216045954.3002473-2-andrii@kernel.org
2023-02-17 21:20:44 +01:00
Hou Tao
997849c4b9 bpf: Zeroing allocated object from slab in bpf memory allocator
Currently the freed element in bpf memory allocator may be immediately
reused, for htab map the reuse will reinitialize special fields in map
value (e.g., bpf_spin_lock), but lookup procedure may still access
these special fields, and it may lead to hard-lockup as shown below:

 NMI backtrace for cpu 16
 CPU: 16 PID: 2574 Comm: htab.bin Tainted: G             L     6.1.0+ #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
 RIP: 0010:queued_spin_lock_slowpath+0x283/0x2c0
 ......
 Call Trace:
  <TASK>
  copy_map_value_locked+0xb7/0x170
  bpf_map_copy_value+0x113/0x3c0
  __sys_bpf+0x1c67/0x2780
  __x64_sys_bpf+0x1c/0x20
  do_syscall_64+0x30/0x60
  entry_SYSCALL_64_after_hwframe+0x46/0xb0
 ......
  </TASK>

For htab map, just like the preallocated case, these is no need to
initialize these special fields in map value again once these fields
have been initialized. For preallocated htab map, these fields are
initialized through __GFP_ZERO in bpf_map_area_alloc(), so do the
similar thing for non-preallocated htab in bpf memory allocator. And
there is no need to use __GFP_ZERO for per-cpu bpf memory allocator,
because __alloc_percpu_gfp() does it implicitly.

Fixes: 0fd7c5d433 ("bpf: Optimize call_rcu in non-preallocated hash map.")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230215082132.3856544-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-15 15:40:06 -08:00
Eduard Zingerman
31ff213512 bpf: BPF_ST with variable offset should preserve STACK_ZERO marks
BPF_STX instruction preserves STACK_ZERO marks for variable offset
writes in situations like below:

  *(u64*)(r10 - 8) = 0   ; STACK_ZERO marks for fp[-8]
  r0 = random(-7, -1)    ; some random number in range of [-7, -1]
  r0 += r10              ; r0 is now a variable offset pointer to stack
  r1 = 0
  *(u8*)(r0) = r1        ; BPF_STX writing zero, STACK_ZERO mark for
                         ; fp[-8] is preserved

This commit updates verifier.c:check_stack_write_var_off() to process
BPF_ST in a similar manner, e.g. the following example:

  *(u64*)(r10 - 8) = 0   ; STACK_ZERO marks for fp[-8]
  r0 = random(-7, -1)    ; some random number in range of [-7, -1]
  r0 += r10              ; r0 is now variable offset pointer to stack
  *(u8*)(r0) = 0         ; BPF_ST writing zero, STACK_ZERO mark for
                         ; fp[-8] is preserved

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20230214232030.1502829-4-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-15 11:48:47 -08:00
Eduard Zingerman
ecdf985d76 bpf: track immediate values written to stack by BPF_ST instruction
For aligned stack writes using BPF_ST instruction track stored values
in a same way BPF_STX is handled, e.g. make sure that the following
commands produce similar verifier knowledge:

  fp[-8] = 42;             r1 = 42;
                       fp[-8] = r1;

This covers two cases:
 - non-null values written to stack are stored as spill of fake
   registers;
 - null values written to stack are stored as STACK_ZERO marks.

Previously both cases above used STACK_MISC marks instead.

Some verifier test cases relied on the old logic to obtain STACK_MISC
marks for some stack values. These test cases are updated in the same
commit to avoid failures during bisect.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20230214232030.1502829-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-15 11:48:47 -08:00
Dave Marchevsky
a40d363243 bpf: Special verifier handling for bpf_rbtree_{remove, first}
Newly-added bpf_rbtree_{remove,first} kfuncs have some special properties
that require handling in the verifier:

  * both bpf_rbtree_remove and bpf_rbtree_first return the type containing
    the bpf_rb_node field, with the offset set to that field's offset,
    instead of a struct bpf_rb_node *
    * mark_reg_graph_node helper added in previous patch generalizes
      this logic, use it

  * bpf_rbtree_remove's node input is a node that's been inserted
    in the tree - a non-owning reference.

  * bpf_rbtree_remove must invalidate non-owning references in order to
    avoid aliasing issue. Use previously-added
    invalidate_non_owning_refs helper to mark this function as a
    non-owning ref invalidation point.

  * Unlike other functions, which convert one of their input arg regs to
    non-owning reference, bpf_rbtree_first takes no arguments and just
    returns a non-owning reference (possibly null)
    * For now verifier logic for this is special-cased instead of
      adding new kfunc flag.

This patch, along with the previous one, complete special verifier
handling for all rbtree API functions added in this series.

With functional verifier handling of rbtree_remove, under current
non-owning reference scheme, a node type with both bpf_{list,rb}_node
fields could cause the verifier to accept programs which remove such
nodes from collections they haven't been added to.

In order to prevent this, this patch adds a check to btf_parse_fields
which rejects structs with both bpf_{list,rb}_node fields. This is a
temporary measure that can be removed after "collection identity"
followup. See comment added in btf_parse_fields. A linked_list BTF test
exercising the new check is added in this patch as well.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230214004017.2534011-6-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-13 19:40:53 -08:00
Dave Marchevsky
5d92ddc3de bpf: Add callback validation to kfunc verifier logic
Some BPF helpers take a callback function which the helper calls. For
each helper that takes such a callback, there's a special call to
__check_func_call with a callback-state-setting callback that sets up
verifier bpf_func_state for the callback's frame.

kfuncs don't have any of this infrastructure yet, so let's add it in
this patch, following existing helper pattern as much as possible. To
validate functionality of this added plumbing, this patch adds
callback handling for the bpf_rbtree_add kfunc and hopes to lay
groundwork for future graph datastructure callbacks.

In the "general plumbing" category we have:

  * check_kfunc_call doing callback verification right before clearing
    CALLER_SAVED_REGS, exactly like check_helper_call
  * recognition of func_ptr BTF types in kfunc args as
    KF_ARG_PTR_TO_CALLBACK + propagation of subprogno for this arg type

In the "rbtree_add / graph datastructure-specific plumbing" category:

  * Since bpf_rbtree_add must be called while the spin_lock associated
    with the tree is held, don't complain when callback's func_state
    doesn't unlock it by frame exit
  * Mark rbtree_add callback's args with ref_set_non_owning
    to prevent rbtree api functions from being called in the callback.
    Semantically this makes sense, as less() takes no ownership of its
    args when determining which comes first.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230214004017.2534011-5-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-13 19:40:53 -08:00
Dave Marchevsky
cd6791b4b6 bpf: Add support for bpf_rb_root and bpf_rb_node in kfunc args
Now that we find bpf_rb_root and bpf_rb_node in structs, let's give args
that contain those types special classification and properly handle
these types when checking kfunc args.

"Properly handling" these types largely requires generalizing similar
handling for bpf_list_{head,node}, with little new logic added in this
patch.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230214004017.2534011-4-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-13 19:40:53 -08:00
Dave Marchevsky
bd1279ae8a bpf: Add bpf_rbtree_{add,remove,first} kfuncs
This patch adds implementations of bpf_rbtree_{add,remove,first}
and teaches verifier about their BTF_IDs as well as those of
bpf_rb_{root,node}.

All three kfuncs have some nonstandard component to their verification
that needs to be addressed in future patches before programs can
properly use them:

  * bpf_rbtree_add:     Takes 'less' callback, need to verify it

  * bpf_rbtree_first:   Returns ptr_to_node_type(off=rb_node_off) instead
                        of ptr_to_rb_node(off=0). Return value ref is
			non-owning.

  * bpf_rbtree_remove:  Returns ptr_to_node_type(off=rb_node_off) instead
                        of ptr_to_rb_node(off=0). 2nd arg (node) is a
			non-owning reference.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230214004017.2534011-3-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-13 19:40:48 -08:00
Dave Marchevsky
9c395c1b99 bpf: Add basic bpf_rb_{root,node} support
This patch adds special BPF_RB_{ROOT,NODE} btf_field_types similar to
BPF_LIST_{HEAD,NODE}, adds the necessary plumbing to detect the new
types, and adds bpf_rb_root_free function for freeing bpf_rb_root in
map_values.

structs bpf_rb_root and bpf_rb_node are opaque types meant to
obscure structs rb_root_cached rb_node, respectively.

btf_struct_access will prevent BPF programs from touching these special
fields automatically now that they're recognized.

btf_check_and_fixup_fields now groups list_head and rb_root together as
"graph root" fields and {list,rb}_node as "graph node", and does same
ownership cycle checking as before. Note that this function does _not_
prevent ownership type mixups (e.g. rb_root owning list_node) - that's
handled by btf_parse_graph_root.

After this patch, a bpf program can have a struct bpf_rb_root in a
map_value, but not add anything to nor do anything useful with it.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230214004017.2534011-2-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-13 19:31:13 -08:00
Dave Marchevsky
6a3cd3318f bpf: Migrate release_on_unlock logic to non-owning ref semantics
This patch introduces non-owning reference semantics to the verifier,
specifically linked_list API kfunc handling. release_on_unlock logic for
refs is refactored - with small functional changes - to implement these
semantics, and bpf_list_push_{front,back} are migrated to use them.

When a list node is pushed to a list, the program still has a pointer to
the node:

  n = bpf_obj_new(typeof(*n));

  bpf_spin_lock(&l);
  bpf_list_push_back(&l, n);
  /* n still points to the just-added node */
  bpf_spin_unlock(&l);

What the verifier considers n to be after the push, and thus what can be
done with n, are changed by this patch.

Common properties both before/after this patch:
  * After push, n is only a valid reference to the node until end of
    critical section
  * After push, n cannot be pushed to any list
  * After push, the program can read the node's fields using n

Before:
  * After push, n retains the ref_obj_id which it received on
    bpf_obj_new, but the associated bpf_reference_state's
    release_on_unlock field is set to true
    * release_on_unlock field and associated logic is used to implement
      "n is only a valid ref until end of critical section"
  * After push, n cannot be written to, the node must be removed from
    the list before writing to its fields
  * After push, n is marked PTR_UNTRUSTED

After:
  * After push, n's ref is released and ref_obj_id set to 0. NON_OWN_REF
    type flag is added to reg's type, indicating that it's a non-owning
    reference.
    * NON_OWN_REF flag and logic is used to implement "n is only a
      valid ref until end of critical section"
  * n can be written to (except for special fields e.g. bpf_list_node,
    timer, ...)

Summary of specific implementation changes to achieve the above:

  * release_on_unlock field, ref_set_release_on_unlock helper, and logic
    to "release on unlock" based on that field are removed

  * The anonymous active_lock struct used by bpf_verifier_state is
    pulled out into a named struct bpf_active_lock.

  * NON_OWN_REF type flag is introduced along with verifier logic
    changes to handle non-owning refs

  * Helpers are added to use NON_OWN_REF flag to implement non-owning
    ref semantics as described above
    * invalidate_non_owning_refs - helper to clobber all non-owning refs
      matching a particular bpf_active_lock identity. Replaces
      release_on_unlock logic in process_spin_lock.
    * ref_set_non_owning - set NON_OWN_REF type flag after doing some
      sanity checking
    * ref_convert_owning_non_owning - convert owning reference w/
      specified ref_obj_id to non-owning references. Set NON_OWN_REF
      flag for each reg with that ref_obj_id and 0-out its ref_obj_id

  * Update linked_list selftests to account for minor semantic
    differences introduced by this patch
    * Writes to a release_on_unlock node ref are not allowed, while
      writes to non-owning reference pointees are. As a result the
      linked_list "write after push" failure tests are no longer scenarios
      that should fail.
    * The test##missing_lock##op and test##incorrect_lock##op
      macro-generated failure tests need to have a valid node argument in
      order to have the same error output as before. Otherwise
      verification will fail early and the expected error output won't be seen.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230212092715.1422619-2-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-13 13:37:37 -08:00
Yafang Shao
bf39650824 bpf: allow to disable bpf prog memory accounting
We can simply disable the bpf prog memory accouting by not setting the
GFP_ACCOUNT.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://lore.kernel.org/r/20230210154734.4416-5-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-10 18:59:57 -08:00
Yafang Shao
ee53cbfb1e bpf: allow to disable bpf map memory accounting
We can simply set root memcg as the map's memcg to disable bpf memory
accounting. bpf_map_area_alloc is a little special as it gets the memcg
from current rather than from the map, so we need to disable GFP_ACCOUNT
specifically for it.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://lore.kernel.org/r/20230210154734.4416-4-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-10 18:59:57 -08:00
Yafang Shao
ddef81b5fd bpf: use bpf_map_kvcalloc in bpf_local_storage
Introduce new helper bpf_map_kvcalloc() for the memory allocation in
bpf_local_storage(). Then the allocation will charge the memory from the
map instead of from current, though currently they are the same thing as
it is only used in map creation path now. By charging map's memory into
the memcg from the map, it will be more clear.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://lore.kernel.org/r/20230210154734.4416-3-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-10 18:59:56 -08:00
Jakub Kicinski
de42873367 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCY+bZrwAKCRDbK58LschI
 gzi4AP4+TYo0jnSwwkrOoN9l4f5VO9X8osmj3CXfHBv7BGWVxAD/WnvA3TDZyaUd
 agIZTkRs6BHF9He8oROypARZxTeMLwM=
 =nO1C
 -----END PGP SIGNATURE-----

Daniel Borkmann says:

====================
pull-request: bpf-next 2023-02-11

We've added 96 non-merge commits during the last 14 day(s) which contain
a total of 152 files changed, 4884 insertions(+), 962 deletions(-).

There is a minor conflict in drivers/net/ethernet/intel/ice/ice_main.c
between commit 5b246e533d ("ice: split probe into smaller functions")
from the net-next tree and commit 66c0e13ad2 ("drivers: net: turn on
XDP features") from the bpf-next tree. Remove the hunk given ice_cfg_netdev()
is otherwise there a 2nd time, and add XDP features to the existing
ice_cfg_netdev() one:

        [...]
        ice_set_netdev_features(netdev);
        netdev->xdp_features = NETDEV_XDP_ACT_BASIC | NETDEV_XDP_ACT_REDIRECT |
                               NETDEV_XDP_ACT_XSK_ZEROCOPY;
        ice_set_ops(netdev);
        [...]

Stephen's merge conflict mail:
https://lore.kernel.org/bpf/20230207101951.21a114fa@canb.auug.org.au/

The main changes are:

1) Add support for BPF trampoline on s390x which finally allows to remove many
   test cases from the BPF CI's DENYLIST.s390x, from Ilya Leoshkevich.

2) Add multi-buffer XDP support to ice driver, from Maciej Fijalkowski.

3) Add capability to export the XDP features supported by the NIC.
   Along with that, add a XDP compliance test tool,
   from Lorenzo Bianconi & Marek Majtyka.

4) Add __bpf_kfunc tag for marking kernel functions as kfuncs,
   from David Vernet.

5) Add a deep dive documentation about the verifier's register
   liveness tracking algorithm, from Eduard Zingerman.

6) Fix and follow-up cleanups for resolve_btfids to be compiled
   as a host program to avoid cross compile issues,
   from Jiri Olsa & Ian Rogers.

7) Batch of fixes to the BPF selftest for xdp_hw_metadata which resulted
   when testing on different NICs, from Jesper Dangaard Brouer.

8) Fix libbpf to better detect kernel version code on Debian, from Hao Xiang.

9) Extend libbpf to add an option for when the perf buffer should
   wake up, from Jon Doron.

10) Follow-up fix on xdp_metadata selftest to just consume on TX
    completion, from Stanislav Fomichev.

11) Extend the kfuncs.rst document with description on kfunc
    lifecycle & stability expectations, from David Vernet.

12) Fix bpftool prog profile to skip attaching to offline CPUs,
    from Tonghao Zhang.

====================

Link: https://lore.kernel.org/r/20230211002037.8489-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-10 17:51:27 -08:00
Jakub Kicinski
025a785ff0 net: skbuff: drop the word head from skb cache
skbuff_head_cache is misnamed (perhaps for historical reasons?)
because it does not hold heads. Head is the buffer which skb->data
points to, and also where shinfo lives. struct sk_buff is a metadata
structure, not the head.

Eric recently added skb_small_head_cache (which allocates actual
head buffers), let that serve as an excuse to finally clean this up :)

Leave the user-space visible name intact, it could possibly be uAPI.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-10 09:10:28 +00:00
Suren Baghdasaryan
1c71222e5f mm: replace vma->vm_flags direct modifications with modifier calls
Replace direct modifications to vma->vm_flags with calls to modifier
functions to be able to track flag changes and to keep vma locking
correctness.

[akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo]
Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Oskolkov <posk@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-09 16:51:39 -08:00
Lorenzo Bianconi
b9d460c924 bpf: devmap: check XDP features in __xdp_enqueue routine
Check if the destination device implements ndo_xdp_xmit callback relying
on NETDEV_XDP_ACT_NDO_XMIT flags. Moreover, check if the destination device
supports XDP non-linear frame in __xdp_enqueue and is_valid_dst routines.
This patch allows to perform XDP_REDIRECT on non-linear XDP buffers.

Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://lore.kernel.org/r/26a94c33520c0bfba021b3fbb2cb8c1e69bf53b8.1675245258.git.lorenzo@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-02 20:48:24 -08:00
Tobias Klauser
158e5e9eea bpf: Drop always true do_idr_lock parameter to bpf_map_free_id
The do_idr_lock parameter to bpf_map_free_id was introduced by commit
bd5f5f4ecb ("bpf: Add BPF_MAP_GET_FD_BY_ID"). However, all callers set
do_idr_lock = true since commit 1e0bd5a091 ("bpf: Switch bpf_map ref
counter to atomic64_t so bpf_map_inc() never fails").

While at it also inline __bpf_map_put into its only caller bpf_map_put
now that do_idr_lock can be dropped from its signature.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Link: https://lore.kernel.org/r/20230202141921.4424-1-tklauser@distanz.ch
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-02 20:26:12 -08:00
Jakub Kicinski
82b4a9412b Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
net/core/gro.c
  7d2c89b325 ("skb: Do mix page pool and page referenced frags in GRO")
  b1a78b9b98 ("net: add support for ipv4 big tcp")
https://lore.kernel.org/all/20230203094454.5766f160@canb.auug.org.au/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-02 14:49:55 -08:00
David Vernet
400031e05a bpf: Add __bpf_kfunc tag to all kfuncs
Now that we have the __bpf_kfunc tag, we should use add it to all
existing kfuncs to ensure that they'll never be elided in LTO builds.

Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20230201173016.342758-4-void@manifault.com
2023-02-02 00:25:14 +01:00
Ilya Leoshkevich
49f67f393f bpf: btf: Add BTF_FMODEL_SIGNED_ARG flag
s390x eBPF JIT needs to know whether a function return value is signed
and which function arguments are signed, in order to generate code
compliant with the s390x ABI.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230128000650.1516334-26-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-28 12:45:15 -08:00
Ilya Leoshkevich
0f0e5f5bd5 bpf: iterators: Split iterators.lskel.h into little- and big- endian versions
iterators.lskel.h is little-endian, therefore bpf iterator is currently
broken on big-endian systems. Introduce a big-endian version and add
instructions regarding its generation. Unfortunately bpftool's
cross-endianness capabilities are limited to BTF right now, so the
procedure requires access to a big-endian machine or a configured
emulator.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230128000650.1516334-25-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-28 12:45:15 -08:00
David Vernet
cb4a21ea59 bpf: Build-time assert that cpumask offset is zero
The first element of a struct bpf_cpumask is a cpumask_t. This is done
to allow struct bpf_cpumask to be cast to a struct cpumask. If this
element were ever moved to another field, any BPF program passing a
struct bpf_cpumask * to a kfunc expecting a const struct cpumask * would
immediately fail to load. Add a build-time assertion so this is
assumption is captured and verified.

Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230128141537.100777-1-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-28 12:21:40 -08:00
Jakub Kicinski
2d104c390f bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCY9RqJgAKCRDbK58LschI
 gw2IAP9G5uhFO5abBzYLupp6SY3T5j97MUvPwLfFqUEt7EXmuwEA2lCUEWeW0KtR
 QX+QmzCa6iHxrW7WzP4DUYLue//FJQY=
 =yYqA
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
bpf-next 2023-01-28

We've added 124 non-merge commits during the last 22 day(s) which contain
a total of 124 files changed, 6386 insertions(+), 1827 deletions(-).

The main changes are:

1) Implement XDP hints via kfuncs with initial support for RX hash and
   timestamp metadata kfuncs, from Stanislav Fomichev and
   Toke Høiland-Jørgensen.
   Measurements on overhead: https://lore.kernel.org/bpf/875yellcx6.fsf@toke.dk

2) Extend libbpf's bpf_tracing.h support for tracing arguments of
   kprobes/uprobes and syscall as a special case, from Andrii Nakryiko.

3) Significantly reduce the search time for module symbols by livepatch
   and BPF, from Jiri Olsa and Zhen Lei.

4) Enable cpumasks to be used as kptrs, which is useful for tracing
   programs tracking which tasks end up running on which CPUs
   in different time intervals, from David Vernet.

5) Fix several issues in the dynptr processing such as stack slot liveness
   propagation, missing checks for PTR_TO_STACK variable offset, etc,
   from Kumar Kartikeya Dwivedi.

6) Various performance improvements, fixes, and introduction of more
   than just one XDP program to XSK selftests, from Magnus Karlsson.

7) Big batch to BPF samples to reduce deprecated functionality,
   from Daniel T. Lee.

8) Enable struct_ops programs to be sleepable in verifier,
   from David Vernet.

9) Reduce pr_warn() noise on BTF mismatches when they are expected under
   the CONFIG_MODULE_ALLOW_BTF_MISMATCH config anyway, from Connor O'Brien.

10) Describe modulo and division by zero behavior of the BPF runtime
    in BPF's instruction specification document, from Dave Thaler.

11) Several improvements to libbpf API documentation in libbpf.h,
    from Grant Seltzer.

12) Improve resolve_btfids header dependencies related to subcmd and add
    proper support for HOSTCC, from Ian Rogers.

13) Add ipip6 and ip6ip decapsulation support for bpf_skb_adjust_room()
    helper along with BPF selftests, from Ziyang Xuan.

14) Simplify the parsing logic of structure parameters for BPF trampoline
    in the x86-64 JIT compiler, from Pu Lehui.

15) Get BTF working for kernels with CONFIG_RUST enabled by excluding
    Rust compilation units with pahole, from Martin Rodriguez Reboredo.

16) Get bpf_setsockopt() working for kTLS on top of TCP sockets,
    from Kui-Feng Lee.

17) Disable stack protection for BPF objects in bpftool given BPF backends
    don't support it, from Holger Hoffstätte.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (124 commits)
  selftest/bpf: Make crashes more debuggable in test_progs
  libbpf: Add documentation to map pinning API functions
  libbpf: Fix malformed documentation formatting
  selftests/bpf: Properly enable hwtstamp in xdp_hw_metadata
  selftests/bpf: Calls bpf_setsockopt() on a ktls enabled socket.
  bpf: Check the protocol of a sock to agree the calls to bpf_setsockopt().
  bpf/selftests: Verify struct_ops prog sleepable behavior
  bpf: Pass const struct bpf_prog * to .check_member
  libbpf: Support sleepable struct_ops.s section
  bpf: Allow BPF_PROG_TYPE_STRUCT_OPS programs to be sleepable
  selftests/bpf: Fix vmtest static compilation error
  tools/resolve_btfids: Alter how HOSTCC is forced
  tools/resolve_btfids: Install subcmd headers
  bpf/docs: Document the nocast aliasing behavior of ___init
  bpf/docs: Document how nested trusted fields may be defined
  bpf/docs: Document cpumask kfuncs in a new file
  selftests/bpf: Add selftest suite for cpumask kfuncs
  selftests/bpf: Add nested trust selftests suite
  bpf: Enable cpumasks to be queried and used as kptrs
  bpf: Disallow NULLable pointers for trusted kfuncs
  ...
====================

Link: https://lore.kernel.org/r/20230128004827.21371-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-28 00:00:14 -08:00
Kui-Feng Lee
5416c9aea8 bpf: Fix the kernel crash caused by bpf_setsockopt().
The kernel crash was caused by a BPF program attached to the
"lsm_cgroup/socket_sock_rcv_skb" hook, which performed a call to
`bpf_setsockopt()` in order to set the TCP_NODELAY flag as an
example. Flags like TCP_NODELAY can prompt the kernel to flush a
socket's outgoing queue, and this hook
"lsm_cgroup/socket_sock_rcv_skb" is frequently triggered by
softirqs. The issue was that in certain circumstances, when
`tcp_write_xmit()` was called to flush the queue, it would also allow
BH (bottom-half) to run. This could lead to our program attempting to
flush the same socket recursively, which caused a `skbuff` to be
unlinked twice.

`security_sock_rcv_skb()` is triggered by `tcp_filter()`. This occurs
before the sock ownership is checked in `tcp_v4_rcv()`. Consequently,
if a bpf program runs on `security_sock_rcv_skb()` while under softirq
conditions, it may not possess the lock needed for `bpf_setsockopt()`,
thus presenting an issue.

The patch fixes this issue by ensuring that a BPF program attached to
the "lsm_cgroup/socket_sock_rcv_skb" hook is not allowed to call
`bpf_setsockopt()`.

The differences from v1 are
 - changing commit log to explain holding the lock of the sock,
 - emphasizing that TCP_NODELAY is not the only flag, and
 - adding the fixes tag.

v1: https://lore.kernel.org/bpf/20230125000244.1109228-1-kuifeng@meta.com/

Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
Fixes: 9113d7e48e ("bpf: expose bpf_{g,s}etsockopt to lsm cgroup")
Link: https://lore.kernel.org/r/20230127001732.4162630-1-kuifeng@meta.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-01-26 23:26:40 -08:00
David Vernet
51a52a29eb bpf: Pass const struct bpf_prog * to .check_member
The .check_member field of struct bpf_struct_ops is currently passed the
member's btf_type via const struct btf_type *t, and a const struct
btf_member *member. This allows the struct_ops implementation to check
whether e.g. an ops is supported, but it would be useful to also enforce
that the struct_ops prog being loaded for that member has other
qualities, like being sleepable (or not). This patch therefore updates
the .check_member() callback to also take a const struct bpf_prog *prog
argument.

Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230125164735.785732-4-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-25 10:25:57 -08:00
David Vernet
1e12d3ef47 bpf: Allow BPF_PROG_TYPE_STRUCT_OPS programs to be sleepable
BPF struct_ops programs currently cannot be marked as sleepable. This
need not be the case -- struct_ops programs can be sleepable, and e.g.
invoke kfuncs that export the KF_SLEEPABLE flag. So as to allow future
struct_ops programs to invoke such kfuncs, this patch updates the
verifier to allow struct_ops programs to be sleepable. A follow-on patch
will add support to libbpf for specifying struct_ops.s as a sleepable
struct_ops program, and then another patch will add testcases to the
dummy_st_ops selftest suite which test sleepable struct_ops behavior.

Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230125164735.785732-2-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-25 10:25:57 -08:00
David Vernet
bdbda39584 bpf/docs: Document cpumask kfuncs in a new file
Now that we've added a series of new cpumask kfuncs, we should document
them so users can easily use them. This patch adds a new cpumasks.rst
file to document them.

Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230125143816.721952-6-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-25 07:57:50 -08:00
David Vernet
516f4d3397 bpf: Enable cpumasks to be queried and used as kptrs
Certain programs may wish to be able to query cpumasks. For example, if
a program that is tracing percpu operations wishes to track which tasks
end up running on which CPUs, it could be useful to associate that with
the tasks' cpumasks. Similarly, programs tracking NUMA allocations, CPU
scheduling domains, etc, could potentially benefit from being able to
see which CPUs a task could be migrated to.

This patch enables these types of use cases by introducing a series of
bpf_cpumask_* kfuncs. Amongst these kfuncs, there are two separate
"classes" of operations:

1. kfuncs which allow the caller to allocate and mutate their own
   cpumask kptrs in the form of a struct bpf_cpumask * object. Such
   kfuncs include e.g. bpf_cpumask_create() to allocate the cpumask, and
   bpf_cpumask_or() to mutate it. "Regular" cpumasks such as p->cpus_ptr
   may not be passed to these kfuncs, and the verifier will ensure this
   is the case by comparing BTF IDs.

2. Read-only operations which operate on const struct cpumask *
   arguments. For example, bpf_cpumask_test_cpu(), which tests whether a
   CPU is set in the cpumask. Any trusted struct cpumask * or struct
   bpf_cpumask * may be passed to these kfuncs. The verifier allows
   struct bpf_cpumask * even though the kfunc is defined with struct
   cpumask * because the first element of a struct bpf_cpumask is a
   cpumask_t, so it is safe to cast.

A follow-on patch will add selftests which validate these kfuncs, and
another will document them.

Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230125143816.721952-3-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-25 07:57:49 -08:00
David Vernet
caf713c338 bpf: Disallow NULLable pointers for trusted kfuncs
KF_TRUSTED_ARGS kfuncs currently have a subtle and insidious bug in
validating pointers to scalars. Say that you have a kfunc like the
following, which takes an array as the first argument:

bool bpf_cpumask_empty(const struct cpumask *cpumask)
{
	return cpumask_empty(cpumask);
}

...
BTF_ID_FLAGS(func, bpf_cpumask_empty, KF_TRUSTED_ARGS)
...

If a BPF program were to invoke the kfunc with a NULL argument, it would
crash the kernel. The reason is that struct cpumask is defined as a
bitmap, which is itself defined as an array, and is accessed as a memory
address by bitmap operations. So when the verifier analyzes the
register, it interprets it as a pointer to a scalar struct, which is an
array of size 8. check_mem_reg() then sees that the register is NULL and
returns 0, and the kfunc crashes when it passes it down to the cpumask
wrappers.

To fix this, this patch adds a check for KF_ARG_PTR_TO_MEM which
verifies that the register doesn't contain a possibly-NULL pointer if
the kfunc is KF_TRUSTED_ARGS.

Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230125143816.721952-2-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-25 07:57:49 -08:00
David Vernet
b613d335a7 bpf: Allow trusted args to walk struct when checking BTF IDs
When validating BTF types for KF_TRUSTED_ARGS kfuncs, the verifier
currently enforces that the top-level type must match when calling
the kfunc. In other words, the verifier does not allow the BPF program
to pass a bitwise equivalent struct, despite it being allowed according
to the C standard.

For example, if you have the following type:

struct  nf_conn___init {
	struct nf_conn ct;
};

The C standard stipulates that it would be safe to pass a struct
nf_conn___init to a kfunc expecting a struct nf_conn. The verifier
currently disallows this, however, as semantically kfuncs may want to
enforce that structs that have equivalent types according to the C
standard, but have different BTF IDs, are not able to be passed to
kfuncs expecting one or the other. For example, struct nf_conn___init
may not be queried / looked up, as it is allocated but may not yet be
fully initialized.

On the other hand, being able to pass types that are equivalent
according to the C standard will be useful for other types of kfunc /
kptrs enabled by BPF.  For example, in a follow-on patch, a series of
kfuncs will be added which allow programs to do bitwise queries on
cpumasks that are either allocated by the program (in which case they'll
be a 'struct bpf_cpumask' type that wraps a cpumask_t as its first
element), or a cpumask that was allocated by the main kernel (in which
case it will just be a straight cpumask_t, as in task->cpus_ptr).

Having the two types of cpumasks allows us to distinguish between the
two for when a cpumask is read-only vs. mutatable. A struct bpf_cpumask
can be mutated by e.g. bpf_cpumask_clear(), whereas a regular cpumask_t
cannot be. On the other hand, a struct bpf_cpumask can of course be
queried in the exact same manner as a cpumask_t, with e.g.
bpf_cpumask_test_cpu().

If we were to enforce that top level types match, then a user that's
passing a struct bpf_cpumask to a read-only cpumask_t argument would
have to cast with something like bpf_cast_to_kern_ctx() (which itself
would need to be updated to expect the alias, and currently it only
accommodates a single alias per prog type). Additionally, not specifying
KF_TRUSTED_ARGS is not an option, as some kfuncs take one argument as a
struct bpf_cpumask *, and another as a struct cpumask *
(i.e. cpumask_t).

In order to enable this, this patch relaxes the constraint that a
KF_TRUSTED_ARGS kfunc must have strict type matching, and instead only
enforces strict type matching if a type is observed to be a "no-cast
alias" (i.e., that the type names are equivalent, but one is suffixed
with ___init).

Additionally, in order to try and be conservative and match existing
behavior / expectations, this patch also enforces strict type checking
for acquire kfuncs. We were already enforcing it for release kfuncs, so
this should also improve the consistency of the semantics for kfuncs.

Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230120192523.3650503-3-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-24 20:15:13 -08:00
David Vernet
57539b1c0a bpf: Enable annotating trusted nested pointers
In kfuncs, a "trusted" pointer is a pointer that the kfunc can assume is
safe, and which the verifier will allow to be passed to a
KF_TRUSTED_ARGS kfunc. Currently, a KF_TRUSTED_ARGS kfunc disallows any
pointer to be passed at a nonzero offset, but sometimes this is in fact
safe if the "nested" pointer's lifetime is inherited from its parent.
For example, the const cpumask_t *cpus_ptr field in a struct task_struct
will remain valid until the task itself is destroyed, and thus would
also be safe to pass to a KF_TRUSTED_ARGS kfunc.

While it would be conceptually simple to enable this by using BTF tags,
gcc unfortunately does not yet support this. In the interim, this patch
enables support for this by using a type-naming convention. A new
BTF_TYPE_SAFE_NESTED macro is defined in verifier.c which allows a
developer to specify the nested fields of a type which are considered
trusted if its parent is also trusted. The verifier is also updated to
account for this. A patch with selftests will be added in a follow-on
change, along with documentation for this feature.

Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230120192523.3650503-2-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-24 20:15:01 -08:00
Toke Høiland-Jørgensen
fd7c211d68 bpf: Support consuming XDP HW metadata from fext programs
Instead of rejecting the attaching of PROG_TYPE_EXT programs to XDP
programs that consume HW metadata, implement support for propagating the
offload information. The extension program doesn't need to set a flag or
ifindex, these will just be propagated from the target by the verifier.
We need to create a separate offload object for the extension program,
though, since it can be reattached to a different program later (which
means we can't just inherit the offload information from the target).

An additional check is added on attach that the new target is compatible
with the offload information in the extension prog.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230119221536.3349901-9-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-01-23 09:38:11 -08:00
Stanislav Fomichev
3d76a4d3d4 bpf: XDP metadata RX kfuncs
Define a new kfunc set (xdp_metadata_kfunc_ids) which implements all possible
XDP metatada kfuncs. Not all devices have to implement them. If kfunc is not
supported by the target device, the default implementation is called instead.
The verifier, at load time, replaces a call to the generic kfunc with a call
to the per-device one. Per-device kfunc pointers are stored in separate
struct xdp_metadata_ops.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230119221536.3349901-8-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-01-23 09:38:11 -08:00
Stanislav Fomichev
2b3486bc2d bpf: Introduce device-bound XDP programs
New flag BPF_F_XDP_DEV_BOUND_ONLY plus all the infra to have a way
to associate a netdev with a BPF program at load time.

netdevsim checks are dropped in favor of generic check in dev_xdp_attach.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230119221536.3349901-6-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-01-23 09:38:10 -08:00
Stanislav Fomichev
89bbc53a4d bpf: Reshuffle some parts of bpf/offload.c
To avoid adding forward declarations in the main patch, shuffle
some code around. No functional changes.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230119221536.3349901-5-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-01-23 09:38:10 -08:00
Stanislav Fomichev
f1fc43d039 bpf: Move offload initialization into late_initcall
So we don't have to initialize it manually from several paths.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230119221536.3349901-4-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-01-23 09:38:10 -08:00
Stanislav Fomichev
9d03ebc71a bpf: Rename bpf_{prog,map}_is_dev_bound to is_offloaded
BPF offloading infra will be reused to implement
bound-but-not-offloaded bpf programs. Rename existing
helpers for clarity. No functional changes.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230119221536.3349901-3-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-01-23 09:38:10 -08:00
Kumar Kartikeya Dwivedi
1ee72bcbe4 bpf: Avoid recomputing spi in process_dynptr_func
Currently, process_dynptr_func first calls dynptr_get_spi and then
is_dynptr_reg_valid_init and is_dynptr_reg_valid_uninit have to call it
again to obtain the spi value. Instead of doing this twice, reuse the
already obtained value (which is by default 0, and is only set for
PTR_TO_STACK, and only used in that case in aforementioned functions).
The input value for these two functions will either be -ERANGE or >= 1,
and can either be permitted or rejected based on the respective check.

Suggested-by: Joanne Koong <joannelkoong@gmail.com>
Acked-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230121002241.2113993-8-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-20 17:55:03 -08:00
Kumar Kartikeya Dwivedi
f5b625e5f8 bpf: Combine dynptr_get_spi and is_spi_bounds_valid
Currently, a check on spi resides in dynptr_get_spi, while others
checking its validity for being within the allocated stack slots happens
in is_spi_bounds_valid. Almost always barring a couple of cases (where
being beyond allocated stack slots is not an error as stack slots need
to be populated), both are used together to make checks. Hence, subsume
the is_spi_bounds_valid check in dynptr_get_spi, and return -ERANGE to
specially distinguish the case where spi is valid but not within
allocated slots in the stack state.

The is_spi_bounds_valid function is still kept around as it is a generic
helper that will be useful for other objects on stack similar to dynptr
in the future.

Suggested-by: Joanne Koong <joannelkoong@gmail.com>
Acked-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230121002241.2113993-7-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-20 17:55:03 -08:00
Kumar Kartikeya Dwivedi
379d4ba831 bpf: Allow reinitializing unreferenced dynptr stack slots
Consider a program like below:

void prog(void)
{
	{
		struct bpf_dynptr ptr;
		bpf_dynptr_from_mem(...);
	}
	...
	{
		struct bpf_dynptr ptr;
		bpf_dynptr_from_mem(...);
	}
}

Here, the C compiler based on lifetime rules in the C standard would be
well within in its rights to share stack storage for dynptr 'ptr' as
their lifetimes do not overlap in the two distinct scopes. Currently,
such an example would be rejected by the verifier, but this is too
strict. Instead, we should allow reinitializing over dynptr stack slots
and forget information about the old dynptr object.

The destroy_if_dynptr_stack_slot function already makes necessary checks
to avoid overwriting referenced dynptr slots. This is done to present a
better error message instead of forgetting dynptr information on stack
and preserving reference state, leading to an inevitable but
undecipherable error at the end about an unreleased reference which has
to be associated back to its allocating call instruction to make any
sense to the user.

Acked-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230121002241.2113993-6-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-20 17:55:03 -08:00
Kumar Kartikeya Dwivedi
f8064ab90d bpf: Invalidate slices on destruction of dynptrs on stack
The previous commit implemented destroy_if_dynptr_stack_slot. It
destroys the dynptr which given spi belongs to, but still doesn't
invalidate the slices that belong to such a dynptr. While for the case
of referenced dynptr, we don't allow their overwrite and return an error
early, we still allow it and destroy the dynptr for unreferenced dynptr.

To be able to enable precise and scoped invalidation of dynptr slices in
this case, we must be able to associate the source dynptr of slices that
have been obtained using bpf_dynptr_data. When doing destruction, only
slices belonging to the dynptr being destructed should be invalidated,
and nothing else. Currently, dynptr slices belonging to different
dynptrs are indistinguishible.

Hence, allocate a unique id to each dynptr (CONST_PTR_TO_DYNPTR and
those on stack). This will be stored as part of reg->id. Whenever using
bpf_dynptr_data, transfer this unique dynptr id to the returned
PTR_TO_MEM_OR_NULL slice pointer, and store it in a new per-PTR_TO_MEM
dynptr_id register state member.

Finally, after establishing such a relationship between dynptrs and
their slices, implement precise invalidation logic that only invalidates
slices belong to the destroyed dynptr in destroy_if_dynptr_stack_slot.

Acked-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230121002241.2113993-5-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-20 17:55:03 -08:00
Kumar Kartikeya Dwivedi
ef8fc7a07c bpf: Fix partial dynptr stack slot reads/writes
Currently, while reads are disallowed for dynptr stack slots, writes are
not. Reads don't work from both direct access and helpers, while writes
do work in both cases, but have the effect of overwriting the slot_type.

While this is fine, handling for a few edge cases is missing. Firstly,
a user can overwrite the stack slots of dynptr partially.

Consider the following layout:
spi: [d][d][?]
      2  1  0

First slot is at spi 2, second at spi 1.
Now, do a write of 1 to 8 bytes for spi 1.

This will essentially either write STACK_MISC for all slot_types or
STACK_MISC and STACK_ZERO (in case of size < BPF_REG_SIZE partial write
of zeroes). The end result is that slot is scrubbed.

Now, the layout is:
spi: [d][m][?]
      2  1  0

Suppose if user initializes spi = 1 as dynptr.
We get:
spi: [d][d][d]
      2  1  0

But this time, both spi 2 and spi 1 have first_slot = true.

Now, when passing spi 2 to dynptr helper, it will consider it as
initialized as it does not check whether second slot has first_slot ==
false. And spi 1 should already work as normal.

This effectively replaced size + offset of first dynptr, hence allowing
invalid OOB reads and writes.

Make a few changes to protect against this:
When writing to PTR_TO_STACK using BPF insns, when we touch spi of a
STACK_DYNPTR type, mark both first and second slot (regardless of which
slot we touch) as STACK_INVALID. Reads are already prevented.

Second, prevent writing	to stack memory from helpers if the range may
contain any STACK_DYNPTR slots. Reads are already prevented.

For helpers, we cannot allow it to destroy dynptrs from the writes as
depending on arguments, helper may take uninit_mem and dynptr both at
the same time. This would mean that helper may write to uninit_mem
before it reads the dynptr, which would be bad.

PTR_TO_MEM: [?????dd]

Depending on the code inside the helper, it may end up overwriting the
dynptr contents first and then read those as the dynptr argument.

Verifier would only simulate destruction when it does byte by byte
access simulation in check_helper_call for meta.access_size, and
fail to catch this case, as it happens after argument checks.

The same would need to be done for any other non-trivial objects created
on the stack in the future, such as bpf_list_head on stack, or
bpf_rb_root on stack.

A common misunderstanding in the current code is that MEM_UNINIT means
writes, but note that writes may also be performed even without
MEM_UNINIT in case of helpers, in that case the code after handling meta
&& meta->raw_mode will complain when it sees STACK_DYNPTR. So that
invalid read case also covers writes to potential STACK_DYNPTR slots.
The only loophole was in case of meta->raw_mode which simulated writes
through instructions which could overwrite them.

A future series sequenced after this will focus on the clean up of
helper access checks and bugs around that.

Fixes: 97e03f5210 ("bpf: Add verifier support for dynptrs")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230121002241.2113993-4-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-20 17:55:03 -08:00
Kumar Kartikeya Dwivedi
79168a669d bpf: Fix missing var_off check for ARG_PTR_TO_DYNPTR
Currently, the dynptr function is not checking the variable offset part
of PTR_TO_STACK that it needs to check. The fixed offset is considered
when computing the stack pointer index, but if the variable offset was
not a constant (such that it could not be accumulated in reg->off), we
will end up a discrepency where runtime pointer does not point to the
actual stack slot we mark as STACK_DYNPTR.

It is impossible to precisely track dynptr state when variable offset is
not constant, hence, just like bpf_timer, kptr, bpf_spin_lock, etc.
simply reject the case where reg->var_off is not constant. Then,
consider both reg->off and reg->var_off.value when computing the stack
pointer index.

A new helper dynptr_get_spi is introduced to hide over these details
since the dynptr needs to be located in multiple places outside the
process_dynptr_func checks, hence once we know it's a PTR_TO_STACK, we
need to enforce these checks in all places.

Note that it is disallowed for unprivileged users to have a non-constant
var_off, so this problem should only be possible to trigger from
programs having CAP_PERFMON. However, its effects can vary.

Without the fix, it is possible to replace the contents of the dynptr
arbitrarily by making verifier mark different stack slots than actual
location and then doing writes to the actual stack address of dynptr at
runtime.

Fixes: 97e03f5210 ("bpf: Add verifier support for dynptrs")
Acked-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230121002241.2113993-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-20 17:55:02 -08:00
Kumar Kartikeya Dwivedi
d6fefa1105 bpf: Fix state pruning for STACK_DYNPTR stack slots
The root of the problem is missing liveness marking for STACK_DYNPTR
slots. This leads to all kinds of problems inside stacksafe.

The verifier by default inside stacksafe ignores spilled_ptr in stack
slots which do not have REG_LIVE_READ marks. Since this is being checked
in the 'old' explored state, it must have already done clean_live_states
for this old bpf_func_state. Hence, it won't be receiving any more
liveness marks from to be explored insns (it has received REG_LIVE_DONE
marking from liveness point of view).

What this means is that verifier considers that it's safe to not compare
the stack slot if was never read by children states. While liveness
marks are usually propagated correctly following the parentage chain for
spilled registers (SCALAR_VALUE and PTR_* types), the same is not the
case for STACK_DYNPTR.

clean_live_states hence simply rewrites these stack slots to the type
STACK_INVALID since it sees no REG_LIVE_READ marks.

The end result is that we will never see STACK_DYNPTR slots in explored
state. Even if verifier was conservatively matching !REG_LIVE_READ
slots, very next check continuing the stacksafe loop on seeing
STACK_INVALID would again prevent further checks.

Now as long as verifier stores an explored state which we can compare to
when reaching a pruning point, we can abuse this bug to make verifier
prune search for obviously unsafe paths using STACK_DYNPTR slots
thinking they are never used hence safe.

Doing this in unprivileged mode is a bit challenging. add_new_state is
only set when seeing BPF_F_TEST_STATE_FREQ (which requires privileges)
or when jmps_processed difference is >= 2 and insn_processed difference
is >= 8. So coming up with the unprivileged case requires a little more
work, but it is still totally possible. The test case being discussed
below triggers the heuristic even in unprivileged mode.

However, it no longer works since commit
8addbfc7b3 ("bpf: Gate dynptr API behind CAP_BPF").

Let's try to study the test step by step.

Consider the following program (C style BPF ASM):

0  r0 = 0;
1  r6 = &ringbuf_map;
3  r1 = r6;
4  r2 = 8;
5  r3 = 0;
6  r4 = r10;
7  r4 -= -16;
8  call bpf_ringbuf_reserve_dynptr;
9  if r0 == 0 goto pc+1;
10 goto pc+1;
11 *(r10 - 16) = 0xeB9F;
12 r1 = r10;
13 r1 -= -16;
14 r2 = 0;
15 call bpf_ringbuf_discard_dynptr;
16 r0 = 0;
17 exit;

We know that insn 12 will be a pruning point, hence if we force
add_new_state for it, it will first verify the following path as
safe in straight line exploration:
0 1 3 4 5 6 7 8 9 -> 10 -> (12) 13 14 15 16 17

Then, when we arrive at insn 12 from the following path:
0 1 3 4 5 6 7 8 9 -> 11 (12)

We will find a state that has been verified as safe already at insn 12.
Since register state is same at this point, regsafe will pass. Next, in
stacksafe, for spi = 0 and spi = 1 (location of our dynptr) is skipped
seeing !REG_LIVE_READ. The rest matches, so stacksafe returns true.
Next, refsafe is also true as reference state is unchanged in both
states.

The states are considered equivalent and search is pruned.

Hence, we are able to construct a dynptr with arbitrary contents and use
the dynptr API to operate on this arbitrary pointer and arbitrary size +
offset.

To fix this, first define a mark_dynptr_read function that propagates
liveness marks whenever a valid initialized dynptr is accessed by dynptr
helpers. REG_LIVE_WRITTEN is marked whenever we initialize an
uninitialized dynptr. This is done in mark_stack_slots_dynptr. It allows
screening off mark_reg_read and not propagating marks upwards from that
point.

This ensures that we either set REG_LIVE_READ64 on both dynptr slots, or
none, so clean_live_states either sets both slots to STACK_INVALID or
none of them. This is the invariant the checks inside stacksafe rely on.

Next, do a complete comparison of both stack slots whenever they have
STACK_DYNPTR. Compare the dynptr type stored in the spilled_ptr, and
also whether both form the same first_slot. Only then is the later path
safe.

Fixes: 97e03f5210 ("bpf: Add verifier support for dynptrs")
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230121002241.2113993-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-20 17:55:02 -08:00
Jakub Kicinski
b3c588cd55 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/net/ipa/ipa_interrupt.c
drivers/net/ipa/ipa_interrupt.h
  9ec9b2a308 ("net: ipa: disable ipa interrupt during suspend")
  8e461e1f09 ("net: ipa: introduce ipa_interrupt_enable()")
  d50ed35587 ("net: ipa: enable IPA interrupt handlers separate from registration")
https://lore.kernel.org/all/20230119114125.5182c7ab@canb.auug.org.au/
https://lore.kernel.org/all/79e46152-8043-a512-79d9-c3b905462774@tessares.net/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-20 12:28:23 -08:00
Jiri Olsa
74bc3a5acc bpf: Add missing btf_put to register_btf_id_dtor_kfuncs
We take the BTF reference before we register dtors and we need
to put it back when it's done.

We probably won't se a problem with kernel BTF, but module BTF
would stay loaded (because of the extra ref) even when its module
is removed.

Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Fixes: 5ce937d613 ("bpf: Populate pairs of btf_id and destructor kfunc in btf")
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230120122148.1522359-1-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-20 08:19:56 -08:00
Eduard Zingerman
71f656a501 bpf: Fix to preserve reg parent/live fields when copying range info
Register range information is copied in several places. The intent is
to transfer range/id information from one register/stack spill to
another. Currently this is done using direct register assignment, e.g.:

static void find_equal_scalars(..., struct bpf_reg_state *known_reg)
{
	...
	struct bpf_reg_state *reg;
	...
			*reg = *known_reg;
	...
}

However, such assignments also copy the following bpf_reg_state fields:

struct bpf_reg_state {
	...
	struct bpf_reg_state *parent;
	...
	enum bpf_reg_liveness live;
	...
};

Copying of these fields is accidental and incorrect, as could be
demonstrated by the following example:

     0: call ktime_get_ns()
     1: r6 = r0
     2: call ktime_get_ns()
     3: r7 = r0
     4: if r0 > r6 goto +1             ; r0 & r6 are unbound thus generated
                                       ; branch states are identical
     5: *(u64 *)(r10 - 8) = 0xdeadbeef ; 64-bit write to fp[-8]
    --- checkpoint ---
     6: r1 = 42                        ; r1 marked as written
     7: *(u8 *)(r10 - 8) = r1          ; 8-bit write, fp[-8] parent & live
                                       ; overwritten
     8: r2 = *(u64 *)(r10 - 8)
     9: r0 = 0
    10: exit

This example is unsafe because 64-bit write to fp[-8] at (5) is
conditional, thus not all bytes of fp[-8] are guaranteed to be set
when it is read at (8). However, currently the example passes
verification.

First, the execution path 1-10 is examined by verifier.
Suppose that a new checkpoint is created by is_state_visited() at (6).
After checkpoint creation:
- r1.parent points to checkpoint.r1,
- fp[-8].parent points to checkpoint.fp[-8].
At (6) the r1.live is set to REG_LIVE_WRITTEN.
At (7) the fp[-8].parent is set to r1.parent and fp[-8].live is set to
REG_LIVE_WRITTEN, because of the following code called in
check_stack_write_fixed_off():

static void save_register_state(struct bpf_func_state *state,
				int spi, struct bpf_reg_state *reg,
				int size)
{
	...
	state->stack[spi].spilled_ptr = *reg;  // <--- parent & live copied
	if (size == BPF_REG_SIZE)
		state->stack[spi].spilled_ptr.live |= REG_LIVE_WRITTEN;
	...
}

Note the intent to mark stack spill as written only if 8 bytes are
spilled to a slot, however this intent is spoiled by a 'live' field copy.
At (8) the checkpoint.fp[-8] should be marked as REG_LIVE_READ but
this does not happen:
- fp[-8] in a current state is already marked as REG_LIVE_WRITTEN;
- fp[-8].parent points to checkpoint.r1, parentage chain is used by
  mark_reg_read() to mark checkpoint states.
At (10) the verification is finished for path 1-10 and jump 4-6 is
examined. The checkpoint.fp[-8] never gets REG_LIVE_READ mark and this
spill is pruned from the cached states by clean_live_states(). Hence
verifier state obtained via path 1-4,6 is deemed identical to one
obtained via path 1-6 and program marked as safe.

Note: the example should be executed with BPF_F_TEST_STATE_FREQ flag
set to force creation of intermediate verifier states.

This commit revisits the locations where bpf_reg_state instances are
copied and replaces the direct copies with a call to a function
copy_register_state(dst, src) that preserves 'parent' and 'live'
fields of the 'dst'.

Fixes: 679c782de1 ("bpf/verifier: per-register parent pointers")
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20230106142214.1040390-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-19 15:19:23 -08:00
Christian Brauner
f2d40141d5
fs: port inode_init_owner() to mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:28 +01:00
Christian Brauner
4609e1f18e
fs: port ->permission() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:28 +01:00
Christian Brauner
c54bd91e9e
fs: port ->mkdir() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:26 +01:00
Christian Brauner
7a77db9551
fs: port ->symlink() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:25 +01:00
Hou Tao
36024d023d bpf: Fix off-by-one error in bpf_mem_cache_idx()
According to the definition of sizes[NUM_CACHES], when the size passed
to bpf_mem_cache_size() is 256, it should return 6 instead 7.

Fixes: 7c8199e24f ("bpf: Introduce any context BPF specific memory allocator.")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20230118084630.3750680-1-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-18 18:36:26 -08:00
Jiri Olsa
700e6f853e bpf: Do not allow to load sleepable BPF_TRACE_RAW_TP program
Currently we allow to load any tracing program as sleepable,
but BPF_TRACE_RAW_TP can't sleep. Making the check explicit
for tracing programs attach types, so sleepable BPF_TRACE_RAW_TP
will fail to load.

Updating the verifier error to mention iter programs as well.

Acked-by: Song Liu <song@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230117223705.440975-1-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-17 16:56:04 -08:00
Luis Gerhorst
e4f4db4779 bpf: Fix pointer-leak due to insufficient speculative store bypass mitigation
To mitigate Spectre v4, 2039f26f3a ("bpf: Fix leakage due to
insufficient speculative store bypass mitigation") inserts lfence
instructions after 1) initializing a stack slot and 2) spilling a
pointer to the stack.

However, this does not cover cases where a stack slot is first
initialized with a pointer (subject to sanitization) but then
overwritten with a scalar (not subject to sanitization because
the slot was already initialized). In this case, the second write
may be subject to speculative store bypass (SSB) creating a
speculative pointer-as-scalar type confusion. This allows the
program to subsequently leak the numerical pointer value using,
for example, a branch-based cache side channel.

To fix this, also sanitize scalars if they write a stack slot
that previously contained a pointer. Assuming that pointer-spills
are only generated by LLVM on register-pressure, the performance
impact on most real-world BPF programs should be small.

The following unprivileged BPF bytecode drafts a minimal exploit
and the mitigation:

  [...]
  // r6 = 0 or 1 (skalar, unknown user input)
  // r7 = accessible ptr for side channel
  // r10 = frame pointer (fp), to be leaked
  //
  r9 = r10 # fp alias to encourage ssb
  *(u64 *)(r9 - 8) = r10 // fp[-8] = ptr, to be leaked
  // lfence added here because of pointer spill to stack.
  //
  // Ommitted: Dummy bpf_ringbuf_output() here to train alias predictor
  // for no r9-r10 dependency.
  //
  *(u64 *)(r10 - 8) = r6 // fp[-8] = scalar, overwrites ptr
  // 2039f26f3aca: no lfence added because stack slot was not STACK_INVALID,
  // store may be subject to SSB
  //
  // fix: also add an lfence when the slot contained a ptr
  //
  r8 = *(u64 *)(r9 - 8)
  // r8 = architecturally a scalar, speculatively a ptr
  //
  // leak ptr using branch-based cache side channel:
  r8 &= 1 // choose bit to leak
  if r8 == 0 goto SLOW // no mispredict
  // architecturally dead code if input r6 is 0,
  // only executes speculatively iff ptr bit is 1
  r8 = *(u64 *)(r7 + 0) # encode bit in cache (0: slow, 1: fast)
SLOW:
  [...]

After running this, the program can time the access to *(r7 + 0) to
determine whether the chosen pointer bit was 0 or 1. Repeat this 64
times to recover the whole address on amd64.

In summary, sanitization can only be skipped if one scalar is
overwritten with another scalar. Scalar-confusion due to speculative
store bypass can not lead to invalid accesses because the pointer
bounds deducted during verification are enforced using branchless
logic. See 979d63d50c ("bpf: prevent out of bounds speculation on
pointer arithmetic") for details.

Do not make the mitigation depend on !env->allow_{uninit_stack,ptr_leaks}
because speculative leaks are likely unexpected if these were enabled.
For example, leaking the address to a protected log file may be acceptable
while disabling the mitigation might unintentionally leak the address
into the cached-state of a map that is accessible to unprivileged
processes.

Fixes: 2039f26f3a ("bpf: Fix leakage due to insufficient speculative store bypass mitigation")
Signed-off-by: Luis Gerhorst <gerhorst@cs.fau.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Henriette Hofmeier <henriette.hofmeier@rub.de>
Link: https://lore.kernel.org/bpf/edc95bad-aada-9cfc-ffe2-fa9bb206583c@cs.fau.de
Link: https://lore.kernel.org/bpf/20230109150544.41465-1-gerhorst@cs.fau.de
2023-01-13 17:18:35 +01:00
Tonghao Zhang
9f907439dc bpf: hash map, avoid deadlock with suitable hash mask
The deadlock still may occur while accessed in NMI and non-NMI
context. Because in NMI, we still may access the same bucket but with
different map_locked index.

For example, on the same CPU, .max_entries = 2, we update the hash map,
with key = 4, while running bpf prog in NMI nmi_handle(), to update
hash map with key = 20, so it will have the same bucket index but have
different map_locked index.

To fix this issue, using min mask to hash again.

Fixes: 20b6cc34ea ("bpf: Avoid hashtab deadlock with map_locked")
Signed-off-by: Tonghao Zhang <tong@infragraf.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Hou Tao <houtao1@huawei.com>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230111092903.92389-1-tong@infragraf.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-01-12 18:55:42 -08:00
Connor O'Brien
9cb61e50bf bpf: btf: limit logging of ignored BTF mismatches
Enabling CONFIG_MODULE_ALLOW_BTF_MISMATCH is an indication that BTF
mismatches are expected and module loading should proceed
anyway. Logging with pr_warn() on every one of these "benign"
mismatches creates unnecessary noise when many such modules are
loaded. Instead, handle this case with a single log warning that BTF
info may be unavailable.

Mismatches also result in calls to __btf_verifier_log() via
__btf_verifier_log_type() or btf_verifier_log_member(), adding several
additional lines of logging per mismatched module. Add checks to these
paths to skip logging for module BTF mismatches in the "allow
mismatch" case.

All existing logging behavior is preserved in the default
CONFIG_MODULE_ALLOW_BTF_MISMATCH=n case.

Signed-off-by: Connor O'Brien <connoro@google.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20230107025331.3240536-1-connoro@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-01-10 15:58:30 -08:00
Paul Moore
e7895f017b bpf: remove the do_idr_lock parameter from bpf_prog_free_id()
It was determined that the do_idr_lock parameter to
bpf_prog_free_id() was not necessary as it should always be true.

Suggested-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230106154400.74211-2-paul@paul-moore.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-09 19:47:59 -08:00
Paul Moore
ef01f4e25c bpf: restore the ebpf program ID for BPF_AUDIT_UNLOAD and PERF_BPF_EVENT_PROG_UNLOAD
When changing the ebpf program put() routines to support being called
from within IRQ context the program ID was reset to zero prior to
calling the perf event and audit UNLOAD record generators, which
resulted in problems as the ebpf program ID was bogus (always zero).
This patch addresses this problem by removing an unnecessary call to
bpf_prog_free_id() in __bpf_prog_offload_destroy() and adjusting
__bpf_prog_put() to only call bpf_prog_free_id() after audit and perf
have finished their bpf program unload tasks in
bpf_prog_put_deferred().  For the record, no one can determine, or
remember, why it was necessary to free the program ID, and remove it
from the IDR, prior to executing bpf_prog_put_deferred();
regardless, both Stanislav and Alexei agree that the approach in this
patch should be safe.

It is worth noting that when moving the bpf_prog_free_id() call, the
do_idr_lock parameter was forced to true as the ebpf devs determined
this was the correct as the do_idr_lock should always be true.  The
do_idr_lock parameter will be removed in a follow-up patch, but it
was kept here to keep the patch small in an effort to ease any stable
backports.

I also modified the bpf_audit_prog() logic used to associate the
AUDIT_BPF record with other associated records, e.g. @ctx != NULL.
Instead of keying off the operation, it now keys off the execution
context, e.g. '!in_irg && !irqs_disabled()', which is much more
appropriate and should help better connect the UNLOAD operations with
the associated audit state (other audit records).

Cc: stable@vger.kernel.org
Fixes: d809e134be ("bpf: Prepare bpf_prog_put() to be called from irq context.")
Reported-by: Burn Alting <burn.alting@iinet.net.au>
Reported-by: Jiri Olsa <olsajiri@gmail.com>
Suggested-by: Stanislav Fomichev <sdf@google.com>
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230106154400.74211-1-paul@paul-moore.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-09 19:47:58 -08:00
Hao Sun
d3178e8a43 bpf: Skip invalid kfunc call in backtrack_insn
The verifier skips invalid kfunc call in check_kfunc_call(), which
would be captured in fixup_kfunc_call() if such insn is not eliminated
by dead code elimination. However, this can lead to the following
warning in backtrack_insn(), also see [1]:

  ------------[ cut here ]------------
  verifier backtracking bug
  WARNING: CPU: 6 PID: 8646 at kernel/bpf/verifier.c:2756 backtrack_insn
  kernel/bpf/verifier.c:2756
	__mark_chain_precision kernel/bpf/verifier.c:3065
	mark_chain_precision kernel/bpf/verifier.c:3165
	adjust_reg_min_max_vals kernel/bpf/verifier.c:10715
	check_alu_op kernel/bpf/verifier.c:10928
	do_check kernel/bpf/verifier.c:13821 [inline]
	do_check_common kernel/bpf/verifier.c:16289
  [...]

So make backtracking conservative with this by returning ENOTSUPP.

  [1] https://lore.kernel.org/bpf/CACkBjsaXNceR8ZjkLG=dT3P=4A8SBsg0Z5h5PWLryF5=ghKq=g@mail.gmail.com/

Reported-by: syzbot+4da3ff23081bafe74fc2@syzkaller.appspotmail.com
Signed-off-by: Hao Sun <sunhao.th@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20230104014709.9375-1-sunhao.th@gmail.com
2023-01-06 18:49:37 +01:00
Jakub Kicinski
4aea86b403 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-05 15:34:11 -08:00
Jakub Kicinski
d75858ef10 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCY7X/4wAKCRDbK58LschI
 g7gzAQCjKsLtAWg1OplW+B7pvEPwkQ8g3O1+PYWlToCUACTlzQD+PEMrqGnxB573
 oQAk6I2yOTwLgvlHkrm+TIdKSouI4gs=
 =2hUY
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
bpf-next 2023-01-04

We've added 45 non-merge commits during the last 21 day(s) which contain
a total of 50 files changed, 1454 insertions(+), 375 deletions(-).

The main changes are:

1) Fixes, improvements and refactoring of parts of BPF verifier's
   state equivalence checks, from Andrii Nakryiko.

2) Fix a few corner cases in libbpf's BTF-to-C converter in particular
   around padding handling and enums, also from Andrii Nakryiko.

3) Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to better
  support decap on GRE tunnel devices not operating in collect metadata,
  from Christian Ehrig.

4) Improve x86 JIT's codegen for PROBE_MEM runtime error checks,
   from Dave Marchevsky.

5) Remove the need for trace_printk_lock for bpf_trace_printk
   and bpf_trace_vprintk helpers, from Jiri Olsa.

6) Add proper documentation for BPF_MAP_TYPE_SOCK{MAP,HASH} maps,
   from Maryam Tahhan.

7) Improvements in libbpf's btf_parse_elf error handling, from Changbin Du.

8) Bigger batch of improvements to BPF tracing code samples,
   from Daniel T. Lee.

9) Add LoongArch support to libbpf's bpf_tracing helper header,
   from Hengqi Chen.

10) Fix a libbpf compiler warning in perf_event_open_probe on arm32,
    from Khem Raj.

11) Optimize bpf_local_storage_elem by removing 56 bytes of padding,
    from Martin KaFai Lau.

12) Use pkg-config to locate libelf for resolve_btfids build,
    from Shen Jiamin.

13) Various libbpf improvements around API documentation and errno
    handling, from Xin Liu.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (45 commits)
  libbpf: Return -ENODATA for missing btf section
  libbpf: Add LoongArch support to bpf_tracing.h
  libbpf: Restore errno after pr_warn.
  libbpf: Added the description of some API functions
  libbpf: Fix invalid return address register in s390
  samples/bpf: Use BPF_KSYSCALL macro in syscall tracing programs
  samples/bpf: Fix tracex2 by using BPF_KSYSCALL macro
  samples/bpf: Change _kern suffix to .bpf with syscall tracing program
  samples/bpf: Use vmlinux.h instead of implicit headers in syscall tracing program
  samples/bpf: Use kyscall instead of kprobe in syscall tracing program
  bpf: rename list_head -> graph_root in field info types
  libbpf: fix errno is overwritten after being closed.
  bpf: fix regs_exact() logic in regsafe() to remap IDs correctly
  bpf: perform byte-by-byte comparison only when necessary in regsafe()
  bpf: reject non-exact register type matches in regsafe()
  bpf: generalize MAYBE_NULL vs non-MAYBE_NULL rule
  bpf: reorganize struct bpf_reg_state fields
  bpf: teach refsafe() to take into account ID remapping
  bpf: Remove unused field initialization in bpf's ctl_table
  selftests/bpf: Add jit probe_mem corner case tests to s390x denylist
  ...
====================

Link: https://lore.kernel.org/r/20230105000926.31350-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-04 20:21:25 -08:00
Dave Marchevsky
30465003ad bpf: rename list_head -> graph_root in field info types
Many of the structs recently added to track field info for linked-list
head are useful as-is for rbtree root. So let's do a mechanical renaming
of list_head-related types and fields:

include/linux/bpf.h:
  struct btf_field_list_head -> struct btf_field_graph_root
  list_head -> graph_root in struct btf_field union
kernel/bpf/btf.c:
  list_head -> graph_root in struct btf_field_info

This is a nonfunctional change, functionality to actually use these
fields for rbtree will be added in further patches.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20221217082506.1570898-5-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-28 20:14:22 -08:00
Kees Cook
45435d8da7 bpf: Always use maximal size for copy_array()
Instead of counting on prior allocations to have sized allocations to
the next kmalloc bucket size, always perform a krealloc that is at least
ksize(dst) in size (which is a no-op), so the size can be correctly
tracked by all the various allocation size trackers (KASAN,
__alloc_size, etc).

Reported-by: Hyunwoo Kim <v4bel@theori.io>
Link: https://lore.kernel.org/bpf/20221223094551.GA1439509@ubuntu
Fixes: ceb35b666d ("bpf/verifier: Use kmalloc_size_roundup() to match ksize() usage")
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: bpf@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20221223182836.never.866-kees@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-28 14:54:53 -08:00
Kui-Feng Lee
7ff94f276f bpf: keep a reference to the mm, in case the task is dead.
Fix the system crash that happens when a task iterator travel through
vma of tasks.

In task iterators, we used to access mm by following the pointer on
the task_struct; however, the death of a task will clear the pointer,
even though we still hold the task_struct.  That can cause an
unexpected crash for a null pointer when an iterator is visiting a
task that dies during the visit.  Keeping a reference of mm on the
iterator ensures we always have a valid pointer to mm.

Co-developed-by: Song Liu <song@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
Reported-by: Nathan Slingerland <slinger@meta.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221216221855.4122288-2-kuifeng@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-28 14:11:48 -08:00
Chuang Wang
9ed1d9aeef bpf: Fix panic due to wrong pageattr of im->image
In the scenario where livepatch and kretfunc coexist, the pageattr of
im->image is rox after arch_prepare_bpf_trampoline in
bpf_trampoline_update, and then modify_fentry or register_fentry returns
-EAGAIN from bpf_tramp_ftrace_ops_func, the BPF_TRAMP_F_ORIG_STACK flag
will be configured, and arch_prepare_bpf_trampoline will be re-executed.

At this time, because the pageattr of im->image is rox,
arch_prepare_bpf_trampoline will read and write im->image, which causes
a fault. as follows:

  insmod livepatch-sample.ko    # samples/livepatch/livepatch-sample.c
  bpftrace -e 'kretfunc:cmdline_proc_show {}'

BUG: unable to handle page fault for address: ffffffffa0206000
PGD 322d067 P4D 322d067 PUD 322e063 PMD 1297e067 PTE d428061
Oops: 0003 [#1] PREEMPT SMP PTI
CPU: 2 PID: 270 Comm: bpftrace Tainted: G            E K    6.1.0 #5
RIP: 0010:arch_prepare_bpf_trampoline+0xed/0x8c0
RSP: 0018:ffffc90001083ad8 EFLAGS: 00010202
RAX: ffffffffa0206000 RBX: 0000000000000020 RCX: 0000000000000000
RDX: ffffffffa0206001 RSI: ffffffffa0206000 RDI: 0000000000000030
RBP: ffffc90001083b70 R08: 0000000000000066 R09: ffff88800f51b400
R10: 000000002e72c6e5 R11: 00000000d0a15080 R12: ffff8880110a68c8
R13: 0000000000000000 R14: ffff88800f51b400 R15: ffffffff814fec10
FS:  00007f87bc0dc780(0000) GS:ffff88803e600000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffa0206000 CR3: 0000000010b70000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
 bpf_trampoline_update+0x25a/0x6b0
 __bpf_trampoline_link_prog+0x101/0x240
 bpf_trampoline_link_prog+0x2d/0x50
 bpf_tracing_prog_attach+0x24c/0x530
 bpf_raw_tp_link_attach+0x73/0x1d0
 __sys_bpf+0x100e/0x2570
 __x64_sys_bpf+0x1c/0x30
 do_syscall_64+0x5b/0x80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

With this patch, when modify_fentry or register_fentry returns -EAGAIN
from bpf_tramp_ftrace_ops_func, the pageattr of im->image will be reset
to nx+rw.

Cc: stable@vger.kernel.org
Fixes: 00963a2e75 ("bpf: Support bpf_trampoline on functions with IPMODIFY (e.g. livepatch)")
Signed-off-by: Chuang Wang <nashuiliang@gmail.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20221224133146.780578-1-nashuiliang@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-28 13:46:28 -08:00
Andrii Nakryiko
4633a00682 bpf: fix regs_exact() logic in regsafe() to remap IDs correctly
Comparing IDs exactly between two separate states is not just
suboptimal, but also incorrect in some cases. So update regs_exact()
check to do byte-by-byte memcmp() only up to id/ref_obj_id. For id and
ref_obj_id perform proper check_ids() checks, taking into account idmap.

This change makes more states equivalent improving insns and states
stats across a bunch of selftest BPF programs:

File                                         Program                           Insns (A)  Insns (B)  Insns   (DIFF)  States (A)  States (B)  States (DIFF)
-------------------------------------------  --------------------------------  ---------  ---------  --------------  ----------  ----------  -------------
cgrp_kfunc_success.bpf.linked1.o             test_cgrp_get_release                   141        137     -4 (-2.84%)          13          13    +0 (+0.00%)
cgrp_kfunc_success.bpf.linked1.o             test_cgrp_xchg_release                  142        139     -3 (-2.11%)          14          13    -1 (-7.14%)
connect6_prog.bpf.linked1.o                  connect_v6_prog                         139        102   -37 (-26.62%)           9           6   -3 (-33.33%)
ima.bpf.linked1.o                            bprm_creds_for_exec                      68         61    -7 (-10.29%)           6           5   -1 (-16.67%)
linked_list.bpf.linked1.o                    global_list_in_list                     569        499   -70 (-12.30%)          60          52   -8 (-13.33%)
linked_list.bpf.linked1.o                    global_list_push_pop                    167        150   -17 (-10.18%)          18          16   -2 (-11.11%)
linked_list.bpf.linked1.o                    global_list_push_pop_multiple           881        815    -66 (-7.49%)          74          63  -11 (-14.86%)
linked_list.bpf.linked1.o                    inner_map_list_in_list                  579        534    -45 (-7.77%)          61          55    -6 (-9.84%)
linked_list.bpf.linked1.o                    inner_map_list_push_pop                 190        181     -9 (-4.74%)          19          18    -1 (-5.26%)
linked_list.bpf.linked1.o                    inner_map_list_push_pop_multiple        916        850    -66 (-7.21%)          75          64  -11 (-14.67%)
linked_list.bpf.linked1.o                    map_list_in_list                        588        525   -63 (-10.71%)          62          55   -7 (-11.29%)
linked_list.bpf.linked1.o                    map_list_push_pop                       183        174     -9 (-4.92%)          18          17    -1 (-5.56%)
linked_list.bpf.linked1.o                    map_list_push_pop_multiple              909        843    -66 (-7.26%)          75          64  -11 (-14.67%)
map_kptr.bpf.linked1.o                       test_map_kptr                           264        256     -8 (-3.03%)          26          26    +0 (+0.00%)
map_kptr.bpf.linked1.o                       test_map_kptr_ref                        95         91     -4 (-4.21%)           9           8   -1 (-11.11%)
task_kfunc_success.bpf.linked1.o             test_task_xchg_release                  139        136     -3 (-2.16%)          14          13    -1 (-7.14%)
test_bpf_nf.bpf.linked1.o                    nf_skb_ct_test                          815        509  -306 (-37.55%)          57          30  -27 (-47.37%)
test_bpf_nf.bpf.linked1.o                    nf_xdp_ct_test                          815        509  -306 (-37.55%)          57          30  -27 (-47.37%)
test_cls_redirect.bpf.linked1.o              cls_redirect                          78925      78390   -535 (-0.68%)        4782        4704   -78 (-1.63%)
test_cls_redirect_subprogs.bpf.linked1.o     cls_redirect                          64901      63897  -1004 (-1.55%)        4612        4470  -142 (-3.08%)
test_sk_lookup.bpf.linked1.o                 access_ctx_sk                           181         95   -86 (-47.51%)          19          10   -9 (-47.37%)
test_sk_lookup.bpf.linked1.o                 ctx_narrow_access                       447        437    -10 (-2.24%)          38          37    -1 (-2.63%)
test_sk_lookup_kern.bpf.linked1.o            sk_lookup_success                       148        133   -15 (-10.14%)          14          12   -2 (-14.29%)
test_tcp_check_syncookie_kern.bpf.linked1.o  check_syncookie_clsact                  304        300     -4 (-1.32%)          23          22    -1 (-4.35%)
test_tcp_check_syncookie_kern.bpf.linked1.o  check_syncookie_xdp                     304        300     -4 (-1.32%)          23          22    -1 (-4.35%)
test_verify_pkcs7_sig.bpf.linked1.o          bpf                                      87         76   -11 (-12.64%)           7           6   -1 (-14.29%)
-------------------------------------------  --------------------------------  ---------  ---------  --------------  ----------  ----------  -------------

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20221223054921.958283-7-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-27 17:37:07 -08:00
Andrii Nakryiko
4a95c85c99 bpf: perform byte-by-byte comparison only when necessary in regsafe()
Extract byte-by-byte comparison of bpf_reg_state in regsafe() into
a helper function, which makes it more convenient to use it "on demand"
only for registers that benefit from such checks, instead of doing it
all the time, even if result of such comparison is ignored.

Also, remove WARN_ON_ONCE(1)+return false dead code. There is no risk of
missing some case as compiler will warn about non-void function not
returning value in some branches (and that under assumption that default
case is removed in the future).

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20221223054921.958283-6-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-27 17:37:07 -08:00
Andrii Nakryiko
910f699966 bpf: reject non-exact register type matches in regsafe()
Generalize the (somewhat implicit) rule of regsafe(), which states that
if register types in old and current states do not match *exactly*, they
can't be safely considered equivalent.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20221223054921.958283-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-27 17:37:07 -08:00
Andrii Nakryiko
7f4ce97cd5 bpf: generalize MAYBE_NULL vs non-MAYBE_NULL rule
Make generic check to prevent XXX_OR_NULL and XXX register types to be
intermixed. While technically in some situations it could be safe, it's
impossible to enforce due to the loss of an ID when converting
XXX_OR_NULL to its non-NULL variant. So prevent this in general, not
just for PTR_TO_MAP_KEY and PTR_TO_MAP_VALUE.

PTR_TO_MAP_KEY_OR_NULL and PTR_TO_MAP_VALUE_OR_NULL checks, which were
previously special-cased, are simplified to generic check that takes
into account range_within() and tnum_in(). This is correct as BPF
verifier doesn't allow arithmetic on XXX_OR_NULL register types, so
var_off and ranges should stay zero. But even if in the future this
restriction is lifted, it's even more important to enforce that var_off
and ranges are compatible, otherwise it's possible to construct case
where this can be exploited to bypass verifier's memory range safety
checks.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20221223054921.958283-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-27 17:37:07 -08:00
Andrii Nakryiko
a73bf9f2d9 bpf: reorganize struct bpf_reg_state fields
Move id and ref_obj_id fields after scalar data section (var_off and
ranges). This is necessary to simplify next patch which will change
regsafe()'s logic to be safer, as it makes the contents that has to be
an exact match (type-specific parts, off, type, and var_off+ranges)
a single sequential block of memory, while id and ref_obj_id should
always be remapped and thus can't be memcp()'ed.

There are few places that assume that var_off is after id/ref_obj_id to
clear out id/ref_obj_id with the single memset(0). These are changed to
explicitly zero-out id/ref_obj_id fields. Other places are adjusted to
preserve exact byte-by-byte comparison behavior.

No functional changes.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20221223054921.958283-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-27 17:37:07 -08:00
Andrii Nakryiko
e8f55fcf77 bpf: teach refsafe() to take into account ID remapping
states_equal() check performs ID mapping between old and new states to
establish a 1-to-1 correspondence between IDs, even if their absolute
numberic values across two equivalent states differ. This is important
both for correctness and to avoid unnecessary work when two states are
equivalent.

With recent changes we partially fixed this logic by maintaining ID map
across all function frames. This patch also makes refsafe() check take
into account (and maintain) ID map, making states_equal() behavior more
optimal and correct.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20221223054921.958283-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-27 17:37:07 -08:00
David S. Miller
be1236fce5 bpf-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCY6YkXgAKCRDbK58LschI
 g25kAP4jYi+YomSlmGUzN/fUbEIHkXXyh85Yh2/yHGYdVuIuvwEA0uXeC7JHQTca
 dkcyYvgY6zJwFBV0lAVnhTRzFirFkQk=
 =THs1
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
The following pull-request contains BPF updates for your *net* tree.

We've added 7 non-merge commits during the last 5 day(s) which contain
a total of 11 files changed, 231 insertions(+), 3 deletions(-).

The main changes are:

1) Fix a splat in bpf_skb_generic_pop() under CHECKSUM_PARTIAL due to
   misuse of skb_postpull_rcsum(), from Jakub Kicinski with test case
   from Martin Lau.

2) Fix BPF verifier's nullness propagation when registers are of
   type PTR_TO_BTF_ID, from Hao Sun.

3) Fix bpftool build for JIT disassembler under statically built
   libllvm, from Anton Protopopov.

4) Fix warnings reported by resolve_btfids when building vmlinux
   with CONFIG_SECURITY_NETWORK disabled, from Hou Tao.

5) Minor fix up for BPF selftest gitignore, from Stanislav Fomichev.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-24 09:39:02 +00:00
Hao Sun
8374bfd5a3 bpf: fix nullness propagation for reg to reg comparisons
After befae75856, the verifier would propagate null information after
JEQ/JNE, e.g., if two pointers, one is maybe_null and the other is not,
the former would be marked as non-null in eq path. However, as comment
"PTR_TO_BTF_ID points to a kernel struct that does not need to be null
checked by the BPF program ... The verifier must keep this in mind and
can make no assumptions about null or non-null when doing branch ...".
If one pointer is maybe_null and the other is PTR_TO_BTF, the former is
incorrectly marked non-null. The following BPF prog can trigger a
null-ptr-deref, also see this report for more details[1]:

	0: (18) r1 = map_fd	        ; R1_w=map_ptr(ks=4, vs=4)
	2: (79) r6 = *(u64 *)(r1 +8)    ; R6_w=bpf_map->inner_map_data
					; R6 is PTR_TO_BTF_ID
					; equals to null at runtime
	3: (bf) r2 = r10
	4: (07) r2 += -4
	5: (62) *(u32 *)(r2 +0) = 0
	6: (85) call bpf_map_lookup_elem#1    ; R0_w=map_value_or_null
	7: (1d) if r6 == r0 goto pc+1
	8: (95) exit
	; from 7 to 9: R0=map_value R6=ptr_bpf_map
	9: (61) r0 = *(u32 *)(r0 +0)          ; null-ptr-deref
	10: (95) exit

So, make the verifier propagate nullness information for reg to reg
comparisons only if neither reg is PTR_TO_BTF_ID.

[1] https://lore.kernel.org/bpf/CACkBjsaFJwjC5oiw-1KXvcazywodwXo4zGYsRHwbr2gSG9WcSw@mail.gmail.com/T/#u

Fixes: befae75856 ("bpf: propagate nullness information for reg to reg comparisons")
Signed-off-by: Hao Sun <sunhao.th@gmail.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221222024414.29539-1-sunhao.th@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-12-22 17:19:06 -08:00
Ricardo Ribalda
cfca00767f bpf: Remove unused field initialization in bpf's ctl_table
Maxlen is used by standard proc_handlers such as proc_dointvec(), but in this
case we have our own proc_handler via bpf_stats_handler(). Therefore, remove
the initialization.

Signed-off-by: Ricardo Ribalda <ribalda@chromium.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20221221-bpf-syscall-v1-0-9550f5f2c3fc@chromium.org
2022-12-22 15:48:11 +01:00
Linus Torvalds
609d3bc623 Including fixes from bpf, netfilter and can.
Current release - regressions:
 
  - bpf: synchronize dispatcher update with bpf_dispatcher_xdp_func
 
  - rxrpc:
   - fix security setting propagation
   - fix null-deref in rxrpc_unuse_local()
   - fix switched parameters in peer tracing
 
 Current release - new code bugs:
 
  - rxrpc:
    - fix I/O thread startup getting skipped
    - fix locking issues in rxrpc_put_peer_locked()
    - fix I/O thread stop
    - fix uninitialised variable in rxperf server
    - fix the return value of rxrpc_new_incoming_call()
 
  - microchip: vcap: fix initialization of value and mask
 
  - nfp: fix unaligned io read of capabilities word
 
 Previous releases - regressions:
 
  - stop in-kernel socket users from corrupting socket's task_frag
 
  - stream: purge sk_error_queue in sk_stream_kill_queues()
 
  - openvswitch: fix flow lookup to use unmasked key
 
  - dsa: mv88e6xxx: avoid reg_lock deadlock in mv88e6xxx_setup_port()
 
  - devlink:
    - hold region lock when flushing snapshots
    - protect devlink dump by the instance lock
 
 Previous releases - always broken:
 
  - bpf:
    - prevent leak of lsm program after failed attach
    - resolve fext program type when checking map compatibility
 
  - skbuff: account for tail adjustment during pull operations
 
  - macsec: fix net device access prior to holding a lock
 
  - bonding: switch back when high prio link up
 
  - netfilter: flowtable: really fix NAT IPv6 offload
 
  - enetc: avoid buffer leaks on xdp_do_redirect() failure
 
  - unix: fix race in SOCK_SEQPACKET's unix_dgram_sendmsg()
 
  - dsa: microchip: remove IRQF_TRIGGER_FALLING in request_threaded_irq
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmOiGa4ACgkQMUZtbf5S
 IrvetBAAg/AjgG51gboLsuGjgRSwAi5T6ijgVR+pW+kMuoOdaamOF+h/zC1ox/H9
 QrWvTBipy+EqSD8bM4Xz0FNgidch8X4iWYhKGZuBht/4NP5FOzPUG2mNlUy5ANGq
 QZcCw6CUsir8HTb+IJpFEIq0JMwzKCm3WyAkYjEj4iuft0Y93cAgjkMVwoX0RERO
 o/pslC5dsozCLJxEglpw1aJq7aoroNuRSGSXl95nv8fU3UxmUXajnA3HNscXImdV
 6uqSIuyPIaGocpCBPRKUQd0sctkTY4cm8wmxxMCDVsBRVusoaq5eg1VRvxJm9Rxj
 gvDvHvfhnEuSigFF5A+paBp4c+i3C8g/UTBJTtptdAC+Y2tt4UT3Q5aaazYUOAqd
 W4TSJ3bk5zhkhpRF9clb0fNQaM1HOT4rkDEEGTfVN62dtHfPKpNwYufQKaYHdVj1
 RJ3ooH6c7TMVaRs6ZgEWNYToKZj94SIfPhfEhuqWXdNMDBkUMp2BXFFOp9fZDWju
 PsMQrRD7n6+XXpNvScYtnJDORqfIL9yHGZE9kxZA5QSDl9cnPA3SUbNruQPlXHrl
 w0yQlYuG3gcciua4dXaLfz1iN4rPdenuYhVBHhztEwDKl+b61CVQYlOHGkXPVURp
 oft74qCCFbva+Hf/7jENQotjT1tLfxAGdUARuFeDBueJgDRAPsw=
 =goV5
 -----END PGP SIGNATURE-----

Merge tag 'net-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bpf, netfilter and can.

  Current release - regressions:

   - bpf: synchronize dispatcher update with bpf_dispatcher_xdp_func

   - rxrpc:
      - fix security setting propagation
      - fix null-deref in rxrpc_unuse_local()
      - fix switched parameters in peer tracing

  Current release - new code bugs:

   - rxrpc:
      - fix I/O thread startup getting skipped
      - fix locking issues in rxrpc_put_peer_locked()
      - fix I/O thread stop
      - fix uninitialised variable in rxperf server
      - fix the return value of rxrpc_new_incoming_call()

   - microchip: vcap: fix initialization of value and mask

   - nfp: fix unaligned io read of capabilities word

  Previous releases - regressions:

   - stop in-kernel socket users from corrupting socket's task_frag

   - stream: purge sk_error_queue in sk_stream_kill_queues()

   - openvswitch: fix flow lookup to use unmasked key

   - dsa: mv88e6xxx: avoid reg_lock deadlock in mv88e6xxx_setup_port()

   - devlink:
      - hold region lock when flushing snapshots
      - protect devlink dump by the instance lock

  Previous releases - always broken:

   - bpf:
      - prevent leak of lsm program after failed attach
      - resolve fext program type when checking map compatibility

   - skbuff: account for tail adjustment during pull operations

   - macsec: fix net device access prior to holding a lock

   - bonding: switch back when high prio link up

   - netfilter: flowtable: really fix NAT IPv6 offload

   - enetc: avoid buffer leaks on xdp_do_redirect() failure

   - unix: fix race in SOCK_SEQPACKET's unix_dgram_sendmsg()

   - dsa: microchip: remove IRQF_TRIGGER_FALLING in
     request_threaded_irq"

* tag 'net-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (64 commits)
  net: fec: check the return value of build_skb()
  net: simplify sk_page_frag
  Treewide: Stop corrupting socket's task_frag
  net: Introduce sk_use_task_frag in struct sock.
  mctp: Remove device type check at unregister
  net: dsa: microchip: remove IRQF_TRIGGER_FALLING in request_threaded_irq
  can: kvaser_usb: hydra: help gcc-13 to figure out cmd_len
  can: flexcan: avoid unbalanced pm_runtime_enable warning
  Documentation: devlink: add missing toc entry for etas_es58x devlink doc
  mctp: serial: Fix starting value for frame check sequence
  nfp: fix unaligned io read of capabilities word
  net: stream: purge sk_error_queue in sk_stream_kill_queues()
  myri10ge: Fix an error handling path in myri10ge_probe()
  net: microchip: vcap: Fix initialization of value and mask
  rxrpc: Fix the return value of rxrpc_new_incoming_call()
  rxrpc: rxperf: Fix uninitialised variable
  rxrpc: Fix I/O thread stop
  rxrpc: Fix switched parameters in peer tracing
  rxrpc: Fix locking issues in rxrpc_put_peer_locked()
  rxrpc: Fix I/O thread startup getting skipped
  ...
2022-12-21 08:41:32 -08:00
Martin KaFai Lau
552d42a356 bpf: Reduce smap->elem_size
'struct bpf_local_storage_elem' has an unused 56 byte padding at the
end due to struct's cache-line alignment requirement. This padding
space is overlapped by storage value contents, so if we use sizeof()
to calculate the total size, we overinflate it by 56 bytes. Use
offsetof() instead to calculate more exact memory use.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20221221013036.3427431-1-martin.lau@linux.dev
2022-12-21 15:42:39 +01:00
Jiri Olsa
e2bb9e01d5 bpf: Remove trace_printk_lock
Both bpf_trace_printk and bpf_trace_vprintk helpers use static buffer guarded
with trace_printk_lock spin lock.

The spin lock contention causes issues with bpf programs attached to
contention_begin tracepoint [1][2].

Andrii suggested we could get rid of the contention by using trylock, but we
could actually get rid of the spinlock completely by using percpu buffers the
same way as for bin_args in bpf_bprintf_prepare function.

Adding new return 'buf' argument to struct bpf_bprintf_data and making
bpf_bprintf_prepare to return also the buffer for printk helpers.

  [1] https://lore.kernel.org/bpf/CACkBjsakT_yWxnSWr4r-0TpPvbKm9-OBmVUhJb7hV3hY8fdCkw@mail.gmail.com/
  [2] https://lore.kernel.org/bpf/CACkBjsaCsTovQHFfkqJKto6S4Z8d02ud1D7MPESrHa1cVNNTrw@mail.gmail.com/

Reported-by: Hao Sun <sunhao.th@gmail.com>
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20221215214430.1336195-4-jolsa@kernel.org
2022-12-19 22:08:37 +01:00
Jiri Olsa
f19a405045 bpf: Do cleanup in bpf_bprintf_cleanup only when needed
Currently we always cleanup/decrement bpf_bprintf_nest_level variable
in bpf_bprintf_cleanup if it's > 0.

There's possible scenario where this could cause a problem, when
bpf_bprintf_prepare does not get bin_args buffer (because num_args is 0)
and following bpf_bprintf_cleanup call decrements bpf_bprintf_nest_level
variable, like:

  in task context:
    bpf_bprintf_prepare(num_args != 0) increments 'bpf_bprintf_nest_level = 1'
    -> first irq :
       bpf_bprintf_prepare(num_args == 0)
       bpf_bprintf_cleanup decrements 'bpf_bprintf_nest_level = 0'
    -> second irq:
       bpf_bprintf_prepare(num_args != 0) bpf_bprintf_nest_level = 1
       gets same buffer as task context above

Adding check to bpf_bprintf_cleanup and doing the real cleanup only if we
got bin_args data in the first place.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20221215214430.1336195-3-jolsa@kernel.org
2022-12-19 22:08:06 +01:00
Jiri Olsa
78aa1cc940 bpf: Add struct for bin_args arg in bpf_bprintf_prepare
Adding struct bpf_bprintf_data to hold bin_args argument for
bpf_bprintf_prepare function.

We will add another return argument to bpf_bprintf_prepare and
pass the struct to bpf_bprintf_cleanup for proper cleanup in
following changes.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20221215214430.1336195-2-jolsa@kernel.org
2022-12-19 22:07:52 +01:00
Hou Tao
cc07482246 bpf: Define sock security related BTF IDs under CONFIG_SECURITY_NETWORK
There are warnings reported from resolve_btfids when building vmlinux
with CONFIG_SECURITY_NETWORK disabled:

  WARN: resolve_btfids: unresolved symbol bpf_lsm_sk_free_security
  WARN: resolve_btfids: unresolved symbol bpf_lsm_sk_alloc_security

So only define BTF IDs for these LSM hooks when CONFIG_SECURITY_NETWORK
is enabled.

Fixes: c0c852dd18 ("bpf: Do not mark certain LSM hook arguments as trusted")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20221217062144.2507222-1-houtao@huaweicloud.com
2022-12-19 22:02:17 +01:00
Linus Torvalds
4f292c4de4 New Feature:
* Randomize the per-cpu entry areas
 Cleanups:
 * Have CR3_ADDR_MASK use PHYSICAL_PAGE_MASK instead of open
   coding it
 * Move to "native" set_memory_rox() helper
 * Clean up pmd_get_atomic() and i386-PAE
 * Remove some unused page table size macros
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmOc53UACgkQaDWVMHDJ
 krCUHw//SGZ+La0hLZLAiAiZTXLZZHpYkOmg1Oj1+11qSU11uZzTFqDpauhaKpRS
 cJCSh+D+RXe5e2ipgt0+Zl0hESLt7pJf8258OE4ra0DL/IlyO9uqruAs9Kn3eRS/
 Fk76nG8gdEU+JKJqpG02GqOLslYQuIy96n9hpuj1x25b614+uezPfC7S4XEat0NT
 MbJQ+jnVDf16aJIJkzT+iSwhubDVeh+bSHeO0SSCzX23WLUqDeg5NvlyxoCHGbBh
 UpUTWggV/0pYAkBKRHToeJs8qTWREwuuH/8JGewpe9A0tjdB5wyZfNL2PuracweN
 9MauXC3T5f0+Ca4yIIaPq1fF7Ny/PR2dBFihk27rOD0N7tjaZxNwal2pB1sZcmvZ
 +PAokjyTPVH5ZXjkMYGGAUe1jyjwr2+TgFSZxhTnDuGtyVQiY4pihGKOifLCX6tv
 x6khvYeTBw7wfaDRtKEAf+2kLHYn+71HszHP/8bNKX9T03h+Zf0i1wdZu5xbM5Gc
 VK2wR7bCC+UftJJYG0pldcHg2qaF19RBHK2tLwp7zngUv7lTbkKfkgKjre73KV2a
 D4b76lrqdUMo6UYwYdw7WtDyarZS4OVLq2DcNhwwMddBCaX8kyN5a4AqwQlZYJ0u
 dM+kuMofE8U3yMxmMhJimkZUsj09yLHIqfynY0jbAcU3nhKZZNY=
 =wwVF
 -----END PGP SIGNATURE-----

Merge tag 'x86_mm_for_6.2_v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 mm updates from Dave Hansen:
 "New Feature:

   - Randomize the per-cpu entry areas

  Cleanups:

   - Have CR3_ADDR_MASK use PHYSICAL_PAGE_MASK instead of open coding it

   - Move to "native" set_memory_rox() helper

   - Clean up pmd_get_atomic() and i386-PAE

   - Remove some unused page table size macros"

* tag 'x86_mm_for_6.2_v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits)
  x86/mm: Ensure forced page table splitting
  x86/kasan: Populate shadow for shared chunk of the CPU entry area
  x86/kasan: Add helpers to align shadow addresses up and down
  x86/kasan: Rename local CPU_ENTRY_AREA variables to shorten names
  x86/mm: Populate KASAN shadow for entire per-CPU range of CPU entry area
  x86/mm: Recompute physical address for every page of per-CPU CEA mapping
  x86/mm: Rename __change_page_attr_set_clr(.checkalias)
  x86/mm: Inhibit _PAGE_NX changes from cpa_process_alias()
  x86/mm: Untangle __change_page_attr_set_clr(.checkalias)
  x86/mm: Add a few comments
  x86/mm: Fix CR3_ADDR_MASK
  x86/mm: Remove P*D_PAGE_MASK and P*D_PAGE_SIZE macros
  mm: Convert __HAVE_ARCH_P..P_GET to the new style
  mm: Remove pointless barrier() after pmdp_get_lockless()
  x86/mm/pae: Get rid of set_64bit()
  x86_64: Remove pointless set_64bit() usage
  x86/mm/pae: Be consistent with pXXp_get_and_clear()
  x86/mm/pae: Use WRITE_ONCE()
  x86/mm/pae: Don't (ab)use atomic64
  mm/gup: Fix the lockless PMD access
  ...
2022-12-17 14:06:53 -06:00
Peter Zijlstra
d48567c9a0 mm: Introduce set_memory_rox()
Because endlessly repeating:

	set_memory_ro()
	set_memory_x()

is getting tedious.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/Y1jek64pXOsougmz@hirez.programming.kicks-ass.net
2022-12-15 10:37:26 -08:00
Toke Høiland-Jørgensen
1c123c567f bpf: Resolve fext program type when checking map compatibility
The bpf_prog_map_compatible() check makes sure that BPF program types are
not mixed inside BPF map types that can contain programs (tail call maps,
cpumaps and devmaps). It does this by setting the fields of the map->owner
struct to the values of the first program being checked against, and
rejecting any subsequent programs if the values don't match.

One of the values being set in the map owner struct is the program type,
and since the code did not resolve the prog type for fext programs, the map
owner type would be set to PROG_TYPE_EXT and subsequent loading of programs
of the target type into the map would fail.

This bug is seen in particular for XDP programs that are loaded as
PROG_TYPE_EXT using libxdp; these cannot insert programs into devmaps and
cpumaps because the check fails as described above.

Fix the bug by resolving the fext program type to its target program type
as elsewhere in the verifier.

v3:
- Add Yonghong's ACK

Fixes: f45d5b6ce2 ("bpf: generalise tail call map compatibility check")
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20221214230254.790066-1-toke@redhat.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-12-14 21:30:27 -08:00
Jiri Olsa
4121d4481b bpf: Synchronize dispatcher update with bpf_dispatcher_xdp_func
Hao Sun reported crash in dispatcher image [1].

Currently we don't have any sync between bpf_dispatcher_update and
bpf_dispatcher_xdp_func, so following race is possible:

 cpu 0:                               cpu 1:

 bpf_prog_run_xdp
   ...
   bpf_dispatcher_xdp_func
     in image at offset 0x0

                                      bpf_dispatcher_update
                                        update image at offset 0x800
                                      bpf_dispatcher_update
                                        update image at offset 0x0

     in image at offset 0x0 -> crash

Fixing this by synchronizing dispatcher image update (which is done
in bpf_dispatcher_update function) with bpf_dispatcher_xdp_func that
reads and execute the dispatcher image.

Calling synchronize_rcu after updating and installing new image ensures
that readers leave old image before it's changed in the next dispatcher
update. The update itself is locked with dispatcher's mutex.

The bpf_prog_run_xdp is called under local_bh_disable and synchronize_rcu
will wait for it to leave [2].

[1] https://lore.kernel.org/bpf/Y5SFho7ZYXr9ifRn@krava/T/#m00c29ece654bc9f332a17df493bbca33e702896c
[2] https://lore.kernel.org/bpf/0B62D35A-E695-4B7A-A0D4-774767544C1A@gmail.com/T/#mff43e2c003ae99f4a38f353c7969be4c7162e877

Reported-by: Hao Sun <sunhao.th@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20221214123542.1389719-1-jolsa@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-12-14 12:02:14 -08:00
Milan Landaverde
e89f3edffb bpf: prevent leak of lsm program after failed attach
In [0], we added the ability to bpf_prog_attach LSM programs to cgroups,
but in our validation to make sure the prog is meant to be attached to
BPF_LSM_CGROUP, we return too early if the check fails. This results in
lack of decrementing prog's refcnt (through bpf_prog_put)
leaving the LSM program alive past the point of the expected lifecycle.
This fix allows for the decrement to take place.

[0] https://lore.kernel.org/all/20220628174314.1216643-4-sdf@google.com/

Fixes: 69fd337a97 ("bpf: per-cgroup lsm flavor")
Signed-off-by: Milan Landaverde <milan@mdaverde.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20221213175714.31963-1-milan@mdaverde.com
2022-12-14 18:35:41 +01:00
Linus Torvalds
7e68dd7d07 Networking changes for 6.2.
Core
 ----
  - Allow live renaming when an interface is up
 
  - Add retpoline wrappers for tc, improving considerably the
    performances of complex queue discipline configurations.
 
  - Add inet drop monitor support.
 
  - A few GRO performance improvements.
 
  - Add infrastructure for atomic dev stats, addressing long standing
    data races.
 
  - De-duplicate common code between OVS and conntrack offloading
    infrastructure.
 
  - A bunch of UBSAN_BOUNDS/FORTIFY_SOURCE improvements.
 
  - Netfilter: introduce packet parser for tunneled packets
 
  - Replace IPVS timer-based estimators with kthreads to scale up
    the workload with the number of available CPUs.
 
  - Add the helper support for connection-tracking OVS offload.
 
 BPF
 ---
  - Support for user defined BPF objects: the use case is to allocate
    own objects, build own object hierarchies and use the building
    blocks to build own data structures flexibly, for example, linked
    lists in BPF.
 
  - Make cgroup local storage available to non-cgroup attached BPF
    programs.
 
  - Avoid unnecessary deadlock detection and failures wrt BPF task
    storage helpers.
 
  - A relevant bunch of BPF verifier fixes and improvements.
 
  - Veristat tool improvements to support custom filtering, sorting,
    and replay of results.
 
  - Add LLVM disassembler as default library for dumping JITed code.
 
  - Lots of new BPF documentation for various BPF maps.
 
  - Add bpf_rcu_read_{,un}lock() support for sleepable programs.
 
  - Add RCU grace period chaining to BPF to wait for the completion
    of access from both sleepable and non-sleepable BPF programs.
 
  - Add support storing struct task_struct objects as kptrs in maps.
 
  - Improve helper UAPI by explicitly defining BPF_FUNC_xxx integer
    values.
 
  - Add libbpf *_opts API-variants for bpf_*_get_fd_by_id() functions.
 
 Protocols
 ---------
  - TCP: implement Protective Load Balancing across switch links.
 
  - TCP: allow dynamically disabling TCP-MD5 static key, reverting
    back to fast[er]-path.
 
  - UDP: Introduce optional per-netns hash lookup table.
 
  - IPv6: simplify and cleanup sockets disposal.
 
  - Netlink: support different type policies for each generic
    netlink operation.
 
  - MPTCP: add MSG_FASTOPEN and FastOpen listener side support.
 
  - MPTCP: add netlink notification support for listener sockets
    events.
 
  - SCTP: add VRF support, allowing sctp sockets binding to VRF
    devices.
 
  - Add bridging MAC Authentication Bypass (MAB) support.
 
  - Extensions for Ethernet VPN bridging implementation to better
    support multicast scenarios.
 
  - More work for Wi-Fi 7 support, comprising conversion of all
    the existing drivers to internal TX queue usage.
 
  - IPSec: introduce a new offload type (packet offload) allowing
    complete header processing and crypto offloading.
 
  - IPSec: extended ack support for more descriptive XFRM error
    reporting.
 
  - RXRPC: increase SACK table size and move processing into a
    per-local endpoint kernel thread, reducing considerably the
    required locking.
 
  - IEEE 802154: synchronous send frame and extended filtering
    support, initial support for scanning available 15.4 networks.
 
  - Tun: bump the link speed from 10Mbps to 10Gbps.
 
  - Tun/VirtioNet: implement UDP segmentation offload support.
 
 Driver API
 ----------
 
  - PHY/SFP: improve power level switching between standard
    level 1 and the higher power levels.
 
  - New API for netdev <-> devlink_port linkage.
 
  - PTP: convert existing drivers to new frequency adjustment
    implementation.
 
  - DSA: add support for rx offloading.
 
  - Autoload DSA tagging driver when dynamically changing protocol.
 
  - Add new PCP and APPTRUST attributes to Data Center Bridging.
 
  - Add configuration support for 800Gbps link speed.
 
  - Add devlink port function attribute to enable/disable RoCE and
    migratable.
 
  - Extend devlink-rate to support strict prioriry and weighted fair
    queuing.
 
  - Add devlink support to directly reading from region memory.
 
  - New device tree helper to fetch MAC address from nvmem.
 
  - New big TCP helper to simplify temporary header stripping.
 
 New hardware / drivers
 ----------------------
 
  - Ethernet:
    - Marvel Octeon CNF95N and CN10KB Ethernet Switches.
    - Marvel Prestera AC5X Ethernet Switch.
    - WangXun 10 Gigabit NIC.
    - Motorcomm yt8521 Gigabit Ethernet.
    - Microchip ksz9563 Gigabit Ethernet Switch.
    - Microsoft Azure Network Adapter.
    - Linux Automation 10Base-T1L adapter.
 
  - PHY:
    - Aquantia AQR112 and AQR412.
    - Motorcomm YT8531S.
 
  - PTP:
    - Orolia ART-CARD.
 
  - WiFi:
    - MediaTek Wi-Fi 7 (802.11be) devices.
    - RealTek rtw8821cu, rtw8822bu, rtw8822cu and rtw8723du USB
      devices.
 
  - Bluetooth:
    - Broadcom BCM4377/4378/4387 Bluetooth chipsets.
    - Realtek RTL8852BE and RTL8723DS.
    - Cypress.CYW4373A0 WiFi + Bluetooth combo device.
 
 Drivers
 -------
  - CAN:
    - gs_usb: bus error reporting support.
    - kvaser_usb: listen only and bus error reporting support.
 
  - Ethernet NICs:
    - Intel (100G):
      - extend action skbedit to RX queue mapping.
      - implement devlink-rate support.
      - support direct read from memory.
    - nVidia/Mellanox (mlx5):
      - SW steering improvements, increasing rules update rate.
      - Support for enhanced events compression.
      - extend H/W offload packet manipulation capabilities.
      - implement IPSec packet offload mode.
    - nVidia/Mellanox (mlx4):
      - better big TCP support.
    - Netronome Ethernet NICs (nfp):
      - IPsec offload support.
      - add support for multicast filter.
    - Broadcom:
      - RSS and PTP support improvements.
    - AMD/SolarFlare:
      - netlink extened ack improvements.
      - add basic flower matches to offload, and related stats.
    - Virtual NICs:
      - ibmvnic: introduce affinity hint support.
    - small / embedded:
      - FreeScale fec: add initial XDP support.
      - Marvel mv643xx_eth: support MII/GMII/RGMII modes for Kirkwood.
      - TI am65-cpsw: add suspend/resume support.
      - Mediatek MT7986: add RX wireless wthernet dispatch support.
      - Realtek 8169: enable GRO software interrupt coalescing per
        default.
 
  - Ethernet high-speed switches:
    - Microchip (sparx5):
      - add support for Sparx5 TC/flower H/W offload via VCAP.
    - Mellanox mlxsw:
      - add 802.1X and MAC Authentication Bypass offload support.
      - add ip6gre support.
 
  - Embedded Ethernet switches:
    - Mediatek (mtk_eth_soc):
      - improve PCS implementation, add DSA untag support.
      - enable flow offload support.
    - Renesas:
      - add rswitch R-Car Gen4 gPTP support.
    - Microchip (lan966x):
      - add full XDP support.
      - add TC H/W offload via VCAP.
      - enable PTP on bridge interfaces.
    - Microchip (ksz8):
      - add MTU support for KSZ8 series.
 
  - Qualcomm 802.11ax WiFi (ath11k):
    - support configuring channel dwell time during scan.
 
  - MediaTek WiFi (mt76):
    - enable Wireless Ethernet Dispatch (WED) offload support.
    - add ack signal support.
    - enable coredump support.
    - remain_on_channel support.
 
  - Intel WiFi (iwlwifi):
    - enable Wi-Fi 7 Extremely High Throughput (EHT) PHY capabilities.
    - 320 MHz channels support.
 
  - RealTek WiFi (rtw89):
    - new dynamic header firmware format support.
    - wake-over-WLAN support.
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmOYXUcSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOk8zQP/R7BZtbJMTPiWkRnSoKHnAyupDVwrz5U
 ktukLkwPsCyJuEbAjgxrxf4EEEQ9uq2FFlxNSYuKiiQMqIpFxV6KED7LCUygn4Tc
 kxtkp0Q+5XiqisWlQmtfExf2OjuuPqcjV9tWCDBI6GebKUbfNwY/eI44RcMu4BSv
 DzIlW5GkX/kZAPqnnuqaLsN3FudDTJHGEAD7NbA++7wJ076RWYSLXlFv0Z+SCSPS
 H8/PEG0/ZK/65rIWMAFRClJ9BNIDwGVgp0GrsIvs1gqbRUOlA1hl1rDM21TqtNFf
 5QPQT7sIfTcCE/nerxKJD5JE3JyP+XRlRn96PaRw3rt4MgI6I/EOj/HOKQ5tMCNc
 oPiqb7N70+hkLZyr42qX+vN9eDPjp2koEQm7EO2Zs+/534/zWDs24Zfk/Aa1ps0I
 Fa82oGjAgkBhGe/FZ6i5cYoLcyxqRqZV1Ws9XQMl72qRC7/BwvNbIW6beLpCRyeM
 yYIU+0e9dEm+wHQEdh2niJuVtR63hy8tvmPx56lyh+6u0+pondkwbfSiC5aD3kAC
 ikKsN5DyEsdXyiBAlytCEBxnaOjQy4RAz+3YXSiS0eBNacXp03UUrNGx4Pzpu/D0
 QLFJhBnMFFCgy5to8/DvKnrTPgZdSURwqbIUcZdvU21f1HLR8tUTpaQnYffc/Whm
 V8gnt1EL+0cc
 =CbJC
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Paolo Abeni:
 "Core:

   - Allow live renaming when an interface is up

   - Add retpoline wrappers for tc, improving considerably the
     performances of complex queue discipline configurations

   - Add inet drop monitor support

   - A few GRO performance improvements

   - Add infrastructure for atomic dev stats, addressing long standing
     data races

   - De-duplicate common code between OVS and conntrack offloading
     infrastructure

   - A bunch of UBSAN_BOUNDS/FORTIFY_SOURCE improvements

   - Netfilter: introduce packet parser for tunneled packets

   - Replace IPVS timer-based estimators with kthreads to scale up the
     workload with the number of available CPUs

   - Add the helper support for connection-tracking OVS offload

  BPF:

   - Support for user defined BPF objects: the use case is to allocate
     own objects, build own object hierarchies and use the building
     blocks to build own data structures flexibly, for example, linked
     lists in BPF

   - Make cgroup local storage available to non-cgroup attached BPF
     programs

   - Avoid unnecessary deadlock detection and failures wrt BPF task
     storage helpers

   - A relevant bunch of BPF verifier fixes and improvements

   - Veristat tool improvements to support custom filtering, sorting,
     and replay of results

   - Add LLVM disassembler as default library for dumping JITed code

   - Lots of new BPF documentation for various BPF maps

   - Add bpf_rcu_read_{,un}lock() support for sleepable programs

   - Add RCU grace period chaining to BPF to wait for the completion of
     access from both sleepable and non-sleepable BPF programs

   - Add support storing struct task_struct objects as kptrs in maps

   - Improve helper UAPI by explicitly defining BPF_FUNC_xxx integer
     values

   - Add libbpf *_opts API-variants for bpf_*_get_fd_by_id() functions

  Protocols:

   - TCP: implement Protective Load Balancing across switch links

   - TCP: allow dynamically disabling TCP-MD5 static key, reverting back
     to fast[er]-path

   - UDP: Introduce optional per-netns hash lookup table

   - IPv6: simplify and cleanup sockets disposal

   - Netlink: support different type policies for each generic netlink
     operation

   - MPTCP: add MSG_FASTOPEN and FastOpen listener side support

   - MPTCP: add netlink notification support for listener sockets events

   - SCTP: add VRF support, allowing sctp sockets binding to VRF devices

   - Add bridging MAC Authentication Bypass (MAB) support

   - Extensions for Ethernet VPN bridging implementation to better
     support multicast scenarios

   - More work for Wi-Fi 7 support, comprising conversion of all the
     existing drivers to internal TX queue usage

   - IPSec: introduce a new offload type (packet offload) allowing
     complete header processing and crypto offloading

   - IPSec: extended ack support for more descriptive XFRM error
     reporting

   - RXRPC: increase SACK table size and move processing into a
     per-local endpoint kernel thread, reducing considerably the
     required locking

   - IEEE 802154: synchronous send frame and extended filtering support,
     initial support for scanning available 15.4 networks

   - Tun: bump the link speed from 10Mbps to 10Gbps

   - Tun/VirtioNet: implement UDP segmentation offload support

  Driver API:

   - PHY/SFP: improve power level switching between standard level 1 and
     the higher power levels

   - New API for netdev <-> devlink_port linkage

   - PTP: convert existing drivers to new frequency adjustment
     implementation

   - DSA: add support for rx offloading

   - Autoload DSA tagging driver when dynamically changing protocol

   - Add new PCP and APPTRUST attributes to Data Center Bridging

   - Add configuration support for 800Gbps link speed

   - Add devlink port function attribute to enable/disable RoCE and
     migratable

   - Extend devlink-rate to support strict prioriry and weighted fair
     queuing

   - Add devlink support to directly reading from region memory

   - New device tree helper to fetch MAC address from nvmem

   - New big TCP helper to simplify temporary header stripping

  New hardware / drivers:

   - Ethernet:
      - Marvel Octeon CNF95N and CN10KB Ethernet Switches
      - Marvel Prestera AC5X Ethernet Switch
      - WangXun 10 Gigabit NIC
      - Motorcomm yt8521 Gigabit Ethernet
      - Microchip ksz9563 Gigabit Ethernet Switch
      - Microsoft Azure Network Adapter
      - Linux Automation 10Base-T1L adapter

   - PHY:
      - Aquantia AQR112 and AQR412
      - Motorcomm YT8531S

   - PTP:
      - Orolia ART-CARD

   - WiFi:
      - MediaTek Wi-Fi 7 (802.11be) devices
      - RealTek rtw8821cu, rtw8822bu, rtw8822cu and rtw8723du USB
        devices

   - Bluetooth:
      - Broadcom BCM4377/4378/4387 Bluetooth chipsets
      - Realtek RTL8852BE and RTL8723DS
      - Cypress.CYW4373A0 WiFi + Bluetooth combo device

  Drivers:

   - CAN:
      - gs_usb: bus error reporting support
      - kvaser_usb: listen only and bus error reporting support

   - Ethernet NICs:
      - Intel (100G):
         - extend action skbedit to RX queue mapping
         - implement devlink-rate support
         - support direct read from memory
      - nVidia/Mellanox (mlx5):
         - SW steering improvements, increasing rules update rate
         - Support for enhanced events compression
         - extend H/W offload packet manipulation capabilities
         - implement IPSec packet offload mode
      - nVidia/Mellanox (mlx4):
         - better big TCP support
      - Netronome Ethernet NICs (nfp):
         - IPsec offload support
         - add support for multicast filter
      - Broadcom:
         - RSS and PTP support improvements
      - AMD/SolarFlare:
         - netlink extened ack improvements
         - add basic flower matches to offload, and related stats
      - Virtual NICs:
         - ibmvnic: introduce affinity hint support
      - small / embedded:
         - FreeScale fec: add initial XDP support
         - Marvel mv643xx_eth: support MII/GMII/RGMII modes for Kirkwood
         - TI am65-cpsw: add suspend/resume support
         - Mediatek MT7986: add RX wireless wthernet dispatch support
         - Realtek 8169: enable GRO software interrupt coalescing per
           default

   - Ethernet high-speed switches:
      - Microchip (sparx5):
         - add support for Sparx5 TC/flower H/W offload via VCAP
      - Mellanox mlxsw:
         - add 802.1X and MAC Authentication Bypass offload support
         - add ip6gre support

   - Embedded Ethernet switches:
      - Mediatek (mtk_eth_soc):
         - improve PCS implementation, add DSA untag support
         - enable flow offload support
      - Renesas:
         - add rswitch R-Car Gen4 gPTP support
      - Microchip (lan966x):
         - add full XDP support
         - add TC H/W offload via VCAP
         - enable PTP on bridge interfaces
      - Microchip (ksz8):
         - add MTU support for KSZ8 series

   - Qualcomm 802.11ax WiFi (ath11k):
      - support configuring channel dwell time during scan

   - MediaTek WiFi (mt76):
      - enable Wireless Ethernet Dispatch (WED) offload support
      - add ack signal support
      - enable coredump support
      - remain_on_channel support

   - Intel WiFi (iwlwifi):
      - enable Wi-Fi 7 Extremely High Throughput (EHT) PHY capabilities
      - 320 MHz channels support

   - RealTek WiFi (rtw89):
      - new dynamic header firmware format support
      - wake-over-WLAN support"

* tag 'net-next-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2002 commits)
  ipvs: fix type warning in do_div() on 32 bit
  net: lan966x: Remove a useless test in lan966x_ptp_add_trap()
  net: ipa: add IPA v4.7 support
  dt-bindings: net: qcom,ipa: Add SM6350 compatible
  bnxt: Use generic HBH removal helper in tx path
  IPv6/GRO: generic helper to remove temporary HBH/jumbo header in driver
  selftests: forwarding: Add bridge MDB test
  selftests: forwarding: Rename bridge_mdb test
  bridge: mcast: Support replacement of MDB port group entries
  bridge: mcast: Allow user space to specify MDB entry routing protocol
  bridge: mcast: Allow user space to add (*, G) with a source list and filter mode
  bridge: mcast: Add support for (*, G) with a source list and filter mode
  bridge: mcast: Avoid arming group timer when (S, G) corresponds to a source
  bridge: mcast: Add a flag for user installed source entries
  bridge: mcast: Expose __br_multicast_del_group_src()
  bridge: mcast: Expose br_multicast_new_group_src()
  bridge: mcast: Add a centralized error path
  bridge: mcast: Place netlink policy before validation functions
  bridge: mcast: Split (*, G) and (S, G) addition into different functions
  bridge: mcast: Do not derive entry type from its filter mode
  ...
2022-12-13 15:47:48 -08:00
Linus Torvalds
268325bda5 Random number generator updates for Linux 6.2-rc1.
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEq5lC5tSkz8NBJiCnSfxwEqXeA64FAmOU+U8ACgkQSfxwEqXe
 A67NnQ//Y5DltmvibyPd7r1TFT2gUYv+Rx3sUV9ZE1NYptd/SWhhcL8c5FZ70Fuw
 bSKCa1uiWjOxosjXT1kGrWq3de7q7oUpAPSOGxgxzoaNURIt58N/ajItCX/4Au8I
 RlGAScHy5e5t41/26a498kB6qJ441fBEqCYKQpPLINMBAhe8TQ+NVp0rlpUwNHFX
 WrUGg4oKWxdBIW3HkDirQjJWDkkAiklRTifQh/Al4b6QDbOnRUGGCeckNOhixsvS
 waHWTld+Td8jRrA4b82tUb2uVZ2/b8dEvj/A8CuTv4yC0lywoyMgBWmJAGOC+UmT
 ZVNdGW02Jc2T+Iap8ZdsEmeLHNqbli4+IcbY5xNlov+tHJ2oz41H9TZoYKbudlr6
 /ReAUPSn7i50PhbQlEruj3eg+M2gjOeh8OF8UKwwRK8PghvyWQ1ScW0l3kUhPIhI
 PdIG6j4+D2mJc1FIj2rTVB+Bg933x6S+qx4zDxGlNp62AARUFYf6EgyD6aXFQVuX
 RxcKb6cjRuFkzFiKc8zkqg5edZH+IJcPNuIBmABqTGBOxbZWURXzIQvK/iULqZa4
 CdGAFIs6FuOh8pFHLI3R4YoHBopbHup/xKDEeAO9KZGyeVIuOSERDxxo5f/ITzcq
 APvT77DFOEuyvanr8RMqqh0yUjzcddXqw9+ieufsAyDwjD9DTuE=
 =QRhK
 -----END PGP SIGNATURE-----

Merge tag 'random-6.2-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random

Pull random number generator updates from Jason Donenfeld:

 - Replace prandom_u32_max() and various open-coded variants of it,
   there is now a new family of functions that uses fast rejection
   sampling to choose properly uniformly random numbers within an
   interval:

       get_random_u32_below(ceil) - [0, ceil)
       get_random_u32_above(floor) - (floor, U32_MAX]
       get_random_u32_inclusive(floor, ceil) - [floor, ceil]

   Coccinelle was used to convert all current users of
   prandom_u32_max(), as well as many open-coded patterns, resulting in
   improvements throughout the tree.

   I'll have a "late" 6.1-rc1 pull for you that removes the now unused
   prandom_u32_max() function, just in case any other trees add a new
   use case of it that needs to converted. According to linux-next,
   there may be two trivial cases of prandom_u32_max() reintroductions
   that are fixable with a 's/.../.../'. So I'll have for you a final
   conversion patch doing that alongside the removal patch during the
   second week.

   This is a treewide change that touches many files throughout.

 - More consistent use of get_random_canary().

 - Updates to comments, documentation, tests, headers, and
   simplification in configuration.

 - The arch_get_random*_early() abstraction was only used by arm64 and
   wasn't entirely useful, so this has been replaced by code that works
   in all relevant contexts.

 - The kernel will use and manage random seeds in non-volatile EFI
   variables, refreshing a variable with a fresh seed when the RNG is
   initialized. The RNG GUID namespace is then hidden from efivarfs to
   prevent accidental leakage.

   These changes are split into random.c infrastructure code used in the
   EFI subsystem, in this pull request, and related support inside of
   EFISTUB, in Ard's EFI tree. These are co-dependent for full
   functionality, but the order of merging doesn't matter.

 - Part of the infrastructure added for the EFI support is also used for
   an improvement to the way vsprintf initializes its siphash key,
   replacing an sleep loop wart.

 - The hardware RNG framework now always calls its correct random.c
   input function, add_hwgenerator_randomness(), rather than sometimes
   going through helpers better suited for other cases.

 - The add_latent_entropy() function has long been called from the fork
   handler, but is a no-op when the latent entropy gcc plugin isn't
   used, which is fine for the purposes of latent entropy.

   But it was missing out on the cycle counter that was also being mixed
   in beside the latent entropy variable. So now, if the latent entropy
   gcc plugin isn't enabled, add_latent_entropy() will expand to a call
   to add_device_randomness(NULL, 0), which adds a cycle counter,
   without the absent latent entropy variable.

 - The RNG is now reseeded from a delayed worker, rather than on demand
   when used. Always running from a worker allows it to make use of the
   CPU RNG on platforms like S390x, whose instructions are too slow to
   do so from interrupts. It also has the effect of adding in new inputs
   more frequently with more regularity, amounting to a long term
   transcript of random values. Plus, it helps a bit with the upcoming
   vDSO implementation (which isn't yet ready for 6.2).

 - The jitter entropy algorithm now tries to execute on many different
   CPUs, round-robining, in hopes of hitting even more memory latencies
   and other unpredictable effects. It also will mix in a cycle counter
   when the entropy timer fires, in addition to being mixed in from the
   main loop, to account more explicitly for fluctuations in that timer
   firing. And the state it touches is now kept within the same cache
   line, so that it's assured that the different execution contexts will
   cause latencies.

* tag 'random-6.2-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: (23 commits)
  random: include <linux/once.h> in the right header
  random: align entropy_timer_state to cache line
  random: mix in cycle counter when jitter timer fires
  random: spread out jitter callback to different CPUs
  random: remove extraneous period and add a missing one in comments
  efi: random: refresh non-volatile random seed when RNG is initialized
  vsprintf: initialize siphash key using notifier
  random: add back async readiness notifier
  random: reseed in delayed work rather than on-demand
  random: always mix cycle counter in add_latent_entropy()
  hw_random: use add_hwgenerator_randomness() for early entropy
  random: modernize documentation comment on get_random_bytes()
  random: adjust comment to account for removed function
  random: remove early archrandom abstraction
  random: use random.trust_{bootloader,cpu} command line option only
  stackprotector: actually use get_random_canary()
  stackprotector: move get_random_canary() into stackprotector.h
  treewide: use get_random_u32_inclusive() when possible
  treewide: use get_random_u32_{above,below}() instead of manual loop
  treewide: use get_random_u32_below() instead of deprecated function
  ...
2022-12-12 16:22:22 -08:00
Eduard Zingerman
4ea2bb158b bpf: use check_ids() for active_lock comparison
An update for verifier.c:states_equal()/regsafe() to use check_ids()
for active spin lock comparisons. This fixes the issue reported by
Kumar Kartikeya Dwivedi in [1] using technique suggested by Edward Cree.

W/o this commit the verifier might be tricked to accept the following
program working with a map containing spin locks:

  0: r9 = map_lookup_elem(...)  ; Returns PTR_TO_MAP_VALUE_OR_NULL id=1.
  1: r8 = map_lookup_elem(...)  ; Returns PTR_TO_MAP_VALUE_OR_NULL id=2.
  2: if r9 == 0 goto exit       ; r9 -> PTR_TO_MAP_VALUE.
  3: if r8 == 0 goto exit       ; r8 -> PTR_TO_MAP_VALUE.
  4: r7 = ktime_get_ns()        ; Unbound SCALAR_VALUE.
  5: r6 = ktime_get_ns()        ; Unbound SCALAR_VALUE.
  6: bpf_spin_lock(r8)          ; active_lock.id == 2.
  7: if r6 > r7 goto +1         ; No new information about the state
                                ; is derived from this check, thus
                                ; produced verifier states differ only
                                ; in 'insn_idx'.
  8: r9 = r8                    ; Optionally make r9.id == r8.id.
  --- checkpoint ---            ; Assume is_state_visisted() creates a
                                ; checkpoint here.
  9: bpf_spin_unlock(r9)        ; (a,b) active_lock.id == 2.
                                ; (a) r9.id == 2, (b) r9.id == 1.
 10: exit(0)

Consider two verification paths:
(a) 0-10
(b) 0-7,9-10

The path (a) is verified first. If checkpoint is created at (8)
the (b) would assume that (8) is safe because regsafe() does not
compare register ids for registers of type PTR_TO_MAP_VALUE.

[1] https://lore.kernel.org/bpf/20221111202719.982118-1-memxor@gmail.com/

Reported-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Suggested-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20221209135733.28851-6-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-10 13:20:53 -08:00
Eduard Zingerman
5dd9cdbc9d bpf: states_equal() must build idmap for all function frames
verifier.c:states_equal() must maintain register ID mapping across all
function frames. Otherwise the following example might be erroneously
marked as safe:

main:
    fp[-24] = map_lookup_elem(...)  ; frame[0].fp[-24].id == 1
    fp[-32] = map_lookup_elem(...)  ; frame[0].fp[-32].id == 2
    r1 = &fp[-24]
    r2 = &fp[-32]
    call foo()
    r0 = 0
    exit

foo:
  0: r9 = r1
  1: r8 = r2
  2: r7 = ktime_get_ns()
  3: r6 = ktime_get_ns()
  4: if (r6 > r7) goto skip_assign
  5: r9 = r8

skip_assign:                ; <--- checkpoint
  6: r9 = *r9               ; (a) frame[1].r9.id == 2
                            ; (b) frame[1].r9.id == 1

  7: if r9 == 0 goto exit:  ; mark_ptr_or_null_regs() transfers != 0 info
                            ; for all regs sharing ID:
                            ;   (a) r9 != 0 => &frame[0].fp[-32] != 0
                            ;   (b) r9 != 0 => &frame[0].fp[-24] != 0

  8: r8 = *r8               ; (a) r8 == &frame[0].fp[-32]
                            ; (b) r8 == &frame[0].fp[-32]
  9: r0 = *r8               ; (a) safe
                            ; (b) unsafe

exit:
 10: exit

While processing call to foo() verifier considers the following
execution paths:

(a) 0-10
(b) 0-4,6-10
(There is also path 0-7,10 but it is not interesting for the issue at
 hand. (a) is verified first.)

Suppose that checkpoint is created at (6) when path (a) is verified,
next path (b) is verified and (6) is reached.

If states_equal() maintains separate 'idmap' for each frame the
mapping at (6) for frame[1] would be empty and
regsafe(r9)::check_ids() would add a pair 2->1 and return true,
which is an error.

If states_equal() maintains single 'idmap' for all frames the mapping
at (6) would be { 1->1, 2->2 } and regsafe(r9)::check_ids() would
return false when trying to add a pair 2->1.

This issue was suggested in the following discussion:
https://lore.kernel.org/bpf/CAEf4BzbFB5g4oUfyxk9rHy-PJSLQ3h8q9mV=rVoXfr_JVm8+1Q@mail.gmail.com/

Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20221209135733.28851-4-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-10 13:20:53 -08:00
Eduard Zingerman
7c884339bb bpf: regsafe() must not skip check_ids()
The verifier.c:regsafe() has the following shortcut:

	equal = memcmp(rold, rcur, offsetof(struct bpf_reg_state, parent)) == 0;
	...
	if (equal)
		return true;

Which is executed regardless old register type. This is incorrect for
register types that might have an ID checked by check_ids(), namely:
 - PTR_TO_MAP_KEY
 - PTR_TO_MAP_VALUE
 - PTR_TO_PACKET_META
 - PTR_TO_PACKET

The following pattern could be used to exploit this:

  0: r9 = map_lookup_elem(...)  ; Returns PTR_TO_MAP_VALUE_OR_NULL id=1.
  1: r8 = map_lookup_elem(...)  ; Returns PTR_TO_MAP_VALUE_OR_NULL id=2.
  2: r7 = ktime_get_ns()        ; Unbound SCALAR_VALUE.
  3: r6 = ktime_get_ns()        ; Unbound SCALAR_VALUE.
  4: if r6 > r7 goto +1         ; No new information about the state
                                ; is derived from this check, thus
                                ; produced verifier states differ only
                                ; in 'insn_idx'.
  5: r9 = r8                    ; Optionally make r9.id == r8.id.
  --- checkpoint ---            ; Assume is_state_visisted() creates a
                                ; checkpoint here.
  6: if r9 == 0 goto <exit>     ; Nullness info is propagated to all
                                ; registers with matching ID.
  7: r1 = *(u64 *) r8           ; Not always safe.

Verifier first visits path 1-7 where r8 is verified to be not null
at (6). Later the jump from 4 to 6 is examined. The checkpoint for (6)
looks as follows:
  R8_rD=map_value_or_null(id=2,off=0,ks=4,vs=8,imm=0)
  R9_rwD=map_value_or_null(id=2,off=0,ks=4,vs=8,imm=0)
  R10=fp0

The current state is:
  R0=... R6=... R7=... fp-8=...
  R8=map_value_or_null(id=2,off=0,ks=4,vs=8,imm=0)
  R9=map_value_or_null(id=1,off=0,ks=4,vs=8,imm=0)
  R10=fp0

Note that R8 states are byte-to-byte identical, so regsafe() would
exit early and skip call to check_ids(), thus ID mapping 2->2 will not
be added to 'idmap'. Next, states for R9 are compared: these are not
identical and check_ids() is executed, but 'idmap' is empty, so
check_ids() adds mapping 2->1 to 'idmap' and returns success.

This commit pushes the 'equal' down to register types that don't need
check_ids().

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20221209135733.28851-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-10 13:20:52 -08:00
Kumar Kartikeya Dwivedi
76d16077be bpf: Use memmove for bpf_dynptr_{read,write}
It may happen that destination buffer memory overlaps with memory dynptr
points to. Hence, we must use memmove to correctly copy from dynptr to
destination buffer, or source buffer to dynptr.

This actually isn't a problem right now, as memcpy implementation falls
back to memmove on detecting overlap and warns about it, but we
shouldn't be relying on that.

Acked-by: Joanne Koong <joannelkoong@gmail.com>
Acked-by: David Vernet <void@manifault.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221207204141.308952-7-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-08 18:39:28 -08:00
Kumar Kartikeya Dwivedi
f6ee298fa1 bpf: Move PTR_TO_STACK alignment check to process_dynptr_func
After previous commit, we are minimizing helper specific assumptions
from check_func_arg_reg_off, making it generic, and offloading checks
for a specific argument type to their respective functions called after
check_func_arg_reg_off has been called.

This allows relying on a consistent set of guarantees after that call
and then relying on them in code that deals with registers for each
argument type later. This is in line with how process_spin_lock,
process_timer_func, process_kptr_func check reg->var_off to be constant.
The same reasoning is used here to move the alignment check into
process_dynptr_func. Note that it also needs to check for constant
var_off, and accumulate the constant var_off when computing the spi in
get_spi, but that fix will come in later changes.

Acked-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221207204141.308952-6-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-08 18:39:28 -08:00
Kumar Kartikeya Dwivedi
184c9bdb8f bpf: Rework check_func_arg_reg_off
While check_func_arg_reg_off is the place which performs generic checks
needed by various candidates of reg->type, there is some handling for
special cases, like ARG_PTR_TO_DYNPTR, OBJ_RELEASE, and
ARG_PTR_TO_RINGBUF_MEM.

This commit aims to streamline these special cases and instead leave
other things up to argument type specific code to handle. The function
will be restrictive by default, and cover all possible cases when
OBJ_RELEASE is set, without having to update the function again (and
missing to do that being a bug).

This is done primarily for two reasons: associating back reg->type to
its argument leaves room for the list getting out of sync when a new
reg->type is supported by an arg_type.

The other case is ARG_PTR_TO_RINGBUF_MEM. The problem there is something
we already handle, whenever a release argument is expected, it should
be passed as the pointer that was received from the acquire function.
Hence zero fixed and variable offset.

There is nothing special about ARG_PTR_TO_RINGBUF_MEM, where technically
its target register type PTR_TO_MEM | MEM_RINGBUF can already be passed
with non-zero offset to other helper functions, which makes sense.

Hence, lift the arg_type_is_release check for reg->off and cover all
possible register types, instead of duplicating the same kind of check
twice for current OBJ_RELEASE arg_types (alloc_mem and ptr_to_btf_id).

For the release argument, arg_type_is_dynptr is the special case, where
we go to actual object being freed through the dynptr, so the offset of
the pointer still needs to allow fixed and variable offset and
process_dynptr_func will verify them later for the release argument case
as well.

This is not specific to ARG_PTR_TO_DYNPTR though, we will need to make
this exception for any future object on the stack that needs to be
released. In this sense, PTR_TO_STACK as a candidate for object on stack
argument is a special case for release offset checks, and they need to
be done by the helper releasing the object on stack.

Since the check has been lifted above all register type checks, remove
the duplicated check that is being done for PTR_TO_BTF_ID.

Acked-by: Joanne Koong <joannelkoong@gmail.com>
Acked-by: David Vernet <void@manifault.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221207204141.308952-5-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-08 18:39:06 -08:00
Kumar Kartikeya Dwivedi
2706053173 bpf: Rework process_dynptr_func
Recently, user ringbuf support introduced a PTR_TO_DYNPTR register type
for use in callback state, because in case of user ringbuf helpers,
there is no dynptr on the stack that is passed into the callback. To
reflect such a state, a special register type was created.

However, some checks have been bypassed incorrectly during the addition
of this feature. First, for arg_type with MEM_UNINIT flag which
initialize a dynptr, they must be rejected for such register type.
Secondly, in the future, there are plans to add dynptr helpers that
operate on the dynptr itself and may change its offset and other
properties.

In all of these cases, PTR_TO_DYNPTR shouldn't be allowed to be passed
to such helpers, however the current code simply returns 0.

The rejection for helpers that release the dynptr is already handled.

For fixing this, we take a step back and rework existing code in a way
that will allow fitting in all classes of helpers and have a coherent
model for dealing with the variety of use cases in which dynptr is used.

First, for ARG_PTR_TO_DYNPTR, it can either be set alone or together
with a DYNPTR_TYPE_* constant that denotes the only type it accepts.

Next, helpers which initialize a dynptr use MEM_UNINIT to indicate this
fact. To make the distinction clear, use MEM_RDONLY flag to indicate
that the helper only operates on the memory pointed to by the dynptr,
not the dynptr itself. In C parlance, it would be equivalent to taking
the dynptr as a point to const argument.

When either of these flags are not present, the helper is allowed to
mutate both the dynptr itself and also the memory it points to.
Currently, the read only status of the memory is not tracked in the
dynptr, but it would be trivial to add this support inside dynptr state
of the register.

With these changes and renaming PTR_TO_DYNPTR to CONST_PTR_TO_DYNPTR to
better reflect its usage, it can no longer be passed to helpers that
initialize a dynptr, i.e. bpf_dynptr_from_mem, bpf_ringbuf_reserve_dynptr.

A note to reviewers is that in code that does mark_stack_slots_dynptr,
and unmark_stack_slots_dynptr, we implicitly rely on the fact that
PTR_TO_STACK reg is the only case that can reach that code path, as one
cannot pass CONST_PTR_TO_DYNPTR to helpers that don't set MEM_RDONLY. In
both cases such helpers won't be setting that flag.

The next patch will add a couple of selftest cases to make sure this
doesn't break.

Fixes: 2057156738 ("bpf: Add bpf_user_ringbuf_drain() helper")
Acked-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221207204141.308952-4-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-08 18:25:31 -08:00
Kumar Kartikeya Dwivedi
ac50fe51ce bpf: Propagate errors from process_* checks in check_func_arg
Currently, we simply ignore the errors in process_spin_lock,
process_timer_func, process_kptr_func, process_dynptr_func. Instead,
bubble up the error by storing and checking err variable.

Acked-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221207204141.308952-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-08 18:25:31 -08:00
Kumar Kartikeya Dwivedi
6b75bd3d03 bpf: Refactor ARG_PTR_TO_DYNPTR checks into process_dynptr_func
ARG_PTR_TO_DYNPTR is akin to ARG_PTR_TO_TIMER, ARG_PTR_TO_KPTR, where
the underlying register type is subjected to more special checks to
determine the type of object represented by the pointer and its state
consistency.

Move dynptr checks to their own 'process_dynptr_func' function so that
is consistent and in-line with existing code. This also makes it easier
to reuse this code for kfunc handling.

Then, reuse this consolidated function in kfunc dynptr handling too.
Note that for kfuncs, the arg_type constraint of DYNPTR_TYPE_LOCAL has
been lifted.

Acked-by: David Vernet <void@manifault.com>
Acked-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221207204141.308952-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-08 18:25:31 -08:00
Hou Tao
822ed78fab bpf: Skip rcu_barrier() if rcu_trace_implies_rcu_gp() is true
If there are pending rcu callback, free_mem_alloc() will use
rcu_barrier_tasks_trace() and rcu_barrier() to wait for the pending
__free_rcu_tasks_trace() and __free_rcu() callback.

If rcu_trace_implies_rcu_gp() is true, there will be no pending
__free_rcu(), so it will be OK to skip rcu_barrier() as well.

Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20221209010947.3130477-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-08 17:50:17 -08:00
Hou Tao
0893d6007d bpf: Reuse freed element in free_by_rcu during allocation
When there are batched freeing operations on a specific CPU, part of
the freed elements ((high_watermark - lower_watermark) / 2 + 1) will be
indirectly moved into waiting_for_gp list through free_by_rcu list.
After call_rcu_in_progress becomes false again, the remaining elements
in free_by_rcu list will be moved to waiting_for_gp list by the next
invocation of free_bulk(). However if the expiration of RCU tasks trace
grace period is relatively slow, none element in free_by_rcu list will
be moved.

So instead of invoking __alloc_percpu_gfp() or kmalloc_node() to
allocate a new object, in alloc_bulk() just check whether or not there is
freed element in free_by_rcu list and reuse it if available.

Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20221209010947.3130477-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-08 17:50:17 -08:00
Yang Jihong
c2cc0ce72a bpf: Fix comment error in fixup_kfunc_call function
insn->imm for kfunc is the relative address of __bpf_call_base,
instead of __bpf_base_call, Fix the comment error.

Signed-off-by: Yang Jihong <yangjihong1@huawei.com>
Link: https://lore.kernel.org/r/20221208013724.257848-1-yangjihong1@huawei.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-08 10:05:05 -08:00
Björn Töpel
d35af0a7fe bpf: Do not zero-extend kfunc return values
In BPF all global functions, and BPF helpers return a 64-bit
value. For kfunc calls, this is not the case, and they can return
e.g. 32-bit values.

The return register R0 for kfuncs calls can therefore be marked as
subreg_def != DEF_NOT_SUBREG. In general, if a register is marked with
subreg_def != DEF_NOT_SUBREG, some archs (where bpf_jit_needs_zext()
returns true) require the verifier to insert explicit zero-extension
instructions.

For kfuncs calls, however, the caller should do sign/zero extension
for return values. In other words, the compiler is responsible to
insert proper instructions, not the verifier.

An example, provided by Yonghong Song:

$ cat t.c
extern unsigned foo(void);
unsigned bar1(void) {
     return foo();
}
unsigned bar2(void) {
     if (foo()) return 10; else return 20;
}

$ clang -target bpf -mcpu=v3 -O2 -c t.c && llvm-objdump -d t.o
t.o:    file format elf64-bpf

Disassembly of section .text:

0000000000000000 <bar1>:
	0:       85 10 00 00 ff ff ff ff call -0x1
	1:       95 00 00 00 00 00 00 00 exit

0000000000000010 <bar2>:
	2:       85 10 00 00 ff ff ff ff call -0x1
	3:       bc 01 00 00 00 00 00 00 w1 = w0
	4:       b4 00 00 00 14 00 00 00 w0 = 0x14
	5:       16 01 01 00 00 00 00 00 if w1 == 0x0 goto +0x1 <LBB1_2>
	6:       b4 00 00 00 0a 00 00 00 w0 = 0xa

0000000000000038 <LBB1_2>:
	7:       95 00 00 00 00 00 00 00 exit

If the return value of 'foo()' is used in the BPF program, the proper
zero-extension will be done.

Currently, the verifier correctly marks, say, a 32-bit return value as
subreg_def != DEF_NOT_SUBREG, but will fail performing the actual
zero-extension, due to a verifier bug in
opt_subreg_zext_lo32_rnd_hi32(). load_reg is not properly set to R0,
and the following path will be taken:

		if (WARN_ON(load_reg == -1)) {
			verbose(env, "verifier bug. zext_dst is set, but no reg is defined\n");
			return -EFAULT;
		}

A longer discussion from v1 can be found in the link below.

Correct the verifier by avoiding doing explicit zero-extension of R0
for kfunc calls. Note that R0 will still be marked as a sub-register
for return values smaller than 64-bit.

Fixes: 83a2881903 ("bpf: Account for BPF_FETCH in insn_has_def32()")
Link: https://lore.kernel.org/bpf/20221202103620.1915679-1-bjorn@kernel.org/
Suggested-by: Yonghong Song <yhs@meta.com>
Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221207103540.396496-1-bjorn@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-08 09:59:27 -08:00
David Vernet
36aa10ffd6 bpf/docs: Document struct cgroup * kfuncs
bpf_cgroup_acquire(), bpf_cgroup_release(), bpf_cgroup_kptr_get(), and
bpf_cgroup_ancestor(), are kfuncs that were recently added to
kernel/bpf/helpers.c. These are "core" kfuncs in that they're available
for use in any tracepoint or struct_ops BPF program. Though they have no
ABI stability guarantees, we should still document them. This patch adds
a struct cgroup * subsection to the Core kfuncs section which describes
each of these kfuncs.

Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20221207204911.873646-3-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-07 17:11:24 -08:00
David Vernet
25c5e92d19 bpf/docs: Document struct task_struct * kfuncs
bpf_task_acquire(), bpf_task_release(), and bpf_task_from_pid() are
kfuncs that were recently added to kernel/bpf/helpers.c. These are
"core" kfuncs in that they're available for use for any tracepoint or
struct_ops BPF program. Though they have no ABI stability guarantees, we
should still document them. This patch adds a new Core kfuncs section to
the BPF kfuncs doc, and adds entries for all of these task kfuncs.

Note that bpf_task_kptr_get() is not documented, as it still returns
NULL while we're working to resolve how it can use RCU to ensure struct
task_struct * lifetime.

Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20221207204911.873646-2-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-07 17:09:13 -08:00
Andrii Nakryiko
dcb2288b1f bpf: Remove unused insn_cnt argument from visit_[func_call_]insn()
Number of total instructions in BPF program (including subprogs) can and
is accessed from env->prog->len. visit_func_call_insn() doesn't do any
checks against insn_cnt anymore, relying on push_insn() to do this check
internally. So remove unnecessary insn_cnt input argument from
visit_func_call_insn() and visit_insn() functions.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20221207195534.2866030-1-andrii@kernel.org
2022-12-07 22:58:41 +01:00
Alexei Starovoitov
0a6ea1ce82 for-alexei-2022120701
-----BEGIN PGP SIGNATURE-----
 
 iQJSBAABCAA8FiEEoEVH9lhNrxiMPSyI7MXwXhnZSjYFAmOQpWweHGJlbmphbWlu
 LnRpc3NvaXJlc0ByZWRoYXQuY29tAAoJEOzF8F4Z2Uo23ooQAJR4JBv+WKxyDplY
 m2Kk1t156kenJNhyRojwNWlYk7S0ziClwfjnJEsiki4S0RAwHcVNuuMLjKSjcDIP
 TFrs3kFIlgLITpkPFdMIqMniq0Fynb3N5QDsaohQPQvtLeDx5ASH9D6J+20bcdky
 PE+xOo1Nkn1DpnBiGX7P6irMsqrm5cXfBES2u9c7He9VLThviP2v+TvB80gmRi7w
 zUU4Uikcr8wlt+9MZoLVoVwAOg5aZmVa/9ogNqaT+cKnW6hQ+3CymxiyiyOdRrAQ
 e521+GhQOVTiM0w5C6BwhMx+Wu8r0Qz4Vp49UWf04U/KU+M1TzqAk1z7Vvt72TCr
 965qb19TSRNTGQzebAIRd09mFb/nech54dhpyceONBGnUs9r2dDWjfDd/PA7e2WO
 FbDE0HGnz/XK7GUrk/BXWU+n9VA7itnhJzB+zr3i6IKFgwwDJ1V4e81CWdBEsp9I
 WNDC8LF2bcgHvzFVC23AkKujmbirS6K4Wq+R0f2PISQIs2FdUBl1mgjh2E47lK8E
 zCozMRf9bMya5aGkd4S4dtn0NFGByFSXod2TMgfHPvBz06t6YG00DajALzcE5l8U
 GAoP5Nz9hRSbmHJCNMqy0SN0WN9Cz+JIFx5Vlb9az3lduRRBOVptgnjx9LOjErVr
 +aWWxuQgoHZmB5Ja5WNVN1lIf39/
 =FX5W
 -----END PGP SIGNATURE-----

Merge "do not rely on ALLOW_ERROR_INJECTION for fmod_ret" into bpf-next

Merge commit 5b481acab4 ("bpf: do not rely on ALLOW_ERROR_INJECTION for fmod_ret")
from hid tree into bpf-next.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-07 13:49:21 -08:00
Benjamin Tissoires
5b481acab4 bpf: do not rely on ALLOW_ERROR_INJECTION for fmod_ret
The current way of expressing that a non-bpf kernel component is willing
to accept that bpf programs can be attached to it and that they can change
the return value is to abuse ALLOW_ERROR_INJECTION.
This is debated in the link below, and the result is that it is not a
reasonable thing to do.

Reuse the kfunc declaration structure to also tag the kernel functions
we want to be fmodret. This way we can control from any subsystem which
functions are being modified by bpf without touching the verifier.

Link: https://lore.kernel.org/all/20221121104403.1545f9b5@gandalf.local.home/
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20221206145936.922196-2-benjamin.tissoires@redhat.com
2022-12-07 15:31:08 +01:00
Andrii Nakryiko
618945fbed bpf: remove unnecessary prune and jump points
Don't mark some instructions as jump points when there are actually no
jumps and instructions are just processed sequentially. Such case is
handled naturally by precision backtracking logic without the need to
update jump history. See get_prev_insn_idx(). It goes back linearly by
one instruction, unless current top of jmp_history is pointing to
current instruction. In such case we use `st->jmp_history[cnt - 1].prev_idx`
to find instruction from which we jumped to the current instruction
non-linearly.

Also remove both jump and prune point marking for instruction right
after unconditional jumps, as program flow can get to the instruction
right after unconditional jump instruction only if there is a jump to
that instruction from somewhere else in the program. In such case we'll
mark such instruction as prune/jump point because it's a destination of
a jump.

This change has no changes in terms of number of instructions or states
processes across Cilium and selftests programs.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20221206233345.438540-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-06 19:14:38 -08:00
Andrii Nakryiko
a095f42105 bpf: mostly decouple jump history management from is_state_visited()
Jump history updating and state equivalence checks are conceptually
independent, so move push_jmp_history() out of is_state_visited(). Also
make a decision whether to perform state equivalence checks or not one
layer higher in do_check(), keeping is_state_visited() unconditionally
performing state checks.

push_jmp_history() should be performed after state checks. There is just
one small non-uniformity. When is_state_visited() finds already
validated equivalent state, it propagates precision marks to current
state's parent chain. For this to work correctly, jump history has to be
updated, so is_state_visited() is doing that internally.

But if no equivalent verified state is found, jump history has to be
updated in a newly cloned child state, so is_jmp_point()
+ push_jmp_history() is performed after is_state_visited() exited with
zero result, which means "proceed with validation".

This change has no functional changes. It's not strictly necessary, but
feels right to decouple these two processes.

Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20221206233345.438540-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-06 19:14:38 -08:00
Andrii Nakryiko
bffdeaa8a5 bpf: decouple prune and jump points
BPF verifier marks some instructions as prune points. Currently these
prune points serve two purposes.

It's a point where verifier tries to find previously verified state and
check current state's equivalence to short circuit verification for
current code path.

But also currently it's a point where jump history, used for precision
backtracking, is updated. This is done so that non-linear flow of
execution could be properly backtracked.

Such coupling is coincidental and unnecessary. Some prune points are not
part of some non-linear jump path, so don't need update of jump history.
On the other hand, not all instructions which have to be recorded in
jump history necessarily are good prune points.

This patch splits prune and jump points into independent flags.
Currently all prune points are marked as jump points to minimize amount
of changes in this patch, but next patch will perform some optimization
of prune vs jmp point placement.

No functional changes are intended.

Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20221206233345.438540-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-06 19:14:38 -08:00
Dave Marchevsky
d8939cb0a0 bpf: Loosen alloc obj test in verifier's reg_btf_record
btf->struct_meta_tab is populated by btf_parse_struct_metas in btf.c.
There, a BTF record is created for any type containing a spin_lock or
any next-gen datastructure node/head.

Currently, for non-MAP_VALUE types, reg_btf_record will only search for
a record using struct_meta_tab if the reg->type exactly matches
(PTR_TO_BTF_ID | MEM_ALLOC). This exact match is too strict: an
"allocated obj" type - returned from bpf_obj_new - might pick up other
flags while working its way through the program.

Loosen the check to be exact for base_type and just use MEM_ALLOC mask
for type_flag.

This patch is marked Fixes as the original intent of reg_btf_record was
unlikely to have been to fail finding btf_record for valid alloc obj
types with additional flags, some of which (e.g. PTR_UNTRUSTED)
are valid register type states for alloc obj independent of this series.
However, I didn't find a specific broken repro case outside of this
series' added functionality, so it's possible that nothing was
triggering this logic error before.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Fixes: 4e814da0d5 ("bpf: Allow locking bpf_spin_lock in allocated objects")
Link: https://lore.kernel.org/r/20221206231000.3180914-2-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-06 18:40:25 -08:00
David Vernet
156ed20d22 bpf: Don't use rcu_users to refcount in task kfuncs
A series of prior patches added some kfuncs that allow struct
task_struct * objects to be used as kptrs. These kfuncs leveraged the
'refcount_t rcu_users' field of the task for performing refcounting.
This field was used instead of 'refcount_t usage', as we wanted to
leverage the safety provided by RCU for ensuring a task's lifetime.

A struct task_struct is refcounted by two different refcount_t fields:

1. p->usage:     The "true" refcount field which task lifetime. The
		 task is freed as soon as this refcount drops to 0.

2. p->rcu_users: An "RCU users" refcount field which is statically
		 initialized to 2, and is co-located in a union with
		 a struct rcu_head field (p->rcu). p->rcu_users
		 essentially encapsulates a single p->usage
		 refcount, and when p->rcu_users goes to 0, an RCU
		 callback is scheduled on the struct rcu_head which
		 decrements the p->usage refcount.

Our logic was that by using p->rcu_users, we would be able to use RCU to
safely issue refcount_inc_not_zero() a task's rcu_users field to
determine if a task could still be acquired, or was exiting.
Unfortunately, this does not work due to p->rcu_users and p->rcu sharing
a union. When p->rcu_users goes to 0, an RCU callback is scheduled to
drop a single p->usage refcount, and because the fields share a union,
the refcount immediately becomes nonzero again after the callback is
scheduled.

If we were to split the fields out of the union, this wouldn't be a
problem. Doing so should also be rather non-controversial, as there are
a number of places in struct task_struct that have padding which we
could use to avoid growing the structure by splitting up the fields.

For now, so as to fix the kfuncs to be correct, this patch instead
updates bpf_task_acquire() and bpf_task_release() to use the p->usage
field for refcounting via the get_task_struct() and put_task_struct()
functions. Because we can no longer rely on RCU, the change also guts
the bpf_task_acquire_not_zero() and bpf_task_kptr_get() functions
pending a resolution on the above problem.

In addition, the task fixes the kfunc and rcu_read_lock selftests to
expect this new behavior.

Fixes: 90660309b0 ("bpf: Add kfuncs for storing struct task_struct * as a kptr")
Fixes: fca1aa7551 ("bpf: Handle MEM_RCU type properly")
Reported-by: Matus Jokay <matus.jokay@stuba.sk>
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20221206210538.597606-1-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-06 16:40:16 -08:00
Yonghong Song
2c40d97da1 bpf: Enable sleeptable support for cgrp local storage
Similar to sk/inode/task local storage, enable sleepable support for
cgrp local storage.

Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221201050444.2785007-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-04 16:51:18 -08:00
Yonghong Song
c0c852dd18 bpf: Do not mark certain LSM hook arguments as trusted
Martin mentioned that the verifier cannot assume arguments from
LSM hook sk_alloc_security being trusted since after the hook
is called, the sk ref_count is set to 1. This will overwrite
the ref_count changed by the bpf program and may cause ref_count
underflow later on.

I then further checked some other hooks. For example,
for bpf_lsm_file_alloc() hook in fs/file_table.c,

        f->f_cred = get_cred(cred);
        error = security_file_alloc(f);
        if (unlikely(error)) {
                file_free_rcu(&f->f_rcuhead);
                return ERR_PTR(error);
        }

        atomic_long_set(&f->f_count, 1);

The input parameter 'f' to security_file_alloc() cannot be trusted
as well.

Specifically, I investiaged bpf_map/bpf_prog/file/sk/task alloc/free
lsm hooks. Except bpf_map_alloc and task_alloc, arguments for all other
hooks should not be considered as trusted. This may not be a complete
list, but it covers common usage for sk and task.

Fixes: 3f00c52393 ("bpf: Allow trusted pointers to be passed to KF_TRUSTED_ARGS kfuncs")
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221203204954.2043348-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-04 12:59:58 -08:00
Yonghong Song
fca1aa7551 bpf: Handle MEM_RCU type properly
Commit 9bb00b2895 ("bpf: Add kfunc bpf_rcu_read_lock/unlock()")
introduced MEM_RCU and bpf_rcu_read_lock/unlock() support. In that
commit, a rcu pointer is tagged with both MEM_RCU and PTR_TRUSTED
so that it can be passed into kfuncs or helpers as an argument.

Martin raised a good question in [1] such that the rcu pointer,
although being able to accessing the object, might have reference
count of 0. This might cause a problem if the rcu pointer is passed
to a kfunc which expects trusted arguments where ref count should
be greater than 0.

This patch makes the following changes related to MEM_RCU pointer:
  - MEM_RCU pointer might be NULL (PTR_MAYBE_NULL).
  - Introduce KF_RCU so MEM_RCU ptr can be acquired with
    a KF_RCU tagged kfunc which assumes ref count of rcu ptr
    could be zero.
  - For mem access 'b = ptr->a', say 'ptr' is a MEM_RCU ptr, and
    'a' is tagged with __rcu as well. Let us mark 'b' as
    MEM_RCU | PTR_MAYBE_NULL.

 [1] https://lore.kernel.org/bpf/ac70f574-4023-664e-b711-e0d3b18117fd@linux.dev/

Fixes: 9bb00b2895 ("bpf: Add kfunc bpf_rcu_read_lock/unlock()")
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221203184602.477272-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-04 12:52:40 -08:00
Dave Marchevsky
1f82dffc10 bpf: Fix release_on_unlock release logic for multiple refs
Consider a verifier state with three acquired references, all with
release_on_unlock = true:

            idx  0 1 2
  state->refs = [2 4 6]

(with 2, 4, and 6 being the ref ids).

When bpf_spin_unlock is called, process_spin_lock will loop through all
acquired_refs and, for each ref, if it's release_on_unlock, calls
release_reference on it. That function in turn calls
release_reference_state, which removes the reference from state->refs by
swapping the reference state with the last reference state in
refs array and decrements acquired_refs count.

process_spin_lock's loop logic, which is essentially:

  for (i = 0; i < state->acquired_refs; i++) {
    if (!state->refs[i].release_on_unlock)
      continue;
    release_reference(state->refs[i].id);
  }

will fail to release release_on_unlock references which are swapped from
the end. Running this logic on our example demonstrates:

  state->refs = [2 4 6] (start of idx=0 iter)
    release state->refs[0] by swapping w/ state->refs[2]

  state->refs = [6 4]   (start of idx=1)
    release state->refs[1], no need to swap as it's the last idx

  state->refs = [6]     (start of idx=2, loop terminates)

ref_id 6 should have been removed but was skipped.

Fix this by looping from back-to-front, which results in refs that are
candidates for removal being swapped with refs which have already been
examined and kept.

If we modify our initial example such that ref 6 is replaced with ref 7,
which is _not_ release_on_unlock, and loop from the back, we'd see:

  state->refs = [2 4 7] (start of idx=2)

  state->refs = [2 4 7] (start of idx=1)

  state->refs = [2 7]   (start of idx=0, refs 7 and 4 swapped)

  state->refs = [7]     (after idx=0, 7 and 2 swapped, loop terminates)

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Fixes: 534e86bc6c ("bpf: Add 'release on unlock' logic for bpf_list_push_{front,back}")
Link: https://lore.kernel.org/r/20221201183406.1203621-1-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-01 19:38:17 -08:00
Yonghong Song
3144bfa507 bpf: Fix a compilation failure with clang lto build
When building the kernel with clang lto (CONFIG_LTO_CLANG_FULL=y), the
following compilation error will appear:

  $ make LLVM=1 LLVM_IAS=1 -j
  ...
  ld.lld: error: ld-temp.o <inline asm>:26889:1: symbol 'cgroup_storage_map_btf_ids' is already defined
  cgroup_storage_map_btf_ids:;
  ^
  make[1]: *** [/.../bpf-next/scripts/Makefile.vmlinux_o:61: vmlinux.o] Error 1

In local_storage.c, we have
  BTF_ID_LIST_SINGLE(cgroup_storage_map_btf_ids, struct, bpf_local_storage_map)
Commit c4bcfb38a9 ("bpf: Implement cgroup storage available to
non-cgroup-attached bpf progs") added the above identical BTF_ID_LIST_SINGLE
definition in bpf_cgrp_storage.c. With duplicated definitions, llvm linker
complains with lto build.

Also, extracting btf_id of 'struct bpf_local_storage_map' is defined four times
for sk, inode, task and cgrp local storages. Let us define a single global one
with a different name than cgroup_storage_map_btf_ids, which also fixed
the lto compilation error.

Fixes: c4bcfb38a9 ("bpf: Implement cgroup storage available to non-cgroup-attached bpf progs")
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20221130052147.1591625-1-yhs@fb.com
2022-11-30 17:13:25 -08:00
Alexei Starovoitov
c67cae551f bpf: Tighten ptr_to_btf_id checks.
The networking programs typically don't require CAP_PERFMON, but through kfuncs
like bpf_cast_to_kern_ctx() they can access memory through PTR_TO_BTF_ID. In
such case enforce CAP_PERFMON.
Also make sure that only GPL programs can access kernel data structures.
All kfuncs require GPL already.

Also remove allow_ptr_to_map_access. It's the same as allow_ptr_leaks and
different name for the same check only causes confusion.

Fixes: fd264ca020 ("bpf: Add a kfunc to type cast from bpf uapi ctx to kernel ctx")
Fixes: 50c6b8a9ae ("selftests/bpf: Add a test for btf_type_tag "percpu"")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20221125220617.26846-1-alexei.starovoitov@gmail.com
2022-11-30 15:33:48 -08:00
Jakub Kicinski
f2bb566f5c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
tools/lib/bpf/ringbuf.c
  927cbb478a ("libbpf: Handle size overflow for ringbuf mmap")
  b486d19a0a ("libbpf: checkpatch: Fixed code alignments in ringbuf.c")
https://lore.kernel.org/all/20221121122707.44d1446a@canb.auug.org.au/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-29 13:04:52 -08:00
Jakub Kicinski
d6dc62fca6 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCY4AC5QAKCRDbK58LschI
 g1e0AQCfAqduTy7mYd02jDNCV0wLphNp9FbPiP9OrQT37ABpKAEA1ulj1X59bX3d
 HnZdDKuatcPZT9MV5hDLM7MFJ9GjOA4=
 =fNmM
 -----END PGP SIGNATURE-----

Daniel Borkmann says:

====================
bpf-next 2022-11-25

We've added 101 non-merge commits during the last 11 day(s) which contain
a total of 109 files changed, 8827 insertions(+), 1129 deletions(-).

The main changes are:

1) Support for user defined BPF objects: the use case is to allocate own
   objects, build own object hierarchies and use the building blocks to
   build own data structures flexibly, for example, linked lists in BPF,
   from Kumar Kartikeya Dwivedi.

2) Add bpf_rcu_read_{,un}lock() support for sleepable programs,
   from Yonghong Song.

3) Add support storing struct task_struct objects as kptrs in maps,
   from David Vernet.

4) Batch of BPF map documentation improvements, from Maryam Tahhan
   and Donald Hunter.

5) Improve BPF verifier to propagate nullness information for branches
   of register to register comparisons, from Eduard Zingerman.

6) Fix cgroup BPF iter infra to hold reference on the start cgroup,
   from Hou Tao.

7) Fix BPF verifier to not mark fentry/fexit program arguments as trusted
   given it is not the case for them, from Alexei Starovoitov.

8) Improve BPF verifier's realloc handling to better play along with dynamic
   runtime analysis tools like KASAN and friends, from Kees Cook.

9) Remove legacy libbpf mode support from bpftool,
   from Sahid Orentino Ferdjaoui.

10) Rework zero-len skb redirection checks to avoid potentially breaking
    existing BPF test infra users, from Stanislav Fomichev.

11) Two small refactorings which are independent and have been split out
    of the XDP queueing RFC series, from Toke Høiland-Jørgensen.

12) Fix a memory leak in LSM cgroup BPF selftest, from Wang Yufen.

13) Documentation on how to run BPF CI without patch submission,
    from Daniel Müller.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================

Link: https://lore.kernel.org/r/20221125012450.441-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-28 19:42:17 -08:00
Alexei Starovoitov
c6b0337f01 bpf: Don't mark arguments to fentry/fexit programs as trusted.
The PTR_TRUSTED flag should only be applied to pointers where the verifier can
guarantee that such pointers are valid.
The fentry/fexit/fmod_ret programs are not in this category.
Only arguments of SEC("tp_btf") and SEC("iter") programs are trusted
(which have BPF_TRACE_RAW_TP and BPF_TRACE_ITER attach_type correspondingly)

This bug was masked because convert_ctx_accesses() was converting trusted
loads into BPF_PROBE_MEM loads. Fix it as well.
The loads from trusted pointers don't need exception handling.

Fixes: 3f00c52393 ("bpf: Allow trusted pointers to be passed to KF_TRUSTED_ARGS kfuncs")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20221124215314.55890-1-alexei.starovoitov@gmail.com
2022-11-24 23:47:09 +01:00
Yonghong Song
9bb00b2895 bpf: Add kfunc bpf_rcu_read_lock/unlock()
Add two kfunc's bpf_rcu_read_lock() and bpf_rcu_read_unlock(). These two kfunc's
can be used for all program types. The following is an example about how
rcu pointer are used w.r.t. bpf_rcu_read_lock()/bpf_rcu_read_unlock().

  struct task_struct {
    ...
    struct task_struct              *last_wakee;
    struct task_struct __rcu        *real_parent;
    ...
  };

Let us say prog does 'task = bpf_get_current_task_btf()' to get a
'task' pointer. The basic rules are:
  - 'real_parent = task->real_parent' should be inside bpf_rcu_read_lock
    region. This is to simulate rcu_dereference() operation. The
    'real_parent' is marked as MEM_RCU only if (1). task->real_parent is
    inside bpf_rcu_read_lock region, and (2). task is a trusted ptr. So
    MEM_RCU marked ptr can be 'trusted' inside the bpf_rcu_read_lock region.
  - 'last_wakee = real_parent->last_wakee' should be inside bpf_rcu_read_lock
    region since it tries to access rcu protected memory.
  - the ptr 'last_wakee' will be marked as PTR_UNTRUSTED since in general
    it is not clear whether the object pointed by 'last_wakee' is valid or
    not even inside bpf_rcu_read_lock region.

The verifier will reset all rcu pointer register states to untrusted
at bpf_rcu_read_unlock() kfunc call site, so any such rcu pointer
won't be trusted any more outside the bpf_rcu_read_lock() region.

The current implementation does not support nested rcu read lock
region in the prog.

Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221124053217.2373910-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-24 12:54:13 -08:00
Yonghong Song
01685c5bdd bpf: Introduce might_sleep field in bpf_func_proto
Introduce bpf_func_proto->might_sleep to indicate a particular helper
might sleep. This will make later check whether a helper might be
sleepable or not easier.

Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221124053211.2373553-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-24 12:27:13 -08:00
David Vernet
3f0e6f2b41 bpf: Add bpf_task_from_pid() kfunc
Callers can currently store tasks as kptrs using bpf_task_acquire(),
bpf_task_kptr_get(), and bpf_task_release(). These are useful if a
caller already has a struct task_struct *, but there may be some callers
who only have a pid, and want to look up the associated struct
task_struct * from that to e.g. find task->comm.

This patch therefore adds a new bpf_task_from_pid() kfunc which allows
BPF programs to get a struct task_struct * kptr from a pid.

Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20221122145300.251210-2-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-23 17:45:23 -08:00
Stanislav Fomichev
5bad3587b7 bpf: Unify and simplify btf_func_proto_check error handling
Replace 'err = x; break;' with 'return x;'.

Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20221124002838.2700179-1-sdf@google.com
2022-11-24 01:43:22 +01:00
Stanislav Fomichev
f17472d459 bpf: Prevent decl_tag from being referenced in func_proto arg
Syzkaller managed to hit another decl_tag issue:

  btf_func_proto_check kernel/bpf/btf.c:4506 [inline]
  btf_check_all_types kernel/bpf/btf.c:4734 [inline]
  btf_parse_type_sec+0x1175/0x1980 kernel/bpf/btf.c:4763
  btf_parse kernel/bpf/btf.c:5042 [inline]
  btf_new_fd+0x65a/0xb00 kernel/bpf/btf.c:6709
  bpf_btf_load+0x6f/0x90 kernel/bpf/syscall.c:4342
  __sys_bpf+0x50a/0x6c0 kernel/bpf/syscall.c:5034
  __do_sys_bpf kernel/bpf/syscall.c:5093 [inline]
  __se_sys_bpf kernel/bpf/syscall.c:5091 [inline]
  __x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:5091
  do_syscall_64+0x54/0x70 arch/x86/entry/common.c:48

This seems similar to commit ea68376c8b ("bpf: prevent decl_tag from being
referenced in func_proto") but for the argument.

Reported-by: syzbot+8dd0551dda6020944c5d@syzkaller.appspotmail.com
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20221123035422.872531-2-sdf@google.com
2022-11-24 00:48:50 +01:00
David Vernet
2fcc6081a7 bpf: Don't use idx variable when registering kfunc dtors
In commit fda01efc61 ("bpf: Enable cgroups to be used as kptrs"), I
added an 'int idx' variable to kfunc_init() which was meant to
dynamically set the index of the btf id entries of the
'generic_dtor_ids' array. This was done to make the code slightly less
brittle as the struct cgroup * kptr kfuncs such as bpf_cgroup_aquire()
are compiled out if CONFIG_CGROUPS is not defined. This, however, causes
an lkp build warning:

>> kernel/bpf/helpers.c:2005:40: warning: multiple unsequenced
   modifications to 'idx' [-Wunsequenced]
	.btf_id       = generic_dtor_ids[idx++],

Fix the warning by just hard-coding the indices.

Fixes: fda01efc61 ("bpf: Enable cgroups to be used as kptrs")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: David Vernet <void@manifault.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221123135253.637525-1-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-23 11:58:34 -08:00
David Vernet
5ca7867078 bpf: Add bpf_cgroup_ancestor() kfunc
struct cgroup * objects have a variably sized struct cgroup *ancestors[]
field which stores pointers to their ancestor cgroups. If using a cgroup
as a kptr, it can be useful to access these ancestors, but doing so
requires variable offset accesses for PTR_TO_BTF_ID, which is currently
unsupported.

This is a very useful field to access for cgroup kptrs, as programs may
wish to walk their ancestor cgroups when determining e.g. their
proportional cpu.weight. So as to enable this functionality with cgroup
kptrs before var_off is supported for PTR_TO_BTF_ID, this patch adds a
bpf_cgroup_ancestor() kfunc which accesses the cgroup node on behalf of
the caller, and acquires a reference on it. Once var_off is supported
for PTR_TO_BTF_ID, and fields inside a struct can be marked as trusted
so they retain the PTR_TRUSTED modifier when walked, this can be
removed.

Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20221122055458.173143-4-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-22 14:45:41 -08:00
David Vernet
fda01efc61 bpf: Enable cgroups to be used as kptrs
Now that tasks can be used as kfuncs, and the PTR_TRUSTED flag is
available for us to easily add basic acquire / get / release kfuncs, we
can do the same for cgroups. This patch set adds the following kfuncs
which enable using cgroups as kptrs:

struct cgroup *bpf_cgroup_acquire(struct cgroup *cgrp);
struct cgroup *bpf_cgroup_kptr_get(struct cgroup **cgrpp);
void bpf_cgroup_release(struct cgroup *cgrp);

A follow-on patch will add a selftest suite which validates these
kfuncs.

Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20221122055458.173143-2-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-22 14:45:41 -08:00
Xu Kuohai
836e49e103 bpf: Do not copy spin lock field from user in bpf_selem_alloc
bpf_selem_alloc function is used by inode_storage, sk_storage and
task_storage maps to set map value, for these map types, there may
be a spin lock in the map value, so if we use memcpy to copy the whole
map value from user, the spin lock field may be initialized incorrectly.

Since the spin lock field is zeroed by kzalloc, call copy_map_value
instead of memcpy to skip copying the spin lock field to fix it.

Fixes: 6ac99e8f23 ("bpf: Introduce bpf sk local storage")
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20221114134720.1057939-2-xukuohai@huawei.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-21 11:45:37 -08:00
Hou Tao
1a5160d4d8 bpf: Pin the start cgroup in cgroup_iter_seq_init()
bpf_iter_attach_cgroup() has already acquired an extra reference for the
start cgroup, but the reference may be released if the iterator link fd
is closed after the creation of iterator fd, and it may lead to
user-after-free problem when reading the iterator fd.

An alternative fix is pinning iterator link when opening iterator,
but it will make iterator link being still visible after the close of
iterator link fd and the behavior is different with other link types, so
just fixing it by acquiring another reference for the start cgroup.

Fixes: d4ccaf58a8 ("bpf: Introduce cgroup iter")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20221121073440.1828292-2-houtao@huaweicloud.com
2022-11-21 17:40:42 +01:00
Kees Cook
ceb35b666d bpf/verifier: Use kmalloc_size_roundup() to match ksize() usage
Most allocation sites in the kernel want an explicitly sized allocation
(and not "more"), and that dynamic runtime analysis tools (e.g. KASAN,
UBSAN_BOUNDS, FORTIFY_SOURCE, etc) are looking for precise bounds checking
(i.e. not something that is rounded up). A tiny handful of allocations
were doing an implicit alloc/realloc loop that actually depended on
ksize(), and didn't actually always call realloc. This has created a
long series of bugs and problems over many years related to the runtime
bounds checking, so these callers are finally being adjusted to _not_
depend on the ksize() side-effect, by doing one of several things:

- tracking the allocation size precisely and just never calling ksize()
  at all [1].

- always calling realloc and not using ksize() at all. (This solution
  ends up actually be a subset of the next solution.)

- using kmalloc_size_roundup() to explicitly round up the desired
  allocation size immediately [2].

The bpf/verifier case is this another of this latter case, and is the
last outstanding case to be fixed in the kernel.

Because some of the dynamic bounds checking depends on the size being an
_argument_ to an allocator function (i.e. see the __alloc_size attribute),
the ksize() users are rare, and it could waste local variables, it
was been deemed better to explicitly separate the rounding up from the
allocation itself [3].

Round up allocations with kmalloc_size_roundup() so that the verifier's
use of ksize() is always accurate.

[1] e.g.:
    https://git.kernel.org/linus/712f210a457d
    https://git.kernel.org/linus/72c08d9f4c72

[2] e.g.:
    https://git.kernel.org/netdev/net-next/c/12d6c1d3a2ad
    https://git.kernel.org/netdev/net-next/c/ab3f7828c979
    https://git.kernel.org/netdev/net-next/c/d6dd508080a3

[3] https://lore.kernel.org/lkml/0ea1fc165a6c6117f982f4f135093e69cb884930.camel@redhat.com/

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20221118183409.give.387-kees@kernel.org
2022-11-21 15:07:04 +01:00
Yonghong Song
a35b9af4ec bpf: Add a kfunc for generic type cast
Implement bpf_rdonly_cast() which tries to cast the object
to a specified type. This tries to support use case like below:
  #define skb_shinfo(SKB) ((struct skb_shared_info *)(skb_end_pointer(SKB)))
where skb_end_pointer(SKB) is a 'unsigned char *' and needs to
be casted to 'struct skb_shared_info *'.

The signature of bpf_rdonly_cast() looks like
   void *bpf_rdonly_cast(void *obj, __u32 btf_id)
The function returns the same 'obj' but with PTR_TO_BTF_ID with
btf_id. The verifier will ensure btf_id being a struct type.

Since the supported type cast may not reflect what the 'obj'
represents, the returned btf_id is marked as PTR_UNTRUSTED, so
the return value and subsequent pointer chasing cannot be
used as helper/kfunc arguments.

Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221120195437.3114585-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-20 15:45:26 -08:00
Yonghong Song
fd264ca020 bpf: Add a kfunc to type cast from bpf uapi ctx to kernel ctx
Implement bpf_cast_to_kern_ctx() kfunc which does a type cast
of a uapi ctx object to the corresponding kernel ctx. Previously
if users want to access some data available in kctx but not
in uapi ctx, bpf_probe_read_kernel() helper is needed.
The introduction of bpf_cast_to_kern_ctx() allows direct
memory access which makes code simpler and easier to understand.

Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221120195432.3113982-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-20 15:43:37 -08:00
Yonghong Song
cfe1456440 bpf: Add support for kfunc set with common btf_ids
Later on, we will introduce kfuncs bpf_cast_to_kern_ctx() and
bpf_rdonly_cast() which apply to all program types. Currently kfunc set
only supports individual prog types. This patch added support for kfunc
applying to all program types.

Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221120195426.3113828-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-20 15:43:37 -08:00
Kumar Kartikeya Dwivedi
e181d3f143 bpf: Disallow bpf_obj_new_impl call when bpf_mem_alloc_init fails
In the unlikely event that bpf_global_ma is not correctly initialized,
instead of checking the boolean everytime bpf_obj_new_impl is called,
simply check it while loading the program and return an error if
bpf_global_ma_set is false.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221120212610.2361700-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-20 15:38:28 -08:00
David Vernet
90660309b0 bpf: Add kfuncs for storing struct task_struct * as a kptr
Now that BPF supports adding new kernel functions with kfuncs, and
storing kernel objects in maps with kptrs, we can add a set of kfuncs
which allow struct task_struct objects to be stored in maps as
referenced kptrs. The possible use cases for doing this are plentiful.
During tracing, for example, it would be useful to be able to collect
some tasks that performed a certain operation, and then periodically
summarize who they are, which cgroup they're in, how much CPU time
they've utilized, etc.

In order to enable this, this patch adds three new kfuncs:

struct task_struct *bpf_task_acquire(struct task_struct *p);
struct task_struct *bpf_task_kptr_get(struct task_struct **pp);
void bpf_task_release(struct task_struct *p);

A follow-on patch will add selftests validating these kfuncs.

Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20221120051004.3605026-4-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-20 09:16:21 -08:00
David Vernet
3f00c52393 bpf: Allow trusted pointers to be passed to KF_TRUSTED_ARGS kfuncs
Kfuncs currently support specifying the KF_TRUSTED_ARGS flag to signal
to the verifier that it should enforce that a BPF program passes it a
"safe", trusted pointer. Currently, "safe" means that the pointer is
either PTR_TO_CTX, or is refcounted. There may be cases, however, where
the kernel passes a BPF program a safe / trusted pointer to an object
that the BPF program wishes to use as a kptr, but because the object
does not yet have a ref_obj_id from the perspective of the verifier, the
program would be unable to pass it to a KF_ACQUIRE | KF_TRUSTED_ARGS
kfunc.

The solution is to expand the set of pointers that are considered
trusted according to KF_TRUSTED_ARGS, so that programs can invoke kfuncs
with these pointers without getting rejected by the verifier.

There is already a PTR_UNTRUSTED flag that is set in some scenarios,
such as when a BPF program reads a kptr directly from a map
without performing a bpf_kptr_xchg() call. These pointers of course can
and should be rejected by the verifier. Unfortunately, however,
PTR_UNTRUSTED does not cover all the cases for safety that need to
be addressed to adequately protect kfuncs. Specifically, pointers
obtained by a BPF program "walking" a struct are _not_ considered
PTR_UNTRUSTED according to BPF. For example, say that we were to add a
kfunc called bpf_task_acquire(), with KF_ACQUIRE | KF_TRUSTED_ARGS, to
acquire a struct task_struct *. If we only used PTR_UNTRUSTED to signal
that a task was unsafe to pass to a kfunc, the verifier would mistakenly
allow the following unsafe BPF program to be loaded:

SEC("tp_btf/task_newtask")
int BPF_PROG(unsafe_acquire_task,
             struct task_struct *task,
             u64 clone_flags)
{
        struct task_struct *acquired, *nested;

        nested = task->last_wakee;

        /* Would not be rejected by the verifier. */
        acquired = bpf_task_acquire(nested);
        if (!acquired)
                return 0;

        bpf_task_release(acquired);
        return 0;
}

To address this, this patch defines a new type flag called PTR_TRUSTED
which tracks whether a PTR_TO_BTF_ID pointer is safe to pass to a
KF_TRUSTED_ARGS kfunc or a BPF helper function. PTR_TRUSTED pointers are
passed directly from the kernel as a tracepoint or struct_ops callback
argument. Any nested pointer that is obtained from walking a PTR_TRUSTED
pointer is no longer PTR_TRUSTED. From the example above, the struct
task_struct *task argument is PTR_TRUSTED, but the 'nested' pointer
obtained from 'task->last_wakee' is not PTR_TRUSTED.

A subsequent patch will add kfuncs for storing a task kfunc as a kptr,
and then another patch will add selftests to validate.

Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20221120051004.3605026-3-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-20 09:16:21 -08:00
David Vernet
ef66c5475d bpf: Allow multiple modifiers in reg_type_str() prefix
reg_type_str() in the verifier currently only allows a single register
type modifier to be present in the 'prefix' string which is eventually
stored in the env type_str_buf. This currently works fine because there
are no overlapping type modifiers, but once PTR_TRUSTED is added, that
will no longer be the case. This patch updates reg_type_str() to support
having multiple modifiers in the prefix string, and updates the size of
type_str_buf to be 128 bytes.

Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20221120051004.3605026-2-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-20 09:16:21 -08:00
Kumar Kartikeya Dwivedi
c22dfdd215 bpf: Add comments for map BTF matching requirement for bpf_list_head
The old behavior of bpf_map_meta_equal was that it compared timer_off
to be equal (but not spin_lock_off, because that was not allowed), and
did memcmp of kptr_off_tab.

Now, we memcmp the btf_record of two bpf_map structs, which has all
fields.

We preserve backwards compat as we kzalloc the array, so if only spin
lock and timer exist in map, we only compare offset while the rest of
unused members in the btf_field struct are zeroed out.

In case of kptr, btf and everything else is of vmlinux or module, so as
long type is same it will match, since kernel btf, module, dtor pointer
will be same across maps.

Now with list_head in the mix, things are a bit complicated. We
implicitly add a requirement that both BTFs are same, because struct
btf_field_list_head has btf and value_rec members.

We obviously shouldn't force BTFs to be equal by default, as that breaks
backwards compatibility.

Currently it is only implicitly required due to list_head matching
struct btf and value_rec member. value_rec points back into a btf_record
stashed in the map BTF (btf member of btf_field_list_head). So that
pointer and btf member has to match exactly.

Document all these subtle details so that things don't break in the
future when touching this code.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-19-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:22:14 -08:00
Kumar Kartikeya Dwivedi
534e86bc6c bpf: Add 'release on unlock' logic for bpf_list_push_{front,back}
This commit implements the delayed release logic for bpf_list_push_front
and bpf_list_push_back.

Once a node has been added to the list, it's pointer changes to
PTR_UNTRUSTED. However, it is only released once the lock protecting the
list is unlocked. For such PTR_TO_BTF_ID | MEM_ALLOC with PTR_UNTRUSTED
set but an active ref_obj_id, it is still permitted to read them as long
as the lock is held. Writing to them is not allowed.

This allows having read access to push items we no longer own until we
release the lock guarding the list, allowing a little more flexibility
when working with these APIs.

Note that enabling write support has fairly tricky interactions with
what happens inside the critical section. Just as an example, currently,
bpf_obj_drop is not permitted, but if it were, being able to write to
the PTR_UNTRUSTED pointer while the object gets released back to the
memory allocator would violate safety properties we wish to guarantee
(i.e. not crashing the kernel). The memory could be reused for a
different type in the BPF program or even in the kernel as it gets
eventually kfree'd.

Not enabling bpf_obj_drop inside the critical section would appear to
prevent all of the above, but that is more of an artifical limitation
right now. Since the write support is tangled with how we handle
potential aliasing of nodes inside the critical section that may or may
not be part of the list anymore, it has been deferred to a future patch.

Acked-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-18-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:22:14 -08:00
Kumar Kartikeya Dwivedi
8cab76ec63 bpf: Introduce single ownership BPF linked list API
Add a linked list API for use in BPF programs, where it expects
protection from the bpf_spin_lock in the same allocation as the
bpf_list_head. For now, only one bpf_spin_lock can be present hence that
is assumed to be the one protecting the bpf_list_head.

The following functions are added to kick things off:

// Add node to beginning of list
void bpf_list_push_front(struct bpf_list_head *head, struct bpf_list_node *node);

// Add node to end of list
void bpf_list_push_back(struct bpf_list_head *head, struct bpf_list_node *node);

// Remove node at beginning of list and return it
struct bpf_list_node *bpf_list_pop_front(struct bpf_list_head *head);

// Remove node at end of list and return it
struct bpf_list_node *bpf_list_pop_back(struct bpf_list_head *head);

The lock protecting the bpf_list_head needs to be taken for all
operations. The verifier ensures that the lock that needs to be taken is
always held, and only the correct lock is taken for these operations.
These checks are made statically by relying on the reg->id preserved for
registers pointing into regions having both bpf_spin_lock and the
objects protected by it. The comment over check_reg_allocation_locked in
this change describes the logic in detail.

Note that bpf_list_push_front and bpf_list_push_back are meant to
consume the object containing the node in the 1st argument, however that
specific mechanism is intended to not release the ref_obj_id directly
until the bpf_spin_unlock is called. In this commit, nothing is done,
but the next commit will be introducing logic to handle this case, so it
has been left as is for now.

bpf_list_pop_front and bpf_list_pop_back delete the first or last item
of the list respectively, and return pointer to the element at the
list_node offset. The user can then use container_of style macro to get
the actual entry type. The verifier however statically knows the actual
type, so the safety properties are still preserved.

With these additions, programs can now manage their own linked lists and
store their objects in them.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-17-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:22:14 -08:00
Kumar Kartikeya Dwivedi
df57f38a0d bpf: Permit NULL checking pointer with non-zero fixed offset
Pointer increment on seeing PTR_MAYBE_NULL is already protected against,
hence make an exception for PTR_TO_BTF_ID | MEM_ALLOC while still
keeping the warning for other unintended cases that might creep in.

bpf_list_pop_{front,_back} helpers planned to be introduced in next
commit will return a MEM_ALLOC register with incremented offset pointing
to bpf_list_node field. The user is supposed to then obtain the pointer
to the entry using container_of after NULL checking it. The current
restrictions trigger a warning when doing the NULL checking. Revisiting
the reason, it is meant as an assertion which seems to actually work and
catch the bad case.

Hence, under no other circumstances can reg->off be non-zero for a
register that has the PTR_MAYBE_NULL type flag set.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-16-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:22:14 -08:00
Kumar Kartikeya Dwivedi
ac9f06050a bpf: Introduce bpf_obj_drop
Introduce bpf_obj_drop, which is the kfunc used to free allocated
objects (allocated using bpf_obj_new). Pairing with bpf_obj_new, it
implicitly destructs the fields part of object automatically without
user intervention.

Just like the previous patch, btf_struct_meta that is needed to free up
the special fields is passed as a hidden argument to the kfunc.

For the user, a convenience macro hides over the kernel side kfunc which
is named bpf_obj_drop_impl.

Continuing the previous example:

void prog(void) {
	struct foo *f;

	f = bpf_obj_new(typeof(*f));
	if (!f)
		return;
	bpf_obj_drop(f);
}

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-15-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:22:14 -08:00
Kumar Kartikeya Dwivedi
958cf2e273 bpf: Introduce bpf_obj_new
Introduce type safe memory allocator bpf_obj_new for BPF programs. The
kernel side kfunc is named bpf_obj_new_impl, as passing hidden arguments
to kfuncs still requires having them in prototype, unlike BPF helpers
which always take 5 arguments and have them checked using bpf_func_proto
in verifier, ignoring unset argument types.

Introduce __ign suffix to ignore a specific kfunc argument during type
checks, then use this to introduce support for passing type metadata to
the bpf_obj_new_impl kfunc.

The user passes BTF ID of the type it wants to allocates in program BTF,
the verifier then rewrites the first argument as the size of this type,
after performing some sanity checks (to ensure it exists and it is a
struct type).

The second argument is also fixed up and passed by the verifier. This is
the btf_struct_meta for the type being allocated. It would be needed
mostly for the offset array which is required for zero initializing
special fields while leaving the rest of storage in unitialized state.

It would also be needed in the next patch to perform proper destruction
of the object's special fields.

Under the hood, bpf_obj_new will call bpf_mem_alloc and bpf_mem_free,
using the any context BPF memory allocator introduced recently. To this
end, a global instance of the BPF memory allocator is initialized on
boot to be used for this purpose. This 'bpf_global_ma' serves all
allocations for bpf_obj_new. In the future, bpf_obj_new variants will
allow specifying a custom allocator.

Note that now that bpf_obj_new can be used to allocate objects that can
be linked to BPF linked list (when future linked list helpers are
available), we need to also free the elements using bpf_mem_free.
However, since the draining of elements is done outside the
bpf_spin_lock, we need to do migrate_disable around the call since
bpf_list_head_free can be called from map free path where migration is
enabled. Otherwise, when called from BPF programs migration is already
disabled.

A convenience macro is included in the bpf_experimental.h header to hide
over the ugly details of the implementation, leading to user code
looking similar to a language level extension which allocates and
constructs fields of a user type.

struct bar {
	struct bpf_list_node node;
};

struct foo {
	struct bpf_spin_lock lock;
	struct bpf_list_head head __contains(bar, node);
};

void prog(void) {
	struct foo *f;

	f = bpf_obj_new(typeof(*f));
	if (!f)
		return;
	...
}

A key piece of this story is still missing, i.e. the free function,
which will come in the next patch.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-14-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:22:14 -08:00
Kumar Kartikeya Dwivedi
a50388dbb3 bpf: Support constant scalar arguments for kfuncs
Allow passing known constant scalars as arguments to kfuncs that do not
represent a size parameter. We use mark_chain_precision for the constant
scalar argument to mark it precise. This makes the search pruning
optimization of verifier more conservative for such kfunc calls, and
each non-distinct argument is considered unequivalent.

We will use this support to then expose a bpf_obj_new function where it
takes the local type ID of a type in program BTF, and returns a
PTR_TO_BTF_ID | MEM_ALLOC to the local type, and allows programs to
allocate their own objects.

Each type ID resolves to a distinct type with a possibly distinct size,
hence the type ID constant matters in terms of program safety and its
precision needs to be checked between old and cur states inside regsafe.
The use of mark_chain_precision enables this.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-13-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:22:13 -08:00
Kumar Kartikeya Dwivedi
00b85860fe bpf: Rewrite kfunc argument handling
As we continue to add more features, argument types, kfunc flags, and
different extensions to kfuncs, the code to verify the correctness of
the kfunc prototype wrt the passed in registers has become ad-hoc and
ugly to read. To make life easier, and make a very clear split between
different stages of argument processing, move all the code into
verifier.c and refactor into easier to read helpers and functions.

This also makes sharing code within the verifier easier with kfunc
argument processing. This will be more and more useful in later patches
as we are now moving to implement very core BPF helpers as kfuncs, to
keep them experimental before baking into UAPI.

Remove all kfunc related bits now from btf_check_func_arg_match, as
users have been converted away to refactored kfunc argument handling.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-12-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:16:49 -08:00
Kumar Kartikeya Dwivedi
b7ff97925b bpf: Allow locking bpf_spin_lock in inner map values
There is no need to restrict users from locking bpf_spin_lock in map
values of inner maps. Each inner map lookup gets a unique reg->id
assigned to the returned PTR_TO_MAP_VALUE which will be preserved after
the NULL check. Distinct lookups into different inner map get unique
IDs, and distinct lookups into same inner map also get unique IDs.

Hence, lift the restriction by removing the check return -ENOTSUPP in
map_in_map.c. Later commits will add comprehensive test cases to ensure
that invalid cases are rejected.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-11-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:11:32 -08:00
Kumar Kartikeya Dwivedi
d0d78c1df9 bpf: Allow locking bpf_spin_lock global variables
Global variables reside in maps accessible using direct_value_addr
callbacks, so giving each load instruction's rewrite a unique reg->id
disallows us from holding locks which are global.

The reason for preserving reg->id as a unique value for registers that
may point to spin lock is that two separate lookups are treated as two
separate memory regions, and any possible aliasing is ignored for the
purposes of spin lock correctness.

This is not great especially for the global variable case, which are
served from maps that have max_entries == 1, i.e. they always lead to
map values pointing into the same map value.

So refactor the active_spin_lock into a 'active_lock' structure which
represents the lock identity, and instead of the reg->id, remember two
fields, a pointer and the reg->id. The pointer will store reg->map_ptr
or reg->btf. It's only necessary to distinguish for the id == 0 case of
global variables, but always setting the pointer to a non-NULL value and
using the pointer to check whether the lock is held simplifies code in
the verifier.

This is generic enough to allow it for global variables, map lookups,
and allocated objects at the same time.

Note that while whether a lock is held can be answered by just comparing
active_lock.ptr to NULL, to determine whether the register is pointing
to the same held lock requires comparing _both_ ptr and id.

Finally, as a result of this refactoring, pseudo load instructions are
not given a unique reg->id, as they are doing lookup for the same map
value (max_entries is never greater than 1).

Essentially, we consider that the tuple of (ptr, id) will always be
unique for any kind of argument to bpf_spin_{lock,unlock}.

Note that this can be extended in the future to also remember offset
used for locking, so that we can introduce multiple bpf_spin_lock fields
in the same allocation.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-10-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:11:32 -08:00
Kumar Kartikeya Dwivedi
4e814da0d5 bpf: Allow locking bpf_spin_lock in allocated objects
Allow locking a bpf_spin_lock in an allocated object, in addition to
already supported map value pointers. The handling is similar to that of
map values, by just preserving the reg->id of PTR_TO_BTF_ID | MEM_ALLOC
as well, and adjusting process_spin_lock to work with them and remember
the id in verifier state.

Refactor the existing process_spin_lock to work with PTR_TO_BTF_ID |
MEM_ALLOC in addition to PTR_TO_MAP_VALUE. We need to update the
reg_may_point_to_spin_lock which is used in mark_ptr_or_null_reg to
preserve reg->id, that will be used in env->cur_state->active_spin_lock
to remember the currently held spin lock.

Also update the comment describing bpf_spin_lock implementation details
to also talk about PTR_TO_BTF_ID | MEM_ALLOC type.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-9-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:11:32 -08:00
Kumar Kartikeya Dwivedi
865ce09a49 bpf: Verify ownership relationships for user BTF types
Ensure that there can be no ownership cycles among different types by
way of having owning objects that can hold some other type as their
element. For instance, a map value can only hold allocated objects, but
these are allowed to have another bpf_list_head. To prevent unbounded
recursion while freeing resources, elements of bpf_list_head in local
kptrs can never have a bpf_list_head which are part of list in a map
value. Later patches will verify this by having dedicated BTF selftests.

Also, to make runtime destruction easier, once btf_struct_metas is fully
populated, we can stash the metadata of the value type directly in the
metadata of the list_head fields, as that allows easier access to the
value type's layout to destruct it at runtime from the btf_field entry
of the list head itself.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-8-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:11:32 -08:00
Kumar Kartikeya Dwivedi
8ffa5cc142 bpf: Recognize lock and list fields in allocated objects
Allow specifying bpf_spin_lock, bpf_list_head, bpf_list_node fields in a
allocated object.

Also update btf_struct_access to reject direct access to these special
fields.

A bpf_list_head allows implementing map-in-map style use cases, where an
allocated object with bpf_list_head is linked into a list in a map
value. This would require embedding a bpf_list_node, support for which
is also included. The bpf_spin_lock is used to protect the bpf_list_head
and other data.

While we strictly don't require to hold a bpf_spin_lock while touching
the bpf_list_head in such objects, as when have access to it, we have
complete ownership of the object, the locking constraint is still kept
and may be conditionally lifted in the future.

Note that the specification of such types can be done just like map
values, e.g.:

struct bar {
	struct bpf_list_node node;
};

struct foo {
	struct bpf_spin_lock lock;
	struct bpf_list_head head __contains(bar, node);
	struct bpf_list_node node;
};

struct map_value {
	struct bpf_spin_lock lock;
	struct bpf_list_head head __contains(foo, node);
};

To recognize such types in user BTF, we build a btf_struct_metas array
of metadata items corresponding to each BTF ID. This is done once during
the btf_parse stage to avoid having to do it each time during the
verification process's requirement to inspect the metadata.

Moreover, the computed metadata needs to be passed to some helpers in
future patches which requires allocating them and storing them in the
BTF that is pinned by the program itself, so that valid access can be
assumed to such data during program runtime.

A key thing to note is that once a btf_struct_meta is available for a
type, both the btf_record and btf_field_offs should be available. It is
critical that btf_field_offs is available in case special fields are
present, as we extensively rely on special fields being zeroed out in
map values and allocated objects in later patches. The code ensures that
by bailing out in case of errors and ensuring both are available
together. If the record is not available, the special fields won't be
recognized, so not having both is also fine (in terms of being a
verification error and not a runtime bug).

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-7-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:11:32 -08:00
Kumar Kartikeya Dwivedi
282de143ea bpf: Introduce allocated objects support
Introduce support for representing pointers to objects allocated by the
BPF program, i.e. PTR_TO_BTF_ID that point to a type in program BTF.
This is indicated by the presence of MEM_ALLOC type flag in reg->type to
avoid having to check btf_is_kernel when trying to match argument types
in helpers.

Whenever walking such types, any pointers being walked will always yield
a SCALAR instead of pointer. In the future we might permit kptr inside
such allocated objects (either kernel or program allocated), and it will
then form a PTR_TO_BTF_ID of the respective type.

For now, such allocated objects will always be referenced in verifier
context, hence ref_obj_id == 0 for them is a bug. It is allowed to write
to such objects, as long fields that are special are not touched
(support for which will be added in subsequent patches). Note that once
such a pointer is marked PTR_UNTRUSTED, it is no longer allowed to write
to it.

No PROBE_MEM handling is therefore done for loads into this type unless
PTR_UNTRUSTED is part of the register type, since they can never be in
an undefined state, and their lifetime will always be valid.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-6-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:11:32 -08:00
Kumar Kartikeya Dwivedi
f73e601aaf bpf: Populate field_offs for inner_map_meta
Far too much code simply assumes that both btf_record and btf_field_offs
are set to valid pointers together, or both are unset. They go together
hand in hand as btf_record describes the special fields and
btf_field_offs is compact representation for runtime copying/zeroing.

It is very difficult to make this clear in the code when the only
exception to this universal invariant is inner_map_meta which is used
as reg->map_ptr in the verifier. This is simply a bug waiting to happen,
as in verifier context we cannot easily distinguish if PTR_TO_MAP_VALUE
is coming from an inner map, and if we ever end up using field_offs for
any reason in the future, we will silently ignore the special fields for
inner map case (as NULL is not an error but unset field_offs).

Hence, simply copy field_offs from inner map together with btf_record.

While at it, refactor code to unwind properly on errors with gotos.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-5-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:11:32 -08:00
Kumar Kartikeya Dwivedi
d48995723c bpf: Free inner_map_meta when btf_record_dup fails
Whenever btf_record_dup fails, we must free inner_map_meta that was
allocated before.

This fixes a memory leak (in case of errors) during inner map creation.

Fixes: aa3496accc ("bpf: Refactor kptr_off_tab into btf_record")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-4-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:11:31 -08:00
Kumar Kartikeya Dwivedi
d7f5ef653c bpf: Do btf_record_free outside map_free callback
Since the commit being fixed, we now miss freeing btf_record for local
storage maps which will have a btf_record populated in case they have
bpf_spin_lock element.

This was missed because I made the choice of offloading the job to free
kptr_off_tab (now btf_record) to the map_free callback when adding
support for kptrs.

Revisiting the reason for this decision, there is the possibility that
the btf_record gets used inside map_free callback (e.g. in case of maps
embedding kptrs) to iterate over them and free them, hence doing it
before the map_free callback would be leaking special field memory, and
do invalid memory access. The btf_record keeps module references which
is critical to ensure the dtor call made for referenced kptr is safe to
do.

If doing it after map_free callback, the map area is already freed, so
we cannot access bpf_map structure anymore.

To fix this and prevent such lapses in future, move bpf_map_free_record
out of the map_free callback, and do it after map_free by remembering
the btf_record pointer. There is no need to access bpf_map structure in
that case, and we can avoid missing this case when support for new map
types is added for other special fields.

Since a btf_record and its btf_field_offs are used together, for
consistency delay freeing of field_offs as well. While not a problem
right now, a lot of code assumes that either both record and field_offs
are set or none at once.

Note that in case of map of maps (outer maps), inner_map_meta->record is
only used during verification, not to free fields in map value, hence we
simply keep the bpf_map_free_record call as is in bpf_map_meta_free and
never touch map->inner_map_meta in bpf_map_free_deferred.

Add a comment making note of these details.

Fixes: db55911782 ("bpf: Consolidate spin_lock, timer management into btf_record")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:11:31 -08:00
Kumar Kartikeya Dwivedi
c237bfa528 bpf: Fix early return in map_check_btf
Instead of returning directly with -EOPNOTSUPP for the timer case, we
need to free the btf_record before returning to userspace.

Fixes: db55911782 ("bpf: Consolidate spin_lock, timer management into btf_record")
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:11:31 -08:00
Jakub Kicinski
224b744abf Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
include/linux/bpf.h
  1f6e04a1c7 ("bpf: Fix offset calculation error in __copy_map_value and zero_map_value")
  aa3496accc ("bpf: Refactor kptr_off_tab into btf_record")
  f71b2f6417 ("bpf: Refactor map->off_arr handling")
https://lore.kernel.org/all/20221114095000.67a73239@canb.auug.org.au/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 18:30:39 -08:00
Jason A. Donenfeld
8032bf1233 treewide: use get_random_u32_below() instead of deprecated function
This is a simple mechanical transformation done by:

@@
expression E;
@@
- prandom_u32_max
+ get_random_u32_below
  (E)

Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Reviewed-by: SeongJae Park <sj@kernel.org> # for damon
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> # for arm
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-11-18 02:15:15 +01:00
Hou Tao
3af43ba4c6 bpf: Pass map file to .map_update_batch directly
Currently bpf_map_do_batch() first invokes fdget(batch.map_fd) to get
the target map file, then it invokes generic_map_update_batch() to do
batch update. generic_map_update_batch() will get the target map file
by using fdget(batch.map_fd) again and pass it to bpf_map_update_value().

The problem is map file returned by the second fdget() may be NULL or a
totally different file compared by map file in bpf_map_do_batch(). The
reason is that the first fdget() only guarantees the liveness of struct
file instead of file descriptor and the file description may be released
by concurrent close() through pick_file().

It doesn't incur any problem as for now, because maps with batch update
support don't use map file in .map_fd_get_ptr() ops. But it is better to
fix the potential access of an invalid map file.

Using __bpf_map_get() again in generic_map_update_batch() can not fix
the problem, because batch.map_fd may be closed and reopened, and the
returned map file may be different with map file got in
bpf_map_do_batch(), so just passing the map file directly to
.map_update_batch() in bpf_map_do_batch().

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20221116075059.1551277-1-houtao@huaweicloud.com
2022-11-17 17:12:35 +01:00
Eduard Zingerman
befae75856 bpf: propagate nullness information for reg to reg comparisons
Propagate nullness information for branches of register to register
equality compare instructions. The following rules are used:
- suppose register A maybe null
- suppose register B is not null
- for JNE A, B, ... - A is not null in the false branch
- for JEQ A, B, ... - A is not null in the true branch

E.g. for program like below:

  r6 = skb->sk;
  r7 = sk_fullsock(r6);
  r0 = sk_fullsock(r6);
  if (r0 == 0) return 0;    (a)
  if (r0 != r7) return 0;   (b)
  *r7->type;                (c)
  return 0;

It is safe to dereference r7 at point (c), because of (a) and (b).

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221115224859.2452988-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-15 17:38:36 -08:00
Toke Høiland-Jørgensen
32637e3300 bpf: Expand map key argument of bpf_redirect_map to u64
For queueing packets in XDP we want to add a new redirect map type with
support for 64-bit indexes. To prepare fore this, expand the width of the
'key' argument to the bpf_redirect_map() helper. Since BPF registers are
always 64-bit, this should be safe to do after the fact.

Acked-by: Song Liu <song@kernel.org>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20221108140601.149971-3-toke@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-15 09:00:27 -08:00
Kumar Kartikeya Dwivedi
6728aea721 bpf: Refactor btf_struct_access
Instead of having to pass multiple arguments that describe the register,
pass the bpf_reg_state into the btf_struct_access callback. Currently,
all call sites simply reuse the btf and btf_id of the reg they want to
check the access of. The only exception to this pattern is the callsite
in check_ptr_to_map_access, hence for that case create a dummy reg to
simulate PTR_TO_BTF_ID access.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221114191547.1694267-8-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-14 21:52:45 -08:00
Kumar Kartikeya Dwivedi
894f2a8b16 bpf: Rename MEM_ALLOC to MEM_RINGBUF
Currently, verifier uses MEM_ALLOC type tag to specially tag memory
returned from bpf_ringbuf_reserve helper. However, this is currently
only used for this purpose and there is an implicit assumption that it
only refers to ringbuf memory (e.g. the check for ARG_PTR_TO_ALLOC_MEM
in check_func_arg_reg_off).

Hence, rename MEM_ALLOC to MEM_RINGBUF to indicate this special
relationship and instead open the use of MEM_ALLOC for more generic
allocations made for user types.

Also, since ARG_PTR_TO_ALLOC_MEM_OR_NULL is unused, simply drop it.

Finally, update selftests using 'alloc_' verifier string to 'ringbuf_'.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221114191547.1694267-7-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-14 21:52:45 -08:00
Kumar Kartikeya Dwivedi
2de2669b4e bpf: Rename RET_PTR_TO_ALLOC_MEM
Currently, the verifier has two return types, RET_PTR_TO_ALLOC_MEM, and
RET_PTR_TO_ALLOC_MEM_OR_NULL, however the former is confusingly named to
imply that it carries MEM_ALLOC, while only the latter does. This causes
confusion during code review leading to conclusions like that the return
value of RET_PTR_TO_DYNPTR_MEM_OR_NULL (which is RET_PTR_TO_ALLOC_MEM |
PTR_MAYBE_NULL) may be consumable by bpf_ringbuf_{submit,commit}.

Rename it to make it clear MEM_ALLOC needs to be tacked on top of
RET_PTR_TO_MEM.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221114191547.1694267-6-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-14 21:52:45 -08:00
Kumar Kartikeya Dwivedi
f0c5941ff5 bpf: Support bpf_list_head in map values
Add the support on the map side to parse, recognize, verify, and build
metadata table for a new special field of the type struct bpf_list_head.
To parameterize the bpf_list_head for a certain value type and the
list_node member it will accept in that value type, we use BTF
declaration tags.

The definition of bpf_list_head in a map value will be done as follows:

struct foo {
	struct bpf_list_node node;
	int data;
};

struct map_value {
	struct bpf_list_head head __contains(foo, node);
};

Then, the bpf_list_head only allows adding to the list 'head' using the
bpf_list_node 'node' for the type struct foo.

The 'contains' annotation is a BTF declaration tag composed of four
parts, "contains:name:node" where the name is then used to look up the
type in the map BTF, with its kind hardcoded to BTF_KIND_STRUCT during
the lookup. The node defines name of the member in this type that has
the type struct bpf_list_node, which is actually used for linking into
the linked list. For now, 'kind' part is hardcoded as struct.

This allows building intrusive linked lists in BPF, using container_of
to obtain pointer to entry, while being completely type safe from the
perspective of the verifier. The verifier knows exactly the type of the
nodes, and knows that list helpers return that type at some fixed offset
where the bpf_list_node member used for this list exists. The verifier
also uses this information to disallow adding types that are not
accepted by a certain list.

For now, no elements can be added to such lists. Support for that is
coming in future patches, hence draining and freeing items is done with
a TODO that will be resolved in a future patch.

Note that the bpf_list_head_free function moves the list out to a local
variable under the lock and releases it, doing the actual draining of
the list items outside the lock. While this helps with not holding the
lock for too long pessimizing other concurrent list operations, it is
also necessary for deadlock prevention: unless every function called in
the critical section would be notrace, a fentry/fexit program could
attach and call bpf_map_update_elem again on the map, leading to the
same lock being acquired if the key matches and lead to a deadlock.
While this requires some special effort on part of the BPF programmer to
trigger and is highly unlikely to occur in practice, it is always better
if we can avoid such a condition.

While notrace would prevent this, doing the draining outside the lock
has advantages of its own, hence it is used to also fix the deadlock
related problem.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221114191547.1694267-5-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-14 21:52:45 -08:00
Kumar Kartikeya Dwivedi
2d57725257 bpf: Remove BPF_MAP_OFF_ARR_MAX
In f71b2f6417 ("bpf: Refactor map->off_arr handling"), map->off_arr
was refactored to be btf_field_offs. The number of field offsets is
equal to maximum possible fields limited by BTF_FIELDS_MAX. Hence, reuse
BTF_FIELDS_MAX as spin_lock and timer no longer are to be handled
specially for offset sorting, fix the comment, and remove incorrect
WARN_ON as its rec->cnt can never exceed this value. The reason to keep
separate constant was the it was always more 2 more than total kptrs.
This is no longer the case.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221114191547.1694267-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-14 21:52:45 -08:00
Jakub Kicinski
f4c4ca70de bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEET63h6RnJhTJHuKTjXOwUVIRcSScFAmNu2EkACgkQXOwUVIRc
 SSebKhAA0ffmp5jJgEJpQYNABGLYIJcwKkBrGClDbMJLtwCjevGZJajT9fpbCLb1
 eK6EIhdfR0NTO+0KtUVkZ8WMa81OmLEJYdTNtJfNE23ENMpssiAWhlhDF8AoXeKv
 Bo3j719gn3Cw9PWXQoircH3wpj+5RMDnjxy4iYlA5yNrvzC7XVmssMF+WALvQnuK
 CGrfR57hxdgmphmasRqeCzEoriwihwPsG3k6eQN8rf7ZytLhs90tMVgT9L3Cd2u9
 DafA0Xl8mZdz2mHhThcJhQVq4MUymZj44ufuHDiOs1j6nhUlWToyQuvegPOqxKti
 uLGtZul0ls+3UP0Lbrv1oEGU/MWMxyDz4IBc0EVs0k3ItQbmSKs6r9WuPFGd96Sb
 GHk68qFVySeLGN0LfKe3rCHJ9ZoIOPYJg9qT8Rd5bOhetgGwSsxZTxUI39BxkFup
 CEqwIDnts1TMU37GDjj+vssKW91k4jEzMZVtRfsL3J36aJs28k/Ez4AqLXg6WU6u
 ADqFaejVPcXbN9rX90onIYxxiL28gZSeT+i8qOPELZtqTQmNWz+tC/ySVuWnD8Mn
 Nbs7PZ1IWiNZpsKS8pZnpd6j4mlBeJnwXkPKiFy+xHGuwRSRdYl6G9e5CtlRely/
 rwQ8DtaOpRYMrGhnmBEdAOCa9t/iqzrzHzjoigjJ7iAST4ToJ5s=
 =Y+/e
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Andrii Nakryiko says:

====================
bpf-next 2022-11-11

We've added 49 non-merge commits during the last 9 day(s) which contain
a total of 68 files changed, 3592 insertions(+), 1371 deletions(-).

The main changes are:

1) Veristat tool improvements to support custom filtering, sorting, and replay
   of results, from Andrii Nakryiko.

2) BPF verifier precision tracking fixes and improvements,
   from Andrii Nakryiko.

3) Lots of new BPF documentation for various BPF maps, from Dave Tucker,
   Donald Hunter, Maryam Tahhan, Bagas Sanjaya.

4) BTF dedup improvements and libbpf's hashmap interface clean ups, from
   Eduard Zingerman.

5) Fix veth driver panic if XDP program is attached before veth_open, from
   John Fastabend.

6) BPF verifier clean ups and fixes in preparation for follow up features,
   from Kumar Kartikeya Dwivedi.

7) Add access to hwtstamp field from BPF sockops programs,
   from Martin KaFai Lau.

8) Various fixes for BPF selftests and samples, from Artem Savkov,
   Domenico Cerasuolo, Kang Minchul, Rong Tao, Yang Jihong.

9) Fix redirection to tunneling device logic, preventing skb->len == 0, from
   Stanislav Fomichev.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (49 commits)
  selftests/bpf: fix veristat's singular file-or-prog filter
  selftests/bpf: Test skops->skb_hwtstamp
  selftests/bpf: Fix incorrect ASSERT in the tcp_hdr_options test
  bpf: Add hwtstamp field for the sockops prog
  selftests/bpf: Fix xdp_synproxy compilation failure in 32-bit arch
  bpf, docs: Document BPF_MAP_TYPE_ARRAY
  docs/bpf: Document BPF map types QUEUE and STACK
  docs/bpf: Document BPF ARRAY_OF_MAPS and HASH_OF_MAPS
  docs/bpf: Document BPF_MAP_TYPE_CPUMAP map
  docs/bpf: Document BPF_MAP_TYPE_LPM_TRIE map
  libbpf: Hashmap.h update to fix build issues using LLVM14
  bpf: veth driver panics when xdp prog attached before veth_open
  selftests: Fix test group SKIPPED result
  selftests/bpf: Tests for btf_dedup_resolve_fwds
  libbpf: Resolve unambigous forward declarations
  libbpf: Hashmap interface update to allow both long and void* keys/values
  samples/bpf: Fix sockex3 error: Missing BPF prog type
  selftests/bpf: Fix u32 variable compared with less than zero
  Documentation: bpf: Escape underscore in BPF type name prefix
  selftests/bpf: Use consistent build-id type for liburandom_read.so
  ...
====================

Link: https://lore.kernel.org/r/20221111233733.1088228-1-andrii@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-11 18:33:04 -08:00
Xu Kuohai
4b45cd81f7 bpf: Initialize same number of free nodes for each pcpu_freelist
pcpu_freelist_populate() initializes nr_elems / num_possible_cpus() + 1
free nodes for some CPUs, and then possibly one CPU with fewer nodes,
followed by remaining cpus with 0 nodes. For example, when nr_elems == 256
and num_possible_cpus() == 32, CPU 0~27 each gets 9 free nodes, CPU 28 gets
4 free nodes, CPU 29~31 get 0 free nodes, while in fact each CPU should get
8 nodes equally.

This patch initializes nr_elems / num_possible_cpus() free nodes for each
CPU firstly, then allocates the remaining free nodes by one for each CPU
until no free nodes left.

Fixes: e19494edab ("bpf: introduce percpu_freelist")
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20221110122128.105214-1-xukuohai@huawei.com
2022-11-11 12:05:14 -08:00
Maryam Tahhan
161939abc8 docs/bpf: Document BPF_MAP_TYPE_CPUMAP map
Add documentation for BPF_MAP_TYPE_CPUMAP including
kernel version introduced, usage and examples.

Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20221107165207.2682075-2-mtahhan@redhat.com
2022-11-11 11:32:54 -08:00
Jakub Kicinski
966a9b4903 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/net/can/pch_can.c
  ae64438be1 ("can: dev: fix skb drop check")
  1dd1b521be ("can: remove obsolete PCH CAN driver")
https://lore.kernel.org/all/20221110102509.1f7d63cc@canb.auug.org.au/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-10 17:43:53 -08:00
Wang Yufen
eb86559a69 bpf: Fix memory leaks in __check_func_call
kmemleak reports this issue:

unreferenced object 0xffff88817139d000 (size 2048):
  comm "test_progs", pid 33246, jiffies 4307381979 (age 45851.820s)
  hex dump (first 32 bytes):
    01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<0000000045f075f0>] kmalloc_trace+0x27/0xa0
    [<0000000098b7c90a>] __check_func_call+0x316/0x1230
    [<00000000b4c3c403>] check_helper_call+0x172e/0x4700
    [<00000000aa3875b7>] do_check+0x21d8/0x45e0
    [<000000001147357b>] do_check_common+0x767/0xaf0
    [<00000000b5a595b4>] bpf_check+0x43e3/0x5bc0
    [<0000000011e391b1>] bpf_prog_load+0xf26/0x1940
    [<0000000007f765c0>] __sys_bpf+0xd2c/0x3650
    [<00000000839815d6>] __x64_sys_bpf+0x75/0xc0
    [<00000000946ee250>] do_syscall_64+0x3b/0x90
    [<0000000000506b7f>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

The root case here is: In function prepare_func_exit(), the callee is
not released in the abnormal scenario after "state->curframe--;". To
fix, move "state->curframe--;" to the very bottom of the function,
right when we free callee and reset frame[] pointer to NULL, as Andrii
suggested.

In addition, function __check_func_call() has a similar problem. In
the abnormal scenario before "state->curframe++;", the callee also
should be released by free_func_state().

Fixes: 69c087ba62 ("bpf: Add bpf_for_each_map_elem() helper")
Fixes: fd978bf7fd ("bpf: Add reference tracking to verifier")
Signed-off-by: Wang Yufen <wangyufen@huawei.com>
Link: https://lore.kernel.org/r/1667884291-15666-1-git-send-email-wangyufen@huawei.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-11-08 09:21:35 -08:00
Nathan Chancellor
a679120edf bpf: Add explicit cast to 'void *' for __BPF_DISPATCHER_UPDATE()
When building with clang:

  kernel/bpf/dispatcher.c:126:33: error: pointer type mismatch ('void *' and 'unsigned int (*)(const void *, const struct bpf_insn *, bpf_func_t)' (aka 'unsigned int (*)(const void *, const struct bpf_insn *, unsigned int (*)(const void *, const struct bpf_insn *))')) [-Werror,-Wpointer-type-mismatch]
          __BPF_DISPATCHER_UPDATE(d, new ?: &bpf_dispatcher_nop_func);
                                     ~~~ ^  ~~~~~~~~~~~~~~~~~~~~~~~~
  ./include/linux/bpf.h:1045:54: note: expanded from macro '__BPF_DISPATCHER_UPDATE'
          __static_call_update((_d)->sc_key, (_d)->sc_tramp, (_new))
                                                              ^~~~
  1 error generated.

The warning is pointing out that the type of new ('void *') and
&bpf_dispatcher_nop_func are not compatible, which could have side
effects coming out of a conditional operator due to promotion rules.

Add the explicit cast to 'void *' to make it clear that this is
expected, as __BPF_DISPATCHER_UPDATE() expands to a call to
__static_call_update(), which expects a 'void *' as its final argument.

Fixes: c86df29d11 ("bpf: Convert BPF_DISPATCHER to use static_call() (not ftrace)")
Link: https://github.com/ClangBuiltLinux/linux/issues/1755
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: "kernelci.org bot" <bot@kernelci.org>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: Björn Töpel <bjorn@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221107170711.42409-1-nathan@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-11-07 16:07:35 -08:00
Peter Zijlstra
c86df29d11 bpf: Convert BPF_DISPATCHER to use static_call() (not ftrace)
The dispatcher function is currently abusing the ftrace __fentry__
call location for its own purposes -- this obviously gives trouble
when the dispatcher and ftrace are both in use.

A previous solution tried using __attribute__((patchable_function_entry()))
which works, except it is GCC-8+ only, breaking the build on the
earlier still supported compilers. Instead use static_call() -- which
has its own annotations and does not conflict with ftrace -- to
rewrite the dispatch function.

By using: return static_call()(ctx, insni, bpf_func) you get a perfect
forwarding tail call as function body (iow a single jmp instruction).
By having the default static_call() target be bpf_dispatcher_nop_func()
it retains the default behaviour (an indirect call to the argument
function). Only once a dispatcher program is attached is the target
rewritten to directly call the JIT'ed image.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Björn Töpel <bjorn@kernel.org>
Tested-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Björn Töpel <bjorn@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lkml.kernel.org/r/Y1/oBlK0yFk5c/Im@hirez.programming.kicks-ass.net
Link: https://lore.kernel.org/bpf/20221103120647.796772565@infradead.org
2022-11-04 23:13:57 +01:00
Peter Zijlstra
18acb7fac2 bpf: Revert ("Fix dispatcher patchable function entry to 5 bytes nop")
Because __attribute__((patchable_function_entry)) is only available
since GCC-8 this solution fails to build on the minimum required GCC
version.

Undo these changes so we might try again -- without cluttering up the
patches with too many changes.

This is an almost complete revert of:

  dbe69b2998 ("bpf: Fix dispatcher patchable function entry to 5 bytes nop")
  ceea991a01 ("bpf: Move bpf_dispatcher function out of ftrace locations")

(notably the arch/x86/Kconfig hunk is kept).

Reported-by: David Laight <David.Laight@aculab.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Björn Töpel <bjorn@kernel.org>
Tested-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Björn Töpel <bjorn@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lkml.kernel.org/r/439d8dc735bb4858875377df67f1b29a@AcuMS.aculab.com
Link: https://lore.kernel.org/bpf/20221103120647.728830733@infradead.org
2022-11-04 23:13:08 +01:00
Andrii Nakryiko
7a830b53c1 bpf: aggressively forget precise markings during state checkpointing
Exploit the property of about-to-be-checkpointed state to be able to
forget all precise markings up to that point even more aggressively. We
now clear all potentially inherited precise markings right before
checkpointing and branching off into child state. If any of children
states require precise knowledge of any SCALAR register, those will be
propagated backwards later on before this state is finalized, preserving
correctness.

There is a single selftests BPF program change, but tremendous one: 25x
reduction in number of verified instructions and states in
trace_virtqueue_add_sgs.

Cilium results are more modest, but happen across wider range of programs.

SELFTESTS RESULTS
=================

$ ./veristat -C -e file,prog,insns,states ~/imprecise-early-results.csv ~/imprecise-aggressive-results.csv | grep -v '+0'
File                 Program                  Total insns (A)  Total insns (B)  Total insns (DIFF)  Total states (A)  Total states (B)  Total states (DIFF)
-------------------  -----------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------
loop6.bpf.linked1.o  trace_virtqueue_add_sgs           398057            15114   -382943 (-96.20%)              8717               336      -8381 (-96.15%)
-------------------  -----------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------

CILIUM RESULTS
==============

$ ./veristat -C -e file,prog,insns,states ~/imprecise-early-results-cilium.csv ~/imprecise-aggressive-results-cilium.csv | grep -v '+0'
File           Program                           Total insns (A)  Total insns (B)  Total insns (DIFF)  Total states (A)  Total states (B)  Total states (DIFF)
-------------  --------------------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------
bpf_host.o     tail_handle_nat_fwd_ipv4                    23426            23221       -205 (-0.88%)              1537              1515         -22 (-1.43%)
bpf_host.o     tail_handle_nat_fwd_ipv6                    13009            12904       -105 (-0.81%)               719               708         -11 (-1.53%)
bpf_host.o     tail_nodeport_nat_ingress_ipv6               5261             5196        -65 (-1.24%)               247               243          -4 (-1.62%)
bpf_host.o     tail_nodeport_nat_ipv6_egress                3446             3406        -40 (-1.16%)               203               198          -5 (-2.46%)
bpf_lxc.o      tail_handle_nat_fwd_ipv4                    23426            23221       -205 (-0.88%)              1537              1515         -22 (-1.43%)
bpf_lxc.o      tail_handle_nat_fwd_ipv6                    13009            12904       -105 (-0.81%)               719               708         -11 (-1.53%)
bpf_lxc.o      tail_ipv4_ct_egress                          5074             4897       -177 (-3.49%)               255               248          -7 (-2.75%)
bpf_lxc.o      tail_ipv4_ct_ingress                         5100             4923       -177 (-3.47%)               255               248          -7 (-2.75%)
bpf_lxc.o      tail_ipv4_ct_ingress_policy_only             5100             4923       -177 (-3.47%)               255               248          -7 (-2.75%)
bpf_lxc.o      tail_ipv6_ct_egress                          4558             4536        -22 (-0.48%)               188               187          -1 (-0.53%)
bpf_lxc.o      tail_ipv6_ct_ingress                         4578             4556        -22 (-0.48%)               188               187          -1 (-0.53%)
bpf_lxc.o      tail_ipv6_ct_ingress_policy_only             4578             4556        -22 (-0.48%)               188               187          -1 (-0.53%)
bpf_lxc.o      tail_nodeport_nat_ingress_ipv6               5261             5196        -65 (-1.24%)               247               243          -4 (-1.62%)
bpf_overlay.o  tail_nodeport_nat_ingress_ipv6               5261             5196        -65 (-1.24%)               247               243          -4 (-1.62%)
bpf_overlay.o  tail_nodeport_nat_ipv6_egress                3482             3442        -40 (-1.15%)               204               201          -3 (-1.47%)
bpf_xdp.o      tail_nodeport_nat_egress_ipv4               17200            15619      -1581 (-9.19%)              1111              1010        -101 (-9.09%)
-------------  --------------------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20221104163649.121784-6-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-04 11:51:45 -07:00
Andrii Nakryiko
f63181b6ae bpf: stop setting precise in current state
Setting reg->precise to true in current state is not necessary from
correctness standpoint, but it does pessimise the whole precision (or
rather "imprecision", because that's what we want to keep as much as
possible) tracking. Why is somewhat subtle and my best attempt to
explain this is recorded in an extensive comment for __mark_chain_precise()
function. Some more careful thinking and code reading is probably required
still to grok this completely, unfortunately. Whiteboarding and a bunch
of extra handwaiving in person would be even more helpful, but is deemed
impractical in Git commit.

Next patch pushes this imprecision property even further, building on top of
the insights described in this patch.

End results are pretty nice, we get reduction in number of total instructions
and states verified due to a better states reuse, as some of the states are now
more generic and permissive due to less unnecessary precise=true requirements.

SELFTESTS RESULTS
=================

$ ./veristat -C -e file,prog,insns,states ~/subprog-precise-results.csv ~/imprecise-early-results.csv | grep -v '+0'
File                                     Program                 Total insns (A)  Total insns (B)  Total insns (DIFF)  Total states (A)  Total states (B)  Total states (DIFF)
---------------------------------------  ----------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------
bpf_iter_ksym.bpf.linked1.o              dump_ksym                           347              285       -62 (-17.87%)                20                19          -1 (-5.00%)
pyperf600_bpf_loop.bpf.linked1.o         on_event                           3678             3736        +58 (+1.58%)               276               285          +9 (+3.26%)
setget_sockopt.bpf.linked1.o             skops_sockopt                      4038             3947        -91 (-2.25%)               347               343          -4 (-1.15%)
test_l4lb.bpf.linked1.o                  balancer_ingress                   4559             2611     -1948 (-42.73%)               118               105        -13 (-11.02%)
test_l4lb_noinline.bpf.linked1.o         balancer_ingress                   6279             6268        -11 (-0.18%)               237               236          -1 (-0.42%)
test_misc_tcp_hdr_options.bpf.linked1.o  misc_estab                         1307             1303         -4 (-0.31%)               100                99          -1 (-1.00%)
test_sk_lookup.bpf.linked1.o             ctx_narrow_access                   456              447         -9 (-1.97%)                39                38          -1 (-2.56%)
test_sysctl_loop1.bpf.linked1.o          sysctl_tcp_mem                     1389             1384         -5 (-0.36%)                26                25          -1 (-3.85%)
test_tc_dtime.bpf.linked1.o              egress_fwdns_prio101                518              485        -33 (-6.37%)                51                46          -5 (-9.80%)
test_tc_dtime.bpf.linked1.o              egress_host                         519              468        -51 (-9.83%)                50                44         -6 (-12.00%)
test_tc_dtime.bpf.linked1.o              ingress_fwdns_prio101               842             1000      +158 (+18.76%)                73                88        +15 (+20.55%)
xdp_synproxy_kern.bpf.linked1.o          syncookie_tc                     405757           373173     -32584 (-8.03%)             25735             22882      -2853 (-11.09%)
xdp_synproxy_kern.bpf.linked1.o          syncookie_xdp                    479055           371590   -107465 (-22.43%)             29145             22207      -6938 (-23.81%)
---------------------------------------  ----------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------

Slight regression in test_tc_dtime.bpf.linked1.o/ingress_fwdns_prio101
is left for a follow up, there might be some more precision-related bugs
in existing BPF verifier logic.

CILIUM RESULTS
==============

$ ./veristat -C -e file,prog,insns,states ~/subprog-precise-results-cilium.csv ~/imprecise-early-results-cilium.csv | grep -v '+0'
File           Program                         Total insns (A)  Total insns (B)  Total insns (DIFF)  Total states (A)  Total states (B)  Total states (DIFF)
-------------  ------------------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------
bpf_host.o     cil_from_host                               762              556      -206 (-27.03%)                43                37         -6 (-13.95%)
bpf_host.o     tail_handle_nat_fwd_ipv4                  23541            23426       -115 (-0.49%)              1538              1537          -1 (-0.07%)
bpf_host.o     tail_nodeport_nat_egress_ipv4             33592            33566        -26 (-0.08%)              2163              2161          -2 (-0.09%)
bpf_lxc.o      tail_handle_nat_fwd_ipv4                  23541            23426       -115 (-0.49%)              1538              1537          -1 (-0.07%)
bpf_overlay.o  tail_nodeport_nat_egress_ipv4             33581            33543        -38 (-0.11%)              2160              2157          -3 (-0.14%)
bpf_xdp.o      tail_handle_nat_fwd_ipv4                  21659            20920       -739 (-3.41%)              1440              1376         -64 (-4.44%)
bpf_xdp.o      tail_handle_nat_fwd_ipv6                  17084            17039        -45 (-0.26%)               907               905          -2 (-0.22%)
bpf_xdp.o      tail_lb_ipv4                              73442            73430        -12 (-0.02%)              4370              4369          -1 (-0.02%)
bpf_xdp.o      tail_lb_ipv6                             152114           151895       -219 (-0.14%)              6493              6479         -14 (-0.22%)
bpf_xdp.o      tail_nodeport_nat_egress_ipv4             17377            17200       -177 (-1.02%)              1125              1111         -14 (-1.24%)
bpf_xdp.o      tail_nodeport_nat_ingress_ipv6             6405             6397         -8 (-0.12%)               309               308          -1 (-0.32%)
bpf_xdp.o      tail_rev_nodeport_lb4                      7126             6934       -192 (-2.69%)               414               402         -12 (-2.90%)
bpf_xdp.o      tail_rev_nodeport_lb6                     18059            17905       -154 (-0.85%)              1105              1096          -9 (-0.81%)
-------------  ------------------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20221104163649.121784-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-04 11:51:45 -07:00
Andrii Nakryiko
be2ef81615 bpf: allow precision tracking for programs with subprogs
Stop forcing precise=true for SCALAR registers when BPF program has any
subprograms. Current restriction means that any BPF program, as soon as
it uses subprograms, will end up not getting any of the precision
tracking benefits in reduction of number of verified states.

This patch keeps the fallback mark_all_scalars_precise() behavior if
precise marking has to cross function frames. E.g., if subprogram
requires R1 (first input arg) to be marked precise, ideally we'd need to
backtrack to the parent function and keep marking R1 and its
dependencies as precise. But right now we give up and force all the
SCALARs in any of the current and parent states to be forced to
precise=true. We can lift that restriction in the future.

But this patch fixes two issues identified when trying to enable
precision tracking for subprogs.

First, prevent "escaping" from top-most state in a global subprog. While
with entry-level BPF program we never end up requesting precision for
R1-R5 registers, because R2-R5 are not initialized (and so not readable
in correct BPF program), and R1 is PTR_TO_CTX, not SCALAR, and so is
implicitly precise. With global subprogs, though, it's different, as
global subprog a) can have up to 5 SCALAR input arguments, which might
get marked as precise=true and b) it is validated in isolation from its
main entry BPF program. b) means that we can end up exhausting parent
state chain and still not mark all registers in reg_mask as precise,
which would lead to verifier bug warning.

To handle that, we need to consider two cases. First, if the very first
state is not immediately "checkpointed" (i.e., stored in state lookup
hashtable), it will get correct first_insn_idx and last_insn_idx
instruction set during state checkpointing. As such, this case is
already handled and __mark_chain_precision() already handles that by
just doing nothing when we reach to the very first parent state.
st->parent will be NULL and we'll just stop. Perhaps some extra check
for reg_mask and stack_mask is due here, but this patch doesn't address
that issue.

More problematic second case is when global function's initial state is
immediately checkpointed before we manage to process the very first
instruction. This is happening because when there is a call to global
subprog from the main program the very first subprog's instruction is
marked as pruning point, so before we manage to process first
instruction we have to check and checkpoint state. This patch adds
a special handling for such "empty" state, which is identified by having
st->last_insn_idx set to -1. In such case, we check that we are indeed
validating global subprog, and with some sanity checking we mark input
args as precise if requested.

Note that we also initialize state->first_insn_idx with correct start
insn_idx offset. For main program zero is correct value, but for any
subprog it's quite confusing to not have first_insn_idx set. This
doesn't have any functional impact, but helps with debugging and state
printing. We also explicitly initialize state->last_insns_idx instead of
relying on is_state_visited() to do this with env->prev_insns_idx, which
will be -1 on the very first instruction. This concludes necessary
changes to handle specifically global subprog's precision tracking.

Second identified problem was missed handling of BPF helper functions
that call into subprogs (e.g., bpf_loop and few others). From precision
tracking and backtracking logic's standpoint those are effectively calls
into subprogs and should be called as BPF_PSEUDO_CALL calls.

This patch takes the least intrusive way and just checks against a short
list of current BPF helpers that do call subprogs, encapsulated in
is_callback_calling_function() function. But to prevent accidentally
forgetting to add new BPF helpers to this "list", we also do a sanity
check in __check_func_call, which has to be called for each such special
BPF helper, to validate that BPF helper is indeed recognized as
callback-calling one. This should catch any missed checks in the future.
Adding some special flags to be added in function proto definitions
seemed like an overkill in this case.

With the above changes, it's possible to remove forceful setting of
reg->precise to true in __mark_reg_unknown, which turns on precision
tracking both inside subprogs and entry progs that have subprogs. No
warnings or errors were detected across all the selftests, but also when
validating with veristat against internal Meta BPF objects and Cilium
objects. Further, in some BPF programs there are noticeable reduction in
number of states and instructions validated due to more effective
precision tracking, especially benefiting syncookie test.

$ ./veristat -C -e file,prog,insns,states ~/baseline-results.csv ~/subprog-precise-results.csv  | grep -v '+0'
File                                      Program                     Total insns (A)  Total insns (B)  Total insns (DIFF)  Total states (A)  Total states (B)  Total states (DIFF)
----------------------------------------  --------------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------
pyperf600_bpf_loop.bpf.linked1.o          on_event                               3966             3678       -288 (-7.26%)               306               276         -30 (-9.80%)
pyperf_global.bpf.linked1.o               on_event                               7563             7530        -33 (-0.44%)               520               517          -3 (-0.58%)
pyperf_subprogs.bpf.linked1.o             on_event                              36358            36934       +576 (+1.58%)              2499              2531         +32 (+1.28%)
setget_sockopt.bpf.linked1.o              skops_sockopt                          3965             4038        +73 (+1.84%)               343               347          +4 (+1.17%)
test_cls_redirect_subprogs.bpf.linked1.o  cls_redirect                          64965            64901        -64 (-0.10%)              4619              4612          -7 (-0.15%)
test_misc_tcp_hdr_options.bpf.linked1.o   misc_estab                             1491             1307      -184 (-12.34%)               110               100         -10 (-9.09%)
test_pkt_access.bpf.linked1.o             test_pkt_access                         354              349         -5 (-1.41%)                25                24          -1 (-4.00%)
test_sock_fields.bpf.linked1.o            egress_read_sock_fields                 435              375       -60 (-13.79%)                22                20          -2 (-9.09%)
test_sysctl_loop2.bpf.linked1.o           sysctl_tcp_mem                         1508             1501         -7 (-0.46%)                29                28          -1 (-3.45%)
test_tc_dtime.bpf.linked1.o               egress_fwdns_prio100                    468              435        -33 (-7.05%)                45                41          -4 (-8.89%)
test_tc_dtime.bpf.linked1.o               ingress_fwdns_prio100                   398              408        +10 (+2.51%)                42                39          -3 (-7.14%)
test_tc_dtime.bpf.linked1.o               ingress_fwdns_prio101                  1096              842      -254 (-23.18%)                97                73        -24 (-24.74%)
test_tcp_hdr_options.bpf.linked1.o        estab                                  2758             2408      -350 (-12.69%)               208               181        -27 (-12.98%)
test_urandom_usdt.bpf.linked1.o           urand_read_with_sema                    466              448        -18 (-3.86%)                31                28          -3 (-9.68%)
test_urandom_usdt.bpf.linked1.o           urand_read_without_sema                 466              448        -18 (-3.86%)                31                28          -3 (-9.68%)
test_urandom_usdt.bpf.linked1.o           urandlib_read_with_sema                 466              448        -18 (-3.86%)                31                28          -3 (-9.68%)
test_urandom_usdt.bpf.linked1.o           urandlib_read_without_sema              466              448        -18 (-3.86%)                31                28          -3 (-9.68%)
test_xdp_noinline.bpf.linked1.o           balancer_ingress_v6                    4302             4294         -8 (-0.19%)               257               256          -1 (-0.39%)
xdp_synproxy_kern.bpf.linked1.o           syncookie_tc                         583722           405757   -177965 (-30.49%)             35846             25735     -10111 (-28.21%)
xdp_synproxy_kern.bpf.linked1.o           syncookie_xdp                        609123           479055   -130068 (-21.35%)             35452             29145      -6307 (-17.79%)
----------------------------------------  --------------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20221104163649.121784-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-04 11:51:45 -07:00
Andrii Nakryiko
529409ea92 bpf: propagate precision across all frames, not just the last one
When equivalent completed state is found and it has additional precision
restrictions, BPF verifier propagates precision to
currently-being-verified state chain (i.e., including parent states) so
that if some of the states in the chain are not yet completed, necessary
precision restrictions are enforced.

Unfortunately, right now this happens only for the last frame (deepest
active subprogram's frame), not all the frames. This can lead to
incorrect matching of states due to missing precision marker. Currently
this doesn't seem possible as BPF verifier forces everything to precise
when validated BPF program has any subprograms. But with the next patch
lifting this restriction, this becomes problematic.

In fact, without this fix, we'll start getting failure in one of the
existing test_verifier test cases:

  #906/p precise: cross frame pruning FAIL
  Unexpected success to load!
  verification time 48 usec
  stack depth 0+0
  processed 26 insns (limit 1000000) max_states_per_insn 3 total_states 17 peak_states 17 mark_read 8

This patch adds precision propagation across all frames.

Fixes: a3ce685dd0 ("bpf: fix precision tracking")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20221104163649.121784-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-04 11:51:45 -07:00
Andrii Nakryiko
a3b666bfa9 bpf: propagate precision in ALU/ALU64 operations
When processing ALU/ALU64 operations (apart from BPF_MOV, which is
handled correctly already; and BPF_NEG and BPF_END are special and don't
have source register), if destination register is already marked
precise, this causes problem with potentially missing precision tracking
for the source register. E.g., when we have r1 >>= r5 and r1 is marked
precise, but r5 isn't, this will lead to r5 staying as imprecise. This
is due to the precision backtracking logic stopping early when it sees
r1 is already marked precise. If r1 wasn't precise, we'd keep
backtracking and would add r5 to the set of registers that need to be
marked precise. So there is a discrepancy here which can lead to invalid
and incompatible states matched due to lack of precision marking on r5.
If r1 wasn't precise, precision backtracking would correctly mark both
r1 and r5 as precise.

This is simple to fix, though. During the forward instruction simulation
pass, for arithmetic operations of `scalar <op>= scalar` form (where
<op> is ALU or ALU64 operations), if destination register is already
precise, mark source register as precise. This applies only when both
involved registers are SCALARs. `ptr += scalar` and `scalar += ptr`
cases are already handled correctly.

This does have (negative) effect on some selftest programs and few
Cilium programs.  ~/baseline-tmp-results.csv are veristat results with
this patch, while ~/baseline-results.csv is without it. See post
scriptum for instructions on how to make Cilium programs testable with
veristat. Correctness has a price.

$ ./veristat -C -e file,prog,insns,states ~/baseline-results.csv ~/baseline-tmp-results.csv | grep -v '+0'
File                     Program               Total insns (A)  Total insns (B)  Total insns (DIFF)  Total states (A)  Total states (B)  Total states (DIFF)
-----------------------  --------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------
bpf_cubic.bpf.linked1.o  bpf_cubic_cong_avoid              997             1700      +703 (+70.51%)                62                90        +28 (+45.16%)
test_l4lb.bpf.linked1.o  balancer_ingress                 4559             5469      +910 (+19.96%)               118               126          +8 (+6.78%)
-----------------------  --------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------

$ ./veristat -C -e file,prog,verdict,insns,states ~/baseline-results-cilium.csv ~/baseline-tmp-results-cilium.csv | grep -v '+0'
File           Program                         Total insns (A)  Total insns (B)  Total insns (DIFF)  Total states (A)  Total states (B)  Total states (DIFF)
-------------  ------------------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------
bpf_host.o     tail_nodeport_nat_ingress_ipv6             4448             5261      +813 (+18.28%)               234               247         +13 (+5.56%)
bpf_host.o     tail_nodeport_nat_ipv6_egress              3396             3446        +50 (+1.47%)               201               203          +2 (+1.00%)
bpf_lxc.o      tail_nodeport_nat_ingress_ipv6             4448             5261      +813 (+18.28%)               234               247         +13 (+5.56%)
bpf_overlay.o  tail_nodeport_nat_ingress_ipv6             4448             5261      +813 (+18.28%)               234               247         +13 (+5.56%)
bpf_xdp.o      tail_lb_ipv4                              71736            73442      +1706 (+2.38%)              4295              4370         +75 (+1.75%)
-------------  ------------------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------

P.S. To make Cilium ([0]) programs libbpf-compatible and thus
veristat-loadable, apply changes from topmost commit in [1], which does
minimal changes to Cilium source code, mostly around SEC() annotations
and BPF map definitions.

  [0] https://github.com/cilium/cilium/
  [1] https://github.com/anakryiko/cilium/commits/libbpf-friendliness

Fixes: b5dc0163d8 ("bpf: precise scalar_value tracking")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20221104163649.121784-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-04 11:51:45 -07:00
Kumar Kartikeya Dwivedi
f71b2f6417 bpf: Refactor map->off_arr handling
Refactor map->off_arr handling into generic functions that can work on
their own without hardcoding map specific code. The btf_fields_offs
structure is now returned from btf_parse_field_offs, which can be reused
later for types in program BTF.

All functions like copy_map_value, zero_map_value call generic
underlying functions so that they can also be reused later for copying
to values allocated in programs which encode specific fields.

Later, some helper functions will also require access to this
btf_field_offs structure to be able to skip over special fields at
runtime.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221103191013.1236066-9-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-03 23:09:40 -07:00
Kumar Kartikeya Dwivedi
db55911782 bpf: Consolidate spin_lock, timer management into btf_record
Now that kptr_off_tab has been refactored into btf_record, and can hold
more than one specific field type, accomodate bpf_spin_lock and
bpf_timer as well.

While they don't require any more metadata than offset, having all
special fields in one place allows us to share the same code for
allocated user defined types and handle both map values and these
allocated objects in a similar fashion.

As an optimization, we still keep spin_lock_off and timer_off offsets in
the btf_record structure, just to avoid having to find the btf_field
struct each time their offset is needed. This is mostly needed to
manipulate such objects in a map value at runtime. It's ok to hardcode
just one offset as more than one field is disallowed.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221103191013.1236066-8-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-03 22:19:40 -07:00
Kumar Kartikeya Dwivedi
aa3496accc bpf: Refactor kptr_off_tab into btf_record
To prepare the BPF verifier to handle special fields in both map values
and program allocated types coming from program BTF, we need to refactor
the kptr_off_tab handling code into something more generic and reusable
across both cases to avoid code duplication.

Later patches also require passing this data to helpers at runtime, so
that they can work on user defined types, initialize them, destruct
them, etc.

The main observation is that both map values and such allocated types
point to a type in program BTF, hence they can be handled similarly. We
can prepare a field metadata table for both cases and store them in
struct bpf_map or struct btf depending on the use case.

Hence, refactor the code into generic btf_record and btf_field member
structs. The btf_record represents the fields of a specific btf_type in
user BTF. The cnt indicates the number of special fields we successfully
recognized, and field_mask is a bitmask of fields that were found, to
enable quick determination of availability of a certain field.

Subsequently, refactor the rest of the code to work with these generic
types, remove assumptions about kptr and kptr_off_tab, rename variables
to more meaningful names, etc.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221103191013.1236066-7-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-03 21:44:53 -07:00
Kumar Kartikeya Dwivedi
a28ace782e bpf: Drop reg_type_may_be_refcounted_or_null
It is not scalable to maintain a list of types that can have non-zero
ref_obj_id. It is never set for scalars anyway, so just remove the
conditional on register types and print it whenever it is non-zero.

Acked-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20221103191013.1236066-6-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-03 19:31:18 -07:00
Kumar Kartikeya Dwivedi
f5e477a861 bpf: Fix slot type check in check_stack_write_var_off
For the case where allow_ptr_leaks is false, code is checking whether
slot type is STACK_INVALID and STACK_SPILL and rejecting other cases.
This is a consequence of incorrectly checking for register type instead
of the slot type (NOT_INIT and SCALAR_VALUE respectively). Fix the
check.

Fixes: 01f810ace9 ("bpf: Allow variable-offset stack access")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221103191013.1236066-5-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-03 19:31:18 -07:00
Kumar Kartikeya Dwivedi
261f4664ca bpf: Clobber stack slot when writing over spilled PTR_TO_BTF_ID
When support was added for spilled PTR_TO_BTF_ID to be accessed by
helper memory access, the stack slot was not overwritten to STACK_MISC
(and that too is only safe when env->allow_ptr_leaks is true).

This means that helpers who take ARG_PTR_TO_MEM and write to it may
essentially overwrite the value while the verifier continues to track
the slot for spilled register.

This can cause issues when PTR_TO_BTF_ID is spilled to stack, and then
overwritten by helper write access, which can then be passed to BPF
helpers or kfuncs.

Handle this by falling back to the case introduced in a later commit,
which will also handle PTR_TO_BTF_ID along with other pointer types,
i.e. cd17d38f8b ("bpf: Permits pointers on stack for helper calls").

Finally, include a comment on why REG_LIVE_WRITTEN is not being set when
clobber is set to true. In short, the reason is that while when clobber
is unset, we know that we won't be writing, when it is true, we *may*
write to any of the stack slots in that range. It may be a partial or
complete write, to just one or many stack slots.

We cannot be sure, hence to be conservative, we leave things as is and
never set REG_LIVE_WRITTEN for any stack slot. However, clobber still
needs to reset them to STACK_MISC assuming writes happened. However read
marks still need to be propagated upwards from liveness point of view,
as parent stack slot's contents may still continue to matter to child
states.

Cc: Yonghong Song <yhs@meta.com>
Fixes: 1d68f22b3d ("bpf: Handle spilled PTR_TO_BTF_ID properly when checking stack_boundary")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221103191013.1236066-4-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-03 19:31:18 -07:00
Kumar Kartikeya Dwivedi
23da464dd6 bpf: Allow specifying volatile type modifier for kptrs
This is useful in particular to mark the pointer as volatile, so that
compiler treats each load and store to the field as a volatile access.
The alternative is having to define and use READ_ONCE and WRITE_ONCE in
the BPF program.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20221103191013.1236066-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-03 19:31:17 -07:00
Youlin Li
f1db20814a bpf: Fix wrong reg type conversion in release_reference()
Some helper functions will allocate memory. To avoid memory leaks, the
verifier requires the eBPF program to release these memories by calling
the corresponding helper functions.

When a resource is released, all pointer registers corresponding to the
resource should be invalidated. The verifier use release_references() to
do this job, by apply  __mark_reg_unknown() to each relevant register.

It will give these registers the type of SCALAR_VALUE. A register that
will contain a pointer value at runtime, but of type SCALAR_VALUE, which
may allow the unprivileged user to get a kernel pointer by storing this
register into a map.

Using __mark_reg_not_init() while NOT allow_ptr_leaks can mitigate this
problem.

Fixes: fd978bf7fd ("bpf: Add reference tracking to verifier")
Signed-off-by: Youlin Li <liulin063@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20221103093440.3161-1-liulin063@gmail.com
2022-11-04 00:24:12 +01:00
Jakub Kicinski
b54a0d4094 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCY2GuKgAKCRDbK58LschI
 gy32AP9PI0e/bUGDExKJ8g97PeeEtnpj4TTI6g+XKILtYnyXlgD/Rk4j2D/f3IBF
 Ha9TmqYvAUim+U/g50vUrNuoNLNJ5w8=
 =OKC1
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
bpf-next 2022-11-02

We've added 70 non-merge commits during the last 14 day(s) which contain
a total of 96 files changed, 3203 insertions(+), 640 deletions(-).

The main changes are:

1) Make cgroup local storage available to non-cgroup attached BPF programs
   such as tc BPF ones, from Yonghong Song.

2) Avoid unnecessary deadlock detection and failures wrt BPF task storage
   helpers, from Martin KaFai Lau.

3) Add LLVM disassembler as default library for dumping JITed code
   in bpftool, from Quentin Monnet.

4) Various kprobe_multi_link fixes related to kernel modules,
   from Jiri Olsa.

5) Optimize x86-64 JIT with emitting BMI2-based shift instructions,
   from Jie Meng.

6) Improve BPF verifier's memory type compatibility for map key/value
   arguments, from Dave Marchevsky.

7) Only create mmap-able data section maps in libbpf when data is exposed
   via skeletons, from Andrii Nakryiko.

8) Add an autoattach option for bpftool to load all object assets,
   from Wang Yufen.

9) Various memory handling fixes for libbpf and BPF selftests,
   from Xu Kuohai.

10) Initial support for BPF selftest's vmtest.sh on arm64,
    from Manu Bretelle.

11) Improve libbpf's BTF handling to dedup identical structs,
    from Alan Maguire.

12) Add BPF CI and denylist documentation for BPF selftests,
    from Daniel Müller.

13) Check BPF cpumap max_entries before doing allocation work,
    from Florian Lehner.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (70 commits)
  samples/bpf: Fix typo in README
  bpf: Remove the obsolte u64_stats_fetch_*_irq() users.
  bpf: check max_entries before allocating memory
  bpf: Fix a typo in comment for DFS algorithm
  bpftool: Fix spelling mistake "disasembler" -> "disassembler"
  selftests/bpf: Fix bpftool synctypes checking failure
  selftests/bpf: Panic on hard/soft lockup
  docs/bpf: Add documentation for new cgroup local storage
  selftests/bpf: Add test cgrp_local_storage to DENYLIST.s390x
  selftests/bpf: Add selftests for new cgroup local storage
  selftests/bpf: Fix test test_libbpf_str/bpf_map_type_str
  bpftool: Support new cgroup local storage
  libbpf: Support new cgroup local storage
  bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  bpf: Refactor some inode/task/sk storage functions for reuse
  bpf: Make struct cgroup btf id global
  selftests/bpf: Tracing prog can still do lookup under busy lock
  selftests/bpf: Ensure no task storage failure for bpf_lsm.s prog due to deadlock detection
  bpf: Add new bpf_task_storage_delete proto with no deadlock detection
  bpf: bpf_task_storage_delete_recur does lookup first before the deadlock check
  ...
====================

Link: https://lore.kernel.org/r/20221102062120.5724-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-02 08:18:27 -07:00
Kees Cook
42378a9ca5 bpf, verifier: Fix memory leak in array reallocation for stack state
If an error (NULL) is returned by krealloc(), callers of realloc_array()
were setting their allocation pointers to NULL, but on error krealloc()
does not touch the original allocation. This would result in a memory
resource leak. Instead, free the old allocation on the error handling
path.

The memory leak information is as follows as also reported by Zhengchao:

  unreferenced object 0xffff888019801800 (size 256):
  comm "bpf_repo", pid 6490, jiffies 4294959200 (age 17.170s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000b211474b>] __kmalloc_node_track_caller+0x45/0xc0
    [<0000000086712a0b>] krealloc+0x83/0xd0
    [<00000000139aab02>] realloc_array+0x82/0xe2
    [<00000000b1ca41d1>] grow_stack_state+0xfb/0x186
    [<00000000cd6f36d2>] check_mem_access.cold+0x141/0x1341
    [<0000000081780455>] do_check_common+0x5358/0xb350
    [<0000000015f6b091>] bpf_check.cold+0xc3/0x29d
    [<000000002973c690>] bpf_prog_load+0x13db/0x2240
    [<00000000028d1644>] __sys_bpf+0x1605/0x4ce0
    [<00000000053f29bd>] __x64_sys_bpf+0x75/0xb0
    [<0000000056fedaf5>] do_syscall_64+0x35/0x80
    [<000000002bd58261>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

Fixes: c69431aab6 ("bpf: verifier: Improve function state reallocation")
Reported-by: Zhengchao Shao <shaozhengchao@huawei.com>
Reported-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Bill Wendling <morbo@google.com>
Cc: Lorenz Bauer <oss@lmb.io>
Link: https://lore.kernel.org/bpf/20221029025433.2533810-1-keescook@chromium.org
2022-11-01 14:29:16 +01:00
Thomas Gleixner
97c4090bad bpf: Remove the obsolte u64_stats_fetch_*_irq() users.
Now that the 32bit UP oddity is gone and 32bit uses always a sequence
count, there is no need for the fetch_irq() variants anymore.

Convert to the regular interface.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/bpf/20221026123110.331690-1-bigeasy@linutronix.de
2022-10-31 23:40:24 +01:00
Florian Lehner
e39e739ab5 bpf: check max_entries before allocating memory
For maps of type BPF_MAP_TYPE_CPUMAP memory is allocated first before
checking the max_entries argument. If then max_entries is greater than
NR_CPUS additional work needs to be done to free allocated memory before
an error is returned.
This changes moves the check on max_entries before the allocation
happens.

Signed-off-by: Florian Lehner <dev@der-flo.net>
Link: https://lore.kernel.org/r/20221028183405.59554-1-dev@der-flo.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-10-28 15:45:58 -07:00
Xu Kuohai
b6d207999c bpf: Fix a typo in comment for DFS algorithm
There is a typo in comment for DFS algorithm in bpf/verifier.c. The top
element should not be popped until all its neighbors have been checked.
Fix it.

Fixes: 475fb78fbf ("bpf: verifier (add branch/goto checks)")
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20221027034458.2925218-1-xukuohai@huaweicloud.com
2022-10-27 13:10:32 -07:00
Yonghong Song
c4bcfb38a9 bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
Similar to sk/inode/task storage, implement similar cgroup local storage.

There already exists a local storage implementation for cgroup-attached
bpf programs.  See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
bpf_get_local_storage(). But there are use cases such that non-cgroup
attached bpf progs wants to access cgroup local storage data. For example,
tc egress prog has access to sk and cgroup. It is possible to use
sk local storage to emulate cgroup local storage by storing data in socket.
But this is a waste as it could be lots of sockets belonging to a particular
cgroup. Alternatively, a separate map can be created with cgroup id as the key.
But this will introduce additional overhead to manipulate the new map.
A cgroup local storage, similar to existing sk/inode/task storage,
should help for this use case.

The life-cycle of storage is managed with the life-cycle of the
cgroup struct.  i.e. the storage is destroyed along with the owning cgroup
with a call to bpf_cgrp_storage_free() when cgroup itself
is deleted.

The userspace map operations can be done by using a cgroup fd as a key
passed to the lookup, update and delete operations.

Typically, the following code is used to get the current cgroup:
    struct task_struct *task = bpf_get_current_task_btf();
    ... task->cgroups->dfl_cgrp ...
and in structure task_struct definition:
    struct task_struct {
        ....
        struct css_set __rcu            *cgroups;
        ....
    }
With sleepable program, accessing task->cgroups is not protected by rcu_read_lock.
So the current implementation only supports non-sleepable program and supporting
sleepable program will be the next step together with adding rcu_read_lock
protection for rcu tagged structures.

Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup local
storage support, the new map name BPF_MAP_TYPE_CGRP_STORAGE is used
for cgroup storage available to non-cgroup-attached bpf programs. The old
cgroup storage supports bpf_get_local_storage() helper to get the cgroup data.
The new cgroup storage helper bpf_cgrp_storage_get() can provide similar
functionality. While old cgroup storage pre-allocates storage memory, the new
mechanism can also pre-allocate with a user space bpf_map_update_elem() call
to avoid potential run-time memory allocation failure.
Therefore, the new cgroup storage can provide all functionality w.r.t.
the old one. So in uapi bpf.h, the old BPF_MAP_TYPE_CGROUP_STORAGE is alias to
BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED to indicate the old cgroup storage can
be deprecated since the new one can provide the same functionality.

Acked-by: David Vernet <void@manifault.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221026042850.673791-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-25 23:19:19 -07:00
Yonghong Song
c83597fa5d bpf: Refactor some inode/task/sk storage functions for reuse
Refactor codes so that inode/task/sk storage implementation
can maximally share the same code. I also added some comments
in new function bpf_local_storage_unlink_nolock() to make
codes easy to understand. There is no functionality change.

Acked-by: David Vernet <void@manifault.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221026042845.672944-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-25 23:19:19 -07:00
Yonghong Song
5e67b8ef12 bpf: Make struct cgroup btf id global
Make struct cgroup btf id global so later patch can reuse
the same btf id.

Acked-by: David Vernet <void@manifault.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221026042840.672602-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-25 23:19:19 -07:00
Martin KaFai Lau
8a7dac37f2 bpf: Add new bpf_task_storage_delete proto with no deadlock detection
The bpf_lsm and bpf_iter do not recur that will cause a deadlock.
The situation is similar to the bpf_pid_task_storage_delete_elem()
which is called from the syscall map_delete_elem.  It does not need
deadlock detection.  Otherwise, it will cause unnecessary failure
when calling the bpf_task_storage_delete() helper.

This patch adds bpf_task_storage_delete proto that does not do deadlock
detection.  It will be used by bpf_lsm and bpf_iter program.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20221025184524.3526117-8-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-25 23:11:46 -07:00
Martin KaFai Lau
fda64ae0bb bpf: bpf_task_storage_delete_recur does lookup first before the deadlock check
Similar to the earlier change in bpf_task_storage_get_recur.
This patch changes bpf_task_storage_delete_recur such that it
does the lookup first.  It only returns -EBUSY if it needs to
take the spinlock to do the deletion when potential deadlock
is detected.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20221025184524.3526117-7-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-25 23:11:46 -07:00
Martin KaFai Lau
4279adb094 bpf: Add new bpf_task_storage_get proto with no deadlock detection
The bpf_lsm and bpf_iter do not recur that will cause a deadlock.
The situation is similar to the bpf_pid_task_storage_lookup_elem()
which is called from the syscall map_lookup_elem.  It does not need
deadlock detection.  Otherwise, it will cause unnecessary failure
when calling the bpf_task_storage_get() helper.

This patch adds bpf_task_storage_get proto that does not do deadlock
detection.  It will be used by bpf_lsm and bpf_iter programs.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20221025184524.3526117-6-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-25 23:11:46 -07:00
Martin KaFai Lau
e8b02296a6 bpf: Avoid taking spinlock in bpf_task_storage_get if potential deadlock is detected
bpf_task_storage_get() does a lookup and optionally inserts
new data if BPF_LOCAL_STORAGE_GET_F_CREATE is present.

During lookup, it will cache the lookup result and caching requires to
acquire a spinlock.  When potential deadlock is detected (by the
bpf_task_storage_busy pcpu-counter added in
commit bc235cdb42 ("bpf: Prevent deadlock from recursive bpf_task_storage_[get|delete]")),
the current behavior is returning NULL immediately to avoid deadlock.  It is
too pessimistic.  This patch will go ahead to do a lookup (which is a
lockless operation) but it will avoid caching it in order to avoid
acquiring the spinlock.

When lookup fails to find the data and BPF_LOCAL_STORAGE_GET_F_CREATE
is set, an insertion is needed and this requires acquiring a spinlock.
This patch will still return NULL when a potential deadlock is detected.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20221025184524.3526117-5-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-25 23:11:46 -07:00
Martin KaFai Lau
6d65500c34 bpf: Refactor the core bpf_task_storage_get logic into a new function
This patch creates a new function __bpf_task_storage_get() and
moves the core logic of the existing bpf_task_storage_get()
into this new function.   This new function will be shared
by another new helper proto in the latter patch.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20221025184524.3526117-4-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-25 23:11:46 -07:00
Martin KaFai Lau
0593dd34e5 bpf: Append _recur naming to the bpf_task_storage helper proto
This patch adds the "_recur" naming to the bpf_task_storage_{get,delete}
proto.  In a latter patch, they will only be used by the tracing
programs that requires a deadlock detection because a tracing
prog may use bpf_task_storage_{get,delete} recursively and cause a
deadlock.

Another following patch will add a different helper proto for the non
tracing programs because they do not need the deadlock prevention.
This patch does this rename to prepare for this future proto
additions.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20221025184524.3526117-3-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-25 23:11:46 -07:00
Martin KaFai Lau
271de525e1 bpf: Remove prog->active check for bpf_lsm and bpf_iter
The commit 64696c40d0 ("bpf: Add __bpf_prog_{enter,exit}_struct_ops for struct_ops trampoline")
removed prog->active check for struct_ops prog.  The bpf_lsm
and bpf_iter is also using trampoline.  Like struct_ops, the bpf_lsm
and bpf_iter have fixed hooks for the prog to attach.  The
kernel does not call the same hook in a recursive way.
This patch also removes the prog->active check for
bpf_lsm and bpf_iter.

A later patch has a test to reproduce the recursion issue
for a sleepable bpf_lsm program.

This patch appends the '_recur' naming to the existing
enter and exit functions that track the prog->active counter.
New __bpf_prog_{enter,exit}[_sleepable] function are
added to skip the prog->active tracking. The '_struct_ops'
version is also removed.

It also moves the decision on picking the enter and exit function to
the new bpf_trampoline_{enter,exit}().  It returns the '_recur' ones
for all tracing progs to use.  For bpf_lsm, bpf_iter,
struct_ops (no prog->active tracking after 64696c40d0), and
bpf_lsm_cgroup (no prog->active tracking after 69fd337a97),
it will return the functions that don't track the prog->active.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20221025184524.3526117-2-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-25 23:11:46 -07:00
Jakub Kicinski
96917bb3a3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
include/linux/net.h
  a5ef058dc4 ("net: introduce and use custom sockopt socket flag")
  e993ffe3da ("net: flag sockets supporting msghdr originated zerocopy")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24 13:44:11 -07:00
Jakub Kicinski
e28c44450b bpf-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmNVkYkACgkQ6rmadz2v
 bTqzHw/+NYMwfLm5Ck+BK0+HiYU5VVLoG4jp8G7B3sJL/6nUDduajzoqa+nM19Xl
 +HEjbMza7CizmhkCRkzIs1VVtx8mtvKdTxbhvm77SU2+GBn+X1es+XhtFd4EOpok
 MINNHs+cOC/HlnPD/QbFgvxKiKkjyjWxInjUp6Y/mLMcKCn7l9KOkc07/la9Dj3j
 RI0gXCywq1pJaPuTCnt0/wcYLJvzn6QsZnKmmksQwt59GQqOd11HWid3rBWZhDp6
 beEoHDIMGHROtu60vm4DB0p4l6tauZfeXyPCeu3Tx5ZSsypJIyU1iTdKiIUjG963
 ilpy55nrX9bWxadB7LIKHyYfW3in4o+D1ZZaUvLIato/69CZJZ7Uc4kU1RF4Ay1F
 Df1Fmal2WeNAxxETPmQPvVeCePvQvwLHl4KNogdZZvd/67cyc1cDhnuTJp37iPak
 FALHaaw0VOrTdTvxsWym7yEbkhPbCHpPrKYFZFHgGrRTFk/GM2k38mM07lcLxFGw
 aKyooS+eoIZMEgtK5Hma2wpeIVSlkJiJk1d0K20OxdnIUyYEsMXmI+uV1gMxq/8z
 EHNi0+296xOoxy22I1Bd5Tu7fIeecHFN44q7YFmpGsB54UNLpFsP0vYUmYT/6hLI
 Y0KVZu4c3oQDX7ttifMvkeOCURDJBPrZx37bpNpNXF55fB5ehNk=
 =eV7W
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Alexei Starovoitov says:

====================
pull-request: bpf 2022-10-23

We've added 7 non-merge commits during the last 18 day(s) which contain
a total of 8 files changed, 69 insertions(+), 5 deletions(-).

The main changes are:

1) Wait for busy refill_work when destroying bpf memory allocator, from Hou.

2) Allow bpf_user_ringbuf_drain() callbacks to return 1, from David.

3) Fix dispatcher patchable function entry to 5 bytes nop, from Jiri.

4) Prevent decl_tag from being referenced in func_proto, from Stanislav.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf: Use __llist_del_all() whenever possbile during memory draining
  bpf: Wait for busy refill_work when destroying bpf memory allocator
  bpf: Fix dispatcher patchable function entry to 5 bytes nop
  bpf: prevent decl_tag from being referenced in func_proto
  selftests/bpf: Add reproducer for decl_tag in func_proto return type
  selftests/bpf: Make bpf_user_ringbuf_drain() selftest callback return 1
  bpf: Allow bpf_user_ringbuf_drain() callbacks to return 1
====================

Link: https://lore.kernel.org/r/20221023192244.81137-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24 10:32:01 -07:00
Dave Marchevsky
d167330409 bpf: Consider all mem_types compatible for map_{key,value} args
After the previous patch, which added PTR_TO_MEM | MEM_ALLOC type
map_key_value_types, the only difference between map_key_value_types and
mem_types sets is PTR_TO_BUF and PTR_TO_MEM, which are in the latter set
but not the former.

Helpers which expect ARG_PTR_TO_MAP_KEY or ARG_PTR_TO_MAP_VALUE
already effectively expect a valid blob of arbitrary memory that isn't
necessarily explicitly associated with a map. When validating a
PTR_TO_MAP_{KEY,VALUE} arg, the verifier expects meta->map_ptr to have
already been set, either by an earlier ARG_CONST_MAP_PTR arg, or custom
logic like that in process_timer_func or process_kptr_func.

So let's get rid of map_key_value_types and just use mem_types for those
args.

This has the effect of adding PTR_TO_BUF and PTR_TO_MEM to the set of
compatible types for ARG_PTR_TO_MAP_KEY and ARG_PTR_TO_MAP_VALUE.

PTR_TO_BUF is used by various bpf_iter implementations to represent a
chunk of valid r/w memory in ctx args for iter prog.

PTR_TO_MEM is used by networking, tracing, and ringbuf helpers to
represent a chunk of valid memory. The PTR_TO_MEM | MEM_ALLOC
type added in previous commit is specific to ringbuf helpers.
Presence or absence of MEM_ALLOC doesn't change the validity of using
PTR_TO_MEM as a map_{key,val} input.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20221020160721.4030492-2-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-21 19:23:28 -07:00
Dave Marchevsky
9ef40974a8 bpf: Allow ringbuf memory to be used as map key
This patch adds support for the following pattern:

  struct some_data *data = bpf_ringbuf_reserve(&ringbuf, sizeof(struct some_data, 0));
  if (!data)
    return;
  bpf_map_lookup_elem(&another_map, &data->some_field);
  bpf_ringbuf_submit(data);

Currently the verifier does not consider bpf_ringbuf_reserve's
PTR_TO_MEM | MEM_ALLOC ret type a valid key input to bpf_map_lookup_elem.
Since PTR_TO_MEM is by definition a valid region of memory, it is safe
to use it as a key for lookups.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20221020160721.4030492-1-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-21 19:22:42 -07:00
Hou Tao
fa4447cb73 bpf: Use __llist_del_all() whenever possbile during memory draining
Except for waiting_for_gp list, there are no concurrent operations on
free_by_rcu, free_llist and free_llist_extra lists, so use
__llist_del_all() instead of llist_del_all(). waiting_for_gp list can be
deleted by RCU callback concurrently, so still use llist_del_all().

Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20221021114913.60508-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-21 19:17:38 -07:00
Hou Tao
3d05818707 bpf: Wait for busy refill_work when destroying bpf memory allocator
A busy irq work is an unfinished irq work and it can be either in the
pending state or in the running state. When destroying bpf memory
allocator, refill_work may be busy for PREEMPT_RT kernel in which irq
work is invoked in a per-CPU RT-kthread. It is also possible for kernel
with arch_irq_work_has_interrupt() being false (e.g. 1-cpu arm32 host or
mips) and irq work is inovked in timer interrupt.

The busy refill_work leads to various issues. The obvious one is that
there will be concurrent operations on free_by_rcu and free_list between
irq work and memory draining. Another one is call_rcu_in_progress will
not be reliable for the checking of pending RCU callback because
do_call_rcu() may have not been invoked by irq work yet. The other is
there will be use-after-free if irq work is freed before the callback
of irq work is invoked as shown below:

 BUG: kernel NULL pointer dereference, address: 0000000000000000
 #PF: supervisor instruction fetch in kernel mode
 #PF: error_code(0x0010) - not-present page
 PGD 12ab94067 P4D 12ab94067 PUD 1796b4067 PMD 0
 Oops: 0010 [#1] PREEMPT_RT SMP
 CPU: 5 PID: 64 Comm: irq_work/5 Not tainted 6.0.0-rt11+ #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
 RIP: 0010:0x0
 Code: Unable to access opcode bytes at 0xffffffffffffffd6.
 RSP: 0018:ffffadc080293e78 EFLAGS: 00010286
 RAX: 0000000000000000 RBX: ffffcdc07fb6a388 RCX: ffffa05000a2e000
 RDX: ffffa05000a2e000 RSI: ffffffff96cc9827 RDI: ffffcdc07fb6a388
 ......
 Call Trace:
  <TASK>
  irq_work_single+0x24/0x60
  irq_work_run_list+0x24/0x30
  run_irq_workd+0x23/0x30
  smpboot_thread_fn+0x203/0x300
  kthread+0x126/0x150
  ret_from_fork+0x1f/0x30
  </TASK>

Considering the ease of concurrency handling, no overhead for
irq_work_sync() under non-PREEMPT_RT kernel and has-irq-work-interrupt
kernel and the short wait time used for irq_work_sync() under PREEMPT_RT
(When running two test_maps on PREEMPT_RT kernel and 72-cpus host, the
max wait time is about 8ms and the 99th percentile is 10us), just using
irq_work_sync() to wait for busy refill_work to complete before memory
draining and memory freeing.

Fixes: 7c8199e24f ("bpf: Introduce any context BPF specific memory allocator.")
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20221021114913.60508-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-21 19:17:38 -07:00
Jiri Olsa
dbe69b2998 bpf: Fix dispatcher patchable function entry to 5 bytes nop
The patchable_function_entry(5) might output 5 single nop
instructions (depends on toolchain), which will clash with
bpf_arch_text_poke check for 5 bytes nop instruction.

Adding early init call for dispatcher that checks and change
the patchable entry into expected 5 nop instruction if needed.

There's no need to take text_mutex, because we are using it
in early init call which is called at pre-smp time.

Fixes: ceea991a01 ("bpf: Move bpf_dispatcher function out of ftrace locations")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20221018075934.574415-1-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-20 18:57:51 -07:00
Jakub Kicinski
94adb5e29e Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-20 17:49:10 -07:00
Jakub Kicinski
3566a79c9e bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCY08JQQAKCRDbK58LschI
 g0M0AQCWGrJcnQFut1qwR9efZUadwxtKGAgpaA/8Smd8+v7c8AD/SeHQuGfkFiD6
 rx18hv1mTfG0HuPnFQy6YZQ98vmznwE=
 =DaeS
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2022-10-18

We've added 33 non-merge commits during the last 14 day(s) which contain
a total of 31 files changed, 874 insertions(+), 538 deletions(-).

The main changes are:

1) Add RCU grace period chaining to BPF to wait for the completion
   of access from both sleepable and non-sleepable BPF programs,
   from Hou Tao & Paul E. McKenney.

2) Improve helper UAPI by explicitly defining BPF_FUNC_xxx integer
   values. In the wild we have seen OS vendors doing buggy backports
   where helper call numbers mismatched. This is an attempt to make
   backports more foolproof, from Andrii Nakryiko.

3) Add libbpf *_opts API-variants for bpf_*_get_fd_by_id() functions,
   from Roberto Sassu.

4) Fix libbpf's BTF dumper for structs with padding-only fields,
   from Eduard Zingerman.

5) Fix various libbpf bugs which have been found from fuzzing with
   malformed BPF object files, from Shung-Hsi Yu.

6) Clean up an unneeded check on existence of SSE2 in BPF x86-64 JIT,
   from Jie Meng.

7) Fix various ASAN bugs in both libbpf and selftests when running
   the BPF selftest suite on arm64, from Xu Kuohai.

8) Fix missing bpf_iter_vma_offset__destroy() call in BPF iter selftest
   and use in-skeleton link pointer to remove an explicit bpf_link__destroy(),
   from Jiri Olsa.

9) Fix BPF CI breakage by pointing to iptables-legacy instead of relying
   on symlinked iptables which got upgraded to iptables-nft,
   from Martin KaFai Lau.

10) Minor BPF selftest improvements all over the place, from various others.

* tag 'for-netdev' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (33 commits)
  bpf/docs: Update README for most recent vmtest.sh
  bpf: Use rcu_trace_implies_rcu_gp() for program array freeing
  bpf: Use rcu_trace_implies_rcu_gp() in local storage map
  bpf: Use rcu_trace_implies_rcu_gp() in bpf memory allocator
  rcu-tasks: Provide rcu_trace_implies_rcu_gp()
  selftests/bpf: Use sys_pidfd_open() helper when possible
  libbpf: Fix null-pointer dereference in find_prog_by_sec_insn()
  libbpf: Deal with section with no data gracefully
  libbpf: Use elf_getshdrnum() instead of e_shnum
  selftest/bpf: Fix error usage of ASSERT_OK in xdp_adjust_tail.c
  selftests/bpf: Fix error failure of case test_xdp_adjust_tail_grow
  selftest/bpf: Fix memory leak in kprobe_multi_test
  selftests/bpf: Fix memory leak caused by not destroying skeleton
  libbpf: Fix memory leak in parse_usdt_arg()
  libbpf: Fix use-after-free in btf_dump_name_dups
  selftests/bpf: S/iptables/iptables-legacy/ in the bpf_nf and xdp_synproxy test
  selftests/bpf: Alphabetize DENYLISTs
  selftests/bpf: Add tests for _opts variants of bpf_*_get_fd_by_id()
  libbpf: Introduce bpf_link_get_fd_by_id_opts()
  libbpf: Introduce bpf_btf_get_fd_by_id_opts()
  ...
====================

Link: https://lore.kernel.org/r/20221018210631.11211-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-18 18:56:43 -07:00
Hou Tao
4835f9ee98 bpf: Use rcu_trace_implies_rcu_gp() for program array freeing
To support both sleepable and normal uprobe bpf program, the freeing of
trace program array chains a RCU-tasks-trace grace period and a normal
RCU grace period one after the other.

With the introduction of rcu_trace_implies_rcu_gp(),
__bpf_prog_array_free_sleepable_cb() can check whether or not a normal
RCU grace period has also passed after a RCU-tasks-trace grace period
has passed. If it is true, it is safe to invoke kfree() directly.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20221014113946.965131-5-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-18 10:27:02 -07:00
Hou Tao
d39d1445d3 bpf: Use rcu_trace_implies_rcu_gp() in local storage map
Local storage map is accessible for both sleepable and non-sleepable bpf
program, and its memory is freed by using both call_rcu_tasks_trace() and
kfree_rcu() to wait for both RCU-tasks-trace grace period and RCU grace
period to pass.

With the introduction of rcu_trace_implies_rcu_gp(), both
bpf_selem_free_rcu() and bpf_local_storage_free_rcu() can check whether
or not a normal RCU grace period has also passed after a RCU-tasks-trace
grace period has passed. If it is true, it is safe to call kfree()
directly.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20221014113946.965131-4-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-18 10:27:02 -07:00
Hou Tao
59be91e5e7 bpf: Use rcu_trace_implies_rcu_gp() in bpf memory allocator
The memory free logic in bpf memory allocator chains a RCU Tasks Trace
grace period and a normal RCU grace period one after the other, so it
can ensure that both sleepable and non-sleepable programs have finished.

With the introduction of rcu_trace_implies_rcu_gp(),
__free_rcu_tasks_trace() can check whether or not a normal RCU grace
period has also passed after a RCU Tasks Trace grace period has passed.
If it is true, freeing these elements directly, else freeing through
call_rcu().

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20221014113946.965131-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-18 10:27:02 -07:00
Linus Torvalds
bb1a114646 cgroup fixes for v6.1-rc1
* Fix a recent regression where a sleeping kernfs function is called with
   css_set_lock (spinlock) held.
 
 * Revert the commit to enable cgroup1 support for cgroup_get_from_fd/file().
   Multiple users assume that the lookup only works for cgroup2 and breaks
   when fed a cgroup1 file. Instead, introduce a separate set of functions to
   lookup both v1 and v2 and use them where the user explicitly wants to
   support both versions.
 
 * Compat update for tools/perf/util/bpf_skel/bperf_cgroup.bpf.c.
 
 * Add Josef Bacik as a blkcg maintainer.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCY03MlA4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGTkUAQD7fNcSPuc2m/BvW+gySKQkp9PZMA6E6yOIqirc
 QKmIgwEAwWECW7RR1alhOGD50RtYkuYVsLD1+6Ka4eMHe+EhwA4=
 =kGLI
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-6.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup fixes from Tejun Heo:

 - Fix a recent regression where a sleeping kernfs function is called
   with css_set_lock (spinlock) held

 - Revert the commit to enable cgroup1 support for cgroup_get_from_fd/file()

   Multiple users assume that the lookup only works for cgroup2 and
   breaks when fed a cgroup1 file. Instead, introduce a separate set of
   functions to lookup both v1 and v2 and use them where the user
   explicitly wants to support both versions.

 - Compat update for tools/perf/util/bpf_skel/bperf_cgroup.bpf.c.

 - Add Josef Bacik as a blkcg maintainer.

* tag 'cgroup-for-6.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  blkcg: Update MAINTAINERS entry
  mm: cgroup: fix comments for get from fd/file helpers
  perf stat: Support old kernels for bperf cgroup counting
  bpf: cgroup_iter: support cgroup1 using cgroup fd
  cgroup: add cgroup_v1v2_get_from_[fd/file]()
  Revert "cgroup: enable cgroup_get_from_file() on cgroup1"
  cgroup: Reorganize css_set_lock and kernfs path processing
2022-10-17 18:52:43 -07:00
Stanislav Fomichev
ea68376c8b bpf: prevent decl_tag from being referenced in func_proto
Syzkaller was able to hit the following issue:

------------[ cut here ]------------
WARNING: CPU: 0 PID: 3609 at kernel/bpf/btf.c:1946
btf_type_id_size+0x2d5/0x9d0 kernel/bpf/btf.c:1946
Modules linked in:
CPU: 0 PID: 3609 Comm: syz-executor361 Not tainted
6.0.0-syzkaller-02734-g0326074ff465 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 09/22/2022
RIP: 0010:btf_type_id_size+0x2d5/0x9d0 kernel/bpf/btf.c:1946
Code: ef e8 7f 8e e4 ff 41 83 ff 0b 77 28 f6 44 24 10 18 75 3f e8 6d 91
e4 ff 44 89 fe bf 0e 00 00 00 e8 20 8e e4 ff e8 5b 91 e4 ff <0f> 0b 45
31 f6 e9 98 02 00 00 41 83 ff 12 74 18 e8 46 91 e4 ff 44
RSP: 0018:ffffc90003cefb40 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000000
RDX: ffff8880259c0000 RSI: ffffffff81968415 RDI: 0000000000000005
RBP: ffff88801270ca00 R08: 0000000000000005 R09: 000000000000000e
R10: 0000000000000011 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000011 R14: ffff888026ee6424 R15: 0000000000000011
FS:  000055555641b300(0000) GS:ffff8880b9a00000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000f2e258 CR3: 000000007110e000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 btf_func_proto_check kernel/bpf/btf.c:4447 [inline]
 btf_check_all_types kernel/bpf/btf.c:4723 [inline]
 btf_parse_type_sec kernel/bpf/btf.c:4752 [inline]
 btf_parse kernel/bpf/btf.c:5026 [inline]
 btf_new_fd+0x1926/0x1e70 kernel/bpf/btf.c:6892
 bpf_btf_load kernel/bpf/syscall.c:4324 [inline]
 __sys_bpf+0xb7d/0x4cf0 kernel/bpf/syscall.c:5010
 __do_sys_bpf kernel/bpf/syscall.c:5069 [inline]
 __se_sys_bpf kernel/bpf/syscall.c:5067 [inline]
 __x64_sys_bpf+0x75/0xb0 kernel/bpf/syscall.c:5067
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f0fbae41c69
Code: 28 c3 e8 2a 14 00 00 66 2e 0f 1f 84 00 00 00 00 00 48 89 f8 48 89
f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01
f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffc8aeb6228 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f0fbae41c69
RDX: 0000000000000020 RSI: 0000000020000140 RDI: 0000000000000012
RBP: 00007f0fbae05e10 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000ffffffff R11: 0000000000000246 R12: 00007f0fbae05ea0
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
 </TASK>

Looks like it tries to create a func_proto which return type is
decl_tag. For the details, see Martin's spot on analysis in [0].

0: https://lore.kernel.org/bpf/CAKH8qBuQDLva_hHxxBuZzyAcYNO4ejhovz6TQeVSk8HY-2SO6g@mail.gmail.com/T/#mea6524b3fcd6298347432226e81b1e6155efc62c

Cc: Yonghong Song <yhs@fb.com>
Cc: Martin KaFai Lau <martin.lau@kernel.org>
Fixes: bd16dee66a ("bpf: Add BTF_KIND_DECL_TAG typedef support")
Reported-by: syzbot+d8bd751aef7c6b39a344@syzkaller.appspotmail.com
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221015002444.2680969-2-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-10-17 10:56:12 -07:00
David Vernet
c92a7a5224 bpf: Allow bpf_user_ringbuf_drain() callbacks to return 1
The bpf_user_ringbuf_drain() helper function allows a BPF program to
specify a callback that is invoked when draining entries from a
BPF_MAP_TYPE_USER_RINGBUF ring buffer map. The API is meant to allow the
callback to return 0 if it wants to continue draining samples, and 1 if
it's done draining. Unfortunately, bpf_user_ringbuf_drain() landed shortly
after commit 1bfe26fb08 ("bpf: Add verifier support for custom
callback return range"), which changed the default behavior of callbacks
to only support returning 0.

This patch corrects that oversight by allowing bpf_user_ringbuf_drain()
callbacks to return 0 or 1. A follow-on patch will update the
user_ringbuf selftests to also return 1 from a bpf_user_ringbuf_drain()
callback to prevent this from regressing in the future.

Fixes: 2057156738 ("bpf: Add bpf_user_ringbuf_drain() helper")
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20221012232015.1510043-2-void@manifault.com
2022-10-13 08:27:37 -07:00
Jason A. Donenfeld
a251c17aa5 treewide: use get_random_u32() when possible
The prandom_u32() function has been a deprecated inline wrapper around
get_random_u32() for several releases now, and compiles down to the
exact same code. Replace the deprecated wrapper with a direct call to
the real function. The same also applies to get_random_int(), which is
just a wrapper around get_random_u32(). This was done as a basic find
and replace.

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz> # for ext4
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake
Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbolt
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Acked-by: Helge Deller <deller@gmx.de> # for parisc
Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-10-11 17:42:58 -06:00
Jason A. Donenfeld
81895a65ec treewide: use prandom_u32_max() when possible, part 1
Rather than incurring a division or requesting too many random bytes for
the given range, use the prandom_u32_max() function, which only takes
the minimum required bytes from the RNG and avoids divisions. This was
done mechanically with this coccinelle script:

@basic@
expression E;
type T;
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
typedef u64;
@@
(
- ((T)get_random_u32() % (E))
+ prandom_u32_max(E)
|
- ((T)get_random_u32() & ((E) - 1))
+ prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2)
|
- ((u64)(E) * get_random_u32() >> 32)
+ prandom_u32_max(E)
|
- ((T)get_random_u32() & ~PAGE_MASK)
+ prandom_u32_max(PAGE_SIZE)
)

@multi_line@
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
identifier RAND;
expression E;
@@

-       RAND = get_random_u32();
        ... when != RAND
-       RAND %= (E);
+       RAND = prandom_u32_max(E);

// Find a potential literal
@literal_mask@
expression LITERAL;
type T;
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
position p;
@@

        ((T)get_random_u32()@p & (LITERAL))

// Add one to the literal.
@script:python add_one@
literal << literal_mask.LITERAL;
RESULT;
@@

value = None
if literal.startswith('0x'):
        value = int(literal, 16)
elif literal[0] in '123456789':
        value = int(literal, 10)
if value is None:
        print("I don't know how to handle %s" % (literal))
        cocci.include_match(False)
elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1:
        print("Skipping 0x%x for cleanup elsewhere" % (value))
        cocci.include_match(False)
elif value & (value + 1) != 0:
        print("Skipping 0x%x because it's not a power of two minus one" % (value))
        cocci.include_match(False)
elif literal.startswith('0x'):
        coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1))
else:
        coccinelle.RESULT = cocci.make_expr("%d" % (value + 1))

// Replace the literal mask with the calculated result.
@plus_one@
expression literal_mask.LITERAL;
position literal_mask.p;
expression add_one.RESULT;
identifier FUNC;
@@

-       (FUNC()@p & (LITERAL))
+       prandom_u32_max(RESULT)

@collapse_ret@
type T;
identifier VAR;
expression E;
@@

 {
-       T VAR;
-       VAR = (E);
-       return VAR;
+       return E;
 }

@drop_var@
type T;
identifier VAR;
@@

 {
-       T VAR;
        ... when != VAR
 }

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: KP Singh <kpsingh@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 and sbitmap
Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> # for drbd
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-10-11 17:42:55 -06:00
Yosry Ahmed
35256d673a bpf: cgroup_iter: support cgroup1 using cgroup fd
Use cgroup_v1v2_get_from_fd() in cgroup_iter to support attaching to
both cgroup v1 and v2 using fds.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-10-11 06:50:15 -10:00