Commit Graph

1428970 Commits

Author SHA1 Message Date
Amery Hung
bb6d9f5cf1 selftests/bpf: Simplify task_local_data memory allocation
Simplify data allocation by always using aligned_alloc() and passing
size_pot, size rounded up to the closest power of two to alignment.

Currently, aligned_alloc(page_size, size) is only intended to be used
with memory allocators that can fulfill the request without rounding
size up to page_size to conserve memory. This is enabled by defining
TLD_DATA_USE_ALIGNED_ALLOC. The reason to align to page_size is due to
the limitation of UPTR where only a page can be pinned to the kernel.
Otherwise, malloc(size * 2) is used to allocate memory for data.

However, we don't need to call aligned_alloc(page_size, size) to get
a contiguous memory of size bytes within a page. aligned_alloc(size_pot,
...) will also do the trick. Therefore, just use aligned_alloc(size_pot,
...) universally.

As for the size argument, create a new option,
TLD_DONT_ROUND_UP_DATA_SIZE, to specify not rounding up the size.
This preserves the current TLD_DATA_USE_ALIGNED_ALLOC behavior, allowing
memory allocators with low overhead aligned_alloc() to not waste memory.
To enable this, users need to make sure it is not an undefined behavior
for the memory allocator to have size not being an integral multiple of
alignment.

Compared to the current implementation, !TLD_DATA_USE_ALIGNED_ALLOC
used to always waste size-byte of memory due to malloc(size * 2).
Now the worst case becomes size - 1 and the best case is 0 when the size
is already a power of two.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Link: https://lore.kernel.org/r/20260331213555.1993883-3-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-02 15:11:08 -07:00
Amery Hung
7c8ca532a7 selftests/bpf: Fix task_local_data data allocation size
Currently, when allocating memory for data, size of tld_data_u->start
is not taken into account. This may cause OOB access. Fixed it by adding
the non-flexible array part of tld_data_u.

Besides, explicitly align tld_data_u->data to 8 bytes in case some
fields are added before data in the future. It could break the
assumption that every data field is 8 byte aligned and
sizeof(tld_data_u) will no longer be equal to
offsetof(struct tld_data_u, data), which we use interchangeably.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Acked-by: Sun Jian <sun.jian.kdev@gmail.com>
Link: https://lore.kernel.org/r/20260331213555.1993883-2-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-02 15:11:08 -07:00
Andrii Nakryiko
e8aec1058c Merge branch 'libbpf-clarify-raw-address-single-kprobe-attach-behavior'
Hoyeon Lee says:

====================
libbpf: clarify raw-address single kprobe attach behavior

Today libbpf documents single-kprobe attach through func_name, with an
optional offset. For the PMU-based path, func_name = NULL with an
absolute address in offset already works as well, but that is not
described in the API.

This patchset clarifies this behavior. First commit fixes kprobe
and uprobe attach error handling to use direct error codes. Next adds
kprobe API comments for the raw-address form and rejects it explicitly
for legacy tracefs/debugfs kprobes. Last adds PERF and LINK selftests
for the raw-address form, and checks that LEGACY rejects it.
---
Changes in v7:
- Change selftest line wrapping and assertions

Changes in v6:
- Split the kprobe/uprobe direct error-code fix into a separate patch

Changes in v5:
- Add kprobe API docs, use -EOPNOTSUPP, and switch selftests to LIBBPF_OPTS

Changes in v4:
- Inline raw-address error formatting and remove the probe_target buffer

Changes in v3:
- Drop bpf_kprobe_opts.addr and reuse offset when func_name is NULL
- Make legacy tracefs/debugfs kprobes reject the raw-address form
- Update selftests to cover PERF/LINK raw-address attach and LEGACY reject

Changes in v2:
- Fix line wrapping and indentation
====================

Link: https://patch.msgid.link/20260401143116.185049-1-hoyeon.lee@suse.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2026-04-02 13:23:19 -07:00
Hoyeon Lee
9d77cefe8f selftests/bpf: Add test for raw-address single kprobe attach
Currently, attach_probe covers manual single-kprobe attaches by
func_name, but not the raw-address form that the PMU-based
single-kprobe path can accept.

This commit adds PERF and LINK raw-address coverage. It resolves
SYS_NANOSLEEP_KPROBE_NAME through kallsyms, passes the absolute address
in bpf_kprobe_opts.offset with func_name = NULL, and verifies that
kprobe and kretprobe are still triggered. It also verifies that LEGACY
rejects the same form.

Signed-off-by: Hoyeon Lee <hoyeon.lee@suse.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20260401143116.185049-4-hoyeon.lee@suse.com
2026-04-02 13:23:19 -07:00
Hoyeon Lee
e1621c7528 libbpf: Clarify raw-address single kprobe attach behavior
bpf_program__attach_kprobe_opts() documents single-kprobe attach
through func_name, with an optional offset. For the PMU-based path,
func_name = NULL with an absolute address in offset already works as
well, but that is not described in the API.

This commit clarifies this existing non-legacy behavior. For PMU-based
attach, callers can use func_name = NULL with an absolute address in
offset as the raw-address form. For legacy tracefs/debugfs kprobes,
reject this form explicitly.

Signed-off-by: Hoyeon Lee <hoyeon.lee@suse.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20260401143116.185049-3-hoyeon.lee@suse.com
2026-04-02 13:23:19 -07:00
Hoyeon Lee
f547cf7947 libbpf: Use direct error codes for kprobe/uprobe attach
perf_event_open_probe() and perf_event_{k,u}probe_open_legacy() helpers
are returning negative error codes directly on failure. This commit
changes bpf_program__attach_{k,u}probe_opts() to use those return
values directly instead of re-reading possibly changed errno.

Signed-off-by: Hoyeon Lee <hoyeon.lee@suse.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20260401143116.185049-2-hoyeon.lee@suse.com
2026-04-02 13:23:19 -07:00
Mykyta Yatsenko
1cc96e0e20 libbpf: Fix BTF handling in bpf_program__clone()
Align bpf_program__clone() with bpf_object_load_prog() by gating
BTF func/line info on FEAT_BTF_FUNC kernel support, and resolve
caller-provided prog_btf_fd before checking obj->btf so that callers
with their own BTF can use clone() even when the object has no BTF
loaded.

While at it, treat func_info and line_info fields as atomic groups
to prevent mismatches between pointer and count from different sources.

Move bpf_program__clone() to libbpf 1.8.

Fixes: 970bd2dced ("libbpf: Introduce bpf_program__clone()")
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260401151640.356419-1-mykyta.yatsenko5@gmail.com
2026-04-02 13:02:46 -07:00
Alexei Starovoitov
e25cfbec08 Merge branch 'bpf-migrate-bpf_task_work-and-file-dynptr-to-kmalloc_nolock'
Mykyta Yatsenko says:

====================
bpf: Migrate bpf_task_work and file dynptr to kmalloc_nolock

Now that kmalloc can be used from NMI context via kmalloc_nolock(),
migrate BPF internal allocations away from bpf_mem_alloc to use the
standard slab allocator.

Use kfree_rcu() for deferred freeing, which waits for a regular RCU
grace period before the memory is reclaimed. Sleepable BPF programs
hold rcu_read_lock_trace but not regular rcu_read_lock, so patch 1
adds explicit rcu_read_lock/unlock around the pointer-to-refcount
window to prevent kfree_rcu from freeing memory while a sleepable
program is still between reading the pointer and acquiring a
reference.

Patch 1 migrates bpf_task_work_ctx from bpf_mem_alloc/bpf_mem_free to
kmalloc_nolock/kfree_rcu.

Patch 2 migrates bpf_dynptr_file_impl from bpf_mem_alloc/bpf_mem_free
to kmalloc_nolock/kfree.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
Changes in v2:
- Switch to scoped_guard in patch 1 (Kumar)
- Remove rcu gp wait in patch 2 (Kumar)
- Defer to irq_work when irqs disabled in patch 1
- use bpf_map_kmalloc_nolock() for bpf_task_work
- use kmalloc_nolock() for file dynptr
- Link to v1: https://lore.kernel.org/all/20260325-kmalloc_special-v1-0-269666afb1ea@meta.com/
====================

Link: https://patch.msgid.link/20260330-kmalloc_special-v2-0-c90403f92ff0@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-02 09:31:49 -07:00
Mykyta Yatsenko
cc878b4144 bpf: Migrate dynptr file to kmalloc_nolock
Replace bpf_mem_alloc/bpf_mem_free with kmalloc_nolock/kfree_nolock for
bpf_dynptr_file_impl, continuing the migration away from bpf_mem_alloc
now that kmalloc can be used from NMI context.

freader_cleanup() runs before kfree_nolock() while the dynptr still
holds exclusive access, so plain kfree_nolock() is safe — no concurrent
readers can access the object.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260330-kmalloc_special-v2-2-c90403f92ff0@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-02 09:31:42 -07:00
Mykyta Yatsenko
90f51ebff2 bpf: Migrate bpf_task_work to kmalloc_nolock
Replace bpf_mem_alloc/bpf_mem_free with
kmalloc_nolock/kfree_rcu for bpf_task_work_ctx.

Replace guard(rcu_tasks_trace)() with guard(rcu)() in
bpf_task_work_irq(). The function only accesses ctx struct members
(not map values), so tasks trace protection is not needed - regular
RCU is sufficient since ctx is freed via kfree_rcu. The guard in
bpf_task_work_callback() remains as tasks trace since it accesses map
values from process context.

Sleepable BPF programs hold rcu_read_lock_trace but not
regular rcu_read_lock. Since kfree_rcu
waits for a regular RCU grace period, the ctx memory can be freed
while a sleepable program is still running. Add scoped_guard(rcu)
around the pointer read and refcount tryget in
bpf_task_work_acquire_ctx to close this race window.

Since kfree_rcu uses call_rcu internally which is not safe from
NMI context, defer destruction via irq_work when irqs are disabled.

For the lost-cmpxchg path the ctx was never published, so
kfree_nolock is safe.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260330-kmalloc_special-v2-1-c90403f92ff0@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-02 09:31:42 -07:00
Alexei Starovoitov
f760104402 Merge branch 'bpf-fix-abuse-of-kprobe_write_ctx-via-freplace'
Leon Hwang says:

====================
bpf: Fix abuse of kprobe_write_ctx via freplace

The potential issue of kprobe_write_ctx+freplace was mentioned in
"bpf: Disallow !kprobe_write_ctx progs tail-calling kprobe_write_ctx progs" [1].

It is true issue, that the test in patch #2 verifies that kprobe_write_ctx=false
kprobe progs can be abused to modify struct pt_regs via kprobe_write_ctx=true
freplace progs.

When struct pt_regs is modified, bpf_prog_test_run_opts() gets -EFAULT instead
of 0.

test_freplace_kprobe_write_ctx:FAIL:bpf_prog_test_run_opts unexpected error: -14 (errno 14)

We will disallow attaching freplace programs on kprobe programs with different
kprobe_write_ctx values.

Links:
[1] https://lore.kernel.org/bpf/CAP01T74w4KVMn9bEwpQXrk+bqcUxzb6VW1SQ_QvNy0A4EY-9Jg@mail.gmail.com/

Changes:
v2 -> v3:
* Add comment to the rejection of kprobe_write_ctx (per Jiri).
* Use libbpf_get_error() instead of errno in test (per Jiri).
* Collect Acked-by tags from Jiri and Song, thanks.
v2: https://lore.kernel.org/bpf/20260326141718.17731-1-leon.hwang@linux.dev/

v1 -> v2:
* Drop patch #1 in v1, as it wasn't an issue (per Toke).
* Check kprobe_write_ctx value at attach time instead of at load time, to
  prevent attaching kprobe_write_ctx=true freplace progs on
  kprobe_write_ctx=false kprobe progs (per Gemini/sashiko).
* Move kprobe_write_ctx test code to attach_probe.c and kprobe_write_ctx.c.
v1: https://lore.kernel.org/bpf/20260324150444.68166-1-leon.hwang@linux.dev/
====================

Link: https://patch.msgid.link/20260331145353.87606-1-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-02 09:29:49 -07:00
Leon Hwang
da77f3a9aa selftests/bpf: Add test to verify the fix of kprobe_write_ctx abuse
Add a test to verify the issue: kprobe_write_ctx can be abused to modify
struct pt_regs of kernel functions via kprobe_write_ctx=true freplace
progs.

Without the fix, the issue is verified:

kprobe_write_ctx=true freplace prog is allowed to attach to
kprobe_write_ctx=false kprobe prog. Then, the first arg of
bpf_fentry_test1 will be set as 0, and bpf_prog_test_run_opts() gets
-EFAULT instead of 0.

With the fix, the issue is rejected at attach time.

Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260331145353.87606-3-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-02 09:29:49 -07:00
Leon Hwang
611fe4b79a bpf: Fix abuse of kprobe_write_ctx via freplace
uprobe programs are allowed to modify struct pt_regs.

Since the actual program type of uprobe is KPROBE, it can be abused to
modify struct pt_regs via kprobe+freplace when the kprobe attaches to
kernel functions.

For example,

SEC("?kprobe")
int kprobe(struct pt_regs *regs)
{
	return 0;
}

SEC("?freplace")
int freplace_kprobe(struct pt_regs *regs)
{
	regs->di = 0;
	return 0;
}

freplace_kprobe prog will attach to kprobe prog.
kprobe prog will attach to a kernel function.

Without this patch, when the kernel function runs, its first arg will
always be set as 0 via the freplace_kprobe prog.

To fix the abuse of kprobe_write_ctx=true via kprobe+freplace, disallow
attaching freplace programs on kprobe programs with different
kprobe_write_ctx values.

Fixes: 7384893d97 ("bpf: Allow uprobe program to change context registers")
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260331145353.87606-2-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-02 09:29:49 -07:00
Mykyta Yatsenko
0eeb0094ba selftests/bpf: Suppress veristat error messages in non-verbose mode
When running veristat across many BPF objects, expected load failures
produce noisy stderr output that obscures actual issues. Gate these
diagnostic messages behind --verbose.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260331172634.57402-2-mykyta.yatsenko5@gmail.com
2026-03-31 15:55:47 -07:00
Menglong Dong
3e6475dc60 selftests/bpf: Test access to ringbuf position with map pointer
Add the testing to access the bpf_ringbuf with the map pointer.
"consumer_pos" and "producer_pos" is accessed in this testing. We reserve
128 bytes in the ringbuf to test the producer_pos, which should be
"128 + BPF_RINGBUF_HDR_SZ".

It will be helpful if we want to evaluate the usage of the ringbuf in bpf
prog with the consumer and producer position.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/bpf/20260331070434.10037-1-dongml2@chinatelecom.cn
2026-03-31 15:47:14 -07:00
Eyal Birger
f9a80c7ce4 bpf: Clarify BPF_RB_NO_WAKEUP behavior for bpf_ringbuf_discard()
Clarify bpf_ringbuf_discard() documentation for BPF_RB_NO_WAKEUP.

Discarded ring buffer records are still left in the ring buffer and are
only skipped when user space consumes them. This can matter when
BPF_RB_NO_WAKEUP is used: a later submit relying on adaptive wakeup
might not wake the consumer, because the discarded record still needs to
be consumed first.

Scenario:

epoll_wait(rb_fd);                     // blocks

rec = bpf_ringbuf_reserve(&rb, ...);
bpf_ringbuf_discard(rec, BPF_RB_NO_WAKEUP);

rec = bpf_ringbuf_reserve(&rb, ...);
bpf_ringbuf_submit(rec, 0);           // valid record, but no wakeup

Document this in bpf_ringbuf_discard() to make the interaction between
discarded records, user-space consumption, and adaptive wakeups explicit.

Reported-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260331130612.3762433-1-eyal.birger@gmail.com

----

v2: adapt wording per feedback from Andrii.
2026-03-31 15:46:34 -07:00
Jiri Olsa
9eccdd38fb bpf: Fix block device hooks names
Use proper names for block device hooks names.

Fixes: 46df585fcf ("bpf: classify block device hooks appropriately")
Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Closes: https://lore.kernel.org/bpf/acrVKUy_EPiFFmV9@krava/T/#m7c7906a1ff4029e29185aec3266dbf5c8996dbf7
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/bpf/20260330210344.3073712-1-jolsa@kernel.org
2026-03-31 11:11:42 +02:00
haoyu.lu
b6b5e0ebd4 bpf,arc_jit: Fix missing newline in pr_err messages
Add missing newline to pr_err messages in ARC JIT.

Fixes: f122668ddc ("ARC: Add eBPF JIT support")
Signed-off-by: haoyu.lu <hechushiguitu666@gmail.com>
Link: https://lore.kernel.org/r/20260324122703.641-1-hechushiguitu666@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-29 09:59:00 -07:00
Daniel Borkmann
398ad123e8 selftests/bpf: Add few tests for alu32 shift value tracking and zext
Add few more alu32 shift tests using div-by-zero on provably dead paths
to check both verifier and JIT xlation resp. runtime correctness.

If the verifier mistracks the result, it rejects due to the div by 0;
if the JIT computes a wrong value, then runtime hits the dead path and
retval changes.

  # LDLIBS=-static PKG_CONFIG='pkg-config --static' ./vmtest.sh -- ./test_progs -t verifier_subreg
  [...]
  #644/76  verifier_subreg/arsh32_imm1_value:OK
  #644/77  verifier_subreg/lsh32_reg0_zero_extend_check:OK
  #644/78  verifier_subreg/rsh32_reg0_zero_extend_check:OK
  #644/79  verifier_subreg/arsh32_reg0_zero_extend_check:OK
  #644/80  verifier_subreg/lsh32_imm31_value:OK
  #644/81  verifier_subreg/rsh32_imm31_value:OK
  #644/82  verifier_subreg/arsh32_imm31_value:OK
  #644/83  verifier_subreg/lsh32_unknown_precise_bounds:OK
  #644/84  verifier_subreg/rsh32_unknown_bounds:OK
  #644     verifier_subreg:OK
  Summary: 1/84 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260327220629.343327-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-29 09:57:39 -07:00
Ihor Solodrai
101a9d9df8 selftests/bpf: Update kfuncs using btf_struct_meta to new variants
Update selftests to use the new non-_impl kfuncs marked with
KF_IMPLICIT_ARGS by removing redundant declarations and macros from
bpf_experimental.h (the new kfuncs are present in the vmlinux.h) and
updating relevant callsites.

Fix spin_lock verifier-log matching for lock_id_kptr_preserve by
accepting variable instruction numbers. The calls to kfuncs with
implicit arguments do not have register moves (e.g. r5 = 0)
corresponding to dummy arguments anymore, so the order of instructions
has shifted.

Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20260327203241.3365046-2-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-29 09:56:06 -07:00
Ihor Solodrai
d457072576 bpf: Support struct btf_struct_meta via KF_IMPLICIT_ARGS
The following kfuncs currently accept void *meta__ign argument:
  * bpf_obj_new_impl
  * bpf_obj_drop_impl
  * bpf_percpu_obj_new_impl
  * bpf_percpu_obj_drop_impl
  * bpf_refcount_acquire_impl
  * bpf_list_push_back_impl
  * bpf_list_push_front_impl
  * bpf_rbtree_add_impl

The __ign suffix is an indicator for the verifier to skip the argument
in check_kfunc_args(). Then, in fixup_kfunc_call() the verifier may
set the value of this argument to struct btf_struct_meta *
kptr_struct_meta from insn_aux_data.

BPF programs must pass a dummy NULL value when calling these kfuncs.

Additionally, the list and rbtree _impl kfuncs also accept an implicit
u64 argument, which doesn't require __ign suffix because it's a
scalar, and BPF programs explicitly pass 0.

Add new kfuncs with KF_IMPLICIT_ARGS [1], that correspond to each
_impl kfunc accepting meta__ign. The existing _impl kfuncs remain
unchanged for backwards compatibility.

To support this, add "btf_struct_meta" to the list of recognized
implicit argument types in resolve_btfids.

Implement is_kfunc_arg_implicit() in the verifier, that determines
implicit args by inspecting both a non-_impl BTF prototype of the
kfunc.

Update the special_kfunc_list in the verifier and relevant checks to
support both the old _impl and the new KF_IMPLICIT_ARGS variants of
btf_struct_meta users.

[1] https://lore.kernel.org/bpf/20260120222638.3976562-1-ihor.solodrai@linux.dev/

Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20260327203241.3365046-1-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-29 09:56:06 -07:00
Alexei Starovoitov
5e961eebef Merge branch 'bpf-classify-block-device-hooks-and-add-selftests'
Christian Brauner says:

====================
bpf: classify block device hooks and add selftests

A bunch of new hooks for managing block devices were added a while ago
but they weren't appropriately classified. Classify them and add a test
program so we catch regressions.

Note that for whatever reason building the bpf selftests locally seems
to fail for all kinds of arcane reasons for me. That might just be my
fault. I added a pr against the ci to have the selftests run but to test
this meaningfully it needs veritysetup and dmverity support. I'm not
sure if that's available already.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Changes in v2:
- No changes.
- Link to v1: https://patch.msgid.link/20260220-work-bpf-bdev-v1-0-c53e852c4702@kernel.org

---
====================

Link: https://patch.msgid.link/20260326-work-bpf-bdev-v2-0-5e3c58963987@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-27 09:05:13 -07:00
Christian Brauner
96f4c251a0 selftests/bpf: add block device management selftests
Add selftests to test block device tracking for bpf lsm programs.

Signed-off-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20260326-work-bpf-bdev-v2-2-5e3c58963987@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-27 09:05:13 -07:00
Christian Brauner
46df585fcf bpf: classify block device hooks appropriately
A bunch of new hooks for managing block devices were added a while ago
but they weren't actually appropriately classified.

* bpf_lsm_bdev_alloc() is called when the inode for the block
  device is allocated. This happens from a sleepable context so mark the
  function as sleepable. When this function is called the memory for the
  block device storage embedded into the inode is zeroed. That block
  device cannot be meaningfully reference or interacted with at this
  point. So mark it as untrusted for now.

* bpf_lsm_bdev_free() is called when the inode for the block
  device is freed. A bunch of memory associated with the block device
  has already been freed and there's dangling pointers in there. So mark
  it as untrusted. It cannot be meaningfully referenced or interacted
  with anymore. It is also called from sb->s_op->free_inode:: which
  means it runs in rcu context (most of the times). So leave it as
  non-sleepable.

* bpf_lsm_bdev_setintegrity() is called when a dm-verity device
  is instantiated (glossing over details for simplicity of the commit
  message). The block device is very much alive so it remains a trusted
  hook. It's also called with device mapper's suspend lock held and so
  the hook is able to sleep so mark it sleepable.

Signed-off-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20260326-work-bpf-bdev-v2-1-5e3c58963987@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-27 09:05:13 -07:00
Andrii Nakryiko
01504da43e Merge branch 'add-btf-layout-to-btf'
Alan Maguire says:

====================
Add BTF layout to BTF

Update struct btf_header to add a new "layout" section containing
a description of how to parse the BTF kinds known about at BTF
encoding time.  This provides the opportunity for tools that might
not know all of these kinds - as is the case when older tools run
on more newly-generated BTF - to still parse the BTF provided,
even if it cannot all be used.

The ideas here were discussed at [1], with further discussion
at [2].

Patches for pahole will enable the layout addition during BTF
generation are at [3], but even absent these the addition of the
layout feature in the final patch in this series should not break
anything since such unknown features are simply ignored during pahole
BTF generation.

Separately tested sanitization of BTF location info with separate
small series which simulates feature absence to support testing of
features for older kernels; will follow up with that shortly.

Changes since v15 [4]:

- Fixed endian issues for layout section by swapping flags fields
  where needed (sashiko.dev, patch 2)
- Fixed string size issue with swapped endian case, use btf->magic
  for comparison to determine endian mismatch (bpf review bot,
  sashiko.dev, patch 6)

Changes since v14 [5]:

- Fix potential overflow for swapped endian case (BPF review bot,
  patch 2)
- Add global: keyword to libbpf.map (sashiko.dev, patch 4)
- Fix endian issues in sanitization; we use the endian safe
  btf->hdr and check for endian mismatch between it and raw original
  BTF header to inform how we write the change str_off. Also fix
  potential truncation issues due to not including hdr->type_off
  (sashiko.dev, patch 6)
- Fix issues with selftests raw BTF file interactions (sashiko.dev,
  patch 8)
- Drop feature test test since it will be covered by another series

Changes since v13: [6]:

- add feature check/sanitization of BTF with layout info (Andrii,
  patch 6)
- added feature check test for layout support (patch 9)

Changes since v12: [7]:

- add logging of layout off/len to kernel header logging (review bot,
  patch 6)
- add mode to open() in selftest (review bot, patch 7)

Changes since v11 [8]:

- Revert unneeded changes to btf__new_empty() (Eduard, review bot,
  patch 4)
- Reorder btf_parse_layout_sec() checks to ensure min size check
  occurs before multiple check (review bot, patch 6)

Changes since v10 [9]:

- deal with read overflow with small header (review bot, patch 2)
- validate layout length is a multiple of sizeof(struct btf_layout)
  (review bot, patch 6)
- fix comment style (Alexei, patches 4,7)
- remove bpftool BTF metadata subcommands for now (Alexei)

Changes since v9: [10]:

- fix memcpy header size overrun (review bot, patch 2)
- return size computation directly (Andrii, patch 333)
- revert to original unknown kind logging (Alexei/review bot,
  patch 6)
- gap-checking logic can be simplified now that we have
 4-byte aligned types and layout together (patch 6)
- fix naming of layout offset, unconditionally emit a layouts
  array in json (Quentin, review bot, patch 8)
- fix metadata output in man page to include flags (review bot,
  patch 9)

Changes since v8: [11]:

- updated name from "kind_layout" to "layout" (Andrii)
- moved layout info to inbetween types and strings since
  both types and layout info align on 4 bytes (Andrii)
- use embedded btf_header (Eduard)
- when consulting layout, fall back to base BTF if none found in
  split BTF; this will allow us to only encode layout info in
  vmlinux rather than repeating it for each module.

Changes since v7: [12]:
- Fixed comment style in UAPI headers (Mykyta, patch 1)
- Simplify calcuation of header size using min() (Mykyta, patch 2)
- simplify computation of bounds for kind (Mykyta, patch 3)
- Added utility functions for updating type, string offsets when
  data is added; this simplifies the code and encapsulates such
  updates more clearly (patch 2)

Changes since v6: [13]:

- BPF review bot caught some memory leaks around freeing
  of kind layout; more importantly, it noted that we were
  breaking with the contiguous BTF representation for
  btf_new_empty_opts(). Doing so meant that freeing kind_layout
  could not be predicated on having btf->modifiable set, so
  adpoted the contiguous raw data layout for BTF to be
  consistent with type/string storage (patches 2,4)
- Moved checks for kind overflow prior to referencing kinds
  to avoid any risk of overrun (patches 3, 8)
- Tightened up kind layout header offset/len header validation
  to catch invalid combinations early in btf_parse_hdr()
  (patch 2)
- Fixed selftest to verify calloc success (patch 7)

Changes since v5: [14]:

- removed flags field from kind layout; it is not really workable
  since we would have to define semantics of all possible future
  flags today to be usable. Instead stick to parsing only, which
  means each kind just needs the length of the singular and
  vlen-specified objects (Alexei)
- added documentation for bpftool BTF metadata dump (Quentin, patch 9)

Changes since v4: [15]:

- removed CRC generation since it is not needed to handle modules
  built at different time than kernel; distilled base BTF supports
  this now
- fixed up bpftool display of empty kind names, comment/documentation
  indentation (Quentin, patches 8, 9)

Changes since v3 [16]:

- fixed mismerge issues with kbuild changes for BTF generation
  (patches 9, 14)
- fixed a few small issues in libbpf with kind layout representation
  (patches 2, 4)

Changes since v2 [17]:

- drop "optional" kind flag (Andrii, patch 1)
- allocate "struct btf_header" for struct btf to ensure
  we can always access new fields (Andrii, patch 2)
- use an internal BTF kind array in btf.c to simplify
  kind encoding (Andrii, patch 2)
- drop use of kind layout information for in-kernel parsing,
  since the kernel needs to be strict in what it accepts
  (Andrii, patch 6)
- added CRC verification for BTF objects and for matching
  with base object (Alexei, patches 7,8)
- fixed bpftool json output (Quentin, patch 10)
- added standalone module BTF support, tests (patches 13-17)

Changes since RFC
- Terminology change from meta -> kind_layout
 (Alexei and Andrii)
- Simplify representation, removing meta header
  and just having kind layout section (Alexei)
- Fixed bpftool to have JSON support, support
  prefix match, documented changes (Quentin)
- Separated metadata opts into add_kind_layout
  and add_crc
- Added additional positive/negative tests
  to cover basic unknown kind, one with an
  info_sz object following it and one with
  N elem_sz elements following it.
- Updated pahole-flags to use help output
  rather than version to see if features
  are present

[1] https://lore.kernel.org/bpf/CAEf4BzYjWHRdNNw4B=eOXOs_ONrDwrgX4bn=Nuc1g8JPFC34MA@mail.gmail.com/
[2] https://lore.kernel.org/bpf/20230531201936.1992188-1-alan.maguire@oracle.com/
[3] https://lore.kernel.org/dwarves/20260226085240.1908874-1-alan.maguire@oracle.com/
[4] https://lore.kernel.org/bpf/20260324174450.1570809-1-alan.maguire@oracle.com/
[5] https://lore.kernel.org/bpf/20260318132927.1142388-1-alan.maguire@oracle.com/
[6] https://lore.kernel.org/bpf/20260306113630.1281527-1-alan.maguire@oracle.com/
[7] https://lore.kernel.org/bpf/20260303182003.117483-1-alan.maguire@oracle.com/
[8] https://lore.kernel.org/bpf/20260302114059.3697879-1-alan.maguire@oracle.com/
[9] https://lore.kernel.org/bpf/20260227100426.2585191-1-alan.maguire@oracle.com/
[10] https://lore.kernel.org/bpf/20260226085624.1909682-1-alan.maguire@oracle.com/
[11] https://lore.kernel.org/bpf/20251215091730.1188790-1-alan.maguire@oracle.com/
[12] https://lore.kernel.org/dwarves/20251211164646.1219122-1-alan.maguire@oracle.com/
[13] https://lore.kernel.org/bpf/20251210203243.814529-1-alan.maguire@oracle.com/
[14] https://lore.kernel.org/bpf/20250528095743.791722-1-alan.maguire@oracle.com/
[15] https://lore.kernel.org/bpf/20231112124834.388735-1-alan.maguire@oracle.com/
[16] https://lore.kernel.org/bpf/20231110110304.63910-1-alan.maguire@oracle.com/
[17] https://lore.kernel.org/bpf/20230616171728.530116-1-alan.maguire@oracle.com/
====================

Link: https://patch.msgid.link/20260326145444.2076244-1-alan.maguire@oracle.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2026-03-26 13:53:57 -07:00
Alan Maguire
5e1942eb1c kbuild, bpf: Specify "layout" optional feature
The "layout" feature will add metadata about BTF kinds to the
generated BTF; its absence in pahole will not trigger an error so it
is safe to add unconditionally as it will simply be ignored if pahole
does not support it.

Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260326145444.2076244-10-alan.maguire@oracle.com
2026-03-26 13:53:57 -07:00
Alan Maguire
0467491617 selftests/bpf: Test kind encoding/decoding
verify btf__new_empty_opts() adds layouts for all kinds supported,
and after adding kind-related types for an unknown kind, ensure that
parsing uses this info when that kind is encountered rather than
giving up.

Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260326145444.2076244-9-alan.maguire@oracle.com
2026-03-26 13:53:57 -07:00
Alan Maguire
626e88c070 btf: Support kernel parsing of BTF with layout info
Validate layout if present, but because the kernel must be
strict in what it accepts, reject BTF with unsupported kinds,
even if they are in the layout information.

Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260326145444.2076244-8-alan.maguire@oracle.com
2026-03-26 13:53:56 -07:00
Alan Maguire
081677d03d libbpf: Support sanitization of BTF layout for older kernels
Add a FEAT_BTF_LAYOUT feature check which checks if the
kernel supports BTF layout information.  Also sanitize
BTF if it contains layout data but the kernel does not
support it.  The sanitization requires rewriting raw
BTF data to update the header and eliminate the layout
section (since it lies between the types and strings),
so refactor sanitization to do the raw BTF retrieval
and creation of updated BTF, returning that new BTF
on success.

Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260326145444.2076244-7-alan.maguire@oracle.com
2026-03-26 13:53:56 -07:00
Alan Maguire
6ad8928599 libbpf: BTF validation can use layout for unknown kinds
BTF parsing can use layout to navigate unknown kinds, so
btf_validate_type() should take layout information into
account to avoid failure when an unrecognized kind is met.

Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260326145444.2076244-6-alan.maguire@oracle.com
2026-03-26 13:53:56 -07:00
Alan Maguire
d686d92c40 libbpf: Add layout encoding support
Support encoding of BTF layout data via btf__new_empty_opts().

Current supported opts are base_btf and add_layout.

Layout information is maintained in btf.c in the layouts[] array;
when BTF is created with the add_layout option it represents the
current view of supported BTF kinds.

Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260326145444.2076244-5-alan.maguire@oracle.com
2026-03-26 13:53:56 -07:00
Alan Maguire
2ecbe53e0e libbpf: Use layout to compute an unknown kind size
This allows BTF parsing to proceed even if we do not know the
kind.  Fall back to base BTF layout if layout information is
not in split BTF.

Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260326145444.2076244-4-alan.maguire@oracle.com
2026-03-26 13:53:56 -07:00
Alan Maguire
087f3964f4 libbpf: Support layout section handling in BTF
Support reading in layout fixing endian issues on reading;
also support writing layout section to raw BTF object.
There is not yet an API to populate the layout with meaningful
information.

As part of this, we need to consider multiple valid BTF header
sizes; the original or the layout-extended headers.
So to support this, the "struct btf" representation is modified
to contain a "struct btf_header" and we copy the valid
portion from the raw data to it; this means we can always safely
check fields like btf->hdr.layout_len .

Note if parsed-in BTF has extra header information beyond
sizeof(struct btf_header) - if so we make that BTF ineligible
for modification by setting btf->has_hdr_extra .

Ensure that we handle endianness issues for BTF layout section,
though currently only field that needs this (flags) is unused.

Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260326145444.2076244-3-alan.maguire@oracle.com
2026-03-26 13:53:56 -07:00
Alan Maguire
222edc843c btf: Add BTF kind layout encoding to UAPI
BTF kind layouts provide information to parse BTF kinds. By separating
parsing BTF from using all the information it provides, we allow BTF
to encode new features even if they cannot be used by readers. This
will be helpful in particular for cases where older tools are used
to parse newer BTF with kinds the older tools do not recognize;
the BTF can still be parsed in such cases using kind layout.

The intent is to support encoding of kind layouts optionally so that
tools like pahole can add this information. For each kind, we record

- length of singular element following struct btf_type
- length of each of the btf_vlen() elements following
- a (currently unused) flags field

The ideas here were discussed at [1], [2]; hence

Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260326145444.2076244-2-alan.maguire@oracle.com

[1] https://lore.kernel.org/bpf/CAEf4BzYjWHRdNNw4B=eOXOs_ONrDwrgX4bn=Nuc1g8JPFC34MA@mail.gmail.com/
[2] https://lore.kernel.org/bpf/20230531201936.1992188-1-alan.maguire@oracle.com/
2026-03-26 13:53:56 -07:00
Alexei Starovoitov
400ff899c3 selftests/bpf: Make reg_bounds test more robust
The verifier log output may contain multiple lines that start with
18: (bf) r0 = r6
teach reg_bounds to look for lines that have ';' in them,
since reg_bounds test is looking for:
18: (bf) r0 = r6       ; R0=... R6=...

Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20260325012242.45606-1-alexei.starovoitov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-25 08:50:33 -07:00
Alexei Starovoitov
9f7d8fa681 selftests/bpf: Test variable length stack write
Add a test to make sure that variable length stack writes
scrubs STACK_SPILL into STACK_MISC.

Tested-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260324215938.81733-2-alexei.starovoitov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-24 17:00:16 -07:00
Alexei Starovoitov
4639eb9e30 bpf: Fix variable length stack write over spilled pointers
Scrub slots if variable-offset stack write goes over spilled pointers.
Otherwise is_spilled_reg() may == true && spilled_ptr.type == NOT_INIT
and valid program is rejected by check_stack_read_fixed_off()
with obscure "invalid size of register fill" message.

Fixes: 01f810ace9 ("bpf: Allow variable-offset stack access")
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260324215938.81733-1-alexei.starovoitov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-24 17:00:11 -07:00
David Carlier
8ed82f807b bpf: Use RCU-safe iteration in dev_map_redirect_multi() SKB path
The DEVMAP_HASH branch in dev_map_redirect_multi() uses
hlist_for_each_entry_safe() to iterate hash buckets, but this function
runs under RCU protection (called from xdp_do_generic_redirect_map()
in softirq context). Concurrent writers (__dev_map_hash_update_elem,
dev_map_hash_delete_elem) modify the list using RCU primitives
(hlist_add_head_rcu, hlist_del_rcu).

hlist_for_each_entry_safe() performs plain pointer dereferences without
rcu_dereference(), missing the acquire barrier needed to pair with
writers' rcu_assign_pointer(). On weakly-ordered architectures (ARM64,
POWER), a reader can observe a partially-constructed node. It also
defeats CONFIG_PROVE_RCU lockdep validation and KCSAN data-race
detection.

Replace with hlist_for_each_entry_rcu() using rcu_read_lock_bh_held()
as the lockdep condition, consistent with the rcu_dereference_check()
used in the DEVMAP (non-hash) branch of the same functions. Also fix
the same incorrect lockdep_is_held(&dtab->index_lock) condition in
dev_map_enqueue_multi(), where the lock is not held either.

Fixes: e624d4ed4a ("xdp: Extend xdp_redirect_map with broadcast support")
Signed-off-by: David Carlier <devnexen@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260320072645.16731-1-devnexen@gmail.com
2026-03-24 15:17:20 -07:00
Sun Jian
7f5b0a60a8 selftests/bpf: move trampoline_count to dedicated bpf_testmod target
trampoline_count fills all trampoline attachment slots for a single
target function and verifies that one extra attach fails with -E2BIG.

It currently targets bpf_modify_return_test, which is also used by
other selftests such as modify_return, get_func_ip_test, and
get_func_args_test. When such tests run in parallel, they can contend
for the same per-function trampoline quota and cause unexpected attach
failures. This issue is currently masked by harness serialization.

Move trampoline_count to a dedicated bpf_testmod target and register it
for fmod_ret attachment. Also route the final trigger through
trigger_module_test_read(), so the execution path exercises the same
dedicated target.

This keeps the test semantics unchanged while isolating it from other
selftests, so it no longer needs to run in serial mode. Remove the
TODO comment as well.

Tested:
  ./test_progs -t trampoline_count -vv
  ./test_progs -j$(nproc) -t trampoline_count -vv
  ./test_progs -j$(nproc) -t \
    trampoline_count,modify_return,get_func_ip_test,get_func_args_test -vv
  20 runs of:
    ./test_progs -j$(nproc) -t \
      trampoline_count,modify_return,get_func_ip_test,get_func_args_test

Signed-off-by: Sun Jian <sun.jian.kdev@gmail.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20260324044949.869801-1-sun.jian.kdev@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-24 13:39:32 -07:00
Jiayuan Chen
d9d7125e44 selftests/bpf: Fix sockmap_multi_channels reliability
Previously I added a FIONREAD test for sockmap, but it can occasionally
fail in CI [1].

The test sends 10 bytes in two segments (2 + 8). For UDP, FIONREAD only
reports the length of the first datagram, not the total queued data.
The original code used recv_timeout() expecting all 10 bytes, but under
high system load, the second datagram may not yet be processed by the
protocol stack, so recv would only return the first 2-byte datagram,
causing a size mismatch failure.

Fix this by receiving exactly the expected bytes (matching FIONREAD) in
the first recv. The remaining datagram is then consumed in a second recv
block, which is only reachable for UDP since TCP's expected already
equals sizeof(buf).

Test:
./test_progs -a sockmap_basic
410/1   sockmap_basic/sockmap create_update_free:OK
...
Summary: 1/35 PASSED, 0 SKIPPED, 0 FAILED

[1] https://github.com/kernel-patches/bpf/actions/runs/22919385910/job/66515395423

Cc: Jiayuan Chen <jiayuan.chen@linux.dev>
Fixes: 17e2ce02bf ("selftests/bpf: Add tests for FIONREAD and copied_seq")
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Link: https://lore.kernel.org/r/20260312072549.6766-1-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-24 13:38:43 -07:00
Jiayuan Chen
2790db208b selftests/bpf: Improve tc_tunnel test reliability
A test failure was discovered in BPF CI [1] caused by connection timeout.
The current test timeout of 500ms is insufficient for CI environments,
particularly under high load.

While the optimal timeout is unclear, this test was converted from the
original test_tc_tunnel.sh script. The original script used nc with "-w 1"
to specify a 1-second timeout [2]. Therefore, this test restores the
timeout to 1s.

Test:
./test_progs -a tc_tunnel
 #478/1   tc_tunnel/ipip_none:OK
 #478/2   tc_tunnel/ipip6_none:OK
 #478/3   tc_tunnel/ip6tnl_none:OK
 #478/4   tc_tunnel/sit_none:OK
 #478/5   tc_tunnel/vxlan_eth:OK
 #478/6   tc_tunnel/ip6vxlan_eth:OK
 #478/7   tc_tunnel/gre_none:OK
 #478/8   tc_tunnel/gre_eth:OK
 #478/9   tc_tunnel/gre_mpls:OK
 #478/10  tc_tunnel/ip6gre_none:OK
 #478/11  tc_tunnel/ip6gre_eth:OK
 #478/12  tc_tunnel/ip6gre_mpls:OK
 #478/13  tc_tunnel/udp_none:OK
 #478/14  tc_tunnel/udp_eth:OK
 #478/15  tc_tunnel/udp_mpls:OK
 #478/16  tc_tunnel/ip6udp_none:OK
 #478/17  tc_tunnel/ip6udp_eth:OK
 #478/18  tc_tunnel/ip6udp_mpls:OK
 #478     tc_tunnel:OK
 Summary: 1/18 PASSED, 0 SKIPPED, 0 FAILED

[1] https://github.com/kernel-patches/bpf/actions/runs/22674350732/job/65728072723
[2] https://lore.kernel.org/all/20251027-tc_tunnel-v3-4-505c12019f9d@bootlin.com/

Cc: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Link: https://lore.kernel.org/r/20260312083615.31835-1-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-24 13:38:04 -07:00
Kexin Sun
70b5f3f782 bpf: update outdated comment for refactored btf_check_kfunc_arg_match()
The function btf_check_kfunc_arg_match() was refactored into
check_kfunc_args() by commit 00b85860fe ("bpf: Rewrite kfunc
argument handling").  Update the comment accordingly.

Assisted-by: unnamed:deepseek-v3.2 coccinelle
Signed-off-by: Kexin Sun <kexinsun@smail.nju.edu.cn>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Link: https://lore.kernel.org/r/20260321105658.6006-1-kexinsun@smail.nju.edu.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-24 13:37:29 -07:00
Alexei Starovoitov
70275388ae Merge branch 'bpf-add-multi-level-pointer-parameter-support-for-trampolines'
Slava Imameev says:

====================
bpf: Add multi-level pointer parameter support for trampolines

This is v6 of a series adding support for new pointer types for
trampoline parameters.

Originally, only support for multi-level pointers was proposed.
As suggested during review, it was extended to single-level pointers.
During discussion, it was proposed to add support for any single or
multi-level pointer type that is not a single-level pointer to a
structure, with the condition if (!btf_type_is_struct(t)). The safety
of this condition is based on BTF data verification performed for
modules and programs, and vmlinux BTF being trusted to not contain
invalid types, so it is not possible for invalid types, like
PTR->DATASEC, PTR->FUNC, PTR->VAR and corresponding multi-level
pointers, to reach btf_ctx_access.

These changes appear to be a safe extension since any future support
for arrays and output values would require annotation (similar to
Microsoft SAL), which differentiates between current unannotated
scalar cases and new annotated cases.

This series adds BPF verifier support for single- and multi-level
pointer parameters and return values in BPF trampolines. The
implementation treats these parameters as SCALAR_VALUE.

This is consistent with existing pointers to int and void that are
already treated as SCALAR.

This provides consistent logic for single- and multi-level pointers:
if the type is treated as SCALAR for a single-level pointer, the
same applies to multi-level pointers, except for pointers to structs
which are currently PTR_TO_BTF_ID. However, in the case of
multi-level pointers, they are treated as scalar since the verifier
lacks the context to infer the size of their target memory regions.

Background:

Prior to these changes, accessing multi-level pointer parameters or
return values through BPF trampoline context arrays resulted in
verification failures in btf_ctx_access, producing errors such as:

func '%s' arg%d type %s is not a struct

For example, consider a BPF program that logs an input parameter of
type struct posix_acl **:

SEC("fentry/__posix_acl_chmod")
int BPF_PROG(trace_posix_acl_chmod, struct posix_acl **ppacl, gfp_t gfp,
             umode_t mode)
{
    bpf_printk("__posix_acl_chmod ppacl = %px\n", ppacl);
    return 0;
}

This program failed BPF verification with the following error:

libbpf: prog 'trace_posix_acl_chmod': -- BEGIN PROG LOAD LOG --
0: R1=ctx() R10=fp0
; int BPF_PROG(trace_posix_acl_chmod, struct posix_acl **ppacl,
gfp_t gfp, umode_t mode) @ posix_acl_monitor.bpf.c:23
0: (79) r6 = *(u64 *)(r1 +16)         ; R1=ctx() R6_w=scalar()
1: (79) r1 = *(u64 *)(r1 +0)
func '__posix_acl_chmod' arg0 type PTR is not a struct
invalid bpf_context access off=0 size=8
processed 2 insns (limit 1000000) max_states_per_insn 0 total_states 0
peak_states 0 mark_read 0
-- END PROG LOAD LOG --

The common workaround involved using helper functions to fetch
parameter values by passing the address of the context array entry:

SEC("fentry/__posix_acl_chmod")
int BPF_PROG(trace_posix_acl_chmod, struct posix_acl **ppacl, gfp_t gfp,
             umode_t mode)
{
    struct posix_acl **pp;
    bpf_probe_read_kernel(&pp, sizeof(ppacl), &ctx[0]);
    bpf_printk("__posix_acl_chmod %px\n", pp);
    return 0;
}

This approach introduced helper call overhead and created
inconsistency with parameter access patterns.

Improvements:

With this patch, trampoline programs can directly access multi-level
pointer parameters, eliminating helper call overhead and explicit ctx
access while ensuring consistent parameter handling. For example, the
following ctx access with a helper call:

SEC("fentry/__posix_acl_chmod")
int BPF_PROG(trace_posix_acl_chmod, struct posix_acl **ppacl, gfp_t gfp,
             umode_t mode)
{
    struct posix_acl **pp;
    bpf_probe_read_kernel(&pp, sizeof(pp), &ctx[0]);
    bpf_printk("__posix_acl_chmod %px\n", pp);
    ...
}

is replaced by a load instruction:

SEC("fentry/__posix_acl_chmod")
int BPF_PROG(trace_posix_acl_chmod, struct posix_acl **ppacl, gfp_t gfp,
             umode_t mode)
{
    bpf_printk("__posix_acl_chmod %px\n", ppacl);
    ...
}

The bpf_core_cast macro can be used for deeper level dereferences.

v1 -> v2:
* corrected maintainer's email
v2 -> v3:
* Addressed reviewers' feedback:
	* Changed the register type from PTR_TO_MEM to SCALAR_VALUE.
	* Modified tests to accommodate SCALAR_VALUE handling.
* Fixed a compilation error for loongarch
	* https://lore.kernel.org/oe-kbuild-all/202602181710.tEK6nOl6-lkp@intel.com/
* Addressed AI bot review
	* Added a commentary to address a NULL pointer case
	* Removed WARN_ON
	* Fixed a commentary
v3 -> v4:
* Added more consistent support for single and multi-level pointers
as suggested by reviewers.
	* added single level pointers to enum 32 and 64
	* added single level pointers to functions
	* harmonized support for single and multi-level pointer types
	* added new tests to support the above changes
* Removed create_bad_kaddr that allocated and invalidated kernel VA
for tests, and replaced it with hardcoded values similar to
bpf_testmod_return_ptr as suggested by reviewers.
v4 -> v5:
* As suggested, extended support to single-level pointers and
covered all supported valid pointer (single- and multi-level) types
with a wider condition if (!btf_type_is_struct(t)).
* As requested, simplified tests by keeping only tests that check
the verifier log for scalar().
v5 -> v6:
* Fixed the test message based on the bot's feedback.
====================

Link: https://patch.msgid.link/20260314082127.7939-1-slava.imameev@crowdstrike.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-24 13:36:32 -07:00
Slava Imameev
e8571de534 selftests/bpf: Add trampolines single and multi-level pointer params test coverage
Add single and multi-level pointer parameters and return value test
coverage for BPF trampolines. Includes verifier tests for single and
multi-level pointers. The tests check verifier logs for pointers
inferred as scalar() type.

Signed-off-by: Slava Imameev <slava.imameev@crowdstrike.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260314082127.7939-3-slava.imameev@crowdstrike.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-24 13:36:32 -07:00
Slava Imameev
4145203841 bpf: Support pointer param types via SCALAR_VALUE for trampolines
Add BPF verifier support for single- and multi-level pointer
parameters and return values in BPF trampolines by treating these
parameters as SCALAR_VALUE.

This extends the existing support for int and void pointers that are
already treated as SCALAR_VALUE.

This provides consistent logic for single and multi-level pointers:
if a type is treated as SCALAR for a single-level pointer, the same
applies to multi-level pointers. The exception is pointer-to-struct,
which is currently PTR_TO_BTF_ID for single-level but treated as
scalar for multi-level pointers since the verifier lacks context
to infer the size of target memory regions.

Safety is ensured by existing BTF verification, which rejects invalid
pointer types at the BTF verification stage.

Signed-off-by: Slava Imameev <slava.imameev@crowdstrike.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260314082127.7939-2-slava.imameev@crowdstrike.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-24 13:36:31 -07:00
Alexei Starovoitov
7b4f1a29c7 selftests/bpf: Test 32-bit scalar spill pruning in stacksafe()
Add a test verifying that stacksafe() correctly handles 32-bit scalar
spills when comparing stack states for equivalence during state pruning.

A 32-bit scalar spill creates slot[0-3] = STACK_INVALID and
slot[4-7] = STACK_SPILL. Without the im=4 check in stacksafe(), the
STACK_SPILL vs STACK_MISC mismatch at byte 4 causes pruning to fail,
forcing the verifier to re-explore a path that is provably safe.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260323022410.75444-2-alexei.starovoitov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-24 12:10:38 -07:00
Alexei Starovoitov
596bef1d71 bpf: Support 32-bit scalar spills in stacksafe()
v1->v2: updated comments
v1: https://lore.kernel.org/bpf/20260322225124.14005-1-alexei.starovoitov@gmail.com/

The commit 6efbde200b ("bpf: Handle scalar spill vs all MISC in stacksafe()")
in stacksafe() only recognized full 64-bit scalar spills when
comparing stack states for equivalence during state pruning and
missed 32-bit scalar spill. When 32-bit scalar is spilled the
check_stack_write_fixed_off() -> save_register_state() calls
mark_stack_slot_misc() for slot[0-3], which preserves STACK_INVALID
and STACK_ZERO (on a fresh stack slot[0-3] remain STACK_INVALID),
sets slot[4-7] = STACK_SPILL, and updates spilled_ptr.

The im=4 path is only reached when im=0 fails: The loop at im=0 already
attempts the 64-bit scalar-spill/all-MISC check. If it matches, i advances
by 7, skipping the entire 8-byte slot. So im=4 is only reached when bytes
0-3 are neither a scalar spill nor all-MISC — they must pass individual
byte-by-byte comparison first. Then bytes 4-7 get the scalar-unit
treatment.

is_spilled_scalar_after(stack, 4): slot_type[4] == STACK_SPILL from a
64-bit spill would have been caught at im=0 (unless it's a pointer spill,
in which case spilled_ptr.type != SCALAR_VALUE -> returns false at im=4
too). A partial overwrite of a 64-bit spill invalidates the entire slot in
check_stack_write_fixed_off().

is_stack_misc_after(stack, 4): Only checks bytes 4-7 are MISC/INVALID,
returns &unbound_reg. Comparing two unbound regs via regsafe() is safe.

Changes to cilium programs:
File             Program                            Insns (A)  Insns (B)  Insns     (DIFF)
_______________  _________________________________  _________  _________  ________________
bpf_host.o       cil_host_policy                        49351      45811    -3540 (-7.17%)
bpf_host.o       cil_to_host                             2384       2270     -114 (-4.78%)
bpf_host.o       cil_to_netdev                         112051     100269  -11782 (-10.51%)
bpf_host.o       tail_handle_ipv4_cont_from_host        61175      60910     -265 (-0.43%)
bpf_host.o       tail_handle_ipv4_cont_from_netdev       9381       8873     -508 (-5.42%)
bpf_host.o       tail_handle_ipv4_from_host             12994       7066   -5928 (-45.62%)
bpf_host.o       tail_handle_ipv4_from_netdev           85015      59875  -25140 (-29.57%)
bpf_host.o       tail_handle_ipv6_cont_from_host        24732      23527    -1205 (-4.87%)
bpf_host.o       tail_handle_ipv6_cont_from_netdev       9463       8953     -510 (-5.39%)
bpf_host.o       tail_handle_ipv6_from_host             12477      11787     -690 (-5.53%)
bpf_host.o       tail_handle_ipv6_from_netdev           30814      30017     -797 (-2.59%)
bpf_host.o       tail_handle_nat_fwd_ipv4                8943       8860      -83 (-0.93%)
bpf_host.o       tail_handle_snat_fwd_ipv4              64716      61625    -3091 (-4.78%)
bpf_host.o       tail_handle_snat_fwd_ipv6              48299      30797  -17502 (-36.24%)
bpf_host.o       tail_ipv4_host_policy_ingress          21591      20017    -1574 (-7.29%)
bpf_host.o       tail_ipv6_host_policy_ingress          21177      20693     -484 (-2.29%)
bpf_host.o       tail_nodeport_nat_egress_ipv4          16588      16543      -45 (-0.27%)
bpf_host.o       tail_nodeport_nat_ingress_ipv4         39200      36116    -3084 (-7.87%)
bpf_host.o       tail_nodeport_nat_ingress_ipv6         50102      48003    -2099 (-4.19%)
bpf_lxc.o        tail_handle_ipv4_cont                 113092      96891  -16201 (-14.33%)
bpf_lxc.o        tail_handle_ipv6                        6727       6701      -26 (-0.39%)
bpf_lxc.o        tail_handle_ipv6_cont                  25567      21805   -3762 (-14.71%)
bpf_lxc.o        tail_ipv4_ct_egress                    28843      15970  -12873 (-44.63%)
bpf_lxc.o        tail_ipv4_ct_ingress                   16691      10213   -6478 (-38.81%)
bpf_lxc.o        tail_ipv4_ct_ingress_policy_only       16691      10213   -6478 (-38.81%)
bpf_lxc.o        tail_ipv4_policy                        6776       6622     -154 (-2.27%)
bpf_lxc.o        tail_ipv4_to_endpoint                   7523       7219     -304 (-4.04%)
bpf_lxc.o        tail_ipv6_ct_egress                    10275       9999     -276 (-2.69%)
bpf_lxc.o        tail_ipv6_ct_ingress                    6466       6438      -28 (-0.43%)
bpf_lxc.o        tail_ipv6_ct_ingress_policy_only        6466       6438      -28 (-0.43%)
bpf_lxc.o        tail_ipv6_policy                        6859       5159   -1700 (-24.78%)
bpf_lxc.o        tail_ipv6_to_endpoint                   7039       4427   -2612 (-37.11%)
bpf_lxc.o        tail_nodeport_ipv6_dsr                  1175       1033    -142 (-12.09%)
bpf_lxc.o        tail_nodeport_nat_egress_ipv4          16318      16292      -26 (-0.16%)
bpf_lxc.o        tail_nodeport_nat_ingress_ipv4         18907      18490     -417 (-2.21%)
bpf_lxc.o        tail_nodeport_nat_ingress_ipv6         14624      14556      -68 (-0.46%)
bpf_lxc.o        tail_nodeport_rev_dnat_ipv4             4776       4588     -188 (-3.94%)
bpf_overlay.o    tail_handle_inter_cluster_revsnat      15733      15498     -235 (-1.49%)
bpf_overlay.o    tail_handle_ipv4                      124682     105717  -18965 (-15.21%)
bpf_overlay.o    tail_handle_ipv6                       16201      15801     -400 (-2.47%)
bpf_overlay.o    tail_handle_snat_fwd_ipv4              21280      19323    -1957 (-9.20%)
bpf_overlay.o    tail_handle_snat_fwd_ipv6              20824      20822       -2 (-0.01%)
bpf_overlay.o    tail_nodeport_ipv6_dsr                  1175       1033    -142 (-12.09%)
bpf_overlay.o    tail_nodeport_nat_egress_ipv4          16293      16267      -26 (-0.16%)
bpf_overlay.o    tail_nodeport_nat_ingress_ipv4         20841      20737     -104 (-0.50%)
bpf_overlay.o    tail_nodeport_nat_ingress_ipv6         14678      14629      -49 (-0.33%)
bpf_sock.o       cil_sock4_connect                       1678       1623      -55 (-3.28%)
bpf_sock.o       cil_sock4_sendmsg                       1791       1736      -55 (-3.07%)
bpf_sock.o       cil_sock6_connect                       3641       3600      -41 (-1.13%)
bpf_sock.o       cil_sock6_recvmsg                       2048       1899     -149 (-7.28%)
bpf_sock.o       cil_sock6_sendmsg                       3755       3721      -34 (-0.91%)
bpf_wireguard.o  tail_handle_ipv4                       31180      27484   -3696 (-11.85%)
bpf_wireguard.o  tail_handle_ipv6                       12095      11760     -335 (-2.77%)
bpf_wireguard.o  tail_nodeport_ipv6_dsr                  1232       1094    -138 (-11.20%)
bpf_wireguard.o  tail_nodeport_nat_egress_ipv4          16071      16061      -10 (-0.06%)
bpf_wireguard.o  tail_nodeport_nat_ingress_ipv4         20804      20565     -239 (-1.15%)
bpf_wireguard.o  tail_nodeport_nat_ingress_ipv6         13490      12224    -1266 (-9.38%)
bpf_xdp.o        tail_lb_ipv4                           49695      42673   -7022 (-14.13%)
bpf_xdp.o        tail_lb_ipv6                          122683      87896  -34787 (-28.36%)
bpf_xdp.o        tail_nodeport_ipv6_dsr                  1833       1862      +29 (+1.58%)
bpf_xdp.o        tail_nodeport_nat_egress_ipv4           6999       6990       -9 (-0.13%)
bpf_xdp.o        tail_nodeport_nat_ingress_ipv4         28903      28780     -123 (-0.43%)
bpf_xdp.o        tail_nodeport_nat_ingress_ipv6        200361     197771    -2590 (-1.29%)
bpf_xdp.o        tail_nodeport_rev_dnat_ipv4             4606       4454     -152 (-3.30%)

Changes to sched-ext:
File                       Program           Insns (A)  Insns (B)  Insns    (DIFF)
_________________________  ________________  _________  _________  _______________
scx_arena_selftests.bpf.o  arena_selftest       236305     236251     -54 (-0.02%)
scx_chaos.bpf.o            chaos_dispatch        12282       8013  -4269 (-34.76%)
scx_chaos.bpf.o            chaos_enqueue         11398       7126  -4272 (-37.48%)
scx_chaos.bpf.o            chaos_init             3854       3828     -26 (-0.67%)
scx_flash.bpf.o            flash_init             1015        979     -36 (-3.55%)
scx_flatcg.bpf.o           fcg_dispatch           1143       1100     -43 (-3.76%)
scx_lavd.bpf.o             lavd_enqueue          35487      35472     -15 (-0.04%)
scx_lavd.bpf.o             lavd_init             21127      21107     -20 (-0.09%)
scx_p2dq.bpf.o             p2dq_enqueue          10210       7854  -2356 (-23.08%)
scx_p2dq.bpf.o             p2dq_init              3233       3207     -26 (-0.80%)
scx_qmap.bpf.o             qmap_init             20285      20230     -55 (-0.27%)
scx_rusty.bpf.o            rusty_select_cpu       1165       1148     -17 (-1.46%)
scxtop.bpf.o               on_sched_switch        2369       2355     -14 (-0.59%)

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260323022410.75444-1-alexei.starovoitov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-24 11:59:52 -07:00
Kumar Kartikeya Dwivedi
02bcf8ef26 bpf: Update MAINTAINERS file for general BPF entry
Per discussion with Alexei, add Eduard and myself as maintainers under
BPF [GENERAL]. While at it, drop R entries for reviewers who have been
inactive.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260324152230.2916217-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-24 09:05:11 -07:00
Varun R Mallya
b43d574c00 selftests/bpf: Add test for struct_ops __ref argument in any position
Add a selftest to verify that the verifier correctly identifies refcounted
arguments in struct_ops programs, even when they are not the first
argument. This ensures that the restriction on tail calls for programs
with __ref arguments is properly enforced regardless of which argument
they appear in.

This test verifies the fix for check_struct_ops_btf_id() proposed by
Keisuke Nishimura [0], which corrected a bug where only the first
argument was checked for the refcounted flag.
The test includes:
- An update to bpf_testmod to add 'test_refcounted_multi', an operator with
  three arguments where the third is tagged with "__ref".
- A BPF program 'test_refcounted_multi' that attempts a tail call.
- A test runner that asserts the verifier rejects the program with
  "program with __ref argument cannot tail call".

[0]: https://lore.kernel.org/bpf/20260320130219.63711-1-keisuke.nishimura@inria.fr/

Signed-off-by: Varun R Mallya <varunrmallya@gmail.com>
Link: https://lore.kernel.org/r/20260321214038.80479-1-varunrmallya@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-24 08:51:23 -07:00
Keisuke Nishimura
25e3e1f109 bpf: Fix refcount check in check_struct_ops_btf_id()
The current implementation only checks whether the first argument is
refcounted. Fix this by iterating over all arguments.

Signed-off-by: Keisuke Nishimura <keisuke.nishimura@inria.fr>
Fixes: 38f1e66abd ("bpf: Do not allow tail call in strcut_ops program with __ref argument")
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Acked-by: Amery Hung <ameryhung@gmail.com>
Link: https://lore.kernel.org/r/20260320130219.63711-1-keisuke.nishimura@inria.fr
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-24 08:50:20 -07:00