linux

mirror of https://github.com/torvalds/linux.git synced 2026-05-31 10:33:41 +02:00

Author	SHA1	Message	Date
Jiayuan Chen	a0c584fc18	bpf: Fix use-after-free in offloaded map/prog info fill When querying info for an offloaded BPF map or program, bpf_map_offload_info_fill_ns() and bpf_prog_offload_info_fill_ns() obtain the network namespace with get_net(dev_net(offmap->netdev)). However, the associated netdev's netns may be racing with teardown during netns destruction. If the netns refcount has already reached 0, get_net() performs a refcount_t increment on 0, triggering: refcount_t: addition on 0; use-after-free. Although rtnl_lock and bpf_devs_lock ensure the netdev pointer remains valid, they cannot prevent the netns refcount from reaching zero. Fix this by using maybe_get_net() instead of get_net(). maybe_get_net() uses refcount_inc_not_zero() and returns NULL if the refcount is already zero, which causes ns_get_path_cb() to fail and the caller to return -ENOENT -- the correct behavior when the netns is being destroyed. Fixes: `675fc275a3` ("bpf: offload: report device information for offloaded programs") Fixes: `52775b33bb` ("bpf: offload: report device information about offloaded maps") Reported-by: Yinhao Hu <dddddd@hust.edu.cn> Reported-by: Kaiyan Mei <M202472210@hust.edu.cn> Reviewed-by: Dongliang Mu <dzm91@hust.edu.cn> Closes: https://lore.kernel.org/bpf/f0aa3678-79c9-47ae-9e8c-02a3d1df160a@hust.edu.cn/ Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260409023733.168050-1-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-09 13:24:32 -07:00
Daniel Borkmann	9f118095dd	bpf: Drop pkt_end markers on arithmetic to prevent is_pkt_ptr_branch_taken When a pkt pointer acquires AT_PKT_END or BEYOND_PKT_END range from a comparison, and then, known-constant arithmetic is performed, adjust_ptr_min_max_vals() copies the stale range via dst_reg->raw = ptr_reg->raw without clearing the negative reg->range sentinel values. This lets is_pkt_ptr_branch_taken() choose one branch direction and skip going through the other. Fix this by clearing negative pkt range values (that is, AT_PKT_END and BEYOND_PKT_END) after arithmetic on pkt pointers. This ensures is_pkt_ptr_branch_taken() returns unknown and both branches are properly verified. Fixes: `6d94e741a8` ("bpf: Support for pointers beyond pkt_end.") Reported-by: STAR Labs SG <info@starlabs.sg> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260409155016.536608-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-09 13:11:31 -07:00
Daniel Borkmann	9dba0ae973	bpf: Remove static qualifier from local subprog pointer The local subprog pointer in create_jt() and visit_abnormal_return_insn() was declared static. It is unconditionally assigned via bpf_find_containing_subprog() before every use. Thus, the static qualifier serves no purpose and rather creates confusion. Just remove it. Fixes: `e40f5a6bf8` ("bpf: correct stack liveness for tail calls") Fixes: `493d9e0d60` ("bpf, x86: add support for indirect jumps") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Anton Protopopov <a.s.protopopov@gmail.com> Link: https://lore.kernel.org/r/20260408191242.526279-3-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-08 18:43:28 -07:00
Daniel Borkmann	ee861486e3	bpf: Fix ld_{abs,ind} failure path analysis in subprogs Usage of ld_{abs,ind} instructions got extended into subprogs some time ago via commit `09b28d76ea` ("bpf: Add abnormal return checks."). These are only allowed in subprograms when the latter are BTF annotated and have scalar return types. The code generator in bpf_gen_ld_abs() has an abnormal exit path (r0=0 + exit) from legacy cBPF times. While the enforcement is on scalar return types, the verifier must also simulate the path of abnormal exit if the packet data load via ld_{abs,ind} failed. This is currently not the case. Fix it by having the verifier simulate both success and failure paths, and extend it in similar ways as we do for tail calls. The success path (r0=unknown, continue to next insn) is pushed onto stack for later validation and the r0=0 and return to the caller is done on the fall-through side. Fixes: `09b28d76ea` ("bpf: Add abnormal return checks.") Reported-by: STAR Labs SG <info@starlabs.sg> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260408191242.526279-2-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-08 18:43:28 -07:00
Daniel Borkmann	6bd96e40f3	bpf: Propagate error from visit_tailcall_insn Commit `e40f5a6bf8` ("bpf: correct stack liveness for tail calls") added visit_tailcall_insn() but did not check its return value. Fixes: `e40f5a6bf8` ("bpf: correct stack liveness for tail calls") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260408191242.526279-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-08 18:43:28 -07:00
Kumar Kartikeya Dwivedi	4f64d5b664	bpf: Make find_linfo widely available Move find_linfo() as bpf_find_linfo() into core.c to allow for its use in the verifier in subsequent patches. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/20260408021359.3786905-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-08 18:09:56 -07:00
Kumar Kartikeya Dwivedi	fbb98834a9	bpf: Extract bpf_get_linfo_file_line Extract bpf_get_linfo_file_line as its own function so that the logic to obtain the file, line, and line number for a given program can be shared in subsequent patches. Reviewed-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260408021359.3786905-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-08 18:09:56 -07:00
Amery Hung	017f5c4ef7	bpf: Allow overwriting referenced dynptr when refcnt > 1 The verifier currently does not allow overwriting a referenced dynptr's stack slot to prevent resource leak. This is because referenced dynptr holds additional resources that requires calling specific helpers to release. This limitation can be relaxed when there are multiple copies of the same dynptr. Whether it is the orignial dynptr or one of its clones, as long as there exists at least one other dynptr with the same ref_obj_id (to be used to release the reference), its stack slot should be allowed to be overwritten. Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260406150548.1354271-2-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-07 18:20:49 -07:00
Daniel Borkmann	1b327732c8	bpf: Clear delta when clearing reg id for non-{add,sub} ops When a non-{add,sub} alu op such as xor is performed on a scalar register that previously had a BPF_ADD_CONST delta, the else path in adjust_reg_min_max_vals() only clears dst_reg->id but leaves dst_reg->delta unchanged. This stale delta can propagate via assign_scalar_id_before_mov() when the register is later used in a mov. It gets a fresh id but keeps the stale delta from the old (now-cleared) BPF_ADD_CONST. This stale delta can later propagate leading to a verifier-vs- runtime value mismatch. The clear_id label already correctly clears both delta and id. Make the else path consistent by also zeroing the delta when id is cleared. More generally, this introduces a helper clear_scalar_id() which internally takes care of zeroing. There are various other locations in the verifier where only the id is cleared. By using the helper we catch all current and future locations. Fixes: `98d7ca374b` ("bpf: Track delta between "linked" registers.") Reported-by: STAR Labs SG <info@starlabs.sg> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260407192421.508817-2-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-07 18:15:42 -07:00
Daniel Borkmann	d7f14173c0	bpf: Fix linked reg delta tracking when src_reg == dst_reg Consider the case of rX += rX where src_reg and dst_reg are pointers to the same bpf_reg_state in adjust_reg_min_max_vals(). The latter first modifies the dst_reg in-place, and later in the delta tracking, the subsequent is_reg_const(src_reg)/reg_const_value(src_reg) reads the post-{add,sub} value instead of the original source. This is problematic since it sets an incorrect delta, which sync_linked_regs() then propagates to linked registers, thus creating a verifier-vs-runtime mismatch. Fix it by just skipping this corner case. Fixes: `98d7ca374b` ("bpf: Track delta between "linked" registers.") Reported-by: STAR Labs SG <info@starlabs.sg> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260407192421.508817-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-07 18:15:42 -07:00
Kumar Kartikeya Dwivedi	57b23c0f61	bpf: Retire rcu_trace_implies_rcu_gp() RCU Tasks Trace grace period implies RCU grace period, and this guarantee is expected to remain in the future. Only BPF is the user of this predicate, hence retire the API and clean up all in-tree users. RCU Tasks Trace is now implemented on SRCU-fast and its grace period mechanism always has at least one call to synchronize_rcu() as it is required for SRCU-fast's correctness (it replaces the smp_mb() that SRCU-fast readers skip). So, RCU-tt GP will always imply RCU GP. Reviewed-by: Puranjay Mohan <puranjay@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260407162234.785270-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-07 12:24:49 -07:00
Jiayuan Chen	beaf0e96b1	bpf: Drop task_to_inode and inet_conn_established from lsm sleepable hooks bpf_lsm_task_to_inode() is called under rcu_read_lock() and bpf_lsm_inet_conn_established() is called from softirq context, so neither hook can be used by sleepable LSM programs. Fixes: `423f16108c` ("bpf: Augment the set of sleepable LSM hooks") Reported-by: Quan Sun <2022090917019@std.uestc.edu.cn> Reported-by: Yinhao Hu <dddddd@hust.edu.cn> Reported-by: Kaiyan Mei <M202472210@hust.edu.cn> Reported-by: Dongliang Mu <dzm91@hust.edu.cn> Closes: https://lore.kernel.org/bpf/3ab69731-24d1-431a-a351-452aafaaf2a5@std.uestc.edu.cn/T/#u Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://lore.kernel.org/r/20260407122334.344072-1-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-07 07:57:07 -07:00
Anton Protopopov	43cd9d9520	bpf: Do not ignore offsets for loads from insn_arrays When a pointer to PTR_TO_INSN is dereferenced, the offset field of the BPF_LDX_MEM instruction can be nonzero. Patch the verifier to not ignore this field. Reported-by: Jiyong Yang <ksur673@gmail.com> Fixes: `493d9e0d60` ("bpf, x86: add support for indirect jumps") Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Link: https://lore.kernel.org/r/20260406160141.36943-2-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-06 18:38:32 -07:00
Gustavo A. R. Silva	18474aed5d	bpf: Avoid -Wflex-array-members-not-at-end warnings Apparently, struct bpf_empty_prog_array exists entirely to populate a single element of "items" in a global variable. "null_prog" is only used during the initializer. None of this is needed; globals will be correctly sized with an array initializer of a flexible-array member. So, remove struct bpf_empty_prog_array and adjust the rest of the code, accordingly. With these changes, fix the following warnings: ./include/linux/bpf.h:2369:31: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/acr7Whmn0br3xeBP@kspp Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-06 18:37:52 -07:00
Kumar Kartikeya Dwivedi	f25777056e	bpf: Enable unaligned accesses for syscall ctx Don't reject usage of fixed unaligned offsets for syscall ctx. Tests will be added in later commits. Unaligned offsets already work for variable offsets. Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260406194403.1649608-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-06 15:27:26 -07:00
Kumar Kartikeya Dwivedi	ae5ef001aa	bpf: Support variable offsets for syscall PTR_TO_CTX Allow accessing PTR_TO_CTX with variable offsets in syscall programs. Fixed offsets are already enabled for all program types that do not convert their ctx accesses, since the changes we made in the commit `de6c7d99f8` ("bpf: Relax fixed offset check for PTR_TO_CTX"). Note that we also lift the restriction on passing syscall context into helpers, which was not permitted before, and passing modified syscall context into kfuncs. The structure of check_mem_access can be mostly shared and preserved, but we must use check_mem_region_access to correctly verify access with variable offsets. The check made in check_helper_mem_access is hardened to only allow PTR_TO_CTX for syscall programs to be passed in as helper memory. This was the original intention of the existing code anyway, and it makes little sense for other program types' context to be utilized as a memory buffer. In case a convincing example presents itself in the future, this check can be relaxed further. We also no longer use the last-byte access to simulate helper memory access, but instead go through check_mem_region_access. Since this no longer updates our max_ctx_offset, we must do so manually, to keep track of the maximum offset at which the program ctx may be accessed. Take care to ensure that when arg_type is ARG_PTR_TO_CTX, we do not relax any fixed or variable offset constraints around PTR_TO_CTX even in syscall programs, and require them to be passed unmodified. There are several reasons why this is necessary. First, if we pass a modified ctx, then the global subprog's accesses will not update the max_ctx_offset to its true maximum offset, and can lead to out of bounds accesses. Second, tail called program (or extension program replacing global subprog) where their max_ctx_offset exceeds the program they are being called from can also cause issues. For the latter, unmodified PTR_TO_CTX is the first requirement for the fix, the second is ensuring max_ctx_offset >= the program they are being called from, which has to be a separate change not made in this commit. All in all, we can hint using arg_type when we expect ARG_PTR_TO_CTX and make our relaxation around offsets conditional on it. Drop coverage of syscall tests from verifier_ctx.c temporarily for negative cases until they are updated in subsequent commits. Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Acked-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260406194403.1649608-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-06 15:27:26 -07:00
MingTao Huang	a1aa9ef47c	bpf: Fix stale offload->prog pointer after constant blinding When a dev-bound-only BPF program (BPF_F_XDP_DEV_BOUND_ONLY) undergoes JIT compilation with constant blinding enabled (bpf_jit_harden >= 2), bpf_jit_blind_constants() clones the program. The original prog is then freed in bpf_jit_prog_release_other(), which updates aux->prog to point to the surviving clone, but fails to update offload->prog. This leaves offload->prog pointing to the freed original program. When the network namespace is subsequently destroyed, cleanup_net() triggers bpf_dev_bound_netdev_unregister(), which iterates ondev->progs and calls __bpf_prog_offload_destroy(offload->prog). Accessing the freed prog causes a page fault: BUG: unable to handle page fault for address: ffffc900085f1038 Workqueue: netns cleanup_net RIP: 0010:__bpf_prog_offload_destroy+0xc/0x80 Call Trace: __bpf_offload_dev_netdev_unregister+0x257/0x350 bpf_dev_bound_netdev_unregister+0x4a/0x90 unregister_netdevice_many_notify+0x2a2/0x660 ... cleanup_net+0x21a/0x320 The test sequence that triggers this reliably is: 1. Set net.core.bpf_jit_harden=2 (echo 2 > /proc/sys/net/core/bpf_jit_harden) 2. Run xdp_metadata selftest, which creates a dev-bound-only XDP program on a veth inside a netns (./test_progs -t xdp_metadata) 3. cleanup_net -> page fault in __bpf_prog_offload_destroy Dev-bound-only programs are unique in that they have an offload structure but go through the normal JIT path instead of bpf_prog_offload_compile(). This means they are subject to constant blinding's prog clone-and-replace, while also having offload->prog that must stay in sync. Fix this by updating offload->prog in bpf_jit_prog_release_other(), alongside the existing aux->prog update. Both are back-pointers to the prog that must be kept in sync when the prog is replaced. Fixes: `2b3486bc2d` ("bpf: Introduce device-bound XDP programs") Signed-off-by: MingTao Huang <mintaohuang@tencent.com> Link: https://lore.kernel.org/r/tencent_BCF692F45859CCE6C22B7B0B64827947D406@qq.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-05 18:48:09 -07:00
Weiming Shi	5828b9e5b2	bpf: fix end-of-list detection in cgroup_storage_get_next_key() list_next_entry() never returns NULL -- when the current element is the last entry it wraps to the list head via container_of(). The subsequent NULL check is therefore dead code and get_next_key() never returns -ENOENT for the last element, instead reading storage->key from a bogus pointer that aliases internal map fields and copying the result to userspace. Replace it with list_entry_is_head() so the function correctly returns -ENOENT when there are no more entries. Fixes: `de9cbbaadb` ("bpf: introduce cgroup storage maps") Reported-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Weiming Shi <bestswngs@gmail.com> Reviewed-by: Sun Jian <sun.jian.kdev@gmail.com> Acked-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/20260403132951.43533-2-bestswngs@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-05 18:45:05 -07:00
Mykyta Yatsenko	07738bc566	bpf: Use copy_map_value_locked() in alloc_htab_elem() for BPF_F_LOCK When a BPF_F_LOCK update races with a concurrent delete, the freed element can be immediately recycled by alloc_htab_elem(). The fast path in htab_map_update_elem() performs a lockless lookup and then calls copy_map_value_locked() under the element's spin_lock. If alloc_htab_elem() recycles the same memory, it overwrites the value with plain copy_map_value(), without taking the spin_lock, causing torn writes. Use copy_map_value_locked() when BPF_F_LOCK is set so the new element's value is written under the embedded spin_lock, serializing against any stale lock holders. Fixes: `96049f3afd` ("bpf: introduce BPF_F_LOCK flag") Reported-by: Aaron Esau <aaron1esau@gmail.com> Closes: https://lore.kernel.org/all/CADucPGRvSRpkneb94dPP08YkOHgNgBnskTK6myUag_Mkjimihg@mail.gmail.com/ Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/20260401-bpf_map_torn_writes-v1-1-782d071c55e7@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-05 18:37:32 -07:00
David Hildenbrand (Arm)	0326440c35	mm: rename zap_page_range_single() to zap_vma_range() Let's rename it to make it better match our new naming scheme. While at it, polish the kerneldoc. [akpm@linux-foundation.org: fix rustfmtcheck] Link: https://lkml.kernel.org/r/20260227200848.114019-15-david@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: Puranjay Mohan <puranjay@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Arve <arve@android.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: Dave Airlie <airlied@gmail.com> Cc: David Ahern <dsahern@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hartley Sweeten <hsweeten@visionengravers.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ian Abbott <abbotti@mev.co.uk> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jakub Kacinski <kuba@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jann Horn <jannh@google.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Namhyung kim <namhyung@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Todd Kjos <tkjos@android.com> Cc: Tvrtko Ursulin <tursulin@ursulin.net> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:15 -07:00
David Hildenbrand (Arm)	de008c9ba5	mm/memory: remove "zap_details" parameter from zap_page_range_single() Nobody except memory.c should really set that parameter to non-NULL. So let's just drop it and make unmap_mapping_range_vma() use zap_page_range_single_batched() instead. [david@kernel.org: format on a single line] Link: https://lkml.kernel.org/r/8a27e9ac-2025-4724-a46d-0a7c90894ba7@kernel.org Link: https://lkml.kernel.org/r/20260227200848.114019-3-david@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: Puranjay Mohan <puranjay@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Arve <arve@android.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: Dave Airlie <airlied@gmail.com> Cc: David Ahern <dsahern@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hartley Sweeten <hsweeten@visionengravers.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ian Abbott <abbotti@mev.co.uk> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jakub Kacinski <kuba@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jann Horn <jannh@google.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Namhyung kim <namhyung@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Todd Kjos <tkjos@android.com> Cc: Tvrtko Ursulin <tursulin@ursulin.net> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-04-05 13:53:13 -07:00
Alexei Starovoitov	1a1cadbd5d	bpf: Add helper and kfunc stack access size resolution The static stack liveness analysis needs to know how many bytes a helper or kfunc accesses through a stack pointer argument, so it can precisely mark the affected stack slots as stack 'def' or 'use'. Add bpf_helper_stack_access_bytes() and bpf_kfunc_stack_access_bytes() which resolve the access size for a given call argument. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260403024422.87231-7-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-03 08:34:44 -07:00
Alexei Starovoitov	19dbb13474	bpf: Move verifier helpers to header Move several helpers to header as preparation for the subsequent stack liveness patches. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260403024422.87231-6-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-03 08:34:41 -07:00
Alexei Starovoitov	f1606dd0ac	bpf: Add bpf_compute_const_regs() and bpf_prune_dead_branches() passes Add two passes before the main verifier pass: bpf_compute_const_regs() is a forward dataflow analysis that tracks register values in R0-R9 across the program using fixed-point iteration in reverse postorder. Each register is tracked with a six-state lattice: UNVISITED -> CONST(val) / MAP_PTR(map_index) / MAP_VALUE(map_index, offset) / SUBPROG(num) -> UNKNOWN At merge points, if two paths produce the same state and value for a register, it stays; otherwise it becomes UNKNOWN. The analysis handles: - MOV, ADD, SUB, AND with immediate or register operands - LD_IMM64 for plain constants, map FDs, map values, and subprogs - LDX from read-only maps: constant-folds the load by reading the map value directly via bpf_map_direct_read() Results that fit in 32 bits are stored per-instruction in insn_aux_data and bitmasks. bpf_prune_dead_branches() uses the computed constants to evaluate conditional branches. When both operands of a conditional jump are known constants, the branch outcome is determined statically and the instruction is rewritten to an unconditional jump. The CFG postorder is then recomputed to reflect new control flow. This eliminates dead edges so that subsequent liveness analysis doesn't propagate through dead code. Also add runtime sanity check to validate that precomputed constants match the verifier's tracked state. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260403024422.87231-5-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-03 08:34:36 -07:00
Alexei Starovoitov	e6898ec751	bpf: Sort subprogs in topological order after check_cfg() Add a pass that sorts subprogs in topological order so that iterating subprog_topo_order[] walks leaf subprogs first, then their callers. This is computed as a DFS post-order traversal of the CFG. The pass runs after check_cfg() to ensure the CFG has been validated before traversing and after postorder has been computed to avoid walking dead code. Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260403024422.87231-3-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-03 08:34:30 -07:00
Alexei Starovoitov	503d21ef8e	bpf: Do register range validation early Instead of checking src/dst range multiple times during the main verifier pass do them once. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260403024422.87231-2-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-03 08:34:26 -07:00
Alexei Starovoitov	891a05ccba	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf 7.0-rc6+ Cross-merge BPF and other fixes after downstream PR. Minor conflict in kernel/bpf/verifier.c Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-03 08:14:13 -07:00
Harishankar Vishwanathan	b254c6d816	bpf: Simulate branches to prune based on range violations This patch fixes the invariant violations that can happen after we refine ranges & tnum based on an incorrectly-detected branch condition. For example, the branch is always true, but we miss it in is_branch_taken; we then refine based on the branch being false and end up with incoherent ranges (e.g. umax < umin). To avoid this, we can simulate the refinement on both branches. More specifically, this patch simulates both branches taken using regs_refine_cond_op and reg_bounds_sync. If the resulting register states are ill-formed on one of the branches, is_branch_taken can mark that branch as "never taken". On a more formal note, we can deduce a branch is not taken when regs_refine_cond_op or reg_bounds_sync returns an ill-formed state because the branch operators are sound (verified with Agni [1]). Soundness means that the verifier is guaranteed to produce sound outputs on the taken branches. On the non-taken branch (explored because of imprecision in the bounds), the verifier is free to produce any output. We use ill-formedness as a signal that the branch is dead and prune that branch. This patch moves the refinement logic for both branches from reg_set_min_max to their own function, simulate_both_branches_taken, which is called from is_scalar_branch_taken. As a result, reg_set_min_max now only runs sanity checks and has been renamed to reg_bounds_sanity_check_branches to reflect that. We have had five patches fixing specific cases of invariant violations in the past, all added with selftests: - commit `fbc7aef517` ("bpf: Fix u32/s32 bounds when ranges cross min/max boundary") - commit `efc11a6678` ("bpf: Improve bounds when tnum has a single possible value") - commit `f41345f47f` ("bpf: Use tnums for JEQ/JNE is_branch_taken logic") - commit `00bf8d0c6c` ("bpf: Improve bounds when s64 crosses sign boundary") - commit `6279846b9b` ("bpf: Forget ranges when refining tnum after JSET") To confirm that this patch addresses all invariant violations, we have also reverted those five commits and verified that their related selftests don't cause any invariant violation warnings anymore. Those selftests still fail but only because of misdetected branches or less-precise bounds than expected. This demonstrates that the current patch is enough to avoid the invariant violation warning AND that the previous five patches are still useful to improve branch detection. In addition to the selftests, this change was also tested with the Cilium complexity test suite: all programs were successfully loaded and it didn't change the number of processed instructions. Link: https://github.com/bpfverif/agni [1] Reported-by: syzbot+c950cc277150935cc0b5@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=c950cc277150935cc0b5 Co-developed-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Co-developed-by: Srinivas Narayana <srinivas.narayana@rutgers.edu> Signed-off-by: Srinivas Narayana <srinivas.narayana@rutgers.edu> Co-developed-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu> Signed-off-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu> Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/a166b54a3cbbbdbcdf8a87f53045f1097176218b.1775142354.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-02 18:23:25 -07:00
Harishankar Vishwanathan	a2a14e874b	bpf: Exit early if reg_bounds_sync gets invalid inputs In the subsequent commit, to prune dead branches we will rely on detecting ill-formed ranges using range_bounds_violations() (e.g., umin > umax) after refining register bounds using regs_refine_cond_op(). However, reg_bounds_sync() can sometimes "repair" ill-formed bounds, potentially masking a violation that was produced by regs_refine_cond_op(). This commit modifies reg_bounds_sync() to exit early if an invariant violation is already present in the input. This ensures ill-formed reg_states remain ill-formed after reg_bounds_sync(), allowing simulate_both_branches_taken() to correctly identify dead branches with a single check to range_bounds_violation(). Suggested-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/73127d628841c59cb7423d6bdcd204bf90bcdc80.1775142354.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-02 18:23:25 -07:00
Paul Chaignon	ec1d77cb0e	bpf: Use bpf_verifier_env buffers for reg_set_min_max In a subsequent patch, the regs_refine_cond_op and reg_bounds_sync functions will be called in is_branch_taken instead of reg_set_min_max, to simulate each branch's outcome. Since they will run before we branch out, these two functions will need to work on temporary registers for the two branches. This refactoring patch prepares for that change, by introducing the temporary registers on bpf_verifier_env and using them in reg_set_min_max. This change also allows us to save one fake_reg slot as we don't need to allocate an additional temporary buffer in case of a BPF_K condition. Finally, you may notice that this patch removes the check for "false_reg1 == false_reg2" in reg_set_min_max. That check was introduced in commit `d43ad9da80` ("bpf: Skip bounds adjustment for conditional jumps on same scalar register") to avoid an invariant violation. Given that "env->false_reg1 == env->false_reg2" doesn't make sense and invariant violations are addressed in a subsequent commit, this patch just removes the check. Suggested-by: Eduard Zingerman <eddyz87@gmail.com> Co-developed-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/260b0270052944a420e1c56e6a92df4d43cadf03.1775142354.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-02 18:23:25 -07:00
Harishankar Vishwanathan	a1311b94ef	bpf: Refactor reg_bounds_sanity_check This commit refactors reg_bounds_sanity_check to factor out the logic that performs the sanity check from the logic that does the reporting. Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/198ec3e69343e2c46dc9cbe2b1bc9be9ae2df5bd.1775142354.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-02 18:23:24 -07:00
Daniel Borkmann	179ee84a89	bpf: Fix incorrect pruning due to atomic fetch precision tracking When backtrack_insn encounters a BPF_STX instruction with BPF_ATOMIC and BPF_FETCH, the src register (or r0 for BPF_CMPXCHG) also acts as a destination, thus receiving the old value from the memory location. The current backtracking logic does not account for this. It treats atomic fetch operations the same as regular stores where the src register is only an input. This leads the backtrack_insn to fail to propagate precision to the stack location, which is then not marked as precise! Later, the verifier's path pruning can incorrectly consider two states equivalent when they differ in terms of stack state. Meaning, two branches can be treated as equivalent and thus get pruned when they should not be seen as such. Fix it as follows: Extend the BPF_LDX handling in backtrack_insn to also cover atomic fetch operations via is_atomic_fetch_insn() helper. When the fetch dst register is being tracked for precision, clear it, and propagate precision over to the stack slot. For non-stack memory, the precision walk stops at the atomic instruction, same as regular BPF_LDX. This covers all fetch variants. Before: 0: (b7) r1 = 8 ; R1=8 1: (7b) (u64 )(r10 -8) = r1 ; R1=8 R10=fp0 fp-8=8 2: (b7) r2 = 0 ; R2=0 3: (db) r2 = atomic64_fetch_add((u64 )(r10 -8), r2) ; R2=8 R10=fp0 fp-8=mmmmmmmm 4: (bf) r3 = r10 ; R3=fp0 R10=fp0 5: (0f) r3 += r2 mark_precise: frame0: last_idx 5 first_idx 0 subseq_idx -1 mark_precise: frame0: regs=r2 stack= before 4: (bf) r3 = r10 mark_precise: frame0: regs=r2 stack= before 3: (db) r2 = atomic64_fetch_add((u64 )(r10 -8), r2) mark_precise: frame0: regs=r2 stack= before 2: (b7) r2 = 0 6: R2=8 R3=fp8 6: (b7) r0 = 0 ; R0=0 7: (95) exit After: 0: (b7) r1 = 8 ; R1=8 1: (7b) (u64 )(r10 -8) = r1 ; R1=8 R10=fp0 fp-8=8 2: (b7) r2 = 0 ; R2=0 3: (db) r2 = atomic64_fetch_add((u64 )(r10 -8), r2) ; R2=8 R10=fp0 fp-8=mmmmmmmm 4: (bf) r3 = r10 ; R3=fp0 R10=fp0 5: (0f) r3 += r2 mark_precise: frame0: last_idx 5 first_idx 0 subseq_idx -1 mark_precise: frame0: regs=r2 stack= before 4: (bf) r3 = r10 mark_precise: frame0: regs=r2 stack= before 3: (db) r2 = atomic64_fetch_add((u64 )(r10 -8), r2) mark_precise: frame0: regs= stack=-8 before 2: (b7) r2 = 0 mark_precise: frame0: regs= stack=-8 before 1: (7b) (u64 )(r10 -8) = r1 mark_precise: frame0: regs=r1 stack= before 0: (b7) r1 = 8 6: R2=8 R3=fp8 6: (b7) r0 = 0 ; R0=0 7: (95) exit Fixes: `5ffa25502b` ("bpf: Add instructions for atomic_[cmp]xchg") Fixes: `5ca419f286` ("bpf: Add BPF_FETCH field / create atomic_fetch_add instruction") Reported-by: STAR Labs SG <info@starlabs.sg> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260331222020.401848-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-02 09:57:59 -07:00
Qi Tang	b0db1accbc	bpf: reject direct access to nullable PTR_TO_BUF pointers check_mem_access() matches PTR_TO_BUF via base_type() which strips PTR_MAYBE_NULL, allowing direct dereference without a null check. Map iterator ctx->key and ctx->value are PTR_TO_BUF \| PTR_MAYBE_NULL. On stop callbacks these are NULL, causing a kernel NULL dereference. Add a type_may_be_null() guard to the PTR_TO_BUF branch, matching the existing PTR_TO_BTF_ID pattern. Fixes: `20b2aff4bc` ("bpf: Introduce MEM_RDONLY flag") Signed-off-by: Qi Tang <tpluszz77@gmail.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260402092923.38357-2-tpluszz77@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-02 09:47:13 -07:00
Mykyta Yatsenko	cc878b4144	bpf: Migrate dynptr file to kmalloc_nolock Replace bpf_mem_alloc/bpf_mem_free with kmalloc_nolock/kfree_nolock for bpf_dynptr_file_impl, continuing the migration away from bpf_mem_alloc now that kmalloc can be used from NMI context. freader_cleanup() runs before kfree_nolock() while the dynptr still holds exclusive access, so plain kfree_nolock() is safe — no concurrent readers can access the object. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260330-kmalloc_special-v2-2-c90403f92ff0@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-02 09:31:42 -07:00
Mykyta Yatsenko	90f51ebff2	bpf: Migrate bpf_task_work to kmalloc_nolock Replace bpf_mem_alloc/bpf_mem_free with kmalloc_nolock/kfree_rcu for bpf_task_work_ctx. Replace guard(rcu_tasks_trace)() with guard(rcu)() in bpf_task_work_irq(). The function only accesses ctx struct members (not map values), so tasks trace protection is not needed - regular RCU is sufficient since ctx is freed via kfree_rcu. The guard in bpf_task_work_callback() remains as tasks trace since it accesses map values from process context. Sleepable BPF programs hold rcu_read_lock_trace but not regular rcu_read_lock. Since kfree_rcu waits for a regular RCU grace period, the ctx memory can be freed while a sleepable program is still running. Add scoped_guard(rcu) around the pointer read and refcount tryget in bpf_task_work_acquire_ctx to close this race window. Since kfree_rcu uses call_rcu internally which is not safe from NMI context, defer destruction via irq_work when irqs are disabled. For the lost-cmpxchg path the ctx was never published, so kfree_nolock is safe. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260330-kmalloc_special-v2-1-c90403f92ff0@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-02 09:31:42 -07:00
Leon Hwang	611fe4b79a	bpf: Fix abuse of kprobe_write_ctx via freplace uprobe programs are allowed to modify struct pt_regs. Since the actual program type of uprobe is KPROBE, it can be abused to modify struct pt_regs via kprobe+freplace when the kprobe attaches to kernel functions. For example, SEC("?kprobe") int kprobe(struct pt_regs regs) { return 0; } SEC("?freplace") int freplace_kprobe(struct pt_regs regs) { regs->di = 0; return 0; } freplace_kprobe prog will attach to kprobe prog. kprobe prog will attach to a kernel function. Without this patch, when the kernel function runs, its first arg will always be set as 0 via the freplace_kprobe prog. To fix the abuse of kprobe_write_ctx=true via kprobe+freplace, disallow attaching freplace programs on kprobe programs with different kprobe_write_ctx values. Fixes: `7384893d97` ("bpf: Allow uprobe program to change context registers") Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20260331145353.87606-2-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-02 09:29:49 -07:00
Kumar Kartikeya Dwivedi	c76fef7dcd	bpf: Fix grace period wait for tracepoint bpf_link Recently, tracepoints were switched from using disabled preemption (which acts as RCU read section) to SRCU-fast when they are not faultable. This means that to do a proper grace period wait for programs running in such tracepoints, we must use SRCU's grace period wait. This is only for non-faultable tracepoints, faultable ones continue using RCU Tasks Trace. However, bpf_link_free() currently does call_rcu() for all cases when the link is non-sleepable (hence, for tracepoints, non-faultable). Fix this by doing a call_srcu() grace period wait. As far RCU Tasks Trace gp -> RCU gp chaining is concerned, it is deemed unnecessary for tracepoint programs. The link and program are either accessed under RCU Tasks Trace protection, or SRCU-fast protection now. The earlier logic of chaining both RCU Tasks Trace and RCU gp waits was to generalize the logic, even if it conceded an extra RCU gp wait, however that is unnecessary for tracepoints even before this change. In practice no cost was paid since rcu_trace_implies_rcu_gp() was always true. Hence we need not chaining any RCU gp after the SRCU gp. For instance, in the non-faultable raw tracepoint, the RCU read section of the program in __bpf_trace_run() is enclosed in the SRCU gp, likewise for faultable raw tracepoint, the program is under the RCU Tasks Trace protection. Hence, the outermost scope can be waited upon to ensure correctness. Also, sleepable programs cannot be attached to non-faultable tracepoints, so whenever program or link is sleepable, only RCU Tasks Trace protection is being used for the link and prog. Fixes: `a46023d561` ("tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast") Reviewed-by: Sun Jian <sun.jian.kdev@gmail.com> Reviewed-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20260331211021.1632902-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-31 16:01:13 -07:00
Alexei Starovoitov	a8502a79e8	bpf: Fix regsafe() for pointers to packet In case rold->reg->range == BEYOND_PKT_END && rcur->reg->range == N regsafe() may return true which may lead to current state with valid packet range not being explored. Fix the bug. Fixes: `6d94e741a8` ("bpf: Support for pointers beyond pkt_end.") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Amery Hung <ameryhung@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20260331204228.26726-1-alexei.starovoitov@gmail.com	2026-03-31 15:18:10 -07:00
Jiri Olsa	9eccdd38fb	bpf: Fix block device hooks names Use proper names for block device hooks names. Fixes: `46df585fcf` ("bpf: classify block device hooks appropriately") Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Closes: https://lore.kernel.org/bpf/acrVKUy_EPiFFmV9@krava/T/#m7c7906a1ff4029e29185aec3266dbf5c8996dbf7 Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/bpf/20260330210344.3073712-1-jolsa@kernel.org	2026-03-31 11:11:42 +02:00
Ihor Solodrai	d457072576	bpf: Support struct btf_struct_meta via KF_IMPLICIT_ARGS The following kfuncs currently accept void meta__ign argument: bpf_obj_new_impl * bpf_obj_drop_impl * bpf_percpu_obj_new_impl * bpf_percpu_obj_drop_impl * bpf_refcount_acquire_impl * bpf_list_push_back_impl * bpf_list_push_front_impl * bpf_rbtree_add_impl The __ign suffix is an indicator for the verifier to skip the argument in check_kfunc_args(). Then, in fixup_kfunc_call() the verifier may set the value of this argument to struct btf_struct_meta * kptr_struct_meta from insn_aux_data. BPF programs must pass a dummy NULL value when calling these kfuncs. Additionally, the list and rbtree _impl kfuncs also accept an implicit u64 argument, which doesn't require __ign suffix because it's a scalar, and BPF programs explicitly pass 0. Add new kfuncs with KF_IMPLICIT_ARGS [1], that correspond to each _impl kfunc accepting meta__ign. The existing _impl kfuncs remain unchanged for backwards compatibility. To support this, add "btf_struct_meta" to the list of recognized implicit argument types in resolve_btfids. Implement is_kfunc_arg_implicit() in the verifier, that determines implicit args by inspecting both a non-_impl BTF prototype of the kfunc. Update the special_kfunc_list in the verifier and relevant checks to support both the old _impl and the new KF_IMPLICIT_ARGS variants of btf_struct_meta users. [1] https://lore.kernel.org/bpf/20260120222638.3976562-1-ihor.solodrai@linux.dev/ Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260327203241.3365046-1-ihor.solodrai@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-29 09:56:06 -07:00
Christian Brauner	46df585fcf	bpf: classify block device hooks appropriately A bunch of new hooks for managing block devices were added a while ago but they weren't actually appropriately classified. * bpf_lsm_bdev_alloc() is called when the inode for the block device is allocated. This happens from a sleepable context so mark the function as sleepable. When this function is called the memory for the block device storage embedded into the inode is zeroed. That block device cannot be meaningfully reference or interacted with at this point. So mark it as untrusted for now. * bpf_lsm_bdev_free() is called when the inode for the block device is freed. A bunch of memory associated with the block device has already been freed and there's dangling pointers in there. So mark it as untrusted. It cannot be meaningfully referenced or interacted with anymore. It is also called from sb->s_op->free_inode:: which means it runs in rcu context (most of the times). So leave it as non-sleepable. * bpf_lsm_bdev_setintegrity() is called when a dm-verity device is instantiated (glossing over details for simplicity of the commit message). The block device is very much alive so it remains a trusted hook. It's also called with device mapper's suspend lock held and so the hook is able to sleep so mark it sleepable. Signed-off-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20260326-work-bpf-bdev-v2-1-5e3c58963987@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-27 09:05:13 -07:00
Alan Maguire	626e88c070	btf: Support kernel parsing of BTF with layout info Validate layout if present, but because the kernel must be strict in what it accepts, reject BTF with unsupported kinds, even if they are in the layout information. Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20260326145444.2076244-8-alan.maguire@oracle.com	2026-03-26 13:53:56 -07:00
Alexei Starovoitov	4639eb9e30	bpf: Fix variable length stack write over spilled pointers Scrub slots if variable-offset stack write goes over spilled pointers. Otherwise is_spilled_reg() may == true && spilled_ptr.type == NOT_INIT and valid program is rejected by check_stack_read_fixed_off() with obscure "invalid size of register fill" message. Fixes: `01f810ace9` ("bpf: Allow variable-offset stack access") Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260324215938.81733-1-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-24 17:00:11 -07:00
David Carlier	8ed82f807b	bpf: Use RCU-safe iteration in dev_map_redirect_multi() SKB path The DEVMAP_HASH branch in dev_map_redirect_multi() uses hlist_for_each_entry_safe() to iterate hash buckets, but this function runs under RCU protection (called from xdp_do_generic_redirect_map() in softirq context). Concurrent writers (__dev_map_hash_update_elem, dev_map_hash_delete_elem) modify the list using RCU primitives (hlist_add_head_rcu, hlist_del_rcu). hlist_for_each_entry_safe() performs plain pointer dereferences without rcu_dereference(), missing the acquire barrier needed to pair with writers' rcu_assign_pointer(). On weakly-ordered architectures (ARM64, POWER), a reader can observe a partially-constructed node. It also defeats CONFIG_PROVE_RCU lockdep validation and KCSAN data-race detection. Replace with hlist_for_each_entry_rcu() using rcu_read_lock_bh_held() as the lockdep condition, consistent with the rcu_dereference_check() used in the DEVMAP (non-hash) branch of the same functions. Also fix the same incorrect lockdep_is_held(&dtab->index_lock) condition in dev_map_enqueue_multi(), where the lock is not held either. Fixes: `e624d4ed4a` ("xdp: Extend xdp_redirect_map with broadcast support") Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260320072645.16731-1-devnexen@gmail.com	2026-03-24 15:17:20 -07:00
Kexin Sun	70b5f3f782	bpf: update outdated comment for refactored btf_check_kfunc_arg_match() The function btf_check_kfunc_arg_match() was refactored into check_kfunc_args() by commit `00b85860fe` ("bpf: Rewrite kfunc argument handling"). Update the comment accordingly. Assisted-by: unnamed:deepseek-v3.2 coccinelle Signed-off-by: Kexin Sun <kexinsun@smail.nju.edu.cn> Acked-by: Yonghong Song <yonghong.song@linux.dev> Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://lore.kernel.org/r/20260321105658.6006-1-kexinsun@smail.nju.edu.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-24 13:37:29 -07:00
Slava Imameev	4145203841	bpf: Support pointer param types via SCALAR_VALUE for trampolines Add BPF verifier support for single- and multi-level pointer parameters and return values in BPF trampolines by treating these parameters as SCALAR_VALUE. This extends the existing support for int and void pointers that are already treated as SCALAR_VALUE. This provides consistent logic for single and multi-level pointers: if a type is treated as SCALAR for a single-level pointer, the same applies to multi-level pointers. The exception is pointer-to-struct, which is currently PTR_TO_BTF_ID for single-level but treated as scalar for multi-level pointers since the verifier lacks context to infer the size of target memory regions. Safety is ensured by existing BTF verification, which rejects invalid pointer types at the BTF verification stage. Signed-off-by: Slava Imameev <slava.imameev@crowdstrike.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260314082127.7939-2-slava.imameev@crowdstrike.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-24 13:36:31 -07:00
Alexei Starovoitov	596bef1d71	bpf: Support 32-bit scalar spills in stacksafe() v1->v2: updated comments v1: https://lore.kernel.org/bpf/20260322225124.14005-1-alexei.starovoitov@gmail.com/ The commit `6efbde200b` ("bpf: Handle scalar spill vs all MISC in stacksafe()") in stacksafe() only recognized full 64-bit scalar spills when comparing stack states for equivalence during state pruning and missed 32-bit scalar spill. When 32-bit scalar is spilled the check_stack_write_fixed_off() -> save_register_state() calls mark_stack_slot_misc() for slot[0-3], which preserves STACK_INVALID and STACK_ZERO (on a fresh stack slot[0-3] remain STACK_INVALID), sets slot[4-7] = STACK_SPILL, and updates spilled_ptr. The im=4 path is only reached when im=0 fails: The loop at im=0 already attempts the 64-bit scalar-spill/all-MISC check. If it matches, i advances by 7, skipping the entire 8-byte slot. So im=4 is only reached when bytes 0-3 are neither a scalar spill nor all-MISC — they must pass individual byte-by-byte comparison first. Then bytes 4-7 get the scalar-unit treatment. is_spilled_scalar_after(stack, 4): slot_type[4] == STACK_SPILL from a 64-bit spill would have been caught at im=0 (unless it's a pointer spill, in which case spilled_ptr.type != SCALAR_VALUE -> returns false at im=4 too). A partial overwrite of a 64-bit spill invalidates the entire slot in check_stack_write_fixed_off(). is_stack_misc_after(stack, 4): Only checks bytes 4-7 are MISC/INVALID, returns &unbound_reg. Comparing two unbound regs via regsafe() is safe. Changes to cilium programs: File Program Insns (A) Insns (B) Insns (DIFF) _______________ _________________________________ _________ _________ ________________ bpf_host.o cil_host_policy 49351 45811 -3540 (-7.17%) bpf_host.o cil_to_host 2384 2270 -114 (-4.78%) bpf_host.o cil_to_netdev 112051 100269 -11782 (-10.51%) bpf_host.o tail_handle_ipv4_cont_from_host 61175 60910 -265 (-0.43%) bpf_host.o tail_handle_ipv4_cont_from_netdev 9381 8873 -508 (-5.42%) bpf_host.o tail_handle_ipv4_from_host 12994 7066 -5928 (-45.62%) bpf_host.o tail_handle_ipv4_from_netdev 85015 59875 -25140 (-29.57%) bpf_host.o tail_handle_ipv6_cont_from_host 24732 23527 -1205 (-4.87%) bpf_host.o tail_handle_ipv6_cont_from_netdev 9463 8953 -510 (-5.39%) bpf_host.o tail_handle_ipv6_from_host 12477 11787 -690 (-5.53%) bpf_host.o tail_handle_ipv6_from_netdev 30814 30017 -797 (-2.59%) bpf_host.o tail_handle_nat_fwd_ipv4 8943 8860 -83 (-0.93%) bpf_host.o tail_handle_snat_fwd_ipv4 64716 61625 -3091 (-4.78%) bpf_host.o tail_handle_snat_fwd_ipv6 48299 30797 -17502 (-36.24%) bpf_host.o tail_ipv4_host_policy_ingress 21591 20017 -1574 (-7.29%) bpf_host.o tail_ipv6_host_policy_ingress 21177 20693 -484 (-2.29%) bpf_host.o tail_nodeport_nat_egress_ipv4 16588 16543 -45 (-0.27%) bpf_host.o tail_nodeport_nat_ingress_ipv4 39200 36116 -3084 (-7.87%) bpf_host.o tail_nodeport_nat_ingress_ipv6 50102 48003 -2099 (-4.19%) bpf_lxc.o tail_handle_ipv4_cont 113092 96891 -16201 (-14.33%) bpf_lxc.o tail_handle_ipv6 6727 6701 -26 (-0.39%) bpf_lxc.o tail_handle_ipv6_cont 25567 21805 -3762 (-14.71%) bpf_lxc.o tail_ipv4_ct_egress 28843 15970 -12873 (-44.63%) bpf_lxc.o tail_ipv4_ct_ingress 16691 10213 -6478 (-38.81%) bpf_lxc.o tail_ipv4_ct_ingress_policy_only 16691 10213 -6478 (-38.81%) bpf_lxc.o tail_ipv4_policy 6776 6622 -154 (-2.27%) bpf_lxc.o tail_ipv4_to_endpoint 7523 7219 -304 (-4.04%) bpf_lxc.o tail_ipv6_ct_egress 10275 9999 -276 (-2.69%) bpf_lxc.o tail_ipv6_ct_ingress 6466 6438 -28 (-0.43%) bpf_lxc.o tail_ipv6_ct_ingress_policy_only 6466 6438 -28 (-0.43%) bpf_lxc.o tail_ipv6_policy 6859 5159 -1700 (-24.78%) bpf_lxc.o tail_ipv6_to_endpoint 7039 4427 -2612 (-37.11%) bpf_lxc.o tail_nodeport_ipv6_dsr 1175 1033 -142 (-12.09%) bpf_lxc.o tail_nodeport_nat_egress_ipv4 16318 16292 -26 (-0.16%) bpf_lxc.o tail_nodeport_nat_ingress_ipv4 18907 18490 -417 (-2.21%) bpf_lxc.o tail_nodeport_nat_ingress_ipv6 14624 14556 -68 (-0.46%) bpf_lxc.o tail_nodeport_rev_dnat_ipv4 4776 4588 -188 (-3.94%) bpf_overlay.o tail_handle_inter_cluster_revsnat 15733 15498 -235 (-1.49%) bpf_overlay.o tail_handle_ipv4 124682 105717 -18965 (-15.21%) bpf_overlay.o tail_handle_ipv6 16201 15801 -400 (-2.47%) bpf_overlay.o tail_handle_snat_fwd_ipv4 21280 19323 -1957 (-9.20%) bpf_overlay.o tail_handle_snat_fwd_ipv6 20824 20822 -2 (-0.01%) bpf_overlay.o tail_nodeport_ipv6_dsr 1175 1033 -142 (-12.09%) bpf_overlay.o tail_nodeport_nat_egress_ipv4 16293 16267 -26 (-0.16%) bpf_overlay.o tail_nodeport_nat_ingress_ipv4 20841 20737 -104 (-0.50%) bpf_overlay.o tail_nodeport_nat_ingress_ipv6 14678 14629 -49 (-0.33%) bpf_sock.o cil_sock4_connect 1678 1623 -55 (-3.28%) bpf_sock.o cil_sock4_sendmsg 1791 1736 -55 (-3.07%) bpf_sock.o cil_sock6_connect 3641 3600 -41 (-1.13%) bpf_sock.o cil_sock6_recvmsg 2048 1899 -149 (-7.28%) bpf_sock.o cil_sock6_sendmsg 3755 3721 -34 (-0.91%) bpf_wireguard.o tail_handle_ipv4 31180 27484 -3696 (-11.85%) bpf_wireguard.o tail_handle_ipv6 12095 11760 -335 (-2.77%) bpf_wireguard.o tail_nodeport_ipv6_dsr 1232 1094 -138 (-11.20%) bpf_wireguard.o tail_nodeport_nat_egress_ipv4 16071 16061 -10 (-0.06%) bpf_wireguard.o tail_nodeport_nat_ingress_ipv4 20804 20565 -239 (-1.15%) bpf_wireguard.o tail_nodeport_nat_ingress_ipv6 13490 12224 -1266 (-9.38%) bpf_xdp.o tail_lb_ipv4 49695 42673 -7022 (-14.13%) bpf_xdp.o tail_lb_ipv6 122683 87896 -34787 (-28.36%) bpf_xdp.o tail_nodeport_ipv6_dsr 1833 1862 +29 (+1.58%) bpf_xdp.o tail_nodeport_nat_egress_ipv4 6999 6990 -9 (-0.13%) bpf_xdp.o tail_nodeport_nat_ingress_ipv4 28903 28780 -123 (-0.43%) bpf_xdp.o tail_nodeport_nat_ingress_ipv6 200361 197771 -2590 (-1.29%) bpf_xdp.o tail_nodeport_rev_dnat_ipv4 4606 4454 -152 (-3.30%) Changes to sched-ext: File Program Insns (A) Insns (B) Insns (DIFF) _________________________ ________________ _________ _________ _______________ scx_arena_selftests.bpf.o arena_selftest 236305 236251 -54 (-0.02%) scx_chaos.bpf.o chaos_dispatch 12282 8013 -4269 (-34.76%) scx_chaos.bpf.o chaos_enqueue 11398 7126 -4272 (-37.48%) scx_chaos.bpf.o chaos_init 3854 3828 -26 (-0.67%) scx_flash.bpf.o flash_init 1015 979 -36 (-3.55%) scx_flatcg.bpf.o fcg_dispatch 1143 1100 -43 (-3.76%) scx_lavd.bpf.o lavd_enqueue 35487 35472 -15 (-0.04%) scx_lavd.bpf.o lavd_init 21127 21107 -20 (-0.09%) scx_p2dq.bpf.o p2dq_enqueue 10210 7854 -2356 (-23.08%) scx_p2dq.bpf.o p2dq_init 3233 3207 -26 (-0.80%) scx_qmap.bpf.o qmap_init 20285 20230 -55 (-0.27%) scx_rusty.bpf.o rusty_select_cpu 1165 1148 -17 (-1.46%) scxtop.bpf.o on_sched_switch 2369 2355 -14 (-0.59%) Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260323022410.75444-1-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-24 11:59:52 -07:00
Keisuke Nishimura	25e3e1f109	bpf: Fix refcount check in check_struct_ops_btf_id() The current implementation only checks whether the first argument is refcounted. Fix this by iterating over all arguments. Signed-off-by: Keisuke Nishimura <keisuke.nishimura@inria.fr> Fixes: `38f1e66abd` ("bpf: Do not allow tail call in strcut_ops program with __ref argument") Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Acked-by: Amery Hung <ameryhung@gmail.com> Link: https://lore.kernel.org/r/20260320130219.63711-1-keisuke.nishimura@inria.fr Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-24 08:50:20 -07:00
Weixie Cui	ad2f7ed0ee	bpf: propagate kvmemdup_bpfptr errors from bpf_prog_verify_signature kvmemdup_bpfptr() returns -EFAULT when the user pointer cannot be copied, and -ENOMEM on allocation failure. The error path always returned -ENOMEM, misreporting bad addresses as out-of-memory. Return PTR_ERR(sig) so user space gets the correct errno. Signed-off-by: Weixie Cui <cuiweixie@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/tencent_C9C5B2B28413D6303D505CD02BFEA4708C07@qq.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-24 08:48:51 -07:00
Hao Sun	833ef4a954	bpf: Simplify tnum_step() Simplify tnum_step() from a 10-variable algorithm into a straight line sequence of bitwise operations. Problem Reduction: tnum_step(): Given a tnum `(tval, tmask)` where `tval & tmask == 0`, and a value `z` with `tval ≤ z < (tval \| tmask)`, find the smallest `r > z`, a tnum-satisfying value, i.e., `r & ~tmask == tval`. Every tnum-satisfying value has the form tval \| s where s is a subset of tmask bits (s & ~tmask == 0). Since tval and tmask are disjoint: tval \| s = tval + s Similarly z = tval + d where d = z - tval, so r > z becomes: tval + s > tval + d s > d The problem reduces to: find the smallest s, a subset of tmask, such that s > d. Notice that `s` must be a subset of tmask, the problem now is simplified. Algorithm: The mask bits of `d` form a "counter" that we want to increment by one, but the counter has gaps at the fixed-bit positions. A normal +1 would stop at the first 0-bit it meets; we need it to skip over fixed-bit gaps and land on the next mask bit. Step 1 -- plug the gaps: d \| carry_mask \| ~tmask - ~tmask fills all fixed-bit positions with 1. - carry_mask = (1 << fls64(d & ~tmask)) - 1 fills all positions (including mask positions) below the highest non-mask bit of d. After this, the only remaining 0s are mask bits above the highest non-mask bit of d where d is also 0 -- exactly the positions where the carry can validly land. Step 2 -- increment: (d \| carry_mask \| ~tmask) + 1 Adding 1 flips all trailing 1s to 0 and sets the first 0 to 1. Since every gap has been plugged, that first 0 is guaranteed to be a mask bit above all non-mask bits of d. Step 3 -- mask: ((d \| carry_mask \| ~tmask) + 1) & tmask Strip the scaffolding, keeping only mask bits. Call the result inc. Step 4 -- result: tval \| inc Reattach the fixed bits. A simple 8-bit example: tmask: 1 1 0 1 0 1 1 0 d: 1 0 1 0 0 0 1 0 (d = 162) ^ non-mask 1 at bit 5 With carry_mask = 0b00111111 (smeared from bit 5): d\|carry\|~tm 1 0 1 1 1 1 1 1 + 1 1 1 0 0 0 0 0 0 & tmask 1 1 0 0 0 0 0 0 The patch passes my local test: test_verifier, test_progs for `-t verifier` and `-t reg_bounds`. CBMC shows the new code is equiv to original one[1], and a lean4 proof of correctness is available[2]: theorem tnumStep_correct (tval tmask z : BitVec 64) -- Precondition: valid tnum and input z (h_consistent : (tval &&& tmask) = 0) (h_lo : tval ≤ z) (h_hi : z < (tval \|\|\| tmask)) : -- Postcondition: r must be: -- (1) tnum member -- (2) z < r -- (3) for any other member w > z, r <= w let r := tnumStep tval tmask z satisfiesTnum64 r tval tmask ∧ tval ≤ r ∧ r ≤ (tval \|\|\| tmask) ∧ z < r ∧ ∀ w, satisfiesTnum64 w tval tmask → z < w → r ≤ w := by -- unfold definition unfold tnumStep satisfiesTnum64 simp only [] refine ⟨?_, ?_, ?_, ?_, ?_⟩ -- the solver proves each conjunct · bv_decide · bv_decide · bv_decide · bv_decide · intro w hw1 hw2; bv_decide [1] https://github.com/eddyz87/tnum-step-verif/blob/master/main.c [2] https://pastebin.com/raw/czHKiyY0 Signed-off-by: Hao Sun <hao.sun@inf.ethz.ch> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Reviewed-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Link: https://lore.kernel.org/r/20260320162336.166542-1-hao.sun@inf.ethz.ch Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-24 08:45:29 -07:00
Carlos Llamas	9b0cf064ea	bpf: Switch CONFIG_CFI_CLANG to CONFIG_CFI This was renamed in commit `23ef9d4397` ("kcfi: Rename CONFIG_CFI_CLANG to CONFIG_CFI") as it is now a compiler-agnostic option. Using the wrong name results in the code getting compiled out. Meaning the CFI failures for btf_dtor_kfunc_t would still trigger. Fixes: `99fde4d062` ("bpf, btf: Enforce destructor kfunc type with CFI") Signed-off-by: Carlos Llamas <cmllamas@google.com> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20260312183818.2721750-1-cmllamas@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-24 08:42:24 -07:00
Eric Biggers	a02327413a	bpf: Remove inclusions of crypto/sha1.h Since commit `603b441623` ("bpf: Update the bpf_prog_calc_tag to use SHA256") made BPF program tags use SHA-256 instead of SHA-1, the header <crypto/sha1.h> no longer needs to be included. Remove the relevant inclusions so that they no longer unnecessarily come up in searches for which kernel code is still using the obsolete SHA-1 algorithm. Since net/ipv6/addrconf.c was relying on the transitive inclusion of <crypto/sha1.h> (for an unrelated purpose) via <linux/filter.h>, make it include <crypto/sha1.h> explicitly in order to keep that file building. Signed-off-by: Eric Biggers <ebiggers@kernel.org> Acked-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/20260314214555.112386-1-ebiggers@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-24 08:40:45 -07:00
Alexei Starovoitov	bfec8e88ff	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf 7.0-rc5 Cross-merge BPF and other fixes after downstream PR. Minor conflicts in: tools/testing/selftests/bpf/progs/exceptions_fail.c tools/testing/selftests/bpf/progs/verifier_bounds.c Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-22 19:33:29 -07:00
Daniel Borkmann	bc308be380	bpf: Fix sync_linked_regs regarding BPF_ADD_CONST32 zext propagation Jenny reported that in sync_linked_regs() the BPF_ADD_CONST32 flag is checked on known_reg (the register narrowed by a conditional branch) instead of reg (the linked target register created by an alu32 operation). Example case with reg: 1. r6 = bpf_get_prandom_u32() 2. r7 = r6 (linked, same id) 3. w7 += 5 (alu32 -- r7 gets BPF_ADD_CONST32, zero-extended by CPU) 4. if w6 < 0xFFFFFFFC goto safe (narrows r6 to [0xFFFFFFFC, 0xFFFFFFFF]) 5. sync_linked_regs() propagates to r7 but does NOT call zext_32_to_64() 6. Verifier thinks r7 is [0x100000001, 0x100000004] instead of [1, 4] Since known_reg above does not have BPF_ADD_CONST32 set above, zext_32_to_64() is never called on alu32-derived linked registers. This causes the verifier to track incorrect 64-bit bounds, while the CPU correctly zero-extends the 32-bit result. The code checking known_reg->id was correct however (see scalars_alu32_wrap selftest case), but the real fix needs to handle both directions - zext propagation should be done when either register has BPF_ADD_CONST32, since the linked relationship involves a 32-bit operation regardless of which side has the flag. Example case with known_reg (exercised also by scalars_alu32_wrap): 1. r1 = r0; w1 += 0x100 (alu32 -- r1 gets BPF_ADD_CONST32) 2. if r1 > 0x80 - known_reg = r1 (has BPF_ADD_CONST32), reg = r0 (doesn't) Hence, fix it by checking for (reg->id \| known_reg->id) & BPF_ADD_CONST32. Moreover, sync_linked_regs() also has a soundness issue when two linked registers used different ALU widths: one with BPF_ADD_CONST32 and the other with BPF_ADD_CONST64. The delta relationship between linked registers assumes the same arithmetic width though. When one register went through alu32 (CPU zero-extends the 32-bit result) and the other went through alu64 (no zero-extension), the propagation produces incorrect bounds. Example: r6 = bpf_get_prandom_u32() // fully unknown if r6 >= 0x100000000 goto out // constrain r6 to [0, U32_MAX] r7 = r6 w7 += 1 // alu32: r7.id = N \| BPF_ADD_CONST32 r8 = r6 r8 += 2 // alu64: r8.id = N \| BPF_ADD_CONST64 if r7 < 0xFFFFFFFF goto out // narrows r7 to [0xFFFFFFFF, 0xFFFFFFFF] At the branch on r7, sync_linked_regs() runs with known_reg=r7 (BPF_ADD_CONST32) and reg=r8 (BPF_ADD_CONST64). The delta path computes: r8 = r7 + (delta_r8 - delta_r7) = 0xFFFFFFFF + (2 - 1) = 0x100000000 Then, because known_reg->id has BPF_ADD_CONST32, zext_32_to_64(r8) is called, truncating r8 to [0, 0]. But r8 used a 64-bit ALU op -- the CPU does NOT zero-extend it. The actual CPU value of r8 is 0xFFFFFFFE + 2 = 0x100000000, not 0. The verifier now underestimates r8's 64-bit bounds, which is a soundness violation. Fix sync_linked_regs() by skipping propagation when the two registers have mixed ALU widths (one BPF_ADD_CONST32, the other BPF_ADD_CONST64). Lastly, fix regsafe() used for path pruning: the existing checks used "& BPF_ADD_CONST" to test for offset linkage, which treated BPF_ADD_CONST32 and BPF_ADD_CONST64 as equivalent. Fixes: `7a433e5193` ("bpf: Support negative offsets, BPF_SUB, and alu32 for linked register tracking") Reported-by: Jenny Guanni Qu <qguanni@gmail.com> Co-developed-by: Puranjay Mohan <puranjay@kernel.org> Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260319211507.213816-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-21 13:19:40 -07:00
Daniel Wade	c845894ebd	bpf: Fix unsound scalar forking in maybe_fork_scalars() for BPF_OR maybe_fork_scalars() is called for both BPF_AND and BPF_OR when the source operand is a constant. When dst has signed range [-1, 0], it forks the verifier state: the pushed path gets dst = 0, the current path gets dst = -1. For BPF_AND this is correct: 0 & K == 0. For BPF_OR this is wrong: 0 \| K == K, not 0. The pushed path therefore tracks dst as 0 when the runtime value is K, producing an exploitable verifier/runtime divergence that allows out-of-bounds map access. Fix this by passing env->insn_idx (instead of env->insn_idx + 1) to push_stack(), so the pushed path re-executes the ALU instruction with dst = 0 and naturally computes the correct result for any opcode. Fixes: `bffacdb80b` ("bpf: Recognize special arithmetic shift in the verifier") Signed-off-by: Daniel Wade <danjwade95@gmail.com> Reviewed-by: Amery Hung <ameryhung@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260314021521.128361-2-danjwade95@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-21 13:14:28 -07:00
Jenny Guanni Qu	c77b30bd1d	bpf: Fix undefined behavior in interpreter sdiv/smod for INT_MIN The BPF interpreter's signed 32-bit division and modulo handlers use the kernel abs() macro on s32 operands. The abs() macro documentation (include/linux/math.h) explicitly states the result is undefined when the input is the type minimum. When DST contains S32_MIN (0x80000000), abs((s32)DST) triggers undefined behavior and returns S32_MIN unchanged on arm64/x86. This value is then sign-extended to u64 as 0xFFFFFFFF80000000, causing do_div() to compute the wrong result. The verifier's abstract interpretation (scalar32_min_max_sdiv) computes the mathematically correct result for range tracking, creating a verifier/interpreter mismatch that can be exploited for out-of-bounds map value access. Introduce abs_s32() which handles S32_MIN correctly by casting to u32 before negating, avoiding signed overflow entirely. Replace all 8 abs((s32)...) call sites in the interpreter's sdiv32/smod32 handlers. s32 is the only affected case -- the s64 division/modulo handlers do not use abs(). Fixes: `ec0e2da95f` ("bpf: Support new signed div/mod instructions.") Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Jenny Guanni Qu <qguanni@gmail.com> Link: https://lore.kernel.org/r/20260311011116.2108005-2-qguanni@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-21 13:12:16 -07:00
Puranjay Mohan	a2542a91aa	bpf: Consolidate sleepable checks in check_func_call() The sleepable context check for global function calls in check_func_call() open-codes the same checks that in_sleepable_context() already performs. Replace the open-coded check with a call to in_sleepable_context() and use non_sleepable_context_description() for the error message, consistent with check_helper_call() and check_kfunc_call(). Note that in_sleepable_context() also checks active_locks, which overlaps with the existing active_locks check above it. However, the two checks serve different purposes: the active_locks check rejects all global function calls while holding a lock (not just sleepable ones), so it must remain as a separate guard. Update the expected error messages in the irq and preempt_lock selftests to match. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260318174327.3151925-4-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-21 13:09:35 -07:00
Puranjay Mohan	cd9840c413	bpf: Consolidate sleepable checks in check_kfunc_call() check_kfunc_call() has multiple scattered checks that reject sleepable kfuncs in various non-sleepable contexts (RCU, preempt-disabled, IRQ- disabled). These are the same conditions already checked by in_sleepable_context(), so replace them with a single consolidated check. This also simplifies the preempt lock tracking by flattening the nested if/else structure into a linear chain: preempt_disable increments, preempt_enable checks for underflow and decrements. The sleepable check is kept as a separate block since it is logically distinct from the lock accounting. No functional change since in_sleepable_context() checks all the same state (active_rcu_locks, active_preempt_locks, active_locks, active_irq_id, in_sleepable). Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260318174327.3151925-3-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-21 13:09:35 -07:00
Puranjay Mohan	a0d06cf102	bpf: Consolidate sleepable checks in check_helper_call() check_helper_call() prints the error message for every env->cur_state->active* element when calling a sleepable helper. Consolidate all of them into a single print statement. The check for env->cur_state->active_locks was not part of the removed print statements and will not be triggered with the consolidated print as well because it is checked in do_check() before check_helper_call() is even reached. Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260318174327.3151925-2-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-21 13:09:34 -07:00
Ihor Solodrai	6c2128505f	bpf: Fix exception exit lock checking for subprogs process_bpf_exit_full() passes check_lock = !curframe to check_resource_leak(), which is false in cases when bpf_throw() is called from a static subprog. This makes check_resource_leak() to skip validation of active_rcu_locks, active_preempt_locks, and active_irq_id on exception exits from subprogs. At runtime bpf_throw() unwinds the stack via ORC without releasing any user-acquired locks, which may cause various issues as the result. Fix by setting check_lock = true for exception exits regardless of curframe, since exceptions bypass all intermediate frame cleanup. Update the error message prefix to "bpf_throw" for exception exits to distinguish them from normal BPF_EXIT. Fix reject_subprog_with_rcu_read_lock test which was previously passing for the wrong reason. Test program returned directly from the subprog call without closing the RCU section, so the error was triggered by the unclosed RCU lock on normal exit, not by bpf_throw. Update __msg annotations for affected tests to match the new "bpf_throw" error prefix. The spin_lock case is not affected because they are already checked [1] at the call site in do_check_insn() before bpf_throw can run. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/bpf/verifier.c?h=v7.0-rc4#n21098 Assisted-by: Claude:claude-opus-4-6 Fixes: `f18b03faba` ("bpf: Implement BPF exceptions") Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260320000809.643798-1-ihor.solodrai@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-21 12:51:44 -07:00
Amery Hung	4b21ea5024	bpf: Add warning to detect memory leak in bpf_selem_unlink_nofail() While very unlikely, local storage theoretically may leak memory of the size of "struct bpf_local_storage" when destroy() fails to grab local_storage->lock and initializes selem->local_storage before other racing map_free() see it. Warn the user to allow debugging the issue instead of leaking the memory silently. Note that test_maps in bpf selftests already stress tested bpf_selem_unlink_nofail() by creating 4096 sockets and then immediately destroying them in multiple threads. With 64 threads, 64 x 4096 socket local storages were created and destroyed during the test and no warning in the function were triggered. Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://patch.msgid.link/20260318224219.615105-1-ameryhung@gmail.com	2026-03-19 12:14:28 -07:00
Amery Hung	350de5b8a9	bpf: Do not allow deleting local storage in NMI Currently, local storage may deadlock when deferring freeing selem or local storage through kfree_rcu(), call_rcu() or call_rcu_tasks_trace() in NMI or reentrant. Since deleting selem in NMI is an unlikely use case, partially mitigate it by returning error when calling from bpf_xxx_storage_delete() helpers in NMI. Note that, it is still possible to deadlock through reentrant. A full mitigation requires returning error when irqs_disabled() is true, which, however is too heavy-handed for bpf_xxx_storage_delete(). The long-term solution requires _nolock versions of call_rcu. Another possible solution is to defer the free through irq_work [0], but it would grow the size of selem, which is non-ideal. The check is only needed in bpf_selem_unlink(), which is used by helpers and syscalls. bpf_selem_unlink_nofail() is fine as it is called during map and owner tear down that never run in NMI or reentrant. [0] https://lore.kernel.org/bpf/20260205190233.912-1-alexei.starovoitov@gmail.com/ Fixes: `a10787e6d5` ("bpf: Enable task local storage for tracing programs") Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://patch.msgid.link/20260319025716.2361065-1-ameryhung@gmail.com	2026-03-19 11:26:05 -07:00
Kumar Kartikeya Dwivedi	146bd2a87a	bpf: Release module BTF IDR before module unload Gregory reported in [0] that the global_map_resize test when run in repeatedly ends up failing during program load. This stems from the fact that BTF reference has not dropped to zero after the previous run's module is unloaded, and the older module's BTF is still discoverable and visible. Later, in libbpf, load_module_btfs() will find the ID for this stale BTF, open its fd, and then it will be used during program load where later steps taking module reference using btf_try_get_module() fail since the underlying module for the BTF is gone. Logically, once a module is unloaded, it's associated BTF artifacts should become hidden. The BTF object inside the kernel may still remain alive as long its reference counts are alive, but it should no longer be discoverable. To fix this, let us call btf_free_id() from the MODULE_STATE_GOING case for the module unload to free the BTF associated IDR entry, and disable its discovery once module unload returns to user space. If a race happens during unload, the outcome is non-deterministic anyway. However, user space should be able to rely on the guarantee that once it has synchronously established a successful module unload, no more stale artifacts associated with this module can be obtained subsequently. Note that we must be careful to not invoke btf_free_id() in btf_put() when btf_is_module() is true now. There could be a window where the module unload drops a non-terminal reference, frees the IDR, but the same ID gets reused and the second unconditional btf_free_id() ends up releasing an unrelated entry. To avoid a special case for btf_is_module() case, set btf->id to zero to make btf_free_id() idempotent, such that we can unconditionally invoke it from btf_put(), and also from the MODULE_STATE_GOING case. Since zero is an invalid IDR, the idr_remove() should be a noop. Note that we can be sure that by the time we reach final btf_put() for btf_is_module() case, the btf_free_id() is already done, since the module itself holds the BTF reference, and it will call this function for the BTF before dropping its own reference. [0]: https://lore.kernel.org/bpf/cover.1773170190.git.grbell@redhat.com Fixes: `36e68442d1` ("bpf: Load and verify kernel module BTFs") Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Suggested-by: Martin KaFai Lau <martin.lau@kernel.org> Reported-by: Gregory Bell <grbell@redhat.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260312205307.1346991-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-18 17:26:40 -07:00
Emil Tsalapatis	ad95d3c758	bpf: Only enforce 8 frame call stack limit for all-static stacks The BPF verifier currently enforces a call stack depth of 8 frames, regardless of the actual stack space consumption of those frames. The limit is necessary for static call stacks, because the bookkeeping data structures used by the verifier when stepping into static functions during verification only support 8 stack frames. However, this limitation only matters for static stack frames: Global subprogs are verified by themselves and do not require limiting the call depth. Relax this limitation to only apply to static stack frames. Verification now only fails when there is a sequence of 8 calls to non-global subprogs. Calling into a global subprog resets the counter. This allows deeper call stacks, provided all frames still fit in the stack. The change does not increase the maximum size of the call stack, only the maximum number of frames we can place in it. Also change the progs/test_global_func3.c selftest to use static functions, since with the new patch it would otherwise unexpectedly pass verification. Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/r/20260316161225.128011-2-emil@etsalapatis.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-16 11:26:41 -07:00
Paul Chaignon	9e5fcb003a	bpf: Avoid one round of bounds deduction In commit `5dbb19b16a` ("bpf: Add third round of bounds deduction"), I added a new round of bounds deduction because two rounds were not enough to converge to a fixed point. This commit slightly refactor the bounds deduction logic such that two rounds are enough. In [1], Eduard noticed that after we improved the refinement logic, a third call to the bounds deduction (__reg_deduce_bounds) was needed to converge to a fixed point. More specifically, we needed this third call to improve the s64 range using the s32 range. We added the third call and postponed a more detailed analysis of the refinement logic. I've been looking into this more recently. The register refinement consists of the following calls. __update_reg_bounds(); 3 x __reg_deduce_bounds() { deduce_bounds_32_from_64(); deduce_bounds_32_from_32(); deduce_bounds_64_from_64(); deduce_bounds_64_from_32(); }; __reg_bound_offset(); __update_reg_bounds(); From this, we can observe that we first improve the 32bit ranges from the 64bit ranges in deduce_bounds_32_from_64, then improve the 64bit ranges on their own in deduce_bounds_64_from_64. Intuitively, if we were to improve the 64bit ranges on their own before we use them to improve the 32bit ranges, we may reach a fixed point earlier. In a similar manner, using CBMC, Eduard found that it's best to improve the 32bit ranges on their own after we've improve them using the 64bit ranges. That is, running deduce_bounds_32_from_32 after deduce_bounds_32_from_64. These changes allow us to lose one call to __reg_deduce_bounds. Without this reordering, the test "verifier_bounds/bounds deduction cross sign boundary, negative overlap" fails when removing one call to __reg_deduce_bounds. In some cases, this change can even improve precision a little bit, as illustrated in the new selftest in the next patch. As expected, this change didn't have any impact on the number of instructions processed when running it through the Cilium complexity test suite [2]. Link: https://lore.kernel.org/bpf/aIKtSK9LjQXB8FLY@mail.gmail.com/ [1] Link: https://pchaigno.github.io/test-verifier-complexity.html [2] Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Co-developed-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/1b00d2749ec4c774c3ada84e265ac7fda72cfe56.1773401138.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-13 19:09:35 -07:00
Eduard Zingerman	879cace976	bpf: better naming for __reg_deduce_bounds() parts This renaming will also help reshuffle the different parts in the subsequent patch. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/a988ecf2c57e265b97917136b14b421038534e8c.1773401138.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-13 19:09:35 -07:00
Sachin Kumar	2321a9596d	bpf: Fix constant blinding for PROBE_MEM32 stores BPF_ST \| BPF_PROBE_MEM32 immediate stores are not handled by bpf_jit_blind_insn(), allowing user-controlled 32-bit immediates to survive unblinded into JIT-compiled native code when bpf_jit_harden >= 1. The root cause is that convert_ctx_accesses() rewrites BPF_ST\|BPF_MEM to BPF_ST\|BPF_PROBE_MEM32 for arena pointer stores during verification, before bpf_jit_blind_constants() runs during JIT compilation. The blinding switch only matches BPF_ST\|BPF_MEM (mode 0x60), not BPF_ST\|BPF_PROBE_MEM32 (mode 0xa0). The instruction falls through unblinded. Add BPF_ST\|BPF_PROBE_MEM32 cases to bpf_jit_blind_insn() alongside the existing BPF_ST\|BPF_MEM cases. The blinding transformation is identical: load the blinded immediate into BPF_REG_AX via mov+xor, then convert the immediate store to a register store (BPF_STX). The rewritten STX instruction must preserve the BPF_PROBE_MEM32 mode so the architecture JIT emits the correct arena addressing (R12-based on x86-64). Cannot use the BPF_STX_MEM() macro here because it hardcodes BPF_MEM mode; construct the instruction directly instead. Fixes: `6082b6c328` ("bpf: Recognize addr_space_cast instruction in the verifier.") Reviewed-by: Puranjay Mohan <puranjay@kernel.org> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Sachin Kumar <xcyfun@protonmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/Y6IT5VvNRchPBLI5D7JZHBzZrU9rb0ycRJPJzJSXGj7kJlX8RJwZFSM2YZjcDxoQKABkxt1T8Os2gi23PYyFuQe6KkZGWVyfz8K5afdy9ak=@protonmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-10 11:58:17 -07:00
Cupertino Miranda	2f4cb53eed	bpf: detect non null pointer with register operand in JEQ/JNE. This patch adds support to validate a pointer as not null when its value is compared to a register whose value the verifier knows to be null. Initial pattern only verifies against an immediate operand. Signed-off-by: Cupertino Miranda <cupertino.miranda@oracle.com> Cc: David Faust <david.faust@oracle.com> Cc: Jose Marchesi <jose.marchesi@oracle.com> Cc: Elena Zannoni <elena.zannoni@oracle.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260304195018.181396-3-cupertino.miranda@oracle.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-10 11:51:18 -07:00
Yazhou Tang	a3125bc018	bpf: Reset register ID for BPF_END value tracking When a register undergoes a BPF_END (byte swap) operation, its scalar value is mutated in-place. If this register previously shared a scalar ID with another register (e.g., after an `r1 = r0` assignment), this tie must be broken. Currently, the verifier misses resetting `dst_reg->id` to 0 for BPF_END. Consequently, if a conditional jump checks the swapped register, the verifier incorrectly propagates the learned bounds to the linked register, leading to false confidence in the linked register's value and potentially allowing out-of-bounds memory accesses. Fix this by explicitly resetting `dst_reg->id` to 0 in the BPF_END case to break the scalar tie, similar to how BPF_NEG handles it via `__mark_reg_known`. Fixes: `9d21199842` ("bpf: Add bitwise tracking for BPF_END") Closes: https://lore.kernel.org/bpf/AMBPR06MB108683CFEB1CB8D9E02FC95ECF17EA@AMBPR06MB10868.eurprd06.prod.outlook.com/ Link: https://lore.kernel.org/bpf/4be25f7442a52244d0dd1abb47bc6750e57984c9.camel@gmail.com/ Reported-by: Guillaume Laporte <glapt.pro@outlook.com> Co-developed-by: Tianci Cao <ziye@zju.edu.cn> Signed-off-by: Tianci Cao <ziye@zju.edu.cn> Co-developed-by: Shenghao Yuan <shenghaoyuan0928@163.com> Signed-off-by: Shenghao Yuan <shenghaoyuan0928@163.com> Signed-off-by: Yazhou Tang <tangyazhou518@outlook.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260304083228.142016-2-tangyazhou@zju.edu.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-10 11:46:31 -07:00
Viktor Malik	20c2e102a2	bpf: Always allow fmod_ret programs on syscalls fmod_ret BPF programs can only be attached to selected functions. For convenience, the error injection list was originally used (along with functions prefixed with "security_"), which contains syscalls and several other functions. When error injection is disabled (CONFIG_FUNCTION_ERROR_INJECTION=n), that list is empty and fmod_ret programs are effectively unavailable for most of the functions. In such a case, at least enable fmod_ret programs on syscalls. Signed-off-by: Viktor Malik <vmalik@redhat.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/472310f9a5f4944ad03214e4d943a4830fd8eb76.1773055375.git.vmalik@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-09 09:28:42 -07:00
Viktor Malik	16d9c56606	bpf: Always allow sleepable programs on syscalls Sleepable BPF programs can only be attached to selected functions. For convenience, the error injection list was originally used, which contains syscalls and several other functions. When error injection is disabled (CONFIG_FUNCTION_ERROR_INJECTION=n), that list is empty and sleepable tracing programs are effectively unavailable. In such a case, at least enable sleepable programs on syscalls. For discussion why syscalls were chosen, see [1]. To detect that a function is a syscall handler, we check for arch-specific prefixes for the most common architectures. Unfortunately, the prefixes are hard-coded in arch syscall code so we need to hard-code them, too. [1] https://lore.kernel.org/bpf/CAADnVQK6qP8izg+k9yV0vdcT-+=axtFQ2fKw7D-2Ei-V6WS5Dw@mail.gmail.com/ Signed-off-by: Viktor Malik <vmalik@redhat.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/2704a8512746655037e3c02b471b31bd0d76c8db.1773055375.git.vmalik@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-09 09:28:42 -07:00
Alexei Starovoitov	099bded752	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf 7.0-rc3 Cross-merge BPF and other fixes after downstream PR. No conflicts. Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-08 17:46:38 -07:00
Linus Torvalds	8b7f4cd3ac	bpf-fixes -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmmsahoACgkQ6rmadz2v bToD4Q/9Hr2lAELsTRZIENBTYYTfk48Rzrr+rJbZWLfX4nOu9RwhVboTw0gvJr8r QEteYO8+gP5hrIuifcC+vdzW12CpgziBK7/Qj2NB7MGSX0ZCLK0ShQK6TAbNE8uZ xA4qMcJp0BWJpgfU51z4Ur5nMzG0Th2XlY+SGwHYZ/53qRMmP7Tr8R9wPltKuEll J+tS6BE6bFKMQe1CuIYlYfYj9kO3j4yx9S48j1e8x9LXc2/2arfsOIZQ6L11rnxZ YTa9jbcRL0YdLy7d4hMLyPQquF9hiHDrJLJih+YNmbyOCPGD3HP0ib2c2g0ip+qk WlNFLc2K9w0eS02uuAytaxwArdOpDUHG+skWe+dGt95Bk3Xld0hW/JSI5hrohM1E 0YKjuZeGGvTkI5FJ5kk+BWoApy+FveiN1w7VsHZgr1tix5NRZRgJP/5swM+0eSp/ neTzp08fC10gi+GOnUpdF/lI4wEayoGMpo68BTGiM/hP6Qh268wNWgTHaau3Sk/z wfJb90VTcDHk+N8mnbi4970RZ6OdqX2n7aBy8D7hBW31kHhP27zofzYX51CFD8Ln 5NQDU3wX/hUw9tNwoghCNe12//gfDBi6rO/GoEztNFfVVef4QuzsfwKKdxzUMc9a jZvyDyGh87rGLauDWylq4s23vgxgIG/p114uv+eFp3AMkOfT/Dk= =knWl -----END PGP SIGNATURE----- Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Pull bpf fixes from Alexei Starovoitov: - Fix u32/s32 bounds when ranges cross min/max boundary (Eduard Zingerman) - Fix precision backtracking with linked registers (Eduard Zingerman) - Fix linker flags detection for resolve_btfids (Ihor Solodrai) - Fix race in update_ftrace_direct_add/del (Jiri Olsa) - Fix UAF in bpf_trampoline_link_cgroup_shim (Lang Xu) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: resolve_btfids: Fix linker flags detection selftests/bpf: add reproducer for spurious precision propagation through calls bpf: collect only live registers in linked regs Revert "selftests/bpf: Update reg_bound range refinement logic" selftests/bpf: test refining u32/s32 bounds when ranges cross min/max boundary bpf: Fix u32/s32 bounds when ranges cross min/max boundary bpf: Fix a UAF issue in bpf_trampoline_link_cgroup_shim ftrace: Add missing ftrace_lock to update_ftrace_direct_add/del	2026-03-07 12:20:37 -08:00
Eduard Zingerman	2658a1720a	bpf: collect only live registers in linked regs Fix an inconsistency between func_states_equal() and collect_linked_regs(): - regsafe() uses check_ids() to verify that cached and current states have identical register id mapping. - func_states_equal() calls regsafe() only for registers computed as live by compute_live_registers(). - clean_live_states() is supposed to remove dead registers from cached states, but it can skip states belonging to an iterator-based loop. - collect_linked_regs() collects all registers sharing the same id, ignoring the marks computed by compute_live_registers(). Linked registers are stored in the state's jump history. - backtrack_insn() marks all linked registers for an instruction as precise whenever one of the linked registers is precise. The above might lead to a scenario: - There is an instruction I with register rY known to be dead at I. - Instruction I is reached via two paths: first A, then B. - On path A: - There is an id link between registers rX and rY. - Checkpoint C is created at I. - Linked register set {rX, rY} is saved to the jump history. - rX is marked as precise at I, causing both rX and rY to be marked precise at C. - On path B: - There is no id link between registers rX and rY, otherwise register states are sub-states of those in C. - Because rY is dead at I, check_ids() returns true. - Current state is considered equal to checkpoint C, propagate_precision() propagates spurious precision mark for register rY along the path B. - Depending on a program, this might hit verifier_bug() in the backtrack_insn(), e.g. if rY ∈ [r1..r5] and backtrack_insn() spots a function call. The reproducer program is in the next patch. This was hit by sched_ext scx_lavd scheduler code. Changes in tests: - verifier_scalar_ids.c selftests need modification to preserve some registers as live for __msg() checks. - exceptions_assert.c adjusted to match changes in the verifier log, R0 is dead after conditional instruction and thus does not get range. - precise.c adjusted to match changes in the verifier log, register r9 is dead after comparison and it's range is not important for test. Reported-by: Emil Tsalapatis <emil@etsalapatis.com> Fixes: `0fb3cf6110` ("bpf: use register liveness information for func_states_equal") Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260306-linked-regs-and-propagate-precision-v1-1-18e859be570d@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-06 21:49:40 -08:00
Eduard Zingerman	fbc7aef517	bpf: Fix u32/s32 bounds when ranges cross min/max boundary Same as in __reg64_deduce_bounds(), refine s32/u32 ranges in __reg32_deduce_bounds() in the following situations: - s32 range crosses U32_MAX/0 boundary, positive part of the s32 range overlaps with u32 range: 0 U32_MAX \| [xxxxxxxxxxxxxx u32 range xxxxxxxxxxxxxx] \| \|----------------------------\|----------------------------\| \|xxxxx s32 range xxxxxxxxx] [xxxxxxx\| 0 S32_MAX S32_MIN -1 - s32 range crosses U32_MAX/0 boundary, negative part of the s32 range overlaps with u32 range: 0 U32_MAX \| [xxxxxxxxxxxxxx u32 range xxxxxxxxxxxxxx] \| \|----------------------------\|----------------------------\| \|xxxxxxxxx] [xxxxxxxxxxxx s32 range \| 0 S32_MAX S32_MIN -1 - No refinement if ranges overlap in two intervals. This helps for e.g. consider the following program: call %[bpf_get_prandom_u32]; w0 &= 0xffffffff; if w0 < 0x3 goto 1f; // on fall-through u32 range [3..U32_MAX] if w0 s> 0x1 goto 1f; // on fall-through s32 range [S32_MIN..1] if w0 s< 0x0 goto 1f; // range can be narrowed to [S32_MIN..-1] r10 = 0; 1: ...; The reg_bounds.c selftest is updated to incorporate identical logic, refinement based on non-overflowing range halves: ((x ∩ [0, smax]) ∩ (y ∩ [0, smax])) ∪ ((x ∩ [smin,-1]) ∩ (y ∩ [smin,-1])) Reported-by: Andrea Righi <arighi@nvidia.com> Reported-by: Emil Tsalapatis <emil@etsalapatis.com> Closes: https://lore.kernel.org/bpf/aakqucg4vcujVwif@gpd4/T/ Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260306-bpf-32-bit-range-overflow-v3-1-f7f67e060a6b@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-06 18:16:06 -08:00
Christian Loehle	7fe44c4388	bpf: drop kthread_exit from noreturn_deny kthread_exit became a macro to do_exit in commit `28aaa9c399` ("kthread: consolidate kthread exit paths to prevent use-after-free"), so there is no kthread_exit function BTF ID to resolve. Remove it from noreturn_deny to avoid resolve_btfids unresolved symbol warnings. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-03-06 08:25:54 -08:00
Lang Xu	56145d2373	bpf: Fix a UAF issue in bpf_trampoline_link_cgroup_shim The root cause of this bug is that when 'bpf_link_put' reduces the refcount of 'shim_link->link.link' to zero, the resource is considered released but may still be referenced via 'tr->progs_hlist' in 'cgroup_shim_find'. The actual cleanup of 'tr->progs_hlist' in 'bpf_shim_tramp_link_release' is deferred. During this window, another process can cause a use-after-free via 'bpf_trampoline_link_cgroup_shim'. Based on Martin KaFai Lau's suggestions, I have created a simple patch. To fix this: Add an atomic non-zero check in 'bpf_trampoline_link_cgroup_shim'. Only increment the refcount if it is not already zero. Testing: I verified the fix by adding a delay in 'bpf_shim_tramp_link_release' to make the bug easier to trigger: static void bpf_shim_tramp_link_release(struct bpf_link link) { / ... */ if (!shim_link->trampoline) return; + msleep(100); WARN_ON_ONCE(bpf_trampoline_unlink_prog(&shim_link->link, shim_link->trampoline, NULL)); bpf_trampoline_put(shim_link->trampoline); } Before the patch, running a PoC easily reproduced the crash(almost 100%) with a call trace similar to KaiyanM's report. After the patch, the bug no longer occurs even after millions of iterations. Fixes: `69fd337a97` ("bpf: per-cgroup lsm flavor") Reported-by: Kaiyan Mei <M202472210@hust.edu.cn> Closes: https://lore.kernel.org/bpf/3c4ebb0b.46ff8.19abab8abe2.Coremail.kaiyanm@hust.edu.cn/ Signed-off-by: Lang Xu <xulang@uniontech.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/279EEE1BA1DDB49D+20260303095217.34436-1-xulang@uniontech.com	2026-03-03 15:13:51 -08:00
Emil Tsalapatis	8446ded1e1	bpf: Allow void global functions in the verifier Global subprogs are currently not allowed to return void. Adjust verifier logic to allow global functions with a void return type. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/r/20260228184759.108145-5-emil@etsalapatis.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-03 08:47:23 -08:00
Eduard Zingerman	69ca55e631	bpf: extract check_global_subprog_return_code() for clarity From: Eduard Zingerman <eddyz87@gmail.com> Both main progs and subprogs use the same function in the verifier, check_return_code, to verify the type and value range of the register being returned. However, subprogs only need a subset of the logic in check_return_code. this also goes the way - check_return_code explicitly checks whether it is handling a subprogram in multiple places, complicating the logic. Separate the handling of the two into separate functions. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/r/20260228184759.108145-4-emil@etsalapatis.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-03 08:47:23 -08:00
Eduard Zingerman	63ec296239	bpf: Extract program_returns_void() for clarity From: Eduard Zingerman <eddyz87@gmail.com> The check_return_code function has explicit checks on whether a program type can return void. Factor this logic out to reuse it later for both main progs and subprogs. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/r/20260228184759.108145-3-emil@etsalapatis.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-03 08:47:23 -08:00
Emil Tsalapatis	83419c8fdb	bpf: Factor out program return value calculation Factor the return value range calculation logic in check_return_code out of the function in preparation for separating the return value validation logic for BPF_EXIT and bpf_throw(). No functional changes. The change made in return_retval_code's handling of PROG_TRACING program types (not error'ing on the default case) is a no-op because the match on the program attach type is exhaustive. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/r/20260228184759.108145-2-emil@etsalapatis.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-03 08:47:22 -08:00
Kumar Kartikeya Dwivedi	de6c7d99f8	bpf: Relax fixed offset check for PTR_TO_CTX The limitation on fixed offsets stems from the fact that certain program types rewrite the accesses to the context structure and translate them to accesses to the real underlying type. Technically, in the past, we could have stashed the register offset in insn_aux and made rewrites work, but we've never needed it in the past since the offset for such context structures easily fit in the s16 signed instruction offset. Regardless, the consequence is that for program types where the program type's verifier ops doesn't supply a convert_ctx_access callback, we unnecessarily reject accesses with a modified ctx pointer (i.e., one whose offset has been shifted) in check_ptr_off_reg. Make an exception for such program types (like syscall, tracepoint, raw_tp, etc.). Pass in fixed_off_ok as true to __check_ptr_off_reg for such cases, and accumulate the register offset into insn->off passed to check_ctx_access. In particular, the accumulation is critical since we need to correctly track the max_ctx_offset which is used for bounds checking the buffer for syscall programs at runtime. Reported-by: Tejun Heo <tj@kernel.org> Reported-by: Dan Schatzberg <dschatzberg@meta.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/r/20260227005725.1247305-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-03 08:45:16 -08:00
Anand Kumar Shaw	39948c2d42	bpf: Add missing XDP_ABORTED handling in cpumap cpu_map_bpf_prog_run_xdp() handles XDP_PASS, XDP_REDIRECT, and XDP_DROP but is missing an XDP_ABORTED case. Without it, XDP_ABORTED falls into the default case which logs a misleading "invalid XDP action" warning instead of tracing the abort via trace_xdp_exception(). Add the missing XDP_ABORTED case with trace_xdp_exception(), matching the handling already present in the skb path (cpu_map_bpf_prog_run_skb), devmap (dev_map_bpf_prog_run), and the generic XDP path (do_xdp_generic). Also pass xdpf->dev_rx instead of NULL to bpf_warn_invalid_xdp_action() in the default case, so the warning includes the actual device name. This aligns with the generic XDP path in net/core/dev.c which already passes the real device. Signed-off-by: Anand Kumar Shaw <anandkrshawheritage@gmail.com> Link: https://lore.kernel.org/r/20260218042924.42931-1-anandkrshawheritage@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-03 08:37:21 -08:00
Ilya Leoshkevich	6fe54677bc	s390: Introduce bpf_get_lowcore() kfunc Implementing BPF version of preempt_count() requires accessing lowcore from BPF. Since lowcore can be relocated, open-coding (struct lowcore *)0 does not work, so add a kfunc. Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Link: https://lore.kernel.org/r/20260217160813.100855-2-iii@linux.ibm.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-03 08:35:07 -08:00
Alexei Starovoitov	309d8808ee	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf before 7.0-rc2 Cross-merge BPF and other fixes after downstream PR. No conflicts. Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-03-01 09:04:00 -08:00
Paul Chaignon	efc11a6678	bpf: Improve bounds when tnum has a single possible value We're hitting an invariant violation in Cilium that sometimes leads to BPF programs being rejected and Cilium failing to start [1]. The following extract from verifier logs shows what's happening: from 201 to 236: R1=0 R6=ctx() R7=1 R9=scalar(smin=umin=smin32=umin32=3584,smax=umax=smax32=umax32=3840,var_off=(0xe00; 0x100)) R10=fp0 236: R1=0 R6=ctx() R7=1 R9=scalar(smin=umin=smin32=umin32=3584,smax=umax=smax32=umax32=3840,var_off=(0xe00; 0x100)) R10=fp0 ; if (magic == MARK_MAGIC_HOST \|\| magic == MARK_MAGIC_OVERLAY \|\| magic == MARK_MAGIC_ENCRYPT) @ bpf_host.c:1337 236: (16) if w9 == 0xe00 goto pc+45 ; R9=scalar(smin=umin=smin32=umin32=3585,smax=umax=smax32=umax32=3840,var_off=(0xe00; 0x100)) 237: (16) if w9 == 0xf00 goto pc+1 verifier bug: REG INVARIANTS VIOLATION (false_reg1): range bounds violation u64=[0xe01, 0xe00] s64=[0xe01, 0xe00] u32=[0xe01, 0xe00] s32=[0xe01, 0xe00] var_off=(0xe00, 0x0) We reach instruction 236 with two possible values for R9, 0xe00 and 0xf00. This is perfectly reflected in the tnum, but of course the ranges are less accurate and cover [0xe00; 0xf00]. Taking the fallthrough path at instruction 236 allows the verifier to reduce the range to [0xe01; 0xf00]. The tnum is however not updated. With these ranges, at instruction 237, the verifier is not able to deduce that R9 is always equal to 0xf00. Hence the fallthrough pass is explored first, the verifier refines the bounds using the assumption that R9 != 0xf00, and ends up with an invariant violation. This pattern of impossible branch + bounds refinement is common to all invariant violations seen so far. The long-term solution is likely to rely on the refinement + invariant violation check to detect dead branches, as started by Eduard. To fix the current issue, we need something with less refactoring that we can backport. This patch uses the tnum_step helper introduced in the previous patch to detect the above situation. In particular, three cases are now detected in the bounds refinement: 1. The u64 range and the tnum only overlap in umin. u64: ---[xxxxxx]----- tnum: --xx----------x- 2. The u64 range and the tnum only overlap in the maximum value represented by the tnum, called tmax. u64: ---[xxxxxx]----- tnum: xx-----x-------- 3. The u64 range and the tnum only overlap in between umin (excluded) and umax. u64: ---[xxxxxx]----- tnum: xx----x-------x- To detect these three cases, we call tnum_step(tnum, umin), which returns the smallest member of the tnum greater than umin, called tnum_next here. We're in case (1) if umin is part of the tnum and tnum_next is greater than umax. We're in case (2) if umin is not part of the tnum and tnum_next is equal to tmax. Finally, we're in case (3) if umin is not part of the tnum, tnum_next is inferior or equal to umax, and calling tnum_step a second time gives us a value past umax. This change implements these three cases. With it, the above bytecode looks as follows: 0: (85) call bpf_get_prandom_u32#7 ; R0=scalar() 1: (47) r0 \|= 3584 ; R0=scalar(smin=0x8000000000000e00,umin=umin32=3584,smin32=0x80000e00,var_off=(0xe00; 0xfffffffffffff1ff)) 2: (57) r0 &= 3840 ; R0=scalar(smin=umin=smin32=umin32=3584,smax=umax=smax32=umax32=3840,var_off=(0xe00; 0x100)) 3: (15) if r0 == 0xe00 goto pc+2 ; R0=3840 4: (15) if r0 == 0xf00 goto pc+1 4: R0=3840 6: (95) exit In addition to the new selftests, this change was also verified with Agni [3]. For the record, the raw SMT is available at [4]. The property it verifies is that: If a concrete value x is contained in all input abstract values, after __update_reg_bounds, it will continue to be contained in all output abstract values. Link: https://github.com/cilium/cilium/issues/44216 [1] Link: https://pchaigno.github.io/test-verifier-complexity.html [2] Link: https://github.com/bpfverif/agni [3] Link: https://pastebin.com/raw/naCfaqNx [4] Fixes: `0df1a55afa` ("bpf: Warn on internal verifier errors") Acked-by: Eduard Zingerman <eddyz87@gmail.com> Tested-by: Marco Schirrmeister <mschirrmeister@gmail.com> Co-developed-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/ef254c4f68be19bd393d450188946821c588565d.1772225741.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-02-27 16:11:50 -08:00
Harishankar Vishwanathan	76e954155b	bpf: Introduce tnum_step to step through tnum's members This commit introduces tnum_step(), a function that, when given t, and a number z returns the smallest member of t larger than z. The number z must be greater or equal to the smallest member of t and less than the largest member of t. The first step is to compute j, a number that keeps all of t's known bits, and matches all unknown bits to z's bits. Since j is a member of the t, it is already a candidate for result. However, we want our result to be (minimally) greater than z. There are only two possible cases: (1) Case j <= z. In this case, we want to increase the value of j and make it > z. (2) Case j > z. In this case, we want to decrease the value of j while keeping it > z. (Case 1) j <= z t = xx11x0x0 z = 10111101 (189) j = 10111000 (184) ^ k (Case 1.1) Let's first consider the case where j < z. We will address j == z later. Since z > j, there had to be a bit position that was 1 in z and a 0 in j, beyond which all positions of higher significance are equal in j and z. Further, this position could not have been unknown in a, because the unknown positions of a match z. This position had to be a 1 in z and known 0 in t. Let k be position of the most significant 1-to-0 flip. In our example, k = 3 (starting the count at 1 at the least significant bit). Setting (to 1) the unknown bits of t in positions of significance smaller than k will not produce a result > z. Hence, we must set/unset the unknown bits at positions of significance higher than k. Specifically, we look for the next larger combination of 1s and 0s to place in those positions, relative to the combination that exists in z. We can achieve this by concatenating bits at unknown positions of t into an integer, adding 1, and writing the bits of that result back into the corresponding bit positions previously extracted from z. >From our example, considering only positions of significance greater than k: t = xx..x z = 10..1 + 1 ----- 11..0 This is the exact combination 1s and 0s we need at the unknown bits of t in positions of significance greater than k. Further, our result must only increase the value minimally above z. Hence, unknown bits in positions of significance smaller than k should remain 0. We finally have, result = 11110000 (240) (Case 1.2) Now consider the case when j = z, for example t = 1x1x0xxx z = 10110100 (180) j = 10110100 (180) Matching the unknown bits of the t to the bits of z yielded exactly z. To produce a number greater than z, we must set/unset the unknown bits in t, and all the unknown bits of t candidates for being set/unset. We can do this similar to Case 1.1, by adding 1 to the bits extracted from the masked bit positions of z. Essentially, this case is equivalent to Case 1.1, with k = 0. t = 1x1x0xxx z = .0.1.100 + 1 --------- .0.1.101 This is the exact combination of bits needed in the unknown positions of t. After recalling the known positions of t, we get result = 10110101 (181) (Case 2) j > z t = x00010x1 z = 10000010 (130) j = 10001011 (139) ^ k Since j > z, there had to be a bit position which was 0 in z, and a 1 in j, beyond which all positions of higher significance are equal in j and z. This position had to be a 0 in z and known 1 in t. Let k be the position of the most significant 0-to-1 flip. In our example, k = 4. Because of the 0-to-1 flip at position k, a member of t can become greater than z if the bits in positions greater than k are themselves >= to z. To make that member minimally greater than z, the bits in positions greater than k must be exactly = z. Hence, we simply match all of t's unknown bits in positions more significant than k to z's bits. In positions less significant than k, we set all t's unknown bits to 0 to retain minimality. In our example, in positions of greater significance than k (=4), t=x000. These positions are matched with z (1000) to produce 1000. In positions of lower significance than k, t=10x1. All unknown bits are set to 0 to produce 1001. The final result is: result = 10001001 (137) This concludes the computation for a result > z that is a member of t. The procedure for tnum_step() in this commit implements the idea described above. As a proof of correctness, we verified the algorithm against a logical specification of tnum_step. The specification asserts the following about the inputs t, z and output res that: 1. res is a member of t, and 2. res is strictly greater than z, and 3. there does not exist another value res2 such that 3a. res2 is also a member of t, and 3b. res2 is greater than z 3c. res2 is smaller than res We checked the implementation against this logical specification using an SMT solver. The verification formula in SMTLIB format is available at [1]. The verification returned an "unsat": indicating that no input assignment exists for which the implementation and the specification produce different outputs. In addition, we also automatically generated the logical encoding of the C implementation using Agni [2] and verified it against the same specification. This verification also returned an "unsat", confirming that the implementation is equivalent to the specification. The formula for this check is also available at [3]. Link: https://pastebin.com/raw/2eRWbiit [1] Link: https://github.com/bpfverif/agni [2] Link: https://pastebin.com/raw/EztVbBJ2 [3] Co-developed-by: Srinivas Narayana <srinivas.narayana@rutgers.edu> Signed-off-by: Srinivas Narayana <srinivas.narayana@rutgers.edu> Co-developed-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu> Signed-off-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu> Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Link: https://lore.kernel.org/r/93fdf71910411c0f19e282ba6d03b4c65f9c5d73.1772225741.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-02-27 16:11:50 -08:00
Jiayuan Chen	1872e75375	bpf: Fix race in devmap on PREEMPT_RT On PREEMPT_RT kernels, the per-CPU xdp_dev_bulk_queue (bq) can be accessed concurrently by multiple preemptible tasks on the same CPU. The original code assumes bq_enqueue() and __dev_flush() run atomically with respect to each other on the same CPU, relying on local_bh_disable() to prevent preemption. However, on PREEMPT_RT, local_bh_disable() only calls migrate_disable() (when PREEMPT_RT_NEEDS_BH_LOCK is not set) and does not disable preemption, which allows CFS scheduling to preempt a task during bq_xmit_all(), enabling another task on the same CPU to enter bq_enqueue() and operate on the same per-CPU bq concurrently. This leads to several races: 1. Double-free / use-after-free on bq->q[]: bq_xmit_all() snapshots cnt = bq->count, then iterates bq->q[0..cnt-1] to transmit frames. If preempted after the snapshot, a second task can call bq_enqueue() -> bq_xmit_all() on the same bq, transmitting (and freeing) the same frames. When the first task resumes, it operates on stale pointers in bq->q[], causing use-after-free. 2. bq->count and bq->q[] corruption: concurrent bq_enqueue() modifying bq->count and bq->q[] while bq_xmit_all() is reading them. 3. dev_rx/xdp_prog teardown race: __dev_flush() clears bq->dev_rx and bq->xdp_prog after bq_xmit_all(). If preempted between bq_xmit_all() return and bq->dev_rx = NULL, a preempting bq_enqueue() sees dev_rx still set (non-NULL), skips adding bq to the flush_list, and enqueues a frame. When __dev_flush() resumes, it clears dev_rx and removes bq from the flush_list, orphaning the newly enqueued frame. 4. __list_del_clearprev() on flush_node: similar to the cpumap race, both tasks can call __list_del_clearprev() on the same flush_node, the second dereferences the prev pointer already set to NULL. The race between task A (__dev_flush -> bq_xmit_all) and task B (bq_enqueue -> bq_xmit_all) on the same CPU: Task A (xdp_do_flush) Task B (ndo_xdp_xmit redirect) ---------------------- -------------------------------- __dev_flush(flush_list) bq_xmit_all(bq) cnt = bq->count /* e.g. 16 / / start iterating bq->q[] / <-- CFS preempts Task A --> bq_enqueue(dev, xdpf) bq->count == DEV_MAP_BULK_SIZE bq_xmit_all(bq, 0) cnt = bq->count / same 16! / ndo_xdp_xmit(bq->q[]) / frames freed by driver / bq->count = 0 <-- Task A resumes --> ndo_xdp_xmit(bq->q[]) / use-after-free: frames already freed! */ Fix this by adding a local_lock_t to xdp_dev_bulk_queue and acquiring it in bq_enqueue() and __dev_flush(). These paths already run under local_bh_disable(), so use local_lock_nested_bh() which on non-RT is a pure annotation with no overhead, and on PREEMPT_RT provides a per-CPU sleeping lock that serializes access to the bq. Fixes: `3253cb49cb` ("softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT") Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://lore.kernel.org/r/20260225121459.183121-3-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-02-27 16:08:10 -08:00
Jiayuan Chen	869c63d597	bpf: Fix race in cpumap on PREEMPT_RT On PREEMPT_RT kernels, the per-CPU xdp_bulk_queue (bq) can be accessed concurrently by multiple preemptible tasks on the same CPU. The original code assumes bq_enqueue() and __cpu_map_flush() run atomically with respect to each other on the same CPU, relying on local_bh_disable() to prevent preemption. However, on PREEMPT_RT, local_bh_disable() only calls migrate_disable() (when PREEMPT_RT_NEEDS_BH_LOCK is not set) and does not disable preemption, which allows CFS scheduling to preempt a task during bq_flush_to_queue(), enabling another task on the same CPU to enter bq_enqueue() and operate on the same per-CPU bq concurrently. This leads to several races: 1. Double __list_del_clearprev(): after bq->count is reset in bq_flush_to_queue(), a preempting task can call bq_enqueue() -> bq_flush_to_queue() on the same bq when bq->count reaches CPU_MAP_BULK_SIZE. Both tasks then call __list_del_clearprev() on the same bq->flush_node, the second call dereferences the prev pointer that was already set to NULL by the first. 2. bq->count and bq->q[] races: concurrent bq_enqueue() can corrupt the packet queue while bq_flush_to_queue() is processing it. The race between task A (__cpu_map_flush -> bq_flush_to_queue) and task B (bq_enqueue -> bq_flush_to_queue) on the same CPU: Task A (xdp_do_flush) Task B (cpu_map_enqueue) ---------------------- ------------------------ bq_flush_to_queue(bq) spin_lock(&q->producer_lock) /* flush bq->q[] to ptr_ring / bq->count = 0 spin_unlock(&q->producer_lock) bq_enqueue(rcpu, xdpf) <-- CFS preempts Task A --> bq->q[bq->count++] = xdpf / ... more enqueues until full ... / bq_flush_to_queue(bq) spin_lock(&q->producer_lock) / flush to ptr_ring / spin_unlock(&q->producer_lock) __list_del_clearprev(flush_node) / sets flush_node.prev = NULL / <-- Task A resumes --> __list_del_clearprev(flush_node) flush_node.prev->next = ... / prev is NULL -> kernel oops */ Fix this by adding a local_lock_t to xdp_bulk_queue and acquiring it in bq_enqueue() and __cpu_map_flush(). These paths already run under local_bh_disable(), so use local_lock_nested_bh() which on non-RT is a pure annotation with no overhead, and on PREEMPT_RT provides a per-CPU sleeping lock that serializes access to the bq. To reproduce, insert an mdelay(100) between bq->count = 0 and __list_del_clearprev() in bq_flush_to_queue(), then run reproducer provided by syzkaller. Fixes: `3253cb49cb` ("softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT") Reported-by: syzbot+2b3391f44313b3983e91@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/69369331.a70a0220.38f243.009d.GAE@google.com/T/ Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://lore.kernel.org/r/20260225121459.183121-2-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-02-27 16:07:14 -08:00
Kumar Kartikeya Dwivedi	baa35b3cb6	bpf: Retire rcu_trace_implies_rcu_gp() from local storage This assumption will always hold going forward, hence just remove the various checks and assume it is true with a comment for the uninformed reader. Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260227224806.646888-5-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-02-27 15:39:00 -08:00
Kumar Kartikeya Dwivedi	f41deee082	bpf: Delay freeing fields in local storage Currently, when use_kmalloc_nolock is false, the freeing of fields for a local storage selem is done eagerly before waiting for the RCU or RCU tasks trace grace period to elapse. This opens up a window where the program which has access to the selem can recreate the fields after the freeing of fields is done eagerly, causing memory leaks when the element is finally freed and returned to the kernel. Make a few changes to address this. First, delay the freeing of fields until after the grace periods have expired using a __bpf_selem_free_rcu wrapper which is eventually invoked after transitioning through the necessary number of grace period waits. Replace usage of the kfree_rcu with call_rcu to be able to take a custom callback. Finally, care needs to be taken to extend the rcu barriers for all cases, and not just when use_kmalloc_nolock is true, as RCU and RCU tasks trace callbacks can be in flight for either case and access the smap field, which is used to obtain the BTF record to walk over special fields in the map value. While we're at it, drop migrate_disable() from bpf_selem_free_rcu, since migration should be disabled for RCU callbacks already. Fixes: `9bac675e63` ("bpf: Postpone bpf_obj_free_fields to the rcu callback") Reviewed-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260227224806.646888-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-02-27 15:39:00 -08:00
Kumar Kartikeya Dwivedi	ae51772b1e	bpf: Lose const-ness of map in map_check_btf() BPF hash map may now use the map_check_btf() callback to decide whether to set a dtor on its bpf_mem_alloc or not. Unlike C++ where members can opt out of const-ness using mutable, we must lose the const qualifier on the callback such that we can avoid the ugly cast. Make the change and adjust all existing users, and lose the comment in hashtab.c. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260227224806.646888-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-02-27 15:39:00 -08:00
Kumar Kartikeya Dwivedi	1df97a7453	bpf: Register dtor for freeing special fields There is a race window where BPF hash map elements can leak special fields if the program with access to the map value recreates these special fields between the check_and_free_fields done on the map value and its eventual return to the memory allocator. Several ways were explored prior to this patch, most notably [0] tried to use a poison value to reject attempts to recreate special fields for map values that have been logically deleted but still accessible to BPF programs (either while sitting in the free list or when reused). While this approach works well for task work, timers, wq, etc., it is harder to apply the idea to kptrs, which have a similar race and failure mode. Instead, we change bpf_mem_alloc to allow registering destructor for allocated elements, such that when they are returned to the allocator, any special fields created while they were accessible to programs in the mean time will be freed. If these values get reused, we do not free the fields again before handing the element back. The special fields thus may remain initialized while the map value sits in a free list. When bpf_mem_alloc is retired in the future, a similar concept can be introduced to kmalloc_nolock-backed kmem_cache, paired with the existing idea of a constructor. Note that the destructor registration happens in map_check_btf, after the BTF record is populated and (at that point) avaiable for inspection and duplication. Duplication is necessary since the freeing of embedded bpf_mem_alloc can be decoupled from actual map lifetime due to logic introduced to reduce the cost of rcu_barrier()s in mem alloc free path in `9f2c6e96c6` ("bpf: Optimize rcu_barrier usage between hash map and bpf_mem_alloc."). As such, once all callbacks are done, we must also free the duplicated record. To remove dependency on the bpf_map itself, also stash the key size of the map to obtain value from htab_elem long after the map is gone. [0]: https://lore.kernel.org/bpf/20260216131341.1285427-1-mykyta.yatsenko5@gmail.com Fixes: `14a324f6a6` ("bpf: Wire up freeing of referenced kptr") Fixes: `1bfbc267ec` ("bpf: Enable bpf_timer and bpf_wq in any context") Reported-by: Alexei Starovoitov <ast@kernel.org> Tested-by: syzbot@syzkaller.appspotmail.com Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260227224806.646888-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-02-27 15:39:00 -08:00
Kohei Enju	b7bf516c3e	bpf: Fix stack-out-of-bounds write in devmap get_upper_ifindexes() iterates over all upper devices and writes their indices into an array without checking bounds. Also the callers assume that the max number of upper devices is MAX_NEST_DEV and allocate excluded_devices[1+MAX_NEST_DEV] on the stack, but that assumption is not correct and the number of upper devices could be larger than MAX_NEST_DEV (e.g., many macvlans), causing a stack-out-of-bounds write. Add a max parameter to get_upper_ifindexes() to avoid the issue. When there are too many upper devices, return -EOVERFLOW and abort the redirect. To reproduce, create more than MAX_NEST_DEV(8) macvlans on a device with an XDP program attached using BPF_F_BROADCAST \| BPF_F_EXCLUDE_INGRESS. Then send a packet to the device to trigger the XDP redirect path. Reported-by: syzbot+10cc7f13760b31bd2e61@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/698c4ce3.050a0220.340abe.000b.GAE@google.com/T/ Fixes: `aeea1b86f9` ("bpf, devmap: Exclude XDP broadcast to master device") Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Kohei Enju <kohei@enjuk.jp> Link: https://lore.kernel.org/r/20260225053506.4738-1-kohei@enjuk.jp Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-02-26 11:25:53 -08:00
Hari Bathini	3733f4be28	bpf: Do not increment tailcall count when prog is NULL Currently, tailcall count is incremented in the interpreter even when tailcall fails due to non-existent prog. Fix this by holding off on the tailcall count increment until after NULL check on the prog. Suggested-by: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Link: https://lore.kernel.org/r/20260220062959.195101-1-hbathini@linux.ibm.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-02-24 10:34:16 -08:00
Kaitao Cheng	fb1590448f	bpf: allow using bpf_kptr_xchg even if the MEM_RCU flag is set For the following scenario: struct tree_node { struct bpf_refcount ref; struct bpf_rb_node node; struct node_data __kptr * node_data; u64 key; }; This means node_data would have the type PTR_TO_BTF_ID \| MEM_ALLOC \| NON_OWN_REF \| MEM_RCU. When traversing an rbtree using bpf_rbtree_left/right, if we need to use bpf_kptr_xchg to read the __kptr pointer, we still need to follow the remove-read-add sequence. This patch allows us to use bpf_kptr_xchg to directly read the __kptr pointer without any prior operations. Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn> Signed-off-by: Feng Yang <yangfeng@kylinos.cn> Link: https://lore.kernel.org/r/20260214124042.62229-5-pilgrimtao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-02-23 17:37:06 -08:00
Kaitao Cheng	ee9886c40a	bpf: allow using bpf_kptr_xchg even if the NON_OWN_REF flag is set When traversing an rbtree using bpf_rbtree_left/right, if bpf_kptr_xchg is used to access the __kptr pointer contained in a node, it currently requires first removing the node with bpf_rbtree_remove and clearing the NON_OWN_REF flag, then re-adding the node to the original rbtree with bpf_rbtree_add after usage. This process significantly degrades rbtree traversal performance. The patch enables accessing __kptr pointers with the NON_OWN_REF flag set while holding the lock, eliminating the need for this remove-read-add sequence. Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn> Signed-off-by: Feng Yang <yangfeng@kylinos.cn> Link: https://lore.kernel.org/r/20260214124042.62229-3-pilgrimtao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-02-23 17:37:06 -08:00
Kaitao Cheng	964c074768	bpf: allow calling bpf_kptr_xchg while holding a lock For the following scenario: struct tree_node { struct bpf_rb_node node; struct request __kptr *req; u64 key; }; struct bpf_rb_root tree_root __contains(tree_node, node); struct bpf_spin_lock tree_lock; If we need to traverse all nodes in the rbtree, retrieve the __kptr pointer from each node, and read kernel data from the referenced object, using bpf_kptr_xchg appears unavoidable. This patch skips the BPF verifier checks for bpf_kptr_xchg when called while holding a lock. Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn> Link: https://lore.kernel.org/r/20260214124042.62229-2-pilgrimtao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-02-23 17:37:06 -08:00
Alexei Starovoitov	3ecf0b4a0e	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after 7.0-rc1 Cross-merge trees after 7.0-rc1. No conflicts. Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-02-23 08:06:33 -08:00
Kees Cook	189f164e57	Convert remaining multi-line kmalloc_obj/flex GFP_KERNEL uses Conversion performed via this Coccinelle script: // SPDX-License-Identifier: GPL-2.0-only // Options: --include-headers-for-types --all-includes --include-headers --keep-comments virtual patch @gfp depends on patch && !(file in "tools") && !(file in "samples")@ identifier ALLOC = {kmalloc_obj,kmalloc_objs,kmalloc_flex, kzalloc_obj,kzalloc_objs,kzalloc_flex, kvmalloc_obj,kvmalloc_objs,kvmalloc_flex, kvzalloc_obj,kvzalloc_objs,kvzalloc_flex}; @@ ALLOC(... - , GFP_KERNEL ) $ make coccicheck MODE=patch COCCI=gfp.cocci Build and boot tested x86_64 with Fedora 42's GCC and Clang: Linux version 6.19.0+ (user@host) (gcc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7), GNU ld version 2.44-12.fc42) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Linux version 6.19.0+ (user@host) (clang version 20.1.8 (Fedora 20.1.8-4.fc42), LLD 20.1.8) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-22 08:26:33 -08:00

1 2 3 4 5 ...

4511 Commits