linux

mirror of https://github.com/torvalds/linux.git synced 2026-06-01 11:03:43 +02:00

Author	SHA1	Message	Date
Leon Hwang	8526397c3c	bpf: Copy map value using copy_map_value_long for percpu_cgroup_storage maps Copy map value using 'copy_map_value_long()'. It's to keep consistent style with the way of other percpu maps. No functional change intended. Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20260107022022.12843-5-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-06 20:48:32 -08:00
Leon Hwang	c6936161fd	bpf: Add BPF_F_CPU and BPF_F_ALL_CPUS flags support for percpu_hash and lru_percpu_hash maps Introduce BPF_F_ALL_CPUS flag support for percpu_hash and lru_percpu_hash maps to allow updating values for all CPUs with a single value for both update_elem and update_batch APIs. Introduce BPF_F_CPU flag support for percpu_hash and lru_percpu_hash maps to allow: * update value for specified CPU for both update_elem and update_batch APIs. * lookup value for specified CPU for both lookup_elem and lookup_batch APIs. The BPF_F_CPU flag is passed via: * map_flags along with embedded cpu info. * elem_flags along with embedded cpu info. Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20260107022022.12843-4-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-06 20:48:32 -08:00
Leon Hwang	8eb76cb03f	bpf: Add BPF_F_CPU and BPF_F_ALL_CPUS flags support for percpu_array maps Introduce support for the BPF_F_ALL_CPUS flag in percpu_array maps to allow updating values for all CPUs with a single value for both update_elem and update_batch APIs. Introduce support for the BPF_F_CPU flag in percpu_array maps to allow: * update value for specified CPU for both update_elem and update_batch APIs. * lookup value for specified CPU for both lookup_elem and lookup_batch APIs. The BPF_F_CPU flag is passed via: * map_flags of lookup_elem and update_elem APIs along with embedded cpu info. * elem_flags of lookup_batch and update_batch APIs along with embedded cpu info. Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20260107022022.12843-3-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-06 20:48:32 -08:00
Leon Hwang	2b421662c7	bpf: Introduce BPF_F_CPU and BPF_F_ALL_CPUS flags Introduce BPF_F_CPU and BPF_F_ALL_CPUS flags and check them for following APIs: * 'map_lookup_elem()' * 'map_update_elem()' * 'generic_map_lookup_batch()' * 'generic_map_update_batch()' And, get the correct value size for these APIs. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20260107022022.12843-2-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-06 20:48:32 -08:00
Emil Tsalapatis	39f77533b6	bpf: Allow calls to arena functions while holding spinlocks The bpf_arena_*_pages() kfuncs can be called from sleepable contexts, but the verifier still prevents BPF programs from calling them while holding a spinlock. Amend the verifier to allow for BPF programs calling arena page management functions while holding a lock. Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/r/20260106-arena-under-lock-v2-2-378e9eab3066@etsalapatis.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-06 17:44:00 -08:00
Emil Tsalapatis	b25b48c7d3	bpf: Check active lock count in in_sleepable_context() The in_sleepable_context() function is used to specialize the BPF code in do_misc_fixups(). With the addition of nonsleepable arena kfuncs, there are kfuncs whose specialization depends on whether we are holding a lock. We should use the nonsleepable version while holding a lock and the sleepable one when not. Add a check for active_locks to account for locking when specializing arena kfuncs. Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/r/20260106-arena-under-lock-v2-1-378e9eab3066@etsalapatis.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-06 17:43:19 -08:00
Puranjay Mohan	a069190b59	bpf: Replace __opt annotation with __nullable for kfuncs The __opt annotation was originally introduced specifically for buffer/size argument pairs in bpf_dynptr_slice() and bpf_dynptr_slice_rdwr(), allowing the buffer pointer to be NULL while still validating the size as a constant. The __nullable annotation serves the same purpose but is more general and is already used throughout the BPF subsystem for raw tracepoints, struct_ops, and other kfuncs. This patch unifies the two annotations by replacing __opt with __nullable. The key change is in the verifier's get_kfunc_ptr_arg_type() function, where mem/size pair detection is now performed before the nullable check. This ensures that buffer/size pairs are correctly classified as KF_ARG_PTR_TO_MEM_SIZE even when the buffer is nullable, while adding an !arg_mem_size condition to the nullable check prevents interference with mem/size pair handling. When processing KF_ARG_PTR_TO_MEM_SIZE arguments, the verifier now uses is_kfunc_arg_nullable() instead of the removed is_kfunc_arg_optional() to determine whether to skip size validation for NULL buffers. This is the first documentation added for the __nullable annotation, which has been in use since it was introduced but was previously undocumented. No functional changes to verifier behavior - nullable buffer/size pairs continue to work exactly as before. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20260102221513.1961781-1-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-02 15:51:34 -08:00
Puranjay Mohan	e66fe1bc6d	bpf: arena: Reintroduce memcg accounting When arena allocations were converted from bpf_map_alloc_pages() to kmalloc_nolock() to support non-sleepable contexts, memcg accounting was inadvertently lost. This commit restores proper memory accounting for all arena-related allocations. All arena related allocations are accounted into memcg of the process that created bpf_arena. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20260102200230.25168-3-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-02 14:31:59 -08:00
Puranjay Mohan	817593af7b	bpf: syscall: Introduce memcg enter/exit helpers Introduce bpf_map_memcg_enter() and bpf_map_memcg_exit() helpers to reduce code duplication in memcg context management. bpf_map_memcg_enter() gets the memcg from the map, sets it as active, and returns both the previous and the now active memcg. bpf_map_memcg_exit() restores the previous active memcg and releases the reference obtained during enter. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20260102200230.25168-2-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-02 14:31:59 -08:00
Puranjay Mohan	7646c7afd9	bpf: Remove redundant KF_TRUSTED_ARGS flag from all kfuncs Now that KF_TRUSTED_ARGS is the default for all kfuncs, remove the explicit KF_TRUSTED_ARGS flag from all kfunc definitions and remove the flag itself. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20260102180038.2708325-3-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-02 12:04:28 -08:00
Puranjay Mohan	1a5c01d250	bpf: Make KF_TRUSTED_ARGS the default for all kfuncs Change the verifier to make trusted args the default requirement for all kfuncs by removing is_kfunc_trusted_args() assuming it be to always return true. This works because: 1. Context pointers (xdp_md, __sk_buff, etc.) are handled through their own KF_ARG_PTR_TO_CTX case label and bypass the trusted check 2. Struct_ops callback arguments are already marked as PTR_TRUSTED during initialization and pass is_trusted_reg() 3. KF_RCU kfuncs are handled separately via is_kfunc_rcu() checks at call sites (always checked with \|\| alongside is_kfunc_trusted_args) This simple change makes all kfuncs require trusted args by default while maintaining correct behavior for all existing special cases. Note: This change means kfuncs that previously accepted NULL pointers without KF_TRUSTED_ARGS will now reject NULL at verification time. Several netfilter kfuncs are affected: bpf_xdp_ct_lookup(), bpf_skb_ct_lookup(), bpf_xdp_ct_alloc(), and bpf_skb_ct_alloc() all accept NULL for their bpf_tuple and opts parameters internally (checked in __bpf_nf_ct_lookup), but after this change the verifier rejects NULL before the kfunc is even called. This is acceptable because these kfuncs don't work with NULL parameters in their proper usage. Now they will be rejected rather than returning an error, which shouldn't make a difference to BPF programs that were using these kfuncs properly. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20260102180038.2708325-2-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-02 12:04:28 -08:00
Eduard Zingerman	840692326e	bpf: allow states pruning for misc/invalid slots in iterator loops Within an iterator or callback based loop, it should be safe to prune the current state if the old state stack slot is marked as STACK_INVALID or STACK_MISC: - either all branches of the old state lead to a program exit; - or some branch of the old state leads the current state. This is the same logic as applied in non-loop cases when states_equal() is called in NOT_EXACT mode. The test case that exercises stacksafe() and demonstrates the difference in verification performance is included in the next patch. I'm not sure if it is possible to prepare a test case that exercises regsafe(); it appears that the compute_live_registers() pass makes this impossible. Nevertheless, for code readability reasons, I think that stacksafe() and regsafe() should handle STACK_INVALID / NOT_INIT symmetrically. Hence, this commit changes both functions. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251230-loop-stack-misc-pruning-v1-1-585cfd6cec51@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-12-31 09:01:13 -08:00
Eduard Zingerman	f597664454	bpf: bpf_scc_visit instance and backedges accumulation for bpf_loop() Calls like bpf_loop() or bpf_for_each_map_elem() introduce loops that are not explicitly present in the control-flow graph. The verifier processes such calls by repeatedly interpreting the callback function body within the same verification path (until the current state converges with a previous state). Such loops require a bpf_scc_visit instance in order to allow the accumulation of the state graph backedges. Otherwise, certain checkpoint states created within the bodies of such loops will have incomplete precision marks. See the next patch for an example of a program that leads to the verifier accepting an unsafe program. Fixes: `96c6aa4c63` ("bpf: compute SCCs in program control flow graph") Fixes: `c9e31900b5` ("bpf: propagate read/precision marks over state graph backedges") Reported-by: Breno Leitao <leitao@debian.org> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Tested-by: Breno Leitao <leitao@debian.org> Link: https://lore.kernel.org/r/20251229-scc-for-callbacks-v1-1-ceadfe679900@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-12-30 15:42:42 -08:00
Puranjay Mohan	b8467290ed	bpf: arena: make arena kfuncs any context safe Make arena related kfuncs any context safe by the following changes: bpf_arena_alloc_pages() and bpf_arena_reserve_pages(): Replace the usage of the mutex with a rqspinlock for range tree and use kmalloc_nolock() wherever needed. Use free_pages_nolock() to free pages from any context. apply_range_set/clear_cb() with apply_to_page_range() has already made populating the vm_area in bpf_arena_alloc_pages() any context safe. bpf_arena_free_pages(): defer the main logic to a workqueue if it is called from a non-sleepable context. specialize_kfunc() is used to replace the sleepable arena_free_pages() with bpf_arena_free_pages_non_sleepable() when the verifier detects the call is from a non-sleepable context. In the non-sleepable case, arena_free_pages() queues the address and the page count to be freed to a lock-less list of struct arena_free_spans and raises an irq_work. The irq_work handler calls schedules_work() as it is safe to be called from irq context. arena_free_worker() (the work queue handler) iterates these spans and clears ptes, flushes tlb, zaps pages, and calls __free_page(). Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20251222195022.431211-4-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-12-23 11:30:00 -08:00
Puranjay Mohan	360c35f8ff	bpf: arena: use kmalloc_nolock() in place of kvcalloc() To make arena_alloc_pages() safe to be called from any context, replace kvcalloc() with kmalloc_nolock() so as it doesn't sleep or take any locks. kmalloc_nolock() returns NULL for allocations larger than KMALLOC_MAX_CACHE_SIZE, which is (PAGE_SIZE * 2) = 8KB on systems with 4KB pages. So, round down the allocation done by kmalloc_nolock to 1024 * 8 and reuse the array in a loop. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20251222195022.431211-3-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-12-23 11:29:59 -08:00
Puranjay Mohan	c336b0b327	bpf: arena: populate vm_area without allocating memory vm_area_map_pages() may allocate memory while inserting pages into bpf arena's vm_area. In order to make bpf_arena_alloc_pages() kfunc non-sleepable change bpf arena to populate pages without allocating memory: - at arena creation time populate all page table levels except the last level - when new pages need to be inserted call apply_to_page_range() again with apply_range_set_cb() which will only set_pte_at() those pages and will not allocate memory. - when freeing pages call apply_to_existing_page_range with apply_range_clear_cb() to clear the pte for the page to be removed. This doesn't free intermediate page table levels. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20251222195022.431211-2-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-12-23 11:29:59 -08:00
Daniel Gomez	ac1c5bc7c4	bpf: crypto: replace -EEXIST with -EBUSY The -EEXIST error code is reserved by the module loading infrastructure to indicate that a module is already loaded. When a module's init function returns -EEXIST, userspace tools like kmod interpret this as "module already loaded" and treat the operation as successful, returning 0 to the user even though the module initialization actually failed. This follows the precedent set by commit `54416fd767` ("netfilter: conntrack: helper: Replace -EEXIST by -EBUSY") which fixed the same issue in nf_conntrack_helper_register(). This affects bpf_crypto_skcipher module. While the configuration required to build it as a module is unlikely in practice, it is technically possible, so fix it for correctness. Signed-off-by: Daniel Gomez <da.gomez@samsung.com> Acked-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://lore.kernel.org/r/20251220-dev-module-init-eexists-bpf-v1-1-7f186663dbe7@samsung.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-12-22 22:25:09 -08:00
Puranjay Mohan	342297d511	bpf: allow calling kfuncs from raw_tp programs Associate raw tracepoint program type with the kfunc tracing hook. This allows calling kfuncs from raw_tp programs. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20251222133250.1890587-2-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-12-22 22:23:38 -08:00
Matt Bobrowski	94e948b7e6	bpf: annotate file argument as __nullable in bpf_lsm_mmap_file As reported in [0], anonymous memory mappings are not backed by a struct file instance. Consequently, the struct file pointer passed to the security_mmap_file() LSM hook is NULL in such cases. The BPF verifier is currently unaware of this, allowing BPF LSM programs to dereference this struct file pointer without needing to perform an explicit NULL check. This leads to potential NULL pointer dereference and a kernel crash. Add a strong override for bpf_lsm_mmap_file() which annotates the struct file pointer parameter with the __nullable suffix. This explicitly informs the BPF verifier that this pointer (PTR_MAYBE_NULL) can be NULL, forcing BPF LSM programs to perform a check on it before dereferencing it. [0] https://lore.kernel.org/bpf/5e460d3c.4c3e9.19adde547d8.Coremail.kaiyanm@hust.edu.cn/ Reported-by: Kaiyan Mei <M202472210@hust.edu.cn> Reported-by: Yinhao Hu <dddddd@hust.edu.cn> Reviewed-by: Dongliang Mu <dzm91@hust.edu.cn> Closes: https://lore.kernel.org/bpf/5e460d3c.4c3e9.19adde547d8.Coremail.kaiyanm@hust.edu.cn/ Signed-off-by: Matt Bobrowski <mattbobrowski@google.com> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20251216133000.3690723-1-mattbobrowski@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-12-21 10:56:33 -08:00
Puranjay Mohan	c3e34f88f9	bpf: arm64: Optimize recursion detection by not using atomics BPF programs detect recursion using a per-CPU 'active' flag in struct bpf_prog. The trampoline currently sets/clears this flag with atomic operations. On some arm64 platforms (e.g., Neoverse V2 with LSE), per-CPU atomic operations are relatively slow. Unlike x86_64 - where per-CPU updates can avoid cross-core atomicity, arm64 LSE atomics are always atomic across all cores, which is unnecessary overhead for strictly per-CPU state. This patch removes atomics from the recursion detection path on arm64 by changing 'active' to a per-CPU array of four u8 counters, one per context: {NMI, hard-irq, soft-irq, normal}. The running context uses a non-atomic increment/decrement on its element. After increment, recursion is detected by reading the array as a u32 and verifying that only the expected element changed; any change in another element indicates inter-context recursion, and a value > 1 in the same element indicates same-context recursion. For example, starting from {0,0,0,0}, a normal-context trigger changes the array to {0,0,0,1}. If an NMI arrives on the same CPU and triggers the program, the array becomes {1,0,0,1}. When the NMI context checks the u32 against the expected mask for normal (0x00000001), it observes 0x01000001 and correctly reports recursion. Same-context recursion is detected analogously. Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20251219184422.2899902-3-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-12-21 10:54:37 -08:00
Puranjay Mohan	93f0d09697	bpf: move recursion detection logic to helpers BPF programs detect recursion by doing atomic inc/dec on a per-cpu active counter from the trampoline. Create two helpers for operations on this active counter, this makes it easy to changes the recursion detection logic in future. This commit makes no functional changes. Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20251219184422.2899902-2-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-12-21 10:54:37 -08:00
Alexei Starovoitov	ec439c3801	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after 6.19-rc1 Cross-merge BPF and other fixes after downstream PR. Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-12-16 21:29:38 -08:00
Linus Torvalds	ea1013c153	bpf-fixes -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmlCBmwACgkQ6rmadz2v bToUZA//ZY0IE1x1nCixEAqGF/nGpDzVX4YQQfjrUoXQOD4ykzt35yTNXl6B1IVA dliVSI6kUtdoThUa7xJUxMSkDsVBsEMT/zYXQEXJG1zXvJANCB9wTzsC3OCBWbXt BRczcEkq0OHC9/l5CrILR6ocwxKGDIMIysfeOSABgfqckSEhylWy3+EWZQCk08ka gNpXlDJUG7dYpcZD/zhuC7e5Rg1uNvN7WiTv+Biig8xZCsEtYOq+qC5C/sOnsypI nqfECfbx48cVl49SjatdgquuHn/INESdLRCHisshkurA2Mp5PQuCmrwlXbv4JG59 v9b7lsFQlkpvEXMdo9VYe6K2gjfkOPRdWsVPu2oXA1qISRmrDqX8cKOpapUIwRhL p3ASruMOnz0KFqVaET8+5u2SwtALeW+c+1p1aHMfVGF/qbXuyG05qBkLoGFJR+Xr WznXUXY80Z7pjD57SpA6U3DigAkGqKCBXUwdifaOq8HQonwsnQGqkW/3NngNULGP IC4u0JXn61VgQsM/kAw+ucc4bdKI0g4oKJR56lT48elrj6Yxrjpde4oOqzZ0IQKu VQ0YnzWqqT2tjh4YNMOwkNPbFR4ALd329zI6TUkWib/jByEBNcfjSj9BRANd1KSx JgSHAE6agrbl6h3nOx584YCasX3Zq+nfv1Sj4Z/5GaHKKW3q/Vw= =wHLt -----END PGP SIGNATURE----- Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Pull bpf fixes from Alexei Starovoitov: - Fix BPF builds due to -fms-extensions. selftests (Alexei Starovoitov), bpftool (Quentin Monnet). - Fix build of net/smc when CONFIG_BPF_SYSCALL=y, but CONFIG_BPF_JIT=n (Geert Uytterhoeven) - Fix livepatch/BPF interaction and support reliable unwinding through BPF stack frames (Josh Poimboeuf) - Do not audit capability check in arm64 JIT (Ondrej Mosnacek) - Fix truncated dmabuf BPF iterator reads (T.J. Mercier) - Fix verifier assumptions of bpf_d_path's output buffer (Shuran Liu) - Fix warnings in libbpf when built with -Wdiscarded-qualifiers under C23 (Mikhail Gavrilov) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: add regression test for bpf_d_path() bpf: Fix verifier assumptions of bpf_d_path's output buffer selftests/bpf: Add test for truncated dmabuf_iter reads bpf: Fix truncated dmabuf iterator reads x86/unwind/orc: Support reliable unwinding through BPF stack frames bpf: Add bpf_has_frame_pointer() bpf, arm64: Do not audit capability check in do_jit() libbpf: Fix -Wdiscarded-qualifiers under C23 bpftool: Fix build warnings due to MS extensions net: smc: SMC_HS_CTRL_BPF should depend on BPF_JIT selftests/bpf: Add -fms-extensions to bpf build flags	2025-12-17 15:54:58 +12:00
Emil Tsalapatis	12a1fe6e12	bpf/verifier: Do not limit maximum direct offset into arena map The verifier currently limits direct offsets into a map to 512MiB to avoid overflow during pointer arithmetic. However, this prevents arena maps from using direct addressing instructions to access data at the end of > 512MiB arena maps. This is necessary when moving arena globals to the end of the arena instead of the front. Refactor the verifier code to remove the offset calculation during direct value access calculations. This is possible because the only two map types that implement .map_direct_value_addr() are arrays and arenas, and they both do their own internal checks to ensure the offset is within bounds. Adjust selftests that expect the old error. These tests still fail because the verifier identifies the access as out of bounds for the map, so change them to expect an "invalid access to map value pointer" error instead. Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20251216173325.98465-3-emil@etsalapatis.com	2025-12-16 10:42:55 -08:00
T.J. Mercier	6f0b824a61	bpf: Fix bpf_seq_read docs for increased buffer size Commit `af65320948` ("bpf: Bump iter seq size to support BTF representation of large data structures") increased the fixed buffer size from PAGE_SIZE to PAGE_SIZE << 3, but the docs for the function didn't get updated at the same time. Update them. Signed-off-by: T.J. Mercier <tjmercier@google.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20251207091005.2829703-1-tjmercier@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-12-13 18:57:53 -08:00
Cupertino Miranda	d18dec4b89	bpf: verifier improvement in 32bit shift sign extension pattern This patch improves the verifier to correctly compute bounds for sign extension compiler pattern composed of left shift by 32bits followed by a sign right shift by 32bits. Pattern in the verifier was limitted to positive value bounds and would reset bound computation for negative values. New code allows both positive and negative values for sign extension without compromising bound computation and verifier to pass. This change is required by GCC which generate such pattern, and was detected in the context of systemd, as described in the following GCC bugzilla: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119731 Three new tests were added in verifier_subreg.c. Signed-off-by: Cupertino Miranda <cupertino.miranda@oracle.com> Signed-off-by: Andrew Pinski <andrew.pinski@oss.qualcomm.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Cc: David Faust <david.faust@oracle.com> Cc: Jose Marchesi <jose.marchesi@oracle.com> Cc: Elena Zannoni <elena.zannoni@oracle.com> Link: https://lore.kernel.org/r/20251202180220.11128-2-cupertino.miranda@oracle.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-12-10 00:12:09 -08:00
Kohei Enju	48e11bad9a	bpf: cpumap: propagate underlying error in cpu_map_update_elem() After commit `9216477449` ("bpf: cpumap: Add the possibility to attach an eBPF program to cpumap"), __cpu_map_entry_alloc() may fail with errors other than -ENOMEM, such as -EBADF or -EINVAL. However, __cpu_map_entry_alloc() returns NULL on all failures, and cpu_map_update_elem() unconditionally converts this NULL into -ENOMEM. As a result, user space always receives -ENOMEM regardless of the actual underlying error. Examples of unexpected behavior: - Nonexistent fd : -ENOMEM (should be -EBADF) - Non-BPF fd : -ENOMEM (should be -EINVAL) - Bad attach type : -ENOMEM (should be -EINVAL) Change __cpu_map_entry_alloc() to return ERR_PTR(err) instead of NULL and have cpu_map_update_elem() propagate this error. Signed-off-by: Kohei Enju <enjuk@amazon.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/20251208131449.73036-2-enjuk@amazon.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-12-09 23:53:27 -08:00
T.J. Mercier	234483565d	bpf: Fix truncated dmabuf iterator reads If there is a large number (hundreds) of dmabufs allocated, the text output generated from dmabuf_iter_seq_show can exceed common user buffer sizes (e.g. PAGE_SIZE) necessitating multiple start/stop cycles to iterate through all dmabufs. However the dmabuf iterator currently returns NULL in dmabuf_iter_seq_start for all non-zero pos values, which results in the truncation of the output before all dmabufs are handled. After dma_buf_iter_begin / dma_buf_iter_next, the refcount of the buffer is elevated so that the BPF iterator program can run without holding any locks. When a stop occurs, instead of immediately dropping the reference on the buffer, stash a pointer to the buffer in seq->priv until either start is called or the iterator is released. This also enables the resumption of iteration without first walking through the list of dmabufs based on the pos value. Fixes: `76ea955349` ("bpf: Add dmabuf iterator") Signed-off-by: T.J. Mercier <tjmercier@google.com> Link: https://lore.kernel.org/r/20251204000348.1413593-1-tjmercier@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-12-09 23:48:34 -08:00
Josh Poimboeuf	ca45c84afb	bpf: Add bpf_has_frame_pointer() Introduce a bpf_has_frame_pointer() helper that unwinders can call to determine whether a given instruction pointer is within the valid frame pointer region of a BPF JIT program or trampoline (i.e., after the prologue, before the epilogue). This will enable livepatch (with the ORC unwinder) to reliably unwind through BPF JIT frames. Acked-by: Song Liu <song@kernel.org> Acked-and-tested-by: Andrey Grodzovsky <andrey.grodzovsky@crowdstrike.com> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lore.kernel.org/r/fd2bc5b4e261a680774b28f6100509fd5ebad2f0.1764818927.git.jpoimboe@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Jiri Olsa <jolsa@kernel.org>	2025-12-09 23:29:42 -08:00
Amery Hung	b5709f6d26	bpf: Support associating BPF program with struct_ops Add a new BPF command BPF_PROG_ASSOC_STRUCT_OPS to allow associating a BPF program with a struct_ops map. This command takes a file descriptor of a struct_ops map and a BPF program and set prog->aux->st_ops_assoc to the kdata of the struct_ops map. The command does not accept a struct_ops program nor a non-struct_ops map. Programs of a struct_ops map is automatically associated with the map during map update. If a program is shared between two struct_ops maps, prog->aux->st_ops_assoc will be poisoned to indicate that the associated struct_ops is ambiguous. The pointer, once poisoned, cannot be reset since we have lost track of associated struct_ops. For other program types, the associated struct_ops map, once set, cannot be changed later. This restriction may be lifted in the future if there is a use case. A kernel helper bpf_prog_get_assoc_struct_ops() can be used to retrieve the associated struct_ops pointer. The returned pointer, if not NULL, is guaranteed to be valid and point to a fully updated struct_ops struct. For struct_ops program reused in multiple struct_ops map, the return will be NULL. prog->aux->st_ops_assoc is protected by bumping the refcount for non-struct_ops programs and RCU for struct_ops programs. Since it would be inefficient to track programs associated with a struct_ops map, every non-struct_ops program will bump the refcount of the map to make sure st_ops_assoc stays valid. For a struct_ops program, it is protected by RCU as map_free will wait for an RCU grace period before disassociating the program with the map. The helper must be called in BPF program context or RCU read-side critical section. struct_ops implementers should note that the struct_ops returned may not be initialized nor attached yet. The struct_ops implementer will be responsible for tracking and checking the state of the associated struct_ops map if the use case expects an initialized or attached struct_ops. Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/bpf/20251203233748.668365-3-ameryhung@gmail.com	2025-12-05 16:17:57 -08:00
Amery Hung	1588c81b9f	bpf: Allow verifier to fixup kernel module kfuncs Allow verifier to fixup kfuncs in kernel module to support kfuncs with __prog arguments. Currently, special kfuncs and kfuncs with __prog arguments are kernel kfuncs. Allowing kernel module kfuncs should not affect existing kfunc fixup as kernel module kfuncs have BTF IDs greater than kernel kfuncs' BTF IDs. Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20251203233748.668365-2-ameryhung@gmail.com	2025-12-05 16:17:57 -08:00
Linus Torvalds	7cd122b552	Some filesystems use a kinda-sorta controlled dentry refcount leak to pin dentries of created objects in dcache (and undo it when removing those). Reference is grabbed and not released, but it's not actually _stored_ anywhere. That works, but it's hard to follow and verify; among other things, we have no way to tell _which_ of the increments is intended to be an unpaired one. Worse, on removal we need to decide whether the reference had already been dropped, which can be non-trivial if that removal is on umount and we need to figure out if this dentry is pinned due to e.g. unlink() not done. Usually that is handled by using kill_litter_super() as ->kill_sb(), but there are open-coded special cases of the same (consider e.g. /proc/self). Things get simpler if we introduce a new dentry flag (DCACHE_PERSISTENT) marking those "leaked" dentries. Having it set claims responsibility for +1 in refcount. The end result this series is aiming for: * get these unbalanced dget() and dput() replaced with new primitives that would, in addition to adjusting refcount, set and clear persistency flag. * instead of having kill_litter_super() mess with removing the remaining "leaked" references (e.g. for all tmpfs files that hadn't been removed prior to umount), have the regular shrink_dcache_for_umount() strip DCACHE_PERSISTENT of all dentries, dropping the corresponding reference if it had been set. After that kill_litter_super() becomes an equivalent of kill_anon_super(). Doing that in a single step is not feasible - it would affect too many places in too many filesystems. It has to be split into a series. This work has really started early in 2024; quite a few preliminary pieces have already gone into mainline. This chunk is finally getting to the meat of that stuff - infrastructure and most of the conversions to it. Some pieces are still sitting in the local branches, but the bulk of that stuff is here. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaTEq1wAKCRBZ7Krx/gZQ 643uAQC1rRslhw5l7OjxEpIYbGG4M+QaadN4Nf5Sr2SuTRaPJQD/W4oj/u4C2eCw Dd3q071tqyvm/PXNgN2EEnIaxlFUlwc= =rKq+ -----END PGP SIGNATURE----- Merge tag 'pull-persistency' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull persistent dentry infrastructure and conversion from Al Viro: "Some filesystems use a kinda-sorta controlled dentry refcount leak to pin dentries of created objects in dcache (and undo it when removing those). A reference is grabbed and not released, but it's not actually _stored_ anywhere. That works, but it's hard to follow and verify; among other things, we have no way to tell _which_ of the increments is intended to be an unpaired one. Worse, on removal we need to decide whether the reference had already been dropped, which can be non-trivial if that removal is on umount and we need to figure out if this dentry is pinned due to e.g. unlink() not done. Usually that is handled by using kill_litter_super() as ->kill_sb(), but there are open-coded special cases of the same (consider e.g. /proc/self). Things get simpler if we introduce a new dentry flag (DCACHE_PERSISTENT) marking those "leaked" dentries. Having it set claims responsibility for +1 in refcount. The end result this series is aiming for: - get these unbalanced dget() and dput() replaced with new primitives that would, in addition to adjusting refcount, set and clear persistency flag. - instead of having kill_litter_super() mess with removing the remaining "leaked" references (e.g. for all tmpfs files that hadn't been removed prior to umount), have the regular shrink_dcache_for_umount() strip DCACHE_PERSISTENT of all dentries, dropping the corresponding reference if it had been set. After that kill_litter_super() becomes an equivalent of kill_anon_super(). Doing that in a single step is not feasible - it would affect too many places in too many filesystems. It has to be split into a series. This work has really started early in 2024; quite a few preliminary pieces have already gone into mainline. This chunk is finally getting to the meat of that stuff - infrastructure and most of the conversions to it. Some pieces are still sitting in the local branches, but the bulk of that stuff is here" * tag 'pull-persistency' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits) d_make_discardable(): warn if given a non-persistent dentry kill securityfs_recursive_remove() convert securityfs get rid of kill_litter_super() convert rust_binderfs convert nfsctl convert rpc_pipefs convert hypfs hypfs: swich hypfs_create_u64() to returning int hypfs: switch hypfs_create_str() to returning int hypfs: don't pin dentries twice convert gadgetfs gadgetfs: switch to simple_remove_by_name() convert functionfs functionfs: switch to simple_remove_by_name() functionfs: fix the open/removal races functionfs: need to cancel ->reset_work in ->kill_sb() functionfs: don't bother with ffs->ref in ffs_data_{opened,closed}() functionfs: don't abuse ffs_data_closed() on fs shutdown convert selinuxfs ...	2025-12-05 14:36:21 -08:00
Linus Torvalds	7203ca412f	Significant patch series in this merge are as follows: - The 10 patch series "__vmalloc()/kvmalloc() and no-block support" from Uladzislau Rezki reworks the vmalloc() code to support non-blocking allocations (GFP_ATOIC, GFP_NOWAIT). - The 2 patch series "ksm: fix exec/fork inheritance" from xu xin fixes a rare case where the KSM MMF_VM_MERGE_ANY prctl state is not inherited across fork/exec. - The 4 patch series "mm/zswap: misc cleanup of code and documentations" from SeongJae Park does some light maintenance work on the zswap code. - The 5 patch series "mm/page_owner: add debugfs files 'show_handles' and 'show_stacks_handles'" from Mauricio Faria de Oliveira enhances the /sys/kernel/debug/page_owner debug feature. It adds unique identifiers to differentiate the various stack traces so that userspace monitoring tools can better match stack traces over time. - The 2 patch series "mm/page_alloc: pcp->batch cleanups" from Joshua Hahn makes some minor alterations to the page allocator's per-cpu-pages feature. - The 2 patch series "Improve UFFDIO_MOVE scalability by removing anon_vma lock" from Lokesh Gidra addresses a scalability issue in userfaultfd's UFFDIO_MOVE operation. - The 2 patch series "kasan: cleanups for kasan_enabled() checks" from Sabyrzhan Tasbolatov performs some cleanup in the KASAN code. - The 2 patch series "drivers/base/node: fold node register and unregister functions" from Donet Tom cleans up the NUMA node handling code a little. - The 4 patch series "mm: some optimizations for prot numa" from Kefeng Wang provides some cleanups and small optimizations to the NUMA allocation hinting code. - The 5 patch series "mm/page_alloc: Batch callers of free_pcppages_bulk" from Joshua Hahn addresses long lock hold times at boot on large machines. These were causing (harmless) softlockup warnings. - The 2 patch series "optimize the logic for handling dirty file folios during reclaim" from Baolin Wang removes some now-unnecessary work from page reclaim. - The 10 patch series "mm/damon: allow DAMOS auto-tuned for per-memcg per-node memory usage" from SeongJae Park enhances the DAMOS auto-tuning feature. - The 2 patch series "mm/damon: fixes for address alignment issues in DAMON_LRU_SORT and DAMON_RECLAIM" from Quanmin Yan fixes DAMON_LRU_SORT and DAMON_RECLAIM with certain userspace configuration. - The 15 patch series "expand mmap_prepare functionality, port more users" from Lorenzo Stoakes enhances the new(ish) file_operations.mmap_prepare() method and ports additional callsites from the old ->mmap() over to ->mmap_prepare(). - The 8 patch series "Fix stale IOTLB entries for kernel address space" from Lu Baolu fixes a bug (and possible security issue on non-x86) in the IOMMU code. In some situations the IOMMU could be left hanging onto a stale kernel pagetable entry. - The 4 patch series "mm/huge_memory: cleanup __split_unmapped_folio()" from Wei Yang cleans up and optimizes the folio splitting code. - The 5 patch series "mm, swap: misc cleanup and bugfix" from Kairui Song implements some cleanups and a minor fix in the swap discard code. - The 8 patch series "mm/damon: misc documentation fixups" from SeongJae Park does as advertised. - The 9 patch series "mm/damon: support pin-point targets removal" from SeongJae Park permits userspace to remove a specific monitoring target in the middle of the current targets list. - The 2 patch series "mm: MISC follow-up patches for linux/pgalloc.h" from Harry Yoo implements a couple of cleanups related to mm header file inclusion. - The 2 patch series "mm/swapfile.c: select swap devices of default priority round robin" from Baoquan He improves the selection of swap devices for NUMA machines. - The 3 patch series "mm: Convert memory block states (MEM_) macros to enums" from Israel Batista changes the memory block labels from macros to enums so they will appear in kernel debug info. - The 3 patch series "ksm: perform a range-walk to jump over holes in break_ksm" from Pedro Demarchi Gomes addresses an inefficiency when KSM unmerges an address range. - The 22 patch series "mm/damon/tests: fix memory bugs in kunit tests" from SeongJae Park fixes leaks and unhandled malloc() failures in DAMON userspace unit tests. - The 2 patch series "some cleanups for pageout()" from Baolin Wang cleans up a couple of minor things in the page scanner's writeback-for-eviction code. - The 2 patch series "mm/hugetlb: refactor sysfs/sysctl interfaces" from Hui Zhu moves hugetlb's sysfs/sysctl handling code into a new file. - The 9 patch series "introduce VM_MAYBE_GUARD and make it sticky" from Lorenzo Stoakes makes the VMA guard regions available in /proc/pid/smaps and improves the mergeability of guarded VMAs. - The 2 patch series "mm: perform guard region install/remove under VMA lock" from Lorenzo Stoakes reduces mmap lock contention for callers performing VMA guard region operations. - The 2 patch series "vma_start_write_killable" from Matthew Wilcox starts work in permitting applications to be killed when they are waiting on a read_lock on the VMA lock. - The 11 patch series "mm/damon/tests: add more tests for online parameters commit" from SeongJae Park adds additional userspace testing of DAMON's "commit" feature. - The 9 patch series "mm/damon: misc cleanups" from SeongJae Park does that. - The 2 patch series "make VM_SOFTDIRTY a sticky VMA flag" from Lorenzo Stoakes addresses the possible loss of a VMA's VM_SOFTDIRTY flag when that VMA is merged with another. - The 16 patch series "mm: support device-private THP" from Balbir Singh introduces support for Transparent Huge Page (THP) migration in zone device-private memory. - The 3 patch series "Optimize folio split in memory failure" from Zi Yan optimizes folio split operations in the memory failure code. - The 2 patch series "mm/huge_memory: Define split_type and consolidate split support checks" from Wei Yang provides some more cleanups in the folio splitting code. - The 16 patch series "mm: remove is_swap_[pte, pmd]() + non-swap entries, introduce leaf entries" from Lorenzo Stoakes cleans up our handling of pagetable leaf entries by introducing the concept of 'software leaf entries', of type softleaf_t. - The 4 patch series "reparent the THP split queue" from Muchun Song reparents the THP split queue to its parent memcg. This is in preparation for addressing the long-standing "dying memcg" problem, wherein dead memcg's linger for too long, consuming memory resources. - The 3 patch series "unify PMD scan results and remove redundant cleanup" from Wei Yang does a little cleanup in the hugepage collapse code. - The 6 patch series "zram: introduce writeback bio batching" from Sergey Senozhatsky improves zram writeback efficiency by introducing batched bio writeback support. - The 4 patch series "memcg: cleanup the memcg stats interfaces" from Shakeel Butt cleans up our handling of the interrupt safety of some memcg stats. - The 4 patch series "make vmalloc gfp flags usage more apparent" from Vishal Moola cleans up vmalloc's handling of incoming GFP flags. - The 6 patch series "mm: Add soft-dirty and uffd-wp support for RISC-V" from Chunyan Zhang teches soft dirty and userfaultfd write protect tracking to use RISC-V's Svrsw60t59b extension. - The 5 patch series "mm: swap: small fixes and comment cleanups" from Youngjun Park fixes a small bug and cleans up some of the swap code. - The 4 patch series "initial work on making VMA flags a bitmap" from Lorenzo Stoakes starts work on converting the vma struct's flags to a bitmap, so we stop running out of them, especially on 32-bit. - The 2 patch series "mm/swapfile: fix and cleanup swap list iterations" from Youngjun Park addresses a possible bug in the swap discard code and cleans things up a little. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaTEb0wAKCRDdBJ7gKXxA jjfIAP94W4EkCCwNOupnChoG+YWw/JW21anXt5NN+i5svn1yugEAwzvv6A+cAFng o+ug/fyrfPZG7PLp2R8WFyGIP0YoBA4= =IUzS -----END PGP SIGNATURE----- Merge tag 'mm-stable-2025-12-03-21-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "__vmalloc()/kvmalloc() and no-block support" (Uladzislau Rezki) Rework the vmalloc() code to support non-blocking allocations (GFP_ATOIC, GFP_NOWAIT) "ksm: fix exec/fork inheritance" (xu xin) Fix a rare case where the KSM MMF_VM_MERGE_ANY prctl state is not inherited across fork/exec "mm/zswap: misc cleanup of code and documentations" (SeongJae Park) Some light maintenance work on the zswap code "mm/page_owner: add debugfs files 'show_handles' and 'show_stacks_handles'" (Mauricio Faria de Oliveira) Enhance the /sys/kernel/debug/page_owner debug feature by adding unique identifiers to differentiate the various stack traces so that userspace monitoring tools can better match stack traces over time "mm/page_alloc: pcp->batch cleanups" (Joshua Hahn) Minor alterations to the page allocator's per-cpu-pages feature "Improve UFFDIO_MOVE scalability by removing anon_vma lock" (Lokesh Gidra) Address a scalability issue in userfaultfd's UFFDIO_MOVE operation "kasan: cleanups for kasan_enabled() checks" (Sabyrzhan Tasbolatov) "drivers/base/node: fold node register and unregister functions" (Donet Tom) Clean up the NUMA node handling code a little "mm: some optimizations for prot numa" (Kefeng Wang) Cleanups and small optimizations to the NUMA allocation hinting code "mm/page_alloc: Batch callers of free_pcppages_bulk" (Joshua Hahn) Address long lock hold times at boot on large machines. These were causing (harmless) softlockup warnings "optimize the logic for handling dirty file folios during reclaim" (Baolin Wang) Remove some now-unnecessary work from page reclaim "mm/damon: allow DAMOS auto-tuned for per-memcg per-node memory usage" (SeongJae Park) Enhance the DAMOS auto-tuning feature "mm/damon: fixes for address alignment issues in DAMON_LRU_SORT and DAMON_RECLAIM" (Quanmin Yan) Fix DAMON_LRU_SORT and DAMON_RECLAIM with certain userspace configuration "expand mmap_prepare functionality, port more users" (Lorenzo Stoakes) Enhance the new(ish) file_operations.mmap_prepare() method and port additional callsites from the old ->mmap() over to ->mmap_prepare() "Fix stale IOTLB entries for kernel address space" (Lu Baolu) Fix a bug (and possible security issue on non-x86) in the IOMMU code. In some situations the IOMMU could be left hanging onto a stale kernel pagetable entry "mm/huge_memory: cleanup __split_unmapped_folio()" (Wei Yang) Clean up and optimize the folio splitting code "mm, swap: misc cleanup and bugfix" (Kairui Song) Some cleanups and a minor fix in the swap discard code "mm/damon: misc documentation fixups" (SeongJae Park) "mm/damon: support pin-point targets removal" (SeongJae Park) Permit userspace to remove a specific monitoring target in the middle of the current targets list "mm: MISC follow-up patches for linux/pgalloc.h" (Harry Yoo) A couple of cleanups related to mm header file inclusion "mm/swapfile.c: select swap devices of default priority round robin" (Baoquan He) improve the selection of swap devices for NUMA machines "mm: Convert memory block states (MEM_) macros to enums" (Israel Batista) Change the memory block labels from macros to enums so they will appear in kernel debug info "ksm: perform a range-walk to jump over holes in break_ksm" (Pedro Demarchi Gomes) Address an inefficiency when KSM unmerges an address range "mm/damon/tests: fix memory bugs in kunit tests" (SeongJae Park) Fix leaks and unhandled malloc() failures in DAMON userspace unit tests "some cleanups for pageout()" (Baolin Wang) Clean up a couple of minor things in the page scanner's writeback-for-eviction code "mm/hugetlb: refactor sysfs/sysctl interfaces" (Hui Zhu) Move hugetlb's sysfs/sysctl handling code into a new file "introduce VM_MAYBE_GUARD and make it sticky" (Lorenzo Stoakes) Make the VMA guard regions available in /proc/pid/smaps and improves the mergeability of guarded VMAs "mm: perform guard region install/remove under VMA lock" (Lorenzo Stoakes) Reduce mmap lock contention for callers performing VMA guard region operations "vma_start_write_killable" (Matthew Wilcox) Start work on permitting applications to be killed when they are waiting on a read_lock on the VMA lock "mm/damon/tests: add more tests for online parameters commit" (SeongJae Park) Add additional userspace testing of DAMON's "commit" feature "mm/damon: misc cleanups" (SeongJae Park) "make VM_SOFTDIRTY a sticky VMA flag" (Lorenzo Stoakes) Address the possible loss of a VMA's VM_SOFTDIRTY flag when that VMA is merged with another "mm: support device-private THP" (Balbir Singh) Introduce support for Transparent Huge Page (THP) migration in zone device-private memory "Optimize folio split in memory failure" (Zi Yan) "mm/huge_memory: Define split_type and consolidate split support checks" (Wei Yang) Some more cleanups in the folio splitting code "mm: remove is_swap_[pte, pmd]() + non-swap entries, introduce leaf entries" (Lorenzo Stoakes) Clean up our handling of pagetable leaf entries by introducing the concept of 'software leaf entries', of type softleaf_t "reparent the THP split queue" (Muchun Song) Reparent the THP split queue to its parent memcg. This is in preparation for addressing the long-standing "dying memcg" problem, wherein dead memcg's linger for too long, consuming memory resources "unify PMD scan results and remove redundant cleanup" (Wei Yang) A little cleanup in the hugepage collapse code "zram: introduce writeback bio batching" (Sergey Senozhatsky) Improve zram writeback efficiency by introducing batched bio writeback support "memcg: cleanup the memcg stats interfaces" (Shakeel Butt) Clean up our handling of the interrupt safety of some memcg stats "make vmalloc gfp flags usage more apparent" (Vishal Moola) Clean up vmalloc's handling of incoming GFP flags "mm: Add soft-dirty and uffd-wp support for RISC-V" (Chunyan Zhang) Teach soft dirty and userfaultfd write protect tracking to use RISC-V's Svrsw60t59b extension "mm: swap: small fixes and comment cleanups" (Youngjun Park) Fix a small bug and clean up some of the swap code "initial work on making VMA flags a bitmap" (Lorenzo Stoakes) Start work on converting the vma struct's flags to a bitmap, so we stop running out of them, especially on 32-bit "mm/swapfile: fix and cleanup swap list iterations" (Youngjun Park) Address a possible bug in the swap discard code and clean things up a little [ This merge also reverts commit `ebb9aeb980` ("vfio/nvgrace-gpu: register device memory for poison handling") because it looks broken to me, I've asked for clarification - Linus ] * tag 'mm-stable-2025-12-03-21-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits) mm: fix vma_start_write_killable() signal handling mm/swapfile: use plist_for_each_entry in __folio_throttle_swaprate mm/swapfile: fix list iteration when next node is removed during discard fs/proc/task_mmu.c: fix make_uffd_wp_huge_pte() huge pte handling mm/kfence: add reboot notifier to disable KFENCE on shutdown memcg: remove inc/dec_lruvec_kmem_state helpers selftests/mm/uffd: initialize char variable to Null mm: fix DEBUG_RODATA_TEST indentation in Kconfig mm: introduce VMA flags bitmap type tools/testing/vma: eliminate dependency on vma->__vm_flags mm: simplify and rename mm flags function for clarity mm: declare VMA flags by bit zram: fix a spelling mistake mm/page_alloc: optimize lowmem_reserve max lookup using its semantic monotonicity mm/vmscan: skip increasing kswapd_failures when reclaim was boosted pagemap: update BUDDY flag documentation mm: swap: remove scan_swap_map_slots() references from comments mm: swap: change swap_alloc_slow() to void mm, swap: remove redundant comment for read_swap_cache_async mm, swap: use SWP_SOLIDSTATE to determine if swap is rotational ...	2025-12-05 13:52:43 -08:00
Linus Torvalds	8f7aa3d3c7	Networking changes for 6.19. Core & protocols ---------------- - Replace busylock at the Tx queuing layer with a lockless list. Resulting in a 300% (4x) improvement on heavy TX workloads, sending twice the number of packets per second, for half the cpu cycles. - Allow constantly busy flows to migrate to a more suitable CPU/NIC queue. Normally we perform queue re-selection when flow comes out of idle, but under extreme circumstances the flows may be constantly busy. Add sysctl to allow periodic rehashing even if it'd risk packet reordering. - Optimize the NAPI skb cache, make it larger, use it in more paths. - Attempt returning Tx skbs to the originating CPU (like we already did for Rx skbs). - Various data structure layout and prefetch optimizations from Eric. - Remove ktime_get() from the recvmsg() fast path, ktime_get() is sadly quite expensive on recent AMD machines. - Extend threaded NAPI polling to allow the kthread busy poll for packets. - Make MPTCP use Rx backlog processing. This lowers the lock pressure, improving the Rx performance. - Support memcg accounting of MPTCP socket memory. - Allow admin to opt sockets out of global protocol memory accounting (using a sysctl or BPF-based policy). The global limits are a poor fit for modern container workloads, where limits are imposed using cgroups. - Improve heuristics for when to kick off AF_UNIX garbage collection. - Allow users to control TCP SACK compression, and default to 33% of RTT. - Add tcp_rcvbuf_low_rtt sysctl to let datacenter users avoid unnecessarily aggressive rcvbuf growth and overshot when the connection RTT is low. - Preserve skb metadata space across skb_push / skb_pull operations. - Support for IPIP encapsulation in the nftables flowtable offload. - Support appending IP interface information to ICMP messages (RFC 5837). - Support setting max record size in TLS (RFC 8449). - Remove taking rtnl_lock from RTM_GETNEIGHTBL and RTM_SETNEIGHTBL. - Use a dedicated lock (and RCU) in MPLS, instead of rtnl_lock. - Let users configure the number of write buffers in SMC. - Add new struct sockaddr_unsized for sockaddr of unknown length, from Kees. - Some conversions away from the crypto_ahash API, from Eric Biggers. - Some preparations for slimming down struct page. - YAML Netlink protocol spec for WireGuard. - Add a tool on top of YAML Netlink specs/lib for reporting commonly computed derived statistics and summarized system state. Driver API ---------- - Add CAN XL support to the CAN Netlink interface. - Add uAPI for reporting PHY Mean Square Error (MSE) diagnostics, as defined by the OPEN Alliance's "Advanced diagnostic features for 100BASE-T1 automotive Ethernet PHYs" specification. - Add DPLL phase-adjust-gran pin attribute (and implement it in zl3073x). - Refactor xfrm_input lock to reduce contention when NIC offloads IPsec and performs RSS. - Add info to devlink params whether the current setting is the default or a user override. Allow resetting back to default. - Add standard device stats for PSP crypto offload. - Leverage DSA frame broadcast to implement simple HSR frame duplication for a lot of switches without dedicated HSR offload. - Add uAPI defines for 1.6Tbps link modes. Device drivers -------------- - Add Motorcomm YT921x gigabit Ethernet switch support. - Add MUCSE driver for N500/N210 1GbE NIC series. - Convert drivers to support dedicated ops for timestamping control, and away from the direct IOCTL handling. While at it support GET operations for PHY timestamping. - Add (and convert most drivers to) a dedicated ethtool callback for reading the Rx ring count. - Significant refactoring efforts in the STMMAC driver, which supports Synopsys turn-key MAC IP integrated into a ton of SoCs. - Ethernet high-speed NICs: - Broadcom (bnxt): - support PPS in/out on all pins - Intel (100G, ice, idpf): - ice: implement standard ethtool and timestamping stats - i40e: support setting the max number of MAC addresses per VF - iavf: support RSS of GTP tunnels for 5G and LTE deployments - nVidia/Mellanox (mlx5): - reduce downtime on interface reconfiguration - disable being an XDP redirect target by default (same as other drivers) to avoid wasting resources if feature is unused - Meta (fbnic): - add support for Linux-managed PCS on 25G, 50G, and 100G links - Wangxun: - support Rx descriptor merge, and Tx head writeback - support Rx coalescing offload - support 25G SPF and 40G QSFP modules - Ethernet virtual: - Google (gve): - allow ethtool to configure rx_buf_len - implement XDP HW RX Timestamping support for DQ descriptor format - Microsoft vNIC (mana): - support HW link state events - handle hardware recovery events when probing the device - Ethernet NICs consumer, and embedded: - usbnet: add support for Byte Queue Limits (BQL) - AMD (amd-xgbe): - add device selftests - NXP (enetc): - add i.MX94 support - Broadcom integrated MACs (bcmgenet, bcmasp): - bcmasp: add support for PHY-based Wake-on-LAN - Broadcom switches (b53): - support port isolation - support BCM5389/97/98 and BCM63XX ARL formats - Lantiq/MaxLinear switches: - support bridge FDB entries on the CPU port - use regmap for register access - allow user to enable/disable learning - support Energy Efficient Ethernet - support configuring RMII clock delays - add tagging driver for MaxLinear GSW1xx switches - Synopsys (stmmac): - support using the HW clock in free running mode - add Eswin EIC7700 support - add Rockchip RK3506 support - add Altera Agilex5 support - Cadence (macb): - cleanup and consolidate descriptor and DMA address handling - add EyeQ5 support - TI: - icssg-prueth: support AF_XDP - Airoha access points: - add missing Ethernet stats and link state callback - add AN7583 support - support out-of-order Tx completion processing - Power over Ethernet: - pd692x0: preserve PSE configuration across reboots - add support for TPS23881B devices - Ethernet PHYs: - Open Alliance OATC14 10BASE-T1S PHY cable diagnostic support - Support 50G SerDes and 100G interfaces in Linux-managed PHYs - micrel: - support for non PTP SKUs of lan8814 - enable in-band auto-negotiation on lan8814 - realtek: - cable testing support on RTL8224 - interrupt support on RTL8221B - motorcomm: support for PHY LEDs on YT853 - microchip: support for LAN867X Rev.D0 PHYs w/ SQI and cable diag - mscc: support for PHY LED control - CAN drivers: - m_can: add support for optional reset and system wake up - remove can_change_mtu() obsoleted by core handling - mcp251xfd: support GPIO controller functionality - Bluetooth: - add initial support for PASTa - WiFi: - split ieee80211.h file, it's way too big - improvements in VHT radiotap reporting, S1G, Channel Switch Announcement handling, rate tracking in mesh networks - improve multi-radio monitor mode support, and add a cfg80211 debugfs interface for it - HT action frame handling on 6 GHz - initial chanctx work towards NAN - MU-MIMO sniffer improvements - WiFi drivers: - RealTek (rtw89): - support USB devices RTL8852AU and RTL8852CU - initial work for RTL8922DE - improved injection support - Intel: - iwlwifi: new sniffer API support - MediaTek (mt76): - WED support for >32-bit DMA - airoha NPU support - regdomain improvements - continued WiFi7/MLO work - Qualcomm/Atheros: - ath10k: factory test support - ath11k: TX power insertion support - ath12k: BSS color change support - ath12k: statistics improvements - brcmfmac: Acer A1 840 tablet quirk - rtl8xxxu: 40 MHz connection fixes/support Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmkveRQACgkQMUZtbf5S IrvY7A/+Nb0o4BxLHjPkAl1m3t3q2d0Y29B7SNkwnwEtxAV8EkNeZ3GWrdtDnTQY MYhmc7LEzvz8/lihapr7UJkcokzSASUV54hbez5jDBKC8EEoyUk8FdWDPerwlcRI zmCFNAVFyh9GX8i7wcrzKbDTHT5+GZLbSlGl9U5mhLsDdRlJgH7d8PJ7vWcmtLFY XN0paDyaeHfCl8wReWNAYx4C/I0ODOvlscpO0tnAKhB0ngJbQCKY2t6tn3rOYdif ZSQ5KwVRnJtQ4fYOFMOy9+FSCjVXtyrxF8KLxD+mqom2ZhmO00UpOMl09tqhq3uT WnvwoHUVBt6F+iITHwg5kMgIDPUq1kpUvL4S4UbVSuUm9ZKD+4KRU2ZHRBYMx+MU bsqmtY8/IULClUoRz+tZhltA8eb0NEqNZE2JPOFDiJHn1YiCCkFwxibhir893oM3 sB7x65D7LQI2ty2BBGVGYnwYDPtyaxOA/s3WTwPvLEi3+Y/TGNIIrS9lBLA4U+Yr Gi93WQGVjttMmVyaHgXBUGmi3L52hvolm0AZ8zSRGrnIEpecjhly2KfYuaOzuxXC IHEQ6AFLdRh6JzafXGb/mQwGCHNmhwsY8A49i94fakWQamaL/L6A+1dyPu4LXMqi NwqCmlVb/LKGlfNG+V4wT27srJ+yBA2Vk3tpR1sZQQytFh0LKHI= =UoDR -----END PGP SIGNATURE----- Merge tag 'net-next-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core & protocols: - Replace busylock at the Tx queuing layer with a lockless list. Resulting in a 300% (4x) improvement on heavy TX workloads, sending twice the number of packets per second, for half the cpu cycles. - Allow constantly busy flows to migrate to a more suitable CPU/NIC queue. Normally we perform queue re-selection when flow comes out of idle, but under extreme circumstances the flows may be constantly busy. Add sysctl to allow periodic rehashing even if it'd risk packet reordering. - Optimize the NAPI skb cache, make it larger, use it in more paths. - Attempt returning Tx skbs to the originating CPU (like we already did for Rx skbs). - Various data structure layout and prefetch optimizations from Eric. - Remove ktime_get() from the recvmsg() fast path, ktime_get() is sadly quite expensive on recent AMD machines. - Extend threaded NAPI polling to allow the kthread busy poll for packets. - Make MPTCP use Rx backlog processing. This lowers the lock pressure, improving the Rx performance. - Support memcg accounting of MPTCP socket memory. - Allow admin to opt sockets out of global protocol memory accounting (using a sysctl or BPF-based policy). The global limits are a poor fit for modern container workloads, where limits are imposed using cgroups. - Improve heuristics for when to kick off AF_UNIX garbage collection. - Allow users to control TCP SACK compression, and default to 33% of RTT. - Add tcp_rcvbuf_low_rtt sysctl to let datacenter users avoid unnecessarily aggressive rcvbuf growth and overshot when the connection RTT is low. - Preserve skb metadata space across skb_push / skb_pull operations. - Support for IPIP encapsulation in the nftables flowtable offload. - Support appending IP interface information to ICMP messages (RFC 5837). - Support setting max record size in TLS (RFC 8449). - Remove taking rtnl_lock from RTM_GETNEIGHTBL and RTM_SETNEIGHTBL. - Use a dedicated lock (and RCU) in MPLS, instead of rtnl_lock. - Let users configure the number of write buffers in SMC. - Add new struct sockaddr_unsized for sockaddr of unknown length, from Kees. - Some conversions away from the crypto_ahash API, from Eric Biggers. - Some preparations for slimming down struct page. - YAML Netlink protocol spec for WireGuard. - Add a tool on top of YAML Netlink specs/lib for reporting commonly computed derived statistics and summarized system state. Driver API: - Add CAN XL support to the CAN Netlink interface. - Add uAPI for reporting PHY Mean Square Error (MSE) diagnostics, as defined by the OPEN Alliance's "Advanced diagnostic features for 100BASE-T1 automotive Ethernet PHYs" specification. - Add DPLL phase-adjust-gran pin attribute (and implement it in zl3073x). - Refactor xfrm_input lock to reduce contention when NIC offloads IPsec and performs RSS. - Add info to devlink params whether the current setting is the default or a user override. Allow resetting back to default. - Add standard device stats for PSP crypto offload. - Leverage DSA frame broadcast to implement simple HSR frame duplication for a lot of switches without dedicated HSR offload. - Add uAPI defines for 1.6Tbps link modes. Device drivers: - Add Motorcomm YT921x gigabit Ethernet switch support. - Add MUCSE driver for N500/N210 1GbE NIC series. - Convert drivers to support dedicated ops for timestamping control, and away from the direct IOCTL handling. While at it support GET operations for PHY timestamping. - Add (and convert most drivers to) a dedicated ethtool callback for reading the Rx ring count. - Significant refactoring efforts in the STMMAC driver, which supports Synopsys turn-key MAC IP integrated into a ton of SoCs. - Ethernet high-speed NICs: - Broadcom (bnxt): - support PPS in/out on all pins - Intel (100G, ice, idpf): - ice: implement standard ethtool and timestamping stats - i40e: support setting the max number of MAC addresses per VF - iavf: support RSS of GTP tunnels for 5G and LTE deployments - nVidia/Mellanox (mlx5): - reduce downtime on interface reconfiguration - disable being an XDP redirect target by default (same as other drivers) to avoid wasting resources if feature is unused - Meta (fbnic): - add support for Linux-managed PCS on 25G, 50G, and 100G links - Wangxun: - support Rx descriptor merge, and Tx head writeback - support Rx coalescing offload - support 25G SPF and 40G QSFP modules - Ethernet virtual: - Google (gve): - allow ethtool to configure rx_buf_len - implement XDP HW RX Timestamping support for DQ descriptor format - Microsoft vNIC (mana): - support HW link state events - handle hardware recovery events when probing the device - Ethernet NICs consumer, and embedded: - usbnet: add support for Byte Queue Limits (BQL) - AMD (amd-xgbe): - add device selftests - NXP (enetc): - add i.MX94 support - Broadcom integrated MACs (bcmgenet, bcmasp): - bcmasp: add support for PHY-based Wake-on-LAN - Broadcom switches (b53): - support port isolation - support BCM5389/97/98 and BCM63XX ARL formats - Lantiq/MaxLinear switches: - support bridge FDB entries on the CPU port - use regmap for register access - allow user to enable/disable learning - support Energy Efficient Ethernet - support configuring RMII clock delays - add tagging driver for MaxLinear GSW1xx switches - Synopsys (stmmac): - support using the HW clock in free running mode - add Eswin EIC7700 support - add Rockchip RK3506 support - add Altera Agilex5 support - Cadence (macb): - cleanup and consolidate descriptor and DMA address handling - add EyeQ5 support - TI: - icssg-prueth: support AF_XDP - Airoha access points: - add missing Ethernet stats and link state callback - add AN7583 support - support out-of-order Tx completion processing - Power over Ethernet: - pd692x0: preserve PSE configuration across reboots - add support for TPS23881B devices - Ethernet PHYs: - Open Alliance OATC14 10BASE-T1S PHY cable diagnostic support - Support 50G SerDes and 100G interfaces in Linux-managed PHYs - micrel: - support for non PTP SKUs of lan8814 - enable in-band auto-negotiation on lan8814 - realtek: - cable testing support on RTL8224 - interrupt support on RTL8221B - motorcomm: support for PHY LEDs on YT853 - microchip: support for LAN867X Rev.D0 PHYs w/ SQI and cable diag - mscc: support for PHY LED control - CAN drivers: - m_can: add support for optional reset and system wake up - remove can_change_mtu() obsoleted by core handling - mcp251xfd: support GPIO controller functionality - Bluetooth: - add initial support for PASTa - WiFi: - split ieee80211.h file, it's way too big - improvements in VHT radiotap reporting, S1G, Channel Switch Announcement handling, rate tracking in mesh networks - improve multi-radio monitor mode support, and add a cfg80211 debugfs interface for it - HT action frame handling on 6 GHz - initial chanctx work towards NAN - MU-MIMO sniffer improvements - WiFi drivers: - RealTek (rtw89): - support USB devices RTL8852AU and RTL8852CU - initial work for RTL8922DE - improved injection support - Intel: - iwlwifi: new sniffer API support - MediaTek (mt76): - WED support for >32-bit DMA - airoha NPU support - regdomain improvements - continued WiFi7/MLO work - Qualcomm/Atheros: - ath10k: factory test support - ath11k: TX power insertion support - ath12k: BSS color change support - ath12k: statistics improvements - brcmfmac: Acer A1 840 tablet quirk - rtl8xxxu: 40 MHz connection fixes/support" * tag 'net-next-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1381 commits) net: page_pool: sanitise allocation order net: page pool: xa init with destroy on pp init net/mlx5e: Support XDP target xmit with dummy program net/mlx5e: Update XDP features in switch channels selftests/tc-testing: Test CAKE scheduler when enqueue drops packets net/sched: sch_cake: Fix incorrect qlen reduction in cake_drop wireguard: netlink: generate netlink code wireguard: uapi: generate header with ynl-gen wireguard: uapi: move flag enums wireguard: uapi: move enum wg_cmd wireguard: netlink: add YNL specification selftests: drv-net: Fix tolerance calculation in devlink_rate_tc_bw.py selftests: drv-net: Fix and clarify TC bandwidth split in devlink_rate_tc_bw.py selftests: drv-net: Set shell=True for sysfs writes in devlink_rate_tc_bw.py selftests: drv-net: Use Iperf3Runner in devlink_rate_tc_bw.py selftests: drv-net: introduce Iperf3Runner for measurement use cases selftests: drv-net: Add devlink_rate_tc_bw.py to TEST_PROGS net: ps3_gelic_net: Use napi_alloc_skb() and napi_gro_receive() Documentation: net: dsa: mention simple HSR offload helpers Documentation: net: dsa: mention availability of RedBox ...	2025-12-03 17:24:33 -08:00
Linus Torvalds	015e7b0b0e	bpf-next-6.19 -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmktzC4ACgkQ6rmadz2v bTpA1w/+PZ45N3y6O+NQVIpBlpnHG7DEMK7Lw19On0xVLwH+XPHz6J5PEfzjyJR1 SCbsV30qkJ1YCtgRHHf+ZCuWPWm58hY8dXYwSDyjNavdQyVGOdf17aBu9pvH45NW K20OhwQHpCHWIfDlijjPkDdiHnYf5S7Xy6ctt/3ztF0pMDHIaghGxJymG4wULcDT iLKnT37kwO8b2ihmw/HbcZPQYMWfHRye7X009K+wCv0dnhJ6q/Ny1m+Pg4kF92e6 ON/RY26ep2dq7LpaNWa1rI1yOgFlI7uUlVojqrAuAb+xrg+64wUDBxeijvE37EN1 s/+PuEKAR6xwz1dbY2cWAI0D633saz24UdV6kCBW9HrjHKVRQ7ZSsBF9ENkS4DTK nowx4wOe1ZHc/6YgTktZp9LEn/0YrmQtFxjqEAJiYUgD18FrBrSjmhHpBiL+HghP sTqy41qDQGoKtg3bRu42Co9wmNeeLsnxT8NQExCmTYQ4ufpdA/VMQux9cBVX3GBq EchJb465+AcvvCJUiKbnHLxDsHCQz1YYytz3RqyFLgGDFZnHOE0FjwPJmM8I5kkK gvDB3ZYdO3Halm8BZfZZBnKv5uK7myuAWwqRLgMRanuZcRgmIV1oUP5EP88HdH75 fB20vZSVcfzB17SLyhiM20ivEWodJa9VCLEw9WDOmDoml+33Pks= =kaCJ -----END PGP SIGNATURE----- Merge tag 'bpf-next-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Pull bpf updates from Alexei Starovoitov: - Convert selftests/bpf/test_tc_edt and test_tc_tunnel from .sh to test_progs runner (Alexis Lothoré) - Convert selftests/bpf/test_xsk to test_progs runner (Bastien Curutchet) - Replace bpf memory allocator with kmalloc_nolock() in bpf_local_storage (Amery Hung), and in bpf streams and range tree (Puranjay Mohan) - Introduce support for indirect jumps in BPF verifier and x86 JIT (Anton Protopopov) and arm64 JIT (Puranjay Mohan) - Remove runqslower bpf tool (Hoyeon Lee) - Fix corner cases in the verifier to close several syzbot reports (Eduard Zingerman, KaFai Wan) - Several improvements in deadlock detection in rqspinlock (Kumar Kartikeya Dwivedi) - Implement "jmp" mode for BPF trampoline and corresponding DYNAMIC_FTRACE_WITH_JMP. It improves "fexit" program type performance from 80 M/s to 136 M/s. With Steven's Ack. (Menglong Dong) - Add ability to test non-linear skbs in BPF_PROG_TEST_RUN (Paul Chaignon) - Do not let BPF_PROG_TEST_RUN emit invalid GSO types to stack (Daniel Borkmann) - Generalize buildid reader into bpf_dynptr (Mykyta Yatsenko) - Optimize bpf_map_update_elem() for map-in-map types (Ritesh Oedayrajsingh Varma) - Introduce overwrite mode for BPF ring buffer (Xu Kuohai) * tag 'bpf-next-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (169 commits) bpf: optimize bpf_map_update_elem() for map-in-map types bpf: make kprobe_multi_link_prog_run always_inline selftests/bpf: do not hardcode target rate in test_tc_edt BPF program selftests/bpf: remove test_tc_edt.sh selftests/bpf: integrate test_tc_edt into test_progs selftests/bpf: rename test_tc_edt.bpf.c section to expose program type selftests/bpf: Add success stats to rqspinlock stress test rqspinlock: Precede non-head waiter queueing with AA check rqspinlock: Disable spinning for trylock fallback rqspinlock: Use trylock fallback when per-CPU rqnode is busy rqspinlock: Perform AA checks immediately rqspinlock: Enclose lock/unlock within lock entry acquisitions bpf: Remove runqslower tool selftests/bpf: Remove usage of lsm/file_alloc_security in selftest bpf: Disable file_alloc_security hook bpf: check for insn arrays in check_ptr_alignment bpf: force BPF_F_RDONLY_PROG on insn array creation bpf: Fix exclusive map memory leak selftests/bpf: Make CS length configurable for rqspinlock stress test selftests/bpf: Add lock wait time stats to rqspinlock stress test ...	2025-12-03 16:54:54 -08:00
Linus Torvalds	7f8d5f70ff	Tree wide cleanup of the remaining users of in_irq() which got replaced by in_hardirq() and marked deprecated in 2020. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmkvDhUTHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoW+XD/959KAIm2JpcEYUWuNBmlhEyuYWvPLw ZyOiraLYBNyWmfCO/Yz4Ff8VZSR9gdWQoNfvBb8uxkbSXa0UOEUhCbzWsuoTnqR5 ObTIHCJ9QmPlRiFDvs4Sf5TGmy/4nXh6/PoH3JykNdlD3rZMTxiAz/k6QuO/S2iu ykA+DNtNL7jDkQHzrWa3rf597BkBN1Z+hUD8zHRt8LYKRfmLYWjCMggjPLMnuqcn 240fnV/FubCLd9f5ZgNxHQMQCQH2qB7GYMk08YwXwCZQqIIXWqbNnhedkkNO3kWq Sws4TEO6yg9pgTFqkuiDU5QgYEboRY4pDT45KSkdTHHGZl2OAAl3eVIGCto72UEI Eyzn4k900hZ1iI/Rad5mx3D4XJZEXFgEbXhjph0odn6jVvmSj+Fmg3J67u1niO2a obzB+xeaIkbGNQIgJFy8+A9SSnZckvuPlXdZdUxS2S95zH7f9+vBY8HWJMuyursa 3AJAKa82mN1i3A9FdSuMTdttQWkDmrwPKVzxvixs1mBu7kB70XaRIKsPjZj7LH6X CiqP9Kt5FO0hVA7K+nKTeUA5DdjB4HzYzOgMqzFUhExY3hksVsj8rQEO6B0bCp9t CfITA3BvU7GXxhXZHOq3dABQ21J/ZHgeuK3QdQSnOxSQOv2ElYIdKvYirJy2QdS1 tSM3O3GXb4zWDg== =6LKf -----END PGP SIGNATURE----- Merge tag 'core-core-2025-12-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core irq cleanup from Thomas Gleixner: "Tree wide cleanup of the remaining users of in_irq() which got replaced by in_hardirq() and marked deprecated in 2020" * tag 'core-core-2025-12-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: treewide: Remove in_irq()	2025-12-02 10:18:49 -08:00
Linus Torvalds	6c26fbe8c9	Performance events changes for v6.19: Callchain support: - Add support for deferred user-space stack unwinding for perf, enabled on x86. (Peter Zijlstra, Steven Rostedt) - unwind_user/x86: Enable frame pointer unwinding on x86 (Josh Poimboeuf) x86 PMU support and infrastructure: - x86/insn: Simplify for_each_insn_prefix() (Peter Zijlstra) - x86/insn,uprobes,alternative: Unify insn_is_nop() (Peter Zijlstra) Intel PMU driver: - Large series to prepare for and implement architectural PEBS support for Intel platforms such as Clearwater Forest (CWF) and Panther Lake (PTL). (Dapeng Mi, Kan Liang) - Check dynamic constraints (Kan Liang) - Optimize PEBS extended config (Peter Zijlstra) - cstates: Remove PC3 support from LunarLake (Zhang Rui) - cstates: Add Pantherlake support (Zhang Rui) - cstates: Clearwater Forest support (Zide Chen) AMD PMU driver: - x86/amd: Check event before enable to avoid GPF (George Kennedy) Fixes and cleanups: - task_work: Fix NMI race condition (Peter Zijlstra) - perf/x86: Fix NULL event access and potential PEBS record loss (Dapeng Mi) - Misc other fixes and cleanups. (Dapeng Mi, Ingo Molnar, Peter Zijlstra) Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmktcU0RHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1gNKw//ThLmbkoGJ0/yLOdEcW8rA/7HB43Oz6j9 k0Vs7zwDBMRFP4zQg2XeF5SH7CWS9p/nI3eMhorgmH77oJCvXJxVtD5991zmlZhf eafOar5ZMVaoMz+tK8WWiENZyuN0bt0mumZmz9svXR3KV1S/q18XZ8bCas0itwnq D0T3Gqi/Z39gJIy7bHNgLoFY2zvI9b2EJNDKlzHk3NJ7UamA4GuMHN0cM2dIzKGK 2L+wXOe2BH9YYzYrz/cdKq7sBMjOvFsCQ/5jh23A2Yu6JI4nJbw0WmexZRK1OWCp GAdMjBuqIShibLRxK746WRO9iut49uTsah4iSG80hXzhpwf7VaegOarost1nLaqm zweIOr3iwJRf273r6IqRuaporVHpQYMj2w2H63z36sQtGtkKHNyxZ50b6bqpwwjU LikLEJ9Bmh3mlvlXsOx2wX6dTb1fUk+cy2ezCDKUHqOLjqy4dM8V+jYhuRO4yxXz mj9aHZKgyuREt8yo/3nLqAzF5Okj9cXp7H6F1hCKWuCoAhNXkrvYcvbg8h6aRxOX 2vGhMYjpElkl/DG6OWCSwuqCt9nVEC/dazW9fKQjh4S0CFOVopaMGSkGcS/xUPub 92J4XMDEJX4RJ6dfspeQr97+1fETXEIWNv4WbKnDjqJlAucU1gnOTprVnAYUjcWw 74320FjGN1E= =/8GE -----END PGP SIGNATURE----- Merge tag 'perf-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull performance events updates from Ingo Molnar: "Callchain support: - Add support for deferred user-space stack unwinding for perf, enabled on x86. (Peter Zijlstra, Steven Rostedt) - unwind_user/x86: Enable frame pointer unwinding on x86 (Josh Poimboeuf) x86 PMU support and infrastructure: - x86/insn: Simplify for_each_insn_prefix() (Peter Zijlstra) - x86/insn,uprobes,alternative: Unify insn_is_nop() (Peter Zijlstra) Intel PMU driver: - Large series to prepare for and implement architectural PEBS support for Intel platforms such as Clearwater Forest (CWF) and Panther Lake (PTL). (Dapeng Mi, Kan Liang) - Check dynamic constraints (Kan Liang) - Optimize PEBS extended config (Peter Zijlstra) - cstates: - Remove PC3 support from LunarLake (Zhang Rui) - Add Pantherlake support (Zhang Rui) - Clearwater Forest support (Zide Chen) AMD PMU driver: - x86/amd: Check event before enable to avoid GPF (George Kennedy) Fixes and cleanups: - task_work: Fix NMI race condition (Peter Zijlstra) - perf/x86: Fix NULL event access and potential PEBS record loss (Dapeng Mi) - Misc other fixes and cleanups (Dapeng Mi, Ingo Molnar, Peter Zijlstra)" * tag 'perf-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits) perf/x86/intel: Fix and clean up intel_pmu_drain_arch_pebs() type use perf/x86/intel: Optimize PEBS extended config perf/x86/intel: Check PEBS dyn_constraints perf/x86/intel: Add a check for dynamic constraints perf/x86/intel: Add counter group support for arch-PEBS perf/x86/intel: Setup PEBS data configuration and enable legacy groups perf/x86/intel: Update dyn_constraint base on PEBS event precise level perf/x86/intel: Allocate arch-PEBS buffer and initialize PEBS_BASE MSR perf/x86/intel: Process arch-PEBS records or record fragments perf/x86/intel/ds: Factor out PEBS group processing code to functions perf/x86/intel/ds: Factor out PEBS record processing code to functions perf/x86/intel: Initialize architectural PEBS perf/x86/intel: Correct large PEBS flag check perf/x86/intel: Replace x86_pmu.drain_pebs calling with static call perf/x86: Fix NULL event access and potential PEBS record loss perf/x86: Remove redundant is_x86_event() prototype entry,unwind/deferred: Fix unwind_reset_info() placement unwind_user/x86: Fix arch=um build perf: Support deferred user unwind unwind_user/x86: Teach FP unwind about start of function ...	2025-12-01 20:42:01 -08:00
Linus Torvalds	1b5dd29869	vfs-6.19-rc1.fd_prepare.fs -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZwAKCRCRxhvAZXjc op0AAP4oNVJkFyvgKoPos5K2EGFB8M7merGhpYtsOoeg8UK6OwD/UySQErHsXQDR sUDDa5uFOhfrkcfM8REtAN4wF8p5qAc= =QgFD -----END PGP SIGNATURE----- Merge tag 'vfs-6.19-rc1.fd_prepare.fs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull fd prepare updates from Christian Brauner: "This adds the FD_ADD() and FD_PREPARE() primitive. They simplify the common pattern of get_unused_fd_flags() + create file + fd_install() that is used extensively throughout the kernel and currently requires cumbersome cleanup paths. FD_ADD() - For simple cases where a file is installed immediately: fd = FD_ADD(O_CLOEXEC, vfio_device_open_file(device)); if (fd < 0) vfio_device_put_registration(device); return fd; FD_PREPARE() - For cases requiring access to the fd or file, or additional work before publishing: FD_PREPARE(fdf, O_CLOEXEC, sync_file->file); if (fdf.err) { fput(sync_file->file); return fdf.err; } data.fence = fd_prepare_fd(fdf); if (copy_to_user((void __user )arg, &data, sizeof(data))) return -EFAULT; return fd_publish(fdf); The primitives are centered around struct fd_prepare. FD_PREPARE() encapsulates all allocation and cleanup logic and must be followed by a call to fd_publish() which associates the fd with the file and installs it into the caller's fdtable. If fd_publish() isn't called, both are deallocated automatically. FD_ADD() is a shorthand that does fd_publish() immediately and never exposes the struct to the caller. I've implemented this in a way that it's compatible with the cleanup infrastructure while also being usable separately. IOW, it's centered around struct fd_prepare which is aliased to class_fd_prepare_t and so we can make use of all the basica guard infrastructure" tag 'vfs-6.19-rc1.fd_prepare.fs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (42 commits) io_uring: convert io_create_mock_file() to FD_PREPARE() file: convert replace_fd() to FD_PREPARE() vfio: convert vfio_group_ioctl_get_device_fd() to FD_ADD() tty: convert ptm_open_peer() to FD_ADD() ntsync: convert ntsync_obj_get_fd() to FD_PREPARE() media: convert media_request_alloc() to FD_PREPARE() hv: convert mshv_ioctl_create_partition() to FD_ADD() gpio: convert linehandle_create() to FD_PREPARE() pseries: port papr_rtas_setup_file_interface() to FD_ADD() pseries: convert papr_platform_dump_create_handle() to FD_ADD() spufs: convert spufs_gang_open() to FD_PREPARE() papr-hvpipe: convert papr_hvpipe_dev_create_handle() to FD_PREPARE() spufs: convert spufs_context_open() to FD_PREPARE() net/socket: convert __sys_accept4_file() to FD_ADD() net/socket: convert sock_map_fd() to FD_ADD() net/kcm: convert kcm_ioctl() to FD_PREPARE() net/handshake: convert handshake_nl_accept_doit() to FD_PREPARE() secretmem: convert memfd_secret() to FD_ADD() memfd: convert memfd_create() to FD_ADD() bpf: convert bpf_token_create() to FD_PREPARE() ...	2025-12-01 17:32:07 -08:00
Ritesh Oedayrajsingh Varma	ff34657aa7	bpf: optimize bpf_map_update_elem() for map-in-map types Updating a BPF_MAP_TYPE_HASH_OF_MAPS or BPF_MAP_TYPE_ARRAY_OF_MAPS via bpf_map_update_elem() is very expensive. In one of our workloads, we're inserting ~1400 maps of type BPF_MAP_TYPE_ARRAY into a BPF_MAP_TYPE_ARRAY_OF_MAPS. This takes ~21 seconds on a single thread, with an average of ~15ms per call: Function Name: map_update_elem Number of calls: 1369 Total time: 21s 182ms 966µs Maximum: 47ms 937µs Average: 15ms 473µs Minimum: 7µs Profiling shows that nearly all of this time is going to synchronize_rcu(), via maybe_wait_bpf_programs() in map_update_elem(). The call to synchronize_rcu() is done to ensure that after bpf_map_update_elem() returns, no BPF programs are still looking at the old value of the map, per commit `1ae80cf319` ("bpf: wait for running BPF programs when updating map-in-map"). As discussed on the bpf mailing list, replace synchronize_rcu() with synchronize_rcu_expedited(). This is 175x faster: it now takes an average of 88 microseconds per call, for a total of 127 milliseconds in the same benchmark: Function Name: map_update_elem Number of calls: 1439 Total time: 127ms 626µs Maximum: 445µs Average: 88µs Minimum: 10µs Link: https://lore.kernel.org/bpf/CAH6OuBR=w2kybK6u7aH_35B=Bo1PCukeMZefR=7V4Z2tJNK--Q@mail.gmail.com/ Signed-off-by: Ritesh Oedayrajsingh Varma <ritesh@superluminal.eu> Link: https://lore.kernel.org/r/20251128000422.20462-1-ritesh@superluminal.eu Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-29 09:48:41 -08:00
Kumar Kartikeya Dwivedi	087849cca3	rqspinlock: Precede non-head waiter queueing with AA check While previous commits sufficiently address the deadlocks, there are still scenarios where queueing of waiters in NMIs can exacerbate the possibility of timeouts. Consider the case below: CPU 0 <NMI> res_spin_lock(A) -> becomes non-head waiter </NMI> lock owner in CS or pending waiter spinning CPU 1 res_spin_lock(A) -> head waiter spinning on owner/pending bits In such a scenario, the non-head waiter in NMI on CPU 0 will not poll for deadlocks or timeout since it will simply queue behind previous waiter (head on CPU 1), and also not enter the trylock fallback since no rqspinlock queue waiter is active on CPU 0. In such a scenario, the transaction initiated by the head waiter on CPU 1 will timeout, signalling the NMI and ending the cyclic dependency, but it will cost 250 ms of time. Instead, the NMI on CPU 0 could simply check for the presence of an AA deadlock and only proceed with queueing on success. Add such a check right before any form of queueing is initiated. The reason the AA deadlock check is not used in conjunction with in_nmi() is that a similar case could occur due to a reentrant path in the owner's critical section, and unconditionally checking for AA before entering the queueing path avoids expensive timeouts. Non-NMI reentrancy only happens at controlled points in the slow path (with specific tracepoints which do not impede the forward progress of a waiter loop), or in the owner CS, while NMIs can land anywhere. While this check is only needed for non-head waiter queueing, checking whether we are head or not is racy without xchg_tail, and after that point, we are already queued, hence for simplicity we must invoke the check unconditionally. Note that a more contrived case could still be constructed by using two locks, and interrupting the progress of the respective owners by non-head waiters of the other lock, in an ABBA fashion, which would still not be covered by the current set of checks and conditions. It would still lead to a timeout though, and not a deadlock. An ABBA check cannot happen optimistically before the queueing, since it can be racy, and needs to be happen continuously during the waiting period, which would then require an unlinking step for queued NMI/reentrant waiters. This is beyond the scope of this patch. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20251128232802.1031906-6-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-29 09:35:36 -08:00
Kumar Kartikeya Dwivedi	30dc2f7025	rqspinlock: Disable spinning for trylock fallback The original trylock fallback was inherited from qspinlock, and then reused for the reentrant NMIs while the slow path is active. However, under contention, it is very unlikely for the trylock to succeed in taking the lock. In addition, a trylock also has no fairness guarantees, and thus is prone to starvation issues under extreme scenarios. The original qspinlock had no choice in terms of returning an error the caller; if the node count was breached, it had to fall back to trylock to attempt to take the lock. In case of rqspinlock, we do have the option of returning to the user. Thus, simply attempt the trylock once, and instead of spinning, return an error in case the lock cannot be taken. This ends up significantly reducing the time spent in the trylock fallback, since we no longer wait for the timeout duration trying to aimlessly acquire the lock when there's a high-probability that under contention, it won't be available to us anyway. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20251128232802.1031906-5-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-29 09:35:36 -08:00
Kumar Kartikeya Dwivedi	81d5a6a438	rqspinlock: Use trylock fallback when per-CPU rqnode is busy In addition to deferring to the trylock fallback in NMIs, only do so when an rqspinlock waiter is queued on the current CPU. This is detected by noticing a non-zero node index. This allows NMI waiters to join the waiter queue if it isn't interrupting an existing rqspinlock waiter, and increase the chances of fairly obtaining the lock, performing deadlock detection as the head, and not being starved while attempting the trylock. The trylock path in particular is unlikely to succeed under contention, as it relies on the lock word becoming 0, which indicates no contention. This means that the most likely result for NMIs attempting a trylock is a timeout under contention if they don't hit an AA or ABBA case. The core problem being addressed through the fixed commit was removing the dependency edge between an NMI queue waiter and the queue waiter it is interrupting. Whenever a circular dependency forms, and with no way to break it (as non-head waiters don't poll for deadlocks or timeouts), we would enter into a deadlock. A trylock either breaks such an edge by probing for deadlocks, and finally terminating the waiting loop using a timeout. By excluding queueing on CPUs where the node index is non-zero for NMIs, this sort of dependency is broken. The CPU enters the trylock path for those cases, and falls back to deadlock checks and timeouts. However, in other case where it doesn't interrupt the CPU in the slow path while its queued on the lock, it can join the queue as a normal waiter, and avoid trylock associated starvation and subsequent timeouts. There are a few remaining cases here that matter: the NMI can still preempt the owner in its critical section, and if it queues as a non-head waiter, it can end up impeding the progress of the owner. While this won't deadlock, since the head waiter will eventually signal the NMI waiter to either stop (due to a timeout), it can still lead to long timeouts. These gaps will be addressed in subsequent commits. Note that while the node count detection approach is less conservative than simply deferring NMIs to trylock, it is going to return errors where attempts to lock B in NMI happen while waiters for lock A are in a lower context on the same CPU. However, this only occurs when the lower context is queued in the slow path, and the NMI attempt can proceed without failure in all other cases. To continue to prevent AA deadlocks (or ABBA in a similar NMI interrupting lower context pattern), we'd need a more fleshed out algorithm to unlink NMI waiters after they queue and detect such cases. However, all that complexity isn't appealing yet to reduce the failure rate in the small window inside the slow path. It is important to note that reentrancy in the slow path can also happen through trace_contention_{begin,end}, but in those cases, unlike an NMI, the forward progress of the head waiter (or the predecessor in general) is not being blocked. Fixes: `0d80e7f951` ("rqspinlock: Choose trylock fallback for NMI waiters") Reported-by: Ritesh Oedayrajsingh Varma <ritesh@superluminal.eu> Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20251128232802.1031906-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-29 09:35:35 -08:00
Kumar Kartikeya Dwivedi	5860f5ce47	rqspinlock: Perform AA checks immediately Currently, while we enter the check_timeout call immediately due to the way the ts.spin is initialized, we still invoke the AA and ABBA checks in the second invocation, and only initialize the timestamp in the first one. Since each iteration is at least done with a 1ms delay, this can add delays in detection of AA deadlocks, up to a ms. Rework check_timeout() to avoid this. First, call check_deadlock_AA() while initializing the timestamps for the wait period. This also means that we only do it once per waiting period, instead of every invocation. Finally, drop check_deadlock() and call check_deadlock_ABBA() directly. To save on unnecessary ktime_get_mono_fast_ns() in case of AA deadlock, sample the time only if it returns 0. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20251128232802.1031906-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-29 09:35:35 -08:00
Kumar Kartikeya Dwivedi	beb7021a60	rqspinlock: Enclose lock/unlock within lock entry acquisitions Ritesh reported that timeouts occurred frequently for rqspinlock despite reentrancy on the same lock on the same CPU in [0]. This patch closes one of the races leading to this behavior, and reduces the frequency of timeouts. We currently have a tiny window between the fast-path cmpxchg and the grabbing of the lock entry where an NMI could land, attempt the same lock that was just acquired, and end up timing out. This is not ideal. Instead, move the lock entry acquisition from the fast path to before the cmpxchg, and remove the grabbing of the lock entry in the slow path, assuming it was already taken by the fast path. The TAS fallback is invoked directly without being preceded by the typical fast path, therefore we must continue to grab the deadlock detection entry in that case. Case on lock leading to missed AA: cmpxchg lock A <NMI> ... rqspinlock acquisition of A ... timeout </NMI> grab_held_lock_entry(A) There is a similar case when unlocking the lock. If the NMI lands between the WRITE_ONCE and smp_store_release, it is possible that we end up in a situation where the NMI fails to diagnose the AA condition, leading to a timeout. Case on unlock leading to missed AA: WRITE_ONCE(rqh->locks[rqh->cnt - 1], NULL) <NMI> ... rqspinlock acquisition of A ... timeout </NMI> smp_store_release(A->locked, 0) The patch changes the order on unlock to smp_store_release() succeeded by WRITE_ONCE() of NULL. This avoids the missed AA detection described above, but may lead to a false positive if the NMI lands between these two statements, which is acceptable (and preferred over a timeout). The original intention of the reverse order on unlock was to prevent the following possible misdiagnosis of an ABBA scenario: grab entry A lock A grab entry B lock B unlock B smp_store_release(B->locked, 0) grab entry B lock B grab entry A lock A ! <detect ABBA> WRITE_ONCE(rqh->locks[rqh->cnt - 1], NULL) If the store release were is after the WRITE_ONCE, the other CPU would not observe B in the table of the CPU unlocking the lock B. However, since the threads are obviously participating in an ABBA deadlock, it is no longer appealing to use the order above since it may lead to a 250 ms timeout due to missed AA detection. [0]: https://lore.kernel.org/bpf/CAH6OuBTjG+N=+GGwcpOUbeDN563oz4iVcU3rbse68egp9wj9_A@mail.gmail.com Fixes: `0d80e7f951` ("rqspinlock: Choose trylock fallback for NMI waiters") Reported-by: Ritesh Oedayrajsingh Varma <ritesh@superluminal.eu> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20251128232802.1031906-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-29 09:35:35 -08:00
Amery Hung	b4bf1d23dc	bpf: Disable file_alloc_security hook A use-after-free bug may be triggered by calling bpf_inode_storage_get() in a BPF LSM program hooked to file_alloc_security. Disable the hook to prevent this from happening. The cause of the bug is shown in the trace below. In alloc_file(), a file struct is first allocated through kmem_cache_alloc(). Then, file_alloc_security hook is invoked. Since the zero initialization or assignment of f->f_inode happen after this LSM hook, a BPF program may get a dangeld inode pointer by walking the file struct. alloc_file() -> alloc_empty_file() -> f = kmem_cache_alloc() -> init_file() -> security_file_alloc() // f->f_inode not init-ed yet! -> f->f_inode = NULL; -> file_init_path() -> f->f_inode = path->dentry->d_inode Reported-by: Kaiyan Mei <M202472210@hust.edu.cn> Reported-by: Yinhao Hu <dddddd@hust.edu.cn> Reported-by: Dongliang Mu <dzm91@hust.edu.cn> Closes: https://lore.kernel.org/bpf/1d2d1968.47cd3.19ab9528e94.Coremail.kaiyanm@hust.edu.cn/ Signed-off-by: Amery Hung <ameryhung@gmail.com> Link: https://lore.kernel.org/r/20251126202927.2584874-1-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-28 15:18:28 -08:00
Anton Protopopov	e3ea26add6	bpf: check for insn arrays in check_ptr_alignment Do not abuse the strict_alignment_once flag, and check if the map is an instruction array inside the check_ptr_alignment() function. Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Link: https://lore.kernel.org/r/20251128063224.1305482-3-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-28 15:15:43 -08:00
Anton Protopopov	7feff23cdf	bpf: force BPF_F_RDONLY_PROG on insn array creation The original implementation added a hack to check_mem_access() to prevent programs from writing into insn arrays. To get rid of this hack, enforce BPF_F_RDONLY_PROG on map creation. Also fix the corresponding selftest, as the error message changes with this patch. Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Link: https://lore.kernel.org/r/20251128063224.1305482-2-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-28 15:15:43 -08:00
Christian Brauner	981bec8f69	bpf: convert bpf_token_create() to FD_PREPARE() Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-24-b6efa1706cfd@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-28 12:42:33 +01:00
Christian Brauner	798c2da490	bpf: convert bpf_iter_new_fd() to FD_PREPARE() Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-23-b6efa1706cfd@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-28 12:42:33 +01:00
Edward Adam Davis	688b745401	bpf: Fix exclusive map memory leak When excl_prog_hash is 0 and excl_prog_hash_size is non-zero, the map also needs to be freed. Otherwise, the map memory will not be reclaimed, just like the memory leak problem reported by syzbot [1]. syzbot reported: BUG: memory leak backtrace (crc 7b9fb9b4): map_create+0x322/0x11e0 kernel/bpf/syscall.c:1512 __sys_bpf+0x3556/0x3610 kernel/bpf/syscall.c:6131 Fixes: `baefdbdf68` ("bpf: Implement exclusive map creation") Reported-by: syzbot+cf08c551fecea9fd1320@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=cf08c551fecea9fd1320 Tested-by: syzbot+cf08c551fecea9fd1320@syzkaller.appspotmail.com Signed-off-by: Edward Adam Davis <eadavis@qq.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/tencent_3F226F882CE56DCC94ACE90EED1ECCFC780A@qq.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-26 11:23:27 -08:00
Leon Hwang	8f6ddc0587	bpf: Introduce internal bpf_map_check_op_flags helper function It is to unify map flags checking for lookup_elem, update_elem, lookup_batch and update_batch APIs. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20251125145857.98134-2-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-25 15:27:48 -08:00
Menglong Dong	402e44b31e	bpf: implement "jmp" mode for trampoline Implement the "jmp" mode for the bpf trampoline. For the ftrace_managed case, we need only to set the FTRACE_OPS_FL_JMP on the tr->fops if "jmp" is needed. For the bpf poke case, we will check the origin poke type with the "origin_flags", and current poke type with "tr->flags". The function bpf_trampoline_update_fentry() is introduced to do the job. The "jmp" mode will only be enabled with CONFIG_DYNAMIC_FTRACE_WITH_JMP enabled and BPF_TRAMP_F_SHARE_IPMODIFY is not set. With BPF_TRAMP_F_SHARE_IPMODIFY, we need to get the origin call ip from the stack, so we can't use the "jmp" mode. Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20251118123639.688444-7-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-24 09:47:04 -08:00
Menglong Dong	ae4a3160d1	bpf: specify the old and new poke_type for bpf_arch_text_poke In the origin logic, the bpf_arch_text_poke() assume that the old and new instructions have the same opcode. However, they can have different opcode if we want to replace a "call" insn with a "jmp" insn. Therefore, add the new function parameter "old_t" along with the "new_t", which are used to indicate the old and new poke type. Meanwhile, adjust the implement of bpf_arch_text_poke() for all the archs. "BPF_MOD_NOP" is added to make the code more readable. In bpf_arch_text_poke(), we still check if the new and old address is NULL to determine if nop insn should be used, which I think is more safe. Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Link: https://lore.kernel.org/r/20251118123639.688444-6-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-24 09:47:03 -08:00
Anton Protopopov	fad804002e	bpf: cleanup aux->used_maps after jit In commit `b4ce5923e7` ("bpf, x86: add new map type: instructions array") env->used_map was copied to func[i]->aux->used_maps before jitting. Clear these fields out after jitting such that pointer to freed memory (env->used_maps is freed later) are not kept in a live data structure. The reason why the copies were initially added is explained in https://lore.kernel.org/bpf/20251105090410.1250500-1-a.s.protopopov@gmail.com Suggested-by: Alexei Starovoitov <ast@kernel.org> Fixes: `b4ce5923e7` ("bpf, x86: add new map type: instructions array") Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Link: https://lore.kernel.org/r/20251124151515.2543403-1-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-24 09:39:55 -08:00
Puranjay Mohan	4167096cb9	bpf: support nested rcu critical sections Currently, nested rcu critical sections are rejected by the verifier and rcu_lock state is managed by a boolean variable. Add support for nested rcu critical sections by make active_rcu_locks a counter similar to active_preempt_locks. bpf_rcu_read_lock() increments this counter and bpf_rcu_read_unlock() decrements it, MEM_RCU -> PTR_UNTRUSTED transition happens when active_rcu_locks drops to 0. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251117200411.25563-2-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-21 18:34:59 -08:00
Eduard Zingerman	e40f5a6bf8	bpf: correct stack liveness for tail calls This updates bpf_insn_successors() reflecting that control flow might jump over the instructions between tail call and function exit, verifier might assume that some writes to parent stack always happen, which is not the case. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Martin Teichmann <martin.teichmann@xfel.eu> Link: https://lore.kernel.org/r/20251119160355.1160932-4-martin.teichmann@xfel.eu Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-21 17:45:30 -08:00
Martin Teichmann	e3245f8990	bpf: properly verify tail call behavior A successful ebpf tail call does not return to the caller, but to the caller-of-the-caller, often just finishing the ebpf program altogether. Any restrictions that the verifier needs to take into account - notably the fact that the tail call might have modified packet pointers - are to be checked on the caller-of-the-caller. Checking it on the caller made the verifier refuse perfectly fine programs that would use the packet pointers after a tail call, which is no problem as this code is only executed if the tail call was unsuccessful, i.e. nothing happened. This patch simulates the behavior of a tail call in the verifier. A conditional jump to the code after the tail call is added for the case of an unsucessful tail call, and a return to the caller is simulated for a successful tail call. For the successful case we assume that the tail call returns an int, as tail calls are currently only allowed in functions that return and int. We always assume that the tail call modified the packet pointers, as we do not know what the tail call did. For the unsuccessful case we know nothing happened, so we do not need to add new constraints. This approach also allows to check other problems that may occur with tail calls, namely we are now able to check that precision is properly propagated into subprograms using tail calls, as well as checking the live slots in such a subprogram. Fixes: `1a4607ffba` ("bpf: consider that tail calls invalidate packet pointers") Link: https://lore.kernel.org/bpf/20251029105828.1488347-1-martin.teichmann@xfel.eu/ Signed-off-by: Martin Teichmann <martin.teichmann@xfel.eu> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251119160355.1160932-2-martin.teichmann@xfel.eu Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-21 17:45:30 -08:00
Anton Protopopov	4dd3a48d13	bpf: Add a check to make static analysers happy In [1] Dan Carpenter reported that the following code makes the Smatch static analyser unhappy: 17904 value = map->ops->map_lookup_elem(map, &i); 17905 if (!value) 17906 return -EINVAL; --> 17907 items[i - start] = value->xlated_off; The analyser assumes that the `value` variable may contain an error and thus it should be properly checked before the dereference. On practice this will never happen as array maps do not return error values in map_lookup_elem, but to make the Smatch and other possible analysers happy this patch adds a formal check. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/bpf/aR2BN1Ix--8tmVrN@stanley.mountain/ [1] Fixes: `493d9e0d60` ("bpf, x86: add support for indirect jumps") Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Link: https://lore.kernel.org/r/20251119112517.1091793-1-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-21 17:01:14 -08:00
Jakub Kicinski	9e203721ec	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.18-rc7). No conflicts, adjacent changes: tools/testing/selftests/net/af_unix/Makefile `e1bb28bf13` ("selftest: af_unix: Add test for SO_PEEK_OFF.") `45a1cd8346` ("selftests: af_unix: Add tests for ECONNRESET and EOF semantics") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-20 09:13:26 -08:00
Amery Hung	f484f4a3e0	bpf: Replace bpf memory allocator with kmalloc_nolock() in local storage Replace bpf memory allocator with kmalloc_nolock() to reduce memory wastage due to preallocation. In bpf_selem_free(), an selem now needs to wait for a RCU grace period before being freed when reuse_now == true. Therefore, rcu_barrier() should be always be called in bpf_local_storage_map_free(). In bpf_local_storage_free(), since smap->storage_ma is no longer needed to return the memory, the function is now independent from smap. Remove the outdated comment in bpf_local_storage_alloc(). We already free selem after an RCU grace period in bpf_local_storage_update() when bpf_local_storage_alloc() failed the cmpxchg since commit `c0d63f3091` ("bpf: Add bpf_selem_free()"). Signed-off-by: Amery Hung <ameryhung@gmail.com> Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20251114201329.3275875-5-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-18 16:20:25 -08:00
Amery Hung	39a460c425	bpf: Save memory alloction info in bpf_local_storage Save the memory allocation method used for bpf_local_storage in the struct explicitly so that we don't need to go through the hassle to find out the info. When a later patch replaces BPF memory allocator with kmalloc_noloc(), bpf_local_storage_free() will no longer need smap->storage_ma to return the memory and completely remove the dependency on smap in bpf_local_storage_free(). Signed-off-by: Amery Hung <ameryhung@gmail.com> Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20251114201329.3275875-4-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-18 16:20:25 -08:00
Amery Hung	e76a33e1c7	bpf: Remove smap argument from bpf_selem_free() Since selem already saves a pointer to smap, use it instead of an additional argument in bpf_selem_free(). This requires moving the SDATA(selem)->smap assignment from bpf_selem_link_map() to bpf_selem_alloc() since bpf_selem_free() may be called without the selem being linked to smap in bpf_local_storage_update(). Signed-off-by: Amery Hung <ameryhung@gmail.com> Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20251114201329.3275875-3-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-18 16:20:25 -08:00
Amery Hung	0e854e5535	bpf: Always charge/uncharge memory when allocating/unlinking storage elements Since commit `a96a44aba5` ("bpf: bpf_sk_storage: Fix invalid wait context lockdep report"), {charge,uncharge}_mem are always true when allocating a bpf_local_storage_elem or unlinking a bpf_local_storage_elem from local storage, so drop these arguments. No functional change. Signed-off-by: Amery Hung <ameryhung@gmail.com> Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20251114201329.3275875-2-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-18 16:20:25 -08:00
Pu Lehui	7dc211c115	bpf: Fix invalid prog->stats access when update_effective_progs fails Syzkaller triggers an invalid memory access issue following fault injection in update_effective_progs. The issue can be described as follows: __cgroup_bpf_detach update_effective_progs compute_effective_progs bpf_prog_array_alloc <-- fault inject purge_effective_progs /* change to dummy_bpf_prog / array->items[index] = &dummy_bpf_prog.prog ---softirq start--- __do_softirq ... __cgroup_bpf_run_filter_skb __bpf_prog_run_save_cb bpf_prog_run stats = this_cpu_ptr(prog->stats) / invalid memory access */ flags = u64_stats_update_begin_irqsave(&stats->syncp) ---softirq end--- static_branch_dec(&cgroup_bpf_enabled_key[atype]) The reason is that fault injection caused update_effective_progs to fail and then changed the original prog into dummy_bpf_prog.prog in purge_effective_progs. Then a softirq came, and accessing the members of dummy_bpf_prog.prog in the softirq triggers invalid mem access. To fix it, skip updating stats when stats is NULL. Fixes: `492ecee892` ("bpf: enable program stats") Signed-off-by: Pu Lehui <pulehui@huawei.com> Link: https://lore.kernel.org/r/20251115102343.2200727-1-pulehui@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-17 22:35:51 -08:00
Ryan Roberts	9ac09bb9fe	mm: consistently use current->mm in mm_get_unmapped_area() mm_get_unmapped_area() is a wrapper around arch_get_unmapped_area() / arch_get_unmapped_area_topdown(), both of which search current->mm for some free space. Neither take an mm_struct - they implicitly operate on current->mm. But the wrapper takes an mm_struct and uses it to decide whether to search bottom up or top down. All callers pass in current->mm for this, so everything is working consistently. But it feels like an accident waiting to happen; eventually someone will call that function with a different mm, expecting to find free space in it, but what gets returned is free space in the current mm. So let's simplify by removing the parameter and have the wrapper use current->mm to decide which end to start at. Now everything is consistent and self-documenting. Link: https://lkml.kernel.org/r/20251003155306.2147572-1-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-16 17:27:57 -08:00
Al Viro	c5055286f8	convert bpf object creation goes through the normal VFS paths or approximation thereof (user_path_create()/done_path_create() in case of bpf_obj_do_pin(), open-coded simple_{start,done}_creating() in bpf_iter_link_pin_kernel() at mount time), removals go entirely through the normal VFS paths (and ->unlink() is simple_unlink() there). Enough to have bpf_dentry_finalize() use d_make_persistent() instead of dget() and we are done. Convert bpf_iter_link_pin_kernel() to simple_{start,done}_creating(), while we are at it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-11-16 01:35:03 -05:00
Altgelt, Max (Nextron)	4722981cca	bpf: don't skip other information if xlated_prog_insns is skipped If xlated_prog_insns should not be exposed, other information (such as func_info) still can and should be filled in. Therefore, instead of directly terminating in this case, continue with the normal flow. Signed-off-by: Max Altgelt <max.altgelt@nextron-systems.com> Link: https://lore.kernel.org/r/efd00fcec5e3e247af551632726e2a90c105fbd8.camel@nextron-systems.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-14 18:55:06 -08:00
Puranjay Mohan	4f7bc83b98	bpf: verifier: Move desc->imm setup to sort_kfunc_descs_by_imm_off() Metadata about a kfunc call is added to the kfunc_tab in add_kfunc_call() but the call instruction itself could get removed by opt_remove_dead_code() later if it is not reachable. If the call instruction is removed, specialize_kfunc() is never called for it and the desc->imm in the kfunc_tab is never initialized for this kfunc call. In this case, sort_kfunc_descs_by_imm_off(env->prog); in do_misc_fixups() doesn't sort the table correctly. This is a problem for s390 as its JIT uses this table to find the addresses for kfuncs, and if this table is not sorted properly, JIT may fail to find addresses for valid kfunc calls. This was exposed by: commit `d869d56ca8` ("bpf: verifier: refactor kfunc specialization") as before this commit, desc->imm was initialised in add_kfunc_call() which happens before dead code elimination. Move desc->imm setup down to sort_kfunc_descs_by_imm_off(), this fixes the problem and also saves us from having the same logic in add_kfunc_call() and specialize_kfunc(). Suggested-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251114154023.12801-1-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-14 17:55:18 -08:00
Alexei Starovoitov	e47b68bda4	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after 6.18-rc5+ Cross-merge BPF and other fixes after downstream PR. Minor conflict in kernel/bpf/helpers.c Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-14 17:43:41 -08:00
Menglong Dong	fea3f5e83c	bpf: Handle return value of ftrace_set_filter_ip in register_fentry The error that returned by ftrace_set_filter_ip() in register_fentry() is not handled properly. Just fix it. Fixes: `00963a2e75` ("bpf: Support bpf_trampoline on functions with IPMODIFY (e.g. livepatch)") Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20251110120705.1553694-1-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-14 13:31:30 -08:00
Eduard Zingerman	e5d2e34e72	bpf: Add missing checks to avoid verbose verifier log There are a few places where log level is not checked before calling "verbose()". This forces programs working only at BPF_LOG_LEVEL_STATS (e.g. veristat) to allocate unnecessarily large log buffers. Add missing checks. Reported-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251114200542.912386-1-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-14 13:29:12 -08:00
Sahil Chandna	c1da3df719	bpf: Prevent nesting overflow in bpf_try_get_buffers bpf_try_get_buffers() returns one of multiple per-CPU buffers based on a per-CPU nesting counter. This mechanism expects that buffers are not endlessly acquired before being returned. migrate_disable() ensures that a task remains on the same CPU, but it does not prevent the task from being preempted by another task on that CPU. Without disabled preemption, a task may be preempted while holding a buffer, allowing another task to run on same CPU and acquire an additional buffer. Several such preemptions can cause the per-CPU nest counter to exceed MAX_BPRINTF_NEST_LEVEL and trigger the warning in bpf_try_get_buffers(). Adding preempt_disable()/preempt_enable() around buffer acquisition and release prevents this task preemption and preserves the intended bounded nesting behavior. Reported-by: syzbot+b0cff308140f79a9c4cb@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/68f6a4c8.050a0220.1be48.0011.GAE@google.com/ Fixes: `4223bf833c` ("bpf: Remove preempt_disable in bpf_try_get_buffers") Suggested-by: Yonghong Song <yonghong.song@linux.dev> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Sahil Chandna <chandna.sahil@gmail.com> Link: https://lore.kernel.org/r/20251114064922.11650-1-chandna.sahil@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-14 13:06:47 -08:00
Eduard Zingerman	b0c8e6d3d8	bpf: account for current allocated stack depth in widen_imprecise_scalars() The usage pattern for widen_imprecise_scalars() looks as follows: prev_st = find_prev_entry(env, ...); queued_st = push_stack(...); widen_imprecise_scalars(env, prev_st, queued_st); Where prev_st is an ancestor of the queued_st in the explored states tree. This ancestor is not guaranteed to have same allocated stack depth as queued_st. E.g. in the following case: def main(): for i in 1..2: foo(i) // same callsite, differnt param def foo(i): if i == 1: use 128 bytes of stack iterator based loop Here, for a second 'foo' call prev_st->allocated_stack is 128, while queued_st->allocated_stack is much smaller. widen_imprecise_scalars() needs to take this into account and avoid accessing bpf_verifier_state->frame[*]->stack out of bounds. Fixes: `2793a8b015` ("bpf: exact states comparison for iterator convergence checks") Reported-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251114025730.772723-1-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-14 09:26:05 -08:00
Leon Hwang	6af6e49a76	bpf: Free special fields when update [lru_,]percpu_hash maps As [lru_,]percpu_hash maps support BPF_KPTR_{REF,PERCPU}, missing calls to 'bpf_obj_free_fields()' in 'pcpu_copy_value()' could cause the memory referenced by BPF_KPTR_{REF,PERCPU} fields to be held until the map gets freed. Fix this by calling 'bpf_obj_free_fields()' after 'copy_map_value[,_long]()' in 'pcpu_copy_value()'. Fixes: `65334e64a4` ("bpf: Support kptrs in percpu hashmap and percpu LRU hashmap") Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20251105151407.12723-2-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-13 09:14:15 -08:00
Kumar Kartikeya Dwivedi	3249e8a17e	bpf: Adjust return value for queue destruction in rqspinlock Return -ETIMEDOUT whenever non-head waiters are signalled by head, and fix oversight in commit `7bd6e5ce5b` ("rqspinlock: Disable queue destruction for deadlocks"). We no longer signal on deadlocks. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Reviewed-by: Amery Hung <ameryhung@gmail.com> Link: https://lore.kernel.org/r/20251111013827.1853484-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-12 11:17:39 -08:00
D. Wythe	07c428ece3	bpf: Export necessary symbols for modules with struct_ops Exports three necessary symbols for implementing struct_ops with tristate subsystem. To hold or release refcnt of struct_ops refcnt by inline funcs bpf_try_module_get and bpf_module_put which use bpf_struct_ops_get(put) conditionally. And to copy obj name from one to the other with effective checks by bpf_obj_name_cpy. Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20251107035632.115950-2-alibuda@linux.alibaba.com	2025-11-10 11:07:34 -08:00
Jakub Sitnicki	f38499ff45	bpf: Unclone skb head on bpf_dynptr_write to skb metadata Currently bpf_dynptr_from_skb_meta() marks the dynptr as read-only when the skb is cloned, preventing writes to metadata. Remove this restriction and unclone the skb head on bpf_dynptr_write() to metadata, now that the metadata is preserved during uncloning. This makes metadata dynptr consistent with skb dynptr, allowing writes regardless of whether the skb is cloned. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-3-5ceb08a9b37b@cloudflare.com	2025-11-10 10:52:31 -08:00
Puranjay Mohan	f8c67d8550	bpf: Use kmalloc_nolock() in range tree The range tree uses bpf_mem_alloc() that is safe to be called from all contexts and uses a pre-allocated pool of memory to serve these allocations. Replace bpf_mem_alloc() with kmalloc_nolock() as it can be called safely from all contexts and is more scalable than bpf_mem_alloc(). Remove the migrate_disable/enable pairs as they were only needed for bpf_mem_alloc() as it does per-cpu operations, kmalloc_nolock() doesn't need this. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20251106170608.4800-1-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-06 15:55:19 -08:00
Jakub Kicinski	1ec9871fbb	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.18-rc5). Conflicts: drivers/net/wireless/ath/ath12k/mac.c `9222582ec5` ("Revert "wifi: ath12k: Fix missing station power save configuration"") `6917e268c4` ("wifi: ath12k: Defer vdev bring-up until CSA finalize to avoid stale beacon") https://lore.kernel.org/11cece9f7e36c12efd732baa5718239b1bf8c950.camel@sipsolutions.net Adjacent changes: drivers/net/ethernet/intel/Kconfig `b1d16f7c00` ("libie: depend on DEBUG_FS when building LIBIE_FWLOG") `93f53db9f9` ("ice: switch to Page Pool") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-06 09:27:40 -08:00
Anton Protopopov	bc414d3583	bpf: disasm: add support for BPF_JMP\|BPF_JA\|BPF_X Add support for indirect jump instruction. Example output from bpftool: 0: (79) r3 = (u64 )(r1 +0) 1: (25) if r3 > 0x4 goto pc+666 2: (67) r3 <<= 3 3: (18) r1 = 0xffffbeefspameggs 5: (0f) r1 += r3 6: (79) r1 = (u64 )(r1 +0) 7: (0d) gotox r1 Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251105090410.1250500-10-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-05 17:53:23 -08:00
Anton Protopopov	493d9e0d60	bpf, x86: add support for indirect jumps Add support for a new instruction BPF_JMP\|BPF_X\|BPF_JA, SRC=0, DST=Rx, off=0, imm=0 which does an indirect jump to a location stored in Rx. The register Rx should have type PTR_TO_INSN. This new type assures that the Rx register contains a value (or a range of values) loaded from a correct jump table – map of type instruction array. For example, for a C switch LLVM will generate the following code: 0: r3 = r1 # "switch (r3)" 1: if r3 > 0x13 goto +0x666 # check r3 boundaries 2: r3 <<= 0x3 # adjust to an index in array of addresses 3: r1 = 0xbeef ll # r1 is PTR_TO_MAP_VALUE, r1->map_ptr=M 5: r1 += r3 # r1 inherits boundaries from r3 6: r1 = (u64 )(r1 + 0x0) # r1 now has type INSN_TO_PTR 7: gotox r1 # jit will generate proper code Here the gotox instruction corresponds to one particular map. This is possible however to have a gotox instruction which can be loaded from different maps, e.g. 0: r1 &= 0x1 1: r2 <<= 0x3 2: r3 = 0x0 ll # load from map M_1 4: r3 += r2 5: if r1 == 0x0 goto +0x4 6: r1 <<= 0x3 7: r3 = 0x0 ll # load from map M_2 9: r3 += r1 A: r1 = (u64 )(r3 + 0x0) B: gotox r1 # jump to target loaded from M_1 or M_2 During check_cfg stage the verifier will collect all the maps which point to inside the subprog being verified. When building the config, the high 16 bytes of the insn_state are used, so this patch (theoretically) supports jump tables of up to 2^16 slots. During the later stage, in check_indirect_jump, it is checked that the register Rx was loaded from a particular instruction array. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251105090410.1250500-9-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-05 17:53:23 -08:00
Anton Protopopov	30ec0ec09b	bpf: support instructions arrays with constants blinding When bpf_jit_harden is enabled, all constants in the BPF code are blinded to prevent JIT spraying attacks. This happens during JIT phase. Adjust all the related instruction arrays accordingly. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251105090410.1250500-6-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-05 17:53:22 -08:00
Anton Protopopov	b4ce5923e7	bpf, x86: add new map type: instructions array On bpf(BPF_PROG_LOAD) syscall user-supplied BPF programs are translated by the verifier into "xlated" BPF programs. During this process the original instructions offsets might be adjusted and/or individual instructions might be replaced by new sets of instructions, or deleted. Add a new BPF map type which is aimed to keep track of how, for a given program, the original instructions were relocated during the verification. Also, besides keeping track of the original -> xlated mapping, make x86 JIT to build the xlated -> jitted mapping for every instruction listed in an instruction array. This is required for every future application of instruction arrays: static keys, indirect jumps and indirect calls. A map of the BPF_MAP_TYPE_INSN_ARRAY type must be created with a u32 keys and value of size 8. The values have different semantics for userspace and for BPF space. For userspace a value consists of two u32 values – xlated and jitted offsets. For BPF side the value is a real pointer to a jitted instruction. On map creation/initialization, before loading the program, each element of the map should be initialized to point to an instruction offset within the program. Before the program load such maps should be made frozen. After the program verification xlated and jitted offsets can be read via the bpf(2) syscall. If a tracked instruction is removed by the verifier, then the xlated offset is set to (u32)-1 which is considered to be too big for a valid BPF program offset. One such a map can, obviously, be used to track one and only one BPF program. If the verification process was unsuccessful, then the same map can be re-used to verify the program with a different log level. However, if the program was loaded fine, then such a map, being frozen in any case, can't be reused by other programs even after the program release. Example. Consider the following original and xlated programs: Original prog: Xlated prog: 0: r1 = 0x0 0: r1 = 0 1: (u32 )(r10 - 0x4) = r1 1: (u32 )(r10 -4) = r1 2: r2 = r10 2: r2 = r10 3: r2 += -0x4 3: r2 += -4 4: r1 = 0x0 ll 4: r1 = map[id:88] 6: call 0x1 6: r1 += 272 7: r0 = (u32 )(r2 +0) 8: if r0 >= 0x1 goto pc+3 9: r0 <<= 3 10: r0 += r1 11: goto pc+1 12: r0 = 0 7: r6 = r0 13: r6 = r0 8: if r6 == 0x0 goto +0x2 14: if r6 == 0x0 goto pc+4 9: call 0x76 15: r0 = 0xffffffff8d2079c0 17: r0 = (u64 )(r0 +0) 10: (u64 )(r6 + 0x0) = r0 18: (u64 )(r6 +0) = r0 11: r0 = 0x0 19: r0 = 0x0 12: exit 20: exit An instruction array map, containing, e.g., instructions [0,4,7,12] will be translated by the verifier to [0,4,13,20]. A map with index 5 (the middle of 16-byte instruction) or indexes greater than 12 (outside the program boundaries) would be rejected. The functionality provided by this patch will be extended in consequent patches to implement BPF Static Keys, indirect jumps, and indirect calls. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251105090410.1250500-2-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-05 17:31:25 -08:00
Kees Cook	c1a799eef6	bpf: Convert bpf_sock_addr_kern "uaddr" to sockaddr_unsized Change struct bpf_sock_addr_kern to use sockaddr_unsized for the "uaddr" field instead of sockaddr. This improves type safety in the BPF cgroup socket address filtering code. The casting in __cgroup_bpf_run_filter_sock_addr() is updated to match the new type, removing an unnecessary cast in the initialization and updating the conditional assignment to use the appropriate sockaddr_unsized cast. Additionally rename the "unspec" variable to "storage" to better align with its usage. No binary changes expected. Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-7-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-04 19:10:33 -08:00
Kees Cook	8116d803e7	bpf: Convert cgroup sockaddr filters to use sockaddr_unsized consistently Update BPF cgroup sockaddr filtering infrastructure to use sockaddr_unsized consistently throughout the call chain, removing redundant explicit casts from callers. No binary changes expected. Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-6-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-04 19:10:33 -08:00
Mykyta Yatsenko	137cc92ffe	bpf: add _impl suffix for bpf_stream_vprintk() kfunc Rename bpf_stream_vprintk() to bpf_stream_vprintk_impl(). This makes bpf_stream_vprintk() follow the already established "_impl" suffix-based naming convention for kfuncs with the bpf_prog_aux argument provided by the verifier implicitly. This convention will be taken advantage of with the upcoming KF_IMPLICIT_ARGS feature to preserve backwards compatibility to BPF programs. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/20251104-implv2-v3-2-4772b9ae0e06@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Ihor Solodrai <ihor.solodrai@linux.dev>	2025-11-04 17:50:25 -08:00
Mykyta Yatsenko	ea0714d61d	bpf:add _impl suffix for bpf_task_work_schedule* kfuncs Rename: bpf_task_work_schedule_resume()->bpf_task_work_schedule_resume_impl() bpf_task_work_schedule_signal()->bpf_task_work_schedule_signal_impl() This aligns task work scheduling kfuncs with the established naming scheme for kfuncs with the bpf_prog_aux argument provided by the verifier implicitly. This convention will be taken advantage of with the upcoming KF_IMPLICIT_ARGS feature to preserve backwards compatibility to BPF programs. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/20251104-implv2-v3-1-4772b9ae0e06@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Ihor Solodrai <ihor.solodrai@linux.dev>	2025-11-04 17:50:25 -08:00
KaFai Wan	d43ad9da80	bpf: Skip bounds adjustment for conditional jumps on same scalar register When conditional jumps are performed on the same scalar register (e.g., r0 <= r0, r0 > r0, r0 < r0), the BPF verifier incorrectly attempts to adjust the register's min/max bounds. This leads to invalid range bounds and triggers a BUG warning. The problematic BPF program: 0: call bpf_get_prandom_u32 1: w8 = 0x80000000 2: r0 &= r8 3: if r0 > r0 goto <exit> The instruction 3 triggers kernel warning: 3: if r0 > r0 goto <exit> true_reg1: range bounds violation u64=[0x1, 0x0] s64=[0x1, 0x0] u32=[0x1, 0x0] s32=[0x1, 0x0] var_off=(0x0, 0x0) true_reg2: const tnum out of sync with range bounds u64=[0x0, 0xffffffffffffffff] s64=[0x8000000000000000, 0x7fffffffffffffff] var_off=(0x0, 0x0) Comparing a register with itself should not change its bounds and for most comparison operations, comparing a register with itself has a known result (e.g., r0 == r0 is always true, r0 < r0 is always false). Fix this by: 1. Enhance is_scalar_branch_taken() to properly handle branch direction computation for same register comparisons across all BPF jump operations 2. Adds early return in reg_set_min_max() to avoid bounds adjustment for unknown branch directions (e.g., BPF_JSET) on the same register The fix ensures that unnecessary bounds adjustments are skipped, preventing the verifier bug while maintaining correct branch direction analysis. Reported-by: Kaiyan Mei <M202472210@hust.edu.cn> Reported-by: Yinhao Hu <dddddd@hust.edu.cn> Closes: https://lore.kernel.org/all/1881f0f5.300df.199f2576a01.Coremail.kaiyanm@hust.edu.cn/ Signed-off-by: KaFai Wan <kafai.wan@linux.dev> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251103063108.1111764-2-kafai.wan@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-03 17:43:28 -08:00
Song Liu	56b3c85e15	ftrace: Fix BPF fexit with livepatch When livepatch is attached to the same function as bpf trampoline with a fexit program, bpf trampoline code calls register_ftrace_direct() twice. The first time will fail with -EAGAIN, and the second time it will succeed. This requires register_ftrace_direct() to unregister the address on the first attempt. Otherwise, the bpf trampoline cannot attach. Here is an easy way to reproduce this issue: insmod samples/livepatch/livepatch-sample.ko bpftrace -e 'fexit:cmdline_proc_show {}' ERROR: Unable to attach probe: fexit:vmlinux:cmdline_proc_show... Fix this by cleaning up the hash when register_ftrace_function_nolock hits errors. Also, move the code that resets ops->func and ops->trampoline to the error path of register_ftrace_direct(); and add a helper function reset_direct() in register_ftrace_direct() and unregister_ftrace_direct(). Fixes: `d05cb47066` ("ftrace: Fix modification of direct_function hash while in use") Cc: stable@vger.kernel.org # v6.6+ Reported-by: Andrey Grodzovsky <andrey.grodzovsky@crowdstrike.com> Closes: https://lore.kernel.org/live-patching/c5058315a39d4615b333e485893345be@crowdstrike.com/ Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-and-tested-by: Andrey Grodzovsky <andrey.grodzovsky@crowdstrike.com> Signed-off-by: Song Liu <song@kernel.org> Reviewed-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20251027175023.1521602-2-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-03 17:22:06 -08:00
Alexei Starovoitov	5dae7453ec	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after 6.18-rc4 Cross-merge BPF and other fixes after downstream PR. No conflicts. Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-03 14:59:55 -08:00
Puranjay Mohan	5701d5aefa	bpf: Use kmalloc_nolock() in bpf streams BPF stream kfuncs need to be non-sleeping as they can be called from programs running in any context, this requires a way to allocate memory from any context. Currently, this is done by a custom per-CPU NMI-safe bump allocation mechanism, backed by alloc_pages_nolock() and free_pages_nolock() primitives. As kmalloc_nolock() and kfree_nolock() primitives are available now, the custom allocator can be removed in favor of these. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20251023161448.4263-1-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-29 18:19:46 -07:00
Kumar Kartikeya Dwivedi	7bd6e5ce5b	rqspinlock: Disable queue destruction for deadlocks Disable propagation and unwinding of the waiter queue in case the head waiter detects a deadlock condition, but keep it enabled in case of the timeout fallback. Currently, when the head waiter experiences an AA deadlock, it will signal all its successors in the queue to exit with an error. This is not ideal for cases where the same lock is held in contexts which can cause errors in an unrestricted fashion (e.g., BPF programs, or kernel paths invoked through BPF programs), and core kernel logic which is written in a correct fashion and does not expect deadlocks. The same reasoning can be extended to ABBA situations. Depending on the actual runtime schedule, one or both of the head waiters involved in an ABBA situation can detect and exit directly without terminating their waiter queue. If the ABBA situation manifests again, the waiters will keep exiting until progress can be made, or a timeout is triggered in case of more complicated locking dependencies. We still preserve the queue destruction in case of timeouts, as either the locking dependencies are too complex to be captured by AA and ABBA heuristics, or the owner is perpetually stuck. As such, it would be unwise to continue to apply the timeout for each new head waiter without terminating the queue, since we may end up waiting for more than 250 ms in aggregate with all participants in the locking transaction. The patch itself is fairly simple; we can simply signal our successor to become the next head waiter, and leave the queue without attempting to acquire the lock. With this change, the behavior for waiters in case of deadlocks experienced by a predecessor changes. It is guaranteed that call sites will no longer receive errors if the predecessors encounter deadlocks and the successors do not participate in one. This should lower the failure rate for waiters that are not doing improper locking opreations, just because they were unlucky to queue behind a misbehaving waiter. However, timeouts are still a possibility, hence they must be accounted for, so users cannot rely upon errors not occuring at all. Suggested-by: Amery Hung <ameryhung@gmail.com> Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20251029181828.231529-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-29 18:17:56 -07:00
Peter Zijlstra	c69993ecdd	perf: Support deferred user unwind Add support for deferred userspace unwind to perf. Where perf currently relies on in-place stack unwinding; from NMI context and all that. This moves the userspace part of the unwind to right before the return-to-userspace. This has two distinct benefits, the biggest is that it moves the unwind to a faultable context. It becomes possible to fault in debug info (.eh_frame, SFrame etc.) that might not otherwise be readily available. And secondly, it de-duplicates the user callchain where multiple samples happen during the same kernel entry. To facilitate this the perf interface is extended with a new record type: PERF_RECORD_CALLCHAIN_DEFERRED and two new attribute flags: perf_event_attr::defer_callchain - to request the user unwind be deferred perf_event_attr::defer_output - to request PERF_RECORD_CALLCHAIN_DEFERRED records The existing PERF_RECORD_SAMPLE callchain section gets a new context type: PERF_CONTEXT_USER_DEFERRED After which will come a single entry, denoting the 'cookie' of the deferred callchain that should be attached here, matching the 'cookie' field of the above mentioned PERF_RECORD_CALLCHAIN_DEFERRED. The 'defer_callchain' flag is expected on all events with PERF_SAMPLE_CALLCHAIN. The 'defer_output' flag is expect on the event responsible for collecting side-band events (like mmap, comm etc.). Setting 'defer_output' on multiple events will get you duplicated PERF_RECORD_CALLCHAIN_DEFERRED records. Based on earlier patches by Josh and Steven. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251023150002.GR4067720@noisy.programming.kicks-ass.net	2025-10-29 10:29:58 +01:00
Arnaud Lecomte	23f852daa4	bpf: Fix stackmap overflow check in __bpf_get_stackid() Syzkaller reported a KASAN slab-out-of-bounds write in __bpf_get_stackid() when copying stack trace data. The issue occurs when the perf trace contains more stack entries than the stack map bucket can hold, leading to an out-of-bounds write in the bucket's data array. Fixes: `ee2a098851` ("bpf: Adjust BPF stack helper functions to accommodate skip > 0") Reported-by: syzbot+c9b724fbb41cf2538b7b@syzkaller.appspotmail.com Signed-off-by: Arnaud Lecomte <contact@arnaud-lcm.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/bpf/20251025192941.1500-1-contact@arnaud-lcm.com Closes: https://syzkaller.appspot.com/bug?extid=c9b724fbb41cf2538b7b	2025-10-28 09:20:27 -07:00
Arnaud Lecomte	e17d62fedd	bpf: Refactor stack map trace depth calculation into helper function Extract the duplicated maximum allowed depth computation for stack traces stored in BPF stacks from bpf_get_stackid() and __bpf_get_stack() into a dedicated stack_map_calculate_max_depth() helper function. This unifies the logic for: - The max depth computation - Enforcing the sysctl_perf_event_max_stack limit No functional changes for existing code paths. Signed-off-by: Arnaud Lecomte <contact@arnaud-lcm.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/bpf/20251025192858.31424-1-contact@arnaud-lcm.com	2025-10-28 09:20:27 -07:00
Xu Kuohai	feeaf1346f	bpf: Add overwrite mode for BPF ring buffer When the BPF ring buffer is full, a new event cannot be recorded until one or more old events are consumed to make enough space for it. In cases such as fault diagnostics, where recent events are more useful than older ones, this mechanism may lead to critical events being lost. So add overwrite mode for BPF ring buffer to address it. In this mode, the new event overwrites the oldest event when the buffer is full. The basic idea is as follows: 1. producer_pos tracks the next position to record new event. When there is enough free space, producer_pos is simply advanced by producer to make space for the new event. 2. To avoid waiting for consumer when the buffer is full, a new variable, overwrite_pos, is introduced for producer. It points to the oldest event committed in the buffer. It is advanced by producer to discard one or more oldest events to make space for the new event when the buffer is full. 3. pending_pos tracks the oldest event to be committed. pending_pos is never passed by producer_pos, so multiple producers never write to the same position at the same time. The following example diagrams show how it works in a 4096-byte ring buffer. 1. At first, {producer,overwrite,pending,consumer}_pos are all set to 0. 0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ \| \| \| \| \| \| +-----------------------------------------------------------------------+ ^ \| \| producer_pos = 0 overwrite_pos = 0 pending_pos = 0 consumer_pos = 0 2. Now reserve a 512-byte event A. There is enough free space, so A is allocated at offset 0. And producer_pos is advanced to 512, the end of A. Since A is not submitted, the BUSY bit is set. 0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ \| \| \| \| A \| \| \| [BUSY] \| \| +-----------------------------------------------------------------------+ ^ ^ \| \| \| \| \| producer_pos = 512 \| overwrite_pos = 0 pending_pos = 0 consumer_pos = 0 3. Reserve event B, size 1024. B is allocated at offset 512 with BUSY bit set, and producer_pos is advanced to the end of B. 0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ \| \| \| \| \| A \| B \| \| \| [BUSY] \| [BUSY] \| \| +-----------------------------------------------------------------------+ ^ ^ \| \| \| \| \| producer_pos = 1536 \| overwrite_pos = 0 pending_pos = 0 consumer_pos = 0 4. Reserve event C, size 2048. C is allocated at offset 1536, and producer_pos is advanced to 3584. 0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ \| \| \| \| \| \| A \| B \| C \| \| \| [BUSY] \| [BUSY] \| [BUSY] \| \| +-----------------------------------------------------------------------+ ^ ^ \| \| \| \| \| producer_pos = 3584 \| overwrite_pos = 0 pending_pos = 0 consumer_pos = 0 5. Submit event A. The BUSY bit of A is cleared. B becomes the oldest event to be committed, so pending_pos is advanced to 512, the start of B. 0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ \| \| \| \| \| \| A \| B \| C \| \| \| \| [BUSY] \| [BUSY] \| \| +-----------------------------------------------------------------------+ ^ ^ ^ \| \| \| \| \| \| \| pending_pos = 512 producer_pos = 3584 \| overwrite_pos = 0 consumer_pos = 0 6. Submit event B. The BUSY bit of B is cleared, and pending_pos is advanced to the start of C, which is now the oldest event to be committed. 0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ \| \| \| \| \| \| A \| B \| C \| \| \| \| \| [BUSY] \| \| +-----------------------------------------------------------------------+ ^ ^ ^ \| \| \| \| \| \| \| pending_pos = 1536 producer_pos = 3584 \| overwrite_pos = 0 consumer_pos = 0 7. Reserve event D, size 1536 (3 * 512). There are 2048 bytes not being written between producer_pos (currently 3584) and pending_pos, so D is allocated at offset 3584, and producer_pos is advanced by 1536 (from 3584 to 5120). Since event D will overwrite all bytes of event A and the first 512 bytes of event B, overwrite_pos is advanced to the start of event C, the oldest event that is not overwritten. 0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ \| \| \| \| \| \| D End \| \| C \| D Begin\| \| [BUSY] \| \| [BUSY] \| [BUSY] \| +-----------------------------------------------------------------------+ ^ ^ ^ \| \| \| \| \| pending_pos = 1536 \| \| overwrite_pos = 1536 \| \| \| producer_pos=5120 \| consumer_pos = 0 8. Reserve event E, size 1024. Although there are 512 bytes not being written between producer_pos and pending_pos, E cannot be reserved, as it would overwrite the first 512 bytes of event C, which is still being written. 9. Submit event C and D. pending_pos is advanced to the end of D. 0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ \| \| \| \| \| \| D End \| \| C \| D Begin\| \| \| \| \| \| +-----------------------------------------------------------------------+ ^ ^ ^ \| \| \| \| \| overwrite_pos = 1536 \| \| \| producer_pos=5120 \| pending_pos=5120 \| consumer_pos = 0 The performance data for overwrite mode will be provided in a follow-up patch that adds overwrite-mode benchmarks. A sample of performance data for non-overwrite mode, collected on an x86_64 CPU and an arm64 CPU, before and after this patch, is shown below. As we can see, no obvious performance regression occurs. - x86_64 (AMD EPYC 9654) Before: Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 11.623 ± 0.027M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 15.812 ± 0.014M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 7.871 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 6.703 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 2.896 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 2.054 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 1.864 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 1.580 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 1.484 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 1.369 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 32 1.316 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 36 1.272 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 40 1.239 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 44 1.226 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 48 1.213 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 52 1.193 ± 0.001M/s (drops 0.000 ± 0.000M/s) After: Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 11.845 ± 0.036M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 15.889 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 8.155 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 6.708 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 2.918 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 2.065 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 1.870 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 1.582 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 1.482 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 1.372 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 32 1.323 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 36 1.264 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 40 1.236 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 44 1.209 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 48 1.189 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 52 1.165 ± 0.002M/s (drops 0.000 ± 0.000M/s) - arm64 (HiSilicon Kunpeng 920) Before: Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 11.310 ± 0.623M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 9.947 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 6.634 ± 0.011M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 4.502 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 3.888 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 3.372 ± 0.005M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 3.189 ± 0.010M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 2.998 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 3.086 ± 0.018M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 2.845 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 32 2.815 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 36 2.771 ± 0.009M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 40 2.814 ± 0.011M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 44 2.752 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 48 2.695 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 52 2.710 ± 0.006M/s (drops 0.000 ± 0.000M/s) After: Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 11.283 ± 0.550M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 9.993 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 6.898 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 5.257 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 3.830 ± 0.005M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 3.528 ± 0.013M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 3.265 ± 0.018M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 2.990 ± 0.007M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 2.929 ± 0.014M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 2.898 ± 0.010M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 32 2.818 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 36 2.789 ± 0.012M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 40 2.770 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 44 2.651 ± 0.007M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 48 2.669 ± 0.005M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 52 2.695 ± 0.009M/s (drops 0.000 ± 0.000M/s) Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20251018035738.4039621-2-xukuohai@huaweicloud.com	2025-10-27 19:42:39 -07:00
Mykyta Yatsenko	2c52e8943a	bpf: dispatch to sleepable file dynptr File dynptr reads may sleep when the requested folios are not in the page cache. To avoid sleeping in non-sleepable contexts while still supporting valid sleepable use, given that dynptrs are non-sleepable by default, enable sleeping only when bpf_dynptr_from_file() is invoked from a sleepable context. This change: * Introduces a sleepable constructor: bpf_dynptr_from_file_sleepable() * Override non-sleepable constructor with sleepable if it's always called in sleepable context Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251026203853.135105-10-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-27 09:56:27 -07:00
Mykyta Yatsenko	d869d56ca8	bpf: verifier: refactor kfunc specialization Move kfunc specialization (function address substitution) to later stage of verification to support a new use case, where we need to take into consideration whether kfunc is called in sleepable context. Minor refactoring in add_kfunc_call(), making sure that if function fails, kfunc desc is not added to tab->descs (previously it could be added or not, depending on what failed). Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251026203853.135105-9-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-27 09:56:27 -07:00
Mykyta Yatsenko	e3e36edb1b	bpf: add kfuncs and helpers support for file dynptrs Add support for file dynptr. Introduce struct bpf_dynptr_file_impl to hold internal state for file dynptrs, with 64-bit size and offset support. Introduce lifecycle management kfuncs: - bpf_dynptr_from_file() for initialization - bpf_dynptr_file_discard() for destruction Extend existing helpers to support file dynptrs in: - bpf_dynptr_read() - bpf_dynptr_slice() Write helpers (bpf_dynptr_write() and bpf_dynptr_data()) are not modified, as file dynptr is read-only. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251026203853.135105-8-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-27 09:56:27 -07:00
Mykyta Yatsenko	8d8771dc03	bpf: add plumbing for file-backed dynptr Add the necessary verifier plumbing for the new file-backed dynptr type. Introduce two kfuncs for its lifecycle management: * bpf_dynptr_from_file() for initialization * bpf_dynptr_file_discard() for destruction Currently there is no mechanism for kfunc to release dynptr, this patch add one: * Dynptr release function sets meta->release_regno * Call unmark_stack_slots_dynptr() if meta->release_regno is set and dynptr ref_obj_id is set as well. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251026203853.135105-7-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-27 09:56:27 -07:00
Mykyta Yatsenko	9cba966f1c	bpf: verifier: centralize const dynptr check in unmark_stack_slots_dynptr() Move the const dynptr check into unmark_stack_slots_dynptr() so callers don’t have to duplicate it. This puts the validation next to the code that manipulates dynptr stack slots and allows upcoming changes to reuse it directly. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251026203853.135105-6-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-27 09:56:27 -07:00
Mykyta Yatsenko	531b87d865	bpf: widen dynptr size/offset to 64 bit Dynptr currently caps size and offset at 24 bits, which isn’t sufficient for file-backed use cases; even 32 bits can be limiting. Refactor dynptr helpers/kfuncs to use 64-bit size and offset, ensuring consistency across the APIs. This change does not affect internals of xdp, skb or other dynptrs, which continue to behave as before. Also it does not break binary compatibility. The widening enables large-file access support via dynptr, implemented in the next patches. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251026203853.135105-3-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-27 09:56:26 -07:00
Matthew Wilcox (Oracle)	70e0a80a1f	treewide: Remove in_irq() This old alias for in_hardirq() has been marked as deprecated since 2020; remove the stragglers. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251024180654.1691095-1-willy@infradead.org	2025-10-24 21:39:27 +02:00
Malin Jonsson	8ce93aabbf	bpf: Conditionally include dynptr copy kfuncs Since commit `a498ee7576` ("bpf: Implement dynptr copy kfuncs"), if CONFIG_BPF_EVENTS is not enabled, but BPF_SYSCALL and DEBUG_INFO_BTF are, the build will break like so: BTFIDS vmlinux.unstripped WARN: resolve_btfids: unresolved symbol bpf_probe_read_user_str_dynptr WARN: resolve_btfids: unresolved symbol bpf_probe_read_user_dynptr WARN: resolve_btfids: unresolved symbol bpf_probe_read_kernel_str_dynptr WARN: resolve_btfids: unresolved symbol bpf_probe_read_kernel_dynptr WARN: resolve_btfids: unresolved symbol bpf_copy_from_user_task_str_dynptr WARN: resolve_btfids: unresolved symbol bpf_copy_from_user_task_dynptr WARN: resolve_btfids: unresolved symbol bpf_copy_from_user_str_dynptr WARN: resolve_btfids: unresolved symbol bpf_copy_from_user_dynptr make[2]: * [scripts/Makefile.vmlinux:72: vmlinux.unstripped] Error 255 make[2]: * Deleting file 'vmlinux.unstripped' make[1]: * [/repo/malin/upstream/linux/Makefile:1242: vmlinux] Error 2 make: * [Makefile:248: __sub-make] Error 2 Guard these symbols with #ifdef CONFIG_BPF_EVENTS to resolve the problem. Fixes: `a498ee7576` ("bpf: Implement dynptr copy kfuncs") Reported-by: Yong Gu <yong.g.gu@ericsson.com> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Malin Jonsson <malin.jonsson@est.tech> Link: https://lore.kernel.org/r/20251024151436.139131-1-malin.jonsson@est.tech Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-24 09:44:47 -07:00
Anton Protopopov	2f69c56854	bpf: make bpf_insn_successors to return a pointer The bpf_insn_successors() function is used to return successors to a BPF instruction. So far, an instruction could have 0, 1 or 2 successors. Prepare the verifier code to introduction of instructions with more than 2 successors (namely, indirect jumps). To do this, introduce a new struct, struct bpf_iarray, containing an array of bpf instruction indexes and make bpf_insn_successors to return a pointer of that type. The storage for all instructions is allocated in the env->succ, which holds an array of size 2, to be used for all instructions. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251019202145.3944697-10-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-21 11:20:23 -07:00
Anton Protopopov	44481e4925	bpf: generalize and export map_get_next_key for arrays The kernel/bpf/array.c file defines the array_map_get_next_key() function which finds the next key for array maps. It actually doesn't use any map fields besides the generic max_entries field. Generalize it, and export as bpf_array_get_next_key() such that it can be re-used by other array-like maps. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251019202145.3944697-4-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-21 11:17:25 -07:00
Anton Protopopov	f7d72d0b3f	bpf: save the start of functions in bpf_prog_aux Introduce a new subprog_start field in bpf_prog_aux. This field may be used by JIT compilers wanting to know the real absolute xlated offset of the function being jitted. The func_info[func_id] may have served this purpose, but func_info may be NULL, so JIT compilers can't rely on it. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251019202145.3944697-3-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-21 11:17:25 -07:00
Anton Protopopov	6ea5fc92a0	bpf: fix the return value of push_stack In [1] Eduard mentioned that on push_stack failure verifier code should return -ENOMEM instead of -EFAULT. After checking with the other call sites I've found that code randomly returns either -ENOMEM or -EFAULT. This patch unifies the return values for the push_stack (and similar push_async_cb) functions such that error codes are always assigned properly. [1] https://lore.kernel.org/bpf/20250615085943.3871208-1-a.s.protopopov@gmail.com Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251019202145.3944697-2-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-21 11:17:25 -07:00
Noorain Eqbal	4e90776383	bpf: Sync pending IRQ work before freeing ring buffer Fix a race where irq_work can be queued in bpf_ringbuf_commit() but the ring buffer is freed before the work executes. In the syzbot reproducer, a BPF program attached to sched_switch triggers bpf_ringbuf_commit(), queuing an irq_work. If the ring buffer is freed before this work executes, the irq_work thread may accesses freed memory. Calling `irq_work_sync(&rb->work)` ensures that all pending irq_work complete before freeing the buffer. Fixes: `457f44363a` ("bpf: Implement BPF ring buffer and verifier support for it") Reported-by: syzbot+2617fc732430968b45d2@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=2617fc732430968b45d2 Tested-by: syzbot+2617fc732430968b45d2@syzkaller.appspotmail.com Signed-off-by: Noorain Eqbal <nooraineqbal@gmail.com> Link: https://lore.kernel.org/r/20251020180301.103366-1-nooraineqbal@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-21 09:57:48 -07:00
Shardul Bankar	96d31dff3f	bpf: Clarify get_outer_instance() handling in propagate_to_outer_instance() propagate_to_outer_instance() calls get_outer_instance() and uses the returned pointer to reset and commit stack write marks. Under normal conditions, update_instance() guarantees that an outer instance exists, so get_outer_instance() cannot return an ERR_PTR. However, explicitly checking for IS_ERR(outer_instance) makes this code more robust and self-documenting. It reduces cognitive load when reading the control flow and silences potential false-positive reports from static analysis or automated tooling. No functional change intended. Signed-off-by: Shardul Bankar <shardulsb08@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251021080849.860072-1-shardulsb08@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-21 09:39:05 -07:00
Yafang Shao	7484e7cd8a	bpf: mark vma->{vm_mm,vm_file} as __safe_trusted_or_null The vma->vm_mm might be NULL and it can be accessed outside of RCU. Thus, we can mark it as trusted_or_null. With this change, BPF helpers can safely access vma->vm_mm to retrieve the associated mm_struct from the VMA. Then we can make policy decision from the VMA. The "trusted" annotation enables direct access to vma->vm_mm within kfuncs marked with KF_TRUSTED_ARGS or KF_RCU, such as bpf_task_get_cgroup1() and bpf_task_under_cgroup(). Conversely, "null" enforcement requires all callsites using vma->vm_mm to perform NULL checks. The lsm selftest must be modified because it directly accesses vma->vm_mm without a NULL pointer check; otherwise it will break due to this change. For the VMA based THP policy, the use case is as follows, @mm = @vma->vm_mm; // vm_area_struct::vm_mm is trusted or null if (!@mm) return; bpf_rcu_read_lock(); // rcu lock must be held to dereference the owner @owner = @mm->owner; // mm_struct::owner is rcu trusted or null if (!@owner) goto out; @cgroup1 = bpf_task_get_cgroup1(@owner, MEMCG_HIERARCHY_ID); /* make the decision based on the @cgroup1 attribute / bpf_cgroup_release(@cgroup1); // release the associated cgroup out: bpf_rcu_read_unlock(); PSI memory information can be obtained from the associated cgroup to inform policy decisions. Since upstream PSI support is currently limited to cgroup v2, the following example demonstrates cgroup v2 implementation: @owner = @mm->owner; if (@owner) { // @ancestor_cgid is user-configured @ancestor = bpf_cgroup_from_id(@ancestor_cgid); if (bpf_task_under_cgroup(@owner, @ancestor)) { @psi_group = @ancestor->psi; / Extract PSI metrics from @psi_group and * implement policy logic based on the values */ } } The vma::vm_file can also be marked with __safe_trusted_or_null. No additional selftests are required since vma->vm_file and vma->vm_mm are already validated in the existing selftest suite. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Link: https://lore.kernel.org/r/20251016063929.13830-3-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-18 19:23:08 -07:00
Yafang Shao	ec8e3e27a1	bpf: mark mm->owner as __safe_rcu_or_null When CONFIG_MEMCG is enabled, we can access mm->owner under RCU. The owner can be NULL. With this change, BPF helpers can safely access mm->owner to retrieve the associated task from the mm. We can then make policy decision based on the task attribute. The typical use case is as follows, bpf_rcu_read_lock(); // rcu lock must be held for rcu trusted field @owner = @mm->owner; // mm_struct::owner is rcu trusted or null if (!@owner) goto out; /* Do something based on the task attribute */ out: bpf_rcu_read_unlock(); Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/r/20251016063929.13830-2-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-18 19:23:08 -07:00
Alexei Starovoitov	50de48a4dd	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf at 6.18-rc2 Cross-merge BPF and other fixes after downstream PR. No conflicts. Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-18 18:20:57 -07:00
Shardul Bankar	f6fddc6df3	bpf: Fix memory leak in __lookup_instance error path When __lookup_instance() allocates a func_instance structure but fails to allocate the must_write_set array, it returns an error without freeing the previously allocated func_instance. This causes a memory leak of 192 bytes (sizeof(struct func_instance)) each time this error path is triggered. Fix by freeing 'result' on must_write_set allocation failure. Fixes: `b3698c356a` ("bpf: callchain sensitive stack liveness tracking using CFG") Reported-by: BPF Runtime Fuzzer (BRF) Signed-off-by: Shardul Bankar <shardulsb08@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://patch.msgid.link/20251016063330.4107547-1-shardulsb08@gmail.com	2025-10-16 10:45:17 -07:00
Andrii Nakryiko	48a97ffc6c	bpf: Consistently use bpf_rcu_lock_held() everywhere We have many places which open-code what's now is bpf_rcu_lock_held() macro, so replace all those places with a clean and short macro invocation. For that, move bpf_rcu_lock_held() macro into include/linux/bpf.h. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20251014201403.4104511-1-andrii@kernel.org	2025-10-15 12:26:12 +02:00
Alexei Starovoitov	5fb750e8a9	bpf: Replace bpf_map_kmalloc_node() with kmalloc_nolock() to allocate bpf_async_cb structures. The following kmemleak splat: [ 8.105530] kmemleak: Trying to color unknown object at 0xff11000100e918c0 as Black [ 8.106521] Call Trace: [ 8.106521] <TASK> [ 8.106521] dump_stack_lvl+0x4b/0x70 [ 8.106521] kvfree_call_rcu+0xcb/0x3b0 [ 8.106521] ? hrtimer_cancel+0x21/0x40 [ 8.106521] bpf_obj_free_fields+0x193/0x200 [ 8.106521] htab_map_update_elem+0x29c/0x410 [ 8.106521] bpf_prog_cfc8cd0f42c04044_overwrite_cb+0x47/0x4b [ 8.106521] bpf_prog_8c30cd7c4db2e963_overwrite_timer+0x65/0x86 [ 8.106521] bpf_prog_test_run_syscall+0xe1/0x2a0 happens due to the combination of features and fixes, but mainly due to commit `6d78b4473c` ("bpf: Tell memcg to use allow_spinning=false path in bpf_timer_init()") It's using __GFP_HIGH, which instructs slub/kmemleak internals to skip kmemleak_alloc_recursive() on allocation, so subsequent kfree_rcu()-> kvfree_call_rcu()->kmemleak_ignore() complains with the above splat. To fix this imbalance, replace bpf_map_kmalloc_node() with kmalloc_nolock() and kfree_rcu() with call_rcu() + kfree_nolock() to make sure that the objects allocated with kmalloc_nolock() are freed with kfree_nolock() rather than the implicit kfree() that kfree_rcu() uses internally. Note, the kmalloc_nolock() happens under bpf_spin_lock_irqsave(), so it will always fail in PREEMPT_RT. This is not an issue at the moment, since bpf_timers are disabled in PREEMPT_RT. In the future bpf_spin_lock will be replaced with state machine similar to bpf_task_work. Fixes: `6d78b4473c` ("bpf: Tell memcg to use allow_spinning=false path in bpf_timer_init()") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Harry Yoo <harry.yoo@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: linux-mm@kvack.org Link: https://lore.kernel.org/bpf/20251015000700.28988-1-alexei.starovoitov@gmail.com	2025-10-15 12:22:22 +02:00
Alexei Starovoitov	39e9d5f630	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf before 6.18-rc1 Cross-merge BPF and other fixes after downstream PR. No conflicts. Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-11 18:27:47 -07:00
Mykyta Yatsenko	4c97c4b149	bpf: Extract internal structs validation logic into helpers The arraymap and hashtab duplicate the logic that checks for and frees internal structs (timer, workqueue, task_work) based on BTF record flags. Centralize this by introducing two helpers: * bpf_map_has_internal_structs(map) Returns true if the map value contains any of internal structs: BPF_TIMER \| BPF_WORKQUEUE \| BPF_TASK_WORK. * bpf_map_free_internal_structs(map, obj) Frees the internal structs for a single value object. Convert arraymap and both the prealloc/malloc hashtab paths to use the new generic functions. This keeps the functionality for when/how to free these special fields in one place and makes it easier to add support for new internal structs in the future without touching every map implementation. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251010164606.147298-3-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-10 11:13:28 -07:00
Mykyta Yatsenko	5f8d411729	bpf: Fix handling maps with no BTF and non-constant offsets for the bpf_wq Fix handling maps with no BTF and non-constant offsets for the bpf_wq. This de-duplicates logic with other internal structs (task_work, timer), keeps error reporting consistent, and makes future changes to the layout handling centralized. Fixes: `d940c9b94d` ("bpf: add support for KF_ARG_PTR_TO_WORKQUEUE") Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251010164606.147298-1-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-10 11:08:10 -07:00
KaFai Wan	4f375ade6a	bpf: Avoid RCU context warning when unpinning htab with internal structs When unpinning a BPF hash table (htab or htab_lru) that contains internal structures (timer, workqueue, or task_work) in its values, a BUG warning is triggered: BUG: sleeping function called from invalid context at kernel/bpf/hashtab.c:244 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 14, name: ksoftirqd/0 ... The issue arises from the interaction between BPF object unpinning and RCU callback mechanisms: 1. BPF object unpinning uses ->free_inode() which schedules cleanup via call_rcu(), deferring the actual freeing to an RCU callback that executes within the RCU_SOFTIRQ context. 2. During cleanup of hash tables containing internal structures, htab_map_free_internal_structs() is invoked, which includes cond_resched() or cond_resched_rcu() calls to yield the CPU during potentially long operations. However, cond_resched() or cond_resched_rcu() cannot be safely called from atomic RCU softirq context, leading to the BUG warning when attempting to reschedule. Fix this by changing from ->free_inode() to ->destroy_inode() and rename bpf_free_inode() to bpf_destroy_inode() for BPF objects (prog, map, link). This allows direct inode freeing without RCU callback scheduling, avoiding the invalid context warning. Reported-by: Le Chen <tom2cat@sjtu.edu.cn> Closes: https://lore.kernel.org/all/1444123482.1827743.1750996347470.JavaMail.zimbra@sjtu.edu.cn/ Fixes: `68134668c1` ("bpf: Add map side support for bpf timers.") Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: KaFai Wan <kafai.wan@linux.dev> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20251008102628.808045-2-kafai.wan@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-10 10:10:08 -07:00
Rong Tao	b5b693f735	bpf: add bpf_strcasestr,bpf_strncasestr kfuncs bpf_strcasestr() and bpf_strncasestr() functions perform same like bpf_strstr() and bpf_strnstr() except ignoring the case of the characters. Signed-off-by: Rong Tao <rongtao@cestc.cn> Link: https://lore.kernel.org/r/tencent_B01165355D42A8B8BF5E8D0A21EE1A88090A@qq.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-10 10:05:32 -07:00
Kumar Kartikeya Dwivedi	f233d48559	bpf: Refactor storage_get_func_atomic to generic non_sleepable flag Rename the storage_get_func_atomic flag to a more generic non_sleepable flag that tracks whether a helper or kfunc may be called from a non-sleepable context. This makes the flag more broadly applicable beyond just storage_get helpers. See [0] for more context. The flag is now set unconditionally for all helpers and kfuncs when: - RCU critical section is active. - Preemption is disabled. - IRQs are disabled. - In a non-sleepable context within a sleepable program (e.g., timer callbacks), which is indicated by !in_sleepable(). Previously, the flag was only set for storage_get helpers in these contexts. With this change, it can be used by any code that needs to differentiate between sleepable and non-sleepable contexts at the per-instruction level. The existing usage in do_misc_fixups() for storage_get helpers is preserved by checking is_storage_get_function() before using the flag. [0]: https://lore.kernel.org/bpf/CAP01T76cbaNi4p-y8E0sjE2NXSra2S=Uja8G4hSQDu_SbXxREQ@mail.gmail.com Cc: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Mykyta Yatsenko <mykyta.yatsenko5@gmail.com> Link: https://lore.kernel.org/r/20251007220349.3852807-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-10 10:04:51 -07:00
Kumar Kartikeya Dwivedi	469d638d15	bpf: Fix sleepable context for async callbacks Fix the BPF verifier to correctly determine the sleepable context of async callbacks based on the async primitive type rather than the arming program's context. The bug is in in_sleepable() which uses OR logic to check if the current execution context is sleepable. When a sleepable program arms a timer callback, the callback's state correctly has in_sleepable=false, but in_sleepable() would still return true due to env->prog->sleepable being true. This incorrectly allows sleepable helpers like bpf_copy_from_user() inside timer callbacks when armed from sleepable programs, even though timer callbacks always execute in non-sleepable context. Fix in_sleepable() to rely solely on env->cur_state->in_sleepable, and initialize state->in_sleepable to env->prog->sleepable in do_check_common() for the main program entry. This ensures the sleepable context is properly tracked per verification state rather than being overridden by the program's sleepability. The env->cur_state NULL check in in_sleepable() was only needed for do_misc_fixups() which runs after verification when env->cur_state is set to NULL. Update do_misc_fixups() to use env->prog->sleepable directly for the storage_get_function check, and remove the redundant NULL check from in_sleepable(). Introduce is_async_cb_sleepable() helper to explicitly determine async callback sleepability based on the primitive type: - bpf_timer callbacks are never sleepable - bpf_wq and bpf_task_work callbacks are always sleepable Add verifier_bug() check to catch unhandled async callback types, ensuring future additions cannot be silently mishandled. Move the is_task_work_add_kfunc() forward declaration to the top alongside other callback-related helpers. We update push_async_cb() to adjust to the new changes. At the same time, while simplifying in_sleepable(), we notice a problem in do_misc_fixups. Fix storage_get helpers to use GFP_ATOMIC when called from non-sleepable contexts within sleepable programs, such as bpf_timer callbacks. Currently, the check in do_misc_fixups assumes that env->prog->sleepable, previously in_sleepable(env) which only resolved to this check before last commit, holds across the program's execution, but that is not true. Instead, the func_atomic bit must be set whenever we see the function being called in an atomic context. Previously, this is being done when the helper is invoked in atomic contexts in sleepable programs, we can simply just set the value to true without doing an in_sleepable() check. We must also do a standalone in_sleepable() check to handle cases where the async callback itself is armed from a sleepable program, but is itself non-sleepable (e.g., timer callback) and invokes such a helper, thus needing the func_atomic bit to be true for the said call. Adjust do_misc_fixups() to drop any checks regarding sleepable nature of the program, and just depend on the func_atomic bit to decide which GFP flag to pass. Fixes: `81f1d7a583` ("bpf: wq: add bpf_wq_set_callback_impl") Fixes: `b00fa38a9c` ("bpf: Enable non-atomic allocations in local storage") Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20251007220349.3852807-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-10 10:04:51 -07:00
Siddharth Chintamaneni	56b4d16239	bpf: Cleanup unused func args in rqspinlock implementation cleanup unused function args in check_deadlock* functions. Fixes: `31158ad02d` ("rqspinlock: Add deadlock detection and recovery") Signed-off-by: Siddharth Chintamaneni <sidchintamaneni@gmail.com> Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20251001172702.122838-1-sidchintamaneni@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-07 15:30:43 -07:00
Linus Torvalds	cbf33b8e0b	bpf-fixes -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmjfBtMACgkQ6rmadz2v bToTuA/+JizieMIW6eqzCDrrhj45DmuKu6gUMrkmM0xte3QCoLly5FDGJ9HMmrqf L5xfz/TAExg/Nh9a0zpCGE7/fMr9P10Z5vVHMe/I06rcW+/fDrAERXDurlfpfliv /7KqF/lq1D6kp7bcBzWX4dzkT66Nh/Ax6ZJKIXF/K3iwFXBb1A/95Wwgqxuc3xNM n0kthEhnX6x5cojy6fla2zYNoebNLVxPvafTcAtnE0QM8vxdOl0Q7KWrdQpVE5V/ z6WF2M822eTQeUIy6k/ZRckFlmEwa++I+w5EnygBFP8INUHARdYNBQpQVFkg31jz GYGlbTs7jaDbMNWkEhKyDbuhL5Xq9/5FGOaLI9CvDtpPjbfYTZMGtIn9L902XtqE VEfBcTA/g/qd8Urx5622sfPwau64LrSAKBRE/4CJ7/dsnPx4BdKavsA5DMVuyRIm Au6tOxlRFhPWhultnmmbFBjwn33xTleYNUF9wpWQnijNq+Ph29gKZk8Ac5eKKS6A dTSCxv7LGafezN/evr779vXohZyF9vWfqHdITVzTo40tsSbi5v4uyfJ2qzoaS7Gf UQ4F2nkjqddhG/O1W2TY8aFzmWXP2q92V6yqkcq5zjA3D1Ue6JOxRzG/3kNzgsLX Gf1DmPosUyeLNpwdDYRD83Dk7tJ7rAI+x3i6Iwh8qivIhC4CzrU= =Kzf9 -----END PGP SIGNATURE----- Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Pull bpf fixes from Alexei Starovoitov: - Fix selftests/bpf (typo, conflicts) and unbreak BPF CI (Jiri Olsa) - Remove linux/unaligned.h dependency for libbpf_sha256 (Andrii Nakryiko) and add a test (Eric Biggers) - Reject negative offsets for ALU operations in the verifier (Yazhou Tang) and add a test (Eduard Zingerman) - Skip scalar adjustment for BPF_NEG operation if destination register is a pointer (Brahmajit Das) and add a test (KaFai Wan) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: libbpf: Fix missing #pragma in libbpf_utils.c selftests/bpf: Add tests for rejection of ALU ops with negative offsets selftests/bpf: Add test for libbpf_sha256() bpf: Reject negative offsets for ALU ops libbpf: remove linux/unaligned.h dependency for libbpf_sha256() libbpf: move libbpf_sha256() implementation into libbpf_utils.c libbpf: move libbpf_errstr() into libbpf_utils.c libbpf: remove unused libbpf_strerror_r and STRERR_BUFSIZE libbpf: make libbpf_errno.c into more generic libbpf_utils.c selftests/bpf: Add test for BPF_NEG alu on CONST_PTR_TO_MAP bpf: Skip scalar adjustment for BPF_NEG if dst is a pointer selftests/bpf: Fix realloc size in bpf_get_addrs selftests/bpf: Fix typo in subtest_basic_usdt after merge conflict selftests/bpf: Fix open-coded gettid syscall in uprobe syscall tests	2025-10-03 19:38:19 -07:00
Linus Torvalds	24d9e8b3c9	slab updates for 6.18 -----BEGIN PGP SIGNATURE----- iQFPBAABCAA5FiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmja74IbFIAAAAAABAAO bWFudTIsMi41KzEuMTEsMiwyAAoJELvgsHXSRYiacR4H/04aBsr7LZnTJVeZLQwK HKoOwXBqiQyqPdjKXGKnp7Mh9gRp2W3V11VsYTuDJNUS+Vz5YXW0z8cRnUfZ3SYs l+GZC3vZeAy2EVJE1U6Mb673hU8vziI80IO2q/tGzaj9a+wC3L0lemc+YFQTwG+u pMtt8zU2vHRjgkx8TNNqJBBOLLDV+RzIl8pqXVnh4eju6x6ZdreGnjXaePYMdjG0 fXLf9XwIeWREqbfeOCEOB50Ts71kkdiOeskwnJyfCTDT8WTu3zC/dICqfh66e3Gg 8hQKvMsuKpm/FwbtgdB0WvaDjENH6PmY+ubLYVxwvNpcsTSqfe0IYGm+HpUP+TPf m+Y= =w+JL -----END PGP SIGNATURE----- Merge tag 'slab-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab updates from Vlastimil Babka: - A new layer for caching objects for allocation and free via percpu arrays called sheaves. The aim is to combine the good parts of SLAB (lower-overhead and simpler percpu caching, compared to SLUB) without the past issues with arrays for freeing remote NUMA node objects and their flushing. It also allows more efficient kfree_rcu(), and cheaper object preallocations for cases where the exact number of objects is unknown, but an upper bound is. Currently VMAs and maple nodes are using this new caching, with a plan to enable it for all caches and remove the complex SLUB fastpath based on cpu (partial) slabs and this_cpu_cmpxchg_double(). (Vlastimil Babka, with Liam Howlett and Pedro Falcato for the maple tree changes) - Re-entrant kmalloc_nolock(), which allows opportunistic allocations from NMI and tracing/kprobe contexts. Building on prior page allocator and memcg changes, it will result in removing BPF-specific caches on top of slab (Alexei Starovoitov) - Various fixes and cleanups. (Kuan-Wei Chiu, Matthew Wilcox, Suren Baghdasaryan, Ye Liu) * tag 'slab-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: (40 commits) slab: Introduce kmalloc_nolock() and kfree_nolock(). slab: Reuse first bit for OBJEXTS_ALLOC_FAIL slab: Make slub local_(try)lock more precise for LOCKDEP mm: Introduce alloc_frozen_pages_nolock() mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock(). locking/local_lock: Introduce local_lock_is_locked(). maple_tree: Convert forking to use the sheaf interface maple_tree: Add single node allocation support to maple state maple_tree: Prefilled sheaf conversion and testing tools/testing: Add support for prefilled slab sheafs maple_tree: Replace mt_free_one() with kfree() maple_tree: Use kfree_rcu in ma_free_rcu testing/radix-tree/maple: Hack around kfree_rcu not existing tools/testing: include maple-shim.c in maple.c maple_tree: use percpu sheaves for maple_node_cache mm, vma: use percpu sheaves for vm_area_struct cache tools/testing: Add support for changes to slab for sheaves slab: allow NUMA restricted allocations to use percpu sheaves tools/testing/vma: Implement vm_refcnt reset slab: skip percpu sheaves for remote object freeing ...	2025-10-02 15:58:05 -07:00
Yazhou Tang	55c0ced59f	bpf: Reject negative offsets for ALU ops When verifying BPF programs, the check_alu_op() function validates instructions with ALU operations. The 'offset' field in these instructions is a signed 16-bit integer. The existing check 'insn->off > 1' was intended to ensure the offset is either 0, or 1 for BPF_MOD/BPF_DIV. However, because 'insn->off' is signed, this check incorrectly accepts all negative values (e.g., -1). This commit tightens the validation by changing the condition to '(insn->off != 0 && insn->off != 1)'. This ensures that any value other than the explicitly permitted 0 and 1 is rejected, hardening the verifier against malformed BPF programs. Co-developed-by: Shenghao Yuan <shenghaoyuan0928@163.com> Signed-off-by: Shenghao Yuan <shenghaoyuan0928@163.com> Co-developed-by: Tianci Cao <ziye@zju.edu.cn> Signed-off-by: Tianci Cao <ziye@zju.edu.cn> Signed-off-by: Yazhou Tang <tangyazhou518@outlook.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Fixes: `ec0e2da95f` ("bpf: Support new signed div/mod instructions.") Link: https://lore.kernel.org/r/tencent_70D024BAE70A0A309A4781694C7B764B0608@qq.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-01 15:43:13 -07:00
Brahmajit Das	34904582b5	bpf: Skip scalar adjustment for BPF_NEG if dst is a pointer In check_alu_op(), the verifier currently calls check_reg_arg() and adjust_scalar_min_max_vals() unconditionally for BPF_NEG operations. However, if the destination register holds a pointer, these scalar adjustments are unnecessary and potentially incorrect. This patch adds a check to skip the adjustment logic when the destination register contains a pointer. Reported-by: syzbot+d36d5ae81e1b0a53ef58@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=d36d5ae81e1b0a53ef58 Fixes: `aced132599` ("bpf: Add range tracking for BPF_NEG") Suggested-by: KaFai Wan <kafai.wan@linux.dev> Suggested-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Brahmajit Das <listout@listout.xyz> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251001191739.2323644-2-listout@listout.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-01 13:53:20 -07:00
Linus Torvalds	ae28ed4578	bpf-next-6.18 -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmjZH40ACgkQ6rmadz2v bTrG7w//X/5CyDoKIYJCqynYRdMtfqYuCe8Jhud4p5++iBVqkDyS6Y8EFLqZVyg/ UHTqaSE4Nz8/pma0WSjhUYn6Chs1AeH+Rw/g109SovE/YGkek2KNwY3o2hDrtPMX +oD0my8qF2HLKgEyteXXyZ5Ju+AaF92JFiGko4/wNTX8O99F9nyz2pTkrctS9Vl9 VwuTxrEXpmhqrhP3WCxkfNfcbs9HP+AALpgOXZKdMI6T4KI0N1gnJ0ZWJbiXZ8oT tug0MTPkNRidYMl0wHY2LZ6ZG8Q3a7Sgc+M0xFzaHGvGlJbBg1HjsDMtT6j34CrG TIVJ/O8F6EJzAnQ5Hio0FJk8IIgMRgvng5Kd5GXidU+mE6zokTyHIHOXitYkBQNH Hk+lGA7+E2cYqUqKvB5PFoyo+jlucuIH7YwrQlyGfqz+98n65xCgZKcmdVXr0hdB 9v3WmwJFtVIoPErUvBC3KRANQYhFk4eVk1eiGV/20+eIVyUuNbX6wqSWSA9uEXLy n5fm/vlk4RjZmrPZHxcJ0dsl9LTF1VvQQHkgoC1Sz/Cc+jA6k4I+ECVHAqEbk36p 1TUF52yPOD2ViaJKkj+962JaaaXlUn6+Dq7f1GMP6VuyHjz4gsI3mOo4XarqNdWd c7TnYmlGO/cGwqd4DdbmWiF1DDsrBcBzdbC8+FgffxQHLPXGzUg= =LeQi -----END PGP SIGNATURE----- Merge tag 'bpf-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Pull bpf updates from Alexei Starovoitov: - Support pulling non-linear xdp data with bpf_xdp_pull_data() kfunc (Amery Hung) Applied as a stable branch in bpf-next and net-next trees. - Support reading skb metadata via bpf_dynptr (Jakub Sitnicki) Also a stable branch in bpf-next and net-next trees. - Enforce expected_attach_type for tailcall compatibility (Daniel Borkmann) - Replace path-sensitive with path-insensitive live stack analysis in the verifier (Eduard Zingerman) This is a significant change in the verification logic. More details, motivation, long term plans are in the cover letter/merge commit. - Support signed BPF programs (KP Singh) This is another major feature that took years to materialize. Algorithm details are in the cover letter/marge commit - Add support for may_goto instruction to s390 JIT (Ilya Leoshkevich) - Add support for may_goto instruction to arm64 JIT (Puranjay Mohan) - Fix USDT SIB argument handling in libbpf (Jiawei Zhao) - Allow uprobe-bpf program to change context registers (Jiri Olsa) - Support signed loads from BPF arena (Kumar Kartikeya Dwivedi and Puranjay Mohan) - Allow access to union arguments in tracing programs (Leon Hwang) - Optimize rcu_read_lock() + migrate_disable() combination where it's used in BPF subsystem (Menglong Dong) - Introduce bpf_task_work_schedule() kfuncs to schedule deferred execution of BPF callback in the context of a specific task using the kernel’s task_work infrastructure (Mykyta Yatsenko) - Enforce RCU protection for KF_RCU_PROTECTED kfuncs (Kumar Kartikeya Dwivedi) - Add stress test for rqspinlock in NMI (Kumar Kartikeya Dwivedi) - Improve the precision of tnum multiplier verifier operation (Nandakumar Edamana) - Use tnums to improve is_branch_taken() logic (Paul Chaignon) - Add support for atomic operations in arena in riscv JIT (Pu Lehui) - Report arena faults to BPF error stream (Puranjay Mohan) - Search for tracefs at /sys/kernel/tracing first in bpftool (Quentin Monnet) - Add bpf_strcasecmp() kfunc (Rong Tao) - Support lookup_and_delete_elem command in BPF_MAP_STACK_TRACE (Tao Chen) tag 'bpf-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (197 commits) libbpf: Replace AF_ALG with open coded SHA-256 selftests/bpf: Add stress test for rqspinlock in NMI selftests/bpf: Add test case for different expected_attach_type bpf: Enforce expected_attach_type for tailcall compatibility bpftool: Remove duplicate string.h header bpf: Remove duplicate crypto/sha2.h header libbpf: Fix error when st-prefix_ops and ops from differ btf selftests/bpf: Test changing packet data from kfunc selftests/bpf: Add stacktrace map lookup_and_delete_elem test case selftests/bpf: Refactor stacktrace_map case with skeleton bpf: Add lookup_and_delete_elem for BPF_MAP_STACK_TRACE selftests/bpf: Fix flaky bpf_cookie selftest selftests/bpf: Test changing packet data from global functions with a kfunc bpf: Emit struct bpf_xdp_sock type in vmlinux BTF selftests/bpf: Task_work selftest cleanup fixes MAINTAINERS: Delete inactive maintainers from AF_XDP bpf: Mark kfuncs as __noclone selftests/bpf: Add kprobe multi write ctx attach test selftests/bpf: Add kprobe write ctx attach test selftests/bpf: Add uprobe context ip register change test ...	2025-09-30 17:58:11 -07:00
Linus Torvalds	e4dcbdff11	Performance events updates for v6.18: Core perf code updates: - Convert mmap() related reference counts to refcount_t. This is in reaction to the recently fixed refcount bugs, which could have been detected earlier and could have mitigated the bug somewhat. (Thomas Gleixner, Peter Zijlstra) - Clean up and simplify the callchain code, in preparation for sframes. (Steven Rostedt, Josh Poimboeuf) Uprobes updates: - Add support to optimize usdt probes on x86-64, which gives a substantial speedup. (Jiri Olsa) - Cleanups and fixes on x86 (Peter Zijlstra) PMU driver updates: - Various optimizations and fixes to the Intel PMU driver (Dapeng Mi) Misc cleanups and fixes: - Remove redundant __GFP_NOWARN (Qianfeng Rong) Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmjWpGIRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1iHvxAAvO8qWbbhUdF3EZaFU0Wx6oh5KBhImU49 VZ107xe9llA0Szy3hIl1YpdOQA2NAHtma6We/ebonrPVTTkcSCGq8absc+GahA3I CHIomx2hjD0OQ01aHvTqgHJUdFUQQ0yzE3+FY6Tsn05JsNZvDmqpAMIoMQT0LuuG 7VvVRLBuDXtuMtNmGaGCvfDGKTZkGGxD6iZS1iWHuixvVAz4IECK0vYqSyh31UGA w9Jwa0thwjKm2EZTmcSKaHSM2zw3N8QXJ3SNPPThuMrtO6QDz2+3Da9kO+vhGcRP Jls9KnWC2wxNxqIs3dr80Mzn4qMplc67Ekx2tUqX4tYEGGtJQxW6tm3JOKKIgFMI g/KF9/WJPXp0rVI9mtoQkgndzyswR/ZJBAwfEQu+nAqlp3gmmQR9+MeYPCyNnyhB 2g22PTMbXkihJmRPAVeH+WhwFy1YY3nsRhh61ha3/N0ULXTHUh0E+hWwUVMifYSV SwXqQx4srlo6RJJNTji1d6R3muNjXCQNEsJ0lCOX6ajVoxWZsPH2x7/W1A8LKmY+ FLYQUi6X9ogQbOO3WxCjUhzp5nMTNA2vvo87MUzDlZOCLPqYZmqcjntHuXwdjPyO lPcfTzc2nK1Ud26bG3+p2Bk3fjqkX9XcTMFniOvjKfffEfwpAq4xRPBQ3uRlzn0V pf9067JYF+c= =sVXH -----END PGP SIGNATURE----- Merge tag 'perf-core-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull performance events updates from Ingo Molnar: "Core perf code updates: - Convert mmap() related reference counts to refcount_t. This is in reaction to the recently fixed refcount bugs, which could have been detected earlier and could have mitigated the bug somewhat (Thomas Gleixner, Peter Zijlstra) - Clean up and simplify the callchain code, in preparation for sframes (Steven Rostedt, Josh Poimboeuf) Uprobes updates: - Add support to optimize usdt probes on x86-64, which gives a substantial speedup (Jiri Olsa) - Cleanups and fixes on x86 (Peter Zijlstra) PMU driver updates: - Various optimizations and fixes to the Intel PMU driver (Dapeng Mi) Misc cleanups and fixes: - Remove redundant __GFP_NOWARN (Qianfeng Rong)" * tag 'perf-core-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits) selftests/bpf: Fix uprobe_sigill test for uprobe syscall error value uprobes/x86: Return error from uprobe syscall when not called from trampoline perf: Skip user unwind if the task is a kernel thread perf: Simplify get_perf_callchain() user logic perf: Use current->flags & PF_KTHREAD\|PF_USER_WORKER instead of current->mm == NULL perf: Have get_perf_callchain() return NULL if crosstask and user are set perf: Remove get_perf_callchain() init_nr argument perf/x86: Print PMU counters bitmap in x86_pmu_show_pmu_cap() perf/x86/intel: Add ICL_FIXED_0_ADAPTIVE bit into INTEL_FIXED_BITS_MASK perf/x86/intel: Change macro GLOBAL_CTRL_EN_PERF_METRICS to BIT_ULL(48) perf/x86: Add PERF_CAP_PEBS_TIMING_INFO flag perf/x86/intel: Fix IA32_PMC_x_CFG_B MSRs access error perf/x86/intel: Use early_initcall() to hook bts_init() uprobes: Remove redundant __GFP_NOWARN selftests/seccomp: validate uprobe syscall passes through seccomp seccomp: passthrough uprobe systemcall without filtering selftests/bpf: Fix uprobe syscall shadow stack test selftests/bpf: Change test_uretprobe_regs_change for uprobe and uretprobe selftests/bpf: Add uprobe_regs_equal test selftests/bpf: Add optimized usdt variant for basic usdt test ...	2025-09-30 11:11:21 -07:00
Linus Torvalds	6c7340a7a8	Scheduler updates for v6.18: Core scheduler changes: - Make migrate_{en,dis}able() inline, to improve performance (Menglong Dong) - Move STDL_INIT() functions out-of-line (Peter Zijlstra) - Unify the SCHED_{SMT,CLUSTER,MC} Kconfig (Peter Zijlstra) Fair scheduling: - Defer throttling when tasks exit to user-space, to reduce the chance & impact of throttle-preemption with held locks and other resources. (Aaron Lu, Valentin Schneider) - Get rid of sched_domains_curr_level hack for tl->cpumask(), as the warning was getting triggered on certain topologies. (Peter Zijlstra) Misc cleanups & fixes: - Header cleanups (Menglong Dong) - Fix race in push_dl_task() (Harshit Agarwal) Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmjWn0IRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1i9ng/+KJYEsNilPA6nGzZLurU3HXm6S5NrNBPc xJU43eo3QdFUymKqYaCoXCwaMah3m8fFW+DC8ZoEvWC5wHnlIw6+u/ZEtsWQ4R8N 8+sjQddQeb9Gez+nkC7pXX8h1nqlhHyGWU88+9hOyEEMk21KgH7tJ5gOuGQdPhiA v24D8dkS1sT6CAeqZAycYIkos+kfNNV+wAEQCXAxfvSSslpzJVZcj9kdQInxF4O4 9+WrvFZFVYJWdJgmgfbpBMw7N8a4Gpv+Lr6U0ZVXHNeLjNdn60I+5Me+9BW9GRcO pPL50RfGgGgVIqyEHvBhhz8p0tlKUJo9NkRJ9hmNB5SKT3P442SU789ppTYgI1il 5P4g3TiuomqPVuo99/mYHYOpT6NkYC1fkwMP9Hfk4NRUX0EA6lKXelVnMXaC3Z7R U+0xVYtK4BBLRFBo4HIEM1HChSIcOnutuufFX4lPPt+ewnAmJwSt619w+lBF0v+n cpiLIlQ74LrYlrULw3bznnwzh5dV8XHm4CT3XAShm6qB97/SaLr0rI5W7njdSvte ym9z4yFWidawowvLWjFX1CykSnX/T5MV+uzuIH64MEIA39B4VKJWCAC3l3AMJn/5 Ckhf93B5uG0q+wUtE66x/JrN7wtbz9PMpwh01fXNWs/eWc1VKkVrvq+PmHODngmz drSUoI1P570= =8ZTZ -----END PGP SIGNATURE----- Merge tag 'sched-core-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Core scheduler changes: - Make migrate_{en,dis}able() inline, to improve performance (Menglong Dong) - Move STDL_INIT() functions out-of-line (Peter Zijlstra) - Unify the SCHED_{SMT,CLUSTER,MC} Kconfig (Peter Zijlstra) Fair scheduling: - Defer throttling to when tasks exit to user-space, to reduce the chance & impact of throttle-preemption with held locks and other resources (Aaron Lu, Valentin Schneider) - Get rid of sched_domains_curr_level hack for tl->cpumask(), as the warning was getting triggered on certain topologies (Peter Zijlstra) Misc cleanups & fixes: - Header cleanups (Menglong Dong) - Fix race in push_dl_task() (Harshit Agarwal)" * tag 'sched-core-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Fix some typos in include/linux/preempt.h sched: Make migrate_{en,dis}able() inline rcu: Replace preempt.h with sched.h in include/linux/rcupdate.h arch: Add the macro COMPILE_OFFSETS to all the asm-offsets.c sched/fair: Do not balance task to a throttled cfs_rq sched/fair: Do not special case tasks in throttled hierarchy sched/fair: update_cfs_group() for throttled cfs_rqs sched/fair: Propagate load for throttled cfs_rq sched/fair: Get rid of throttled_lb_pair() sched/fair: Task based throttle time accounting sched/fair: Switch to task based throttle model sched/fair: Implement throttle task work and related helpers sched/fair: Add related data structure for task based throttle sched: Unify the SCHED_{SMT,CLUSTER,MC} Kconfig sched: Move STDL_INIT() functions out-of-line sched/fair: Get rid of sched_domains_curr_level hack for tl->cpumask() sched/deadline: Fix race in push_dl_task()	2025-09-30 10:35:11 -07:00
Linus Torvalds	449c2b302c	vfs-6.18-rc1.async -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZQkAAKCRCRxhvAZXjc oijOAQCw+gE9GkvYJ7TOCiqVFj8RZZ1tb7NiCdBRCa4wXM7KMwD9Gwim//G8Hjyy dNPZejphGdxv99Skwp/0nfyDGL6x7w4= =zIJf -----END PGP SIGNATURE----- Merge tag 'vfs-6.18-rc1.async' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs async directory updates from Christian Brauner: "This contains further preparatory changes for the asynchronous directory locking scheme: - Add lookup_one_positive_killable() which allows overlayfs to perform lookup that won't block on a fatal signal - Unify the mount idmap handling in struct renamedata as a rename can only happen within a single mount - Introduce kern_path_parent() for audit which sets the path to the parent and returns a dentry for the target without holding any locks on return - Rename kern_path_locked() as it is only used to prepare for the removal of an object from the filesystem: kern_path_locked() => start_removing_path() kern_path_create() => start_creating_path() user_path_create() => start_creating_user_path() user_path_locked_at() => start_removing_user_path_at() done_path_create() => end_creating_path() NA => end_removing_path()" * tag 'vfs-6.18-rc1.async' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: debugfs: rename start_creating() to debugfs_start_creating() VFS: rename kern_path_locked() and related functions. VFS/audit: introduce kern_path_parent() for audit VFS: unify old_mnt_idmap and new_mnt_idmap in renamedata VFS: discard err2 in filename_create() VFS/ovl: add lookup_one_positive_killable()	2025-09-29 11:55:15 -07:00
Linus Torvalds	b7ce6fa90f	vfs-6.18-rc1.misc -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZQMQAKCRCRxhvAZXjc omNLAQCgrwzd9sa1JTlixweu3OAxQlSEbLuMpEv7Ztm+B7Wz0AD9HtwPC44Kev03 GbMcB2DCFLC4evqYECj6IG7NBmoKsAs= =1ICf -----END PGP SIGNATURE----- Merge tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual selections of misc updates for this cycle. Features: - Add "initramfs_options" parameter to set initramfs mount options. This allows to add specific mount options to the rootfs to e.g., limit the memory size - Add RWF_NOSIGNAL flag for pwritev2() Add RWF_NOSIGNAL flag for pwritev2. This flag prevents the SIGPIPE signal from being raised when writing on disconnected pipes or sockets. The flag is handled directly by the pipe filesystem and converted to the existing MSG_NOSIGNAL flag for sockets - Allow to pass pid namespace as procfs mount option Ever since the introduction of pid namespaces, procfs has had very implicit behaviour surrounding them (the pidns used by a procfs mount is auto-selected based on the mounting process's active pidns, and the pidns itself is basically hidden once the mount has been constructed) This implicit behaviour has historically meant that userspace was required to do some special dances in order to configure the pidns of a procfs mount as desired. Examples include: * In order to bypass the mnt_too_revealing() check, Kubernetes creates a procfs mount from an empty pidns so that user namespaced containers can be nested (without this, the nested containers would fail to mount procfs) But this requires forking off a helper process because you cannot just one-shot this using mount(2) * Container runtimes in general need to fork into a container before configuring its mounts, which can lead to security issues in the case of shared-pidns containers (a privileged process in the pidns can interact with your container runtime process) While SUID_DUMP_DISABLE and user namespaces make this less of an issue, the strict need for this due to a minor uAPI wart is kind of unfortunate Things would be much easier if there was a way for userspace to just specify the pidns they want. So this pull request contains changes to implement a new "pidns" argument which can be set using fsconfig(2): fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd); fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0); or classic mount(2) / mount(8): // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid"); Cleanups: - Remove the last references to EXPORT_OP_ASYNC_LOCK - Make file_remove_privs_flags() static - Remove redundant __GFP_NOWARN when GFP_NOWAIT is used - Use try_cmpxchg() in start_dir_add() - Use try_cmpxchg() in sb_init_done_wq() - Replace offsetof() with struct_size() in ioctl_file_dedupe_range() - Remove vfs_ioctl() export - Replace rwlock() with spinlock in epoll code as rwlock causes priority inversion on preempt rt kernels - Make ns_entries in fs/proc/namespaces const - Use a switch() statement() in init_special_inode() just like we do in may_open() - Use struct_size() in dir_add() in the initramfs code - Use str_plural() in rd_load_image() - Replace strcpy() with strscpy() in find_link() - Rename generic_delete_inode() to inode_just_drop() and generic_drop_inode() to inode_generic_drop() - Remove unused arguments from fcntl_{g,s}et_rw_hint() Fixes: - Document @name parameter for name_contains_dotdot() helper - Fix spelling mistake - Always return zero from replace_fd() instead of the file descriptor number - Limit the size for copy_file_range() in compat mode to prevent a signed overflow - Fix debugfs mount options not being applied - Verify the inode mode when loading it from disk in minixfs - Verify the inode mode when loading it from disk in cramfs - Don't trigger automounts with RESOLVE_NO_XDEV If openat2() was called with RESOLVE_NO_XDEV it didn't traverse through automounts, but could still trigger them - Add FL_RECLAIM flag to show_fl_flags() macro so it appears in tracepoints - Fix unused variable warning in rd_load_image() on s390 - Make INITRAMFS_PRESERVE_MTIME depend on BLK_DEV_INITRD - Use ns_capable_noaudit() when determining net sysctl permissions - Don't call path_put() under namespace semaphore in listmount() and statmount()" * tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (38 commits) fcntl: trim arguments listmount: don't call path_put() under namespace semaphore statmount: don't call path_put() under namespace semaphore pid: use ns_capable_noaudit() when determining net sysctl permissions fs: rename generic_delete_inode() and generic_drop_inode() init: INITRAMFS_PRESERVE_MTIME should depend on BLK_DEV_INITRD initramfs: Replace strcpy() with strscpy() in find_link() initrd: Use str_plural() in rd_load_image() initramfs: Use struct_size() helper to improve dir_add() initrd: Fix unused variable warning in rd_load_image() on s390 fs: use the switch statement in init_special_inode() fs/proc/namespaces: make ns_entries const filelock: add FL_RECLAIM to show_fl_flags() macro eventpoll: Replace rwlock with spinlock selftests/proc: add tests for new pidns APIs procfs: add "pidns" mount option pidns: move is-ancestor logic to helper openat2: don't trigger automounts with RESOLVE_NO_XDEV namei: move cross-device check to __traverse_mounts namei: remove LOOKUP_NO_XDEV check from handle_mounts ...	2025-09-29 09:03:07 -07:00
Alexei Starovoitov	99253de51f	mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock(). Change alloc_pages_nolock() to default to __GFP_COMP when allocating pages, since upcoming reentrant alloc_slab_page() needs __GFP_COMP. Also allow __GFP_ACCOUNT flag to be specified, since most of BPF infra needs __GFP_ACCOUNT except BPF streams. Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2025-09-29 09:42:35 +02:00
Daniel Borkmann	4540aed51b	bpf: Enforce expected_attach_type for tailcall compatibility Yinhao et al. recently reported: Our fuzzer tool discovered an uninitialized pointer issue in the bpf_prog_test_run_xdp() function within the Linux kernel's BPF subsystem. This leads to a NULL pointer dereference when a BPF program attempts to deference the txq member of struct xdp_buff object. The test initializes two programs of BPF_PROG_TYPE_XDP: progA acts as the entry point for bpf_prog_test_run_xdp() and its expected_attach_type can neither be of be BPF_XDP_DEVMAP nor BPF_XDP_CPUMAP. progA calls into a slot of a tailcall map it owns. progB's expected_attach_type must be BPF_XDP_DEVMAP to pass xdp_is_valid_access() validation. The program returns struct xdp_md's egress_ifindex, and the latter is only allowed to be accessed under mentioned expected_attach_type. progB is then inserted into the tailcall which progA calls. The underlying issue goes beyond XDP though. Another example are programs of type BPF_PROG_TYPE_CGROUP_SOCK_ADDR. sock_addr_is_valid_access() as well as sock_addr_func_proto() have different logic depending on the programs' expected_attach_type. Similarly, a program attached to BPF_CGROUP_INET4_GETPEERNAME should not be allowed doing a tailcall into a program which calls bpf_bind() out of BPF which is only enabled for BPF_CGROUP_INET4_CONNECT. In short, specifying expected_attach_type allows to open up additional functionality or restrictions beyond what the basic bpf_prog_type enables. The use of tailcalls must not violate these constraints. Fix it by enforcing expected_attach_type in __bpf_prog_map_compatible(). Note that we only enforce this for tailcall maps, but not for BPF devmaps or cpumaps: There, the programs are invoked through dev_map_bpf_prog_run() and cpu_map_bpf_prog_run() which set up a new environment / context and therefore these situations are not prone to this issue. Fixes: `5e43f899b0` ("bpf: Check attach type at prog load time") Reported-by: Yinhao Hu <dddddd@hust.edu.cn> Reported-by: Kaiyan Mei <M202472210@hust.edu.cn> Reviewed-by: Dongliang Mu <dzm91@hust.edu.cn> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20250926171201.188490-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-27 06:24:27 -07:00
Tao Chen	17f0d1f632	bpf: Add lookup_and_delete_elem for BPF_MAP_STACK_TRACE The stacktrace map can be easily full, which will lead to failure in obtaining the stack. In addition to increasing the size of the map, another solution is to delete the stack_id after looking it up from the user, so extend the existing bpf_map_lookup_and_delete_elem() functionality to stacktrace map types. Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250925175030.1615837-1-chen.dylane@linux.dev	2025-09-25 16:12:14 -07:00
Menglong Dong	378b770819	sched: Make migrate_{en,dis}able() inline For now, migrate_enable and migrate_disable are global, which makes them become hotspots in some case. Take BPF for example, the function calling to migrate_enable and migrate_disable in BPF trampoline can introduce significant overhead, and following is the 'perf top' of FENTRY's benchmark (./tools/testing/selftests/bpf/bench trig-fentry): 54.63% bpf_prog_2dcccf652aac1793_bench_trigger_fentry [k] bpf_prog_2dcccf652aac1793_bench_trigger_fentry 10.43% [kernel] [k] migrate_enable 10.07% bpf_trampoline_6442517037 [k] bpf_trampoline_6442517037 8.06% [kernel] [k] __bpf_prog_exit_recur 4.11% libc.so.6 [.] syscall 2.15% [kernel] [k] entry_SYSCALL_64 1.48% [kernel] [k] memchr_inv 1.32% [kernel] [k] fput 1.16% [kernel] [k] _copy_to_user 0.73% [kernel] [k] bpf_prog_test_run_raw_tp So in this commit, we make migrate_enable/migrate_disable inline to obtain better performance. The struct rq is defined internally in kernel/sched/sched.h, and the field "nr_pinned" is accessed in migrate_enable/migrate_disable, which makes it hard to make them inline. Alexei Starovoitov suggests to generate the offset of "nr_pinned" in [1], so we can define the migrate_enable/migrate_disable in include/linux/sched.h and access "this_rq()->nr_pinned" with "(void *)this_rq() + RQ_nr_pinned". The offset of "nr_pinned" is generated in include/generated/rq-offsets.h by kernel/sched/rq-offsets.c. Generally speaking, we move the definition of migrate_enable and migrate_disable to include/linux/sched.h from kernel/sched/core.c. The calling to __set_cpus_allowed_ptr() is leaved in ___migrate_enable(). The "struct rq" is not available in include/linux/sched.h, so we can't access the "runqueues" with this_cpu_ptr(), as the compilation will fail in this_cpu_ptr() -> raw_cpu_ptr() -> __verify_pcpu_ptr(): typeof((ptr) + 0) So we introduce the this_rq_raw() and access the runqueues with arch_raw_cpu_ptr/PERCPU_PTR directly. The variable "runqueues" is not visible in the kernel modules, and export it is not a good idea. As Peter Zijlstra advised in [2], we define and export migrate_enable/migrate_disable in kernel/sched/core.c too, and use them for the modules. Before this patch, the performance of BPF FENTRY is: fentry : 113.030 ± 0.149M/s fentry : 112.501 ± 0.187M/s fentry : 112.828 ± 0.267M/s fentry : 115.287 ± 0.241M/s After this patch, the performance of BPF FENTRY increases to: fentry : 143.644 ± 0.670M/s fentry : 149.764 ± 0.362M/s fentry : 149.642 ± 0.156M/s fentry : 145.263 ± 0.221M/s Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/bpf/CAADnVQ+5sEDKHdsJY5ZsfGDO_1SEhhQWHrt2SMBG5SYyQ+jt7w@mail.gmail.com/ [1] Link: https://lore.kernel.org/all/20250819123214.GH4067720@noisy.programming.kicks-ass.net/ [2]	2025-09-25 09:57:16 +02:00
Martin KaFai Lau	34f033a6c9	Merge branch 'bpf-next/xdp_pull_data' into 'bpf-next/master' Merge the xdp_pull_data stable branch into the master branch. No conflict. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2025-09-23 16:23:58 -07:00
Amery Hung	0e7a733ab3	bpf: Clear packet pointers after changing packet data in kfuncs bpf_xdp_pull_data() may change packet data and therefore packet pointers need to be invalidated. Add bpf_xdp_pull_data() to the special kfunc list instead of introducing a new KF_ flag until there are more kfuncs changing packet data. Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20250922233356.3356453-5-ameryhung@gmail.com	2025-09-23 13:35:12 -07:00
Leon Hwang	ccb4f5d91e	bpf: Allow union argument in trampoline based programs Currently, functions with 'union' arguments cannot be traced with fentry/fexit: bpftrace -e 'fentry:release_pages { exit(); }' -v The function release_pages arg0 type UNION is unsupported. The type of the 'release_pages' arg0 is defined as: typedef union { struct page pages; struct folio folios; struct encoded_page **encoded_pages; } release_pages_arg __attribute__ ((__transparent_union__)); This patch relaxes the restriction by allowing function arguments of type 'union' to be traced in verifier. Reviewed-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20250919044110.23729-2-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-23 12:07:46 -07:00
Kumar Kartikeya Dwivedi	a91ae3c893	bpf, x86: Add support for signed arena loads Currently, signed load instructions into arena memory are unsupported. The compiler is free to generate these, and on GCC-14 we see a corresponding error when it happens. The hurdle in supporting them is deciding which unused opcode to use to mark them for the JIT's own consumption. After much thinking, it appears 0xc0 / BPF_NOSPEC can be combined with load instructions to identify signed arena loads. Use this to recognize and JIT them appropriately, and remove the verifier side limitation on the program if the JIT supports them. Co-developed-by: Puranjay Mohan <puranjay@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20250923110157.18326-2-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-23 12:00:22 -07:00
Mykyta Yatsenko	38aa7003e3	bpf: task work scheduling kfuncs Implementation of the new bpf_task_work_schedule kfuncs, that let a BPF program schedule task_work callbacks for a target task: * bpf_task_work_schedule_signal() - schedules with TWA_SIGNAL * bpf_task_work_schedule_resume() - schedules with TWA_RESUME Each map value should embed a struct bpf_task_work, which the kernel side pairs with struct bpf_task_work_kern, containing a pointer to struct bpf_task_work_ctx, that maintains metadata relevant for the concrete callback scheduling. A small state machine and refcounting scheme ensures safe reuse and teardown. State transitions: _______________________________ \| \| v \| [standby] ---> [pending] --> [scheduling] --> [scheduled] ^ \|________________\|_________ \| \| \| v \| [running] \|_______________________________________________________\| All states may transition into FREED state: [pending] [scheduling] [scheduled] [running] [standby] -> [freed] A FREED terminal state coordinates with map-value deletion (bpf_task_work_cancel_and_free()). Scheduling itself is deferred via irq_work to keep the kfunc callable from NMI context. Lifetime is guarded with refcount_t + RCU Tasks Trace. Main components: * struct bpf_task_work_context – Metadata and state management per task work. * enum bpf_task_work_state – A state machine to serialize work scheduling and execution. * bpf_task_work_schedule() – The central helper that initiates scheduling. * bpf_task_work_acquire_ctx() - Attempts to take ownership of the context, pointed by passed struct bpf_task_work, allocates new context if none exists yet. * bpf_task_work_callback() – Invoked when the actual task_work runs. * bpf_task_work_irq() – An intermediate step (runs in softirq context) to enqueue task work. * bpf_task_work_cancel_and_free() – Cleanup for deleted BPF map entries. Flow of successful task work scheduling 1) bpf_task_work_schedule_* is called from BPF code. 2) Transition state from STANDBY to PENDING, mark context as owned by this task work scheduler 3) irq_work_queue() schedules bpf_task_work_irq(). 4) Transition state from PENDING to SCHEDULING (noop if transition successful) 5) bpf_task_work_irq() attempts task_work_add(). If successful, state transitions to SCHEDULED. 6) Task work calls bpf_task_work_callback(), which transition state to RUNNING. 7) BPF callback is executed 8) Context is cleaned up, refcounts released, context state set back to STANDBY. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Reviewed-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250923112404.668720-8-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-23 07:34:39 -07:00
Mykyta Yatsenko	5e8134f50d	bpf: extract map key pointer calculation Calculation of the BPF map key, given the pointer to a value is duplicated in a couple of places in helpers already, in the next patch another use case is introduced as well. This patch extracts that functionality into a separate function. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250923112404.668720-7-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-23 07:34:38 -07:00
Mykyta Yatsenko	5c8fd7e2b5	bpf: bpf task work plumbing This patch adds necessary plumbing in verifier, syscall and maps to support handling new kfunc bpf_task_work_schedule and kernel structure bpf_task_work. The idea is similar to how we already handle bpf_wq and bpf_timer. verifier changes validate calls to bpf_task_work_schedule to make sure it is safe and expected invariants hold. btf part is required to detect bpf_task_work structure inside map value and store its offset, which will be used in the next patch to calculate key and value addresses. arraymap and hashtab changes are needed to handle freeing of the bpf_task_work: run code needed to deinitialize it, for example cancel task_work callback if possible. The use of bpf_task_work and proper implementation for kfuncs are introduced in the next patch. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250923112404.668720-6-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-23 07:34:38 -07:00
Mykyta Yatsenko	d2699bdb6e	bpf: verifier: permit non-zero returns from async callbacks The verifier currently enforces a zero return value for all async callbacks—a constraint originally introduced for bpf_timer. That restriction is too narrow for other async use cases. Relax the rule by allowing non-zero return codes from async callbacks in general, while preserving the zero-return requirement for bpf_timer to maintain its existing semantics. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250923112404.668720-5-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-23 07:34:38 -07:00
Mykyta Yatsenko	acc3a0d250	bpf: htab: extract helper for freeing special structs Extract the cleanup of known embedded structs into the dedicated helper. Remove duplication and introduce a single source of truth for freeing special embedded structs in hashtab. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250923112404.668720-4-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-23 07:34:38 -07:00
Mykyta Yatsenko	5eab266b80	bpf: extract generic helper from process_timer_func() Refactor the verifier by pulling the common logic from process_timer_func() into a dedicated helper. This allows reusing process_async_func() helper for verifying bpf_task_work struct in the next patch. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Tested-by: syzbot@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/20250923112404.668720-3-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-23 07:34:38 -07:00
Mykyta Yatsenko	f902132616	bpf: refactor special field-type detection Reduce code duplication in detection of the known special field types in map values. This refactoring helps to avoid copying a chunk of code in the next patch of the series. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20250923112404.668720-2-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-23 07:34:38 -07:00
NeilBrown	3d18f80ce1	VFS: rename kern_path_locked() and related functions. kern_path_locked() is now only used to prepare for removing an object from the filesystem (and that is the only credible reason for wanting a positive locked dentry). Thus it corresponds to kern_path_create() and so should have a corresponding name. Unfortunately the name "kern_path_create" is somewhat misleading as it doesn't actually create anything. The recently added simple_start_creating() provides a better pattern I believe. The "start" can be matched with "end" to bracket the creating or removing. So this patch changes names: kern_path_locked -> start_removing_path kern_path_create -> start_creating_path user_path_create -> start_creating_user_path user_path_locked_at -> start_removing_user_path_at done_path_create -> end_creating_path and also introduces end_removing_path() which is identical to end_creating_path(). __start_removing_path (which was __kern_path_locked) is enhanced to call mnt_want_write() for consistency with the start_creating_path(). Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: NeilBrown <neil@brown.name> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-23 12:37:36 +02:00
KP Singh	3492715683	bpf: Implement signature verification for BPF programs This patch extends the BPF_PROG_LOAD command by adding three new fields to `union bpf_attr` in the user-space API: - signature: A pointer to the signature blob. - signature_size: The size of the signature blob. - keyring_id: The serial number of a loaded kernel keyring (e.g., the user or session keyring) containing the trusted public keys. When a BPF program is loaded with a signature, the kernel: 1. Retrieves the trusted keyring using the provided `keyring_id`. 2. Verifies the supplied signature against the BPF program's instruction buffer. 3. If the signature is valid and was generated by a key in the trusted keyring, the program load proceeds. 4. If no signature is provided, the load proceeds as before, allowing for backward compatibility. LSMs can chose to restrict unsigned programs and implement a security policy. 5. If signature verification fails for any reason, the program is not loaded. Tested-by: syzbot@syzkaller.appspotmail.com Signed-off-by: KP Singh <kpsingh@kernel.org> Link: https://lore.kernel.org/r/20250921160120.9711-2-kpsingh@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-22 18:58:03 -07:00
Eduard Zingerman	79f047c7d9	bpf: table based bpf_insn_successors() Converting bpf_insn_successors() to use lookup table makes it ~1.5 times faster. Also remove unnecessary conditionals: - `idx + 1 < prog->len` is unnecessary because after check_cfg() all jump targets are guaranteed to be within a program; - `i == 0 \|\| succ[0] != dst` is unnecessary because any client of bpf_insn_successors() can handle duplicate edges: - compute_live_registers() - compute_scc() Moving bpf_insn_successors() to liveness.c allows its inlining in liveness.c:__update_stack_liveness(). Such inlining speeds up __update_stack_liveness() by ~40%. bpf_insn_successors() is used in both verifier.c and liveness.c. perf shows such move does not negatively impact users in verifier.c, as these are executed only once before main varification pass. Unlike __update_stack_liveness() which can be triggered multiple times. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250918-callchain-sensitive-liveness-v3-10-c3cd27bacc60@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-19 09:27:23 -07:00
Eduard Zingerman	107e169799	bpf: disable and remove registers chain based liveness Remove register chain based liveness tracking: - struct bpf_reg_state->{parent,live} fields are no longer needed; - REG_LIVE_WRITTEN marks are superseded by bpf_mark_stack_write() calls; - mark_reg_read() calls are superseded by bpf_mark_stack_read(); - log.c:print_liveness() is superseded by logging in liveness.c; - propagate_liveness() is superseded by bpf_update_live_stack(); - no need to establish register chains in is_state_visited() anymore; - fix a bunch of tests expecting "_w" suffixes in verifier log messages. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250918-callchain-sensitive-liveness-v3-9-c3cd27bacc60@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-19 09:27:23 -07:00
Eduard Zingerman	ccf25a67c7	bpf: signal error if old liveness is more conservative than new Unlike the new algorithm, register chain based liveness tracking is fully path sensitive, and thus should be strictly more accurate. Validate the new algorithm by signaling an error whenever it considers a stack slot dead while the old algorithm considers it alive. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250918-callchain-sensitive-liveness-v3-8-c3cd27bacc60@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-19 09:27:23 -07:00
Eduard Zingerman	e41c237953	bpf: enable callchain sensitive stack liveness tracking Allocate analysis instance: - Add bpf_stack_liveness_{init,free}() calls to bpf_check(). Notify the instance about any stack reads and writes: - Add bpf_mark_stack_write() call at every location where REG_LIVE_WRITTEN is recorded for a stack slot. - Add bpf_mark_stack_read() call at every location mark_reg_read() is called. - Both bpf_mark_stack_{read,write}() rely on env->liveness->cur_instance callchain being in sync with env->cur_state. It is possible to update env->liveness->cur_instance every time a mark read/write is called, but that costs a hash table lookup and is noticeable in the performance profile. Hence, manually reset env->liveness->cur_instance whenever the verifier changes env->cur_state call stack: - call bpf_reset_live_stack_callchain() when the verifier enters a subprogram; - call bpf_update_live_stack() when the verifier exits a subprogram (it implies the reset). Make sure bpf_update_live_stack() is called for a callchain before issuing liveness queries. And make sure that bpf_update_live_stack() is called for any callee callchain first: - Add bpf_update_live_stack() call at every location that processes BPF_EXIT: - exit from a subprogram; - before pop_stack() call. This makes sure that bpf_update_live_stack() is called for callee callchains before caller callchains. Make sure must_write marks are set to zero for instructions that do not always access the stack: - Wrap do_check_insn() with bpf_reset_stack_write_marks() / bpf_commit_stack_write_marks() calls. Any calls to bpf_mark_stack_write() are accumulated between this pair of calls. If no bpf_mark_stack_write() calls were made it means that the instruction does not access stack (at-least on the current verification path) and it is important to record this fact. Finally, use bpf_live_stack_query_init() / bpf_stack_slot_alive() to query stack liveness info. The manual tracking of the correct order for callee/caller bpf_update_live_stack() calls is a bit convoluted and may warrant some automation in future revisions. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250918-callchain-sensitive-liveness-v3-7-c3cd27bacc60@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-19 09:27:23 -07:00
Eduard Zingerman	b3698c356a	bpf: callchain sensitive stack liveness tracking using CFG This commit adds a flow-sensitive, context-sensitive, path-insensitive data flow analysis for live stack slots: - flow-sensitive: uses program control flow graph to compute data flow values; - context-sensitive: collects data flow values for each possible call chain in a program; - path-insensitive: does not distinguish between separate control flow graph paths reaching the same instruction. Compared to the current path-sensitive analysis, this approach trades some precision for not having to enumerate every path in the program. This gives a theoretical capability to run the analysis before main verification pass. See cover letter for motivation. The basic idea is as follows: - Data flow values indicate stack slots that might be read and stack slots that are definitely written. - Data flow values are collected for each (call chain, instruction number) combination in the program. - Within a subprogram, data flow values are propagated using control flow graph. - Data flow values are transferred from entry instructions of callee subprograms to call sites in caller subprograms. In other words, a tree of all possible call chains is constructed. Each node of this tree represents a subprogram. Read and write marks are collected for each instruction of each node. Live stack slots are first computed for lower level nodes. Then, information about outer stack slots that might be read or are definitely written by a subprogram is propagated one level up, to the corresponding call instructions of the upper nodes. Procedure repeats until root node is processed. In the absence of value range analysis, stack read/write marks are collected during main verification pass, and data flow computation is triggered each time verifier.c:states_equal() needs to query the information. Implementation details are documented in kernel/bpf/liveness.c. Quantitative data about verification performance changes and memory consumption is in the cover letter. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250918-callchain-sensitive-liveness-v3-6-c3cd27bacc60@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-19 09:27:23 -07:00
Eduard Zingerman	efcda22aa5	bpf: compute instructions postorder per subprogram The next patch would require doing postorder traversal of individual subprograms. Facilitate this by moving env->cfg.insn_postorder computation from check_cfg() to a separate pass, as check_cfg() descends into called subprograms (and it needs to, because of merge_callee_effects() logic). env->cfg.insn_postorder is used only by compute_live_registers(), this function does not track cross subprogram dependencies, thus the change does not affect it's operation. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250918-callchain-sensitive-liveness-v3-5-c3cd27bacc60@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-19 09:27:23 -07:00
Eduard Zingerman	3b20d3c120	bpf: declare a few utility functions as internal api Namely, rename the following functions and add prototypes to bpf_verifier.h: - find_containing_subprog -> bpf_find_containing_subprog - insn_successors -> bpf_insn_successors - calls_callback -> bpf_calls_callback - fmt_stack_mask -> bpf_fmt_stack_mask Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250918-callchain-sensitive-liveness-v3-4-c3cd27bacc60@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-19 09:27:22 -07:00
Eduard Zingerman	12a23f93a5	bpf: remove redundant REG_LIVE_READ check in stacksafe() stacksafe() is called in exact == NOT_EXACT mode only for states that had been porcessed by clean_verifier_states(). The latter replaces dead stack spills with a series of STACK_INVALID masks. Such masks are already handled by stacksafe(). Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250918-callchain-sensitive-liveness-v3-3-c3cd27bacc60@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-19 09:27:22 -07:00
Eduard Zingerman	6cd21eb9ad	bpf: use compute_live_registers() info in clean_func_state Prepare for bpf_reg_state->live field removal by leveraging insn_aux_data->live_regs_before instead of bpf_reg_state->live in compute_live_registers(). This is similar to logic in func_states_equal(). No changes in verification performance for selftests or sched_ext. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250918-callchain-sensitive-liveness-v3-2-c3cd27bacc60@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-19 09:27:22 -07:00
Eduard Zingerman	daf4c2929f	bpf: bpf_verifier_state->cleaned flag instead of REG_LIVE_DONE Prepare for bpf_reg_state->live field removal by introducing a separate flag to track if clean_verifier_state() had been applied to the state. No functional changes. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250918-callchain-sensitive-liveness-v3-1-c3cd27bacc60@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-19 09:27:22 -07:00
KP Singh	8cd189e414	bpf: Move the signature kfuncs to helpers.c No functional changes, except for the addition of the headers for the kfuncs so that they can be used for signature verification. Signed-off-by: KP Singh <kpsingh@kernel.org> Link: https://lore.kernel.org/r/20250914215141.15144-8-kpsingh@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-18 19:11:42 -07:00
KP Singh	ea2e6467ac	bpf: Return hashes of maps in BPF_OBJ_GET_INFO_BY_FD Currently only array maps are supported, but the implementation can be extended for other maps and objects. The hash is memoized only for exclusive and frozen maps as their content is stable until the exclusive program modifies the map. This is required for BPF signing, enabling a trusted loader program to verify a map's integrity. The loader retrieves the map's runtime hash from the kernel and compares it against an expected hash computed at build time. Signed-off-by: KP Singh <kpsingh@kernel.org> Link: https://lore.kernel.org/r/20250914215141.15144-7-kpsingh@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-18 19:11:42 -07:00
KP Singh	baefdbdf68	bpf: Implement exclusive map creation Exclusive maps allow maps to only be accessed by program with a program with a matching hash which is specified in the excl_prog_hash attr. For the signing use-case, this allows the trusted loader program to load the map and verify the integrity Signed-off-by: KP Singh <kpsingh@kernel.org> Link: https://lore.kernel.org/r/20250914215141.15144-3-kpsingh@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-18 19:11:42 -07:00
KP Singh	603b441623	bpf: Update the bpf_prog_calc_tag to use SHA256 Exclusive maps restrict map access to specific programs using a hash. The current hash used for this is SHA1, which is prone to collisions. This patch uses SHA256, which is more resilient against collisions. This new hash is stored in bpf_prog and used by the verifier to determine if a program can access a given exclusive map. The original 64-bit tags are kept, as they are used by users as a short, possibly colliding program identifier for non-security purposes. Signed-off-by: KP Singh <kpsingh@kernel.org> Link: https://lore.kernel.org/r/20250914215141.15144-2-kpsingh@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-18 19:10:20 -07:00
Kumar Kartikeya Dwivedi	1512231b6c	bpf: Enforce RCU protection for KF_RCU_PROTECTED Currently, KF_RCU_PROTECTED only applies to iterator APIs and that too in a convoluted fashion: the presence of this flag on the kfunc is used to set MEM_RCU in iterator type, and the lack of RCU protection results in an error only later, once next() or destroy() methods are invoked on the iterator. While there is no bug, this is certainly a bit unintuitive, and makes the enforcement of the flag iterator specific. In the interest of making this flag useful for other upcoming kfuncs, e.g. scx_bpf_cpu_curr() [0][1], add enforcement for invoking the kfunc in an RCU critical section in general. This would also mean that iterator APIs using KF_RCU_PROTECTED will error out earlier, instead of throwing an error for lack of RCU CS protection when next() or destroy() methods are invoked. In addition to this, if the kfuncs tagged KF_RCU_PROTECTED return a pointer value, ensure that this pointer value is only usable in an RCU critical section. There might be edge cases where the return value is special and doesn't need to imply MEM_RCU semantics, but in general, the assumption should hold for the majority of kfuncs, and we can revisit things if necessary later. [0]: https://lore.kernel.org/all/20250903212311.369697-3-christian.loehle@arm.com [1]: https://lore.kernel.org/all/20250909195709.92669-1-arighi@nvidia.com Tested-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250917032755.4068726-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-18 15:36:17 -07:00
Eduard Zingerman	a3c73d629e	bpf: dont report verifier bug for missing bpf_scc_visit on speculative path Syzbot generated a program that triggers a verifier_bug() call in maybe_exit_scc(). maybe_exit_scc() assumes that, when called for a state with insn_idx in some SCC, there should be an instance of struct bpf_scc_visit allocated for that SCC. Turns out the assumption does not hold for speculative execution paths. See example in the next patch. maybe_scc_exit() is called from update_branch_counts() for states that reach branch count of zero, meaning that path exploration for a particular path is finished. Path exploration can finish in one of three ways: a. Verification error is found. In this case, update_branch_counts() is called only for non-speculative paths. b. Top level BPF_EXIT is reached. Such instructions are never a part of an SCC, so compute_scc_callchain() in maybe_scc_exit() will return false, and maybe_scc_exit() will return early. c. A checkpoint is reached and matched. Checkpoints are created by is_state_visited(), which calls maybe_enter_scc(), which allocates bpf_scc_visit instances for checkpoints within SCCs. Hence, for non-speculative symbolic execution paths, the assumption still holds: if maybe_scc_exit() is called for a state within an SCC, bpf_scc_visit instance must exist. This patch removes the verifier_bug() call for speculative paths. Fixes: `c9e31900b5` ("bpf: propagate read/precision marks over state graph backedges") Reported-by: syzbot+3afc814e8df1af64b653@syzkaller.appspotmail.com Closes: https://lore.kernel.org/bpf/68c85acd.050a0220.2ff435.03a4.GAE@google.com/ Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250916212251.3490455-1-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-17 11:19:58 -07:00
Eduard Zingerman	b13448dd64	bpf: potential double-free of env->insn_aux_data Function bpf_patch_insn_data() has the following structure: static struct bpf_prog bpf_patch_insn_data(... env ...) { struct bpf_prog new_prog; struct bpf_insn_aux_data *new_data = NULL; if (len > 1) { new_data = vrealloc(...); // <--------- (1) if (!new_data) return NULL; env->insn_aux_data = new_data; // <---- (2) } new_prog = bpf_patch_insn_single(env->prog, off, patch, len); if (IS_ERR(new_prog)) { ... vfree(new_data); // <----------------- (3) return NULL; } ... happy path ... } In case if bpf_patch_insn_single() returns an error the `new_data` allocated at (1) will be freed at (3). However, at (2) this pointer is stored in `env->insn_aux_data`. Which is freed unconditionally by verifier.c:bpf_check() on both happy and error paths. Thus, leading to double-free. Fix this by removing vfree() call at (3), ownership over `new_data` is already passed to `env->insn_aux_data` at this point. Fixes: `77620d1267` ("bpf: use realloc in bpf_patch_insn_data") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250912-patch-insn-data-double-free-v1-1-af05bd85a21a@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-15 13:04:21 -07:00
Kumar Kartikeya Dwivedi	2c89513395	bpf: Do not limit bpf_cgroup_from_id to current's namespace The bpf_cgroup_from_id kfunc relies on cgroup_get_from_id to obtain the cgroup corresponding to a given cgroup ID. This helper can be called in a lot of contexts where the current thread can be random. A recent example was its use in sched_ext's ops.tick(), to obtain the root cgroup pointer. Since the current task can be whatever random user space task preempted by the timer tick, this makes the behavior of the helper unreliable. Refactor out __cgroup_get_from_id as the non-namespace aware version of cgroup_get_from_id, and change bpf_cgroup_from_id to make use of it. There is no compatibility breakage here, since changing the namespace against which the lookup is being done to the root cgroup namespace only permits a wider set of lookups to succeed now. The cgroup IDs across namespaces are globally unique, and thus don't need to be retranslated. Reported-by: Dan Schatzberg <dschatzberg@meta.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250915032618.1551762-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-15 10:53:15 -07:00
Mateusz Guzik	f99b391778	fs: rename generic_delete_inode() and generic_drop_inode() generic_delete_inode() is rather misleading for what the routine is doing. inode_just_drop() should be much clearer. The new naming is inconsistent with generic_drop_inode(), so rename that one as well with inode_ as the suffix. No functional changes. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-15 16:09:42 +02:00
Puranjay Mohan	5c5240d020	bpf: Report arena faults to BPF stderr Begin reporting arena page faults and the faulting address to BPF program's stderr, this patch adds support in the arm64 and x86-64 JITs, support for other archs can be added later. The fault handlers receive the 32 bit address in the arena region so the upper 32 bits of user_vm_start is added to it before printing the address. This is what the user would expect to see as this is what is printed by bpf_printk() is you pass it an address returned by bpf_arena_alloc_pages(); Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20250911145808.58042-4-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-11 13:00:43 -07:00
Puranjay Mohan	70f23546d2	bpf: core: introduce main_prog_aux for stream access BPF streams are only valid for the main programs, to make it easier to access streams from subprogs, introduce main_prog_aux in struct bpf_prog_aux. prog->aux->main_prog_aux = prog->aux, for main programs and prog->aux->main_prog_aux = main_prog->aux, for subprograms. Make bpf_prog_find_from_stack() use the added main_prog_aux to return the mainprog when a subprog is found on the stack. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250911145808.58042-3-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-11 13:00:43 -07:00
Alexei Starovoitov	5d87e96a49	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc5 Cross-merge BPF and other fixes after downstream PR. No conflicts. Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-11 09:34:37 -07:00
Leon Hwang	e25ddfb388	bpf: Reject bpf_timer for PREEMPT_RT When enable CONFIG_PREEMPT_RT, the kernel will warn when run timer selftests by './test_progs -t timer': BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48 In order to avoid such warning, reject bpf_timer in verifier when PREEMPT_RT is enabled. Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20250910125740.52172-2-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-10 12:34:09 -07:00
Peilin Ye	6d78b4473c	bpf: Tell memcg to use allow_spinning=false path in bpf_timer_init() Currently, calling bpf_map_kmalloc_node() from __bpf_async_init() can cause various locking issues; see the following stack trace (edited for style) as one example: ... [10.011566] do_raw_spin_lock.cold [10.011570] try_to_wake_up (5) double-acquiring the same [10.011575] kick_pool rq_lock, causing a hardlockup [10.011579] __queue_work [10.011582] queue_work_on [10.011585] kernfs_notify [10.011589] cgroup_file_notify [10.011593] try_charge_memcg (4) memcg accounting raises an [10.011597] obj_cgroup_charge_pages MEMCG_MAX event [10.011599] obj_cgroup_charge_account [10.011600] __memcg_slab_post_alloc_hook [10.011603] __kmalloc_node_noprof ... [10.011611] bpf_map_kmalloc_node [10.011612] __bpf_async_init [10.011615] bpf_timer_init (3) BPF calls bpf_timer_init() [10.011617] bpf_prog_xxxxxxxxxxxxxxxx_fcg_runnable [10.011619] bpf__sched_ext_ops_runnable [10.011620] enqueue_task_scx (2) BPF runs with rq_lock held [10.011622] enqueue_task [10.011626] ttwu_do_activate [10.011629] sched_ttwu_pending (1) grabs rq_lock ... The above was reproduced on bpf-next (`b338cf849e`) by modifying ./tools/sched_ext/scx_flatcg.bpf.c to call bpf_timer_init() during ops.runnable(), and hacking the memcg accounting code a bit to make a bpf_timer_init() call more likely to raise an MEMCG_MAX event. We have also run into other similar variants (both internally and on bpf-next), including double-acquiring cgroup_file_kn_lock, the same worker_pool::lock, etc. As suggested by Shakeel, fix this by using __GFP_HIGH instead of GFP_ATOMIC in __bpf_async_init(), so that e.g. if try_charge_memcg() raises an MEMCG_MAX event, we call __memcg_memory_event() with @allow_spinning=false and avoid calling cgroup_file_notify() there. Depends on mm patch "memcg: skip cgroup_file_notify if spinning is not allowed": https://lore.kernel.org/bpf/20250905201606.66198-1-shakeel.butt@linux.dev/ v0 approach s/bpf_map_kmalloc_node/bpf_mem_alloc/ https://lore.kernel.org/bpf/20250905061919.439648-1-yepeilin@google.com/ v1 approach: https://lore.kernel.org/bpf/20250905234547.862249-1-yepeilin@google.com/ Fixes: `b00628b1c7` ("bpf: Introduce bpf timers.") Suggested-by: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Peilin Ye <yepeilin@google.com> Link: https://lore.kernel.org/r/20250909095222.2121438-1-yepeilin@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-09 15:24:34 -07:00
KaFai Wan	df0cb5cb50	bpf: Allow fall back to interpreter for programs with stack size <= 512 OpenWRT users reported regression on ARMv6 devices after updating to latest HEAD, where tcpdump filter: tcpdump "not ether host 3c37121a2b3c and not ether host 184ecbca2a3a \ and not ether host 14130b4d3f47 and not ether host f0f61cf440b7 \ and not ether host a84b4dedf471 and not ether host d022be17e1d7 \ and not ether host 5c497967208b and not ether host 706655784d5b" fails with warning: "Kernel filter failed: No error information" when using config: # CONFIG_BPF_JIT_ALWAYS_ON is not set CONFIG_BPF_JIT_DEFAULT_ON=y The issue arises because commits: 1. "bpf: Fix array bounds error with may_goto" changed default runtime to __bpf_prog_ret0_warn when jit_requested = 1 2. "bpf: Avoid __bpf_prog_ret0_warn when jit fails" returns error when jit_requested = 1 but jit fails This change restores interpreter fallback capability for BPF programs with stack size <= 512 bytes when jit fails. Reported-by: Felix Fietkau <nbd@nbd.name> Closes: https://lore.kernel.org/bpf/2e267b4b-0540-45d8-9310-e127bf95fc63@nbd.name/ Fixes: `6ebc5030e0` ("bpf: Fix array bounds error with may_goto") Signed-off-by: KaFai Wan <kafai.wan@linux.dev> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250909144614.2991253-1-kafai.wan@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-09 15:12:16 -07:00
Kumar Kartikeya Dwivedi	0d80e7f951	rqspinlock: Choose trylock fallback for NMI waiters Currently, out of all 3 types of waiters in the rqspinlock slow path (i.e., pending bit waiter, wait queue head waiter, and wait queue non-head waiter), only the pending bit waiter and wait queue head waiters apply deadlock checks and a timeout on their waiting loop. The assumption here was that the wait queue head's forward progress would be sufficient to identify cases where the lock owner or pending bit waiter is stuck, and non-head waiters relying on the head waiter would prove to be sufficient for their own forward progress. However, the head waiter itself can be preempted by a non-head waiter for the same lock (AA) or a different lock (ABBA) in a manner that impedes its forward progress. In such a case, non-head waiters not performing deadlock and timeout checks becomes insufficient, and the system can enter a state of lockup. This is typically not a concern with non-NMI lock acquisitions, as lock holders which in run in different contexts (IRQ, non-IRQ) use "irqsave" variants of the lock APIs, which naturally excludes such lock holders from preempting one another on the same CPU. It might seem likely that a similar case may occur for rqspinlock when programs are attached to contention tracepoints (begin, end), however, these tracepoints either precede the enqueue into the wait queue, or succeed it, therefore cannot be used to preempt a head waiter's waiting loop. We must still be careful against nested kprobe and fentry programs that may attach to the middle of the head's waiting loop to stall forward progress and invoke another rqspinlock acquisition that proceeds as a non-head waiter. To this end, drop CC_FLAGS_FTRACE from the rqspinlock.o object file. For now, this issue is resolved by falling back to a repeated trylock on the lock word from NMI context, while performing the deadlock checks to break out early in case forward progress is impossible, and use the timeout as a final fallback. A more involved fix to terminate the queue when such a condition occurs will be made as a follow up. A selftest to stress this aspect of nested NMI/non-NMI locking attempts will be added in a subsequent patch to the bpf-next tree when this fix lands and trees are synchronized. Reported-by: Josef Bacik <josef@toxicpanda.com> Fixes: `164c246571` ("rqspinlock: Protect waiters in queue from stalls") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250909184959.3509085-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-09 15:10:28 -07:00
Rong Tao	7edfc02470	bpf: Fix bpf_strnstr() to handle suffix match cases better bpf_strnstr() should not treat the ending '\0' of s2 as a matching character if the parameter 'len' equal to s2 string length, for example: 1. bpf_strnstr("openat", "open", 4) = -ENOENT 2. bpf_strnstr("openat", "open", 5) = 0 This patch makes (1) return 0, fix just the `len == strlen(s2)` case. And fix a more general case when s2 is a suffix of the first len characters of s1. Fixes: `e91370550f` ("bpf: Add kfuncs for read-only string operations") Signed-off-by: Rong Tao <rongtao@cestc.cn> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/tencent_17DC57B9D16BC443837021BEACE84B7C1507@qq.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-09 15:07:58 -07:00
Daniel Borkmann	f9bb6ffa7f	bpf: Fix out-of-bounds dynptr write in bpf_crypto_crypt Stanislav reported that in bpf_crypto_crypt() the destination dynptr's size is not validated to be at least as large as the source dynptr's size before calling into the crypto backend with 'len = src_len'. This can result in an OOB write when the destination is smaller than the source. Concretely, in mentioned function, psrc and pdst are both linear buffers fetched from each dynptr: psrc = __bpf_dynptr_data(src, src_len); [...] pdst = __bpf_dynptr_data_rw(dst, dst_len); [...] err = decrypt ? ctx->type->decrypt(ctx->tfm, psrc, pdst, src_len, piv) : ctx->type->encrypt(ctx->tfm, psrc, pdst, src_len, piv); The crypto backend expects pdst to be large enough with a src_len length that can be written. Add an additional src_len > dst_len check and bail out if it's the case. Note that these kfuncs are accessible under root privileges only. Fixes: `3e1c6f3540` ("bpf: make common crypto API for TC/XDP programs") Reported-by: Stanislav Fort <disclosure@aisle.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Vadim Fedorenko <vadim.fedorenko@linux.dev> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://lore.kernel.org/r/20250829143657.318524-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-09 15:07:57 -07:00
Marco Crivellari	a857210b10	bpf: WQ_PERCPU added to alloc_workqueue users Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. alloc_workqueue() treats all queues as per-CPU by default, while unbound workqueues must opt-in via WQ_UNBOUND. This default is suboptimal: most workloads benefit from unbound queues, allowing the scheduler to place worker threads where they’re needed and reducing noise when CPUs are isolated. This default is suboptimal: most workloads benefit from unbound queues, allowing the scheduler to place worker threads where they’re needed and reducing noise when CPUs are isolated. This patch adds a new WQ_PERCPU flag to explicitly request the use of the per-CPU behavior. Both flags coexist for one release cycle to allow callers to transition their calls. Once migration is complete, WQ_UNBOUND can be removed and unbound will become the implicit default. With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND), any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND must now use WQ_PERCPU. All existing users have been updated accordingly. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Link: https://lore.kernel.org/r/20250905085309.94596-4-marco.crivellari@suse.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-08 10:04:37 -07:00
Marco Crivellari	0409819a00	bpf: replace use of system_unbound_wq with system_dfl_wq Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. system_unbound_wq should be the default workqueue so as not to enforce locality constraints for random work whenever it's not required. Adding system_dfl_wq to encourage its use when unbound work should be used. queue_work() / queue_delayed_work() / mod_delayed_work() will now use the new unbound wq: whether the user still use the old wq a warn will be printed along with a wq redirect to the new one. The old system_unbound_wq will be kept for a few release cycles. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Link: https://lore.kernel.org/r/20250905085309.94596-3-marco.crivellari@suse.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-08 10:04:37 -07:00
Marco Crivellari	34f86083a4	bpf: replace use of system_wq with system_percpu_wq Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. system_wq is a per-CPU worqueue, yet nothing in its name tells about that CPU affinity constraint, which is very often not required by users. Make it clear by adding a system_percpu_wq. queue_work() / queue_delayed_work() mod_delayed_work() will now use the new per-cpu wq: whether the user still stick on the old name a warn will be printed along a wq redirect to the new one. This patch add the new system_percpu_wq except for mm, fs and net subsystem, whom are handled in separated patches. The old wq will be kept for a few release cylces. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Link: https://lore.kernel.org/r/20250905085309.94596-2-marco.crivellari@suse.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-08 10:04:37 -07:00
Rong Tao	19559e8441	bpf: add bpf_strcasecmp kfunc bpf_strcasecmp() function performs same like bpf_strcmp() except ignoring the case of the characters. Signed-off-by: Rong Tao <rongtao@cestc.cn> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Viktor Malik <vmalik@redhat.com> Link: https://lore.kernel.org/r/tencent_292BD3682A628581AA904996D8E59F4ACD06@qq.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-04 09:00:57 -07:00
Feng Yang	e4980fa646	bpf: Replace kvfree with kfree for kzalloc memory These pointers are allocated by kzalloc. Therefore, replace kvfree() with kfree() to avoid unnecessary is_vmalloc_addr() check in kvfree(). This is the remaining unmodified part from [1]. Signed-off-by: Feng Yang <yangfeng@kylinos.cn> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20250811123949.552885-1-rongqianfeng@vivo.com [1] Link: https://lore.kernel.org/bpf/20250827032812.498216-1-yangfeng59949@163.com	2025-09-02 17:29:52 +02:00
Nandakumar Edamana	1df7dad4d5	bpf: Improve the general precision of tnum_mul Drop the value-mask decomposition technique and adopt straightforward long-multiplication with a twist: when LSB(a) is uncertain, find the two partial products (for LSB(a) = known 0 and LSB(a) = known 1) and take a union. Experiment shows that applying this technique in long multiplication improves the precision in a significant number of cases (at the cost of losing precision in a relatively lower number of cases). Signed-off-by: Nandakumar Edamana <nandakumar@nandakumar.co.in> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Reviewed-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20250826034524.2159515-1-nandakumar@nandakumar.co.in	2025-08-27 15:00:26 -07:00
Josh Poimboeuf	e649bcda25	perf: Remove get_perf_callchain() init_nr argument The 'init_nr' argument has double duty: it's used to initialize both the number of contexts and the number of stack entries. That's confusing and the callers always pass zero anyway. Hard code the zero. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Namhyung Kim <Namhyung@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/r/20250820180428.259565081@kernel.org	2025-08-26 09:51:12 +02:00
Menglong Dong	8e4f0b1ebc	bpf: use rcu_read_lock_dont_migrate() for trampoline.c Use rcu_read_lock_dont_migrate() and rcu_read_unlock_migrate() in trampoline.c to obtain better performance when PREEMPT_RCU is not enabled. Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Link: https://lore.kernel.org/r/20250821090609.42508-8-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-08-25 18:52:16 -07:00
Menglong Dong	427a36bb55	bpf: use rcu_read_lock_dont_migrate() for bpf_prog_run_array_cg() Use rcu_read_lock_dont_migrate() and rcu_read_unlock_migrate() in bpf_prog_run_array_cg to obtain better performance when PREEMPT_RCU is not enabled. Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Link: https://lore.kernel.org/r/20250821090609.42508-7-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-08-25 18:52:16 -07:00
Menglong Dong	cf4303b70d	bpf: use rcu_read_lock_dont_migrate() for bpf_task_storage_free() Use rcu_read_lock_dont_migrate() and rcu_read_unlock_migrate() in bpf_task_storage_free to obtain better performance when PREEMPT_RCU is not enabled. Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Link: https://lore.kernel.org/r/20250821090609.42508-6-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-08-25 18:52:16 -07:00
Menglong Dong	68748f0397	bpf: use rcu_read_lock_dont_migrate() for bpf_iter_run_prog() Use rcu_read_lock_dont_migrate() and rcu_read_unlock_migrate() in bpf_iter_run_prog to obtain better performance when PREEMPT_RCU is not enabled. Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Link: https://lore.kernel.org/r/20250821090609.42508-5-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-08-25 18:52:16 -07:00
Menglong Dong	f2fa9b9069	bpf: use rcu_read_lock_dont_migrate() for bpf_inode_storage_free() Use rcu_read_lock_dont_migrate() and rcu_read_unlock_migrate() in bpf_inode_storage_free to obtain better performance when PREEMPT_RCU is not enabled. Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Link: https://lore.kernel.org/r/20250821090609.42508-4-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-08-25 18:52:16 -07:00
Menglong Dong	8c0afc7c9c	bpf: use rcu_read_lock_dont_migrate() for bpf_cgrp_storage_free() Use rcu_read_lock_dont_migrate() and rcu_read_unlock_migrate() in bpf_cgrp_storage_free to obtain better performance when PREEMPT_RCU is not enabled. Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Link: https://lore.kernel.org/r/20250821090609.42508-3-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-08-25 18:52:16 -07:00
Tao Chen	4223bf833c	bpf: Remove preempt_disable in bpf_try_get_buffers Now BPF program will run with migration disabled, so it is safe to access this_cpu_inc_return(bpf_bprintf_nest_level). Fixes: `d9c9e4db18` ("bpf: Factorize bpf_trace_printk and bpf_seq_printf") Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250819125638.2544715-1-chen.dylane@linux.dev	2025-08-22 11:44:09 -07:00
Eric Biggers	d47cc4dea1	bpf: Use sha1() instead of sha1_transform() in bpf_prog_calc_tag() Now that there's a proper SHA-1 library API, just use that instead of the low-level SHA-1 compression function. This eliminates the need for bpf_prog_calc_tag() to implement the SHA-1 padding itself. No functional change; the computed tags remain the same. Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20250811201615.564461-1-ebiggers@kernel.org	2025-08-22 11:40:05 -07:00
Paul Chaignon	f41345f47f	bpf: Use tnums for JEQ/JNE is_branch_taken logic In the following toy program (reg states minimized for readability), R0 and R1 always have different values at instruction 6. This is obvious when reading the program but cannot be guessed from ranges alone as they overlap (R0 in [0; 0xc0000000], R1 in [1024; 0xc0000400]). 0: call bpf_get_prandom_u32#7 ; R0_w=scalar() 1: w0 = w0 ; R0_w=scalar(var_off=(0x0; 0xffffffff)) 2: r0 >>= 30 ; R0_w=scalar(var_off=(0x0; 0x3)) 3: r0 <<= 30 ; R0_w=scalar(var_off=(0x0; 0xc0000000)) 4: r1 = r0 ; R1_w=scalar(var_off=(0x0; 0xc0000000)) 5: r1 += 1024 ; R1_w=scalar(var_off=(0x400; 0xc0000000)) 6: if r1 != r0 goto pc+1 Looking at tnums however, we can deduce that R1 is always different from R0 because their tnums don't agree on known bits. This patch uses this logic to improve is_scalar_branch_taken in case of BPF_JEQ and BPF_JNE. This change has a tiny impact on complexity, which was measured with the Cilium complexity CI test. That test covers 72 programs with various build and load time configurations for a total of 970 test cases. For 80% of test cases, the patch has no impact. On the other test cases, the patch decreases complexity by only 0.08% on average. In the best case, the verifier needs to walk 3% less instructions and, in the worst case, 1.5% more. Overall, the patch has a small positive impact, especially for our largest programs. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/be3ee70b6e489c49881cb1646114b1d861b5c334.1755694147.git.paul.chaignon@gmail.com	2025-08-22 18:12:24 +02:00
Martin KaFai Lau	5c42715e63	Merge branch 'bpf-next/skb-meta-dynptr' into 'bpf-next/master' Merge 'skb-meta-dynptr' branch into 'master' branch. No conflict. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2025-08-18 17:59:26 -07:00
Jakub Sitnicki	6877cd392b	bpf: Enable read/write access to skb metadata through a dynptr Now that we can create a dynptr to skb metadata, make reads to the metadata area possible with bpf_dynptr_read() or through a bpf_dynptr_slice(), and make writes to the metadata area possible with bpf_dynptr_write() or through a bpf_dynptr_slice_rdwr(). Note that for cloned skbs which share data with the original, we limit the skb metadata dynptr to be read-only since we don't unclone on a bpf_dynptr_write to metadata. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20250814-skb-metadata-thru-dynptr-v7-2-8a39e636e0fb@cloudflare.com	2025-08-18 10:29:42 -07:00
Jakub Sitnicki	89d912e494	bpf: Add dynptr type for skb metadata Add a dynptr type, similar to skb dynptr, but for the skb metadata access. The dynptr provides an alternative to __sk_buff->data_meta for accessing the custom metadata area allocated using the bpf_xdp_adjust_meta() helper. More importantly, it abstracts away the fact where the storage for the custom metadata lives, which opens up the way to persist the metadata by relocating it as the skb travels through the network stack layers. Writes to skb metadata invalidate any existing skb payload and metadata slices. While this is more restrictive that needed at the moment, it leaves the door open to reallocating the metadata on writes, and should be only a minor inconvenience to the users. Only the program types which can access __sk_buff->data_meta today are allowed to create a dynptr for skb metadata at the moment. We need to modify the network stack to persist the metadata across layers before opening up access to other BPF hooks. Once more BPF hooks gain access to skb_meta dynptr, we will also need to add a read-only variant of the helper similar to bpf_dynptr_from_skb_rdonly. skb_meta dynptr ops are stubbed out and implemented by subsequent changes. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com> Link: https://patch.msgid.link/20250814-skb-metadata-thru-dynptr-v7-1-8a39e636e0fb@cloudflare.com	2025-08-18 10:29:42 -07:00
Anton Protopopov	dbe99ea541	bpf: Add a verbose message when the BTF limit is reached When a BPF program which is being loaded reaches the map limit (MAX_USED_MAPS) or the BTF limit (MAX_USED_BTFS) the -E2BIG is returned. However, in the former case there is an accompanying verifier verbose message, and in the latter case there is not. Add a verbose message to make the behaviour symmetrical. Reported-by: Kevin Sheldrake <kevin.sheldrake@isovalent.com> Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20250816151554.902995-1-a.s.protopopov@gmail.com	2025-08-18 17:27:01 +02:00
Fushuai Wang	d87fdb1f27	bpf: Replace get_next_cpu() with cpumask_next_wrap() The get_next_cpu() function was only used in one place to find the next possible CPU, which can be replaced by cpumask_next_wrap(). Signed-off-by: Fushuai Wang <wangfushuai@baidu.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20250818032344.23229-1-wangfushuai@baidu.com	2025-08-18 15:11:02 +02:00
Jiri Olsa	e4414b01c1	bpf: Check the helper function is valid in get_helper_proto kernel test robot reported verifier bug [1] where the helper func pointer could be NULL due to disabled config option. As Alexei suggested we could check on that in get_helper_proto directly. Marking tail_call helper func with BPF_PTR_POISON, because it is unused by design. [1] https://lore.kernel.org/oe-lkp/202507160818.68358831-lkp@intel.com Reported-by: kernel test robot <oliver.sang@intel.com> Reported-by: syzbot+a9ed3d9132939852d0df@syzkaller.appspotmail.com Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Paul Chaignon <paul.chaignon@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20250814200655.945632-1-jolsa@kernel.org Closes: https://lore.kernel.org/oe-lkp/202507160818.68358831-lkp@intel.com	2025-08-15 11:16:56 +02:00
Jesper Dangaard Brouer	2b986b9e91	bpf, cpumap: Disable page_pool direct xdp_return need larger scope When running an XDP bpf_prog on the remote CPU in cpumap code then we must disable the direct return optimization that xdp_return can perform for mem_type page_pool. This optimization assumes code is still executing under RX-NAPI of the original receiving CPU, which isn't true on this remote CPU. The cpumap code already disabled this via helpers xdp_set_return_frame_no_direct() and xdp_clear_return_frame_no_direct(), but the scope didn't include xdp_do_flush(). When doing XDP_REDIRECT towards e.g devmap this causes the function bq_xmit_all() to run with direct return optimization enabled. This can lead to hard to find bugs. The issue only happens when bq_xmit_all() cannot ndo_xdp_xmit all frames and them frees them via xdp_return_frame_rx_napi(). Fix by expanding scope to include xdp_do_flush(). This was found by Dragos Tatulea. Fixes: `11941f8a85` ("bpf: cpumap: Implement generic cpumap") Reported-by: Dragos Tatulea <dtatulea@nvidia.com> Reported-by: Chris Arges <carges@cloudflare.com> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Chris Arges <carges@cloudflare.com> Link: https://patch.msgid.link/175519587755.3008742.1088294435150406835.stgit@firesoul	2025-08-15 11:08:08 +02:00
Qianfeng Rong	bf0c2a84df	bpf: Replace kvfree with kfree for kzalloc memory The 'backedge' pointer is allocated with kzalloc(), which returns physically contiguous memory. Using kvfree() to deallocate such memory is functionally safe but semantically incorrect. Replace kvfree() with kfree() to avoid unnecessary is_vmalloc_addr() check in kvfree(). Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20250811123949.552885-1-rongqianfeng@vivo.com	2025-08-12 15:55:01 -07:00
Qianfeng Rong	3e2b799008	bpf: Remove redundant __GFP_NOWARN Commit `16f5dfbc85` ("gfp: include __GFP_NOWARN in GFP_NOWAIT") made GFP_NOWAIT implicitly include __GFP_NOWARN. Therefore, explicit __GFP_NOWARN combined with GFP_NOWAIT (e.g., `GFP_NOWAIT \| __GFP_NOWARN`) is now redundant. Let's clean up these redundant flags across subsystems. No functional changes. Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20250804122731.460158-1-rongqianfeng@vivo.com	2025-08-12 14:56:04 -07:00
Martin KaFai Lau	9e293d47bf	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Cross merge bpf/master after 6.17-rc1. No conflict. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2025-08-12 12:14:02 -07:00
Eduard Zingerman	77620d1267	bpf: use realloc in bpf_patch_insn_data Avoid excessive vzalloc/vfree calls when patching instructions in do_misc_fixups(). bpf_patch_insn_data() uses vzalloc to allocate new memory for env->insn_aux_data for each patch as follows: struct bpf_prog *bpf_patch_insn_data(env, ...) { ... new_data = vzalloc(... O(program size) ...); ... adjust_insn_aux_data(env, new_data, ...); ... } void adjust_insn_aux_data(env, new_data, ...) { ... memcpy(new_data, env->insn_aux_data); vfree(env->insn_aux_data); env->insn_aux_data = new_data; ... } The vzalloc/vfree pair is hot in perf report collected for e.g. pyperf180 test case. It can be replaced with a call to vrealloc in order to reduce the number of actual memory allocations. This is a stop-gap solution, as bpf_patch_insn_data is still hot in the profile. More comprehansive solutions had been discussed before e.g. as in [1]. [1] https://lore.kernel.org/bpf/CAEf4BzY_E8MSL4mD0UPuuiDcbJhh9e2xQo2=5w+ppRWWiYSGvQ@mail.gmail.com/ Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Tested-by: Anton Protopopov <a.s.protopopov@gmail.com> Link: https://lore.kernel.org/r/20250807010205.3210608-3-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-08-07 09:17:02 -07:00
Eduard Zingerman	cb070a8156	bpf: removed unused 'env' parameter from is_reg64 and insn_has_def32 Parameter 'env' is not used by is_reg64() and insn_has_def32() functions. Remove the parameter to make it clear that neither function depends on 'env' state, e.g. env->insn_aux_data. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250807010205.3210608-2-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-08-07 09:17:02 -07:00
Amery Hung	d87a513d09	bpf: Allow struct_ops to get map id by kdata Add bpf_struct_ops_id() to enable struct_ops implementors to use struct_ops map id as the unique id of a struct_ops in their subsystem. A subsystem that wishes to create a mapping between id and struct_ops instance pointer can update the mapping accordingly during bpf_struct_ops::reg(), unreg(), and update(). Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20250806162540.681679-2-ameryhung@gmail.com	2025-08-06 13:39:58 -07:00
Eduard Zingerman	1b30d44417	bpf: Fix memory leak of bpf_scc_info objects env->scc_info array contains references to bpf_scc_info objects allocated lazily in verifier.c:scc_visit_alloc(). env->scc_cnt was supposed to track env->scc_info array size in order to free referenced objects in verifier.c:free_states(). Fix initialization of env->scc_cnt that was omitted in verifier.c:compute_scc(). To reproduce the bug: - build with CONFIG_DEBUG_KMEMLEAK - boot and load bpf program with loops, e.g.: ./veristat -q pyperf180.bpf.o - initiate memleak scan and check results: echo scan > /sys/kernel/debug/kmemleak cat /sys/kernel/debug/kmemleak Fixes: `c9e31900b5` ("bpf: propagate read/precision marks over state graph backedges") Reported-by: Jens Axboe <axboe@kernel.dk> Closes: https://lore.kernel.org/bpf/CAADnVQKXUWg9uRCPD5ebRXwN4dmBCRUFFM7kN=GxymYz3zU25A@mail.gmail.com/T/ Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Tested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250801232330.1800436-1-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-08-02 09:04:57 -07:00
Paul Chaignon	f914876eec	bpf: Improve ctx access verifier error message We've already had two "error during ctx access conversion" warnings triggered by syzkaller. Let's improve the error message by dumping the cnt variable so that we can more easily differentiate between the different error cases. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/cc94316c30dd76fae4a75a664b61a2dbfe68e205.1754039605.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-08-01 09:22:44 -07:00
Daniel Borkmann	abad3d0bad	bpf: Fix oob access in cgroup local storage Lonial reported that an out-of-bounds access in cgroup local storage can be crafted via tail calls. Given two programs each utilizing a cgroup local storage with a different value size, and one program doing a tail call into the other. The verifier will validate each of the indivial programs just fine. However, in the runtime context the bpf_cg_run_ctx holds an bpf_prog_array_item which contains the BPF program as well as any cgroup local storage flavor the program uses. Helpers such as bpf_get_local_storage() pick this up from the runtime context: ctx = container_of(current->bpf_ctx, struct bpf_cg_run_ctx, run_ctx); storage = ctx->prog_item->cgroup_storage[stype]; if (stype == BPF_CGROUP_STORAGE_SHARED) ptr = &READ_ONCE(storage->buf)->data[0]; else ptr = this_cpu_ptr(storage->percpu_buf); For the second program which was called from the originally attached one, this means bpf_get_local_storage() will pick up the former program's map, not its own. With mismatching sizes, this can result in an unintended out-of-bounds access. To fix this issue, we need to extend bpf_map_owner with an array of storage_cookie[] to match on i) the exact maps from the original program if the second program was using bpf_get_local_storage(), or ii) allow the tail call combination if the second program was not using any of the cgroup local storage maps. Fixes: `7d9c342789` ("bpf: Make cgroup storages shared between programs on the same cgroup") Reported-by: Lonial Con <kongln9170@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20250730234733.530041-4-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-07-31 11:30:05 -07:00
Daniel Borkmann	fd1c98f0ef	bpf: Move bpf map owner out of common struct Given this is only relevant for BPF tail call maps, it is adding up space and penalizing other map types. We also need to extend this with further objects to track / compare to. Therefore, lets move this out into a separate structure and dynamically allocate it only for BPF tail call maps. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20250730234733.530041-2-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-07-31 11:30:05 -07:00
Daniel Borkmann	12df58ad29	bpf: Add cookie object to bpf maps Add a cookie to BPF maps to uniquely identify BPF maps for the timespan when the node is up. This is different to comparing a pointer or BPF map id which could get rolled over and reused. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20250730234733.530041-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-07-31 11:30:05 -07:00
Linus Torvalds	d9104cec3e	bpf-next-6.17 -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmiINnEACgkQ6rmadz2v bToBnA/9F+A3R6rTwGk4HK3xpfc/nm2Tanl3oRN7S2ub/mskDOtWSIyG6cVFZ0UG 1fK6IkByyRIpAF/5qhdlw8drRXHkQtGLA0lP2L9llm4X1mHLofB18y9OeLrDE1WN KwNP06+IGX9W802lCGSIXOY+VmRscVfXSMokyQt2ilHplKjOnDqJcYkWupi3T2rC mz79FY9aEl2YrIcpj9RXz+8nwP49pZBuW2P0IM5PAIj4BJBXShrUp8T1nz94okNe NFsnAyRxjWpUT0McEgtA9WvpD9lZqujYD8Qp0KlGZWmI3vNpV5d9S1+dBcEb1n7q dyNMkTF3oRrJhhg4VqoHc6fVpzSEoZ9ZxV5Hx4cs+ganH75D4YbdGqx/7mR3DUgH MZh6rHF1pGnK7TAm7h5gl3ZRAOkZOaahbe1i01NKo9CEe5fSh3AqMyzJYoyGHRKi xDN39eQdWBNA+hm1VkbK2Bv93Rbjrka2Kj+D3sSSO9Bo/u3ntcknr7LW39idKz62 Q8dkKHcCEtun7gjk0YXPF013y81nEohj1C+52BmJ2l5JitM57xfr6YOaQpu7DPDE AJbHx6ASxKdyEETecd0b+cXUPQ349zmRXy0+CDMAGKpBicC0H0mHhL14cwOY1Hfu EIpIjmIJGI3JNF6T5kybcQGSBOYebdV0FFgwSllzPvuYt7YsHCs= =/O3j -----END PGP SIGNATURE----- Merge tag 'bpf-next-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Pull bpf updates from Alexei Starovoitov: - Remove usermode driver (UMD) framework (Thomas Weißschuh) - Introduce Strongly Connected Component (SCC) in the verifier to detect loops and refine register liveness (Eduard Zingerman) - Allow 'void ' cast using bpf_rdonly_cast() and corresponding '__arg_untrusted' for global function parameters (Eduard Zingerman) - Improve precision for BPF_ADD and BPF_SUB operations in the verifier (Harishankar Vishwanathan) - Teach the verifier that constant pointer to a map cannot be NULL (Ihor Solodrai) - Introduce BPF streams for error reporting of various conditions detected by BPF runtime (Kumar Kartikeya Dwivedi) - Teach the verifier to insert runtime speculation barrier (lfence on x86) to mitigate speculative execution instead of rejecting the programs (Luis Gerhorst) - Various improvements for 'veristat' (Mykyta Yatsenko) - For CONFIG_DEBUG_KERNEL config warn on internal verifier errors to improve bug detection by syzbot (Paul Chaignon) - Support BPF private stack on arm64 (Puranjay Mohan) - Introduce bpf_cgroup_read_xattr() kfunc to read xattr of cgroup's node (Song Liu) - Introduce kfuncs for read-only string opreations (Viktor Malik) - Implement show_fdinfo() for bpf_links (Tao Chen) - Reduce verifier's stack consumption (Yonghong Song) - Implement mprog API for cgroup-bpf programs (Yonghong Song) tag 'bpf-next-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (192 commits) selftests/bpf: Migrate fexit_noreturns case into tracing_failure test suite selftests/bpf: Add selftest for attaching tracing programs to functions in deny list bpf: Add log for attaching tracing programs to functions in deny list bpf: Show precise rejected function when attaching fexit/fmod_ret to __noreturn functions bpf: Fix various typos in verifier.c comments bpf: Add third round of bounds deduction selftests/bpf: Test invariants on JSLT crossing sign selftests/bpf: Test cross-sign 64bits range refinement selftests/bpf: Update reg_bound range refinement logic bpf: Improve bounds when s64 crosses sign boundary bpf: Simplify bounds refinement from s32 selftests/bpf: Enable private stack tests for arm64 bpf, arm64: JIT support for private stack bpf: Move bpf_jit_get_prog_name() to core.c bpf, arm64: Fix fp initialization for exception boundary umd: Remove usermode driver framework bpf/preload: Don't select USERMODE_DRIVER selftests/bpf: Fix test dynptr/test_dynptr_memset_xdp_chunks failure selftests/bpf: Fix test dynptr/test_dynptr_copy_xdp failure selftests/bpf: Increase xdp data size for arm64 64K page size ...	2025-07-30 09:58:50 -07:00
Linus Torvalds	8be4d31cb8	Networking changes for 6.17. Core & protocols ---------------- - Wrap datapath globals into net_aligned_data, to avoid false sharing. - Preserve MSG_ZEROCOPY in forwarding (e.g. out of a container). - Add SO_INQ and SCM_INQ support to AF_UNIX. - Add SIOCINQ support to AF_VSOCK. - Add TCP_MAXSEG sockopt to MPTCP. - Add IPv6 force_forwarding sysctl to enable forwarding per interface. - Make TCP validation of whether packet fully fits in the receive window and the rcv_buf more strict. With increased use of HW aggregation a single "packet" can be multiple 100s of kB. - Add MSG_MORE flag to optimize large TCP transmissions via sockmap, improves latency up to 33% for sockmap users. - Convert TCP send queue handling from tasklet to BH workque. - Improve BPF iteration over TCP sockets to see each socket exactly once. - Remove obsolete and unused TCP RFC3517/RFC6675 loss recovery code. - Support enabling kernel threads for NAPI processing on per-NAPI instance basis rather than a whole device. Fully stop the kernel NAPI thread when threaded NAPI gets disabled. Previously thread would stick around until ifdown due to tricky synchronization. - Allow multicast routing to take effect on locally-generated packets. - Add output interface argument for End.X in segment routing. - MCTP: add support for gateway routing, improve bind() handling. - Don't require rtnl_lock when fetching an IPv6 neighbor over Netlink. - Add a new neighbor flag ("extern_valid"), which cedes refresh responsibilities to userspace. This is needed for EVPN multi-homing where a neighbor entry for a multi-homed host needs to be synced across all the VTEPs among which the host is multi-homed. - Support NUD_PERMANENT for proxy neighbor entries. - Add a new queuing discipline for IETF RFC9332 DualQ Coupled AQM. - Add sequence numbers to netconsole messages. Unregister netconsole's console when all net targets are removed. Code refactoring. Add a number of selftests. - Align IPSec inbound SA lookup to RFC 4301. Only SPI and protocol should be used for an inbound SA lookup. - Support inspecting ref_tracker state via DebugFS. - Don't force bonding advertisement frames tx to ~333 ms boundaries. Add broadcast_neighbor option to send ARP/ND on all bonded links. - Allow providing upcall pid for the 'execute' command in openvswitch. - Remove DCCP support from Netfilter's conntrack. - Disallow multiple packet duplications in the queuing layer. - Prevent use of deprecated iptables code on PREEMPT_RT. Driver API ---------- - Support RSS and hashing configuration over ethtool Netlink. - Add dedicated ethtool callbacks for getting and setting hashing fields. - Add support for power budget evaluation strategy in PSE / Power-over-Ethernet. Generate Netlink events for overcurrent etc. - Support DPLL phase offset monitoring across all device inputs. Support providing clock reference and SYNC over separate DPLL inputs. - Support traffic classes in devlink rate API for bandwidth management. - Remove rtnl_lock dependency from UDP tunnel port configuration. Device drivers -------------- - Add a new Broadcom driver for 800G Ethernet (bnge). - Add a standalone driver for Microchip ZL3073x DPLL. - Remove IBM's NETIUCV device driver. - Ethernet high-speed NICs: - Broadcom (bnxt): - support zero-copy Tx of DMABUF memory - take page size into account for page pool recycling rings - Intel (100G, ice, idpf): - idpf: XDP and AF_XDP support preparations - idpf: add flow steering - add link_down_events statistic - clean up the TSPLL code - preparations for live VM migration - nVidia/Mellanox: - support zero-copy Rx/Tx interfaces (DMABUF and io_uring) - optimize context memory usage for matchers - expose serial numbers in devlink info - support PCIe congestion metrics - Meta (fbnic): - add 25G, 50G, and 100G link modes to phylink - support dumping FW logs - Marvell/Cavium: - support for CN20K generation of the Octeon chips - Amazon: - add HW clock (without timestamping, just hypervisor time access) - Ethernet virtual: - VirtIO net: - support segmentation of UDP-tunnel-encapsulated packets - Google (gve): - support packet timestamping and clock synchronization - Microsoft vNIC: - add handler for device-originated servicing events - allow dynamic MSI-X vector allocation - support Tx bandwidth clamping - Ethernet NICs consumer, and embedded: - AMD: - amd-xgbe: hardware timestamping and PTP clock support - Broadcom integrated MACs (bcmgenet, bcmasp): - use napi_complete_done() return value to support NAPI polling - add support for re-starting auto-negotiation - Broadcom switches (b53): - support BCM5325 switches - add bcm63xx EPHY power control - Synopsys (stmmac): - lots of code refactoring and cleanups - TI: - icssg-prueth: read firmware-names from device tree - icssg: PRP offload support - Microchip: - lan78xx: convert to PHYLINK for improved PHY and MAC management - ksz: add KSZ8463 switch support - Intel: - support similar queue priority scheme in multi-queue and time-sensitive networking (taprio) - support packet pre-emption in both - RealTek (r8169): - enable EEE at 5Gbps on RTL8126 - Airoha: - add PPPoE offload support - MDIO bus controller for Airoha AN7583 - Ethernet PHYs: - support for the IPQ5018 internal GE PHY - micrel KSZ9477 switch-integrated PHYs: - add MDI/MDI-X control support - add RX error counters - add cable test support - add Signal Quality Indicator (SQI) reporting - dp83tg720: improve reset handling and reduce link recovery time - support bcm54811 (and its MII-Lite interface type) - air_en8811h: support resume/suspend - support PHY counters for QCA807x and QCA808x - support WoL for QCA807x - CAN drivers: - rcar_canfd: support for Transceiver Delay Compensation - kvaser: report FW versions via devlink dev info - WiFi: - extended regulatory info support (6 GHz) - add statistics and beacon monitor for Multi-Link Operation (MLO) - support S1G aggregation, improve S1G support - add Radio Measurement action fields - support per-radio RTS threshold - some work around how FIPS affects wifi, which was wrong (RC4 is used by TKIP, not only WEP) - improvements for unsolicited probe response handling - WiFi drivers: - RealTek (rtw88): - IBSS mode for SDIO devices - RealTek (rtw89): - BT coexistence for MLO/WiFi7 - concurrent station + P2P support - support for USB devices RTL8851BU/RTL8852BU - Intel (iwlwifi): - use embedded PNVM in (to be released) FW images to fix compatibility issues - many cleanups (unused FW APIs, PCIe code, WoWLAN) - some FIPS interoperability - MediaTek (mt76): - firmware recovery improvements - more MLO work - Qualcomm/Atheros (ath12k): - fix scan on multi-radio devices - more EHT/Wi-Fi 7 features - encapsulation/decapsulation offload - Broadcom (brcm80211): - support SDIO 43751 device - Bluetooth: - hci_event: add support for handling LE BIG Sync Lost event - ISO: add socket option to report packet seqnum via CMSG - ISO: support SCM_TIMESTAMPING for ISO TS - Bluetooth drivers: - intel_pcie: support Function Level Reset - nxpuart: add support for 4M baudrate - nxpuart: implement powerup sequence, reset, FW dump, and FW loading Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmiFgLgACgkQMUZtbf5S IrvafxAAnQRwYBoIG+piCILx6z5pRvBGHkmEQ4AQgSCFuq2eO3ubwMFIqEybfma1 5+QFjUZAV3OgGgKRBS2KGWxtSzdiF+/JGV1VOIN67sX3Mm0a2QgjA4n5CgKL0FPr o6BEzjX5XwG1zvGcBNQ5BZ19xUUKjoZQgTtnea8sZ57Fsp5RtRgmYRqoewNvNk/n uImh0NFsDVb0UeOpSzC34VD9l1dJvLGdui4zJAjno/vpvmT1DkXjoK419J/r52SS X+5WgsfJ6DkjHqVN1tIhhK34yWqBOcwGFZJgEnWHMkFIl2FqRfFKMHyqtfLlVnLA mnIpSyz8Sq2AHtx0TlgZ3At/Ri8p5+yYJgHOXcDKyABa8y8Zf4wrycmr6cV9JLuL z54nLEVnJuvfDVDVJjsLYdJXyhMpZFq6+uAItdxKaw8Ugp/QqG4QtoRj+XIHz4ZW z6OohkCiCzTwEISFK+pSTxPS30eOxq43kCspcvuLiwCCStJBRkRb5GdZA4dm7LA+ 1Od4ADAkHjyrFtBqTyyC2scX8UJ33DlAIpAYyIeS6w9Cj9EXxtp1z33IAAAZ03MW jJwIaJuc8bK2fWKMmiG7ucIXjPo4t//KiWlpkwwqLhPbjZgfDAcxq1AC2TLoqHBL y4EOgKpHDCMAghSyiFIAn2JprGcEt8dp+11B0JRXIn4Pm/eYDH8= =lqbe -----END PGP SIGNATURE----- Merge tag 'net-next-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core & protocols: - Wrap datapath globals into net_aligned_data, to avoid false sharing - Preserve MSG_ZEROCOPY in forwarding (e.g. out of a container) - Add SO_INQ and SCM_INQ support to AF_UNIX - Add SIOCINQ support to AF_VSOCK - Add TCP_MAXSEG sockopt to MPTCP - Add IPv6 force_forwarding sysctl to enable forwarding per interface - Make TCP validation of whether packet fully fits in the receive window and the rcv_buf more strict. With increased use of HW aggregation a single "packet" can be multiple 100s of kB - Add MSG_MORE flag to optimize large TCP transmissions via sockmap, improves latency up to 33% for sockmap users - Convert TCP send queue handling from tasklet to BH workque - Improve BPF iteration over TCP sockets to see each socket exactly once - Remove obsolete and unused TCP RFC3517/RFC6675 loss recovery code - Support enabling kernel threads for NAPI processing on per-NAPI instance basis rather than a whole device. Fully stop the kernel NAPI thread when threaded NAPI gets disabled. Previously thread would stick around until ifdown due to tricky synchronization - Allow multicast routing to take effect on locally-generated packets - Add output interface argument for End.X in segment routing - MCTP: add support for gateway routing, improve bind() handling - Don't require rtnl_lock when fetching an IPv6 neighbor over Netlink - Add a new neighbor flag ("extern_valid"), which cedes refresh responsibilities to userspace. This is needed for EVPN multi-homing where a neighbor entry for a multi-homed host needs to be synced across all the VTEPs among which the host is multi-homed - Support NUD_PERMANENT for proxy neighbor entries - Add a new queuing discipline for IETF RFC9332 DualQ Coupled AQM - Add sequence numbers to netconsole messages. Unregister netconsole's console when all net targets are removed. Code refactoring. Add a number of selftests - Align IPSec inbound SA lookup to RFC 4301. Only SPI and protocol should be used for an inbound SA lookup - Support inspecting ref_tracker state via DebugFS - Don't force bonding advertisement frames tx to ~333 ms boundaries. Add broadcast_neighbor option to send ARP/ND on all bonded links - Allow providing upcall pid for the 'execute' command in openvswitch - Remove DCCP support from Netfilter's conntrack - Disallow multiple packet duplications in the queuing layer - Prevent use of deprecated iptables code on PREEMPT_RT Driver API: - Support RSS and hashing configuration over ethtool Netlink - Add dedicated ethtool callbacks for getting and setting hashing fields - Add support for power budget evaluation strategy in PSE / Power-over-Ethernet. Generate Netlink events for overcurrent etc - Support DPLL phase offset monitoring across all device inputs. Support providing clock reference and SYNC over separate DPLL inputs - Support traffic classes in devlink rate API for bandwidth management - Remove rtnl_lock dependency from UDP tunnel port configuration Device drivers: - Add a new Broadcom driver for 800G Ethernet (bnge) - Add a standalone driver for Microchip ZL3073x DPLL - Remove IBM's NETIUCV device driver - Ethernet high-speed NICs: - Broadcom (bnxt): - support zero-copy Tx of DMABUF memory - take page size into account for page pool recycling rings - Intel (100G, ice, idpf): - idpf: XDP and AF_XDP support preparations - idpf: add flow steering - add link_down_events statistic - clean up the TSPLL code - preparations for live VM migration - nVidia/Mellanox: - support zero-copy Rx/Tx interfaces (DMABUF and io_uring) - optimize context memory usage for matchers - expose serial numbers in devlink info - support PCIe congestion metrics - Meta (fbnic): - add 25G, 50G, and 100G link modes to phylink - support dumping FW logs - Marvell/Cavium: - support for CN20K generation of the Octeon chips - Amazon: - add HW clock (without timestamping, just hypervisor time access) - Ethernet virtual: - VirtIO net: - support segmentation of UDP-tunnel-encapsulated packets - Google (gve): - support packet timestamping and clock synchronization - Microsoft vNIC: - add handler for device-originated servicing events - allow dynamic MSI-X vector allocation - support Tx bandwidth clamping - Ethernet NICs consumer, and embedded: - AMD: - amd-xgbe: hardware timestamping and PTP clock support - Broadcom integrated MACs (bcmgenet, bcmasp): - use napi_complete_done() return value to support NAPI polling - add support for re-starting auto-negotiation - Broadcom switches (b53): - support BCM5325 switches - add bcm63xx EPHY power control - Synopsys (stmmac): - lots of code refactoring and cleanups - TI: - icssg-prueth: read firmware-names from device tree - icssg: PRP offload support - Microchip: - lan78xx: convert to PHYLINK for improved PHY and MAC management - ksz: add KSZ8463 switch support - Intel: - support similar queue priority scheme in multi-queue and time-sensitive networking (taprio) - support packet pre-emption in both - RealTek (r8169): - enable EEE at 5Gbps on RTL8126 - Airoha: - add PPPoE offload support - MDIO bus controller for Airoha AN7583 - Ethernet PHYs: - support for the IPQ5018 internal GE PHY - micrel KSZ9477 switch-integrated PHYs: - add MDI/MDI-X control support - add RX error counters - add cable test support - add Signal Quality Indicator (SQI) reporting - dp83tg720: improve reset handling and reduce link recovery time - support bcm54811 (and its MII-Lite interface type) - air_en8811h: support resume/suspend - support PHY counters for QCA807x and QCA808x - support WoL for QCA807x - CAN drivers: - rcar_canfd: support for Transceiver Delay Compensation - kvaser: report FW versions via devlink dev info - WiFi: - extended regulatory info support (6 GHz) - add statistics and beacon monitor for Multi-Link Operation (MLO) - support S1G aggregation, improve S1G support - add Radio Measurement action fields - support per-radio RTS threshold - some work around how FIPS affects wifi, which was wrong (RC4 is used by TKIP, not only WEP) - improvements for unsolicited probe response handling - WiFi drivers: - RealTek (rtw88): - IBSS mode for SDIO devices - RealTek (rtw89): - BT coexistence for MLO/WiFi7 - concurrent station + P2P support - support for USB devices RTL8851BU/RTL8852BU - Intel (iwlwifi): - use embedded PNVM in (to be released) FW images to fix compatibility issues - many cleanups (unused FW APIs, PCIe code, WoWLAN) - some FIPS interoperability - MediaTek (mt76): - firmware recovery improvements - more MLO work - Qualcomm/Atheros (ath12k): - fix scan on multi-radio devices - more EHT/Wi-Fi 7 features - encapsulation/decapsulation offload - Broadcom (brcm80211): - support SDIO 43751 device - Bluetooth: - hci_event: add support for handling LE BIG Sync Lost event - ISO: add socket option to report packet seqnum via CMSG - ISO: support SCM_TIMESTAMPING for ISO TS - Bluetooth drivers: - intel_pcie: support Function Level Reset - nxpuart: add support for 4M baudrate - nxpuart: implement powerup sequence, reset, FW dump, and FW loading" * tag 'net-next-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1742 commits) dpll: zl3073x: Fix build failure selftests: bpf: fix legacy netfilter options ipv6: annotate data-races around rt->fib6_nsiblings ipv6: fix possible infinite loop in fib6_info_uses_dev() ipv6: prevent infinite loop in rt6_nlmsg_size() ipv6: add a retry logic in net6_rt_notify() vrf: Drop existing dst reference in vrf_ip6_input_dst net/sched: taprio: align entry index attr validation with mqprio net: fsl_pq_mdio: use dev_err_probe selftests: rtnetlink.sh: remove esp4_offload after test vsock: remove unnecessary null check in vsock_getname() igb: xsk: solve negative overflow of nb_pkts in zerocopy mode stmmac: xsk: fix negative overflow of budget in zerocopy mode dt-bindings: ieee802154: Convert at86rf230.txt yaml format net: dsa: microchip: Disable PTP function of KSZ8463 net: dsa: microchip: Setup fiber ports for KSZ8463 net: dsa: microchip: Write switch MAC address differently for KSZ8463 net: dsa: microchip: Use different registers for KSZ8463 net: dsa: microchip: Add KSZ8463 switch support to KSZ DSA driver dt-bindings: net: dsa: microchip: Add KSZ8463 switch support ...	2025-07-30 08:58:55 -07:00
Linus Torvalds	22c5696e3f	Driver core changes for 6.17-rc1 - DEBUGFS - Remove unneeded debugfs_file_{get,put}() instances - Remove last remnants of debugfs_real_fops() - Allow storing non-const void * in struct debugfs_inode_info::aux - SYSFS - Switch back to attribute_group::bin_attrs (treewide) - Switch back to bin_attribute::read()/write() (treewide) - Constify internal references to 'struct bin_attribute' - Support cache-ids for device-tree systems - Add arch hook arch_compact_of_hwid() - Use arch_compact_of_hwid() to compact MPIDR values on arm64 - Rust - Device - Introduce CoreInternal device context (for bus internal methods) - Provide generic drvdata accessors for bus devices - Provide Driver::unbind() callbacks - Use the infrastructure above for auxiliary, PCI and platform - Implement Device::as_bound() - Rename Device::as_ref() to Device::from_raw() (treewide) - Implement fwnode and device property abstractions - Implement example usage in the Rust platform sample driver - Devres - Remove the inner reference count (Arc) and use pin-init instead - Replace Devres::new_foreign_owned() with devres::register() - Require T to be Send in Devres<T> - Initialize the data kept inside a Devres last - Provide an accessor for the Devres associated Device - Device ID - Add support for ACPI device IDs and driver match tables - Split up generic device ID infrastructure - Use generic device ID infrastructure in net::phy - DMA - Implement the dma::Device trait - Add DMA mask accessors to dma::Device - Implement dma::Device for PCI and platform devices - Use DMA masks from the DMA sample module - I/O - Implement abstraction for resource regions (struct resource) - Implement resource-based ioremap() abstractions - Provide platform device accessors for I/O (remap) requests - Misc - Support fallible PinInit types in Revocable - Implement Wrapper<T> for Opaque<T> - Merge pin-init blanket dependencies (for Devres) - Misc - Fix OF node leak in auxiliary_device_create() - Use util macros in device property iterators - Improve kobject sample code - Add device_link_test() for testing device link flags - Fix typo in Documentation/ABI/testing/sysfs-kernel-address_bits - Hint to prefer container_of_const() over container_of() -----BEGIN PGP SIGNATURE----- iHQEABYKAB0WIQS2q/xV6QjXAdC7k+1FlHeO1qrKLgUCaIjkhwAKCRBFlHeO1qrK LpXuAP9RWwfD9ZGgQZ9OsMk/0pZ2mDclaK97jcmI9TAeSxeZMgD1FHnOMTY7oSIi iG7Muq0yLD+A5gk9HUnMUnFNrngWCg== =jgRj -----END PGP SIGNATURE----- Merge tag 'driver-core-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core Pull driver core updates from Danilo Krummrich: "debugfs: - Remove unneeded debugfs_file_{get,put}() instances - Remove last remnants of debugfs_real_fops() - Allow storing non-const void * in struct debugfs_inode_info::aux sysfs: - Switch back to attribute_group::bin_attrs (treewide) - Switch back to bin_attribute::read()/write() (treewide) - Constify internal references to 'struct bin_attribute' Support cache-ids for device-tree systems: - Add arch hook arch_compact_of_hwid() - Use arch_compact_of_hwid() to compact MPIDR values on arm64 Rust: - Device: - Introduce CoreInternal device context (for bus internal methods) - Provide generic drvdata accessors for bus devices - Provide Driver::unbind() callbacks - Use the infrastructure above for auxiliary, PCI and platform - Implement Device::as_bound() - Rename Device::as_ref() to Device::from_raw() (treewide) - Implement fwnode and device property abstractions - Implement example usage in the Rust platform sample driver - Devres: - Remove the inner reference count (Arc) and use pin-init instead - Replace Devres::new_foreign_owned() with devres::register() - Require T to be Send in Devres<T> - Initialize the data kept inside a Devres last - Provide an accessor for the Devres associated Device - Device ID: - Add support for ACPI device IDs and driver match tables - Split up generic device ID infrastructure - Use generic device ID infrastructure in net::phy - DMA: - Implement the dma::Device trait - Add DMA mask accessors to dma::Device - Implement dma::Device for PCI and platform devices - Use DMA masks from the DMA sample module - I/O: - Implement abstraction for resource regions (struct resource) - Implement resource-based ioremap() abstractions - Provide platform device accessors for I/O (remap) requests - Misc: - Support fallible PinInit types in Revocable - Implement Wrapper<T> for Opaque<T> - Merge pin-init blanket dependencies (for Devres) Misc: - Fix OF node leak in auxiliary_device_create() - Use util macros in device property iterators - Improve kobject sample code - Add device_link_test() for testing device link flags - Fix typo in Documentation/ABI/testing/sysfs-kernel-address_bits - Hint to prefer container_of_const() over container_of()" * tag 'driver-core-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core: (84 commits) rust: io: fix broken intra-doc links to `platform::Device` rust: io: fix broken intra-doc link to missing `flags` module rust: io: mem: enable IoRequest doc-tests rust: platform: add resource accessors rust: io: mem: add a generic iomem abstraction rust: io: add resource abstraction rust: samples: dma: set DMA mask rust: platform: implement the `dma::Device` trait rust: pci: implement the `dma::Device` trait rust: dma: add DMA addressing capabilities rust: dma: implement `dma::Device` trait rust: net::phy Change module_phy_driver macro to use module_device_table macro rust: net::phy represent DeviceId as transparent wrapper over mdio_device_id rust: device_id: split out index support into a separate trait device: rust: rename Device::as_ref() to Device::from_raw() arm64: cacheinfo: Provide helper to compress MPIDR value into u32 cacheinfo: Add arch hook to compress CPU h/w id into 32 bits for cache-id cacheinfo: Set cache 'id' based on DT data container_of: Document container_of() is not to be used in new code driver core: auxiliary bus: fix OF node leak ...	2025-07-29 12:15:39 -07:00
KaFai Wan	863aab3d4d	bpf: Add log for attaching tracing programs to functions in deny list Show the rejected function name when attaching tracing programs to functions in deny list. With this change, we know why tracing programs can't attach to functions like __rcu_read_lock() from log. $ ./fentry libbpf: prog '__rcu_read_lock': BPF program load failed: -EINVAL libbpf: prog '__rcu_read_lock': -- BEGIN PROG LOAD LOG -- Attaching tracing programs to function '__rcu_read_lock' is rejected. Suggested-by: Leon Hwang <leon.hwang@linux.dev> Signed-off-by: KaFai Wan <kafai.wan@linux.dev> Acked-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20250724151454.499040-3-kafai.wan@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-07-28 19:39:29 -07:00
KaFai Wan	a5a6b29a70	bpf: Show precise rejected function when attaching fexit/fmod_ret to __noreturn functions With this change, we know the precise rejected function name when attaching fexit/fmod_ret to __noreturn functions from log. $ ./fexit libbpf: prog 'fexit': BPF program load failed: -EINVAL libbpf: prog 'fexit': -- BEGIN PROG LOAD LOG -- Attaching fexit/fmod_ret to __noreturn function 'do_exit' is rejected. Suggested-by: Leon Hwang <leon.hwang@linux.dev> Signed-off-by: KaFai Wan <kafai.wan@linux.dev> Acked-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20250724151454.499040-2-kafai.wan@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-07-28 19:39:29 -07:00
Linus Torvalds	13150742b0	Crypto library updates for 6.17 This is the main crypto library pull request for 6.17. The main focus this cycle is on reorganizing the SHA-1 and SHA-2 code, providing high-quality library APIs for SHA-1 and SHA-2 including HMAC support, and establishing conventions for lib/crypto/ going forward: - Migrate the SHA-1 and SHA-512 code (and also SHA-384 which shares most of the SHA-512 code) into lib/crypto/. This includes both the generic and architecture-optimized code. Greatly simplify how the architecture-optimized code is integrated. Add an easy-to-use library API for each SHA variant, including HMAC support. Finally, reimplement the crypto_shash support on top of the library API. - Apply the same reorganization to the SHA-256 code (and also SHA-224 which shares most of the SHA-256 code). This is a somewhat smaller change, due to my earlier work on SHA-256. But this brings in all the same additional improvements that I made for SHA-1 and SHA-512. There are also some smaller changes: - Move the architecture-optimized ChaCha, Poly1305, and BLAKE2s code from arch/$(SRCARCH)/lib/crypto/ to lib/crypto/$(SRCARCH)/. For these algorithms it's just a move, not a full reorganization yet. - Fix the MIPS chacha-core.S to build with the clang assembler. - Fix the Poly1305 functions to work in all contexts. - Fix a performance regression in the x86_64 Poly1305 code. - Clean up the x86_64 SHA-NI optimized SHA-1 assembly code. Note that since the new organization of the SHA code is much simpler, the diffstat of this pull request is negative, despite the addition of new fully-documented library APIs for multiple SHA and HMAC-SHA variants. These APIs will allow further simplifications across the kernel as users start using them instead of the old-school crypto API. (I've already written a lot of such conversion patches, removing over 1000 more lines of code. But most of those will target 6.18 or later.) -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCaIZ93BQcZWJpZ2dlcnNA a2VybmVsLm9yZwAKCRDzXCl4vpKOK8HCAQD3O9P0qd6wscne5XuRwaybzKHQ2AqU OlhlDZWQQEvYAgD/aa6KP/DS+8RKGj0TBn6bACAJyXyDygFXq5a5s9pGzAs= =UmMM -----END PGP SIGNATURE----- Merge tag 'libcrypto-updates-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux Pull crypto library updates from Eric Biggers: "This is the main crypto library pull request for 6.17. The main focus this cycle is on reorganizing the SHA-1 and SHA-2 code, providing high-quality library APIs for SHA-1 and SHA-2 including HMAC support, and establishing conventions for lib/crypto/ going forward: - Migrate the SHA-1 and SHA-512 code (and also SHA-384 which shares most of the SHA-512 code) into lib/crypto/. This includes both the generic and architecture-optimized code. Greatly simplify how the architecture-optimized code is integrated. Add an easy-to-use library API for each SHA variant, including HMAC support. Finally, reimplement the crypto_shash support on top of the library API. - Apply the same reorganization to the SHA-256 code (and also SHA-224 which shares most of the SHA-256 code). This is a somewhat smaller change, due to my earlier work on SHA-256. But this brings in all the same additional improvements that I made for SHA-1 and SHA-512. There are also some smaller changes: - Move the architecture-optimized ChaCha, Poly1305, and BLAKE2s code from arch/$(SRCARCH)/lib/crypto/ to lib/crypto/$(SRCARCH)/. For these algorithms it's just a move, not a full reorganization yet. - Fix the MIPS chacha-core.S to build with the clang assembler. - Fix the Poly1305 functions to work in all contexts. - Fix a performance regression in the x86_64 Poly1305 code. - Clean up the x86_64 SHA-NI optimized SHA-1 assembly code. Note that since the new organization of the SHA code is much simpler, the diffstat of this pull request is negative, despite the addition of new fully-documented library APIs for multiple SHA and HMAC-SHA variants. These APIs will allow further simplifications across the kernel as users start using them instead of the old-school crypto API. (I've already written a lot of such conversion patches, removing over 1000 more lines of code. But most of those will target 6.18 or later)" * tag 'libcrypto-updates-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: (67 commits) lib/crypto: arm64/sha512-ce: Drop compatibility macros for older binutils lib/crypto: x86/sha1-ni: Convert to use rounds macros lib/crypto: x86/sha1-ni: Minor optimizations and cleanup crypto: sha1 - Remove sha1_base.h lib/crypto: x86/sha1: Migrate optimized code into library lib/crypto: sparc/sha1: Migrate optimized code into library lib/crypto: s390/sha1: Migrate optimized code into library lib/crypto: powerpc/sha1: Migrate optimized code into library lib/crypto: mips/sha1: Migrate optimized code into library lib/crypto: arm64/sha1: Migrate optimized code into library lib/crypto: arm/sha1: Migrate optimized code into library crypto: sha1 - Use same state format as legacy drivers crypto: sha1 - Wrap library and add HMAC support lib/crypto: sha1: Add HMAC support lib/crypto: sha1: Add SHA-1 library functions lib/crypto: sha1: Rename sha1_init() to sha1_init_raw() crypto: x86/sha1 - Rename conflicting symbol lib/crypto: sha2: Add hmac_sha*_init_usingrawkey() lib/crypto: arm/poly1305: Remove unneeded empty weak function lib/crypto: x86/poly1305: Fix performance regression on short messages ...	2025-07-28 17:58:52 -07:00
Linus Torvalds	7e7bc8335b	vfs-6.17-rc1.bpf -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaINCjwAKCRCRxhvAZXjc osnVAQCv4rM7sF4yJvGlm1myIJcJy5Sabk2q31qMdI1VHmkcOwD+Mxs7d1aByTS8 /6djhVleq6lcT2LpP9j8YI3Rb+x30QY= =PF3o -----END PGP SIGNATURE----- Merge tag 'vfs-6.17-rc1.bpf' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs bpf updates from Christian Brauner: "These changes allow bpf to read extended attributes from cgroupfs. This is useful in redirecting AF_UNIX socket connections based on cgroup membership of the socket. One use-case is the ability to implement log namespaces in systemd so services and containers are redirected to different journals" * tag 'vfs-6.17-rc1.bpf' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: selftests/kernfs: test xattr retrieval selftests/bpf: Add tests for bpf_cgroup_read_xattr bpf: Mark cgroup_subsys_state->cgroup RCU safe bpf: Introduce bpf_cgroup_read_xattr to read xattr of cgroup's node kernfs: remove iattr_mutex	2025-07-28 14:42:31 -07:00
Suchit Karunakaran	5b4c54ac49	bpf: Fix various typos in verifier.c comments This patch fixes several minor typos in comments within the BPF verifier. No changes in functionality. Signed-off-by: Suchit Karunakaran <suchitkarunakaran@gmail.com> Link: https://lore.kernel.org/r/20250727081754.15986-1-suchitkarunakaran@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-07-28 10:02:57 -07:00
Paul Chaignon	5dbb19b16a	bpf: Add third round of bounds deduction Commit `d7f0087381` ("bpf: try harder to deduce register bounds from different numeric domains") added a second call to __reg_deduce_bounds in reg_bounds_sync because a single call wasn't enough to converge to a fixed point in terms of register bounds. With patch "bpf: Improve bounds when s64 crosses sign boundary" from this series, Eduard noticed that calling __reg_deduce_bounds twice isn't enough anymore to converge. The first selftest added in "selftests/bpf: Test cross-sign 64bits range refinement" highlights the need for a third call to __reg_deduce_bounds. After instruction 7, reg_bounds_sync performs the following bounds deduction: reg_bounds_sync entry: scalar(smin=-655,smax=0xeffffeee,smin32=-783,smax32=-146) __update_reg_bounds: scalar(smin=-655,smax=0xeffffeee,smin32=-783,smax32=-146) __reg_deduce_bounds: __reg32_deduce_bounds: scalar(smin=-655,smax=0xeffffeee,smin32=-783,smax32=-146,umin32=0xfffffcf1,umax32=0xffffff6e) __reg64_deduce_bounds: scalar(smin=-655,smax=0xeffffeee,smin32=-783,smax32=-146,umin32=0xfffffcf1,umax32=0xffffff6e) __reg_deduce_mixed_bounds: scalar(smin=-655,smax=0xeffffeee,umin=umin32=0xfffffcf1,umax=0xffffffffffffff6e,smin32=-783,smax32=-146,umax32=0xffffff6e) __reg_deduce_bounds: __reg32_deduce_bounds: scalar(smin=-655,smax=0xeffffeee,umin=umin32=0xfffffcf1,umax=0xffffffffffffff6e,smin32=-783,smax32=-146,umax32=0xffffff6e) __reg64_deduce_bounds: scalar(smin=-655,smax=smax32=-146,umin=0xfffffffffffffd71,umax=0xffffffffffffff6e,smin32=-783,umin32=0xfffffcf1,umax32=0xffffff6e) __reg_deduce_mixed_bounds: scalar(smin=-655,smax=smax32=-146,umin=0xfffffffffffffd71,umax=0xffffffffffffff6e,smin32=-783,umin32=0xfffffcf1,umax32=0xffffff6e) __reg_bound_offset: scalar(smin=-655,smax=smax32=-146,umin=0xfffffffffffffd71,umax=0xffffffffffffff6e,smin32=-783,umin32=0xfffffcf1,umax32=0xffffff6e,var_off=(0xfffffffffffffc00; 0x3ff)) __update_reg_bounds: scalar(smin=-655,smax=smax32=-146,umin=0xfffffffffffffd71,umax=0xffffffffffffff6e,smin32=-783,umin32=0xfffffcf1,umax32=0xffffff6e,var_off=(0xfffffffffffffc00; 0x3ff)) In particular, notice how: 1. In the first call to __reg_deduce_bounds, __reg32_deduce_bounds learns new u32 bounds. 2. __reg64_deduce_bounds is unable to improve bounds at this point. 3. __reg_deduce_mixed_bounds derives new u64 bounds from the u32 bounds. 4. In the second call to __reg_deduce_bounds, __reg64_deduce_bounds improves the smax and umin bounds thanks to patch "bpf: Improve bounds when s64 crosses sign boundary" from this series. 5. Subsequent functions are unable to improve the ranges further (only tnums). Yet, a better smin32 bound could be learned from the smin bound. __reg32_deduce_bounds is able to improve smin32 from smin, but for that we need a third call to __reg_deduce_bounds. As discussed in [1], there may be a better way to organize the deduction rules to learn the same information with less calls to the same functions. Such an optimization requires further analysis and is orthogonal to the present patchset. Link: https://lore.kernel.org/bpf/aIKtSK9LjQXB8FLY@mail.gmail.com/ [1] Acked-by: Eduard Zingerman <eddyz87@gmail.com> Co-developed-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/79619d3b42e5525e0e174ed534b75879a5ba15de.1753695655.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-07-28 10:02:13 -07:00
Paul Chaignon	00bf8d0c6c	bpf: Improve bounds when s64 crosses sign boundary __reg64_deduce_bounds currently improves the s64 range using the u64 range and vice versa, but only if it doesn't cross the sign boundary. This patch improves __reg64_deduce_bounds to cover the case where the s64 range crosses the sign boundary but overlaps with the u64 range on only one end. In that case, we can improve both ranges. Consider the following example, with the s64 range crossing the sign boundary: 0 U64_MAX \| [xxxxxxxxxxxxxx u64 range xxxxxxxxxxxxxx] \| \|----------------------------\|----------------------------\| \|xxxxx s64 range xxxxxxxxx] [xxxxxxx\| 0 S64_MAX S64_MIN -1 The u64 range overlaps only with positive portion of the s64 range. We can thus derive the following new s64 and u64 ranges. 0 U64_MAX \| [xxxxxx u64 range xxxxx] \| \|----------------------------\|----------------------------\| \| [xxxxxx s64 range xxxxx] \| 0 S64_MAX S64_MIN -1 The same logic can probably apply to the s32/u32 ranges, but this patch doesn't implement that change. In addition to the selftests, the __reg64_deduce_bounds change was also tested with Agni, the formal verification tool for the range analysis [1]. Link: https://github.com/bpfverif/agni [1] Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/933bd9ce1f36ded5559f92fdc09e5dbc823fa245.1753695655.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-07-28 10:02:12 -07:00
Paul Chaignon	5345e64760	bpf: Simplify bounds refinement from s32 During the bounds refinement, we improve the precision of various ranges by looking at other ranges. Among others, we improve the following in this order (other things happen between 1 and 2): 1. Improve u32 from s32 in __reg32_deduce_bounds. 2. Improve s/u64 from u32 in __reg_deduce_mixed_bounds. 3. Improve s/u64 from s32 in __reg_deduce_mixed_bounds. In particular, if the s32 range forms a valid u32 range, we will use it to improve the u32 range in __reg32_deduce_bounds. In __reg_deduce_mixed_bounds, under the same condition, we will use the s32 range to improve the s/u64 ranges. If at (1) we were able to learn from s32 to improve u32, we'll then be able to use that in (2) to improve s/u64. Hence, as (3) happens under the same precondition as (1), it won't improve s/u64 ranges further than (1)+(2) did. Thus, we can get rid of (3). In addition to the extensive suite of selftests for bounds refinement, this patch was also tested with the Agni formal verification tool [1]. Additionally, Eduard mentioned: The argument appears to be as follows: Under precondition `(u32)reg->s32_min <= (u32)reg->s32_max` __reg32_deduce_bounds produces: reg->u32_min = max_t(u32, reg->s32_min, reg->u32_min); reg->u32_max = min_t(u32, reg->s32_max, reg->u32_max); And then first part of __reg_deduce_mixed_bounds assigns: a. reg->umin umax= (reg->umin & ~0xffffffffULL) \| max_t(u32, reg->s32_min, reg->u32_min); b. reg->umax umin= (reg->umax & ~0xffffffffULL) \| min_t(u32, reg->s32_max, reg->u32_max); And then second part of __reg_deduce_mixed_bounds assigns: c. reg->umin umax= (reg->umin & ~0xffffffffULL) \| (u32)reg->s32_min; d. reg->umax umin= (reg->umax & ~0xffffffffULL) \| (u32)reg->s32_max; But assignment (c) is a noop because: max_t(u32, reg->s32_min, reg->u32_min) >= (u32)reg->s32_min Hence RHS(a) >= RHS(c) and umin= does nothing. Also assignment (d) is a noop because: min_t(u32, reg->s32_max, reg->u32_max) <= (u32)reg->s32_max Hence RHS(b) <= RHS(d) and umin= does nothing. Plus the same reasoning for the part dealing with reg->s{min,max}_value: e. reg->smin_value smax= (reg->smin_value & ~0xffffffffULL) \| max_t(u32, reg->s32_min_value, reg->u32_min_value); f. reg->smax_value smin= (reg->smax_value & ~0xffffffffULL) \| min_t(u32, reg->s32_max_value, reg->u32_max_value); vs g. reg->smin_value smax= (reg->smin_value & ~0xffffffffULL) \| (u32)reg->s32_min_value; h. reg->smax_value smin= (reg->smax_value & ~0xffffffffULL) \| (u32)reg->s32_max_value; RHS(e) >= RHS(g) and RHS(f) <= RHS(h), hence smax=,smin= do nothing. This appears to be correct. Also, Shung-Hsi: Beside going through the reasoning, I also played with CBMC a bit to double check that as far as a single run of __reg_deduce_bounds() is concerned (and that the register state matches certain handwavy expectations), the change indeed still preserve the original behavior. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://github.com/bpfverif/agni [1] Link: https://lore.kernel.org/bpf/aIJwnFnFyUjNsCNa@mail.gmail.com	2025-07-27 19:23:29 +02:00
Puranjay Mohan	3ba58312e6	bpf: Move bpf_jit_get_prog_name() to core.c bpf_jit_get_prog_name() will be used by all JITs when enabling support for private stack. This function is currently implemented in the x86 JIT. Move the function to core.c so that other JITs can easily use it in their implementation of private stack. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/bpf/20250724120257.7299-2-puranjay@kernel.org	2025-07-26 21:26:51 +02:00
Thomas Weißschuh	b7b3500bd4	umd: Remove usermode driver framework The code is unused since `98e20e5e13` ("bpfilter: remove bpfilter"), therefore remove it. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Link: https://lore.kernel.org/bpf/20250721-remove-usermode-driver-v1-2-0d0083334382@linutronix.de	2025-07-26 21:03:04 +02:00
Thomas Weißschuh	2b03164eee	bpf/preload: Don't select USERMODE_DRIVER The usermode driver framework is not used anymore by the BPF preload code. Fixes: `cb80ddc671` ("bpf: Convert bpf_preload.ko to use light skeleton.") Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/bpf/20250721-remove-usermode-driver-v1-1-0d0083334382@linutronix.de	2025-07-26 21:02:48 +02:00
Samiullah Khawaja	71c52411c5	net: Create separate gro_flush_normal function Move multiple copies of same code snippet doing `gro_flush` and `gro_normal_list` into separate helper function. Signed-off-by: Samiullah Khawaja <skhawaja@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250723013031.2911384-2-skhawaja@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-24 18:34:55 -07:00
Jakub Kicinski	a4f5759b6f	bpf-next-for-netdev -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ6NaUOruQGUkvPdG4raS+Z+3y5EwUCaIJYlAAKCRAraS+Z+3y5 E5MFAQDW29BJyjRbB75oy6RxmFZX+xFmGgmy1XO3w822gIwgzQD/WzhsmFPDYv/F 7iOpLvez6zTySUdTJXJGCTvYJG5EHwU= =U8S4 -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Martin KaFai Lau says: ==================== pull-request: bpf-next 2025-07-24 We've added 3 non-merge commits during the last 3 day(s) which contain a total of 4 files changed, 40 insertions(+), 15 deletions(-). The main changes are: 1) Improved verifier error message for incorrect narrower load from pointer field in ctx, from Paul Chaignon. 2) Disabled migration in nf_hook_run_bpf to address a syzbot report, from Kuniyuki Iwashima. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: selftests/bpf: Test invalid narrower ctx load bpf: Reject narrower access to pointer ctx fields bpf: Disable migration in nf_hook_run_bpf(). ==================== Link: https://patch.msgid.link/20250724173306.3578483-1-martin.lau@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-24 18:02:24 -07:00
Paul Chaignon	e09299225d	bpf: Reject narrower access to pointer ctx fields The following BPF program, simplified from a syzkaller repro, causes a kernel warning: r0 = (u8 )(r1 + 169); exit; With pointer field sk being at offset 168 in __sk_buff. This access is detected as a narrower read in bpf_skb_is_valid_access because it doesn't match offsetof(struct __sk_buff, sk). It is therefore allowed and later proceeds to bpf_convert_ctx_access. Note that for the "is_narrower_load" case in the convert_ctx_accesses(), the insn->off is aligned, so the cnt may not be 0 because it matches the offsetof(struct __sk_buff, sk) in the bpf_convert_ctx_access. However, the target_size stays 0 and the verifier errors with a kernel warning: verifier bug: error during ctx access conversion(1) This patch fixes that to return a proper "invalid bpf_context access off=X size=Y" error on the load instruction. The same issue affects multiple other fields in context structures that allow narrow access. Some other non-affected fields (for sk_msg, sk_lookup, and sockopt) were also changed to use bpf_ctx_range_ptr for consistency. Note this syzkaller crash was reported in the "Closes" link below, which used to be about a different bug, fixed in commit `fce7bd8e38` ("bpf/verifier: Handle BPF_LOAD_ACQ instructions in insn_def_regno()"). Because syzbot somehow confused the two bugs, the new crash and repro didn't get reported to the mailing list. Fixes: `f96da09473` ("bpf: simplify narrower ctx access") Fixes: `0df1a55afa` ("bpf: Warn on internal verifier errors") Reported-by: syzbot+0ef84a7bdf5301d4cbec@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=0ef84a7bdf5301d4cbec Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://patch.msgid.link/3b8dcee67ff4296903351a974ddd9c4dca768b64.1753194596.git.paul.chaignon@gmail.com	2025-07-23 19:33:49 -07:00
Yonghong Song	95993dc303	bpf: Use ERR_CAST instead of ERR_PTR(PTR_ERR(...)) Intel linux test robot reported a warning that ERR_CAST can be used for error pointer casting instead of more-complicated/rarely-used ERR_PTR(PTR_ERR(...)) style. There is no functionality change, but still let us replace two such instances as it improves consistency and readability. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202507201048.bceHy8zX-lkp@intel.com/ Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://patch.msgid.link/20250720164754.3999140-1-yonghong.song@linux.dev	2025-07-21 17:27:09 -07:00
Alexei Starovoitov	beb1097ec8	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc6 Cross-merge BPF and other fixes after downstream PR. No conflicts. Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-07-18 12:15:59 -07:00
Lorenz Bauer	2e2713ae1a	btf: Fix virt_to_phys() on arm64 when mmapping BTF Breno Leitao reports that arm64 emits the following warning with CONFIG_DEBUG_VIRTUAL: [ 58.896157] virt_to_phys used for non-linear address: 000000009fea9737 (__start_BTF+0x0/0x685530) [ 23.988669] WARNING: CPU: 25 PID: 1442 at arch/arm64/mm/physaddr.c:15 __virt_to_phys (arch/arm64/mm/physaddr.c:?) ... [ 24.075371] Tainted: [E]=UNSIGNED_MODULE, [N]=TEST [ 24.080276] Hardware name: Quanta S7GM 20S7GCU0010/S7G MB (CG1), BIOS 3D22 07/03/2024 [ 24.088295] pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [ 24.098440] pc : __virt_to_phys (arch/arm64/mm/physaddr.c:?) [ 24.105398] lr : __virt_to_phys (arch/arm64/mm/physaddr.c:?) ... [ 24.197257] Call trace: [ 24.199761] __virt_to_phys (arch/arm64/mm/physaddr.c:?) (P) [ 24.206883] btf_sysfs_vmlinux_mmap (kernel/bpf/sysfs_btf.c:27) [ 24.214264] sysfs_kf_bin_mmap (fs/sysfs/file.c:179) [ 24.218536] kernfs_fop_mmap (fs/kernfs/file.c:462) [ 24.222461] mmap_region (./include/linux/fs.h:? mm/internal.h:167 mm/vma.c:2405 mm/vma.c:2467 mm/vma.c:2622 mm/vma.c:2692) It seems that the memory layout on arm64 maps the kernel image in vmalloc space which is different than x86. This makes virt_to_phys emit the warning. Fix this by translating the address using __pa_symbol as suggested by Breno instead. Reported-by: Breno Leitao <leitao@debian.org> Closes: https://lore.kernel.org/bpf/g2gqhkunbu43awrofzqb4cs4sxkxg2i4eud6p4qziwrdh67q4g@mtw3d3aqfgmb/ Signed-off-by: Lorenz Bauer <lmb@isovalent.com> Tested-by: Breno Leitao <leitao@debian> Fixes: `a539e2a6d5` ("btf: Allow mmap of vmlinux btf") Link: https://lore.kernel.org/r/20250717-vmlinux-mmap-pa-symbol-v1-1-970be6681158@isovalent.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-07-17 11:33:52 -07:00
Tao Chen	19d18fdfc7	bpf: Add struct bpf_token_info The 'commit `35f96de041` ("bpf: Introduce BPF token object")' added BPF token as a new kind of BPF kernel object. And BPF_OBJ_GET_INFO_BY_FD already used to get BPF object info, so we can also get token info with this cmd. One usage scenario, when program runs failed with token, because of the permission failure, we can report what BPF token is allowing with this API for debugging. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Tao Chen <chen.dylane@linux.dev> Link: https://lore.kernel.org/r/20250716134654.1162635-1-chen.dylane@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-07-16 18:38:05 -07:00
Feng Yang	62ef449b8d	bpf: Clean up individual BTF_ID code Use BTF_ID_LIST_SINGLE(a, b, c) instead of BTF_ID_LIST(a) BTF_ID(b, c) Signed-off-by: Feng Yang <yangfeng@kylinos.cn> Link: https://lore.kernel.org/r/20250710055419.70544-1-yangfeng59949@163.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-07-16 18:34:42 -07:00
Ilya Leoshkevich	1f489662fb	bpf: Update iterators.lskel-big-endian.h The last iterators update (commit `515ee52b22` ("bpf: make preloaded map iterators to display map elements count")) missed the big-endian skeleton. Update it by running "make big" with Debian clang version 21.0.0 (++20250706105601+01c97b4953e8-1~exp1~20250706225612.1558). Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Link: https://lore.kernel.org/r/20250710100907.45880-1-iii@linux.ibm.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-07-16 18:29:18 -07:00
Eric Biggers	9503ca2cca	lib/crypto: sha1: Rename sha1_init() to sha1_init_raw() Rename the existing sha1_init() to sha1_init_raw(), since it conflicts with the upcoming library function. This will later be removed, but this keeps the kernel building for the introduction of the library. Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20250712232329.818226-3-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>	2025-07-14 08:22:31 -07:00
Tao Chen	0eeeebdcc5	bpf: Remove attach_type in bpf_tracing_link Use attach_type in bpf_link, and remove it in bpf_tracing_link. Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20250710032038.888700-7-chen.dylane@linux.dev	2025-07-11 11:01:08 -07:00
Tao Chen	2a76a80c7f	bpf: Remove attach_type in bpf_netns_link Use attach_type in bpf_link, and remove it in bpf_netns_link. And move netns_type field to the end to fill the byte hole. Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20250710032038.888700-6-chen.dylane@linux.dev	2025-07-11 11:01:04 -07:00
Tao Chen	6e816e1c05	bpf: Remove location field in tcx_link Use attach_type in bpf_link to replace the location filed, and remove location field in tcx_link. Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20250710032038.888700-5-chen.dylane@linux.dev	2025-07-11 11:00:57 -07:00
Tao Chen	9b8d543dc2	bpf: Remove attach_type in bpf_cgroup_link Use attach_type in bpf_link, and remove it in bpf_cgroup_link. Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20250710032038.888700-3-chen.dylane@linux.dev	2025-07-11 10:51:55 -07:00
Tao Chen	b725441f02	bpf: Add attach_type field to bpf_link Attach_type will be set when a link is created by user. It is better to record attach_type in bpf_link generically and have it available universally for all link types. So add the attach_type field in bpf_link and move the sleepable field to avoid unnecessary gap padding. Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20250710032038.888700-2-chen.dylane@linux.dev	2025-07-11 10:51:55 -07:00
Paul Chaignon	6279846b9b	bpf: Forget ranges when refining tnum after JSET Syzbot reported a kernel warning due to a range invariant violation on the following BPF program. 0: call bpf_get_netns_cookie 1: if r0 == 0 goto <exit> 2: if r0 & Oxffffffff goto <exit> The issue is on the path where we fall through both jumps. That path is unreachable at runtime: after insn 1, we know r0 != 0, but with the sign extension on the jset, we would only fallthrough insn 2 if r0 == 0. Unfortunately, is_branch_taken() isn't currently able to figure this out, so the verifier walks all branches. The verifier then refines the register bounds using the second condition and we end up with inconsistent bounds on this unreachable path: 1: if r0 == 0 goto <exit> r0: u64=[0x1, 0xffffffffffffffff] var_off=(0, 0xffffffffffffffff) 2: if r0 & 0xffffffff goto <exit> r0 before reg_bounds_sync: u64=[0x1, 0xffffffffffffffff] var_off=(0, 0) r0 after reg_bounds_sync: u64=[0x1, 0] var_off=(0, 0) Improving the range refinement for JSET to cover all cases is tricky. We also don't expect many users to rely on JSET given LLVM doesn't generate those instructions. So instead of improving the range refinement for JSETs, Eduard suggested we forget the ranges whenever we're narrowing tnums after a JSET. This patch implements that approach. Reported-by: syzbot+c711ce17dd78e5d4fdcf@syzkaller.appspotmail.com Suggested-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/9d4fd6432a095d281f815770608fdcd16028ce0b.1752171365.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-07-11 10:45:25 -07:00
Emil Tsalapatis	8fc3d2d8b5	bpf/arena: add bpf_arena_reserve_pages kfunc Add a new BPF arena kfunc for reserving a range of arena virtual addresses without backing them with pages. This prevents the range from being populated using bpf_arena_alloc_pages(). Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250709191312.29840-2-emil@etsalapatis.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-07-11 10:43:54 -07:00
Tao Chen	3413bc0cf1	bpf: Clean code with bpf_copy_to_user() No logic change, use bpf_copy_to_user() to clean code. Signed-off-by: Tao Chen <chen.dylane@linux.dev> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20250703163700.677628-1-chen.dylane@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-07-07 08:53:59 -07:00
Luis Gerhorst	dadb59104c	bpf: Fix aux usage after do_check_insn() We must terminate the speculative analysis if the just-analyzed insn had nospec_result set. Using cur_aux() here is wrong because insn_idx might have been incremented by do_check_insn(). Therefore, introduce and use insn_aux variable. Also change cur_aux(env)->nospec in case do_check_insn() ever manages to increment insn_idx but still fail. Change the warning to check the insn class (which prevents it from triggering for ldimm64, for which nospec_result would not be problematic) and use verifier_bug_if(). In line with Eduard's suggestion, do not introduce prev_aux() because that requires one to understand that after do_check_insn() call what was current became previous. This would at-least require a comment. Fixes: `d6f1c85f22` ("bpf: Fall back to nospec for Spectre v1") Reported-by: Paul Chaignon <paul.chaignon@gmail.com> Reported-by: Eduard Zingerman <eddyz87@gmail.com> Reported-by: syzbot+dc27c5fb8388e38d2d37@syzkaller.appspotmail.com Link: https://lore.kernel.org/bpf/685b3c1b.050a0220.2303ee.0010.GAE@google.com/ Link: https://lore.kernel.org/bpf/4266fd5de04092aa4971cbef14f1b4b96961f432.camel@gmail.com/ Suggested-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250705190908.1756862-2-luis.gerhorst@fau.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-07-07 08:32:34 -07:00
Kumar Kartikeya Dwivedi	bfa2bb9abd	bpf: Fix improper int-to-ptr cast in dump_stack_cb On 32-bit platforms, we'll try to convert a u64 directly to a pointer type which is 32-bit, which causes the compiler to complain about cast from an integer of a different size to a pointer type. Cast to long before casting to the pointer type to match the pointer width. Reported-by: kernelci.org bot <bot@kernelci.org> Reported-by: Randy Dunlap <rdunlap@infradead.org> Fixes: `d7c431cafc` ("bpf: Add dump_stack() analogue to print to BPF stderr") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/20250705053035.3020320-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-07-07 08:30:15 -07:00
Kumar Kartikeya Dwivedi	116c8f4747	bpf: Fix bounds for bpf_prog_get_file_line linfo loop We may overrun the bounds because linfo and jited_linfo are already advanced to prog->aux->linfo_idx, hence we must only iterate the remaining elements until we reach prog->aux->nr_linfo. Adjust the nr_linfo calculation to fix this. Reported in [0]. [0]: https://lore.kernel.org/bpf/f3527af3b0620ce36e299e97e7532d2555018de2.camel@gmail.com Reported-by: Eduard Zingerman <eddyz87@gmail.com> Fixes: `0e521efaf3` ("bpf: Add function to extract program source info") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250705053035.3020320-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-07-07 08:30:15 -07:00
Eduard Zingerman	c4aa454c64	bpf: support for void/primitive __arg_untrusted global func params Allow specifying __arg_untrusted for void /char /int /long parameters. Treat such parameters as PTR_TO_MEM\|MEM_RDONLY\|PTR_UNTRUSTED of size zero. Intended usage is as follows: int memcmp(char a __arg_untrusted, char b __arg_untrusted, size_t n) { bpf_for(i, 0, n) { if (a[i] - b[i]) // load at any offset is allowed return a[i] - b[i]; } return 0; } Allocate register id for ARG_PTR_TO_MEM parameters only when PTR_MAYBE_NULL is set. Register id for PTR_TO_MEM is used only to propagate non-null status after conditionals. Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250704230354.1323244-8-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-07-07 08:25:07 -07:00
Eduard Zingerman	182f7df704	bpf: attribute __arg_untrusted for global function parameters Add support for PTR_TO_BTF_ID \| PTR_UNTRUSTED global function parameters. Anything is allowed to pass to such parameters, as these are read-only and probe read instructions would protect against invalid memory access. Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250704230354.1323244-5-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-07-07 08:25:06 -07:00
Eduard Zingerman	2d5c91e1cc	bpf: rdonly_untrusted_mem for btf id walk pointer leafs When processing a load from a PTR_TO_BTF_ID, the verifier calculates the type of the loaded structure field based on the load offset. For example, given the following types: struct foo { struct foo a; int b; } *p; The verifier would calculate the type of `p->a` as a pointer to `struct foo`. However, the type of `p->b` is currently calculated as a SCALAR_VALUE. This commit updates the logic for processing PTR_TO_BTF_ID to instead calculate the type of p->b as PTR_TO_MEM\|MEM_RDONLY\|PTR_UNTRUSTED. This change allows further dereferencing of such pointers (using probe memory instructions). Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250704230354.1323244-3-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-07-07 08:25:06 -07:00
Eduard Zingerman	b9d44bc9fd	bpf: make makr_btf_ld_reg return error for unexpected reg types Non-functional change: mark_btf_ld_reg() expects 'reg_type' parameter to be either SCALAR_VALUE or PTR_TO_BTF_ID. Next commit expands this set, so update this function to fail if unexpected type is passed. Also update callers to propagate the error. Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250704230354.1323244-2-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-07-07 08:25:06 -07:00
Yonghong Song	82bc4abf28	bpf: Avoid putting struct bpf_scc_callchain variables on the stack Add a 'struct bpf_scc_callchain callchain_buf' field in bpf_verifier_env. This way, the previous bpf_scc_callchain local variables can be replaced by taking address of env->callchain_buf. This can reduce stack usage and fix the following error: kernel/bpf/verifier.c:19921:12: error: stack frame size (1368) exceeds limit (1280) in 'do_check' [-Werror,-Wframe-larger-than] Reported-by: Arnd Bergmann <arnd@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20250703141117.1485108-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-07-03 19:31:30 -07:00
Yonghong Song	45e9cd38aa	bpf: Reduce stack frame size by using env->insn_buf for bpf insns Arnd Bergmann reported an issue ([1]) where clang compiler (less than llvm18) may trigger an error where the stack frame size exceeds the limit. I can reproduce the error like below: kernel/bpf/verifier.c:24491:5: error: stack frame size (2552) exceeds limit (1280) in 'bpf_check' [-Werror,-Wframe-larger-than] kernel/bpf/verifier.c:19921:12: error: stack frame size (1368) exceeds limit (1280) in 'do_check' [-Werror,-Wframe-larger-than] Use env->insn_buf for bpf insns instead of putting these insns on the stack. This can resolve the above 'bpf_check' error. The 'do_check' error will be resolved in the next patch. [1] https://lore.kernel.org/bpf/20250620113846.3950478-1-arnd@kernel.org/ Reported-by: Arnd Bergmann <arnd@kernel.org> Tested-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20250703141111.1484521-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-07-03 19:31:29 -07:00
Yonghong Song	3b87251439	bpf: Simplify assignment to struct bpf_insn pointer in do_misc_fixups() In verifier.c, the following code patterns (in two places) struct bpf_insn patch = &insn_buf[0]; can be simplified to struct bpf_insn patch = insn_buf; which is easier to understand. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20250703141106.1483216-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-07-03 19:31:29 -07:00
Paul Chaignon	032547272e	bpf: Avoid warning on unexpected map for tail call Before handling the tail call in record_func_key(), we check that the map is of the expected type and log a verifier error if it isn't. Such an error however doesn't indicate anything wrong with the verifier. The check for map<>func compatibility is done after record_func_key(), by check_map_func_compatibility(). Therefore, this patch logs the error as a typical reject instead of a verifier error. Fixes: `d2e4c1e6c2` ("bpf: Constant map key tracking for prog array pokes") Fixes: `0df1a55afa` ("bpf: Warn on internal verifier errors") Reported-by: syzbot+efb099d5833bca355e51@syzkaller.appspotmail.com Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/1f395b74e73022e47e04a31735f258babf305420.1751578055.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-07-03 19:30:54 -07:00
Kumar Kartikeya Dwivedi	ecec5b5743	bpf: Report rqspinlock deadlocks/timeout to BPF stderr Begin reporting rqspinlock deadlocks and timeout to BPF program's stderr. Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250703204818.925464-9-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-07-03 19:30:07 -07:00
Kumar Kartikeya Dwivedi	e8d0133022	bpf: Report may_goto timeout to BPF stderr Begin reporting may_goto timeouts to BPF program's stderr stream. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250703204818.925464-8-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-07-03 19:30:07 -07:00
Kumar Kartikeya Dwivedi	d7c431cafc	bpf: Add dump_stack() analogue to print to BPF stderr Introduce a kernel function which is the analogue of dump_stack() printing some useful information and the stack trace. This is not exposed to BPF programs yet, but can be made available in the future. When we have a program counter for a BPF program in the stack trace, also additionally output the filename and line number to make the trace helpful. The rest of the trace can be passed into ./decode_stacktrace.sh to obtain the line numbers for kernel symbols. Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250703204818.925464-7-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-07-03 19:30:07 -07:00
Kumar Kartikeya Dwivedi	f0c53fd4a7	bpf: Add function to find program from stack trace In preparation of figuring out the closest program that led to the current point in the kernel, implement a function that scans through the stack trace and finds out the closest BPF program when walking down the stack trace. Special care needs to be taken to skip over kernel and BPF subprog frames. We basically scan until we find a BPF main prog frame. The assumption is that if a program calls into us transitively, we'll hit it along the way. If not, we end up returning NULL. Contextually the function will be used in places where we know the program may have called into us. Due to reliance on arch_bpf_stack_walk(), this function only works on x86 with CONFIG_UNWINDER_ORC, arm64, and s390. Remove the warning from arch_bpf_stack_walk as well since we call it outside bpf_throw() context. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250703204818.925464-6-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-07-03 19:30:06 -07:00
Kumar Kartikeya Dwivedi	d090326860	bpf: Ensure RCU lock is held around bpf_prog_ksym_find Add a warning to ensure RCU lock is held around tree lookup, and then fix one of the invocations in bpf_stack_walker. The program has an active stack frame and won't disappear. Use the opportunity to remove unneeded invocation of is_bpf_text_address. Fixes: `f18b03faba` ("bpf: Implement BPF exceptions") Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250703204818.925464-5-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-07-03 19:30:06 -07:00
Kumar Kartikeya Dwivedi	0e521efaf3	bpf: Add function to extract program source info Prepare a function for use in future patches that can extract the file info, line info, and the source line number for a given BPF program provided it's program counter. Only the basename of the file path is provided, given it can be excessively long in some cases. This will be used in later patches to print source info to the BPF stream. Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250703204818.925464-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-07-03 19:30:06 -07:00
Kumar Kartikeya Dwivedi	5ab154f146	bpf: Introduce BPF standard streams Add support for a stream API to the kernel and expose related kfuncs to BPF programs. Two streams are exposed, BPF_STDOUT and BPF_STDERR. These can be used for printing messages that can be consumed from user space, thus it's similar in spirit to existing trace_pipe interface. The kernel will use the BPF_STDERR stream to notify the program of any errors encountered at runtime. BPF programs themselves may use both streams for writing debug messages. BPF library-like code may use BPF_STDERR to print warnings or errors on misuse at runtime. The implementation of a stream is as follows. Everytime a message is emitted from the kernel (directly, or through a BPF program), a record is allocated by bump allocating from per-cpu region backed by a page obtained using alloc_pages_nolock(). This ensures that we can allocate memory from any context. The eventual plan is to discard this scheme in favor of Alexei's kmalloc_nolock() [0]. This record is then locklessly inserted into a list (llist_add()) so that the printing side doesn't require holding any locks, and works in any context. Each stream has a maximum capacity of 4MB of text, and each printed message is accounted against this limit. Messages from a program are emitted using the bpf_stream_vprintk kfunc, which takes a stream_id argument in addition to working otherwise similar to bpf_trace_vprintk. The bprintf buffer helpers are extracted out to be reused for printing the string into them before copying it into the stream, so that we can (with the defined max limit) format a string and know its true length before performing allocations of the stream element. For consuming elements from a stream, we expose a bpf(2) syscall command named BPF_PROG_STREAM_READ_BY_FD, which allows reading data from the stream of a given prog_fd into a user space buffer. The main logic is implemented in bpf_stream_read(). The log messages are queued in bpf_stream::log by the bpf_stream_vprintk kfunc, and then pulled and ordered correctly in the stream backlog. For this purpose, we hold a lock around bpf_stream_backlog_peek(), as llist_del_first() (if we maintained a second lockless list for the backlog) wouldn't be safe from multiple threads anyway. Then, if we fail to find something in the backlog log, we splice out everything from the lockless log, and place it in the backlog log, and then return the head of the backlog. Once the full length of the element is consumed, we will pop it and free it. The lockless list bpf_stream::log is a LIFO stack. Elements obtained using a llist_del_all() operation are in LIFO order, thus would break the chronological ordering if printed directly. Hence, this batch of messages is first reversed. Then, it is stashed into a separate list in the stream, i.e. the backlog_log. The head of this list is the actual message that should always be returned to the caller. All of this is done in bpf_stream_backlog_fill(). From the kernel side, the writing into the stream will be a bit more involved than the typical printk. First, the kernel typically may print a collection of messages into the stream, and parallel writers into the stream may suffer from interleaving of messages. To ensure each group of messages is visible atomically, we can lift the advantage of using a lockless list for pushing in messages. To enable this, we add a bpf_stream_stage() macro, and require kernel users to use bpf_stream_printk statements for the passed expression to write into the stream. Underneath the macro, we have a message staging API, where a bpf_stream_stage object on the stack accumulates the messages being printed into a local llist_head, and then a commit operation splices the whole batch into the stream's lockless log list. This is especially pertinent for rqspinlock deadlock messages printed to program streams. After this change, we see each deadlock invocation as a non-interleaving contiguous message without any confusion on the reader's part, improving their user experience in debugging the fault. While programs cannot benefit from this staged stream writing API, they could just as well hold an rqspinlock around their print statements to serialize messages, hence this is kept kernel-internal for now. Overall, this infrastructure provides NMI-safe any context printing of messages to two dedicated streams. Later patches will add support for printing splats in case of BPF arena page faults, rqspinlock deadlocks, and cond_break timeouts, and integration of this facility into bpftool for dumping messages to user space. [0]: https://lore.kernel.org/bpf/20250501032718.65476-1-alexei.starovoitov@gmail.com Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250703204818.925464-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-07-03 19:30:06 -07:00
Kumar Kartikeya Dwivedi	0426729f46	bpf: Refactor bprintf buffer support Refactor code to be able to get and put bprintf buffers and use bpf_printf_prepare independently. This will be used in the next patch to implement BPF streams support, particularly as a staging buffer for strings that need to be formatted and then allocated and pushed into a stream. Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250703204818.925464-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-07-03 19:30:06 -07:00
Tao Chen	803f0700a3	bpf: Show precise link_type for {uprobe,kprobe}_multi fdinfo Alexei suggested, 'link_type' can be more precise and differentiate for human in fdinfo. In fact BPF_LINK_TYPE_KPROBE_MULTI includes kretprobe_multi type, the same as BPF_LINK_TYPE_UPROBE_MULTI, so we can show it more concretely. link_type: kprobe_multi link_id: 1 prog_tag: d2b307e915f0dd37 ... link_type: kretprobe_multi link_id: 2 prog_tag: ab9ea0545870781d ... link_type: uprobe_multi link_id: 9 prog_tag: e729f789e34a8eca ... link_type: uretprobe_multi link_id: 10 prog_tag: 7db356c03e61a4d4 Co-developed-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Tao Chen <chen.dylane@linux.dev> Link: https://lore.kernel.org/r/20250702153958.639852-1-chen.dylane@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-07-03 19:29:42 -07:00
Ihor Solodrai	5fc5d8fded	bpf: Add bpf_dynptr_memset() kfunc Currently there is no straightforward way to fill dynptr memory with a value (most commonly zero). One can do it with bpf_dynptr_write(), but a temporary buffer is necessary for that. Implement bpf_dynptr_memset() - an analogue of memset() from libc. Signed-off-by: Ihor Solodrai <isolodrai@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250702210309.3115903-2-isolodrai@meta.com	2025-07-03 15:21:20 -07:00
Paul Chaignon	65fdafd676	bpf: Avoid warning on multiple referenced args in call The description of full helper calls in syzkaller [1] and the addition of kernel warnings in commit `0df1a55afa` ("bpf: Warn on internal verifier errors") allowed syzbot to reach a verifier state that was thought to indicate a verifier bug [2]: 12: (85) call bpf_tcp_raw_gen_syncookie_ipv4#204 verifier bug: more than one arg with ref_obj_id R2 2 2 This error can be reproduced with the program from the previous commit: 0: (b7) r2 = 20 1: (b7) r3 = 0 2: (18) r1 = 0xffff92cee3cbc600 4: (85) call bpf_ringbuf_reserve#131 5: (55) if r0 == 0x0 goto pc+3 6: (bf) r1 = r0 7: (bf) r2 = r0 8: (85) call bpf_tcp_raw_gen_syncookie_ipv4#204 9: (95) exit bpf_tcp_raw_gen_syncookie_ipv4 expects R1 and R2 to be ARG_PTR_TO_FIXED_SIZE_MEM (with a size of at least sizeof(struct iphdr) for R1). R0 is a ring buffer payload of 20B and therefore matches this requirement. The verifier reaches the check on ref_obj_id while verifying R2 and rejects the program because the helper isn't supposed to take two referenced arguments. This case is a legitimate rejection and doesn't indicate a kernel bug, so we shouldn't log it as such and shouldn't emit a kernel warning. Link: https://github.com/google/syzkaller/pull/4313 [1] Link: https://lore.kernel.org/all/686491d6.a70a0220.3b7e22.20ea.GAE@google.com/T/ [2] Fixes: `457f44363a` ("bpf: Implement BPF ring buffer and verifier support for it") Fixes: `0df1a55afa` ("bpf: Warn on internal verifier errors") Reported-by: syzbot+69014a227f8edad4d8c6@syzkaller.appspotmail.com Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/cd09afbfd7bef10bbc432d72693f78ffdc1e8ee5.1751463262.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>	2025-07-02 10:43:38 -07:00
Eduard Zingerman	c3b9faac9b	bpf: avoid jump misprediction for PTR_TO_MEM \| PTR_UNTRUSTED Commit `f2362a57ae` ("bpf: allow void* cast using bpf_rdonly_cast()") added a notion of PTR_TO_MEM \| MEM_RDONLY \| PTR_UNTRUSTED type. This simultaneously introduced a bug in jump prediction logic for situations like below: p = bpf_rdonly_cast(..., 0); if (p) a(); else b(); Here verifier would wrongly predict that else branch cannot be taken. This happens because: - Function reg_not_null() assumes that PTR_TO_MEM w/o PTR_MAYBE_NULL flag cannot be zero. - The call to bpf_rdonly_cast() returns a rdonly_untrusted_mem value w/o PTR_MAYBE_NULL flag. Tracking of PTR_MAYBE_NULL flag for untrusted PTR_TO_MEM does not make sense, as the main purpose of the flag is to catch null pointer access errors. Such errors are not possible on load of PTR_UNTRUSTED values and verifier makes sure that PTR_UNTRUSTED can't be passed to helpers or kfuncs. Hence, modify reg_not_null() to assume that nullness of untrusted PTR_TO_MEM is not known. Fixes: `f2362a57ae` ("bpf: allow void* cast using bpf_rdonly_cast()") Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250702073620.897517-1-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>	2025-07-02 10:43:10 -07:00
Song Liu	5bc9557c9f	bpf: Mark cgroup_subsys_state->cgroup RCU safe Mark struct cgroup_subsys_state->cgroup as safe under RCU read lock. This will enable accessing css->cgroup from a bpf css iterator. Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/20250623063854.1896364-4-song@kernel.org Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-07-02 14:18:20 +02:00
Song Liu	b95ee9049c	bpf: Introduce bpf_cgroup_read_xattr to read xattr of cgroup's node BPF programs, such as LSM and sched_ext, would benefit from tags on cgroups. One common practice to apply such tags is to set xattrs on cgroupfs folders. Introduce kfunc bpf_cgroup_read_xattr, which allows reading cgroup's xattr. Note that, we already have bpf_get_[file\|dentry]_xattr. However, these two APIs are not ideal for reading cgroupfs xattrs, because: 1) These two APIs only works in sleepable contexts; 2) There is no kfunc that matches current cgroup to cgroupfs dentry. bpf_cgroup_read_xattr is generic and can be useful for many program types. It is also safe, because it requires trusted or rcu protected argument (KF_RCU). Therefore, we make it available to all program types. Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/20250623063854.1896364-3-song@kernel.org Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-07-02 14:18:20 +02:00
Paul Chaignon	f824274587	bpf: Reject %p% format string in bprintf-like helpers static const char fmt[] = "%p%"; bpf_trace_printk(fmt, sizeof(fmt)); The above BPF program isn't rejected and causes a kernel warning at runtime: Please remove unsupported %\x00 in format string WARNING: CPU: 1 PID: 7244 at lib/vsprintf.c:2680 format_decode+0x49c/0x5d0 This happens because bpf_bprintf_prepare skips over the second %, detected as punctuation, while processing %p. This patch fixes it by not skipping over punctuation. %\x00 is then processed in the next iteration and rejected. Reported-by: syzbot+e2c932aec5c8a6e1d31c@syzkaller.appspotmail.com Fixes: `48cac3f4a9` ("bpf: Implement formatted output helpers with bstr_printf") Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/a0e06cc479faec9e802ae51ba5d66420523251ee.1751395489.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-07-01 15:22:46 -07:00
Paul Chaignon	0df1a55afa	bpf: Warn on internal verifier errors This patch is a follow up to commit `1cb0f56d96` ("bpf: WARN_ONCE on verifier bugs"). It generalizes the use of verifier_error throughout the verifier, in particular for logs previously marked "verifier internal error". As a consequence, all of those verifier bugs will now come with a kernel warning (under CONFIG_DBEUG_KERNEL) detectable by fuzzers. While at it, some error messages that were too generic (ex., "bpf verifier is misconfigured") have been reworded. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/aGQqnzMyeagzgkCK@Tunnel Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-07-01 12:38:30 -07:00
Eduard Zingerman	a5a7b25d75	bpf: guard BTF_ID_FLAGS(bpf_cgroup_read_xattr) with CONFIG_BPF_LSM Function bpf_cgroup_read_xattr is defined in fs/bpf_fs_kfuncs.c, which is compiled only when CONFIG_BPF_LSM is set. Add CONFIG_BPF_LSM check to bpf_cgroup_read_xattr spec in common_btf_ids in kernel/bpf/helpers.c to avoid build failures for configs w/o CONFIG_BPF_LSM. Build failure example: BTF .tmp_vmlinux1.btf.o btf_encoder__tag_kfunc: failed to find kfunc 'bpf_cgroup_read_xattr' in BTF ... WARN: resolve_btfids: unresolved symbol bpf_cgroup_read_xattr make[2]: *** [scripts/Makefile.vmlinux:91: vmlinux.unstripped] Error 255 Fixes: `535b070f4a` ("bpf: Introduce bpf_cgroup_read_xattr to read xattr of cgroup's node") Reported-by: Jake Hillion <jakehillion@meta.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250627175309.2710973-1-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-27 12:18:42 -07:00
Viktor Malik	5272b51367	bpf: Fix string kfuncs names in doc comments Documentation comments for bpf_strnlen and bpf_strcspn contained incorrect function names. Fixes: `e91370550f` ("bpf: Add kfuncs for read-only string operations") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Closes: https://lore.kernel.org/bpf/20250627174759.3a435f86@canb.auug.org.au/T/#u Signed-off-by: Viktor Malik <vmalik@redhat.com> Link: https://lore.kernel.org/r/20250627082001.237606-1-vmalik@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-27 08:48:06 -07:00
Alexei Starovoitov	48d998af99	Merge branch 'vfs-6.17.bpf' of https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Merge branch 'vfs-6.17.bpf' from vfs tree into bpf-next/master and resolve conflicts. Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-26 19:01:04 -07:00
Alexei Starovoitov	886178a33a	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc3 Cross-merge BPF, perf and other fixes after downstream PRs. It restores BPF CI to green after critical fix commit `bc4394e5e7` ("perf: Fix the throttle error of some clock events") No conflicts. Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-26 09:49:39 -07:00
Viktor Malik	e91370550f	bpf: Add kfuncs for read-only string operations String operations are commonly used so this exposes the most common ones to BPF programs. For now, we limit ourselves to operations which do not copy memory around. Unfortunately, most in-kernel implementations assume that strings are %NUL-terminated, which is not necessarily true, and therefore we cannot use them directly in the BPF context. Instead, we open-code them using __get_kernel_nofault instead of plain dereference to make them safe and limit the strings length to XATTR_SIZE_MAX to make sure the functions terminate. When __get_kernel_nofault fails, functions return -EFAULT. Similarly, when the size bound is reached, the functions return -E2BIG. In addition, we return -ERANGE when the passed strings are outside of the kernel address space. Note that thanks to these dynamic safety checks, no other constraints are put on the kfunc args (they are marked with the "__ign" suffix to skip any verifier checks for them). All of the functions return integers, including functions which normally (in kernel or libc) return pointers to the strings. The reason is that since the strings are generally treated as unsafe, the pointers couldn't be dereferenced anyways. So, instead, we return an index to the string and let user decide what to do with it. This also nicely fits with returning various error codes when necessary (see above). Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Viktor Malik <vmalik@redhat.com> Link: https://lore.kernel.org/r/4b008a6212852c1b056a413f86e3efddac73551c.1750917800.git.vmalik@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-26 09:44:45 -07:00
Anton Protopopov	d83caf7c8d	bpf: add btf_type_is_i{32,64} helpers There are places in BPF code which check if a BTF type is an integer of particular size. This code can be made simpler by using helpers. Add new btf_type_is_i{32,64} helpers, and simplify code in a few files. (Suggested by Eduard for a patch which copy-pasted such a check [1].) v1 -> v2: * export less generic helpers (Eduard) * make subject less generic than in [v1] (Eduard) [1] https://lore.kernel.org/bpf/7edb47e73baa46705119a23c6bf4af26517a640f.camel@gmail.com/ [v1] https://lore.kernel.org/bpf/20250624193655.733050-1-a.s.protopopov@gmail.com/ Suggested-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250625151621.1000584-1-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-25 15:15:49 -07:00
Eduard Zingerman	f2362a57ae	bpf: allow void* cast using bpf_rdonly_cast() Introduce support for `bpf_rdonly_cast(v, 0)`, which casts the value `v` to an untyped, untrusted pointer, logically similar to a `void *`. The memory pointed to by such a pointer is treated as read-only. As with other untrusted pointers, memory access violations on loads return zero instead of causing a fault. Technically: - The resulting pointer is represented as a register of type `PTR_TO_MEM \| MEM_RDONLY \| PTR_UNTRUSTED` with size zero. - Offsets within such pointers are not tracked. - Same load instructions are allowed to have both `PTR_TO_MEM \| MEM_RDONLY \| PTR_UNTRUSTED` and `PTR_TO_BTF_ID` as the base pointer types. In such cases, `bpf_insn_aux_data->ptr_type` is considered the weaker of the two: `PTR_TO_MEM \| MEM_RDONLY \| PTR_UNTRUSTED`. The following constraints apply to the new pointer type: - can be used as a base for LDX instructions; - can't be used as a base for ST/STX or atomic instructions; - can't be used as parameter for kfuncs or helpers. These constraints are enforced by existing handling of `MEM_RDONLY` flag and `PTR_TO_MEM` of size zero. Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250625182414.30659-3-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-25 15:13:16 -07:00
Eduard Zingerman	b23e97ffc2	bpf: add bpf_features enum This commit adds a kernel side enum for use in conjucntion with BTF CO-RE bpf_core_enum_value_exists. The goal of the enum is to assist with available BPF features detection. Intended usage looks as follows: if (bpf_core_enum_value_exists(enum bpf_features, BPF_FEAT_<f>)) ... use feature f ... Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250625182414.30659-2-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-25 15:13:15 -07:00
Song Liu	aced132599	bpf: Add range tracking for BPF_NEG Add range tracking for instruction BPF_NEG. Without this logic, a trivial program like the following will fail volatile bool found_value_b; SEC("lsm.s/socket_connect") int BPF_PROG(test_socket_connect) { if (!found_value_b) return -1; return 0; } with verifier log: "At program exit the register R0 has smin=0 smax=4294967295 should have been in [-4095, 0]". This is because range information is lost in BPF_NEG: 0: R1=ctx() R10=fp0 ; if (!found_value_b) @ xxxx.c:24 0: (18) r1 = 0xffa00000011e7048 ; R1_w=map_value(...) 2: (71) r0 = (u8 )(r1 +0) ; R0_w=scalar(smin32=0,smax=255) 3: (a4) w0 ^= 1 ; R0_w=scalar(smin32=0,smax=255) 4: (84) w0 = -w0 ; R0_w=scalar(range info lost) Note that, the log above is manually modified to highlight relevant bits. Fix this by maintaining proper range information with BPF_NEG, so that the verifier will know: 4: (84) w0 = -w0 ; R0_w=scalar(smin32=-255,smax=0) Also updated selftests based on the expected behavior. Signed-off-by: Song Liu <song@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250625164025.3310203-2-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-25 15:12:17 -07:00
Harishankar Vishwanathan	7a998a7316	bpf, verifier: Improve precision for BPF_ADD and BPF_SUB This patch improves the precison of the scalar(32)_min_max_add and scalar(32)_min_max_sub functions, which update the u(32)min/u(32)_max ranges for the BPF_ADD and BPF_SUB instructions. We discovered this more precise operator using a technique we are developing for automatically synthesizing functions for updating tnums and ranges. According to the BPF ISA [1], "Underflow and overflow are allowed during arithmetic operations, meaning the 64-bit or 32-bit value will wrap". Our patch leverages the wrap-around semantics of unsigned overflow and underflow to improve precision. Below is an example of our patch for scalar_min_max_add; the idea is analogous for all four functions. There are three cases to consider when adding two u64 ranges [dst_umin, dst_umax] and [src_umin, src_umax]. Consider a value x in the range [dst_umin, dst_umax] and another value y in the range [src_umin, src_umax]. (a) No overflow: No addition x + y overflows. This occurs when even the largest possible sum, i.e., dst_umax + src_umax <= U64_MAX. (b) Partial overflow: Some additions x + y overflow. This occurs when the largest possible sum overflows (dst_umax + src_umax > U64_MAX), but the smallest possible sum does not overflow (dst_umin + src_umin <= U64_MAX). (c) Full overflow: All additions x + y overflow. This occurs when both the smallest possible sum and the largest possible sum overflow, i.e., both (dst_umin + src_umin) and (dst_umax + src_umax) are > U64_MAX. The current implementation conservatively sets the output bounds to unbounded, i.e, [umin=0, umax=U64_MAX], whenever there is any possibility of overflow, i.e, in cases (b) and (c). Otherwise it computes tight bounds as [dst_umin + src_umin, dst_umax + src_umax]: if (check_add_overflow(dst_umin, src_reg->umin_value, dst_umin) \|\| check_add_overflow(dst_umax, src_reg->umax_value, dst_umax)) { dst_umin = 0; dst_umax = U64_MAX; } Our synthesis-based technique discovered a more precise operator. Particularly, in case (c), all possible additions x + y overflow and wrap around according to eBPF semantics, and the computation of the output range as [dst_umin + src_umin, dst_umax + src_umax] continues to work. Only in case (b), do we need to set the output bounds to unbounded, i.e., [0, U64_MAX]. Case (b) can be checked by seeing if the minimum possible sum does not overflow and the maximum possible sum does overflow, and when that happens, we set the output to unbounded: min_overflow = check_add_overflow(dst_umin, src_reg->umin_value, dst_umin); max_overflow = check_add_overflow(dst_umax, src_reg->umax_value, dst_umax); if (!min_overflow && max_overflow) { dst_umin = 0; dst_umax = U64_MAX; } Below is an example eBPF program and the corresponding log from the verifier. The current implementation of scalar_min_max_add() sets r3's bounds to [0, U64_MAX] at instruction 5: (0f) r3 += r3, due to conservative overflow handling. 0: R1=ctx() R10=fp0 0: (b7) r4 = 0 ; R4_w=0 1: (87) r4 = -r4 ; R4_w=scalar() 2: (18) r3 = 0xa000000000000000 ; R3_w=0xa000000000000000 4: (4f) r3 \|= r4 ; R3_w=scalar(smin=0xa000000000000000,smax=-1,umin=0xa000000000000000,var_off=(0xa000000000000000; 0x5fffffffffffffff)) R4_w=scalar() 5: (0f) r3 += r3 ; R3_w=scalar() 6: (b7) r0 = 1 ; R0_w=1 7: (95) exit With our patch, r3's bounds after instruction 5 are set to a much more precise [0x4000000000000000,0xfffffffffffffffe]. ... 5: (0f) r3 += r3 ; R3_w=scalar(umin=0x4000000000000000,umax=0xfffffffffffffffe) 6: (b7) r0 = 1 ; R0_w=1 7: (95) exit The logic for scalar32_min_max_add is analogous. For the scalar(32)_min_max_sub functions, the reasoning is similar but applied to detecting underflow instead of overflow. We verified the correctness of the new implementations using Agni [3,4]. We since also discovered that a similar technique has been used to calculate output ranges for unsigned interval addition and subtraction in Hacker's Delight [2]. [1] https://docs.kernel.org/bpf/standardization/instruction-set.html [2] Hacker's Delight Ch.4-2, Propagating Bounds through Add’s and Subtract’s [3] https://github.com/bpfverif/agni [4] https://people.cs.rutgers.edu/~sn349/papers/sas24-preprint.pdf Co-developed-by: Matan Shachnai <m.shachnai@rutgers.edu> Signed-off-by: Matan Shachnai <m.shachnai@rutgers.edu> Co-developed-by: Srinivas Narayana <srinivas.narayana@rutgers.edu> Signed-off-by: Srinivas Narayana <srinivas.narayana@rutgers.edu> Co-developed-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu> Signed-off-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu> Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250623040359.343235-2-harishankar.vishwanathan@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-24 18:37:22 -07:00
Jerome Marchand	2eb7648558	bpf: Specify access type of bpf_sysctl_get_name args The second argument of bpf_sysctl_get_name() helper is a pointer to a buffer that is being written to. However that isn't specify in the prototype. Until commit `37cce22dbd` ("bpf: verifier: Refactor helper access type tracking"), all helper accesses were considered as a possible write access by the verifier, so no big harm was done. However, since then, the verifier might make wrong asssumption about the content of that address which might lead it to make faulty optimizations (such as removing code that was wrongly labeled dead). This is what happens in test_sysctl selftest to the tests related to sysctl_get_name. Add MEM_WRITE flag the second argument of bpf_sysctl_get_name(). Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20250619140603.148942-2-jmarchan@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-23 21:50:44 -07:00
Menglong Dong	c11f34e300	bpf: Make update_prog_stats() always_inline The function update_prog_stats() will be called in the bpf trampoline. In most cases, it will be optimized by the compiler by making it inline. However, we can't rely on the compiler all the time, and just make it __always_inline to reduce the possible overhead. Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Link: https://lore.kernel.org/r/20250621045501.101187-1-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-23 09:21:07 -07:00
Song Liu	1504d8c7c7	bpf: Mark cgroup_subsys_state->cgroup RCU safe Mark struct cgroup_subsys_state->cgroup as safe under RCU read lock. This will enable accessing css->cgroup from a bpf css iterator. Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/20250623063854.1896364-4-song@kernel.org Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-06-23 13:03:12 +02:00
Song Liu	535b070f4a	bpf: Introduce bpf_cgroup_read_xattr to read xattr of cgroup's node BPF programs, such as LSM and sched_ext, would benefit from tags on cgroups. One common practice to apply such tags is to set xattrs on cgroupfs folders. Introduce kfunc bpf_cgroup_read_xattr, which allows reading cgroup's xattr. Note that, we already have bpf_get_[file\|dentry]_xattr. However, these two APIs are not ideal for reading cgroupfs xattrs, because: 1) These two APIs only works in sleepable contexts; 2) There is no kfunc that matches current cgroup to cgroupfs dentry. bpf_cgroup_read_xattr is generic and can be useful for many program types. It is also safe, because it requires trusted or rcu protected argument (KF_RCU). Therefore, we make it available to all program types. Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/20250623063854.1896364-3-song@kernel.org Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-06-23 13:03:12 +02:00
Willem de Bruijn	d4adf1c9ee	bpf: Adjust free target to avoid global starvation of LRU map BPF_MAP_TYPE_LRU_HASH can recycle most recent elements well before the map is full, due to percpu reservations and force shrink before neighbor stealing. Once a CPU is unable to borrow from the global map, it will once steal one elem from a neighbor and after that each time flush this one element to the global list and immediately recycle it. Batch value LOCAL_FREE_TARGET (128) will exhaust a 10K element map with 79 CPUs. CPU 79 will observe this behavior even while its neighbors hold 78 * 127 + 1 * 15 == 9921 free elements (99%). CPUs need not be active concurrently. The issue can appear with affinity migration, e.g., irqbalance. Each CPU can reserve and then hold onto its 128 elements indefinitely. Avoid global list exhaustion by limiting aggregate percpu caches to half of map size, by adjusting LOCAL_FREE_TARGET based on cpu count. This change has no effect on sufficiently large tables. Similar to LOCAL_NR_SCANS and lru->nr_scans, introduce a map variable lru->free_target. The extra field fits in a hole in struct bpf_lru. The cacheline is already warm where read in the hot path. The field is only accessed with the lru lock held. Tested-by: Anton Protopopov <a.s.protopopov@gmail.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://lore.kernel.org/r/20250618215803.3587312-1-willemdebruijn.kernel@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-18 18:50:14 -07:00
Al Viro	f5527f0171	bpf: Get rid of redundant 3rd argument of prepare_seq_file() Remove 3rd argument in prepare_seq_file() to clean up the code a bit. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20250615004719.GE3011112@ZenIV Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-17 17:19:41 -07:00
Song Liu	a766cfbbeb	bpf: Mark dentry->d_inode as trusted_or_null LSM hooks such as security_path_mknod() and security_inode_rename() have access to newly allocated negative dentry, which has NULL d_inode. Therefore, it is necessary to do the NULL pointer check for d_inode. Also add selftests that checks the verifier enforces the NULL pointer check. Signed-off-by: Song Liu <song@kernel.org> Reviewed-by: Matt Bobrowski <mattbobrowski@google.com> Link: https://lore.kernel.org/r/20250613052857.1992233-1-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-17 08:40:59 -07:00
Thomas Weißschuh	2fbe82037a	sysfs: treewide: switch back to bin_attribute::read()/write() The bin_attribute argument of bin_attribute::read() is now const. This makes the _new() callbacks unnecessary. Switch all users back. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://lore.kernel.org/r/20250530-sysfs-const-bin_attr-final-v3-3-724bfcf05b99@weissschuh.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-06-17 10:44:13 +02:00
Luis Gerhorst	f66b4aaff2	bpf: Remove redundant free_verifier_state()/pop_stack() This patch removes duplicated code. Eduard points out [1]: Same cleanup cycles are done in push_stack() and push_async_cb(), both functions are only reachable from do_check_common() via do_check() -> do_check_insn(). Hence, I think that cur state should not be freed in push_() functions and pop_stack() loop there is not needed. This would also fix the 'symptom' for [2], but the issue also has a simpler fix which was sent separately. This fix also makes sure the push_() callers always return an error for which error_recoverable_with_nospec(err) is false. This is required because otherwise we try to recover and access the stale `state`. Moving free_verifier_state() and pop_stack(..., pop_log=false) to happen after the bpf_vlog_reset() call in do_check_common() is fine because the pop_stack() call that is moved does not call bpf_vlog_reset() with the pop_log=false parameter. [1] https://lore.kernel.org/all/b6931bd0dd72327c55287862f821ca6c4c3eb69a.camel@gmail.com/ [2] https://lore.kernel.org/all/68497853.050a0220.33aa0e.036a.GAE@google.com/ Reported-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/all/b6931bd0dd72327c55287862f821ca6c4c3eb69a.camel@gmail.com/ Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de> Link: https://lore.kernel.org/r/20250613090157.568349-2-luis.gerhorst@fau.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-13 14:59:30 -07:00
Eduard Zingerman	3157f7e299	bpf: handle jset (if a & b ...) as a jump in CFG computation BPF_JSET is a conditional jump and currently verifier.c:can_jump() does not know about that. This can lead to incorrect live registers and SCC computation. E.g. in the following example: 1: r0 = 1; 2: r2 = 2; 3: if r1 & 0x7 goto +1; 4: exit; 5: r0 = r2; 6: exit; W/o this fix insn_successors(3) will return only (4), a jump to (5) would be missed and r2 won't be marked as alive at (3). Fixes: `14c8552db6` ("bpf: simple DFA-based live registers analysis") Reported-by: syzbot+a36aac327960ff474804@syzkaller.appspotmail.com Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250613175331.3238739-1-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-13 11:51:19 -07:00
Eduard Zingerman	43736ec3e0	bpf: Include verifier memory allocations in memcg statistics This commit adds __GFP_ACCOUNT flag to verifier induced memory allocations. The intent is to account for all allocations reachable from BPF_PROG_LOAD command, which is needed to track verifier memory consumption in veristat. This includes allocations done in verifier.c, and some allocations in btf.c, functions in log.c do not allocate. There is also a utility function bpf_memcg_flags() which selectively adds GFP_ACCOUNT flag depending on the `cgroup.memory=nobpf` option. As far as I understand [1], the idea is to remove bpf_prog instances and maps from memcg accounting as these objects do not strictly belong to cgroup, hence it should not apply here. (btf_parse_fields() is reachable from both program load and map creation, but allocated record is not persistent as is freed as soon as map_check_btf() exits). [1] https://lore.kernel.org/all/20230210154734.4416-1-laoar.shao@gmail.com/ Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250613072147.3938139-2-eddyz87@gmail.com	2025-06-13 10:29:45 -07:00
Song Liu	fa6932577c	bpf: Initialize used but uninit variable in propagate_liveness() With input changed == NULL, a local variable is used for "changed". Initialize tmp properly, so that it can be used in the following: *changed \|= err > 0; Otherwise, UBSAN will complain: UBSAN: invalid-load in kernel/bpf/verifier.c:18924:4 load of value <some random value> is not a valid value for type '_Bool' Fixes: `dfb2d4c64b` ("bpf: set 'changed' status if propagate_liveness() did any updates") Signed-off-by: Song Liu <song@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250612221100.2153401-1-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-12 16:53:40 -07:00
Luis Gerhorst	3d71b8b9ab	bpf: Fix state use-after-free on push_stack() err Without this, `state->speculative` is used after the cleanup cycles in push_stack() or push_async_cb() freed `env->cur_state` (i.e., `state`). Avoid this by relying on the short-circuit logic to only access `state` if the error is recoverable (and make sure it never is after push_() failed). push_() callers must always return an error for which error_recoverable_with_nospec(err) is false if push_() returns NULL, otherwise we try to recover and access the stale `state`. This is only violated by sanitize_ptr_alu(), thus also fix this case to return -ENOMEM. state->speculative does not make sense if the error path of push_() ran. In that case, `state->speculative && error_recoverable_with_nospec(err)` as a whole should already never evaluate to true (because all cases where push_stack() fails must return -ENOMEM/-EFAULT). As mentioned, this is only violated by the push_stack() call in sanitize_speculative_path() which returns -EACCES without [1] (through REASON_STACK in sanitize_err() after sanitize_ptr_alu()). To fix this, return -ENOMEM for REASON_STACK (which is also the behavior we will have after [1]). Checked that it fixes the syzbot reproducer as expected. [1] https://lore.kernel.org/all/20250603213232.339242-1-luis.gerhorst@fau.de/ Fixes: `d6f1c85f22` ("bpf: Fall back to nospec for Spectre v1") Reported-by: syzbot+b5eb72a560b8149a1885@syzkaller.appspotmail.com Reported-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/all/38862a832b91382cddb083dddd92643bed0723b8.camel@gmail.com/ Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250611210728.266563-1-luis.gerhorst@fau.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-12 16:52:43 -07:00
Eduard Zingerman	0f54ff5470	bpf: include backedges in peak_states stat Count states accumulated in bpf_scc_visit->backedges in env->peak_states. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250611200836.4135542-10-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-12 16:52:43 -07:00
Eduard Zingerman	0e0da5f901	bpf: remove {update,get}_loop_entry functions The previous patch switched read and precision tracking for iterator-based loops from state-graph-based loop tracking to control-flow-graph-based loop tracking. This patch removes the now-unused `update_loop_entry()` and `get_loop_entry()` functions, which were part of the state-graph-based logic. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250611200836.4135542-9-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-12 16:52:43 -07:00
Eduard Zingerman	c9e31900b5	bpf: propagate read/precision marks over state graph backedges Current loop_entry-based exact states comparison logic does not handle the following case: .-> A --. Assume the states are visited in the order A, B, C. \| \| \| Assume that state B reaches a state equivalent to state A. \| v v At this point, state C is not processed yet, so state A '-- B C has not received any read or precision marks from C. As a result, these marks won't be propagated to B. If B has incomplete marks, it is unsafe to use it in states_equal() checks. This commit replaces the existing logic with the following: - Strongly connected components (SCCs) are computed over the program's control flow graph (intraprocedurally). - When a verifier state enters an SCC, that state is recorded as the SCC entry point. - When a verifier state is found equivalent to another (e.g., B to A in the example), it is recorded as a states graph backedge. Backedges are accumulated per SCC. - When an SCC entry state reaches `branches == 0`, read and precision marks are propagated through the backedges (e.g., from A to B, from C to A, and then again from A to B). To support nested subprogram calls, the entry state and backedge list are associated not with the SCC itself but with an object called `bpf_scc_callchain`. A callchain is a tuple `(callsite*, scc_id)`, where `callsite` is the index of a call instruction for each frame except the last. See the comments added in `is_state_visited()` and `compute_scc_callchain()` for more details. Fixes: `2a0992829e` ("bpf: correct loop detection for iterators convergence") Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250611200836.4135542-8-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-12 16:52:43 -07:00
Eduard Zingerman	b5c677d8d9	bpf: move REG_LIVE_DONE check to clean_live_states() The next patch would add some relatively heavy-weight operation to clean_live_states(), this operation can be skipped if REG_LIVE_DONE is set. Move the check from clean_verifier_state() to clean_verifier_state() as a small refactoring commit. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250611200836.4135542-7-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-12 16:52:43 -07:00
Eduard Zingerman	dfb2d4c64b	bpf: set 'changed' status if propagate_liveness() did any updates Add an out parameter to `propagate_liveness()` to record whether any new liveness bits were set during its execution. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250611200836.4135542-6-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-12 16:52:43 -07:00
Eduard Zingerman	23b37d6165	bpf: set 'changed' status if propagate_precision() did any updates Add an out parameter to `propagate_precision()` to record whether any new precision bits were set during its execution. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250611200836.4135542-5-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-12 16:52:43 -07:00

... 4 5 6 7 8 ...

4511 Commits