linux

mirror of https://github.com/torvalds/linux.git synced 2026-05-23 22:52:19 +02:00

Author	SHA1	Message	Date
Amery Hung	136deea435	bpf: Remove gfp_flags plumbing from bpf_local_storage_update() Remove the check that rejects sleepable BPF programs from doing BPF_ANY/BPF_EXIST updates on local storage. This restriction was added in commit `b00fa38a9c` ("bpf: Enable non-atomic allocations in local storage") because kzalloc(GFP_KERNEL) could sleep inside local_storage->lock. This is no longer a concern: all local storage allocations now use kmalloc_nolock() which never sleeps. In addition, since kmalloc_nolock() only accepts __GFP_ACCOUNT, __GFP_ZERO and __GFP_NO_OBJ_EXT, the gfp_flags parameter plumbing from bpf__storage_get() to bpf_local_storage_update() becomes dead code. Remove gfp_flags from bpf_selem_alloc(), bpf_local_storage_alloc() and bpf_local_storage_update(). Drop the hidden 5th argument from bpf__storage_get helpers, and remove the verifier patching that injected GFP_KERNEL/GFP_ATOMIC into the fifth argument. Signed-off-by: Amery Hung <ameryhung@gmail.com> Link: https://lore.kernel.org/r/20260411015419.114016-4-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-10 21:22:32 -07:00
Amery Hung	5063e77588	bpf: Use kmalloc_nolock() universally in local storage Switch to kmalloc_nolock() universally in local storage. Socket local storage didn't move to kmalloc_nolock() when BPF memory allocator was replaced by it for performance reasons. Now that kfree_rcu() supports freeing memory allocated by kmalloc_nolock(), we can move the remaining local storages to use kmalloc_nolock() and cleanup the cluttered free paths. Use kfree() instead of kfree_nolock() in bpf_selem_free_trace_rcu() and bpf_local_storage_free_trace_rcu(). Both callbacks run in process context where spinning is allowed, so kfree_nolock() is unnecessary. Benchmark: ./bench -p 1 local-storage-create --storage-type socket \ --batch-size {16,32,64} The benchmark is a microbenchmark stress-testing how fast local storage can be created. There is no measurable throughput change for socket local storage after switching from kzalloc() to kmalloc_nolock(). Socket local storage batch creation speed diff --------------- ---- ------------------ ---- Baseline 16 433.9 ± 0.6 k/s 32 434.3 ± 1.4 k/s 64 434.2 ± 0.7 k/s After 16 439.0 ± 1.9 k/s +1.2% 32 437.3 ± 2.0 k/s +0.7% 64 435.8 ± 2.5k/s +0.4% Also worth noting that the baseline got a 5% throughput boost when sheaf replaces percpu partial slab recently [0]. [0] https://lore.kernel.org/bpf/20260123-sheaves-for-all-v4-0-041323d506f7@suse.cz/ Signed-off-by: Amery Hung <ameryhung@gmail.com> Link: https://lore.kernel.org/r/20260411015419.114016-3-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-04-10 21:22:32 -07:00
Amery Hung	4b21ea5024	bpf: Add warning to detect memory leak in bpf_selem_unlink_nofail() While very unlikely, local storage theoretically may leak memory of the size of "struct bpf_local_storage" when destroy() fails to grab local_storage->lock and initializes selem->local_storage before other racing map_free() see it. Warn the user to allow debugging the issue instead of leaking the memory silently. Note that test_maps in bpf selftests already stress tested bpf_selem_unlink_nofail() by creating 4096 sockets and then immediately destroying them in multiple threads. With 64 threads, 64 x 4096 socket local storages were created and destroyed during the test and no warning in the function were triggered. Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://patch.msgid.link/20260318224219.615105-1-ameryhung@gmail.com	2026-03-19 12:14:28 -07:00
Amery Hung	350de5b8a9	bpf: Do not allow deleting local storage in NMI Currently, local storage may deadlock when deferring freeing selem or local storage through kfree_rcu(), call_rcu() or call_rcu_tasks_trace() in NMI or reentrant. Since deleting selem in NMI is an unlikely use case, partially mitigate it by returning error when calling from bpf_xxx_storage_delete() helpers in NMI. Note that, it is still possible to deadlock through reentrant. A full mitigation requires returning error when irqs_disabled() is true, which, however is too heavy-handed for bpf_xxx_storage_delete(). The long-term solution requires _nolock versions of call_rcu. Another possible solution is to defer the free through irq_work [0], but it would grow the size of selem, which is non-ideal. The check is only needed in bpf_selem_unlink(), which is used by helpers and syscalls. bpf_selem_unlink_nofail() is fine as it is called during map and owner tear down that never run in NMI or reentrant. [0] https://lore.kernel.org/bpf/20260205190233.912-1-alexei.starovoitov@gmail.com/ Fixes: `a10787e6d5` ("bpf: Enable task local storage for tracing programs") Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://patch.msgid.link/20260319025716.2361065-1-ameryhung@gmail.com	2026-03-19 11:26:05 -07:00
Kumar Kartikeya Dwivedi	baa35b3cb6	bpf: Retire rcu_trace_implies_rcu_gp() from local storage This assumption will always hold going forward, hence just remove the various checks and assume it is true with a comment for the uninformed reader. Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260227224806.646888-5-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-02-27 15:39:00 -08:00
Kumar Kartikeya Dwivedi	f41deee082	bpf: Delay freeing fields in local storage Currently, when use_kmalloc_nolock is false, the freeing of fields for a local storage selem is done eagerly before waiting for the RCU or RCU tasks trace grace period to elapse. This opens up a window where the program which has access to the selem can recreate the fields after the freeing of fields is done eagerly, causing memory leaks when the element is finally freed and returned to the kernel. Make a few changes to address this. First, delay the freeing of fields until after the grace periods have expired using a __bpf_selem_free_rcu wrapper which is eventually invoked after transitioning through the necessary number of grace period waits. Replace usage of the kfree_rcu with call_rcu to be able to take a custom callback. Finally, care needs to be taken to extend the rcu barriers for all cases, and not just when use_kmalloc_nolock is true, as RCU and RCU tasks trace callbacks can be in flight for either case and access the smap field, which is used to obtain the BTF record to walk over special fields in the map value. While we're at it, drop migrate_disable() from bpf_selem_free_rcu, since migration should be disabled for RCU callbacks already. Fixes: `9bac675e63` ("bpf: Postpone bpf_obj_free_fields to the rcu callback") Reviewed-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260227224806.646888-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-02-27 15:39:00 -08:00
Kumar Kartikeya Dwivedi	ae51772b1e	bpf: Lose const-ness of map in map_check_btf() BPF hash map may now use the map_check_btf() callback to decide whether to set a dtor on its bpf_mem_alloc or not. Unlike C++ where members can opt out of const-ness using mutable, we must lose the const qualifier on the callback such that we can avoid the ugly cast. Make the change and adjust all existing users, and lose the comment in hashtab.c. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260227224806.646888-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-02-27 15:39:00 -08:00
Amery Hung	0be08389c7	bpf: Switch to bpf_selem_unlink_nofail in bpf_local_storage_{map_free, destroy} Take care of rqspinlock error in bpf_local_storage_{map_free, destroy}() properly by switching to bpf_selem_unlink_nofail(). Both functions iterate their own RCU-protected list of selems and call bpf_selem_unlink_nofail(). In map_free(), to prevent infinite loop when both map_free() and destroy() fail to remove a selem from b->list (extremely unlikely), switch to hlist_for_each_entry_rcu(). In destroy(), also switch to hlist_for_each_entry_rcu() since we no longer iterate local_storage->list under local_storage->lock. bpf_selem_unlink() now becomes dedicated to helpers and syscalls paths so reuse_now should always be false. Remove it from the argument and hardcode it. Acked-by: Alexei Starovoitov <ast@kernel.org> Co-developed-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260205222916.1788211-12-ameryhung@gmail.com	2026-02-06 14:47:59 -08:00
Amery Hung	5d800f87d0	bpf: Support lockless unlink when freeing map or local storage Introduce bpf_selem_unlink_nofail() to properly handle errors returned from rqspinlock in bpf_local_storage_map_free() and bpf_local_storage_destroy() where the operation must succeeds. The idea of bpf_selem_unlink_nofail() is to allow an selem to be partially linked and use atomic operation on a bit field, selem->state, to determine when and who can free the selem if any unlink under lock fails. An selem initially is fully linked to a map and a local storage. Under normal circumstances, bpf_selem_unlink_nofail() will be able to grab locks and unlink a selem from map and local storage in sequeunce, just like bpf_selem_unlink(), and then free it after an RCU grace period. However, if any of the lock attempts fails, it will only clear SDATA(selem)->smap or selem->local_storage depending on the caller and set SELEM_MAP_UNLINKED or SELEM_STORAGE_UNLINKED according to the caller. Then, after both map_free() and destroy() see the selem and the state becomes SELEM_UNLINKED, one of two racing caller can succeed in cmpxchg the state from SELEM_UNLINKED to SELEM_TOFREE, ensuring no double free or memory leak. To make sure bpf_obj_free_fields() is done only once and when map is still present, it is called when unlinking an selem from b->list under b->lock. To make sure uncharging memory is done only when the owner is still present in map_free(), block destroy() from returning until there is no pending map_free(). Since smap may not be valid in destroy(), bpf_selem_unlink_nofail() skips bpf_selem_unlink_storage_nolock_misc() when called from destroy(). This is okay as bpf_local_storage_destroy() will return the remaining amount of memory charge tracked by mem_charge to the owner to uncharge. It is also safe to skip clearing local_storage->owner and owner_storage as the owner is being freed and no users or bpf programs should be able to reference the owner and using local_storage. Finally, access of selem, SDATA(selem)->smap and selem->local_storage are racy. Callers will protect these fields with RCU. Acked-by: Alexei Starovoitov <ast@kernel.org> Co-developed-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260205222916.1788211-11-ameryhung@gmail.com	2026-02-06 14:47:47 -08:00
Amery Hung	c8be3da147	bpf: Prepare for bpf_selem_unlink_nofail() The next patch will introduce bpf_selem_unlink_nofail() to handle rqspinlock errors. bpf_selem_unlink_nofail() will allow an selem to be partially unlinked from map or local storage. Save memory allocation method in selem so that later an selem can be correctly freed even when SDATA(selem)->smap is init to NULL. In addition, keep track of memory charge to the owner in local storage so that later bpf_selem_unlink_nofail() can return the correct memory charge to the owner. Updating local_storage->mem_charge is protected by local_storage->lock. Finally, extract miscellaneous tasks performed when unlinking an selem from local_storage into bpf_selem_unlink_storage_nolock_misc(). It will be reused by bpf_selem_unlink_nofail(). This patch also takes the chance to remove local_storage->smap, which is no longer used since commit `f484f4a3e0` ("bpf: Replace bpf memory allocator with kmalloc_nolock() in local storage"). Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260205222916.1788211-10-ameryhung@gmail.com	2026-02-06 14:29:22 -08:00
Amery Hung	3417dffb58	bpf: Remove unused percpu counter from bpf_local_storage_map_free Percpu locks have been removed from cgroup and task local storage. Now that all local storage no longer use percpu variables as locks preventing recursion, there is no need to pass them to bpf_local_storage_map_free(). Remove the argument from the function. Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260205222916.1788211-9-ameryhung@gmail.com	2026-02-06 14:29:18 -08:00
Amery Hung	8dabe34b9d	bpf: Change local_storage->lock and b->lock to rqspinlock Change bpf_local_storage::lock and bpf_local_storage_map_bucket::lock from raw_spin_lock to rqspinlock. Finally, propagate errors from raw_res_spin_lock_irqsave() to syscall return or BPF helper return. In bpf_local_storage_destroy(), ignore return from raw_res_spin_lock_irqsave() for now. A later patch will correctly handle errors correctly in bpf_local_storage_destroy() so that it can unlink selems even when failing to acquire locks. For __bpf_local_storage_map_cache(), instead of handling the error, skip updating the cache. Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260205222916.1788211-6-ameryhung@gmail.com	2026-02-06 14:29:04 -08:00
Amery Hung	403e935f91	bpf: Convert bpf_selem_unlink to failable To prepare changing both bpf_local_storage_map_bucket::lock and bpf_local_storage::lock to rqspinlock, convert bpf_selem_unlink() to failable. It still always succeeds and returns 0 until the change happens. No functional change. Open code bpf_selem_unlink_storage() in the only caller, bpf_selem_unlink(), since unlink_map and unlink_storage must be done together after all the necessary locks are acquired. For bpf_local_storage_map_free(), ignore the return from bpf_selem_unlink() for now. A later patch will allow it to unlink selems even when failing to acquire locks. Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260205222916.1788211-5-ameryhung@gmail.com	2026-02-06 14:28:59 -08:00
Amery Hung	fd103ffc57	bpf: Convert bpf_selem_link_map to failable To prepare for changing bpf_local_storage_map_bucket::lock to rqspinlock, convert bpf_selem_link_map() to failable. It still always succeeds and returns 0 until the change happens. No functional change. Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260205222916.1788211-4-ameryhung@gmail.com	2026-02-06 14:28:55 -08:00
Amery Hung	1b7e0cae85	bpf: Convert bpf_selem_unlink_map to failable To prepare for changing bpf_local_storage_map_bucket::lock to rqspinlock, convert bpf_selem_unlink_map() to failable. It still always succeeds and returns 0 for now. Since some operations updating local storage cannot fail in the middle, open-code bpf_selem_unlink_map() to take the b->lock before the operation. There are two such locations: - bpf_local_storage_alloc() The first selem will be unlinked from smap if cmpxchg owner_storage_ptr fails, which should not fail. Therefore, hold b->lock when linking until allocation complete. Helpers that assume b->lock is held by callers are introduced: bpf_selem_link_map_nolock() and bpf_selem_unlink_map_nolock(). - bpf_local_storage_update() The three step update process: link_map(new_selem), link_storage(new_selem), and unlink_map(old_selem) should not fail in the middle. In bpf_selem_unlink(), bpf_selem_unlink_map() and bpf_selem_unlink_storage() should either all succeed or fail as a whole instead of failing in the middle. So, return if unlink_map() failed. Remove the selem_linked_to_map_lockless() check as an selem in the common paths (not bpf_local_storage_map_free() or bpf_local_storage_destroy()), will be unlinked under b->lock and local_storage->lock and therefore no other threads can unlink the selem from map at the same time. In bpf_local_storage_destroy(), ignore the return of bpf_selem_unlink_map() for now. A later patch will allow bpf_local_storage_destroy() to unlink selems even when failing to acquire locks. Note that while this patch removes all callers of selem_linked_to_map(), a later patch that introduces bpf_selem_unlink_nofail() will use it again. Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260205222916.1788211-3-ameryhung@gmail.com	2026-02-06 14:28:48 -08:00
Amery Hung	0ccef7079e	bpf: Select bpf_local_storage_map_bucket based on bpf_local_storage A later bpf_local_storage refactor will acquire all locks before performing any update. To simplified the number of locks needed to take in bpf_local_storage_map_update(), determine the bucket based on the local_storage an selem belongs to instead of the selem pointer. Currently, when a new selem needs to be created to replace the old selem in bpf_local_storage_map_update(), locks of both buckets need to be acquired to prevent racing. This can be simplified if the two selem belongs to the same bucket so that only one bucket needs to be locked. Therefore, instead of hashing selem, hashing the local_storage pointer the selem belongs. Performance wise, this is slightly better as update now requires locking one bucket. It should not change the level of contention on one bucket as the pointers to local storages of selems in a map are just as unique as pointers to selems. Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260205222916.1788211-2-ameryhung@gmail.com	2026-02-06 14:28:43 -08:00
Amery Hung	f484f4a3e0	bpf: Replace bpf memory allocator with kmalloc_nolock() in local storage Replace bpf memory allocator with kmalloc_nolock() to reduce memory wastage due to preallocation. In bpf_selem_free(), an selem now needs to wait for a RCU grace period before being freed when reuse_now == true. Therefore, rcu_barrier() should be always be called in bpf_local_storage_map_free(). In bpf_local_storage_free(), since smap->storage_ma is no longer needed to return the memory, the function is now independent from smap. Remove the outdated comment in bpf_local_storage_alloc(). We already free selem after an RCU grace period in bpf_local_storage_update() when bpf_local_storage_alloc() failed the cmpxchg since commit `c0d63f3091` ("bpf: Add bpf_selem_free()"). Signed-off-by: Amery Hung <ameryhung@gmail.com> Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20251114201329.3275875-5-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-18 16:20:25 -08:00
Amery Hung	39a460c425	bpf: Save memory alloction info in bpf_local_storage Save the memory allocation method used for bpf_local_storage in the struct explicitly so that we don't need to go through the hassle to find out the info. When a later patch replaces BPF memory allocator with kmalloc_noloc(), bpf_local_storage_free() will no longer need smap->storage_ma to return the memory and completely remove the dependency on smap in bpf_local_storage_free(). Signed-off-by: Amery Hung <ameryhung@gmail.com> Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20251114201329.3275875-4-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-18 16:20:25 -08:00
Amery Hung	e76a33e1c7	bpf: Remove smap argument from bpf_selem_free() Since selem already saves a pointer to smap, use it instead of an additional argument in bpf_selem_free(). This requires moving the SDATA(selem)->smap assignment from bpf_selem_link_map() to bpf_selem_alloc() since bpf_selem_free() may be called without the selem being linked to smap in bpf_local_storage_update(). Signed-off-by: Amery Hung <ameryhung@gmail.com> Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20251114201329.3275875-3-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-18 16:20:25 -08:00
Amery Hung	0e854e5535	bpf: Always charge/uncharge memory when allocating/unlinking storage elements Since commit `a96a44aba5` ("bpf: bpf_sk_storage: Fix invalid wait context lockdep report"), {charge,uncharge}_mem are always true when allocating a bpf_local_storage_elem or unlinking a bpf_local_storage_elem from local storage, so drop these arguments. No functional change. Signed-off-by: Amery Hung <ameryhung@gmail.com> Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20251114201329.3275875-2-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-18 16:20:25 -08:00
Anton Protopopov	d83caf7c8d	bpf: add btf_type_is_i{32,64} helpers There are places in BPF code which check if a BTF type is an integer of particular size. This code can be made simpler by using helpers. Add new btf_type_is_i{32,64} helpers, and simplify code in a few files. (Suggested by Eduard for a patch which copy-pasted such a check [1].) v1 -> v2: * export less generic helpers (Eduard) * make subject less generic than in [v1] (Eduard) [1] https://lore.kernel.org/bpf/7edb47e73baa46705119a23c6bf4af26517a640f.camel@gmail.com/ [v1] https://lore.kernel.org/bpf/20250624193655.733050-1-a.s.protopopov@gmail.com/ Suggested-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250625151621.1000584-1-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-06-25 15:15:49 -07:00
Hou Tao	d86088e2c3	bpf: Remove migrate_{disable\|enable} from bpf_selem_free() bpf_selem_free() has the following three callers: (1) bpf_local_storage_update It will be invoked through ->map_update_elem syscall or helpers for storage map. Migration has already been disabled in these running contexts. (2) bpf_sk_storage_clone It has already disabled migration before invoking bpf_selem_free(). (3) bpf_selem_free_list bpf_selem_free_list() has three callers: bpf_selem_unlink_storage(), bpf_local_storage_update() and bpf_local_storage_destroy(). The callers of bpf_selem_unlink_storage() includes: storage map ->map_delete_elem syscall, storage map delete helpers and bpf_local_storage_map_free(). These contexts have already disabled migration when invoking bpf_selem_unlink() which invokes bpf_selem_unlink_storage() and bpf_selem_free_list() correspondingly. bpf_local_storage_update() has been analyzed as the first caller above. bpf_local_storage_destroy() is invoked when freeing the local storage for the kernel object. Now cgroup, task, inode and sock storage have already disabled migration before invoking bpf_local_storage_destroy(). After the analyses above, it is safe to remove migrate_{disable\|enable} from bpf_selem_free(). Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250108010728.207536-17-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-01-08 18:06:37 -08:00
Hou Tao	7b984359e0	bpf: Remove migrate_{disable\|enable} from bpf_local_storage_free() bpf_local_storage_free() has three callers: 1) bpf_local_storage_alloc() Its caller must have disabled migration. 2) bpf_local_storage_destroy() Its four callers (bpf_{cgrp\|inode\|task\|sk}_storage_free()) have already invoked migrate_disable() before invoking bpf_local_storage_destroy(). 3) bpf_selem_unlink() Its callers include: cgrp/inode/task/sk storage ->map_delete_elem callbacks, bpf_{cgrp\|inode\|task\|sk}_storage_delete() helpers and bpf_local_storage_map_free(). All of these callers have already disabled migration before invoking bpf_selem_unlink(). Therefore, it is OK to remove migrate_{disable\|enable} pair from bpf_local_storage_free(). Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250108010728.207536-16-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-01-08 18:06:37 -08:00
Hou Tao	4855a75ebf	bpf: Remove migrate_{disable\|enable} from bpf_local_storage_alloc() These two callers of bpf_local_storage_alloc() are the same as bpf_selem_alloc(): bpf_sk_storage_clone() and bpf_local_storage_update(). The running contexts of these two callers have already disabled migration, therefore, there is no need to add extra migrate_{disable\|enable} pair in bpf_local_storage_alloc(). Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250108010728.207536-15-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-01-08 18:06:37 -08:00
Hou Tao	2269b32ab0	bpf: Remove migrate_{disable\|enable} from bpf_selem_alloc() bpf_selem_alloc() has two callers: (1) bpf_sk_storage_clone_elem() bpf_sk_storage_clone() has already disabled migration before invoking bpf_sk_storage_clone_elem(). (2) bpf_local_storage_update() Its callers include: cgrp/task/inode/sock storage ->map_update_elem() callbacks and bpf_{cgrp\|task\|inode\|sk}_storage_get() helpers. These running contexts have already disabled migration Therefore, there is no need to add extra migrate_{disable\|enable} pair in bpf_selem_alloc(). Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250108010728.207536-14-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-01-08 18:06:37 -08:00
Hou Tao	4b7e7cd1c1	bpf: Disable migration before calling ops->map_free() The freeing of all map elements may invoke bpf_obj_free_fields() to free the special fields in the map value. Since these special fields may be allocated from bpf memory allocator, migrate_{disable\|enable} pairs are necessary for the freeing of these special fields. To simplify reasoning about when migrate_disable() is needed for the freeing of these special fields, let the caller to guarantee migration is disabled before invoking bpf_obj_free_fields(). Therefore, disabling migration before calling ops->map_free() to simplify the freeing of map values or special fields allocated from bpf memory allocator. After disabling migration in bpf_map_free(), there is no need for additional migration_{disable\|enable} pairs in these ->map_free() callbacks. Remove these redundant invocations. The migrate_{disable\|enable} pairs in the underlying implementation of bpf_obj_free_fields() will be removed by the following patch. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250108010728.207536-11-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-01-08 18:06:36 -08:00
Hou Tao	090d7f2e64	bpf: Disable migration in bpf_selem_free_rcu bpf_selem_free_rcu() calls bpf_obj_free_fields() to free the special fields in map value (e.g., kptr). Since kptrs may be allocated from bpf memory allocator, migrate_{disable\|enable} pairs are necessary for the freeing of these kptrs. To simplify reasoning about when migrate_disable() is needed for the freeing of these dynamically-allocated kptrs, let the caller to guarantee migration is disabled before invoking bpf_obj_free_fields(). Therefore, the patch adds migrate_{disable\|enable} pair in bpf_selem_free_rcu(). The migrate_{disable\|enable} pairs in the underlying implementation of bpf_obj_free_fields() will be removed by the following patch. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250108010728.207536-10-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-01-08 18:06:36 -08:00
Martin KaFai Lau	8eef6ac4d7	bpf: bpf_local_storage: Always use bpf_mem_alloc in PREEMPT_RT In PREEMPT_RT, kmalloc(GFP_ATOMIC) is still not safe in non preemptible context. bpf_mem_alloc must be used in PREEMPT_RT. This patch is to enforce bpf_mem_alloc in the bpf_local_storage when CONFIG_PREEMPT_RT is enabled. [ 35.118559] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48 [ 35.118566] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1832, name: test_progs [ 35.118569] preempt_count: 1, expected: 0 [ 35.118571] RCU nest depth: 1, expected: 1 [ 35.118577] INFO: lockdep is turned off. ... [ 35.118647] __might_resched+0x433/0x5b0 [ 35.118677] rt_spin_lock+0xc3/0x290 [ 35.118700] ___slab_alloc+0x72/0xc40 [ 35.118723] __kmalloc_noprof+0x13f/0x4e0 [ 35.118732] bpf_map_kzalloc+0xe5/0x220 [ 35.118740] bpf_selem_alloc+0x1d2/0x7b0 [ 35.118755] bpf_local_storage_update+0x2fa/0x8b0 [ 35.118784] bpf_sk_storage_get_tracing+0x15a/0x1d0 [ 35.118791] bpf_prog_9a118d86fca78ebb_trace_inet_sock_set_state+0x44/0x66 [ 35.118795] bpf_trace_run3+0x222/0x400 [ 35.118820] __bpf_trace_inet_sock_set_state+0x11/0x20 [ 35.118824] trace_inet_sock_set_state+0x112/0x130 [ 35.118830] inet_sk_state_store+0x41/0x90 [ 35.118836] tcp_set_state+0x3b3/0x640 There is no need to adjust the gfp_flags passing to the bpf_mem_cache_alloc_flags() which only honors the GFP_KERNEL. The verifier has ensured GFP_KERNEL is passed only in sleepable context. It has been an old issue since the first introduction of the bpf_local_storage ~5 years ago, so this patch targets the bpf-next. bpf_mem_alloc is needed to solve it, so the Fixes tag is set to the commit when bpf_mem_alloc was first used in the bpf_local_storage. Fixes: `08a7ce384e` ("bpf: Use bpf_mem_cache_alloc/free in bpf_local_storage_elem") Reported-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20241218193000.2084281-1-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-12-18 15:36:06 -08:00
Martin KaFai Lau	ba512b00e5	bpf: Add uptr support in the map_value of the task local storage. This patch adds uptr support in the map_value of the task local storage. struct map_value { struct user_data __uptr uptr; }; struct { __uint(type, BPF_MAP_TYPE_TASK_STORAGE); __uint(map_flags, BPF_F_NO_PREALLOC); __type(key, int); __type(value, struct value_type); } datamap SEC(".maps"); A new bpf_obj_pin_uptrs() is added to pin the user page and also stores the kernel address back to the uptr for the bpf prog to use later. It currently does not support the uptr pointing to a user struct across two pages. It also excludes PageHighMem support to keep it simple. As of now, the 32bit bpf jit is missing other more crucial bpf features. For example, many important bpf features depend on bpf kfunc now but so far only one arch (x86-32) supports it which was added by me as an example when kfunc was first introduced to bpf. The uptr can only be stored to the task local storage by the syscall update_elem. Meaning the uptr will not be considered if it is provided by the bpf prog through bpf_task_storage_get(BPF_LOCAL_STORAGE_GET_F_CREATE). This is enforced by only calling bpf_local_storage_update(swap_uptrs==true) in bpf_pid_task_storage_update_elem. Everywhere else will have swap_uptrs==false. This will pump down to bpf_selem_alloc(swap_uptrs==true). It is the only case that bpf_selem_alloc() will take the uptr value when updating the newly allocated selem. bpf_obj_swap_uptrs() is added to swap the uptr between the SDATA(selem)->data and the user provided map_value in "void value". bpf_obj_swap_uptrs() makes the SDATA(selem)->data takes the ownership of the uptr and the user space provided map_value will have NULL in the uptr. The bpf_obj_unpin_uptrs() is called after map->ops->map_update_elem() returning error. If the map->ops->map_update_elem has reached a state that the local storage has taken the uptr ownership, the bpf_obj_unpin_uptrs() will be a no op because the uptr is NULL. A "__"bpf_obj_unpin_uptrs is added to make this error path unpin easier such that it does not have to check the map->record is NULL or not. BPF_F_LOCK is not supported when the map_value has uptr. This can be revisited later if there is a use case. A similar swap_uptrs idea can be considered. The final bit is to do unpin_user_page in the bpf_obj_free_fields(). The earlier patch has ensured that the bpf_obj_free_fields() has gone through the rcu gp when needed. Cc: linux-mm@kvack.org Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Link: https://lore.kernel.org/r/20241023234759.860539-7-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-10-24 10:25:59 -07:00
Martin KaFai Lau	9bac675e63	bpf: Postpone bpf_obj_free_fields to the rcu callback A later patch will enable the uptr usage in the task_local_storage map. This will require the unpin_user_page() to be done after the rcu task trace gp for the cases that the uptr may still be used by a bpf prog. The bpf_obj_free_fields() will be the one doing unpin_user_page(), so this patch is to postpone calling bpf_obj_free_fields() to the rcu callback. The bpf_obj_free_fields() is only required to be done in the rcu callback when bpf->bpf_ma==true and reuse_now==false. bpf->bpf_ma==true case is because uptr will only be enabled in task storage which has already been moved to bpf_mem_alloc. The bpf->bpf_ma==false case can be supported in the future also if there is a need. reuse_now==false when the selem (aka storage) is deleted by bpf prog (bpf_task_storage_delete) or by syscall delete_elem(). In both cases, bpf_obj_free_fields() needs to wait for rcu gp. A few words on reuse_now==true. reuse_now==true when the storage's owner (i.e. the task_struct) is destructing or the map itself is doing map_free(). In both cases, no bpf prog should have a hold on the selem and its uptrs, so there is no need to postpone bpf_obj_free_fields(). reuse_now==true should be the common case for local storage usage where the storage exists throughout the lifetime of its owner (task_struct). The bpf_obj_free_fields() needs to use the map->record. Doing bpf_obj_free_fields() in a rcu callback will require the bpf_local_storage_map_free() to wait for rcu_barrier. An optimization could be only waiting for rcu_barrier when the map has uptr in its map_value. This will require either yet another rcu callback function or adding a bool in the selem to flag if the SDATA(selem)->smap is still valid. This patch chooses to keep it simple and wait for rcu_barrier for maps that use bpf_mem_alloc. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20241023234759.860539-6-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-10-24 10:25:59 -07:00
Martin KaFai Lau	5bd5bab766	bpf: Postpone bpf_selem_free() in bpf_selem_unlink_storage_nolock() In a later patch, bpf_selem_free() will call unpin_user_page() through bpf_obj_free_fields(). unpin_user_page() may take spin_lock. However, some bpf_selem_free() call paths have held a raw_spin_lock. Like this: raw_spin_lock_irqsave() bpf_selem_unlink_storage_nolock() bpf_selem_free() unpin_user_page() spin_lock() To avoid spinlock nested in raw_spinlock, bpf_selem_free() should be done after releasing the raw_spinlock. The "bool reuse_now" arg is replaced with "struct hlist_head *free_selem_list" in bpf_selem_unlink_storage_nolock(). The bpf_selem_unlink_storage_nolock() will append the to-be-free selem at the free_selem_list. The caller of bpf_selem_unlink_storage_nolock() will need to call the new bpf_selem_free_list(free_selem_list, reuse_now) to free the selem after releasing the raw_spinlock. Note that the selem->snode cannot be reused for linking to the free_selem_list because the selem->snode is protected by the raw_spinlock that we want to avoid holding. A new "struct hlist_node free_node;" is union-ized with the rcu_head. Only the first one successfully hlist_del_init_rcu(&selem->snode) will be able to use the free_node. After succeeding hlist_del_init_rcu(&selem->snode), the free_node and rcu_head usage is serialized such that they can share the 16 bytes in a union. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20241023234759.860539-5-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-10-24 10:25:59 -07:00
Martin KaFai Lau	b9a5a07aea	bpf: Add "bool swap_uptrs" arg to bpf_local_storage_update() and bpf_selem_alloc() In a later patch, the task local storage will only accept uptr from the syscall update_elem and will not accept uptr from the bpf prog. The reason is the bpf prog does not have a way to provide a valid user space address. bpf_local_storage_update() and bpf_selem_alloc() are used by both bpf prog bpf_task_storage_get(BPF_LOCAL_STORAGE_GET_F_CREATE) and bpf syscall update_elem. "bool swap_uptrs" arg is added to bpf_local_storage_update() and bpf_selem_alloc() to tell if it is called by the bpf prog or by the bpf syscall. When swap_uptrs==true, it is called by the syscall. The arg is named (swap_)uptrs because the later patch will swap the uptrs between the newly allocated selem and the user space provided map_value. It will make error handling easier in case map->ops->map_update_elem() fails and the caller can decide if it needs to unpin the uptr in the user space provided map_value or the bpf_local_storage_update() has already taken the uptr ownership and will take care of unpinning it also. Only swap_uptrs==false is passed now. The logic to handle the true case will be added in a later patch. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20241023234759.860539-4-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-10-24 10:25:59 -07:00
Mohammad Shehar Yaar Tausif	af253aef18	bpf: fix order of args in call to bpf_map_kvcalloc The original function call passed size of smap->bucket before the number of buckets which raises the error 'calloc-transposed-args' on compilation. Vlastimil Babka added: The order of parameters can be traced back all the way to `6ac99e8f23` ("bpf: Introduce bpf sk local storage") accross several refactorings, and that's why the commit is used as a Fixes: tag. In v6.10-rc1, a different commit `2c321f3f70` ("mm: change inlined allocation helpers to account at the call site") however exposed the order of args in a way that gcc-14 has enough visibility to start warning about it, because (in !CONFIG_MEMCG case) bpf_map_kvcalloc is then a macro alias for kvcalloc instead of a static inline wrapper. To sum up the warning happens when the following conditions are all met: - gcc-14 is used (didn't see it with gcc-13) - commit `2c321f3f70` is present - CONFIG_MEMCG is not enabled in .config - CONFIG_WERROR turns this from a compiler warning to error Fixes: `6ac99e8f23` ("bpf: Introduce bpf sk local storage") Reviewed-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: Christian Kujau <lists@nerdbynature.de> Signed-off-by: Mohammad Shehar Yaar Tausif <sheharyaar48@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Link: https://lore.kernel.org/r/20240710100521.15061-2-vbabka@suse.cz Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-07-10 15:31:19 -07:00
Rafael Passos	a7de265cb2	bpf: Fix typos in comments Found the following typos in comments, and fixed them: s/unpriviledged/unprivileged/ s/reponsible/responsible/ s/possiblities/possibilities/ s/Divison/Division/ s/precsion/precision/ s/havea/have a/ s/reponsible/responsible/ s/responsibile/responsible/ s/tigher/tighter/ s/respecitve/respective/ Signed-off-by: Rafael Passos <rafael@rcpassos.me> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/6af7deb4-bb24-49e8-b3f1-8dd410597337@smtp-relay.sendinblue.com	2024-04-22 17:48:08 +02:00
Marco Elver	68bc61c26c	bpf: Allow compiler to inline most of bpf_local_storage_lookup() In various performance profiles of kernels with BPF programs attached, bpf_local_storage_lookup() appears as a significant portion of CPU cycles spent. To enable the compiler generate more optimal code, turn bpf_local_storage_lookup() into a static inline function, where only the cache insertion code path is outlined Notably, outlining cache insertion helps avoid bloating callers by duplicating setting up calls to raw_spin_{lock,unlock}_irqsave() (on architectures which do not inline spin_lock/unlock, such as x86), which would cause the compiler produce worse code by deciding to outline otherwise inlinable functions. The call overhead is neutral, because we make 2 calls either way: either calling raw_spin_lock_irqsave() and raw_spin_unlock_irqsave(); or call __bpf_local_storage_insert_cache(), which calls raw_spin_lock_irqsave(), followed by a tail-call to raw_spin_unlock_irqsave() where the compiler can perform TCO and (in optimized uninstrumented builds) turns it into a plain jump. The call to __bpf_local_storage_insert_cache() can be elided entirely if cacheit_lockit is a false constant expression. Based on results from './benchs/run_bench_local_storage.sh' (21 trials, reboot between each trial; x86 defconfig + BPF, clang 16) this produces improvements in throughput and latency in the majority of cases, with an average (geomean) improvement of 8%: +---- Hashmap Control -------------------- \| \| + num keys: 10 \| : <before> \| <after> \| +-+ hashmap (control) sequential get +----------------------+---------------------- \| +- hits throughput \| 14.789 M ops/s \| 14.745 M ops/s ( ~ ) \| +- hits latency \| 67.679 ns/op \| 67.879 ns/op ( ~ ) \| +- important_hits throughput \| 14.789 M ops/s \| 14.745 M ops/s ( ~ ) \| \| + num keys: 1000 \| : <before> \| <after> \| +-+ hashmap (control) sequential get +----------------------+---------------------- \| +- hits throughput \| 12.233 M ops/s \| 12.170 M ops/s ( ~ ) \| +- hits latency \| 81.754 ns/op \| 82.185 ns/op ( ~ ) \| +- important_hits throughput \| 12.233 M ops/s \| 12.170 M ops/s ( ~ ) \| \| + num keys: 10000 \| : <before> \| <after> \| +-+ hashmap (control) sequential get +----------------------+---------------------- \| +- hits throughput \| 7.220 M ops/s \| 7.204 M ops/s ( ~ ) \| +- hits latency \| 138.522 ns/op \| 138.842 ns/op ( ~ ) \| +- important_hits throughput \| 7.220 M ops/s \| 7.204 M ops/s ( ~ ) \| \| + num keys: 100000 \| : <before> \| <after> \| +-+ hashmap (control) sequential get +----------------------+---------------------- \| +- hits throughput \| 5.061 M ops/s \| 5.165 M ops/s (+2.1%) \| +- hits latency \| 198.483 ns/op \| 194.270 ns/op (-2.1%) \| +- important_hits throughput \| 5.061 M ops/s \| 5.165 M ops/s (+2.1%) \| \| + num keys: 4194304 \| : <before> \| <after> \| +-+ hashmap (control) sequential get +----------------------+---------------------- \| +- hits throughput \| 2.864 M ops/s \| 2.882 M ops/s ( ~ ) \| +- hits latency \| 365.220 ns/op \| 361.418 ns/op (-1.0%) \| +- important_hits throughput \| 2.864 M ops/s \| 2.882 M ops/s ( ~ ) \| +---- Local Storage ---------------------- \| \| + num_maps: 1 \| : <before> \| <after> \| +-+ local_storage cache sequential get +----------------------+---------------------- \| +- hits throughput \| 33.005 M ops/s \| 39.068 M ops/s (+18.4%) \| +- hits latency \| 30.300 ns/op \| 25.598 ns/op (-15.5%) \| +- important_hits throughput \| 33.005 M ops/s \| 39.068 M ops/s (+18.4%) \| : \| : <before> \| <after> \| +-+ local_storage cache interleaved get +----------------------+---------------------- \| +- hits throughput \| 37.151 M ops/s \| 44.926 M ops/s (+20.9%) \| +- hits latency \| 26.919 ns/op \| 22.259 ns/op (-17.3%) \| +- important_hits throughput \| 37.151 M ops/s \| 44.926 M ops/s (+20.9%) \| \| + num_maps: 10 \| : <before> \| <after> \| +-+ local_storage cache sequential get +----------------------+---------------------- \| +- hits throughput \| 32.288 M ops/s \| 38.099 M ops/s (+18.0%) \| +- hits latency \| 30.972 ns/op \| 26.248 ns/op (-15.3%) \| +- important_hits throughput \| 3.229 M ops/s \| 3.810 M ops/s (+18.0%) \| : \| : <before> \| <after> \| +-+ local_storage cache interleaved get +----------------------+---------------------- \| +- hits throughput \| 34.473 M ops/s \| 41.145 M ops/s (+19.4%) \| +- hits latency \| 29.010 ns/op \| 24.307 ns/op (-16.2%) \| +- important_hits throughput \| 12.312 M ops/s \| 14.695 M ops/s (+19.4%) \| \| + num_maps: 16 \| : <before> \| <after> \| +-+ local_storage cache sequential get +----------------------+---------------------- \| +- hits throughput \| 32.524 M ops/s \| 38.341 M ops/s (+17.9%) \| +- hits latency \| 30.748 ns/op \| 26.083 ns/op (-15.2%) \| +- important_hits throughput \| 2.033 M ops/s \| 2.396 M ops/s (+17.9%) \| : \| : <before> \| <after> \| +-+ local_storage cache interleaved get +----------------------+---------------------- \| +- hits throughput \| 34.575 M ops/s \| 41.338 M ops/s (+19.6%) \| +- hits latency \| 28.925 ns/op \| 24.193 ns/op (-16.4%) \| +- important_hits throughput \| 11.001 M ops/s \| 13.153 M ops/s (+19.6%) \| \| + num_maps: 17 \| : <before> \| <after> \| +-+ local_storage cache sequential get +----------------------+---------------------- \| +- hits throughput \| 28.861 M ops/s \| 32.756 M ops/s (+13.5%) \| +- hits latency \| 34.649 ns/op \| 30.530 ns/op (-11.9%) \| +- important_hits throughput \| 1.700 M ops/s \| 1.929 M ops/s (+13.5%) \| : \| : <before> \| <after> \| +-+ local_storage cache interleaved get +----------------------+---------------------- \| +- hits throughput \| 31.529 M ops/s \| 36.110 M ops/s (+14.5%) \| +- hits latency \| 31.719 ns/op \| 27.697 ns/op (-12.7%) \| +- important_hits throughput \| 9.598 M ops/s \| 10.993 M ops/s (+14.5%) \| \| + num_maps: 24 \| : <before> \| <after> \| +-+ local_storage cache sequential get +----------------------+---------------------- \| +- hits throughput \| 18.602 M ops/s \| 19.937 M ops/s (+7.2%) \| +- hits latency \| 53.767 ns/op \| 50.166 ns/op (-6.7%) \| +- important_hits throughput \| 0.776 M ops/s \| 0.831 M ops/s (+7.2%) \| : \| : <before> \| <after> \| +-+ local_storage cache interleaved get +----------------------+---------------------- \| +- hits throughput \| 21.718 M ops/s \| 23.332 M ops/s (+7.4%) \| +- hits latency \| 46.047 ns/op \| 42.865 ns/op (-6.9%) \| +- important_hits throughput \| 6.110 M ops/s \| 6.564 M ops/s (+7.4%) \| \| + num_maps: 32 \| : <before> \| <after> \| +-+ local_storage cache sequential get +----------------------+---------------------- \| +- hits throughput \| 14.118 M ops/s \| 14.626 M ops/s (+3.6%) \| +- hits latency \| 70.856 ns/op \| 68.381 ns/op (-3.5%) \| +- important_hits throughput \| 0.442 M ops/s \| 0.458 M ops/s (+3.6%) \| : \| : <before> \| <after> \| +-+ local_storage cache interleaved get +----------------------+---------------------- \| +- hits throughput \| 17.111 M ops/s \| 17.906 M ops/s (+4.6%) \| +- hits latency \| 58.451 ns/op \| 55.865 ns/op (-4.4%) \| +- important_hits throughput \| 4.776 M ops/s \| 4.998 M ops/s (+4.6%) \| \| + num_maps: 100 \| : <before> \| <after> \| +-+ local_storage cache sequential get +----------------------+---------------------- \| +- hits throughput \| 5.281 M ops/s \| 5.528 M ops/s (+4.7%) \| +- hits latency \| 192.398 ns/op \| 183.059 ns/op (-4.9%) \| +- important_hits throughput \| 0.053 M ops/s \| 0.055 M ops/s (+4.9%) \| : \| : <before> \| <after> \| +-+ local_storage cache interleaved get +----------------------+---------------------- \| +- hits throughput \| 6.265 M ops/s \| 6.498 M ops/s (+3.7%) \| +- hits latency \| 161.436 ns/op \| 152.877 ns/op (-5.3%) \| +- important_hits throughput \| 1.636 M ops/s \| 1.697 M ops/s (+3.7%) \| \| + num_maps: 1000 \| : <before> \| <after> \| +-+ local_storage cache sequential get +----------------------+---------------------- \| +- hits throughput \| 0.355 M ops/s \| 0.354 M ops/s ( ~ ) \| +- hits latency \| 2826.538 ns/op \| 2827.139 ns/op ( ~ ) \| +- important_hits throughput \| 0.000 M ops/s \| 0.000 M ops/s ( ~ ) \| : \| : <before> \| <after> \| +-+ local_storage cache interleaved get +----------------------+---------------------- \| +- hits throughput \| 0.404 M ops/s \| 0.403 M ops/s ( ~ ) \| +- hits latency \| 2481.190 ns/op \| 2487.555 ns/op ( ~ ) \| +- important_hits throughput \| 0.102 M ops/s \| 0.101 M ops/s ( ~ ) The on_lookup test in {cgrp,task}_ls_recursion.c is removed because the bpf_local_storage_lookup is no longer traceable and adding tracepoint will make the compiler generate worse code: https://lore.kernel.org/bpf/ZcJmok64Xqv6l4ZS@elver.google.com/ Signed-off-by: Marco Elver <elver@google.com> Cc: Martin KaFai Lau <martin.lau@linux.dev> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20240207122626.3508658-1-elver@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2024-02-11 14:06:24 -08:00
Martin KaFai Lau	55d49f750b	bpf: bpf_sk_storage: Fix the missing uncharge in sk_omem_alloc The commit `c83597fa5d` ("bpf: Refactor some inode/task/sk storage functions for reuse"), refactored the bpf_{sk,task,inode}_storage_free() into bpf_local_storage_unlink_nolock() which then later renamed to bpf_local_storage_destroy(). The commit accidentally passed the "bool uncharge_mem = false" argument to bpf_selem_unlink_storage_nolock() which then stopped the uncharge from happening to the sk->sk_omem_alloc. This missing uncharge only happens when the sk is going away (during __sk_destruct). This patch fixes it by always passing "uncharge_mem = true". It is a noop to the task/inode/cgroup storage because they do not have the map_local_storage_(un)charge enabled in the map_ops. A followup patch will be done in bpf-next to remove the uncharge_mem argument. A selftest is added in the next patch. Fixes: `c83597fa5d` ("bpf: Refactor some inode/task/sk storage functions for reuse") Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230901231129.578493-3-martin.lau@linux.dev	2023-09-06 11:08:14 +02:00
Martin KaFai Lau	a96a44aba5	bpf: bpf_sk_storage: Fix invalid wait context lockdep report './test_progs -t test_local_storage' reported a splat: [ 27.137569] ============================= [ 27.138122] [ BUG: Invalid wait context ] [ 27.138650] 6.5.0-03980-gd11ae1b16b0a #247 Tainted: G O [ 27.139542] ----------------------------- [ 27.140106] test_progs/1729 is trying to lock: [ 27.140713] ffff8883ef047b88 (stock_lock){-.-.}-{3:3}, at: local_lock_acquire+0x9/0x130 [ 27.141834] other info that might help us debug this: [ 27.142437] context-{5:5} [ 27.142856] 2 locks held by test_progs/1729: [ 27.143352] #0: ffffffff84bcd9c0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x4/0x40 [ 27.144492] #1: ffff888107deb2c0 (&storage->lock){..-.}-{2:2}, at: bpf_local_storage_update+0x39e/0x8e0 [ 27.145855] stack backtrace: [ 27.146274] CPU: 0 PID: 1729 Comm: test_progs Tainted: G O 6.5.0-03980-gd11ae1b16b0a #247 [ 27.147550] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 27.149127] Call Trace: [ 27.149490] <TASK> [ 27.149867] dump_stack_lvl+0x130/0x1d0 [ 27.152609] dump_stack+0x14/0x20 [ 27.153131] __lock_acquire+0x1657/0x2220 [ 27.153677] lock_acquire+0x1b8/0x510 [ 27.157908] local_lock_acquire+0x29/0x130 [ 27.159048] obj_cgroup_charge+0xf4/0x3c0 [ 27.160794] slab_pre_alloc_hook+0x28e/0x2b0 [ 27.161931] __kmem_cache_alloc_node+0x51/0x210 [ 27.163557] __kmalloc+0xaa/0x210 [ 27.164593] bpf_map_kzalloc+0xbc/0x170 [ 27.165147] bpf_selem_alloc+0x130/0x510 [ 27.166295] bpf_local_storage_update+0x5aa/0x8e0 [ 27.167042] bpf_fd_sk_storage_update_elem+0xdb/0x1a0 [ 27.169199] bpf_map_update_value+0x415/0x4f0 [ 27.169871] map_update_elem+0x413/0x550 [ 27.170330] __sys_bpf+0x5e9/0x640 [ 27.174065] __x64_sys_bpf+0x80/0x90 [ 27.174568] do_syscall_64+0x48/0xa0 [ 27.175201] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 27.175932] RIP: 0033:0x7effb40e41ad [ 27.176357] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d8 [ 27.179028] RSP: 002b:00007ffe64c21fc8 EFLAGS: 00000202 ORIG_RAX: 0000000000000141 [ 27.180088] RAX: ffffffffffffffda RBX: 00007ffe64c22768 RCX: 00007effb40e41ad [ 27.181082] RDX: 0000000000000020 RSI: 00007ffe64c22008 RDI: 0000000000000002 [ 27.182030] RBP: 00007ffe64c21ff0 R08: 0000000000000000 R09: 00007ffe64c22788 [ 27.183038] R10: 0000000000000064 R11: 0000000000000202 R12: 0000000000000000 [ 27.184006] R13: 00007ffe64c22788 R14: 00007effb42a1000 R15: 0000000000000000 [ 27.184958] </TASK> It complains about acquiring a local_lock while holding a raw_spin_lock. It means it should not allocate memory while holding a raw_spin_lock since it is not safe for RT. raw_spin_lock is needed because bpf_local_storage supports tracing context. In particular for task local storage, it is easy to get a "current" task PTR_TO_BTF_ID in tracing bpf prog. However, task (and cgroup) local storage has already been moved to bpf mem allocator which can be used after raw_spin_lock. The splat is for the sk storage. For sk (and inode) storage, it has not been moved to bpf mem allocator. Using raw_spin_lock or not, kzalloc(GFP_ATOMIC) could theoretically be unsafe in tracing context. However, the local storage helper requires a verifier accepted sk pointer (PTR_TO_BTF_ID), it is hypothetical if that (mean running a bpf prog in a kzalloc unsafe context and also able to hold a verifier accepted sk pointer) could happen. This patch avoids kzalloc after raw_spin_lock to silent the splat. There is an existing kzalloc before the raw_spin_lock. At that point, a kzalloc is very likely required because a lookup has just been done before. Thus, this patch always does the kzalloc before acquiring the raw_spin_lock and remove the later kzalloc usage after the raw_spin_lock. After this change, it will have a charge and then uncharge during the syscall bpf_map_update_elem() code path. This patch opts for simplicity and not continue the old optimization to save one charge and uncharge. This issue is dated back to the very first commit of bpf_sk_storage which had been refactored multiple times to create task, inode, and cgroup storage. This patch uses a Fixes tag with a more recent commit that should be easier to do backport. Fixes: `b00fa38a9c` ("bpf: Enable non-atomic allocations in local storage") Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230901231129.578493-2-martin.lau@linux.dev	2023-09-06 11:07:54 +02:00
Andrii Nakryiko	6c3eba1c5e	bpf: Centralize permissions checks for all BPF map types This allows to do more centralized decisions later on, and generally makes it very explicit which maps are privileged and which are not (e.g., LRU_HASH and LRU_PERCPU_HASH, which are privileged HASH variants, as opposed to unprivileged HASH and HASH_PERCPU; now this is explicit and easy to verify). Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/20230613223533.3689589-4-andrii@kernel.org	2023-06-19 14:04:04 +02:00
Alexei Starovoitov	10fd5f70c3	bpf: Handle NULL in bpf_local_storage_free. During OOM bpf_local_storage_alloc() may fail to allocate 'storage' and call to bpf_local_storage_free() with NULL pointer will cause a crash like: [ 271718.917646] BUG: kernel NULL pointer dereference, address: 00000000000000a0 [ 271719.019620] RIP: 0010:call_rcu+0x2d/0x240 [ 271719.216274] bpf_local_storage_alloc+0x19e/0x1e0 [ 271719.250121] bpf_local_storage_update+0x33b/0x740 Fixes: `7e30a8477b` ("bpf: Add bpf_local_storage_free()") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20230412171252.15635-1-alexei.starovoitov@gmail.com	2023-04-12 10:27:50 -07:00
Martin KaFai Lau	6ae9d5e99e	bpf: Use bpf_mem_cache_alloc/free for bpf_local_storage This patch uses bpf_mem_cache_alloc/free for allocating and freeing bpf_local_storage for task and cgroup storage. The changes are similar to the previous patch. A few things that worth to mention for bpf_local_storage: The local_storage is freed when the last selem is deleted. Before deleting a selem from local_storage, it needs to retrieve the local_storage->smap because the bpf_selem_unlink_storage_nolock() may have set it to NULL. Note that local_storage->smap may have already been NULL when the selem created this local_storage has been removed. In this case, call_rcu will be used to free the local_storage. Also, the bpf_ma (true or false) value is needed before calling bpf_local_storage_free(). The bpf_ma can either be obtained from the local_storage->smap (if available) or any of its selem's smap. A new helper check_storage_bpf_ma() is added to obtain bpf_ma for a deleting bpf_local_storage. When bpf_local_storage_alloc getting a reused memory, all fields are either in the correct values or will be initialized. 'cache[]' must already be all NULLs. 'list' must be empty. Others will be initialized. Cc: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20230322215246.1675516-4-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-25 19:52:52 -07:00
Martin KaFai Lau	08a7ce384e	bpf: Use bpf_mem_cache_alloc/free in bpf_local_storage_elem This patch uses bpf_mem_alloc for the task and cgroup local storage that the bpf prog can easily get a hold of the storage owner's PTR_TO_BTF_ID. eg. bpf_get_current_task_btf() can be used in some of the kmalloc code path which will cause deadlock/recursion. bpf_mem_cache_alloc is deadlock free and will solve a legit use case in [1]. For sk storage, its batch creation benchmark shows a few percent regression when the sk create/destroy batch size is larger than 32. The sk creation/destruction happens much more often and depends on external traffic. Considering it is hypothetical to be able to cause deadlock with sk storage, it can cross the bridge to use bpf_mem_alloc till a legit (ie. useful) use case comes up. For inode storage, bpf_local_storage_destroy() is called before waiting for a rcu gp and its memory cannot be reused immediately. inode stays with kmalloc/kfree after the rcu [or tasks_trace] gp. A 'bool bpf_ma' argument is added to bpf_local_storage_map_alloc(). Only task and cgroup storage have 'bpf_ma == true' which means to use bpf_mem_cache_alloc/free(). This patch only changes selem to use bpf_mem_alloc for task and cgroup. The next patch will change the local_storage to use bpf_mem_alloc also for task and cgroup. Here is some more details on the changes: * memory allocation: After bpf_mem_cache_alloc(), the SDATA(selem)->data is zero-ed because bpf_mem_cache_alloc() could return a reused selem. It is to keep the existing bpf_map_kzalloc() behavior. Only SDATA(selem)->data is zero-ed. SDATA(selem)->data is the visible part to the bpf prog. No need to use zero_map_value() to do the zeroing because bpf_selem_free(..., reuse_now = true) ensures no bpf prog is using the selem before returning the selem through bpf_mem_cache_free(). For the internal fields of selem, they will be initialized when linking to the new smap and the new local_storage. When 'bpf_ma == false', nothing changes in this patch. It will stay with the bpf_map_kzalloc(). * memory free: The bpf_selem_free() and bpf_selem_free_rcu() are modified to handle the bpf_ma == true case. For the common selem free path where its owner is also being destroyed, the mem is freed in bpf_local_storage_destroy(), the owner (task and cgroup) has gone through a rcu gp. The memory can be reused immediately, so bpf_local_storage_destroy() will call bpf_selem_free(..., reuse_now = true) which will do bpf_mem_cache_free() for immediate reuse consideration. An exception is the delete elem code path. The delete elem code path is called from the helper bpf_*_storage_delete() and the syscall bpf_map_delete_elem(). This path is an unusual case for local storage because the common use case is to have the local storage staying with its owner life time so that the bpf prog and the user space does not have to monitor the owner's destruction. For the delete elem path, the selem cannot be reused immediately because there could be bpf prog using it. It will call bpf_selem_free(..., reuse_now = false) and it will wait for a rcu tasks trace gp before freeing the elem. The rcu callback is changed to do bpf_mem_cache_raw_free() instead of kfree(). When 'bpf_ma == false', it should be the same as before. __bpf_selem_free() is added to do the kfree_rcu and call_tasks_trace_rcu(). A few words on the 'reuse_now == true'. When 'reuse_now == true', it is still racing with bpf_local_storage_map_free which is under rcu protection, so it still needs to wait for a rcu gp instead of kfree(). Otherwise, the selem may be reused by slab for a totally different struct while the bpf_local_storage_map_free() is still using it (as a rcu reader). For the inode case, there may be other rcu readers also. In short, when bpf_ma == false and reuse_now == true => vanilla rcu. [1]: https://lore.kernel.org/bpf/20221118190109.1512674-1-namhyung@kernel.org/ Cc: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20230322215246.1675516-3-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-25 19:52:52 -07:00
Martin KaFai Lau	7e30a8477b	bpf: Add bpf_local_storage_free() This patch refactors local_storage freeing logic into bpf_local_storage_free(). It is a preparation work for a later patch that uses bpf_mem_cache_alloc/free. The other kfree(local_storage) cases are also changed to bpf_local_storage_free(..., reuse_now = true). Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20230308065936.1550103-12-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-10 11:05:29 -08:00
Martin KaFai Lau	1288aaa278	bpf: Add bpf_local_storage_rcu callback The existing bpf_local_storage_free_rcu is renamed to bpf_local_storage_free_trace_rcu. A new bpf_local_storage_rcu callback is added to do the kfree instead of using kfree_rcu. It is a preparation work for a later patch using bpf_mem_cache_alloc/free. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20230308065936.1550103-11-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-10 11:05:29 -08:00
Martin KaFai Lau	c0d63f3091	bpf: Add bpf_selem_free() This patch refactors the selem freeing logic into bpf_selem_free(). It is a preparation work for a later patch using bpf_mem_cache_alloc/free. The other kfree(selem) cases are also changed to bpf_selem_free(..., reuse_now = true). Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20230308065936.1550103-10-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-10 11:05:28 -08:00
Martin KaFai Lau	f8ccf30c17	bpf: Add bpf_selem_free_rcu callback Add bpf_selem_free_rcu() callback to do the kfree() instead of using kfree_rcu. It is a preparation work for using bpf_mem_cache_alloc/free in a later patch. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20230308065936.1550103-9-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-10 11:05:28 -08:00
Martin KaFai Lau	c609981342	bpf: Remove bpf_selem_free_fields_rcu This patch removes the bpf_selem_free_fields_rcu. The bpf_obj_free_fields() can be done before the call_rcu_trasks_trace() and kfree_rcu(). It is needed when a later patch uses bpf_mem_cache_alloc/free. In bpf hashtab, bpf_obj_free_fields() is also called before calling bpf_mem_cache_free. The discussion can be found in https://lore.kernel.org/bpf/f67021ee-21d9-bfae-6134-4ca542fab843@linux.dev/ Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20230308065936.1550103-8-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-10 11:05:28 -08:00
Martin KaFai Lau	a47eabf216	bpf: Repurpose use_trace_rcu to reuse_now in bpf_local_storage This patch re-purpose the use_trace_rcu to mean if the freed memory can be reused immediately or not. The use_trace_rcu is renamed to reuse_now. Other than the boolean test is reversed, it should be a no-op. The following explains the reason for the rename and how it will be used in a later patch. In a later patch, bpf_mem_cache_alloc/free will be used in the bpf_local_storage. The bpf mem allocator will reuse the freed memory immediately. Some of the free paths in bpf_local_storage does not support memory to be reused immediately. These paths are the "delete" elem cases from the bpf_*_storage_delete() helper and the map_delete_elem() syscall. Note that "delete" elem before the owner's (sk/task/cgrp/inode) lifetime ended is not the common usage for the local storage. The common free path, bpf_local_storage_destroy(), can reuse the memory immediately. This common path means the storage stays with its owner until the owner is destroyed. The above mentioned "delete" elem paths that cannot reuse immediately always has the 'use_trace_rcu == true'. The cases that is safe for immediate reuse always have 'use_trace_rcu == false'. Instead of adding another arg in a later patch, this patch re-purpose this arg to reuse_now and have the test logic reversed. In a later patch, 'reuse_now == true' will free to the bpf_mem_cache_free() where the memory can be reused immediately. 'reuse_now == false' will go through the call_rcu_tasks_trace(). Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20230308065936.1550103-7-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-10 11:05:28 -08:00
Martin KaFai Lau	fc6652aab6	bpf: Remember smap in bpf_local_storage This patch remembers which smap triggers the allocation of a 'struct bpf_local_storage' object. The local_storage is allocated during the very first selem added to the owner. The smap pointer is needed when using the bpf_mem_cache_free in a later patch because it needs to free to the correct smap's bpf_mem_alloc object. When a selem is being removed, it needs to check if it is the selem that triggers the creation of the local_storage. If it is, the local_storage->smap pointer will be reset to NULL. This NULL reset is done under the local_storage->lock in bpf_selem_unlink_storage_nolock() when a selem is being removed. Also note that the local_storage may not go away even local_storage->smap is NULL because there may be other selem still stored in the local_storage. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20230308065936.1550103-6-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-10 11:05:28 -08:00
Martin KaFai Lau	121f31f3e0	bpf: Remove the preceding __ from __bpf_selem_unlink_storage __bpf_selem_unlink_storage is taking the spin lock and there is no name collision also. Having the preceding '__' is confusing when reviewing the later patch. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20230308065936.1550103-5-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-10 11:05:28 -08:00
Martin KaFai Lau	62827d612a	bpf: Remove __bpf_local_storage_map_alloc bpf_local_storage_map_alloc() is the only caller of __bpf_local_storage_map_alloc(). The remaining logic in bpf_local_storage_map_alloc() is only a one liner setting the smap->cache_idx. Remove __bpf_local_storage_map_alloc() to simplify code. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20230308065936.1550103-4-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-03-10 11:05:28 -08:00

1 2

76 Commits