linux

mirror of https://github.com/torvalds/linux.git synced 2026-05-25 07:33:19 +02:00

Author	SHA1	Message	Date
Hyunwoo Kim	48f6a5356a	net: skbuff: propagate shared-frag marker through frag-transfer helpers Two frag-transfer helpers (__pskb_copy_fclone() and skb_shift()) fail to propagate the SKBFL_SHARED_FRAG bit in skb_shinfo()->flags when moving frags from source to destination. __pskb_copy_fclone() defers the rest of the shinfo metadata to skb_copy_header() after copying frag descriptors, but that helper only carries over gso_{size,segs, type} and never touches skb_shinfo()->flags; skb_shift() moves frag descriptors directly and leaves flags untouched. As a result, the destination skb keeps a reference to the same externally-owned or page-cache-backed pages while reporting skb_has_shared_frag() as false. The mismatch is harmful in any in-place writer that uses skb_has_shared_frag() to decide whether shared pages must be detoured through skb_cow_data(). ESP input is one such writer (esp4.c, esp6.c), and a single nft 'dup to <local>' rule -- or any other nf_dup_ipv4() / xt_TEE caller -- is enough to land a pskb_copy()'d skb in esp_input() with the marker stripped, letting an unprivileged user write into the page cache of a root-owned read-only file via authencesn-ESN stray writes. Set SKBFL_SHARED_FRAG on the destination whenever frag descriptors were actually moved from the source. skb_copy() and skb_copy_expand() share skb_copy_header() too but linearize all paged data into freshly allocated head storage and emerge with nr_frags == 0, so skb_has_shared_frag() returns false on its own; they need no change. The same omission exists in skb_gro_receive() and skb_gro_receive_list(). The former moves the incoming skb's frag descriptors into the accumulator's last sub-skb via two paths (a direct frag-move loop and the head_frag + memcpy path); the latter chains the incoming skb whole onto p's frag_list. Downstream skb_segment() reads only skb_shinfo(p)->flags, and skb_segment_list() reuses each sub-skb's shinfo as the nskb -- both p and lp must carry the marker. The same omission also exists in tcp_clone_payload(), which builds an MTU probe skb by moving frag descriptors from skbs on sk_write_queue into a freshly allocated nskb. The helper falls into the same family and warrants the same fix for consistency; no TCP TX-side in-place writer is currently known to reach a user page through this gap, but a future consumer depending on the marker would regress silently. The same omission exists in skb_segment(): the per-iteration flag merge takes only head_skb's flag, and the inner switch that rebinds frag_skb to list_skb on head_skb-frags exhaustion does not fold the new frag_skb's flag into nskb. Fold frag_skb's flag at both sites so segments drawing frags from frag_list members carry the marker. Fixes: `cef401de7b` ("net: fix possible wrong checksum generation") Fixes: `f4c50a4034` ("xfrm: esp: avoid in-place decrypt on shared skb frags") Suggested-by: Sabrina Dubroca <sd@queasysnail.net> Suggested-by: Sultan Alsawaf <sultan@kerneltoast.com> Suggested-by: Ben Hutchings <ben@decadent.org.uk> Suggested-by: Lin Ma <malin89@huawei.com> Suggested-by: Jingguo Tan <tanjingguo@huawei.com> Suggested-by: Aaron Esau <aaron1esau@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com> Tested-by: Rajat Gupta <rajat.gupta@oss.qualcomm.com> Link: https://patch.msgid.link/ageeJfJHwgzmKXbh@v4bel Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-21 11:31:05 +02:00
William Bowling	f84eca5817	net: skbuff: preserve shared-frag marker during coalescing skb_try_coalesce() can attach paged frags from @from to @to. If @from has SKBFL_SHARED_FRAG set, the resulting @to skb can contain the same externally-owned or page-cache-backed frags, but the shared-frag marker is currently lost. That breaks the invariant relied on by later in-place writers. In particular, ESP input checks skb_has_shared_frag() before deciding whether an uncloned nonlinear skb can skip skb_cow_data(). If TCP receive coalescing has moved shared frags into an unmarked skb, ESP can see skb_has_shared_frag() as false and decrypt in place over page-cache backed frags. Propagate SKBFL_SHARED_FRAG when skb_try_coalesce() transfers paged frags. The tailroom copy path does not need the marker because it copies bytes into @to's linear data rather than transferring frag descriptors. Fixes: `cef401de7b` ("net: fix possible wrong checksum generation") Fixes: `f4c50a4034` ("xfrm: esp: avoid in-place decrypt on shared skb frags") Signed-off-by: William Bowling <vakzz@zellic.io> Reviewed-by: Eric Dumazet <edumazet@google.com> Tested-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://patch.msgid.link/20260513041635.1289541-1-vakzz@zellic.io Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-14 17:22:00 -07:00
Konstantin Khorenko	29b1ee8788	net: add noinline __init __no_profile to skb_extensions_init() for GCOV compatibility With -fprofile-update=atomic in global CFLAGS_GCOV, GCC still cannot constant-fold the skb_ext_total_length() loop when it is inlined into a profiled caller. The existing __no_profile on skb_ext_total_length() itself is insufficient because after __always_inline expansion the code resides in the caller's body, which still carries GCOV instrumentation. Mark skb_extensions_init() with __no_profile so the BUILD_BUG_ON checks can be evaluated at compile time. Also mark it noinline to prevent the compiler from inlining it into skb_init() (which lacks __no_profile), which would re-expose the function body to GCOV instrumentation. Add __init since skb_extensions_init() is only called from __init skb_init(). Previously it was implicitly inlined into the .init.text section; with noinline it would otherwise remain in permanent .text, wasting memory after boot. Build-tested with both CONFIG_GCOV_PROFILE_ALL=y and CONFIG_KCOV_INSTRUMENT_ALL=y. Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com> Link: https://patch.msgid.link/20260410162150.3105738-3-khorenko@virtuozzo.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 15:29:02 -07:00
Konstantin Khorenko	c0b4382c86	net: fix skb_ext_total_length() BUILD_BUG_ON with CONFIG_GCOV_PROFILE_ALL When CONFIG_GCOV_PROFILE_ALL=y is enabled, the kernel fails to build: In file included from <command-line>: In function 'skb_extensions_init', inlined from 'skb_init' at net/core/skbuff.c:5214:2: ././include/linux/compiler_types.h:706:45: error: call to '__compiletime_assert_1490' declared with attribute error: BUILD_BUG_ON failed: skb_ext_total_length() > 255 CONFIG_GCOV_PROFILE_ALL adds -fprofile-arcs -ftest-coverage -fno-tree-loop-im to CFLAGS globally. GCC inserts branch profiling counters into the skb_ext_total_length() loop and, combined with -fno-tree-loop-im (which disables loop invariant motion), cannot constant-fold the result. BUILD_BUG_ON requires a compile-time constant and fails. The issue manifests in kernels with 5+ SKB extension types enabled (e.g., after addition of SKB_EXT_CAN, SKB_EXT_PSP). With 4 extensions GCC can still unroll and fold the loop despite GCOV instrumentation; with 5+ it gives up. Mark skb_ext_total_length() with __no_profile to prevent GCOV from inserting counters into this function. Without counters the loop is "clean" and GCC can constant-fold it even with -fno-tree-loop-im active. This allows BUILD_BUG_ON to work correctly while keeping GCOV profiling for the rest of the kernel. This also removes the CONFIG_KCOV_INSTRUMENT_ALL preprocessor guard introduced by `d6e5794b06`. That guard was added as a precaution because KCOV instrumentation was also suspected of inhibiting constant folding. However, KCOV uses -fsanitize-coverage=trace-pc, which inserts lightweight trace callbacks that do not interfere with GCC's constant folding or loop optimization passes. Only GCOV's -fprofile-arcs combined with -fno-tree-loop-im actually prevents the compiler from evaluating the loop at compile time. The guard is therefore unnecessary and can be safely removed. Fixes: `96ea3a1e2d` ("can: add CAN skb extension infrastructure") Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com> Reviewed-by: Thomas Weissschuh <linux@weissschuh.net> Link: https://patch.msgid.link/20260410162150.3105738-2-khorenko@virtuozzo.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 15:29:02 -07:00
Jiayuan Chen	5758be283f	net: skb: clean up dead code after skb_kfree_head() simplification Since commit `0f42e3f4fe` ("net: skb: fix cross-cache free of KFENCE-allocated skb head"), skb_kfree_head() always calls kfree() and no longer uses end_offset to distinguish between skb_small_head_cache and generic kmalloc caches. Clean up the leftovers: - Remove the unused end_offset parameter from skb_kfree_head() and update all callers. - Remove the SKB_SMALL_HEAD_HEADROOM guard in __skb_unclone_keeptruesize() which was protecting the old skb_kfree_head() logic. - Update the SKB_SMALL_HEAD_CACHE_SIZE comment to reflect that the non-power-of-2 sizing is no longer used for free-path disambiguation. No functional change. Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260410034736.297900-1-jiayuan.chen@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 09:43:49 -07:00
Jakub Kicinski	b6e39e4846	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-7.0-rc8). Conflicts: net/ipv6/seg6_iptunnel.c `c3812651b5` ("seg6: separate dst_cache for input and output paths in seg6 lwtunnel") `78723a62b9` ("seg6: add per-route tunnel source address") https://lore.kernel.org/adZhwtOYfo-0ImSa@sirena.org.uk net/ipv4/icmp.c `fde29fd934` ("ipv4: icmp: fix null-ptr-deref in icmp_build_probe()") `d98adfbdd5` ("ipv4: drop ipv6_stub usage and use direct function calls") https://lore.kernel.org/adO3dccqnr6j-BL9@sirena.org.uk Adjacent changes: drivers/net/ethernet/stmicro/stmmac/chain_mode.c `51f4e090b9` ("net: stmmac: fix integer underflow in chain mode") `6b4286e055` ("net: stmmac: rename STMMAC_GET_ENTRY() -> STMMAC_NEXT_ENTRY()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-09 13:20:59 -07:00
Jiayuan Chen	0f42e3f4fe	net: skb: fix cross-cache free of KFENCE-allocated skb head SKB_SMALL_HEAD_CACHE_SIZE is intentionally set to a non-power-of-2 value (e.g. 704 on x86_64) to avoid collisions with generic kmalloc bucket sizes. This ensures that skb_kfree_head() can reliably use skb_end_offset to distinguish skb heads allocated from skb_small_head_cache vs. generic kmalloc caches. However, when KFENCE is enabled, kfence_ksize() returns the exact requested allocation size instead of the slab bucket size. If a caller (e.g. bpf_test_init) allocates skb head data via kzalloc() and the requested size happens to equal SKB_SMALL_HEAD_CACHE_SIZE, then slab_build_skb() -> ksize() returns that exact value. After subtracting skb_shared_info overhead, skb_end_offset ends up matching SKB_SMALL_HEAD_HEADROOM, causing skb_kfree_head() to incorrectly free the object to skb_small_head_cache instead of back to the original kmalloc cache, resulting in a slab cross-cache free: kmem_cache_free(skbuff_small_head): Wrong slab cache. Expected skbuff_small_head but got kmalloc-1k Fix this by always calling kfree(head) in skb_kfree_head(). This keeps the free path generic and avoids allocator-specific misclassification for KFENCE objects. Fixes: `bf9f1baa27` ("net: add dedicated kmem_cache for typical/small skb->head") Reported-by: Antonius <antonius@bluedragonsec.com> Closes: https://lore.kernel.org/netdev/CAK8a0jxC5L5N7hq-DT2_NhUyjBxrPocoiDazzsBk4TGgT1r4-A@mail.gmail.com/ Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260403014517.142550-1-jiayuan.chen@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-06 18:46:53 -07:00
Jason Xing	8a4e3ab61d	net: advance skb_defer_disable_key check in napi_consume_skb When net.core.skb_defer_max is adjusted to zero, napi_consume_skb() shouldn't go into that deeper in skb_attempt_defer_free() because it adds an additional pair of local_bh_enable/disable() which is evidently not needed. Advancing the check of the static key saves more cycles and benefits non defer case. Signed-off-by: Jason Xing <kernelxing@tencent.com> Link: https://patch.msgid.link/20260402034114.65766-1-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-06 18:32:04 -07:00
Eric Dumazet	08dc30de1a	net: add skb_defer_disable_key static key Add a static key to bypass skb_attempt_defer_free() steps if net.core.skb_defer_max is set to zero. Main benefit is the atomic_long_inc_return() avoidance. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260311191340.1996888-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-12 19:25:33 -07:00
Eric Dumazet	6466441a5e	net: inline skb_add_rx_frag_netmem() This critical helper (via skb_add_rx_frag()) is mostly used from drivers rx fast path. It is time to inline it, this actually saves space in vmlinux: size vmlinux.old vmlinux text data bss dec hex filename 37350766 23092977 4846992 65290735 3e441ef vmlinux.old 37350600 23092977 4846992 65290569 3e44149 vmlinux Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260226041213.1892561-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 19:20:34 -08:00
Sebastian Andrzej Siewior	983512f3a8	net: Drop the lock in skb_may_tx_timestamp() skb_may_tx_timestamp() may acquire sock::sk_callback_lock. The lock must not be taken in IRQ context, only softirq is okay. A few drivers receive the timestamp via a dedicated interrupt and complete the TX timestamp from that handler. This will lead to a deadlock if the lock is already write-locked on the same CPU. Taking the lock can be avoided. The socket (pointed by the skb) will remain valid until the skb is released. The ->sk_socket and ->file member will be set to NULL once the user closes the socket which may happen before the timestamp arrives. If we happen to observe the pointer while the socket is closing but before the pointer is set to NULL then we may use it because both pointer (and the file's cred member) are RCU freed. Drop the lock. Use READ_ONCE() to obtain the individual pointer. Add a matching WRITE_ONCE() where the pointer are cleared. Link: https://lore.kernel.org/all/20260205145104.iWinkXHv@linutronix.de Fixes: `b245be1f4d` ("net-timestamp: no-payload only sysctl") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260220183858.N4ERjFW6@linutronix.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-24 11:27:29 +01:00
Eric Dumazet	0943404b1f	net: do not delay zero-copy skbs in skb_attempt_defer_free() After the blamed commit, TCP tx zero copy notifications could be arbitrarily delayed and cause regressions in applications waiting for them. Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: `e20dfbad8a` ("net: fix napi_consume_skb() with alien skbs") Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260216193653.627617-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-17 17:06:18 -08:00
Oliver Hartkopp	abf981bb8d	net: skb: allow up to 8 skb extension ids The skb extension ids range from 0 .. 7 to fit their bits as flags into a single byte. The ids are automatically enumnerated in enum skb_ext_id in skbuff.h, where SKB_EXT_NUM is defined as the last value. When having 8 skb extension ids (0 .. 7), SKB_EXT_NUM becomes 8 which is a valid value for SKB_EXT_NUM. Fixes: `96ea3a1e2d` ("can: add CAN skb extension infrastructure") Link: https://lore.kernel.org/netdev/aXoMqaA7b2CqJZNA@strlen.de/ Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20260205-skb_ext-v1-1-9ba992ccee8b@hartkopp.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-06 20:07:24 -08:00
Eric Dumazet	7a4cd71fa4	net: add vlan_get_protocol_offset_inline() helper skb_protocol() is bloated, and forces slow stack canaries in many fast paths. Add vlan_get_protocol_offset_inline() which deals with the non-vlan common cases. __vlan_get_protocol_offset() is now out of line. It returns a vlan_type_depth struct to avoid stack canaries in callers. struct vlan_type_depth { __be16 type; u16 depth; }; $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/2 grow/shrink: 0/22 up/down: 0/-6320 (-6320) Function old new delta vlan_get_protocol_dgram 61 59 -2 __pfx_skb_protocol 16 - -16 __vlan_get_protocol_offset 307 273 -34 tap_get_user 1374 1207 -167 ip_md_tunnel_xmit 1625 1452 -173 tap_sendmsg 940 753 -187 netif_skb_features 1079 866 -213 netem_enqueue 3017 2800 -217 vlan_parse_protocol 271 50 -221 tso_start 567 344 -223 fq_dequeue 1908 1685 -223 skb_network_protocol 434 205 -229 ip6_tnl_xmit 2639 2409 -230 br_dev_queue_push_xmit 474 236 -238 skb_protocol 258 - -258 packet_parse_headers 621 357 -264 __ip6_tnl_rcv 1306 1039 -267 skb_csum_hwoffload_help 515 224 -291 ip_tunnel_xmit 2635 2339 -296 sch_frag_xmit_hook 1582 1233 -349 bpf_skb_ecn_set_ce 868 457 -411 IP6_ECN_decapsulate 1297 768 -529 ip_tunnel_rcv 2121 1489 -632 ipip6_rcv 2572 1922 -650 Total: Before=24892803, After=24886483, chg -0.03% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260204053023.1622775-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-05 16:33:52 +01:00
Oliver Hartkopp	96ea3a1e2d	can: add CAN skb extension infrastructure To remove the private CAN bus skb headroom infrastructure 8 bytes need to be stored in the skb. The skb extensions are a common pattern and an easy and efficient way to hold private data travelling along with the skb. We only need the skb_ext_add() and skb_ext_find() functions to allocate and access CAN specific content as the skb helpers to copy/clone/free skbs automatically take care of skb extensions and their final removal. This patch introduces the complete CAN skb extensions infrastructure: - add struct can_skb_ext in new file include/net/can.h - add include/net/can.h in MAINTAINERS - add SKB_EXT_CAN to skbuff.c and skbuff.h - select SKB_EXTENSIONS in Kconfig when CONFIG_CAN is enabled - check for existing CAN skb extensions in can_rcv() in af_can.c - add CAN skb extensions allocation at every skb_alloc() location - duplicate the skb extensions if cloning outgoing skbs (framelen/gw_hops) - introduce can_skb_ext_add() and can_skb_ext_find() helpers The patch also corrects an indention issue in the original code from 2018: Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202602010426.PnGrYAk3-lkp@intel.com/ Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20260201-can_skb_ext-v8-2-3635d790fe8b@hartkopp.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-05 11:58:39 +01:00
Eric Dumazet	df7388b3d7	net: inline get_netmem() and put_netmem() These helpers are used in network fast paths. Only call out-of-line helpers for netmem case. We might consider inlining __get_netmem() and __put_netmem() in the future. $ scripts/bloat-o-meter -t vmlinux.3 vmlinux.4 add/remove: 6/6 grow/shrink: 22/1 up/down: 2614/-646 (1968) Function old new delta pskb_carve 1669 1894 +225 gro_pull_from_frag0 - 206 +206 get_page 190 380 +190 skb_segment 3561 3747 +186 put_page 595 765 +170 skb_copy_ubufs 1683 1822 +139 __pskb_trim_head 276 401 +125 __pskb_copy_fclone 734 858 +124 skb_zerocopy 1092 1215 +123 pskb_expand_head 892 1008 +116 skb_split 828 940 +112 skb_release_data 297 409 +112 ___pskb_trim 829 941 +112 __skb_zcopy_downgrade_managed 120 226 +106 tcp_clone_payload 530 634 +104 esp_ssg_unref 191 294 +103 dev_gro_receive 1464 1514 +50 __put_netmem - 41 +41 __get_netmem - 41 +41 skb_shift 1139 1175 +36 skb_try_coalesce 681 714 +33 __pfx_put_page 112 144 +32 __pfx_get_page 32 64 +32 __pskb_pull_tail 1137 1168 +31 veth_xdp_get 250 267 +17 __pfx_gro_pull_from_frag0 - 16 +16 __pfx___put_netmem - 16 +16 __pfx___get_netmem - 16 +16 __pfx_put_netmem 16 - -16 __pfx_gro_try_pull_from_frag0 16 - -16 __pfx_get_netmem 16 - -16 put_netmem 114 - -114 get_netmem 130 - -130 napi_gro_frags 929 771 -158 gro_try_pull_from_frag0 196 - -196 Total: Before=22565857, After=22567825, chg +0.01% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260122045720.1221017-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-25 13:18:53 -08:00
Jakub Kicinski	9abf22075d	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.19-rc7). Conflicts: drivers/net/ethernet/huawei/hinic3/hinic3_irq.c `b35a6fd37a` ("hinic3: Add adaptive IRQ coalescing with DIM") `fb2bb2a1eb` ("hinic3: Fix netif_queue_set_napi queue_index input parameter error") https://lore.kernel.org/fc0a7fdf08789a52653e8ad05281a0a849e79206.1768915707.git.zhuyikai1@h-partners.com drivers/net/wireless/ath/ath12k/mac.c drivers/net/wireless/ath/ath12k/wifi7/hw.c `3170757210` ("wifi: ath12k: Fix wrong P2P device link id issue") `c26f294fef` ("wifi: ath12k: Move ieee80211_ops callback to the arch specific module") https://lore.kernel.org/20260114123751.6a208818@canb.auug.org.au Adjacent changes: drivers/net/wireless/ath/ath12k/mac.c `8b8d6ee53d` ("wifi: ath12k: Fix scan state stuck in ABORTING after cancel_remain_on_channel") `914c890d3b` ("wifi: ath12k: Add framework for hardware specific ieee80211_ops registration") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-22 20:14:36 -08:00
Jakub Kicinski	cd18e8ac03	net: add kdoc for napi_consume_skb() Looks like AI reviewers miss that napi_consume_skb() must have a real budget passed to it. Let's see if adding a real kdoc will help them figure this out. Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://patch.msgid.link/20260119224140.1362729-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-20 18:27:50 -08:00
Eric Dumazet	79bfa5fb85	net: fclone allocation small optimization After skb allocation, initial skb->fclone value is 0 (SKB_FCLONE_UNAVAILABLE) We can replace one RMW sequence with a single OR instruction. movzbl 0x7e(%r13),%eax // skb->fclone = SKB_FCLONE_ORIG; and $0xf3,%al or $0x4,%al mov %al,0x7e(%r13) -> or $0x4,0x7e(%r13) // skb->fclone \|= SKB_FCLONE_ORIG; Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260116164402.1872649-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-20 18:25:33 -08:00
Eric Dumazet	3fbb5395c7	net: split kmalloc_reserve() to allow inlining kmalloc_reserve() is too big to be inlined. Put the slow path in a new out-of-line function : kmalloc_pfmemalloc() Then let kmalloc_reserve() set skb->pfmemalloc only when/if the slow path is taken. This makes __alloc_skb() faster : - kmalloc_reserve() is now automatically inlined by both gcc and clang. - No more expensive RMW (skb->pfmemalloc = pfmemalloc). - No more expensive stack canary (for CONFIG_STACKPROTECTOR_STRONG=y). - Removal of two prefetches that were coming too late for modern cpus. Text size increase is quite small compared to the cpu savings (~0.7 %) $ size net/core/skbuff.clang.before.o net/core/skbuff.clang.after.o text data bss dec hex filename 72507 5897 0 78404 13244 net/core/skbuff.clang.before.o 72681 5897 0 78578 132f2 net/core/skbuff.clang.after.o Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260116041359.181104-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-20 18:19:18 -08:00
Eric Dumazet	2db009e4c8	net: minor __alloc_skb() optimization We can directly call __finalize_skb_around() instead of __build_skb_around() because @size is not zero. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260113131017.2310584-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-15 19:52:02 -08:00
Eric Dumazet	220d89df1d	net: add skb->data_len and (skb>end - skb->tail) to skb_dump() While working on a syzbot report, I found that skb_dump() is lacking two important parts : - skb->data_len. - (skb>end - skb->tail) tailroom is zero if skb is not linear. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260112172621.4188700-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-15 19:49:47 -08:00
Eric Dumazet	d4596891e7	net: inline napi_skb_cache_get() clang is inlining it already, gcc (14.2) does not. Small space cost (215 bytes on x86_64) but faster sk_buff allocations. $ scripts/bloat-o-meter -t net/core/skbuff.gcc.before.o net/core/skbuff.gcc.after.o add/remove: 0/1 grow/shrink: 4/1 up/down: 359/-144 (215) Function old new delta __alloc_skb 471 611 +140 napi_build_skb 245 363 +118 napi_alloc_skb 331 416 +85 skb_copy_ubufs 1869 1885 +16 skb_shift 1445 1413 -32 napi_skb_cache_get 112 - -112 Total: Before=59941, After=60156, chg +0.36% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260112131515.4051589-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-01-15 12:30:28 +01:00
Eric Dumazet	0391ab577c	net: add skbuff_clear() helper clang is unable to inline the memset() calls in net/core/skbuff.c when initializing allocated sk_buff. memset(skb, 0, offsetof(struct sk_buff, tail)); This is unfortunate, because: 1) calling external memset_orig() helper adds a call/ret and typical setup cost. 2) offsetof(struct sk_buff, tail) == 0xb8 = 0x80 + 0x38 On x86_64, memset_orig() performs two 64 bytes clear, then has to loop 7 times to clear the final 56 bytes. skbuff_clear() makes sure the minimal and optimal code is generated. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260109203836.1667441-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-12 19:30:57 -08:00
Mohammad Heib	238e03d046	net: fix memory leak in skb_segment_list for GRO packets When skb_segment_list() is called during packet forwarding, it handles packets that were aggregated by the GRO engine. Historically, the segmentation logic in skb_segment_list assumes that individual segments are split from a parent SKB and may need to carry their own socket memory accounting. Accordingly, the code transfers truesize from the parent to the newly created segments. Prior to commit `ed4cccef64` ("gro: fix ownership transfer"), this truesize subtraction in skb_segment_list() was valid because fragments still carry a reference to the original socket. However, commit `ed4cccef64` ("gro: fix ownership transfer") changed this behavior by ensuring that fraglist entries are explicitly orphaned (skb->sk = NULL) to prevent illegal orphaning later in the stack. This change meant that the entire socket memory charge remained with the head SKB, but the corresponding accounting logic in skb_segment_list() was never updated. As a result, the current code unconditionally adds each fragment's truesize to delta_truesize and subtracts it from the parent SKB. Since the fragments are no longer charged to the socket, this subtraction results in an effective under-count of memory when the head is freed. This causes sk_wmem_alloc to remain non-zero, preventing socket destruction and leading to a persistent memory leak. The leak can be observed via KMEMLEAK when tearing down the networking environment: unreferenced object 0xffff8881e6eb9100 (size 2048): comm "ping", pid 6720, jiffies 4295492526 backtrace: kmem_cache_alloc_noprof+0x5c6/0x800 sk_prot_alloc+0x5b/0x220 sk_alloc+0x35/0xa00 inet6_create.part.0+0x303/0x10d0 __sock_create+0x248/0x640 __sys_socket+0x11b/0x1d0 Since skb_segment_list() is exclusively used for SKB_GSO_FRAGLIST packets constructed by GRO, the truesize adjustment is removed. The call to skb_release_head_state() must be preserved. As documented in commit `cf673ed0e0` ("net: fix fraglist segmentation reference count leak"), it is still required to correctly drop references to SKB extensions that may be overwritten during __copy_skb_header(). Fixes: `ed4cccef64` ("gro: fix ownership transfer") Signed-off-by: Mohammad Heib <mheib@redhat.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260104213101.352887-1-mheib@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-05 17:01:28 -08:00
Jakub Kicinski	4c03592689	net: restore napi_consume_skb()'s NULL-handling Commit `e20dfbad8a` ("net: fix napi_consume_skb() with alien skbs") added a skb->cpu check to napi_consume_skb(), before the point where napi_consume_skb() validated skb is not NULL. Add an explicit check to the early exit condition. Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-27 17:44:22 -08:00
Jason Xing	5d7fc63ab8	net: prefetch the next skb in napi_skb_cache_get() After getting the current skb in napi_skb_cache_get(), the next skb in cache is highly likely to be used soon, so prefetch would be helpful. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20251118070646.61344-5-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-19 20:29:25 -08:00
Jason Xing	2d67b5c5c6	net: use NAPI_SKB_CACHE_FREE to keep 32 as default to do bulk free - Replace NAPI_SKB_CACHE_HALF with NAPI_SKB_CACHE_FREE - Only free 32 skbs in napi_skb_cache_put() Since the first patch adjusting NAPI_SKB_CACHE_SIZE to 128, the number of packets to be freed in the softirq was increased from 32 to 64. Considering a subsequent net_rx_action() calling napi_poll() a few times can easily consume the 64 available slots and we can afford keeping a higher value of sk_buffs in per-cpu storage, decrease NAPI_SKB_CACHE_FREE to 32 like before. So now the logic is 1) keeping 96 skbs, 2) freeing 32 skbs at one time. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20251118070646.61344-4-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-19 20:29:24 -08:00
Jason Xing	01d7385618	net: increase default NAPI_SKB_CACHE_BULK to 32 The previous value 16 is a bit conservative, so adjust it along with NAPI_SKB_CACHE_SIZE, which can minimize triggering memory allocation in napi_skb_cache_get*(). Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20251118070646.61344-3-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-19 20:29:24 -08:00
Jason Xing	3505730d90	net: increase default NAPI_SKB_CACHE_SIZE to 128 After commit `b61785852e` ("net: increase skb_defer_max default to 128") changed the value sysctl_skb_defer_max to avoid many calls to kick_defer_list_purge(), the same situation can be applied to NAPI_SKB_CACHE_SIZE that was proposed in 2016. It's a trade-off between using pre-allocated memory in skb_cache and saving more a bit heavy function calls in the softirq context. With this patch applied, we can have more skbs per-cpu to accelerate the sending path that needs to acquire new skbs. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20251118070646.61344-2-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-19 20:29:24 -08:00
Eric Dumazet	21664814b8	net: use napi_skb_cache even in process context This is a followup of commit `e20dfbad8a` ("net: fix napi_consume_skb() with alien skbs"). Now the per-cpu napi_skb_cache is populated from TX completion path, we can make use of this cache, especially for cpus not used from a driver NAPI poll (primary user of napi_cache). We can use the napi_skb_cache only if current context is not from hard irq. With this patch, I consistently reach 130 Mpps on my UDP tx stress test and reduce SLUB spinlock contention to smaller values. Note there is still some SLUB contention for skb->head allocations. I had to tune /sys/kernel/slab/skbuff_small_head/cpu_partial and /sys/kernel/slab/skbuff_small_head/min_partial depending on the platform taxonomy. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Tested-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251116202717.1542829-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-18 18:25:47 -08:00
Eric Dumazet	294e638259	net: __alloc_skb() cleanup This patch refactors __alloc_skb() to prepare the following one, and does not change functionality. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251116202717.1542829-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-18 18:25:47 -08:00
Eric Dumazet	dac0236075	net: add a new @alloc parameter to napi_skb_cache_get() We want to be able in the series last patch to get an skb from napi_skb_cache from process context, if there is one available. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251116202717.1542829-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-18 18:25:47 -08:00
Eric Dumazet	26b8986a18	net: clear skb->sk in skb_release_head_state() skb_release_head_state() inlines skb_orphan(). We need to clear skb->sk otherwise we can freeze TCP flows on a mostly idle host, because skb_fclone_busy() would return true as long as the packet is not yet processed by skb_defer_free_flush(). Fixes: `1fcf572211` ("net: allow skb_release_head_state() to be called multiple times") Fixes: `e20dfbad8a` ("net: fix napi_consume_skb() with alien skbs") Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Aditya Garg <gargaditya@linux.microsoft.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251111151235.1903659-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-12 06:48:33 -08:00
Jason Xing	8da7bea7db	xsk: add indirect call for xsk_destruct_skb Since Eric proposed an idea about adding indirect call wrappers for UDP and managed to see a huge improvement[1], the same situation can also be applied in xsk scenario. This patch adds an indirect call for xsk and helps current copy mode improve the performance by around 1% stably which was observed with IXGBE at 10Gb/sec loaded. If the throughput grows, the positive effect will be magnified. I applied this patch on top of batch xmit series[2], and was able to see <5% improvement from our internal application which is a little bit unstable though. Use INDIRECT wrappers to keep xsk_destruct_skb static as it used to be when the mitigation config is off. Be aware of the freeing path that can be very hot since the frequency can reach around 2,000,000 times per second with the xdpsock test. [1]: https://lore.kernel.org/netdev/20251006193103.2684156-2-edumazet@google.com/ [2]: https://lore.kernel.org/all/20251021131209.41491-1-kerneljasonxing@gmail.com/ Suggested-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20251031103328.95468-1-kerneljasonxing@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-11-11 10:21:08 +01:00
Jakub Sitnicki	290fc0be09	net: Preserve metadata on pskb_expand_head pskb_expand_head() copies headroom, including skb metadata, into the newly allocated head, but then clears the metadata. As a result, metadata is lost when BPF helpers trigger an skb head reallocation. Let the skb metadata remain in the newly created copy of head. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-2-5ceb08a9b37b@cloudflare.com	2025-11-10 10:52:31 -08:00
Eric Dumazet	e20dfbad8a	net: fix napi_consume_skb() with alien skbs There is a lack of NUMA awareness and more generally lack of slab caches affinity on TX completion path. Modern drivers are using napi_consume_skb(), hoping to cache sk_buff in per-cpu caches so that they can be recycled in RX path. Only use this if the skb was allocated on the same cpu, otherwise use skb_attempt_defer_free() so that the skb is freed on the original cpu. This removes contention on SLUB spinlocks and data structures. After this patch, I get ~50% improvement for an UDP tx workload on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues). 80 Mpps -> 120 Mpps. Profiling one of the 32 cpus servicing NIC interrupts : Before: mpstat -P 511 1 1 Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: 511 0.00 0.00 0.00 0.00 0.00 98.00 0.00 0.00 0.00 2.00 31.01% ksoftirqd/511 [kernel.kallsyms] [k] queued_spin_lock_slowpath 12.45% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath 5.60% ksoftirqd/511 [kernel.kallsyms] [k] __slab_free 3.31% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_clean_buf_ring 3.27% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_all 2.95% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_start 2.52% ksoftirqd/511 [kernel.kallsyms] [k] fq_dequeue 2.32% ksoftirqd/511 [kernel.kallsyms] [k] read_tsc 2.25% ksoftirqd/511 [kernel.kallsyms] [k] build_detached_freelist 2.15% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free 2.11% swapper [kernel.kallsyms] [k] __slab_free 2.06% ksoftirqd/511 [kernel.kallsyms] [k] idpf_features_check 2.01% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr 1.97% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_data 1.52% ksoftirqd/511 [kernel.kallsyms] [k] sock_wfree 1.34% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring 1.23% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all 1.15% ksoftirqd/511 [kernel.kallsyms] [k] dma_unmap_page_attrs 1.11% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start 1.03% swapper [kernel.kallsyms] [k] fq_dequeue 0.94% swapper [kernel.kallsyms] [k] kmem_cache_free 0.93% swapper [kernel.kallsyms] [k] read_tsc 0.81% ksoftirqd/511 [kernel.kallsyms] [k] napi_consume_skb 0.79% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr 0.77% ksoftirqd/511 [kernel.kallsyms] [k] skb_free_head 0.76% swapper [kernel.kallsyms] [k] idpf_features_check 0.72% swapper [kernel.kallsyms] [k] skb_release_data 0.69% swapper [kernel.kallsyms] [k] build_detached_freelist 0.58% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_head_state 0.56% ksoftirqd/511 [kernel.kallsyms] [k] __put_partials 0.55% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free_bulk 0.48% swapper [kernel.kallsyms] [k] sock_wfree After: mpstat -P 511 1 1 Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: 511 0.00 0.00 0.00 0.00 0.00 51.49 0.00 0.00 0.00 48.51 19.10% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr 13.86% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring 10.80% swapper [kernel.kallsyms] [k] skb_attempt_defer_free 10.57% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all 7.18% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath 6.69% swapper [kernel.kallsyms] [k] sock_wfree 5.55% swapper [kernel.kallsyms] [k] dma_unmap_page_attrs 3.10% swapper [kernel.kallsyms] [k] fq_dequeue 3.00% swapper [kernel.kallsyms] [k] skb_release_head_state 2.73% swapper [kernel.kallsyms] [k] read_tsc 2.48% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start 1.20% swapper [kernel.kallsyms] [k] idpf_features_check 1.13% swapper [kernel.kallsyms] [k] napi_consume_skb 0.93% swapper [kernel.kallsyms] [k] idpf_vport_splitq_napi_poll 0.64% swapper [kernel.kallsyms] [k] native_send_call_func_single_ipi 0.60% swapper [kernel.kallsyms] [k] acpi_processor_ffh_cstate_enter 0.53% swapper [kernel.kallsyms] [k] io_idle 0.43% swapper [kernel.kallsyms] [k] netif_skb_features 0.41% swapper [kernel.kallsyms] [k] __direct_call_cpuidle_state_enter2 0.40% swapper [kernel.kallsyms] [k] native_irq_return_iret 0.40% swapper [kernel.kallsyms] [k] idpf_tx_buf_hw_update 0.36% swapper [kernel.kallsyms] [k] sched_clock_noinstr 0.34% swapper [kernel.kallsyms] [k] handle_softirqs 0.32% swapper [kernel.kallsyms] [k] net_rx_action 0.32% swapper [kernel.kallsyms] [k] dql_completed 0.32% swapper [kernel.kallsyms] [k] validate_xmit_skb 0.31% swapper [kernel.kallsyms] [k] skb_network_protocol 0.29% swapper [kernel.kallsyms] [k] skb_csum_hwoffload_help 0.29% swapper [kernel.kallsyms] [k] x2apic_send_IPI 0.28% swapper [kernel.kallsyms] [k] ktime_get 0.24% swapper [kernel.kallsyms] [k] __qdisc_run Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://patch.msgid.link/20251106202935.1776179-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-07 19:02:39 -08:00
Eric Dumazet	1fcf572211	net: allow skb_release_head_state() to be called multiple times Currently, only skb dst is cleared (thanks to skb_dst_drop()) Make sure skb->destructor, conntrack and extensions are cleared. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20251106202935.1776179-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-07 19:02:39 -08:00
Eric Dumazet	a5cd3a60aa	net: shrink napi_skb_cache_{put,get}() and napi_skb_cache_get_bulk() Following loop in napi_skb_cache_put() is unrolled by the compiler even if CONFIG_KASAN is not enabled: for (i = NAPI_SKB_CACHE_HALF; i < NAPI_SKB_CACHE_SIZE; i++) kasan_mempool_unpoison_object(nc->skb_cache[i], kmem_cache_size(net_hotdata.skbuff_cache)); We have 32 times this sequence, for a total of 384 bytes. 48 8b 3d 00 00 00 00 net_hotdata.skbuff_cache,%rdi e8 00 00 00 00 call kmem_cache_size This is because kmem_cache_size() is not an inline and not const, and kasan_unpoison_object_data() is an inline function. Cache kmem_cache_size() result in a variable, so that the compiler can remove dead code (and variable) when/if CONFIG_KASAN is unset. After this patch, napi_skb_cache_put() is inlined in its callers, and we avoid one kmem_cache_size() call in napi_skb_cache_get() and napi_skb_cache_get_bulk(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20251016182911.1132792-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-20 17:37:57 -07:00
Eric Dumazet	5b2b7dec05	net: add add indirect call wrapper in skb_release_head_state() While stress testing UDP senders on a host with expensive indirect calls, I found cpus processing TX completions where showing a very high cost (20%) in sock_wfree() due to CONFIG_MITIGATION_RETPOLINE=y. Take care of TCP and UDP TX destructors and use INDIRECT_CALL_3() macro. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Tested-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20251014171907.3554413-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-16 16:25:09 -07:00
Eric Dumazet	6de1dec1c1	udp: do not use skb_release_head_state() before skb_attempt_defer_free() Michal reported and bisected an issue after recent adoption of skb_attempt_defer_free() in UDP. The issue here is that skb_release_head_state() is called twice per skb, one time from skb_consume_udp(), then a second time from skb_defer_free_flush() and napi_consume_skb(). As Sabrina suggested, remove skb_release_head_state() call from skb_consume_udp(). Add a DEBUG_NET_WARN_ON_ONCE(skb_nfct(skb)) in skb_attempt_defer_free() Many thanks to Michal, Sabrina, Paolo and Florian for their help. Fixes: `6471658dc6` ("udp: use skb_attempt_defer_free()") Reported-and-bisected-by: Michal Kubecek <mkubecek@suse.cz> Closes: https://lore.kernel.org/netdev/gpjh4lrotyephiqpuldtxxizrsg6job7cvhiqrw72saz2ubs3h@g6fgbvexgl3r/ Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Michal Kubecek <mkubecek@suse.cz> Cc: Sabrina Dubroca <sd@queasysnail.net> Cc: Florian Westphal <fw@strlen.de> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20251015052715.4140493-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-10-16 16:03:07 +02:00
Eric Dumazet	5628f3fe3b	net: add NUMA awareness to skb_attempt_defer_free() Instead of sharing sd->defer_list & sd->defer_count with many cpus, add one pair for each NUMA node. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250928084934.3266948-4-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-30 15:45:53 +02:00
Eric Dumazet	844c9db7f7	net: use llist for sd->defer_list Get rid of sd->defer_lock and adopt llist operations. We optimize skb_attempt_defer_free() for the common case, where the packet is queued. Otherwise sd->defer_count is increasing, until skb_defer_free_flush() clears it. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250928084934.3266948-3-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-30 15:45:53 +02:00
Eric Dumazet	9c94ae6bb0	net: make softnet_data.defer_count an atomic This is preparation work to remove the softnet_data.defer_lock, as it is contended on hosts with large number of cores. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250928084934.3266948-2-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-30 15:45:52 +02:00
Jakub Kicinski	f857478d62	netdevsim: a basic test PSP implementation Provide a PSP implementation for netdevsim. Use psp_dev_encapsulate() and psp_dev_rcv() to do actual encapsulation and decapsulation on skbs, but perform no encryption or decryption. In order to make encryption with a bad key result in a drop on the peer's rx side, we stash our psd's generation number in the first byte of each key before handing to the peer. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Co-developed-by: Daniel Zahka <daniel.zahka@gmail.com> Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Link: https://patch.msgid.link/20250927225420.1443468-2-kuba@kernel.org Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-30 15:17:21 +02:00
Jakub Kicinski	203e3beb73	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.17-rc8). Conflicts: drivers/net/can/spi/hi311x.c `6b69680847` ("can: hi311x: fix null pointer dereference when resuming from sleep before interface was enabled") `27ce71e1ce` ("net: WQ_PERCPU added to alloc_workqueue users") https://lore.kernel.org/72ce7599-1b5b-464a-a5de-228ff9724701@kernel.org net/smc/smc_loopback.c drivers/dibs/dibs_loopback.c `a35c04de25` ("net/smc: fix warning in smc_rx_splice() when calling get_page()") `cc21191b58` ("dibs: Move data path to dibs layer") https://lore.kernel.org/74368a5c-48ac-4f8e-a198-40ec1ed3cf5f@kernel.org Adjacent changes: drivers/net/dsa/lantiq/lantiq_gswip.c `c0054b25e2` ("net: dsa: lantiq_gswip: move gswip_add_single_port_br() call to port_setup()") `7a1eaef0a7` ("net: dsa: lantiq_gswip: support model-specific mac_select_pcs()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-25 11:00:59 -07:00
Jason Baron	ca9f9cdc4d	net: allow alloc_skb_with_frags() to use MAX_SKB_FRAGS Currently, alloc_skb_with_frags() will only fill (MAX_SKB_FRAGS - 1) slots. I think it should use all MAX_SKB_FRAGS slots, as callers of alloc_skb_with_frags() will size their allocation of frags based on MAX_SKB_FRAGS. This issue was discovered via a test patch that sets 'order' to 0 in alloc_skb_with_frags(), which effectively tests/simulates high fragmentation. In this case sendmsg() on unix sockets will fail every time for large allocations. If the PAGE_SIZE is 4K, then data_len will request 68K or 17 pages, but alloc_skb_with_frags() can only allocate 64K in this case or 16 pages. Fixes: `09c2c90705` ("net: allow alloc_skb_with_frags() to allocate bigger packets") Signed-off-by: Jason Baron <jbaron@akamai.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250922191957.2855612-1-jbaron@akamai.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-23 16:51:26 -07:00
Jakub Kicinski	ed8a507b74	net: modify core data structures for PSP datapath support Add pointers to psp data structures to core networking structs, and an SKB extension to carry the PSP information from the drivers to the socket layer. Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Co-developed-by: Daniel Zahka <daniel.zahka@gmail.com> Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250917000954.859376-4-daniel.zahka@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-18 12:32:06 +02:00
Pengtao He	8f2c72f225	net: avoid one loop iteration in __skb_splice_bits If len is equal to 0 at the beginning of __splice_segment it returns true directly. But when decreasing len from a positive number to 0 in __splice_segment, it returns false. The __skb_splice_bits needs to call __splice_segment again. Recheck *len if it changes, return true in time. Reduce unnecessary calls to __splice_segment. Signed-off-by: Pengtao He <hept.hept.hept@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250819021551.8361-1-hept.hept.hept@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-20 19:24:17 -07:00
Feng Yang	76d727ae02	skbuff: Add MSG_MORE flag to optimize tcp large packet transmission When using sockmap for forwarding, the average latency for different packet sizes after sending 10,000 packets is as follows: size old(us) new(us) 512 56 55 1472 58 58 1600 106 81 3000 145 105 5000 182 125 Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Feng Yang <yangfeng@kylinos.cn> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250708054053.39551-1-yangfeng59949@163.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-09 19:25:57 -07:00

1 2 3 4 5 ...

1081 Commits