linux

mirror of https://github.com/torvalds/linux.git synced 2026-05-29 17:43:52 +02:00

Author	SHA1	Message	Date
Kuniyuki Iwashima	c570bd25d8	udp: Remove udp_table in struct udp_seq_afinfo. Since UDP and UDP-Lite had dedicated socket hash tables for each, we have had to fetch them from different pointers for procfs or bpf iterator. UDP always has its global or per-netns table in net->ipv4.udp_table and struct udp_seq_afinfo.udp_table is NULL. OTOH, UDP-Lite had only one global table in the pointer. We no longer use the field. Let's remove it and udp_get_table_seq(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260311052020.1213705-12-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-13 18:57:45 -07:00
Kuniyuki Iwashima	5c27385886	udp: Remove struct proto.h.udp_table. Since UDP and UDP-Lite had dedicated socket hash tables for each, we have had to fetch them from different pointers. UDP always has its global or per-netns table in net->ipv4.udp_table and struct proto.h.udp_table is NULL. OTOH, UDP-Lite had only one global table in the pointer. We no longer use the field. Let's remove it and udp_get_table_prot(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260311052020.1213705-11-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-13 18:57:45 -07:00
Kuniyuki Iwashima	74f0cca110	udp: Remove UDPLITE_SEND_CSCOV and UDPLITE_RECV_CSCOV. UDP-Lite supports variable-length checksum and has two socket options, UDPLITE_SEND_CSCOV and UDPLITE_RECV_CSCOV, to control the checksum coverage. Let's remove the support. setsockopt(UDPLITE_SEND_CSCOV / UDPLITE_RECV_CSCOV) was only available for UDP-Lite and returned -ENOPROTOOPT for UDP. Now, the options are handled in ip_setsockopt() and ipv6_setsockopt(), which still return the same error. getsockopt(UDPLITE_SEND_CSCOV / UDPLITE_RECV_CSCOV) was available for UDP and always returned 0, meaning full checksum, but now -ENOPROTOOPT is returned. Given that getsockopt() is meaningless for UDP and even the options are not defined under include/uapi/, this should not be a problem. $ man 7 udplite ... BUGS Where glibc support is missing, the following definitions are needed: #define IPPROTO_UDPLITE 136 #define UDPLITE_SEND_CSCOV 10 #define UDPLITE_RECV_CSCOV 11 Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260311052020.1213705-10-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-13 18:57:45 -07:00
Kuniyuki Iwashima	b2a1d719be	udp: Remove partial csum code in TX. UDP TX paths also have some code for UDP-Lite partial checksum: * udplite_csum() in udp_send_skb() and udp_v6_send_skb() * udplite_getfrag() in udp_sendmsg() and udpv6_sendmsg() Let's remove such code. Now, we can use IPPROTO_UDP directly instead of sk->sk_protocol or fl6->flowi6_proto for csum_tcpudp_magic() and csum_ipv6_magic(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260311052020.1213705-9-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-13 18:57:45 -07:00
Kuniyuki Iwashima	c2539d4f2d	udp: Remove partial csum code in RX. UDP-Lite supports the partial checksum and the coverage is stored in the position of the length field of struct udphdr. In RX paths, udp4_csum_init() / udp6_csum_init() save the value in UDP_SKB_CB(skb)->cscov and set UDP_SKB_CB(skb)->partial_cov to 1 if the coverage is not full. The subsequent processing diverges depending on the value, but such paths are now dead. Also, these functions have some code guarded for UDP: * udp_unicast_rcv_skb / udp6_unicast_rcv_skb * __udp4_lib_rcv() and __udp6_lib_rcv(). Let's remove the partial csum code and the unnecessary guard for UDP-Lite in RX. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260311052020.1213705-8-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-13 18:57:45 -07:00
Kuniyuki Iwashima	7accba6fd1	udp: Remove UDP-Lite SNMP stats. Since UDP and UDP-Lite shared most of the code, we have had to check the protocol every time we increment SNMP stats. Now that the UDP-Lite paths are dead, let's remove UDP-Lite SNMP stats. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260311052020.1213705-6-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-13 18:57:44 -07:00
Kuniyuki Iwashima	56520b398e	ipv4: Retire UDP-Lite. We have deprecated IPv6 UDP-Lite sockets. Let's drop support for IPv4 UDP-Lite sockets as well. Most of the changes are similar to the IPv6 patch: removing udplite.c and udp_impl.h, marking most functions in udp_impl.h as static, moving the prototype for udp_recvmsg() to udp.h, and adding INDIRECT_CALLABLE_SCOPE for it. In addition, the INET_DIAG support for UDP-Lite is dropped. We will remove the remaining dead code in the following patches. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260311052020.1213705-5-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-13 18:57:44 -07:00
Kuniyuki Iwashima	62554a51c5	ipv6: Retire UDP-Lite. As announced in commit `be28c14ac8` ("udplite: Print deprecation notice."), it's time to deprecate UDP-Lite. As a first step, let's drop support for IPv6 UDP-Lite sockets. We will remove the remaining dead code gradually. Along with the removal of udplite.c, most of the functions exposed via udp_impl.h are made static. The prototypes of udpv6_sendmsg() and udpv6_recvmsg() are moved to udp.h, but only udpv6_recvmsg() has INDIRECT_CALLABLE_DECLARE() because udpv6_sendmsg() is exported for rxrpc since commit `ed472b0c87` ("rxrpc: Call udp_sendmsg() directly"). Also, udpv6_recvmsg() needs INDIRECT_CALLABLE_SCOPE for CONFIG_MITIGATION_RETPOLINE=n. Note that udplite.h is included temporarily for udplite_csum(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260311052020.1213705-3-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-13 18:57:44 -07:00
Kuniyuki Iwashima	86a41d957b	udp: Make udp[46]_seq_show() static. Since commit `a3d2599b24` ("ipv{4,6}/udp{,lite}: simplify proc registration"), udp4_seq_show() and udp6_seq_show() are not used in net/ipv4/udplite.c and net/ipv6/udplite.c. Instead, udp_seq_ops and udp6_seq_ops are exposed to UDP-Lite. Let's make udp4_seq_show() and udp6_seq_show() static. udp_seq_ops and udp6_seq_ops are moved to udp_impl.h so that we can make them static when the header is removed. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260311052020.1213705-2-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-13 18:57:44 -07:00
Pablo Neira Ayuso	0548a13b5a	nf_tables: nft_dynset: fix possible stateful expression memleak in error path If cloning the second stateful expression in the element via GFP_ATOMIC fails, then the first stateful expression remains in place without being released. unreferenced object (percpu) 0x607b97e9cab8 (size 16): comm "softirq", pid 0, jiffies 4294931867 hex dump (first 16 bytes on cpu 3): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 backtrace (crc 0): pcpu_alloc_noprof+0x453/0xd80 nft_counter_clone+0x9c/0x190 [nf_tables] nft_expr_clone+0x8f/0x1b0 [nf_tables] nft_dynset_new+0x2cb/0x5f0 [nf_tables] nft_rhash_update+0x236/0x11c0 [nf_tables] nft_dynset_eval+0x11f/0x670 [nf_tables] nft_do_chain+0x253/0x1700 [nf_tables] nft_do_chain_ipv4+0x18d/0x270 [nf_tables] nf_hook_slow+0xaa/0x1e0 ip_local_deliver+0x209/0x330 Fixes: `563125a73a` ("netfilter: nftables: generalize set extension to support for several expressions") Reported-by: Gurpreet Shergill <giki.shergill@proton.me> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-03-13 15:31:15 +01:00
Florian Westphal	598adea720	netfilter: revert nft_set_rbtree: validate open interval overlap This reverts commit `648946966a` ("netfilter: nft_set_rbtree: validate open interval overlap"). There have been reports of nft failing to laod valid rulesets after this patch was merged into -stable. I can reproduce several such problem with recent nft versions, including nft 1.1.6 which is widely shipped by distributions. We currently have little choice here. This commit can be resurrected at some point once the nftables fix that triggers the false overlap positive has appeared in common distros (see e83e32c8d1cd ("mnl: restore create element command with large batches" in nftables.git). Fixes: `648946966a` ("netfilter: nft_set_rbtree: validate open interval overlap") Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-03-13 15:31:14 +01:00
Eric Dumazet	8431c602f5	ip_tunnel: adapt iptunnel_xmit_stats() to NETDEV_PCPU_STAT_DSTATS Blamed commits forgot that vxlan/geneve use udp_tunnel[6]_xmit_skb() which call iptunnel_xmit_stats(). iptunnel_xmit_stats() was assuming tunnels were only using NETDEV_PCPU_STAT_TSTATS. @syncp offset in pcpu_sw_netstats and pcpu_dstats is different. 32bit kernels would either have corruptions or freezes if the syncp sequence was overwritten. This patch also moves pcpu_stat_type closer to dev->{t,d}stats to avoid a potential cache line miss since iptunnel_xmit_stats() needs to read it. Fixes: `6fa6de3022` ("geneve: Handle stats using NETDEV_PCPU_STAT_DSTATS.") Fixes: `be226352e8` ("vxlan: Handle stats using NETDEV_PCPU_STAT_DSTATS.") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/20260311123110.1471930-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-12 19:24:45 -07:00
Nimrod Oren	15abbe7c82	net: page_pool: scale alloc cache with PAGE_SIZE The current page_pool alloc-cache size and refill values were chosen to match the NAPI budget and to leave headroom for XDP_DROP recycling. These fixed values do not scale well with large pages, as they significantly increase a given page_pool's memory footprint. Scale these values to better balance memory footprint across page sizes, while keeping behavior on 4KB-page systems unchanged. Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Nimrod Oren <noren@nvidia.com> Link: https://patch.msgid.link/20260309081301.103152-1-noren@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-12 18:51:11 -07:00
Jakub Kicinski	72374257ed	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-7.0-rc4). drivers/net/ethernet/mellanox/mlx5/core/en_rx.c `db25c42c2e` ("net/mlx5e: RX, Fix XDP multi-buf frag counting for striding RQ") `dff1c3164a` ("net/mlx5e: SHAMPO, Always calculate page size") https://lore.kernel.org/aa7ORohmf67EKihj@sirena.org.uk drivers/net/ethernet/ti/am65-cpsw-nuss.c `840c9d13cb` ("net: ethernet: ti: am65-cpsw-nuss: Fix rx_filter value for PTP support") `a23c657e33` ("net: ethernet: ti: am65-cpsw: Use also port number to identify timestamps") https://lore.kernel.org/abK3EkIXuVgMyGI7@sirena.org.uk No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-12 12:53:34 -07:00
Eric Dumazet	c38b8f5f79	net: prevent NULL deref in ip[6]tunnel_xmit() Blamed commit missed that both functions can be called with dev == NULL. Also add unlikely() hints for these conditions that only fuzzers can hit. Fixes: `6f1a9140ec` ("net: add xmit recursion limit to tunnel xmit functions") Signed-off-by: Eric Dumazet <edumazet@google.com> CC: Weiming Shi <bestswngs@gmail.com> Link: https://patch.msgid.link/20260312043908.2790803-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-03-12 16:03:41 +01:00
Eric Dumazet	6f459eda8b	tcp: add tcp_release_cb_cond() helper Majority of tcp_release_cb() calls do nothing at all. Provide tcp_release_cb_cond() helper so that release_sock() can avoid these calls. Also hint the compiler that __release_sock() and wake_up() are rarely called. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-77 (-77) Function old new delta release_sock 258 181 -77 Total: Before=25235790, After=25235713, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260310124451.2280968-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-03-12 13:22:03 +01:00
Alexander Graf	0de607dc4f	vsock: add G2H fallback for CIDs not owned by H2G transport When no H2G transport is loaded, vsock currently routes all CIDs to the G2H transport (commit `65b422d9b6` ("vsock: forward all packets to the host when no H2G is registered"). Extend that existing behavior: when an H2G transport is loaded but does not claim a given CID, the connection falls back to G2H in the same way. This matters in environments like Nitro Enclaves, where an instance may run nested VMs via vhost-vsock (H2G) while also needing to reach sibling enclaves at higher CIDs through virtio-vsock-pci (G2H). With the old code, any CID > 2 was unconditionally routed to H2G when vhost was loaded, making those enclaves unreachable without setting VMADDR_FLAG_TO_HOST explicitly on every connect. Requiring every application to set VMADDR_FLAG_TO_HOST creates friction: tools like socat, iperf, and others would all need to learn about it. The flag was introduced 6 years ago and I am still not aware of any tool that supports it. Even if there was support, it would be cumbersome to use. The most natural experience is a single CID address space where H2G only wins for CIDs it actually owns, and everything else falls through to G2H, extending the behavior that already exists when H2G is absent. To give user space at least a hint that the kernel applied this logic, automatically set the VMADDR_FLAG_TO_HOST on the remote address so it can determine the path taken via getpeername(). Add a per-network namespace sysctl net.vsock.g2h_fallback (default 1). At 0 it forces strict routing: H2G always wins for CID > VMADDR_CID_HOST, or ENODEV if H2G is not loaded. Signed-off-by: Alexander Graf <graf@amazon.com> Tested-by: syzbot@syzkaller.appspotmail.com Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20260304230027.59857-1-graf@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-03-12 10:59:36 +01:00
Sabrina Dubroca	d87f8bc47f	xfrm: avoid RCU warnings around the per-netns netlink socket net->xfrm.nlsk is used in 2 types of contexts: - fully under RCU, with rcu_read_lock + rcu_dereference and a NULL check - in the netlink handlers, with requests coming from a userspace socket In the 2nd case, net->xfrm.nlsk is guaranteed to stay non-NULL and the object is alive, since we can't enter the netns destruction path while the user socket holds a reference on the netns. After adding the __rcu annotation to netns_xfrm.nlsk (which silences sparse warnings in the RCU users and __net_init code), we need to tell sparse that the 2nd case is safe. Add a helper for that. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2026-03-12 07:16:02 +01:00
Jakub Kicinski	28b225282d	page_pool: store detach_time as ktime_t to avoid false-negatives While testing other changes in vng I noticed that nl_netdev.page_pool_check flakes. This never happens in real CI. Turns out vng may boot and get to that test in less than a second. page_pool_detached() records the detach time in seconds, so if vng is fast enough detach time is set to 0. Other code treats 0 as "not detached". detach_time is only used to report the state to the user, so it's not a huge deal in practice but let's fix it. Store the raw ktime_t (nanoseconds) instead. A nanosecond value of 0 is practically impossible. Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Fixes: `69cb4952b6` ("net: page_pool: report when page pool was destroyed") Link: https://patch.msgid.link/20260310003907.3540019-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-10 19:03:34 -07:00
Fernando Fernandez Mancera	7da62262ec	inet: add ip_local_port_step_width sysctl to improve port usage distribution With the current port selection algorithm, ports after a reserved port range or long time used port are used more often than others [1]. This causes an uneven port usage distribution. This combines with cloud environments blocking connections between the application server and the database server if there was a previous connection with the same source port, leading to connectivity problems between applications on cloud environments. The real issue here is that these firewalls cannot cope with standards-compliant port reuse. This is a workaround for such situations and an improvement on the distribution of ports selected. The proposed solution is to implement a variant of RFC 6056 Algorithm 5. The step size is selected randomly on every connect() call ensuring it is a coprime with respect to the size of the range of ports we want to scan. This way, we can ensure that all ports within the range are scanned before returning an error. To enable this algorithm, the user must configure the new sysctl option "net.ipv4.ip_local_port_step_width". In addition, on graphs generated we can observe that the distribution of source ports is more even with the proposed approach. [2] [1] https://0xffsoftware.com/port_graph_current_alg.html [2] https://0xffsoftware.com/port_graph_random_step_alg.html Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Link: https://patch.msgid.link/20260309023946.5473-2-fmancera@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-10 18:59:39 -07:00
Erni Sri Satya Vennela	89fe91c659	net: mana: hardening: Validate doorbell ID from GDMA_REGISTER_DEVICE response As a part of MANA hardening for CVM, add validation for the doorbell ID (db_id) received from hardware in the GDMA_REGISTER_DEVICE response to prevent out-of-bounds memory access when calculating the doorbell page address. In mana_gd_ring_doorbell(), the doorbell page address is calculated as: addr = db_page_base + db_page_size * db_index = (bar0_va + db_page_off) + db_page_size * db_index A hardware could return values that cause this address to fall outside the BAR0 MMIO region. In Confidential VM environments, hardware responses cannot be fully trusted. Add the following validations: - Store the BAR0 size (bar0_size) in gdma_context during probe. - Validate the doorbell page offset (db_page_off) read from device registers does not exceed bar0_size during initialization, converting mana_gd_init_registers() to return an error code. - Validate db_id from GDMA_REGISTER_DEVICE response against the maximum number of doorbell pages that fit within BAR0. Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com> Link: https://patch.msgid.link/20260306211212.543376-1-ernis@linux.microsoft.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-03-10 13:39:51 +01:00
Weiming Shi	6f1a9140ec	net: add xmit recursion limit to tunnel xmit functions Tunnel xmit functions (iptunnel_xmit, ip6tunnel_xmit) lack their own recursion limit. When a bond device in broadcast mode has GRE tap interfaces as slaves, and those GRE tunnels route back through the bond, multicast/broadcast traffic triggers infinite recursion between bond_xmit_broadcast() and ip_tunnel_xmit()/ip6_tnl_xmit(), causing kernel stack overflow. The existing XMIT_RECURSION_LIMIT (8) in the no-qdisc path is not sufficient because tunnel recursion involves route lookups and full IP output, consuming much more stack per level. Use a lower limit of 4 (IP_TUNNEL_RECURSION_LIMIT) to prevent overflow. Add recursion detection using dev_xmit_recursion helpers directly in iptunnel_xmit() and ip6tunnel_xmit() to cover all IPv4/IPv6 tunnel paths including UDP encapsulated tunnels (VXLAN, Geneve, etc.). Move dev_xmit_recursion helpers from net/core/dev.h to public header include/linux/netdevice.h so they can be used by tunnel code. BUG: KASAN: stack-out-of-bounds in blake2s.constprop.0+0xe7/0x160 Write of size 32 at addr ffff88810033fed0 by task kworker/0:1/11 Workqueue: mld mld_ifc_work Call Trace: <TASK> __build_flow_key.constprop.0 (net/ipv4/route.c:515) ip_rt_update_pmtu (net/ipv4/route.c:1073) iptunnel_xmit (net/ipv4/ip_tunnel_core.c:84) ip_tunnel_xmit (net/ipv4/ip_tunnel.c:847) gre_tap_xmit (net/ipv4/ip_gre.c:779) dev_hard_start_xmit (net/core/dev.c:3887) sch_direct_xmit (net/sched/sch_generic.c:347) __dev_queue_xmit (net/core/dev.c:4802) bond_dev_queue_xmit (drivers/net/bonding/bond_main.c:312) bond_xmit_broadcast (drivers/net/bonding/bond_main.c:5279) bond_start_xmit (drivers/net/bonding/bond_main.c:5530) dev_hard_start_xmit (net/core/dev.c:3887) __dev_queue_xmit (net/core/dev.c:4841) ip_finish_output2 (net/ipv4/ip_output.c:237) ip_output (net/ipv4/ip_output.c:438) iptunnel_xmit (net/ipv4/ip_tunnel_core.c:86) gre_tap_xmit (net/ipv4/ip_gre.c:779) dev_hard_start_xmit (net/core/dev.c:3887) sch_direct_xmit (net/sched/sch_generic.c:347) __dev_queue_xmit (net/core/dev.c:4802) bond_dev_queue_xmit (drivers/net/bonding/bond_main.c:312) bond_xmit_broadcast (drivers/net/bonding/bond_main.c:5279) bond_start_xmit (drivers/net/bonding/bond_main.c:5530) dev_hard_start_xmit (net/core/dev.c:3887) __dev_queue_xmit (net/core/dev.c:4841) ip_finish_output2 (net/ipv4/ip_output.c:237) ip_output (net/ipv4/ip_output.c:438) iptunnel_xmit (net/ipv4/ip_tunnel_core.c:86) ip_tunnel_xmit (net/ipv4/ip_tunnel.c:847) gre_tap_xmit (net/ipv4/ip_gre.c:779) dev_hard_start_xmit (net/core/dev.c:3887) sch_direct_xmit (net/sched/sch_generic.c:347) __dev_queue_xmit (net/core/dev.c:4802) bond_dev_queue_xmit (drivers/net/bonding/bond_main.c:312) bond_xmit_broadcast (drivers/net/bonding/bond_main.c:5279) bond_start_xmit (drivers/net/bonding/bond_main.c:5530) dev_hard_start_xmit (net/core/dev.c:3887) __dev_queue_xmit (net/core/dev.c:4841) mld_sendpack mld_ifc_work process_one_work worker_thread </TASK> Fixes: `745e20f1b6` ("net: add a recursion limit in xmit path") Reported-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Weiming Shi <bestswngs@gmail.com> Link: https://patch.msgid.link/20260306160133.3852900-2-bestswngs@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-03-10 13:30:30 +01:00
Eric Dumazet	d6d4ff335d	tcp: inline tcp_chrono_start() tcp_chrono_start() is small enough, and used in TCP sendmsg() fast path (from tcp_skb_entail()). Note clang is already inlining it from functions in tcp_output.c. Inlining it improves performance and reduces bloat : $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/2 grow/shrink: 1/0 up/down: 1/-84 (-83) Function old new delta tcp_skb_entail 280 281 +1 __pfx_tcp_chrono_start 16 - -16 tcp_chrono_start 68 - -68 Total: Before=25192434, After=25192351, chg -0.00% Note that tcp_chrono_stop() is too big. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20260308123549.2924460-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-09 19:34:00 -07:00
Eric Dumazet	f2db7b80b0	net/sched: refine indirect call mitigation in tc_wrapper.h Some modern cpus disable X86_FEATURE_RETPOLINE feature, even if a direct call can still be beneficial. Even when IBRS is present, an indirect call is more expensive than a direct one: Direct Calls: Compilers can perform powerful optimizations like inlining, where the function body is directly inserted at the call site, eliminating call overhead entirely. Indirect Calls: Inlining is much harder, if not impossible, because the compiler doesn't know the target function at compile time. Techniques like Indirect Call Promotion can help by using profile-guided optimization to turn frequently taken indirect calls into conditional direct calls, but they still add complexity and potential overhead compared to a truly direct call. In this patch, I split tc_skip_wrapper in two different static keys, one for tc_act() (tc_skip_wrapper_act) and one for tc_classify() (tc_skip_wrapper_cls). Then I enable the tc_skip_wrapper_cls only if the count of builtin classifiers is above one. I enable tc_skip_wrapper_act only it the count of builtin actions is above one. In our production kernels, we only have CONFIG_NET_CLS_BPF=y and CONFIG_NET_ACT_BPF=y. Other are modules or are not compiled. Tested on AMD Turin cpus, cls_bpf_classify() cost went from 1% down to 0.18 %, and FDO will be able to inline it in tcf_classify() for further gains. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Link: https://patch.msgid.link/20260307133601.3863071-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-09 19:31:41 -07:00
Eric Dumazet	e8eb33d650	tcp: move sysctl_tcp_shrink_window to netns_ipv4_read_txrx group Commit `18fd64d254` ("netns-ipv4: reorganize netns_ipv4 fast path variables") missed that __tcp_select_window() is reading net->ipv4.sysctl_tcp_shrink_window. Move this field to netns_ipv4_read_txrx group, as __tcp_select_window() is used both in tx and rx paths. Saves a potential cache line miss. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260307092214.2433548-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-09 19:31:01 -07:00
Eric Dumazet	47e8dbb6e7	net/sched: do not reset queues in graft operations Following typical script is extremely disruptive, because each graft operation calls dev_deactivate() which resets all the queues of the device. QPARAM="limit 100000 flow_limit 1000 buckets 4096" TXQS=64 for ETH in eth1 do tc qd del dev $ETH root 2>/dev/null tc qd add dev $ETH root handle 1: mq for i in `seq 1 $TXQS` do slot=$( printf %x $(( i )) ) tc qd add dev $ETH parent 1:$slot fq $QPARAM done done One can add "ip link set dev $ETH down/up" to reduce the disruption time: QPARAM="limit 100000 flow_limit 1000 buckets 4096" TXQS=64 for ETH in eth1 do ip link set dev $ETH down tc qd del dev $ETH root 2>/dev/null tc qd add dev $ETH root handle 1: mq for i in `seq 1 $TXQS` do slot=$( printf %x $(( i )) ) tc qd add dev $ETH parent 1:$slot fq $QPARAM done ip link set dev $ETH up done Or we can add a @reset_needed flag to dev_deactivate() and dev_deactivate_many(). This flag is set to true at device dismantle or linkwatch_do_dev(), and to false for graft operations. In the future, we might only stop one queue instead of the whole device, ie call dev_deactivate_queue() instead of dev_deactivate(). I think the problem (quadratic behavior) was added in commit `2fb541c862` ("net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc") but this does not look serious enough to deserve risky backports. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Link: https://patch.msgid.link/20260307163430.470644-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-09 18:55:55 -07:00
Eric Dumazet	50636e5ff8	tcp: move tcp_v4_early_demux() to net/ipv4/ip_input.c tcp_v4_early_demux() has a single caller : ip_rcv_finish_core(). Move it to net/ipv4/ip_input.c and mark it static, for possible compiler/linker optimizations. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260306131130.654991-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-09 18:50:24 -07:00
Jeff Layton	0fe27e5985	net: change sock.sk_ino and sock_i_ino() to u64 inode->i_ino is being converted to a u64. sock.sk_ino (which caches the inode number) must also be widened to avoid truncation on 32-bit architectures where unsigned long is only 32 bits. Change sk_ino from unsigned long to u64, and update the return type of sock_i_ino() to match. Fix all format strings that print the result of sock_i_ino() (%lu -> %llu), and widen the intermediate variables and function parameters in the diag modules that were using int to hold the inode number. Note that the UAPI socket diag structures (inet_diag_msg.idiag_inode, unix_diag_msg.udiag_ino, etc.) are all __u32 and cannot be changed without breaking the ABI. The assignments to those fields will silently truncate, which is the existing behavior. Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> # for net/can Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20260304-iino-u64-v3-3-2257ad83d372@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2026-03-06 14:31:26 +01:00
Ria Thomas	98acd4c1d9	wifi: mac80211: add support for NDP ADDBA/DELBA for S1G S1G defines use of NDP Block Ack (BA) for aggregation, requiring negotiation of NDP ADDBA/DELBA action frames. If the S1G recipient supports HT-immediate block ack, the sender must send an NDP ADDBA Request indicating it expects only NDP BlockAck frames for the agreement. Introduce support for NDP ADDBA and DELBA exchange in mac80211. The implementation negotiates the BA mechanism during setup based on station capabilities and driver support (IEEE80211_HW_SUPPORTS_NDP_BLOCKACK). If negotiation fails due to mismatched expectations, a rejection with status code WLAN_STATUS_REJECTED_NDP_BLOCK_ACK_SUGGESTED is returned as per IEEE 802.11-2024. Trace sample: IEEE 802.11 Wireless Management Fixed parameters Category code: Block Ack (3) Action code: NDP ADDBA Request (0x80) Dialog token: 0x01 Block Ack Parameters: 0x1003, A-MSDUs, Block Ack Policy .... .... .... ...1 = A-MSDUs: Permitted in QoS Data MPDUs .... .... .... ..1. = Block Ack Policy: Immediate Block Ack .... .... ..00 00.. = Traffic Identifier: 0x0 0001 0000 00.. .... = Number of Buffers (1 Buffer = 2304 Bytes): 64 Block Ack Timeout: 0x0000 Block Ack Starting Sequence Control (SSC): 0x0010 .... .... .... 0000 = Fragment: 0 0000 0000 0001 .... = Starting Sequence Number: 1 IEEE 802.11 Wireless Management Fixed parameters Category code: Block Ack (3) Action code: NDP ADDBA Response (0x81) Dialog token: 0x02 Status code: BlockAck negotiation refused because, due to buffer constraints and other unspecified reasons, the recipient prefers to generate only NDP BlockAck frames (0x006d) Block Ack Parameters: 0x1002, Block Ack Policy .... .... .... ...0 = A-MSDUs: Not Permitted .... .... .... ..1. = Block Ack Policy: Immediate Block Ack .... .... ..00 00.. = Traffic Identifier: 0x0 0001 0000 00.. .... = Number of Buffers (1 Buffer = 2304 Bytes): 64 Block Ack Timeout: 0x0000 Signed-off-by: Ria Thomas <ria.thomas@morsemicro.com> Link: https://patch.msgid.link/20260305091304.310990-1-ria.thomas@morsemicro.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2026-03-06 10:52:11 +01:00
Kuniyuki Iwashima	d4d8c6e6fd	tcp: Initialise ehash secrets during connect() and listen(). inet_ehashfn() and inet6_ehashfn() initialise random secrets on the first call by net_get_random_once(). While the init part is patched out using static keys, with CONFIG_STACKPROTECTOR_STRONG=y, this causes a compiler to generate a stack canary due to an automatic variable, unsigned long ___flags, in the DO_ONCE() macro being passed to __do_once_start(). With FDO, this is visible in __inet_lookup_established() and __inet6_lookup_established() too. Let's initialise the secrets by get_random_sleepable_once() in the slow paths: inet_hash() for listen(), and inet_hash_connect() and inet6_hash_connect() for connect(). Note that IPv6 listener will initialise both IPv4 & IPv6 secrets in inet_hash() for IPv4-mapped IPv6 address. With the patch, the stack size is reduced by 16 bytes (___flags + a stack canary) and NOPs for the static key go away. Before: __inet6_lookup_established() ... push %rbx sub $0x38,%rsp # stack is 56 bytes mov %edx,%ebx # sport mov %gs:0x299419f(%rip),%rax # load stack canary mov %rax,0x30(%rsp) and store it onto stack mov 0x440(%rdi),%r15 # net->ipv4.tcp_death_row.hashinfo nop 32: mov %r8d,%ebp # hnum shl $0x10,%ebp # hnum << 16 nop 3d: mov 0x70(%rsp),%r14d # sdif or %ebx,%ebp # INET_COMBINED_PORTS(sport, hnum) mov 0x11a8382(%rip),%eax # inet6_ehashfn() ... After: __inet6_lookup_established() ... push %rbx sub $0x28,%rsp # stack is 40 bytes mov 0x60(%rsp),%ebp # sdif mov %r8d,%r14d # hnum shl $0x10,%r14d # hnum << 16 or %edx,%r14d # INET_COMBINED_PORTS(sport, hnum) mov 0x440(%rdi),%rax # net->ipv4.tcp_death_row.hashinfo mov 0x1194f09(%rip),%r10d # inet6_ehashfn() ... Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260303235424.3877267-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-05 18:50:05 -08:00
Eric Dumazet	46cb1fcdb7	tcp: move tcp_v6_early_demux() to net/ipv6/ip6_input.c tcp_v6_early_demux() has a single caller : ip6_rcv_finish_core(). Move it to net/ipv6/ip6_input.c and mark it static, for possible compiler/linker optimizations. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260304022706.1062459-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-05 18:33:51 -08:00
Jakub Kicinski	0b1324cdd8	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-7.0-rc3). No conflicts. Adjacent changes: net/netfilter/nft_set_rbtree.c `fb7fb40163` ("netfilter: nf_tables: clone set on flush only") `3aea466a43` ("netfilter: nft_set_rbtree: don't disable bh when acquiring tree lock") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-05 12:11:05 -08:00
Larysa Zaremba	75d9228982	libeth, idpf: use truesize as XDP RxQ info frag_size The only user of frag_size field in XDP RxQ info is bpf_xdp_frags_increase_tail(). It clearly expects whole buffer size instead of DMA write size. Different assumptions in idpf driver configuration lead to negative tailroom. To make it worse, buffer sizes are not actually uniform in idpf when splitq is enabled, as there are several buffer queues, so rxq->rx_buf_size is meaningless in this case. Use truesize of the first bufq in AF_XDP ZC, as there is only one. Disable growing tail for regular splitq. Fixes: `ac8a861f63` ("idpf: prepare structures to support XDP") Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Link: https://patch.msgid.link/20260305111253.2317394-8-larysa.zaremba@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-05 08:02:05 -08:00
Larysa Zaremba	16394d8053	xsk: introduce helper to determine rxq->frag_size rxq->frag_size is basically a step between consecutive strictly aligned frames. In ZC mode, chunk size fits exactly, but if chunks are unaligned, there is no safe way to determine accessible space to grow tailroom. Report frag_size to be zero, if chunks are unaligned, chunk_size otherwise. Fixes: `24ea50127e` ("xsk: support mbuf on ZC RX") Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Link: https://patch.msgid.link/20260305111253.2317394-3-larysa.zaremba@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-05 08:02:03 -08:00
Jamal Hadi Salim	e2cedd400c	net/sched: act_ife: Fix metalist update behavior Whenever an ife action replace changes the metalist, instead of replacing the old data on the metalist, the current ife code is appending the new metadata. Aside from being innapropriate behavior, this may lead to an unbounded addition of metadata to the metalist which might cause an out of bounds error when running the encode op: [ 138.423369][ C1] ================================================================== [ 138.424317][ C1] BUG: KASAN: slab-out-of-bounds in ife_tlv_meta_encode (net/ife/ife.c:168) [ 138.424906][ C1] Write of size 4 at addr ffff8880077f4ffe by task ife_out_out_bou/255 [ 138.425778][ C1] CPU: 1 UID: 0 PID: 255 Comm: ife_out_out_bou Not tainted 7.0.0-rc1-00169-gfbdfa8da05b6 #624 PREEMPT(full) [ 138.425795][ C1] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 138.425800][ C1] Call Trace: [ 138.425804][ C1] <IRQ> [ 138.425808][ C1] dump_stack_lvl (lib/dump_stack.c:122) [ 138.425828][ C1] print_report (mm/kasan/report.c:379 mm/kasan/report.c:482) [ 138.425839][ C1] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 138.425844][ C1] ? __virt_addr_valid (./arch/x86/include/asm/preempt.h:95 (discriminator 1) ./include/linux/rcupdate.h:975 (discriminator 1) ./include/linux/mmzone.h:2207 (discriminator 1) arch/x86/mm/physaddr.c:54 (discriminator 1)) [ 138.425853][ C1] ? ife_tlv_meta_encode (net/ife/ife.c:168) [ 138.425859][ C1] kasan_report (mm/kasan/report.c:221 mm/kasan/report.c:597) [ 138.425868][ C1] ? ife_tlv_meta_encode (net/ife/ife.c:168) [ 138.425878][ C1] kasan_check_range (mm/kasan/generic.c:186 (discriminator 1) mm/kasan/generic.c:200 (discriminator 1)) [ 138.425884][ C1] __asan_memset (mm/kasan/shadow.c:84 (discriminator 2)) [ 138.425889][ C1] ife_tlv_meta_encode (net/ife/ife.c:168) [ 138.425893][ C1] ? ife_tlv_meta_encode (net/ife/ife.c:171) [ 138.425898][ C1] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 138.425903][ C1] ife_encode_meta_u16 (net/sched/act_ife.c:57) [ 138.425910][ C1] ? __pfx_do_raw_spin_lock (kernel/locking/spinlock_debug.c:114) [ 138.425916][ C1] ? __asan_memcpy (mm/kasan/shadow.c:105 (discriminator 3)) [ 138.425921][ C1] ? __pfx_ife_encode_meta_u16 (net/sched/act_ife.c:45) [ 138.425927][ C1] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 138.425931][ C1] tcf_ife_act (net/sched/act_ife.c:847 net/sched/act_ife.c:879) To solve this issue, fix the replace behavior by adding the metalist to the ife rcu data structure. Fixes: `aa9fd9a325` ("sched: act: ife: update parameters via rcu handling") Reported-by: Ruitong Liu <cnitlrt@gmail.com> Tested-by: Ruitong Liu <cnitlrt@gmail.com> Co-developed-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260304140603.76500-1-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-05 07:54:08 -08:00
Florian Westphal	9df95785d3	netfilter: nft_set_pipapo: split gc into unlink and reclaim phase Yiming Qian reports Use-after-free in the pipapo set type: Under a large number of expired elements, commit-time GC can run for a very long time in a non-preemptible context, triggering soft lockup warnings and RCU stall reports (local denial of service). We must split GC in an unlink and a reclaim phase. We cannot queue elements for freeing until pointers have been swapped. Expired elements are still exposed to both the packet path and userspace dumpers via the live copy of the data structure. call_rcu() does not protect us: dump operations or element lookups starting after call_rcu has fired can still observe the free'd element, unless the commit phase has made enough progress to swap the clone and live pointers before any new reader has picked up the old version. This a similar approach as done recently for the rbtree backend in commit `35f83a7552` ("netfilter: nft_set_rbtree: don't gc elements on insert"). Fixes: `3c4287f620` ("nf_tables: Add set type for arbitrary concatenation of ranges") Reported-by: Yiming Qian <yimingqian591@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-03-05 13:22:37 +01:00
Pablo Neira Ayuso	fb7fb40163	netfilter: nf_tables: clone set on flush only Syzbot with fault injection triggered a failing memory allocation with GFP_KERNEL which results in a WARN splat: iter.err WARNING: net/netfilter/nf_tables_api.c:845 at nft_map_deactivate+0x34e/0x3c0 net/netfilter/nf_tables_api.c:845, CPU#0: syz.0.17/5992 Modules linked in: CPU: 0 UID: 0 PID: 5992 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2026 RIP: 0010:nft_map_deactivate+0x34e/0x3c0 net/netfilter/nf_tables_api.c:845 Code: 8b 05 86 5a 4e 09 48 3b 84 24 a0 00 00 00 75 62 48 8d 65 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc cc e8 63 6d fa f7 90 <0f> 0b 90 43 +80 7c 35 00 00 0f 85 23 fe ff ff e9 26 fe ff ff 89 d9 RSP: 0018:ffffc900045af780 EFLAGS: 00010293 RAX: ffffffff89ca45bd RBX: 00000000fffffff4 RCX: ffff888028111e40 RDX: 0000000000000000 RSI: 00000000fffffff4 RDI: 0000000000000000 RBP: ffffc900045af870 R08: 0000000000400dc0 R09: 00000000ffffffff R10: dffffc0000000000 R11: fffffbfff1d141db R12: ffffc900045af7e0 R13: 1ffff920008b5f24 R14: dffffc0000000000 R15: ffffc900045af920 FS: 000055557a6a5500(0000) GS:ffff888125496000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fb5ea271fc0 CR3: 000000003269e000 CR4: 00000000003526f0 Call Trace: <TASK> __nft_release_table+0xceb/0x11f0 net/netfilter/nf_tables_api.c:12115 nft_rcv_nl_event+0xc25/0xdb0 net/netfilter/nf_tables_api.c:12187 notifier_call_chain+0x19d/0x3a0 kernel/notifier.c:85 blocking_notifier_call_chain+0x6a/0x90 kernel/notifier.c:380 netlink_release+0x123b/0x1ad0 net/netlink/af_netlink.c:761 __sock_release net/socket.c:662 [inline] sock_close+0xc3/0x240 net/socket.c:1455 Restrict set clone to the flush set command in the preparation phase. Add NFT_ITER_UPDATE_CLONE and use it for this purpose, update the rbtree and pipapo backends to only clone the set when this iteration type is used. As for the existing NFT_ITER_UPDATE type, update the pipapo backend to use the existing set clone if available, otherwise use the existing set representation. After this update, there is no need to clone a set that is being deleted, this includes bound anonymous set. An alternative approach to NFT_ITER_UPDATE_CLONE is to add a .clone interface and call it from the flush set path. Reported-by: syzbot+4924a0edc148e8b4b342@syzkaller.appspotmail.com Fixes: `3f1d886cc7` ("netfilter: nft_set_pipapo: move cloning of match info to insert/removal path") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-03-05 13:22:37 +01:00
Paolo Abeni	6d32a196be	netfilter pull request nf-next-26-03-04 -----BEGIN PGP SIGNATURE----- iQJdBAABCABHFiEEgKkgxbID4Gn1hq6fcJGo2a1f9gAFAmmoGSwbFIAAAAAABAAO bWFudTIsMi41KzEuMTEsMiwyDRxmd0BzdHJsZW4uZGUACgkQcJGo2a1f9gAPTw/+ O9OR5n1v7C2qlOTg9dDKEvSlCceg2bqNndplrVyPb7+NlbGbhQJyzuIHh/7jvVpo VNLtEYl6wYAuRRux/I3eFc7KV1hEtqXjV0Asi0C0HMVUcig+/9Wh4CMt6LnBJ7Xp GksxXtwqGBewfT1jiu/hxnsgjNRzGDWMf+23QgLTHnch6H456kySUetlaWq96SLR AhZKSeb3dinh9YHKC50RoPzKaPtf9HQWDM7vlX8Q1hu6bAHfP14xW4CRqFq8JGYi hEWd/E5oIDJbPO7gAIuwq5GBnmfw/oiblfQBdYBN2MkmzN7CvYBnleL/N7ZXhnkH 4sBFJQCLBNGu/v5aD+lAjAjq7YJUs5jrSmGghsrORkMe2hEf4IwbFmEoisSz9ycO snJPX8LHoud1Ah5sDQdj0zYRD/iDkd2kLqiFMGgddJeZ+7RlNZm4rgJWIjXE2lLi 0RXjUgJtJobrhmrCethsB/AFts5XrEVCWpRPlfEAx/yFiuG3x2IsxgFJGpBSfPBQ o1Opl9YRkMM2FmfKC/NeLA+lkRUl94PV330khCqHOupVGc5JCzKWC7o8ndp3hB/Y 8+4wUziUMf60YVW2fo6wNu1gOkNV1RH5/yZkdVzTq7mxrPkwK+NCy+KQh7OOdyVT YV5WdqRUh6Kp6AvU7TJaa2FXjlVXB58i9GrgnoQz5YM= =6PUL -----END PGP SIGNATURE----- Merge tag 'nf-next-26-03-04' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Florian Westphal says: ==================== netfilter: updates for net-next The following patchset contains Netfilter updates for net-next, including changes to IPv6 stack and updates to IPVS from Julian Anastasov. 1) ipv6: export fib6_lookup for nft_fib_ipv6 module 2) factor out ipv6_anycast_destination logic so its usable without dst_entry. These are dependencies for patch 3. 3) switch nft_fib_ipv6 module to no longer need temporary dst_entry object allocations by using fib6_lookup() + RCU. This gets us ~13% higher packet rate in my tests. Patches 4 to 8, from Eric Dumazet, zap sk_callback_lock usage in netfilter. Patch 9 removes another sk_callback_lock instance. Remaining patches, from Julian Anastasov, improve IPVS, Quoting Julian: * Add infrastructure for resizable hash tables based on hlist_bl. * Change the 256-bucket service hash table to be resizable. * Change the global connection table to be per-net and resizable. * Make connection hashing more secure for setups with multiple services. netfilter pull request nf-next-26-03-04 * tag 'nf-next-26-03-04' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: ipvs: use more keys for connection hashing ipvs: switch to per-net connection table ipvs: use resizable hash table for services ipvs: add resizable hash tables rculist_bl: add hlist_bl_for_each_entry_continue_rcu netfilter: nfnetlink_queue: remove locking in nfqnl_get_sk_secctx netfilter: nfnetlink_queue: no longer acquire sk_callback_lock netfilter: nfnetlink_log: no longer acquire sk_callback_lock netfilter: nft_meta: no longer acquire sk_callback_lock in nft_meta_get_eval_skugid() netfilter: xt_owner: no longer acquire sk_callback_lock in mt_owner() netfilter: nf_log_syslog: no longer acquire sk_callback_lock in nf_log_dump_sk_uid_gid() netfilter: nft_fib_ipv6: switch to fib6_lookup ipv6: make ipv6_anycast_destination logic usable without dst_entry ipv6: export fib6_lookup for nft_fib_ipv6 ==================== Link: https://patch.msgid.link/20260304114921.31042-1-fw@strlen.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-03-05 11:32:50 +01:00
Eric Dumazet	c66e0f453d	net: use ktime_t in struct scm_timestamping_internal Instead of using struct timespec64 in scm_timestamping_internal, use ktime_t, saving 24 bytes in kernel stack. This makes tcp_update_recv_tstamps() small enough to be inlined. The ktime_t -> timespec64 conversions happen after socket lock has been released in tcp_recvmsg(), and only if the application requested them. $ scripts/bloat-o-meter -t vmlinux.0 vmlinux add/remove: 0/2 grow/shrink: 5/4 up/down: 146/-277 (-131) Function old new delta tcp_zerocopy_receive 2383 2425 +42 mptcp_recvmsg 1565 1607 +42 tcp_recvmsg_locked 3797 3823 +26 put_cmsg_scm_timestamping64 131 149 +18 put_cmsg_scm_timestamping 131 149 +18 __pfx_tcp_update_recv_tstamps 16 - -16 do_tcp_getsockopt 4024 4006 -18 tcp_recv_timestamp 474 430 -44 tcp_zc_handle_leftover 417 371 -46 __sock_recv_timestamp 1087 1031 -56 tcp_update_recv_tstamps 97 - -97 Total: Before=25223788, After=25223657, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://patch.msgid.link/20260304012747.881644-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-04 17:53:34 -08:00
Eric Dumazet	165573e41f	tcp: secure_seq: add back ports to TS offset This reverts `28ee1b746f` ("secure_seq: downgrade to per-host timestamp offsets") tcp_tw_recycle went away in 2017. Zhouyan Deng reported off-path TCP source port leakage via SYN cookie side-channel that can be fixed in multiple ways. One of them is to bring back TCP ports in TS offset randomization. As a bonus, we perform a single siphash() computation to provide both an ISN and a TS offset. Fixes: `28ee1b746f` ("secure_seq: downgrade to per-host timestamp offsets") Reported-by: Zhouyan Deng <dengzhouyan_nwpu@163.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Acked-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20260302205527.1982836-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-04 17:44:35 -08:00
Koichiro Den	7f083faf59	net: sched: avoid qdisc_reset_all_tx_gt() vs dequeue race for lockless qdiscs When shrinking the number of real tx queues, netif_set_real_num_tx_queues() calls qdisc_reset_all_tx_gt() to flush qdiscs for queues which will no longer be used. qdisc_reset_all_tx_gt() currently serializes qdisc_reset() with qdisc_lock(). However, for lockless qdiscs, the dequeue path is serialized by qdisc_run_begin/end() using qdisc->seqlock instead, so qdisc_reset() can run concurrently with __qdisc_run() and free skbs while they are still being dequeued, leading to UAF. This can easily be reproduced on e.g. virtio-net by imposing heavy traffic while frequently changing the number of queue pairs: iperf3 -ub0 -c $peer -t 0 & while :; do ethtool -L eth0 combined 1 ethtool -L eth0 combined 2 done With KASAN enabled, this leads to reports like: BUG: KASAN: slab-use-after-free in __qdisc_run+0x133f/0x1760 ... Call Trace: <TASK> ... __qdisc_run+0x133f/0x1760 __dev_queue_xmit+0x248f/0x3550 ip_finish_output2+0xa42/0x2110 ip_output+0x1a7/0x410 ip_send_skb+0x2e6/0x480 udp_send_skb+0xb0a/0x1590 udp_sendmsg+0x13c9/0x1fc0 ... </TASK> Allocated by task 1270 on cpu 5 at 44.558414s: ... alloc_skb_with_frags+0x84/0x7c0 sock_alloc_send_pskb+0x69a/0x830 __ip_append_data+0x1b86/0x48c0 ip_make_skb+0x1e8/0x2b0 udp_sendmsg+0x13a6/0x1fc0 ... Freed by task 1306 on cpu 3 at 44.558445s: ... kmem_cache_free+0x117/0x5e0 pfifo_fast_reset+0x14d/0x580 qdisc_reset+0x9e/0x5f0 netif_set_real_num_tx_queues+0x303/0x840 virtnet_set_channels+0x1bf/0x260 [virtio_net] ethnl_set_channels+0x684/0xae0 ethnl_default_set_doit+0x31a/0x890 ... Serialize qdisc_reset_all_tx_gt() against the lockless dequeue path by taking qdisc->seqlock for TCQ_F_NOLOCK qdiscs, matching the serialization model already used by dev_reset_queue(). Additionally clear QDISC_STATE_NON_EMPTY after reset so the qdisc state reflects an empty queue, avoiding needless re-scheduling. Fixes: `6b3ba9146f` ("net: sched: allow qdiscs to handle locking") Signed-off-by: Koichiro Den <den@valinux.co.jp> Link: https://patch.msgid.link/20260228145307.3955532-1-den@valinux.co.jp Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-04 17:43:45 -08:00
Eric Dumazet	a435163d31	net-sysfs: use rps_tag_ptr and remove metadata from rps_dev_flow_table Instead of storing the @log at the beginning of rps_dev_flow_table use 5 low order bits of the rps_tag_ptr to store the log of the size. This removes a potential cache line miss (for light traffic). This allows us to switch to one high-order allocation instead of vmalloc() when CONFIG_RFS_ACCEL is not set. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260302181432.1836150-8-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-04 16:54:10 -08:00
Eric Dumazet	b2cc61857e	net-sysfs: remove rcu field from 'struct rps_dev_flow_table' Remove rps_dev_flow_table_release() in favor of kvfree_rcu_mightsleep(). In the following pach, we will remove "u8 @log" field and 'struct rps_dev_flow_table' size will be a power-of-two. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260302181432.1836150-7-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-04 16:54:10 -08:00
Eric Dumazet	dd378109d2	net-sysfs: use rps_tag_ptr and remove metadata from rps_sock_flow_table Instead of storing the @mask at the beginning of rps_sock_flow_table, use 5 low order bits of the rps_tag_ptr to store the log of the size. This removes a potential cache line miss to fetch @mask. More importantly, we can switch to vmalloc_huge() without wasting memory. Tested with: numactl --interleave=all bash -c "echo 4194304 >/proc/sys/net/core/rps_sock_flow_entries" Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260302181432.1836150-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-04 16:54:09 -08:00
Eric Dumazet	9cde131cdd	net-sysfs: add rps_sock_flow_table_mask() helper In preparation of the following patch, abstract access to the @mask field in 'struct rps_sock_flow_table'. Also cleanup rps_sock_flow_sysctl() a bit : - Rename orig_sock_table to o_sock_table. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260302181432.1836150-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-04 16:54:09 -08:00
Eric Dumazet	61753849b8	net-sysfs: remove rcu field from 'struct rps_sock_flow_table' Removing rcu_head (and @mask in a following patch) will allow a power-of-two allocation and thus high-order allocation for better performance. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260302181432.1836150-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-04 16:54:09 -08:00
Eric Dumazet	42a101775b	net: add rps_tag_ptr type and helpers Add a new rps_tag_ptr type to encode a pointer and a size to a power-of-two table. Three helpers are added converting an rps_tag_ptr to: 1) A log of the size. 2) A mask : (size - 1). 3) A pointer to the array. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260302181432.1836150-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-04 16:54:09 -08:00
Eric Dumazet	c26b8c4e29	net: fix off-by-one in udp_flow_src_port() / psp_write_headers() udp_flow_src_port() and psp_write_headers() use ip_local_port_range. ip_local_port_range is inclusive : all ports between min and max can be used. Before this patch, if ip_local_port_range was set to 40000-40001 40001 would not be used as a source port. Use reciprocal_scale() to help code readability. Not tagged for stable trees, as this change could break user expectations. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260302163933.1754393-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-04 16:51:10 -08:00
Jakub Kicinski	dbbda7dd68	Notable features this time: - cfg80211/mac80211 - finished assoc frame encryption/EPPKE/802.1X-over-auth (also hwsim) - radar detection improvements - 6 GHz incumbent signal detection APIs - multi-link support for FILS, probe response templates and client probling - ath12k: - monitor mode support on IPQ5332 - basic hwmon temperature reporting -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmmoGDYACgkQ10qiO8sP aACl9BAAi4ezTR8jjvBQNjJ9EXJmamjVAitMlHulUaw0DVHnMAMALTgYGq0ZpIva 8EMiH/ksfxmYvu8qFYypYH2WcQAsg9DFuuo2Mcd4MwmJkOyQgme1mqaTpTDuHAWp S+wZBgQQCrnhQkmmNUJmp8m4Edw4cYi94jcct0BRYvAMBdQo4hMctA/7Ja8+ttU5 Q2uhHVZjmNPR2OXBp31INp4vo7RK5AXUFI5l/7XX36o7zIudtqbJJ0GL+1UNeG3f v4an+a0tiunacgZiuWeeL/U1t4cZ5WQiDV31FQPIBiiYQO5M76l7+cuikr3HLkG1 kdqGXs77blW32s7NF3MebswIV+dzmBF69HjwCxdsU0iWzp54y8I3Lgu/cN8O721a 2Pt6IGmcsOm9F9Lbrxn6UNHMjn6VQUYGg40NtbhHGwniheLX4Gi4MBjbgOdD3GJh 9h12h/2CRZcHjA6kg3tcdzluD09510IiWMbPaAtXr456CPJ+hBUJIutuXOszbA+7 d9eecObxoMtMqtesRLkhbyBMt7aNkWLYBvpSQVHaJktqt7c5NmKe0xXXdRHeIqKo XpXsl2q/1NrmSj9lPyyte8LHWWXQ+TVWWujqaHFUJdMDT/IBscKk4ahxGoEBtHOR KHRFCD2oRsyCnsI6tSJ3/IuU5AVmBIzd6wZlPYZUZI/PsWuMwIg= =oNzs -----END PGP SIGNATURE----- Merge tag 'wireless-next-2026-03-04' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Johannes Berg says: ==================== Notable features this time: - cfg80211/mac80211 - finished assoc frame encryption/EPPKE/802.1X-over-auth (also hwsim) - radar detection improvements - 6 GHz incumbent signal detection APIs - multi-link support for FILS, probe response templates and client probling - ath12k: - monitor mode support on IPQ5332 - basic hwmon temperature reporting * tag 'wireless-next-2026-03-04' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (38 commits) wifi: UHR: define DPS/DBE/P-EDCA elements and fix size parsing wifi: mac80211_hwsim: change hwsim_class to a const struct wifi: mac80211: give the AP more time for EPPKE as well wifi: ath12k: Remove the unused argument from the Rx data path wifi: ath12k: Enable monitor mode support on IPQ5332 wifi: ath12k: Set up MLO after SSR wifi: ath11k: Silence remoteproc probe deferral prints wifi: cfg80211: support key installation on non-netdev wdevs wifi: cfg80211: make cluster id an array wifi: mac80211: update outdated comment wifi: mac80211: Advertise IEEE 802.1X authentication support wifi: mac80211: Add support for IEEE 802.1X authentication protocol in non-AP STA mode wifi: cfg80211: add support for IEEE 802.1X Authentication Protocol wifi: mac80211: Advertise EPPKE support based on driver capabilities wifi: mac80211_hwsim: Advertise support for (Re)Association frame encryption wifi: mac80211: Fix AAD/Nonce computation for management frames with MLO wifi: rt2x00: use generic nvmem_cell_get wifi: mac80211: fetch unsolicited probe response template by link ID wifi: mac80211: fetch FILS discovery template by link ID wifi: nl80211: don't allow DFS channels for NAN ... ==================== Link: https://patch.msgid.link/20260304113707.175181-3-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-04 15:30:05 -08:00
Julian Anastasov	f20c73b046	ipvs: use more keys for connection hashing Simon Kirby reported long time ago that IPVS connection hashing based only on the client address/port (caddr, cport) as hash keys is not suitable for setups that accept traffic on multiple virtual IPs and ports. It can happen for multiple VIP:VPORT services, for single or many fwmark service(s) that match multiple virtual IPs and ports or even for passive FTP with peristence in DR/TUN mode where we expect traffic on multiple ports for the virtual IP. Fix it by adding virtual addresses and ports to the hash function. This causes the traffic from NAT real servers to clients to use second hashing for the in->out direction. As result: - the IN direction from client will use hash node hn0 where the source/dest addresses and ports used by client will be used as hash keys - the OUT direction from NAT real servers will use hash node hn1 for the traffic from real server to client - the persistence templates are hashed only with parameters based on the IN direction, so they now will also use the virtual address, port and fwmark from the service. OLD: - all methods: c_list node: proto, caddr:cport - persistence templates: c_list node: proto, caddr_net:0 - persistence engine templates: c_list node: per-PE, PE-SIP uses jhash NEW: - all methods: hn0 node (dir 0): proto, caddr:cport -> vaddr:vport - MASQ method: hn1 node (dir 1): proto, daddr:dport -> caddr:cport - persistence templates: hn0 node (dir 0): proto, caddr_net:0 -> vaddr:vport_or_0 proto, caddr_net:0 -> fwmark:0 - persistence engine templates: hn0 node (dir 0): as before Also reorder the ip_vs_conn fields, so that hash nodes are on same read-mostly cache line while write-mostly fields are on separate cache line. Reported-by: Simon Kirby <sim@hostway.ca> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-03-04 11:45:45 +01:00
Julian Anastasov	2fa7cc9c70	ipvs: switch to per-net connection table Use per-net resizable hash table for connections. The global table is slow to walk when using many namespaces. The table can be resized in the range of [256 - ip_vs_conn_tab_size]. Table is attached only while services are present. Resizing is done by delayed work based on load (the number of connections). Add a hash_key field into the connection to store the table ID in the highest bit and the entry's hash value in the lowest bits. The lowest part of the hash value is used as bucket ID, the remaining part is used to filter the entries in the bucket before matching the keys and as result, helps the lookup operation to access only one cache line. By knowing the table ID and bucket ID for entry, we can unlink it without calculating the hash value and doing lookup by keys. We need only to validate the saved hash_key under lock. For better security switch from jhash to siphash for the default connection hashing but the persistence engines may use their own function. Keeping the hash table loaded with entries below the size (12%) allows to avoid collision for 96+% of the conns. ip_vs_conn_fill_cport() now will rehash the connection with proper locking because unhash+hash is not safe for RCU readers. To invalidate the templates setting just dport to 0xffff is enough, no need to rehash them. As result, ip_vs_conn_unhash() is now unused and removed. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-03-04 11:45:45 +01:00
Julian Anastasov	840aac3d90	ipvs: use resizable hash table for services Make the hash table for services resizable in the bit range of 4-20. Table is attached only while services are present. Resizing is done by delayed work based on load (the number of hashed services). Table grows when load increases 2+ times (above 12.5% with lfactor=-3) and shrinks 8+ times when load decreases 16+ times (below 0.78%). Switch to jhash hashing to reduce the collisions for multiple services. Add a hash_key field into the service to store the table ID in the highest bit and the entry's hash value in the lowest bits. The lowest part of the hash value is used as bucket ID, the remaining part is used to filter the entries in the bucket before matching the keys and as result, helps the lookup operation to access only one cache line. By knowing the table ID and bucket ID for entry, we can unlink it without calculating the hash value and doing lookup by keys. We need only to validate the saved hash_key under lock. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-03-04 11:45:45 +01:00
Julian Anastasov	b655388111	ipvs: add resizable hash tables Add infrastructure for resizable hash tables based on hlist_bl which we will use in followup patches. The tables allow RCU lookups during resizing, bucket modifications are protected with per-bucket bit lock and additional custom locking, the tables are resized when load reaches thresholds determined based on load factor parameter. Compared to other implementations we rely on: * fast entry removal by using node unlinking without pre-lookup * entry rehashing when hash key changes * entries can contain multiple hash nodes * custom locking depending on different contexts * adjustable load factor to customize the grow/shrink process Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-03-04 11:45:45 +01:00
Florian Westphal	831fb31b76	ipv6: make ipv6_anycast_destination logic usable without dst_entry nft_fib_ipv6 uses ipv6_anycast_destination(), but upcoming patch removes the dst_entry usage in favor of fib6_result. Move the 'plen > 127' logic to a new helper and call it from the existing one. Signed-off-by: Florian Westphal <fw@strlen.de>	2026-03-04 11:45:44 +01:00
Yung Chih Su	4ee7fa6cf7	net: ipv4: fix ARM64 alignment fault in multipath hash seed `struct sysctl_fib_multipath_hash_seed` contains two u32 fields (user_seed and mp_seed), making it an 8-byte structure with a 4-byte alignment requirement. In `fib_multipath_hash_from_keys()`, the code evaluates the entire struct atomically via `READ_ONCE()`: mp_seed = READ_ONCE(net->ipv4.sysctl_fib_multipath_hash_seed).mp_seed; While this silently works on GCC by falling back to unaligned regular loads which the ARM64 kernel tolerates, it causes a fatal kernel panic when compiled with Clang and LTO enabled. Commit `e35123d83e` ("arm64: lto: Strengthen READ_ONCE() to acquire when CONFIG_LTO=y") strengthens `READ_ONCE()` to use Load-Acquire instructions (`ldar` / `ldapr`) to prevent compiler reordering bugs under Clang LTO. Since the macro evaluates the full 8-byte struct, Clang emits a 64-bit `ldar` instruction. ARM64 architecture strictly requires `ldar` to be naturally aligned, thus executing it on a 4-byte aligned address triggers a strict Alignment Fault (FSC = 0x21). Fix the read side by moving the `READ_ONCE()` directly to the `u32` member, which emits a safe 32-bit `ldar Wn`. Furthermore, Eric Dumazet pointed out that `WRITE_ONCE()` on the entire struct in `proc_fib_multipath_hash_set_seed()` is also flawed. Analysis shows that Clang splits this 8-byte write into two separate 32-bit `str` instructions. While this avoids an alignment fault, it destroys atomicity and exposes a tear-write vulnerability. Fix this by explicitly splitting the write into two 32-bit `WRITE_ONCE()` operations. Finally, add the missing `READ_ONCE()` when reading `user_seed` in `proc_fib_multipath_hash_seed()` to ensure proper pairing and concurrency safety. Fixes: `4ee2a8cace` ("net: ipv4: Add a sysctl to set multipath hash seed") Signed-off-by: Yung Chih Su <yuuchihsu@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260302060247.7066-1-yuuchihsu@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-03 17:20:37 -08:00
Dipayaan Roy	2b12ffb669	net: mana: Trigger VF reset/recovery on health check failure due to HWC timeout The GF stats periodic query is used as mechanism to monitor HWC health check. If this HWC command times out, it is a strong indication that the device/SoC is in a faulty state and requires recovery. Today, when a timeout is detected, the driver marks hwc_timeout_occurred, clears cached stats, and stops rescheduling the periodic work. However, the device itself is left in the same failing state. Extend the timeout handling path to trigger the existing MANA VF recovery service by queueing a GDMA_EQE_HWC_RESET_REQUEST work item. This is expected to initiate the appropriate recovery flow by suspende resume first and if it fails then trigger a bus rescan. This change is intentionally limited to HWC command timeouts and does not trigger recovery for errors reported by the SoC as a normal command response. Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/aaFShvKnwR5FY8dH@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-03-03 11:14:22 +01:00
Jiayuan Chen	479d589b40	bpf/bonding: reject vlan+srcmac xmit_hash_policy change when XDP is loaded bond_option_mode_set() already rejects mode changes that would make a loaded XDP program incompatible via bond_xdp_check(). However, bond_option_xmit_hash_policy_set() has no such guard. For 802.3ad and balance-xor modes, bond_xdp_check() returns false when xmit_hash_policy is vlan+srcmac, because the 802.1q payload is usually absent due to hardware offload. This means a user can: 1. Attach a native XDP program to a bond in 802.3ad/balance-xor mode with a compatible xmit_hash_policy (e.g. layer2+3). 2. Change xmit_hash_policy to vlan+srcmac while XDP remains loaded. This leaves bond->xdp_prog set but bond_xdp_check() now returning false for the same device. When the bond is later destroyed, dev_xdp_uninstall() calls bond_xdp_set(dev, NULL, NULL) to remove the program, which hits the bond_xdp_check() guard and returns -EOPNOTSUPP, triggering: WARN_ON(dev_xdp_install(dev, mode, bpf_op, NULL, 0, NULL)) Fix this by rejecting xmit_hash_policy changes to vlan+srcmac when an XDP program is loaded on a bond in 802.3ad or balance-xor mode. commit `39a0876d59` ("net, bonding: Disallow vlan+srcmac with XDP") introduced bond_xdp_check() which returns false for 802.3ad/balance-xor modes when xmit_hash_policy is vlan+srcmac. The check was wired into bond_xdp_set() to reject XDP attachment with an incompatible policy, but the symmetric path -- preventing xmit_hash_policy from being changed to an incompatible value after XDP is already loaded -- was left unguarded in bond_option_xmit_hash_policy_set(). Note: commit `094ee6017e` ("bonding: check xdp prog when set bond mode") later added a similar guard to bond_option_mode_set(), but bond_option_xmit_hash_policy_set() remained unprotected. Reported-by: syzbot+5a287bcdc08104bc3132@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/6995aff6.050a0220.2eeac1.014e.GAE@google.com/T/ Fixes: `39a0876d59` ("net, bonding: Disallow vlan+srcmac with XDP") Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Link: https://patch.msgid.link/20260226080306.98766-2-jiayuan.chen@linux.dev Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-03-03 10:47:37 +01:00
Kuniyuki Iwashima	425e080a1c	dccp Remove inet_hashinfo2_init_mod(). Commit `c92c81df93` ("net: dccp: fix kernel crash on module load") added inet_hashinfo2_init_mod() for DCCP. Commit `22d6c9eebf` ("net: Unexport shared functions for DCCP.") removed EXPORT_SYMBOL_GPL() it but forgot to remove the function itself. Let's remove inet_hashinfo2_init_mod(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260301063756.1581685-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-02 18:50:28 -08:00
Kuniyuki Iwashima	3c1e53e554	ipmr: Add dedicated mutex for mrt->{mfc_hash,mfc_cache_list}. We will no longer hold RTNL for ipmr_rtm_route() to modify the MFC hash table. Only __dev_get_by_index() in rtm_to_ipmr_mfcc() is the RTNL dependant, otherwise, we just need protection for mrt->mfc_hash and mrt->mfc_cache_list. Let's add a new mutex for ipmr_mfc_add(), ipmr_mfc_delete(), and mroute_clean_tables() (setsockopt(MRT_FLUSH or MRT_DONE)). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260228221800.1082070-15-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-02 18:49:41 -08:00
Kuniyuki Iwashima	4480d5fa1f	ipmr/ip6mr: Convert net->ipv[46].ipmr_seq to atomic_t. We will no longer hold RTNL for ipmr_mfc_add() and ipmr_mfc_delete(). MFC entry can be loosely connected with VIF by its index for mrt->vif_table[] (stored in mfc_parent), but the two tables are not synchronised. i.e. Even if VIF 1 is removed, MFC for VIF 1 is not automatically removed. The only field that the MFC/VIF interfaces share is net->ipv[46].ipmr_seq, which is protected by RTNL. Adding a new mutex for both just to protect a single field is overkill. Let's convert the field to atomic_t. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260228221800.1082070-14-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-02 18:49:41 -08:00
Kuniyuki Iwashima	1c36d186a0	ipmr: Define net->ipv4.{ipmr_notifier_ops,ipmr_seq} under CONFIG_IP_MROUTE. net->ipv4.ipmr_notifier_ops and net->ipv4.ipmr_seq are used only in net/ipv4/ipmr.c. Let's move these definitions under CONFIG_IP_MROUTE. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260228221800.1082070-13-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-02 18:49:41 -08:00
Eric Dumazet	8341c989ac	net: remove addr_len argument of recvmsg() handlers Use msg->msg_namelen as a place holder instead of a temporary variable, notably in inet[6]_recvmsg(). This removes stack canaries and allows tail-calls. $ scripts/bloat-o-meter -t vmlinux.old vmlinux add/remove: 0/0 grow/shrink: 2/19 up/down: 26/-532 (-506) Function old new delta rawv6_recvmsg 744 767 +23 vsock_dgram_recvmsg 55 58 +3 vsock_connectible_recvmsg 50 47 -3 unix_stream_recvmsg 161 158 -3 unix_seqpacket_recvmsg 62 59 -3 unix_dgram_recvmsg 42 39 -3 tcp_recvmsg 546 543 -3 mptcp_recvmsg 1568 1565 -3 ping_recvmsg 806 800 -6 tcp_bpf_recvmsg_parser 983 974 -9 ip_recv_error 588 576 -12 ipv6_recv_rxpmtu 442 428 -14 udp_recvmsg 1243 1224 -19 ipv6_recv_error 1046 1024 -22 udpv6_recvmsg 1487 1461 -26 raw_recvmsg 465 437 -28 udp_bpf_recvmsg 1027 984 -43 sock_common_recvmsg 103 27 -76 inet_recvmsg 257 175 -82 inet6_recvmsg 257 175 -82 tcp_bpf_recvmsg 663 568 -95 Total: Before=25143834, After=25143328, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260227151120.1346573-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-02 18:17:17 -08:00
Avraham Stern	7c6084d7fa	wifi: cfg80211: support key installation on non-netdev wdevs Currently key installation is only supported for netdev. For NAN, support most key operations (except setting default data key) on wdevs instead of netdevs, and adjust all the APIs and tracing to match. Since nothing currently sets NL80211_EXT_FEATURE_SECURE_NAN, this doesn't change anything (P2P Device already isn't allowed.) Signed-off-by: Avraham Stern <avraham.stern@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20260107150057.69a0cfad95fa.I00efdf3b2c11efab82ef6ece9f393382bcf33ba8@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2026-03-02 11:28:33 +01:00
Miri Korenblit	94d8657392	wifi: cfg80211: make cluster id an array cfg80211_nan_conf::cluster_id is currently a pointer, but there is no real reason to not have it an array. It makes things easier as there is no need to check the pointer validity each time. If a cluster ID wasn't provided by user space it will be randomized. Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20260302091108.2b12e4ccf5bb.Ib16bf5cca55463d4c89e18099cf1dfe4de95d405@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2026-03-02 11:01:02 +01:00
Sai Pratyusha Magam	a536be9231	wifi: mac80211: Fix AAD/Nonce computation for management frames with MLO Per IEEE Std 802.11be-2024, 12.5.2.3.3, if the MPDU is an individually addressed Data frame between an AP MLD and a non-AP MLD associated with the AP MLD, then A1/A2/A3 will be MLD MAC addresses. Otherwise, Al/A2/A3 will be over-the-air link MAC addresses. Currently, during AAD and Nonce computation for software based encryption/decryption cases, mac80211 directly uses the addresses it receives in the skb frame header. However, after the first authentication, management frame addresses for non-AP MLD stations are translated to MLD addresses from over the air link addresses in software. This means that the skb header could contain translated MLD addresses, which when used as is, can lead to incorrect AAD/Nonce computation. In the following manner, ensure that the right set of addresses are used: In the receive path, stash the pre-translated link addresses in ieee80211_rx_data and use them for the AAD/Nonce computations when required. In the transmit path, offload the encryption for a CCMP/GCMP key to the hwsim driver that can then ensure that encryption and hence the AAD/Nonce computations are performed on the frame containing the right set of addresses, i.e, MLD addresses if unicast data frame and link addresses otherwise. To do so, register the set key handler in hwsim driver so mac80211 is aware that it is the driver that would take care of encrypting the frame. Offload encryption for a CCMP/GCMP key, while keeping the encryption for WEP/TKIP and MMIE generation for a AES_CMAC or a AES_GMAC key still at the SW crypto in mac layer Co-developed-by: Rohan Dutta <quic_drohan@quicinc.com> Signed-off-by: Rohan Dutta <quic_drohan@quicinc.com> Signed-off-by: Sai Pratyusha Magam <sai.magam@oss.qualcomm.com> Link: https://patch.msgid.link/20260226042959.3766157-1-sai.magam@oss.qualcomm.com [only store and apply link_addrs for unicast non-data rather storing always and applying for !unicast_data] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2026-03-02 09:53:19 +01:00
Sriram R	e098c26b35	wifi: mac80211: fetch unsolicited probe response template by link ID Currently, the unsolicited probe response template is always fetched from the default link of a virtual interface in both Multi-Link Operation (MLO) and non-MLO cases. However, in the MLO case there is a need to fetch the unsolicited probe response template from a specific link instead of the default link. Hence, add support for fetching the unsolicited probe response template based on the link ID from the corresponding link data. Signed-off-by: Sriram R <quic_srirrama@quicinc.com> Co-developed-by: Raj Kumar Bhagat <raj.bhagat@oss.qualcomm.com> Signed-off-by: Raj Kumar Bhagat <raj.bhagat@oss.qualcomm.com> Link: https://patch.msgid.link/20260220-fils-prob-by-link-v1-2-a2746a853f75@oss.qualcomm.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2026-03-02 09:29:15 +01:00
Sriram R	0495b64132	wifi: mac80211: fetch FILS discovery template by link ID Currently, the FILS discovery template is always fetched from the default link of a virtual interface in both Multi-Link Operation (MLO) and non-MLO cases. However, in the MLO case there is a need to fetch the FILS discovery template from a specific link instead of the default link. Hence, add support for fetching the FILS discovery template based on the link ID from the corresponding link data. Signed-off-by: Sriram R <quic_srirrama@quicinc.com> Co-developed-by: Raj Kumar Bhagat <raj.bhagat@oss.qualcomm.com> Signed-off-by: Raj Kumar Bhagat <raj.bhagat@oss.qualcomm.com> Link: https://patch.msgid.link/20260220-fils-prob-by-link-v1-1-a2746a853f75@oss.qualcomm.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2026-03-02 09:29:15 +01:00
Miri Korenblit	033fe322f5	wifi: nl80211/cfg80211: support stations of non-netdev interfaces Currently, a station can only be added to a netdev interface, mainly because there was no need for a station of a non-netdev interface. But for NAN, we will have stations that belong to the NL80211_IFTYPE_NAN interface. Prepare for adding/changing/deleting a station that belongs to a non-netdev interface. This doesn't actually allow such stations - this will be done in a different patch. Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20260219114327.65c9cc96f814.Ic02066b88bb8ad6b21e15cbea8d720280008c83b@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2026-03-02 09:23:03 +01:00
Hari Chandrakanthan	6a584e336c	wifi: cfg80211: add support to handle incumbent signal detected event from mac80211/driver When any incumbent signal is detected by an AP/mesh interface operating in 6 GHz band, FCC mandates the AP/mesh to vacate the channels affected by it [1]. Add a new API cfg80211_incumbent_signal_notify() that can be used by mac80211 or drivers to notify the higher layers about the signal interference event with the interference bitmap in which each bit denotes the affected 20 MHz in the operating channel. Add support for the new nl80211 event and nl80211 attribute as well to notify userspace on the details about the interference event. Userspace is expected to process it and take further action - vacate the channel, or reduce the bandwidth. [1] - https://apps.fcc.gov/kdb/GetAttachment.html?id=nXQiRC%2B4mfiA54Zha%2BrW4Q%3D%3D&desc=987594%20D02%20U-NII%206%20GHz%20EMC%20Measurement%20v03&tracking_number=277034 Signed-off-by: Hari Chandrakanthan <quic_haric@quicinc.com> Signed-off-by: Amith A <amith.a@oss.qualcomm.com> Link: https://patch.msgid.link/20260216032027.2310956-2-amith.a@oss.qualcomm.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2026-03-02 09:14:54 +01:00
Janusz Dziedzic	d69cb039ab	wifi: cfg80211: set and report chandef CAC ongoing Allow to track and check CAC state from user mode by simple check phy channels eg. using iw phy1 channels command. This is done for regular CAC and background CAC. It is important for background CAC while we can start it from any app (eg. iw or hostapd). Signed-off-by: Janusz Dziedzic <janusz.dziedzic@gmail.com> Link: https://patch.msgid.link/20260206171830.553879-3-janusz.dziedzic@gmail.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2026-03-02 09:10:28 +01:00
Jesper Dangaard Brouer	67713dff63	net: sched: sch_dualpi2: use qdisc_dequeue_drop() for dequeue drops DualPI2 drops packets during dequeue but was using kfree_skb_reason() directly, bypassing trace_qdisc_drop. Convert to qdisc_dequeue_drop() and add QDISC_DROP_L4S_STEP_NON_ECN to the qdisc drop reason enum. - Set TCQ_F_DEQUEUE_DROPS flag in dualpi2_init() - Use enum qdisc_drop_reason in drop_and_retry() - Replace kfree_skb_reason() with qdisc_dequeue_drop() Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/177211351978.3011628.11267023360997620069.stgit@firesoul Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 15:31:35 -08:00
Jesper Dangaard Brouer	9d3e7f9718	net: sched: rename QDISC_DROP_CAKE_FLOOD to QDISC_DROP_FLOOD_PROTECTION Rename QDISC_DROP_CAKE_FLOOD to QDISC_DROP_FLOOD_PROTECTION to use a generic name without embedding the qdisc name. This follows the principle that drop reasons should describe the drop mechanism rather than being tied to a specific qdisc implementation. The flood protection drop reason is used by qdiscs implementing probabilistic drop algorithms (like BLUE) that detect unresponsive flows indicating potential DoS or flood attacks. CAKE uses this via its Cobalt AQM component. Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/177211347537.3011628.13759059534638729639.stgit@firesoul Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 15:31:35 -08:00
Jesper Dangaard Brouer	f30d9073ec	net: sched: rename QDISC_DROP_FQ_* to generic names Rename FQ-specific drop reasons to generic names: - QDISC_DROP_FQ_BAND_LIMIT -> QDISC_DROP_BAND_LIMIT - QDISC_DROP_FQ_HORIZON_LIMIT -> QDISC_DROP_HORIZON_LIMIT This follows the principle that drop reasons should describe the drop mechanism rather than being tied to a specific qdisc implementation. These concepts (priority band limits, timestamp horizon) could apply to other qdiscs as well. Remove the local macro define FQDR() and instead use the full QDISC_DROP_* name to make it easier to navigate code. Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/177211346902.3011628.12523261489552097455.stgit@firesoul Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 15:31:35 -08:00
Jesper Dangaard Brouer	3e28f8ad47	net: sched: sfq: convert to qdisc drop reasons Convert SFQ to use the new qdisc-specific drop reason infrastructure. This patch demonstrates how to convert a flow-based qdisc to use the new enum qdisc_drop_reason. As part of this conversion: - Add QDISC_DROP_MAXFLOWS for flow table exhaustion - Rename FQ_FLOW_LIMIT to generic FLOW_LIMIT, now shared by FQ and SFQ - Use QDISC_DROP_OVERLIMIT for sfq_drop() when overall limit exceeded - Use QDISC_DROP_FLOW_LIMIT for per-flow depth limit exceeded The FLOW_LIMIT reason is now a common drop reason for per-flow limits, applicable to both FQ and SFQ qdiscs. Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/177211345946.3011628.12770616071857185664.stgit@firesoul Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 15:31:34 -08:00
Jesper Dangaard Brouer	ff2998f29f	net: sched: introduce qdisc-specific drop reason tracing Create new enum qdisc_drop_reason and trace_qdisc_drop tracepoint for qdisc layer drop diagnostics with direct qdisc context visibility. The new tracepoint includes qdisc handle, parent, kind (name), and device information. Existing SKB_DROP_REASON_QDISC_DROP is retained for backwards compatibility via kfree_skb_reason(). Convert qdiscs with drop reasons to use the new infrastructure. Change CAKE's cobalt_should_drop() return type from enum skb_drop_reason to enum qdisc_drop_reason to fix implicit enum conversion warnings. Use QDISC_DROP_UNSPEC as the 'not dropped' sentinel instead of SKB_NOT_DROPPED_YET. Both have the same compiled value (0), so the comparison logic remains semantically equivalent. Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/177211345275.3011628.1974310302645218067.stgit@firesoul Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 15:31:34 -08:00
Nikhil P. Rao	60abb0ac11	xsk: Fix fragment node deletion to prevent buffer leak After commit `b692bf9a75` ("xsk: Get rid of xdp_buff_xsk::xskb_list_node"), the list_node field is reused for both the xskb pool list and the buffer free list, this causes a buffer leak as described below. xp_free() checks if a buffer is already on the free list using list_empty(&xskb->list_node). When list_del() is used to remove a node from the xskb pool list, it doesn't reinitialize the node pointers. This means list_empty() will return false even after the node has been removed, causing xp_free() to incorrectly skip adding the buffer to the free list. Fix this by using list_del_init() instead of list_del() in all fragment handling paths, this ensures the list node is reinitialized after removal, allowing the list_empty() to work correctly. Fixes: `b692bf9a75` ("xsk: Get rid of xdp_buff_xsk::xskb_list_node") Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Nikhil P. Rao <nikhil.rao@amd.com> Link: https://patch.msgid.link/20260225000456.107806-2-nikhil.rao@amd.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 08:55:11 -08:00
Victor Nogueira	11cb63b0d1	net/sched: Only allow act_ct to bind to clsact/ingress qdiscs and shared blocks As Paolo said earlier [1]: "Since the blamed commit below, classify can return TC_ACT_CONSUMED while the current skb being held by the defragmentation engine. As reported by GangMin Kim, if such packet is that may cause a UaF when the defrag engine later on tries to tuch again such packet." act_ct was never meant to be used in the egress path, however some users are attaching it to egress today [2]. Attempting to reach a middle ground, we noticed that, while most qdiscs are not handling TC_ACT_CONSUMED, clsact/ingress qdiscs are. With that in mind, we address the issue by only allowing act_ct to bind to clsact/ingress qdiscs and shared blocks. That way it's still possible to attach act_ct to egress (albeit only with clsact). [1] https://lore.kernel.org/netdev/674b8cbfc385c6f37fb29a1de08d8fe5c2b0fbee.1771321118.git.pabeni@redhat.com/ [2] https://lore.kernel.org/netdev/cc6bfb4a-4a2b-42d8-b9ce-7ef6644fb22b@ovn.org/ Reported-by: GangMin Kim <km.kim1503@gmail.com> Fixes: `3f14b377d0` ("net/sched: act_ct: fix skb leak and crash on ooo frags") CC: stable@vger.kernel.org Signed-off-by: Victor Nogueira <victor@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260225134349.1287037-1-victor@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 19:06:21 -08:00
Eric Dumazet	5151ec54f5	net: use try_cmpxchg() in lock_sock_nested() Add a fast path in lock_sock_nested(), to avoid acquiring the socket spinlock only to set @owned to one: spin_lock_bh(&sk->sk_lock.slock); if (unlikely(sock_owned_by_user_nocheck(sk))) __lock_sock(sk); sk->sk_lock.owned = 1; spin_unlock_bh(&sk->sk_lock.slock); On x86_64 compiler generates something quite efficient: 00000000000077c0 <lock_sock_nested>: 77c0: f3 0f 1e fa endbr64 77c4: e8 00 00 00 00 call __fentry__ 77c9: b9 01 00 00 00 mov $0x1,%ecx 77ce: 31 c0 xor %eax,%eax 77d0: f0 48 0f b1 8f 48 01 00 00 lock cmpxchg %rcx,0x148(%rdi) 77d9: 75 06 jne slow_path 77db: 2e e9 00 00 00 00 cs jmp __x86_return_thunk-0x4 slow_path: ... Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://patch.msgid.link/20260226021215.1764237-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 17:25:45 -08:00
Eric Dumazet	29252397bc	inet: annotate data-races around isk->inet_num UDP/TCP lookups are using RCU, thus isk->inet_num accesses should use READ_ONCE() and WRITE_ONCE() where needed. Fixes: `3ab5aee7fe` ("net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260225203545.1512417-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 17:16:59 -08:00
Paul Moses	62413a9c3c	net/sched: act_gate: snapshot parameters with RCU on replace The gate action can be replaced while the hrtimer callback or dump path is walking the schedule list. Convert the parameters to an RCU-protected snapshot and swap updates under tcf_lock, freeing the previous snapshot via call_rcu(). When REPLACE omits the entry list, preserve the existing schedule so the effective state is unchanged. Fixes: `a51c328df3` ("net: qos: introduce a gate control flow action") Cc: stable@vger.kernel.org Signed-off-by: Paul Moses <p@1g4.org> Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Link: https://patch.msgid.link/20260223150512.2251594-2-p@1g4.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 16:10:36 -08:00
Byungchul Park	fd6dad4e1a	netmem: remove the pp fields from net_iov Now that the pp fields in net_iov have no users, remove them from net_iov and clean up. Signed-off-by: Byungchul Park <byungchul@sk.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20260224061424.11219-1-byungchul@sk.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-26 19:45:24 -08:00
Jakub Kicinski	0314e382cf	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-7.0-rc2). Conflicts: tools/testing/selftests/drivers/net/hw/rss_ctx.py `19c3a2a81d` ("selftests: drv-net: rss: Generate unique ports for RSS context tests") `ce5a0f4612` ("selftests: drv-net: rss_ctx: test RSS contexts persist after ifdown/up") include/net/inet_connection_sock.h `858d2a4f67` ("tcp: fix potential race in tcp_v6_syn_recv_sock()") `fcd3d039fa` ("tcp: make tcp_v{4,6}_send_check() static") https://lore.kernel.org/aZ8PSFLzBrEU3I89@sirena.org.uk drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c drivers/net/ethernet/mellanox/mlx5/core/en/xsk/pool.c `69050f8d6d` ("treewide: Replace kmalloc with kmalloc_obj for non-scalar types") `bf4afc53b7` ("Convert 'alloc_obj' family to use the new default GFP_KERNEL argument") `8a96b9144f` ("net/mlx5e: Alloc xsk channel param out of mlx5e_open_xsk()") Adjacent changes: net/netfilter/ipvs/ip_vs_ctl.c `c59bd9e62e` ("ipvs: use more counters to avoid service lookups") `bf4afc53b7` ("Convert 'alloc_obj' family to use the new default GFP_KERNEL argument") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-26 10:23:00 -08:00
Linus Torvalds	b9c8fc2cae	Including fixes from IPsec, Bluetooth and netfilter Current release - regressions: - wifi: fix dev_alloc_name() return value check - rds: fix recursive lock in rds_tcp_conn_slots_available Current release - new code bugs: - vsock: lock down child_ns_mode as write-once Previous releases - regressions: - core: - do not pass flow_id to set_rps_cpu() - consume xmit errors of GSO frames - netconsole: avoid OOB reads, msg is not nul-terminated - netfilter: h323: fix OOB read in decode_choice() - tcp: re-enable acceptance of FIN packets when RWIN is 0 - udplite: fix null-ptr-deref in __udp_enqueue_schedule_skb(). - wifi: brcmfmac: fix potential kernel oops when probe fails - phy: register phy led_triggers during probe to avoid AB-BA deadlock - eth: bnxt_en: fix deleting of Ntuple filters - eth: wan: farsync: fix use-after-free bugs caused by unfinished tasklets - eth: xscale: check for PTP support properly Previous releases - always broken: - tcp: fix potential race in tcp_v6_syn_recv_sock() - kcm: fix zero-frag skb in frag_list on partial sendmsg error - xfrm: - fix race condition in espintcp_close() - always flush state and policy upon NETDEV_UNREGISTER event - bluetooth: - purge error queues in socket destructors - fix response to L2CAP_ECRED_CONN_REQ - eth: mlx5: - fix circular locking dependency in dump - fix "scheduling while atomic" in IPsec MAC address query - eth: gve: fix incorrect buffer cleanup for QPL - eth: team: avoid NETDEV_CHANGEMTU event when unregistering slave - eth: usb: validate USB endpoints Signed-off-by: Paolo Abeni <pabeni@redhat.com> -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmmgYU4SHHBhYmVuaUBy ZWRoYXQuY29tAAoJECkkeY3MjxOkLBgQAINazHstJ0DoDkvmwXapRSN0Ffauyd46 oX6nfeWOT3BzZbAhZHtGgCSs4aULifJWMevtT7pq7a7PgZwMwfa47BugR1G/u5UE hCqalNjRTB/U2KmFk6eViKSacD4FvUIAyAMOotn1aEdRRAkBIJnIW/o/ZR9ZUkm0 5+UigO64aq57+FOc5EQdGjYDcTVdzW12iOZ8ZqwtSATdNd9aC+gn3voRomTEo+Fm kQinkFEPAy/YyHGmfpC/z87/RTgkYLpagmsT4ZvBJeNPrIRvFEibSpPNhuzTzg81 /BW5M8sJmm3XFiTiRp6Blv+0n6HIpKjAZMHn5c9hzX9cxPZQ24EjkXEex9ClaxLd OMef79rr1HBwqBTpIlK7xfLKCdT5Iex88s8HxXRB/Psqk9pVP469cSoK6cpyiGiP I+4WT0wn9ukTiu/yV2L2byVr1sanlu54P+UBYJpDwqq3lZ1ngWtkJ+SY369jhwAS FYIBmUSKhmWz3FEULaGpgPy4m9Fl/fzN8IFh2Buoc/Puq61HH7MAMjRty2ZSFTqj gbHrRhlkCRqubytgjsnCDPLoJF4ZYcXtpo/8ogG3641H1I+dN+DyGGVZ/ioswkks My1ds0rKqA3BHCmn+pN/qqkuopDCOB95dqOpgDqHG7GePrpa/FJ1guhxexsCd+nL Run2RcgDmd+d =HBOu -----END PGP SIGNATURE----- Merge tag 'net-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from IPsec, Bluetooth and netfilter Current release - regressions: - wifi: fix dev_alloc_name() return value check - rds: fix recursive lock in rds_tcp_conn_slots_available Current release - new code bugs: - vsock: lock down child_ns_mode as write-once Previous releases - regressions: - core: - do not pass flow_id to set_rps_cpu() - consume xmit errors of GSO frames - netconsole: avoid OOB reads, msg is not nul-terminated - netfilter: h323: fix OOB read in decode_choice() - tcp: re-enable acceptance of FIN packets when RWIN is 0 - udplite: fix null-ptr-deref in __udp_enqueue_schedule_skb(). - wifi: brcmfmac: fix potential kernel oops when probe fails - phy: register phy led_triggers during probe to avoid AB-BA deadlock - eth: - bnxt_en: fix deleting of Ntuple filters - wan: farsync: fix use-after-free bugs caused by unfinished tasklets - xscale: check for PTP support properly Previous releases - always broken: - tcp: fix potential race in tcp_v6_syn_recv_sock() - kcm: fix zero-frag skb in frag_list on partial sendmsg error - xfrm: - fix race condition in espintcp_close() - always flush state and policy upon NETDEV_UNREGISTER event - bluetooth: - purge error queues in socket destructors - fix response to L2CAP_ECRED_CONN_REQ - eth: - mlx5: - fix circular locking dependency in dump - fix "scheduling while atomic" in IPsec MAC address query - gve: fix incorrect buffer cleanup for QPL - team: avoid NETDEV_CHANGEMTU event when unregistering slave - usb: validate USB endpoints" * tag 'net-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (72 commits) netfilter: nf_conntrack_h323: fix OOB read in decode_choice() dpaa2-switch: validate num_ifs to prevent out-of-bounds write net: consume xmit errors of GSO frames vsock: document write-once behavior of the child_ns_mode sysctl vsock: lock down child_ns_mode as write-once selftests/vsock: change tests to respect write-once child ns mode net/mlx5e: Fix "scheduling while atomic" in IPsec MAC address query net/mlx5: Fix missing devlink lock in SRIOV enable error path net/mlx5: E-switch, Clear legacy flag when moving to switchdev net/mlx5: LAG, disable MPESW in lag_disable_change() net/mlx5: DR, Fix circular locking dependency in dump selftests: team: Add a reference count leak test team: avoid NETDEV_CHANGEMTU event when unregistering slave net: mana: Fix double destroy_workqueue on service rescan PCI path MAINTAINERS: Update maintainer entry for QUALCOMM ETHQOS ETHERNET DRIVER dpll: zl3073x: Remove redundant cleanup in devm_dpll_init() selftests/net: packetdrill: Verify acceptance of FIN packets when RWIN is 0 tcp: re-enable acceptance of FIN packets when RWIN is 0 vsock: Use container_of() to get net namespace in sysctl handlers net: usb: kaweth: validate USB endpoints ...	2026-02-26 08:00:13 -08:00
Bobby Eshleman	102eab95f0	vsock: lock down child_ns_mode as write-once Two administrator processes may race when setting child_ns_mode as one process sets child_ns_mode to "local" and then creates a namespace, but another process changes child_ns_mode to "global" between the write and the namespace creation. The first process ends up with a namespace in "global" mode instead of "local". While this can be detected after the fact by reading ns_mode and retrying, it is fragile and error-prone. Make child_ns_mode write-once so that a namespace manager can set it once and be sure it won't change. Writing a different value after the first write returns -EBUSY. This applies to all namespaces, including init_net, where an init process can write "local" to lock all future namespaces into local mode. Fixes: `eafb64f40c` ("vsock: add netns to vsock core") Suggested-by: Daan De Meyer <daan.j.demeyer@gmail.com> Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Co-developed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20260223-vsock-ns-write-once-v3-2-c0cde6959923@meta.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 11:10:03 +01:00
Florian Westphal	6b94d081f8	netfilter: nf_tables: remove register tracking infrastructure This facility was disabled in commit `9e539c5b6d` ("netfilter: nf_tables: disable expression reduction infra"), because not all nft_exprs guarantee they will update the destination register: some may set NFT_BREAK instead to cancel evaluation of the rule. This has been dead code ever since. There are no plans to salvage this at this time, so remove this. Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20260224205048.4718-10-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-25 19:36:26 -08:00
Julian Anastasov	09b71fb459	ipvs: no_cport and dropentry counters can be per-net Change the no_cport counters to be per-net and address family. This should reduce the extra conn lookups done during present NO_CPORT connections. By changing from global to per-net dropentry counters, one net will not affect the drop rate of another net. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20260224205048.4718-7-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-25 19:36:26 -08:00
Julian Anastasov	c59bd9e62e	ipvs: use more counters to avoid service lookups When new connection is created we can lookup for services multiple times to support fallback options. We already have some counters to skip specific lookups because it costs CPU cycles for hash calculation, etc. Add more counters for fwmark/non-fwmark services (fwm_services and nonfwm_services) and make all counters per address family. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20260224205048.4718-6-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-25 19:36:26 -08:00
Julian Anastasov	b24ae1a387	ipvs: use single svc table fwmark based services and non-fwmark based services can be hashed in same service table. This reduces the burden of working with two tables. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20260224205048.4718-4-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-25 19:36:25 -08:00
Jiejian Wu	74455a5b43	ipvs: make ip_vs_svc_table and ip_vs_svc_fwm_table per netns Current ipvs uses one global mutex "__ip_vs_mutex" to keep the global "ip_vs_svc_table" and "ip_vs_svc_fwm_table" safe. But when there are tens of thousands of services from different netns in the table, it takes a long time to look up the table, for example, using "ipvsadm -ln" from different netns simultaneously. We make "ip_vs_svc_table" and "ip_vs_svc_fwm_table" per netns, and we add "service_mutex" per netns to keep these two tables safe instead of the global "__ip_vs_mutex" in current version. To this end, looking up services from different netns simultaneously will not get stuck, shortening the time consumption in large-scale deployment. It can be reproduced using the simple scripts below. init.sh: #!/bin/bash for((i=1;i<=4;i++));do ip netns add ns$i ip netns exec ns$i ip link set dev lo up ip netns exec ns$i sh add-services.sh done add-services.sh: #!/bin/bash for((i=0;i<30000;i++)); do ipvsadm -A -t 10.10.10.10:$((80+$i)) -s rr done runtest.sh: #!/bin/bash for((i=1;i<4;i++));do ip netns exec ns$i ipvsadm -ln > /dev/null & done ip netns exec ns4 ipvsadm -ln > /dev/null Run "sh init.sh" to initiate the network environment. Then run "time ./runtest.sh" to evaluate the time consumption. Our testbed is a 4-core Intel Xeon ECS. The result of the original version is around 8 seconds, while the result of the modified version is only 0.8 seconds. Signed-off-by: Jiejian Wu <jiejian@linux.alibaba.com> Co-developed-by: Dust Li <dust.li@linux.alibaba.com> Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20260224205048.4718-2-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-25 19:36:25 -08:00
Kuniyuki Iwashima	fc1f97929a	bonding: Optimise is_netpoll_tx_blocked(). bond_start_xmit() spends some cycles in is_netpoll_tx_blocked(): if (unlikely(is_netpoll_tx_blocked(dev))) return NETDEV_TX_BUSY; because of the "pushf;pop reg" sequence (aka irqs_disabled()). Let's swap the conditions in is_netpoll_tx_blocked() and convert netpoll_block_tx to a static key. Before: 1.23 │ mov %gs:0x28,%rax 1.24 │ mov %rax,0x18(%rsp) 29.45 │ pushfq 0.50 │ pop %rax 0.47 │ test $0x200,%eax │ ↓ je 1b4 0.49 │ 32: lea 0x980(%rsi),%rbx After: 0.72 │ mov %gs:0x28,%rax 0.81 │ mov %rax,0x18(%rsp) 0.82 │ nop 2.77 │ 2a: lea 0x980(%rsi),%rbx Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260223230749.2376145-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-24 18:13:38 -08:00
Eric Dumazet	539a6cf084	tcp: move inet6_csk_update_pmtu() to tcp_ipv6.c This function is only called from tcp_v6_mtu_reduced() and can be (auto)inlined by the compiler. Note that inet6_csk_route_socket() is no longer (auto)inlined, which is a good thing as it is slow path. $ scripts/bloat-o-meter -t vmlinux.0 vmlinux.1 add/remove: 0/2 grow/shrink: 2/0 up/down: 93/-129 (-36) Function old new delta tcp_v6_mtu_reduced 139 228 +89 inet6_csk_route_socket 486 490 +4 __pfx_inet6_csk_update_pmtu 16 - -16 inet6_csk_update_pmtu 113 - -113 Total: Before=25076512, After=25076476, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260223153047.886683-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-24 17:47:27 -08:00
Eric Dumazet	fcd3d039fa	tcp: make tcp_v{4,6}_send_check() static tcp_v{4,6}_send_check() are only called from tcp_output.c and should be made static so that the compiler does not need to put an out of line copy of them. Remove (struct inet_connection_sock_af_ops) send_check field and use instead @net_header_len. Move @net_header_len close to @queue_xmit for data locality as both are used in TCP tx fast path. $ scripts/bloat-o-meter -t vmlinux.2 vmlinux.3 add/remove: 0/2 grow/shrink: 0/3 up/down: 0/-172 (-172) Function old new delta __tcp_transmit_skb 3426 3423 -3 tcp_v4_send_check 136 132 -4 mptcp_subflow_init 777 763 -14 __pfx_tcp_v6_send_check 16 - -16 tcp_v6_send_check 135 - -135 Total: Before=25143196, After=25143024, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260223100729.3761597-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-24 17:16:09 -08:00
Eric Dumazet	255688652b	tcp: move tcp_v6_send_check() to tcp_output.c Move tcp_v6_send_check() so that __tcp_transmit_skb() can inline it. $ scripts/bloat-o-meter -t vmlinux.1 vmlinux.2 add/remove: 0/0 grow/shrink: 1/0 up/down: 105/0 (105) Function old new delta __tcp_transmit_skb 3321 3426 +105 Total: Before=25143091, After=25143196, chg +0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260223100729.3761597-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-24 17:16:09 -08:00
Eric Dumazet	bd5e5e1d41	tcp: inline __tcp_v4_send_check() Inline __tcp_v4_send_check(), like __tcp_v6_send_check(). Move tcp_v4_send_check() to tcp_output.c close to its fast path caller (__tcp_transmit_skb()). Note __tcp_v4_send_check() is still out-of-line for tcp4_gso_segment() because it is called in an unlikely() section. $ scripts/bloat-o-meter -t vmlinux.0 vmlinux.1 add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-9 (-9) Function old new delta __tcp_v4_send_check 130 121 -9 Total: Before=25143100, After=25143091, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260223100729.3761597-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-24 17:16:09 -08:00
Eric Dumazet	f033335937	udp: move udp6_csum_init() back to net/ipv6/udp.c This function has a single caller in net/ipv6/udp.c. Move it there so that the compiler can decide to (auto)inline it if he prefers to. IBT glue is removed anyway. With clang, we can see it was able to inline it and also inlined one other helper at the same time. UDPLITE removal will also help. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/2 grow/shrink: 1/0 up/down: 840/-785 (55) Function old new delta __udp6_lib_rcv 1247 2087 +840 __pfx_udp6_csum_init 16 - -16 udp6_csum_init 769 - -769 Total: Before=25074399, After=25074454, chg +0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260223093445.3696368-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-24 16:30:40 -08:00
Eric Dumazet	2550def53b	net: __lock_sock() can be static After commit `6511882cdd` ("mptcp: allocate fwd memory separately on the rx and tx path") __lock_sock() can be static again. Make sure __lock_sock() is not inlined, so that lock_sock_nested() no longer needs a stack canary. Add a noinline attribute on lock_sock_nested() so that calls to lock_sock() from net/core/sock.c are not inlined, none of them are fast path to deserve that: - sockopt_lock_sock() - sock_set_reuseport() - sock_set_reuseaddr() - sock_set_mark() - sock_set_keepalive() - sock_no_linger() - sock_bindtoindex() - sk_wait_data() - sock_set_rcvbuf() $ scripts/bloat-o-meter -t vmlinux.old vmlinux add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-312 (-312) Function old new delta __lock_sock 192 188 -4 __lock_sock_fast 239 86 -153 lock_sock_nested 227 72 -155 Total: Before=24888707, After=24888395, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260223092716.3673939-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-24 16:30:33 -08:00
Paolo Abeni	1348659dc9	bluetooth pull request for net: - purge error queues in socket destructors - hci_sync: Fix CIS host feature condition - L2CAP: Fix invalid response to L2CAP_ECRED_RECONF_REQ - L2CAP: Fix result of L2CAP_ECRED_CONN_RSP when MTU is too short - L2CAP: Fix response to L2CAP_ECRED_CONN_REQ - L2CAP: Fix not checking output MTU is acceptable on L2CAP_ECRED_CONN_REQ - L2CAP: Fix missing key size check for L2CAP_LE_CONN_REQ - hci_qca: Cleanup on all setup failures -----BEGIN PGP SIGNATURE----- iQJNBAABCgA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmmcw1EZHGx1aXoudm9u LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKUTyD/4jtQwDrveC19zamF5n7lFY Oils6eftANcLFzLwTrMqGO7IxESga4qdNOf2vc/UgVSUfNqsPIUJ5El+LzpXZXAa sYBP/KudEX53CfU3fEVyPTUaWkZ4CdMRZeiCmgXqW7GxYbGw92SFuaSIHAP6Ep4s Z7Ryd1H0xhX9QPMc4g4IgoMiBiKzNs4GtlLSbDJcivAtbC/34nkMOxK9g+1DbU0F qzW+oPfYCpPzXTf20I1QIAMt5smnSM3Tuvo9u2pZRuEGpKjENxeY4hdAejfjeKA6 RLWXm6JvMP2lUBT68plMQQdYyQ8DxG75sVjgSoQYIu2YTVnsX76t/kD2hhiHXH/Y nQoy4dtA1/5V7Ka0cfMhcvino4Rb9Gh3dsFKJOuWRT+aTY+gNhpyr56SuJh24Y3C 7tUeEDI4fBkJGaRAbreVbaI5vw4kbSfi7IDOM/ccWDSLaG8HGaLOtn0IU8q4AgMa IkYzB5zwtiyM/zaSTO1k0HkpjR0wwftnTd+Fj2mUWdTwSeek64R9enmKYmg5UJrv 14yhfLHFsbAQo+o1B3ZslnCdYQJpgFmyAInV6Jpunc78IE9+g/YA55K22JbDDSzI t9Zy25OWLyYZyuD1PzDkMlYU5OARNYeyRXbJ3w037LrpqRoEuFsK0qTmgi+kR9C7 VR9IpCqgf4SJbL7ge83H8g== =JBaa -----END PGP SIGNATURE----- Merge tag 'for-net-2026-02-23' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth Luiz Augusto von Dentz says: ==================== bluetooth pull request for net: - purge error queues in socket destructors - hci_sync: Fix CIS host feature condition - L2CAP: Fix invalid response to L2CAP_ECRED_RECONF_REQ - L2CAP: Fix result of L2CAP_ECRED_CONN_RSP when MTU is too short - L2CAP: Fix response to L2CAP_ECRED_CONN_REQ - L2CAP: Fix not checking output MTU is acceptable on L2CAP_ECRED_CONN_REQ - L2CAP: Fix missing key size check for L2CAP_LE_CONN_REQ - hci_qca: Cleanup on all setup failures * tag 'for-net-2026-02-23' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth: Bluetooth: L2CAP: Fix missing key size check for L2CAP_LE_CONN_REQ Bluetooth: L2CAP: Fix not checking output MTU is acceptable on L2CAP_ECRED_CONN_REQ Bluetooth: Fix CIS host feature condition Bluetooth: L2CAP: Fix response to L2CAP_ECRED_CONN_REQ Bluetooth: hci_qca: Cleanup on all setup failures Bluetooth: purge error queues in socket destructors Bluetooth: L2CAP: Fix result of L2CAP_ECRED_CONN_RSP when MTU is too short Bluetooth: L2CAP: Fix invalid response to L2CAP_ECRED_RECONF_REQ ==================== Link: https://patch.msgid.link/20260223211634.3800315-1-luiz.dentz@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-24 15:03:08 +01:00
Sebastian Andrzej Siewior	983512f3a8	net: Drop the lock in skb_may_tx_timestamp() skb_may_tx_timestamp() may acquire sock::sk_callback_lock. The lock must not be taken in IRQ context, only softirq is okay. A few drivers receive the timestamp via a dedicated interrupt and complete the TX timestamp from that handler. This will lead to a deadlock if the lock is already write-locked on the same CPU. Taking the lock can be avoided. The socket (pointed by the skb) will remain valid until the skb is released. The ->sk_socket and ->file member will be set to NULL once the user closes the socket which may happen before the timestamp arrives. If we happen to observe the pointer while the socket is closing but before the pointer is set to NULL then we may use it because both pointer (and the file's cred member) are RCU freed. Drop the lock. Use READ_ONCE() to obtain the individual pointer. Add a matching WRITE_ONCE() where the pointer are cleared. Link: https://lore.kernel.org/all/20260205145104.iWinkXHv@linutronix.de Fixes: `b245be1f4d` ("net-timestamp: no-payload only sysctl") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260220183858.N4ERjFW6@linutronix.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-24 11:27:29 +01:00
Luiz Augusto von Dentz	c28d2bff70	Bluetooth: L2CAP: Fix result of L2CAP_ECRED_CONN_RSP when MTU is too short Test L2CAP/ECFC/BV-26-C expect the response to L2CAP_ECRED_CONN_REQ with and MTU value < L2CAP_ECRED_MIN_MTU (64) to be L2CAP_CR_LE_INVALID_PARAMS rather than L2CAP_CR_LE_UNACCEPT_PARAMS. Also fix not including the correct number of CIDs in the response since the spec requires all CIDs being rejected to be included in the response. Link: https://github.com/bluez/bluez/issues/1868 Fixes: `15f02b9105` ("Bluetooth: L2CAP: Add initial code for Enhanced Credit Based Mode") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2026-02-23 15:28:56 -05:00
Luiz Augusto von Dentz	7accb1c432	Bluetooth: L2CAP: Fix invalid response to L2CAP_ECRED_RECONF_REQ This fixes responding with an invalid result caused by checking the wrong size of CID which should have been (cmd_len - sizeof(*req)) and on top of it the wrong result was use L2CAP_CR_LE_INVALID_PARAMS which is invalid/reserved for reconf when running test like L2CAP/ECFC/BI-03-C: > ACL Data RX: Handle 64 flags 0x02 dlen 14 LE L2CAP: Enhanced Credit Reconfigure Request (0x19) ident 2 len 6 MTU: 64 MPS: 64 Source CID: 64 < ACL Data TX: Handle 64 flags 0x00 dlen 10 LE L2CAP: Enhanced Credit Reconfigure Respond (0x1a) ident 2 len 2 ! Result: Reserved (0x000c) Result: Reconfiguration failed - one or more Destination CIDs invalid (0x0003) Fiix L2CAP/ECFC/BI-04-C which expects L2CAP_RECONF_INVALID_MPS (0x0002) when more than one channel gets its MPS reduced: > ACL Data RX: Handle 64 flags 0x02 dlen 16 LE L2CAP: Enhanced Credit Reconfigure Request (0x19) ident 2 len 8 MTU: 264 MPS: 99 Source CID: 64 ! Source CID: 65 < ACL Data TX: Handle 64 flags 0x00 dlen 10 LE L2CAP: Enhanced Credit Reconfigure Respond (0x1a) ident 2 len 2 ! Result: Reconfiguration successful (0x0000) Result: Reconfiguration failed - reduction in size of MPS not allowed for more than one channel at a time (0x0002) Fix L2CAP/ECFC/BI-05-C when SCID is invalid (85 unconnected): > ACL Data RX: Handle 64 flags 0x02 dlen 14 LE L2CAP: Enhanced Credit Reconfigure Request (0x19) ident 2 len 6 MTU: 65 MPS: 64 ! Source CID: 85 < ACL Data TX: Handle 64 flags 0x00 dlen 10 LE L2CAP: Enhanced Credit Reconfigure Respond (0x1a) ident 2 len 2 ! Result: Reconfiguration successful (0x0000) Result: Reconfiguration failed - one or more Destination CIDs invalid (0x0003) Fix L2CAP/ECFC/BI-06-C when MPS < L2CAP_ECRED_MIN_MPS (64): > ACL Data RX: Handle 64 flags 0x02 dlen 14 LE L2CAP: Enhanced Credit Reconfigure Request (0x19) ident 2 len 6 MTU: 672 ! MPS: 63 Source CID: 64 < ACL Data TX: Handle 64 flags 0x00 dlen 10 LE L2CAP: Enhanced Credit Reconfigure Respond (0x1a) ident 2 len 2 ! Result: Reconfiguration failed - reduction in size of MPS not allowed for more than one channel at a time (0x0002) Result: Reconfiguration failed - other unacceptable parameters (0x0004) Fix L2CAP/ECFC/BI-07-C when MPS reduced for more than one channel: > ACL Data RX: Handle 64 flags 0x02 dlen 16 LE L2CAP: Enhanced Credit Reconfigure Request (0x19) ident 3 len 8 MTU: 84 ! MPS: 71 Source CID: 64 ! Source CID: 65 < ACL Data TX: Handle 64 flags 0x00 dlen 10 LE L2CAP: Enhanced Credit Reconfigure Respond (0x1a) ident 2 len 2 ! Result: Reconfiguration successful (0x0000) Result: Reconfiguration failed - reduction in size of MPS not allowed for more than one channel at a time (0x0002) Link: https://github.com/bluez/bluez/issues/1865 Fixes: `15f02b9105` ("Bluetooth: L2CAP: Add initial code for Enhanced Credit Based Mode") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2026-02-23 15:23:37 -05:00
Linus Torvalds	32a92f8c89	Convert more 'alloc_obj' cases to default GFP_KERNEL arguments This converts some of the visually simpler cases that have been split over multiple lines. I only did the ones that are easy to verify the resulting diff by having just that final GFP_KERNEL argument on the next line. Somebody should probably do a proper coccinelle script for this, but for me the trivial script actually resulted in an assertion failure in the middle of the script. I probably had made it a bit _too_ trivial. So after fighting that far a while I decided to just do some of the syntactically simpler cases with variations of the previous 'sed' scripts. The more syntactically complex multi-line cases would mostly really want whitespace cleanup anyway. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-21 20:03:00 -08:00
Linus Torvalds	bf4afc53b7	Convert 'alloc_obj' family to use the new default GFP_KERNEL argument This was done entirely with mindless brute force, using git grep -l '\<k[vmz]alloc_objs(., GFP_KERNEL)' \| xargs sed -i 's/$alloc_objs(.*$, GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-21 17:09:51 -08:00
Kees Cook	69050f8d6d	treewide: Replace kmalloc with kmalloc_obj for non-scalar types This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(PTR, FAM, COUNT, ...) (where TYPE may also be VAR) The resulting allocations no longer return "void ", instead returning "TYPE ". Signed-off-by: Kees Cook <kees@kernel.org>	2026-02-21 01:02:28 -08:00
Eric Dumazet	858d2a4f67	tcp: fix potential race in tcp_v6_syn_recv_sock() Code in tcp_v6_syn_recv_sock() after the call to tcp_v4_syn_recv_sock() is done too late. After tcp_v4_syn_recv_sock(), the child socket is already visible from TCP ehash table and other cpus might use it. Since newinet->pinet6 is still pointing to the listener ipv6_pinfo bad things can happen as syzbot found. Move the problematic code in tcp_v6_mapped_child_init() and call this new helper from tcp_v4_syn_recv_sock() before the ehash insertion. This allows the removal of one tcp_sync_mss(), since tcp_v4_syn_recv_sock() will call it with the correct context. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Reported-by: syzbot+937b5bbb6a815b3e5d0b@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/69949275.050a0220.2eeac1.0145.GAE@google.com/ Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260217161205.2079883-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-19 14:02:19 -08:00
Linus Torvalds	8bf22c33e7	Including fixes from Netfilter. Current release - new code bugs: - net: fix backlog_unlock_irq_restore() vs CONFIG_PREEMPT_RT - eth: mlx5e: XSK, Fix unintended ICOSQ change - phy_port: correctly recompute the port's linkmodes - vsock: prevent child netns mode switch from local to global - couple of kconfig fixes for new symbols Previous releases - regressions: - nfc: nci: fix false-positive parameter validation for packet data - net: do not delay zero-copy skbs in skb_attempt_defer_free() Previous releases - always broken: - mctp: ensure our nlmsg responses to user space are zero-initialised - ipv6: ioam: fix heap buffer overflow in __ioam6_fill_trace_data() - fixes for ICMP rate limiting Misc: - intel: fix PCI device ID conflict between i40e and ipw2200 Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmmXUh8ACgkQMUZtbf5S IrufYA//ZVj+4gvegqKwKZYXNBndVW00GGTYqaILbaenK1olUVUelVB91eV2Klc/ dXCeKG/MgEPuT89IjkPzVr2Yg4x6uhjcQL1rsahORn+GuQfSI/P8y7ysDOPnHVeM Rtsg1m8z3EizJcHPeAJe7nEqFzfvZ2m+FCEGe++z8BYaUZUVApytgpIWOHO/aB+p t13bCNzd05XxPphMl610T00Fncj2jCVDHILMgTB5rmFmkeJuQwNrRGXQSoQame46 +g+yCZjT0eVTrBaH1EUssWfrOT3VJj3BEee6gSp7k9mxMkbW18i8shBgmxS+EHjk u19wwBzSrHK+JY1UExim+1E/rZisQVmEE1Gs0ALedxAu9zC/Julzfa2/+BFsc0j7 QTXd4jukG3aTPIX8v3TV2Igu0j+bAT4WdpzvnsXXBMVKy7wFYMd1+aSOLyFH2W9L qRbg50oUATcsz77bZt6YUTJEgua4HXNYGtn15FMZOR7HJVR2L44Q5TK5mQxGp5iM GabeKMzg6bsjE98STM3nbWks3pIb9ptIk++i0913eSqKgn84bDPtp3Gabfgle2SJ 8gjKS61K8rDt5x8StXVod7oGQ4asL8RJyOtE/avgbWUu9BNH8/oKqsE6TQrpXauv 1ndiyim/mPe4fBCxkVAi2+uq5/ph9z8XyleESz9VYwyL3Rl4nsg= =qSCj -----END PGP SIGNATURE----- Merge tag 'net-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from Netfilter. Current release - new code bugs: - net: fix backlog_unlock_irq_restore() vs CONFIG_PREEMPT_RT - eth: mlx5e: XSK, Fix unintended ICOSQ change - phy_port: correctly recompute the port's linkmodes - vsock: prevent child netns mode switch from local to global - couple of kconfig fixes for new symbols Previous releases - regressions: - nfc: nci: fix false-positive parameter validation for packet data - net: do not delay zero-copy skbs in skb_attempt_defer_free() Previous releases - always broken: - mctp: ensure our nlmsg responses to user space are zero-initialised - ipv6: ioam: fix heap buffer overflow in __ioam6_fill_trace_data() - fixes for ICMP rate limiting Misc: - intel: fix PCI device ID conflict between i40e and ipw2200" * tag 'net-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (85 commits) net: nfc: nci: Fix parameter validation for packet data net/mlx5e: Use unsigned for mlx5e_get_max_num_channels net/mlx5e: Fix deadlocks between devlink and netdev instance locks net/mlx5e: MACsec, add ASO poll loop in macsec_aso_set_arm_event net/mlx5: Fix misidentification of write combining CQE during poll loop net/mlx5e: Fix misidentification of ASO CQE during poll loop net/mlx5: Fix multiport device check over light SFs bonding: alb: fix UAF in rlb_arp_recv during bond up/down bnge: fix reserving resources from FW eth: fbnic: Advertise supported XDP features. rds: tcp: fix uninit-value in __inet_bind net/rds: Fix NULL pointer dereference in rds_tcp_accept_one octeontx2-af: Fix default entries mcam entry action net/mlx5e: XSK, Fix unintended ICOSQ change ipv6: icmp: icmpv6_xrlim_allow() optimization if net.ipv6.icmp.ratelimit is zero ipv4: icmp: icmpv4_xrlim_allow() optimization if net.ipv4.icmp_ratelimit is zero ipv6: icmp: remove obsolete code in icmpv6_xrlim_allow() inet: move icmp_global_{credit,stamp} to a separate cache line icmp: prevent possible overflow in icmp_global_allow() selftests/net: packetdrill: add ipv4-mapped-ipv6 tests ...	2026-02-19 10:39:08 -08:00
Eric Dumazet	87b08913a9	inet: move icmp_global_{credit,stamp} to a separate cache line icmp_global_credit was meant to be changed ~1000 times per second, but if an admin sets net.ipv4.icmp_msgs_per_sec to a very high value, icmp_global_credit changes can inflict false sharing to surrounding fields that are read mostly. Move icmp_global_credit and icmp_global_stamp to a separate cacheline aligned group. Fixes: `b056b4cd91` ("icmp: move icmp_global.credit and icmp_global.stamp to per netns storage") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260216142832.3834174-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-18 16:46:36 -08:00
Linus Torvalds	23b0f90ba8	Summary * Removed macros from proc handler converters Replace the proc converter macros with "regular" functions. Though it is more verbose than the macro version, it helps when debugging and better aligns with coding-style.rst. * General cleanup Remove superfluous ctl_table forward declarations. Const qualify the memory_allocation_profiling_sysctl and loadpin_sysctl_table arrays. Add missing kernel doc to proc_dointvec_conv. * Testing This series was run through sysctl selftests/kunit test suite in x86_64. And went into linux-next after rc4, giving it a good 3 weeks of testing -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEErkcJVyXmMSXOyyeQupfNUreWQU8FAmmUabYACgkQupfNUreW QU8y2Qv/d2y35uQPRDh0HKWKWXJy41C2RJzd/rFCWJPCwo150whTSHIHkWYnu76g 10QblBXQmXi9TVqFnJ7Il7PWgqkMPjzA13tfT9eXNWU8j2OB/mcVKNl9X4wm/jWi QxtGmBsIQ/nxb2pUzMCykzgfc5mLi2NQ8qhZ5bOnq7UW3zdYmzEqx+tRdvIacyIk adComi5v8xUDqyEbVFaBovuX2WHQkPyBMnD64nwWG93JpNG/+9PxGzv/DNUXY11Y epVOfSoKdJbSLjYoHEPEhT0aHjSydq3QHru7uF6wzKOFTfHej/XkXXbUnFXPO2Pn c5J0u/HziYG5eN2QTqGfrhECZYuCFPemtUozltbcgGebkl1wKH+k9K5vsCaz/mhk ihUC3mui++W/n9B9HJRYh1XeEpk6C1pWERCOx27XFZ25fSek2YO6ZWkT0q+gceC0 t4+eIFSGJ3OzheJgHNK9XhTMWiQPmHyA6brXYGx4WeRvJFLpVddPF7k3Z89zIAu/ Fut7FGTH =0Z+I -----END PGP SIGNATURE----- Merge tag 'sysctl-7.00-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl Pull sysctl updates from Joel Granados: - Remove macros from proc handler converters Replace the proc converter macros with "regular" functions. Though it is more verbose than the macro version, it helps when debugging and better aligns with coding-style.rst. - General cleanup Remove superfluous ctl_table forward declarations. Const qualify the memory_allocation_profiling_sysctl and loadpin_sysctl_table arrays. Add missing kernel doc to proc_dointvec_conv. - Testing This series was run through sysctl selftests/kunit test suite in x86_64. And went into linux-next after rc4, giving it a good 3 weeks of testing * tag 'sysctl-7.00-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl: sysctl: replace SYSCTL_INT_CONV_CUSTOM macro with functions sysctl: Replace unidirectional INT converter macros with functions sysctl: Add kernel doc to proc_douintvec_conv sysctl: Replace UINT converter macros with functions sysctl: Add CONFIG_PROC_SYSCTL guards for converter macros sysctl: clarify proc_douintvec_minmax doc sysctl: Return -ENOSYS from proc_douintvec_conv when CONFIG_PROC_SYSCTL=n sysctl: Remove unused ctl_table forward declarations loadpin: Implement custom proc_handler for enforce alloc_tag: move memory_allocation_profiling_sysctls into .rodata sysctl: Add missing kernel-doc for proc_dointvec_conv	2026-02-18 10:45:36 -08:00
Fernando Fernandez Mancera	9e371b0ba7	ipv6: addrconf: reduce default temp_valid_lft to 2 days This is a recommendation from RFC 8981 and it was intended to be changed by commit `969c54646a` ("ipv6: Implement draft-ietf-6man-rfc4941bis") but it only changed the sysctl documentation. Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260214172543.5783-1-fmancera@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-17 17:12:06 -08:00
Eric Dumazet	452a3eee22	ipv6: fix a race in ip6_sock_set_v6only() It is unlikely that this function will be ever called with isk->inet_num being not zero. Perform the check on isk->inet_num inside the locked section for complete safety. Fixes: `9b115749ac` ("ipv6: add ip6_sock_set_v6only") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de> Link: https://patch.msgid.link/20260216102202.3343588-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-17 16:45:29 -08:00
Qanux	6db8b56eed	ipv6: ioam: fix heap buffer overflow in __ioam6_fill_trace_data() On the receive path, __ioam6_fill_trace_data() uses trace->nodelen to decide how much data to write for each node. It trusts this field as-is from the incoming packet, with no consistency check against trace->type (the 24-bit field that tells which data items are present). A crafted packet can set nodelen=0 while setting type bits 0-21, causing the function to write ~100 bytes past the allocated region (into skb_shared_info), which corrupts adjacent heap memory and leads to a kernel panic. Add a shared helper ioam6_trace_compute_nodelen() in ioam6.c to derive the expected nodelen from the type field, and use it: - in ioam6_iptunnel.c (send path, existing validation) to replace the open-coded computation; - in exthdrs.c (receive path, ipv6_hop_ioam) to drop packets whose nodelen is inconsistent with the type field, before any data is written. Per RFC 9197, bits 12-21 are each short (4-octet) fields, so they are included in IOAM6_MASK_SHORT_FIELDS (changed from 0xff100000 to 0xff1ffc00). Fixes: `9ee11f0fff` ("ipv6: ioam: Data plane support for Pre-allocated Trace") Cc: stable@vger.kernel.org Signed-off-by: Junxi Qian <qjx1298677004@gmail.com> Reviewed-by: Justin Iurman <justin.iurman@gmail.com> Link: https://patch.msgid.link/20260211040412.86195-1-qjx1298677004@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-13 12:24:05 -08:00
Linus Torvalds	311aa68319	RDMA v7.0 merge window Usual smallish cycle: - Various code improvements in irdma, rtrs, qedr, ocrdma, irdma, rxe - Small driver improvements and minor bug fixes to hns, mlx5, rxe, mana, mlx5, irdma - Robusness improvements in completion processing for EFA - New query_port_speed() verb to move past limited IBA defined speed steps - Support for SG_GAPS in rts and many other small improvements - Rare list corruption fix in iwcm - Better support different page sizes in rxe - Device memory support for mana - Direct bio vec to kernel MR for use by NFS-RDMA - QP rate limiting for bnxt_re - Remote triggerable NULL pointer crash in siw - DMA-buf exporter support for RDMA mmaps like doorbells -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCaY44vgAKCRCFwuHvBreF YfiZAP91cMZfogN7r1FMD75xDZu55dI3Jvy8OaixyRxlWLGPcQEAjritdL0o7fZp YrD1OXNS/1XG//rPBVw7xj+54Aa8hAU= =AVcu -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma Pull rdma updates from Jason Gunthorpe: "Usual smallish cycle. The NFS biovec work to push it down into RDMA instead of indirecting through a scatterlist is pretty nice to see, been talked about for a long time now. - Various code improvements in irdma, rtrs, qedr, ocrdma, irdma, rxe - Small driver improvements and minor bug fixes to hns, mlx5, rxe, mana, mlx5, irdma - Robusness improvements in completion processing for EFA - New query_port_speed() verb to move past limited IBA defined speed steps - Support for SG_GAPS in rts and many other small improvements - Rare list corruption fix in iwcm - Better support different page sizes in rxe - Device memory support for mana - Direct bio vec to kernel MR for use by NFS-RDMA - QP rate limiting for bnxt_re - Remote triggerable NULL pointer crash in siw - DMA-buf exporter support for RDMA mmaps like doorbells" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (66 commits) RDMA/mlx5: Implement DMABUF export ops RDMA/uverbs: Add DMABUF object type and operations RDMA/uverbs: Support external FD uobjects RDMA/siw: Fix potential NULL pointer dereference in header processing RDMA/umad: Reject negative data_len in ib_umad_write IB/core: Extend rate limit support for RC QPs RDMA/mlx5: Support rate limit only for Raw Packet QP RDMA/bnxt_re: Report QP rate limit in debugfs RDMA/bnxt_re: Report packet pacing capabilities when querying device RDMA/bnxt_re: Add support for QP rate limiting MAINTAINERS: Drop RDMA files from Hyper-V section RDMA/uverbs: Add __GFP_NOWARN to ib_uverbs_unmarshall_recv() kmalloc svcrdma: use bvec-based RDMA read/write API RDMA/core: add rdma_rw_max_sge() helper for SQ sizing RDMA/core: add MR support for bvec-based RDMA operations RDMA/core: use IOVA-based DMA mapping for bvec RDMA operations RDMA/core: add bio_vec based RDMA read/write API RDMA/irdma: Use kvzalloc for paged memory DMA address array RDMA/rxe: Fix race condition in QP timer handlers RDMA/mana_ib: Add device‑memory support ...	2026-02-12 17:05:20 -08:00
Daniel Golle	85ee987429	net: dsa: add tag format for MxL862xx switches Add proprietary special tag format for the MaxLinear MXL862xx family of switches. While using the same Ethertype as MaxLinear's GSW1xx switches, the actual tag format differs significantly, hence we need a dedicated tag driver for that. Signed-off-by: Daniel Golle <daniel@makrotopia.org> Link: https://patch.msgid.link/c64e6ddb6c93a4fac39f9ab9b2d8bf551a2b118d.1770433307.git.daniel@makrotopia.org Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-11 11:27:57 +01:00
Eric Dumazet	a6eee39cc2	tcp: populate inet->cork.fl.u.ip6 in tcp_v6_syn_recv_sock() As explained in commit `85d05e2817` ("ipv6: change inet6_sk_rebuild_header() to use inet->cork.fl.u.ip6"): TCP v6 spends a good amount of time rebuilding a fresh fl6 at each transmit in inet6_csk_xmit()/inet6_csk_route_socket(). TCP v4 caches the information in inet->cork.fl.u.ip4 instead. After this patch, passive TCP ipv6 flows have correctly initialized inet->cork.fl.u.ip6 structure. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260206173426.1638518-7-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-10 20:57:50 -08:00
Jakub Kicinski	792aaea994	netfilter pull request nf-next-26-02-06 -----BEGIN PGP SIGNATURE----- iQJdBAABCABHFiEEgKkgxbID4Gn1hq6fcJGo2a1f9gAFAmmGB20bFIAAAAAABAAO bWFudTIsMi41KzEuMTEsMiwyDRxmd0BzdHJsZW4uZGUACgkQcJGo2a1f9gC/tQ/7 B7/akiCP/QeGF7go78PZQlpIGmjtoCOcQ9uxymlmpLkArepcIEkgZ04tFH0FClY6 d3QPfT9iNap222aCQxZwCiaWrXqUNynW7RwH72SkqGmO8JTLKlzW8CQC+yGkyznj FxwRKzB8XO5Ohtw0wED3mzcf9DelsvJpX5rCU5gEjsHZjKA/rEwYgovyM+es+xSx JbHHc2tzLQuDZ1BL7rEW8TJDxmJ2bCsFJHKeIvykk3D2nVg01P0AwhUeIy+7ObV7 bQh7B8DhYwKNLtgZvDi8D6o4nWQvkjfF5BadrWusumDCtIupcwbelpcUeCsUWBqC oCjLMcH7TwmT513RXWMId50z93FWciduCHUGrQt4BxLBZmkQ9kE0iamZVIAAzLl8 VYIM9qb+nUk58jnLFl3xTqW2GetSj/p31bp6e78+SQFvqjie2z9/I+nGBr7A8aAB bNd5vpvHSEg5OP7oKk+Dhr26MiCDowtuzvdC4lYR+loFYoI+a1FS6a1w/kcw9/VA XmR6Y8is+CTy4XYTQZ4klYTVpoTkWa/D/t1CTC4IlELzYS49L6qSyef6m91IWeQ6 Way5+3ZON7sA6SM1PZ/zjsKDxYLo/hQz2+dw6YLVflfY62khvuK2Yc56MQcZEjsH 7x0b3MaKvNn9yqKC+Mk7QZ55nCjV3wyGp3GQ+ClAqZ4= =wU6p -----END PGP SIGNATURE----- Merge tag 'nf-next-26-02-06' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Florian Westphal says: ==================== netfilter: updates for net-next The following patchset contains Netfilter updates for net-next: 1) Fix net-next-only use-after-free bug in nf_tables rbtree set: Expired elements cannot be released right away after unlink anymore because there is no guarantee that the binary-search blob is going to be updated. Spotted by syzkaller. 2) Fix esoteric bug in nf_queue with udp fraglist gro, broken since 6.11. Patch 3 adds extends the nfqueue selftest for this. 4) Use dedicated slab for flowtable entries, currently the -512 cache is used, which is wasteful. From Qingfang Deng. 5) Recent net-next update extended existing test for ip6ip6 tunnels, add the required /config entry. Test still passed by accident because the previous tests network setup gets re-used, so also update the test so it will fail in case the ip6ip6 tunnel interface cannot be added. 6) Fix 'nft get element mytable myset { 1.2.3.4 }' on big endian platforms, this was broken since code was added in v5.1. 7) Fix nf_tables counter reset support on 32bit platforms, where counter reset may cause huge values to appear due to wraparound. Broken since reset feature was added in v6.11. From Anders Grahn. 8-11) update nf_tables rbtree set type to detect partial operlaps. This will eventually speed up nftables userspace: at this time userspace does a netlink dump of the set content which slows down incremental updates on interval sets. From Pablo Neira Ayuso. * tag 'nf-next-26-02-06' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: nft_set_rbtree: validate open interval overlap netfilter: nft_set_rbtree: validate element belonging to interval netfilter: nft_set_rbtree: check for partial overlaps in anonymous sets netfilter: nft_set_rbtree: fix bogus EEXIST with NLM_F_CREATE with null interval netfilter: nft_counter: fix reset of counters on 32bit archs netfilter: nft_set_hash: fix get operation on big endian selftests: netfilter: add IPV6_TUNNEL to config netfilter: flowtable: dedicated slab for flow entry selftests: netfilter: nft_queue.sh: add udp fraglist gro test case netfilter: nfnetlink_queue: do shared-unconfirmed check before segmentation netfilter: nft_set_rbtree: don't gc elements on insert ==================== Link: https://patch.msgid.link/20260206153048.17570-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-10 20:25:38 -08:00
Paolo Abeni	dc010e1b4b	xfrm: reduce struct sec_path size The mentioned struct has an hole and uses unnecessary wide type to store MAC length and indexes of very small arrays. It's also embedded into the skb_extensions, and the latter, due to recent CAN changes, may exceeds the 192 bytes mark (3 cachelines on x86_64 arch) on some reasonable configurations. Reordering and the sec_path fields, shrinking xfrm_offload.orig_mac_len to 16 bits and xfrm_offload.{len,olen,verified_cnt} to u8, we can save 16 bytes and keep skb_extensions size under control. Before: struct sec_path { int len; int olen; int verified_cnt; /* XXX 4 bytes hole, try to pack /$ struct xfrm_state xvec[6]; struct xfrm_offload ovec[1]; /* size: 88, cachelines: 2, members: 5 / / sum members: 84, holes: 1, sum holes: 4 / / last cacheline: 24 bytes / }; After: struct sec_path { struct xfrm_state xvec[6]; struct xfrm_offload ovec[1]; /* typedef u8 -> __u8 / unsigned char len; / typedef u8 -> __u8 / unsigned char olen; / typedef u8 -> __u8 / unsigned char verified_cnt; / size: 72, cachelines: 2, members: 5 / / padding: 1 / / last cacheline: 8 bytes */ }; Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Florian Westphal <fw@strlen.de> Reviewed-by: Steffen Klassert <steffen.klassert@secunet.com> Link: https://patch.msgid.link/83846bd2e3fa08899bd0162e41bfadfec95e82ef.1770398071.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-10 20:21:48 -08:00
Vladimir Oltean	c22ba07c82	net: dsa: eliminate local type for tc policers David Yang is saying that struct flow_action_entry in include/net/flow_offload.h has gained new fields and DSA's struct dsa_mall_policer_tc_entry, derived from that, isn't keeping up. This structure is passed to drivers and they are completely oblivious to the values of fields they don't see. This has happened before, and almost always the solution was to make the DSA layer thinner and use the upstream data structures. Here, the reason why we didn't do that is because struct flow_action_entry :: police is an anonymous structure. That is easily enough fixable, just name those fields "struct flow_action_police" and reference them from DSA. Make the according transformations to the two users (sja1105 and felix): "rate_bytes_per_sec" -> "rate_bytes_ps". Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Co-developed-by: David Yang <mmyangfl@gmail.com> Signed-off-by: David Yang <mmyangfl@gmail.com> Link: https://patch.msgid.link/20260206075427.44733-1-mmyangfl@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-10 15:30:11 +01:00
Alice Mikityanska	35f66ce900	net/ipv6: Remove HBH helpers Now that the HBH jumbo helpers are not used by any driver or GSO, remove them altogether. Signed-off-by: Alice Mikityanska <alice@isovalent.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260205133925.526371-13-alice.kernel@fastmail.im Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-06 20:50:13 -08:00
Alice Mikityanska	b2936b4fd5	net/ipv6: Introduce payload_len helpers The next commits will transition away from using the hop-by-hop extension header to encode packet length for BIG TCP. Add wrappers around ip6->payload_len that return the actual value if it's non-zero, and calculate it from skb->len if payload_len is set to zero (and a symmetrical setter). The new helpers are used wherever the surrounding code supports the hop-by-hop jumbo header for BIG TCP IPv6, or the corresponding IPv4 code uses skb_ip_totlen (e.g., in include/net/netfilter/nf_tables_ipv6.h). No behavioral change in this commit. Signed-off-by: Alice Mikityanska <alice@isovalent.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260205133925.526371-2-alice.kernel@fastmail.im Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-06 20:50:03 -08:00
Eric Dumazet	a35b6e4863	tcp: inline tcp_filter() This helper is already (auto)inlined from IPv4 TCP stack. Make it an inline function to benefit IPv6 as well. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/2 grow/shrink: 1/0 up/down: 30/-49 (-19) Function old new delta tcp_v6_rcv 3448 3478 +30 __pfx_tcp_filter 16 - -16 tcp_filter 33 - -33 Total: Before=24891904, After=24891885, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260205164329.3401481-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-06 20:12:11 -08:00
Qiliang Yuan	7acee67a6b	netns: optimize netns cleaning by batching unhash_nsid calls Currently, unhash_nsid() scans the entire system for each netns being killed, leading to O(L_dying_net * M_alive_net * N_id) complexity, as __peernet2id() also performs a linear search in the IDR. Optimize this to O(M_alive_net * N_id) by batching unhash operations. Move unhash_nsid() out of the per-netns loop in cleanup_net() to perform a single-pass traversal over survivor namespaces. Identify dying peers by an 'is_dying' flag, which is set under net_rwsem write lock after the netns is removed from the global list. This batches the unhashing work and eliminates the O(L_dying_net) multiplier. To minimize the impact on struct net size, 'is_dying' is placed in an existing hole after 'hash_mix' in struct net. Use a restartable idr_get_next() loop for iteration. This avoids the unsafe modification issue inherent to idr_for_each() callbacks and allows dropping the nsid_lock to safely call sleepy rtnl_net_notifyid(). Clean up redundant nsid_lock and simplify the destruction loop now that unhashing is centralized. Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260204074854.3506916-1-realwujing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-06 20:01:31 -08:00
Pablo Neira Ayuso	648946966a	netfilter: nft_set_rbtree: validate open interval overlap Open intervals do not have an end element, in particular an open interval at the end of the set is hard to validate because of it is lacking the end element, and interval validation relies on such end element to perform the checks. This patch adds a new flag field to struct nft_set_elem, this is not an issue because this is a temporary object that is allocated in the stack from the insert/deactivate path. This flag field is used to specify that this is the last element in this add/delete command. The last flag is used, in combination with the start element cookie, to check if there is a partial overlap, eg. Already exists: 255.255.255.0-255.255.255.254 Add interval: 255.255.255.0-255.255.255.255 ~~~~~~~~~~~~~ start element overlap Basically, the idea is to check for an existing end element in the set if there is an overlap with an existing start element. However, the last open interval can come in any position in the add command, the corner case can get a bit more complicated: Already exists: 255.255.255.0-255.255.255.254 Add intervals: 255.255.255.0-255.255.255.255,255.255.255.0-255.255.255.254 ~~~~~~~~~~~~~ start element overlap To catch this overlap, annotate that the new start element is a possible overlap, then report the overlap if the next element is another start element that confirms that previous element in an open interval at the end of the set. For deletions, do not update the start cookie when deleting an open interval, otherwise this can trigger spurious EEXIST when adding new elements. Unfortunately, there is no NFT_SET_ELEM_INTERVAL_OPEN flag which would make easier to detect open interval overlaps. Fixes: `7c84d41416` ("netfilter: nft_set_rbtree: Detect partial overlaps on insertion") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-02-06 13:36:07 +01:00
Florian Westphal	207b3ebacb	netfilter: nfnetlink_queue: do shared-unconfirmed check before segmentation Ulrich reports a regression with nfqueue: If an application did not set the 'F_GSO' capability flag and a gso packet with an unconfirmed nf_conn entry is received all packets are now dropped instead of queued, because the check happens after skb_gso_segment(). In that case, we did have exclusive ownership of the skb and its associated conntrack entry. The elevated use count is due to skb_clone happening via skb_gso_segment(). Move the check so that its peformed vs. the aggregated packet. Then, annotate the individual segments except the first one so we can do a 2nd check at reinject time. For the normal case, where userspace does in-order reinjects, this avoids packet drops: first reinjected segment continues traversal and confirms entry, remaining segments observe the confirmed entry. While at it, simplify nf_ct_drop_unconfirmed(): We only care about unconfirmed entries with a refcnt > 1, there is no need to special-case dying entries. This only happens with UDP. With TCP, the only unconfirmed packet will be the TCP SYN, those aren't aggregated by GRO. Next patch adds a udpgro test case to cover this scenario. Reported-by: Ulrich Weber <ulrich.weber@gmail.com> Fixes: `7d8dc1c7be` ("netfilter: nf_queue: drop packets with cloned unconfirmed conntracks") Signed-off-by: Florian Westphal <fw@strlen.de>	2026-02-06 13:34:55 +01:00
Davide Caratti	a90f6dcefc	net/sched: don't use dynamic lockdep keys with clsact/ingress/noqueue Currently we are registering one dynamic lockdep key for each allocated qdisc, to avoid false deadlock reports when mirred (or TC eBPF) redirects packets to another device while the root lock is acquired [1]. Since dynamic keys are a limited resource, we can save them at least for qdiscs that are not meant to acquire the root lock in the traffic path, or to carry traffic at all, like: - clsact - ingress - noqueue Don't register dynamic keys for the above schedulers, so that we hit MAX_LOCKDEP_KEYS later in our tests. [1] https://github.com/multipath-tcp/mptcp_net-next/issues/451 Changes in v2: - change ordering of spin_lock_init() vs. lockdep_register_key() (Jakub Kicinski) Signed-off-by: Davide Caratti <dcaratti@redhat.com> Link: https://patch.msgid.link/94448f7fa7c4f52d2ce416a4895ec87d456d7417.1770220576.git.dcaratti@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-05 09:32:45 -08:00
Eric Dumazet	22c1264415	tcp: move __reqsk_free() out of line Inlining __reqsk_free() is overkill, let's reclaim 2 Kbytes of text. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 2/4 grow/shrink: 2/14 up/down: 225/-2338 (-2113) Function old new delta __reqsk_free - 114 +114 sock_edemux 18 82 +64 inet_csk_listen_start 233 264 +31 __pfx___reqsk_free - 16 +16 __pfx_reqsk_queue_alloc 16 - -16 __pfx_reqsk_free 16 - -16 reqsk_queue_alloc 46 - -46 tcp_req_err 272 177 -95 reqsk_fastopen_remove 348 253 -95 cookie_bpf_check 157 62 -95 cookie_tcp_reqsk_alloc 387 290 -97 cookie_v4_check 1568 1465 -103 reqsk_free 105 - -105 cookie_v6_check 1519 1412 -107 sock_gen_put 187 78 -109 sock_pfree 212 82 -130 tcp_try_fastopen 1818 1683 -135 tcp_v4_rcv 3478 3294 -184 reqsk_put 306 90 -216 tcp_get_cookie_sock 551 318 -233 tcp_v6_rcv 3404 3141 -263 tcp_conn_request 2677 2384 -293 Total: Before=24887415, After=24885302, chg -0.01% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260204055147.1682705-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-05 09:23:06 -08:00
Eric Dumazet	d5c5391554	inet: move reqsk_queue_alloc() to net/ipv4/inet_connection_sock.c Only called once from inet_csk_listen_start(), it can be static. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260204055147.1682705-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-05 09:23:05 -08:00
David Yang	770e112634	flow_offload: add const qualifiers to function arguments Some functions do not modify the pointed-to data, but lack const qualifiers. Add const qualifiers to the arguments of flow_rule_match_has_control_flags() and flow_cls_offload_flow_rule(). Signed-off-by: David Yang <mmyangfl@gmail.com> Link: https://patch.msgid.link/20260204052839.198602-1-mmyangfl@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-05 16:24:22 +01:00
Oliver Hartkopp	96ea3a1e2d	can: add CAN skb extension infrastructure To remove the private CAN bus skb headroom infrastructure 8 bytes need to be stored in the skb. The skb extensions are a common pattern and an easy and efficient way to hold private data travelling along with the skb. We only need the skb_ext_add() and skb_ext_find() functions to allocate and access CAN specific content as the skb helpers to copy/clone/free skbs automatically take care of skb extensions and their final removal. This patch introduces the complete CAN skb extensions infrastructure: - add struct can_skb_ext in new file include/net/can.h - add include/net/can.h in MAINTAINERS - add SKB_EXT_CAN to skbuff.c and skbuff.h - select SKB_EXTENSIONS in Kconfig when CONFIG_CAN is enabled - check for existing CAN skb extensions in can_rcv() in af_can.c - add CAN skb extensions allocation at every skb_alloc() location - duplicate the skb extensions if cloning outgoing skbs (framelen/gw_hops) - introduce can_skb_ext_add() and can_skb_ext_find() helpers The patch also corrects an indention issue in the original code from 2018: Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202602010426.PnGrYAk3-lkp@intel.com/ Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20260201-can_skb_ext-v8-2-3635d790fe8b@hartkopp.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-05 11:58:39 +01:00
Randy Dunlap	a34b0e4e21	net/iucv: clean up iucv kernel-doc warnings Fix numerous (many) kernel-doc warnings in iucv.[ch]: - convert function documentation comments to a common (kernel-doc) look, even for static functions (without "/*") - use matching parameter and parameter description names - use better wording in function descriptions (Jakub & AI) - remove duplicate kernel-doc comments from the header file (Jakub) Examples: Warning: include/net/iucv/iucv.h:210 missing initial short description on line: iucv_unregister Warning: include/net/iucv/iucv.h:216 function parameter 'handle' not described in 'iucv_unregister' Warning: include/net/iucv/iucv.h:467 function parameter 'answer' not described in 'iucv_message_send2way' Warning: net/iucv/iucv.c:727 missing initial short description on line: * iucv_cleanup_queue Build-tested with both "make htmldocs" and "make ARCH=s390 defconfig all". Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Alexandra Winter <wintera@linux.ibm.com> Link: https://patch.msgid.link/20260203075248.1177869-1-rdunlap@infradead.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-04 20:39:58 -08:00
Eric Dumazet	309dd99421	tcp: split tcp_check_space() in two parts tcp_check_space() is fat and not inlined. Move its slow path in (out of line) __tcp_check_space() and make tcp_check_space() an inline function for better TCP performance. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 2/2 grow/shrink: 4/0 up/down: 708/-582 (126) Function old new delta __tcp_check_space - 521 +521 tcp_rcv_established 1860 1916 +56 tcp_rcv_state_process 3342 3384 +42 tcp_event_new_data_sent 248 286 +38 tcp_data_snd_check 71 106 +35 __pfx___tcp_check_space - 16 +16 __pfx_tcp_check_space 16 - -16 tcp_check_space 566 - -566 Total: Before=24896373, After=24896499, chg +0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260203050932.3522221-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-04 20:37:06 -08:00
Jakub Kicinski	333225e1e9	Some more changes, including pulls from drivers: - ath drivers: small features/cleanups - rtw drivers: mostly refactoring for rtw89 RTL8922DE support - mac80211: use hrtimers for CAC to avoid too long delays - cfg80211/mac80211: some initial UHR (Wi-Fi 8) support -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmmDNuEACgkQ10qiO8sP aADhvA/+J35p2CDkffi1KfZxxx1YdHAAlj1zjhjLzshCMCG3oWzLpOL7se5bgN/C axPLPbeCAXtsRXln083lbwtrSRexPHSVhelPDNtybLPEocQYrksV8a6V3eWXCNTR ymN4iDaO/K0gLkDRKH5T8lwZvJttA6iHi+Fm4ir+dsr0O5vwwe4CuAEPA1SuZ2rh 0lQMz6pEzsxq+sZX3p8SoBwXx147l0n6gwMNIgBTKo1tjZha4oaavdvcqq4zaZWV WCcg4YVA/dWHL0UuwtIF8uQADM43quegBBUFx63QgzfgcnHAnBk2Ckeein/bfvnv XOKlI4UJi1cxTkTJkDOrSn5IwBzVSlBXE3qEUKKnu5G3+ZgfdsnWmSPeTtOndvAE rgbwwZb2SKH1kCvL0FDZTwq/iR9KF60ZfhWIq9Sz7m6VZxJoR8QACHglYCysj2JB B1+oT53EIqP7Ob4s/GN2Yg9M0l4Lv3E6J9g6h3b8yeq9qEXVF8MaVN683rtNpec9 mUqLRlcoToB2W/qvEVESKj8jMvajYZ6TDoO7mSP3paTW3HgMC3wlPJlDc4Q/6h7e LAKEljXlv6ofNGCcCL37l6KATqSZpIZn+tpSqbELIirWlc/rnTIDU2qZRb7MA1e1 3lKdrS6pOXGS1GJr7HWuLb4cX1SukyXNeyIcZJlSFoxG4oDPvwI= =/NUu -----END PGP SIGNATURE----- Merge tag 'wireless-next-2026-02-04' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Johannes Berg says: ==================== Some more changes, including pulls from drivers: - ath drivers: small features/cleanups - rtw drivers: mostly refactoring for rtw89 RTL8922DE support - mac80211: use hrtimers for CAC to avoid too long delays - cfg80211/mac80211: some initial UHR (Wi-Fi 8) support * tag 'wireless-next-2026-02-04' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (59 commits) wifi: brcmsmac: phy: Remove unreachable error handling code wifi: mac80211: Add eMLSR/eMLMR action frame parsing support wifi: mac80211: add initial UHR support wifi: cfg80211: add initial UHR support wifi: ieee80211: add some initial UHR definitions wifi: mac80211: use wiphy_hrtimer_work for CAC timeout wifi: mac80211: correct ieee80211-{s1g/eht}.h include guard comments wifi: ath12k: clear stale link mapping of ahvif->links_map wifi: ath12k: Add support TX hardware queue stats wifi: ath12k: Add support RX PDEV stats wifi: ath12k: Fix index decrement when array_len is zero wifi: ath12k: support OBSS PD configuration for AP mode wifi: ath12k: add WMI support for spatial reuse parameter configuration dt-bindings: net: wireless: ath11k-pci: deprecate 'firmware-name' property wifi: ath11k: add usecase firmware handling based on device compatible wifi: ath10k: sdio: add missing lock protection in ath10k_sdio_fw_crashed_dump() wifi: ath10k: fix lock protection in ath10k_wmi_event_peer_sta_ps_state_chg() wifi: ath10k: snoc: support powering on the device via pwrseq wifi: rtw89: pci: warn if SPS OCP happens for RTL8922DE wifi: rtw89: pci: restore LDO setting after device resume ... ==================== Link: https://patch.msgid.link/20260204121143.181112-3-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-04 20:31:05 -08:00
Chia-Yu Chang	4fa4ac5e58	tcp: accecn: add tcpi_ecn_mode and tcpi_option2 in tcp_info Add 2-bit tcpi_ecn_mode feild within tcp_info to indicate which ECN mode is negotiated: ECN_MODE_DISABLED, ECN_MODE_RFC3168, ECN_MODE_ACCECN, or ECN_MODE_PENDING. This is done by utilizing available bits from tcpi_accecn_opt_seen (reduced from 16 bits to 2 bits) and tcpi_accecn_fail_mode (reduced from 16 bits to 4 bits). Also, an extra 24-bit tcpi_options2 field is identified to represent newer options and connection features, as all 8 bits of tcpi_options field have been used. Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Co-developed-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260131222515.8485-14-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-03 15:13:25 +01:00
Chia-Yu Chang	1247fb19ca	tcp: accecn: detect loss ACK w/ AccECN option and add TCP_ACCECN_OPTION_PERSIST Detect spurious retransmission of a previously sent ACK carrying the AccECN option after the second retransmission. Since this might be caused by the middlebox dropping ACK with options it does not recognize, disable the sending of the AccECN option in all subsequent ACKs. This patch follows Section 3.2.3.2.2 of AccECN spec (RFC9768), and a new field (accecn_opt_sent_w_dsack) is added to indicate that an AccECN option was sent with duplicate SACK info. Also, a new AccECN option sending mode is added to tcp_ecn_option sysctl: (TCP_ECN_OPTION_PERSIST), which ignores the AccECN fallback policy and persistently sends AccECN option once it fits into TCP option space. Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260131222515.8485-13-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-03 15:13:25 +01:00
Chia-Yu Chang	2ed661248e	tcp: accecn: fallback outgoing half link to non-AccECN According to Section 3.2.2.1 of AccECN spec (RFC9768), if the Server is in AccECN mode and in SYN-RCVD state, and if it receives a value of zero on a pure ACK with SYN=0 and no SACK blocks, for the rest of the connection the Server MUST NOT set ECT on outgoing packets and MUST NOT respond to AccECN feedback. Nonetheless, as a Data Receiver it MUST NOT disable AccECN feedback. Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260131222515.8485-12-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-03 15:13:25 +01:00
Chia-Yu Chang	f326f1f17f	tcp: accecn: retransmit SYN/ACK without AccECN option or non-AccECN SYN/ACK For Accurate ECN, the first SYN/ACK sent by the TCP server shall set the ACE flag (Table 1 of RFC9768) and the AccECN option to complete the capability negotiation. However, if the TCP server needs to retransmit such a SYN/ACK (for example, because it did not receive an ACK acknowledging its SYN/ACK, or received a second SYN requesting AccECN support), the TCP server retransmits the SYN/ACK without the AccECN option. This is because the SYN/ACK may be lost due to congestion, or a middlebox may block the AccECN option. Furthermore, if this retransmission also times out, to expedite connection establishment, the TCP server should retransmit the SYN/ACK with (AE,CWR,ECE) = (0,0,0) and without the AccECN option, while maintaining AccECN feedback mode. This complies with Section 3.2.3.2.2 of the AccECN spec RFC9768. Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260131222515.8485-10-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-03 15:13:24 +01:00
Chia-Yu Chang	f1eaea5585	tcp: add TCP_SYNACK_RETRANS synack_type Before this patch, retransmitted SYN/ACK did not have a specific synack_type; however, the upcoming patch needs to distinguish between retransmitted and non-retransmitted SYN/ACK for AccECN negotiation to transmit the fallback SYN/ACK during AccECN negotiation. Therefore, this patch introduces a new synack_type (TCP_SYNACK_RETRANS). Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260131222515.8485-9-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-03 15:13:24 +01:00
Chia-Yu Chang	c5ff6b8371	tcp: accecn: handle unexpected AccECN negotiation feedback According to Sections 3.1.2 and 3.1.3 of AccECN spec (RFC9768). In Section 3.1.2, it says an AccECN implementation has no need to recognize or support the Server response labelled 'Nonce' or ECN-nonce feedback more generally, as RFC 3540 has been reclassified as Historic. AccECN is compatible with alternative ECN feedback integrity approaches to the nonce. The SYN/ACK labelled 'Nonce' with (AE,CWR,ECE) = (1,0,1) is reserved for future use. A TCP Client (A) that receives such a SYN/ACK follows the procedure for forward compatibility given in Section 3.1.3. Then in Section 3.1.3, it says if a TCP Client has sent a SYN requesting AccECN feedback with (AE,CWR,ECE) = (1,1,1) then receives a SYN/ACK with the currently reserved combination (AE,CWR,ECE) = (1,0,1) but it does not have logic specific to such a combination, the Client MUST enable AccECN mode as if the SYN/ACK onfirmed that the Server supported AccECN and as if it fed back that the IP-ECN field on the SYN had arrived unchanged. Fixes: `3cae34274c` ("tcp: accecn: AccECN negotiation"). Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260131222515.8485-7-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-03 15:13:24 +01:00
Chia-Yu Chang	e68c28f22f	tcp: disable RFC3168 fallback identifier for CC modules When AccECN is not successfully negociated for a TCP flow, it defaults fallback to classic ECN (RFC3168). However, L4S service will fallback to non-ECN. This patch enables congestion control module to control whether it should not fallback to classic ECN after unsuccessful AccECN negotiation. A new CA module flag (TCP_CONG_NO_FALLBACK_RFC3168) identifies this behavior expected by the CA. Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260131222515.8485-6-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-03 15:13:24 +01:00
Chia-Yu Chang	100f946b8d	tcp: ECT_1_NEGOTIATION and NEEDS_ACCECN identifiers Two flags for congestion control (CC) module are added in this patch related to AccECN negotiation. First, a new flag (TCP_CONG_NEEDS_ACCECN) defines that the CC expects to negotiate AccECN functionality using the ECE, CWR and AE flags in the TCP header. Second, during ECN negotiation, ECT(0) in the IP header is used. This patch enables CC to control whether ECT(0) or ECT(1) should be used on a per-segment basis. A new flag (TCP_CONG_ECT_1_NEGOTIATION) defines the expected ECT value in the IP header by the CA when not-yet initialized for the connection. The detailed AccECN negotiaotn can be found in IETF RFC9768. Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com> Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com> Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260131222515.8485-5-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-03 15:13:24 +01:00
Geliang Tang	2d85088d46	tcp: export tcp_splice_state Export struct tcp_splice_state and tcp_splice_data_recv() in net/tcp.h so that they can be used by MPTCP in the next patch. Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Acked-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260130-net-next-mptcp-splice-v2-3-31332ba70d7f@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-02 18:15:32 -08:00
Eric Dumazet	b409a7f717	ipv6: colocate inet6_cork in inet_cork_full All inet6_cork users also use one inet_cork_full. Reduce number of parameters and increase data locality. This saves ~275 bytes of code on x86_64. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260130210303.3888261-9-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-02 17:49:30 -08:00
Eric Dumazet	8776c4ef3a	inet: add dst4_mtu() and dst6_mtu() helpers With CONFIG_MITIGATION_RETPOLINE=y dst_mtu() is a bit fat, because it is generic. Indeed, clang does not always inline it. Add dst4_mtu() and dst6_mtu() helpers for callers that expect either ipv4_mtu() or ip6_mtu() to be called. These helpers are always inlined. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260130210303.3888261-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-02 17:49:29 -08:00
Eric Dumazet	1bc46dd209	ipv6: pass proto by value to ipv6_push_nfrag_opts() and ipv6_push_frag_opts() With CONFIG_STACKPROTECTOR_STRONG=y, it is better to avoid passing a pointer to an automatic variable. Change these exported functions to return 'u8 proto' instead of void. - ipv6_push_nfrag_opts() - ipv6_push_frag_opts() For instance, replace ipv6_push_frag_opts(skb, opt, &proto); with: proto = ipv6_push_frag_opts(skb, opt, proto); Note that even after this change, ip6_xmit() has to use a stack canary because of @first_hop variable. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260130210303.3888261-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-02 17:49:28 -08:00
Eric Dumazet	82f35bec11	net: l3mdev: use skb_dst_dev_rcu() in l3mdev_l3_out() Extend the RCU section a bit so that we can use the safer skb_dst_dev_rcu() helper. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260130191906.3781856-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-02 17:09:11 -08:00
Lorenzo Bianconi	0d95280a2d	wifi: mac80211: Add eMLSR/eMLMR action frame parsing support Introduce support in AP mode for parsing of the Operating Mode Notification frame sent by the client to enable/disable MLO eMLSR or eMLMR if supported by both the AP and the client. Add drv_set_eml_op_mode mac80211 callback in order to configure underlay driver with eMLSR/eMLMR info. Tested-by: Christian Marangi <ansuelsmth@gmail.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://patch.msgid.link/20260129-mac80211-emlsr-v4-1-14bdadf57380@kernel.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2026-02-02 10:11:18 +01:00
Johannes Berg	a108511471	wifi: mac80211: add initial UHR support Add support for making UHR connections and accepting AP stations with UHR support. Link: https://patch.msgid.link/20260130164259.7185980484eb.Ieec940b58dbf8115dab7e1e24cb5513f52c8cb2f@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2026-02-02 10:11:08 +01:00
Johannes Berg	072e6f7f41	wifi: cfg80211: add initial UHR support Add initial support for making UHR connections (or suppressing that), adding UHR capable stations on the AP side, encoding and decoding UHR MCSes (except rate calculation for the new MCSes 17, 19, 20 and 23) as well as regulatory support. Link: https://patch.msgid.link/20260130164259.54cc12fbb307.I26126bebd83c7ab17e99827489f946ceabb3521f@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2026-02-02 10:11:07 +01:00
Ethan Nelson-Moore	82fff3b055	net: ax25: remove plumbing for never-implemented DAMA Master support The AX25_DAMA_MASTER option has been unimplemented and marked broken ever since it was introduced in 2007 in commit `954b2e7f4c` ("[NET] AX.25 Kconfig and docs updates and fixes"). At this point, it is very unlikely it will be implemented. Remove it. Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com> Link: https://patch.msgid.link/20260129080908.44710-1-enelsonmoore@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-30 19:19:39 -08:00
Eric Dumazet	ed9b70040d	tcp: reduce tcp sockets size by one cache line By default, when a kmem_cache is created with SLAB_TYPESAFE_BY_RCU, slub has to use extra storage for the freelist pointer after each object, because slub assumes that any bit in the object can be used by RCU readers. Because proto_register() is also using SLAB_HWCACHE_ALIGN, this forces slub to use one extra cache line per object. We can instead put the slub freelist anywhere in the object, granted the concurrent RCU readers are not supposed to use the pointer value. Add a new (struct sock)sk_freeptr field, in an union with sk_rcu: No RCU readers would need to look at sk_rcu, which is only used at free phase. Tested: grep . /sys/kernel/slab/TCP/{object_size,slab_size,objs_per_slab} grep . /sys/kernel/slab/TCPv6/{object_size,slab_size,objs_per_slab} Before: /sys/kernel/slab/TCP/object_size:2368 /sys/kernel/slab/TCP/slab_size:2432 /sys/kernel/slab/TCP/objs_per_slab:13 /sys/kernel/slab/TCPv6/object_size:2496 /sys/kernel/slab/TCPv6/slab_size:2560 /sys/kernel/slab/TCPv6/objs_per_slab:12 After this patch, we can pack one more TCPv6 object per slab, and object_size == slab_size. /sys/kernel/slab/TCP/object_size:2368 /sys/kernel/slab/TCP/slab_size:2368 /sys/kernel/slab/TCP/objs_per_slab:13 /sys/kernel/slab/TCPv6/object_size:2496 /sys/kernel/slab/TCPv6/slab_size:2496 /sys/kernel/slab/TCPv6/objs_per_slab:13 Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260129153458.4163797-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-30 17:15:51 -08:00
Jakub Kicinski	303c1a66a2	Another fairly large set of changes, notably: - cfg80211/mac80211 - most of EPPKE/802.1X over auth frames support - additional FTM capabilities - split up drop reasons better, removing generic RX_DROP - NAN cleanups/fixes - ath11k: - support for Channel Frequency Response measurement - ath12k: - support for the QCC2072 chipset - iwlwifi: - partial NAN support - UNII-9 support - some UHR/802.11bn FW APIs - remove most of MLO/EHT from iwlmvm (such devices use iwlmld) - rtw89: - preparations for RTL8922DE support -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAml7PQcACgkQ10qiO8sP aAAKiA/6AnyNxa0bX2VFsWYW6KJYnJBVNLlP2ghkV3uIWtJoZdXuQO+W8/cy9Cng yrhfPzNfT+2hqmxasxI0tND3H3tW9CqcwX80J84eP9JCpYuPept9uGpSxPQoQl5J Q2k9gX1NlO/SEa8/mOFDT4EmH0bQobxiN84kxSg6Riaazkj6ZjHVVm/3PgzNhxlA v77m5thlhopzYxKn38qA19E9uHSLcY7XwkeYOZDf00Zhgot29lmDeHOf39IH+HvI +a20q6tW59D7iX2IUyvLnWzFV1iEcJ6ONF/hYJ0r3TlfmX/NDWfOQxx87K8M1Tqh sMa+FGrFdqloE1aYi1l+9m6Wu30pHmh7vhlgskPffPmvG+RkCEQCg1Me7eoFOzTB 81K2CMJ34Cp9se+QdiBtY5GpRPZIOlFmY6ZVyZIoEXHkn6r0R94e6dsMZuFcqjv1 y1dzv7BnraVMAQcqwkE9pQtq6LeJoHl2OUT2JzjbKhQhivMf9YubPBZ2QC1LZdMg NYEX4XSeJ/etpUk1MZFnm5wOw545tMi3U2sAhpYWbE6UBPDrQBvYADqd3lq3DmWe BdCDHTbqMnAJ3C0xFEKTYTmVF8IoFt6eOclFUPw4Uhq+YmU9x8wx1yBQbF9TjyKU a/rDCahmryj5gwD0QFJKhdQjfKaQFVNZWZqaKaokM84+8kIdA2U= =70Rs -----END PGP SIGNATURE----- Merge tag 'wireless-next-2026-01-29' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Johannes Berg says: ==================== Another fairly large set of changes, notably: - cfg80211/mac80211 - most of EPPKE/802.1X over auth frames support - additional FTM capabilities - split up drop reasons better, removing generic RX_DROP - NAN cleanups/fixes - ath11k: - support for Channel Frequency Response measurement - ath12k: - support for the QCC2072 chipset - iwlwifi: - partial NAN support - UNII-9 support - some UHR/802.11bn FW APIs - remove most of MLO/EHT from iwlmvm (such devices use iwlmld) - rtw89: - preparations for RTL8922DE support * tag 'wireless-next-2026-01-29' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (184 commits) wifi: iwlegacy: add missing mutex protection in il4965_store_tx_power() wifi: iwlegacy: add missing mutex protection in il3945_store_measurement() wifi: mac80211: use u64_stats_t with u64_stats_sync properly wifi: p54: Fix memory leak in p54_beacon_update() wifi: cfg80211: treat deprecated INDOOR_SP_AP_OLD control value as LPI mode wifi: rtw88: sdio: Migrate to use sdio specific shutdown function wifi: rsi: sdio: Migrate to use sdio specific shutdown function sdio: Provide a bustype shutdown function wifi: nl80211/cfg80211: support operating as RSTA in PMSR FTM request wifi: nl80211/cfg80211: add negotiated burst period to FTM result wifi: nl80211/cfg80211: clarify periodic FTM parameters for non-EDCA based ranging wifi: nl80211/cfg80211: add new FTM capabilities wifi: iwlwifi: rename struct iwl_mcc_allowed_ap_type_cmd::offset_map wifi: iwlwifi: mvm: Remove link_id from time_events wifi: iwlwifi: mld: change cluster_id type to u8 array wifi: iwlwifi: support V13 of iwl_lari_config_change_cmd wifi: iwlwifi: split bios_value_u32 to separate the header wifi: iwlwifi: uefi: cache the DSM functions wifi: iwlwifi: acpi: cache the DSM functions wifi: iwlwifi: mvm: Cleanup MLO code ... ==================== Link: https://patch.msgid.link/20260129110136.176980-39-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-29 19:17:43 -08:00
Eric Dumazet	b1cd687e3e	ipv6: optimize fl6_update_dst() fl6_update_dst() is called for every TCP (and others) transmit, and is a nop for common cases. Split it in two parts : 1) fl6_update_dst() inline helper, small and fast. 2) __fl6_update_dst() for the exception, out of line. Small size increase to get better TX performance. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 2/2 grow/shrink: 8/0 up/down: 296/-125 (171) Function old new delta __fl6_update_dst - 104 +104 rawv6_sendmsg 2244 2284 +40 udpv6_sendmsg 3013 3043 +30 tcp_v6_connect 1514 1534 +20 cookie_v6_check 1501 1519 +18 ip6_datagram_dst_update 673 690 +17 inet6_sk_rebuild_header 499 516 +17 inet6_csk_route_socket 507 524 +17 inet6_csk_route_req 343 360 +17 __pfx___fl6_update_dst - 16 +16 __pfx_fl6_update_dst 16 - -16 fl6_update_dst 109 - -109 Total: Before=22570304, After=22570475, chg +0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260128185548.3738781-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-29 18:47:21 -08:00
Jakub Kicinski	a010fe8d86	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.19-rc8). No adjacent changes, conflicts: drivers/net/ethernet/spacemit/k1_emac.c `2c84959167` ("net: spacemit: Check for netif_carrier_ok() in emac_stats_update()") `f66086798f` ("net: spacemit: Remove broken flow control support") https://lore.kernel.org/aXjAqZA3iEWD_DGM@sirena.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-29 17:28:54 -08:00
Luiz Augusto von Dentz	6c3ea155e5	Bluetooth: L2CAP: Fix not tracking outstanding TX ident This attempts to proper track outstanding request by using struct ida and allocating from it in l2cap_get_ident using ida_alloc_range which would reuse ids as they are free, then upon completion release the id using ida_free. This fixes the qualification test case L2CAP/COS/CED/BI-29-C which attempts to check if the host stack is able to work after 256 attempts to connect which requires Ident field to use the full range of possible values in order to pass the test. Link: https://github.com/bluez/bluez/issues/1829 Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>	2026-01-29 13:36:35 -05:00
Luiz Augusto von Dentz	0e2a6af810	Bluetooth: Fix using PHYs bitfields as PHY value This renames the PHY fields in bt_iso_io_qos to PHYs (plural) since it represents a bitfield where multiple PHYs can be set and make the same change also to HCI_OP_LE_SET_CIG_PARAMS since both c_phy and p_phy fields are bitfields. This also fixes the assumption that hci_evt_le_cis_established PHYs fields are compatible with bt_iso_io_qos, they are not, the fields in hci_evt_le_cis_established represent just a single PHY value so they need to be converted to bitfield when set in bt_iso_io_qos. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2026-01-29 13:27:47 -05:00
Luiz Augusto von Dentz	132c0779d4	Bluetooth: L2CAP: Add support for setting BT_PHY This enables client to use setsockopt(BT_PHY) to set the connection packet type/PHY: Example setting BT_PHY_BR_1M_1SLOT: < HCI Command: Change Conne.. (0x01\|0x000f) plen 4 Handle: 1 Address: 00:AA:01:01:00:00 (Intel Corporation) Packet type: 0x331e 2-DH1 may not be used 3-DH1 may not be used DM1 may be used DH1 may be used 2-DH3 may not be used 3-DH3 may not be used 2-DH5 may not be used 3-DH5 may not be used > HCI Event: Command Status (0x0f) plen 4 Change Connection Packet Type (0x01\|0x000f) ncmd 1 Status: Success (0x00) > HCI Event: Connection Packet Typ.. (0x1d) plen 5 Status: Success (0x00) Handle: 1 Address: 00:AA:01:01:00:00 (Intel Corporation) Packet type: 0x331e 2-DH1 may not be used 3-DH1 may not be used DM1 may be used DH1 may be used 2-DH3 may not be used 3-DH3 may not be used 2-DH5 may not be used Example setting BT_PHY_LE_1M_TX and BT_PHY_LE_1M_RX: < HCI Command: LE Set PHY (0x08\|0x0032) plen 7 Handle: 1 Address: 00:AA:01:01:00:00 (Intel Corporation) All PHYs preference: 0x00 TX PHYs preference: 0x01 LE 1M RX PHYs preference: 0x01 LE 1M PHY options preference: Reserved (0x0000) > HCI Event: Command Status (0x0f) plen 4 LE Set PHY (0x08\|0x0032) ncmd 1 Status: Success (0x00) > HCI Event: LE Meta Event (0x3e) plen 6 LE PHY Update Complete (0x0c) Status: Success (0x00) Handle: 1 Address: 00:AA:01:01:00:00 (Intel Corporation) TX PHY: LE 1M (0x01) RX PHY: LE 1M (0x01) Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2026-01-29 13:25:34 -05:00
Naga Bhavani Akella	fe05e3c059	Bluetooth: hci_sync: Add LE Channel Sounding HCI Command/event structures 1. Implement LE Event Mask to include events required for LE Channel Sounding 2. Enable Channel Sounding feature bit in the LE Host Supported Features command 3. Define HCI command and event structures necessary for LE Channel Sounding functionality Signed-off-by: Naga Bhavani Akella <naga.akella@oss.qualcomm.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2026-01-29 13:24:48 -05:00
Luiz Augusto von Dentz	129d1ef3c5	Bluetooth: hci_conn: Fix using conn->le_{tx,rx}_phy as supported PHYs conn->le_{tx,rx}_phy is not actually a bitfield as it set by HCI_EV_LE_PHY_UPDATE_COMPLETE it is actually correspond to the current PHY in use not what is supported by the controller, so this introduces different fields (conn->le_{tx,rx}_def_phys) to track what PHYs are supported by the connection. Fixes: `eab2404ba7` ("Bluetooth: Add BT_PHY socket option") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2026-01-29 13:21:40 -05:00
Scott Mitchell	e19079adcd	netfilter: nfnetlink_queue: optimize verdict lookup with hash table The current implementation uses a linear list to find queued packets by ID when processing verdicts from userspace. With large queue depths and out-of-order verdicting, this O(n) lookup becomes a significant bottleneck, causing userspace verdict processing to dominate CPU time. Replace the linear search with a hash table for O(1) average-case packet lookup by ID. A global rhashtable spanning all network namespaces attributes hash bucket memory to kernel but is subject to fixed upper bound. Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-01-29 09:52:07 +01:00
Kuniyuki Iwashima	d2492688bb	nfc: nci: Fix race between rfkill and nci_unregister_device(). syzbot reported the splat below [0] without a repro. It indicates that struct nci_dev.cmd_wq had been destroyed before nci_close_device() was called via rfkill. nci_dev.cmd_wq is only destroyed in nci_unregister_device(), which (I think) was called from virtual_ncidev_close() when syzbot close()d an fd of virtual_ncidev. The problem is that nci_unregister_device() destroys nci_dev.cmd_wq first and then calls nfc_unregister_device(), which removes the device from rfkill by rfkill_unregister(). So, the device is still visible via rfkill even after nci_dev.cmd_wq is destroyed. Let's unregister the device from rfkill first in nci_unregister_device(). Note that we cannot call nfc_unregister_device() before nci_close_device() because 1) nfc_unregister_device() calls device_del() which frees all memory allocated by devm_kzalloc() and linked to ndev->conn_info_list 2) nci_rx_work() could try to queue nci_conn_info to ndev->conn_info_list which could be leaked Thus, nfc_unregister_device() is split into two functions so we can remove rfkill interfaces only before nci_close_device(). [0]: DEBUG_LOCKS_WARN_ON(1) WARNING: kernel/locking/lockdep.c:238 at hlock_class kernel/locking/lockdep.c:238 [inline], CPU#0: syz.0.8675/6349 WARNING: kernel/locking/lockdep.c:238 at check_wait_context kernel/locking/lockdep.c:4854 [inline], CPU#0: syz.0.8675/6349 WARNING: kernel/locking/lockdep.c:238 at __lock_acquire+0x39d/0x2cf0 kernel/locking/lockdep.c:5187, CPU#0: syz.0.8675/6349 Modules linked in: CPU: 0 UID: 0 PID: 6349 Comm: syz.0.8675 Not tainted syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/13/2026 RIP: 0010:hlock_class kernel/locking/lockdep.c:238 [inline] RIP: 0010:check_wait_context kernel/locking/lockdep.c:4854 [inline] RIP: 0010:__lock_acquire+0x3a4/0x2cf0 kernel/locking/lockdep.c:5187 Code: 18 00 4c 8b 74 24 08 75 27 90 e8 17 f2 fc 02 85 c0 74 1c 83 3d 50 e0 4e 0e 00 75 13 48 8d 3d 43 f7 51 0e 48 c7 c6 8b 3a de 8d <67> 48 0f b9 3a 90 31 c0 0f b6 98 c4 00 00 00 41 8b 45 20 25 ff 1f RSP: 0018:ffffc9000c767680 EFLAGS: 00010046 RAX: 0000000000000001 RBX: 0000000000040000 RCX: 0000000000080000 RDX: ffffc90013080000 RSI: ffffffff8dde3a8b RDI: ffffffff8ff24ca0 RBP: 0000000000000003 R08: ffffffff8fef35a3 R09: 1ffffffff1fde6b4 R10: dffffc0000000000 R11: fffffbfff1fde6b5 R12: 00000000000012a2 R13: ffff888030338ba8 R14: ffff888030338000 R15: ffff888030338b30 FS: 00007fa5995f66c0(0000) GS:ffff8881256f8000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f7e72f842d0 CR3: 00000000485a0000 CR4: 00000000003526f0 Call Trace: <TASK> lock_acquire+0x106/0x330 kernel/locking/lockdep.c:5868 touch_wq_lockdep_map+0xcb/0x180 kernel/workqueue.c:3940 __flush_workqueue+0x14b/0x14f0 kernel/workqueue.c:3982 nci_close_device+0x302/0x630 net/nfc/nci/core.c:567 nci_dev_down+0x3b/0x50 net/nfc/nci/core.c:639 nfc_dev_down+0x152/0x290 net/nfc/core.c:161 nfc_rfkill_set_block+0x2d/0x100 net/nfc/core.c:179 rfkill_set_block+0x1d2/0x440 net/rfkill/core.c:346 rfkill_fop_write+0x461/0x5a0 net/rfkill/core.c:1301 vfs_write+0x29a/0xb90 fs/read_write.c:684 ksys_write+0x150/0x270 fs/read_write.c:738 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xe2/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fa59b39acb9 Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fa5995f6028 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 00007fa59b615fa0 RCX: 00007fa59b39acb9 RDX: 0000000000000008 RSI: 0000200000000080 RDI: 0000000000000007 RBP: 00007fa59b408bf7 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007fa59b616038 R14: 00007fa59b615fa0 R15: 00007ffc82218788 </TASK> Fixes: `6a2968aaf5` ("NFC: basic NCI protocol implementation") Reported-by: syzbot+f9c5fd1a0874f9069dce@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/695e7f56.050a0220.1c677c.036c.GAE@google.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260127040411.494931-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-28 19:32:26 -08:00
Eric Dumazet	d5fb143dbe	tcp: move tcp_rack_advance() to tcp_input.c tcp_rack_advance() is called from tcp_ack() and tcp_sacktag_one(). Moving it to tcp_input.c allows the compiler to inline it and save both space and cpu cycles in TCP fast path. $ scripts/bloat-o-meter -t vmlinux.1 vmlinux.2 add/remove: 0/2 grow/shrink: 1/1 up/down: 98/-132 (-34) Function old new delta tcp_ack 5741 5839 +98 tcp_sacktag_one 407 395 -12 __pfx_tcp_rack_advance 16 - -16 tcp_rack_advance 104 - -104 Total: Before=22572680, After=22572646, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260127032147.3498272-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-28 19:31:51 -08:00
Eric Dumazet	629a68865a	tcp: move tcp_rack_update_reo_wnd() to tcp_input.c tcp_rack_update_reo_wnd() is called only once from tcp_ack() Move it to tcp_input.c so that it can be inlined by the compiler to save space and cpu cycles. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/2 grow/shrink: 1/0 up/down: 110/-153 (-43) Function old new delta tcp_ack 5631 5741 +110 __pfx_tcp_rack_update_reo_wnd 16 - -16 tcp_rack_update_reo_wnd 137 - -137 Total: Before=22572723, After=22572680, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260127032147.3498272-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-28 19:31:51 -08:00
Konstantin Taranov	a01745ccf7	RDMA/mana_ib: Add device‑memory support Introduce a basic DM implementation that enables creating and registering device memory, and using the associated memory keys for networking operations. Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com> Link: https://patch.msgid.link/20260127082649.429018-1-kotaranov@linux.microsoft.com Signed-off-by: Leon Romanovsky <leon@kernel.org>	2026-01-27 09:16:11 -05:00
Pagadala Yesu Anjaneyulu	fd5bfcf430	wifi: cfg80211: treat deprecated INDOOR_SP_AP_OLD control value as LPI mode Although value 4 (INDOOR_SP_AP_OLD) is deprecated in IEEE standards, existing APs may still use this control value. Since this value is based on the old specification, we cannot trust such APs implement proper power controls. Therefore, move IEEE80211_6GHZ_CTRL_REG_INDOOR_SP_AP_OLD case from SP_AP to LPI_AP power type handling to prevent potential power limit violations. Signed-off-by: Pagadala Yesu Anjaneyulu <pagadala.yesu.anjaneyulu@intel.com> Reviewed-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20260111163601.6b5a36d3601e.I1704ee575fd25edb0d56f48a0a3169b44ef72ad0@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2026-01-27 13:42:26 +01:00
Avraham Stern	853800c746	wifi: nl80211/cfg80211: support operating as RSTA in PMSR FTM request Add an option to operate as the RSTA in an FTM measurement request. When requested, the device will dwell on the requested channel until the peer starts the FTM negotiation. This option is only valid for trigger-based/non trigger-based measurement with LMR feedback which will allow the RSTA to receive the results of the measurement. Signed-off-by: Avraham Stern <avraham.stern@intel.com> Reviewed-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20260111190221.1f95fc0afab4.Iae2d32783b8e7c4a29089fec0f4c6bce94d303cc@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2026-01-27 13:40:38 +01:00
Avraham Stern	cfd46d1c6f	wifi: nl80211/cfg80211: add negotiated burst period to FTM result The FTM result includes some of the periodic measurement negotiated parameters (like the burst duration and number of bursts), but it doesn't include the burst period. Add it to the FTM result notification. Signed-off-by: Avraham Stern <avraham.stern@intel.com> Reviewed-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20260111190221.e0778f86edef.I3c98c1933eb639963bc3ffdef81a8788b59f2188@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2026-01-27 13:40:36 +01:00
Avraham Stern	853ce6943c	wifi: nl80211/cfg80211: clarify periodic FTM parameters for non-EDCA based ranging Periodic FTM request attributes are defined based on the periodic parameters used in EDCA-based ranging negotiation. However, non-EDCA based ranging (trigger-based/non-trigger-based) does not include periodic parameters in the negotiation protocol, even though upper layers may still request periodic measurements. Clarify the semantics of periodic ranging attributes when used with non-EDCA based ranging. Signed-off-by: Avraham Stern <avraham.stern@intel.com> Reviewed-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20260111190221.b89cb3f68e1a.I7a9d8c6d1c66c77f1b43120a841101c96c3f19ad@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2026-01-27 13:40:30 +01:00
Avraham Stern	86c6b6e4d1	wifi: nl80211/cfg80211: add new FTM capabilities Add new capabilities to the PMSR FTM capabilities list. The new capabilities include 6 GHz support, supported number of spatial streams and supported number of LTF repetitions. Signed-off-by: Avraham Stern <avraham.stern@intel.com> Tested-by: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20260111190221.bf43785c18f6.Ic98cf9790ddee84bf88e5720b93c46c23af3c96c@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2026-01-27 13:40:25 +01:00
Bobby Eshleman	eafb64f40c	vsock: add netns to vsock core Add netns logic to vsock core. Additionally, modify transport hook prototypes to be used by later transport-specific patches (e.g., *_seqpacket_allow()). Namespaces are supported primarily by changing socket lookup functions (e.g., vsock_find_connected_socket()) to take into account the socket namespace and the namespace mode before considering a candidate socket a "match". This patch also introduces the sysctl /proc/sys/net/vsock/ns_mode to report the mode and /proc/sys/net/vsock/child_ns_mode to set the mode for new namespaces. Add netns functionality (initialization, passing to transports, procfs, etc...) to the af_vsock socket layer. Later patches that add netns support to transports depend on this patch. This patch changes the allocation of random ports for connectible vsocks in order to avoid leaking the random port range starting point to other namespaces. dgram_allow(), stream_allow(), and seqpacket_allow() callbacks are modified to take a vsk in order to perform logic on namespace modes. In future patches, the net will also be used for socket lookups in these functions. Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com> Link: https://patch.msgid.link/20260121-vsock-vmtest-v16-1-2859a7512097@meta.com Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-01-27 10:45:38 +01:00
Eric Dumazet	df7388b3d7	net: inline get_netmem() and put_netmem() These helpers are used in network fast paths. Only call out-of-line helpers for netmem case. We might consider inlining __get_netmem() and __put_netmem() in the future. $ scripts/bloat-o-meter -t vmlinux.3 vmlinux.4 add/remove: 6/6 grow/shrink: 22/1 up/down: 2614/-646 (1968) Function old new delta pskb_carve 1669 1894 +225 gro_pull_from_frag0 - 206 +206 get_page 190 380 +190 skb_segment 3561 3747 +186 put_page 595 765 +170 skb_copy_ubufs 1683 1822 +139 __pskb_trim_head 276 401 +125 __pskb_copy_fclone 734 858 +124 skb_zerocopy 1092 1215 +123 pskb_expand_head 892 1008 +116 skb_split 828 940 +112 skb_release_data 297 409 +112 ___pskb_trim 829 941 +112 __skb_zcopy_downgrade_managed 120 226 +106 tcp_clone_payload 530 634 +104 esp_ssg_unref 191 294 +103 dev_gro_receive 1464 1514 +50 __put_netmem - 41 +41 __get_netmem - 41 +41 skb_shift 1139 1175 +36 skb_try_coalesce 681 714 +33 __pfx_put_page 112 144 +32 __pfx_get_page 32 64 +32 __pskb_pull_tail 1137 1168 +31 veth_xdp_get 250 267 +17 __pfx_gro_pull_from_frag0 - 16 +16 __pfx___put_netmem - 16 +16 __pfx___get_netmem - 16 +16 __pfx_put_netmem 16 - -16 __pfx_gro_try_pull_from_frag0 16 - -16 __pfx_get_netmem 16 - -16 put_netmem 114 - -114 get_netmem 130 - -130 napi_gro_frags 929 771 -158 gro_try_pull_from_frag0 196 - -196 Total: Before=22565857, After=22567825, chg +0.01% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260122045720.1221017-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-25 13:18:53 -08:00
Eric Dumazet	87918dd4ea	net: inline net_is_devmem_iov() 1) Inline this small helper to reduce code size and decrease cpu costs. 2) Constify its argument. 3) Move it to include/net/netmem.h, as a prereq for the following patch. $ scripts/bloat-o-meter -t vmlinux.2 vmlinux.3 add/remove: 0/2 grow/shrink: 0/4 up/down: 0/-158 (-158) Function old new delta validate_xmit_skb 866 857 -9 __pfx_net_is_devmem_iov 16 - -16 net_is_devmem_iov 22 - -22 get_netmem 152 130 -22 put_netmem 140 114 -26 tcp_recvmsg_locked 3860 3797 -63 Total: Before=22566015, After=22565857, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260122045720.1221017-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-25 13:18:53 -08:00
Eric Dumazet	f6c3665b6d	bonding: annotate data-races around slave->last_rx slave->last_rx and slave->target_last_arp_rx[...] can be read and written locklessly. Add READ_ONCE() and WRITE_ONCE() annotations. syzbot reported: BUG: KCSAN: data-race in bond_rcv_validate / bond_rcv_validate write to 0xffff888149f0d428 of 8 bytes by interrupt on cpu 1: bond_rcv_validate+0x202/0x7a0 drivers/net/bonding/bond_main.c:3335 bond_handle_frame+0xde/0x5e0 drivers/net/bonding/bond_main.c:1533 __netif_receive_skb_core+0x5b1/0x1950 net/core/dev.c:6039 __netif_receive_skb_one_core net/core/dev.c:6150 [inline] __netif_receive_skb+0x59/0x270 net/core/dev.c:6265 netif_receive_skb_internal net/core/dev.c:6351 [inline] netif_receive_skb+0x4b/0x2d0 net/core/dev.c:6410 ... write to 0xffff888149f0d428 of 8 bytes by interrupt on cpu 0: bond_rcv_validate+0x202/0x7a0 drivers/net/bonding/bond_main.c:3335 bond_handle_frame+0xde/0x5e0 drivers/net/bonding/bond_main.c:1533 __netif_receive_skb_core+0x5b1/0x1950 net/core/dev.c:6039 __netif_receive_skb_one_core net/core/dev.c:6150 [inline] __netif_receive_skb+0x59/0x270 net/core/dev.c:6265 netif_receive_skb_internal net/core/dev.c:6351 [inline] netif_receive_skb+0x4b/0x2d0 net/core/dev.c:6410 br_netif_receive_skb net/bridge/br_input.c:30 [inline] NF_HOOK include/linux/netfilter.h:318 [inline] ... value changed: 0x0000000100005365 -> 0x0000000100005366 Fixes: `f5b2b966f0` ("[PATCH] bonding: Validate probe replies in ARP monitor") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://patch.msgid.link/20260122162914.2299312-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-23 13:55:56 -08:00
Jakub Kicinski	8e3245cb30	net: add queue config validation callback I imagine (tm) that as the number of per-queue configuration options grows some of them may conflict for certain drivers. While the drivers can obviously do all the validation locally doing so is fairly inconvenient as the config is fed to drivers piecemeal via different ops (for different params and NIC-wide vs per-queue). Add a centralized callback for validating the queue config in queue ops. The callback gets invoked before memory provider is installed, and in the future should also be called when ring params are modified. The validation is done after each layer of configuration. Since we can't fail MP un-binding we must make sure that the config is valid both before and after MP overrides are applied. This is moot for now since the set of MP and device configs are disjoint. It will matter significantly in the future, so adding it now so that we don't forget.. Link: https://patch.msgid.link/20260122005113.2476634-6-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-23 11:49:02 -08:00
Jakub Kicinski	fc1a78a25c	net: use netdev_queue_config() for mp restart We should follow the prepare/commit approach for queue configuration. The qcfg struct should be added to dev->cfg rather than directly to queue objects so that we can clone and discard the pending config easily. Remove the qcfg in struct netdev_rx_queue, and switch remaining callers to netdev_queue_config(). netdev_queue_config() will construct the qcfg on the fly based on device defaults and state of the queue. ndo_default_qcfg becomes optional because having the callback itself does not have any meaningful semantics to us. Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Link: https://patch.msgid.link/20260122005113.2476634-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-23 11:49:02 -08:00
Jakub Kicinski	b9ac2c60a3	net: introduce a trivial netdev_queue_config() We may choose to extend or reimplement the logic which renders the per-queue config. The drivers should not poke directly into the queue state. Add a helper for drivers to use when they want to query the config for a specific queue. Link: https://patch.msgid.link/20260122005113.2476634-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-23 11:49:01 -08:00
Paolo Abeni	0c09e89f6c	geneve: expose gso partial features for tunnel offload GSO partial features for tunnels do not require any kind of support from the underlying device: we can safely add them to the geneve UDP tunnel. The only point of attention is the skb required features propagation in the device xmit op: partial features must be stripped, except for UDP_TUNNEL*. Keep partial features disabled by default. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/d851ca8e928cf05d68310bcbaeaa5e9e0b01e058.1769011015.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-23 11:31:14 -08:00
Jakub Kicinski	9abf22075d	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.19-rc7). Conflicts: drivers/net/ethernet/huawei/hinic3/hinic3_irq.c `b35a6fd37a` ("hinic3: Add adaptive IRQ coalescing with DIM") `fb2bb2a1eb` ("hinic3: Fix netif_queue_set_napi queue_index input parameter error") https://lore.kernel.org/fc0a7fdf08789a52653e8ad05281a0a849e79206.1768915707.git.zhuyikai1@h-partners.com drivers/net/wireless/ath/ath12k/mac.c drivers/net/wireless/ath/ath12k/wifi7/hw.c `3170757210` ("wifi: ath12k: Fix wrong P2P device link id issue") `c26f294fef` ("wifi: ath12k: Move ieee80211_ops callback to the arch specific module") https://lore.kernel.org/20260114123751.6a208818@canb.auug.org.au Adjacent changes: drivers/net/wireless/ath/ath12k/mac.c `8b8d6ee53d` ("wifi: ath12k: Fix scan state stuck in ABORTING after cancel_remain_on_channel") `914c890d3b` ("wifi: ath12k: Add framework for hardware specific ieee80211_ops registration") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-22 20:14:36 -08:00
Eric Dumazet	bc1f0b1c98	tcp: move tcp_rate_check_app_limited() to tcp.c tcp_rate_check_app_limited() is used from tcp_sendmsg_locked() fast path and from other callers. Move it to tcp.c so that it can be inlined in tcp_sendmsg_locked(). Small increase of code, for better TCP performance. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/0 grow/shrink: 1/0 up/down: 87/0 (87) Function old new delta tcp_sendmsg_locked 4217 4304 +87 Total: Before=22566462, After=22566549, chg +0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20260121095923.3134639-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-22 18:28:48 -08:00
Eric Dumazet	b814bdcecd	tcp: move tcp_rate_gen to tcp_input.c This function is called from one caller only, in TCP fast path. Move it to tcp_input.c so that compiler can inline it. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/2 grow/shrink: 1/0 up/down: 226/-300 (-74) Function old new delta tcp_ack 5405 5631 +226 __pfx_tcp_rate_gen 16 - -16 tcp_rate_gen 284 - -284 Total: Before=22566536, After=22566462, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20260121095923.3134639-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-22 18:28:48 -08:00
Pablo Neira Ayuso	f175b46d91	netfilter: nf_tables: add .abort_skip_removal flag for set types The pipapo set backend is the only user of the .abort interface so far. To speed up pipapo abort path, removals are skipped. The follow up patch updates the rbtree to use to build an array of ordered elements, then use binary search. This needs a new .abort interface but, unlike pipapo, it also need to undo/remove elements. Add a flag and use it from the pipapo set backend. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-01-22 17:18:13 +01:00
Jakub Kicinski	9146fe2829	Another set of updates: - various small fixes for ath10k/ath12k/mwifiex/rsi - cfg80211 fix for HE bitrate overflow - mac80211 fixes - S1G beacon handling in scan - skb tailroom handling for HW encryption - CSA fix for multi-link - handling of disabled links during association -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmlyArkACgkQ10qiO8sP aACZxA/+N/q+DAHhVgqETwqOh80WAFTSuhDZsUXc6PNtFkOHvuZaeHePU+fn8hco +CkUVEnWYoNgUiaVg1697PJubp0psN7H+3+cq8kn8C/oNB2YnEV2kPCk4x7R8LCj TP6mad/Fb+I6Ct7XaCFymCS49eP4Bju7UBgYgLTiKJYvabA+Jim7LavZr8j0Tvra h9TNeA0I8+dgGppAWLTssrnsxp65xfSdq71mtRCFUrpEUHjzCl589PEv6BYcIRwv N50pm6Am5KJ1TZn5sVSYVfKiiG7UtL/pbXbsM5Cj/54yIFIgmE3bGI5MGAXRlWNG o/d/bo0rJg1xppipyZDEN5OJS6S0ijyC5TNKFFRX6IU2eZ8jJs7CWrQw6L8hWCbY G+lnVSh4yPAzLEk80S/zBHQccAPXnONtm+cFyPsPab79oAboxQVauuDdH1t5cxbQ 1HRn0RopyfEoPLmxsCcCSVcdF+hDwRZxAUO735Opnz/amDJNjBKgXTKezXSubfPH 5hvoAs/VZh7xSyJmEViDO5gavW1SX4nKlUixLYLUrXyq4i+eZkEIvqlCkIWuN9EW y/leooWqFzoXY3K8QL8nKQyHWg10WJL1l56tVz+Y+YAh8TvqgWWTB/O/u3g+/ZqO eDkaeKbbexUy6iyN0/TMb2RNnTERO0xOepMmIu3sw0HXhptkl1Q= =njuT -----END PGP SIGNATURE----- Merge tag 'wireless-2026-11-22' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless Johannes Berg says: ==================== Another set of updates: - various small fixes for ath10k/ath12k/mwifiex/rsi - cfg80211 fix for HE bitrate overflow - mac80211 fixes - S1G beacon handling in scan - skb tailroom handling for HW encryption - CSA fix for multi-link - handling of disabled links during association * tag 'wireless-2026-11-22' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: wifi: cfg80211: ignore link disabled flag from userspace wifi: mac80211: apply advertised TTLM from association response wifi: mac80211: parse all TTLM entries wifi: mac80211: don't increment crypto_tx_tailroom_needed_cnt twice wifi: mac80211: don't perform DA check on S1G beacon wifi: ath12k: Fix wrong P2P device link id issue wifi: ath12k: fix dead lock while flushing management frames wifi: ath12k: Fix scan state stuck in ABORTING after cancel_remain_on_channel wifi: ath12k: cancel scan only on active scan vdev wifi: mwifiex: Fix a loop in mwifiex_update_ampdu_rxwinsize() wifi: mac80211: correctly check if CSA is active wifi: cfg80211: Fix bitrate calculation overflow for HE rates wifi: rsi: Fix memory corruption due to not set vif driver data size wifi: ath12k: don't force radio frequency check in freq_to_idx() wifi: ath12k: fix dma_free_coherent() pointer wifi: ath10k: fix dma_free_coherent() pointer ==================== Link: https://patch.msgid.link/20260122110248.15450-3-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-22 07:54:31 -08:00
Tonghao Zhang	11ea9b8a88	net: bonding: use workqueue to make sure peer notify updated in lacp mode The rtnl lock might be locked, preventing ad_cond_set_peer_notif() from acquiring the lock and updating send_peer_notif. This patch addresses the issue by using a workqueue. Since updating send_peer_notif does not require high real-time performance, such delayed updates are entirely acceptable. In fact, checking this value and using it in multiple places, all operations are protected at the same time by rtnl lock, such as - read send_peer_notif - send_peer_notif-- - bond_should_notify_peers By the way, rtnl lock is still required, when accessing bond.params.* for updating send_peer_notif. In lacp mode, resetting send_peer_notif in workqueue is safe, simple and effective way. Additionally, this patch introduces bond_peer_notify_may_events(), which is used to check whether an event should be sent. This function will be used in both patch 1 and 2. Cc: Jay Vosburgh <jv@jvosburgh.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Simon Horman <horms@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Andrew Lunn <andrew+netdev@lunn.ch> Cc: Nikolay Aleksandrov <razor@blackwall.org> Cc: Hangbin Liu <liuhangbin@gmail.com> Cc: Jason Xing <kerneljasonxing@gmail.com> Suggested-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: Tonghao Zhang <tonghao@bamaicloud.com> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://patch.msgid.link/f95accb5db0b10ce3ed2f834fc70f716c9abbb9c.1768709239.git.tonghao@bamaicloud.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-01-22 11:20:33 +01:00
Jakub Kicinski	a7c708dc0d	netfilter pull request nf-next-26-01-20 -----BEGIN PGP SIGNATURE----- iQJdBAABCABHFiEEgKkgxbID4Gn1hq6fcJGo2a1f9gAFAmlvnjQbFIAAAAAABAAO bWFudTIsMi41KzEuMTEsMiwyDRxmd0BzdHJsZW4uZGUACgkQcJGo2a1f9gBGtQ// fpGuA96XbcQVHDAkOYYAsjkHk4DPJvTdL4m4Pnw5SO3m5lVq0kw5Cp+6drv3q/Pd pMTuAUliVDOlK7wYemsThv/DgzqSO93uxrqTeX2J4tb/TgVYA9040aAfvWKo76iE GxSqfM255cMOAJ2zpBbgP3WfwiklGBlB7phPDTP3yoxbwH6TdtDCxcpVJ+M8wVN4 7CsRY3P8ZWZR1lW2V7sHERRABdsVfRmEtlFCEP+WARKHtkTyMZeqRsUtk3f50iRB AeEm3Uryj+q5s2Uof+uO8Lu0RxaBezJID4JksbWXEc/bsxaGKroPXx1qUsruwAJP 1TW+HL2yJx1xIydinoKFSD6PE7as0LeRvwCLFNbOqTrGefpPFX7sIdQNb/qMh1IN JpU0O0cwhWPYKjXD8pGcVNqTFs9CABRGSZBRUkSKMhWqwF1Hu0habF4nL70QkCqv FuhrelmNY/pDn7X5EQRII7cZMAxEL2lFtv+HERwZH2uZvDdMjE6Fu/+NdGQT4bCe d95dlnd1UkpMhI2CPsDKACXb5aqA5apWb7+2F5WcXzlI7XmNcHJksY8OVkjB0r7p +6IeBAYLPEUv+PYsR6g7vf6pAHA6I/axkMoK4vFXRnU3POnrVyZh3JHL/CChokWj cy8BYZukYSqvOnEoPJRpWkwO8opWDZmT5gIUSqHgKGk= =QWao -----END PGP SIGNATURE----- Merge tag 'nf-next-26-01-20' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Florian Westphal says: ==================== Subject: netfilter: updates for net-next 1) Speed up nftables transactions after earlier transaction failed. Due to a (harmeless) bug we remained in slow paranoia mode until a successful transaction completes. 2) Allow generic tracker to resolve clashes, this avoids very rare packet drops. From Yuto Hamaguchi. 3) Increase the cleanup budget to 64 entries in nf_conncount to reap more entries in one go, from Fernando Fernandez Mancera. 4) Allow icmp trackers to resolve clashes, this avoids very rare initial packet drop with test cases that have high-frequency pings. After this all trackers except tcp and sctp allow clash resolution. 5) Disentangle netfilter headers, don't include nftables/xtables headers in subsystems that are unrelated. 6) Don't rely on implicit includes coming from nf_conntrack_proto_gre.h. 7) Allow nfnetlink_queue nfq instance struct to get accounted via memcg, from Scott Mitchell. 8) Reject bogus xt target/match data upfront via netlink policiy in nft_compat interface rather than relying on x_tables API to do it. 9) Fix nf_conncount breakage when trying to limit loopback flows via prerouting rule, from Fernando Fernandez Mancera. This is a recent breakage but not seen as urgent enough to rush this via net tree at this late stage in development cycle. 10) Fix a possible off-by-one when parsing tcp option in xtables tcpmss match. Also handled via -next due to late stage in development cycle. * tag 'nf-next-26-01-20' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: xt_tcpmss: check remaining length before reading optlen netfilter: nf_conncount: fix tracking of connections from localhost netfilter: nft_compat: add more restrictions on netlink attributes netfilter: nfnetlink_queue: nfqnl_instance GFP_ATOMIC -> GFP_KERNEL_ACCOUNT allocation netfilter: nf_conntrack: don't rely on implicit includes netfilter: don't include xt and nftables.h in unrelated subsystems netfilter: nf_conntrack: enable icmp clash support netfilter: nf_conncount: increase the connection clean up limit to 64 netfilter: nf_conntrack: Add allow_clash to generic protocol handler netfilter: nf_tables: reset table validation state on abort ==================== Link: https://patch.msgid.link/20260120191803.22208-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-21 20:23:12 -08:00
Eric Dumazet	b8d9b7daf0	gro: inline tcp6_gro_complete() Remove one function call from GRO stack for native IPv6 + TCP packets. $ scripts/bloat-o-meter -t vmlinux.2 vmlinux.3 add/remove: 0/0 grow/shrink: 1/1 up/down: 298/-5 (293) Function old new delta ipv6_gro_complete 435 733 +298 tcp6_gro_complete 311 306 -5 Total: Before=22593532, After=22593825, chg +0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260120164903.1912995-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-21 19:28:32 -08:00
Eric Dumazet	87737cd76e	gro: inline tcp6_gro_receive() FDO/LTO are unable to inline tcp6_gro_receive() from ipv6_gro_receive() Make sure tcp6_check_fraglist_gro() is only called only when needed, so that compiler can leave it out-of-line. $ scripts/bloat-o-meter -t vmlinux.1 vmlinux.2 add/remove: 2/0 grow/shrink: 3/1 up/down: 1123/-253 (870) Function old new delta ipv6_gro_receive 1069 1846 +777 tcp6_check_fraglist_gro - 272 +272 ipv6_offload_init 218 274 +56 __pfx_tcp6_check_fraglist_gro - 16 +16 ipv6_gro_complete 433 435 +2 tcp6_gro_receive 959 706 -253 Total: Before=22592662, After=22593532, chg +0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260120164903.1912995-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-21 19:28:32 -08:00
Eric Dumazet	a4674aa58b	tcp: preserve const qualifier in tcp_rsk() and inet_rsk() We can change tcp_rsk() and inet_rsk() to propagate their argument const qualifier thanks to container_of_const(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260120125353.1470456-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-21 19:20:04 -08:00
Eric Dumazet	670ade3bfa	tcp: move tcp_rate_skb_delivered() to tcp_input.c tcp_rate_skb_delivered() is only called from tcp_input.c. Move it there and make it static. Both gcc and clang are (auto)inlining it, TCP performance is increased at a small space cost. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/2 grow/shrink: 3/0 up/down: 509/-187 (322) Function old new delta tcp_sacktag_walk 1682 1867 +185 tcp_ack 5230 5405 +175 tcp_shifted_skb 437 586 +149 __pfx_tcp_rate_skb_delivered 16 - -16 tcp_rate_skb_delivered 171 - -171 Total: Before=22566192, After=22566514, chg +0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20260118123204.2315993-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-20 19:03:09 -08:00
Jakub Kicinski	677a51790b	Merge tag 'net-queue-rx-buf-len-v9' of https://github.com/isilence/linux Pavel Begunkov says: ==================== Add support for providers with large rx buffer Many modern NICs support configurable receive buffer lengths, and zcrx and memory providers can use buffers larger than 4K to improve performance. When paired with hw-gro larger rx buffer sizes can drastically reduce the number of buffers traversing the stack and save a lot of processing time. It also allows to give to users larger contiguous chunks of data. Single stream benchmarks showed up to ~30% CPU util improvement. E.g. comparison for 4K vs 32K buffers using a 200Gbit NIC: packets=23987040 (MB=2745098), rps=199559 (MB/s=22837) CPU %usr %nice %sys %iowait %irq %soft %idle 0 1.53 0.00 27.78 2.72 1.31 66.45 0.22 packets=24078368 (MB=2755550), rps=200319 (MB/s=22924) CPU %usr %nice %sys %iowait %irq %soft %idle 0 0.69 0.00 8.26 31.65 1.83 57.00 0.57 This series adds net infrastructure for memory providers configuring the size and implements it for bnxt. It's an opt-in feature for drivers, they should advertise support for the parameter in the qops and must check if the hardware supports the given size. It's limited to memory providers as it drastically simplifies implementation. It doesn't affect the fast path zcrx uAPI, and the user exposed parameter is defined in zcrx terms, which allows it to be flexible and adjusted in the future. A liburing example can be found at [2] full branch: [1] https://github.com/isilence/linux.git zcrx/large-buffers-v8 Liburing example: [2] https://github.com/isilence/liburing.git zcrx/rx-buf-len * tag 'net-queue-rx-buf-len-v9' of https://github.com/isilence/linux: io_uring/zcrx: document area chunking parameter selftests: iou-zcrx: test large chunk sizes eth: bnxt: support qcfg provided rx page size eth: bnxt: adjust the fill level of agg queues with larger buffers eth: bnxt: store rx buffer size per queue net: pass queue rx page size from memory provider net: add bare bone queue configs net: reduce indent of struct netdev_queue_mgmt_ops members net: memzero mp params when closing a queue ==================== Link: https://patch.msgid.link/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-20 18:10:04 -08:00
Jakub Kicinski	8766d61a1d	Revert "Merge branch 'netkit-support-for-io_uring-zero-copy-and-af_xdp'" This reverts commit `77b9c4a438`, reversing changes made to 4515ec4ad58a37e70a9e1256c0b993958c9b7497: `931420a2fc` ("selftests/net: Add netkit container tests") `ab771c938d` ("selftests/net: Make NetDrvContEnv support queue leasing") `6be87fbb27` ("selftests/net: Add env for container based tests") `61d99ce3df` ("selftests/net: Add bpf skb forwarding program") `920da36341` ("netkit: Add xsk support for af_xdp applications") `eef51113f8` ("netkit: Add netkit notifier to check for unregistering devices") `b5ef109d22` ("netkit: Implement rtnl_link_ops->alloc and ndo_queue_create") `b5c3fa4a0b` ("netkit: Add single device mode for netkit") `0073d2fd67` ("xsk: Proxy pool management for leased queues") `1ecea95dd3` ("xsk: Extend xsk_rcv_check validation") `804bf334d0` ("net: Proxy netdev_queue_get_dma_dev for leased queues") `0caa9a8dde` ("net: Proxy net_mp_{open,close}_rxq for leased queues") `ff8889ff91` ("net, ethtool: Disallow leased real rxqs to be resized") `9e2103f361` ("net: Add lease info to queue-get response") `31127dedde` ("net: Implement netdev_nl_queue_create_doit") `a5546e18f7` ("net: Add queue-create operation") The series will conflict with io_uring work, and the code needs more polish. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-20 18:06:01 -08:00
Florian Westphal	d00453b6e3	netfilter: nf_conntrack: don't rely on implicit includes several netfilter compilation units rely on implicit includes coming from nf_conntrack_proto_gre.h. Clean this up and add the required dependencies where needed. nf_conntrack.h requires net_generic() helper. Place various gre/ppp/vlan includes to where they are needed. Signed-off-by: Florian Westphal <fw@strlen.de>	2026-01-20 16:23:37 +01:00
Florian Westphal	910d271227	netfilter: don't include xt and nftables.h in unrelated subsystems conntrack, xtables and nftables are distinct subsystems, don't use them in other subystems. Signed-off-by: Florian Westphal <fw@strlen.de>	2026-01-20 16:23:37 +01:00
Fernando Fernandez Mancera	21d033e472	netfilter: nf_conncount: increase the connection clean up limit to 64 After the optimization to only perform one GC per jiffy, a new problem was introduced. If more than 8 new connections are tracked per jiffy the list won't be cleaned up fast enough possibly reaching the limit wrongly. In order to prevent this issue, only skip the GC if it was already triggered during the same jiffy and the increment is lower than the clean up limit. In addition, increase the clean up limit to 64 connections to avoid triggering GC too often and do more effective GCs. This has been tested using a HTTP server and several performance tools while having nft_connlimit/xt_connlimit or OVS limit configured. Output of slowhttptest + OVS limit at 52000 connections: slow HTTP test status on 340th second: initializing: 0 pending: 432 connected: 51998 error: 0 closed: 0 service available: YES Fixes: `d265929930` ("netfilter: nf_conncount: reduce unnecessary GC") Reported-by: Aleksandra Rukomoinikova <ARukomoinikova@k2.cloud> Closes: https://lore.kernel.org/netfilter/b2064e7b-0776-4e14-adb6-c68080987471@k2.cloud/ Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-01-20 16:23:37 +01:00
David Wei	0caa9a8dde	net: Proxy net_mp_{open,close}_rxq for leased queues When a process in a container wants to setup a memory provider, it will use the virtual netdev and a leased rxq, and call net_mp_{open,close}_rxq to try and restart the queue. At this point, proxy the queue restart on the real rxq in the physical netdev. For memory providers (io_uring zero-copy rx and devmem), it causes the real rxq in the physical netdev to be filled from a memory provider that has DMA mapped memory from a process within a container. Signed-off-by: David Wei <dw@davidwei.uk> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260115082603.219152-6-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-01-20 11:58:49 +01:00
Daniel Borkmann	9e2103f361	net: Add lease info to queue-get response Populate nested lease info to the queue-get response that returns the ifindex, queue id with type and optionally netns id if the device resides in a different netns. Example with ynl client: # ip a [...] 4: enp10s0f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp/id:24 qdisc mq state UP group default qlen 1000 link/ether e8:eb:d3:a3:43:f6 brd ff:ff:ff:ff:ff:ff inet 10.0.0.2/24 scope global enp10s0f0np0 valid_lft forever preferred_lft forever inet6 fe80::eaeb:d3ff:fea3:43f6/64 scope link proto kernel_ll valid_lft forever preferred_lft forever [...] # ethtool -i enp10s0f0np0 driver: mlx5_core [...] # ./pyynl/cli.py \ --spec ~/netlink/specs/netdev.yaml \ --do queue-get \ --json '{"ifindex": 4, "id": 15, "type": "rx"}' {'id': 15, 'ifindex': 4, 'lease': {'ifindex': 8, 'netns-id': 0, 'queue': {'id': 1, 'type': 'rx'}}, 'napi-id': 8227, 'type': 'rx', 'xsk': {}} # ip netns list foo (id: 0) # ip netns exec foo ip a [...] 8: nk@NONE: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff inet6 fe80::200:ff:fe00:0/64 scope link proto kernel_ll valid_lft forever preferred_lft forever [...] # ip netns exec foo ethtool -i nk driver: netkit [...] # ip netns exec foo ls /sys/class/net/nk/queues/ rx-0 rx-1 tx-0 # ip netns exec foo ./pyynl/cli.py \ --spec ~/netlink/specs/netdev.yaml \ --do queue-get \ --json '{"ifindex": 8, "id": 1, "type": "rx"}' {'id': 1, 'ifindex': 8, 'type': 'rx'} Note that the caller of netdev_nl_queue_fill_one() holds the netdevice lock. For the queue-get we do not lock both devices. When queues get {un,}leased, both devices are locked, thus if __netif_get_rx_queue_peer() returns true, the peer pointer points to a valid device. The netns-id is fetched via peernet2id_alloc() similarly as done in OVS. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260115082603.219152-4-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-01-20 11:58:49 +01:00
Daniel Borkmann	31127dedde	net: Implement netdev_nl_queue_create_doit Implement netdev_nl_queue_create_doit which creates a new rx queue in a virtual netdev and then leases it to a rx queue in a physical netdev. Example with ynl client: # ./pyynl/cli.py \ --spec ~/netlink/specs/netdev.yaml \ --do queue-create \ --json '{"ifindex": 8, "type": "rx", "lease": {"ifindex": 4, "queue": {"type": "rx", "id": 15}}}' {'id': 1} Note that the netdevice locking order is always from the virtual to the physical device. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260115082603.219152-3-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-01-20 11:58:49 +01:00
Benjamin Berg	50b359896f	wifi: cfg80211: ignore link disabled flag from userspace When the AP has an advertised TID to Link Mapping (TTLM) it shall include the element in the association response. As such, when this element is present it needs to be used for the currently dormant links. See Draft P802.11REVmf_D1.0 section 35.3.7.2.3 ("Negotiation of TTLM") for the details. The flag is also not usable in case userspace wants to specify a negotiated TTLM during association. Note that for the link reconfiguration case, mac80211 did not use the information. Draft P802.11REVmf_D1.0 states in section 35.3.6.4 ("Link reconfiguration to the setup links) that we "shall operate with all the TIDs mapped to the newly added links ..." All this means that the flag is not needed. The implementation should parse the information from the association response. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Reviewed-by: Johannes Berg <johannes.berg@intel.com> Reviewed-by: Ilan Peer <ilan.peer@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20260118093904.754e057896a5.Ifd06f5ef839a93bfd54d0593dc932870f95f3242@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2026-01-20 10:02:01 +01:00
Eric Dumazet	03e9d91dd6	ipv6: annotate data-races in ip6_multipath_hash_{policy,fields}() Add missing READ_ONCE() when reading sysctl values. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260115094141.3124990-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-19 09:56:42 -08:00
Eric Dumazet	3681282530	ipv6: annotate date-race in ipv6_can_nonlocal_bind() Add a missing READ_ONCE(), and add const qualifiers to the two parameters. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260115094141.3124990-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-19 09:56:42 -08:00
Eric Dumazet	ded139b59b	ipv6: annotate data-races from ip6_make_flowlabel() Use READ_ONCE() to read sysctl values in ip6_make_flowlabel() and ip6_make_flowlabel() Add a const qualifier to 'struct net' parameters. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260115094141.3124990-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-19 09:56:42 -08:00
Eric Dumazet	e82a347d92	ipv6: add sysctl_ipv6_flowlabel group Group together following struct netns_sysctl_ipv6 fields: - flowlabel_consistency - auto_flowlabels - flowlabel_state_ranges After this patch, ip6_make_flowlabel() uses a single cache line to fetch auto_flowlabels and flowlabel_state_ranges (instead of two before the patch). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260115094141.3124990-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-19 09:56:42 -08:00
Eric Dumazet	f10ab9d3a7	tcp: move tcp_rate_skb_sent() to tcp_output.c It is only called from __tcp_transmit_skb() and __tcp_retransmit_skb(). Move it in tcp_output.c and make it static. clang compiler is now able to inline it from __tcp_transmit_skb(). gcc compiler inlines it in the two callers, which is also fine. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20260114165109.1747722-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-17 15:43:16 -08:00
Jakub Kicinski	c27022497d	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.19-rc6). No conflicts, or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-15 18:02:48 -08:00

... 2 3 4 5 6 ...

19299 Commits