linux

mirror of https://github.com/torvalds/linux.git synced 2026-05-12 16:18:45 +02:00

Author	SHA1	Message	Date
Jason Xing	3dec153ae4	xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb() Once xsk_skb_init_misc() has been called on an skb, its destructor is set to xsk_destruct_skb(), which submits the descriptor address(es) to the completion queue and advances the CQ producer. If such an skb is subsequently freed via kfree_skb() along an error path - before the skb has ever been handed to the driver - the destructor still runs and submits a bogus, half-initialized address to the CQ. Postpone the init phase when we believe the allocation of first frag is successfully completed. Before this init, skb can be safely freed by kfree_skb(). Closes: https://lore.kernel.org/all/20260419045822.843BFC2BCAF@smtp.kernel.org/ Fixes: `c30d084960` ("xsk: avoid overwriting skb fields for multi-buffer traffic") Acked-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20260502200722.53960-6-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-05 19:27:50 -07:00
Jason Xing	0f3776583d	xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path When xsk_build_skb() processes multi-buffer packets in copy mode, the first descriptor stores data into the skb linear area without adding any frags, so nr_frags stays at 0. The caller then sets xs->skb = skb to accumulate subsequent descriptors. If a continuation descriptor fails (e.g. alloc_page returns NULL with -EAGAIN), we jump to free_err where the condition: if (skb && !skb_shinfo(skb)->nr_frags) kfree_skb(skb); evaluates to true because nr_frags is still 0 (the first descriptor used the linear area, not frags). This frees the skb while xs->skb still points to it, creating a dangling pointer. On the next transmit attempt or socket close, xs->skb is dereferenced, causing a use-after-free or double-free. Fix by using a !xs->skb check to handle first frag situation, ensuring we only free skbs that were freshly allocated in this call (xs->skb is NULL) and never free an in-progress multi-buffer skb that the caller still references. Closes: https://lore.kernel.org/all/20260415082654.21026-4-kerneljasonxing@gmail.com/ Fixes: `6b9c129c2f` ("xsk: remove @first_frag from xsk_build_skb()") Acked-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20260502200722.53960-5-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-05 19:27:50 -07:00
Jason Xing	8cd3c1c6e7	xsk: handle NULL dereference of the skb without frags issue When a first descriptor (xs->skb == NULL) triggers -EOVERFLOW in xsk_build_skb_zerocopy() (e.g., MAX_SKB_FRAGS exceeded), the free_err -EOVERFLOW handler unconditionally dereferences xs->skb via xsk_inc_num_desc(xs->skb) and xsk_drop_skb(xs->skb), causing a NULL pointer dereference. Fix this by guarding the existing xsk_inc_num_desc()/xsk_drop_skb() calls with an xs->skb check (for the continuation case), and add an else branch for the first-descriptor case that manually cancels the one reserved CQ slot and increments invalid_descs by one to account for the single invalid descriptor. Fixes: `cf24f5a5fe` ("xsk: add support for AF_XDP multi-buffer on Tx path") Acked-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20260502200722.53960-4-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-05 19:27:50 -07:00
Jason Xing	0bb7a9caf5	xsk: free the skb when hitting the upper bound MAX_SKB_FRAGS Fix it by explicitly adding kfree_skb() before returning back to its caller. How to reproduce it in virtio_net: 1. the current skb is the first one (which means xs->skb is NULL) and hit the limit MAX_SKB_FRAGS. 2. xsk_build_skb_zerocopy() returns -EOVERFLOW. 3. the caller xsk_build_skb() clears skb by using 'skb = NULL;'. This is why bug can be triggered. 4. there is no chance to free this skb anymore. Note that if in this case the xs->skb is not NULL, xsk_build_skb() will call xsk_drop_skb(xs->skb) to do the right thing. Fixes: `cf24f5a5fe` ("xsk: add support for AF_XDP multi-buffer on Tx path") Acked-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20260502200722.53960-3-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-05 19:27:50 -07:00
Jason Xing	d73a9a63f9	xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices skb_checksum_help() is a common helper that writes the folded 16-bit checksum back via skb->data + csum_start + csum_offset, i.e. it relies on the skb's linear head and fails (with WARN_ONCE and -EINVAL) when skb_headlen() is 0. AF_XDP generic xmit takes two very different paths depending on the netdev. Drivers that advertise IFF_TX_SKB_NO_LINEAR (e.g. virtio_net) skip the "copy payload into a linear head" step on purpose as a performance optimisation: xsk_build_skb_zerocopy() only attaches UMEM pages as frags and never calls skb_put(), so skb_headlen() stays 0 for the whole skb. For these skbs there is simply no linear area for skb_checksum_help() to write the csum into - the sw-csum fallback is structurally inapplicable. The patch tries to catch this and reject the combination with error at setup time. Rejecting at bind() converts this silent per-packet failure into a synchronous, actionable -EOPNOTSUPP at setup time. HW csum and launch_time metadata on IFF_TX_SKB_NO_LINEAR drivers are unaffected because they do not call skb_checksum_help(). Without the patch, every descriptor carrying 'XDP_TX_METADATA \| XDP_TXMD_FLAGS_CHECKSUM' produces: 1) a WARN_ONCE "offset (N) >= skb_headlen() (0)" from skb_checksum_help(), 2) sendmsg() returning -EINVAL without consuming the descriptor (invalid_descs is not incremented), 3) a wedged TX ring: __xsk_generic_xmit() does not advance the consumer on non-EOVERFLOW errors, so the next sendmsg() re-reads the same descriptor and re-hits the same WARN until the socket is closed. Closes: https://lore.kernel.org/all/20260419045822.843BFC2BCAF@smtp.kernel.org/#t Acked-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Jason Xing <kernelxing@tencent.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Fixes: `30c3055f9c` ("xsk: wrap generic metadata handling onto separate function") Link: https://patch.msgid.link/20260502200722.53960-2-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-05 19:27:49 -07:00
Eric Dumazet	f83e07b292	net/sched: sch_fq_codel: annotate data-races from fq_codel_dump_class_stats() fq_codel_dump_class_stats() acquires qdisc spinlock only when requested to follow flow->head chain. As we did in sch_cake recently, add the missing READ_ONCE()/WRITE_ONCE() annotations. Fixes: `edb09eb17e` ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260504163842.1162001-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-05 18:01:28 -07:00
Jakub Kicinski	40aa9fcea0	netfilter pull request 26-05-05 -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEjF9xRqF1emXiQiqU1w0aZmrPKyEFAmn5NYYACgkQ1w0aZmrP KyG0PQ/+KLaL84yLv5J9GgoXbGkYOicP5UVFMukEDuMWvjpdhBMUBzKqMNJcCiVi xfWtr9BJoHsXi6U7QJGirvKpyePxL/7xEq1D8PHFZFuHUXTiYXXESOLodBZpqqfw EmykQSbE7h24qeFmvedgHNGDTfixWqMckH7bujeoOq4xO8WqVlqp/1SVwdpfh6uk bZacJUDoAKU10RArqVfh+tQrbvVujrM91QOEUKZ7rdYwaDTnC2GH13x6EUngE+L2 8ilTlRpyUq6E4S214ivBRaTZb94skbHjzgkPY6HOYV8+JFX2I9SK+eDEuGpyi7rC B3WbWXXyrRbTZ0EamzTBpnxrE9Bxj0G2I/9UMAZBtZOGIQld3jhkhV8xwCXdMFL5 QDXbbBnR1X++7UP6YALkvCPY8AQqT0mkfqDDuc+sBF4WS84Pf1FL2eHh192YYtUb A+H0OHiu/0JHOgskIyhSYGoKt3DPCR0cD2pyzw0xbXV3WEIGKlkaiMo6luL8U5Ni 0zgPujk2Dq3BAP++SwVuFyshxm/2WS18U+6mrRVzDxMl9QBhOXpaXm0DkCgkyhf0 yRZ55J/gde9po+5tqY/Jn6wVNTI1D7bOq0DGDHVb7y9nX/sNaFwR3yz7FW19kEGl 7SEPmhIosbVwIUgexTuR5WL00lqoURBL4LuXf80W52C3Y0EiGHo= =KQQr -----END PGP SIGNATURE----- Merge tag 'nf-26-05-05' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== IPVS fixes for net The following batch contains IPVS fixes for net to address issues from the latest net-next pull request. Julian Anastasov made the following summary: 1-3) Fixes for the recently added resizable hash tables 4) dest from trash can be leaked if ip_vs_start_estimator() fails 5) fixed races and locking for the estimation kthreads 6) fix for wrong roundup_pow_of_two() usage in the resizable hash tables 7-8) v2 of the changes from Waiman Long to properly guard against the housekeeping_cpumask() updates: https://lore.kernel.org/netfilter-devel/20260331165015.2777765-1-longman@redhat.com/ I added missing Fixes tag. The original description: Since commit `041ee6f372` ("kthread: Rely on HK_TYPE_DOMAIN for preferred affinity management"), the HK_TYPE_KTHREAD housekeeping cpumask may no longer be correct in showing the actual CPU affinity of kthreads that have no predefined CPU affinity. As the ipvs networking code is still using HK_TYPE_KTHREAD, we need to make HK_TYPE_KTHREAD reflect the reality. This patch series makes HK_TYPE_KTHREAD an alias of HK_TYPE_DOMAIN and uses RCU to protect access to the HK_TYPE_KTHREAD housekeeping cpumask. Julian plans to post a nf-next patch to limit the connections by using "conn_max" sysctl. With Simon Horman, they agreed that this is an old problem that we do not have a limit of connections and it is not a stopper for this patchset. * tag 'nf-26-05-05' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: sched/isolation: Make HK_TYPE_KTHREAD an alias of HK_TYPE_DOMAIN ipvs: Guard access of HK_TYPE_KTHREAD cpumask with RCU ipvs: fix shift-out-of-bounds in ip_vs_rht_desired_size ipvs: fix races around est_mutex and est_cpulist ipvs: do not leak dest after get from dest trash ipvs: fix the spin_lock usage for RT build ipvs: fix races around the conn_lfactor and svc_lfactor sysctl vars ipvs: fixes for the new ip_vs_status info ==================== Link: https://patch.msgid.link/20260505001648.360569-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-05 17:55:25 -07:00
Kuniyuki Iwashima	5ad509c1fd	ipv6: Fix null-ptr-deref in fib6_mtu(). syzbot reported null-ptr-deref in fib6_mtu(). [0] When res->f6i->fib6_pmtu is 0 in fib6_mtu(), it fetches MTU from __in6_dev_get(nh->fib_nh_dev)->cnf.mtu6. However, __in6_dev_get() could return NULL when the device is being unregistered. Let's return 0 MTU if __in6_dev_get() returns NULL in fib6_mtu(). [0]: Oops: general protection fault, probably for non-canonical address 0xdffffc00000000bc: 0000 [#1] SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x00000000000005e0-0x00000000000005e7] CPU: 0 UID: 0 PID: 7890 Comm: syz.2.502 Tainted: G L syzkaller #0 PREEMPT(full) Tainted: [L]=SOFTLOCKUP Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:fib6_mtu net/ipv6/route.c:1648 [inline] RIP: 0010:rt6_insert_exception+0x9eb/0x10a0 net/ipv6/route.c:1753 Code: 3b 14 cf f7 45 85 f6 0f 85 1d 02 00 00 e8 7d 19 cf f7 48 8d bb e0 05 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 89 RSP: 0000:ffffc9000610f120 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffc9000c001000 RDX: 00000000000000bc RSI: ffffffff8a38bc83 RDI: 00000000000005e0 RBP: ffff888052f06000 R08: 0000000000000005 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: ffff888042d16c00 R13: ffff888042d16cc8 R14: 0000000000000001 R15: 0000000000000500 FS: 0000000000000000(0000) GS:ffff88809717d000(0063) knlGS:00000000f540db40 CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 CR2: 00000000f73c6d50 CR3: 000000006eff0000 CR4: 0000000000352ef0 Call Trace: <TASK> __ip6_rt_update_pmtu+0x555/0xd60 net/ipv6/route.c:2982 ip6_update_pmtu+0x34f/0x3b0 net/ipv6/route.c:3014 icmpv6_err+0x2a2/0x3f0 net/ipv6/icmp.c:82 icmpv6_notify+0x35e/0x820 net/ipv6/icmp.c:1087 icmpv6_rcv+0x10bf/0x1ae0 net/ipv6/icmp.c:1228 ip6_protocol_deliver_rcu+0xf97/0x1500 net/ipv6/ip6_input.c:478 ip6_input_finish+0x1e4/0x4a0 net/ipv6/ip6_input.c:529 NF_HOOK include/linux/netfilter.h:318 [inline] NF_HOOK include/linux/netfilter.h:312 [inline] ip6_input+0x105/0x2f0 net/ipv6/ip6_input.c:540 ip6_mc_input+0x513/0xf50 net/ipv6/ip6_input.c:630 dst_input include/net/dst.h:480 [inline] ip6_rcv_finish net/ipv6/ip6_input.c:119 [inline] NF_HOOK include/linux/netfilter.h:318 [inline] NF_HOOK include/linux/netfilter.h:312 [inline] ipv6_rcv+0x34c/0x3d0 net/ipv6/ip6_input.c:351 __netif_receive_skb_one_core+0x12d/0x1e0 net/core/dev.c:6202 __netif_receive_skb+0x1f/0x120 net/core/dev.c:6315 netif_receive_skb_internal net/core/dev.c:6401 [inline] netif_receive_skb+0x13b/0x7f0 net/core/dev.c:6460 tun_rx_batched.isra.0+0x3f6/0x750 drivers/net/tun.c:1511 tun_get_user+0x1e31/0x3c20 drivers/net/tun.c:1955 tun_chr_write_iter+0xdc/0x200 drivers/net/tun.c:2001 new_sync_write fs/read_write.c:595 [inline] vfs_write+0x6ac/0x1070 fs/read_write.c:688 ksys_write+0x12a/0x250 fs/read_write.c:740 do_syscall_32_irqs_on arch/x86/entry/syscall_32.c:83 [inline] do_int80_emulation+0x141/0x700 arch/x86/entry/syscall_32.c:172 asm_int80_emulation+0x1a/0x20 arch/x86/include/asm/idtentry.h:621 RIP: 0023:0xf715616b Code: 57 56 53 8b 44 24 14 f6 00 08 75 23 8b 44 24 18 8b 5c 24 1c 8b 4c 24 20 8b 54 24 24 8b 74 24 28 8b 7c 24 2c 8b 6c 24 30 cd 80 <5b> 5e 5f 5d c3 5b 5e 5f 5d e9 f7 a1 ff ff 66 90 66 90 66 90 90 53 RSP: 002b:00000000f540d44c EFLAGS: 00000246 ORIG_RAX: 0000000000000004 RAX: ffffffffffffffda RBX: 00000000000000c8 RCX: 0000000080000640 RDX: 000000000000007a RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000292 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 </TASK> Fixes: `dcd1f57295` ("net/ipv6: Remove fib6_idev") Reported-by: syzbot+01f005f9c6387ca6f6dd@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/69f83f22.170a0220.13cc2.0004.GAE@google.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260504064316.3820775-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-05 17:32:57 -07:00
Alyssa Ross	901a7d9e2f	ipv6: default IPV6_SIT to m This basically defaulted to m until recently, since IPV6 defaulted to m. Since IPV6 was changed to a boolean with a default of y, IPV6_SIT started defaulting to built-in as well. This results in a surprise sit0 device by default for defconfig (and defconfig-derived config) users at boot. For me, this broke an (admittedly non-robust) script. Preserve the behaviour of most configs by avoiding building this module, that's probably overall seldom used compared to IPv6 as a whole, into the kernel. Fixes: `309b905dee` ("ipv6: convert CONFIG_IPV6 to built-in only and clean up Kconfigs") Signed-off-by: Alyssa Ross <hi@alyssa.is> Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de> Link: https://patch.msgid.link/20260503192515.290900-2-hi@alyssa.is Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-05 17:31:51 -07:00
Benjamin Berg	ac8eb3e18f	wifi: mac80211: use safe list iteration in radar detect work The call to ieee80211_dfs_cac_cancel can cause the iterated chanctx to be freed and removed from the list. Guard against this to avoid a slab-use-after-free error. Cc: stable@vger.kernel.org Fixes: `bca8bc0399` ("wifi: mac80211: handle ieee80211_radar_detected() for MLO") Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20260505151539.236d63a1b736.I35dbb9e96a2d4a480be208770fdd99ba3b817b79@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2026-05-05 18:07:39 +02:00
Nan Li	44b550d88b	net/rds: handle zerocopy send cleanup before the message is queued A zerocopy send can fail after user pages have been pinned but before the message is attached to the sending socket. The purge path currently infers zerocopy state from rm->m_rs, so an unqueued message can be cleaned up as if it owned normal payload pages. However, zerocopy ownership is really determined by the presence of op_mmp_znotifier, regardless of whether the message has reached the socket queue. Capture op_mmp_znotifier up front in rds_message_purge() and use it as the cleanup discriminator. If the message is already associated with a socket, keep the existing completion path. Otherwise, drop the pinned page accounting directly and release the notifier before putting the payload pages. This keeps early send failure cleanup consistent with the zerocopy lifetime rules without changing the normal queued completion path. Fixes: `0cebaccef3` ("rds: zerocopy Tx support.") Cc: stable@kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Co-developed-by: Xiao Liu <lx24@stu.ynu.edu.cn> Signed-off-by: Xiao Liu <lx24@stu.ynu.edu.cn> Signed-off-by: Nan Li <tonanli66@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Reviewed-by: Allison Henderson <achender@kernel.org> Link: https://patch.msgid.link/d2ea98a6313d5467bac00f7c9fef8c7acddb9258.1777550074.git.tonanli66@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-05 15:32:40 +02:00
Ilya Maximets	aa69918bd4	openvswitch: vport: fix self-deadlock on release of tunnel ports vports are used concurrently and protected by RCU, so netdev_put() must happen after the RCU grace period. So, either in an RCU call or after the synchronize_net(). The rtnl_delete_link() must happen under RTNL and so can't be executed in RCU context. Calling synchronize_net() while holding RTNL is not a good idea for performance and system stability under load in general, so calling netdev_put() in RCU call is the right solution here. However, when the device is deleted, rtnl_unlock() will call netdev_run_todo() and block until all the references are gone. In the current code this means that we never reach the call_rcu() and the vport is never freed and the reference is never released, causing a self-deadlock on device removal. Fix that by moving the rcu_call() before the rtnl_unlock(), so the scheduled RCU callback will be executed when synchronize_net() is called from the rtnl_unlock()->netdev_run_todo() while the RTNL itself is already released. Fixes: `6931d21f87` ("openvswitch: defer tunnel netdev_put to RCU release") Cc: stable@vger.kernel.org Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Aaron Conole <aconole@redhat.com> Link: https://patch.msgid.link/20260430233848.440994-2-i.maximets@ovn.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-05 15:19:37 +02:00
Ilya Maximets	83861c48ba	openvswitch: vport: fix race between tunnel creation and linking When a tunnel vport is created it first creates the tunnel device, e.g., with geneve_dev_create_fb(), then it calls ovs_netdev_link() to take a reference and link it to the device that represents openvswitch datapath. The creation of the device is happening under RTNL, but then RTNL is released and re-acquired to find the device by name. It is technically possible for the tunnel device to be re-named or deleted within that window while RTNL is not held, and some other device created in its place. This will cause a non-tunnel device to be referenced in the vport and tunnel-specific functions used on it, e.g. vxlan_get_options() that directly casts the private netdev data into a struct vxlan_dev causing an invalid memory access: BUG: KASAN: slab-use-after-free in vxlan_get_options+0x323/0x3a0 vxlan_get_options+0x323/0x3a0 ovs_vport_cmd_new+0x6e3/0xd30 Fix that by taking a reference to the just created device before releasing RTNL. This ensures that the device in the vport is always the one that was just created. The search by name is only needed for a standard vport-netdev that links pre-existing devices, so that functionality and device type checks are moved to netdev_create(). It is also awkward that ovs_netdev_link() takes ownership of the vport and destroys it on failure. It doesn't know the type of the port it is dealing with, so we need to pass down the indicator that it's a tunnel, so the link can be properly deleted on failure. It's possible to refactor the logic to make the ovs_netdev_link() do only the linking part and let the callers perform a proper destruction, but it will be much more code for each legacy tunnel port type, so it is not worth it for the bug fix. Fixes: `614732eaa1` ("openvswitch: Use regular VXLAN net_device device") Reported-by: Yuan Tan <tanyuan98@outlook.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Reported-by: Yang Yang <n05ec@lzu.edu.cn> Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Eelco Chaudron <echaudro@redhat.com> Link: https://patch.msgid.link/20260430213349.407991-1-i.maximets@ovn.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-05 15:14:33 +02:00
Kuan-Ting Chen	f4c50a4034	xfrm: esp: avoid in-place decrypt on shared skb frags MSG_SPLICE_PAGES can attach pages from a pipe directly to an skb. TCP marks such skbs with SKBFL_SHARED_FRAG after skb_splice_from_iter(), so later paths that may modify packet data can first make a private copy. The IPv4/IPv6 datagram append paths did not set this flag when splicing pages into UDP skbs. That leaves an ESP-in-UDP packet made from shared pipe pages looking like an ordinary uncloned nonlinear skb. ESP input then takes the no-COW fast path for uncloned skbs without a frag_list and decrypts in place over data that is not owned privately by the skb. Mark IPv4/IPv6 datagram splice frags with SKBFL_SHARED_FRAG, matching TCP. Also make ESP input fall back to skb_cow_data() when the flag is present, so ESP does not decrypt externally backed frags in place. Private nonlinear skb frags still use the existing fast path. This intentionally does not change ESP output. In esp_output_head(), the path that appends the ESP trailer to existing skb tailroom without calling skb_cow_data() is not reachable for nonlinear skbs: skb_tailroom() returns zero when skb->data_len is nonzero, while ESP tailen is positive. Thus ESP output will either use the separate destination-frag path or fall back to skb_cow_data(). Fixes: `cac2661c53` ("esp4: Avoid skb_cow_data whenever possible") Fixes: `03e2a30f6a` ("esp6: Avoid skb_cow_data whenever possible") Fixes: `7da0dde684` ("ip, udp: Support MSG_SPLICE_PAGES") Fixes: `6d8192bd69` ("ip6, udp6: Support MSG_SPLICE_PAGES") Reported-by: Hyunwoo Kim <imv4bel@gmail.com> Reported-by: Kuan-Ting Chen <h3xrabbit@gmail.com> Tested-by: Hyunwoo Kim <imv4bel@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Kuan-Ting Chen <h3xrabbit@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2026-05-05 06:38:30 +02:00
David Carlier	30cb24f97d	psp: strip variable-length PSP header in psp_dev_rcv() psp_dev_rcv() unconditionally removes a fixed PSP_ENCAP_HLEN, even when psph->hdrlen indicates that the PSP header carries optional fields. A frame whose PSP header advertises a non-zero VC or any extension would therefore be silently mis-decapsulated: option bytes would spill into the inner packet head and downstream parsing would fail on a corrupted skb. Compute the full PSP header length from psph->hdrlen, pull the optional bytes into the linear region, and strip the whole header when decapsulating. Optional fields (VC, ...) are still ignored, just discarded with the rest of the header instead of leaking. crypt_offset and the VIRT flag are intentionally not validated here - callers know their device's PSP implementation and can decide. Both in-tree callers gate on hardware-validated PSP, so this is a correctness fix rather than a reachable corruption path under current configurations. Fixes: `0eddb8023c` ("psp: provide decapsulation and receive helper for drivers") Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Daniel Zahka <daniel.zahka@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: David Carlier <devnexen@gmail.com> Link: https://patch.msgid.link/20260502141945.14484-1-devnexen@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-04 19:25:14 -07:00
Eric Dumazet	ac0841d7d2	net: prevent possible UAF in rtnl_prop_list_size() I was mistaken by synchronize_rcu() [1] call in netdev_name_node_alt_destroy(), giving a false sense of RCU safety at delete times. We have to use list_del_rcu() to not confuse potential readers in rtnl_prop_list_size(). [1] This synchronize_rcu() call was later removed in commit `723de3ebef` ("net: free altname using an RCU callback"). Fixes: `9f30831390` ("net: add rcu safety to rtnl_prop_list_size()") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260502124102.499204-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-04 19:24:27 -07:00
Matthieu Baerts (NGI0)	70ece9d702	mptcp: sockopt: increase seq in mptcp_setsockopt_all_sf mptcp_setsockopt_all_sf() was missing a call to sockopt_seq_inc(). This is required not to cause missing synchronization for newer subflows created later on. This helper is called each time a socket option is set on subflows, and future ones will need to inherit this option after their creation. Fixes: `51c5fd09e1` ("mptcp: add TCP_MAXSEG sockopt support") Cc: stable@vger.kernel.org Suggested-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260501-net-mptcp-misc-fixes-7-1-rc3-v1-4-b70118df778e@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-04 19:20:50 -07:00
Paolo Abeni	6254a16d6f	mptcp: fix rx timestamp corruption on fastopen The skb cb offset containing the timestamp presence flag is cleared before loading such information. Cache such value before MPTCP CB initialization. Fixes: `36b122baf6` ("mptcp: add subflow_v(4,6)_send_synack()") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260501-net-mptcp-misc-fixes-7-1-rc3-v1-3-b70118df778e@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-04 19:20:50 -07:00
Shardul Bankar	a6da02d4c0	mptcp: use MPTCP_RST_EMPTCP for ACK HMAC validation failure When HMAC validation fails on a received ACK + MP_JOIN in subflow_syn_recv_sock(), the subflow is reset with reason MPTCP_RST_EPROHIBIT ("Administratively prohibited"). This is incorrect: HMAC validation failure is an MPTCP protocol-level error, not an administrative policy denial. The mirror site on the client, in subflow_finish_connect(), already uses MPTCP_RST_EMPTCP ("MPTCP-specific error") for the same kind of HMAC failure on the SYN/ACK + MP_JOIN. Use the same reason on the server side for symmetry and accuracy. Suggested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Fixes: `443041deb5` ("mptcp: fix NULL pointer in can_accept_new_subflow") Cc: stable@vger.kernel.org Signed-off-by: Shardul Bankar <shardul.b@mpiricsoftware.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260501-net-mptcp-misc-fixes-7-1-rc3-v1-2-b70118df778e@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-04 19:20:50 -07:00
Shardul Bankar	c4a99a9219	mptcp: use MPJoinSynAckHMacFailure for SynAck HMAC failure In subflow_finish_connect(), HMAC validation of the server's HMAC in SYN/ACK + MP_JOIN increments MPTCP_MIB_JOINACKMAC ("HMAC was wrong on ACK + MP_JOIN") on failure. The function processes the SYN/ACK, not the ACK; the matching MPTCP_MIB_JOINSYNACKMAC counter ("HMAC was wrong on SYN/ACK + MP_JOIN") exists but is not incremented anywhere in the tree. The mirror site on the server, subflow_syn_recv_sock(), already uses JOINACKMAC correctly for ACK HMAC failure. Use JOINSYNACKMAC at the SYN/ACK validation site so each counter reflects the packet whose HMAC actually failed. Suggested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Fixes: `fc518953bc` ("mptcp: add and use MIB counter infrastructure") Cc: stable@vger.kernel.org Signed-off-by: Shardul Bankar <shardul.b@mpiricsoftware.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260501-net-mptcp-misc-fixes-7-1-rc3-v1-1-b70118df778e@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-04 19:20:50 -07:00
Eric Dumazet	059b7dbd20	vsock/virtio: fix potential unbounded skb queue virtio_transport_inc_rx_pkt() checks vvs->rx_bytes + len > vvs->buf_alloc. virtio_transport_recv_enqueue() skips coalescing for packets with VIRTIO_VSOCK_SEQ_EOM. If fed with packets with len == 0 and VIRTIO_VSOCK_SEQ_EOM, a very large number of packets can be queued because vvs->rx_bytes stays at 0. Fix this by estimating the skb metadata size: (Number of skbs in the queue) * SKB_TRUESIZE(0) Fixes: `0777061657` ("virtio/vsock: don't use skbuff state to account credit") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Arseniy Krasnov <AVKrasnov@sberdevices.ru> Cc: Stefan Hajnoczi <stefanha@redhat.com> Cc: Stefano Garzarella <sgarzare@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: "Eugenio Pérez" <eperezma@redhat.com> Cc: virtualization@lists.linux.dev Link: https://patch.msgid.link/20260430122653.554058-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-04 19:12:37 -07:00
Breno Leitao	76b93a8107	netpoll: pass buffer size to egress_dev() to avoid MAC truncation egress_dev() formats np->dev_mac via snprintf() but receives buf as a bare char , so it cannot derive the buffer size from the pointer. The size argument was hardcoded to MAC_ADDR_STR_LEN (3 ETH_ALEN - 1 = 17), which is silly wrong in two ways: 1) misleading kernel log output on the MAC-selected target path (np->dev_name[0] == '\0'); for example "aa:bb:cc:dd:ee:ff doesn't exist, aborting" was logged as "aa:bb:cc:dd:ee:f doesn't exist, aborting". 2) the second argument of snprintf is the size of the buffer, not the size of what you want to write. Add a bufsz parameter to egress_dev() and pass sizeof(buf) from each caller, matching the standard snprintf() idiom and removing the hardcoded size from the helper. Every caller already declares "char buf[MAC_ADDR_STR_LEN + 1]" so the formatted MAC continues to fit. Tested by booting with netconsole=6665@/aa:bb:cc:dd:ee:ff,6666@10.0.0.1/00:11:22:33:44:55 on a kernel without a matching device. Pre-fix dmesg shows "aa:bb:cc:dd:ee:f doesn't exist, aborting"; post-fix shows the full "aa:bb:cc:dd:ee:ff doesn't exist, aborting". Fixes: `f8a10bed32` ("netconsole: allow selection of egress interface via MAC address") Cc: stable@vger.kernel.org Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20260501-netpoll_snprintf_fix-v1-1-84b0566e6597@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-04 18:37:25 -07:00
Kuniyuki Iwashima	d82ba05263	af_unix: Set gc_in_progress to true in unix_gc(). Igor Ushakov reported that unix_gc() could run with gc_in_progress being false if the work is scheduled while running: Thread 1 Thread 2 Thread 3 -------- -------- -------- unix_schedule_gc() unix_schedule_gc() `- if (!gc_in_progress) `- if (!gc_in_progress) \|- gc_in_progress = true \| `- queue_work() \| unix_gc() <----------------/ \| \| \|- gc_in_progress = true ... `- queue_work() \| \| `- gc_in_progress = false \| \| unix_gc() <---------------------------------------------' \| ... /* gc_in_progress == false */ \| `- gc_in_progress = false unix_peek_fpl() relies on gc_in_progress not to confuse GC by MSG_PEEK. Let's set gc_in_progress to true in unix_gc(). Fixes: `8b90a9f819` ("af_unix: Run GC on only one CPU.") Reported-by: Igor Ushakov <sysroot314@gmail.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260501073945.1884564-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-04 18:34:45 -07:00
Waiman Long	aa60652069	ipvs: Guard access of HK_TYPE_KTHREAD cpumask with RCU The ip_vs_ctl.c file and the associated ip_vs.h file are the only places in the kernel where HK_TYPE_KTHREAD cpumask is being retrieved and used. Now that HK_TYPE_KTHREAD/HK_TYPE_DOMAIN cpumask can be changed at run time. We need to use RCU to guard access to this cpumask to avoid a potential UAF problem as the returned cpumask may be freed before it is being used. We can replace HK_TYPE_KTHREAD by HK_TYPE_DOMAIN as they are aliases of each other, but keeping the HK_TYPE_KTHREAD name can highlight the fact that it is the kthread initiated by ipvs that is being controlled. Fixes: `03ff735101` ("cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-05 01:52:55 +02:00
Julian Anastasov	4ee52b7021	ipvs: fix shift-out-of-bounds in ip_vs_rht_desired_size Calling roundup_pow_of_two() with 0 has undefined result: UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13 shift exponent 64 is too large for 64-bit type 'unsigned long' CPU: 1 UID: 0 PID: 77 Comm: kworker/u8:4 Not tainted syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/18/2026 Workqueue: events_unbound conn_resize_work_handler Call Trace: <TASK> dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120 ubsan_epilogue+0xa/0x30 lib/ubsan.c:233 __ubsan_handle_shift_out_of_bounds+0x385/0x410 lib/ubsan.c:494 __roundup_pow_of_two include/linux/log2.h:57 [inline] ip_vs_rht_desired_size+0x2cf/0x410 net/netfilter/ipvs/ip_vs_core.c:240 ip_vs_conn_desired_size net/netfilter/ipvs/ip_vs_conn.c:765 [inline] conn_resize_work_handler+0x1b6/0x14c0 net/netfilter/ipvs/ip_vs_conn.c:822 process_one_work kernel/workqueue.c:3302 [inline] process_scheduled_works+0xb5d/0x1860 kernel/workqueue.c:3385 worker_thread+0xa53/0xfc0 kernel/workqueue.c:3466 kthread+0x388/0x470 kernel/kthread.c:436 ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 </TASK> Reported-by: syzbot+217f1db9c791e27fe54a@syzkaller.appspotmail.com Fixes: `b655388111` ("ipvs: add resizable hash tables") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-05 01:52:55 +02:00
Julian Anastasov	2fd1092389	ipvs: fix races around est_mutex and est_cpulist Sashiko reports for races and possible crash around the usage of est_cpulist_valid and sysctl_est_cpulist. The problem is that we do not lock est_mutex in some places which can lead to wrong write ordering and as result problems when calling cpumask_weight() and cpumask_empty(). Fix them by moving the est_max_threads read/write under locked est_mutex. Do the same for one ip_vs_est_reload_start() call to protect the cpumask_empty() usage of sysctl_est_cpulist. To remove the chance of deadlock while stopping the estimation kthreads, keep the data structure for kthread 0 even after last estimator is removed and do not hold mutexes while stopping this task. Now we will use a new flag 'needed' to know when kthread 0 should run. The kthreads above 0 do not use mutexes, so stop them under est_mutex because their kthread data still can be destroyed if they do not serve estimators. Now all kthreads will be started by the est_reload_work to properly serialize the stop/start for kthread 0. Reduce the use of service_mutex in ip_vs_est_calc_phase() because under est_mutex we can safely walk est_kt_arr to stop the kthreads above slot 0. As ip_vs_stop_estimator() for tot_stats should be called under service_mutex, do it early in the netns exit path in ip_vs_flush() to avoid locking the mutex again later. It still should be called in ip_vs_control_net_cleanup_sysctl() when we are called during netns init error. Use -2 for ktid as indicator if estimator was already stopped. Finally, fix use-after-free for kd->est_row in ip_vs_est_calc_phase(). est->ktrow should simply switch to a delay value while estimator is linked to est_temp_list. Link: https://sashiko.dev/#/patchset/20260331165015.2777765-1-longman%40redhat.com Link: https://sashiko.dev/#/patchset/20260420171308.87192-1-ja%40ssi.bg Link: https://sashiko.dev/#/patchset/20260422125123.40658-1-ja%40ssi.bg Link: https://sashiko.dev/#/patchset/20260424175858.54752-1-ja%40ssi.bg Link: https://sashiko.dev/#/patchset/20260425103918.7447-1-ja%40ssi.bg Fixes: `f0be83d542` ("ipvs: add est_cpulist and est_nice sysctl vars") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-05 01:52:55 +02:00
Julian Anastasov	fbe1e01e81	ipvs: do not leak dest after get from dest trash Sashiko warns about leaked dest if ip_vs_start_estimator() fails in ip_vs_add_dest(). Add ip_vs_trash_put_dest() to put back the dest into dest trash. Link: https://sashiko.dev/#/patchset/20260428175725.72050-1-ja%40ssi.bg Fixes: `705dd34440` ("ipvs: use kthreads for stats estimation") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-05 01:52:55 +02:00
Julian Anastasov	d493d9de1c	ipvs: fix the spin_lock usage for RT build syzbot reports for sleeping function called from invalid context [1]. The recently added code for resizable hash tables uses hlist_bl bit locks in combination with spin_lock for the connection fields (cp->lock). Fix the following problems: * avoid using spin_lock(&cp->lock) under locked bit lock because it sleeps on PREEMPT_RT * as the recent changes call ip_vs_conn_hash() only for newly allocated connection, the spin_lock can be removed there because the connection is still not linked to table and does not need cp->lock protection. * the lock can be removed also from ip_vs_conn_unlink() where we are the last connection user. * the last place that is fixed is ip_vs_conn_fill_cport() where now the cp->lock is locked before the other locks to ensure other packets do not modify the cp->flags in non-atomic way. Here we make sure cport and flags are changed only once if two or more packets race to fill the cport. Also, we fill cport early, so that if we race with resizing there will be valid cport key for the hashing. Add a warning if too many hash table changes occur for our RCU read-side critical section which is error condition but minor because the connection still can expire gracefully. Still, restore the cport to 0 to allow retransmitted packet to properly fill the cport. Problems reported by Sashiko. [1]: BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 16, name: ktimers/0 preempt_count: 2, expected: 0 RCU nest depth: 3, expected: 3 8 locks held by ktimers/0/16: #0: ffffffff8de5f260 (local_bh){.+.+}-{1:3}, at: __local_bh_disable_ip+0x3c/0x420 kernel/softirq.c:163 #1: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: __local_bh_disable_ip+0x3c/0x420 kernel/softirq.c:163 #2: ffff8880b8826360 (&base->expiry_lock){+...}-{3:3}, at: spin_lock include/linux/spinlock_rt.h:45 [inline] #2: ffff8880b8826360 (&base->expiry_lock){+...}-{3:3}, at: timer_base_lock_expiry kernel/time/timer.c:1502 [inline] #2: ffff8880b8826360 (&base->expiry_lock){+...}-{3:3}, at: __run_timer_base+0x120/0x9f0 kernel/time/timer.c:2384 #3: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire include/linux/rcupdate.h:300 [inline] #3: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline] #3: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: __rt_spin_lock kernel/locking/spinlock_rt.c:50 [inline] #3: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: rt_spin_lock+0x1e0/0x400 kernel/locking/spinlock_rt.c:57 #4: ffffc90000157a80 ((&cp->timer)){+...}-{0:0}, at: call_timer_fn+0xd4/0x5e0 kernel/time/timer.c:1745 #5: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire include/linux/rcupdate.h:300 [inline] #5: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline] #5: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: ip_vs_conn_unlink net/netfilter/ipvs/ip_vs_conn.c:315 [inline] #5: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: ip_vs_conn_expire+0x257/0x2390 net/netfilter/ipvs/ip_vs_conn.c:1260 #6: ffffffff8de5f260 (local_bh){.+.+}-{1:3}, at: __local_bh_disable_ip+0x3c/0x420 kernel/softirq.c:163 #7: ffff888068d4c3f0 (&cp->lock#2){+...}-{3:3}, at: spin_lock include/linux/spinlock_rt.h:45 [inline] #7: ffff888068d4c3f0 (&cp->lock#2){+...}-{3:3}, at: ip_vs_conn_unlink net/netfilter/ipvs/ip_vs_conn.c:324 [inline] #7: ffff888068d4c3f0 (&cp->lock#2){+...}-{3:3}, at: ip_vs_conn_expire+0xd4a/0x2390 net/netfilter/ipvs/ip_vs_conn.c:1260 Preemption disabled at: [<ffffffff898a6358>] bit_spin_lock include/linux/bit_spinlock.h:38 [inline] [<ffffffff898a6358>] hlist_bl_lock+0x18/0x110 include/linux/list_bl.h:149 CPU: 0 UID: 0 PID: 16 Comm: ktimers/0 Tainted: G W L syzkaller #0 PREEMPT_{RT,(full)} Tainted: [W]=WARN, [L]=SOFTLOCKUP Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/18/2026 Call Trace: <TASK> dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120 __might_resched+0x329/0x480 kernel/sched/core.c:9162 __rt_spin_lock kernel/locking/spinlock_rt.c:48 [inline] rt_spin_lock+0xc2/0x400 kernel/locking/spinlock_rt.c:57 spin_lock include/linux/spinlock_rt.h:45 [inline] ip_vs_conn_unlink net/netfilter/ipvs/ip_vs_conn.c:324 [inline] ip_vs_conn_expire+0xd4a/0x2390 net/netfilter/ipvs/ip_vs_conn.c:1260 call_timer_fn+0x192/0x5e0 kernel/time/timer.c:1748 expire_timers kernel/time/timer.c:1799 [inline] __run_timers kernel/time/timer.c:2374 [inline] __run_timer_base+0x6a3/0x9f0 kernel/time/timer.c:2386 run_timer_base kernel/time/timer.c:2395 [inline] run_timer_softirq+0xb7/0x170 kernel/time/timer.c:2405 handle_softirqs+0x1de/0x6d0 kernel/softirq.c:622 __do_softirq kernel/softirq.c:656 [inline] run_ktimerd+0x69/0x100 kernel/softirq.c:1151 smpboot_thread_fn+0x541/0xa50 kernel/smpboot.c:160 kthread+0x388/0x470 kernel/kthread.c:436 ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 </TASK> Reported-by: syzbot+504e778ddaecd36fdd17@syzkaller.appspotmail.com Link: https://sashiko.dev/#/patchset/20260415200216.79699-1-ja%40ssi.bg Link: https://sashiko.dev/#/patchset/20260420165539.85174-4-ja%40ssi.bg Link: https://sashiko.dev/#/patchset/20260422135823.50489-4-ja%40ssi.bg Fixes: `2fa7cc9c70` ("ipvs: switch to per-net connection table") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-05 01:52:55 +02:00
Julian Anastasov	f2da9a96ab	ipvs: fix races around the conn_lfactor and svc_lfactor sysctl vars Sashiko warns that the new sysctls vars can be changed after the hash tables are destroyed and their respective resizing works canceled, leading to mod_delayed_work() being called for canceled works. Solve this in different ways. conn_tab can be present even without services and is destroyed only on netns exit, so use disable_delayed_work_sync() to disable the work instead of adding more synchronization mechanisms. As for the svc_table, it is destroyed when the services are deleted, so we must be sure that netns exit is not called yet (the check for 'enable') and the work is not canceled by checking all under same mutex lock. Also, use WRITE_ONCE when updating the sysctl vars as we already read them with READ_ONCE. Link: https://sashiko.dev/#/patchset/20260410112352.23599-1-fw%40strlen.de Fixes: `8d7de5477e` ("ipvs: add conn_lfactor and svc_lfactor sysctl vars") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-05 01:52:55 +02:00
Julian Anastasov	afbd961305	ipvs: fixes for the new ip_vs_status info Sashiko reports some problems for the recently added /proc/net/ip_vs_status: * ip_vs_status_show() as a table reader may run long after the conn_tab and svc_table table are released. While ip_vs_conn_flush() properly changes the conn_tab_changes counter when conn_tab is removed, ip_vs_del_service() and ip_vs_flush() were missing such change for the svc_table_changes counter. As result, readers like ip_vs_dst_event() and ip_vs_status_show() may continue to use a freed table after a cond_resched_rcu() call. * While counting the buckets in ip_vs_status_show() make sure we traverse only the needed number of entries in the chain. This also prevents possible overflow of the 'count' variable. * Add check for 'loops' to prevent infinite loops while restarting the traversal on table change. * While IP_VS_CONN_TAB_MAX_BITS is 20 on 32-bit platforms and there is no risk to overflow when multiplying the number of conn_tab buckets to 100, prefer the div_u64() helper to make the following dividing safer. * Use 0440 permissions for ip_vs_status to restrict the info only to root due to the exported information for hash distribution. Link: https://sashiko.dev/#/patchset/20260410112352.23599-1-fw%40strlen.de Fixes: `9a9ccef907` ("ipvs: add ip_vs_status info") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-05 01:52:55 +02:00
Johannes Berg	0cfff13c94	wifi: mac80211: tests: mark HT check strict The HT check now only applies in strict mode since APs were found to be broken. Mark it as such. Fixes: `711a9c018a` ("wifi: mac80211: skip ieee80211_verify_sta_ht_mcs_support check in non-strict mode") Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2026-05-04 08:54:52 +02:00
Jakub Kicinski	7e7be31bfd	net: tls: fix silent data drop under pipe back-pressure tls_sw_splice_read() uses len when advancing rxm->offset / rxm->full_len after skb_splice_bits(), rather than copied (the actual number of bytes successfully spliced into the pipe). When the destination pipe cannot accept all the requested bytes, splice_to_pipe() returns fewer bytes than len, and 'len - copied' of data is effectively skipped over. Fixes: `e062fe99cc` ("tls: splice_read: fix accessing pre-processed records") Link: https://patch.msgid.link/20260429222944.2139041-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-02 18:27:14 -07:00
Eric Dumazet	67dc6c56b8	net/sched: sch_cake: annotate data-races in cake_dump_class_stats (II) cake_dump_class_stats() runs without qdisc spinlock being held. In this second patch, I add READ_ONCE()/WRITE_ONCE() annotations for: - flow->deficit - flow->cvars.dropping - flow->cvars.count - flow->cvars.p_drop - flow->cvars.blue_timer - flow->cvars.drop_next Fixes: `046f6fd5da` ("sched: Add Common Applications Kept Enhanced (cake) qdisc") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260430061610.3503483-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-02 16:59:09 -07:00
Eric Dumazet	046111a1a3	net/sched: sch_cake: annotate data-races in cake_dump_class_stats (I) cake_dump_class_stats() runs without qdisc spinlock being held. In this first patch, I add READ_ONCE()/WRITE_ONCE() annotations for: - flow->head - flow->dropped - b->backlogs[] Fixes: `046f6fd5da` ("sched: Add Common Applications Kept Enhanced (cake) qdisc") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260430061610.3503483-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-02 16:59:09 -07:00
Maoyi Xie	1d324c2f43	ip6_gre: Use cached t->net in ip6erspan_changelink(). After commit `5e72ce3e39` ("net: ipv6: Use link netns in newlink() of rtnl_link_ops"), ip6erspan_newlink() correctly resolves the per-netns ip6gre hash via link_net. ip6erspan_changelink() was not converted in that series and still uses dev_net(dev), which diverges from the device's creation netns after IFLA_NET_NS_FD migration. This re-inserts the tunnel into the wrong per-netns hash. The original netns keeps a stale entry. When that netns is later destroyed, ip6gre_exit_rtnl_net() walks the stale entry, producing a slab-use-after-free reported by KASAN, followed by a kernel BUG at net/core/dev.c (LIST_POISON1) in unregister_netdevice_many_notify(). Reachable from an unprivileged user namespace (unshare --user --map-root-user --net). ip6gre_changelink() earlier in the same file already uses the cached t->net; only ip6erspan_changelink() has the wrong shape. Fixes: `2d665034f2` ("net: ip6_gre: Fix ip6erspan hlen calculation") Cc: stable@vger.kernel.org # v5.15+ Signed-off-by: Maoyi Xie <maoyi.xie@ntu.edu.sg> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260430103318.3206018-1-maoyi.xie@ntu.edu.sg Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-02 10:36:21 -07:00
Victor Nogueria	1b9bc71153	net/sched: sch_sfb: Replace direct dequeue call with peek and qdisc_dequeue_peeked When sfb has children (eg qfq qdisc) whose peek() callback is qdisc_peek_dequeued(), we could get a kernel panic. When the parent of such qdiscs (eg illustrated in patch #3 as tbf) wants to retrieve an skb from its child (sfb in this case), it will do the following: 1a. do a peek() - and when sensing there's an skb the child can offer, then - the child in this case(sfb) calls its child's (qfq) peek. qfq does the right thing and will return the gso_skb queue packet. Note: if there wasnt a gso_skb entry then qfq will store it there. 1b. invoke a dequeue() on the child (sfb). And herein lies the problem. - sfb will call the child's dequeue() which will essentially just try to grab something of qfq's queue. [ 127.594489][ T453] KASAN: null-ptr-deref in range [0x0000000000000048-0x000000000000004f] [ 127.594741][ T453] CPU: 2 UID: 0 PID: 453 Comm: ping Not tainted 7.1.0-rc1-00035-gac961974495b-dirty #793 PREEMPT(full) [ 127.595059][ T453] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 127.595254][ T453] RIP: 0010:qfq_dequeue+0x35c/0x1650 [sch_qfq] [ 127.595461][ T453] Code: 00 fc ff df 80 3c 02 00 0f 85 17 0e 00 00 4c 8d 73 48 48 89 9d b8 02 00 00 48 b8 00 00 00 00 00 fc ff df 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 76 0c 00 00 48 b8 00 00 00 00 00 fc ff df 4c 8b [ 127.596081][ T453] RSP: 0018:ffff88810e5af440 EFLAGS: 00010216 [ 127.596337][ T453] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: dffffc0000000000 [ 127.596623][ T453] RDX: 0000000000000009 RSI: 0000001880000000 RDI: ffff888104fd82b0 [ 127.596917][ T453] RBP: ffff888104fd8000 R08: ffff888104fd8280 R09: 1ffff110211893a3 [ 127.597165][ T453] R10: 1ffff110211893a6 R11: 1ffff110211893a7 R12: 0000001880000000 [ 127.597404][ T453] R13: ffff888104fd82b8 R14: 0000000000000048 R15: 0000000040000000 [ 127.597644][ T453] FS: 00007fc380cbfc40(0000) GS:ffff88816f2a8000(0000) knlGS:0000000000000000 [ 127.597956][ T453] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 127.598160][ T453] CR2: 00005610aa9890a8 CR3: 000000010369e000 CR4: 0000000000750ef0 [ 127.598390][ T453] PKRU: 55555554 [ 127.598509][ T453] Call Trace: [ 127.598629][ T453] <TASK> [ 127.598718][ T453] ? mark_held_locks+0x40/0x70 [ 127.598890][ T453] ? srso_alias_return_thunk+0x5/0xfbef5 [ 127.599053][ T453] sfb_dequeue+0x88/0x4d0 [ 127.599174][ T453] ? ktime_get+0x137/0x230 [ 127.599328][ T453] ? srso_alias_return_thunk+0x5/0xfbef5 [ 127.599480][ T453] ? qdisc_peek_dequeued+0x7b/0x350 [sch_qfq] [ 127.599670][ T453] ? srso_alias_return_thunk+0x5/0xfbef5 [ 127.599831][ T453] tbf_dequeue+0x6b1/0x1098 [sch_tbf] [ 127.599988][ T453] __qdisc_run+0x169/0x1900 The right thing to do in #1b is to grab the skb off gso_skb queue. This patchset fixes that issue by changing #1b to use qdisc_dequeue_peeked() method instead. Fixes: `e13e02a3c6` ("net_sched: SFB flow scheduler") Signed-off-by: Victor Nogueria <victor@mojatatu.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260430152957.194015-3-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-02 10:20:56 -07:00
Jamal Hadi Salim	458d561527	net/sched: sch_red: Replace direct dequeue call with peek and qdisc_dequeue_peeked When red qdisc has children (eg qfq qdisc) whose peek() callback is qdisc_peek_dequeued(), we could get a kernel panic. When the parent of such qdiscs (eg illustrated in patch #3 as tbf) wants to retrieve an skb from its child (red in this case), it will do the following: 1a. do a peek() - and when sensing there's an skb the child can offer, then - the child in this case(red) calls its child's (qfq) peek. qfq does the right thing and will return the gso_skb queue packet. Note: if there wasnt a gso_skb entry then qfq will store it there. 1b. invoke a dequeue() on the child (red). And herein lies the problem. - red will call the child's dequeue() which will essentially just try to grab something of qfq's queue. [ 78.667668][ T363] KASAN: null-ptr-deref in range [0x0000000000000048-0x000000000000004f] [ 78.667927][ T363] CPU: 1 UID: 0 PID: 363 Comm: ping Not tainted 7.1.0-rc1-00033-g46f74a3f7d57-dirty #790 PREEMPT(full) [ 78.668263][ T363] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 78.668486][ T363] RIP: 0010:qfq_dequeue+0x446/0xc90 [sch_qfq] [ 78.668718][ T363] Code: 54 c0 e8 dd 90 00 f1 48 c7 c7 e0 03 54 c0 48 89 de e8 ce 90 00 f1 48 8d 7b 48 b8 ff ff 37 00 48 89 fa 48 c1 e0 2a 48 c1 ea 03 <80> 3c 02 00 74 05 e8 ef a1 e1 f1 48 8b 7b 48 48 8d 54 24 58 48 8d [ 78.669312][ T363] RSP: 0018:ffff88810de573e0 EFLAGS: 00010216 [ 78.669533][ T363] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 78.669790][ T363] RDX: 0000000000000009 RSI: 0000000000000004 RDI: 0000000000000048 [ 78.670044][ T363] RBP: ffff888110dc4000 R08: ffffffffb1b0885a R09: fffffbfff6ba9078 [ 78.670297][ T363] R10: 0000000000000003 R11: ffff888110e31c80 R12: 0000001880000000 [ 78.670560][ T363] R13: ffff888110dc4150 R14: ffff888110dc42b8 R15: 0000000000000200 [ 78.670814][ T363] FS: 00007f66a8f09c40(0000) GS:ffff888163428000(0000) knlGS:0000000000000000 [ 78.671110][ T363] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 78.671324][ T363] CR2: 000055db4c6a30a8 CR3: 000000010da67000 CR4: 0000000000750ef0 [ 78.671585][ T363] PKRU: 55555554 [ 78.671713][ T363] Call Trace: [ 78.671843][ T363] <TASK> [ 78.671936][ T363] ? __pfx_qfq_dequeue+0x10/0x10 [sch_qfq] [ 78.672148][ T363] ? __pfx__printk+0x10/0x10 [ 78.672322][ T363] ? srso_alias_return_thunk+0x5/0xfbef5 [ 78.672496][ T363] ? lockdep_hardirqs_on_prepare+0xa8/0x1a0 [ 78.672706][ T363] ? srso_alias_return_thunk+0x5/0xfbef5 [ 78.672875][ T363] ? trace_hardirqs_on+0x19/0x1a0 [ 78.673047][ T363] red_dequeue+0x65/0x270 [sch_red] [ 78.673217][ T363] ? srso_alias_return_thunk+0x5/0xfbef5 [ 78.673385][ T363] tbf_dequeue.cold+0xb0/0x70c [sch_tbf] [ 78.673566][ T363] __qdisc_run+0x169/0x1900 The right thing to do in #1b is to grab the skb off gso_skb queue. This patchset fixes that issue by changing #1b to use qdisc_dequeue_peeked() method instead. Fixes: `77be155cba` ("pkt_sched: Add peek emulation for non-work-conserving qdiscs.") Reported-by: Manas <ghandatmanas@gmail.com> Reported-by: Rakshit Awasthi <rakshitawasthi17@gmail.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260430152957.194015-2-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-02 10:20:55 -07:00
Sagarika Sharma	4bc852006b	ipv6: update route serial number on NETDEV_CHANGE When using IPv6 ECMP routes, if a netdev listed as a nexthop experiences a carrier change event (e.g., a bond device generating a NETDEV_CHANGE event after its slaves go linkdown), established connections utilizing that nexthop fail to fail over to other available nexthops. Instead, these connections stall or drop. This happens because the IPv6 FIB code does not invalidate the socket's cached destination when a NETDEV_CHANGE event occurs. While fib6_ifdown() correctly marks the nexthop with RTNH_F_LINKDOWN, it leaves the route's serial number unchanged. As a result, sockets with a previously cached dst do not realize the route is no longer viable and continue to try using the non-functional nexthop. This behavior contrasts with IPv4, which actively flushes cached destinations on a NETDEV_CHANGE event (see fib_netdev_event() in net/ipv4/fib_frontend.c). Fix this by updating the route serial number in fib6_ifdown() when setting RTNH_F_LINKDOWN. This invalidates stale cached destinations, forcing sockets to perform a new route lookup and fail over to a functioning nexthop. Fixes: `51ebd31815` ("ipv6: add support of equal cost multipath (ECMP)") Signed-off-by: Sagarika Sharma <sharmasagarika@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260430200909.527827-2-sharmasagarika@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-01 17:58:44 -07:00
Eric Dumazet	6d4106e8df	net/sched: sch_pie: annotate more data-races in pie_dump_stats() My prior patch missed few READ_ONCE()/WRITE_ONCE() annotations. Fixes: `5154561d9b` ("net/sched: sch_pie: annotate data-races in pie_dump_stats()") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260430080056.35104-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-01 17:54:57 -07:00
Eric Dumazet	c6bebaa744	ipv4: igmp: annotate data-races in igmp_heard_query() Multiple cpus can run igmp_heard_query() concurrently. Add missing READ_ONCE()/WRITE_ONCE() over following in_dev fields. - mr_qrv - mr_qi - mr_qri - mr_v1_seen - mr_v2_seen Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Reported-by: syzbot+ae9a171f239b14485310@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/69f38675.050a0220.3cbe47.0002.GAE@google.com Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260430164836.872079-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-01 17:11:42 -07:00
Kai Zen	4b9e327991	net: rtnetlink: zero ifla_vf_broadcast to avoid stack infoleak in rtnl_fill_vfinfo rtnl_fill_vfinfo() declares struct ifla_vf_broadcast on the stack without initialisation: struct ifla_vf_broadcast vf_broadcast; The struct contains a single fixed 32-byte field: /* include/uapi/linux/if_link.h / struct ifla_vf_broadcast { __u8 broadcast[32]; }; The function then copies dev->broadcast into it using dev->addr_len as the length: memcpy(vf_broadcast.broadcast, dev->broadcast, dev->addr_len); On Ethernet devices (the overwhelming majority of SR-IOV NICs) dev->addr_len is 6, so only the first 6 bytes of broadcast[] are written. The remaining 26 bytes retain whatever was previously on the kernel stack. The full struct is then handed to userspace via: nla_put(skb, IFLA_VF_BROADCAST, sizeof(vf_broadcast), &vf_broadcast) leaking up to 26 bytes of uninitialised kernel stack per VF per RTM_GETLINK request, repeatable. The other vf_ structs in the same function are explicitly zeroed for exactly this reason - see the memset() calls for ivi, vf_vlan_info, node_guid and port_guid a few lines above. vf_broadcast was simply missed when it was added. Reachability: any unprivileged local process can open AF_NETLINK / NETLINK_ROUTE without capabilities and send RTM_GETLINK with an IFLA_EXT_MASK attribute carrying RTEXT_FILTER_VF. The kernel walks each VF and emits IFLA_VF_BROADCAST, leaking 26 bytes of stack per VF per request. Stack residue at this call site can include return addresses and transient sensitive data; KASAN with stack instrumentation, or KMSAN, will flag the nla_put() when reproduced. Zero the on-stack struct before the partial memcpy, matching the existing pattern used for the other vf_* structs in the same function. Fixes: `75345f888f` ("ipoib: show VF broadcast address") Cc: stable@vger.kernel.org Signed-off-by: Kai Zen <kai.aizen.dev@gmail.com> Link: https://patch.msgid.link/3c506e8f936e52b57620269b55c348af05d413a2.1777557228.git.kai.aizen.dev@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-01 17:04:28 -07:00
Eric Dumazet	4f34002e2e	ipmr: prevent info-leak in pmr_cache_report() Yiming Qian reported: <quote> ipmr_cache_report()` allocates a report skb with `alloc_skb(128, GFP_ATOMIC)` and appends a `struct igmphdr` using `skb_put()`. In the non-`IGMPMSG_WHOLEPKT` path it initializes only: - `igmp->type` - `igmp->code` but does not initialize: - `igmp->csum` - `igmp->group` Later, `igmpmsg_netlink_event()` copies the bytes after `sizeof(struct igmpmsg)` into the `IPMRA_CREPORT_PKT` netlink attribute and emits `RTM_NEWCACHEREPORT` on `RTNLGRP_IPV4_MROUTE_R`. As a result, 6 bytes of stale heap data from the skb head are disclosed to userspace. </quote> Let's use skb_put_zero() instead of skb_put() to fix this bug. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Reported-by: Yiming Qian <yimingqian591@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260430070611.4004529-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-01 17:01:40 -07:00
Jakub Kicinski	2ab02ac411	netfilter pull request 26-05-01 -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEjF9xRqF1emXiQiqU1w0aZmrPKyEFAmn0hZ8ACgkQ1w0aZmrP KyFzNg//ZVbSZyMag+CJoIJv3sMFDJ7uLSEko9mR0nNvo6hPZDWAysCNychhPCDl w9yiar5wM9W1zcSWvtlBFozZUcS55mQbcqCHNEyJdSjQ1zTr7C9Dl9zDU3jDJEoK aplUk5VvFYFqEp4Bqy7EA1VGY5uc2WzmbsCAf9Z2pjprTQKD/E5tzyx0RFEPksKU 0pSvsC8VfOES6mJs3KIng6TfvnaC/TWilOtjXC/1y1jl+WftXgwb0gwIVnWKjZnc yEJ6h4VOiW2NjwcW+gcaaqvt0c1T4EO/bDvuVnCJzwxDZKI2W9KOs8yQytO2hNTo jrAyjTB0F3yDxcnDP1AO8ipkJzu42wOfZblrZKvSmC4Kwwqq8QlsXqD1HMh3oMqv JGNJSB8rNbIqt9RTMB+A5wiAZvZbSGZc3qH+y7Z5z/2Zl7u0+Zwl20YZ1r7RqM9Z Ay/+QzZIyRAyKmQDr8nSoqmBy2i0wfw79NovvhgPDl9qak8Cfc8Df8wkd59t3z33 0VzPO9kieTWW6aqW19l88C7dtspsd93IsMZz3He3Lvy5e4dpPG+2OdLKpPkTYHBg 17KY4Qs7gYM0m5baHlcmana4bZHWcBz146dmIMUuhoj3gPyjgV+s/Hum3YxD/P43 PNA6X8pI38R8O97VkPXYg1aoQIRLt9YsGwVTYxPXv2gZgLD0Acw= =ASC0 -----END PGP SIGNATURE----- Merge tag 'nf-26-05-01' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following batch contains Netfilter fixes for net: 1) Replace skb_try_make_writable() by skb_ensure_writable() in nft_fwd_netdev and the flowtable to deal with uncloned packets having their network header in paged fragments. 2) Drop packet if output device does not exist and ensure sufficient headroom in nft_fwd_netdev before transmitting the skb. 3) Use the existing dup recursion counter in nft_fwd_netdev for the neigh_xmit variant, from Weiming Shi. 4) Add .check_hooks interface to x_tables to detach the control plane hook check based on the match/target configuration. Then, update nft_compat to use .check_hooks from .validate path, this fixes a lack of hook validation for several match/targets. 5) Fix incorrect .usersize in xt_CT, from Florian Westphal. 6) Fix a memleak with netdev tables in dormant state, from Florian Westphal. 7) Several patches to check if the packet is a fragment, then skip layer 4 inspection, for x_tables and nf_tables; as well as common nf_socket infrastructure. The xt_hashlimit match drops fragments to stay consistent with the existing approach when failing to parse the layer 4 protocol header. 8) Ensure sufficient headroom in the flowtable before transmitting the skb. 9) Fix the flowtable inline vlan approach for double-tagged vlan: Reverse the iteration over .encap[] since it represents the encapsulation as seen from the ingress path. Postpone pushing layer 2 header so output device is available to calculate needed headroom. Finally, add and use nf_flow_vlan_push() to fix it. 10) Fix flowtable inline pppoe with GSO packets. Moreover, use FLOW_OFFLOAD_XMIT_DIRECT to fill up destination hardware address since neighbour cache does not exist in pppoe. 11) Use skb_pull_rcsum() to decapsulate vlan and pppoe headers, for double-tagged vlan in particular this should provide some benefits in certain scenarios. More notes regarding 9-11): - sashiko is also signalling to use it for IPIP headers, but that needs more adjustments such setting skb->protocol after removing the IPIP header, will follow up in a separated patch. - I plan to submit selftests to cover double-tagged-vlan. As for pppoe, it should be possible but that would mandate a few userspace dependencies. This has been semi-automatically tested by me and reporters describing broken double-vlan-tagged and pppoe currently in the flowtable. * tag 'nf-26-05-01' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: flowtable: use skb_pull_rcsum() to pop vlan/pppoe header netfilter: flowtable: fix inline pppoe encapsulation in xmit path netfilter: flowtable: fix inline vlan encapsulation in xmit path netfilter: flowtable: ensure sufficient headroom in xmit path netfilter: xtables: fix L4 header parsing for non-first fragments netfilter: nf_tables: skip L4 header parsing for non-first fragments netfilter: nf_socket: skip socket lookup for non-first fragments netfilter: nf_tables: fix netdev hook allocation memleak with dormant tables netfilter: xt_CT: fix usersize for v1 and v2 revision netfilter: nft_compat: run xt_check_hooks_{match,target}() from .validate netfilter: x_tables: add .check_hooks to matches and targets netfilter: nft_fwd_netdev: use recursion counter in neigh egress path netfilter: nft_fwd_netdev: add device and headroom validate with neigh forwarding netfilter: replace skb_try_make_writable() by skb_ensure_writable() ==================== Link: https://patch.msgid.link/20260501122237.296262-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-01 16:45:42 -07:00
Pablo Neira Ayuso	baa3c65435	netfilter: flowtable: use skb_pull_rcsum() to pop vlan/pppoe header This adjusts the checksum, if required, after pulling the layer 2 header, either the pppoe header or the inner vlan header in the double-tagged vlan packets. Fixes: `4cd91f7c29` ("netfilter: flowtable: add vlan support") Fixes: `72efd585f7` ("netfilter: flowtable: add pppoe support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-01 12:39:23 +02:00
Daniel Borkmann	3744b0964d	ipv6: Implement limits on extension header parsing ipv6_{skip_exthdr,find_hdr}() and ip6_{tnl_parse_tlv_enc_lim, protocol_deliver_rcu}() iterate over IPv6 extension headers until they find a non-extension-header protocol or run out of packet data. The loops have no iteration counter, relying solely on the packet length to bound them. For a crafted packet with 8-byte extension headers filling a 64KB jumbogram, this means a worst case of up to ~8k iterations with a skb_header_pointer call each. ipv6_skip_exthdr(), for example, is used where it parses the inner quoted packet inside an incoming ICMPv6 error: - icmpv6_rcv - checksum validation - case ICMPV6_DEST_UNREACH - icmpv6_notify - pskb_may_pull() <- pull inner IPv6 header - ipv6_skip_exthdr() <- iterates here - pskb_may_pull() - ipprot->err_handler() <- sk lookup The per-iteration cost of ipv6_skip_exthdr itself is generally light, but skb_header_pointer becomes more costly on reassembled packets: the first ~1232 bytes of the inner packet are in the skb's linear area, but the remaining ~63KB are in the frag_list where skb_copy_bits is needed to read data. Initially, the idea was to add a configurable limit via a new sysctl knob with default 8, in line with knobs from commit `47d3d7ac65` ("ipv6: Implement limits on Hop-by-Hop and Destination options"), but two reasons eventually argued against it: - It adds to UAPI that needs to be maintained forever, and upcoming work is restricting extension header ordering anyway, leaving little reason for another sysctl knob - exthdrs_core.c is always built-in even when CONFIG_IPV6=n, where struct net has no .ipv6 member, so the read site would need an ifdef'd fallback to a constant anyway Therefore, just use a constant (IP6_MAX_EXT_HDRS_CNT). All four extension header walking functions are now bound by this limit. Note that the check in ip6_protocol_deliver_rcu() happens right before the goto resubmit, such that we don't have to have a test for ipv6_ext_hdr() in the fast-path. There's an ongoing IETF draft-iurman-6man-eh-occurrences to enforce IPv6 extension headers ordering and occurrence. The latter also discusses security implications. As per RFC8200 section 4.1, the occurrence rules for extension headers provide a practical upper bound which is 8. In order to be conservative, let's define IP6_MAX_EXT_HDRS_CNT as 12 to leave enough room for quirky setups. In the unlikely event that this is still not enough, then we might need to reconsider a sysctl. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Justin Iurman <justin.iurman@gmail.com> Link: https://patch.msgid.link/20260429154648.809751-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-30 17:21:45 -07:00
Pablo Neira Ayuso	69c54f80f4	netfilter: flowtable: fix inline pppoe encapsulation in xmit path Address two issues in the inline pppoe encapsulation: - Add needs_gso_segment flag to segment PPPoE packets in software given that there is no GSO support for this. - Use FLOW_OFFLOAD_XMIT_DIRECT since neighbour cache is not available in point-to-point device, use the hardware address that is obtained via flowtable path discovery (ie. fill_forward_path). Fixes: `18d27bed08` ("netfilter: flowtable: inline pppoe encapsulation in xmit path") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-01 01:24:01 +02:00
Pablo Neira Ayuso	a177ae30f7	netfilter: flowtable: fix inline vlan encapsulation in xmit path Several issues in the inline vlan support: - The layer 2 encapsulation representation in the tuple takes encap[0] as the outer header and encap[1] as the inner header as seen from the ingress path. Reverse the encap loop to push first the inner then the outer vlan header. - Postpone pushing the layer 2 header once destination device is known. This allows to calculate the needed hearoom via LL_RESERVED_SPACE to accommodate the layer 2 headers. - Add and use nf_flow_vlan_push() as suggested by Eric Woudstra, this is a simplified version of skb_vlan_push() for egress path only. Fixes: `c653d5a78f` ("netfilter: flowtable: inline vlan encapsulation in xmit path") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-05-01 01:23:47 +02:00
Jeremy Kerr	7687297106	net: mctp: test: Use dev_direct_xmit for TX to our test device In our test cases, we typically feed a packet sequence into the routing code, then inspect the device's TXed skbs to assert specific behaviours. Using dev_queue_xmit() for our TX path introduces a fair bit of complexity between the test packet sequence and the test device's ndo_start_xmit callback; which may mean that the skbs have not hit the device at the point we're inspecting the TXed skb list. Use dev_direct_xmit instead, as we want a direct a path as possible here, and the test dev does not need any queueing, scheduling or flow control. Fixes: `6ab578739a` ("net: mctp: test: move TX packetqueue from dst to dev") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202604281320.525eee17-lkp@intel.com Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260429-dev-mctp-test-fixes-v1-2-1127b7425809@codeconstruct.com.au Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-30 13:36:47 -07:00
Jeremy Kerr	18ed60e33e	net: mctp: test: use a zeroed struct sockaddr_mctp Invalid sockaddr padding will cause bind() to fail; ensure we have a zeroed address in the testcase. Fixes: `0d8647bc74` ("net: mctp: don't require a route for null-EID ingress") Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260429-dev-mctp-test-fixes-v1-1-1127b7425809@codeconstruct.com.au Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-30 13:36:47 -07:00
Pablo Neira Ayuso	ef4f741e86	netfilter: flowtable: ensure sufficient headroom in xmit path Check for headroom and call skb_expand_head() like in the IP output path to ensure there is sufficient headroom for the mac header when forwarding this packet as suggested by sashiko. Fixes: `b5964aac51` ("netfilter: flowtable: consolidate xmit path") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-04-30 17:59:01 +02:00
Fernando Fernandez Mancera	952e121c96	netfilter: xtables: fix L4 header parsing for non-first fragments Multiple targets and matches relies on L4 header to operate. For fragmented packets, every fragment carries the transport protocol identifier, but only the first fragment contains the L4 header. As the 'raw' table can be configured to run at priority -450 (before defragmentation at -400), the target/match can be reached before reassembly. In this case, non-first fragments have their payload incorrectly parsed as a TCP/UDP header. This would be of course a misconfiguration scenario. In most of the cases this just lead to a unreliable behavior for fragmented traffic. Add a fragment check to ensure target/match only evaluates unfragmented packets or the first fragment in the stream. Fixes: `902d6a4c2a` ("netfilter: nf_defrag: Skip defrag if NOTRACK is set") Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-04-30 17:59:01 +02:00
Fernando Fernandez Mancera	009d203e56	netfilter: nf_tables: skip L4 header parsing for non-first fragments The tproxy, osf and exthdr (SCTP) expressions rely on the presence of transport layer headers to perform socket lookups, fingerprint matching, or chunk extraction. For fragmented packets, while the IP protocol remains constant across all fragments, only the first fragment contains the actual L4 header. The expressions could be attached to a chain with a priority lower than -400, bypassing defragmentation. Or could be used in stateless environments where defragmentation is not happening at all. This could result in garbage data being used for the matching. Add a check for pkt->fragoff so only unfragmented packets or the first fragment is processed. Fixes: `133dc203d7` ("netfilter: nft_exthdr: Support SCTP chunks") Fixes: `4ed8eb6570` ("netfilter: nf_tables: Add native tproxy support") Fixes: `b96af92d6e` ("netfilter: nf_tables: implement Passive OS fingerprint module in nft_osf") Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-04-30 17:59:01 +02:00
Fernando Fernandez Mancera	0bf00859d7	netfilter: nf_socket: skip socket lookup for non-first fragments Both nft_socket and xt_socket relies on L4 headers to perform socket lookup in the slow path. For fragmented packets, while the IP protocol remains constant across all fragments, only the first fragment contains the actual L4 header. As the expression/match could be attached to a chain with a priority lower than -400, it could bypass defragmentation. Add a check for fragmentation in the lookup functions directly so the problem is handled for both nft_socket and xt_socket at the same time. In addition, future users of the functions would not need to care about this. Fixes: `902d6a4c2a` ("netfilter: nf_defrag: Skip defrag if NOTRACK is set") Fixes: `554ced0a6e` ("netfilter: nf_tables: add support for native socket matching") Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-04-30 17:59:01 +02:00
Paolo Abeni	1e01abec85	net/sched: cls_flower: revert unintended changes While applying the blamed commit `4ca07b9239` ("net: mctp i2c: check length before marking flow active"), I unintentionally included unrelated and unacceptable changes. Revert them. Fixes: `4ca07b9239` ("net: mctp i2c: check length before marking flow active") Reported-by: Jeremy Kerr <jk@codeconstruct.com.au> Closes: https://lore.kernel.org/netdev/bd8704fe0bd53e278add5cde4873256656623e2e.camel@codeconstruct.com.au/ Signed-off-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/043026a53ff84da88b17648c4b0d17f0331749cb.1777447863.git.pabeni@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-30 13:47:01 +02:00
Jakub Kicinski	58689498ca	net: tls: fix strparser anchor skb leak on offload RX setup failure When tls_set_device_offload_rx() fails at tls_dev_add(), the error path calls tls_sw_free_resources_rx() to clean up the SW context that was initialized by tls_set_sw_offload(). This function calls tls_sw_release_resources_rx() (which stops the strparser via tls_strp_stop()) and tls_sw_free_ctx_rx() (which kfrees the context), but never frees the anchor skb that was allocated by alloc_skb(0) in tls_strp_init(). Note that tls_sw_free_resources_rx() is exclusively used for this "failed to start offload" code path, there's no other caller. The leak did not exist before commit `84c61fe1a7` ("tls: rx: do not use the standard strparser"), because the standard strparser doesn't try to pre-allocate an skb. The normal close path in tls_sk_proto_close() handles cleanup by calling tls_sw_strparser_done() (which calls tls_strp_done()) after dropping the socket lock, because tls_strp_done() does cancel_work_sync() and the strparser work handler takes the socket lock. Fixes: `84c61fe1a7` ("tls: rx: do not use the standard strparser") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://patch.msgid.link/20260428231559.1358502-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-30 13:38:29 +02:00
Florian Westphal	63bac02786	netfilter: nf_tables: fix netdev hook allocation memleak with dormant tables sashiko says: could the related code in __nf_tables_abort() leak the struct nft_hook objects when the table is dormant? In __nf_tables_abort(), when rolling back a NEWCHAIN transaction that updates hooks, the code conditionally unregisters and frees the hooks only if the table is not dormant [..] if (!(table->flags & NFT_TABLE_F_DORMANT)) { nft_netdev_unregister_hooks(net, &nft_trans_chain_hooks(trans), true); } ... nft_trans_destroy(trans); Unfortunately netdev family mixes hook registration and allocation. Push table struct down and only check for the flag to unregister. Fixes: `216e7bf740` ("netfilter: nf_tables: skip netdev hook unregistration if table is dormant") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-04-30 08:03:22 +02:00
Florian Westphal	8bedb6c469	netfilter: xt_CT: fix usersize for v1 and v2 revision While resurrecting the conntrack-tool test cases I found following bug: In: iptables -I OUTPUT -t raw -p 13 -j CT --timeout test-generic Out: [0:0] -A OUTPUT -p 13 -j CT --timeout test Data after first four bytes of the timeout policy name is never copied to userspace because its treated as kernel-only. Fixes: `ec23189049` ("xtables: extend matches and targets with .usersize") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-04-30 08:03:22 +02:00
Pablo Neira Ayuso	2f768d638d	netfilter: nft_compat: run xt_check_hooks_{match,target}() from .validate Several matches and one target check that the hook is correct from checkentry(), however, the basechain is only available from nft_table_validate(). This patch uses xt_check_hooks_{match,target}() from the nft_compat expression .validate path. This patch sets the table in the nft_ctx struct in nft_table_validate() which is required by this patch. Based on patch from Florian Westphal. Fixes: `0ca743a559` ("netfilter: nf_tables: add compatibility layer for x_tables") Reported-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-04-30 08:03:22 +02:00
Pablo Neira Ayuso	6813985ca4	netfilter: x_tables: add .check_hooks to matches and targets Add a new .check_hooks interface for checking if the match/target is used from the validate hook according to its configuration. Move existing conditional hook check based on the match/target configuration from .checkentry to .check_hooks for the following matches/targets: - addrtype - devgroup - physdev - policy - set - TCPMSS - SET This is a preparation patch to fix nft_compat, not functional changes are intended. Based on patch from Florian Westphal. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-04-30 08:03:22 +02:00
Hasan Basbunar	5ef343614d	page_pool: fix memory-provider leak in page_pool_create_percpu() error path When page_pool_create_percpu() fails on page_pool_list(), it falls through to its err_uninit: label, which calls page_pool_uninit(). At that point page_pool_init() has already taken two references when the user requested PP_FLAG_ALLOW_UNREADABLE_NETMEM: pool->mp_ops->init(pool) static_branch_inc(&page_pool_mem_providers); Neither is undone by page_pool_uninit(); both are only undone by __page_pool_destroy() (success-side teardown). The error path therefore leaks the per-provider reference taken by mp_ops->init (io_zcrx_ifq->refs in the io_uring zcrx provider, the dmabuf binding refcount in the devmem provider) plus one increment of the page_pool_mem_providers static branch on every failure of xa_alloc_cyclic() inside page_pool_list(). The leaked io_zcrx_ifq->refs in turn pins everything io_zcrx_ifq_free() would release on cleanup: ifq->user (uid), ifq->mm_account (mmdrop), ifq->dev (device refcount), ifq->netdev_tracker (netdev refcount), and the rbuf region. The leaked static branch increment forces all subsequent page_pool_alloc_netmems() and page_pool_return_page() callers to take the slow mp_ops branch for the lifetime of the kernel. Reachable via the io_uring zcrx path: io_uring_register(IORING_REGISTER_ZCRX_IFQ) /* CAP_NET_ADMIN */ -> __io_uring_register -> io_register_zcrx -> zcrx_register_netdev -> netif_mp_open_rxq -> driver ndo_queue_mem_alloc -> page_pool_create_percpu -> page_pool_init succeeds (mp_ops->init runs, branch++) -> page_pool_list fails (xa_alloc_cyclic -ENOMEM) -> goto err_uninit <-- leak The same shape applies to the devmem dmabuf provider via mp_dmabuf_devmem_init()/mp_dmabuf_devmem_destroy(). Restore the cleanup symmetry by moving the mp_ops->destroy() and static_branch_dec() calls out of __page_pool_destroy() and into page_pool_uninit(), so page_pool_uninit() is again the strict inverse of page_pool_init(). page_pool_uninit() has only two callers (the err_uninit: path and __page_pool_destroy()), so this preserves the single-call invariant on the success path while fixing the err path. The error path of page_pool_init() itself still skips the mp_ops cleanup correctly: mp_ops->init is the last action that takes a reference before page_pool_init() returns 0, so when it returns an error neither the refcount nor the static branch has been touched. Triggering the bug requires xa_alloc_cyclic() to fail with -ENOMEM, which under normal GFP_KERNEL retry behaviour is rare. It is deterministic under CONFIG_FAULT_INJECTION with fail_page_alloc / xa fault injection, or under sustained memory pressure. The leak is silent: there is no warning, and the released kernel build continues running with a permanently-incremented static branch. Fixes: `0f92140468` ("memory-provider: dmabuf devmem memory provider") Signed-off-by: Hasan Basbunar <basbunarhasan@gmail.com> Link: https://patch.msgid.link/20260428170739.34881-1-basbunarhasan@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-29 19:11:49 -07:00
Hamza Mahfooz	b31681206e	hv_sock: fix ARM64 support VMBUS ring buffers must be page aligned. Therefore, the current value of 24K presents a challenge on ARM64 kernels (with 64K pages). So, use VMBUS_RING_SIZE() to ensure they are always aligned and large enough to hold all of the relevant data. Cc: stable@vger.kernel.org Fixes: `77ffe33363` ("hv_sock: use HV_HYP_PAGE_SIZE for Hyper-V communication") Tested-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20260428125339.13963-1-hamzamahfooz@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-29 17:30:45 -07:00
Jakub Kicinski	735a309b4b	net: add net_iov_init() and use it to initialize ->page_type Commit `db359fccf2` ("mm: introduce a new page type for page pool in page type") added a page_type field to struct net_iov at the same offset as struct page::page_type, so that page_pool_set_pp_info() can call __SetPageNetpp() uniformly on both pages and net_iovs. The page-type API requires the field to hold the UINT_MAX "no type" sentinel before a type can be set; for real struct page that invariant is established by the page allocator on free. struct net_iov is not allocated through the page allocator, so the field is left as zero (io_uring zcrx, which uses __GFP_ZERO) or as slab garbage (devmem, which uses kvmalloc_objs() without zeroing). When the page pool then calls page_pool_set_pp_info() on a freshly-bound niov, __SetPageNetpp()'s VM_BUG_ON_PAGE(page->page_type != UINT_MAX) fires and the kernel BUGs. Triggered in selftests by io_uring zcrx setup through the fbnic queue restart path: kernel BUG at ./include/linux/page-flags.h:1062! RIP: 0010:page_pool_set_pp_info (./include/linux/page-flags.h:1062 net/core/page_pool.c:716) Call Trace: <TASK> net_mp_niov_set_page_pool (net/core/page_pool.c:1360) io_pp_zc_alloc_netmems (io_uring/zcrx.c:1089 io_uring/zcrx.c:1110) fbnic_fill_bdq (./include/net/page_pool/helpers.h:160 drivers/net/ethernet/meta/fbnic/fbnic_txrx.c:906) __fbnic_nv_restart (drivers/net/ethernet/meta/fbnic/fbnic_txrx.c:2470 drivers/net/ethernet/meta/fbnic/fbnic_txrx.c:2874) fbnic_queue_start (drivers/net/ethernet/meta/fbnic/fbnic_txrx.c:2903) netdev_rx_queue_reconfig (net/core/netdev_rx_queue.c:137) __netif_mp_open_rxq (net/core/netdev_rx_queue.c:234) io_register_zcrx (io_uring/zcrx.c:818 io_uring/zcrx.c:903) __io_uring_register (io_uring/register.c:931) __do_sys_io_uring_register (io_uring/register.c:1029) do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94) </TASK> The same path is reachable through devmem dmabuf binding via netdev_nl_bind_rx_doit() -> net_devmem_bind_dmabuf_to_queue(). Add a net_iov_init() helper that stamps ->owner, ->type and the ->page_type sentinel, and use it from both the devmem and io_uring zcrx niov init loops. Fixes: `db359fccf2` ("mm: introduce a new page type for page pool in page type") Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Acked-by: Byungchul Park <byungchul@sk.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Acked-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/20260428025320.853452-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-29 16:40:08 -07:00
Weiming Shi	1d47b55b36	netfilter: nft_fwd_netdev: use recursion counter in neigh egress path nft_fwd_neigh can be used in egress chains (NF_NETDEV_EGRESS). When the forwarding rule targets the same device or two devices forward to each other, neigh_xmit() triggers dev_queue_xmit() which re-enters nf_hook_egress(), causing infinite recursion and stack overflow. Move the nf_get_nf_dup_skb_recursion() accessor and NF_RECURSION_LIMIT to the shared header nf_dup_netdev.h as a static inline, so that nft_fwd_netdev can use the recursion counter directly without exported function call overhead. Guard neigh_xmit() with the same recursion limit already used in nf_do_netdev_egress(). [ Updated to cache the nf_get_nf_dup_skb_recursion pointer. --pablo ] Fixes: `f87b9464d1` ("netfilter: nft_fwd_netdev: Support egress hook") Reported-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Weiming Shi <bestswngs@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-04-30 00:57:42 +02:00
Pablo Neira Ayuso	0a0b35f0bf	netfilter: nft_fwd_netdev: add device and headroom validate with neigh forwarding The ttl field has been decremented already and evaluation of this rule would proceed, just drop this packet instead if there is no destination device to forwards this packet. This is exactly what nf_dup already does in this case. Moreover, check for headroom and call skb_expand_head() like in the IP output path to ensure there is sufficient headroom when forwarding this via neigh_xmit(). Fixes: `d32de98ea7` ("netfilter: nft_fwd_netdev: allow to forward packets via neighbour layer") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-04-30 00:57:42 +02:00
Pablo Neira Ayuso	1049970d75	netfilter: replace skb_try_make_writable() by skb_ensure_writable() skb_try_make_writable() only works on clones and uncloned packets might have their network header in paged fragments. nft_fwd needs to work for the ingress and egress hooks, but the egress hook where skb->data points to the mac header, use skb_network_offset() to include the mac header. The flowtable is fine since it already uses the transport offset. Fixes: `d32de98ea7` ("netfilter: nft_fwd_netdev: allow to forward packets via neighbour layer") Fixes: `7d20868717` ("netfilter: nf_flow_table: move ipv4 offload hook code to nf_flow_table") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-04-30 00:57:42 +02:00
Michal Kosiorek	14acf9652e	xfrm: defensively unhash xfrm_state lists in __xfrm_state_delete KASAN reproduces a slab-use-after-free in __xfrm_state_delete()'s hlist_del_rcu calls under syzkaller load on linux-6.12.y stable (reproduced on 6.12.47, also reachable via the same code path on torvalds/master and on the ipsec tree). Nine unique signatures cluster in the xfrm_state lifecycle, the load-bearing one being: BUG: KASAN: slab-use-after-free in __hlist_del include/linux/list.h:990 [inline] BUG: KASAN: slab-use-after-free in hlist_del_rcu include/linux/rculist.h:516 [inline] BUG: KASAN: slab-use-after-free in __xfrm_state_delete net/xfrm/xfrm_state.c Write of size 8 at addr ffff8881198bcb70 by task kworker/u8:9/435 Workqueue: netns cleanup_net Call Trace: __hlist_del / hlist_del_rcu __xfrm_state_delete xfrm_state_delete xfrm_state_flush xfrm_state_fini ops_exit_list cleanup_net The other observed signatures hit the same slab object from __xfrm_state_lookup, xfrm_alloc_spi, __xfrm_state_insert and an OOB write variant of __xfrm_state_delete, all on the byseq/byspi hash chains. __xfrm_state_delete() guards its byseq and byspi unhashes with value-based predicates: if (x->km.seq) hlist_del_rcu(&x->byseq); if (x->id.spi) hlist_del_rcu(&x->byspi); while everywhere else in the file (e.g. state_cache, state_cache_input) the safer hlist_unhashed() check is used. xfrm_alloc_spi() sets x->id.spi = newspi inside xfrm_state_lock and then immediately inserts into byspi, but a path that observes x->id.spi != 0 outside of xfrm_state_lock can still skip-or-hit the byspi unhash inconsistently with whether x is actually on the list. The same holds for x->km.seq versus byseq, and the bydst/bysrc unhashes have no predicate at all, so a second __xfrm_state_delete() on the same object writes through LIST_POISON pprev. The defensive change here: - Use hlist_del_init_rcu() instead of hlist_del_rcu() on bydst, bysrc, byseq and byspi so a second deletion is a no-op rather than a write through LIST_POISON pprev. The byseq/byspi nodes are already initialised in xfrm_state_alloc(). - Test hlist_unhashed() rather than the value predicate for byseq/byspi, so the unhash decision tracks list state rather than mutable scalar fields. Empirical verification: applied this patch on top of v6.12.47, rebuilt, and re-ran the same syzkaller harness for 1h16m on a previously-crashy configuration that produced ~100 hits each of slab-use-after-free Read in xfrm_alloc_spi / Read in __xfrm_state_lookup / Write in __xfrm_state_delete. After the patch, 7.1M execs across 32 VMs at ~1550 exec/sec produced zero xfrm_state UAF/OOB hits. /proc/slabinfo confirms the xfrm_state slab is actively allocated and freed during the run (~143 KiB resident), so the fuzzer is still exercising those code paths -- they just no longer crash. Reproduction: - Linux 6.12.47 x86_64 + KASAN_GENERIC + KASAN_INLINE + KCOV - syzkaller @ 746545b8b1e4c3a128db8652b340d3df90ce61db - 32 QEMU/KVM VMs x 2 vCPU on AWS c5.metal bare metal - 9 unique signatures collected in ~9h, all within xfrm_state lifecycle Fixes: `fe9f1d8779` ("xfrm: add state hashtable keyed by seq") Fixes: `7b4dc3600e` ("[XFRM]: Do not add a state whose SPI is zero to the SPI hash.") Reported-by: Michal Kosiorek <mkosiorek121@gmail.com> Tested-by: Michal Kosiorek <mkosiorek121@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Michal Kosiorek <mkosiorek121@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2026-04-29 11:27:34 +02:00
Ruijie Li	28465227c8	xfrm: provide message size for XFRM_MSG_MAPPING The compat 64=>32 translation path handles XFRM_MSG_MAPPING, but xfrm_msg_min[] does not provide the native payload size for this message type. Add the missing XFRM_MSG_MAPPING entry so compat translation can size and translate mapping notifications correctly. Fixes: `5461fc0c8d` ("xfrm/compat: Add 64=>32-bit messages translator") Cc: stable@kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Signed-off-by: Ruijie Li <ruijieli51@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2026-04-29 09:10:25 +02:00
Matthieu Baerts (NGI0)	1774d3cf3c	mptcp: pm: kernel: reset fullmesh counter after flush This variable counts how many MPTCP endpoints have a 'fullmesh' flag set. After having flushed all MPTCP endpoints, it is then needed to reset this counter. Without this reset, this counter exposed to the userspace is wrong, but also non-fullmesh endpoints added after the flush will not be taken into account to create subflows in reaction to ADD_ADDRs. Fixes: `f88191c7f3` ("mptcp: pm: in-kernel: record fullmesh endp nb") Cc: stable@vger.kernel.org Reported-by: Sashiko <sashiko-bot@kernel.org> Closes: https://sashiko.dev/#/patchset/20260422-mptcp-inc-limits-v6-0-903181771530%40kernel.org?part=15 Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260427-net-mptcp-misc-fixes-7-1-rc2-v1-4-7432b7f279fa@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-28 18:36:27 -07:00
Matthieu Baerts (NGI0)	f14d6e9c36	mptcp: fastclose msk when linger time is 0 The SO_LINGER socket option has been supported for a while with MPTCP sockets [1], but it didn't cause the equivalent of a TCP reset as expected when enabled and its time was set to 0. This was causing some behavioural differences with TCP where some connections were not promptly stopped as expected. To fix that, an extra condition is checked at close() time before sending an MP_FASTCLOSE, the MPTCP equivalent of a TCP reset. Note that backporting up to [1] will be difficult as more changes are needed to be able to send MP_FASTCLOSE. It seems better to stop at [2], which was supposed to already imitate TCP. Validated with MPTCP packetdrill tests [3]. Fixes: `268b123874` ("mptcp: setsockopt: support SO_LINGER") [1] Fixes: `d21f834855` ("mptcp: use fastclose on more edge scenarios") [2] Cc: stable@vger.kernel.org Reported-by: Lance Tuller <lance@lance0.com> Closes: https://github.com/lance0/xfr/pull/67 Link: https://github.com/multipath-tcp/packetdrill/pull/196 [3] Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260427-net-mptcp-misc-fixes-7-1-rc2-v1-3-7432b7f279fa@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-28 18:36:27 -07:00
Gang Yan	b5c52908d5	mptcp: fix scheduling with atomic in timestamp sockopt Using lock_sock_fast() (atomic context) around sock_set_timestamp() and sock_set_timestamping() is unsafe, as both helpers can sleep. Replace lock_sock_fast() with sleepable lock_sock()/release_sock() to avoid scheduling while atomic panic. Fixes: `9061f24bf8` ("mptcp: sockopt: propagate timestamp request to subflows") Cc: stable@vger.kernel.org Reported-by: Sashiko <sashiko-bot@kernel.org> Closes: https://sashiko.dev/#/patchset/20260420093343.16443-1-gang.yan@linux.dev Signed-off-by: Gang Yan <yangang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260427-net-mptcp-misc-fixes-7-1-rc2-v1-2-7432b7f279fa@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-28 18:36:27 -07:00
Gang Yan	5f95c21fc2	mptcp: sockopt: set timestamp flags on subflow socket, not msk Both mptcp_setsockopt_sol_socket_tstamp() and mptcp_setsockopt_sol_socket_timestamping() iterate over subflows, acquire the subflow socket lock, but then erroneously pass the MPTCP msk socket to sock_set_timestamp() / sock_set_timestamping() instead of the subflow ssk. As a result, the timestamp flags are set on the wrong socket and have no effect on the actual subflows. Pass ssk instead of sk to both helpers. Fixes: `9061f24bf8` ("mptcp: sockopt: propagate timestamp request to subflows") Cc: stable@vger.kernel.org Signed-off-by: Gang Yan <yangang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260427-net-mptcp-misc-fixes-7-1-rc2-v1-1-7432b7f279fa@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-28 18:36:27 -07:00
Eric Dumazet	a6c95b833d	net/sched: sch_cake: annotate data-races in cake_dump_stats() (V) cake_dump_stats() runs without qdisc spinlock being held. In this final patch, I add READ_ONCE()/WRITE_ONCE() annotations for cparams.target and cparams.interval. Fixes: `046f6fd5da` ("sched: Add Common Applications Kept Enhanced (cake) qdisc") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: "Toke Høiland-Jørgensen" <toke@toke.dk> Link: https://patch.msgid.link/20260427083606.459355-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-28 18:27:51 -07:00
Eric Dumazet	8fab48d877	net/sched: sch_cake: annotate data-races in cake_dump_stats() (IV) cake_dump_stats() runs without qdisc spinlock being held. In this fourth patch, I add READ_ONCE()/WRITE_ONCE() annotations for the following fields: - avg_peak_bandwidth - buffer_limit - buffer_max_used - avg_netoff - max_netlen - max_adjlen - min_netlen - min_adjlen - active_queues - tin_rate_bps - bytes - tin_backlog Other annotations are added in following patch, to ease code review. Fixes: `046f6fd5da` ("sched: Add Common Applications Kept Enhanced (cake) qdisc") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> Link: https://patch.msgid.link/20260427083606.459355-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-28 18:27:51 -07:00
Eric Dumazet	276a98a434	net/sched: sch_cake: annotate data-races in cake_dump_stats() (III) cake_dump_stats() runs without qdisc spinlock being held. In this third patch, I add READ_ONCE()/WRITE_ONCE() annotations for the following fields: - packets - tin_dropped - tin_ecn_mark - ack_drops - peak_delay - avge_delay - base_delay Other annotations are added in following patches, to ease code review. Fixes: `046f6fd5da` ("sched: Add Common Applications Kept Enhanced (cake) qdisc") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: "Toke Høiland-Jørgensen" <toke@toke.dk> Link: https://patch.msgid.link/20260427083606.459355-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-28 18:27:50 -07:00
Eric Dumazet	91a96427b9	net/sched: sch_cake: annotate data-races in cake_dump_stats() (II) cake_dump_stats() runs without qdisc spinlock being held. In this second patch, I add READ_ONCE()/WRITE_ONCE() annotations for the following fields: - bulk_flow_count - unresponsive_flow_count - max_skblen - flow_quantum Other annotations are added in following patches, to ease code review. Fixes: `046f6fd5da` ("sched: Add Common Applications Kept Enhanced (cake) qdisc") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: "Toke Høiland-Jørgensen" <toke@toke.dk> Link: https://patch.msgid.link/20260427083606.459355-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-28 18:27:50 -07:00
Eric Dumazet	44967ac378	net/sched: sch_cake: annotate data-races in cake_dump_stats() (I) cake_dump_stats() runs without qdisc spinlock being held. In this first patch, I add READ_ONCE()/WRITE_ONCE() annotations for the following fields: - way_hits - way_misses - way_collisions - sparse_flow_count - decaying_flow_count Other annotations are added in following patches, to ease code review. Fixes: `046f6fd5da` ("sched: Add Common Applications Kept Enhanced (cake) qdisc") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: "Toke Høiland-Jørgensen" <toke@toke.dk> Link: https://patch.msgid.link/20260427083606.459355-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-28 18:27:50 -07:00
Xin Long	8a92cb475c	sctp: discard stale INIT after handshake completion After an association reaches ESTABLISHED, the peer’s init_tag is already known from the handshake. Any subsequent INIT with the same init_tag is not a valid restart, but a delayed or duplicate INIT. Drop such INIT chunks in sctp_sf_do_unexpected_init() instead of processing them as new association attempts. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Link: https://patch.msgid.link/5788c76c1ee122a3ed00189e88dcf9df1fba226c.1777214801.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-28 17:52:19 -07:00
Xin Long	576a5d2bad	netfilter: skip recording stale or retransmitted INIT An INIT whose init_tag matches the peer's vtag does not provide new state information. It indicates either: - a stale INIT (after INIT-ACK has already been seen on the same side), or - a retransmitted INIT (after INIT has already been recorded on the same side). In both cases, the INIT must not update ct->proto.sctp.init[] state, since it does not advance the handshake tracking and may otherwise corrupt INIT/INIT-ACK validation logic. Allow INIT processing only when the conntrack entry is newly created (SCTP_CONNTRACK_NONE), or when the init_tag differs from the stored peer vtag. Note it skips the check for the ct with old_state SCTP_CONNTRACK_NONE in nf_conntrack_sctp_packet(), as it is just created in sctp_new() where it set ct->proto.sctp.vtag[IP_CT_DIR_REPLY] = ih->init_tag. Fixes: `9fb9cbb108` ("[NETFILTER]: Add nf_conntrack subsystem.") Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/ee56c3e416452b2a40589a2a85245ac2ad5e9f4b.1777214801.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-28 17:52:19 -07:00
Jakub Kicinski	b718342a7f	net: psp: require admin permission for dev-set and key-rotate The dev-set and key-rotate netlink operations modify shared device state (PSP version configuration and cryptographic key material, respectively) but do not require CAP_NET_ADMIN. The only access control is psp_dev_check_access() which merely verifies netns membership. Fixes: `00c94ca2b9` ("psp: base PSP device support") Reviewed-by: Daniel Zahka <daniel.zahka@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260427195856.401223-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-28 17:44:20 -07:00
Jakub Kicinski	b89769f936	net: psp: check for device unregister when creating assoc psp_assoc_device_get_locked() obtains a psp_dev reference via psp_dev_get_for_sock() (which uses psp_dev_tryget() under RCU); it then acquires psd->lock and drops the reference. Before the lock is taken, psp_dev_unregister() can run to completion: take psd->lock, clear out state, unlock, drop the registration reference. The expectation is that the lock prevents device unregistration, but much like with netdevs special care has to be taken when "upgrading" a reference to a locked device. Add the missing check if device is still alive. psp_dev_is_registered() exists already but had no callers, which makes me wonder if I either forgot to add this or lost the check during refactoring... Reported-by: Yiming Qian <yimingqian591@gmail.com> Fixes: `6b46ca260e` ("net: psp: add socket security association code") Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260427190606.366101-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-28 17:43:32 -07:00
Jakub Kicinski	67d7ae3340	netfilter pull request 26-04-28 -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEjF9xRqF1emXiQiqU1w0aZmrPKyEFAmnwhAAACgkQ1w0aZmrP KyGSsxAAh1gE5UmUum0Q9x0K0a3C+Vh07c2YRw4zuI6sy0xh0W0ZQongj5p5QQUA dL8b9pAZkV0Kr0WKhOTDvz5HhUFNWH0I/5hppwJ94Swx0PcEq4P+PZ+8eEYH7jfp 7bxSJu4vsjzGxn4qP6lzI221ICDsiifQisDE1+J0HyNyfV0Qr9oUIkW3usxiJsnP IsIMp/zk/9PNC+IOSlQCEwl7tO/86p5g1XyCOP/WUCDa2DfpfBTPWAueMTTacN8r Wgk+Butf6xJe7OfteGMJ07kg2oyqUr4pFiwoKog+MxV0EDQCQgm15t10AtYJl4D9 IIHVBIw4e7MgwlS0P/F5Vhb860U+gguaGuwLx/UPW4QyUV8fkT+ileIvAZdxd15i RDwPup0Q+8fKeY9WnIOdvBpdPHh1T7UgrppoVwwwj6PxQZHCf6R6EgvtlftBNVyI Zlys4rSwtDG8pbPngVPoIZlPYGMnlx0IljXiQCijHVtnU61afp7D7Rv/gH+Se+N8 2p9ne5rQ7MRevYdH07etWbMPmlZ/nbgbha9+hCC5jvZceyhekC7TCxfi2PtswGon uW1RQhuemZnHHvmtPzsQrHOddwCv7FmozKankdLoEfnYIfjkNywbJCAAnuD7jCg+ s0utZwXb7uarQszEb7PMy3bCuHoKzqRy8ICw6SDBw2Vc7x2HFQk= =E/L6 -----END PGP SIGNATURE----- Merge tag 'nf-26-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) IEEE1394 ARP payload contains no target hardware address in the ARP packet. Apparently, arp_tables was never updated to deal with IEEE1394 ARP properly. To deal with this, return no match in case the target hardware address selector is used, either for inverse or normal match. Moreover, arpt_mangle disallows mangling of the target hardware and IP address because, it is not worth to adjust the offset calculation to fix this, we suspect no users of arp_tables for this family. 2) Use list_del_rcu() to delete device hooks in nf_tables, this hook list is RCU protected, concurrent netlink dump readers can be walking on this list, fix it by adding a helper function and use it for consistency. From Florian Westphal. 3) Add list_splice_rcu(), this is useful for joining the local list of new device hooks to the RCU protected hook list in chain and flowtable. Reviewed by Paul E. McKenney. 4) Use list_splice_rcu() to publish the new device hooks in chain and flowtable to fix concurrent netlink dump traversal. 5) Add a new hook transaction object to track device hook deletions. The current approach moves device hooks to be deleted around during the preparation phase, this breaks concurrent RCU reader via netlink dump. This new hook transaction is combined with NFT_HOOK_REMOVE flag to annotate hooks for removal in the preparation phase. 6) xt_policy inbound policy check in strict mode can lead to out-of-bound access of the secpath array due to incorrect. The iteration over the secpath needs to be reversed in the inbound to check for the human readable policy, expecting inner in first position and outer in second position, the secpath from inbound actually stores outer in first position then in second position. From Jiexun Wang. 7) Fix possible zero shift in nft_bitwise triggering UBSAN splat, reject zero shift from control plane, from Kai Ma. 8) Replace simple_strtoul() in the conntrack SIP helper since it relies on nul-terminated strings. From Florian Westphal. * tag 'nf-26-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nf_conntrack_sip: don't use simple_strtoul netfilter: reject zero shift in nft_bitwise netfilter: xt_policy: fix strict mode inbound policy matching netfilter: nf_tables: add hook transactions for device deletions netfilter: nf_tables: join hook list via splice_list_rcu() in commit phase rculist: add list_splice_rcu() for private lists netfilter: nf_tables: use list_del_rcu for netlink hooks netfilter: arp_tables: fix IEEE1394 ARP payload parsing ==================== Link: https://patch.msgid.link/20260428095840.51961-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-28 17:41:06 -07:00
William A. Kennington III	4ca07b9239	net: mctp i2c: check length before marking flow active Currently, mctp_i2c_get_tx_flow_state() is called before the packet length sanity check. This function marks a new flow as active in the MCTP core. If the sanity check fails, mctp_i2c_xmit() returns early without calling mctp_i2c_lock_nest(). This results in a mismatched locking state: the flow is active, but the I2C bus lock was never acquired for it. When the flow is later released, mctp_i2c_release_flow() will see the active state and queue an unlock marker. The TX thread will then decrement midev->i2c_lock_count from 0, causing it to underflow to -1. This underflow permanently breaks the driver's locking logic, allowing future transmissions to occur without holding the I2C bus lock, leading to bus collisions and potential hardware hangs. Move the mctp_i2c_get_tx_flow_state() call to after the length sanity check to ensure we only transition the flow state if we are actually going to proceed with the transmission and locking. Fixes: `f5b8abf9fc` ("mctp i2c: MCTP I2C binding driver") Signed-off-by: William A. Kennington III <william@wkennington.com> Acked-by: Jeremy Kerr <jk@codeconstruct.com.au> Link: https://patch.msgid.link/20260423074741.201460-1-william@wkennington.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-28 13:11:53 +02:00
Andrea Mayer	f9c52a6ba9	net: ipv6: fix NOREF dst use in seg6 and rpl lwtunnels seg6_input_core() and rpl_input() call ip6_route_input() which sets a NOREF dst on the skb, then pass it to dst_cache_set_ip6() invoking dst_hold() unconditionally. On PREEMPT_RT, ksoftirqd is preemptible and a higher-priority task can release the underlying pcpu_rt between the lookup and the caching through a concurrent FIB lookup on a shared nexthop. Simplified race sequence: ksoftirqd/X higher-prio task (same CPU X) ----------- -------------------------------- seg6_input_core(,skb)/rpl_input(skb) dst_cache_get() -> miss ip6_route_input(skb) -> ip6_pol_route(,skb,flags) [RT6_LOOKUP_F_DST_NOREF in flags] -> FIB lookup resolves fib6_nh [nhid=N route] -> rt6_make_pcpu_route() [creates pcpu_rt, refcount=1] pcpu_rt->sernum = fib6_sernum [fib6_sernum=W] -> cmpxchg(fib6_nh.rt6i_pcpu, NULL, pcpu_rt) [slot was empty, store succeeds] -> skb_dst_set_noref(skb, dst) [dst is pcpu_rt, refcount still 1] rt_genid_bump_ipv6() -> bumps fib6_sernum [fib6_sernum from W to Z] ip6_route_output() -> ip6_pol_route() -> FIB lookup resolves fib6_nh [nhid=N] -> rt6_get_pcpu_route() pcpu_rt->sernum != fib6_sernum [W <> Z, stale] -> prev = xchg(rt6i_pcpu, NULL) -> dst_release(prev) [prev is pcpu_rt, refcount 1->0, dead] dst = skb_dst(skb) [dst is the dead pcpu_rt] dst_cache_set_ip6(dst) -> dst_hold() on dead dst -> WARN / use-after-free For the race to occur, ksoftirqd must be preemptible (PREEMPT_RT without PREEMPT_RT_NEEDS_BH_LOCK) and a concurrent task must be able to release the pcpu_rt. Shared nexthop objects provide such a path, as two routes pointing to the same nhid share the same fib6_nh and its rt6i_pcpu entry. Fix seg6_input_core() and rpl_input() by calling skb_dst_force() after ip6_route_input() to force the NOREF dst into a refcounted one before caching. The output path is not affected as ip6_route_output() already returns a refcounted dst. Fixes: `af4a2209b1` ("ipv6: sr: use dst_cache in seg6_input") Fixes: `a7a29f9c36` ("net: ipv6: add rpl sr tunnel") Cc: stable@vger.kernel.org Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Justin Iurman <justin.iurman@gmail.com> Link: https://patch.msgid.link/20260421094735.20997-1-andrea.mayer@uniroma2.it Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-28 11:16:14 +02:00
Cosmin Ratiu	fa90a3145c	xfrm: Don't clobber inner headers when already set On VXLAN over IPsec egress, xfrm{4,6}_transport_output() blindly overwrite inner_transport_header (== the inner TCP header saved in VXLAN iptunnel_handle_offloads() -> skb_reset_inner_headers()) with the current transport_header (== the VXLAN outer UDP header set by udp_tunnel_xmit_skb()). This was a latent bug, harmless until commit [1] added a doff validation check in qdisc_pkt_len_segs_init() for encapsulated GSO packets. With the wrong inner_transport_header set by xfrm, qdisc_pkt_len_segs_init() interprets inner_transport_header as a TCP header, reads doff=0 from the upper byte of the VNI and drops the packet with DROP_REASON_SKB_BAD_GSO. Besides the use in GSO to determine the header size of segmented packets, inner_transport_header might be used by drivers to set up inner checksum offloading by pointing the HW to the inner transport header. A quick browse through available drivers shows that mlx5 uses skb->csum_start specifically for this scenario, while others either don't support VXLAN over IPsec crypto offload (ixgbe) or the HW is capable of parsing the packets itself (nfp, Chelsio). But in all cases, it is more correct to let the inner_transport_header point to the innermost header instead of overwriting it in xfrm. So fix this by guarding all four inner header save sites in xfrm_output.c (xfrm{4,6}_transport_output, xfrm{4,6}_tunnel_encap_add) with a check for skb->inner_protocol. When inner_protocol is set, a tunnel layer (VXLAN, Geneve, GRE, etc.) has already saved the correct inner header offsets and they must not be overwritten. When inner_protocol is zero, no prior tunnel encapsulation exists and xfrm must save the inner headers itself. The tunnel mode checks are only added for completion, since they aren't strictly required, as xfrm_output() forces software GSO in tunnel mode before encap. This makes the previously added test pass: # ./tools/testing/selftests/drivers/net/hw/ipsec_vxlan.py TAP version 13 1..4 ok 1 ipsec_vxlan.test_vxlan_ipsec_crypto_offload.outer_v4_inner_v4 ok 2 ipsec_vxlan.test_vxlan_ipsec_crypto_offload.outer_v4_inner_v6 ok 3 ipsec_vxlan.test_vxlan_ipsec_crypto_offload.outer_v6_inner_v4 ok 4 ipsec_vxlan.test_vxlan_ipsec_crypto_offload.outer_v6_inner_v6 # Totals: pass:4 fail:0 xfail:0 xpass:0 skip:0 error:0 [1] commit `7fb4c19670` ("net: pull headers in qdisc_pkt_len_segs_init()") Fixes: `f1bd7d659e` ("xfrm: Add encapsulation header offsets while SKB is not encrypted") Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2026-04-28 06:47:20 +02:00
Breno Leitao	3bc179bc71	netpoll: fix IPv6 local-address corruption netpoll_setup() decides whether to auto-populate the local source address by testing np->local_ip.ip, which only inspects the first 4 bytes of the union inet_addr storage. For an IPv6 netpoll whose caller-supplied local address has a zero high-32 bits (::1, ::<suffix>, IPv4-mapped ::ffff:a.b.c.d, etc.), this misdetects the address as unset (which they are not, but the first 4 bytes are empty), calls netpoll_take_ipv6() and overwrites it with whatever matching link-local/global address the device happens to expose first. Introduce a helper netpoll_local_ip_unset() that picks the correct family-aware test (ipv6_addr_any() for IPv6, !.ip for IPv4) and use it from netpoll_setup(). Reproducer is something like: echo "::2" > local_ip echo 1 > enabled cat local_ip # before this fix: 2001:db8::1 (caller-supplied ::2 was clobbered) # after this fix: ::2 Fixes: `b7394d2429` ("netpoll: prepare for ipv6") Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20260424-netpoll_fix-v1-1-3a55348c625f@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-27 19:16:18 -07:00
Altan Hacigumus	2b9f6f7065	tcp: make probe0 timer handle expired user timeout tcp_clamp_probe0_to_user_timeout() computes remaining time in jiffies using subtraction with an unsigned lvalue. If elapsed probing time exceeds the configured TCP_USER_TIMEOUT, the underflow yields a large value. This ends up re-arming the probe timer for a full backoff interval instead of expiring immediately, delaying connection teardown beyond the configured timeout. Fix this by preventing underflow so user-set timeout expiration is handled correctly without extending the probe timer. Fixes: `344db93ae3` ("tcp: make TCP_USER_TIMEOUT accurate for zero window probes") Link: https://lore.kernel.org/r/20260414013634.43997-1-ahacigu.linux@gmail.com Signed-off-by: Altan Hacigumus <ahacigu.linux@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260424014639.54110-1-ahacigu.linux@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-27 19:16:07 -07:00
Florian Westphal	4438113be6	neigh: let neigh_xmit take skb ownership neigh_xmit always releases the skb, except when no neighbour table is found. But even the first added user of neigh_xmit (mpls) relied on neigh_xmit to release the skb (or queue it for tx). sashiko reported: If neigh_xmit() is called with an uninitialized neighbor table (for example, NEIGH_ND_TABLE when IPv6 is disabled), it returns -EAFNOSUPPORT and bypasses its internal out_kfree_skb error path. Because the return value of neigh_xmit() is ignored here, does this leak the SKB? Assume full ownership and remove the last code path that doesn't xmit or free skb. Fixes: `4fd3d7d9e8` ("neigh: Add helper function neigh_xmit") Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260424145843.74055-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-27 19:02:11 -07:00
Kuniyuki Iwashima	b3b6babf47	ipmr: Free mr_table after RCU grace period. With CONFIG_IP_MROUTE_MULTIPLE_TABLES=n, ipmr_fib_lookup() does not check if net->ipv4.mrt is NULL. Since default_device_exit_batch() is called after ->exit_rtnl(), a device could receive IGMP packets and access net->ipv4.mrt during/after ipmr_rules_exit_rtnl(). If ipmr_rules_exit_rtnl() had already cleared it and freed the memory, the access would trigger null-ptr-deref or use-after-free. Let's fix it by using RCU helper and free mrt after RCU grace period. In addition, check_net(net) is added to mroute_clean_tables() and ipmr_cache_unresolved() to synchronise via mfc_unres_lock. This prevents ipmr_cache_unresolved() from putting skb into c->_c.mfc_un.unres.unresolved after mroute_clean_tables() purges it. For the same reason, timer_shutdown_sync() is moved after mroute_clean_tables(). Since rhltable_destroy() holds mutex internally, rcu_work is used, and it is placed as the first member because rcu_head must be placed within <4K offset. mr_table is alraedy 3864 bytes without rcu_work. Note that IP6MR is not yet converted to ->exit_rtnl(), so this change is not needed for now but will be. Fixes: `b22b018674` ("ipmr: Convert ipmr_net_exit_batch() to ->exit_rtnl().") Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260423053456.4097409-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-27 18:46:17 -07:00
Morduan Zang	5b0c911bcd	net: phonet: do not BUG_ON() in pn_socket_autobind() on failed bind syzbot reported a kernel BUG triggered from pn_socket_sendmsg() via pn_socket_autobind(): kernel BUG at net/phonet/socket.c:213! RIP: 0010:pn_socket_autobind net/phonet/socket.c:213 [inline] RIP: 0010:pn_socket_sendmsg+0x240/0x250 net/phonet/socket.c:421 Call Trace: sock_sendmsg_nosec+0x112/0x150 net/socket.c:797 __sock_sendmsg net/socket.c:812 [inline] __sys_sendto+0x402/0x590 net/socket.c:2280 ... pn_socket_autobind() calls pn_socket_bind() with port 0 and, on -EINVAL, assumes the socket was already bound and asserts that the port is non-zero: err = pn_socket_bind(sock, ..., sizeof(struct sockaddr_pn)); if (err != -EINVAL) return err; BUG_ON(!pn_port(pn_sk(sock->sk)->sobject)); return 0; /* socket was already bound */ However pn_socket_bind() also returns -EINVAL when sk->sk_state is not TCP_CLOSE, even when the socket has never been bound and pn_port() is still 0. In that case the BUG_ON() fires and panics the kernel from a user-triggerable path. Treat the "bind returned -EINVAL but pn_port() is still 0" case as a regular error and propagate -EINVAL to the caller instead of crashing. Existing callers already translate a non-zero return from pn_socket_autobind() into -ENOBUFS/-EAGAIN, so returning -EINVAL here only changes behaviour from panic to a normal errno. Fixes: `ba113a94b7` ("Phonet: common socket glue") Reported-by: syzbot+706f5eb79044e686c794@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=706f5eb79044e686c794 Suggested-by: Remi Denis-Courmont <courmisch@gmail.com> Signed-off-by: Morduan Zang <zhangdandan@uniontech.com> Signed-off-by: zhanjun <zhanjun@uniontech.com> Acked-by: Rémi Denis-Courmont <remi@remlab.net> Link: https://patch.msgid.link/87A8960A2045AF3C+20260423010557.138124-1-zhangdandan@uniontech.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-27 18:45:17 -07:00
Weiming Shi	3d07ca5c0f	net/sched: taprio: fix NULL pointer dereference in class dump When a TAPRIO child qdisc is deleted via RTM_DELQDISC, taprio_graft() is called with new == NULL and stores NULL into q->qdiscs[cl - 1]. Subsequent RTM_GETTCLASS dump operations walk all classes via taprio_walk() and call taprio_dump_class(), which calls taprio_leaf() returning the NULL pointer, then dereferences it to read child->handle, causing a kernel NULL pointer dereference. The bug is reachable with namespace-scoped CAP_NET_ADMIN on any kernel with CONFIG_NET_SCH_TAPRIO enabled. On systems with unprivileged user namespaces enabled, an unprivileged local user can trigger a kernel panic by creating a taprio qdisc inside a new network namespace, grafting an explicit child qdisc, deleting it, and requesting a class dump. The RTM_GETTCLASS dump itself requires no capability. Oops: general protection fault, probably for non-canonical address 0xdffffc0000000007: 0000 [#1] SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000038-0x000000000000003f] RIP: 0010:taprio_dump_class (net/sched/sch_taprio.c:2478) Call Trace: <TASK> tc_fill_tclass (net/sched/sch_api.c:1966) qdisc_class_dump (net/sched/sch_api.c:2326) taprio_walk (net/sched/sch_taprio.c:2514) tc_dump_tclass_qdisc (net/sched/sch_api.c:2352) tc_dump_tclass_root (net/sched/sch_api.c:2370) tc_dump_tclass (net/sched/sch_api.c:2431) rtnl_dumpit (net/core/rtnetlink.c:6864) netlink_dump (net/netlink/af_netlink.c:2325) rtnetlink_rcv_msg (net/core/rtnetlink.c:6959) netlink_rcv_skb (net/netlink/af_netlink.c:2550) </TASK> Fix this by substituting &noop_qdisc when new is NULL in taprio_graft(), a common pattern used by other qdiscs (e.g., multiq_graft()) to ensure the q->qdiscs[] slots are never NULL. This makes control-plane dump paths safe without requiring individual NULL checks. Since the data-plane paths (taprio_enqueue and taprio_dequeue_from_txq) previously had explicit NULL guards that would drop/skip the packet cleanly, update those checks to test for &noop_qdisc instead. Without this, packets would reach taprio_enqueue_one() which increments the root qdisc's qlen and backlog before calling the child's enqueue; noop_qdisc drops the packet but those counters are never rolled back, permanently inflating the root qdisc's statistics. After this change old can be a valid qdisc, NULL, or &noop_qdisc. Only call qdisc_put(old) in the first case to avoid decreasing noop_qdisc's refcount, which was never increased. Fixes: `665338b2a7` ("net/sched: taprio: dump class stats for the actual q->qdiscs[]") Reported-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Weiming Shi <bestswngs@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Tested-by: Weiming Shi <bestswngs@gmail.com> Link: https://patch.msgid.link/20260422161958.2517539-3-bestswngs@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-27 18:41:36 -07:00
Greg Kroah-Hartman	9e6bf146b5	ipv6: rpl: reserve mac_len headroom when recompressed SRH grows ipv6_rpl_srh_rcv() decompresses an RFC 6554 Source Routing Header, swaps the next segment into ipv6_hdr->daddr, recompresses, then pulls the old header and pushes the new one plus the IPv6 header back. The recompressed header can be larger than the received one when the swap reduces the common-prefix length the segments share with daddr (CmprI=0, CmprE>0, seg[0][0] != daddr[0] gives the maximum +8 bytes). pskb_expand_head() was gated on segments_left == 0, so on earlier segments the push consumed unchecked headroom. Once skb_push() leaves fewer than skb->mac_len bytes in front of data, skb_mac_header_rebuild()'s call to: skb_set_mac_header(skb, -skb->mac_len); will store (data - head) - mac_len into the u16 mac_header field, which wraps to ~65530, and the following memmove() writes mac_len bytes ~64KiB past skb->head. A single AF_INET6/SOCK_RAW/IPV6_HDRINCL packet over lo with a two segment type-3 SRH (CmprI=0, CmprE=15) reaches headroom 8 after one pass; KASAN reports a 14-byte OOB write in ipv6_rthdr_rcv. Fix this by expanding the head whenever the remaining room is less than the push size plus mac_len, and request that much extra so the rebuilt MAC header fits afterwards. Fixes: `8610c7c6e3` ("net: ipv6: add support for rpl sr exthdr") Cc: stable <stable@kernel.org> Reported-by: Anthropic Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://patch.msgid.link/2026042133-gout-unvented-1bd9@gregkh Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-27 17:47:26 -07:00
Eric Dumazet	59b145771c	net/sched: sch_fq_pie: annotate data-races in fq_pie_dump_stats() fq_codel_dump_stats() acquires the qdisc spinlock a bit too late. Move this acquisition before we fill tc_fq_pie_xstats with live data. Alternative would be to add READ_ONCE() and WRITE_ONCE() annotations, but the spinlock is needed anyway to scan q->new_flows and q->old_flows. Fixes: `ec97ecf1eb` ("net: sched: add Flow Queue PIE packet scheduler") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260423063527.2568262-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-27 17:41:52 -07:00
Eric Dumazet	d3aeb889dc	net/sched: sch_choke: annotate data-races in choke_dump_stats() choke_dump_stats() only runs with RTNL held. It reads fields that can be changed in qdisc fast path. Add READ_ONCE()/WRITE_ONCE() annotations. Fixes: `edb09eb17e` ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260423062839.2524324-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-27 17:41:08 -07:00
Stephen Hemminger	90be9fedb2	net/sched: netem: check for negative latency and jitter Reject requests with negative latency or jitter. A negative value added to current timestamp (u64) wraps to an enormous time_to_send, disabling dequeue. The original UAPI used u32 for these values; the conversion to 64-bit time values via TCA_NETEM_LATENCY64 and TCA_NETEM_JITTER64 allowed signed values to reach the kernel without validation. Jitter is already silently clamped by an abs() in netem_change(); that abs() can be removed in a follow-up once this rejection is in place. Fixes: `99803171ef` ("netem: add uapi to express delay and jitter in nanoseconds") Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260418032027.900913-7-stephen@networkplumber.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-27 17:30:28 -07:00
Stephen Hemminger	51e94e1e2f	net/sched: netem: fix slot delay calculation overflow get_slot_next() computes a random delay between min_delay and max_delay using: get_random_u32() * (max_delay - min_delay) >> 32 This overflows signed 64-bit arithmetic when the delay range exceeds approximately 2.1 seconds (2^31 nanoseconds), producing a negative result that effectively disables slot-based pacing. This is a realistic configuration for WAN emulation (e.g., slot 1s 5s). Use mul_u64_u32_shr() which handles the widening multiply without overflow. Fixes: `0a9fe5c375` ("netem: slotting with non-uniform distribution") Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260418032027.900913-6-stephen@networkplumber.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-27 17:30:28 -07:00
Stephen Hemminger	01801c359a	net/sched: netem: validate slot configuration Reject slot configurations that have no defensible meaning: - negative min_delay or max_delay - min_delay greater than max_delay - negative dist_delay or dist_jitter - negative max_packets or max_bytes Negative or out-of-order delays underflow in get_slot_next(), producing garbage intervals. Negative limits trip the per-slot accounting (packets_left/bytes_left <= 0) on the first packet of every slot, defeating the rate-limiting half of the slot feature. Note that dist_jitter has been silently coerced to its absolute value by get_slot() since the feature was introduced; rejecting negatives here converts that silent coercion into -EINVAL. The abs() can be removed in a follow-up. Fixes: `836af83b54` ("netem: support delivering packets in delayed time slots") Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260418032027.900913-5-stephen@networkplumber.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-27 17:30:28 -07:00
Stephen Hemminger	986afaf809	net/sched: netem: only reseed PRNG when seed is explicitly provided netem_change() unconditionally reseeds the PRNG on every tc change command. If TCA_NETEM_PRNG_SEED is not specified, a new random seed is generated, destroying reproducibility for users who set a deterministic seed on a previous change. Move the initial random seed generation to netem_init() and only reseed in netem_change() when TCA_NETEM_PRNG_SEED is explicitly provided by the user. Fixes: `4072d97ddc` ("netem: add prng attribute to netem_sched_data") Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260418032027.900913-4-stephen@networkplumber.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-27 17:30:28 -07:00
Stephen Hemminger	4185701fcc	net/sched: netem: fix queue limit check to include reordered packets The queue limit check in netem_enqueue() uses q->t_len which only counts packets in the internal tfifo. Packets placed in sch->q by the reorder path (__qdisc_enqueue_head) are not counted, allowing the total queue occupancy to exceed sch->limit under reordering. Include sch->q.qlen in the limit check. Fixes: `f8d4bc4550` ("net/sched: netem: account for backlog updates from child qdisc") Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260418032027.900913-3-stephen@networkplumber.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-27 17:30:27 -07:00
Stephen Hemminger	732b463449	net/sched: netem: fix probability gaps in 4-state loss model The 4-state Markov chain in loss_4state() has gaps at the boundaries between transition probability ranges. The comparisons use: if (rnd < a4) else if (a4 < rnd && rnd < a1 + a4) When rnd equals a boundary value exactly, neither branch matches and no state transition occurs. The redundant lower-bound check (a4 < rnd) is already implied by being in the else branch. Remove the unnecessary lower-bound comparisons so the ranges are contiguous and every random value produces a transition, matching the GI (General and Intuitive) loss model specification. This bug goes back to original implementation of this model. Fixes: `661b79725f` ("netem: revised correlated loss generator") Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260418032027.900913-2-stephen@networkplumber.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-27 17:30:27 -07:00
Catherine	7a5b81e0c8	wifi: mac80211: drop stray 'static' from fast-RX rx_result ieee80211_invoke_fast_rx() is documented as safe for parallel RX, but its per-invocation rx_result is declared static. Concurrent callers then share one instance and can overwrite each other's result between ieee80211_rx_mesh_data() and the switch on res. That can make a packet that was queued or consumed by ieee80211_rx_mesh_data() fall through into ieee80211_rx_8023(), or make a packet that should continue return as queued. Make res an automatic variable so each invocation keeps its own result. Fixes: `3468e1e0c6` ("wifi: mac80211: add mesh fast-rx support") Cc: stable@vger.kernel.org Signed-off-by: Catherine <enderaoelyther@gmail.com> Link: https://patch.msgid.link/20260424131435.83212-2-enderaoelyther@gmail.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2026-04-27 12:41:55 +02:00

1 2 3 4 5 ...

84204 Commits