linux

mirror of https://github.com/torvalds/linux.git synced 2026-07-27 09:36:22 +02:00

Author	SHA1	Message	Date
Nikola Z. Ivanov	313a123e1f	ipv6: Change allocation flags to match rcu_read_lock section requirements Since the call to __ip6_del_rt_siblings has been converted under rcu read lock and it only has one call point we should no longer block or yield. Our stack trace from the syzbot reproducer looks as follows: __ip6_del_rt_siblings rtnl_notify (Here we pass gfp_any() -> GFP_KERNEL) nlmsg_notify nlmsg_multicast nlmsg_multicast_filtered netlink_broadcast_filtered (GFP_KERNEL passed from earlier) netlink_broadcast_filtered can yield if GFP_KERNEL is passed, which we do not want to happen. Fix this by changing the allocation flag of rtnl_notify. Also change the flag passed to nlmsg_new. Even though it is not related to the syzbot generated bug it still falls under the same requirements. Reported-by: syzbot+84d4a405ed798b40c96d@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=84d4a405ed798b40c96d Fixes: `bd11ff421d` ("ipv6: Get rid of RTNL for SIOCDELRT and RTM_DELROUTE.") Signed-off-by: Nikola Z. Ivanov <zlatistiv@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260719105759.558050-1-zlatistiv@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-07-23 09:13:13 -07:00
Li RongQing	440e274da4	net: ipv6: fix dif and sdif mismatch in raw6_icmp_error In raw6_icmp_error(), raw_v6_match() is called with inet6_iif(skb) passed to both the 'dif' and 'sdif' arguments. This is a copy-paste or typo error, as the last argument should represent the secondary interface index (sdif). This mismatch breaks ICMPv6 error handling for IPv6 raw sockets in VRF (Virtual Routing and Forwarding) environments. When a raw socket is bound to a VRF master device, raw_v6_match() fails to find a match because it is not given the correct sdif value, causing the socket to miss relevant ICMPv6 error notifications. Fix this by properly passing inet6_sdif(skb) as the last argument to raw_v6_match(). Fixes: `5108ab4bf4` ("net: ipv6: add second dif to raw socket lookups") Signed-off-by: Li RongQing <lirongqing@baidu.com> Reviewed-by: Joe Damato <joe@dama.to> Link: https://patch.msgid.link/20260717143230.1836-1-lirongqing@baidu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-07-23 08:31:44 -07:00
Yun Zhou	675ed582c1	net: gre: fix lltx regression for GRE tunnels with SEQ/CSUM Before commit `00d066a4d4` ("netdev_features: convert NETIF_F_LLTX to dev->lltx"), NETIF_F_LLTX was set unconditionally in both __gre_tunnel_init() and ip6gre_tnl_init_features() alongside GRE_FEATURES: dev->features \|= GRE_FEATURES \| NETIF_F_LLTX; When that commit converted NETIF_F_LLTX to the dev->lltx flag, it placed 'dev->lltx = true' after the SEQ/CSUM early returns instead of before them. This causes GRE/GRETAP/ip6gre tunnels with SEQ or CSUM+encap to lose lockless TX, reintroducing _xmit_lock acquisition around their ndo_start_xmit. Since GRE xmit re-enters the stack via ip_tunnel_xmit(), holding _xmit_lock risks ABBA deadlock with the underlay device. CPU0 CPU1 ---- ---- lock(&qdisc_xmit_lock_key#6); lock(&qdisc_xmit_lock_key#3); lock(&qdisc_xmit_lock_key#6); lock(&qdisc_xmit_lock_key#3); Fix by moving dev->lltx = true before the early returns in both functions, restoring the original unconditional behavior. Fixes: `00d066a4d4` ("netdev_features: convert NETIF_F_LLTX to dev->lltx") Signed-off-by: Yun Zhou <yun.zhou@windriver.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260713150945.1779628-1-yun.zhou@windriver.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-07-23 13:01:57 +02:00
Michael Bommarito	92d3817649	ila: reload IPv6 header after pskb_may_pull in checksum adjust ila_csum_adjust_transport() caches ip6h = ipv6_hdr(skb) before calling pskb_may_pull(). On a non-linear skb whose transport header sits in a page fragment, pskb_may_pull() can call __pskb_pull_tail() / pskb_expand_head() and free the old skb head, leaving ip6h dangling; the following get_csum_diff(ip6h, p) then reads freed memory. ila_update_ipv6_locator() uses ip6h (and the iaddr derived from it) again after the csum-adjust call and additionally writes the new locator through that pointer. Impact: a remote IPv6 packet routed through a configured ILA csum-adjust-transport route or receive-side mapping triggers a slab-use-after-free in ila_update_ipv6_locator() (KASAN). The route or mapping requires CAP_NET_ADMIN to configure, but trigger packets are unauthenticated once it exists. Reload ip6h after each pskb_may_pull() in ila_csum_adjust_transport() before the csum-diff read. In ila_update_ipv6_locator() only the ILA_CSUM_ADJUST_TRANSPORT case pulls the skb, so reload ip6h and iaddr in that case alone before the destination-address write; the neutral-map modes never pull and keep their cached pointers. Fixes: `33f11d1614` ("ila: Create net/ipv6/ila directory") Cc: stable@vger.kernel.org Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Antoine Tenart <atenart@kernel.org> Link: https://patch.msgid.link/20260714114903.3763420-1-michael.bommarito@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-07-22 14:00:41 -07:00
Yizhou Zhao	e1a9d3cc11	tcp: initialize standalone TCP-AO response padding tcp_v4_send_ack() and tcp_v6_send_response() construct standalone TCP responses with TCP-AO options. The option length carries the actual MAC length, but the TCP header length includes the option rounded up to a four-byte boundary. tcp_ao_hash_hdr() writes the MAC only. Thus, when the MAC length is not four-byte aligned, the one to three bytes after the MAC are left uninitialized and may be transmitted. For the normal TCP-AO hashing mode, those bytes also have to be initialized before computing the MAC. Initialize only the alignment padding in the TCP-AO branches, before hashing the header. Use TCPOPT_NOP, as in the normal TCP-AO output path. This avoids adding work to non-AO TCP responses while preserving a valid authenticated header. Fixes: `decde2586b` ("net/tcp: Add TCP-AO sign to twsk") Fixes: `da7dfaa6d6` ("net/tcp: Consistently align TCP-AO option in the header") Cc: stable@vger.kernel.org Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Reported-by: Yuxiang Yang <yangyx22@mails.tsinghua.edu.cn> Reported-by: Ao Wang <wangao@seu.edu.cn> Reported-by: Xuewei Feng <fengxw06@126.com> Reported-by: Qi Li <qli01@tsinghua.edu.cn> Reported-by: Ke Xu <xuke@tsinghua.edu.cn> Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260713105631.8616-1-zhaoyz24@mails.tsinghua.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-07-21 15:24:36 -07:00
Eric Dumazet	2c1931a811	tcp: fix TIME_WAIT socket reference leak on PSP policy failure Release the TIME_WAIT socket reference and jump to discard_it upon PSP policy failure in both IPv4 and IPv6 receive paths. This prevents a memory leak of tcp_tw_bucket structures. Fixes: `659a2899a5` ("tcp: add datapath logic for PSP with inline key exchange") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Daniel Zahka <daniel.zahka@gmail.com> Link: https://patch.msgid.link/20260710181317.4060230-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-07-17 11:53:55 +02:00
Paolo Abeni	bc7291793f	netfilter pull request nf-26-07-10 -----BEGIN PGP SIGNATURE----- iQJdBAABCABHFiEEgKkgxbID4Gn1hq6fcJGo2a1f9gAFAmpRAaIbFIAAAAAABAAO bWFudTIsMi41KzEuMTIsMiwyDRxmd0BzdHJsZW4uZGUACgkQcJGo2a1f9gAQJw/+ Kj3gIsC+IWVsyF4zVvQrYlNkE6SELujsS7OkzY2voyaMS2waA2awl2Wep7G5Z6B7 o36vez8ajnMCvGzfJKOtLgo2H32qxW4XzRuJVROlHRgSpJjASVTP9WREK4pCmfJr cU/D8YujTMxoEcdaIjDgjKyDogMtfbEyUqbiguLuym9NEX4izXXMv9tfWlFfWDtH RhmDV25WXLTa/h1Iz7rl2m1FPZ48oB7VOaar0seIPYF+kQ+tVr0CsSnfx22uHfoj H2g8ahfjjDhyPJTraDjQbdyenW5TzZ3ZzoBpwlpRynbKwKgLqnFdCBtG/vdm0SMM ZOfITT0aU8/tyh/zUm3pOmhIQJpGrxQC7ox0Jya+wc6tNT9ixKAocfUm/ftSuZBD EmH+YlfdQivEcaiP9GlPyGKFp0xRlAtDjbPiA5ZZXJtCNS17KRHNmc/GRIBnhqJu sv8c5Qxj3xppoBbTENKy0WI81L200CEpBrm+7qUgQj+TRdXzCaLCL7KTyLlmKxQa mor1TPDbrs5FEMWVaPJLloKSlKCm0forFnuRIxD3VYLQxqeomHY4CaV+vNjdMJDC 4oV64Z1XVDR5SjtA+/MmX07FdTBdlGDzBZQugwQyhbM+iHEHBmvxXkyMEBpVBN2x U9XICr1PBibENZAyS/IKYW8ilfPtkUiW6Jgyk5W7bIw= =eXsP -----END PGP SIGNATURE----- Merge tag 'nf-26-07-10' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Florian Westphal says: ==================== netfilter: updates for net The following patchset contains Netfilter fixes for net. These are fixes for bugs except patches 6 and 9 which fix issues added in last PR and 7.1-rc1. 1) Reject unsupported target families in xt_nat_checkentry(). From Wyatt Feng. 2) Fix inverted time_after() check in ecache_work_evict_list(). Causes pointless work rescheds and thus way longer time to clear the pending event backlog. From Yizhou Zhao. 3) Fix a use-after-free in br_ip6_fragment() caused by a dangling prevhdr pointer. From Xiang Mei. 4) Fix incorrect conntrack zone comparison in nf_conncount tuple deduplication. Pass IP_CT_DIR_ORIGINAL, not zone direction. From Yizhou Zhao. 5) Add bridge tunnel flowtable regression test for a bug that got fixed in the previous PR. From Zhengyang Chen. 6) Use the correct direction when setting up tunnel routes in the flowtable xmit path. From Pablo Neira Ayuso. This fixes a bug added in the previous PR. 7) Reload IP header after potential skb head reallocation in IPVS. 8) Fix incorrect IPv6 transport offsets in TCP application code. Correct the ICMPv6 header offset to ensure proper checksumming with extension headers, from Julian Anastasov. this is a followup to the previous PR. 9) Remove null-termination requirement for xt_physdev masks, this broke device names with 15 characters. netfilter pull request nf-26-07-10 * tag 'nf-26-07-10' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: xt_physdev: masks are not c-strings ipvs: fix more places with wrong ipv6 transport offsets ipvs: reload ip header after head reallocation netfilter: flowtable: use correct direction to set up tunnel route selftests: netfilter: add bridge tunnel flowtable regression netfilter: nf_conncount: fix zone comparison in tuple dedup netfilter: bridge: fix stale prevhdr pointer in br_ip6_fragment() netfilter: ecache: fix inverted time_after() check netfilter: xt_nat: reject unsupported target families ==================== Link: https://patch.msgid.link/20260710143733.29741-1-fw@strlen.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-07-17 11:19:37 +02:00
Paolo Abeni	389704eb51	ipsec-2026-07-10 -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmpQtBkACgkQrB3Eaf9P W7fEwQ//ZGlVHLizm9aK0bcoIGOuUBCASQOmtk19n5vYP97612LbMeaQD9jEBCd2 c/gLK2P9hn9dnH25lvOXPwrWG+gje7lpPypeTjvzvW/andg/2DWGMvmhSv1M8a5q J6NUKRsCJ2jwwkpILw6ziD+xskEh9t2hxXSc1Xk6K0W5MHJlx07ieoMrL/r/HbvJ lDiHfg87Hlcp8K6W/aecN8NcliXEuo3KTSYCsBZM48R+Oar3VpqMC2QiLxvYZDqQ IEVshugVQ8FQCgSp37hDxQLqsGdAfzAcjE8SjAGPJk12IODO8MKLsSsDJWt3MX4a sjbIiQjEr0nDxtnwtPOBYN5ko5cpeU/A0mvAIHCTcJ36SFucii32i5hBABRV/WKh gKAppFmXb4TL63heupYff3jlSWRSZKvlXc4R4ybMcIHHXVslK/9DGheEAYvVxTgh NlV6wmj89cx/Xvzk/ODS4TlsnypIjcE+TINA5lGscRSNwtWYAeqBOxCK8fkSabka aau0PTkoGaLHQWp2HwjIiJFrat/ZMJfsHWUvISeaKQ5ZJFtq++rE0alkgMpnfVmp EOtvMdxrxYMPmeDVewH2XQseYWLFgdGzAf3i7MG1dHxEunqaoFCxSqzSnrKuFZLA tbkHCjm2m2/oW3A0FdqmIIKrMLIDoBYjsEG5PXVYNqOnKpx/PEw= =k1zg -----END PGP SIGNATURE----- Merge tag 'ipsec-2026-07-10' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2026-07-10 1) xfrm: propagate -EINPROGRESS from validate_xmit_xfrm() Return -EINPROGRESS from xfrm_output_one when validate_xmit_xfrm requeues the packet asynchronously, so the caller doesn't treat it as a real error and free the skb. 2) xfrm: fix stale skb->prev after async crypto steals a GSO segment Re-derive skb->prev from the fragment list after async crypto splits a GSO skb, keeping the linked-list pointers validi. 3) xfrm: nat_keepalive: avoid double free on send error Hold a state ref while the nat_keepalive timer is active and drop the timer before freeing the state, preventing a re-entered free on send error. 4) xfrm: fix sk_dst_cache double-free in xfrm_user_policy() Null the skb dst cache before freeing the policy so a later skb destructor doesn't double-free it. 5) xfrm: cache the offload ifindex for netlink dumps Cache the device ifindex at state-add time and use it for netlink dumps instead of dereferencing dst->dev, which may have changed by the time the dump runs. 6) xfrm: reject optional IPTFS templates in outbound policies Reject outbound policies with an optional IPTFS template, IPTFS must always be used if configured. 7) xfrm: clear mode callbacks after failed mode setup Clear the mode->init_flags and init_state callbacks on the error path after xfrm_init_mode fails, so a partially-initialised mode isn't reused in xfrm_state_construct. 8) xfrm: iptfs: propagate SKBFL_SHARED_FRAG in iptfs_skb_add_frags() Propagate SKBFL_SHARED_FRAG from the original skb to fragments allocated by iptfs_skb_add_frags, keeping shared-fragment accounting correct after IPTFS reassembly. 9) xfrm6: clear dst.dev on error to avoid double netdev_put in xfrm6_fill_dst() Clear dst->dev on the error path of xfrm6_fill_dst() so the caller doesn't release the netdev reference twice via dst_release. 10) xfrm: policy: preallocate inexact bins before xfrm_hash_rebuild reinsert Preallocate all inexact hash bins before existing entries are reinserted during xfrm_hash_rebuild, so reinsertion always hits an existing bin. Please pull or let me know if there are problems. ipsec-2026-07-10 * tag 'ipsec-2026-07-10' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec: xfrm: policy: preallocate inexact bins before xfrm_hash_rebuild reinsert xfrm6: clear dst.dev on error to avoid double netdev_put in xfrm6_fill_dst() xfrm: iptfs: propagate SKBFL_SHARED_FRAG in iptfs_skb_add_frags() xfrm: clear mode callbacks after failed mode setup xfrm: reject optional IPTFS templates in outbound policies xfrm: cache the offload ifindex for netlink dumps xfrm: fix sk_dst_cache double-free in xfrm_user_policy() xfrm: nat_keepalive: avoid double free on send error xfrm: fix stale skb->prev after async crypto steals a GSO segment xfrm: propagate -EINPROGRESS from validate_xmit_xfrm() ==================== Link: https://patch.msgid.link/20260710090349.343389-1-steffen.klassert@secunet.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-07-11 12:48:08 +02:00
Xiang Mei (Microsoft)	86f3ce81dd	netfilter: bridge: fix stale prevhdr pointer in br_ip6_fragment() br_ip6_fragment() gets prevhdr, a pointer into the skb head, from ip6_find_1stfragopt(), then calls skb_checksum_help(). For a cloned skb skb_checksum_help() reallocates the head via pskb_expand_head(), leaving prevhdr dangling. It is later dereferenced in ip6_frag_next(), causing a use-after-free write. Save prevhdr's offset before skb_checksum_help() and recompute it after, like commit `ef0efcd3bd` ("ipv6: Fix dangling pointer when ipv6 fragment"). BUG: KASAN: slab-use-after-free in ip6_frag_next (net/ipv6/ip6_output.c:857) Write of size 1 at addr ffff888013ff5016 by task exploit/141 Call Trace: ... kasan_report (mm/kasan/report.c:595) ip6_frag_next (net/ipv6/ip6_output.c:857) br_ip6_fragment (net/ipv6/netfilter.c:212) nf_ct_bridge_post (net/bridge/netfilter/nf_conntrack_bridge.c:407) nf_hook_slow (net/netfilter/core.c:619) br_forward_finish (net/bridge/br_forward.c:66) __br_forward (net/bridge/br_forward.c:115) maybe_deliver (net/bridge/br_forward.c:191) br_flood (net/bridge/br_forward.c:245) br_handle_frame_finish (net/bridge/br_input.c:229) br_handle_frame (net/bridge/br_input.c:442) ... packet_sendmsg (net/packet/af_packet.c:3114) ... do_syscall_64 (arch/x86/entry/syscall_64.c:94) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:121) Kernel panic - not syncing: Fatal exception in interrupt Fixes: `764dd163ac` ("netfilter: nf_conntrack_bridge: add support for IPv6") Cc: stable@vger.kernel.org Reported-by: AutonomousCodeSecurity@microsoft.com Signed-off-by: Xiang Mei (Microsoft) <xmei5@asu.edu> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-07-10 16:28:47 +02:00
Florian Westphal	da5b58478a	netfilter: handle unreadable frags sashiko reports: When an skb with unreadable fragments (such as from devmem TCP, where skb_frags_readable(skb) returns false) is processed by the u32 module, skb_copy_bits() will safely return a negative error code [..] xt_u32: bail out with hotdrop in this case. gather_frags: return -1, just as if we had no fragment header. nfnetlink_queue: restrict to the linear part. nfnetlink_log: restrict to the linear part. v2: - skb_zerocopy helpers don't copy readable flag, i.e. nfnetlink_queue is broken too xt_u32 shouldn't return true if hotdrop was set. Fixes: `65249feb6b` ("net: add support for skbs with unreadable frags") Cc: stable@vger.kernel.org Acked-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-07-08 15:33:44 +02:00
Xiang Mei	3b08fed5b7	netfilter: nf_conntrack_reasm: guard mac_header adjustment after IPv6 defrag nf_ct_frag6_reasm() slides the packet head forward to drop the IPv6 fragment header and then unconditionally advances skb->mac_header: skb->mac_header += sizeof(struct frag_hdr); On the NF_INET_LOCAL_OUT defrag path the skb has no link-layer header yet, so skb->mac_header is still the "not set" sentinel (u16)~0U. Adding sizeof(struct frag_hdr) wraps it to a small value (0xffff + 8 == 7), after which skb_mac_header_was_set() wrongly reports a MAC header is present and skb_mac_header() points into the headroom. The reassembler has done this unconditional add since it was introduced; it was harmless while mac_header was a bare pointer, but wrong once mac_header became a u16 offset whose unset state is the ~0U sentinel tested by skb_mac_header_was_set(). The sibling net/ipv6/reassembly.c does the same relocation and does guard the adjustment; mirror the guard here. Fixes: `9fb9cbb108` ("[NETFILTER]: Add nf_conntrack subsystem.") Cc: stable@vger.kernel.org Reported-by: Weiming Shi <bestswngs@gmail.com> Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-07-08 15:22:03 +02:00
Eric Dumazet	9b26518b68	ipv6: mcast: Fix potential UAF in MLD delayed work A race condition exists between device teardown and incoming MLD query processing, leading to a Use-After-Free in the MLD delayed work. During device destruction, the primary reference to inet6_dev is dropped, which can drop its refcount to 0. The actual freeing of inet6_dev memory is deferred via RCU. Concurrently, the packet receive path runs under RCU read lock and obtains the inet6_dev pointer. Because the memory is RCU-protected, CPU-0 can safely dereference inet6_dev even if its refcount has hit 0. However, if CPU-0 calls igmp6_event_query() and schedules delayed work, it attempts to acquire a reference using in6_dev_hold(). This increments the refcount from 0 to 1, triggering a "refcount_t: addition on 0" warning. Since the inet6_dev memory is still scheduled to be freed after the RCU grace period, the device is freed while the work is still scheduled. When the work runs, it accesses the freed memory, causing a kernel panic. Fix this by using refcount_inc_not_zero() (via a new helper in6_dev_hold_safe()) to prevent acquiring a reference if the device is already being destroyed. If the refcount is 0, we do not schedule the work. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260705181756.963063-3-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-07-08 14:41:01 +02:00
Xiang Mei (Microsoft)	136992de9b	xfrm6: clear dst.dev on error to avoid double netdev_put in xfrm6_fill_dst() On the error path where in6_dev_get(dev) returns NULL, xfrm6_fill_dst() releases the device reference with netdev_put() but leaves xdst->u.dst.dev set. dst_destroy() later calls netdev_put(dst->dev) again, so the same net_device reference is released twice, underflowing its refcount (ref_tracker WARNING + "unregister_netdevice: waiting for <dev> to become free"). Clear xdst->u.dst.dev after the netdev_put(), the same way the XFRM device-offload paths xfrm_dev_state_add() and xfrm_dev_policy_add() in net/xfrm/xfrm_device.c NULL ->dev when releasing the reference on error. ref_tracker: reference already released. ref_tracker: allocated in: xfrm6_fill_dst (net/ipv6/xfrm6_policy.c:86) ... udpv6_sendmsg (net/ipv6/udp.c:1696) ... ref_tracker: freed in: xfrm6_fill_dst (net/ipv6/xfrm6_policy.c:90) ... WARNING: lib/ref_tracker.c:322 at ref_tracker_free+0x58b/0x780 dst_destroy (net/core/dst.c:115) rcu_core handle_softirqs ... Fixes: `84c4a9dfbf` ("xfrm6: release dev before returning error") Reported-by: AutonomousCodeSecurity@microsoft.com Signed-off-by: Xiang Mei (Microsoft) <xmei5@asu.edu> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2026-07-06 08:29:08 +02:00
Zhixing Chen	43ccc20b5a	netfilter: ip6tables: mark malformed IPv6 extension headers for hotdrop The ah, hbh and rt matches check that the fixed extension header is present, then use the header length field to derive the advertised extension header length for matching. For the ah match, add the missing advertised-length check. For hbh and rt, update the existing advertised-length checks. In all three cases, set hotdrop to true before returning false when the advertised extension header length exceeds the available skb data. Returning false treats the packet as a rule mismatch. Set hotdrop to true and drop malformed packets so they cannot bypass rules intended to drop packets with these IPv6 extension headers. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Zhixing Chen <running910@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-07-03 14:45:21 +02:00
Pengfei Zhang	9facb861dc	ipv6: fib6: fix NULL deref in fib6_walk_continue() on multi-batch dump inet6_dump_fib() saves its progress in cb->args[1] as a positional index within the current hash chain. Between batches, a concurrent fib6_new_table() can insert a new table at the chain head, shifting all existing entries. The saved index then lands on a different table, causing fib6_dump_table() to set w->root to the wrong table while w->node still points into the previous one. fib6_walk_continue() dereferences w->node->parent (NULL) and panics: BUG: kernel NULL pointer dereference, address: 0000000000000008 RIP: 0010:fib6_walk_continue+0x6e/0x170 Call Trace: <TASK> fib6_dump_table.isra.0+0xc5/0x240 inet6_dump_fib+0xf6/0x420 rtnl_dumpit+0x30/0xa0 netlink_dump+0x15b/0x460 netlink_recvmsg+0x1d6/0x2a0 ____sys_recvmsg+0x17a/0x190 Fix by storing tb->tb6_id in cb->args[1] instead of a positional index. On resume, skip entries until the id matches; a concurrent head-insert can never match the saved id, so the walker always resumes on the correct table. Fixes: `1b43af5480` ("[IPV6]: Increase number of possible routing tables to 2^32") Signed-off-by: Pengfei Zhang <zhangfeionline@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260625070517.965597-1-zhangfeionline@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-06-29 18:28:41 -07:00
Nuoqi Gui	a75d99f46b	seg6: validate SRH length before reading fixed fields seg6_validate_srh() reads fixed SRH fields such as srh->type and srh->hdrlen before checking that the supplied length covers the fixed struct ipv6_sr_hdr fields. The BPF SEG6 encap path reaches this with a BPF program-supplied pointer and length: bpf_lwt_push_encap() and the SEG6 local BPF END_B6 and END_B6_ENCAP actions call bpf_push_seg6_encap(), which forwards the length to seg6_validate_srh() with no minimum-size guard. A 2-byte SEG6 encap header can therefore make the validator read srh->type at offset 2 beyond the caller-supplied buffer. Reject lengths shorter than the fixed SRH at the top of seg6_validate_srh(), before any field is read. This fixes the BPF helper path and keeps the common validator robust. Fixes: `fe94cc290f` ("bpf: Add IPv6 Segment Routing helpers") Signed-off-by: Nuoqi Gui <gnq25@mails.tsinghua.edu.cn> Reviewed-by: Andrea Mayer <andrea.mayer@uniroma2.it> Link: https://patch.msgid.link/20260623-f01-17-seg6-srh-len-v2-1-2edc40e9e3e1@mails.tsinghua.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-06-26 18:49:37 -07:00
Fernando Fernandez Mancera	17dc3b245d	ipv6: fix missing notification for ignore_routes_with_linkdown When changing the ignore_routes_with_linkdown sysctl for a specific interface, the RTM_NEWNETCONF netlink notification was not being emitted to userspace. Fix this by emitting the notification when needed. In addition, fix bogus return value for successful "all" and specific interface write operation leading to a wrong reset of the position pointer. Fixes: `35103d1117` ("net: ipv6 sysctl option to ignore routes when nexthop link is down") Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260622130857.5115-7-fmancera@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-06-23 19:12:48 -07:00
Fernando Fernandez Mancera	6a1b50e585	ipv6: fix state corruption during proxy_ndp sysctl restart When handling proxy_ndp, if rtnl_net_trylock() fails, the operation is retried but as the value was already modified by the initial proc_dointvec() call, the restarted syscall will read the newly modified value as the 'old' state. Fix this by taking the RTNL lock before parsing the input value if the operation is a write. Fixes: `c92d5491a6` ("netconf: add support for IPv6 proxy_ndp") Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260622130857.5115-6-fmancera@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-06-23 19:12:47 -07:00
Fernando Fernandez Mancera	3e0e51c0ee	ipv6: fix error handling in disable_policy sysctl When writing to the disable_policy sysctl, if proc_dointvec() fails to parse the input, it returns a negative error code. The current implementation is resetting the position argument even if an error occurred during proc_dointvec() and not only during sysctl restart. Fix this by checking the return value of proc_dointvec() and returning early on failure. Fixes: `df789fe752` ("ipv6: Provide ipv6 version of "disable_policy" sysctl") Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260622130857.5115-5-fmancera@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-06-23 19:12:47 -07:00
Fernando Fernandez Mancera	058b9b19f9	ipv6: fix error handling in forwarding sysctl When writing to the forwarding sysctl, if proc_dointvec() fails to parse the input, it returns a negative error code. The current implementation is overwriting that error for write operations. This results in a silent failure, it returns a successful write although the configuration was not modified at all. When modifying the "all" variant it can also modify the configuration of existing interfaces to the wrong value. Fix this by checking the return value of proc_dointvec() and returning early on failure. In addition, adjust return code of addrconf_fixup_forwarding() for successful operation. Fixes: `b325fddb7f` ("ipv6: Fix sysctl unregistration deadlock") Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260622130857.5115-4-fmancera@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-06-23 19:12:47 -07:00
Fernando Fernandez Mancera	cf4f2b1440	ipv6: fix error handling in ignore_routes_with_linkdown sysctl When writing to the ignore_routes_with_linkdown sysctl, if proc_dointvec() fails to parse the input, it returns a negative error code. The current implementation is overwriting that error for write operations. This results in a silent failure, it returns a successful write although the configuration was not modified at all. When modifying the "all" variant it can also modify the configuration of existing interfaces to the wrong value. Fix this by checking the return value of proc_dointvec() and returning early on failure. Fixes: `35103d1117` ("net: ipv6 sysctl option to ignore routes when nexthop link is down") Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260622130857.5115-3-fmancera@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-06-23 19:12:47 -07:00
Fernando Fernandez Mancera	c779441e50	ipv6: fix error handling in disable_ipv6 sysctl When writing to the disable_ipv6 sysctl, if proc_dointvec() fails to parse the input, it returns a negative error code. The current implementation is overwriting that error for write operations. This results in a silent failure, it returns a successful write although the configuration was not modified at all. When modifying the "all" variant it can also modify the configuration of existing interfaces to the wrong value. Fix this by checking the return value of proc_dointvec() and returning early on failure. Fixes: `56d417b12e` ("IPv6: Add 'autoconf' and 'disable_ipv6' module parameters") Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260622130857.5115-2-fmancera@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-06-23 19:12:47 -07:00
Jakub Kicinski	e9deb406c1	ipsec-2026-06-22 -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmo4560ACgkQrB3Eaf9P W7fDNg//WWYvJgff2nGNHa0cqm2PGbwlfD8xEyc5d9nn/sz5yO+1BzN6junxAd7b U9PE2Y1Zgh8E43TjqCsWebyv6D4/T6X/iIJXEzPoMZeJnBxLHxGf66ThKaQ8RuDH sKaOBexllkfrluSyvdL3vpJUS04bbvyBpLFWsaKIsy0hFMSIBRYWv2Z+HF8RBzTb YqAsG1bFOKhBxbbj5CXh6Wuc8PFRMoXlrzWjsMKa9KGkXDbIitvdO/t8cc3ZXo9g YE5iz8+poJDy+Uvn74TJ5P5w6jhkTtBliJ13Hckoq5th/5USE92c9ZtVqqCCjB2h TL3ChPRn3O2FfBaNYZBHqKXEBpgFF91FPlCTviZKtavDTuNVij432XeYKXU7ZBUW dNVPHkbylYjNKaFftLFSKIj644aWLCSrit7Uoz07eeMOn7Ef7pkv6VAnDMBCrS5i jwH7FJHyTH7cw21SqB20Q8rlpipeBbSl4Z4gCePBomXXgRkgQqQddOtfCTl25xtF vLQiAEU8oVUNQ7z41CBR70S6psfxQn+Od9z2UACv1d8yOEt+cWnK/mmxYL8qq6Jz 2sx753Z7UWcPLOQVbpgh4CxJuA1wjqlip7Evi1bFdeY3NM+o9PZ2NfFzBOWS6ZXH ZgEqHVHcYiKqUb59m6LDcLG+AzDmHBJUk7o3w9tr0rqVhN+eLkA= =LFlA -----END PGP SIGNATURE----- Merge tag 'ipsec-2026-06-22' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2026-06-22 1) xfrm: use compat translator only for u64 alignment mismatch Gate the XFRM_USER_COMPAT translator on COMPAT_FOR_U64_ALIGNMENT so 32-bit compat tasks on arches whose 32-bit ABI already matches the native 64-bit layout are no longer rejected with -EOPNOTSUPP. From Sanman Pradhan. 2) net: af_key: initialize alg_key_len for IPComp states Initialize the alg_key_len to 0 in the IPComp branch of pfkey_msg2xfrm_state() so an uninitialized value cannot drive xfrm_alg_len() into a slab-out-of-bounds kmemdup during XFRM_MSG_MIGRATE. From Zijing Yin. 3) xfrm: Fix dev use-after-free in xfrm async resumption Stash the original skb->dev and extend the RCU critical section across xfrm_rcv_cb() and transport_finish() to prevent a tunnel-device UAF and original-device refcount leak when a callback replaces skb->dev. From Dong Chenchen. 4) xfrm: Fix xfrm state cache insertion race Move the state-validity check inside xfrm_state_lock in the input state cache insertion path so a state cannot be killed between the check and the insert. From Herbert Xu. 5) xfrm: annotate data-races around xfrm_policy_count[] and xfrm_policy_default[] Add READ_ONCE()/WRITE_ONCE() annotations on xfrm_policy_count and xfrm_policy_default to silence the KCSAN data race reported on net->xfrm.policy_count. From Eric Dumazet. 6) espintcp: use sk_msg_free_partial to fix partial send Replace the manual skmsg accounting in espintcp with sk_msg_free_partial() so the skmsg stays consistent on every iteration and the partial-send accounting bugs go away. From Sabrina Dubroca. 7) xfrm: validate selector family and prefixlen during match Reject mismatched address families in xfrm_selector_match() and bound prefixlen in addr4_match()/addr_match() to prevent the shift-out-of-bounds syzbot reported when an AF_UNSPEC selector with a large prefixlen is matched against an IPv4 flow. From Eric Dumazet. * tag 'ipsec-2026-06-22' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec: xfrm: validate selector family and prefixlen during match espintcp: use sk_msg_free_partial to fix partial send xfrm: annotate data-races around xfrm_policy_count[] and xfrm_policy_default[] xfrm: Fix xfrm state cache insertion race xfrm: Fix dev use-after-free in xfrm async resumption net: af_key: initialize alg_key_len for IPComp states xfrm: use compat translator only for u64 alignment mismatch ==================== Link: https://patch.msgid.link/20260622075726.29685-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-06-23 16:22:24 -07:00
Xiang Mei	46c3b8191a	ipv6: Fix null-ptr-deref in fib6_nh_mtu_change(). fib6_nh_mtu_change() re-fetches idev via __in6_dev_get(arg->dev) and dereferences idev->cnf.mtu6 without a NULL check. addrconf_ifdown() clears dev->ip6_ptr with RCU_INIT_POINTER() after rt6_disable_ip() has released tb6_lock, so the RA-driven MTU walk can observe a NULL idev and oops. The caller rt6_mtu_change_route() guards its own __in6_dev_get(), but this re-fetch is unguarded; nexthop-backed routes survive addrconf_ifdown()'s flush, so the walk still reaches it after ip6_ptr is nulled. Return 0 when idev is NULL, matching rt6_mtu_change_route() and the fib6_mtu() fix in commit `5ad509c1fd` ("ipv6: Fix null-ptr-deref in fib6_mtu()."). Oops: general protection fault, ... KASAN: null-ptr-deref in range [0x00000000000002a8-0x00000000000002af] RIP: 0010:fib6_nh_mtu_change+0x203/0x990 rt6_mtu_change_route+0x141/0x1d0 __fib6_clean_all+0xd0/0x160 rt6_mtu_change+0xb4/0x100 ndisc_router_discovery+0x24b5/0x2cb0 icmpv6_rcv+0x12e9/0x1710 ipv6_rcv+0x39b/0x410 Fixes: `c0b220cf7d` ("ipv6: Refactor exception functions") Reported-by: Weiming Shi <bestswngs@gmail.com> Signed-off-by: Xiang Mei <xmei5@asu.edu> Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260619045334.2427073-1-xmei5@asu.edu Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-06-22 18:20:09 -07:00
Jakub Kicinski	56abdaebbf	netfilter pull request 26-06-21 -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEjF9xRqF1emXiQiqU1w0aZmrPKyEFAmo3Eg8ACgkQ1w0aZmrP KyGS5xAApJV4Qq8dA9waGk8CEJNtDpTm71VDOWOznKzLSG3hvJKtzK1QkaTreyjj MQeIyRr2mwQGgOsYiNzAmMKMKdCdmcR/6MEjJ/XoNsfxOFitI2uNheZBeBTQXs09 FiQzz1Q+fv+/kBUwD1M/Zx/1/ebwIe1C5Ms/grV5/QCok5CQ6aMm8UMpAULYuQua uaH6+VeLMpahDxWPMv47eCcvLAkB00/dxMtoujNV9HQ5aLaKOYDuOJcn90GeUDFw hSu3nlEjepbF8cxFtDofntnHUwQ0FXVnvvLVCPBwmAXboMXiWx/cb9+HQhRgxV9m 41Sla0q7pEtM2KVuf8mj6jYUUEfjQUZM47utSbuX6Gx7YOphdlHkCbblmTP+lvey TGFJkP+WEWjBxdcT3vbduPfEwkCudBIi5su3Agzqom0U49DenlS9WirVBOnDC/T2 xwwp93NLR/URs+DR4mLlXYNYO42FZpdpy5m9O3HUnQhfpXr3pLzp0POK1c3Kzbss N/2kJ+EuUVSCRkM4gQTwvZ+Os4VhafnLwqKfHdeH7tBQNJr0fVMEK7e85rCd4IX9 S6XPvGk1s9Jcc4Jdk0NhTJhK3Chyb9SyGEgLBWsBRlOARywox5s0sI+kKqpLll2P poFTrFSJOo1y2qOPDFvQl/puJWR3XhKWda8bSfy812bqGnghpPQ= =GGDL -----END PGP SIGNATURE----- Merge tag 'nf-26-06-21' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net. This batches fixes for real crashes with trivial/correctness fixes. There is too a rework of the conntrack expectation timeout strategy to deal with a possible race when removing an expectation. 1) Fix the incorrect flowtable timeout extension for entries in hw offload, from Adrian Bente. This is correcting a defect in the functionality, no crash. 2) Hold reference to device under the fake dst in br_netfilter, from Haoze Xie. This is fixing a possible UaF if the device is removed while packet is sitting in nfqueue. 3) Reject template conntrack in xt_cluster, otherwise access to uninitialize conntrack fields are possible leading to WARN_ON due to unset layer 3 protocol. From Wyatt Feng. 4) Make sure the IPv6 tunnel header is in the linear skb data area before pulling. While at it remove incomplete NEXTHDR_DEST support. From Lorenzo Bianconi. This possibly leading to crash if IPv4 header is not in the linear area. 5) Use test_bit_acquire in ipset hash set to avoid reordering of subsequent memory access. This is addressing a LLM related report, no crash has been observed. From Jozsef Kadlecsik. 6) Use test_bit_acquire in ipset bitmap set too, for the same reason as in the previous patch, from Jozsef Kadlecsik. 7) Call kfree_rcu() after rcu_assign_pointer() to address a possible UaF if kfree_rcu() runs inmediately, which to my understanding never happens. Never observed in practise, reported by LLM. Also from Jozsef Kadlecsik. 8) Use disable_delayed_work_sync() instead cancel_delayed_work_sync() to avoid that ipset GC handler re-queues work as reported by LLM. From Jozsef Kadlecsik. This is for correctness. 9) Restore the check in nft_payload for exceeding payloda offset over 2^16. From Florian Westphal. This fixes a silent truncation, not a big deal, but better be assertive and reject it. 10) Validate NFT_META_BRI_IIFHWADDR can only run from bridge prerouting. From Florian Westphal. Harmless but it could allow to read bytes from skb->cb. 11) Zero out destination hardware address during the flowtable path setup, also from Florian. This is a correctness fix, LLM points that possible infoleak can happen but topology to achieve it is not clear. 12) Skip IPv4 options if present when building the IPV4 reject reply. Otherwise bytes in the IPv4 options header can be sent back to origin where the ICMP header is being expected. Again from Florian Westphal. 13) Replace timer API for expectation by GC worker approach. This is implicitly fixing a race between nf_ct_remove_expectations() which might fail to remove the expectation due to timer_del() returning false because timer has expired and callback is being run concurrently. This fix is addressing a crash that has been already reported with a reproducer. 14) Check if br_vlan_get_pvid_rcu() fails, otherwise possible stack infoleak of 4-bytes. From Florian Westphal. * tag 'nf-26-06-21' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nft_meta_bridge: fix NFT_META_BRI_IIFPVID stack leak netfilter: nf_conntrack_expect: use conntrack GC to reap expectations netfilter: nf_reject: skip iphdr options when looking for icmp header netfilter: nft_flow_offload: zero device address for non-ether case netfilter: nft_meta_bridge: add validate callback for get operations netfilter: nft_payload: reject offsets exceeding 65535 bytes netfilter: ipset: make sure gc is properly stopped netfilter: ipset: fix order of kfree_rcu() and rcu_assign_pointer() netfilter: ipset: Don't use test_bit() in lockless RCU readers in bitmap types netfilter: ipset: Don't use test_bit() in lockless RCU readers in hash types netfilter: flowtable: fix and simplify IP6IP6 tunnel handling netfilter: xt_cluster: reject template conntracks in hash match netfilter: nf_queue: pin bridge device while NFQUEUE holds fake dst netfilter: flowtable: fix offloaded ct timeout never being extended ==================== Link: https://patch.msgid.link/20260620222738.112506-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-06-22 10:33:38 -07:00
Jiayuan Chen	9ed19e11d2	ipv6: ioam: fix type confusion of dst_entry IOAM uses a dummy dst_entry(null_dst) to mark that the destination should not be changed after the transformation. This dst is stored in the IOAM lwt state and may be passed to dst_cache_set_ip6(). However, the IPv6 dst cache path eventually calls rt6_get_cookie(), which treats the dst_entry as part of a struct rt6_info. Since the null_dst was embedded directly as a struct dst_entry in struct ioam6_lwt, this resulted in an invalid cast and rt6_get_cookie() reading fields from the wrong object. In practice, the wrong cookie is not used while dst->obsolete is zero, but rt6_get_cookie() may also access per-cpu value when rt->sernum is zero. In this case, rt->sernum aliases ioam6_lwt::cache::reset_ts, which can become zero, making this a potential invalid pointer access. Fix this by embedding a full struct rt6_info for the dummy IPv6 route and passing its dst member to the dst APIs. Fixes: `47ce7c8545` ("net: ipv6: ioam6: fix double reallocation") Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Reviewed-by: Justin Iurman <justin.iurman@gmail.com> Link: https://patch.msgid.link/20260618104336.48934-1-jiayuan.chen@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-06-21 15:26:40 -07:00
Wongi Lee	736b380e28	ipv6: account for fraggap on the paged allocation path In __ip6_append_data(), when the paged-allocation branch is taken (MSG_MORE / NETIF_F_SG / large fraglen), alloclen and pagedlen are computed as alloclen = fragheaderlen + transhdrlen; pagedlen = datalen - transhdrlen; datalen already includes fraggap (datalen = length + fraggap). When fraggap is non-zero, this is not the first skb and transhdrlen is zero. The fraggap bytes carried over from the previous skb are copied just past the fragment headers in the new skb's linear area. The linear area is therefore undersized by fraggap bytes while pagedlen is overstated by the same amount, and the copy writes past skb->end into the trailing skb_shared_info. An unprivileged user can trigger this via a UDPv6 socket using MSG_MORE together with MSG_SPLICE_PAGES. The bad accounting was introduced by commit `773ba4fe91` ("ipv6: avoid partial copy for zc"). Before commit `ce650a1663` ("udp6: Fix __ip6_append_data()'s handling of MSG_SPLICE_PAGES"), the negative copy value caused -EINVAL to be returned. That later commit allowed MSG_SPLICE_PAGES to proceed in this case, making the corruption triggerable. The non-paged branch sets alloclen to fraglen, which already accounts for fraggap because datalen does. Bring the paged branch in line by adding fraggap to alloclen and subtracting it from pagedlen. After this adjustment, copy no longer collapses to -fraggap on the paged path, so remove the stale comment describing that old arithmetic. Since a negative copy is no longer expected for a valid MSG_SPLICE_PAGES case, remove the MSG_SPLICE_PAGES exception from the negative copy check. Fixes: `773ba4fe91` ("ipv6: avoid partial copy for zc") Signed-off-by: Jungwoo Lee <jwlee2217@gmail.com> Signed-off-by: Wongi Lee <qw3rtyp0@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/ajFTqRljatR17fFy@DESKTOP-19IMU7U.localdomain Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-06-21 15:24:49 -07:00
Weiming Shi	d186e94236	ipv6: ndisc: fix NULL deref in accept_untracked_na() accept_untracked_na() re-fetches the inet6_dev with __in6_dev_get(dev) and dereferences idev->cnf.accept_untracked_na without a NULL check, even though its only caller ndisc_recv_na() already fetched and NULL-checked idev for the same device. Both reads of dev->ip6_ptr run in the same RCU read-side critical section, but a concurrent addrconf_ifdown() can clear dev->ip6_ptr between them: lowering the MTU below IPV6_MIN_MTU calls addrconf_ifdown() without the synchronize_net() that orders the unregister path, so the re-fetch returns NULL and oopses: BUG: KASAN: null-ptr-deref in ndisc_recv_na (net/ipv6/ndisc.c:974) Read of size 4 at addr 0000000000000364 Call Trace: <IRQ> ndisc_recv_na (net/ipv6/ndisc.c:974) icmpv6_rcv (net/ipv6/icmp.c:1193) ip6_protocol_deliver_rcu (net/ipv6/ip6_input.c:479) ip6_input_finish (net/ipv6/ip6_input.c:534) ip6_input (net/ipv6/ip6_input.c:545) ip6_mc_input (net/ipv6/ip6_input.c:635) ipv6_rcv (net/ipv6/ip6_input.c:351) </IRQ> It is reachable by an unprivileged user via a network namespace. Pass the caller's already validated idev instead of re-fetching it; the idev stays alive for the whole RCU critical section, so it is safe even after dev->ip6_ptr has been cleared. Fixes: `aaa5f515b1` ("net: ipv6: new accept_untracked_na option to accept na only if in-network") Reported-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Weiming Shi <bestswngs@gmail.com> Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://patch.msgid.link/20260617065512.2529757-2-bestswngs@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-06-21 15:13:24 -07:00
Maoyi Xie	27ccb68e7c	net: sit: require CAP_NET_ADMIN in the device netns for changelink ipip6_changelink() operates on at most two netns, dev_net(dev) and the tunnel link netns t->net. They differ once the device is created in or moved to a netns other than the one the request runs in. The rtnl changelink path checks CAP_NET_ADMIN only against dev_net(dev), so a caller privileged there but not in t->net can rewrite a tunnel that lives in t->net. Gate ipip6_changelink() on rtnl_dev_link_net_capable() at its top, before any attribute is parsed. sit was the one tunnel type not covered by the recent series that added this check to the other changelink() handlers. Fixes: `5e6700b3bf` ("sit: add support of x-netns") Link: https://lore.kernel.org/netdev/20260612085941.3158249-1-maoyixie.tju@gmail.com/ Cc: stable@vger.kernel.org Signed-off-by: Maoyi Xie <maoyixie.tju@gmail.com> Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260618070817.3378283-1-maoyixie.tju@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-06-21 15:10:06 -07:00
Lorenzo Bianconi	f4c2d8668d	netfilter: flowtable: fix and simplify IP6IP6 tunnel handling Fix nf_flow_ip6_tunnel_proto() to use pskb_may_pull() instead of skb_header_pointer() to ensure the outer IPv6 header is in the skb headroom, which is required for subsequent packet processing. Move ctx->offset update inside the IPPROTO_IPV6 conditional block since it should only be adjusted when an IP6IP6 tunnel is actually detected. Simplify the rx path by removing ipv6_skip_exthdr() and checking ip6h->nexthdr directly, as the flowtable fast path only handles simple IP6IP6 encapsulation without extension headers. Drop the tunnel encapsulation limit destination option support from the tx path to match, since the rx path no longer handles extension headers. Remove the encap_limit parameter from nf_flow_offload_ipv6_forward(), nf_flow_tunnel_ip6ip6_push() and nf_flow_tunnel_v6_push(), along with the ipv6_tel_txoption struct and related headroom/MTU adjustments. Fixes: `d98103575d` ("netfilter: flowtable: Add IP6IP6 rx sw acceleration") Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-06-19 12:22:29 +02:00
Maoyi Xie	e2ac3b242c	net: ip6_vti: require CAP_NET_ADMIN in the device netns for changelink vti6_changelink() operates on at most two netns, dev_net(dev) and the tunnel link netns t->net. They differ once the device is created in or moved to a netns other than the one the request runs in. The rtnl changelink path checks CAP_NET_ADMIN only against dev_net(dev), so a caller privileged there but not in t->net can rewrite a tunnel that lives in t->net. Gate vti6_changelink() on rtnl_dev_link_net_capable() at its top, before any attribute is parsed. Reported-by: Xiao Liang <shaw.leon@gmail.com> Closes: https://lore.kernel.org/netdev/CABAhCOSzP1vaThGV35_VnsRCb=87_CPjPVsTHbq905k8A+BuUg@mail.gmail.com/ Fixes: `61220ab349` ("vti6: Enable namespace changing") Cc: stable@vger.kernel.org Signed-off-by: Maoyi Xie <maoyixie.tju@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260612085941.3158249-7-maoyixie.tju@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-06-17 16:01:52 -07:00
Maoyi Xie	f00a50876d	net: ip6_gre: require CAP_NET_ADMIN in the device netns for changelink ip6gre_changelink() and ip6erspan_changelink() operate on at most two netns, dev_net(dev) and the tunnel link netns t->net. They differ once the device is created in or moved to a netns other than the one the request runs in. The rtnl changelink path checks CAP_NET_ADMIN only against dev_net(dev), so a caller privileged there but not in t->net can rewrite a tunnel that lives in t->net. Gate both ops on rtnl_dev_link_net_capable() at their top, before any attribute is parsed. Reported-by: Xiao Liang <shaw.leon@gmail.com> Closes: https://lore.kernel.org/netdev/CABAhCOSzP1vaThGV35_VnsRCb=87_CPjPVsTHbq905k8A+BuUg@mail.gmail.com/ Fixes: `690afc165b` ("net: ip6_gre: fix moving ip6gre between namespaces") Cc: stable@vger.kernel.org Signed-off-by: Maoyi Xie <maoyixie.tju@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260612085941.3158249-6-maoyixie.tju@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-06-17 16:01:52 -07:00
Maoyi Xie	2496fa0b7d	net: ip6_tunnel: require CAP_NET_ADMIN in the device netns for changelink ip6_tnl_changelink() operates on at most two netns, dev_net(dev) and the tunnel link netns t->net. They differ once the device is created in or moved to a netns other than the one the request runs in. The rtnl changelink path checks CAP_NET_ADMIN only against dev_net(dev), so a caller privileged there but not in t->net can rewrite a tunnel that lives in t->net. Gate ip6_tnl_changelink() on rtnl_dev_link_net_capable() at its top, before any attribute is parsed. Reported-by: Xiao Liang <shaw.leon@gmail.com> Closes: https://lore.kernel.org/netdev/CABAhCOSzP1vaThGV35_VnsRCb=87_CPjPVsTHbq905k8A+BuUg@mail.gmail.com/ Fixes: `0bd8762824` ("ip6tnl: add x-netns support") Cc: stable@vger.kernel.org Signed-off-by: Maoyi Xie <maoyixie.tju@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260612085941.3158249-5-maoyixie.tju@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-06-17 16:01:52 -07:00
Jakub Kicinski	d755d45bc0	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Merge in late fixes in preparation for the net-next PR. Conflicts: net/tls/tls_sw.c `406e8a651a` ("net: skmsg: preserve sg.copy across SG transforms") `79511603a6` ("tls: remove dead sockmap (psock) handling from the SW path") drivers/net/ethernet/microsoft/mana/mana_en.c `f8fd56977e` ("net: mana: guard TX wq object destroy with INVALID_MANA_HANDLE check") `d07efe5a6e` ("net: mana: Use per-queue allocation for tx_qp to reduce allocation size") https://lore.kernel.org/ajAPXu-C_PuTgV-a@sirena.org.uk No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-06-16 14:59:58 -07:00
Neil Spring	658eb69654	tcp: rehash onto different local ECMP path on retransmit timeout Currently sk_rethink_txhash() re-rolls the socket's txhash on RTO, PLB, and spurious-retransmission events, but the cached route is reused and the new hash is not propagated into the ECMP path selection logic. Two changes are needed to make rehash select a different local ECMP path: 1. Add __sk_dst_reset() alongside sk_rethink_txhash() in tcp_write_timeout(), tcp_rcv_spurious_retrans(), and tcp_plb_check_rehash() so the cached dst is invalidated and the next transmit triggers a fresh route lookup. 2. Set fl6->mp_hash from sk_txhash (or tcp_rsk(req)->txhash for SYN/ACK retransmits and syncookies) in tcp_v6_connect(), inet6_sk_rebuild_header(), inet6_csk_route_req(), inet6_csk_route_socket(), tcp_v6_send_response(), and cookie_v6_check() so fib6_select_path() picks a path based on the new hash. The mp_hash override only applies to fib_multipath_hash_policy 0 (the default L3 policy). Its hash includes the flow label, but that is 0 by default -- np->flow_label is unset, and auto_flowlabels only computes the on-wire label later, per packet -- so flows to the same peer share one local path. Keying the hash on sk_txhash makes the local path per-connection and lets a rehash re-select it. Policies 1-3 are left unchanged. The mp_hash assignment is factored into a small helper, ip6_ecmp_set_mp_hash(), shared by inet6_csk_route_req(), inet6_csk_route_socket(), tcp_v6_connect(), inet6_sk_rebuild_header(), tcp_v6_send_response(), and cookie_v6_check(). It applies (txhash >> 1) ?: 1 for policy 0 (the >> 1 keeps mp_hash in the 31-bit range; ?: 1 keeps it non-zero, since 0 would fall back to rt6_multipath_hash()). inet6_csk_route_socket() calls it only for sk_protocol == IPPROTO_TCP so that non-TCP callers (e.g., L2TP via inet6_csk_xmit) fall through to rt6_multipath_hash() and retain their existing flow-key-based ECMP behavior. tcp_v6_send_response() also sets mp_hash from the response txhash so that a control packet (a RST from the full socket, or an ACK from a time-wait socket) selects the same local ECMP nexthop as the connection's txhash rather than falling back to the flow hash. The time-wait socket's tw_txhash is copied from sk_txhash when the connection enters TIME_WAIT, so it reflects any rehash that occurred. Setting mp_hash explicitly is necessary because the default ECMP hash derives from fl6->flowlabel via np->flow_label, which is not updated from sk_txhash (REPFLOW is off by default). ip6_make_flowlabel() cannot help either, as it runs after the route lookup. As a consequence, for policy 0 the local ECMP path of an IPv6 TCP flow follows sk_txhash even when fl6->flowlabel is non-zero, e.g. a reflected (REPFLOW) or explicitly set (IPV6_FLOWLABEL_MGR) flow label. This is intentional: only local path selection changes, so rehash can recover from a failed path; the on-wire flow label is unchanged. sk_set_txhash() is moved before ip6_dst_lookup_flow() in tcp_v6_connect() so the initial ECMP path is selected by the same txhash that subsequent route rebuilds will use. This avoids unintended path changes when the cached dst is naturally invalidated (e.g., by PMTU discovery or route changes). The rehash sites (tcp_write_timeout(), tcp_plb_check_rehash(), and tcp_rcv_spurious_retrans()) call __sk_rethink_txhash_reset_dst(), which re-rolls the txhash and, when it changed, drops the cached dst so the next transmit re-runs route selection. The dst reset is guarded by sk->sk_family == AF_INET6 since IPv4 ECMP does not currently use sk_txhash for path selection. For IPv4-mapped IPv6 sockets this produces a redundant dst reset on a cold path (RTO/PLB); the subsequent IPv4 route lookup returns the same result. The helper is deliberately separate from sk_rethink_txhash() itself: dst_negative_advice() calls sk_rethink_txhash() before its own dst op, so resetting the dst inside sk_rethink_txhash() would skip that op (e.g. rt6_remove_exception_rt()). For syncookies, cookie_init_sequence() computes the cookie value before route_req() and sets txhash so the SYN-ACK selects the same ECMP path that cookie_v6_check() will use when the full socket is created. cookie_tcp_reqsk_init() derives txhash from the cookie so the full socket's ECMP path matches the SYN-ACK. Both the SYN-ACK assignment in tcp_conn_request() and the full-socket assignment in cookie_tcp_reqsk_init() set txhash from the cookie for IPv4 and IPv6 alike. On IPv6 this drives ECMP path selection; on IPv4, which does not use sk_txhash for ECMP, it only affects TX-queue selection. That selection scales the hash by its high bits (reciprocal_scale()), which are uniform in the keyed secure_tcp_syn_cookie() output -- the MSS index only perturbs the low bits -- so the queue distribution matches net_tx_rndhash(). cookie_init_sequence() is split from the former version that also called tcp_synq_overflow() and incremented SYNCOOKIESSENT; those side effects are now in cookie_record_sent(), called after route_req() succeeds so they are not bumped when route_req() fails. cookie_record_sent() is guarded by CONFIG_SYN_COOKIES to match the guard on tcp_synq_overflow(). route_req() receives 0 as tw_isn for the syncookie path so that tcp_v6_init_req() still saves ireq->pktopts for REPFLOW flowlabel reflection and IPv6 cmsg options. The ecn_ok clear for syncookies without timestamps stays after tcp_ecn_create_request() so it takes precedence. Signed-off-by: Neil Spring <ntspring@meta.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260615042158.1600746-2-ntspring@meta.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-06-15 15:57:31 -07:00
Eric Dumazet	2bf43d0e2e	tcp: ipv6: clamp default adverting MSS to avoid GSO_BY_FRAGS (0xFFFF) When MTU is large, ip6_default_advmss() can return IPV6_MAXPLEN (65535). This is interpreted by TCP as mss_clamp, allowing the MSS to reach 65535. However, 0xFFFF is also used as a magic value GSO_BY_FRAGS in the kernel. If a TCP packet with gso_size=0xFFFF is passed to skb_segment(), it will be mistakenly treated as GSO_BY_FRAGS, leading to a NULL pointer dereference because local TCP packets do not use frag_list. Fix this by returning min(IPV6_MAXPLEN, GSO_BY_FRAGS - 1) (65534) from ip6_default_advmss() when MTU is large. Also update the stale comment in ip6_default_advmss() which suggested that IPV6_MAXPLEN is returned to mean "any MSS". Fixes: `3953c46c3a` ("sk_buff: allow segmenting based on frag sizes") Reported-by: syzbot+ebdb22d461c904fc3cb2@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/6a2c3193.8812e0fc.3c3fa4.0001.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260612162517.83394-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-06-15 12:51:04 -07:00
Jakub Kicinski	431662b642	ipsec-next-2026-06-12 -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmortdsACgkQrB3Eaf9P W7eC8g/+JjaC75xnVJUZLnVqeyasXYs9HcAppT4IUPVKFm0iEitjE7smbUf1Zapp 1dyjVIAp1eqahJ8wY3u5OOq16UTkg38bEUXtlm5Txyr8EUOALtWE5DIHhT/LYns8 rDfVht9V+OuzjddYRwpQY/HYPC4WlxT1WOVDoKhJDDDiAph0JyyRLDlanp59Cmdh lxOHQ0ogh76SbUyaslWSYqBdaNq84DaV2fwNxpll27/b2Qix064Q+ZM4IrpulI8b H8Fb5u+R/3hGO+/TJPbOeRl6exw7aeU03+Gu9h6UFpQTlP9uzas8g/7xxlmJr7wq LcAz5JyMKapdSfoxI6Z2lXM0AtyM+B4iqHoQRKplF4QRjzCsHnmw0SSbv/nXwzrU 6/qHHRqtioKNGg99ptChBT36L0bLV78WE6TdLHW/fxGxLIoJWGgyltAZ9/TNH/es dcPEuecYl0vxz2kspXCiRQGTGI8gDPaTaOiiNvQfhOb9rT+767oqeUvxuW7WRAW+ STSShHZqDYVTNtHR7dl0X1BtEqHLryyHMi4gsB+NpOP+snZaUmRpTTVdYdl/GRK1 WAgYkgFsGThwR4fZUtDHOAt5VVyonvA//q+yYzgzi8RtpFxsGDQsANneD8m7tQL0 iTmx0vBR+AVgaiHdhsVN/rQDLdMbv+W3n03Kx1kk3+I4udTkVuE= =bils -----END PGP SIGNATURE----- Merge tag 'ipsec-next-2026-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2026-06-12 1) Replace the open-coded manual cleanup in xfrm_add_policy() error path with xfrm_policy_destroy() for consistency with xfrm_policy_construct(). From Deepanshu Kartikey. 2) Limit XFRMA_TFCPAD to a sensible maximum (max IP length, 64k) since u32 is excessive for traffic flow confidentiality padding. From David Ahern. 3) Add a new netlink message XFRM_MSG_MIGRATE_STATE that allows migrating individual IPsec SAs independently of their policies. The existing XFRM_MSG_MIGRATE is tightly coupled to policy+SA migration, lacks SPI for unique SA identification, and cannot express reqid changes or migrate Transport mode selectors. The new interface identifies the SA via SPI and mark, supports reqid changes, address family changes, encap removal, and uses an atomic create+install flow under x->lock to prevent SN/IV reuse during AEAD SA migration. From Antony Antony. * tag 'ipsec-next-2026-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next: xfrm: add documentation for XFRM_MSG_MIGRATE_STATE xfrm: restrict netlink attributes for XFRM_MSG_MIGRATE_STATE xfrm: add XFRM_MSG_MIGRATE_STATE for single SA migration xfrm: make xfrm_dev_state_add xuo parameter const xfrm: extract address family and selector validation helpers xfrm: refactor XFRMA_MTIMER_THRESH validation into a helper xfrm: move encap and xuo into struct xfrm_migrate xfrm: add error messages to state migration xfrm: add state synchronization after migration xfrm: check family before comparing addresses in migrate xfrm: split xfrm_state_migrate into create and install functions xfrm: rename reqid in xfrm_migrate xfrm: fix NAT-related field inheritance in SA migration xfrm: allow migration from UDP encapsulated to non-encapsulated ESP xfrm: add extack to xfrm_init_state xfrm: remove redundant assignments xfrm: Reject excessive values for XFRMA_TFCPAD xfrm: cleanup error path in xfrm_add_policy() ==================== Link: https://patch.msgid.link/20260612074725.1760473-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-06-13 13:16:39 -07:00
Ido Schimmel	d25e7e9d8a	ipv6: Honor oif when choosing nexthop for locally generated traffic Commit `741a11d9e4` ("net: ipv6: Add RT6_LOOKUP_F_IFACE flag if oif is set") made the kernel honor the oif parameter when specified as part of output route lookup: # ip route add 2001:db8:1::/64 dev dummy1 # ip route add ::/0 dev dummy2 # ip route get 2001:db8:1::1 oif dummy2 fibmatch default dev dummy2 metric 1024 pref medium Due to regression reports, the behavior was partially reverted in commit `d46a9d678e` ("net: ipv6: Dont add RT6_LOOKUP_F_IFACE flag if saddr set") to only honor the oif if source address is not specified: # ip route get 2001:db8:1::1 from 2001:db8:2::1 oif dummy2 fibmatch 2001:db8:1::/64 dev dummy1 metric 1024 pref medium That is, when source address is specified, the kernel will choose the most specific route even if its nexthop device does not match the specified oif. This creates a problem for multipath routes. After looking up a route, when source address is not specified, the kernel will choose a nexthop whose nexthop device matches the specified oif: # sysctl -wq net.ipv6.conf.all.forwarding=1 # ip route add 2001:db8:10::/64 nexthop via fe80::1 dev dummy1 nexthop via fe80::2 dev dummy2 # for i in {1..100}; do ip route get 2001:db8:10::${i} oif dummy2; done \| grep -o dummy[0-9] \| sort \| uniq -c 100 dummy2 But will disregard the oif when source address is specified despite the fact that a matching nexthop exists: # for i in {1..100}; do ip route get 2001:db8:10::${i} from 2001:db8:2::1 oif dummy2; done \| grep -o dummy[0-9] \| sort \| uniq -c 53 dummy1 47 dummy2 This behavior differs from IPv4: # ip address add 192.0.2.1/32 dev lo # ip route add 198.51.100.0/24 nexthop via inet6 fe80::1 dev dummy1 nexthop via inet6 fe80::2 dev dummy2 # for i in {1..100}; do ip route get 198.51.100.${i} from 192.0.2.1 oif dummy2; done \| grep -o dummy[0-9] \| sort \| uniq -c 100 dummy2 What happens is that fib6_table_lookup() returns a route with a matching nexthop device (assuming it exists): # perf record -e fib6:fib6_table_lookup -- bash -c "for i in {1..100}; do ip route get 2001:db8:10::${i} from 2001:db8:2::1 oif dummy2; done > /dev/null" # perf script \| grep -o dummy[0-9] \| sort \| uniq -c 100 dummy2 But it is later overwritten during path selection in fib6_select_path() which instead chooses a nexthop according to the calculated hash. Solve this by telling fib6_select_path() to skip path selection if we have an oif match during output route lookup (iif being LOOPBACK_IFINDEX). Behavior after the change: # sysctl -wq net.ipv6.conf.all.forwarding=1 # ip route add 2001:db8:10::/64 nexthop via fe80::1 dev dummy1 nexthop via fe80::2 dev dummy2 # for i in {1..100}; do ip route get 2001:db8:10::${i} from 2001:db8:2::1 oif dummy2; done \| grep -o dummy[0-9] \| sort \| uniq -c 100 dummy2 Note that enabling forwarding is only needed because we did not add neighbor entries for the gateway addresses. When forwarding is disabled and CONFIG_IPV6_ROUTER_PREF is not enabled in kernel config, the kernel will treat non-existing neighbor entries as errors and perform round-robin between the nexthops: # sysctl -wq net.ipv6.conf.all.forwarding=0 # for i in {1..100}; do ip route get 2001:db8:10::${i} from 2001:db8:2::1 oif dummy2; done \| grep -o dummy[0-9] \| sort \| uniq -c 50 dummy1 50 dummy2 Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260611154605.992528-3-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-06-12 17:53:47 -07:00
Ido Schimmel	484bb9d164	ipv6: Select best matching nexthop object in fib6_table_lookup() Currently, when using multipath routes without nexthop objects, fib6_table_lookup() selects the nexthop with the highest score. This means that when both a source address and an oif are specified, the nexthop that is chosen is the one that matches in terms of oif: # sysctl -wq net.ipv6.conf.all.forwarding=1 # ip address add 2001:db8:2::1/64 dev lo # ip route add 2001:db8:10::/64 nexthop via fe80::1 dev dummy1 nexthop via fe80::2 dev dummy2 # perf record -e fib6:fib6_table_lookup -- bash -c "for i in {1..100}; do ip route get 2001:db8:10::${i} from 2001:db8:2::1 oif dummy1; done > /dev/null" # perf script \| grep -o dummy[0-9] \| sort \| uniq -c 100 dummy1 # perf record -e fib6:fib6_table_lookup -- bash -c "for i in {1..100}; do ip route get 2001:db8:10::${i} from 2001:db8:2::1 oif dummy2; done > /dev/null" # perf script \| grep -o dummy[0-9] \| sort \| uniq -c 100 dummy2 When using nexthop objects, fib6_table_lookup() selects the first matching nexthop and not necessarily the one with the highest score: # ip nexthop add id 1 via fe80::1 dev dummy1 # ip nexthop add id 2 via fe80::2 dev dummy2 # ip nexthop add id 3 group 1/2 # ip route add 2001:db8:20::/64 nhid 3 # perf record -e fib6:fib6_table_lookup -- bash -c "for i in {1..100}; do ip route get 2001:db8:20::${i} from 2001:db8:2::1 oif dummy1; done > /dev/null" # perf script \| grep -o dummy[0-9] \| sort \| uniq -c 100 dummy1 # perf record -e fib6:fib6_table_lookup -- bash -c "for i in {1..100}; do ip route get 2001:db8:20::${i} from 2001:db8:2::1 oif dummy2; done > /dev/null" # perf script \| grep -o dummy[0-9] \| sort \| uniq -c 100 dummy1 This is not very significant right now because the nexthop is later overwritten during path selection in fib6_select_path(). However, the next patch is going to skip path selection when we have an oif match during output route lookup. As a preparation for this change, align the nexthop object behavior with the legacy one and make sure that fib6_table_lookup() always selects the best matching nexthop. Do that by always returning 0 from rt6_nh_find_match() in order not to terminate the loop in nexthop_for_each_fib6_nh() and storing in arg->nh the best matching nexthop so far. Behavior after the change: # perf record -e fib6:fib6_table_lookup -- bash -c "for i in {1..100}; do ip route get 2001:db8:20::${i} from 2001:db8:2::1 oif dummy1; done > /dev/null" # perf script \| grep -o dummy[0-9] \| sort \| uniq -c 100 dummy1 # perf record -e fib6:fib6_table_lookup -- bash -c "for i in {1..100}; do ip route get 2001:db8:20::${i} from 2001:db8:2::1 oif dummy2; done > /dev/null" # perf script \| grep -o dummy[0-9] \| sort \| uniq -c 100 dummy2 Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20260611154605.992528-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-06-12 17:53:47 -07:00
Yuyang Huang	1ea2f885a7	ipv6: mcast: annotate igmp6 timer expiry race /proc/net/igmp6 walks IPv6 multicast memberships under RCU and reads mca_work.timer.expires to print the remaining multicast timer. The delayed-work timer can be updated concurrently. Annotate the intentional lockless procfs snapshot with READ_ONCE(). Signed-off-by: Yuyang Huang <sigefriedhyy@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260609081113.7613-3-sigefriedhyy@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-06-12 17:12:08 -07:00
Yuyang Huang	d0dc208808	ipv6: mcast: annotate data-races around mca_flags /proc/net/igmp6 walks IPv6 multicast memberships under RCU and prints mca_flags without holding idev->mc_lock. The multicast paths update the field while holding idev->mc_lock. Annotate this intentional lockless snapshot with READ_ONCE() and the matching writers with WRITE_ONCE(). Signed-off-by: Yuyang Huang <sigefriedhyy@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260609081113.7613-2-sigefriedhyy@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-06-12 17:12:07 -07:00
Dong Chenchen	8045c0df98	xfrm: Fix dev use-after-free in xfrm async resumption xfrm async resumption hold skb->dev refcnt until after transport_finish. However, xfrm_rcv_cb may modify skb->dev to tunnel dev without taking device reference, such as vti_rcv_cb. The subsequent async resumption will decrement the tunnel device's reference count, which lead to uaf of tunnel dev and refcnt leak of orig dev as below: unregister_netdevice: waiting for vti1 to become free. Usage count = -2 Stash the original skb->dev to fix refcnt imbalance. The new skb->dev set by xfrm_rcv_cb can race with device teardown. Extend rcu protection over xfrm_rcv_cb and transport_finish to prevent races. Fixes: `1c428b0384` ("xfrm: hold dev ref until after transport_finish NF_HOOK") Reported-by: Xu Chunxiao <xuchunxiao3@huawei.com> Signed-off-by: Dong Chenchen <dongchenchen2@huawei.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2026-06-12 08:39:59 +02:00
Eric Dumazet	f6033078a9	ip6_tunnel: annotate data-races around t->err_count and t->err_time ip6_tnl_xmit() and ipip6_tunnel_xmit() run locklessly (dev->lltx == true). ip6gre_err() and ipip6_err() also run locklessly. We need to add READ_ONCE() and WRITE_ONCE() annotations around t->err_count and t->err_time. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260610171458.1359630-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-06-11 19:12:40 -07:00
Kuniyuki Iwashima	65440a8ce4	ipmr: Convert mr_table.cache_resolve_queue_len to u32. mr_table.cache_resolve_queue_len is always updated under spin_lock_bh(&mfc_unres_lock). Let's convert it to u32. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260609222013.1550355-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-06-11 15:17:30 -07:00
Jakub Kicinski	dad4d4b92a	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-7.1-rc8). Conflicts: drivers/net/ethernet/wangxun/txgbe/txgbe_aml.c `f67aead16e` ("net: txgbe: rework service event handling") `57d39faed4` ("net: txgbe: improve functions of AML 40G devices") net/rds/info.c `512db8267b` ("rds: mark snapshot pages dirty in rds_info_getsockopt()") `6e94eeb2a2` ("rds: convert to getsockopt_iter") Adjacent changes: include/net/sock.h `1ee90b77b7` ("net: guard timestamp cmsgs to real error queue skbs") `f0de88303d` ("net: make is_skb_wmem() available to modules") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-06-11 14:33:35 -07:00
Eric Dumazet	6e12a8894e	ip6_tunnel: do not use dst6_mtu() in ip4ip6_err() and ip6erspan_tunnel_xmit() This is a minor performance / conceptual fix. 1) ip6erspan_tunnel_xmit() ERSPAN tunnel can mirror both IPv4 and IPv6 traffic, skb (the packet being mirrored) can be an IPv4 packet, and thus dst can be an IPv4 destination entry Use dst_mtu() which contains generic logic for both families. 2) ip4ip6_err() skb2 has been prepared as an IPv4 packet, and its destination is an IPv4 route. dst6_mtu() is optimized for IPv6 destinations and uses INDIRECT_CALL_1 to call ip6_mtu() directly if the ops match. We should use dst4_mtu() instead. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260609091337.2672441-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-06-11 15:38:49 +02:00
Paolo Abeni	64ced6c088	netfilter pull request 26-06-10 -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEjF9xRqF1emXiQiqU1w0aZmrPKyEFAmopiokACgkQ1w0aZmrP KyGBoA//W6EXg8MCgQGojXZ4+41fPpJ6opw7t65e26/CIwTBgtOiX/no4WOJcOk/ J6WBk2sHIg+RgPPzzOLfzAeyVf7MfbnXvBnEMselFGOtd1o5LeikdcskuMIQ2GP/ v7a3C49Xueyp5ChjdArTbYcaREnAFVnmPGdsaE90D64SHNlmuiyIsRtEut+XUfuz Tl+JNCV2RCdSAl36MqnwAF8sBpj9ndnekNz8E5UONwgA7sowPM6KsGbYqCJrQXxa qDqbj9fme8BmLWc8gwQ37yD5iBGxc7LdVlyavIJ9uSi2Gp/jnDSsuyUm5WDo84wm R65jYmkXh9ZHr8OU/qYg5Mb0LBsSfNRzcE4WZIqb71WQ45IqA53kXam/2W0CuflN g9tPfAqboI9A11Mq3LSwwFWfa3L2xJIi1LFpzGThNTZa9vFoqhXZG0PCqXBKiDdo KsEmU4vtMFqga8tOFDJMgKHAxr/jPWhEZp1ZgIy3xAM2s9xC56Qe1g8mwcBFeZR7 n/YEFGpWFa83EnXGow1gdhUQHu+PYbiwQQy1tSVaZ+kg5WhA3ahnJfIopS2qIrHq NN1RP8m85YjSZGzF/gvq+BEuuYZsebeaFxYLDha0zrhqXA3vlRZvVxoySlJtW/bk m/05GsIb7ySpAyOjejJ9fokymzjD1op+0o69AnEqvg1IXUmE0oM= =Rg2q -----END PGP SIGNATURE----- Merge tag 'nf-26-06-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Revalidate bridge ports, add missing NULL checks to fetch the bridge device by the port. From Florian Westphal. 2) Fix netdevice refcount leak in the error path of nft_fwd hardware offload function, also from Florian. 3) Unregister helper expectfn callback on conntrack helper module removal, otherwise dangling pointer remains in place, from Weiming Shi. 4) Fix possible pointer infoleak in getsockopt() IPT_SO_GET_ENTRIES, From Kyle Zeng. 5) Validate that device MAC header is present before nf_syslog accesses it. From Xiang Mei. 6-8) Three patches to address a possible infoleak of stale stack data in three nf_tables expressions, due to mismatch in the _init() and _eval() function which is possible since `14fb07130c`. From Davide Ornaghi and Florian Westphal. netfilter pull request 26-06-10 * tag 'nf-26-06-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nft_meta_bridge: fix stale stack leak via IIFHWADDR register netfilter: nft_fib: fix stale stack leak via the OIFNAME register netfilter: nft_exthdr: fix register tracking for F_PRESENT flag netfilter: nf_log: validate MAC header was set before dumping it netfilter: x_tables: avoid leaking percpu counter pointers netfilter: nf_conntrack: destroy stale expectfn expectations on unregister netfilter: nf_tables_offload: drop device refcount on error netfilter: revalidate bridge ports ==================== Link: https://patch.msgid.link/20260610161629.214092-1-pablo@netfilter.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-06-11 12:30:00 +02:00
Paolo Abeni	29899ec61a	ipsec-2026-06-10 -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmopbrsACgkQrB3Eaf9P W7cGkg//ZWSTsofbJTwxfGpS5/bkDplup/5evevcbBBJWXfPVM1ImK+oJ3M/hisF /Xx7PDwaw13qdDeH/ZCWRnKAWuEPkt8Y5j51R5QDJKhk3V7h8jAG1sRyQtnE2Z9M JyKmNFJB2cyCYQnpMHfRDW+TMtgCeriKL+rxxRlSB9bbEC/64pLWAeQc9SliSjUj M6I3jHHEkQVG2PzqzViIk8B2GATSTVEyHZMHlPAVAEgaIfNeTX1TzaIbyuaM2yoP /hYPPTsNvaTmXS+4kLg7zEoLEeOdYdhSD538L7GWnkijUbSIkyd+dpKSfoNMy8An /JjY+2GAW1+y2VpS6YbQzE0Q2Bs9ZPU6B5RCETC1drgpxQDBDas5HxwmNA/aKJ/j brMQDL412K8oZrEls/t3wDuP1sdHDZQNP/Zn2OysJX2Hi2rIObE6Xm/tRTyMkrio CU2EX+fXHuLsz+2Hh2Pw3X4G+JgNrPmEJhHRn/Bv0Sl5/fzNP6Jp1vQg6oC/7uoa 2h/hxSmeJObBcg79sLmVfrpjzFoJ6fqQ4Z7JIj6qtIxnMvI74DxRJFJHk54XL7Qk 6cLZlwXnea0Eqa71mu18sCO0MAep9Cui+A0+CH5OplHm6sdzufWEuoQx9BpGV7ce X14GM7KpWzo1LHCXYg4IOcf9JJdV8+YULubWy7SC/9xHjW7ETQY= =vhay -----END PGP SIGNATURE----- Merge tag 'ipsec-2026-06-10' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2026-06-10 1) xfrm: iptfs: preserve shared-frag marker in iptfs_consume_frags() Propagate SKBFL_SHARED_FRAG when paged fragments are moved between skbs so ESP can decide whether in-place crypto is safe. 2) xfrm: iptfs: fix use-after-free on first_skb in __input_process_payload Replace the unlocked read of xtfs->ra_newskb with a local flag so a concurrent reassembly can no longer free first_skb between spin_unlock and the post-loop check. 3) xfrm: policy: fix use-after-free on inexact bin in xfrm_policy_bysel_ctx() Prune the inexact bin under xfrm_policy_lock so a concurrent xfrm_hash_rebuild() can no longer free it before xfrm_policy_kill() dereferences it. 4) xfrm: iptfs: fix ABBA deadlock in iptfs_destroy_state() Move hrtimer_cancel() for the output and drop timers ahead of their spinlocks, breaking the softirq/lock cycle that could deadlock against the timer callbacks on SMP. 5) xfrm: espintcp: do not reuse an in-progress partial send Fail a new send when espintcp_push_msgs() returns with emsg->len still set, so a blocking caller can no longer overwrite ctx->partial while a previous transfer still owns it. 6) esp: fix page frag reference leak on skb_to_sgvec failure Add a flag to esp_ssg_unref() to unconditionally unref the source scatterlist, releasing the old page references that are otherwise leaked when the second skb_to_sgvec() in esp_output_tail() fails. Please pull or let me know if there are problems. ipsec-2026-06-10 * tag 'ipsec-2026-06-10' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec: esp: fix page frag reference leak on skb_to_sgvec failure xfrm: espintcp: do not reuse an in-progress partial send xfrm: iptfs: fix ABBA deadlock in iptfs_destroy_state() xfrm: policy: fix use-after-free on inexact bin in xfrm_policy_bysel_ctx() xfrm: iptfs: fix use-after-free on first_skb in __input_process_payload xfrm: iptfs: preserve shared-frag marker in iptfs_consume_frags() ==================== Link: https://patch.msgid.link/20260610140800.2562818-1-steffen.klassert@secunet.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-06-11 12:00:49 +02:00
Ido Schimmel	b70c687b7c	ipv6: Fix a potential NPD in cleanup_prefix_route() addrconf_get_prefix_route() can return the fib6_null_entry sentinel entry which has a NULL fib6_table pointer. Therefore, before setting the route's expiration time, check that we are not working with this entry, as otherwise a NPD will be triggered [1]. Note that the other callers of addrconf_get_prefix_route() are not susceptible to this bug: 1. addrconf_prefix_rcv(): Requests a route with the 'RTF_ADDRCONF \| RTF_PREFIX_RT' flags which are not set on fib6_null_entry. 2. modify_prefix_route(): Fixed by commit `a747e02430` ("ipv6: avoid possible NULL deref in modify_prefix_route()"). 3. __ipv6_ifa_notify(): Calls ip6_del_rt() which specifically checks for fib6_null_entry and returns an error. [1] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000006: 0000 [#1] SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000030-0x0000000000000037] [...] Call Trace: <TASK> __kasan_check_byte (mm/kasan/common.c:573) lock_acquire.part.0 (kernel/locking/lockdep.c:5842 (discriminator 1)) _raw_spin_lock_bh (kernel/locking/spinlock.c:182 (discriminator 1)) cleanup_prefix_route (net/ipv6/addrconf.c:1280) ipv6_del_addr (net/ipv6/addrconf.c:1342) inet6_addr_del.isra.0 (net/ipv6/addrconf.c:3119) inet6_rtm_deladdr (net/ipv6/addrconf.c:4812) rtnetlink_rcv_msg (net/core/rtnetlink.c:6997) netlink_rcv_skb (net/netlink/af_netlink.c:2555) netlink_unicast (net/netlink/af_netlink.c:1344) netlink_sendmsg (net/netlink/af_netlink.c:1899) __sock_sendmsg (net/socket.c:802 (discriminator 4)) ____sys_sendmsg (net/socket.c:2698) ___sys_sendmsg (net/socket.c:2752) __sys_sendmsg (net/socket.c:2784) do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:121) Fixes: `5eb902b8e7` ("net/ipv6: Remove expired routes with a separated list of routes.") Reported-by: Ji'an Zhou <eilaimemedsnaimel@gmail.com> Reviewed-by: David Ahern <dahern@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260609145448.768318-1-idosch@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-06-11 11:57:11 +02:00
Davide Ornaghi	ab185e0c4f	netfilter: nft_fib: fix stale stack leak via the OIFNAME register For NFT_FIB_RESULT_OIFNAME the destination register is declared with len = IFNAMSIZ (four 32-bit registers), but on the lookup-fail, RTN_LOCAL and oif-mismatch paths nft_fib{4,6}_eval() only writes one register via "dest = 0". The remaining three registers are left as whatever was on the stack in nft_do_chain()'s struct nft_regs, and a downstream expression that loads the register span can leak that uninitialised kernel stack to userspace. The NFTA_FIB_F_PRESENT existence check has the same shape: it is only meaningful for NFT_FIB_RESULT_OIF, yet it was accepted for any result type while the eval stores a single byte via nft_reg_store8(), leaving the rest of the declared span stale. Fix both: - replace the bare "dest = 0" in the eval with nft_fib_store_result(), which strscpy_pad()s the whole IFNAMSIZ for OIFNAME (and is already used on the other early-return path), and - restrict NFTA_FIB_F_PRESENT to NFT_FIB_RESULT_OIF and declare its destination as a single u8, so the marked span matches the one byte the eval writes. Fixes: `f6d0cbcf09` ("netfilter: nf_tables: add fib expression") Suggested-by: Florian Westphal <fw@strlen.de> Cc: stable@vger.kernel.org Signed-off-by: Davide Ornaghi <d.ornaghi97@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-06-10 18:00:19 +02:00

1 2 3 4 5 ...

9127 Commits