linux

mirror of https://github.com/torvalds/linux.git synced 2026-07-27 17:47:41 +02:00

Author	SHA1	Message	Date
Florian Westphal	f468c48d48	netfilter: xt_physdev: masks are not c-strings ... and must not be subjected to the 'nul terminated' constraint. If the interface name is 15 characters long, the mask is 16-bytes '0xff' (to cover for \0) and the valid device name is rejected. Fixes: `8df772afc9` ("netfilter: x_physdev: reject empty or not-nul terminated device names") Cc: stable@vger.kernel.org Closes: https://bugs.launchpad.net/neutron/+bug/2159935 Signed-off-by: Florian Westphal <fw@strlen.de>	2026-07-10 16:28:47 +02:00
Julian Anastasov	b3fe4cbd58	ipvs: fix more places with wrong ipv6 transport offsets Sashiko reports for more incorrect IPv6 transport offsets. The app code for TCP was assuming IPv4 network header even after the ipvsh argument was provided. This can cause problems with apps over IPv6. As for the only official app in the kernel tree (FTP) this problem is harmless because we use Netfilter to mangle the FTP ports and we do not adjust the TCP seq numbers. Also, provide correct offset of the ICMPV6 header in ip_vs_out_icmp_v6() for correct checksum checks when the IPv6 packet has extension headers. Fixes: `d12e12299a` ("ipvs: add ipv6 support to ftp") Fixes: `2a3b791e6e` ("IPVS: Add/adjust Netfilter hook functions and helpers for v6") Cc: stable@vger.kernel.org Link: https://sashiko.dev/#/patchset/20260706101624.69471-1-zhaoyz24%40mails.tsinghua.edu.cn Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-07-10 16:28:47 +02:00
Florian Westphal	a2f57827bf	ipvs: reload ip header after head reallocation __ip_vs_get_out_rt() calls skb_ensure_writable() which may reallocate skb->head. Fixes: `8d8e20e2d7` ("ipvs: Decrement ttl") Cc: stable@vger.kernel.org Assisted-by: Claude:claude-sonnet-4-6 Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-07-10 16:28:47 +02:00
Pablo Neira Ayuso	90941d9c92	netfilter: flowtable: use correct direction to set up tunnel route The layer 2 encapsulation and layer 3 tunnel information in the xmit path is taken from the other tuple, because the tunnel information that is included in the tuple for hashtable lookups is also used to perform the egress encapsulation in the transmit path. This patch uses the correct direction when setting up the tunnel, the original proposed patch to address this fix uses the reversed direction. While at it, remove the redundant check to call dst_release() to drop the reference on the dst that was obtained from the forward path, which is not useful in the direct xmit path unless tunneling is performed. Fixes: `fa7395c02d` ("netfilter: flowtable: support IPIP tunnel with direct xmit") Cc: stable@vger.kernel.org Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-07-10 16:28:47 +02:00
Yizhou Zhao	f62c41b491	netfilter: nf_conncount: fix zone comparison in tuple dedup The "already exists" dedup logic in __nf_conncount_add() decides whether a connection has already been counted and can be skipped instead of incrementing the connlimit count. It compares the conntrack zone of a list entry with the zone of the connection being added using nf_ct_zone_id() and nf_ct_zone_equal(), passing conn->zone.dir or zone->dir as the direction argument. Those helpers take enum ip_conntrack_dir values: IP_CT_DIR_ORIGINAL is 0 and IP_CT_DIR_REPLY is 1. However, zone->dir is a u8 bitmask: NF_CT_ZONE_DIR_ORIG is 1, NF_CT_ZONE_DIR_REPL is 2 and NF_CT_DEFAULT_ZONE_DIR is 3. Passing that bitmask as the enum direction shifts the meaning of every non-zero value. An ORIG-only zone passes 1 and is tested as REPLY, while REPL-only and default zones pass 2 or 3 and test bits beyond the valid direction range. In those cases nf_ct_zone_id() can fall back to NF_CT_DEFAULT_ZONE_ID instead of using the real zone id, so different zones can be treated as equal and dedup collapses to tuple equality alone. nf_conncount stores and compares the original-direction tuple for a connection. If an skb already has an attached conntrack entry, get_ct_or_tuple_from_skb() explicitly copies ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple, regardless of the packet's ctinfo. Therefore the zone comparison in the tuple dedup path must use IP_CT_DIR_ORIGINAL as well; the zone direction bitmask describes where a zone id applies, not which direction this conncount tuple represents. Fix the two dedup comparisons by passing IP_CT_DIR_ORIGINAL directly. Do not special-case NF_CT_DEFAULT_ZONE_DIR and do not compare raw zone ids: using the existing helpers with IP_CT_DIR_ORIGINAL preserves the direction-aware NF_CT_DEFAULT_ZONE_ID fallback. A default bidirectional zone contains the ORIG bit, so it naturally returns the real zone id; reply-only zones continue to fall back for original-direction tuple comparisons. Fixes: `21ba8847f8` ("netfilter: nf_conncount: Fix garbage collection with zones") Fixes: `b36e4523d4` ("netfilter: nf_conncount: fix garbage collection confirm race") Cc: stable@vger.kernel.org Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Reported-by: Yuxiang Yang <yangyx22@mails.tsinghua.edu.cn> Reported-by: Ao Wang <wangao@seu.edu.cn> Reported-by: Xuewei Feng <fengxw06@126.com> Reported-by: Qi Li <qli01@tsinghua.edu.cn> Reported-by: Ke Xu <xuke@tsinghua.edu.cn> Assisted-by: Claude-Code:GLM-5.2 Signed-off-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-07-10 16:28:47 +02:00
Yizhou Zhao	b06163ce52	netfilter: ecache: fix inverted time_after() check ecache_work_evict_list() redelivers DESTROY events for conntracks that were moved to the per-netns dying_list after event delivery failed. It sets a 10ms deadline: stop = jiffies + ECACHE_MAX_JIFFIES but then tests: time_after(stop, jiffies) This condition is true while the deadline is still in the future, so the worker returns STATE_RESTART after the first successful redelivery in the usual case. ecache_work() maps STATE_RESTART to delay 0, which turns the redelivery path into one dying conntrack per workqueue dispatch and makes the sent > 16 batching/cond_resched() path effectively unreachable. A conntrack netlink listener whose receive queue is congested can make DESTROY event delivery fail with -ENOBUFS. With sustained conntrack churn, entries then accumulate on the dying_list and are only drained at the degraded one-entry-per-dispatch rate once delivery succeeds again, wasting CPU on back-to-back workqueue reschedules and prolonging conntrack memory/resource pressure. In a KASAN QEMU test with CONFIG_NF_CONNTRACK_EVENTS=y and nf_conntrack.enable_hooks=1, a congested DESTROY listener caused 8192 nf_ct_delete() calls to return false and move entries to the dying_list. After closing the listener, the unfixed kernel needed 7670 ecache_work() entries to destroy 7669 conntracks. With this change, the same 8192 entries were destroyed by 2 ecache_work() entries. Swap the comparison so the worker restarts only after the deadline has expired. Fixes: `2ed3bf188b` ("netfilter: ecache: use dedicated list for event redelivery") Cc: stable@vger.kernel.org Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Reported-by: Yuxiang Yang <yangyx22@mails.tsinghua.edu.cn> Reported-by: Ao Wang <wangao@seu.edu.cn> Reported-by: Xuewei Feng <fengxw06@126.com> Reported-by: Qi Li <qli01@tsinghua.edu.cn> Reported-by: Ke Xu <xuke@tsinghua.edu.cn> Assisted-by: Claude-Code:GLM-5.2 Signed-off-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-07-10 16:28:47 +02:00
Wyatt Feng	5d1a224093	netfilter: xt_nat: reject unsupported target families xt_nat SNAT and DNAT target handlers assume IP-family conntrack state is present and can dereference a NULL pointer when instantiated from an unsupported family through nft_compat. A bridge-family compat rule can therefore trigger a NULL-dereference in nf_nat_setup_info(). Reject non-IP families in xt_nat_checkentry() so unsupported targets cannot be installed. Keep NFPROTO_INET allowed for valid inet NAT compat users and leave the runtime fast path unchanged. [ The crash was fixed via `9dbba7e694` ("netfilter: nft_compat: ebtables emulation must reject non-bridge targets"), so this patch is no longer critical. Nevertheless, NAT is only relevant for ipv4/ipv6, so this extra family check is a good idea in any case. ] Fixes: `c7232c9979` ("netfilter: add protocol independent NAT core") Cc: stable@vger.kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Zhengchuan Liang <zcliangcn@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Assisted-by: Codex:GPT-5.4 Signed-off-by: Wyatt Feng <bronzed_45_vested@icloud.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-07-10 16:28:47 +02:00
Julian Anastasov	3f7a535ff0	ipvs: ensure inner headers in ICMP errors are in headroom Sashiko points out that after stripping the outer headers with pskb_pull() we should ensure the inner IP headers in ICMP errors from tunnels are present in the skb headroom for functions like ipv4_update_pmtu(), icmp_send() and IP_VS_DBG(). Also, add more checks for the length of the inner headers. Fixes: `f2edb9f770` ("ipvs: implement passive PMTUD for IPIP packets") Link: https://sashiko.dev/#/patchset/20260702073430.67680-1-zhaoyz24%40mails.tsinghua.edu.cn Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-07-08 15:33:44 +02:00
Yizhou Zhao	2f75c0faa3	ipvs: use parsed transport offset in SCTP state lookup set_sctp_state() reads the SCTP chunk header again in order to drive the IPVS SCTP state table. For IPv6 it computes the offset with sizeof(struct ipv6hdr), while the surrounding IPVS code uses iph.len from ip_vs_fill_iph_skb(), where ipv6_find_hdr() has already skipped extension headers and found the real transport header. This makes the state machine read from the wrong offset for IPv6 SCTP packets that carry extension headers. For example, an INIT packet with an 8-byte destination options header can be scheduled correctly by sctp_conn_schedule(), but set_sctp_state() reads the first byte of the SCTP verification tag as a DATA chunk type. The connection then moves from NONE to ESTABLISHED instead of INIT1, gets the longer established timeout, and updates the active/inactive destination counters incorrectly. This happens even though the SCTP handshake has not completed. Use the parsed transport offset passed down from ip_vs_set_state() for the SCTP chunk-header lookup. For IPv4 and IPv6 packets without extension headers this preserves the existing offset. Fixes: `2906f66a56` ("ipvs: SCTP Trasport Loadbalancing Support") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/netdev/20260705123040.35755-1-zhaoyz24@mails.tsinghua.edu.cn/ Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Reported-by: Yuxiang Yang <yangyx22@mails.tsinghua.edu.cn> Reported-by: Ao Wang <wangao@seu.edu.cn> Reported-by: Xuewei Feng <fengxw06@126.com> Reported-by: Qi Li <qli01@tsinghua.edu.cn> Reported-by: Ke Xu <xuke@tsinghua.edu.cn> Assisted-by: Claude Code:GLM-5.2 Signed-off-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-07-08 15:33:44 +02:00
Yizhou Zhao	2500fa3958	ipvs: use parsed transport offset in TCP state lookup TCP state handling reparses the skb to find the TCP header. For IPv6 it uses sizeof(struct ipv6hdr), while the surrounding IPVS code already parsed the packet with ip_vs_fill_iph_skb() and has the real transport-header offset in iph.len. This makes TCP state handling look at the wrong bytes when an IPv6 packet carries extension headers. Use the parsed transport offset passed down from ip_vs_set_state() when reading the TCP header. For IPv4 and for IPv6 packets without extension headers, the passed offset matches the previous value. Fixes: `0bbdd42b7e` ("IPVS: Extend protocol DNAT/SNAT and state handlers") Link: https://lore.kernel.org/netdev/20260705125659.37744-1-zhaoyz24@mails.tsinghua.edu.cn/ Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Reported-by: Yuxiang Yang <yangyx22@mails.tsinghua.edu.cn> Reported-by: Ao Wang <wangao@seu.edu.cn> Reported-by: Xuewei Feng <fengxw06@126.com> Reported-by: Qi Li <qli01@tsinghua.edu.cn> Reported-by: Ke Xu <xuke@tsinghua.edu.cn> Assisted-by: Claude Code:GLM-5.2 Signed-off-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-07-08 15:33:44 +02:00
Yizhou Zhao	bae7ce7baf	ipvs: pass parsed transport offset to state handlers IPVS callers already parse the packet into struct ip_vs_iphdr before updating connection state. For IPv6 this records the real transport-header offset after extension headers in iph.len. Pass this parsed transport offset through ip_vs_set_state() and the protocol state_transition() callback so protocol handlers can use the same packet context as scheduling and NAT handling. This patch only changes the common callback plumbing and adapts the protocol callback signatures; TCP and SCTP start using the value in follow-up patches. Signed-off-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-07-08 15:33:44 +02:00
Florian Westphal	da5b58478a	netfilter: handle unreadable frags sashiko reports: When an skb with unreadable fragments (such as from devmem TCP, where skb_frags_readable(skb) returns false) is processed by the u32 module, skb_copy_bits() will safely return a negative error code [..] xt_u32: bail out with hotdrop in this case. gather_frags: return -1, just as if we had no fragment header. nfnetlink_queue: restrict to the linear part. nfnetlink_log: restrict to the linear part. v2: - skb_zerocopy helpers don't copy readable flag, i.e. nfnetlink_queue is broken too xt_u32 shouldn't return true if hotdrop was set. Fixes: `65249feb6b` ("net: add support for skbs with unreadable frags") Cc: stable@vger.kernel.org Acked-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-07-08 15:33:44 +02:00
Pablo Neira Ayuso	fa7395c02d	netfilter: flowtable: support IPIP tunnel with direct xmit The combination of IPIP tunnel with direct xmit, eg. bridge device, breaks because no dst_entry is provided to check the skb headroom and to set the iph->frag_off field. This leads to invalid dst usage and can trigger a crash in the tunnel transmit path. Fix this by moving dst_cache and dst_cookie out of the runtime union so that they can be shared by neighbour, xfrm, and direct tunnel flows. For FLOW_OFFLOAD_XMIT_DIRECT tuples carrying tunnel metadata, preserve route state in these shared fields and release it through the common dst release path. Since dst_entry is now available to the three supported xmit modes and dst_release() already deals with NULL dst, remove the xmit type check in nft_flow_dst_release(). Moreover, skip the check if the dst entry is NULL in nf_flow_dst_check() which is now the case for the direct xmit case. Based on patch from Rein Wei <n05ec@lzu.edu.cn>. Fixes: `d30301ba4b` ("netfilter: flowtable: Add IPIP tx sw acceleration") Cc: stable@vger.kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Reported-by: Zhengyang Chen <chzhengyang2023@lzu.edu.cn> Reported-by: Ren Wei <n05ec@lzu.edu.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-07-08 15:33:44 +02:00
Pablo Neira Ayuso	6c5dcab95f	netfilter: flowtable: IPIP tunnel hardware offload is not yet support No driver supports for IPIP tunnels yet, give up early on setting up the hardware offload for this scenario. This patch adds a stub that can be enhanced to add more configuration that are currently not supported. As of now, the offload work is enqueued to the worker, then ignored if the hardware offload configuration is not supported. Check the NF_FLOW_HW flag to know if this entry was already tried once to be offloaded so this is not retried on refresh when unsupported. Move NF_FLOW_HW flag check to nf_flow_offload_add(). If this NF_FLOW_HW flag is unset the _del and _stats variants are never called. This can be updated later on to skip hardware offload work to be queued in case hardware offload does not support it. Fixes: `d98103575d` ("netfilter: flowtable: Add IP6IP6 rx sw acceleration") Fixes: `ab427db178` ("netfilter: flowtable: Add IPIP rx sw acceleration") Cc: stable@vger.kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Reported-by: Zhengyang Chen <chzhengyang2023@lzu.edu.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-07-08 15:33:44 +02:00
Pablo Neira Ayuso	c328b90c17	netfilter: flowtable: use dst in this direction when pushing IPIP header When pushing the IPIP header, the route of the other direction is used to calculate the headroom, use the route in this direction. Accessing the other tuple to set the IP source and destination is fine because this tuple does not provide such information to avoid storing redundant information. However, this tuple already provides the dst for this direction, this went unnoticed because this bug affects headroom and iph->frag_off only at this stage. Fixes: `d30301ba4b` ("netfilter: flowtable: Add IPIP tx sw acceleration") Fixes: `93cf357fa7` ("netfilter: flowtable: Add IP6IP6 tx sw acceleration") Cc: stable@vger.kernel.org Acked-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-07-08 15:33:44 +02:00
Jozsef Kadlecsik	724f32699a	netfilter: ipset: allocate the proper memory for the generic hash structure Because a single create function is emitted for every hash type, from the IPv4 and IPv6 generic hash structure definitions the last one, i.e. the IPv6 was in effect for IPv4 too. Use the proper size when allocating the structure. Comment properly that because create() refers to elements of the generic hash structure, all referred ones must come before the IPv4/IPv6 dependent 'next' member. Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-07-08 15:33:44 +02:00
Jozsef Kadlecsik	672321302e	netfilter: ipset: cleanup the add/del backlog when resize failed Sashiko pointed out that the add/del backlog was not cleaned up when resize failed. Fix it in the corresponding error path. Also, make sure that the add/del backlog is htable-specific so when resize creates a new htable, old/new backlog can't be mixed up. Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-07-08 15:33:43 +02:00
Jozsef Kadlecsik	cffcf57bf0	netfilter: ipset: exclude gc when resize is in progress Zhengchuan Liang and Eulgyu Kim reported that because resize does not copy the comment extension into the resized set but uses it's pointer, ongoing gc can free the extension in the original set which then results stale pointer in the resized one. The proposed patch was to recreate the extensions for every element in the resized set. It is both expensive and wastes memory, so better exclude gc when resizing in progress detected: resizing will destroy the original set anyway, so doing gc on it is unnecessary. Introduce a new spinlock to exclude parallel gc and resize. Because we just set and check a bool value, there's no need for the parameter to be atomic_t and rename it for better readability. Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Reported by: Zhengchuan Liang <zcliangcn@gmail.com> Reported by: Eulgyu Kim <eulgyukim@snu.ac.kr> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-07-08 15:33:43 +02:00
Jozsef Kadlecsik	5d0c22e736	netfilter: ipset: mark the rcu locked areas properly When we bump the uref counter, there's no need to keep the rcu lock because the referred hash table can't disappear. Also, from the same reason in mtype_gc we need the rcu lock and not a spinlock. Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-07-08 15:33:43 +02:00
Tamaki Yanagawa	e6107a4c74	netfilter: nft_lookup: fix catchall element handling with inverted lookups nft_lookup_eval() decides whether a lookup matched (`found`) from the direct set lookup and priv->invert before falling back to the catchall element used by interval sets (e.g. nft_set_rbtree) for the open-ended default range. Since `found` is never recomputed after `ext` is replaced by the catchall lookup, inverted lookups (NFT_LOOKUP_F_INV, "!= @set") can wrongly match or wrongly skip the catchall element, producing the wrong verdict. Fold the catchall lookup into `ext` before computing `found`, matching the order already used by nft_objref_map_eval(). Fixes: `aaa31047a6` ("netfilter: nftables: add catch-all set element support") Signed-off-by: Tamaki Yanagawa <ty@000ty.net> Assisted-by: Claude:claude-sonnet-5 Signed-off-by: Florian Westphal <fw@strlen.de>	2026-07-08 15:33:37 +02:00
Wyatt Feng	1b47026fb4	netfilter: xt_connmark: reject invalid shift parameters Revision 2 of the CONNMARK target accepts user-controlled shift parameters and applies them to 32-bit mark values in connmark_tg_shift(). A shift_bits value of 32 or more triggers an undefined-shift bug when the rule is evaluated. Invalid shift_dir values are also accepted and silently fall back to the left-shift path. Reject invalid revision-2 shift parameters in connmark_tg_check() so malformed rules fail at installation time, before they can reach the packet path. Fixes: `472a73e007` ("netfilter: xt_conntrack: Support bit-shifting for CONNMARK & MARK targets.") Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Zhengchuan Liang <zcliangcn@gmail.com> Reported-by: Xin Liu <dstsmallbird@foxmail.com> Assisted-by: Codex:GPT-5.4 Signed-off-by: Wyatt Feng <bronzed_45_vested@icloud.com> Reviewed-by: Ren Wei <enjou1224z@gmail.com> Reviewed-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-07-03 14:45:21 +02:00
Yizhou Zhao	2975324d16	ipvs: reset full ip_vs_seq structs in ip_vs_conn_new Commit `9a05475ceb` ("ipvs: avoid kmem_cache_zalloc in ip_vs_conn_new") changed ip_vs_conn_new() to allocate an ip_vs_conn object with kmem_cache_alloc(). The function then initializes many fields explicitly, but only resets in_seq.delta and out_seq.delta in the two struct ip_vs_seq members. That leaves init_seq and previous_delta uninitialized. This is normally harmless while the corresponding IP_VS_CONN_F_IN_SEQ or IP_VS_CONN_F_OUT_SEQ flag is clear. For connections learned from a sync message, however, ip_vs_proc_conn() preserves those flags from IP_VS_CONN_F_BACKUP_MASK and passes opt=NULL when the message omits IPVS_OPT_SEQ_DATA. In that case the new connection can be hashed with SEQ flags set but with the rest of in_seq/out_seq still containing stale slab data. When a packet for such a connection is later handled by an IPVS application helper, vs_fix_seq() and vs_fix_ack_seq() use previous_delta and init_seq to rewrite TCP sequence numbers. A malformed sync message can therefore make forwarded packets carry stale slab bytes in their TCP seq/ack numbers, and can also corrupt the forwarded TCP flow. Reset both struct ip_vs_seq members completely before publishing the connection. This matches the existing "reset struct ip_vs_seq" comment and keeps the sequence-adjustment gates inactive unless valid sequence data is installed later. Fixes: `9a05475ceb` ("ipvs: avoid kmem_cache_zalloc in ip_vs_conn_new") Cc: stable@vger.kernel.org Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Reported-by: Yuxiang Yang <yangyx22@mails.tsinghua.edu.cn> Reported-by: Ao Wang <wangao@seu.edu.cn> Reported-by: Xuewei Feng <fengxw06@126.com> Reported-by: Qi Li <qli01@tsinghua.edu.cn> Reported-by: Ke Xu <xuke@tsinghua.edu.cn> Assisted-by: Claude-Code:GLM-5.2 Signed-off-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-07-03 14:45:21 +02:00
Yizhou Zhao	6b335af0d0	ipvs: fix PMTU for GUE/GRE tunnel ICMP errors When an ICMP Fragmentation Needed error is received for a tunneled IPVS connection, ip_vs_in_icmp() recomputes the MTU that the original packet can use by subtracting the tunnel overhead from the reported next-hop MTU. The current code always subtracts sizeof(struct iphdr), which is only the IPIP overhead. For GUE and GRE tunnels, ipvs_udp_decap() and ipvs_gre_decap() already compute the additional tunnel header length, but that value is scoped to the decapsulation block and is lost before the ICMP_FRAG_NEEDED handling. As a result, the ICMP error sent back to the client advertises an MTU that is too large, so PMTUD can fail to converge for GUE/GRE-tunneled real servers. With a reported next-hop MTU of 1400, a GUE tunnel currently returns 1380 to the client. The correct value is 1368: 1400 - sizeof(struct iphdr) - sizeof(struct udphdr) - sizeof(struct guehdr) Hoist the tunnel header length into the main ip_vs_in_icmp() scope and subtract sizeof(struct iphdr) + ulen in the Fragmentation Needed path. The IPIP path keeps ulen as 0, so its existing 1400 - 20 = 1380 result is unchanged. Fixes: `508f744c0d` ("ipvs: strip udp tunnel headers from icmp errors") Cc: stable@vger.kernel.org Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Reported-by: Yuxiang Yang <yangyx22@mails.tsinghua.edu.cn> Reported-by: Ao Wang <wangao@seu.edu.cn> Reported-by: Xuewei Feng <fengxw06@126.com> Reported-by: Qi Li <qli01@tsinghua.edu.cn> Reported-by: Ke Xu <xuke@tsinghua.edu.cn> Assisted-by: Claude-Code:GLM-5.2 Signed-off-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-07-03 14:45:21 +02:00
Pablo Neira Ayuso	d63611cbe8	netfilter: nft_set_rbtree: get command skips end element with open interval The get command on intervals provide partial matches such as subranges for usability reasons. However, an open interval has no closing end element. If the closing element matches within the range of the open internal, ie. its closest match is the start element of the open range, then, return 0 but offer no matching element to userspace through netlink as a special case. Userspace provides at least a matching start element in this case and the closing end element matching the open interal is ignored. Another possibility is to report the matching start element of the open interval for this end interval. However, this results in duplicated matching being listed in userspace because userspace does not expect a start element as response to a end element. Fixes: `2aa34191f0` ("netfilter: nft_set_rbtree: use binary search array in get command") Reported-by: Melbin K Mathew <mlbnkm1@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-07-03 14:45:21 +02:00
Pablo Neira Ayuso	278296b69f	netfilter: nfnetlink_cthelper: cap to maximum number of expectation per master on updates Really cap it to NF_CT_EXPECT_MAX_CNT (255) on updates. The commit ("netfilter: nfnetlink_cthelper: cap to maximum number of expectation per master") only covers creation of helpers, not updates. Fixes: `397c830097` ("netfilter: nf_conntrack_helper: cap maximum number of expectation at helper registration") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-07-03 14:45:21 +02:00
Feng Wu	444853cd43	netfilter: xt_rateest: fix u64 truncation in xt_rateest_mt() On links faster than ~34 Gbps, where byte rate may exceed 2^32-1 (~ 4.3 GBps), the comparison result becomes incorrect because the truncated value no longer reflects the actual estimator rate. Fix by changing the local variables to u64. Fixes: `1c0d32fde5` ("net_sched: gen_estimator: complete rewrite of rate estimators") Signed-off-by: Feng Wu <wufengwufengwufeng@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-07-03 14:45:21 +02:00
Wyatt Feng	64cdf7d30a	netfilter: xt_u32: reject invalid shift counts u32_match_it() executes rule-supplied shift operands on a 32-bit value. A malformed u32 rule can provide a shift count of 32 or more, triggering an undefined shift out-of-bounds during packet evaluation. Validate XT_U32_LEFTSH and XT_U32_RIGHTSH operands in u32_mt_checkentry() and reject malformed rules before they reach the packet path. Fixes: `1b50b8a371` ("[NETFILTER]: Add u32 match") Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Zhengchuan Liang <zcliangcn@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Assisted-by: Codex:GPT-5.4 Signed-off-by: Wyatt Feng <bronzed_45_vested@icloud.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-07-03 14:45:21 +02:00
Florian Westphal	77e43bcb7e	netfilter: nf_nat_sip: reload possible stale data pointer quoting sashiko: ------------------------------------------------------------------------ [..] noticed a potential memory bug and header corruption involving the SIP NAT helper. In net/netfilter/nf_nat_sip.c:nf_nat_sip(): if (skb_ensure_writable(skb, skb->len)) { nf_ct_helper_log(skb, ct, "cannot mangle packet"); return NF_DROP; } uh = (void )skb->data + protoff; uh->dest = ct_sip_info->forced_dport; if (!nf_nat_mangle_udp_packet(skb, ct, ctinfo, protoff, 0, 0, NULL, 0)) { If a cloned or fragmented SKB is reallocated by skb_ensure_writable(), the old data buffer is freed. However, nf_nat_sip() fails to update dptr to point to the new buffer. It also appears to use nf_nat_mangle_udp_packet() on what could be a TCP packet, which would overwrite the sequence number with a checksum update. ------------------------------------------------------------------------ nf_conntrack_sip linerizes skbs, hence no fragmented skb can be seen. But clones are possible, so rebuild dptr. Disable nf_nat_mangle_udp_packet() branch for TCP streams. It doesn't look like this can ever happen, else we should have received bug reports about this, so just check the conntrack is UDP and drop otherwise. The calling conntrack_sip set ->forced_dport for SIP_HDR_VIA_UDP messages, so I don't think this is ever expected to be true for a TCP stream. Fixes: `7266507d89` ("netfilter: nf_ct_sip: support Cisco 7941/7945 IP phones") Cc: stable@vger.kernel.org Assisted-by: Claude:claude-sonnet-4-6 Signed-off-by: Florian Westphal <fw@strlen.de>	2026-07-03 14:45:20 +02:00
Linus Torvalds	87320be9f0	Including fixes from netfilter and batman-adv. Current release - new code bugs: - netfilter: cthelper: cap to maximum number of expectation per master Previous releases - regressions: - netpoll: fix a use-after-free on shutdown path - tcp: restore RCU grace period in tcp_ao_destroy_sock - ipv6: fix NULL deref in fib6_walk_continiue() on multi-batch dump - batman-adv: dat: ensure accessible eth_hdr proto field - eth: virtio_net: disable cb when NAPI is busy-polled - eth: lan743x: Initialize eth_syslock spinlock before use Previous releases - always broken: - netfilter: - nft_set_pipapo: don't leak bad clone into future transaction - sched: - sch_teql: Introduce slaves_lock to avoid race condition and UAF - replace direct dequeue call with peek and qdisc_dequeue_peeked - sctp: add INIT verification after cookie unpacking - tipc: fix out-of-bounds read in broadcast Gap ACK blocks - seg6: validate SRH length before reading fixed fields - eth: mlx5e: fix use-after-free of metadata_dst on RX SC delete - eth: enetc: check the number of BDs needed for xdp_frame - eth: fbnic: don't cache shinfo across skb realloc Signed-off-by: Paolo Abeni <pabeni@redhat.com> -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmpGTMASHHBhYmVuaUBy ZWRoYXQuY29tAAoJECkkeY3MjxOkjDMQAIWWaKs8zb+d/aTW0uNzlatQuY2w7+rU i2+W5Bb7e2w95OP3dM2abNu4oGb/O69PsLzTAg8/TYTICcBe6j8i3QThXNtw0vNi USZqvZdeCUB8r/ICvBki7FoV2bxDZh2TAWsHHxbPEup7y/SbWg9Wk7kAQj+uxjFa dV5DoVBrS376xt+VO/D89BKCoqneJRetHJoO11cKNPbd+btXcbConXBTYYDfxzaO fdwqbP7nNN6X6ADXcjf0oSHkj/bdiw+CdaU2Z1lSa0cuDolO80aIXW5d1AVnrxC1 C6hOz5rvQS0l0+ionRkB6S77B6PNPp12cYo3L9HaoQuE+oQc3QvotwxvJpzRxHYf wTBQ11Ab0mke11OVXjjGZREA9c+BZ9j8Tto539H11s9tUegRU/V9AFvErTdfx/Ym Hr82C+wC3Bv6b7iYjAF7BJAtV9GJ0VSwaw3luSFOh4S6XyBzqn482XfnKF2m/Js3 7l5TQYLYtUjYJ0NhuXDkwWBKkP8HimIWZs7de41GZv6DMa/aaoFlzr4MRrD+Uuc0 CW6G5UNOOGVNNuNMPKIMw3w9diMVoc72yFleNGwlOBsrOyncW2JI1eIyWBLx3E9G l9jabZPD2qzsi/iXCzPM4rn2hp3Sb5qOvuBg8qRsiDqz5t1b1mRhvPVcu36k38Mu gAwmDBbogPcC =E5mR -----END PGP SIGNATURE----- Merge tag 'net-7.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from netfilter and batman-adv. Current release - new code bugs: - netfilter: cthelper: cap to maximum number of expectation per master Previous releases - regressions: - netpoll: fix a use-after-free on shutdown path - tcp: restore RCU grace period in tcp_ao_destroy_sock - ipv6: fix NULL deref in fib6_walk_continiue() on multi-batch dump - batman-adv: dat: ensure accessible eth_hdr proto field - eth: - virtio_net: disable cb when NAPI is busy-polled - lan743x: Initialize eth_syslock spinlock before use Previous releases - always broken: - netfilter: - nft_set_pipapo: don't leak bad clone into future transaction - sched: - sch_teql: Introduce slaves_lock to avoid race condition and UAF - replace direct dequeue call with peek and qdisc_dequeue_peeked - sctp: add INIT verification after cookie unpacking - tipc: fix out-of-bounds read in broadcast Gap ACK blocks - seg6: validate SRH length before reading fixed fields - eth: - mlx5e: fix use-after-free of metadata_dst on RX SC delete - enetc: check the number of BDs needed for xdp_frame - fbnic: don't cache shinfo across skb realloc" * tag 'net-7.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (58 commits) net/mlx5: HWS, fix matcher leak on resize target setup failure net/sched: hhf: clear heavy-hitter state on reset net/sched: dualpi2: clear stale classification on filter miss net/sched: act_bpf: use rcu_dereference_bh() to read the filter selftests: drv-net: tso: don't touch dangerous feature bits cxgb4: Fix decode strings dump for T6 adapters virtio_net: disable cb when NAPI is busy-polled sctp: fix addr_wq_timer race in sctp_free_addr_wq() selftests: net: bump default cmd() timeout to 20 seconds bridge: stp: Fix a potential use-after-free when deleting a bridge net/sched: sch_teql: Introduce slaves_lock to avoid race condition and UAF net: gianfar: dispose irq mappings on probe failure and device removal net: lan743x: Initialize eth_syslock spinlock before use net: libwx: fix VMDQ mask for 1-queue mode net: airoha: fix max receive size configuration fsl/fman: Free init resources on KeyGen failure in fman_init() netfilter: nftables: restrict checkum update offset netfilter: nftables: restrict linklayer and network header writes netfilter: nfnetlink_queue: restrict writes to network header netfilter: nft_fib: reject fib expression on the netdev egress hook ...	2026-07-02 06:01:12 -10:00
Florian Westphal	c3716a3c43	netfilter: nftables: restrict checkum update offset After previous patch, writes to network header are restricted. However, there is another way to manipulate the l3 header: The checksum update function. Restrict this for network header writes, only the ipv4 header is allowed. This needs run-time checks because BRIDGE, INET, NETDEV families can carry l3 headers other than IP. checksum updates to the udp/tcp (l4) headers are not restricted. Signed-off-by: Florian Westphal <fw@strlen.de>	2026-06-30 06:45:53 +02:00
Florian Westphal	df07998dfd	netfilter: nftables: restrict linklayer and network header writes Don't permit arbitrary writes to linklayer and network header data. Several spots in network stack trust header validation performed in ipv4/ipv6 before PRE_ROUTING hook. For linklayer, allow writes for netdev ingress. For other hooks, only allow link layer writes that do not spill into network header. For network header, check the offset/length combinations: - changing dscp requires store at offset 0 for checsum fixups, so make sure ip version + length field isn't altered. - ip6 dscp starts directly after the version field, so make sure it remains 6. Several of these checks could already be done at rule insertion time. Risk is that this might cause ruleset load failures for existing rulesets. With this change such writes are silently skipped and packet passes unchanged. Transport and inner header bases are not checked / restricted. Signed-off-by: Florian Westphal <fw@strlen.de>	2026-06-30 06:45:53 +02:00
Florian Westphal	54f34607d1	netfilter: nfnetlink_queue: restrict writes to network header nfnetlink_queue doesn't allow selective replacements of some part of the payload, only complete replacement. If the new data is shorter, skb is trimmed, otherwise expanded. Add minimal validation of the new ip/ipv6 header. Check total len matches skb length. Disallow ip option modifications. IPv6 extension headers are also disabled. IP options and exthdrs could be allowed later after validation pass or ip option recompile. Transport header is not checked. Bridge modifications are rejected. Given userspace doesn't even receive L2 headers, use is limited and I don't think there are any users of bridge nfnetlink_queue, let alone users that modifiy payload. Arp isn't supported at all. Signed-off-by: Florian Westphal <fw@strlen.de>	2026-06-30 06:45:45 +02:00
Theodor Arsenij Larionov-Trichkine	d07955dd34	netfilter: nft_fib: reject fib expression on the netdev egress hook A fib expression in a netdev egress base chain dereferences nft_in(pkt), NULL on the transmit path, causing a NULL pointer dereference at eval. nft_fib_validate() masks the hook with NF_INET_* values, but netdev hook numbers are a separate enum that aliases them (NF_NETDEV_EGRESS == NF_INET_LOCAL_IN), so an egress chain passes validation and then faults. Add nft_fib_netdev_validate() that limits each result/flag to the netdev hook where the device it reads exists: the input-device cases (OIF, OIFNAME, ADDRTYPE with F_IIF) to ingress, the output-device case (ADDRTYPE with F_OIF) to egress, ADDRTYPE with no device flag to both. Also restrict nft_fib_validate() to NFPROTO_IPV4/IPV6/INET so its NF_INET_* masks are not applied to another family's hooks. Fixes: `42df6e1d22` ("netfilter: Introduce egress hook") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/netfilter-devel/ajxsjcDOnwllMfoR@strlen.de/ Signed-off-by: Theodor Arsenij Larionov-Trichkine <theodorlarionov@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-06-30 06:37:12 +02:00
Pablo Neira Ayuso	bf5355cfde	netfilter: nfnetlink_cthelper: cap to maximum number of expectation per master If userspace helper policy updates sets maximum number of expectation to zero, cap it to NF_CT_EXPECT_MAX_CNT (255) on updates too. Fixes: `397c830097` ("netfilter: nf_conntrack_helper: cap maximum number of expectation at helper registration") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-06-30 06:37:12 +02:00
Pablo Neira Ayuso	e5e24a365a	netfilter: nf_conntrack_sip: validate skb_dst() before accessing it tc ingress and openvswitch do not guarantee routing information to be available. These subsystems use the conntrack helper infrastructure, and the SIP helper relies on the skb_dst() to be present if sip_external_media is set to 1 (which is disabled by default as a module parameter). This effectively disables the sip_external_media toggle for these subsystems without resulting in a crash. Fixes: `cae3a26275` ("openvswitch: Allow attaching helpers to ct action") Fixes: `b57dc7c13e` ("net/sched: Introduce action ct") Cc: stable@vger.kernel.org Reported-by: Ren Wei <n05ec@lzu.edu.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-06-30 06:37:11 +02:00
Xiang Mei	7cd9103283	netfilter: ipset: fix race between dump and ip_set_list resize The release path of ip_set_dump_do() and ip_set_dump_done() read inst->ip_set_list via ip_set_ref_netlink(), a plain rcu_dereference_raw() of the array pointer. These run from netlink_recvmsg() without the nfnl mutex and without an RCU read-side critical section. A concurrent ip_set_create() can grow the array: it publishes the new array, calls synchronize_net() and then kvfree()s the old one. Since the dump paths read the array outside any RCU reader, synchronize_net() does not wait for them and the old array can be freed while they still index into it, causing a use-after-free. The dumped set itself stays pinned via set->ref_netlink, so only the array load needs protecting. Take rcu_read_lock() around it, matching ip_set_get_byname() and __ip_set_put_byindex(). BUG: KASAN: slab-use-after-free in ip_set_dump_do (net/netfilter/ipset/ip_set_core.c:1697) Read of size 8 at addr ffff88800b5c4018 by task exploit/150 Call Trace: ... kasan_report (mm/kasan/report.c:595) ip_set_dump_do (net/netfilter/ipset/ip_set_core.c:1697) netlink_dump (net/netlink/af_netlink.c:2325) netlink_recvmsg (net/netlink/af_netlink.c:1976) sock_recvmsg (net/socket.c:1159) __sys_recvfrom (net/socket.c:2315) ... Oops: general protection fault, probably for non-canonical address ... KASAN NOPTI KASAN: maybe wild-memory-access in range [0x02d6...d0-0x02d6...d7] RIP: 0010:ip_set_dump_do (net/netfilter/ipset/ip_set_core.c:1698) Kernel panic - not syncing: Fatal exception Fixes: `8a02bdd50b` ("netfilter: ipset: Fix calling ip_set() macro at dumping") Cc: stable@vger.kernel.org Reported-by: Weiming Shi <bestswngs@gmail.com> Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Xiang Mei <xmei5@asu.edu> Acked-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-06-30 06:37:11 +02:00
Florian Westphal	47e65eff50	netfilter: nft_set_pipapo: don't leak bad clone into future transaction On memory allocation failure the cloned nft_pipapo_match can enter a bad state: - some fields can have their lookup tables resized while others did not - bits might have been toggled - scratch map can be undersized which also means m->bsize_max can be lower than what is required This means that the next insertion in the same batch can trigger out-of-bounds writes. Furthermore, a failure in the first can result in the bad clone to leak into the next transaction because the abort callback is never executed in this case (the upper layer saw an error and no attempt to allocate a transactional request was made). Record a state for the nft_pipapo_match structure: - NEW (pristine clone) - MOD (modified clone with good state) - ERR (potentially bogus content) Then make it so that deletes and insertions fail when the clone entered ERR state. In case the very first insert attempt results in an error, free the clone right away. Fixes: `3c4287f620` ("nf_tables: Add set type for arbitrary concatenation of ranges") Cc: stable@vger.kernel.org Reported-and-tested-by: Seesee <cjc000013@gmail.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de>	2026-06-30 06:37:07 +02:00
Florian Westphal	241ccd2fed	netfilter: nf_conntrack_expect: zero at allocation time There are occasional LLM hints wrt. leaking uninitialized data to userspace via ctnetlink. Just zero at allocation time, expectations are not frequently used these days. Intentionally keeps _init as-is because we could theoretically support re-init, so add the missing exp->dir there. Signed-off-by: Florian Westphal <fw@strlen.de>	2026-06-30 05:31:11 +02:00
Linus Torvalds	4edcdefd40	bpf-fixes -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmo9if4ACgkQ6rmadz2v bTqujQ/+IUqsYYfnOzxHoaWOy5aLFDjjFBwbbBqU3qBrgkAVNa1xLtuM2Dt/zro9 eNBHa1OitQG4ZZ3ZjFqw7W+ErsUlUagYXa3ldmlpNs1euDqAtLgg6lW8sYj5zIfn KV9coUxArcos8hdX3YQqV3ZWFxFvTuEduniW502swYGTjUXLjBrbPhHhyXVtf3i4 v6MMYkraWSo3ez8Wh+TiyfkOFVNP5JoDRP66vmN5k90B4VuZgVHIQsXp/kFPTmxL XpbGRF4/SevWgE5TEY7tBjwWV5aOcsPL55H6lStMt2ViOdQyWCGF6CY8UISXMC3L WlWu8e4ICpg9YWOXb98pfp8f2mZW7J8ppa3fed0IRcnHh40ywG69ubNJSddPLfFf LtzR3wCQd4DIg0CqWcGnAuOTgioLh3GEwQJ1JzjFmKHw7wlsD4PzoJ6jFdwop6zw 2/3tKv/9jhcM9yeLWW7qFiySMxsENbkhkje/hqWd24PL6HKj2d5mXzoX8Msq1ccE VQ3HGjXHLawTSQ4YDUp4NoZh7E7KkKZkH8UHS8YgrlQsshdTpuiVpmtX08cJHNWP 0SpSVMYxvZtOBbrbCk7qyu/H6ubUod6gscdB2L6t06QjEYl4GyurpP1Uygs9hGy8 9vlUSBRxSa7u7VAAEolgxkiutWfD4CnulYX02f4H1h+jfBKEMQA= =wnIr -----END PGP SIGNATURE----- Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Pull bpf fixes from Alexei Starovoitov: - Fix effective prog array index with BPF_F_PREORDER (Amery Hung) - Zero-initialize the fib lookup flow struct (Avinash Duduskar) - Disable xfrm_decode_session hook attachment (Bradley Morgan) - Allow type tag BTF records to succeed other modifier records (Emil Tsalapatis) - Fix build_id caching in stack_map_get_build_id_offset() (Ihor Solodrai) - Add missing access_ok call to copy_user_syms (Jiri Olsa) - Fix stack slot index in nospec checks (Nuoqi Gui) - Preserve pointer spill metadata during half-slot cleanup (Nuoqi Gui) - Fix partial copy of non-linear test_run output (Sun Jian) - Fix BPF_PROG_ASSOC_STRUCT_OPS last field check (Thiébaud Weksteen) - Reset register bounds before narrowing retval range (Tristan Madani) - Fix vmlinux BTF leak in bpftool cgroup commands (Yichong Chen) - Guard error writes in conntrack kfuncs (Yiyang Chen) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Disable xfrm_decode_session hook attachment selftests/bpf: Add test for stale bounds on LSM retval context load bpf: Reset register bounds before narrowing retval range in check_mem_access() selftests/bpf: Cover small conntrack opts error writes bpf: Guard conntrack opts error writes selftests/bpf: Cover half-slot cleanup of pointer spills bpf: Preserve pointer spill metadata during half-slot cleanup selftests/bpf: Test cgroup link replace with BPF_F_PREORDER bpf: Fix effective prog array index with BPF_F_PREORDER bpf: Fix BPF_PROG_ASSOC_STRUCT_OPS last field check bpf: zero-initialize the fib lookup flow struct bpftool: Fix vmlinux BTF leak in cgroup commands bpf: Add missing access_ok call to copy_user_syms bpf: Allow type tag BTF records to succeed other modifier records bpf: Emit verbose message when prog-specific btf_struct_access rejects a write bpf: Fix build_id caching in stack_map_get_build_id_offset() bpf: Fix partial copy of non-linear test_run output selftests/bpf: Cover stack nospec slot indexing bpf: Fix stack slot index in nospec checks	2026-06-25 14:09:26 -07:00
Pablo Neira Ayuso	397c830097	netfilter: nf_conntrack_helper: cap maximum number of expectation at helper registration On helper registration, the maximum number of expectations cannot go over NF_CT_EXPECT_MAX_CNT (255), but zero can be specified then nf_conntrack_expect_max applies. Turn zero into NF_CT_EXPECT_MAX_CNT otherwise, expectation LRU eviction on insertion is disabled. Moreover, expand this sanity check all expectation classes. This max_expecy policy is only tunable since userspace helpers are available, set Fixes: tag to the commit that adds such infrastructure. Remove the check for p->max_expected given this field must always be non-zero after this patch. Fixes: `12f7a50533` ("netfilter: add user-space connection tracking helper infrastructure") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-06-23 13:10:48 +02:00
Florian Westphal	6fb421bd07	netfilter: nft_ct: expectation timeouts are passed in milliseconds Userspace passes '5000' in case user asks for 5 seconds. Allowing for sub-second expectation lifetimes makes sense to me. so fix up the kernel side instead of munging nft to send a value rounded up to next second. Also note that this violates nft convention of passing integers in network byte order, but we can't change this anymore. Fixes: `857b46027d` ("netfilter: nft_ct: add ct expectations support") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-06-23 13:10:47 +02:00
Pablo Neira Ayuso	be57dd9c1c	netfilter: nf_conntrack_expect: run expectation eviction with no helper Run expectation eviction if no helper is specified to deal with the nft_ct expectation support. Cap the maximum expectation limit per master conntrack to NF_CT_EXPECT_MAX_CNT (255). Fixes: `857b46027d` ("netfilter: nft_ct: add ct expectations support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-06-23 13:10:34 +02:00
Pablo Neira Ayuso	979c13114c	netfilter: nf_conntrack_expect: store master_tuple in expectation Store master conntrack tuple in the expectation since exp->master might refer to a different conntrack when accessed from rcu read side lock area due to typesafe rcu rules. Fixes: `02a3231b6d` ("netfilter: nf_conntrack_expect: store netns and zone in expectation") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-06-23 13:10:34 +02:00
Florian Westphal	57f940017a	netfilter: conntrack: add deprecation warnings for irc and pptp trackers IRC Direct client-to-client requires plaintext. IRC over TLS should be preferred, making this helper ineffective. Add a deprecation warning and update the help text to better reflect that this is needed for the DCC extension, not IRC itself. PPTP is esoteric these days and it is the only helper that requires the destroy callback in the conntrack helper API. Removal would simplify the conntrack core. Both helpers are IPv4 only. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-06-23 13:10:34 +02:00
Pablo Neira Ayuso	aaa0cd698f	netfilter: ctnetlink: do not allow to reset helper on existing conntrack This feature allows to reset a helper for an existing conntrack, but it is not safe. This requires a synchronized_rcu() call after resetting the helper, which is going to be expensive for a large batch of conntrack entries. This also needs to call to the .destroy callback to release the GRE/PPTP mappings to fix it. This feature antedates the creation of the conntrack-tools and I cannot find a good use-case for this. Given that I cannot find any user in the netfilter.org userspace tree, I prefer to remove this feature. Fixes: `c1d10adb4a` ("[NETFILTER]: Add ctnetlink port for nf_conntrack") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-06-23 08:11:22 +02:00
Florian Westphal	9dbba7e694	netfilter: nft_compat: ebtables emulation must reject non-bridge targets xtables targets return netfilter verdicts: NF_ACCEPT, NF_DROP, and so on. ebtables targets return incompatible verdicts: EBT_ACCEPT, EBT_DROP, ... We cannot allow fallback to NFPROTO_UNSPEC. ebtables doesn't permit this since `11ff7288be` ("netfilter: ebtables: reject non-bridge targets") but that commit missed the nft_compat layer. Reported-by: Ren Wei <n05ec@lzu.edu.cn> Reported-by: Wyatt Feng <bronzed_45_vested@icloud.com> Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Zhengchuan Liang <zcliangcn@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Fixes: `0ca743a559` ("netfilter: nf_tables: add compatibility layer for x_tables") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-06-23 08:11:22 +02:00
Runyu Xiao	11d4bc4e26	netfilter: nft_synproxy: stop bypassing the priv->info snapshot nft_synproxy_eval_v4() and nft_synproxy_eval_v6() already take a whole-object READ_ONCE() snapshot of the shared priv->info state before building the SYNACK reply, but nft_synproxy_tcp_options() still masks opts->options with priv->info.options from the live shared object. When a named synproxy object is updated concurrently with SYN traffic, the eval path can then mix mss and timestamp handling from the local snapshot with an options mask taken from a newer configuration, so one SYNACK no longer reflects a coherent synproxy configuration. Use info->options so nft_synproxy_tcp_options() stays on the same local snapshot that the eval path already copied from priv->info. Fixes: `ee394f96ad` ("netfilter: nft_synproxy: add synproxy stateful object support") Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn> Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-06-23 08:11:22 +02:00
Lorenzo Bianconi	84460b6443	netfilter: flowtable: Validate iph->ihl in nf_flow_ip4_tunnel_proto() Add sanity check for iph->ihl field in nf_flow_ip4_tunnel_proto() before using it to compute the header size, avoiding out-of-bounds access with malformed IP headers. While at it, use iph->protocol instead of the hardcoded IPPROTO_IPIP constant when setting ctx->tun.proto and reference ctx->tun.hdr_size when updating ctx->offset. Fixes: `ab427db178` ("netfilter: flowtable: Add IPIP rx sw acceleration") Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-06-23 08:11:21 +02:00
Fernando Fernandez Mancera	c8b6f36f76	netfilter: nf_conncount: prevent connlimit drops for early confirmed ct Commit `69894e5b4c` ("netfilter: nft_connlimit: update the count if add was skipped") introduced a regression where packets for valid connections are dropped when using connlimit for soft-limiting scenarios. The issue occurs when a new connection reuses a socket currently in the TIME_WAIT state. In this scenario, the connection tracking entry is evaluated as already confirmed. Previously, __nf_conncount_add() assumed that if a connection was confirmed and did not originate from the loopback interface, it should skip the addition and return -EEXIST. Skipping the addition triggers a garbage collection run that cleans up the TIME_WAIT connection. Consequently, the active connection count drops to 0, which xt_connlimit mishandles, leading to the false rejection of the perfectly valid new connection. Fix this by replacing the interface check with protocol-agnostic state checks. We now skip the tree insertion and preserve the lockless garbage collection optimization only if the connection is IPS_ASSURED. This allows early-confirmed setup packets (such as reused TIME_WAIT sockets or locally generated SYN-ACKs) to be properly evaluated and counted without falsely dropping. The goto check_connections path is maintained to ensure these setup packets are deduplicated correctly. This has been tested with slowhttptest and HTTP server configured locally to ensure we are not breaking soft-limiting scenarios for local or external connections. In addition, it was tested with a OVS zone limit too. Fixes: `69894e5b4c` ("netfilter: nft_connlimit: update the count if add was skipped") Reported-by: Alejandro Olivan Alvarez <alejandro.olivan.alvarez@gmail.com> Closes: https://lore.kernel.org/netfilter-devel/177349610461.3071718.4083978280323144323@eldamar.lan/ Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-06-23 08:11:21 +02:00
Mathias Krause	069cfe3de2	netfilter: nf_nat: avoid invalid nat_net pointer use on failed nf_nat_init() We ran into below KASAN splat, which is mostly uninteresting, beside for having nf_nat_register_fn() in the call chain as a cause for the offending access: ================================================================== BUG: KASAN: slab-out-of-bounds in nf_nat_register_fn+0x5f9/0x640 Read of size 8 at addr ffff890031e54c20 by task iptables/9510 CPU: 0 UID: 0 PID: 9510 Comm: iptables Not tainted 6.18.18-grsec-full-20260320181326 #1 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 Call Trace: <TASK> […] dump_stack_lvl+0xee/0x160 ffff88004117eeb8 […] print_report+0x6e/0x640 ffff88004117eee0 […] ? __phys_addr+0x8e/0x140 ffff88004117eef0 […] ? kasan_addr_to_slab+0x51/0xe0 ffff88004117ef08 […] ? complete_report_info+0xec/0x1c0 ffff88004117ef20 […] ? nf_nat_register_fn+0x5f9/0x640 ffff88004117ef48 […] kasan_report+0xbc/0x140 ffff88004117ef50 […] ? nf_nat_register_fn+0x5f9/0x640 ffff88004117ef90 […] nf_nat_register_fn+0x5f9/0x640 ffff88004117eff8 […] ? nf_nat_icmp_reply_translation+0x6e0/0x6e0 ffff88004117f070 […] nf_tables_register_hook.part.0+0xa0/0x220 ffff88004117f080 […] nf_tables_addchain.constprop.0+0x1054/0x1fc0 ffff88004117f0b8 […] ? nft_chain_lookup.part.0+0x4ce/0xac0 ffff88004117f130 […] ? nf_tables_abort+0x3d80/0x3d80 ffff88004117f190 […] ? nf_tables_dumpreset_obj+0x100/0x100 ffff88004117f1c8 […] ? nft_table_lookup.part.0+0x255/0x300 ffff88004117f310 […] ? nf_tables_newchain+0x21a4/0x2fa0 ffff88004117f358 […] nf_tables_newchain+0x21a4/0x2fa0 ffff88004117f360 […] ? nf_tables_addchain.constprop.0+0x1fc0/0x1fc0 ffff88004117f458 […] ? nla_get_range_signed+0x4a0/0x4a0 ffff88004117f488 […] ? lock_acquire+0x16f/0x320 ffff88004117f490 […] ? find_held_lock+0x3b/0xe0 ffff88004117f4b0 […] ? __nla_parse+0x45/0x80 ffff88004117f500 […] nfnetlink_rcv_batch+0xbca/0x19a0 ffff88004117f550 […] ? nfnetlink_net_exit_batch+0x120/0x120 ffff88004117f618 […] ? __sanitizer_cov_trace_switch+0x63/0xe0 ffff88004117f720 […] ? gr_acl_handle_mmap+0x1c4/0x320 ffff88004117f7c0 […] ? nla_get_range_signed+0x4a0/0x4a0 ffff88004117f7e8 […] ? gr_is_capable+0x6f/0xe0 ffff88004117f830 […] ? __nla_parse+0x45/0x80 ffff88004117f860 […] ? skb_pull+0x103/0x1a0 ffff88004117f880 […] nfnetlink_rcv+0x3db/0x4a0 ffff88004117f8b0 […] ? nfnetlink_rcv_batch+0x19a0/0x19a0 ffff88004117f8d8 […] ? netlink_lookup+0xe2/0x240 ffff88004117f900 […] netlink_unicast+0x74b/0xb00 ffff88004117f930 […] ? netlink_attachskb+0xb20/0xb20 ffff88004117f980 […] ? __check_object_size+0x3e/0xaa0 ffff88004117f998 […] ? security_netlink_send+0x51/0x160 ffff88004117f9c8 […] netlink_sendmsg+0xa03/0x1200 ffff88004117f9f8 […] ? netlink_unicast+0xb00/0xb00 ffff88004117fa70 […] ? netlink_unicast+0xb00/0xb00 ffff88004117fac8 […] ? ____sys_sendmsg+0xe2a/0x1040 ffff88004117faf8 […] ____sys_sendmsg+0xe2a/0x1040 ffff88004117fb00 […] ? kernel_recvmsg+0x300/0x300 ffff88004117fb60 […] ? reacquire_held_locks+0xe9/0x260 ffff88004117fbc8 […] ___sys_sendmsg+0x138/0x200 ffff88004117fbf8 […] ? do_recvmmsg+0x7e0/0x7e0 ffff88004117fc30 […] ? lockdep_hardirqs_on_prepare+0x101/0x1e0 ffff88004117fc50 […] ? lock_acquire+0x16f/0x320 ffff88004117fd20 […] ? lock_acquire+0x16f/0x320 ffff88004117fd58 […] ? find_held_lock+0x3b/0xe0 ffff88004117fd70 […] __sys_sendmsg+0x17a/0x260 ffff88004117fdc8 […] ? __sys_sendmsg_sock+0x80/0x80 ffff88004117fdf0 […] ? syscall_trace_enter+0x15e/0x2c0 ffff88004117fe98 […] do_syscall_64+0x7d/0x400 ffff88004117fec8 […] entry_SYSCALL_64_safe_stack+0x4a/0x60 ffff88004117fef8 </TASK> ================================================================== The out-of-bounds report, though, is a red herring as it is for an access that shouldn't have happened in the first place. When nf_nat_init() fails to register its BPF kfuncs, it'll unwind and, among others, call unregister_pernet_subsys() to deregister its per-net ops. This makes the previously allocated net id available for reuse by the next caller of register_pernet_subsys(), in our case, synproxy. However, 'nat_net_id' will still hold the previously allocated value. If nf_nat.o gets build as a module, all this doesn't matter. A failed initialization routine makes the module fail to load and any dependent module won't be able to load either. However, if nf_nat.o is built-in, a failing init won't /completely/ make its functionality unavailable to dependent modules, namely the code and static data is still there, free to be called by modules like nft_chain_nat.ko. Case in point, nft_chain_nat registers hooks that'll call into nf_nat which, in our case, failed to initialize and therefore won't have a valid net id nor related net_nat object any more. Code in nf_nat, namely nf_nat_register_fn() and nf_nat_unregister_fn(), still making use of the reallocated net id, lead to a type confusion as the call to net_generic() will no longer return memory belonging to an object suited to fit 'struct nat_net' but 'struct synproxy_net' instead. The latter is only 24 bytes on 64-bit systems, much smaller than struct nat_net which is 176 bytes, perfectly explaining the OOB KASAN report. Detect and handle a failed nf_nat_init() by testing the 'nf_nat_hook' pointer which will be reset to NULL on initialization errors to prevent the usage of an invalid nat_net pointer. As this check is only needed when nf_nat.o is built-in, guard it by '#ifndef MODULE...'. Fixes: `cbc1dd5b65` ("netfilter: nf_nat: Fix possible memory leak in nf_nat_init()") Signed-off-by: Mathias Krause <minipli@grsecurity.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2026-06-23 08:11:21 +02:00

1 2 3 4 5 ...

7423 Commits