linux

mirror of https://github.com/torvalds/linux.git synced 2026-05-30 01:53:29 +02:00

Author	SHA1	Message	Date
Jakub Kicinski	b6e39e4846	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-7.0-rc8). Conflicts: net/ipv6/seg6_iptunnel.c `c3812651b5` ("seg6: separate dst_cache for input and output paths in seg6 lwtunnel") `78723a62b9` ("seg6: add per-route tunnel source address") https://lore.kernel.org/adZhwtOYfo-0ImSa@sirena.org.uk net/ipv4/icmp.c `fde29fd934` ("ipv4: icmp: fix null-ptr-deref in icmp_build_probe()") `d98adfbdd5` ("ipv4: drop ipv6_stub usage and use direct function calls") https://lore.kernel.org/adO3dccqnr6j-BL9@sirena.org.uk Adjacent changes: drivers/net/ethernet/stmicro/stmmac/chain_mode.c `51f4e090b9` ("net: stmmac: fix integer underflow in chain mode") `6b4286e055` ("net: stmmac: rename STMMAC_GET_ENTRY() -> STMMAC_NEXT_ENTRY()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-09 13:20:59 -07:00
Paolo Abeni	d2000361e4	mptcp: better mptcp-level RTT estimator The current MPTCP-level RTT estimator has several issues. On high speed links, the MPTCP-level receive buffer auto-tuning happens with a frequency well above the TCP-level's one. That in turn can cause excessive/unneeded receive buffer increase. On such links, the initial rtt_us value is considerably higher than the actual delay, and the current mptcp_rcv_space_adjust() updates msk->rcvq_space.rtt_us with a period equal to the such field previous value. If the initial rtt_us is 40ms, its first update will happen after 40ms, even if the subflows see actual RTT orders of magnitude lower. Additionally: - setting the msk RTT to the maximum among all the subflows RTTs makes DRS constantly overshooting the rcvbuf size when a subflow has considerable higher latency than the other(s). - during unidirectional bulk transfers with multiple active subflows, the TCP-level RTT estimator occasionally sees considerably higher value than the real link delay, i.e. when the packet scheduler reacts to an incoming ACK on given subflow pushing data on a different subflow. - currently inactive but still open subflows (i.e. switched to backup mode) are always considered when computing the msk-level RTT. Address the all the issues above with a more accurate RTT estimation strategy: the MPTCP-level RTT is set to the minimum of all the subflows actually feeding data into the MPTCP receive buffer, using a small sliding window. While at it, also use EWMA to compute the msk-level scaling_ratio, to that MPTCP can avoid traversing the subflow list is mptcp_rcv_space_adjust(). Use some care to avoid updating msk and ssk level fields too often. Fixes: `a6b118febb` ("mptcp: add receive buffer auto-tuning") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260407-net-next-mptcp-reduce-rbuf-v2-1-0d1d135bf6f6@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-08 19:32:00 -07:00
Jiayuan Chen	9b55b25390	mptcp: fix slab-use-after-free in __inet_lookup_established The ehash table lookups are lockless and rely on SLAB_TYPESAFE_BY_RCU to guarantee socket memory stability during RCU read-side critical sections. Both tcp_prot and tcpv6_prot have their slab caches created with this flag via proto_register(). However, MPTCP's mptcp_subflow_init() copies tcpv6_prot into tcpv6_prot_override during inet_init() (fs_initcall, level 5), before inet6_init() (module_init/device_initcall, level 6) has called proto_register(&tcpv6_prot). At that point, tcpv6_prot.slab is still NULL, so tcpv6_prot_override.slab remains NULL permanently. This causes MPTCP v6 subflow child sockets to be allocated via kmalloc (falling into kmalloc-4k) instead of the TCPv6 slab cache. The kmalloc-4k cache lacks SLAB_TYPESAFE_BY_RCU, so when these sockets are freed without SOCK_RCU_FREE (which is cleared for child sockets by design), the memory can be immediately reused. Concurrent ehash lookups under rcu_read_lock can then access freed memory, triggering a slab-use-after-free in __inet_lookup_established. Fix this by splitting the IPv6-specific initialization out of mptcp_subflow_init() into a new mptcp_subflow_v6_init(), called from mptcp_proto_v6_init() before protocol registration. This ensures tcpv6_prot_override.slab correctly inherits the SLAB_TYPESAFE_BY_RCU slab cache. Fixes: `b19bc2945b` ("mptcp: implement delegated actions") Cc: stable@vger.kernel.org Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260406031512.189159-1-jiayuan.chen@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-08 19:30:08 -07:00
Gang Yan	7fb2f5f964	mptcp: preserve MSG_EOR semantics in sendmsg path Extend MPTCP's sendmsg handling to recognize and honor the MSG_EOR flag, which marks the end of a record for application-level message boundaries. Data fragments tagged with MSG_EOR are explicitly marked in the mptcp_data_frag structure and skb context to prevent unintended coalescing with subsequent data chunks. This ensures the intent of applications using MSG_EOR is preserved across MPTCP subflows, maintaining consistent message segmentation behavior. Signed-off-by: Gang Yan <yangang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260403-net-next-mptcp-msg_eor-misc-v1-2-b0b33bea3fed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-06 19:14:26 -07:00
Gang Yan	00d46be3c3	mptcp: reduce 'overhead' from u16 to u8 The 'overhead' in struct mptcp_data_frag can safely use u8, as it represents 'alignment + sizeof(mptcp_data_frag)'. With a maximum alignment of 7('ALIGN(1, sizeof(long)) - 1'), the overhead is at most 47, well below U8_MAX and validated with BUILD_BUG_ON(). This patch also adds a field named 'unused' for further extensions. Signed-off-by: Gang Yan <yangang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260403-net-next-mptcp-msg_eor-misc-v1-1-b0b33bea3fed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-06 19:14:26 -07:00
Geliang Tang	e5e2e43002	mptcp: allow overridden write_space to be invoked Future extensions with psock will override their own sk->sk_write_space callback. This patch ensures that the overridden sk_write_space can be invoked by MPTCP. INDIRECT_CALL is used to keep the default path optimised. Note that sk->sk_write_space was never called directly with MPTCP sockets, so changing it to sk_stream_write_space in the init, and using it from mptcp_write_space() is not supposed to change the current behaviour. This patch is shared early to ease discussions around future RFC and avoid confusions with this "fix" that is needed for different future extensions. Suggested-by: Paolo Abeni <pabeni@redhat.com> Co-developed-by: Gang Yan <yangang@kylinos.cn> Signed-off-by: Gang Yan <yangang@kylinos.cn> Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260206-net-next-mptcp-write_space-override-v2-1-e0b12be818c6@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-10 19:54:21 -08:00
Matthieu Baerts (NGI0)	d7e712b66f	mptcp: pm: align endpoint flags size with the NL specs The MPTCP Netlink specs describe the 'flags' as a u32 type. Internally, a u8 type was used. Using a u8 is currently fine, because only the 5 first bits are used. But there is also no reason not to be aligns with the specs, and to stick to a u8. Especially because there is a whole of 3 bytes after in both mptcp_pm_local and mptcp_pm_addr_entry structures. Also, setting it to a u32 will allow future flags, just in case. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260203-net-next-mptcp-misc-feat-6-20-v1-5-31ec8bfc56d1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-04 18:45:09 -08:00
Paolo Abeni	5c4dcc52c6	mptcp: consolidate rcv space init MPTCP uses several calls of the mptcp_rcv_space_init() helper to initialize the receive space, with a catch-up call in mptcp_rcv_space_adjust(). Drop all the other strictly not needed invocations and move constant fields initialization at socket init/reset time. This removes a bit of complexity from mptcp DRS code. No functional changes intended. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260203-net-next-mptcp-misc-feat-6-20-v1-3-31ec8bfc56d1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-04 18:45:09 -08:00
Paolo Abeni	70274765fe	mptcp: fix receive space timestamp initialization MPTCP initialize the receive buffer stamp in mptcp_rcv_space_init(), using the provided subflow stamp. Such helper is invoked in several places; for passive sockets, space init happened at clone time. In such scenario, MPTCP ends-up accesses the subflow stamp before its initialization, leading to quite randomic timing for the first receive buffer auto-tune event, as the timestamp for newly created subflow is not refreshed there. Fix the issue moving the stamp initialization out of the mentioned helper, at the data transfer start, and always using a fresh timestamp. Fixes: `013e3179db` ("mptcp: fix rcv space initialization") Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260203-net-next-mptcp-misc-feat-6-20-v1-2-31ec8bfc56d1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-04 18:45:09 -08:00
Paolo Abeni	86730ac255	mptcp: ensure context reset on disconnect() After the blamed commit below, if the MPC subflow is already in TCP_CLOSE status or has fallback to TCP at mptcp_disconnect() time, mptcp_do_fastclose() skips setting the `send_fastclose flag` and the later __mptcp_close_ssk() does not reset anymore the related subflow context. Any later connection will be created with both the `request_mptcp` flag and the msk-level fallback status off (it is unconditionally cleared at MPTCP disconnect time), leading to a warning in subflow_data_ready(): WARNING: CPU: 26 PID: 8996 at net/mptcp/subflow.c:1519 subflow_data_ready (net/mptcp/subflow.c:1519 (discriminator 13)) Modules linked in: CPU: 26 UID: 0 PID: 8996 Comm: syz.22.39 Not tainted 6.18.0-rc7-05427-g11fc074f6c36 #1 PREEMPT(voluntary) Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 RIP: 0010:subflow_data_ready (net/mptcp/subflow.c:1519 (discriminator 13)) Code: 90 0f 0b 90 90 e9 04 fe ff ff e8 b7 1e f5 fe 89 ee bf 07 00 00 00 e8 db 19 f5 fe 83 fd 07 0f 84 35 ff ff ff e8 9d 1e f5 fe 90 <0f> 0b 90 e9 27 ff ff ff e8 8f 1e f5 fe 4c 89 e7 48 89 de e8 14 09 RSP: 0018:ffffc9002646fb30 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff88813b218000 RCX: ffffffff825c8435 RDX: ffff8881300b3580 RSI: ffffffff825c8443 RDI: 0000000000000005 RBP: 000000000000000b R08: ffffffff825c8435 R09: 000000000000000b R10: 0000000000000005 R11: 0000000000000007 R12: ffff888131ac0000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 FS: 00007f88330af6c0(0000) GS:ffff888a93dd2000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f88330aefe8 CR3: 000000010ff59000 CR4: 0000000000350ef0 Call Trace: <TASK> tcp_data_ready (net/ipv4/tcp_input.c:5356) tcp_data_queue (net/ipv4/tcp_input.c:5445) tcp_rcv_state_process (net/ipv4/tcp_input.c:7165) tcp_v4_do_rcv (net/ipv4/tcp_ipv4.c:1955) __release_sock (include/net/sock.h:1158 (discriminator 6) net/core/sock.c:3180 (discriminator 6)) release_sock (net/core/sock.c:3737) mptcp_sendmsg (net/mptcp/protocol.c:1763 net/mptcp/protocol.c:1857) inet_sendmsg (net/ipv4/af_inet.c:853 (discriminator 7)) __sys_sendto (net/socket.c:727 (discriminator 15) net/socket.c:742 (discriminator 15) net/socket.c:2244 (discriminator 15)) __x64_sys_sendto (net/socket.c:2247) do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1)) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) RIP: 0033:0x7f883326702d Address the issue setting an explicit `fastclosing` flag at fastclose time, and checking such flag after mptcp_do_fastclose(). Fixes: `ae15506024` ("mptcp: fix duplicate reset on fastclose") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251212-net-mptcp-subflow_data_ready-warn-v1-2-d1f9fd1c36c8@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-12-23 09:12:25 +01:00
Paolo Abeni	71154bbe49	mptcp: fallback earlier on simult connection Syzkaller reports a simult-connect race leading to inconsistent fallback status: WARNING: CPU: 3 PID: 33 at net/mptcp/subflow.c:1515 subflow_data_ready+0x40b/0x7c0 net/mptcp/subflow.c:1515 Modules linked in: CPU: 3 UID: 0 PID: 33 Comm: ksoftirqd/3 Not tainted syzkaller #0 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 RIP: 0010:subflow_data_ready+0x40b/0x7c0 net/mptcp/subflow.c:1515 Code: 89 ee e8 78 61 3c f6 40 84 ed 75 21 e8 8e 66 3c f6 44 89 fe bf 07 00 00 00 e8 c1 61 3c f6 41 83 ff 07 74 09 e8 76 66 3c f6 90 <0f> 0b 90 e8 6d 66 3c f6 48 89 df e8 e5 ad ff ff 31 ff 89 c5 89 c6 RSP: 0018:ffffc900006cf338 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff888031acd100 RCX: ffffffff8b7f2abf RDX: ffff88801e6ea440 RSI: ffffffff8b7f2aca RDI: 0000000000000005 RBP: 0000000000000000 R08: 0000000000000005 R09: 0000000000000007 R10: 0000000000000004 R11: 0000000000002c10 R12: ffff88802ba69900 R13: 1ffff920000d9e67 R14: ffff888046f81800 R15: 0000000000000004 FS: 0000000000000000(0000) GS:ffff8880d69bc000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000560fc0ca1670 CR3: 0000000032c3a000 CR4: 0000000000352ef0 Call Trace: <TASK> tcp_data_queue+0x13b0/0x4f90 net/ipv4/tcp_input.c:5197 tcp_rcv_state_process+0xfdf/0x4ec0 net/ipv4/tcp_input.c:6922 tcp_v6_do_rcv+0x492/0x1740 net/ipv6/tcp_ipv6.c:1672 tcp_v6_rcv+0x2976/0x41e0 net/ipv6/tcp_ipv6.c:1918 ip6_protocol_deliver_rcu+0x188/0x1520 net/ipv6/ip6_input.c:438 ip6_input_finish+0x1e4/0x4b0 net/ipv6/ip6_input.c:489 NF_HOOK include/linux/netfilter.h:318 [inline] NF_HOOK include/linux/netfilter.h:312 [inline] ip6_input+0x105/0x2f0 net/ipv6/ip6_input.c:500 dst_input include/net/dst.h:471 [inline] ip6_rcv_finish net/ipv6/ip6_input.c:79 [inline] NF_HOOK include/linux/netfilter.h:318 [inline] NF_HOOK include/linux/netfilter.h:312 [inline] ipv6_rcv+0x264/0x650 net/ipv6/ip6_input.c:311 __netif_receive_skb_one_core+0x12d/0x1e0 net/core/dev.c:5979 __netif_receive_skb+0x1d/0x160 net/core/dev.c:6092 process_backlog+0x442/0x15e0 net/core/dev.c:6444 __napi_poll.constprop.0+0xba/0x550 net/core/dev.c:7494 napi_poll net/core/dev.c:7557 [inline] net_rx_action+0xa9f/0xfe0 net/core/dev.c:7684 handle_softirqs+0x216/0x8e0 kernel/softirq.c:579 run_ksoftirqd kernel/softirq.c:968 [inline] run_ksoftirqd+0x3a/0x60 kernel/softirq.c:960 smpboot_thread_fn+0x3f7/0xae0 kernel/smpboot.c:160 kthread+0x3c2/0x780 kernel/kthread.c:463 ret_from_fork+0x5d7/0x6f0 arch/x86/kernel/process.c:148 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 </TASK> The TCP subflow can process the simult-connect syn-ack packet after transitioning to TCP_FIN1 state, bypassing the MPTCP fallback check, as the sk_state_change() callback is not invoked for * -> FIN_WAIT1 transitions. That will move the msk socket to an inconsistent status and the next incoming data will hit the reported splat. Close the race moving the simult-fallback check at the earliest possible stage - that is at syn-ack generation time. About the fixes tags: [2] was supposed to also fix this issue introduced by [3]. [1] is required as a dependence: it was not explicitly marked as a fix, but it is one and it has already been backported before [3]. In other words, this commit should be backported up to [3], including [2] and [1] if that's not already there. Fixes: `23e89e8ee7` ("tcp: Don't drop SYN+ACK for simultaneous connect().") [1] Fixes: `4fd19a3070` ("mptcp: fix inconsistent state on fastopen race") [2] Fixes: `1e777f39b4` ("mptcp: add MSG_FASTOPEN sendmsg flag support") [3] Cc: stable@vger.kernel.org Reported-by: syzbot+0ff6b771b4f7a5bce83b@syzkaller.appspotmail.com Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/586 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251212-net-mptcp-subflow_data_ready-warn-v1-1-d1f9fd1c36c8@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-12-23 09:12:25 +01:00
Eric Dumazet	08dfe37023	tcp: introduce icsk->icsk_keepalive_timer sk->sk_timer has been used for TCP keepalives. Keepalive timers are not in fast path, we want to use sk->sk_timer storage for retransmit timers, for better cache locality. Create icsk->icsk_keepalive_timer and change keepalive code to no longer use sk->sk_timer. Added space is reclaimed in the following patch. This includes changes to MPTCP, which was also using sk_timer. Alias icsk->mptcp_tout_timer and icsk->icsk_keepalive_timer for inet_sk_diag_fill() sake. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251124175013.1473655-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-25 19:28:29 -08:00
Paolo Abeni	6228efe0cc	mptcp: leverage the backlog for RX packet processing When the msk socket is owned or the msk receive buffer is full, move the incoming skbs in a msk level backlog list. This avoid traversing the joined subflows and acquiring the subflow level socket lock at reception time, improving the RX performances. When processing the backlog, use the fwd alloc memory borrowed from the incoming subflow. skbs exceeding the msk receive space are not dropped; instead they are kept into the backlog until the receive buffer is freed. Dropping packets already acked at the TCP level is explicitly discouraged by the RFC and would corrupt the data stream for fallback sockets. Special care is needed to avoid adding skbs to the backlog of a closed msk and to avoid leaving dangling references into the backlog at subflow closing time. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-14-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-24 19:49:43 -08:00
Paolo Abeni	ee458a3f31	mptcp: introduce mptcp-level backlog We are soon using it for incoming data processing. MPTCP can't leverage the sk_backlog, as the latter is processed before the release callback, and such callback for MPTCP releases and re-acquire the socket spinlock, breaking the sk_backlog processing assumption. Add a skb backlog list inside the mptcp sock struct, and implement basic helper to transfer packet to and purge such list. Packets in the backlog are memory accounted and still use the incoming subflow receive memory, to allow back-pressure. The backlog size is implicitly bounded to the sum of subflows rcvbuf. When a subflow is closed, references from the backlog to such sock are removed. No packet is currently added to the backlog, so no functional changes intended here. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-13-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-24 19:49:43 -08:00
Paolo Abeni	9db5b3cec4	mptcp: borrow forward memory from subflow In the MPTCP receive path, we release the subflow allocated fwd memory just to allocate it again shortly after for the msk. That could increases the failures chances, especially when we will add backlog processing, with other actions could consume the just released memory before the msk socket has a chance to do the rcv allocation. Replace the skb_orphan() call with an open-coded variant that explicitly borrows, the fwd memory from the subflow socket instead of releasing it. The borrowed memory does not have PAGE_SIZE granularity; rounding to the page size will make the fwd allocated memory higher than what is strictly required and could make the incoming subflow fwd mem consistently negative. Instead, keep track of the accumulated frag and borrow the full page at subflow close time. This allow removing the last drop in the TCP to MPTCP transition and the associated, now unused, MIB. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-12-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-24 19:49:42 -08:00
Paolo Abeni	0eeb372dee	mptcp: handle first subflow closing consistently Currently, as soon as the PM closes a subflow, the msk stops accepting data from it, even if the TCP socket could be still formally open in the incoming direction, with the notable exception of the first subflow. The root cause of such behavior is that code currently piggy back two separate semantic on the subflow->disposable bit: the subflow context must be released and that the subflow must stop accepting incoming data. The first subflow is never disposed, so it also never stop accepting incoming data. Use a separate bit to mark the latter status and set such bit in __mptcp_close_ssk() for all subflows. Beyond making per subflow behaviour more consistent this will also simplify the next patch. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-11-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-24 19:49:42 -08:00
Paolo Abeni	9d82959603	mptcp: make mptcp_destroy_common() static Such function is only used inside protocol.c, there is no need to expose it to the whole stack. Note that the function definition most be moved earlier to avoid forward declaration. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Geliang Tang <geliang@kernel.org> Tested-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-9-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-24 19:49:42 -08:00
Paolo Abeni	68c7c38671	mptcp: fix memcg accounting for passive sockets The passive sockets never got proper memcg accounting: the msk socket is associated with the memcg at accept time, but the passive subflows never got it right. At accept time, traverse the subflows list and associate each of them with the msk memcg, and try to do the same at join completion time, if the msk has been already accepted. Fixes: `cf7da0d66c` ("mptcp: Create SUBFLOW socket for incoming connections") Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/298 Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/597 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-4-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-24 19:49:41 -08:00
Paolo Abeni	bd92dd8e03	mptcp: factor-out cgroup data inherit helper MPTCP will soon need the same functionality for passive sockets, factor them out in a common helper. No functional change intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-2-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-24 19:49:41 -08:00
Jakub Kicinski	9e203721ec	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.18-rc7). No conflicts, adjacent changes: tools/testing/selftests/net/af_unix/Makefile `e1bb28bf13` ("selftest: af_unix: Add test for SO_PEEK_OFF.") `45a1cd8346` ("selftests: af_unix: Add tests for ECONNRESET and EOF semantics") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-20 09:13:26 -08:00
Paolo Abeni	fff0c87996	mptcp: decouple mptcp fastclose from tcp close With the current fastclose implementation, the mptcp_do_fastclose() helper is in charge of two distinct actions: send the fastclose reset and cleanup the subflows. Formally decouple the two steps, ensuring that mptcp explicitly closes all the subflows after the mentioned helper. This will make the upcoming fix simpler, and allows dropping the 2nd argument from mptcp_destroy_common(). The Fixes tag is then the same as in the next commit to help with the backports. Fixes: `d21f834855` ("mptcp: use fastclose on more edge scenarios") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251118-net-mptcp-misc-fixes-6-18-rc6-v1-5-806d3781c95f@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-19 20:07:15 -08:00
Paolo Abeni	4f102d747c	mptcp: avoid unneeded subflow-level drops The rcv window is shared among all the subflows. Currently, MPTCP sync the TCP-level rcv window with the MPTCP one at tcp_transmit_skb() time. The above means that incoming data may sporadically observe outdated TCP-level rcv window and being wrongly dropped by TCP. Address the issue checking for the edge condition before queuing the data at TCP level, and eventually syncing the rcv window as needed. Note that the issue is actually present from the very first MPTCP implementation, but backports older than the blamed commit below will range from impossible to useless. Before: $ nstat -n; sleep 1; nstat -z TcpExtBeyondWindow TcpExtBeyondWindow 14 0.0 After: $ nstat -n; sleep 1; nstat -z TcpExtBeyondWindow TcpExtBeyondWindow 0 0.0 Fixes: `fa3fe2b150` ("mptcp: track window announced to peer") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251118-net-mptcp-misc-fixes-6-18-rc6-v1-2-806d3781c95f@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-19 20:07:14 -08:00
Matthieu Baerts (NGI0)	f88191c7f3	mptcp: pm: in-kernel: record fullmesh endp nb Instead of iterating over all endpoints, under RCU read lock, just to check if one of them as the fullmesh flag, we can keep a counter of fullmesh endpoint, similar to what is done with the other flags. This counter is now checked, before iterating over all endpoints. Similar to the other counters, this new one is also exposed. A userspace app can then know when it is being used in a fullmesh mode, with potentially (too) many subflows. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251101-net-next-mptcp-fm-endp-nb-bind-v1-1-b4166772d6bb@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-04 17:15:06 -08:00
Paolo Abeni	27b0e701d3	mptcp: drop bogus optimization in __mptcp_check_push() Accessing the transmit queue without owning the msk socket lock is inherently racy, hence __mptcp_check_push() could actually quit early even when there is pending data. That in turn could cause unexpected tx lock and timeout. Dropping the early check avoids the race, implicitly relaying on later tests under the relevant lock. With such change, all the other mptcp_send_head() call sites are now under the msk socket lock and we can additionally drop the now unneeded annotation on the transmit head pointer accesses. Fixes: `6e628cd3a8` ("mptcp: use mptcp release_cb for delayed tasks") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Geliang Tang <geliang@kernel.org> Tested-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251028-net-mptcp-send-timeout-v1-1-38ffff5a9ec8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 17:44:28 -07:00
Paolo Abeni	e118cdc34d	mptcp: rcvbuf auto-tuning improvement Apply to the MPTCP auto-tuning the same improvements introduced for the TCP protocol by the merge commit `2da35e4b4d` ("Merge branch 'tcp-receive-side-improvements'"). The main difference is that TCP subflow and the main MPTCP socket need to account separately for OoO: MPTCP does not care for TCP-level OoO and vice versa, as a consequence do not reflect MPTCP-level rcvbuf increase due to OoO packets at the subflow level. This refeactor additionally allow dropping the msk receive buffer update at receive time, as the latter only intended to cope with subflow receive buffer increase due to OoO packets. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/487 Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/559 Reviewed-by: Geliang Tang <geliang@kernel.org> Tested-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250927-net-next-mptcp-rcv-path-imp-v1-3-5da266aa9c1a@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-29 18:23:35 -07:00
Matthieu Baerts (NGI0)	539f6b9de3	mptcp: pm: in-kernel: add laminar endpoints Currently, upon the reception of an ADD_ADDR (and when the fullmesh flag is not used), the in-kernel PM will create new subflows using the local address the routing configuration will pick. It would be easier to pick local addresses from a selected list of endpoints, and use it only once, than relying on routing rules. Use case: both the client (C) and the server (S) have two addresses (a and b). The client establishes the connection between C(a) and S(a). Once established, the server announces its additional address S(b). Once received, the client connects to it using its second address C(b). Compared to a situation without the 'laminar' endpoint for C(b), the client didn't use this address C(b) to establish a subflow to the server's primary address S(a). So at the end, we have: C S C(a) --- S(a) C(b) --- S(b) In case of a 3rd address on each side (C(c) and S(c)), upon the reception of an ADD_ADDR with S(c), the client should not pick C(b) because it has already been used. C(c) should then be used. Note that this situation is currently possible if C doesn't add any endpoint, but configure the routing in order to pick C(b) for the route to S(b), and pick C(c) for the route to S(c). That doesn't sound very practical because it means knowing in advance the IP addresses that will be used and announced by the server. 'laminar', like the idea of laminar flows: the different subflows don't mix with each other on an endpoint, unlike the "turbulent" way traffic is mixed by 'fullmesh'. In the code, the new endpoint type is added. Similar to the other subflow types, an MPTCP_INFO counter is added. While at it, hole are now commented in struct mptcp_info, to remember next time that these holes can no longer be used. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/503 Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250925-net-next-mptcp-c-flag-laminar-v1-15-ad126cc47c6b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-26 17:44:05 -07:00
Matthieu Baerts (NGI0)	4984fe6254	mptcp: pm: in-kernel: reduce pernet struct size All the 'unsigned int' variables from the 'pm_nl_pernet' structure are bounded to MPTCP_PM_ADDR_MAX, currently set to 8. The endpoint ID is also bounded by the protocol to 8-bit. MPTCP_PM_ADDR_MAX, if extended later, will never over 8-bit. So no need to use 'unsigned int' variables, 'u8' is enough. Note that the exposed counters in MPTCP_INFO are already limited to 8-bit, same for pm->extra_subflows, and others. So it seems even better to limit them to 8-bit. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250925-net-next-mptcp-c-flag-laminar-v1-13-ad126cc47c6b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-26 17:44:05 -07:00
Matthieu Baerts (NGI0)	e7757b6d3a	mptcp: pm: in-kernel: rename 'local_addr_max' to 'endp_subflow_max' A few variables linked to the in-kernel Path-Manager are confusing, and it would help current and future developers, to clarify them. One of them is 'local_addr_max', which in fact represents the maximum number of 'subflow' endpoints that can be used to create new subflows, and not the number of local addresses that have been used to create subflows. While at it, add an additional name for the corresponding variable in MPTCP INFO: mptcpi_endp_subflow_max. Not to break the current uAPI, the new name is added as a 'define' pointing to the former name. This will then also help userspace devs. Also move the variable and function next to the other 'endp_X_max' ones. No functional changes intended. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250925-net-next-mptcp-c-flag-laminar-v1-9-ad126cc47c6b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-26 17:44:05 -07:00
Matthieu Baerts (NGI0)	37712d84df	mptcp: pm: in-kernel: rename 'add_addr_accept_max' to 'limit_add_addr_accepted' A few variables linked to the in-kernel Path-Manager are confusing, and it would help current and future developers, to clarify them. One of them is 'add_addr_accept_max', which in fact represents the limit of ADD_ADDR that can be accepted: the limit set via 'ip mptcp limit add_addr_accepted X' for example. It is not linked to the maximum number of accepted ADD_ADDR. While at it, add an additional name for the corresponding variable in MPTCP INFO: mptcpi_limit_add_addr_accepted. Not to break the current uAPI, the new name is added as a 'define' pointing to the former name. This will then also help userspace devs. No functional changes intended. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250925-net-next-mptcp-c-flag-laminar-v1-8-ad126cc47c6b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-26 17:44:04 -07:00
Matthieu Baerts (NGI0)	45cae57066	mptcp: pm: in-kernel: rename 'add_addr_signal_max' to 'endp_signal_max' A few variables linked to the in-kernel Path-Manager are confusing, and it would help current and future developers, to clarify them. One of them is 'add_addr_signal_max', which in fact represents the maximum number of 'signal' endpoints that can be used to announced addresses, and not the number of ADD_ADDR that can be signalled. While at it, add an additional name for the corresponding variable in MPTCP INFO: mptcpi_endp_signal_max. Not to break the current uAPI, the new name is added as a 'define' pointing to the former name. This will then also help userspace devs. No functional changes intended. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250925-net-next-mptcp-c-flag-laminar-v1-7-ad126cc47c6b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-26 17:44:04 -07:00
Matthieu Baerts (NGI0)	3eb3c9a959	mptcp: pm: in-kernel: rename 'subflows_max' to 'limit_extra_subflows' A few variables linked to the in-kernel Path-Manager are confusing, and it would help current and future developers, to clarify them. One of them is 'subflows_max', which in fact represents the limit of extra subflows: the limit set via 'ip mptcp limit subflows X' for example. It is not linked to the maximum number of created / possible subflows. While at it, add an additional name for the corresponding variable in MPTCP INFO: mptcpi_limit_extra_subflows. Not to break the current uAPI, the new name is added as a 'define' pointing to the former name. This will then also help userspace devs. No functional changes intended. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250925-net-next-mptcp-c-flag-laminar-v1-6-ad126cc47c6b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-26 17:44:04 -07:00
Matthieu Baerts (NGI0)	c5273f6ca1	mptcp: pm: rename 'subflows' to 'extra_subflows' A few variables linked to the Path-Managers are confusing, and it would help current and future developers, to clarify them. One of them is 'subflows', which in fact represents the number of extra subflows: all the additional subflows created after the initial one, and not the total number of subflows. While at it, add an additional name for the corresponding variable in MPTCP INFO: mptcpi_extra_subflows. Not to break the current uAPI, the new name is added as a 'define' pointing to the former name. This will then also help userspace devs. No functional changes intended. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250925-net-next-mptcp-c-flag-laminar-v1-5-ad126cc47c6b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-26 17:44:04 -07:00
Matthieu Baerts (NGI0)	4b1ff850e0	mptcp: pm: in-kernel: usable client side with C-flag When servers set the C-flag in their MP_CAPABLE to tell clients not to create subflows to the initial address and port, clients will likely not use their other endpoints. That's because the in-kernel path-manager uses the 'subflow' endpoints to create subflows only to the initial address and port. If the limits have not been modified to accept ADD_ADDR, the client doesn't try to establish new subflows. If the limits accept ADD_ADDR, the routing routes will be used to select the source IP. The C-flag is typically set when the server is operating behind a legacy Layer 4 load balancer, or using anycast IP address. Clients having their different 'subflow' endpoints setup, don't end up creating multiple subflows as expected, and causing some deployment issues. A special case is then added here: when servers set the C-flag in the MPC and directly sends an ADD_ADDR, this single ADD_ADDR is accepted. The 'subflows' endpoints will then be used with this new remote IP and port. This exception is only allowed when the ADD_ADDR is sent immediately after the 3WHS, and makes the client switching to the 'fully established' mode. After that, 'select_local_address()' will not be able to find any subflows, because 'id_avail_bitmap' will be filled in mptcp_pm_create_subflow_or_signal_addr(), when switching to 'fully established' mode. Fixes: `df377be387` ("mptcp: add deny_join_id0 in mptcp_options_received") Cc: stable@vger.kernel.org Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/536 Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250925-net-next-mptcp-c-flag-laminar-v1-1-ad126cc47c6b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-26 17:44:03 -07:00
Kuniyuki Iwashima	1068b48ed1	mptcp: Use tcp_under_memory_pressure() in mptcp_epollin_ready(). Some conditions used in mptcp_epollin_ready() are the same as tcp_under_memory_pressure(). We will modify tcp_under_memory_pressure() in the later patch. Let's use tcp_under_memory_pressure() instead. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Link: https://patch.msgid.link/20250815201712.1745332-3-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-19 19:20:58 -07:00
Paolo Abeni	829fec0244	mptcp: remove pr_fallback() We can now track fully the fallback status of a given connection via the relevant mibs, the mentioned helper is redundant. Remove it completely. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250723-net-next-mptcp-track-fallbacks-v1-2-a83cce08f2d5@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-25 11:29:04 -07:00
Paolo Abeni	c65c2e3bae	mptcp: track fallbacks accurately via mibs Add the mibs required to cover the few possible fallback causes still lacking suck info. Move the relevant mib increment into the fallback helper, so that no eventual future fallback operation will miss a paired mib increment. Additionally track failed fallback via its own mib, such mib is incremented only when a fallback mandated by the protocol fails - due to racing subflow creation. While at the above, rename an existing helper to reduce long lines problems all along. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250723-net-next-mptcp-track-fallbacks-v1-1-a83cce08f2d5@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-25 11:29:04 -07:00
Geliang Tang	51c5fd09e1	mptcp: add TCP_MAXSEG sockopt support The TCP_MAXSEG socket option is currently not supported by MPTCP, mainly because it has never been requested before. But there are still valid use-cases, e.g. with HAProxy. This patch adds its support in MPTCP by propagating the value to all subflows. The get part looks at the value on the first subflow, to be as closed as possible to TCP. Only one value can be returned for the cached MSS, so this can come only from one subflow. Similar to mptcp_setsockopt_first_sf_only(), a generic helper mptcp_setsockopt_all_subflows() is added to set sockopt for each subflows of the mptcp socket. Add a new member for struct mptcp_sock to store the TCP_MAXSEG value, and return this value in getsockopt. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/515 Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250719-net-next-mptcp-tcp_maxseg-v2-3-8c910fbc5307@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-21 17:48:33 -07:00
Paolo Abeni	def5b7b264	mptcp: plug races between subflow fail and subflow creation We have races similar to the one addressed by the previous patch between subflow failing and additional subflow creation. They are just harder to trigger. The solution is similar. Use a separate flag to track the condition 'socket state prevent any additional subflow creation' protected by the fallback lock. The socket fallback makes such flag true, and also receiving or sending an MP_FAIL option. The field 'allow_infinite_fallback' is now always touched under the relevant lock, we can drop the ONCE annotation on write. Fixes: `478d770008` ("mptcp: send out MP_FAIL when data checksum fails") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250714-net-mptcp-fallback-races-v1-2-391aff963322@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-15 17:31:25 -07:00
Paolo Abeni	f8a1d9b18c	mptcp: make fallback action and fallback decision atomic Syzkaller reported the following splat: WARNING: CPU: 1 PID: 7704 at net/mptcp/protocol.h:1223 __mptcp_do_fallback net/mptcp/protocol.h:1223 [inline] WARNING: CPU: 1 PID: 7704 at net/mptcp/protocol.h:1223 mptcp_do_fallback net/mptcp/protocol.h:1244 [inline] WARNING: CPU: 1 PID: 7704 at net/mptcp/protocol.h:1223 check_fully_established net/mptcp/options.c:982 [inline] WARNING: CPU: 1 PID: 7704 at net/mptcp/protocol.h:1223 mptcp_incoming_options+0x21a8/0x2510 net/mptcp/options.c:1153 Modules linked in: CPU: 1 UID: 0 PID: 7704 Comm: syz.3.1419 Not tainted 6.16.0-rc3-gbd5ce2324dba #20 PREEMPT(voluntary) Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:__mptcp_do_fallback net/mptcp/protocol.h:1223 [inline] RIP: 0010:mptcp_do_fallback net/mptcp/protocol.h:1244 [inline] RIP: 0010:check_fully_established net/mptcp/options.c:982 [inline] RIP: 0010:mptcp_incoming_options+0x21a8/0x2510 net/mptcp/options.c:1153 Code: 24 18 e8 bb 2a 00 fd e9 1b df ff ff e8 b1 21 0f 00 e8 ec 5f c4 fc 44 0f b7 ac 24 b0 00 00 00 e9 54 f1 ff ff e8 d9 5f c4 fc 90 <0f> 0b 90 e9 b8 f4 ff ff e8 8b 2a 00 fd e9 8d e6 ff ff e8 81 2a 00 RSP: 0018:ffff8880a3f08448 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff8880180a8000 RCX: ffffffff84afcf45 RDX: ffff888090223700 RSI: ffffffff84afdaa7 RDI: 0000000000000001 RBP: ffff888017955780 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: ffff8880180a8910 R14: ffff8880a3e9d058 R15: 0000000000000000 FS: 00005555791b8500(0000) GS:ffff88811c495000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000110c2800b7 CR3: 0000000058e44000 CR4: 0000000000350ef0 Call Trace: <IRQ> tcp_reset+0x26f/0x2b0 net/ipv4/tcp_input.c:4432 tcp_validate_incoming+0x1057/0x1b60 net/ipv4/tcp_input.c:5975 tcp_rcv_established+0x5b5/0x21f0 net/ipv4/tcp_input.c:6166 tcp_v4_do_rcv+0x5dc/0xa70 net/ipv4/tcp_ipv4.c:1925 tcp_v4_rcv+0x3473/0x44a0 net/ipv4/tcp_ipv4.c:2363 ip_protocol_deliver_rcu+0xba/0x480 net/ipv4/ip_input.c:205 ip_local_deliver_finish+0x2f1/0x500 net/ipv4/ip_input.c:233 NF_HOOK include/linux/netfilter.h:317 [inline] NF_HOOK include/linux/netfilter.h:311 [inline] ip_local_deliver+0x1be/0x560 net/ipv4/ip_input.c:254 dst_input include/net/dst.h:469 [inline] ip_rcv_finish net/ipv4/ip_input.c:447 [inline] NF_HOOK include/linux/netfilter.h:317 [inline] NF_HOOK include/linux/netfilter.h:311 [inline] ip_rcv+0x514/0x810 net/ipv4/ip_input.c:567 __netif_receive_skb_one_core+0x197/0x1e0 net/core/dev.c:5975 __netif_receive_skb+0x1f/0x120 net/core/dev.c:6088 process_backlog+0x301/0x1360 net/core/dev.c:6440 __napi_poll.constprop.0+0xba/0x550 net/core/dev.c:7453 napi_poll net/core/dev.c:7517 [inline] net_rx_action+0xb44/0x1010 net/core/dev.c:7644 handle_softirqs+0x1d0/0x770 kernel/softirq.c:579 do_softirq+0x3f/0x90 kernel/softirq.c:480 </IRQ> <TASK> __local_bh_enable_ip+0xed/0x110 kernel/softirq.c:407 local_bh_enable include/linux/bottom_half.h:33 [inline] inet_csk_listen_stop+0x2c5/0x1070 net/ipv4/inet_connection_sock.c:1524 mptcp_check_listen_stop.part.0+0x1cc/0x220 net/mptcp/protocol.c:2985 mptcp_check_listen_stop net/mptcp/mib.h:118 [inline] __mptcp_close+0x9b9/0xbd0 net/mptcp/protocol.c:3000 mptcp_close+0x2f/0x140 net/mptcp/protocol.c:3066 inet_release+0xed/0x200 net/ipv4/af_inet.c:435 inet6_release+0x4f/0x70 net/ipv6/af_inet6.c:487 __sock_release+0xb3/0x270 net/socket.c:649 sock_close+0x1c/0x30 net/socket.c:1439 __fput+0x402/0xb70 fs/file_table.c:465 task_work_run+0x150/0x240 kernel/task_work.c:227 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] exit_to_user_mode_loop+0xd4/0xe0 kernel/entry/common.c:114 exit_to_user_mode_prepare include/linux/entry-common.h:330 [inline] syscall_exit_to_user_mode_work include/linux/entry-common.h:414 [inline] syscall_exit_to_user_mode include/linux/entry-common.h:449 [inline] do_syscall_64+0x245/0x360 arch/x86/entry/syscall_64.c:100 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fc92f8a36ad Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffcf52802d8 EFLAGS: 00000246 ORIG_RAX: 00000000000001b4 RAX: 0000000000000000 RBX: 00007ffcf52803a8 RCX: 00007fc92f8a36ad RDX: 0000000000000000 RSI: 000000000000001e RDI: 0000000000000003 RBP: 00007fc92fae7ba0 R08: 0000000000000001 R09: 0000002800000000 R10: 00007fc92f700000 R11: 0000000000000246 R12: 00007fc92fae5fac R13: 00007fc92fae5fa0 R14: 0000000000026d00 R15: 0000000000026c51 </TASK> irq event stamp: 4068 hardirqs last enabled at (4076): [<ffffffff81544816>] __up_console_sem+0x76/0x80 kernel/printk/printk.c:344 hardirqs last disabled at (4085): [<ffffffff815447fb>] __up_console_sem+0x5b/0x80 kernel/printk/printk.c:342 softirqs last enabled at (3096): [<ffffffff840e1be0>] local_bh_enable include/linux/bottom_half.h:33 [inline] softirqs last enabled at (3096): [<ffffffff840e1be0>] inet_csk_listen_stop+0x2c0/0x1070 net/ipv4/inet_connection_sock.c:1524 softirqs last disabled at (3097): [<ffffffff813b6b9f>] do_softirq+0x3f/0x90 kernel/softirq.c:480 Since we need to track the 'fallback is possible' condition and the fallback status separately, there are a few possible races open between the check and the actual fallback action. Add a spinlock to protect the fallback related information and use it close all the possible related races. While at it also remove the too-early clearing of allow_infinite_fallback in __mptcp_subflow_connect(): the field will be correctly cleared by subflow_finish_connect() if/when the connection will complete successfully. If fallback is not possible, as per RFC, reset the current subflow. Since the fallback operation can now fail and return value should be checked, rename the helper accordingly. Fixes: `0530020a7c` ("mptcp: track and update contiguous data status") Cc: stable@vger.kernel.org Reported-by: Matthieu Baerts <matttbe@kernel.org> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/570 Reported-by: syzbot+5cf807c20386d699b524@syzkaller.appspotmail.com Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/555 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250714-net-mptcp-fallback-races-v1-1-391aff963322@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-15 17:31:25 -07:00
Sebastian Andrzej Siewior	82d9e6b9a0	mptcp: Use nested-BH locking for hmac_storage mptcp_delegated_actions is a per-CPU variable and relies on disabled BH for its locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT this data structure requires explicit locking. Add a local_lock_t to the data structure and use local_lock_nested_bh() for locking. This change adds only lockdep coverage and does not alter the functional behaviour for !PREEMPT_RT. Cc: Matthieu Baerts <matttbe@kernel.org> Cc: Mat Martineau <martineau@kernel.org> Cc: Geliang Tang <geliang@kernel.org> Cc: mptcp@lists.linux.dev Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20250512092736.229935-13-bigeasy@linutronix.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-05-15 15:23:31 +02:00
Geliang Tang	760ff07669	mptcp: sched: split validation part A new interface .validate has been added in struct bpf_struct_ops recently. This patch prepares a future struct_ops support by implementing it as a new helper mptcp_validate_scheduler() for struct mptcp_sched_ops. In this helper, check whether the required ops "get_subflow" of struct mptcp_sched_ops has been implemented. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250413-net-next-mptcp-sched-mib-sft-misc-v2-2-0f83a4350150@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-15 08:21:46 -07:00
Geliang Tang	fa3ee9dd80	mptcp: sysctl: add available_path_managers Similarly to net.mptcp.available_schedulers, this patch adds a new one net.mptcp.available_path_managers to list the available path managers. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250313-net-next-mptcp-pm-ops-intro-v1-11-f4e4a88efc50@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-03-20 10:14:49 +01:00
Geliang Tang	595c26d122	mptcp: sysctl: set path manager by name Similar to net.mptcp.scheduler, a new net.mptcp.path_manager sysctl knob is added to determine which path manager will be used by each newly created MPTCP socket by setting the name of it. Dealing with an explicit name is easier than with a number, especially when more PMs will be introduced. This sysctl knob makes the old one "pm_type" deprecated. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250313-net-next-mptcp-pm-ops-intro-v1-8-f4e4a88efc50@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-03-20 10:14:48 +01:00
Geliang Tang	770170b418	mptcp: pm: register in-kernel and userspace PM This patch defines the original in-kernel netlink path manager as a new struct mptcp_pm_ops named "mptcp_pm_kernel", and register it in mptcp_pm_kernel_register(). And define the userspace path manager as a new struct mptcp_pm_ops named "mptcp_pm_userspace", and register it in mptcp_pm_init(). To ensure that there's always a valid path manager available, the default path manager "mptcp_pm_kernel" will be skipped in mptcp_pm_unregister(). Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250313-net-next-mptcp-pm-ops-intro-v1-7-f4e4a88efc50@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-03-20 10:14:48 +01:00
Geliang Tang	1305b0c22e	mptcp: pm: define struct mptcp_pm_ops In order to allow users to develop their own BPF-based path manager, this patch defines a struct ops "mptcp_pm_ops" for an MPTCP path manager, which contains a set of interfaces. Currently only init() and release() interfaces are included, subsequent patches will add others step by step. Add a set of functions to register, unregister, find and validate a given path manager struct ops. "list" is used to add this path manager to mptcp_pm_list list when it is registered. "name" is used to identify this path manager. mptcp_pm_find() uses "name" to find a path manager on the list. mptcp_pm_unregister is not used in this set, but will be invoked in .unreg of struct bpf_struct_ops. mptcp_pm_validate() will be invoked in .validate of struct bpf_struct_ops. That's why they are exported. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250313-net-next-mptcp-pm-ops-intro-v1-6-f4e4a88efc50@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-03-20 10:14:48 +01:00
Geliang Tang	eff5b1578e	mptcp: pm: add struct_group in mptcp_pm_data This patch adds a "struct_group(reset, ...)" in struct mptcp_pm_data to simplify the reset, and make sure we don't miss any. Suggested-by: Matthieu Baerts <matttbe@kernel.org> Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250313-net-next-mptcp-pm-ops-intro-v1-5-f4e4a88efc50@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-03-20 10:14:48 +01:00
Matthieu Baerts (NGI0)	b97d6b6820	mptcp: pm: split netlink and in-kernel init The registration of mptcp_genl_family is useful for both the in-kernel and the userspace PM. It should then be done in pm_netlink.c. On the other hand, the registration of the in-kernel pernet subsystem is specific to the in-kernel PM, and should stay there in pm_kernel.c. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250313-net-next-mptcp-pm-ops-intro-v1-1-f4e4a88efc50@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-03-20 10:14:48 +01:00
Matthieu Baerts (NGI0)	2e7e6e9cda	mptcp: pm: move Netlink PM helpers to pm_netlink.c Before this patch, the PM code was dispersed in different places: - pm.c had common code for all PMs, but also Netlink specific code that will not be needed with the future BPF path-managers. - pm_netlink.c had common Netlink code. To clarify the code, a reorganisation is suggested here, only by moving code around, and small helper renaming to avoid confusions: - pm_netlink.c now only contains common PM Netlink code: - PM events: this code was already there - shared helpers around Netlink code that were already there as well - shared Netlink commands code from pm.c - pm.c now no longer contain Netlink specific code. - protocol.h has been updated accordingly: - mptcp_nl_fill_addr() no longer need to be exported. The code around the PM is now less confusing, which should help for the maintenance in the long term. This will certainly impact future backports, but because other cleanups have already done recently, and more are coming to ease the addition of a new path-manager controlled with BPF (struct_ops), doing that now seems to be a good time. Also, many issues around the PM have been fixed a few months ago while increasing the code coverage in the selftests, so such big reorganisation can be done with more confidence now. No behavioural changes intended. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250307-net-next-mptcp-pm-reorg-v1-15-abef20ada03b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-03-10 13:35:50 -07:00
Matthieu Baerts (NGI0)	e4c28e3d5c	mptcp: pm: move generic PM helpers to pm.c Before this patch, the PM code was dispersed in different places: - pm.c had common code for all PMs - pm_netlink.c was supposed to be about the in-kernel PM, but also had exported common helpers, callbacks used by the different PMs, NL events for PM userspace daemon, etc. quite confusing. - pm_userspace.c had userspace PM only code, but using specific in-kernel PM helpers To clarify the code, a reorganisation is suggested here, only by moving code around, and (un)exporting functions: - helpers used from both PMs and not linked to Netlink - callbacks used by different PMs, e.g. ADD_ADDR management - some helpers have been marked as 'static' - protocol.h has been updated accordingly - (while at it, a needless if before a kfree(), spot by checkpatch in mptcp_remove_anno_list_by_saddr(), has been removed) The code around the PM is now less confusing, which should help for the maintenance in the long term. This will certainly impact future backports, but because other cleanups have already done recently, and more are coming to ease the addition of a new path-manager controlled with BPF (struct_ops), doing that now seems to be a good time. Also, many issues around the PM have been fixed a few months ago while increasing the code coverage in the selftests, so such big reorganisation can be done with more confidence now. No behavioural changes intended. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250307-net-next-mptcp-pm-reorg-v1-13-abef20ada03b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-03-10 13:35:50 -07:00
Matthieu Baerts (NGI0)	a146731272	mptcp: pm: export mptcp_remote_address In a following commit, the 'remote_address' helper will need to be used from different files. It is then exported, and prefixed with 'mptcp_', similar to 'mptcp_local_address'. No behavioural changes intended. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250307-net-next-mptcp-pm-reorg-v1-11-abef20ada03b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-03-10 13:35:50 -07:00

1 2 3 4 5 ...

436 Commits