linux

mirror of https://github.com/torvalds/linux.git synced 2026-05-24 07:03:03 +02:00

Author	SHA1	Message	Date
Kuniyuki Iwashima	f07f4ea53e	mptcp: Initialise rcv_mss before calling tcp_send_active_reset() in mptcp_do_fastclose(). syzbot reported divide-by-zero in __tcp_select_window() by MPTCP socket. [0] We had a similar issue for the bare TCP and fixed in commit `499350a5a6` ("tcp: initialize rcv_mss to TCP_MIN_MSS instead of 0"). Let's apply the same fix to mptcp_do_fastclose(). [0]: Oops: divide error: 0000 [#1] SMP KASAN PTI CPU: 0 UID: 0 PID: 6068 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025 RIP: 0010:__tcp_select_window+0x824/0x1320 net/ipv4/tcp_output.c:3336 Code: ff ff ff 44 89 f1 d3 e0 89 c1 f7 d1 41 01 cc 41 21 c4 e9 a9 00 00 00 e8 ca 49 01 f8 e9 9c 00 00 00 e8 c0 49 01 f8 44 89 e0 99 <f7> 7c 24 1c 41 29 d4 48 bb 00 00 00 00 00 fc ff df e9 80 00 00 00 RSP: 0018:ffffc90003017640 EFLAGS: 00010293 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88807b469e40 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffffc90003017730 R08: ffff888033268143 R09: 1ffff1100664d028 R10: dffffc0000000000 R11: ffffed100664d029 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 FS: 000055557faa0500(0000) GS:ffff888126135000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f64a1912ff8 CR3: 0000000072122000 CR4: 00000000003526f0 Call Trace: <TASK> tcp_select_window net/ipv4/tcp_output.c:281 [inline] __tcp_transmit_skb+0xbc7/0x3aa0 net/ipv4/tcp_output.c:1568 tcp_transmit_skb net/ipv4/tcp_output.c:1649 [inline] tcp_send_active_reset+0x2d1/0x5b0 net/ipv4/tcp_output.c:3836 mptcp_do_fastclose+0x27e/0x380 net/mptcp/protocol.c:2793 mptcp_disconnect+0x238/0x710 net/mptcp/protocol.c:3253 mptcp_sendmsg_fastopen+0x2f8/0x580 net/mptcp/protocol.c:1776 mptcp_sendmsg+0x1774/0x1980 net/mptcp/protocol.c:1855 sock_sendmsg_nosec net/socket.c:727 [inline] __sock_sendmsg+0xe5/0x270 net/socket.c:742 __sys_sendto+0x3bd/0x520 net/socket.c:2244 __do_sys_sendto net/socket.c:2251 [inline] __se_sys_sendto net/socket.c:2247 [inline] __x64_sys_sendto+0xde/0x100 net/socket.c:2247 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f66e998f749 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffff9acedb8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 00007f66e9be5fa0 RCX: 00007f66e998f749 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003 RBP: 00007ffff9acee10 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001 R13: 00007f66e9be5fa0 R14: 00007f66e9be5fa0 R15: 0000000000000006 </TASK> Fixes: `ae15506024` ("mptcp: fix duplicate reset on fastclose") Reported-by: syzbot+3a92d359bc2ec6255a33@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/69260882.a70a0220.d98e3.00b4.GAE@google.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20251125195331.309558-1-kuniyu@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-11-27 13:10:16 +01:00
Paolo Abeni	27fd028601	mptcp: clear scheduled subflows on retransmit When __mptcp_retrans() kicks-in, it schedules one or more subflows for retransmission, but such subflows could be actually left alone if there is no more data to retransmit and/or in case of concurrent fallback. Scheduled subflows could be processed much later in time, i.e. when new data will be transmitted, leading to bad subflow selection. Explicitly clear all scheduled subflows before leaving the retransmission function. Fixes: `ee2708aeda` ("mptcp: use get_retrans wrapper") Cc: stable@vger.kernel.org Reported-by: Filip Pokryvka <fpokryvk@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251125-net-mptcp-clear-sched-rtx-v1-1-1cea4ad2165f@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-26 18:24:35 -08:00
Linus Torvalds	8e621c9a33	Including fixes from IPsec and wireless. Previous releases - regressions: - prevent NULL deref in generic_hwtstamp_ioctl_lower(), newer APIs don't populate all the pointers in the request - phylink: add missing supported link modes for the fixed-link - mptcp: fix false positive warning in mptcp_pm_nl_rm_addr Previous releases - always broken: - openvswitch: remove never-working support for setting NSH fields - xfrm: number of fixes for error paths of xfrm_state creation/ modification/deletion - xfrm: fixes for offload - fix the determination of the protocol of the inner packet - don't push locally generated packets directly to L2 tunnel mode offloading, they still need processing from the standard xfrm path - mptcp: fix a couple of corner cases in fallback and fastclose handling - wifi: rtw89: hw_scan: prevent connections from getting stuck, work around apparent bug in FW by tweaking messages we send - af_unix: fix duplicate data if PEEK w/ peek_offset needs to wait - veth: more robust handing of race to avoid txq getting stuck - eth: ps3_gelic_net: handle skb allocation failures Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmkfQv8ACgkQMUZtbf5S IruptQ//aa3wFISt/xfCxaygNnQAL1gxWbKg15cLnv88S5r5wrh+Mf/r+qHkT0IG qaxPToztcTKNGzRqMI/7x/OEXjCdfEmJtNcb5JfYRxGkYVcje81bhHrxZN+nLLcj jx1xD7rXCjfFNDVFSl+OzQkoTZtfGphX+JgCTKLPZ8t4qR/UhU1IwYgpoaiK0uCT gitXJPHrDGJLHud0ilaXbn+q0ZH9ZU08QSRhsaAq7C6YPafUseEr2fuQYgFr1kkp mtzFm84bEnRSneQ5+noVzoc5llbu3vf3Wd9Y5tr5sBaBjh5OpH/kK0kuPy6Lzhvd 8y05jl8kyASMtRBoTvmpYOdi3xUNbge5AwJN3FIo4KPCmyr1bQoNoQVIivUpoiGz Zvv7QcceP7E5057CRuhi06krld/QaScHkmhUododbqJKmL8NWaSoXLRO7oYjB4kU 1MuAwOJWVpsqnTUEbMw7dHJPZFTqRkLYv6oAmIqb/AWzTcqg9OXjLEc+OzUqWEl6 mHv/SonSjAM81m3GxCxtNtH2ErKwdMIgI/ado39+KvksKTH1bhN9ZpmuyyktzpCj PBn9FVdd7zyct92u5xXtsKVkDWKGyD419Z2lFHC0FaAsNQmPrSqzi+U50pAtie5M JWABm8hsSdia4dTDrdldMgM50dzuBd5nSceY7XCqPA8nZ5+Af/8= =89Xl -----END PGP SIGNATURE----- Merge tag 'net-6.18-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from IPsec and wireless. Previous releases - regressions: - prevent NULL deref in generic_hwtstamp_ioctl_lower(), newer APIs don't populate all the pointers in the request - phylink: add missing supported link modes for the fixed-link - mptcp: fix false positive warning in mptcp_pm_nl_rm_addr Previous releases - always broken: - openvswitch: remove never-working support for setting NSH fields - xfrm: number of fixes for error paths of xfrm_state creation/ modification/deletion - xfrm: fixes for offload - fix the determination of the protocol of the inner packet - don't push locally generated packets directly to L2 tunnel mode offloading, they still need processing from the standard xfrm path - mptcp: fix a couple of corner cases in fallback and fastclose handling - wifi: rtw89: hw_scan: prevent connections from getting stuck, work around apparent bug in FW by tweaking messages we send - af_unix: fix duplicate data if PEEK w/ peek_offset needs to wait - veth: more robust handing of race to avoid txq getting stuck - eth: ps3_gelic_net: handle skb allocation failures" * tag 'net-6.18-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (47 commits) vsock: Ignore signal/timeout on connect() if already established be2net: pass wrb_params in case of OS2BMC l2tp: reset skb control buffer on xmit net: dsa: microchip: lan937x: Fix RGMII delay tuning selftests: mptcp: add a check for 'add_addr_accepted' mptcp: fix address removal logic in mptcp_pm_nl_rm_addr selftests: mptcp: join: userspace: longer timeout selftests: mptcp: join: endpoints: longer timeout selftests: mptcp: join: fastclose: remove flaky marks mptcp: fix duplicate reset on fastclose mptcp: decouple mptcp fastclose from tcp close mptcp: do not fallback when OoO is present mptcp: fix premature close in case of fallback mptcp: avoid unneeded subflow-level drops mptcp: fix ack generation for fallback msk wifi: rtw89: hw_scan: Don't let the operating channel be last net: phylink: add missing supported link modes for the fixed-link selftest: af_unix: Add test for SO_PEEK_OFF. af_unix: Read sk_peek_offset() again after sleeping in unix_stream_read_generic(). net/mlx5: Clean up only new IRQ glue on request_irq() failure ...	2025-11-20 08:52:07 -08:00
Gang Yan	92e239e36d	mptcp: fix address removal logic in mptcp_pm_nl_rm_addr Fix inverted WARN_ON_ONCE condition that prevented normal address removal counter updates. The current code only executes decrement logic when the counter is already 0 (abnormal state), while normal removals (counter > 0) are ignored. Signed-off-by: Gang Yan <yangang@kylinos.cn> Fixes: `6361139185` ("mptcp: pm: remove '_nl' from mptcp_pm_nl_rm_addr_received") Cc: stable@vger.kernel.org Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251118-net-mptcp-misc-fixes-6-18-rc6-v1-10-806d3781c95f@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-19 20:07:15 -08:00
Paolo Abeni	ae15506024	mptcp: fix duplicate reset on fastclose The CI reports sporadic failures of the fastclose self-tests. The root cause is a duplicate reset, not carrying the relevant MPTCP option. In the failing scenario the bad reset is received by the peer before the fastclose one, preventing the reception of the latter. Indeed there is window of opportunity at fastclose time for the following race: mptcp_do_fastclose __mptcp_close_ssk __tcp_close() tcp_set_state() [1] tcp_send_active_reset() [2] After [1] the stack will send reset to in-flight data reaching the now closed port. Such reset may race with [2]. Address the issue explicitly sending a single reset on fastclose before explicitly moving the subflow to close status. Fixes: `d21f834855` ("mptcp: use fastclose on more edge scenarios") Cc: stable@vger.kernel.org Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/596 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251118-net-mptcp-misc-fixes-6-18-rc6-v1-6-806d3781c95f@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-19 20:07:15 -08:00
Paolo Abeni	fff0c87996	mptcp: decouple mptcp fastclose from tcp close With the current fastclose implementation, the mptcp_do_fastclose() helper is in charge of two distinct actions: send the fastclose reset and cleanup the subflows. Formally decouple the two steps, ensuring that mptcp explicitly closes all the subflows after the mentioned helper. This will make the upcoming fix simpler, and allows dropping the 2nd argument from mptcp_destroy_common(). The Fixes tag is then the same as in the next commit to help with the backports. Fixes: `d21f834855` ("mptcp: use fastclose on more edge scenarios") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251118-net-mptcp-misc-fixes-6-18-rc6-v1-5-806d3781c95f@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-19 20:07:15 -08:00
Paolo Abeni	1bba3f219c	mptcp: do not fallback when OoO is present In case of DSS corruption, the MPTCP protocol tries to avoid the subflow reset if fallback is possible. Such corruptions happen in the receive path; to ensure fallback is possible the stack additionally needs to check for OoO data, otherwise the fallback will break the data stream. Fixes: `e32d262c89` ("mptcp: handle consistently DSS corruption") Cc: stable@vger.kernel.org Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/598 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251118-net-mptcp-misc-fixes-6-18-rc6-v1-4-806d3781c95f@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-19 20:07:14 -08:00
Paolo Abeni	17393fa7b7	mptcp: fix premature close in case of fallback I'm observing very frequent self-tests failures in case of fallback when running on a CONFIG_PREEMPT kernel. The root cause is that subflow_sched_work_if_closed() closes any subflow as soon as it is half-closed and has no incoming data pending. That works well for regular subflows - MPTCP needs bi-directional connectivity to operate on a given subflow - but for fallback socket is race prone. When TCP peer closes the connection before the MPTCP one, subflow_sched_work_if_closed() will schedule the MPTCP worker to gracefully close the subflow, and shortly after will do another schedule to inject and process a dummy incoming DATA_FIN. On CONFIG_PREEMPT kernel, the MPTCP worker can kick-in and close the fallback subflow before subflow_sched_work_if_closed() is able to create the dummy DATA_FIN, unexpectedly interrupting the transfer. Address the issue explicitly avoiding closing fallback subflows on when the peer is only half-closed. Note that, when the subflow is able to create the DATA_FIN before the worker invocation, the worker will change the msk state before trying to close the subflow and will skip the latter operation as the msk will not match anymore the precondition in __mptcp_close_subflow(). Fixes: `f09b0ad55a` ("mptcp: close subflow when receiving TCP+FIN") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251118-net-mptcp-misc-fixes-6-18-rc6-v1-3-806d3781c95f@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-19 20:07:14 -08:00
Paolo Abeni	4f102d747c	mptcp: avoid unneeded subflow-level drops The rcv window is shared among all the subflows. Currently, MPTCP sync the TCP-level rcv window with the MPTCP one at tcp_transmit_skb() time. The above means that incoming data may sporadically observe outdated TCP-level rcv window and being wrongly dropped by TCP. Address the issue checking for the edge condition before queuing the data at TCP level, and eventually syncing the rcv window as needed. Note that the issue is actually present from the very first MPTCP implementation, but backports older than the blamed commit below will range from impossible to useless. Before: $ nstat -n; sleep 1; nstat -z TcpExtBeyondWindow TcpExtBeyondWindow 14 0.0 After: $ nstat -n; sleep 1; nstat -z TcpExtBeyondWindow TcpExtBeyondWindow 0 0.0 Fixes: `fa3fe2b150` ("mptcp: track window announced to peer") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251118-net-mptcp-misc-fixes-6-18-rc6-v1-2-806d3781c95f@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-19 20:07:14 -08:00
Paolo Abeni	5e15395f6d	mptcp: fix ack generation for fallback msk mptcp_cleanup_rbuf() needs to know the last most recent, mptcp-level rcv_wnd sent, and such information is tracked into the msk->old_wspace field, updated at ack transmission time by mptcp_write_options(). Fallback socket do not add any mptcp options, such helper is never invoked, and msk->old_wspace value remain stale. That in turn makes ack generation at recvmsg() time quite random. Address the issue ensuring mptcp_write_options() is invoked even for fallback sockets, and just update the needed info in such a case. The issue went unnoticed for a long time, as mptcp currently overshots the fallback socket receive buffer autotune significantly. It is going to change in the near future. Fixes: `e3859603ba` ("mptcp: better msk receive window updates") Cc: stable@vger.kernel.org Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/594 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251118-net-mptcp-misc-fixes-6-18-rc6-v1-1-806d3781c95f@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-19 20:07:14 -08:00
Eric Dumazet	426358d9be	mptcp: fix a race in mptcp_pm_del_add_timer() mptcp_pm_del_add_timer() can call sk_stop_timer_sync(sk, &entry->add_timer) while another might have free entry already, as reported by syzbot. Add RCU protection to fix this issue. Also change confusing add_timer variable with stop_timer boolean. syzbot report: BUG: KASAN: slab-use-after-free in __timer_delete_sync+0x372/0x3f0 kernel/time/timer.c:1616 Read of size 4 at addr ffff8880311e4150 by task kworker/1:1/44 CPU: 1 UID: 0 PID: 44 Comm: kworker/1:1 Not tainted syzkaller #0 PREEMPT_{RT,(full)} Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/02/2025 Workqueue: events mptcp_worker Call Trace: <TASK> dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:378 [inline] print_report+0xca/0x240 mm/kasan/report.c:482 kasan_report+0x118/0x150 mm/kasan/report.c:595 __timer_delete_sync+0x372/0x3f0 kernel/time/timer.c:1616 sk_stop_timer_sync+0x1b/0x90 net/core/sock.c:3631 mptcp_pm_del_add_timer+0x283/0x310 net/mptcp/pm.c:362 mptcp_incoming_options+0x1357/0x1f60 net/mptcp/options.c:1174 tcp_data_queue+0xca/0x6450 net/ipv4/tcp_input.c:5361 tcp_rcv_established+0x1335/0x2670 net/ipv4/tcp_input.c:6441 tcp_v4_do_rcv+0x98b/0xbf0 net/ipv4/tcp_ipv4.c:1931 tcp_v4_rcv+0x252a/0x2dc0 net/ipv4/tcp_ipv4.c:2374 ip_protocol_deliver_rcu+0x221/0x440 net/ipv4/ip_input.c:205 ip_local_deliver_finish+0x3bb/0x6f0 net/ipv4/ip_input.c:239 NF_HOOK+0x30c/0x3a0 include/linux/netfilter.h:318 NF_HOOK+0x30c/0x3a0 include/linux/netfilter.h:318 __netif_receive_skb_one_core net/core/dev.c:6079 [inline] __netif_receive_skb+0x143/0x380 net/core/dev.c:6192 process_backlog+0x31e/0x900 net/core/dev.c:6544 __napi_poll+0xb6/0x540 net/core/dev.c:7594 napi_poll net/core/dev.c:7657 [inline] net_rx_action+0x5f7/0xda0 net/core/dev.c:7784 handle_softirqs+0x22f/0x710 kernel/softirq.c:622 __do_softirq kernel/softirq.c:656 [inline] __local_bh_enable_ip+0x1a0/0x2e0 kernel/softirq.c:302 mptcp_pm_send_ack net/mptcp/pm.c:210 [inline] mptcp_pm_addr_send_ack+0x41f/0x500 net/mptcp/pm.c:-1 mptcp_pm_worker+0x174/0x320 net/mptcp/pm.c:1002 mptcp_worker+0xd5/0x1170 net/mptcp/protocol.c:2762 process_one_work kernel/workqueue.c:3263 [inline] process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3346 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3427 kthread+0x711/0x8a0 kernel/kthread.c:463 ret_from_fork+0x4bc/0x870 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 </TASK> Allocated by task 44: kasan_save_stack mm/kasan/common.c:56 [inline] kasan_save_track+0x3e/0x80 mm/kasan/common.c:77 poison_kmalloc_redzone mm/kasan/common.c:400 [inline] __kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:417 kasan_kmalloc include/linux/kasan.h:262 [inline] __kmalloc_cache_noprof+0x1ef/0x6c0 mm/slub.c:5748 kmalloc_noprof include/linux/slab.h:957 [inline] mptcp_pm_alloc_anno_list+0x104/0x460 net/mptcp/pm.c:385 mptcp_pm_create_subflow_or_signal_addr+0xf9d/0x1360 net/mptcp/pm_kernel.c:355 mptcp_pm_nl_fully_established net/mptcp/pm_kernel.c:409 [inline] __mptcp_pm_kernel_worker+0x417/0x1ef0 net/mptcp/pm_kernel.c:1529 mptcp_pm_worker+0x1ee/0x320 net/mptcp/pm.c:1008 mptcp_worker+0xd5/0x1170 net/mptcp/protocol.c:2762 process_one_work kernel/workqueue.c:3263 [inline] process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3346 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3427 kthread+0x711/0x8a0 kernel/kthread.c:463 ret_from_fork+0x4bc/0x870 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 Freed by task 6630: kasan_save_stack mm/kasan/common.c:56 [inline] kasan_save_track+0x3e/0x80 mm/kasan/common.c:77 __kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:587 kasan_save_free_info mm/kasan/kasan.h:406 [inline] poison_slab_object mm/kasan/common.c:252 [inline] __kasan_slab_free+0x5c/0x80 mm/kasan/common.c:284 kasan_slab_free include/linux/kasan.h:234 [inline] slab_free_hook mm/slub.c:2523 [inline] slab_free mm/slub.c:6611 [inline] kfree+0x197/0x950 mm/slub.c:6818 mptcp_remove_anno_list_by_saddr+0x2d/0x40 net/mptcp/pm.c:158 mptcp_pm_flush_addrs_and_subflows net/mptcp/pm_kernel.c:1209 [inline] mptcp_nl_flush_addrs_list net/mptcp/pm_kernel.c:1240 [inline] mptcp_pm_nl_flush_addrs_doit+0x593/0xbb0 net/mptcp/pm_kernel.c:1281 genl_family_rcv_msg_doit+0x215/0x300 net/netlink/genetlink.c:1115 genl_family_rcv_msg net/netlink/genetlink.c:1195 [inline] genl_rcv_msg+0x60e/0x790 net/netlink/genetlink.c:1210 netlink_rcv_skb+0x208/0x470 net/netlink/af_netlink.c:2552 genl_rcv+0x28/0x40 net/netlink/genetlink.c:1219 netlink_unicast_kernel net/netlink/af_netlink.c:1320 [inline] netlink_unicast+0x846/0xa10 net/netlink/af_netlink.c:1346 netlink_sendmsg+0x805/0xb30 net/netlink/af_netlink.c:1896 sock_sendmsg_nosec net/socket.c:727 [inline] __sock_sendmsg+0x21c/0x270 net/socket.c:742 ____sys_sendmsg+0x508/0x820 net/socket.c:2630 ___sys_sendmsg+0x21f/0x2a0 net/socket.c:2684 __sys_sendmsg net/socket.c:2716 [inline] __do_sys_sendmsg net/socket.c:2721 [inline] __se_sys_sendmsg net/socket.c:2719 [inline] __x64_sys_sendmsg+0x1a1/0x260 net/socket.c:2719 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f Cc: stable@vger.kernel.org Fixes: `00cfd77b90` ("mptcp: retransmit ADD_ADDR when timeout") Reported-by: syzbot+2a6fbf0f0530375968df@syzkaller.appspotmail.com Closes: https://lore.kernel.org/691ad3c3.a70a0220.f6df1.0004.GAE@google.com Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Geliang Tang <geliang@kernel.org> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251117100745.1913963-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-18 18:33:01 -08:00
Eric Dumazet	035bca3f01	mptcp: fix race condition in mptcp_schedule_work() syzbot reported use-after-free in mptcp_schedule_work() [1] Issue here is that mptcp_schedule_work() schedules a work, then gets a refcount on sk->sk_refcnt if the work was scheduled. This refcount will be released by mptcp_worker(). [A] if (schedule_work(...)) { [B] sock_hold(sk); return true; } Problem is that mptcp_worker() can run immediately and complete before [B] We need instead : sock_hold(sk); if (schedule_work(...)) return true; sock_put(sk); [1] refcount_t: addition on 0; use-after-free. WARNING: CPU: 1 PID: 29 at lib/refcount.c:25 refcount_warn_saturate+0xfa/0x1d0 lib/refcount.c:25 Call Trace: <TASK> __refcount_add include/linux/refcount.h:-1 [inline] __refcount_inc include/linux/refcount.h:366 [inline] refcount_inc include/linux/refcount.h:383 [inline] sock_hold include/net/sock.h:816 [inline] mptcp_schedule_work+0x164/0x1a0 net/mptcp/protocol.c:943 mptcp_tout_timer+0x21/0xa0 net/mptcp/protocol.c:2316 call_timer_fn+0x17e/0x5f0 kernel/time/timer.c:1747 expire_timers kernel/time/timer.c:1798 [inline] __run_timers kernel/time/timer.c:2372 [inline] __run_timer_base+0x648/0x970 kernel/time/timer.c:2384 run_timer_base kernel/time/timer.c:2393 [inline] run_timer_softirq+0xb7/0x180 kernel/time/timer.c:2403 handle_softirqs+0x22f/0x710 kernel/softirq.c:622 __do_softirq kernel/softirq.c:656 [inline] run_ktimerd+0xcf/0x190 kernel/softirq.c:1138 smpboot_thread_fn+0x542/0xa60 kernel/smpboot.c:160 kthread+0x711/0x8a0 kernel/kthread.c:463 ret_from_fork+0x4bc/0x870 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 Cc: stable@vger.kernel.org Fixes: `3b1d6210a9` ("mptcp: implement and use MPTCP-level retransmission") Reported-by: syzbot+355158e7e301548a1424@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/6915b46f.050a0220.3565dc.0028.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251113103924.3737425-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-14 18:12:35 -08:00
Jiayuan Chen	c77b3b79a9	mptcp: Fix proto fallback detection with BPF The sockmap feature allows bpf syscall from userspace, or based on bpf sockops, replacing the sk_prot of sockets during protocol stack processing with sockmap's custom read/write interfaces. ''' tcp_rcv_state_process() syn_recv_sock()/subflow_syn_recv_sock() tcp_init_transfer(BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB) bpf_skops_established <== sockops bpf_sock_map_update(sk) <== call bpf helper tcp_bpf_update_proto() <== update sk_prot ''' When the server has MPTCP enabled but the client sends a TCP SYN without MPTCP, subflow_syn_recv_sock() performs a fallback on the subflow, replacing the subflow sk's sk_prot with the native sk_prot. ''' subflow_syn_recv_sock() subflow_ulp_fallback() subflow_drop_ctx() mptcp_subflow_ops_undo_override() ''' Then, this subflow can be normally used by sockmap, which replaces the native sk_prot with sockmap's custom sk_prot. The issue occurs when the user executes accept::mptcp_stream_accept::mptcp_fallback_tcp_ops(). Here, it uses sk->sk_prot to compare with the native sk_prot, but this is incorrect when sockmap is used, as we may incorrectly set sk->sk_socket->ops. This fix uses the more generic sk_family for the comparison instead. Additionally, this also prevents a WARNING from occurring: result from ./scripts/decode_stacktrace.sh: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 337 at net/mptcp/protocol.c:68 mptcp_stream_accept \ (net/mptcp/protocol.c:4005) Modules linked in: ... PKRU: 55555554 Call Trace: <TASK> do_accept (net/socket.c:1989) __sys_accept4 (net/socket.c:2028 net/socket.c:2057) __x64_sys_accept (net/socket.c:2067) x64_sys_call (arch/x86/entry/syscall_64.c:41) do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) RIP: 0033:0x7f87ac92b83d ---[ end trace 0000000000000000 ]--- Fixes: `0b4f33def7` ("mptcp: fix tcp fallback crash") Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Cc: <stable@vger.kernel.org> Link: https://patch.msgid.link/20251111060307.194196-3-jiayuan.chen@linux.dev	2025-11-13 13:11:15 -08:00
Jiayuan Chen	fbade4bd08	mptcp: Disallow MPTCP subflows from sockmap The sockmap feature allows bpf syscall from userspace, or based on bpf sockops, replacing the sk_prot of sockets during protocol stack processing with sockmap's custom read/write interfaces. ''' tcp_rcv_state_process() subflow_syn_recv_sock() tcp_init_transfer(BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB) bpf_skops_established <== sockops bpf_sock_map_update(sk) <== call bpf helper tcp_bpf_update_proto() <== update sk_prot ''' Consider two scenarios: 1. When the server has MPTCP enabled and the client also requests MPTCP, the sk passed to the BPF program is a subflow sk. Since subflows only handle partial data, replacing their sk_prot is meaningless and will cause traffic disruption. 2. When the server has MPTCP enabled but the client sends a TCP SYN without MPTCP, subflow_syn_recv_sock() performs a fallback on the subflow, replacing the subflow sk's sk_prot with the native sk_prot. ''' subflow_ulp_fallback() subflow_drop_ctx() mptcp_subflow_ops_undo_override() ''' Subsequently, accept::mptcp_stream_accept::mptcp_fallback_tcp_ops() converts the subflow to plain TCP. For the first case, we should prevent it from being combined with sockmap by setting sk_prot->psock_update_sk_prot to NULL, which will be blocked by sockmap's own flow. For the second case, since subflow_syn_recv_sock() has already restored sk_prot to native tcp_prot/tcpv6_prot, no further action is needed. Fixes: `cec37a6e41` ("mptcp: Handle MP_CAPABLE options for outgoing connections") Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Cc: <stable@vger.kernel.org> Link: https://patch.msgid.link/20251111060307.194196-2-jiayuan.chen@linux.dev	2025-11-13 09:15:41 -08:00
Paolo Abeni	fe11dfa109	mptcp: zero window probe mib Explicitly account for MPTCP-level zero windows probe, to catch hopefully earlier issues alike the one addressed by the previous patch. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Tested-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251028-net-mptcp-send-timeout-v1-4-38ffff5a9ec8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 17:44:28 -07:00
Paolo Abeni	a824084b98	mptcp: restore window probe Since commit `72377ab2d6` ("mptcp: more conservative check for zero probes") the MPTCP-level zero window probe check is always disabled, as the TCP-level write queue always contains at least the newly allocated skb. Refine the relevant check tacking in account that the above condition and that such skb can have zero length. Fixes: `72377ab2d6` ("mptcp: more conservative check for zero probes") Cc: stable@vger.kernel.org Reported-by: Geliang Tang <geliang@kernel.org> Closes: https://lore.kernel.org/d0a814c364e744ca6b836ccd5b6e9146882e8d42.camel@kernel.org Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Tested-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251028-net-mptcp-send-timeout-v1-3-38ffff5a9ec8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 17:44:28 -07:00
Paolo Abeni	8e04ce45a8	mptcp: fix MSG_PEEK stream corruption If a MSG_PEEK \| MSG_WAITALL read operation consumes all the bytes in the receive queue and recvmsg() need to waits for more data - i.e. it's a blocking one - upon arrival of the next packet the MPTCP protocol will start again copying the oldest data present in the receive queue, corrupting the data stream. Address the issue explicitly tracking the peeked sequence number, restarting from the last peeked byte. Fixes: `ca4fb89257` ("mptcp: add MSG_PEEK support") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Geliang Tang <geliang@kernel.org> Tested-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251028-net-mptcp-send-timeout-v1-2-38ffff5a9ec8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 17:44:28 -07:00
Paolo Abeni	27b0e701d3	mptcp: drop bogus optimization in __mptcp_check_push() Accessing the transmit queue without owning the msk socket lock is inherently racy, hence __mptcp_check_push() could actually quit early even when there is pending data. That in turn could cause unexpected tx lock and timeout. Dropping the early check avoids the race, implicitly relaying on later tests under the relevant lock. With such change, all the other mptcp_send_head() call sites are now under the msk socket lock and we can additionally drop the now unneeded annotation on the transmit head pointer accesses. Fixes: `6e628cd3a8` ("mptcp: use mptcp release_cb for delayed tasks") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Geliang Tang <geliang@kernel.org> Tested-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251028-net-mptcp-send-timeout-v1-1-38ffff5a9ec8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 17:44:28 -07:00
Eric Dumazet	aa251c8463	tcp: fix too slow tcp_rcvbuf_grow() action While the blamed commits apparently avoided an overshoot, they also limited how fast a sender can increase BDP at each RTT. This is not exactly a revert, we do not add the 16 * tp->advmss cushion we had, and we are keeping the out_of_order_queue contribution. Do the same in mptcp_rcvbuf_grow(). Tested: emulated 50ms rtt (tcp_stream --tcp-tx-delay 50000), cubic 20 second flow. net.ipv4.tcp_rmem set to "4096 131072 67000000" perf record -a -e tcp:tcp_rcvbuf_grow sleep 20 perf script Before: We can see we fail to roughly double RWIN at each RTT. Sender is RWIN limited while CWND is ramping up (before getting tcp_wmem limited). tcp_stream 33793 [010] 825.717525: tcp:tcp_rcvbuf_grow: time=100869 rtt_us=50428 copied=49152 inq=0 space=40960 ooo=0 scaling_ratio=219 rcvbuf=131072 rcv_ssthresh=103970 window_clamp=112128 rcv_wnd=106496 tcp_stream 33793 [010] 825.768966: tcp:tcp_rcvbuf_grow: time=51447 rtt_us=50362 copied=86016 inq=0 space=49152 ooo=0 scaling_ratio=219 rcvbuf=131072 rcv_ssthresh=107474 window_clamp=112128 rcv_wnd=106496 tcp_stream 33793 [010] 825.821539: tcp:tcp_rcvbuf_grow: time=52577 rtt_us=50243 copied=114688 inq=0 space=86016 ooo=0 scaling_ratio=219 rcvbuf=201096 rcv_ssthresh=167377 window_clamp=172031 rcv_wnd=167936 tcp_stream 33793 [010] 825.871781: tcp:tcp_rcvbuf_grow: time=50248 rtt_us=50237 copied=167936 inq=0 space=114688 ooo=0 scaling_ratio=219 rcvbuf=268129 rcv_ssthresh=224722 window_clamp=229375 rcv_wnd=225280 tcp_stream 33793 [010] 825.922475: tcp:tcp_rcvbuf_grow: time=50698 rtt_us=50183 copied=241664 inq=0 space=167936 ooo=0 scaling_ratio=219 rcvbuf=392617 rcv_ssthresh=331217 window_clamp=335871 rcv_wnd=323584 tcp_stream 33793 [010] 825.973326: tcp:tcp_rcvbuf_grow: time=50855 rtt_us=50213 copied=339968 inq=0 space=241664 ooo=0 scaling_ratio=219 rcvbuf=564986 rcv_ssthresh=478674 window_clamp=483327 rcv_wnd=462848 tcp_stream 33793 [010] 826.023970: tcp:tcp_rcvbuf_grow: time=50647 rtt_us=50248 copied=491520 inq=0 space=339968 ooo=0 scaling_ratio=219 rcvbuf=794811 rcv_ssthresh=671778 window_clamp=679935 rcv_wnd=651264 tcp_stream 33793 [010] 826.074612: tcp:tcp_rcvbuf_grow: time=50648 rtt_us=50227 copied=700416 inq=0 space=491520 ooo=0 scaling_ratio=219 rcvbuf=1149124 rcv_ssthresh=974881 window_clamp=983039 rcv_wnd=942080 tcp_stream 33793 [010] 826.125452: tcp:tcp_rcvbuf_grow: time=50845 rtt_us=50225 copied=987136 inq=8192 space=700416 ooo=0 scaling_ratio=219 rcvbuf=1637502 rcv_ssthresh=1392674 window_clamp=1400831 rcv_wnd=1339392 tcp_stream 33793 [010] 826.175698: tcp:tcp_rcvbuf_grow: time=50250 rtt_us=50198 copied=1347584 inq=0 space=978944 ooo=0 scaling_ratio=219 rcvbuf=2288672 rcv_ssthresh=1949729 window_clamp=1957887 rcv_wnd=1945600 tcp_stream 33793 [010] 826.225947: tcp:tcp_rcvbuf_grow: time=50252 rtt_us=50240 copied=1945600 inq=0 space=1347584 ooo=0 scaling_ratio=219 rcvbuf=3150516 rcv_ssthresh=2687010 window_clamp=2695167 rcv_wnd=2691072 tcp_stream 33793 [010] 826.276175: tcp:tcp_rcvbuf_grow: time=50233 rtt_us=50224 copied=2691072 inq=0 space=1945600 ooo=0 scaling_ratio=219 rcvbuf=4548617 rcv_ssthresh=3883041 window_clamp=3891199 rcv_wnd=3887104 tcp_stream 33793 [010] 826.326403: tcp:tcp_rcvbuf_grow: time=50233 rtt_us=50229 copied=3887104 inq=0 space=2691072 ooo=0 scaling_ratio=219 rcvbuf=6291456 rcv_ssthresh=5370482 window_clamp=5382144 rcv_wnd=5373952 tcp_stream 33793 [010] 826.376723: tcp:tcp_rcvbuf_grow: time=50323 rtt_us=50218 copied=5373952 inq=0 space=3887104 ooo=0 scaling_ratio=219 rcvbuf=9087658 rcv_ssthresh=7755537 window_clamp=7774207 rcv_wnd=7757824 tcp_stream 33793 [010] 826.426991: tcp:tcp_rcvbuf_grow: time=50274 rtt_us=50196 copied=7757824 inq=180224 space=5373952 ooo=0 scaling_ratio=219 rcvbuf=12563759 rcv_ssthresh=10729233 window_clamp=10747903 rcv_wnd=10575872 tcp_stream 33793 [010] 826.477229: tcp:tcp_rcvbuf_grow: time=50241 rtt_us=50078 copied=10731520 inq=180224 space=7577600 ooo=0 scaling_ratio=219 rcvbuf=17715667 rcv_ssthresh=15136529 window_clamp=15155199 rcv_wnd=14983168 tcp_stream 33793 [010] 826.527482: tcp:tcp_rcvbuf_grow: time=50258 rtt_us=50153 copied=15138816 inq=360448 space=10551296 ooo=0 scaling_ratio=219 rcvbuf=24667870 rcv_ssthresh=21073410 window_clamp=21102591 rcv_wnd=20766720 tcp_stream 33793 [010] 826.577712: tcp:tcp_rcvbuf_grow: time=50234 rtt_us=50228 copied=21073920 inq=0 space=14778368 ooo=0 scaling_ratio=219 rcvbuf=34550339 rcv_ssthresh=29517041 window_clamp=29556735 rcv_wnd=29519872 tcp_stream 33793 [010] 826.627982: tcp:tcp_rcvbuf_grow: time=50275 rtt_us=50220 copied=29519872 inq=540672 space=21073920 ooo=0 scaling_ratio=219 rcvbuf=49268707 rcv_ssthresh=42090625 window_clamp=42147839 rcv_wnd=41627648 tcp_stream 33793 [010] 826.678274: tcp:tcp_rcvbuf_grow: time=50296 rtt_us=50185 copied=42053632 inq=761856 space=28979200 ooo=0 scaling_ratio=219 rcvbuf=67000000 rcv_ssthresh=57238168 window_clamp=57316406 rcv_wnd=56606720 tcp_stream 33793 [010] 826.728627: tcp:tcp_rcvbuf_grow: time=50357 rtt_us=50128 copied=43913216 inq=851968 space=41291776 ooo=0 scaling_ratio=219 rcvbuf=67000000 rcv_ssthresh=57290728 window_clamp=57316406 rcv_wnd=56524800 tcp_stream 33793 [010] 827.131364: tcp:tcp_rcvbuf_grow: time=50239 rtt_us=50127 copied=43843584 inq=655360 space=43061248 ooo=0 scaling_ratio=219 rcvbuf=67000000 rcv_ssthresh=57290728 window_clamp=57316406 rcv_wnd=56696832 tcp_stream 33793 [010] 827.181613: tcp:tcp_rcvbuf_grow: time=50254 rtt_us=50115 copied=43843584 inq=524288 space=43188224 ooo=0 scaling_ratio=219 rcvbuf=67000000 rcv_ssthresh=57290728 window_clamp=57316406 rcv_wnd=56807424 tcp_stream 33793 [010] 828.339635: tcp:tcp_rcvbuf_grow: time=50283 rtt_us=50110 copied=43843584 inq=458752 space=43319296 ooo=0 scaling_ratio=219 rcvbuf=67000000 rcv_ssthresh=57290728 window_clamp=57316406 rcv_wnd=56864768 tcp_stream 33793 [010] 828.440350: tcp:tcp_rcvbuf_grow: time=50404 rtt_us=50099 copied=43843584 inq=393216 space=43384832 ooo=0 scaling_ratio=219 rcvbuf=67000000 rcv_ssthresh=57290728 window_clamp=57316406 rcv_wnd=56922112 tcp_stream 33793 [010] 829.195106: tcp:tcp_rcvbuf_grow: time=50154 rtt_us=50077 copied=43843584 inq=196608 space=43450368 ooo=0 scaling_ratio=219 rcvbuf=67000000 rcv_ssthresh=57290728 window_clamp=57316406 rcv_wnd=57090048 After: It takes few steps to increase RWIN. Sender is no longer RWIN limited. tcp_stream 50826 [010] 935.634212: tcp:tcp_rcvbuf_grow: time=100788 rtt_us=50315 copied=49152 inq=0 space=40960 ooo=0 scaling_ratio=219 rcvbuf=131072 rcv_ssthresh=103970 window_clamp=112128 rcv_wnd=106496 tcp_stream 50826 [010] 935.685642: tcp:tcp_rcvbuf_grow: time=51437 rtt_us=50361 copied=86016 inq=0 space=49152 ooo=0 scaling_ratio=219 rcvbuf=160875 rcv_ssthresh=132969 window_clamp=137623 rcv_wnd=131072 tcp_stream 50826 [010] 935.738299: tcp:tcp_rcvbuf_grow: time=52660 rtt_us=50256 copied=139264 inq=0 space=86016 ooo=0 scaling_ratio=219 rcvbuf=502741 rcv_ssthresh=411497 window_clamp=430079 rcv_wnd=413696 tcp_stream 50826 [010] 935.788544: tcp:tcp_rcvbuf_grow: time=50249 rtt_us=50233 copied=307200 inq=0 space=139264 ooo=0 scaling_ratio=219 rcvbuf=728690 rcv_ssthresh=618717 window_clamp=623371 rcv_wnd=618496 tcp_stream 50826 [010] 935.838796: tcp:tcp_rcvbuf_grow: time=50258 rtt_us=50202 copied=618496 inq=0 space=307200 ooo=0 scaling_ratio=219 rcvbuf=2450338 rcv_ssthresh=1855709 window_clamp=2096187 rcv_wnd=1859584 tcp_stream 50826 [010] 935.889140: tcp:tcp_rcvbuf_grow: time=50347 rtt_us=50166 copied=1261568 inq=0 space=618496 ooo=0 scaling_ratio=219 rcvbuf=4376503 rcv_ssthresh=3725291 window_clamp=3743961 rcv_wnd=3706880 tcp_stream 50826 [010] 935.939435: tcp:tcp_rcvbuf_grow: time=50300 rtt_us=50185 copied=2478080 inq=24576 space=1261568 ooo=0 scaling_ratio=219 rcvbuf=9082648 rcv_ssthresh=7733731 window_clamp=7769921 rcv_wnd=7692288 tcp_stream 50826 [010] 935.989681: tcp:tcp_rcvbuf_grow: time=50251 rtt_us=50221 copied=4915200 inq=114688 space=2453504 ooo=0 scaling_ratio=219 rcvbuf=16574936 rcv_ssthresh=14108110 window_clamp=14179339 rcv_wnd=14024704 tcp_stream 50826 [010] 936.039967: tcp:tcp_rcvbuf_grow: time=50289 rtt_us=50279 copied=9830400 inq=114688 space=4800512 ooo=0 scaling_ratio=219 rcvbuf=32695050 rcv_ssthresh=27896187 window_clamp=27969593 rcv_wnd=27815936 tcp_stream 50826 [010] 936.090172: tcp:tcp_rcvbuf_grow: time=50211 rtt_us=50200 copied=19841024 inq=114688 space=9715712 ooo=0 scaling_ratio=219 rcvbuf=67000000 rcv_ssthresh=57245176 window_clamp=57316406 rcv_wnd=57163776 tcp_stream 50826 [010] 936.140430: tcp:tcp_rcvbuf_grow: time=50262 rtt_us=50197 copied=39501824 inq=114688 space=19726336 ooo=0 scaling_ratio=219 rcvbuf=67000000 rcv_ssthresh=57245176 window_clamp=57316406 rcv_wnd=57163776 tcp_stream 50826 [010] 936.190527: tcp:tcp_rcvbuf_grow: time=50101 rtt_us=50071 copied=43655168 inq=262144 space=39387136 ooo=0 scaling_ratio=219 rcvbuf=67000000 rcv_ssthresh=57259192 window_clamp=57316406 rcv_wnd=57032704 tcp_stream 50826 [010] 936.240719: tcp:tcp_rcvbuf_grow: time=50197 rtt_us=50057 copied=43843584 inq=262144 space=43393024 ooo=0 scaling_ratio=219 rcvbuf=67000000 rcv_ssthresh=57259192 window_clamp=57316406 rcv_wnd=57032704 tcp_stream 50826 [010] 936.341271: tcp:tcp_rcvbuf_grow: time=50297 rtt_us=50123 copied=43843584 inq=131072 space=43581440 ooo=0 scaling_ratio=219 rcvbuf=67000000 rcv_ssthresh=57259192 window_clamp=57316406 rcv_wnd=57147392 tcp_stream 50826 [010] 936.642503: tcp:tcp_rcvbuf_grow: time=50131 rtt_us=50084 copied=43843584 inq=0 space=43712512 ooo=0 scaling_ratio=219 rcvbuf=67000000 rcv_ssthresh=57259192 window_clamp=57316406 rcv_wnd=57262080 Fixes: `65c5287892` ("tcp: fix sk_rcvbuf overshoot") Fixes: `e118cdc34d` ("mptcp: rcvbuf auto-tuning improvement") Reported-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/589 Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20251028-net-tcp-recv-autotune-v3-4-74b43ba4c84c@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 17:30:19 -07:00
Eric Dumazet	b1e014a1f3	tcp: add newval parameter to tcp_rcvbuf_grow() This patch has no functional change, and prepares the following one. tcp_rcvbuf_grow() will need to have access to tp->rcvq_space.space old and new values. Change mptcp_rcvbuf_grow() in a similar way. Signed-off-by: Eric Dumazet <edumazet@google.com> [ Moved 'oldval' declaration to the next patch to avoid warnings at build time. ] Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20251028-net-tcp-recv-autotune-v3-3-74b43ba4c84c@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 17:30:19 -07:00
Paolo Abeni	a6f0459aad	mptcp: fix subflow rcvbuf adjust The mptcp PM can add subflow to the conn_list before tcp_init_transfer(). Calling tcp_rcvbuf_grow() on such subflow is not correct as later init will overwrite the update. Fix the issue calling tcp_rcvbuf_grow() only after init buffer initialization. Fixes: `e118cdc34d` ("mptcp: rcvbuf auto-tuning improvement") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251028-net-tcp-recv-autotune-v3-1-74b43ba4c84c@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 17:30:18 -07:00
Matthieu Baerts (NGI0)	e84cb860ac	mptcp: pm: in-kernel: C-flag: handle late ADD_ADDR The special C-flag case expects the ADD_ADDR to be received when switching to 'fully-established'. But for various reasons, the ADD_ADDR could be sent after the "4th ACK", and the special case doesn't work. On NIPA, the new test validating this special case for the C-flag failed a few times, e.g. 102 default limits, server deny join id 0 syn rx [FAIL] got 0 JOIN[s] syn rx expected 2 Server ns stats (...) MPTcpExtAddAddrTx 1 MPTcpExtEchoAdd 1 Client ns stats (...) MPTcpExtAddAddr 1 MPTcpExtEchoAddTx 1 synack rx [FAIL] got 0 JOIN[s] synack rx expected 2 ack rx [FAIL] got 0 JOIN[s] ack rx expected 2 join Rx [FAIL] see above syn tx [FAIL] got 0 JOIN[s] syn tx expected 2 join Tx [FAIL] see above I had a suspicion about what the issue could be: the ADD_ADDR might have been received after the switch to the 'fully-established' state. The issue was not easy to reproduce. The packet capture shown that the ADD_ADDR can indeed be sent with a delay, and the client would not try to establish subflows to it as expected. A simple fix is not to mark the endpoints as 'used' in the C-flag case, when looking at creating subflows to the remote initial IP address and port. In this case, there is no need to try. Note: newly added fullmesh endpoints will still continue to be used as expected, thanks to the conditions behind mptcp_pm_add_addr_c_flag_case. Fixes: `4b1ff850e0` ("mptcp: pm: in-kernel: usable client side with C-flag") Cc: stable@vger.kernel.org Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251020-net-mptcp-c-flag-late-add-addr-v1-1-8207030cb0e8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-21 17:36:45 -07:00
Paolo Abeni	59701b1870	mptcp: minor move_skbs_to_msk() cleanup Such function is called only by __mptcp_data_ready(), which in turn is always invoked when msk is not owned by the user: we can drop the redundant, related check. Additionally mptcp needs to propagate the socket error only for current subflow. Reviewed-by: Geliang Tang <geliang@kernel.org> Tested-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250927-net-next-mptcp-rcv-path-imp-v1-7-5da266aa9c1a@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-29 18:23:36 -07:00
Paolo Abeni	68c7af988b	mptcp: factor out a basic skb coalesce helper The upcoming patch will introduced backlog processing for MPTCP socket, and we want to leverage coalescing in such data path. Factor out the relevant bits not touching memory accounting to deal with such use-case. Co-developed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250927-net-next-mptcp-rcv-path-imp-v1-6-5da266aa9c1a@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-29 18:23:35 -07:00
Paolo Abeni	c4ebc4ee4e	mptcp: remove unneeded mptcp_move_skb() Since commit `b7535cfed2` ("mptcp: drop legacy code around RX EOF"), sk_shutdown can't change during the main recvmsg loop, we can drop the related race breaker. Reviewed-by: Geliang Tang <geliang@kernel.org> Tested-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250927-net-next-mptcp-rcv-path-imp-v1-5-5da266aa9c1a@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-29 18:23:35 -07:00
Paolo Abeni	9a0afe0db4	mptcp: introduce the mptcp_init_skb helper Factor out all the skb initialization step in a new helper and use it. Note that this change moves the MPTCP CB initialization earlier: we can do such step as soon as the skb leaves the subflow socket receive queues. Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Geliang Tang <geliang@kernel.org> Tested-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250927-net-next-mptcp-rcv-path-imp-v1-4-5da266aa9c1a@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-29 18:23:35 -07:00
Paolo Abeni	e118cdc34d	mptcp: rcvbuf auto-tuning improvement Apply to the MPTCP auto-tuning the same improvements introduced for the TCP protocol by the merge commit `2da35e4b4d` ("Merge branch 'tcp-receive-side-improvements'"). The main difference is that TCP subflow and the main MPTCP socket need to account separately for OoO: MPTCP does not care for TCP-level OoO and vice versa, as a consequence do not reflect MPTCP-level rcvbuf increase due to OoO packets at the subflow level. This refeactor additionally allow dropping the msk receive buffer update at receive time, as the latter only intended to cope with subflow receive buffer increase due to OoO packets. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/487 Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/559 Reviewed-by: Geliang Tang <geliang@kernel.org> Tested-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250927-net-next-mptcp-rcv-path-imp-v1-3-5da266aa9c1a@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-29 18:23:35 -07:00
Paolo Abeni	9aa59323f2	mptcp: leverage skb deferral free Usage of the skb deferral API is straight-forward; with multiple subflows actives this allow moving part of the received application load into multiple CPUs. Also fix a typo in the related comment. Reviewed-by: Geliang Tang <geliang@kernel.org> Tested-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250927-net-next-mptcp-rcv-path-imp-v1-1-5da266aa9c1a@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-29 18:23:34 -07:00
Matthieu Baerts (NGI0)	539f6b9de3	mptcp: pm: in-kernel: add laminar endpoints Currently, upon the reception of an ADD_ADDR (and when the fullmesh flag is not used), the in-kernel PM will create new subflows using the local address the routing configuration will pick. It would be easier to pick local addresses from a selected list of endpoints, and use it only once, than relying on routing rules. Use case: both the client (C) and the server (S) have two addresses (a and b). The client establishes the connection between C(a) and S(a). Once established, the server announces its additional address S(b). Once received, the client connects to it using its second address C(b). Compared to a situation without the 'laminar' endpoint for C(b), the client didn't use this address C(b) to establish a subflow to the server's primary address S(a). So at the end, we have: C S C(a) --- S(a) C(b) --- S(b) In case of a 3rd address on each side (C(c) and S(c)), upon the reception of an ADD_ADDR with S(c), the client should not pick C(b) because it has already been used. C(c) should then be used. Note that this situation is currently possible if C doesn't add any endpoint, but configure the routing in order to pick C(b) for the route to S(b), and pick C(c) for the route to S(c). That doesn't sound very practical because it means knowing in advance the IP addresses that will be used and announced by the server. 'laminar', like the idea of laminar flows: the different subflows don't mix with each other on an endpoint, unlike the "turbulent" way traffic is mixed by 'fullmesh'. In the code, the new endpoint type is added. Similar to the other subflow types, an MPTCP_INFO counter is added. While at it, hole are now commented in struct mptcp_info, to remember next time that these holes can no longer be used. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/503 Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250925-net-next-mptcp-c-flag-laminar-v1-15-ad126cc47c6b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-26 17:44:05 -07:00
Matthieu Baerts (NGI0)	f596293314	mptcp: pm: in-kernel: compare IDs instead of addresses When receiving an ADD_ADDR right after the 3WHS, the connection will switch to 'fully established'. It means the MPTCP worker will be called to treat two events, in this order: ADD_ADDR_RECEIVED, PM_ESTABLISHED. The MPTCP endpoints cannot have the ID 0, because it is reserved to the address and port used by the initial subflow. To be able to deal with this case in different places, msk->mpc_endpoint_id contains the endpoint ID linked to the initial subflow. This variable was only set when treating the first PM_ESTABLISHED event, after ADD_ADDR_RECEIVED. That's why in fill_local_addresses_vec(), the endpoint addresses were compared with the one of the initial subflow, instead of only comparing the IDs. Instead, msk->mpc_endpoint_id is now set when treating ADD_ADDR_RECEIVED as well, if needed, then the IDs can be compared. To be able to do so, the code doing that is now in a dedicated helper, and called from the functions linked to the two actions. While at it, mptcp_endp_get_local_id() has also been moved up, next to this new helper, because they are linked, and to be able to use it in fill_local_addresses_vec() in the next commit. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250925-net-next-mptcp-c-flag-laminar-v1-14-ad126cc47c6b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-26 17:44:05 -07:00
Matthieu Baerts (NGI0)	4984fe6254	mptcp: pm: in-kernel: reduce pernet struct size All the 'unsigned int' variables from the 'pm_nl_pernet' structure are bounded to MPTCP_PM_ADDR_MAX, currently set to 8. The endpoint ID is also bounded by the protocol to 8-bit. MPTCP_PM_ADDR_MAX, if extended later, will never over 8-bit. So no need to use 'unsigned int' variables, 'u8' is enough. Note that the exposed counters in MPTCP_INFO are already limited to 8-bit, same for pm->extra_subflows, and others. So it seems even better to limit them to 8-bit. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250925-net-next-mptcp-c-flag-laminar-v1-13-ad126cc47c6b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-26 17:44:05 -07:00
Matthieu Baerts (NGI0)	db9a0e3858	mptcp: pm: in-kernel: remove stale_loss_cnt It is currently not used. It was in fact never used since its introduction in commit `ff5a0b421c` ("mptcp: faster active backup recovery"). It was probably initially added to struct pm_nl_pernet during the development of this commit, before being added to struct mptcp_pernet in ctrl.c, but not removed from the first place. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250925-net-next-mptcp-c-flag-laminar-v1-12-ad126cc47c6b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-26 17:44:05 -07:00
Matthieu Baerts (NGI0)	e9aa044f4a	mptcp: pm: in-kernel: rename 'addrs' to 'endpoints' A few variables linked to the in-kernel Path-Manager are confusing, and it would help current and future developers, to clarify them. One of them is 'addrs', which in fact represents the number of declared endpoints, and not only the 'signal' endpoints. No functional changes intended. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250925-net-next-mptcp-c-flag-laminar-v1-11-ad126cc47c6b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-26 17:44:05 -07:00
Matthieu Baerts (NGI0)	35e71e43a5	mptcp: pm: in-kernel: rename 'local_addr_list' to 'endp_list' A few variables linked to the in-kernel Path-Manager are confusing, and it would help current and future developers, to clarify them. One of them is 'local_addr_list', which in fact represents the list of endpoints, and not only the 'subflow' endpoints. No functional changes intended. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250925-net-next-mptcp-c-flag-laminar-v1-10-ad126cc47c6b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-26 17:44:05 -07:00
Matthieu Baerts (NGI0)	e7757b6d3a	mptcp: pm: in-kernel: rename 'local_addr_max' to 'endp_subflow_max' A few variables linked to the in-kernel Path-Manager are confusing, and it would help current and future developers, to clarify them. One of them is 'local_addr_max', which in fact represents the maximum number of 'subflow' endpoints that can be used to create new subflows, and not the number of local addresses that have been used to create subflows. While at it, add an additional name for the corresponding variable in MPTCP INFO: mptcpi_endp_subflow_max. Not to break the current uAPI, the new name is added as a 'define' pointing to the former name. This will then also help userspace devs. Also move the variable and function next to the other 'endp_X_max' ones. No functional changes intended. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250925-net-next-mptcp-c-flag-laminar-v1-9-ad126cc47c6b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-26 17:44:05 -07:00
Matthieu Baerts (NGI0)	37712d84df	mptcp: pm: in-kernel: rename 'add_addr_accept_max' to 'limit_add_addr_accepted' A few variables linked to the in-kernel Path-Manager are confusing, and it would help current and future developers, to clarify them. One of them is 'add_addr_accept_max', which in fact represents the limit of ADD_ADDR that can be accepted: the limit set via 'ip mptcp limit add_addr_accepted X' for example. It is not linked to the maximum number of accepted ADD_ADDR. While at it, add an additional name for the corresponding variable in MPTCP INFO: mptcpi_limit_add_addr_accepted. Not to break the current uAPI, the new name is added as a 'define' pointing to the former name. This will then also help userspace devs. No functional changes intended. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250925-net-next-mptcp-c-flag-laminar-v1-8-ad126cc47c6b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-26 17:44:04 -07:00
Matthieu Baerts (NGI0)	45cae57066	mptcp: pm: in-kernel: rename 'add_addr_signal_max' to 'endp_signal_max' A few variables linked to the in-kernel Path-Manager are confusing, and it would help current and future developers, to clarify them. One of them is 'add_addr_signal_max', which in fact represents the maximum number of 'signal' endpoints that can be used to announced addresses, and not the number of ADD_ADDR that can be signalled. While at it, add an additional name for the corresponding variable in MPTCP INFO: mptcpi_endp_signal_max. Not to break the current uAPI, the new name is added as a 'define' pointing to the former name. This will then also help userspace devs. No functional changes intended. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250925-net-next-mptcp-c-flag-laminar-v1-7-ad126cc47c6b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-26 17:44:04 -07:00
Matthieu Baerts (NGI0)	3eb3c9a959	mptcp: pm: in-kernel: rename 'subflows_max' to 'limit_extra_subflows' A few variables linked to the in-kernel Path-Manager are confusing, and it would help current and future developers, to clarify them. One of them is 'subflows_max', which in fact represents the limit of extra subflows: the limit set via 'ip mptcp limit subflows X' for example. It is not linked to the maximum number of created / possible subflows. While at it, add an additional name for the corresponding variable in MPTCP INFO: mptcpi_limit_extra_subflows. Not to break the current uAPI, the new name is added as a 'define' pointing to the former name. This will then also help userspace devs. No functional changes intended. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250925-net-next-mptcp-c-flag-laminar-v1-6-ad126cc47c6b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-26 17:44:04 -07:00
Matthieu Baerts (NGI0)	c5273f6ca1	mptcp: pm: rename 'subflows' to 'extra_subflows' A few variables linked to the Path-Managers are confusing, and it would help current and future developers, to clarify them. One of them is 'subflows', which in fact represents the number of extra subflows: all the additional subflows created after the initial one, and not the total number of subflows. While at it, add an additional name for the corresponding variable in MPTCP INFO: mptcpi_extra_subflows. Not to break the current uAPI, the new name is added as a 'define' pointing to the former name. This will then also help userspace devs. No functional changes intended. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250925-net-next-mptcp-c-flag-laminar-v1-5-ad126cc47c6b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-26 17:44:04 -07:00
Matthieu Baerts (NGI0)	a845b2bbf2	mptcp: pm: in-kernel: refactor fill_remote_addresses_vec Before this modification, this function was quite long with many levels of indentations. Each case can be split in a dedicated function: fullmesh, non-fullmesh. To remove one level of indentation, msk->pm.subflows >= subflows_max is now checked after having added one subflow, and stops the loop if it is no longer possible to add new subflows. This is fine to do this because this function should only be called if msk->pm.subflows < subflows_max. No functional changes intended. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250925-net-next-mptcp-c-flag-laminar-v1-4-ad126cc47c6b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-26 17:44:04 -07:00
Matthieu Baerts (NGI0)	8dc63ade45	mptcp: pm: in-kernel: refactor fill_local_addresses_vec Before this modification, this function was quite long with many levels of indentations. Each case can be split in a dedicated function: fullmesh, C flag, any. No functional changes intended. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250925-net-next-mptcp-c-flag-laminar-v1-3-ad126cc47c6b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-26 17:44:04 -07:00
Matthieu Baerts (NGI0)	4b1ff850e0	mptcp: pm: in-kernel: usable client side with C-flag When servers set the C-flag in their MP_CAPABLE to tell clients not to create subflows to the initial address and port, clients will likely not use their other endpoints. That's because the in-kernel path-manager uses the 'subflow' endpoints to create subflows only to the initial address and port. If the limits have not been modified to accept ADD_ADDR, the client doesn't try to establish new subflows. If the limits accept ADD_ADDR, the routing routes will be used to select the source IP. The C-flag is typically set when the server is operating behind a legacy Layer 4 load balancer, or using anycast IP address. Clients having their different 'subflow' endpoints setup, don't end up creating multiple subflows as expected, and causing some deployment issues. A special case is then added here: when servers set the C-flag in the MPC and directly sends an ADD_ADDR, this single ADD_ADDR is accepted. The 'subflows' endpoints will then be used with this new remote IP and port. This exception is only allowed when the ADD_ADDR is sent immediately after the 3WHS, and makes the client switching to the 'fully established' mode. After that, 'select_local_address()' will not be able to find any subflows, because 'id_avail_bitmap' will be filled in mptcp_pm_create_subflow_or_signal_addr(), when switching to 'fully established' mode. Fixes: `df377be387` ("mptcp: add deny_join_id0 in mptcp_options_received") Cc: stable@vger.kernel.org Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/536 Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250925-net-next-mptcp-c-flag-laminar-v1-1-ad126cc47c6b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-26 17:44:03 -07:00
Matthieu Baerts (NGI0)	1be5b82c45	mptcp: remove unused returned value of check_data_fin When working on a fix modifying mptcp_check_data_fin(), I noticed the returned value was no longer used. It looks like it was used for 3 days, between commit `7ed90803a2` ("mptcp: send explicit ack on delayed ack_seq incr") and commit `ea4ca586b1` ("mptcp: refine MPTCP-level ack scheduling"). This returned value can be safely removed. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250919-net-next-mptcp-server-side-flag-v1-6-a97a5d561a8b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-22 11:51:25 -07:00
Matthieu Baerts (NGI0)	c8bc168f5f	mptcp: pm: netlink: deprecate server-side attribute Now that such info is in the 'flags' attribute, it is time to deprecate the dedicated 'server-side' attribute. It will be removed in a few versions. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250919-net-next-mptcp-server-side-flag-v1-3-a97a5d561a8b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-22 11:51:24 -07:00
Matthieu Baerts (NGI0)	3d7ae91107	mptcp: pm: netlink: announce server-side flag Now that the 'flags' attribute is used, it seems interesting to add one flag for 'server-side', a boolean value. This is duplicating the info from the dedicated 'server-side' attribute, but it will be deprecated in the next commit, and removed in a few versions. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250919-net-next-mptcp-server-side-flag-v1-2-a97a5d561a8b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-22 11:51:24 -07:00
Matthieu Baerts (NGI0)	c9809f03c1	mptcp: pm: netlink: only add server-side attr when true This attribute is a boolean. No need to add it to set it to 'false'. Indeed, the default value when this attribute is not set is naturally 'false'. A few bytes can then be saved by not adding this attribute if the connection is not on the server side. This prepares the future deprecation of its attribute, in favour of a new flag. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250919-net-next-mptcp-server-side-flag-v1-1-a97a5d561a8b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-22 11:51:24 -07:00
Matthieu Baerts (NGI0)	833d4313bc	mptcp: reset blackhole on success with non-loopback ifaces When a first MPTCP connection gets successfully established after a blackhole period, 'active_disable_times' was supposed to be reset when this connection was done via any non-loopback interfaces. Unfortunately, the opposite condition was checked: only reset when the connection was established via a loopback interface. Fixing this by simply looking at the opposite. This is similar to what is done with TCP FastOpen, see tcp_fastopen_active_disable_ofo_check(). This patch is a follow-up of a previous discussion linked to commit `893c49a78d` ("mptcp: Use __sk_dst_get() and dst_dev_rcu() in mptcp_active_enable()."), see [1]. Fixes: `27069e7cb3` ("mptcp: disable active MPTCP in case of blackhole") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/4209a283-8822-47bd-95b7-87e96d9b7ea3@kernel.org [1] Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250918-net-next-mptcp-blackhole-reset-loopback-v1-1-bf5818326639@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-19 07:06:19 -07:00
Jakub Kicinski	f2cdc4c22b	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.17-rc7). No conflicts. Adjacent changes: drivers/net/ethernet/mellanox/mlx5/core/en/fs.h `9536fbe10c` ("net/mlx5e: Add PSP steering in local NIC RX") `7601a0a462` ("net/mlx5e: Add a miss level for ipsec crypto offload") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-18 11:26:06 -07:00
Kuniyuki Iwashima	893c49a78d	mptcp: Use __sk_dst_get() and dst_dev_rcu() in mptcp_active_enable(). mptcp_active_enable() is called from subflow_finish_connect(), which is icsk->icsk_af_ops->sk_rx_dst_set() and it's not always under RCU. Using sk_dst_get(sk)->dev could trigger UAF. Let's use __sk_dst_get() and dst_dev_rcu(). Fixes: `27069e7cb3` ("mptcp: disable active MPTCP in case of blackhole") Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916214758.650211-8-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 18:10:22 -07:00
Kuniyuki Iwashima	108a86c71c	mptcp: Call dst_release() in mptcp_active_enable(). mptcp_active_enable() calls sk_dst_get(), which returns dst with its refcount bumped, but forgot dst_release(). Let's add missing dst_release(). Cc: stable@vger.kernel.org Fixes: `27069e7cb3` ("mptcp: disable active MPTCP in case of blackhole") Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916214758.650211-7-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 18:10:22 -07:00

1 2 3 4 5 ...

1359 Commits