linux

mirror of https://github.com/torvalds/linux.git synced 2026-07-27 17:47:41 +02:00

Author	SHA1	Message	Date
Stefano Garzarella	2a12c05aef	vsock/virtio: collapse receive queue under memory pressure When many small packets accumulate in the receive queue, the skb overhead can exceed buf_alloc even while the payload is within bounds. This causes virtio_transport_inc_rx_pkt() to reject packets, leading to connection resets during large transfers under backpressure. The issue was reported by Brien, who has a reproducer, but it is also easily reproducible with iperf-vsock [1] using a small packet size: iperf3 --vsock -c $CID -l 129 which fails immediately without this patch but with commit `059b7dbd20` ("vsock/virtio: fix potential unbounded skb queue"). Inspired by TCP's tcp_collapse() which solves a similar problem, add virtio_transport_collapse_rx_queue() that walks the receive queue and re-copies data into compact linear skbs to reduce the overhead. The collapse is triggered proactively from when the number of skb queued is close to exceeding the overhead budget. A pre-scan counts the eligible bytes to size each allocation precisely, avoiding waste for isolated small packets. Partially consumed skbs are kept as-is to preserve buf_used/fwd_cnt accounting, EOM-marked skbs to maintain SEQPACKET message boundaries, and skbs already larger than the collapse target because they already have a good data-to-overhead ratio. Walking a large queue may take a significant amount of time and cache misses, causing traffic burstiness. To limit this, the collapse stops once enough room is freed for this packet and the next one, but may opportunistically free more to fill each collapsed skb to capacity. [1] https://github.com/stefano-garzarella/iperf-vsock Fixes: `059b7dbd20` ("vsock/virtio: fix potential unbounded skb queue") Cc: stable@vger.kernel.org Reported-by: Brien Oberstein <brienpub@gmail.com> Closes: https://lore.kernel.org/netdev/618701dd023e$063de350$12b9a9f0$@gmail.com/ Tested-by: Brien Oberstein <brienpub@gmail.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Link: https://patch.msgid.link/20260708102904.50732-2-sgarzare@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-07-21 12:11:06 +02:00
Raf Dickson	27fc25bb82	vsock: fold sk_acceptq_removed() into vsock_remove_pending() Callers of vsock_remove_pending() must also call sk_acceptq_removed() to keep sk_ack_backlog consistent. Move the call into vsock_remove_pending() itself to make it automatic and prevent future callers from forgetting it. Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Raf Dickson <rafdog35@gmail.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Link: https://patch.msgid.link/20260612045216.105796-5-rafdog35@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-06-13 10:39:26 -07:00
Raf Dickson	6f6f9b65a9	vsock: fold sk_acceptq_added() into vsock_enqueue_accept() virtio and hyperv call sk_acceptq_added() immediately before vsock_enqueue_accept(). Move the call into vsock_enqueue_accept() itself so callers cannot forget it and the accounting is consistent. Suggested-by: Paolo Abeni <pabeni@redhat.com> Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Raf Dickson <rafdog35@gmail.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Link: https://patch.msgid.link/20260612045216.105796-4-rafdog35@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-06-13 10:38:28 -07:00
Raf Dickson	a6fd2cfdcd	vsock: fold sk_acceptq_added() into vsock_add_pending() Move sk_acceptq_added() into vsock_add_pending() so callers cannot forget it. vmci is the only transport using the pending list and is updated accordingly. Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Raf Dickson <rafdog35@gmail.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Link: https://patch.msgid.link/20260612045216.105796-3-rafdog35@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-06-13 10:38:28 -07:00
Raf Dickson	77eee18939	vsock: introduce vsock_pending_to_accept() helper Add vsock_pending_to_accept() to move a socket directly from the pending list to the accept queue in a single operation, avoiding the sock_put/sock_hold dance and the sk_acceptq_removed()/ sk_acceptq_added() pair that would otherwise be needed when calling vsock_remove_pending() followed by vsock_enqueue_accept(). Use it in vmci_transport_recv_connecting_server() where a completed handshake transitions the socket from pending to accept queue. Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Raf Dickson <rafdog35@gmail.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Link: https://patch.msgid.link/20260612045216.105796-2-rafdog35@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-06-13 10:38:27 -07:00
Raf Dickson	4ff2e84ff1	vsock: use sk_acceptq_is_full() helper in all transports Replace the open-coded backlog check with sk_acceptq_is_full(). The helper uses > instead of >=, which is the correct comparison per commit `64a146513f` ("[NET]: Revert incorrect accept queue backlog changes."), and adds READ_ONCE() for proper memory ordering. Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Raf Dickson <rafdog35@gmail.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Link: https://patch.msgid.link/20260612045842.122207-1-rafdog35@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-06-13 10:36:37 -07:00
Jakub Kicinski	8d72997dab	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-7.1-rc7). Silent conflicts: net/wireless/nl80211.c `cb9959ab5f` ("wifi: cfg80211: enforce HE/EHT cap/oper consistency") `a384ae9699` ("wifi: cfg80211: move AP HT/VHT/... operation to beacon info") https://lore.kernel.org/aiGJDaHV4UlCexIQ@sirena.org.uk Conflicts: drivers/net/wireless/intel/iwlwifi/mld/ap.c `a342c99cb7` ("wifi: iwlwifi: mld: honor BSS_CHANGED_BEACON_ENABLED") `9bf1b409af` ("wifi: iwlwifi: mld: send tx power constraints before link activation") https://lore.kernel.org/ah2bfedhV45ZxMO8@sirena.org.uk drivers/net/wireless/intel/iwlwifi/pcie/drv.c `093305d801` ("wifi: iwlwifi: pcie: simplify the resume flow if fast resume is not used") `e2323929a6` ("wifi: iwlwifi: pcie: add debug print for resume flow if powered off") https://lore.kernel.org/ah2bfedhV45ZxMO8@sirena.org.uk Adjacent changes: drivers/net/ethernet/airoha/airoha_eth.c `b38cae85d1` ("net: airoha: Fix use-after-free in metadata dst teardown") `ec6c391bcc` ("net: airoha: Introduce airoha_gdm_dev struct") drivers/net/ethernet/microchip/lan743x_main.c `8173d22b21` ("net: lan743x: permit VLAN-tagged packets up to configured MTU") `e3c6508a46` ("net: lan743x: avoid netdev-based logging before netdev registration") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-06-04 15:29:04 -07:00
Raf Dickson	c05fa14db4	vsock/vmci: fix sk_ack_backlog leak on failed handshake When vmci_transport_recv_connecting_server() returns an error, vmci_transport_recv_listen() calls vsock_remove_pending() but never calls sk_acceptq_removed(). This leaves sk_ack_backlog incremented permanently. Repeated handshake failures (malformed packets, queue pair alloc failure, event subscribe failure) cause sk_ack_backlog to climb toward sk_max_ack_backlog. Once it reaches the limit the listener permanently refuses all new connections with -ECONNREFUSED, a silent denial of service requiring a process restart to recover. The two existing sk_acceptq_removed() calls in af_vsock.c do not cover this path: line 764 checks vsock_is_pending() which returns false after vsock_remove_pending(), and line 1889 is only reached on successful accept(). Fix by balancing sk_acceptq_added() with sk_acceptq_removed() on the error path. Fixes: `d021c34405` ("VSOCK: Introduce VM Sockets") Cc: stable@vger.kernel.org Signed-off-by: Raf Dickson <rafdog35@gmail.com> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20260526104356.469928-1-rafdog35@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-06-04 13:08:02 +02:00
Paolo Abeni	c2c0486c56	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Conflicts: drivers/net/ethernet/microsoft/mana/mana_en.c: `17bfe0a8c0` ("net: mana: Add NULL guards in teardown path to prevent panic on attach failure") `d07efe5a6e` ("net: mana: Use per-queue allocation for tx_qp to reduce allocation size") Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-06-01 13:42:12 +02:00
Jingguo Tan	1e584c304c	vsock/virtio: bind uarg before filling zerocopy skb virtio_transport_send_pkt_info() allocates or reuses the zerocopy uarg before entering the send loop, but virtio_transport_alloc_skb() still fills the skb before it inherits that uarg. When fixed-buffer vectored zerocopy hits MAX_SKB_FRAGS, io_sg_from_iter() may partially attach managed frags and return -EMSGSIZE. The rollback path call kfree_skb() to free an skb that carries SKBFL_MANAGED_FRAG_REFS but no uarg, so skb_release_data() falls through to ordinary frag unref. Pass the uarg into virtio_transport_alloc_skb() and bind it immediately before virtio_transport_fill_skb(). This keeps control or no-payload skbs untouched while ensuring success and rollback share one lifetime rule. Fixes: `581512a6dc` ("vsock/virtio: MSG_ZEROCOPY flag support") Signed-off-by: Lin Ma <malin89@huawei.com> Signed-off-by: Rongzhen Cui <cuirongzhen@huawei.com> Signed-off-by: Jingguo Tan <tanjingguo@huawei.com> Acked-by: Arseniy Krasnov <avkrasnov@salutedevices.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20260527023301.1075581-1-malin89@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-29 12:38:00 -07:00
Stefano Garzarella	e03f0b53b4	Revert "vsock/virtio: fix skb overhead overflow on 32-bit builds" This reverts commit `4157501b9a` ("vsock/virtio: fix skb overhead overflow on 32-bit builds"). The fix was semantically correct (although it would have been better to use mul_u32_u32(), as David pointed out), but in practice we are estimating the memory used to allocate the SKBs, and this will never cause a 32-bit variable to overflow on a 32-bit system, since the memory would have run out long before that. On 64-bit, SKB_TRUESIZE() already evaluates to size_t, so the multiplication is already in 64-bit arithmetic without the cast. Let's revert this to avoid unnecessary 64-bit multiplies on the per-packet receive path on 32-bit systems. Reported-by: David Laight <david.laight.linux@gmail.com> Closes: https://lore.kernel.org/netdev/20260523173557.5cc4f4f6@pumpkin Suggested-by: "Michael S. Tsirkin" <mst@redhat.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: David Laight <david.laight.linux@gmail.com> Link: https://patch.msgid.link/20260527171046.130211-1-sgarzare@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-28 14:44:21 -07:00
Jakub Kicinski	d44646fc9e	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-7.1-rc6). Conflicts: drivers/net/phy/air_en8811h.c `d895767c33` ("net: phy: air_en8811h: add AN8811HB MCU assert/deassert support") `dddfadd751` ("net: phy: Add Airoha phy library for shared code") `5226bb6634` ("net: phy: air_phy_lib: Factorize BuckPBus register accessors") `e08f0ea6da` ("net: phy: Rename Airoha common BuckPBus register accessors") net/sched/sch_netem.c `a2f6ed7b48` ("net/sched: netem: add per-impairment extended statistics") `9552b11e3e` ("net/sched: fix packet loop on netem when duplicate is on") Adjacent changes: drivers/dpll/zl3073x/core.c `c1224569ce` ("dpll: zl3073x: make frequency monitor a per-device attribute") `54e65df8cf` ("dpll: zl3073x: report FFO as DPLL vs input reference offset") net/iucv/af_iucv.c `347fdd4df8` ("af_iucv: convert to getsockopt_iter") `3589d20a66` ("net/iucv: fix locking in .getsockopt") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-28 14:02:21 -07:00
Stefano Garzarella	4157501b9a	vsock/virtio: fix skb overhead overflow on 32-bit builds On 32-bit architectures, both skb_queue_len() and SKB_TRUESIZE(0) evaluate to 32-bit values. The multiplication can overflow before being assigned to the u64 skb_overhead variable, making the skb overhead check ineffective. Cast skb_queue_len() to u64 so the multiplication is always performed in 64-bit arithmetic. This issue was reported by Sashiko while reviewing another patch. Fixes: `059b7dbd20` ("vsock/virtio: fix potential unbounded skb queue") Closes: https://sashiko.dev/#/patchset/20260518090656.134588-1-sgarzare%40redhat.com Cc: stable@vger.kernel.org Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://patch.msgid.link/20260521124732.125771-1-sgarzare@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-22 19:05:10 -07:00
Ziyu Zhang	aae9d8a552	vsock: keep poll shutdown state consistent vsock_poll() reads vsk->peer_shutdown before taking the socket lock to set EPOLLHUP and EPOLLRDHUP, then reads it again after taking the lock to report EOF readability. A shutdown packet can update peer_shutdown while poll is waiting for the lock, so one poll invocation can report EOF readability without the corresponding HUP/RDHUP bits. For connectible sockets, take one peer_shutdown snapshot after lock_sock() and use it for all peer-shutdown-derived poll bits. For datagram sockets, which do not take lock_sock() in poll(), take one lockless READ_ONCE() snapshot and pair it with WRITE_ONCE() on the writer side. This keeps the peer-shutdown-derived bits internally consistent for each poll pass. Fixes: `d021c34405` ("VSOCK: Introduce VM Sockets") Signed-off-by: Ziyu Zhang <ziyuzhang201@gmail.com> Link: https://patch.msgid.link/20260519165636.62542-1-ziyuzhang201@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-22 11:27:57 -07:00
Jakub Kicinski	6a20b34fe3	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-7.1-rc5). No conflicts, adjacent changes: drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c `cc199cd1b9` ("net/mlx5e: Reduce branches in napi poll") `c326f9c689` ("net/mlx5e: xsk: Fix unlocked writing to ICOSQ") drivers/net/ethernet/mellanox/mlx5/core/eswitch.c `c6df9a65cb` ("net/mlx5: Skip disabled vports when setting max TX speed") `1fba57c914` ("net/mlx5: Add VHCA_ID page management mode support") net/mac80211/mlme.c `a6e6ccd5bd` ("wifi: mac80211: consume only present negotiated TTLM maps") `49e62ec6eb` ("wifi: mac80211: move frame RX handling to type files") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-21 15:09:02 -07:00
Stefano Garzarella	c6087c5aaa	vsock/virtio: fix skb overhead accounting to preserve full buf_alloc After commit `059b7dbd20` ("vsock/virtio: fix potential unbounded skb queue"), virtio_transport_inc_rx_pkt() subtracts per-skb overhead from buf_alloc when checking whether a new packet fits. This reduces the effective receive buffer below what the user configured via SO_VM_SOCKETS_BUFFER_SIZE, causing legitimate data packets to be silently dropped and applications that rely on the full buffer size to deadlock. Also, the reduced space is not communicated to the remote peer, so its credit calculation accounts more credit than the receiver will actually accept, causing data loss (there is no retransmission). With this approach we currently have failures in tools/testing/vsock/vsock_test.c. Test 18 sometimes fails, while test 22 always fails in this way: 18 - SOCK_STREAM MSG_ZEROCOPY...hash mismatch 22 - SOCK_STREAM virtio credit update + SO_RCVLOWAT...send failed: Resource temporarily unavailable Fix by allowing at most `buf_alloc * 2` as the total budget for payload plus skb overhead in virtio_transport_inc_rx_pkt(), similar to how SO_RCVBUF is doubled to reserve space for sk_buff metadata. This preserves the full buf_alloc for payload under normal operation, while still bounding the skb queue growth. With this patch, all tests in tools/testing/vsock/vsock_test.c are now passing again. Fixes: `059b7dbd20` ("vsock/virtio: fix potential unbounded skb queue") Cc: stable@vger.kernel.org Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20260518090656.134588-3-sgarzare@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-21 13:14:01 +02:00
Stefano Garzarella	a4f0b00178	vsock/virtio: reset connection on receiving queue overflow When there is no more space to queue an incoming packet, the packet is silently dropped. This causes data loss without any notification to either peer, since there is no retransmission. Under normal circumstances, this should never happen. However, it could happen if the other peer doesn't respect the credit, or if the skb overhead, which we recently began to take into account with commit `059b7dbd20` ("vsock/virtio: fix potential unbounded skb queue"), is too high. Fix this by resetting the connection and setting the local socket error to ENOBUFS when virtio_transport_recv_enqueue() can no longer queue a packet, so both peers are explicitly notified of the failure rather than silently losing data. Fixes: `ae6fcfbf5f` ("vsock/virtio: discard packets if credit is not respected") Cc: stable@vger.kernel.org Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20260518090656.134588-2-sgarzare@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-21 13:14:01 +02:00
Minh Nguyen	99e22ddf4e	vsock/vmci: fix UAF when peer resets connection during handshake vmci_transport_recv_connecting_server() returned err = 0 for a peer RST in its default switch arm: err = pkt->type == VMCI_TRANSPORT_PACKET_TYPE_RST ? 0 : -EINVAL; That made vmci_transport_recv_listen() skip vsock_remove_pending(), leaving the pending socket on the listener's pending_links with sk_state = TCP_CLOSE while destroy: still dropped the explicit reference taken before schedule_delayed_work(). One second later vsock_pending_work() observed is_pending=true and performed full cleanup: vsock_remove_pending() then the two trailing sock_put(sk) calls -- the first reached refcount 0 and __sk_freed the socket, and the second wrote into the freed object: BUG: KASAN: slab-use-after-free in refcount_warn_saturate Write of size 4 at addr ffff88800b1cac80 by task kworker Workqueue: events vsock_pending_work Treat peer RST like any other unexpected packet type (err = -EINVAL). All destroy: arms now return err < 0, so vmci_transport_recv_listen() removes pending from pending_links synchronously and vsock_pending_work() takes the is_pending=false / !rejected branch, dropping only its own work reference. This also closes the multi-packet race Sashiko reported on v2: pending is removed from the list before any subsequent packet can find it. The pre-existing sk_acceptq_removed() gap on the err < 0 path of vmci_transport_recv_listen() that Sashiko also noted is not introduced or changed by this patch. Tested on lts-6.12.79 with KASAN: 52/100 unpatched -> 0/100 patched. Fixes: `d021c34405` ("VSOCK: Introduce VM Sockets") Cc: stable@vger.kernel.org Signed-off-by: Minh Nguyen <minhnguyen.080505@gmail.com> Acked-by: Bryan Tan <bryan-bt.tan@broadcom.com> Link: https://patch.msgid.link/20260519102310.237181-1-minhnguyen.080505@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-20 19:11:18 -07:00
Stefano Garzarella	ae38d91791	vsock/virtio: fix zerocopy completion for multi-skb sends When a large message is fragmented into multiple skbs, the zerocopy uarg is only allocated and attached to the last skb in the loop. Non-final skbs carry pinned user pages with no completion tracking, so the kernel has no way to notify userspace when those pages are safe to reuse. If the loop breaks early the uarg is never allocated at all, leaking pinned pages with no completion notification. Fix this by following the approach used by TCP: allocate the zerocopy uarg (if not provided by the caller) before the send loop and attach it to every skb via skb_zcopy_set(), which takes a reference per skb. Each skb's completion properly decrements the refcount, and the notification only fires after the last skb is freed. On failure, if no data was sent, the uarg is cleanly aborted via net_zcopy_put_abort(). This issue was initially discovered by sashiko while reviewing commit `1cb36e2522` ("vsock/virtio: fix MSG_ZEROCOPY pinned-pages accounting") but was pre-existing. Fixes: `581512a6dc` ("vsock/virtio: MSG_ZEROCOPY flag support") Closes: https://sashiko.dev/#/patchset/20260420132051.217589-1-sgarzare%40redhat.com Reported-by: Maher Azzouzi <maherazz04@gmail.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Arseniy Krasnov <avkrasnov@salutedevices.com> Link: https://patch.msgid.link/20260514092948.268720-1-sgarzare@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-15 17:38:15 -07:00
Jakub Kicinski	878492af7d	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-7.1-rc4). No conflicts, or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-14 10:08:06 -07:00
Stefano Garzarella	3a3e3d90cb	vsock/virtio: fix empty payload in tap skb for non-linear buffers For non-linear skbs, virtio_transport_build_skb() goes through virtio_transport_copy_nonlinear_skb() to copy the original payload in the new skb to be delivered to the vsockmon tap device. This manually initializes an iov_iter but does not set iov_iter.count. Since the iov_iter is zero-initialized, the copy length is zero and no payload is actually copied to the monitor interface, leaving data un-initialized. Fix this by removing the linear vs non-linear split and using skb_copy_datagram_iter() with iov_iter_kvec() for all cases, as vhost-vsock already does. This handles both linear and non-linear skbs, properly initializes the iov_iter, and removes the now unused virtio_transport_copy_nonlinear_skb(). While touching this code, let's also check the return value of skb_copy_datagram_iter(), even though it's unlikely to fail. Fixes: `4b0bf10eb0` ("vsock/virtio: non-linear skb handling for tap") Reported-by: Yiqi Sun <sunyiqixm@gmail.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Reviewed-by: Arseniy Krasnov <avkrasnov@rulkc.org> Link: https://patch.msgid.link/20260508164411.261440-3-sgarzare@redhat.com Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-12 12:52:15 +02:00
Stefano Garzarella	5f344d809e	vsock/virtio: fix length and offset in tap skb for split packets virtio_transport_build_skb() builds a new skb to be delivered to the vsockmon tap device. To build the new skb, it uses the original skb data length as payload length, but as the comment notes, the original packet stored in the skb may have been split in multiple packets, so we need to use the length in the header, which is correctly updated before the packet is delivered to the tap, and the offset for the data. This was also similar to what we did before commit `71dc9ec9ac` ("virtio/vsock: replace virtio_vsock_pkt with sk_buff") where we probably missed something during the skb conversion. Also update the comment above, which was left stale by the skb conversion and still mentioned a buffer pointer that no longer exists. Fixes: `71dc9ec9ac` ("virtio/vsock: replace virtio_vsock_pkt with sk_buff") Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Reviewed-by: Arseniy Krasnov <avkrasnov@rulkc.org> Link: https://patch.msgid.link/20260508164411.261440-2-sgarzare@redhat.com Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-05-12 12:52:15 +02:00
Jakub Kicinski	6a4c4656b0	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-7.1-rc3). Conflicts: net/ipv4/igmp.c `726fa7da2d` ("ipv4: igmp: get rid of IGMPV3_{QQIC,MRC} and simplify calculation") `c6bebaa744` ("ipv4: igmp: annotate data-races in igmp_heard_query()") https://lore.kernel.org/a7365e4873340f7a5e30411207de3bf9@kernel.org Adjacent changes: net/psp/psp_main.c `30cb24f97d` ("psp: strip variable-length PSP header in psp_dev_rcv()") `c2b22277ad` ("psp: validate IPv4 header fields in psp_dev_rcv()") net/sched/sch_fq_codel.c `f83e07b292` ("net/sched: sch_fq_codel: annotate data-races from fq_codel_dump_class_stats()") `3f3aa77ff1` ("net/sched: add qstats_cpu_drop_inc() helper") net/wireless/pmsr.c `0f3c0a1973` ("wifi: nl80211: fix NL80211_PMSR_FTM_REQ_ATTR_FTMS_PER_BURST usage") `410aa47fd9` ("wifi: cfg80211: allow suppressing FTM result reporting for PD requests") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-07 11:19:07 -07:00
Eric Dumazet	059b7dbd20	vsock/virtio: fix potential unbounded skb queue virtio_transport_inc_rx_pkt() checks vvs->rx_bytes + len > vvs->buf_alloc. virtio_transport_recv_enqueue() skips coalescing for packets with VIRTIO_VSOCK_SEQ_EOM. If fed with packets with len == 0 and VIRTIO_VSOCK_SEQ_EOM, a very large number of packets can be queued because vvs->rx_bytes stays at 0. Fix this by estimating the skb metadata size: (Number of skbs in the queue) * SKB_TRUESIZE(0) Fixes: `0777061657` ("virtio/vsock: don't use skbuff state to account credit") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Arseniy Krasnov <AVKrasnov@sberdevices.ru> Cc: Stefan Hajnoczi <stefanha@redhat.com> Cc: Stefano Garzarella <sgarzare@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: "Eugenio Pérez" <eperezma@redhat.com> Cc: virtualization@lists.linux.dev Link: https://patch.msgid.link/20260430122653.554058-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-04 19:12:37 -07:00
Breno Leitao	e21bf72954	vsock: convert to getsockopt_iter Convert AF_VSOCK's getsockopt implementation to use the new getsockopt_iter callback with sockopt_t. The single vsock_connectible_getsockopt() callback is shared by both vsock_stream_ops and vsock_seqpacket_ops, so both proto_ops are updated to use .getsockopt_iter. Key changes: - Replace (char __user optval, int __user optlen) with sockopt_t *opt - Use opt->optlen for buffer length (input) and returned size (output) - Use copy_to_iter() instead of put_user()/copy_to_user() Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260501-getsock_one-v1-2-810ce23ea70e@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-05-04 19:02:30 -07:00
Hamza Mahfooz	b31681206e	hv_sock: fix ARM64 support VMBUS ring buffers must be page aligned. Therefore, the current value of 24K presents a challenge on ARM64 kernels (with 64K pages). So, use VMBUS_RING_SIZE() to ensure they are always aligned and large enough to hold all of the relevant data. Cc: stable@vger.kernel.org Fixes: `77ffe33363` ("hv_sock: use HV_HYP_PAGE_SIZE for Hyper-V communication") Tested-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20260428125339.13963-1-hamzamahfooz@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-29 17:30:45 -07:00
Dexuan Cui	3d1f20727a	hv_sock: Return -EIO for malformed/short packets Commit `f631529589` fixes a regression, however it fails to report an error for malformed/short packets -- normally we should never see such packets, but let's report an error for them just in case. Fixes: `f631529589` ("hv_sock: Report EOF instead of -EIO for FIN") Cc: stable@vger.kernel.org Signed-off-by: Dexuan Cui <decui@microsoft.com> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20260423064811.1371749-1-decui@microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-23 10:53:16 -07:00
Stefano Garzarella	1cb36e2522	vsock/virtio: fix MSG_ZEROCOPY pinned-pages accounting virtio_transport_init_zcopy_skb() uses iter->count as the size argument for msg_zerocopy_realloc(), which in turn passes it to mm_account_pinned_pages() for RLIMIT_MEMLOCK accounting. However, this function is called after virtio_transport_fill_skb() has already consumed the iterator via __zerocopy_sg_from_iter(), so on the last skb, iter->count will be 0, skipping the RLIMIT_MEMLOCK enforcement. Pass pkt_len (the total bytes being sent) as an explicit parameter to virtio_transport_init_zcopy_skb() instead of reading the already-consumed iter->count. This matches TCP and UDP, which both call msg_zerocopy_realloc() with the original message size. Fixes: `581512a6dc` ("vsock/virtio: MSG_ZEROCOPY flag support") Reported-by: Yiming Qian <yimingqian591@gmail.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Link: https://patch.msgid.link/20260420132051.217589-1-sgarzare@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-23 13:03:21 +02:00
Dexuan Cui	f631529589	hv_sock: Report EOF instead of -EIO for FIN Commit `f0c5827d07` unluckily causes a regression for the FIN packet, and the final read syscall gets an error rather than 0. Ideally, we would want to fix hvs_channel_readable_payload() so that it could return 0 in the FIN scenario, but it's not good for the hv_sock driver to use the VMBus ringbuffer's cached priv_read_index, which is internal data in the VMBus driver. Fix the regression in hv_sock by returning 0 rather than -EIO. Fixes: `f0c5827d07` ("hv_sock: Return the readable bytes in hvs_stream_has_data()") Cc: stable@vger.kernel.org Reported-by: Ben Hillis <Ben.Hillis@microsoft.com> Reported-by: Mitchell Levy <levymitchell0@gmail.com> Signed-off-by: Dexuan Cui <decui@microsoft.com> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20260416191433.840637-1-decui@microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-20 11:44:43 -07:00
Luigi Leonardi	080f22f5d3	vsock/virtio: fix MSG_PEEK ignoring skb offset when calculating bytes to copy `virtio_transport_stream_do_peek()` does not account for the skb offset when computing the number of bytes to copy. This means that, after a partial recv() that advances the offset, a peek requesting more bytes than are available in the sk_buff causes `skb_copy_datagram_iter()` to go past the valid payload, resulting in a -EFAULT. The dequeue path already handles this correctly. Apply the same logic to the peek path. Fixes: `0df7cd3c13` ("vsock/virtio/vhost: read data from non-linear skb") Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Acked-by: Arseniy Krasnov <avkrasnov@salutedevices.com> Signed-off-by: Luigi Leonardi <leonardi@redhat.com> Link: https://patch.msgid.link/20260415-fix_peek-v4-1-8207e872759e@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-16 19:34:22 -07:00
Dudu Lu	52bcb57a4e	vsock/virtio: fix accept queue count leak on transport mismatch virtio_transport_recv_listen() calls sk_acceptq_added() before vsock_assign_transport(). If vsock_assign_transport() fails or selects a different transport, the error path returns without calling sk_acceptq_removed(), permanently incrementing sk_ack_backlog. After approximately backlog+1 such failures, sk_acceptq_is_full() returns true, causing the listener to reject all new connections. Fix by moving sk_acceptq_added() to after the transport validation, matching the pattern used by vmci_transport and hyperv_transport. Fixes: `c0cfa2d8a7` ("vsock: add multi-transports support") Signed-off-by: Dudu Lu <phx0fer@gmail.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://patch.msgid.link/20260413131409.19022-1-phx0fer@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-16 15:13:25 +02:00
Jakub Kicinski	35c2c39832	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Merge in late fixes in preparation for the net-next PR. Conflicts: include/net/sch_generic.h `a6bd339dbb` ("net_sched: fix skb memory leak in deferred qdisc drops") `ff2998f29f` ("net: sched: introduce qdisc-specific drop reason tracing") https://lore.kernel.org/adz0iX85FHMz0HdO@sirena.org.uk drivers/net/ethernet/airoha/airoha_eth.c `1acdfbdb51` ("net: airoha: Fix VIP configuration for AN7583 SoC") `bf3471e6e6` ("net: airoha: Make flow control source port mapping dependent on nbq parameter") Adjacent changes: drivers/net/ethernet/airoha/airoha_ppe.c `f44218cd5e` ("net: airoha: Reset PPE cpu port configuration in airoha_ppe_hw_init()") `7da62262ec` ("inet: add ip_local_port_step_width sysctl to improve port usage distribution") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-14 12:04:00 -07:00
Norbert Szetei	d114bfdc9b	vsock: fix buffer size clamping order In vsock_update_buffer_size(), the buffer size was being clamped to the maximum first, and then to the minimum. If a user sets a minimum buffer size larger than the maximum, the minimum check overrides the maximum check, inverting the constraint. This breaks the intended socket memory boundaries by allowing the vsk->buffer_size to grow beyond the configured vsk->buffer_max_size. Fix this by checking the minimum first, and then the maximum. This ensures the buffer size never exceeds the buffer_max_size. Fixes: `b9f2b0ffde` ("vsock: handle buffer_size sockopts in the core") Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Norbert Szetei <norbert@doyensec.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/180118C5-8BCF-4A63-A305-4EE53A34AB9C@doyensec.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 14:31:50 -07:00
Luigi Leonardi	006679268a	vsock/virtio: remove unnecessary call to `virtio_transport_get_ops` `virtio_transport_send_pkt_info` gets all the transport information from the parameter `t_ops`. There is no need to call `virtio_transport_get_ops()`. Remove it. Acked-by: Arseniy Krasnov <avkrasnov@salutedevices.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Luigi Leonardi <leonardi@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20260408-remove_parameter-v2-1-e00f31cf7a17@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-12 09:57:01 -07:00
Laurence Rowe	98f28d8d6e	vsock: avoid timeout for non-blocking accept() with empty backlog A common pattern in epoll network servers is to eagerly accept all pending connections from the non-blocking listening socket after epoll_wait indicates the socket is ready by calling accept in a loop until EAGAIN is returned indicating that the backlog is empty. Scheduling a timeout for a non-blocking accept with an empty backlog meant AF_VSOCK sockets used by epoll network servers incurred hundreds of microseconds of additional latency per accept loop compared to AF_INET or AF_UNIX sockets. Signed-off-by: Laurence Rowe <laurencerowe@gmail.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20260402204918.130395-1-laurencerowe@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-06 18:29:01 -07:00
Jakub Kicinski	8ffb33d770	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-7.0-rc7). Conflicts: net/vmw_vsock/af_vsock.c `b18c833888` ("vsock: initialize child_ns_mode_locked in vsock_net_init()") `0de607dc4f` ("vsock: add G2H fallback for CIDs not owned by H2G transport") Adjacent changes: drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c `ceee35e567` ("bnxt_en: Refactor some basic ring setup and adjustment logic") `57cdfe0dc7` ("bnxt_en: Resize RSS contexts on channel count change") drivers/net/wireless/intel/iwlwifi/mld/mac80211.c `4d56037a02` ("wifi: iwlwifi: mld: block EMLSR during TDLS connections") `687a95d204` ("wifi: iwlwifi: mld: correctly set wifi generation data") drivers/net/wireless/intel/iwlwifi/mld/scan.h `b6045c899e` ("wifi: iwlwifi: mld: Refactor scan command handling") `ec66ec6a5a` ("wifi: iwlwifi: mld: Fix MLO scan timing") drivers/net/wireless/intel/iwlwifi/mvm/fw.c `078df640ef` ("wifi: iwlwifi: mld: add support for iwl_mcc_allowed_ap_type_cmd v 2") `323156c354` ("wifi: iwlwifi: mvm: don't send a 6E related command when not supported") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-02 11:03:13 -07:00
Stefano Garzarella	b18c833888	vsock: initialize child_ns_mode_locked in vsock_net_init() The `child_ns_mode_locked` field lives in `struct net`, which persists across vsock module reloads. When the module is unloaded and reloaded, `vsock_net_init()` resets `mode` and `child_ns_mode` back to their default values, but does not reset `child_ns_mode_locked`. The stale lock from the previous module load causes subsequent writes to `child_ns_mode` to silently fail: `vsock_net_set_child_mode()` sees the old lock, skips updating the actual value, and returns success when the requested mode matches the stale lock. The sysctl handler reports no error, but `child_ns_mode` remains unchanged. Steps to reproduce: $ modprobe vsock $ echo local > /proc/sys/net/vsock/child_ns_mode $ cat /proc/sys/net/vsock/child_ns_mode local $ modprobe -r vsock $ modprobe vsock $ echo local > /proc/sys/net/vsock/child_ns_mode $ cat /proc/sys/net/vsock/child_ns_mode global <--- expected "local" Fix this by initializing `child_ns_mode_locked` to 0 (unlocked) in `vsock_net_init()`, so the write-once mechanism works correctly after module reload. Fixes: `102eab95f0` ("vsock: lock down child_ns_mode as write-once") Reported-by: Jin Liu <jinl@redhat.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Link: https://patch.msgid.link/20260401092153.28462-1-sgarzare@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-02 08:18:56 -07:00
Kexin Sun	88c07dff9f	hv_sock: update outdated comment for renamed vsock_stream_recvmsg() The function vsock_stream_recvmsg() was renamed to vsock_connectible_recvmsg() by commit `a9e29e5511` ("af_vsock: update functions for connectible socket"). Update the comment accordingly. Assisted-by: unnamed:deepseek-v3.2 coccinelle Signed-off-by: Kexin Sun <kexinsun@smail.nju.edu.cn> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260321105753.6751-1-kexinsun@smail.nju.edu.cn Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-03-24 13:16:58 +01:00
Alexander Graf	0de607dc4f	vsock: add G2H fallback for CIDs not owned by H2G transport When no H2G transport is loaded, vsock currently routes all CIDs to the G2H transport (commit `65b422d9b6` ("vsock: forward all packets to the host when no H2G is registered"). Extend that existing behavior: when an H2G transport is loaded but does not claim a given CID, the connection falls back to G2H in the same way. This matters in environments like Nitro Enclaves, where an instance may run nested VMs via vhost-vsock (H2G) while also needing to reach sibling enclaves at higher CIDs through virtio-vsock-pci (G2H). With the old code, any CID > 2 was unconditionally routed to H2G when vhost was loaded, making those enclaves unreachable without setting VMADDR_FLAG_TO_HOST explicitly on every connect. Requiring every application to set VMADDR_FLAG_TO_HOST creates friction: tools like socat, iperf, and others would all need to learn about it. The flag was introduced 6 years ago and I am still not aware of any tool that supports it. Even if there was support, it would be cumbersome to use. The most natural experience is a single CID address space where H2G only wins for CIDs it actually owns, and everything else falls through to G2H, extending the behavior that already exists when H2G is absent. To give user space at least a hint that the kernel applied this logic, automatically set the VMADDR_FLAG_TO_HOST on the remote address so it can determine the path taken via getpeername(). Add a per-network namespace sysctl net.vsock.g2h_fallback (default 1). At 0 it forces strict routing: H2G always wins for CID > VMADDR_CID_HOST, or ENODEV if H2G is not loaded. Signed-off-by: Alexander Graf <graf@amazon.com> Tested-by: syzbot@syzkaller.appspotmail.com Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20260304230027.59857-1-graf@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-03-12 10:59:36 +01:00
Eric Dumazet	8341c989ac	net: remove addr_len argument of recvmsg() handlers Use msg->msg_namelen as a place holder instead of a temporary variable, notably in inet[6]_recvmsg(). This removes stack canaries and allows tail-calls. $ scripts/bloat-o-meter -t vmlinux.old vmlinux add/remove: 0/0 grow/shrink: 2/19 up/down: 26/-532 (-506) Function old new delta rawv6_recvmsg 744 767 +23 vsock_dgram_recvmsg 55 58 +3 vsock_connectible_recvmsg 50 47 -3 unix_stream_recvmsg 161 158 -3 unix_seqpacket_recvmsg 62 59 -3 unix_dgram_recvmsg 42 39 -3 tcp_recvmsg 546 543 -3 mptcp_recvmsg 1568 1565 -3 ping_recvmsg 806 800 -6 tcp_bpf_recvmsg_parser 983 974 -9 ip_recv_error 588 576 -12 ipv6_recv_rxpmtu 442 428 -14 udp_recvmsg 1243 1224 -19 ipv6_recv_error 1046 1024 -22 udpv6_recvmsg 1487 1461 -26 raw_recvmsg 465 437 -28 udp_bpf_recvmsg 1027 984 -43 sock_common_recvmsg 103 27 -76 inet_recvmsg 257 175 -82 inet6_recvmsg 257 175 -82 tcp_bpf_recvmsg 663 568 -95 Total: Before=25143834, After=25143328, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260227151120.1346573-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-02 18:17:17 -08:00
Linus Torvalds	b9c8fc2cae	Including fixes from IPsec, Bluetooth and netfilter Current release - regressions: - wifi: fix dev_alloc_name() return value check - rds: fix recursive lock in rds_tcp_conn_slots_available Current release - new code bugs: - vsock: lock down child_ns_mode as write-once Previous releases - regressions: - core: - do not pass flow_id to set_rps_cpu() - consume xmit errors of GSO frames - netconsole: avoid OOB reads, msg is not nul-terminated - netfilter: h323: fix OOB read in decode_choice() - tcp: re-enable acceptance of FIN packets when RWIN is 0 - udplite: fix null-ptr-deref in __udp_enqueue_schedule_skb(). - wifi: brcmfmac: fix potential kernel oops when probe fails - phy: register phy led_triggers during probe to avoid AB-BA deadlock - eth: bnxt_en: fix deleting of Ntuple filters - eth: wan: farsync: fix use-after-free bugs caused by unfinished tasklets - eth: xscale: check for PTP support properly Previous releases - always broken: - tcp: fix potential race in tcp_v6_syn_recv_sock() - kcm: fix zero-frag skb in frag_list on partial sendmsg error - xfrm: - fix race condition in espintcp_close() - always flush state and policy upon NETDEV_UNREGISTER event - bluetooth: - purge error queues in socket destructors - fix response to L2CAP_ECRED_CONN_REQ - eth: mlx5: - fix circular locking dependency in dump - fix "scheduling while atomic" in IPsec MAC address query - eth: gve: fix incorrect buffer cleanup for QPL - eth: team: avoid NETDEV_CHANGEMTU event when unregistering slave - eth: usb: validate USB endpoints Signed-off-by: Paolo Abeni <pabeni@redhat.com> -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmmgYU4SHHBhYmVuaUBy ZWRoYXQuY29tAAoJECkkeY3MjxOkLBgQAINazHstJ0DoDkvmwXapRSN0Ffauyd46 oX6nfeWOT3BzZbAhZHtGgCSs4aULifJWMevtT7pq7a7PgZwMwfa47BugR1G/u5UE hCqalNjRTB/U2KmFk6eViKSacD4FvUIAyAMOotn1aEdRRAkBIJnIW/o/ZR9ZUkm0 5+UigO64aq57+FOc5EQdGjYDcTVdzW12iOZ8ZqwtSATdNd9aC+gn3voRomTEo+Fm kQinkFEPAy/YyHGmfpC/z87/RTgkYLpagmsT4ZvBJeNPrIRvFEibSpPNhuzTzg81 /BW5M8sJmm3XFiTiRp6Blv+0n6HIpKjAZMHn5c9hzX9cxPZQ24EjkXEex9ClaxLd OMef79rr1HBwqBTpIlK7xfLKCdT5Iex88s8HxXRB/Psqk9pVP469cSoK6cpyiGiP I+4WT0wn9ukTiu/yV2L2byVr1sanlu54P+UBYJpDwqq3lZ1ngWtkJ+SY369jhwAS FYIBmUSKhmWz3FEULaGpgPy4m9Fl/fzN8IFh2Buoc/Puq61HH7MAMjRty2ZSFTqj gbHrRhlkCRqubytgjsnCDPLoJF4ZYcXtpo/8ogG3641H1I+dN+DyGGVZ/ioswkks My1ds0rKqA3BHCmn+pN/qqkuopDCOB95dqOpgDqHG7GePrpa/FJ1guhxexsCd+nL Run2RcgDmd+d =HBOu -----END PGP SIGNATURE----- Merge tag 'net-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from IPsec, Bluetooth and netfilter Current release - regressions: - wifi: fix dev_alloc_name() return value check - rds: fix recursive lock in rds_tcp_conn_slots_available Current release - new code bugs: - vsock: lock down child_ns_mode as write-once Previous releases - regressions: - core: - do not pass flow_id to set_rps_cpu() - consume xmit errors of GSO frames - netconsole: avoid OOB reads, msg is not nul-terminated - netfilter: h323: fix OOB read in decode_choice() - tcp: re-enable acceptance of FIN packets when RWIN is 0 - udplite: fix null-ptr-deref in __udp_enqueue_schedule_skb(). - wifi: brcmfmac: fix potential kernel oops when probe fails - phy: register phy led_triggers during probe to avoid AB-BA deadlock - eth: - bnxt_en: fix deleting of Ntuple filters - wan: farsync: fix use-after-free bugs caused by unfinished tasklets - xscale: check for PTP support properly Previous releases - always broken: - tcp: fix potential race in tcp_v6_syn_recv_sock() - kcm: fix zero-frag skb in frag_list on partial sendmsg error - xfrm: - fix race condition in espintcp_close() - always flush state and policy upon NETDEV_UNREGISTER event - bluetooth: - purge error queues in socket destructors - fix response to L2CAP_ECRED_CONN_REQ - eth: - mlx5: - fix circular locking dependency in dump - fix "scheduling while atomic" in IPsec MAC address query - gve: fix incorrect buffer cleanup for QPL - team: avoid NETDEV_CHANGEMTU event when unregistering slave - usb: validate USB endpoints" * tag 'net-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (72 commits) netfilter: nf_conntrack_h323: fix OOB read in decode_choice() dpaa2-switch: validate num_ifs to prevent out-of-bounds write net: consume xmit errors of GSO frames vsock: document write-once behavior of the child_ns_mode sysctl vsock: lock down child_ns_mode as write-once selftests/vsock: change tests to respect write-once child ns mode net/mlx5e: Fix "scheduling while atomic" in IPsec MAC address query net/mlx5: Fix missing devlink lock in SRIOV enable error path net/mlx5: E-switch, Clear legacy flag when moving to switchdev net/mlx5: LAG, disable MPESW in lag_disable_change() net/mlx5: DR, Fix circular locking dependency in dump selftests: team: Add a reference count leak test team: avoid NETDEV_CHANGEMTU event when unregistering slave net: mana: Fix double destroy_workqueue on service rescan PCI path MAINTAINERS: Update maintainer entry for QUALCOMM ETHQOS ETHERNET DRIVER dpll: zl3073x: Remove redundant cleanup in devm_dpll_init() selftests/net: packetdrill: Verify acceptance of FIN packets when RWIN is 0 tcp: re-enable acceptance of FIN packets when RWIN is 0 vsock: Use container_of() to get net namespace in sysctl handlers net: usb: kaweth: validate USB endpoints ...	2026-02-26 08:00:13 -08:00
Bobby Eshleman	102eab95f0	vsock: lock down child_ns_mode as write-once Two administrator processes may race when setting child_ns_mode as one process sets child_ns_mode to "local" and then creates a namespace, but another process changes child_ns_mode to "global" between the write and the namespace creation. The first process ends up with a namespace in "global" mode instead of "local". While this can be detected after the fact by reading ns_mode and retrying, it is fragile and error-prone. Make child_ns_mode write-once so that a namespace manager can set it once and be sure it won't change. Writing a different value after the first write returns -EBUSY. This applies to all namespaces, including init_net, where an init process can write "local" to lock all future namespaces into local mode. Fixes: `eafb64f40c` ("vsock: add netns to vsock core") Suggested-by: Daan De Meyer <daan.j.demeyer@gmail.com> Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Co-developed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20260223-vsock-ns-write-once-v3-2-c0cde6959923@meta.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-26 11:10:03 +01:00
Greg Kroah-Hartman	5cc619583c	vsock: Use container_of() to get net namespace in sysctl handlers current->nsproxy is should not be accessed directly as syzbot has found that it could be NULL at times, causing crashes. Fix up the af_vsock sysctl handlers to use container_of() to deal with the current net namespace instead of attempting to rely on current. This is the same type of change done in commit `7f5611cbc4` ("rds: sysctl: rds_tcp_{rcv,snd}buf: avoid using current->nsproxy") Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Fixes: `eafb64f40c` ("vsock: add netns to vsock core") Link: https://patch.msgid.link/2026022318-rearview-gallery-ae13@gregkh Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-25 18:59:18 -08:00
Linus Torvalds	bf4afc53b7	Convert 'alloc_obj' family to use the new default GFP_KERNEL argument This was done entirely with mindless brute force, using git grep -l '\<k[vmz]alloc_objs(., GFP_KERNEL)' \| xargs sed -i 's/$alloc_objs(.*$, GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-21 17:09:51 -08:00
Kees Cook	69050f8d6d	treewide: Replace kmalloc with kmalloc_obj for non-scalar types This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(PTR, FAM, COUNT, ...) (where TYPE may also be VAR) The resulting allocations no longer return "void ", instead returning "TYPE ". Signed-off-by: Kees Cook <kees@kernel.org>	2026-02-21 01:02:28 -08:00
Linus Torvalds	8bf22c33e7	Including fixes from Netfilter. Current release - new code bugs: - net: fix backlog_unlock_irq_restore() vs CONFIG_PREEMPT_RT - eth: mlx5e: XSK, Fix unintended ICOSQ change - phy_port: correctly recompute the port's linkmodes - vsock: prevent child netns mode switch from local to global - couple of kconfig fixes for new symbols Previous releases - regressions: - nfc: nci: fix false-positive parameter validation for packet data - net: do not delay zero-copy skbs in skb_attempt_defer_free() Previous releases - always broken: - mctp: ensure our nlmsg responses to user space are zero-initialised - ipv6: ioam: fix heap buffer overflow in __ioam6_fill_trace_data() - fixes for ICMP rate limiting Misc: - intel: fix PCI device ID conflict between i40e and ipw2200 Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmmXUh8ACgkQMUZtbf5S IrufYA//ZVj+4gvegqKwKZYXNBndVW00GGTYqaILbaenK1olUVUelVB91eV2Klc/ dXCeKG/MgEPuT89IjkPzVr2Yg4x6uhjcQL1rsahORn+GuQfSI/P8y7ysDOPnHVeM Rtsg1m8z3EizJcHPeAJe7nEqFzfvZ2m+FCEGe++z8BYaUZUVApytgpIWOHO/aB+p t13bCNzd05XxPphMl610T00Fncj2jCVDHILMgTB5rmFmkeJuQwNrRGXQSoQame46 +g+yCZjT0eVTrBaH1EUssWfrOT3VJj3BEee6gSp7k9mxMkbW18i8shBgmxS+EHjk u19wwBzSrHK+JY1UExim+1E/rZisQVmEE1Gs0ALedxAu9zC/Julzfa2/+BFsc0j7 QTXd4jukG3aTPIX8v3TV2Igu0j+bAT4WdpzvnsXXBMVKy7wFYMd1+aSOLyFH2W9L qRbg50oUATcsz77bZt6YUTJEgua4HXNYGtn15FMZOR7HJVR2L44Q5TK5mQxGp5iM GabeKMzg6bsjE98STM3nbWks3pIb9ptIk++i0913eSqKgn84bDPtp3Gabfgle2SJ 8gjKS61K8rDt5x8StXVod7oGQ4asL8RJyOtE/avgbWUu9BNH8/oKqsE6TQrpXauv 1ndiyim/mPe4fBCxkVAi2+uq5/ph9z8XyleESz9VYwyL3Rl4nsg= =qSCj -----END PGP SIGNATURE----- Merge tag 'net-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from Netfilter. Current release - new code bugs: - net: fix backlog_unlock_irq_restore() vs CONFIG_PREEMPT_RT - eth: mlx5e: XSK, Fix unintended ICOSQ change - phy_port: correctly recompute the port's linkmodes - vsock: prevent child netns mode switch from local to global - couple of kconfig fixes for new symbols Previous releases - regressions: - nfc: nci: fix false-positive parameter validation for packet data - net: do not delay zero-copy skbs in skb_attempt_defer_free() Previous releases - always broken: - mctp: ensure our nlmsg responses to user space are zero-initialised - ipv6: ioam: fix heap buffer overflow in __ioam6_fill_trace_data() - fixes for ICMP rate limiting Misc: - intel: fix PCI device ID conflict between i40e and ipw2200" * tag 'net-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (85 commits) net: nfc: nci: Fix parameter validation for packet data net/mlx5e: Use unsigned for mlx5e_get_max_num_channels net/mlx5e: Fix deadlocks between devlink and netdev instance locks net/mlx5e: MACsec, add ASO poll loop in macsec_aso_set_arm_event net/mlx5: Fix misidentification of write combining CQE during poll loop net/mlx5e: Fix misidentification of ASO CQE during poll loop net/mlx5: Fix multiport device check over light SFs bonding: alb: fix UAF in rlb_arp_recv during bond up/down bnge: fix reserving resources from FW eth: fbnic: Advertise supported XDP features. rds: tcp: fix uninit-value in __inet_bind net/rds: Fix NULL pointer dereference in rds_tcp_accept_one octeontx2-af: Fix default entries mcam entry action net/mlx5e: XSK, Fix unintended ICOSQ change ipv6: icmp: icmpv6_xrlim_allow() optimization if net.ipv6.icmp.ratelimit is zero ipv4: icmp: icmpv4_xrlim_allow() optimization if net.ipv4.icmp_ratelimit is zero ipv6: icmp: remove obsolete code in icmpv6_xrlim_allow() inet: move icmp_global_{credit,stamp} to a separate cache line icmp: prevent possible overflow in icmp_global_allow() selftests/net: packetdrill: add ipv4-mapped-ipv6 tests ...	2026-02-19 10:39:08 -08:00
Stefano Garzarella	6a997f38bd	vsock: prevent child netns mode switch from local to global A "local" namespace can change its `child_ns_mode` sysctl to "global", allowing nested namespaces to access global CIDs. This can be exploited by an unprivileged user who gained CAP_NET_ADMIN through a user namespace. Prevent this by rejecting writes that attempt to set `child_ns_mode` to "global" when the current namespace's mode is "local". Fixes: `eafb64f40c` ("vsock: add netns to vsock core") Cc: bobbyeshleman@meta.com Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Link: https://patch.msgid.link/20260212205916.97533-3-sgarzare@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-13 12:28:38 -08:00
Stefano Garzarella	9dd391493a	vsock: fix child netns mode initialization When a new network namespace is created, vsock_net_init() correctly initializes the namespace's mode by reading the parent's `child_ns_mode` via vsock_net_child_mode(). However, the `child_ns_mode` of the new namespace was always hardcoded to VSOCK_NET_MODE_GLOBAL, regardless of its own mode. This means that if a parent namespace has `child_ns_mode` set to "local", the child namespace correctly gets mode "local", but its `child_ns_mode` is reset to "global". As a result, further nested namespaces will incorrectly get mode "global" instead of inheriting "local", breaking the expected propagation of the mode through nested namespaces. Fix this by initializing `child_ns_mode` to the namespace's own mode, so the setting propagates correctly through all levels of nesting. Fixes: `eafb64f40c` ("vsock: add netns to vsock core") Cc: bobbyeshleman@meta.com Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Link: https://patch.msgid.link/20260212205916.97533-2-sgarzare@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-13 12:28:38 -08:00
Linus Torvalds	a353e7260b	virtio,vhost,vdpa: features, fixes - in order support in virtio core - multiple address space support in vduse - fixes, cleanups all over the place, notably - dma alignment fixes for non cache coherent systems Signed-off-by: Michael S. Tsirkin <mst@redhat.com> -----BEGIN PGP SIGNATURE----- iQFDBAABCgAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmmO9rYPHG1zdEByZWRo YXQuY29tAAoJECgfDbjSjVRpBzYH/2wUPo3T8/CKGFjF7QSPzgL/UI2NhnP8iSm4 btg1zVnrWmJK6vVIwnf5UsG8dFKsMcp/BEGCewTmIddNM2wEeSul0kKDXtIzrK/U jdA9bJrUKLMeU7IFKne1Fip/yE+5nkWJttWXXyVRJtOJrYxZlkWfqSns3qYcPvsG g7HXvF6tmici5uoKdRCLqHtQCWsvpnvTD5A7qoZAlEUjlQCDKKmuukpN9oK5UYLl 9uUOgPQAJaxIwx1C4uP7L+AwbLUcN/+MtrvQRNz+sFpP3sN9oXeDJKBpNQp109NB JGk1sUsINL+54Cmdd5RwZ9T1vBJyRDrdWRDy1yHj95LildaPfh0= =pnob -----END PGP SIGNATURE----- Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost Pull virtio updates from Michael Tsirkin: - in-order support in virtio core - multiple address space support in vduse - fixes, cleanups all over the place, notably dma alignment fixes for non-cache-coherent systems * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (59 commits) vduse: avoid adding implicit padding vhost: fix caching attributes of MMIO regions by setting them explicitly vdpa/mlx5: update MAC address handling in mlx5_vdpa_set_attr() vdpa/mlx5: reuse common function for MAC address updates vdpa/mlx5: update mlx_features with driver state check crypto: virtio: Replace package id with numa node id crypto: virtio: Remove duplicated virtqueue_kick in virtio_crypto_skcipher_crypt_req crypto: virtio: Add spinlock protection with virtqueue notification Documentation: Add documentation for VDUSE Address Space IDs vduse: bump version number vduse: add vq group asid support vduse: merge tree search logic of IOTLB_GET_FD and IOTLB_GET_INFO ioctls vduse: take out allocations from vduse_dev_alloc_coherent vduse: remove unused vaddr parameter of vduse_domain_free_coherent vduse: refactor vdpa_dev_add for goto err handling vhost: forbid change vq groups ASID if DRIVER_OK is set vdpa: document set_group_asid thread safety vduse: return internal vq group struct as map token vduse: add vq group support vduse: add v1 API definition ...	2026-02-13 12:02:18 -08:00
Arnd Bergmann	e25dbf561e	vmw_vsock: bypass false-positive Wnonnull warning with gcc-16 The gcc-16.0.1 snapshot produces a false-positive warning that turns into a build failure with CONFIG_WERROR: In file included from arch/x86/include/asm/string.h:6, from net/vmw_vsock/vmci_transport.c:10: In function 'vmci_transport_packet_init', inlined from '__vmci_transport_send_control_pkt.constprop' at net/vmw_vsock/vmci_transport.c:198:2: arch/x86/include/asm/string_32.h:150:25: error: argument 2 null where non-null expected because argument 3 is nonzero [-Werror=nonnull] 150 \| #define memcpy(t, f, n) __builtin_memcpy(t, f, n) \| ^~~~~~~~~~~~~~~~~~~~~~~~~ net/vmw_vsock/vmci_transport.c:164:17: note: in expansion of macro 'memcpy' 164 \| memcpy(&pkt->u.wait, wait, sizeof(pkt->u.wait)); \| ^~~~~~ arch/x86/include/asm/string_32.h:150:25: note: in a call to built-in function '__builtin_memcpy' net/vmw_vsock/vmci_transport.c:164:17: note: in expansion of macro 'memcpy' 164 \| memcpy(&pkt->u.wait, wait, sizeof(pkt->u.wait)); \| ^~~~~~ This seems relatively harmless, and it so far the only instance of this warning I have found. The __vmci_transport_send_control_pkt function is called either with wait=NULL or with one of the type values that pass 'wait' into memcpy() here, but not from the same caller. Replacing the memcpy with a struct assignment is otherwise the same but avoids the warning. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Bryan Tan <bryan-bt.tan@broadcom.com> Link: https://patch.msgid.link/20260203163406.2636463-1-arnd@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-04 18:40:31 -08:00

1 2 3 4 5 ...

480 Commits