When netif_is_rxfh_configured() is true (i.e., the user has explicitly
configured the RSS indirection table), virtnet_set_queues() skips the
RSS update path and falls through to the VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET
command to change the number of queue pairs. However, it does not update
vi->rss_trailer.max_tx_vq to reflect the new queue_pairs value.
This causes a mismatch between vi->curr_queue_pairs and
vi->rss_trailer.max_tx_vq. Any subsequent RSS reconfiguration (e.g.,
via ethtool -X) calls virtnet_commit_rss_command(), which sends the
stale max_tx_vq to the device, silently reverting the queue count.
Reproduction:
1. User configured RSS
ethtool -X eth0 equal 8
2. VQ_PAIRS_SET path; max_tx_vq stays 16
ethtool -L eth0 combined 12
3. RSS commit uses max_tx_vq=16 instead of 12
ethtool -X eth0 equal 4
Fix this by updating vi->rss_trailer.max_tx_vq after a successful
VQ_PAIRS_SET command when RSS is enabled, keeping it in sync with
curr_queue_pairs.
Fixes: 50bfcaedd7 ("virtio_net: Update rss when set queue")
Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://patch.msgid.link/20260416212121.29073-1-brett.creeley@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rss_max_key_size in the virtio spec is the maximum key size supported by
the device, not a mandatory size the driver must use. Also the value 40
is a spec minimum, not a spec maximum.
The current code rejects RSS and can fail probe when the device reports a
larger rss_max_key_size than the driver buffer limit. Instead, clamp the
effective key length to min(device rss_max_key_size, NETDEV_RSS_KEY_LEN)
and keep RSS enabled.
This keeps probe working on devices that advertise larger maximum key sizes
while respecting the netdev RSS key buffer size limit.
Fixes: 3f7d9c1964 ("virtio_net: Add hash_key_length check")
Cc: stable@vger.kernel.org
Signed-off-by: Srujana Challa <schalla@marvell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://patch.msgid.link/20260326142344.1171317-1-schalla@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Negotiating VIRTIO_NET_F_CTRL_GUEST_OFFLOADS indicates the device
allows control over offload support, but the offloads that can be
controlled may have nothing to do with GRO (e.g., if neither GUEST_TSO4
nor GUEST_TSO6 is supported).
In such a setup, reporting NETIF_F_GRO_HW as available for the device
is too optimistic and misleading to the user.
Improve the situation by masking off NETIF_F_GRO_HW unless the device
possesses actual GRO-related offload capabilities. Out of an abundance
of caution, this does not change the current behaviour for hardware with
just v6 or just v4 GRO: current interfaces do not allow distinguishing
between v6/v4 GRO, so we can't expose them to userspace precisely.
Signed-off-by: Di Zhu <zhud@hygon.cn>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://patch.msgid.link/20260323041730.986351-1-zhud@hygon.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
receive_buf() reads the virtio header through buf before
page_pool_dma_sync_for_cpu() runs in receive_small() or
receive_mergeable(). The header buffer is thus unsynchronized at the
point where flags and, for mergeable buffers, num_buffers are consumed.
Omar Elghoul reported that on s390x Secure Execution this showed up as
greatly reduced virtio-net performance together with "bad gso" and
"bad csum" messages in dmesg. This is because with SE sync actually
copies data, so the header is uninitialized.
Move the sync into receive_buf() so the
header is synchronized before any access through buf.
Tool use: Cursor with GPT-5.4 drafted the initial code move from prompt:
"in drivers/net/virtio_net.c, move page_pool_dma_sync_for_cpu on receive
path to before memory is accessed through buf".
The result and the commit log were reviewed and edited manually.
Fixes: 24fbd3967f ("virtio_net: add page_pool support for buffer allocation")
Reported-by: Omar Elghoul <oelghoul@linux.ibm.com>
Tested-by: Srikanth Aithal <sraithal@amd.com>
Tested-by: Omar Elghoul <oelghoul@linux.ibm.com>
Link: https://lore.kernel.org/r/20260323150136.14452-1-oelghoul@linux.ibm.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Vishwanath Seshagiri <vishs@meta.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://patch.msgid.link/f4caa9be9e5addae7851c012cab0a733be7f0974.1774365273.git.mst@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A UAF issue occurs when the virtio_net driver is configured with napi_tx=N
and the device's IFF_XMIT_DST_RELEASE flag is cleared
(e.g., during the configuration of tc route filter rules).
When IFF_XMIT_DST_RELEASE is removed from the net_device, the network stack
expects the driver to hold the reference to skb->dst until the packet
is fully transmitted and freed. In virtio_net with napi_tx=N,
skbs may remain in the virtio transmit ring for an extended period.
If the network namespace is destroyed while these skbs are still pending,
the corresponding dst_ops structure has freed. When a subsequent packet
is transmitted, free_old_xmit() is triggered to clean up old skbs.
It then calls dst_release() on the skb associated with the stale dst_entry.
Since the dst_ops (referenced by the dst_entry) has already been freed,
a UAF kernel paging request occurs.
fix it by adds skb_dst_drop(skb) in start_xmit to explicitly release
the dst reference before the skb is queued in virtio_net.
Call Trace:
Unable to handle kernel paging request at virtual address ffff80007e150000
CPU: 2 UID: 0 PID: 6236 Comm: ping Kdump: loaded Not tainted 7.0.0-rc1+ #6 PREEMPT
...
percpu_counter_add_batch+0x3c/0x158 lib/percpu_counter.c:98 (P)
dst_release+0xe0/0x110 net/core/dst.c:177
skb_release_head_state+0xe8/0x108 net/core/skbuff.c:1177
sk_skb_reason_drop+0x54/0x2d8 net/core/skbuff.c:1255
dev_kfree_skb_any_reason+0x64/0x78 net/core/dev.c:3469
napi_consume_skb+0x1c4/0x3a0 net/core/skbuff.c:1527
__free_old_xmit+0x164/0x230 drivers/net/virtio_net.c:611 [virtio_net]
free_old_xmit drivers/net/virtio_net.c:1081 [virtio_net]
start_xmit+0x7c/0x530 drivers/net/virtio_net.c:3329 [virtio_net]
...
Reproduction Steps:
NETDEV="enp3s0"
config_qdisc_route_filter() {
tc qdisc del dev $NETDEV root
tc qdisc add dev $NETDEV root handle 1: prio
tc filter add dev $NETDEV parent 1:0 \
protocol ip prio 100 route to 100 flowid 1:1
ip route add 192.168.1.100/32 dev $NETDEV realm 100
}
test_ns() {
ip netns add testns
ip link set $NETDEV netns testns
ip netns exec testns ifconfig $NETDEV 10.0.32.46/24
ip netns exec testns ping -c 1 10.0.32.1
ip netns del testns
}
config_qdisc_route_filter
test_ns
sleep 2
test_ns
Fixes: f2fc6a5458 ("[NETNS][IPV6] route6 - move ip6_dst_ops inside the network namespace")
Cc: stable@vger.kernel.org
Signed-off-by: xietangxin <xietangxin@yeah.net>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Fixes: 0287587884 ("net: better IFF_XMIT_DST_RELEASE support")
Link: https://patch.msgid.link/20260312025406.15641-1-xietangxin@yeah.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The commit be50da3e9d ("net: virtio_net: implement exact header length
guest feature") introduces support for the VIRTIO_NET_F_GUEST_HDRLEN
feature in virtio-net.
This feature requires virtio-net to set hdr_len to the actual header
length of the packet when transmitting, the number of
bytes from the start of the packet to the beginning of the
transport-layer payload.
However, in practice, hdr_len was being set using skb_headlen(skb),
which is clearly incorrect. This commit fixes that issue.
Fixes: be50da3e9d ("net: virtio_net: implement exact header length guest feature")
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Link: https://patch.msgid.link/20260320021818.111741-2-xuanzhuo@linux.alibaba.com
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Use page_pool for RX buffer allocation in mergeable and small buffer
modes to enable page recycling and avoid repeated page allocator calls.
skb_mark_for_recycle() enables page reuse in the network stack.
Big packets mode is unchanged because it uses page->private for linked
list chaining of multiple pages per buffer, which conflicts with
page_pool's internal use of page->private.
Implement conditional DMA premapping using virtqueue_dma_dev():
- When non-NULL (vhost, virtio-pci): use PP_FLAG_DMA_MAP with page_pool
handling DMA mapping, submit via virtqueue_add_inbuf_premapped()
- When NULL (VDUSE, direct physical): page_pool handles allocation only,
submit via virtqueue_add_inbuf_ctx()
This preserves the DMA premapping optimization from commit 31f3cd4e57
("virtio-net: rq submits premapped per-buffer") while adding page_pool
support as a prerequisite for future zero-copy features (devmem TCP,
io_uring ZCRX).
Page pools are created in probe and destroyed in remove (not open/close),
following existing driver behavior where RX buffers remain in virtqueues
across interface state changes.
Signed-off-by: Vishwanath Seshagiri <vishs@meta.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://patch.msgid.link/20260310183107.2822016-1-vishs@meta.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When XDP_USE_NEED_WAKEUP is used and the fill ring is empty so no buffer
is allocated on RX side, allow RX NAPI to be descheduled. This avoids
wasting CPU cycles on polling. Users will be notified and they need to
make a wakeup call after refilling the ring.
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Link: https://patch.msgid.link/20260304154317.7506-1-minhquangbui99@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This was done entirely with mindless brute force, using
git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'
to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.
Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.
For the same reason the 'flex' versions will be done as a separate
conversion.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with: kmalloc_objs(TYPE, COUNT, ...)
Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...)
(where TYPE may also be *VAR)
The resulting allocations no longer return "void *", instead returning
"TYPE *".
Signed-off-by: Kees Cook <kees@kernel.org>
Use the new TRAILING_OVERLAP() helper to fix a misalignment bug
along with the following warning:
drivers/net/virtio_net.c:429:46: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
This helper creates a union between a flexible-array member (FAM)
and a set of members that would otherwise follow it (in this case
`u8 rss_hash_key_data[VIRTIO_NET_RSS_MAX_KEY_SIZE];`). This
overlays the trailing members (rss_hash_key_data) onto the FAM
(hash_key_data) while keeping the FAM and the start of MEMBERS aligned.
The static_assert() ensures this alignment remains.
Notice that due to tail padding in flexible `struct
virtio_net_rss_config_trailer`, `rss_trailer.hash_key_data`
(at offset 83 in struct virtnet_info) and `rss_hash_key_data` (at
offset 84 in struct virtnet_info) are misaligned by one byte. See
below:
struct virtio_net_rss_config_trailer {
__le16 max_tx_vq; /* 0 2 */
__u8 hash_key_length; /* 2 1 */
__u8 hash_key_data[]; /* 3 0 */
/* size: 4, cachelines: 1, members: 3 */
/* padding: 1 */
/* last cacheline: 4 bytes */
};
struct virtnet_info {
...
struct virtio_net_rss_config_trailer rss_trailer; /* 80 4 */
/* XXX last struct has 1 byte of padding */
u8 rss_hash_key_data[40]; /* 84 40 */
...
/* size: 832, cachelines: 13, members: 48 */
/* sum members: 801, holes: 8, sum holes: 31 */
/* paddings: 2, sum paddings: 5 */
};
After changes, those members are correctly aligned at offset 795:
struct virtnet_info {
...
union {
struct virtio_net_rss_config_trailer rss_trailer; /* 792 4 */
struct {
unsigned char __offset_to_hash_key_data[3]; /* 792 3 */
u8 rss_hash_key_data[40]; /* 795 40 */
}; /* 792 43 */
}; /* 792 44 */
...
/* size: 840, cachelines: 14, members: 47 */
/* sum members: 801, holes: 8, sum holes: 35 */
/* padding: 4 */
/* paddings: 1, sum paddings: 4 */
/* last cacheline: 8 bytes */
};
As a result, the RSS key passed to the device is shifted by 1
byte: the last byte is cut off, and instead a (possibly
uninitialized) byte is added at the beginning.
As a last note `struct virtio_net_rss_config_hdr *rss_hdr;` is also
moved to the end, since it seems those three members should stick
around together. :)
Cc: stable@vger.kernel.org
Fixes: ed3100e90d ("virtio_net: Use new RSS config structs")
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://patch.msgid.link/aWIItWq5dV9XTTCJ@kspp
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The delayed refill worker is removed which makes virtnet_rx_pause/resume
quite the same as __virtnet_rx_pause/resume. So remove
__virtnet_rx_pause/resume and move the code to virtnet_rx_pause/resume.
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Link: https://patch.msgid.link/20260106150438.7425-4-minhquangbui99@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since we switched to retry refilling receive buffer in NAPI poll instead
of delayed worker, remove all now unused delayed refill worker code.
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Link: https://patch.msgid.link/20260106150438.7425-3-minhquangbui99@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When we fail to refill the receive buffers, we schedule a delayed worker
to retry later. However, this worker creates some concurrency issues.
For example, when the worker runs concurrently with virtnet_xdp_set,
both need to temporarily disable queue's NAPI before enabling again.
Without proper synchronization, a deadlock can happen when
napi_disable() is called on an already disabled NAPI. That
napi_disable() call will be stuck and so will the subsequent
napi_enable() call.
To simplify the logic and avoid further problems, we will instead retry
refilling in the next NAPI poll.
Fixes: 4bc12818b3 ("virtio-net: disable delayed refill when pausing rx")
Reported-by: Paolo Abeni <pabeni@redhat.com>
Closes: https://lore.kernel.org/526b5396-459d-4d02-8635-a222d07b46d7@redhat.com
Cc: stable@vger.kernel.org
Suggested-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://patch.msgid.link/20260106150438.7425-2-minhquangbui99@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit a2fb4bc4e2 ("net: implement virtio helpers to handle UDP
GSO tunneling.") inadvertently altered checksum offload behavior
for guests not using UDP GSO tunneling.
Before, tun_put_user called tun_vnet_hdr_from_skb, which passed
has_data_valid = true to virtio_net_hdr_from_skb.
After, tun_put_user began calling tun_vnet_hdr_tnl_from_skb instead,
which passes has_data_valid = false into both call sites.
This caused virtio hdr flags to not include VIRTIO_NET_HDR_F_DATA_VALID
for SKBs where skb->ip_summed == CHECKSUM_UNNECESSARY. As a result,
guests are forced to recalculate checksums unnecessarily.
Restore the previous behavior by ensuring has_data_valid = true is
passed in the !tnl_gso_type case, but only from tun side, as
virtio_net_hdr_tnl_from_skb() is used also by the virtio_net driver,
which in turn must not use VIRTIO_NET_HDR_F_DATA_VALID on tx.
cc: stable@vger.kernel.org
Fixes: a2fb4bc4e2 ("net: implement virtio helpers to handle UDP GSO tunneling.")
Signed-off-by: Jon Kohler <jon@nutanix.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://patch.msgid.link/20251125222754.1737443-1-jon@nutanix.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch refines and strengthens the statistics collection of TX queue
wake/stop events introduced by commit c39add9b24 ("virtio_net: Add TX
stopped and wake counters").
Previously, the driver only recorded partial wake/stop statistics
for TX queues. Some wake events triggered by 'skb_xmit_done()' or resume
operations were not counted, which made the per-queue metrics incomplete.
Signed-off-by: Liming Wu <liming.wu@jaguarmicro.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://patch.msgid.link/20251120015320.1418-1-liming.wu@jaguarmicro.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cross-merge networking fixes after downstream PR (net-6.18-rc6).
No conflicts, adjacent changes in:
drivers/net/phy/micrel.c
96a9178a29 ("net: phy: micrel: lan8814 fix reset of the QSGMII interface")
61b7ade9ba ("net: phy: micrel: Add support for non PTP SKUs for lan8814")
and a trivial one in tools/testing/selftests/drivers/net/Makefile.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The purpose of commit 703eec1b24 ("virtio_net: fixing XDP for fully
checksummed packets handling") is to record the flags in advance, as
their value may be overwritten in the XDP case. However, the flags
recorded under big mode are incorrect, because in big mode, the passed
buf does not point to the rx buffer, but rather to the page of the
submitted buffer. This commit fixes this issue.
For the small mode, the commit c11a49d58a ("virtio_net: Fix mismatched
buf address when unmapping for small packets") fixed it.
Tested-by: Alyssa Ross <hi@alyssa.is>
Fixes: 703eec1b24 ("virtio_net: fixing XDP for fully checksummed packets handling")
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://patch.msgid.link/20251111090828.23186-1-xuanzhuo@linux.alibaba.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Cross-merge networking fixes after downstream PR (net-6.18-rc5).
Conflicts:
drivers/net/wireless/ath/ath12k/mac.c
9222582ec5 ("Revert "wifi: ath12k: Fix missing station power save configuration"")
6917e268c4 ("wifi: ath12k: Defer vdev bring-up until CSA finalize to avoid stale beacon")
https://lore.kernel.org/11cece9f7e36c12efd732baa5718239b1bf8c950.camel@sipsolutions.net
Adjacent changes:
drivers/net/ethernet/intel/Kconfig
b1d16f7c00 ("libie: depend on DEBUG_FS when building LIBIE_FWLOG")
93f53db9f9 ("ice: switch to Page Pool")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since commit 4959aebba8 ("virtio-net: use mtu size as buffer length
for big packets"), when guest gso is off, the allocated size for big
packets is not MAX_SKB_FRAGS * PAGE_SIZE anymore but depends on
negotiated MTU. The number of allocated frags for big packets is stored
in vi->big_packets_num_skbfrags.
Because the host announced buffer length can be malicious (e.g. the host
vhost_net driver's get_rx_bufs is modified to announce incorrect
length), we need a check in virtio_net receive path. Currently, the
check is not adapted to the new change which can lead to NULL page
pointer dereference in the below while loop when receiving length that
is larger than the allocated one.
This commit fixes the received length check corresponding to the new
change.
Fixes: 4959aebba8 ("virtio-net: use mtu size as buffer length for big packets")
Cc: stable@vger.kernel.org
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Tested-by: Lei Yang <leiyang@redhat.com>
Link: https://patch.msgid.link/20251030144438.7582-1-minhquangbui99@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Changing alignment of header would mean it's no longer safe to cast a
2 byte aligned pointer between formats. Use two 16 bit fields to make
it 2 byte aligned as previously.
This fixes the performance regression since
commit ("virtio_net: enable gso over UDP tunnel support.") as it uses
virtio_net_hdr_v1_hash_tunnel which embeds
virtio_net_hdr_v1_hash. Pktgen in guest + XDP_DROP on TAP + vhost_net
shows the TX PPS is recovered from 2.4Mpps to 4.45Mpps.
Fixes: 56a06bd40f ("virtio_net: enable gso over UDP tunnel support.")
Cc: stable@vger.kernel.org
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Tested-by: Lei Yang <leiyang@redhat.com>
Link: https://patch.msgid.link/20251031060551.126-1-jasowang@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix the spelling error of "separate".
Signed-off-by: Chu Guangqing <chuguangqing@inspur.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://patch.msgid.link/20251103074305.4727-1-chuguangqing@inspur.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In virtio-net, we have not yet supported multi-buffer XDP packet in
zerocopy mode when there is a binding XDP program. However, in that
case, when receiving multi-buffer XDP packet, we skip the XDP program
and return XDP_PASS. As a result, the packet is passed to normal network
stack which is an incorrect behavior (e.g. a XDP program for packet
count is installed, multi-buffer XDP packet arrives and does go through
XDP program. As a result, the packet count does not increase but the
packet is still received from network stack).This commit instead returns
XDP_ABORTED in that case.
Fixes: 99c861b44e ("virtio_net: xsk: rx: support recv merge mode")
Cc: stable@vger.kernel.org
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Link: https://patch.msgid.link/20251022155630.49272-1-minhquangbui99@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Just fixes and cleanups this time around. The mapping cleanups are
preparing the ground for new features, though.
In order patches were almost there but I feel they didn't
spend enough time in next yet.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQFDBAABCgAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmjdEQQPHG1zdEByZWRo
YXQuY29tAAoJECgfDbjSjVRpsDAH/2yWtj3WWBUmRo5oZ5Rkveebb0oMkm642zGB
nmJ21UdIelvRM1sQoaV0+m6B8mMBDpmxrN3Mg7sSLFMN3xK5DF5QkWEwudJ1RJaq
VVfPyv29tee5mtCve/aG/d+JWipYPdma96Gi8l3UaRXq7TTWBkVAFpxukpK5I1O5
NJJigcxxu6O/gbrR9JxW6HSX9BmV7hsFtsW2HR/C2hXWlTECaJeQJ/ZvooLFhzfZ
pnwFtWjk3D6wYCWquvyE6OhrFDqLsLEW2GgYehL2BRYp/PcLizxewmZL0ghnP7mV
bT5QRHGjxPZIHckBwvVjsIE0eN9at3nl5koBnWbxRYsJau3HgVA=
=4Bdo
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio updates from Michael Tsirkin:
"Just fixes and cleanups this time around. The mapping cleanups are
preparing the ground for new features, though"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
virtio-vdpa: Drop redundant conversion to bool
vduse: Use fixed 4KB bounce pages for non-4KB page size
vduse: switch to use virtio map API instead of DMA API
vdpa: introduce map ops
vdpa: support virtio_map
virtio: introduce map ops in virtio core
virtio_ring: rename dma_handle to map_handle
virtio: introduce virtio_map container union
virtio: rename dma helpers
virtio_ring: switch to use dma_{map|unmap}_page()
virtio_ring: constify virtqueue pointer for DMA helpers
virtio_balloon: Remove redundant __GFP_NOWARN
vhost: vringh: Fix copy_to_iter return value check
vhost: vringh: Modify the return value check
Following patch will introduce virtio mapping function to avoid
abusing DMA API for device that doesn't do DMA. To ease the
introduction, this patch rename "dma" to "map" for the current dma
mapping helpers.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20250821064641.5025-4-jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Lei Yang <leiyang@redhat.com>
Reviewed-by: Eugenio Pérez <eperezma@redhat.com>
Replace the existing virtnet_get_rxnfc callback with a dedicated
virtnet_get_rxrings implementation to provide the number of RX rings
directly via the new ethtool_ops get_rx_ring_count pointer.
This simplifies the RX ring count retrieval and aligns virtio_net with
the new ethtool API for querying RX ring parameters.
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20250917-gxrings-v4-8-dae520e2e1cb@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
xdp_update_skb_shared_info() needs to update skb state which
was maintained in xdp_buff / frame. Pass full flags into it,
instead of breaking it out bit by bit. We will need to add
a bit for unreadable frags (even tho XDP doesn't support
those the driver paths may be common), at which point almost
all call sites would become:
xdp_update_skb_shared_info(skb, num_frags,
sinfo->xdp_frags_size,
MY_PAGE_SIZE * num_frags,
xdp_buff_is_frag_pfmemalloc(xdp),
xdp_buff_is_frag_unreadable(xdp));
Keep a helper for accessing the flags, in case we need to
transform them somehow in the future (e.g. to cover up xdp_buff
vs xdp_frame differences).
While we are touching call callers - rename the helper to
xdp_update_skb_frags_info(), previous name may have implied that
it's shinfo that's updated. We are updating flags in struct sk_buff
based on frags that got attched.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://patch.msgid.link/20250905221539.2930285-2-kuba@kernel.org
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
"Use after free" issue appears in suspend once race occurs when
napi poll scheduls after `netif_device_detach` and before napi disables.
For details, during suspend flow of virtio-net,
the tx queue state is set to "__QUEUE_STATE_DRV_XOFF" by CPU-A.
And at some coincidental times, if a TCP connection is still working,
CPU-B does `virtnet_poll` before napi disable.
In this flow, the state "__QUEUE_STATE_DRV_XOFF"
of tx queue will be cleared. This is not the normal process it expects.
After that, CPU-A continues to close driver then virtqueue is removed.
Sequence likes below:
--------------------------------------------------------------------------
CPU-A CPU-B
----- -----
suspend is called A TCP based on
virtio-net still work
virtnet_freeze
|- virtnet_freeze_down
| |- netif_device_detach
| | |- netif_tx_stop_all_queues
| | |- netif_tx_stop_queue
| | |- set_bit
| | (__QUEUE_STATE_DRV_XOFF,...)
| | softirq rasied
| | |- net_rx_action
| | |- napi_poll
| | |- virtnet_poll
| | |- virtnet_poll_cleantx
| | |- netif_tx_wake_queue
| | |- test_and_clear_bit
| | (__QUEUE_STATE_DRV_XOFF,...)
| |- virtnet_close
| |- virtnet_disable_queue_pair
| |- virtnet_napi_tx_disable
|- remove_vq_common
--------------------------------------------------------------------------
When TCP delayack timer is up, a cpu gets softirq and irq handler
`tcp_delack_timer_handler` will be called, which will finally call
`start_xmit` in virtio net driver.
Then the access to tx virtq will cause panic.
The root cause of this issue is that napi tx
is not disable before `netif_tx_stop_queue`,
once `virnet_poll` schedules in such coincidental time,
the tx queue state will be cleared.
To solve this issue, adjusts the order of
function `virtnet_close` in `virtnet_freeze_down`.
Co-developed-by: Ying Xu <ying123.xu@samsung.com>
Signed-off-by: Ying Xu <ying123.xu@samsung.com>
Signed-off-by: Junnan Wu <junnan01.wu@samsung.com>
Message-Id: <20250812090817.3463403-1-junnan01.wu@samsung.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
The deadlock appears in a stack trace like:
virtnet_probe()
rtnl_lock()
virtio_config_changed_work()
netdev_notify_peers()
rtnl_lock()
It happens if the VMM sends a VIRTIO_NET_S_ANNOUNCE request while the
virtio-net driver is still probing.
The config_work in probe() will get scheduled until virtnet_open() enables
the config change notification via virtio_config_driver_enable().
Fixes: df28de7b00 ("virtio-net: synchronize operstate with admin state on up/down")
Signed-off-by: Zigit Zo <zuozhijie@bytedance.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://patch.msgid.link/20250716115717.1472430-1-zuozhijie@bytedance.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Consolidate the two nested if conditions for checking tx queue wake
conditions into a single combined condition. This improves code
readability without changing functionality. And move netif_tx_wake_queue
into if condition to reduce unnecessary checks for queue stops.
Signed-off-by: Liming Wu <liming.wu@jaguarmicro.com>
Tested-by: Lei Yang <leiyang@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://patch.msgid.link/20250710023208.846-1-liming.wu@jaguarmicro.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni says:
====================
virtio: introduce GSO over UDP tunnel
Some virtualized deployments use UDP tunnel pervasively and are impacted
negatively by the lack of GSO support for such kind of traffic in the
virtual NIC driver.
The virtio_net specification recently introduced support for GSO over
UDP tunnel, this series updates the virtio implementation to support
such a feature.
Currently the kernel virtio support limits the feature space to 64,
while the virtio specification allows for a larger number of features.
Specifically the GSO-over-UDP-tunnel-related virtio features use bits
65-69.
The first four patches in this series rework the virtio and vhost
feature support to cope with up to 128 bits. The limit is set by
a define and could be easily raised in future, as needed.
This implementation choice is aimed at keeping the code churn as
limited as possible. For the same reason, only the virtio_net driver is
reworked to leverage the extended feature space; all other
virtio/vhost drivers are unaffected, but could be upgraded to support
the extended features space in a later time.
The last four patches bring in the actual GSO over UDP tunnel support.
As per specification, some additional fields are introduced into the
virtio net header to support the new offload. The presence of such
fields depends on the negotiated features.
New helpers are introduced to convert the UDP-tunneled skb metadata to
an extended virtio net header and vice versa. Such helpers are used by
the tun and virtio_net driver to cope with the newly supported offloads.
Tested with basic stream transfer with all the possible permutations of
host kernel/qemu/guest kernel with/without GSO over UDP tunnel support.
====================
Link: https://patch.msgid.link/cover.1751874094.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This commit does not do any functional changes. It moves xdp->data
adjustment for buffer other than first buffer to buf_to_xdp() helper so
that the xdp_buff adjustment does not scatter over different functions.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Link: https://patch.msgid.link/20250705075515.34260-1-minhquangbui99@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If the related virtio feature is set, enable transmission and reception
of gso over UDP tunnel packets.
Most of the work is done by the previously introduced helper, just need
to determine the UDP tunnel features inside the virtio_net_hdr and
update accordingly the virtio net hdr size.
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The virtio_net driver needs it to implement GSO over UDP tunnel
offload.
The only missing piece is mapping them to/from the extended
features.
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The `tx_may_stop()` logic stops TX queues if free descriptors
(`sq->vq->num_free`) fall below the threshold of (`MAX_SKB_FRAGS` + 2).
If the total ring size (`ring_num`) is not strictly greater than this
value, queues can become persistently stopped or stop after minimal
use, severely degrading performance.
A single sk_buff transmission typically requires descriptors for:
- The virtio_net_hdr (1 descriptor)
- The sk_buff's linear data (head) (1 descriptor)
- Paged fragments (up to MAX_SKB_FRAGS descriptors)
This patch enforces that the TX ring size ('ring_num') must be strictly
greater than (MAX_SKB_FRAGS + 2). This ensures that the ring is
always large enough to hold at least one maximally-fragmented packet
plus at least one additional slot.
Reported-by: Lei Yang <leiyang@redhat.com>
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://patch.msgid.link/20250521092236.661410-4-lvivier@redhat.com
Tested-by: Lei Yang <leiyang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Improve consistency by using everywhere it is needed
'MAX_SKB_FRAGS + 2' rather than '2+MAX_SKB_FRAGS' or
'2 + MAX_SKB_FRAGS'.
No functional change.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://patch.msgid.link/20250521092236.661410-3-lvivier@redhat.com
Tested-by: Lei Yang <leiyang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When calling buf_to_xdp, the len argument is the frame data's length
without virtio header's length (vi->hdr_len). We check that len with
xsk_pool_get_rx_frame_size() + vi->hdr_len
to ensure the provided len does not larger than the allocated chunk
size. The additional vi->hdr_len is because in virtnet_add_recvbuf_xsk,
we use part of XDP_PACKET_HEADROOM for virtio header and ask the vhost
to start placing data from
hard_start + XDP_PACKET_HEADROOM - vi->hdr_len
not
hard_start + XDP_PACKET_HEADROOM
But the first buffer has virtio_header, so the maximum frame's length in
the first buffer can only be
xsk_pool_get_rx_frame_size()
not
xsk_pool_get_rx_frame_size() + vi->hdr_len
like in the current check.
This commit adds an additional argument to buf_to_xdp differentiate
between the first buffer and other ones to correctly calculate the maximum
frame's length.
Cc: stable@vger.kernel.org
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Fixes: a4e7ba7027 ("virtio_net: xsk: rx: support recv small mode")
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Link: https://patch.msgid.link/20250630151315.86722-2-minhquangbui99@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Replace the current repeated code to check received length in mergeable
mode with the new check_mergeable_len helper.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://patch.msgid.link/20250630144212.48471-4-minhquangbui99@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The truesize is guaranteed not to exceed PAGE_SIZE in
get_mergeable_buf_len(). It is saved in mergeable context, which is not
changeable by the host side, so the check in receive path is quite
redundant.
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Link: https://patch.msgid.link/20250630144212.48471-3-minhquangbui99@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
In xdp_linearize_page, when reading the following buffers from the ring,
we forget to check the received length with the true allocate size. This
can lead to an out-of-bound read. This commit adds that missing check.
Cc: <stable@vger.kernel.org>
Fixes: 4941d472bf ("virtio-net: do not reset during XDP set")
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://patch.msgid.link/20250630144212.48471-2-minhquangbui99@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add support for the new rxfh_fields callbacks, instead of de-muxing
the rxnfc calls. This driver does not support flow filtering so
the set_rxnfc callback is completely removed.
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://patch.msgid.link/20250611145949.2674086-9-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
NIPA tests report that the interface statistics reported
via qstat are lower than those reported via ip link.
Looks like this is because some tests flip the queue
count up and down, and we end up with some of the traffic
accounted on disabled queues.
Add up counters from disabled queues.
Fixes: d888f04c09 ("virtio-net: support queue stat")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250507003221.823267-3-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Commit 4bc12818b3 ("virtio-net: disable delayed refill when pausing rx")
fixed a deadlock between reconfig paths and refill work trying to disable
the same NAPI instance. The refill work can't run in parallel with reconfig
because trying to double-disable a NAPI instance causes a stall under the
instance lock, which the reconfig path needs to re-enable the NAPI and
therefore unblock the stalled thread.
There are two cases where we re-enable refill too early. One is in the
virtnet_set_queues() handler. We call it when installing XDP:
virtnet_rx_pause_all(vi);
...
virtnet_napi_tx_disable(..);
...
virtnet_set_queues(..);
...
virtnet_rx_resume_all(..);
We want the work to be disabled until we call virtnet_rx_resume_all(),
but virtnet_set_queues() kicks it before NAPIs were re-enabled.
The other case is a more trivial case of mis-ordering in
__virtnet_rx_resume() found by code inspection.
Taking the spin lock in virtnet_set_queues() (requested during review)
may be unnecessary as we are under rtnl_lock and so are all paths writing
to ->refill_enabled.
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Bui Quang Minh <minhquangbui99@gmail.com>
Fixes: 4bc12818b3 ("virtio-net: disable delayed refill when pausing rx")
Fixes: 413f0271f3 ("net: protect NAPI enablement with netdev_lock()")
Link: https://patch.msgid.link/20250430163758.3029367-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>