linux

mirror of https://github.com/torvalds/linux.git synced 2026-05-13 00:28:54 +02:00

Author	SHA1	Message	Date
Ariful Islam Shoikot	645d044d7e	docs: maintainer-netdev: fix typo in "targeting" Fix spelling mistake "targgeting" -> "targeting" in maintainer-netdev.rst No functional change. Signed-off-by: Ariful Islam Shoikot <islamarifulshoikat@gmail.com> Reviewed-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20260420114554.1026-1-islamarifulshoikat@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-22 20:16:58 -07:00
Bingquan Chen	2c054e17d9	net/packet: fix TOCTOU race on mmap'd vnet_hdr in tpacket_snd() In tpacket_snd(), when PACKET_VNET_HDR is enabled, vnet_hdr points directly into the mmap'd TX ring buffer shared with userspace. The kernel validates the header via __packet_snd_vnet_parse() but then re-reads all fields later in virtio_net_hdr_to_skb(). A concurrent userspace thread can modify the vnet_hdr fields between validation and use, bypassing all safety checks. The non-TPACKET path (packet_snd()) already correctly copies vnet_hdr to a stack-local variable. All other vnet_hdr consumers in the kernel (tun.c, tap.c, virtio_net.c) also use stack copies. The TPACKET TX path is the only caller of virtio_net_hdr_to_skb() that reads directly from user-controlled shared memory. Fix this by copying vnet_hdr from the mmap'd ring buffer to a stack-local variable before validation and use, consistent with the approach used in packet_snd() and all other callers. Fixes: `1d036d25e5` ("packet: tpacket_snd gso and checksum offload") Signed-off-by: Bingquan Chen <patzilla007@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260418112006.78823-1-patzilla007@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-22 20:16:34 -07:00
Kohei Enju	3bfcf39608	net: validate skb->napi_id in RX tracepoints Since commit `2bd82484bb` ("xps: fix xps for stacked devices"), skb->napi_id shares storage with sender_cpu. RX tracepoints using net_dev_rx_verbose_template read skb->napi_id directly and can therefore report sender_cpu values as if they were NAPI IDs. For example, on the loopback path this can report 1 as napi_id, where 1 comes from raw_smp_processor_id() + 1 in the XPS path: # bpftrace -e 'tracepoint:net:netif_rx_entry{ print(args->napi_id); }' # taskset -c 0 ping -c 1 ::1 Report only valid NAPI IDs in these tracepoints and use 0 otherwise. Fixes: `2bd82484bb` ("xps: fix xps for stacked devices") Signed-off-by: Kohei Enju <kohei@enjuk.jp> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://patch.msgid.link/20260420105427.162816-1-kohei@enjuk.jp Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-22 20:15:50 -07:00
Chia-Yu Chang	478ed6b7d2	net/sched: sch_dualpi2: drain both C-queue and L-queue in dualpi2_change() Fix dualpi2_change() to correctly enforce updated limit and memlimit values after a configuration change of the dualpi2 qdisc. Before this patch, dualpi2_change() always attempted to dequeue packets via the root qdisc (C-queue) when reducing backlog or memory usage, and unconditionally assumed that a valid skb will be returned. When traffic classification results in packets being queued in the L-queue while the C-queue is empty, this leads to a NULL skb dereference during limit or memlimit enforcement. This is fixed by first dequeuing from the C-queue path if it is non-empty. Once the C-queue is empty, packets are dequeued directly from the L-queue. Return values from qdisc_dequeue_internal() are checked for both queues. When dequeuing from the L-queue, the parent qdisc qlen and backlog counters are updated explicitly to keep overall qdisc statistics consistent. Fixes: `320d031ad6` ("sched: Struct definition and parsing of dualpi2 qdisc") Reported-by: "Kito Xu (veritas501)" <hxzene@gmail.com> Closes: https://lore.kernel.org/netdev/20260413075740.2234828-1-hxzene@gmail.com/ Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Link: https://patch.msgid.link/20260417152551.71648-1-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-21 15:00:39 +02:00
Lorenzo Bianconi	d647f25452	net: airoha: Fix PPE cpu port configuration for GDM2 loopback path When QoS loopback is enabled for GDM3 or GDM4, incoming packets are forwarded to GDM2. However, the PPE cpu port for GDM2 is not configured in this path, causing traffic originating from GDM3/GDM4, which may be set up as WAN ports backed by QDMA1, to be incorrectly directed to QDMA0 instead. Configure the PPE cpu port for GDM2 when QoS loopback is active on GDM3 or GDM4 to ensure traffic is routed to the correct QDMA instance. Fixes: `9cd451d414` ("net: airoha: Add loopback support for GDM2") Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://patch.msgid.link/20260417-airoha-ppe-cpu-port-for-gdm2-loopback-v1-1-c7a9de0f6f57@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-21 14:46:22 +02:00
Paolo Abeni	edaa48dc2c	Merge branch 'net-sleepable-ndo_set_rx_mode' Stanislav Fomichev says: ==================== net: sleepable ndo_set_rx_mode This series adds a new ndo_set_rx_mode_async callback that enables drivers to handle address list updates in a sleepable context. The current ndo_set_rx_mode is called under the netif_addr_lock spinlock with BHs disabled, which prevents drivers from sleeping. This is problematic for ops-locked drivers that need to sleep. The approach: 1. Add snapshot/reconcile infrastructure for address lists 2. Introduce dev_rx_mode_work that takes snapshots under the lock, drops the lock, calls the driver, then reconciles changes back 3. Move promiscuity handling into the scheduled work as well 4. Convert existing ops-locked drivers to ndo_set_rx_mode_async 5. Add a warning for ops-locked drivers still using ndo_set_rx_mode 6. Add a selftest exercising the team+bridge+macvlan topology that triggers the addr_lock -> ops_lock ordering issue ==================== Link: https://patch.msgid.link/20260416185712.2155425-1-sdf@fomichev.me Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-21 12:50:26 +02:00
Stanislav Fomichev	c4dde411bc	selftests: net: use ip commands instead of teamd in team rx_mode test Replace teamd daemon usage with ip link commands for team device setup. teamd -d daemonizes and returns to the shell before port addition completes, creating a race: the test may create the macvlan (and check for its address on a slave) before teamd has finished adding ports. This makes the test inherently dependent on scheduling timing. Using ip commands makes port addition synchronous, removing the race and making the test deterministic. Cc: Jiri Pirko <jiri@resnulli.us> Cc: Jay Vosburgh <jv@jvosburgh.net> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260416185712.2155425-16-sdf@fomichev.me Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-21 12:50:25 +02:00
Stanislav Fomichev	ee514cdb07	selftests: net: add team_bridge_macvlan rx_mode test Add a test that exercises the ndo_change_rx_flags path through a macvlan -> bridge -> team -> dummy stack. This triggers dev_uc_add under addr_list_lock which flips promiscuity on the lower device. With the new work queue approach, this must not deadlock. Link: https://lore.kernel.org/netdev/20260214033859.43857-1-jiayuan.chen@linux.dev/ Reviewed-by: Breno Leitao <leitao@debian.org> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260416185712.2155425-15-sdf@fomichev.me Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-21 12:50:25 +02:00
Stanislav Fomichev	3cbd229388	net: warn ops-locked drivers still using ndo_set_rx_mode Now that all in-tree ops-locked drivers have been converted to ndo_set_rx_mode_async, add a warning in register_netdevice to catch any remaining or newly added drivers that use ndo_set_rx_mode with ops locking. This ensures future driver authors are guided toward the async path. Also route ops-locked devices through netdev_rx_mode_work even if they lack rx_mode NDOs, to ensure netdev_ops_assert_locked() does not fire on the legacy path where only RTNL is held. Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260416185712.2155425-14-sdf@fomichev.me Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-21 12:50:25 +02:00
Stanislav Fomichev	754b7e1169	netkit: convert to ndo_set_rx_mode_async Convert netkit driver from ndo_set_rx_mode to ndo_set_rx_mode_async. The netkit driver's set_multicast_list is a no-op, presumably for the same reason as the one in dummy? (fake multicast ability) Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260416185712.2155425-13-sdf@fomichev.me Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-21 12:50:25 +02:00
Stanislav Fomichev	4d157e89bd	dummy: convert to ndo_set_rx_mode_async Convert dummy driver from ndo_set_rx_mode to ndo_set_rx_mode_async. The dummy driver's set_multicast_list is a no-op, so the conversion is straightforward: update the signature and the ops assignment. Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260416185712.2155425-12-sdf@fomichev.me Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-21 12:50:25 +02:00
Stanislav Fomichev	8a5df09e70	netdevsim: convert to ndo_set_rx_mode_async Convert netdevsim from ndo_set_rx_mode to ndo_set_rx_mode_async. The callback is a no-op stub so just update the signature and ops struct wiring. Reviewed-by: Breno Leitao <leitao@debian.org> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260416185712.2155425-11-sdf@fomichev.me Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-21 12:50:25 +02:00
Stanislav Fomichev	d071c15b43	iavf: convert to ndo_set_rx_mode_async Convert iavf from ndo_set_rx_mode to ndo_set_rx_mode_async. iavf_set_rx_mode now takes explicit uc/mc list parameters and uses __hw_addr_sync_dev on the snapshots instead of __dev_uc_sync and __dev_mc_sync. The iavf_configure internal caller passes the real lists directly. Cc: Tony Nguyen <anthony.l.nguyen@intel.com> Cc: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260416185712.2155425-10-sdf@fomichev.me Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-21 12:50:25 +02:00
Stanislav Fomichev	a453b5d9b3	bnxt: use snapshot in bnxt_cfg_rx_mode With the introduction of ndo_set_rx_mode_async (as discussed in [1]) we can call bnxt_cfg_rx_mode directly. Convert bnxt_cfg_rx_mode to use uc/mc snapshots and move its call in bnxt_sp_task to the section that resets BNXT_STATE_IN_SP_TASK. Switch to direct call in bnxt_set_rx_mode. Link: https://lore.kernel.org/netdev/CACKFLi=5vj8hPqEUKDd8RTw3au5G+zRgQEqjF+6NZnyoNm90KA@mail.gmail.com/ [1] Cc: Michael Chan <michael.chan@broadcom.com> Cc: Pavan Chebbi <pavan.chebbi@broadcom.com> Reviewed-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260416185712.2155425-9-sdf@fomichev.me Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-21 12:50:25 +02:00
Stanislav Fomichev	f6c53cfa12	bnxt: convert to ndo_set_rx_mode_async Convert bnxt from ndo_set_rx_mode to ndo_set_rx_mode_async. bnxt_set_rx_mode, bnxt_mc_list_updated and bnxt_uc_list_updated now take explicit uc/mc list parameters and iterate with netdev_hw_addr_list_for_each instead of netdev_for_each_{uc,mc}_addr. The bnxt_cfg_rx_mode internal caller passes the real lists under netif_addr_lock_bh. BNXT_RX_MASK_SP_EVENT is still used here, next patch converts to the direct call. Cc: Michael Chan <michael.chan@broadcom.com> Cc: Pavan Chebbi <pavan.chebbi@broadcom.com> Reviewed-by: Michael Chan <michael.chan@broadcom.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260416185712.2155425-8-sdf@fomichev.me Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-21 12:50:25 +02:00
Stanislav Fomichev	5cf06fbdaf	mlx5: convert to ndo_set_rx_mode_async Convert mlx5 from ndo_set_rx_mode to ndo_set_rx_mode_async. The driver's mlx5e_set_rx_mode now receives uc/mc snapshots and calls mlx5e_fs_set_rx_mode_work directly instead of queueing work. mlx5e_sync_netdev_addr and mlx5e_handle_netdev_addr now take explicit uc/mc list parameters and iterate with netdev_hw_addr_list_for_each instead of netdev_for_each_{uc,mc}_addr. Fallback to netdev's uc/mc in a few places and grab addr lock. Cc: Saeed Mahameed <saeedm@nvidia.com> Cc: Tariq Toukan <tariqt@nvidia.com> Cc: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260416185712.2155425-7-sdf@fomichev.me Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-21 12:50:25 +02:00
Stanislav Fomichev	60dd9781e9	fbnic: convert to ndo_set_rx_mode_async Convert fbnic from ndo_set_rx_mode to ndo_set_rx_mode_async. The driver's __fbnic_set_rx_mode() now takes explicit uc/mc list parameters and uses __hw_addr_sync_dev() on the snapshots instead of __dev_uc_sync/__dev_mc_sync on the netdev directly. Update callers in fbnic_up, fbnic_fw_config_after_crash, fbnic_bmc_rpc_check and fbnic_set_mac to pass the real address lists calling __fbnic_set_rx_mode outside the async work path. Cc: Alexander Duyck <alexanderduyck@fb.com> Cc: kernel-team@meta.com Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260416185712.2155425-6-sdf@fomichev.me Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-21 12:50:24 +02:00
Stanislav Fomichev	7ef83bf171	net: move promiscuity handling into netdev_rx_mode_work Move unicast promiscuity tracking into netdev_rx_mode_work so it runs under netdev_ops_lock instead of under the addr_lock spinlock. This is required because __dev_set_promiscuity calls dev_change_rx_flags and __dev_notify_flags, both of which may need to sleep. Change ASSERT_RTNL() to netdev_ops_assert_locked() in __dev_set_promiscuity, netif_set_allmulti and __dev_change_flags since these are now called from the work queue under the ops lock. Link: https://lore.kernel.org/netdev/20260214033859.43857-1-jiayuan.chen@linux.dev/ Fixes: `78cd408356` ("net: add missing instance lock to dev_set_promiscuity") Reported-by: syzbot+2b3391f44313b3983e91@syzkaller.appspotmail.com Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260416185712.2155425-5-sdf@fomichev.me Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-21 12:50:24 +02:00
Stanislav Fomichev	a4c8332781	net: cache snapshot entries for ndo_set_rx_mode_async Add a per-device netdev_hw_addr_list cache (rx_mode_addr_cache) that allows __hw_addr_list_snapshot() and __hw_addr_list_reconcile() to reuse previously allocated entries instead of hitting GFP_ATOMIC on every snapshot cycle. snapshot pops entries from the cache when available, falling back to __hw_addr_create(). reconcile splices both snapshot lists back into the cache via __hw_addr_splice(). The cache is flushed in free_netdev(). Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260416185712.2155425-4-sdf@fomichev.me Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-21 12:50:13 +02:00
Stanislav Fomichev	3554b4345d	net: introduce ndo_set_rx_mode_async and netdev_rx_mode_work Add ndo_set_rx_mode_async callback that drivers can implement instead of the legacy ndo_set_rx_mode. The legacy callback runs under the netif_addr_lock spinlock with BHs disabled, preventing drivers from sleeping. The async variant runs from a work queue with rtnl_lock and netdev_lock_ops held, in fully sleepable context. When __dev_set_rx_mode() sees ndo_set_rx_mode_async, it schedules netdev_rx_mode_work instead of calling the driver inline. The work function takes two snapshots of each address list (uc/mc) under the addr_lock, then drops the lock and calls the driver with the work copies. After the driver returns, it reconciles the snapshots back to the real lists under the lock. Add netif_rx_mode_sync() to opportunistically execute the pending workqueue update inline, so that rx mode changes are committed before returning to userspace: - dev_change_flags (SIOCSIFFLAGS / RTM_NEWLINK) - dev_set_promiscuity - dev_set_allmulti - dev_ifsioc SIOCADDMULTI / SIOCDELMULTI - do_setlink (RTM_SETLINK) Note that some deep hierarchies still do skip the lower updates via: - dev_uc_sync - dev_mc_sync If we do end up hitting user-visible issues, we can add more calls to netif_rx_mode_sync in specific places. But hopefully we should not, the actual user-visible lists are still synced, it's that just HW state that might be lagging. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260416185712.2155425-3-sdf@fomichev.me Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-21 12:50:03 +02:00
Stanislav Fomichev	db9e726525	net: add address list snapshot and reconciliation infrastructure Introduce __hw_addr_list_snapshot() and __hw_addr_list_reconcile() for use by the upcoming ndo_set_rx_mode_async callback. The async rx_mode path needs to snapshot the device's unicast and multicast address lists under the addr_lock, hand those snapshots to the driver (which may sleep), and then propagate any sync_cnt changes back to the real lists. Two identical snapshots are taken: a work copy for the driver to pass to __hw_addr_sync_dev() and a reference copy to compute deltas against. __hw_addr_list_reconcile() walks the reference snapshot comparing each entry against the work snapshot to determine what the driver synced or unsynced. It then applies those deltas to the real list, handling concurrent modifications: - If the real entry was concurrently removed but the driver synced it to hardware (delta > 0), re-insert a stale entry so the next work run properly unsyncs it from hardware. - If the entry still exists, apply the delta normally. An entry whose refcount drops to zero is removed. # dev_addr_test_snapshot_benchmark: 1024 addrs x 1000 snapshots: 89872802 ns total, 89872 ns/iter # dev_addr_test_snapshot_benchmark.speed: slow Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260416185712.2155425-2-sdf@fomichev.me Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-21 12:50:03 +02:00
Weiming Shi	4c1367a2d7	slip: bound decode() reads against the compressed packet length slhc_uncompress() parses a VJ-compressed TCP header by advancing a pointer through the packet via decode() and pull16(). Neither helper bounds-checks against isize, and decode() masks its return with & 0xffff so it can never return the -1 that callers test for -- those error paths are dead code. A short compressed frame whose change byte requests optional fields lets decode() read past the end of the packet. The over-read bytes are folded into the cached cstate and reflected into subsequent reconstructed packets. Make decode() and pull16() take the packet end pointer and return -1 when exhausted. Add a bounds check before the TCP-checksum read. The existing == -1 tests now do what they were always meant to. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Reported-by: Simon Horman <horms@kernel.org> Closes: https://lore.kernel.org/netdev/20260414134126.758795-2-horms@kernel.org/ Signed-off-by: Weiming Shi <bestswngs@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260416100147.531855-5-bestswngs@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-21 10:18:18 +02:00
Weiming Shi	e76607442d	slip: reject VJ receive packets on instances with no rstate array slhc_init() accepts rslots == 0 as a valid configuration, with the documented meaning of 'no receive compression'. In that case the allocation loop in slhc_init() is skipped, so comp->rstate stays NULL and comp->rslot_limit stays 0 (from the kzalloc of struct slcompress). The receive helpers do not defend against that configuration. slhc_uncompress() dereferences comp->rstate[x] when the VJ header carries an explicit connection ID, and slhc_remember() later assigns cs = &comp->rstate[...] after only comparing the packet's slot number to comp->rslot_limit. Because rslot_limit is 0, slot 0 passes the range check, and the code dereferences a NULL rstate. The configuration is reachable in-tree through PPP. PPPIOCSMAXCID stores its argument in a signed int, and (val >> 16) uses arithmetic shift. Passing 0xffff0000 therefore sign-extends to -1, so val2 + 1 is 0 and ppp_generic.c ends up calling slhc_init(0, 1). Because /dev/ppp open is gated by ns_capable(CAP_NET_ADMIN), the whole path is reachable from an unprivileged user namespace. Once the malformed VJ state is installed, any inbound VJ-compressed or VJ-uncompressed frame that selects slot 0 crashes the kernel in softirq context: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] RIP: 0010:slhc_uncompress (drivers/net/slip/slhc.c:519) Call Trace: <TASK> ppp_receive_nonmp_frame (drivers/net/ppp/ppp_generic.c:2466) ppp_input (drivers/net/ppp/ppp_generic.c:2359) ppp_async_process (drivers/net/ppp/ppp_async.c:492) tasklet_action_common (kernel/softirq.c:926) handle_softirqs (kernel/softirq.c:623) run_ksoftirqd (kernel/softirq.c:1055) smpboot_thread_fn (kernel/smpboot.c:160) kthread (kernel/kthread.c:436) ret_from_fork (arch/x86/kernel/process.c:164) </TASK> Reject the receive side on such instances instead of touching rstate. slhc_uncompress() falls through to its existing 'bad' label, which bumps sls_i_error and enters the toss state. slhc_remember() mirrors that with an explicit sls_i_error increment followed by slhc_toss(); the sls_i_runt counter is not used here because a missing rstate is an internal configuration state, not a runt packet. The transmit path is unaffected: the only in-tree caller that picks rslots from userspace (ppp_generic.c) still supplies tslots >= 1, and slip.c always calls slhc_init(16, 16), so comp->tstate remains valid and slhc_compress() continues to work. Fixes: `4ab42d78e3` ("ppp, slip: Validate VJ compression slot parameters completely") Reported-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Weiming Shi <bestswngs@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260415204130.258866-2-bestswngs@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-21 09:51:40 +02:00
Yuan Zhaoming	a663bac71a	net: mctp: fix don't require received header reserved bits to be zero From the MCTP Base specification (DSP0236 v1.2.1), the first byte of the MCTP header contains a 4 bit reserved field, and 4 bit version. On our current receive path, we require those 4 reserved bits to be zero, but the 9500-8i card is non-conformant, and may set these reserved bits. DSP0236 states that the reserved bits must be written as zero, and ignored when read. While the device might not conform to the former, we should accept these message to conform to the latter. Relax our check on the MCTP version byte to allow non-zero bits in the reserved field. Fixes: `889b7da23a` ("mctp: Add initial routing framework") Signed-off-by: Yuan Zhaoming <yuanzm2@lenovo.com> Cc: stable@vger.kernel.org Acked-by: Jeremy Kerr <jk@codeconstruct.com.au> Link: https://patch.msgid.link/20260417141340.5306-1-yuanzhaoming901030@126.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-20 11:46:57 -07:00
David Carlier	5638504a2a	gtp: disable BH before calling udp_tunnel_xmit_skb() gtp_genl_send_echo_req() runs as a generic netlink doit handler in process context with BH not disabled. It calls udp_tunnel_xmit_skb(), which eventually invokes iptunnel_xmit() — that uses __this_cpu_inc/dec on softnet_data.xmit.recursion to track the tunnel xmit recursion level. Without local_bh_disable(), the task may migrate between dev_xmit_recursion_inc() and dev_xmit_recursion_dec(), breaking the per-CPU counter pairing. The result is stale or negative recursion levels that can later produce false-positive SKB_DROP_REASON_RECURSION_LIMIT drops on either CPU. The other udp_tunnel_xmit_skb() call sites in gtp.c are unaffected: the data path runs under ndo_start_xmit and the echo response handlers run from the UDP encap rx softirq, both with BH already disabled. Fix it by disabling BH around the udp_tunnel_xmit_skb() call, mirroring commit `2cd7e6971f` ("sctp: disable BH before calling udp_tunnel_xmit_skb()"). Fixes: `6f1a9140ec` ("net: add xmit recursion limit to tunnel xmit functions") Cc: stable@vger.kernel.org Signed-off-by: David Carlier <devnexen@gmail.com> Link: https://patch.msgid.link/20260417055408.4667-1-devnexen@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-20 11:46:24 -07:00
Dexuan Cui	f631529589	hv_sock: Report EOF instead of -EIO for FIN Commit `f0c5827d07` unluckily causes a regression for the FIN packet, and the final read syscall gets an error rather than 0. Ideally, we would want to fix hvs_channel_readable_payload() so that it could return 0 in the FIN scenario, but it's not good for the hv_sock driver to use the VMBus ringbuffer's cached priv_read_index, which is internal data in the VMBus driver. Fix the regression in hv_sock by returning 0 rather than -EIO. Fixes: `f0c5827d07` ("hv_sock: Return the readable bytes in hvs_stream_has_data()") Cc: stable@vger.kernel.org Reported-by: Ben Hillis <Ben.Hillis@microsoft.com> Reported-by: Mitchell Levy <levymitchell0@gmail.com> Signed-off-by: Dexuan Cui <decui@microsoft.com> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20260416191433.840637-1-decui@microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-20 11:44:43 -07:00
Lorenzo Bianconi	b94769eb2f	net: airoha: Fix possible TX queue stall in airoha_qdma_tx_napi_poll() Since multiple net_device TX queues can share the same hw QDMA TX queue, there is no guarantee we have inflight packets queued in hw belonging to a net_device TX queue stopped in the xmit path because hw QDMA TX queue can be full. In this corner case the net_device TX queue will never be re-activated. In order to avoid any potential net_device TX queue stall, we need to wake all the net_device TX queues feeding the same hw QDMA TX queue in airoha_qdma_tx_napi_poll routine. Fixes: `23020f0493` ("net: airoha: Introduce ethernet support for EN7581 SoC") Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260416-airoha-txq-potential-stall-v2-1-42c732074540@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-20 11:43:40 -07:00
Weiming Shi	2091c6aa0d	openvswitch: cap upcall PID array size and pre-size vport replies The vport netlink reply helpers allocate a fixed-size skb with nlmsg_new(NLMSG_DEFAULT_SIZE, ...) but serialize the full upcall PID array via ovs_vport_get_upcall_portids(). Since ovs_vport_set_upcall_portids() accepts any non-zero multiple of sizeof(u32) with no upper bound, a CAP_NET_ADMIN user can install a PID array large enough to overflow the reply buffer, causing nla_put() to fail with -EMSGSIZE and hitting BUG_ON(err < 0). On systems with unprivileged user namespaces enabled (e.g., Ubuntu default), this is reachable via unshare -Urn since OVS vport mutation operations use GENL_UNS_ADMIN_PERM. kernel BUG at net/openvswitch/datapath.c:2414! Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI CPU: 1 UID: 0 PID: 65 Comm: poc Not tainted 7.0.0-rc7-00195-geb216e422044 #1 RIP: 0010:ovs_vport_cmd_set+0x34c/0x400 Call Trace: <TASK> genl_family_rcv_msg_doit (net/netlink/genetlink.c:1116) genl_rcv_msg (net/netlink/genetlink.c:1194) netlink_rcv_skb (net/netlink/af_netlink.c:2550) genl_rcv (net/netlink/genetlink.c:1219) netlink_unicast (net/netlink/af_netlink.c:1344) netlink_sendmsg (net/netlink/af_netlink.c:1894) __sys_sendto (net/socket.c:2206) __x64_sys_sendto (net/socket.c:2209) do_syscall_64 (arch/x86/entry/syscall_64.c:63) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) </TASK> Kernel panic - not syncing: Fatal exception Reject attempts to set more PIDs than nr_cpu_ids in ovs_vport_set_upcall_portids(), and pre-compute the worst-case reply size in ovs_vport_cmd_msg_size() based on that bound, similar to the existing ovs_dp_cmd_msg_size(). nr_cpu_ids matches the cap already used by the per-CPU dispatch configuration on the datapath side (ovs_dp_cmd_fill_info() serialises at most nr_cpu_ids PIDs), so the two sides stay consistent. Fixes: `5cd667b0a4` ("openvswitch: Allow each vport to have an array of 'port_id's.") Reported-by: Xiang Mei <xmei5@asu.edu> Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Weiming Shi <bestswngs@gmail.com> Reviewed-by: Ilya Maximets <i.maximets@ovn.org> Link: https://patch.msgid.link/20260416024653.153456-2-bestswngs@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-20 11:43:04 -07:00
Prathamesh Deshpande	d03fc81a57	net/mlx5: Fix HCA caps leak on notifier init failure mlx5_mdev_init() allocates HCA caps via mlx5_hca_caps_alloc() before calling mlx5_notifiers_init(). If notifier initialization fails, the error path jumps to err_hca_caps and skips mlx5_hca_caps_free(), leaking allocated caps. Add a dedicated unwind label for notifier-init failure that frees HCA caps before continuing the existing cleanup sequence. Fixes: `b6b03097f9` ("net/mlx5: Initialize events outside devlink lock") Signed-off-by: Prathamesh Deshpande <prathameshdeshpande7@gmail.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260415005022.34764-1-prathameshdeshpande7@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-20 11:42:30 -07:00
Qingfang Deng	cc1ff87bce	pppoe: drop PFC frames RFC 2516 Section 7 states that Protocol Field Compression (PFC) is NOT RECOMMENDED for PPPoE. In practice, pppd does not support negotiating PFC for PPPoE sessions, and the current PPPoE driver assumes an uncompressed (2-byte) protocol field. However, the generic PPP layer function ppp_input() is not aware of the negotiation result, and still accepts PFC frames. If a peer with a broken implementation or an attacker sends a frame with a compressed (1-byte) protocol field, the subsequent PPP payload is shifted by one byte. This causes the network header to be 4-byte misaligned, which may trigger unaligned access exceptions on some architectures. To reduce the attack surface, drop PPPoE PFC frames. Introduce ppp_skb_is_compressed_proto() helper function to be used in both ppp_generic.c and pppoe.c to avoid open-coding. Fixes: `7fb1b8ca8f` ("ppp: Move PFC decompression to PPP generic layer") Signed-off-by: Qingfang Deng <qingfang.deng@linux.dev> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260415022456.141758-2-qingfang.deng@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-20 11:35:17 -07:00
Qingfang Deng	d6c19b31a3	flow_dissector: do not dissect PPPoE PFC frames RFC 2516 Section 7 states that Protocol Field Compression (PFC) is NOT RECOMMENDED for PPPoE. In practice, pppd does not support negotiating PFC for PPPoE sessions, and the flow dissector driver has assumed an uncompressed frame until the blamed commit. During the review process of that commit [1], support for PFC is suggested. However, having a compressed (1-byte) protocol field means the subsequent PPP payload is shifted by one byte, causing 4-byte misalignment for the network header and an unaligned access exception on some architectures. The exception can be reproduced by sending a PPPoE PFC frame to an ethernet interface of a MIPS board, with RPS enabled, even if no PPPoE session is active on that interface: $ 0 : 00000000 80c40000 00000000 85144817 $ 4 : 00000008 00000100 80a75758 81dc9bb8 $ 8 : 00000010 8087ae2c 0000003d 00000000 $12 : 000000e0 00000039 00000000 00000000 $16 : 85043240 80a75758 81dc9bb8 00006488 $20 : 0000002f 00000007 85144810 80a70000 $24 : 81d1bda0 00000000 $28 : 81dc8000 81dc9aa8 00000000 805ead08 Hi : 00009d51 Lo : 2163358a epc : 805e91f0 __skb_flow_dissect+0x1b0/0x1b50 ra : 805ead08 __skb_get_hash_net+0x74/0x12c Status: 11000403 KERNEL EXL IE Cause : 40800010 (ExcCode 04) BadVA : 85144817 PrId : 0001992f (MIPS 1004Kc) Call Trace: [<805e91f0>] __skb_flow_dissect+0x1b0/0x1b50 [<805ead08>] __skb_get_hash_net+0x74/0x12c [<805ef330>] get_rps_cpu+0x1b8/0x3fc [<805fca70>] netif_receive_skb_list_internal+0x324/0x364 [<805fd120>] napi_complete_done+0x68/0x2a4 [<8058de5c>] mtk_napi_rx+0x228/0xfec [<805fd398>] __napi_poll+0x3c/0x1c4 [<805fd754>] napi_threaded_poll_loop+0x234/0x29c [<805fd848>] napi_threaded_poll+0x8c/0xb0 [<80053544>] kthread+0x104/0x12c [<80002bd8>] ret_from_kernel_thread+0x14/0x1c Code: 02d51821 1060045b 00000000 <8c640000> 3084000f 2c820005 144001a2 00042080 8e220000 To reduce the attack surface and maintain performance, do not process PPPoE PFC frames. [1] https://lore.kernel.org/r/20220630231016.GA392@debian.home Fixes: `46126db9c8` ("flow_dissector: Add PPPoE dissectors") Signed-off-by: Qingfang Deng <qingfang.deng@linux.dev> Link: https://patch.msgid.link/20260415022456.141758-1-qingfang.deng@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-20 11:35:16 -07:00
Michael Bommarito	0cf004ffb6	sctp: fix OOB write to userspace in sctp_getsockopt_peer_auth_chunks sctp_getsockopt_peer_auth_chunks() checks that the caller's optval buffer is large enough for the peer AUTH chunk list with if (len < num_chunks) return -EINVAL; but then writes num_chunks bytes to p->gauth_chunks, which lives at offset offsetof(struct sctp_authchunks, gauth_chunks) == 8 inside optval. The check is missing the sizeof(struct sctp_authchunks) = 8-byte header. When the caller supplies len == num_chunks (for any num_chunks > 0) the test passes but copy_to_user() writes sizeof(struct sctp_authchunks) = 8 bytes past the declared buffer. The sibling function sctp_getsockopt_local_auth_chunks() at the next line already has the correct check: if (len < sizeof(struct sctp_authchunks) + num_chunks) return -EINVAL; Align the peer variant with its sibling. Reproducer confirms on v7.0-13-generic: an unprivileged userspace caller that opens a loopback SCTP association with AUTH enabled, queries num_chunks with a short optval, then issues the real getsockopt with len == num_chunks and sentinel bytes painted past the buffer observes those sentinel bytes overwritten with the peer's AUTH chunk type. The bytes written are under the peer's control but land in the caller's own userspace; this is not a kernel memory corruption, but it is a kernel-side contract violation that can silently corrupt adjacent userspace data. Fixes: `65b07e5d0d` ("[SCTP]: API updates to suport SCTP-AUTH extensions.") Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Acked-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/20260416031903.1447072-1-michael.bommarito@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-18 12:16:14 -07:00
Marek Vasut	22230e68b2	net: ks8851: Avoid excess softirq scheduling The code injects a packet into netif_rx() repeatedly, which will add it to its internal NAPI and schedule a softirq, and process it. It is more efficient to queue multiple packets and process them all at the local_bh_enable() time. Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Fixes: `e0863634bf` ("net: ks8851: Queue RX packets in IRQ handler instead of disabling BHs") Cc: stable@vger.kernel.org Signed-off-by: Marek Vasut <marex@nabladev.com> Link: https://patch.msgid.link/20260415231020.455298-2-marex@nabladev.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-18 12:14:19 -07:00
Marek Vasut	5c9fcac3c8	net: ks8851: Reinstate disabling of BHs around IRQ handler If the driver executes ks8851_irq() AND a TX packet has been sent, then the driver enables TX queue via netif_wake_queue() which schedules TX softirq to queue packets for this device. If CONFIG_PREEMPT_RT=y is set AND a packet has also been received by the MAC, then ks8851_rx_pkts() calls netdev_alloc_skb_ip_align() to allocate SKBs for the received packets. If netdev_alloc_skb_ip_align() is called with BH enabled, then local_bh_enable() at the end of netdev_alloc_skb_ip_align() will trigger the pending softirq processing, which may ultimately call the .xmit callback ks8851_start_xmit_par(). The ks8851_start_xmit_par() will try to lock struct ks8851_net_par .lock spinlock, which is already locked by ks8851_irq() from which ks8851_start_xmit_par() was called. This leads to a deadlock, which is reported by the kernel, including a trace listed below. If CONFIG_PREEMPT_RT is not set, then since commit `0913ec336a` ("net: ks8851: Fix deadlock with the SPI chip variant") the deadlock can also be triggered without received packet in the RX FIFO. The pending softirqs will be processed on return from spin_unlock_bh(&ks->statelock) in ks8851_irq(), which triggers the deadlock as well. Fix the problem by disabling BH around critical sections, including the IRQ handler, thus preventing the net_tx_action() softirq from triggering during these critical sections. The net_tx_action() softirq is triggered once BH are re-enabled and at the end of the IRQ handler, once all the other IRQ handler actions have been completed. __schedule from schedule_rtlock+0x1c/0x34 schedule_rtlock from rtlock_slowlock_locked+0x548/0x904 rtlock_slowlock_locked from rt_spin_lock+0x60/0x9c rt_spin_lock from ks8851_start_xmit_par+0x74/0x1a8 ks8851_start_xmit_par from netdev_start_xmit+0x20/0x44 netdev_start_xmit from dev_hard_start_xmit+0xd0/0x188 dev_hard_start_xmit from sch_direct_xmit+0xb8/0x25c sch_direct_xmit from __qdisc_run+0x1f8/0x4ec __qdisc_run from qdisc_run+0x1c/0x28 qdisc_run from net_tx_action+0x1f0/0x268 net_tx_action from handle_softirqs+0x1a4/0x270 handle_softirqs from __local_bh_enable_ip+0xcc/0xe0 __local_bh_enable_ip from __alloc_skb+0xd8/0x128 __alloc_skb from __netdev_alloc_skb+0x3c/0x19c __netdev_alloc_skb from ks8851_irq+0x388/0x4d4 ks8851_irq from irq_thread_fn+0x24/0x64 irq_thread_fn from irq_thread+0x178/0x28c irq_thread from kthread+0x12c/0x138 kthread from ret_from_fork+0x14/0x28 Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Fixes: `e0863634bf` ("net: ks8851: Queue RX packets in IRQ handler instead of disabling BHs") Cc: stable@vger.kernel.org Signed-off-by: Marek Vasut <marex@nabladev.com> Link: https://patch.msgid.link/20260415231020.455298-1-marex@nabladev.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-18 12:14:19 -07:00
Kuniyuki Iwashima	965dc93481	af_unix: Drop all SCM attributes for SOCKMAP. SOCKMAP can hide inflight fd from AF_UNIX GC. When a socket in SOCKMAP receives skb with inflight fd, sk_psock_verdict_data_ready() looks up the mapped socket and enqueue skb to its psock->ingress_skb. Since neither the old nor the new GC can inspect the psock queue, the hidden skb leaks the inflight sockets. Note that this cannot be detected via kmemleak because inflight sockets are linked to a global list. In addition, SOCKMAP redirect breaks the Tarjan-based GC's assumption that unix_edge.successor is always alive, which is no longer true once skb is redirected, resulting in use-after-free below. [0] Moreover, SOCKMAP does not call scm_stat_del() properly, so unix_show_fdinfo() could report an incorrect fd count. sk_msg_recvmsg() does not support any SCM attributes in the first place. Let's drop all SCM attributes before passing skb to the SOCKMAP layer. [0]: BUG: KASAN: slab-use-after-free in unix_del_edges (net/unix/garbage.c:118 net/unix/garbage.c:181 net/unix/garbage.c:251) Read of size 8 at addr ffff888125362670 by task kworker/56:1/496 CPU: 56 UID: 0 PID: 496 Comm: kworker/56:1 Not tainted 7.0.0-rc7-00263-gb9d8b856689d #3 PREEMPT(lazy) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-debian-1.17.0-1 04/01/2014 Workqueue: events sk_psock_backlog Call Trace: <TASK> dump_stack_lvl (lib/dump_stack.c:122) print_report (mm/kasan/report.c:379) kasan_report (mm/kasan/report.c:597) unix_del_edges (net/unix/garbage.c:118 net/unix/garbage.c:181 net/unix/garbage.c:251) unix_destroy_fpl (net/unix/garbage.c:317) unix_destruct_scm (./include/net/scm.h:80 ./include/net/scm.h:86 net/unix/af_unix.c:1976) sk_psock_backlog (./include/linux/skbuff.h:?) process_scheduled_works (kernel/workqueue.c:?) worker_thread (kernel/workqueue.c:?) kthread (kernel/kthread.c:438) ret_from_fork (arch/x86/kernel/process.c:164) ret_from_fork_asm (arch/x86/entry/entry_64.S:258) </TASK> Allocated by task 955: kasan_save_track (mm/kasan/common.c:58 mm/kasan/common.c:78) __kasan_slab_alloc (mm/kasan/common.c:369) kmem_cache_alloc_noprof (mm/slub.c:4539) sk_prot_alloc (net/core/sock.c:2240) sk_alloc (net/core/sock.c:2301) unix_create1 (net/unix/af_unix.c:1099) unix_create (net/unix/af_unix.c:1169) __sock_create (net/socket.c:1606) __sys_socketpair (net/socket.c:1811) __x64_sys_socketpair (net/socket.c:1863 net/socket.c:1860 net/socket.c:1860) do_syscall_64 (arch/x86/entry/syscall_64.c:?) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) Freed by task 496: kasan_save_track (mm/kasan/common.c:58 mm/kasan/common.c:78) kasan_save_free_info (mm/kasan/generic.c:587) __kasan_slab_free (mm/kasan/common.c:287) kmem_cache_free (mm/slub.c:6165) __sk_destruct (net/core/sock.c:2282 net/core/sock.c:2384) sk_psock_destroy (./include/net/sock.h:?) process_scheduled_works (kernel/workqueue.c:?) worker_thread (kernel/workqueue.c:?) kthread (kernel/kthread.c:438) ret_from_fork (arch/x86/kernel/process.c:164) ret_from_fork_asm (arch/x86/entry/entry_64.S:258) Fixes: `c63829182c` ("af_unix: Implement ->psock_update_sk_prot()") Fixes: `77462de14a` ("af_unix: Add read_sock for stream socket types") Reported-by: Xingyu Jin <xingyuj@google.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260415184830.3988432-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-18 12:12:28 -07:00
KhaiWenTan	8cff9dbe89	net: stmmac: Update default_an_inband before passing value to phylink_config get_interfaces() will update both the plat->phy_interfaces and mdio_bus_data->default_an_inband based on reading a SERDES register. As get_interfaces() will be called after default_an_inband had already been read, dwmac-intel regressed as a result with incorrect default_an_inband value in phylink_config. Therefore, we moved the priv->plat->get_interfaces() to be executed first before assigning priv->plat->default_an_inband to config->default_an_inband to ensure default_an_inband is in correct value. Fixes: `d3836052fe` ("net: stmmac: intel: convert speed_mode_2500() to get_interfaces()") Signed-off-by: KhaiWenTan <khai.wen.tan@linux.intel.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/20260416102609.7953-1-khai.wen.tan@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-18 12:10:16 -07:00
Eric Dumazet	f996edd761	ipv6: fix possible UAF in icmpv6_rcv() Caching saddr and daddr before pskb_pull() is problematic since skb->head can change. Remove these temporary variables: - We only access &ipv6_hdr(skb)->saddr and &ipv6_hdr(skb)->daddr when net_dbg_ratelimited() is called in the slow path. - Avoid potential future misuse after pskb_pull() call. Fixes: `4b3418fba0` ("ipv6: icmp: include addresses in debug messages") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de> Reviewed-by: Joe Damato <joe@dama.to> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260416103505.2380753-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-18 12:09:52 -07:00
Jakub Kicinski	dcf6d5e629	Merge branch 'intel-wired-lan-driver-updates-2026-04-14-ice-i40e-iavf-idpf-e1000e' Jacob Keller says: ==================== Intel Wired LAN Driver Updates 2026-04-14 (ice, i40e, iavf, e1000e) Grzegorz updates the logic for adjusting the PTP hardware clock on E830, fixing a bug that prevented adjustments below S32_MAX/MIN nanoseconds. Grzegorz and Zoli update the PCS latency settings for E825 devices at 10GbE and 25GbE, improving the accuracy of timestamps based on data from production hardware. Michal Schmidt fixes a double-free that could happen if a particular error path is taken in ice_xmit_frame_ring(). Guangshuo fixes a double-free that could happen during error paths in the ice_sf_eth_activate() function. Paul Greenwalt fixes the PHY link configuration when the link-down-on-close driver parameter is enabled and new media is inserted. Paul Greenwalt fixes the ICE_AQ_LINK_SPEED_M macro for 200G, enabling 200G link speed advertisement. Keita Morisaki fixes a race condition in the ice Tx timestamp ring cleanup, preventing a possible NULL pointer dereference. Kohei Enju fixes a potential NULL pointer dereference in ice_set_ring_param(). Kohei Enju fixes i40e to stop advertising IFF_SUPP_NOFCS, when the driver does not actually support the feature. Petr fixes the VLAN L2TAG2 mask when the iAVF VF and a PF negotiate use of the legacy Rx descriptor format. Matt fixes the unrolling logic for PTP when the e1000e probe fails after the PTP clock has been registered. A note to stable backports The patches [7/12] ("ice: fix race condition in TX timestamp ring cleanup") and [8/12] ("ice: fix potential NULL pointer deref in error path of ice_set_ringparam()") must be backported together. Otherwise the fix in patch 8 will not work properly. ==================== Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-0-686c33c9828d@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-18 12:01:42 -07:00
Matt Vollrath	aa3f7fe409	e1000e: Unroll PTP in probe error handling If probe fails after registering the PTP clock and its delayed work, these resources must be released. This was not an issue until a 2016 fix moved the e1000e_ptp_init() call before the jump to err_register. Fixes: `aa524b66c5` ("e1000e: don't modify SYSTIM registers during SIOCSHWTSTAMP ioctl") Signed-off-by: Matt Vollrath <tactii@gmail.com> Tested-by: Avigail Dahan <avigailx.dahan@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-12-686c33c9828d@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-18 12:01:41 -07:00
Petr Oros	496d9f9106	iavf: fix wrong VLAN mask for legacy Rx descriptors L2TAG2 The IAVF_RXD_LEGACY_L2TAG2_M mask was incorrectly defined as GENMASK_ULL(63, 32), extracting 32 bits from qw2 instead of the 16-bit VLAN tag. In the legacy Rx descriptor layout, the 2nd L2TAG2 (VLAN tag) occupies bits 63:48 of qw2, not 63:32. The oversized mask causes FIELD_GET to return a 32-bit value where the actual VLAN tag sits in bits 31:16. When this value is passed to iavf_receive_skb() as a u16 parameter, it gets truncated to the lower 16 bits (which contain the 1st L2TAG2, typically zero). As a result, __vlan_hwaccel_put_tag() is never called and software VLAN interfaces on VFs receive no traffic. This affects VFs behind ice PF (VIRTCHNL VLAN v2) when the PF advertises VLAN stripping into L2TAG2_2 and legacy descriptors are used. The flex descriptor path already uses the correct mask (IAVF_RXD_FLEX_L2TAG2_2_M = GENMASK_ULL(63, 48)). Reproducer: 1. Create 2 VFs on ice PF (echo 2 > sriov_numvfs) 2. Disable spoofchk on both VFs 3. Move each VF into a separate network namespace 4. On each VF: create VLAN interface (e.g. vlan 198), assign IP, bring up 5. Set rx-vlan-offload OFF on both VFs 6. Ping between VLAN interfaces -> expect PASS (VLAN tag stays in packet data, kernel matches in-band) 7. Set rx-vlan-offload ON on both VFs 8. Ping between VLAN interfaces -> expect FAIL if bug present (HW strips VLAN tag into descriptor L2TAG2 field, wrong mask extracts bits 47:32 instead of 63:48, truncated to u16 -> zero, __vlan_hwaccel_put_tag() never called, packet delivered to parent interface, not VLAN interface) The reproducer requires legacy Rx descriptors. On modern ice + iavf with full PTP support, flex descriptors are always negotiated and the buggy legacy path is never reached. Flex descriptors require all of: - CONFIG_PTP_1588_CLOCK enabled - VIRTCHNL_VF_OFFLOAD_RX_FLEX_DESC granted by PF - PTP capabilities negotiated (VIRTCHNL_VF_CAP_PTP) - VIRTCHNL_1588_PTP_CAP_RX_TSTAMP supported - VIRTCHNL_RXDID_2_FLEX_SQ_NIC present in DDP profile If any condition is not met, iavf_select_rx_desc_format() falls back to legacy descriptors (RXDID=1) and the wrong L2TAG2 mask is hit. Fixes: `2dc8e7c36d` ("iavf: refactor iavf_clean_rx_irq to support legacy and flex descriptors") Signed-off-by: Petr Oros <poros@redhat.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-10-686c33c9828d@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-18 12:01:35 -07:00
Kohei Enju	a24162f188	i40e: don't advertise IFF_SUPP_NOFCS i40e advertises IFF_SUPP_NOFCS, allowing users to use the SO_NOFCS socket option. However, this option is silently ignored, as the driver does not check skb->no_fcs, and always enables FCS insertion offload. Fix this by removing the advertisement of IFF_SUPP_NOFCS. This behavior can be reproduced with a simple AF_PACKET socket: import socket s = socket.socket(socket.AF_PACKET, socket.SOCK_RAW) s.setsockopt(socket.SOL_SOCKET, 43, 1) # SO_NOFCS s.bind(("eth0", 0)) s.send(b'\xff' * 64) Previously, send() succeeds but the driver ignores SO_NOFCS. With this change, send() fails with -EPROTONOSUPPORT, as expected. Fixes: `41c445ff0f` ("i40e: main driver core") Signed-off-by: Kohei Enju <kohei@enjuk.jp> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-9-686c33c9828d@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-18 12:01:34 -07:00
Kohei Enju	fa28351f97	ice: fix potential NULL pointer deref in error path of ice_set_ringparam() ice_set_ringparam nullifies tstamp_ring of temporary tx_rings, without clearing ICE_TX_RING_FLAGS_TXTIME bit. When ICE_TX_RING_FLAGS_TXTIME is set and the subsequent ice_setup_tx_ring() call fails, a NULL pointer dereference could happen in the unwinding sequence: ice_clean_tx_ring() -> ice_is_txtime_cfg() == true (ICE_TX_RING_FLAGS_TXTIME is set) -> ice_free_tx_tstamp_ring() -> ice_free_tstamp_ring() -> tstamp_ring->desc (NULL deref) Clear ICE_TX_RING_FLAGS_TXTIME bit to avoid the potential issue. Note that this potential issue is found by manual code review. Compile test only since unfortunately I don't have E830 devices. Fixes: `ccde82e909` ("ice: add E830 Earliest TxTime First Offload support") Signed-off-by: Kohei Enju <kohei@enjuk.jp> Reviewed-by: Paul Greenwalt <paul.greenwalt@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-8-686c33c9828d@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-18 12:01:34 -07:00
Keita Morisaki	7c72ec18c2	ice: fix race condition in TX timestamp ring cleanup Fix a race condition between ice_free_tx_tstamp_ring() and ice_tx_map() that can cause a NULL pointer dereference. ice_free_tx_tstamp_ring currently clears the ICE_TX_FLAGS_TXTIME flag after NULLing the tstamp_ring. This could allow a concurrent ice_tx_map call on another CPU to dereference the tstamp_ring, which could lead to a NULL pointer dereference. CPU A:ice_free_tx_tstamp_ring() \| CPU B:ice_tx_map() --------------------------------\|--------------------------------- tx_ring->tstamp_ring = NULL \| \| ice_is_txtime_cfg() -> true \| tstamp_ring = tx_ring->tstamp_ring \| tstamp_ring->count // NULL deref! flags &= ~ICE_TX_FLAGS_TXTIME \| Fix by: 1. Reordering ice_free_tx_tstamp_ring() to clear the flag before NULLing the pointer, with smp_wmb() to ensure proper ordering. 2. Adding smp_rmb() in ice_tx_map() after the flag check to order the flag read before the pointer read, using READ_ONCE() for the pointer, and adding a NULL check as a safety net. 3. Converting tx_ring->flags from u8 to DECLARE_BITMAP() and using atomic bitops (set_bit(), clear_bit(), test_bit()) for all flag operations throughout the driver: - ICE_TX_RING_FLAGS_XDP - ICE_TX_RING_FLAGS_VLAN_L2TAG1 - ICE_TX_RING_FLAGS_VLAN_L2TAG2 - ICE_TX_RING_FLAGS_TXTIME Fixes: `ccde82e909` ("ice: add E830 Earliest TxTime First Offload support") Signed-off-by: Keita Morisaki <kmta1236@gmail.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-7-686c33c9828d@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-18 12:01:34 -07:00
Paul Greenwalt	4a3a940059	ice: fix ICE_AQ_LINK_SPEED_M for 200G When setting PHY configuration during driver initialization, 200G link speed is not being advertised even when the PHY is capable. This is because the get PHY capabilities link speed response is being masked by ICE_AQ_LINK_SPEED_M, which does not include the 200G link speed bit. ICE_AQ_LINK_SPEED_200GB is defined as BIT(11), but the mask 0x7FF only covers bits 0-10. Fix ICE_AQ_LINK_SPEED_M to use GENMASK(11, 0) so that it covers all defined link speed bits including 200G. Fixes: `24407a01e5` ("ice: Add 200G speed/phy type use") Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-6-686c33c9828d@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-18 12:01:34 -07:00
Paul Greenwalt	55e74f9ea7	ice: fix PHY config on media change with link-down-on-close Commit `1a3571b593` ("ice: restore PHY settings on media insertion") introduced separate flows for setting PHY configuration on media present: ice_configure_phy() when link-down-on-close is disabled, and ice_force_phys_link_state() when enabled. The latter incorrectly uses the previous configuration even after module change, causing link issues such as wrong speed or no link. Unify PHY configuration into a single ice_phy_cfg() function with a link_en parameter, ensuring PHY capabilities are always fetched fresh from hardware. Fixes: `1a3571b593` ("ice: restore PHY settings on media insertion") Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-5-686c33c9828d@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-18 12:01:34 -07:00
Michal Schmidt	1a303baa71	ice: fix double-free of tx_buf skb If ice_tso() or ice_tx_csum() fail, the error path in ice_xmit_frame_ring() frees the skb, but the 'first' tx_buf still points to it and is marked as valid (ICE_TX_BUF_SKB). 'next_to_use' remains unchanged, so the potential problem will likely fix itself when the next packet is transmitted and the tx_buf gets overwritten. But if there is no next packet and the interface is brought down instead, ice_clean_tx_ring() -> ice_unmap_and_free_tx_buf() will find the tx_buf and free the skb for the second time. The fix is to reset the tx_buf type to ICE_TX_BUF_EMPTY in the error path, so that ice_unmap_and_free_tx_buf(). Move the initialization of 'first' up, to ensure it's already valid in case we hit the linearization error path. The bug was spotted by AI while I had it looking for something else. It also proposed an initial version of the patch. I reproduced the bug and tested the fix by adding code to inject failures, on a build with KASAN. I looked for similar bugs in related Intel drivers and did not find any. Fixes: `d76a60ba7a` ("ice: Add support for VLANs and offloads") Assisted-by: Claude:claude-4.6-opus-high Cursor Signed-off-by: Michal Schmidt <mschmidt@redhat.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-4-686c33c9828d@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-18 12:01:33 -07:00
Guangshuo Li	9aab1c3d72	ice: fix double free in ice_sf_eth_activate() error path When auxiliary_device_add() fails, ice_sf_eth_activate() jumps to aux_dev_uninit and calls auxiliary_device_uninit(&sf_dev->adev). The device release callback ice_sf_dev_release() frees sf_dev, but the current error path falls through to sf_dev_free and calls kfree(sf_dev) again, causing a double free. Keep kfree(sf_dev) for the auxiliary_device_init() failure path, but avoid falling through to sf_dev_free after auxiliary_device_uninit(). Fixes: `13acc5c4cd` ("ice: subfunction activation and base devlink ops") Cc: stable@vger.kernel.org Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-3-686c33c9828d@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-18 12:01:33 -07:00
Grzegorz Nitka	05567e4052	ice: update PCS latency settings for E825 10G/25Gb modes Update MAC Rx/Tx offset registers settings (PHY_MAC_[RX\|TX]_OFFSET registers) with the data obtained with the latest research. It applies to PCS latency settings for the following speeds/modes: * 10Gb NO-FEC - TX latency changed from 71.25 ns to 73 ns - RX latency changed from -25.6 ns to -28 ns * 25Gb NO-FEC - TX latency changed from 28.17 ns to 33 ns - RX latency changed from -12.45 ns to -12 ns * 25Gb RS-FEC - TX latency changed from 64.5 ns to 69 ns - RX latency changed from -3.6 ns to -3 ns The original data came from simulation and pre-production hardware. The new data measures the actual delays and as such is more accurate. Fixes: `7cab44f1c3` ("ice: Introduce ETH56G PHY model for E825C products") Co-developed-by: Zoltan Fodor <zoltan.fodor@intel.com> Signed-off-by: Zoltan Fodor <zoltan.fodor@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-2-686c33c9828d@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-18 12:01:33 -07:00
Grzegorz Nitka	885c5e5792	ice: fix 'adjust' timer programming for E830 devices Fix incorrect 'adjust the timer' programming sequence for E830 devices series. Only shadow registers GLTSYN_SHADJ were programmed in the current implementation. According to the specification [1], write to command GLTSYN_CMD register is also required with CMD field set to "Adjust the Time" value, for the timer adjustment to take the effect. The flow was broken for the adjustment less than S32_MAX/MIN range (around +/- 2 seconds). For bigger adjustment, non-atomic programming flow is used, involving set timer programming. Non-atomic flow is implemented correctly. Testing hints: Run command: phc_ctl /dev/ptpX get adj 2 get Expected result: Returned timestamps differ at least by 2 seconds [1] Intel® Ethernet Controller E830 Datasheet rev 1.3, chapter 9.7.5.4 https://cdrdv2.intel.com/v1/dl/getContent/787353?explicitVersion=true Fixes: `f003075227` ("ice: Implement PTP support for E830 devices") Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rinitha S <sx.rinitha@intel.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-1-686c33c9828d@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-18 12:01:33 -07:00
Jakub Kicinski	0916664f99	This batch includes only fixes to the selftest harness: * switch to TAP test orchestration * parse slurped notifications as returned by jq -s * add ovpn_ prefix to helpers and global variables to avoid clashes * fail test in case of netlink notification mismatch * add missing kernel config dependencies * add delay when launching multiple ynl/cli.py listeners -----BEGIN PGP SIGNATURE----- iJEEABYIADkWIQQKU153ubb5unbkl6Gx/ZpNW1HNdwUCaeH1YRsUgAAAAAAEAA5t YW51MiwyLjUrMS4xMiwyLDIACgkQsf2aTVtRzXfi1AD+Me8TMUSJor2+Idiy2/l0 isQlE07BVVj+fi5VZ6RCeXIA/RZb7y7Ct9mG8bvvnSItaH2qpSN5xxqZh91daqs6 lT8K =lb2K -----END PGP SIGNATURE----- Merge tag 'ovpn-net-20260417' of https://github.com/OpenVPN/ovpn-net-next Antonio Quartulli says: ==================== This batch includes only fixes to the selftest harness: * switch to TAP test orchestration * parse slurped notifications as returned by jq -s * add ovpn_ prefix to helpers and global variables to avoid clashes * fail test in case of netlink notification mismatch * add missing kernel config dependencies * add delay when launching multiple ynl/cli.py listeners * tag 'ovpn-net-20260417' of https://github.com/OpenVPN/ovpn-net-next: selftests: ovpn: serialize YNL listener startup selftests: ovpn: align command flow with TAP selftests: ovpn: add prefix to helpers and shared variables selftests: ovpn: flatten slurped notification JSON before filtering selftests: ovpn: fail notification check on mismatch selftests: ovpn: add nftables config dependencies for test-mark ==================== Link: https://patch.msgid.link/20260417090305.2775723-1-antonio@openvpn.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-18 11:44:12 -07:00

1 2 3 4 5 ...

1434387 Commits