linux

mirror of https://github.com/torvalds/linux.git synced 2026-05-30 18:13:41 +02:00

Author	SHA1	Message	Date
Petr Oros	34d33313b5	iavf: add VIRTCHNL_OP_ADD_VLAN to success completion handler The V1 ADD_VLAN opcode had no success handler; filters sent via V1 stayed in ADDING state permanently. Add a fallthrough case so V1 filters also transition ADDING -> ACTIVE on PF confirmation. Critically, add an `if (v_retval) break` guard: the error switch in iavf_virtchnl_completion() does NOT return after handling errors, it falls through to the success switch. Without this guard, a PF-rejected ADD would incorrectly mark ADDING filters as ACTIVE, creating a driver/HW mismatch where the driver believes the filter is installed but the PF never accepted it. For V2, this is harmless: iavf_vlan_add_reject() in the error block already kfree'd all ADDING filters, so the success handler finds nothing to transition. Fixes: `968996c070` ("iavf: Fix VLAN_V2 addition/rejection") Signed-off-by: Petr Oros <poros@redhat.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260427-jk-iwl-net-petr-oros-fixes-v1-4-cdcb48303fd8@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-30 11:37:38 +02:00
Petr Oros	bbcbe4ed70	iavf: wait for PF confirmation before removing VLAN filters The VLAN filter DELETE path was asymmetric with the ADD path: ADD waits for PF confirmation (ADD -> ADDING -> ACTIVE), but DELETE immediately frees the filter struct after sending the DEL message without waiting for the PF response. This is problematic because: - If the PF rejects the DEL, the filter remains in HW but the driver has already freed the tracking structure, losing sync. - Race conditions between DEL pending and other operations (add, reset) cannot be properly resolved if the filter struct is already gone. Add IAVF_VLAN_REMOVING state to make the DELETE path symmetric: REMOVE -> REMOVING (send DEL) -> PF confirms -> kfree -> PF rejects -> ACTIVE In iavf_del_vlans(), transition filters from REMOVE to REMOVING instead of immediately freeing them. The new DEL completion handler in iavf_virtchnl_completion() frees filters on success or reverts them to ACTIVE on error. Update iavf_add_vlan() to handle the REMOVING state: if a DEL is pending and the user re-adds the same VLAN, queue it for ADD so it gets re-programmed after the PF processes the DEL. The !VLAN_FILTERING_ALLOWED early-exit path still frees filters directly since no PF message is sent in that case. Also update iavf_del_vlan() to skip filters already in REMOVING state: DEL has been sent to PF and the completion handler will free the filter when PF confirms. Without this guard, the sequence DEL(pending) -> user-del -> second DEL could cause the PF to return an error for the second DEL (filter already gone), causing the completion handler to incorrectly revert a deleted filter back to ACTIVE. Fixes: `968996c070` ("iavf: Fix VLAN_V2 addition/rejection") Signed-off-by: Petr Oros <poros@redhat.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260427-jk-iwl-net-petr-oros-fixes-v1-3-cdcb48303fd8@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-30 11:37:38 +02:00
Petr Oros	f2ce65b9b9	iavf: stop removing VLAN filters from PF on interface down When a VF goes down, the driver currently sends DEL_VLAN to the PF for every VLAN filter (ACTIVE -> DISABLE -> send DEL -> INACTIVE), then re-adds them all on UP (INACTIVE -> ADD -> send ADD -> ADDING -> ACTIVE). This round-trip is unnecessary because: 1. The PF disables the VF's queues via VIRTCHNL_OP_DISABLE_QUEUES, which already prevents all RX/TX traffic regardless of VLAN filter state. 2. The VLAN filters remaining in PF HW while the VF is down is harmless - packets matching those filters have nowhere to go with queues disabled. 3. The DEL+ADD cycle during down/up creates race windows where the VLAN filter list is incomplete. With spoofcheck enabled, the PF enables TX VLAN filtering on the first non-zero VLAN add, blocking traffic for any VLANs not yet re-added. Remove the entire DISABLE/INACTIVE state machinery: - Remove IAVF_VLAN_DISABLE and IAVF_VLAN_INACTIVE enum values - Remove iavf_restore_filters() and its call from iavf_open() - Remove VLAN filter handling from iavf_clear_mac_vlan_filters(), rename it to iavf_clear_mac_filters() - Remove DEL_VLAN_FILTER scheduling from iavf_down() - Remove all DISABLE/INACTIVE handling from iavf_del_vlans() VLAN filters now stay ACTIVE across down/up cycles. Only explicit user removal (ndo_vlan_rx_kill_vid) or PF/VF reset triggers VLAN filter deletion/re-addition. Fixes: `ed1f5b58ea` ("i40evf: remove VLAN filters on close") Signed-off-by: Petr Oros <poros@redhat.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260427-jk-iwl-net-petr-oros-fixes-v1-2-cdcb48303fd8@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-30 11:37:38 +02:00
Petr Oros	70d62b669f	iavf: rename IAVF_VLAN_IS_NEW to IAVF_VLAN_ADDING Rename the IAVF_VLAN_IS_NEW state to IAVF_VLAN_ADDING to better describe what the state represents: an ADD request has been sent to the PF and is waiting for a response. This is a pure rename with no behavioral change, preparing for a cleanup of the VLAN filter state machine. Signed-off-by: Petr Oros <poros@redhat.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260427-jk-iwl-net-petr-oros-fixes-v1-1-cdcb48303fd8@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-30 11:37:38 +02:00
Stanislav Fomichev	d071c15b43	iavf: convert to ndo_set_rx_mode_async Convert iavf from ndo_set_rx_mode to ndo_set_rx_mode_async. iavf_set_rx_mode now takes explicit uc/mc list parameters and uses __hw_addr_sync_dev on the snapshots instead of __dev_uc_sync and __dev_mc_sync. The iavf_configure internal caller passes the real lists directly. Cc: Tony Nguyen <anthony.l.nguyen@intel.com> Cc: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260416185712.2155425-10-sdf@fomichev.me Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-21 12:50:25 +02:00
Petr Oros	496d9f9106	iavf: fix wrong VLAN mask for legacy Rx descriptors L2TAG2 The IAVF_RXD_LEGACY_L2TAG2_M mask was incorrectly defined as GENMASK_ULL(63, 32), extracting 32 bits from qw2 instead of the 16-bit VLAN tag. In the legacy Rx descriptor layout, the 2nd L2TAG2 (VLAN tag) occupies bits 63:48 of qw2, not 63:32. The oversized mask causes FIELD_GET to return a 32-bit value where the actual VLAN tag sits in bits 31:16. When this value is passed to iavf_receive_skb() as a u16 parameter, it gets truncated to the lower 16 bits (which contain the 1st L2TAG2, typically zero). As a result, __vlan_hwaccel_put_tag() is never called and software VLAN interfaces on VFs receive no traffic. This affects VFs behind ice PF (VIRTCHNL VLAN v2) when the PF advertises VLAN stripping into L2TAG2_2 and legacy descriptors are used. The flex descriptor path already uses the correct mask (IAVF_RXD_FLEX_L2TAG2_2_M = GENMASK_ULL(63, 48)). Reproducer: 1. Create 2 VFs on ice PF (echo 2 > sriov_numvfs) 2. Disable spoofchk on both VFs 3. Move each VF into a separate network namespace 4. On each VF: create VLAN interface (e.g. vlan 198), assign IP, bring up 5. Set rx-vlan-offload OFF on both VFs 6. Ping between VLAN interfaces -> expect PASS (VLAN tag stays in packet data, kernel matches in-band) 7. Set rx-vlan-offload ON on both VFs 8. Ping between VLAN interfaces -> expect FAIL if bug present (HW strips VLAN tag into descriptor L2TAG2 field, wrong mask extracts bits 47:32 instead of 63:48, truncated to u16 -> zero, __vlan_hwaccel_put_tag() never called, packet delivered to parent interface, not VLAN interface) The reproducer requires legacy Rx descriptors. On modern ice + iavf with full PTP support, flex descriptors are always negotiated and the buggy legacy path is never reached. Flex descriptors require all of: - CONFIG_PTP_1588_CLOCK enabled - VIRTCHNL_VF_OFFLOAD_RX_FLEX_DESC granted by PF - PTP capabilities negotiated (VIRTCHNL_VF_CAP_PTP) - VIRTCHNL_1588_PTP_CAP_RX_TSTAMP supported - VIRTCHNL_RXDID_2_FLEX_SQ_NIC present in DDP profile If any condition is not met, iavf_select_rx_desc_format() falls back to legacy descriptors (RXDID=1) and the wrong L2TAG2 mask is hit. Fixes: `2dc8e7c36d` ("iavf: refactor iavf_clean_rx_irq to support legacy and flex descriptors") Signed-off-by: Petr Oros <poros@redhat.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260416-iwl-net-submission-2026-04-14-v2-10-686c33c9828d@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-18 12:01:35 -07:00
Aleksandr Loktionov	3f3a2aefbc	iavf: fix kernel-doc comment style in iavf_ethtool.c iavf_ethtool.c contains 31 kernel-doc comment blocks using the legacy `*/` terminator instead of the correct single `/`. Two function headers also use a colon separator (`iavf_get_channels:`, `iavf_set_channels:`) instead of the ` - ` dash required by kernel-doc. Additionally several comments embed their return-value descriptions in the body paragraph, producing `scripts/kernel-doc -Wreturn` warnings. Void functions that incorrectly say "Returns ..." are also rephrased. Fix all issues across the full file: - Replace every `*/` terminator with `/`. - Change `function_name:` doc headers to `function_name -`. - Move inline "Returns ..." sentences into dedicated `Return:` sections for non-void functions (iavf_get_msglevel, iavf_get_rxnfc, iavf_set_channels, iavf_get_rxfh_key_size, iavf_get_rxfh_indir_size, iavf_get_rxfh, iavf_set_rxfh). - Rephrase body descriptions in void functions that incorrectly said "Returns ..." (iavf_get_drvinfo, iavf_get_ringparam, iavf_get_coalesce). - Remove boilerplate body text for iavf_get_rxfh_key_size and iavf_get_rxfh_indir_size; the `Return:` line now conveys the same information without the vague "Returns the table size." sentence. Suggested-by: Anthony L. Nguyen <anthony.l.nguyen@intel.com> Suggested-by: Leszek Pepiak <leszek.pepiak@intel.com> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Breno Leitao <leitao@debian.org> Reviewed-by: Joe Damato <joe@dama.to> Link: https://patch.msgid.link/20260409093020.3808687-1-aleksandr.loktionov@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-10 16:05:44 -07:00
Kohei Enju	fecacfc95f	iavf: fix out-of-bounds writes in iavf_get_ethtool_stats() iavf incorrectly uses real_num_tx_queues for ETH_SS_STATS. Since the value could change in runtime, we should use num_tx_queues instead. Moreover iavf_get_ethtool_stats() uses num_active_queues while iavf_get_sset_count() and iavf_get_stat_strings() use real_num_tx_queues, which triggers out-of-bounds writes when we do "ethtool -L" and "ethtool -S" simultaneously [1]. For example when we change channels from 1 to 8, Thread 3 could be scheduled before Thread 2, and out-of-bounds writes could be triggered in Thread 3: Thread 1 (ethtool -L) Thread 2 (work) Thread 3 (ethtool -S) iavf_set_channels() ... iavf_alloc_queues() -> num_active_queues = 8 iavf_schedule_finish_config() iavf_get_sset_count() real_num_tx_queues: 1 -> buffer for 1 queue iavf_get_ethtool_stats() num_active_queues: 8 -> out-of-bounds! iavf_finish_config() -> real_num_tx_queues = 8 Use immutable num_tx_queues in all related functions to avoid the issue. [1] BUG: KASAN: vmalloc-out-of-bounds in iavf_add_one_ethtool_stat+0x200/0x270 Write of size 8 at addr ffffc900031c9080 by task ethtool/5800 CPU: 1 UID: 0 PID: 5800 Comm: ethtool Not tainted 6.19.0-enjuk-08403-g8137e3db7f1c #241 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x6f/0xb0 print_report+0x170/0x4f3 kasan_report+0xe1/0x180 iavf_add_one_ethtool_stat+0x200/0x270 iavf_get_ethtool_stats+0x14c/0x2e0 __dev_ethtool+0x3d0c/0x5830 dev_ethtool+0x12d/0x270 dev_ioctl+0x53c/0xe30 sock_do_ioctl+0x1a9/0x270 sock_ioctl+0x3d4/0x5e0 __x64_sys_ioctl+0x137/0x1c0 do_syscall_64+0xf3/0x690 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f7da0e6e36d ... </TASK> The buggy address belongs to a 1-page vmalloc region starting at 0xffffc900031c9000 allocated at __dev_ethtool+0x3cc9/0x5830 The buggy address belongs to the physical page: page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88813a013de0 pfn:0x13a013 flags: 0x200000000000000(node=0\|zone=2) raw: 0200000000000000 0000000000000000 dead000000000122 0000000000000000 raw: ffff88813a013de0 0000000000000000 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffffc900031c8f80: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 ffffc900031c9000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >ffffc900031c9080: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 ^ ffffc900031c9100: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 ffffc900031c9180: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 Fixes: `64430f70ba` ("iavf: Fix displaying queue statistics shown by ethtool") Signed-off-by: Kohei Enju <kohei@enjuk.jp> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2026-03-23 13:29:50 -07:00
Petr Oros	fc9c69be59	iavf: fix VLAN filter lost on add/delete race When iavf_add_vlan() finds an existing filter in IAVF_VLAN_REMOVE state, it transitions the filter to IAVF_VLAN_ACTIVE assuming the pending delete can simply be cancelled. However, there is no guarantee that iavf_del_vlans() has not already processed the delete AQ request and removed the filter from the PF. In that case the filter remains in the driver's list as IAVF_VLAN_ACTIVE but is no longer programmed on the NIC. Since iavf_add_vlans() only picks up filters in IAVF_VLAN_ADD state, the filter is never re-added, and spoof checking drops all traffic for that VLAN. CPU0 CPU1 Workqueue ---- ---- --------- iavf_del_vlan(vlan 100) f->state = REMOVE schedule AQ_DEL_VLAN iavf_add_vlan(vlan 100) f->state = ACTIVE iavf_del_vlans() f is ACTIVE, skip iavf_add_vlans() f is ACTIVE, skip Filter is ACTIVE in driver but absent from NIC. Transition to IAVF_VLAN_ADD instead and schedule IAVF_FLAG_AQ_ADD_VLAN_FILTER so iavf_add_vlans() re-programs the filter. A duplicate add is idempotent on the PF. Fixes: `0c0da0e951` ("iavf: refactor VLAN filter states") Signed-off-by: Petr Oros <poros@redhat.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2026-03-17 13:48:02 -07:00
Petr Oros	fdadbf6e84	iavf: fix incorrect reset handling in callbacks Three driver callbacks schedule a reset and wait for its completion: ndo_change_mtu(), ethtool set_ringparam(), and ethtool set_channels(). Waiting for reset in ndo_change_mtu() and set_ringparam() was added by commit `c2ed2403f1` ("iavf: Wait for reset in callbacks which trigger it") to fix a race condition where adding an interface to bonding immediately after MTU or ring parameter change failed because the interface was still in __RESETTING state. The same commit also added waiting in iavf_set_priv_flags(), which was later removed by commit `53844673d5` ("iavf: kill "legacy-rx" for good"). Waiting in set_channels() was introduced earlier by commit `4e5e6b5d9d` ("iavf: Fix return of set the new channel count") to ensure the PF has enough time to complete the VF reset when changing channel count, and to return correct error codes to userspace. Commit `ef490bbb22` ("iavf: Add net_shaper_ops support") added net_shaper_ops to iavf, which required reset_task to use _locked NAPI variants (napi_enable_locked, napi_disable_locked) that need the netdev instance lock. Later, commit `7e4d784f58` ("net: hold netdev instance lock during rtnetlink operations") and commit `2bcf4772e4` ("net: ethtool: try to protect all callback with netdev instance lock") started holding the netdev instance lock during ndo and ethtool callbacks for drivers with net_shaper_ops. Finally, commit `120f28a6f3` ("iavf: get rid of the crit lock") replaced the driver's crit_lock with netdev_lock in reset_task, causing incorrect behavior: the callback holds netdev_lock and waits for reset_task, but reset_task needs the same lock: Thread 1 (callback) Thread 2 (reset_task) ------------------- --------------------- netdev_lock() [blocked on workqueue] ndo_change_mtu() or ethtool op iavf_schedule_reset() iavf_wait_for_reset() iavf_reset_task() waiting... netdev_lock() <- blocked This does not strictly deadlock because iavf_wait_for_reset() uses wait_event_interruptible_timeout() with a 5-second timeout. The wait eventually times out, the callback returns an error to userspace, and after the lock is released reset_task completes the reset. This leads to incorrect behavior: userspace sees an error even though the configuration change silently takes effect after the timeout. Fix this by extracting the reset logic from iavf_reset_task() into a new iavf_reset_step() function that expects netdev_lock to be already held. The three callbacks now call iavf_reset_step() directly instead of scheduling the work and waiting, performing the reset synchronously in the caller's context which already holds netdev_lock. This eliminates both the incorrect error reporting and the need for iavf_wait_for_reset(), which is removed along with the now-unused reset_waitqueue. The workqueue-based iavf_reset_task() becomes a thin wrapper that acquires netdev_lock and calls iavf_reset_step(), preserving its use for PF-initiated resets. The callbacks may block for several seconds while iavf_reset_step() polls hardware registers, but this is acceptable since netdev_lock is a per-device mutex and only serializes operations on the same interface. v3: - Remove netif_running() guard from iavf_set_channels(). Unlike set_ringparam where descriptor counts are picked up by iavf_open() directly, num_req_queues is only consumed during iavf_reinit_interrupt_scheme() in the reset path. Skipping the reset on a down device would silently discard the channel count change. - Remove dead reset_waitqueue code (struct field, init, and all wake_up calls) since iavf_wait_for_reset() was the only consumer. Fixes: `120f28a6f3` ("iavf: get rid of the crit lock") Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Petr Oros <poros@redhat.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2026-03-10 09:08:31 -07:00
Petr Oros	efc54fb13d	iavf: fix PTP use-after-free during reset Commit `7c01dbfc8a` ("iavf: periodically cache PHC time") introduced a worker to cache PHC time, but failed to stop it during reset or disable. This creates a race condition where `iavf_reset_task()` or `iavf_disable_vf()` free adapter resources (AQ) while the worker is still running. If the worker triggers `iavf_queue_ptp_cmd()` during teardown, it accesses freed memory/locks, leading to a crash. Fix this by calling `iavf_ptp_release()` before tearing down the adapter. This ensures `ptp_clock_unregister()` synchronously cancels the worker and cleans up the chardev before the backing resources are destroyed. Fixes: `7c01dbfc8a` ("iavf: periodically cache PHC time") Signed-off-by: Petr Oros <poros@redhat.com> Reviewed-by: Ivan Vecera <ivecera@redhat.com> Acked-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2026-03-10 09:08:31 -07:00
Kohei Enju	b848521701	iavf: fix netdev->max_mtu to respect actual hardware limit iavf sets LIBIE_MAX_MTU as netdev->max_mtu, ignoring vf_res->max_mtu from PF [1]. This allows setting an MTU beyond the actual hardware limit, causing TX queue timeouts [2]. Set correct netdev->max_mtu using vf_res->max_mtu from the PF. Note that currently PF drivers such as ice/i40e set the frame size in vf_res->max_mtu, not MTU. Convert vf_res->max_mtu to MTU before setting netdev->max_mtu. [1] # ip -j -d link show $DEV \| jq '.[0].max_mtu' 16356 [2] iavf 0000:00:05.0 enp0s5: NETDEV WATCHDOG: CPU: 1: transmit queue 0 timed out 5692 ms iavf 0000:00:05.0 enp0s5: NIC Link is Up Speed is 10 Gbps Full Duplex iavf 0000:00:05.0 enp0s5: NETDEV WATCHDOG: CPU: 6: transmit queue 3 timed out 5312 ms iavf 0000:00:05.0 enp0s5: NIC Link is Up Speed is 10 Gbps Full Duplex ... Fixes: `5fa4caff59` ("iavf: switch to Page Pool") Signed-off-by: Kohei Enju <kohei@enjuk.jp> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2026-03-03 13:06:04 -08:00
Linus Torvalds	32a92f8c89	Convert more 'alloc_obj' cases to default GFP_KERNEL arguments This converts some of the visually simpler cases that have been split over multiple lines. I only did the ones that are easy to verify the resulting diff by having just that final GFP_KERNEL argument on the next line. Somebody should probably do a proper coccinelle script for this, but for me the trivial script actually resulted in an assertion failure in the middle of the script. I probably had made it a bit _too_ trivial. So after fighting that far a while I decided to just do some of the syntactically simpler cases with variations of the previous 'sed' scripts. The more syntactically complex multi-line cases would mostly really want whitespace cleanup anyway. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-21 20:03:00 -08:00
Linus Torvalds	323bbfcf1e	Convert 'alloc_flex' family to use the new default GFP_KERNEL argument This is the exact same thing as the 'alloc_obj()' version, only much smaller because there are a lot fewer users of the alloc_flex() interface. As with alloc_obj() version, this was done entirely with mindless brute force, using the same script, except using 'flex' in the pattern rather than 'objs'. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-21 17:09:51 -08:00
Linus Torvalds	bf4afc53b7	Convert 'alloc_obj' family to use the new default GFP_KERNEL argument This was done entirely with mindless brute force, using git grep -l '\<k[vmz]alloc_objs(., GFP_KERNEL)' \| xargs sed -i 's/$alloc_objs(.*$, GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-21 17:09:51 -08:00
Kees Cook	69050f8d6d	treewide: Replace kmalloc with kmalloc_obj for non-scalar types This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(PTR, FAM, COUNT, ...) (where TYPE may also be VAR) The resulting allocations no longer return "void ", instead returning "TYPE ". Signed-off-by: Kees Cook <kees@kernel.org>	2026-02-21 01:02:28 -08:00
Kohei Enju	6daa2893f3	iavf: fix off-by-one issues in iavf_config_rss_reg() There are off-by-one bugs when configuring RSS hash key and lookup table, causing out-of-bounds reads to memory [1] and out-of-bounds writes to device registers. Before commit `43a3d9ba34` ("i40evf: Allow PF driver to configure RSS"), the loop upper bounds were: i <= I40E_VFQF_{HKEY,HLUT}_MAX_INDEX which is safe since the value is the last valid index. That commit changed the bounds to: i <= adapter->rss_{key,lut}_size / 4 where `rss_{key,lut}_size / 4` is the number of dwords, so the last valid index is `(rss_{key,lut}_size / 4) - 1`. Therefore, using `<=` accesses one element past the end. Fix the issues by using `<` instead of `<=`, ensuring we do not exceed the bounds. [1] KASAN splat about rss_key_size off-by-one BUG: KASAN: slab-out-of-bounds in iavf_config_rss+0x619/0x800 Read of size 4 at addr ffff888102c50134 by task kworker/u8:6/63 CPU: 0 UID: 0 PID: 63 Comm: kworker/u8:6 Not tainted 6.18.0-rc2-enjuk-tnguy-00378-g3005f5b77652-dirty #156 PREEMPT(voluntary) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 Workqueue: iavf iavf_watchdog_task Call Trace: <TASK> dump_stack_lvl+0x6f/0xb0 print_report+0x170/0x4f3 kasan_report+0xe1/0x1a0 iavf_config_rss+0x619/0x800 iavf_watchdog_task+0x2be7/0x3230 process_one_work+0x7fd/0x1420 worker_thread+0x4d1/0xd40 kthread+0x344/0x660 ret_from_fork+0x249/0x320 ret_from_fork_asm+0x1a/0x30 </TASK> Allocated by task 63: kasan_save_stack+0x30/0x50 kasan_save_track+0x14/0x30 __kasan_kmalloc+0x7f/0x90 __kmalloc_noprof+0x246/0x6f0 iavf_watchdog_task+0x28fc/0x3230 process_one_work+0x7fd/0x1420 worker_thread+0x4d1/0xd40 kthread+0x344/0x660 ret_from_fork+0x249/0x320 ret_from_fork_asm+0x1a/0x30 The buggy address belongs to the object at ffff888102c50100 which belongs to the cache kmalloc-64 of size 64 The buggy address is located 0 bytes to the right of allocated 52-byte region [ffff888102c50100, ffff888102c50134) The buggy address belongs to the physical page: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x102c50 flags: 0x200000000000000(node=0\|zone=2) page_type: f5(slab) raw: 0200000000000000 ffff8881000418c0 dead000000000122 0000000000000000 raw: 0000000000000000 0000000080200020 00000000f5000000 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888102c50000: 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fc ffff888102c50080: 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fc >ffff888102c50100: 00 00 00 00 00 00 04 fc fc fc fc fc fc fc fc fc ^ ffff888102c50180: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc ffff888102c50200: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc Fixes: `43a3d9ba34` ("i40evf: Allow PF driver to configure RSS") Signed-off-by: Kohei Enju <enjuk@amazon.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2025-12-17 09:36:02 -08:00
Jakub Kicinski	4de4454299	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Merge in late fixes in preparation for the net-next PR. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-12-02 15:37:53 -08:00
Alok Tiwari	57bb13d7eb	iavf: clarify VLAN add/delete log messages and lower log level The current dev_warn messages for too many VLAN changes are confusing and one place incorrectly references "add" instead of "delete" VLANs due to copy-paste errors. - Use dev_info instead of dev_warn to lower the log level. - Rephrase the message to: "virtchnl: Too many VLAN [add\|delete] ([v1\|v2]) requests; splitting into multiple messages to PF\n". Suggested-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20251125223632.1857532-12-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-27 18:34:21 -08:00
Michal Schmidt	1e43ebcd51	iavf: Implement settime64 with -EOPNOTSUPP ptp_clock_settime() assumes every ptp_clock has implemented settime64(). Stub it with -EOPNOTSUPP to prevent a NULL dereference. The fix is similar to commit `329d050bbe` ("gve: Implement settime64 with -EOPNOTSUPP"). Fixes: `d734223b2f` ("iavf: add initial framework for registering PTP clock") Signed-off-by: Michal Schmidt <mschmidt@redhat.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Tim Hostetler <thostet@google.com> Link: https://patch.msgid.link/20251126094850.2842557-1-mschmidt@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-27 17:47:28 -08:00
Breno Leitao	fe0a3d7d1d	iavf: extract GRXRINGS from .get_rxnfc Commit `84eaf4359c` ("net: ethtool: add get_rx_ring_count callback to optimize RX ring queries") added specific support for GRXRINGS callback, simplifying .get_rxnfc. Remove the handling of GRXRINGS in .get_rxnfc() by moving it to the new .get_rx_ring_count(). This simplifies the RX ring count retrieval and aligns iavf with the new ethtool API for querying RX ring parameters. Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Link: https://patch.msgid.link/20251125-gxring_intel-v2-2-f55cd022d28b@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-26 17:09:09 -08:00
Aleksandr Loktionov	3da28eb277	iavf: add RSS support for GTP protocol via ethtool Extend the iavf driver to support Receive Side Scaling (RSS) configuration for GTP (GPRS Tunneling Protocol) flows using ethtool. The implementation introduces new RSS flow segment headers and hash field definitions for various GTP encapsulations, including: - GTPC - GTPU (IP, Extension Header, Uplink, Downlink) - TEID-based hashing The ethtool interface is updated to parse and apply these new flow types and hash fields, enabling fine-grained traffic distribution for GTP-based mobile workloads. This enhancement improves performance and scalability for virtualized network functions (VNFs) and user plane functions (UPFs) in 5G and LTE deployments. Reviewed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2025-11-06 14:19:43 -08:00
Aleksandr Loktionov	c4f7a6672f	iavf: fix proper type for error code in iavf_resume() The variable 'err' in iavf_resume() is used to store the return value of different functions, which return an int. Currently, 'err' is declared as u32, which is semantically incorrect and misleading. In the Linux kernel, u32 is typically reserved for fixed-width data used in hardware interfaces or protocol structures. Using it for a generic error code may confuse reviewers or developers into thinking the value is hardware-related or size-constrained. Replace u32 with int to reflect the actual usage and improve code clarity and semantic correctness. No functional change. Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2025-09-19 08:42:08 -07:00
Michal Swiatkowski	43a1130632	iavf: use libie_aq_str There is no need to store the err string in hw->err_str. Simplify it and use common helper. hw->err_str is still used for other purpouse. It should be marked that previously for unknown error the numeric value was passed as a string. Now the "LIBIE_AQ_RC_UNKNOWN" is used for such cases. Add libie_aminq module in iavf Kconfig. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2025-07-24 09:43:40 -07:00
Michal Swiatkowski	0eb61b3569	iavf: use libie adminq descriptors Use libie_aq_desc instead of iavf_aq_desc. Do needed changes to allow clean build Use libie_aq_raw() wherever it can be used. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2025-07-24 09:28:26 -07:00
Byungchul Park	c8d6830e32	iavf: access ->pp through netmem_desc instead of page To eliminate the use of struct page in page pool, the page pool users should use netmem descriptor and APIs instead. Make iavf access ->pp through netmem_desc instead of page. Signed-off-by: Byungchul Park <byungchul@sk.com> Link: https://patch.msgid.link/20250721021835.63939-9-byungchul@sk.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-23 17:46:55 -07:00
Jakub Kicinski	189bd9c873	Merge branch '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== libeth: add libeth_xdp helper lib Alexander Lobakin says: Time to add XDP helpers infra to libeth to greatly simplify adding XDP to idpf and iavf, as well as improve and extend XDP in ice and i40e. Any vendor is free to reuse helpers. If this happens, I'm fine with moving the folder of out intel/. The helpers greatly simplify building xdp_buff, running a prog, handling the verdict, implement XDP_TX, .ndo_xdp_xmit, XDP buffer completion. Same applies to XSk (with XSk xmit instead of .ndo_xdp_xmit, plus stuff like XSk wakeup). They are entirely generic with no HW definitions or assumptions. HW-specific stuff like parsing Rx desc / filling Tx desc is passed from the driver as inline callbacks. For now, key assumptions that optimize performance / avoid code bloat, but might not fit every driver in driver/net/: * netmem holding the buffers are always order-0; * driver has separate XDP Tx queues, doesn't use stack queues for that. For best efficiency, you may want to have nr_cpu_ids XDP queues, but less (queue sharing) is also supported; * XDP Tx queues are interrupt-less and use "lazy" cleaning only when there are less than 1/4 free Tx descriptors of the queue size; * main target platforms are 64-bit, although 32-bit is also fully supported, but the code might be not as optimized for them. Library code already supports multi-buffer for all kinds of Tx and both header split and no split for Rx and Tx. Frags can come from devmem/io_uring etc., direct `struct page ` is used only for header buffers for which it's always true. Drivers are free to pass their own Rx hints and XSK xmit hints ops. XDP_TX and ndo_xdp_xmit use onstack bulk for the frames to be sent and send them by batches of 16 buffers. This eats ~280 bytes on the stack, but gives good boosts and allow to greatly optimize the main sending function leaving it without any error/exception paths. XSk xmit fills Tx descriptors in the loop unrolled by 8. This was proven to improve perf on ice and i40e. XDP_TX and ndo_xdp_xmit doesn't use unrolling as I wasn't able to get any improvements in those scenenarios from this, while +1 Kb for their sending functions for nothing doesn't sound reasonable. XSk wakeup, instead of traditionally used "SW interrupts" provided by NICs, uses IPI to schedule NAPI on the CPU corresponding to the given queue pair. It gives better control over CPU distribution and in general performs way better than "SW interrupts", plus allows us to not pass any HW-specific callbacks there. The code is built the way that all callbacks passed from drivers get inlined; in general, most of hotpath gets inlined. Everything slow/exception lands to .c files in the libeth folder, doesn't create copies in the drivers themselves and doesn't overloat hotpath. Sure, inlining means that hotpath will be compiled into every driver that uses the lib, but the core code is written in one place, so no copying of bugs happens. Fixed once -- works everywhere. The last commit might look like sorta hack, but it gives really good boosts and decreases object code size, plus there are checks that all those wider accesses are fully safe, so I don't feel anything bad about it. An example of using libeth_xdp can be found either on my GitHub or on the mailing lists here ("XDP for idpf"). Macros for building driver XDP functions lead to that some implementations (XDP_TX, ndo_xdp_xmit etc.) consist of really only a few lines. '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: libeth: xdp, xsk: access adjacent u32s as u64 where applicable libeth: xsk: add XSkFQ refill and XSk wakeup helpers libeth: xsk: add XSk Rx processing support libeth: xsk: add XSk xmit functions libeth: xsk: add XSk XDP_TX sending helpers libeth: xdp: add RSS hash hint and XDP features setup helpers libeth: xdp: add templates for building driver-side callbacks libeth: xdp: add XDP prog run and verdict result handling libeth: xdp: add helpers for preparing/processing &libeth_xdp_buff libeth: xdp: add XDPSQ cleanup timers libeth: xdp: add XDPSQ locking helpers libeth: xdp: add XDPSQE completion helpers libeth: xdp: add .ndo_xdp_xmit() helpers libeth: xdp: add XDP_TX buffers sending libeth: support native XDP and register memory model libeth: convert to netmem libeth, libie: clean symbol exports up a little ==================== Link: https://patch.msgid.link/20250616201639.710420-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-06-17 18:50:57 -07:00
Jakub Kicinski	2c5f2ad1d9	eth: iavf: migrate to new RXFH callbacks Migrate to new callbacks added by commit `9bb00786fc` ("net: ethtool: add dedicated callbacks for getting and setting rxfh fields"). I'm deleting all the boilerplate kdoc from the affected functions. It is somewhere between pointless and incorrect, just a burden for people refactoring the code. Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Joe Damato <joe@dama.to> Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20250614180907.4167714-8-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-06-16 18:14:53 -07:00
Alexander Lobakin	6ad5ff6e72	libeth: convert to netmem Back when the libeth Rx core was initially written, devmem was a draft and netmem_ref didn't exist in the mainline. Now that it's here, make libeth MP-agnostic before introducing any new code or any new library users. When it's known that the created PP/FQ is for header buffers, use faster "unsafe" underscored netmem <--> virt accessors as netmem_is_net_iov() is always false in that case, but consumes some cycles (bit test + true branch). Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2025-06-16 11:40:14 -07:00
Jakub Kicinski	535de52801	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.16-rc2). No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-06-12 10:09:10 -07:00
Ahmed Zaki	0c6f463143	iavf: fix reset_task for early reset event If a reset event is received from the PF early in the init cycle, the state machine hangs for about 25 seconds. Reproducer: echo 1 > /sys/class/net/$PF0/device/sriov_numvfs ip link set dev $PF0 vf 0 mac $NEW_MAC The log shows: [792.620416] ice 0000:5e:00.0: Enabling 1 VFs [792.738812] iavf 0000:5e:01.0: enabling device (0000 -> 0002) [792.744182] ice 0000:5e:00.0: Enabling 1 VFs with 17 vectors and 16 queues per VF [792.839964] ice 0000:5e:00.0: Setting MAC 52:54:00:00:00:11 on VF 0. VF driver will be reinitialized [813.389684] iavf 0000:5e:01.0: Failed to communicate with PF; waiting before retry [818.635918] iavf 0000:5e:01.0: Hardware came out of reset. Attempting reinit. [818.766273] iavf 0000:5e:01.0: Multiqueue Enabled: Queue pair count = 16 Fix it by scheduling the reset task and making the reset task capable of resetting early in the init cycle. Fixes: `ef8693eb90` ("i40evf: refactor reset handling") Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com> Tested-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2025-06-10 09:10:47 -07:00
Ahmed Zaki	b0ca7dc0e7	iavf: convert to NAPI IRQ affinity API Commit `bd7c00605e` ("net: move aRFS rmap management and CPU affinity to core") allows the drivers to delegate the IRQ affinity to the NAPI instance. However, the driver needs to use a persistent NAPI config and explicitly set/unset the NAPI<->IRQ association. Convert to the new IRQ affinity API. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2025-06-09 09:56:18 -07:00
Jacob Keller	141d0c9037	net: intel: move RSS packet classifier types to libie The Intel i40e, iavf, and ice drivers all include a definition of the packet classifier filter types used to program RSS hash enable bits. For i40e, these bits are used for both the PF and VF to configure the PFQF_HENA and VFQF_HENA registers. For ice and iAVF, these bits are used to communicate the desired hash enable filter over virtchnl via its struct virtchnl_rss_hashena. The virtchnl.h header makes no mention of where the bit definitions reside. Maintaining a separate copy of these bits across three drivers is cumbersome. Move the definition to libie as a new pctype.h header file. Each driver can include this, and drop its own definition. The ice implementation also defined a ICE_AVF_FLOW_FIELD_INVALID, intending to use this to indicate when there were no hash enable bits set. This is confusing, since the enumeration is using bit positions. A value of 0 should indicate the first bit. Instead, rewrite the code that uses ICE_AVF_FLOW_FIELD_INVALID to just check if the avf_hash is zero. From context this should be clear that we're checking if none of the bits are set. The values are kept as bit positions instead of encoding the BIT_ULL directly into their value. While most users will simply use BIT_ULL immediately, i40e uses the macros both with BIT_ULL and test_bit/set_bit calls. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2025-06-09 09:56:18 -07:00
Jacob Keller	78b2d9908b	net: intel: rename 'hena' to 'hashcfg' for clarity i40e, ice, and iAVF all use 'hena' as a shorthand for the "hash enable" configuration. This comes originally from the X710 datasheet 'xxQF_HENA' registers. In the context of the registers the meaning is fairly clear. However, on its own, hena is a weird name that can be more difficult to understand. This is especially true in ice. The E810 hardware doesn't even have registers with HENA in the name. Replace the shorthand 'hena' with 'hashcfg'. This makes it clear the variables deal with the Hash configuration, not just a single boolean on/off for all hashing. Do not update the register names. These come directly from the datasheet for X710 and X722, and it is more important that the names can be searched. Suggested-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2025-06-09 09:56:18 -07:00
Przemek Kitszel	120f28a6f3	iavf: get rid of the crit lock Get rid of the crit lock. That frees us from the error prone logic of try_locks. Thanks to netdev_lock() by Jakub it is now easy, and in most cases we were protected by it already - replace crit lock by netdev lock when it was not the case. Lockdep reports that we should cancel the work under crit_lock [splat1], and that was the scheme we have mostly followed since [1] by Slawomir. But when that is done we still got into deadlocks [splat2]. So instead we should look at the bigger problem, namely "weird locking/scheduling" of the iavf. The first step to fix that is to remove the crit lock. I will followup with a -next series that simplifies scheduling/tasks. Cancel the work without netdev lock (weird unlock+lock scheme), to fix the [splat2] (which would be totally ugly if we would kept the crit lock). Extend protected part of iavf_watchdog_task() to include scheduling more work. Note that the removed comment in iavf_reset_task() was misplaced, it belonged to inside of the removed if condition, so it's gone now. [splat1] - w/o this patch - The deadlock during VF removal: WARNING: possible circular locking dependency detected sh/3825 is trying to acquire lock: ((work_completion)(&(&adapter->watchdog_task)->work)){+.+.}-{0:0}, at: start_flush_work+0x1a1/0x470 but task is already holding lock: (&adapter->crit_lock){+.+.}-{4:4}, at: iavf_remove+0xd1/0x690 [iavf] which lock already depends on the new lock. [splat2] - when cancelling work under crit lock, w/o this series, see [2] for the band aid attempt WARNING: possible circular locking dependency detected sh/3550 is trying to acquire lock: ((wq_completion)iavf){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x26/0x90 but task is already holding lock: (&dev->lock){+.+.}-{4:4}, at: iavf_remove+0xa6/0x6e0 [iavf] which lock already depends on the new lock. [1] `fc2e6b3b13` ("iavf: Rework mutexes for better synchronisation") [2] https://github.com/pkitszel/linux/commit/52dddbfc2bb60294083f5711a158a Fixes: `d1639a1731` ("iavf: fix a deadlock caused by rtnl and driver's lock circular dependencies") Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2025-06-03 09:48:03 -07:00
Przemek Kitszel	05702b5c94	iavf: sprinkle netdev_assert_locked() annotations Lockdep annotations help in general, but here it is extra good, as next commit will remove crit lock. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2025-06-03 09:48:03 -07:00
Przemek Kitszel	257a8241ad	iavf: extract iavf_watchdog_step() out of iavf_watchdog_task() Finish up easy refactor of watchdog_task, total for this + prev two commits is: 1 file changed, 47 insertions(+), 82 deletions(-) Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2025-06-03 09:48:03 -07:00
Przemek Kitszel	ecb4cd0461	iavf: simplify watchdog_task in terms of adminq task scheduling Simplify the decision whether to schedule adminq task. The condition is the same, but it is executed in more scenarios. Note that movement of watchdog_done label makes this commit a bit surprising. (Hence not squashing it to anything bigger). Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2025-06-03 09:48:03 -07:00
Przemek Kitszel	099418da91	iavf: centralize watchdog requeueing itself Centralize the unlock(critlock); unlock(netdev); queue_delayed_work(watchog_task); pattern to one place. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2025-06-03 09:48:03 -07:00
Przemek Kitszel	dba35a4bb4	iavf: iavf_suspend(): take RTNL before netdev_lock() Fix an obvious violation of lock ordering. Jakub's [1] added netdev_lock() call that is wrong ordered wrt RTNL, but the Fixes tag points to crit_lock being wrongly placed (by lockdep standards). Actual reason we got it wrong is dated back to critical section managed by pure flag checks, which is with us since the very beginning. [1] `afc664987a` ("eth: iavf: extend the netdev_lock usage") Fixes: `5ac49f3c27` ("iavf: use mutexes for locking of critical sections") Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2025-06-03 09:48:03 -07:00
Jakub Kicinski	8ef890df40	net: move misc netdev_lock flavors to a separate header Move the more esoteric helpers for netdev instance lock to a dedicated header. This avoids growing netdevice.h to infinity and makes rebuilding the kernel much faster (after touching the header with the helpers). The main netdev_lock() / netdev_unlock() functions are used in static inlines in netdevice.h and will probably be used most commonly, so keep them in netdevice.h. Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250307183006.2312761-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-03-08 09:06:50 -08:00
Stanislav Fomichev	c4f0f30b42	net: hold netdev instance lock during nft ndo_setup_tc Introduce new dev_setup_tc for nft ndo_setup_tc paths. Reviewed-by: Eric Dumazet <edumazet@google.com> Cc: Saeed Mahameed <saeed@kernel.org> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250305163732.2766420-3-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-03-06 12:59:43 -08:00
Stanislav Fomichev	d4c22ec680	net: hold netdev instance lock during ndo_open/ndo_stop For the drivers that use shaper API, switch to the mode where core stack holds the netdev lock. This affects two drivers: * iavf - already grabs netdev lock in ndo_open/ndo_stop, so mostly remove these * netdevsim - switch to _locked APIs to avoid deadlock iavf_close diff is a bit confusing, the existing call looks like this: iavf_close() { netdev_lock() .. netdev_unlock() wait_event_timeout(down_waitqueue) } I change it to the following: netdev_lock() iavf_close() { .. netdev_unlock() wait_event_timeout(down_waitqueue) netdev_lock() // reusing this lock call } netdev_unlock() Since I'm reusing existing netdev_lock call, so it looks like I only add netdev_unlock. Cc: Saeed Mahameed <saeed@kernel.org> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250305163732.2766420-2-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-03-06 12:59:43 -08:00
Jakub Kicinski	357660d759	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.14-rc5). Conflicts: drivers/net/ethernet/cadence/macb_main.c `fa52f15c74` ("net: cadence: macb: Synchronize stats calculations") `75696dd0fd` ("net: cadence: macb: Convert to get_stats64") https://lore.kernel.org/20250224125848.68ee63e5@canb.auug.org.au Adjacent changes: drivers/net/ethernet/intel/ice/ice_sriov.c `79990cf5e7` ("ice: Fix deinitializing VF in error path") `a203163274` ("ice: simplify VF MSI-X managing") net/ipv4/tcp.c `18912c5206` ("tcp: devmem: don't write truncated dmabuf CMSGs to userspace") `297d389e9e` ("net: prefix devmem specific helpers") net/mptcp/subflow.c `8668860b0a` ("mptcp: reset when MPTCP opts are dropped after join") `c3349a22c2` ("mptcp: consolidate subflow cleanup") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-02-27 10:20:58 -08:00
Jacob Keller	c6124f6fd3	iavf: fix circular lock dependency with netdev_lock We have recently seen reports of lockdep circular lock dependency warnings when loading the iAVF driver: [ 1504.790308] ====================================================== [ 1504.790309] WARNING: possible circular locking dependency detected [ 1504.790310] 6.13.0 #net_next_rt.c2933b2befe2.el9 Not tainted [ 1504.790311] ------------------------------------------------------ [ 1504.790312] kworker/u128:0/13566 is trying to acquire lock: [ 1504.790313] ffff97d0e4738f18 (&dev->lock){+.+.}-{4:4}, at: register_netdevice+0x52c/0x710 [ 1504.790320] [ 1504.790320] but task is already holding lock: [ 1504.790321] ffff97d0e47392e8 (&adapter->crit_lock){+.+.}-{4:4}, at: iavf_finish_config+0x37/0x240 [iavf] [ 1504.790330] [ 1504.790330] which lock already depends on the new lock. [ 1504.790330] [ 1504.790330] [ 1504.790330] the existing dependency chain (in reverse order) is: [ 1504.790331] [ 1504.790331] -> #1 (&adapter->crit_lock){+.+.}-{4:4}: [ 1504.790333] __lock_acquire+0x52d/0xbb0 [ 1504.790337] lock_acquire+0xd9/0x330 [ 1504.790338] mutex_lock_nested+0x4b/0xb0 [ 1504.790341] iavf_finish_config+0x37/0x240 [iavf] [ 1504.790347] process_one_work+0x248/0x6d0 [ 1504.790350] worker_thread+0x18d/0x330 [ 1504.790352] kthread+0x10e/0x250 [ 1504.790354] ret_from_fork+0x30/0x50 [ 1504.790357] ret_from_fork_asm+0x1a/0x30 [ 1504.790361] [ 1504.790361] -> #0 (&dev->lock){+.+.}-{4:4}: [ 1504.790364] check_prev_add+0xf1/0xce0 [ 1504.790366] validate_chain+0x46a/0x570 [ 1504.790368] __lock_acquire+0x52d/0xbb0 [ 1504.790370] lock_acquire+0xd9/0x330 [ 1504.790371] mutex_lock_nested+0x4b/0xb0 [ 1504.790372] register_netdevice+0x52c/0x710 [ 1504.790374] iavf_finish_config+0xfa/0x240 [iavf] [ 1504.790379] process_one_work+0x248/0x6d0 [ 1504.790381] worker_thread+0x18d/0x330 [ 1504.790383] kthread+0x10e/0x250 [ 1504.790385] ret_from_fork+0x30/0x50 [ 1504.790387] ret_from_fork_asm+0x1a/0x30 [ 1504.790389] [ 1504.790389] other info that might help us debug this: [ 1504.790389] [ 1504.790389] Possible unsafe locking scenario: [ 1504.790389] [ 1504.790390] CPU0 CPU1 [ 1504.790391] ---- ---- [ 1504.790391] lock(&adapter->crit_lock); [ 1504.790393] lock(&dev->lock); [ 1504.790394] lock(&adapter->crit_lock); [ 1504.790395] lock(&dev->lock); [ 1504.790397] [ 1504.790397] * DEADLOCK * This appears to be caused by the change in commit `5fda3f3534` ("net: make netdev_lock() protect netdev->reg_state"), which added a netdev_lock() in register_netdevice. The iAVF driver calls register_netdevice() from iavf_finish_config(), as a final stage of its state machine post-probe. It currently takes the RTNL lock, then the netdev lock, and then the device critical lock. This pattern is used throughout the driver. Thus there is a strong dependency that the crit_lock should not be acquired before the net device lock. The change to register_netdevice creates an ABBA lock order violation because the iAVF driver is holding the crit_lock while calling register_netdevice, which then takes the netdev_lock. It seems likely that future refactors could result in netdev APIs which hold the netdev_lock while calling into the driver. This means that we should not re-order the locks so that netdev_lock is acquired after the device private crit_lock. Instead, notice that we already release the netdev_lock prior to calling the register_netdevice. This flow only happens during the early driver initialization as we transition through the __IAVF_STARTUP, __IAVF_INIT_VERSION_CHECK, __IAVF_INIT_GET_RESOURCES, etc. Analyzing the places where we take crit_lock in the driver there are two sources: a) several of the work queue tasks including adminq_task, watchdog_task, reset_task, and the finish_config task. b) various callbacks which ultimately stem back to .ndo operations or ethtool operations. The latter cannot be triggered until after the netdevice registration is completed successfully. The iAVF driver uses alloc_ordered_workqueue, which is an unbound workqueue that has a max limit of 1, and thus guarantees that only a single work item on the queue is executing at any given time, so none of the other work threads could be executing due to the ordered workqueue guarantees. The iavf_finish_config() function also does not do anything else after register_netdevice, unless it fails. It seems unlikely that the driver private crit_lock is protecting anything that register_netdevice() itself touches. Thus, to fix this ABBA lock violation, lets simply release the adapter->crit_lock as well as netdev_lock prior to calling register_netdevice(). We do still keep holding the RTNL lock as required by the function. If we do fail to register the netdevice, then we re-acquire the adapter critical lock to finish the transition back to __IAVF_INIT_CONFIG_ADAPTER. This ensures every call where both netdev_lock and the adapter->crit_lock are acquired under the same ordering. Fixes: `afc664987a` ("eth: iavf: extend the netdev_lock usage") Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20250224190647.3601930-5-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-02-25 19:09:40 -08:00
Gal Pressman	ecdff89338	ethtool: Symmetric OR-XOR RSS hash Add an additional type of symmetric RSS hash type: OR-XOR. The "Symmetric-OR-XOR" algorithm transforms the input as follows: (SRC_IP \| DST_IP, SRC_IP ^ DST_IP, SRC_PORT \| DST_PORT, SRC_PORT ^ DST_PORT) Change 'cap_rss_sym_xor_supported' to 'supported_input_xfrm', a bitmap of supported RXH_XFRM_* types. Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Edward Cree <ecree.xilinx@gmail.com> Link: https://patch.msgid.link/20250224174416.499070-2-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-02-25 18:31:04 -08:00
Jacob Keller	48ccdcd87e	iavf: add support for Rx timestamps to hotpath Add support for receive timestamps to the Rx hotpath. This support only works when using the flexible descriptor format, so make sure that we request this format by default if we have receive timestamp support available in the PTP capabilities. In order to report the timestamps to userspace, we need to perform timestamp extension. The Rx descriptor does actually contain the "40 bit" timestamp. However, upper 32 bits which contain nanoseconds are conveniently stored separately in the descriptor. We could extract the 32bits and lower 8 bits, then perform a bitwise OR to calculate the 40bit value. This makes no sense, because the timestamp extension algorithm would simply discard the lower 8 bits anyways. Thus, implement timestamp extension as iavf_ptp_extend_32b_timestamp(), and extract and forward only the 32bits of nominal nanoseconds. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Reviewed-by: Sunil Goutham <sgoutham@marvell.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2025-02-14 10:58:08 -08:00
Jacob Keller	51534239ef	iavf: handle set and get timestamps ops Add handlers for the .ndo_hwtstamp_get and .ndo_hwtstamp_set ops which allow userspace to request timestamp enablement for the device. This support allows standard Linux applications to request the timestamping desired. As with other devices that support timestamping all packets, the driver will upgrade any request for timestamping of a specific type of packet to HWTSTAMP_FILTER_ALL. The current configuration is stored, so that it can be retrieved by calling .ndo_hwtstamp_get The Tx timestamps are not implemented yet so calling set ops for Tx path will end with EOPNOTSUPP error code. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Co-developed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2025-02-14 10:58:08 -08:00
Mateusz Polchlopek	8447357e7b	iavf: Implement checking DD desc field Rx timestamping introduced in PF driver caused the need of refactoring the VF driver mechanism to check packet fields. The function to check errors in descriptor has been removed and from now only previously set struct fields are being checked. The field DD (descriptor done) needs to be checked at the very beginning, before extracting other fields. Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2025-02-14 10:58:08 -08:00
Jacob Keller	2dc8e7c36d	iavf: refactor iavf_clean_rx_irq to support legacy and flex descriptors Using VIRTCHNL_VF_OFFLOAD_FLEX_DESC, the iAVF driver is capable of negotiating to enable the advanced flexible descriptor layout. Add the flexible NIC layout (RXDID=2) as a member of the Rx descriptor union. Also add bit position definitions for the status and error indications that are needed. The iavf_clean_rx_irq function needs to extract a few fields from the Rx descriptor, including the size, rx_ptype, and vlan_tag. Move the extraction to a separate function that decodes the fields into a structure. This will reduce the burden for handling multiple descriptor types by keeping the relevant extraction logic in one place. To support handling an additional descriptor format with minimal code duplication, refactor Rx checksum handling so that the general logic is separated from the bit calculations. Introduce an iavf_rx_desc_decoded structure which holds the relevant bits decoded from the Rx descriptor. This will enable implementing flexible descriptor handling without duplicating the general logic twice. Introduce an iavf_extract_flex_rx_fields, iavf_flex_rx_hash, and iavf_flex_rx_csum functions which operate on the flexible NIC descriptor format instead of the legacy 32 byte format. Based on the negotiated RXDID, select the correct function for processing the Rx descriptors. With this change, the Rx hot path should be functional when using either the default legacy 32byte format or when we switch to the flexible NIC layout. Modify the Rx hot path to add support for the flexible descriptor format and add request enabling Rx timestamps for all queues. As in ice, make sure we bump the checksum level if the hardware detected a packet type which could have an outer checksum. This is important because hardware only verifies the inner checksum. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Co-developed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2025-02-14 10:58:08 -08:00

1 2 3 4 5 ...

389 Commits