mirror of
https://github.com/torvalds/linux.git
synced 2026-05-13 00:28:54 +02:00
master
10175 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
9e5dead140 |
ice: add dpll peer notification for paired SMA and U.FL pins
SMA and U.FL pins share physical signal paths in pairs (SMA1/U.FL1 and
SMA2/U.FL2). When one pin's state changes via a PCA9575 GPIO write,
the paired pin's state also changes, but no notification is sent for
the peer pin. Userspace consumers monitoring the peer via dpll netlink
subscribe never learn about the update.
Add ice_dpll_sw_pin_notify_peer() which sends a change notification for
the paired SW pin. Call it from ice_dpll_pin_sma_direction_set(),
ice_dpll_sma_pin_state_set(), and ice_dpll_ufl_pin_state_set() after
pf->dplls.lock is released. Use __dpll_pin_change_ntf() because
dpll_lock is still held by the dpll netlink layer (dpll_pin_pre_doit).
Fixes:
|
||
|
|
1a41b58fd4 |
ice: fix missing dpll notifications for SW pins
The SMA/U.FL pin redesign (commit |
||
|
|
6f9d8393c9 |
ice: fix SMA and U.FL pin state changes affecting paired pin
SMA and U.FL pins share physical signal paths in pairs (SMA1/U.FL1 and
SMA2/U.FL2) controlled by the PCA9575 GPIO expander. Each pair can
only have one active pin at a time: SMA1 output and U.FL1 output share
the same CGU output, SMA2 input and U.FL2 input share the same CGU
input. The PCA9575 register bits determine which connector in each
pair owns the signal path.
The driver does not account for this pairing in two places:
ice_dpll_ufl_pin_state_set() modifies PCA9575 bits and disables the
backing CGU pin without checking whether the U.FL pin is currently
active. Disconnecting an already inactive U.FL pin flips bits that
the paired SMA pin relies on, breaking its connection.
ice_dpll_sma_direction_set() does not propagate direction changes to
the paired U.FL pin. For SMA2/U.FL2 the ICE_SMA2_UFL2_RX_DIS bit is
never managed, so U.FL2 stays disconnected after SMA2 switches to
output. For both pairs the backing CGU pin of the U.FL side is never
enabled when a direction change activates it, so userspace sees the
pin as disconnected even though the routing is correct.
Fix by guarding the U.FL disconnect path against inactive pins and by
updating the paired U.FL pin fully on SMA direction changes: manage
ICE_SMA2_UFL2_RX_DIS for the SMA2/U.FL2 pair and enable the backing
CGU pin whenever the peer becomes active.
Fixes:
|
||
|
|
56a643aed0 |
ice: fix missing SMA pin initialization in DPLL subsystem
The DPLL SMA/U.FL pin redesign introduced ice_dpll_sw_pin_frequency_get()
which gates frequency reporting on the pin's active flag. This flag is
determined by ice_dpll_sw_pins_update() from the PCA9575 GPIO expander
state. Before the redesign, SMA pins were exposed as direct HW
input/output pins and ice_dpll_frequency_get() returned the CGU
frequency unconditionally — the PCA9575 state was never consulted.
The PCA9575 powers on with all outputs high, setting ICE_SMA1_DIR_EN,
ICE_SMA1_TX_EN, ICE_SMA2_DIR_EN and ICE_SMA2_TX_EN. Nothing in the
driver writes the register during initialization, so
ice_dpll_sw_pins_update() sees all pins as inactive and
ice_dpll_sw_pin_frequency_get() permanently returns 0 Hz for every
SW pin.
Fix this by writing a default SMA configuration in
ice_dpll_init_info_sw_pins(): clear all SMA bits, then set SMA1 and
SMA2 as active inputs (DIR_EN=0) with U.FL1 output and U.FL2 input
disabled. Each SMA/U.FL pair shares a physical signal path so only
one pin per pair can be active at a time. U.FL pins still report
frequency 0 after this fix: U.FL1 (output-only) is disabled by
ICE_SMA1_TX_EN which keeps the TX output buffer off, and U.FL2
(input-only) is disabled by ICE_SMA2_UFL2_RX_DIS. They can be
activated by changing the corresponding SMA pin direction via dpll
netlink.
Fixes:
|
||
|
|
70ad216411 |
ice: fix infinite recursion in ice_cfg_tx_topo via ice_init_dev_hw
On certain E810 configurations where firmware supports Tx scheduler topology switching (tx_sched_topo_comp_mode_en), ice_cfg_tx_topo() may need to apply a new 5-layer or 9-layer topology from the DDP package. If the AQ command to set the topology fails (e.g. due to invalid DDP data or firmware limitations), the global configuration lock must still be cleared via a CORER reset. Commit |
||
|
|
54ef024879 |
ice: fix NULL pointer dereference in ice_reset_all_vfs()
ice_reset_all_vfs() ignores the return value of ice_vf_rebuild_vsi().
When the VSI rebuild fails (e.g. during NVM firmware update via
nvmupdate64e), ice_vsi_rebuild() tears down the VSI on its error path,
leaving txq_map and rxq_map as NULL. The subsequent unconditional call
to ice_vf_post_vsi_rebuild() leads to a NULL pointer dereference in
ice_ena_vf_q_mappings() when it accesses vsi->txq_map[0].
The single-VF reset path in ice_reset_vf() already handles this
correctly by checking the return value of ice_vf_reconfig_vsi() and
skipping ice_vf_post_vsi_rebuild() on failure.
Apply the same pattern to ice_reset_all_vfs(): check the return value
of ice_vf_rebuild_vsi() and skip ice_vf_post_vsi_rebuild() and
ice_eswitch_attach_vf() on failure. The VF is left safely disabled
(ICE_VF_STATE_INIT not set, VFGEN_RSTAT not set to VFACTIVE) and can
be recovered via a VFLR triggered by a PCI reset of the VF
(sysfs reset or driver rebind).
Note that this patch does not prevent the VF VSI rebuild from failing
during NVM update — the underlying cause is firmware being in a
transitional state while the EMP reset is processed, which can cause
Admin Queue commands (ice_add_vsi, ice_cfg_vsi_lan) to fail. This
patch only prevents the subsequent NULL pointer dereference that
crashes the kernel when the rebuild does fail.
crash> bt
PID: 50795 TASK: ff34c9ee708dc680 CPU: 1 COMMAND: "kworker/u512:5"
#0 [ff72159bcfe5bb50] machine_kexec at ffffffffaa8850ee
#1 [ff72159bcfe5bba8] __crash_kexec at ffffffffaaa15fba
#2 [ff72159bcfe5bc68] crash_kexec at ffffffffaaa16540
#3 [ff72159bcfe5bc70] oops_end at ffffffffaa837eda
#4 [ff72159bcfe5bc90] page_fault_oops at ffffffffaa893997
#5 [ff72159bcfe5bce8] exc_page_fault at ffffffffab528595
#6 [ff72159bcfe5bd10] asm_exc_page_fault at ffffffffab600bb2
[exception RIP: ice_ena_vf_q_mappings+0x79]
RIP: ffffffffc0a85b29 RSP: ff72159bcfe5bdc8 RFLAGS: 00010206
RAX: 00000000000f0000 RBX: ff34c9efc9c00000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000010 RDI: ff34c9efc9c00000
RBP: ff34c9efc27d4828 R8: 0000000000000093 R9: 0000000000000040
R10: ff34c9efc27d4828 R11: 0000000000000040 R12: 0000000000100000
R13: 0000000000000010 R14: R15:
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#7 [ff72159bcfe5bdf8] ice_sriov_post_vsi_rebuild at ffffffffc0a85e2e [ice]
#8 [ff72159bcfe5be08] ice_reset_all_vfs at ffffffffc0a920b4 [ice]
#9 [ff72159bcfe5be48] ice_service_task at ffffffffc0a31519 [ice]
#10 [ff72159bcfe5be88] process_one_work at ffffffffaa93dca4
#11 [ff72159bcfe5bec8] worker_thread at ffffffffaa93e9de
#12 [ff72159bcfe5bf18] kthread at ffffffffaa946663
#13 [ff72159bcfe5bf50] ret_from_fork at ffffffffaa8086b9
The panic occurs attempting to dereference the NULL pointer in RDX at
ice_sriov.c:294, which loads vsi->txq_map (offset 0x4b8 in ice_vsi).
The faulting VSI is an allocated slab object but not fully initialized
after a failed ice_vsi_rebuild():
crash> struct ice_vsi 0xff34c9efc27d4828
netdev = 0x0,
rx_rings = 0x0,
tx_rings = 0x0,
q_vectors = 0x0,
txq_map = 0x0,
rxq_map = 0x0,
alloc_txq = 0x10,
num_txq = 0x10,
alloc_rxq = 0x10,
num_rxq = 0x10,
The nvmupdate64e process was performing NVM firmware update:
crash> bt 0xff34c9edd1a30000
PID: 49858 TASK: ff34c9edd1a30000 CPU: 1 COMMAND: "nvmupdate64e"
#0 [ff72159bcd617618] __schedule at ffffffffab5333f8
#4 [ff72159bcd617750] ice_sq_send_cmd at ffffffffc0a35347 [ice]
#5 [ff72159bcd6177a8] ice_sq_send_cmd_retry at ffffffffc0a35b47 [ice]
#6 [ff72159bcd617810] ice_aq_send_cmd at ffffffffc0a38018 [ice]
#7 [ff72159bcd617848] ice_aq_read_nvm at ffffffffc0a40254 [ice]
#8 [ff72159bcd6178b8] ice_read_flat_nvm at ffffffffc0a4034c [ice]
#9 [ff72159bcd617918] ice_devlink_nvm_snapshot at ffffffffc0a6ffa5 [ice]
dmesg:
ice 0000:13:00.0: firmware recommends not updating fw.mgmt, as it
may result in a downgrade. continuing anyways
ice 0000:13:00.1: ice_init_nvm failed -5
ice 0000:13:00.1: Rebuild failed, unload and reload driver
Fixes:
|
||
|
|
34d33313b5 |
iavf: add VIRTCHNL_OP_ADD_VLAN to success completion handler
The V1 ADD_VLAN opcode had no success handler; filters sent via V1
stayed in ADDING state permanently. Add a fallthrough case so V1
filters also transition ADDING -> ACTIVE on PF confirmation.
Critically, add an `if (v_retval) break` guard: the error switch in
iavf_virtchnl_completion() does NOT return after handling errors,
it falls through to the success switch. Without this guard, a
PF-rejected ADD would incorrectly mark ADDING filters as ACTIVE,
creating a driver/HW mismatch where the driver believes the filter
is installed but the PF never accepted it.
For V2, this is harmless: iavf_vlan_add_reject() in the error
block already kfree'd all ADDING filters, so the success handler
finds nothing to transition.
Fixes:
|
||
|
|
bbcbe4ed70 |
iavf: wait for PF confirmation before removing VLAN filters
The VLAN filter DELETE path was asymmetric with the ADD path: ADD
waits for PF confirmation (ADD -> ADDING -> ACTIVE), but DELETE
immediately frees the filter struct after sending the DEL message
without waiting for the PF response.
This is problematic because:
- If the PF rejects the DEL, the filter remains in HW but the driver
has already freed the tracking structure, losing sync.
- Race conditions between DEL pending and other operations
(add, reset) cannot be properly resolved if the filter struct
is already gone.
Add IAVF_VLAN_REMOVING state to make the DELETE path symmetric:
REMOVE -> REMOVING (send DEL) -> PF confirms -> kfree
-> PF rejects -> ACTIVE
In iavf_del_vlans(), transition filters from REMOVE to REMOVING
instead of immediately freeing them. The new DEL completion handler
in iavf_virtchnl_completion() frees filters on success or reverts
them to ACTIVE on error.
Update iavf_add_vlan() to handle the REMOVING state: if a DEL is
pending and the user re-adds the same VLAN, queue it for ADD so
it gets re-programmed after the PF processes the DEL.
The !VLAN_FILTERING_ALLOWED early-exit path still frees filters
directly since no PF message is sent in that case.
Also update iavf_del_vlan() to skip filters already in REMOVING
state: DEL has been sent to PF and the completion handler will
free the filter when PF confirms. Without this guard, the sequence
DEL(pending) -> user-del -> second DEL could cause the PF to return
an error for the second DEL (filter already gone), causing the
completion handler to incorrectly revert a deleted filter back to
ACTIVE.
Fixes:
|
||
|
|
f2ce65b9b9 |
iavf: stop removing VLAN filters from PF on interface down
When a VF goes down, the driver currently sends DEL_VLAN to the PF for
every VLAN filter (ACTIVE -> DISABLE -> send DEL -> INACTIVE), then
re-adds them all on UP (INACTIVE -> ADD -> send ADD -> ADDING ->
ACTIVE). This round-trip is unnecessary because:
1. The PF disables the VF's queues via VIRTCHNL_OP_DISABLE_QUEUES,
which already prevents all RX/TX traffic regardless of VLAN filter
state.
2. The VLAN filters remaining in PF HW while the VF is down is
harmless - packets matching those filters have nowhere to go with
queues disabled.
3. The DEL+ADD cycle during down/up creates race windows where the
VLAN filter list is incomplete. With spoofcheck enabled, the PF
enables TX VLAN filtering on the first non-zero VLAN add, blocking
traffic for any VLANs not yet re-added.
Remove the entire DISABLE/INACTIVE state machinery:
- Remove IAVF_VLAN_DISABLE and IAVF_VLAN_INACTIVE enum values
- Remove iavf_restore_filters() and its call from iavf_open()
- Remove VLAN filter handling from iavf_clear_mac_vlan_filters(),
rename it to iavf_clear_mac_filters()
- Remove DEL_VLAN_FILTER scheduling from iavf_down()
- Remove all DISABLE/INACTIVE handling from iavf_del_vlans()
VLAN filters now stay ACTIVE across down/up cycles. Only explicit
user removal (ndo_vlan_rx_kill_vid) or PF/VF reset triggers VLAN
filter deletion/re-addition.
Fixes:
|
||
|
|
70d62b669f |
iavf: rename IAVF_VLAN_IS_NEW to IAVF_VLAN_ADDING
Rename the IAVF_VLAN_IS_NEW state to IAVF_VLAN_ADDING to better describe what the state represents: an ADD request has been sent to the PF and is waiting for a response. This is a pure rename with no behavioral change, preparing for a cleanup of the VLAN filter state machine. Signed-off-by: Petr Oros <poros@redhat.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260427-jk-iwl-net-petr-oros-fixes-v1-1-cdcb48303fd8@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> |
||
|
|
e728258deb |
Including fixes from Netfilter.
Steady stream of fixes. Last two weeks feel comparable to the two
weeks before the merge window. Lots of AI-aided bug discovery.
A newer big source is Sashiko/Gemini (Roman Gushchin's system),
which points out issues in existing code during patch review
(maybe 25% of fixes here likely originating from Sashiko).
Nice thing is these are often fixed by the respective maintainers,
not drive-bys.
Current release - new code bugs:
- kconfig: MDIO_PIC64HPSC should depend on ARCH_MICROCHIP
Previous releases - regressions:
- add async ndo_set_rx_mode and switch drivers which we promised
to be called under the per-netdev mutex to it
- dsa: remove duplicate netdev_lock_ops() for conduit ethtool ops
- hv_sock: report EOF instead of -EIO for FIN
- vsock/virtio: fix MSG_PEEK calculation on bytes to copy
Previous releases - always broken:
- ipv6: fix possible UAF in icmpv6_rcv()
- icmp: validate reply type before using icmp_pointers
- af_unix: drop all SCM attributes for SOCKMAP
- netfilter: fix a number of bugs in the osf (OS fingerprinting)
- eth: intel: fix timestamp interrupt configuration for E825C
Misc:
- bunch of data-race annotations
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmnqkmMACgkQMUZtbf5S
Iruqig/+NSg/YwEkZLbSaW+0LqNMIdOVZPdves97YAvNRdcKvgAPB5I13/G+koCz
bRpmtdDLYTkfMFLaM582DO6XeO3Hsz/BrRRuRbyEz7lTi7PtxTEs1J+6W6NxGOQ2
30f3J7OGudGlinsFV9VkJe81rvFbKZFZ9fGPmOcVzzzfLvT3rrt20iVvMOyM+PpD
H0ixFW+myescEx6AQoGcVs/sDveJ4bpLpNG3p4gADh3Laj9HKSl00kudCIOQ1Kdy
SEHsSZs3A87ueOnGwIBl/x24zVWGTGHyKcmc5ENPUSIaNGOWzmBxvfhb5dZ989RQ
HQix+FMue21k4JypYwrdhU3MAnMDPLk+FDp4XJuwJ5I/caNLZXS2geIlnXOI5IFJ
ojuq4pF5njoWtvkWGvxxRM+shIMiDUYUK+k9xTMqmge88O9ahGIAYb2qyKL+P6Sl
mMuSRcArk6pw3lPbUA4u1wEaU52IdxRJDPQA/Ai3O5UVTfemJO/VqawQfuBE274g
KZXG4x0lwE+LSyoguTnSqhMCJk1ZXAeHjtpz1Yo3CEHOwCH9MxEEL/dldAXWZiWN
K0nLcUQ8fg3GnmOEzYw1gzDVJrgkR1eIrh6OCpw+UGCg0Af0HE6C6QBL9q59YhQw
DjLJAUNM8puBNIh9paCsHf1aIcFpPXBcR5dKoufCQx41x1OOqew=
=knNy
-----END PGP SIGNATURE-----
Merge tag 'net-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from Netfilter.
Steady stream of fixes. Last two weeks feel comparable to the two
weeks before the merge window. Lots of AI-aided bug discovery. A newer
big source is Sashiko/Gemini (Roman Gushchin's system), which points
out issues in existing code during patch review (maybe 25% of fixes
here likely originating from Sashiko). Nice thing is these are often
fixed by the respective maintainers, not drive-bys.
Current release - new code bugs:
- kconfig: MDIO_PIC64HPSC should depend on ARCH_MICROCHIP
Previous releases - regressions:
- add async ndo_set_rx_mode and switch drivers which we promised to
be called under the per-netdev mutex to it
- dsa: remove duplicate netdev_lock_ops() for conduit ethtool ops
- hv_sock: report EOF instead of -EIO for FIN
- vsock/virtio: fix MSG_PEEK calculation on bytes to copy
Previous releases - always broken:
- ipv6: fix possible UAF in icmpv6_rcv()
- icmp: validate reply type before using icmp_pointers
- af_unix: drop all SCM attributes for SOCKMAP
- netfilter: fix a number of bugs in the osf (OS fingerprinting)
- eth: intel: fix timestamp interrupt configuration for E825C
Misc:
- bunch of data-race annotations"
* tag 'net-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (148 commits)
rxrpc: Fix error handling in rxgk_extract_token()
rxrpc: Fix re-decryption of RESPONSE packets
rxrpc: Fix rxrpc_input_call_event() to only unshare DATA packets
rxrpc: Fix missing validation of ticket length in non-XDR key preparsing
rxgk: Fix potential integer overflow in length check
rxrpc: Fix conn-level packet handling to unshare RESPONSE packets
rxrpc: Fix potential UAF after skb_unshare() failure
rxrpc: Fix rxkad crypto unalignment handling
rxrpc: Fix memory leaks in rxkad_verify_response()
net: rds: fix MR cleanup on copy error
m68k: mvme147: Make me the maintainer
net: txgbe: fix firmware version check
selftests/bpf: check epoll readiness during reuseport migration
tcp: call sk_data_ready() after listener migration
vhost_net: fix sleeping with preempt-disabled in vhost_net_busy_poll()
ipv6: Cap TLV scan in ip6_tnl_parse_tlv_enc_lim
tipc: fix double-free in tipc_buf_append()
llc: Return -EINPROGRESS from llc_ui_connect()
ipv4: icmp: validate reply type before using icmp_pointers
selftests/net: packetdrill: cover RFC 5961 5.2 challenge ACK on both edges
...
|
||
|
|
1f75dbc53f |
ice: fix ice_ptp_read_tx_hwtstamp_status_eth56g
The ice_ptp_read_tx_hwtstamp_status_eth56g function calls
ice_read_phy_eth56g with a PHY index. However the function actually expects
a port index. This causes the function to read the wrong PHY_PTP_INT_STATUS
registers, and effectively makes the status wrong for the second set of
ports from 4 to 7.
The ice_read_phy_eth56g function uses the provided port index to determine
which PHY device to read. We could refactor the entire chain to take a PHY
index, but this would impact many code sites. Instead, multiply the PHY
index by the number of ports, so that we read from the first port of each
PHY.
Fixes:
|
||
|
|
359dc1d413 |
ice: fix ready bitmap check for non-E822 devices
The E800 hardware (apart from E810) has a ready bitmap for the PHY indicating which timestamp slots currently have an outstanding timestamp waiting to be read by software. This bitmap is checked in multiple places using the ice_get_phy_tx_tstamp_ready(): * ice_ptp_process_tx_tstamp() calls it to determine which timestamps to attempt reading from the PHY * ice_ptp_tx_tstamps_pending() calls it in a loop at the end of the miscellaneous IRQ to check if new timestamps came in while the interrupt handler was executing. * ice_ptp_maybe_trigger_tx_interrupt() calls it in the auxiliary work task to trigger a software interrupt in the event that the hardware logic gets stuck. For E82X devices, multiple PHYs share the same block, and the parameter passed to the ready bitmap is a block number associated with the given port. For E825-C devices, the PHYs have their own independent blocks and do not share, so the parameter passed needs to be the port number. For E810 devices, the ice_get_phy_tx_tstamp_ready() always returns all 1s regardless of what port, since this hardware does not have a ready bitmap. Finally, for E830 devices, each PF has its own ready bitmap accessible via register, and the block parameter is unused. The first call correctly uses the Tx timestamp tracker block parameter to check the appropriate timestamp block. This works because the tracker is setup correctly for each timestamp device type. The second two callers behave incorrectly for all device types other than the older E822 devices. They both iterate in a loop using ICE_GET_QUAD_NUM() which is a macro only used by E822 devices. This logic is incorrect for devices other than the E822 devices. For E810 the calls would always return true, causing E810 devices to always attempt to trigger a software interrupt even when they have no reason to. For E830, this results in duplicate work as the ready bitmap is checked once per number of quads. Finally, for E825-C, this results in the pending checks failing to detect timestamps on ports other than the first two. Fix this by introducing a new hardware API function to ice_ptp_hw.c, ice_check_phy_tx_tstamp_ready(). This function will check if any timestamps are available and returns a positive value if any timestamps are pending. For E810, the function always returns false, so that the re-trigger checks never happen. For E830, check the ready bitmap just once. For E82x hardware, check each quad. Finally, for E825-C, check every port. The interface function returns an integer to enable reporting of error code if the driver is unable read the ready bitmap. This enables callers to handle this case properly. The previous implementation assumed that timestamps are available if they failed to read the bitmap. This is problematic as it could lead to continuous software IRQ triggering if the PHY timestamp registers somehow become inaccessible. This change is especially important for E825-C devices, as the missing checks could leave a window open where a new timestamp could arrive while the existing timestamps aren't completed. As a result, the hardware threshold logic would not trigger a new interrupt. Without the check, the timestamp is left unhandled, and new timestamps will not cause an interrupt again until the timestamp is handled. Since both the interrupt check and the backup check in the auxiliary task do not function properly, the device may have Tx timestamps permanently stuck failing on a given port. The faulty checks originate from commit |
||
|
|
3ec46e157c |
ice: perform PHY soft reset for E825C ports at initialization
In some cases the PHY timestamp block of the E825C can become stuck. This
is known to occur if the software writes 0 to the Tx timestamp threshold,
and with older versions of the ice driver the threshold configuration is
buggy and can race in such that hardware briefly operates with a zero
threshold enabled. There are no other known ways to trigger this behavior,
but once it occurs, the hardware is not recovered by normal reset, a driver
reload, or even a warm power cycle of the system. A cold power cycle is
sufficient to recover hardware, but this is extremely invasive and can
result in significant downtime on customer deployments.
The PHY for each port has a timestamping block which has its own reset
functionality accessible by programming the PHY_REG_GLOBAL register.
Writing to the PHY_REG_GLOBAL_SOFT_RESET_BIT triggers the hardware to
perform a complete reset of the timestamping block of the PHY. This
includes clearing the timestamp status for the port, clearing all
outstanding timestamps in the memory bank, and resetting the PHY timer.
The new ice_ptp_phy_soft_reset_eth56g() function toggles the
PHY_REG_GLOBAL soft reset bit with the required delays, ensuring the
PHY is properly reinitialized without requiring a full device reset.
The sequence clears the reset bit, asserts it, then clears it again,
with short waits between transitions to allow hardware stabilization.
Call this function in the new ice_ptp_init_phc_e825c(), implementing the
E825C device specific variant of the ice_ptp_init_phc(). Note that if
ice_ptp_init_phc() fails, PTP functionality may be disabled, but the driver
will still load to allow basic functionality to continue.
This causes the clock owning PF driver to perform a PHY soft reset for
every port during initialization. This ensures the driver begins life in a
known functional state regardless of how it was previously programmed.
This ensures that we properly reconfigure the hardware after a device reset
or when loading the driver, even if it was previously misconfigured with an
out-of-date or modified driver.
Fixes:
|
||
|
|
c0a575a801 |
ice: fix timestamp interrupt configuration for E825C
The E825C ice_phy_cfg_intr_eth56g() function is responsible for programming
the PHY interrupt for a given port. This function writes to the
PHY_REG_TS_INT_CONFIG register of the port. The register is responsible for
configuring whether the port interrupt logic is enabled, as well as
programming the threshold of waiting timestamps that will trigger an
interrupt from this port.
This threshold value must not be programmed to zero while the interrupt is
enabled. Doing so puts the port in a misconfigured state where the PHY
timestamp interrupt for the quad of connected ports will become stuck.
This occurs, because a threshold of zero results in the timestamp interrupt
status for the port becoming stuck high. The four ports in the connected
quad have their timestamp status indicators muxed together. A new interrupt
cannot be generated until the timestamp status indicators return low for
all four ports.
Normally, the timestamp status for a port will clear once there are fewer
timestamps in that ports timestamp memory bank than the threshold. A
threshold of zero makes this impossible, so the timestamp status for the
port does not clear.
The ice driver never intentionally programs the threshold to zero, indeed
the driver always programs it to a value of 1, intending to get an
interrupt immediately as soon as even a single packet is waiting for a
timestamp.
However, there is a subtle flaw in the programming logic in the
ice_phy_cfg_intr_eth56g() function. Due to the way that the hardware
handles enabling the PHY interrupt. If the threshold value is modified at
the same time as the interrupt is enabled, the HW PHY state machine might
enable the interrupt before the new threshold value is actually updated.
This leaves a potential race condition caused by the hardware logic where
a PHY timestamp interrupt might be triggered before the non-zero threshold
is written, resulting in the PHY timestamp logic becoming stuck.
Once the PHY timestamp status is stuck high, it will remain stuck even
after attempting to reprogram the PHY block by changing its threshold or
disabling the interrupt. Even a typical PF or CORE reset will not reset the
particular block of the PHY that becomes stuck. Even a warm power cycle is
not guaranteed to cause the PHY block to reset, and a cold power cycle is
required.
Prevent this by always writing the PHY_REG_TS_INT_CONFIG in two stages.
First write the threshold value with the interrupt disabled, and only write
the enable bit after the threshold has been programmed. When disabling the
interrupt, leave the threshold unchanged. Additionally, re-read the
register after writing it to guarantee that the write to the PHY has been
flushed upon exit of the function.
While we're modifying this function implementation, explicitly reject
programming a threshold of 0 when enabling the interrupt. No caller does
this today, but the consequences of doing so are significant. An explicit
rejection in the code makes this clear.
Fixes:
|
||
|
|
d071c15b43 |
iavf: convert to ndo_set_rx_mode_async
Convert iavf from ndo_set_rx_mode to ndo_set_rx_mode_async. iavf_set_rx_mode now takes explicit uc/mc list parameters and uses __hw_addr_sync_dev on the snapshots instead of __dev_uc_sync and __dev_mc_sync. The iavf_configure internal caller passes the real lists directly. Cc: Tony Nguyen <anthony.l.nguyen@intel.com> Cc: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260416185712.2155425-10-sdf@fomichev.me Signed-off-by: Paolo Abeni <pabeni@redhat.com> |
||
|
|
aa3f7fe409 |
e1000e: Unroll PTP in probe error handling
If probe fails after registering the PTP clock and its delayed work,
these resources must be released.
This was not an issue until a 2016 fix moved the e1000e_ptp_init() call
before the jump to err_register.
Fixes:
|
||
|
|
496d9f9106 |
iavf: fix wrong VLAN mask for legacy Rx descriptors L2TAG2
The IAVF_RXD_LEGACY_L2TAG2_M mask was incorrectly defined as
GENMASK_ULL(63, 32), extracting 32 bits from qw2 instead of the
16-bit VLAN tag. In the legacy Rx descriptor layout, the 2nd L2TAG2
(VLAN tag) occupies bits 63:48 of qw2, not 63:32.
The oversized mask causes FIELD_GET to return a 32-bit value where the
actual VLAN tag sits in bits 31:16. When this value is passed to
iavf_receive_skb() as a u16 parameter, it gets truncated to the lower
16 bits (which contain the 1st L2TAG2, typically zero). As a result,
__vlan_hwaccel_put_tag() is never called and software VLAN interfaces
on VFs receive no traffic.
This affects VFs behind ice PF (VIRTCHNL VLAN v2) when the PF
advertises VLAN stripping into L2TAG2_2 and legacy descriptors are
used.
The flex descriptor path already uses the correct mask
(IAVF_RXD_FLEX_L2TAG2_2_M = GENMASK_ULL(63, 48)).
Reproducer:
1. Create 2 VFs on ice PF (echo 2 > sriov_numvfs)
2. Disable spoofchk on both VFs
3. Move each VF into a separate network namespace
4. On each VF: create VLAN interface (e.g. vlan 198), assign IP,
bring up
5. Set rx-vlan-offload OFF on both VFs
6. Ping between VLAN interfaces -> expect PASS
(VLAN tag stays in packet data, kernel matches in-band)
7. Set rx-vlan-offload ON on both VFs
8. Ping between VLAN interfaces -> expect FAIL if bug present
(HW strips VLAN tag into descriptor L2TAG2 field, wrong mask
extracts bits 47:32 instead of 63:48, truncated to u16 -> zero,
__vlan_hwaccel_put_tag() never called, packet delivered to parent
interface, not VLAN interface)
The reproducer requires legacy Rx descriptors. On modern ice + iavf
with full PTP support, flex descriptors are always negotiated and the
buggy legacy path is never reached. Flex descriptors require all of:
- CONFIG_PTP_1588_CLOCK enabled
- VIRTCHNL_VF_OFFLOAD_RX_FLEX_DESC granted by PF
- PTP capabilities negotiated (VIRTCHNL_VF_CAP_PTP)
- VIRTCHNL_1588_PTP_CAP_RX_TSTAMP supported
- VIRTCHNL_RXDID_2_FLEX_SQ_NIC present in DDP profile
If any condition is not met, iavf_select_rx_desc_format() falls back
to legacy descriptors (RXDID=1) and the wrong L2TAG2 mask is hit.
Fixes:
|
||
|
|
a24162f188 |
i40e: don't advertise IFF_SUPP_NOFCS
i40e advertises IFF_SUPP_NOFCS, allowing users to use the SO_NOFCS
socket option. However, this option is silently ignored, as the driver
does not check skb->no_fcs, and always enables FCS insertion offload.
Fix this by removing the advertisement of IFF_SUPP_NOFCS.
This behavior can be reproduced with a simple AF_PACKET socket:
import socket
s = socket.socket(socket.AF_PACKET, socket.SOCK_RAW)
s.setsockopt(socket.SOL_SOCKET, 43, 1) # SO_NOFCS
s.bind(("eth0", 0))
s.send(b'\xff' * 64)
Previously, send() succeeds but the driver ignores SO_NOFCS.
With this change, send() fails with -EPROTONOSUPPORT, as expected.
Fixes:
|
||
|
|
fa28351f97 |
ice: fix potential NULL pointer deref in error path of ice_set_ringparam()
ice_set_ringparam nullifies tstamp_ring of temporary tx_rings, without
clearing ICE_TX_RING_FLAGS_TXTIME bit.
When ICE_TX_RING_FLAGS_TXTIME is set and the subsequent
ice_setup_tx_ring() call fails, a NULL pointer dereference could happen
in the unwinding sequence:
ice_clean_tx_ring()
-> ice_is_txtime_cfg() == true (ICE_TX_RING_FLAGS_TXTIME is set)
-> ice_free_tx_tstamp_ring()
-> ice_free_tstamp_ring()
-> tstamp_ring->desc (NULL deref)
Clear ICE_TX_RING_FLAGS_TXTIME bit to avoid the potential issue.
Note that this potential issue is found by manual code review.
Compile test only since unfortunately I don't have E830 devices.
Fixes:
|
||
|
|
7c72ec18c2 |
ice: fix race condition in TX timestamp ring cleanup
Fix a race condition between ice_free_tx_tstamp_ring() and ice_tx_map()
that can cause a NULL pointer dereference.
ice_free_tx_tstamp_ring currently clears the ICE_TX_FLAGS_TXTIME flag
after NULLing the tstamp_ring. This could allow a concurrent ice_tx_map
call on another CPU to dereference the tstamp_ring, which could lead to
a NULL pointer dereference.
CPU A:ice_free_tx_tstamp_ring() | CPU B:ice_tx_map()
--------------------------------|---------------------------------
tx_ring->tstamp_ring = NULL |
| ice_is_txtime_cfg() -> true
| tstamp_ring = tx_ring->tstamp_ring
| tstamp_ring->count // NULL deref!
flags &= ~ICE_TX_FLAGS_TXTIME |
Fix by:
1. Reordering ice_free_tx_tstamp_ring() to clear the flag before
NULLing the pointer, with smp_wmb() to ensure proper ordering.
2. Adding smp_rmb() in ice_tx_map() after the flag check to order the
flag read before the pointer read, using READ_ONCE() for the
pointer, and adding a NULL check as a safety net.
3. Converting tx_ring->flags from u8 to DECLARE_BITMAP() and using
atomic bitops (set_bit(), clear_bit(), test_bit()) for all flag
operations throughout the driver:
- ICE_TX_RING_FLAGS_XDP
- ICE_TX_RING_FLAGS_VLAN_L2TAG1
- ICE_TX_RING_FLAGS_VLAN_L2TAG2
- ICE_TX_RING_FLAGS_TXTIME
Fixes:
|
||
|
|
4a3a940059 |
ice: fix ICE_AQ_LINK_SPEED_M for 200G
When setting PHY configuration during driver initialization, 200G link
speed is not being advertised even when the PHY is capable. This is
because the get PHY capabilities link speed response is being masked by
ICE_AQ_LINK_SPEED_M, which does not include the 200G link speed bit.
ICE_AQ_LINK_SPEED_200GB is defined as BIT(11), but the mask 0x7FF only
covers bits 0-10. Fix ICE_AQ_LINK_SPEED_M to use GENMASK(11, 0) so
that it covers all defined link speed bits including 200G.
Fixes:
|
||
|
|
55e74f9ea7 |
ice: fix PHY config on media change with link-down-on-close
Commit |
||
|
|
1a303baa71 |
ice: fix double-free of tx_buf skb
If ice_tso() or ice_tx_csum() fail, the error path in
ice_xmit_frame_ring() frees the skb, but the 'first' tx_buf still points
to it and is marked as valid (ICE_TX_BUF_SKB).
'next_to_use' remains unchanged, so the potential problem will
likely fix itself when the next packet is transmitted and the tx_buf
gets overwritten. But if there is no next packet and the interface is
brought down instead, ice_clean_tx_ring() -> ice_unmap_and_free_tx_buf()
will find the tx_buf and free the skb for the second time.
The fix is to reset the tx_buf type to ICE_TX_BUF_EMPTY in the error
path, so that ice_unmap_and_free_tx_buf().
Move the initialization of 'first' up, to ensure it's already valid in
case we hit the linearization error path.
The bug was spotted by AI while I had it looking for something else.
It also proposed an initial version of the patch.
I reproduced the bug and tested the fix by adding code to inject
failures, on a build with KASAN.
I looked for similar bugs in related Intel drivers and did not find any.
Fixes:
|
||
|
|
9aab1c3d72 |
ice: fix double free in ice_sf_eth_activate() error path
When auxiliary_device_add() fails, ice_sf_eth_activate() jumps to
aux_dev_uninit and calls auxiliary_device_uninit(&sf_dev->adev).
The device release callback ice_sf_dev_release() frees sf_dev, but
the current error path falls through to sf_dev_free and calls
kfree(sf_dev) again, causing a double free.
Keep kfree(sf_dev) for the auxiliary_device_init() failure path, but
avoid falling through to sf_dev_free after auxiliary_device_uninit().
Fixes:
|
||
|
|
05567e4052 |
ice: update PCS latency settings for E825 10G/25Gb modes
Update MAC Rx/Tx offset registers settings (PHY_MAC_[RX|TX]_OFFSET
registers) with the data obtained with the latest research. It applies
to PCS latency settings for the following speeds/modes:
* 10Gb NO-FEC
- TX latency changed from 71.25 ns to 73 ns
- RX latency changed from -25.6 ns to -28 ns
* 25Gb NO-FEC
- TX latency changed from 28.17 ns to 33 ns
- RX latency changed from -12.45 ns to -12 ns
* 25Gb RS-FEC
- TX latency changed from 64.5 ns to 69 ns
- RX latency changed from -3.6 ns to -3 ns
The original data came from simulation and pre-production hardware.
The new data measures the actual delays and as such is more accurate.
Fixes:
|
||
|
|
885c5e5792 |
ice: fix 'adjust' timer programming for E830 devices
Fix incorrect 'adjust the timer' programming sequence for E830 devices
series. Only shadow registers GLTSYN_SHADJ were programmed in the
current implementation. According to the specification [1], write to
command GLTSYN_CMD register is also required with CMD field set to
"Adjust the Time" value, for the timer adjustment to take the effect.
The flow was broken for the adjustment less than S32_MAX/MIN range
(around +/- 2 seconds). For bigger adjustment, non-atomic programming
flow is used, involving set timer programming. Non-atomic flow is
implemented correctly.
Testing hints:
Run command:
phc_ctl /dev/ptpX get adj 2 get
Expected result:
Returned timestamps differ at least by 2 seconds
[1] Intel® Ethernet Controller E830 Datasheet rev 1.3, chapter 9.7.5.4
https://cdrdv2.intel.com/v1/dl/getContent/787353?explicitVersion=true
Fixes:
|
||
|
|
40286d6379 |
pci-v7.1-changes
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAmnfwfMUHGJoZWxnYWFz
QGdvb2dsZS5jb20ACgkQWYigwDrT+vxwIRAAlN1h5er8aFDbjON5YMXBZqlQmzaC
bjlUHgwm7HkdErTFozyuqhE8QUO1kCm4uMQzeyJdfY9nRWqMDOuKYxMD5j0exk+o
4tbbJg6Xx4dq7Qrawy9PhxyQm/PDAcvs+FRRlGala+qq9o3fxPDOAZVDE/1C8qFQ
Jd7GGd7NZn/NN4xrqST4RQHjO8fwaMwmksWCStsb79kfesQWP6kLADGfIMcWxNUB
2s+oTnK6Hw0tkBv56n6i8mbb0EzS3/RN1daTevGAta1rmfUVVtWGRZ4paMvv0Owi
Rl5+O5Jz6/c1qiXZbUqu5CRQPIy7Dr3JPvURcZX6qbsV8PzWXZr0Wi+geWefGOnp
55y+3OT0vdBGAuXLJhrcU7Clzq9D/TZOt8oTI8IFArUfDlmrAIdozPn7gr+VGre5
QuKymSk3XWtyIbe4o8UeZ4f9g0y6ZY1XvtvB7K1tze+OOmqlkfq966+z8aZuGOKx
ZvAU/NIat5H02EgB4dEVOP8R5vPZlXGT0RLGl1JWRypPWyZDbVVA3z927qRQG5md
IsVq8WaIrB1zyl9g37lZeEaYwP/qCIQsHkMGPYcP4wdOQEV9AQqi5pmjMXnWyQJD
PR1nvmTKW7USRCJ+pz8xPhZh0cj3ENaddORTD3I/0CGVV0y452bU/5rr4T+K04bK
PCJBpxTIDuWDwXc=
=FFRz
-----END PGP SIGNATURE-----
Merge tag 'pci-v7.1-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull pci updates from Bjorn Helgaas:
"Enumeration:
- Allow TLP Processing Hints to be enabled for RCiEPs (George Abraham
P)
- Enable AtomicOps only if we know the Root Port supports them (Gerd
Bayer)
- Don't enable AtomicOps for RCiEPs since none of them need Atomic
Ops and we can't tell whether the Root Complex would support them
(Gerd Bayer)
- Leave Precision Time Measurement disabled until a driver enables it
to avoid PCIe errors (Mika Westerberg)
- Make pci_set_vga_state() fail if bridge doesn't support VGA
routing, i.e., PCI_BRIDGE_CTL_VGA is not writable, and return
errors to vga_get() callers including userspace via
/dev/vga_arbiter (Simon Richter)
- Validate max-link-speed from DT in j721e, brcmstb, mediatek-gen3,
rzg3s drivers (where the actual controller constraints are known),
and remove validation from the generic OF DT accessor (Hans Zhang)
- Remove pc110pad driver (no longer useful after 486 CPU support
removed) and no_pci_devices() (pc110pad was the last user) (Dmitry
Torokhov, Heiner Kallweit)
Resource management:
- Prevent assigning space to unimplemented bridge windows; previously
we mistakenly assumed prefetchable window existed and assigned
space and put a BAR there (Ahmed Naseef)
- Avoid shrinking bridge windows to fit in the initial Root Port
window; fixes one problem with devices with large BARs connected
via switches, e.g., Thunderbolt (Ilpo Järvinen)
- Pass full extent of empty space, not just the aligned space, to
resource_alignf callback so free space before the requested
alignment can be used (Ilpo Järvinen)
- Place small resources before larger ones for better utilization of
address space (Ilpo Järvinen)
- Fix alignment calculation for resource size larger than align,
e.g., bridge windows larger than the 1MB required alignment (Ilpo
Järvinen)
Reset:
- Update slot handling so all ARI functions are treated as being in
the same slot. They're all reset by Secondary Bus Reset, but
previously drivers of ARI functions that appeared to be on a
non-zero device weren't notified and fatal hardware errors could
result (Keith Busch)
- Make sysfs reset_subordinate hotplug safe to avoid spurious hotplug
events (Keith Busch)
- Hide Secondary Bus Reset ('bus') from sysfs reset_methods if masked
by CXL because it has no effect (Vidya Sagar)
- Avoid FLR for AMD NPU device, where it causes the device to hang
(Lizhi Hou)
Error handling:
- Clear only error bits in PCIe Device Status to avoid accidentally
clearing Emergency Power Reduction Detected (Shuai Xue)
- Check for AER errors even in devices without drivers (Lukas Wunner)
- Initialize ratelimit info so DPC and EDR paths log AER error
information (Kuppuswamy Sathyanarayanan)
Power control:
- Add UPD720201/UPD720202 USB 3.0 xHCI Host Controller .compatible so
generic pwrctrl driver can control it (Neil Armstrong)
Hotplug:
- Set LED_HW_PLUGGABLE for NPEM hotplug-capable ports so LED core
doesn't complain when setting brightness fails because the endpoint
is gone (Richard Cheng)
Peer-to-peer DMA:
- Allow wildcards in list of host bridges that support peer-to-peer
DMA between hierarchy domains and add all Google SoCs (Jacob
Moroni)
Endpoint framework:
- Advertise dynamic inbound mapping support in pci-epf-test and
update host pci_endpoint_test to skip doorbell testing if not
advertised by endpoint (Koichiro Den)
- Return 0, not remaining timeout, when MHI eDMA ops complete so
mhi_ep_ring_add_element() doesn't interpret non-zero as failure
(Daniel Hodges)
- Remove vntb and ntb duplicate resource teardown that leads to oops
when .allow_link() fails or .drop_link() is called (Koichiro Den)
- Disable vntb delayed work before clearing BAR mappings and
doorbells to avoid oops caused by doing the work after resources
have been torn down (Koichiro Den)
- Add a way to describe reserved subregions within BARs, e.g.,
platform-owned fixed register windows, and use it for the RK3588
BAR4 DMA ctrl window (Koichiro Den)
- Add BAR_DISABLED for BARs that will never be available to an EPF
driver, and change some BAR_RESERVED annotations to BAR_DISABLED
(Niklas Cassel)
- Add NTB .get_dma_dev() callback for cases where DMA API requires a
different device, e.g., vNTB devices (Koichiro Den)
- Add reserved region types for MSI-X Table and PBA so Endpoint
controllers can them as describe hardware-owned regions in a
BAR_RESERVED BAR (Manikanta Maddireddy)
- Make Tegra194/234 BAR0 programmable and remove 1MB size limit
(Manikanta Maddireddy)
- Expose Tegra BAR2 (MSI-X) and BAR4 (DMA) as 64-bit BAR_RESERVED
(Manikanta Maddireddy)
- Add Tegra194 and Tegra234 device table entries to pci_endpoint_test
(Manikanta Maddireddy)
- Skip the BAR subrange selftest if there are not enough inbound
window resources to run the test (Christian Bruel)
New native PCIe controller drivers:
- Add DT binding and driver for Andes QiLai SoC PCIe host controller
(Randolph Lin)
- Add DT binding and driver for ESWIN PCIe Root Complex (Senchuan
Zhang)
Baikal T-1 PCIe controller driver:
- Remove driver since it never quite became usable (Andy Shevchenko)
Cadence PCIe controller driver:
- Implement byte/word config reads with dword (32-bit) reads because
some Cadence controllers don't support sub-dword accesses (Aksh
Garg)
CIX Sky1 PCIe controller driver:
- Add 'power-domains' to DT binding for SCMI power domain (Gary Yang)
Freescale i.MX6 PCIe controller driver:
- Add i.MX94 and i.MX943 to fsl,imx6q-pcie-ep DT binding (Richard
Zhu)
- Delay instead of polling for L2/L3 Ready after PME_Turn_off when
suspending i.MX6SX because LTSSM registers are inaccessible
(Richard Zhu)
- Separate PERST# assertion (for resetting endpoints) from core reset
(for resetting the RC itself) to prepare for new DTs with PERST#
GPIO in per-Root Port nodes (Sherry Sun)
- Retain Root Port MSI capability on i.MX7D, i.MX8MM, and i.MX8MQ so
MSI from downstream devices will work (Richard Zhu)
- Fix i.MX95 reference clock source selection when internal refclk is
used (Franz Schnyder)
Freescale Layerscape PCIe controller driver:
- Allow building as a removable module (Sascha Hauer)
MediaTek PCIe Gen3 controller driver:
- Use dev_err_probe() to simplify error paths and make deferred probe
messages visible in /sys/kernel/debug/devices_deferred (Chen-Yu
Tsai)
- Power off device if setup fails (Chen-Yu Tsai)
- Integrate new pwrctrl API to enable power control for WiFi/BT
adapters on mainboard or in PCIe or M.2 slots (Chen-Yu Tsai)
NVIDIA Tegra194 PCIe controller driver:
- Poll less aggressively and non-atomically for PME_TO_Ack during
transition to L2 (Vidya Sagar)
- Disable LTSSM after transition to Detect on surprise link down to
stop toggling between Polling and Detect (Manikanta Maddireddy)
- Don't force the device into the D0 state before L2 when suspending
or shutting down the controller (Vidya Sagar)
- Disable PERST# IRQ only in Endpoint mode because it's not
registered in Root Port mode (Manikanta Maddireddy)
- Handle 'nvidia,refclk-select' as optional (Vidya Sagar)
- Disable direct speed change in Endpoint mode so link speed change
is controlled by the host (Vidya Sagar)
- Set LTR values before link up to avoid bogus LTR messages with 0
latency (Vidya Sagar)
- Allow system suspend when the Endpoint link is down (Vidya Sagar)
- Use DWC IP core version, not Tegra custom values, to avoid DWC core
version check warnings (Manikanta Maddireddy)
- Apply ECRC workaround to devices based on DesignWare 5.00a as well
as 4.90a (Manikanta Maddireddy)
- Disable PM Substate L1.2 in Endpoint mode to work around Tegra234
erratum (Vidya Sagar)
- Delay post-PERST# cleanup until core is powered on to avoid CBB
timeout (Manikanta Maddireddy)
- Assert CLKREQ# so switches that forward it to their downstream side
can bring up those links successfully (Vidya Sagar)
- Calibrate pipe to UPHY for Endpoint mode to reset stale PLL state
from any previous bad link state (Vidya Sagar)
- Remove IRQF_ONESHOT flag from Endpoint interrupt registration so
DMA driver and Endpoint controller driver can share the interrupt
line (Vidya Sagar)
- Enable DMA interrupt to support DMA in both Root Port and Endpoint
modes (Vidya Sagar)
- Enable hardware link retraining after link goes down in Endpoint
mode (Vidya Sagar)
- Add DT binding and driver support for core clock monitoring (Vidya
Sagar)
Qualcomm PCIe controller driver:
- Advertise 'Hot-Plug Capable' and set 'No Command Completed Support'
since Qcom Root Ports support hotplug events like DL_Up/Down and
can accept writes to Slot Control without delays between writes
(Krishna Chaitanya Chundru)
Renesas R-Car PCIe controller driver:
- Mark Endpoint BAR0 and BAR2 as Resizable (Koichiro Den)
- Reduce EPC BAR alignment requirement to 4K (Koichiro Den)
Renesas RZ/G3S PCIe controller driver:
- Add RZ/G3E to DT binding and to driver (John Madieu)
- Assert (not deassert) resets in probe error path (John Madieu)
- Assert resets in suspend path in reverse order they were deasserted
during probe (John Madieu)
- Rework inbound window algorithm to prevent mapping more than
intended region and enforce alignment on size, to prepare for
RZ/G3E support (John Madieu)
Rockchip DesignWare PCIe controller driver:
- Add tracepoints for PCIe controller LTSSM transitions and link rate
changes (Shawn Lin)
- Trace LTSSM events collected by the dw-rockchip debug FIFO (Shawn
Lin)
SOPHGO PCIe controller driver:
- Disable ASPM L0s and L1 on Sophgo 2042 PCIe Root Ports that
advertise support for them (Yao Zi)
Synopsys DesignWare PCIe controller driver:
- Continue with system suspend even if an Endpoint doesn't respond
with PME_TO_Ack message (Manivannan Sadhasivam)
- Set Endpoint MSI-X Table Size in the correct function of a
multi-function device when configuring MSI-X, not in Function 0
(Aksh Garg)
- Set Max Link Width and Max Link Speed for all functions of a
multi-function device, not just Function 0 (Aksh Garg)
- Expose PCIe event counters in groups 5-7 in debugfs (Hans Zhang)
Miscellaneous:
- Warn only once about invalid ACS kernel parameter format (Richard
Cheng)
- Suppress FW_BUG warning when writing sysfs 'numa_node' with the
current value (Li RongQing)
- Drop redundant 'depends on PCI' from Kconfig (Julian Braha)"
* tag 'pci-v7.1-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (165 commits)
PCI/P2PDMA: Add Google SoCs to the P2P DMA host bridge list
PCI/P2PDMA: Allow wildcard Device IDs in host bridge list
PCI: sg2042: Avoid L0s and L1 on Sophgo 2042 PCIe Root Ports
PCI: cadence: Add flags for disabling ASPM capability for broken Root Ports
PCI: tegra194: Add core monitor clock support
dt-bindings: PCI: tegra194: Add monitor clock support
PCI: tegra194: Enable hardware hot reset mode in Endpoint mode
PCI: tegra194: Enable DMA interrupt
PCI: tegra194: Remove IRQF_ONESHOT flag during Endpoint interrupt registration
PCI: tegra194: Calibrate pipe to UPHY for Endpoint mode
PCI: tegra194: Assert CLKREQ# explicitly by default
PCI: tegra194: Fix CBB timeout caused by DBI access before core power-on
PCI: tegra194: Disable L1.2 capability of Tegra234 EP
PCI: dwc: Apply ECRC workaround to DesignWare 5.00a as well
PCI: tegra194: Use DWC IP core version
PCI: tegra194: Free up Endpoint resources during remove()
PCI: tegra194: Allow system suspend when the Endpoint link is not up
PCI: tegra194: Set LTR message request before PCIe link up in Endpoint mode
PCI: tegra194: Disable direct speed change for Endpoint mode
PCI: tegra194: Use devm_gpiod_get_optional() to parse "nvidia,refclk-select"
...
|
||
|
|
91a4855d6c |
Networking changes for 7.1.
Core & protocols
----------------
- Support HW queue leasing, allowing containers to be granted access
to HW queues for zero-copy operations and AF_XDP.
- Number of code moves to help the compiler with inlining.
Avoid output arguments for returning drop reason where possible.
- Rework drop handling within qdiscs to include more metadata
about the reason and dropping qdisc in the tracepoints.
- Remove the rtnl_lock use from IP Multicast Routing.
- Pack size information into the Rx Flow Steering table pointer
itself. This allows making the table itself a flat array of u32s,
thus making the table allocation size a power of two.
- Report TCP delayed ack timer information via socket diag.
- Add ip_local_port_step_width sysctl to allow distributing the randomly
selected ports more evenly throughout the allowed space.
- Add support for per-route tunsrc in IPv6 segment routing.
- Start work of switching sockopt handling to iov_iter.
- Improve dynamic recvbuf sizing in MPTCP, limit burstiness and avoid
buffer size drifting up.
- Support MSG_EOR in MPTCP.
- Add stp_mode attribute to the bridge driver for STP mode selection.
This addresses concerns about call_usermodehelper() usage.
- Remove UDP-Lite support (as announced in 2023).
- Remove support for building IPv6 as a module.
Remove the now unnecessary function calling indirection.
Cross-tree stuff
----------------
- Move Michael MIC code from generic crypto into wireless,
it's considered insecure but some WiFi networks still need it.
Netfilter
---------
- Switch nft_fib_ipv6 module to no longer need temporary dst_entry
object allocations by using fib6_lookup() + RCU.
Florian W reports this gets us ~13% higher packet rate.
- Convert IPVS's global __ip_vs_mutex to per-net service_mutex and
switch the service tables to be per-net. Convert some code that
walks the service lists to use RCU instead of the service_mutex.
- Add more opinionated input validation to lower security exposure.
- Make IPVS hash tables to be per-netns and resizable.
Wireless
--------
- Finished assoc frame encryption/EPPKE/802.1X-over-auth.
- Radar detection improvements.
- Add 6 GHz incumbent signal detection APIs.
- Multi-link support for FILS, probe response templates and
client probing.
- New APIs and mac80211 support for NAN (Neighbor Aware Networking,
aka Wi-Fi Aware) so less work must be in firmware.
Driver API
----------
- Add numerical ID for devlink instances (to avoid having to create
fake bus/device pairs just to have an ID). Support shared devlink
instances which span multiple PFs.
- Add standard counters for reporting pause storm events
(implement in mlx5 and fbnic).
- Add configuration API for completion writeback buffering
(implement in mana).
- Support driver-initiated change of RSS context sizes.
- Support DPLL monitoring input frequency (implement in zl3073x).
- Support per-port resources in devlink (implement in mlx5).
Misc
----
- Expand the YAML spec for Netfilter.
Drivers
-------
- Software:
- macvlan: support multicast rx for bridge ports with shared source
MAC address
- team: decouple receive and transmit enablement for IEEE 802.3ad
LACP "independent control"
- Ethernet high-speed NICs:
- nVidia/Mellanox:
- support high order pages in zero-copy mode (for payload
coalescing)
- support multiple packets in a page (for systems with 64kB pages)
- Broadcom 25-400GE (bnxt):
- implement XDP RSS hash metadata extraction
- add software fallback for UDP GSO, lowering the IOMMU cost
- Broadcom 800GE (bnge):
- add link status and configuration handling
- add various HW and SW statistics
- Marvell/Cavium:
- NPC HW block support for cn20k
- Huawei (hinic3):
- add mailbox / control queue
- add rx VLAN offload
- add driver info and link management
- Ethernet NICs:
- Marvell/Aquantia:
- support reading SFP module info on some AQC100 cards
- Realtek PCI (r8169):
- add support for RTL8125cp
- Realtek USB (r8152):
- support for the RTL8157 5Gbit chip
- add 2500baseT EEE status/configuration support
- Ethernet NICs embedded and off-the-shelf IP:
- Synopsys (stmmac):
- cleanup and reorganize SerDes handling and PCS support
- cleanup descriptor handling and per-platform data
- cleanup and consolidate MDIO defines and handling
- shrink driver memory use for internal structures
- improve Tx IRQ coalescing
- improve TCP segmentation handling
- add support for Spacemit K3
- Cadence (macb):
- support PHYs that have inband autoneg disabled with GEM
- support IEEE 802.3az EEE
- rework usrio capabilities and handling
- AMD (xgbe):
- improve power management for S0i3
- improve TX resilience for link-down handling
- Virtual:
- Google cloud vNIC:
- support larger ring sizes in DQO-QPL mode
- improve HW-GRO handling
- support UDP GSO for DQO format
- PCIe NTB:
- support queue count configuration
- Ethernet PHYs:
- automatically disable PHY autonomous EEE if MAC is in charge
- Broadcom:
- add BCM84891/BCM84892 support
- Micrel:
- support for LAN9645X internal PHY
- Realtek:
- add RTL8224 pair order support
- support PHY LEDs on RTL8211F-VD
- support spread spectrum clocking (SSC)
- Maxlinear:
- add PHY-level statistics via ethtool
- Ethernet switches:
- Maxlinear (mxl862xx):
- support for bridge offloading
- support for VLANs
- support driver statistics
- Bluetooth:
- large number of fixes and new device IDs
- Mediatek:
- support MT6639 (MT7927)
- support MT7902 SDIO
- WiFi:
- Intel (iwlwifi):
- UNII-9 and continuing UHR work
- MediaTek (mt76):
- mt7996/mt7925 MLO fixes/improvements
- mt7996 NPU support (HW eth/wifi traffic offload)
- Qualcomm (ath12k):
- monitor mode support on IPQ5332
- basic hwmon temperature reporting
- support IPQ5424
- Realtek:
- add USB RX aggregation to improve performance
- add USB TX flow control by tracking in-flight URBs
- Cellular:
- IPA v5.2 support
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmnelNoACgkQMUZtbf5S
IrtWFw//WyiXuEiGawVQONnbu1dtR+3nw/cvNpSYi0IM66vbRUB9n+9fxm2MIyG4
4jI/c/X/fxIvUxEqGez3yPn5P7KqkQR8WRYwkxrMYKRpXeukN0IDk5Euew5DskCe
wtBKNJOQWKdKXff0bLQoJ9dHWYuJ2IMRVil5M3fhUbeUOXeyJD7Yn1w2ICvJAbj+
T/Hw7sEtchNaHp6h6SbaQfahkUFHQG5peNoETkZF4UDF6ALGY29WH91GXeO2lrgN
IxX203KtaavV0oU8T0oixZgOc57Ns081YfFL/F1JP2HV6lgkwhuq+zxCrRTi1c9M
HPTXgwD7Z80Y74nM3YTLrPfoMOP8GLBZgdV3rUpwmteM26+gMTm+O1zHUur5ZoGy
D6TaMFguPTIqiRyrARa9xY/J6r9TQkc2Wfu4bIuPndKFg8xPoepuEObODnh0+5Hg
4j4pdFhIo2huENhSg7kVb/yl+1q68SFwM3RqTmx+OhCa0AyjcKIKgt/UBhismdnG
r8obxzb+nXeJc2rRDuwNMwlBlcMSbep27uGt64zeHMMXVhTVqOoytNaL/X/ZpH2m
A0DscUrpHvb36IoDPtanc6irP+JOh5Xe7Nw5qhkgwsMc7hlf8SyyHB4OUBBaz1qA
ETSnHlfwklRmXSpWqH2LyGXjdOQpDKP46+h0W3dttMD2/cRBqYo=
=EhQZ
-----END PGP SIGNATURE-----
Merge tag 'net-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core & protocols:
- Support HW queue leasing, allowing containers to be granted access
to HW queues for zero-copy operations and AF_XDP
- Number of code moves to help the compiler with inlining. Avoid
output arguments for returning drop reason where possible
- Rework drop handling within qdiscs to include more metadata about
the reason and dropping qdisc in the tracepoints
- Remove the rtnl_lock use from IP Multicast Routing
- Pack size information into the Rx Flow Steering table pointer
itself. This allows making the table itself a flat array of u32s,
thus making the table allocation size a power of two
- Report TCP delayed ack timer information via socket diag
- Add ip_local_port_step_width sysctl to allow distributing the
randomly selected ports more evenly throughout the allowed space
- Add support for per-route tunsrc in IPv6 segment routing
- Start work of switching sockopt handling to iov_iter
- Improve dynamic recvbuf sizing in MPTCP, limit burstiness and avoid
buffer size drifting up
- Support MSG_EOR in MPTCP
- Add stp_mode attribute to the bridge driver for STP mode selection.
This addresses concerns about call_usermodehelper() usage
- Remove UDP-Lite support (as announced in 2023)
- Remove support for building IPv6 as a module. Remove the now
unnecessary function calling indirection
Cross-tree stuff:
- Move Michael MIC code from generic crypto into wireless, it's
considered insecure but some WiFi networks still need it
Netfilter:
- Switch nft_fib_ipv6 module to no longer need temporary dst_entry
object allocations by using fib6_lookup() + RCU.
Florian W reports this gets us ~13% higher packet rate
- Convert IPVS's global __ip_vs_mutex to per-net service_mutex and
switch the service tables to be per-net. Convert some code that
walks the service lists to use RCU instead of the service_mutex
- Add more opinionated input validation to lower security exposure
- Make IPVS hash tables to be per-netns and resizable
Wireless:
- Finished assoc frame encryption/EPPKE/802.1X-over-auth
- Radar detection improvements
- Add 6 GHz incumbent signal detection APIs
- Multi-link support for FILS, probe response templates and client
probing
- New APIs and mac80211 support for NAN (Neighbor Aware Networking,
aka Wi-Fi Aware) so less work must be in firmware
Driver API:
- Add numerical ID for devlink instances (to avoid having to create
fake bus/device pairs just to have an ID). Support shared devlink
instances which span multiple PFs
- Add standard counters for reporting pause storm events (implement
in mlx5 and fbnic)
- Add configuration API for completion writeback buffering (implement
in mana)
- Support driver-initiated change of RSS context sizes
- Support DPLL monitoring input frequency (implement in zl3073x)
- Support per-port resources in devlink (implement in mlx5)
Misc:
- Expand the YAML spec for Netfilter
Drivers
- Software:
- macvlan: support multicast rx for bridge ports with shared
source MAC address
- team: decouple receive and transmit enablement for IEEE 802.3ad
LACP "independent control"
- Ethernet high-speed NICs:
- nVidia/Mellanox:
- support high order pages in zero-copy mode (for payload
coalescing)
- support multiple packets in a page (for systems with 64kB
pages)
- Broadcom 25-400GE (bnxt):
- implement XDP RSS hash metadata extraction
- add software fallback for UDP GSO, lowering the IOMMU cost
- Broadcom 800GE (bnge):
- add link status and configuration handling
- add various HW and SW statistics
- Marvell/Cavium:
- NPC HW block support for cn20k
- Huawei (hinic3):
- add mailbox / control queue
- add rx VLAN offload
- add driver info and link management
- Ethernet NICs:
- Marvell/Aquantia:
- support reading SFP module info on some AQC100 cards
- Realtek PCI (r8169):
- add support for RTL8125cp
- Realtek USB (r8152):
- support for the RTL8157 5Gbit chip
- add 2500baseT EEE status/configuration support
- Ethernet NICs embedded and off-the-shelf IP:
- Synopsys (stmmac):
- cleanup and reorganize SerDes handling and PCS support
- cleanup descriptor handling and per-platform data
- cleanup and consolidate MDIO defines and handling
- shrink driver memory use for internal structures
- improve Tx IRQ coalescing
- improve TCP segmentation handling
- add support for Spacemit K3
- Cadence (macb):
- support PHYs that have inband autoneg disabled with GEM
- support IEEE 802.3az EEE
- rework usrio capabilities and handling
- AMD (xgbe):
- improve power management for S0i3
- improve TX resilience for link-down handling
- Virtual:
- Google cloud vNIC:
- support larger ring sizes in DQO-QPL mode
- improve HW-GRO handling
- support UDP GSO for DQO format
- PCIe NTB:
- support queue count configuration
- Ethernet PHYs:
- automatically disable PHY autonomous EEE if MAC is in charge
- Broadcom:
- add BCM84891/BCM84892 support
- Micrel:
- support for LAN9645X internal PHY
- Realtek:
- add RTL8224 pair order support
- support PHY LEDs on RTL8211F-VD
- support spread spectrum clocking (SSC)
- Maxlinear:
- add PHY-level statistics via ethtool
- Ethernet switches:
- Maxlinear (mxl862xx):
- support for bridge offloading
- support for VLANs
- support driver statistics
- Bluetooth:
- large number of fixes and new device IDs
- Mediatek:
- support MT6639 (MT7927)
- support MT7902 SDIO
- WiFi:
- Intel (iwlwifi):
- UNII-9 and continuing UHR work
- MediaTek (mt76):
- mt7996/mt7925 MLO fixes/improvements
- mt7996 NPU support (HW eth/wifi traffic offload)
- Qualcomm (ath12k):
- monitor mode support on IPQ5332
- basic hwmon temperature reporting
- support IPQ5424
- Realtek:
- add USB RX aggregation to improve performance
- add USB TX flow control by tracking in-flight URBs
- Cellular:
- IPA v5.2 support"
* tag 'net-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1561 commits)
net: pse-pd: fix kernel-doc function name for pse_control_find_by_id()
wireguard: device: use exit_rtnl callback instead of manual rtnl_lock in pre_exit
wireguard: allowedips: remove redundant space
tools: ynl: add sample for wireguard
wireguard: allowedips: Use kfree_rcu() instead of call_rcu()
MAINTAINERS: Add netkit selftest files
selftests/net: Add additional test coverage in nk_qlease
selftests/net: Split netdevsim tests from HW tests in nk_qlease
tools/ynl: Make YnlFamily closeable as a context manager
net: airoha: Add missing PPE configurations in airoha_ppe_hw_init()
net: airoha: Fix VIP configuration for AN7583 SoC
net: caif: clear client service pointer on teardown
net: strparser: fix skb_head leak in strp_abort_strp()
net: usb: cdc-phonet: fix skb frags[] overflow in rx_complete()
selftests/bpf: add test for xdp_master_redirect with bond not up
net, bpf: fix null-ptr-deref in xdp_master_redirect() for down master
net: airoha: Remove PCE_MC_EN_MASK bit in REG_FE_PCE_CFG configuration
sctp: disable BH before calling udp_tunnel_xmit_skb()
sctp: fix missing encap_port propagation for GSO fragments
net: airoha: Rely on net_device pointer in ETS callbacks
...
|
||
|
|
a970ed1881 |
bitmap updates for v7.1
- new API: bitmap_weight_from() and bitmap_weighted_xor() (Yury); - drop unused __find_nth_andnot_bit() (Yury); - new tests and test improvements (Andy, Akinobu, Yury); - fixes for count_zeroes API (Yury); - cleanup bitmap_print_to_pagebuf() mess (Yury); - documentation updates (Andy, Kai, Kit). -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEEi8GdvG6xMhdgpu/4sUSA/TofvsgFAmnb8vkACgkQsUSA/Tof vsjzKgv/RI6HDkwRgjT/jPVAZzaNFrdoL0nIQ1ZriyE70b/0HtjMzbQBO0P3Vmsa 5k13Nus0eBi9CeEAK0NvjQXy8NRj4E7favqF3faV7l4+J6STHpOKeHZglUAj00CG +23WGInz+TS5RBjXnvT00wuTAVQjT6dvYng9606psVDF/nlh8ZtXmYDjLauseoUH a1EEKwLGXbk3/MhDgVq/R5RvZoNscL4Hky7QWMZiqLutwF8EDrZotF142tfbxkmW mu+2Bn1W66F+8A42HJBDRevcuvsRzMggP2kXxDk50XNL1zTN9f/4iE0r+/5x8UVF s3WiGnuLSkRIK4osey12Z9BAtGJTn3gTPvIPYOWvRiJHskOa1yvGSgcvmzc53x0Q FZgDq1JkBDsF3OZceSjGIp9QOqg+YJArlzun+mNxLbfnahEbhx21Z/ls65vLJCae ENIPAzet5Fxa8mZeJIyiV0zR05DcV+g64FOhcGJ7al4fRWtYVP8qa9FAyGFMV4L2 JL4xHuRO =pEBo -----END PGP SIGNATURE----- Merge tag 'bitmap-for-v7.1' of https://github.com/norov/linux Pull bitmap updates from Yury Norov: - new API: bitmap_weight_from() and bitmap_weighted_xor() (Yury) - drop unused __find_nth_andnot_bit() (Yury) - new tests and test improvements (Andy, Akinobu, Yury) - fixes for count_zeroes API (Yury) - cleanup bitmap_print_to_pagebuf() mess (Yury) - documentation updates (Andy, Kai, Kit). * tag 'bitmap-for-v7.1' of https://github.com/norov/linux: (24 commits) bitops: Update kernel-doc for sign_extendXX() powerpc/xive: simplify xive_spapr_debug_show() thermal: intel: switch cpumask_get() to using cpumask_print_to_pagebuf() coresight: don't use bitmap_print_to_pagebuf() lib/prime_numbers: drop temporary buffer in dump_primes() drm/xe: switch xe_pagefault_queue_init() to using bitmap_weighted_or() ice: use bitmap_empty() in ice_vf_has_no_qs_ena ice: use bitmap_weighted_xor() in ice_find_free_recp_res_idx() bitmap: introduce bitmap_weighted_xor() bitmap: add test_zero_nbits() bitmap: exclude nbits == 0 cases from bitmap test bitmap: test bitmap_weight() for more asm-generic/bitops: Fix a comment typo in instrumented-atomic.h bitops: fix kernel-doc parameter name for parity8() lib: count_zeros: unify count_{leading,trailing}_zeros() lib: count_zeros: fix 32/64-bit inconsistency in count_trailing_zeros() lib: crypto: fix comments for count_leading_zeros() x86/topology: use bitmap_weight_from() bitmap: add bitmap_weight_from() lib/find_bit_benchmark: avoid clearing randomly filled bitmap in test_find_first_bit() ... |
||
|
|
3f3a2aefbc |
iavf: fix kernel-doc comment style in iavf_ethtool.c
iavf_ethtool.c contains 31 kernel-doc comment blocks using the legacy `**/` terminator instead of the correct single `*/`. Two function headers also use a colon separator (`iavf_get_channels:`, `iavf_set_channels:`) instead of the ` - ` dash required by kernel-doc. Additionally several comments embed their return-value descriptions in the body paragraph, producing `scripts/kernel-doc -Wreturn` warnings. Void functions that incorrectly say "Returns ..." are also rephrased. Fix all issues across the full file: - Replace every `**/` terminator with `*/`. - Change `function_name:` doc headers to `function_name -`. - Move inline "Returns ..." sentences into dedicated `Return:` sections for non-void functions (iavf_get_msglevel, iavf_get_rxnfc, iavf_set_channels, iavf_get_rxfh_key_size, iavf_get_rxfh_indir_size, iavf_get_rxfh, iavf_set_rxfh). - Rephrase body descriptions in void functions that incorrectly said "Returns ..." (iavf_get_drvinfo, iavf_get_ringparam, iavf_get_coalesce). - Remove boilerplate body text for iavf_get_rxfh_key_size and iavf_get_rxfh_indir_size; the `Return:` line now conveys the same information without the vague "Returns the table size." sentence. Suggested-by: Anthony L. Nguyen <anthony.l.nguyen@intel.com> Suggested-by: Leszek Pepiak <leszek.pepiak@intel.com> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Breno Leitao <leitao@debian.org> Reviewed-by: Joe Damato <joe@dama.to> Link: https://patch.msgid.link/20260409093020.3808687-1-aleksandr.loktionov@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
b6e39e4846 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-7.0-rc8). Conflicts: net/ipv6/seg6_iptunnel.c |
||
|
|
82e68aa4a6 |
ice: use bitmap_empty() in ice_vf_has_no_qs_ena
bitmap_empty() is more verbose and efficient, as it stops traversing
{r,t}xq_ena as soon as the 1st set bit found.
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Yury Norov <ynorov@nvidia.com>
|
||
|
|
bdeaa653ae |
ice: use bitmap_weighted_xor() in ice_find_free_recp_res_idx()
Use the right helper and save one bitmaps traverse. Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Yury Norov <ynorov@nvidia.com> |
||
|
|
d3baa34a47 |
e1000: check return value of e1000_read_eeprom
[Why]
e1000_set_eeprom() performs a read-modify-write operation when the write
range is not word-aligned. This requires reading the first and last words
of the range from the EEPROM to preserve the unmodified bytes.
However, the code does not check the return value of e1000_read_eeprom().
If the read fails, the operation continues using uninitialized data from
eeprom_buff. This results in corrupted data being written back to the
EEPROM for the boundary words.
Add the missing error checks and abort the operation if reading fails.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes:
|
||
|
|
b1e0672403 |
igb: remove napi_synchronize() in igb_down()
When an AF_XDP zero-copy application terminates abruptly (e.g., kill -9), the XSK buffer pool is destroyed but NAPI polling continues. igb_clean_rx_irq_zc() repeatedly returns the full budget, preventing napi_complete_done() from clearing NAPI_STATE_SCHED. igb_down() calls napi_synchronize() before napi_disable() for each queue vector. napi_synchronize() spins waiting for NAPI_STATE_SCHED to clear, which never happens. igb_down() blocks indefinitely, the TX watchdog fires, and the TX queue remains permanently stalled. napi_disable() already handles this correctly: it sets NAPI_STATE_DISABLE. After a full-budget poll, __napi_poll() checks napi_disable_pending(). If set, it forces completion and clears NAPI_STATE_SCHED, breaking the loop that napi_synchronize() cannot. napi_synchronize() was added in commit |
||
|
|
4821d563cd |
ixgbevf: add missing negotiate_features op to Hyper-V ops table
Commit |
||
|
|
d8ae40dc20 |
ixgbe: stop re-reading flash on every get_drvinfo for e610
ixgbe_get_drvinfo() calls ixgbe_refresh_fw_version() on every ethtool
query for e610 adapters. That ends up in ixgbe_discover_flash_size(),
which bisects the full 16 MB NVM space issuing one ACI command per
step (~20 ms each, ~24 steps total = ~500 ms).
Profiling on an idle E610-XAT2 system with telegraf scraping ethtool
stats every 10 seconds:
kretprobe:ixgbe_get_drvinfo took 527603 us
kretprobe:ixgbe_get_drvinfo took 523978 us
kretprobe:ixgbe_get_drvinfo took 552975 us
kretprobe:ice_get_drvinfo took 3 us
kretprobe:igb_get_drvinfo took 2 us
kretprobe:i40e_get_drvinfo took 5 us
The half-second stall happens under the RTNL lock, causing visible
latency on ip-link and friends.
The FW version can only change after an EMPR reset. All flash data is
already populated at probe time and the cached adapter->eeprom_id is
what get_drvinfo should be returning. The only place that needs to
trigger a re-read is ixgbe_devlink_reload_empr_finish(), right after
the EMPR completes and new firmware is running. Additionally, refresh
the FW version in ixgbe_reinit_locked() so that any PF that undergoes a
reinit after an EMPR (e.g. triggered by another PF's devlink reload)
also picks up the new version in adapter->eeprom_id.
ixgbe_devlink_info_get() keeps its refresh call for explicit
"devlink dev info" queries, which is fine given those are user-initiated.
Fixes:
|
||
|
|
bf6dbadb72 |
ice: fix PTP timestamping broken by SyncE code on E825C
The E825C SyncE support added in commit |
||
|
|
bb3f21edc7 |
ice: ptp: don't WARN when controlling PF is unavailable
In VFIO passthrough setups, it is possible to pass through only a PF
which doesn't own the source timer. In that case the PTP controlling PF
(adapter->ctrl_pf) is never initialized in the VM, so ice_get_ctrl_ptp()
returns NULL and triggers WARN_ON() in ice_ptp_setup_pf().
Since this is an expected behavior in that configuration, replace
WARN_ON() with an informational message and return -EOPNOTSUPP.
Fixes:
|
||
|
|
8e2a2420e2 |
idpf: set the payload size before calling the async handler
Set the payload size before forwarding the reply to the async handler.
Without this, xn->reply_sz will be 0 and idpf_mac_filter_async_handler()
will never get past the size check.
Fixes:
|
||
|
|
d086fae650 |
idpf: improve locking around idpf_vc_xn_push_free()
Protect the set_bit() operation for the free_xn bitmask in
idpf_vc_xn_push_free(), to make the locking consistent with rest of the
code and avoid potential races in that logic.
Fixes:
|
||
|
|
5914781182 |
idpf: fix PREEMPT_RT raw/bh spinlock nesting for async VC handling
Switch from using the completion's raw spinlock to a local lock in the
idpf_vc_xn struct. The conversion is safe because complete/_all() are
called outside the lock and there is no reason to share the completion
lock in the current logic. This avoids invalid wait context reported by
the kernel due to the async handler taking BH spinlock:
[ 805.726977] =============================
[ 805.726991] [ BUG: Invalid wait context ]
[ 805.727006] 7.0.0-rc2-net-devq-031026+ #28 Tainted: G S OE
[ 805.727026] -----------------------------
[ 805.727038] kworker/u261:0/572 is trying to lock:
[ 805.727051] ff190da6a8dbb6a0 (&vport_config->mac_filter_list_lock){+...}-{3:3}, at: idpf_mac_filter_async_handler+0xe9/0x260 [idpf]
[ 805.727099] other info that might help us debug this:
[ 805.727111] context-{5:5}
[ 805.727119] 3 locks held by kworker/u261:0/572:
[ 805.727132] #0: ff190da6db3e6148 ((wq_completion)idpf-0000:83:00.0-mbx){+.+.}-{0:0}, at: process_one_work+0x4b5/0x730
[ 805.727163] #1: ff3c6f0a6131fe50 ((work_completion)(&(&adapter->mbx_task)->work)){+.+.}-{0:0}, at: process_one_work+0x1e5/0x730
[ 805.727191] #2: ff190da765190020 (&x->wait#34){+.+.}-{2:2}, at: idpf_recv_mb_msg+0xc8/0x710 [idpf]
[ 805.727218] stack backtrace:
...
[ 805.727238] Workqueue: idpf-0000:83:00.0-mbx idpf_mbx_task [idpf]
[ 805.727247] Call Trace:
[ 805.727249] <TASK>
[ 805.727251] dump_stack_lvl+0x77/0xb0
[ 805.727259] __lock_acquire+0xb3b/0x2290
[ 805.727268] ? __irq_work_queue_local+0x59/0x130
[ 805.727275] lock_acquire+0xc6/0x2f0
[ 805.727277] ? idpf_mac_filter_async_handler+0xe9/0x260 [idpf]
[ 805.727284] ? _printk+0x5b/0x80
[ 805.727290] _raw_spin_lock_bh+0x38/0x50
[ 805.727298] ? idpf_mac_filter_async_handler+0xe9/0x260 [idpf]
[ 805.727303] idpf_mac_filter_async_handler+0xe9/0x260 [idpf]
[ 805.727310] idpf_recv_mb_msg+0x1c8/0x710 [idpf]
[ 805.727317] process_one_work+0x226/0x730
[ 805.727322] worker_thread+0x19e/0x340
[ 805.727325] ? __pfx_worker_thread+0x10/0x10
[ 805.727328] kthread+0xf4/0x130
[ 805.727333] ? __pfx_kthread+0x10/0x10
[ 805.727336] ret_from_fork+0x32c/0x410
[ 805.727345] ? __pfx_kthread+0x10/0x10
[ 805.727347] ret_from_fork_asm+0x1a/0x30
[ 805.727354] </TASK>
Fixes:
|
||
|
|
9ebcf66cd6 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-7.0-rc6). No conflicts, or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
b5e5797e3c |
idpf: only assign num refillqs if allocation was successful
As reported by AI review [1], if the refillqs allocation fails, refillqs
will be NULL but num_refillqs will be non-zero. The release function
will then dereference refillqs since it thinks the refillqs are present,
resulting in a NULL ptr dereference.
Only assign the num refillqs if the allocation was successful. This will
prevent the release function from entering the loop and accessing
refillqs.
[1] https://lore.kernel.org/netdev/20260227035625.2632753-1-kuba@kernel.org/
Fixes:
|
||
|
|
1eb0db7e39 |
idpf: clear stale cdev_info ptr
Deinit calls idpf_idc_deinit_core_aux_device to free the cdev_info
memory, but leaves the adapter->cdev_info field with a stale pointer
value. This will bypass subsequent "if (!cdev_info)" checks if cdev_info
is not reallocated. For example, if idc_init fails after a reset,
cdev_info will already have been freed during the reset handling, but it
will not have been reallocated. The next reset or rmmod will result in a
crash.
[ +0.000008] BUG: kernel NULL pointer dereference, address: 00000000000000d0
[ +0.000033] #PF: supervisor read access in kernel mode
[ +0.000020] #PF: error_code(0x0000) - not-present page
[ +0.000017] PGD 2097dfa067 P4D 0
[ +0.000017] Oops: Oops: 0000 [#1] SMP NOPTI
...
[ +0.000018] RIP: 0010:device_del+0x3e/0x3d0
[ +0.000010] Call Trace:
[ +0.000010] <TASK>
[ +0.000012] idpf_idc_deinit_core_aux_device+0x36/0x70 [idpf]
[ +0.000034] idpf_vc_core_deinit+0x3e/0x180 [idpf]
[ +0.000035] idpf_remove+0x40/0x1d0 [idpf]
[ +0.000035] pci_device_remove+0x42/0xb0
[ +0.000020] device_release_driver_internal+0x19c/0x200
[ +0.000024] driver_detach+0x48/0x90
[ +0.000018] bus_remove_driver+0x6d/0x100
[ +0.000023] pci_unregister_driver+0x2e/0xb0
[ +0.000022] __do_sys_delete_module.isra.0+0x18c/0x2b0
[ +0.000025] ? kmem_cache_free+0x2c2/0x390
[ +0.000023] do_syscall_64+0x107/0x7d0
[ +0.000023] entry_SYSCALL_64_after_hwframe+0x76/0x7e
Pass the adapter struct into idpf_idc_deinit_core_aux_device instead and
clear the cdev_info ptr.
Fixes:
|
||
|
|
fecacfc95f |
iavf: fix out-of-bounds writes in iavf_get_ethtool_stats()
iavf incorrectly uses real_num_tx_queues for ETH_SS_STATS. Since the
value could change in runtime, we should use num_tx_queues instead.
Moreover iavf_get_ethtool_stats() uses num_active_queues while
iavf_get_sset_count() and iavf_get_stat_strings() use
real_num_tx_queues, which triggers out-of-bounds writes when we do
"ethtool -L" and "ethtool -S" simultaneously [1].
For example when we change channels from 1 to 8, Thread 3 could be
scheduled before Thread 2, and out-of-bounds writes could be triggered
in Thread 3:
Thread 1 (ethtool -L) Thread 2 (work) Thread 3 (ethtool -S)
iavf_set_channels()
...
iavf_alloc_queues()
-> num_active_queues = 8
iavf_schedule_finish_config()
iavf_get_sset_count()
real_num_tx_queues: 1
-> buffer for 1 queue
iavf_get_ethtool_stats()
num_active_queues: 8
-> out-of-bounds!
iavf_finish_config()
-> real_num_tx_queues = 8
Use immutable num_tx_queues in all related functions to avoid the issue.
[1]
BUG: KASAN: vmalloc-out-of-bounds in iavf_add_one_ethtool_stat+0x200/0x270
Write of size 8 at addr ffffc900031c9080 by task ethtool/5800
CPU: 1 UID: 0 PID: 5800 Comm: ethtool Not tainted 6.19.0-enjuk-08403-g8137e3db7f1c #241 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x6f/0xb0
print_report+0x170/0x4f3
kasan_report+0xe1/0x180
iavf_add_one_ethtool_stat+0x200/0x270
iavf_get_ethtool_stats+0x14c/0x2e0
__dev_ethtool+0x3d0c/0x5830
dev_ethtool+0x12d/0x270
dev_ioctl+0x53c/0xe30
sock_do_ioctl+0x1a9/0x270
sock_ioctl+0x3d4/0x5e0
__x64_sys_ioctl+0x137/0x1c0
do_syscall_64+0xf3/0x690
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f7da0e6e36d
...
</TASK>
The buggy address belongs to a 1-page vmalloc region starting at 0xffffc900031c9000 allocated at __dev_ethtool+0x3cc9/0x5830
The buggy address belongs to the physical page: page: refcount:1 mapcount:0 mapping:0000000000000000
index:0xffff88813a013de0 pfn:0x13a013
flags: 0x200000000000000(node=0|zone=2)
raw: 0200000000000000 0000000000000000 dead000000000122 0000000000000000
raw: ffff88813a013de0 0000000000000000 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffffc900031c8f80: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
ffffc900031c9000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffffc900031c9080: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
^
ffffc900031c9100: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
ffffc900031c9180: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
Fixes:
|
||
|
|
2526e440df |
ice: use ice_update_eth_stats() for representor stats
ice_repr_get_stats64() and __ice_get_ethtool_stats() call
ice_update_vsi_stats() on the VF's src_vsi. This always returns early
because ICE_VSI_DOWN is permanently set for VF VSIs - ice_up() is never
called on them since queues are managed by iavf through virtchnl.
In __ice_get_ethtool_stats() the original code called
ice_update_vsi_stats() for all VSIs including representors, iterated
over ice_gstrings_vsi_stats[] to populate the data, and then bailed out
with an early return before the per-queue ring stats section. That early
return was necessary because representor VSIs have no rings on the PF
side - the rings belong to the VF driver (iavf), so accessing per-queue
stats would be invalid.
Move the representor handling to the top of __ice_get_ethtool_stats()
and call ice_update_eth_stats() directly to read the hardware GLV_*
counters. This matches ice_get_vf_stats() which already uses
ice_update_eth_stats() for the same VF VSI in legacy mode. Apply the
same fix to ice_repr_get_stats64().
Note that ice_gstrings_vsi_stats[] contains five software ring counters
(rx_buf_failed, rx_page_failed, tx_linearize, tx_busy, tx_restart) that
are always zero for representors since the PF never processes packets on
VF rings. This is pre-existing behavior unchanged by this patch.
Fixes:
|
||
|
|
ad85de0fc0 |
ice: fix inverted ready check for VF representors
Commit |
||
|
|
c7fcd269e1 |
ice: set max queues in alloc_etherdev_mqs()
When allocating netdevice using alloc_etherdev_mqs() the maximum
supported queues number should be passed. The vsi->alloc_txq/rxq is
storing current number of queues, not the maximum ones.
Use the same function for getting max Tx and Rx queues which is used
during ethtool -l call to set maximum number of queues during netdev
allocation.
Reproduction steps:
$ethtool -l $pf # says current 16, max 64
$ethtool -S $pf # fine
$ethtool -L $pf combined 40 # crash
[491187.472594] Call Trace:
[491187.472829] <TASK>
[491187.473067] netif_set_xps_queue+0x26/0x40
[491187.473305] ice_vsi_cfg_txq+0x265/0x3d0 [ice]
[491187.473619] ice_vsi_cfg_lan_txqs+0x68/0xa0 [ice]
[491187.473918] ice_vsi_cfg_lan+0x2b/0xa0 [ice]
[491187.474202] ice_vsi_open+0x71/0x170 [ice]
[491187.474484] ice_vsi_recfg_qs+0x17f/0x230 [ice]
[491187.474759] ? dev_get_min_mp_channel_count+0xab/0xd0
[491187.474987] ice_set_channels+0x185/0x3d0 [ice]
[491187.475278] ethnl_set_channels+0x26f/0x340
Fixes:
|