Commit Graph

332 Commits

Author SHA1 Message Date
Linus Torvalds
91a4855d6c Networking changes for 7.1.
Core & protocols
 ----------------
 
  - Support HW queue leasing, allowing containers to be granted access
    to HW queues for zero-copy operations and AF_XDP.
 
  - Number of code moves to help the compiler with inlining.
    Avoid output arguments for returning drop reason where possible.
 
  - Rework drop handling within qdiscs to include more metadata
    about the reason and dropping qdisc in the tracepoints.
 
  - Remove the rtnl_lock use from IP Multicast Routing.
 
  - Pack size information into the Rx Flow Steering table pointer
    itself. This allows making the table itself a flat array of u32s,
    thus making the table allocation size a power of two.
 
  - Report TCP delayed ack timer information via socket diag.
 
  - Add ip_local_port_step_width sysctl to allow distributing the randomly
    selected ports more evenly throughout the allowed space.
 
  - Add support for per-route tunsrc in IPv6 segment routing.
 
  - Start work of switching sockopt handling to iov_iter.
 
  - Improve dynamic recvbuf sizing in MPTCP, limit burstiness and avoid
    buffer size drifting up.
 
  - Support MSG_EOR in MPTCP.
 
  - Add stp_mode attribute to the bridge driver for STP mode selection.
    This addresses concerns about call_usermodehelper() usage.
 
  - Remove UDP-Lite support (as announced in 2023).
 
  - Remove support for building IPv6 as a module.
    Remove the now unnecessary function calling indirection.
 
 Cross-tree stuff
 ----------------
 
  - Move Michael MIC code from generic crypto into wireless,
    it's considered insecure but some WiFi networks still need it.
 
 Netfilter
 ---------
 
  - Switch nft_fib_ipv6 module to no longer need temporary dst_entry
    object allocations by using fib6_lookup() + RCU.
    Florian W reports this gets us ~13% higher packet rate.
 
  - Convert IPVS's global __ip_vs_mutex to per-net service_mutex and
    switch the service tables to be per-net. Convert some code that
    walks the service lists to use RCU instead of the service_mutex.
 
  - Add more opinionated input validation to lower security exposure.
 
  - Make IPVS hash tables to be per-netns and resizable.
 
 Wireless
 --------
 
  - Finished assoc frame encryption/EPPKE/802.1X-over-auth.
 
  - Radar detection improvements.
 
  - Add 6 GHz incumbent signal detection APIs.
 
  - Multi-link support for FILS, probe response templates and
    client probing.
 
  - New APIs and mac80211 support for NAN (Neighbor Aware Networking,
    aka Wi-Fi Aware) so less work must be in firmware.
 
 Driver API
 ----------
 
  - Add numerical ID for devlink instances (to avoid having to create
    fake bus/device pairs just to have an ID). Support shared devlink
    instances which span multiple PFs.
 
  - Add standard counters for reporting pause storm events
    (implement in mlx5 and fbnic).
 
  - Add configuration API for completion writeback buffering
    (implement in mana).
 
  - Support driver-initiated change of RSS context sizes.
 
  - Support DPLL monitoring input frequency (implement in zl3073x).
 
  - Support per-port resources in devlink (implement in mlx5).
 
 Misc
 ----
 
  - Expand the YAML spec for Netfilter.
 
 Drivers
 -------
 
  - Software:
    - macvlan: support multicast rx for bridge ports with shared source
      MAC address
    - team: decouple receive and transmit enablement for IEEE 802.3ad
      LACP "independent control"
 
  - Ethernet high-speed NICs:
    - nVidia/Mellanox:
      - support high order pages in zero-copy mode (for payload
        coalescing)
      - support multiple packets in a page (for systems with 64kB pages)
    - Broadcom 25-400GE (bnxt):
      - implement XDP RSS hash metadata extraction
      - add software fallback for UDP GSO, lowering the IOMMU cost
    - Broadcom 800GE (bnge):
      - add link status and configuration handling
      - add various HW and SW statistics
    - Marvell/Cavium:
      - NPC HW block support for cn20k
    - Huawei (hinic3):
      - add mailbox / control queue
      - add rx VLAN offload
      - add driver info and link management
 
  - Ethernet NICs:
    - Marvell/Aquantia:
      - support reading SFP module info on some AQC100 cards
    - Realtek PCI (r8169):
      - add support for RTL8125cp
    - Realtek USB (r8152):
      - support for the RTL8157 5Gbit chip
      - add 2500baseT EEE status/configuration support
 
  - Ethernet NICs embedded and off-the-shelf IP:
    - Synopsys (stmmac):
      - cleanup and reorganize SerDes handling and PCS support
      - cleanup descriptor handling and per-platform data
      - cleanup and consolidate MDIO defines and handling
      - shrink driver memory use for internal structures
      - improve Tx IRQ coalescing
      - improve TCP segmentation handling
      - add support for Spacemit K3
    - Cadence (macb):
      - support PHYs that have inband autoneg disabled with GEM
      - support IEEE 802.3az EEE
      - rework usrio capabilities and handling
    - AMD (xgbe):
      - improve power management for S0i3
      - improve TX resilience for link-down handling
 
  - Virtual:
    - Google cloud vNIC:
      - support larger ring sizes in DQO-QPL mode
      - improve HW-GRO handling
      - support UDP GSO for DQO format
    - PCIe NTB:
      - support queue count configuration
 
  - Ethernet PHYs:
    - automatically disable PHY autonomous EEE if MAC is in charge
    - Broadcom:
      - add BCM84891/BCM84892 support
    - Micrel:
      - support for LAN9645X internal PHY
    - Realtek:
      - add RTL8224 pair order support
      - support PHY LEDs on RTL8211F-VD
      - support spread spectrum clocking (SSC)
    - Maxlinear:
      - add PHY-level statistics via ethtool
 
  - Ethernet switches:
    - Maxlinear (mxl862xx):
      - support for bridge offloading
      - support for VLANs
      - support driver statistics
 
  - Bluetooth:
    - large number of fixes and new device IDs
    - Mediatek:
      - support MT6639 (MT7927)
      - support MT7902 SDIO
 
  - WiFi:
    - Intel (iwlwifi):
      - UNII-9 and continuing UHR work
    - MediaTek (mt76):
      - mt7996/mt7925 MLO fixes/improvements
      - mt7996 NPU support (HW eth/wifi traffic offload)
    - Qualcomm (ath12k):
      - monitor mode support on IPQ5332
      - basic hwmon temperature reporting
      - support IPQ5424
    - Realtek:
      - add USB RX aggregation to improve performance
      - add USB TX flow control by tracking in-flight URBs
 
  - Cellular:
    - IPA v5.2 support
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmnelNoACgkQMUZtbf5S
 IrtWFw//WyiXuEiGawVQONnbu1dtR+3nw/cvNpSYi0IM66vbRUB9n+9fxm2MIyG4
 4jI/c/X/fxIvUxEqGez3yPn5P7KqkQR8WRYwkxrMYKRpXeukN0IDk5Euew5DskCe
 wtBKNJOQWKdKXff0bLQoJ9dHWYuJ2IMRVil5M3fhUbeUOXeyJD7Yn1w2ICvJAbj+
 T/Hw7sEtchNaHp6h6SbaQfahkUFHQG5peNoETkZF4UDF6ALGY29WH91GXeO2lrgN
 IxX203KtaavV0oU8T0oixZgOc57Ns081YfFL/F1JP2HV6lgkwhuq+zxCrRTi1c9M
 HPTXgwD7Z80Y74nM3YTLrPfoMOP8GLBZgdV3rUpwmteM26+gMTm+O1zHUur5ZoGy
 D6TaMFguPTIqiRyrARa9xY/J6r9TQkc2Wfu4bIuPndKFg8xPoepuEObODnh0+5Hg
 4j4pdFhIo2huENhSg7kVb/yl+1q68SFwM3RqTmx+OhCa0AyjcKIKgt/UBhismdnG
 r8obxzb+nXeJc2rRDuwNMwlBlcMSbep27uGt64zeHMMXVhTVqOoytNaL/X/ZpH2m
 A0DscUrpHvb36IoDPtanc6irP+JOh5Xe7Nw5qhkgwsMc7hlf8SyyHB4OUBBaz1qA
 ETSnHlfwklRmXSpWqH2LyGXjdOQpDKP46+h0W3dttMD2/cRBqYo=
 =EhQZ
 -----END PGP SIGNATURE-----

Merge tag 'net-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core & protocols:

   - Support HW queue leasing, allowing containers to be granted access
     to HW queues for zero-copy operations and AF_XDP

   - Number of code moves to help the compiler with inlining. Avoid
     output arguments for returning drop reason where possible

   - Rework drop handling within qdiscs to include more metadata about
     the reason and dropping qdisc in the tracepoints

   - Remove the rtnl_lock use from IP Multicast Routing

   - Pack size information into the Rx Flow Steering table pointer
     itself. This allows making the table itself a flat array of u32s,
     thus making the table allocation size a power of two

   - Report TCP delayed ack timer information via socket diag

   - Add ip_local_port_step_width sysctl to allow distributing the
     randomly selected ports more evenly throughout the allowed space

   - Add support for per-route tunsrc in IPv6 segment routing

   - Start work of switching sockopt handling to iov_iter

   - Improve dynamic recvbuf sizing in MPTCP, limit burstiness and avoid
     buffer size drifting up

   - Support MSG_EOR in MPTCP

   - Add stp_mode attribute to the bridge driver for STP mode selection.
     This addresses concerns about call_usermodehelper() usage

   - Remove UDP-Lite support (as announced in 2023)

   - Remove support for building IPv6 as a module. Remove the now
     unnecessary function calling indirection

  Cross-tree stuff:

   - Move Michael MIC code from generic crypto into wireless, it's
     considered insecure but some WiFi networks still need it

  Netfilter:

   - Switch nft_fib_ipv6 module to no longer need temporary dst_entry
     object allocations by using fib6_lookup() + RCU.

     Florian W reports this gets us ~13% higher packet rate

   - Convert IPVS's global __ip_vs_mutex to per-net service_mutex and
     switch the service tables to be per-net. Convert some code that
     walks the service lists to use RCU instead of the service_mutex

   - Add more opinionated input validation to lower security exposure

   - Make IPVS hash tables to be per-netns and resizable

  Wireless:

   - Finished assoc frame encryption/EPPKE/802.1X-over-auth

   - Radar detection improvements

   - Add 6 GHz incumbent signal detection APIs

   - Multi-link support for FILS, probe response templates and client
     probing

   - New APIs and mac80211 support for NAN (Neighbor Aware Networking,
     aka Wi-Fi Aware) so less work must be in firmware

  Driver API:

   - Add numerical ID for devlink instances (to avoid having to create
     fake bus/device pairs just to have an ID). Support shared devlink
     instances which span multiple PFs

   - Add standard counters for reporting pause storm events (implement
     in mlx5 and fbnic)

   - Add configuration API for completion writeback buffering (implement
     in mana)

   - Support driver-initiated change of RSS context sizes

   - Support DPLL monitoring input frequency (implement in zl3073x)

   - Support per-port resources in devlink (implement in mlx5)

  Misc:

   - Expand the YAML spec for Netfilter

  Drivers

   - Software:
      - macvlan: support multicast rx for bridge ports with shared
        source MAC address
      - team: decouple receive and transmit enablement for IEEE 802.3ad
        LACP "independent control"

   - Ethernet high-speed NICs:
      - nVidia/Mellanox:
         - support high order pages in zero-copy mode (for payload
           coalescing)
         - support multiple packets in a page (for systems with 64kB
           pages)
      - Broadcom 25-400GE (bnxt):
         - implement XDP RSS hash metadata extraction
         - add software fallback for UDP GSO, lowering the IOMMU cost
      - Broadcom 800GE (bnge):
         - add link status and configuration handling
         - add various HW and SW statistics
      - Marvell/Cavium:
         - NPC HW block support for cn20k
      - Huawei (hinic3):
         - add mailbox / control queue
         - add rx VLAN offload
         - add driver info and link management

   - Ethernet NICs:
      - Marvell/Aquantia:
         - support reading SFP module info on some AQC100 cards
      - Realtek PCI (r8169):
         - add support for RTL8125cp
      - Realtek USB (r8152):
         - support for the RTL8157 5Gbit chip
         - add 2500baseT EEE status/configuration support

   - Ethernet NICs embedded and off-the-shelf IP:
      - Synopsys (stmmac):
         - cleanup and reorganize SerDes handling and PCS support
         - cleanup descriptor handling and per-platform data
         - cleanup and consolidate MDIO defines and handling
         - shrink driver memory use for internal structures
         - improve Tx IRQ coalescing
         - improve TCP segmentation handling
         - add support for Spacemit K3
      - Cadence (macb):
         - support PHYs that have inband autoneg disabled with GEM
         - support IEEE 802.3az EEE
         - rework usrio capabilities and handling
      - AMD (xgbe):
         - improve power management for S0i3
         - improve TX resilience for link-down handling

   - Virtual:
      - Google cloud vNIC:
         - support larger ring sizes in DQO-QPL mode
         - improve HW-GRO handling
         - support UDP GSO for DQO format
      - PCIe NTB:
         - support queue count configuration

   - Ethernet PHYs:
      - automatically disable PHY autonomous EEE if MAC is in charge
      - Broadcom:
         - add BCM84891/BCM84892 support
      - Micrel:
         - support for LAN9645X internal PHY
      - Realtek:
         - add RTL8224 pair order support
         - support PHY LEDs on RTL8211F-VD
         - support spread spectrum clocking (SSC)
      - Maxlinear:
         - add PHY-level statistics via ethtool

   - Ethernet switches:
      - Maxlinear (mxl862xx):
         - support for bridge offloading
         - support for VLANs
         - support driver statistics

   - Bluetooth:
      - large number of fixes and new device IDs
      - Mediatek:
         - support MT6639 (MT7927)
         - support MT7902 SDIO

   - WiFi:
      - Intel (iwlwifi):
         - UNII-9 and continuing UHR work
      - MediaTek (mt76):
         - mt7996/mt7925 MLO fixes/improvements
         - mt7996 NPU support (HW eth/wifi traffic offload)
      - Qualcomm (ath12k):
         - monitor mode support on IPQ5332
         - basic hwmon temperature reporting
         - support IPQ5424
      - Realtek:
         - add USB RX aggregation to improve performance
         - add USB TX flow control by tracking in-flight URBs

   - Cellular:
      - IPA v5.2 support"

* tag 'net-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1561 commits)
  net: pse-pd: fix kernel-doc function name for pse_control_find_by_id()
  wireguard: device: use exit_rtnl callback instead of manual rtnl_lock in pre_exit
  wireguard: allowedips: remove redundant space
  tools: ynl: add sample for wireguard
  wireguard: allowedips: Use kfree_rcu() instead of call_rcu()
  MAINTAINERS: Add netkit selftest files
  selftests/net: Add additional test coverage in nk_qlease
  selftests/net: Split netdevsim tests from HW tests in nk_qlease
  tools/ynl: Make YnlFamily closeable as a context manager
  net: airoha: Add missing PPE configurations in airoha_ppe_hw_init()
  net: airoha: Fix VIP configuration for AN7583 SoC
  net: caif: clear client service pointer on teardown
  net: strparser: fix skb_head leak in strp_abort_strp()
  net: usb: cdc-phonet: fix skb frags[] overflow in rx_complete()
  selftests/bpf: add test for xdp_master_redirect with bond not up
  net, bpf: fix null-ptr-deref in xdp_master_redirect() for down master
  net: airoha: Remove PCE_MC_EN_MASK bit in REG_FE_PCE_CFG configuration
  sctp: disable BH before calling udp_tunnel_xmit_skb()
  sctp: fix missing encap_port propagation for GSO fragments
  net: airoha: Rely on net_device pointer in ETS callbacks
  ...
2026-04-14 18:36:10 -07:00
Christian Brauner
e3b2cf6e5d kernfs: pass struct ns_common instead of const void * for namespace tags
kernfs has historically used const void * to pass around namespace tags
used for directory-level namespace filtering. The only current user of
this is sysfs network namespace tagging where struct net pointers are
cast to void *.

Replace all const void * namespace parameters with const struct
ns_common * throughout the kernfs, sysfs, and kobject namespace layers.
This includes the kobj_ns_type_operations callbacks, kobject_namespace(),
and all sysfs/kernfs APIs that accept or return namespace tags.

Passing struct ns_common is needed because various codepaths require
access to the underlying namespace. A struct ns_common can always be
converted back to the concrete namespace type (e.g., struct net) via
container_of() or to_ns_common() in the reverse direction.

This is a preparatory change for switching to ns_id-based directory
iteration to prevent a KASLR pointer leak through the current use of
raw namespace pointers as hash seeds and comparison keys.

Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-04-09 14:36:52 +02:00
Yury Norov
de69301dc2 net-sysfs: switch xps_queue_show() to sysfs_emit()
Switch the function to use the proper sysfs_emit("%pb").

Suggested-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Yury Norov <ynorov@nvidia.com>
Link: https://patch.msgid.link/20260319201713.941956-3-ynorov@nvidia.com
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-03-24 11:08:17 +01:00
Eric Dumazet
a435163d31 net-sysfs: use rps_tag_ptr and remove metadata from rps_dev_flow_table
Instead of storing the @log at the beginning of rps_dev_flow_table
use 5 low order bits of the rps_tag_ptr to store the log of the size.

This removes a potential cache line miss (for light traffic).

This allows us to switch to one high-order allocation instead of vmalloc()
when CONFIG_RFS_ACCEL is not set.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260302181432.1836150-8-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-04 16:54:10 -08:00
Eric Dumazet
b2cc61857e net-sysfs: remove rcu field from 'struct rps_dev_flow_table'
Remove rps_dev_flow_table_release() in favor of kvfree_rcu_mightsleep().

In the following pach, we will remove "u8 @log" field
and 'struct rps_dev_flow_table' size will be a power-of-two.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260302181432.1836150-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-04 16:54:10 -08:00
Eric Dumazet
68b6394a22 net-sysfs: get rid of rps_dev_flow_lock
Use unrcu_pointer() and xchg() in store_rps_dev_flow_table_cnt()
instead of a dedicated spinlock.

Make a similar change in rx_queue_release(), so that both
functions use a similar construct and synchronization.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260302181432.1836150-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-04 16:54:10 -08:00
Frederic Weisbecker
662ff1aac8 net: Keep ignoring isolated cpuset change
RPS cpumask can be overriden through sysfs/syctl. The boot defined
isolated CPUs are then excluded from that cpumask.

However HK_TYPE_DOMAIN will soon integrate cpuset isolated
CPUs updates and the RPS infrastructure needs more thoughts to be able
to propagate such changes and synchronize against them.

Keep handling only what was passed through "isolcpus=" for now.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Simon Horman <horms@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: netdev@vger.kernel.org
2026-02-03 15:23:33 +01:00
Linus Torvalds
07fdad3a93 Networking changes for 6.18.
Core & protocols
 ----------------
 
  - Improve drop account scalability on NUMA hosts for RAW and UDP sockets
    and the backlog, almost doubling the Pps capacity under DoS.
 
  - Optimize the UDP RX performance under stress, reducing contention,
    revisiting the binary layout of the involved data structs and
    implementing NUMA-aware locking. This improves UDP RX performance by
    an additional 50%, even more under extreme conditions.
 
  - Add support for PSP encryption of TCP connections; this mechanism has
    some similarities with IPsec and TLS, but offers superior HW offloads
    capabilities.
 
  - Ongoing work to support Accurate ECN for TCP. AccECN allows more than
    one congestion notification signal per RTT and is a building block for
    Low Latency, Low Loss, and Scalable Throughput (L4S).
 
  - Reorganize the TCP socket binary layout for data locality, reducing
    the number of touched cachelines in the fastpath.
 
  - Refactor skb deferral free to better scale on large multi-NUMA hosts,
    this improves TCP and UDP RX performances significantly on such HW.
 
  - Increase the default socket memory buffer limits from 256K to 4M to
    better fit modern link speeds.
 
  - Improve handling of setups with a large number of nexthop, making dump
    operating scaling linearly and avoiding unneeded synchronize_rcu() on
    delete.
 
  - Improve bridge handling of VLAN FDB, storing a single entry per bridge
    instead of one entry per port; this makes the dump order of magnitude
    faster on large switches.
 
  - Restore IP ID correctly for encapsulated packets at GSO segmentation
    time, allowing GRO to merge packets in more scenarios.
 
  - Improve netfilter matching performance on large sets.
 
  - Improve MPTCP receive path performance by leveraging recently
    introduced core infrastructure (skb deferral free) and adopting recent
    TCP autotuning changes.
 
  - Allow bridges to redirect to a backup port when the bridge port is
    administratively down.
 
  - Introduce MPTCP 'laminar' endpoint that con be used only once per
    connection and simplify common MPTCP setups.
 
  - Add RCU safety to dst->dev, closing a lot of possible races.
 
  - A significant crypto library API for SCTP, MPTCP and IPv6 SR, reducing
    code duplication.
 
  - Supports pulling data from an skb frag into the linear area of an XDP
    buffer.
 
 Things we sprinkled into general kernel code
 --------------------------------------------
 
  - Generate netlink documentation from YAML using an integrated
    YAML parser.
 
 Driver API
 ----------
 
  - Support using IPv6 Flow Label in Rx hash computation and RSS queue
    selection.
 
  - Introduce API for fetching the DMA device for a given queue, allowing
    TCP zerocopy RX on more H/W setups.
 
  - Make XDP helpers compatible with unreadable memory, allowing more
    easily building DevMem-enabled drivers with a unified XDP/skbs
    datapath.
 
  - Add a new dedicated ethtool callback enabling drivers to provide the
    number of RX rings directly, improving efficiency and clarity in RX
    ring queries and RSS configuration.
 
  - Introduce a burst period for the health reporter, allowing better
    handling of multiple errors due to the same root cause.
 
  - Support for DPLL phase offset exponential moving average, controlling
    the average smoothing factor.
 
 Device drivers
 --------------
 
  - Add a new Huawei driver for 3rd gen NIC (hinic3).
 
  - Add a new SpacemiT driver for K1 ethernet MAC.
 
  - Add a generic abstraction for shared memory communication devices
    (dibps)
 
  - Ethernet high-speed NICs:
    - nVidia/Mellanox:
      - Use multiple per-queue doorbell, to avoid MMIO contention issues
      - support adjacent functions, allowing them to delegate their
        SR-IOV VFs to sibling PFs
      - support RSS for IPSec offload
      - support exposing raw cycle counters in PTP and mlx5
      - support for disabling host PFs.
    - Intel (100G, ice, idpf):
      - ice: support for SRIOV VFs over an Active-Active link aggregate
      - ice: support for firmware logging via debugfs
      - ice: support for Earliest TxTime First (ETF) hardware offload
      - idpf: support basic XDP functionalities and XSk
    - Broadcom (bnxt):
      - support Hyper-V VF ID
      - dynamic SRIOV resource allocations for RoCE
    - Meta (fbnic):
      - support queue API, zero-copy Rx and Tx
      - support basic XDP functionalities
      - devlink health support for FW crashes and OTP mem corruptions
      - expand hardware stats coverage to FEC, PHY, and Pause
    - Wangxun:
      - support ethtool coalesce options
      - support for multiple RSS contexts
 
  - Ethernet virtual:
    - Macsec:
      - replace custom netlink attribute checks with policy-level checks
    - Bonding:
      - support aggregator selection based on port priority
    - Microsoft vNIC:
      - use page pool fragments for RX buffers instead of full pages to
        improve memory efficiency
 
  - Ethernet NICs consumer, and embedded:
    - Qualcomm: support Ethernet function for IPQ9574 SoC
    - Airoha: implement wlan offloading via NPU
    - Freescale
      - enetc: add NETC timer PTP driver and add PTP support
      - fec: enable the Jumbo frame support for i.MX8QM
    - Renesas (R-Car S4): support HW offloading for layer 2 switching
      - support for RZ/{T2H, N2H} SoCs
    - Cadence (macb): support TAPRIO traffic scheduling
    - TI:
      - support for Gigabit ICSS ethernet SoC (icssm-prueth)
    - Synopsys (stmmac): a lot of cleanups
 
  - Ethernet PHYs:
    - Support 10g-qxgmi phy-mode for AQR412C, Felix DSA and Lynx PCS
      driver
    - Support bcm63268 GPHY power control
    - Support for Micrel lan8842 PHY and PTP
    - Support for Aquantia AQR412 and AQR115
 
  - CAN:
    - a large CAN-XL preparation work
    - reorganize raw_sock and uniqframe struct to minimize memory usage
    - rcar_canfd: update the CAN-FD handling
 
  - WiFi:
    - extended Neighbor Awareness Networking (NAN) support
    - S1G channel representation cleanup
    - improve S1G support
 
  - WiFi drivers:
    - Intel (iwlwifi):
      - major refactor and cleanup
    - Broadcom (brcm80211):
      - support for AP isolation
    - RealTek (rtw88/89) rtw88/89:
      - preparation work for RTL8922DE support
    - MediaTek (mt76):
      - HW restart improvements
      - MLO support
    - Qualcomm/Atheros (ath10k_
      - GTK rekey fixes
 
  - Bluetooth drivers:
    - btusb: support for several new IDs for MT7925
    - btintel: support for BlazarIW core
    - btintel_pcie: support for _suspend() / _resume()
    - btintel_pcie: support for Scorpious, Panther Lake-H484 IDs
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmjdBdsSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkWnAP/2Jz0ExnYMDxLxkkKqMte+3JNeF/KNjB
 zVIslYN/5ekkd1TMXyDLFlWHqUOUMSF9+cK5ZRMHG6/lzOueSzFuuuDFsWs6Kf2f
 q7Q9MzlXhR9++YCsdES1uS5x3PPjMInzo2ZivMFa6fUDLLFSzeAOKL9k+RS0EggU
 VlXv2Wkm73R0O6KAssgDsHke9cnNz+F0DzhQ0S3qkyZF9tS5NrDeUl7fZ47XZgwb
 ZuXdEzKmTTepo2XvZGxJoF2D7nekWFlHhLjEPpDJtET19nwhukCry41/FplJwlKR
 dWsYkqLkrOEQKFQ+q++/5c3BgFOG+IrQLbP5bGQF1tZRMl4Dqp9PDxK5fKJfccbS
 0VY3Y2qWOSYejT2Ji2kEHR5E5rPyFm3Y5A4Rnpz0yEHj14vL2v4zf7CZRkCyyDfD
 doIZXFGkM0+N7QeXLEN833cje9zjaXuP9GAE+2bb+wAWAZAqof4KX8JgHh+y5Rwm
 pvUtvFxmEtntlMwNBap8aT3FquGtfZncU8pzAE5kvWjuMvyF0NVWiE5rB2kSQM/X
 NLmUdvDyjwwJqthXAJG0fl+a0mNJ/kOAqSOKJDJrfKDGWa+ovwY0iY06SpK0wIbO
 Wz7tpMk5MSlYXW8xWMlbyhvvU6T9xuoQ2KV4QTdMxc6Ir3sNX6YkQr+gjQjxB0gx
 ST2QF6lZeWFh
 =w2Kz
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Paolo Abeni:
 "Core & protocols:

   - Improve drop account scalability on NUMA hosts for RAW and UDP
     sockets and the backlog, almost doubling the Pps capacity under DoS

   - Optimize the UDP RX performance under stress, reducing contention,
     revisiting the binary layout of the involved data structs and
     implementing NUMA-aware locking. This improves UDP RX performance
     by an additional 50%, even more under extreme conditions

   - Add support for PSP encryption of TCP connections; this mechanism
     has some similarities with IPsec and TLS, but offers superior HW
     offloads capabilities

   - Ongoing work to support Accurate ECN for TCP. AccECN allows more
     than one congestion notification signal per RTT and is a building
     block for Low Latency, Low Loss, and Scalable Throughput (L4S)

   - Reorganize the TCP socket binary layout for data locality, reducing
     the number of touched cachelines in the fastpath

   - Refactor skb deferral free to better scale on large multi-NUMA
     hosts, this improves TCP and UDP RX performances significantly on
     such HW

   - Increase the default socket memory buffer limits from 256K to 4M to
     better fit modern link speeds

   - Improve handling of setups with a large number of nexthop, making
     dump operating scaling linearly and avoiding unneeded
     synchronize_rcu() on delete

   - Improve bridge handling of VLAN FDB, storing a single entry per
     bridge instead of one entry per port; this makes the dump order of
     magnitude faster on large switches

   - Restore IP ID correctly for encapsulated packets at GSO
     segmentation time, allowing GRO to merge packets in more scenarios

   - Improve netfilter matching performance on large sets

   - Improve MPTCP receive path performance by leveraging recently
     introduced core infrastructure (skb deferral free) and adopting
     recent TCP autotuning changes

   - Allow bridges to redirect to a backup port when the bridge port is
     administratively down

   - Introduce MPTCP 'laminar' endpoint that con be used only once per
     connection and simplify common MPTCP setups

   - Add RCU safety to dst->dev, closing a lot of possible races

   - A significant crypto library API for SCTP, MPTCP and IPv6 SR,
     reducing code duplication

   - Supports pulling data from an skb frag into the linear area of an
     XDP buffer

  Things we sprinkled into general kernel code:

   - Generate netlink documentation from YAML using an integrated YAML
     parser

  Driver API:

   - Support using IPv6 Flow Label in Rx hash computation and RSS queue
     selection

   - Introduce API for fetching the DMA device for a given queue,
     allowing TCP zerocopy RX on more H/W setups

   - Make XDP helpers compatible with unreadable memory, allowing more
     easily building DevMem-enabled drivers with a unified XDP/skbs
     datapath

   - Add a new dedicated ethtool callback enabling drivers to provide
     the number of RX rings directly, improving efficiency and clarity
     in RX ring queries and RSS configuration

   - Introduce a burst period for the health reporter, allowing better
     handling of multiple errors due to the same root cause

   - Support for DPLL phase offset exponential moving average,
     controlling the average smoothing factor

  Device drivers:

   - Add a new Huawei driver for 3rd gen NIC (hinic3)

   - Add a new SpacemiT driver for K1 ethernet MAC

   - Add a generic abstraction for shared memory communication
     devices (dibps)

   - Ethernet high-speed NICs:
      - nVidia/Mellanox:
         - Use multiple per-queue doorbell, to avoid MMIO contention
           issues
         - support adjacent functions, allowing them to delegate their
           SR-IOV VFs to sibling PFs
         - support RSS for IPSec offload
         - support exposing raw cycle counters in PTP and mlx5
         - support for disabling host PFs.
      - Intel (100G, ice, idpf):
         - ice: support for SRIOV VFs over an Active-Active link
           aggregate
         - ice: support for firmware logging via debugfs
         - ice: support for Earliest TxTime First (ETF) hardware offload
         - idpf: support basic XDP functionalities and XSk
      - Broadcom (bnxt):
         - support Hyper-V VF ID
         - dynamic SRIOV resource allocations for RoCE
      - Meta (fbnic):
         - support queue API, zero-copy Rx and Tx
         - support basic XDP functionalities
         - devlink health support for FW crashes and OTP mem corruptions
         - expand hardware stats coverage to FEC, PHY, and Pause
      - Wangxun:
         - support ethtool coalesce options
         - support for multiple RSS contexts

   - Ethernet virtual:
      - Macsec:
         - replace custom netlink attribute checks with policy-level
           checks
      - Bonding:
         - support aggregator selection based on port priority
      - Microsoft vNIC:
         - use page pool fragments for RX buffers instead of full pages
           to improve memory efficiency

   - Ethernet NICs consumer, and embedded:
      - Qualcomm: support Ethernet function for IPQ9574 SoC
      - Airoha: implement wlan offloading via NPU
      - Freescale
         - enetc: add NETC timer PTP driver and add PTP support
         - fec: enable the Jumbo frame support for i.MX8QM
      - Renesas (R-Car S4):
         - support HW offloading for layer 2 switching
         - support for RZ/{T2H, N2H} SoCs
      - Cadence (macb): support TAPRIO traffic scheduling
      - TI:
         - support for Gigabit ICSS ethernet SoC (icssm-prueth)
      - Synopsys (stmmac): a lot of cleanups

   - Ethernet PHYs:
      - Support 10g-qxgmi phy-mode for AQR412C, Felix DSA and Lynx PCS
        driver
      - Support bcm63268 GPHY power control
      - Support for Micrel lan8842 PHY and PTP
      - Support for Aquantia AQR412 and AQR115

   - CAN:
      - a large CAN-XL preparation work
      - reorganize raw_sock and uniqframe struct to minimize memory
        usage
      - rcar_canfd: update the CAN-FD handling

   - WiFi:
      - extended Neighbor Awareness Networking (NAN) support
      - S1G channel representation cleanup
      - improve S1G support

   - WiFi drivers:
      - Intel (iwlwifi):
         - major refactor and cleanup
      - Broadcom (brcm80211):
         - support for AP isolation
      - RealTek (rtw88/89) rtw88/89:
         - preparation work for RTL8922DE support
      - MediaTek (mt76):
         - HW restart improvements
         - MLO support
      - Qualcomm/Atheros (ath10k):
         - GTK rekey fixes

   - Bluetooth drivers:
      - btusb: support for several new IDs for MT7925
      - btintel: support for BlazarIW core
      - btintel_pcie: support for _suspend() / _resume()
      - btintel_pcie: support for Scorpious, Panther Lake-H484 IDs"

* tag 'net-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1536 commits)
  net: stmmac: Add support for Allwinner A523 GMAC200
  dt-bindings: net: sun8i-emac: Add A523 GMAC200 compatible
  Revert "Documentation: net: add flow control guide and document ethtool API"
  octeontx2-pf: fix bitmap leak
  octeontx2-vf: fix bitmap leak
  net/mlx5e: Use extack in set rxfh callback
  net/mlx5e: Introduce mlx5e_rss_params for RSS configuration
  net/mlx5e: Introduce mlx5e_rss_init_params
  net/mlx5e: Remove unused mdev param from RSS indir init
  net/mlx5: Improve QoS error messages with actual depth values
  net/mlx5e: Prevent entering switchdev mode with inconsistent netns
  net/mlx5: HWS, Generalize complex matchers
  net/mlx5: Improve write-combining test reliability for ARM64 Grace CPUs
  selftests/net: add tcp_port_share to .gitignore
  Revert "net/mlx5e: Update and set Xon/Xoff upon MTU set"
  net: add NUMA awareness to skb_attempt_defer_free()
  net: use llist for sd->defer_list
  net: make softnet_data.defer_count an atomic
  selftests: drv-net: psp: add tests for destroying devices
  selftests: drv-net: psp: add test for auto-adjusting TCP MSS
  ...
2025-10-02 15:17:01 -07:00
Christian Brauner
83914de1c1
net-sysfs: use check_net()
Don't directly acess the namespace count. There's even a dedicated
helper for this.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-19 16:22:37 +02:00
Krishna Kumar
97bcc5b6f4 net: Prevent RPS table overwrite of active flows
This patch fixes an issue where two different flows on the same RXq
produce the same hash resulting in continuous flow overwrites.

Flow #1: A packet for Flow #1 comes in, kernel calls the steering
         function. The driver gives back a filter id. The kernel saves
	 this filter id in the selected slot. Later, the driver's
	 service task checks if any filters have expired and then
	 installs the rule for Flow #1.
Flow #2: A packet for Flow #2 comes in. It goes through the same steps.
         But this time, the chosen slot is being used by Flow #1. The
	 driver gives a new filter id and the kernel saves it in the
	 same slot. When the driver's service task runs, it runs through
	 all the flows, checks if Flow #1 should be expired, the kernel
	 returns True as the slot has a different filter id, and then
	 the driver installs the rule for Flow #2.
Flow #1: Another packet for Flow #1 comes in. The same thing repeats.
         The slot is overwritten with a new filter id for Flow #1.

This causes a repeated cycle of flow programming for missed packets,
wasting CPU cycles while not improving performance. This problem happens
at higher rates when the RPS table is small, but tests show it still
happens even with 12,000 connections and an RPS size of 16K per queue
(global table size = 144x16K = 64K).

This patch prevents overwriting an rps_dev_flow entry if it is active.
The intention is that it is better to do aRFS for the first flow instead
of hurting all flows on the same hash. Without this, two (or more) flows
on one RX queue with the same hash can keep overwriting each other. This
causes the driver to reprogram the flow repeatedly.

Changes:
  1. Add a new 'hash' field to struct rps_dev_flow.
  2. Add rps_flow_is_active(): a helper function to check if a flow is
     active or not, extracted from rps_may_expire_flow(). It is further
     simplified as per reviewer feedback.
  3. In set_rps_cpu():
     - Avoid overwriting by programming a new filter if:
        - The slot is not in use, or
        - The slot is in use but the flow is not active, or
        - The slot has an active flow with the same hash, but target CPU
          differs.
     - Save the hash in the rps_dev_flow entry.
  4. rps_may_expire_flow(): Use earlier extracted rps_flow_is_active().

Testing & results:
  - Driver: ice (E810 NIC), Kernel: net-next
  - #CPUs = #RXq = 144 (1:1)
  - Number of flows: 12K
  - Eight RPS settings from 256 to 32768. Though RPS=256 is not ideal,
    it is still sufficient to cover 12K flows (256*144 rx-queues = 64K
    global table slots)
  - Global Table Size = 144 * RPS (effectively equal to 256 * RPS)
  - Each RPS test duration = 8 mins (org code) + 8 mins (new code).
  - Metrics captured on client

Legend for following tables:
Steer-C: #times ndo_rx_flow_steer() was Called by set_rps_cpu()
Steer-L: #times ice_arfs_flow_steer() Looped over aRFS entries
Add:     #times driver actually programmed aRFS (ice_arfs_build_entry())
Del:     #times driver deleted the flow (ice_arfs_del_flow_rules())
Units:   K = 1,000 times, M = 1 million times

  |-------|---------|------|     Org Code    |---------|---------|
  | RPS   | Latency | CPU  | Add    |  Del   | Steer-C | Steer-L |
  |-------|---------|------|--------|--------|---------|---------|
  | 256   | 227.0   | 93.2 | 1.6M   | 1.6M   | 121.7M  | 267.6M  |
  | 512   | 225.9   | 94.1 | 11.5M  | 11.2M  | 65.7M   | 199.6M  |
  | 1024  | 223.5   | 95.6 | 16.5M  | 16.5M  | 27.1M   | 187.3M  |
  | 2048  | 222.2   | 96.3 | 10.5M  | 10.5M  | 12.5M   | 115.2M  |
  | 4096  | 223.9   | 94.1 | 5.5M   | 5.5M   | 7.2M    | 65.9M   |
  | 8192  | 224.7   | 92.5 | 2.7M   | 2.7M   | 3.0M    | 29.9M   |
  | 16384 | 223.5   | 92.5 | 1.3M   | 1.3M   | 1.4M    | 13.9M   |
  | 32768 | 219.6   | 93.2 | 838.1K | 838.1K | 965.1K  | 8.9M    |
  |-------|---------|------|   New Code      |---------|---------|
  | 256   | 201.5   | 99.1 | 13.4K  | 5.0K   | 13.7K   | 75.2K   |
  | 512   | 202.5   | 98.2 | 11.2K  | 5.9K   | 11.2K   | 55.5K   |
  | 1024  | 207.3   | 93.9 | 11.5K  | 9.7K   | 11.5K   | 59.6K   |
  | 2048  | 207.5   | 96.7 | 11.8K  | 11.1K  | 15.5K   | 79.3K   |
  | 4096  | 206.9   | 96.6 | 11.8K  | 11.7K  | 11.8K   | 63.2K   |
  | 8192  | 205.8   | 96.7 | 11.9K  | 11.8K  | 11.9K   | 63.9K   |
  | 16384 | 200.9   | 98.2 | 11.9K  | 11.9K  | 11.9K   | 64.2K   |
  | 32768 | 202.5   | 98.0 | 11.9K  | 11.9K  | 11.9K   | 64.2K   |
  |-------|---------|------|--------|--------|---------|---------|

Some observations:
  1. Overall Latency improved: (1790.19-1634.94)/1790.19*100 = 8.67%
  2. Overall CPU increased:    (777.32-751.49)/751.45*100    = 3.44%
  3. Flow Management (add/delete) remained almost constant at ~11K
     compared to values in millions.

Signed-off-by: Krishna Kumar <krikku@gmail.com>
Link: https://patch.msgid.link/20250825031005.3674864-2-krikku@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-27 18:24:13 -07:00
Stanislav Fomichev
5d4d84618e net: s/dev_set_threaded/netif_set_threaded/
Commit cc34acd577 ("docs: net: document new locking reality")
introduced netif_ vs dev_ function semantics: the former expects locked
netdev, the latter takes care of the locking. We don't strictly
follow this semantics on either side, but there are more dev_xxx handlers
now that don't fit. Rename them to netif_xxx where appropriate.

Note that one dev_set_threaded call still remains in mt76 for debugfs file.

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250717172333.1288349-7-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-18 17:27:47 -07:00
Stanislav Fomichev
af1d017377 net: s/dev_get_mac_address/netif_get_mac_address/
Commit cc34acd577 ("docs: net: document new locking reality")
introduced netif_ vs dev_ function semantics: the former expects locked
netdev, the latter takes care of the locking. We don't strictly
follow this semantics on either side, but there are more dev_xxx handlers
now that don't fit. Rename them to netif_xxx where appropriate.

netif_get_mac_address is used only by tun/tap, so move it into
NETDEV_INTERNAL namespace.

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250717172333.1288349-3-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-18 17:27:46 -07:00
Stanislav Fomichev
ffea116834 net: s/dev_get_port_parent_id/netif_get_port_parent_id/
Commit cc34acd577 ("docs: net: document new locking reality")
introduced netif_ vs dev_ function semantics: the former expects locked
netdev, the latter takes care of the locking. We don't strictly
follow this semantics on either side, but there are more dev_xxx handlers
now that don't fit. Rename them to netif_xxx where appropriate.

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250717172333.1288349-2-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-18 17:27:46 -07:00
Eric Dumazet
6058099da5 net: remove RTNL use for /proc/sys/net/core/rps_default_mask
Use a dedicated mutex instead.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250702061558.1585870-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07 18:42:12 -07:00
Yajun Deng
0c17270f9b net: sysfs: Implement is_visible for phys_(port_id, port_name, switch_id)
phys_port_id_show, phys_port_name_show and phys_switch_id_show would
return -EOPNOTSUPP if the netdev didn't implement the corresponding
method.

There is no point in creating these files if they are unsupported.

Put these attributes in netdev_phys_group and implement the is_visible
method. make phys_(port_id, port_name, switch_id) invisible if the netdev
dosen't implement the corresponding method.

Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250612142707.4644-1-yajun.deng@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-14 11:26:59 -07:00
Jakub Kicinski
0a65dcf624 net: designate queue counts as "double ops protected" by instance lock
Drivers which opt into instance lock protection of ops should
only call set_real_num_*_queues() under the instance lock.
This means that queue counts are double protected (writes
are under both rtnl_lock and instance lock, readers under
either).

Some readers may still be under the rtnl_lock, however, so for
now we need double protection of writers.

OTOH queue API paths are only under the protection of the instance
lock, so we need to validate that the instance is actually locking
ops, otherwise the input checks we do against queue count are racy.

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250324224537.248800-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25 10:06:49 -07:00
Eric Dumazet
f3483c8e1d net: rfs: hash function change
RFS is using two kinds of hash tables.

First one is controlled by /proc/sys/net/core/rps_sock_flow_entries = 2^N
and using the N low order bits of the l4 hash is good enough.

Then each RX queue has its own hash table, controlled by
/sys/class/net/eth1/queues/rx-$q/rps_flow_cnt = 2^X

Current hash function, using the X low order bits is suboptimal,
because RSS is usually using Func(hash) = (hash % power_of_two);

For example, with 32 RX queues, 6 low order bits have no entropy
for a given queue.

Switch this hash function to hash_32(hash, log) to increase
chances to use all possible slots and reduce collisions.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <tom@herbertland.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250321171309.634100-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25 08:24:13 -07:00
Stanislav Fomichev
8033d2aef5 Revert "net: replace dev_addr_sem with netdev instance lock"
This reverts commit df43d8bf10.

Cc: Kohei Enju <enjuk@amazon.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Fixes: df43d8bf10 ("net: replace dev_addr_sem with netdev instance lock")
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250312190513.1252045-2-sdf@fomichev.me
Tested-by: Lei Yang <leiyang@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-19 18:52:00 +01:00
Jakub Kicinski
8ef890df40 net: move misc netdev_lock flavors to a separate header
Move the more esoteric helpers for netdev instance lock to
a dedicated header. This avoids growing netdevice.h to infinity
and makes rebuilding the kernel much faster (after touching
the header with the helpers).

The main netdev_lock() / netdev_unlock() functions are used
in static inlines in netdevice.h and will probably be used
most commonly, so keep them in netdevice.h.

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250307183006.2312761-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-08 09:06:50 -08:00
Stanislav Fomichev
df43d8bf10 net: replace dev_addr_sem with netdev instance lock
Lockdep reports possible circular dependency in [0]. Instead of
fixing the ordering, replace global dev_addr_sem with netdev
instance lock. Most of the paths that set/get mac are RTNL
protected. Two places where it's not, convert to explicit
locking:
- sysfs address_show
- dev_get_mac_address via dev_ioctl

0: https://netdev-3.bots.linux.dev/vmksft-forwarding-dbg/results/993321/24-router-bridge-1d-lag-sh/stderr

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250305163732.2766420-12-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-06 12:59:45 -08:00
Stanislav Fomichev
ad7c7b2172 net: hold netdev instance lock during sysfs operations
Most of them are already covered by the converted dev_xxx APIs.
Add the locking wrappers for the remaining ones.

Cc: Saeed Mahameed <saeed@kernel.org>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250305163732.2766420-9-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-06 12:59:44 -08:00
Antoine Tenart
38d41cf575 net-sysfs: remove unused initial ret values
In some net-sysfs functions the ret value is initialized but never used
as it is always overridden. Remove those.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Link: https://patch.msgid.link/20250226174644.311136-1-atenart@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-27 17:38:42 -08:00
Eric Dumazet
75bc3dab4e net-sysfs: restore behavior for not running devices
modprobe dummy dumdummies=1

Old behavior :

$ cat /sys/class/net/dummy0/carrier
cat: /sys/class/net/dummy0/carrier: Invalid argument

After blamed commit, an empty string is reported.

$ cat /sys/class/net/dummy0/carrier
$

In this commit, I restore the old behavior for carrier,
speed and duplex attributes.

Fixes: 79c61899b5 ("net-sysfs: remove rtnl_trylock from device attributes")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Marco Leogrande <leogrande@google.com>
Reviewed-by: Antoine Tenart <atenart@kernel.org>
Link: https://patch.msgid.link/20250221051223.576726-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-24 14:21:23 -08:00
Antoine Tenart
b0b6fcfa6a net-sysfs: remove rtnl_trylock from queue attributes
Similar to the commit removing remove rtnl_trylock from device
attributes we here apply the same technique to networking queues.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Link: https://patch.msgid.link/20250204170314.146022-5-atenart@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-05 17:49:08 -08:00
Antoine Tenart
7e54f85c60 net-sysfs: prevent uncleared queues from being re-added
With the (upcoming) removal of the rtnl_trylock/restart_syscall logic
and because of how Tx/Rx queues are implemented (and their
requirements), it might happen that a queue is re-added before having
the chance to be cleared. In such rare case, do not complete the queue
addition operation.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Link: https://patch.msgid.link/20250204170314.146022-4-atenart@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-05 17:49:08 -08:00
Antoine Tenart
b7ecc1de51 net-sysfs: move queue attribute groups outside the default groups
Rx/tx queues embed their own kobject for registering their per-queue
sysfs files. The issue is they're using the kobject default groups for
this and entirely rely on the kobject refcounting for releasing their
sysfs paths.

In order to remove rtnl_trylock calls we need sysfs files not to rely on
their associated kobject refcounting for their release. Thus we here
move queues sysfs files from the kobject default groups to their own
groups which can be removed separately.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Link: https://patch.msgid.link/20250204170314.146022-3-atenart@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-05 17:49:08 -08:00
Antoine Tenart
79c61899b5 net-sysfs: remove rtnl_trylock from device attributes
There is an ABBA deadlock between net device unregistration and sysfs
files being accessed[1][2]. To prevent this from happening all paths
taking the rtnl lock after the sysfs one (actually kn->active refcount)
use rtnl_trylock and return early (using restart_syscall)[3], which can
make syscalls to spin for a long time when there is contention on the
rtnl lock[4].

There are not many possibilities to improve the above:
- Rework the entire net/ locking logic.
- Invert two locks in one of the paths — not possible.

But here it's actually possible to drop one of the locks safely: the
kernfs_node refcount. More details in the code itself, which comes with
lots of comments.

Note that we check the device is alive in the added sysfs_rtnl_lock
helper to disallow sysfs operations to run after device dismantle has
started. This also help keeping the same behavior as before. Because of
this calls to dev_isalive in sysfs ops were removed.

[1] https://lore.kernel.org/netdev/49A4D5D5.5090602@trash.net/
[2] https://lore.kernel.org/netdev/m14oyhis31.fsf@fess.ebiederm.org/
[3] https://lore.kernel.org/netdev/20090226084924.16cb3e08@nehalam/
[4] https://lore.kernel.org/all/20210928125500.167943-1-atenart@kernel.org/T/

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Link: https://patch.msgid.link/20250204170314.146022-2-atenart@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-05 17:49:07 -08:00
Jakub Kicinski
e7ed2ba757 net: protect NAPI config fields with netdev_lock()
Protect the following members of netdev and napi by netdev_lock:
 - defer_hard_irqs,
 - gro_flush_timeout,
 - irq_suspend_timeout.

The first two are written via sysfs (which this patch switches
to new lock), and netdev genl which holds both netdev and rtnl locks.

irq_suspend_timeout is only written by netdev genl.

Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-11-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15 19:13:34 -08:00
Jakub Kicinski
1bb86cf8f4 net: protect threaded status of NAPI with netdev_lock()
Now that NAPI instances can't come and go without holding
netdev->lock we can trivially switch from rtnl_lock() to
netdev_lock() for setting netdev->threaded via sysfs.

Note that since we do not lock netdev_lock around sysfs
calls in the core we don't have to "trylock" like we do
with rtnl_lock.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-9-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15 19:13:34 -08:00
Joe Damato
acb8d4ed56 net: napi: Make gro_flush_timeout per-NAPI
Allow per-NAPI gro_flush_timeout setting.

The existing sysfs parameter is respected; writes to sysfs will write to
all NAPI structs for the device and the net_device gro_flush_timeout
field. Reads from sysfs will read from the net_device field.

The ability to set gro_flush_timeout on specific NAPI instances will be
added in a later commit, via netdev-genl.

Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20241011184527.16393-4-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-14 17:54:29 -07:00
Joe Damato
f15e3b3ddb net: napi: Make napi_defer_hard_irqs per-NAPI
Add defer_hard_irqs to napi_struct in preparation for per-NAPI
settings.

The existing sysfs parameter is respected; writes to sysfs will write to
all NAPI structs for the device and the net_device defer_hard_irq field.
Reads from sysfs show the net_device field.

The ability to set defer_hard_irqs on specific NAPI instances will be
added in a later commit, via netdev-genl.

Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20241011184527.16393-2-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-14 17:54:28 -07:00
Zijun Hu
8f08854199 net: sysfs: Fix weird usage of class's namespace relevant fields
Device class has two namespace relevant fields which are associated by
the following usage:

struct class {
	...
	const struct kobj_ns_type_operations *ns_type;
	const void *(*namespace)(const struct device *dev);
	...
}
if (dev->class && dev->class->ns_type)
	dev->class->namespace(dev);

The usage looks weird since it checks @ns_type but calls namespace()
it is found for all existing class definitions that the other filed is
also assigned once one is assigned in current kernel tree, so fix this
weird usage by checking @namespace to call namespace().

Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-09-09 10:30:52 +01:00
Jakub Kicinski
502cc061de Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

drivers/net/phy/phy_device.c
  2560db6ede ("net: phy: Fix missing of_node_put() for leds")
  1dce520abd ("net: phy: Use for_each_available_child_of_node_scoped()")
https://lore.kernel.org/20240904115823.74333648@canb.auug.org.au

Adjacent changes:

drivers/net/ethernet/xilinx/xilinx_axienet.h
drivers/net/ethernet/xilinx/xilinx_axienet_main.c
  858430db28 ("net: xilinx: axienet: Fix race in axienet_stop")
  76abb5d675 ("net: xilinx: axienet: Add statistics support")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-05 20:37:20 -07:00
Joe Damato
08062af0a5 net: napi: Prevent overflow of napi_defer_hard_irqs
In commit 6f8b12d661 ("net: napi: add hard irqs deferral feature")
napi_defer_irqs was added to net_device and napi_defer_irqs_count was
added to napi_struct, both as type int.

This value never goes below zero, so there is not reason for it to be a
signed int. Change the type for both from int to u32, and add an
overflow check to sysfs to limit the value to S32_MAX.

The limit of S32_MAX was chosen because the practical limit before this
patch was S32_MAX (anything larger was an overflow) and thus there are
no behavioral changes introduced. If the extra bit is needed in the
future, the limit can be raised.

Before this patch:

$ sudo bash -c 'echo 2147483649 > /sys/class/net/eth4/napi_defer_hard_irqs'
$ cat /sys/class/net/eth4/napi_defer_hard_irqs
-2147483647

After this patch:

$ sudo bash -c 'echo 2147483649 > /sys/class/net/eth4/napi_defer_hard_irqs'
bash: line 0: echo: write error: Numerical result out of range

Similarly, /sys/class/net/XXXXX/tx_queue_len is defined as unsigned:

include/linux/netdevice.h:      unsigned int            tx_queue_len;

And has an overflow check:

dev_change_tx_queue_len(..., unsigned long new_len):

  if (new_len != (unsigned int)new_len)
          return -ERANGE;

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240904153431.307932-1-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-05 18:42:56 -07:00
Breno Leitao
77461c1081 net: dqs: Do not use extern for unused dql_group
When CONFIG_DQL is not enabled, dql_group should be treated as a dead
declaration. However, its current extern declaration assumes the linker
will ignore it, which is generally true across most compiler and
architecture combinations.

But in certain cases, the linker still attempts to resolve the extern
struct, even when the associated code is dead, resulting in a linking
error. For instance the following error in loongarch64:

>> loongarch64-linux-ld: net-sysfs.c:(.text+0x589c): undefined reference to `dql_group'

Modify the declaration of the dead object to be an empty declaration
instead of an extern. This change will prevent the linker from
attempting to resolve an undefined reference.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202409012047.eCaOdfQJ-lkp@intel.com/
Fixes: 74293ea1c4 ("net: sysfs: Do not create sysfs for non BQL device")
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Simon Horman <horms@kernel.org> # build-tested
Link: https://patch.msgid.link/20240902101734.3260455-1-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-03 12:01:38 -07:00
Alexander Lobakin
00d066a4d4 netdev_features: convert NETIF_F_LLTX to dev->lltx
NETIF_F_LLTX can't be changed via Ethtool and is not a feature,
rather an attribute, very similar to IFF_NO_QUEUE (and hot).
Free one netdev_features_t bit and make it a "hot" private flag.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-09-03 11:36:43 +02:00
Jamie Bainbridge
a699781c79 ethtool: check device is present when getting link settings
A sysfs reader can race with a device reset or removal, attempting to
read device state when the device is not actually present. eg:

     [exception RIP: qed_get_current_link+17]
  #8 [ffffb9e4f2907c48] qede_get_link_ksettings at ffffffffc07a994a [qede]
  #9 [ffffb9e4f2907cd8] __rh_call_get_link_ksettings at ffffffff992b01a3
 #10 [ffffb9e4f2907d38] __ethtool_get_link_ksettings at ffffffff992b04e4
 #11 [ffffb9e4f2907d90] duplex_show at ffffffff99260300
 #12 [ffffb9e4f2907e38] dev_attr_show at ffffffff9905a01c
 #13 [ffffb9e4f2907e50] sysfs_kf_seq_show at ffffffff98e0145b
 #14 [ffffb9e4f2907e68] seq_read at ffffffff98d902e3
 #15 [ffffb9e4f2907ec8] vfs_read at ffffffff98d657d1
 #16 [ffffb9e4f2907f00] ksys_read at ffffffff98d65c3f
 #17 [ffffb9e4f2907f38] do_syscall_64 at ffffffff98a052fb

 crash> struct net_device.state ffff9a9d21336000
    state = 5,

state 5 is __LINK_STATE_START (0b1) and __LINK_STATE_NOCARRIER (0b100).
The device is not present, note lack of __LINK_STATE_PRESENT (0b10).

This is the same sort of panic as observed in commit 4224cfd7fb
("net-sysfs: add check for netdevice being present to speed_show").

There are many other callers of __ethtool_get_link_ksettings() which
don't have a device presence check.

Move this check into ethtool to protect all callers.

Fixes: d519e17e2d ("net: export device speed and duplex via sysfs")
Fixes: 4224cfd7fb ("net-sysfs: add check for netdevice being present to speed_show")
Signed-off-by: Jamie Bainbridge <jamie.bainbridge@gmail.com>
Link: https://patch.msgid.link/8bae218864beaa44ed01628140475b9bf641c5b0.1724393671.git.jamie.bainbridge@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-26 14:03:02 -07:00
Alexander Lobakin
13cabc47f8 netdevice: define and allocate &net_device _properly_
In fact, this structure contains a flexible array at the end, but
historically its size, alignment etc., is calculated manually.
There are several instances of the structure embedded into other
structures, but also there's ongoing effort to remove them and we
could in the meantime declare &net_device properly.
Declare the array explicitly, use struct_size() and store the array
size inside the structure, so that __counted_by() can be applied.
Don't use PTR_ALIGN(), as SLUB itself tries its best to ensure the
allocated buffer is aligned to what the user expects.
Also, change its alignment from %NETDEV_ALIGN to the cacheline size
as per several suggestions on the netdev ML.

bloat-o-meter for vmlinux:

free_netdev                                  445     440      -5
netdev_freemem                                24       -     -24
alloc_netdev_mqs                            1481    1450     -31

On x86_64 with several NICs of different vendors, I was never able to
get a &net_device pointer not aligned to the cacheline size after the
change.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20240710113036.2125584-1-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-11 18:11:31 -07:00
Eric Dumazet
c1742dcb6b net: no longer acquire RTNL in threaded_show()
dev->threaded can be read locklessly, if we add
corresponding READ_ONCE()/WRITE_ONCE() annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240502173926.2010646-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-03 15:14:01 -07:00
Heiner Kallweit
9382b4f338 net: constify net_class
AFAICS all users of net_class take a const struct class * argument.
Therefore fully constify net_class.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-15 10:46:01 +01:00
Eric Dumazet
db77cdc696 net: dqs: use sysfs_emit() in favor of sprintf()
Commit 6025b9135f ("net: dqs: add NIC stall detector based on BQL")
added three sysfs files.

Use the recommended sysfs_emit() helper.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Breno Leitao <leitao@debian.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-08 11:04:08 +01:00
Jakub Kicinski
6025b9135f net: dqs: add NIC stall detector based on BQL
softnet_data->time_squeeze is sometimes used as a proxy for
host overload or indication of scheduling problems. In practice
this statistic is very noisy and has hard to grasp units -
e.g. is 10 squeezes a second to be expected, or high?

Delaying network (NAPI) processing leads to drops on NIC queues
but also RTT bloat, impacting pacing and CA decisions.
Stalls are a little hard to detect on the Rx side, because
there may simply have not been any packets received in given
period of time. Packet timestamps help a little bit, but
again we don't know if packets are stale because we're
not keeping up or because someone (*cough* cgroups)
disabled IRQs for a long time.

We can, however, use Tx as a proxy for Rx stalls. Most drivers
use combined Rx+Tx NAPIs so if Tx gets starved so will Rx.
On the Tx side we know exactly when packets get queued,
and completed, so there is no uncertainty.

This patch adds stall checks to BQL. Why BQL? Because
it's a convenient place to add such checks, already
called by most drivers, and it has copious free space
in its structures (this patch adds no extra cache
references or dirtying to the fast path).

The algorithm takes one parameter - max delay AKA stall
threshold and increments a counter whenever NAPI got delayed
for at least that amount of time. It also records the length
of the longest stall.

To be precise every time NAPI has not polled for at least
stall thrs we check if there were any Tx packets queued
between last NAPI run and now - stall_thrs/2.

Unlike the classic Tx watchdog this mechanism does not
ignore stalls caused by Tx being disabled, or loss of link.
I don't think the check is worth the complexity, and
stall is a stall, whether due to host overload, flow
control, link down... doesn't matter much to the application.

We have been running this detector in production at Meta
for 2 years, with the threshold of 8ms. It's the lowest
value where false positives become rare. There's still
a constant stream of reported stalls (especially without
the ksoftirqd deferral patches reverted), those who like
their stall metrics to be 0 may prefer higher value.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-08 10:23:26 +00:00
Eric Dumazet
490a79faf9 net: introduce include/net/rps.h
Move RPS related structures and helpers from include/linux/netdevice.h
and include/net/sock.h to a new include file.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306160031.874438-18-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-07 21:12:43 -08:00
Breno Leitao
74293ea1c4 net: sysfs: Do not create sysfs for non BQL device
Creation of sysfs entries is expensive, mainly for workloads that
constantly creates netdev and netns often.

Do not create BQL sysfs entries for devices that don't need,
basically those that do not have a real queue, i.e, devices that has
NETIF_F_LLTX and IFF_NO_QUEUE, such as `lo` interface.

This will remove the /sys/class/net/eth0/queues/tx-X/byte_queue_limits/
directory for these devices.

In the example below, eth0 has the `byte_queue_limits` directory but not
`lo`.

	# ls /sys/class/net/lo/queues/tx-0/
	traffic_class  tx_maxrate  tx_timeout  xps_cpus  xps_rxqs

	# ls /sys/class/net/eth0/queues/tx-0/byte_queue_limits/
	hold_time  inflight  limit  limit_max  limit_min

This also removes the #ifdefs, since we can also use netdev_uses_bql() to
check if the config is enabled. (as suggested by Jakub).

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20240216094154.3263843-1-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-19 12:30:44 -08:00
Eric Dumazet
e154bb7a6e net-sysfs: convert netstat_show() to RCU
dev_get_stats() can be called from RCU, there is no need
to acquire dev_base_lock.

Change dev_isalive() comment to reflect we no longer use
dev_base_lock from net/core/net-sysfs.c

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-14 11:20:13 +00:00
Eric Dumazet
004d138364 net-sysfs: convert dev->operstate reads to lockless ones
operstate_show() can omit dev_base_lock acquisition only
to read dev->operstate.

Annotate accesses to dev->operstate.

Writers still acquire dev_base_lock for mutual exclusion.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-14 11:20:13 +00:00
Eric Dumazet
c7d52737e7 net-sysfs: use dev_addr_sem to remove races in address_show()
Using dev_base_lock is not preventing from reading garbage.

Use dev_addr_sem instead.

v4: place dev_addr_sem extern in net/core/dev.h (Jakub Kicinski)
 Link: https://lore.kernel.org/netdev/20240212175845.10f6680a@kernel.org/

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-14 11:20:13 +00:00
Eric Dumazet
12692e3df2 net-sysfs: convert netdev_show() to RCU
Make clear dev_isalive() can be called with RCU protection.

Then convert netdev_show() to RCU, to remove dev_base_lock
dependency.

Also add RCU to broadcast_show().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-14 11:20:13 +00:00
Eric Dumazet
1c07dbb0cc net: annotate data-races around dev->name_assign_type
name_assign_type_show() runs locklessly, we should annotate
accesses to dev->name_assign_type.

Alternative would be to grab devnet_rename_sem semaphore
from name_assign_type_show(), but this would not bring
more accuracy.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-14 11:20:13 +00:00
Johannes Berg
bf17b36ccd net: sysfs: fix locking in carrier read
My previous patch added a call to linkwatch_sync_dev(),
but that of course needs to be called under RTNL, which
I missed earlier, but now saw RCU warnings from.

Fix that by acquiring the RTNL in a similar fashion to
how other files do it here.

Fixes: facd15dfd6 ("net: core: synchronize link-watch when carrier is queried")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20231206172122.859df6ba937f.I9c80608bcfbab171943ff4942b52dbd5e97fe06e@changeid
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-08 16:10:17 -08:00