Commit Graph

19301 Commits

Author SHA1 Message Date
Luiz Augusto von Dentz
c28d2bff70 Bluetooth: L2CAP: Fix result of L2CAP_ECRED_CONN_RSP when MTU is too short
Test L2CAP/ECFC/BV-26-C expect the response to L2CAP_ECRED_CONN_REQ with
and MTU value < L2CAP_ECRED_MIN_MTU (64) to be L2CAP_CR_LE_INVALID_PARAMS
rather than L2CAP_CR_LE_UNACCEPT_PARAMS.

Also fix not including the correct number of CIDs in the response since
the spec requires all CIDs being rejected to be included in the
response.

Link: https://github.com/bluez/bluez/issues/1868
Fixes: 15f02b9105 ("Bluetooth: L2CAP: Add initial code for Enhanced Credit Based Mode")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2026-02-23 15:28:56 -05:00
Luiz Augusto von Dentz
7accb1c432 Bluetooth: L2CAP: Fix invalid response to L2CAP_ECRED_RECONF_REQ
This fixes responding with an invalid result caused by checking the
wrong size of CID which should have been (cmd_len - sizeof(*req)) and
on top of it the wrong result was use L2CAP_CR_LE_INVALID_PARAMS which
is invalid/reserved for reconf when running test like L2CAP/ECFC/BI-03-C:

> ACL Data RX: Handle 64 flags 0x02 dlen 14
      LE L2CAP: Enhanced Credit Reconfigure Request (0x19) ident 2 len 6
        MTU: 64
        MPS: 64
        Source CID: 64
< ACL Data TX: Handle 64 flags 0x00 dlen 10
      LE L2CAP: Enhanced Credit Reconfigure Respond (0x1a) ident 2 len 2
!        Result: Reserved (0x000c)
         Result: Reconfiguration failed - one or more Destination CIDs invalid (0x0003)

Fiix L2CAP/ECFC/BI-04-C which expects L2CAP_RECONF_INVALID_MPS (0x0002)
when more than one channel gets its MPS reduced:

> ACL Data RX: Handle 64 flags 0x02 dlen 16
      LE L2CAP: Enhanced Credit Reconfigure Request (0x19) ident 2 len 8
        MTU: 264
        MPS: 99
        Source CID: 64
!       Source CID: 65
< ACL Data TX: Handle 64 flags 0x00 dlen 10
      LE L2CAP: Enhanced Credit Reconfigure Respond (0x1a) ident 2 len 2
!        Result: Reconfiguration successful (0x0000)
         Result: Reconfiguration failed - reduction in size of MPS not allowed for more than one channel at a time (0x0002)

Fix L2CAP/ECFC/BI-05-C when SCID is invalid (85 unconnected):

> ACL Data RX: Handle 64 flags 0x02 dlen 14
      LE L2CAP: Enhanced Credit Reconfigure Request (0x19) ident 2 len 6
        MTU: 65
        MPS: 64
!        Source CID: 85
< ACL Data TX: Handle 64 flags 0x00 dlen 10
      LE L2CAP: Enhanced Credit Reconfigure Respond (0x1a) ident 2 len 2
!        Result: Reconfiguration successful (0x0000)
         Result: Reconfiguration failed - one or more Destination CIDs invalid (0x0003)

Fix L2CAP/ECFC/BI-06-C when MPS < L2CAP_ECRED_MIN_MPS (64):

> ACL Data RX: Handle 64 flags 0x02 dlen 14
      LE L2CAP: Enhanced Credit Reconfigure Request (0x19) ident 2 len 6
        MTU: 672
!       MPS: 63
        Source CID: 64
< ACL Data TX: Handle 64 flags 0x00 dlen 10
      LE L2CAP: Enhanced Credit Reconfigure Respond (0x1a) ident 2 len 2
!       Result: Reconfiguration failed - reduction in size of MPS not allowed for more than one channel at a time (0x0002)
        Result: Reconfiguration failed - other unacceptable parameters (0x0004)

Fix L2CAP/ECFC/BI-07-C when MPS reduced for more than one channel:

> ACL Data RX: Handle 64 flags 0x02 dlen 16
      LE L2CAP: Enhanced Credit Reconfigure Request (0x19) ident 3 len 8
        MTU: 84
!       MPS: 71
        Source CID: 64
!        Source CID: 65
< ACL Data TX: Handle 64 flags 0x00 dlen 10
      LE L2CAP: Enhanced Credit Reconfigure Respond (0x1a) ident 2 len 2
!       Result: Reconfiguration successful (0x0000)
        Result: Reconfiguration failed - reduction in size of MPS not allowed for more than one channel at a time (0x0002)

Link: https://github.com/bluez/bluez/issues/1865
Fixes: 15f02b9105 ("Bluetooth: L2CAP: Add initial code for Enhanced Credit Based Mode")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2026-02-23 15:23:37 -05:00
Linus Torvalds
32a92f8c89 Convert more 'alloc_obj' cases to default GFP_KERNEL arguments
This converts some of the visually simpler cases that have been split
over multiple lines.  I only did the ones that are easy to verify the
resulting diff by having just that final GFP_KERNEL argument on the next
line.

Somebody should probably do a proper coccinelle script for this, but for
me the trivial script actually resulted in an assertion failure in the
middle of the script.  I probably had made it a bit _too_ trivial.

So after fighting that far a while I decided to just do some of the
syntactically simpler cases with variations of the previous 'sed'
scripts.

The more syntactically complex multi-line cases would mostly really want
whitespace cleanup anyway.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 20:03:00 -08:00
Linus Torvalds
bf4afc53b7 Convert 'alloc_obj' family to use the new default GFP_KERNEL argument
This was done entirely with mindless brute force, using

    git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
        xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'

to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.

Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.

For the same reason the 'flex' versions will be done as a separate
conversion.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 17:09:51 -08:00
Kees Cook
69050f8d6d treewide: Replace kmalloc with kmalloc_obj for non-scalar types
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:

Single allocations:	kmalloc(sizeof(TYPE), ...)
are replaced with:	kmalloc_obj(TYPE, ...)

Array allocations:	kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with:	kmalloc_objs(TYPE, COUNT, ...)

Flex array allocations:	kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with:	kmalloc_flex(*PTR, FAM, COUNT, ...)

(where TYPE may also be *VAR)

The resulting allocations no longer return "void *", instead returning
"TYPE *".

Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-21 01:02:28 -08:00
Eric Dumazet
858d2a4f67 tcp: fix potential race in tcp_v6_syn_recv_sock()
Code in tcp_v6_syn_recv_sock() after the call to tcp_v4_syn_recv_sock()
is done too late.

After tcp_v4_syn_recv_sock(), the child socket is already visible
from TCP ehash table and other cpus might use it.

Since newinet->pinet6 is still pointing to the listener ipv6_pinfo
bad things can happen as syzbot found.

Move the problematic code in tcp_v6_mapped_child_init()
and call this new helper from tcp_v4_syn_recv_sock() before
the ehash insertion.

This allows the removal of one tcp_sync_mss(), since
tcp_v4_syn_recv_sock() will call it with the correct
context.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: syzbot+937b5bbb6a815b3e5d0b@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/69949275.050a0220.2eeac1.0145.GAE@google.com/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260217161205.2079883-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-19 14:02:19 -08:00
Linus Torvalds
8bf22c33e7 Including fixes from Netfilter.
Current release - new code bugs:
 
  - net: fix backlog_unlock_irq_restore() vs CONFIG_PREEMPT_RT
 
  - eth: mlx5e: XSK, Fix unintended ICOSQ change
 
  - phy_port: correctly recompute the port's linkmodes
 
  - vsock: prevent child netns mode switch from local to global
 
  - couple of kconfig fixes for new symbols
 
 Previous releases - regressions:
 
  - nfc: nci: fix false-positive parameter validation for packet data
 
  - net: do not delay zero-copy skbs in skb_attempt_defer_free()
 
 Previous releases - always broken:
 
  - mctp: ensure our nlmsg responses to user space are zero-initialised
 
  - ipv6: ioam: fix heap buffer overflow in __ioam6_fill_trace_data()
 
  - fixes for ICMP rate limiting
 
 Misc:
 
  - intel: fix PCI device ID conflict between i40e and ipw2200
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmmXUh8ACgkQMUZtbf5S
 IrufYA//ZVj+4gvegqKwKZYXNBndVW00GGTYqaILbaenK1olUVUelVB91eV2Klc/
 dXCeKG/MgEPuT89IjkPzVr2Yg4x6uhjcQL1rsahORn+GuQfSI/P8y7ysDOPnHVeM
 Rtsg1m8z3EizJcHPeAJe7nEqFzfvZ2m+FCEGe++z8BYaUZUVApytgpIWOHO/aB+p
 t13bCNzd05XxPphMl610T00Fncj2jCVDHILMgTB5rmFmkeJuQwNrRGXQSoQame46
 +g+yCZjT0eVTrBaH1EUssWfrOT3VJj3BEee6gSp7k9mxMkbW18i8shBgmxS+EHjk
 u19wwBzSrHK+JY1UExim+1E/rZisQVmEE1Gs0ALedxAu9zC/Julzfa2/+BFsc0j7
 QTXd4jukG3aTPIX8v3TV2Igu0j+bAT4WdpzvnsXXBMVKy7wFYMd1+aSOLyFH2W9L
 qRbg50oUATcsz77bZt6YUTJEgua4HXNYGtn15FMZOR7HJVR2L44Q5TK5mQxGp5iM
 GabeKMzg6bsjE98STM3nbWks3pIb9ptIk++i0913eSqKgn84bDPtp3Gabfgle2SJ
 8gjKS61K8rDt5x8StXVod7oGQ4asL8RJyOtE/avgbWUu9BNH8/oKqsE6TQrpXauv
 1ndiyim/mPe4fBCxkVAi2+uq5/ph9z8XyleESz9VYwyL3Rl4nsg=
 =qSCj
 -----END PGP SIGNATURE-----

Merge tag 'net-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from Netfilter.

  Current release - new code bugs:

   - net: fix backlog_unlock_irq_restore() vs CONFIG_PREEMPT_RT

   - eth: mlx5e: XSK, Fix unintended ICOSQ change

   - phy_port: correctly recompute the port's linkmodes

   - vsock: prevent child netns mode switch from local to global

   - couple of kconfig fixes for new symbols

  Previous releases - regressions:

   - nfc: nci: fix false-positive parameter validation for packet data

   - net: do not delay zero-copy skbs in skb_attempt_defer_free()

  Previous releases - always broken:

   - mctp: ensure our nlmsg responses to user space are zero-initialised

   - ipv6: ioam: fix heap buffer overflow in __ioam6_fill_trace_data()

   - fixes for ICMP rate limiting

  Misc:

   - intel: fix PCI device ID conflict between i40e and ipw2200"

* tag 'net-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (85 commits)
  net: nfc: nci: Fix parameter validation for packet data
  net/mlx5e: Use unsigned for mlx5e_get_max_num_channels
  net/mlx5e: Fix deadlocks between devlink and netdev instance locks
  net/mlx5e: MACsec, add ASO poll loop in macsec_aso_set_arm_event
  net/mlx5: Fix misidentification of write combining CQE during poll loop
  net/mlx5e: Fix misidentification of ASO CQE during poll loop
  net/mlx5: Fix multiport device check over light SFs
  bonding: alb: fix UAF in rlb_arp_recv during bond up/down
  bnge: fix reserving resources from FW
  eth: fbnic: Advertise supported XDP features.
  rds: tcp: fix uninit-value in __inet_bind
  net/rds: Fix NULL pointer dereference in rds_tcp_accept_one
  octeontx2-af: Fix default entries mcam entry action
  net/mlx5e: XSK, Fix unintended ICOSQ change
  ipv6: icmp: icmpv6_xrlim_allow() optimization if net.ipv6.icmp.ratelimit is zero
  ipv4: icmp: icmpv4_xrlim_allow() optimization if net.ipv4.icmp_ratelimit is zero
  ipv6: icmp: remove obsolete code in icmpv6_xrlim_allow()
  inet: move icmp_global_{credit,stamp} to a separate cache line
  icmp: prevent possible overflow in icmp_global_allow()
  selftests/net: packetdrill: add ipv4-mapped-ipv6 tests
  ...
2026-02-19 10:39:08 -08:00
Eric Dumazet
87b08913a9 inet: move icmp_global_{credit,stamp} to a separate cache line
icmp_global_credit was meant to be changed ~1000 times per second,
but if an admin sets net.ipv4.icmp_msgs_per_sec to a very high value,
icmp_global_credit changes can inflict false sharing to surrounding
fields that are read mostly.

Move icmp_global_credit and icmp_global_stamp to a separate
cacheline aligned group.

Fixes: b056b4cd91 ("icmp: move icmp_global.credit and icmp_global.stamp to per netns storage")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260216142832.3834174-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-18 16:46:36 -08:00
Linus Torvalds
23b0f90ba8 Summary
* Removed macros from proc handler converters
 
   Replace the proc converter macros with "regular" functions. Though it is more
   verbose than the macro version, it helps when debugging and better aligns with
   coding-style.rst.
 
 * General cleanup
 
   Remove superfluous ctl_table forward declarations. Const qualify the
   memory_allocation_profiling_sysctl and loadpin_sysctl_table arrays. Add
   missing kernel doc to proc_dointvec_conv.
 
 * Testing
 
   This series was run through sysctl selftests/kunit test suite in
   x86_64. And went into linux-next after rc4, giving it a good 3 weeks of
   testing
 -----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEErkcJVyXmMSXOyyeQupfNUreWQU8FAmmUabYACgkQupfNUreW
 QU8y2Qv/d2y35uQPRDh0HKWKWXJy41C2RJzd/rFCWJPCwo150whTSHIHkWYnu76g
 10QblBXQmXi9TVqFnJ7Il7PWgqkMPjzA13tfT9eXNWU8j2OB/mcVKNl9X4wm/jWi
 QxtGmBsIQ/nxb2pUzMCykzgfc5mLi2NQ8qhZ5bOnq7UW3zdYmzEqx+tRdvIacyIk
 adComi5v8xUDqyEbVFaBovuX2WHQkPyBMnD64nwWG93JpNG/+9PxGzv/DNUXY11Y
 epVOfSoKdJbSLjYoHEPEhT0aHjSydq3QHru7uF6wzKOFTfHej/XkXXbUnFXPO2Pn
 c5J0u/HziYG5eN2QTqGfrhECZYuCFPemtUozltbcgGebkl1wKH+k9K5vsCaz/mhk
 ihUC3mui++W/n9B9HJRYh1XeEpk6C1pWERCOx27XFZ25fSek2YO6ZWkT0q+gceC0
 t4+eIFSGJ3OzheJgHNK9XhTMWiQPmHyA6brXYGx4WeRvJFLpVddPF7k3Z89zIAu/
 Fut7FGTH
 =0Z+I
 -----END PGP SIGNATURE-----

Merge tag 'sysctl-7.00-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl

Pull sysctl updates from Joel Granados:

 - Remove macros from proc handler converters

   Replace the proc converter macros with "regular" functions. Though it
   is more verbose than the macro version, it helps when debugging and
   better aligns with coding-style.rst.

 - General cleanup

   Remove superfluous ctl_table forward declarations. Const qualify the
   memory_allocation_profiling_sysctl and loadpin_sysctl_table arrays.
   Add missing kernel doc to proc_dointvec_conv.

 - Testing

   This series was run through sysctl selftests/kunit test suite in
   x86_64. And went into linux-next after rc4, giving it a good 3 weeks
   of testing

* tag 'sysctl-7.00-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl:
  sysctl: replace SYSCTL_INT_CONV_CUSTOM macro with functions
  sysctl: Replace unidirectional INT converter macros with functions
  sysctl: Add kernel doc to proc_douintvec_conv
  sysctl: Replace UINT converter macros with functions
  sysctl: Add CONFIG_PROC_SYSCTL guards for converter macros
  sysctl: clarify proc_douintvec_minmax doc
  sysctl: Return -ENOSYS from proc_douintvec_conv when CONFIG_PROC_SYSCTL=n
  sysctl: Remove unused ctl_table forward declarations
  loadpin: Implement custom proc_handler for enforce
  alloc_tag: move memory_allocation_profiling_sysctls into .rodata
  sysctl: Add missing kernel-doc for proc_dointvec_conv
2026-02-18 10:45:36 -08:00
Fernando Fernandez Mancera
9e371b0ba7 ipv6: addrconf: reduce default temp_valid_lft to 2 days
This is a recommendation from RFC 8981 and it was intended to be changed
by commit 969c54646a ("ipv6: Implement draft-ietf-6man-rfc4941bis")
but it only changed the sysctl documentation.

Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260214172543.5783-1-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-17 17:12:06 -08:00
Eric Dumazet
452a3eee22 ipv6: fix a race in ip6_sock_set_v6only()
It is unlikely that this function will be ever called
with isk->inet_num being not zero.

Perform the check on isk->inet_num inside the locked section
for complete safety.

Fixes: 9b115749ac ("ipv6: add ip6_sock_set_v6only")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de>
Link: https://patch.msgid.link/20260216102202.3343588-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-17 16:45:29 -08:00
Qanux
6db8b56eed ipv6: ioam: fix heap buffer overflow in __ioam6_fill_trace_data()
On the receive path, __ioam6_fill_trace_data() uses trace->nodelen
to decide how much data to write for each node. It trusts this field
as-is from the incoming packet, with no consistency check against
trace->type (the 24-bit field that tells which data items are
present). A crafted packet can set nodelen=0 while setting type bits
0-21, causing the function to write ~100 bytes past the allocated
region (into skb_shared_info), which corrupts adjacent heap memory
and leads to a kernel panic.

Add a shared helper ioam6_trace_compute_nodelen() in ioam6.c to
derive the expected nodelen from the type field, and use it:

  - in ioam6_iptunnel.c (send path, existing validation) to replace
    the open-coded computation;
  - in exthdrs.c (receive path, ipv6_hop_ioam) to drop packets whose
    nodelen is inconsistent with the type field, before any data is
    written.

Per RFC 9197, bits 12-21 are each short (4-octet) fields, so they
are included in IOAM6_MASK_SHORT_FIELDS (changed from 0xff100000 to
0xff1ffc00).

Fixes: 9ee11f0fff ("ipv6: ioam: Data plane support for Pre-allocated Trace")
Cc: stable@vger.kernel.org
Signed-off-by: Junxi Qian <qjx1298677004@gmail.com>
Reviewed-by: Justin Iurman <justin.iurman@gmail.com>
Link: https://patch.msgid.link/20260211040412.86195-1-qjx1298677004@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-13 12:24:05 -08:00
Linus Torvalds
311aa68319 RDMA v7.0 merge window
Usual smallish cycle:
 
 - Various code improvements in irdma, rtrs, qedr, ocrdma, irdma, rxe
 
 - Small driver improvements and minor bug fixes to hns, mlx5, rxe, mana,
   mlx5, irdma
 
 - Robusness improvements in completion processing for EFA
 
 - New query_port_speed() verb to move past limited IBA defined speed steps
 
 - Support for SG_GAPS in rts and many other small improvements
 
 - Rare list corruption fix in iwcm
 
 - Better support different page sizes in rxe
 
 - Device memory support for mana
 
 - Direct bio vec to kernel MR for use by NFS-RDMA
 
 - QP rate limiting for bnxt_re
 
 - Remote triggerable NULL pointer crash in siw
 
 - DMA-buf exporter support for RDMA mmaps like doorbells
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCaY44vgAKCRCFwuHvBreF
 YfiZAP91cMZfogN7r1FMD75xDZu55dI3Jvy8OaixyRxlWLGPcQEAjritdL0o7fZp
 YrD1OXNS/1XG//rPBVw7xj+54Aa8hAU=
 =AVcu
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "Usual smallish cycle. The NFS biovec work to push it down into RDMA
  instead of indirecting through a scatterlist is pretty nice to see,
  been talked about for a long time now.

   - Various code improvements in irdma, rtrs, qedr, ocrdma, irdma, rxe

   - Small driver improvements and minor bug fixes to hns, mlx5, rxe,
     mana, mlx5, irdma

   - Robusness improvements in completion processing for EFA

   - New query_port_speed() verb to move past limited IBA defined speed
     steps

   - Support for SG_GAPS in rts and many other small improvements

   - Rare list corruption fix in iwcm

   - Better support different page sizes in rxe

   - Device memory support for mana

   - Direct bio vec to kernel MR for use by NFS-RDMA

   - QP rate limiting for bnxt_re

   - Remote triggerable NULL pointer crash in siw

   - DMA-buf exporter support for RDMA mmaps like doorbells"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (66 commits)
  RDMA/mlx5: Implement DMABUF export ops
  RDMA/uverbs: Add DMABUF object type and operations
  RDMA/uverbs: Support external FD uobjects
  RDMA/siw: Fix potential NULL pointer dereference in header processing
  RDMA/umad: Reject negative data_len in ib_umad_write
  IB/core: Extend rate limit support for RC QPs
  RDMA/mlx5: Support rate limit only for Raw Packet QP
  RDMA/bnxt_re: Report QP rate limit in debugfs
  RDMA/bnxt_re: Report packet pacing capabilities when querying device
  RDMA/bnxt_re: Add support for QP rate limiting
  MAINTAINERS: Drop RDMA files from Hyper-V section
  RDMA/uverbs: Add __GFP_NOWARN to ib_uverbs_unmarshall_recv() kmalloc
  svcrdma: use bvec-based RDMA read/write API
  RDMA/core: add rdma_rw_max_sge() helper for SQ sizing
  RDMA/core: add MR support for bvec-based RDMA operations
  RDMA/core: use IOVA-based DMA mapping for bvec RDMA operations
  RDMA/core: add bio_vec based RDMA read/write API
  RDMA/irdma: Use kvzalloc for paged memory DMA address array
  RDMA/rxe: Fix race condition in QP timer handlers
  RDMA/mana_ib: Add device‑memory support
  ...
2026-02-12 17:05:20 -08:00
Daniel Golle
85ee987429 net: dsa: add tag format for MxL862xx switches
Add proprietary special tag format for the MaxLinear MXL862xx family of
switches. While using the same Ethertype as MaxLinear's GSW1xx switches,
the actual tag format differs significantly, hence we need a dedicated
tag driver for that.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/c64e6ddb6c93a4fac39f9ab9b2d8bf551a2b118d.1770433307.git.daniel@makrotopia.org
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-11 11:27:57 +01:00
Eric Dumazet
a6eee39cc2 tcp: populate inet->cork.fl.u.ip6 in tcp_v6_syn_recv_sock()
As explained in commit 85d05e2817 ("ipv6: change inet6_sk_rebuild_header()
to use inet->cork.fl.u.ip6"):

TCP v6 spends a good amount of time rebuilding a fresh fl6 at each
transmit in inet6_csk_xmit()/inet6_csk_route_socket().

TCP v4 caches the information in inet->cork.fl.u.ip4 instead.

After this patch, passive TCP ipv6 flows have correctly initialized
inet->cork.fl.u.ip6 structure.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260206173426.1638518-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-10 20:57:50 -08:00
Jakub Kicinski
792aaea994 netfilter pull request nf-next-26-02-06
-----BEGIN PGP SIGNATURE-----
 
 iQJdBAABCABHFiEEgKkgxbID4Gn1hq6fcJGo2a1f9gAFAmmGB20bFIAAAAAABAAO
 bWFudTIsMi41KzEuMTEsMiwyDRxmd0BzdHJsZW4uZGUACgkQcJGo2a1f9gC/tQ/7
 B7/akiCP/QeGF7go78PZQlpIGmjtoCOcQ9uxymlmpLkArepcIEkgZ04tFH0FClY6
 d3QPfT9iNap222aCQxZwCiaWrXqUNynW7RwH72SkqGmO8JTLKlzW8CQC+yGkyznj
 FxwRKzB8XO5Ohtw0wED3mzcf9DelsvJpX5rCU5gEjsHZjKA/rEwYgovyM+es+xSx
 JbHHc2tzLQuDZ1BL7rEW8TJDxmJ2bCsFJHKeIvykk3D2nVg01P0AwhUeIy+7ObV7
 bQh7B8DhYwKNLtgZvDi8D6o4nWQvkjfF5BadrWusumDCtIupcwbelpcUeCsUWBqC
 oCjLMcH7TwmT513RXWMId50z93FWciduCHUGrQt4BxLBZmkQ9kE0iamZVIAAzLl8
 VYIM9qb+nUk58jnLFl3xTqW2GetSj/p31bp6e78+SQFvqjie2z9/I+nGBr7A8aAB
 bNd5vpvHSEg5OP7oKk+Dhr26MiCDowtuzvdC4lYR+loFYoI+a1FS6a1w/kcw9/VA
 XmR6Y8is+CTy4XYTQZ4klYTVpoTkWa/D/t1CTC4IlELzYS49L6qSyef6m91IWeQ6
 Way5+3ZON7sA6SM1PZ/zjsKDxYLo/hQz2+dw6YLVflfY62khvuK2Yc56MQcZEjsH
 7x0b3MaKvNn9yqKC+Mk7QZ55nCjV3wyGp3GQ+ClAqZ4=
 =wU6p
 -----END PGP SIGNATURE-----

Merge tag 'nf-next-26-02-06' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next

Florian Westphal says:

====================
netfilter: updates for net-next

The following patchset contains Netfilter updates for *net-next*:

1) Fix net-next-only use-after-free bug in nf_tables rbtree set:
   Expired elements cannot be released right away after unlink anymore
   because there is no guarantee that the binary-search blob is going to
   be updated.  Spotted by syzkaller.

2) Fix esoteric bug in nf_queue with udp fraglist gro, broken since
   6.11. Patch 3 adds extends the nfqueue selftest for this.

4) Use dedicated slab for flowtable entries, currently the -512 cache
   is used, which is wasteful.  From Qingfang Deng.

5) Recent net-next update extended existing test for ip6ip6 tunnels, add
   the required /config entry.  Test still passed by accident because the
   previous tests network setup gets re-used, so also update the test so
   it will fail in case the ip6ip6 tunnel interface cannot be added.

6) Fix 'nft get element mytable myset { 1.2.3.4 }' on big endian
   platforms, this was broken since code was added in v5.1.

7) Fix nf_tables counter reset support on 32bit platforms, where counter
   reset may cause huge values to appear due to wraparound.
   Broken since reset feature was added in v6.11.  From Anders Grahn.

8-11) update nf_tables rbtree set type to detect partial
   operlaps.  This will eventually speed up nftables userspace: at this
   time userspace does a netlink dump of the set content which slows down
   incremental updates on interval sets.  From Pablo Neira Ayuso.

* tag 'nf-next-26-02-06' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: nft_set_rbtree: validate open interval overlap
  netfilter: nft_set_rbtree: validate element belonging to interval
  netfilter: nft_set_rbtree: check for partial overlaps in anonymous sets
  netfilter: nft_set_rbtree: fix bogus EEXIST with NLM_F_CREATE with null interval
  netfilter: nft_counter: fix reset of counters on 32bit archs
  netfilter: nft_set_hash: fix get operation on big endian
  selftests: netfilter: add IPV6_TUNNEL to config
  netfilter: flowtable: dedicated slab for flow entry
  selftests: netfilter: nft_queue.sh: add udp fraglist gro test case
  netfilter: nfnetlink_queue: do shared-unconfirmed check before segmentation
  netfilter: nft_set_rbtree: don't gc elements on insert
====================

Link: https://patch.msgid.link/20260206153048.17570-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-10 20:25:38 -08:00
Paolo Abeni
dc010e1b4b xfrm: reduce struct sec_path size
The mentioned struct has an hole and uses unnecessary wide type to
store MAC length and indexes of very small arrays.

It's also embedded into the skb_extensions, and the latter, due
to recent CAN changes, may exceeds the 192 bytes mark (3 cachelines
on x86_64 arch) on some reasonable configurations.

Reordering and the sec_path fields, shrinking xfrm_offload.orig_mac_len
to 16 bits and xfrm_offload.{len,olen,verified_cnt} to u8, we can save
16 bytes and keep skb_extensions size under control.

Before:

struct sec_path {
	int                        len;
	int                        olen;
	int                        verified_cnt;

	/* XXX 4 bytes hole, try to pack */$
	struct xfrm_state *        xvec[6];
	struct xfrm_offload ovec[1];

	/* size: 88, cachelines: 2, members: 5 */
	/* sum members: 84, holes: 1, sum holes: 4 */
	/* last cacheline: 24 bytes */
};

After:

struct sec_path {
	struct xfrm_state *        xvec[6];
	struct xfrm_offload        ovec[1];
	/* typedef u8 -> __u8 */ unsigned char              len;
	/* typedef u8 -> __u8 */ unsigned char              olen;
	/* typedef u8 -> __u8 */ unsigned char              verified_cnt;

	/* size: 72, cachelines: 2, members: 5 */
	/* padding: 1 */
	/* last cacheline: 8 bytes */
};

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Steffen Klassert <steffen.klassert@secunet.com>
Link: https://patch.msgid.link/83846bd2e3fa08899bd0162e41bfadfec95e82ef.1770398071.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-10 20:21:48 -08:00
Vladimir Oltean
c22ba07c82 net: dsa: eliminate local type for tc policers
David Yang is saying that struct flow_action_entry in
include/net/flow_offload.h has gained new fields and DSA's struct
dsa_mall_policer_tc_entry, derived from that, isn't keeping up.
This structure is passed to drivers and they are completely oblivious to
the values of fields they don't see.

This has happened before, and almost always the solution was to make the
DSA layer thinner and use the upstream data structures. Here, the reason
why we didn't do that is because struct flow_action_entry :: police is
an anonymous structure.

That is easily enough fixable, just name those fields "struct
flow_action_police" and reference them from DSA.

Make the according transformations to the two users (sja1105 and felix):
"rate_bytes_per_sec" -> "rate_bytes_ps".

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Co-developed-by: David Yang <mmyangfl@gmail.com>
Signed-off-by: David Yang <mmyangfl@gmail.com>
Link: https://patch.msgid.link/20260206075427.44733-1-mmyangfl@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-10 15:30:11 +01:00
Alice Mikityanska
35f66ce900 net/ipv6: Remove HBH helpers
Now that the HBH jumbo helpers are not used by any driver or GSO, remove
them altogether.

Signed-off-by: Alice Mikityanska <alice@isovalent.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260205133925.526371-13-alice.kernel@fastmail.im
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-06 20:50:13 -08:00
Alice Mikityanska
b2936b4fd5 net/ipv6: Introduce payload_len helpers
The next commits will transition away from using the hop-by-hop
extension header to encode packet length for BIG TCP. Add wrappers
around ip6->payload_len that return the actual value if it's non-zero,
and calculate it from skb->len if payload_len is set to zero (and a
symmetrical setter).

The new helpers are used wherever the surrounding code supports the
hop-by-hop jumbo header for BIG TCP IPv6, or the corresponding IPv4 code
uses skb_ip_totlen (e.g., in include/net/netfilter/nf_tables_ipv6.h).

No behavioral change in this commit.

Signed-off-by: Alice Mikityanska <alice@isovalent.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260205133925.526371-2-alice.kernel@fastmail.im
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-06 20:50:03 -08:00
Eric Dumazet
a35b6e4863 tcp: inline tcp_filter()
This helper is already (auto)inlined from IPv4 TCP stack.

Make it an inline function to benefit IPv6 as well.

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/2 grow/shrink: 1/0 up/down: 30/-49 (-19)
Function                                     old     new   delta
tcp_v6_rcv                                  3448    3478     +30
__pfx_tcp_filter                              16       -     -16
tcp_filter                                    33       -     -33
Total: Before=24891904, After=24891885, chg -0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260205164329.3401481-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-06 20:12:11 -08:00
Qiliang Yuan
7acee67a6b netns: optimize netns cleaning by batching unhash_nsid calls
Currently, unhash_nsid() scans the entire system for each netns being
killed, leading to O(L_dying_net * M_alive_net * N_id) complexity, as
__peernet2id() also performs a linear search in the IDR.

Optimize this to O(M_alive_net * N_id) by batching unhash operations. Move
unhash_nsid() out of the per-netns loop in cleanup_net() to perform a
single-pass traversal over survivor namespaces.

Identify dying peers by an 'is_dying' flag, which is set under net_rwsem
write lock after the netns is removed from the global list. This batches
the unhashing work and eliminates the O(L_dying_net) multiplier.

To minimize the impact on struct net size, 'is_dying' is placed in an
existing hole after 'hash_mix' in struct net.

Use a restartable idr_get_next() loop for iteration. This avoids the
unsafe modification issue inherent to idr_for_each() callbacks and allows
dropping the nsid_lock to safely call sleepy rtnl_net_notifyid().

Clean up redundant nsid_lock and simplify the destruction loop now that
unhashing is centralized.

Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260204074854.3506916-1-realwujing@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-06 20:01:31 -08:00
Pablo Neira Ayuso
648946966a netfilter: nft_set_rbtree: validate open interval overlap
Open intervals do not have an end element, in particular an open
interval at the end of the set is hard to validate because of it is
lacking the end element, and interval validation relies on such end
element to perform the checks.

This patch adds a new flag field to struct nft_set_elem, this is not an
issue because this is a temporary object that is allocated in the stack
from the insert/deactivate path. This flag field is used to specify that
this is the last element in this add/delete command.

The last flag is used, in combination with the start element cookie, to
check if there is a partial overlap, eg.

   Already exists:   255.255.255.0-255.255.255.254
   Add interval:     255.255.255.0-255.255.255.255
                     ~~~~~~~~~~~~~
             start element overlap

Basically, the idea is to check for an existing end element in the set
if there is an overlap with an existing start element.

However, the last open interval can come in any position in the add
command, the corner case can get a bit more complicated:

   Already exists:   255.255.255.0-255.255.255.254
   Add intervals:    255.255.255.0-255.255.255.255,255.255.255.0-255.255.255.254
                     ~~~~~~~~~~~~~
             start element overlap

To catch this overlap, annotate that the new start element is a possible
overlap, then report the overlap if the next element is another start
element that confirms that previous element in an open interval at the
end of the set.

For deletions, do not update the start cookie when deleting an open
interval, otherwise this can trigger spurious EEXIST when adding new
elements.

Unfortunately, there is no NFT_SET_ELEM_INTERVAL_OPEN flag which would
make easier to detect open interval overlaps.

Fixes: 7c84d41416 ("netfilter: nft_set_rbtree: Detect partial overlaps on insertion")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
2026-02-06 13:36:07 +01:00
Florian Westphal
207b3ebacb netfilter: nfnetlink_queue: do shared-unconfirmed check before segmentation
Ulrich reports a regression with nfqueue:

If an application did not set the 'F_GSO' capability flag and a gso
packet with an unconfirmed nf_conn entry is received all packets are
now dropped instead of queued, because the check happens after
skb_gso_segment().  In that case, we did have exclusive ownership
of the skb and its associated conntrack entry.  The elevated use
count is due to skb_clone happening via skb_gso_segment().

Move the check so that its peformed vs. the aggregated packet.

Then, annotate the individual segments except the first one so we
can do a 2nd check at reinject time.

For the normal case, where userspace does in-order reinjects, this avoids
packet drops: first reinjected segment continues traversal and confirms
entry, remaining segments observe the confirmed entry.

While at it, simplify nf_ct_drop_unconfirmed(): We only care about
unconfirmed entries with a refcnt > 1, there is no need to special-case
dying entries.

This only happens with UDP.  With TCP, the only unconfirmed packet will
be the TCP SYN, those aren't aggregated by GRO.

Next patch adds a udpgro test case to cover this scenario.

Reported-by: Ulrich Weber <ulrich.weber@gmail.com>
Fixes: 7d8dc1c7be ("netfilter: nf_queue: drop packets with cloned unconfirmed conntracks")
Signed-off-by: Florian Westphal <fw@strlen.de>
2026-02-06 13:34:55 +01:00
Davide Caratti
a90f6dcefc net/sched: don't use dynamic lockdep keys with clsact/ingress/noqueue
Currently we are registering one dynamic lockdep key for each allocated
qdisc, to avoid false deadlock reports when mirred (or TC eBPF) redirects
packets to another device while the root lock is acquired [1].
Since dynamic keys are a limited resource, we can save them at least for
qdiscs that are not meant to acquire the root lock in the traffic path,
or to carry traffic at all, like:

 - clsact
 - ingress
 - noqueue

Don't register dynamic keys for the above schedulers, so that we hit
MAX_LOCKDEP_KEYS later in our tests.

[1] https://github.com/multipath-tcp/mptcp_net-next/issues/451

Changes in v2:
 - change ordering of spin_lock_init() vs. lockdep_register_key()
   (Jakub Kicinski)

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Link: https://patch.msgid.link/94448f7fa7c4f52d2ce416a4895ec87d456d7417.1770220576.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-05 09:32:45 -08:00
Eric Dumazet
22c1264415 tcp: move __reqsk_free() out of line
Inlining __reqsk_free() is overkill, let's reclaim 2 Kbytes of text.

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 2/4 grow/shrink: 2/14 up/down: 225/-2338 (-2113)
Function                                     old     new   delta
__reqsk_free                                   -     114    +114
sock_edemux                                   18      82     +64
inet_csk_listen_start                        233     264     +31
__pfx___reqsk_free                             -      16     +16
__pfx_reqsk_queue_alloc                       16       -     -16
__pfx_reqsk_free                              16       -     -16
reqsk_queue_alloc                             46       -     -46
tcp_req_err                                  272     177     -95
reqsk_fastopen_remove                        348     253     -95
cookie_bpf_check                             157      62     -95
cookie_tcp_reqsk_alloc                       387     290     -97
cookie_v4_check                             1568    1465    -103
reqsk_free                                   105       -    -105
cookie_v6_check                             1519    1412    -107
sock_gen_put                                 187      78    -109
sock_pfree                                   212      82    -130
tcp_try_fastopen                            1818    1683    -135
tcp_v4_rcv                                  3478    3294    -184
reqsk_put                                    306      90    -216
tcp_get_cookie_sock                          551     318    -233
tcp_v6_rcv                                  3404    3141    -263
tcp_conn_request                            2677    2384    -293
Total: Before=24887415, After=24885302, chg -0.01%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260204055147.1682705-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-05 09:23:06 -08:00
Eric Dumazet
d5c5391554 inet: move reqsk_queue_alloc() to net/ipv4/inet_connection_sock.c
Only called once from inet_csk_listen_start(), it can be static.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260204055147.1682705-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-05 09:23:05 -08:00
David Yang
770e112634 flow_offload: add const qualifiers to function arguments
Some functions do not modify the pointed-to data, but lack const
qualifiers. Add const qualifiers to the arguments of
flow_rule_match_has_control_flags() and flow_cls_offload_flow_rule().

Signed-off-by: David Yang <mmyangfl@gmail.com>
Link: https://patch.msgid.link/20260204052839.198602-1-mmyangfl@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-05 16:24:22 +01:00
Oliver Hartkopp
96ea3a1e2d can: add CAN skb extension infrastructure
To remove the private CAN bus skb headroom infrastructure 8 bytes need to
be stored in the skb. The skb extensions are a common pattern and an easy
and efficient way to hold private data travelling along with the skb. We
only need the skb_ext_add() and skb_ext_find() functions to allocate and
access CAN specific content as the skb helpers to copy/clone/free skbs
automatically take care of skb extensions and their final removal.

This patch introduces the complete CAN skb extensions infrastructure:
- add struct can_skb_ext in new file include/net/can.h
- add include/net/can.h in MAINTAINERS
- add SKB_EXT_CAN to skbuff.c and skbuff.h
- select SKB_EXTENSIONS in Kconfig when CONFIG_CAN is enabled
- check for existing CAN skb extensions in can_rcv() in af_can.c
- add CAN skb extensions allocation at every skb_alloc() location
- duplicate the skb extensions if cloning outgoing skbs (framelen/gw_hops)
- introduce can_skb_ext_add() and can_skb_ext_find() helpers

The patch also corrects an indention issue in the original code from 2018:
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202602010426.PnGrYAk3-lkp@intel.com/
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://patch.msgid.link/20260201-can_skb_ext-v8-2-3635d790fe8b@hartkopp.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-05 11:58:39 +01:00
Randy Dunlap
a34b0e4e21 net/iucv: clean up iucv kernel-doc warnings
Fix numerous (many) kernel-doc warnings in iucv.[ch]:

- convert function documentation comments to a common (kernel-doc) look,
  even for static functions (without "/**")
- use matching parameter and parameter description names
- use better wording in function descriptions (Jakub & AI)
- remove duplicate kernel-doc comments from the header file (Jakub)

Examples:

Warning: include/net/iucv/iucv.h:210 missing initial short description
 on line: * iucv_unregister
Warning: include/net/iucv/iucv.h:216 function parameter 'handle' not
 described in 'iucv_unregister'
Warning: include/net/iucv/iucv.h:467 function parameter 'answer' not
 described in 'iucv_message_send2way'
Warning: net/iucv/iucv.c:727 missing initial short description on line:
 * iucv_cleanup_queue

Build-tested with both "make htmldocs" and "make ARCH=s390 defconfig all".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Alexandra Winter <wintera@linux.ibm.com>
Link: https://patch.msgid.link/20260203075248.1177869-1-rdunlap@infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-04 20:39:58 -08:00
Eric Dumazet
309dd99421 tcp: split tcp_check_space() in two parts
tcp_check_space() is fat and not inlined.

Move its slow path in (out of line) __tcp_check_space()
and make tcp_check_space() an inline function for better TCP performance.

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 2/2 grow/shrink: 4/0 up/down: 708/-582 (126)
Function                                     old     new   delta
__tcp_check_space                              -     521    +521
tcp_rcv_established                         1860    1916     +56
tcp_rcv_state_process                       3342    3384     +42
tcp_event_new_data_sent                      248     286     +38
tcp_data_snd_check                            71     106     +35
__pfx___tcp_check_space                        -      16     +16
__pfx_tcp_check_space                         16       -     -16
tcp_check_space                              566       -    -566
Total: Before=24896373, After=24896499, chg +0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260203050932.3522221-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-04 20:37:06 -08:00
Jakub Kicinski
333225e1e9 Some more changes, including pulls from drivers:
- ath drivers: small features/cleanups
  - rtw drivers: mostly refactoring for rtw89 RTL8922DE support
  - mac80211: use hrtimers for CAC to avoid too long delays
  - cfg80211/mac80211: some initial UHR (Wi-Fi 8) support
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmmDNuEACgkQ10qiO8sP
 aADhvA/+J35p2CDkffi1KfZxxx1YdHAAlj1zjhjLzshCMCG3oWzLpOL7se5bgN/C
 axPLPbeCAXtsRXln083lbwtrSRexPHSVhelPDNtybLPEocQYrksV8a6V3eWXCNTR
 ymN4iDaO/K0gLkDRKH5T8lwZvJttA6iHi+Fm4ir+dsr0O5vwwe4CuAEPA1SuZ2rh
 0lQMz6pEzsxq+sZX3p8SoBwXx147l0n6gwMNIgBTKo1tjZha4oaavdvcqq4zaZWV
 WCcg4YVA/dWHL0UuwtIF8uQADM43quegBBUFx63QgzfgcnHAnBk2Ckeein/bfvnv
 XOKlI4UJi1cxTkTJkDOrSn5IwBzVSlBXE3qEUKKnu5G3+ZgfdsnWmSPeTtOndvAE
 rgbwwZb2SKH1kCvL0FDZTwq/iR9KF60ZfhWIq9Sz7m6VZxJoR8QACHglYCysj2JB
 B1+oT53EIqP7Ob4s/GN2Yg9M0l4Lv3E6J9g6h3b8yeq9qEXVF8MaVN683rtNpec9
 mUqLRlcoToB2W/qvEVESKj8jMvajYZ6TDoO7mSP3paTW3HgMC3wlPJlDc4Q/6h7e
 LAKEljXlv6ofNGCcCL37l6KATqSZpIZn+tpSqbELIirWlc/rnTIDU2qZRb7MA1e1
 3lKdrS6pOXGS1GJr7HWuLb4cX1SukyXNeyIcZJlSFoxG4oDPvwI=
 =/NUu
 -----END PGP SIGNATURE-----

Merge tag 'wireless-next-2026-02-04' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next

Johannes Berg says:

====================
Some more changes, including pulls from drivers:
 - ath drivers: small features/cleanups
 - rtw drivers: mostly refactoring for rtw89 RTL8922DE support
 - mac80211: use hrtimers for CAC to avoid too long delays
 - cfg80211/mac80211: some initial UHR (Wi-Fi 8) support

* tag 'wireless-next-2026-02-04' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (59 commits)
  wifi: brcmsmac: phy: Remove unreachable error handling code
  wifi: mac80211: Add eMLSR/eMLMR action frame parsing support
  wifi: mac80211: add initial UHR support
  wifi: cfg80211: add initial UHR support
  wifi: ieee80211: add some initial UHR definitions
  wifi: mac80211: use wiphy_hrtimer_work for CAC timeout
  wifi: mac80211: correct ieee80211-{s1g/eht}.h include guard comments
  wifi: ath12k: clear stale link mapping of ahvif->links_map
  wifi: ath12k: Add support TX hardware queue stats
  wifi: ath12k: Add support RX PDEV stats
  wifi: ath12k: Fix index decrement when array_len is zero
  wifi: ath12k: support OBSS PD configuration for AP mode
  wifi: ath12k: add WMI support for spatial reuse parameter configuration
  dt-bindings: net: wireless: ath11k-pci: deprecate 'firmware-name' property
  wifi: ath11k: add usecase firmware handling based on device compatible
  wifi: ath10k: sdio: add missing lock protection in ath10k_sdio_fw_crashed_dump()
  wifi: ath10k: fix lock protection in ath10k_wmi_event_peer_sta_ps_state_chg()
  wifi: ath10k: snoc: support powering on the device via pwrseq
  wifi: rtw89: pci: warn if SPS OCP happens for RTL8922DE
  wifi: rtw89: pci: restore LDO setting after device resume
  ...
====================

Link: https://patch.msgid.link/20260204121143.181112-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-04 20:31:05 -08:00
Chia-Yu Chang
4fa4ac5e58 tcp: accecn: add tcpi_ecn_mode and tcpi_option2 in tcp_info
Add 2-bit tcpi_ecn_mode feild within tcp_info to indicate which ECN
mode is negotiated: ECN_MODE_DISABLED, ECN_MODE_RFC3168, ECN_MODE_ACCECN,
or ECN_MODE_PENDING. This is done by utilizing available bits from
tcpi_accecn_opt_seen (reduced from 16 bits to 2 bits) and
tcpi_accecn_fail_mode (reduced from 16 bits to 4 bits).

Also, an extra 24-bit tcpi_options2 field is identified to represent
newer options and connection features, as all 8 bits of tcpi_options
field have been used.

Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Co-developed-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260131222515.8485-14-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-03 15:13:25 +01:00
Chia-Yu Chang
1247fb19ca tcp: accecn: detect loss ACK w/ AccECN option and add TCP_ACCECN_OPTION_PERSIST
Detect spurious retransmission of a previously sent ACK carrying the
AccECN option after the second retransmission. Since this might be caused
by the middlebox dropping ACK with options it does not recognize, disable
the sending of the AccECN option in all subsequent ACKs. This patch
follows Section 3.2.3.2.2 of AccECN spec (RFC9768), and a new field
(accecn_opt_sent_w_dsack) is added to indicate that an AccECN option was
sent with duplicate SACK info.

Also, a new AccECN option sending mode is added to tcp_ecn_option sysctl:
(TCP_ECN_OPTION_PERSIST), which ignores the AccECN fallback policy and
persistently sends AccECN option once it fits into TCP option space.

Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260131222515.8485-13-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-03 15:13:25 +01:00
Chia-Yu Chang
2ed661248e tcp: accecn: fallback outgoing half link to non-AccECN
According to Section 3.2.2.1 of AccECN spec (RFC9768), if the Server
is in AccECN mode and in SYN-RCVD state, and if it receives a value of
zero on a pure ACK with SYN=0 and no SACK blocks, for the rest of the
connection the Server MUST NOT set ECT on outgoing packets and MUST
NOT respond to AccECN feedback. Nonetheless, as a Data Receiver it
MUST NOT disable AccECN feedback.

Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260131222515.8485-12-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-03 15:13:25 +01:00
Chia-Yu Chang
f326f1f17f tcp: accecn: retransmit SYN/ACK without AccECN option or non-AccECN SYN/ACK
For Accurate ECN, the first SYN/ACK sent by the TCP server shall set
the ACE flag (Table 1 of RFC9768) and the AccECN option to complete the
capability negotiation. However, if the TCP server needs to retransmit
such a SYN/ACK (for example, because it did not receive an ACK
acknowledging its SYN/ACK, or received a second SYN requesting AccECN
support), the TCP server retransmits the SYN/ACK without the AccECN
option. This is because the SYN/ACK may be lost due to congestion, or a
middlebox may block the AccECN option. Furthermore, if this retransmission
also times out, to expedite connection establishment, the TCP server
should retransmit the SYN/ACK with (AE,CWR,ECE) = (0,0,0) and without the
AccECN option, while maintaining AccECN feedback mode.

This complies with Section 3.2.3.2.2 of the AccECN spec RFC9768.

Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260131222515.8485-10-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-03 15:13:24 +01:00
Chia-Yu Chang
f1eaea5585 tcp: add TCP_SYNACK_RETRANS synack_type
Before this patch, retransmitted SYN/ACK did not have a specific
synack_type; however, the upcoming patch needs to distinguish between
retransmitted and non-retransmitted SYN/ACK for AccECN negotiation to
transmit the fallback SYN/ACK during AccECN negotiation. Therefore, this
patch introduces a new synack_type (TCP_SYNACK_RETRANS).

Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260131222515.8485-9-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-03 15:13:24 +01:00
Chia-Yu Chang
c5ff6b8371 tcp: accecn: handle unexpected AccECN negotiation feedback
According to Sections 3.1.2 and 3.1.3 of AccECN spec (RFC9768).

In Section 3.1.2, it says an AccECN implementation has no need to
recognize or support the Server response labelled 'Nonce' or ECN-nonce
feedback more generally, as RFC 3540 has been reclassified as Historic.
AccECN is compatible with alternative ECN feedback integrity approaches
to the nonce. The SYN/ACK labelled 'Nonce' with (AE,CWR,ECE) = (1,0,1)
is reserved for future use. A TCP Client (A) that receives such a SYN/ACK
follows the procedure for forward compatibility given in Section 3.1.3.

Then in Section 3.1.3, it says if a TCP Client has sent a SYN requesting
AccECN feedback with (AE,CWR,ECE) = (1,1,1) then receives a SYN/ACK with
the currently reserved combination (AE,CWR,ECE) = (1,0,1) but it does not
have logic specific to such a combination, the Client MUST enable AccECN
mode as if the SYN/ACK onfirmed that the Server supported AccECN and as
if it fed back that the IP-ECN field on the SYN had arrived unchanged.

Fixes: 3cae34274c ("tcp: accecn: AccECN negotiation").
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260131222515.8485-7-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-03 15:13:24 +01:00
Chia-Yu Chang
e68c28f22f tcp: disable RFC3168 fallback identifier for CC modules
When AccECN is not successfully negociated for a TCP flow, it defaults
fallback to classic ECN (RFC3168). However, L4S service will fallback
to non-ECN.

This patch enables congestion control module to control whether it
should not fallback to classic ECN after unsuccessful AccECN negotiation.
A new CA module flag (TCP_CONG_NO_FALLBACK_RFC3168) identifies this
behavior expected by the CA.

Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260131222515.8485-6-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-03 15:13:24 +01:00
Chia-Yu Chang
100f946b8d tcp: ECT_1_NEGOTIATION and NEEDS_ACCECN identifiers
Two flags for congestion control (CC) module are added in this patch
related to AccECN negotiation. First, a new flag (TCP_CONG_NEEDS_ACCECN)
defines that the CC expects to negotiate AccECN functionality using the
ECE, CWR and AE flags in the TCP header.

Second, during ECN negotiation, ECT(0) in the IP header is used. This
patch enables CC to control whether ECT(0) or ECT(1) should be used on
a per-segment basis. A new flag (TCP_CONG_ECT_1_NEGOTIATION) defines the
expected ECT value in the IP header by the CA when not-yet initialized
for the connection.

The detailed AccECN negotiaotn can be found in IETF RFC9768.

Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260131222515.8485-5-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-03 15:13:24 +01:00
Geliang Tang
2d85088d46 tcp: export tcp_splice_state
Export struct tcp_splice_state and tcp_splice_data_recv() in net/tcp.h
so that they can be used by MPTCP in the next patch.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Acked-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260130-net-next-mptcp-splice-v2-3-31332ba70d7f@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-02 18:15:32 -08:00
Eric Dumazet
b409a7f717 ipv6: colocate inet6_cork in inet_cork_full
All inet6_cork users also use one inet_cork_full.

Reduce number of parameters and increase data locality.

This saves ~275 bytes of code on x86_64.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260130210303.3888261-9-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-02 17:49:30 -08:00
Eric Dumazet
8776c4ef3a inet: add dst4_mtu() and dst6_mtu() helpers
With CONFIG_MITIGATION_RETPOLINE=y dst_mtu() is a bit fat,
because it is generic.

Indeed, clang does not always inline it.

Add dst4_mtu() and dst6_mtu() helpers for callers that
expect either ipv4_mtu() or ip6_mtu() to be called.

These helpers are always inlined.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260130210303.3888261-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-02 17:49:29 -08:00
Eric Dumazet
1bc46dd209 ipv6: pass proto by value to ipv6_push_nfrag_opts() and ipv6_push_frag_opts()
With CONFIG_STACKPROTECTOR_STRONG=y, it is better to avoid passing
a pointer to an automatic variable.

Change these exported functions to return 'u8 proto'
instead of void.

- ipv6_push_nfrag_opts()
- ipv6_push_frag_opts()

For instance, replace
	ipv6_push_frag_opts(skb, opt, &proto);
with:
	proto = ipv6_push_frag_opts(skb, opt, proto);

Note that even after this change, ip6_xmit() has to use a stack canary
because of @first_hop variable.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260130210303.3888261-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-02 17:49:28 -08:00
Eric Dumazet
82f35bec11 net: l3mdev: use skb_dst_dev_rcu() in l3mdev_l3_out()
Extend the RCU section a bit so that we can use the safer
skb_dst_dev_rcu() helper.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260130191906.3781856-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-02 17:09:11 -08:00
Lorenzo Bianconi
0d95280a2d wifi: mac80211: Add eMLSR/eMLMR action frame parsing support
Introduce support in AP mode for parsing of the Operating Mode Notification
frame sent by the client to enable/disable MLO eMLSR or eMLMR if supported
by both the AP and the client.
Add drv_set_eml_op_mode mac80211 callback in order to configure underlay
driver with eMLSR/eMLMR info.

Tested-by: Christian Marangi <ansuelsmth@gmail.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20260129-mac80211-emlsr-v4-1-14bdadf57380@kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-02-02 10:11:18 +01:00
Johannes Berg
a108511471 wifi: mac80211: add initial UHR support
Add support for making UHR connections and accepting AP
stations with UHR support.

Link: https://patch.msgid.link/20260130164259.7185980484eb.Ieec940b58dbf8115dab7e1e24cb5513f52c8cb2f@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-02-02 10:11:08 +01:00
Johannes Berg
072e6f7f41 wifi: cfg80211: add initial UHR support
Add initial support for making UHR connections (or suppressing
that), adding UHR capable stations on the AP side, encoding
and decoding UHR MCSes (except rate calculation for the new
MCSes 17, 19, 20 and 23) as well as regulatory support.

Link: https://patch.msgid.link/20260130164259.54cc12fbb307.I26126bebd83c7ab17e99827489f946ceabb3521f@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-02-02 10:11:07 +01:00
Ethan Nelson-Moore
82fff3b055 net: ax25: remove plumbing for never-implemented DAMA Master support
The AX25_DAMA_MASTER option has been unimplemented and marked broken
ever since it was introduced in 2007 in commit 954b2e7f4c ("[NET]
AX.25 Kconfig and docs updates and fixes"). At this point, it is very
unlikely it will be implemented. Remove it.

Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com>
Link: https://patch.msgid.link/20260129080908.44710-1-enelsonmoore@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-30 19:19:39 -08:00
Eric Dumazet
ed9b70040d tcp: reduce tcp sockets size by one cache line
By default, when a kmem_cache is created with SLAB_TYPESAFE_BY_RCU,
slub has to use extra storage for the freelist pointer after each
object, because slub assumes that any bit in the object
can be used by RCU readers.

Because proto_register() is also using SLAB_HWCACHE_ALIGN,
this forces slub to use one extra cache line per object.

We can instead put the slub freelist anywhere in the object,
granted the concurrent RCU readers are not supposed to
use the pointer value.

Add a new (struct sock)sk_freeptr field, in an union
with sk_rcu: No RCU readers would need to look at sk_rcu,
which is only used at free phase.

Tested:

grep . /sys/kernel/slab/TCP/{object_size,slab_size,objs_per_slab}
grep . /sys/kernel/slab/TCPv6/{object_size,slab_size,objs_per_slab}

Before:

/sys/kernel/slab/TCP/object_size:2368
/sys/kernel/slab/TCP/slab_size:2432
/sys/kernel/slab/TCP/objs_per_slab:13

/sys/kernel/slab/TCPv6/object_size:2496
/sys/kernel/slab/TCPv6/slab_size:2560
/sys/kernel/slab/TCPv6/objs_per_slab:12

After this patch, we can pack one more TCPv6 object per slab,
and object_size == slab_size.

/sys/kernel/slab/TCP/object_size:2368
/sys/kernel/slab/TCP/slab_size:2368
/sys/kernel/slab/TCP/objs_per_slab:13

/sys/kernel/slab/TCPv6/object_size:2496
/sys/kernel/slab/TCPv6/slab_size:2496
/sys/kernel/slab/TCPv6/objs_per_slab:13

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260129153458.4163797-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-30 17:15:51 -08:00
Jakub Kicinski
303c1a66a2 Another fairly large set of changes, notably:
- cfg80211/mac80211
     - most of EPPKE/802.1X over auth frames support
     - additional FTM capabilities
     - split up drop reasons better, removing generic RX_DROP
     - NAN cleanups/fixes
  - ath11k:
     - support for Channel Frequency Response measurement
  - ath12k:
     - support for the QCC2072 chipset
  - iwlwifi:
     - partial NAN support
     - UNII-9 support
     - some UHR/802.11bn FW APIs
     - remove most of MLO/EHT from iwlmvm
       (such devices use iwlmld)
  - rtw89:
     - preparations for RTL8922DE support
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAml7PQcACgkQ10qiO8sP
 aAAKiA/6AnyNxa0bX2VFsWYW6KJYnJBVNLlP2ghkV3uIWtJoZdXuQO+W8/cy9Cng
 yrhfPzNfT+2hqmxasxI0tND3H3tW9CqcwX80J84eP9JCpYuPept9uGpSxPQoQl5J
 Q2k9gX1NlO/SEa8/mOFDT4EmH0bQobxiN84kxSg6Riaazkj6ZjHVVm/3PgzNhxlA
 v77m5thlhopzYxKn38qA19E9uHSLcY7XwkeYOZDf00Zhgot29lmDeHOf39IH+HvI
 +a20q6tW59D7iX2IUyvLnWzFV1iEcJ6ONF/hYJ0r3TlfmX/NDWfOQxx87K8M1Tqh
 sMa+FGrFdqloE1aYi1l+9m6Wu30pHmh7vhlgskPffPmvG+RkCEQCg1Me7eoFOzTB
 81K2CMJ34Cp9se+QdiBtY5GpRPZIOlFmY6ZVyZIoEXHkn6r0R94e6dsMZuFcqjv1
 y1dzv7BnraVMAQcqwkE9pQtq6LeJoHl2OUT2JzjbKhQhivMf9YubPBZ2QC1LZdMg
 NYEX4XSeJ/etpUk1MZFnm5wOw545tMi3U2sAhpYWbE6UBPDrQBvYADqd3lq3DmWe
 BdCDHTbqMnAJ3C0xFEKTYTmVF8IoFt6eOclFUPw4Uhq+YmU9x8wx1yBQbF9TjyKU
 a/rDCahmryj5gwD0QFJKhdQjfKaQFVNZWZqaKaokM84+8kIdA2U=
 =70Rs
 -----END PGP SIGNATURE-----

Merge tag 'wireless-next-2026-01-29' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next

Johannes Berg says:

====================
Another fairly large set of changes, notably:
 - cfg80211/mac80211
    - most of EPPKE/802.1X over auth frames support
    - additional FTM capabilities
    - split up drop reasons better, removing generic RX_DROP
    - NAN cleanups/fixes
 - ath11k:
    - support for Channel Frequency Response measurement
 - ath12k:
    - support for the QCC2072 chipset
 - iwlwifi:
    - partial NAN support
    - UNII-9 support
    - some UHR/802.11bn FW APIs
    - remove most of MLO/EHT from iwlmvm
      (such devices use iwlmld)
 - rtw89:
    - preparations for RTL8922DE support

* tag 'wireless-next-2026-01-29' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (184 commits)
  wifi: iwlegacy: add missing mutex protection in il4965_store_tx_power()
  wifi: iwlegacy: add missing mutex protection in il3945_store_measurement()
  wifi: mac80211: use u64_stats_t with u64_stats_sync properly
  wifi: p54: Fix memory leak in p54_beacon_update()
  wifi: cfg80211: treat deprecated INDOOR_SP_AP_OLD control value as LPI mode
  wifi: rtw88: sdio: Migrate to use sdio specific shutdown function
  wifi: rsi: sdio: Migrate to use sdio specific shutdown function
  sdio: Provide a bustype shutdown function
  wifi: nl80211/cfg80211: support operating as RSTA in PMSR FTM request
  wifi: nl80211/cfg80211: add negotiated burst period to FTM result
  wifi: nl80211/cfg80211: clarify periodic FTM parameters for non-EDCA based ranging
  wifi: nl80211/cfg80211: add new FTM capabilities
  wifi: iwlwifi: rename struct iwl_mcc_allowed_ap_type_cmd::offset_map
  wifi: iwlwifi: mvm: Remove link_id from time_events
  wifi: iwlwifi: mld: change cluster_id type to u8 array
  wifi: iwlwifi: support V13 of iwl_lari_config_change_cmd
  wifi: iwlwifi: split bios_value_u32 to separate the header
  wifi: iwlwifi: uefi: cache the DSM functions
  wifi: iwlwifi: acpi: cache the DSM functions
  wifi: iwlwifi: mvm: Cleanup MLO code
  ...
====================

Link: https://patch.msgid.link/20260129110136.176980-39-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-29 19:17:43 -08:00
Eric Dumazet
b1cd687e3e ipv6: optimize fl6_update_dst()
fl6_update_dst() is called for every TCP (and others) transmit,
and is a nop for common cases.

Split it in two parts :

1) fl6_update_dst() inline helper, small and fast.

2) __fl6_update_dst() for the exception, out of line.

Small size increase to get better TX performance.

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 2/2 grow/shrink: 8/0 up/down: 296/-125 (171)
Function                                     old     new   delta
__fl6_update_dst                               -     104    +104
rawv6_sendmsg                               2244    2284     +40
udpv6_sendmsg                               3013    3043     +30
tcp_v6_connect                              1514    1534     +20
cookie_v6_check                             1501    1519     +18
ip6_datagram_dst_update                      673     690     +17
inet6_sk_rebuild_header                      499     516     +17
inet6_csk_route_socket                       507     524     +17
inet6_csk_route_req                          343     360     +17
__pfx___fl6_update_dst                         -      16     +16
__pfx_fl6_update_dst                          16       -     -16
fl6_update_dst                               109       -    -109
Total: Before=22570304, After=22570475, chg +0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260128185548.3738781-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-29 18:47:21 -08:00
Jakub Kicinski
a010fe8d86 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.19-rc8).

No adjacent changes, conflicts:

drivers/net/ethernet/spacemit/k1_emac.c
  2c84959167 ("net: spacemit: Check for netif_carrier_ok() in emac_stats_update()")
  f66086798f ("net: spacemit: Remove broken flow control support")
https://lore.kernel.org/aXjAqZA3iEWD_DGM@sirena.org.uk

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-29 17:28:54 -08:00
Luiz Augusto von Dentz
6c3ea155e5 Bluetooth: L2CAP: Fix not tracking outstanding TX ident
This attempts to proper track outstanding request by using struct ida
and allocating from it in l2cap_get_ident using ida_alloc_range which
would reuse ids as they are free, then upon completion release
the id using ida_free.

This fixes the qualification test case L2CAP/COS/CED/BI-29-C which
attempts to check if the host stack is able to work after 256 attempts
to connect which requires Ident field to use the full range of possible
values in order to pass the test.

Link: https://github.com/bluez/bluez/issues/1829
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
2026-01-29 13:36:35 -05:00
Luiz Augusto von Dentz
0e2a6af810 Bluetooth: Fix using PHYs bitfields as PHY value
This renames the PHY fields in bt_iso_io_qos to PHYs (plural) since it
represents a bitfield where multiple PHYs can be set and make the same
change also to HCI_OP_LE_SET_CIG_PARAMS since both c_phy and p_phy
fields are bitfields.

This also fixes the assumption that hci_evt_le_cis_established PHYs
fields are compatible with bt_iso_io_qos, they are not, the fields in
hci_evt_le_cis_established represent just a single PHY value so they
need to be converted to bitfield when set in bt_iso_io_qos.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2026-01-29 13:27:47 -05:00
Luiz Augusto von Dentz
132c0779d4 Bluetooth: L2CAP: Add support for setting BT_PHY
This enables client to use setsockopt(BT_PHY) to set the connection
packet type/PHY:

Example setting BT_PHY_BR_1M_1SLOT:

< HCI Command: Change Conne.. (0x01|0x000f) plen 4
        Handle: 1 Address: 00:AA:01:01:00:00 (Intel Corporation)
        Packet type: 0x331e
          2-DH1 may not be used
          3-DH1 may not be used
          DM1 may be used
          DH1 may be used
          2-DH3 may not be used
          3-DH3 may not be used
          2-DH5 may not be used
          3-DH5 may not be used
> HCI Event: Command Status (0x0f) plen 4
      Change Connection Packet Type (0x01|0x000f) ncmd 1
        Status: Success (0x00)
> HCI Event: Connection Packet Typ.. (0x1d) plen 5
        Status: Success (0x00)
        Handle: 1 Address: 00:AA:01:01:00:00 (Intel Corporation)
        Packet type: 0x331e
          2-DH1 may not be used
          3-DH1 may not be used
          DM1 may be used
          DH1 may be used
          2-DH3 may not be used
          3-DH3 may not be used
          2-DH5 may not be used

Example setting BT_PHY_LE_1M_TX and BT_PHY_LE_1M_RX:

< HCI Command: LE Set PHY (0x08|0x0032) plen 7
        Handle: 1 Address: 00:AA:01:01:00:00 (Intel Corporation)
        All PHYs preference: 0x00
        TX PHYs preference: 0x01
          LE 1M
        RX PHYs preference: 0x01
          LE 1M
        PHY options preference: Reserved (0x0000)
> HCI Event: Command Status (0x0f) plen 4
      LE Set PHY (0x08|0x0032) ncmd 1
        Status: Success (0x00)
> HCI Event: LE Meta Event (0x3e) plen 6
      LE PHY Update Complete (0x0c)
        Status: Success (0x00)
        Handle: 1 Address: 00:AA:01:01:00:00 (Intel Corporation)
        TX PHY: LE 1M (0x01)
        RX PHY: LE 1M (0x01)

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2026-01-29 13:25:34 -05:00
Naga Bhavani Akella
fe05e3c059 Bluetooth: hci_sync: Add LE Channel Sounding HCI Command/event structures
1. Implement LE Event Mask to include events required for
   LE Channel Sounding
2. Enable Channel Sounding feature bit in the
   LE Host Supported Features command
3. Define HCI command and event structures necessary for
   LE Channel Sounding functionality

Signed-off-by: Naga Bhavani Akella <naga.akella@oss.qualcomm.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2026-01-29 13:24:48 -05:00
Luiz Augusto von Dentz
129d1ef3c5 Bluetooth: hci_conn: Fix using conn->le_{tx,rx}_phy as supported PHYs
conn->le_{tx,rx}_phy is not actually a bitfield as it set by
HCI_EV_LE_PHY_UPDATE_COMPLETE it is actually correspond to the current
PHY in use not what is supported by the controller, so this introduces
different fields (conn->le_{tx,rx}_def_phys) to track what PHYs are
supported by the connection.

Fixes: eab2404ba7 ("Bluetooth: Add BT_PHY socket option")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2026-01-29 13:21:40 -05:00
Scott Mitchell
e19079adcd netfilter: nfnetlink_queue: optimize verdict lookup with hash table
The current implementation uses a linear list to find queued packets by
ID when processing verdicts from userspace. With large queue depths and
out-of-order verdicting, this O(n) lookup becomes a significant
bottleneck, causing userspace verdict processing to dominate CPU time.

Replace the linear search with a hash table for O(1) average-case
packet lookup by ID. A global rhashtable spanning all network
namespaces attributes hash bucket memory to kernel but is subject to
fixed upper bound.

Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-29 09:52:07 +01:00
Kuniyuki Iwashima
d2492688bb nfc: nci: Fix race between rfkill and nci_unregister_device().
syzbot reported the splat below [0] without a repro.

It indicates that struct nci_dev.cmd_wq had been destroyed before
nci_close_device() was called via rfkill.

nci_dev.cmd_wq is only destroyed in nci_unregister_device(), which
(I think) was called from virtual_ncidev_close() when syzbot close()d
an fd of virtual_ncidev.

The problem is that nci_unregister_device() destroys nci_dev.cmd_wq
first and then calls nfc_unregister_device(), which removes the
device from rfkill by rfkill_unregister().

So, the device is still visible via rfkill even after nci_dev.cmd_wq
is destroyed.

Let's unregister the device from rfkill first in nci_unregister_device().

Note that we cannot call nfc_unregister_device() before
nci_close_device() because

  1) nfc_unregister_device() calls device_del() which frees
     all memory allocated by devm_kzalloc() and linked to
     ndev->conn_info_list

  2) nci_rx_work() could try to queue nci_conn_info to
     ndev->conn_info_list which could be leaked

Thus, nfc_unregister_device() is split into two functions so we
can remove rfkill interfaces only before nci_close_device().

[0]:
DEBUG_LOCKS_WARN_ON(1)
WARNING: kernel/locking/lockdep.c:238 at hlock_class kernel/locking/lockdep.c:238 [inline], CPU#0: syz.0.8675/6349
WARNING: kernel/locking/lockdep.c:238 at check_wait_context kernel/locking/lockdep.c:4854 [inline], CPU#0: syz.0.8675/6349
WARNING: kernel/locking/lockdep.c:238 at __lock_acquire+0x39d/0x2cf0 kernel/locking/lockdep.c:5187, CPU#0: syz.0.8675/6349
Modules linked in:
CPU: 0 UID: 0 PID: 6349 Comm: syz.0.8675 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/13/2026
RIP: 0010:hlock_class kernel/locking/lockdep.c:238 [inline]
RIP: 0010:check_wait_context kernel/locking/lockdep.c:4854 [inline]
RIP: 0010:__lock_acquire+0x3a4/0x2cf0 kernel/locking/lockdep.c:5187
Code: 18 00 4c 8b 74 24 08 75 27 90 e8 17 f2 fc 02 85 c0 74 1c 83 3d 50 e0 4e 0e 00 75 13 48 8d 3d 43 f7 51 0e 48 c7 c6 8b 3a de 8d <67> 48 0f b9 3a 90 31 c0 0f b6 98 c4 00 00 00 41 8b 45 20 25 ff 1f
RSP: 0018:ffffc9000c767680 EFLAGS: 00010046
RAX: 0000000000000001 RBX: 0000000000040000 RCX: 0000000000080000
RDX: ffffc90013080000 RSI: ffffffff8dde3a8b RDI: ffffffff8ff24ca0
RBP: 0000000000000003 R08: ffffffff8fef35a3 R09: 1ffffffff1fde6b4
R10: dffffc0000000000 R11: fffffbfff1fde6b5 R12: 00000000000012a2
R13: ffff888030338ba8 R14: ffff888030338000 R15: ffff888030338b30
FS:  00007fa5995f66c0(0000) GS:ffff8881256f8000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f7e72f842d0 CR3: 00000000485a0000 CR4: 00000000003526f0
Call Trace:
 <TASK>
 lock_acquire+0x106/0x330 kernel/locking/lockdep.c:5868
 touch_wq_lockdep_map+0xcb/0x180 kernel/workqueue.c:3940
 __flush_workqueue+0x14b/0x14f0 kernel/workqueue.c:3982
 nci_close_device+0x302/0x630 net/nfc/nci/core.c:567
 nci_dev_down+0x3b/0x50 net/nfc/nci/core.c:639
 nfc_dev_down+0x152/0x290 net/nfc/core.c:161
 nfc_rfkill_set_block+0x2d/0x100 net/nfc/core.c:179
 rfkill_set_block+0x1d2/0x440 net/rfkill/core.c:346
 rfkill_fop_write+0x461/0x5a0 net/rfkill/core.c:1301
 vfs_write+0x29a/0xb90 fs/read_write.c:684
 ksys_write+0x150/0x270 fs/read_write.c:738
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xe2/0xf80 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fa59b39acb9
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fa5995f6028 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007fa59b615fa0 RCX: 00007fa59b39acb9
RDX: 0000000000000008 RSI: 0000200000000080 RDI: 0000000000000007
RBP: 00007fa59b408bf7 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fa59b616038 R14: 00007fa59b615fa0 R15: 00007ffc82218788
 </TASK>

Fixes: 6a2968aaf5 ("NFC: basic NCI protocol implementation")
Reported-by: syzbot+f9c5fd1a0874f9069dce@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/695e7f56.050a0220.1c677c.036c.GAE@google.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260127040411.494931-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-28 19:32:26 -08:00
Eric Dumazet
d5fb143dbe tcp: move tcp_rack_advance() to tcp_input.c
tcp_rack_advance() is called from tcp_ack() and tcp_sacktag_one().

Moving it to tcp_input.c allows the compiler to inline it and save
both space and cpu cycles in TCP fast path.

$ scripts/bloat-o-meter -t vmlinux.1 vmlinux.2
add/remove: 0/2 grow/shrink: 1/1 up/down: 98/-132 (-34)
Function                                     old     new   delta
tcp_ack                                     5741    5839     +98
tcp_sacktag_one                              407     395     -12
__pfx_tcp_rack_advance                        16       -     -16
tcp_rack_advance                             104       -    -104
Total: Before=22572680, After=22572646, chg -0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260127032147.3498272-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-28 19:31:51 -08:00
Eric Dumazet
629a68865a tcp: move tcp_rack_update_reo_wnd() to tcp_input.c
tcp_rack_update_reo_wnd() is called only once from tcp_ack()

Move it to tcp_input.c so that it can be inlined by the compiler
to save space and cpu cycles.

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/2 grow/shrink: 1/0 up/down: 110/-153 (-43)
Function                                     old     new   delta
tcp_ack                                     5631    5741    +110
__pfx_tcp_rack_update_reo_wnd                 16       -     -16
tcp_rack_update_reo_wnd                      137       -    -137
Total: Before=22572723, After=22572680, chg -0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260127032147.3498272-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-28 19:31:51 -08:00
Konstantin Taranov
a01745ccf7 RDMA/mana_ib: Add device‑memory support
Introduce a basic DM implementation that enables creating and
registering device memory, and using the associated memory keys
for networking operations.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://patch.msgid.link/20260127082649.429018-1-kotaranov@linux.microsoft.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-01-27 09:16:11 -05:00
Pagadala Yesu Anjaneyulu
fd5bfcf430 wifi: cfg80211: treat deprecated INDOOR_SP_AP_OLD control value as LPI mode
Although value 4 (INDOOR_SP_AP_OLD) is deprecated in IEEE standards,
existing APs may still use this control value. Since this value is
based on the old specification, we cannot trust such APs implement
proper power controls.
Therefore, move IEEE80211_6GHZ_CTRL_REG_INDOOR_SP_AP_OLD case
from SP_AP to LPI_AP power type handling to prevent potential
power limit violations.

Signed-off-by: Pagadala Yesu Anjaneyulu <pagadala.yesu.anjaneyulu@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20260111163601.6b5a36d3601e.I1704ee575fd25edb0d56f48a0a3169b44ef72ad0@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-01-27 13:42:26 +01:00
Avraham Stern
853800c746 wifi: nl80211/cfg80211: support operating as RSTA in PMSR FTM request
Add an option to operate as the RSTA in an FTM measurement request.
When requested, the device will dwell on the requested channel until
the peer starts the FTM negotiation. This option is only valid for
trigger-based/non trigger-based measurement with LMR feedback which
will allow the RSTA to receive the results of the measurement.

Signed-off-by: Avraham Stern <avraham.stern@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20260111190221.1f95fc0afab4.Iae2d32783b8e7c4a29089fec0f4c6bce94d303cc@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-01-27 13:40:38 +01:00
Avraham Stern
cfd46d1c6f wifi: nl80211/cfg80211: add negotiated burst period to FTM result
The FTM result includes some of the periodic measurement negotiated
parameters (like the burst duration and number of bursts), but it
doesn't include the burst period. Add it to the FTM result
notification.

Signed-off-by: Avraham Stern <avraham.stern@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20260111190221.e0778f86edef.I3c98c1933eb639963bc3ffdef81a8788b59f2188@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-01-27 13:40:36 +01:00
Avraham Stern
853ce6943c wifi: nl80211/cfg80211: clarify periodic FTM parameters for non-EDCA based ranging
Periodic FTM request attributes are defined based on the periodic
parameters used in EDCA-based ranging negotiation. However, non-EDCA
based ranging (trigger-based/non-trigger-based) does not include
periodic parameters in the negotiation protocol, even though upper
layers may still request periodic measurements.

Clarify the semantics of periodic ranging attributes when used with
non-EDCA based ranging.

Signed-off-by: Avraham Stern <avraham.stern@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20260111190221.b89cb3f68e1a.I7a9d8c6d1c66c77f1b43120a841101c96c3f19ad@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-01-27 13:40:30 +01:00
Avraham Stern
86c6b6e4d1 wifi: nl80211/cfg80211: add new FTM capabilities
Add new capabilities to the PMSR FTM capabilities list. The new
capabilities include 6 GHz support, supported number of spatial streams
and supported number of LTF repetitions.

Signed-off-by: Avraham Stern <avraham.stern@intel.com>
Tested-by: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20260111190221.bf43785c18f6.Ic98cf9790ddee84bf88e5720b93c46c23af3c96c@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-01-27 13:40:25 +01:00
Bobby Eshleman
eafb64f40c vsock: add netns to vsock core
Add netns logic to vsock core. Additionally, modify transport hook
prototypes to be used by later transport-specific patches (e.g.,
*_seqpacket_allow()).

Namespaces are supported primarily by changing socket lookup functions
(e.g., vsock_find_connected_socket()) to take into account the socket
namespace and the namespace mode before considering a candidate socket a
"match".

This patch also introduces the sysctl /proc/sys/net/vsock/ns_mode to
report the mode and /proc/sys/net/vsock/child_ns_mode to set the mode
for new namespaces.

Add netns functionality (initialization, passing to transports, procfs,
etc...) to the af_vsock socket layer. Later patches that add netns
support to transports depend on this patch.

This patch changes the allocation of random ports for connectible vsocks
in order to avoid leaking the random port range starting point to other
namespaces.

dgram_allow(), stream_allow(), and seqpacket_allow() callbacks are
modified to take a vsk in order to perform logic on namespace modes. In
future patches, the net will also be used for socket
lookups in these functions.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
Link: https://patch.msgid.link/20260121-vsock-vmtest-v16-1-2859a7512097@meta.com
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-27 10:45:38 +01:00
Eric Dumazet
df7388b3d7 net: inline get_netmem() and put_netmem()
These helpers are used in network fast paths.

Only call out-of-line helpers for netmem case.

We might consider inlining __get_netmem() and __put_netmem()
in the future.

$ scripts/bloat-o-meter -t vmlinux.3 vmlinux.4
add/remove: 6/6 grow/shrink: 22/1 up/down: 2614/-646 (1968)
Function                                     old     new   delta
pskb_carve                                  1669    1894    +225
gro_pull_from_frag0                            -     206    +206
get_page                                     190     380    +190
skb_segment                                 3561    3747    +186
put_page                                     595     765    +170
skb_copy_ubufs                              1683    1822    +139
__pskb_trim_head                             276     401    +125
__pskb_copy_fclone                           734     858    +124
skb_zerocopy                                1092    1215    +123
pskb_expand_head                             892    1008    +116
skb_split                                    828     940    +112
skb_release_data                             297     409    +112
___pskb_trim                                 829     941    +112
__skb_zcopy_downgrade_managed                120     226    +106
tcp_clone_payload                            530     634    +104
esp_ssg_unref                                191     294    +103
dev_gro_receive                             1464    1514     +50
__put_netmem                                   -      41     +41
__get_netmem                                   -      41     +41
skb_shift                                   1139    1175     +36
skb_try_coalesce                             681     714     +33
__pfx_put_page                               112     144     +32
__pfx_get_page                                32      64     +32
__pskb_pull_tail                            1137    1168     +31
veth_xdp_get                                 250     267     +17
__pfx_gro_pull_from_frag0                      -      16     +16
__pfx___put_netmem                             -      16     +16
__pfx___get_netmem                             -      16     +16
__pfx_put_netmem                              16       -     -16
__pfx_gro_try_pull_from_frag0                 16       -     -16
__pfx_get_netmem                              16       -     -16
put_netmem                                   114       -    -114
get_netmem                                   130       -    -130
napi_gro_frags                               929     771    -158
gro_try_pull_from_frag0                      196       -    -196
Total: Before=22565857, After=22567825, chg +0.01%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260122045720.1221017-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-25 13:18:53 -08:00
Eric Dumazet
87918dd4ea net: inline net_is_devmem_iov()
1) Inline this small helper to reduce code size and decrease cpu costs.
2) Constify its argument.
3) Move it to include/net/netmem.h, as a prereq for the following patch.

$ scripts/bloat-o-meter -t vmlinux.2 vmlinux.3
add/remove: 0/2 grow/shrink: 0/4 up/down: 0/-158 (-158)
Function                                     old     new   delta
validate_xmit_skb                            866     857      -9
__pfx_net_is_devmem_iov                       16       -     -16
net_is_devmem_iov                             22       -     -22
get_netmem                                   152     130     -22
put_netmem                                   140     114     -26
tcp_recvmsg_locked                          3860    3797     -63
Total: Before=22566015, After=22565857, chg -0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260122045720.1221017-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-25 13:18:53 -08:00
Eric Dumazet
f6c3665b6d bonding: annotate data-races around slave->last_rx
slave->last_rx and slave->target_last_arp_rx[...] can be read and written
locklessly. Add READ_ONCE() and WRITE_ONCE() annotations.

syzbot reported:

BUG: KCSAN: data-race in bond_rcv_validate / bond_rcv_validate

write to 0xffff888149f0d428 of 8 bytes by interrupt on cpu 1:
  bond_rcv_validate+0x202/0x7a0 drivers/net/bonding/bond_main.c:3335
  bond_handle_frame+0xde/0x5e0 drivers/net/bonding/bond_main.c:1533
  __netif_receive_skb_core+0x5b1/0x1950 net/core/dev.c:6039
  __netif_receive_skb_one_core net/core/dev.c:6150 [inline]
  __netif_receive_skb+0x59/0x270 net/core/dev.c:6265
  netif_receive_skb_internal net/core/dev.c:6351 [inline]
  netif_receive_skb+0x4b/0x2d0 net/core/dev.c:6410
...

write to 0xffff888149f0d428 of 8 bytes by interrupt on cpu 0:
  bond_rcv_validate+0x202/0x7a0 drivers/net/bonding/bond_main.c:3335
  bond_handle_frame+0xde/0x5e0 drivers/net/bonding/bond_main.c:1533
  __netif_receive_skb_core+0x5b1/0x1950 net/core/dev.c:6039
  __netif_receive_skb_one_core net/core/dev.c:6150 [inline]
  __netif_receive_skb+0x59/0x270 net/core/dev.c:6265
  netif_receive_skb_internal net/core/dev.c:6351 [inline]
  netif_receive_skb+0x4b/0x2d0 net/core/dev.c:6410
  br_netif_receive_skb net/bridge/br_input.c:30 [inline]
  NF_HOOK include/linux/netfilter.h:318 [inline]
...

value changed: 0x0000000100005365 -> 0x0000000100005366

Fixes: f5b2b966f0 ("[PATCH] bonding: Validate probe replies in ARP monitor")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Link: https://patch.msgid.link/20260122162914.2299312-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 13:55:56 -08:00
Jakub Kicinski
8e3245cb30 net: add queue config validation callback
I imagine (tm) that as the number of per-queue configuration
options grows some of them may conflict for certain drivers.
While the drivers can obviously do all the validation locally
doing so is fairly inconvenient as the config is fed to drivers
piecemeal via different ops (for different params and NIC-wide
vs per-queue).

Add a centralized callback for validating the queue config
in queue ops. The callback gets invoked before memory provider
is installed, and in the future should also be called when ring
params are modified.

The validation is done after each layer of configuration.
Since we can't fail MP un-binding we must make sure that
the config is valid both before and after MP overrides are
applied. This is moot for now since the set of MP and device
configs are disjoint. It will matter significantly in the future,
so adding it now so that we don't forget..

Link: https://patch.msgid.link/20260122005113.2476634-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 11:49:02 -08:00
Jakub Kicinski
fc1a78a25c net: use netdev_queue_config() for mp restart
We should follow the prepare/commit approach for queue configuration.
The qcfg struct should be added to dev->cfg rather than directly to
queue objects so that we can clone and discard the pending config
easily.

Remove the qcfg in struct netdev_rx_queue, and switch remaining callers
to netdev_queue_config(). netdev_queue_config() will construct the qcfg
on the fly based on device defaults and state of the queue.

ndo_default_qcfg becomes optional because having the callback itself
does not have any meaningful semantics to us.

Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Link: https://patch.msgid.link/20260122005113.2476634-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 11:49:02 -08:00
Jakub Kicinski
b9ac2c60a3 net: introduce a trivial netdev_queue_config()
We may choose to extend or reimplement the logic which renders
the per-queue config. The drivers should not poke directly into
the queue state. Add a helper for drivers to use when they want
to query the config for a specific queue.

Link: https://patch.msgid.link/20260122005113.2476634-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 11:49:01 -08:00
Paolo Abeni
0c09e89f6c geneve: expose gso partial features for tunnel offload
GSO partial features for tunnels do not require any kind of support from
the underlying device: we can safely add them to the geneve UDP tunnel.

The only point of attention is the skb required features propagation in
the device xmit op: partial features must be stripped, except for
UDP_TUNNEL*.

Keep partial features disabled by default.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/d851ca8e928cf05d68310bcbaeaa5e9e0b01e058.1769011015.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 11:31:14 -08:00
Jakub Kicinski
9abf22075d Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.19-rc7).

Conflicts:

drivers/net/ethernet/huawei/hinic3/hinic3_irq.c
  b35a6fd37a ("hinic3: Add adaptive IRQ coalescing with DIM")
  fb2bb2a1eb ("hinic3: Fix netif_queue_set_napi queue_index input parameter error")
https://lore.kernel.org/fc0a7fdf08789a52653e8ad05281a0a849e79206.1768915707.git.zhuyikai1@h-partners.com

drivers/net/wireless/ath/ath12k/mac.c
drivers/net/wireless/ath/ath12k/wifi7/hw.c
  3170757210 ("wifi: ath12k: Fix wrong P2P device link id issue")
  c26f294fef ("wifi: ath12k: Move ieee80211_ops callback to the arch specific module")
https://lore.kernel.org/20260114123751.6a208818@canb.auug.org.au

Adjacent changes:

drivers/net/wireless/ath/ath12k/mac.c
  8b8d6ee53d ("wifi: ath12k: Fix scan state stuck in ABORTING after cancel_remain_on_channel")
  914c890d3b ("wifi: ath12k: Add framework for hardware specific ieee80211_ops registration")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-22 20:14:36 -08:00
Eric Dumazet
bc1f0b1c98 tcp: move tcp_rate_check_app_limited() to tcp.c
tcp_rate_check_app_limited() is used from tcp_sendmsg_locked()
fast path and from other callers.

Move it to tcp.c so that it can be inlined in tcp_sendmsg_locked().

Small increase of code, for better TCP performance.

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/0 grow/shrink: 1/0 up/down: 87/0 (87)
Function                                     old     new   delta
tcp_sendmsg_locked                          4217    4304     +87
Total: Before=22566462, After=22566549, chg +0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20260121095923.3134639-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-22 18:28:48 -08:00
Eric Dumazet
b814bdcecd tcp: move tcp_rate_gen to tcp_input.c
This function is called from one caller only, in TCP fast path.

Move it to tcp_input.c so that compiler can inline it.

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/2 grow/shrink: 1/0 up/down: 226/-300 (-74)
Function                                     old     new   delta
tcp_ack                                     5405    5631    +226
__pfx_tcp_rate_gen                            16       -     -16
tcp_rate_gen                                 284       -    -284
Total: Before=22566536, After=22566462, chg -0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20260121095923.3134639-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-22 18:28:48 -08:00
Pablo Neira Ayuso
f175b46d91 netfilter: nf_tables: add .abort_skip_removal flag for set types
The pipapo set backend is the only user of the .abort interface so far.
To speed up pipapo abort path, removals are skipped.

The follow up patch updates the rbtree to use to build an array of
ordered elements, then use binary search. This needs a new .abort
interface but, unlike pipapo, it also need to undo/remove elements.

Add a flag and use it from the pipapo set backend.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-22 17:18:13 +01:00
Jakub Kicinski
9146fe2829 Another set of updates:
- various small fixes for ath10k/ath12k/mwifiex/rsi
  - cfg80211 fix for HE bitrate overflow
  - mac80211 fixes
    - S1G beacon handling in scan
    - skb tailroom handling for HW encryption
    - CSA fix for multi-link
    - handling of disabled links during association
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmlyArkACgkQ10qiO8sP
 aACZxA/+N/q+DAHhVgqETwqOh80WAFTSuhDZsUXc6PNtFkOHvuZaeHePU+fn8hco
 +CkUVEnWYoNgUiaVg1697PJubp0psN7H+3+cq8kn8C/oNB2YnEV2kPCk4x7R8LCj
 TP6mad/Fb+I6Ct7XaCFymCS49eP4Bju7UBgYgLTiKJYvabA+Jim7LavZr8j0Tvra
 h9TNeA0I8+dgGppAWLTssrnsxp65xfSdq71mtRCFUrpEUHjzCl589PEv6BYcIRwv
 N50pm6Am5KJ1TZn5sVSYVfKiiG7UtL/pbXbsM5Cj/54yIFIgmE3bGI5MGAXRlWNG
 o/d/bo0rJg1xppipyZDEN5OJS6S0ijyC5TNKFFRX6IU2eZ8jJs7CWrQw6L8hWCbY
 G+lnVSh4yPAzLEk80S/zBHQccAPXnONtm+cFyPsPab79oAboxQVauuDdH1t5cxbQ
 1HRn0RopyfEoPLmxsCcCSVcdF+hDwRZxAUO735Opnz/amDJNjBKgXTKezXSubfPH
 5hvoAs/VZh7xSyJmEViDO5gavW1SX4nKlUixLYLUrXyq4i+eZkEIvqlCkIWuN9EW
 y/leooWqFzoXY3K8QL8nKQyHWg10WJL1l56tVz+Y+YAh8TvqgWWTB/O/u3g+/ZqO
 eDkaeKbbexUy6iyN0/TMb2RNnTERO0xOepMmIu3sw0HXhptkl1Q=
 =njuT
 -----END PGP SIGNATURE-----

Merge tag 'wireless-2026-11-22' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless

Johannes Berg says:

====================
Another set of updates:
 - various small fixes for ath10k/ath12k/mwifiex/rsi
 - cfg80211 fix for HE bitrate overflow
 - mac80211 fixes
   - S1G beacon handling in scan
   - skb tailroom handling for HW encryption
   - CSA fix for multi-link
   - handling of disabled links during association

* tag 'wireless-2026-11-22' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
  wifi: cfg80211: ignore link disabled flag from userspace
  wifi: mac80211: apply advertised TTLM from association response
  wifi: mac80211: parse all TTLM entries
  wifi: mac80211: don't increment crypto_tx_tailroom_needed_cnt twice
  wifi: mac80211: don't perform DA check on S1G beacon
  wifi: ath12k: Fix wrong P2P device link id issue
  wifi: ath12k: fix dead lock while flushing management frames
  wifi: ath12k: Fix scan state stuck in ABORTING after cancel_remain_on_channel
  wifi: ath12k: cancel scan only on active scan vdev
  wifi: mwifiex: Fix a loop in mwifiex_update_ampdu_rxwinsize()
  wifi: mac80211: correctly check if CSA is active
  wifi: cfg80211: Fix bitrate calculation overflow for HE rates
  wifi: rsi: Fix memory corruption due to not set vif driver data size
  wifi: ath12k: don't force radio frequency check in freq_to_idx()
  wifi: ath12k: fix dma_free_coherent() pointer
  wifi: ath10k: fix dma_free_coherent() pointer
====================

Link: https://patch.msgid.link/20260122110248.15450-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-22 07:54:31 -08:00
Tonghao Zhang
11ea9b8a88 net: bonding: use workqueue to make sure peer notify updated in lacp mode
The rtnl lock might be locked, preventing ad_cond_set_peer_notif() from
acquiring the lock and updating send_peer_notif. This patch addresses
the issue by using a workqueue. Since updating send_peer_notif does
not require high real-time performance, such delayed updates are entirely
acceptable.

In fact, checking this value and using it in multiple places, all operations
are protected at the same time by rtnl lock, such as
- read send_peer_notif
- send_peer_notif--
- bond_should_notify_peers

By the way, rtnl lock is still required, when accessing bond.params.* for
updating send_peer_notif. In lacp mode, resetting send_peer_notif in
workqueue is safe, simple and effective way.

Additionally, this patch introduces bond_peer_notify_may_events(), which
is used to check whether an event should be sent. This function will be
used in both patch 1 and 2.

Cc: Jay Vosburgh <jv@jvosburgh.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Andrew Lunn <andrew+netdev@lunn.ch>
Cc: Nikolay Aleksandrov <razor@blackwall.org>
Cc: Hangbin Liu <liuhangbin@gmail.com>
Cc: Jason Xing <kerneljasonxing@gmail.com>
Suggested-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: Tonghao Zhang <tonghao@bamaicloud.com>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/f95accb5db0b10ce3ed2f834fc70f716c9abbb9c.1768709239.git.tonghao@bamaicloud.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-22 11:20:33 +01:00
Jakub Kicinski
a7c708dc0d netfilter pull request nf-next-26-01-20
-----BEGIN PGP SIGNATURE-----
 
 iQJdBAABCABHFiEEgKkgxbID4Gn1hq6fcJGo2a1f9gAFAmlvnjQbFIAAAAAABAAO
 bWFudTIsMi41KzEuMTEsMiwyDRxmd0BzdHJsZW4uZGUACgkQcJGo2a1f9gBGtQ//
 fpGuA96XbcQVHDAkOYYAsjkHk4DPJvTdL4m4Pnw5SO3m5lVq0kw5Cp+6drv3q/Pd
 pMTuAUliVDOlK7wYemsThv/DgzqSO93uxrqTeX2J4tb/TgVYA9040aAfvWKo76iE
 GxSqfM255cMOAJ2zpBbgP3WfwiklGBlB7phPDTP3yoxbwH6TdtDCxcpVJ+M8wVN4
 7CsRY3P8ZWZR1lW2V7sHERRABdsVfRmEtlFCEP+WARKHtkTyMZeqRsUtk3f50iRB
 AeEm3Uryj+q5s2Uof+uO8Lu0RxaBezJID4JksbWXEc/bsxaGKroPXx1qUsruwAJP
 1TW+HL2yJx1xIydinoKFSD6PE7as0LeRvwCLFNbOqTrGefpPFX7sIdQNb/qMh1IN
 JpU0O0cwhWPYKjXD8pGcVNqTFs9CABRGSZBRUkSKMhWqwF1Hu0habF4nL70QkCqv
 FuhrelmNY/pDn7X5EQRII7cZMAxEL2lFtv+HERwZH2uZvDdMjE6Fu/+NdGQT4bCe
 d95dlnd1UkpMhI2CPsDKACXb5aqA5apWb7+2F5WcXzlI7XmNcHJksY8OVkjB0r7p
 +6IeBAYLPEUv+PYsR6g7vf6pAHA6I/axkMoK4vFXRnU3POnrVyZh3JHL/CChokWj
 cy8BYZukYSqvOnEoPJRpWkwO8opWDZmT5gIUSqHgKGk=
 =QWao
 -----END PGP SIGNATURE-----

Merge tag 'nf-next-26-01-20' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next

Florian Westphal says:

====================
Subject: netfilter: updates for net-next

1) Speed up nftables transactions after earlier transaction failed.
   Due to a (harmeless) bug we remained in slow paranoia mode until
   a successful transaction completes.

2) Allow generic tracker to resolve clashes, this avoids very rare
   packet drops.  From Yuto Hamaguchi.

3) Increase the cleanup budget to 64 entries in nf_conncount to reap
   more entries in one go, from Fernando Fernandez Mancera.

4) Allow icmp trackers to resolve clashes, this avoids very rare
   initial packet drop with test cases that have high-frequency pings.
   After this all trackers except tcp and sctp allow clash resolution.

5) Disentangle netfilter headers, don't include nftables/xtables headers
   in subsystems that are unrelated.

6) Don't rely on implicit includes coming from nf_conntrack_proto_gre.h.

7) Allow nfnetlink_queue nfq instance struct to get accounted via memcg,
   from Scott Mitchell.

8) Reject bogus xt target/match data upfront via netlink policiy in
   nft_compat interface rather than relying on x_tables API to do it.

9) Fix nf_conncount breakage when trying to limit loopback flows via
   prerouting rule, from Fernando Fernandez Mancera.
   This is a recent breakage but not seen as urgent enough to rush this
   via net tree at this late stage in development cycle.

10) Fix a possible off-by-one when parsing tcp option in xtables tcpmss
    match.  Also handled via -next due to late stage in development
    cycle.

* tag 'nf-next-26-01-20' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: xt_tcpmss: check remaining length before reading optlen
  netfilter: nf_conncount: fix tracking of connections from localhost
  netfilter: nft_compat: add more restrictions on netlink attributes
  netfilter: nfnetlink_queue: nfqnl_instance GFP_ATOMIC -> GFP_KERNEL_ACCOUNT allocation
  netfilter: nf_conntrack: don't rely on implicit includes
  netfilter: don't include xt and nftables.h in unrelated subsystems
  netfilter: nf_conntrack: enable icmp clash support
  netfilter: nf_conncount: increase the connection clean up limit to 64
  netfilter: nf_conntrack: Add allow_clash to generic protocol handler
  netfilter: nf_tables: reset table validation state on abort
====================

Link: https://patch.msgid.link/20260120191803.22208-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-21 20:23:12 -08:00
Eric Dumazet
b8d9b7daf0 gro: inline tcp6_gro_complete()
Remove one function call from GRO stack for native IPv6 + TCP packets.

$ scripts/bloat-o-meter -t vmlinux.2 vmlinux.3
add/remove: 0/0 grow/shrink: 1/1 up/down: 298/-5 (293)
Function                                     old     new   delta
ipv6_gro_complete                            435     733    +298
tcp6_gro_complete                            311     306      -5
Total: Before=22593532, After=22593825, chg +0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260120164903.1912995-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-21 19:28:32 -08:00
Eric Dumazet
87737cd76e gro: inline tcp6_gro_receive()
FDO/LTO are unable to inline tcp6_gro_receive() from ipv6_gro_receive()

Make sure tcp6_check_fraglist_gro() is only called only when needed,
so that compiler can leave it out-of-line.

$ scripts/bloat-o-meter -t vmlinux.1 vmlinux.2
add/remove: 2/0 grow/shrink: 3/1 up/down: 1123/-253 (870)
Function                                     old     new   delta
ipv6_gro_receive                            1069    1846    +777
tcp6_check_fraglist_gro                        -     272    +272
ipv6_offload_init                            218     274     +56
__pfx_tcp6_check_fraglist_gro                  -      16     +16
ipv6_gro_complete                            433     435      +2
tcp6_gro_receive                             959     706    -253
Total: Before=22592662, After=22593532, chg +0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260120164903.1912995-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-21 19:28:32 -08:00
Eric Dumazet
a4674aa58b tcp: preserve const qualifier in tcp_rsk() and inet_rsk()
We can change tcp_rsk() and inet_rsk() to propagate their argument
const qualifier thanks to container_of_const().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260120125353.1470456-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-21 19:20:04 -08:00
Eric Dumazet
670ade3bfa tcp: move tcp_rate_skb_delivered() to tcp_input.c
tcp_rate_skb_delivered() is only called from tcp_input.c.
Move it there and make it static.

Both gcc and clang are (auto)inlining it, TCP performance
is increased at a small space cost.

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/2 grow/shrink: 3/0 up/down: 509/-187 (322)
Function                                     old     new   delta
tcp_sacktag_walk                            1682    1867    +185
tcp_ack                                     5230    5405    +175
tcp_shifted_skb                              437     586    +149
__pfx_tcp_rate_skb_delivered                  16       -     -16
tcp_rate_skb_delivered                       171       -    -171
Total: Before=22566192, After=22566514, chg +0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20260118123204.2315993-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-20 19:03:09 -08:00
Jakub Kicinski
677a51790b Merge tag 'net-queue-rx-buf-len-v9' of https://github.com/isilence/linux
Pavel Begunkov says:

====================
Add support for providers with large rx buffer

Many modern NICs support configurable receive buffer lengths, and zcrx and
memory providers can use buffers larger than 4K to improve performance.
When paired with hw-gro larger rx buffer sizes can drastically reduce
the number of buffers traversing the stack and save a lot of processing
time. It also allows to give to users larger contiguous chunks of data.

Single stream benchmarks showed up to ~30% CPU util improvement.
E.g. comparison for 4K vs 32K buffers using a 200Gbit NIC:

packets=23987040 (MB=2745098), rps=199559 (MB/s=22837)
CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
  0    1.53    0.00   27.78    2.72    1.31   66.45    0.22
packets=24078368 (MB=2755550), rps=200319 (MB/s=22924)
CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
  0    0.69    0.00    8.26   31.65    1.83   57.00    0.57

This series adds net infrastructure for memory providers configuring
the size and implements it for bnxt. It's an opt-in feature for drivers,
they should advertise support for the parameter in the qops and must check
if the hardware supports the given size. It's limited to memory providers
as it drastically simplifies implementation. It doesn't affect the fast
path zcrx uAPI, and the user exposed parameter is defined in zcrx terms,
which allows it to be flexible and adjusted in the future.

A liburing example can be found at [2]

full branch:
[1] https://github.com/isilence/linux.git zcrx/large-buffers-v8
Liburing example:
[2] https://github.com/isilence/liburing.git zcrx/rx-buf-len

* tag 'net-queue-rx-buf-len-v9' of https://github.com/isilence/linux:
  io_uring/zcrx: document area chunking parameter
  selftests: iou-zcrx: test large chunk sizes
  eth: bnxt: support qcfg provided rx page size
  eth: bnxt: adjust the fill level of agg queues with larger buffers
  eth: bnxt: store rx buffer size per queue
  net: pass queue rx page size from memory provider
  net: add bare bone queue configs
  net: reduce indent of struct netdev_queue_mgmt_ops members
  net: memzero mp params when closing a queue
====================

Link: https://patch.msgid.link/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-20 18:10:04 -08:00
Jakub Kicinski
8766d61a1d Revert "Merge branch 'netkit-support-for-io_uring-zero-copy-and-af_xdp'"
This reverts commit 77b9c4a438, reversing
changes made to 4515ec4ad58a37e70a9e1256c0b993958c9b7497:

 931420a2fc ("selftests/net: Add netkit container tests")
 ab771c938d ("selftests/net: Make NetDrvContEnv support queue leasing")
 6be87fbb27 ("selftests/net: Add env for container based tests")
 61d99ce3df ("selftests/net: Add bpf skb forwarding program")
 920da36341 ("netkit: Add xsk support for af_xdp applications")
 eef51113f8 ("netkit: Add netkit notifier to check for unregistering devices")
 b5ef109d22 ("netkit: Implement rtnl_link_ops->alloc and ndo_queue_create")
 b5c3fa4a0b ("netkit: Add single device mode for netkit")
 0073d2fd67 ("xsk: Proxy pool management for leased queues")
 1ecea95dd3 ("xsk: Extend xsk_rcv_check validation")
 804bf334d0 ("net: Proxy netdev_queue_get_dma_dev for leased queues")
 0caa9a8dde ("net: Proxy net_mp_{open,close}_rxq for leased queues")
 ff8889ff91 ("net, ethtool: Disallow leased real rxqs to be resized")
 9e2103f361 ("net: Add lease info to queue-get response")
 31127dedde ("net: Implement netdev_nl_queue_create_doit")
 a5546e18f7 ("net: Add queue-create operation")

The series will conflict with io_uring work, and the code needs more
polish.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-20 18:06:01 -08:00
Florian Westphal
d00453b6e3 netfilter: nf_conntrack: don't rely on implicit includes
several netfilter compilation units rely on implicit includes
coming from nf_conntrack_proto_gre.h.

Clean this up and add the required dependencies where needed.

nf_conntrack.h requires net_generic() helper.
Place various gre/ppp/vlan includes to where they are needed.

Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-20 16:23:37 +01:00
Florian Westphal
910d271227 netfilter: don't include xt and nftables.h in unrelated subsystems
conntrack, xtables and nftables are distinct subsystems, don't use them
in other subystems.

Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-20 16:23:37 +01:00
Fernando Fernandez Mancera
21d033e472 netfilter: nf_conncount: increase the connection clean up limit to 64
After the optimization to only perform one GC per jiffy, a new problem
was introduced. If more than 8 new connections are tracked per jiffy the
list won't be cleaned up fast enough possibly reaching the limit
wrongly.

In order to prevent this issue, only skip the GC if it was already
triggered during the same jiffy and the increment is lower than the
clean up limit. In addition, increase the clean up limit to 64
connections to avoid triggering GC too often and do more effective GCs.

This has been tested using a HTTP server and several
performance tools while having nft_connlimit/xt_connlimit or OVS limit
configured.

Output of slowhttptest + OVS limit at 52000 connections:

 slow HTTP test status on 340th second:
 initializing:        0
 pending:             432
 connected:           51998
 error:               0
 closed:              0
 service available:   YES

Fixes: d265929930 ("netfilter: nf_conncount: reduce unnecessary GC")
Reported-by: Aleksandra Rukomoinikova <ARukomoinikova@k2.cloud>
Closes: https://lore.kernel.org/netfilter/b2064e7b-0776-4e14-adb6-c68080987471@k2.cloud/
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-20 16:23:37 +01:00
David Wei
0caa9a8dde net: Proxy net_mp_{open,close}_rxq for leased queues
When a process in a container wants to setup a memory provider, it will
use the virtual netdev and a leased rxq, and call net_mp_{open,close}_rxq
to try and restart the queue. At this point, proxy the queue restart on
the real rxq in the physical netdev.

For memory providers (io_uring zero-copy rx and devmem), it causes the
real rxq in the physical netdev to be filled from a memory provider that
has DMA mapped memory from a process within a container.

Signed-off-by: David Wei <dw@davidwei.uk>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20260115082603.219152-6-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-20 11:58:49 +01:00
Daniel Borkmann
9e2103f361 net: Add lease info to queue-get response
Populate nested lease info to the queue-get response that returns the
ifindex, queue id with type and optionally netns id if the device
resides in a different netns.

Example with ynl client:

  # ip a
  [...]
  4: enp10s0f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp/id:24 qdisc mq state UP group default qlen 1000
    link/ether e8:eb:d3:a3:43:f6 brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.2/24 scope global enp10s0f0np0
       valid_lft forever preferred_lft forever
    inet6 fe80::eaeb:d3ff:fea3:43f6/64 scope link proto kernel_ll
       valid_lft forever preferred_lft forever
  [...]

  # ethtool -i enp10s0f0np0
  driver: mlx5_core
  [...]

  # ./pyynl/cli.py \
      --spec ~/netlink/specs/netdev.yaml \
      --do queue-get \
      --json '{"ifindex": 4, "id": 15, "type": "rx"}'
  {'id': 15,
   'ifindex': 4,
   'lease': {'ifindex': 8, 'netns-id': 0, 'queue': {'id': 1, 'type': 'rx'}},
   'napi-id': 8227,
   'type': 'rx',
   'xsk': {}}

  # ip netns list
  foo (id: 0)

  # ip netns exec foo ip a
  [...]
  8: nk@NONE: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
      link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
      inet6 fe80::200:ff:fe00:0/64 scope link proto kernel_ll
         valid_lft forever preferred_lft forever
  [...]

  # ip netns exec foo ethtool -i nk
  driver: netkit
  [...]

  # ip netns exec foo ls /sys/class/net/nk/queues/
  rx-0  rx-1  tx-0

  # ip netns exec foo ./pyynl/cli.py \
      --spec ~/netlink/specs/netdev.yaml \
      --do queue-get \
      --json '{"ifindex": 8, "id": 1, "type": "rx"}'
  {'id': 1, 'ifindex': 8, 'type': 'rx'}

Note that the caller of netdev_nl_queue_fill_one() holds the netdevice
lock. For the queue-get we do not lock both devices. When queues get
{un,}leased, both devices are locked, thus if __netif_get_rx_queue_peer()
returns true, the peer pointer points to a valid device. The netns-id
is fetched via peernet2id_alloc() similarly as done in OVS.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20260115082603.219152-4-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-20 11:58:49 +01:00
Daniel Borkmann
31127dedde net: Implement netdev_nl_queue_create_doit
Implement netdev_nl_queue_create_doit which creates a new rx queue in a
virtual netdev and then leases it to a rx queue in a physical netdev.

Example with ynl client:

  # ./pyynl/cli.py \
      --spec ~/netlink/specs/netdev.yaml \
      --do queue-create \
      --json '{"ifindex": 8, "type": "rx", "lease": {"ifindex": 4, "queue": {"type": "rx", "id": 15}}}'
  {'id': 1}

Note that the netdevice locking order is always from the virtual to
the physical device.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260115082603.219152-3-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-20 11:58:49 +01:00
Benjamin Berg
50b359896f wifi: cfg80211: ignore link disabled flag from userspace
When the AP has an advertised TID to Link Mapping (TTLM) it shall
include the element in the association response. As such, when this
element is present it needs to be used for the currently dormant links.
See Draft P802.11REVmf_D1.0 section 35.3.7.2.3 ("Negotiation of TTLM")
for the details. The flag is also not usable in case userspace wants to
specify a negotiated TTLM during association.

Note that for the link reconfiguration case, mac80211 did not use the
information. Draft P802.11REVmf_D1.0 states in section 35.3.6.4 ("Link
reconfiguration to the setup links) that we "shall operate with all the
TIDs mapped to the newly added links ..."

All this means that the flag is not needed. The implementation should
parse the information from the association response.

Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20260118093904.754e057896a5.Ifd06f5ef839a93bfd54d0593dc932870f95f3242@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-01-20 10:02:01 +01:00
Eric Dumazet
03e9d91dd6 ipv6: annotate data-races in ip6_multipath_hash_{policy,fields}()
Add missing READ_ONCE() when reading sysctl values.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260115094141.3124990-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-19 09:56:42 -08:00
Eric Dumazet
3681282530 ipv6: annotate date-race in ipv6_can_nonlocal_bind()
Add a missing READ_ONCE(), and add const qualifiers to the two parameters.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260115094141.3124990-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-19 09:56:42 -08:00
Eric Dumazet
ded139b59b ipv6: annotate data-races from ip6_make_flowlabel()
Use READ_ONCE() to read sysctl values in ip6_make_flowlabel()
and ip6_make_flowlabel()

Add a const qualifier to 'struct net' parameters.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260115094141.3124990-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-19 09:56:42 -08:00
Eric Dumazet
e82a347d92 ipv6: add sysctl_ipv6_flowlabel group
Group together following struct netns_sysctl_ipv6 fields:

- flowlabel_consistency
- auto_flowlabels
- flowlabel_state_ranges

After this patch, ip6_make_flowlabel() uses a single cache line to fetch
auto_flowlabels and flowlabel_state_ranges (instead of two before the patch).

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260115094141.3124990-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-19 09:56:42 -08:00
Eric Dumazet
f10ab9d3a7 tcp: move tcp_rate_skb_sent() to tcp_output.c
It is only called from __tcp_transmit_skb() and __tcp_retransmit_skb().

Move it in tcp_output.c and make it static.

clang compiler is now able to inline it from __tcp_transmit_skb().

gcc compiler inlines it in the two callers, which is also fine.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20260114165109.1747722-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-17 15:43:16 -08:00
Jakub Kicinski
c27022497d Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.19-rc6).

No conflicts, or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-15 18:02:48 -08:00
Shahar Shitrit
cfbc8b6bab net: Introduce netif_xmit_timeout_ms() helper
Introduce a new helper function netif_xmit_timeout_ms() to check
if a TX queue is stopped and has timed out and report the timeout
duration. This makes the timeout logic reusable, and will be used
in several places in subsequent patches.

Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Reviewed-by: Yael Chemla <ychemla@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1768209383-1546791-2-git-send-email-tariqt@nvidia.com
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-15 11:55:05 +01:00
Jason Xing
a2cb2e23b2 xsk: move cq_cached_prod_lock to avoid touching a cacheline in sending path
We (Paolo and I) noticed that in the sending path touching an extra
cacheline due to cq_cached_prod_lock will impact the performance. After
moving the lock from struct xsk_buff_pool to struct xsk_queue, the
performance is increased by ~5% which can be observed by xdpsock.

An alternative approach [1] can be using atomic_try_cmpxchg() to have the
same effect. But unfortunately I don't have evident performance numbers to
prove the atomic approach is better than the current patch. The advantage
is to save the contention time among multiple xsks sharing the same pool
while the disadvantage is losing good maintenance. The full discussion can
be found at the following link.

[1]: https://lore.kernel.org/all/20251128134601.54678-1-kerneljasonxing@gmail.com/

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Link: https://patch.msgid.link/20260104012125.44003-3-kerneljasonxing@gmail.com
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-15 10:07:45 +01:00
Kavita Kavita
63e7e3b643 wifi: mac80211: allow key installation before association
Currently, mac80211 allows key installation only after association
completes. However, Enhanced Privacy Protection Key Exchange (EPPKE)
requires key installation before association to enable encryption and
decryption of (Re)Association Request and Response frames.

Add support to install keys prior to association when the peer is an
Enhanced Privacy Protection (EPP) peer that requires encryption and
decryption of (Re)Association Request and Response frames.

Introduce a new boolean parameter "epp_peer" in the "ieee80211_sta"
profile to indicate that the peer supports the Enhanced Privacy
Protection Key Exchange (EPPKE) protocol. For non-AP STA mode, it
is set when the authentication algorithm is WLAN_AUTH_EPPKE during
station profile initialization. For AP mode, it is set during
NL80211_CMD_NEW_STA and NL80211_CMD_ADD_LINK_STA.

When "epp_peer" parameter is set, mac80211 now accepts keys before
association and enables encryption of the (Re)Association
Request/Response frames.

Co-developed-by: Sai Pratyusha Magam <sai.magam@oss.qualcomm.com>
Signed-off-by: Sai Pratyusha Magam <sai.magam@oss.qualcomm.com>
Signed-off-by: Kavita Kavita <kavita.kavita@oss.qualcomm.com>
Link: https://patch.msgid.link/20260114111900.2196941-6-kavita.kavita@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-01-14 14:34:16 +01:00
Sai Pratyusha Magam
6ee3a22c61 wifi: nl80211: Add support for EPP peer indication
Introduce a new netlink attribute NL80211_ATTR_EPP_PEER
to be used with NL80211_CMD_NEW_STA and
NL80211_CMD_ADD_LINK_STA for the userspace to indicate
that a non-AP STA is an Enhanced Privacy Protection (EPP)
peer.

Co-developed-by: Rohan Dutta <quic_drohan@quicinc.com>
Signed-off-by: Rohan Dutta <quic_drohan@quicinc.com>
Signed-off-by: Sai Pratyusha Magam <sai.magam@oss.qualcomm.com>
Signed-off-by: Kavita Kavita <kavita.kavita@oss.qualcomm.com>
Link: https://patch.msgid.link/20260114111900.2196941-5-kavita.kavita@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-01-14 14:34:16 +01:00
Dipayaan Roy
3b194343c2 net: mana: Implement ndo_tx_timeout and serialize queue resets per port.
Implement .ndo_tx_timeout for MANA so any stalled TX queue can be detected
and a device-controlled port reset for all queues can be scheduled to a
ordered workqueue. The reset for all queues on stall detection is
recomended by hardware team.

Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
Link: https://patch.msgid.link/20260112130552.GA11785@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13 19:14:36 -08:00
Pavel Begunkov
c0b709bf43 net: pass queue rx page size from memory provider
Allow memory providers to configure rx queues with a custom receive
page size. It's passed in struct pp_memory_provider_params, which is
copied into the queue, so it's preserved across queue restarts. Then,
it's propagated to the driver in a new queue config parameter.

Drivers should explicitly opt into using it by setting
QCFG_RX_PAGE_SIZE, in which case they should implement ndo_default_qcfg,
validate the size on queue restart and honour the current config in case
of a reset.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
2026-01-14 02:13:36 +00:00
Pavel Begunkov
efcb9a4d32 net: add bare bone queue configs
We'll need to pass extra parameters when allocating a queue for memory
providers. Define a new structure for queue configurations, and pass it
to qapi callbacks. It's empty for now, actual parameters will be added
in following patches.

Configurations should persist across resets, and for that they're
default-initialised on device registration and stored in struct
netdev_rx_queue. We also add a new qapi callback for defaulting a given
config. It must be implemented if a driver wants to use queue configs
and is optional otherwise.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
2026-01-14 02:13:36 +00:00
Jakub Kicinski
92d76cf96d net: reduce indent of struct netdev_queue_mgmt_ops members
Trivial change, reduce the indent. I think the original is copied
from real NDOs. It's unnecessarily deep, makes passing struct args
problematic.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
2026-01-14 02:13:36 +00:00
Toke Høiland-Jørgensen
8b27fd66f5 net/sched: Export mq functions for reuse
To enable the cake_mq qdisc to reuse code from the mq qdisc, export a
bunch of functions from sch_mq. Split common functionality out from some
functions so it can be composed with other code, and export other
functions wholesale. To discourage wanton reuse, put the symbols into a
new NET_SCHED_INTERNAL namespace, and a sch_priv.h header file.

No functional change intended.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20260109-mq-cake-sub-qdisc-v8-1-8d613fece5d8@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-13 11:54:29 +01:00
Eric Dumazet
ffe4ccd359 net: add net.core.qdisc_max_burst
In blamed commit, I added a check against the temporary queue
built in __dev_xmit_skb(). Idea was to drop packets early,
before any spinlock was acquired.

if (unlikely(defer_count > READ_ONCE(q->limit))) {
	kfree_skb_reason(skb, SKB_DROP_REASON_QDISC_DROP);
	return NET_XMIT_DROP;
}

It turned out that HTB Qdisc has a zero q->limit.
HTB limits packets on a per-class basis.
Some of our tests became flaky.

Add a new sysctl : net.core.qdisc_max_burst to control
how many packets can be stored in the temporary lockless queue.

Also add a new QDISC_BURST_DROP drop reason to better diagnose
future issues.

Thanks Neal !

Fixes: 100dfa74ca ("net: dev_queue_xmit() llist adoption")
Reported-and-bisected-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20260107104159.3669285-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-13 10:12:11 +01:00
Heiner Kallweit
c4277d21ab net: phy: realtek: add dummy PHY driver for RTL8127ATF
RTL8127ATF supports a SFP+ port for fiber modules (10GBASE-SR/LR/ER/ZR and
DAC). The list of supported modes was provided by Realtek. According to the
r8127 vendor driver also 1G modules are supported, but this needs some more
complexity in the driver, and only 10G mode has been tested so far.
Therefore mainline support will be limited to 10G for now.
The SFP port signals are hidden in the chip IP and driven by firmware.
Therefore mainline SFP support can't be used here.
This PHY driver is used by the RTL8127ATF support in r8169.
RTL8127ATF reports the same PHY ID as the TP version. Therefore use a dummy
PHY ID.  This PHY driver is used by the RTL8127ATF support in r8169.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/e3d55162-210a-4fab-9abf-99c6954eee10@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-12 19:29:11 -08:00
Jakub Kicinski
669aa3e3fa First set of changes for the current -next cycle, of note:
- ath12k gets an overhaul to support multi-wiphy device
    wiphy and pave the way for future device support in the
    same driver (rather than splitting to ath13k)
  - mac80211 gets some better iteration macros
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmllQ8YACgkQ10qiO8sP
 aAA5Eg/+Nk1NNSP0+69YdXZECE+cGPoXVUsLZSEalgd7SZ8jKkCfVVshGAhimFp0
 qMcRgkjGsQkVrOlKcCdX9GkQTQnI2HIvxfqXaE0pB39KelB3lKnBycD47FuC6OVE
 mjNZYpUhNs9wuKs8uP+BRO9L/41C/5uE8hyTR3cf2bqkhx+FAasdYZaVFenpaOJ2
 lmH5GkeAjxEYfyYk/7I70ixtIZ4oDISj99W97rfqSiTQx7VEOD8NdWZirUpLbpt1
 UDR+rCapJQ1wl9p8riSE09hzJALKKVI9YHIDWvfTI81pO+Xt1eyf1wWu0ewz2t6P
 5tLk4LChFZeUqV6oJqTyRYWVlSQ8d9wVFNjPF0dJCiqmh48oz4BCFCWE5dJHgh8Q
 LN8LErrjTBBbgldwbQm7HRHb7llt0MRCmW2qJKKSU4aIBbEC/1sFu0w6smNIPCst
 GJ+fCBYugFAHZ0cft468FW/TSs49zlpGZShRcy22/Ll2iuGp1+TA5nmMAcfF4kdf
 GoPIO2c4gdnHUAp046czquU4KnbtzI0AWLdti7jAHSdVNv1+u3uTgudKVh+PGqIj
 BjOfJvUYCDqK9vkLyZXA0iMfKOdKjSOM+IWT0elYjr4MUy/E9fKK28AmXsze4IzB
 iNMHowJFlQUEi+/26NxO1+8kZpf1x05rxDGcszvECoDiow2TWJg=
 =F1n0
 -----END PGP SIGNATURE-----

Merge tag 'wireless-next-2026-01-12' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next

Johannes Berg says:

====================
First set of changes for the current -next cycle, of note:

 - ath12k gets an overhaul to support multi-wiphy device
   wiphy and pave the way for future device support in
   the same driver (rather than splitting to ath13k)

 - mac80211 gets some better iteration macros

* tag 'wireless-next-2026-01-12' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (120 commits)
  wifi: mac80211: remove width argument from ieee80211_parse_bitrates
  wifi: mac80211_hwsim: remove NAN by default
  wifi: mac80211: improve station iteration ergonomics
  wifi: mac80211: improve interface iteration ergonomics
  wifi: cfg80211: include S1G_NO_PRIMARY flag when sending channel
  wifi: mac80211: unexport ieee80211_get_bssid()
  wl1251: Replace strncpy with strscpy in wl1251_acx_fw_version
  wifi: iwlegacy: 3945-rs: remove redundant pointer check in il3945_rs_tx_status() and il3945_rs_get_rate()
  wifi: mac80211: don't send an unused argument to ieee80211_check_combinations
  wifi: libertas: fix WARNING in usb_tx_block
  wifi: mwifiex: Allocate dev name earlier for interface workqueue name
  wifi: wlcore: sdio: Use pm_ptr instead of #ifdef CONFIG_PM
  wifi: cfg80211: Fix use_for flag update on BSS refresh
  wifi: brcmfmac: rename function that frees vif
  wifi: brcmfmac: fix/add kernel-doc comments
  wifi: mac80211: Update csa_finalize to use link_id
  wifi: cfg80211: add cfg80211_stop_link() for per-link teardown
  wifi: ath12k: Skip DP peer creation for scan vdev
  wifi: ath12k: move firmware stats request outside of atomic context
  wifi: ath12k: add the missing RCU lock in ath12k_dp_tx_free_txbuf()
  ...
====================

Link: https://patch.msgid.link/20260112185836.378736-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-12 17:02:02 -08:00
Johannes Berg
f813117f20 wifi: mac80211: improve station iteration ergonomics
Right now, the only way to iterate stations is to declare an
iterator function, possibly data structure to use, and pass all
that to the iteration helper function. This is annoying, and
there's really no inherent need for it.

Add a new for_each_station() macro that does the iteration in
a more ergonomic way. To avoid even more exported functions, do
the old ieee80211_iterate_stations_mtx() as an inline using the
new way, which may also let the compiler optimise it a bit more,
e.g. via inlining the iterator function.

Link: https://patch.msgid.link/20260108143431.d2b641f6f6af.I4470024f7404446052564b15bcf8b3f1ada33655@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-01-12 19:48:17 +01:00
Johannes Berg
6b3bafa2bd wifi: mac80211: improve interface iteration ergonomics
Right now, the only way to iterate interfaces is to declare an
iterator function, possibly data structure to use, and pass all
that to the iteration helper function. This is annoying, and
there's really no inherent need for it, except it was easier to
implement with the iflist mutex, but that's not used much now.

Add a new for_each_interface() macro that does the iteration in
a more ergonomic way. To avoid even more exported functions, do
the old ieee80211_iterate_active_interfaces_mtx() as an inline
using the new way, which may also let the compiler optimise it
a bit more, e.g. via inlining the iterator function.

Also provide for_each_active_interface() for the common case of
just iterating active interfaces.

Link: https://patch.msgid.link/20260108143431.f2581e0c381a.Ie387227504c975c109c125b3c57f0bb3fdab2835@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-01-12 19:48:17 +01:00
Eric Dumazet
872ac785e7 ipv4: ip_tunnel: spread netdev_lockdep_set_classes()
Inspired by yet another syzbot report.

IPv6 tunnels call netdev_lockdep_set_classes() for each tunnel type,
while IPv4 currently centralizes netdev_lockdep_set_classes() call from
ip_tunnel_init().

Make ip_tunnel_init() a macro, so that we have different lockdep
classes per tunnel type.

Fixes: 0bef512012 ("net: add netdev_lockdep_set_classes() to virtual drivers")
Reported-by: syzbot+1240b33467289f5ab50b@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/695d439f.050a0220.1c677c.0347.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260106172426.1760721-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-08 18:02:35 -08:00
Manish Dharanenthiran
dc4b176cce wifi: cfg80211: add cfg80211_stop_link() for per-link teardown
Currently, whenever cfg80211_stop_iface() is called, the entire iface
is stopped. However, there could be a need in AP/P2P_GO mode, where
one would like to stop a single link in MLO operation instead of the
whole MLD interface.

Hence, introduce cfg80211_stop_link() to allow drivers to tear down
only a specified AP/P2P_GO link during MLO operation. Passing -1
preserves the existing behavior of stopping the whole interface. Make
cfg80211_stop_iface() call this function by passing -1 to keep the
default behavior the same, that is, to stop all links and use
cfg80211_stop_link() with the desired link_id for AP/P2P_GO mode, to
stop only that link.

This brings no behavioral change for single-link/non-MLO interfaces,
and enables drivers to stop an AP/P2P_GO link without disrupting other
links on the same interface.

Signed-off-by: Manish Dharanenthiran <manish.dharanenthiran@oss.qualcomm.com>
Link: https://patch.msgid.link/20251127-stop_link-v2-1-43745846c5fd@qti.qualcomm.com
[make cfg80211_stop_iface() inline]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-01-08 13:11:01 +01:00
Daniel Sedlak
55ffb0b14a tcp: clarify tcp_congestion_ops functions comments
The optional and required hints in the tcp_congestion_ops are information
for the user of this interface to signalize its importance when
implementing these functions.

However, cong_avoid comment incorrectly tells that it is required,
in reality congestion control must provide one of either cong_avoid or
cong_control.

In addition, min_tso_segs has not had any comment optional/required
hints. So mark it as optional since it is used only in BBR.

Co-developed-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Daniel Sedlak <daniel.sedlak@cdn77.com>
Link: https://patch.msgid.link/20260105115533.1151442-1-daniel.sedlak@cdn77.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-06 17:35:17 -08:00
Eric Dumazet
e9cd04b281 udp: udplite is unlikely
Add some unlikely() annotations to speed up the fast path,
at least with clang compiler.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260105101719.2378881-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-06 17:06:03 -08:00
Gustavo A. R. Silva
c86af46b9c ipv4/inet_sock.h: Avoid thousands of -Wflex-array-member-not-at-end warnings
Use DEFINE_RAW_FLEX() to avoid thousands of -Wflex-array-member-not-at-end
warnings.

Remove struct ip_options_data, and adjust the rest of the code so that
flexible-array member struct ip_options_rcu::opt.__data[] ends last
in struct icmp_bxm.

Compensate for this by using the DEFINE_RAW_FLEX() helper to define each
on-stack struct instance that contained struct ip_options_data as a member,
and to define struct ip_options_rcu with a fixed on-stack size for its
nested flexible-array member opt.__data[].

Also, add a couple of code comments to prevent people from adding members
to a struct after another member that contains a flexible array.

With these changes, fix 2600 warnings of the following type:

include/net/inet_sock.h:65:33: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://patch.msgid.link/aVteBadWA6AbTp7X@kspp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-06 17:02:52 -08:00
Joel Granados
f7386f545e sysctl: Remove unused ctl_table forward declarations
Remove superfluous forward declarations of ctl_table from header files
where they are no longer needed. These declarations were left behind
after sysctl code refactoring and cleanup.

Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Muchun Song <muchun.song@linux.dev>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
2026-01-05 13:54:41 +01:00
Vladimir Oltean
06e219f6a7 net: dsa: properly keep track of conduit reference
Problem description
-------------------

DSA has a mumbo-jumbo of reference handling of the conduit net device
and its kobject which, sadly, is just wrong and doesn't make sense.

There are two distinct problems.

1. The OF path, which uses of_find_net_device_by_node(), never releases
   the elevated refcount on the conduit's kobject. Nominally, the OF and
   non-OF paths should result in objects having identical reference
   counts taken, and it is already suspicious that
   dsa_dev_to_net_device() has a put_device() call which is missing in
   dsa_port_parse_of(), but we can actually even verify that an issue
   exists. With CONFIG_DEBUG_KOBJECT_RELEASE=y, if we run this command
   "before" and "after" applying this patch:

(unbind the conduit driver for net device eno2)
echo 0000:00:00.2 > /sys/bus/pci/drivers/fsl_enetc/unbind

we see these lines in the output diff which appear only with the patch
applied:

kobject: 'eno2' (ffff002009a3a6b8): kobject_release, parent 0000000000000000 (delayed 1000)
kobject: '109' (ffff0020099d59a0): kobject_release, parent 0000000000000000 (delayed 1000)

2. After we find the conduit interface one way (OF) or another (non-OF),
   it can get unregistered at any time, and DSA remains with a long-lived,
   but in this case stale, cpu_dp->conduit pointer. Holding the net
   device's underlying kobject isn't actually of much help, it just
   prevents it from being freed (but we never need that kobject
   directly). What helps us to prevent the net device from being
   unregistered is the parallel netdev reference mechanism (dev_hold()
   and dev_put()).

Actually we actually use that netdev tracker mechanism implicitly on
user ports since commit 2f1e8ea726 ("net: dsa: link interfaces with
the DSA master to get rid of lockdep warnings"), via netdev_upper_dev_link().
But time still passes at DSA switch probe time between the initial
of_find_net_device_by_node() code and the user port creation time, time
during which the conduit could unregister itself and DSA wouldn't know
about it.

So we have to run of_find_net_device_by_node() under rtnl_lock() to
prevent that from happening, and release the lock only with the netdev
tracker having acquired the reference.

Do we need to keep the reference until dsa_unregister_switch() /
dsa_switch_shutdown()?
1: Maybe yes. A switch device will still be registered even if all user
   ports failed to probe, see commit 86f8b1c01a ("net: dsa: Do not
   make user port errors fatal"), and the cpu_dp->conduit pointers
   remain valid.  I haven't audited all call paths to see whether they
   will actually use the conduit in lack of any user port, but if they
   do, it seems safer to not rely on user ports for that reference.
2. Definitely yes. We support changing the conduit which a user port is
   associated to, and we can get into a situation where we've moved all
   user ports away from a conduit, thus no longer hold any reference to
   it via the net device tracker. But we shouldn't let it go nonetheless
   - see the next change in relation to dsa_tree_find_first_conduit()
   and LAG conduits which disappear.
   We have to be prepared to return to the physical conduit, so the CPU
   port must explicitly keep another reference to it. This is also to
   say: the user ports and their CPU ports may not always keep a
   reference to the same conduit net device, and both are needed.

As for the conduit's kobject for the /sys/class/net/ entry, we don't
care about it, we can release it as soon as we hold the net device
object itself.

History and blame attribution
-----------------------------

The code has been refactored so many times, it is very difficult to
follow and properly attribute a blame, but I'll try to make a short
history which I hope to be correct.

We have two distinct probing paths:
- one for OF, introduced in 2016 in commit 83c0afaec7 ("net: dsa: Add
  new binding implementation")
- one for non-OF, introduced in 2017 in commit 71e0bbde0d ("net: dsa:
  Add support for platform data")

These are both complete rewrites of the original probing paths (which
used struct dsa_switch_driver and other weird stuff, instead of regular
devices on their respective buses for register access, like MDIO, SPI,
I2C etc):
- one for OF, introduced in 2013 in commit 5e95329b70 ("dsa: add
  device tree bindings to register DSA switches")
- one for non-OF, introduced in 2008 in commit 91da11f870 ("net:
  Distributed Switch Architecture protocol support")

except for tiny bits and pieces like dsa_dev_to_net_device() which were
seemingly carried over since the original commit, and used to this day.

The point is that the original probing paths received a fix in 2015 in
the form of commit 679fb46c57 ("net: dsa: Add missing master netdev
dev_put() calls"), but the fix never made it into the "new" (dsa2)
probing paths that can still be traced to today, and the fixed probing
path was later deleted in 2019 in commit 93e86b3bc8 ("net: dsa: Remove
legacy probing support").

That is to say, the new probing paths were never quite correct in this
area.

The existence of the legacy probing support which was deleted in 2019
explains why dsa_dev_to_net_device() returns a conduit with elevated
refcount (because it was supposed to be released during
dsa_remove_dst()). After the removal of the legacy code, the only user
of dsa_dev_to_net_device() calls dev_put(conduit) immediately after this
function returns. This pattern makes no sense today, and can only be
interpreted historically to understand why dev_hold() was there in the
first place.

Change details
--------------

Today we have a better netdev tracking infrastructure which we should
use. Logically netdev_hold() belongs in common code
(dsa_port_parse_cpu(), where dp->conduit is assigned), but there is a
tradeoff to be made with the rtnl_lock() section which would become a
bit too long if we did that - dsa_port_parse_cpu() also calls
request_module(). So we duplicate a bit of logic in order for the
callers of dsa_port_parse_cpu() to be the ones responsible of holding
the conduit reference and releasing it on error. This shortens the
rtnl_lock() section significantly.

In the dsa_switch_probe() error path, dsa_switch_release_ports() will be
called in a number of situations, one being where dsa_port_parse_cpu()
maybe didn't get the chance to run at all (a different port failed
earlier, etc). So we have to test for the conduit being NULL prior to
calling netdev_put().

There have still been so many transformations to the code since the
blamed commits (rename master -> conduit, commit 0650bf52b3 ("net:
dsa: be compatible with masters which unregister on shutdown")), that it
only makes sense to fix the code using the best methods available today
and see how it can be backported to stable later. I suspect the fix
cannot even be backported to kernels which lack dsa_switch_shutdown(),
and I suspect this is also maybe why the long-lived conduit reference
didn't make it into the new DSA probing paths at the time (problems
during shutdown).

Because dsa_dev_to_net_device() has a single call site and has to be
changed anyway, the logic was just absorbed into the non-OF
dsa_port_parse().

Tested on the ocelot/felix switch and on dsa_loop, both on the NXP
LS1028A with CONFIG_DEBUG_KOBJECT_RELEASE=y.

Reported-by: Ma Ke <make24@iscas.ac.cn>
Closes: https://lore.kernel.org/netdev/20251214131204.4684-1-make24@iscas.ac.cn/
Fixes: 83c0afaec7 ("net: dsa: Add new binding implementation")
Fixes: 71e0bbde0d ("net: dsa: Add support for platform data")
Reviewed-by: Jonas Gorski <jonas.gorski@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20251215150236.3931670-1-vladimir.oltean@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-12-23 10:32:08 +01:00
Linus Torvalds
7b8e9264f5 Including fixes from netfilter and CAN.
Current release - regressions:
 
   - netfilter: nf_conncount: fix leaked ct in error paths
 
   - sched: act_mirred: fix loop detection
 
   - sctp: fix potential deadlock in sctp_clone_sock()
 
   - can: fix build dependency
 
   - eth: mlx5e: do not update BQL of old txqs during channel reconfiguration
 
 Previous releases - regressions:
 
   - sched: ets: always remove class from active list before deleting it
 
   - inet: frags: flush pending skbs in fqdir_pre_exit()
 
   - netfilter:  nf_nat: remove bogus direction check
 
   - mptcp:
     - schedule rtx timer only after pushing data
     - avoid deadlock on fallback while reinjecting
 
   - can: gs_usb: fix error handling
 
   - eth: mlx5e:
     - avoid unregistering PSP twice
     - fix double unregister of HCA_PORTS component
 
   - eth: bnxt_en: fix XDP_TX path
 
   - eth: mlxsw: fix use-after-free when updating multicast route stats
 
 Previous releases - always broken:
 
   - ethtool: avoid overflowing userspace buffer on stats query
 
   - openvswitch: fix middle attribute validation in push_nsh() action
 
   - eth: mlx5: fw_tracer, validate format string parameters
 
   - eth: mlxsw: spectrum_router: fix neighbour use-after-free
 
   - eth: ipvlan: ignore PACKET_LOOPBACK in handle_mode_l2()
 
 Misc:
 
   - Jozsef Kadlecsik retires from maintaining netfilter
 
   - tools: ynl: fix build on systems with old kernel headers
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmlEPS0SHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkHLEP/3TnVI9qLivOGsZ48bn5UUcIaIlX0wiw
 i5gwpUhbtW3zcLSphO0Nh/CmProme6dFhaMOOkk48bKAdIxRpOMbCC20bfYDLyd0
 ZJTUQqheKpuI23vOnhfs2TTOkcqz6cM7txUq681taQ8Nfo8Dbf0fgOO2HT5ljRD+
 JXlxrdvicZDtve68sSdnsAj5M15EPQGKTPMyPkymBmCKnCMQK9SQKaeTEtaW6NIO
 yM0KqDSNoulDc/6LYMhx8DTtyE7yTiTxwe2NixjdyYljXsk0KiRDirCzxnZuSgh8
 oh+cFLdFDq4mwdYEKjgC5c3ifQyLEpZvwzlY5MKoobsVnT5SbigeSK53l5rEkO3V
 sM84lITHfqTJvlid0AF/ixEc6iWwV7nGRBHh2FXNbfoIKt45eF77jPi9YFsq6Z95
 vlCzYIbY0f2L1y3mPvZbzGQbh2Z12b5kyK8QA1j7SK+zNzxgXbf4+ZtYxHg7O3Ne
 gecmIpTKXMWodaZyfsRQPjR/F6UIlMqsgl9Ci9bfUw+XwL8x+7bJxQAQz+yVjLla
 ng14BItiYKBavcPZBjYlhGKqD1fzGhVZqQecrCkF0VTbMusRd9RcwytU1NG4QGDx
 V5aL28ht85KtMednEWOBkrg+PeXnNyZHzLAf2Xtx3UkaGgiDC8G4IxUv3orlOLD1
 sFPfZnSiGCof
 =6pla
 -----END PGP SIGNATURE-----

Merge tag 'net-6.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from netfilter and CAN.

  Current release - regressions:

   - netfilter: nf_conncount: fix leaked ct in error paths

   - sched: act_mirred: fix loop detection

   - sctp: fix potential deadlock in sctp_clone_sock()

   - can: fix build dependency

   - eth: mlx5e: do not update BQL of old txqs during channel
     reconfiguration

  Previous releases - regressions:

   - sched: ets: always remove class from active list before deleting it

   - inet: frags: flush pending skbs in fqdir_pre_exit()

   - netfilter: nf_nat: remove bogus direction check

   - mptcp:
      - schedule rtx timer only after pushing data
      - avoid deadlock on fallback while reinjecting

   - can: gs_usb: fix error handling

   - eth:
      - mlx5e:
         - avoid unregistering PSP twice
         - fix double unregister of HCA_PORTS component
      - bnxt_en: fix XDP_TX path
      - mlxsw: fix use-after-free when updating multicast route stats

  Previous releases - always broken:

   - ethtool: avoid overflowing userspace buffer on stats query

   - openvswitch: fix middle attribute validation in push_nsh() action

   - eth:
      - mlx5: fw_tracer, validate format string parameters
      - mlxsw: spectrum_router: fix neighbour use-after-free
      - ipvlan: ignore PACKET_LOOPBACK in handle_mode_l2()

  Misc:

   - Jozsef Kadlecsik retires from maintaining netfilter

   - tools: ynl: fix build on systems with old kernel headers"

* tag 'net-6.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (83 commits)
  net: hns3: add VLAN id validation before using
  net: hns3: using the num_tqps to check whether tqp_index is out of range when vf get ring info from mbx
  net: hns3: using the num_tqps in the vf driver to apply for resources
  net: enetc: do not transmit redirected XDP frames when the link is down
  selftests/tc-testing: Test case exercising potential mirred redirect deadlock
  net/sched: act_mirred: fix loop detection
  sctp: Clear inet_opt in sctp_v6_copy_ip_options().
  sctp: Fetch inet6_sk() after setting ->pinet6 in sctp_clone_sock().
  net/handshake: duplicate handshake cancellations leak socket
  net/mlx5e: Don't include PSP in the hard MTU calculations
  net/mlx5e: Do not update BQL of old txqs during channel reconfiguration
  net/mlx5e: Trigger neighbor resolution for unresolved destinations
  net/mlx5e: Use ip6_dst_lookup instead of ipv6_dst_lookup_flow for MAC init
  net/mlx5: Serialize firmware reset with devlink
  net/mlx5: fw_tracer, Handle escaped percent properly
  net/mlx5: fw_tracer, Validate format string parameters
  net/mlx5: Drain firmware reset in shutdown callback
  net/mlx5: fw reset, clear reset requested on drain_fw_reset
  net: dsa: mxl-gsw1xx: manually clear RANEG bit
  net: dsa: mxl-gsw1xx: fix .shutdown driver operation
  ...
2025-12-19 07:55:35 +12:00
Florian Westphal
8e1a1bc4f5 netfilter: nf_tables: avoid chain re-validation if possible
Hamza Mahfooz reports cpu soft lock-ups in
nft_chain_validate():

 watchdog: BUG: soft lockup - CPU#1 stuck for 27s! [iptables-nft-re:37547]
[..]
 RIP: 0010:nft_chain_validate+0xcb/0x110 [nf_tables]
[..]
  nft_immediate_validate+0x36/0x50 [nf_tables]
  nft_chain_validate+0xc9/0x110 [nf_tables]
  nft_immediate_validate+0x36/0x50 [nf_tables]
  nft_chain_validate+0xc9/0x110 [nf_tables]
  nft_immediate_validate+0x36/0x50 [nf_tables]
  nft_chain_validate+0xc9/0x110 [nf_tables]
  nft_immediate_validate+0x36/0x50 [nf_tables]
  nft_chain_validate+0xc9/0x110 [nf_tables]
  nft_immediate_validate+0x36/0x50 [nf_tables]
  nft_chain_validate+0xc9/0x110 [nf_tables]
  nft_immediate_validate+0x36/0x50 [nf_tables]
  nft_chain_validate+0xc9/0x110 [nf_tables]
  nft_table_validate+0x6b/0xb0 [nf_tables]
  nf_tables_validate+0x8b/0xa0 [nf_tables]
  nf_tables_commit+0x1df/0x1eb0 [nf_tables]
[..]

Currently nf_tables will traverse the entire table (chain graph), starting
from the entry points (base chains), exploring all possible paths
(chain jumps).  But there are cases where we could avoid revalidation.

Consider:
1  input -> j2 -> j3
2  input -> j2 -> j3
3  input -> j1 -> j2 -> j3

Then the second rule does not need to revalidate j2, and, by extension j3,
because this was already checked during validation of the first rule.
We need to validate it only for rule 3.

This is needed because chain loop detection also ensures we do not exceed
the jump stack: Just because we know that j2 is cycle free, its last jump
might now exceed the allowed stack size.  We also need to update all
reachable chains with the new largest observed call depth.

Care has to be taken to revalidate even if the chain depth won't be an
issue: chain validation also ensures that expressions are not called from
invalid base chains.  For example, the masquerade expression can only be
called from NAT postrouting base chains.

Therefore we also need to keep record of the base chain context (type,
hooknum) and revalidate if the chain becomes reachable from a different
hook location.

Reported-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com>
Closes: https://lore.kernel.org/netfilter-devel/20251118221735.GA5477@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net/
Tested-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
2025-12-15 15:02:44 +01:00
Jakub Kicinski
006a5035b4 inet: frags: flush pending skbs in fqdir_pre_exit()
We have been seeing occasional deadlocks on pernet_ops_rwsem since
September in NIPA. The stuck task was usually modprobe (often loading
a driver like ipvlan), trying to take the lock as a Writer.
lockdep does not track readers for rwsems so the read wasn't obvious
from the reports.

On closer inspection the Reader holding the lock was conntrack looping
forever in nf_conntrack_cleanup_net_list(). Based on past experience
with occasional NIPA crashes I looked thru the tests which run before
the crash and noticed that the crash follows ip_defrag.sh. An immediate
red flag. Scouring thru (de)fragmentation queues reveals skbs sitting
around, holding conntrack references.

The problem is that since conntrack depends on nf_defrag_ipv6,
nf_defrag_ipv6 will load first. Since nf_defrag_ipv6 loads first its
netns exit hooks run _after_ conntrack's netns exit hook.

Flush all fragment queue SKBs during fqdir_pre_exit() to release
conntrack references before conntrack cleanup runs. Also flush
the queues in timer expiry handlers when they discover fqdir->dead
is set, in case packet sneaks in while we're running the pre_exit
flush.

The commit under Fixes is not exactly the culprit, but I think
previously the timer firing would eventually unblock the spinning
conntrack.

Fixes: d5dd88794a ("inet: fix various use-after-free in defrags units")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251207010942.1672972-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-12-10 01:15:27 -08:00
Jakub Kicinski
1231eec699 inet: frags: add inet_frag_queue_flush()
Instead of exporting inet_frag_rbtree_purge() which requires that
caller takes care of memory accounting, add a new helper. We will
need to call it from a few places in the next patch.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251207010942.1672972-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-12-10 01:15:27 -08:00
Linus Torvalds
bbbf7f3284 - fix a bug with O_APPEND in cached mode causing data to be written multiple times on server
- use kvmalloc for trans_fd to avoid problems with large msize and fragmented memory
 This should hopefully be used in more transports when time allows
 - convert to new mount API
 - minor cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAmk1dvwACgkQq06b7GqY
 5nCRWA//a4qCTs/8FRS7N0Mz5Jg84VZ2JPnVN6iydLKbFDkgUL8JXI723VmApb6D
 wR21yRm7VuWpnGVdfPF6BtjZV7cYXEEwukfLqkXtOwx/WaRREKpN0sOciXMd0htg
 ZgnhhrabCOiSAYJYb9/29sNwhvfweQi0BeAFdIEAPrVonFUYXRzFS0v4AiBCs5PY
 n8X6aoshViAG05MZycB2VYaCxT45I+8YNCXtSsT/uX+3BP1FuRYMRluAYCLu/TU8
 oKyFjkpIri01211OEORx6gs5CDeCv0LpELfk5EW2QF/mz4oW3/4bAchg22NgNV6x
 0OCbgTqwlSJVETCZfZso/TV8efMlk1rLxSA0xjQY9r1lA26BTubrNadnC2W2nhSv
 GpPbDu6s3Cj3WD7P2InGtXxzUIZDCm8kHfHjzbbzgwOS6jl8SaDSnc6J2HtKNESL
 T55hqzqzv4POFQKrgznaQcaDW7imftOA+9xv+k+j5DTDKS9LLxiKS7+dyyzYAyyX
 sCjOd/T7JMvXIp4TxwbROn2+VBVYVO3ZaaKdK8e+qVKvWooP3iorGXAg0O0wZYJV
 tLE5zioZiRngzhqAczDIpJAe5Qd/SIi6W+sAOpKLSvVVKGf/akr0wH0KxI5z7ox8
 uBStVXTOoh52qQr0L7nnBIUJ2VJLJUt9TCzwwvaLM5B1TyPgxh8=
 =7ST9
 -----END PGP SIGNATURE-----

Merge tag '9p-for-6.19-rc1' of https://github.com/martinetd/linux

Pull 9p updates from Dominique Martinet:

 - fix a bug with O_APPEND in cached mode causing data to be written
   multiple times on server

 - use kvmalloc for trans_fd to avoid problems with large msize and
   fragmented memory This should hopefully be used in more transports
   when time allows

 - convert to new mount API

 - minor cleanups

* tag '9p-for-6.19-rc1' of https://github.com/martinetd/linux:
  9p: fix new mount API cache option handling
  9p: fix cache/debug options printing in v9fs_show_options
  9p: convert to the new mount API
  9p: create a v9fs_context structure to hold parsed options
  net/9p: move structures and macros to header files
  fs/fs_parse: add back fsparam_u32hex
  fs/9p: delete unnnecessary condition
  fs/9p: Don't open remote file with APPEND mode when writeback cache is used
  net/9p: cleanup: change p9_trans_module->def to bool
  9p: Use kvmalloc for message buffers on supported transports
2025-12-07 08:29:09 -08:00
Linus Torvalds
7203ca412f Significant patch series in this merge are as follows:
- The 10 patch series "__vmalloc()/kvmalloc() and no-block support" from
   Uladzislau Rezki reworks the vmalloc() code to support non-blocking
   allocations (GFP_ATOIC, GFP_NOWAIT).
 
 - The 2 patch series "ksm: fix exec/fork inheritance" from xu xin fixes
   a rare case where the KSM MMF_VM_MERGE_ANY prctl state is not inherited
   across fork/exec.
 
 - The 4 patch series "mm/zswap: misc cleanup of code and documentations"
   from SeongJae Park does some light maintenance work on the zswap code.
 
 - The 5 patch series "mm/page_owner: add debugfs files 'show_handles'
   and 'show_stacks_handles'" from Mauricio Faria de Oliveira enhances the
   /sys/kernel/debug/page_owner debug feature.  It adds unique identifiers
   to differentiate the various stack traces so that userspace monitoring
   tools can better match stack traces over time.
 
 - The 2 patch series "mm/page_alloc: pcp->batch cleanups" from Joshua
   Hahn makes some minor alterations to the page allocator's per-cpu-pages
   feature.
 
 - The 2 patch series "Improve UFFDIO_MOVE scalability by removing
   anon_vma lock" from Lokesh Gidra addresses a scalability issue in
   userfaultfd's UFFDIO_MOVE operation.
 
 - The 2 patch series "kasan: cleanups for kasan_enabled() checks" from
   Sabyrzhan Tasbolatov performs some cleanup in the KASAN code.
 
 - The 2 patch series "drivers/base/node: fold node register and
   unregister functions" from Donet Tom cleans up the NUMA node handling
   code a little.
 
 - The 4 patch series "mm: some optimizations for prot numa" from Kefeng
   Wang provides some cleanups and small optimizations to the NUMA
   allocation hinting code.
 
 - The 5 patch series "mm/page_alloc: Batch callers of
   free_pcppages_bulk" from Joshua Hahn addresses long lock hold times at
   boot on large machines.  These were causing (harmless) softlockup
   warnings.
 
 - The 2 patch series "optimize the logic for handling dirty file folios
   during reclaim" from Baolin Wang removes some now-unnecessary work from
   page reclaim.
 
 - The 10 patch series "mm/damon: allow DAMOS auto-tuned for per-memcg
   per-node memory usage" from SeongJae Park enhances the DAMOS auto-tuning
   feature.
 
 - The 2 patch series "mm/damon: fixes for address alignment issues in
   DAMON_LRU_SORT and DAMON_RECLAIM" from Quanmin Yan fixes DAMON_LRU_SORT
   and DAMON_RECLAIM with certain userspace configuration.
 
 - The 15 patch series "expand mmap_prepare functionality, port more
   users" from Lorenzo Stoakes enhances the new(ish)
   file_operations.mmap_prepare() method and ports additional callsites
   from the old ->mmap() over to ->mmap_prepare().
 
 - The 8 patch series "Fix stale IOTLB entries for kernel address space"
   from Lu Baolu fixes a bug (and possible security issue on non-x86) in
   the IOMMU code.  In some situations the IOMMU could be left hanging onto
   a stale kernel pagetable entry.
 
 - The 4 patch series "mm/huge_memory: cleanup __split_unmapped_folio()"
   from Wei Yang cleans up and optimizes the folio splitting code.
 
 - The 5 patch series "mm, swap: misc cleanup and bugfix" from Kairui
   Song implements some cleanups and a minor fix in the swap discard code.
 
 - The 8 patch series "mm/damon: misc documentation fixups" from SeongJae
   Park does as advertised.
 
 - The 9 patch series "mm/damon: support pin-point targets removal" from
   SeongJae Park permits userspace to remove a specific monitoring target
   in the middle of the current targets list.
 
 - The 2 patch series "mm: MISC follow-up patches for linux/pgalloc.h"
   from Harry Yoo implements a couple of cleanups related to mm header file
   inclusion.
 
 - The 2 patch series "mm/swapfile.c: select swap devices of default
   priority round robin" from Baoquan He improves the selection of swap
   devices for NUMA machines.
 
 - The 3 patch series "mm: Convert memory block states (MEM_*) macros to
   enums" from Israel Batista changes the memory block labels from macros
   to enums so they will appear in kernel debug info.
 
 - The 3 patch series "ksm: perform a range-walk to jump over holes in
   break_ksm" from Pedro Demarchi Gomes addresses an inefficiency when KSM
   unmerges an address range.
 
 - The 22 patch series "mm/damon/tests: fix memory bugs in kunit tests"
   from SeongJae Park fixes leaks and unhandled malloc() failures in DAMON
   userspace unit tests.
 
 - The 2 patch series "some cleanups for pageout()" from Baolin Wang
   cleans up a couple of minor things in the page scanner's
   writeback-for-eviction code.
 
 - The 2 patch series "mm/hugetlb: refactor sysfs/sysctl interfaces" from
   Hui Zhu moves hugetlb's sysfs/sysctl handling code into a new file.
 
 - The 9 patch series "introduce VM_MAYBE_GUARD and make it sticky" from
   Lorenzo Stoakes makes the VMA guard regions available in /proc/pid/smaps
   and improves the mergeability of guarded VMAs.
 
 - The 2 patch series "mm: perform guard region install/remove under VMA
   lock" from Lorenzo Stoakes reduces mmap lock contention for callers
   performing VMA guard region operations.
 
 - The 2 patch series "vma_start_write_killable" from Matthew Wilcox
   starts work in permitting applications to be killed when they are
   waiting on a read_lock on the VMA lock.
 
 - The 11 patch series "mm/damon/tests: add more tests for online
   parameters commit" from SeongJae Park adds additional userspace testing
   of DAMON's "commit" feature.
 
 - The 9 patch series "mm/damon: misc cleanups" from SeongJae Park does
   that.
 
 - The 2 patch series "make VM_SOFTDIRTY a sticky VMA flag" from Lorenzo
   Stoakes addresses the possible loss of a VMA's VM_SOFTDIRTY flag when
   that VMA is merged with another.
 
 - The 16 patch series "mm: support device-private THP" from Balbir Singh
   introduces support for Transparent Huge Page (THP) migration in zone
   device-private memory.
 
 - The 3 patch series "Optimize folio split in memory failure" from Zi
   Yan optimizes folio split operations in the memory failure code.
 
 - The 2 patch series "mm/huge_memory: Define split_type and consolidate
   split support checks" from Wei Yang provides some more cleanups in the
   folio splitting code.
 
 - The 16 patch series "mm: remove is_swap_[pte, pmd]() + non-swap
   entries, introduce leaf entries" from Lorenzo Stoakes cleans up our
   handling of pagetable leaf entries by introducing the concept of
   'software leaf entries', of type softleaf_t.
 
 - The 4 patch series "reparent the THP split queue" from Muchun Song
   reparents the THP split queue to its parent memcg.  This is in
   preparation for addressing the long-standing "dying memcg" problem,
   wherein dead memcg's linger for too long, consuming memory resources.
 
 - The 3 patch series "unify PMD scan results and remove redundant
   cleanup" from Wei Yang does a little cleanup in the hugepage collapse
   code.
 
 - The 6 patch series "zram: introduce writeback bio batching" from
   Sergey Senozhatsky improves zram writeback efficiency by introducing
   batched bio writeback support.
 
 - The 4 patch series "memcg: cleanup the memcg stats interfaces" from
   Shakeel Butt cleans up our handling of the interrupt safety of some
   memcg stats.
 
 - The 4 patch series "make vmalloc gfp flags usage more apparent" from
   Vishal Moola cleans up vmalloc's handling of incoming GFP flags.
 
 - The 6 patch series "mm: Add soft-dirty and uffd-wp support for RISC-V"
   from Chunyan Zhang teches soft dirty and userfaultfd write protect
   tracking to use RISC-V's Svrsw60t59b extension.
 
 - The 5 patch series "mm: swap: small fixes and comment cleanups" from
   Youngjun Park fixes a small bug and cleans up some of the swap code.
 
 - The 4 patch series "initial work on making VMA flags a bitmap" from
   Lorenzo Stoakes starts work on converting the vma struct's flags to a
   bitmap, so we stop running out of them, especially on 32-bit.
 
 - The 2 patch series "mm/swapfile: fix and cleanup swap list iterations"
   from Youngjun Park addresses a possible bug in the swap discard code and
   cleans things up a little.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaTEb0wAKCRDdBJ7gKXxA
 jjfIAP94W4EkCCwNOupnChoG+YWw/JW21anXt5NN+i5svn1yugEAwzvv6A+cAFng
 o+ug/fyrfPZG7PLp2R8WFyGIP0YoBA4=
 =IUzS
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-12-03-21-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

  "__vmalloc()/kvmalloc() and no-block support" (Uladzislau Rezki)
     Rework the vmalloc() code to support non-blocking allocations
     (GFP_ATOIC, GFP_NOWAIT)

  "ksm: fix exec/fork inheritance" (xu xin)
     Fix a rare case where the KSM MMF_VM_MERGE_ANY prctl state is not
     inherited across fork/exec

  "mm/zswap: misc cleanup of code and documentations" (SeongJae Park)
     Some light maintenance work on the zswap code

  "mm/page_owner: add debugfs files 'show_handles' and 'show_stacks_handles'" (Mauricio Faria de Oliveira)
     Enhance the /sys/kernel/debug/page_owner debug feature by adding
     unique identifiers to differentiate the various stack traces so
     that userspace monitoring tools can better match stack traces over
     time

  "mm/page_alloc: pcp->batch cleanups" (Joshua Hahn)
     Minor alterations to the page allocator's per-cpu-pages feature

  "Improve UFFDIO_MOVE scalability by removing anon_vma lock" (Lokesh Gidra)
     Address a scalability issue in userfaultfd's UFFDIO_MOVE operation

  "kasan: cleanups for kasan_enabled() checks" (Sabyrzhan Tasbolatov)

  "drivers/base/node: fold node register and unregister functions" (Donet Tom)
     Clean up the NUMA node handling code a little

  "mm: some optimizations for prot numa" (Kefeng Wang)
     Cleanups and small optimizations to the NUMA allocation hinting
     code

  "mm/page_alloc: Batch callers of free_pcppages_bulk" (Joshua Hahn)
     Address long lock hold times at boot on large machines. These were
     causing (harmless) softlockup warnings

  "optimize the logic for handling dirty file folios during reclaim" (Baolin Wang)
     Remove some now-unnecessary work from page reclaim

  "mm/damon: allow DAMOS auto-tuned for per-memcg per-node memory usage" (SeongJae Park)
     Enhance the DAMOS auto-tuning feature

  "mm/damon: fixes for address alignment issues in DAMON_LRU_SORT and DAMON_RECLAIM" (Quanmin Yan)
     Fix DAMON_LRU_SORT and DAMON_RECLAIM with certain userspace
     configuration

  "expand mmap_prepare functionality, port more users" (Lorenzo Stoakes)
     Enhance the new(ish) file_operations.mmap_prepare() method and port
     additional callsites from the old ->mmap() over to ->mmap_prepare()

  "Fix stale IOTLB entries for kernel address space" (Lu Baolu)
     Fix a bug (and possible security issue on non-x86) in the IOMMU
     code. In some situations the IOMMU could be left hanging onto a
     stale kernel pagetable entry

  "mm/huge_memory: cleanup __split_unmapped_folio()" (Wei Yang)
     Clean up and optimize the folio splitting code

  "mm, swap: misc cleanup and bugfix" (Kairui Song)
     Some cleanups and a minor fix in the swap discard code

  "mm/damon: misc documentation fixups" (SeongJae Park)

  "mm/damon: support pin-point targets removal" (SeongJae Park)
     Permit userspace to remove a specific monitoring target in the
     middle of the current targets list

  "mm: MISC follow-up patches for linux/pgalloc.h" (Harry Yoo)
     A couple of cleanups related to mm header file inclusion

  "mm/swapfile.c: select swap devices of default priority round robin" (Baoquan He)
     improve the selection of swap devices for NUMA machines

  "mm: Convert memory block states (MEM_*) macros to enums" (Israel Batista)
     Change the memory block labels from macros to enums so they will
     appear in kernel debug info

  "ksm: perform a range-walk to jump over holes in break_ksm" (Pedro Demarchi Gomes)
     Address an inefficiency when KSM unmerges an address range

  "mm/damon/tests: fix memory bugs in kunit tests" (SeongJae Park)
     Fix leaks and unhandled malloc() failures in DAMON userspace unit
     tests

  "some cleanups for pageout()" (Baolin Wang)
     Clean up a couple of minor things in the page scanner's
     writeback-for-eviction code

  "mm/hugetlb: refactor sysfs/sysctl interfaces" (Hui Zhu)
     Move hugetlb's sysfs/sysctl handling code into a new file

  "introduce VM_MAYBE_GUARD and make it sticky" (Lorenzo Stoakes)
     Make the VMA guard regions available in /proc/pid/smaps and
     improves the mergeability of guarded VMAs

  "mm: perform guard region install/remove under VMA lock" (Lorenzo Stoakes)
     Reduce mmap lock contention for callers performing VMA guard region
     operations

  "vma_start_write_killable" (Matthew Wilcox)
     Start work on permitting applications to be killed when they are
     waiting on a read_lock on the VMA lock

  "mm/damon/tests: add more tests for online parameters commit" (SeongJae Park)
     Add additional userspace testing of DAMON's "commit" feature

  "mm/damon: misc cleanups" (SeongJae Park)

  "make VM_SOFTDIRTY a sticky VMA flag" (Lorenzo Stoakes)
     Address the possible loss of a VMA's VM_SOFTDIRTY flag when that
     VMA is merged with another

  "mm: support device-private THP" (Balbir Singh)
     Introduce support for Transparent Huge Page (THP) migration in zone
     device-private memory

  "Optimize folio split in memory failure" (Zi Yan)

  "mm/huge_memory: Define split_type and consolidate split support checks" (Wei Yang)
     Some more cleanups in the folio splitting code

  "mm: remove is_swap_[pte, pmd]() + non-swap entries, introduce leaf entries" (Lorenzo Stoakes)
     Clean up our handling of pagetable leaf entries by introducing the
     concept of 'software leaf entries', of type softleaf_t

  "reparent the THP split queue" (Muchun Song)
     Reparent the THP split queue to its parent memcg. This is in
     preparation for addressing the long-standing "dying memcg" problem,
     wherein dead memcg's linger for too long, consuming memory
     resources

  "unify PMD scan results and remove redundant cleanup" (Wei Yang)
     A little cleanup in the hugepage collapse code

  "zram: introduce writeback bio batching" (Sergey Senozhatsky)
     Improve zram writeback efficiency by introducing batched bio
     writeback support

  "memcg: cleanup the memcg stats interfaces" (Shakeel Butt)
     Clean up our handling of the interrupt safety of some memcg stats

  "make vmalloc gfp flags usage more apparent" (Vishal Moola)
     Clean up vmalloc's handling of incoming GFP flags

  "mm: Add soft-dirty and uffd-wp support for RISC-V" (Chunyan Zhang)
     Teach soft dirty and userfaultfd write protect tracking to use
     RISC-V's Svrsw60t59b extension

  "mm: swap: small fixes and comment cleanups" (Youngjun Park)
     Fix a small bug and clean up some of the swap code

  "initial work on making VMA flags a bitmap" (Lorenzo Stoakes)
     Start work on converting the vma struct's flags to a bitmap, so we
     stop running out of them, especially on 32-bit

  "mm/swapfile: fix and cleanup swap list iterations" (Youngjun Park)
     Address a possible bug in the swap discard code and clean things
     up a little

[ This merge also reverts commit ebb9aeb980 ("vfio/nvgrace-gpu:
  register device memory for poison handling") because it looks
  broken to me, I've asked for clarification   - Linus ]

* tag 'mm-stable-2025-12-03-21-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits)
  mm: fix vma_start_write_killable() signal handling
  mm/swapfile: use plist_for_each_entry in __folio_throttle_swaprate
  mm/swapfile: fix list iteration when next node is removed during discard
  fs/proc/task_mmu.c: fix make_uffd_wp_huge_pte() huge pte handling
  mm/kfence: add reboot notifier to disable KFENCE on shutdown
  memcg: remove inc/dec_lruvec_kmem_state helpers
  selftests/mm/uffd: initialize char variable to Null
  mm: fix DEBUG_RODATA_TEST indentation in Kconfig
  mm: introduce VMA flags bitmap type
  tools/testing/vma: eliminate dependency on vma->__vm_flags
  mm: simplify and rename mm flags function for clarity
  mm: declare VMA flags by bit
  zram: fix a spelling mistake
  mm/page_alloc: optimize lowmem_reserve max lookup using its semantic monotonicity
  mm/vmscan: skip increasing kswapd_failures when reclaim was boosted
  pagemap: update BUDDY flag documentation
  mm: swap: remove scan_swap_map_slots() references from comments
  mm: swap: change swap_alloc_slow() to void
  mm, swap: remove redundant comment for read_swap_cache_async
  mm, swap: use SWP_SOLIDSTATE to determine if swap is rotational
  ...
2025-12-05 13:52:43 -08:00
Jakub Kicinski
4a18b6cd7c bluetooth-next pull request for net-next:
core:
 
  - HCI: Add initial support for PAST
  - hci_core: Introduce HCI_CONN_FLAG_PAST
  - ISO: Add support to bind to trigger PAST
  - HCI: Always use the identity address when initializing a connection
  - ISO: Attempt to resolve broadcast address
  - MGMT: Allow use of Set Device Flags without Add Device
  - ISO: Fix not updating BIS sender source address
  - HCI: Add support for LL Extended Feature Set
 
  driver:
 
  - btusb: Add new VID/PID 2b89/6275 for RTL8761BUV
  - btusb: MT7920: Add VID/PID 0489/e135
  - btusb: MT7922: Add VID/PID 0489/e170
  - btusb: Add new VID/PID 13d3/3533 for RTL8821CE
  - btusb: Add new VID/PID 0x0489/0xE12F for RTL8852BE-VT
  - btusb: Add new VID/PID 0x13d3/0x3618 for RTL8852BE-VT
  - btusb: Add new VID/PID 0x13d3/0x3619 for RTL8852BE-VT
  - btusb: Reclassify Qualcomm WCN6855 debug packets
  - btintel_pcie: Introduce HCI Driver protocol
  - btintel_pcie: Support for S4 (Hibernate)
  - btintel_pcie: Suspend/Resume: Controller doorbell interrupt handling
  - dt-bindings: net: Convert Marvell 8897/8997 bindings to DT schema
  - btbcm: Use kmalloc_array() to prevent overflow
  - btrtl: Add the support for RTL8761CUV
  - hci_h5: avoid sending two SYNC messages
  - hci_h5: implement CRC data integrity
 
 MAINTAINERS:
 
  - Add Bartosz Golaszewski as Qualcomm hci_qca maintainer
 -----BEGIN PGP SIGNATURE-----
 
 iQJNBAABCgA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmkuCjwZHGx1aXoudm9u
 LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKdOvD/9ReyEoUaUZ6aKC7TO66fT0
 eo/sga036O3k9oeeECsEBOOgwOy86gbR7uGol/vV6mocm/Duql/pNqMrgC28Cxhy
 u+aGX7sf3s8Pc/e82Syx6u+QPKa6xz3Hnx1UdCM/HXb0nOkJUm3VHFmshFxmXE9g
 xyMWKtLp1DrXOZNLauR/p8fgAEhGQs8muICuWT/SrEbXZ4+coQoz2h6279IA9FGZ
 2qj/pcLIBFk0qePkiaZ5LJGjF0P0+S9uX9XlhF9yIsLIckH8qAbPsHpD1+RsbkId
 R4WYxIBTVeeUhvWQezodsZJa8HRFQendCeBq8QhOo6fEcprpxrb4/8NS6hipblfy
 rTeyKUoPmPZ/5/nvx9pbRSqqdVOVT/pEOzD5o66bppX7s7Xq3k/WIRqCKpM4znIp
 iZyUlaoX6J5R+SHaOLVzRun6oUzXnIHbIGbWopfQBtMgpDajTcxQJpsTWdFQW4El
 RP+N9xoF+OBa4pKumKCe11pzUJOkBislDIAjNvp8wmevfSfkmo8CbqK5WwfL0rar
 VeBDlkfPYLVNiJ30WKwmLC4ymRvzAFhC0R6LoDPw8OWTZ7VBc7ReCcHDp8GBWbew
 pKe7jMWCCOo1q9zx0KBAwghQ6ZbABnK6N6Fg/Wpb/kfz7YkqnsRWRJkI2BTotEdu
 +bIZ2MwuJ59ROnfPy0tNBw==
 =ndah
 -----END PGP SIGNATURE-----

Merge tag 'for-net-next-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next

Luiz Augusto von Dentz says:

====================
bluetooth-next pull request for net-next:

core:

 - HCI: Add initial support for PAST
 - hci_core: Introduce HCI_CONN_FLAG_PAST
 - ISO: Add support to bind to trigger PAST
 - HCI: Always use the identity address when initializing a connection
 - ISO: Attempt to resolve broadcast address
 - MGMT: Allow use of Set Device Flags without Add Device
 - ISO: Fix not updating BIS sender source address
 - HCI: Add support for LL Extended Feature Set

 driver:

 - btusb: Add new VID/PID 2b89/6275 for RTL8761BUV
 - btusb: MT7920: Add VID/PID 0489/e135
 - btusb: MT7922: Add VID/PID 0489/e170
 - btusb: Add new VID/PID 13d3/3533 for RTL8821CE
 - btusb: Add new VID/PID 0x0489/0xE12F for RTL8852BE-VT
 - btusb: Add new VID/PID 0x13d3/0x3618 for RTL8852BE-VT
 - btusb: Add new VID/PID 0x13d3/0x3619 for RTL8852BE-VT
 - btusb: Reclassify Qualcomm WCN6855 debug packets
 - btintel_pcie: Introduce HCI Driver protocol
 - btintel_pcie: Support for S4 (Hibernate)
 - btintel_pcie: Suspend/Resume: Controller doorbell interrupt handling
 - dt-bindings: net: Convert Marvell 8897/8997 bindings to DT schema
 - btbcm: Use kmalloc_array() to prevent overflow
 - btrtl: Add the support for RTL8761CUV
 - hci_h5: avoid sending two SYNC messages
 - hci_h5: implement CRC data integrity

MAINTAINERS:

 - Add Bartosz Golaszewski as Qualcomm hci_qca maintainer

* tag 'for-net-next-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (29 commits)
  Bluetooth: btusb: Add new VID/PID 13d3/3533 for RTL8821CE
  Bluetooth: HCI: Add support for LL Extended Feature Set
  drivers/bluetooth: btbcm: Use kmalloc_array() to prevent overflow
  Bluetooth: btintel_pcie: Introduce HCI Driver protocol
  Bluetooth: btusb: add new custom firmwares
  Bluetooth: btusb: Add new VID/PID 0x13d3/0x3619 for RTL8852BE-VT
  Bluetooth: btusb: Add new VID/PID 0x13d3/0x3618 for RTL8852BE-VT
  Bluetooth: btusb: Add new VID/PID 0x0489/0xE12F for RTL8852BE-VT
  Bluetooth: iso: fix socket matching ambiguity between BIS and CIS
  Bluetooth: MAINTAINERS: Add Bartosz Golaszewski as Qualcomm hci_qca maintainer
  Bluetooth: btrtl: Add the support for RTL8761CUV
  Bluetooth: Remove redundant pm_runtime_mark_last_busy() calls
  dt-bindings: net: Convert Marvell 8897/8997 bindings to DT schema
  Bluetooth: btusb: Reclassify Qualcomm WCN6855 debug packets
  Bluetooth: btusb: Add new VID/PID 2b89/6275 for RTL8761BUV
  Bluetooth: btintel_pcie: Suspend/Resume: Controller doorbell interrupt handling
  Bluetooth: btintel_pcie: Support for S4 (Hibernate)
  Bluetooth: btusb: MT7922: Add VID/PID 0489/e170
  Bluetooth: btusb: MT7920: Add VID/PID 0489/e135
  Bluetooth: ISO: Fix not updating BIS sender source address
  ...
====================

Link: https://patch.msgid.link/20251201213818.97249-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-12-01 17:10:52 -08:00
Vladimir Oltean
0e75bfe340 net: dsa: add simple HSR offload helpers
It turns out that HSR offloads are so fine-grained that many DSA
switches can do a small part even though they weren't specifically
designed for the protocols supported by that driver (HSR and PRP).

Specifically NETIF_F_HW_HSR_DUP - it is simple packet duplication on
transmit, towards all (aka 2) ports members of the HSR device.

For many DSA switches, we know how to duplicate a packet, even though we
never typically use that feature. The transmit port mask from the
tagging protocol can have multiple bits set, and the switch should send
the packet once to every port with a bit set from that mask.

Nonetheless, not all tagging protocols are like this, and sometimes the
port is a single numeric value rather than a bit mask. For that reason,
and also because switches can sometimes change tagging protocols for
different ones, we need to make HSR offload helpers opt-in.

For devices that can do nothing else HSR-specific, we introduce
dsa_port_simple_hsr_join() and dsa_port_simple_hsr_leave(). These
functions monitor when two user ports of the same switch are part of the
same HSR device, and when that condition is true, they toggle the
NETIF_F_HW_HSR_DUP feature flag of both net devices.

Normally only dsa_port_simple_hsr_join() and dsa_port_simple_hsr_leave()
are needed. The dsa_port_simple_hsr_validate() helper is just to see
what kind of configuration could be offloadable using the generic
helpers. This is used by switch drivers which are not currently using
the right tagging protocol to offload this HSR ring, but could in
principle offload it after changing the tagger.

Suggested-by: David Yang <mmyangfl@gmail.com>
Cc: "Alvin Šipraga" <alsi@bang-olufsen.dk>
Cc: Chester A. Unal" <chester.a.unal@arinc9.com>
Cc: "Clément Léger" <clement.leger@bootlin.com>
Cc: Daniel Golle <daniel@makrotopia.org>
Cc: DENG Qingfang <dqfext@gmail.com>
Cc: Florian Fainelli <florian.fainelli@broadcom.com>
Cc: George McCollister <george.mccollister@gmail.com>
Cc: Hauke Mehrtens <hauke@hauke-m.de>
Cc: Jonas Gorski <jonas.gorski@gmail.com>
Cc: Kurt Kanzenbach <kurt@linutronix.de>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Sean Wang <sean.wang@mediatek.com>
Cc: UNGLinuxDriver@microchip.com
Cc: Woojung Huh <woojung.huh@microchip.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20251130131657.65080-6-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-12-01 16:45:07 -08:00
Long Li
9bf66036d6 net: mana: Handle hardware recovery events when probing the device
When MANA is being probed, it's possible that hardware is in recovery
mode and the device may get GDMA_EQE_HWC_RESET_REQUEST over HWC in the
middle of the probe. Detect such condition and go through the recovery
service procedure.

Signed-off-by: Long Li <longli@microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://patch.msgid.link/1764193552-9712-1-git-send-email-longli@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-12-01 13:53:53 -08:00
Luiz Augusto von Dentz
a106e50be7 Bluetooth: HCI: Add support for LL Extended Feature Set
This adds support for emulating LL Extended Feature Set introduced in 6.0
that adds the following:

Commands:

 - HCI_LE_Read_All_Local_Supported_­Features(0x2087)(Feature:47,1)
 - HCI_LE_Read_All_Remote_Features(0x2088)(Feature:47,2)

Events:

 - HCI_LE_Read_All_Remote_Features_Complete(0x2b)(Mask bit:42)

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-12-01 16:21:16 -05:00
Luiz Augusto von Dentz
14b06c3a88 Bluetooth: HCI: Always use the identity address when initializing a connection
This makes sure hci_conn is initialized with the identity address if
a matching IRK exists which avoids the trouble of having to do it at
multiple places which seems to be missing (e.g. CIS, BIS and PA).

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-12-01 16:00:06 -05:00
Luiz Augusto von Dentz
d3413703d5 Bluetooth: ISO: Add support to bind to trigger PAST
This makes it possible to bind to a different destination address
after being connected (BT_CONNECTED, BT_CONNECT2) which then triggers
PAST Sender proceedure to transfer the PA Sync to the destination
address.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-12-01 16:00:04 -05:00
Luiz Augusto von Dentz
c530569adc Bluetooth: hci_core: Introduce HCI_CONN_FLAG_PAST
This introduces a new device flag so userspace can indicate if it
wants to enable PAST Receiver for a specific device.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-12-01 15:58:54 -05:00
Luiz Augusto von Dentz
33b2835f0b Bluetooth: HCI: Add initial support for PAST
This adds PAST related commands (HCI_OP_LE_PAST,
HCI_OP_LE_PAST_SET_INFO and HCI_OP_LE_PAST_PARAMS) and events
(HCI_EV_LE_PAST_RECEIVED) along with handling of PAST sender and
receiver features bits including new MGMG settings (
HCI_EV_LE_PAST_RECEIVED and MGMT_SETTING_PAST_RECEIVER) which
userspace can use to determine if PAST is supported by the
controller.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-12-01 15:58:54 -05:00
Jakub Kicinski
840a64710e netfilter pull request 25-11-28
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEjF9xRqF1emXiQiqU1w0aZmrPKyEFAmko6jsACgkQ1w0aZmrP
 KyGIGBAAkmLtKNMnouv2eOJjJb50ERQ1cYvKG3zSI5GrOnkYvfS3MfU5rLuBR/ee
 L/xRpgNZdXMFAu1nkpFbNIoSwpOe3JaUuizlzLwTRYRmtZeRlGfvzDqiY4CDYKU1
 7gBP0EMeTeF0SJRntU6S+zoTY7Xru5w40u5wVTnm0etiigwklv4EgixnzuSLSdkz
 Av3KLE0BN85cNs6onZ6s4N4dEpIyQ7Ln0imdFiJOLvg42lM6uVNfXB6CxUIo/tIC
 VzY9vQ5rTfhcNx3lRbaJaDOE6k01x+RsBM15AkkAlafLMfvRIH4zK9qiV9tfT6c+
 t7md70+7w6j7zB9sXuI1tSMOCMvtxYfB49RJVomasEJ8J7VZ+x/7vaFYSfvydEVb
 hy1v9jOuViWWCEQhswLwQw/Xl42MVCE/zReHHBAxIC+I7nAZgEYqOCtYYPex3gZq
 l5gfiJhWqdg5yOuQepZkNo5TaFbkANgFcDuUp8IfWsbwZ2xdIIqIbHVNmenr0UuS
 4ml+t8is/rsLi/gHoKfmfbG64wG1reVcRpVxWQljr9ePkg+04fRtesaOG44k/R+i
 wdUxHL4D4WV2SnNHznw8J12tgbsIc/VgwU0EFEUxUahc18quxaumZTVuL7enbFw1
 3qgN+9qQ5ONDuABR9fedFGIoCFmOVkZLXgJnLgTC7bbZ6v0GvSM=
 =exaR
 -----END PGP SIGNATURE-----

Merge tag 'nf-next-25-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next

Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following batch contains Netfilter updates for net-next:

0) Add sanity check for maximum encapsulations in bridge vlan,
   reported by the new AI robot.

1) Move the flowtable path discovery code to its own file, the
   nft_flow_offload.c mixes the nf_tables evaluation with the path
   discovery logic, just split this in two for clarity.

2) Consolidate flowtable xmit path by using dev_queue_xmit() and the
   real device behind the layer 2 vlan/pppoe device. This allows to
   inline encapsulation. After this update, hw_ifidx can be removed
   since both ifidx and hw_ifidx now point to the same device.

3) Support for IPIP encapsulation in the flowtable, extend selftest
   to cover for this new layer 3 offload, from Lorenzo Bianconi.

4) Push down the skb into the conncount API to fix duplicates in the
   conncount list for packets with non-confirmed conntrack entries,
   this is due to an optimization introduced in d265929930
   ("netfilter: nf_conncount: reduce unnecessary GC").
   From Fernando Fernandez Mancera.

5) In conncount, disable BH when performing garbage collection
   to consolidate existing behaviour in the conncount API, also
   from Fernando.

6) A matching packet with a confirmed conntrack invokes GC if
   conncount reaches the limit in an attempt to release slots.
   This allows the existing extensions to be used for real conntrack
   counting, not just limiting new connections, from Fernando.

7) Support for updating ct count objects in nf_tables, from Fernando.

8) Extend nft_flowtables.sh selftest to send IPv6 TCP traffic,
   from Lorenzo Bianconi.

9) Fixes for UAPI kernel-doc documentation, from Randy Dunlap.

* tag 'nf-next-25-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: nf_tables: improve UAPI kernel-doc comments
  netfilter: ip6t_srh: fix UAPI kernel-doc comments format
  selftests: netfilter: nft_flowtable.sh: Add the capability to send IPv6 TCP traffic
  netfilter: nft_connlimit: add support to object update operation
  netfilter: nft_connlimit: update the count if add was skipped
  netfilter: nf_conncount: make nf_conncount_gc_list() to disable BH
  netfilter: nf_conncount: rework API to use sk_buff directly
  selftests: netfilter: nft_flowtable.sh: Add IPIP flowtable selftest
  netfilter: flowtable: Add IPIP tx sw acceleration
  netfilter: flowtable: Add IPIP rx sw acceleration
  netfilter: flowtable: use tuple address to calculate next hop
  netfilter: flowtable: remove hw_ifidx
  netfilter: flowtable: inline pppoe encapsulation in xmit path
  netfilter: flowtable: inline vlan encapsulation in xmit path
  netfilter: flowtable: consolidate xmit path
  netfilter: flowtable: move path discovery infrastructure to its own file
  netfilter: flowtable: check for maximum number of encapsulations in bridge vlan
====================

Link: https://patch.msgid.link/20251128002345.29378-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-28 20:08:39 -08:00
Jakub Kicinski
2c80116b50 Apart from the usual small things just driver updates:
- mt76:
    - WED support for >32-bit DMA
    - airoha NPU support
    - regdomain improvements
    - continued WiFi7/MLO work
  - rtw89
    - support USB devices RTL8852AU and RTL8852CU
    - initial work for RTL8922DE
    - improved injection support
  - rtl8xxxu: 40 MHz connection fixes/support
  - brcmfmac: Acer A1 840 tablet quirk
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmkoKZoACgkQ10qiO8sP
 aADi1g//e4/kTZzl8j09V/PU+2xPQ6dqNBwsYjwowl4CPusWEJqny0M5nOs9F1ob
 5lpVY3rMl4S6D5yUHY9B1fBkAgj3xuky4Udm0KONpwiGMexIn1CjlND5Qa2XW2fz
 BaHMoCI6RXzdgQoQqWQNtyxvstsb5PXfAE8h3avO+uoFhfm9zdvmWLw4cjgy76qo
 YAcUhwgIntc3oouDajOahwnxNDR2ZmZ+ATwDsmoqzhOtvTLoARnZD4+tLD1VFe+L
 yW3FdrbQYYOAdRyYiIcCIiLfr9AvqeEluCYy06J4Viafkf8io84IgijTuxM8tHpp
 spA0RA0LWNwcaYG6xf07VwjwbuhnhJEZEAfapEqhF7R6zcH7ZA6Y3vLB9JhB9bPX
 UnOb+kLrqiwnKHyHbcyW8uVFPj4D9vl9xDKM0wGCKFrv14q4Cwy/uIWW5Vy6GJnh
 Iyft0RxG83jU4x3uSx9Ywss/ByfhBuRChrBpy3ud6hf5D5dtbnH2310kBvmNMla0
 G+y2/EDjmC4uFAglKS7CwoYHYE6KJclg1hxX8jZoKq5EoKBT+n7/uM8a6vDa5lW5
 l1Sa3nJHfHHbQCQKc8jSHoC543rYMid36bJpUtnaWse35cN7bQI2v1EFuOmQD4zO
 W/OSJnH7roGUaFu9k76cCQjXWgF4NEgmRGlnvSrPEJ5YnPDJ6VU=
 =7c9f
 -----END PGP SIGNATURE-----

Merge tag 'wireless-next-2025-11-27' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next

Johannes Berg says:

====================
Apart from the usual small things just driver updates:
 - mt76:
   - WED support for >32-bit DMA
   - airoha NPU support
   - regdomain improvements
   - continued WiFi7/MLO work
 - rtw89
   - support USB devices RTL8852AU and RTL8852CU
   - initial work for RTL8922DE
   - improved injection support
 - rtl8xxxu: 40 MHz connection fixes/support
 - brcmfmac: Acer A1 840 tablet quirk

* tag 'wireless-next-2025-11-27' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (152 commits)
  wifi: mac80211: allow sharing identical chanctx for S1G interfaces
  wifi: nl80211: vendor-cmd: intel: fix a blank kernel-doc line warning
  wifi: cfg80211: include s1g_primary_2mhz when comparing chandefs
  wifi: cfg80211: include s1g_primary_2mhz when sending chandef
  wifi: ieee80211: correct FILS status codes
  mt76: mt7615: Fix memory leak in mt7615_mcu_wtbl_sta_add()
  wifi: mt76: mt792x: fix wifi init fail by setting MCU_RUNNING after CLC load
  wifi: mt76: Strip whitespace from build ddate
  wifi: mt76: mt7996: Add missing locking in mt7996_mac_sta_rc_work()
  wifi: mt76: mt7996: skip ieee80211_iter_keys() on scanning link remove
  wifi: mt76: mt7996: skip deflink accounting for offchannel links
  wifi: mt76: Move mt76_abort_scan out of mt76_reset_device()
  wifi: mt76: mt7996: move mt7996_update_beacons under mt76 mutex
  wifi: mt76: mt7996: grab mt76 mutex in mt7996_mac_sta_event()
  wifi: mt76: mt7925: ensure the 6GHz A-MPDU density cap from the hardware.
  wifi: mt76: mt7996: fix EMI rings for RRO
  wifi: mt76: mt7996: fix using wrong phy to start in mt7996_mac_restart()
  wifi: mt76: mt7996: fix MLO set key and group key issues
  wifi: mt76: mt7996: fix MLD group index assignment
  wifi: mt76: mt7996: use correct link_id when filling TXD and TXP
  ...
====================

Link: https://patch.msgid.link/20251127103806.17776-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-28 19:34:21 -08:00
Fernando Fernandez Mancera
be102eb6a0 netfilter: nf_conncount: rework API to use sk_buff directly
When using nf_conncount infrastructure for non-confirmed connections a
duplicated track is possible due to an optimization introduced since
commit d265929930 ("netfilter: nf_conncount: reduce unnecessary GC").

In order to fix this introduce a new conncount API that receives
directly an sk_buff struct.  It fetches the tuple and zone and the
corresponding ct from it. It comes with both existing conncount variants
nf_conncount_count_skb() and nf_conncount_add_skb(). In addition remove
the old API and adjust all the users to use the new one.

This way, for each sk_buff struct it is possible to check if there is a
ct present and already confirmed. If so, skip the add operation.

Fixes: d265929930 ("netfilter: nf_conncount: reduce unnecessary GC")
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-11-28 00:05:49 +00:00
Lorenzo Bianconi
ab427db178 netfilter: flowtable: Add IPIP rx sw acceleration
Introduce sw acceleration for rx path of IPIP tunnels relying on the
netfilter flowtable infrastructure. Subsequent patches will add sw
acceleration for IPIP tunnels tx path.
This series introduces basic infrastructure to accelerate other tunnel
types (e.g. IP6IP6).
IPIP rx sw acceleration can be tested running the following scenario where
the traffic is forwarded between two NICs (eth0 and eth1) and an IPIP
tunnel is used to access a remote site (using eth1 as the underlay device):

ETH0 -- TUN0 <==> ETH1 -- [IP network] -- TUN1 (192.168.100.2)

$ip addr show
6: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:00:22:33:11:55 brd ff:ff:ff:ff:ff:ff
    inet 192.168.0.2/24 scope global eth0
       valid_lft forever preferred_lft forever
7: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:11:22:33:11:55 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.1/24 scope global eth1
       valid_lft forever preferred_lft forever
8: tun0@NONE: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1480 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ipip 192.168.1.1 peer 192.168.1.2
    inet 192.168.100.1/24 scope global tun0
       valid_lft forever preferred_lft forever

$ip route show
default via 192.168.100.2 dev tun0
192.168.0.0/24 dev eth0 proto kernel scope link src 192.168.0.2
192.168.1.0/24 dev eth1 proto kernel scope link src 192.168.1.1
192.168.100.0/24 dev tun0 proto kernel scope link src 192.168.100.1

$nft list ruleset
table inet filter {
        flowtable ft {
                hook ingress priority filter
                devices = { eth0, eth1 }
        }

        chain forward {
                type filter hook forward priority filter; policy accept;
                meta l4proto { tcp, udp } flow add @ft
        }
}

Reproducing the scenario described above using veths I got the following
results:
- TCP stream received from the IPIP tunnel:
  - net-next: (baseline)		~ 71Gbps
  - net-next + IPIP flowtbale support:	~101Gbps

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-11-28 00:00:38 +00:00
Pablo Neira Ayuso
030feea309 netfilter: flowtable: remove hw_ifidx
hw_ifidx was originally introduced to store the real netdevice as a
requirement for the hardware offload support in:

 73f97025a9 ("netfilter: nft_flow_offload: use direct xmit if hardware offload is enabled")

Since ("netfilter: flowtable: consolidate xmit path"), ifidx and
hw_ifidx points to the real device in the xmit path, remove it.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-11-28 00:00:22 +00:00
Pablo Neira Ayuso
b5964aac51 netfilter: flowtable: consolidate xmit path
Use dev_queue_xmit() for the XMIT_NEIGH case. Store the interface index
of the real device behind the vlan/pppoe device, this introduces  an
extra lookup for the real device in the xmit path because rt->dst.dev
provides the vlan/pppoe device.

XMIT_NEIGH now looks more similar to XMIT_DIRECT but the check for stale
dst and the neighbour lookup still remain in place which is convenient
to deal with network topology changes.

Note that nft_flow_route() needs to relax the check for _XMIT_NEIGH so
the existing basic xfrm offload (which only works in one direction) does
not break.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-11-27 23:59:56 +00:00
Pablo Neira Ayuso
93d7a7ed07 netfilter: flowtable: move path discovery infrastructure to its own file
This file contains the path discovery that is run from the forward chain
for the packet offloading the flow into the flowtable. This consists
of a series of calls to dev_fill_forward_path() for each device stack.

More topologies may be supported in the future, so move this code to its
own file to separate it from the nftables flow_offload expression.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-11-27 23:59:43 +00:00
Jakub Kicinski
db4029859d Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts:

net/xdp/xsk.c
  0ebc27a4c6 ("xsk: avoid data corruption on cq descriptor number")
  8da7bea7db ("xsk: add indirect call for xsk_destruct_skb")
  30ed05adca ("xsk: use a smaller new lock for shared pool case")
https://lore.kernel.org/20251127105450.4a1665ec@canb.auug.org.au
https://lore.kernel.org/eb4eee14-7e24-4d1b-b312-e9ea738fefee@kernel.org

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-27 12:19:08 -08:00
Eric Dumazet
9a5e5334ad tcp: remove icsk->icsk_retransmit_timer
Now sk->sk_timer is no longer used by TCP keepalive, we can use
its storage for TCP and MPTCP retransmit timers for better
cache locality.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251124175013.1473655-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-25 19:28:29 -08:00
Eric Dumazet
08dfe37023 tcp: introduce icsk->icsk_keepalive_timer
sk->sk_timer has been used for TCP keepalives.

Keepalive timers are not in fast path, we want to use sk->sk_timer
storage for retransmit timers, for better cache locality.

Create icsk->icsk_keepalive_timer and change keepalive
code to no longer use sk->sk_timer.

Added space is reclaimed in the following patch.

This includes changes to MPTCP, which was also using sk_timer.

Alias icsk->mptcp_tout_timer and icsk->icsk_keepalive_timer
for inet_sk_diag_fill() sake.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251124175013.1473655-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-25 19:28:29 -08:00
Eric Dumazet
27e8257a86 net: move sk_dst_pending_confirm and sk_pacing_status to sock_read_tx group
These two fields are mostly read in TCP tx path, move them
in an more appropriate group for better cache locality.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251124175013.1473655-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-25 19:28:29 -08:00
Eric Dumazet
3a6e8fd0bf tcp: rename icsk_timeout() to tcp_timeout_expires()
In preparation of sk->tcp_timeout_timer introduction,
rename icsk_timeout() helper and change its argument to plain
'const struct sock *sk'.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251124175013.1473655-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-25 19:28:28 -08:00
Eric Dumazet
191ff13e42 net_sched: add qdisc_dequeue_drop() helper
Some qdisc like cake, codel, fq_codel might drop packets
in their dequeue() method.

This is currently problematic because dequeue() runs with
the qdisc spinlock held. Freeing skbs can be extremely expensive.

Add qdisc_dequeue_drop() method and a new TCQ_F_DEQUEUE_DROPS
so that these qdiscs can opt-in to defer the skb frees
after the socket spinlock is released.

TCQ_F_DEQUEUE_DROPS is an attempt to not penalize other qdiscs
with an extra cache line miss.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251121083256.674562-14-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-25 16:10:32 +01:00
Eric Dumazet
0170d7f47c net_sched: add tcf_kfree_skb_list() helper
Using kfree_skb_list_reason() to free list of skbs from qdisc
operations seems wrong as each skb might have a different drop reason.

Cleanup __dev_xmit_skb() to call tcf_kfree_skb_list() once
in preparation of the following patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251121083256.674562-13-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-25 16:10:32 +01:00
Eric Dumazet
ad50d5a3fc net_sched: add Qdisc_read_mostly and Qdisc_write groups
It is possible to reorg Qdisc to avoid always dirtying 2 cache lines in
fast path by reducing this to a single dirtied cache line.

In current layout, we change only four/six fields in the first cache line:
 - q.spinlock
 - q.qlen
 - bstats.bytes
 - bstats.packets
 - some Qdisc also change q.next/q.prev

In the second cache line we change in the fast path:
 - running
 - state
 - qstats.backlog

        /* --- cacheline 2 boundary (128 bytes) --- */
        struct sk_buff_head        gso_skb __attribute__((__aligned__(64))); /*  0x80  0x18 */
        struct qdisc_skb_head      q;                    /*  0x98  0x18 */
        struct gnet_stats_basic_sync bstats __attribute__((__aligned__(16))); /*  0xb0  0x10 */

        /* --- cacheline 3 boundary (192 bytes) --- */
        struct gnet_stats_queue    qstats;               /*  0xc0  0x14 */
        bool                       running;              /*  0xd4   0x1 */

        /* XXX 3 bytes hole, try to pack */

        unsigned long              state;                /*  0xd8   0x8 */
        struct Qdisc *             next_sched;           /*  0xe0   0x8 */
        struct sk_buff_head        skb_bad_txq;          /*  0xe8  0x18 */
        /* --- cacheline 4 boundary (256 bytes) --- */

Reorganize things to have a first cache line mostly read,
then a mostly written one.

This gives a ~3% increase of performance under tx stress.

Note that there is an additional hole because @qstats now spans over a third cache line.

	/* --- cacheline 2 boundary (128 bytes) --- */
	__u8                       __cacheline_group_begin__Qdisc_read_mostly[0] __attribute__((__aligned__(64))); /*  0x80     0 */
	struct sk_buff_head        gso_skb;              /*  0x80  0x18 */
	struct Qdisc *             next_sched;           /*  0x98   0x8 */
	struct sk_buff_head        skb_bad_txq;          /*  0xa0  0x18 */
	__u8                       __cacheline_group_end__Qdisc_read_mostly[0]; /*  0xb8     0 */

	/* XXX 8 bytes hole, try to pack */

	/* --- cacheline 3 boundary (192 bytes) --- */
	__u8                       __cacheline_group_begin__Qdisc_write[0] __attribute__((__aligned__(64))); /*  0xc0     0 */
	struct qdisc_skb_head      q;                    /*  0xc0  0x18 */
	unsigned long              state;                /*  0xd8   0x8 */
	struct gnet_stats_basic_sync bstats __attribute__((__aligned__(16))); /*  0xe0  0x10 */
	bool                       running;              /*  0xf0   0x1 */

	/* XXX 3 bytes hole, try to pack */

	struct gnet_stats_queue    qstats;               /*  0xf4  0x14 */
	/* --- cacheline 4 boundary (256 bytes) was 8 bytes ago --- */
	__u8                       __cacheline_group_end__Qdisc_write[0]; /* 0x108     0 */

	/* XXX 56 bytes hole, try to pack */

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251121083256.674562-8-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-25 16:10:32 +01:00
Eric Dumazet
2773cb0b31 net_sched: use qdisc_skb_cb(skb)->pkt_segs in bstats_update()
Avoid up to two cache line misses in qdisc dequeue() to fetch
skb_shinfo(skb)->gso_segs/gso_size while qdisc spinlock is held.

This gives a 5 % improvement in a TX intensive workload.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251121083256.674562-6-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-25 16:10:32 +01:00
Eric Dumazet
b2a38f6df9 net_sched: make room for (struct qdisc_skb_cb)->pkt_segs
Add a new u16 field, next to pkt_len : pkt_segs

This will cache shinfo->gso_segs to speed up qdisc deqeue().

Move slave_dev_queue_mapping at the end of qdisc_skb_cb,
and move three bits from tc_skb_cb :
- post_ct
- post_ct_snat
- post_ct_dnat

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251121083256.674562-2-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-25 16:10:31 +01:00
Lachlan Hodges
cba1ba11c1 wifi: cfg80211: include s1g_primary_2mhz when comparing chandefs
When comparing chandefs, ensure we include s1g_primary_2mhz.

Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com>
Link: https://patch.msgid.link/20251125025927.245280-3-lachlan.hodges@morsemicro.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-11-25 10:31:28 +01:00
Paolo Abeni
075b19c211 net: factor-out _sk_charge() helper
Move out of __inet_accept() the code dealing charging newly
accepted socket to memcg. MPTCP will soon use it to on a per
subflow basis, in different contexts.

No functional changes intended.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Geliang Tang <geliang@kernel.org>
Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-1-1f34b6c1e0b1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-24 19:49:40 -08:00
Eric Dumazet
4fe5a00ec7 net: sched: fix TCF_LAYER_TRANSPORT handling in tcf_get_base_ptr()
syzbot reported that tcf_get_base_ptr() can be called while transport
header is not set [1].

Instead of returning a dangling pointer, return NULL.

Fix tcf_get_base_ptr() callers to handle this NULL value.

[1]
 WARNING: CPU: 1 PID: 6019 at ./include/linux/skbuff.h:3071 skb_transport_header include/linux/skbuff.h:3071 [inline]
 WARNING: CPU: 1 PID: 6019 at ./include/linux/skbuff.h:3071 tcf_get_base_ptr include/net/pkt_cls.h:539 [inline]
 WARNING: CPU: 1 PID: 6019 at ./include/linux/skbuff.h:3071 em_nbyte_match+0x2d8/0x3f0 net/sched/em_nbyte.c:43
Modules linked in:
CPU: 1 UID: 0 PID: 6019 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full)
Call Trace:
 <TASK>
  tcf_em_match net/sched/ematch.c:494 [inline]
  __tcf_em_tree_match+0x1ac/0x770 net/sched/ematch.c:520
  tcf_em_tree_match include/net/pkt_cls.h:512 [inline]
  basic_classify+0x115/0x2d0 net/sched/cls_basic.c:50
  tc_classify include/net/tc_wrapper.h:197 [inline]
  __tcf_classify net/sched/cls_api.c:1764 [inline]
  tcf_classify+0x4cf/0x1140 net/sched/cls_api.c:1860
  multiq_classify net/sched/sch_multiq.c:39 [inline]
  multiq_enqueue+0xfd/0x4c0 net/sched/sch_multiq.c:66
  dev_qdisc_enqueue+0x4e/0x260 net/core/dev.c:4118
  __dev_xmit_skb net/core/dev.c:4214 [inline]
  __dev_queue_xmit+0xe83/0x3b50 net/core/dev.c:4729
  packet_snd net/packet/af_packet.c:3076 [inline]
  packet_sendmsg+0x3e33/0x5080 net/packet/af_packet.c:3108
  sock_sendmsg_nosec net/socket.c:727 [inline]
  __sock_sendmsg+0x21c/0x270 net/socket.c:742
  ____sys_sendmsg+0x505/0x830 net/socket.c:2630

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: syzbot+f3a497f02c389d86ef16@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/6920855a.a70a0220.2ea503.0058.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20251121154100.1616228-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-24 18:53:14 -08:00
Daniel Zahka
2a367002ed devlink: support default values for param-get and param-set
Support querying and resetting to default param values.

Introduce two new devlink netlink attrs:
DEVLINK_ATTR_PARAM_VALUE_DEFAULT and
DEVLINK_ATTR_PARAM_RESET_DEFAULT. The former is used to contain an
optional parameter value inside of the param_value nested
attribute. The latter is used in param-set requests from userspace to
indicate that the driver should reset the param to its default value.

To implement this, two new functions are added to the devlink driver
api: devlink_param::get_default() and
devlink_param::reset_default(). These callbacks allow drivers to
implement default param actions for runtime and permanent cmodes. For
driverinit params, the core latches the last value set by a driver via
devl_param_driverinit_value_set(), and uses that as the default value
for a param.

Because default parameter values are optional, it would be impossible
to discern whether or not a param of type bool has default value of
false or not provided if the default value is encoded using a netlink
flag type. For this reason, when a DEVLINK_PARAM_TYPE_BOOL has an
associated default value, the default value is encoded using a u8
type.

Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Link: https://patch.msgid.link/20251119025038.651131-4-daniel.zahka@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-20 19:01:22 -08:00
Daniel Zahka
011d133bb9 devlink: pass extack through to devlink_param::get()
Allow devlink_param::get() handlers to report error messages via
extack. This function is called in a few different contexts, but not
all of them will have an valid extack to use.

When devlink_param::get() is called from param_get_doit or
param_get_dumpit contexts, pass the extack through so that drivers can
report errors when retrieving param values. devlink_param::get() is
called from the context of devlink_param_notify(), pass NULL in for
the extack.

Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Link: https://patch.msgid.link/20251119025038.651131-2-daniel.zahka@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-20 19:01:22 -08:00
Eric Dumazet
ecfea98b7d tcp: add net.ipv4.tcp_rcvbuf_low_rtt
This is a follow up of commit aa251c8463 ("tcp: fix too slow
tcp_rcvbuf_grow() action") which brought again the issue that I tried
to fix in commit 65c5287892 ("tcp: fix sk_rcvbuf overshoot")

We also recently increased tcp_rmem[2] to 32 MB in commit 572be9bf9d
("tcp: increase tcp_rmem[2] to 32 MB")

Idea of this patch is to not let tcp_rcvbuf_grow() grow sk->sk_rcvbuf
too fast for small RTT flows. If sk->sk_rcvbuf is too big, this can
force NIC driver to not recycle pages from their page pool, and also
can cause cache evictions for DDIO enabled cpus/NIC, as receivers
are usually slower than senders.

Add net.ipv4.tcp_rcvbuf_low_rtt sysctl, set by default to 1000 usec (1 ms)

If RTT if smaller than the sysctl value, use the RTT/tcp_rcvbuf_low_rtt
ratio to control sk_rcvbuf inflation.

Tested:

Pair of hosts with a 200Gbit IDPF NIC. Using netperf/netserver

Client initiates 8 TCP bulk flows, asking netserver to use CPU #10 only.

super_netperf 8 -H server -T,10 -l 30

On server, use perf -e tcp:tcp_rcvbuf_grow while test is running.

Before:

sysctl -w net.ipv4.tcp_rcvbuf_low_rtt=1
perf record -a -e tcp:tcp_rcvbuf_grow sleep 30 ; perf script|tail -20|cut -c30-230
 1153.051201: tcp:tcp_rcvbuf_grow: time=398 rtt_us=382 copied=6905856 inq=180224 space=6115328 ooo=0 scaling_ratio=240 rcvbuf=27666235 rcv_ssthresh=25878235 window_clamp=25937095 rcv_wnd=25600000 famil
 1153.138752: tcp:tcp_rcvbuf_grow: time=446 rtt_us=413 copied=5529600 inq=180224 space=4505600 ooo=0 scaling_ratio=240 rcvbuf=23068672 rcv_ssthresh=21571860 window_clamp=21626880 rcv_wnd=21286912 famil
 1153.361484: tcp:tcp_rcvbuf_grow: time=415 rtt_us=380 copied=7061504 inq=204800 space=6725632 ooo=0 scaling_ratio=240 rcvbuf=27666235 rcv_ssthresh=25878235 window_clamp=25937095 rcv_wnd=25600000 famil
 1153.457642: tcp:tcp_rcvbuf_grow: time=483 rtt_us=421 copied=5885952 inq=720896 space=4407296 ooo=0 scaling_ratio=240 rcvbuf=23763511 rcv_ssthresh=22223271 window_clamp=22278291 rcv_wnd=21430272 famil
 1153.466002: tcp:tcp_rcvbuf_grow: time=308 rtt_us=281 copied=3244032 inq=180224 space=2883584 ooo=0 scaling_ratio=240 rcvbuf=44854314 rcv_ssthresh=41992059 window_clamp=42050919 rcv_wnd=41713664 famil
 1153.747792: tcp:tcp_rcvbuf_grow: time=394 rtt_us=332 copied=4460544 inq=585728 space=3063808 ooo=0 scaling_ratio=240 rcvbuf=44854314 rcv_ssthresh=41992059 window_clamp=42050919 rcv_wnd=41373696 famil
 1154.260747: tcp:tcp_rcvbuf_grow: time=652 rtt_us=226 copied=10977280 inq=737280 space=9486336 ooo=0 scaling_ratio=240 rcvbuf=31165538 rcv_ssthresh=29197743 window_clamp=29217691 rcv_wnd=28368896 fami
 1154.375019: tcp:tcp_rcvbuf_grow: time=461 rtt_us=443 copied=7573504 inq=507904 space=6856704 ooo=0 scaling_ratio=240 rcvbuf=27666235 rcv_ssthresh=25878235 window_clamp=25937095 rcv_wnd=25288704 famil
 1154.463072: tcp:tcp_rcvbuf_grow: time=494 rtt_us=408 copied=7983104 inq=200704 space=7065600 ooo=0 scaling_ratio=240 rcvbuf=27666235 rcv_ssthresh=25878235 window_clamp=25937095 rcv_wnd=25579520 famil
 1154.474658: tcp:tcp_rcvbuf_grow: time=507 rtt_us=459 copied=5586944 inq=540672 space=4718592 ooo=0 scaling_ratio=240 rcvbuf=17852266 rcv_ssthresh=16692999 window_clamp=16736499 rcv_wnd=16056320 famil
 1154.584657: tcp:tcp_rcvbuf_grow: time=494 rtt_us=427 copied=8126464 inq=204800 space=7782400 ooo=0 scaling_ratio=240 rcvbuf=27666235 rcv_ssthresh=25878235 window_clamp=25937095 rcv_wnd=25600000 famil
 1154.702117: tcp:tcp_rcvbuf_grow: time=480 rtt_us=406 copied=5734400 inq=180224 space=5349376 ooo=0 scaling_ratio=240 rcvbuf=23068672 rcv_ssthresh=21571860 window_clamp=21626880 rcv_wnd=21286912 famil
 1155.941595: tcp:tcp_rcvbuf_grow: time=717 rtt_us=670 copied=11042816 inq=3784704 space=7159808 ooo=0 scaling_ratio=240 rcvbuf=19581357 rcv_ssthresh=18333222 window_clamp=18357522 rcv_wnd=14614528 fam
 1156.384735: tcp:tcp_rcvbuf_grow: time=529 rtt_us=473 copied=9011200 inq=180224 space=7258112 ooo=0 scaling_ratio=240 rcvbuf=19581357 rcv_ssthresh=18333222 window_clamp=18357522 rcv_wnd=18018304 famil
 1157.821676: tcp:tcp_rcvbuf_grow: time=529 rtt_us=272 copied=8224768 inq=602112 space=6545408 ooo=0 scaling_ratio=240 rcvbuf=67000000 rcv_ssthresh=62793576 window_clamp=62812500 rcv_wnd=62115840 famil
 1158.906379: tcp:tcp_rcvbuf_grow: time=710 rtt_us=445 copied=11845632 inq=540672 space=10240000 ooo=0 scaling_ratio=240 rcvbuf=31165538 rcv_ssthresh=29205935 window_clamp=29217691 rcv_wnd=28536832 fam
 1164.600160: tcp:tcp_rcvbuf_grow: time=841 rtt_us=430 copied=12976128 inq=1290240 space=11304960 ooo=0 scaling_ratio=240 rcvbuf=31165538 rcv_ssthresh=29212591 window_clamp=29217691 rcv_wnd=27856896 fa
 1165.163572: tcp:tcp_rcvbuf_grow: time=845 rtt_us=800 copied=12632064 inq=540672 space=7921664 ooo=0 scaling_ratio=240 rcvbuf=27666235 rcv_ssthresh=25912795 window_clamp=25937095 rcv_wnd=25260032 fami
 1165.653464: tcp:tcp_rcvbuf_grow: time=388 rtt_us=309 copied=4493312 inq=180224 space=3874816 ooo=0 scaling_ratio=240 rcvbuf=44854314 rcv_ssthresh=41995899 window_clamp=42050919 rcv_wnd=41713664 famil
 1166.651211: tcp:tcp_rcvbuf_grow: time=556 rtt_us=553 copied=6328320 inq=540672 space=5554176 ooo=0 scaling_ratio=240 rcvbuf=23068672 rcv_ssthresh=21571860 window_clamp=21626880 rcv_wnd=20946944 famil

After:

sysctl -w net.ipv4.tcp_rcvbuf_low_rtt=1000
perf record -a -e tcp:tcp_rcvbuf_grow sleep 30 ; perf script|tail -20|cut -c30-230
 1457.053149: tcp:tcp_rcvbuf_grow: time=128 rtt_us=24 copied=1441792 inq=40960 space=1269760 ooo=0 scaling_ratio=240 rcvbuf=2960741 rcv_ssthresh=2605474 window_clamp=2775694 rcv_wnd=2568192 family=AF_I
 1458.000778: tcp:tcp_rcvbuf_grow: time=128 rtt_us=31 copied=1441792 inq=24576 space=1400832 ooo=0 scaling_ratio=240 rcvbuf=3060163 rcv_ssthresh=2810042 window_clamp=2868902 rcv_wnd=2674688 family=AF_I
 1458.088059: tcp:tcp_rcvbuf_grow: time=190 rtt_us=110 copied=3227648 inq=385024 space=2781184 ooo=0 scaling_ratio=240 rcvbuf=6728240 rcv_ssthresh=6252705 window_clamp=6307725 rcv_wnd=5799936 family=AF
 1458.148549: tcp:tcp_rcvbuf_grow: time=232 rtt_us=129 copied=3956736 inq=237568 space=2842624 ooo=0 scaling_ratio=240 rcvbuf=6731333 rcv_ssthresh=6252705 window_clamp=6310624 rcv_wnd=5918720 family=AF
 1458.466861: tcp:tcp_rcvbuf_grow: time=193 rtt_us=83 copied=2949120 inq=180224 space=2457600 ooo=0 scaling_ratio=240 rcvbuf=5751438 rcv_ssthresh=5357689 window_clamp=5391973 rcv_wnd=5054464 family=AF_
 1458.775476: tcp:tcp_rcvbuf_grow: time=257 rtt_us=127 copied=4304896 inq=352256 space=3346432 ooo=0 scaling_ratio=240 rcvbuf=8067131 rcv_ssthresh=7523275 window_clamp=7562935 rcv_wnd=7061504 family=AF
 1458.776631: tcp:tcp_rcvbuf_grow: time=200 rtt_us=96 copied=3260416 inq=143360 space=2768896 ooo=0 scaling_ratio=240 rcvbuf=6397256 rcv_ssthresh=5938567 window_clamp=5997427 rcv_wnd=5828608 family=AF_
 1459.707973: tcp:tcp_rcvbuf_grow: time=215 rtt_us=96 copied=2506752 inq=163840 space=1388544 ooo=0 scaling_ratio=240 rcvbuf=3068867 rcv_ssthresh=2768282 window_clamp=2877062 rcv_wnd=2555904 family=AF_
 1460.246494: tcp:tcp_rcvbuf_grow: time=231 rtt_us=80 copied=3756032 inq=204800 space=3117056 ooo=0 scaling_ratio=240 rcvbuf=7288091 rcv_ssthresh=6773725 window_clamp=6832585 rcv_wnd=6471680 family=AF_
 1460.714596: tcp:tcp_rcvbuf_grow: time=270 rtt_us=110 copied=4714496 inq=311296 space=3719168 ooo=0 scaling_ratio=240 rcvbuf=8957739 rcv_ssthresh=8339020 window_clamp=8397880 rcv_wnd=7933952 family=AF
 1462.029977: tcp:tcp_rcvbuf_grow: time=101 rtt_us=19 copied=1105920 inq=40960 space=1036288 ooo=0 scaling_ratio=240 rcvbuf=2338970 rcv_ssthresh=2091684 window_clamp=2192784 rcv_wnd=1986560 family=AF_I
 1462.802385: tcp:tcp_rcvbuf_grow: time=89 rtt_us=45 copied=1069056 inq=0 space=1064960 ooo=0 scaling_ratio=240 rcvbuf=2338970 rcv_ssthresh=2091684 window_clamp=2192784 rcv_wnd=2035712 family=AF_INET6
 1462.918648: tcp:tcp_rcvbuf_grow: time=105 rtt_us=33 copied=1441792 inq=180224 space=1069056 ooo=0 scaling_ratio=240 rcvbuf=2383282 rcv_ssthresh=2091684 window_clamp=2234326 rcv_wnd=1896448 family=AF_
 1463.222533: tcp:tcp_rcvbuf_grow: time=273 rtt_us=144 copied=4603904 inq=385024 space=3469312 ooo=0 scaling_ratio=240 rcvbuf=8422564 rcv_ssthresh=7891053 window_clamp=7896153 rcv_wnd=7409664 family=AF
 1466.519312: tcp:tcp_rcvbuf_grow: time=130 rtt_us=23 copied=1343488 inq=0 space=1261568 ooo=0 scaling_ratio=240 rcvbuf=2780158 rcv_ssthresh=2493778 window_clamp=2606398 rcv_wnd=2494464 family=AF_INET6
 1466.681003: tcp:tcp_rcvbuf_grow: time=128 rtt_us=21 copied=1441792 inq=12288 space=1343488 ooo=0 scaling_ratio=240 rcvbuf=2932027 rcv_ssthresh=2578555 window_clamp=2748775 rcv_wnd=2568192 family=AF_I
 1470.689959: tcp:tcp_rcvbuf_grow: time=255 rtt_us=122 copied=3932160 inq=204800 space=3551232 ooo=0 scaling_ratio=240 rcvbuf=8182038 rcv_ssthresh=7647384 window_clamp=7670660 rcv_wnd=7442432 family=AF
 1471.754154: tcp:tcp_rcvbuf_grow: time=188 rtt_us=95 copied=2138112 inq=577536 space=1429504 ooo=0 scaling_ratio=240 rcvbuf=3113650 rcv_ssthresh=2806426 window_clamp=2919046 rcv_wnd=2248704 family=AF_
 1476.813542: tcp:tcp_rcvbuf_grow: time=269 rtt_us=99 copied=3088384 inq=180224 space=2564096 ooo=0 scaling_ratio=240 rcvbuf=6219470 rcv_ssthresh=5771893 window_clamp=5830753 rcv_wnd=5509120 family=AF_
 1477.738309: tcp:tcp_rcvbuf_grow: time=166 rtt_us=54 copied=1777664 inq=180224 space=1417216 ooo=0 scaling_ratio=240 rcvbuf=3117118 rcv_ssthresh=2874958 window_clamp=2922298 rcv_wnd=2613248 family=AF_

We can see sk_rcvbuf values are much smaller, and that rtt_us (estimation of rtt
from a receiver point of view) is kept small, instead of being bloated.

No difference in throughput.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Tested-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/20251119084813.3684576-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-20 17:44:23 -08:00
Eric Dumazet
6d5dea6824 tcp: tcp_moderate_rcvbuf is only used in rx path
sysctl_tcp_moderate_rcvbuf is only used from tcp_rcvbuf_grow().

Move it to netns_ipv4_read_rx group.

Remove various CACHELINE_ASSERT_GROUP_SIZE() from netns_ipv4_struct_check(),
as they have no real benefit but cause pain for all changes.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251119084813.3684576-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-20 17:44:23 -08:00
Pauli Virtanen
79a2d4678b Bluetooth: hci_core: lookup hci_conn on RX path on protocol side
The hdev lock/lookup/unlock/use pattern in the packet RX path doesn't
ensure hci_conn* is not concurrently modified/deleted. This locking
appears to be leftover from before conn_hash started using RCU
commit bf4c632524 ("Bluetooth: convert conn hash to RCU")
and not clear if it had purpose since then.

Currently, there are code paths that delete hci_conn* from elsewhere
than the ordered hdev->workqueue where the RX work runs in. E.g.
commit 5af1f84ed1 ("Bluetooth: hci_sync: Fix UAF on hci_abort_conn_sync")
introduced some of these, and there probably were a few others before
it.  It's better to do the locking so that even if these run
concurrently no UAF is possible.

Move the lookup of hci_conn and associated socket-specific conn to
protocol recv handlers, and do them within a single critical section
to cover hci_conn* usage and lookup.

syzkaller has reported a crash that appears to be this issue:

    [Task hdev->workqueue]          [Task 2]
                                    hci_disconnect_all_sync
    l2cap_recv_acldata(hcon)
                                      hci_conn_get(hcon)
                                      hci_abort_conn_sync(hcon)
                                        hci_dev_lock
      hci_dev_lock
                                        hci_conn_del(hcon)
      v-------------------------------- hci_dev_unlock
                                      hci_conn_put(hcon)
      conn = hcon->l2cap_data (UAF)

Fixes: 5af1f84ed1 ("Bluetooth: hci_sync: Fix UAF on hci_abort_conn_sync")
Reported-by: syzbot+d32d77220b92eddd89ad@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=d32d77220b92eddd89ad
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-11-20 17:01:09 -05:00
Chris Lu
4015b97976 Bluetooth: btusb: mediatek: Fix kernel crash when releasing mtk iso interface
When performing reset tests and encountering abnormal card drop issues
that lead to a kernel crash, it is necessary to perform a null check
before releasing resources to avoid attempting to release a null pointer.

<4>[   29.158070] Hardware name: Google Quigon sku196612/196613 board (DT)
<4>[   29.158076] Workqueue: hci0 hci_cmd_sync_work [bluetooth]
<4>[   29.158154] pstate: 20400009 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
<4>[   29.158162] pc : klist_remove+0x90/0x158
<4>[   29.158174] lr : klist_remove+0x88/0x158
<4>[   29.158180] sp : ffffffc0846b3c00
<4>[   29.158185] pmr_save: 000000e0
<4>[   29.158188] x29: ffffffc0846b3c30 x28: ffffff80cd31f880 x27: ffffff80c1bdc058
<4>[   29.158199] x26: dead000000000100 x25: ffffffdbdc624ea3 x24: ffffff80c1bdc4c0
<4>[   29.158209] x23: ffffffdbdc62a3e6 x22: ffffff80c6c07000 x21: ffffffdbdc829290
<4>[   29.158219] x20: 0000000000000000 x19: ffffff80cd3e0648 x18: 000000031ec97781
<4>[   29.158229] x17: ffffff80c1bdc4a8 x16: ffffffdc10576548 x15: ffffff80c1180428
<4>[   29.158238] x14: 0000000000000000 x13: 000000000000e380 x12: 0000000000000018
<4>[   29.158248] x11: ffffff80c2a7fd10 x10: 0000000000000000 x9 : 0000000100000000
<4>[   29.158257] x8 : 0000000000000000 x7 : 7f7f7f7f7f7f7f7f x6 : 2d7223ff6364626d
<4>[   29.158266] x5 : 0000008000000000 x4 : 0000000000000020 x3 : 2e7325006465636e
<4>[   29.158275] x2 : ffffffdc11afeff8 x1 : 0000000000000000 x0 : ffffffdc11be4d0c
<4>[   29.158285] Call trace:
<4>[   29.158290]  klist_remove+0x90/0x158
<4>[   29.158298]  device_release_driver_internal+0x20c/0x268
<4>[   29.158308]  device_release_driver+0x1c/0x30
<4>[   29.158316]  usb_driver_release_interface+0x70/0x88
<4>[   29.158325]  btusb_mtk_release_iso_intf+0x68/0xd8 [btusb (HASH:e8b6 5)]
<4>[   29.158347]  btusb_mtk_reset+0x5c/0x480 [btusb (HASH:e8b6 5)]
<4>[   29.158361]  hci_cmd_sync_work+0x10c/0x188 [bluetooth (HASH:a4fa 6)]
<4>[   29.158430]  process_scheduled_works+0x258/0x4e8
<4>[   29.158441]  worker_thread+0x300/0x428
<4>[   29.158448]  kthread+0x108/0x1d0
<4>[   29.158455]  ret_from_fork+0x10/0x20
<0>[   29.158467] Code: 91343000 940139d1 f9400268 927ff914 (f9401297)
<4>[   29.158474] ---[ end trace 0000000000000000 ]---
<0>[   29.167129] Kernel panic - not syncing: Oops: Fatal exception
<2>[   29.167144] SMP: stopping secondary CPUs
<4>[   29.167158] ------------[ cut here ]------------

Fixes: ceac1cb025 ("Bluetooth: btusb: mediatek: add ISO data transmission functions")
Signed-off-by: Chris Lu <chris.lu@mediatek.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-11-20 16:51:14 -05:00
Jakub Kicinski
9e203721ec Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.18-rc7).

No conflicts, adjacent changes:

tools/testing/selftests/net/af_unix/Makefile
  e1bb28bf13 ("selftest: af_unix: Add test for SO_PEEK_OFF.")
  45a1cd8346 ("selftests: af_unix: Add tests for ECONNRESET and EOF semantics")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-20 09:13:26 -08:00
Pagadala Yesu Anjaneyulu
a77f0ad44f wifi: cfg80211: Add support for 6GHz AP role not relevant AP type
Add IEEE80211_6GHZ_CTRL_REG_AP_ROLE_NOT_RELEVANT
and map it to IEEE80211_REG_LPI_AP for safe regulatory compliance
when AP role classification is not applicable.
Use LPI as safe fallback to prevent power limit violations.

Signed-off-by: Pagadala Yesu Anjaneyulu <pagadala.yesu.anjaneyulu@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20251112110828.856283677cc7.I36138a34847c3b4e680974bf347dde844448f3bc@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-11-20 10:25:10 +01:00
Aditya Garg
45120304e8 net: mana: Drop TX skb on post_work_request failure and unmap resources
Drop TX packets when posting the work request fails and ensure DMA
mappings are always cleaned up.

Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://patch.msgid.link/1763464269-10431-3-git-send-email-gargaditya@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-19 20:11:57 -08:00
Aditya Garg
934fa943b5 net: mana: Handle SKB if TX SGEs exceed hardware limit
The MANA hardware supports a maximum of 30 scatter-gather entries (SGEs)
per TX WQE. Exceeding this limit can cause TX failures.
Add ndo_features_check() callback to validate SKB layout before
transmission. For GSO SKBs that would exceed the hardware SGE limit, clear
NETIF_F_GSO_MASK to enforce software segmentation in the stack.
Add a fallback in mana_start_xmit() to linearize non-GSO SKBs that still
exceed the SGE limit.

Also, Add ethtool counter for SKBs linearized

Co-developed-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://patch.msgid.link/1763464269-10431-2-git-send-email-gargaditya@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-19 20:11:57 -08:00
Jakub Kicinski
c3995fc1a8 ipsec-2025-11-18
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmkcEO8ACgkQrB3Eaf9P
 W7eXjA//ReWvgmIwM87WjEwI0E8y/ChS3GwWOMKo2XVwntkuctW+gvTfKn7WDMcs
 AuqbhCpoRdA1a3rEUWNBKoMT1PYmWHt4oElC2vEodIKcvrtVpOukyHQg5zaOTRni
 TCiXUD5kojyCC3YX8J2VXnIsvmHl/0Wo2iEd9MBivOkKXh7UGy/azOqPMhwmQBHx
 Ds37Mj86tRPylEaVtW9Js7BWTBWBCg5TpUJbJY8DvaYP1TBFduao2ExMo2dFPeYC
 495N856k+Pa1OVqW6Ss40I59UXmXbs5WcUd8mOhleqxUaAQoaUqSfQwdw0UErS+2
 lttuMH1pnNgpkWMgusXWgs8lxXiwbH74eIthtR6/9k/B80eKaQ5Rwp8sAZ0DV+8M
 FoL7PBHWQzWvc+/L+8zJ0g78mv5+ymvSdkl2ZQXPJiJF1hdZ31RGQAwlPDYqrq63
 WNu19dKwXzASWR/YBXO9vw7pdjljs8BXZcTMNDZcS3FgWonI47nTIpy0vjx9vinm
 4KzaIpg+cjEt1SNrO45sPoBmoMj642aEHtkAEhR47U8FHQTBW2/9l/WdpIJhYhjb
 IrVdVw32Fo55HJby14YlwODPpUJ0t/UcI32KdTXd5kI+UqqyeiIxdtLfaXiNTDGJ
 RQ80mTeG9AKxfcD7LGK73ndWJxBb+2C+6MPQN7+AF3rh1bFcGQ8=
 =h/E5
 -----END PGP SIGNATURE-----

Merge tag 'ipsec-2025-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec

Steffen Klassert says:

====================
pull request (net): ipsec 2025-11-18

1) Misc fixes for xfrm_state creation/modification/deletion.
   Patchset from Sabrina Dubroca.

2) Fix inner packet family determination for xfrm offloads.
   From Jianbo Liu.

3) Don't push locally generated packets directly to L2 tunnel
   mode offloading, they still need processing from the standard
   xfrm path. From Jianbo Liu.

4) Fix memory leaks in xfrm_add_acquire for policy offloads and policy
   security contexts. From Zilin Guan.

* tag 'ipsec-2025-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
  xfrm: fix memory leak in xfrm_add_acquire()
  xfrm: Prevent locally generated packets from direct output in tunnel mode
  xfrm: Determine inner GSO type from packet inner protocol
  xfrm: Check inner packet family directly from skb_dst
  xfrm: check all hash buckets for leftover states during netns deletion
  xfrm: set err and extack on failure to create pcpu SA
  xfrm: call xfrm_dev_state_delete when xfrm_state_migrate fails to add the state
  xfrm: make state as DEAD before final put when migrate fails
  xfrm: also call xfrm_state_delete_tunnel at destroy time for states that were never added
  xfrm: drop SA reference in xfrm_state_update if dir doesn't match
====================

Link: https://patch.msgid.link/20251118085344.2199815-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-18 17:58:44 -08:00
Erni Sri Satya Vennela
be4f1d67ec net: mana: Add standard counter rx_missed_errors
Report standard counter stats->rx_missed_errors
using hc_rx_discards_no_wqe from the hardware.

Add a global workqueue to periodically run
mana_query_gf_stats every 2 seconds to get the latest
info in eth_stats and define a driver capability flag
to notify hardware of the periodic queries.

To avoid repeated failures and log flooding, the workqueue
is not rescheduled if mana_query_gf_stats fails on HWC timeout
error and the stats are reset to 0. Other errors are transient
which will not need a VF reset for recovery.

Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://patch.msgid.link/1763120599-6331-3-git-send-email-ernis@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-17 19:52:30 -08:00
Erni Sri Satya Vennela
e275d9091c net: mana: Move hardware counter stats from per-port to per-VF context
Move hardware counter (HC) statistics from mana_port_context to
mana_context to enable sharing stats across multiple network ports
on the same MANA VF. Previously, each network port queried
hardware counters independently using MANA_QUERY_GF_STAT command
(GF = Generic Function stats from GDMA hardware), resulting in
redundant queries when multiple ports existed on the same device.

Isolate hardware counter stats by introducing mana_ethtool_hc_stats
in mana_context and update the code to ensure all stats are properly
reported via ethtool -S <interface>, maintaining consistency with
previous behavior.

Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://patch.msgid.link/1763120599-6331-2-git-send-email-ernis@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-17 19:52:30 -08:00
Shakeel Butt
d929525c2e memcg: net: track network throttling due to memcg memory pressure
The kernel can throttle network sockets if the memory cgroup associated
with the corresponding socket is under memory pressure.  The throttling
actions include clamping the transmit window, failing to expand receive or
send buffers, aggressively prune out-of-order receive queue, FIN deferred
to a retransmitted packet and more.  Let's add memcg metric to track such
throttling actions.

At the moment memcg memory pressure is defined through vmpressure and in
future it may be defined using PSI or we may add more flexible way for the
users to define memory pressure, maybe through ebpf.  However the
potential throttling actions will remain the same, so this newly
introduced metric will continue to track throttling actions irrespective
of how memcg memory pressure is defined.

Link: https://lkml.kernel.org/r/20251016161035.86161-1-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Daniel Sedlak <daniel.sedlak@cdn77.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kacinski <kuba@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-16 17:28:06 -08:00
Yue Haibing
06ac470658 sctp: Remove unused declaration sctp_auth_init_hmacs()
Commit bf40785fa4 ("sctp: Use HMAC-SHA1 and HMAC-SHA256 library for chunk
authentication") removed the implementation but leave declaration.

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Eric Biggers <ebiggers@kernel.org>
Link: https://patch.msgid.link/20251113114501.32905-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-14 18:00:34 -08:00
Eric Dumazet
6d650ae928 tcp: gro: inline tcp_gro_pull_header()
tcp_gro_pull_header() is used in GRO fast path, inline it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251113140358.58242-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-14 18:00:08 -08:00
Heiner Kallweit
4aa73c6051 net: dsa: remove definition of struct dsa_switch_driver
Since 93e86b3bc8 ("net: dsa: Remove legacy probing support")
this struct has no user any longer.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://patch.msgid.link/4053a98f-052f-4dc1-a3d4-ed9b3d3cc7cb@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-13 17:40:22 -08:00
Jakub Kicinski
c99ebb6132 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.18-rc6).

No conflicts, adjacent changes in:

drivers/net/phy/micrel.c
  96a9178a29 ("net: phy: micrel: lan8814 fix reset of the QSGMII interface")
  61b7ade9ba ("net: phy: micrel: Add support for non PTP SKUs for lan8814")

and a trivial one in tools/testing/selftests/drivers/net/Makefile.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-13 12:35:38 -08:00
Linus Torvalds
d0309c0543 Including fixes from Bluetooth and Wireless. No known outstanding
regressions.
 
 Current release - regressions:
 
   - eth: bonding: fix mii_status when slave is down
 
   - eth: mlx5e: fix missing error assignment in mlx5e_xfrm_add_state()
 
 Previous releases - regressions:
 
   - sched: limit try_bulk_dequeue_skb() batches
 
   - ipv4: route: prevent rt_bind_exception() from rebinding stale fnhe
 
   - af_unix: initialise scc_index in unix_add_edge()
 
   - netpoll: fix incorrect refcount handling causing incorrect cleanup
 
   - bluetooth: don't hold spin lock over sleeping functions
 
   - hsr: Fix supervision frame sending on HSRv0
 
   - sctp: prevent possible shift out-of-bounds
 
   - tipc: fix use-after-free in tipc_mon_reinit_self().
 
   - dsa: tag_brcm: do not mark link local traffic as offloaded
 
   - eth: virtio-net: fix incorrect flags recording in big mode
 
 Previous releases - always broken:
 
   - sched: initialize struct tc_ife to fix kernel-infoleak
 
   - wifi:
     - mac80211: reject address change while connecting
     - iwlwifi: avoid toggling links due to wrong element use
 
   - bluetooth: cancel mesh send timer when hdev removed
 
   - strparser: fix signed/unsigned mismatch bug
 
   - handshake: fix memory leak in tls_handshake_accept()
 
 Misc:
 
   - selftests: mptcp: fix some flaky tests
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmkV/O8SHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkeBQQALgJv+sfJ/8H+BA3US7BU9kwydXtJneo
 mHKnse3NCiV3kXyJu7p+CIBU4xP8LMMmpVFZQ0dPKnbFyfGWjCX+UOEMU4NBnhG7
 dfSGWAxR8iS0gUi/5K4dT55LZjeQ392Zu2OqrBRjAAYG0s5vXbAcCx0jUhfmG+ZD
 rJupFYuRuF7W3UvWC/TQviY4oJ0goaZnrm6Y5ADX9rblPlzD5xgOTqZzSuUpLxQM
 Q37IGjW3F12FrNmqabC3MBQcNfWNpqUgkDfFxMJbxGPDe9CjvSAAmMVNmDDkr+EH
 eL5+M1cdH+L9RRBNSQ4SX9dsNwhyyU04wrEADGf39jcgw/ACXI+t0laj/Hm0O1xg
 vMOtqiQYSe8lVVdCswmi6BdYxBsNU6l2gx0evq/qGztFbDW9s5rn26uXy5CIqlJa
 04k1lmT5EeMpE9opwxPJpw+5LyjjtCsD6AGOlLb8DU6cbXKVUhxCZVHdDNnfxt4J
 ZfQo8aUx6X3vnDGMzPWEZbYMqd4va6hVPTJdUuqk1enuE6KfhKMhWbP5D9a/qiqM
 lhinWYdmBP6bKJlxdfsH7kgwhfuoi/jT54VYu33l5LmrOk7+tO7gLJcRhNZW9jsl
 KkJ+Wk1JQ7/oiQI8tcUCc+LEwwY54F+34HAHjFLNELW+bp/vvoMEVeZKku2YM3Gy
 xW+7WYdrx2RK
 =JAyr
 -----END PGP SIGNATURE-----

Merge tag 'net-6.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from Bluetooth and Wireless. No known outstanding
  regressions.

  Current release - regressions:

   - eth:
      - bonding: fix mii_status when slave is down
      - mlx5e: fix missing error assignment in mlx5e_xfrm_add_state()

  Previous releases - regressions:

   - sched: limit try_bulk_dequeue_skb() batches

   - ipv4: route: prevent rt_bind_exception() from rebinding stale fnhe

   - af_unix: initialise scc_index in unix_add_edge()

   - netpoll: fix incorrect refcount handling causing incorrect cleanup

   - bluetooth: don't hold spin lock over sleeping functions

   - hsr: Fix supervision frame sending on HSRv0

   - sctp: prevent possible shift out-of-bounds

   - tipc: fix use-after-free in tipc_mon_reinit_self().

   - dsa: tag_brcm: do not mark link local traffic as offloaded

   - eth: virtio-net: fix incorrect flags recording in big mode

  Previous releases - always broken:

   - sched: initialize struct tc_ife to fix kernel-infoleak

   - wifi:
      - mac80211: reject address change while connecting
      - iwlwifi: avoid toggling links due to wrong element use

   - bluetooth: cancel mesh send timer when hdev removed

   - strparser: fix signed/unsigned mismatch bug

   - handshake: fix memory leak in tls_handshake_accept()

  Misc:

   - selftests: mptcp: fix some flaky tests"

* tag 'net-6.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (60 commits)
  hsr: Follow standard for HSRv0 supervision frames
  hsr: Fix supervision frame sending on HSRv0
  virtio-net: fix incorrect flags recording in big mode
  ipv4: route: Prevent rt_bind_exception() from rebinding stale fnhe
  wifi: iwlwifi: mld: always take beacon ies in link grading
  wifi: iwlwifi: mvm: fix beacon template/fixed rate
  wifi: iwlwifi: fix aux ROC time event iterator usage
  net_sched: limit try_bulk_dequeue_skb() batches
  selftests: mptcp: join: properly kill background tasks
  selftests: mptcp: connect: trunc: read all recv data
  selftests: mptcp: join: userspace: longer transfer
  selftests: mptcp: join: endpoints: longer transfer
  selftests: mptcp: join: rm: set backup flag
  selftests: mptcp: connect: fix fallback note due to OoO
  ethtool: fix incorrect kernel-doc style comment in ethtool.h
  mlx5: Fix default values in create CQ
  Bluetooth: btrtl: Avoid loading the config file on security chips
  net/mlx5e: Fix potentially misleading debug message
  net/mlx5e: Fix wraparound in rate limiting for values above 255 Gbps
  net/mlx5e: Fix maxrate wraparound in threshold between units
  ...
2025-11-13 11:20:25 -08:00
Jakub Kicinski
e949824730 More -next material, notably:
- split ieee80211.h file, it's way too big
  - mac80211: initial chanctx work towards NAN
  - mac80211: MU-MIMO sniffer improvements
  - ath12k: statistics improvements
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmkUdDsACgkQ10qiO8sP
 aABeDhAAnBtrADnRQYo5BMbV6/7JsJL3GV6Zj5/Nrq6si8TE8vquf4kevUbISpgk
 Mi2tQ0UxFTW6LckT3LTOxzQSZzaPANSPO9AQm/q9/BLtAsdPpgc8yHQTkZlkiatF
 dS+WZFSZpF8hisKmkYCDvnggaqipnJUvnwtIY6xxSZftn3J4h6B9Wp+XjyCLtDxC
 2DvztdQJ3oqYBFsSpk6J0gA0/4lF+jlVmZ+DPpVSYlCRJivFAqcpwdx3vdh4Pib3
 dUNgh/MJuPv01RIA783TCsHBnOKPxuD5OfusQzkXdj33yX2bcPL6M2s57FgnFf8q
 l4B9+R/Q7/8ohp+qMOd4S+SteFa/7WlbJ+5UjJ71y7xScBKOaRZMf6wKKqSZozP1
 zOB4AxlC7COrL3tsljC0Vun9CgBL4Ov/XBe7G2WTOUIrZ2KOU328/3atndbLeVJg
 knwsiNdKJCJJpTkO3zHzaYfDhDghSaINj1fl67hZV7s3Jj7u4lAD8HfV/9CKoMRd
 X1ltgB84u/nFc2aL2fGQbQg7NJLVqIvyx6iyss6K58nNwQMf0ZFUJOihgkSMCRK3
 t4qQrXVdorWnRvioA/roHICkGBZdZw53Jz+0EltRxsjTfmzkki6EbXeGhWtRzlSo
 5Bx1L4vtK7143nMlD2H/JuwDopD1fOxAM8L3sTlqJE8nu/3plPA=
 =N5Tm
 -----END PGP SIGNATURE-----

Merge tag 'wireless-next-2025-11-12' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next

Johannes Berg says:

====================
More -next material, notably:
 - split ieee80211.h file, it's way too big
 - mac80211: initial chanctx work towards NAN
 - mac80211: MU-MIMO sniffer improvements
 - ath12k: statistics improvements

* tag 'wireless-next-2025-11-12' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (26 commits)
  wifi: cw1200: Fix potential memory leak in cw1200_bh_rx_helper()
  wifi: mac80211: make monitor link info check more specific
  wifi: mac80211: track MU-MIMO configuration on disabled interfaces
  wifi: cfg80211/mac80211: Add fallback mechanism for INDOOR_SP connection
  wifi: cfg80211/mac80211: clean up duplicate ap_power handling
  wifi: cfg80211: use a C99 initializer in wiphy_register
  wifi: cfg80211: fix doc of struct key_params
  wifi: mac80211: remove unnecessary vlan NULL check
  wifi: mac80211: pass frame type to element parsing
  wifi: mac80211: remove "disabling VHT" message
  wifi: mac80211: add and use chanctx usage iteration
  wifi: mac80211: simplify ieee80211_recalc_chanctx_min_def() API
  wifi: mac80211: remove chanctx to link back-references
  wifi: mac80211: make link iteration safe for 'break'
  wifi: mac80211: fix EHT typo
  wifi: cfg80211: fix EHT typo
  wifi: ieee80211: split NAN definitions out
  wifi: ieee80211: split P2P definitions out
  wifi: ieee80211: split S1G definitions out
  wifi: ieee80211: split EHT definitions out
  ...
====================

Link: https://patch.msgid.link/20251112115126.16223-4-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-12 09:33:24 -08:00
Luiz Augusto von Dentz
485e0626e5 Bluetooth: hci_event: Fix not handling PA Sync Lost event
This handles PA Sync Lost event which previously was assumed to be
handled with BIG Sync Lost but their lifetime are not the same thus why
there are 2 different events to inform when each sync is lost.

Fixes: b2a5f2e1c1 ("Bluetooth: hci_event: Add support for handling LE BIG Sync Lost event")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-11-11 08:55:18 -05:00
Pagadala Yesu Anjaneyulu
b54cf0f449 wifi: cfg80211/mac80211: Add fallback mechanism for INDOOR_SP connection
Implement fallback to LPI mode when SP mode is not permitted
by regulatory constraints for INDOOR_SP connections.
Limit fallback mechanism to client mode.

Signed-off-by: Pagadala Yesu Anjaneyulu <pagadala.yesu.anjaneyulu@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20251110140806.8b43201a34ae.I37fc7bb5892eb9d044d619802e8f2095fde6b296@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-11-11 11:05:00 +01:00
Pagadala Yesu Anjaneyulu
e18efacc9c wifi: cfg80211/mac80211: clean up duplicate ap_power handling
Move duplicated ap_power type handling code to an inline
function in cfg80211.

Signed-off-by: Pagadala Yesu Anjaneyulu <pagadala.yesu.anjaneyulu@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20251110140806.959948da1cb5.I893b5168329fb3232f249c182a35c99804112da6@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-11-11 11:05:00 +01:00
Jason Xing
8da7bea7db xsk: add indirect call for xsk_destruct_skb
Since Eric proposed an idea about adding indirect call wrappers for
UDP and managed to see a huge improvement[1], the same situation can
also be applied in xsk scenario.

This patch adds an indirect call for xsk and helps current copy mode
improve the performance by around 1% stably which was observed with
IXGBE at 10Gb/sec loaded. If the throughput grows, the positive effect
will be magnified. I applied this patch on top of batch xmit series[2],
and was able to see <5% improvement from our internal application
which is a little bit unstable though.

Use INDIRECT wrappers to keep xsk_destruct_skb static as it used to
be when the mitigation config is off.

Be aware of the freeing path that can be very hot since the frequency
can reach around 2,000,000 times per second with the xdpsock test.

[1]: https://lore.kernel.org/netdev/20251006193103.2684156-2-edumazet@google.com/
[2]: https://lore.kernel.org/all/20251021131209.41491-1-kerneljasonxing@gmail.com/

Suggested-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20251031103328.95468-1-kerneljasonxing@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-11 10:21:08 +01:00
Jakub Kicinski
7fc2bf8d30 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ6NaUOruQGUkvPdG4raS+Z+3y5EwUCaRJJFwAKCRAraS+Z+3y5
 E8eZAQDWQ3D76HOlLK8tBAQ8aSpxwsfr7fpheiMSCEI5r7O5sQEA5LuHrBvEb/Cr
 zfU7DxhEQEoVeJMTSia2hEJKWhoNpQ0=
 =n2Ap
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Martin KaFai Lau says:

====================
pull-request: bpf-next 2025-11-10

We've added 19 non-merge commits during the last 3 day(s) which contain
a total of 22 files changed, 1345 insertions(+), 197 deletions(-).

The main changes are:

1) Preserve skb metadata after a TC BPF program has changed the skb,
   from Jakub Sitnicki.
   This allows a TC program at the end of a TC filter chain to still see
   the skb metadata, even if another TC program at the front of the chain
   has changed the skb using BPF helpers.

2) Initial af_smc bpf_struct_ops support to control the smc specific
   syn/synack options, from D. Wythe.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
  bpf/selftests: Add selftest for bpf_smc_hs_ctrl
  net/smc: bpf: Introduce generic hook for handshake flow
  bpf: Export necessary symbols for modules with struct_ops
  selftests/bpf: Cover skb metadata access after bpf_skb_change_proto
  selftests/bpf: Cover skb metadata access after change_head/tail helper
  selftests/bpf: Cover skb metadata access after bpf_skb_adjust_room
  selftests/bpf: Cover skb metadata access after vlan push/pop helper
  selftests/bpf: Expect unclone to preserve skb metadata
  selftests/bpf: Dump skb metadata on verification failure
  selftests/bpf: Verify skb metadata in BPF instead of userspace
  bpf: Make bpf_skb_change_head helper metadata-safe
  bpf: Make bpf_skb_change_proto helper metadata-safe
  bpf: Make bpf_skb_adjust_room metadata-safe
  bpf: Make bpf_skb_vlan_push helper metadata-safe
  bpf: Make bpf_skb_vlan_pop helper metadata-safe
  vlan: Make vlan_remove_tag return nothing
  bpf: Unclone skb head on bpf_dynptr_write to skb metadata
  net: Preserve metadata on pskb_expand_head
  net: Helper to move packet data and metadata after skb_push/pull
====================

Link: https://patch.msgid.link/20251110232427.3929291-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-10 16:43:51 -08:00
Kuniyuki Iwashima
73edb26b06 sctp: Don't inherit do_auto_asconf in sctp_clone_sock().
syzbot reported list_del(&sp->auto_asconf_list) corruption
in sctp_destroy_sock().

The repro calls setsockopt(SCTP_AUTO_ASCONF, 1) to a SCTP
listener, calls accept(), and close()s the child socket.

setsockopt(SCTP_AUTO_ASCONF, 1) sets sp->do_auto_asconf
to 1 and links sp->auto_asconf_list to a per-netns list.

Both fields are placed after sp->pd_lobby in struct sctp_sock,
and sctp_copy_descendant() did not copy the fields before the
cited commit.

Also, sctp_clone_sock() did not set them explicitly.

In addition, sctp_auto_asconf_init() is called from
sctp_sock_migrate(), but it initialises the fields only
conditionally.

The two fields relied on __GFP_ZERO added in sk_alloc(),
but sk_clone() does not use it.

Let's clear newsp->do_auto_asconf in sctp_clone_sock().

[0]:
list_del corruption. prev->next should be ffff8880799e9148, but was ffff8880799e8808. (prev=ffff88803347d9f8)
kernel BUG at lib/list_debug.c:64!
Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 0 UID: 0 PID: 6008 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/02/2025
RIP: 0010:__list_del_entry_valid_or_report+0x15a/0x190 lib/list_debug.c:62
Code: e8 7b 26 71 fd 43 80 3c 2c 00 74 08 4c 89 ff e8 7c ee 92 fd 49 8b 17 48 c7 c7 80 0a bf 8b 48 89 de 4c 89 f9 e8 07 c6 94 fc 90 <0f> 0b 4c 89 f7 e8 4c 26 71 fd 43 80 3c 2c 00 74 08 4c 89 ff e8 4d
RSP: 0018:ffffc90003067ad8 EFLAGS: 00010246
RAX: 000000000000006d RBX: ffff8880799e9148 RCX: b056988859ee6e00
RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000
RBP: dffffc0000000000 R08: ffffc90003067807 R09: 1ffff9200060cf00
R10: dffffc0000000000 R11: fffff5200060cf01 R12: 1ffff1100668fb3f
R13: dffffc0000000000 R14: ffff88803347d9f8 R15: ffff88803347d9f8
FS:  00005555823e5500(0000) GS:ffff88812613e000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000200000000480 CR3: 00000000741ce000 CR4: 00000000003526f0
Call Trace:
 <TASK>
 __list_del_entry_valid include/linux/list.h:132 [inline]
 __list_del_entry include/linux/list.h:223 [inline]
 list_del include/linux/list.h:237 [inline]
 sctp_destroy_sock+0xb4/0x370 net/sctp/socket.c:5163
 sk_common_release+0x75/0x310 net/core/sock.c:3961
 sctp_close+0x77e/0x900 net/sctp/socket.c:1550
 inet_release+0x144/0x190 net/ipv4/af_inet.c:437
 __sock_release net/socket.c:662 [inline]
 sock_close+0xc3/0x240 net/socket.c:1455
 __fput+0x44c/0xa70 fs/file_table.c:468
 task_work_run+0x1d4/0x260 kernel/task_work.c:227
 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
 exit_to_user_mode_loop+0xe9/0x130 kernel/entry/common.c:43
 exit_to_user_mode_prepare include/linux/irq-entry-common.h:225 [inline]
 syscall_exit_to_user_mode_work include/linux/entry-common.h:175 [inline]
 syscall_exit_to_user_mode include/linux/entry-common.h:210 [inline]
 do_syscall_64+0x2bd/0xfa0 arch/x86/entry/syscall_64.c:100
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Fixes: 16942cf4d3 ("sctp: Use sk_clone() in sctp_accept().")
Reported-by: syzbot+ba535cb417f106327741@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/690d2185.a70a0220.22f260.000e.GAE@google.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Acked-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/20251106223418.1455510-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-10 16:22:09 -08:00
D. Wythe
15f295f556 net/smc: bpf: Introduce generic hook for handshake flow
The introduction of IPPROTO_SMC enables eBPF programs to determine
whether to use SMC based on the context of socket creation, such as
network namespaces, PID and comm name, etc.

As a subsequent enhancement, to introduce a new generic hook that
allows decisions on whether to use SMC or not at runtime, including
but not limited to local/remote IP address or ports.

User can write their own implememtion via bpf_struct_ops now to choose
whether to use SMC or not before TCP 3rd handshake to be comleted.

Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Link: https://patch.msgid.link/20251107035632.115950-3-alibuda@linux.alibaba.com
2025-11-10 11:19:41 -08:00
Chien Wong
473235677a wifi: cfg80211: fix doc of struct key_params
The seq in struct key_params is for many ciphers, including CCMP, GCMP,
CMAC, GMAC. In addition to get_key(), it is also used when setting keys.

Signed-off-by: Chien Wong <m@xv97.com>
Link: https://patch.msgid.link/20251107142332.181308-1-m@xv97.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-11-10 10:39:14 +01:00
Johannes Berg
1a1cad924e wifi: mac80211: fix EHT typo
This is clearly EHT, not ETH, fix the typo.

Link: https://patch.msgid.link/20251105153958.12a04517f7ec.Idcf800817fa30605b1002c3d2287cad016e7aea7@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-11-10 10:38:37 +01:00
Johannes Berg
30b6089aad wifi: cfg80211: fix EHT typo
This is clearly EHT, not ETH, fix the typo.

Link: https://patch.msgid.link/20251105153958.e9d4af3b768e.I5f3378326837e3f62928a2f1fd3403f29cea069b@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-11-10 10:38:36 +01:00
Jakub Kicinski
a0c3aefb08 Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2025-11-06 (i40, ice, iavf)

Mohammad Heib introduces a new devlink parameter, max_mac_per_vf, for
controlling the maximum number of MAC address filters allowed by a VF. This
allows administrators to control the VF behavior in a more nuanced manner.

Aleksandr and Przemek add support for Receive Side Scaling of GTP to iAVF
for VFs running on E800 series ice hardware. This improves performance and
scalability for virtualized network functions in 5G and LTE deployments.

* '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  iavf: add RSS support for GTP protocol via ethtool
  ice: Extend PTYPE bitmap coverage for GTP encapsulated flows
  ice: improve TCAM priority handling for RSS profiles
  ice: implement GTP RSS context tracking and configuration
  ice: add virtchnl definitions and static data for GTP RSS
  ice: add flow parsing for GTP and new protocol field support
  i40e: support generic devlink param "max_mac_per_vf"
  devlink: Add new "max_mac_per_vf" generic device param
====================

Link: https://patch.msgid.link/20251106225321.1609605-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-07 19:15:36 -08:00
Jakub Kicinski
f05d26198c psp: add stats from psp spec to driver facing api
Provide a driver api for reporting device statistics required by the
"Implementation Requirements" section of the PSP Architecture
Specification. Use a warning to ensure drivers report stats required
by the spec.

Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Link: https://patch.msgid.link/20251106002608.1578518-4-daniel.zahka@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-07 18:53:57 -08:00
Jakub Kicinski
dae4a92399 psp: report basic stats from the core
Track and report stats common to all psp devices from the core. A
'stale-event' is when the core marks the rx state of an active
psp_assoc as incapable of authenticating psp encapsulated data.

Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Link: https://patch.msgid.link/20251106002608.1578518-2-daniel.zahka@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-07 18:53:56 -08:00
Eric Dumazet
416dd649f3 tcp: add net.ipv4.tcp_comp_sack_rtt_percent
TCP SACK compression has been added in 2018 in commit
5d9f4262b7 ("tcp: add SACK compression").

It is working great for WAN flows (with large RTT).
Wifi in particular gets a significant boost _when_ ACK are suppressed.

Add a new sysctl so that we can tune the very conservative 5 % value
that has been used so far in this formula, so that small RTT flows
can benefit from this feature.

delay = min ( 5 % of RTT, 1 ms)

This patch adds new tcp_comp_sack_rtt_percent sysctl
to ease experiments and tuning.

Given that we cap the delay to 1ms (tcp_comp_sack_delay_ns sysctl),
set the default value to 33 %.

Quoting Neal Cardwell ( https://lore.kernel.org/netdev/CADVnQymZ1tFnEA1Q=vtECs0=Db7zHQ8=+WCQtnhHFVbEOzjVnQ@mail.gmail.com/ )

The rationale for 33% is basically to try to facilitate pipelining,
where there are always at least 3 ACKs and 3 GSO/TSO skbs per SRTT, so
that the path can maintain a budget for 3 full-sized GSO/TSO skbs "in
flight" at all times:

+ 1 skb in the qdisc waiting to be sent by the NIC next
+ 1 skb being sent by the NIC (being serialized by the NIC out onto the wire)
+ 1 skb being received and aggregated by the receiver machine's
aggregation mechanism (some combination of LRO, GRO, and sack
compression)

Note that this is basically the same magic number (3) and the same
rationales as:

(a) tcp_tso_should_defer() ensuring that we defer sending data for no
longer than cwnd/tcp_tso_win_divisor (where tcp_tso_win_divisor = 3),
and
(b) bbr_quantization_budget() ensuring that cwnd is at least 3 GSO/TSO
skbs to maintain pipelining and full throughput at low RTTs

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20251106115236.3450026-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-07 18:41:44 -08:00
Kuniyuki Iwashima
1e9d3005e0 tcp: Apply max RTO to non-TFO SYN+ACK.
Since commit 54a378f434 ("tcp: add the ability to control
max RTO"), TFO SYN+ACK RTO is capped by the TFO full sk's
inet_csk(sk)->icsk_rto_max.

The value is inherited from the parent listener.

Let's apply the same cap to non-TFO SYN+ACK.

Note that req->rsk_listener is always non-NULL when we call
tcp_reqsk_timeout() in reqsk_timer_handler() or tcp_check_req().

It could be NULL for SYN cookie req, but we do not use
req->timeout then.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251106003357.273403-6-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-07 18:05:26 -08:00
Kuniyuki Iwashima
207ce0f6bc tcp: Remove timeout arg from reqsk_timeout().
reqsk_timeout() is always called with @timeout being TCP_RTO_MAX.

Let's remove the arg.

As a prep for the next patch, reqsk_timeout() is moved to tcp.h
and renamed to tcp_reqsk_timeout().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251106003357.273403-5-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-07 18:05:26 -08:00
Kuniyuki Iwashima
3ce5dd8161 tcp: Remove timeout arg from reqsk_queue_hash_req().
inet_csk_reqsk_queue_hash_add() is no longer shared by DCCP.

We do not need to pass req->timeout down to reqsk_queue_hash_req().

Let's move tcp_timeout_init() from tcp_conn_request() to
reqsk_queue_hash_req().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251106003357.273403-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-07 18:05:25 -08:00
Kuniyuki Iwashima
be88c549e9 tcp: Call tcp_syn_ack_timeout() directly.
Since DCCP has been removed, we do not need to use
request_sock_ops.syn_ack_timeout().

Let's call tcp_syn_ack_timeout() directly.

Now other function pointers of request_sock_ops are
protocol-dependent.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251106003357.273403-2-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-07 18:05:25 -08:00
Daniel Borkmann
24ab8efb9a xsk: Move NETDEV_XDP_ACT_ZC into generic header
Move NETDEV_XDP_ACT_ZC into xdp_sock_drv.h header such that external code
can reuse it, and rename it into more generic NETDEV_XDP_ACT_XSK.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20251031212103.310683-7-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-06 16:46:11 -08:00
Daniel Golle
c6230446b1 net: dsa: add tagging driver for MaxLinear GSW1xx switch family
Add support for a new DSA tagging protocol driver for the MaxLinear
GSW1xx switch family. The GSW1xx switches use a proprietary 8-byte
special tag inserted between the source MAC address and the EtherType
field to indicate the source and destination ports for frames
traversing the CPU port.

Implement the tag handling logic to insert the special tag on transmit
and parse it on receive.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Reviewed-by: Alexander Sverdlin <alexander.sverdlin@siemens.com>
Tested-by: Alexander Sverdlin <alexander.sverdlin@siemens.com>
Link: https://patch.msgid.link/0e973ebfd9433c30c96f50670da9e9449a0d98f2.1762170107.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-06 14:16:17 -08:00
Mohammad Heib
9352d40c8b devlink: Add new "max_mac_per_vf" generic device param
Add a new device generic parameter to controls the maximum
number of MAC filters allowed per VF.

For example, to limit a VF to 3 MAC addresses:
 $ devlink dev param set pci/0000:3b:00.0 name max_mac_per_vf \
        value 3 \
        cmode runtime

Signed-off-by: Mohammad Heib <mheib@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-11-06 12:57:31 -08:00
Linus Torvalds
c90841db35 hardening fixes for v6.18-rc5
- Introduce __nocfi_generic for arm32 Clang (Nathan Chancellor)
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRSPkdeREjth1dHnSE2KwveOeQkuwUCaQz8RwAKCRA2KwveOeQk
 u51dAP9YhHIttRev7rdRwDWwrlhO+i2fps0vq8G9S6keSndr+AEAv8BtQexexPhI
 1oSNLeB/kVUVp5dQWjni48IgMxaG4A8=
 =zFGT
 -----END PGP SIGNATURE-----

Merge tag 'hardening-v6.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull hardening fixes from Kees Cook:
 "This is a work-around for a (now fixed) corner case in the arm32 build
  with Clang KCFI enabled.

   - Introduce __nocfi_generic for arm32 Clang (Nathan Chancellor)"

* tag 'hardening-v6.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  libeth: xdp: Disable generic kCFI pass for libeth_xdp_tx_xmit_bulk()
  ARM: Select ARCH_USES_CFI_GENERIC_LLVM_PASS
  compiler_types: Introduce __nocfi_generic
2025-11-06 11:54:59 -08:00
Jakub Kicinski
1ec9871fbb Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.18-rc5).

Conflicts:

drivers/net/wireless/ath/ath12k/mac.c
  9222582ec5 ("Revert "wifi: ath12k: Fix missing station power save configuration"")
  6917e268c4 ("wifi: ath12k: Defer vdev bring-up until CSA finalize to avoid stale beacon")
https://lore.kernel.org/11cece9f7e36c12efd732baa5718239b1bf8c950.camel@sipsolutions.net

Adjacent changes:

drivers/net/ethernet/intel/Kconfig
  b1d16f7c00 ("libie: depend on DEBUG_FS when building LIBIE_FWLOG")
  93f53db9f9 ("ice: switch to Page Pool")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-06 09:27:40 -08:00
Raju Rangoju
6b47af35a6 net: selftests: export packet creation helpers for driver use
Export the network selftest packet creation infrastructure to allow
network drivers to reuse the existing selftest framework instead of
duplicating packet creation code.

Signed-off-by: Raju Rangoju <Raju.Rangoju@amd.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20251031111811.775434-1-Raju.Rangoju@amd.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-06 13:38:11 +01:00
Kees Cook
449f68f8ff net: Convert proto callbacks from sockaddr to sockaddr_unsized
Convert struct proto pre_connect(), connect(), bind(), and bind_add()
callback function prototypes from struct sockaddr to struct sockaddr_unsized.
This does not change per-implementation use of sockaddr for passing around
an arbitrarily sized sockaddr struct. Those will be addressed in future
patches.

Additionally removes the no longer referenced struct sockaddr from
include/net/inet_common.h.

No binary changes expected.

Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20251104002617.2752303-5-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-04 19:10:33 -08:00
Kees Cook
85cb0757d7 net: Convert proto_ops connect() callbacks to use sockaddr_unsized
Update all struct proto_ops connect() callback function prototypes from
"struct sockaddr *" to "struct sockaddr_unsized *" to avoid lying to the
compiler about object sizes. Calls into struct proto handlers gain casts
that will be removed in the struct proto conversion patch.

No binary changes expected.

Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20251104002617.2752303-3-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-04 19:10:32 -08:00
Kees Cook
0e50474fa5 net: Convert proto_ops bind() callbacks to use sockaddr_unsized
Update all struct proto_ops bind() callback function prototypes from
"struct sockaddr *" to "struct sockaddr_unsized *" to avoid lying to the
compiler about object sizes. Calls into struct proto handlers gain casts
that will be removed in the struct proto conversion patch.

No binary changes expected.

Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20251104002617.2752303-2-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-04 19:10:32 -08:00
Jason Xing
30ed05adca xsk: use a smaller new lock for shared pool case
- Split cq_lock into two smaller locks: cq_prod_lock and
  cq_cached_prod_lock
- Avoid disabling/enabling interrupts in the hot xmit path

In either xsk_cq_cancel_locked() or xsk_cq_reserve_locked() function,
the race condition is only between multiple xsks sharing the same
pool. They are all in the process context rather than interrupt context,
so now the small lock named cq_cached_prod_lock can be used without
handling interrupts.

While cq_cached_prod_lock ensures the exclusive modification of
@cached_prod, cq_prod_lock in xsk_cq_submit_addr_locked() only cares
about @producer and corresponding @desc. Both of them don't necessarily
be consistent with @cached_prod protected by cq_cached_prod_lock.
That's the reason why the previous big lock can be split into two
smaller ones. Please note that SPSC rule is all about the global state
of producer and consumer that can affect both layers instead of local
or cached ones.

Frequently disabling and enabling interrupt are very time consuming
in some cases, especially in a per-descriptor granularity, which now
can be avoided after this optimization, even when the pool is shared by
multiple xsks.

With this patch, the performance number[1] could go from 1,872,565 pps
to 1,961,009 pps. It's a minor rise of around 5%.

[1]: taskset -c 1 ./xdpsock -i enp2s0f1 -q 0 -t -S -s 64

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20251030000646.18859-3-kerneljasonxing@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-04 16:10:53 +01:00
Kuniyuki Iwashima
e833eb2516 mpls: Protect net->mpls.platform_label with a per-netns mutex.
MPLS (re)uses RTNL to protect net->mpls.platform_label,
but the lock does not need to be RTNL at all.

Let's protect net->mpls.platform_label with a dedicated
per-netns mutex.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/20251029173344.2934622-13-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-03 17:40:53 -08:00
Kuniyuki Iwashima
d8f9581e1b ipv6: Add in6_dev_rcu().
rcu_dereference_rtnl() does not clearly tell whether the caller
is under RCU or RTNL.

Let's add in6_dev_rcu() to make it easy to remove __in6_dev_get()
in the future.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/20251029173344.2934622-5-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-03 17:40:46 -08:00
Eric Sandeen
1f3e4142c0 9p: convert to the new mount API
Convert 9p to the new mount API. This patch consolidates all parsing
into fs/9p/v9fs.c, which stores all results into a filesystem context
which can be passed to the various transports as needed.

Some of the parsing helper functions such as get_cache_mode() have been
eliminated in favor of using the new mount API's enum param type,
for simplicity.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Message-ID: <20251010214222.1347785-5-sandeen@redhat.com>
[ Dominique: handled source explicitly as per follow-up discussion ]
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2025-11-03 16:49:53 +09:00
Eric Sandeen
075e8bd412 9p: create a v9fs_context structure to hold parsed options
This patch creates a new v9fs_context structure which includes
new p9_session_opts and p9_client_opts structures, as well as
re-using the existing p9_fd_opts and p9_rdma_opts to store options
during parsing. The new structure will be used in the next
commit to pass all parsed options to the appropriate transports.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Message-ID: <20251010214222.1347785-4-sandeen@redhat.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2025-11-03 16:49:53 +09:00
Eric Sandeen
c44393d841 net/9p: move structures and macros to header files
With the new mount API all option parsing will need to happen
in fs/v9fs.c, so move some existing data structures and macros
to header files to facilitate this. Rename some to reflect
the transport they are used for (rdma, fd, etc), for clarity.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Message-ID: <20251010214222.1347785-3-sandeen@redhat.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2025-11-03 16:49:53 +09:00
Dominique Martinet
eeaf38a798 net/9p: cleanup: change p9_trans_module->def to bool
'->def' is only ever used as a true/false flag

Reported-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Message-ID: <20251103-v9fs_trans_def_bool-v1-1-f33dc7ed9e81@codewreck.org>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2025-11-03 16:49:29 +09:00
Pierre Barre
e21d451a82 9p: Use kvmalloc for message buffers on supported transports
While developing a 9P server (https://github.com/Barre/ZeroFS) and
testing it under high-load, I was running into allocation failures.
The failures occur even with plenty of free memory available because
kmalloc requires contiguous physical memory.

This results in errors like:
ls: page allocation failure: order:7, mode:0x40c40(GFP_NOFS|__GFP_COMP)

This patch introduces a transport capability flag (supports_vmalloc)
that indicates whether a transport can work with vmalloc'd buffers
(non-physically contiguous memory). Transports requiring DMA should
leave this flag as false.

The fd-based transports (tcp, unix, fd) set this flag to true, and
p9_fcall_init will use kvmalloc instead of kmalloc for these
transports. This allows the allocator to fall back to vmalloc when
contiguous physical memory is not available.

Additionally, if kmem_cache_alloc fails, the code falls back to
kvmalloc for transports that support it.

Signed-off-by: Pierre Barre <pierre@barre.sh>
Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Message-ID: <d2017c29-11fb-44a5-bd0f-4204329bbefb@app.fastmail.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2025-11-03 16:41:24 +09:00
Haiyang Zhang
54133f9b4b net: mana: Support HW link state events
Handle the NIC hardware link state events received from the HW
channel, then set the proper link state accordingly.

And, add a feature bit, GDMA_DRV_CAP_FLAG_1_HW_VPORT_LINK_AWARE,
to inform the NIC hardware this handler exists.

Our MANA NIC only sends out the link state down/up messages
when we need to let the VM rerun DHCP client and change IP
address. So, add netif_carrier_on() in the probe(), let the NIC
show the right initial state in /sys/class/net/ethX/operstate.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://patch.msgid.link/1761770601-16920-1-git-send-email-haiyangz@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-31 15:56:53 -07:00
Jakub Kicinski
284987ab6c bluetooth pull request for net:
- btrtl: Fix memory leak in rtlbt_parse_firmware_v2()
  - MGMT: Fix OOB access in parse_adv_monitor_pattern()
  - hci_event: validate skb length for unknown CC opcode
 -----BEGIN PGP SIGNATURE-----
 
 iQJNBAABCgA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmkE7S4ZHGx1aXoudm9u
 LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKYSVD/0a4pWtXaDvUPjss1Y4nW06
 yjw1JOnD4alZRlUhueAxv+BHekhLV9ZhQ5h13+ceF2x7rX0+/7aZ8PXvTe9Iy9ta
 aV/GUVoT/D4yeZcsJnbFjriA8OI7E+bnLK2WhaTBVbnZhNYPJFyEyJkAC3oP5nab
 FmDgHDqX0Mu1ifLmimSgpneSmC9SpKV8zMqS44NXzlHvtD7WUaIXvO/13Zh/cKaJ
 /XoRTADvxLXgPSMuRqjtyCHMdzf5IAYvbiVynQlmkqoeKZHUPkcCltpYiPJz9Qrn
 +iRKsbEzwO5PU0fJZYE1mrhsogEvzUY/R7a1GSZZzl8FMKfS7YSynGFlw/biiryu
 sY/eOKh1MHriBdSdD4eE8NPglYHcBKzEzb4td80UtL5M80QDiOkru4t5qDLnM73R
 xEj5WNvryMcAz+wnBSmyb8zS8VEmFQRORfftAbQNEObMLwgQpPQlqhLMjCIRG0o3
 2+VLeszeLBuE84lZOMIC5CYvmkaZKbKDd2Ef8mo2tBqq7MEM8VMwe4UmnswAfumI
 B0VzIo7qNNVgNwjxJ09MDX/IRfGpzug1D9jujb6vfzS0IOJ4UpV1HeI2DsnY/eor
 aJbg9sN0qchDn+cKz0lx4U2I1Kp10Eoz730vg5/cRorYBVqhMYs4M8vaeC9pcojb
 rvvs7RN2y8En1Ujsnf4YNg==
 =5z+f
 -----END PGP SIGNATURE-----

Merge tag 'for-net-2025-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth

Luiz Augusto von Dentz says:

====================
bluetooth pull request for net:

 - btrtl: Fix memory leak in rtlbt_parse_firmware_v2()
 - MGMT: Fix OOB access in parse_adv_monitor_pattern()
 - hci_event: validate skb length for unknown CC opcode

* tag 'for-net-2025-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
  Bluetooth: MGMT: Fix OOB access in parse_adv_monitor_pattern()
  Bluetooth: btrtl: Fix memory leak in rtlbt_parse_firmware_v2()
  Bluetooth: hci_event: validate skb length for unknown CC opcode
====================

Link: https://patch.msgid.link/20251031170959.590470-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-31 12:33:08 -07:00
Jakub Kicinski
b7904323e7 Couple of new fixes:
- ath10k: revert a patch that had caused issues on some devices
  - cfg80211/mac80211: use hrtimers for some things where the
                       precise timing matters
  - zd1211rw: fix a long-standing potential leak
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmkDQiwACgkQ10qiO8sP
 aAD8mg//RrRLyJIrVzC/c5TtiVT/ziwyMPMe6l6aRBSN+H/15hX+QfsDt29x+BgX
 CyN9io7Tix3Szshos3XPSrWjKrRy2JYDDGQbVW1npxzAs/E4O5rhETm3SWymHkJh
 geF3W9NDbF7OQIzm6jBp7/i5Im97LU7Y6c9Pd6FxPucKBvFahvxqyDvrD2MAa6/j
 RDr9de7l7X02xymU8Z2h4qNjXQ1fbLWio1UMr+xPxPvPmXsR4aYfGb3xESryHS1y
 a2Xnui1Apzv0Bxf0j0CrsqPoZmSSMhhfB5NMclTsZluPZySRmFygoPt5fjoeaw/Q
 8Ga4nkzUW2ZMZ9CJiB8qFX8dSQ+k658AyVY0XAQkZFey4e0kPwwOG/f1xBoAeIW5
 VcvLbxIZ2Ao3K2qAg5f2bfPyjcp1t+I4iFb/WKzTzX5EcuHcxB5IfoqYfIAkLfgP
 ada9iCJzrVv0+aBCbpKvDm+BAnV/H1sQVRtPQKH09+P8Dylj/klVRYemBMzG9zH2
 zSrkllJrESKUI7Y0fDszt4G9HmWuX9+y5eKhHrLwYkUutqIJsdkW9hbe+EQGUIke
 bOL5nH5hhQDaOa5hpoTQRUKYZb+4tbttKCa9Bii+1M1NgPlQq22Ud4wBdGWqvfqF
 dFHx7vmbVAqrTl8RzXtpLj/UVbgVNhGJAIbD6IAc3DyZC4BvZJs=
 =3oN9
 -----END PGP SIGNATURE-----

Merge tag 'wireless-2025-10-30' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless

Johannes Berg says:

====================
Couple of new fixes:

 - ath10k: revert a patch that had caused issues on some devices
 - cfg80211/mac80211: use hrtimers for some things where the
                      precise timing matters
 - zd1211rw: fix a long-standing potential leak

* tag 'wireless-2025-10-30' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
  wifi: zd1211rw: fix potential memory leak in __zd_usb_enable_rx()
  wifi: mac80211: use wiphy_hrtimer_work for csa.switch_work
  wifi: mac80211: use wiphy_hrtimer_work for ml_reconf_work
  wifi: mac80211: use wiphy_hrtimer_work for ttlm_work
  wifi: cfg80211: add an hrtimer based delayed work item
  Revert "wifi: ath10k: avoid unnecessary wait for service ready message"
====================

Link: https://patch.msgid.link/20251030104919.12871-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-31 12:30:33 -07:00
Ilia Gavrilov
8d59fba493 Bluetooth: MGMT: Fix OOB access in parse_adv_monitor_pattern()
In the parse_adv_monitor_pattern() function, the value of
the 'length' variable is currently limited to HCI_MAX_EXT_AD_LENGTH(251).
The size of the 'value' array in the mgmt_adv_pattern structure is 31.
If the value of 'pattern[i].length' is set in the user space
and exceeds 31, the 'patterns[i].value' array can be accessed
out of bound when copied.

Increasing the size of the 'value' array in
the 'mgmt_adv_pattern' structure will break the userspace.
Considering this, and to avoid OOB access revert the limits for 'offset'
and 'length' back to the value of HCI_MAX_AD_LENGTH.

Found by InfoTeCS on behalf of Linux Verification Center
(linuxtesting.org) with SVACE.

Fixes: db08722fc7 ("Bluetooth: hci_core: Fix missing instances using HCI_MAX_AD_LENGTH")
Cc: stable@vger.kernel.org
Signed-off-by: Ilia Gavrilov <Ilia.Gavrilov@infotecs.ru>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-10-31 12:43:05 -04:00
Jakub Kicinski
1a2352ad82 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.18-rc4).

No conflicts, adjacent changes:

drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
  ded9813d17 ("net: stmmac: Consider Tx VLAN offload tag length for maxSDU")
  26ab9830be ("net: stmmac: replace has_xxxx with core_type")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-31 06:46:03 -07:00
Jakub Kicinski
12a7c6a993 netfilter pull request nf-next-25-10-30
-----BEGIN PGP SIGNATURE-----
 
 iQJBBAABCAArFiEEgKkgxbID4Gn1hq6fcJGo2a1f9gAFAmkDUZINHGZ3QHN0cmxl
 bi5kZQAKCRBwkajZrV/2AK4uD/9NS2Q+uqGlyAnsKrmetRBuhCMRdJaHnrniD5+N
 xnsPrVpyL3EdioOehoXZNkbJ2JXTz/xl4bqZ4Zx4Z6vCVpkws2et5aU9lvm39n9s
 yIhSkthuI0buRYrTzsHO21bHBPQU4gpqOXvbcFotm3QOj2wF2NaNthrDhyWUjtiY
 tPmkSEWu/mjbjMEt1VU3ec3rb3fbe4i6tMW+Y4bTwH7dAopk7Rhw26NexGVUY8fO
 +BzenYi3XTktOrevYnlAJXmsavJjFkSsRffe76PtewNNY2WTCY+st341IXmOVASL
 OzL+9qvY3xIyevClUKx/WEKPGZae+V0Buxq6U6SijDLgbDF19A7R6zMId/ILsexm
 zuAnAdhXdP1gRl4cHAs/AQGD72gqXpVaIRVJE7NsRz3+rvOXzvYRvBhIVQAdvg6v
 pnR8PQg30jHPmPPRruarwH6O4x4jGptX3ASQtIf9Fii37vbDhZmJIt3j2OCaEuFP
 09ImQSwjvizD+aXvf+GtsWD/ZS2PNhE6m0jpJKI26grngODSpKPfNA54uwHmzXDM
 +1Zy+b6AxFjLi5HoAu+Cq7a+ktFKYVCRsunfovzVdvM+63AXoP9e2aRqWnoJUxvb
 sB4eAccoDE1W0CDDktz36bVt92ZTz4zmPO7HoT/oNyvkH8ye6tYpOPfXVWrUirkQ
 MNFqmw==
 =woNE
 -----END PGP SIGNATURE-----

Merge tag 'nf-next-25-10-30' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next

Florian Westphal says:

====================
netfilter: updates for net-next

1) Convert nf_tables 'nft_set_iter' usage to use C99 struct
   initialization, from Fernando Fernandez Mancera.
2) Disallow nf_conntrack_max=0.  This was an (undocumented)
   historic inheritance from ip_conntrack (ipv4 only nf_conntrack
   predecessor).  Doing so will simplify future changes to make
   this pernet-tuneable.
3) Fix a typo in conntrack.h comment, from Weibiao Tu.

* tag 'nf-next-25-10-30' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: fix typo in nf_conntrack_l4proto.h comment
  netfilter: conntrack: disable 0 value for conntrack_max setting
  netfilter: nf_tables: use C99 struct initializer for nft_set_iter
====================

Link: https://patch.msgid.link/20251030121954.29175-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-30 17:57:07 -07:00
Jakub Kicinski
1659b441b6 Not that many changes this time:
- mac80211:
    - improved VHT radiotap reporting
    - S1G improvements
    - multi-radio monitor improvements
    - HT action frame handling on 6 GHz
    - mesh rate tracking improvements
    - CSA handling improvements
  - cfg80211: multi-radio debugfs
  - rt2x00: improvements for embedded platforms
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmkDQ2gACgkQ10qiO8sP
 aABNfw/+INSkvH4yc85aMYliAhPfwS+C6ZuWxsRg5Wq/WGTTtqI2jhZFeBfupQ+J
 HpfpDeT7ZxLTcu1Q8rSmwPxwfvDN13+bP/nsP73zFXTejWYMXHRpMOtM2ypDnT8S
 CcU70H/KW85WM1undX99zDbuYhhkV/u3fNHXW989N6/Liirjnrecyegabp8UNfTq
 waEkPkJdb32KHXinccPhG9IC3MQANyRo+0DwDdupQkIj881IKIhaOg4MyfNOw9+A
 23+8WJuL6OtIQ0Z/p/iZ1erFYPq0FAbHO1ua5FU4aN693f72EeYgGwfBNuQfTet5
 Fq0Ii/Lp5gRiwft+/OzX3JMTsVJRV/leEdPtbFpRpZpEp8nS3Kxeel2zCf48ZkQU
 QFEP2rSNO2ezRqYN0sekOBJOMOrAhEz0olTu7//ooXGsdXUtd/Olsvi7slXyRdD1
 9FoeX6wBIlwl81+klGRe6G3J4xOOoePC23czSAEYH8DI1HFvYrFUgwzGWKkmIiBh
 WG3tWo0N7X9zLKxul3SheKu2QcMXaSU4zeSBVLudrpBIvSGboA0TYhfFVvRGdgEJ
 Of2kjEd3zpSS1+eQQ2PBvuPrijCcIrUGKzw98LHRNcnsCrgWozztAmtykYEfnHFH
 /HM0QanW8cn6v+5vON18jLkg5aYw9ovUQ4Gxz5MPbNfIpmzPxPo=
 =vyhh
 -----END PGP SIGNATURE-----

Merge tag 'wireless-next-2025-10-30' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next

Johannes Berg says:

====================
Not that many changes this time:

 - mac80211:
   - improved VHT radiotap reporting
   - S1G improvements
   - multi-radio monitor improvements
   - HT action frame handling on 6 GHz
   - mesh rate tracking improvements
   - CSA handling improvements
 - cfg80211: multi-radio debugfs
 - rt2x00: improvements for embedded platforms

* tag 'wireless-next-2025-10-30' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next:
  wifi: mac80211: Allow HT Action frame processing on 6 GHz when HE is supported
  wifi: rt2x00: add nvmem eeprom support
  wifi: mac80211: add RX flag to report radiotap VHT information
  net: wireless: Remove redundant pm_runtime_mark_last_busy() calls
  wifi: cfg80211: Add parameters to radio-specific debugfs directories
  wifi: cfg80211: Add debugfs support for multi-radio wiphy
  wifi: mac80211: fix missing RX bitrate update for mesh forwarding path
  wifi: cfg80211: default S1G chandef width to 1MHz
  wifi: mac80211: get probe response chan via ieee80211_get_channel_khz
  wifi: mac80211: reset CRC valid after CSA
  wifi: mac80211_hwsim: advertise puncturing feature support
  wifi: cfg80211/mac80211: validate radio frequency range for monitor mode
  wifi: rt2x00: check retval for of_get_mac_address
====================

Link: https://patch.msgid.link/20251030105355.13216-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-30 17:38:37 -07:00
Halil Pasic
aef3cdb47b net/smc: make wr buffer count configurable
Think SMC_WR_BUF_CNT_SEND := SMC_WR_BUF_CNT used in send context and
SMC_WR_BUF_CNT_RECV := 3 * SMC_WR_BUF_CNT used in recv context. Those
get replaced with lgr->max_send_wr and lgr->max_recv_wr respective.

Please note that although with the default sysctl values
qp_attr.cap.max_send_wr ==  qp_attr.cap.max_recv_wr is maintained but
can not be assumed to be generally true any more. I see no downside to
that, but my confidence level is rather modest.

Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
Reviewed-by: Sidraya Jayagond <sidraya@linux.ibm.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Tested-by: Mahanta Jambigi <mjambigi@linux.ibm.com>
Link: https://patch.msgid.link/20251027224856.2970019-2-pasic@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-10-30 13:31:43 +01:00
caivive (Weibiao Tu)
57347d58a4 netfilter: fix typo in nf_conntrack_l4proto.h comment
In the comment for nf_conntrack_l4proto.h, the word "nfnetink" was
incorrectly spelled. It has been corrected to "nfnetlink".

Fixes a typo to enhance readability and ensure consistency.

Signed-off-by: caivive (Weibiao Tu) <cavivie@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
2025-10-30 12:52:45 +01:00
Jianbo Liu
61fafbee6c xfrm: Determine inner GSO type from packet inner protocol
The GSO segmentation functions for ESP tunnel mode
(xfrm4_tunnel_gso_segment and xfrm6_tunnel_gso_segment) were
determining the inner packet's L2 protocol type by checking the static
x->inner_mode.family field from the xfrm state.

This is unreliable. In tunnel mode, the state's actual inner family
could be defined by x->inner_mode.family or by
x->inner_mode_iaf.family. Checking only the former can lead to a
mismatch with the actual packet being processed, causing GSO to create
segments with the wrong L2 header type.

This patch fixes the bug by deriving the inner mode directly from the
packet's inner protocol stored in XFRM_MODE_SKB_CB(skb)->protocol.

Instead of replicating the code, this patch modifies the
xfrm_ip2inner_mode helper function. It now correctly returns
&x->inner_mode if the selector family (x->sel.family) is already
specified, thereby handling both specific and AF_UNSPEC cases
appropriately.

With this change, ESP GSO can use xfrm_ip2inner_mode to get the
correct inner mode. It doesn't affect existing callers, as the updated
logic now mirrors the checks they were already performing externally.

Fixes: 26dbd66eab ("esp: choose the correct inner protocol for GSO on inter address family tunnels")
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-10-30 11:52:31 +01:00
Benjamin Berg
db82ddeaf4 wifi: mac80211: add RX flag to report radiotap VHT information
mac80211 already reports some basic information in the radiotap header
with the known fields declared by the driver. However, drivers may want
to report more accurate information and in that case the full VHT
radiotap structure needs to be provided.

Add a new RX_FLAG_RADIOTAP_VHT which is set when the VHT information
should be pulled from the skb. Update the code to fill in the VHT fields
to only do so when requested by the driver or if the information has not
yet been set. This way the driver can fully control the information if
it chooses so.

Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20251027142118.0bad1c307a21.I2cf285c20a822698039603f2af00ed9c548f2ee0@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-10-30 08:38:51 +01:00
Nathan Chancellor
c57f5fee54 libeth: xdp: Disable generic kCFI pass for libeth_xdp_tx_xmit_bulk()
When building drivers/net/ethernet/intel/idpf/xsk.c for ARCH=arm with
CONFIG_CFI=y using a version of LLVM prior to 22.0.0, there is a
BUILD_BUG_ON failure:

  $ cat arch/arm/configs/repro.config
  CONFIG_BPF_SYSCALL=y
  CONFIG_CFI=y
  CONFIG_IDPF=y
  CONFIG_XDP_SOCKETS=y

  $ make -skj"$(nproc)" ARCH=arm LLVM=1 clean defconfig repro.config drivers/net/ethernet/intel/idpf/xsk.o
  In file included from drivers/net/ethernet/intel/idpf/xsk.c:4:
  include/net/libeth/xsk.h:205:2: error: call to '__compiletime_assert_728' declared with 'error' attribute: BUILD_BUG_ON failed: !__builtin_constant_p(tmo == libeth_xsktmo)
    205 |         BUILD_BUG_ON(!__builtin_constant_p(tmo == libeth_xsktmo));
        |         ^
  ...

libeth_xdp_tx_xmit_bulk() indirectly calls libeth_xsk_xmit_fill_buf()
but these functions are marked as __always_inline so that the compiler
can turn these indirect calls into direct ones and see that the tmo
parameter to __libeth_xsk_xmit_fill_buf_md() is ultimately libeth_xsktmo
from idpf_xsk_xmit().

Unfortunately, the generic kCFI pass in LLVM expands the kCFI bundles
from the indirect calls in libeth_xdp_tx_xmit_bulk() in such a way that
later optimizations cannot turn these calls into direct ones, making the
BUILD_BUG_ON fail because it cannot be proved at compile time that tmo
is libeth_xsktmo.

Disable the generic kCFI pass for libeth_xdp_tx_xmit_bulk() to ensure
these indirect calls can always be turned into direct calls to avoid
this error.

Closes: https://github.com/ClangBuiltLinux/linux/issues/2124
Fixes: 9705d6552f ("idpf: implement Rx path for AF_XDP")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Acked-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20251025-idpf-fix-arm-kcfi-build-error-v1-3-ec57221153ae@kernel.org
Signed-off-by: Kees Cook <kees@kernel.org>
2025-10-29 20:04:55 -07:00
Shahar Shitrit
c15d5c62ab net: tls: Cancel RX async resync request on rcd_delta overflow
When a netdev issues a RX async resync request for a TLS connection,
the TLS module handles it by logging record headers and attempting to
match them to the tcp_sn provided by the device. If a match is found,
the TLS module approves the tcp_sn for resynchronization.

While waiting for a device response, the TLS module also increments
rcd_delta each time a new TLS record is received, tracking the distance
from the original resync request.

However, if the device response is delayed or fails (e.g due to
unstable connection and device getting out of tracking, hardware
errors, resource exhaustion etc.), the TLS module keeps logging and
incrementing, which can lead to a WARN() when rcd_delta exceeds the
threshold.

To address this, introduce tls_offload_rx_resync_async_request_cancel()
to explicitly cancel resync requests when a device response failure is
detected. Call this helper also as a final safeguard when rcd_delta
crosses its threshold, as reaching this point implies that earlier
cancellation did not occur.

Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1761508983-937977-3-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-29 18:32:18 -07:00
Shahar Shitrit
34892cfec0 net: tls: Change async resync helpers argument
Update tls_offload_rx_resync_async_request_start() and
tls_offload_rx_resync_async_request_end() to get a struct
tls_offload_resync_async parameter directly, rather than
extracting it from struct sock.

This change aligns the function signatures with the upcoming
tls_offload_rx_resync_async_request_cancel() helper, which
will be introduced in a subsequent patch.

Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1761508983-937977-2-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-29 18:32:17 -07:00
Ido Schimmel
d12d04d221 ipv6: icmp: Add RFC 5837 support
Add the ability to append the incoming IP interface information to
ICMPv6 error messages in accordance with RFC 5837 and RFC 4884. This is
required for more meaningful traceroute results in unnumbered networks.

The feature is disabled by default and controlled via a new sysctl
("net.ipv6.icmp.errors_extension_mask") which accepts a bitmask of ICMP
extensions to append to ICMP error messages. Currently, only a single
value is supported, but the interface and the implementation should be
able to support more extensions, if needed.

Clone the skb and copy the relevant data portions before modifying the
skb as the caller of icmp6_send() still owns the skb after the function
returns. This should be fine since by default ICMP error messages are
rate limited to 1000 per second and no more than 1 per second per
specific host.

Trim or pad the packet to 128 bytes before appending the ICMP extension
structure in order to be compatible with legacy applications that assume
that the ICMP extension structure always starts at this offset (the
minimum length specified by RFC 4884).

Since commit 20e1954fe2 ("ipv6: RFC 4884 partial support for SIT/GRE
tunnels") it is possible for icmp6_send() to be called with an skb that
already contains ICMP extensions. This can happen when we receive an
ICMPv4 message with extensions from a tunnel and translate it to an
ICMPv6 message towards an IPv6 host in the overlay network. I could not
find an RFC that supports this behavior, but it makes sense to not
overwrite the original extensions that were appended to the packet.
Therefore, avoid appending extensions if the length field in the
provided ICMPv6 header is already filled.

Export netdev_copy_name() using EXPORT_IPV6_MOD_GPL() to make it
available to IPv6 when it is built as a module.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251027082232.232571-3-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-29 18:28:30 -07:00
Ido Schimmel
f0e7036fc9 ipv4: icmp: Add RFC 5837 support
Add the ability to append the incoming IP interface information to
ICMPv4 error messages in accordance with RFC 5837 and RFC 4884. This is
required for more meaningful traceroute results in unnumbered networks.

The feature is disabled by default and controlled via a new sysctl
("net.ipv4.icmp_errors_extension_mask") which accepts a bitmask of ICMP
extensions to append to ICMP error messages. Currently, only a single
value is supported, but the interface and the implementation should be
able to support more extensions, if needed.

Clone the skb and copy the relevant data portions before modifying the
skb as the caller of __icmp_send() still owns the skb after the function
returns. This should be fine since by default ICMP error messages are
rate limited to 1000 per second and no more than 1 per second per
specific host.

Trim or pad the packet to 128 bytes before appending the ICMP extension
structure in order to be compatible with legacy applications that assume
that the ICMP extension structure always starts at this offset (the
minimum length specified by RFC 4884).

Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251027082232.232571-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-29 18:28:29 -07:00
Eric Dumazet
b1e014a1f3 tcp: add newval parameter to tcp_rcvbuf_grow()
This patch has no functional change, and prepares the following one.

tcp_rcvbuf_grow() will need to have access to tp->rcvq_space.space
old and new values.

Change mptcp_rcvbuf_grow() in a similar way.

Signed-off-by: Eric Dumazet <edumazet@google.com>
[ Moved 'oldval' declaration to the next patch to avoid warnings at
 build time. ]
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20251028-net-tcp-recv-autotune-v3-3-74b43ba4c84c@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-29 17:30:19 -07:00
Christophe JAILLET
294bfe0343 sctp: Constify struct sctp_sched_ops
'struct sctp_sched_ops' is not modified in these drivers.

Constifying this structure moves some data to a read-only section, so
increases overall security, especially when the structure holds some
function pointers.

On a x86_64, with allmodconfig, as an example:
Before:
======
   text	   data	    bss	    dec	    hex	filename
   8019	    568	      0	   8587	   218b	net/sctp/stream_sched_fc.o

After:
=====
   text	   data	    bss	    dec	    hex	filename
   8275	    312	      0	   8587	   218b	net/sctp/stream_sched_fc.o

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://patch.msgid.link/dce03527eb7b7cc8a3c26d5cdac12bafe3350135.1761377890.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-28 17:50:55 -07:00
Bobby Eshleman
8443c31608 net: netmem: remove NET_IOV_MAX from net_iov_type enum
Remove the NET_IOV_MAX workaround from the net_iov_type enum. This entry
was previously added to force the enum size to unsigned long to satisfy
the NET_IOV_ASSERT_OFFSET static assertions.

After commit f3d85c9ee5 ("netmem: introduce struct netmem_desc
mirroring struct page") this approach became unnecessary by placing the
net_iov_type after the netmem_desc. Placing the net_iov_type after
netmem_desc results in the net_iov_type size having no effect on the
position or layout of the fields that mirror the struct page.

The layout before this patch:

struct net_iov {
	union {
		struct netmem_desc desc;                 /*     0    48 */
		struct {
			long unsigned int _flags;        /*     0     8 */
			long unsigned int pp_magic;      /*     8     8 */
			struct page_pool * pp;           /*    16     8 */
			long unsigned int _pp_mapping_pad; /*    24     8 */
			long unsigned int dma_addr;      /*    32     8 */
			atomic_long_t pp_ref_count;      /*    40     8 */
		};                                       /*     0    48 */
	};                                               /*     0    48 */
	struct net_iov_area *      owner;                /*    48     8 */
	enum net_iov_type          type;                 /*    56     8 */

	/* size: 64, cachelines: 1, members: 3 */
};

The layout after this patch:

struct net_iov {
	union {
		struct netmem_desc desc;                 /*     0    48 */
		struct {
			long unsigned int _flags;        /*     0     8 */
			long unsigned int pp_magic;      /*     8     8 */
			struct page_pool * pp;           /*    16     8 */
			long unsigned int _pp_mapping_pad; /*    24     8 */
			long unsigned int dma_addr;      /*    32     8 */
			atomic_long_t pp_ref_count;      /*    40     8 */
		};                                       /*     0    48 */
	};                                               /*     0    48 */
	struct net_iov_area *      owner;                /*    48     8 */
	enum net_iov_type          type;                 /*    56     4 */

	/* size: 64, cachelines: 1, members: 3 */
	/* padding: 4 */
};

Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20251024-b4-devmem-remove-niov-max-v1-1-ba72c68bc869@meta.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-28 17:41:46 -07:00
Benjamin Berg
7ceba45a66 wifi: cfg80211: add an hrtimer based delayed work item
The normal timer mechanism assume that timeout further in the future
need a lower accuracy. As an example, the granularity for a timer
scheduled 4096 ms in the future on a 1000 Hz system is already 512 ms.
This granularity is perfectly sufficient for e.g. timeouts, but there
are other types of events that will happen at a future point in time and
require a higher accuracy.

Add a new wiphy_hrtimer_work type that uses an hrtimer internally. The
API is almost identical to the existing wiphy_delayed_work and it can be
used as a drop-in replacement after minor adjustments. The work will be
scheduled relative to the current time with a slack of 1 millisecond.

CC: stable@vger.kernel.org # 6.4+
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20251028125710.7f13a2adc5eb.I01b5af0363869864b0580d9c2a1770bafab69566@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-10-28 14:56:30 +01:00
Yue Haibing
6f147c8328 net/sched: Remove unused typedef psched_tdiff_t
Since commit 051d442098 ("net/sched: Retire CBQ qdisc")
this is not used anymore.

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Link: https://patch.msgid.link/20251024025145.4069583-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-27 18:05:54 -07:00
Kuniyuki Iwashima
71068e2e1b sctp: Remove sctp_copy_sock() and sctp_copy_descendant().
Now, sctp_accept() and sctp_do_peeloff() use sk_clone(), and
we no longer need sctp_copy_sock() and sctp_copy_descendant().

Let's remove them.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Acked-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/20251023231751.4168390-9-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-27 18:04:59 -07:00
Kuniyuki Iwashima
c49ed521f1 sctp: Remove sctp_pf.create_accept_sk().
sctp_v[46]_create_accept_sk() are no longer used.

Let's remove sctp_pf.create_accept_sk().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Acked-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/20251023231751.4168390-7-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-27 18:04:58 -07:00
Kuniyuki Iwashima
151b98d10e net: Add sk_clone().
sctp_accept() will use sk_clone_lock(), but it will be called
with the parent socket locked, and sctp_migrate() acquires the
child lock later.

Let's add no lock version of sk_clone_lock().

Note that lockdep complains if we simply use bh_lock_sock_nested().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/20251023231751.4168390-5-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-27 18:04:57 -07:00
Wilfred Mallawa
82cb5be6ad net/tls: support setting the maximum payload size
During a handshake, an endpoint may specify a maximum record size limit.
Currently, the kernel defaults to TLS_MAX_PAYLOAD_SIZE (16KB) for the
maximum record size. Meaning that, the outgoing records from the kernel
can exceed a lower size negotiated during the handshake. In such a case,
the TLS endpoint must send a fatal "record_overflow" alert [1], and
thus the record is discarded.

Upcoming Western Digital NVMe-TCP hardware controllers implement TLS
support. For these devices, supporting TLS record size negotiation is
necessary because the maximum TLS record size supported by the controller
is less than the default 16KB currently used by the kernel.

Currently, there is no way to inform the kernel of such a limit. This patch
adds support to a new setsockopt() option `TLS_TX_MAX_PAYLOAD_LEN` that
allows for setting the maximum plaintext fragment size. Once set, outgoing
records are no larger than the size specified. This option can be used to
specify the record size limit.

[1] https://www.rfc-editor.org/rfc/rfc8449

Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20251022001937.20155-1-wilfred.opensource@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-27 16:13:42 -07:00
Roopni Devanathan
7cc986c04a wifi: cfg80211: Add debugfs support for multi-radio wiphy
In multi-radio wiphy architecture, where a single wiphy can have
multiple radios tied to it, radio specific configuration parameters
and global wiphy parameters are maintained for the entire physical
device and common to all radios. But, each radio in a wiphy can have
different values for each radio configuration parameter, like RTS
threshold. With the current debugfs directory structure, the values
of global wiphy configuration parameters can be viewed, but, values
of individual radio configuration parameters cannot be viewed, as
radio specific configuration parameters are not maintained, separately.

To address this, in addition to maintaining global wiphy configuration
parameters common to all radios, create separate debugfs directories
for each radio in a wiphy to maintain parameters corresponding to that
radio in this directory.

In implementation, maintain a dentry structure in wiphy_radio_cfg, a
structure  containing radio configurations of a wiphy. This struct is
maintained to denote per-radio configurations of a wiphy. Create
separate directories representing each radio within phy#X directory in
debugfs during wiphy registration.

Sample directory structure with this change:
ls /sys/kernel/debug/ieee80211/phy0/radio
radio0/ radio1/ radio2/

Signed-off-by: Roopni Devanathan <quic_rdevanat@quicinc.com>
Link: https://patch.msgid.link/20251024044649.483557-2-quic_rdevanat@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-10-27 09:18:41 +01:00
Ryder Lee
a392cde88d wifi: cfg80211/mac80211: validate radio frequency range for monitor mode
In multi-radio devices, it is possible to have an MLD AP and a monitor
interface active at the same time. In such cases, monitor mode may not
be able to specify a fixed channel and could end up capturing frames
from all radios, including those outside the intended frequency bands.

This patch adds frequency validation for monitor mode. Received frames
are now only processed if their frequency fall within the allowed ranges
of the radios specified by the interface's radio_mask.

This prevents monitor mode from capturing frames outside the supported radio.

Signed-off-by: Ryder Lee <ryder.lee@mediatek.com>
Link: https://patch.msgid.link/700b8284e845d96654eb98431f8eeb5a81503862.1758647858.git.ryder.lee@mediatek.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-10-27 09:16:23 +01:00
Kuniyuki Iwashima
3064d0fe02 neighbour: Convert rwlock of struct neigh_table to spinlock.
Only neigh_for_each() and neigh_seq_start/stop() are on the
reader side of neigh_table.lock.

Let's convert rwlock to the plain spinlock.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251022054004.2514876-6-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-24 17:57:20 -07:00
Kuniyuki Iwashima
35d7c70870 neighbour: Annotate access to neigh_parms fields.
NEIGH_VAR() is read locklessly in the fast path, and IPv6 ndisc uses
NEIGH_VAR_SET() locklessly.

The next patch will convert neightbl_dump_info() to RCU.

Let's annotate accesses to neigh_param with READ_ONCE() and WRITE_ONCE().

Note that ndisc_ifinfo_sysctl_change() uses &NEIGH_VAR() and we cannot
use '&' with READ_ONCE(), so NEIGH_VAR_PTR() is introduced.

Note also that NEIGH_VAR_INIT() does not need WRITE_ONCE() as it is before
parms is published.  Also, the only user hippi_neigh_setup_dev() is no
longer called since commit e3804cbebb ("net: remove COMPAT_NET_DEV_OPS"),
which looks wrong, but probably no one uses HIPPI and RoadRunner.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251022054004.2514876-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-24 17:57:20 -07:00
Luiz Augusto von Dentz
751463ceef Bluetooth: hci_core: Fix tracking of periodic advertisement
Periodic advertising enabled flag cannot be tracked by the enabled
flag since advertising and periodic advertising each can be
enabled/disabled separately from one another causing the states to be
inconsistent when for example an advertising set is disabled its
enabled flag is set to false which is then used for periodic which has
not being disabled.

Fixes: eca0ae4aea ("Bluetooth: Add initial implementation of BIS connections")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-10-24 10:31:59 -04:00
Frédéric Danis
76e20da0bd Revert "Bluetooth: L2CAP: convert timeouts to secs_to_jiffies()"
This reverts commit c9d84da18d. It
replaces in L2CAP calls to msecs_to_jiffies() to secs_to_jiffies()
and updates the constants accordingly. But the constants are also
used in LCAP Configure Request and L2CAP Configure Response which
expect values in milliseconds.
This may prevent correct usage of L2CAP channel.

To fix it, keep those constants in milliseconds and so revert this
change.

Fixes: c9d84da18d ("Bluetooth: L2CAP: convert timeouts to secs_to_jiffies()")
Signed-off-by: Frédéric Danis <frederic.danis@collabora.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-10-24 10:30:30 -04:00
Pauli Virtanen
e8785404de Bluetooth: MGMT: fix crash in set_mesh_sync and set_mesh_complete
There is a BUG: KASAN: stack-out-of-bounds in set_mesh_sync due to
memcpy from badly declared on-stack flexible array.

Another crash is in set_mesh_complete() due to double list_del via
mgmt_pending_valid + mgmt_pending_remove.

Use DEFINE_FLEX to declare the flexible array right, and don't memcpy
outside bounds.

As mgmt_pending_valid removes the cmd from list, use mgmt_pending_free,
and also report status on error.

Fixes: 302a1f674c ("Bluetooth: MGMT: Fix possible UAFs")
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-10-24 10:21:37 -04:00
Luiz Augusto von Dentz
0d92808024 Bluetooth: HCI: Fix tracking of advertisement set/instance 0x00
This fixes the state tracking of advertisement set/instance 0x00 which
is considered a legacy instance and is not tracked individually by
adv_instances list, previously it was assumed that hci_dev itself would
track it via HCI_LE_ADV but that is a global state not specifc to
instance 0x00, so to fix it a new flag is introduced that only tracks the
state of instance 0x00.

Fixes: 1488af7b8b ("Bluetooth: hci_sync: Fix hci_resume_advertising_sync")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-10-24 10:21:07 -04:00
Yue Haibing
114573962a net/sched: Remove unused inline helper qdisc_from_priv()
Since commit fb38306ceb ("net/sched: Retire ATM qdisc"), this is
not used and can be removed.

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Link: https://patch.msgid.link/20251021114626.3148894-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-22 19:20:48 -07:00
David Yang
ca4709843b net: dsa: tag_yt921x: add support for Motorcomm YT921x tags
Add support for Motorcomm YT921x tags, which includes a proper
configurable ethertype field (default to 0x9988).

Signed-off-by: David Yang <mmyangfl@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20251017060859.326450-3-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-21 18:25:30 -07:00
Eric Dumazet
3ff9bcecce net: avoid extra access to sk->sk_wmem_alloc in sock_wfree()
UDP TX packets destructor is sock_wfree().

It suffers from a cache line bouncing in sock_def_write_space_wfree().

Instead of reading sk->sk_wmem_alloc after we just did an atomic RMW
on it, use __refcount_sub_and_test() to get the old value for free,
and pass the new value to sock_def_write_space_wfree().

Add __sock_writeable() helper.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251017133712.2842665-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-10-21 15:56:21 +02:00
Randy Dunlap
3701572931 nl802154: fix some kernel-doc warnings
Correct multiple kernel-doc warnings in nl802154.h:

- Fix a typo on one enum name to avoid a kernel-doc warning.
- Drop 2 enum descriptions that are no longer needed.
- Mark 2 internal enums as "private:" so that kernel-doc is not needed
  for them.

Warning: nl802154.h:239 Enum value 'NL802154_CAP_ATTR_MAX_MAXBE' not described in enum 'nl802154_wpan_phy_capability_attr'
Warning: nl802154.h:239 Excess enum value '%NL802154_CAP_ATTR_MIN_CCA_ED_LEVEL' description in 'nl802154_wpan_phy_capability_attr'
Warning: nl802154.h:239 Excess enum value '%NL802154_CAP_ATTR_MAX_CCA_ED_LEVEL' description in 'nl802154_wpan_phy_capability_attr'
Warning: nl802154.h:369 Enum value '__NL802154_CCA_OPT_ATTR_AFTER_LAST' not described in enum 'nl802154_cca_opts'
Warning: nl802154.h:369 Enum value 'NL802154_CCA_OPT_ATTR_MAX' not described in enum 'nl802154_cca_opts'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20251016035917.1148012-1-rdunlap@infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-20 17:13:40 -07:00
Jakub Kicinski
e90576829c bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ6NaUOruQGUkvPdG4raS+Z+3y5EwUCaPFVxwAKCRAraS+Z+3y5
 E1yDAP98nxKLBFb9auLj8AV1zUNl4lKEpOIhH7ceXaDT7ucsagEAwvSbGAN5sun/
 M0+sOpKEZwF8JBoRU8L61uYPYSu7XAk=
 =2K9O
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Martin KaFai Lau says:

====================
pull-request: bpf-next 2025-10-16

We've added 6 non-merge commits during the last 1 day(s) which contain
a total of 18 files changed, 577 insertions(+), 38 deletions(-).

The main changes are:

1) Bypass the global per-protocol memory accounting either by setting
   a netns sysctl or using bpf_setsockopt in a bpf program,
   from Kuniyuki Iwashima.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
  selftests/bpf: Add test for sk->sk_bypass_prot_mem.
  bpf: Introduce SK_BPF_BYPASS_PROT_MEM.
  bpf: Support bpf_setsockopt() for BPF_CGROUP_INET_SOCK_CREATE.
  net: Introduce net.core.bypass_prot_mem sysctl.
  net: Allow opt-out from global protocol memory accounting.
  tcp: Save lock_sock() for memcg in inet_csk_accept().
====================

Link: https://patch.msgid.link/20251016204539.773707-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-17 17:20:42 -07:00
Eric Biggers
37a183d3b7 tcp: Convert tcp-md5 to use MD5 library instead of crypto_ahash
Make tcp-md5 use the MD5 library API (added in 6.18) instead of the
crypto_ahash API.  This is much simpler and also more efficient:

- The library API just operates on struct md5_ctx.  Just allocate this
  struct on the stack instead of using a pool of pre-allocated
  crypto_ahash and ahash_request objects.

- The library API accepts standard pointers and doesn't require
  scatterlists.  So, for hashing the headers just use an on-stack buffer
  instead of a pool of pre-allocated kmalloc'ed scratch buffers.

- The library API never fails.  Therefore, checking for MD5 hashing
  errors is no longer necessary.  Update tcp_v4_md5_hash_skb(),
  tcp_v6_md5_hash_skb(), tcp_v4_md5_hash_hdr(), tcp_v6_md5_hash_hdr(),
  tcp_md5_hash_key(), tcp_sock_af_ops::calc_md5_hash, and
  tcp_request_sock_ops::calc_md5_hash to return void instead of int.

- The library API provides direct access to the MD5 code, eliminating
  unnecessary overhead such as indirect function calls and scatterlist
  management.  Microbenchmarks of tcp_v4_md5_hash_skb() on x86_64 show a
  speedup from 7518 to 7041 cycles (6% fewer) with skb->len == 1440, or
  from 1020 to 678 cycles (33% fewer) with skb->len == 140.

Since tcp_sigpool_hash_skb_data() can no longer be used, add a function
tcp_md5_hash_skb_data() which is specialized to MD5.  Of course, to the
extent that this duplicates any code, it's well worth it.

To preserve the existing behavior of TCP-MD5 support being disabled when
the kernel is booted with "fips=1", make tcp_md5_do_add() check
fips_enabled itself.  Previously it relied on the error from
crypto_alloc_ahash("md5") being bubbled up.  I don't know for sure that
this is actually needed, but this preserves the existing behavior.

Tested with bidirectional TCP-MD5, both IPv4 and IPv6, between a kernel
that includes this commit and a kernel that doesn't include this commit.

(Side note: please don't use TCP-MD5!  It's cryptographically weak.  But
as long as Linux supports it, it might as well be implemented properly.)

Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Link: https://patch.msgid.link/20251014215836.115616-1-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-17 17:14:54 -07:00
Xuanqiang Luo
1532ed0d07 inet: Avoid ehash lookup race in inet_ehash_insert()
Since ehash lookups are lockless, if one CPU performs a lookup while
another concurrently deletes and inserts (removing reqsk and inserting sk),
the lookup may fail to find the socket, an RST may be sent.

The call trace map is drawn as follows:
   CPU 0                           CPU 1
   -----                           -----
				inet_ehash_insert()
                                spin_lock()
                                sk_nulls_del_node_init_rcu(osk)
__inet_lookup_established()
	(lookup failed)
                                __sk_nulls_add_node_rcu(sk, list)
                                spin_unlock()

As both deletion and insertion operate on the same ehash chain, this patch
introduces a new sk_nulls_replace_node_init_rcu() helper functions to
implement atomic replacement.

Fixes: 5e0724d027 ("tcp/dccp: fix hashdance race for passive sessions")
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251015020236.431822-3-xuanqiang.luo@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-17 16:08:43 -07:00
Kuniyuki Iwashima
1c17f4373d ipv6: Move ipv6_fl_list from ipv6_pinfo to inet_sock.
In {tcp6,udp6,raw6}_sock, struct ipv6_pinfo is always placed at
the beginning of a new cache line because

  1. __alignof__(struct tcp_sock) is 64 due to ____cacheline_aligned
     of __cacheline_group_begin(tcp_sock_write_tx)

  2. __alignof__(struct udp_sock) is 64 due to ____cacheline_aligned
     of struct numa_drop_counters

  3. in raw6_sock, struct numa_drop_counters is placed before
     struct ipv6_pinfo

.  struct ipv6_pinfo is 136 bytes, but the last cache line is
only used by ipv6_fl_list:

  $ pahole -C ipv6_pinfo vmlinux
  struct ipv6_pinfo {
  ...
  	/* --- cacheline 2 boundary (128 bytes) --- */
  	struct ipv6_fl_socklist *  ipv6_fl_list;         /*   128     8 */

  	/* size: 136, cachelines: 3, members: 23 */

Let's move ipv6_fl_list from struct ipv6_pinfo to struct inet_sock
to save a full cache line for {tcp6,udp6,raw6}_sock.

Now, struct ipv6_pinfo is 128 bytes, and {tcp6,udp6,raw6}_sock have
64 bytes less, while {tcp,udp,raw}_sock retain the same size.

Before:

  # grep -E "^(RAW|UDP[^L\-]|TCP)" /proc/slabinfo | awk '{print $1, "\t", $4}'
  RAWv6 	 1408
  UDPv6 	 1472
  TCPv6 	 2560
  RAW 		 1152
  UDP	 	 1280
  TCP 		 2368

After:

  # grep -E "^(RAW|UDP[^L\-]|TCP)" /proc/slabinfo | awk '{print $1, "\t", $4}'
  RAWv6 	 1344
  UDPv6 	 1408
  TCPv6 	 2496
  RAW 		 1152
  UDP	 	 1280
  TCP 		 2368

Also, ipv6_fl_list and inet_flags (SNDFLOW bit) are placed in the
same cache line.

  $ pahole -C inet_sock vmlinux
  ...
  	/* --- cacheline 11 boundary (704 bytes) was 56 bytes ago --- */
  	struct ipv6_pinfo *        pinet6;               /*   760     8 */
  	/* --- cacheline 12 boundary (768 bytes) --- */
  	struct ipv6_fl_socklist *  ipv6_fl_list;         /*   768     8 */
  	unsigned long              inet_flags;           /*   776     8 */

Doc churn is due to the insufficient Type column (only 1 space short).

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251014224210.2964778-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-17 16:06:52 -07:00
Eric Dumazet
100dfa74ca net: dev_queue_xmit() llist adoption
Remove busylock spinlock and use a lockless list (llist)
to reduce spinlock contention to the minimum.

Idea is that only one cpu might spin on the qdisc spinlock,
while others simply add their skb in the llist.

After this patch, we get a 300 % improvement on heavy TX workloads.
- Sending twice the number of packets per second.
- While consuming 50 % less cycles.

Note that this also allows in the future to submit batches
to various qdisc->enqueue() methods.

Tested:

- Dual Intel(R) Xeon(R) 6985P-C  (480 hyper threads).
- 100Gbit NIC, 30 TX queues with FQ packet scheduler.
- echo 64 >/sys/kernel/slab/skbuff_small_head/cpu_partial (avoid contention in mm)
- 240 concurrent "netperf -t UDP_STREAM -- -m 120 -n"

Before:

16 Mpps (41 Mpps if each thread is pinned to a different cpu)

vmstat 2 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
243  0      0 2368988672  51036 1100852    0    0   146     1  242   60  0  9 91  0  0
244  0      0 2368988672  51036 1100852    0    0   536    10 487745 14718  0 52 48  0  0
244  0      0 2368988672  51036 1100852    0    0   512     0 503067 46033  0 52 48  0  0
244  0      0 2368988672  51036 1100852    0    0   512     0 494807 12107  0 52 48  0  0
244  0      0 2368988672  51036 1100852    0    0   702    26 492845 10110  0 52 48  0  0

Lock contention (1 second sample taken on 8 cores)
perf lock record -C0-7 sleep 1; perf lock contention
 contended   total wait     max wait     avg wait         type   caller

    442111      6.79 s     162.47 ms     15.35 us     spinlock   dev_hard_start_xmit+0xcd
      5961      9.57 ms      8.12 us      1.60 us     spinlock   __dev_queue_xmit+0x3a0
       244    560.63 us      7.63 us      2.30 us     spinlock   do_softirq+0x5b
        13     25.09 us      3.21 us      1.93 us     spinlock   net_tx_action+0xf8

If netperf threads are pinned, spinlock stress is very high.
perf lock record -C0-7 sleep 1; perf lock contention
 contended   total wait     max wait     avg wait         type   caller

    964508      7.10 s     147.25 ms      7.36 us     spinlock   dev_hard_start_xmit+0xcd
       201    268.05 us      4.65 us      1.33 us     spinlock   __dev_queue_xmit+0x3a0
        12     26.05 us      3.84 us      2.17 us     spinlock   do_softirq+0x5b

@__dev_queue_xmit_ns:
[256, 512)            21 |                                                    |
[512, 1K)            631 |                                                    |
[1K, 2K)           27328 |@                                                   |
[2K, 4K)          265392 |@@@@@@@@@@@@@@@@                                    |
[4K, 8K)          417543 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
[8K, 16K)         826292 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16K, 32K)        733822 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
[32K, 64K)         19055 |@                                                   |
[64K, 128K)        17240 |@                                                   |
[128K, 256K)       25633 |@                                                   |
[256K, 512K)           4 |                                                    |

After:

29 Mpps (57 Mpps if each thread is pinned to a different cpu)

vmstat 2 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
78  0      0 2369573632  32896 1350988    0    0    22     0  331  254  0  8 92  0  0
75  0      0 2369573632  32896 1350988    0    0    22    50 425713 280199  0 23 76  0  0
104  0      0 2369573632  32896 1350988    0    0   290     0 430238 298247  0 23 76  0  0
86  0      0 2369573632  32896 1350988    0    0   132     0 428019 291865  0 24 76  0  0
90  0      0 2369573632  32896 1350988    0    0   502     0 422498 278672  0 23 76  0  0

perf lock record -C0-7 sleep 1; perf lock contention
 contended   total wait     max wait     avg wait         type   caller

      2524    116.15 ms    486.61 us     46.02 us     spinlock   __dev_queue_xmit+0x55b
      5821    107.18 ms    371.67 us     18.41 us     spinlock   dev_hard_start_xmit+0xcd
      2377      9.73 ms     35.86 us      4.09 us     spinlock   ___slab_alloc+0x4e0
       923      5.74 ms     20.91 us      6.22 us     spinlock   ___slab_alloc+0x5c9
       121      3.42 ms    193.05 us     28.24 us     spinlock   net_tx_action+0xf8
         6    564.33 us    167.60 us     94.05 us     spinlock   do_softirq+0x5b

If netperf threads are pinned (~54 Mpps)
perf lock record -C0-7 sleep 1; perf lock contention
     32907    316.98 ms    195.98 us      9.63 us     spinlock   dev_hard_start_xmit+0xcd
      4507     61.83 ms    212.73 us     13.72 us     spinlock   __dev_queue_xmit+0x554
      2781     23.53 ms     40.03 us      8.46 us     spinlock   ___slab_alloc+0x5c9
      3554     18.94 ms     34.69 us      5.33 us     spinlock   ___slab_alloc+0x4e0
       233      9.09 ms    215.70 us     38.99 us     spinlock   do_softirq+0x5b
       153    930.66 us     48.67 us      6.08 us     spinlock   net_tx_action+0xfd
        84    331.10 us     14.22 us      3.94 us     spinlock   ___slab_alloc+0x5c9
       140    323.71 us      9.94 us      2.31 us     spinlock   ___slab_alloc+0x4e0

@__dev_queue_xmit_ns:
[128, 256)       1539830 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                  |
[256, 512)       2299558 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[512, 1K)         483936 |@@@@@@@@@@                                          |
[1K, 2K)          265345 |@@@@@@                                              |
[2K, 4K)          145463 |@@@                                                 |
[4K, 8K)           54571 |@                                                   |
[8K, 16K)          10270 |                                                    |
[16K, 32K)          9385 |                                                    |
[32K, 64K)          7749 |                                                    |
[64K, 128K)        26799 |                                                    |
[128K, 256K)        2665 |                                                    |
[256K, 512K)         665 |                                                    |

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20251014171907.3554413-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-16 16:25:10 -07:00
Eric Dumazet
526f5fb112 net: sched: claim one cache line in Qdisc
Replace state2 field with a boolean.

Move it to a hole between qstats and state so that
we shrink Qdisc by a full cache line.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20251014171907.3554413-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-16 16:25:10 -07:00
Eric Dumazet
178ca30889 Revert "net/sched: Fix mirred deadlock on device recursion"
This reverts commits 0f022d32c3
and 44180feacc.

Prior patch in this series implemented loop detection
in act_mirred, we can remove q->owner to save some cycles
in the fast path.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20251014171907.3554413-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-16 16:25:10 -07:00
Kuniyuki Iwashima
b46ab63181 net: Introduce net.core.bypass_prot_mem sysctl.
If a socket has sk->sk_bypass_prot_mem flagged, the socket opts out
of the global protocol memory accounting.

Let's control the flag by a new sysctl knob.

The flag is written once during socket(2) and is inherited to child
sockets.

Tested with a script that creates local socket pairs and send()s a
bunch of data without recv()ing.

Setup:

  # mkdir /sys/fs/cgroup/test
  # echo $$ >> /sys/fs/cgroup/test/cgroup.procs
  # sysctl -q net.ipv4.tcp_mem="1000 1000 1000"
  # ulimit -n 524288

Without net.core.bypass_prot_mem, charged to tcp_mem & memcg

  # python3 pressure.py &
  # cat /sys/fs/cgroup/test/memory.stat | grep sock
  sock 22642688 <-------------------------------------- charged to memcg
  # cat /proc/net/sockstat| grep TCP
  TCP: inuse 2006 orphan 0 tw 0 alloc 2008 mem 5376 <-- charged to tcp_mem
  # ss -tn | head -n 5
  State Recv-Q Send-Q Local Address:Port  Peer Address:Port
  ESTAB 2000   0          127.0.0.1:34479    127.0.0.1:53188
  ESTAB 2000   0          127.0.0.1:34479    127.0.0.1:49972
  ESTAB 2000   0          127.0.0.1:34479    127.0.0.1:53868
  ESTAB 2000   0          127.0.0.1:34479    127.0.0.1:53554
  # nstat | grep Pressure || echo no pressure
  TcpExtTCPMemoryPressures        1                  0.0

With net.core.bypass_prot_mem=1, charged to memcg only:

  # sysctl -q net.core.bypass_prot_mem=1
  # python3 pressure.py &
  # cat /sys/fs/cgroup/test/memory.stat | grep sock
  sock 2757468160 <------------------------------------ charged to memcg
  # cat /proc/net/sockstat | grep TCP
  TCP: inuse 2006 orphan 0 tw 0 alloc 2008 mem 0 <- NOT charged to tcp_mem
  # ss -tn | head -n 5
  State Recv-Q Send-Q  Local Address:Port  Peer Address:Port
  ESTAB 111000 0           127.0.0.1:36019    127.0.0.1:49026
  ESTAB 110000 0           127.0.0.1:36019    127.0.0.1:45630
  ESTAB 110000 0           127.0.0.1:36019    127.0.0.1:44870
  ESTAB 111000 0           127.0.0.1:36019    127.0.0.1:45274
  # nstat | grep Pressure || echo no pressure
  no pressure

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://patch.msgid.link/20251014235604.3057003-4-kuniyu@google.com
2025-10-16 12:04:47 -07:00
Kuniyuki Iwashima
7c268eaeec net: Allow opt-out from global protocol memory accounting.
Some protocols (e.g., TCP, UDP) implement memory accounting for socket
buffers and charge memory to per-protocol global counters pointed to by
sk->sk_proto->memory_allocated.

Sometimes, system processes do not want that limitation.  For a similar
purpose, there is SO_RESERVE_MEM for sockets under memcg.

Also, by opting out of the per-protocol accounting, sockets under memcg
can avoid paying costs for two orthogonal memory accounting mechanisms.
A microbenchmark result is in the subsequent bpf patch.

Let's allow opt-out from the per-protocol memory accounting if
sk->sk_bypass_prot_mem is true.

sk->sk_bypass_prot_mem and sk->sk_prot are placed in the same cache
line, and sk_has_account() always fetches sk->sk_prot before accessing
sk->sk_bypass_prot_mem, so there is no extra cache miss for this patch.

The following patches will set sk->sk_bypass_prot_mem to true, and
then, the per-protocol memory accounting will be skipped.

Note that this does NOT disable memcg, but rather the per-protocol one.

Another option not to use the hole in struct sock_common is create
sk_prot variants like tcp_prot_bypass, but this would complicate
SOCKMAP logic, tcp_bpf_prots etc.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://patch.msgid.link/20251014235604.3057003-3-kuniyu@google.com
2025-10-16 12:04:47 -07:00
Jakub Kicinski
55db64ddd6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.18-rc2).

No conflicts or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-16 11:06:28 -07:00
Eric Dumazet
e5b670e543 net: remove obsolete WARN_ON(refcount_read(&sk->sk_refcnt) == 1)
sk->sk_refcnt has been converted to refcount_t in 2017.

__sock_put(sk) being refcount_dec(&sk->sk_refcnt), it will complain
loudly if the current refcnt is 1 (or less) in a non racy way.

We can remove four WARN_ON() in favor of the generic refcount_dec()
check.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Xuanqiang Luo<luoxuanqiang@kylinos.cn>
Link: https://patch.msgid.link/20251014140605.2982703-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-15 17:18:38 -07:00
Eric Dumazet
4a7708443d net: allow busy connected flows to switch tx queues
This is a followup of commit 726e9e8b94 ("tcp: refine
skb->ooo_okay setting") and of prior commit in this series
("net: control skb->ooo_okay from skb_set_owner_w()")

skb->ooo_okay might never be set for bulk flows that always
have at least one skb in a qdisc queue of NIC queue,
especially if TX completion is delayed because of a stressed cpu.

The so-called "strange attractors" has caused many performance
issues (see for instance 9b462d02d6 ("tcp: TCP Small Queues
and strange attractors")), we need to do better.

We have tried very hard to avoid reorders because TCP was
not dealing with them nicely a decade ago.

Use the new net.core.txq_reselection_ms sysctl to let
flows follow XPS and select a more efficient queue.

After this patch, we no longer have to make sure threads
are pinned to cpus, they now can be migrated without
adding too much spinlock/qdisc/TX completion pressure anymore.

TX completion part was problematic, because it added false sharing
on various socket fields, but also added false sharing and spinlock
contention in mm layers. Calling skb_orphan() from ndo_start_xmit()
is not an option unfortunately.

Note for later:

1) move sk->sk_tx_queue_mapping closer
to sk_tx_queue_mapping_jiffies for better cache locality.

2) Study if 9b462d02d6 ("tcp: TCP Small Queues
and strange attractors") could be revised.

Tested:

Used a host with 32 TX queues, shared by groups of 8 cores.
XPS setup :

echo ff >/sys/class/net/eth1/queue/tx-0/xps_cpus
echo ff00 >/sys/class/net/eth1/queue/tx-1/xps_cpus
echo ff0000 >/sys/class/net/eth1/queue/tx-2/xps_cpus
echo ff000000 >/sys/class/net/eth1/queue/tx-3/xps_cpus
echo ff,00000000 >/sys/class/net/eth1/queue/tx-4/xps_cpus
echo ff00,00000000 >/sys/class/net/eth1/queue/tx-5/xps_cpus
echo ff0000,00000000 >/sys/class/net/eth1/queue/tx-6/xps_cpus
echo ff000000,00000000 >/sys/class/net/eth1/queue/tx-7/xps_cpus
...

Launched a tcp_stream with 15 threads and 1000 flows, initially affined to core 0-15

taskset -c 0-15 tcp_stream -T15 -F1000 -l1000 -c -H target_host

Checked that only queues 0 and 1 are used as instructed by XPS :
tc -s qdisc show dev eth1|grep backlog|grep -v "backlog 0b 0p"
 backlog 123489410b 1890p
 backlog 69809026b 1064p
 backlog 52401054b 805p

Then force each thread to run on cpu 1,9,17,25,33,41,49,57,65,73,81,89,97,105,113,121

C=1;PID=`pidof tcp_stream`;for P in `ls /proc/$PID/task`; do taskset -pc $C $P; C=$(($C + 8));done

Set txq_reselection_ms to 1000
echo 1000 > /proc/sys/net/core/txq_reselection_ms

Check that the flows have migrated nicely:

tc -s qdisc show dev eth1|grep backlog|grep -v "backlog 0b 0p"
 backlog 130508314b 1916p
 backlog 8584380b 126p
 backlog 8584380b 126p
 backlog 8379990b 123p
 backlog 8584380b 126p
 backlog 8487484b 125p
 backlog 8584380b 126p
 backlog 8448120b 124p
 backlog 8584380b 126p
 backlog 8720640b 128p
 backlog 8856900b 130p
 backlog 8584380b 126p
 backlog 8652510b 127p
 backlog 8448120b 124p
 backlog 8516250b 125p
 backlog 7834950b 115p

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251013152234.842065-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-15 09:04:22 -07:00
Eric Dumazet
2ddef3462b net: add /proc/sys/net/core/txq_reselection_ms control
Add a new sysctl to control how often a queue reselection
can happen even if a flow has a persistent queue of skbs
in a Qdisc or NIC queue.

A value of zero means the feature is disabled.

Default is 1000 (1 second).

This sysctl is used in the following patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251013152234.842065-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-15 09:04:21 -07:00
Eric Dumazet
6ddb811a57 net: add SK_WMEM_ALLOC_BIAS constant
sk->sk_wmem_alloc is initialized to 1, and sk_wmem_alloc_get()
takes care of this initial value.

Add SK_WMEM_ALLOC_BIAS define to not spread this magic value.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251013152234.842065-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-15 09:04:21 -07:00
Eric Dumazet
1c51450f1a tcp: better handle TCP_TX_DELAY on established flows
Some applications uses TCP_TX_DELAY socket option after TCP flow
is established.

Some metrics need to be updated, otherwise TCP might take time to
adapt to the new (emulated) RTT.

This patch adjusts tp->srtt_us, tp->rtt_min, icsk_rto
and sk->sk_pacing_rate.

This is best effort, and for instance icsk_rto is reset
without taking backoff into account.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251013145926.833198-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-15 08:56:30 -07:00
Byungchul Park
53615ad26e netmem: replace __netmem_clear_lsb() with netmem_to_nmdesc()
Now that we have struct netmem_desc, it'd better access the pp fields
via struct netmem_desc rather than struct net_iov.

Introduce netmem_to_nmdesc() for safely converting netmem_ref to
netmem_desc regardless of the type underneath e.i. netmem_desc, net_iov.

While at it, remove __netmem_clear_lsb() and make netmem_to_nmdesc()
used instead.

Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Byungchul Park <byungchul@sk.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20251013044133.69472-1-byungchul@sk.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-10-14 13:37:26 +02:00
Dmitry Safonov
21f4d45eba net/ip6_tunnel: Prevent perpetual tunnel growth
Similarly to ipv4 tunnel, ipv6 version updates dev->needed_headroom, too.
While ipv4 tunnel headroom adjustment growth was limited in
commit 5ae1e9922b ("net: ip_tunnel: prevent perpetual headroom growth"),
ipv6 tunnel yet increases the headroom without any ceiling.

Reflect ipv4 tunnel headroom adjustment limit on ipv6 version.

Credits to Francesco Ruggeri, who was originally debugging this issue
and wrote local Arista-specific patch and a reproducer.

Fixes: 8eb30be035 ("ipv6: Create ip6_tnl_xmit")
Cc: Florian Westphal <fw@strlen.de>
Cc: Francesco Ruggeri <fruggeri05@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Link: https://patch.msgid.link/20251009-ip6_tunnel-headroom-v2-1-8e4dbd8f7e35@arista.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-13 17:43:46 -07:00
Jakub Kicinski
7a0f94361f net: psp: don't assume reply skbs will have a socket
Rx path may be passing around unreferenced sockets, which means
that skb_set_owner_edemux() may not set skb->sk and PSP will crash:

  KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017]
  RIP: 0010:psp_reply_set_decrypted (./include/net/psp/functions.h:132 net/psp/psp_sock.c:287)
    tcp_v6_send_response.constprop.0 (net/ipv6/tcp_ipv6.c:979)
    tcp_v6_send_reset (net/ipv6/tcp_ipv6.c:1140 (discriminator 1))
    tcp_v6_do_rcv (net/ipv6/tcp_ipv6.c:1683)
    tcp_v6_rcv (net/ipv6/tcp_ipv6.c:1912)

Fixes: 659a2899a5 ("tcp: add datapath logic for PSP with inline key exchange")
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251001022426.2592750-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-03 10:23:50 -07:00
Linus Torvalds
07fdad3a93 Networking changes for 6.18.
Core & protocols
 ----------------
 
  - Improve drop account scalability on NUMA hosts for RAW and UDP sockets
    and the backlog, almost doubling the Pps capacity under DoS.
 
  - Optimize the UDP RX performance under stress, reducing contention,
    revisiting the binary layout of the involved data structs and
    implementing NUMA-aware locking. This improves UDP RX performance by
    an additional 50%, even more under extreme conditions.
 
  - Add support for PSP encryption of TCP connections; this mechanism has
    some similarities with IPsec and TLS, but offers superior HW offloads
    capabilities.
 
  - Ongoing work to support Accurate ECN for TCP. AccECN allows more than
    one congestion notification signal per RTT and is a building block for
    Low Latency, Low Loss, and Scalable Throughput (L4S).
 
  - Reorganize the TCP socket binary layout for data locality, reducing
    the number of touched cachelines in the fastpath.
 
  - Refactor skb deferral free to better scale on large multi-NUMA hosts,
    this improves TCP and UDP RX performances significantly on such HW.
 
  - Increase the default socket memory buffer limits from 256K to 4M to
    better fit modern link speeds.
 
  - Improve handling of setups with a large number of nexthop, making dump
    operating scaling linearly and avoiding unneeded synchronize_rcu() on
    delete.
 
  - Improve bridge handling of VLAN FDB, storing a single entry per bridge
    instead of one entry per port; this makes the dump order of magnitude
    faster on large switches.
 
  - Restore IP ID correctly for encapsulated packets at GSO segmentation
    time, allowing GRO to merge packets in more scenarios.
 
  - Improve netfilter matching performance on large sets.
 
  - Improve MPTCP receive path performance by leveraging recently
    introduced core infrastructure (skb deferral free) and adopting recent
    TCP autotuning changes.
 
  - Allow bridges to redirect to a backup port when the bridge port is
    administratively down.
 
  - Introduce MPTCP 'laminar' endpoint that con be used only once per
    connection and simplify common MPTCP setups.
 
  - Add RCU safety to dst->dev, closing a lot of possible races.
 
  - A significant crypto library API for SCTP, MPTCP and IPv6 SR, reducing
    code duplication.
 
  - Supports pulling data from an skb frag into the linear area of an XDP
    buffer.
 
 Things we sprinkled into general kernel code
 --------------------------------------------
 
  - Generate netlink documentation from YAML using an integrated
    YAML parser.
 
 Driver API
 ----------
 
  - Support using IPv6 Flow Label in Rx hash computation and RSS queue
    selection.
 
  - Introduce API for fetching the DMA device for a given queue, allowing
    TCP zerocopy RX on more H/W setups.
 
  - Make XDP helpers compatible with unreadable memory, allowing more
    easily building DevMem-enabled drivers with a unified XDP/skbs
    datapath.
 
  - Add a new dedicated ethtool callback enabling drivers to provide the
    number of RX rings directly, improving efficiency and clarity in RX
    ring queries and RSS configuration.
 
  - Introduce a burst period for the health reporter, allowing better
    handling of multiple errors due to the same root cause.
 
  - Support for DPLL phase offset exponential moving average, controlling
    the average smoothing factor.
 
 Device drivers
 --------------
 
  - Add a new Huawei driver for 3rd gen NIC (hinic3).
 
  - Add a new SpacemiT driver for K1 ethernet MAC.
 
  - Add a generic abstraction for shared memory communication devices
    (dibps)
 
  - Ethernet high-speed NICs:
    - nVidia/Mellanox:
      - Use multiple per-queue doorbell, to avoid MMIO contention issues
      - support adjacent functions, allowing them to delegate their
        SR-IOV VFs to sibling PFs
      - support RSS for IPSec offload
      - support exposing raw cycle counters in PTP and mlx5
      - support for disabling host PFs.
    - Intel (100G, ice, idpf):
      - ice: support for SRIOV VFs over an Active-Active link aggregate
      - ice: support for firmware logging via debugfs
      - ice: support for Earliest TxTime First (ETF) hardware offload
      - idpf: support basic XDP functionalities and XSk
    - Broadcom (bnxt):
      - support Hyper-V VF ID
      - dynamic SRIOV resource allocations for RoCE
    - Meta (fbnic):
      - support queue API, zero-copy Rx and Tx
      - support basic XDP functionalities
      - devlink health support for FW crashes and OTP mem corruptions
      - expand hardware stats coverage to FEC, PHY, and Pause
    - Wangxun:
      - support ethtool coalesce options
      - support for multiple RSS contexts
 
  - Ethernet virtual:
    - Macsec:
      - replace custom netlink attribute checks with policy-level checks
    - Bonding:
      - support aggregator selection based on port priority
    - Microsoft vNIC:
      - use page pool fragments for RX buffers instead of full pages to
        improve memory efficiency
 
  - Ethernet NICs consumer, and embedded:
    - Qualcomm: support Ethernet function for IPQ9574 SoC
    - Airoha: implement wlan offloading via NPU
    - Freescale
      - enetc: add NETC timer PTP driver and add PTP support
      - fec: enable the Jumbo frame support for i.MX8QM
    - Renesas (R-Car S4): support HW offloading for layer 2 switching
      - support for RZ/{T2H, N2H} SoCs
    - Cadence (macb): support TAPRIO traffic scheduling
    - TI:
      - support for Gigabit ICSS ethernet SoC (icssm-prueth)
    - Synopsys (stmmac): a lot of cleanups
 
  - Ethernet PHYs:
    - Support 10g-qxgmi phy-mode for AQR412C, Felix DSA and Lynx PCS
      driver
    - Support bcm63268 GPHY power control
    - Support for Micrel lan8842 PHY and PTP
    - Support for Aquantia AQR412 and AQR115
 
  - CAN:
    - a large CAN-XL preparation work
    - reorganize raw_sock and uniqframe struct to minimize memory usage
    - rcar_canfd: update the CAN-FD handling
 
  - WiFi:
    - extended Neighbor Awareness Networking (NAN) support
    - S1G channel representation cleanup
    - improve S1G support
 
  - WiFi drivers:
    - Intel (iwlwifi):
      - major refactor and cleanup
    - Broadcom (brcm80211):
      - support for AP isolation
    - RealTek (rtw88/89) rtw88/89:
      - preparation work for RTL8922DE support
    - MediaTek (mt76):
      - HW restart improvements
      - MLO support
    - Qualcomm/Atheros (ath10k_
      - GTK rekey fixes
 
  - Bluetooth drivers:
    - btusb: support for several new IDs for MT7925
    - btintel: support for BlazarIW core
    - btintel_pcie: support for _suspend() / _resume()
    - btintel_pcie: support for Scorpious, Panther Lake-H484 IDs
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmjdBdsSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkWnAP/2Jz0ExnYMDxLxkkKqMte+3JNeF/KNjB
 zVIslYN/5ekkd1TMXyDLFlWHqUOUMSF9+cK5ZRMHG6/lzOueSzFuuuDFsWs6Kf2f
 q7Q9MzlXhR9++YCsdES1uS5x3PPjMInzo2ZivMFa6fUDLLFSzeAOKL9k+RS0EggU
 VlXv2Wkm73R0O6KAssgDsHke9cnNz+F0DzhQ0S3qkyZF9tS5NrDeUl7fZ47XZgwb
 ZuXdEzKmTTepo2XvZGxJoF2D7nekWFlHhLjEPpDJtET19nwhukCry41/FplJwlKR
 dWsYkqLkrOEQKFQ+q++/5c3BgFOG+IrQLbP5bGQF1tZRMl4Dqp9PDxK5fKJfccbS
 0VY3Y2qWOSYejT2Ji2kEHR5E5rPyFm3Y5A4Rnpz0yEHj14vL2v4zf7CZRkCyyDfD
 doIZXFGkM0+N7QeXLEN833cje9zjaXuP9GAE+2bb+wAWAZAqof4KX8JgHh+y5Rwm
 pvUtvFxmEtntlMwNBap8aT3FquGtfZncU8pzAE5kvWjuMvyF0NVWiE5rB2kSQM/X
 NLmUdvDyjwwJqthXAJG0fl+a0mNJ/kOAqSOKJDJrfKDGWa+ovwY0iY06SpK0wIbO
 Wz7tpMk5MSlYXW8xWMlbyhvvU6T9xuoQ2KV4QTdMxc6Ir3sNX6YkQr+gjQjxB0gx
 ST2QF6lZeWFh
 =w2Kz
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Paolo Abeni:
 "Core & protocols:

   - Improve drop account scalability on NUMA hosts for RAW and UDP
     sockets and the backlog, almost doubling the Pps capacity under DoS

   - Optimize the UDP RX performance under stress, reducing contention,
     revisiting the binary layout of the involved data structs and
     implementing NUMA-aware locking. This improves UDP RX performance
     by an additional 50%, even more under extreme conditions

   - Add support for PSP encryption of TCP connections; this mechanism
     has some similarities with IPsec and TLS, but offers superior HW
     offloads capabilities

   - Ongoing work to support Accurate ECN for TCP. AccECN allows more
     than one congestion notification signal per RTT and is a building
     block for Low Latency, Low Loss, and Scalable Throughput (L4S)

   - Reorganize the TCP socket binary layout for data locality, reducing
     the number of touched cachelines in the fastpath

   - Refactor skb deferral free to better scale on large multi-NUMA
     hosts, this improves TCP and UDP RX performances significantly on
     such HW

   - Increase the default socket memory buffer limits from 256K to 4M to
     better fit modern link speeds

   - Improve handling of setups with a large number of nexthop, making
     dump operating scaling linearly and avoiding unneeded
     synchronize_rcu() on delete

   - Improve bridge handling of VLAN FDB, storing a single entry per
     bridge instead of one entry per port; this makes the dump order of
     magnitude faster on large switches

   - Restore IP ID correctly for encapsulated packets at GSO
     segmentation time, allowing GRO to merge packets in more scenarios

   - Improve netfilter matching performance on large sets

   - Improve MPTCP receive path performance by leveraging recently
     introduced core infrastructure (skb deferral free) and adopting
     recent TCP autotuning changes

   - Allow bridges to redirect to a backup port when the bridge port is
     administratively down

   - Introduce MPTCP 'laminar' endpoint that con be used only once per
     connection and simplify common MPTCP setups

   - Add RCU safety to dst->dev, closing a lot of possible races

   - A significant crypto library API for SCTP, MPTCP and IPv6 SR,
     reducing code duplication

   - Supports pulling data from an skb frag into the linear area of an
     XDP buffer

  Things we sprinkled into general kernel code:

   - Generate netlink documentation from YAML using an integrated YAML
     parser

  Driver API:

   - Support using IPv6 Flow Label in Rx hash computation and RSS queue
     selection

   - Introduce API for fetching the DMA device for a given queue,
     allowing TCP zerocopy RX on more H/W setups

   - Make XDP helpers compatible with unreadable memory, allowing more
     easily building DevMem-enabled drivers with a unified XDP/skbs
     datapath

   - Add a new dedicated ethtool callback enabling drivers to provide
     the number of RX rings directly, improving efficiency and clarity
     in RX ring queries and RSS configuration

   - Introduce a burst period for the health reporter, allowing better
     handling of multiple errors due to the same root cause

   - Support for DPLL phase offset exponential moving average,
     controlling the average smoothing factor

  Device drivers:

   - Add a new Huawei driver for 3rd gen NIC (hinic3)

   - Add a new SpacemiT driver for K1 ethernet MAC

   - Add a generic abstraction for shared memory communication
     devices (dibps)

   - Ethernet high-speed NICs:
      - nVidia/Mellanox:
         - Use multiple per-queue doorbell, to avoid MMIO contention
           issues
         - support adjacent functions, allowing them to delegate their
           SR-IOV VFs to sibling PFs
         - support RSS for IPSec offload
         - support exposing raw cycle counters in PTP and mlx5
         - support for disabling host PFs.
      - Intel (100G, ice, idpf):
         - ice: support for SRIOV VFs over an Active-Active link
           aggregate
         - ice: support for firmware logging via debugfs
         - ice: support for Earliest TxTime First (ETF) hardware offload
         - idpf: support basic XDP functionalities and XSk
      - Broadcom (bnxt):
         - support Hyper-V VF ID
         - dynamic SRIOV resource allocations for RoCE
      - Meta (fbnic):
         - support queue API, zero-copy Rx and Tx
         - support basic XDP functionalities
         - devlink health support for FW crashes and OTP mem corruptions
         - expand hardware stats coverage to FEC, PHY, and Pause
      - Wangxun:
         - support ethtool coalesce options
         - support for multiple RSS contexts

   - Ethernet virtual:
      - Macsec:
         - replace custom netlink attribute checks with policy-level
           checks
      - Bonding:
         - support aggregator selection based on port priority
      - Microsoft vNIC:
         - use page pool fragments for RX buffers instead of full pages
           to improve memory efficiency

   - Ethernet NICs consumer, and embedded:
      - Qualcomm: support Ethernet function for IPQ9574 SoC
      - Airoha: implement wlan offloading via NPU
      - Freescale
         - enetc: add NETC timer PTP driver and add PTP support
         - fec: enable the Jumbo frame support for i.MX8QM
      - Renesas (R-Car S4):
         - support HW offloading for layer 2 switching
         - support for RZ/{T2H, N2H} SoCs
      - Cadence (macb): support TAPRIO traffic scheduling
      - TI:
         - support for Gigabit ICSS ethernet SoC (icssm-prueth)
      - Synopsys (stmmac): a lot of cleanups

   - Ethernet PHYs:
      - Support 10g-qxgmi phy-mode for AQR412C, Felix DSA and Lynx PCS
        driver
      - Support bcm63268 GPHY power control
      - Support for Micrel lan8842 PHY and PTP
      - Support for Aquantia AQR412 and AQR115

   - CAN:
      - a large CAN-XL preparation work
      - reorganize raw_sock and uniqframe struct to minimize memory
        usage
      - rcar_canfd: update the CAN-FD handling

   - WiFi:
      - extended Neighbor Awareness Networking (NAN) support
      - S1G channel representation cleanup
      - improve S1G support

   - WiFi drivers:
      - Intel (iwlwifi):
         - major refactor and cleanup
      - Broadcom (brcm80211):
         - support for AP isolation
      - RealTek (rtw88/89) rtw88/89:
         - preparation work for RTL8922DE support
      - MediaTek (mt76):
         - HW restart improvements
         - MLO support
      - Qualcomm/Atheros (ath10k):
         - GTK rekey fixes

   - Bluetooth drivers:
      - btusb: support for several new IDs for MT7925
      - btintel: support for BlazarIW core
      - btintel_pcie: support for _suspend() / _resume()
      - btintel_pcie: support for Scorpious, Panther Lake-H484 IDs"

* tag 'net-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1536 commits)
  net: stmmac: Add support for Allwinner A523 GMAC200
  dt-bindings: net: sun8i-emac: Add A523 GMAC200 compatible
  Revert "Documentation: net: add flow control guide and document ethtool API"
  octeontx2-pf: fix bitmap leak
  octeontx2-vf: fix bitmap leak
  net/mlx5e: Use extack in set rxfh callback
  net/mlx5e: Introduce mlx5e_rss_params for RSS configuration
  net/mlx5e: Introduce mlx5e_rss_init_params
  net/mlx5e: Remove unused mdev param from RSS indir init
  net/mlx5: Improve QoS error messages with actual depth values
  net/mlx5e: Prevent entering switchdev mode with inconsistent netns
  net/mlx5: HWS, Generalize complex matchers
  net/mlx5: Improve write-combining test reliability for ARM64 Grace CPUs
  selftests/net: add tcp_port_share to .gitignore
  Revert "net/mlx5e: Update and set Xon/Xoff upon MTU set"
  net: add NUMA awareness to skb_attempt_defer_free()
  net: use llist for sd->defer_list
  net: make softnet_data.defer_count an atomic
  selftests: drv-net: psp: add tests for destroying devices
  selftests: drv-net: psp: add test for auto-adjusting TCP MSS
  ...
2025-10-02 15:17:01 -07:00
Paolo Abeni
f1455695d2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.17-rc8).

Conflicts:

tools/testing/selftests/drivers/net/bonding/Makefile
  87951b5664 selftests: bonding: add test for passive LACP mode
  c2377f1763 selftests: bonding: add test for LACP actor port priority

Adjacent changes:

drivers/net/ethernet/cadence/macb.h
  fca3dc859b net: macb: remove illusion about TBQPH/RBQPH being per-queue
  89934dbf16 net: macb: Add TAPRIO traffic scheduling support

drivers/net/ethernet/cadence/macb_main.c
  fca3dc859b net: macb: remove illusion about TBQPH/RBQPH being per-queue
  89934dbf16 net: macb: Add TAPRIO traffic scheduling support

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-10-01 10:14:49 +02:00
Linus Torvalds
ae28ed4578 bpf-next-6.18
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmjZH40ACgkQ6rmadz2v
 bTrG7w//X/5CyDoKIYJCqynYRdMtfqYuCe8Jhud4p5++iBVqkDyS6Y8EFLqZVyg/
 UHTqaSE4Nz8/pma0WSjhUYn6Chs1AeH+Rw/g109SovE/YGkek2KNwY3o2hDrtPMX
 +oD0my8qF2HLKgEyteXXyZ5Ju+AaF92JFiGko4/wNTX8O99F9nyz2pTkrctS9Vl9
 VwuTxrEXpmhqrhP3WCxkfNfcbs9HP+AALpgOXZKdMI6T4KI0N1gnJ0ZWJbiXZ8oT
 tug0MTPkNRidYMl0wHY2LZ6ZG8Q3a7Sgc+M0xFzaHGvGlJbBg1HjsDMtT6j34CrG
 TIVJ/O8F6EJzAnQ5Hio0FJk8IIgMRgvng5Kd5GXidU+mE6zokTyHIHOXitYkBQNH
 Hk+lGA7+E2cYqUqKvB5PFoyo+jlucuIH7YwrQlyGfqz+98n65xCgZKcmdVXr0hdB
 9v3WmwJFtVIoPErUvBC3KRANQYhFk4eVk1eiGV/20+eIVyUuNbX6wqSWSA9uEXLy
 n5fm/vlk4RjZmrPZHxcJ0dsl9LTF1VvQQHkgoC1Sz/Cc+jA6k4I+ECVHAqEbk36p
 1TUF52yPOD2ViaJKkj+962JaaaXlUn6+Dq7f1GMP6VuyHjz4gsI3mOo4XarqNdWd
 c7TnYmlGO/cGwqd4DdbmWiF1DDsrBcBzdbC8+FgffxQHLPXGzUg=
 =LeQi
 -----END PGP SIGNATURE-----

Merge tag 'bpf-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Pull bpf updates from Alexei Starovoitov:

 - Support pulling non-linear xdp data with bpf_xdp_pull_data() kfunc
   (Amery Hung)

   Applied as a stable branch in bpf-next and net-next trees.

 - Support reading skb metadata via bpf_dynptr (Jakub Sitnicki)

   Also a stable branch in bpf-next and net-next trees.

 - Enforce expected_attach_type for tailcall compatibility (Daniel
   Borkmann)

 - Replace path-sensitive with path-insensitive live stack analysis in
   the verifier (Eduard Zingerman)

   This is a significant change in the verification logic. More details,
   motivation, long term plans are in the cover letter/merge commit.

 - Support signed BPF programs (KP Singh)

   This is another major feature that took years to materialize.

   Algorithm details are in the cover letter/marge commit

 - Add support for may_goto instruction to s390 JIT (Ilya Leoshkevich)

 - Add support for may_goto instruction to arm64 JIT (Puranjay Mohan)

 - Fix USDT SIB argument handling in libbpf (Jiawei Zhao)

 - Allow uprobe-bpf program to change context registers (Jiri Olsa)

 - Support signed loads from BPF arena (Kumar Kartikeya Dwivedi and
   Puranjay Mohan)

 - Allow access to union arguments in tracing programs (Leon Hwang)

 - Optimize rcu_read_lock() + migrate_disable() combination where it's
   used in BPF subsystem (Menglong Dong)

 - Introduce bpf_task_work_schedule*() kfuncs to schedule deferred
   execution of BPF callback in the context of a specific task using the
   kernel’s task_work infrastructure (Mykyta Yatsenko)

 - Enforce RCU protection for KF_RCU_PROTECTED kfuncs (Kumar Kartikeya
   Dwivedi)

 - Add stress test for rqspinlock in NMI (Kumar Kartikeya Dwivedi)

 - Improve the precision of tnum multiplier verifier operation
   (Nandakumar Edamana)

 - Use tnums to improve is_branch_taken() logic (Paul Chaignon)

 - Add support for atomic operations in arena in riscv JIT (Pu Lehui)

 - Report arena faults to BPF error stream (Puranjay Mohan)

 - Search for tracefs at /sys/kernel/tracing first in bpftool (Quentin
   Monnet)

 - Add bpf_strcasecmp() kfunc (Rong Tao)

 - Support lookup_and_delete_elem command in BPF_MAP_STACK_TRACE (Tao
   Chen)

* tag 'bpf-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (197 commits)
  libbpf: Replace AF_ALG with open coded SHA-256
  selftests/bpf: Add stress test for rqspinlock in NMI
  selftests/bpf: Add test case for different expected_attach_type
  bpf: Enforce expected_attach_type for tailcall compatibility
  bpftool: Remove duplicate string.h header
  bpf: Remove duplicate crypto/sha2.h header
  libbpf: Fix error when st-prefix_ops and ops from differ btf
  selftests/bpf: Test changing packet data from kfunc
  selftests/bpf: Add stacktrace map lookup_and_delete_elem test case
  selftests/bpf: Refactor stacktrace_map case with skeleton
  bpf: Add lookup_and_delete_elem for BPF_MAP_STACK_TRACE
  selftests/bpf: Fix flaky bpf_cookie selftest
  selftests/bpf: Test changing packet data from global functions with a kfunc
  bpf: Emit struct bpf_xdp_sock type in vmlinux BTF
  selftests/bpf: Task_work selftest cleanup fixes
  MAINTAINERS: Delete inactive maintainers from AF_XDP
  bpf: Mark kfuncs as __noclone
  selftests/bpf: Add kprobe multi write ctx attach test
  selftests/bpf: Add kprobe write ctx attach test
  selftests/bpf: Add uprobe context ip register change test
  ...
2025-09-30 17:58:11 -07:00
Eric Dumazet
5628f3fe3b net: add NUMA awareness to skb_attempt_defer_free()
Instead of sharing sd->defer_list & sd->defer_count with
many cpus, add one pair for each NUMA node.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250928084934.3266948-4-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-30 15:45:53 +02:00
Hangbin Liu
5b66169f6b bonding: fix xfrm offload feature setup on active-backup mode
The active-backup bonding mode supports XFRM ESP offload. However, when
a bond is added using command like `ip link add bond0 type bond mode 1
miimon 100`, the `ethtool -k` command shows that the XFRM ESP offload is
disabled. This occurs because, in bond_newlink(), we change bond link
first and register bond device later. So the XFRM feature update in
bond_option_mode_set() is not called as the bond device is not yet
registered, leading to the offload feature not being set successfully.

To resolve this issue, we can modify the code order in bond_newlink() to
ensure that the bond device is registered first before changing the bond
link parameters. This change will allow the XFRM ESP offload feature to be
correctly enabled.

Fixes: 007ab53455 ("bonding: fix feature flag setting at init time")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20250925023304.472186-1-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-30 09:55:11 +02:00
Eric Dumazet
7d452516b6 Revert "net: group sk_backlog and sk_receive_queue"
This reverts commit 4effb335b5.

This was a benefit for UDP flood case, which was later greatly improved
with commits 6471658dc6 ("udp: use skb_attempt_defer_free()")
and b650bf0977 ("udp: remove busylock and add per NUMA queues").

Apparently blamed commit added a regression for RAW sockets, possibly
because they do not use the dual RX queue strategy that UDP has.

sock_queue_rcv_skb_reason() and RAW recvmsg() compete for sk_receive_buf
and sk_rmem_alloc changes, and them being in the same
cache line reduce performance.

Fixes: 4effb335b5 ("net: group sk_backlog and sk_receive_queue")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202509281326.f605b4eb-lkp@intel.com
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: David Ahern <dsahern@kernel.org>
Cc: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250929182112.824154-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:30:32 -07:00
Paolo Abeni
a755677974 tcp: make tcp_rcvbuf_grow() accessible to mptcp code
To leverage the auto-tuning improvements brought by commit 2da35e4b4d
("Merge branch 'tcp-receive-side-improvements'"), the MPTCP stack need
to access the mentioned helper.

Acked-by: Geliang Tang <geliang@kernel.org>
Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250927-net-next-mptcp-rcv-path-imp-v1-2-5da266aa9c1a@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29 18:23:35 -07:00
Linus Torvalds
18b19abc37 namespace-6.18-rc1
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZQgQAKCRCRxhvAZXjc
 oiFXAQCpbLvkWbld9wLgxUBhq+q+kw5NvGxzpvqIhXwJB9F9YAEA44/Wevln4xGx
 +kRUbP+xlRQqenIYs2dLzVHzAwAdfQ4=
 =EO4Y
 -----END PGP SIGNATURE-----

Merge tag 'namespace-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull namespace updates from Christian Brauner:
 "This contains a larger set of changes around the generic namespace
  infrastructure of the kernel.

  Each specific namespace type (net, cgroup, mnt, ...) embedds a struct
  ns_common which carries the reference count of the namespace and so
  on.

  We open-coded and cargo-culted so many quirks for each namespace type
  that it just wasn't scalable anymore. So given there's a bunch of new
  changes coming in that area I've started cleaning all of this up.

  The core change is to make it possible to correctly initialize every
  namespace uniformly and derive the correct initialization settings
  from the type of the namespace such as namespace operations, namespace
  type and so on. This leaves the new ns_common_init() function with a
  single parameter which is the specific namespace type which derives
  the correct parameters statically. This also means the compiler will
  yell as soon as someone does something remotely fishy.

  The ns_common_init() addition also allows us to remove ns_alloc_inum()
  and drops any special-casing of the initial network namespace in the
  network namespace initialization code that Linus complained about.

  Another part is reworking the reference counting. The reference
  counting was open-coded and copy-pasted for each namespace type even
  though they all followed the same rules. This also removes all open
  accesses to the reference count and makes it private and only uses a
  very small set of dedicated helpers to manipulate them just like we do
  for e.g., files.

  In addition this generalizes the mount namespace iteration
  infrastructure introduced a few cycles ago. As reminder, the vfs makes
  it possible to iterate sequentially and bidirectionally through all
  mount namespaces on the system or all mount namespaces that the caller
  holds privilege over. This allow userspace to iterate over all mounts
  in all mount namespaces using the listmount() and statmount() system
  call.

  Each mount namespace has a unique identifier for the lifetime of the
  systems that is exposed to userspace. The network namespace also has a
  unique identifier working exactly the same way. This extends the
  concept to all other namespace types.

  The new nstree type makes it possible to lookup namespaces purely by
  their identifier and to walk the namespace list sequentially and
  bidirectionally for all namespace types, allowing userspace to iterate
  through all namespaces. Looking up namespaces in the namespace tree
  works completely locklessly.

  This also means we can move the mount namespace onto the generic
  infrastructure and remove a bunch of code and members from struct
  mnt_namespace itself.

  There's a bunch of stuff coming on top of this in the future but for
  now this uses the generic namespace tree to extend a concept
  introduced first for pidfs a few cycles ago. For a while now we have
  supported pidfs file handles for pidfds. This has proven to be very
  useful.

  This extends the concept to cover namespaces as well. It is possible
  to encode and decode namespace file handles using the common
  name_to_handle_at() and open_by_handle_at() apis.

  As with pidfs file handles, namespace file handles are exhaustive,
  meaning it is not required to actually hold a reference to nsfs in
  able to decode aka open_by_handle_at() a namespace file handle.
  Instead the FD_NSFS_ROOT constant can be passed which will let the
  kernel grab a reference to the root of nsfs internally and thus decode
  the file handle.

  Namespaces file descriptors can already be derived from pidfds which
  means they aren't subject to overmount protection bugs. IOW, it's
  irrelevant if the caller would not have access to an appropriate
  /proc/<pid>/ns/ directory as they could always just derive the
  namespace based on a pidfd already.

  It has the same advantage as pidfds. It's possible to reliably and for
  the lifetime of the system refer to a namespace without pinning any
  resources and to compare them trivially.

  Permission checking is kept simple. If the caller is located in the
  namespace the file handle refers to they are able to open it otherwise
  they must hold privilege over the owning namespace of the relevant
  namespace.

  The namespace file handle layout is exposed as uapi and has a stable
  and extensible format. For now it simply contains the namespace
  identifier, the namespace type, and the inode number. The stable
  format means that userspace may construct its own namespace file
  handles without going through name_to_handle_at() as they are already
  allowed for pidfs and cgroup file handles"

* tag 'namespace-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (65 commits)
  ns: drop assert
  ns: move ns type into struct ns_common
  nstree: make struct ns_tree private
  ns: add ns_debug()
  ns: simplify ns_common_init() further
  cgroup: add missing ns_common include
  ns: use inode initializer for initial namespaces
  selftests/namespaces: verify initial namespace inode numbers
  ns: rename to __ns_ref
  nsfs: port to ns_ref_*() helpers
  net: port to ns_ref_*() helpers
  uts: port to ns_ref_*() helpers
  ipv4: use check_net()
  net: use check_net()
  net-sysfs: use check_net()
  user: port to ns_ref_*() helpers
  time: port to ns_ref_*() helpers
  pid: port to ns_ref_*() helpers
  ipc: port to ns_ref_*() helpers
  cgroup: port to ns_ref_*() helpers
  ...
2025-09-29 11:20:29 -07:00
Linus Torvalds
722df25ddf kernel-6.18-rc1.clone3
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZgMQAKCRCRxhvAZXjc
 ornXAP954dZjz+OJw6lJLCf0j9TXJOczGHvK3oW5ZD9KnqtTdwEA7p1A6WMOKJyl
 8VtTgCS0yNt8QlznUnsSDfVm0jXVGAY=
 =tUXG
 -----END PGP SIGNATURE-----

Merge tag 'kernel-6.18-rc1.clone3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull copy_process updates from Christian Brauner:
 "This contains the changes to enable support for clone3() on nios2
  which apparently is still a thing.

  The more exciting part of this is that it cleans up the inconsistency
  in how the 64-bit flag argument is passed from copy_process() into the
  various other copy_*() helpers"

[ Fixed up rv ltl_monitor 32-bit support as per Sasha Levin in the merge ]

* tag 'kernel-6.18-rc1.clone3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  nios2: implement architecture-specific portion of sys_clone3
  arch: copy_thread: pass clone_flags as u64
  copy_process: pass clone_flags as u64 across calltree
  copy_sighand: Handle architectures where sizeof(unsigned long) < sizeof(u64)
2025-09-29 10:36:50 -07:00
Gustavo A. R. Silva
be812ace03 Bluetooth: Avoid a couple dozen -Wflex-array-member-not-at-end warnings
-Wflex-array-member-not-at-end was introduced in GCC-14, and we are
getting ready to enable it, globally.

Use the __struct_group() helper to fix 31 instances of the following
type of warnings:

30 net/bluetooth/mgmt_config.c:16:33: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
1 net/bluetooth/mgmt_config.c:22:33: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-09-27 11:37:43 -04:00
Luiz Augusto von Dentz
9eb1433188 Bluetooth: Add function and line information to bt_dbg
When enabling debug via CONFIG_BT_FEATURE_DEBUG include function and
line information by default otherwise it is hard to make any sense of
which function the logs comes from.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-09-27 11:37:01 -04:00
Luiz Augusto von Dentz
c9beb36c14 Bluetooth: hci_core: Detect if an ISO link has stalled
This attempts to detect if an ISO link has been waiting for an ISO
buffer for longer than the maximum allowed transport latency then
proceed to use hci_link_tx_to which prints an error and disconnects.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-09-27 11:37:01 -04:00
Luiz Augusto von Dentz
339a87883a Bluetooth: ISO: Use sk_sndtimeo as conn_timeout
This aligns the usage of socket sk_sndtimeo as conn_timeout when
initiating a connection and then use it when scheduling the
resulting HCI command, similar to what has been done in bf98feea5b
("Bluetooth: hci_conn: Always use sk_timeo as conn_timeout").

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-09-27 11:37:01 -04:00
Thorsten Blum
d4e99db3d9 Bluetooth: Annotate struct hci_drv_rp_read_info with __counted_by_le()
Add the __counted_by_le() compiler attribute to the flexible array
member 'supported_commands' to improve access bounds-checking via
CONFIG_UBSAN_BOUNDS and CONFIG_FORTIFY_SOURCE.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-09-27 11:37:00 -04:00
Jakub Kicinski
94aced6ed9 Quite a bit more things, including pull requests from drivers:
- mt76: MLO support, HW restart improvements
  - rtw88/89: small features, prep for RTL8922DE support
  - ath10k: GTK rekey fixes
  - cfg80211/mac80211:
    - additions for more NAN support
    - S1G channel representation cleanup
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmjU+BcACgkQ10qiO8sP
 aADryQ//aarXIxRw6irA9d0IuiDdCAnTpYGSnbSCzWi4I51lnL5VU9bH0BynGHMu
 VSuQokuG6Fp+rMD0CPVAhIlTj2TxueItSN1Dl0aLMzKXHSNFd9QmhyLc6XL+ei0M
 d8bw1UEjdWxjfEUQHocqP8uCgLAZqOwIlJhvs4HGBrkwNSoMMinjltQMtPNdnqX6
 lXHKok717I3kTERBLSVE6m+1OkLizz1inswmvkccJM5JBQ2Y0iCkRewEsXJxfL50
 V1LAKgZOy15OPBja3o+tUSkZBYu7GHGirWmColZ8mO0nVOdWaH42XtsCaVA/UMOk
 FCeVetfbPlFlQ5huo4rvPDPqUKOjAkVIp10dal/1+mdCvvlyKAnhYslqxxdlS+cY
 uwHV9/GGFZkUj2cbx0ARq16BD0Co0tGfEbjeb0dkltjuekfH1jEpC0l4LjM/VDYX
 F+UJXeVBe1926g6RjP5Akj7NdGtz8/ND9bo+hpKW7EA5milzwFpDGTSXUAqiBomk
 KpGqjeVvLlXmwaAZtZldePEYrZvYGtJmR8mVZA/B6RzVerN1p1ttE+wQgCCLCbXm
 vwYLOPYJqBLJCXP1Xr0VIAGJELO4Z/0nu/Cil9jE8CDx/PSysOWRwWa+V5OowdTB
 dxC2GHKnEq1Bc2qPQJPeT+bV1NVQXMzaT/kTQ1+vPaKX2P0oRRo=
 =3HjM
 -----END PGP SIGNATURE-----

Merge tag 'wireless-next-2025-09-25' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next

Johannes Berg says:

====================
Quite a bit more things, including pull requests from drivers:

 - mt76: MLO support, HW restart improvements
 - rtw88/89: small features, prep for RTL8922DE support
 - ath10k: GTK rekey fixes
 - cfg80211/mac80211:
   - additions for more NAN support
   - S1G channel representation cleanup

* tag 'wireless-next-2025-09-25' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (167 commits)
  wifi: libertas: add WQ_UNBOUND to alloc_workqueue users
  Revert "wifi: libertas: WQ_PERCPU added to alloc_workqueue users"
  wifi: libertas: WQ_PERCPU added to alloc_workqueue users
  wifi: cfg80211: fix width unit in cfg80211_radio_chandef_valid()
  wifi: ath11k: HAL SRNG: don't deinitialize and re-initialize again
  wifi: ath12k: enforce CPU endian format for all QMI data
  wifi: ath12k: Use 1KB Cache Flush Command for QoS TID Descriptors
  wifi: ath12k: Fix flush cache failure during RX queue update
  wifi: ath12k: Add Retry Mechanism for REO RX Queue Update Failures
  wifi: ath12k: Refactor REO command to use ath12k_dp_rx_tid_rxq
  wifi: ath12k: Refactor RX TID buffer cleanup into helper function
  wifi: ath12k: Refactor RX TID deletion handling into helper function
  wifi: ath12k: Increase DP_REO_CMD_RING_SIZE to 256
  wifi: cfg80211: remove IEEE80211_CHAN_{1,2,4,8,16}MHZ flags
  wifi: rtw89: avoid circular locking dependency in ser_state_run()
  wifi: rtw89: fix leak in rtw89_core_send_nullfunc()
  wifi: rtw89: avoid possible TX wait initialization race
  wifi: rtw89: fix use-after-free in rtw89_core_tx_kick_off_and_wait()
  wifi: ath12k: Fix peer lookup in ath12k_dp_mon_rx_deliver_msdu()
  wifi: mac80211: fix Rx packet handling when pubsta information is not available
  ...
====================

Link: https://patch.msgid.link/20250925232341.4544-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-26 14:27:28 -07:00
Jakub Kicinski
203e3beb73 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.17-rc8).

Conflicts:

drivers/net/can/spi/hi311x.c
  6b69680847 ("can: hi311x: fix null pointer dereference when resuming from sleep before interface was enabled")
  27ce71e1ce ("net: WQ_PERCPU added to alloc_workqueue users")
https://lore.kernel.org/72ce7599-1b5b-464a-a5de-228ff9724701@kernel.org

net/smc/smc_loopback.c
drivers/dibs/dibs_loopback.c
  a35c04de25 ("net/smc: fix warning in smc_rx_splice() when calling get_page()")
  cc21191b58 ("dibs: Move data path to dibs layer")
https://lore.kernel.org/74368a5c-48ac-4f8e-a198-40ec1ed3cf5f@kernel.org

Adjacent changes:

drivers/net/dsa/lantiq/lantiq_gswip.c
  c0054b25e2 ("net: dsa: lantiq_gswip: move gswip_add_single_port_br() call to port_setup()")
  7a1eaef0a7 ("net: dsa: lantiq_gswip: support model-specific mac_select_pcs()")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-25 11:00:59 -07:00
Richard Gobert
f095a358fa net: gro: remove unnecessary df checks
Currently, packets with fixed IDs will be merged only if their
don't-fragment bit is set. This restriction is unnecessary since
packets without the don't-fragment bit will be forwarded as-is even
if they were merged together. The merged packets will be segmented
into their original forms before being forwarded, either by GSO or
by TSO. The IDs will also remain identical unless NETIF_F_TSO_MANGLEID
is set, in which case the IDs can become incrementing, which is also fine.

Clean up the code by removing the unnecessary don't-fragment checks.

Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250923085908.4687-5-richardbgobert@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-25 12:42:49 +02:00
Richard Gobert
21f7484220 net: gro: only merge packets with incrementing or fixed outer ids
Only merge encapsulated packets if their outer IDs are either
incrementing or fixed, just like for inner IDs and IDs of non-encapsulated
packets.

Add another ip_fixedid bit for a total of two bits: one for outer IDs (and
for unencapsulated packets) and one for inner IDs.

This commit preserves the current behavior of GSO where only the IDs of the
inner-most headers are restored correctly.

Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250923085908.4687-3-richardbgobert@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-25 12:42:49 +02:00
Richard Gobert
25c550464a net: gro: remove is_ipv6 from napi_gro_cb
Remove is_ipv6 from napi_gro_cb and use sk->sk_family instead.
This frees up space for another ip_fixedid bit that will be added
in the next commit.

udp_sock_create always creates either a AF_INET or a AF_INET6 socket,
so using sk->sk_family is reliable. In IPv6-FOU, cfg->ipv6_v6only is
always enabled.

Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250923085908.4687-2-richardbgobert@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-25 12:42:49 +02:00
Jakub Kicinski
5e3fee34f6 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ6NaUOruQGUkvPdG4raS+Z+3y5EwUCaNNwBQAKCRAraS+Z+3y5
 E8heAQDdJTR9rwAL7gD79cldlHP5PTmjyidLIoFG/efaGSbN1AD9EdvrykDU4xOG
 aGaO8TooGUZf7vAL8tIFuMeydYvi/gM=
 =Qu4T
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Martin KaFai Lau says:

====================
pull-request: bpf-next 2025-09-23

We've added 9 non-merge commits during the last 33 day(s) which contain
a total of 10 files changed, 480 insertions(+), 53 deletions(-).

The main changes are:

1) A new bpf_xdp_pull_data kfunc that supports pulling data from
   a frag into the linear area of a xdp_buff, from Amery Hung.

   This includes changes in the xdp_native.bpf.c selftest, which
   Nimrod's future work depends on.

   It is a merge from a stable branch 'xdp_pull_data' which has
   also been merged to bpf-next.

   There is a conflict with recent changes in 'include/net/xdp.h'
   in the net-next tree that will need to be resolved.

2) A compiler warning fix when CONFIG_NET=n in the recent dynptr
   skb_meta support, from Jakub Sitnicki.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
  selftests: drv-net: Pull data before parsing headers
  selftests/bpf: Test bpf_xdp_pull_data
  bpf: Support specifying linear xdp packet data size for BPF_PROG_TEST_RUN
  bpf: Make variables in bpf_prog_test_run_xdp less confusing
  bpf: Clear packet pointers after changing packet data in kfuncs
  bpf: Support pulling non-linear xdp data
  bpf: Allow bpf_xdp_shrink_data to shrink a frag from head and tail
  bpf: Clear pfmemalloc flag when freeing all fragments
  bpf: Return an error pointer for skb metadata when CONFIG_NET=n
====================

Link: https://patch.msgid.link/20250924050303.2466356-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-24 10:22:37 -07:00
Eric Dumazet
b650bf0977 udp: remove busylock and add per NUMA queues
busylock was protecting UDP sockets against packet floods,
but unfortunately was not protecting the host itself.

Under stress, many cpus could spin while acquiring the busylock,
and NIC had to drop packets. Or packets would be dropped
in cpu backlog if RPS/RFS were in place.

This patch replaces the busylock by intermediate
lockless queues. (One queue per NUMA node).

This means that fewer number of cpus have to acquire
the UDP receive queue lock.

Most of the cpus can either:
- immediately drop the packet.
- or queue it in their NUMA aware lockless queue.

Then one of the cpu is chosen to process this lockless queue
in a batch.

The batch only contains packets that were cooked on the same
NUMA node, thus with very limited latency impact.

Tested:

DDOS targeting a victim UDP socket, on a platform with 6 NUMA nodes
(Intel(R) Xeon(R) 6985P-C)

Before:

nstat -n ; sleep 1 ; nstat | grep Udp
Udp6InDatagrams                 1004179            0.0
Udp6InErrors                    3117               0.0
Udp6RcvbufErrors                3117               0.0

After:
nstat -n ; sleep 1 ; nstat | grep Udp
Udp6InDatagrams                 1116633            0.0
Udp6InErrors                    14197275           0.0
Udp6RcvbufErrors                14197275           0.0

We can see this host can now proces 14.2 M more packets per second
while under attack, and the victim socket can receive 11 % more
packets.

I used a small bpftrace program measuring time (in us) spent in
__udp_enqueue_schedule_skb().

Before:

@udp_enqueue_us[398]:
[0]                24901 |@@@                                                 |
[1]                63512 |@@@@@@@@@                                           |
[2, 4)            344827 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4, 8)            244673 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                |
[8, 16)            54022 |@@@@@@@@                                            |
[16, 32)          222134 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                   |
[32, 64)          232042 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                  |
[64, 128)           4219 |                                                    |
[128, 256)           188 |                                                    |

After:

@udp_enqueue_us[398]:
[0]              5608855 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1]              1111277 |@@@@@@@@@@                                          |
[2, 4)            501439 |@@@@                                                |
[4, 8)            102921 |                                                    |
[8, 16)            29895 |                                                    |
[16, 32)           43500 |                                                    |
[32, 64)           31552 |                                                    |
[64, 128)            979 |                                                    |
[128, 256)            13 |                                                    |

Note that the remaining bottleneck for this platform is in
udp_drops_inc() because we limited struct numa_drop_counters
to only two nodes so far.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250922104240.2182559-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-23 16:38:39 -07:00
Martin KaFai Lau
34f033a6c9 Merge branch 'bpf-next/xdp_pull_data' into 'bpf-next/master'
Merge the xdp_pull_data stable branch into the master branch. No conflict.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2025-09-23 16:23:58 -07:00
Martin KaFai Lau
55d5a5154d Merge branch 'bpf-next/xdp_pull_data' into 'bpf-next/net'
Merge the xdp_pull_data stable branch into the net branch. No conflict.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2025-09-23 15:46:52 -07:00
Amery Hung
dea1526fba bpf: Allow bpf_xdp_shrink_data to shrink a frag from head and tail
Move skb_frag_t adjustment into bpf_xdp_shrink_data() and extend its
functionality to be able to shrink an xdp fragment from both head and
tail. In a later patch, bpf_xdp_pull_data() will reuse it to shrink an
xdp fragment from head.

Additionally, in bpf_xdp_frags_shrink_tail(), breaking the loop when
bpf_xdp_shrink_data() returns false (i.e., not releasing the current
fragment) is not necessary as the loop condition, offset > 0, has the
same effect. Remove the else branch to simplify the code.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20250922233356.3356453-3-ameryhung@gmail.com
2025-09-23 13:35:12 -07:00
Amery Hung
8f12d1137c bpf: Clear pfmemalloc flag when freeing all fragments
It is possible for bpf_xdp_adjust_tail() to free all fragments. The
kfunc currently clears the XDP_FLAGS_HAS_FRAGS bit, but not
XDP_FLAGS_FRAGS_PF_MEMALLOC. So far, this has not caused a issue when
building sk_buff from xdp_buff since all readers of xdp_buff->flags
use the flag only when there are fragments. Clear the
XDP_FLAGS_FRAGS_PF_MEMALLOC bit as well to make the flags correct.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20250922233356.3356453-2-ameryhung@gmail.com
2025-09-23 13:35:11 -07:00
Julian Ruess
a612dbe8d0 dibs: Move event handling to dibs layer
Add defines for all event types and subtypes an ism device is known to
produce as it can be helpful for debugging purposes.

Introduces a generic 'struct dibs_event' and adopt ism device driver
and smc-d client accordingly. Tolerate and ignore other type and subtype
values to enable future device extensions.

SMC-D and ISM are now independent.
struct ism_dev can be moved to drivers/s390/net/ism.h.

Note that in smc, the term 'ism' is still used. Future patches could
replace that with 'dibs' or 'smc-d' as appropriate.

Signed-off-by: Julian Ruess <julianr@linux.ibm.com>
Co-developed-by: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-15-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-23 11:13:22 +02:00
Alexandra Winter
cc21191b58 dibs: Move data path to dibs layer
Use struct dibs_dmb instead of struct smc_dmb and move the corresponding
client tables to dibs_dev. Leave driver specific implementation details
like sba in the device drivers.

Register and unregister dmbs via dibs_dev_ops. A dmb is dedicated to a
single client, but a dibs device can have dmbs for more than one client.

Trigger dibs clients via dibs_client_ops->handle_irq(), when data is
received into a dmb. For dibs_loopback replace scheduling an smcd receive
tasklet with calling dibs_client_ops->handle_irq().

For loopback devices attach_dmb(), detach_dmb() and move_data() need to
access the dmb tables, so move those to dibs_dev_ops in this patch as well.

Remove remaining definitions of smc_loopback as they are no longer
required, now that everything is in dibs_loopback.

Note that struct ism_client and struct ism_dev are still required in smc
until a follow-on patch moves event handling to dibs. (Loopback does not
use events).

Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-14-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-23 11:13:22 +02:00
Alexandra Winter
719c3b67bb dibs: Move query_remote_gid() to dibs_dev_ops
Provide the dibs_dev_ops->query_remote_gid() in ism and dibs_loopback
dibs_devices. And call it in smc dibs_client.

Reviewed-by: Julian Ruess <julianr@linux.ibm.com>
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-13-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-23 11:13:22 +02:00
Alexandra Winter
92a0f7bb08 dibs: Move vlan support to dibs_dev_ops
It can be debated how much benefit definition of vlan ids for dibs devices
brings, as the dmbs are accessible only by a single peer anyhow. But ism
provides vlan support and smcd exploits it, so move it to dibs layer as an
optional feature.

smcd_loopback simply ignores all vlan settings, do the same in
dibs_loopback.

SMC-D and ISM have a method to use the invalid VLAN ID 1FFF
(ISM_RESERVED_VLANID), to indicate that both communication peers support
routable SMC-Dv2. Tolerate it in dibs, but move it to SMC only.

Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-12-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-23 11:13:22 +02:00
Alexandra Winter
05e68d8ded dibs: Local gid for dibs devices
Define a uuid_t GID attribute to identify a dibs device.

SMC uses 64 Bit and 128 Bit Global Identifiers (GIDs) per device, that
need to be sent via the SMC protocol. Because the smc code uses integers,
network endianness and host endianness need to be considered. Avoid this
in the dibs layer by using uuid_t byte arrays. Future patches could change
SMC to use uuid_t. For now conversion helper functions are introduced.

ISM devices provide 64 Bit GIDs. Map them to dibs uuid_t GIDs like this:
 _________________________________________
| 64 Bit ISM-vPCI GID | 00000000_00000000 |
 -----------------------------------------
If interpreted as UUID [1], this would be interpreted as the UIID variant,
that is reserved for NCS backward compatibility. So it will not collide
with UUIDs that were generated according to the standard.

smc_loopback already uses version 4 UUIDs as 128 Bit GIDs, move that to
dibs loopback. A temporary change to smc_lo_query_rgid() is required,
that will be moved to dibs_loopback with a follow-on patch.

Provide gid of a dibs device as sysfs read-only attribute.

Link: https://datatracker.ietf.org/doc/html/rfc4122 [1]
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Julian Ruess <julianr@linux.ibm.com>
Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-11-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-23 11:13:22 +02:00
Julian Ruess
845c334a01 dibs: Move struct device to dibs_dev
Move struct device from ism_dev and smc_lo_dev to dibs_dev, and define a
corresponding release function. Free ism_dev in ism_remove() and smc_lo_dev
in smc_lo_dev_remove().

Replace smcd->ops->get_dev(smcd) by using dibs->dev directly.

An alternative design would be to embed dibs_dev as a field in ism_dev and
do the same for other dibs device driver specific structs. However that
would have the disadvantage that each dibs device driver needs to allocate
dibs_dev and each dibs device driver needs a different device release
function. The advantage would be that ism_dev and other device driver
specific structs would be covered by device reference counts.

Signed-off-by: Julian Ruess <julianr@linux.ibm.com>
Co-developed-by: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-9-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-23 11:13:22 +02:00
Alexandra Winter
69baaac936 dibs: Define dibs_client_ops and dibs_dev_ops
Move the device add() and remove() functions from ism_client to
dibs_client_ops and call add_dev()/del_dev() for ism devices and
dibs_loopback devices. dibs_client_ops->add_dev() = smcd_register_dev() for
the smc_dibs_client. This is the first step to handle ism and loopback
devices alike (as dibs devices) in the smc dibs client.

Define dibs_dev->ops and move smcd_ops->get_chid to
dibs_dev_ops->get_fabric_id() for ism and loopback devices. See below for
why this needs to be in the same patch as dibs_client_ops->add_dev().

The following changes contain intermediate steps, that will be obsoleted by
follow-on patches, once more functionality has been moved to dibs:

Use different smcd_ops and max_dmbs for ism and loopback. Follow-on patches
will change SMC-D to directly use dibs_ops instead of smcd_ops.

In smcd_register_dev() it is now necessary to identify a dibs_loopback
device before smcd_dev and smcd_ops->get_chid() are available. So provide
dibs_dev_ops->get_fabric_id() in this patch and evaluate it in
smc_ism_is_loopback().

Call smc_loopback_init() in smcd_register_dev() and call
smc_loopback_exit() in smcd_unregister_dev() to handle the functionality
that is still in smc_loopback. Follow-on patches will move all smc_loopback
code to dibs_loopback.

In smcd_[un]register_dev() use only ism device name, this will be replaced
by dibs device name by a follow-on patch.

End of changes with intermediate parts.

Allocate an smcd event workqueue for all dibs devices, although
dibs_loopback does not generate events.

Use kernel memory instead of devres memory for smcd_dev and smcd->conn.
Since commit a72178cfe8 ("net/smc: Fix dependency of SMC on ISM") an ism
device and its driver can have a longer lifetime than the smc module, so
smc should not rely on devres to free its resources [1]. It is now the
responsibility of the smc client to free smcd and smcd->conn for all dibs
devices, ism devices as well as loopback. Call client->ops->del_dev() for
all existing dibs devices in dibs_unregister_client(), so all device
related structures can be freed in the client.

When dibs_unregister_client() is called in the context of smc_exit() or
smc_core_reboot_event(), these functions have already called
smc_lgrs_shutdown() which calls smc_smcd_terminate_all(smcd) and sets
going_away. This is done a second time in smcd_unregister_dev(). This is
analogous to how smcr is handled in these functions, by calling first
smc_lgrs_shutdown() and then smc_ib_unregister_client() >
smc_ib_remove_dev(), so leave it that way. It may be worth investigating,
whether smc_lgrs_shutdown() is still required or useful.

Remove CONFIG_SMC_LO. CONFIG_DIBS_LO now controls whether a dibs loopback
device exists or not.

Link: https://www.kernel.org/doc/Documentation/driver-model/devres.txt [1]
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-8-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-23 11:13:22 +02:00
Alexandra Winter
884eee8e43 net/smc: Remove error handling of unregister_dmb()
smcd_buf_free() calls smc_ism_unregister_dmb(lgr->smcd, buf_desc) and
then unconditionally frees buf_desc.

Remove the cleaning up of fields of buf_desc in
smc_ism_unregister_dmb(), because it is not helpful.

This removes the only usage of ISM_ERROR from the smc module. So move it
to drivers/s390/net/ism.h.

Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Link: https://patch.msgid.link/20250918110500.1731261-2-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-23 11:13:21 +02:00